pandas Archives - Creatronix

2021 – Advent of code – Day 2

Jörn — Fri, 03 Dec 2021 11:23:34 +0000

Part 1

Today the puzzle got a bit trickier than Day 1.

The submarine seems to already have a planned course (your puzzle input). You should probably figure out where it's going. For example:

    forward 5
    down 5
    forward 8
    up 3
    down 8
    forward 2

Your horizontal position and depth both start at 0. The steps above would then modify them as follows:

    forward 5 adds 5 to your horizontal position, a total of 5.
    down 5 adds 5 to your depth, resulting in a value of 5.
    forward 8 adds 8 to your horizontal position, a total of 13.
    up 3 decreases your depth by 3, resulting in a value of 2.
    down 8 adds 8 to your depth, resulting in a value of 10.
    forward 2 adds 2 to your horizontal position, a total of 15.

After following these instructions, you would have a horizontal position of 15 and a depth of 10. (Multiplying these together produces 150.)

Calculate the horizontal position and depth you would have after following the planned course. What do you get if you multiply your final horizontal position by your final depth?

So Pandas here we go again:

df = pd.read_csv("./aoc_day_02_data.txt", delimiter=" ",header=None)
df.columns = ["command", "value"]

Alright, reading in the data and naming the columns are the same steps as yesterday. Now we have to columns.

	0	1
0	forward	5
1	down	5
2	forward	8
3	up	3
4	down	8
5	forward	5

horizontal = df[df['command']=="forward"]["value"].sum()

The horizontal value can be calculated with the sum function when we filter the data frame to rows where the command is “forward”

depth = df[df['command']=="down"]["value"].sum() - df[df['command']=="up"]["value"].sum()

The depth can calculated by summing up the down and up commands separately and subtract the sums from each other.

Now we have to multiply the depth and the position to get the solution

position = depth * horizontal

Part 2

    down X increases your aim by X units.
    up X decreases your aim by X units.
    forward X does two things:
        It increases your horizontal position by X units.
        It increases your depth by your aim multiplied by X.

Now, the above example does something different:

    forward 5 adds 5 to your horizontal position, a total of 5. Because your aim is 0, your depth does not change.
    down 5 adds 5 to your aim, resulting in a value of 5.
    forward 8 adds 8 to your horizontal position, a total of 13. Because your aim is 5, your depth increases by 8*5=40.
    up 3 decreases your aim by 3, resulting in a value of 2.
    down 8 adds 8 to your aim, resulting in a value of 10.
    forward 2 adds 2 to your horizontal position, a total of 15. Because your aim is 10, your depth increases by 2*10=20 to a total of 60.

After following these new instructions, you would have a horizontal position of 15 and a depth of 60. 
(Multiplying these produces 900.)

Using this new interpretation of the commands, calculate the horizontal position and depth you would have after following the planned course. 
What do you get if you multiply your final horizontal position by your final depth?

To get an overview I simplified the table

Here I had a hard time to do it with pandas so vanilla python to the rescue:

if __name__ == '__main__':

    with open("./aoc_day_02_data.txt") as file:
        lines = file.readlines()
        lines = [line.rstrip() for line in lines]

    horizontal = 0
    current_aim = 0
    depth = 0
    for line in lines:
        print(line)
        command, value = line.split(" ")
        value = int(value)
        if command == "forward":
            horizontal += value
            depth += value * current_aim
        if command == "down":
            current_aim += value
        if command == "up":
            current_aim += value * -1

    print(f"horizontal: {horizontal}")
    print(f"depth: {depth}")
    print(horizontal * depth)

Update

I’ve figured out how to do it with Pandas as well

import pandas as pd

df = pd.read_csv("./aoc_day_02_test_data.txt", delimiter=" ",header=None)
df.columns = ["command", "value"]

horizontal = df[df['command']=="forward"]["value"].sum()

df.loc[df['command']=="up", "value"] = df[df['command']=="up"].mul(-1)
df["aim"] = 0

df.loc[df['command']!="forward", "aim"] = df[df['command']!="forward"]["value"]
df["current_aim"] = df["aim"].cumsum()

df.loc[df['command']=="forward", "depth"] = df[df['command']=="forward"]["value"] * df[df['command']=="forward"]["current_aim"]
depth = df["depth"].sum()

The post 2021 – Advent of code – Day 2 appeared first on Creatronix.

2021 – Advent of code – Day 1

Jörn — Thu, 02 Dec 2021 09:30:22 +0000

I’ve haven’t participated in the advent of code before. But always been curious.

What is advent of code?

It’s an advent Calendar for programmers. You get 25 challenges starting December 1st. Caveat: you have to solve the challenge to be eligible for the next day’s challenge 🙂

Day 1 Challenge – Part 1

On the first day your first task is to count how many times a value is bigger than its predecessor. They give us some sample data

199 N/A
200 bigger
208 bigger
210 bigger
200 smaller
207 bigger
240 bigger
269 bigger
260 smaller
263 bigger

When we count the times a value is bigger we get seven times bigger.

The actual data contains 2000 rows. This isn’t exactly big data but I’ve wanted to dust off my Pandas skill, so here we go:

Let’s look at the data

import pandas as pd

df = pd.read_csv("./aoc_day_01_data.txt", header=None)
df.describe

With the read_csv() function we can read in our data file and convert it into a data frame. It’s important to hand over the header=None. Otherwise pandas assumes the first row is a column header.

df.describe gives us:

Because we want to reference the columns by name we add a column header

df.columns = ["original"]

To compare the nth cell with its n+1th cell neighbour be add a new column but shift the values

df['shifted'] = df['original'].shift(-1)

The output looks like this:

	original	shifted
0	159	158.0
1	158	174.0
2	174	196.0
3	196	197.0
4	197	194.0
…	…	…
1995	8538	8543.0
1996	8543	8545.0
1997	8545	8557.0
1998	8557	8568.0
1999	8568	NaN

We add another column where we place the value True when the value from the current row in the shifted column is bigger than in the original column:

df['increased'] = (df['shifted'] > df['original'])

Now it starts to look like the sample data from the introduction:

	original	shifted	increased
0	159	158.0	False
1	158	174.0	True
2	174	196.0	True
3	196	197.0	True
4	197	194.0	False
…	…	…	…
1995	8538	8543.0	True
1996	8543	8545.0	True
1997	8545	8557.0	True
1998	8557	8568.0	True
1999	8568	NaN	False

the last thing we have to do is counting how many times True occurs:

true_count = df['increased'].sum()

which gives us “1583”

This is a bit of a hack because it assumes that True equals 1 and False == 0

A more elegant solution is to use value_counts:

df['increased'].value_counts(dropna=False)

No the output is:

True     1583
False     417
Name: increased, dtype: int64

And 1583 is the number we are looking for. This earned us our first golden star and unlocked the second part of the challenge:

Part 2

The second part is a bit more challenging because we have to sum up three adjacent values and compare them to the next three values.

199  A       
200  A B     
208  A B C   
210    B C D
200  E   C D
207  E F   D
240  E F G
269    F G H
260      G H
263        H

I created a new notebook and started like part 1 with reading the data and naming the first column

import pandas as pd

df = pd.read_csv("./aoc_day_01_data.txt", header=None)
df.columns = ["original"]

To add the sum of three values to the row of the first value we use the following code

indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=3)
df["rolling_sum"] = df.original.rolling(window=indexer).sum()

This demonstrates the power of Pandas once more: you have integrated sliding window functions!

The rest is equal to part one “shift, compare and count”

df['shifted_rs'] = df['rolling_sum'].shift(-1)
df['increased_rs'] = (df['shifted_rs'] > df['rolling_sum'])
true_count = df['increased_rs'].sum()
true_count

As a little Fingerübung I did the same with vanilla Python:

data = []
with open("./aoc_day_01_test_data.txt") as f:
    for line in f:
        data.append(int(line.rstrip()))

triplet_sums = []

for i, v in enumerate(data):
    if i < (len(data) - 2):
        triplet_sum = data[i] + data[i+1] + data[i+2]
        triplet_sums.append(triplet_sum)
print(triplet_sums)

sums_larger_than_previous_sums = 0
for i, v in enumerate(triplet_sums):
    if i < (len(triplet_sums) - 1):
        if triplet_sums[i] < triplet_sums[i+1]:
            sums_larger_than_previous_sums += 1

print(sums_larger_than_previous_sums)

Which works but is less elegant.

Stay tuned for more!

The post 2021 – Advent of code – Day 1 appeared first on Creatronix.

Pandas Cheat Sheet

Jörn — Fri, 05 Mar 2021 13:17:19 +0000

If you are new to Pandas feel free to read Introduction to Pandas

I’ve assembled some pandas code snippets

Reading Data

Reading CSV

import pandas as pd

# read from csv
df = pd.read_csv("path_to_file")

Can also be textfiles. file suffix is ignored

The default limiter for comma separated value files is the comma. If you have data with another delimiter you can specify it via:

delimiter=";"

If your data has no header you can pass header=None into the function

df = pd.read_csv("./aoc_day_01_data.txt", header=None)

With skiprows you can start reading in at any row

skiprows=8

Sometimes you need to alter the encoding as well:

encoding="cp1252"

Reading Excel

You can read excel files as well but you need to install

pip install openpyxl

df = pd.read_excel("./my_excel_sheet.xlsx")

With sheet_name you can select the individual sheet:

sheet_name="my_sheet_1"

Inspecting data

Basic information

df.describe()

Length

len(df)

showing entries

df.head()

df.tail(10)

Indexing

df['A']

gives you column A

iloc gives you entries based on numerical index

#      [row, column]
df.iloc[0,   0]

#     [row, column]
df.loc[:, :]

Data Cleaning

Dropping columns

del df["column_name"]

Renaming columns

df.columns = ["new_column_name", ...]

Comparing columns

df['increased'] = (df['shifted'] > df['original'])

Shifting columns

df['shifted'] = df['original'].shift(-1)

Splitting

Splitting strings into individual columns

df = pd.DataFrame(df["original"].str.split('').tolist())

Counting and Calculating

Summing columns

df["value"].sum()

Cumulative sum

df["aim"].cumsum()

Rolling sum

indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=3)
df["rolling_sum"] = df.original.rolling(window=indexer).sum()

Counting value occurence

df['increased'].value_counts(dropna=False)

Counting occurrences for all columns

df = pd.concat([df[column].value_counts() for column in df], axis = 1)

Convert column to datetime

df.loc[:, 'date'] = pd.to_datetime(df.loc[:, 'date'])

Convert datetime to minutes since midnight

df_train.loc[:, 'msm'] = df_train.loc[:, "date"].dt.hour * 60 + df_train.loc[:, "date"].dt.minute

The post Pandas Cheat Sheet appeared first on Creatronix.

Introduction to Pandas

Jörn — Wed, 01 Aug 2018 05:30:41 +0000

Pandas is a data analyzing tool. Together with numpy and matplotlib it is part of the data science stack

You can install it via

pip install pandas

Working with real data

The data set we are using is the astronauts data set from kaggle:

Download Data Set NASA Astronauts from Kaggle

During this introduction we want to answer the following questions

Which American astronaut has spent the most time in space?
What university has produced the most astronauts?
What subject did the most astronauts major in at college?
Have most astronauts served in the military?
What is the most common rank they achieved?

Basic Usage

Installation

pip install pandas

pandas is often aliased with “pd”

import pandas as pd

Dataframes

df= pd.read_csv("./astronauts.csv")

A dataframe is the most versatile data structure in pandas. You can think of it as an excel sheet with columns and rows.

You can get an overview of the dataframe values with

 df.describe()

With the len function You can get the number of rows in the dataset

len(df)

which gives us 357 astronauts

The columns property gives you the names of the individual columns

df.columns

The methods head() gives you the first n (default=5) entries:

df.head()

whereas the tail method gives you the last n entries

df.tail(10)

With the iloc keyword You get the entries directly

#      [row, column]
df.iloc[0,   0]

The loc keyword is another way to access dataframe.

The colon is used as a “select *” for rows or columns

#     [row, column]
df.loc[:, :]

Some analytics

Which American astronaut has spent the most time in space?

most_time_in_space = df.sort_values(by="Space Flight (hr)", ascending=False).head(1)
most_time_in_space[['Name', 'Space Flight (hr)']]

Sorting the dataframe can be done with sort_by_values. And for this question we sort for Space Flight (hr). Because we want the most hours we have to sort descending which translates to ascending=False.

head(1) gives us the correct answer:

Jeffrey N. Williams. He spent 12818 hours (534 days) in space.

Have You heard of him? Unsung hero!

Hint: the Dataset was updated the last time in 2017. As of 2019 Peggy Whitson is the american who has spent the most time in space.

She has spend more than 665 days in space!

What university has produced the most astronauts?

The method value_counts is used to count the number of occurences of unique values

df['Alma Mater'].value_counts().head(1)

The US Naval Academy produced 12 astronauts.

What subject did the most astronauts major in at college?

df['Undergraduate Major'].value_counts().head(1)

The same here: use value_counts method on the Undergraduate Major column.
The answer is Physics: 35 Astronauts studied physics in college

Have most astronauts served in the military?

The count method returns the number of entries which are not null or not NaN

astronauts_with_military_rank = df['Military Rank'].count()
astronauts_with_military_rank

207 astronauts have a military rank.

percentage_astronauts_served = astronauts_with_military_rank / len(df)
percentage_astronauts_served

58% served in the military.

Which is the most common rank?

df['Military Rank'].value_counts()

which gives us 94 Colonels.

You can find this code as a JuPyteR notebook on github:

https://github.com/jboegeholz/introduction_to_pandas/blob/master/astronauts_with_pandas.ipynb

The post Introduction to Pandas appeared first on Creatronix.