<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>pandas Archives - Creatronix</title>
	<atom:link href="https://creatronix.de/tag/pandas/feed/" rel="self" type="application/rss+xml" />
	<link>https://creatronix.de/tag/pandas/</link>
	<description>My adventures in code &#38; business</description>
	<lastBuildDate>Mon, 06 Jan 2025 09:39:08 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	<generator>https://wordpress.org/?v=6.9.4</generator>
	<item>
		<title>2021 – Advent of code – Day 2</title>
		<link>https://creatronix.de/2021-advent-of-code-day-2/</link>
		
		<dc:creator><![CDATA[Jörn]]></dc:creator>
		<pubDate>Fri, 03 Dec 2021 11:23:34 +0000</pubDate>
				<category><![CDATA[Data Science & SQL]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[advent of code]]></category>
		<category><![CDATA[day 2]]></category>
		<category><![CDATA[pandas]]></category>
		<guid isPermaLink="false">https://creatronix.de/?p=4296</guid>

					<description><![CDATA[<p>Part 1 Today the puzzle got a bit trickier than Day 1. The submarine seems to already have a planned course (your puzzle input). You should probably figure out where it's going. For example: forward 5 down 5 forward 8 up 3 down 8 forward 2 Your horizontal position and depth both start at 0.&#8230;</p>
<p>The post <a href="https://creatronix.de/2021-advent-of-code-day-2/">2021 – Advent of code – Day 2</a> appeared first on <a href="https://creatronix.de">Creatronix</a>.</p>
]]></description>
										<content:encoded><![CDATA[<h2>Part 1</h2>
<p>Today the puzzle got a bit trickier than <a href="https://creatronix.de/2021-advent-of-code-day-1/">Day 1</a>.</p>
<pre>The submarine seems to already have a planned course (your puzzle input). You should probably figure out where it's going. For example:

    forward 5
    down 5
    forward 8
    up 3
    down 8
    forward 2

Your horizontal position and depth both start at 0. The steps above would then modify them as follows:

    forward 5 adds 5 to your horizontal position, a total of 5.
    down 5 adds 5 to your depth, resulting in a value of 5.
    forward 8 adds 8 to your horizontal position, a total of 13.
    up 3 decreases your depth by 3, resulting in a value of 2.
    down 8 adds 8 to your depth, resulting in a value of 10.
    forward 2 adds 2 to your horizontal position, a total of 15.

After following these instructions, you would have a horizontal position of 15 and a depth of 10. (Multiplying these together produces 150.)

Calculate the horizontal position and depth you would have after following the planned course. What do you get if you multiply your final horizontal position by your final depth?</pre>
<p>So Pandas here we go again:<span id="more-4296"></span></p>
<pre>df = pd.read_csv("./aoc_day_02_data.txt", delimiter=" ",header=None)
df.columns = ["command", "value"]</pre>
<p>Alright, reading in the data and naming the columns are the same steps as yesterday. Now we have to columns.</p>
<table class="dataframe" border="1">
<thead>
<tr>
<th></th>
<th>0</th>
<th>1</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>forward</td>
<td>5</td>
</tr>
<tr>
<th>1</th>
<td>down</td>
<td>5</td>
</tr>
<tr>
<th>2</th>
<td>forward</td>
<td>8</td>
</tr>
<tr>
<th>3</th>
<td>up</td>
<td>3</td>
</tr>
<tr>
<th>4</th>
<td>down</td>
<td>8</td>
</tr>
<tr>
<th>5</th>
<td>forward</td>
<td>5</td>
</tr>
</tbody>
</table>
<pre>horizontal = df[df['command']=="forward"]["value"].sum()</pre>
<p>The horizontal value can be calculated with the sum function when we filter the data frame to rows where the command is &#8220;forward&#8221;</p>
<p>&nbsp;</p>
<pre>depth = df[df['command']=="down"]["value"].sum() - df[df['command']=="up"]["value"].sum()</pre>
<p>The depth can calculated by summing up the down and up commands separately and subtract the sums from each other.</p>
<p>Now we have to multiply the depth and the position to get the solution</p>
<pre>position = depth * horizontal</pre>
<h2>Part 2</h2>
<pre>    down X increases your aim by X units.
    up X decreases your aim by X units.
    forward X does two things:
        It increases your horizontal position by X units.
        It increases your depth by your aim multiplied by X.

Now, the above example does something different:

    forward 5 adds 5 to your horizontal position, a total of 5. Because your aim is 0, your depth does not change.
    down 5 adds 5 to your aim, resulting in a value of 5.
    forward 8 adds 8 to your horizontal position, a total of 13. Because your aim is 5, your depth increases by 8*5=40.
    up 3 decreases your aim by 3, resulting in a value of 2.
    down 8 adds 8 to your aim, resulting in a value of 10.
    forward 2 adds 2 to your horizontal position, a total of 15. Because your aim is 10, your depth increases by 2*10=20 to a total of 60.

After following these new instructions, you would have a horizontal position of 15 and a depth of 60. 
(Multiplying these produces 900.)

Using this new interpretation of the commands, calculate the horizontal position and depth you would have after following the planned course. 
What do you get if you multiply your final horizontal position by your final depth?</pre>
<p>To get an overview I simplified the table</p>
<pre>     a   d
f 5  0   0
d 5  5
f 8  5  40
u 3  2
d 8 10
f 2 10  20</pre>
<p>Here I had a hard time to do it with pandas so vanilla python to the rescue:</p>
<pre>if __name__ == '__main__':

    with open("./aoc_day_02_data.txt") as file:
        lines = file.readlines()
        lines = [line.rstrip() for line in lines]

    horizontal = 0
    current_aim = 0
    depth = 0
    for line in lines:
        print(line)
        command, value = line.split(" ")
        value = int(value)
        if command == "forward":
            horizontal += value
            depth += value * current_aim
        if command == "down":
            current_aim += value
        if command == "up":
            current_aim += value * -1

    print(f"horizontal: {horizontal}")
    print(f"depth: {depth}")
    print(horizontal * depth)</pre>
<h2>Update</h2>
<p>I&#8217;ve figured out how to do it with Pandas as well</p>
<pre>import pandas as pd

df = pd.read_csv("./aoc_day_02_test_data.txt", delimiter=" ",header=None)
df.columns = ["command", "value"]

horizontal = df[df['command']=="forward"]["value"].sum()

df.loc[df['command']=="up", "value"] = df[df['command']=="up"].mul(-1)
df["aim"] = 0

df.loc[df['command']!="forward", "aim"] = df[df['command']!="forward"]["value"]
df["current_aim"] = df["aim"].cumsum()

df.loc[df['command']=="forward", "depth"] = df[df['command']=="forward"]["value"] * df[df['command']=="forward"]["current_aim"]
depth = df["depth"].sum()</pre>
<p>The post <a href="https://creatronix.de/2021-advent-of-code-day-2/">2021 – Advent of code – Day 2</a> appeared first on <a href="https://creatronix.de">Creatronix</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>2021 &#8211; Advent of code &#8211; Day 1</title>
		<link>https://creatronix.de/2021-advent-of-code-day-1/</link>
		
		<dc:creator><![CDATA[Jörn]]></dc:creator>
		<pubDate>Thu, 02 Dec 2021 09:30:22 +0000</pubDate>
				<category><![CDATA[Data Science & SQL]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[advent of code]]></category>
		<category><![CDATA[challenge]]></category>
		<category><![CDATA[coding]]></category>
		<category><![CDATA[jupyter]]></category>
		<category><![CDATA[pandas]]></category>
		<guid isPermaLink="false">https://creatronix.de/?p=4285</guid>

					<description><![CDATA[<p>I&#8217;ve haven&#8217;t participated in the advent of code before. But always been curious. What is advent of code? It&#8217;s an advent Calendar for programmers. You get 25 challenges starting December 1st. Caveat: you have to solve the challenge to be eligible for the next day&#8217;s challenge 🙂 Day 1 Challenge &#8211; Part 1 On the&#8230;</p>
<p>The post <a href="https://creatronix.de/2021-advent-of-code-day-1/">2021 &#8211; Advent of code &#8211; Day 1</a> appeared first on <a href="https://creatronix.de">Creatronix</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>I&#8217;ve haven&#8217;t participated in the <a href="https://adventofcode.com/2021/day/1">advent of code</a> before. But always been curious.</p>
<h2>What is advent of code?</h2>
<p>It&#8217;s an <span class="Y2IQFc" lang="en">advent Calendar for programmers. You get 25 challenges starting December 1st. Caveat: you have to solve the challenge to be eligible for the next day&#8217;s challenge 🙂<br />
</span><span id="more-4285"></span></p>
<h2>Day 1 Challenge &#8211; Part 1</h2>
<p>On the first day your first task is to count how many times a value is bigger than its predecessor. They give us some sample data</p>
<pre>199 N/A
200 <strong>bigger</strong>
208 <strong>bigger</strong>
210 <strong>bigger</strong>
200 smaller
207 <strong>bigger</strong>
240 <strong>bigger</strong>
269 <strong>bigger</strong>
260 smaller
263 <strong>bigger</strong></pre>
<p>When we count the times a value is bigger we get seven times bigger.</p>
<p>The actual data contains 2000 rows. This isn&#8217;t exactly big data but I&#8217;ve wanted to dust off my Pandas skill, so here we go:</p>
<p>Let&#8217;s look at the data</p>
<pre>import pandas as pd

df = pd.read_csv("./aoc_day_01_data.txt", header=None)
df.describe</pre>
<p>With the read_csv() function we can read in our data file and convert it into a data frame. It&#8217;s important to hand over the header=None. Otherwise pandas assumes the first row is a column header.</p>
<p>df.describe gives us:</p>
<pre class="console_text">&lt;bound method NDFrame.describe of          0
0      159
1      158
2      174
3      196
4      197
...    ...
1995  8538
1996  8543
1997  8545
1998  8557
1999  8568

[2000 rows x 1 columns]&gt;</pre>
<p>Because we want to reference the columns by name we add a column header</p>
<pre>df.columns = ["original"]</pre>
<p>To compare the nth cell with its n+1th cell neighbour be add a new column but shift the values</p>
<pre>df['shifted'] = df['original'].shift(-1)</pre>
<p>The output looks like this:</p>
<table class="dataframe" border="1">
<thead>
<tr>
<th></th>
<th>original</th>
<th>shifted</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>159</td>
<td><strong>158</strong>.0</td>
</tr>
<tr>
<th>1</th>
<td><strong>158</strong></td>
<td>174.0</td>
</tr>
<tr>
<th>2</th>
<td>174</td>
<td>196.0</td>
</tr>
<tr>
<th>3</th>
<td>196</td>
<td>197.0</td>
</tr>
<tr>
<th>4</th>
<td>197</td>
<td>194.0</td>
</tr>
<tr>
<th>&#8230;</th>
<td>&#8230;</td>
<td>&#8230;</td>
</tr>
<tr>
<th>1995</th>
<td>8538</td>
<td>8543.0</td>
</tr>
<tr>
<th>1996</th>
<td>8543</td>
<td>8545.0</td>
</tr>
<tr>
<th>1997</th>
<td>8545</td>
<td>8557.0</td>
</tr>
<tr>
<th>1998</th>
<td>8557</td>
<td>8568.0</td>
</tr>
<tr>
<th>1999</th>
<td>8568</td>
<td>NaN</td>
</tr>
</tbody>
</table>
<p>We add another column where we place the value True when the value from the current row in the shifted column is bigger than in the original column:</p>
<pre>df['increased'] = (df['shifted'] &gt; df['original'])</pre>
<p>Now it starts to look like the sample data from the introduction:</p>
<table class="dataframe" border="1">
<thead>
<tr>
<th></th>
<th>original</th>
<th>shifted</th>
<th>increased</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>159</td>
<td>158.0</td>
<td>False</td>
</tr>
<tr>
<th>1</th>
<td>158</td>
<td>174.0</td>
<td>True</td>
</tr>
<tr>
<th>2</th>
<td>174</td>
<td>196.0</td>
<td>True</td>
</tr>
<tr>
<th>3</th>
<td>196</td>
<td>197.0</td>
<td>True</td>
</tr>
<tr>
<th>4</th>
<td>197</td>
<td>194.0</td>
<td>False</td>
</tr>
<tr>
<th>&#8230;</th>
<td>&#8230;</td>
<td>&#8230;</td>
<td>&#8230;</td>
</tr>
<tr>
<th>1995</th>
<td>8538</td>
<td>8543.0</td>
<td>True</td>
</tr>
<tr>
<th>1996</th>
<td>8543</td>
<td>8545.0</td>
<td>True</td>
</tr>
<tr>
<th>1997</th>
<td>8545</td>
<td>8557.0</td>
<td>True</td>
</tr>
<tr>
<th>1998</th>
<td>8557</td>
<td>8568.0</td>
<td>True</td>
</tr>
<tr>
<th>1999</th>
<td>8568</td>
<td>NaN</td>
<td>False</td>
</tr>
</tbody>
</table>
<p>the last thing we have to do is counting how many times True occurs:</p>
<pre>true_count = df['increased'].sum()</pre>
<p>which gives us &#8220;1583&#8221;</p>
<p>This is a bit of a hack because it assumes that True equals 1 and False == 0</p>
<p>A more elegant solution is to use value_counts:</p>
<pre>df['increased'].value_counts(dropna=False)</pre>
<p>No the output is:</p>
<pre class="console_text">True     1583
False     417
Name: increased, dtype: int64</pre>
<p>And 1583 is the number we are looking for. This earned us our first golden star and unlocked the second part of the challenge:</p>
<h2>Part 2</h2>
<p>The second part is a bit more challenging because we have to sum up three adjacent values and compare them to the next three values.</p>
<pre>199  A       
200  A B     
208  A B C   
210    B C D
200  E   C D
207  E F   D
240  E F G
269    F G H
260      G H
263        H</pre>
<p>I created a new notebook and started like part 1 with reading the data and naming the first column</p>
<pre>import pandas as pd

df = pd.read_csv("./aoc_day_01_data.txt", header=None)
df.columns = ["original"]</pre>
<p>To add the sum of three values to the row of the first value we use the following code</p>
<pre>indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=3)
df["rolling_sum"] = df.original.rolling(window=indexer).sum()</pre>
<p>This demonstrates the power of Pandas once more: you have integrated sliding window functions!</p>
<p>The rest is equal to part one &#8220;shift, compare and count&#8221;</p>
<pre>df['shifted_rs'] = df['rolling_sum'].shift(-1)
df['increased_rs'] = (df['shifted_rs'] &gt; df['rolling_sum'])
true_count = df['increased_rs'].sum()
true_count</pre>
<p>As a little Fingerübung I did the same with vanilla Python:</p>
<pre>data = []
with open("./aoc_day_01_test_data.txt") as f:
    for line in f:
        data.append(int(line.rstrip()))

triplet_sums = []

for i, v in enumerate(data):
    if i &lt; (len(data) - 2):
        triplet_sum = data[i] + data[i+1] + data[i+2]
        triplet_sums.append(triplet_sum)
print(triplet_sums)

sums_larger_than_previous_sums = 0
for i, v in enumerate(triplet_sums):
    if i &lt; (len(triplet_sums) - 1):
        if triplet_sums[i] &lt; triplet_sums[i+1]:
            sums_larger_than_previous_sums += 1

print(sums_larger_than_previous_sums)</pre>
<p>Which works but is less elegant.</p>
<p>Stay tuned for more!</p>
<p>The post <a href="https://creatronix.de/2021-advent-of-code-day-1/">2021 &#8211; Advent of code &#8211; Day 1</a> appeared first on <a href="https://creatronix.de">Creatronix</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Pandas Cheat Sheet</title>
		<link>https://creatronix.de/pandas-cheat-sheet/</link>
		
		<dc:creator><![CDATA[Jörn]]></dc:creator>
		<pubDate>Fri, 05 Mar 2021 13:17:19 +0000</pubDate>
				<category><![CDATA[Data Science & SQL]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[cheat sheet]]></category>
		<category><![CDATA[csv]]></category>
		<category><![CDATA[cumsum]]></category>
		<category><![CDATA[cumulative sum]]></category>
		<category><![CDATA[datetime]]></category>
		<category><![CDATA[delimiter]]></category>
		<category><![CDATA[dropping columns]]></category>
		<category><![CDATA[encoding]]></category>
		<category><![CDATA[excel]]></category>
		<category><![CDATA[head tail]]></category>
		<category><![CDATA[header]]></category>
		<category><![CDATA[iloc]]></category>
		<category><![CDATA[indexing]]></category>
		<category><![CDATA[loc]]></category>
		<category><![CDATA[pandas]]></category>
		<category><![CDATA[renamin columns]]></category>
		<category><![CDATA[rolling sum]]></category>
		<category><![CDATA[shifting]]></category>
		<category><![CDATA[snippets]]></category>
		<category><![CDATA[splitting strings]]></category>
		<category><![CDATA[value_counts]]></category>
		<guid isPermaLink="false">http://creatronix.de/?p=3112</guid>

					<description><![CDATA[<p>If you are new to Pandas feel free to read Introduction to Pandas I&#8217;ve assembled some pandas code snippets Reading Data Reading CSV import pandas as pd # read from csv df = pd.read_csv("path_to_file") Can also be textfiles. file suffix is ignored The default limiter for comma separated value files is the comma. If you&#8230;</p>
<p>The post <a href="https://creatronix.de/pandas-cheat-sheet/">Pandas Cheat Sheet</a> appeared first on <a href="https://creatronix.de">Creatronix</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>If you are new to Pandas feel free to read <a href="https://creatronix.de/introduction-to-pandas/">Introduction to Pandas</a></p>
<p>I&#8217;ve assembled some pandas code snippets</p>
<h2>Reading Data</h2>
<h3>Reading CSV</h3>
<pre>import pandas as pd

# read from csv
df = pd.read_csv("path_to_file")</pre>
<p>Can also be textfiles. file suffix is ignored</p>
<p>The default limiter for comma separated value files is the comma. If you have data with another delimiter you can specify it via:</p>
<pre>delimiter=";"</pre>
<p>If your data has no header you can pass header=None into the function</p>
<pre>df = pd.read_csv("./aoc_day_01_data.txt", header=None)</pre>
<p>With skiprows you can start reading in at any row</p>
<pre>skiprows=8</pre>
<p>Sometimes you need to alter the encoding as well:</p>
<pre>encoding="cp1252"</pre>
<h3>Reading Excel</h3>
<p>You can read excel files as well but you need to install</p>
<pre>pip install openpyxl</pre>
<pre>df = pd.read_excel("./my_excel_sheet.xlsx")</pre>
<p>With sheet_name you can select the individual sheet:</p>
<pre>sheet_name="my_sheet_1"</pre>
<h2>Inspecting data</h2>
<h3>Basic information</h3>
<pre>df.describe()</pre>
<h3>Length</h3>
<pre>len(df)</pre>
<h3>showing entries</h3>
<pre>df.head()</pre>
<pre>df.tail(10)</pre>
<h3>Indexing</h3>
<pre><span class="n">df</span><span class="p">[</span><span class="s1">'A'</span><span class="p">]</span></pre>
<p>gives you column A</p>
<p>iloc gives you entries based on numerical index</p>
<pre>#      [row, column]
df.iloc[0,   0]</pre>
<pre>#     [row, column]
df.loc[:, :]</pre>
<h2>Data Cleaning</h2>
<h3>Dropping columns</h3>
<pre>del df["column_name"]</pre>
<h3>Renaming columns</h3>
<pre>df.columns = ["new_column_name", ...]</pre>
<h3>Comparing columns</h3>
<pre>df['increased'] = (df['shifted'] &gt; df['original'])</pre>
<h3>Shifting columns</h3>
<pre>df['shifted'] = df['original'].shift(-1)</pre>
<h2>Splitting</h2>
<h3>Splitting strings into individual columns</h3>
<p>&nbsp;</p>
<pre>df = pd.DataFrame(df["original"].str.split('').tolist())</pre>
<h2></h2>
<h2>Counting and Calculating</h2>
<h3>Summing columns</h3>
<pre>df["value"].sum()</pre>
<h3>Cumulative sum</h3>
<pre>df["aim"].cumsum()</pre>
<h3>Rolling sum</h3>
<pre>indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=3)
df["rolling_sum"] = df.original.rolling(window=indexer).sum()</pre>
<h3>Counting value occurence</h3>
<p>&nbsp;</p>
<pre>df['increased'].value_counts(dropna=False)</pre>
<h3>Counting occurrences for all columns</h3>
<pre>df = pd.concat([df[column].value_counts() for column in df], axis = 1)</pre>
<h3>Convert column to datetime</h3>
<p>&nbsp;</p>
<pre>df.loc[:, 'date'] = pd.to_datetime(df.loc[:, 'date'])</pre>
<h3>Convert datetime to minutes since midnight</h3>
<p>&nbsp;</p>
<pre>df_train.loc[:, 'msm'] = df_train.loc[:, "date"].dt.hour * 60 + df_train.loc[:, "date"].dt.minute</pre>
<p>The post <a href="https://creatronix.de/pandas-cheat-sheet/">Pandas Cheat Sheet</a> appeared first on <a href="https://creatronix.de">Creatronix</a>.</p>
]]></content:encoded>
					
		
		
			</item>
		<item>
		<title>Introduction to Pandas</title>
		<link>https://creatronix.de/introduction-to-pandas/</link>
		
		<dc:creator><![CDATA[Jörn]]></dc:creator>
		<pubDate>Wed, 01 Aug 2018 05:30:41 +0000</pubDate>
				<category><![CDATA[Data Science & SQL]]></category>
		<category><![CDATA[Python]]></category>
		<category><![CDATA[astronaut-yearbook]]></category>
		<category><![CDATA[astronauts]]></category>
		<category><![CDATA[kaggle]]></category>
		<category><![CDATA[pandas]]></category>
		<guid isPermaLink="false">http://creatronix.de/?p=1507</guid>

					<description><![CDATA[<p>Pandas is a data analyzing tool. Together with numpy and matplotlib it is part of the data science stack You can install it via pip install pandas Working with real data The data set we are using is the astronauts data set from kaggle: Download Data Set NASA Astronauts from Kaggle During this introduction we&#8230;</p>
<p>The post <a href="https://creatronix.de/introduction-to-pandas/">Introduction to Pandas</a> appeared first on <a href="https://creatronix.de">Creatronix</a>.</p>
]]></description>
										<content:encoded><![CDATA[<p>Pandas is a data analyzing tool. Together with numpy and matplotlib it is part of the data science stack</p>
<p>You can install it via</p>
<pre><span class="n">pip</span> <span class="n">install</span> <span class="n">pandas</span></pre>
<h2>Working with real data</h2>
<p>The data set we are using is the astronauts data set from kaggle:</p>
<p>Download Data Set NASA Astronauts from Kaggle</p>
<p>During this introduction we want to answer the following questions</p>
<ul>
<li>Which American astronaut has spent the most time in space?</li>
<li>What university has produced the most astronauts?</li>
<li>What subject did the most astronauts major in at college?</li>
<li>Have most astronauts served in the military?</li>
<li>What is the most common rank they achieved?</li>
</ul>
<h2>Basic Usage</h2>
<h3>Installation</h3>
<pre>pip install pandas</pre>
<p>pandas is often aliased with &#8220;pd&#8221;</p>
<pre>import pandas as pd
</pre>
<h3>Dataframes</h3>
<pre>df= pd.read_csv("./astronauts.csv")</pre>
<p>A dataframe is the most versatile data structure in pandas. You can think of it as an excel sheet with columns and rows.</p>
<p>You can get an overview of the dataframe values with</p>
<pre> df.describe()</pre>
<p>With the len function You can get the number of rows in the dataset</p>
<pre>len(df)</pre>
<p>which gives us 357 astronauts</p>
<p>The columns property gives you the names of the individual columns</p>
<pre>df.columns</pre>
<p>The methods head() gives you the first n (default=5) entries:</p>
<pre>df.head()</pre>
<p>whereas the tail method gives you the last n entries</p>
<pre>df.tail(10)</pre>
<p>With the iloc keyword You get the entries directly</p>
<pre>#      [row, column]
df.iloc[0,   0]</pre>
<div class="cell text_cell rendered selected" tabindex="2">
<div class="inner_cell">
<div class="text_cell_render rendered_html" tabindex="-1">
<p>The loc keyword is another way to access dataframe.</p>
<p>The colon is used as a &#8220;select *&#8221; for rows or columns</p>
</div>
</div>
</div>
<pre>#     [row, column]
df.loc[:, :]</pre>
<h2>Some analytics</h2>
<h3>Which American astronaut has spent the most time in space?</h3>
<pre>most_time_in_space = df.sort_values(by="Space Flight (hr)", ascending=False).head(1)
most_time_in_space[['Name', 'Space Flight (hr)']]</pre>
<p>Sorting the dataframe can be done with sort_by_values. And for this question we sort for Space Flight (hr). Because we want the most hours we have to sort descending which translates to ascending=False.</p>
<p>head(1) gives us the correct answer:</p>
<p>Jeffrey N. Williams. He spent 12818 hours (534 days) in space.</p>
<p>Have You heard of him? Unsung hero!</p>
<p>Hint: the Dataset was updated the last time in 2017. As of 2019 <a href="https://www.insider.com/nasa-astronauts-time-in-space-record-2019-3">Peggy Whitson is the american who has spent the most time in space. </a></p>
<p>She has spend more than 665 days in space!</p>
<h3>What university has produced the most astronauts?</h3>
<p>The method value_counts is used to count the number of occurences of unique values</p>
<pre>df['Alma Mater'].value_counts().head(1)</pre>
<p class="console_text">The US Naval Academy produced 12 astronauts.</p>
<h3>What subject did the most astronauts major in at college?</h3>
<pre>df['Undergraduate Major'].value_counts().head(1)</pre>
<p>The same here: use value_counts method on the Undergraduate Major column.<br />
The answer is Physics: 35 Astronauts studied physics in college</p>
<h3>Have most astronauts served in the military?</h3>
<p>The count method returns the number of entries which are not null or not NaN</p>
<pre>astronauts_with_military_rank = df['Military Rank'].count()
astronauts_with_military_rank</pre>
<p>207 astronauts have a military rank.</p>
<pre>percentage_astronauts_served = astronauts_with_military_rank / len(df)
percentage_astronauts_served</pre>
<p>58% served in the military.</p>
<h3>Which is the most common rank?</h3>
<pre>df['Military Rank'].value_counts()</pre>
<p>which gives us 94 Colonels.</p>
<p>You can find this code as a JuPyteR notebook on github:</p>
<p><a href="https://github.com/jboegeholz/introduction_to_pandas/blob/master/astronauts_with_pandas.ipynb">https://github.com/jboegeholz/introduction_to_pandas/blob/master/astronauts_with_pandas.ipynb</a></p>
<p>The post <a href="https://creatronix.de/introduction-to-pandas/">Introduction to Pandas</a> appeared first on <a href="https://creatronix.de">Creatronix</a>.</p>
]]></content:encoded>
					
		
		
			</item>
	</channel>
</rss>
