Pandas is a data analyzing tool. Together with numpy and matplotlib it is part of the data science stack
You can install it via
pip install pandas
Table of Contents
Working with real data
The data set we are using is the astronauts data set from kaggle:
Download Data Set NASA Astronauts from Kaggle
During this introduction we want to answer the following questions
- Which American astronaut has spent the most time in space?
- What university has produced the most astronauts?
- What subject did the most astronauts major in at college?
- Have most astronauts served in the military?
- What is the most common rank they achieved?
Basic Usage
Installation
pip install pandas
pandas is often aliased with “pd”
import pandas as pd
Dataframes
df= pd.read_csv("./astronauts.csv")
A dataframe is the most versatile data structure in pandas. You can think of it as an excel sheet with columns and rows.
You can get an overview of the dataframe values with
df.describe()
With the len function You can get the number of rows in the dataset
len(df)
which gives us 357 astronauts
The columns property gives you the names of the individual columns
df.columns
The methods head() gives you the first n (default=5) entries:
df.head()
whereas the tail method gives you the last n entries
df.tail(10)
With the iloc keyword You get the entries directly
# [row, column] df.iloc[0, 0]
The loc keyword is another way to access dataframe.
The colon is used as a “select *” for rows or columns
# [row, column] df.loc[:, :]
Some analytics
Which American astronaut has spent the most time in space?
most_time_in_space = df.sort_values(by="Space Flight (hr)", ascending=False).head(1) most_time_in_space[['Name', 'Space Flight (hr)']]
Sorting the dataframe can be done with sort_by_values. And for this question we sort for Space Flight (hr). Because we want the most hours we have to sort descending which translates to ascending=False.
head(1) gives us the correct answer:
Jeffrey N. Williams. He spent 12818 hours (534 days) in space.
Have You heard of him? Unsung hero!
Hint: the Dataset was updated the last time in 2017. As of 2019 Peggy Whitson is the american who has spent the most time in space.
She has spend more than 665 days in space!
What university has produced the most astronauts?
The method value_counts is used to count the number of occurences of unique values
df['Alma Mater'].value_counts().head(1)
The US Naval Academy produced 12 astronauts.
What subject did the most astronauts major in at college?
df['Undergraduate Major'].value_counts().head(1)
The same here: use value_counts method on the Undergraduate Major column.
The answer is Physics: 35 Astronauts studied physics in college
Have most astronauts served in the military?
The count method returns the number of entries which are not null or not NaN
astronauts_with_military_rank = df['Military Rank'].count() astronauts_with_military_rank
207 astronauts have a military rank.
percentage_astronauts_served = astronauts_with_military_rank / len(df) percentage_astronauts_served
58% served in the military.
Which is the most common rank?
df['Military Rank'].value_counts()
which gives us 94 Colonels.
You can find this code as a JuPyteR notebook on github:
https://github.com/jboegeholz/introduction_to_pandas/blob/master/astronauts_with_pandas.ipynb