Introduction to Pandas

Pandas is a data analyzing tool. Together with numpy and matplotlib it is part of the data science stack

You can install it via

pip install pandas

Working with real data

The data set we are using is the astronauts data set from kaggle:

Download Data Set NASA Astronauts from Kaggle

During this introduction we want to answer the following questions

  • Which American astronaut has spent the most time in space?
  • What university has produced the most astronauts?
  • What subject did the most astronauts major in at college?
  • Have most astronauts served in the military?
  • What is the most common rank they achieved?

Basic Usage

Installation

pip install pandas

pandas is often aliased with “pd”

import pandas as pd

Dataframes

df= pd.read_csv("./astronauts.csv")

A dataframe is the most versatile data structure in pandas. You can think of it as an excel sheet with columns and rows.

You can get an overview of the dataframe values with

 df.describe()

With the len function You can get the number of rows in the dataset

len(df)

which gives us 357 astronauts

The columns property gives you the names of the individual columns

df.columns

The methods head() gives you the first n (default=5) entries:

df.head()

whereas the tail method gives you the last n entries

df.tail(10)

With the iloc keyword You get the entries directly

#      [row, column]
df.iloc[0,   0]

The loc keyword is another way to access dataframe.

The colon is used as a “select *” for rows or columns

#     [row, column]
df.loc[:, :]

Some analytics

Which American astronaut has spent the most time in space?

most_time_in_space = df.sort_values(by="Space Flight (hr)", ascending=False).head(1)
most_time_in_space[['Name', 'Space Flight (hr)']]

Sorting the dataframe can be done with sort_by_values. And for this question we sort for Space Flight (hr). Because we want the most hours we have to sort descending which translates to ascending=False.

head(1) gives us the correct answer:

Jeffrey N. Williams. He spent 12818 hours (534 days) in space.

Have You heard of him? Unsung hero!

Hint: the Dataset was updated the last time in 2017. As of 2019 Peggy Whitson is the american who has spent the most time in space. 

She has spend more than 665 days in space!

What university has produced the most astronauts?

The method value_counts is used to count the number of occurences of unique values

df['Alma Mater'].value_counts().head(1)

The US Naval Academy produced 12 astronauts.

What subject did the most astronauts major in at college?

df['Undergraduate Major'].value_counts().head(1)

The same here: use value_counts method on the Undergraduate Major column.
The answer is Physics: 35 Astronauts studied physics in college

Have most astronauts served in the military?

The count method returns the number of entries which are not null or not NaN

astronauts_with_military_rank = df['Military Rank'].count()
astronauts_with_military_rank

207 astronauts have a military rank.

percentage_astronauts_served = astronauts_with_military_rank / len(df)
percentage_astronauts_served

58% served in the military.

Which is the most common rank?

df['Military Rank'].value_counts()

which gives us 94 Colonels.

You can find this code as a JuPyteR notebook on github:

https://github.com/jboegeholz/introduction_to_pandas/blob/master/astronauts_with_pandas.ipynb