Introduction to matplotlib – Part 2

When you finished reading part 1 of the introduction you might have wondered how to draw more than one line or curve into on plot. I will show you now.

To make it a bit more interesting we generate two functions: sine and cosine. We generate our x-values with numpy’s linspace function

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0, 2*np.pi)

sin = np.sin(x)
cos = np.cos(x)

plt.plot(x, sin, color='b')
plt.plot(x, cos, color='r')

You can plot two or more curves by repeatedly calling the plot method.

That’s fine as long as the individual plots share the same axis-description and values.


fig = plt.figure()
p1 = fig.add_subplot(2, 1, 1)
p2 = fig.add_subplot(2, 1, 2)
p1.plot(x, sin, c='b')
p2.plot(x, cos, c='r'

The add_subplot method allows us to put many plots into one “parent” plot aka figure. The arguments are (number_of_rows, number_of_columns, place in the matrix) So in this example we have 2 rows in 1 column, sine is in first, cosine in second position:

when you have a 2 by 2 matrix it is counted from columns to row

fig = plt.figure()
p1 = fig.add_subplot(221)
p2 = fig.add_subplot(222)
p3 = fig.add_subplot(223)
p4 = fig.add_subplot(224)
p1.plot(x, sin, c='b')
p2.plot(x, cos, c='r')
p3.plot(x, -sin, c='g')
p4.plot(x, -cos, c='y')

The code is available as a Jupyter Notebook on my github

10 things I didn’t know about Data Science a year ago

In my article My personal road map for learning data science in 2018 I wrote about how I try to tackle the data science knowledge sphere. Due to the fact that 2018 is slowly coming to an end I think it is time for a little wrap up.

What are the things I learned about Data Science in 2018? Here we go:

1. The difference between Data Science, Machine Learning, Deep Learning and AI

Continue reading “10 things I didn’t know about Data Science a year ago”

Lesson 10: Feature Scaling

What is Feature Scaling?

Feature Scaling is an important pre-processing step for some machine learning algorithms.

Imagine you have three friends of whom you know the individual weight and height.

You would like to deduce Chris’ T-shirt size from Cameron’s and Sarah’s by looking at the height and weight.

Name Height in m Weight in kg T-Shirt size
Sarah 1.58 52 Small
Cameron 1.79 79 Large
Chris 1.86 64 ?

One way You could determine the shirt size is to just add up the weight and the height of each friend. You would get: Continue reading “Lesson 10: Feature Scaling”

Receiver Operating Characteristic

ROC Curve

As we already introduced Precision and Recall  the ROC curve is another way of looking at the quality of classification algorithms.

ROC stands for Receiver Operating Characteristic

The ROC curve is created by plotting the true positive rate (TPR) on the y-axis against the false positive rate (FPR) on the x-axis at various threshold settings.

You already know the TPR as recall or sensitivity.

The false positive rate is defined as FPR = FP / (FP + TN)


ROC curves have a big advantage: they are insensitive to changes in class distribution.


from sklearn.metrics import roc_curve

Data Science: Pandas

Pandas is a data analyzing tool. Together with numpy and matplotlib it is part of the data science stack

You can install it via

pip install pandas

Working with real data

The data set we are using is the astronauts data set from kaggle:

Download Data Set NASA Astronauts from Kaggle

During this introduction we want to answer the following questions

  • Which American astronaut has spent the most time in space?
  • What university has produced the most astronauts?
  • What subject did the most astronauts major in at college?
  • Have most astronauts served in the military? What rank did they achieve?

Basic Usage

import pandas as pd

astronaut_data = pd.read_csv("./astronauts.csv")

With the len function You can get the number of rows in the dataset


which gives us 357 astronauts

The columns property gives you the names of the individual columns


The methods head() gives you the first five entries:


whereas the tail method gives you the last n entries


With the iloc keyword You get the entries directly


Which American astronaut has spent the most time in space?

most_time_in_space = astronaut_data.sort_values(by="Space Flight (hr)", ascending=False).head(1)
most_time_in_space[['Name', 'Space Flight (hr)']]

Sorting the dataframe can be done with sort_by_values. And for this question we sort for Space Flight (hr). Because we want the most hours we have to sort descending which translates to ascending=False.

head(1) gives us the correct answer:

Jeffrey N. Williams. He spent 12818 hours (534 days) in space.

Have You heard of him? Unsung hero!

Hint: the Dataset was updated the last time in 2017. As of 2019 Peggy Whitson is the american who has spent the most time in space. 

She has spend more than 665 days in space!

What university has produced the most astronauts?

The method value_counts is used to count the number of occurences of unique values

astronaut_data['Alma Mater'].value_counts().head(1)

The US Naval Academy produced 12 astronauts

What subject did the most astronauts major in at college?

astronaut_data['Undergraduate Major'].value_counts().head(1)

The same here: use value_counts method on the Undergraduate Major column.
The answer is Physics: 35 Astronauts studied physics in college

Have most astronauts served in the military?

the count method returns the number of entries which are not null or not NaN

astronaut_data['Military Rank'].count()

In this case 207 astronauts have a military rank.

astronaut_data['Military Rank'].value_counts().head(1)

which gives us 94 Colonels.

Classification: Precision and Recall

In the realms of Data Science you’ll encounter sooner or the later the terms “Precision” and “Recall”. But what do they mean?


Living together with little kids You very often run into classification issues:

My daughter really likes dogs, so seeing a dog is something positive. When she sees a normal dog e.g. a Labrador and proclaims: “Look, there is a dog!”

That’s a True Positive (TP) Continue reading “Classification: Precision and Recall”