## 10 things I didn’t know about Data Science a year ago

In my article My personal road map for learning data science in 2018 I wrote about how I try to tackle the data science knowledge sphere. Due to the fact that 2018 is slowly coming to an end I think it is time for a little wrap up.

What are the things I learned about Data Science in 2018? Here we go:

## 1. The difference between Data Science, Machine Learning, Deep Learning and AI Continue reading “10 things I didn’t know about Data Science a year ago”

## Introduction to matplotlib

matplotlib is the workhorse of data science visualization. The module pyplot gives us MATLAB like plots.

The most basic plot is done with the “plot”-function. It looks like this: Continue reading “Introduction to matplotlib”

## Scatterplot with matplotlib

When you area already familiar with the basic plot from the introduction to matplotlib here is another type of plot used in data science.

A very basic visualization is the scatter plot:

```import numpy as np
import matplotlib.pyplot as plt

N = 100
x = np.random.rand(N)
y = np.random.rand(N)

plt.scatter(x, y)
plt.show()```

## What is Feature Scaling? Feature Scaling is an important pre-processing step for some machine learning algorithms.

Imagine you have three friends of whom you know the individual weight and height.

You would like to deduce Christian’s  t-shirt size from David’s and Julia’s by looking at the height and weight.

Name Height in m Weight in kg T-Shirt size
Julia 1.58 52 Small
David 1.79 79 Large
Christian 1.86 64 ?

One way You could determine the shirt size is to just add up the weight and the height of each friend. You would get: Continue reading “Feature Scaling”

## ROC Curve

As we already introduced Precision and Recall  the ROC curve is another way of looking at the quality of classification algorithms.

ROC stands for Receiver Operating Characteristic

The ROC curve is created by plotting the true positive rate (TPR) on the y-axis against the false positive rate (FPR) on the x-axis at various threshold settings.

You already know the TPR as recall or sensitivity. Continue reading “Receiver Operating Characteristic”

## Introduction to Pandas

Pandas is a data analyzing tool. Together with numpy and matplotlib it is part of the data science stack

You can install it via

`pip install pandas`

## Working with real data

The data set we are using is the astronauts data set from kaggle:

## Curriculum Vitae for Data Scientists

Applying for a data scientist job offer? Tired of writing the same old curriculum vitae?

## Intro to OpenCV with Python

To work with OpenCV from python, you need to install it first:

`pip install opencv-python`

After we import cv2 we can directly work with images like so:

```import cv2