10 things I didn’t know about Data Science a year ago

In my article My personal road map for learning data science in 2018 I wrote about how I try to tackle the data science knowledge sphere. Due to the fact that 2018 is slowly coming to an end I think it is time for a little wrap up.

What are the things I learned about Data Science in 2018? Here we go:

1. The difference between Data Science, Machine Learning, Deep Learning and AI

Continue reading “10 things I didn’t know about Data Science a year ago”

Lesson 10: Feature Scaling

What is Feature Scaling?

Feature Scaling is an important pre-processing step for some machine learning algorithms.

Imagine you have three friends of whom you know the individual weight and height.

You would like to deduce Chris’ T-shirt size from Cameron’s and Sarah’s by looking at the height and weight.

Name Height in m Weight in kg T-Shirt size
Sarah 1.58 52 Small
Cameron 1.79 79 Large
Chris 1.86 64 ?

One way You could determine the shirt size is to just add up the weight and the height of each friend. You would get: Continue reading “Lesson 10: Feature Scaling”

Receiver Operating Characteristic

ROC Curve

As we already introduced Precision and Recall  the ROC curve is another way of looking at the quality of classification algorithms.

ROC stands for Receiver Operating Characteristic

The ROC curve is created by plotting the true positive rate (TPR) on the y-axis against the false positive rate (FPR) on the x-axis at various threshold settings.

You already know the TPR as recall or sensitivity.

The false positive rate is defined as FPR = FP / (FP + TN)


ROC curves have a big advantage: they are insensitive to changes in class distribution.


from sklearn.metrics import roc_curve

Classification: Precision and Recall

In the realms of Data Science you’ll encounter sooner or the later the terms “Precision” and “Recall”. But what do they mean?


Living together with little kids You very often run into classification issues:

My daughter really likes dogs, so seeing a dog is something positive. When she sees a normal dog e.g. a Labrador and proclaims: “Look, there is a dog!”

That’s a True Positive (TP) Continue reading “Classification: Precision and Recall”

Lesson 4: Decision Trees

from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(min_samples_split=40)
clf.fit(features_train, labels_train)

UD120 – Intro to Machine Learning

One part of my bucket list for 2018 is finishing the Udacity Course UD120: Intro to Machine Learning.

the host of this course are Sebastian Thrun, ex-google-X and founder of Udacity and Katie Malone, creator of the Linear digressions podcast.

The course consists of 17 lessons. Every lesson has a couple of hours of video and lots and lots of quizzes in it.

  • [x] Lesson 1: Only introduction 🙂
  • [x] Lesson 2: Naive Bayes
  • [x] Lesson 3: Support Vector Machines
  • [x] Lesson 4: Decision Trees
  • [x] Lesson 5: Choose your own algorithm
  • [x] Lesson 6: Datasets and questions
  • [x] Lesson 7: Regression
  • [x] Lesson 8: Outliers
  • [x] Lesson 9: Clustering
  • [x] Lesson 10: Feature Scaling
  • [ ] Lesson 11: Text Learning
  • [ ] Lesson 12: Feature Selection
  • [ ] Lesson 13: PCA
  • [ ] Lesson 14: Validation
  • [ ] Lesson 15: Evaluation Metrics
  • [ ] Lesson 16: Tying it all together
  • [ ] Lesson 17: Final project