Classification: Precision and Recall

In the realms of Data Science you’ll encounter sooner or the later the terms “Precision” and “Recall”. But what do they mean? Clarification Living together with little kids You very often run into classification issues: My daughter really likes dogs, so seeing a dog is something positive. When she sees a normal dog e.g. a…

UD120 – Intro to Machine Learning

One part of my bucket list for 2018 is finishing the Udacity Course UD120: Intro to Machine Learning. The host of this course are Sebastian Thrun, ex-google-X and founder of Udacity and Katie Malone, creator of the Linear digressions podcast. The course consists of 17 lessons. Every lesson has a couple of hours of video…

Lesson 3: Support Vector Machines

The term Support Vector Machines or SVM is a bit misleading. It is just a name for a very clever algorithm invented by two Russians. in the 1960s. SVMs are used for classification and regression. SVM do that by finding a hyperplane between two classes of data which separates both classes best.

Linear Algebra with numpy – Part 1

Numpy is a package for scientific computing in Python. It is blazing fast due to its implementation in C. It is often used together with pandas, matplotlib and Jupyter notebooks. Often these packages are referred to as the datascience stack. Installation You can install numpy via pip pip install numpy Basic Usage In the datascience…

The Normal Distribution

Diving deeper into data science I started to brush up my knowledge about math especially statistics. The Mother of all Distributions The normal distribution was formulated by Carl Friedrich Gauß in 18XX and can be implemented in Python like the following : def normal_distribution(x, mu=0, sigma=1): sqrt_two_pi = math.sqrt(2*math.pi) return math.exp(-(x-mu)**2 / 2 / sigma**2)…

Data Science: Cross-Validation

For validating your model you need to split your data into a training and a test data set. More training data means a better model, more test data means better validation. But because the amount of data to train/test the model is limited you have to decide in which ratio of training vs test data…

Introduction to Jupyter Notebook

JuPyteR Do You know the feeling of being already late to a party when encountering something new? But when you actually start telling others about it, You realize that it is not too common sense at all, e.g. Jupyter Notebooks. What is a Jupyter notebook? In my own words: a browser-based document-oriented command line style…

Data Science Datasets: Iris flower data set

The Iris flower data set or Fisher’s Iris data set became a typical test case for many statistical classification techniques in machine learning such as support vector machines. It is sometimes called Anderson’s Iris data set because Edgar Anderson collected the data to quantify the morphological variation of Iris flowers of three related species. This…

Data Science Overview

Questions Data Science tries to answer one of the following questions: Classification -> “Is it A or B?” Clustering -> “Are there groups which belong together?” Regression -> “How will it develop in the future?” Association -> “What is happening very often together?” There are two ways to tackle these problem domains with machine learning:…