10 things I didn’t know about Data Science a year ago

Jörn — Mon, 12 Nov 2018 08:42:26 +0000

In my article My personal road map for learning data science in 2018 I wrote about how I try to tackle the data science knowledge sphere. Due to the fact that 2018 is slowly coming to an end I think it is time for a little wrap up.

What are the things I learned about Data Science in 2018? Here we go:

The difference between Data Science, Machine Learning, Deep Learning and AI

A picture says more than a thousand words.

The difference between supervised and unsupervised learning

Supervised Learning

You have training and test data with labels. Labels tell You to which e.g. class a certain data item belongs. Image you have images of pets and the labels are the name of the pets.

Unsupervised Learning

Your data doesn’t have labels. Your algorithm e.g. k-means clustering need to figure out a structure given only the data

The areas of applied machine learning

are described here: The Essence of Machine Learning and Data Science Overview

Bayes Theorem

In my article Bayes theorem I elaborated about the base rate fallacy and in naive bayes I recapped the second lesson from udacity’s UD120 Intro to Machine Learning

Precision and Recall and ROC

In my article classification: precision and recall I wrote about different useful measures to evaluate the quality of a supervised learning algorithm.

In Receiver Operating Characteristic I wrote about another useful measures the ROC.

Visualization with matplotlib

Matplotlib is a really good starting point for visualization. I wrote about it in Introduction to matplotlib, Matplotlib – Part 2, Scatterplot with matplotlib

Math with numpy

I wrote some articles about the usage of numpy but only scraped the surface of this mighty library

Image manipulation with OpenCV

Intro to OpenCV with Python

JuPyter Notebooks

Sometimes I love them sometimes I hate them. I wrote an Introduction to JuPyter Notebook

Podcasts

In 2018 I’ve listened to a bunch of great podcasts on iTunes:

The post 10 things I didn’t know about Data Science a year ago appeared first on Creatronix.

Bayes’ Theorem

Jörn — Sun, 03 Dec 2017 16:26:26 +0000

Imagine that you come home from a party and you are stopped by the police. They ask you to take a drug test and you accept. The test result is positive. You are guilty.

But wait a minute! Is it really that simple?

In Germany about 2.8 million people consume weed on a regular basis, that’s about 3.5% of the population.

Let’s say D is drug addict and consumes weed regularly, ¬D consumes no weed. So the chance by randomly picking a person that he or she is a drug addict is P(D) = 0.035

Because You either take drugs or you don’t the remaining part must be non-drug takers P(¬D) = 1 – P(D) = 0.965.

The accuracy of a drug test is about 92%. So let’s assume that there are 8% false positives and 8% false negatives as well.

The chance that if a person actually takes drugs the test result will be positive is P(+|D) = 0.92 but You also get a positive reuslt when a person doesn’t take drugs in 8% of all cases: P(+|¬D) = 0.08 These are called “False positives”.

When a person doesn’t take drugs the test will be negative in 92% of all cases P(-|¬D) = 0.92. And of course a test can also be negative even if a person takes drugs P(-|D) = 0.08. these are called “False negatives”. Got it?

What comes next?

Combined Probabilities

Knowing the success and error rates of the test and the relative distribution of drug consumers we can calculate the combined probabilities:

P(+, D) = 0.035 * 0.92 = 0.0322
- Think: The test is positive AND the person is a drug user
P(+, ¬D) = 0.965 * 0.08 = 0.0772
- Think: The test is positive AND the person is NOT a drug user
P(-, D) = 0.035 * 0.08 = 0.0028
- Think: The test is negative AND the person is a drug user
P(-, ¬D) = 0.965 * 0.92 = 0,8878
- Think: The test is negative AND the person is NOT a drug user

Bayes Theorem

P(A|B) = P(B|A) * P(A) / P(B)

In our case we are interested in the probability of a person being a drug addict given the test is positive. That means:

P(D | +) = P(+ | D) * P(D) / P(+) = P(+ | D) * P(D) / ( P(+, D) + P(+, ¬D) )

= 0.92 * 0.035 / (0.0322 + 0.0772) = 0.294

The outcome is quite interesting and mildly shocking: The probability that a person tested positively is actually a drug addict is only around 29% or less than one third!

Why is this so counter intuitive, when the test states an accuracy of 92%? That is the so called base rate fallacy. We have to take into account that only 3.5% of the population actually take drugs.

Bayes theorem Archives - Creatronix

10 things I didn’t know about Data Science a year ago

The difference between Data Science, Machine Learning, Deep Learning and AI

The difference between supervised and unsupervised learning

The areas of applied machine learning

Bayes Theorem

Precision and Recall and ROC

Visualization with matplotlib

Math with numpy

Image manipulation with OpenCV

JuPyter Notebooks

Podcasts

Bayes’ Theorem

Bayes Theorem