data science Archives - Creatronix

10 things I didn’t know about Data Science a year ago

Jörn — Mon, 12 Nov 2018 08:42:26 +0000

In my article My personal road map for learning data science in 2018 I wrote about how I try to tackle the data science knowledge sphere. Due to the fact that 2018 is slowly coming to an end I think it is time for a little wrap up.

What are the things I learned about Data Science in 2018? Here we go:

The difference between Data Science, Machine Learning, Deep Learning and AI

A picture says more than a thousand words.

The difference between supervised and unsupervised learning

Supervised Learning

You have training and test data with labels. Labels tell You to which e.g. class a certain data item belongs. Image you have images of pets and the labels are the name of the pets.

Unsupervised Learning

Your data doesn’t have labels. Your algorithm e.g. k-means clustering need to figure out a structure given only the data

The areas of applied machine learning

are described here: The Essence of Machine Learning and Data Science Overview

Bayes Theorem

In my article Bayes theorem I elaborated about the base rate fallacy and in naive bayes I recapped the second lesson from udacity’s UD120 Intro to Machine Learning

Precision and Recall and ROC

In my article classification: precision and recall I wrote about different useful measures to evaluate the quality of a supervised learning algorithm.

In Receiver Operating Characteristic I wrote about another useful measures the ROC.

Visualization with matplotlib

Matplotlib is a really good starting point for visualization. I wrote about it in Introduction to matplotlib, Matplotlib – Part 2, Scatterplot with matplotlib

Math with numpy

I wrote some articles about the usage of numpy but only scraped the surface of this mighty library

Image manipulation with OpenCV

Intro to OpenCV with Python

JuPyter Notebooks

Sometimes I love them sometimes I hate them. I wrote an Introduction to JuPyter Notebook

Podcasts

In 2018 I’ve listened to a bunch of great podcasts on iTunes:

The post 10 things I didn’t know about Data Science a year ago appeared first on Creatronix.

Classification: Precision and Recall

Jörn — Thu, 28 Jun 2018 15:10:00 +0000

In the realms of Data Science you’ll encounter sooner or the later the terms “Precision” and “Recall”. But what do they mean?

Clarification

Living together with little kids You very often run into classification issues:

My daughter really likes dogs, so seeing a dog is something positive. When she sees a normal dog e.g. a Labrador and proclaims: “Look, there is a dog!”

That’s a True Positive (TP)

If she now sees a fat cat and proclaims: “Look at the dog!” we call it a False Positive (FP), because her assumption of a positive outcome (a dog!) was false. A false positive is also called a Type 1 error

If I point at a small dog e.g. a Chihuahua and say “Look at the dog!” and she cries: “This is not a dog!” but indeed it is one, we call that a False negatives (FN) A false negative is also called a Type 2 error

And last but not least, if I show her a bird and we agree on the bird not being a dog we have a True Negative (TN)

This neat little matrix shows all of them in context:

Precision and Recall

If I show my daughter twenty pictures of cats and dogs (8 cat pictures and 12 dog pictures) and she identifies 10 as dogs but out of ten dogs there are actually 2 cats her precision is 8 / (8+2) = 4/5 or 80%.

Precision = TP / (TP + FP)

Knowing that there are actually 12 dog pictures and she misses 4 (false negatives) her recall is 8 / (8+4) = 2/3 or roughly 67%

Recall = TP / (TP + FN)

Which measure is more important?

It depends:

If you’re a dog lover it is better to have a high precision, when you are afraid of dogs say to avoid dogs, a higher recall is better 🙂

Different terms

Precision is also called Positive Predictive Value (PPV)

Recall often is also called

True positive rate
Sensitivity
Probability of detection

Other interesting measures

Accuracy

ACC = (TP + TN) / (TP + FP + TN + FN)

F1-Score

You can combine Precision and Recall to a measure called F1-Score. It is the harmonic mean of precision and recall

F1 = 2 / (1/Precision + 1/Recall)

Scikit-Learn

scikit-learn being a one-stop-shop for data scientists does of course offer functions for calculating precision and recall:

from sklearn.metrics import precision_score 
y_true = ["dog", "dog", "not-a-dog", "not-a-dog", "dog", "dog"] 
y_pred = ["dog", "not-a-dog", "dog", "not-a-dog", "dog", "not-a-dog"] 
print(precision_score(y_true, y_predicted , pos_label="dog"))

Let’s assume we trained a binary classifier which can tell us “dog” or “not-a-dog”

In this example the precision is 0.666 or ~67% because in two third of the cases the algorithm was right when it predicted a dog

from sklearn.metrics import recall_score 
print(recall_score(y_true, y_pred, pos_label="dog"))

The recall was just 0.5 or 50% because out of 4 dogs it just identified 2 correctly as dogs.

from sklearn.metrics import accuracy_score 
print(accuracy_score(y_true, y_pred))

The accuracy was also just 50% because out of 6 items it made only 3 correct predictions.

from sklearn.metrics import f1_score print(f1_score(y_true, y_pred, pos_label="dog"))

The F1 score is 0.57 – just between 0.5 and 0.666.

What other scores do you encounter? – stay tuned for the next episode 🙂

The post Classification: Precision and Recall appeared first on Creatronix.

Lesson 2: Naive Bayes

Jörn — Tue, 19 Jun 2018 06:29:16 +0000

Lesson 2 of the Udacity Course UD120 – Intro to Machine Learning deals with Naive Bayes classification.

Mini project

For the mini project you should fork https://github.com/udacity/ud120-projects and clone it. It is recommended to install a python 2.7 64bit version because ML is heavy data processing and can easily rip up more than 2GB of memory.

Dependecies

After cloning the repo I would recommend setting up a venv and install the requirements:

sklearn
numpy
scipy
matplotlib

The Code

The code itself is pretty straightforward:

Instantiate the classifier
Train (fit) the Classifier
Predict
Calculate accuracy

# training
print("Start training")
t0 = time()
clf = GaussianNB()
clf.fit(features_train, labels_train)
print("training time:", round(time() - t0, 3), "s")

# prediction
print("start predicting")
t0 = time()
prediction = clf.predict(features_test)
print("predict time:", round(time() - t0, 3), "s")

# accuracy
print("Calculating accuracy")
accuracy = accuracy_score(labels_test, prediction)
print("Accuracy calculated, and the accuracy is", accuracy)

The output on my machine:

training time: 1.762 s
start predicting
predict time: 0.286 s
Calculating accuracy
Accuracy calculated, and the accuracy is 0.9732650739476678

The simple Gaussian Naive Bayes is pretty accurate with 97.3%

The post Lesson 2: Naive Bayes appeared first on Creatronix.

Linear Algebra with numpy

Jörn — Fri, 04 May 2018 11:42:22 +0000

Numpy is a package for scientific computing in Python. It is blazing fast due to its implementation in C.

It is often used together with pandas, matplotlib and Jupyter notebooks. Often these packages are referred to as the datascience stack.

Installation

You can install numpy via pip

pip install numpy

Basic Usage

In the datascience world numpy is often imported like this:

import numpy as np

The “as” keyword defines a so called alias. Now you can use structures from numpy by referencing them with “np” instaed of the whole name.

Think “abbreviation”.

n-dimensional array

The most important data structure is ndarray, which is short for n-dimensional array.

You can convert a list to an numpy array with the array-method

my_list = [1, 2, 3, 4] 
my_array = np.array(my_list)

You can also convert an array back to a list with

my_new_list = my_array.tolist()

You can retrieve the dimensionality of an array with the ndim property:

my_array.ndim

and get the number of data points with the shape property

my_array.shape

Vector arithmetic

Addition / Subtraction

a = np.array([1, 2, 3, 4]) 
b = np.array([4, 3, 2, 1]) 
a + b 
array([5, 5, 5, 5]) 

a - b 
array([-3, -1, 1, 3])

Scalar Multiplication

a = np.array([1, 2, 3, 4]) 
a * 3 

array([3, 6, 9, 12])

To see why it is charming to use numpy’s array for this operation You have to consider the alternative:

c = [1,2,3,4] 
d = [x * 3 for x in c]

Dot Product

a = np.array([1,2,3,4]) 
b = np.array([4,3,2,1]) 
a.dot(b) 

20 # 1*3 + 2*3 + 3*2 + 4*1

Learn more about numpy:

numpy random choice

Numpy linspace function

Project on github

The post Linear Algebra with numpy appeared first on Creatronix.

Introduction to Jupyter Notebook

Jörn — Wed, 25 Apr 2018 09:17:48 +0000

JuPyteR

Do You know the feeling of being already late to a party when encountering something new?

But when you actually start telling others about it, you realize that it is not too common knowledge at all, e.g. Jupyter Notebooks.

What is a Jupyter notebook?

In my own words: a browser-based document-oriented command line style exploration tool for Julia, Python and R, hence the name JuPyteR Huh!

Ok, let’s break it down:

Browser-based

JuPyter is a client-server concept where you edit your code in a web form in a browser. You send the input of a cell to the server backend for execution and the server sends back a response which will be rendered in your browser.

Document-oriented

On great aspect of a JuPyter is that You can enrich your code in a nice fashion with headlines and markdown code so that you have a document containing code, the result of the code execution and documentation.

Installation and Run

If You already have a python installation You can either use pip or pipenv to install JuPyter

Pip

pip install jupyter

Pipenv

pipenv install jupyter

After installation you can start it on the console with:

jupyter notebook

An alternative way is to use the anaconda distribution.

Disadvantages

On big drawback -when your background is SW development- is that You don’t have code completion.

Another disadvantage: modularization of your code is not easy.

Versioning is an issue as well. Because the Jupyter notebook’s json files contains code and generated artifacts like plots every re-run of a notebook changes the file. The diff is not easily comprehensible.

PyCharm Integration

For the code completion issue there is JetBrains for the rescue: PyCharm IDE has an integrated JuPyter editor which supports code completion.

Useful Keyboard Shortcuts

You can open command palette via

Cmd + Shift + P on Mac OS or via

Ctrl + Shift + P on Linux and Windows

Ctrl + Enter Run Cell

Alt + Enter Run Cell and insert new cell below

https://www.dataquest.io/blog/jupyter-notebook-tips-tricks-shortcuts/

The post Introduction to Jupyter Notebook appeared first on Creatronix.

Data Science Datasets: Iris flower data set

Jörn — Wed, 25 Apr 2018 08:55:12 +0000

Motivation

When you are going to learn some data science the aquisition of data is often the first step.

To get you started scikit-learn comes with a bunch of so called “toy datasets”. One of them is the Iris dataset.

Prerequisites & Imports

Besides scikit-learn we will use pandas for data handling and matplotlib with seaborn for visualization. So let’s install them:

pip install scikit-learn pandas seaborn matplotlib

from sklearn import datasets
import seaborn as sns
import pandas as pd
sns.set_palette('husl')
import matplotlib.pyplot as plt
%matplotlib inline

Iris data set

The Iris flower data set or Fisher’s Iris data set became a typical test case for many statistical classification techniques in machine learning such as support vector machines.

It is sometimes called Anderson’s Iris data set because Edgar Anderson collected the data to quantify the morphological variation of Iris flowers of three related species.

This data set can be imported from scikit-learn like the following:

iris = datasets.load_iris()

Convert to Pandas Dataframe

To work with the dataset we convert it into a pandas dataframe.

df = pd.DataFrame(
    iris['data'],
    columns=iris['feature_names']
)
df['species'] = iris['target']
df['species'] = df['species'].map({
    0 : 'Iris-setosa',
    1 : 'Iris-versicolor',
    2 : 'Iris-virginica'
})

Data visualization

Seaborn has a nice way to visualize data for exploration with the pariplot function.

It takes every feature and compares it pairwise with every other feature

g = sns.pairplot(df, hue='species', markers='+')
plt.show()

Data Science Overview

Jörn — Wed, 07 Mar 2018 09:40:28 +0000

Questions

Data Science tries to answer one of the following questions:

Classification -> “Is it A or B?”
Clustering -> “Are there groups which belong together?”
Regression -> “How will it develop in the future?”
Association -> “What is happening very often together?”

There are two ways to tackle these problem domains with machine learning:

Supervised Learning
Unsupervised Learning

Supervised Learning

You have training and test data with labels. Labels tell You to which e.g. class a certain data item belongs. Image you have images of pets and the labels are the name of the pets.

Unsupervised Learning

Your data doesn’t have labels. Your algorithm e.g. k-means clustering need to figure out a structure given only the data

The post Data Science Overview appeared first on Creatronix.

My personal roadmap for learning data science in 2018

Jörn — Wed, 13 Dec 2017 14:05:14 +0000

I got confused by all the buzzwords: data science, machine learning, deep learning, neural nets, artificial intelligence, big data, and so on and so on.

As an engineer I like to put some structure to the chaos. Inspired by Roadmap: How to Learn Machine Learning in 6 Months and Tetiana Ivanova – How to become a Data Scientist in 6 months a hacker’s approach to career planning I build my own learning road map for this year:
So 2018 will be all about Data Science. Hearing about the Personal Knowledge Mastery concept at SWEC17 I am going to tackle the learning process on different levels.

Watch the Pros

Thanks to open course ware there are a ton of awesome university courses online e.g.:

MIT 6.0002 Introduction to Computational Thinking and Data Science

Learn the tools

There is already a whole bunch of tools we can consider belonging to a standard data science stack. Because my main language is Python the focus is of course on mostly python modules.

Finishing Udacity / Udemy courses

To brush up my python skills and my knowledge of basic computer science I will finish some already started online courses:

- [ ] Introduction to Machine Learning
- [ ] Python Bootcamp
- [ ] Algorithms and Data Structures
- [ ] Introduction to Artificial Intelligence
- [ ] Introduction to computer vision
- [ ] Artificial Intelligence for Robotics

Reading data science books

To get a broad overview I bought two books on DS / ML

[ ] Data Science from Scratch
[ ] Hands on Machine Learning

Do Exercises on Kaggle

[x] Create Account at Kaggle
[ ] Do first exercise
[ ] Participate in a contest

Visit Meetups about Data Science

[ ] Visit Big Data Meetup Events

Add some Peer Pressure

My brother in law and I teemed up and build a Whatsapp learn & exchange group. We are currently four members.

Write Blog Articles

I will try to incorporate some of the stuff I’ve learned into blog articles.

I already did

So stay tuned!

The post My personal roadmap for learning data science in 2018 appeared first on Creatronix.

Bayes’ Theorem

Jörn — Sun, 03 Dec 2017 16:26:26 +0000

Imagine that you come home from a party and you are stopped by the police. They ask you to take a drug test and you accept. The test result is positive. You are guilty.

But wait a minute! Is it really that simple?

In Germany about 2.8 million people consume weed on a regular basis, that’s about 3.5% of the population.

Let’s say D is drug addict and consumes weed regularly, ¬D consumes no weed. So the chance by randomly picking a person that he or she is a drug addict is P(D) = 0.035

Because You either take drugs or you don’t the remaining part must be non-drug takers P(¬D) = 1 – P(D) = 0.965.

The accuracy of a drug test is about 92%. So let’s assume that there are 8% false positives and 8% false negatives as well.

The chance that if a person actually takes drugs the test result will be positive is P(+|D) = 0.92 but You also get a positive reuslt when a person doesn’t take drugs in 8% of all cases: P(+|¬D) = 0.08 These are called “False positives”.

When a person doesn’t take drugs the test will be negative in 92% of all cases P(-|¬D) = 0.92. And of course a test can also be negative even if a person takes drugs P(-|D) = 0.08. these are called “False negatives”. Got it?

What comes next?

Combined Probabilities

Knowing the success and error rates of the test and the relative distribution of drug consumers we can calculate the combined probabilities:

P(+, D) = 0.035 * 0.92 = 0.0322
- Think: The test is positive AND the person is a drug user
P(+, ¬D) = 0.965 * 0.08 = 0.0772
- Think: The test is positive AND the person is NOT a drug user
P(-, D) = 0.035 * 0.08 = 0.0028
- Think: The test is negative AND the person is a drug user
P(-, ¬D) = 0.965 * 0.92 = 0,8878
- Think: The test is negative AND the person is NOT a drug user

Bayes Theorem

P(A|B) = P(B|A) * P(A) / P(B)

In our case we are interested in the probability of a person being a drug addict given the test is positive. That means:

P(D | +) = P(+ | D) * P(D) / P(+) = P(+ | D) * P(D) / ( P(+, D) + P(+, ¬D) )

= 0.92 * 0.035 / (0.0322 + 0.0772) = 0.294

The outcome is quite interesting and mildly shocking: The probability that a person tested positively is actually a drug addict is only around 29% or less than one third!

Why is this so counter intuitive, when the test states an accuracy of 92%? That is the so called base rate fallacy. We have to take into account that only 3.5% of the population actually take drugs.

data science Archives - Creatronix

10 things I didn’t know about Data Science a year ago

The difference between Data Science, Machine Learning, Deep Learning and AI

The difference between supervised and unsupervised learning

The areas of applied machine learning

Bayes Theorem

Precision and Recall and ROC

Visualization with matplotlib

Math with numpy

Image manipulation with OpenCV

JuPyter Notebooks

Podcasts

Classification: Precision and Recall

Clarification

Precision and Recall

Different terms

Other interesting measures

Accuracy

F1-Score

Scikit-Learn

Lesson 2: Naive Bayes

Mini project

Dependecies

The Code

Linear Algebra with numpy

Installation

Basic Usage

n-dimensional array

Vector arithmetic

Addition / Subtraction

Scalar Multiplication

Dot Product

Introduction to Jupyter Notebook

JuPyteR

Browser-based

Document-oriented

Installation and Run

Pip

Pipenv

Disadvantages

PyCharm Integration

Useful Keyboard Shortcuts

Data Science Datasets: Iris flower data set

Motivation

Prerequisites & Imports

Iris data set

Convert to Pandas Dataframe

Data visualization

Further Reading

Data Science Overview

Questions

Supervised Learning

Unsupervised Learning

My personal roadmap for learning data science in 2018

Watch the Pros

Learn the tools

Finishing Udacity / Udemy courses

Reading data science books

Do Exercises on Kaggle

Visit Meetups about Data Science

Add some Peer Pressure

Write Blog Articles

Bayes’ Theorem

Bayes Theorem