Lesson 4: Decision Trees

from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(min_samples_split=40)
clf.fit(features_train, labels_train)

UD120 – Intro to Machine Learning

One part of my bucket list for 2018 was finishing the Udacity Course UD120: Intro to Machine Learning.

the host of this course are Sebastian Thrun, ex-google-X and founder of Udacity and Katie Malone, creator of the Linear digressions podcast.

The course consists of 17 lessons. Every lesson has a couple of hours of video and lots and lots of quizzes in it.

  • [x] Lesson 1: Only introduction 🙂
  • [x] Lesson 2: Naive Bayes
  • [x] Lesson 3: Support Vector Machines
  • [x] Lesson 4: Decision Trees
  • [x] Lesson 5: Choose your own algorithm
  • Lesson 6: Datasets and questions
  • Lesson 7: Regression
  • Lesson 8: Outliers
  • Lesson 9: Clustering
  • Lesson 10: Feature Scaling
  • Lesson 11: Text Learning
  • Lesson 12: Feature Selection
  • Lesson 13: PCA
  • Lesson 14: Validation
  • Lesson 15: Evaluation Metrics
  • Lesson 16: Tying it all together
  • Lesson 17: Final project

Lesson 2: Naive Bayes

Lesson 2 of the Udacity Course UD120 – Intro to Machine Learning deals with Naive Bayes classification.

Mini project

For the mini project you should fork https://github.com/udacity/ud120-projects and clone it. It is recommended to install a python 2.7 64bit version because ML is heavy data processing and can easily rip up more than 2GB of memory.

Dependecies

After cloning the repo I would recommend setting up a venv and install the requirements:

  • sklearn
  • numpy
  • scipy
  • matplotlib

The Code

the code itself is pretty straightforward:

  • Instantiate the classifier
  • Train (fit) the Classifier
  • Predict
  • Calculate accuracy
# training
print("Start training")
t0 = time()
clf = GaussianNB()
clf.fit(features_train, labels_train)
print("training time:", round(time() - t0, 3), "s")

# prediction
print("start predicting")
t0 = time()
prediction = clf.predict(features_test)
print("predict time:", round(time() - t0, 3), "s")

# accuracy
print("Calculating accuracy")
accuracy = accuracy_score(labels_test, prediction)
print("Accuracy calculated, and the accuracy is", accuracy)

The output on my machine:

training time: 1.762 s
start predicting
predict time: 0.286 s
Calculating accuracy
Accuracy calculated, and the accuracy is 0.9732650739476678

The simple Gaussian Naive Bayes is pretty accurate with 97.3%

Lesson 3: Support Vector Machines

The term Support Vector Machines or SVM is a bit misleading. It is just a name for a very clever algorithm invented by two Russians. in the 1960s. SVM are used for classification and regression.

print("Start training")
t0 = time()
clf = svm.SVC(kernel="linear")
clf.fit(features_train, labels_train)
print("training time:", round(time() - t0, 3), "s")

print("start predicting")
t0 = time()
prediction = clf.predict(features_test)
print("predict time:", round(time() - t0, 3), "s")

# accuracy
print("Calculating accuracy")
accuracy = accuracy_score(labels_test, prediction)
print("Accuracy calculated, and the accuracy is", accuracy)

When timing the training of the SVC, it’s astonishing how long it takes: around 2.5 minutes at 98.4% accuracy.

As an alternative You can use:

clf = LinearSVC(loss='hinge')

It gets you a result in 0.3 seconds with the same accuracy.

What’s the difference?

Parameter tuning

with the initial SVC we can play around with the parameters “C” and “kernel”

Kernels

 

Ingo’s Deep Dive

SVM MIT

SVM Siraj Raval

 

Linear Algebra with numpy – Part 1

Numpy is a package for scientific computing in Python.

Vector arithmetic

Declaration

a = np.array([1,2,3,4])
[1 2 3 4]

Addition / Subtraction

a = np.array([1,2,3,4])
b = np.array([4,3,2,1])
a + b
array([5, 5, 5, 5])
a - b
array([-3, -1,  1,  3])

Scalar Multiplication

a = np.array([1,2,3,4])
a * 3
array([ 3,  6,  9, 12])

To see why it is charming to use numpy’s array for this operation You have to consider the alternative:

c = [1,2,3,4]
d = [x * 3 for x in c]

Dot Product

a = np.array([1,2,3,4]) 
b = np.array([4,3,2,1])

a.dot(b)

20 # 1*3 + 2*3 + 3*2 + 4*1

Stay tuned for more algebraic stuff with numpy!

Python pip and virtualenv

After working for a couple of years with Python and external dependencies I’ve ran again and again into the same kind of problems.

Bad habits

Say you have a global python installation under e.g. C:\Python27 on Windows. When you start working on your first python project you want to use external packages and you encounter pip as dependency management tool. (pip is part of the python installation since 2.7.9 / 3.4) So far so good.

But you keep installing all the packages into your global python installation. Continue reading “Python pip and virtualenv”

JavaScript: dot vs bracket notation

During linting my code jshint gave me the “hint” that I should prefer dot notation over bracket notation.

"testcase": data.finding["testcase"], [‘testcase’] is better written in dot notation.

What is that?

  • Accessing members with “.” is called “dot notation”.
  • Accessing them with [] is called “bracket notation”.

 

The Normal Distribution

Diving deeper into data science I started to brush up my knowledge about math especially statistics.

The Mother of all Distributions

The normal distribution was formulated by Carl Friedrich Gauß in 18XX and can be implemented in Python like the following :

def normal_distribution(x, mu=0, sigma=1):
    sqrt_two_pi = math.sqrt(2*math.pi)
    return math.exp(-(x-mu)**2 / 2 / sigma**2) / sqrt_two_pi * sigma

Data Science: Cross-Validation

For validating your model You need to split your data into a training and a test data set.

More training data means a better model, more test data means better validation.

But because the amount of data to train/test the model is limited you have to decide in which ratio of training vs test data you want to split your data.

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm

iris = datasets.load_iris()
iris.data.shape, iris.target.shape

Sample a training set while holding out 40% of the data for testing:

X_train, X_test, y_train, y_test = train_test_split( iris.data, iris.target, test_size=0.4, random_state=0)

Helpful links:

http://scikit-learn.org/stable/modules/cross_validation.html

Five Minutes with Ingo: Cross Validation