## Too confused of the confusion matrix?

Let me bring some clarity into this topic!

Skip to content
# Category: Data Science

## Confusion Matrix

## Too confused of the confusion matrix?

## numpy random choice

## Classification: Precision and Recall

## Clarification

## Lesson 4: Decision Trees

## UD120 – Intro to Machine Learning

## Lesson 2: Naive Bayes

## Lesson 3: Support Vector Machines

## Linear Algebra with numpy – Part 1

## Vector arithmetic

## The Normal Distribution

## The Mother of all Distributions

## Data Science: Cross-Validation

The Adventures of Dash Daring in Code & Music & Business

Let me bring some clarity into this topic!

With numpy you can easily create test data with random_integers and randint.

numpy.random.randint(low, high=None, size=None, dtype='l') numpy.random.random_integers(low, high=None, size=None)

random_integers includes the high boundary while randint does not. Continue reading “numpy random choice”

In the realms of Data Science you’ll encounter sooner or the later the terms “Precision” and “Recall”. But what do they mean?

Living together with little kids You very often run into classification issues:

My daughter really likes dogs, so seeing a dog is something positive. When she sees a normal dog e.g. a Labrador and proclaims: “Look, there is a dog!”

That’s a **True Positive (TP)** Continue reading “Classification: Precision and Recall”

from sklearn.tree import DecisionTreeClassifier clf = DecisionTreeClassifier(min_samples_split=40) clf.fit(features_train, labels_train)

One part of my bucket list for 2018 was finishing the Udacity Course UD120: Intro to Machine Learning.

the host of this course are Sebastian Thrun, ex-google-X and founder of Udacity and Katie Malone, creator of the Linear digressions podcast.

The course consists of 17 lessons. Every lesson has a couple of hours of video and lots and lots of quizzes in it.

- [x] Lesson 1: Only introduction ðŸ™‚
- [x] Lesson 2: Naive Bayes
- [x] Lesson 3: Support Vector Machines
- [x] Lesson 4: Decision Trees
- [x] Lesson 5: Choose your own algorithm
- [ ] Lesson 6: Datasets and questions
- [ ] Lesson 7: Regression
- Lesson 8: Outliers
- Lesson 9: Clustering
- Lesson 10: Feature Scaling
- Lesson 11: Text Learning
- Lesson 12: Feature Selection
- Lesson 13: PCA
- Lesson 14: Validation
- Lesson 15: Evaluation Metrics
- Lesson 16: Tying it all together
- Lesson 17: Final project

Lesson 2 of the Udacity Course UD120 – Intro to Machine Learning deals with Naive Bayes classification. Continue reading “Lesson 2: Naive Bayes”

The term Support Vector Machines or SVM is a bit misleading. It is just a name for a very clever algorithm invented by two Russians. in the 1960s. SVM are used for classification and regression. Continue reading “Lesson 3: Support Vector Machines”

Numpy is a package for scientific computing in Python.

Declaration

a = np.array([1,2,3,4]) [1 2 3 4]

Diving deeper into data science I started to brush up my knowledge about math especially statistics.

The normal distribution was formulated by Carl Friedrich GauÃŸ in 18XX and can be implemented in Python like the following :

def normal_distribution(x, mu=0, sigma=1): sqrt_two_pi = math.sqrt(2*math.pi) return math.exp(-(x-mu)**2 / 2 / sigma**2) / sqrt_two_pi * sigma

For validating your model You need to split your data into a training and a test data set.

More training data means a better model, more test data means better validation.

But because the amount of data to train/test the model is limited you have to decide in which ratio of training vs test data you want to split your data.

import numpy as np from sklearn.model_selection import train_test_split from sklearn import datasets from sklearn import svm iris = datasets.load_iris() iris.data.shape, iris.target.shape

Sample a training set while holding out 40% of the data for testing:

X_train, X_test, y_train, y_test = train_test_split( iris.data, iris.target, test_size=0.4, random_state=0)

from sklearn.model_selection import cross_val_score clf = svm.SVC(kernel='linear', C=1) scores = cross_val_score(clf, iris.data, iris.target, cv=5)

Five Minutes with Ingo: Cross Validation