Linear Algebra with numpy – Part 1

Numpy is a package for scientific computing in Python.

Vector arithmetic

Declaration

a = np.array([1,2,3,4])
[1 2 3 4]

Addition / Subtraction

a = np.array([1,2,3,4])
b = np.array([4,3,2,1])
a + b
array([5, 5, 5, 5])
a - b
array([-3, -1,  1,  3])

Scalar Multiplication

a = np.array([1,2,3,4])
a * 3
array([ 3,  6,  9, 12])

To see why it is charming to use numpy’s array for this operation You have to consider the alternative:

c = [1,2,3,4]
d = [x * 3 for x in c]

Dot Product

a = np.array([1,2,3,4]) 
b = np.array([4,3,2,1])

a.dot(b)

20 # 1*3 + 2*3 + 3*2 + 4*1

Stay tuned for more algebraic stuff with numpy!

The Normal Distribution

Diving deeper into data science I started to brush up my knowledge about math especially statistics.

The Mother of all Distributions

The normal distribution was formulated by Carl Friedrich Gauß in 18XX and can be implemented in Python like the following :

def normal_distribution(x, mu=0, sigma=1):
    sqrt_two_pi = math.sqrt(2*math.pi)
    return math.exp(-(x-mu)**2 / 2 / sigma**2) / sqrt_two_pi * sigma

Data Science: Cross-Validation

For validating your model You need to split your data into a training and a test data set.

More training data means a better model, more test data means better validation.

But because the amount of data to train/test the model is limited you have to decide in which ratio of training vs test data you want to split your data.

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm

iris = datasets.load_iris()
iris.data.shape, iris.target.shape

Sample a training set while holding out 40% of the data for testing:

X_train, X_test, y_train, y_test = train_test_split( iris.data, iris.target, test_size=0.4, random_state=0)

Helpful links:

http://scikit-learn.org/stable/modules/cross_validation.html

Five Minutes with Ingo: Cross Validation

 

Data Science Datasets: Iris flower data set

The Iris flower data set or Fisher’s Iris data set became a typical test case for many statistical classification techniques in machine learning such as support vector machines. It is sometimes called Anderson’s Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species.

This data set can be imported from scikit-learn like the following:

from sklearn import datasets

iris = datasets.load_iris()
iris.data.shape, iris.target.shape

Data Science Overview

Data Science tries to answer one of the following questions:

  • Classification -> “Is it A or B?”
  • Clustering -> “Are there groups which belong together?”
  • Regression -> “How will it develop in the future?”
  • Association -> “What is happening very often together?”

SQL-Basics: Create – Read – Update – Delete

This episode is about the basic statements needed to create, read, update and delete data in a database system.

Let’s assume we work as a data scientist for Knight Industries.  We want to help the Foundation of Law and Government to keep track of our operatives.

We decide to use a classic relational database management system or RDBMS. In order to explore Database Management Systems we can either install one locally or we can use an online tool like SQLFiddle.

To interact with RDBMS we use SQL – the Structured Query Language.

As the name says SQL (speak either S-Q-L or Sequel) is used to write structured queries. Think of “conversations” when You think of “queries”.

So, let’s fire up SQLFiddle. Continue reading “SQL-Basics: Create – Read – Update – Delete”

My personal road map for learning data science in 2018

I got confused by all the buzzwords: data science, machine learning, deep learning, neural nets, artificial intelligence, big data, and so on and so on.

As an engineer I like to put some structure to the chaos. Inspired by Roadmap: How to Learn Machine Learning in 6 Months and Tetiana Ivanova – How to become a Data Scientist in 6 months a hacker’s approach to career planning I build my own learning road map for this year: Continue reading “My personal road map for learning data science in 2018”