SQLite3: Python and SQL

Everything we did in last articles was a dry run because we just used SQLFiddle. So let’s start with a real database like SQLite.

SQLite is a file based DBRMS and can be used for e.g. web sites. The official docs say:

“SQLite works great as the database engine for most low to medium traffic websites (which is to say, most websites). [..] Generally speaking, any site that gets fewer than 100K hits/day should work fine with SQLite.”

Because Knight Industries is not Google, Amazon nor Facebook we can definitely use SQLite.

Creating and connecting to a database

In Python it is pretty easy to connect to a SQLite database:

from sqlite3 import connect
db_connection = connect('knight_industries.db')

If the file knight_industries.db does not exist, it will be created automagically. A nice little feature of the sqlite3 library.

But be careful: If You already have a database file and you mess up the path in the connect statement you will wonder why you cannot access your data, because a new file is created silently.

cursor = db_connection.cursor()
cursor.execute('''CREATE TABLE operatives (id INTEGER, name TEXT, birthday DATE)''')
cursor.execute('''INSERT INTO operatives (id, name, birthday) \
                  VALUES (1, "Michael Arthur Long", "1949-01-09")''')

cursor.execute('''SELECT * FROM operatives''')
print cursor.fetchone()

Introduction to matplotlib – Part 3


After laying the foundation in Introduction to matplotlib and Introduction to matplotlib – Part 2 I want to show you another important chart

Bar Charts

A bar chart is useful to show total values over time e.g. the revenue of a company.

years = (2017, 2018, 2019)
revenue = (5000, 7000, 9000)
plt.bar(years, revenue, width=0.35)
plt.title("Revenue over years")

Continue reading “Introduction to matplotlib – Part 3”

Linear Regression with sklearn – cheat sheet

# import and instantiate model
from sklearn.linear_model import LinearRegression
model = LinearRegression()

#prepare test data
features_train = df_train.loc[:, 'feature_name']
target_train = df_train.loc[:, 'target_name']

#fit (train) model and print coefficient and intercept
model.fit(features_train , target_train )

# calculate model quality
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

target_prediction = model.predict(features_train)
print(mean_squared_error(target_train , target_prediction))
print(r2_score(target_train , target_prediction))

# test predictions
features_test = df_train.loc[:, 'feature_name'] 
target_test = df_train.loc[:, 'target_name']
target_prediction_test = model.predict(features_test) 
print(mean_squared_error(target_test, target_prediction_test )) 
print(r2_score(target_test, target_prediction_test ))

10 things I didn’t know about Data Science a year ago

In my article My personal road map for learning data science in 2018 I wrote about how I try to tackle the data science knowledge sphere. Due to the fact that 2018 is slowly coming to an end I think it is time for a little wrap up.

What are the things I learned about Data Science in 2018? Here we go:

1. The difference between Data Science, Machine Learning, Deep Learning and AI

Continue reading “10 things I didn’t know about Data Science a year ago”

Feature Scaling

What is Feature Scaling?

Feature Scaling is an important pre-processing step for some machine learning algorithms.

Imagine you have three friends of whom you know the individual weight and height.

You would like to deduce Christian’s  t-shirt size from David’s and Julia’s by looking at the height and weight.

Name Height in m Weight in kg T-Shirt size
Julia 1.58 52 Small
David 1.79 79 Large
Christian 1.86 64 ?

One way You could determine the shirt size is to just add up the weight and the height of each friend. You would get: Continue reading “Feature Scaling”

Receiver Operating Characteristic

ROC Curve

As we already introduced Precision and Recall  the ROC curve is another way of looking at the quality of classification algorithms.

ROC stands for Receiver Operating Characteristic

The ROC curve is created by plotting the true positive rate (TPR) on the y-axis against the false positive rate (FPR) on the x-axis at various threshold settings.

You already know the TPR as recall or sensitivity.

Continue reading “Receiver Operating Characteristic”