Table of Contents
Motivation
Cross-validation is a technique to validate the quality of your machine learning model.
For validating your model you split your training data into a training and a test data set.
----------------------------------------------- | | | | training data | test data | | | | -----------------------------------------------
More training data means a better model, more test data means better validation.
But because the amount of data to train/test the model is limited you have to decide in which ratio of training vs test data you want to split your data.
Example
Prerequisite
You need to install sklearn
pip install scikit-learn
Iris dataset
Let’s take the (in-)famous iris data set:
from sklearn import datasets
iris = datasets.load_iris()
iris.data.shape, iris.target.shape
Let’s train a model. For this example we use a support vector machine with a linear kernel
from sklearn import svm
clf = svm.SVC(kernel='linear', C=1).fit(iris.data, iris.target)
clf.score(iris.data, iris.target)
In this case we take the complete data set as training data. We don’t hold back any test data to check the accuracy of the model, so it shouldn’t surprise us that the accuracy on the training data itself is almost a 100%
Train-Test-Split
To indepently check the accuracy of your model you can hold back a certain amount of your training data for testing purposes.
sklearn can take car of that with the train_test_split function
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4)
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)
clf.score(X_test, y_test)
test_size is the percentage of data reserved for testing.
Further info