What is Cross-Validation in Data Science?

Table of Contents

Motivation

Cross-validation is a technique to validate the quality of your machine learning model.

For validating your model you split your training data into a training and a test data set.

-----------------------------------------------
|                           |                 | 
|      training data        |    test data    |
|                           |                 |
-----------------------------------------------

More training data means a better model, more test data means better validation.

But because the amount of data to train/test the model is limited you have to decide in which ratio of training vs test data you want to split your data.

Example

Prerequisite

You need to install sklearn

pip install scikit-learn

Iris dataset

Let’s take the (in-)famous iris data set:

from sklearn import datasets 

iris = datasets.load_iris() 
iris.data.shape, iris.target.shape

Let’s train a model. For this example we use a support vector machine with a linear kernel

from sklearn import svm
clf = svm.SVC(kernel='linear', C=1).fit(iris.data, iris.target) 
clf.score(iris.data, iris.target)

In this case we take the complete data set as training data. We don’t hold back any test data to check the accuracy of the model, so it shouldn’t surprise us that the accuracy on the training data itself is almost a 100%

Train-Test-Split

To indepently check the accuracy of your model you can hold back a certain amount of your training data for testing purposes.

sklearn can take car of that with the train_test_split function

from sklearn.model_selection import train_test_split 

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4) 

clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train) 
clf.score(X_test, y_test)

test_size is the percentage of data reserved for testing.