Feature Scaling

What is Feature Scaling?

Feature Scaling is an important pre-processing step for some machine learning algorithms.

Imagine you have three friends of whom you know the individual weight and height.

You would like to deduce Christian’s  t-shirt size from David’s and Julia’s by looking at the height and weight.

Name Height in m Weight in kg T-Shirt size
Julia 1.58 52 Small
David 1.79 79 Large
Christian 1.86 64 ?

One way You could determine the shirt size is to just add up the weight and the height of each friend. You would get:

Name Height + weight T-Shirt size
Julia 53.58 Small
David 80.79 Large
Christian 65.86

Because Christian’s height + weight number is nearer to Julia’s number than to David’s, Christian should wear a small T-Shirt. What?

Feature Scaling Formula

x’ = (x – xmin) / (xmax – xmin)

 

Feature min max
Height 1.58 1.86
Weight 52 79
Name Scaled Height Scaled Weight Combined Scaled Height + Weight T-Shirt size
Julia 0 0 0 Small
David 0.75 1 1.75 Large
Christian 1 0.44 1.44 ?

If we look at the combined scaled properties we see that Christian’s value now is closer to David’s so we deduce that Christian shall wear a large shirt as well.

Implementing feature scaling in python

As a little coding practice we can implement a feature scaling algorithm in Python:

def feature_scaling(arr):
    ret_arr = []
    min_val = min(arr)
    max_val = max(arr)
    if min_val == max_val:
        raise ZeroDivisionError()
    for f in arr:
        f = (f - min_val) / float((max_val - min_val))
        ret_arr.append(f)
    return ret_arr

MinMaxScaler from sklearn

Instead of writing our own feature scaler we can we should use the MinMaxScaler from sklearn. It works with numpy arrays by default.

from sklearn.preprocessing import MinMaxScaler
import numpy as np

weights = np.array([[52.0], [79.0], [64.0]])
scaler = MinMaxScaler()
rescaled_weight = scaler.fit_transform(weights)
print(rescaled_weight)

Affected Algorithms

Which algorithms are affected by non-properly scaled features?

SVM and k-means are algorithms which are affected. SVM for example calculates distances and when two features differ dramatically in value range, the feature with the greater range will dominate the other. (As seen when adding kilograms and meters)