Feature Scaling

Table of Contents

What is Feature Scaling?

Feature Scaling is an important pre-processing step for some machine learning algorithms.

Imagine you have three friends of whom you know the individual weight and height.

You would like to deduce Christian’s t-shirt size from David’s and Julia’s by looking at the height and weight.

Name	Height in m	Weight in kg	T-Shirt size
Julia	1.58	52	Small
David	1.79	79	Large
Christian	1.86	64	?

One way You could determine the shirt size is to just add up the weight and the height of each friend. You would get:

Name	Height + weight	T-Shirt size
Julia	53.58	Small
David	80.79	Large
Christian	65.86

Because Christian’s height + weight number is nearer to Julia’s number than to David’s, Christian should wear a small T-Shirt. What?

Feature Scaling Formula

x’ = (x – x_min) / (x_max – x_min)

Feature	min	max
Height	1.58	1.86
Weight	52	79

Name	Scaled Height	Scaled Weight	Combined Scaled Height + Weight	T-Shirt size
Julia	0	0	0	Small
David	0.75	1	1.75	Large
Christian	1	0.44	1.44	?

If we look at the combined scaled properties we see that Christian’s value now is closer to David’s so we deduce that Christian shall wear a large shirt as well.

Implementing feature scaling in python

As a little coding practice we can implement a feature scaling algorithm in Python:

def feature_scaling(arr):
    ret_arr = []
    min_val = min(arr)
    max_val = max(arr)
    if min_val == max_val:
        raise ZeroDivisionError()
    for f in arr:
        f = (f - min_val) / float((max_val - min_val))
        ret_arr.append(f)
    return ret_arr

MinMaxScaler from sklearn

Instead of writing our own feature scaler ~~we can~~ we should use the MinMaxScaler from sklearn. It works with numpy arrays by default.

from sklearn.preprocessing import MinMaxScaler
import numpy as np

weights = np.array([[52.0], [79.0], [64.0]])
scaler = MinMaxScaler()
rescaled_weight = scaler.fit_transform(weights)
print(rescaled_weight)

Affected Algorithms

Which algorithms are affected by non-properly scaled features?

SVM and k-means are algorithms which are affected. SVM for example calculates distances and when two features differ dramatically in value range, the feature with the greater range will dominate the other. (As seen when adding kilograms and meters)

What is Feature Scaling?

Feature Scaling Formula

Implementing feature scaling in python

MinMaxScaler from sklearn

Affected Algorithms

Related Posts