Table of Contents
What is Feature Scaling?
Feature Scaling is an important pre-processing step for some machine learning algorithms.
Imagine you have three friends of whom you know the individual weight and height.
You would like to deduce Christian’s t-shirt size from David’s and Julia’s by looking at the height and weight.
Name | Height in m | Weight in kg | T-Shirt size |
---|---|---|---|
Julia | 1.58 | 52 | Small |
David | 1.79 | 79 | Large |
Christian | 1.86 | 64 | ? |
One way You could determine the shirt size is to just add up the weight and the height of each friend. You would get:
Name | Height + weight | T-Shirt size |
---|---|---|
Julia | 53.58 | Small |
David | 80.79 | Large |
Christian | 65.86 |
Because Christian’s height + weight number is nearer to Julia’s number than to David’s, Christian should wear a small T-Shirt. What?
Feature Scaling Formula
x’ = (x – xmin) / (xmax – xmin)
Feature | min | max |
---|---|---|
Height | 1.58 | 1.86 |
Weight | 52 | 79 |
Name | Scaled Height | Scaled Weight | Combined Scaled Height + Weight | T-Shirt size |
---|---|---|---|---|
Julia | 0 | 0 | 0 | Small |
David | 0.75 | 1 | 1.75 | Large |
Christian | 1 | 0.44 | 1.44 | ? |
If we look at the combined scaled properties we see that Christian’s value now is closer to David’s so we deduce that Christian shall wear a large shirt as well.
Implementing feature scaling in python
As a little coding practice we can implement a feature scaling algorithm in Python:
def feature_scaling(arr): ret_arr = [] min_val = min(arr) max_val = max(arr) if min_val == max_val: raise ZeroDivisionError() for f in arr: f = (f - min_val) / float((max_val - min_val)) ret_arr.append(f) return ret_arr
MinMaxScaler from sklearn
Instead of writing our own feature scaler we can we should use the MinMaxScaler from sklearn. It works with numpy arrays by default.
from sklearn.preprocessing import MinMaxScaler import numpy as np weights = np.array([[52.0], [79.0], [64.0]]) scaler = MinMaxScaler() rescaled_weight = scaler.fit_transform(weights) print(rescaled_weight)
Affected Algorithms
Which algorithms are affected by non-properly scaled features?
SVM and k-means are algorithms which are affected. SVM for example calculates distances and when two features differ dramatically in value range, the feature with the greater range will dominate the other. (As seen when adding kilograms and meters)