Data Science Datasets: Iris flower data set

Table of Contents

Motivation

When you are going to learn some data science the aquisition of data is often the first step.

To get you started scikit-learn comes with a bunch of so called “toy datasets”. One of them is the Iris dataset.

Prerequisites & Imports

Besides scikit-learn we will use pandas for data handling and matplotlib with seaborn for visualization. So let’s install them:

pip install scikit-learn pandas seaborn matplotlib

from sklearn import datasets
import seaborn as sns
import pandas as pd
sns.set_palette('husl')
import matplotlib.pyplot as plt
%matplotlib inline

Iris data set

The Iris flower data set or Fisher’s Iris data set became a typical test case for many statistical classification techniques in machine learning such as support vector machines.

It is sometimes called Anderson’s Iris data set because Edgar Anderson collected the data to quantify the morphological variation of Iris flowers of three related species.

This data set can be imported from scikit-learn like the following:

iris = datasets.load_iris()

Convert to Pandas Dataframe

To work with the dataset we convert it into a pandas dataframe.

df = pd.DataFrame(
    iris['data'],
    columns=iris['feature_names']
)
df['species'] = iris['target']
df['species'] = df['species'].map({
    0 : 'Iris-setosa',
    1 : 'Iris-versicolor',
    2 : 'Iris-virginica'
})

Data visualization

Seaborn has a nice way to visualize data for exploration with the pariplot function.

It takes every feature and compares it pairwise with every other feature

g = sns.pairplot(df, hue='species', markers='+')
plt.show()