Iris

Table of Contents

1. Init

Let's start by importing the necessary modules.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt

matplotlib.use('Agg') # non-interactive backend, produce pngs instead
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
# uncomment the following line to prevent truncated output
# pd.set_option('display.large_repr', 'info')

from context import src
from src import utils

2. Analysis

In this section we analyse the iris dataset. We start by reading the accompanying data docs. The documentation gives us a good starting point, it tells us the schema of the dataset (4 numerical and 1 categorical feature), no missing values and also provides a descriptive statistical analysis of the features. Still, no harm in verifying these ourselves!

2.1. Preliminary analysis

We start by loading the dataset and answering our initial set of questions.

iris = pd.read_csv('../data/data/iris.csv')
iris.head()
   sepal-length  sepal-width  petal-length  petal-width        class
0           5.1          3.5           1.4          0.2  Iris-setosa
1           4.9          3.0           1.4          0.2  Iris-setosa
2           4.7          3.2           1.3          0.2  Iris-setosa
3           4.6          3.1           1.5          0.2  Iris-setosa
4           5.0          3.6           1.4          0.2  Iris-setosa

Next we look at the features and their dtypes.

iris.columns
Index(['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class'], dtype='object')
iris.dtypes
sepal-length    float64
sepal-width     float64
petal-length    float64
petal-width     float64
class            object
dtype: object

All features are numerical and continuous. Since class is our target, we convert it to a category dtype before further exploration.

categorical_features = ['class']

iris = utils.to_categorical(iris, categorical_features)
iris.dtypes
sepal-length     float64
sepal-width      float64
petal-length     float64
petal-width      float64
class           category
dtype: object

Next, descriptive statistics.

iris.describe(include='all')
        sepal-length  sepal-width  petal-length  petal-width        class
count     150.000000   150.000000    150.000000   150.000000          150
unique           NaN          NaN           NaN          NaN            3
top              NaN          NaN           NaN          NaN  Iris-setosa
freq             NaN          NaN           NaN          NaN           50
mean        5.843333     3.054000      3.758667     1.198667          NaN
std         0.828066     0.433594      1.764420     0.763161          NaN
min         4.300000     2.000000      1.000000     0.100000          NaN
25%         5.100000     2.800000      1.600000     0.300000          NaN
50%         5.800000     3.000000      4.350000     1.300000          NaN
75%         6.400000     3.300000      5.100000     1.800000          NaN
max         7.900000     4.400000      6.900000     2.500000          NaN

We have 150 examples, with no missing values. There are 3 unique classes, each with 50 examples. Next, we check the usual stuff: missing values, duplicate values and correlations in the dataset.

iris.isna().any()
sepal-length    False
sepal-width     False
petal-length    False
petal-width     False
class           False
dtype: bool
iris[iris.duplicated()].shape
3 5

Looks like there are 3 duplicate rows, let's investigate them! Since we want to see all rows that are duplicated, we pass the keep argument to duplicated() and set it to False.

iris[iris.duplicated(keep=False)]
     sepal-length  sepal-width  petal-length  petal-width           class
9             4.9          3.1           1.5          0.1     Iris-setosa
34            4.9          3.1           1.5          0.1     Iris-setosa
37            4.9          3.1           1.5          0.1     Iris-setosa
101           5.8          2.7           5.1          1.9  Iris-virginica
142           5.8          2.7           5.1          1.9  Iris-virginica

Okay, let's drop the duplicate rows (and only keep the first row) before further analysis.

iris = iris.drop_duplicates()
iris.shape
147 5

Next we check the correlations between the numerical features.

corr = iris.corr()
name = 'heatmap@iris--corr.png'
utils.corr(corr, name)
name

Several features are positively correlated with one another, this may present an opportunity for feature selection.

2.2. Distributions & relational analysis

In this section we analyse the distributions and relations of the features. Let's start with a histogram of all features to see the distribution of the values. We additionally overlay the continuous features with a kde plot.

name = 'histplot@iris.png'
fig, axs = plt.subplots(1, 5, figsize=(20, 5), sharey=True)

sns.histplot(data=iris, x='sepal-length', kde=True, ax=axs[0])
sns.histplot(data=iris, x='sepal-width', kde=True, ax=axs[1])
sns.histplot(data=iris, x='petal-length', kde=True, ax=axs[2])
sns.histplot(data=iris, x='petal-width', kde=True, ax=axs[3])
sns.histplot(data=iris, x='class', discrete=True, ax=axs[4])

fig.savefig(name)
name

histplot@iris.png

sepal-{length,width} are normally distributed however petal-{length,width} are bimodal. We may want to consider standardising the data for certain models. The class has equal number of examples for each class.

Let's look at the distribution of the features per class. Since the number of examples per class are not equal, we normalise the histogram.

name = 'histplot@iris--hue:class.png'
fig, axs = plt.subplots(1, 4, figsize=(20, 5), sharey=True)

sns.histplot(data=iris, x='sepal-length', hue='class', stat='density', common_norm=False, ax=axs[0])
sns.histplot(data=iris, x='sepal-width', hue='class', stat='density', common_norm=False, ax=axs[1])
sns.histplot(data=iris, x='petal-length', hue='class', stat='density', common_norm=False, ax=axs[2])
sns.histplot(data=iris, x='petal-width', hue='class', stat='density', common_norm=False, ax=axs[3])

fig.savefig(name)
name

histplot@iris--hue:class.png

I experimented with several permutations for visualising the data (altering the multiple and the element arguments of histplot) however did not find any that was ideal for deriving insights from the histograms. As an alternative, let's look at the kdeplot instead.

name = 'kdeplot@iris--hue:class.png'
fig, axs = plt.subplots(1, 4, figsize=(20, 5), sharey=True)

sns.kdeplot(data=iris, x='sepal-length', hue='class', common_norm=False, ax=axs[0])
sns.kdeplot(data=iris, x='sepal-width', hue='class', common_norm=False, ax=axs[1])
sns.kdeplot(data=iris, x='petal-length', hue='class', common_norm=False, ax=axs[2])
sns.kdeplot(data=iris, x='petal-width', hue='class', common_norm=False, ax=axs[3])

fig.savefig(name)
name

kdeplot@iris--hue:class.png

The kdeplot is a bit easier to interpret. We note that all features are normally distributed across the classes but they are not centered around the same values. Also note that from the last two figures, we can understand why we observed bimodality for petal-{length,width} in the previous figure.

Next, let's look at the scatter plots for all pairs of features.

name = 'pairplot@iris.png'
g = sns.pairplot(data=iris, diag_kind=None, corner=True)
g.savefig(name)
name

pairplot@iris.png

We can see two distinct clusters in all plots. We can also see certain trends in some of the figures indicating presence of a linear relationship amongst the features. We may want to investigate these specific pairs in more detail, or compare against the correlation matrix. Let's look at the same pairplot but conditioned on the class.

name = 'pairplot@iris--hue:class.png'
g = sns.pairplot(data=iris, diag_kind=None, hue='class', corner=True)
g.savefig(name)
name

pairplot@iris--hue:class.png

We note that the setosa class is linearly separable from the other two classes. This will aid the model to learn better. The versicolour and virginica classes however are not that easy to distinguish from one another.

Created: 2021-10-22 Fri 21:59