Iris
Table of Contents
1. Init
Let's start by importing the necessary modules.
import pandas as pd import numpy as np import seaborn as sns import matplotlib import matplotlib.pyplot as plt matplotlib.use('Agg') # non-interactive backend, produce pngs instead pd.set_option('display.max_columns', None) pd.set_option('display.max_colwidth', None) # uncomment the following line to prevent truncated output # pd.set_option('display.large_repr', 'info') from context import src from src import utils
2. Analysis
In this section we analyse the iris
dataset. We start by
reading the accompanying data docs. The documentation gives us a good
starting point, it tells us the schema of the dataset (4 numerical and
1 categorical feature), no missing values and also provides a
descriptive statistical analysis of the features. Still, no harm in
verifying these ourselves!
2.1. Preliminary analysis
We start by loading the dataset and answering our initial set of questions.
iris = pd.read_csv('../data/data/iris.csv') iris.head()
sepal-length sepal-width petal-length petal-width class 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa
Next we look at the features and their dtypes.
iris.columns
Index(['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class'], dtype='object')
iris.dtypes
sepal-length float64 sepal-width float64 petal-length float64 petal-width float64 class object dtype: object
All features are numerical and continuous. Since class
is our
target, we convert it to a category
dtype before further
exploration.
categorical_features = ['class'] iris = utils.to_categorical(iris, categorical_features) iris.dtypes
sepal-length float64 sepal-width float64 petal-length float64 petal-width float64 class category dtype: object
Next, descriptive statistics.
iris.describe(include='all')
sepal-length sepal-width petal-length petal-width class count 150.000000 150.000000 150.000000 150.000000 150 unique NaN NaN NaN NaN 3 top NaN NaN NaN NaN Iris-setosa freq NaN NaN NaN NaN 50 mean 5.843333 3.054000 3.758667 1.198667 NaN std 0.828066 0.433594 1.764420 0.763161 NaN min 4.300000 2.000000 1.000000 0.100000 NaN 25% 5.100000 2.800000 1.600000 0.300000 NaN 50% 5.800000 3.000000 4.350000 1.300000 NaN 75% 6.400000 3.300000 5.100000 1.800000 NaN max 7.900000 4.400000 6.900000 2.500000 NaN
We have 150 examples, with no missing values. There are 3 unique classes, each with 50 examples. Next, we check the usual stuff: missing values, duplicate values and correlations in the dataset.
iris.isna().any()
sepal-length False sepal-width False petal-length False petal-width False class False dtype: bool
iris[iris.duplicated()].shape
3 | 5 |
Looks like there are 3 duplicate rows, let's investigate them! Since
we want to see all rows that are duplicated, we pass the keep
argument to duplicated()
and set it to False
.
iris[iris.duplicated(keep=False)]
sepal-length sepal-width petal-length petal-width class 9 4.9 3.1 1.5 0.1 Iris-setosa 34 4.9 3.1 1.5 0.1 Iris-setosa 37 4.9 3.1 1.5 0.1 Iris-setosa 101 5.8 2.7 5.1 1.9 Iris-virginica 142 5.8 2.7 5.1 1.9 Iris-virginica
Okay, let's drop the duplicate rows (and only keep the first row) before further analysis.
iris = iris.drop_duplicates() iris.shape
147 | 5 |
Next we check the correlations between the numerical features.
corr = iris.corr() name = 'heatmap@iris--corr.png' utils.corr(corr, name) name
Several features are positively correlated with one another, this may present an opportunity for feature selection.
2.2. Distributions & relational analysis
In this section we analyse the distributions and relations of the features. Let's start with a histogram of all features to see the distribution of the values. We additionally overlay the continuous features with a kde plot.
name = 'histplot@iris.png' fig, axs = plt.subplots(1, 5, figsize=(20, 5), sharey=True) sns.histplot(data=iris, x='sepal-length', kde=True, ax=axs[0]) sns.histplot(data=iris, x='sepal-width', kde=True, ax=axs[1]) sns.histplot(data=iris, x='petal-length', kde=True, ax=axs[2]) sns.histplot(data=iris, x='petal-width', kde=True, ax=axs[3]) sns.histplot(data=iris, x='class', discrete=True, ax=axs[4]) fig.savefig(name) name
sepal-{length,width}
are normally distributed however
petal-{length,width}
are bimodal. We may want to consider
standardising the data for certain models. The class
has equal
number of examples for each class.
Let's look at the distribution of the features per class. Since the number of examples per class are not equal, we normalise the histogram.
name = 'histplot@iris--hue:class.png' fig, axs = plt.subplots(1, 4, figsize=(20, 5), sharey=True) sns.histplot(data=iris, x='sepal-length', hue='class', stat='density', common_norm=False, ax=axs[0]) sns.histplot(data=iris, x='sepal-width', hue='class', stat='density', common_norm=False, ax=axs[1]) sns.histplot(data=iris, x='petal-length', hue='class', stat='density', common_norm=False, ax=axs[2]) sns.histplot(data=iris, x='petal-width', hue='class', stat='density', common_norm=False, ax=axs[3]) fig.savefig(name) name
I experimented with several permutations for visualising the data
(altering the multiple
and the element
arguments of histplot
)
however did not find any that was ideal for deriving insights from the
histograms. As an alternative, let's look at the kdeplot instead.
name = 'kdeplot@iris--hue:class.png' fig, axs = plt.subplots(1, 4, figsize=(20, 5), sharey=True) sns.kdeplot(data=iris, x='sepal-length', hue='class', common_norm=False, ax=axs[0]) sns.kdeplot(data=iris, x='sepal-width', hue='class', common_norm=False, ax=axs[1]) sns.kdeplot(data=iris, x='petal-length', hue='class', common_norm=False, ax=axs[2]) sns.kdeplot(data=iris, x='petal-width', hue='class', common_norm=False, ax=axs[3]) fig.savefig(name) name
The kdeplot is a bit easier to interpret. We note that all features
are normally distributed across the classes but they are not centered
around the same values. Also note that from the last two figures, we
can understand why we observed bimodality for petal-{length,width}
in the previous figure.
Next, let's look at the scatter plots for all pairs of features.
name = 'pairplot@iris.png' g = sns.pairplot(data=iris, diag_kind=None, corner=True) g.savefig(name) name
We can see two distinct clusters in all plots. We can also see certain
trends in some of the figures indicating presence of a linear
relationship amongst the features. We may want to investigate these
specific pairs in more detail, or compare against the correlation
matrix. Let's look at the same pairplot but conditioned on the class
.
name = 'pairplot@iris--hue:class.png' g = sns.pairplot(data=iris, diag_kind=None, hue='class', corner=True) g.savefig(name) name
We note that the setosa
class is linearly separable from the other
two classes. This will aid the model to learn better. The
versicolour
and virginica
classes however are not that easy to
distinguish from one another.