Wine
Table of Contents
1. Init
Let's start by importing the necessary modules.
import pandas as pd import numpy as np import seaborn as sns import matplotlib import matplotlib.pyplot as plt matplotlib.use('Agg') # non-interactive backend, produce pngs instead pd.set_option('display.max_columns', None) pd.set_option('display.max_colwidth', None) # uncomment the following line to prevent truncated output # pd.set_option('display.large_repr', 'info') from context import src from src import utils
2. Analysis
In this section we analyse the wine
dataset. We start by reading the
accompanying data docs. The docs are not particularly useful. It gives
us the name of the features but does not give any information on the
target
. It gives us some preliminary information about the
distribution of the classes, which are approximately equal. Looks
like we have to dig into the data outselves!
2.1. Preliminary analyse
We start by loading the dataset and answering our initial set of questions for any data analysis task.
wine = pd.read_csv('../data/data/wine.csv') wine.head()
Let's take a look at the features next along with their dtypes.
wine.columns
wine.dtypes
All features are numerical and continuous. Since label
is our
target, we should convert that to a category
before we proceed any
further.
categorical_features = ['label'] wine = utils.to_categorical(wine, categorical_features) wine.dtypes
Next, we check the descriptive statistics.
wine.describe(include='all')
Next, we check the usual stuff: missing values, duplicate values and correlations in the dataset.
wine.isna().any()
wine[wine.duplicated()].shape
No missing or duplicate values. Next we check the correlations between the numerical features. Since this is an internal method that we have defined, the memory is cleared internally.
corr = wine.corr() name = 'heatmap@wine--corr.png' utils.corr(corr, name) name
Next we conduct a relational analysis of the features. To start, we can check the distribution of the features w.r.t the labels.
2.2. Distributions & relational analysis
In this section we analyse the distributions and relations of the features. We start by plotting a pairplot of all features. By default, the pairplot uses a scatterplot for the off-diagonal axes, but here we use a lineplot instead.
name = 'pairplot@wine--lineplot-kdeplot.png' g = sns.PairGrid(data=wine, corner=True) g.map_diag(sns.kdeplot) g.map_offdiag(sns.lineplot) g.savefig(name) utils.clear_seaborn_memory(g, plt) name
This was not as useful as I had hoped. Let's try the scatterplot next for the off-diagonal axes but keep the kdeplot for the diagonal.
name = 'pairplot@wine--scatterplot-kdeplot.png' g = sns.pairplot(data=wine, corner=True, diag_kind='kde') g.savefig(name) utils.clear_seaborn_memory(g, plt) name
name = 'pairplot@wine--scatterplot-kdeplot-hue.png' g = sns.pairplot(data=wine, corner=True, diag_kind='kde', hue='label') g.savefig(name) utils.clear_seaborn_memory(g, plt) name
numerical_features = list(wine.columns) numerical_features.remove('label') name = 'boxplot@wine--label-all.png' g = sns.PairGrid( data=wine, y_vars='label', x_vars=numerical_features ) g.map(sns.boxplot) g.savefig(name) name
The distribution per class is similar across the features. There are outliers in the dataset.
It will be better to see a density kdeplot for all features next both
with and without label
. I find the pairplot
a
g = sns.PairGrid(data=wine) g.map_diag(sns.kdeplot) g.savefig(name) name
Let's check the distribution of the label
column next.
name = 'displot@wine--label.png' g = sns.displot(data=wine, x=wine.label) g.savefig(name) name
We have more-or-less the same number of examples per class.