Wine

Table of Contents

1. Init

Let's start by importing the necessary modules.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt

matplotlib.use('Agg') # non-interactive backend, produce pngs instead
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
# uncomment the following line to prevent truncated output
# pd.set_option('display.large_repr', 'info')

from context import src
from src import utils

2. Analysis

In this section we analyse the wine dataset. We start by reading the accompanying data docs. The docs are not particularly useful. It gives us the name of the features but does not give any information on the target. It gives us some preliminary information about the distribution of the classes, which are approximately equal. Looks like we have to dig into the data outselves!

2.1. Preliminary analyse

We start by loading the dataset and answering our initial set of questions for any data analysis task.

wine = pd.read_csv('../data/data/wine.csv')
wine.head()

Let's take a look at the features next along with their dtypes.

wine.columns
wine.dtypes

All features are numerical and continuous. Since label is our target, we should convert that to a category before we proceed any further.

categorical_features = ['label']

wine = utils.to_categorical(wine, categorical_features)
wine.dtypes

Next, we check the descriptive statistics.

wine.describe(include='all')

Next, we check the usual stuff: missing values, duplicate values and correlations in the dataset.

wine.isna().any()
wine[wine.duplicated()].shape

No missing or duplicate values. Next we check the correlations between the numerical features. Since this is an internal method that we have defined, the memory is cleared internally.

corr = wine.corr()
name = 'heatmap@wine--corr.png'
utils.corr(corr, name)
name

Next we conduct a relational analysis of the features. To start, we can check the distribution of the features w.r.t the labels.

2.2. Distributions & relational analysis

In this section we analyse the distributions and relations of the features. We start by plotting a pairplot of all features. By default, the pairplot uses a scatterplot for the off-diagonal axes, but here we use a lineplot instead.

name = 'pairplot@wine--lineplot-kdeplot.png'

g = sns.PairGrid(data=wine, corner=True)
g.map_diag(sns.kdeplot)
g.map_offdiag(sns.lineplot)
g.savefig(name)
utils.clear_seaborn_memory(g, plt)
name

This was not as useful as I had hoped. Let's try the scatterplot next for the off-diagonal axes but keep the kdeplot for the diagonal.

name = 'pairplot@wine--scatterplot-kdeplot.png'

g = sns.pairplot(data=wine, corner=True, diag_kind='kde')
g.savefig(name)
utils.clear_seaborn_memory(g, plt)
name
name = 'pairplot@wine--scatterplot-kdeplot-hue.png'

g = sns.pairplot(data=wine, corner=True, diag_kind='kde', hue='label')
g.savefig(name)
utils.clear_seaborn_memory(g, plt)
name
numerical_features = list(wine.columns)
numerical_features.remove('label')

name = 'boxplot@wine--label-all.png'

g = sns.PairGrid(
    data=wine,
    y_vars='label',
    x_vars=numerical_features
)
g.map(sns.boxplot)
g.savefig(name)
name

The distribution per class is similar across the features. There are outliers in the dataset.

It will be better to see a density kdeplot for all features next both with and without label. I find the pairplot a


g = sns.PairGrid(data=wine)
g.map_diag(sns.kdeplot)
g.savefig(name)
name

Let's check the distribution of the label column next.

name = 'displot@wine--label.png'

g = sns.displot(data=wine, x=wine.label)
g.savefig(name)
name

We have more-or-less the same number of examples per class.

Created: 2021-10-22 Fri 22:02