Abalone

Table of Contents

1. Init

Let's start by importing the necessary modules.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt

matplotlib.use('Agg') # non-interactive backend, produce pngs instead
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
# uncomment the following line to prevent truncated output
# pd.set_option('display.large_repr', 'info')

from context import src
from src import utils, plotter

2. Analysis

In this section we analyse the abalone dataset. We start by reading the accompanying data docs. The documentation is complete and informative. One interesting observation is that the sex field can have 3 values, namely male, female and infant. The documentation says there are no missing values however we should verify that. Another point of observation is that the task is to predict the age of the abalone. The question is, do we treat this as a classification or a regression task?

2.1. Preliminary analysis

We start by answering our initial set of questions.

abalone = pd.read_csv('../data/data/abalone.csv')
abalone.head()
  sex  length  diameter  height  whole-weight  shucked-weight  viscera-weight  \
0   M   0.455     0.365   0.095        0.5140          0.2245          0.1010   
1   M   0.350     0.265   0.090        0.2255          0.0995          0.0485   
2   F   0.530     0.420   0.135        0.6770          0.2565          0.1415   
3   M   0.440     0.365   0.125        0.5160          0.2155          0.1140   
4   I   0.330     0.255   0.080        0.2050          0.0895          0.0395   

   shell-weight  rings  
0         0.150     15  
1         0.070      7  
2         0.210      9  
3         0.155     10  
4         0.055      7  
abalone.describe(include='all')
         sex       length     diameter       height  whole-weight  \
count   4177  4177.000000  4177.000000  4177.000000   4177.000000   
unique     3          NaN          NaN          NaN           NaN   
top        M          NaN          NaN          NaN           NaN   
freq    1528          NaN          NaN          NaN           NaN   
mean     NaN     0.523992     0.407881     0.139516      0.828742   
std      NaN     0.120093     0.099240     0.041827      0.490389   
min      NaN     0.075000     0.055000     0.000000      0.002000   
25%      NaN     0.450000     0.350000     0.115000      0.441500   
50%      NaN     0.545000     0.425000     0.140000      0.799500   
75%      NaN     0.615000     0.480000     0.165000      1.153000   
max      NaN     0.815000     0.650000     1.130000      2.825500   

        shucked-weight  viscera-weight  shell-weight        rings  
count      4177.000000     4177.000000   4177.000000  4177.000000  
unique             NaN             NaN           NaN          NaN  
top                NaN             NaN           NaN          NaN  
freq               NaN             NaN           NaN          NaN  
mean          0.359367        0.180594      0.238831     9.933684  
std           0.221963        0.109614      0.139203     3.224169  
min           0.001000        0.000500      0.001500     1.000000  
25%           0.186000        0.093500      0.130000     8.000000  
50%           0.336000        0.171000      0.234000     9.000000  
75%           0.502000        0.253000      0.329000    11.000000  
max           1.488000        0.760000      1.005000    29.000000  
abalone.shape
4177 9
abalone.dtypes
sex                object
length            float64
diameter          float64
height            float64
whole-weight      float64
shucked-weight    float64
viscera-weight    float64
shell-weight      float64
rings               int64
dtype: object
abalone = utils.to_categorical(abalone, ['sex'])
abalone.dtypes
sex               category
length             float64
diameter           float64
height             float64
whole-weight       float64
shucked-weight     float64
viscera-weight     float64
shell-weight       float64
rings                int64
dtype: object
abalone.isna().any()
sex               False
length            False
diameter          False
height            False
whole-weight      False
shucked-weight    False
viscera-weight    False
shell-weight      False
rings             False
dtype: bool
abalone[abalone.duplicated()].shape
0 9
name = 'heatmap@abalone--corr.png'
corr = abalone.corr()

plotter.corr(corr, name)
name

heatmap@abalone--corr.png

All numerical features are positively correlated with one another. This may present an opportunity to do some feature selection. Unfortunately, there are no particular features that are strongly correlated to rings. In other words, we may need to perform feature engineering to aid the model to learn better.

2.2. Distributional analysis

In this section we analyse the distributions of the features. Let's start with a histogram of the rings feature.

name = 'histplot@abalone--rings.png'

fig, ax = plt.subplots()
sns.histplot(data=abalone, x='rings', kde=True, ax=ax)
fig.savefig(name)
name

histplot@abalone--rings.png

2.3. Relational analysis

In this section we analyse the relationships of the features.

name = 'scatterplot@abalone--numerical-rings--hue:rings.png'

numerical_features = ['diameter',
                      'height',
                      'whole-weight',
                      'shucked-weight',
                      'viscera-weight',
                      'shell-weight']

g = sns.pairplot(data=abalone, x_vars=numerical_features, y_vars=['rings'], hue='rings')
g.savefig(name)
name

scatterplot@abalone--numerical-rings--hue:rings.png

We observe that abalones with a larger diameter and height, generally have a higher number of rings. The number of rings seems to be evenly distributed across the remaining features. We can study this relationship better using a lineplot.

name = 'lineplot@abalone--numerical-rings.png'

g = sns.PairGrid(data=abalone, x_vars=numerical_features, y_vars=['rings'])
g.map(sns.lineplot)
g.savefig(name)
name

lineplot@abalone--numerical-rings.png

The linear relationship between {diameter,height} and ring is obvious from the lineplot. We also check the pairwise relations between each feature next.

name = 'pairplot@abalone.png'

g = sns.pairplot(data=abalone, vars=numerical_features, corner=True)
g.savefig(name)
name

pairplot@abalone.png

All numerical features are linearly related to one another.

2.4. Categorical analysis

We have one categorical feature in the dataset namely sex. Let's condition rings on sex to see if the number of rings vary across the sexes.

name = 'histplot@abalone--hue:sex.png'

fig, ax = plt.subplots()
sns.histplot(data=abalone, x='rings', hue='sex', bins=20, multiple='fill', ax=ax)
fig.savefig(name)
name

histplot@abalone--hue:sex.png

We observe similar distribution for the male and female sexes. Infants primarily have lower number of rings.

Created: 2021-10-22 Fri 21:57