Abalone
Table of Contents
1. Init
Let's start by importing the necessary modules.
import pandas as pd import numpy as np import seaborn as sns import matplotlib import matplotlib.pyplot as plt matplotlib.use('Agg') # non-interactive backend, produce pngs instead pd.set_option('display.max_columns', None) pd.set_option('display.max_colwidth', None) # uncomment the following line to prevent truncated output # pd.set_option('display.large_repr', 'info') from context import src from src import utils, plotter
2. Analysis
In this section we analyse the abalone
dataset. We start
by reading the accompanying data docs. The documentation is complete
and informative. One interesting observation is that
the sex
field can have 3 values,
namely male
, female
and infant
.
The documentation says there are no missing values however we should
verify that. Another point of observation is that the task is to
predict the age of the abalone. The question is, do we treat this as a
classification or a regression task?
2.1. Preliminary analysis
We start by answering our initial set of questions.
abalone = pd.read_csv('../data/data/abalone.csv') abalone.head()
sex length diameter height whole-weight shucked-weight viscera-weight \ 0 M 0.455 0.365 0.095 0.5140 0.2245 0.1010 1 M 0.350 0.265 0.090 0.2255 0.0995 0.0485 2 F 0.530 0.420 0.135 0.6770 0.2565 0.1415 3 M 0.440 0.365 0.125 0.5160 0.2155 0.1140 4 I 0.330 0.255 0.080 0.2050 0.0895 0.0395 shell-weight rings 0 0.150 15 1 0.070 7 2 0.210 9 3 0.155 10 4 0.055 7
abalone.describe(include='all')
sex length diameter height whole-weight \ count 4177 4177.000000 4177.000000 4177.000000 4177.000000 unique 3 NaN NaN NaN NaN top M NaN NaN NaN NaN freq 1528 NaN NaN NaN NaN mean NaN 0.523992 0.407881 0.139516 0.828742 std NaN 0.120093 0.099240 0.041827 0.490389 min NaN 0.075000 0.055000 0.000000 0.002000 25% NaN 0.450000 0.350000 0.115000 0.441500 50% NaN 0.545000 0.425000 0.140000 0.799500 75% NaN 0.615000 0.480000 0.165000 1.153000 max NaN 0.815000 0.650000 1.130000 2.825500 shucked-weight viscera-weight shell-weight rings count 4177.000000 4177.000000 4177.000000 4177.000000 unique NaN NaN NaN NaN top NaN NaN NaN NaN freq NaN NaN NaN NaN mean 0.359367 0.180594 0.238831 9.933684 std 0.221963 0.109614 0.139203 3.224169 min 0.001000 0.000500 0.001500 1.000000 25% 0.186000 0.093500 0.130000 8.000000 50% 0.336000 0.171000 0.234000 9.000000 75% 0.502000 0.253000 0.329000 11.000000 max 1.488000 0.760000 1.005000 29.000000
abalone.shape
4177 | 9 |
abalone.dtypes
sex object length float64 diameter float64 height float64 whole-weight float64 shucked-weight float64 viscera-weight float64 shell-weight float64 rings int64 dtype: object
abalone = utils.to_categorical(abalone, ['sex']) abalone.dtypes
sex category length float64 diameter float64 height float64 whole-weight float64 shucked-weight float64 viscera-weight float64 shell-weight float64 rings int64 dtype: object
abalone.isna().any()
sex False length False diameter False height False whole-weight False shucked-weight False viscera-weight False shell-weight False rings False dtype: bool
abalone[abalone.duplicated()].shape
0 | 9 |
name = 'heatmap@abalone--corr.png' corr = abalone.corr() plotter.corr(corr, name) name
All numerical features are positively correlated with one another.
This may present an opportunity to do some feature selection.
Unfortunately, there are no particular features that are strongly
correlated to rings
. In other words, we may need to perform feature
engineering to aid the model to learn better.
2.2. Distributional analysis
In this section we analyse the distributions of the features. Let's
start with a histogram of the rings
feature.
name = 'histplot@abalone--rings.png' fig, ax = plt.subplots() sns.histplot(data=abalone, x='rings', kde=True, ax=ax) fig.savefig(name) name
2.3. Relational analysis
In this section we analyse the relationships of the features.
name = 'scatterplot@abalone--numerical-rings--hue:rings.png' numerical_features = ['diameter', 'height', 'whole-weight', 'shucked-weight', 'viscera-weight', 'shell-weight'] g = sns.pairplot(data=abalone, x_vars=numerical_features, y_vars=['rings'], hue='rings') g.savefig(name) name
We observe that abalones with a larger diameter
and height
,
generally have a higher number of rings
. The number of rings seems
to be evenly distributed across the remaining features. We can study
this relationship better using a lineplot.
name = 'lineplot@abalone--numerical-rings.png' g = sns.PairGrid(data=abalone, x_vars=numerical_features, y_vars=['rings']) g.map(sns.lineplot) g.savefig(name) name
The linear relationship between {diameter,height}
and ring
is
obvious from the lineplot. We also check the pairwise relations
between each feature next.
name = 'pairplot@abalone.png' g = sns.pairplot(data=abalone, vars=numerical_features, corner=True) g.savefig(name) name
All numerical features are linearly related to one another.
2.4. Categorical analysis
We have one categorical feature in the dataset namely sex
. Let's
condition rings
on sex
to see if the number of rings vary across
the sexes.
name = 'histplot@abalone--hue:sex.png' fig, ax = plt.subplots() sns.histplot(data=abalone, x='rings', hue='sex', bins=20, multiple='fill', ax=ax) fig.savefig(name) name
We observe similar distribution for the male and female sexes. Infants primarily have lower number of rings.