Meta

Table of Contents

This document contains the data analysis of metadata.csv.

1. Init

Let's start by importing the necessary modules.

from context import src
from context import pd, np, sns, plt
from context import matplotlib as mpl
from src import utils, plotter

# turn this one when ready to save high quality images
# use lower quality for viewing images inline
# mpl.rcParams['savefig.dpi'] = 300

2. Conversion to long-form

The dataset is stored in a wide-form however a long-form is more suitable for data analysis.

meta = pd.read_csv('../data/metadata.csv')
meta.head()
   dataset  miss-null  miss-sp-val  miss-bin  red-dup  red-uid  red-corr  \
0  abalone        NaN          NaN       NaN      NaN      NaN       1.0   
1    adult        NaN          1.0       NaN      NaN      NaN       1.0   
2   airbnb        1.0          NaN       NaN      NaN      1.0       NaN   
3  avocado        NaN          NaN       NaN      NaN      1.0       1.0   
4  bitcoin        1.0          NaN       NaN      NaN      NaN       1.0   

   str-num  str-human  str-sanitise  cat-bin  cat-hierarchy  misc-sensitive  \
0      NaN        NaN           1.0      NaN            1.0             NaN   
1      NaN        NaN           1.0      1.0            1.0             1.0   
2      NaN        NaN           1.0      NaN            NaN             NaN   
3      NaN        NaN           NaN      NaN            NaN             NaN   
4      NaN        NaN           NaN      NaN            NaN             NaN   

   misc-unit  misc-balance  
0        NaN           NaN  
1        1.0           1.0  
2        1.0           NaN  
3        1.0           NaN  
4        1.0           NaN  
meta.dtypes
dataset            object
miss-null         float64
miss-sp-val       float64
miss-bin          float64
red-dup           float64
red-uid           float64
red-corr          float64
str-num           float64
str-human         float64
str-sanitise      float64
cat-bin           float64
cat-hierarchy     float64
misc-sensitive    float64
misc-unit         float64
misc-balance      float64
dtype: object

The dataset is in a "wide-form" but it will be better to have in in a "long-form". Let's do that.

smell_features = list(meta.columns)
smell_features.remove('dataset')

meta_long = pd.melt(meta,
                    id_vars=['dataset'],
                    value_vars=smell_features,
                    var_name='smell')
meta_long.head()
   dataset      smell  value
0  abalone  miss-null    NaN
1    adult  miss-null    NaN
2   airbnb  miss-null    1.0
3  avocado  miss-null    NaN
4  bitcoin  miss-null    1.0

We only need the rows where value is 1.

meta_long = meta_long[meta_long['value'].notna()]
meta_long.isna().any()
dataset    False
smell      False
value      False
dtype: bool

Let's drop value since we don't need it anymore.

meta_long = meta_long.drop('value', axis='columns')
meta_long.head()
         dataset      smell
2         airbnb  miss-null
4        bitcoin  miss-null
6          comic  miss-null
7  covid-vaccine  miss-null
9     earthquake  miss-null

Let's add the smell group to the examples.

meta_long['group'] = meta_long['smell'].str.extract(r'^(.*)-.*$')
meta_long['group'].value_counts()
red        33
cat        17
misc       14
miss       12
str        12
miss-sp     1
Name: group, dtype: int64
meta_long['group'] = meta_long['group'].replace(to_replace='miss-sp', value='miss')
meta_long['group'].value_counts()
red     33
cat     17
misc    14
miss    13
str     12
Name: group, dtype: int64
# use this block to write file to disk
meta_long.to_csv('../data/metadata_long.csv', index=False)

3. Analysis

In this section we analyse the metadata_long.csv dataset which we generated earlier.

First, let's see the distribution of all the smells.

meta = pd.read_csv('../data/metadata_long.csv')
meta.head()
         dataset      smell group
0         airbnb  miss-null  miss
1        bitcoin  miss-null  miss
2          comic  miss-null  miss
3  covid-vaccine  miss-null  miss
4     earthquake  miss-null  miss

Let's look at the distribution of the smells first.

meta['smell'].value_counts()
red-corr          19
cat-hierarchy     12
miss-null         11
red-uid           11
misc-unit          9
str-num            5
str-sanitise       5
cat-bin            5
red-dup            3
misc-balance       3
str-human          2
misc-sensitive     2
miss-sp-val        1
miss-bin           1
Name: smell, dtype: int64

And the same information in a plot for the report.

name = 'countplot@meta--smells.png'

fig, ax = plt.subplots()
ax.tick_params(axis='x', labelrotation=90)
sns.countplot(data=meta,
              x='smell',
              ax=ax,
              order=meta['smell'].value_counts().index) # sort by count
fig.tight_layout()
fig.savefig(name)
name

countplot@meta--smells.png

Next, let's look at the distribution of the smell groups.

meta['group'].value_counts()
red     33
cat     17
misc    14
miss    13
str     12
Name: group, dtype: int64
name = 'countplot@meta--smells-group.png'

fig, ax = plt.subplots()
sns.countplot(data=meta,
              x='group',
              ax=ax,
              order=meta['group'].value_counts().index) # sort by count
fig.tight_layout()
fig.savefig(name)
name

countplot@meta--smells-group.png

Let's look at the distribution of the smells within each group.

name = 'countplot@meta--smells-hue:group.png'

fig, axs = plt.subplots(1, 5, figsize=(15, 5), sharey=True)

data = meta[meta['group'].eq('red')]
sns.countplot(data=data,
              x='group',
              hue='smell',
              ax=axs[0],
              order=data['group'].value_counts().index)
sns.countplot(data=meta[meta['group'].eq('cat')],
              x='group',
              hue='smell',
              ax=axs[1])
sns.countplot(data=meta[meta['group'].eq('misc')],
              x='group',
              hue='smell',
              ax=axs[2])
sns.countplot(data=meta[meta['group'].eq('miss')],
              x='group',
              hue='smell',
              ax=axs[3])
sns.countplot(data=meta[meta['group'].eq('str')],
              x='group',
              hue='smell',
              ax=axs[4])

fig.tight_layout()
fig.savefig(name)
name

countplot@meta--smells-hue:group.png

Let's look at the distribution of smell groups within each dataset that was analysed.

name = 'countplot@meta--smells-group-dataset.png'

g = sns.catplot(data=meta,
                col='dataset',
                col_wrap=5,
                kind='count',
                x='group',
                sharey=True,
                sharex=True)
g.fig.savefig(name)
name

countplot@meta--smells-group-dataset.png

The above plot is too dense. Let's see if we can visualise the above information in a more condensed manner.

name = 'jointplot@meta--dataset-smells.png'

g = sns.JointGrid(data=meta,
                  x='dataset',
                  y='smell')
g.plot(sns.histplot, sns.countplot)
g.ax_joint.tick_params(axis='x', labelrotation=90)
g.fig.tight_layout()
g.fig.savefig(name)
name

jointplot@meta--dataset-smells.png

Let's add a hue using group while we are at it.

name = 'jointplot@meta--dataset-smells-hue:group.png'

g = sns.JointGrid(data=meta,
                  x='dataset',
                  y='smell')
g.plot_joint(sns.histplot, data=meta, hue='group')
g.plot_marginals(sns.countplot)
g.ax_joint.tick_params(axis='x', labelrotation=90)
g.fig.tight_layout()
g.fig.savefig(name)
name

jointplot@meta--dataset-smells-hue:group.png

Date: 2021-12-04 Sat 00:00

Created: 2021-12-08 Wed 21:16