Meta
Table of Contents
This document contains the data analysis of metadata.csv
.
1. Init
Let's start by importing the necessary modules.
from context import src from context import pd, np, sns, plt from context import matplotlib as mpl from src import utils, plotter # turn this one when ready to save high quality images # use lower quality for viewing images inline # mpl.rcParams['savefig.dpi'] = 300
2. Conversion to long-form
The dataset is stored in a wide-form however a long-form is more suitable for data analysis.
meta = pd.read_csv('../data/metadata.csv') meta.head()
dataset miss-null miss-sp-val miss-bin red-dup red-uid red-corr \ 0 abalone NaN NaN NaN NaN NaN 1.0 1 adult NaN 1.0 NaN NaN NaN 1.0 2 airbnb 1.0 NaN NaN NaN 1.0 NaN 3 avocado NaN NaN NaN NaN 1.0 1.0 4 bitcoin 1.0 NaN NaN NaN NaN 1.0 str-num str-human str-sanitise cat-bin cat-hierarchy misc-sensitive \ 0 NaN NaN 1.0 NaN 1.0 NaN 1 NaN NaN 1.0 1.0 1.0 1.0 2 NaN NaN 1.0 NaN NaN NaN 3 NaN NaN NaN NaN NaN NaN 4 NaN NaN NaN NaN NaN NaN misc-unit misc-balance 0 NaN NaN 1 1.0 1.0 2 1.0 NaN 3 1.0 NaN 4 1.0 NaN
meta.dtypes
dataset object miss-null float64 miss-sp-val float64 miss-bin float64 red-dup float64 red-uid float64 red-corr float64 str-num float64 str-human float64 str-sanitise float64 cat-bin float64 cat-hierarchy float64 misc-sensitive float64 misc-unit float64 misc-balance float64 dtype: object
The dataset is in a "wide-form" but it will be better to have in in a "long-form". Let's do that.
smell_features = list(meta.columns) smell_features.remove('dataset') meta_long = pd.melt(meta, id_vars=['dataset'], value_vars=smell_features, var_name='smell') meta_long.head()
dataset smell value 0 abalone miss-null NaN 1 adult miss-null NaN 2 airbnb miss-null 1.0 3 avocado miss-null NaN 4 bitcoin miss-null 1.0
We only need the rows where value
is 1.
meta_long = meta_long[meta_long['value'].notna()] meta_long.isna().any()
dataset False smell False value False dtype: bool
Let's drop value
since we don't need it anymore.
meta_long = meta_long.drop('value', axis='columns') meta_long.head()
dataset smell 2 airbnb miss-null 4 bitcoin miss-null 6 comic miss-null 7 covid-vaccine miss-null 9 earthquake miss-null
Let's add the smell group to the examples.
meta_long['group'] = meta_long['smell'].str.extract(r'^(.*)-.*$') meta_long['group'].value_counts()
red 33 cat 17 misc 14 miss 12 str 12 miss-sp 1 Name: group, dtype: int64
meta_long['group'] = meta_long['group'].replace(to_replace='miss-sp', value='miss') meta_long['group'].value_counts()
red 33 cat 17 misc 14 miss 13 str 12 Name: group, dtype: int64
# use this block to write file to disk meta_long.to_csv('../data/metadata_long.csv', index=False)
3. Analysis
In this section we analyse the metadata_long.csv
dataset which we
generated earlier.
First, let's see the distribution of all the smells.
meta = pd.read_csv('../data/metadata_long.csv') meta.head()
dataset smell group 0 airbnb miss-null miss 1 bitcoin miss-null miss 2 comic miss-null miss 3 covid-vaccine miss-null miss 4 earthquake miss-null miss
Let's look at the distribution of the smells first.
meta['smell'].value_counts()
red-corr 19 cat-hierarchy 12 miss-null 11 red-uid 11 misc-unit 9 str-num 5 str-sanitise 5 cat-bin 5 red-dup 3 misc-balance 3 str-human 2 misc-sensitive 2 miss-sp-val 1 miss-bin 1 Name: smell, dtype: int64
And the same information in a plot for the report.
name = 'countplot@meta--smells.png' fig, ax = plt.subplots() ax.tick_params(axis='x', labelrotation=90) sns.countplot(data=meta, x='smell', ax=ax, order=meta['smell'].value_counts().index) # sort by count fig.tight_layout() fig.savefig(name) name
Next, let's look at the distribution of the smell groups.
meta['group'].value_counts()
red 33 cat 17 misc 14 miss 13 str 12 Name: group, dtype: int64
name = 'countplot@meta--smells-group.png' fig, ax = plt.subplots() sns.countplot(data=meta, x='group', ax=ax, order=meta['group'].value_counts().index) # sort by count fig.tight_layout() fig.savefig(name) name
Let's look at the distribution of the smells within each group.
name = 'countplot@meta--smells-hue:group.png' fig, axs = plt.subplots(1, 5, figsize=(15, 5), sharey=True) data = meta[meta['group'].eq('red')] sns.countplot(data=data, x='group', hue='smell', ax=axs[0], order=data['group'].value_counts().index) sns.countplot(data=meta[meta['group'].eq('cat')], x='group', hue='smell', ax=axs[1]) sns.countplot(data=meta[meta['group'].eq('misc')], x='group', hue='smell', ax=axs[2]) sns.countplot(data=meta[meta['group'].eq('miss')], x='group', hue='smell', ax=axs[3]) sns.countplot(data=meta[meta['group'].eq('str')], x='group', hue='smell', ax=axs[4]) fig.tight_layout() fig.savefig(name) name
Let's look at the distribution of smell groups within each dataset that was analysed.
name = 'countplot@meta--smells-group-dataset.png' g = sns.catplot(data=meta, col='dataset', col_wrap=5, kind='count', x='group', sharey=True, sharex=True) g.fig.savefig(name) name
The above plot is too dense. Let's see if we can visualise the above information in a more condensed manner.
name = 'jointplot@meta--dataset-smells.png' g = sns.JointGrid(data=meta, x='dataset', y='smell') g.plot(sns.histplot, sns.countplot) g.ax_joint.tick_params(axis='x', labelrotation=90) g.fig.tight_layout() g.fig.savefig(name) name
Let's add a hue using group
while we are at it.
name = 'jointplot@meta--dataset-smells-hue:group.png' g = sns.JointGrid(data=meta, x='dataset', y='smell') g.plot_joint(sns.histplot, data=meta, hue='group') g.plot_marginals(sns.countplot) g.ax_joint.tick_params(axis='x', labelrotation=90) g.fig.tight_layout() g.fig.savefig(name) name