Heart
Table of Contents
1. Init
Let's start by importing the necessary modules.
import pandas as pd import numpy as np import seaborn as sns import matplotlib import matplotlib.pyplot as plt matplotlib.use('Agg') # non-interactive backend, produce pngs instead pd.set_option('display.max_columns', None) pd.set_option('display.max_colwidth', None) # uncomment the following line to prevent truncated output # pd.set_option('display.large_repr', 'info') from context import src from src import utils, plotter
2. Analysis
In this section we analyse the heart
dataset. We start by reading
the accompanying data docs. This dataset was downloaded from Kaggle
thus docs are hosted on the website itself. The docs are not that
helpful, giving us only the column names.
2.1. Preliminary analysis
We start by loading the dataset and answering our initial set of questions.
heart = pd.read_csv('../data/data/heart.csv') heart.head()
age sex cp trestbps chol fbs restecg thalach exang oldpeak slope \ 0 63 1 3 145 233 1 0 150 0 2.3 0 1 37 1 2 130 250 0 1 187 0 3.5 0 2 41 0 1 130 204 0 0 172 0 1.4 2 3 56 1 1 120 236 0 1 178 0 0.8 2 4 57 0 0 120 354 0 1 163 1 0.6 2 ca thal target 0 0 1 1 1 0 2 1 2 0 2 1 3 0 2 1 4 0 2 1
Next we look at the features and their dtypes.
heart.columns
Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'], dtype='object')
heart.dtypes
age int64 sex int64 cp int64 trestbps int64 chol int64 fbs int64 restecg int64 thalach int64 exang int64 oldpeak float64 slope int64 ca int64 thal int64 target int64 dtype: object
The names of the columns are a bit cryptic, but the docs contain more descriptive names. The following mapping of the column namaes to their descriptive names should help.
column | description |
---|---|
age | |
sex | |
cp | chest pain type (4 values) |
trestbps | resting blood pressure |
chol | serum cholesterol |
fbs | fasting blood sugar |
restecg | resting electrocardiographic results (values 0, 1, 2) |
thalach | max heart rate achieved |
exang | exercise induced angina |
oldpeak | ST depression incuded by exercise relative to rest |
slope | the slope of the peak exercise ST segment |
ca | number of major vessels colored by flourosopy (0-3) |
thal | 3=normal; 6=fixed defect; 7=reversable defect |
We note that cp
, restecg
, ca
& thal
are categorical (have
specific, discrete values) and have hierarchy amongst one another.
This is a gotcha, because they are represented as numbers. We should
convert them to category
dtype prior to further analysis. Even with
the descriptive names, some features such as the oldpeak
, slope
and thal
are unclear. This highlights the need for better data
documentation.
In this case, the task is binary classification (patient gets heart
attack or not) thus target
is either a 0 or a 1. We should one-hot
encode that before training. The sex
feature is categorical but
represented numerically. Training a model with this representation may
bias the model through unwanted hierarchy, the feature should be
one-hot encoded. Another problem is that without proper docs, we don't
know which sex is represented by which digit.
Let's convert the categorical features to category
dtype prior to
further analysis.
categorical_features = ['sex', 'cp', 'restecg', 'ca', 'thal', 'target'] heart = utils.to_categorical(heart, categorical_features) heart.dtypes
age int64 sex category cp category trestbps int64 chol int64 fbs int64 restecg category thalach int64 exang int64 oldpeak float64 slope int64 ca category thal category target category dtype: object
Let's look at the descriptive statistics next.
heart.describe(include='all')
age sex cp trestbps chol fbs restecg \ count 303.000000 303.0 303.0 303.000000 303.000000 303.000000 303.0 unique NaN 2.0 4.0 NaN NaN NaN 3.0 top NaN 1.0 0.0 NaN NaN NaN 1.0 freq NaN 207.0 143.0 NaN NaN NaN 152.0 mean 54.366337 NaN NaN 131.623762 246.264026 0.148515 NaN std 9.082101 NaN NaN 17.538143 51.830751 0.356198 NaN min 29.000000 NaN NaN 94.000000 126.000000 0.000000 NaN 25% 47.500000 NaN NaN 120.000000 211.000000 0.000000 NaN 50% 55.000000 NaN NaN 130.000000 240.000000 0.000000 NaN 75% 61.000000 NaN NaN 140.000000 274.500000 0.000000 NaN max 77.000000 NaN NaN 200.000000 564.000000 1.000000 NaN thalach exang oldpeak slope ca thal target count 303.000000 303.000000 303.000000 303.000000 303.0 303.0 303.0 unique NaN NaN NaN NaN 5.0 4.0 2.0 top NaN NaN NaN NaN 0.0 2.0 1.0 freq NaN NaN NaN NaN 175.0 166.0 165.0 mean 149.646865 0.326733 1.039604 1.399340 NaN NaN NaN std 22.905161 0.469794 1.161075 0.616226 NaN NaN NaN min 71.000000 0.000000 0.000000 0.000000 NaN NaN NaN 25% 133.500000 0.000000 0.000000 1.000000 NaN NaN NaN 50% 153.000000 0.000000 0.800000 1.000000 NaN NaN NaN 75% 166.000000 1.000000 1.600000 2.000000 NaN NaN NaN max 202.000000 1.000000 6.200000 2.000000 NaN NaN NaN
Two discrepancies are observed. The docs say that ca
& thal
contain 4 and 3 unique values respectively. However, the descriptive
statistics shows 5 & 4 unique values respectively. Let's investigate.
heart[['ca']].value_counts()
ca 0 175 1 65 2 38 3 20 4 5 dtype: int64
The docs say the values should be in the range of [0,3]
however that
seems to be incorrect as we have 4 also.
heart[['thal']].value_counts()
thal 2 166 3 117 1 18 0 2 dtype: int64
The docs say the values should be one of {3, 6, 7}
however that is
incorrect, we have values in the range of [0,3]
. Without proper
docs, it's difficult to make sense of this discrepancy and requires
consultation with domain expert or research.
Let's look at missing & duplicates next.
heart.isna().any()
age False sex False cp False trestbps False chol False fbs False restecg False thalach False exang False oldpeak False slope False ca False thal False target False dtype: bool
heart[heart.duplicated()].shape
1 | 14 |
There is one duplicate row, we should drop it prior to further analysis.
heart = heart.drop_duplicates() heart.shape
302 | 14 |
Finally, we check for correlations amongst features.
corr = heart.corr() name = 'heatmap@heart--corr.png' plotter.corr(corr, name) name
There are several features that are positively correlated with one another. They should be investigated further and may present an opportunity for feature selection.
2.2. Distribution analysis
In this section we analyse the distributions of the features. Let's start with a histogram of all features to see the distribution of the values. We additionally overlay the continuous features with a kde plot. Since there are a lot of features, we separate the numerical from the categorical features into two plots.
name = 'histplot@heart--numerical.png' fig, axs = plt.subplots(2, 4, figsize=(15, 10), sharey=True) sns.histplot(data=heart, x='age', kde=True, ax=axs[0, 0]) sns.histplot(data=heart, x='trestbps', kde=True, ax=axs[0, 1]) sns.histplot(data=heart, x='chol', kde=True, ax=axs[0, 2]) sns.histplot(data=heart, x='fbs', kde=True, ax=axs[0, 3]) sns.histplot(data=heart, x='thalach', kde=True, ax=axs[1, 0]) sns.histplot(data=heart, x='exang', kde=True, ax=axs[1, 1]) sns.histplot(data=heart, x='oldpeak', kde=True, ax=axs[1, 2]) sns.histplot(data=heart, x='slope', kde=True, ax=axs[1, 3]) fig.savefig(name) name
age, trestbps, chol & thalach
are normally distributed while
oldpeak
is skewed. Interestingly, it seems that fbs, exang & slope
have discrete values (verified using value_counts()
method),
however, without documentation, we cannot be sure if they are
categorical in nature or numerical features with discrete values (and
out dataset simply contains examples with those specific values!).
A box plot is often also useful for exploring the distribution, let's
do that for the numerical features. However we exclude fbs, exang &
slope
since they have discrete values and a boxplot is not that
useful (verified manually). We also generate the plots individually
since the scales of the features vary thus with a shared y axis, some
of the plots are squished!
name = 'boxplot@heart--numerical.png' numerical_features = ['age', 'trestbps', 'chol', 'thalach', 'oldpeak'] fig, ax = plt.subplots(1, 5, figsize=(20, 5)) for idx, feature in enumerate(numerical_features): sns.boxplot(data=heart, y=feature, ax=ax[idx]) fig.savefig(name) name
The individual boxplots are easier to interpret and we see that a few features have outliers which should be dealt with.
And finally the histogram for the categorical features.
name = 'histplot@heart--categorical.png' fig, axs = plt.subplots(2, 3, figsize=(15, 10), sharey=True) sns.histplot(data=heart, x='sex', discrete=True, ax=axs[0, 0]) sns.histplot(data=heart, x='cp', discrete=True, ax=axs[0, 1]) sns.histplot(data=heart, x='restecg', discrete=True, ax=axs[0, 2]) sns.histplot(data=heart, x='ca', discrete=True, ax=axs[1, 0]) sns.histplot(data=heart, x='thal', discrete=True, ax=axs[1, 1]) sns.histplot(data=heart, x='target', discrete=True, ax=axs[1, 2]) fig.savefig(name) name
The examples per class is approximately equal. We have an uneven
distribution in sex
however without proper documentation we cannot
determine which gender 1 represents. sex
is usually considered a
sensitive feature to include. However, in this case we have domain
knowledge that men are more prone to heart attacks than women so it's
a bit unclear if in this context, this feature is sensitive or not.
The rest of the features need to be conditioned on target
to derive
meaningful insights.
2.3. Relational analysis
In this section we analyse the relationships of the features. Let's start with a scatterplot of the numerical features.
name = 'pairplot@heart.png' g = sns.pairplot(data=heart[numerical_features], diag_kind=None, corner=True) g.fig.savefig(name) name
There are no discernable patterns in the plots besides the ones in
plots against oldpeak
but this is expected since we have many
examples where oldpeak
is 0. Next we can condition on target
&
sex
.
name = 'pairplot@heart--hue:target.png' g = sns.pairplot(data=heart, vars=numerical_features, diag_kind=None, corner=True, hue='target') g.fig.savefig(name) name
name = 'pairplot@heart--hue:sex.png' g = sns.pairplot(data=heart, vars=numerical_features, diag_kind=None, corner=True, hue='sex') g.fig.savefig(name) name
The conditioned plots are not that useful.
2.4. Categorical analysis
In this section we analyse the categorical features of the dataset. To
start, let's look at the distribution of age
within target
.
name = 'catplot@heart--target-age--hue:sex.png' fig, ax = plt.subplots(2, 3, figsize=(15, 10), sharey=True) sns.swarmplot(data=heart, x='target', y='age', ax=ax[0, 0]) sns.boxplot(data=heart, x='target', y='age', ax=ax[0, 1]) sns.violinplot(data=heart, x='target', y='age', ax=ax[0, 2]) sns.swarmplot(data=heart, x='target', y='age', ax=ax[1, 0], hue='sex', dodge=True) sns.boxplot(data=heart, x='target', y='age', ax=ax[1, 1], hue='sex', dodge=True) sns.violinplot(data=heart, x='target', y='age', ax=ax[1, 2], hue='sex', dodge=True) fig.savefig(name) name
I have intentionally made 3 different types of categorical plots, to
compare and contrast amongst the types. On the top row we see a
swarmplot, boxplot and violinplot of age
within the two targets. On
the second row we see the same plots but this time, they are
conditioned on sex
. From the first row, we can see that the chance
of getting a heart attack is higher in sex 0 in the age of 55-65
whereas in sex 1 this range is larger: 40-70 years. When we condition
on sex
, we see that the chance of getting of a heart attack is the
same for both the sexes.