Heart

Table of Contents

1. Init

Let's start by importing the necessary modules.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt

matplotlib.use('Agg') # non-interactive backend, produce pngs instead
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
# uncomment the following line to prevent truncated output
# pd.set_option('display.large_repr', 'info')

from context import src
from src import utils, plotter

2. Analysis

In this section we analyse the heart dataset. We start by reading the accompanying data docs. This dataset was downloaded from Kaggle thus docs are hosted on the website itself. The docs are not that helpful, giving us only the column names.

2.1. Preliminary analysis

We start by loading the dataset and answering our initial set of questions.

heart = pd.read_csv('../data/data/heart.csv')
heart.head()
   age  sex  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  slope  \
0   63    1   3       145   233    1        0      150      0      2.3      0   
1   37    1   2       130   250    0        1      187      0      3.5      0   
2   41    0   1       130   204    0        0      172      0      1.4      2   
3   56    1   1       120   236    0        1      178      0      0.8      2   
4   57    0   0       120   354    0        1      163      1      0.6      2   

   ca  thal  target  
0   0     1       1  
1   0     2       1  
2   0     2       1  
3   0     2       1  
4   0     2       1  

Next we look at the features and their dtypes.

heart.columns
Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'],
      dtype='object')
heart.dtypes
age           int64
sex           int64
cp            int64
trestbps      int64
chol          int64
fbs           int64
restecg       int64
thalach       int64
exang         int64
oldpeak     float64
slope         int64
ca            int64
thal          int64
target        int64
dtype: object

The names of the columns are a bit cryptic, but the docs contain more descriptive names. The following mapping of the column namaes to their descriptive names should help.

column description
age  
sex  
cp chest pain type (4 values)
trestbps resting blood pressure
chol serum cholesterol
fbs fasting blood sugar
restecg resting electrocardiographic results (values 0, 1, 2)
thalach max heart rate achieved
exang exercise induced angina
oldpeak ST depression incuded by exercise relative to rest
slope the slope of the peak exercise ST segment
ca number of major vessels colored by flourosopy (0-3)
thal 3=normal; 6=fixed defect; 7=reversable defect

We note that cp, restecg, ca & thal are categorical (have specific, discrete values) and have hierarchy amongst one another. This is a gotcha, because they are represented as numbers. We should convert them to category dtype prior to further analysis. Even with the descriptive names, some features such as the oldpeak, slope and thal are unclear. This highlights the need for better data documentation.

In this case, the task is binary classification (patient gets heart attack or not) thus target is either a 0 or a 1. We should one-hot encode that before training. The sex feature is categorical but represented numerically. Training a model with this representation may bias the model through unwanted hierarchy, the feature should be one-hot encoded. Another problem is that without proper docs, we don't know which sex is represented by which digit.

Let's convert the categorical features to category dtype prior to further analysis.

categorical_features = ['sex', 'cp', 'restecg', 'ca', 'thal', 'target']

heart = utils.to_categorical(heart, categorical_features)
heart.dtypes
age            int64
sex         category
cp          category
trestbps       int64
chol           int64
fbs            int64
restecg     category
thalach        int64
exang          int64
oldpeak      float64
slope          int64
ca          category
thal        category
target      category
dtype: object

Let's look at the descriptive statistics next.

heart.describe(include='all')
               age    sex     cp    trestbps        chol         fbs  restecg  \
count   303.000000  303.0  303.0  303.000000  303.000000  303.000000    303.0   
unique         NaN    2.0    4.0         NaN         NaN         NaN      3.0   
top            NaN    1.0    0.0         NaN         NaN         NaN      1.0   
freq           NaN  207.0  143.0         NaN         NaN         NaN    152.0   
mean     54.366337    NaN    NaN  131.623762  246.264026    0.148515      NaN   
std       9.082101    NaN    NaN   17.538143   51.830751    0.356198      NaN   
min      29.000000    NaN    NaN   94.000000  126.000000    0.000000      NaN   
25%      47.500000    NaN    NaN  120.000000  211.000000    0.000000      NaN   
50%      55.000000    NaN    NaN  130.000000  240.000000    0.000000      NaN   
75%      61.000000    NaN    NaN  140.000000  274.500000    0.000000      NaN   
max      77.000000    NaN    NaN  200.000000  564.000000    1.000000      NaN   

           thalach       exang     oldpeak       slope     ca   thal  target  
count   303.000000  303.000000  303.000000  303.000000  303.0  303.0   303.0  
unique         NaN         NaN         NaN         NaN    5.0    4.0     2.0  
top            NaN         NaN         NaN         NaN    0.0    2.0     1.0  
freq           NaN         NaN         NaN         NaN  175.0  166.0   165.0  
mean    149.646865    0.326733    1.039604    1.399340    NaN    NaN     NaN  
std      22.905161    0.469794    1.161075    0.616226    NaN    NaN     NaN  
min      71.000000    0.000000    0.000000    0.000000    NaN    NaN     NaN  
25%     133.500000    0.000000    0.000000    1.000000    NaN    NaN     NaN  
50%     153.000000    0.000000    0.800000    1.000000    NaN    NaN     NaN  
75%     166.000000    1.000000    1.600000    2.000000    NaN    NaN     NaN  
max     202.000000    1.000000    6.200000    2.000000    NaN    NaN     NaN  

Two discrepancies are observed. The docs say that ca & thal contain 4 and 3 unique values respectively. However, the descriptive statistics shows 5 & 4 unique values respectively. Let's investigate.

heart[['ca']].value_counts()
ca
0     175
1      65
2      38
3      20
4       5
dtype: int64

The docs say the values should be in the range of [0,3] however that seems to be incorrect as we have 4 also.

heart[['thal']].value_counts()
thal
2       166
3       117
1        18
0         2
dtype: int64

The docs say the values should be one of {3, 6, 7} however that is incorrect, we have values in the range of [0,3]. Without proper docs, it's difficult to make sense of this discrepancy and requires consultation with domain expert or research.

Let's look at missing & duplicates next.

heart.isna().any()
age         False
sex         False
cp          False
trestbps    False
chol        False
fbs         False
restecg     False
thalach     False
exang       False
oldpeak     False
slope       False
ca          False
thal        False
target      False
dtype: bool
heart[heart.duplicated()].shape
1 14

There is one duplicate row, we should drop it prior to further analysis.

heart = heart.drop_duplicates()
heart.shape
302 14

Finally, we check for correlations amongst features.

corr = heart.corr()
name = 'heatmap@heart--corr.png'
plotter.corr(corr, name)
name

heatmap@heart--corr.png

There are several features that are positively correlated with one another. They should be investigated further and may present an opportunity for feature selection.

2.2. Distribution analysis

In this section we analyse the distributions of the features. Let's start with a histogram of all features to see the distribution of the values. We additionally overlay the continuous features with a kde plot. Since there are a lot of features, we separate the numerical from the categorical features into two plots.

name = 'histplot@heart--numerical.png'

fig, axs = plt.subplots(2, 4, figsize=(15, 10), sharey=True)

sns.histplot(data=heart, x='age', kde=True, ax=axs[0, 0])
sns.histplot(data=heart, x='trestbps', kde=True, ax=axs[0, 1])
sns.histplot(data=heart, x='chol', kde=True, ax=axs[0, 2])
sns.histplot(data=heart, x='fbs', kde=True, ax=axs[0, 3])

sns.histplot(data=heart, x='thalach', kde=True, ax=axs[1, 0])
sns.histplot(data=heart, x='exang', kde=True, ax=axs[1, 1])
sns.histplot(data=heart, x='oldpeak', kde=True, ax=axs[1, 2])
sns.histplot(data=heart, x='slope', kde=True, ax=axs[1, 3])

fig.savefig(name)
name

histplot@heart--numerical.png

age, trestbps, chol & thalach are normally distributed while oldpeak is skewed. Interestingly, it seems that fbs, exang & slope have discrete values (verified using value_counts() method), however, without documentation, we cannot be sure if they are categorical in nature or numerical features with discrete values (and out dataset simply contains examples with those specific values!).

A box plot is often also useful for exploring the distribution, let's do that for the numerical features. However we exclude fbs, exang & slope since they have discrete values and a boxplot is not that useful (verified manually). We also generate the plots individually since the scales of the features vary thus with a shared y axis, some of the plots are squished!

name = 'boxplot@heart--numerical.png'
numerical_features = ['age',
                      'trestbps',
                      'chol',
                      'thalach',
                      'oldpeak']

fig, ax = plt.subplots(1, 5, figsize=(20, 5))

for idx, feature in enumerate(numerical_features):
    sns.boxplot(data=heart, y=feature, ax=ax[idx])

fig.savefig(name)
name

boxplot@heart--numerical.png

The individual boxplots are easier to interpret and we see that a few features have outliers which should be dealt with.

And finally the histogram for the categorical features.

name = 'histplot@heart--categorical.png'

fig, axs = plt.subplots(2, 3, figsize=(15, 10), sharey=True)

sns.histplot(data=heart, x='sex', discrete=True, ax=axs[0, 0])
sns.histplot(data=heart, x='cp', discrete=True, ax=axs[0, 1])
sns.histplot(data=heart, x='restecg', discrete=True, ax=axs[0, 2])

sns.histplot(data=heart, x='ca', discrete=True, ax=axs[1, 0])
sns.histplot(data=heart, x='thal', discrete=True, ax=axs[1, 1])
sns.histplot(data=heart, x='target', discrete=True, ax=axs[1, 2])

fig.savefig(name)
name

histplot@heart--categorical.png

The examples per class is approximately equal. We have an uneven distribution in sex however without proper documentation we cannot determine which gender 1 represents. sex is usually considered a sensitive feature to include. However, in this case we have domain knowledge that men are more prone to heart attacks than women so it's a bit unclear if in this context, this feature is sensitive or not. The rest of the features need to be conditioned on target to derive meaningful insights.

2.3. Relational analysis

In this section we analyse the relationships of the features. Let's start with a scatterplot of the numerical features.

name = 'pairplot@heart.png'

g = sns.pairplot(data=heart[numerical_features], diag_kind=None, corner=True)
g.fig.savefig(name)
name

pairplot@heart.png

There are no discernable patterns in the plots besides the ones in plots against oldpeak but this is expected since we have many examples where oldpeak is 0. Next we can condition on target & sex.

name = 'pairplot@heart--hue:target.png'

g = sns.pairplot(data=heart,
                 vars=numerical_features,
                 diag_kind=None,
                 corner=True,
                 hue='target')
g.fig.savefig(name)
name

pairplot@heart--hue:target.png

name = 'pairplot@heart--hue:sex.png'

g = sns.pairplot(data=heart,
                 vars=numerical_features,
                 diag_kind=None,
                 corner=True,
                 hue='sex')
g.fig.savefig(name)
name

pairplot@heart--hue:sex.png

The conditioned plots are not that useful.

2.4. Categorical analysis

In this section we analyse the categorical features of the dataset. To start, let's look at the distribution of age within target.

name = 'catplot@heart--target-age--hue:sex.png'

fig, ax = plt.subplots(2, 3, figsize=(15, 10), sharey=True)

sns.swarmplot(data=heart, x='target', y='age', ax=ax[0, 0])
sns.boxplot(data=heart, x='target', y='age', ax=ax[0, 1])
sns.violinplot(data=heart, x='target', y='age', ax=ax[0, 2])

sns.swarmplot(data=heart, x='target', y='age', ax=ax[1, 0], hue='sex', dodge=True)
sns.boxplot(data=heart, x='target', y='age', ax=ax[1, 1], hue='sex', dodge=True)
sns.violinplot(data=heart, x='target', y='age', ax=ax[1, 2], hue='sex', dodge=True)

fig.savefig(name)
name

catplot@heart--target-age--hue:sex.png

I have intentionally made 3 different types of categorical plots, to compare and contrast amongst the types. On the top row we see a swarmplot, boxplot and violinplot of age within the two targets. On the second row we see the same plots but this time, they are conditioned on sex. From the first row, we can see that the chance of getting a heart attack is higher in sex 0 in the age of 55-65 whereas in sex 1 this range is larger: 40-70 years. When we condition on sex, we see that the chance of getting of a heart attack is the same for both the sexes.

Date: 2021-10-15 Fri 00:00

Created: 2021-10-22 Fri 21:48