Suicide

Table of Contents

1. Init

Let's start by importing the necessary modules.

from context import src
from context import pd, np, sns, plt
from src import utils, plotter

2. Analysis

In this section we analyse the suicide dataset. We start by reading the data docs which is not that informative.

2.1. Preliminary analysis

We start by loading the dataset and answering our initial set of questions.

suicide = pd.read_csv(utils.data_path('suicide.csv'))
suicide.head()
   country  year     sex          age  suicides_no  population  \
0  Albania  1987    male  15-24 years           21      312900   
1  Albania  1987    male  35-54 years           16      308000   
2  Albania  1987  female  15-24 years           14      289700   
3  Albania  1987    male    75+ years            1       21800   
4  Albania  1987    male  25-34 years            9      274300   

   suicides/100k pop country-year  HDI for year  gdp_for_year ($)   \
0               6.71  Albania1987           NaN      2,156,624,900   
1               5.19  Albania1987           NaN      2,156,624,900   
2               4.83  Albania1987           NaN      2,156,624,900   
3               4.59  Albania1987           NaN      2,156,624,900   
4               3.28  Albania1987           NaN      2,156,624,900   

   gdp_per_capita ($)       generation  
0                 796     Generation X  
1                 796           Silent  
2                 796     Generation X  
3                 796  G.I. Generation  
4                 796          Boomers  
suicide.shape
27820 12
suicide.dtypes
country                object
year                    int64
sex                    object
age                    object
suicides_no             int64
population              int64
suicides/100k pop     float64
country-year           object
HDI for year          float64
 gdp_for_year ($)      object
gdp_per_capita ($)      int64
generation             object
dtype: object

We have a mix of categorical & numerical features. Let's investigate the features in more details. Let's also cleanup the column names for simplicity.

columns = suicide.columns
columns = columns.str.strip()
columns = columns.str.lower()
columns = columns.str.replace('[^\w\d]+', '_')
columns = columns.str.replace('_+$', '')
suicide.columns = columns
suicide.columns
Index(['country', 'year', 'sex', 'age', 'suicides_no', 'population',
       'suicides_100k_pop', 'country_year', 'hdi_for_year', 'gdp_for_year',
       'gdp_per_capita', 'generation'],
      dtype='object')

2.1.1. Handling categorical features

country, sex, age & generation are all categorical features, and should be converted to category dtype.

categorical_features = ['country',
                        'sex',
                        'age',
                        'generation']

categorical = suicide[categorical_features]

for column in categorical.columns:
    categorical[column] = categorical[column].str.strip().astype('category')

categorical.describe(include='all')
          country     sex          age    generation
count       27820   27820        27820         27820
unique        101       2            6             6
top     Mauritius  female  15-24 years  Generation X
freq          382   13910         4642          6408

We should one-hot encode sex since it's a sensitive feature and there is not hierarchy in the values. age on the other hand can be label encoded since there is an implicit order amongst the values. Not sure how to deal with geographical information.

suicide[categorical_features] = categorical

2.1.2. Handling redundant features

country-year has already been extracted into two separate features so we can drop this one.

suicide = suicide.drop('country-year', axis='columns')
suicide.shape

2.1.3. Handling gdp_for_year

This is a numerical feature but represented as text. The value is also denoted in a human friendly format. On the matter of representation of money, we have two formats: the american format ('.' for decimals and ',' for denominations) or the european format ('.' for denominators and ',' for decimals). Without documentation, there may be some confusion as to which format is followed.

suicide['gdp_for_year']
0         2,156,624,900
1         2,156,624,900
2         2,156,624,900
3         2,156,624,900
4         2,156,624,900
              ...      
27815    63,067,077,179
27816    63,067,077,179
27817    63,067,077,179
27818    63,067,077,179
27819    63,067,077,179
Name: gdp_for_year, Length: 27820, dtype: object

Looks like the values have been rounded to the nearest whole number (no decimals). Let's convert this feature to float dtype.

gdp_for_year = suicide['gdp_for_year']
gdp_for_year = gdp_for_year.str.strip()
gdp_for_year = gdp_for_year.str.replace(',', '')
gdp_for_year = gdp_for_year.astype('float')
suicide['gdp_for_year'] = gdp_for_year
suicide['gdp_for_year'].describe()
count    2.782000e+04
mean     4.455810e+11
std      1.453610e+12
min      4.691962e+07
25%      8.985353e+09
50%      4.811469e+10
75%      2.602024e+11
max      1.812071e+13
Name: gdp_for_year, dtype: float64

2.1.4. Descriptive statistics, missing & duplicates

Let's look at the descriptive statistics next.

suicide.describe(include='all')
          country          year     sex          age   suicides_no  \
count       27820  27820.000000   27820        27820  27820.000000   
unique        101           NaN       2            6           NaN   
top     Mauritius           NaN  female  15-24 years           NaN   
freq          382           NaN   13910         4642           NaN   
mean          NaN   2001.258375     NaN          NaN    242.574407   
std           NaN      8.469055     NaN          NaN    902.047917   
min           NaN   1985.000000     NaN          NaN      0.000000   
25%           NaN   1995.000000     NaN          NaN      3.000000   
50%           NaN   2002.000000     NaN          NaN     25.000000   
75%           NaN   2008.000000     NaN          NaN    131.000000   
max           NaN   2016.000000     NaN          NaN  22338.000000   

          population  suicides_100k_pop country_year  hdi_for_year  \
count   2.782000e+04       27820.000000        27820   8364.000000   
unique           NaN                NaN         2321           NaN   
top              NaN                NaN  Albania1987           NaN   
freq             NaN                NaN           12           NaN   
mean    1.844794e+06          12.816097          NaN      0.776601   
std     3.911779e+06          18.961511          NaN      0.093367   
min     2.780000e+02           0.000000          NaN      0.483000   
25%     9.749850e+04           0.920000          NaN      0.713000   
50%     4.301500e+05           5.990000          NaN      0.779000   
75%     1.486143e+06          16.620000          NaN      0.855000   
max     4.380521e+07         224.970000          NaN      0.944000   

        gdp_for_year  gdp_per_capita    generation  
count   2.782000e+04    27820.000000         27820  
unique           NaN             NaN             6  
top              NaN             NaN  Generation X  
freq             NaN             NaN          6408  
mean    4.455810e+11    16866.464414           NaN  
std     1.453610e+12    18887.576472           NaN  
min     4.691962e+07      251.000000           NaN  
25%     8.985353e+09     3447.000000           NaN  
50%     4.811469e+10     9372.000000           NaN  
75%     2.602024e+11    24874.000000           NaN  
max     1.812071e+13   126352.000000           NaN  

Looks like hdi_for_year is the only column with missing values, we can either drop it or consider imputation.

Let's check for duplicates next.

suicide[suicide.duplicated(keep=False)].shape
0 12

2.1.5. Correlations

Let's look at the correlations between the numerical features.

name = 'heatmap@suicide--corr.png'
corr = suicide.corr()
plotter.corr(corr, name)
name

heatmap@suicide--corr.png

Several features are positively correlated to one another.

Date: 2021-11-05 Fri 00:00

Created: 2021-11-05 Fri 14:30