Suicide
Table of Contents
1. Init
Let's start by importing the necessary modules.
from context import src from context import pd, np, sns, plt from src import utils, plotter
2. Analysis
In this section we analyse the suicide
dataset. We start by reading
the data docs which is not that informative.
2.1. Preliminary analysis
We start by loading the dataset and answering our initial set of questions.
suicide = pd.read_csv(utils.data_path('suicide.csv')) suicide.head()
country year sex age suicides_no population \ 0 Albania 1987 male 15-24 years 21 312900 1 Albania 1987 male 35-54 years 16 308000 2 Albania 1987 female 15-24 years 14 289700 3 Albania 1987 male 75+ years 1 21800 4 Albania 1987 male 25-34 years 9 274300 suicides/100k pop country-year HDI for year gdp_for_year ($) \ 0 6.71 Albania1987 NaN 2,156,624,900 1 5.19 Albania1987 NaN 2,156,624,900 2 4.83 Albania1987 NaN 2,156,624,900 3 4.59 Albania1987 NaN 2,156,624,900 4 3.28 Albania1987 NaN 2,156,624,900 gdp_per_capita ($) generation 0 796 Generation X 1 796 Silent 2 796 Generation X 3 796 G.I. Generation 4 796 Boomers
suicide.shape
27820 | 12 |
suicide.dtypes
country object year int64 sex object age object suicides_no int64 population int64 suicides/100k pop float64 country-year object HDI for year float64 gdp_for_year ($) object gdp_per_capita ($) int64 generation object dtype: object
We have a mix of categorical & numerical features. Let's investigate the features in more details. Let's also cleanup the column names for simplicity.
columns = suicide.columns columns = columns.str.strip() columns = columns.str.lower() columns = columns.str.replace('[^\w\d]+', '_') columns = columns.str.replace('_+$', '') suicide.columns = columns suicide.columns
Index(['country', 'year', 'sex', 'age', 'suicides_no', 'population', 'suicides_100k_pop', 'country_year', 'hdi_for_year', 'gdp_for_year', 'gdp_per_capita', 'generation'], dtype='object')
2.1.1. Handling categorical features
country, sex, age & generation
are all categorical features, and
should be converted to category
dtype.
categorical_features = ['country', 'sex', 'age', 'generation'] categorical = suicide[categorical_features] for column in categorical.columns: categorical[column] = categorical[column].str.strip().astype('category') categorical.describe(include='all')
country sex age generation count 27820 27820 27820 27820 unique 101 2 6 6 top Mauritius female 15-24 years Generation X freq 382 13910 4642 6408
We should one-hot encode sex
since it's a sensitive feature and there
is not hierarchy in the values. age
on the other hand can be label
encoded since there is an implicit order amongst the values. Not sure
how to deal with geographical information.
suicide[categorical_features] = categorical
2.1.2. Handling redundant features
country-year
has already been extracted into two separate features
so we can drop this one.
suicide = suicide.drop('country-year', axis='columns') suicide.shape
2.1.3. Handling gdp_for_year
This is a numerical feature but represented as text. The value is also denoted in a human friendly format. On the matter of representation of money, we have two formats: the american format ('.' for decimals and ',' for denominations) or the european format ('.' for denominators and ',' for decimals). Without documentation, there may be some confusion as to which format is followed.
suicide['gdp_for_year']
0 2,156,624,900 1 2,156,624,900 2 2,156,624,900 3 2,156,624,900 4 2,156,624,900 ... 27815 63,067,077,179 27816 63,067,077,179 27817 63,067,077,179 27818 63,067,077,179 27819 63,067,077,179 Name: gdp_for_year, Length: 27820, dtype: object
Looks like the values have been rounded to the nearest whole number
(no decimals). Let's convert this feature to float
dtype.
gdp_for_year = suicide['gdp_for_year'] gdp_for_year = gdp_for_year.str.strip() gdp_for_year = gdp_for_year.str.replace(',', '') gdp_for_year = gdp_for_year.astype('float') suicide['gdp_for_year'] = gdp_for_year suicide['gdp_for_year'].describe()
count 2.782000e+04 mean 4.455810e+11 std 1.453610e+12 min 4.691962e+07 25% 8.985353e+09 50% 4.811469e+10 75% 2.602024e+11 max 1.812071e+13 Name: gdp_for_year, dtype: float64
2.1.4. Descriptive statistics, missing & duplicates
Let's look at the descriptive statistics next.
suicide.describe(include='all')
country year sex age suicides_no \ count 27820 27820.000000 27820 27820 27820.000000 unique 101 NaN 2 6 NaN top Mauritius NaN female 15-24 years NaN freq 382 NaN 13910 4642 NaN mean NaN 2001.258375 NaN NaN 242.574407 std NaN 8.469055 NaN NaN 902.047917 min NaN 1985.000000 NaN NaN 0.000000 25% NaN 1995.000000 NaN NaN 3.000000 50% NaN 2002.000000 NaN NaN 25.000000 75% NaN 2008.000000 NaN NaN 131.000000 max NaN 2016.000000 NaN NaN 22338.000000 population suicides_100k_pop country_year hdi_for_year \ count 2.782000e+04 27820.000000 27820 8364.000000 unique NaN NaN 2321 NaN top NaN NaN Albania1987 NaN freq NaN NaN 12 NaN mean 1.844794e+06 12.816097 NaN 0.776601 std 3.911779e+06 18.961511 NaN 0.093367 min 2.780000e+02 0.000000 NaN 0.483000 25% 9.749850e+04 0.920000 NaN 0.713000 50% 4.301500e+05 5.990000 NaN 0.779000 75% 1.486143e+06 16.620000 NaN 0.855000 max 4.380521e+07 224.970000 NaN 0.944000 gdp_for_year gdp_per_capita generation count 2.782000e+04 27820.000000 27820 unique NaN NaN 6 top NaN NaN Generation X freq NaN NaN 6408 mean 4.455810e+11 16866.464414 NaN std 1.453610e+12 18887.576472 NaN min 4.691962e+07 251.000000 NaN 25% 8.985353e+09 3447.000000 NaN 50% 4.811469e+10 9372.000000 NaN 75% 2.602024e+11 24874.000000 NaN max 1.812071e+13 126352.000000 NaN
Looks like hdi_for_year
is the only column with missing values, we
can either drop it or consider imputation.
Let's check for duplicates next.
suicide[suicide.duplicated(keep=False)].shape
0 | 12 |
2.1.5. Correlations
Let's look at the correlations between the numerical features.
name = 'heatmap@suicide--corr.png' corr = suicide.corr() plotter.corr(corr, name) name
Several features are positively correlated to one another.