Happiness
Table of Contents
1. Init
Let's start by importing the necessary modules.
import pandas as pd import numpy as np import seaborn as sns import matplotlib import matplotlib.pyplot as plt matplotlib.use('Agg') # non-interactive backend, produce pngs instead pd.set_option('display.max_columns', None) pd.set_option('display.max_colwidth', None) # uncomment the following line to prevent truncated output # pd.set_option('display.large_repr', 'info') from context import src from src import utils, plotter
The original dataset comes in several files, each containing data for a particular year. Let's combine them into a single dataset prior to further analysis.
happiness15 = pd.read_csv('../data/data/happiness-2015.csv') happiness16 = pd.read_csv('../data/data/happiness-2016.csv') happiness17 = pd.read_csv('../data/data/happiness-2017.csv') happiness18 = pd.read_csv('../data/data/happiness-2018.csv') happiness19 = pd.read_csv('../data/data/happiness-2019.csv')
Let's explore what we are dealing with here.
happiness15.dtypes
Country object Region object Happiness Rank int64 Happiness Score float64 Standard Error float64 Economy (GDP per Capita) float64 Family float64 Health (Life Expectancy) float64 Freedom float64 Trust (Government Corruption) float64 Generosity float64 Dystopia Residual float64 dtype: object
happiness16.dtypes
Country object Region object Happiness Rank int64 Happiness Score float64 Lower Confidence Interval float64 Upper Confidence Interval float64 Economy (GDP per Capita) float64 Family float64 Health (Life Expectancy) float64 Freedom float64 Trust (Government Corruption) float64 Generosity float64 Dystopia Residual float64 dtype: object
happiness17.dtypes
Country object Happiness.Rank int64 Happiness.Score float64 Whisker.high float64 Whisker.low float64 Economy..GDP.per.Capita. float64 Family float64 Health..Life.Expectancy. float64 Freedom float64 Generosity float64 Trust..Government.Corruption. float64 Dystopia.Residual float64 dtype: object
happiness18.dtypes
Overall rank int64 Country or region object Score float64 GDP per capita float64 Social support float64 Healthy life expectancy float64 Freedom to make life choices float64 Generosity float64 Perceptions of corruption float64 dtype: object
happiness19.dtypes
Overall rank int64 Country or region object Score float64 GDP per capita float64 Social support float64 Healthy life expectancy float64 Freedom to make life choices float64 Generosity float64 Perceptions of corruption float64 dtype: object
We have a problem. We can already see that the columns are not named consistently, leading to some confusion as to which ones are alike and which different. Perhaps a more pressing issue here is that the columns are not the same across the datasets. For proper analysis, we have to combine the datasets into a single dataset which will contains missing values. We cannot simply drop the missing values (as this is equivalent to using the dataset which contains all the columns), thus we must handle them using imputation.
This shows that poor data collection process can lead to additional technical debt further downstream.
2. Analysis
In this section we analyse the happiness
dataset. Please note that
the original dataset consists of several csv files (2015 through 2019)
however for this analysis we only consider the 2019 dataset.
We start by reading the accompanying data docs which provides us with some useful information about the dataset.
2.1. Preliminary analysis
We start by loading the dataset and answering our initial set of questions.
happiness = pd.read_csv('../data/data/happiness-2019.csv') happiness.head()
Overall rank Country or region Score GDP per capita Social support \ 0 1 Finland 7.769 1.340 1.587 1 2 Denmark 7.600 1.383 1.573 2 3 Norway 7.554 1.488 1.582 3 4 Iceland 7.494 1.380 1.624 4 5 Netherlands 7.488 1.396 1.522 Healthy life expectancy Freedom to make life choices Generosity \ 0 0.986 0.596 0.153 1 0.996 0.592 0.252 2 1.028 0.603 0.271 3 1.026 0.591 0.354 4 0.999 0.557 0.322 Perceptions of corruption 0 0.393 1 0.410 2 0.341 3 0.118 4 0.298
Let's change the column names for simplicity before we proceed with the analysis.
mapper = {'Overall rank': 'rank', 'Country or region': 'location', 'Score': 'score', 'GDP per capita': 'gdp', 'Social support': 'support', 'Healthy life expectancy': 'life_expectancy', 'Freedom to make life choices': 'freedom', 'Generosity': 'generosity', 'Perceptions of corruption': 'corruption'} happiness = happiness.rename(mapper=mapper, axis='columns') happiness.columns
Index(['rank', 'location', 'score', 'gdp', 'support', 'life_expectancy', 'freedom', 'generosity', 'corruption'], dtype='object')
happiness.shape
156 | 9 |
happiness.dtypes
rank int64 location object score float64 gdp float64 support float64 life_expectancy float64 freedom float64 generosity float64 corruption float64 dtype: object
We primarily have numerical features except for location
which is
categorical. We should convert it to the category
dtype and sanitise
the strings.
Let's look at the descriptive statisics, missing & duplicates next.
happiness.describe(include='all')
rank location score gdp support \ count 156.000000 156 156.000000 156.000000 156.000000 unique NaN 156 NaN NaN NaN top NaN Finland NaN NaN NaN freq NaN 1 NaN NaN NaN mean 78.500000 NaN 5.407096 0.905147 1.208814 std 45.177428 NaN 1.113120 0.398389 0.299191 min 1.000000 NaN 2.853000 0.000000 0.000000 25% 39.750000 NaN 4.544500 0.602750 1.055750 50% 78.500000 NaN 5.379500 0.960000 1.271500 75% 117.250000 NaN 6.184500 1.232500 1.452500 max 156.000000 NaN 7.769000 1.684000 1.624000 life_expectancy freedom generosity corruption count 156.000000 156.000000 156.000000 156.000000 unique NaN NaN NaN NaN top NaN NaN NaN NaN freq NaN NaN NaN NaN mean 0.725244 0.392571 0.184846 0.110603 std 0.242124 0.143289 0.095254 0.094538 min 0.000000 0.000000 0.000000 0.000000 25% 0.547750 0.308000 0.108750 0.047000 50% 0.789000 0.417000 0.177500 0.085500 75% 0.881750 0.507250 0.248250 0.141250 max 1.141000 0.631000 0.566000 0.453000
happiness.isna().any()
rank False location False score False gdp False support False life_expectancy False freedom False generosity False corruption False dtype: bool
happiness[happiness.duplicated()].shape
0 | 9 |
And finally the correlations.
name = 'heatmap@happiness--corr.png' corr = happiness.corr() plotter.corr(corr, name) name
Some of the features are positively correlated.
2.1.1. Handling location
Let's convert it to category dtype and investigate.
happiness['location'] = happiness['location'].astype('category') happiness['location'].value_counts()
Afghanistan 1 Palestinian Territories 1 Nicaragua 1 Niger 1 Nigeria 1 .. Guinea 1 Haiti 1 Honduras 1 Hong Kong 1 Zimbabwe 1 Name: location, Length: 156, dtype: int64
happiness['location'].cat.categories
Index(['Afghanistan', 'Albania', 'Algeria', 'Argentina', 'Armenia', 'Australia', 'Austria', 'Azerbaijan', 'Bahrain', 'Bangladesh', ... 'United Arab Emirates', 'United Kingdom', 'United States', 'Uruguay', 'Uzbekistan', 'Venezuela', 'Vietnam', 'Yemen', 'Zambia', 'Zimbabwe'], dtype='object', length=156)
Looks like all the values are unique and there are no leading/trailing whitespaces.
2.2. Distributional analysis
In this section we analyse the distribution of the features. Let's consider the numerical features in particular.
name = 'histplot@happiness--numerical.png' numerical_features = ['score', 'gdp', 'support', 'life_expectancy', 'freedom', 'generosity', 'corruption'] fig, axs = plt.subplots(1, 7, figsize=(20, 5), sharey=True) for idx, feature in enumerate(numerical_features): sns.histplot(data=happiness, x=feature, kde=True, ax=axs[idx]) fig.savefig(name) name
The distribution of the features is normal but skewed for some.
2.3. Relational analysis
In this section we will analyse the relationships amongst the features.
name = 'pairplot@happiness--scatterplot-kdeplot.png' g = sns.pairplot(data=happiness, diag_kind='kde', corner=True) g.savefig(name) name
The pairplot corroborates the correlation matrix, some features are linearly related to one another, while others have an even distribution.