Happiness

Table of Contents

1. Init

Let's start by importing the necessary modules.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt

matplotlib.use('Agg') # non-interactive backend, produce pngs instead
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
# uncomment the following line to prevent truncated output
# pd.set_option('display.large_repr', 'info')

from context import src
from src import utils, plotter

The original dataset comes in several files, each containing data for a particular year. Let's combine them into a single dataset prior to further analysis.

happiness15 = pd.read_csv('../data/data/happiness-2015.csv')
happiness16 = pd.read_csv('../data/data/happiness-2016.csv')
happiness17 = pd.read_csv('../data/data/happiness-2017.csv')
happiness18 = pd.read_csv('../data/data/happiness-2018.csv')
happiness19 = pd.read_csv('../data/data/happiness-2019.csv')

Let's explore what we are dealing with here.

happiness15.dtypes
Country                           object
Region                            object
Happiness Rank                     int64
Happiness Score                  float64
Standard Error                   float64
Economy (GDP per Capita)         float64
Family                           float64
Health (Life Expectancy)         float64
Freedom                          float64
Trust (Government Corruption)    float64
Generosity                       float64
Dystopia Residual                float64
dtype: object
happiness16.dtypes
Country                           object
Region                            object
Happiness Rank                     int64
Happiness Score                  float64
Lower Confidence Interval        float64
Upper Confidence Interval        float64
Economy (GDP per Capita)         float64
Family                           float64
Health (Life Expectancy)         float64
Freedom                          float64
Trust (Government Corruption)    float64
Generosity                       float64
Dystopia Residual                float64
dtype: object
happiness17.dtypes
Country                           object
Happiness.Rank                     int64
Happiness.Score                  float64
Whisker.high                     float64
Whisker.low                      float64
Economy..GDP.per.Capita.         float64
Family                           float64
Health..Life.Expectancy.         float64
Freedom                          float64
Generosity                       float64
Trust..Government.Corruption.    float64
Dystopia.Residual                float64
dtype: object
happiness18.dtypes
Overall rank                      int64
Country or region                object
Score                           float64
GDP per capita                  float64
Social support                  float64
Healthy life expectancy         float64
Freedom to make life choices    float64
Generosity                      float64
Perceptions of corruption       float64
dtype: object
happiness19.dtypes
Overall rank                      int64
Country or region                object
Score                           float64
GDP per capita                  float64
Social support                  float64
Healthy life expectancy         float64
Freedom to make life choices    float64
Generosity                      float64
Perceptions of corruption       float64
dtype: object

We have a problem. We can already see that the columns are not named consistently, leading to some confusion as to which ones are alike and which different. Perhaps a more pressing issue here is that the columns are not the same across the datasets. For proper analysis, we have to combine the datasets into a single dataset which will contains missing values. We cannot simply drop the missing values (as this is equivalent to using the dataset which contains all the columns), thus we must handle them using imputation.

This shows that poor data collection process can lead to additional technical debt further downstream.

2. Analysis

In this section we analyse the happiness dataset. Please note that the original dataset consists of several csv files (2015 through 2019) however for this analysis we only consider the 2019 dataset.

We start by reading the accompanying data docs which provides us with some useful information about the dataset.

2.1. Preliminary analysis

We start by loading the dataset and answering our initial set of questions.

happiness = pd.read_csv('../data/data/happiness-2019.csv')
happiness.head()
   Overall rank Country or region  Score  GDP per capita  Social support  \
0             1           Finland  7.769           1.340           1.587   
1             2           Denmark  7.600           1.383           1.573   
2             3            Norway  7.554           1.488           1.582   
3             4           Iceland  7.494           1.380           1.624   
4             5       Netherlands  7.488           1.396           1.522   

   Healthy life expectancy  Freedom to make life choices  Generosity  \
0                    0.986                         0.596       0.153   
1                    0.996                         0.592       0.252   
2                    1.028                         0.603       0.271   
3                    1.026                         0.591       0.354   
4                    0.999                         0.557       0.322   

   Perceptions of corruption  
0                      0.393  
1                      0.410  
2                      0.341  
3                      0.118  
4                      0.298  

Let's change the column names for simplicity before we proceed with the analysis.

mapper = {'Overall rank': 'rank',
          'Country or region': 'location',
          'Score': 'score',
          'GDP per capita': 'gdp',
          'Social support': 'support',
          'Healthy life expectancy': 'life_expectancy',
          'Freedom to make life choices': 'freedom',
          'Generosity': 'generosity',
          'Perceptions of corruption': 'corruption'}

happiness = happiness.rename(mapper=mapper, axis='columns')
happiness.columns
Index(['rank', 'location', 'score', 'gdp', 'support', 'life_expectancy',
       'freedom', 'generosity', 'corruption'],
      dtype='object')
happiness.shape
156 9
happiness.dtypes
rank                 int64
location            object
score              float64
gdp                float64
support            float64
life_expectancy    float64
freedom            float64
generosity         float64
corruption         float64
dtype: object

We primarily have numerical features except for location which is categorical. We should convert it to the category dtype and sanitise the strings.

Let's look at the descriptive statisics, missing & duplicates next.

happiness.describe(include='all')
              rank location       score         gdp     support  \
count   156.000000      156  156.000000  156.000000  156.000000   
unique         NaN      156         NaN         NaN         NaN   
top            NaN  Finland         NaN         NaN         NaN   
freq           NaN        1         NaN         NaN         NaN   
mean     78.500000      NaN    5.407096    0.905147    1.208814   
std      45.177428      NaN    1.113120    0.398389    0.299191   
min       1.000000      NaN    2.853000    0.000000    0.000000   
25%      39.750000      NaN    4.544500    0.602750    1.055750   
50%      78.500000      NaN    5.379500    0.960000    1.271500   
75%     117.250000      NaN    6.184500    1.232500    1.452500   
max     156.000000      NaN    7.769000    1.684000    1.624000   

        life_expectancy     freedom  generosity  corruption  
count        156.000000  156.000000  156.000000  156.000000  
unique              NaN         NaN         NaN         NaN  
top                 NaN         NaN         NaN         NaN  
freq                NaN         NaN         NaN         NaN  
mean           0.725244    0.392571    0.184846    0.110603  
std            0.242124    0.143289    0.095254    0.094538  
min            0.000000    0.000000    0.000000    0.000000  
25%            0.547750    0.308000    0.108750    0.047000  
50%            0.789000    0.417000    0.177500    0.085500  
75%            0.881750    0.507250    0.248250    0.141250  
max            1.141000    0.631000    0.566000    0.453000  
happiness.isna().any()
rank               False
location           False
score              False
gdp                False
support            False
life_expectancy    False
freedom            False
generosity         False
corruption         False
dtype: bool
happiness[happiness.duplicated()].shape
0 9

And finally the correlations.

name = 'heatmap@happiness--corr.png'
corr = happiness.corr()
plotter.corr(corr, name)
name

heatmap@happiness--corr.png

Some of the features are positively correlated.

2.1.1. Handling location

Let's convert it to category dtype and investigate.

happiness['location'] = happiness['location'].astype('category')
happiness['location'].value_counts()
Afghanistan                1
Palestinian Territories    1
Nicaragua                  1
Niger                      1
Nigeria                    1
                          ..
Guinea                     1
Haiti                      1
Honduras                   1
Hong Kong                  1
Zimbabwe                   1
Name: location, Length: 156, dtype: int64
happiness['location'].cat.categories
Index(['Afghanistan', 'Albania', 'Algeria', 'Argentina', 'Armenia',
       'Australia', 'Austria', 'Azerbaijan', 'Bahrain', 'Bangladesh',
       ...
       'United Arab Emirates', 'United Kingdom', 'United States', 'Uruguay',
       'Uzbekistan', 'Venezuela', 'Vietnam', 'Yemen', 'Zambia', 'Zimbabwe'],
      dtype='object', length=156)

Looks like all the values are unique and there are no leading/trailing whitespaces.

2.2. Distributional analysis

In this section we analyse the distribution of the features. Let's consider the numerical features in particular.

name = 'histplot@happiness--numerical.png'
numerical_features = ['score',
                      'gdp',
                      'support',
                      'life_expectancy',
                      'freedom',
                      'generosity',
                      'corruption']

fig, axs = plt.subplots(1, 7, figsize=(20, 5), sharey=True)
for idx, feature in enumerate(numerical_features):
    sns.histplot(data=happiness, x=feature, kde=True, ax=axs[idx])
fig.savefig(name)
name

histplot@happiness--numerical.png

The distribution of the features is normal but skewed for some.

2.3. Relational analysis

In this section we will analyse the relationships amongst the features.

name = 'pairplot@happiness--scatterplot-kdeplot.png'

g = sns.pairplot(data=happiness, diag_kind='kde', corner=True)
g.savefig(name)
name

pairplot@happiness--scatterplot-kdeplot.png

The pairplot corroborates the correlation matrix, some features are linearly related to one another, while others have an even distribution.

Date: 2021-10-21 Thu 00:00

Created: 2021-10-22 Fri 21:54