Adult

Table of Contents

1. Init

Let's start by importing the necessary modules.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt

matplotlib.use('Agg') # non-interactive backend, produce pngs instead
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
# uncomment the following line to prevent truncated output
# pd.set_option('display.large_repr', 'info')

from context import src
from src import utils, plotter

2. Analysis

In this section we analyse the adult dataset. We start by reading the accompanying data docs which is informative.

2.1. Preliminary analyse

We start by answering our initial set of questions.

adult = pd.read_csv('../data/data/adult.csv')
adult.head()
   age          workclass  fnlwgt   education  education-num  \
0   39          State-gov   77516   Bachelors             13   
1   50   Self-emp-not-inc   83311   Bachelors             13   
2   38            Private  215646     HS-grad              9   
3   53            Private  234721        11th              7   
4   28            Private  338409   Bachelors             13   

        marital-status          occupation    relationship    race      sex  \
0        Never-married        Adm-clerical   Not-in-family   White     Male   
1   Married-civ-spouse     Exec-managerial         Husband   White     Male   
2             Divorced   Handlers-cleaners   Not-in-family   White     Male   
3   Married-civ-spouse   Handlers-cleaners         Husband   Black     Male   
4   Married-civ-spouse      Prof-specialty            Wife   Black   Female   

   capital-gain  capital-loss  hours-per-week  native-country   class  
0          2174             0              40   United-States   <=50K  
1             0             0              13   United-States   <=50K  
2             0             0              40   United-States   <=50K  
3             0             0              40   United-States   <=50K  
4             0             0              40            Cuba   <=50K  
adult.shape
32561 15
adult.dtypes
age                int64
workclass         object
fnlwgt             int64
education         object
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
class             object
dtype: object

We check for missing values next. The docs mention that missing values are represented by ? in the dataset.

adult.isna().any()
age               False
workclass         False
fnlwgt            False
education         False
education-num     False
marital-status    False
occupation        False
relationship      False
race              False
sex               False
capital-gain      False
capital-loss      False
hours-per-week    False
native-country    False
class             False
dtype: bool
adult[adult.eq('?')].any()
age               False
workclass         False
fnlwgt            False
education         False
education-num     False
marital-status    False
occupation        False
relationship      False
race              False
sex               False
capital-gain      False
capital-loss      False
hours-per-week    False
native-country    False
class             False
dtype: bool
adult.workclass.value_counts()
 Private             22696
 Self-emp-not-inc     2541
 Local-gov            2093
 ?                    1836
 State-gov            1298
 Self-emp-inc         1116
 Federal-gov           960
 Without-pay            14
 Never-worked            7
Name: workclass, dtype: int64

adult.eq() does not catch ? but if we inspect value_counts() for a specific feature, we note that ? does exist in the dataset. Why is that?

adult.workclass.astype('category').cat.categories
Index([' ?', ' Federal-gov', ' Local-gov', ' Never-worked', ' Private',
       ' Self-emp-inc', ' Self-emp-not-inc', ' State-gov', ' Without-pay'],
      dtype='object')

Damn em' notorious whitespaces! So we don't have any missing values in the numerical features but the categorical ones may still contain missing values. Let's fix that.

categorical_features = ['workclass',
                        'education',
                        'occupation',
                        'relationship',
                        'race',
                        'sex',
                        'native-country',
                        'class']

for feature in categorical_features:
    adult[feature] = adult[feature].str.strip()
adult.workclass.astype('category').cat.categories
Index(['?', 'Federal-gov', 'Local-gov', 'Never-worked', 'Private',
       'Self-emp-inc', 'Self-emp-not-inc', 'State-gov', 'Without-pay'],
      dtype='object')
adult[adult.eq('?')].any()
age               False
workclass          True
fnlwgt            False
education         False
education-num     False
marital-status    False
occupation         True
relationship      False
race              False
sex               False
capital-gain      False
capital-loss      False
hours-per-week    False
native-country     True
class             False
dtype: bool
adult = adult.replace('?', value=np.nan)
adult.isna().any()
age               False
workclass          True
fnlwgt            False
education         False
education-num     False
marital-status    False
occupation         True
relationship      False
race              False
sex               False
capital-gain      False
capital-loss      False
hours-per-week    False
native-country     True
class             False
dtype: bool
adult = adult.dropna()
adult.shape
30162 15

Let's also convert the categorical features to category dtype.

adult = utils.to_categorical(adult, categorical_features)
adult.dtypes
age                  int64
workclass         category
fnlwgt               int64
education         category
education-num        int64
marital-status      object
occupation        category
relationship      category
race              category
sex               category
capital-gain         int64
capital-loss         int64
hours-per-week       int64
native-country    category
class             category
dtype: object

Let's check for duplicates next.

adult[adult.duplicated()].shape
23 15

Looks like there are 23 rows with duplicate entries, let's drop those.

adult = adult.drop_duplicates()
adult.shape
30139 15

Finally we check for correlations in numerical features.

name = 'heatmap@adult--corr.png'
corr = adult.corr()
plotter.corr(corr, name)
name

heatmap@adult--corr.png

We see more strong positive and negative correlations amongst the numerical features. This is an opportunity to do some experimentation with feature selection and drop features which don't bring anything to the table. For this analysis however, we don't do anything further.

2.2. Distributional analysis

2.3. Relational analysis

2.4. Categorical analysis

Date: 2021-10-19 Tue 00:00

Created: 2021-10-22 Fri 21:58