Adult
Table of Contents
1. Init
Let's start by importing the necessary modules.
import pandas as pd import numpy as np import seaborn as sns import matplotlib import matplotlib.pyplot as plt matplotlib.use('Agg') # non-interactive backend, produce pngs instead pd.set_option('display.max_columns', None) pd.set_option('display.max_colwidth', None) # uncomment the following line to prevent truncated output # pd.set_option('display.large_repr', 'info') from context import src from src import utils, plotter
2. Analysis
In this section we analyse the adult
dataset. We start by
reading the accompanying data docs which is informative.
2.1. Preliminary analyse
We start by answering our initial set of questions.
adult = pd.read_csv('../data/data/adult.csv') adult.head()
age workclass fnlwgt education education-num \ 0 39 State-gov 77516 Bachelors 13 1 50 Self-emp-not-inc 83311 Bachelors 13 2 38 Private 215646 HS-grad 9 3 53 Private 234721 11th 7 4 28 Private 338409 Bachelors 13 marital-status occupation relationship race sex \ 0 Never-married Adm-clerical Not-in-family White Male 1 Married-civ-spouse Exec-managerial Husband White Male 2 Divorced Handlers-cleaners Not-in-family White Male 3 Married-civ-spouse Handlers-cleaners Husband Black Male 4 Married-civ-spouse Prof-specialty Wife Black Female capital-gain capital-loss hours-per-week native-country class 0 2174 0 40 United-States <=50K 1 0 0 13 United-States <=50K 2 0 0 40 United-States <=50K 3 0 0 40 United-States <=50K 4 0 0 40 Cuba <=50K
adult.shape
32561 | 15 |
adult.dtypes
age int64 workclass object fnlwgt int64 education object education-num int64 marital-status object occupation object relationship object race object sex object capital-gain int64 capital-loss int64 hours-per-week int64 native-country object class object dtype: object
We check for missing values next. The docs mention that missing values
are represented by ?
in the dataset.
adult.isna().any()
age False workclass False fnlwgt False education False education-num False marital-status False occupation False relationship False race False sex False capital-gain False capital-loss False hours-per-week False native-country False class False dtype: bool
adult[adult.eq('?')].any()
age False workclass False fnlwgt False education False education-num False marital-status False occupation False relationship False race False sex False capital-gain False capital-loss False hours-per-week False native-country False class False dtype: bool
adult.workclass.value_counts()
Private 22696 Self-emp-not-inc 2541 Local-gov 2093 ? 1836 State-gov 1298 Self-emp-inc 1116 Federal-gov 960 Without-pay 14 Never-worked 7 Name: workclass, dtype: int64
adult.eq()
does not catch ?
but if we inspect value_counts()
for
a specific feature, we note that ?
does exist in the dataset. Why is
that?
adult.workclass.astype('category').cat.categories
Index([' ?', ' Federal-gov', ' Local-gov', ' Never-worked', ' Private', ' Self-emp-inc', ' Self-emp-not-inc', ' State-gov', ' Without-pay'], dtype='object')
Damn em' notorious whitespaces! So we don't have any missing values in the numerical features but the categorical ones may still contain missing values. Let's fix that.
categorical_features = ['workclass', 'education', 'occupation', 'relationship', 'race', 'sex', 'native-country', 'class'] for feature in categorical_features: adult[feature] = adult[feature].str.strip()
adult.workclass.astype('category').cat.categories
Index(['?', 'Federal-gov', 'Local-gov', 'Never-worked', 'Private', 'Self-emp-inc', 'Self-emp-not-inc', 'State-gov', 'Without-pay'], dtype='object')
adult[adult.eq('?')].any()
age False workclass True fnlwgt False education False education-num False marital-status False occupation True relationship False race False sex False capital-gain False capital-loss False hours-per-week False native-country True class False dtype: bool
adult = adult.replace('?', value=np.nan)
adult.isna().any()
age False workclass True fnlwgt False education False education-num False marital-status False occupation True relationship False race False sex False capital-gain False capital-loss False hours-per-week False native-country True class False dtype: bool
adult = adult.dropna() adult.shape
30162 | 15 |
Let's also convert the categorical features to category dtype.
adult = utils.to_categorical(adult, categorical_features) adult.dtypes
age int64 workclass category fnlwgt int64 education category education-num int64 marital-status object occupation category relationship category race category sex category capital-gain int64 capital-loss int64 hours-per-week int64 native-country category class category dtype: object
Let's check for duplicates next.
adult[adult.duplicated()].shape
23 | 15 |
Looks like there are 23 rows with duplicate entries, let's drop those.
adult = adult.drop_duplicates() adult.shape
30139 | 15 |
Finally we check for correlations in numerical features.
name = 'heatmap@adult--corr.png' corr = adult.corr() plotter.corr(corr, name) name
We see more strong positive and negative correlations amongst the numerical features. This is an opportunity to do some experimentation with feature selection and drop features which don't bring anything to the table. For this analysis however, we don't do anything further.