Student
Table of Contents
1. Init
Let's start by importing the necessary modules.
import pandas as pd import numpy as np import seaborn as sns import matplotlib import matplotlib.pyplot as plt matplotlib.use('Agg') # non-interactive backend, produce pngs instead pd.set_option('display.max_columns', None) pd.set_option('display.max_colwidth', None) # uncomment the following line to prevent truncated output # pd.set_option('display.large_repr', 'info') from context import src from src import utils, plotter
2. Analysis
In this section we analyse the student
dataset. We start by reading
the accompanying data docs. The documentation is sparse, no context
provided on the features, how the data was collected and what the
schema should look like.
2.1. Preliminary analysis
We start by loading the dataset and answering our initial set of questions.
student = pd.read_csv('../data/data/student.csv') student.head()
gender race/ethnicity parental level of education lunch \ 0 female group B bachelor's degree standard 1 female group C some college standard 2 female group B master's degree standard 3 male group A associate's degree free/reduced 4 male group C some college standard test preparation course math score reading score writing score 0 none 72 72 74 1 completed 69 90 88 2 none 90 95 93 3 none 47 57 44 4 none 76 78 75
student.shape
1000 | 8 |
student.columns
Index(['gender', 'race/ethnicity', 'parental level of education', 'lunch', 'test preparation course', 'math score', 'reading score', 'writing score'], dtype='object')
student.dtypes
gender object race/ethnicity object parental level of education object lunch object test preparation course object math score int64 reading score int64 writing score int64 dtype: object
We have a mix of categorical and numerical features. Let's examine the categorical features in more depth next. But before that, let's rename the columns for simplicity.
mapper = {'gender': 'gender', 'race/ethnicity': 'race', 'parental level of education': 'parent_education', 'lunch': 'lunch', 'test preparation course': 'test_prep', 'math score': 'math_score', 'reading score': 'reading_score', 'writing score': 'writing_score'} student = student.rename(mapper=mapper, axis='columns') student.columns
Index(['gender', 'race', 'parent_education', 'lunch', 'test_prep', 'math_score', 'reading_score', 'writing_score'], dtype='object')
2.1.1. Handling gender, race/ethnicity, parental level of education, lunch & test preparation course
All the above features are categorical, let's start with gender
.
student['gender'] = student['gender'].astype('category') student['gender'].value_counts()
female 518 male 482 Name: gender, dtype: int64
The distribution is approximately the same which is good. However we do need to one-hot encode it so as to avoid any hierarchy amongst the values.
We look at race/ethnicity
next.
student['race'] = student['race'].astype('category') student['race'].value_counts()
group C 319 group D 262 group B 190 group E 140 group A 89 Name: race, dtype: int64
The ethnic groups are anonymised which is good, one-hot encoding is required again to avoid hierarchy amongst the values.
We look at parent_education
next.
student['parent_education'] = student['parent_education'].astype('category') student['parent_education'].value_counts()
some college 226 associate's degree 222 high school 196 some high school 179 bachelor's degree 118 master's degree 59 Name: parent_education, dtype: int64
Some of the categories are duplicates and can be binned together. For
instance, high school & some high school
are the same. For this
analysis, we can make an assumption that if a person went to college
then they obtained a Bsc. degree. There is also hierarchy in these
values so we may want to use a label encoding rather than one-hot
encoding here.
student['parent_education'] = student['parent_education'].str.strip() student['parent_education'] = student['parent_education'].replace(to_replace='some college', value='bachelors') student['parent_education'] = student['parent_education'].replace(to_replace="associate's degree", value='associates') student['parent_education'] = student['parent_education'].replace(to_replace='some high school', value='high school') student['parent_education'] = student['parent_education'].replace(to_replace="bachelor's degree", value='bachelors') student['parent_education'] = student['parent_education'].replace(to_replace="master's degree", value='masters') student['parent_education'].value_counts()
high school 375 bachelors 344 associates 222 masters 59 Name: parent_education, dtype: int64
Let's look at lunch
next.
student['lunch'] = student['lunch'].astype('category') student['lunch'].value_counts()
standard 645 free/reduced 355 Name: lunch, dtype: int64
student['lunch'] = student['lunch'].str.strip() student['lunch'] = student['lunch'].replace(to_replace='free/reduced', value='free') student['lunch'].value_counts()
standard 645 free 355 Name: lunch, dtype: int64
It's difficult to determine if there is hierarchy amongst the values here.
Finally, let's look at test_prep
.
student['test_prep'] = student['test_prep'].astype('category') student['test_prep'].value_counts()
none 642 completed 358 Name: test_prep, dtype: int64
For lunch & test_prep
should we represent them simply as binary
features? Does categorical make sense?
Let's look at the descriptive statistics next.
student.describe(include='all')
gender race parent_education lunch test_prep math_score \ count 1000 1000 1000 1000 1000 1000.00000 unique 2 5 4 2 2 NaN top female group C high school standard none NaN freq 518 319 375 645 642 NaN mean NaN NaN NaN NaN NaN 66.08900 std NaN NaN NaN NaN NaN 15.16308 min NaN NaN NaN NaN NaN 0.00000 25% NaN NaN NaN NaN NaN 57.00000 50% NaN NaN NaN NaN NaN 66.00000 75% NaN NaN NaN NaN NaN 77.00000 max NaN NaN NaN NaN NaN 100.00000 reading_score writing_score count 1000.000000 1000.000000 unique NaN NaN top NaN NaN freq NaN NaN mean 69.169000 68.054000 std 14.600192 15.195657 min 17.000000 10.000000 25% 59.000000 57.750000 50% 70.000000 69.000000 75% 79.000000 79.000000 max 100.000000 100.000000
Missing & duplicates next.
student.isna().any()
gender False race False parent_education False lunch False test_prep False math_score False reading_score False writing_score False dtype: bool
student[student.duplicated()].shape
0 | 8 |
And lastly, the correlations.
name = 'heatmap@student--corr.png' corr = student.corr() plotter.corr(corr, name) name
{reading,writing}_score
are positively correlated to one another.
2.2. Distributional analysis
In this section we will analyse the distributions of the features. As always, let's start with the histograms of the categorical and numerical features.
name = 'histplot@student--numerical.png' numerical_features = ['math_score', 'reading_score', 'writing_score'] fig, axs = plt.subplots(1, 3, figsize=(15, 5), sharey=True) for idx, feature in enumerate(numerical_features): sns.histplot(data=student, x=feature, kde=True, ax=axs[idx]) fig.savefig(name) name
name = 'histplot@student--categorical.png' categorical_features = ['gender', 'race', 'parent_education', 'lunch', 'test_prep'] fig, axs = plt.subplots(1, 5, figsize=(15, 5), sharey=True) for idx, feature in enumerate(categorical_features): sns.histplot(data=student, x=feature, discrete=True, ax=axs[idx]) fig.savefig(name) name
Nothing particularly interesting. The numerical features are normally distributed.
2.3. Relational analysis
In this section we will analyse the relationships of the numerical features. For such a small dataset, a pairplot is sufficient.
name = 'pairplot@student--scatterplot-kdeplot.png' g = sns.pairplot(data=student, diag_kind='kde', corner=True) g.savefig(name) name
Linear relationship between all features.