Student

Table of Contents

1. Init

Let's start by importing the necessary modules.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt

matplotlib.use('Agg') # non-interactive backend, produce pngs instead
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
# uncomment the following line to prevent truncated output
# pd.set_option('display.large_repr', 'info')

from context import src
from src import utils, plotter

2. Analysis

In this section we analyse the student dataset. We start by reading the accompanying data docs. The documentation is sparse, no context provided on the features, how the data was collected and what the schema should look like.

2.1. Preliminary analysis

We start by loading the dataset and answering our initial set of questions.

student = pd.read_csv('../data/data/student.csv')
student.head()
   gender race/ethnicity parental level of education         lunch  \
0  female        group B           bachelor's degree      standard   
1  female        group C                some college      standard   
2  female        group B             master's degree      standard   
3    male        group A          associate's degree  free/reduced   
4    male        group C                some college      standard   

  test preparation course  math score  reading score  writing score  
0                    none          72             72             74  
1               completed          69             90             88  
2                    none          90             95             93  
3                    none          47             57             44  
4                    none          76             78             75  
student.shape
1000 8
student.columns
Index(['gender', 'race/ethnicity', 'parental level of education', 'lunch',
       'test preparation course', 'math score', 'reading score',
       'writing score'],
      dtype='object')
student.dtypes
gender                         object
race/ethnicity                 object
parental level of education    object
lunch                          object
test preparation course        object
math score                      int64
reading score                   int64
writing score                   int64
dtype: object

We have a mix of categorical and numerical features. Let's examine the categorical features in more depth next. But before that, let's rename the columns for simplicity.

mapper = {'gender': 'gender',
          'race/ethnicity': 'race',
          'parental level of education': 'parent_education',
          'lunch': 'lunch',
          'test preparation course': 'test_prep',
          'math score': 'math_score',
          'reading score': 'reading_score',
          'writing score': 'writing_score'}

student = student.rename(mapper=mapper, axis='columns')
student.columns
Index(['gender', 'race', 'parent_education', 'lunch', 'test_prep',
       'math_score', 'reading_score', 'writing_score'],
      dtype='object')

2.1.1. Handling gender, race/ethnicity, parental level of education, lunch & test preparation course

All the above features are categorical, let's start with gender.

student['gender'] = student['gender'].astype('category')
student['gender'].value_counts()
female    518
male      482
Name: gender, dtype: int64

The distribution is approximately the same which is good. However we do need to one-hot encode it so as to avoid any hierarchy amongst the values.

We look at race/ethnicity next.

student['race'] = student['race'].astype('category')
student['race'].value_counts()
group C    319
group D    262
group B    190
group E    140
group A     89
Name: race, dtype: int64

The ethnic groups are anonymised which is good, one-hot encoding is required again to avoid hierarchy amongst the values.

We look at parent_education next.

student['parent_education'] = student['parent_education'].astype('category')
student['parent_education'].value_counts()
some college          226
associate's degree    222
high school           196
some high school      179
bachelor's degree     118
master's degree        59
Name: parent_education, dtype: int64

Some of the categories are duplicates and can be binned together. For instance, high school & some high school are the same. For this analysis, we can make an assumption that if a person went to college then they obtained a Bsc. degree. There is also hierarchy in these values so we may want to use a label encoding rather than one-hot encoding here.

student['parent_education'] = student['parent_education'].str.strip()
student['parent_education'] = student['parent_education'].replace(to_replace='some college', value='bachelors')
student['parent_education'] = student['parent_education'].replace(to_replace="associate's degree", value='associates')
student['parent_education'] = student['parent_education'].replace(to_replace='some high school', value='high school')
student['parent_education'] = student['parent_education'].replace(to_replace="bachelor's degree", value='bachelors')
student['parent_education'] = student['parent_education'].replace(to_replace="master's degree", value='masters')
student['parent_education'].value_counts()
high school    375
bachelors      344
associates     222
masters         59
Name: parent_education, dtype: int64

Let's look at lunch next.

student['lunch'] = student['lunch'].astype('category')
student['lunch'].value_counts()
standard        645
free/reduced    355
Name: lunch, dtype: int64
student['lunch'] = student['lunch'].str.strip()
student['lunch'] = student['lunch'].replace(to_replace='free/reduced', value='free')
student['lunch'].value_counts()
standard    645
free        355
Name: lunch, dtype: int64

It's difficult to determine if there is hierarchy amongst the values here.

Finally, let's look at test_prep.

student['test_prep'] = student['test_prep'].astype('category')
student['test_prep'].value_counts()
none         642
completed    358
Name: test_prep, dtype: int64

For lunch & test_prep should we represent them simply as binary features? Does categorical make sense?

Let's look at the descriptive statistics next.

student.describe(include='all')
        gender     race parent_education     lunch test_prep  math_score  \
count     1000     1000             1000      1000      1000  1000.00000   
unique       2        5                4         2         2         NaN   
top     female  group C      high school  standard      none         NaN   
freq       518      319              375       645       642         NaN   
mean       NaN      NaN              NaN       NaN       NaN    66.08900   
std        NaN      NaN              NaN       NaN       NaN    15.16308   
min        NaN      NaN              NaN       NaN       NaN     0.00000   
25%        NaN      NaN              NaN       NaN       NaN    57.00000   
50%        NaN      NaN              NaN       NaN       NaN    66.00000   
75%        NaN      NaN              NaN       NaN       NaN    77.00000   
max        NaN      NaN              NaN       NaN       NaN   100.00000   

        reading_score  writing_score  
count     1000.000000    1000.000000  
unique            NaN            NaN  
top               NaN            NaN  
freq              NaN            NaN  
mean        69.169000      68.054000  
std         14.600192      15.195657  
min         17.000000      10.000000  
25%         59.000000      57.750000  
50%         70.000000      69.000000  
75%         79.000000      79.000000  
max        100.000000     100.000000  

Missing & duplicates next.

student.isna().any()
gender              False
race                False
parent_education    False
lunch               False
test_prep           False
math_score          False
reading_score       False
writing_score       False
dtype: bool
student[student.duplicated()].shape
0 8

And lastly, the correlations.

name = 'heatmap@student--corr.png'
corr = student.corr()

plotter.corr(corr, name)
name

heatmap@student--corr.png

{reading,writing}_score are positively correlated to one another.

2.2. Distributional analysis

In this section we will analyse the distributions of the features. As always, let's start with the histograms of the categorical and numerical features.

name = 'histplot@student--numerical.png'
numerical_features = ['math_score',
                      'reading_score',
                      'writing_score']

fig, axs = plt.subplots(1, 3, figsize=(15, 5), sharey=True)
for idx, feature in enumerate(numerical_features):
    sns.histplot(data=student, x=feature, kde=True, ax=axs[idx])
fig.savefig(name)
name

histplot@student--numerical.png

name = 'histplot@student--categorical.png'
categorical_features = ['gender',
                        'race',
                        'parent_education',
                        'lunch',
                        'test_prep']

fig, axs = plt.subplots(1, 5, figsize=(15, 5), sharey=True)
for idx, feature in enumerate(categorical_features):
    sns.histplot(data=student, x=feature, discrete=True, ax=axs[idx])
fig.savefig(name)
name

histplot@student--categorical.png

Nothing particularly interesting. The numerical features are normally distributed.

2.3. Relational analysis

In this section we will analyse the relationships of the numerical features. For such a small dataset, a pairplot is sufficient.

name = 'pairplot@student--scatterplot-kdeplot.png'
g = sns.pairplot(data=student, diag_kind='kde', corner=True)
g.savefig(name)
name

pairplot@student--scatterplot-kdeplot.png

Linear relationship between all features.

Date: 2021-10-21 Thu 00:00

Created: 2021-10-22 Fri 22:00