Insurance

Table of Contents

1. Init

Let's start by importing the necessary modules.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt

matplotlib.use('Agg') # non-interactive backend, produce pngs instead
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
# uncomment the following line to prevent truncated output
# pd.set_option('display.large_repr', 'info')

from context import src
from src import utils, plotter

2. Analysis

In this section we analyse the insurance dataset. We start by reading the accompanying data docs. The docs provide some information regarding the features.

2.1. Preliminary analysis

We start by loading the dataset and answering our initial set of questions.

insurance = pd.read_csv('../data/data/insurance.csv')
insurance.head()
   age     sex     bmi  children smoker     region      charges
0   19  female  27.900         0    yes  southwest  16884.92400
1   18    male  33.770         1     no  southeast   1725.55230
2   28    male  33.000         3     no  southeast   4449.46200
3   33    male  22.705         0     no  northwest  21984.47061
4   32    male  28.880         0     no  northwest   3866.85520
insurance.shape
1338 7
insurance.dtypes
age           int64
sex          object
bmi         float64
children      int64
smoker       object
region       object
charges     float64
dtype: object

We have a sensitive feature sex which we should one-hot encode. sex, smoker & region are categorical so should be converted to the category dtype.

insurance['sex'] = insurance['sex'].str.strip().astype('category')
insurance['smoker'] = insurance['smoker'].str.strip().astype('category')
insurance['region'] = insurance['region'].str.strip().astype('category')

Let's look at the descriptive statistics, missing and duplicates next.

insurance.describe(include='all')
                age   sex          bmi     children smoker     region  \
count   1338.000000  1338  1338.000000  1338.000000   1338       1338   
unique          NaN     2          NaN          NaN      2          4   
top             NaN  male          NaN          NaN     no  southeast   
freq            NaN   676          NaN          NaN   1064        364   
mean      39.207025   NaN    30.663397     1.094918    NaN        NaN   
std       14.049960   NaN     6.098187     1.205493    NaN        NaN   
min       18.000000   NaN    15.960000     0.000000    NaN        NaN   
25%       27.000000   NaN    26.296250     0.000000    NaN        NaN   
50%       39.000000   NaN    30.400000     1.000000    NaN        NaN   
75%       51.000000   NaN    34.693750     2.000000    NaN        NaN   
max       64.000000   NaN    53.130000     5.000000    NaN        NaN   

             charges  
count    1338.000000  
unique           NaN  
top              NaN  
freq             NaN  
mean    13270.422265  
std     12110.011237  
min      1121.873900  
25%      4740.287150  
50%      9382.033000  
75%     16639.912515  
max     63770.428010  
insurance.isna().any()
age         False
sex         False
bmi         False
children    False
smoker      False
region      False
charges     False
dtype: bool
insurance[insurance.duplicated()].shape
1 7

We have some duplicates, let's investigate.

insurance[insurance.duplicated(keep=False)]
     age   sex    bmi  children smoker     region    charges
195   19  male  30.59         0     no  northwest  1639.5631
581   19  male  30.59         0     no  northwest  1639.5631

Let's drop the duplicate and only keep the first instance.

insurance = insurance.drop_duplicates()
insurance.shape
1337 7

Let's look at the correlations next.

name = 'heatmap@insurance--corr.png'
corr = insurance.corr()

plotter.corr(corr, name)
name

heatmap@insurance--corr.png

bmi & age are positively correlated to charges.

Date: 2021-10-26 Tue 00:00

Created: 2021-10-27 Wed 13:05