Insurance
Table of Contents
1. Init
Let's start by importing the necessary modules.
import pandas as pd import numpy as np import seaborn as sns import matplotlib import matplotlib.pyplot as plt matplotlib.use('Agg') # non-interactive backend, produce pngs instead pd.set_option('display.max_columns', None) pd.set_option('display.max_colwidth', None) # uncomment the following line to prevent truncated output # pd.set_option('display.large_repr', 'info') from context import src from src import utils, plotter
2. Analysis
In this section we analyse the insurance
dataset. We start by
reading the accompanying data docs. The docs provide some information
regarding the features.
2.1. Preliminary analysis
We start by loading the dataset and answering our initial set of questions.
insurance = pd.read_csv('../data/data/insurance.csv') insurance.head()
age sex bmi children smoker region charges 0 19 female 27.900 0 yes southwest 16884.92400 1 18 male 33.770 1 no southeast 1725.55230 2 28 male 33.000 3 no southeast 4449.46200 3 33 male 22.705 0 no northwest 21984.47061 4 32 male 28.880 0 no northwest 3866.85520
insurance.shape
1338 | 7 |
insurance.dtypes
age int64 sex object bmi float64 children int64 smoker object region object charges float64 dtype: object
We have a sensitive feature sex
which we should one-hot encode.
sex, smoker & region
are categorical so should be converted to the
category
dtype.
insurance['sex'] = insurance['sex'].str.strip().astype('category') insurance['smoker'] = insurance['smoker'].str.strip().astype('category') insurance['region'] = insurance['region'].str.strip().astype('category')
Let's look at the descriptive statistics, missing and duplicates next.
insurance.describe(include='all')
age sex bmi children smoker region \ count 1338.000000 1338 1338.000000 1338.000000 1338 1338 unique NaN 2 NaN NaN 2 4 top NaN male NaN NaN no southeast freq NaN 676 NaN NaN 1064 364 mean 39.207025 NaN 30.663397 1.094918 NaN NaN std 14.049960 NaN 6.098187 1.205493 NaN NaN min 18.000000 NaN 15.960000 0.000000 NaN NaN 25% 27.000000 NaN 26.296250 0.000000 NaN NaN 50% 39.000000 NaN 30.400000 1.000000 NaN NaN 75% 51.000000 NaN 34.693750 2.000000 NaN NaN max 64.000000 NaN 53.130000 5.000000 NaN NaN charges count 1338.000000 unique NaN top NaN freq NaN mean 13270.422265 std 12110.011237 min 1121.873900 25% 4740.287150 50% 9382.033000 75% 16639.912515 max 63770.428010
insurance.isna().any()
age False sex False bmi False children False smoker False region False charges False dtype: bool
insurance[insurance.duplicated()].shape
1 | 7 |
We have some duplicates, let's investigate.
insurance[insurance.duplicated(keep=False)]
age sex bmi children smoker region charges 195 19 male 30.59 0 no northwest 1639.5631 581 19 male 30.59 0 no northwest 1639.5631
Let's drop the duplicate and only keep the first instance.
insurance = insurance.drop_duplicates() insurance.shape
1337 | 7 |
Let's look at the correlations next.
name = 'heatmap@insurance--corr.png' corr = insurance.corr() plotter.corr(corr, name) name
bmi & age
are positively correlated to charges
.