Covid Vaccine Manufacturer
Table of Contents
1. Init
Let's start by importing the necessary modules.
from context import src from context import pd, np, sns, plt from src import utils, plotter
2. Analysis
In this section we analyse the covid-vaccine-manufacturer
dataset.
This dataset is part of the covid world progress dataset available on
Kaggle (specifically we analyse the
country_vaccinations_by_manufacturer.csv
file which we have renamed
here for simplicity).
We start by reading the data docs which provides useful information on the features of the dataset.
2.1. Preliminary analysis
We start by loading the dataset and answering our initial set of questions.
covid = pd.read_csv(utils.data_path('covid-vaccine-manufacturer.csv')) covid.head()
location date vaccine total_vaccinations 0 Austria 2021-01-08 Johnson&Johnson 0 1 Austria 2021-01-08 Moderna 0 2 Austria 2021-01-08 Oxford/AstraZeneca 0 3 Austria 2021-01-08 Pfizer/BioNTech 31201 4 Austria 2021-01-15 Johnson&Johnson 0
covid.shape
19168 | 4 |
covid.dtypes
location object date object vaccine object total_vaccinations int64 dtype: object
Nice and simple dataset with categorical, datetime and numerical features. Let's observe them individually.
2.1.1. Handling location & vaccine
These are categorical features so let's convert them to category
dtype.
categorical_features = ['location', 'vaccine'] categorical = covid[categorical_features] for column in categorical.columns: categorical[column] = categorical[column].str.strip().astype('category') categorical.describe(include='all')
location vaccine count 19168 19168 unique 38 8 top European Union Pfizer/BioNTech freq 2128 5415
categorical['location'].cat.categories
Index(['Austria', 'Belgium', 'Bulgaria', 'Chile', 'Croatia', 'Cyprus', 'Czechia', 'Denmark', 'Ecuador', 'Estonia', 'European Union', 'Finland', 'France', 'Germany', 'Hong Kong', 'Hungary', 'Iceland', 'Ireland', 'Italy', 'Japan', 'Latvia', 'Liechtenstein', 'Lithuania', 'Luxembourg', 'Malta', 'Netherlands', 'Norway', 'Poland', 'Portugal', 'Romania', 'Slovakia', 'Slovenia', 'South Korea', 'Spain', 'Sweden', 'Switzerland', 'United States', 'Uruguay'], dtype='object')
Doesn't look like we can bin the values for location
.
categorical['vaccine'].cat.categories
Index(['CanSino', 'Johnson&Johnson', 'Moderna', 'Oxford/AstraZeneca', 'Pfizer/BioNTech', 'Sinopharm/Beijing', 'Sinovac', 'Sputnik V'], dtype='object')
covid[categorical_features] = categorical
2.1.2. Handling date
Should be of datetime
dtype.
covid['date'] = pd.to_datetime(covid['date']) covid['date']
0 2021-01-08 1 2021-01-08 2 2021-01-08 3 2021-01-08 4 2021-01-15 ... 19163 2021-10-26 19164 2021-10-26 19165 2021-10-26 19166 2021-10-26 19167 2021-10-26 Name: date, Length: 19168, dtype: datetime64[ns]
2.1.3. Descriptive statistics, missing & duplicates
Let's look at the descriptive statistics next.
covid.describe(include='all')
location date vaccine \ count 19168 19168 19168 unique 38 317 8 top European Union 2021-06-11 00:00:00 Pfizer/BioNTech freq 2128 147 5415 first NaN 2020-12-04 00:00:00 NaN last NaN 2021-10-26 00:00:00 NaN mean NaN NaN NaN std NaN NaN NaN min NaN NaN NaN 25% NaN NaN NaN 50% NaN NaN NaN 75% NaN NaN NaN max NaN NaN NaN total_vaccinations count 1.916800e+04 unique NaN top NaN freq NaN first NaN last NaN mean 1.143473e+07 std 3.913491e+07 min 0.000000e+00 25% 7.058900e+04 50% 7.747065e+05 75% 5.173990e+06 max 4.126639e+08
No missing values, let's look at duplicates next.
covid[covid.duplicated()].shape
0 | 4 |
No duplicates.
2.1.4. Correlations
There is only one numerical feature so checking for correlations does not make sense.