Covid Vaccine Manufacturer

Table of Contents

1. Init

Let's start by importing the necessary modules.

from context import src
from context import pd, np, sns, plt
from src import utils, plotter

2. Analysis

In this section we analyse the covid-vaccine-manufacturer dataset. This dataset is part of the covid world progress dataset available on Kaggle (specifically we analyse the country_vaccinations_by_manufacturer.csv file which we have renamed here for simplicity).

We start by reading the data docs which provides useful information on the features of the dataset.

2.1. Preliminary analysis

We start by loading the dataset and answering our initial set of questions.

covid = pd.read_csv(utils.data_path('covid-vaccine-manufacturer.csv'))
covid.head()
  location        date             vaccine  total_vaccinations
0  Austria  2021-01-08     Johnson&Johnson                   0
1  Austria  2021-01-08             Moderna                   0
2  Austria  2021-01-08  Oxford/AstraZeneca                   0
3  Austria  2021-01-08     Pfizer/BioNTech               31201
4  Austria  2021-01-15     Johnson&Johnson                   0
covid.shape
19168 4
covid.dtypes
location              object
date                  object
vaccine               object
total_vaccinations     int64
dtype: object

Nice and simple dataset with categorical, datetime and numerical features. Let's observe them individually.

2.1.1. Handling location & vaccine

These are categorical features so let's convert them to category dtype.

categorical_features = ['location',
                        'vaccine']
categorical = covid[categorical_features]

for column in categorical.columns:
    categorical[column] = categorical[column].str.strip().astype('category')

categorical.describe(include='all')
              location          vaccine
count            19168            19168
unique              38                8
top     European Union  Pfizer/BioNTech
freq              2128             5415
categorical['location'].cat.categories
Index(['Austria', 'Belgium', 'Bulgaria', 'Chile', 'Croatia', 'Cyprus',
       'Czechia', 'Denmark', 'Ecuador', 'Estonia', 'European Union', 'Finland',
       'France', 'Germany', 'Hong Kong', 'Hungary', 'Iceland', 'Ireland',
       'Italy', 'Japan', 'Latvia', 'Liechtenstein', 'Lithuania', 'Luxembourg',
       'Malta', 'Netherlands', 'Norway', 'Poland', 'Portugal', 'Romania',
       'Slovakia', 'Slovenia', 'South Korea', 'Spain', 'Sweden', 'Switzerland',
       'United States', 'Uruguay'],
      dtype='object')

Doesn't look like we can bin the values for location.

categorical['vaccine'].cat.categories
Index(['CanSino', 'Johnson&Johnson', 'Moderna', 'Oxford/AstraZeneca',
       'Pfizer/BioNTech', 'Sinopharm/Beijing', 'Sinovac', 'Sputnik V'],
      dtype='object')
covid[categorical_features] = categorical

2.1.2. Handling date

Should be of datetime dtype.

covid['date'] = pd.to_datetime(covid['date'])
covid['date']
0       2021-01-08
1       2021-01-08
2       2021-01-08
3       2021-01-08
4       2021-01-15
           ...    
19163   2021-10-26
19164   2021-10-26
19165   2021-10-26
19166   2021-10-26
19167   2021-10-26
Name: date, Length: 19168, dtype: datetime64[ns]

2.1.3. Descriptive statistics, missing & duplicates

Let's look at the descriptive statistics next.

covid.describe(include='all')
              location                 date          vaccine  \
count            19168                19168            19168   
unique              38                  317                8   
top     European Union  2021-06-11 00:00:00  Pfizer/BioNTech   
freq              2128                  147             5415   
first              NaN  2020-12-04 00:00:00              NaN   
last               NaN  2021-10-26 00:00:00              NaN   
mean               NaN                  NaN              NaN   
std                NaN                  NaN              NaN   
min                NaN                  NaN              NaN   
25%                NaN                  NaN              NaN   
50%                NaN                  NaN              NaN   
75%                NaN                  NaN              NaN   
max                NaN                  NaN              NaN   

        total_vaccinations  
count         1.916800e+04  
unique                 NaN  
top                    NaN  
freq                   NaN  
first                  NaN  
last                   NaN  
mean          1.143473e+07  
std           3.913491e+07  
min           0.000000e+00  
25%           7.058900e+04  
50%           7.747065e+05  
75%           5.173990e+06  
max           4.126639e+08  

No missing values, let's look at duplicates next.

covid[covid.duplicated()].shape
0 4

No duplicates.

2.1.4. Correlations

There is only one numerical feature so checking for correlations does not make sense.

Date: 2021-11-03 Wed 00:00

Created: 2021-11-03 Wed 16:17