Covid Vaccine

Table of Contents

1. Init

Let's start by importing the necessary modules.

from context import src
from context import pd, np, sns, plt
from src import utils, plotter

2. Analysis

In this section we analyse the covid-vaccine dataset. This dataset is part of the covid world progress dataset available on Kaggle (specifically we analyse the country_vaccinations.csv file which we have renamed here for simplicity).

We start by reading the data docs which provides useful information on the features of the dataset.

2.1. Preliminary analysis

We start by loading the dataset and answering our initial set of questions.

covid = pd.read_csv(utils.data_path('covid-vaccine.csv'))
covid.head()
       country iso_code        date  total_vaccinations  people_vaccinated  \
0  Afghanistan      AFG  2021-02-22                 0.0                0.0   
1  Afghanistan      AFG  2021-02-23                 NaN                NaN   
2  Afghanistan      AFG  2021-02-24                 NaN                NaN   
3  Afghanistan      AFG  2021-02-25                 NaN                NaN   
4  Afghanistan      AFG  2021-02-26                 NaN                NaN   

   people_fully_vaccinated  daily_vaccinations_raw  daily_vaccinations  \
0                      NaN                     NaN                 NaN   
1                      NaN                     NaN              1367.0   
2                      NaN                     NaN              1367.0   
3                      NaN                     NaN              1367.0   
4                      NaN                     NaN              1367.0   

   total_vaccinations_per_hundred  people_vaccinated_per_hundred  \
0                             0.0                            0.0   
1                             NaN                            NaN   
2                             NaN                            NaN   
3                             NaN                            NaN   
4                             NaN                            NaN   

   people_fully_vaccinated_per_hundred  daily_vaccinations_per_million  \
0                                  NaN                             NaN   
1                                  NaN                            34.0   
2                                  NaN                            34.0   
3                                  NaN                            34.0   
4                                  NaN                            34.0   

                                                                  vaccines  \
0  Johnson&Johnson, Oxford/AstraZeneca, Pfizer/BioNTech, Sinopharm/Beijing   
1  Johnson&Johnson, Oxford/AstraZeneca, Pfizer/BioNTech, Sinopharm/Beijing   
2  Johnson&Johnson, Oxford/AstraZeneca, Pfizer/BioNTech, Sinopharm/Beijing   
3  Johnson&Johnson, Oxford/AstraZeneca, Pfizer/BioNTech, Sinopharm/Beijing   
4  Johnson&Johnson, Oxford/AstraZeneca, Pfizer/BioNTech, Sinopharm/Beijing   

                 source_name            source_website  
0  World Health Organization  https://covid19.who.int/  
1  World Health Organization  https://covid19.who.int/  
2  World Health Organization  https://covid19.who.int/  
3  World Health Organization  https://covid19.who.int/  
4  World Health Organization  https://covid19.who.int/  
covid.shape
53595 15
covid.dtypes
country                                 object
iso_code                                object
date                                    object
total_vaccinations                     float64
people_vaccinated                      float64
people_fully_vaccinated                float64
daily_vaccinations_raw                 float64
daily_vaccinations                     float64
total_vaccinations_per_hundred         float64
people_vaccinated_per_hundred          float64
people_fully_vaccinated_per_hundred    float64
daily_vaccinations_per_million         float64
vaccines                                object
source_name                             object
source_website                          object
dtype: object

We have a mix of text, datetime, categorical & numerical features. Let's explore them individually.

2.1.1. Handling redundant columns

iso_code is simply an abbreviation of county so we can drop it. source_website is also not useful (from a ML perspective).

drop_features = ['iso_code',
                 'source_website']
covid = covid.drop(drop_features, axis='columns')
covid.shape
53595 13

2.1.2. Handling categorical features

country & source_name are categorical features, let's convert them to category dtype.

categorical_features = ['country',
                        'source_name']
categorical = covid[categorical_features]

for column in categorical.columns:
    categorical[column] = categorical[column].str.strip().astype('category')

categorical.describe(include='all')
        country                source_name
count     53595                      53595
unique      223                         82
top     Denmark  World Health Organization
freq        329                      13132
categorical['country'].cat.categories
Index(['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola', 'Anguilla',
       'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba',
       ...
       'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela', 'Vietnam', 'Wales',
       'Wallis and Futuna', 'Yemen', 'Zambia', 'Zimbabwe'],
      dtype='object', length=223)
categorical['source_name'].cat.categories
Index(['Africa Centres for Disease Control and Prevention',
       'COVID-19 Malta Public Health Response Team',
       'COVID-19 Vaccine Information Platform',
       'Centers for Disease Control and Prevention',
       'Costa Rican Social Security Fund',
       'Department of Health via ABS-CBN Investigative and Research Group',
       'Directorate General for Health via Data Science for Social Good',
       'Directorate General of Health Services', 'Directorate of Health',
       'Extraordinary commissioner for the Covid-19 emergency',
       'Federal Office of Public Health',
       'Finnish Institute for Health and Welfare', 'Government of Andorra',
       'Government of Aruba', 'Government of Australia via covidlive.com.au',
       'Government of Azerbaijan', 'Government of Curacao',
       'Government of Ecuador via Ecuacovid', 'Government of Gibraltar',
       'Government of Greenland', 'Government of Guernsey',
       'Government of Hong Kong', 'Government of Hungary',
       'Government of India', 'Government of Israel', 'Government of Jersey',
       'Government of Jordan', 'Government of Kazakhstan',
       'Government of Lebanon', 'Government of Luxembourg',
       'Government of Macao', 'Government of Malaysia',
       'Government of Montenegro', 'Government of North Macedonia',
       'Government of Romania via datelazi.ro', 'Government of Saint Helena',
       'Government of Serbia', 'Government of Suriname',
       'Government of Thailand', 'Government of Uzbekistan',
       'Government of Vietnam', 'Government of Zambia',
       'Government of the Faeroe Islands',
       'Government of the Falkland Islands', 'Government of the Netherlands',
       'Government of the United Kingdom', 'Heath Service Executive',
       'Isle of Man Government',
       'Korea Centers for Disease Control and Prevention',
       'Ministerio de Ciencia, Tecnología, Conocimiento e Innovación',
       'Ministerio de Salud via github.com/jmcastagnetto/covid-19-peru-vacunas',
       'Ministry of Health', 'Ministry of Health via ikon.mn',
       'Ministry of Health via vacuna.uy',
       'Ministry of Health's Epidemiology Unit', 'Ministry of Public Health',
       'National Center for Disease Control and Public Health',
       'National Command and Operation Centre', 'National Council',
       'National Emergency Crisis and Disaster Management Authority',
       'National Health Board', 'National Health Commission',
       'National Health Security Agency', 'National Health Service',
       'National Institute of Public Health via covid-19.sledilnik.org',
       'Norwegian Institute of Public Health',
       'Official data from local governments via gogov.ru',
       'Official data from provinces via covid19tracker.ca',
       'Pan American Health Organization', 'Presidency of the Maldives',
       'Prime Minister's Office', 'Public Health Agency of Sweden',
       'Public Health France', 'Public Health Institute',
       'Robert Koch Institut', 'SPC Public Health Division',
       'Saudi Health Council', 'Sciensano', 'Secretary of Health',
       'Statens Serum Institute', 'Taiwan Centers for Disease Control',
       'World Health Organization'],
      dtype='object')

source_name may contain more than one value, this makes this feature text. We may want to extract numerical features from here.

2.1.3. Handling text features

vaccines contains several values, we may want to extract numerical features from here. An example would be to create a column for each type of vaccine, with a value of 0 or 1 (thus we can extract several binary features from here).

2.1.4. Handling date

This is a timestamp, should be converted to datetime dtype.

covid['date'] = pd.to_datetime(covid['date'])
covid['date']
0       2021-02-22
1       2021-02-23
2       2021-02-24
3       2021-02-25
4       2021-02-26
           ...    
53590   2021-10-22
53591   2021-10-23
53592   2021-10-24
53593   2021-10-25
53594   2021-10-26
Name: date, Length: 53595, dtype: datetime64[ns]

2.1.5. Descriptive statistics, missing & duplicates

Let's look at the descriptive statistics next.

covid.describe(include='all')
        country                 date  total_vaccinations  people_vaccinated  \
count     53595                53595        2.905600e+04       2.754000e+04   
unique      223                  330                 NaN                NaN   
top     Denmark  2021-08-14 00:00:00                 NaN                NaN   
freq        329                  219                 NaN                NaN   
first       NaN  2020-12-01 00:00:00                 NaN                NaN   
last        NaN  2021-10-26 00:00:00                 NaN                NaN   
mean        NaN                  NaN        2.502017e+07       1.045555e+07   
std         NaN                  NaN        1.387474e+08       4.142794e+07   
min         NaN                  NaN        0.000000e+00       0.000000e+00   
25%         NaN                  NaN        2.735060e+05       2.062085e+05   
50%         NaN                  NaN        1.863784e+06       1.193043e+06   
75%         NaN                  NaN        9.407311e+06       5.452138e+06   
max         NaN                  NaN        2.251339e+09       1.100842e+09   

        people_fully_vaccinated  daily_vaccinations_raw  daily_vaccinations  \
count              2.462300e+04            2.383500e+04        5.333000e+04   
unique                      NaN                     NaN                 NaN   
top                         NaN                     NaN                 NaN   
freq                        NaN                     NaN                 NaN   
first                       NaN                     NaN                 NaN   
last                        NaN                     NaN                 NaN   
mean               6.988481e+06            2.550971e+05        1.301540e+05   
std                2.625649e+07            1.258528e+06        8.344678e+05   
min                1.000000e+00            0.000000e+00        0.000000e+00   
25%                1.085115e+05            5.023000e+03        9.370000e+02   
50%                8.255940e+05            2.486200e+04        7.039000e+03   
75%                4.288808e+06            1.164200e+05        4.178350e+04   
max                1.067621e+09            2.474100e+07        2.242429e+07   

        total_vaccinations_per_hundred  people_vaccinated_per_hundred  \
count                     29056.000000                   27540.000000   
unique                             NaN                            NaN   
top                                NaN                            NaN   
freq                               NaN                            NaN   
first                              NaN                            NaN   
last                               NaN                            NaN   
mean                         51.107199                      30.162527   
std                          49.470788                      25.993621   
min                           0.000000                       0.000000   
25%                           6.890000                       5.390000   
50%                          34.510000                      24.150000   
75%                          87.440000                      53.040000   
max                         259.330000                     120.350000   

        people_fully_vaccinated_per_hundred  daily_vaccinations_per_million  \
count                          24623.000000                    53330.000000   
unique                                  NaN                             NaN   
top                                     NaN                             NaN   
freq                                    NaN                             NaN   
first                                   NaN                             NaN   
last                                    NaN                             NaN   
mean                              23.426875                     3522.202756   
std                               23.591298                     4290.717192   
min                                0.000000                        0.000000   
25%                                2.920000                      588.000000   
50%                               14.440000                     2243.000000   
75%                               40.855000                     5185.750000   
max                              118.140000                   117497.000000   

                                                             vaccines  \
count                                                           53595   
unique                                                             73   
top     Johnson&Johnson, Moderna, Oxford/AstraZeneca, Pfizer/BioNTech   
freq                                                             7102   
first                                                             NaN   
last                                                              NaN   
mean                                                              NaN   
std                                                               NaN   
min                                                               NaN   
25%                                                               NaN   
50%                                                               NaN   
75%                                                               NaN   
max                                                               NaN   

                      source_name  
count                       53595  
unique                         82  
top     World Health Organization  
freq                        13132  
first                         NaN  
last                          NaN  
mean                          NaN  
std                           NaN  
min                           NaN  
25%                           NaN  
50%                           NaN  
75%                           NaN  
max                           NaN  

~50% of the numerical features are missing, let's investigate this a bit more.

numerical_features = ['total_vaccinations',
                      'people_vaccinated',
                      'people_fully_vaccinated',
                      'daily_vaccinations_raw',
                      'daily_vaccinations',
                      'total_vaccinations_per_hundred',
                      'people_vaccinated_per_hundred',
                      'people_fully_vaccinated_per_hundred',
                      'daily_vaccinations_per_million']
numerical = covid[numerical_features]
numerical
       total_vaccinations  people_vaccinated  people_fully_vaccinated  \
0                     0.0                0.0                      NaN   
1                     NaN                NaN                      NaN   
2                     NaN                NaN                      NaN   
3                     NaN                NaN                      NaN   
4                     NaN                NaN                      NaN   
...                   ...                ...                      ...   
53590           5814790.0          3271638.0                2543152.0   
53591           5826876.0          3276746.0                2550130.0   
53592           5836363.0          3281618.0                2554745.0   
53593           5848934.0          3287996.0                2560938.0   
53594           5866629.0          3294687.0                2571942.0   

       daily_vaccinations_raw  daily_vaccinations  \
0                         NaN                 NaN   
1                         NaN              1367.0   
2                         NaN              1367.0   
3                         NaN              1367.0   
4                         NaN              1367.0   
...                       ...                 ...   
53590                 17507.0             20007.0   
53591                 12086.0             19845.0   
53592                  9487.0             19241.0   
53593                 12571.0             15135.0   
53594                 17695.0             15697.0   

       total_vaccinations_per_hundred  people_vaccinated_per_hundred  \
0                                0.00                           0.00   
1                                 NaN                            NaN   
2                                 NaN                            NaN   
3                                 NaN                            NaN   
4                                 NaN                            NaN   
...                               ...                            ...   
53590                           38.53                          21.68   
53591                           38.61                          21.71   
53592                           38.67                          21.74   
53593                           38.75                          21.79   
53594                           38.87                          21.83   

       people_fully_vaccinated_per_hundred  daily_vaccinations_per_million  
0                                      NaN                             NaN  
1                                      NaN                            34.0  
2                                      NaN                            34.0  
3                                      NaN                            34.0  
4                                      NaN                            34.0  
...                                    ...                             ...  
53590                                16.85                          1326.0  
53591                                16.90                          1315.0  
53592                                16.93                          1275.0  
53593                                16.97                          1003.0  
53594                                17.04                          1040.0  

[53595 rows x 9 columns]
numerical.isna().sum()
total_vaccinations                     24539
people_vaccinated                      26055
people_fully_vaccinated                28972
daily_vaccinations_raw                 29760
daily_vaccinations                       265
total_vaccinations_per_hundred         24539
people_vaccinated_per_hundred          26055
people_fully_vaccinated_per_hundred    28972
daily_vaccinations_per_million           265
dtype: int64

Let's investigate total_vaccinations individually.

total_vaccinations = covid[['country', 'date', 'total_vaccinations']]
total_vaccinations.head(50)
        country       date  total_vaccinations
0   Afghanistan 2021-02-22                 0.0
1   Afghanistan 2021-02-23                 NaN
2   Afghanistan 2021-02-24                 NaN
3   Afghanistan 2021-02-25                 NaN
4   Afghanistan 2021-02-26                 NaN
5   Afghanistan 2021-02-27                 NaN
6   Afghanistan 2021-02-28              8200.0
7   Afghanistan 2021-03-01                 NaN
8   Afghanistan 2021-03-02                 NaN
9   Afghanistan 2021-03-03                 NaN
10  Afghanistan 2021-03-04                 NaN
11  Afghanistan 2021-03-05                 NaN
12  Afghanistan 2021-03-06                 NaN
13  Afghanistan 2021-03-07                 NaN
14  Afghanistan 2021-03-08                 NaN
15  Afghanistan 2021-03-09                 NaN
16  Afghanistan 2021-03-10                 NaN
17  Afghanistan 2021-03-11                 NaN
18  Afghanistan 2021-03-12                 NaN
19  Afghanistan 2021-03-13                 NaN
20  Afghanistan 2021-03-14                 NaN
21  Afghanistan 2021-03-15                 NaN
22  Afghanistan 2021-03-16             54000.0
23  Afghanistan 2021-03-17                 NaN
24  Afghanistan 2021-03-18                 NaN
25  Afghanistan 2021-03-19                 NaN
26  Afghanistan 2021-03-20                 NaN
27  Afghanistan 2021-03-21                 NaN
28  Afghanistan 2021-03-22                 NaN
29  Afghanistan 2021-03-23                 NaN
30  Afghanistan 2021-03-24                 NaN
31  Afghanistan 2021-03-25                 NaN
32  Afghanistan 2021-03-26                 NaN
33  Afghanistan 2021-03-27                 NaN
34  Afghanistan 2021-03-28                 NaN
35  Afghanistan 2021-03-29                 NaN
36  Afghanistan 2021-03-30                 NaN
37  Afghanistan 2021-03-31                 NaN
38  Afghanistan 2021-04-01                 NaN
39  Afghanistan 2021-04-02                 NaN
40  Afghanistan 2021-04-03                 NaN
41  Afghanistan 2021-04-04                 NaN
42  Afghanistan 2021-04-05                 NaN
43  Afghanistan 2021-04-06                 NaN
44  Afghanistan 2021-04-07            120000.0
45  Afghanistan 2021-04-08                 NaN
46  Afghanistan 2021-04-09                 NaN
47  Afghanistan 2021-04-10                 NaN
48  Afghanistan 2021-04-11                 NaN
49  Afghanistan 2021-04-12                 NaN
total_vaccinations[total_vaccinations['total_vaccinations'].notna()]
           country       date  total_vaccinations
0      Afghanistan 2021-02-22                 0.0
6      Afghanistan 2021-02-28              8200.0
22     Afghanistan 2021-03-16             54000.0
44     Afghanistan 2021-04-07            120000.0
59     Afghanistan 2021-04-22            240000.0
...            ...        ...                 ...
53590     Zimbabwe 2021-10-22           5814790.0
53591     Zimbabwe 2021-10-23           5826876.0
53592     Zimbabwe 2021-10-24           5836363.0
53593     Zimbabwe 2021-10-25           5848934.0
53594     Zimbabwe 2021-10-26           5866629.0

[29056 rows x 3 columns]

Looks like the total_vaccinations column is populated periodically with the cummulative sum. Since a lot of values are missing we have to impute which adds to technical debt. It's unclear what a good imputation stratergy would be in this case.

Let's check for duplicates (even though it doesn't make sense since so many values are missing).

covid[covid.duplicated(keep=False)].shape
0 13

2.1.6. Correlations

Let's look at the correlation between the numerical features next.

name = 'heatmap@covid-vaccine--corr.png'
corr = covid.corr()
plotter.corr(corr, name)
name

heatmap@covid-vaccine--corr.png

Some of the features show high positive correlation.

Date: 2021-11-03 Wed 00:00

Created: 2021-11-03 Wed 17:35