Covid Vaccine
Table of Contents
1. Init
Let's start by importing the necessary modules.
from context import src from context import pd, np, sns, plt from src import utils, plotter
2. Analysis
In this section we analyse the covid-vaccine
dataset. This dataset
is part of the covid world progress dataset available on Kaggle
(specifically we analyse the country_vaccinations.csv
file which we
have renamed here for simplicity).
We start by reading the data docs which provides useful information on the features of the dataset.
2.1. Preliminary analysis
We start by loading the dataset and answering our initial set of questions.
covid = pd.read_csv(utils.data_path('covid-vaccine.csv')) covid.head()
country iso_code date total_vaccinations people_vaccinated \ 0 Afghanistan AFG 2021-02-22 0.0 0.0 1 Afghanistan AFG 2021-02-23 NaN NaN 2 Afghanistan AFG 2021-02-24 NaN NaN 3 Afghanistan AFG 2021-02-25 NaN NaN 4 Afghanistan AFG 2021-02-26 NaN NaN people_fully_vaccinated daily_vaccinations_raw daily_vaccinations \ 0 NaN NaN NaN 1 NaN NaN 1367.0 2 NaN NaN 1367.0 3 NaN NaN 1367.0 4 NaN NaN 1367.0 total_vaccinations_per_hundred people_vaccinated_per_hundred \ 0 0.0 0.0 1 NaN NaN 2 NaN NaN 3 NaN NaN 4 NaN NaN people_fully_vaccinated_per_hundred daily_vaccinations_per_million \ 0 NaN NaN 1 NaN 34.0 2 NaN 34.0 3 NaN 34.0 4 NaN 34.0 vaccines \ 0 Johnson&Johnson, Oxford/AstraZeneca, Pfizer/BioNTech, Sinopharm/Beijing 1 Johnson&Johnson, Oxford/AstraZeneca, Pfizer/BioNTech, Sinopharm/Beijing 2 Johnson&Johnson, Oxford/AstraZeneca, Pfizer/BioNTech, Sinopharm/Beijing 3 Johnson&Johnson, Oxford/AstraZeneca, Pfizer/BioNTech, Sinopharm/Beijing 4 Johnson&Johnson, Oxford/AstraZeneca, Pfizer/BioNTech, Sinopharm/Beijing source_name source_website 0 World Health Organization https://covid19.who.int/ 1 World Health Organization https://covid19.who.int/ 2 World Health Organization https://covid19.who.int/ 3 World Health Organization https://covid19.who.int/ 4 World Health Organization https://covid19.who.int/
covid.shape
53595 | 15 |
covid.dtypes
country object iso_code object date object total_vaccinations float64 people_vaccinated float64 people_fully_vaccinated float64 daily_vaccinations_raw float64 daily_vaccinations float64 total_vaccinations_per_hundred float64 people_vaccinated_per_hundred float64 people_fully_vaccinated_per_hundred float64 daily_vaccinations_per_million float64 vaccines object source_name object source_website object dtype: object
We have a mix of text, datetime, categorical & numerical features. Let's explore them individually.
2.1.1. Handling redundant columns
iso_code
is simply an abbreviation of county
so we can drop it.
source_website
is also not useful (from a ML perspective).
drop_features = ['iso_code', 'source_website'] covid = covid.drop(drop_features, axis='columns') covid.shape
53595 | 13 |
2.1.2. Handling categorical features
country & source_name
are categorical features, let's convert them
to category
dtype.
categorical_features = ['country', 'source_name'] categorical = covid[categorical_features] for column in categorical.columns: categorical[column] = categorical[column].str.strip().astype('category') categorical.describe(include='all')
country source_name count 53595 53595 unique 223 82 top Denmark World Health Organization freq 329 13132
categorical['country'].cat.categories
Index(['Afghanistan', 'Albania', 'Algeria', 'Andorra', 'Angola', 'Anguilla', 'Antigua and Barbuda', 'Argentina', 'Armenia', 'Aruba', ... 'Uruguay', 'Uzbekistan', 'Vanuatu', 'Venezuela', 'Vietnam', 'Wales', 'Wallis and Futuna', 'Yemen', 'Zambia', 'Zimbabwe'], dtype='object', length=223)
categorical['source_name'].cat.categories
Index(['Africa Centres for Disease Control and Prevention', 'COVID-19 Malta Public Health Response Team', 'COVID-19 Vaccine Information Platform', 'Centers for Disease Control and Prevention', 'Costa Rican Social Security Fund', 'Department of Health via ABS-CBN Investigative and Research Group', 'Directorate General for Health via Data Science for Social Good', 'Directorate General of Health Services', 'Directorate of Health', 'Extraordinary commissioner for the Covid-19 emergency', 'Federal Office of Public Health', 'Finnish Institute for Health and Welfare', 'Government of Andorra', 'Government of Aruba', 'Government of Australia via covidlive.com.au', 'Government of Azerbaijan', 'Government of Curacao', 'Government of Ecuador via Ecuacovid', 'Government of Gibraltar', 'Government of Greenland', 'Government of Guernsey', 'Government of Hong Kong', 'Government of Hungary', 'Government of India', 'Government of Israel', 'Government of Jersey', 'Government of Jordan', 'Government of Kazakhstan', 'Government of Lebanon', 'Government of Luxembourg', 'Government of Macao', 'Government of Malaysia', 'Government of Montenegro', 'Government of North Macedonia', 'Government of Romania via datelazi.ro', 'Government of Saint Helena', 'Government of Serbia', 'Government of Suriname', 'Government of Thailand', 'Government of Uzbekistan', 'Government of Vietnam', 'Government of Zambia', 'Government of the Faeroe Islands', 'Government of the Falkland Islands', 'Government of the Netherlands', 'Government of the United Kingdom', 'Heath Service Executive', 'Isle of Man Government', 'Korea Centers for Disease Control and Prevention', 'Ministerio de Ciencia, Tecnología, Conocimiento e Innovación', 'Ministerio de Salud via github.com/jmcastagnetto/covid-19-peru-vacunas', 'Ministry of Health', 'Ministry of Health via ikon.mn', 'Ministry of Health via vacuna.uy', 'Ministry of Health's Epidemiology Unit', 'Ministry of Public Health', 'National Center for Disease Control and Public Health', 'National Command and Operation Centre', 'National Council', 'National Emergency Crisis and Disaster Management Authority', 'National Health Board', 'National Health Commission', 'National Health Security Agency', 'National Health Service', 'National Institute of Public Health via covid-19.sledilnik.org', 'Norwegian Institute of Public Health', 'Official data from local governments via gogov.ru', 'Official data from provinces via covid19tracker.ca', 'Pan American Health Organization', 'Presidency of the Maldives', 'Prime Minister's Office', 'Public Health Agency of Sweden', 'Public Health France', 'Public Health Institute', 'Robert Koch Institut', 'SPC Public Health Division', 'Saudi Health Council', 'Sciensano', 'Secretary of Health', 'Statens Serum Institute', 'Taiwan Centers for Disease Control', 'World Health Organization'], dtype='object')
source_name
may contain more than one value, this makes this feature
text. We may want to extract numerical features from here.
2.1.3. Handling text features
vaccines
contains several values, we may want to extract numerical
features from here. An example would be to create a column for each
type of vaccine, with a value of 0 or 1 (thus we can extract several
binary features from here).
2.1.4. Handling date
This is a timestamp, should be converted to datetime
dtype.
covid['date'] = pd.to_datetime(covid['date']) covid['date']
0 2021-02-22 1 2021-02-23 2 2021-02-24 3 2021-02-25 4 2021-02-26 ... 53590 2021-10-22 53591 2021-10-23 53592 2021-10-24 53593 2021-10-25 53594 2021-10-26 Name: date, Length: 53595, dtype: datetime64[ns]
2.1.5. Descriptive statistics, missing & duplicates
Let's look at the descriptive statistics next.
covid.describe(include='all')
country date total_vaccinations people_vaccinated \ count 53595 53595 2.905600e+04 2.754000e+04 unique 223 330 NaN NaN top Denmark 2021-08-14 00:00:00 NaN NaN freq 329 219 NaN NaN first NaN 2020-12-01 00:00:00 NaN NaN last NaN 2021-10-26 00:00:00 NaN NaN mean NaN NaN 2.502017e+07 1.045555e+07 std NaN NaN 1.387474e+08 4.142794e+07 min NaN NaN 0.000000e+00 0.000000e+00 25% NaN NaN 2.735060e+05 2.062085e+05 50% NaN NaN 1.863784e+06 1.193043e+06 75% NaN NaN 9.407311e+06 5.452138e+06 max NaN NaN 2.251339e+09 1.100842e+09 people_fully_vaccinated daily_vaccinations_raw daily_vaccinations \ count 2.462300e+04 2.383500e+04 5.333000e+04 unique NaN NaN NaN top NaN NaN NaN freq NaN NaN NaN first NaN NaN NaN last NaN NaN NaN mean 6.988481e+06 2.550971e+05 1.301540e+05 std 2.625649e+07 1.258528e+06 8.344678e+05 min 1.000000e+00 0.000000e+00 0.000000e+00 25% 1.085115e+05 5.023000e+03 9.370000e+02 50% 8.255940e+05 2.486200e+04 7.039000e+03 75% 4.288808e+06 1.164200e+05 4.178350e+04 max 1.067621e+09 2.474100e+07 2.242429e+07 total_vaccinations_per_hundred people_vaccinated_per_hundred \ count 29056.000000 27540.000000 unique NaN NaN top NaN NaN freq NaN NaN first NaN NaN last NaN NaN mean 51.107199 30.162527 std 49.470788 25.993621 min 0.000000 0.000000 25% 6.890000 5.390000 50% 34.510000 24.150000 75% 87.440000 53.040000 max 259.330000 120.350000 people_fully_vaccinated_per_hundred daily_vaccinations_per_million \ count 24623.000000 53330.000000 unique NaN NaN top NaN NaN freq NaN NaN first NaN NaN last NaN NaN mean 23.426875 3522.202756 std 23.591298 4290.717192 min 0.000000 0.000000 25% 2.920000 588.000000 50% 14.440000 2243.000000 75% 40.855000 5185.750000 max 118.140000 117497.000000 vaccines \ count 53595 unique 73 top Johnson&Johnson, Moderna, Oxford/AstraZeneca, Pfizer/BioNTech freq 7102 first NaN last NaN mean NaN std NaN min NaN 25% NaN 50% NaN 75% NaN max NaN source_name count 53595 unique 82 top World Health Organization freq 13132 first NaN last NaN mean NaN std NaN min NaN 25% NaN 50% NaN 75% NaN max NaN
~50% of the numerical features are missing, let's investigate this a bit more.
numerical_features = ['total_vaccinations', 'people_vaccinated', 'people_fully_vaccinated', 'daily_vaccinations_raw', 'daily_vaccinations', 'total_vaccinations_per_hundred', 'people_vaccinated_per_hundred', 'people_fully_vaccinated_per_hundred', 'daily_vaccinations_per_million'] numerical = covid[numerical_features] numerical
total_vaccinations people_vaccinated people_fully_vaccinated \ 0 0.0 0.0 NaN 1 NaN NaN NaN 2 NaN NaN NaN 3 NaN NaN NaN 4 NaN NaN NaN ... ... ... ... 53590 5814790.0 3271638.0 2543152.0 53591 5826876.0 3276746.0 2550130.0 53592 5836363.0 3281618.0 2554745.0 53593 5848934.0 3287996.0 2560938.0 53594 5866629.0 3294687.0 2571942.0 daily_vaccinations_raw daily_vaccinations \ 0 NaN NaN 1 NaN 1367.0 2 NaN 1367.0 3 NaN 1367.0 4 NaN 1367.0 ... ... ... 53590 17507.0 20007.0 53591 12086.0 19845.0 53592 9487.0 19241.0 53593 12571.0 15135.0 53594 17695.0 15697.0 total_vaccinations_per_hundred people_vaccinated_per_hundred \ 0 0.00 0.00 1 NaN NaN 2 NaN NaN 3 NaN NaN 4 NaN NaN ... ... ... 53590 38.53 21.68 53591 38.61 21.71 53592 38.67 21.74 53593 38.75 21.79 53594 38.87 21.83 people_fully_vaccinated_per_hundred daily_vaccinations_per_million 0 NaN NaN 1 NaN 34.0 2 NaN 34.0 3 NaN 34.0 4 NaN 34.0 ... ... ... 53590 16.85 1326.0 53591 16.90 1315.0 53592 16.93 1275.0 53593 16.97 1003.0 53594 17.04 1040.0 [53595 rows x 9 columns]
numerical.isna().sum()
total_vaccinations 24539 people_vaccinated 26055 people_fully_vaccinated 28972 daily_vaccinations_raw 29760 daily_vaccinations 265 total_vaccinations_per_hundred 24539 people_vaccinated_per_hundred 26055 people_fully_vaccinated_per_hundred 28972 daily_vaccinations_per_million 265 dtype: int64
Let's investigate total_vaccinations
individually.
total_vaccinations = covid[['country', 'date', 'total_vaccinations']] total_vaccinations.head(50)
country date total_vaccinations 0 Afghanistan 2021-02-22 0.0 1 Afghanistan 2021-02-23 NaN 2 Afghanistan 2021-02-24 NaN 3 Afghanistan 2021-02-25 NaN 4 Afghanistan 2021-02-26 NaN 5 Afghanistan 2021-02-27 NaN 6 Afghanistan 2021-02-28 8200.0 7 Afghanistan 2021-03-01 NaN 8 Afghanistan 2021-03-02 NaN 9 Afghanistan 2021-03-03 NaN 10 Afghanistan 2021-03-04 NaN 11 Afghanistan 2021-03-05 NaN 12 Afghanistan 2021-03-06 NaN 13 Afghanistan 2021-03-07 NaN 14 Afghanistan 2021-03-08 NaN 15 Afghanistan 2021-03-09 NaN 16 Afghanistan 2021-03-10 NaN 17 Afghanistan 2021-03-11 NaN 18 Afghanistan 2021-03-12 NaN 19 Afghanistan 2021-03-13 NaN 20 Afghanistan 2021-03-14 NaN 21 Afghanistan 2021-03-15 NaN 22 Afghanistan 2021-03-16 54000.0 23 Afghanistan 2021-03-17 NaN 24 Afghanistan 2021-03-18 NaN 25 Afghanistan 2021-03-19 NaN 26 Afghanistan 2021-03-20 NaN 27 Afghanistan 2021-03-21 NaN 28 Afghanistan 2021-03-22 NaN 29 Afghanistan 2021-03-23 NaN 30 Afghanistan 2021-03-24 NaN 31 Afghanistan 2021-03-25 NaN 32 Afghanistan 2021-03-26 NaN 33 Afghanistan 2021-03-27 NaN 34 Afghanistan 2021-03-28 NaN 35 Afghanistan 2021-03-29 NaN 36 Afghanistan 2021-03-30 NaN 37 Afghanistan 2021-03-31 NaN 38 Afghanistan 2021-04-01 NaN 39 Afghanistan 2021-04-02 NaN 40 Afghanistan 2021-04-03 NaN 41 Afghanistan 2021-04-04 NaN 42 Afghanistan 2021-04-05 NaN 43 Afghanistan 2021-04-06 NaN 44 Afghanistan 2021-04-07 120000.0 45 Afghanistan 2021-04-08 NaN 46 Afghanistan 2021-04-09 NaN 47 Afghanistan 2021-04-10 NaN 48 Afghanistan 2021-04-11 NaN 49 Afghanistan 2021-04-12 NaN
total_vaccinations[total_vaccinations['total_vaccinations'].notna()]
country date total_vaccinations 0 Afghanistan 2021-02-22 0.0 6 Afghanistan 2021-02-28 8200.0 22 Afghanistan 2021-03-16 54000.0 44 Afghanistan 2021-04-07 120000.0 59 Afghanistan 2021-04-22 240000.0 ... ... ... ... 53590 Zimbabwe 2021-10-22 5814790.0 53591 Zimbabwe 2021-10-23 5826876.0 53592 Zimbabwe 2021-10-24 5836363.0 53593 Zimbabwe 2021-10-25 5848934.0 53594 Zimbabwe 2021-10-26 5866629.0 [29056 rows x 3 columns]
Looks like the total_vaccinations
column is populated periodically
with the cummulative sum. Since a lot of values are missing we have to
impute which adds to technical debt. It's unclear what a good
imputation stratergy would be in this case.
Let's check for duplicates (even though it doesn't make sense since so many values are missing).
covid[covid.duplicated(keep=False)].shape
0 | 13 |
2.1.6. Correlations
Let's look at the correlation between the numerical features next.
name = 'heatmap@covid-vaccine--corr.png' corr = covid.corr() plotter.corr(corr, name) name
Some of the features show high positive correlation.