Netflix
Table of Contents
1. Init
Let's start by importing the necessary modules.
import pandas as pd import numpy as np import seaborn as sns import matplotlib import matplotlib.pyplot as plt matplotlib.use('Agg') # non-interactive backend, produce pngs instead pd.set_option('display.max_columns', None) pd.set_option('display.max_colwidth', None) # uncomment the following line to prevent truncated output # pd.set_option('display.large_repr', 'info') from context import src from src import utils
2. Analysis
In this section we analyse the netflix
dataset. We start by reading
the accompanying data docs which does not give any info on the
dataset.
2.1. Preliminary analysis
We start by loading the dataset and answering our initial set of questions.
netflix = pd.read_csv('../data/data/netflix.csv') netflix.head()
show_id type title director \ 0 s1 Movie Dick Johnson Is Dead Kirsten Johnson 1 s2 TV Show Blood & Water NaN 2 s3 TV Show Ganglands Julien Leclercq 3 s4 TV Show Jailbirds New Orleans NaN 4 s5 TV Show Kota Factory NaN cast \ 0 NaN 1 Ama Qamata, Khosi Ngema, Gail Mabalane, Thabang Molaba, Dillon Windvogel, Natasha Thahane, Arno Greeff, Xolile Tshabalala, Getmore Sithole, Cindy Mahlangu, Ryle De Morny, Greteli Fincham, Sello Maake Ka-Ncube, Odwa Gwanya, Mekaila Mathys, Sandi Schultz, Duane Williams, Shamilla Miller, Patrick Mofokeng 2 Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabiha Akkari, Sofia Lesaffre, Salim Kechiouche, Noureddine Farihi, Geert Van Rampelberg, Bakary Diombera 3 NaN 4 Mayur More, Jitendra Kumar, Ranjan Raj, Alam Khan, Ahsaas Channa, Revathi Pillai, Urvi Singh, Arun Kumar country date_added release_year rating duration \ 0 United States September 25, 2021 2020 PG-13 90 min 1 South Africa September 24, 2021 2021 TV-MA 2 Seasons 2 NaN September 24, 2021 2021 TV-MA 1 Season 3 NaN September 24, 2021 2021 TV-MA 1 Season 4 India September 24, 2021 2021 TV-MA 2 Seasons listed_in \ 0 Documentaries 1 International TV Shows, TV Dramas, TV Mysteries 2 Crime TV Shows, International TV Shows, TV Action & Adventure 3 Docuseries, Reality TV 4 International TV Shows, Romantic TV Shows, TV Comedies description 0 As her father nears the end of his life, filmmaker Kirsten Johnson stages his death in inventive and comical ways to help them both face the inevitable. 1 After crossing paths at a party, a Cape Town teen sets out to prove whether a private-school swimming star is her sister who was abducted at birth. 2 To protect his family from a powerful drug lord, skilled thief Mehdi and his expert team of robbers are pulled into a violent and deadly turf war. 3 Feuds, flirtations and toilet talk go down among the incarcerated women at the Orleans Justice Center in New Orleans on this gritty reality series. 4 In a city of coaching centers known to train India’s finest collegiate minds, an earnest but unexceptional student and his friends navigate campus life.
We have an interesting mix of features: type, country, rating
are
clearly categorical. We encounter a new type of numerical feature in
our analysis: date_added & release_year
are time stamps which
require special attention.
duration
is a tricky one, as it contains numerical data but
represented in a human friendly format, we have to extract the
absolute quantity by some form of transformation.
director
is also interesting, my intuition tells me it a categorical
feature as it can hold distinct values but has no hierarchy. We can
also treat it as text.
listed_in
is also categorical but it contains several values and we
need to think of a way to extract this information such that it's
useful for a ML model. cast
presents same dilemma, it's categorical
but with several values.
description & title
are purely text. Although there are several
interesting things we can do with textual data, this is beyond the
scope of this project and thus are ignored in this analysis.
netflix.shape
8807 | 12 |
netflix.columns
Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added', 'release_year', 'rating', 'duration', 'listed_in', 'description'], dtype='object')
netflix.dtypes
show_id object type object title object director object cast object country object date_added object release_year int64 rating object duration object listed_in object description object dtype: object
Let's drop the columns we don't need and correct the dtypes before further analysis.
netflix = netflix.drop(['show_id', 'title', 'description'], axis='columns') netflix.shape
8807 | 9 |
2.1.1. Handling type, director, country & rating
These are categorical features and should be stored as the category
dtype. Let's start with type
.
netflix['type'] = netflix['type'].astype('category') netflix['type'].value_counts()
Movie 6131 TV Show 2676 Name: type, dtype: int64
We look at director next. This is interesting as we can also treat this feature as text (in which case we would exclude it from this analysis). However, assuming that a film/show can only have one director, and there is no quantifiable hierarchy amongst directors, I think we can also treat this feature as categorical. However, we first make sure that we only have single directors for all shows.
directors = netflix[['director']].dropna() # otherwise len throws error on missing values directors['director'] = directors['director'].str.strip() # by observing the dataset I noticed that multiple values are # separated with comma. directors['count'] = directors['director'].str.split(pat=',').apply(len) directors.head()
director count 0 Kirsten Johnson 1 2 Julien Leclercq 1 5 Mike Flanagan 1 6 Robert Cullen, José Luis Ucha 2 7 Haile Gerima 1
We can already see that our assumption of single director per show is incorrect. Let's investigate a bit more.
directors[directors['count'].gt(1)]
director count 6 Robert Cullen, José Luis Ucha 2 16 Pedro de Echave García, Pablo Azorín Williams 2 23 Alex Woo, Stanley Moore 2 30 Ashwiny Iyer Tiwari, Abhishek Chaubey, Saket Chaudhary 3 68 Hanns-Bruno Kammertöns, Vanessa Nöcker, Michael Wech 3 ... ... ... 8727 Ritu Sarin, Tenzing Sonam 2 8728 Heidi Brandenburg, Mathew Orzel 2 8737 Milla Harrison-Hansley, Alicky Sussman 2 8739 Frank Capra, Anatole Litvak 2 8765 Jovanka Vuckovic, Annie Clark, Roxanne Benjamin, Karyn Kusama 4 [614 rows x 2 columns]
This makes director
the same as cast
. We can do interesting
analysis with them, for instance we can find out the most popular
directors/cast members (and also per country). However, I don't think
we will discover any new smell here. Therefore we drop director
and
cast
at this point.
netflix = netflix.drop(['director', 'cast'], axis='columns') netflix.shape
8807 | 7 |
Next, let's look at country
.
netflix['country'] = netflix['country'].astype('category') netflix['country'].value_counts()
United States 2818 India 972 United Kingdom 419 Japan 245 South Korea 199 ... Ireland, Canada, Luxembourg, United States, United Kingdom, Philippines, India 1 Ireland, Canada, United Kingdom, United States 1 Ireland, Canada, United States, United Kingdom 1 Ireland, France, Iceland, United States, Mexico, Belgium, United Kingdom, Hong Kong 1 Zimbabwe 1 Name: country, Length: 748, dtype: int64
Looks like country
also contains multiple values. This could
indicate the contries in which the show was released perhaps.
Unfortunately, for this analysis though we cannot do much with this
feature as it's again text.
netflix = netflix.drop(['country'], axis='columns')
And finally, we look at rating
netflix['rating'] = netflix['rating'].astype('category') netflix['rating'].value_counts()
TV-MA 3207 TV-14 2160 TV-PG 863 R 799 PG-13 490 TV-Y7 334 TV-Y 307 PG 287 TV-G 220 NR 80 G 41 TV-Y7-FV 6 UR 3 NC-17 3 74 min 1 84 min 1 66 min 1 Name: rating, dtype: int64
Without docs/domain knowledge it's difficult to verify the validity of these ratings. The last 3 look especially suspecious.
2.1.2. Handling date_added & release_year
These two features contain date and/or time information so we should fix their dtype.
dates = netflix[['date_added', 'release_year']] dates.head()
date_added release_year 0 September 25, 2021 2020 1 September 24, 2021 2021 2 September 24, 2021 2021 3 September 24, 2021 2021 4 September 24, 2021 2021
# there are leading/trailing whitespaces here dates['date_added'] = pd.to_datetime(dates['date_added'].str.strip(), format='%B %d, %Y') dates['release_year'] = pd.to_datetime(dates['release_year'], format='%Y') dates.head()
date_added release_year 0 2021-09-25 2020-01-01 1 2021-09-24 2021-01-01 2 2021-09-24 2021-01-01 3 2021-09-24 2021-01-01 4 2021-09-24 2021-01-01
dates.dtypes
date_added datetime64[ns] release_year datetime64[ns] dtype: object
netflix[['date_added', 'release_year']] = dates netflix.head()
type date_added release_year rating duration \ 0 Movie 2021-09-25 2020-01-01 PG-13 90 min 1 TV Show 2021-09-24 2021-01-01 TV-MA 2 Seasons 2 TV Show 2021-09-24 2021-01-01 TV-MA 1 Season 3 TV Show 2021-09-24 2021-01-01 TV-MA 1 Season 4 TV Show 2021-09-24 2021-01-01 TV-MA 2 Seasons listed_in 0 Documentaries 1 International TV Shows, TV Dramas, TV Mysteries 2 Crime TV Shows, International TV Shows, TV Action & Adventure 3 Docuseries, Reality TV 4 International TV Shows, Romantic TV Shows, TV Comedies
Visualisation ideas here include observing in which year most shows were released, and which month across years has the most releases.
2.1.3. Handling duration
This feature should be represented as a timedelta
dtype since it
represents a duration of time. However, it is represented in a
human-friendly format and without documentation or domain expertise,
it is not possible to convert values such as '2 seasons' into an
absolute quantity of time (such as in hours or minutes).
2.1.4. Handling listed_in
This is similar to director & cast
thus not considered in this
analysis.
netflix = netflix.drop(['listed_in'], axis='columns') netflix.shape
8807 | 5 |
2.2. Distributional analysis
Unfortunately we are not left with many features to produce insighful visualisations with. We can check the distributions of the remaining features but I don't think we will discover any new smells.