Netflix

Table of Contents

1. Init

Let's start by importing the necessary modules.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt

matplotlib.use('Agg') # non-interactive backend, produce pngs instead
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
# uncomment the following line to prevent truncated output
# pd.set_option('display.large_repr', 'info')

from context import src
from src import utils

2. Analysis

In this section we analyse the netflix dataset. We start by reading the accompanying data docs which does not give any info on the dataset.

2.1. Preliminary analysis

We start by loading the dataset and answering our initial set of questions.

netflix = pd.read_csv('../data/data/netflix.csv')
netflix.head()
  show_id     type                  title         director  \
0      s1    Movie   Dick Johnson Is Dead  Kirsten Johnson   
1      s2  TV Show          Blood & Water              NaN   
2      s3  TV Show              Ganglands  Julien Leclercq   
3      s4  TV Show  Jailbirds New Orleans              NaN   
4      s5  TV Show           Kota Factory              NaN   

                                                                                                                                                                                                                                                                                                              cast  \
0                                                                                                                                                                                                                                                                                                              NaN   
1  Ama Qamata, Khosi Ngema, Gail Mabalane, Thabang Molaba, Dillon Windvogel, Natasha Thahane, Arno Greeff, Xolile Tshabalala, Getmore Sithole, Cindy Mahlangu, Ryle De Morny, Greteli Fincham, Sello Maake Ka-Ncube, Odwa Gwanya, Mekaila Mathys, Sandi Schultz, Duane Williams, Shamilla Miller, Patrick Mofokeng   
2                                                                                                                                                              Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabiha Akkari, Sofia Lesaffre, Salim Kechiouche, Noureddine Farihi, Geert Van Rampelberg, Bakary Diombera   
3                                                                                                                                                                                                                                                                                                              NaN   
4                                                                                                                                                                                                         Mayur More, Jitendra Kumar, Ranjan Raj, Alam Khan, Ahsaas Channa, Revathi Pillai, Urvi Singh, Arun Kumar   

         country          date_added  release_year rating   duration  \
0  United States  September 25, 2021          2020  PG-13     90 min   
1   South Africa  September 24, 2021          2021  TV-MA  2 Seasons   
2            NaN  September 24, 2021          2021  TV-MA   1 Season   
3            NaN  September 24, 2021          2021  TV-MA   1 Season   
4          India  September 24, 2021          2021  TV-MA  2 Seasons   

                                                       listed_in  \
0                                                  Documentaries   
1                International TV Shows, TV Dramas, TV Mysteries   
2  Crime TV Shows, International TV Shows, TV Action & Adventure   
3                                         Docuseries, Reality TV   
4         International TV Shows, Romantic TV Shows, TV Comedies   

                                                                                                                                                description  
0  As her father nears the end of his life, filmmaker Kirsten Johnson stages his death in inventive and comical ways to help them both face the inevitable.  
1       After crossing paths at a party, a Cape Town teen sets out to prove whether a private-school swimming star is her sister who was abducted at birth.  
2        To protect his family from a powerful drug lord, skilled thief Mehdi and his expert team of robbers are pulled into a violent and deadly turf war.  
3       Feuds, flirtations and toilet talk go down among the incarcerated women at the Orleans Justice Center in New Orleans on this gritty reality series.  
4  In a city of coaching centers known to train India’s finest collegiate minds, an earnest but unexceptional student and his friends navigate campus life.  

We have an interesting mix of features: type, country, rating are clearly categorical. We encounter a new type of numerical feature in our analysis: date_added & release_year are time stamps which require special attention.

duration is a tricky one, as it contains numerical data but represented in a human friendly format, we have to extract the absolute quantity by some form of transformation.

director is also interesting, my intuition tells me it a categorical feature as it can hold distinct values but has no hierarchy. We can also treat it as text.

listed_in is also categorical but it contains several values and we need to think of a way to extract this information such that it's useful for a ML model. cast presents same dilemma, it's categorical but with several values.

description & title are purely text. Although there are several interesting things we can do with textual data, this is beyond the scope of this project and thus are ignored in this analysis.

netflix.shape
8807 12
netflix.columns
Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'],
      dtype='object')
netflix.dtypes
show_id         object
type            object
title           object
director        object
cast            object
country         object
date_added      object
release_year     int64
rating          object
duration        object
listed_in       object
description     object
dtype: object

Let's drop the columns we don't need and correct the dtypes before further analysis.

netflix = netflix.drop(['show_id', 'title', 'description'], axis='columns')
netflix.shape
8807 9

2.1.1. Handling type, director, country & rating

These are categorical features and should be stored as the category dtype. Let's start with type.

netflix['type'] = netflix['type'].astype('category')
netflix['type'].value_counts()
Movie      6131
TV Show    2676
Name: type, dtype: int64

We look at director next. This is interesting as we can also treat this feature as text (in which case we would exclude it from this analysis). However, assuming that a film/show can only have one director, and there is no quantifiable hierarchy amongst directors, I think we can also treat this feature as categorical. However, we first make sure that we only have single directors for all shows.

directors = netflix[['director']].dropna() # otherwise len throws error on missing values
directors['director'] = directors['director'].str.strip()
# by observing the dataset I noticed that multiple values are
# separated with comma.
directors['count'] = directors['director'].str.split(pat=',').apply(len)
directors.head()
                        director  count
0                Kirsten Johnson      1
2                Julien Leclercq      1
5                  Mike Flanagan      1
6  Robert Cullen, José Luis Ucha      2
7                   Haile Gerima      1

We can already see that our assumption of single director per show is incorrect. Let's investigate a bit more.

directors[directors['count'].gt(1)]
                                                           director  count
6                                     Robert Cullen, José Luis Ucha      2
16                    Pedro de Echave García, Pablo Azorín Williams      2
23                                          Alex Woo, Stanley Moore      2
30           Ashwiny Iyer Tiwari, Abhishek Chaubey, Saket Chaudhary      3
68             Hanns-Bruno Kammertöns, Vanessa Nöcker, Michael Wech      3
...                                                             ...    ...
8727                                      Ritu Sarin, Tenzing Sonam      2
8728                                Heidi Brandenburg, Mathew Orzel      2
8737                         Milla Harrison-Hansley, Alicky Sussman      2
8739                                    Frank Capra, Anatole Litvak      2
8765  Jovanka Vuckovic, Annie Clark, Roxanne Benjamin, Karyn Kusama      4

[614 rows x 2 columns]

This makes director the same as cast. We can do interesting analysis with them, for instance we can find out the most popular directors/cast members (and also per country). However, I don't think we will discover any new smell here. Therefore we drop director and cast at this point.

netflix = netflix.drop(['director', 'cast'], axis='columns')
netflix.shape
8807 7

Next, let's look at country.

netflix['country'] = netflix['country'].astype('category')
netflix['country'].value_counts()
United States                                                                          2818
India                                                                                   972
United Kingdom                                                                          419
Japan                                                                                   245
South Korea                                                                             199
                                                                                       ... 
Ireland, Canada, Luxembourg, United States, United Kingdom, Philippines, India            1
Ireland, Canada, United Kingdom, United States                                            1
Ireland, Canada, United States, United Kingdom                                            1
Ireland, France, Iceland, United States, Mexico, Belgium, United Kingdom, Hong Kong       1
Zimbabwe                                                                                  1
Name: country, Length: 748, dtype: int64

Looks like country also contains multiple values. This could indicate the contries in which the show was released perhaps. Unfortunately, for this analysis though we cannot do much with this feature as it's again text.

netflix = netflix.drop(['country'], axis='columns')

And finally, we look at rating

netflix['rating'] = netflix['rating'].astype('category')
netflix['rating'].value_counts()
TV-MA       3207
TV-14       2160
TV-PG        863
R            799
PG-13        490
TV-Y7        334
TV-Y         307
PG           287
TV-G         220
NR            80
G             41
TV-Y7-FV       6
UR             3
NC-17          3
74 min         1
84 min         1
66 min         1
Name: rating, dtype: int64

Without docs/domain knowledge it's difficult to verify the validity of these ratings. The last 3 look especially suspecious.

2.1.2. Handling date_added & release_year

These two features contain date and/or time information so we should fix their dtype.

dates = netflix[['date_added', 'release_year']]
dates.head()
           date_added  release_year
0  September 25, 2021          2020
1  September 24, 2021          2021
2  September 24, 2021          2021
3  September 24, 2021          2021
4  September 24, 2021          2021
# there are leading/trailing whitespaces here
dates['date_added'] = pd.to_datetime(dates['date_added'].str.strip(), format='%B %d, %Y')
dates['release_year'] = pd.to_datetime(dates['release_year'], format='%Y')
dates.head()
  date_added release_year
0 2021-09-25   2020-01-01
1 2021-09-24   2021-01-01
2 2021-09-24   2021-01-01
3 2021-09-24   2021-01-01
4 2021-09-24   2021-01-01
dates.dtypes
date_added      datetime64[ns]
release_year    datetime64[ns]
dtype: object
netflix[['date_added', 'release_year']] = dates
netflix.head()
      type date_added release_year rating   duration  \
0    Movie 2021-09-25   2020-01-01  PG-13     90 min   
1  TV Show 2021-09-24   2021-01-01  TV-MA  2 Seasons   
2  TV Show 2021-09-24   2021-01-01  TV-MA   1 Season   
3  TV Show 2021-09-24   2021-01-01  TV-MA   1 Season   
4  TV Show 2021-09-24   2021-01-01  TV-MA  2 Seasons   

                                                       listed_in  
0                                                  Documentaries  
1                International TV Shows, TV Dramas, TV Mysteries  
2  Crime TV Shows, International TV Shows, TV Action & Adventure  
3                                         Docuseries, Reality TV  
4         International TV Shows, Romantic TV Shows, TV Comedies  

Visualisation ideas here include observing in which year most shows were released, and which month across years has the most releases.

2.1.3. Handling duration

This feature should be represented as a timedelta dtype since it represents a duration of time. However, it is represented in a human-friendly format and without documentation or domain expertise, it is not possible to convert values such as '2 seasons' into an absolute quantity of time (such as in hours or minutes).

2.1.4. Handling listed_in

This is similar to director & cast thus not considered in this analysis.

netflix = netflix.drop(['listed_in'], axis='columns')
netflix.shape
8807 5

2.2. Distributional analysis

Unfortunately we are not left with many features to produce insighful visualisations with. We can check the distributions of the remaining features but I don't think we will discover any new smells.

Date: 2021-10-20 Wed 00:00

Created: 2021-10-22 Fri 21:54