Video Game Sales

Table of Contents

1. Init

Let's start by importing the necessary modules.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt

matplotlib.use('Agg') # non-interactive backend, produce pngs instead
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
# uncomment the following line to prevent truncated output
# pd.set_option('display.large_repr', 'info')

from context import src
from src import utils, plotter

2. Analysis

In this section we analyse the vgsales dataset. We start by reading the accompanying data docs. The docs are not so informative, a description of the columns is provided along with a count of total examples in the dataset. It also states that the data was generated by scraping a website.

2.1. Preliminary analysis

We start by loading the dataset and answering our initial set of questions.

vgsales = pd.read_csv('../data/data/vgsales.csv')
vgsales.head()
   Rank                      Name Platform    Year         Genre Publisher  \
0     1                Wii Sports      Wii  2006.0        Sports  Nintendo   
1     2         Super Mario Bros.      NES  1985.0      Platform  Nintendo   
2     3            Mario Kart Wii      Wii  2008.0        Racing  Nintendo   
3     4         Wii Sports Resort      Wii  2009.0        Sports  Nintendo   
4     5  Pokemon Red/Pokemon Blue       GB  1996.0  Role-Playing  Nintendo   

   NA_Sales  EU_Sales  JP_Sales  Other_Sales  Global_Sales  
0     41.49     29.02      3.77         8.46         82.74  
1     29.08      3.58      6.81         0.77         40.24  
2     15.85     12.88      3.79         3.31         35.82  
3     15.75     11.01      3.28         2.96         33.00  
4     11.27      8.89     10.22         1.00         31.37  

Next we look at the features and their dtypes.

vgsales.columns
Index(['Rank', 'Name', 'Platform', 'Year', 'Genre', 'Publisher', 'NA_Sales',
       'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales'],
      dtype='object')
vgsales.dtypes
Rank              int64
Name             object
Platform         object
Year            float64
Genre            object
Publisher        object
NA_Sales        float64
EU_Sales        float64
JP_Sales        float64
Other_Sales     float64
Global_Sales    float64
dtype: object

The name feature would be interesting to analyse however requires some feature extraction to extract meaningful numerical features (such as anagrams). However, in this project we focus only on numerical features so we drop the column here. The rank feature is ordered and may prove valuable for certain ML tasks. However we can drop this aswell since we don't have a specific ML task here.

vgsales = vgsales.drop(labels=['Rank', 'Name'], axis='columns')
vgsales.head()
  Platform    Year         Genre Publisher  NA_Sales  EU_Sales  JP_Sales  \
0      Wii  2006.0        Sports  Nintendo     41.49     29.02      3.77   
1      NES  1985.0      Platform  Nintendo     29.08      3.58      6.81   
2      Wii  2008.0        Racing  Nintendo     15.85     12.88      3.79   
3      Wii  2009.0        Sports  Nintendo     15.75     11.01      3.28   
4       GB  1996.0  Role-Playing  Nintendo     11.27      8.89     10.22   

   Other_Sales  Global_Sales  
0         8.46         82.74  
1         0.77         40.24  
2         3.31         35.82  
3         2.96         33.00  
4         1.00         31.37  

We convert the categorical features platform, genre & publisher to category dtype and year to int dtype as well (after droping missing and duplicate values).

Let's check for missing and duplicates next.

vgsales.isna().any()
Platform        False
Year             True
Genre           False
Publisher        True
NA_Sales        False
EU_Sales        False
JP_Sales        False
Other_Sales     False
Global_Sales    False
dtype: bool
vgsales = vgsales.dropna()
vgsales.shape
16291 9
vgsales[vgsales.duplicated()].shape
252 9

Lots of duplicates! Let's drop 'em.

vgsales = vgsales.drop_duplicates()
vgsales.shape
16039 9

Let's fix year dtype next and save it as int.

vgsales[['Year']] = vgsales[['Year']].astype('int')
vgsales.dtypes
Platform         object
Year              int64
Genre            object
Publisher        object
NA_Sales        float64
EU_Sales        float64
JP_Sales        float64
Other_Sales     float64
Global_Sales    float64
dtype: object

Let's check the descriptive statistics next.

vgsales.describe(include='all')
       Platform          Year   Genre        Publisher      NA_Sales  \
count     16039  16039.000000   16039            16039  16039.000000   
unique       31           NaN      12              576           NaN   
top          DS           NaN  Action  Electronic Arts           NaN   
freq       2099           NaN    3190             1332           NaN   
mean        NaN   2006.363863     NaN              NaN      0.269521   
std         NaN      5.840891     NaN              NaN      0.828258   
min         NaN   1980.000000     NaN              NaN      0.000000   
25%         NaN   2003.000000     NaN              NaN      0.000000   
50%         NaN   2007.000000     NaN              NaN      0.080000   
75%         NaN   2010.000000     NaN              NaN      0.240000   
max         NaN   2020.000000     NaN              NaN     41.490000   

            EU_Sales      JP_Sales   Other_Sales  Global_Sales  
count   16039.000000  16039.000000  16039.000000  16039.000000  
unique           NaN           NaN           NaN           NaN  
top              NaN           NaN           NaN           NaN  
freq             NaN           NaN           NaN           NaN  
mean        0.149934      0.079711      0.049156      0.548595  
std         0.512974      0.314210      0.191478      1.578372  
min         0.000000      0.000000      0.000000      0.010000  
25%         0.000000      0.000000      0.000000      0.060000  
50%         0.020000      0.000000      0.010000      0.180000  
75%         0.110000      0.040000      0.040000      0.490000  
max        29.020000     10.220000     10.570000     82.740000  

We have mostly numerical features with three categorical features. And finally the correlations.

corr = vgsales.corr()
name = 'heatmap@vgsales--corr.png'
plotter.corr(corr, name)
name

heatmap@vgsales--corr.png

The *_sales features are positively related to one another and this is expected. There may be a possibility of feature selection.

2.2. Distribution analysis

In this section we analyse the distibutions of the features. Let's start with a histogram of all features. We keep the *_sales features in the same plot since it will be useful to share the same scale on the y axis for them.

name = 'histplot@vgsales--sales.png'
sales_features = ['NA_Sales',
                  'EU_Sales',
                  'JP_Sales',
                  'Other_Sales',
                  'Global_Sales']

fig, axs = plt.subplots(1, 5, figsize=(20, 5))

for idx, feature in enumerate(sales_features):
    sns.histplot(data=vgsales, x=feature, bins=5, ax=axs[idx])

fig.savefig(name)
name

histplot@vgsales--sales.png

The histplot does not seem to be that useful, not sure why the scale for y axis is to high. But we can use a kdeplot instead.

name = 'kdeplot@vgsales--sales.png'
sales_features = ['NA_Sales',
                  'EU_Sales',
                  'JP_Sales',
                  'Other_Sales',
                  'Global_Sales']

fig, axs = plt.subplots(1, 5, figsize=(20, 5))

for idx, feature in enumerate(sales_features):
    sns.kdeplot(data=vgsales, x=feature, ax=axs[idx])

fig.savefig(name)
name

kdeplot@vgsales--sales.png

Date: 2021-10-18 Mon 00:00

Created: 2021-10-22 Fri 22:00