Video Game Sales
Table of Contents
1. Init
Let's start by importing the necessary modules.
import pandas as pd import numpy as np import seaborn as sns import matplotlib import matplotlib.pyplot as plt matplotlib.use('Agg') # non-interactive backend, produce pngs instead pd.set_option('display.max_columns', None) pd.set_option('display.max_colwidth', None) # uncomment the following line to prevent truncated output # pd.set_option('display.large_repr', 'info') from context import src from src import utils, plotter
2. Analysis
In this section we analyse the vgsales
dataset. We start by reading
the accompanying data docs. The docs are not so informative, a
description of the columns is provided along with a count of total
examples in the dataset. It also states that the data was generated by
scraping a website.
2.1. Preliminary analysis
We start by loading the dataset and answering our initial set of questions.
vgsales = pd.read_csv('../data/data/vgsales.csv') vgsales.head()
Rank Name Platform Year Genre Publisher \ 0 1 Wii Sports Wii 2006.0 Sports Nintendo 1 2 Super Mario Bros. NES 1985.0 Platform Nintendo 2 3 Mario Kart Wii Wii 2008.0 Racing Nintendo 3 4 Wii Sports Resort Wii 2009.0 Sports Nintendo 4 5 Pokemon Red/Pokemon Blue GB 1996.0 Role-Playing Nintendo NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales 0 41.49 29.02 3.77 8.46 82.74 1 29.08 3.58 6.81 0.77 40.24 2 15.85 12.88 3.79 3.31 35.82 3 15.75 11.01 3.28 2.96 33.00 4 11.27 8.89 10.22 1.00 31.37
Next we look at the features and their dtypes.
vgsales.columns
Index(['Rank', 'Name', 'Platform', 'Year', 'Genre', 'Publisher', 'NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales'], dtype='object')
vgsales.dtypes
Rank int64 Name object Platform object Year float64 Genre object Publisher object NA_Sales float64 EU_Sales float64 JP_Sales float64 Other_Sales float64 Global_Sales float64 dtype: object
The name
feature would be interesting to analyse however requires
some feature extraction to extract meaningful numerical features (such
as anagrams). However, in this project we focus only on numerical
features so we drop the column here. The rank
feature is ordered and
may prove valuable for certain ML tasks. However we can drop this
aswell since we don't have a specific ML task here.
vgsales = vgsales.drop(labels=['Rank', 'Name'], axis='columns') vgsales.head()
Platform Year Genre Publisher NA_Sales EU_Sales JP_Sales \ 0 Wii 2006.0 Sports Nintendo 41.49 29.02 3.77 1 NES 1985.0 Platform Nintendo 29.08 3.58 6.81 2 Wii 2008.0 Racing Nintendo 15.85 12.88 3.79 3 Wii 2009.0 Sports Nintendo 15.75 11.01 3.28 4 GB 1996.0 Role-Playing Nintendo 11.27 8.89 10.22 Other_Sales Global_Sales 0 8.46 82.74 1 0.77 40.24 2 3.31 35.82 3 2.96 33.00 4 1.00 31.37
We convert the categorical features platform, genre & publisher
to
category dtype and year
to int dtype as well (after droping missing
and duplicate values).
Let's check for missing and duplicates next.
vgsales.isna().any()
Platform False Year True Genre False Publisher True NA_Sales False EU_Sales False JP_Sales False Other_Sales False Global_Sales False dtype: bool
vgsales = vgsales.dropna() vgsales.shape
16291 | 9 |
vgsales[vgsales.duplicated()].shape
252 | 9 |
Lots of duplicates! Let's drop 'em.
vgsales = vgsales.drop_duplicates() vgsales.shape
16039 | 9 |
Let's fix year
dtype next and save it as int.
vgsales[['Year']] = vgsales[['Year']].astype('int') vgsales.dtypes
Platform object Year int64 Genre object Publisher object NA_Sales float64 EU_Sales float64 JP_Sales float64 Other_Sales float64 Global_Sales float64 dtype: object
Let's check the descriptive statistics next.
vgsales.describe(include='all')
Platform Year Genre Publisher NA_Sales \ count 16039 16039.000000 16039 16039 16039.000000 unique 31 NaN 12 576 NaN top DS NaN Action Electronic Arts NaN freq 2099 NaN 3190 1332 NaN mean NaN 2006.363863 NaN NaN 0.269521 std NaN 5.840891 NaN NaN 0.828258 min NaN 1980.000000 NaN NaN 0.000000 25% NaN 2003.000000 NaN NaN 0.000000 50% NaN 2007.000000 NaN NaN 0.080000 75% NaN 2010.000000 NaN NaN 0.240000 max NaN 2020.000000 NaN NaN 41.490000 EU_Sales JP_Sales Other_Sales Global_Sales count 16039.000000 16039.000000 16039.000000 16039.000000 unique NaN NaN NaN NaN top NaN NaN NaN NaN freq NaN NaN NaN NaN mean 0.149934 0.079711 0.049156 0.548595 std 0.512974 0.314210 0.191478 1.578372 min 0.000000 0.000000 0.000000 0.010000 25% 0.000000 0.000000 0.000000 0.060000 50% 0.020000 0.000000 0.010000 0.180000 75% 0.110000 0.040000 0.040000 0.490000 max 29.020000 10.220000 10.570000 82.740000
We have mostly numerical features with three categorical features. And finally the correlations.
corr = vgsales.corr() name = 'heatmap@vgsales--corr.png' plotter.corr(corr, name) name
The *_sales
features are positively related to one another and this
is expected. There may be a possibility of feature selection.
2.2. Distribution analysis
In this section we analyse the distibutions of the features. Let's
start with a histogram of all features. We keep the *_sales
features
in the same plot since it will be useful to share the same scale on
the y axis for them.
name = 'histplot@vgsales--sales.png' sales_features = ['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales'] fig, axs = plt.subplots(1, 5, figsize=(20, 5)) for idx, feature in enumerate(sales_features): sns.histplot(data=vgsales, x=feature, bins=5, ax=axs[idx]) fig.savefig(name) name
The histplot does not seem to be that useful, not sure why the scale for y axis is to high. But we can use a kdeplot instead.
name = 'kdeplot@vgsales--sales.png' sales_features = ['NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales'] fig, axs = plt.subplots(1, 5, figsize=(20, 5)) for idx, feature in enumerate(sales_features): sns.kdeplot(data=vgsales, x=feature, ax=axs[idx]) fig.savefig(name) name