Playstore
Table of Contents
1. Init
Let's start by importing the necessary modules.
import pandas as pd import numpy as np import seaborn as sns import matplotlib import matplotlib.pyplot as plt matplotlib.use('Agg') # non-interactive backend, produce pngs instead pd.set_option('display.max_columns', None) pd.set_option('display.max_colwidth', None) # uncomment the following line to prevent truncated output # pd.set_option('display.large_repr', 'info') from context import src from src import utils, plotter
2. Analysis
In this section we analyse the playstore
dataset. Unfortunately
there are no data docs for this dataset. Please note that the
original dataset consists of two csv files. However one of them
contains user reviews (ie. text data) thus for this analysis we only
consider googleplaystore.csv.
2.1. Preliminary analysis
We start by loading the dataset and anwering our initial set of questions.
playstore = pd.read_csv('../data/data/playstore.csv') playstore.head()
App Category Rating \ 0 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1 1 Coloring book moana ART_AND_DESIGN 3.9 2 U Launcher Lite – FREE Live Cool Themes, Hide Apps ART_AND_DESIGN 4.7 3 Sketch - Draw & Paint ART_AND_DESIGN 4.5 4 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3 Reviews Size Installs Type Price Content Rating \ 0 159 19M 10,000+ Free 0 Everyone 1 967 14M 500,000+ Free 0 Everyone 2 87510 8.7M 5,000,000+ Free 0 Everyone 3 215644 25M 50,000,000+ Free 0 Teen 4 967 2.8M 100,000+ Free 0 Everyone Genres Last Updated Current Ver \ 0 Art & Design January 7, 2018 1.0.0 1 Art & Design;Pretend Play January 15, 2018 2.0.0 2 Art & Design August 1, 2018 1.2.4 3 Art & Design June 8, 2018 Varies with device 4 Art & Design;Creativity June 20, 2018 1.1 Android Ver 0 4.0.3 and up 1 4.0.3 and up 2 4.0.3 and up 3 4.2 and up 4 4.4 and up
playstore.shape
10841 | 13 |
playstore.dtypes
App object Category object Rating float64 Reviews object Size object Installs object Type object Price object Content Rating object Genres object Last Updated object Current Ver object Android Ver object dtype: object
We can see lots of interesting features (and potential smells)! To start, let's rename the columns for simplicity.
playstore.columns = playstore.columns.str.strip().str.lower().str.replace(' ', '_') playstore.columns
Index(['app', 'category', 'rating', 'reviews', 'size', 'installs', 'type', 'price', 'content_rating', 'genres', 'last_updated', 'current_ver', 'android_ver'], dtype='object')
Let's look at the descriptive statistics, missing & duplicates next.
playstore.describe(include='all')
app category rating reviews size installs \ count 10841 10841 9367.000000 10841 10841 10841 unique 9660 34 NaN 6002 462 22 top ROBLOX FAMILY NaN 0 Varies with device 1,000,000+ freq 9 1972 NaN 596 1695 1579 mean NaN NaN 4.193338 NaN NaN NaN std NaN NaN 0.537431 NaN NaN NaN min NaN NaN 1.000000 NaN NaN NaN 25% NaN NaN 4.000000 NaN NaN NaN 50% NaN NaN 4.300000 NaN NaN NaN 75% NaN NaN 4.500000 NaN NaN NaN max NaN NaN 19.000000 NaN NaN NaN type price content_rating genres last_updated \ count 10840 10841 10840 10841 10841 unique 3 93 6 120 1378 top Free 0 Everyone Tools August 3, 2018 freq 10039 10040 8714 842 326 mean NaN NaN NaN NaN NaN std NaN NaN NaN NaN NaN min NaN NaN NaN NaN NaN 25% NaN NaN NaN NaN NaN 50% NaN NaN NaN NaN NaN 75% NaN NaN NaN NaN NaN max NaN NaN NaN NaN NaN current_ver android_ver count 10833 10838 unique 2832 33 top Varies with device 4.1 and up freq 1459 2451 mean NaN NaN std NaN NaN min NaN NaN 25% NaN NaN 50% NaN NaN 75% NaN NaN max NaN NaN
playstore.isna().any()
app False category False rating True reviews False size False installs False type True price False content_rating True genres False last_updated False current_ver True android_ver True dtype: bool
We have some missing values, let's investigate further.
# show the rows with atleast one missing value playstore[playstore.isna().any(axis='columns')]
app category rating reviews \ 15 Learn To Draw Kawaii Characters ART_AND_DESIGN 3.2 55 23 Mcqueen Coloring pages ART_AND_DESIGN NaN 61 113 Wrinkles and rejuvenation BEAUTY NaN 182 123 Manicure - nail design BEAUTY NaN 119 126 Skin Care and Natural Beauty BEAUTY NaN 654 ... ... ... ... ... 10824 Cardio-FR MEDICAL NaN 67 10825 Naruto & Boruto FR SOCIAL NaN 7 10831 payermonstationnement.fr MAPS_AND_NAVIGATION NaN 38 10835 FR Forms BUSINESS NaN 0 10838 Parkinson Exercices FR MEDICAL NaN 3 size installs type price content_rating \ 15 2.7M 5,000+ Free 0 Everyone 23 7.0M 100,000+ Free 0 Everyone 113 5.7M 100,000+ Free 0 Everyone 10+ 123 3.7M 50,000+ Free 0 Everyone 126 7.4M 100,000+ Free 0 Teen ... ... ... ... ... ... 10824 82M 10,000+ Free 0 Everyone 10825 7.7M 100+ Free 0 Teen 10831 9.8M 5,000+ Free 0 Everyone 10835 9.6M 10+ Free 0 Everyone 10838 9.5M 1,000+ Free 0 Everyone genres last_updated current_ver \ 15 Art & Design June 6, 2018 NaN 23 Art & Design;Action & Adventure March 7, 2018 1.0.0 113 Beauty September 20, 2017 8.0 123 Beauty July 23, 2018 1.3 126 Beauty July 17, 2018 1.15 ... ... ... ... 10824 Medical July 31, 2018 2.2.2 10825 Social February 2, 2018 1.0 10831 Maps & Navigation June 13, 2018 2.0.148.0 10835 Business September 29, 2016 1.1.5 10838 Medical January 20, 2017 1.0 android_ver 15 4.2 and up 23 4.1 and up 113 3.0 and up 123 4.1 and up 126 4.1 and up ... ... 10824 4.4 and up 10825 4.0 and up 10831 4.0 and up 10835 4.0 and up 10838 2.2 and up [1481 rows x 13 columns]
We have ~1400 missing instances. For this dataset, imputation is a
potential solution (for instance, we can impute type
based on
price
and rating
based on reviews
, by making some assumptions of
course). However for this analysis, we simply drop them.
playstore = playstore.dropna() playstore.shape
9360 | 13 |
playstore[playstore.duplicated()].shape
474 | 13 |
We have duplicates as well, let's investigate further.
duplicates = playstore[playstore.duplicated(keep=False)] duplicates
app category \ 164 Ebook Reader BOOKS_AND_REFERENCE 192 Docs To Go™ Free Office Suite BUSINESS 193 Google My Business BUSINESS 204 Box BUSINESS 213 ZOOM Cloud Meetings BUSINESS ... ... ... 8643 Wunderlist: To-Do List & Tasks PRODUCTIVITY 8654 TickTick: To Do List with Reminder, Day Planner PRODUCTIVITY 8658 ColorNote Notepad Notes PRODUCTIVITY 10049 Airway Ex - Intubate. Anesthetize. Train. MEDICAL 10768 AAFP MEDICAL rating reviews size installs type price \ 164 4.1 85842 37M 5,000,000+ Free 0 192 4.1 217730 Varies with device 50,000,000+ Free 0 193 4.4 70991 Varies with device 5,000,000+ Free 0 204 4.2 159872 Varies with device 10,000,000+ Free 0 213 4.4 31614 37M 10,000,000+ Free 0 ... ... ... ... ... ... ... 8643 4.6 404610 Varies with device 10,000,000+ Free 0 8654 4.6 25370 Varies with device 1,000,000+ Free 0 8658 4.6 2401017 Varies with device 100,000,000+ Free 0 10049 4.3 123 86M 10,000+ Free 0 10768 3.8 63 24M 10,000+ Free 0 content_rating genres last_updated current_ver \ 164 Everyone Books & Reference June 25, 2018 5.0.6 192 Everyone Business April 2, 2018 Varies with device 193 Everyone Business July 24, 2018 2.19.0.204537701 204 Everyone Business July 31, 2018 Varies with device 213 Everyone Business July 20, 2018 4.1.28165.0716 ... ... ... ... ... 8643 Everyone Productivity April 6, 2018 Varies with device 8654 Everyone Productivity August 6, 2018 Varies with device 8658 Everyone Productivity June 27, 2018 Varies with device 10049 Everyone Medical June 1, 2018 0.6.88 10768 Everyone Medical June 22, 2018 2.3.1 android_ver 164 4.0 and up 192 Varies with device 193 4.4 and up 204 Varies with device 213 4.0 and up ... ... 8643 Varies with device 8654 Varies with device 8658 Varies with device 10049 5.0 and up 10768 5.0 and up [876 rows x 13 columns]
It's a bit difficult to observe the duplicates like this. We can group by the name of the apps and investigate further.
duplicates['app'] = duplicates['app'].str.strip().str.lower().str.replace(' ', '_') duplicates['app'].value_counts()
cbs_sports_app_-_scores,_news,_stats_&_watch_live 7 espn 6 bleacher_report:_sports_news,_scores,_&_highlights 5 thescore:_live_sports_scores,_news,_stats_&_videos 5 watchespn 4 .. fitbit_coach 2 nike_training_club_-_workouts_&_fitness_plans 2 run_with_map_my_run 2 fabulous:_motivate_me!_meditate,_relax,_sleep 2 newsroom:_news_worth_sharing 2 Name: app, Length: 392, dtype: int64
Okay, let's investigate the ESPN rows in further detail.
duplicates[duplicates['app'].eq('espn')]
app category rating reviews size installs type \ 2959 espn SPORTS 4.2 521138 Varies with device 10,000,000+ Free 3010 espn SPORTS 4.2 521138 Varies with device 10,000,000+ Free 3018 espn SPORTS 4.2 521138 Varies with device 10,000,000+ Free 3048 espn SPORTS 4.2 521140 Varies with device 10,000,000+ Free 3060 espn SPORTS 4.2 521140 Varies with device 10,000,000+ Free 3072 espn SPORTS 4.2 521140 Varies with device 10,000,000+ Free price content_rating genres last_updated current_ver \ 2959 0 Everyone 10+ Sports July 19, 2018 Varies with device 3010 0 Everyone 10+ Sports July 19, 2018 Varies with device 3018 0 Everyone 10+ Sports July 19, 2018 Varies with device 3048 0 Everyone 10+ Sports July 19, 2018 Varies with device 3060 0 Everyone 10+ Sports July 19, 2018 Varies with device 3072 0 Everyone 10+ Sports July 19, 2018 Varies with device android_ver 2959 5.0 and up 3010 5.0 and up 3018 5.0 and up 3048 5.0 and up 3060 5.0 and up 3072 5.0 and up
It's a bit difficult to have any conclusive decision on how to handle
the duplicates. In the case of ESPN it does look like they are
duplicates (the reviews
seems to differ for the last 3 examples
though). We also make the assumption here that the name of the apps
are unique but in reality we have have the same app written in a
different format. Let's drop the duplicates for this analysis using
pandas builtin methods.
playstore = playstore.drop_duplicates() playstore.shape
8886 | 13 |
Okay, let's consider each of the features individually and investigate
them further next. We do this prior to checking the correlations in
this dataset because the object
dtype features actually contain
numerical features which need to extract manually.
2.1.1. Handling app
This is the name of the app, we can consider extracting numerical features from this however for this analysis, we drop this feature.
playstore = playstore.drop(['app'], axis='columns') playstore.shape
8886 | 12 |
2.1.2. Handling rating, reviews, size, installs & price
These are all numerical features but represented as object
dtype.
Let's investigate them individually and make the necessary
transformations.
playstore['rating'] = playstore['rating'].astype('float') playstore['rating'].describe()
count 8886.000000 mean 4.187959 std 0.522428 min 1.000000 25% 4.000000 50% 4.300000 75% 4.500000 max 5.000000 Name: rating, dtype: float64
Looks like we are have a lot of examples with a high rating, this bias
should be kept in mind if we decide to perform any ML tasks. Let's
look at reviews
next.
playstore['reviews'] = playstore['reviews'].astype('int') playstore['reviews'].describe()
count 8.886000e+03 mean 4.730928e+05 std 2.906007e+06 min 1.000000e+00 25% 1.640000e+02 50% 4.723000e+03 75% 7.131325e+04 max 7.815831e+07 Name: reviews, dtype: float64
The high numbers are expected since the feature denotes the number of reviews an app got.
Let's look at size
next. This is a numerical value, but represented
as a more human-friendly string format. On one hand, we can say that
proper data collection process was not followed (leading to technical
debt). However the data was scrapped from a website so the technical
debt angle doesn't really work? Conclusion here is that the technical
debt smell only applies in certain contexts. However, documentation
explaining the source of the dataset should have been provided. This
smell is still valid for the dataset.
There are several transformations to be done here:
- Strip the letter denoting the size (kilobytes, megatybes, etc.)
- Fix the unit at the column level, for this analysis we want to denote the size in megatybes.
- Convert the dtype of the column to
float
.
To start, let's see if all app size are in megatybes or not.
size = playstore[['size']] # NOTE: this is a dataframe # This regex was a fluke and happend to work with the very first # attempt. In reality, I did not expect the values to not contain any # digits! size['value'] = size['size'].str.extract(r'[\d.]*(.*)') size.head()
size value 0 19M M 1 14M M 2 8.7M M 3 25M M 4 2.8M M
size['value'].value_counts()
M 7162 Varies with device 1468 k 256 Name: value, dtype: int64
Note that I added the counts manually and it adds up to the correct number.
Looks like we don't have an absolute size for some (denoted by the Varies with device value) and others are represented in kilobytes. Let's make the necessary transformations.
size['size'] = size['size'].replace(to_replace='Varies with device', value=np.nan) size.isna().any()
size True value False dtype: bool
size['size'] = size['size'].str.lower() size['size'] = size['size'].str.rstrip(to_strip='m') size['size'] = size['size'].str.rstrip(to_strip='k') size['size'] = size['size'].astype('float') size['size'].describe()
count 7418.000000 mean 37.592100 std 94.998132 min 1.000000 25% 5.900000 50% 16.000000 75% 37.000000 max 994.000000 Name: size, dtype: float64
Finally, we convert kilobytes to megatybes.
size_kb = size[size['value'].eq('k')] size_kb['size'] = size_kb['size']/1024 # 1024kb is 1mb size_kb
size value 58 0.196289 k 209 0.022461 k 384 0.077148 k 450 0.115234 k 458 0.678711 k ... ... ... 10732 0.456055 k 10755 0.660156 k 10763 0.539062 k 10832 0.568359 k 10833 0.604492 k [256 rows x 2 columns]
size[size['value'].eq('k')] = size_kb size['size'].describe()
count 7418.000000 mean 22.760481 std 23.439539 min 0.008301 25% 5.100000 50% 14.000000 75% 33.000000 max 100.000000 Name: size, dtype: float64
The mean has reduced (looks like the conversion worked), however we
have some outliers as the min is 0 (which is not possible). Let's
merge with the main dataset and drop the missing values we introduced
in size
!
playstore['size'] = size['size'] playstore = playstore.dropna() playstore.shape
7418 | 12 |
Next, let's look at installs
. Another tricky one. We can strip the
plus sign from the end and the commas and convert it to an int or a
float. However, the + makes things complicated. For instance, 100+
means more than 100 but less than 1000+. Thus we have discrete bins,
so we should consider this feature categorical (with hierarchy).
playstore['installs'].value_counts()
1,000,000+ 1229 100,000+ 1003 10,000+ 948 10,000,000+ 762 1,000+ 674 5,000,000+ 493 500,000+ 470 50,000+ 431 5,000+ 413 100+ 297 500+ 195 100,000,000+ 192 50,000,000+ 144 10+ 67 50+ 56 500,000,000+ 24 5+ 9 1,000,000,000+ 8 1+ 3 Name: installs, dtype: int64
playstore['installs'] = playstore['installs'].str.strip() playstore['installs'] = playstore['installs'].astype('category')
And finally let's look at price
.
playstore['price'].value_counts()
0 6873 $0.99 103 $2.99 98 $4.99 61 $1.99 53 ... $6.49 1 $1.29 1 $299.99 1 $379.99 1 $1.20 1 Name: price, Length: 68, dtype: int64
Let's remove the $ character and convert it to a float dtype. I assumed that the values were preceeded by only the $. In reality, there may have been one of more characters (currency expressed in words) or different characters. In such a case, we would have used regex to perform the sanitation.
playstore['price'] = playstore['price'].str.lstrip(to_strip='$') playstore['price'] = playstore['price'].astype('float') playstore['price'].describe()
count 7418.000000 mean 1.117168 std 17.715707 min 0.000000 25% 0.000000 50% 0.000000 75% 0.000000 max 400.000000 Name: price, dtype: float64
2.1.3. Handling category, type, content_rating & genres
These are categorical features, let's start with category
.
playstore['category'].value_counts()
FAMILY 1590 GAME 959 TOOLS 633 PERSONALIZATION 277 MEDICAL 277 LIFESTYLE 273 FINANCE 263 SPORTS 232 PRODUCTIVITY 231 BUSINESS 225 PHOTOGRAPHY 225 COMMUNICATION 206 HEALTH_AND_FITNESS 199 SOCIAL 170 NEWS_AND_MAGAZINES 162 SHOPPING 159 TRAVEL_AND_LOCAL 147 BOOKS_AND_REFERENCE 143 DATING 141 VIDEO_PLAYERS 116 MAPS_AND_NAVIGATION 95 EDUCATION 95 FOOD_AND_DRINK 82 ENTERTAINMENT 67 AUTO_AND_VEHICLES 63 LIBRARIES_AND_DEMO 61 ART_AND_DESIGN 58 WEATHER 51 HOUSE_AND_HOME 50 COMICS 49 PARENTING 44 EVENTS 38 BEAUTY 37 Name: category, dtype: int64
The values are all unique, we can lowercase them and convert to
category
dtype.
playstore['category'] = playstore['category'].str.lower().astype('category') playstore['category'].describe()
count 7418 unique 33 top family freq 1590 Name: category, dtype: object
We look at type
next.
playstore['type'].value_counts()
Free 6873 Paid 545 Name: type, dtype: int64
This is essentially a binary representation of the price
feature.
Should be label encoded (since we have clear hierarchy).
playstore['type'] = playstore['type'].str.strip().str.lower().astype('category') playstore['type'].describe()
count 7418 unique 2 top free freq 6873 Name: type, dtype: object
We look at content_rating
next.
playstore['content_rating'].value_counts()
Everyone 5952 Teen 832 Mature 17+ 332 Everyone 10+ 299 Adults only 18+ 2 Unrated 1 Name: content_rating, dtype: int64
playstore['content_rating'] = playstore['content_rating'].str.strip().str.lower().astype('category') playstore['content_rating'].describe()
count 7418 unique 6 top everyone freq 5952 Name: content_rating, dtype: object
And finally genres
.
playstore['genres'].value_counts()
Tools 633 Entertainment 428 Education 404 Action 318 Personalization 277 ... Card;Brain Games 1 Lifestyle;Pretend Play 1 Education;Brain Games 1 Comics;Creativity 1 Strategy;Creativity 1 Name: genres, Length: 112, dtype: int64
We encounter something similar to what we have seen in the netflix
dataset, specifically for the director
feature. An app may have
several genres. This feature needs to be one-hot encoded if we decide
to treat it as categorical, else extract numerical features if we
consider it as text. For this analysis however, we drop it.
playstore = playstore.drop(['genres'], axis='columns') playstore.shape
7418 | 11 |
2.1.4. Handling last_updated
This feature contains a timestamp so we convert it to the datetime
dtype.
playstore['last_updated'] = pd.to_datetime(playstore['last_updated'].str.strip(), format='%B %d, %Y') playstore['last_updated'].head()
0 2018-01-07 1 2018-01-15 2 2018-08-01 3 2018-06-08 4 2018-06-20 Name: last_updated, dtype: datetime64[ns]
2.1.5. Handling current_ver & android_ver
These are again tricky. Technically the version number of an app is numerical however not exactly an int or a float. There is also the notion of minor updates and patches. Another point of interest is that some examples have the Varies with device value.
For this analysis we only keep the major version (ie. the first digit)
and drop the rest. The Varies with device becomes NaN
with this
transformation, for this analysis we drop them.
playstore['major_ver'] = playstore['current_ver'].str.extract(r'^(\d)') playstore[['current_ver', 'major_ver']]
current_ver major_ver 0 1.0.0 1 1 2.0.0 2 2 1.2.4 1 3 Varies with device NaN 4 1.1 1 ... ... ... 10833 0.8 0 10834 1.0.0 1 10836 1.48 1 10837 1.0 1 10840 Varies with device NaN [7418 rows x 2 columns]
playstore['major_android_ver'] = playstore['android_ver'].str.extract(r'^(\d)') playstore[['major_android_ver', 'android_ver']]
major_android_ver android_ver 0 4 4.0.3 and up 1 4 4.0.3 and up 2 4 4.0.3 and up 3 4 4.2 and up 4 4 4.4 and up ... ... ... 10833 2 2.2 and up 10834 4 4.1 and up 10836 4 4.1 and up 10837 4 4.1 and up 10840 NaN Varies with device [7418 rows x 2 columns]
Let's drop the old columns and rows with missing values and convert
them to int
dtype.
playstore = playstore.drop(['android_ver', 'current_ver'], axis='columns') playstore = playstore.dropna() playstore.shape
7234 | 11 |
version_features = ['major_android_ver', 'major_ver'] playstore[version_features] = playstore[version_features].astype('int')
2.1.6. Correlations
Finally, we can check the correlation amongst the numerical features, let's print a summary of the columns to jog our memory!
playstore.dtypes
category category rating float64 reviews int64 size float64 installs category type category price float64 content_rating category last_updated datetime64[ns] major_ver int64 major_android_ver int64 dtype: object
name = 'heatmap@playstore--corr.png' corr = playstore.corr() plotter.corr(corr, name) name
There is some positive correlation between size
and reviews
, the
rest are not correlated to one another.
2.2. Distributional analysis
In this section we analyse the distributions of the features. Let's start with a histogram of all features.
name = 'histplot@playstore--numerical.png' fig, axs = plt.subplots(2, 3, figsize=(15, 10)) sns.histplot(data=playstore, x='rating', kde=True, ax=axs[0, 0]) sns.histplot(data=playstore, x='reviews', kde=True, ax=axs[0, 1]) sns.histplot(data=playstore, x='size', kde=True, ax=axs[0, 2]) sns.histplot(data=playstore, x='price', kde=True, ax=axs[1, 0]) sns.histplot(data=playstore, x='major_ver', kde=True, ax=axs[1, 1]) sns.histplot(data=playstore, x='major_android_ver', kde=True, ax=axs[1, 2]) fig.savefig(name) name
The distribution of rating
and size
are skewed. In this dataset
several apps have a 4 or higher rating. Many apps also seem to be less
than 20Mb in size.
The histogram is not a useful visualisation for reviews
and price
.
reviews
does not have a distribution (this makes sense since it's a
count of reviews that an app received, this should not have a
distribution). As for price
, we have many apps which are free or
priced at 0.99.
major_ver & major_android_ver
seem to have discrete values, we may
wish to consider them as categorical features.
Let's investigate the categorical features next. Since category &
installs
have many values, we use separate plots for them.
name = 'histplot@playstore--category.png' fig, ax = plt.subplots(figsize=(20, 10)) sns.histplot(data=playstore, x='category', discrete=True, ax=ax) plt.xticks(rotation=90) plt.tight_layout() fig.savefig(name) name
The top three categories are family, game & tools.
name = 'histplot@playstore--installs.png' fig, ax = plt.subplots(figsize=(20, 10)) sns.histplot(data=playstore, x='installs', discrete=True, ax=ax) plt.xticks(rotation=90) plt.tight_layout() fig.savefig(name) name
This plot is not that informative, however we should be able to find
some nice insights in the categorical analysis section. Let's look
at type & content_rating
next.
name = 'histplot@playstore--categorical.png' categorical_features = ['type', 'content_rating'] fig, axs = plt.subplots(1, 2) for idx, feature in enumerate(categorical_features): sns.histplot(data=playstore, y=feature, discrete=True, ax=axs[idx]) plt.tight_layout() fig.savefig(name) name