Playstore

Table of Contents

1. Init

Let's start by importing the necessary modules.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt

matplotlib.use('Agg') # non-interactive backend, produce pngs instead
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
# uncomment the following line to prevent truncated output
# pd.set_option('display.large_repr', 'info')

from context import src
from src import utils, plotter

2. Analysis

In this section we analyse the playstore dataset. Unfortunately there are no data docs for this dataset. Please note that the original dataset consists of two csv files. However one of them contains user reviews (ie. text data) thus for this analysis we only consider googleplaystore.csv.

2.1. Preliminary analysis

We start by loading the dataset and anwering our initial set of questions.

playstore = pd.read_csv('../data/data/playstore.csv')
playstore.head()
                                                  App        Category  Rating  \
0      Photo Editor & Candy Camera & Grid & ScrapBook  ART_AND_DESIGN     4.1   
1                                 Coloring book moana  ART_AND_DESIGN     3.9   
2  U Launcher Lite – FREE Live Cool Themes, Hide Apps  ART_AND_DESIGN     4.7   
3                               Sketch - Draw & Paint  ART_AND_DESIGN     4.5   
4               Pixel Draw - Number Art Coloring Book  ART_AND_DESIGN     4.3   

  Reviews  Size     Installs  Type Price Content Rating  \
0     159   19M      10,000+  Free     0       Everyone   
1     967   14M     500,000+  Free     0       Everyone   
2   87510  8.7M   5,000,000+  Free     0       Everyone   
3  215644   25M  50,000,000+  Free     0           Teen   
4     967  2.8M     100,000+  Free     0       Everyone   

                      Genres      Last Updated         Current Ver  \
0               Art & Design   January 7, 2018               1.0.0   
1  Art & Design;Pretend Play  January 15, 2018               2.0.0   
2               Art & Design    August 1, 2018               1.2.4   
3               Art & Design      June 8, 2018  Varies with device   
4    Art & Design;Creativity     June 20, 2018                 1.1   

    Android Ver  
0  4.0.3 and up  
1  4.0.3 and up  
2  4.0.3 and up  
3    4.2 and up  
4    4.4 and up  
playstore.shape
10841 13
playstore.dtypes
App                object
Category           object
Rating            float64
Reviews            object
Size               object
Installs           object
Type               object
Price              object
Content Rating     object
Genres             object
Last Updated       object
Current Ver        object
Android Ver        object
dtype: object

We can see lots of interesting features (and potential smells)! To start, let's rename the columns for simplicity.

playstore.columns = playstore.columns.str.strip().str.lower().str.replace(' ', '_')
playstore.columns
Index(['app', 'category', 'rating', 'reviews', 'size', 'installs', 'type',
       'price', 'content_rating', 'genres', 'last_updated', 'current_ver',
       'android_ver'],
      dtype='object')

Let's look at the descriptive statistics, missing & duplicates next.

playstore.describe(include='all')
           app category       rating reviews                size    installs  \
count    10841    10841  9367.000000   10841               10841       10841   
unique    9660       34          NaN    6002                 462          22   
top     ROBLOX   FAMILY          NaN       0  Varies with device  1,000,000+   
freq         9     1972          NaN     596                1695        1579   
mean       NaN      NaN     4.193338     NaN                 NaN         NaN   
std        NaN      NaN     0.537431     NaN                 NaN         NaN   
min        NaN      NaN     1.000000     NaN                 NaN         NaN   
25%        NaN      NaN     4.000000     NaN                 NaN         NaN   
50%        NaN      NaN     4.300000     NaN                 NaN         NaN   
75%        NaN      NaN     4.500000     NaN                 NaN         NaN   
max        NaN      NaN    19.000000     NaN                 NaN         NaN   

         type  price content_rating genres    last_updated  \
count   10840  10841          10840  10841           10841   
unique      3     93              6    120            1378   
top      Free      0       Everyone  Tools  August 3, 2018   
freq    10039  10040           8714    842             326   
mean      NaN    NaN            NaN    NaN             NaN   
std       NaN    NaN            NaN    NaN             NaN   
min       NaN    NaN            NaN    NaN             NaN   
25%       NaN    NaN            NaN    NaN             NaN   
50%       NaN    NaN            NaN    NaN             NaN   
75%       NaN    NaN            NaN    NaN             NaN   
max       NaN    NaN            NaN    NaN             NaN   

               current_ver android_ver  
count                10833       10838  
unique                2832          33  
top     Varies with device  4.1 and up  
freq                  1459        2451  
mean                   NaN         NaN  
std                    NaN         NaN  
min                    NaN         NaN  
25%                    NaN         NaN  
50%                    NaN         NaN  
75%                    NaN         NaN  
max                    NaN         NaN  
playstore.isna().any()
app               False
category          False
rating             True
reviews           False
size              False
installs          False
type               True
price             False
content_rating     True
genres            False
last_updated      False
current_ver        True
android_ver        True
dtype: bool

We have some missing values, let's investigate further.

# show the rows with atleast one missing value
playstore[playstore.isna().any(axis='columns')]
                                   app             category  rating reviews  \
15     Learn To Draw Kawaii Characters       ART_AND_DESIGN     3.2      55   
23              Mcqueen Coloring pages       ART_AND_DESIGN     NaN      61   
113          Wrinkles and rejuvenation               BEAUTY     NaN     182   
123             Manicure - nail design               BEAUTY     NaN     119   
126       Skin Care and Natural Beauty               BEAUTY     NaN     654   
...                                ...                  ...     ...     ...   
10824                        Cardio-FR              MEDICAL     NaN      67   
10825               Naruto & Boruto FR               SOCIAL     NaN       7   
10831         payermonstationnement.fr  MAPS_AND_NAVIGATION     NaN      38   
10835                         FR Forms             BUSINESS     NaN       0   
10838           Parkinson Exercices FR              MEDICAL     NaN       3   

       size  installs  type price content_rating  \
15     2.7M    5,000+  Free     0       Everyone   
23     7.0M  100,000+  Free     0       Everyone   
113    5.7M  100,000+  Free     0   Everyone 10+   
123    3.7M   50,000+  Free     0       Everyone   
126    7.4M  100,000+  Free     0           Teen   
...     ...       ...   ...   ...            ...   
10824   82M   10,000+  Free     0       Everyone   
10825  7.7M      100+  Free     0           Teen   
10831  9.8M    5,000+  Free     0       Everyone   
10835  9.6M       10+  Free     0       Everyone   
10838  9.5M    1,000+  Free     0       Everyone   

                                genres        last_updated current_ver  \
15                        Art & Design        June 6, 2018         NaN   
23     Art & Design;Action & Adventure       March 7, 2018       1.0.0   
113                             Beauty  September 20, 2017         8.0   
123                             Beauty       July 23, 2018         1.3   
126                             Beauty       July 17, 2018        1.15   
...                                ...                 ...         ...   
10824                          Medical       July 31, 2018       2.2.2   
10825                           Social    February 2, 2018         1.0   
10831                Maps & Navigation       June 13, 2018   2.0.148.0   
10835                         Business  September 29, 2016       1.1.5   
10838                          Medical    January 20, 2017         1.0   

      android_ver  
15     4.2 and up  
23     4.1 and up  
113    3.0 and up  
123    4.1 and up  
126    4.1 and up  
...           ...  
10824  4.4 and up  
10825  4.0 and up  
10831  4.0 and up  
10835  4.0 and up  
10838  2.2 and up  

[1481 rows x 13 columns]

We have ~1400 missing instances. For this dataset, imputation is a potential solution (for instance, we can impute type based on price and rating based on reviews, by making some assumptions of course). However for this analysis, we simply drop them.

playstore = playstore.dropna()
playstore.shape
9360 13
playstore[playstore.duplicated()].shape
474 13

We have duplicates as well, let's investigate further.

duplicates = playstore[playstore.duplicated(keep=False)]
duplicates
                                                   app             category  \
164                                       Ebook Reader  BOOKS_AND_REFERENCE   
192                      Docs To Go™ Free Office Suite             BUSINESS   
193                                 Google My Business             BUSINESS   
204                                                Box             BUSINESS   
213                                ZOOM Cloud Meetings             BUSINESS   
...                                                ...                  ...   
8643                    Wunderlist: To-Do List & Tasks         PRODUCTIVITY   
8654   TickTick: To Do List with Reminder, Day Planner         PRODUCTIVITY   
8658                           ColorNote Notepad Notes         PRODUCTIVITY   
10049        Airway Ex - Intubate. Anesthetize. Train.              MEDICAL   
10768                                             AAFP              MEDICAL   

       rating  reviews                size      installs  type price  \
164       4.1    85842                 37M    5,000,000+  Free     0   
192       4.1   217730  Varies with device   50,000,000+  Free     0   
193       4.4    70991  Varies with device    5,000,000+  Free     0   
204       4.2   159872  Varies with device   10,000,000+  Free     0   
213       4.4    31614                 37M   10,000,000+  Free     0   
...       ...      ...                 ...           ...   ...   ...   
8643      4.6   404610  Varies with device   10,000,000+  Free     0   
8654      4.6    25370  Varies with device    1,000,000+  Free     0   
8658      4.6  2401017  Varies with device  100,000,000+  Free     0   
10049     4.3      123                 86M       10,000+  Free     0   
10768     3.8       63                 24M       10,000+  Free     0   

      content_rating             genres    last_updated         current_ver  \
164         Everyone  Books & Reference   June 25, 2018               5.0.6   
192         Everyone           Business   April 2, 2018  Varies with device   
193         Everyone           Business   July 24, 2018    2.19.0.204537701   
204         Everyone           Business   July 31, 2018  Varies with device   
213         Everyone           Business   July 20, 2018      4.1.28165.0716   
...              ...                ...             ...                 ...   
8643        Everyone       Productivity   April 6, 2018  Varies with device   
8654        Everyone       Productivity  August 6, 2018  Varies with device   
8658        Everyone       Productivity   June 27, 2018  Varies with device   
10049       Everyone            Medical    June 1, 2018              0.6.88   
10768       Everyone            Medical   June 22, 2018               2.3.1   

              android_ver  
164            4.0 and up  
192    Varies with device  
193            4.4 and up  
204    Varies with device  
213            4.0 and up  
...                   ...  
8643   Varies with device  
8654   Varies with device  
8658   Varies with device  
10049          5.0 and up  
10768          5.0 and up  

[876 rows x 13 columns]

It's a bit difficult to observe the duplicates like this. We can group by the name of the apps and investigate further.

duplicates['app'] = duplicates['app'].str.strip().str.lower().str.replace(' ', '_')
duplicates['app'].value_counts()
cbs_sports_app_-_scores,_news,_stats_&_watch_live     7
espn                                                  6
bleacher_report:_sports_news,_scores,_&_highlights    5
thescore:_live_sports_scores,_news,_stats_&_videos    5
watchespn                                             4
                                                     ..
fitbit_coach                                          2
nike_training_club_-_workouts_&_fitness_plans         2
run_with_map_my_run                                   2
fabulous:_motivate_me!_meditate,_relax,_sleep         2
newsroom:_news_worth_sharing                          2
Name: app, Length: 392, dtype: int64

Okay, let's investigate the ESPN rows in further detail.

duplicates[duplicates['app'].eq('espn')]
       app category  rating reviews                size     installs  type  \
2959  espn   SPORTS     4.2  521138  Varies with device  10,000,000+  Free   
3010  espn   SPORTS     4.2  521138  Varies with device  10,000,000+  Free   
3018  espn   SPORTS     4.2  521138  Varies with device  10,000,000+  Free   
3048  espn   SPORTS     4.2  521140  Varies with device  10,000,000+  Free   
3060  espn   SPORTS     4.2  521140  Varies with device  10,000,000+  Free   
3072  espn   SPORTS     4.2  521140  Varies with device  10,000,000+  Free   

     price content_rating  genres   last_updated         current_ver  \
2959     0   Everyone 10+  Sports  July 19, 2018  Varies with device   
3010     0   Everyone 10+  Sports  July 19, 2018  Varies with device   
3018     0   Everyone 10+  Sports  July 19, 2018  Varies with device   
3048     0   Everyone 10+  Sports  July 19, 2018  Varies with device   
3060     0   Everyone 10+  Sports  July 19, 2018  Varies with device   
3072     0   Everyone 10+  Sports  July 19, 2018  Varies with device   

     android_ver  
2959  5.0 and up  
3010  5.0 and up  
3018  5.0 and up  
3048  5.0 and up  
3060  5.0 and up  
3072  5.0 and up  

It's a bit difficult to have any conclusive decision on how to handle the duplicates. In the case of ESPN it does look like they are duplicates (the reviews seems to differ for the last 3 examples though). We also make the assumption here that the name of the apps are unique but in reality we have have the same app written in a different format. Let's drop the duplicates for this analysis using pandas builtin methods.

playstore = playstore.drop_duplicates()
playstore.shape
8886 13

Okay, let's consider each of the features individually and investigate them further next. We do this prior to checking the correlations in this dataset because the object dtype features actually contain numerical features which need to extract manually.

2.1.1. Handling app

This is the name of the app, we can consider extracting numerical features from this however for this analysis, we drop this feature.

playstore = playstore.drop(['app'], axis='columns')
playstore.shape
8886 12

2.1.2. Handling rating, reviews, size, installs & price

These are all numerical features but represented as object dtype. Let's investigate them individually and make the necessary transformations.

playstore['rating'] = playstore['rating'].astype('float')
playstore['rating'].describe()
count    8886.000000
mean        4.187959
std         0.522428
min         1.000000
25%         4.000000
50%         4.300000
75%         4.500000
max         5.000000
Name: rating, dtype: float64

Looks like we are have a lot of examples with a high rating, this bias should be kept in mind if we decide to perform any ML tasks. Let's look at reviews next.

playstore['reviews'] = playstore['reviews'].astype('int')
playstore['reviews'].describe()
count    8.886000e+03
mean     4.730928e+05
std      2.906007e+06
min      1.000000e+00
25%      1.640000e+02
50%      4.723000e+03
75%      7.131325e+04
max      7.815831e+07
Name: reviews, dtype: float64

The high numbers are expected since the feature denotes the number of reviews an app got.

Let's look at size next. This is a numerical value, but represented as a more human-friendly string format. On one hand, we can say that proper data collection process was not followed (leading to technical debt). However the data was scrapped from a website so the technical debt angle doesn't really work? Conclusion here is that the technical debt smell only applies in certain contexts. However, documentation explaining the source of the dataset should have been provided. This smell is still valid for the dataset.

There are several transformations to be done here:

  1. Strip the letter denoting the size (kilobytes, megatybes, etc.)
  2. Fix the unit at the column level, for this analysis we want to denote the size in megatybes.
  3. Convert the dtype of the column to float.

To start, let's see if all app size are in megatybes or not.

size = playstore[['size']] # NOTE: this is a dataframe
# This regex was a fluke and happend to work with the very first
# attempt. In reality, I did not expect the values to not contain any
# digits!
size['value'] = size['size'].str.extract(r'[\d.]*(.*)')
size.head()
   size value
0   19M     M
1   14M     M
2  8.7M     M
3   25M     M
4  2.8M     M
size['value'].value_counts()
M                     7162
Varies with device    1468
k                      256
Name: value, dtype: int64

Note that I added the counts manually and it adds up to the correct number.

Looks like we don't have an absolute size for some (denoted by the Varies with device value) and others are represented in kilobytes. Let's make the necessary transformations.

size['size'] = size['size'].replace(to_replace='Varies with device', value=np.nan)
size.isna().any()
size      True
value    False
dtype: bool
size['size'] = size['size'].str.lower()
size['size'] = size['size'].str.rstrip(to_strip='m')
size['size'] = size['size'].str.rstrip(to_strip='k')
size['size'] = size['size'].astype('float')
size['size'].describe()
count    7418.000000
mean       37.592100
std        94.998132
min         1.000000
25%         5.900000
50%        16.000000
75%        37.000000
max       994.000000
Name: size, dtype: float64

Finally, we convert kilobytes to megatybes.

size_kb = size[size['value'].eq('k')]
size_kb['size'] = size_kb['size']/1024 # 1024kb is 1mb
size_kb
           size value
58     0.196289     k
209    0.022461     k
384    0.077148     k
450    0.115234     k
458    0.678711     k
...         ...   ...
10732  0.456055     k
10755  0.660156     k
10763  0.539062     k
10832  0.568359     k
10833  0.604492     k

[256 rows x 2 columns]
size[size['value'].eq('k')] = size_kb
size['size'].describe()
count    7418.000000
mean       22.760481
std        23.439539
min         0.008301
25%         5.100000
50%        14.000000
75%        33.000000
max       100.000000
Name: size, dtype: float64

The mean has reduced (looks like the conversion worked), however we have some outliers as the min is 0 (which is not possible). Let's merge with the main dataset and drop the missing values we introduced in size!

playstore['size'] = size['size']
playstore = playstore.dropna()
playstore.shape
7418 12

Next, let's look at installs. Another tricky one. We can strip the plus sign from the end and the commas and convert it to an int or a float. However, the + makes things complicated. For instance, 100+ means more than 100 but less than 1000+. Thus we have discrete bins, so we should consider this feature categorical (with hierarchy).

playstore['installs'].value_counts()
1,000,000+        1229
100,000+          1003
10,000+            948
10,000,000+        762
1,000+             674
5,000,000+         493
500,000+           470
50,000+            431
5,000+             413
100+               297
500+               195
100,000,000+       192
50,000,000+        144
10+                 67
50+                 56
500,000,000+        24
5+                   9
1,000,000,000+       8
1+                   3
Name: installs, dtype: int64
playstore['installs'] = playstore['installs'].str.strip()
playstore['installs'] = playstore['installs'].astype('category')

And finally let's look at price.

playstore['price'].value_counts()
0          6873
$0.99       103
$2.99        98
$4.99        61
$1.99        53
           ... 
$6.49         1
$1.29         1
$299.99       1
$379.99       1
$1.20         1
Name: price, Length: 68, dtype: int64

Let's remove the $ character and convert it to a float dtype. I assumed that the values were preceeded by only the $. In reality, there may have been one of more characters (currency expressed in words) or different characters. In such a case, we would have used regex to perform the sanitation.

playstore['price'] = playstore['price'].str.lstrip(to_strip='$')
playstore['price'] = playstore['price'].astype('float')
playstore['price'].describe()
count    7418.000000
mean        1.117168
std        17.715707
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max       400.000000
Name: price, dtype: float64

2.1.3. Handling category, type, content_rating & genres

These are categorical features, let's start with category.

playstore['category'].value_counts()
FAMILY                 1590
GAME                    959
TOOLS                   633
PERSONALIZATION         277
MEDICAL                 277
LIFESTYLE               273
FINANCE                 263
SPORTS                  232
PRODUCTIVITY            231
BUSINESS                225
PHOTOGRAPHY             225
COMMUNICATION           206
HEALTH_AND_FITNESS      199
SOCIAL                  170
NEWS_AND_MAGAZINES      162
SHOPPING                159
TRAVEL_AND_LOCAL        147
BOOKS_AND_REFERENCE     143
DATING                  141
VIDEO_PLAYERS           116
MAPS_AND_NAVIGATION      95
EDUCATION                95
FOOD_AND_DRINK           82
ENTERTAINMENT            67
AUTO_AND_VEHICLES        63
LIBRARIES_AND_DEMO       61
ART_AND_DESIGN           58
WEATHER                  51
HOUSE_AND_HOME           50
COMICS                   49
PARENTING                44
EVENTS                   38
BEAUTY                   37
Name: category, dtype: int64

The values are all unique, we can lowercase them and convert to category dtype.

playstore['category'] = playstore['category'].str.lower().astype('category')
playstore['category'].describe()
count       7418
unique        33
top       family
freq        1590
Name: category, dtype: object

We look at type next.

playstore['type'].value_counts()
Free    6873
Paid     545
Name: type, dtype: int64

This is essentially a binary representation of the price feature. Should be label encoded (since we have clear hierarchy).

playstore['type'] = playstore['type'].str.strip().str.lower().astype('category')
playstore['type'].describe()
count     7418
unique       2
top       free
freq      6873
Name: type, dtype: object

We look at content_rating next.

playstore['content_rating'].value_counts()
Everyone           5952
Teen                832
Mature 17+          332
Everyone 10+        299
Adults only 18+       2
Unrated               1
Name: content_rating, dtype: int64
playstore['content_rating'] = playstore['content_rating'].str.strip().str.lower().astype('category')
playstore['content_rating'].describe()
count         7418
unique           6
top       everyone
freq          5952
Name: content_rating, dtype: object

And finally genres.

playstore['genres'].value_counts()
Tools                     633
Entertainment             428
Education                 404
Action                    318
Personalization           277
                         ... 
Card;Brain Games            1
Lifestyle;Pretend Play      1
Education;Brain Games       1
Comics;Creativity           1
Strategy;Creativity         1
Name: genres, Length: 112, dtype: int64

We encounter something similar to what we have seen in the netflix dataset, specifically for the director feature. An app may have several genres. This feature needs to be one-hot encoded if we decide to treat it as categorical, else extract numerical features if we consider it as text. For this analysis however, we drop it.

playstore = playstore.drop(['genres'], axis='columns')
playstore.shape
7418 11

2.1.4. Handling last_updated

This feature contains a timestamp so we convert it to the datetime dtype.

playstore['last_updated'] = pd.to_datetime(playstore['last_updated'].str.strip(), format='%B %d, %Y')
playstore['last_updated'].head()
0   2018-01-07
1   2018-01-15
2   2018-08-01
3   2018-06-08
4   2018-06-20
Name: last_updated, dtype: datetime64[ns]

2.1.5. Handling current_ver & android_ver

These are again tricky. Technically the version number of an app is numerical however not exactly an int or a float. There is also the notion of minor updates and patches. Another point of interest is that some examples have the Varies with device value.

For this analysis we only keep the major version (ie. the first digit) and drop the rest. The Varies with device becomes NaN with this transformation, for this analysis we drop them.

playstore['major_ver'] = playstore['current_ver'].str.extract(r'^(\d)')
playstore[['current_ver', 'major_ver']]
              current_ver major_ver
0                   1.0.0         1
1                   2.0.0         2
2                   1.2.4         1
3      Varies with device       NaN
4                     1.1         1
...                   ...       ...
10833                 0.8         0
10834               1.0.0         1
10836                1.48         1
10837                 1.0         1
10840  Varies with device       NaN

[7418 rows x 2 columns]
playstore['major_android_ver'] = playstore['android_ver'].str.extract(r'^(\d)')
playstore[['major_android_ver', 'android_ver']]
      major_android_ver         android_ver
0                     4        4.0.3 and up
1                     4        4.0.3 and up
2                     4        4.0.3 and up
3                     4          4.2 and up
4                     4          4.4 and up
...                 ...                 ...
10833                 2          2.2 and up
10834                 4          4.1 and up
10836                 4          4.1 and up
10837                 4          4.1 and up
10840               NaN  Varies with device

[7418 rows x 2 columns]

Let's drop the old columns and rows with missing values and convert them to int dtype.

playstore = playstore.drop(['android_ver', 'current_ver'], axis='columns')
playstore = playstore.dropna()
playstore.shape
7234 11
version_features = ['major_android_ver',
                    'major_ver']
playstore[version_features] = playstore[version_features].astype('int')

2.1.6. Correlations

Finally, we can check the correlation amongst the numerical features, let's print a summary of the columns to jog our memory!

playstore.dtypes
category                   category
rating                      float64
reviews                       int64
size                        float64
installs                   category
type                       category
price                       float64
content_rating             category
last_updated         datetime64[ns]
major_ver                     int64
major_android_ver             int64
dtype: object
name = 'heatmap@playstore--corr.png'
corr = playstore.corr()
plotter.corr(corr, name)
name

heatmap@playstore--corr.png

There is some positive correlation between size and reviews, the rest are not correlated to one another.

2.2. Distributional analysis

In this section we analyse the distributions of the features. Let's start with a histogram of all features.

name = 'histplot@playstore--numerical.png'

fig, axs = plt.subplots(2, 3, figsize=(15, 10))
sns.histplot(data=playstore, x='rating', kde=True, ax=axs[0, 0])
sns.histplot(data=playstore, x='reviews', kde=True, ax=axs[0, 1])
sns.histplot(data=playstore, x='size', kde=True, ax=axs[0, 2])

sns.histplot(data=playstore, x='price', kde=True, ax=axs[1, 0])
sns.histplot(data=playstore, x='major_ver', kde=True, ax=axs[1, 1])
sns.histplot(data=playstore, x='major_android_ver', kde=True, ax=axs[1, 2])
fig.savefig(name)
name

histplot@playstore--numerical.png

The distribution of rating and size are skewed. In this dataset several apps have a 4 or higher rating. Many apps also seem to be less than 20Mb in size.

The histogram is not a useful visualisation for reviews and price. reviews does not have a distribution (this makes sense since it's a count of reviews that an app received, this should not have a distribution). As for price, we have many apps which are free or priced at 0.99.

major_ver & major_android_ver seem to have discrete values, we may wish to consider them as categorical features.

Let's investigate the categorical features next. Since category & installs have many values, we use separate plots for them.

name = 'histplot@playstore--category.png'

fig, ax = plt.subplots(figsize=(20, 10))
sns.histplot(data=playstore, x='category', discrete=True, ax=ax)
plt.xticks(rotation=90)
plt.tight_layout()
fig.savefig(name)
name

histplot@playstore--category.png

The top three categories are family, game & tools.

name = 'histplot@playstore--installs.png'

fig, ax = plt.subplots(figsize=(20, 10))
sns.histplot(data=playstore, x='installs', discrete=True, ax=ax)
plt.xticks(rotation=90)
plt.tight_layout()
fig.savefig(name)
name

histplot@playstore--installs.png

This plot is not that informative, however we should be able to find some nice insights in the categorical analysis section. Let's look at type & content_rating next.

name = 'histplot@playstore--categorical.png'
categorical_features = ['type',
                        'content_rating']

fig, axs = plt.subplots(1, 2)
for idx, feature in enumerate(categorical_features):
    sns.histplot(data=playstore, y=feature, discrete=True, ax=axs[idx])

plt.tight_layout()
fig.savefig(name)
name

histplot@playstore--categorical.png

Date: 2021-10-22 Fri 00:00

Created: 2021-10-23 Sat 20:01