Avocado

Table of Contents

1. Init

Let's start by importing the necessary modules.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt

matplotlib.use('Agg') # non-interactive backend, produce pngs instead
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
# uncomment the following line to prevent truncated output
# pd.set_option('display.large_repr', 'info')

from context import src
from src import utils, plotter

2. Analysis

In this section we analyse the avocado dataset. We start by reading the data docs. The docs provide us with some information on the source of the dataset and information regarding the attributes.

2.1. Preliminary analysis

We start by loading the dataset and answering our initial set of questions.

avocado = pd.read_csv('../data/data/avocado.csv')
avocado.head()
   Unnamed: 0        Date  AveragePrice  Total Volume     4046       4225  \
0           0  2015-12-27          1.33      64236.62  1036.74   54454.85   
1           1  2015-12-20          1.35      54876.98   674.28   44638.81   
2           2  2015-12-13          0.93     118220.22   794.70  109149.67   
3           3  2015-12-06          1.08      78992.15  1132.00   71976.41   
4           4  2015-11-29          1.28      51039.60   941.48   43838.39   

     4770  Total Bags  Small Bags  Large Bags  XLarge Bags          type  \
0   48.16     8696.87     8603.62       93.25          0.0  conventional   
1   58.33     9505.56     9408.07       97.49          0.0  conventional   
2  130.50     8145.35     8042.21      103.14          0.0  conventional   
3   72.58     5811.16     5677.40      133.76          0.0  conventional   
4   75.78     6183.95     5986.26      197.69          0.0  conventional   

   year  region  
0  2015  Albany  
1  2015  Albany  
2  2015  Albany  
3  2015  Albany  
4  2015  Albany  

Let's drop the column containing the unique ids (pandas does that for us already) and rename the columns for simplicity.

avocado = avocado.drop('Unnamed: 0', axis='columns')
avocado = avocado.rename(mapper={'AveragePrice': 'Average Price'}, axis='columns')
avocado.columns = avocado.columns.str.lower().str.replace(' ', '_')
avocado.columns

Index(['date', 'average_price', 'total_volume', '4046', '4225', '4770',
       'total_bags', 'small_bags', 'large_bags', 'xlarge_bags', 'type', 'year',
       'region'],
      dtype='object')
avocado.shape
18249 13
avocado.dtypes
date              object
average_price    float64
total_volume     float64
4046             float64
4225             float64
4770             float64
total_bags       float64
small_bags       float64
large_bags       float64
xlarge_bags      float64
type              object
year               int64
region            object
dtype: object

date should be converted to datetime dtype and type & region to category dtype. total_volume, 4046, 4225, 4770, total_bags, small_bags, large_bags & xlarge_bags are floats but represent an absolute value (number of bags & avocados sold), they should be rounded and converted to int type.

2.1.1. Handing date

It should be datetime dtype.

avocado['date'] = pd.to_datetime(avocado['date'].str.strip())
avocado['date']
0       2015-12-27
1       2015-12-20
2       2015-12-13
3       2015-12-06
4       2015-11-29
           ...    
18244   2018-02-04
18245   2018-01-28
18246   2018-01-21
18247   2018-01-14
18248   2018-01-07
Name: date, Length: 18249, dtype: datetime64[ns]

2.1.2. Handling type & region

They should be category dtype.

avocado['type'] = avocado['type'].str.strip().astype('category')
avocado['region'] = avocado['region'].str.strip().astype('category')

2.1.3. Handling total_volume, 4046, 4225, 4770, total_bags, small_bags, large_bags & xlarge_bags

They should all be int dtype.

numerical_features = ['total_volume',
                      '4046',
                      '4225',
                      '4770',
                      'total_bags',
                      'small_bags',
                      'large_bags',
                      'xlarge_bags']

for feature in numerical_features:
    avocado[feature] = avocado[feature].astype('int')

avocado[numerical_features]
       total_volume  4046    4225  4770  total_bags  small_bags  large_bags  \
0             64236  1036   54454    48        8696        8603          93   
1             54876   674   44638    58        9505        9408          97   
2            118220   794  109149   130        8145        8042         103   
3             78992  1132   71976    72        5811        5677         133   
4             51039   941   43838    75        6183        5986         197   
...             ...   ...     ...   ...         ...         ...         ...   
18244         17074  2046    1529     0       13498       13066         431   
18245         13888  1191    3431     0        9264        8940         324   
18246         13766  1191    2452   727        9394        9351          42   
18247         16205  1527    2981   727       10969       10919          50   
18248         17489  2894    2356   224       12014       11988          26   

       xlarge_bags  
0                0  
1                0  
2                0  
3                0  
4                0  
...            ...  
18244            0  
18245            0  
18246            0  
18247            0  
18248            0  

[18249 rows x 8 columns]

2.1.4. Descriptive statistics, missing & duplicates

Let's look at the descriptive statistics next, missing & duplicates next.

avocado.describe(include='all')
                       date  average_price  total_volume          4046  \
count                 18249   18249.000000  1.824900e+04  1.824900e+04   
unique                  169            NaN           NaN           NaN   
top     2015-12-27 00:00:00            NaN           NaN           NaN   
freq                    108            NaN           NaN           NaN   
first   2015-01-04 00:00:00            NaN           NaN           NaN   
last    2018-03-25 00:00:00            NaN           NaN           NaN   
mean                    NaN       1.405978  8.506435e+05  2.930079e+05   
std                     NaN       0.402677  3.453545e+06  1.264989e+06   
min                     NaN       0.440000  8.400000e+01  0.000000e+00   
25%                     NaN       1.100000  1.083800e+04  8.540000e+02   
50%                     NaN       1.370000  1.073760e+05  8.645000e+03   
75%                     NaN       1.660000  4.329620e+05  1.110200e+05   
max                     NaN       3.250000  6.250565e+07  2.274362e+07   

                4225          4770    total_bags    small_bags    large_bags  \
count   1.824900e+04  1.824900e+04  1.824900e+04  1.824900e+04  1.824900e+04   
unique           NaN           NaN           NaN           NaN           NaN   
top              NaN           NaN           NaN           NaN           NaN   
freq             NaN           NaN           NaN           NaN           NaN   
first            NaN           NaN           NaN           NaN           NaN   
last             NaN           NaN           NaN           NaN           NaN   
mean    2.951541e+05  2.283940e+04  2.396387e+05  1.821942e+05  5.433767e+04   
std     1.204120e+06  1.074640e+05  9.862424e+05  7.461785e+05  2.439659e+05   
min     0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00  0.000000e+00   
25%     3.008000e+03  0.000000e+00  5.088000e+03  2.849000e+03  1.270000e+02   
50%     2.906100e+04  1.840000e+02  3.974300e+04  2.636200e+04  2.647000e+03   
75%     1.502060e+05  6.243000e+03  1.107830e+05  8.333700e+04  2.202900e+04   
max     2.047057e+07  2.546439e+06  1.937313e+07  1.338459e+07  5.719096e+06   

          xlarge_bags          type          year  region  
count    18249.000000         18249  18249.000000   18249  
unique            NaN             2           NaN      54  
top               NaN  conventional           NaN  Albany  
freq              NaN          9126           NaN     338  
first             NaN           NaN           NaN     NaN  
last              NaN           NaN           NaN     NaN  
mean      3106.279029           NaN   2016.147899     NaN  
std      17692.837485           NaN      0.939938     NaN  
min          0.000000           NaN   2015.000000     NaN  
25%          0.000000           NaN   2015.000000     NaN  
50%          0.000000           NaN   2016.000000     NaN  
75%        132.000000           NaN   2017.000000     NaN  
max     551693.000000           NaN   2018.000000     NaN  
avocado.isna().any()
date             False
average_price    False
total_volume     False
4046             False
4225             False
4770             False
total_bags       False
small_bags       False
large_bags       False
xlarge_bags      False
type             False
year             False
region           False
dtype: bool
avocado[avocado.duplicated()].shape
0 13

2.1.5. Correlations

Let's look at the correlations of the numerical features next.

name = 'heatmap@avocado--corr.png'
corr = avocado.corr()

plotter.corr(corr, name)
name

heatmap@avocado--corr.png

The numerical features are positively correlated to one another.

Date: 2021-10-27 Wed 00:00

Created: 2021-10-27 Wed 13:04