Avocado
Table of Contents
1. Init
Let's start by importing the necessary modules.
import pandas as pd import numpy as np import seaborn as sns import matplotlib import matplotlib.pyplot as plt matplotlib.use('Agg') # non-interactive backend, produce pngs instead pd.set_option('display.max_columns', None) pd.set_option('display.max_colwidth', None) # uncomment the following line to prevent truncated output # pd.set_option('display.large_repr', 'info') from context import src from src import utils, plotter
2. Analysis
In this section we analyse the avocado
dataset. We start by reading
the data docs. The docs provide us with some information on the source
of the dataset and information regarding the attributes.
2.1. Preliminary analysis
We start by loading the dataset and answering our initial set of questions.
avocado = pd.read_csv('../data/data/avocado.csv') avocado.head()
Unnamed: 0 Date AveragePrice Total Volume 4046 4225 \ 0 0 2015-12-27 1.33 64236.62 1036.74 54454.85 1 1 2015-12-20 1.35 54876.98 674.28 44638.81 2 2 2015-12-13 0.93 118220.22 794.70 109149.67 3 3 2015-12-06 1.08 78992.15 1132.00 71976.41 4 4 2015-11-29 1.28 51039.60 941.48 43838.39 4770 Total Bags Small Bags Large Bags XLarge Bags type \ 0 48.16 8696.87 8603.62 93.25 0.0 conventional 1 58.33 9505.56 9408.07 97.49 0.0 conventional 2 130.50 8145.35 8042.21 103.14 0.0 conventional 3 72.58 5811.16 5677.40 133.76 0.0 conventional 4 75.78 6183.95 5986.26 197.69 0.0 conventional year region 0 2015 Albany 1 2015 Albany 2 2015 Albany 3 2015 Albany 4 2015 Albany
Let's drop the column containing the unique ids (pandas does that for us already) and rename the columns for simplicity.
avocado = avocado.drop('Unnamed: 0', axis='columns') avocado = avocado.rename(mapper={'AveragePrice': 'Average Price'}, axis='columns') avocado.columns = avocado.columns.str.lower().str.replace(' ', '_') avocado.columns
Index(['date', 'average_price', 'total_volume', '4046', '4225', '4770', 'total_bags', 'small_bags', 'large_bags', 'xlarge_bags', 'type', 'year', 'region'], dtype='object')
avocado.shape
18249 | 13 |
avocado.dtypes
date object average_price float64 total_volume float64 4046 float64 4225 float64 4770 float64 total_bags float64 small_bags float64 large_bags float64 xlarge_bags float64 type object year int64 region object dtype: object
date
should be converted to datetime
dtype and type & region
to
category
dtype. total_volume, 4046, 4225, 4770, total_bags, small_bags,
large_bags & xlarge_bags
are floats but represent an absolute value
(number of bags & avocados sold), they should be rounded and converted
to int
type.
2.1.1. Handing date
It should be datetime
dtype.
avocado['date'] = pd.to_datetime(avocado['date'].str.strip()) avocado['date']
0 2015-12-27 1 2015-12-20 2 2015-12-13 3 2015-12-06 4 2015-11-29 ... 18244 2018-02-04 18245 2018-01-28 18246 2018-01-21 18247 2018-01-14 18248 2018-01-07 Name: date, Length: 18249, dtype: datetime64[ns]
2.1.2. Handling type & region
They should be category
dtype.
avocado['type'] = avocado['type'].str.strip().astype('category') avocado['region'] = avocado['region'].str.strip().astype('category')
2.1.3. Handling total_volume, 4046, 4225, 4770, total_bags, small_bags, large_bags & xlarge_bags
They should all be int
dtype.
numerical_features = ['total_volume', '4046', '4225', '4770', 'total_bags', 'small_bags', 'large_bags', 'xlarge_bags'] for feature in numerical_features: avocado[feature] = avocado[feature].astype('int') avocado[numerical_features]
total_volume 4046 4225 4770 total_bags small_bags large_bags \ 0 64236 1036 54454 48 8696 8603 93 1 54876 674 44638 58 9505 9408 97 2 118220 794 109149 130 8145 8042 103 3 78992 1132 71976 72 5811 5677 133 4 51039 941 43838 75 6183 5986 197 ... ... ... ... ... ... ... ... 18244 17074 2046 1529 0 13498 13066 431 18245 13888 1191 3431 0 9264 8940 324 18246 13766 1191 2452 727 9394 9351 42 18247 16205 1527 2981 727 10969 10919 50 18248 17489 2894 2356 224 12014 11988 26 xlarge_bags 0 0 1 0 2 0 3 0 4 0 ... ... 18244 0 18245 0 18246 0 18247 0 18248 0 [18249 rows x 8 columns]
2.1.4. Descriptive statistics, missing & duplicates
Let's look at the descriptive statistics next, missing & duplicates next.
avocado.describe(include='all')
date average_price total_volume 4046 \ count 18249 18249.000000 1.824900e+04 1.824900e+04 unique 169 NaN NaN NaN top 2015-12-27 00:00:00 NaN NaN NaN freq 108 NaN NaN NaN first 2015-01-04 00:00:00 NaN NaN NaN last 2018-03-25 00:00:00 NaN NaN NaN mean NaN 1.405978 8.506435e+05 2.930079e+05 std NaN 0.402677 3.453545e+06 1.264989e+06 min NaN 0.440000 8.400000e+01 0.000000e+00 25% NaN 1.100000 1.083800e+04 8.540000e+02 50% NaN 1.370000 1.073760e+05 8.645000e+03 75% NaN 1.660000 4.329620e+05 1.110200e+05 max NaN 3.250000 6.250565e+07 2.274362e+07 4225 4770 total_bags small_bags large_bags \ count 1.824900e+04 1.824900e+04 1.824900e+04 1.824900e+04 1.824900e+04 unique NaN NaN NaN NaN NaN top NaN NaN NaN NaN NaN freq NaN NaN NaN NaN NaN first NaN NaN NaN NaN NaN last NaN NaN NaN NaN NaN mean 2.951541e+05 2.283940e+04 2.396387e+05 1.821942e+05 5.433767e+04 std 1.204120e+06 1.074640e+05 9.862424e+05 7.461785e+05 2.439659e+05 min 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 25% 3.008000e+03 0.000000e+00 5.088000e+03 2.849000e+03 1.270000e+02 50% 2.906100e+04 1.840000e+02 3.974300e+04 2.636200e+04 2.647000e+03 75% 1.502060e+05 6.243000e+03 1.107830e+05 8.333700e+04 2.202900e+04 max 2.047057e+07 2.546439e+06 1.937313e+07 1.338459e+07 5.719096e+06 xlarge_bags type year region count 18249.000000 18249 18249.000000 18249 unique NaN 2 NaN 54 top NaN conventional NaN Albany freq NaN 9126 NaN 338 first NaN NaN NaN NaN last NaN NaN NaN NaN mean 3106.279029 NaN 2016.147899 NaN std 17692.837485 NaN 0.939938 NaN min 0.000000 NaN 2015.000000 NaN 25% 0.000000 NaN 2015.000000 NaN 50% 0.000000 NaN 2016.000000 NaN 75% 132.000000 NaN 2017.000000 NaN max 551693.000000 NaN 2018.000000 NaN
avocado.isna().any()
date False average_price False total_volume False 4046 False 4225 False 4770 False total_bags False small_bags False large_bags False xlarge_bags False type False year False region False dtype: bool
avocado[avocado.duplicated()].shape
0 | 13 |
2.1.5. Correlations
Let's look at the correlations of the numerical features next.
name = 'heatmap@avocado--corr.png' corr = avocado.corr() plotter.corr(corr, name) name
The numerical features are positively correlated to one another.