Bitcoin
Table of Contents
1. Init
Let's start by importing the necessary modules.
import pandas as pd import numpy as np import seaborn as sns import matplotlib import matplotlib.pyplot as plt matplotlib.use('Agg') # non-interactive backend, produce pngs instead pd.set_option('display.max_columns', None) pd.set_option('display.max_colwidth', None) # uncomment the following line to prevent truncated output # pd.set_option('display.large_repr', 'info') from context import src from src import utils, plotter
2. Analysis
In this section we analyse the bitcoin
dataset. We start by reading
the accompanying data docs which unfortunately are not that helpful.
It mentions that it's a timeseries dataset with missing values
(resulting in jumps in the timeseries).
2.1. Preliminary analysis
We start by loading the dataset and answering our initial set of questions.
bitcoin = pd.read_csv('../data/data/bitcoin.csv') bitcoin.head()
Timestamp Open High Low Close Volume_(BTC) Volume_(Currency) \ 0 1325317920 4.39 4.39 4.39 4.39 0.455581 2.0 1 1325317980 NaN NaN NaN NaN NaN NaN 2 1325318040 NaN NaN NaN NaN NaN NaN 3 1325318100 NaN NaN NaN NaN NaN NaN 4 1325318160 NaN NaN NaN NaN NaN NaN Weighted_Price 0 4.39 1 NaN 2 NaN 3 NaN 4 NaN
Let's rename the Volume_*
features for simplicity.
mapper = {'Volume_(BTC)': 'Volume_btc', 'Volume_(Currency)': 'Volume_currency'} bitcoin = bitcoin.rename(mapper=mapper, axis='columns') bitcoin.columns = bitcoin.columns.str.lower() bitcoin.columns
Index(['timestamp', 'open', 'high', 'low', 'close', 'volume_btc', 'volume_currency', 'weighted_price'], dtype='object')
bitcoin.shape
4857377 | 8 |
bitcoin.dtypes
timestamp int64 open float64 high float64 low float64 close float64 volume_btc float64 volume_currency float64 weighted_price float64 dtype: object
We only need to convert the timestamp
feature to datetime
dtype.
bitcoin['timestamp'] = pd.to_datetime(bitcoin['timestamp'], unit='s') bitcoin['timestamp']
0 2011-12-31 07:52:00 1 2011-12-31 07:53:00 2 2011-12-31 07:54:00 3 2011-12-31 07:55:00 4 2011-12-31 07:56:00 ... 4857372 2021-03-30 23:56:00 4857373 2021-03-30 23:57:00 4857374 2021-03-30 23:58:00 4857375 2021-03-30 23:59:00 4857376 2021-03-31 00:00:00 Name: timestamp, Length: 4857377, dtype: datetime64[ns]
Let's look at the descriptive statistics next, missing & duplicates next.
bitcoin.describe(include='all')
timestamp open high low \ count 4857377 3.613769e+06 3.613769e+06 3.613769e+06 unique 4857377 NaN NaN NaN top 2011-12-31 07:52:00 NaN NaN NaN freq 1 NaN NaN NaN first 2011-12-31 07:52:00 NaN NaN NaN last 2021-03-31 00:00:00 NaN NaN NaN mean NaN 6.009024e+03 6.013357e+03 6.004488e+03 std NaN 8.996247e+03 9.003521e+03 8.988778e+03 min NaN 3.800000e+00 3.800000e+00 1.500000e+00 25% NaN 4.438600e+02 4.440000e+02 4.435200e+02 50% NaN 3.596970e+03 3.598190e+03 3.595620e+03 75% NaN 8.627270e+03 8.632980e+03 8.621090e+03 max NaN 6.176356e+04 6.178183e+04 6.167355e+04 close volume_btc volume_currency weighted_price count 3.613769e+06 3.613769e+06 3.613769e+06 3.613769e+06 unique NaN NaN NaN NaN top NaN NaN NaN NaN freq NaN NaN NaN NaN first NaN NaN NaN NaN last NaN NaN NaN NaN mean 6.009014e+03 9.323249e+00 4.176284e+04 6.008935e+03 std 8.996360e+03 3.054989e+01 1.518248e+05 8.995992e+03 min 1.500000e+00 0.000000e+00 0.000000e+00 3.800000e+00 25% 4.438600e+02 4.097759e-01 4.521422e+02 4.438306e+02 50% 3.597000e+03 1.979811e+00 3.810124e+03 3.596804e+03 75% 8.627160e+03 7.278216e+00 2.569821e+04 8.627637e+03 max 6.178180e+04 5.853852e+03 1.390067e+07 6.171621e+04
bitcoin.isna().any()
timestamp False open True high True low True close True volume_btc True volume_currency True weighted_price True dtype: bool
We have missing values, let's investigate.
bitcoin[bitcoin.isna().any(axis='columns')].shape
1243608 | 8 |
Approximately 1/4th of the dataset is missing. This is a large quantity and rather than dropping them, we may want to consider imputation. For this analysis, we drop them.
bitcoin = bitcoin.dropna() bitcoin.shape
3613769 | 8 |
bitcoin[bitcoin.duplicated()].shape
0 | 8 |
Let's look at correlations next.
name = 'heatmap@bitcoin--corr.png' corr = bitcoin.corr() plotter.corr(corr, name) name
Several features are have high positive correlation with one another.
There are several visualisation techniques that can be employed for timeseries data (in addition to the ones we have been using in our analysis). However, I doubt we will find any new smells.