Bitcoin

Table of Contents

1. Init

Let's start by importing the necessary modules.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt

matplotlib.use('Agg') # non-interactive backend, produce pngs instead
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
# uncomment the following line to prevent truncated output
# pd.set_option('display.large_repr', 'info')

from context import src
from src import utils, plotter

2. Analysis

In this section we analyse the bitcoin dataset. We start by reading the accompanying data docs which unfortunately are not that helpful. It mentions that it's a timeseries dataset with missing values (resulting in jumps in the timeseries).

2.1. Preliminary analysis

We start by loading the dataset and answering our initial set of questions.

bitcoin = pd.read_csv('../data/data/bitcoin.csv')
bitcoin.head()
    Timestamp  Open  High   Low  Close  Volume_(BTC)  Volume_(Currency)  \
0  1325317920  4.39  4.39  4.39   4.39      0.455581                2.0   
1  1325317980   NaN   NaN   NaN    NaN           NaN                NaN   
2  1325318040   NaN   NaN   NaN    NaN           NaN                NaN   
3  1325318100   NaN   NaN   NaN    NaN           NaN                NaN   
4  1325318160   NaN   NaN   NaN    NaN           NaN                NaN   

   Weighted_Price  
0            4.39  
1             NaN  
2             NaN  
3             NaN  
4             NaN  

Let's rename the Volume_* features for simplicity.

mapper = {'Volume_(BTC)': 'Volume_btc',
          'Volume_(Currency)': 'Volume_currency'}

bitcoin = bitcoin.rename(mapper=mapper, axis='columns')
bitcoin.columns = bitcoin.columns.str.lower()
bitcoin.columns
Index(['timestamp', 'open', 'high', 'low', 'close', 'volume_btc',
       'volume_currency', 'weighted_price'],
      dtype='object')
bitcoin.shape
4857377 8
bitcoin.dtypes
timestamp            int64
open               float64
high               float64
low                float64
close              float64
volume_btc         float64
volume_currency    float64
weighted_price     float64
dtype: object

We only need to convert the timestamp feature to datetime dtype.

bitcoin['timestamp'] = pd.to_datetime(bitcoin['timestamp'], unit='s')
bitcoin['timestamp']
0         2011-12-31 07:52:00
1         2011-12-31 07:53:00
2         2011-12-31 07:54:00
3         2011-12-31 07:55:00
4         2011-12-31 07:56:00
                  ...        
4857372   2021-03-30 23:56:00
4857373   2021-03-30 23:57:00
4857374   2021-03-30 23:58:00
4857375   2021-03-30 23:59:00
4857376   2021-03-31 00:00:00
Name: timestamp, Length: 4857377, dtype: datetime64[ns]

Let's look at the descriptive statistics next, missing & duplicates next.

bitcoin.describe(include='all')
                  timestamp          open          high           low  \
count               4857377  3.613769e+06  3.613769e+06  3.613769e+06   
unique              4857377           NaN           NaN           NaN   
top     2011-12-31 07:52:00           NaN           NaN           NaN   
freq                      1           NaN           NaN           NaN   
first   2011-12-31 07:52:00           NaN           NaN           NaN   
last    2021-03-31 00:00:00           NaN           NaN           NaN   
mean                    NaN  6.009024e+03  6.013357e+03  6.004488e+03   
std                     NaN  8.996247e+03  9.003521e+03  8.988778e+03   
min                     NaN  3.800000e+00  3.800000e+00  1.500000e+00   
25%                     NaN  4.438600e+02  4.440000e+02  4.435200e+02   
50%                     NaN  3.596970e+03  3.598190e+03  3.595620e+03   
75%                     NaN  8.627270e+03  8.632980e+03  8.621090e+03   
max                     NaN  6.176356e+04  6.178183e+04  6.167355e+04   

               close    volume_btc  volume_currency  weighted_price  
count   3.613769e+06  3.613769e+06     3.613769e+06    3.613769e+06  
unique           NaN           NaN              NaN             NaN  
top              NaN           NaN              NaN             NaN  
freq             NaN           NaN              NaN             NaN  
first            NaN           NaN              NaN             NaN  
last             NaN           NaN              NaN             NaN  
mean    6.009014e+03  9.323249e+00     4.176284e+04    6.008935e+03  
std     8.996360e+03  3.054989e+01     1.518248e+05    8.995992e+03  
min     1.500000e+00  0.000000e+00     0.000000e+00    3.800000e+00  
25%     4.438600e+02  4.097759e-01     4.521422e+02    4.438306e+02  
50%     3.597000e+03  1.979811e+00     3.810124e+03    3.596804e+03  
75%     8.627160e+03  7.278216e+00     2.569821e+04    8.627637e+03  
max     6.178180e+04  5.853852e+03     1.390067e+07    6.171621e+04  
bitcoin.isna().any()
timestamp          False
open                True
high                True
low                 True
close               True
volume_btc          True
volume_currency     True
weighted_price      True
dtype: bool

We have missing values, let's investigate.

bitcoin[bitcoin.isna().any(axis='columns')].shape
1243608 8

Approximately 1/4th of the dataset is missing. This is a large quantity and rather than dropping them, we may want to consider imputation. For this analysis, we drop them.

bitcoin = bitcoin.dropna()
bitcoin.shape
3613769 8
bitcoin[bitcoin.duplicated()].shape
0 8

Let's look at correlations next.

name = 'heatmap@bitcoin--corr.png'
corr = bitcoin.corr()

plotter.corr(corr, name)
name

heatmap@bitcoin--corr.png

Several features are have high positive correlation with one another.

There are several visualisation techniques that can be employed for timeseries data (in addition to the ones we have been using in our analysis). However, I doubt we will find any new smells.

Date: 2021-10-26 Tue 00:00

Created: 2021-10-26 Tue 12:45