Earthquake
Table of Contents
1. Init
Let's start by importing the necessary modules.
from context import src from context import pd, np, sns, plt from src import utils, plotter
2. Analysis
In this section we analyse the earthquake
dataset. We start by
reading the accompanying data docs. The docs provide some preliminary
information about the features of the dataset.
2.1. Preliminary analysis
We start by loading the dataset and answering our initial set of questions.
earthquake = pd.read_csv(utils.data_path('earthquake.csv')) earthquake.head()
Date Time Latitude Longitude Type Depth Depth Error \ 0 01/02/1965 13:44:18 19.246 145.616 Earthquake 131.6 NaN 1 01/04/1965 11:29:49 1.863 127.352 Earthquake 80.0 NaN 2 01/05/1965 18:05:58 -20.579 -173.972 Earthquake 20.0 NaN 3 01/08/1965 18:49:43 -59.076 -23.557 Earthquake 15.0 NaN 4 01/09/1965 13:32:50 11.938 126.427 Earthquake 15.0 NaN Depth Seismic Stations Magnitude Magnitude Type Magnitude Error \ 0 NaN 6.0 MW NaN 1 NaN 5.8 MW NaN 2 NaN 6.2 MW NaN 3 NaN 5.8 MW NaN 4 NaN 5.8 MW NaN Magnitude Seismic Stations Azimuthal Gap Horizontal Distance \ 0 NaN NaN NaN 1 NaN NaN NaN 2 NaN NaN NaN 3 NaN NaN NaN 4 NaN NaN NaN Horizontal Error Root Mean Square ID Source Location Source \ 0 NaN NaN ISCGEM860706 ISCGEM ISCGEM 1 NaN NaN ISCGEM860737 ISCGEM ISCGEM 2 NaN NaN ISCGEM860762 ISCGEM ISCGEM 3 NaN NaN ISCGEM860856 ISCGEM ISCGEM 4 NaN NaN ISCGEM860890 ISCGEM ISCGEM Magnitude Source Status 0 ISCGEM Automatic 1 ISCGEM Automatic 2 ISCGEM Automatic 3 ISCGEM Automatic 4 ISCGEM Automatic
earthquake.shape
23412 | 21 |
earthquake.dtypes
Date object Time object Latitude float64 Longitude float64 Type object Depth float64 Depth Error float64 Depth Seismic Stations float64 Magnitude float64 Magnitude Type object Magnitude Error float64 Magnitude Seismic Stations float64 Azimuthal Gap float64 Horizontal Distance float64 Horizontal Error float64 Root Mean Square float64 ID object Source object Location Source object Magnitude Source object Status object dtype: object
We have a mix of categorical, datetime and numerical features in the dataset. Let's observe them individually, but first, we will rename the columns or simplicity.
columns = earthquake.columns columns = columns.str.strip() columns = columns.str.lower() columns = columns.str.replace('[\s]+', '_') earthquake.columns = columns earthquake.columns
Index(['date', 'time', 'latitude', 'longitude', 'type', 'depth', 'depth_error', 'depth_seismic_stations', 'magnitude', 'magnitude_type', 'magnitude_error', 'magnitude_seismic_stations', 'azimuthal_gap', 'horizontal_distance', 'horizontal_error', 'root_mean_square', 'id', 'source', 'location_source', 'magnitude_source', 'status'], dtype='object')
2.1.1. Handling id
Let's verify that this feature contains unique values and if so, drop it.
earthquake['id'].str.strip().astype('category').describe(include='all')
count 23412 unique 23412 top AK11232962 freq 1 Name: id, dtype: object
earthquake = earthquake.drop('id', axis='columns') earthquake.shape
23412 | 20 |
2.1.2. Handling date & time
These features contain timestamps so should be converted to datetime
dtype. One transformation we can consider is to merge the two into a
single column since pandas can distinguish between all components of a
timestamp nicely.
We note that the date
contains a human friendly format, however the
docs do not specify if the format is mm/dd/yyyy
or dd/mm/yyyy
.
datetime = earthquake[['date', 'time']] datetime['date'] = pd.to_datetime(datetime['date'], infer_datetime_format=True) datetime['time'] = pd.to_datetime(datetime['time'], infer_datetime_format=True) datetime
date time 0 1965-01-02 00:00:00 2021-11-03 13:44:18 1 1965-01-04 00:00:00 2021-11-03 11:29:49 2 1965-01-05 00:00:00 2021-11-03 18:05:58 3 1965-01-08 00:00:00 2021-11-03 18:49:43 4 1965-01-09 00:00:00 2021-11-03 13:32:50 ... ... ... 23407 2016-12-28 00:00:00 2021-11-03 08:22:12 23408 2016-12-28 00:00:00 2021-11-03 09:13:47 23409 2016-12-28 00:00:00 2021-11-03 12:38:51 23410 2016-12-29 00:00:00 2021-11-03 22:30:19 23411 2016-12-30 00:00:00 2021-11-03 20:08:28 [23412 rows x 2 columns]
As seen above, pandas assigns a default time (midnight) if the timestamp does not contain any time info and alternatively assigns the current date if the timestamp does not contain any date info. Another motivation for combining the two features into one. However, for this analysis, this will do.
2.1.3. Handling categorical features
type, magnitude_type, source, location_source, magnitude_source & status
are
all categorical, let's convert them to the categorical
dtype.
categorical_features = ['type', 'magnitude_type', 'source', 'location_source', 'magnitude_source', 'status'] categorical = earthquake[categorical_features] for column in categorical.columns: categorical[column] = categorical[column].str.strip().astype('category') categorical.describe(include='all')
type magnitude_type source location_source magnitude_source \ count 23412 23409 23412 23412 23412 unique 4 10 13 48 24 top Earthquake MW US US US freq 23232 7722 20630 20350 10458 status count 23412 unique 2 top Reviewed freq 20773
We have a few missing values in magnitude_type
.
earthquake[categorical_features] = categorical
2.1.4. Descriptive statistics, missing & duplicates
Let's look at the descriptive statistics next.
earthquake.describe(include='all')
date time latitude longitude type \ count 23412 23412 23412.000000 23412.000000 23412 unique 12401 20472 NaN NaN 4 top 03/11/2011 02:56:58 NaN NaN Earthquake freq 128 5 NaN NaN 23232 mean NaN NaN 1.679033 39.639961 NaN std NaN NaN 30.113183 125.511959 NaN min NaN NaN -77.080000 -179.997000 NaN 25% NaN NaN -18.653000 -76.349750 NaN 50% NaN NaN -3.568500 103.982000 NaN 75% NaN NaN 26.190750 145.026250 NaN max NaN NaN 86.005000 179.998000 NaN depth depth_error depth_seismic_stations magnitude \ count 23412.000000 4461.000000 7097.000000 23412.000000 unique NaN NaN NaN NaN top NaN NaN NaN NaN freq NaN NaN NaN NaN mean 70.767911 4.993115 275.364098 5.882531 std 122.651898 4.875184 162.141631 0.423066 min -1.100000 0.000000 0.000000 5.500000 25% 14.522500 1.800000 146.000000 5.600000 50% 33.000000 3.500000 255.000000 5.700000 75% 54.000000 6.300000 384.000000 6.000000 max 700.000000 91.295000 934.000000 9.100000 magnitude_type magnitude_error magnitude_seismic_stations \ count 23409 327.000000 2564.000000 unique 10 NaN NaN top MW NaN NaN freq 7722 NaN NaN mean NaN 0.071820 48.944618 std NaN 0.051466 62.943106 min NaN 0.000000 0.000000 25% NaN 0.046000 10.000000 50% NaN 0.059000 28.000000 75% NaN 0.075500 66.000000 max NaN 0.410000 821.000000 azimuthal_gap horizontal_distance horizontal_error \ count 7299.000000 1604.000000 1156.000000 unique NaN NaN NaN top NaN NaN NaN freq NaN NaN NaN mean 44.163532 3.992660 7.662759 std 32.141486 5.377262 10.430396 min 0.000000 0.004505 0.085000 25% 24.100000 0.968750 5.300000 50% 36.000000 2.319500 6.700000 75% 54.000000 4.724500 8.100000 max 360.000000 37.874000 99.000000 root_mean_square source location_source magnitude_source status count 17352.000000 23412 23412 23412 23412 unique NaN 13 48 24 2 top NaN US US US Reviewed freq NaN 20630 20350 10458 20773 mean 1.022784 NaN NaN NaN NaN std 0.188545 NaN NaN NaN NaN min 0.000000 NaN NaN NaN NaN 25% 0.900000 NaN NaN NaN NaN 50% 1.000000 NaN NaN NaN NaN 75% 1.130000 NaN NaN NaN NaN max 3.440000 NaN NaN NaN NaN
We note that there are a lot of missing values in the numerical features. Dropping is not an option and thus imputation must be considered leading to technical depth.
We can't continue with the analysis without significant pre-processing of the dataset, something beyond the scope of this analysis.