Earthquake

Table of Contents

1. Init

Let's start by importing the necessary modules.

from context import src
from context import pd, np, sns, plt
from src import utils, plotter

2. Analysis

In this section we analyse the earthquake dataset. We start by reading the accompanying data docs. The docs provide some preliminary information about the features of the dataset.

2.1. Preliminary analysis

We start by loading the dataset and answering our initial set of questions.

earthquake = pd.read_csv(utils.data_path('earthquake.csv'))
earthquake.head()
         Date      Time  Latitude  Longitude        Type  Depth  Depth Error  \
0  01/02/1965  13:44:18    19.246    145.616  Earthquake  131.6          NaN   
1  01/04/1965  11:29:49     1.863    127.352  Earthquake   80.0          NaN   
2  01/05/1965  18:05:58   -20.579   -173.972  Earthquake   20.0          NaN   
3  01/08/1965  18:49:43   -59.076    -23.557  Earthquake   15.0          NaN   
4  01/09/1965  13:32:50    11.938    126.427  Earthquake   15.0          NaN   

   Depth Seismic Stations  Magnitude Magnitude Type  Magnitude Error  \
0                     NaN        6.0             MW              NaN   
1                     NaN        5.8             MW              NaN   
2                     NaN        6.2             MW              NaN   
3                     NaN        5.8             MW              NaN   
4                     NaN        5.8             MW              NaN   

   Magnitude Seismic Stations  Azimuthal Gap  Horizontal Distance  \
0                         NaN            NaN                  NaN   
1                         NaN            NaN                  NaN   
2                         NaN            NaN                  NaN   
3                         NaN            NaN                  NaN   
4                         NaN            NaN                  NaN   

   Horizontal Error  Root Mean Square            ID  Source Location Source  \
0               NaN               NaN  ISCGEM860706  ISCGEM          ISCGEM   
1               NaN               NaN  ISCGEM860737  ISCGEM          ISCGEM   
2               NaN               NaN  ISCGEM860762  ISCGEM          ISCGEM   
3               NaN               NaN  ISCGEM860856  ISCGEM          ISCGEM   
4               NaN               NaN  ISCGEM860890  ISCGEM          ISCGEM   

  Magnitude Source     Status  
0           ISCGEM  Automatic  
1           ISCGEM  Automatic  
2           ISCGEM  Automatic  
3           ISCGEM  Automatic  
4           ISCGEM  Automatic  
earthquake.shape
23412 21
earthquake.dtypes
Date                           object
Time                           object
Latitude                      float64
Longitude                     float64
Type                           object
Depth                         float64
Depth Error                   float64
Depth Seismic Stations        float64
Magnitude                     float64
Magnitude Type                 object
Magnitude Error               float64
Magnitude Seismic Stations    float64
Azimuthal Gap                 float64
Horizontal Distance           float64
Horizontal Error              float64
Root Mean Square              float64
ID                             object
Source                         object
Location Source                object
Magnitude Source               object
Status                         object
dtype: object

We have a mix of categorical, datetime and numerical features in the dataset. Let's observe them individually, but first, we will rename the columns or simplicity.

columns = earthquake.columns
columns = columns.str.strip()
columns = columns.str.lower()
columns = columns.str.replace('[\s]+', '_')
earthquake.columns = columns
earthquake.columns
Index(['date', 'time', 'latitude', 'longitude', 'type', 'depth', 'depth_error',
       'depth_seismic_stations', 'magnitude', 'magnitude_type',
       'magnitude_error', 'magnitude_seismic_stations', 'azimuthal_gap',
       'horizontal_distance', 'horizontal_error', 'root_mean_square', 'id',
       'source', 'location_source', 'magnitude_source', 'status'],
      dtype='object')

2.1.1. Handling id

Let's verify that this feature contains unique values and if so, drop it.

earthquake['id'].str.strip().astype('category').describe(include='all')
count          23412
unique         23412
top       AK11232962
freq               1
Name: id, dtype: object
earthquake = earthquake.drop('id', axis='columns')
earthquake.shape
23412 20

2.1.2. Handling date & time

These features contain timestamps so should be converted to datetime dtype. One transformation we can consider is to merge the two into a single column since pandas can distinguish between all components of a timestamp nicely.

We note that the date contains a human friendly format, however the docs do not specify if the format is mm/dd/yyyy or dd/mm/yyyy.

datetime = earthquake[['date', 'time']]
datetime['date'] = pd.to_datetime(datetime['date'], infer_datetime_format=True)
datetime['time'] = pd.to_datetime(datetime['time'], infer_datetime_format=True)
datetime
                      date                 time
0      1965-01-02 00:00:00  2021-11-03 13:44:18
1      1965-01-04 00:00:00  2021-11-03 11:29:49
2      1965-01-05 00:00:00  2021-11-03 18:05:58
3      1965-01-08 00:00:00  2021-11-03 18:49:43
4      1965-01-09 00:00:00  2021-11-03 13:32:50
...                    ...                  ...
23407  2016-12-28 00:00:00  2021-11-03 08:22:12
23408  2016-12-28 00:00:00  2021-11-03 09:13:47
23409  2016-12-28 00:00:00  2021-11-03 12:38:51
23410  2016-12-29 00:00:00  2021-11-03 22:30:19
23411  2016-12-30 00:00:00  2021-11-03 20:08:28

[23412 rows x 2 columns]

As seen above, pandas assigns a default time (midnight) if the timestamp does not contain any time info and alternatively assigns the current date if the timestamp does not contain any date info. Another motivation for combining the two features into one. However, for this analysis, this will do.

2.1.3. Handling categorical features

type, magnitude_type, source, location_source, magnitude_source & status are all categorical, let's convert them to the categorical dtype.

categorical_features = ['type',
                        'magnitude_type',
                        'source',
                        'location_source',
                        'magnitude_source',
                        'status']
categorical = earthquake[categorical_features]

for column in categorical.columns:
    categorical[column] = categorical[column].str.strip().astype('category')

categorical.describe(include='all')
              type magnitude_type source location_source magnitude_source  \
count        23412          23409  23412           23412            23412   
unique           4             10     13              48               24   
top     Earthquake             MW     US              US               US   
freq         23232           7722  20630           20350            10458   

          status  
count      23412  
unique         2  
top     Reviewed  
freq       20773  

We have a few missing values in magnitude_type.

earthquake[categorical_features] = categorical

2.1.4. Descriptive statistics, missing & duplicates

Let's look at the descriptive statistics next.

earthquake.describe(include='all')
              date      time      latitude     longitude        type  \
count        23412     23412  23412.000000  23412.000000       23412   
unique       12401     20472           NaN           NaN           4   
top     03/11/2011  02:56:58           NaN           NaN  Earthquake   
freq           128         5           NaN           NaN       23232   
mean           NaN       NaN      1.679033     39.639961         NaN   
std            NaN       NaN     30.113183    125.511959         NaN   
min            NaN       NaN    -77.080000   -179.997000         NaN   
25%            NaN       NaN    -18.653000    -76.349750         NaN   
50%            NaN       NaN     -3.568500    103.982000         NaN   
75%            NaN       NaN     26.190750    145.026250         NaN   
max            NaN       NaN     86.005000    179.998000         NaN   

               depth  depth_error  depth_seismic_stations     magnitude  \
count   23412.000000  4461.000000             7097.000000  23412.000000   
unique           NaN          NaN                     NaN           NaN   
top              NaN          NaN                     NaN           NaN   
freq             NaN          NaN                     NaN           NaN   
mean       70.767911     4.993115              275.364098      5.882531   
std       122.651898     4.875184              162.141631      0.423066   
min        -1.100000     0.000000                0.000000      5.500000   
25%        14.522500     1.800000              146.000000      5.600000   
50%        33.000000     3.500000              255.000000      5.700000   
75%        54.000000     6.300000              384.000000      6.000000   
max       700.000000    91.295000              934.000000      9.100000   

       magnitude_type  magnitude_error  magnitude_seismic_stations  \
count           23409       327.000000                 2564.000000   
unique             10              NaN                         NaN   
top                MW              NaN                         NaN   
freq             7722              NaN                         NaN   
mean              NaN         0.071820                   48.944618   
std               NaN         0.051466                   62.943106   
min               NaN         0.000000                    0.000000   
25%               NaN         0.046000                   10.000000   
50%               NaN         0.059000                   28.000000   
75%               NaN         0.075500                   66.000000   
max               NaN         0.410000                  821.000000   

        azimuthal_gap  horizontal_distance  horizontal_error  \
count     7299.000000          1604.000000       1156.000000   
unique            NaN                  NaN               NaN   
top               NaN                  NaN               NaN   
freq              NaN                  NaN               NaN   
mean        44.163532             3.992660          7.662759   
std         32.141486             5.377262         10.430396   
min          0.000000             0.004505          0.085000   
25%         24.100000             0.968750          5.300000   
50%         36.000000             2.319500          6.700000   
75%         54.000000             4.724500          8.100000   
max        360.000000            37.874000         99.000000   

        root_mean_square source location_source magnitude_source    status  
count       17352.000000  23412           23412            23412     23412  
unique               NaN     13              48               24         2  
top                  NaN     US              US               US  Reviewed  
freq                 NaN  20630           20350            10458     20773  
mean            1.022784    NaN             NaN              NaN       NaN  
std             0.188545    NaN             NaN              NaN       NaN  
min             0.000000    NaN             NaN              NaN       NaN  
25%             0.900000    NaN             NaN              NaN       NaN  
50%             1.000000    NaN             NaN              NaN       NaN  
75%             1.130000    NaN             NaN              NaN       NaN  
max             3.440000    NaN             NaN              NaN       NaN  

We note that there are a lot of missing values in the numerical features. Dropping is not an option and thus imputation must be considered leading to technical depth.

We can't continue with the analysis without significant pre-processing of the dataset, something beyond the scope of this analysis.

Date: 2021-11-03 Wed 00:00

Created: 2021-11-03 Wed 14:34