Fraud

Table of Contents

1. Init

Let's start by importing the necessary modules.

from context import src
from context import pd, np, sns, plt
from src import utils, plotter

2. Analysis

In this section we analyse the creditcard dataset. We start by reading the data docs which gives us some information regarding the dataset.

2.1. Preliminary analysis

We start by loading the dataset and answering our initial set of questions.

fraud = pd.read_csv(utils.data_path('fraud.csv'))
fraud.head()
   Time        V1        V2        V3        V4        V5        V6        V7  \
0   0.0 -1.359807 -0.072781  2.536347  1.378155 -0.338321  0.462388  0.239599   
1   0.0  1.191857  0.266151  0.166480  0.448154  0.060018 -0.082361 -0.078803   
2   1.0 -1.358354 -1.340163  1.773209  0.379780 -0.503198  1.800499  0.791461   
3   1.0 -0.966272 -0.185226  1.792993 -0.863291 -0.010309  1.247203  0.237609   
4   2.0 -1.158233  0.877737  1.548718  0.403034 -0.407193  0.095921  0.592941   

         V8        V9       V10       V11       V12       V13       V14  \
0  0.098698  0.363787  0.090794 -0.551600 -0.617801 -0.991390 -0.311169   
1  0.085102 -0.255425 -0.166974  1.612727  1.065235  0.489095 -0.143772   
2  0.247676 -1.514654  0.207643  0.624501  0.066084  0.717293 -0.165946   
3  0.377436 -1.387024 -0.054952 -0.226487  0.178228  0.507757 -0.287924   
4 -0.270533  0.817739  0.753074 -0.822843  0.538196  1.345852 -1.119670   

        V15       V16       V17       V18       V19       V20       V21  \
0  1.468177 -0.470401  0.207971  0.025791  0.403993  0.251412 -0.018307   
1  0.635558  0.463917 -0.114805 -0.183361 -0.145783 -0.069083 -0.225775   
2  2.345865 -2.890083  1.109969 -0.121359 -2.261857  0.524980  0.247998   
3 -0.631418 -1.059647 -0.684093  1.965775 -1.232622 -0.208038 -0.108300   
4  0.175121 -0.451449 -0.237033 -0.038195  0.803487  0.408542 -0.009431   

        V22       V23       V24       V25       V26       V27       V28  \
0  0.277838 -0.110474  0.066928  0.128539 -0.189115  0.133558 -0.021053   
1 -0.638672  0.101288 -0.339846  0.167170  0.125895 -0.008983  0.014724   
2  0.771679  0.909412 -0.689281 -0.327642 -0.139097 -0.055353 -0.059752   
3  0.005274 -0.190321 -1.175575  0.647376 -0.221929  0.062723  0.061458   
4  0.798278 -0.137458  0.141267 -0.206010  0.502292  0.219422  0.215153   

   Amount  Class  
0  149.62      0  
1    2.69      0  
2  378.66      0  
3  123.50      0  
4   69.99      0  
fraud.shape
284807 31
fraud.dtypes
Time      float64
V1        float64
V2        float64
V3        float64
V4        float64
V5        float64
V6        float64
V7        float64
V8        float64
V9        float64
V10       float64
V11       float64
V12       float64
V13       float64
V14       float64
V15       float64
V16       float64
V17       float64
V18       float64
V19       float64
V20       float64
V21       float64
V22       float64
V23       float64
V24       float64
V25       float64
V26       float64
V27       float64
V28       float64
Amount    float64
Class       int64
dtype: object

Let's rename the columns for simplicity.

columns = fraud.columns
columns = columns.str.strip()
columns = columns.str.lower()
fraud.columns = columns
fraud.columns
Index(['time', 'v1', 'v2', 'v3', 'v4', 'v5', 'v6', 'v7', 'v8', 'v9', 'v10',
       'v11', 'v12', 'v13', 'v14', 'v15', 'v16', 'v17', 'v18', 'v19', 'v20',
       'v21', 'v22', 'v23', 'v24', 'v25', 'v26', 'v27', 'v28', 'amount',
       'class'],
      dtype='object')

2.1.1. Handling datetime features

The time feature denotes the number of seconds ellapsed since the first transation in the dataset. Although this is datetime information, we don't necessarily gain anything from converting it to datetime dtype.

2.1.2. Handling numerical features

We don't know the unit for amount. The docs says that the data if for all European countries, thus the currency may vary. It would be ideal to have the amounts in the same currency for consistency and to aid the ML model to learn better.

2.1.3. Handling categorical features

Let's convert class to categorical for the analysis part, we can continue using the numerical representation for training.

fraud['class'] = fraud['class'].astype('category')

2.1.4. Descriptive statistics, missing & duplicates

Let's look at the descriptive statistics next.

fraud.describe(include='all')
                 time            v1            v2            v3            v4  \
count   284807.000000  2.848070e+05  2.848070e+05  2.848070e+05  2.848070e+05   
unique            NaN           NaN           NaN           NaN           NaN   
top               NaN           NaN           NaN           NaN           NaN   
freq              NaN           NaN           NaN           NaN           NaN   
mean     94813.859575  1.168375e-15  3.416908e-16 -1.379537e-15  2.074095e-15   
std      47488.145955  1.958696e+00  1.651309e+00  1.516255e+00  1.415869e+00   
min          0.000000 -5.640751e+01 -7.271573e+01 -4.832559e+01 -5.683171e+00   
25%      54201.500000 -9.203734e-01 -5.985499e-01 -8.903648e-01 -8.486401e-01   
50%      84692.000000  1.810880e-02  6.548556e-02  1.798463e-01 -1.984653e-02   
75%     139320.500000  1.315642e+00  8.037239e-01  1.027196e+00  7.433413e-01   
max     172792.000000  2.454930e+00  2.205773e+01  9.382558e+00  1.687534e+01   

                  v5            v6            v7            v8            v9  \
count   2.848070e+05  2.848070e+05  2.848070e+05  2.848070e+05  2.848070e+05   
unique           NaN           NaN           NaN           NaN           NaN   
top              NaN           NaN           NaN           NaN           NaN   
freq             NaN           NaN           NaN           NaN           NaN   
mean    9.604066e-16  1.487313e-15 -5.556467e-16  1.213481e-16 -2.406331e-15   
std     1.380247e+00  1.332271e+00  1.237094e+00  1.194353e+00  1.098632e+00   
min    -1.137433e+02 -2.616051e+01 -4.355724e+01 -7.321672e+01 -1.343407e+01   
25%    -6.915971e-01 -7.682956e-01 -5.540759e-01 -2.086297e-01 -6.430976e-01   
50%    -5.433583e-02 -2.741871e-01  4.010308e-02  2.235804e-02 -5.142873e-02   
75%     6.119264e-01  3.985649e-01  5.704361e-01  3.273459e-01  5.971390e-01   
max     3.480167e+01  7.330163e+01  1.205895e+02  2.000721e+01  1.559499e+01   

                 v10           v11           v12           v13           v14  \
count   2.848070e+05  2.848070e+05  2.848070e+05  2.848070e+05  2.848070e+05   
unique           NaN           NaN           NaN           NaN           NaN   
top              NaN           NaN           NaN           NaN           NaN   
freq             NaN           NaN           NaN           NaN           NaN   
mean    2.239053e-15  1.673327e-15 -1.247012e-15  8.190001e-16  1.207294e-15   
std     1.088850e+00  1.020713e+00  9.992014e-01  9.952742e-01  9.585956e-01   
min    -2.458826e+01 -4.797473e+00 -1.868371e+01 -5.791881e+00 -1.921433e+01   
25%    -5.354257e-01 -7.624942e-01 -4.055715e-01 -6.485393e-01 -4.255740e-01   
50%    -9.291738e-02 -3.275735e-02  1.400326e-01 -1.356806e-02  5.060132e-02   
75%     4.539234e-01  7.395934e-01  6.182380e-01  6.625050e-01  4.931498e-01   
max     2.374514e+01  1.201891e+01  7.848392e+00  7.126883e+00  1.052677e+01   

                 v15           v16           v17           v18           v19  \
count   2.848070e+05  2.848070e+05  2.848070e+05  2.848070e+05  2.848070e+05   
unique           NaN           NaN           NaN           NaN           NaN   
top              NaN           NaN           NaN           NaN           NaN   
freq             NaN           NaN           NaN           NaN           NaN   
mean    4.887456e-15  1.437716e-15 -3.772171e-16  9.564149e-16  1.039917e-15   
std     9.153160e-01  8.762529e-01  8.493371e-01  8.381762e-01  8.140405e-01   
min    -4.498945e+00 -1.412985e+01 -2.516280e+01 -9.498746e+00 -7.213527e+00   
25%    -5.828843e-01 -4.680368e-01 -4.837483e-01 -4.988498e-01 -4.562989e-01   
50%     4.807155e-02  6.641332e-02 -6.567575e-02 -3.636312e-03  3.734823e-03   
75%     6.488208e-01  5.232963e-01  3.996750e-01  5.008067e-01  4.589494e-01   
max     8.877742e+00  1.731511e+01  9.253526e+00  5.041069e+00  5.591971e+00   

                 v20           v21           v22           v23           v24  \
count   2.848070e+05  2.848070e+05  2.848070e+05  2.848070e+05  2.848070e+05   
unique           NaN           NaN           NaN           NaN           NaN   
top              NaN           NaN           NaN           NaN           NaN   
freq             NaN           NaN           NaN           NaN           NaN   
mean    6.406204e-16  1.654067e-16 -3.568593e-16  2.578648e-16  4.473266e-15   
std     7.709250e-01  7.345240e-01  7.257016e-01  6.244603e-01  6.056471e-01   
min    -5.449772e+01 -3.483038e+01 -1.093314e+01 -4.480774e+01 -2.836627e+00   
25%    -2.117214e-01 -2.283949e-01 -5.423504e-01 -1.618463e-01 -3.545861e-01   
50%    -6.248109e-02 -2.945017e-02  6.781943e-03 -1.119293e-02  4.097606e-02   
75%     1.330408e-01  1.863772e-01  5.285536e-01  1.476421e-01  4.395266e-01   
max     3.942090e+01  2.720284e+01  1.050309e+01  2.252841e+01  4.584549e+00   

                 v25           v26           v27           v28         amount  \
count   2.848070e+05  2.848070e+05  2.848070e+05  2.848070e+05  284807.000000   
unique           NaN           NaN           NaN           NaN            NaN   
top              NaN           NaN           NaN           NaN            NaN   
freq             NaN           NaN           NaN           NaN            NaN   
mean    5.340915e-16  1.683437e-15 -3.660091e-16 -1.227390e-16      88.349619   
std     5.212781e-01  4.822270e-01  4.036325e-01  3.300833e-01     250.120109   
min    -1.029540e+01 -2.604551e+00 -2.256568e+01 -1.543008e+01       0.000000   
25%    -3.171451e-01 -3.269839e-01 -7.083953e-02 -5.295979e-02       5.600000   
50%     1.659350e-02 -5.213911e-02  1.342146e-03  1.124383e-02      22.000000   
75%     3.507156e-01  2.409522e-01  9.104512e-02  7.827995e-02      77.165000   
max     7.519589e+00  3.517346e+00  3.161220e+01  3.384781e+01   25691.160000   

           class  
count   284807.0  
unique       2.0  
top          0.0  
freq    284315.0  
mean         NaN  
std          NaN  
min          NaN  
25%          NaN  
50%          NaN  
75%          NaN  
max          NaN  

There is a gross imbalance in class, corrective stratergies have to be considered.

There are no missing values so let's check for duplicates.

fraud[fraud.duplicated(keep=False)]
            time        v1        v2        v3        v4        v5        v6  \
32          26.0 -0.529912  0.873892  1.347247  0.145457  0.414209  0.100223   
33          26.0 -0.529912  0.873892  1.347247  0.145457  0.414209  0.100223   
34          26.0 -0.535388  0.865268  1.351076  0.147575  0.433680  0.086983   
35          26.0 -0.535388  0.865268  1.351076  0.147575  0.433680  0.086983   
112         74.0  1.038370  0.127486  0.184456  1.109950  0.441699  0.945283   
...          ...       ...       ...       ...       ...       ...       ...   
283485  171627.0 -1.457978  1.378203  0.811515 -0.603760 -0.711883 -0.471672   
284190  172233.0 -2.667936  3.160505 -3.355984  1.007845 -0.377397 -0.109730   
284191  172233.0 -2.667936  3.160505 -3.355984  1.007845 -0.377397 -0.109730   
284192  172233.0 -2.691642  3.123168 -3.339407  1.017018 -0.293095 -0.167054   
284193  172233.0 -2.691642  3.123168 -3.339407  1.017018 -0.293095 -0.167054   

              v7        v8        v9       v10       v11       v12       v13  \
32      0.711206  0.176066 -0.286717 -0.484688  0.872490  0.851636 -0.571745   
33      0.711206  0.176066 -0.286717 -0.484688  0.872490  0.851636 -0.571745   
34      0.693039  0.179742 -0.285642 -0.482474  0.871800  0.853447 -0.571822   
35      0.693039  0.179742 -0.285642 -0.482474  0.871800  0.853447 -0.571822   
112    -0.036715  0.350995  0.118950 -0.243289  0.578063  0.674730 -0.534231   
...          ...       ...       ...       ...       ...       ...       ...   
283485 -0.282535  0.880654  0.052808 -0.830603 -1.191774  0.942870  1.372621   
284190 -0.667233  2.309700 -1.639306 -1.449823 -0.508930  0.600035 -0.627313   
284191 -0.667233  2.309700 -1.639306 -1.449823 -0.508930  0.600035 -0.627313   
284192 -0.745886  2.325616 -1.634651 -1.440241 -0.511918  0.607878 -0.627645   
284193 -0.745886  2.325616 -1.634651 -1.440241 -0.511918  0.607878 -0.627645   

             v14       v15       v16       v17       v18       v19       v20  \
32      0.100974 -1.519772 -0.284376 -0.310524 -0.404248 -0.823374 -0.290348   
33      0.100974 -1.519772 -0.284376 -0.310524 -0.404248 -0.823374 -0.290348   
34      0.102252 -1.519991 -0.285912 -0.309633 -0.403902 -0.823743 -0.283264   
35      0.102252 -1.519991 -0.285912 -0.309633 -0.403902 -0.823743 -0.283264   
112     0.446601  1.122885 -1.768001  1.241157 -2.449500 -1.747255 -0.335520   
...          ...       ...       ...       ...       ...       ...       ...   
283485 -0.037988 -0.208490  0.321883 -0.205951 -0.025225 -0.468427  0.023667   
284190  1.017499 -0.887384  0.420096  1.856497  1.315099  1.096112 -0.821707   
284191  1.017499 -0.887384  0.420096  1.856497  1.315099  1.096112 -0.821707   
284192  1.023032 -0.888334  0.413444  1.860351  1.316597  1.094512 -0.791037   
284193  1.023032 -0.888334  0.413444  1.860351  1.316597  1.094512 -0.791037   

             v21       v22       v23       v24       v25       v26       v27  \
32      0.046949  0.208105 -0.185548  0.001031  0.098816 -0.552904 -0.073288   
33      0.046949  0.208105 -0.185548  0.001031  0.098816 -0.552904 -0.073288   
34      0.049526  0.206537 -0.187108  0.000753  0.098117 -0.553471 -0.078306   
35      0.049526  0.206537 -0.187108  0.000753  0.098117 -0.553471 -0.078306   
112     0.102520  0.605089  0.023092 -0.626463  0.479120 -0.166937  0.081247   
...          ...       ...       ...       ...       ...       ...       ...   
283485  0.284205  0.949659 -0.216949  0.083250  0.044944  0.639933  0.219432   
284190  0.391483  0.266536 -0.079853 -0.096395  0.086719 -0.451128 -1.183743   
284191  0.391483  0.266536 -0.079853 -0.096395  0.086719 -0.451128 -1.183743   
284192  0.402639  0.259746 -0.086606 -0.097597  0.083693 -0.453584 -1.205466   
284193  0.402639  0.259746 -0.086606 -0.097597  0.083693 -0.453584 -1.205466   

             v28  amount class  
32      0.023307    6.14     0  
33      0.023307    6.14     0  
34      0.025427    1.77     0  
35      0.025427    1.77     0  
112     0.001192    1.18     0  
...          ...     ...   ...  
283485  0.116772   11.93     0  
284190 -0.222200   55.66     0  
284191 -0.222200   55.66     0  
284192 -0.213020   36.74     0  
284193 -0.213020   36.74     0  

[1854 rows x 31 columns]

The duplicates are expected since we can have several transactions/second with similar values in the PCA features. No action required.

2.1.5. Correlations

Let's look at the correlations between the numerical features.

name = 'heatmap@fraud--corr.png'
corr = fraud.corr()
plotter.corr(corr, name)
name

heatmap@fraud--corr.png

Nothing unexpected, amount & time are correlated with the PCA features.

Date: 2021-11-05 Fri 00:00

Created: 2021-11-05 Fri 15:28