Fraud
Table of Contents
1. Init
Let's start by importing the necessary modules.
from context import src from context import pd, np, sns, plt from src import utils, plotter
2. Analysis
In this section we analyse the creditcard
dataset. We start by
reading the data docs which gives us some information regarding the
dataset.
2.1. Preliminary analysis
We start by loading the dataset and answering our initial set of questions.
fraud = pd.read_csv(utils.data_path('fraud.csv')) fraud.head()
Time V1 V2 V3 V4 V5 V6 V7 \ 0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803 2 1.0 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461 3 1.0 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609 4 2.0 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941 V8 V9 V10 V11 V12 V13 V14 \ 0 0.098698 0.363787 0.090794 -0.551600 -0.617801 -0.991390 -0.311169 1 0.085102 -0.255425 -0.166974 1.612727 1.065235 0.489095 -0.143772 2 0.247676 -1.514654 0.207643 0.624501 0.066084 0.717293 -0.165946 3 0.377436 -1.387024 -0.054952 -0.226487 0.178228 0.507757 -0.287924 4 -0.270533 0.817739 0.753074 -0.822843 0.538196 1.345852 -1.119670 V15 V16 V17 V18 V19 V20 V21 \ 0 1.468177 -0.470401 0.207971 0.025791 0.403993 0.251412 -0.018307 1 0.635558 0.463917 -0.114805 -0.183361 -0.145783 -0.069083 -0.225775 2 2.345865 -2.890083 1.109969 -0.121359 -2.261857 0.524980 0.247998 3 -0.631418 -1.059647 -0.684093 1.965775 -1.232622 -0.208038 -0.108300 4 0.175121 -0.451449 -0.237033 -0.038195 0.803487 0.408542 -0.009431 V22 V23 V24 V25 V26 V27 V28 \ 0 0.277838 -0.110474 0.066928 0.128539 -0.189115 0.133558 -0.021053 1 -0.638672 0.101288 -0.339846 0.167170 0.125895 -0.008983 0.014724 2 0.771679 0.909412 -0.689281 -0.327642 -0.139097 -0.055353 -0.059752 3 0.005274 -0.190321 -1.175575 0.647376 -0.221929 0.062723 0.061458 4 0.798278 -0.137458 0.141267 -0.206010 0.502292 0.219422 0.215153 Amount Class 0 149.62 0 1 2.69 0 2 378.66 0 3 123.50 0 4 69.99 0
fraud.shape
284807 | 31 |
fraud.dtypes
Time float64 V1 float64 V2 float64 V3 float64 V4 float64 V5 float64 V6 float64 V7 float64 V8 float64 V9 float64 V10 float64 V11 float64 V12 float64 V13 float64 V14 float64 V15 float64 V16 float64 V17 float64 V18 float64 V19 float64 V20 float64 V21 float64 V22 float64 V23 float64 V24 float64 V25 float64 V26 float64 V27 float64 V28 float64 Amount float64 Class int64 dtype: object
Let's rename the columns for simplicity.
columns = fraud.columns columns = columns.str.strip() columns = columns.str.lower() fraud.columns = columns fraud.columns
Index(['time', 'v1', 'v2', 'v3', 'v4', 'v5', 'v6', 'v7', 'v8', 'v9', 'v10', 'v11', 'v12', 'v13', 'v14', 'v15', 'v16', 'v17', 'v18', 'v19', 'v20', 'v21', 'v22', 'v23', 'v24', 'v25', 'v26', 'v27', 'v28', 'amount', 'class'], dtype='object')
2.1.1. Handling datetime features
The time
feature denotes the number of seconds ellapsed since the
first transation in the dataset. Although this is datetime
information, we don't necessarily gain anything from converting it to
datetime
dtype.
2.1.2. Handling numerical features
We don't know the unit for amount
. The docs says that the data if
for all European countries, thus the currency may vary. It would be
ideal to have the amounts in the same currency for consistency and to
aid the ML model to learn better.
2.1.3. Handling categorical features
Let's convert class
to categorical for the analysis part, we can
continue using the numerical representation for training.
fraud['class'] = fraud['class'].astype('category')
2.1.4. Descriptive statistics, missing & duplicates
Let's look at the descriptive statistics next.
fraud.describe(include='all')
time v1 v2 v3 v4 \ count 284807.000000 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05 unique NaN NaN NaN NaN NaN top NaN NaN NaN NaN NaN freq NaN NaN NaN NaN NaN mean 94813.859575 1.168375e-15 3.416908e-16 -1.379537e-15 2.074095e-15 std 47488.145955 1.958696e+00 1.651309e+00 1.516255e+00 1.415869e+00 min 0.000000 -5.640751e+01 -7.271573e+01 -4.832559e+01 -5.683171e+00 25% 54201.500000 -9.203734e-01 -5.985499e-01 -8.903648e-01 -8.486401e-01 50% 84692.000000 1.810880e-02 6.548556e-02 1.798463e-01 -1.984653e-02 75% 139320.500000 1.315642e+00 8.037239e-01 1.027196e+00 7.433413e-01 max 172792.000000 2.454930e+00 2.205773e+01 9.382558e+00 1.687534e+01 v5 v6 v7 v8 v9 \ count 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05 unique NaN NaN NaN NaN NaN top NaN NaN NaN NaN NaN freq NaN NaN NaN NaN NaN mean 9.604066e-16 1.487313e-15 -5.556467e-16 1.213481e-16 -2.406331e-15 std 1.380247e+00 1.332271e+00 1.237094e+00 1.194353e+00 1.098632e+00 min -1.137433e+02 -2.616051e+01 -4.355724e+01 -7.321672e+01 -1.343407e+01 25% -6.915971e-01 -7.682956e-01 -5.540759e-01 -2.086297e-01 -6.430976e-01 50% -5.433583e-02 -2.741871e-01 4.010308e-02 2.235804e-02 -5.142873e-02 75% 6.119264e-01 3.985649e-01 5.704361e-01 3.273459e-01 5.971390e-01 max 3.480167e+01 7.330163e+01 1.205895e+02 2.000721e+01 1.559499e+01 v10 v11 v12 v13 v14 \ count 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05 unique NaN NaN NaN NaN NaN top NaN NaN NaN NaN NaN freq NaN NaN NaN NaN NaN mean 2.239053e-15 1.673327e-15 -1.247012e-15 8.190001e-16 1.207294e-15 std 1.088850e+00 1.020713e+00 9.992014e-01 9.952742e-01 9.585956e-01 min -2.458826e+01 -4.797473e+00 -1.868371e+01 -5.791881e+00 -1.921433e+01 25% -5.354257e-01 -7.624942e-01 -4.055715e-01 -6.485393e-01 -4.255740e-01 50% -9.291738e-02 -3.275735e-02 1.400326e-01 -1.356806e-02 5.060132e-02 75% 4.539234e-01 7.395934e-01 6.182380e-01 6.625050e-01 4.931498e-01 max 2.374514e+01 1.201891e+01 7.848392e+00 7.126883e+00 1.052677e+01 v15 v16 v17 v18 v19 \ count 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05 unique NaN NaN NaN NaN NaN top NaN NaN NaN NaN NaN freq NaN NaN NaN NaN NaN mean 4.887456e-15 1.437716e-15 -3.772171e-16 9.564149e-16 1.039917e-15 std 9.153160e-01 8.762529e-01 8.493371e-01 8.381762e-01 8.140405e-01 min -4.498945e+00 -1.412985e+01 -2.516280e+01 -9.498746e+00 -7.213527e+00 25% -5.828843e-01 -4.680368e-01 -4.837483e-01 -4.988498e-01 -4.562989e-01 50% 4.807155e-02 6.641332e-02 -6.567575e-02 -3.636312e-03 3.734823e-03 75% 6.488208e-01 5.232963e-01 3.996750e-01 5.008067e-01 4.589494e-01 max 8.877742e+00 1.731511e+01 9.253526e+00 5.041069e+00 5.591971e+00 v20 v21 v22 v23 v24 \ count 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05 unique NaN NaN NaN NaN NaN top NaN NaN NaN NaN NaN freq NaN NaN NaN NaN NaN mean 6.406204e-16 1.654067e-16 -3.568593e-16 2.578648e-16 4.473266e-15 std 7.709250e-01 7.345240e-01 7.257016e-01 6.244603e-01 6.056471e-01 min -5.449772e+01 -3.483038e+01 -1.093314e+01 -4.480774e+01 -2.836627e+00 25% -2.117214e-01 -2.283949e-01 -5.423504e-01 -1.618463e-01 -3.545861e-01 50% -6.248109e-02 -2.945017e-02 6.781943e-03 -1.119293e-02 4.097606e-02 75% 1.330408e-01 1.863772e-01 5.285536e-01 1.476421e-01 4.395266e-01 max 3.942090e+01 2.720284e+01 1.050309e+01 2.252841e+01 4.584549e+00 v25 v26 v27 v28 amount \ count 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05 284807.000000 unique NaN NaN NaN NaN NaN top NaN NaN NaN NaN NaN freq NaN NaN NaN NaN NaN mean 5.340915e-16 1.683437e-15 -3.660091e-16 -1.227390e-16 88.349619 std 5.212781e-01 4.822270e-01 4.036325e-01 3.300833e-01 250.120109 min -1.029540e+01 -2.604551e+00 -2.256568e+01 -1.543008e+01 0.000000 25% -3.171451e-01 -3.269839e-01 -7.083953e-02 -5.295979e-02 5.600000 50% 1.659350e-02 -5.213911e-02 1.342146e-03 1.124383e-02 22.000000 75% 3.507156e-01 2.409522e-01 9.104512e-02 7.827995e-02 77.165000 max 7.519589e+00 3.517346e+00 3.161220e+01 3.384781e+01 25691.160000 class count 284807.0 unique 2.0 top 0.0 freq 284315.0 mean NaN std NaN min NaN 25% NaN 50% NaN 75% NaN max NaN
There is a gross imbalance in class
, corrective stratergies have to
be considered.
There are no missing values so let's check for duplicates.
fraud[fraud.duplicated(keep=False)]
time v1 v2 v3 v4 v5 v6 \ 32 26.0 -0.529912 0.873892 1.347247 0.145457 0.414209 0.100223 33 26.0 -0.529912 0.873892 1.347247 0.145457 0.414209 0.100223 34 26.0 -0.535388 0.865268 1.351076 0.147575 0.433680 0.086983 35 26.0 -0.535388 0.865268 1.351076 0.147575 0.433680 0.086983 112 74.0 1.038370 0.127486 0.184456 1.109950 0.441699 0.945283 ... ... ... ... ... ... ... ... 283485 171627.0 -1.457978 1.378203 0.811515 -0.603760 -0.711883 -0.471672 284190 172233.0 -2.667936 3.160505 -3.355984 1.007845 -0.377397 -0.109730 284191 172233.0 -2.667936 3.160505 -3.355984 1.007845 -0.377397 -0.109730 284192 172233.0 -2.691642 3.123168 -3.339407 1.017018 -0.293095 -0.167054 284193 172233.0 -2.691642 3.123168 -3.339407 1.017018 -0.293095 -0.167054 v7 v8 v9 v10 v11 v12 v13 \ 32 0.711206 0.176066 -0.286717 -0.484688 0.872490 0.851636 -0.571745 33 0.711206 0.176066 -0.286717 -0.484688 0.872490 0.851636 -0.571745 34 0.693039 0.179742 -0.285642 -0.482474 0.871800 0.853447 -0.571822 35 0.693039 0.179742 -0.285642 -0.482474 0.871800 0.853447 -0.571822 112 -0.036715 0.350995 0.118950 -0.243289 0.578063 0.674730 -0.534231 ... ... ... ... ... ... ... ... 283485 -0.282535 0.880654 0.052808 -0.830603 -1.191774 0.942870 1.372621 284190 -0.667233 2.309700 -1.639306 -1.449823 -0.508930 0.600035 -0.627313 284191 -0.667233 2.309700 -1.639306 -1.449823 -0.508930 0.600035 -0.627313 284192 -0.745886 2.325616 -1.634651 -1.440241 -0.511918 0.607878 -0.627645 284193 -0.745886 2.325616 -1.634651 -1.440241 -0.511918 0.607878 -0.627645 v14 v15 v16 v17 v18 v19 v20 \ 32 0.100974 -1.519772 -0.284376 -0.310524 -0.404248 -0.823374 -0.290348 33 0.100974 -1.519772 -0.284376 -0.310524 -0.404248 -0.823374 -0.290348 34 0.102252 -1.519991 -0.285912 -0.309633 -0.403902 -0.823743 -0.283264 35 0.102252 -1.519991 -0.285912 -0.309633 -0.403902 -0.823743 -0.283264 112 0.446601 1.122885 -1.768001 1.241157 -2.449500 -1.747255 -0.335520 ... ... ... ... ... ... ... ... 283485 -0.037988 -0.208490 0.321883 -0.205951 -0.025225 -0.468427 0.023667 284190 1.017499 -0.887384 0.420096 1.856497 1.315099 1.096112 -0.821707 284191 1.017499 -0.887384 0.420096 1.856497 1.315099 1.096112 -0.821707 284192 1.023032 -0.888334 0.413444 1.860351 1.316597 1.094512 -0.791037 284193 1.023032 -0.888334 0.413444 1.860351 1.316597 1.094512 -0.791037 v21 v22 v23 v24 v25 v26 v27 \ 32 0.046949 0.208105 -0.185548 0.001031 0.098816 -0.552904 -0.073288 33 0.046949 0.208105 -0.185548 0.001031 0.098816 -0.552904 -0.073288 34 0.049526 0.206537 -0.187108 0.000753 0.098117 -0.553471 -0.078306 35 0.049526 0.206537 -0.187108 0.000753 0.098117 -0.553471 -0.078306 112 0.102520 0.605089 0.023092 -0.626463 0.479120 -0.166937 0.081247 ... ... ... ... ... ... ... ... 283485 0.284205 0.949659 -0.216949 0.083250 0.044944 0.639933 0.219432 284190 0.391483 0.266536 -0.079853 -0.096395 0.086719 -0.451128 -1.183743 284191 0.391483 0.266536 -0.079853 -0.096395 0.086719 -0.451128 -1.183743 284192 0.402639 0.259746 -0.086606 -0.097597 0.083693 -0.453584 -1.205466 284193 0.402639 0.259746 -0.086606 -0.097597 0.083693 -0.453584 -1.205466 v28 amount class 32 0.023307 6.14 0 33 0.023307 6.14 0 34 0.025427 1.77 0 35 0.025427 1.77 0 112 0.001192 1.18 0 ... ... ... ... 283485 0.116772 11.93 0 284190 -0.222200 55.66 0 284191 -0.222200 55.66 0 284192 -0.213020 36.74 0 284193 -0.213020 36.74 0 [1854 rows x 31 columns]
The duplicates are expected since we can have several transactions/second with similar values in the PCA features. No action required.
2.1.5. Correlations
Let's look at the correlations between the numerical features.
name = 'heatmap@fraud--corr.png' corr = fraud.corr() plotter.corr(corr, name) name
Nothing unexpected, amount & time
are correlated with the PCA
features.