Breast Cancer

Table of Contents

1. Init

Let's start by importing the necessary modules.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib
import matplotlib.pyplot as plt

matplotlib.use('Agg') # non-interactive backend, produce pngs instead
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
# uncomment the following line to prevent truncated output
# pd.set_option('display.large_repr', 'info')

from context import src
from src import utils, plotter

2. Analysis

In this section we analyse the breast-cancer dataset. We start by reading the accompanying data docs. The docs are quite informative, it touches upon the data collection process, information regarding the features, the ML classification task and other useful information (such as absence of missing values).

2.1. Preliminary analysis

We start by loading the dataset and answering our initial set of questions.

cancer = pd.read_csv('../data/data/breast-cancer.csv')
cancer.head()
         id diagnosis  radius_mean  texture_mean  perimeter_mean  area_mean  \
0    842302         M        17.99         10.38          122.80     1001.0   
1    842517         M        20.57         17.77          132.90     1326.0   
2  84300903         M        19.69         21.25          130.00     1203.0   
3  84348301         M        11.42         20.38           77.58      386.1   
4  84358402         M        20.29         14.34          135.10     1297.0   

   smoothness_mean  compactness_mean  concavity_mean  concave points_mean  \
0          0.11840           0.27760          0.3001              0.14710   
1          0.08474           0.07864          0.0869              0.07017   
2          0.10960           0.15990          0.1974              0.12790   
3          0.14250           0.28390          0.2414              0.10520   
4          0.10030           0.13280          0.1980              0.10430   

   symmetry_mean  fractal_dimension_mean  radius_se  texture_se  perimeter_se  \
0         0.2419                 0.07871     1.0950      0.9053         8.589   
1         0.1812                 0.05667     0.5435      0.7339         3.398   
2         0.2069                 0.05999     0.7456      0.7869         4.585   
3         0.2597                 0.09744     0.4956      1.1560         3.445   
4         0.1809                 0.05883     0.7572      0.7813         5.438   

   area_se  smoothness_se  compactness_se  concavity_se  concave points_se  \
0   153.40       0.006399         0.04904       0.05373            0.01587   
1    74.08       0.005225         0.01308       0.01860            0.01340   
2    94.03       0.006150         0.04006       0.03832            0.02058   
3    27.23       0.009110         0.07458       0.05661            0.01867   
4    94.44       0.011490         0.02461       0.05688            0.01885   

   symmetry_se  fractal_dimension_se  radius_worst  texture_worst  \
0      0.03003              0.006193         25.38          17.33   
1      0.01389              0.003532         24.99          23.41   
2      0.02250              0.004571         23.57          25.53   
3      0.05963              0.009208         14.91          26.50   
4      0.01756              0.005115         22.54          16.67   

   perimeter_worst  area_worst  smoothness_worst  compactness_worst  \
0           184.60      2019.0            0.1622             0.6656   
1           158.80      1956.0            0.1238             0.1866   
2           152.50      1709.0            0.1444             0.4245   
3            98.87       567.7            0.2098             0.8663   
4           152.20      1575.0            0.1374             0.2050   

   concavity_worst  concave points_worst  symmetry_worst  \
0           0.7119                0.2654          0.4601   
1           0.2416                0.1860          0.2750   
2           0.4504                0.2430          0.3613   
3           0.6869                0.2575          0.6638   
4           0.4000                0.1625          0.2364   

   fractal_dimension_worst  Unnamed: 32  
0                  0.11890          NaN  
1                  0.08902          NaN  
2                  0.08758          NaN  
3                  0.17300          NaN  
4                  0.07678          NaN  

The docs fail to give more details on the features and that they mean, for instance we have several *_se features which could mean standard error however we cannot be certain without proper documentation. Another problem is that the docs fail to mention what unit the features are recorded in. Often, we want the features to be represented in the same unit of measure.

cancer.shape
569 33
cancer.dtypes
id                           int64
diagnosis                   object
radius_mean                float64
texture_mean               float64
perimeter_mean             float64
area_mean                  float64
smoothness_mean            float64
compactness_mean           float64
concavity_mean             float64
concave points_mean        float64
symmetry_mean              float64
fractal_dimension_mean     float64
radius_se                  float64
texture_se                 float64
perimeter_se               float64
area_se                    float64
smoothness_se              float64
compactness_se             float64
concavity_se               float64
concave points_se          float64
symmetry_se                float64
fractal_dimension_se       float64
radius_worst               float64
texture_worst              float64
perimeter_worst            float64
area_worst                 float64
smoothness_worst           float64
compactness_worst          float64
concavity_worst            float64
concave points_worst       float64
symmetry_worst             float64
fractal_dimension_worst    float64
Unnamed: 32                float64
dtype: object

The id can be dropped since it's just a unique identifier (does not add any new information to our model).

cancer = cancer.drop('id', axis='columns')
cancer.shape
569 32

There is an unnamed feature unnamed: 32 which we should investigate further.

cancer.columns
Index(['diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean',
       'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_worst', 'texture_worst',
       'perimeter_worst', 'area_worst', 'smoothness_worst',
       'compactness_worst', 'concavity_worst', 'concave points_worst',
       'symmetry_worst', 'fractal_dimension_worst', 'Unnamed: 32'],
      dtype='object')
cancer['Unnamed: 32'].describe()
count    0.0
mean     NaN
std      NaN
min      NaN
25%      NaN
50%      NaN
75%      NaN
max      NaN
Name: Unnamed: 32, dtype: float64
cancer['Unnamed: 32'].value_counts()
Series([], Name: Unnamed: 32, dtype: int64)

Seems to be empty (probably an error with the csv itself), let's investigate the csv directly.

head ../data/data/breast-cancer.csv

Looks like there is an empty column in the csv, let's drop it.

cancer = cancer.drop('Unnamed: 32', axis='columns')
cancer.shape
569 31

Our target is diagnosis so it should be represented as a category dtype.

cancer['diagnosis'] = cancer['diagnosis'].str.strip().astype('category')
cancer['diagnosis'].value_counts()
B    357
M    212
Name: diagnosis, dtype: int64

Let's look at the descriptive statistics, missing and duplicates next.

cancer.describe(include='all')
       diagnosis  radius_mean  texture_mean  perimeter_mean    area_mean  \
count        569   569.000000    569.000000      569.000000   569.000000   
unique         2          NaN           NaN             NaN          NaN   
top            B          NaN           NaN             NaN          NaN   
freq         357          NaN           NaN             NaN          NaN   
mean         NaN    14.127292     19.289649       91.969033   654.889104   
std          NaN     3.524049      4.301036       24.298981   351.914129   
min          NaN     6.981000      9.710000       43.790000   143.500000   
25%          NaN    11.700000     16.170000       75.170000   420.300000   
50%          NaN    13.370000     18.840000       86.240000   551.100000   
75%          NaN    15.780000     21.800000      104.100000   782.700000   
max          NaN    28.110000     39.280000      188.500000  2501.000000   

        smoothness_mean  compactness_mean  concavity_mean  \
count        569.000000        569.000000      569.000000   
unique              NaN               NaN             NaN   
top                 NaN               NaN             NaN   
freq                NaN               NaN             NaN   
mean           0.096360          0.104341        0.088799   
std            0.014064          0.052813        0.079720   
min            0.052630          0.019380        0.000000   
25%            0.086370          0.064920        0.029560   
50%            0.095870          0.092630        0.061540   
75%            0.105300          0.130400        0.130700   
max            0.163400          0.345400        0.426800   

        concave points_mean  symmetry_mean  fractal_dimension_mean  \
count            569.000000     569.000000              569.000000   
unique                  NaN            NaN                     NaN   
top                     NaN            NaN                     NaN   
freq                    NaN            NaN                     NaN   
mean               0.048919       0.181162                0.062798   
std                0.038803       0.027414                0.007060   
min                0.000000       0.106000                0.049960   
25%                0.020310       0.161900                0.057700   
50%                0.033500       0.179200                0.061540   
75%                0.074000       0.195700                0.066120   
max                0.201200       0.304000                0.097440   

         radius_se  texture_se  perimeter_se     area_se  smoothness_se  \
count   569.000000  569.000000    569.000000  569.000000     569.000000   
unique         NaN         NaN           NaN         NaN            NaN   
top            NaN         NaN           NaN         NaN            NaN   
freq           NaN         NaN           NaN         NaN            NaN   
mean      0.405172    1.216853      2.866059   40.337079       0.007041   
std       0.277313    0.551648      2.021855   45.491006       0.003003   
min       0.111500    0.360200      0.757000    6.802000       0.001713   
25%       0.232400    0.833900      1.606000   17.850000       0.005169   
50%       0.324200    1.108000      2.287000   24.530000       0.006380   
75%       0.478900    1.474000      3.357000   45.190000       0.008146   
max       2.873000    4.885000     21.980000  542.200000       0.031130   

        compactness_se  concavity_se  concave points_se  symmetry_se  \
count       569.000000    569.000000         569.000000   569.000000   
unique             NaN           NaN                NaN          NaN   
top                NaN           NaN                NaN          NaN   
freq               NaN           NaN                NaN          NaN   
mean          0.025478      0.031894           0.011796     0.020542   
std           0.017908      0.030186           0.006170     0.008266   
min           0.002252      0.000000           0.000000     0.007882   
25%           0.013080      0.015090           0.007638     0.015160   
50%           0.020450      0.025890           0.010930     0.018730   
75%           0.032450      0.042050           0.014710     0.023480   
max           0.135400      0.396000           0.052790     0.078950   

        fractal_dimension_se  radius_worst  texture_worst  perimeter_worst  \
count             569.000000    569.000000     569.000000       569.000000   
unique                   NaN           NaN            NaN              NaN   
top                      NaN           NaN            NaN              NaN   
freq                     NaN           NaN            NaN              NaN   
mean                0.003795     16.269190      25.677223       107.261213   
std                 0.002646      4.833242       6.146258        33.602542   
min                 0.000895      7.930000      12.020000        50.410000   
25%                 0.002248     13.010000      21.080000        84.110000   
50%                 0.003187     14.970000      25.410000        97.660000   
75%                 0.004558     18.790000      29.720000       125.400000   
max                 0.029840     36.040000      49.540000       251.200000   

         area_worst  smoothness_worst  compactness_worst  concavity_worst  \
count    569.000000        569.000000         569.000000       569.000000   
unique          NaN               NaN                NaN              NaN   
top             NaN               NaN                NaN              NaN   
freq            NaN               NaN                NaN              NaN   
mean     880.583128          0.132369           0.254265         0.272188   
std      569.356993          0.022832           0.157336         0.208624   
min      185.200000          0.071170           0.027290         0.000000   
25%      515.300000          0.116600           0.147200         0.114500   
50%      686.500000          0.131300           0.211900         0.226700   
75%     1084.000000          0.146000           0.339100         0.382900   
max     4254.000000          0.222600           1.058000         1.252000   

        concave points_worst  symmetry_worst  fractal_dimension_worst  
count             569.000000      569.000000               569.000000  
unique                   NaN             NaN                      NaN  
top                      NaN             NaN                      NaN  
freq                     NaN             NaN                      NaN  
mean                0.114606        0.290076                 0.083946  
std                 0.065732        0.061867                 0.018061  
min                 0.000000        0.156500                 0.055040  
25%                 0.064930        0.250400                 0.071460  
50%                 0.099930        0.282200                 0.080040  
75%                 0.161400        0.317900                 0.092080  
max                 0.291000        0.663800                 0.207500  
cancer.isna().any()
diagnosis                  False
radius_mean                False
texture_mean               False
perimeter_mean             False
area_mean                  False
smoothness_mean            False
compactness_mean           False
concavity_mean             False
concave points_mean        False
symmetry_mean              False
fractal_dimension_mean     False
radius_se                  False
texture_se                 False
perimeter_se               False
area_se                    False
smoothness_se              False
compactness_se             False
concavity_se               False
concave points_se          False
symmetry_se                False
fractal_dimension_se       False
radius_worst               False
texture_worst              False
perimeter_worst            False
area_worst                 False
smoothness_worst           False
compactness_worst          False
concavity_worst            False
concave points_worst       False
symmetry_worst             False
fractal_dimension_worst    False
dtype: bool
cancer[cancer.duplicated()].shape
0 31

No duplicates and missing values. Let's check the correlation next.

name = 'heatmap@breast-cancer--corr.png'
corr = cancer.corr()

plotter.corr(corr, name)
name

heatmap@breast-cancer--corr.png

Some features are positively correlated, feature selection may be possible.

Date: 2021-10-26 Tue 00:00

Created: 2021-10-27 Wed 13:05