Breast Cancer
Table of Contents
1. Init
Let's start by importing the necessary modules.
import pandas as pd import numpy as np import seaborn as sns import matplotlib import matplotlib.pyplot as plt matplotlib.use('Agg') # non-interactive backend, produce pngs instead pd.set_option('display.max_columns', None) pd.set_option('display.max_colwidth', None) # uncomment the following line to prevent truncated output # pd.set_option('display.large_repr', 'info') from context import src from src import utils, plotter
2. Analysis
In this section we analyse the breast-cancer
dataset. We start by
reading the accompanying data docs. The docs are quite informative, it
touches upon the data collection process, information regarding the
features, the ML classification task and other useful information
(such as absence of missing values).
2.1. Preliminary analysis
We start by loading the dataset and answering our initial set of questions.
cancer = pd.read_csv('../data/data/breast-cancer.csv') cancer.head()
id diagnosis radius_mean texture_mean perimeter_mean area_mean \ 0 842302 M 17.99 10.38 122.80 1001.0 1 842517 M 20.57 17.77 132.90 1326.0 2 84300903 M 19.69 21.25 130.00 1203.0 3 84348301 M 11.42 20.38 77.58 386.1 4 84358402 M 20.29 14.34 135.10 1297.0 smoothness_mean compactness_mean concavity_mean concave points_mean \ 0 0.11840 0.27760 0.3001 0.14710 1 0.08474 0.07864 0.0869 0.07017 2 0.10960 0.15990 0.1974 0.12790 3 0.14250 0.28390 0.2414 0.10520 4 0.10030 0.13280 0.1980 0.10430 symmetry_mean fractal_dimension_mean radius_se texture_se perimeter_se \ 0 0.2419 0.07871 1.0950 0.9053 8.589 1 0.1812 0.05667 0.5435 0.7339 3.398 2 0.2069 0.05999 0.7456 0.7869 4.585 3 0.2597 0.09744 0.4956 1.1560 3.445 4 0.1809 0.05883 0.7572 0.7813 5.438 area_se smoothness_se compactness_se concavity_se concave points_se \ 0 153.40 0.006399 0.04904 0.05373 0.01587 1 74.08 0.005225 0.01308 0.01860 0.01340 2 94.03 0.006150 0.04006 0.03832 0.02058 3 27.23 0.009110 0.07458 0.05661 0.01867 4 94.44 0.011490 0.02461 0.05688 0.01885 symmetry_se fractal_dimension_se radius_worst texture_worst \ 0 0.03003 0.006193 25.38 17.33 1 0.01389 0.003532 24.99 23.41 2 0.02250 0.004571 23.57 25.53 3 0.05963 0.009208 14.91 26.50 4 0.01756 0.005115 22.54 16.67 perimeter_worst area_worst smoothness_worst compactness_worst \ 0 184.60 2019.0 0.1622 0.6656 1 158.80 1956.0 0.1238 0.1866 2 152.50 1709.0 0.1444 0.4245 3 98.87 567.7 0.2098 0.8663 4 152.20 1575.0 0.1374 0.2050 concavity_worst concave points_worst symmetry_worst \ 0 0.7119 0.2654 0.4601 1 0.2416 0.1860 0.2750 2 0.4504 0.2430 0.3613 3 0.6869 0.2575 0.6638 4 0.4000 0.1625 0.2364 fractal_dimension_worst Unnamed: 32 0 0.11890 NaN 1 0.08902 NaN 2 0.08758 NaN 3 0.17300 NaN 4 0.07678 NaN
The docs fail to give more details on the features and that they mean,
for instance we have several *_se
features which could mean
standard error however we cannot be certain without proper
documentation. Another problem is that the docs fail to mention what
unit the features are recorded in. Often, we want the features to be
represented in the same unit of measure.
cancer.shape
569 | 33 |
cancer.dtypes
id int64 diagnosis object radius_mean float64 texture_mean float64 perimeter_mean float64 area_mean float64 smoothness_mean float64 compactness_mean float64 concavity_mean float64 concave points_mean float64 symmetry_mean float64 fractal_dimension_mean float64 radius_se float64 texture_se float64 perimeter_se float64 area_se float64 smoothness_se float64 compactness_se float64 concavity_se float64 concave points_se float64 symmetry_se float64 fractal_dimension_se float64 radius_worst float64 texture_worst float64 perimeter_worst float64 area_worst float64 smoothness_worst float64 compactness_worst float64 concavity_worst float64 concave points_worst float64 symmetry_worst float64 fractal_dimension_worst float64 Unnamed: 32 float64 dtype: object
The id
can be dropped since it's just a unique identifier (does not
add any new information to our model).
cancer = cancer.drop('id', axis='columns') cancer.shape
569 | 32 |
There is an unnamed feature unnamed: 32
which we should investigate
further.
cancer.columns
Index(['diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst', 'symmetry_worst', 'fractal_dimension_worst', 'Unnamed: 32'], dtype='object')
cancer['Unnamed: 32'].describe()
count 0.0 mean NaN std NaN min NaN 25% NaN 50% NaN 75% NaN max NaN Name: Unnamed: 32, dtype: float64
cancer['Unnamed: 32'].value_counts()
Series([], Name: Unnamed: 32, dtype: int64)
Seems to be empty (probably an error with the csv itself), let's investigate the csv directly.
head ../data/data/breast-cancer.csv
Looks like there is an empty column in the csv, let's drop it.
cancer = cancer.drop('Unnamed: 32', axis='columns') cancer.shape
569 | 31 |
Our target is diagnosis
so it should be represented as a category
dtype.
cancer['diagnosis'] = cancer['diagnosis'].str.strip().astype('category') cancer['diagnosis'].value_counts()
B 357 M 212 Name: diagnosis, dtype: int64
Let's look at the descriptive statistics, missing and duplicates next.
cancer.describe(include='all')
diagnosis radius_mean texture_mean perimeter_mean area_mean \ count 569 569.000000 569.000000 569.000000 569.000000 unique 2 NaN NaN NaN NaN top B NaN NaN NaN NaN freq 357 NaN NaN NaN NaN mean NaN 14.127292 19.289649 91.969033 654.889104 std NaN 3.524049 4.301036 24.298981 351.914129 min NaN 6.981000 9.710000 43.790000 143.500000 25% NaN 11.700000 16.170000 75.170000 420.300000 50% NaN 13.370000 18.840000 86.240000 551.100000 75% NaN 15.780000 21.800000 104.100000 782.700000 max NaN 28.110000 39.280000 188.500000 2501.000000 smoothness_mean compactness_mean concavity_mean \ count 569.000000 569.000000 569.000000 unique NaN NaN NaN top NaN NaN NaN freq NaN NaN NaN mean 0.096360 0.104341 0.088799 std 0.014064 0.052813 0.079720 min 0.052630 0.019380 0.000000 25% 0.086370 0.064920 0.029560 50% 0.095870 0.092630 0.061540 75% 0.105300 0.130400 0.130700 max 0.163400 0.345400 0.426800 concave points_mean symmetry_mean fractal_dimension_mean \ count 569.000000 569.000000 569.000000 unique NaN NaN NaN top NaN NaN NaN freq NaN NaN NaN mean 0.048919 0.181162 0.062798 std 0.038803 0.027414 0.007060 min 0.000000 0.106000 0.049960 25% 0.020310 0.161900 0.057700 50% 0.033500 0.179200 0.061540 75% 0.074000 0.195700 0.066120 max 0.201200 0.304000 0.097440 radius_se texture_se perimeter_se area_se smoothness_se \ count 569.000000 569.000000 569.000000 569.000000 569.000000 unique NaN NaN NaN NaN NaN top NaN NaN NaN NaN NaN freq NaN NaN NaN NaN NaN mean 0.405172 1.216853 2.866059 40.337079 0.007041 std 0.277313 0.551648 2.021855 45.491006 0.003003 min 0.111500 0.360200 0.757000 6.802000 0.001713 25% 0.232400 0.833900 1.606000 17.850000 0.005169 50% 0.324200 1.108000 2.287000 24.530000 0.006380 75% 0.478900 1.474000 3.357000 45.190000 0.008146 max 2.873000 4.885000 21.980000 542.200000 0.031130 compactness_se concavity_se concave points_se symmetry_se \ count 569.000000 569.000000 569.000000 569.000000 unique NaN NaN NaN NaN top NaN NaN NaN NaN freq NaN NaN NaN NaN mean 0.025478 0.031894 0.011796 0.020542 std 0.017908 0.030186 0.006170 0.008266 min 0.002252 0.000000 0.000000 0.007882 25% 0.013080 0.015090 0.007638 0.015160 50% 0.020450 0.025890 0.010930 0.018730 75% 0.032450 0.042050 0.014710 0.023480 max 0.135400 0.396000 0.052790 0.078950 fractal_dimension_se radius_worst texture_worst perimeter_worst \ count 569.000000 569.000000 569.000000 569.000000 unique NaN NaN NaN NaN top NaN NaN NaN NaN freq NaN NaN NaN NaN mean 0.003795 16.269190 25.677223 107.261213 std 0.002646 4.833242 6.146258 33.602542 min 0.000895 7.930000 12.020000 50.410000 25% 0.002248 13.010000 21.080000 84.110000 50% 0.003187 14.970000 25.410000 97.660000 75% 0.004558 18.790000 29.720000 125.400000 max 0.029840 36.040000 49.540000 251.200000 area_worst smoothness_worst compactness_worst concavity_worst \ count 569.000000 569.000000 569.000000 569.000000 unique NaN NaN NaN NaN top NaN NaN NaN NaN freq NaN NaN NaN NaN mean 880.583128 0.132369 0.254265 0.272188 std 569.356993 0.022832 0.157336 0.208624 min 185.200000 0.071170 0.027290 0.000000 25% 515.300000 0.116600 0.147200 0.114500 50% 686.500000 0.131300 0.211900 0.226700 75% 1084.000000 0.146000 0.339100 0.382900 max 4254.000000 0.222600 1.058000 1.252000 concave points_worst symmetry_worst fractal_dimension_worst count 569.000000 569.000000 569.000000 unique NaN NaN NaN top NaN NaN NaN freq NaN NaN NaN mean 0.114606 0.290076 0.083946 std 0.065732 0.061867 0.018061 min 0.000000 0.156500 0.055040 25% 0.064930 0.250400 0.071460 50% 0.099930 0.282200 0.080040 75% 0.161400 0.317900 0.092080 max 0.291000 0.663800 0.207500
cancer.isna().any()
diagnosis False radius_mean False texture_mean False perimeter_mean False area_mean False smoothness_mean False compactness_mean False concavity_mean False concave points_mean False symmetry_mean False fractal_dimension_mean False radius_se False texture_se False perimeter_se False area_se False smoothness_se False compactness_se False concavity_se False concave points_se False symmetry_se False fractal_dimension_se False radius_worst False texture_worst False perimeter_worst False area_worst False smoothness_worst False compactness_worst False concavity_worst False concave points_worst False symmetry_worst False fractal_dimension_worst False dtype: bool
cancer[cancer.duplicated()].shape
0 | 31 |
No duplicates and missing values. Let's check the correlation next.
name = 'heatmap@breast-cancer--corr.png' corr = cancer.corr() plotter.corr(corr, name) name
Some features are positively correlated, feature selection may be possible.