Permit
Table of Contents
- 1. Init
- 2. Analysis
- 2.1. Preliminary analysis
- 2.1.1. Handling unwanted columns
- 2.1.2. Handling datetime features
- 2.1.3. Handling sensitive information
- 2.1.4. Handling
structural_notification & tidf_compliance
- 2.1.5. Handling
estimated_cost & revised_cost
- 2.1.6. Handling
existing_use
- 2.1.7. Handling
existing_units & proposed_units
- 2.1.8. Handling categorical features
- 2.1.9. Handling
plansets & supervisor_district
- 2.1.10. Descriptive statistics, missing & duplicates
- 2.1.11. Correlations
- 2.1. Preliminary analysis
1. Init
Let's start by importing the necessary modules.
from context import src from context import pd, np, sns, plt from src import utils, plotter
2. Analysis
In this section we analyse the permit
dataset. We start by reading
the accompanying data docs. The docs provide some background
information regarding the value of the data but no information is
provided on how the data was collected. An excel file is included
which contains a description of all the columns.
2.1. Preliminary analysis
We start by loading the dataset and answering our initial set of questions.
permit = pd.read_csv(utils.data_path('permit.csv')) permit.head()
Permit Number Permit Type Permit Type Definition \ 0 201505065519 4 sign - erect 1 201604195146 4 sign - erect 2 201605278609 3 additions alterations or repairs 3 201611072166 8 otc alterations permit 4 201611283529 6 demolitions Permit Creation Date Block Lot Street Number Street Number Suffix \ 0 05/06/2015 0326 023 140 NaN 1 04/19/2016 0306 007 440 NaN 2 05/27/2016 0595 203 1647 NaN 3 11/07/2016 0156 011 1230 NaN 4 11/28/2016 0342 001 950 NaN Street Name Street Suffix Unit Unit Suffix \ 0 Ellis St NaN NaN 1 Geary St 0.0 NaN 2 Pacific Av NaN NaN 3 Pacific Av 0.0 NaN 4 Market St NaN NaN Description \ 0 ground fl facade: to erect illuminated, electric, wall, single faced sign. n/a for maher ordinance 155-13. 1 remove (e) awning and associated signs. 2 installation of separating wall 3 repair dryrot & stucco at front of bldg. 4 demolish retail/office/commercial 3-story building Current Status Current Status Date Filed Date Issued Date Completed Date \ 0 expired 12/21/2017 05/06/2015 11/09/2015 NaN 1 issued 08/03/2017 04/19/2016 08/03/2017 NaN 2 withdrawn 09/26/2017 05/27/2016 NaN NaN 3 complete 07/24/2017 11/07/2016 07/18/2017 07/24/2017 4 issued 12/01/2017 11/28/2016 12/01/2017 NaN First Construction Document Date Structural Notification \ 0 11/09/2015 NaN 1 08/03/2017 NaN 2 NaN NaN 3 07/18/2017 NaN 4 11/20/2017 NaN Number of Existing Stories Number of Proposed Stories \ 0 6.0 NaN 1 7.0 NaN 2 6.0 6.0 3 2.0 2.0 4 3.0 NaN Voluntary Soft-Story Retrofit Fire Only Permit Permit Expiration Date \ 0 NaN NaN 11/03/2016 1 NaN NaN 12/03/2017 2 NaN NaN NaN 3 NaN NaN 07/13/2018 4 NaN NaN 12/01/2018 Estimated Cost Revised Cost Existing Use Existing Units \ 0 4000.0 4000.0 tourist hotel/motel 143.0 1 1.0 500.0 tourist hotel/motel NaN 2 20000.0 NaN retail sales 39.0 3 2000.0 2000.0 1 family dwelling 1.0 4 100000.0 100000.0 retail sales NaN Proposed Use Proposed Units Plansets TIDF Compliance \ 0 NaN NaN 2.0 NaN 1 NaN NaN 2.0 NaN 2 retail sales 39.0 2.0 NaN 3 1 family dwelling 1.0 2.0 NaN 4 NaN NaN 2.0 NaN Existing Construction Type Existing Construction Type Description \ 0 3.0 constr type 3 1 3.0 constr type 3 2 1.0 constr type 1 3 5.0 wood frame (5) 4 3.0 constr type 3 Proposed Construction Type Proposed Construction Type Description \ 0 NaN NaN 1 NaN NaN 2 1.0 constr type 1 3 5.0 wood frame (5) 4 NaN NaN Site Permit Supervisor District Neighborhoods - Analysis Boundaries \ 0 NaN 3.0 Tenderloin 1 NaN 3.0 Tenderloin 2 NaN 3.0 Russian Hill 3 NaN 3.0 Nob Hill 4 NaN 6.0 Tenderloin Zipcode Location Record ID 0 94102.0 (37.785719256680785, -122.40852313194863) 1380611233945 1 94102.0 (37.78733980600732, -122.41063199757738) 1420164406718 2 94109.0 (37.7946573324287, -122.42232562979227) 1424856504716 3 94109.0 (37.79595867909168, -122.41557405519474) 1443574295566 4 94102.0 (37.78315261897309, -122.40950883997789) 144548169992
We note that there are several columns with a variety of dtypes. Let's rename the columns for simplicity prior to further analysis.
columns = permit.columns columns = columns.str.replace('[\s-]+', '_', regex=True) columns = columns.str.strip() columns = columns.str.lower() permit.columns = columns permit.columns
Index(['permit_number', 'permit_type', 'permit_type_definition', 'permit_creation_date', 'block', 'lot', 'street_number', 'street_number_suffix', 'street_name', 'street_suffix', 'unit', 'unit_suffix', 'description', 'current_status', 'current_status_date', 'filed_date', 'issued_date', 'completed_date', 'first_construction_document_date', 'structural_notification', 'number_of_existing_stories', 'number_of_proposed_stories', 'voluntary_soft_story_retrofit', 'fire_only_permit', 'permit_expiration_date', 'estimated_cost', 'revised_cost', 'existing_use', 'existing_units', 'proposed_use', 'proposed_units', 'plansets', 'tidf_compliance', 'existing_construction_type', 'existing_construction_type_description', 'proposed_construction_type', 'proposed_construction_type_description', 'site_permit', 'supervisor_district', 'neighborhoods_analysis_boundaries', 'zipcode', 'location', 'record_id'], dtype='object')
permit.dtypes
permit_number object permit_type int64 permit_type_definition object permit_creation_date object block object lot object street_number int64 street_number_suffix object street_name object street_suffix object unit float64 unit_suffix object description object current_status object current_status_date object filed_date object issued_date object completed_date object first_construction_document_date object structural_notification object number_of_existing_stories float64 number_of_proposed_stories float64 voluntary_soft_story_retrofit object fire_only_permit object permit_expiration_date object estimated_cost float64 revised_cost float64 existing_use object existing_units float64 proposed_use object proposed_units float64 plansets float64 tidf_compliance object existing_construction_type float64 existing_construction_type_description object proposed_construction_type float64 proposed_construction_type_description object site_permit object supervisor_district float64 neighborhoods_analysis_boundaries object zipcode float64 location object record_id int64 dtype: object
2.1.1. Handling unwanted columns
We can drop the permit_number & record_id
since they are just unique
identifiers and do not add anything new to the model.
We can also drop permit_type_definition &
{existing_construction,proposed_construction}_type_description
since
{permit,existing_construction,proposed_construction}_type
are the
numerical equivalents.
description
is a text feature, we may want to extract useful
numerical features from here, however for this analysis we drop this
column.
drop_columns = ['permit_number', 'record_id', 'permit_type_definition', 'existing_construction_type_description', 'proposed_construction_type_description', 'description'] permit = permit.drop(drop_columns, axis='columns') permit.shape
198900 | 37 |
2.1.2. Handling datetime features
We have several datetime features which should be converted to
datetime
dtype.
datetime_features = ['permit_creation_date', 'current_status_date', 'filed_date', 'issued_date', 'completed_date', 'first_construction_document_date', 'permit_expiration_date'] permit[datetime_features]
permit_creation_date current_status_date filed_date issued_date \ 0 05/06/2015 12/21/2017 05/06/2015 11/09/2015 1 04/19/2016 08/03/2017 04/19/2016 08/03/2017 2 05/27/2016 09/26/2017 05/27/2016 NaN 3 11/07/2016 07/24/2017 11/07/2016 07/18/2017 4 11/28/2016 12/01/2017 11/28/2016 12/01/2017 ... ... ... ... ... 198895 12/05/2017 12/05/2017 12/05/2017 12/05/2017 198896 12/05/2017 12/06/2017 12/05/2017 12/06/2017 198897 12/06/2017 12/06/2017 12/06/2017 12/06/2017 198898 12/06/2017 12/06/2017 12/06/2017 12/06/2017 198899 12/07/2017 12/07/2017 12/07/2017 12/07/2017 completed_date first_construction_document_date permit_expiration_date 0 NaN 11/09/2015 11/03/2016 1 NaN 08/03/2017 12/03/2017 2 NaN NaN NaN 3 07/24/2017 07/18/2017 07/13/2018 4 NaN 11/20/2017 12/01/2018 ... ... ... ... 198895 NaN 12/05/2017 NaN 198896 NaN 12/06/2017 04/06/2018 198897 NaN 12/06/2017 NaN 198898 NaN 12/06/2017 NaN 198899 NaN 12/07/2017 NaN [198900 rows x 7 columns]
We can see that the timestamps are in a custom, "human friendly"
format. The documents fail to mention if the dates are in dd/mm/yyyy
or mm/dd/yyyy
format. Seeing that this data was collected in the
States, we can assume that it's in the mm/dd/yyyy
format. However,
we would still have to verify this manually leading to unnecessary
technical debt.
permit[datetime_features].dtypes
permit_creation_date object current_status_date object filed_date object issued_date object completed_date object first_construction_document_date object permit_expiration_date object dtype: object
datetime = permit[datetime_features] datetime[datetime.isna().any(axis='columns')]
permit_creation_date current_status_date filed_date issued_date \ 0 05/06/2015 12/21/2017 05/06/2015 11/09/2015 1 04/19/2016 08/03/2017 04/19/2016 08/03/2017 2 05/27/2016 09/26/2017 05/27/2016 NaN 4 11/28/2016 12/01/2017 11/28/2016 12/01/2017 5 06/14/2017 07/06/2017 06/14/2017 07/06/2017 ... ... ... ... ... 198895 12/05/2017 12/05/2017 12/05/2017 12/05/2017 198896 12/05/2017 12/06/2017 12/05/2017 12/06/2017 198897 12/06/2017 12/06/2017 12/06/2017 12/06/2017 198898 12/06/2017 12/06/2017 12/06/2017 12/06/2017 198899 12/07/2017 12/07/2017 12/07/2017 12/07/2017 completed_date first_construction_document_date permit_expiration_date 0 NaN 11/09/2015 11/03/2016 1 NaN 08/03/2017 12/03/2017 2 NaN NaN NaN 4 NaN 11/20/2017 12/01/2018 5 NaN 07/06/2017 07/01/2018 ... ... ... ... 198895 NaN 12/05/2017 NaN 198896 NaN 12/06/2017 04/06/2018 198897 NaN 12/06/2017 NaN 198898 NaN 12/06/2017 NaN 198899 NaN 12/07/2017 NaN [101766 rows x 7 columns]
~50% of the data in the datetime features is missing, dropping the missing values will seriously impede the model's ability to learn. Unfortunately, imputation is beyond the score of this analysis and without handling the missing values we cannot convert the features to datetime.
# This is what we would have executed after handling to missing values permit[datetime_features] = pd.to_datetime(permit[datetime_features], format='%m/%d/%Y') permit[datetime_features]
2.1.3. Handling sensitive information
We have several features related to the address of a resident (down to the unit number and coordinates!). This is highly sensitive data which should not leak. Depending on the task, we can either consider anonymisation or simply drop these features.
Even the description
feature (which was dropped earlier) contains
super sensitive information. Imagine that a house needs it's roof to
be repaired and construction is ongoing. This information is sensitive
and may be exploited (for instance for robbery or claim squatter
rights).
sensitive_features = ['block', 'lot', 'street_number', 'street_number_suffix', 'unit', 'unit_suffix', 'zipcode', 'location'] permit[sensitive_features]
block lot street_number street_number_suffix unit unit_suffix \ 0 0326 023 140 NaN NaN NaN 1 0306 007 440 NaN 0.0 NaN 2 0595 203 1647 NaN NaN NaN 3 0156 011 1230 NaN 0.0 NaN 4 0342 001 950 NaN NaN NaN ... ... ... ... ... ... ... 198895 0113 017A 1228 NaN NaN NaN 198896 0271 014 580 NaN NaN NaN 198897 4318 019 1568 NaN NaN NaN 198898 0298 029 795 NaN NaN NaN 198899 0160 006 838 NaN NaN NaN zipcode location 0 94102.0 (37.785719256680785, -122.40852313194863) 1 94102.0 (37.78733980600732, -122.41063199757738) 2 94109.0 (37.7946573324287, -122.42232562979227) 3 94109.0 (37.79595867909168, -122.41557405519474) 4 94102.0 (37.78315261897309, -122.40950883997789) ... ... ... 198895 NaN NaN 198896 NaN NaN 198897 NaN NaN 198898 NaN NaN 198899 NaN NaN [198900 rows x 8 columns]
2.1.4. Handling structural_notification & tidf_compliance
The structural_notification & tidf_compliance
feature are
interesting. On the surface it may look like it contains a lot of
missing values. However, consulting the docs we realise that the
missing represents a no
in this case ie. this is a binary feature.
binary = permit[['structural_notification', 'tidf_compliance']] binary.isna().any()
structural_notification True tidf_compliance True dtype: bool
binary['structural_notification'].str.strip().value_counts()
Y 6922 Name: structural_notification, dtype: int64
binary['tidf_compliance'].str.strip().value_counts()
Y 1 P 1 Name: tidf_compliance, dtype: int64
In this case however, ~90% of the data is missing, it's perhaps best to drop.
permit = permit.drop(['structural_notification', 'tidf_compliance'], axis='columns') permit.shape
198900 | 35 |
2.1.5. Handling estimated_cost & revised_cost
estimated_cost & revised_cost
represents an amount however the docs
don't specify the currency. This can be linked to the missing/unknown
unit smell. Although, since the dataset consists of building permits
from San Francisco, a save guess would be that the currency is in USD.
cost_features = ['estimated_cost', 'revised_cost'] cost = permit[cost_features] cost
estimated_cost revised_cost 0 4000.0 4000.0 1 1.0 500.0 2 20000.0 NaN 3 2000.0 2000.0 4 100000.0 100000.0 ... ... ... 198895 NaN 1.0 198896 5000.0 5000.0 198897 NaN 1.0 198898 NaN 1.0 198899 NaN 1.0 [198900 rows x 2 columns]
cost.dtypes
estimated_cost float64 revised_cost float64 dtype: object
2.1.6. Handling existing_use
existing_use
requires further analysis, it seems like a text feature
but perhaps can be converted to a categorical one.
existing_use = permit['existing_use'] existing_use = existing_use.str.strip() existing_use = existing_use.str.lower() existing_use = existing_use.str.replace('\s+', '_', regex=True) existing_use.unique()
['tourist_hotel/motel' 'retail_sales' '1_family_dwelling' 'apartments' nan '2_family_dwelling' 'church' 'storage_shed' 'office' 'vacant_lot' 'food/beverage_hndlng' 'residential_hotel' 'filling/service_stn' 'workshop_commercial' 'clinics-medic/dental' 'misc_group_residns.' 'hospital' 'club' 'barber/beauty_salon' 'warehouse,no_frnitur' 'school' 'artist_live/work' 'manufacturing' 'garment_shops' 'public_assmbly_other' 'auto_repairs' 'lending_institution' 'museum' 'warehouse,_furniture' 'prkng_garage/private' 'antenna' 'health_studios_&_gym' 'massage_parlor' 'printing_plant' 'parking_lot' 'workshop_residential' 'power_plant' 'tower' 'mortuary' 'animal_sale_or_care' 'laundry/laundromat' 'nite_club' 'paint_store' 'recreation_bldg' 'theater' 'prkng_garage/public' 'sign' 'phone_xchnge/equip' 'dance_hall' 'sfpd_or_sffd_station' 'storage_tanks' 'muni_carbarn' 'stadium' 'automobile_sales' 'fence/retaining_wall' 'radio_&_tv_stations' 'social_care_facility' 'amusement_center' 'day_care_home_gt_12' 'moving_&_storage' 'dry_cleaners' 'day_care_home_7_-_12' 'chemical_processing' 'accessory_cottage' 'day_care,_non-res' 'nursing_home_non_amb' 'wholesale_sales' 'library' 'nursery(floral)' 'day_care_center' 'nursing_home_gt_6' 'sewage_plant' 'convalescent_home' 'greenhouse' 'adult_entertainment' 'muni_driver_restroom' 'sound_studio' 'dairies/dairy_equip.' 'christmas_tree_lot' 'bath_house' 'jail' "prson'l_svc_tutor" 'r-3(dwg)_nursing' 'car_wash' 'roofing_materials' 'orphanage' 'ambulance_service' 'meat/produce_marts' 'building_materials' 'temple' 'swimming_pool' 'day_care_home_lt_7' 'nursing_home_lte_6' 'child_care']
If we go the categorical way, we need to heavily simplify the values, else we can treat it as a text feature and extract appropriate features.
2.1.7. Handling existing_units & proposed_units
The documentation for existing_units & proposed_units
is vague, it
says "existing number of units" but what are units?
units_features = ['existing_units', 'proposed_units'] units = permit[units_features] units
existing_units proposed_units 0 143.0 NaN 1 NaN NaN 2 39.0 39.0 3 1.0 1.0 4 NaN NaN ... ... ... 198895 NaN NaN 198896 4.0 4.0 198897 NaN NaN 198898 NaN NaN 198899 NaN NaN [198900 rows x 2 columns]
Unfortunately we can't do much without proper documentation or domain expertise.
2.1.8. Handling categorical features
We have several categorical features in the dataset.
categorical_features = ['proposed_use', 'neighborhoods_analysis_boundaries', 'current_status', 'voluntary_soft_story_retrofit', 'fire_only_permit', 'site_permit'] categorical = permit[categorical_features] categorical
proposed_use neighborhoods_analysis_boundaries current_status \ 0 NaN Tenderloin expired 1 NaN Tenderloin issued 2 retail sales Russian Hill withdrawn 3 1 family dwelling Nob Hill complete 4 NaN Tenderloin issued ... ... ... ... 198895 NaN NaN issued 198896 apartments NaN issued 198897 NaN NaN issued 198898 NaN NaN issued 198899 NaN NaN issued voluntary_soft_story_retrofit fire_only_permit site_permit 0 NaN NaN NaN 1 NaN NaN NaN 2 NaN NaN NaN 3 NaN NaN NaN 4 NaN NaN NaN ... ... ... ... 198895 NaN NaN NaN 198896 NaN Y NaN 198897 NaN NaN NaN 198898 NaN NaN NaN 198899 NaN NaN NaN [198900 rows x 6 columns]
Let's analyse proposed_use & neighborhoods_analysis_boundaries
further.
proposed_use = categorical['proposed_use'] proposed_use = proposed_use.str.strip() proposed_use = proposed_use.str.lower() proposed_use = proposed_use.str.replace('\s+', '_', regex=True) proposed_use.unique()
[nan 'retail_sales' '1_family_dwelling' 'apartments' '2_family_dwelling' 'church' 'vacant_lot' 'office' 'tourist_hotel/motel' 'school' 'filling/service_stn' 'food/beverage_hndlng' 'residential_hotel' 'storage_shed' 'clinics-medic/dental' 'misc_group_residns.' 'club' 'hospital' 'barber/beauty_salon' 'warehouse,no_frnitur' 'artist_live/work' 'museum' 'lending_institution' 'garment_shops' 'child_care' 'auto_repairs' 'manufacturing' 'day_care_home_lt_7' 'workshop_commercial' 'warehouse,_furniture' 'prkng_garage/private' 'antenna' 'health_studios_&_gym' 'massage_parlor' 'printing_plant' 'parking_lot' 'workshop_residential' 'power_plant' 'tower' 'sfpd_or_sffd_station' 'mortuary' 'animal_sale_or_care' 'fence/retaining_wall' 'nite_club' 'paint_store' 'recreation_bldg' 'theater' 'nursery(floral)' 'prkng_garage/public' 'sign' 'phone_xchnge/equip' 'dance_hall' 'storage_tanks' 'muni_carbarn' 'automobile_sales' 'day_care_center' 'public_assmbly_other' 'greenhouse' 'library' 'radio_&_tv_stations' 'social_care_facility' 'laundry/laundromat' 'accessory_cottage' 'amusement_center' 'day_care_home_gt_12' 'muni_driver_restroom' 'day_care_home_7_-_12' 'moving_&_storage' 'dry_cleaners' 'chemical_processing' 'day_care,_non-res' 'nursing_home_non_amb' 'wholesale_sales' 'stadium' 'bath_house' 'nursing_home_gt_6' 'sewage_plant' 'convalescent_home' 'adult_entertainment' "prson'l_svc_tutor" 'dairies/dairy_equip.' 'christmas_tree_lot' 'jail' 'r-3(dwg)_nursing' 'sound_studio' 'car_wash' 'roofing_materials' 'orphanage' 'swimming_pool' 'ambulance_service' 'not_applicable' 'building_materials' 'meat/produce_marts' 'temple' 'nursing_home_lte_6']
proposed_use
is similar to existing_use
, we should consider a
similar stratergy to extract meaningful features here.
neighborhoods = categorical['neighborhoods_analysis_boundaries'] neighborhoods = neighborhoods.str.strip() neighborhoods = neighborhoods.str.lower() neighborhoods = neighborhoods.str.replace('\s+', '_', regex=True) neighborhoods.unique()
['tenderloin' 'russian_hill' 'nob_hill' 'potrero_hill' 'inner_sunset' 'bayview_hunters_point' 'lone_mountain/usf' 'haight_ashbury' 'castro/upper_market' 'hayes_valley' 'noe_valley' 'pacific_heights' 'chinatown' 'financial_district/south_beach' 'marina' 'mission' 'sunset/parkside' 'outer_richmond' 'western_addition' 'bernal_heights' 'inner_richmond' 'oceanview/merced/ingleside' 'outer_mission' 'portola' 'mission_bay' 'visitacion_valley' 'presidio_heights' 'west_of_twin_peaks' 'south_of_market' 'excelsior' 'north_beach' 'glen_park' 'treasure_island' 'twin_peaks' 'lincoln_park' nan 'japantown' 'lakeshore' 'seacliff' 'golden_gate_park' 'presidio' 'mclaren_park']
neighborhoods.value_counts()
financial_district/south_beach 21816 mission 14681 sunset/parkside 10207 west_of_twin_peaks 8739 castro/upper_market 8527 pacific_heights 8508 marina 8244 outer_richmond 7854 noe_valley 7844 south_of_market 7572 bernal_heights 6067 nob_hill 6009 haight_ashbury 5798 inner_sunset 5776 bayview_hunters_point 5669 russian_hill 5495 hayes_valley 5489 tenderloin 4783 inner_richmond 4458 potrero_hill 4293 presidio_heights 4084 north_beach 4054 western_addition 3867 chinatown 3765 lone_mountain/usf 3358 excelsior 3072 oceanview/merced/ingleside 2654 glen_park 2637 mission_bay 2287 outer_mission 2242 twin_peaks 1702 portola 1433 lakeshore 1308 seacliff 992 visitacion_valley 900 japantown 700 treasure_island 81 golden_gate_park 64 presidio 51 lincoln_park 49 mclaren_park 46 Name: neighborhoods_analysis_boundaries, dtype: int64
This feature should be categorical. There may be some hierarchy (some neighborhoods are better than others) thus we may want to use label encoding as opposed to one-hot encoding.
permit['neighborhoods_analysis_boundaries'] = neighborhoods.astype('category')
The remaining features (such as current_status, fire_only_permit &
voluntary_soft_story_retrofit
to name a few) have missing values but
their meaning is unclear due to lack of proper documentation. For
example, missing values in the fire_only_permit
may indicate a no
or may indicate that we don't have the information.
2.1.9. Handling plansets & supervisor_district
Documentation regarding plansets & supervisor_district
are lacking,
we don't know what the values in these feature indicate without domain
knowledge.
permit[['plansets', 'supervisor_district']]
plansets supervisor_district 0 2.0 3.0 1 2.0 3.0 2 2.0 3.0 3 2.0 3.0 4 2.0 6.0 ... ... ... 198895 NaN NaN 198896 2.0 NaN 198897 NaN NaN 198898 NaN NaN 198899 NaN NaN [198900 rows x 2 columns]
2.1.10. Descriptive statistics, missing & duplicates
The prior sections indicated that the dataset requires a lot of pre-processing effort which is beyond the scope of this analysis. Thus checking for missing & duplicates won't lead to meaningful insights (we know there are lots of missing values!). Without the pre-processing, insights from the descriptive statistics may also be misleading.
2.1.11. Correlations
Same reasoning as above for not doing this.