Permit

Table of Contents

1. Init

Let's start by importing the necessary modules.

from context import src
from context import pd, np, sns, plt
from src import utils, plotter

2. Analysis

In this section we analyse the permit dataset. We start by reading the accompanying data docs. The docs provide some background information regarding the value of the data but no information is provided on how the data was collected. An excel file is included which contains a description of all the columns.

2.1. Preliminary analysis

We start by loading the dataset and answering our initial set of questions.

permit = pd.read_csv(utils.data_path('permit.csv'))
permit.head()
  Permit Number  Permit Type            Permit Type Definition  \
0  201505065519            4                      sign - erect   
1  201604195146            4                      sign - erect   
2  201605278609            3  additions alterations or repairs   
3  201611072166            8            otc alterations permit   
4  201611283529            6                       demolitions   

  Permit Creation Date Block  Lot  Street Number Street Number Suffix  \
0           05/06/2015  0326  023            140                  NaN   
1           04/19/2016  0306  007            440                  NaN   
2           05/27/2016  0595  203           1647                  NaN   
3           11/07/2016  0156  011           1230                  NaN   
4           11/28/2016  0342  001            950                  NaN   

  Street Name Street Suffix  Unit Unit Suffix  \
0       Ellis            St   NaN         NaN   
1       Geary            St   0.0         NaN   
2     Pacific            Av   NaN         NaN   
3     Pacific            Av   0.0         NaN   
4      Market            St   NaN         NaN   

                                                                                                  Description  \
0  ground fl facade: to erect illuminated, electric, wall, single faced sign. n/a for maher ordinance 155-13.   
1                                                                     remove (e) awning and associated signs.   
2                                                                             installation of separating wall   
3                                                                    repair dryrot & stucco at front of bldg.   
4                                                          demolish retail/office/commercial 3-story building   

  Current Status Current Status Date  Filed Date Issued Date Completed Date  \
0        expired          12/21/2017  05/06/2015  11/09/2015            NaN   
1         issued          08/03/2017  04/19/2016  08/03/2017            NaN   
2      withdrawn          09/26/2017  05/27/2016         NaN            NaN   
3       complete          07/24/2017  11/07/2016  07/18/2017     07/24/2017   
4         issued          12/01/2017  11/28/2016  12/01/2017            NaN   

  First Construction Document Date Structural Notification  \
0                       11/09/2015                     NaN   
1                       08/03/2017                     NaN   
2                              NaN                     NaN   
3                       07/18/2017                     NaN   
4                       11/20/2017                     NaN   

   Number of Existing Stories  Number of Proposed Stories  \
0                         6.0                         NaN   
1                         7.0                         NaN   
2                         6.0                         6.0   
3                         2.0                         2.0   
4                         3.0                         NaN   

  Voluntary Soft-Story Retrofit Fire Only Permit Permit Expiration Date  \
0                           NaN              NaN             11/03/2016   
1                           NaN              NaN             12/03/2017   
2                           NaN              NaN                    NaN   
3                           NaN              NaN             07/13/2018   
4                           NaN              NaN             12/01/2018   

   Estimated Cost  Revised Cost         Existing Use  Existing Units  \
0          4000.0        4000.0  tourist hotel/motel           143.0   
1             1.0         500.0  tourist hotel/motel             NaN   
2         20000.0           NaN         retail sales            39.0   
3          2000.0        2000.0    1 family dwelling             1.0   
4        100000.0      100000.0         retail sales             NaN   

        Proposed Use  Proposed Units  Plansets TIDF Compliance  \
0                NaN             NaN       2.0             NaN   
1                NaN             NaN       2.0             NaN   
2       retail sales            39.0       2.0             NaN   
3  1 family dwelling             1.0       2.0             NaN   
4                NaN             NaN       2.0             NaN   

   Existing Construction Type Existing Construction Type Description  \
0                         3.0                          constr type 3   
1                         3.0                          constr type 3   
2                         1.0                          constr type 1   
3                         5.0                         wood frame (5)   
4                         3.0                          constr type 3   

   Proposed Construction Type Proposed Construction Type Description  \
0                         NaN                                    NaN   
1                         NaN                                    NaN   
2                         1.0                          constr type 1   
3                         5.0                         wood frame (5)   
4                         NaN                                    NaN   

  Site Permit  Supervisor District Neighborhoods - Analysis Boundaries  \
0         NaN                  3.0                          Tenderloin   
1         NaN                  3.0                          Tenderloin   
2         NaN                  3.0                        Russian Hill   
3         NaN                  3.0                            Nob Hill   
4         NaN                  6.0                          Tenderloin   

   Zipcode                                   Location      Record ID  
0  94102.0  (37.785719256680785, -122.40852313194863)  1380611233945  
1  94102.0   (37.78733980600732, -122.41063199757738)  1420164406718  
2  94109.0    (37.7946573324287, -122.42232562979227)  1424856504716  
3  94109.0   (37.79595867909168, -122.41557405519474)  1443574295566  
4  94102.0   (37.78315261897309, -122.40950883997789)   144548169992  

We note that there are several columns with a variety of dtypes. Let's rename the columns for simplicity prior to further analysis.

columns = permit.columns
columns = columns.str.replace('[\s-]+', '_', regex=True)
columns = columns.str.strip()
columns = columns.str.lower()
permit.columns = columns
permit.columns
Index(['permit_number', 'permit_type', 'permit_type_definition',
       'permit_creation_date', 'block', 'lot', 'street_number',
       'street_number_suffix', 'street_name', 'street_suffix', 'unit',
       'unit_suffix', 'description', 'current_status', 'current_status_date',
       'filed_date', 'issued_date', 'completed_date',
       'first_construction_document_date', 'structural_notification',
       'number_of_existing_stories', 'number_of_proposed_stories',
       'voluntary_soft_story_retrofit', 'fire_only_permit',
       'permit_expiration_date', 'estimated_cost', 'revised_cost',
       'existing_use', 'existing_units', 'proposed_use', 'proposed_units',
       'plansets', 'tidf_compliance', 'existing_construction_type',
       'existing_construction_type_description', 'proposed_construction_type',
       'proposed_construction_type_description', 'site_permit',
       'supervisor_district', 'neighborhoods_analysis_boundaries', 'zipcode',
       'location', 'record_id'],
      dtype='object')
permit.dtypes
permit_number                              object
permit_type                                 int64
permit_type_definition                     object
permit_creation_date                       object
block                                      object
lot                                        object
street_number                               int64
street_number_suffix                       object
street_name                                object
street_suffix                              object
unit                                      float64
unit_suffix                                object
description                                object
current_status                             object
current_status_date                        object
filed_date                                 object
issued_date                                object
completed_date                             object
first_construction_document_date           object
structural_notification                    object
number_of_existing_stories                float64
number_of_proposed_stories                float64
voluntary_soft_story_retrofit              object
fire_only_permit                           object
permit_expiration_date                     object
estimated_cost                            float64
revised_cost                              float64
existing_use                               object
existing_units                            float64
proposed_use                               object
proposed_units                            float64
plansets                                  float64
tidf_compliance                            object
existing_construction_type                float64
existing_construction_type_description     object
proposed_construction_type                float64
proposed_construction_type_description     object
site_permit                                object
supervisor_district                       float64
neighborhoods_analysis_boundaries          object
zipcode                                   float64
location                                   object
record_id                                   int64
dtype: object

2.1.1. Handling unwanted columns

We can drop the permit_number & record_id since they are just unique identifiers and do not add anything new to the model.

We can also drop permit_type_definition & {existing_construction,proposed_construction}_type_description since {permit,existing_construction,proposed_construction}_type are the numerical equivalents.

description is a text feature, we may want to extract useful numerical features from here, however for this analysis we drop this column.

drop_columns = ['permit_number',
                'record_id',
                'permit_type_definition',
                'existing_construction_type_description',
                'proposed_construction_type_description',
                'description']
permit = permit.drop(drop_columns, axis='columns')
permit.shape
198900 37

2.1.2. Handling datetime features

We have several datetime features which should be converted to datetime dtype.

datetime_features = ['permit_creation_date',
                     'current_status_date',
                     'filed_date',
                     'issued_date',
                     'completed_date',
                     'first_construction_document_date',
                     'permit_expiration_date']
permit[datetime_features]
       permit_creation_date current_status_date  filed_date issued_date  \
0                05/06/2015          12/21/2017  05/06/2015  11/09/2015   
1                04/19/2016          08/03/2017  04/19/2016  08/03/2017   
2                05/27/2016          09/26/2017  05/27/2016         NaN   
3                11/07/2016          07/24/2017  11/07/2016  07/18/2017   
4                11/28/2016          12/01/2017  11/28/2016  12/01/2017   
...                     ...                 ...         ...         ...   
198895           12/05/2017          12/05/2017  12/05/2017  12/05/2017   
198896           12/05/2017          12/06/2017  12/05/2017  12/06/2017   
198897           12/06/2017          12/06/2017  12/06/2017  12/06/2017   
198898           12/06/2017          12/06/2017  12/06/2017  12/06/2017   
198899           12/07/2017          12/07/2017  12/07/2017  12/07/2017   

       completed_date first_construction_document_date permit_expiration_date  
0                 NaN                       11/09/2015             11/03/2016  
1                 NaN                       08/03/2017             12/03/2017  
2                 NaN                              NaN                    NaN  
3          07/24/2017                       07/18/2017             07/13/2018  
4                 NaN                       11/20/2017             12/01/2018  
...               ...                              ...                    ...  
198895            NaN                       12/05/2017                    NaN  
198896            NaN                       12/06/2017             04/06/2018  
198897            NaN                       12/06/2017                    NaN  
198898            NaN                       12/06/2017                    NaN  
198899            NaN                       12/07/2017                    NaN  

[198900 rows x 7 columns]

We can see that the timestamps are in a custom, "human friendly" format. The documents fail to mention if the dates are in dd/mm/yyyy or mm/dd/yyyy format. Seeing that this data was collected in the States, we can assume that it's in the mm/dd/yyyy format. However, we would still have to verify this manually leading to unnecessary technical debt.

permit[datetime_features].dtypes
permit_creation_date                object
current_status_date                 object
filed_date                          object
issued_date                         object
completed_date                      object
first_construction_document_date    object
permit_expiration_date              object
dtype: object
datetime = permit[datetime_features]
datetime[datetime.isna().any(axis='columns')]
       permit_creation_date current_status_date  filed_date issued_date  \
0                05/06/2015          12/21/2017  05/06/2015  11/09/2015   
1                04/19/2016          08/03/2017  04/19/2016  08/03/2017   
2                05/27/2016          09/26/2017  05/27/2016         NaN   
4                11/28/2016          12/01/2017  11/28/2016  12/01/2017   
5                06/14/2017          07/06/2017  06/14/2017  07/06/2017   
...                     ...                 ...         ...         ...   
198895           12/05/2017          12/05/2017  12/05/2017  12/05/2017   
198896           12/05/2017          12/06/2017  12/05/2017  12/06/2017   
198897           12/06/2017          12/06/2017  12/06/2017  12/06/2017   
198898           12/06/2017          12/06/2017  12/06/2017  12/06/2017   
198899           12/07/2017          12/07/2017  12/07/2017  12/07/2017   

       completed_date first_construction_document_date permit_expiration_date  
0                 NaN                       11/09/2015             11/03/2016  
1                 NaN                       08/03/2017             12/03/2017  
2                 NaN                              NaN                    NaN  
4                 NaN                       11/20/2017             12/01/2018  
5                 NaN                       07/06/2017             07/01/2018  
...               ...                              ...                    ...  
198895            NaN                       12/05/2017                    NaN  
198896            NaN                       12/06/2017             04/06/2018  
198897            NaN                       12/06/2017                    NaN  
198898            NaN                       12/06/2017                    NaN  
198899            NaN                       12/07/2017                    NaN  

[101766 rows x 7 columns]

~50% of the data in the datetime features is missing, dropping the missing values will seriously impede the model's ability to learn. Unfortunately, imputation is beyond the score of this analysis and without handling the missing values we cannot convert the features to datetime.

# This is what we would have executed after handling to missing values
permit[datetime_features] = pd.to_datetime(permit[datetime_features], format='%m/%d/%Y')
permit[datetime_features]

2.1.3. Handling sensitive information

We have several features related to the address of a resident (down to the unit number and coordinates!). This is highly sensitive data which should not leak. Depending on the task, we can either consider anonymisation or simply drop these features.

Even the description feature (which was dropped earlier) contains super sensitive information. Imagine that a house needs it's roof to be repaired and construction is ongoing. This information is sensitive and may be exploited (for instance for robbery or claim squatter rights).

sensitive_features = ['block',
                      'lot',
                      'street_number',
                      'street_number_suffix',
                      'unit',
                      'unit_suffix',
                      'zipcode',
                      'location']
permit[sensitive_features]
       block   lot  street_number street_number_suffix  unit unit_suffix  \
0       0326   023            140                  NaN   NaN         NaN   
1       0306   007            440                  NaN   0.0         NaN   
2       0595   203           1647                  NaN   NaN         NaN   
3       0156   011           1230                  NaN   0.0         NaN   
4       0342   001            950                  NaN   NaN         NaN   
...      ...   ...            ...                  ...   ...         ...   
198895  0113  017A           1228                  NaN   NaN         NaN   
198896  0271   014            580                  NaN   NaN         NaN   
198897  4318   019           1568                  NaN   NaN         NaN   
198898  0298   029            795                  NaN   NaN         NaN   
198899  0160   006            838                  NaN   NaN         NaN   

        zipcode                                   location  
0       94102.0  (37.785719256680785, -122.40852313194863)  
1       94102.0   (37.78733980600732, -122.41063199757738)  
2       94109.0    (37.7946573324287, -122.42232562979227)  
3       94109.0   (37.79595867909168, -122.41557405519474)  
4       94102.0   (37.78315261897309, -122.40950883997789)  
...         ...                                        ...  
198895      NaN                                        NaN  
198896      NaN                                        NaN  
198897      NaN                                        NaN  
198898      NaN                                        NaN  
198899      NaN                                        NaN  

[198900 rows x 8 columns]

2.1.4. Handling structural_notification & tidf_compliance

The structural_notification & tidf_compliance feature are interesting. On the surface it may look like it contains a lot of missing values. However, consulting the docs we realise that the missing represents a no in this case ie. this is a binary feature.

binary = permit[['structural_notification', 'tidf_compliance']]
binary.isna().any()
structural_notification    True
tidf_compliance            True
dtype: bool
binary['structural_notification'].str.strip().value_counts()
Y    6922
Name: structural_notification, dtype: int64
binary['tidf_compliance'].str.strip().value_counts()
Y    1
P    1
Name: tidf_compliance, dtype: int64

In this case however, ~90% of the data is missing, it's perhaps best to drop.

permit = permit.drop(['structural_notification', 'tidf_compliance'], axis='columns')
permit.shape
198900 35

2.1.5. Handling estimated_cost & revised_cost

estimated_cost & revised_cost represents an amount however the docs don't specify the currency. This can be linked to the missing/unknown unit smell. Although, since the dataset consists of building permits from San Francisco, a save guess would be that the currency is in USD.

cost_features = ['estimated_cost',
                 'revised_cost']
cost = permit[cost_features]
cost
        estimated_cost  revised_cost
0               4000.0        4000.0
1                  1.0         500.0
2              20000.0           NaN
3               2000.0        2000.0
4             100000.0      100000.0
...                ...           ...
198895             NaN           1.0
198896          5000.0        5000.0
198897             NaN           1.0
198898             NaN           1.0
198899             NaN           1.0

[198900 rows x 2 columns]
cost.dtypes
estimated_cost    float64
revised_cost      float64
dtype: object

2.1.6. Handling existing_use

existing_use requires further analysis, it seems like a text feature but perhaps can be converted to a categorical one.

existing_use = permit['existing_use']
existing_use = existing_use.str.strip()
existing_use = existing_use.str.lower()
existing_use = existing_use.str.replace('\s+', '_', regex=True)
existing_use.unique()
['tourist_hotel/motel' 'retail_sales' '1_family_dwelling' 'apartments' nan
 '2_family_dwelling' 'church' 'storage_shed' 'office' 'vacant_lot'
 'food/beverage_hndlng' 'residential_hotel' 'filling/service_stn'
 'workshop_commercial' 'clinics-medic/dental' 'misc_group_residns.'
 'hospital' 'club' 'barber/beauty_salon' 'warehouse,no_frnitur' 'school'
 'artist_live/work' 'manufacturing' 'garment_shops' 'public_assmbly_other'
 'auto_repairs' 'lending_institution' 'museum' 'warehouse,_furniture'
 'prkng_garage/private' 'antenna' 'health_studios_&_gym' 'massage_parlor'
 'printing_plant' 'parking_lot' 'workshop_residential' 'power_plant'
 'tower' 'mortuary' 'animal_sale_or_care' 'laundry/laundromat' 'nite_club'
 'paint_store' 'recreation_bldg' 'theater' 'prkng_garage/public' 'sign'
 'phone_xchnge/equip' 'dance_hall' 'sfpd_or_sffd_station' 'storage_tanks'
 'muni_carbarn' 'stadium' 'automobile_sales' 'fence/retaining_wall'
 'radio_&_tv_stations' 'social_care_facility' 'amusement_center'
 'day_care_home_gt_12' 'moving_&_storage' 'dry_cleaners'
 'day_care_home_7_-_12' 'chemical_processing' 'accessory_cottage'
 'day_care,_non-res' 'nursing_home_non_amb' 'wholesale_sales' 'library'
 'nursery(floral)' 'day_care_center' 'nursing_home_gt_6' 'sewage_plant'
 'convalescent_home' 'greenhouse' 'adult_entertainment'
 'muni_driver_restroom' 'sound_studio' 'dairies/dairy_equip.'
 'christmas_tree_lot' 'bath_house' 'jail' "prson'l_svc_tutor"
 'r-3(dwg)_nursing' 'car_wash' 'roofing_materials' 'orphanage'
 'ambulance_service' 'meat/produce_marts' 'building_materials' 'temple'
 'swimming_pool' 'day_care_home_lt_7' 'nursing_home_lte_6' 'child_care']

If we go the categorical way, we need to heavily simplify the values, else we can treat it as a text feature and extract appropriate features.

2.1.7. Handling existing_units & proposed_units

The documentation for existing_units & proposed_units is vague, it says "existing number of units" but what are units?

units_features = ['existing_units',
                  'proposed_units']
units = permit[units_features]
units
        existing_units  proposed_units
0                143.0             NaN
1                  NaN             NaN
2                 39.0            39.0
3                  1.0             1.0
4                  NaN             NaN
...                ...             ...
198895             NaN             NaN
198896             4.0             4.0
198897             NaN             NaN
198898             NaN             NaN
198899             NaN             NaN

[198900 rows x 2 columns]

Unfortunately we can't do much without proper documentation or domain expertise.

2.1.8. Handling categorical features

We have several categorical features in the dataset.

categorical_features = ['proposed_use',
                        'neighborhoods_analysis_boundaries',
                        'current_status',
                        'voluntary_soft_story_retrofit',
                        'fire_only_permit',
                        'site_permit']
categorical = permit[categorical_features]
categorical
             proposed_use neighborhoods_analysis_boundaries current_status  \
0                     NaN                        Tenderloin        expired   
1                     NaN                        Tenderloin         issued   
2            retail sales                      Russian Hill      withdrawn   
3       1 family dwelling                          Nob Hill       complete   
4                     NaN                        Tenderloin         issued   
...                   ...                               ...            ...   
198895                NaN                               NaN         issued   
198896         apartments                               NaN         issued   
198897                NaN                               NaN         issued   
198898                NaN                               NaN         issued   
198899                NaN                               NaN         issued   

       voluntary_soft_story_retrofit fire_only_permit site_permit  
0                                NaN              NaN         NaN  
1                                NaN              NaN         NaN  
2                                NaN              NaN         NaN  
3                                NaN              NaN         NaN  
4                                NaN              NaN         NaN  
...                              ...              ...         ...  
198895                           NaN              NaN         NaN  
198896                           NaN                Y         NaN  
198897                           NaN              NaN         NaN  
198898                           NaN              NaN         NaN  
198899                           NaN              NaN         NaN  

[198900 rows x 6 columns]

Let's analyse proposed_use & neighborhoods_analysis_boundaries further.

proposed_use = categorical['proposed_use']
proposed_use = proposed_use.str.strip()
proposed_use = proposed_use.str.lower()
proposed_use = proposed_use.str.replace('\s+', '_', regex=True)
proposed_use.unique()
[nan 'retail_sales' '1_family_dwelling' 'apartments' '2_family_dwelling'
 'church' 'vacant_lot' 'office' 'tourist_hotel/motel' 'school'
 'filling/service_stn' 'food/beverage_hndlng' 'residential_hotel'
 'storage_shed' 'clinics-medic/dental' 'misc_group_residns.' 'club'
 'hospital' 'barber/beauty_salon' 'warehouse,no_frnitur'
 'artist_live/work' 'museum' 'lending_institution' 'garment_shops'
 'child_care' 'auto_repairs' 'manufacturing' 'day_care_home_lt_7'
 'workshop_commercial' 'warehouse,_furniture' 'prkng_garage/private'
 'antenna' 'health_studios_&_gym' 'massage_parlor' 'printing_plant'
 'parking_lot' 'workshop_residential' 'power_plant' 'tower'
 'sfpd_or_sffd_station' 'mortuary' 'animal_sale_or_care'
 'fence/retaining_wall' 'nite_club' 'paint_store' 'recreation_bldg'
 'theater' 'nursery(floral)' 'prkng_garage/public' 'sign'
 'phone_xchnge/equip' 'dance_hall' 'storage_tanks' 'muni_carbarn'
 'automobile_sales' 'day_care_center' 'public_assmbly_other' 'greenhouse'
 'library' 'radio_&_tv_stations' 'social_care_facility'
 'laundry/laundromat' 'accessory_cottage' 'amusement_center'
 'day_care_home_gt_12' 'muni_driver_restroom' 'day_care_home_7_-_12'
 'moving_&_storage' 'dry_cleaners' 'chemical_processing'
 'day_care,_non-res' 'nursing_home_non_amb' 'wholesale_sales' 'stadium'
 'bath_house' 'nursing_home_gt_6' 'sewage_plant' 'convalescent_home'
 'adult_entertainment' "prson'l_svc_tutor" 'dairies/dairy_equip.'
 'christmas_tree_lot' 'jail' 'r-3(dwg)_nursing' 'sound_studio' 'car_wash'
 'roofing_materials' 'orphanage' 'swimming_pool' 'ambulance_service'
 'not_applicable' 'building_materials' 'meat/produce_marts' 'temple'
 'nursing_home_lte_6']

proposed_use is similar to existing_use, we should consider a similar stratergy to extract meaningful features here.

neighborhoods = categorical['neighborhoods_analysis_boundaries']
neighborhoods = neighborhoods.str.strip()
neighborhoods = neighborhoods.str.lower()
neighborhoods = neighborhoods.str.replace('\s+', '_', regex=True)
neighborhoods.unique()
['tenderloin' 'russian_hill' 'nob_hill' 'potrero_hill' 'inner_sunset'
 'bayview_hunters_point' 'lone_mountain/usf' 'haight_ashbury'
 'castro/upper_market' 'hayes_valley' 'noe_valley' 'pacific_heights'
 'chinatown' 'financial_district/south_beach' 'marina' 'mission'
 'sunset/parkside' 'outer_richmond' 'western_addition' 'bernal_heights'
 'inner_richmond' 'oceanview/merced/ingleside' 'outer_mission' 'portola'
 'mission_bay' 'visitacion_valley' 'presidio_heights' 'west_of_twin_peaks'
 'south_of_market' 'excelsior' 'north_beach' 'glen_park' 'treasure_island'
 'twin_peaks' 'lincoln_park' nan 'japantown' 'lakeshore' 'seacliff'
 'golden_gate_park' 'presidio' 'mclaren_park']
neighborhoods.value_counts()
financial_district/south_beach    21816
mission                           14681
sunset/parkside                   10207
west_of_twin_peaks                 8739
castro/upper_market                8527
pacific_heights                    8508
marina                             8244
outer_richmond                     7854
noe_valley                         7844
south_of_market                    7572
bernal_heights                     6067
nob_hill                           6009
haight_ashbury                     5798
inner_sunset                       5776
bayview_hunters_point              5669
russian_hill                       5495
hayes_valley                       5489
tenderloin                         4783
inner_richmond                     4458
potrero_hill                       4293
presidio_heights                   4084
north_beach                        4054
western_addition                   3867
chinatown                          3765
lone_mountain/usf                  3358
excelsior                          3072
oceanview/merced/ingleside         2654
glen_park                          2637
mission_bay                        2287
outer_mission                      2242
twin_peaks                         1702
portola                            1433
lakeshore                          1308
seacliff                            992
visitacion_valley                   900
japantown                           700
treasure_island                      81
golden_gate_park                     64
presidio                             51
lincoln_park                         49
mclaren_park                         46
Name: neighborhoods_analysis_boundaries, dtype: int64

This feature should be categorical. There may be some hierarchy (some neighborhoods are better than others) thus we may want to use label encoding as opposed to one-hot encoding.

permit['neighborhoods_analysis_boundaries'] = neighborhoods.astype('category')

The remaining features (such as current_status, fire_only_permit & voluntary_soft_story_retrofit to name a few) have missing values but their meaning is unclear due to lack of proper documentation. For example, missing values in the fire_only_permit may indicate a no or may indicate that we don't have the information.

2.1.9. Handling plansets & supervisor_district

Documentation regarding plansets & supervisor_district are lacking, we don't know what the values in these feature indicate without domain knowledge.

permit[['plansets', 'supervisor_district']]
        plansets  supervisor_district
0            2.0                  3.0
1            2.0                  3.0
2            2.0                  3.0
3            2.0                  3.0
4            2.0                  6.0
...          ...                  ...
198895       NaN                  NaN
198896       2.0                  NaN
198897       NaN                  NaN
198898       NaN                  NaN
198899       NaN                  NaN

[198900 rows x 2 columns]

2.1.10. Descriptive statistics, missing & duplicates

The prior sections indicated that the dataset requires a lot of pre-processing effort which is beyond the scope of this analysis. Thus checking for missing & duplicates won't lead to meaningful insights (we know there are lots of missing values!). Without the pre-processing, insights from the descriptive statistics may also be misleading.

2.1.11. Correlations

Same reasoning as above for not doing this.

Date: 2021-11-01 Mon 00:00

Created: 2021-11-02 Tue 13:35