Airbnb

Table of Contents

1. Init

Let's start by importing the necessary modules.

from context import src
from context import pd, np, sns, plt
from src import utils, plotter

2. Analysis

In this section we analyse the airbnb dataset. We start by reading the data docs which does not provide much information about the dataset.

2.1. Preliminary analysis

We start by loading the dataset and answering our initial set of questions.

airbnb = pd.read_csv(utils.data_path('airbnb.csv'))
airbnb.head()
     id                                              name  host_id  \
0  2539                Clean & quiet apt home by the park     2787   
1  2595                             Skylit Midtown Castle     2845   
2  3647               THE VILLAGE OF HARLEM....NEW YORK !     4632   
3  3831                   Cozy Entire Floor of Brownstone     4869   
4  5022  Entire Apt: Spacious Studio/Loft by central park     7192   

     host_name neighbourhood_group neighbourhood  latitude  longitude  \
0         John            Brooklyn    Kensington  40.64749  -73.97237   
1     Jennifer           Manhattan       Midtown  40.75362  -73.98377   
2    Elisabeth           Manhattan        Harlem  40.80902  -73.94190   
3  LisaRoxanne            Brooklyn  Clinton Hill  40.68514  -73.95976   
4        Laura           Manhattan   East Harlem  40.79851  -73.94399   

         room_type  price  minimum_nights  number_of_reviews last_review  \
0     Private room    149               1                  9  2018-10-19   
1  Entire home/apt    225               1                 45  2019-05-21   
2     Private room    150               3                  0         NaN   
3  Entire home/apt     89               1                270  2019-07-05   
4  Entire home/apt     80              10                  9  2018-11-19   

   reviews_per_month  calculated_host_listings_count  availability_365  
0               0.21                               6               365  
1               0.38                               2               355  
2                NaN                               1               365  
3               4.64                               1               194  
4               0.10                               1                 0  

We have a mix of text, numerical and categorical features, let's take a closer look at the features.

airbnb.shape
(48895, 16)
airbnb.dtypes
id                                  int64
name                               object
host_id                             int64
host_name                          object
neighbourhood_group                object
neighbourhood                      object
latitude                          float64
longitude                         float64
room_type                          object
price                               int64
minimum_nights                      int64
number_of_reviews                   int64
last_review                        object
reviews_per_month                 float64
calculated_host_listings_count      int64
availability_365                    int64
dtype: object

2.1.1. Handling redundant columns

So far, we have been dropping columns such as id & host_id since they contain unique identifiers which does not bring any new information to the model to learn. However, I did not consider the case where we may have data from the same entity (in this example the host, who may have multiple properties to rent out). Perhaps, there is a correlation between the number of properties and the other numerical features? Let's investigate.

identifier_columns = ['id',
                      'host_id']
identifiers = airbnb[identifier_columns]
identifiers[identifiers.duplicated(keep=False)]
Empty DataFrame
Columns: [id, host_id]
Index: []

We don't have any "trully" duplicate examples (same listing with the same host). But we still may have different listings from the same host or same listing from different hosts (which would be incorrect/outlier)! Let's example them separately next.

identifiers['dup_id'] = identifiers.duplicated('id', keep=False)
identifiers['dup_host_id'] = identifiers.duplicated('host_id', keep=False)
identifiers
             id   host_id  dup_id  dup_host_id
0          2539      2787   False         True
1          2595      2845   False         True
2          3647      4632   False        False
3          3831      4869   False        False
4          5022      7192   False        False
...         ...       ...     ...          ...
48890  36484665   8232441   False         True
48891  36485057   6570630   False         True
48892  36485431  23492952   False        False
48893  36485609  30985759   False         True
48894  36487245  68119814   False        False

[48895 rows x 4 columns]
identifiers['dup_id'].any()
False
identifiers['dup_host_id'].any()
True

We don't have any duplicate id which means that we do not have a listing with multiple hosts (this is good), but we do have hosts with multiple listings (this is normal). Perhaps a useful feature to extract would be the number of listings for all hosts, but this is beyond the scope of this analysis. For now, we drop the identifier columns.

airbnb = airbnb.drop(identifier_columns, axis='columns')
airbnb.shape
48895 14

2.1.2. Handling text features

The name feature may contain several interesting numerical features (for instance, we can do an analysis of the most common words and if they are positively correlated with the number of reviews). For this analysis however, we drop it.

airbnb = airbnb.drop('name', axis='columns')
airbnb.shape
48895 13

The host_name feature is worth investigating. For instance, what is the most common name? And is there any relationship between the common names and the number of reviews their listing(s) get?

airbnb['host_name'] = airbnb['host_name'].str.strip().astype('category')
airbnb['host_name'].cat.categories
Index([''Cil', '(Ari) HENRY LEE', '(Email hidden by Airbnb)', '(Mary) Haiy',
       '-TheQueensCornerLot', '0123', '2018Serenity', '371', '475',
       '5 Star Stays',
       ...
       '辣辣', '铀 Yuli', '青明', '韦达', '馨惠', '단비', '빈나', '소정', '진', '현선'],
      dtype='object', length=11452)
airbnb['host_name'].value_counts()
Michael         417
David           403
Sonder (NYC)    327
John            294
Alex            279
               ... 
Jerbean           1
Jerald            1
Jeonghoon         1
Jeny              1
현선                1
Name: host_name, Length: 11452, dtype: int64

Popular names are what we would expect (for a North American country). It does look like we have to perform some processing as there are names with special characters and mixed languages.

2.1.3. Handling categorical features

neighbourhood_group, neighbourhood & room_type are categorical, let's convert them to category dtype.

categorical_features = ['neighbourhood_group',
                        'neighbourhood',
                        'room_type']

for feature in categorical_features:
    airbnb[feature] = airbnb[feature].str.strip().astype('category')
airbnb['neighbourhood_group'].value_counts()
Manhattan        21661
Brooklyn         20104
Queens            5666
Bronx             1091
Staten Island      373
Name: neighbourhood_group, dtype: int64
airbnb['neighbourhood'].value_counts()
Williamsburg          3920
Bedford-Stuyvesant    3714
Harlem                2658
Bushwick              2465
Upper West Side       1971
                      ... 
Richmondtown             1
Willowbrook              1
Fort Wadsworth           1
New Dorp                 1
Woodrow                  1
Name: neighbourhood, Length: 221, dtype: int64
airbnb['room_type'].value_counts()
Entire home/apt    25409
Private room       22326
Shared room         1160
Name: room_type, dtype: int64

2.1.4. Handling datetime features

last_review should be converted to datetime dtype.

airbnb['last_review'] = pd.to_datetime(airbnb['last_review'])
airbnb['last_review']
0       2018-10-19
1       2019-05-21
2              NaT
3       2019-07-05
4       2018-11-19
           ...    
48890          NaT
48891          NaT
48892          NaT
48893          NaT
48894          NaT
Name: last_review, Length: 48895, dtype: datetime64[ns]

2.1.5. Handling numerical features

The docs does specify that the price is in USD.

The calculated_host_listings_count is the number of listings the host has (what we also suggested earlier).

2.1.6. Descriptive statistics, missing & duplicates

Let's look at the descriptive statistics next.

airbnb.describe(include='all')
       host_name neighbourhood_group neighbourhood      latitude  \
count      48874               48895         48895  48895.000000   
unique     11452                   5           221           NaN   
top      Michael           Manhattan  Williamsburg           NaN   
freq         417               21661          3920           NaN   
first        NaN                 NaN           NaN           NaN   
last         NaN                 NaN           NaN           NaN   
mean         NaN                 NaN           NaN     40.728949   
std          NaN                 NaN           NaN      0.054530   
min          NaN                 NaN           NaN     40.499790   
25%          NaN                 NaN           NaN     40.690100   
50%          NaN                 NaN           NaN     40.723070   
75%          NaN                 NaN           NaN     40.763115   
max          NaN                 NaN           NaN     40.913060   

           longitude        room_type         price  minimum_nights  \
count   48895.000000            48895  48895.000000    48895.000000   
unique           NaN                3           NaN             NaN   
top              NaN  Entire home/apt           NaN             NaN   
freq             NaN            25409           NaN             NaN   
first            NaN              NaN           NaN             NaN   
last             NaN              NaN           NaN             NaN   
mean      -73.952170              NaN    152.720687        7.029962   
std         0.046157              NaN    240.154170       20.510550   
min       -74.244420              NaN      0.000000        1.000000   
25%       -73.983070              NaN     69.000000        1.000000   
50%       -73.955680              NaN    106.000000        3.000000   
75%       -73.936275              NaN    175.000000        5.000000   
max       -73.712990              NaN  10000.000000     1250.000000   

        number_of_reviews          last_review  reviews_per_month  \
count        48895.000000                38843       38843.000000   
unique                NaN                 1764                NaN   
top                   NaN  2019-06-23 00:00:00                NaN   
freq                  NaN                 1413                NaN   
first                 NaN  2011-03-28 00:00:00                NaN   
last                  NaN  2019-07-08 00:00:00                NaN   
mean            23.274466                  NaN           1.373221   
std             44.550582                  NaN           1.680442   
min              0.000000                  NaN           0.010000   
25%              1.000000                  NaN           0.190000   
50%              5.000000                  NaN           0.720000   
75%             24.000000                  NaN           2.020000   
max            629.000000                  NaN          58.500000   

        calculated_host_listings_count  availability_365  
count                     48895.000000      48895.000000  
unique                             NaN               NaN  
top                                NaN               NaN  
freq                               NaN               NaN  
first                              NaN               NaN  
last                               NaN               NaN  
mean                          7.143982        112.781327  
std                          32.952519        131.622289  
min                           1.000000          0.000000  
25%                           1.000000          0.000000  
50%                           1.000000         45.000000  
75%                           2.000000        227.000000  
max                         327.000000        365.000000  

And check for missing values next.

airbnb.isna().any()
host_name                          True
neighbourhood_group               False
neighbourhood                     False
latitude                          False
longitude                         False
room_type                         False
price                             False
minimum_nights                    False
number_of_reviews                 False
last_review                        True
reviews_per_month                  True
calculated_host_listings_count    False
availability_365                  False
dtype: bool

Let's investigate how much data is missing in total and per column.

airbnb[airbnb.isna().any(axis='columns')]
             host_name neighbourhood_group       neighbourhood  latitude  \
2            Elisabeth           Manhattan              Harlem  40.80902   
19                Sing           Manhattan         East Harlem  40.79685   
26     Claude & Sophie           Manhattan              Inwood  40.86754   
36                  Vt            Brooklyn  Bedford-Stuyvesant  40.68876   
38             Harriet            Brooklyn            Flatbush  40.63702   
...                ...                 ...                 ...       ...   
48890          Sabrina            Brooklyn  Bedford-Stuyvesant  40.67853   
48891          Marisol            Brooklyn            Bushwick  40.70184   
48892    Ilgar & Aysel           Manhattan              Harlem  40.81475   
48893              Taz           Manhattan      Hell's Kitchen  40.75751   
48894       Christophe           Manhattan      Hell's Kitchen  40.76404   

       longitude        room_type  price  minimum_nights  number_of_reviews  \
2      -73.94190     Private room    150               3                  0   
19     -73.94872  Entire home/apt    190               7                  0   
26     -73.92639     Private room     80               4                  0   
36     -73.94312     Private room     35              60                  0   
38     -73.96327     Private room    150               1                  0   
...          ...              ...    ...             ...                ...   
48890  -73.94995     Private room     70               2                  0   
48891  -73.93317     Private room     40               4                  0   
48892  -73.94867  Entire home/apt    115              10                  0   
48893  -73.99112      Shared room     55               1                  0   
48894  -73.98933     Private room     90               7                  0   

      last_review  reviews_per_month  calculated_host_listings_count  \
2             NaT                NaN                               1   
19            NaT                NaN                               2   
26            NaT                NaN                               1   
36            NaT                NaN                               1   
38            NaT                NaN                               1   
...           ...                ...                             ...   
48890         NaT                NaN                               2   
48891         NaT                NaN                               2   
48892         NaT                NaN                               1   
48893         NaT                NaN                               6   
48894         NaT                NaN                               1   

       availability_365  
2                   365  
19                  249  
26                    0  
36                  365  
38                  365  
...                 ...  
48890                 9  
48891                36  
48892                27  
48893                 2  
48894                23  

[10068 rows x 13 columns]
airbnb[airbnb['host_name'].isna()].shape
21 13
airbnb[airbnb['last_review'].isna()].shape
10052 13
airbnb[airbnb['reviews_per_month'].isna()].shape
10052 13

Most of the missing data is in te last_review & reviews_per_month features. We may wish to drop these two columns, as imputing will lead to technical debt.

Finally, let's check for duplicates.

airbnb[airbnb.duplicated()].shape
0 13

2.1.7. Correlations

Let's check the correlations between the numerical features.

name = 'heatmap@airbnb--corr.png'
corr = airbnb.corr()
plotter.corr(corr, name)
name

heatmap@airbnb--corr.png

No significant positive correlations except for reviews_per_month & number_of_reviews (which is expected).

Date: 2021-11-04 Thu 00:00

Created: 2021-11-05 Fri 12:14