Airbnb
Table of Contents
1. Init
Let's start by importing the necessary modules.
from context import src from context import pd, np, sns, plt from src import utils, plotter
2. Analysis
In this section we analyse the airbnb
dataset. We start by reading
the data docs which does not provide much information about the
dataset.
2.1. Preliminary analysis
We start by loading the dataset and answering our initial set of questions.
airbnb = pd.read_csv(utils.data_path('airbnb.csv')) airbnb.head()
id name host_id \ 0 2539 Clean & quiet apt home by the park 2787 1 2595 Skylit Midtown Castle 2845 2 3647 THE VILLAGE OF HARLEM....NEW YORK ! 4632 3 3831 Cozy Entire Floor of Brownstone 4869 4 5022 Entire Apt: Spacious Studio/Loft by central park 7192 host_name neighbourhood_group neighbourhood latitude longitude \ 0 John Brooklyn Kensington 40.64749 -73.97237 1 Jennifer Manhattan Midtown 40.75362 -73.98377 2 Elisabeth Manhattan Harlem 40.80902 -73.94190 3 LisaRoxanne Brooklyn Clinton Hill 40.68514 -73.95976 4 Laura Manhattan East Harlem 40.79851 -73.94399 room_type price minimum_nights number_of_reviews last_review \ 0 Private room 149 1 9 2018-10-19 1 Entire home/apt 225 1 45 2019-05-21 2 Private room 150 3 0 NaN 3 Entire home/apt 89 1 270 2019-07-05 4 Entire home/apt 80 10 9 2018-11-19 reviews_per_month calculated_host_listings_count availability_365 0 0.21 6 365 1 0.38 2 355 2 NaN 1 365 3 4.64 1 194 4 0.10 1 0
We have a mix of text, numerical and categorical features, let's take a closer look at the features.
airbnb.shape
(48895, 16)
airbnb.dtypes
id int64 name object host_id int64 host_name object neighbourhood_group object neighbourhood object latitude float64 longitude float64 room_type object price int64 minimum_nights int64 number_of_reviews int64 last_review object reviews_per_month float64 calculated_host_listings_count int64 availability_365 int64 dtype: object
2.1.1. Handling redundant columns
So far, we have been dropping columns such as id & host_id
since
they contain unique identifiers which does not bring any new
information to the model to learn. However, I did not consider the
case where we may have data from the same entity (in this example the
host, who may have multiple properties to rent out). Perhaps, there is
a correlation between the number of properties and the other numerical
features? Let's investigate.
identifier_columns = ['id', 'host_id'] identifiers = airbnb[identifier_columns] identifiers[identifiers.duplicated(keep=False)]
Empty DataFrame Columns: [id, host_id] Index: []
We don't have any "trully" duplicate examples (same listing with the same host). But we still may have different listings from the same host or same listing from different hosts (which would be incorrect/outlier)! Let's example them separately next.
identifiers['dup_id'] = identifiers.duplicated('id', keep=False) identifiers['dup_host_id'] = identifiers.duplicated('host_id', keep=False) identifiers
id host_id dup_id dup_host_id 0 2539 2787 False True 1 2595 2845 False True 2 3647 4632 False False 3 3831 4869 False False 4 5022 7192 False False ... ... ... ... ... 48890 36484665 8232441 False True 48891 36485057 6570630 False True 48892 36485431 23492952 False False 48893 36485609 30985759 False True 48894 36487245 68119814 False False [48895 rows x 4 columns]
identifiers['dup_id'].any()
False
identifiers['dup_host_id'].any()
True
We don't have any duplicate id
which means that we do not have a
listing with multiple hosts (this is good), but we do have hosts with
multiple listings (this is normal). Perhaps a useful feature to
extract would be the number of listings for all hosts, but this is
beyond the scope of this analysis. For now, we drop the identifier
columns.
airbnb = airbnb.drop(identifier_columns, axis='columns') airbnb.shape
48895 | 14 |
2.1.2. Handling text features
The name
feature may contain several interesting numerical features
(for instance, we can do an analysis of the most common words and if
they are positively correlated with the number of reviews). For this
analysis however, we drop it.
airbnb = airbnb.drop('name', axis='columns') airbnb.shape
48895 | 13 |
The host_name
feature is worth investigating. For instance, what is
the most common name? And is there any relationship between the common
names and the number of reviews their listing(s) get?
airbnb['host_name'] = airbnb['host_name'].str.strip().astype('category') airbnb['host_name'].cat.categories
Index([''Cil', '(Ari) HENRY LEE', '(Email hidden by Airbnb)', '(Mary) Haiy', '-TheQueensCornerLot', '0123', '2018Serenity', '371', '475', '5 Star Stays', ... '辣辣', '铀 Yuli', '青明', '韦达', '馨惠', '단비', '빈나', '소정', '진', '현선'], dtype='object', length=11452)
airbnb['host_name'].value_counts()
Michael 417 David 403 Sonder (NYC) 327 John 294 Alex 279 ... Jerbean 1 Jerald 1 Jeonghoon 1 Jeny 1 현선 1 Name: host_name, Length: 11452, dtype: int64
Popular names are what we would expect (for a North American country). It does look like we have to perform some processing as there are names with special characters and mixed languages.
2.1.3. Handling categorical features
neighbourhood_group, neighbourhood & room_type
are categorical,
let's convert them to category
dtype.
categorical_features = ['neighbourhood_group', 'neighbourhood', 'room_type'] for feature in categorical_features: airbnb[feature] = airbnb[feature].str.strip().astype('category')
airbnb['neighbourhood_group'].value_counts()
Manhattan 21661 Brooklyn 20104 Queens 5666 Bronx 1091 Staten Island 373 Name: neighbourhood_group, dtype: int64
airbnb['neighbourhood'].value_counts()
Williamsburg 3920 Bedford-Stuyvesant 3714 Harlem 2658 Bushwick 2465 Upper West Side 1971 ... Richmondtown 1 Willowbrook 1 Fort Wadsworth 1 New Dorp 1 Woodrow 1 Name: neighbourhood, Length: 221, dtype: int64
airbnb['room_type'].value_counts()
Entire home/apt 25409 Private room 22326 Shared room 1160 Name: room_type, dtype: int64
2.1.4. Handling datetime features
last_review
should be converted to datetime
dtype.
airbnb['last_review'] = pd.to_datetime(airbnb['last_review']) airbnb['last_review']
0 2018-10-19 1 2019-05-21 2 NaT 3 2019-07-05 4 2018-11-19 ... 48890 NaT 48891 NaT 48892 NaT 48893 NaT 48894 NaT Name: last_review, Length: 48895, dtype: datetime64[ns]
2.1.5. Handling numerical features
The docs does specify that the price
is in USD.
The calculated_host_listings_count
is the number of listings the
host has (what we also suggested earlier).
2.1.6. Descriptive statistics, missing & duplicates
Let's look at the descriptive statistics next.
airbnb.describe(include='all')
host_name neighbourhood_group neighbourhood latitude \ count 48874 48895 48895 48895.000000 unique 11452 5 221 NaN top Michael Manhattan Williamsburg NaN freq 417 21661 3920 NaN first NaN NaN NaN NaN last NaN NaN NaN NaN mean NaN NaN NaN 40.728949 std NaN NaN NaN 0.054530 min NaN NaN NaN 40.499790 25% NaN NaN NaN 40.690100 50% NaN NaN NaN 40.723070 75% NaN NaN NaN 40.763115 max NaN NaN NaN 40.913060 longitude room_type price minimum_nights \ count 48895.000000 48895 48895.000000 48895.000000 unique NaN 3 NaN NaN top NaN Entire home/apt NaN NaN freq NaN 25409 NaN NaN first NaN NaN NaN NaN last NaN NaN NaN NaN mean -73.952170 NaN 152.720687 7.029962 std 0.046157 NaN 240.154170 20.510550 min -74.244420 NaN 0.000000 1.000000 25% -73.983070 NaN 69.000000 1.000000 50% -73.955680 NaN 106.000000 3.000000 75% -73.936275 NaN 175.000000 5.000000 max -73.712990 NaN 10000.000000 1250.000000 number_of_reviews last_review reviews_per_month \ count 48895.000000 38843 38843.000000 unique NaN 1764 NaN top NaN 2019-06-23 00:00:00 NaN freq NaN 1413 NaN first NaN 2011-03-28 00:00:00 NaN last NaN 2019-07-08 00:00:00 NaN mean 23.274466 NaN 1.373221 std 44.550582 NaN 1.680442 min 0.000000 NaN 0.010000 25% 1.000000 NaN 0.190000 50% 5.000000 NaN 0.720000 75% 24.000000 NaN 2.020000 max 629.000000 NaN 58.500000 calculated_host_listings_count availability_365 count 48895.000000 48895.000000 unique NaN NaN top NaN NaN freq NaN NaN first NaN NaN last NaN NaN mean 7.143982 112.781327 std 32.952519 131.622289 min 1.000000 0.000000 25% 1.000000 0.000000 50% 1.000000 45.000000 75% 2.000000 227.000000 max 327.000000 365.000000
And check for missing values next.
airbnb.isna().any()
host_name True neighbourhood_group False neighbourhood False latitude False longitude False room_type False price False minimum_nights False number_of_reviews False last_review True reviews_per_month True calculated_host_listings_count False availability_365 False dtype: bool
Let's investigate how much data is missing in total and per column.
airbnb[airbnb.isna().any(axis='columns')]
host_name neighbourhood_group neighbourhood latitude \ 2 Elisabeth Manhattan Harlem 40.80902 19 Sing Manhattan East Harlem 40.79685 26 Claude & Sophie Manhattan Inwood 40.86754 36 Vt Brooklyn Bedford-Stuyvesant 40.68876 38 Harriet Brooklyn Flatbush 40.63702 ... ... ... ... ... 48890 Sabrina Brooklyn Bedford-Stuyvesant 40.67853 48891 Marisol Brooklyn Bushwick 40.70184 48892 Ilgar & Aysel Manhattan Harlem 40.81475 48893 Taz Manhattan Hell's Kitchen 40.75751 48894 Christophe Manhattan Hell's Kitchen 40.76404 longitude room_type price minimum_nights number_of_reviews \ 2 -73.94190 Private room 150 3 0 19 -73.94872 Entire home/apt 190 7 0 26 -73.92639 Private room 80 4 0 36 -73.94312 Private room 35 60 0 38 -73.96327 Private room 150 1 0 ... ... ... ... ... ... 48890 -73.94995 Private room 70 2 0 48891 -73.93317 Private room 40 4 0 48892 -73.94867 Entire home/apt 115 10 0 48893 -73.99112 Shared room 55 1 0 48894 -73.98933 Private room 90 7 0 last_review reviews_per_month calculated_host_listings_count \ 2 NaT NaN 1 19 NaT NaN 2 26 NaT NaN 1 36 NaT NaN 1 38 NaT NaN 1 ... ... ... ... 48890 NaT NaN 2 48891 NaT NaN 2 48892 NaT NaN 1 48893 NaT NaN 6 48894 NaT NaN 1 availability_365 2 365 19 249 26 0 36 365 38 365 ... ... 48890 9 48891 36 48892 27 48893 2 48894 23 [10068 rows x 13 columns]
airbnb[airbnb['host_name'].isna()].shape
21 | 13 |
airbnb[airbnb['last_review'].isna()].shape
10052 | 13 |
airbnb[airbnb['reviews_per_month'].isna()].shape
10052 | 13 |
Most of the missing data is in te last_review & reviews_per_month
features. We may wish to drop these two columns, as imputing will lead
to technical debt.
Finally, let's check for duplicates.
airbnb[airbnb.duplicated()].shape
0 | 13 |
2.1.7. Correlations
Let's check the correlations between the numerical features.
name = 'heatmap@airbnb--corr.png' corr = airbnb.corr() plotter.corr(corr, name) name
No significant positive correlations except for reviews_per_month
&
number_of_reviews
(which is expected).