<- yaml::read_yaml("settings.yaml")$data_dir
data_dir if (data_dir != "." && data_dir != ".." && !dir.exists(data_dir)) {
dir.create(data_dir, recursive = TRUE)
}
Case study on uncovering impact factors of housing prices in Beijing with HGWR
In this document, we are going to show the reproducible codes for the case study on uncovering impact factors of housing prices in Beijing in the manuscript entitled “A Backfitting Maximum Likelihood Estimator for Hierarchical and Geographically Weighted Regression Modelling, with a case study of house prices in Beijing”.
This document is created with Quarto. The Experiment_CaseStudy.qmd
is the source file. Please open it with Quarto-supported tools, such as VS Code or RStudio.
Environment setup
Please install the following tools:
After installing R, please run an R console and install packages with the following codes:
install.packages(c("tidyverse", "sf", "sp", "GWmodel", "hgwrr", "lme4", "yaml", "jsonlite", "plot3D"))
Data preparing
The housing price data are collected from Kaggle. The POI and boundary data are collected from the Internet. To enhance the restrictions and relations on these data, they are arranged in a relational database. We only include a subset of them in this case study. After the process, the data are stored in the files shown in the following table.
file | content |
---|---|
data_housing_sel.rda |
Selected housing price data. |
data_poi_sel.rda |
Selected points of interest (POI) data. |
data_boundary_beijing_sel.rda |
Selected administration boundary of Beijing. |
data_title_map.json |
Mapping from original names to standardised names. |
We assume that the data is put in the directory specified by the configuration item data_dir
in the file settings.yml
with default value .
(current working directory). Please change it to anywhere you like.
The POI and boundary data can be read with the following codes.
library(sf)
Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
load(file.path(data_dir, "data_poi_sel.rda"))
<- st_read(file.path(data_dir, "data_boundary_beijing_sel.gpkg")) beijing
Reading layer `data_boundary_beijing_sel' from data source
`/Users/yigong/Library/CloudStorage/OneDrive-UniversityofBristol/Documents/PhD/Paper1_Backfitting-Maximum-Likelihood-HGWR/Submit_IJGIS/codes/CaseStudy/data_boundary_beijing_sel.gpkg'
using driver `GPKG'
Simple feature collection with 6 features and 29 fields
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: 116.0426 ymin: 39.76114 xmax: 116.6394 ymax: 40.15974
Geodetic CRS: WGS 84
Now, read the housing price data with the following codes. These codes will also transform the original column names into standardised names used in the manuscript.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.0 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.2 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
<- jsonlite::read_json(file.path(data_dir, "data_title_map.json"))
name_map load(file.path(data_dir, "data_housing_sel.rda"))
<- rename_with(housing_sel, function (old_names) {
housing_sel map_chr(old_names, ~ ifelse(.x %in% names(name_map), name_map[[.x]], .x))
%>% collect()
}) glimpse(housing_sel)
Rows: 22,314
Columns: 39
$ id <chr> "101100984689", "101100997292", "101101012403", "10…
$ cid <chr> "1111027375873", "1111027381888", "1111027376313", …
$ price <dbl> 10.92448, 10.86228, 10.79538, 10.90060, 11.06312, 1…
$ Area <dbl> -0.46645323, -0.48291075, 1.13454809, 0.46495757, 0…
$ Bedroom <dbl> 1, 1, 2, 1, 2, 2, 2, 1, 0, 2, 1, 0, 1, 0, 2, 0, 0, …
$ living <dbl> 0, 0, 1, 0, 0, 1, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0,…
$ kitchen <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ Bathroom <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ Ladder.Ratio <dbl> -0.15384071, 0.81109077, 0.08883667, 0.81109077, -0…
$ Property.5.Year <dbl> 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, …
$ Subway <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, …
$ Floor.Low <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, …
$ Floor.High <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, …
$ Floor.Ground <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
$ Floor.Top <dbl> 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ d.Company <dbl> -0.321549261, -1.065342320, 0.137085661, -0.7072714…
$ d.Supermarket <dbl> 0.27262352, -0.33977228, -0.49576061, 0.12606195, -…
$ d.Market <dbl> 0.49522008, -0.24245150, -0.25168088, 0.40077819, 0…
$ d.ShoppingCenter <dbl> 0.82355904, 0.28767512, -0.43217285, 2.26604909, 1.…
$ d.GasStation <dbl> 0.914053505, 0.413437991, 0.705057642, -0.051095050…
$ d.FuneralFacility <dbl> 0.7104053, -0.8517368, 0.6804280, 0.1255448, 1.4275…
$ d.Government <dbl> 0.18504605, -0.11576049, 0.21973005, 1.06237081, 1.…
$ d.ConventionCenter <dbl> 0.5802384, 0.8474398, 0.7790081, 2.0376407, 1.01561…
$ d.Research <dbl> 0.6318045457, 0.2976099156, 0.2786768833, 0.7571709…
$ d.ChineseFood <dbl> 1.48197572, -1.25526245, 0.33605912, 0.75165536, 0.…
$ d.ForeignFood <dbl> 0.268580751, 0.596907301, 0.494118919, 0.308745673,…
$ d.FastFood <dbl> -1.83754488, -1.52949665, -0.07074024, -0.01767842,…
$ d.Café <dbl> 0.23416240, 0.67234707, 0.88708938, 2.44182147, 1.4…
$ d.Gym <dbl> 1.334189211, -1.408758921, 0.455478600, 1.134141366…
$ d.PrimarySchool <dbl> -2.32110465, 0.07025931, -0.45596906, 0.13885671, 0…
$ d.HighSchool <dbl> -0.60137603, -0.04861809, 0.91222048, 0.45773404, 0…
$ d.University <dbl> 0.190391924, 0.400790819, -0.029698187, 0.005672968…
$ d.ParkingEntrances <dbl> 1.18345943, 1.04642500, 0.36101942, 0.26778683, 0.2…
$ d.Subway <dbl> 1.23154097, 1.15214605, 2.55482717, 1.89728666, 1.8…
$ d.TrainStation <dbl> 0.3514576, 0.9246838, 0.9767235, -0.3619107, -0.219…
$ d.ChargingStation <dbl> 0.59643630, -0.63678418, 1.22316787, -0.69094611, 0…
$ d.Park <dbl> 1.37136093, 0.01479307, -0.10894317, 1.57932664, 0.…
$ d.ScenicSpot <dbl> 1.30341958, 1.30693312, 1.20824403, 0.75254668, 0.4…
$ geom <POINT [°]> POINT (116.1974 39.91271), POINT (116.2211 39…
Their spatial distributions can be viewed via the following codes (Figure 6 in the manuscript).
library(plot3D)
<- st_drop_geometry(housing_sel) %>%
housing_complex_xyz select(id, cid, price) %>%
cbind(st_coordinates(st_transform(housing_sel, 4498))) %>%
group_by(cid) %>%
mutate(cindex = 1:length(id), price = round(exp(price))) %>%
ungroup() %>%
arrange(cid, cindex)
<- st_transform(beijing, 4498)
boundary_cgcs <- st_bbox(st_bbox(boundary_cgcs))
boundary_bbox <- boundary_bbox[c(1, 3)] + c(14000, -9000)
boundary_xlim <- boundary_bbox[c(2, 4)] + c(5000, -3000)
boundary_ylim perspbox(boundary_xlim, boundary_ylim, c(0, max(housing_complex_xyz$cindex)), bty = "n", theta = 10, phi = 20)
walk(boundary_cgcs$geom, function(geom) {
with(as.data.frame(st_coordinates(geom)), {
polygon3D(X, Y, rep(0, length(X)), add = T, col = "transparent", border = "black")
})
})with(housing_complex_xyz, points3D(X, Y, cindex * 1.08, colvar = price, add = T, pch = 16))
Model calibration
Next, we will calibrate three models: GWR, HLM, and HGWR.
HGWR
To calibrate an HGWR model, we need to get the group list as a column in the data. And also get the coordinates of all groups. Then, calibrate the HGWR model with the processed data.
library(hgwrr)
<- complex_sel %>% select(cid) %>% arrange(cid) %>% mutate(group = row_number())
group_info <- st_transform(group_info, 4498)
group_info <- left_join(st_drop_geometry(housing_sel), st_drop_geometry(group_info), by = "cid") %>%
housing_data_hgwr arrange(group)
<- hgwr(
model_hgwr ~ d.FuneralFacility + d.ChineseFood + d.Café + d.TrainStation +
price 5.Year + Floor.Low + Subway +
Property.+ Ladder.Ratio + Floor.Top | group),
(Area data = housing_data_hgwr,
local.fixed = c(
"d.TrainStation", "d.ChineseFood", "d.Café", "d.FuneralFacility"
), coords = st_coordinates(group_info),
kernel = "bisquared",
alpha = 0.001
)summary(model_hgwr)
Hierarchical and geographically weighted regression model
=========================================================
Formula:
price ~ d.FuneralFacility + d.ChineseFood + d.Café + d.TrainStation +
Property.5.Year + Floor.Low + Subway + (Area + Ladder.Ratio +
Floor.Top | group)
Method: Back-fitting and Maximum likelihood
Data: housing_data_hgwr
Diagnostics
-----------
Rsquared
0.876075
Scaled residuals
----------------
Min 1Q Median 3Q Max
-1.580566 -0.055777 0.000425 0.058776 0.480169
Other Information
-----------------
Number of Obs: 22314
Groups: group , 882
The coefficient estimates can be visualized with the following codes (Figure 7 in the manuscript).
library(tmap)
<- coef(model_hgwr) %>%
coef_hgwr left_join(st_drop_geometry(group_info), by = "group") %>%
left_join(select(complex_sel, cid), ., by = "cid")
<- tm_shape(beijing) + tm_polygons(col = "white")
base_map with(model_hgwr$effects, c(local.fixed, random, "Intercept")) %>% map(., function(x) {
+ tm_shape(coef_hgwr) +
base_map tm_dots(col = x, size = 0.2, mid = 0, title = x, n = 5) +
tm_layout(
frame = F,
outer.margins = 0,
legend.title.size = 2,
legend.text.size = 1.4,
legend.just = "right",
legend.format = list(
text.align = "right",
text.separator = ",",
flag = " -",
scientific = TRUE,
format = "f"
),legend.outside = T,
legend.outside.size = 0.3
+
) tm_compass(position = c("right", "top")) +
tm_scale_bar(position = c("right", "bottom"))
})
[[1]]
[[2]]
[[3]]
[[4]]
[[5]]
[[6]]
[[7]]
[[8]]
Coefficient estimates from HGWR model.
GWR
Because the function gwr.basic()
in package GWmodel only accepts data of Spatial*DataFrame
, we need to do data conversion. And then we can calibrate the GWR model.
Because the data size is quite large, it would take a long time (possibly several hours), especially on macOS. We highly recommend readers do not run these codes on their own computers. Instead, run it on a high-performance server or use the results provided by us.
library(GWmodel)
<- as_Spatial(st_transform(housing_sel, 4498))
housing_data <- bw.gwr(
model_gwr_bw ~ d.FuneralFacility + d.ChineseFood + d.Café + d.TrainStation +
price 5.Year + Floor.Low + Subway +
Property.+ Ladder.Ratio + Floor.Top,
Area data = housing_data,
approach = "AIC",
adaptive = TRUE,
kernel = "bisquare",
parallel.method = ifelse(Sys.info()['sysname'] == "Darwin", FALSE, "omp"),
parallel.arg = ifelse(Sys.info()['sysname'] == "Darwin", 0, 12)
)<- gwr.basic(
model_gwr ~ d.FuneralFacility + d.ChineseFood + d.Café + d.TrainStation +
price 5.Year + Floor.Low + Subway +
Property.+ Ladder.Ratio + Floor.Top,
Area data = housing_data,
bw = model_gwr_bw,
adaptive = TRUE,
kernel = "bisquare",
parallel.method = ifelse(Sys.info()['sysname'] == "Darwin", FALSE, "omp"),
parallel.arg = ifelse(Sys.info()['sysname'] == "Darwin", 0, 12)
)save(model_gwr, file = file.path(data_dir, "model_gwr.rda"))
model_gwr
Estimates of some coefficient estimates are visualized with the following codes (Figure 8 in the manuscript):
load(file.path(data_dir, "model_gwr.rda"))
<- model_gwr$SDF@data %>%
coef_gwr cbind(select(st_drop_geometry(housing_sel), cid)) %>%
group_by(cid) %>%
summarise_all(mean) %>%
left_join(group_info, ., by = "cid")
map(c("d.FuneralFacility", "d.TrainStation", "d.Café"), function (x) {
+ tm_shape(coef_gwr) +
base_map tm_dots(col = x, title = x, size = 0.2, mid = 0) +
tm_layout(
frame = F,
outer.margins = 0,
legend.title.size = 2,
legend.text.size = 1.4,
legend.just = "right",
legend.format = list(
text.align = "right",
text.separator = ",",
flag = " -",
scientific = TRUE,
format = "f"
),legend.position = c("left", "top")
+
) tm_compass(position = c("right", "top")) +
tm_scale_bar(position = c("right", "bottom"))
})
[[1]]
[[2]]
[[3]]
Coefficient estimates of some effects from GWR.
HLM
The HLM is calibrated with the following codes.
library(lme4)
<- left_join(st_drop_geometry(housing_sel), st_drop_geometry(group_info), by = "cid") %>%
housing_data_hlm arrange(group)
<- lmer(
model_hlm ~ d.FuneralFacility + d.ChineseFood + d.Café + d.TrainStation +
price 5.Year + Floor.Low + Subway +
Property.+ Ladder.Ratio + Floor.Top | group),
(Area data = housing_data_hlm
)summary(model_hlm)
Linear mixed model fit by REML ['lmerMod']
Formula:
price ~ d.FuneralFacility + d.ChineseFood + d.Café + d.TrainStation +
Property.5.Year + Floor.Low + Subway + (Area + Ladder.Ratio +
Floor.Top | group)
Data: housing_data_hlm
REML criterion at convergence: -30828.1
Scaled residuals:
Min 1Q Median 3Q Max
-15.0338 -0.5367 0.0030 0.5760 4.7532
Random effects:
Groups Name Variance Std.Dev. Corr
group (Intercept) 0.055850 0.23633
Area 0.004808 0.06934 0.10
Ladder.Ratio 0.001039 0.03223 -0.12 -0.43
Floor.Top 0.001579 0.03974 0.18 0.57 -0.24
Residual 0.011237 0.10600
Number of obs: 22314, groups: group, 882
Fixed effects:
Estimate Std. Error t value
(Intercept) 11.220678 0.008836 1269.840
d.FuneralFacility 0.003988 0.008767 0.455
d.ChineseFood 0.033715 0.008865 3.803
d.Café -0.067453 0.008238 -8.188
d.TrainStation -0.056104 0.008454 -6.636
Property.5.Year 0.003339 0.001591 2.099
Floor.Low -0.010547 0.001839 -5.736
Subway 0.008769 0.003372 2.600
Correlation of Fixed Effects:
(Intr) d.FnrF d.ChnF d.Café d.TrnS Pr.5.Y Flr.Lw
d.FnrlFclty -0.154
d.ChineseFd -0.018 -0.101
d.Café -0.147 0.000 -0.252
d.TrainSttn -0.163 -0.163 -0.113 -0.096
Prprty.5.Yr -0.104 0.002 0.003 -0.001 -0.001
Floor.Low -0.047 -0.003 0.000 0.001 -0.001 0.008
Subway -0.263 0.014 0.018 0.041 0.004 -0.018 -0.006
Because estimates of fixed effects are constant across the data set, we only visualise estimates of random effects (Figure 9 in the manuscript).
<- cbind(st_drop_geometry(group_info), coef(model_hlm)$group) %>%
coef_hlm left_join(select(complex_sel, cid), ., by = "cid")
c("Area", "Ladder.Ratio", "Floor.Top") %>% map(function(x) {
+ tm_shape(coef_hlm) +
base_map tm_dots(col = x, size = 0.1, mid = 0, title = x) +
tm_layout(
frame = F,
outer.margins = 0,
legend.title.size = 2,
legend.text.size = 1.4,
legend.just = "right",
legend.format = list(
text.align = "right",
text.separator = ",",
flag = " -",
scientific = TRUE,
format = "f"
),legend.position = c("left", "top")
+
) tm_compass(position = c("right", "top")) +
tm_scale_bar(position = c("right", "bottom"))
})
[[1]]
[[2]]
[[3]]
Coefficient estimates of random effects from HLM.
Comparing estimates of random effects from HLM and HGWR
In the manuscript, a comparison between estimates of random effects from HLM and HGWR is included. It is carried out by the following codes (Figure 10).
list(HGWR = coef(model_hgwr), HLM = coef(model_hlm)$group) %>%
map(~ select(.x, Area, Ladder.Ratio, Floor.Top)) %>%
map_dfr(~ rowid_to_column(.x), .id = "Model") %>%
pivot_longer(Floor.Top:Area, names_to = "Coefficient", values_to = "Value") %>%
pivot_wider(id_cols = rowid:Coefficient, names_from = Model, values_from = Value) %>%
ggplot(aes(HLM, HGWR)) +
geom_point() +
facet_wrap(~ Coefficient, labeller = ) +
stat_smooth(method = "lm") +
theme_bw() +
theme(text = element_text(size = 24))
`geom_smooth()` using formula = 'y ~ x'
list(HGWR = coef(model_hgwr), HLM = coef(model_hlm)$group) %>%
map(~ select(.x, Area, Ladder.Ratio, Floor.Top)) %>%
pmap_dfc(~ abs((..1 - ..2))) %>%
pivot_longer(Area:Floor.Top, names_to = "Coefficient", values_to = "Absolute.Error") %>%
ggplot(aes(Coefficient, Absolute.Error)) +
geom_boxplot() +
theme_bw() +
theme(text = element_text(size = 24))
Comparison of random effect estimates from HGWR and HLM.
Summary
This document shows how to reproduce the case study of the HGWR model with housing price data in Beijing. All necessary provided codes are reproducible to recreate Figure 6-10 in the manuscript. Intermediate results are also provided for double check.