Case study on uncovering impact factors of housing prices in Beijing with HGWR

In this document, we are going to show the reproducible codes for the case study on uncovering impact factors of housing prices in Beijing in the manuscript entitled “A Backfitting Maximum Likelihood Estimator for Hierarchical and Geographically Weighted Regression Modelling, with a case study of house prices in Beijing”.

This document is created with Quarto. The Experiment_CaseStudy.qmd is the source file. Please open it with Quarto-supported tools, such as VS Code or RStudio.

Environment setup

Please install the following tools:

After installing R, please run an R console and install packages with the following codes:

install.packages(c("tidyverse", "sf", "sp", "GWmodel", "hgwrr", "lme4", "yaml", "jsonlite", "plot3D"))

Data preparing

The housing price data are collected from Kaggle. The POI and boundary data are collected from the Internet. To enhance the restrictions and relations on these data, they are arranged in a relational database. We only include a subset of them in this case study. After the process, the data are stored in the files shown in the following table.

file content
data_housing_sel.rda Selected housing price data.
data_poi_sel.rda Selected points of interest (POI) data.
data_boundary_beijing_sel.rda Selected administration boundary of Beijing.
data_title_map.json Mapping from original names to standardised names.

We assume that the data is put in the directory specified by the configuration item data_dir in the file settings.yml with default value . (current working directory). Please change it to anywhere you like.

data_dir <- yaml::read_yaml("settings.yaml")$data_dir
if (data_dir != "." && data_dir != ".." && !dir.exists(data_dir)) {
    dir.create(data_dir, recursive = TRUE)
}

The POI and boundary data can be read with the following codes.

library(sf)
Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
load(file.path(data_dir, "data_poi_sel.rda"))
beijing <- st_read(file.path(data_dir, "data_boundary_beijing_sel.gpkg"))
Reading layer `data_boundary_beijing_sel' from data source 
  `/Users/yigong/Library/CloudStorage/OneDrive-UniversityofBristol/Documents/PhD/Paper1_Backfitting-Maximum-Likelihood-HGWR/Submit_IJGIS/codes/CaseStudy/data_boundary_beijing_sel.gpkg' 
  using driver `GPKG'
Simple feature collection with 6 features and 29 fields
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: 116.0426 ymin: 39.76114 xmax: 116.6394 ymax: 40.15974
Geodetic CRS:  WGS 84

Now, read the housing price data with the following codes. These codes will also transform the original column names into standardised names used in the manuscript.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.0     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
name_map <- jsonlite::read_json(file.path(data_dir, "data_title_map.json"))
load(file.path(data_dir, "data_housing_sel.rda"))
housing_sel <- rename_with(housing_sel, function (old_names) {
    map_chr(old_names, ~ ifelse(.x %in% names(name_map), name_map[[.x]], .x))
}) %>% collect()
glimpse(housing_sel)
Rows: 22,314
Columns: 39
$ id                 <chr> "101100984689", "101100997292", "101101012403", "10…
$ cid                <chr> "1111027375873", "1111027381888", "1111027376313", …
$ price              <dbl> 10.92448, 10.86228, 10.79538, 10.90060, 11.06312, 1…
$ Area               <dbl> -0.46645323, -0.48291075, 1.13454809, 0.46495757, 0…
$ Bedroom            <dbl> 1, 1, 2, 1, 2, 2, 2, 1, 0, 2, 1, 0, 1, 0, 2, 0, 0, …
$ living             <dbl> 0, 0, 1, 0, 0, 1, 0, 0, -1, 0, 0, 0, 0, 0, 0, 0, 0,…
$ kitchen            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ Bathroom           <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ Ladder.Ratio       <dbl> -0.15384071, 0.81109077, 0.08883667, 0.81109077, -0…
$ Property.5.Year    <dbl> 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, …
$ Subway             <dbl> 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, …
$ Floor.Low          <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, …
$ Floor.High         <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, …
$ Floor.Ground       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, …
$ Floor.Top          <dbl> 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ d.Company          <dbl> -0.321549261, -1.065342320, 0.137085661, -0.7072714…
$ d.Supermarket      <dbl> 0.27262352, -0.33977228, -0.49576061, 0.12606195, -…
$ d.Market           <dbl> 0.49522008, -0.24245150, -0.25168088, 0.40077819, 0…
$ d.ShoppingCenter   <dbl> 0.82355904, 0.28767512, -0.43217285, 2.26604909, 1.…
$ d.GasStation       <dbl> 0.914053505, 0.413437991, 0.705057642, -0.051095050…
$ d.FuneralFacility  <dbl> 0.7104053, -0.8517368, 0.6804280, 0.1255448, 1.4275…
$ d.Government       <dbl> 0.18504605, -0.11576049, 0.21973005, 1.06237081, 1.…
$ d.ConventionCenter <dbl> 0.5802384, 0.8474398, 0.7790081, 2.0376407, 1.01561…
$ d.Research         <dbl> 0.6318045457, 0.2976099156, 0.2786768833, 0.7571709…
$ d.ChineseFood      <dbl> 1.48197572, -1.25526245, 0.33605912, 0.75165536, 0.…
$ d.ForeignFood      <dbl> 0.268580751, 0.596907301, 0.494118919, 0.308745673,…
$ d.FastFood         <dbl> -1.83754488, -1.52949665, -0.07074024, -0.01767842,…
$ d.Café             <dbl> 0.23416240, 0.67234707, 0.88708938, 2.44182147, 1.4…
$ d.Gym              <dbl> 1.334189211, -1.408758921, 0.455478600, 1.134141366…
$ d.PrimarySchool    <dbl> -2.32110465, 0.07025931, -0.45596906, 0.13885671, 0…
$ d.HighSchool       <dbl> -0.60137603, -0.04861809, 0.91222048, 0.45773404, 0…
$ d.University       <dbl> 0.190391924, 0.400790819, -0.029698187, 0.005672968…
$ d.ParkingEntrances <dbl> 1.18345943, 1.04642500, 0.36101942, 0.26778683, 0.2…
$ d.Subway           <dbl> 1.23154097, 1.15214605, 2.55482717, 1.89728666, 1.8…
$ d.TrainStation     <dbl> 0.3514576, 0.9246838, 0.9767235, -0.3619107, -0.219…
$ d.ChargingStation  <dbl> 0.59643630, -0.63678418, 1.22316787, -0.69094611, 0…
$ d.Park             <dbl> 1.37136093, 0.01479307, -0.10894317, 1.57932664, 0.…
$ d.ScenicSpot       <dbl> 1.30341958, 1.30693312, 1.20824403, 0.75254668, 0.4…
$ geom               <POINT [°]> POINT (116.1974 39.91271), POINT (116.2211 39…

Their spatial distributions can be viewed via the following codes (Figure 6 in the manuscript).

library(plot3D)
housing_complex_xyz <- st_drop_geometry(housing_sel) %>%
    select(id, cid, price) %>%
    cbind(st_coordinates(st_transform(housing_sel, 4498))) %>%
    group_by(cid) %>%
    mutate(cindex = 1:length(id), price = round(exp(price))) %>%
    ungroup() %>%
    arrange(cid, cindex)
boundary_cgcs <- st_transform(beijing, 4498)
boundary_bbox <- st_bbox(st_bbox(boundary_cgcs))
boundary_xlim <- boundary_bbox[c(1, 3)] + c(14000, -9000)
boundary_ylim <- boundary_bbox[c(2, 4)] + c(5000, -3000)
perspbox(boundary_xlim, boundary_ylim, c(0, max(housing_complex_xyz$cindex)), bty = "n", theta = 10, phi = 20)
walk(boundary_cgcs$geom, function(geom) {
    with(as.data.frame(st_coordinates(geom)), {
        polygon3D(X, Y, rep(0, length(X)), add = T, col = "transparent", border = "black")
    })
})
with(housing_complex_xyz, points3D(X, Y, cindex * 1.08, colvar = price, add = T, pch = 16))

Model calibration

Next, we will calibrate three models: GWR, HLM, and HGWR.

HGWR

To calibrate an HGWR model, we need to get the group list as a column in the data. And also get the coordinates of all groups. Then, calibrate the HGWR model with the processed data.

library(hgwrr)
group_info <- complex_sel %>% select(cid) %>% arrange(cid) %>% mutate(group = row_number())
group_info <- st_transform(group_info, 4498)
housing_data_hgwr <- left_join(st_drop_geometry(housing_sel), st_drop_geometry(group_info), by = "cid") %>%
    arrange(group)
model_hgwr <- hgwr(
    price ~ d.FuneralFacility + d.ChineseFood + d.Café + d.TrainStation +
            Property.5.Year + Floor.Low + Subway +
            (Area + Ladder.Ratio + Floor.Top | group),
    data = housing_data_hgwr, 
    local.fixed = c(
        "d.TrainStation", "d.ChineseFood", "d.Café", "d.FuneralFacility"
    ), 
    coords = st_coordinates(group_info),
    kernel = "bisquared",
    alpha = 0.001
)
summary(model_hgwr)
Hierarchical and geographically weighted regression model
=========================================================
Formula: 
price ~ d.FuneralFacility + d.ChineseFood + d.Café + d.TrainStation +  
    Property.5.Year + Floor.Low + Subway + (Area + Ladder.Ratio +  
    Floor.Top | group)
 Method: Back-fitting and Maximum likelihood
   Data: housing_data_hgwr

Diagnostics
-----------
 Rsquared 
 0.876075 

Scaled residuals
----------------
       Min         1Q    Median        3Q       Max 
 -1.580566  -0.055777  0.000425  0.058776  0.480169 

Other Information
-----------------
Number of Obs: 22314
       Groups: group , 882

The coefficient estimates can be visualized with the following codes (Figure 7 in the manuscript).

library(tmap)
coef_hgwr <- coef(model_hgwr) %>%
    left_join(st_drop_geometry(group_info), by = "group") %>%
    left_join(select(complex_sel, cid), ., by = "cid")
base_map <- tm_shape(beijing) + tm_polygons(col = "white")
with(model_hgwr$effects, c(local.fixed, random, "Intercept")) %>% map(., function(x) {
    base_map + tm_shape(coef_hgwr) +
        tm_dots(col = x, size = 0.2, mid = 0, title = x, n = 5) +
        tm_layout(
            frame = F,
            outer.margins = 0,
            legend.title.size = 2,
            legend.text.size = 1.4,
            legend.just = "right",
            legend.format = list(
                text.align = "right",
                text.separator = ",",
                flag = " -",
                scientific = TRUE,
                format = "f"
            ),
            legend.outside = T,
            legend.outside.size = 0.3
        ) +
        tm_compass(position = c("right", "top")) +
        tm_scale_bar(position = c("right", "bottom"))
})
[[1]]

[[2]]

[[3]]

[[4]]

[[5]]

[[6]]

[[7]]

[[8]]

d.FuneralFacility

d.ChineseFood

d.Café

d.TrainStation

Area

Ladder.Ratio

Floor.Top

Intercept

Coefficient estimates from HGWR model.

GWR

Because the function gwr.basic() in package GWmodel only accepts data of Spatial*DataFrame, we need to do data conversion. And then we can calibrate the GWR model.

Important

Because the data size is quite large, it would take a long time (possibly several hours), especially on macOS. We highly recommend readers do not run these codes on their own computers. Instead, run it on a high-performance server or use the results provided by us.

library(GWmodel)
housing_data <- as_Spatial(st_transform(housing_sel, 4498))
model_gwr_bw <- bw.gwr(
    price ~ d.FuneralFacility + d.ChineseFood + d.Café + d.TrainStation +
            Property.5.Year + Floor.Low + Subway +
            Area + Ladder.Ratio + Floor.Top,
    data = housing_data,
    approach = "AIC",
    adaptive = TRUE,
    kernel = "bisquare",
    parallel.method = ifelse(Sys.info()['sysname'] == "Darwin", FALSE, "omp"),
    parallel.arg = ifelse(Sys.info()['sysname'] == "Darwin", 0, 12)
)
model_gwr <- gwr.basic(
    price ~ d.FuneralFacility + d.ChineseFood + d.Café + d.TrainStation +
            Property.5.Year + Floor.Low + Subway +
            Area + Ladder.Ratio + Floor.Top,
    data = housing_data,
    bw = model_gwr_bw,
    adaptive = TRUE, 
    kernel = "bisquare",
    parallel.method = ifelse(Sys.info()['sysname'] == "Darwin", FALSE, "omp"),
    parallel.arg = ifelse(Sys.info()['sysname'] == "Darwin", 0, 12)
)
save(model_gwr, file = file.path(data_dir, "model_gwr.rda"))
model_gwr

Estimates of some coefficient estimates are visualized with the following codes (Figure 8 in the manuscript):

load(file.path(data_dir, "model_gwr.rda"))
coef_gwr <- model_gwr$SDF@data %>%
    cbind(select(st_drop_geometry(housing_sel), cid)) %>%
    group_by(cid) %>%
    summarise_all(mean) %>%
    left_join(group_info, ., by = "cid")
map(c("d.FuneralFacility", "d.TrainStation", "d.Café"), function (x) {
    base_map + tm_shape(coef_gwr) +
        tm_dots(col = x, title = x, size = 0.2, mid = 0) +
        tm_layout(
            frame = F,
            outer.margins = 0,
            legend.title.size = 2,
            legend.text.size = 1.4,
            legend.just = "right",
            legend.format = list(
                text.align = "right",
                text.separator = ",",
                flag = " -",
                scientific = TRUE,
                format = "f"
            ),
            legend.position = c("left", "top")
        ) +
        tm_compass(position = c("right", "top")) +
        tm_scale_bar(position = c("right", "bottom"))
})
[[1]]

[[2]]

[[3]]

d.FuneralFacility

d.TrainStation

d.Café

Coefficient estimates of some effects from GWR.

HLM

The HLM is calibrated with the following codes.

library(lme4)
housing_data_hlm <- left_join(st_drop_geometry(housing_sel), st_drop_geometry(group_info), by = "cid") %>%
    arrange(group)
model_hlm <- lmer(
    price ~ d.FuneralFacility + d.ChineseFood + d.Café + d.TrainStation +
            Property.5.Year + Floor.Low + Subway +
            (Area + Ladder.Ratio + Floor.Top | group),
    data = housing_data_hlm
)
summary(model_hlm)
Linear mixed model fit by REML ['lmerMod']
Formula: 
price ~ d.FuneralFacility + d.ChineseFood + d.Café + d.TrainStation +  
    Property.5.Year + Floor.Low + Subway + (Area + Ladder.Ratio +  
    Floor.Top | group)
   Data: housing_data_hlm

REML criterion at convergence: -30828.1

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-15.0338  -0.5367   0.0030   0.5760   4.7532 

Random effects:
 Groups   Name         Variance Std.Dev. Corr             
 group    (Intercept)  0.055850 0.23633                   
          Area         0.004808 0.06934   0.10            
          Ladder.Ratio 0.001039 0.03223  -0.12 -0.43      
          Floor.Top    0.001579 0.03974   0.18  0.57 -0.24
 Residual              0.011237 0.10600                   
Number of obs: 22314, groups:  group, 882

Fixed effects:
                   Estimate Std. Error  t value
(Intercept)       11.220678   0.008836 1269.840
d.FuneralFacility  0.003988   0.008767    0.455
d.ChineseFood      0.033715   0.008865    3.803
d.Café           -0.067453   0.008238   -8.188
d.TrainStation    -0.056104   0.008454   -6.636
Property.5.Year    0.003339   0.001591    2.099
Floor.Low         -0.010547   0.001839   -5.736
Subway             0.008769   0.003372    2.600

Correlation of Fixed Effects:
            (Intr) d.FnrF d.ChnF d.Café d.TrnS Pr.5.Y Flr.Lw
d.FnrlFclty -0.154                                           
d.ChineseFd -0.018 -0.101                                    
d.Café     -0.147  0.000 -0.252                             
d.TrainSttn -0.163 -0.163 -0.113 -0.096                      
Prprty.5.Yr -0.104  0.002  0.003 -0.001  -0.001              
Floor.Low   -0.047 -0.003  0.000  0.001  -0.001  0.008       
Subway      -0.263  0.014  0.018  0.041   0.004 -0.018 -0.006

Because estimates of fixed effects are constant across the data set, we only visualise estimates of random effects (Figure 9 in the manuscript).

coef_hlm <- cbind(st_drop_geometry(group_info), coef(model_hlm)$group) %>%
    left_join(select(complex_sel, cid), ., by = "cid")
c("Area", "Ladder.Ratio", "Floor.Top") %>% map(function(x) {
    base_map + tm_shape(coef_hlm) +
        tm_dots(col = x, size = 0.1, mid = 0, title = x) +
        tm_layout(
            frame = F,
            outer.margins = 0,
            legend.title.size = 2,
            legend.text.size = 1.4,
            legend.just = "right",
            legend.format = list(
                text.align = "right",
                text.separator = ",",
                flag = " -",
                scientific = TRUE,
                format = "f"
            ),
            legend.position = c("left", "top")
        ) +
        tm_compass(position = c("right", "top")) +
        tm_scale_bar(position = c("right", "bottom"))
})
[[1]]

[[2]]

[[3]]

Area

Ladder.Ratio

Floor.Top

Coefficient estimates of random effects from HLM.

Comparing estimates of random effects from HLM and HGWR

In the manuscript, a comparison between estimates of random effects from HLM and HGWR is included. It is carried out by the following codes (Figure 10).

list(HGWR = coef(model_hgwr), HLM = coef(model_hlm)$group) %>%
    map(~ select(.x, Area, Ladder.Ratio, Floor.Top)) %>%
    map_dfr(~ rowid_to_column(.x), .id = "Model") %>%
    pivot_longer(Floor.Top:Area, names_to = "Coefficient", values_to = "Value") %>%
    pivot_wider(id_cols = rowid:Coefficient, names_from = Model, values_from = Value) %>%
    ggplot(aes(HLM, HGWR)) + 
        geom_point() + 
        facet_wrap(~ Coefficient, labeller = ) + 
        stat_smooth(method = "lm") +
        theme_bw() +
        theme(text = element_text(size = 24))
`geom_smooth()` using formula = 'y ~ x'
list(HGWR = coef(model_hgwr), HLM = coef(model_hlm)$group) %>%
    map(~ select(.x, Area, Ladder.Ratio, Floor.Top)) %>%
    pmap_dfc(~ abs((..1 - ..2))) %>%
    pivot_longer(Area:Floor.Top, names_to = "Coefficient", values_to = "Absolute.Error") %>%
    ggplot(aes(Coefficient, Absolute.Error)) +
        geom_boxplot() + 
        theme_bw() +
        theme(text = element_text(size = 24))

Estimate comparison

Absolute difference

Comparison of random effect estimates from HGWR and HLM.

Summary

This document shows how to reproduce the case study of the HGWR model with housing price data in Beijing. All necessary provided codes are reproducible to recreate Figure 6-10 in the manuscript. Intermediate results are also provided for double check.