A maps-to-maps approach for simulating urban land expansion based on convolutional long short-term memory neural networks

Abstract Cellular automata (CA) have been prevalently used for the simulation of urban land change. However, how to effectively learn the spatial-temporal dynamics of urban development from time-series data remain an important challenge for CA-based models. To address this issue, we propose a new model for the simulation of urban development based on convolutional long short-term memory (ConvLSTM) neural networks. The core of the proposed model is a sequence of vanilla ConvLSTM cells integrated with the modules of channel attention and contextual embedding. Compared with conventional CA-based models, the proposed ConvLSTM model is more advanced in that it can better leverage the open access annual urban land maps to capture simultaneously the spatial structure and the temporal dependency of historical urban development, and further predict multiple maps of annual development for subsequent years (i.e., Maps-to-Maps). The performance of the ConvLSTM model is evaluated through the case studies in China’s three mega-urban regions, and ConvLSTM outperforms other state-of-the-art deep learning architectures at both the pixel level and the coarser grid level. The results also suggest the satisfactory transferability of ConvLSTM in that the model trained in one mega-urban region can be successfully re-used in others without fine tuning.


Introduction
Urban land expansion is the direct outcome of urbanization.Globally, population growth was found the major driver of urban land expansion during 1970-2014, while the importance of Gross Domestic Product growth became greater after 2000 (Mahtta et al. 2022).Due to its far-reaching impacts, such as arable land losses (Sumbo et al. 2023), biodiversity declines (Simkin et al. 2022), and urban heat island effects (Chatterjee and Majumdar 2022), researchers have been devoting their efforts to the modeling of urban land change.Many studies focus on mapping urban land change through remotely sensed imagery and advanced detection methods (Aslam et al. 2023).In particular, the power of cloud computing has been exploited along with the increasing volume of satellite images to generate urban land maps that are globally covered, high-resolution, and annually available over a long span of time (Liu et al. 2020, Huang et al. 2022, Zhou et al. 2018, Gong et al. 2020).These data products are fundamental to the development of urban land change models, which have been considered helpful for understanding the associated impacts of urban land expansion at various regions and scales (Chen et al. 2020, Tong and Feng 2020, Gao and O'Neill 2020, Zhang et al. 2023b).
Many of the contemporary urban land change models are developed based on cellular automata (CA) because of their ability to simulate spatial phenomena with simple mechanisms (D'Acci and Batty 2019).When CA are applied to the simulation of urban land change, the major components of a CA-based model usually include a module to evaluate development suitability considering a set of spatial variables (or driving factors), a neighborhood component to model the local interaction between neighboring land units, and a stochastic component to account for the effects of unpredictable perturbation (Roodposhti et al. 2020).How these components operate is controlled by the mechanisms represented in the form of either if-then rules or functions that can be calibrated using spatial data.A variety of methods have been proposed for model calibration.They range from geostatistical methods (Gao et al. 2020), heuristic methods (Cao et al. 2019, Modiri et al. 2023), tree-based algorithms (Shafizadeh-Moghadam et al. 2021), to the more recent deep learning methods such as convolutional neural networks (CNN) (Zhai et al. 2020, Wang et al. 2022, Guan et al. 2023).
Despite the different calibration methods applied, most of the contemporary CAbased models are calibrated using a pair of urban land maps at two points in time: the initial T 0 and the ending T 1 at which some change has occurred, and validated using the urban land map at T 2 (e.g.Addae and Dragi� cevi� c (2022)).A few models are calibrated using only the T 0 urban land map, such as FLUS (Liu et al. 2017).Therefore, the calibration methods used in these models merely captures the spatial dynamics underlying the process of urban land expansion, but fail to interpret the (spatial-)temporal information from multi-temporal or time-series urban land maps.This is not trivial because ignoring the temporal dimension would cause misunderstanding of the process of urban land expansion and reduce the reliability of the long-run projections of urban land change (Li et al. 2021).
To address this issue, several methods have been developed to incorporate multitemporal or time-series urban land data for model calibration.For instance, Chen et al. (2016) used survival analysis to capture the temporal trend of the spatial variables' effects by incorporating four urban land maps with an interval of five years.The major weakness of their approach is that only the constant trend over time (i.e., increase or decrease) can be obtained, neglecting the potential temporal fluctuations.Li et al. (2020) proposed to adjust the neighborhood component of a Logistic-CA model by assigning higher weights for the neighboring cells that were developed closer to present time.However, the weights were determined mainly based on expert knowledge and difficult to calibrate with empirical data.Yang et al. (2023) focused on delineating the spatial-temporal trend of urban land demand with a isotropic Gaussian function, but ignored the spatial-temporal information associated with the neighbourhood component and the spatial variables, which is inherent in time-series urban land data.Overall, the representative studies mentioned above cannot fully address the challenge of capturing the spatial-temporal dynamics of urban land expansion.
More recently, deep learning methods have emerged as a new approach for modeling urban land expansion with multi-temporal or time-series data.For instance, Qian et al. (2020) developed a CNN-based method to learn the effects of spatial variables on urban land expansion from multi-temporal urban land maps through convolution operations.Geng et al. (2022) adopted the similar approach but only three urban land maps were used with an interval of five years.Despite the use of multi-temporal data in Qian et al. (2020) and Geng et al. (2022), they were only stacked as the input while the temporal relationships of the data could not be effectively captured, because the architecture of CNN lacks a component to do that.Another approach of deep learning, namely long short-term memory (LSTM) (Graves 2012), had demonstrated promising performance in the tasks of sequence prediction, hence it is applicable to the spatialtemporal modeling of long-run urban land change.Despite the success of LSTM in many fields, such as natural language processing (Merity et al. 2018) and financial market forecasting (Moghar and Hamiche 2020), the development of LSTM-based models for simulating urban land change are still at its infancy.Liu et al. (2021) applied the LSTM method for the simulation of urban expansion, but the power of LSTM could not be fully exploited because only three urban land maps (with an interval of ten years) were used.Xing et al. (2020) combined CNN and LSTM (i.e., CNN-LSTM) to develop a CA-based simulation model of urban development, in which the CNN component is for learning the effects of spatial variables and the LSTM component is for learning the temporal relationships from the input time-series urban land maps.A similar CNN-LSTM-CA model was also developed by Zhou et al. (2023).
This study proposes a new model for the simulation of urban land expansion based on the method of convolutional long short-term memory (ConvLSTM) (Shi et al. 2015).ConvLSTM extends the conventional LSTM methods by including the convolutional operations for encoding spatial-temporal information and making prediction.Therefore, compared with the conventional CNN-LSTM method that capture the spatial information and the temporal information using two separating networks (i.e., CNN and LSTM, respectively), ConvLSTM has a more concise architecture that can learn and fuse the spatial-temporal information in the same network (Masrur and Yu 2023).Moreover, unlike the conventional LSTM methods that only yield the prediction at one step ahead, ConvLSTM allows for outputting a sequence of future results (i.e., multiple timesteps).All these characteristics make ConvLSTM more suitable than the conventional methods to the challenging simulation of long-run urban land change.
In this study, we further enhance ConvLSTM with the modules of channel attention (Woo et al. 2018) and contextual embedding (Chai et al. 2022).Because of its conciseness and efficiency, channel attention has been widely used for weighting the importance of features learned by deep learning models (Cai and Chen 2021).Therefore, we use channel attention to enhance the ability of the original ConvLSTM to estimate the effects of spatial variables.Contextual embedding, however, is originally designed to improve the contextual interactions when applying LSTM to solve the problems of spatial-temporal prediction (Chai et al. 2022), and hence is feasible to combine with ConvLSTM for simulating urban land expansion.By this means, the simulation model uses a sequence of urban land maps as the input and yields a sequence of predicted urban land maps as the output (i.e., a Maps-to-Maps manner).The proposed ConvLSTM model was evaluated and compared with several other models through the simulations of urban land expansion in China's Guangdong-Hong Kong-Macau Great Bay Area (GBA).The transferability of the ConvLSTM model was also examined through additional experiments of simulations for Yangtze River Delta (YRD) and Beijing-Tianjin-Tangshan (BTT).

Study areas and data
The mega-urban regions of GBA, YRD, and BTT have the highest economic and political status in China, and are considered in the Fourteenth Five-Year Plan as the key regions for China's long-run development.During the past three decades, however, the rapid expansion of urban land in these three mega-urban regions has caused the increasing pressure on natural resources and environments.Therefore, accurate simulation of urban development in these regions is not only useful to understand the historical urban dynamics but also important to explore the consequences of future development.In this study, the performance of the proposed model is first evaluated using the GBA data, and the geographic transferability of the proposed model is further assessed using the collective data of all three mega-urban regions.
The urban land maps of the study area were obtained from the 30-m raster dataset of global impervious surface area (GISA 2.0) (Huang et al. 2022).This dataset was developed based on the global Landsat images spanning from 1985 to 2019, and provides annual urban land cover with higher accuracy than other 30-m datasets.In this study, the annual urban land maps of the study areas during the periods of 2000-2017 were used for the GBA simulation and the evaluation of transferability.Specifically, the 2000-2005 maps were used to trained the proposed model to produce the simulated 2006-2011 maps, and the independent 2012-2017 maps were used to evaluate the performance of the trained model in terms of prediction.
Besides the urban land maps, the data of DEM, road networks, urban facility, and the positions of representative urban centers were also used.The DEM data was derived from the recently released Forest and Buildings removed Copernicus DEM (FABDEM) with a spatial resolution of 30 m (Hawker et al. 2022).The road networks data was collected from OpenStreetMap, which provides the major roads of the study areas.The urban facility considered include airports, railway stations, and accesses to highways.The representative urban centers include county centers and town centers.Data of urban facility and representative urban centers were acquired by digitalizing the official maps of the study area.With these data, several spatial variables were generated, including slope, distance to the nearest major road, distance to the nearest airport, distance to the nearest railway station, distance to the nearest access to highways, distance to the nearest county center, and distance to the nearest town center.All these spatial variables represent the situation of year 2020 and have a 30-m resolution.The selection of these variables is aligned with those in previous studies (Chen et al. 2019b, Rienow et al. 2021, Roodposhti et al. 2020).Note that the spatial variables are held constant throughout the simulation of annual urban growth.Such a treatment is indeed a compromise due to the lack of multi-temporal urban facility and road network data.Recent studies also applied constant spatial variables for urban simulation using multi-temporal or time-series urban land maps (Wang et al. 2022, Zhou et al. 2023, Qian et al. 2020).

Overview of the model structure
The proposed model simulates urban development through a Maps-to-Maps manner.This means that the model receives multiple, annually consecutive urban land maps for years T 1 , … , T n as the inputs and outputs a sequence of annually consecutive urban land maps for the future years T nþ1 , … , T nþm .To fulfil this approach, a ConvLSTM neural network integrated with the modules of channel attention and contextual embedding is developed using a set of spatial variables (Table 1).Specifically, the network is composed of a sequence of ConvLSTM cells, each of which is a vanilla ConvLSTM cell (Shi et al. 2015) (see Section 2.2.4) enhanced by channel attention (Woo et al. 2018) and contextual embedding (Chai et al. 2022) (see Section 2.2.2 and 2.2.3, respectively) (Figure 2).The number of the ConvLSTM cells that compose the model is n þ m -1, where n refers to the number of input years and m is the number of output years.
For the first n ConvLSTM cells, they are used for learning the spatial-temporal dynamics of urban land expansion for the years T 1 , … , T n .Specifically, the first ConvLSTM cell receives the data tile X 1 as its input to generate the cell state c 1 and the hidden state h 1 .Here X 1 is the spatial subset (64 � 64 or 1.92 km � 1.92 km) of the urban land map and the spatial variables, while c 1 and h 1 refer to the long-term and the short-term memories, respectively (Yu et al. 2019) (see Section 2.2.4).The resulting c 1 and h 1 along with the data tile X 2 are used as the input of the second ConvLSTM cell to generate c 2 and h 2 , and so forth until the nth ConvLSTM cell.The resulting h n of the nth ConvLSTM cell is regarded as the probability of urban land development (referred to as probability tile for short) at the year T nþ1 , which is further used (instead of the urban land map at the year T nþ1 ) with the spatial variables to compose the data tile X nþ1 for the next ConvLSTM cell (n þ 1).The (n þ 1)th ConvLSTM cell runs in   the same way to generate c nþ1 and h nþ1 (as the probability tile at the year T nþ2 ), and so on until the (n þ m -1)th cell, which outputs the probability tile at the year T nþm .
Finally, with the resulting probability tiles, urban development for the years T nþ1 , … , T nþm can be simulated by the means of, for example, value-based segmentation (see Section 2.2.5).
Figure 3 demonstrates the details of a ConvLSTM cell.As aforementioned, a ConvLSTM cell is created based on the integration of the vanilla ConvLSTM cell and the modules of channel attention and contextual embedding.Here the channel attention module is used to capture the effects of each spatial variable, while the contextual embedding model is used to infer where urban development would be most likely to occur.The processing within a ConvLSTM cell can be summarized as follows: where ChA(), CE(), and VConvLSTM() correspond to the operations of the channel attention module, the contextual embedding module, and the vanilla ConvLSTM cell.Specifically, the data tile X t at the year T t passes through the channel attention module, generating the output CX t .The resulting CX t along with h t-1 become the input of the contextual embedding, and the corresponding outputs along with c t-1 are used to generate c t and h t based on the operation of the vanilla ConvLSTM cell.The details of the calculations are discussed in the following sections 2.2.2-2.2.4.With the annual urban land maps and the spatial variables, the ConvLSTM model is first trained to simulate the annual urban development for the period of 2006-2011 with the actual 2000-2005 data as the input.After that the trained model is further evaluated through its prediction of the annual urban development for the period of 2012-2017 with the actual 2006-2011 data as the input.Therefore, the number of ConvLSTM cells is 11 (i.e., n ¼ m ¼ 6).Note that the values of six for both n and m is set considering the balance between the amount of spatial-temporal data used by the model and the computation costs.When n is set with greater values, more years of data can be utilized by the model and hence richer spatial-temporal information acquired, but in turn the computation costs would increase.Similarly, when m is set with greater values, more computation time is required to produce the simulations.Also, there is no need to always set n and m equal.

Channel attention module
The channel attention module consists of three components: the global max pooling operation, the global average pooling operation, and the shared weight multi-layer perceptron (MLP) (Woo et al. 2018).Here a channel refers to a specific component of the input data tile (e.g. the urban land map or a certain spatial variable).The purpose of using the channel attention module is to capture the importance of each component in the input data tile X t and output its weighted form CX t .Figure 4 depicts the workflow of the channel attention module.The input data tile X t , which is also called the input feature map, is first transformed separately by the operations of global max pooling and global average pooling: where channel i refers to the ith channel of X t , and mp i and ap i are the respective results of the global max pooling and the global average pooling for channel i .By this means, the input data tile X t with (1 þ k) channels (including the urban land map and k spatial variables) is transformed to two vectors (V max and V avg ) of size 1 After the operations of global max pooling and global average pooling, the resulting vectors V max and V avg are used as the inputs to train a shared weight MLP to predict urban development at the year t þ 1.This is for obtaining the weights that represent the importance of each channel of X t .The shared weight MLP consists of three fully connected layers l 1 , l 2 , and l 3 with the corresponding sizes of 1 The parameters to optimize include the weight vector W and the bias vector b.The back-propagation method to update the two parameters is based on the gradients derived from the difference between the model outputs and the observed urban land maps.It should be noted that the back propagation only starts after the forward propagation stages are finished in all ConvLSTM cells' MLPs.More details of such a training approach can be found in Woo et al. (2018).After that the weights for V max and V avg are calculated as follows: where W 1 , W 2 , and W 3 are the optimized weights in l 1 , l 2 , and l 3 , and b 1 , b 2 , and b 3 are the optimized bias in l 1 , l 2 , and l 3 ; and r() is a sigmoid activation function.The resulting W max and W avg are then combined as a single weight vector W X through the following calculation: Finally, the input data tile X t is then weighted through the calculation of Hadamard product:

Contextual embedding module
This module (Chai et al. 2022) is used to obtain the contextual information that favors the prediction of urban land expansion using the weighted data tile CX t and h t -1 .
The structure of the contextual embedding module is shown in Figure 5(a).The major components of this module are the two convolutional layers that process the input CX t and h t -1 .By this means, the module can capture the interaction between CX t and h t -1 , which is useful to enhance the model's performance in sequence prediction (Melis et al. 2019).Specifically, the input h t -1 passes through the first convolutional layer Conv h and is further transformed by using a sigmoid activation function, yielding the output h t -1, conv : where W h, conv and b h, conv are the convolution kernel and bias, respectively.The resulting h t -1,conv is then combined with the input CX t through the calculation of Hadamard product, and the results are further used as the input of the second convolutional layer Conv X to generate h 0 t−1 : where W X, conv and b X, conv refer to the kernel and bias, respectively, for Conv X .The training processes to update W h, conv , W X, conv , b h, conv , and b X, conv are similar to that of the shared weight MLP.See Chai et al ( 2022) for more details.

Vanilla ConvLSTM cell
The vanilla ConvLSTM cell receives the outputs of the contextual embedding module, based on which the cell state c t and the hidden state h t are generated for the current cell (Figure 5(b)).The vanilla ConvLSTM cell consists of four gates that work sequentially.They include a forget gate, two input gates, and an output gate (Yu et al. 2019).All these gates are developed using convolutional layers.Specifically, the forget gate determines how much information obtained from the previous ConvLSTM cell should be discarded: where f t represents the proportion of information to discard, W f and b f are the convolutional kernel and the bias for the convolutional layer Conv f .With the information of the previous ConvLSTM cell being adjusted by f t , the cell state of the current cell c t -1 is then updated through: where i t and g t are the outputs of the two input gates; W i and b i are the convolutional kernels and bias for Conv i , and W g and b g are the convolutional kernels and bias for Conv g .Here the two input gates are used to generate the information that is used to replace the discarded one with the input CX 0 t and h 0 t−1 : Finally, an output gate is implemented to produce the hidden state of the current cell h t using the input and the resulting c t : where o t is the result of the output gate; W o and b o are the convolutional kernels and bias for Conv o .

Simulation of urban development
The direct outputs of the ConvLSTM model are the annual probability tiles for the years T nþ1 , … , T nþm .Before the simulation of urban development, for each year T, the resulting probability tiles are merged together to form the complete spatial layer of urban development probability.Then the value-based segmentation approach (Feng andTong 2020, Guan et al. 2023) is used to determine the pixels that are developed at the year T. Specifically, for the year T, the non-urban pixels prohibited to develop (e.g., ecologically sensitive areas) are removed.In our case, open water and land units with slope values greater than 25 degrees are considered as ecologically sensitive areas that are not allowed for development.The probability values of the remaining non-urban pixels are then adjusted by random perturbation: where c is a randomly generated value ranging from 0 to 1.The incorporation of the random perturbantion is to represent the unpredictable impacts of certain stochastic factors on urban development (Roodposhti et al. 2020).
After that the set of non-urban cells at the year T are ranked in descending order according to P i, adj , denoting as N T, P .With the number of new urban pixels S T being determined for the year T, the first S T pixels are selected from N T, P as the newly developed urban pixels.An identical simulation process is applied for the year T þ 1 and so forth for the remaining years.For the simulation of historical urban land expansion, the values of S T for each year are set according to the actual annual urban land maps.It should be noted that besides the value-based segmentation approach, others such as the patch-based simulation strategy (Chen 2022) are also feasible to use based on the resulting probability layers.Nevertheless, we only applied the value-based segmentation approach because testing different simulation methods is not the focus of this study.
The simulations can be validated based on the frequently used indicator of 'Figureof-Merit' (FoM) (Pontius Jr 2018), which measures the performance of the model in accurately predicting the position of change.Higher values of FoM suggest the better performance of the model under examined.The calculation of FoM is based on the pixel-by-pixel map comparison between the referenced change and the simulated change from T 0 to T 1 .
Furthermore, the agreements between the referenced change and the simulated change can also be compared at coarser scales to tolerate the model's prediction error at the pixel level.The comparison is made by aggregating the amounts of the referenced change and the simulated change to the units that are larger than individual pixels, such as moving windows.In this study, we used the moving windows of 1.5 km � 1.5 km, 3 km � 3 km, and 6 km � 6 km to produce the focal sum of newly developed urban land for the references and the simulations, respectively.Regression analysis was then run to reveal how much the simulated change can explain the referenced change.Such a moving-window analysis is a variant of the prevalently used map comparison methods (Hagen-Zanker 2006, Meentemeyer et al. 2013, Pijanowski et al. 2014) that complement the pixel-by-pixel validation methods with the information of agreement between the simulated and the referenced local composition of land changes.

Implementation and results
The proposed model is tested through the simulation of historical urban development in the GBA during 2012-2017.Furthermore, the proposed model is compared against three additional models, namely FLUS, LSTM-CA, and CNN-LSTM-CA.FLUS (Liu et al. 2017) features the use of artificial neural networks to estimate development probability, but lacking the ability to handle time series urban land maps.LSTM-CA and CNN-LSTM-CA are developed more recently and represent the current trend of using deep learning techniques to enhance urban CA models with time-series data.
Specifically, LSTM-CA is developed based on the algorithm of LSTM (Staudemeyer and Morris 2019) and focuses on detecting the temporal trend of urban development at the pixel level, considering the sequential states (i.e., the land use type) along the timeline and the effects of spatial variables.CNN-LSTM-CA is developed following the architecture proposed by Xing et al. (2020).Indeed, CNN-LSTM-CA is composed of two neural networks, i.e., a CNN and a LSTM network, or can be considered as a LSTM-CA model enhanced by adding a CNN.Here CNN is for estimating the effects of spatial variables at the neighborhood level with the neighborhood size equal to that of the kernels used for convolution.By this means, the spatial information learned by CNN is further combined with the temporal trend captured by LSTM to generate the development probability.Compared with CNN-LSTM-CA, the proposed ConvLSTM model is different in that the convolution operations are embedded directly in the ConvLSTM cells rather than implemented in a separating network as CNN-LSTM-CA does.In that sense, ConvLSTM can simultaneously learn spatial-temporal information from data in the same network.Given such differences, which architecture performs better in the simulation of urban land expansion can be identified through experiments (see Section 3.2).
Section 3.1 explains the training of the ConvLSTM model, while Section 3.2 discusses the differences of performance among different models.Finally, the transferability of the ConvLSTM model is assessed through the simulations of urban development in another two mega-urban regions, namely YRD and BTT, besides the GBA (Section 3.3).

Settings of model training
The ConvLSTM model was first trained to simulate the urban development during 2006-2011 by using the information of 2000-2005.A total of 6000 data tiles were collected, with 5000 tiles as the training set and the remaining 1000 tiles as the testing set.The initial learning rate was set as 5 � 10 −5 .The loss function of mean-squarederror and the AdamW optimizer were adopted.The batch size was set as eight, and the number of epochs was set as 50.The teacher forcing strategy was also used to facilitate the training process with the value of teacher forced samples decays per epoch being 1.5% (Yang and Liu 2020).The training of the ConvLSTM model was implemented using a single NVIDIA RTX 4090 GPU and PyTorch 1.10.

Model performance and comparison
We first trained several models to evaluate the influence of the kernel size.We tested the kernel sizes of 3 � 3, 5 � 5, 7 � 7, and 9 � 9, and the results were only marginally different (Supplementary Table S1).To balance the performance and computation costs, we chose the kernel size of 5 because the results were already satisfactory.
The proposed ConvLSTM model was compared against FLUS, LSTM-CA, and CNN-LSTM-CA.Due to the differences of model structure, the data used for calibration for these three models are different.For FLUS, only the 2011 data was used, while for LSTM-CA and CNN-LSTM-CA, the data for the years 2005, 2006, 2007, 2008, 2009, 2010, and 2011 were used.The FoM values for FLUS, LSTM-CA, CNN-LSTM-CA, and the ConvLSTM model are summarized in Table 2.It should be noted that the base year (T 0 ) for the calculation of FoM is 2011 for all simulations.The results clearly show that FLUS attains smaller FoM values than the other three models, suggesting that without the input of time-series urban land maps, the simulation models are less successful to capture the actual dynamics of urban land expansion.The LSTM-CA model is capable of learning temporal information from time-series urban land maps, and hence achieves greater FoM values than FLUS by approximately 1-3%.CNN-LSTM-CA further improves the FoM values by 1-2% as compared against those of LSTM-CA for the years 2014-2017.The proposed ConvLSTM model, however, achieves the highest FoM values throughout the entire timeline of simulations, which are even 5%-12% more than those of CNN-LSTM-CA.This could be owing to the more advanced architecture of ConvLSTM that can fuse the spatial-temporal information in the same network.
We also carried out additional experiments to test whether the choice of periods for model training and running would affect the performance.In those experiments, we trained the ConvLSTM model with the annual urban land maps for the years 1998-2009 and used it to predict the annual urban land expansion for the subsequent decade (i.e., 2010-2019).The FoM values of the 2010-2019 simulations are provided in Supplementary Table S2.The results suggest that ConvLSTM again significantly outperform FLUS, LSTM-CA, and CNN-LSTM-CA.Therefore, the choice of periods (n) for model training and the number of time steps (m) for prediction do not affect the better performance of the ConvLSTM model, as compared with the other models.
Figure 6 shows the results of map comparison between the referenced and the simulated urban land expansion from 2011 to 2017 for the four models.An evident difference can be observed between the results of ConvLSTM and the others.That is, the results of ConvLSTM demonstrate more mixture of hits (i.e., pixels referenced changed and simulated changed), misses (i.e., pixels referenced changed and simulated unchanged), and false alarms (i.e., pixels referenced unchanged and simulated changed), as highlighted by the cyan dashed circles in Figure 6.These results imply that although the ConvLSTM model cannot always predict the precise locations of change, the predictions are already close to where the referenced change occur.For the other three models, however, the misses and the false alarms are more likely to repel each other.In other words, if the four models are evaluated at coarser scales, better agreements would still be observed in the results of ConvLSTM rather than in those of the other three models.This is confirmed by Figure 7, which shows the comparison between the amounts of simulated and referenced new urban land during 2012-2017.They were obtained by using the focal sum operation with a moving window of 1.5 km � 1.5 km.At such a scale, FLUS can explain only 46% of the referenced new urban land, while LSTM-CA and

Note:
The 2011 urban land map is used as the base year data.For instance, the FoM for year 2016 is the FoM calculated based on the simulated change and the referenced change from 2011 to 2016.
CNN-LSTM-CA perform better than FLUS with the explanatory power of 59% and 55%, respectively.For these three models, the histograms clearly demonstrate large disparities between the distributions of the simulated and the referenced new urban land (Figure 7).By contrast, the ConvLSTM model not only has greater explanatory power (70%), but also generates a distribution closer to the referenced one, as compared with those yielded by FLUS, LSTM-CA, and CNN-LSTM-CA.Similar analysis was also conducted using the moving windows of 3 km � 3 km (Supplementary Figure S1) and 6 km � 6 km (Supplementary Figure S2).Despite the improvements of R 2 and distribution agreements for all models due to upscaling, the ConvLSTM model (R 2 ¼ 0.80 and 0.87, respectively) still outperforms FLUS (R 2 ¼ 0.61 and 0.71, respectively), LSTM-CA (R 2 ¼ 0.73 and 0.83, respectively), and CNN-LSTM-CA (R 2 ¼ 0.69 and 0.82, respectively).
For the proposed ConvLSTM model, the module of channel attention allows for understanding the importance of each spatial variable in the form of vectors.We exported and transformed the vectors to spatial data, and the results are shown in Supplementary Figure S4.The mean values of importance reflect that D2T (0.71), D2H (0.62), and D2RS (0.61) are the most influential spatial variables, suggesting that proximity to local urban centers and important infrastructure (e.g.highways and railway stations) are the major factors of urban land expansion in the GBA.Slope and D2NR (0.57 and 0.56, respectively) are ranked the 4th and 5th important spatial variables, while D2A and D2C (0.46 and 0.45, respectively) are the least important ones.

Geographic transferability
Although the proposed ConvLSTM model performs better than the urban CA models, the computation costs of deploying and training ConvLSTM are relatively high.Therefore, for saving the computation costs, it is more desirable if the model trained in one geographic region can be reused in another region.This largely depends on the transferability of the ConvLSTM model.To evaluate the transferability, we conducted several experiments in which the ConvLSTM model is trained using the data of a certain mega-urban region and applied in all three mega-urban regions.
Specifically, the ConvLSTM model is first trained separately using the data of GBA, YRD, and BTT.For convenience, the resulting models are referred to as the GBA model, the YRD model, and the BTT model.These three models are consistently trained using the data of 2000-2011, of which the 2000-2005 data are used as input to simulate the urban development in the period of 2006-2011.The number of tiles is 6000, with 5000 tiles for training and 1000 tiles for testing.After that the trained models are directly applied to the simulations in all three mega-urban regions.For example, the GBA model is applied to the simulations in both YRD and BTT besides the GBA, and so forth for the YRD model and the BTT model.All simulations are validated using the FoM indicator, of which the results form a 3 � 3 matrix for every year from 2012 to 2017 (Figure 8).
The results show that the models trained by local data perform slightly better than the ones transferred directly from the other two mega-urban regions.The spatial patterns generated by the three models are also similar (Figure 9).For the same megaurban region, the simulations yielded by the three models show high agreements, as revealed by Figure 10.For the GBA case, land units (i.e., pixels) where at least two models predict new development account for over 61% of the total land units of predicted change.This proportion is even higher in the cases of YRD and BTT, both of which are 67%.Notably, the proportions of pixels that the three models consistently predict new development are the highest in all three cases, which are 38.0%for GBA, 43.1% for YRD, and 45.9% for BTT.These results suggest that without any fine-tunings or modifications, the proposed ConvLSTM model trained in one mega-urban region still can work reasonably well in other mega-urban regions.This could be related to the similar dynamics of urbanization in the three selected mega-urban regions so that the information learned in one mega-urban region can be immediately used in the others.

Ablation experiments
Ablation experiments were conducted to demonstrate the effectiveness of the components used to form the proposed ConvLSTM model.The experiments used all three

Discussion
The results of the case study in the GBA reveal the significant improvements of simulation accuracy for models that are capable of learning spatial-temporal information    2), confirming the effectiveness of the proposed architecture.Moreover, the proposed ConvLSTM presents a distinct way for modeling urban land expansion.Conventional CA-based urban simulation models are developed based on the idea of self-organizing, and operate by iterating the constant local rules that specify how neighborhood interaction and spatial variables affect urban land change (Tobler 1979, Phipps 1989).An important issue in conventional CA-based models is how to align the iterations to the actual time.Most of the existing studies, if not all, use a compromised representation of temporal dimension with a fixed value of iteration count, which is defined according to some constraints such as land demand.Due to such a technical implementation, the effect of path-dependence arises and becomes one of the sources of simulation errors (Brown et al. 2005).
Additionally, in conventional CA-based urban simulation models, neighborhood interaction is often represented in the form of development density (i.e., counting the number of land units that are developed within a given neighborhood), and the neighboring land units are either equally weighted or weighted by the distance from the neighborhood center (or its variants) (Wang et al. 2021, Roodposhti et al. 2020, Liao et al. 2016, Zhang et al. 2023a).It is questionable to use the traditional one-fits-all density calculation to represent the neighborhood interaction in different spatial contexts.
For the proposed ConvLSTM model, however, the temporal dimension is explicit and matches the actual timeline.In this study, we had used ConvLSTM to process the annual urban land maps and make prediction annually.Building upon this feature, the spatial-temporal relationships between urban land data can be learnt and implicitly represented by ConvLSTM.In particular, the multiple convolution operations in ConvLSTM have two major advantages over traditional representation of neighborhood interaction: First, compared with the traditional uniform calculation of neighborhood development density, the multiple convolution operations are more advanced because it can inexplicitly capture the different aspects of neighborhood interaction in addition to development density.Second, the contributions of neighbors (i.e., the weights of the kernels; see Section 2.2.4) are directly learnt from timeseries urban land maps and hence more reliable than using the traditional equal weights or distance-adjusted weights that are manually defined a priori.Nevertheless, compared with the traditional density calculation, the multiple convolutions are less intuitive to understand how neighborhood interaction affects urban development.
Another notable feature of the ConvLSTM model is its promising transferability among the selected mega-urban regions, as shown by Figures 8-10.Nevertheless, the examination of transferability for the ConvLSTM model is still a preliminary analysis at the current stage.More cases of modeling should be considered in future study, particularly those small-and medium-size urbanized areas in different countries.Moreover, the proposed ConvLSTM model still cannot effectively addressed the issue of enclave simulations (see Supplementary Figure S3), which is important for modeling urban growth in fast developing regions.The integration of ConvLSTM and a patchbased simulation approach (Chen et al. 2019a, Yang et al. 2023) perhaps can alleviate this problem, which can be fulfilled in future studies.Finally, it is feasible to use the ConvLSTM model to provide finer-resolution (e.g. 30 m) projections of future urban development at the global scale, thereby facilitating the understanding of global change and its impacts.

Conclusions
This study proposes a novel ConvLSTM model for the simulation of urban land expansion based on the integration of channel attention, contextual embedding, and vanilla ConvLSTM cells.Compared with conventional CA-based models, the proposed model is more advanced in that it can better leverage the open access annual urban land maps to learn simultaneously the spatial structure and the temporal dependency of historical urban development, and further generate the annual predictions of future development for multiple years, i.e., a Maps-to-Maps manner.
The results of the case study in the GBA show that the ConvLSTM model outperforms several state-of-the-art urban CA models, including FLUS, LSTM-CA, and CNN-LSTM-CA.Compared with these models, ConvLSTM features the highest agreements at the pixel-level and the grid-levels with the resolutions ranging from 1.5 km to 6 km.Furthermore, the ConvLSTM model also shows satisfactory transferability.The experiments in the mega-urban regions of GBA, YRD, and BTT indicate that the ConvLSTM model trained in one mega-urban region can be conveniently applied in the other two without further tuning.Future studies will enhance the examination of the model's transferability by increasing the diversity of case study areas (e.g.urban areas with different sizes and geographic contexts), and extend the applications of the model to wider fields such as global change.

Figure 2 .
Figure 2. The structure of the ConvLSTM neural network for predicting urban land expansion.

Figure 3 .
Figure 3.The structure of the proposed ConvLSTM cell.

Figure 4 .
Figure 4.The workflow of the channel attention module.

Figure 5 .
Figure 5.The structures of (a) the contextual embedding module and (b) the vanilla ConvLSTM cell.

Figure 6 .
Figure 6.Spatial distributions of hits and errors of the simulated urban land expansion from 2011 to 2017.

Figure 7 .
Figure 7. Comparing the simulated and the referenced urban land expansion at the scale of 1.5 km (Sample size ¼ 10000).

Figure 8 .
Figure 8. FoM values of the ConvLSTM models in different experiments.

Figure 9 .
Figure 9. Representative examples of the urban land expansion simulations from 2011 to 2017 generated by the transferred ConvLSTM models.

Figure 10 .
Figure 10.Agreement among the simulations yielded by the GBA model, the YRD model, and the BTT model.

Table 1 .
Spatial variables used for simulating urban land expansion.

Table 2 .
FoM values for the models of FLUS, LSTM-CA, CNN-LSTM-CA and ConvLSTM in the simulations of urban land expansion from 2012 to 2017.

Table 3 .
FoM values of the ablation experiments using the GBA, YRD, and BTT data.series urban land maps, such as LSTM-CA, CNN-LSTM-CA, and ConvLSTM.Compared with CNN-LSTM-CA, which uses two separating networks of CNN and LSTM to learn separately the spatial and temporal information from data, the ConvLSTM model embeds the convolution operation to each of the vanilla ConvLSTM cells and hence can simultaneously obtain the spatial and temporal information from data in the same network.The results of model comparison clearly show that ConvLSTM significantly outperforms CNN-LSTM-CA by 5%-12% in terms of FoM (Table GBAfrom time-