Urban traffic flow prediction: a dynamic temporal graph network considering missing values

Abstract Accurate traffic flow prediction on the urban road network is an indispensable function of Intelligent Transportation Systems (ITS), which is of great significance for urban traffic planning. However, the current traffic flow prediction methods still face many challenges, such as missing values and dynamic spatial relationships in traffic flow. In this study, a dynamic temporal graph neural network considering missing values (D-TGNM) is proposed for traffic flow prediction. First, inspired by the Bidirectional Encoder Representations from Transformers (BERT), we extend the classic BERT model, called Traffic BERT, to learn the dynamic spatial associations on the road structure. Second, we propose a temporal graph neural network considering missing values (TGNM) to mine traffic flow patterns in missing data scenarios for traffic flow prediction. Finally, the proposed D-TGNM model can be obtained by integrating the dynamic spatial associations learned by Traffic BERT into the TGNM model. To train the D-TGNM model, we design a novel loss function, which considers the missing values problem and prediction problem in traffic flow, to optimize the proposed model. The proposed model was validated on an actual traffic dataset collected in Wuhan, China. Experimental results showed that D-TGNM achieved good prediction results under four missing data scenarios (15% random missing, 15% block missing, 30% random missing, and 30% block missing), and outperformed ten existing state-of-the-art baselines.


Introduction
With the continuous increase of urban vehicles, traffic congestion has gradually become a common problem in almost all modern metropolises, and has seriously disrupted the normal travel of humans (Y.Wang et al. 2019, Lin et al. 2020, Shi et al. 2021).Traffic flow prediction technology, a fundamental objective of the intelligent transportation system (ITS), can dynamically guide traffic flow based on the traffic state predicted by the model, which is of great significance for urban traffic planning (Fang et al. 2021, F. Zhou et al. 2021a).In recent years, the rapid development of sensors provide essential data sources for traffic flow prediction (J.Yu et al. 2020, P. Wang et al. 2022a).However, due to technical problems of data collection, data missing is common in real-world applications, which severely limits the use of traffic data (Chen et al. 2020, D. Xu et al. 2020, Furtlehner et al. 2022, P. Wang et al. 2022a).
In the early years, some scholars did not consider the missing phenomenon in traffic flow, and simply established prediction models without considering missing values.This kind of classical prediction model mainly includes the spatiotemporal k-nearest-neighbor model (ST-KNN) (S.Wu et al. 2014, B. Yu et al. 2016a), spatiotemporal residual network (ST-ResNet) (J.Zhang et al. 2017), and HIDLST network (Ren et al. 2020).In addition, considering that the traffic road network is a non-Euclidean data structure in nature, graph convolutional networks (GCNs) have been applied to traffic flow modeling and achieved state-of-the-art (SOTA) prediction performance (Kipf and Welling 2017), such as temporal graph convolutional network (T-GCN) (Zhao et al. 2020), spatiotemporal graph convolutional network (ST-GCN) (B.Yu et al. 2018), and residual graph convolutional long shortterm memory network (RGC-LSTM) (Y.Zhang et al. 2020).Although the above models have achieved good prediction results, there are still shortcomings.Specifically, the above prediction models do not have the ability to deal with missing values but convert data with missing values into data without missing values through preprocessing.The above models adopt two preprocessing methods to deal with the missing values.One is to estimate missing data before constructing the prediction models, and the other is to delete the time series with missing data (Yang et al. 2021).The preprocessing methods not only increase the computational complexity of the prediction model, but also may lead to insufficient training data to obtain a reliable prediction model.In addition, traffic flow collected in the real world may contain multiple missing patterns (characteristics of the missing pattern are described in Supplementary Appendix A), which also seriously restrict the performance of the prediction models (Chen et al. 2020, D. Xu et al. 2020, Furtlehner et al. 2022, P. Wang et al. 2022b).
In recent years, some scholars have tried to establish traffic prediction models considering missing values, which directly use raw data to predict the future traffic state without preprocessing missing values.For example, Che et al. (2018) proposed the gate recurrent unit with decay mechanism (GRU-D), and Tian et al. (2018) proposed the long short-term memory network with missing data (LTSM-M).However, GRU-D and LTSM-M are pure time series models that ignore the spatial patterns of traffic flow.In addition, matrix and tensor factorization models also provide a solution for traffic flow prediction based on missing data (H.-F.Yu et al. 2016b, Chen andSun 2022).However, the matrix and tensor factorization models are mainly designed for the Euclidean structure, limiting the modeling ability on non-Euclidean datasets.In recent years, relevant scholars have also applied GCNs to traffic flow modeling with missing data and obtained good prediction results, such as spectral graph Markov network (SGMN) (Cui et al. 2020) and Heterogeneous Spatiotemporal graph convolution network (RIHGCN) (Zhong et al. 2021).However, the SGMN and RIHGCN models rely heavily on the predefined adjacency matrix.In the natural traffic environment, due to the influence of signal lights and traffic events, the spatial association between roads is not fixed, but dynamically changes with time.Therefore, the predefined adjacency matrix cannot accurately describe the spatial association between roads (Z.Wu et al. 2021, F. Zhou et al. 2021a).
In general, the current traffic flow forecasting methods still face two challenges.One is that missing data affects the performance of prediction model, and the other is that fixed adjacency matrixes cannot accurately describe the dynamic spatial association on the traffic network.This study proposes a novel solution to address the above two challenges.Specifically, a dynamic temporal graph network considering missing values (D-TGNM) is proposed for traffic flow prediction.The main contributions are summarized as follows: 1. We propose a novel temporal graph neural network considering missing values (TGNM) for traffic flow prediction under missing data scenarios.In the TGNM model, we designed a missing data processing component to capture missing patterns in traffic flow automatically.Supported by the proposed component, the TGNM model does not need to impute the missing values in the traffic flow in advance, but directly captures the complex spatiotemporal relationships in the traffic flow for traffic flow prediction.2. Inspired by Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al. 2019, Bao et al. 2021), we develop a novel self-supervised learning method called Traffic BERT, which enables the TGNM model to capture dynamic spatial relationships in traffic flow.3. A novel loss function is proposed to optimize the parameters of proposed model.
The proposed loss function divides the optimization function into temporal-based imputation tasks, spatial-based imputation tasks, and prediction tasks.That is, the loss function considers not only the problem of missing value, but also the problem of traffic flow prediction.

Related works
In this section, we systematically review the work related to this study.Existing traffic flow prediction methods can be roughly divided into two categories: spatiotemporal prediction based on complete data, and spatiotemporal prediction based on incomplete data, which are discussed in Sections 2.1-2.2,respectively.Among them, complete data refers to data without missing values, and incomplete data refers to data containing missing values.

Spatiotemporal prediction models based on complete data
As the prediction models based on complete data are challenging to deal with the missing values in the data, the prediction models based on complete data are often divided into multi-stage modeling.More specifically, relevant scholars first used imputation methods, such as matrix factorization (Asif et al. 2016, H.-F. Yu et al. 2016b), tensor factorization (Chen et al. 2019, Chen andSun 2022), and spatiotemporal interpolation (S.Cheng andLu 2017, S. Cheng et al. 2020), to estimate the missing values in traffic flow, and then model the complete traffic flow to predict the future traffic state (Liu et al. 2018, Ge et al. 2019, Q. Li et al. 2020).For example, S. Cheng and Lu (2017) adopted a two-step spatiotemporal interpolation (ST-2SMR) to impute the missing values of traffic flow, and proposed adaptive spatiotemporal k-nearest neighbor (adaptive-STKNN) (S.Cheng et al. 2018) and dynamic spatiotemporal k-nearest neighbor (D-STKNN) (S.Cheng et al. 2021) to model the complete traffic flow.P. Wang et al. (2021) used the linear interpolation method (P.Cai et al. 2016) to impute the missing values in the dataset, and proposed an improved deep belief network (Improved-DBN) to predict the traffic state in 30, 60, and 120-min intervals.In addition, some scholars directly deleted the incomplete data series to model the remaining complete traffic flow data.For instance, J. Zhang et al. (2017) proposed the spatiotemporal residual networks (ST-ResNet) to forecast the inflow and outflow in every city region.L. Cai et al. (2020) proposed a novel deep learning framework called Traffic Transformer to capture the continuity and periodicity of time series and to model spatial dependency.Although above models have achieved good prediction performance, there are still shortcomings.More specifically, vehicles can only travel along road networks, which manifest non-Euclidean geometry and topology structures.Above models are mainly designed for Euclidean structured datasets, limiting the modeling ability for non-Euclidean datasets (Z.Wu et al. 2021, J. Zhou et al. 2021b).
Fortunately, the rapid development of graph convolutional neural networks (GCNs) provides a solution for non-Euclidean data modeling (Y.Zhang et al. 2020, M. Li et al. 2021, Yi et al. 2021).As mentioned above, the traffic road network is a non-Euclidean data structure in nature, the GCNs are naturally applied to traffic flow modeling and achieved SOTA performance (Y.Zhang et al. 2020, K. Zhang et al. 2021a, S. Zhang et al. 2022).For example, Zhao et al. (2020) proposed a temporal graph convolutional network (T-GCN), which uses the GRU and GCN models to mine the temporal patterns and spatial patterns of traffic flow, respectively.B. Yu et al. (2018) integrated the one-dimensional CNN into the GCN and proposed a spatiotemporal graph convolutional network (ST-GCN) to mine the spatiotemporal patterns of traffic flow.K. Zhang et al. (2021a) integrated the temporal convolutional network (TCN) into the GCN and proposed a novel graph attention temporal convolutional network (GATCN) to predict future traffic state.T. Zhou et al. (2022) proposed an attention-based hybrid spatiotemporal model for citywide traffic flow forecasting, accounting for spatial and feature heterogeneity of traffic flows.Although these graph-based prediction models have the capability to handle non-Euclidean data, most of them can only make predictions based on complete data.The prediction models based on complete data either estimate missing data before constructing the prediction models, or delete the time series with missing data.The former adds an extra computational burden and the imputation accuracy directly affects the performance of the prediction models (S.Cheng et al. 2018Cheng et al. , 2019Cheng et al. , 2021)), while the latter may lead to insufficient training data for the models and fail to obtain reliable traffic flow patterns (L.Cai et al. 2020, Y. Zhang et al. 2020, Yi et al. 2021).

Spatiotemporal prediction models based on incomplete data
Compared with the prediction models based on complete data, the prediction models based on incomplete data integrate missing patterns into the prediction models and directly use raw data to predict the future traffic state.For example, Che et al. (2018) proposed the GRU-D model based on the gated recurrent unit (GRU), which effectively integrates masking and time interval into the deep model architecture to capture the long-term time dependence in the missing time series.Tian et al. (2018) integrated the multi-scale time information of traffic flow into the long short term memory network (LSTM), and proposed the LSTM-M to model missing traffic flow.Although GRU-D and LSTM-M models can model missing traffic sequences, there are still deficiencies.Specifically, GRU-D and LSTM-M models are simple time series models, which ignore the impact of spatial information on traffic flow prediction (Ermagun andLevinson 2018, Medrano andAznarte 2021).In addition, matrix/tensor factorization models provide a natural solution to address traffic flow prediction with missing data.For instance, H.-F. Yu et al. (2016b) proposed the temporal regularized matrix factorization model (TRMF) for traffic flow prediction with missing data based on the rolling prediction scheme and the traditional matrix factorization model.To improve the nonlinear fitting ability of the matrix factorization, Yang et al. (2021) integrated LSTM and graph Laplacian (GL) into the solution of the matrix factorization, and proposed the LSTM-GL-ReMF model for traffic flow prediction with missing data.
Considering the non-Euclidean structure of the traffic road network, relevant scholars also applied GCNs to the traffic flow prediction with missing values.For example, Cui et al. (2020) proposed a novel spectral graph Markov network (SGMN), which gradually infers the missing data and predicts the future traffic state by defining the Markov process on the graph.Zhong et al. (2021) proposed a heterogeneous spatiotemporal graph convolution network (RIHGCN) for traffic forecasting with missing values.However, compared with the prediction models based on complete data, limited research explored the graph-based prediction model considering missing values.In addition, SGMN and RIHGCN models are a kind of spectral GCN, which rely heavily on the predefined static adjacency matrix.In other words, the spatial relationship between road segments is constant in SGMN and RIHGCN models.However, in the actual scene, the spatial relationship between road segments often changes dynamically with time, which makes it difficult for the static matrix to capture the dynamic spatial relationship (Diao et al. 2019, M. Xu et al. 2021).
Therefore, to address above two challenges (dynamic spatial associations and missing value), we propose a dynamic temporal graph network considering missing values (D-TGNM) for traffic flow prediction.The D-TGNM model does not rely on the predefined static adjacency matrix but captures the dynamic associations in the traffic flow.In addition, the D-TGNM model also can to predict traffic flow under missing scenarios.

Preliminaries and problem definitions
In this section, we first introduce the definitions of several key concepts, and then present the traffic flow prediction problem based on the definitions.
Definition 1 (Traffic Road Network), As shown in Figure 1(a), a traffic road network can be abstracted as a graph structure G ¼< V, E, A >, where V ¼ v i f g n i¼1 represents n nodes in the graph G (such as n intersections or detectors); E represents the set of edges in G, i.e., the relationships of nodes.For simplicity, the connection relationships between nodes can be represented by an association matrix A 2 R nÂn : Note: The connection relationship between nodes is not the same as the topological relationship in the physical world, but an implicit relationship dynamically learned through the states between nodes.
Definition 2 (Traffic State), As shown in Figure 1(b), the traffic state on the entire road network G within a specific time window can be expressed as a spatiotemporal state matrix X 2 R nÂg , where x t i represents the traffic state of node v i in the time window s t , x t ¼ x t i È É n i¼1 2 R nÂ1 represents a spatial sequence of all nodes in the time window s t : Definition 3 (Zero-one Matrix), Zero-one matrix M 2 R nÂg is a matrix containing only elements 0 and 1, whick is used to distinguish missing values and observed values in the spatiotemporal state matrix X: If M t i ¼ 0, it means that the traffic state of node v i in the time window s t is missing, i.e., x t i ¼ /: Similarly, m t ¼ M t 2 R nÂ1 is used to distinguish missing values and observed values in the spatial sequence x t : Definition 4 (Dynamic Temporal Graph Sequence), A dynamic temporal graph sequence DTGS ¼ G t f g m t¼1 represents a dynamic graph sequence in which graph information changes over time.Specifically, the dynamic graph means that the traffic state and the relationships between nodes in the graph change over time, i.e., G t ¼< V, E t , A t , x t , m t >: As shown in Figure 1(c), G t graph information in time window s t , where x t and m t have the same meaning as in Definition 2 and Definition 3, E t and A t represent the associations between nodes in the time window s t , i.e., the relationships between nodes also changes over time.Our work aims to build a function F Á ð Þ that can mine the spatiotemporal correlations of traffic flow from the dynamic temporal graph sequence with missing values to forecast future traffic flow accurately.Given a dynamic temporal graph sequence, the modeling process is shown in Formula (1).
where G t f g g t¼1 represents the historical dynamic temporal graph sequence with missing values; xt f g gþDt t¼gþ1 represents the future dynamic temporal graph sequence; F Á ð Þ represents the prediction model proposed in this study, i.e., D-TGNM model; Dt represents the prediction step, Dt ¼ 1 represents single step prediction, Dt > 1 represents multi-step prediction; H indicates the learnable parameter in the model.

Methodology
In this section, we describe the proposed D-TGNM model for traffic flow prediction considering missing values.The structure of D-TGNM is presented in Figure 2. The components of D-TGNM model are introduced in Sections 4.1-4.3,respectively.First, we construct a dynamic graph sequence to describe the traffic states of the studied road network.Then, to address two challenges (i.e., dynamic spatial associations and missing value) in the dynamic graph sequence, we propose a novel self-supervised learning method called Traffic BERT and a novel temporal graph network considering missing values (TGNM), respectively.Finally, the proposed D-TGNM model can be obtained by integrating the dynamic spatial associations learned by Traffic BERT into the TGNM model.More specifically, the dynamic graph sequence is used as the input of Traffic BERT to obtain the dynamic spatial associations in traffic flow.Then, the dynamic graph sequence and dynamic spatial associations are used as the input of D-TGNM model to obtain the final output.To train the D-TGNM model, we design a novel loss function, which considers the missing values problem and prediction problem in traffic flow, to optimize the proposed model.

Construction of dynamic association matrix
Due to signal timing and real-time traffic states, the spatial associations between road segments are often dynamic rather than static, i.e., the spatial associations between roads change over time (Diao et al. 2019).For example, when the signal light is red, vehicles will be prohibited from passing, even if there is a topological connection between the two roads in the physical world.However, existing graph convolutional networks mostly use the predefined static matrix to encode spatial associations on the road network, ignoring the dynamics of spatial associations (Zhang et al. 2022).Compared with the explicit static spatial associations, the implicit dynamic spatial associations may more accurately describe the spatial relationships between roads.Therefore, we attempt to integrate the dynamic spatial associations into the graph convolutional networks to further improve the accuracy of traffic flow prediction.Inspired by BERT, we extended the classic BERT model to learn the implicit dynamic spatial associations on the traffic road network (aka., Traffic BERT).
The BERT model is a pre-training model for natural language processing, which learns the implicit semantic associations between words based on text data (Y.Zhang et al. 2021b;Y. Zhang et al. 2022).By analogy with the traffic environment, the Traffic BERT model learns the implicit semantic associations between spatial locations based on the spatial sequences (Definition 2) of traffic states, i.e., the implicit spatial associations.Compared with the traditional BERT model, the Traffic BERT model has two main differences.First, Traffic BERT model is used to process continuous traffic flow data rather than discrete text data.Second, Traffic BERT focuses on the implicit spatial associations (i.e., attention matrix) between traffic flows rather than the output of the model, i.e., Traffic BERT model explicitly uses attention matrix to describe the similarity between different nodes (intersections or detectors).As shown in Figure 3, the Traffic BERT model is mainly composed of two stages: model training and dynamic matrix generation.The training stage is mainly used to optimize the parameters of the Traffic BERT model.The dynamic matrix generation stage is used to obtain the corresponding dynamic spatial association matrix of the spatial sequence.
As shown in Figure 3, the spatial sequence containing the missing values obtains the final output through state embedding, position embedding, and L encoders in turn.In the Traffic BERT model, the encoder is the key to obtain the implicit dynamic spatial associations on the traffic network.Therefore, taking the spatial sequence x t of the t th time window as an example, we describe the operation process of a single encoder, which is shown in Formulas ( 2) and (3). (2) where I t 2 R nÂd e and O t 2 R nÂd e respectively represent the input and output of the encoder (d e represents the matrix dimensions required by the input and output of the encoder.In general, the output of the former encoder can be used as the input of the latter encoder); m t 2 R nÂ1 is used to identify observed and missing values in x t , and x t ⨀m t means that only the observed data is input into the Traffic BERT model; xt 2 R nÂ1 represents the output of the Traffic BERT model, which is used to calculate the loss function.U t 1 , U t 2 , . . ., U t p 2 R nÂd k indicate the results of input I t through multi-head attention mechanism (d k represents the matrix dimensions required by the multi-head attention mechanism); In the multi-headed attention mechanism (Formula 3), T t p 2 R nÂn represents the attention matrix of the p-head, which can be used to obtain the dynamic spatial association matrix of the spatial sequence where W represents the learnable parameter of the Traffic BERT; m t 2 R nÂ1 is used to identify the observed values and missing values in x t ; ⨀ represents the Hadamard product; Þ means that only the output value of the missing position is used to optimize the parameters of the Traffic BERT.As shown in Figure 4, from a spatial perspective, the essence of the Traffic BERT model is to use potential spatial associations to estimate missing values in the spatial sequence.Therefore, the process of optimizing the Traffic BERT model is the process of learning the potential spatial associations in the spatial sequence.
After obtaining the optimized Traffic BERT model, we obtain the final dynamic spatial association matrix by fusing the multi-head attention matrix in the encoder.Although Traffic BERT model contains L-layer encoder, we only fuse the attention matrix in the last layer encoder in order to facilitate calculation.Taking the spatial sequence x t of the t th time window as an example, the fusion process is shown in Formula (5).
where A t 2 R nÂn represents the implicit spatial associations in the spatial sequence x t 2 R nÂ1 , and mainly used for subsequent traffic flow prediction; T t i 2 R nÂn represents the similarity matrix of the i th head, and its calculation method is the same as that of Formula (3).

Construction of the TGNM
By replacing the adjacency matrix in traditional GCNs with the dynamic association matrix obtained by Traffic BERT, traditional GCNs can capture the dynamic spatial association in traffic flow.Compared with static adjacency matrix, the time-varying dynamic association matrix is more suitable for describing the spatial associations in the graph.However, the GCNs integrating Traffic BERT still have two shortcomings.First, the traditional GCNs are mainly designed for spatial data, and it is difficult to mine the temporal patterns in dynamic temporal graph sequence (Rossi et al. 2020).Second, the traditional GCNs cannot effectively deal with missing values in traffic flow data (Cui et al. 2020).To solve the above shortcomings, we propose a novel temporal graph network considering missing values (TGNM) for traffic flow modeling.
The TGNM model defines graph convolution operations under time constraints from the perspective of neighborhood feature aggregation.As shown in Figure 5, inspired by the recurrent neural network, we define two memory states for each node in the graph.Based on the memory state, the basic idea of the TGNM model can be simply described as three steps.First, the temporal memory state H t:tm of the t th time window is obtained from H tÀ1:tm and G t : Then, the spatial memory state H t:sm of the t th time window is obtained from H tÀ1:sm and G t : Finally, H t:tm and H t:sm are integrated to obtain the final prediction results.The advantage of the TGNM model is that it captures the temporal and spatial dependence of graph nodes at the same time.By analogy with the traffic environment, the traffic state of the current road is not only affected by its own historical states, but also by the historical states of its adjacent roads.

Forward propagation of the TGNM
Figure 6 describes the forward propagation process of the TGNM model in detail.The DTGM model iteratively updates the spatial memory state and the time memory state of the previous moment to obtain the final memory state.As the missing pattern will be automatically captured before the traffic flow enters the neuron interior (discussed later), the neuron interior directly operates on the complete data.In addition, there may be long-term dependence in temporal graph sequences, and the gating mechanism of GRU (Chung et al. 2014) is integrated into the neuron interior.Taking the t th time window as an example, the forward propagation process of the neuron interior is shown in Formula (6).
where xtþ1 i 2 R 1Â1 represents the prediction value of node v i in the (t þ 1)th time window; x t i 2 R 1Â1 represents the observed (or pre-processed) value of node v i in the tth time window; x t j 2 R 1Â1 represents the observed (or pre-processed) value of node v j in the tth time window; h tÀ1:tm i 2 R 1Âd h represents the temporal memory state of the node v i in the (t-1)th time window; h tÀ1:sm i 2 R 1Âd h represents the spatial memory state of the node v i in the (t-1)th time window; [ÁjjÁ] represents the vector connection function; r Á ð Þ represents the sigmoid activation function; ?v i ð Þ t represents the graph convolution operation of the node v i in the tth time window; A t represents the spatial association matrix of the graph structure (when A t is dynamically generated by Traffic BERT, the TGNM model evolves into the D-TGNM model); N t i represents the set of spatial neighbors of node v i in the tth time window; and W out 2 R 2d h Â1 indicate the learnable parameters in the TGNM model.

Handling missing values in TGNM
As mentioned above, the neuron interior of the TGNM is mainly used to capture the spatiotemporal correlations in the complete (or pre-processed) traffic flow, and does not process the missing values in the traffic flow.For the problem of missing values in traffic flow, we design a component for mining the missing patterns in traffic flow automatically and iteratively impute the missing values of traffic flow.As shown in Figure 7, the motivation for missing pattern mining component comes from two points.First, in the single-step prediction of traffic flow, the contribution of temporal correlation to prediction results is more significant than the contribution of spatial correlation to prediction results.Second, with the increase of the prediction steps (i.e., multi-step prediction), the contribution of temporal correlation to the prediction results will decrease, while the contribution of spatial correlation to the prediction results will increase.
The single-step prediction and multi-step prediction correspond to random missing and block missing, respectively.Specifically, the random missing value in the traffic flow can be directly imputed by the temporal memory state H tÀ1:tm at the previous time, while the block missing value in the traffic flow should further introduce the spatial memory state H tÀ1:sm to improve the imputation result.To make TGNM model can identify missing patterns in traffic flow, we introduce an auxiliary quantity c t ¼ c t i È É n i¼1 , and its calculation method is shown in Formula (7).
where c t i represents the time step of node v i in the t time window from the nearest observed value; when c t i ¼ 1 and m t i ¼ 0, the missing pattern of node v i in the t th time window is likely to be random missing; when c t i > 1 and m t i ¼ 0, the missing pattern of node v i in the t th time window is likely to be block missing.
Based on c t , we further define the process for handling missing values.The greater the c t , the greater the probability of using spatial memory state to estimate missing values.The smaller the c t , the greater the probability of using the temporal memory state to estimate the missing value.Based on this principle, the calculation method that deals with missing values is shown in Formula (8).
where xt:tm 2 R nÂ1 represents the prediction value of all nodes obtained by temporal memory state; xt:sm 2 R nÂ1 represents the prediction value of all nodes obtained by spatial memory state; exp Á ð Þ means exponential function; max means maximum value function; s t 2 R nÂ1 represents the missing pattern probability calculated by c t : When s t approaches 1, the missing pattern approaches random missing, and when s t approaches 0, the missing pattern approaches block missing.W tm out 2 R d h Â1 , W sm out 2 R d h Â1 , and W s 2 R 1Â1 respectively represent the learnable parameters in missing value processing.

Optimization of TGNM
In the process of model optimization, a time-dependent step l d is set to reduce the computational complexity.That is, the traffic flow in the future xtþ1 can be predicted by the traffic flow x j f g t j¼tÀl d À1 in the previous l d time windows.Theoretically, the prediction model can be obtained by minimizing the square loss between the actual traffic state x tþ1 and the predicted traffic state xtþ1 : However, if we only optimize the square loss between x tþ1 and xtþ1 , the impact of missing values on prediction performance will be ignored.Therefore, a loss function that takes into account the impact of missing values on the prediction results is designed.
As shown in Figure 8, the loss function proposed in this study is mainly composed of three parts: the temporal-based imputation task loss, the spatial-based imputation task loss, and the prediction task loss.Among them, the prediction task loss is mainly used to ensure the accuracy of prediction results, while the imputation task loss mainly considers the influence of missing value on prediction results.The loss function proposed in this study is shown in Formula ( 9).
where x tþ1Ài Àx tþ1Ài:tm ð Þ 2 represents the temporal-based imputation task loss; x tþ1Ài Àx tþ1Ài:sm ð Þ 2 represents the spatial-based imputation task loss; x tþ1 Àx tþ1 ð Þ represents the prediction task loss.To guarantee the equal weight of three parts loss, we enlarge the prediction task loss by 2l d times.Note: In fact, it is more common to add a set of adjustable hyperparameters as weighting factors for each loss.However, the increase of hyperparameters will increase the difficulty of model calibration.To simplify the model calibration process, we set only one hyperparameter.

Relationship between TGNM and D-TGNM
In this subsection, we further describe the relationship between TGNM and D-TGNM.
As shown in Figure 9, the D-TGNM model can be regarded as a special case of the TGNM model.Specifically, the adjacency relationship between nodes in D-TGNM model is dynamically generated by Traffic BERT, while the adjacency relationship between nodes in TGNM model can be a static adjacency matrix, such as a spatial topological adjacency matrix or spatial distance matrix.In other words, the Traffic BERT model enables the D-TGNM model to capture the dynamic spatial association in the Traffic flow.In addition, although the TGNM model is a part of the D-TGNM model, the TGNM model is essentially an independent component for traffic flow prediction considering missing values.The advantages of designing the TGNM model as an independent component are as follows: when the TGNM model does not rely on Traffic BERT, the TGNM model can still run using a static spatial association matrix, such as a fixed spatial distance matrix or spatial topological adjacency matrix.

Training and algorithm
In  17: obtain results xj f g tþ1 j¼tÀl d by Formulas ( 6), (7), and (8) 18: find W by minimizing the Formula (9) 19: until M DTGNM converges 20: output the learned models M TrafficBERT and M DTGNM 5. Experimental results and discussions 5.1.Data preparation 5.1.1.Data sources Traffic flow data in Wuhan, China, was used to evaluate the prediction performance of the D-TGNM model.Traffic flow data was obtained by Automatic Vehicle Identification (AVI) technology, which automatically identifies spatial coordinates of vehicles through photos taken by cameras.We counted the traffic flow of a single camera at intervals of 5 min, i.e., the time window size is 5 min.The period of traffic flow data was from 1 March 2021 to 28 March 2021.Figure 10 shows the spatial distribution of the experimental cameras.In this study, a total of 71 experimental cameras were selected, i.e., we built a graph with 71 nodes (each camera represents a graph node, and the association between any two cameras is determined by Traffic BERT).Table 1 shows the traffic information counted by each camera.Each record contains the unique identification ID of the camera, the time window, the latitude and longitude of the camera, and the traffic flow in the time window.

Data preprocessing
To support the research of this work, we further preprocessed traffic volume data, and the preprocessing process is mainly as follows: on two missing types (random missing and block missing), partial traffic state is deleted at 15% and 30% missing rates by artificial experience.Figure 11 shows the traffic flow information for three days after manual processing, and the details of the missing traffic flow are described in Supplementary Appendix A. 3.The data processed manually were divided into training samples and test samples.
According to the 20-80 criterion, the training samples account for 80%, and the test samples account for 20%.

Evaluation metrics
In traffic flow forecasting, a key issue is how to evaluate the performance of the forecasting model.In this study, the Mean Absolute Error (MAE), Root Mean Square Error (RMSE) and Mean Absolute Percentage Error (MAPE) were used as quantitative indicators to verify the prediction accuracy of the proposed model (S.Cheng et al. 2021).The calculation methods of MAE, RMSE, and MAPE are shown in Formulas (10), (11), and (12).

MAPE
where x t i represents the ground truth of node v i in the t th time window in the future; xt i represents the predicted traffic state of node v i in the t th time window in the future; n represents the total number of nodes in the graph; Dt represents the prediction step.

Estimation of hyper-parameters
In this study, the historical traffic flow is processed on a PC (CPU: Intel(R) Xeon(R) E-2224G @ 3.50 GHz, memory: 16.0GB).Moreover, we built our model based on PyTorch and Python3.7 on a Graphics Processing Unit (GPU) platform with 24GB of GPU memory.
The hyper-parameters of the D-TGNM model mainly include the number of encoders L and the number of multi-head attention p in the Traffic BERT component, and the time-dependent step l d in the TGNM component.As Traffic BERT and TGNM are two independent components, we only calibrate the hyper-parameters of one single component at a time.In the process of model training, the control variable method is used to obtain the optimal combination of hyperparameters, where the value range of L is [1,8], the value range of p is [1,7], and the value range of l d is [1,10].Figure 12 shows the process of parameter calibration in the missing scenario.The results show that with the increase of L and p, the MAE of Traffic BERT decreases first and then stabilizes.When L¼6 and p¼4, the Traffic BERT component achieves better accuracy.Similarly, with the increase of l d , the MAE of TGNM also decreases first and then stabilizes.When l d ¼7, the TGNM component obtains a reasonable accuracy.

Comparison with baselines
Classical statistical methods always perform worse than data driven methods on many traffic forecasting tasks, due to their inability to handle complex spatiotemporal information (Fang et al. 2021, Yi et al. 2021).We directly compare the D-TGNM with popular data-driven methods.In this study, the D-STGM model was compared with ten baseline methods.The baseline methods can be roughly divided into four categories.The first category is the ST-KNN model (S.Wu et al. 2014, B. Yu et al. 2016a), which is regarded as a shallow machine learning method that does not consider missing values.The second category includes the TCN (Bai et al. 2018), T-GCN (Zhao et al. 2020), and ST-GCN (B.Yu et al. 2018) models, which are regarded as deep learning models that do not consider missing values.The third category includes the TRMF (H.-F.Yu et al. 2016b), BTMF (Chen and Sun 2022), and BTTF (Chen and Sun 2022) methods, which are regarded as shallow machine learning models that consider missing values.The fourth category the GRU-D (Che et al. 2018), LSTM-M (Tian et al. 2018), and SGMN (Cui et al. 2020) methods which are regarded as deep learning models that consider missing values.

Comparison results on complete data
In this section, we compare the prediction results of the D-TGNM model and baseline methods on the complete data.The complete data represents the dataset after the natural missing value is imputed, and the comparison results are shown in Table 2. On the complete data, the difference in prediction accuracy between the four categories of models is relatively small, and the prediction accuracy of the deep learning models is slightly higher than that of the machine learning models.In addition, in the deep learning models, the prediction performance of graph-based spatiotemporal prediction models such as T-GCN, ST-GCN, and SGMN is lower than that of time prediction models such as TCN, GRU-D, and LSTM-M.There are two main reasons for the above inconsistent results with common sense.First, T-GCN, ST-GCN, and SGMN models are graph convolution networks based on static graph structure.When the static graph structure is challenging to describe the spatial relationship of traffic flow, the graph convolution network based on static graph structure may introduce more errors, making the prediction performance of the prediction model lower than that of the simple time graph network model.Second, T-GCN, ST-GCN, and SGMN models integrate graph convolution in series, making it more challenging to mine the nonlinear relationship in traffic flow data.Compared with the baseline methods, the D-TGNM model not only considers the dynamic spatial relationship in traffic flow data, but also integrates graph convolution in parallel.Therefore, the D-TGNM model achieves the highest prediction performance on the complete data.

Comparison results on incomplete data
In this section, we further compare the prediction results of the D-TGNM model and baseline methods on the incomplete data.The incomplete data represents the dataset after the complete data is artificially missing.Due to space limitations, we only show the comparison results of one-step prediction and three-step prediction.Tables 3 and 4 show the comparison results of the prediction performance under the missing scenario.When data missing happens, the prediction models such as TCN, ST-KNN, T-GCN, and ST-GCN that do not consider missing values have poor prediction performance, while TRMF, BTMF, BTTF, GRU-D, LSTN-M, and SGMN that consider missing values have better prediction performance.In addition, in different missing data scenarios, the prediction performance of the BTTF, GRU-D, and LSTN-M models is relatively stable, while the prediction performance of the TRMF, BTMF, and SGMN models is greatly affected by the missing data scenarios.Compared with the TRMF, BTMF, and SGMN models, the D-TGNM model also has stable prediction performance and the highest prediction results.Overall, the D-TGNM model has obvious advantages compared to the baseline methods.

Qualitative analysis of prediction results
In addition to quantitative analysis, in this section, scatter plots are used to describe the prediction performance of the D-TGNM model qualitatively.Figure 13 visually shows the difference between the prediction and actual values under the four missing data scenarios.Combined with the quantitative analysis in Tables 3 and 4, it can be found that although there are differences in the prediction performance under the four different missing data scenarios, the actual values are always close to the prediction values, i.e., the D-TGNM model can accurately predict changes in traffic flow over time.The results further prove that the D-TGNM model still has good predictive performance under missing scenarios.

Impact of dynamic association matrix
In the D-TGNM model, we use the dynamic matrix instead of the static matrix to describe the implicit spatial association in traffic flow.Therefore, we analyze the impact of the dynamic association matrix on the prediction performance.Table 5 shows the impact of the dynamic spatial matrix on prediction performance, where TGNM represents the model using the static spatial matrix, (R) represents the random missing data, and (B) represents the block missing data.The results show that the introduction of the dynamic spatial matrix improves the prediction ability of the model.Figure 14 further shows the correlation coefficient of the dynamic spatial association estimated from the missing and non-missing data.The results show that the spatial associations estimated at different times change dynamically with time, proving the rationality of modeling dynamic spatial associations in the proposed method.In addition, the results show that the ability of Traffic BERT to identify spatial associations  declines under block missing, which is also the reason that explains the low prediction performance of the model under block missing.

Impact of loss function
In the optimization process, the loss function consists of imputation task loss and predicting task loss.Therefore, in this section, we analyze the impact of the loss function on the prediction performance of the model.Table 6 shows the impact of the loss function on the prediction results.The results show that the prediction performance of the D-TGNM-P model is lower than that of the D-TGNM-I model, and the prediction performance of the D-TGNM-I model is lower than that of D-TGNM.That is, the D-TGNM model integrating multiple loss tasks has the optimal prediction ability, which further proves the necessity of introducing multiple loss tasks.In addition, the results show that if the impact of missing values on the prediction results is not considered, i.e., only the loss of the prediction task is optimized, the worst prediction result will be obtained.

Analysis of incremental learning
In real-world applications, the spatial structure of sensors on urban road networks is not static.In general, the number of sensors on urban road networks tends to increase gradually over time.When the number of sensors increases, the GCNs based on predefined matrix often need to be re-trained, which greatly wastes the computing resources.Compared with GCNs based on predefined matrix, the D-TGNM model does not need to be re-trained, but can be fine-tuned to regain the optimal prediction ability.This is, the D-TGNM model has the ability of incremental learning.section, the actual traffic dataset collected in Wuhan, China, was used to verify the prediction performance of D-TGNM.Experimental results showed that D-TGNM still had good prediction results under four missing data scenarios (15% random missing, 15% block missing, 30% random missing, and 30% block missing), and outperformed ten existing baselines.Second, we tested the impact of dynamic matrix and loss function on the prediction accuracy of D-TGNM, further proving that the proposed method is suitable for traffic flow prediction with missing values.The limitations of this study are as follows: (1) We only verified the prediction performance of the D-TGNM model under two missing rates (15% and 30%), and did not analyze the upper and lower bounds of the missing rate when the model performance was within an acceptable range; (2) The D-TGNM model needs to calculate the dynamic matrix for each forward propagation, which is computationally expensive; (3) The D-TGNM model is essentially a general prediction model considering missing values, but we only use traffic datasets to verify the imputation performance of the proposed model.Given the above problems, future work will focus on two aspects.First, we will try to determine the missing rate bounds when the D-TGNM model performance is within an acceptable range.Second, the dynamic matrix in the time window can be calculated in advance to reduce the computational complexity of the model.Finally, multi-source data, such as air quality data and meteorological data, will be further collected to improve the imputation performance of the D-TGNM model.

Figure 1 .
Figure 1.Preliminary definitions: (a) traffic road network can be abstracted as a graph structure, (b) the traffic state on the entire road network can be expressed as a spatiotemporal state matrix, and (c) the studied traffic flow prediction task in missing.
and W k:p respectively represent the learnable parameters, which are mainly used for the dimension alignment in the calculation process; Concat Á ð Þ represents the matrix connection function; gelu Á ð Þ represents the activation function of gaussian error linear units; softmax Á ð Þ represents the normalized exponential function.After obtaining the output xt , the parameters of the Traffic BERT model can be optimized by minimizing the mean square error between the output values xt and the actual values x t : The optimization process is shown in Formula (4).

Figure 3 .
Figure 3. Illustration of the Traffic BERT.

Figure 4 .
Figure 4. Optimization of the Traffic BERT: the essence of the Traffic BERT model is to use potential spatial associations to estimate missing values in the spatial sequence.

Figure 5 .
Figure 5. Illustration of memory state in the TGNM model: H t:tm ¼ h t:tm i È É n i¼1 represents the memory state in the time dimension, and H t:sm ¼ h t:sm i È É n i¼1 represents the memory state in the space dimension.The red dashed line indicates the temporal dependency of the target node, and the blue dashed line indicates the spatial dependency of the target node. r

Figure 6 .
Figure 6.Forward propagation illustration of TGNM model: the traffic flow containing missing values enters the neuron interior through the missing value processing, thereby updating the spatial memory state and temporal memory state of the previous moment.

Figure 7 .
Figure 7. Motivation for missing pattern mining component.

Figure 8 .
Figure 8. Loss function illustration of the TGNM: the loss function is mainly composed of the temporal-based imputation task loss, the spatial-based imputation task loss, and the prediction task loss.
Traffic BERT model 7: initialize the parameters W of Traffic BERT 8: repeat 9: randomly select a training instance D TrafficBERT b from D TrafficBERT 10: obtain results xt by Formulas (2) and (3) with L and p 11: find W by minimizing the Formula (4) 12: until M TrafficBERT converges //train D-TGNM model 13: initialize the parameters W of D-TGNM 14: repeat 15: randomly select a training instance D DTGNM b from D DTGNM 16: obtain dynamic matrix f ÂtÀl d À1 , ÂtÀl d , . . . . . ., Ât g by M TrafficBERT and xj f g t j¼tÀl d À1

Figure 12 .
Figure 12.Impact of parameters L, p and l d on the prediction performance.

Figure 14 .
Figure 14.Correlation coefficient of the dynamic spatial relationship estimated from the missing data and the non-missing data.
Figure 15.Illustration of incremental learning of D-TGNM model: (a) regional division, (b) Traffic BERT component, and (c) TGNM component.

Table 1 .
Traffic volume sample of single camera.

Table 2 .
Comparison results (in MAE/RMSE/MAPE) of D-TGNM and baseline methods without missing data.

Table 3 .
Comparison results (in MAE/RMSE/MAPE) of D-TGNM and baseline methods on random missing data.

Table 5 .
Impact of dynamic association matrix on prediction results (in MAE/RMSE/MAPE).

Table 6 .
Impact of loss function on prediction results (in MAE/RMSE/MAPE).