Detecting spatiotemporal propagation patterns of traffic congestion from fine-grained vehicle trajectory data

Abstract Traffic congestion on a road segment typically begins as a small-scale spatiotemporal event that can then propagate throughout a road network and produce large-scale disruptions to a transportation system. In current techniques for the analysis of network flow, data is often aggregated to relatively large (e.g. 5 min) discrete time steps that obscure the small-scale spatiotemporal interactions that drive larger-scale dynamics. We propose a new method that handles fine-grained data to better capture those dynamics. Propagation patterns of traffic congestion are represented as spatiotemporally connected events. Each event is captured as a time series at the temporal resolution of the available trajectory data and at the spatial resolution of the network edge. The spatiotemporal propagation patterns of traffic congestion are captured using Dynamic Time Warping and represented as a set of directed acyclic graphs of spatiotemporal events. Results from this method are compared to an existing method using fine-grained data derived from an agent-based model of traffic simulation. Our method outperforms the existing method. Our method also successfully detects congestion propagation patterns that were reported by media news using sparse real-world data derived from taxis.


Introduction
Traffic congestion can become a drain on economic growth. Schrank et al. (2019) estimated that costs attributable to traffic congestion in the USA were expected to grow from $166 billion in 2017 to $200 billion in 2025. The increasing availability of big spatiotemporal traffic data enables real-time monitoring of traffic congestion in road networks, which could be used to mitigate costs (Gid ofalvi 2015, Xu et al. 2016, D'Andrea andMarcelloni 2017). Although the time of congestion events on individual road segments is often known, the relationships among these events are not captured. Traffic congestion on downstream road segments can cause congestion on upstream segments when traffic inflow exceeds outflow over a sufficiently long period. A better understanding of how small-scale patterns propagate in space and time to inhibit traffic patterns over larger scales will help city managers as they prioritize solutions to traffic problems (e.g. to detect and resolve urban traffic bottlenecks) (Yue et al. 2018, Li et al. 2020. The representation, capture and analysis of such patterns, referred to here as Spatiotemporal Propagation Patterns of Traffic Congestion (SPPTC), is the goal of this research. As Figure 1 shown, from 8:05am to 8:20am, congestion on e 0 and e 1 propagates in multiple directions: e 0 ! e 2 ! e 4 , e 1 ! e 2 ! e 4 , and e 1 ! e 3 : Though important, designing automated detection methods of SPPTC faces three significant challenges: (1) acquiring data at a spatiotemporal resolution capable of capturing traffic dynamics, such as mapping GPS points to the road network (Greenfeld 2002, Lou et al. 2009), (2) developing data structures and algorithms that support the identification and analysis of SPPTC and (3) evaluating the correctness of identified SPPTC. We aim to address the second and third challenges. A variety of approaches have been developed to solve the SPPTC detection problem but are often constrained by the course resolution of spatiotemporal data on which they are built. Past studies often aggregated vehicle trajectory data into large discrete time steps (e.g. 20 min) to reduce data sparsity and capture meaningful congestion events (i.e. those lasting long enough to be of interest to decision-makers) (e.g. Liu et al. 2011, Liang et al. 2017). However, this ignores spatiotemporal dynamics occurring within each timestep, which may result in inaccurate pattern detections. In addition, detected SPPTC were often considered as verified based on the time of occurrence or nearby social gathering events (e.g. concerts) (e.g. Liu et al. 2011, Liang et al. 2017, Nguyen et al. 2017. Such information, however, only indicates a possibility of traffic congestion. As a result, the SPPTC detected by those methods are often inaccurate or unconvincing. Overall, the contributions of our work to the literature are: A new data-driven method designed to analyse vehicle trajectory data in high temporal resolution for better detection of SPPTC. The problem of detecting SPPTC is formulated as a computational problem of detecting directed acyclic spatiotemporal graphs from a road network and GPS data. Our approach to this problem is to quantify the time span and associated intensity of a congestion event on a road segment, detect propagation relationships among congestion events via Dynamic Time Warping (DTW), and build spatiotemporal graphs that capture SPPTC. We present a comprehensive evaluation process that rigorously tests our method. Error metrics (i.e. accuracy, precision, recall and F1-score) are used to quantify the correctness of detection. Due to the difficulty of obtaining confirmed traffic congestion propagation in the real-world, an Agent-Based Model (ABM) is used to produce vehicle trajectories based on pre-defined locations and time spans of congestion propagation patterns. The results obtained from our method are compared to those derived from a state-of-the-art approach in Sui and Zhang (2021), the only existing method that can detect time spans of congestion events and thus is comparable to our method. We also investigate the impact of data sparsity (i.e. a small subset of data of all traffic), because it is a common problem in the context of realworld traffic data analysis. Our method outperforms the method of Sui and Zhang (2021). Finally, we present case studies of SPPTC which are detected from realworld data and supported by media news.
We organize the structure of this paper as follows: Section 2 reviews related work; Section 3 illustrates our method to detect SPPTC; Section 4 evaluates this method using data from an ABM as well as the real world; Section 5 concludes the study and discusses future work.

Literature review
Automated detection and prediction of traffic congestion patterns have been extensively studied. Related literature can be categorized as presented in Table 1. Our work lies in the cell with the thickened border. The following review discusses studies within that particular cell in detail.  2.1. The existing methods related to detecting SPPTC From a GIScience perspective, the key steps to detect SPPTC include (1) the development of a formal definition that facilitates the storage and analysis of detected SPPTC; (2) the extraction of congestion events positioned in space and time; (3) the identification of propagation relationships between congestion events. The following review discusses related work for each step. Past studies represented SPPTC as a subgraph of a given road network (Chen et al. 2018) or a set of connected spatiotemporal events of traffic congestion (Liang et al. 2017, Nguyen et al. 2017, Sui and Zhang 2021. The former representation is easy to visualize, but it can produce a cyclic propagation pattern (e.g. ring) where the start of propagation is unclear. The latter representation treats each congestion event as a node and each propagation relationship as an edge between two events. The connected events are represented as either spatiotemporal trees (Liang et al. 2017, Nguyen et al. 2017 or paths (Sui and Zhang 2021). This representation works well for estimating the impact of each congestion event on upstream road segments. However, there exist overlaps among detected patterns when propagation converges to the same upstream segment. To address those issues, we represent SPPTC as directed acyclic spatiotemporal graphs.
Past studies detecting SPPTC typically represented congestion as a binary variable defined by a state variable of traffic dynamics relative to a predefined threshold value. For example, in Nguyen et al. (2017), a road segment was congested when its travel time was higher than a user-defined percentile of all historical values. Traffic speed (Liang et al. 2017, Chen et al. 2018, travel time (Nguyen et al. 2017, Sui andZhang 2021) and traffic volume (El-Sayed and Thandavarayan 2018) have all been used in this context. For each discrete time step, data were aggregated to identify whether there was a congestion event. This is to reduce the impact of data sparsity and to filter out inconsequential changes in traffic flow (e.g. normal disruptions in flow resulting from traffic signals). The lengths of these time steps varied (e.g. 5 min (Nguyen et al. 2017, Chen et al. 2018, El-Sayed and Thandavarayan 2018, 10 min (Sui and Zhang 2021) and 20 min (Liang et al. 2017)). However, studies in Modifiable Temporal Unit Problem (MTUP) have shown the size of the temporal unit can greatly affect some results (e.g. spatiotemporal clusters of crimes (Cheng and Adepeju 2014) and eccentricity of human daily movement (Zhao et al. 2019)). In addition, the Courant-Friedrichs-Lewy criterion suggests the temporal resolution of analysis should be high enough to capture the fastest speed of the phenomenon of interest Tang 2006, Tang andBennett 2010). Many existing methods do not adapt well to the analysis of high temporal resolution traffic data because the duration of congestion is not addressed. This issue is partially addressed by Sui and Zhang (2021) by treating congestion over multiple consecutive time steps as a single event. This, however, can result in many fragmented congestion events when data are missing in short periods. Our method attempts to overcome this issue.
A common way to determine the propagation relationship between two congestion events is based on the difference in congestion time. This assumes propagation events happen between adjacent road segments. A propagation relationship is assumed to exist if, for example, the upstream segment is congested at time step t up , the downstream segment is congested at time step t down and: t up ¼ t down (El-Sayed and Thandavarayan 2018), t up À t down ¼ 1 (Liu et al. 2011, Nguyen et al. 2017, or t up À t down 5 (Chen et al. 2018). The corresponding SPPTC can be found using a brute-force search (Chen et al. 2018, El-Sayed andThandavarayan 2018) or a Breadth First Search (BFS) (Nguyen et al. 2017) on congestion events. Sui and Zhang (2021) considered a propagation relationship to exist if there was at least one vehicle visiting two adjacent congested road segments during a specified period. When data is extremely sparse and the adjacency of road segments cannot be assumed, the method of Liang et al. (2017) is useful. This approach estimates the most likely propagation relationship among all congestion events and the probability of propagation is weighted by network distance and difference in congestion occurring time. However, the maximum number of propagation relationships is required as an additional input to this method. Our method differs from existing methods by analysing propagation relationships via DTW which is a technique commonly used to measure differences between time-series patterns (M€ uller 2007, Li and Zhu 2021).

Evaluating detected SPPTC
Validating detected SPPTC has proved to be a challenging task. Nguyen et al. (2017)  Obtaining ground-truth data for a robust evaluation remains challenging. To calculate a complete set of error metrics, data are needed for both the presence and absence of SPPTC, which is likely to be context dependent (e.g. by weather, time and season). Therefore, long-term (e.g. multi-months) high-resolution traffic data would be needed across a road network over a substantial period of time. Publicly available traffic data currently lack the spatial and temporal resolution needed to drive such a validation exercise. Some have attempted to use New York City (NYC) taxi data (Anon 2021) but such data only contain the origin and destination of taxi trips and, thus, suffer from a lack of sufficient detail (Yazici et al. 2012, Yazici et al. 2013, Kamga and Yazıcı 2014, Yazici et al. 2017a, 2017b. Other researchers relied on videos or images of traffic dynamics (Eamthanakul et al. 2017, Wang et al. 2018, Gao et al. 2021) but these data have limited network coverage. Large-scale SPPTC detection would, however, require cameras at a significant number of road segments monitoring traffic dynamics, with additional challenges from weather conditions and equipment failure. Therefore, we use an ABM to simulate GPS trajectory data of pre-defined spatiotemporal congestion propagation patterns for correctness evaluation. Simulated data are often used in traffic studies for different evaluative purposes, such as the detection of traffic congestion on road segments (D'Andrea and Marcelloni 2017, Shi et al. 2021) and reducing travel time via re-routing (Smith et al. 2014, El-Sayed andThandavarayan 2018). Our work aims to use ABM to evaluate the correctness of detected propagation relationships.

Formal definitions related to SPPTC
While our application domain here is urban traffic, movement across networks is a common generalizable problem. Following common practice, the road network is represented as a directed road network GðV, EÞ, where e i 2 E is a road segment. e i :v s , e i :v e 2 V are the start and the endpoints of e i : 8e i 2 E, e i has one direction. A bidirectional road segment is represented as two edges in opposite directions. To fully capture the spatiotemporal dynamic of SPPTC, we set the size of a discrete-time step t to be the same as the temporal resolution of the available vehicle trajectory data (e.g. data from ABM or a realworld GPS receiver), which is the expected time to obtain a location point of a vehicle. For real-world GPS data, this is usually the modal time gap between consecutive GPS points (e.g. t:len ¼ 3 s). The trajectory of vehicle i consists of an ordered set of tuples < spd i , e j , t k >; where spd i is the vehicle's average travelling speed on the segment e j at time step t k : Given the trajectory data, we estimate traffic speed as follows: Definition 1 Traffic speed: spdðe j , t k Þ is the overall traffic speed of the road segment e j at time step t k : spd e j , t k ð Þ¼ where VEH j, k refers to the set of all the vehicles on e j at t k : To estimate traffic congestion on a road segment e j , we need its free flow traffic speed spd free flow ðe j Þ, which is often defined as the 85th percentile of non-peak hours (i.e. 9.00am to 4.00pm, 10.00pm to 6.00am) traffic speed spdðe j , t k Þ (Lomax et al. 1999). For the ABM, it is the average travelling speed of vehicles simulated in unrestricted traffic conditions. As congestion on a road segment can last for many time steps, we define a congestion event as follows: Definition 2 Spatiotemporal congestion event: cng a ¼< e j , T a > is a spatiotemporal congestion event, where T a is the time span of the event on e j : T a ¼ ½t s , t e , where t s is the start time step and t e is the end-time step. The set of all traffic congestion events in the road network GðV, EÞ is CNG: If a congestion event propagates to nearby road segments and the new events keep propagating further, larger-scale traffic congestion will appear. To capture the SPPTC causing such larger-scale congestion, as the example shown in Figure 2, we model the SPPTC as follows: Definition 3 Congestion Propagation Graph (CPG): G 0 ¼ ðV 0 , E 0 Þ is a graph that captures the spatiotemporal footprint of propagation among a set of congestion events, where v 0 a 2 V 0 is a congestion event and V 0 & CNG: 8e 0 k 2 E 0 , e 0 k is the propagation relationship from a congestion event v 0 a ¼< e i , T a > to another congestion event v 0 b ¼< e j , T b >; and T a :t s < T b :t s < T a :t e : The set of all CPGs is S: There is no overlap between the two different CPGs. Otherwise, the two CPGs are subsets of a larger CPG. Overall, the input data for detecting SPPTC are a road network GðV, EÞ and vehicle trajectory data over a time span T data : The output is the set S of all the CPGs during a time span T obj , where T obj & T data : The objective of this computational problem is to increase the correctness of detection.

Extracting space, time and traffic dynamics of congestion events
There are multiple approaches to determine congestion on a road segment at a particular time step based on traffic speed. Three common approaches are: (1) traffic speed is less than a fixed absolute value (Liang et al. 2017); (2) traffic speed is less than a predetermined percentage of a certain specific speed (e.g. speed limit (D'Andrea and Marcelloni 2017), and free-flow speed (Lomax 1997, Rao andRao 2012)); (3) distribution of traffic speeds is statistically significantly lower than the historical distribution as determined by a likelihood ratio test (Anbaro glu et al. 2015). The first approach cannot address the heterogeneity of traffic speeds on different road segments (e.g. motorways and residential streets). The third approach requires a substantial number of observations to fit a probability distribution of traffic speed for each road segment at each time step, and it also assumes the form of the probability distribution (e.g. Gaussian, Lognormal, or Gamma) overall road segments and time steps. Because of the small time step and the heterogenous traffic speeds on different segments, we use the second approach to quantify the congestion intensity on road segment e j at time step t k as a score CngScore e j , t k ð Þ (Equation (1)).
CngScore e j , t k ð Þ¼ 1 À spd e j , t k ð Þ 0:5 Â spd free flow e j ð Þ According to the speed reduction index (Lomax 1997, Rao andRao 2012), congestion occurs when traffic speed is less than or equal to 0:5 Â spd free flow ðe j Þ: If CngScore e j , t k ð Þ> 0, then e j is congested at t k : CngScoreðe j , t k Þ varies between ½À1, 1, unless the traffic speeds of most vehicles are higher than free-flow speed, which is uncommon. If spdðe j , t k Þ is unavailable at t k , spdðe j , t k Þ is assumed to be same as spd free flow ðe j Þ, and thus CngScore e j , t k ð Þ ¼ À1: Figure 2. CPG of the example SPPTC in Figure 1.
As road congestion intensity cannot vary abruptly, we use a smoothing operator to account for traffic conditions at nearby times. For any t k , the weight of nearby t i is determined by a Gaussian kernel with r ¼ 1 min (Equation (2)). The bandwidth is a s ( t i À t k j j a=2). The weight is normalized (i.e. w 0 i ¼ w i = P w i ) to compute the smoothed congestion score at t k : Given the smoothed congestion score at each time step, there is a congestion event cng a ¼< e j , T a > on e j , where 8t k 2 T a , CngScore e j , t k ð Þ> 0: For travellers and city managers, congestion should be long enough to be interesting. In addition, missing vehicle trajectory data between two long congestion events may cause a small temporal gap that should be accounted for. To address these two issues, we implement the following post-processing procedures to determine the time spans of congestion events on a road segment: For each e j , find all the congestion time spans, such that for each time span T a : 8t k 2 T a , CngScore e j , t k ð Þ> 0: Find all the time gaps between consecutive congestion time spans, such that for each time gap, its time length is smaller than (1) c s and (2) the total time length of its adjacent congestion time spans. Merge the time gaps mentioned before and congestion time spans when they are contiguous, to form the congestion events. Discard any congestion event if its time length is smaller than c s.

Identifying the propagation relationships between congestion events
To have a propagation relationship between two congestion events, cng a ¼< e i , T a > to cng b ¼< e j , T b >; e i and e j should be topologically adjacent, and the congestion events should be temporally ordered as in Definition 3. In addition, the time-series patterns of road congestion intensity represented by CngScore at e i and e j during each congestion event should be morphologically similar. Therefore, there is a propagation relationship from cng a ¼< e i , T a > to cng b ¼< e j , T b > if e i :v s ¼ e j :v e T a :t s < T b :t s < T a :t e The difference of the time-series patterns of CngScore in cng a and cng b is smaller than b The difference in the time-series patterns is measured by Dynamic Time Wrapping (DTW) (M€ uller 2007). DTW accounts for the time lag of pattern variability between two time series of congestion intensity. Figure 3 illustrates the DTW method. Assuming each time series has a cursor starting from the beginning t 0 , each point in a time series must be matched to the first unmatched point or the last matched point in the other time series. Each match generates a score measuring the distance between two points. In this study, the distance measurement is the absolute difference between the CngScore of those two points. The minimum sum of distances from all the matches is taken as the result. The DTW distance is further averaged by the total number of matches to bound the total distance measurement in ½0, 2, assuming the speeds of most vehicles are no higher than free-flow speed.
For two congestion events cng a ¼< e i , T a > and cng b ¼< e j , T b >; only a part of the time series of cng a or cng b can be used for the DTW distance measurement when there is a significant difference in time length (e.g. Figure 4). To address this issue, the time window used for DTW measurements are [maxðT a :t s , T b :t s À T b :lenÞ, minðT a :t e , T b :t e Þ] for cng a and [T b :t s , minðT b :t e , T a :t e þ T a :lenÞ] for cng b : Given the detected spatiotemporal nodes and edges, we build CPGs via a Breadth First Search (BFS) on event nodes with no parent. In Algorithm 1, from line 1 to line 7, each congestion event node without propagation from downstream is treated as a source node of a CPG. After line 8, this algorithm constantly checks if existing CPGs can be expanded via their leaf nodes. These CPGs are expanded in lines 9-15. CPGs sharing the same nodes are merged in lines 16-20, and CPGs that cannot be expanded further are stored as output in lines 21-23.

Evaluation
The evaluation of our method contains two parts: correctness and real-world cases. The metrics measuring correctness (details are in Section 4.1.1) quantify the difference between the detected congestion propagation relationships (from Section 3.3) and pre-defined propagation relationships. We use SUMO (Lopez et al. 2018) as the ABM to obtain vehicle GPS trajectory data associated with pre-defined propagation relationships in a road network. These high spatiotemporal resolution data of simulated vehicle trajectories lack the problems of noise and missing data common in most existing real-world datasets and, thus, provide a useful dataset for evaluating correctness. We also detect SPPTC from real-world vehicle GPS data provided by DiDi (Anon 2018) to examine the usefulness of our method in the context of significant noise (e.g. GPS sampling and map-matching errors). We present case studies of detected SPPTC which are verified by media news.

Evaluation settings
We simulate traffic congestion propagation in three different areas of the road network from Codeca et al. (2017), with a set of traffic scenarios in each area ( Figure 5).
The complexity of traffic dynamics varies across different sets of scenarios. The first set simulates congestion propagation on urban local streets. All the vehicles travel in the same direction and all involved segments are fully congested. The second set simulates propagation from a local ring road onto a ramp and motorway. Vehicles on the motorway travel in different directions (i.e. staying on the motorway or leaving the motorway). Congestion on the motorway exists only in the right lane and the associated propagation (e 5 ! e 6 ) can be difficult to be detected. The third set simulates propagation at an intersection. All vehicles move in straight directions. This set aims to test whether the causality can be accurately detected. For example, the congestion on e 11 is caused by e 10 but not e 12 : A subset of road segments is used for validation because some segments are used for dispatching vehicles or producing congestion on other segments.
For each propagation relationship, we pre-define the road segment e j and time span T a ¼ ½t s , t e for every associated congestion event cng a ¼< e j , T a >: For each nopropagation relationship, we pre-define the road segments to ensure there is no congestion propagation during the entire simulation period (e.g. e 9 ! e 6 due to smooth traffic on e 9 , and e 12 ! e 11 due to causality). Each road segment involved in a congestion propagation relationship is congested for at least 5 min. Across different scenarios within the same set, the segments of congestion stay the same, but the time steps of congestion are different. Every scenario has 10 different perturbations created by sampling parameters of driver behaviour from pre-defined distributions. There is a total of 170 propagation relationships and 80 no-propagation relationships among all the scenario perturbations. SUMO outputs simulated vehicle GPS trajectories every second. For details of the simulations, please refer to Supplementary material A. To evaluate whether our method improves the detection of congestion propagation, we compare the results to those generated from the method by Sui and Zhang (2021). As far as we know, it is the only existing SPPTC detection method that identifies the duration of traffic congestion events and, thus, is most comparable to our method. For a detailed explanation of how the baseline method is implemented, please refer to Supplementary material B. The baseline method is evaluated at multiple time steps: 1 s (same as our proposed method), 180 s, and 300 s. 180 s and 300 s are used for the baseline because the method in Sui and Zhang (2021) was originally run with large time steps. Our method does not use these two-time steps because our method is designed to detect SPPTC using small time steps.
The metrics for evaluation include accuracy, precision, recall and F1-score, which are shown in Equation (3)-(6) below. If a detected propagation relationship is correct in segments and time (including start and end time steps) of both upstream and downstream congestion events, there is a True Positive (TP). If a detected propagation relationship is correct in segments but wrong in time, there is a False Positive (FP). There is also an FP if a detected propagation relationship is wrong in segments (i.e. matching a no-propagation relationship). If a pre-defined propagation relationship cannot be found in all detected relationships, there is a False Negative (FN). If a pre-defined nopropagation relationship cannot be found in all detected relationships, there is a True Negative (TN). For details, please refer to Supplementary material C.
Currently, it is difficult to obtain vehicle GPS trajectory data in the real world. Therefore, we evaluate the effect of data sparsity on the results. For each GPS trajectory dataset, we randomly draw a subset of all the simulated vehicles along each route and use their trajectories as input to our method. There are two different ranges of draws: (1) from 20% to 90%, in increments of 10%, and (2) from 5% to 13%, in increments of 1%. For each percent value, 50 random draws are performed. The sampling rate of currently available real-world vehicle GPS data is around 10%, but it is expected to increase as technology improves in the future. Therefore, we use two different ranges of the sampling rate.
We also evaluate the effects of the parameters embedded in our method on results. The parameter a (i.e. the bandwidth of the smooth kernel in Equation (2)) on each road segment, is set at 2 Â free flow travel time, 60 s, 120 s, or 180 s. The parameter b (i.e. the maximum difference of time-series patterns of congestion intensity to identify a propagation relationship between two congestion events) is set at 0.3, 0.5, or 0.7. The parameter c (i.e. the minimum time length of a congestion event and the minimum length of the time gap between two congestion events) is set at 180 s, 300 s or 420 s.

Results and discussion
The results (Figures 6 and 7) show our method always outperforms the baseline method across all metrics. The baseline has a better performance with 180 s than the performance with either 1 s or 300 s. The baseline performs worse with 1 s time step because of the large number of fragmented congestion events and the metric it uses to determine propagation (see Supplementary material B). Our method uses the smoothing and post-processing procedures (Section 3.2) to prevent congestion events from being fragmented, which enhances the method's ability to detect the correct Figure 6. Performance of our method and baseline when 20-100% data are given. Each row of charts uses the same evaluation metric. Each column varies one of the user-defined parameters and fix the rest two parameters. All charts have the same x-axis. start and end time steps of congestion events. Furthermore, our method uses DTWbased time-series analysis (Section 3.3) to determine propagation relationships, instead of simply assuming the existence of a propagation relationship based on whether there is a vehicle traveling between two congested road segments during a specific period. Therefore, our method has much fewer FPs than the baseline method, leading to significantly better precision. With a better ability to capture the time of congestion events, our method also has more TPs, resulting in better recall. Overall, these lead to better accuracy and F1-score. Regarding the impact of parameters embedded in our method, a ¼ 2 Â free flow travel time has the overall best performance among all the values of a: a ¼ 2 Â free flow travel time has a slightly better performance than a ¼ 60 s, and both values perform obviously better than a ¼ 120 s and a ¼ 180 s. Most of the road segments have 2 Â free flow travel time of <40s, and half of the road segments have 2 Â free flow travel time of <20s: This result suggests a large bandwidth that overly smooths the data is not beneficial. As for parameter b, b ¼ 0:7 and b ¼ 0:5 have similar performance, while b ¼ 0:3 has the worst performance. Higher b allows our method to detect more propagation relationships. The result suggests b ¼ 0:5 is sufficiently large for our method. Regarding parameter c, c ¼ 180 s is always the worst. c ¼ 420 s has a better precision while c ¼ 300 s has a better recall. As all congestion events of pre-defined propagation relationships last for at least 5 min, c ¼ 420 s excludes congestion events between 5 min and 7 min, leading to overall fewer TPs and more FNs. c ¼ 420 s also excludes some congestion events having wrong time steps in scenario 6, resulting in fewer FPs than c ¼ 300 s. Overall, c ¼ 420 s and c ¼ 300 s have similar performance, but c ¼ 300 s has a noticeably better F1-score when data is highly sparse.
A major limitation of our method is that the performance degrades rapidly as data sparsity increases. With high data sparsity, our method fails to detect many of the congestion events produced by perturbating the scenarios because there is often an insufficient number of time steps associated with CngScore >0 on a congested road segment. For example, in Figure 8, our method can detect the propagation from e 0 to e 1 given 100% of the data but fails to do so in a random draw of 20% of the data. Given 20% of the data, the congestion event on e 1 cannot be detected because the duration of the congestion event is less than c ¼ 300 s: Another limitation of our method is the failure to distinguish causality between congestion events (e.g. FP detection e 12 ! e 11 in Figure 5). It is worth mentioning that partially congested road segments (e.g. only one out of three lanes are congested) are more likely to be detected using smaller time steps. For example, the congestion event on e 6 ( Figure 5) in scenario 7 can be detected by our method and the baseline method with 1 s time step, but the baseline method with 5 min time step cannot detect this congestion event.

Dataset and settings
We derive the vehicle trajectory data from taxi GPS data captured in Chengdu, China, covering November 2016. Based on Pang et al. (2016), there were about 130 million instances of DiDi ridesharing over 3 months. Further, Pang et al. (2016) suggested that 6 out of every 10 people used DiDi ridesharing. According to Chengdu Statistic Bureau, NBS Survey Office in Chengdu and Chengdu Statistical Association (2017), there were 8.94 million employed individuals in 2016. Assuming every person commutes twice daily, the penetration rate of DiDi data is about 7.9% in Chengdu. Given the result in Figure 7 and such data sparsity, the accuracy and F1-score of our method may be less than desired. However, the spatial heterogeneity of vehicle trajectory data coverage is high (e.g. Figure 9(a)). Data sparsity is likely to vary by location. We run our proposed method using real-world GPS data to see whether it can detect some SPPTC in the real world.
The GPS location of a vehicle is obtained every 3 s, and thus we set the time step of our method to be 3 s for this dataset. There are about 538 million GPS points, 1 million vehicles and 4 million trajectories in the processed road network (Figure 9(a)). The network has been processed from Open Street Map, projected to WGS 84 UTM 48 N. It contains all segments from motorways to secondary roads. Lower-level segments (e.g. tertiary and residential) are excluded due to insufficient samples. There are 1666 road segments in total. We have map-matched all vehicle GPS points using the methods from Greenfeld (2002) and Lou et al. (2009). The parameter setting of our method is a ¼ 2 Â free flow travel time, b ¼ 0:5, and c ¼ 300 s: As short road segments have few GPS data points (e.g. intersections), road segments smaller than 30 m are excluded from propagation detection and the topology of adjacent segments is adjusted accordingly (Figure 9(b)). The data used to retrieve free-flow traffic speed are from 1st November to 23rd November. SPPTC are detected from 24th November to 30th November. Figures 10 and 11 show two CPGs detected on 29th November (Monday). Both graphs occur around morning rush hours and traffic congestions in both areas are found in the news (Chengdu Daily 2016). In Figure 10, the plots of traffic dynamics show congestion in segment A mainly occurs between 8:15am and 8:35am and congestion in segment C mainly occurs between 8:17am and 8:28am. Segment B has a short length and traffic dynamics fluctuate considerably due to a small number of vehicles. As segment B is short, the traffic dynamics in segment A and segment C is similar (i.e. congestion mostly occurs in the first 2=3 time span). Congestion in segment A starts earlier but congestion in segment C resolves faster. One possible reason is a significant decrease in incoming traffic to C ! B ! A: The CPG may be caused by high traffic demand to multiple schools. There is a high school (yellow polygon) near the CPG. Side-way parking and the dropping off of passengers may contribute to the congestion. In addition, to the east, there are an elementary school and a middle school close to the study area.

Results and discussion
In Figure 11, congestion in segment D lasts much longer than in the other segments. The congestion pattern on segment D between 7:00am and 8:10am is similar to the pattern on segment E between 7:55am and 8:35am. The traffic dynamics in segments F and G fluctuate. Overall, the traffic dynamics indicate a low intensity of traffic congestion in this area, and traffic mostly travels from segment E to segment D. There is a large company (red polygon in the north of segment D) and a large mall (grey polygon in the north of segment G) in this area, which contributes to the traffic demands in this area. In addition, to the west, there are two schools close to this area. Figure 12 presents a CPG with two root segments (H and J). In this CPG, there are three propagation paths: J ! K, H ! K, H ! L ! M ! N: All road segments are local streets underneath highways. There are three congestion events in segment K, and the third event is associated with congestion events in both segments H and J. Segment H is mostly congested throughout the entire duration of the congestion event, while segment J is congested for three short periods (about 5 min each). The timetable of trains (Anon 2016) indicates multiple trains arriving at or starting from the station in this area around noon. Traffic demand to the train station, pick-up and drop-off activities may be the cause of the CPG.

Conclusion and future work
We present a new data-driven method to capture SPPTC. Due to the difficulty of obtaining adequate ground-truth congestion propagation patterns, the evaluations of detected SPPTC in past studies are limited. We address the issue of evaluating correctness (Section 2.2) by controlling the location and time of congestion propagation in a road network and obtaining GPS trajectory data via traffic simulation of the ABM. The previous methods cannot adapt well to high temporal resolution analysis of traffic data to capture the complex spatiotemporal process that produces congestion propagation due to the challenges of capturing time spans of congestion events and associated propagation relationships. Our method is designed explicitly for high temporal resolution data and aims to address those issues (Section 2.1). We formally represent SPPTC as directed acyclic graphs, use smoothing and post-processing procedures to better capture the period of congestion events, and identify propagation relationships of congestion events by time-series analysis. The results show our method is better at capturing SPPTC than the baseline method tested. While still outperforming the baseline method, the correctness of the detections from our method does degrade with sparse data. With the improvement of sensor technology and the development of smart cities, data sparsity in real-world traffic data is expected to become less of a limitation. Currently, one possible way to increase the penetration rate of traffic data in the real world is combining data from multiple sources, such as ridesharing, navigation services, public transit, etc. Identifying the correct time span of congestion events given sparse data is a very challenging task. The results also show that determining propagation simply based on network topological relationships will produce false positive results. Future work will address these limitations. For example, we will account for the driving directions of vehicles to improve the ability in capturing the spatial characteristics of congestion propagation.