Clock Offset Estimation for Systems With Asymmetric Packet Delays

This paper proposes a new clock offset estimation that mitigates unwanted link asymmetry for precise clock synchronization. The main contribution is to address the primary and traditional design issue of the IEEE 1588 standard precision time protocol (PTP), which estimates clock offset under the assumption that the delays of exchanged packets are symmetric. To mitigate the issue, we focus on the fact that PTP measures asymmetry variation through the derivatives of its timestamps with respect to the time step. By exploiting the measurement of the variation, the proposed approach defines the asymmetry in the form of a linear differential equation (LDE) and leverages the LDE to define and exclude asymmetry-induced errors. Additionally, we clearly derive the state transition of the asymmetry. Subsequently, we derive a novel state-space model from our approach. The model describes PTP clock offset estimation perfectly, allowing optimal clock offset estimation. We verify the theoretical validity of the proposed method with real data. Our approach improves PTP accuracy by more than thousand times and achieves an accuracy at the level of tens to hundreds of nanoseconds on an asymmetric communication link. Our approach realizes an accuracy comparable to that of PTPv2, without the cost of specialized hardware.

in the networked systems for automation, control, testing, and measurement. For example, many industry standards [3], [4], [5], [6] have specified the use of PTP to satisfy the requirement of synchronization accuracy at the nanoseconds level. Specifically, the IEEE 802.1AS [3] standard applies PTP for audio and video networking in a virtually bridged local area network (LAN). Additionally, the LXI industrial standard [4] applies PTP to synchronize distributed measurement systems over the Ethernet. The IEC 61850 power industry communications include the IEC/IEEE 61850-9-3 [5] PTP profile for power utility automation. Furthermore, the SMPTE [6] motion imaging standard suggests applying PTP to synchronizing broadcasting equipment replacing systems.
One primary and traditional issue of PTP lies in the design that the protocol uses an approximately calculated packet delay for its clock offset estimation. The protocol assumes that the delays of the exchanged packets in a master-slave architecture are symmetric, i.e., the transit time of a packet communicated from the master to the slave is equal to the transit time of the packet conveyed from the slave to the master. However, in real environments, this assumption does not hold true as delays are asymmetric. Further, the asymmetry can intractably vary over time and be magnified with congestion or increasing distance from the communication link. The disagreement between assumption and reality causes an error in the clock offset estimation. This practical error limits the use of PTP and increases the cost of expanding the PTP network.
Several studies [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22] have attempted to address the aforementioned design issue of PTP. For instance, PTPv2 [7] introduced the transparent clock (TC), which is implemented inside specialized PTPv2-enabled networking devices to record the residence times of the PTP packets in the devices. PTPv2 applies these residence times to the offset estimations to reduce the errors due to asymmetry. PTPv2.1 [18] includes an optional specification for asymmetry calibration. For bidirectional media, such as a single-mode fiber, which provides an almost constant relationship between the transmissions in both directions, the calibration uses the known relative difference between the two transmission times to compensate for the asymmetry-induced errors. The most popular PTP stacks, Linux PTP [14] and PTPd [15], implement packet delay filters such as moving median, averaging, or infinite impulse response (IIR) to minimize the effects of the fluctuating asymmetry. Moreover, Karthik et al. [8] proposed a method to mitigate the asymmetry of queuing delays of packets by approximating these delays using the Gaussian mixture probabilistic model and a variant of the expectationmaximization (EM) algorithm. The authors also showed that their method converges to a lower bound of the clock synchronization, which is derived using the invariant decision theory [17]. Exel et al. [9] reported that the asymmetry of a link delay occurs by the different twist rates of the Ethernet cable pairs used for transmission and reception; thus, they configured links to ensure that the transmission path of the PTP packets used the same cable pairs. Giorgi et al. [10] proposed software-based packet sampler with an oversampling strategy, which filters out packets with delays exceeding a certain value to reduce the possibility of asymmetry. The reliable range of packet delay is derived with the soft-min machine learning function and weight function defined by the Boltzmann distribution. Puttnies et al. [19] presented the concept of PTP-linear programming (PTP-LP). Assuming that the upper and lower bounds of the slave clock can be linearly approximated, PTP-LP explores these bounds with constraints defined by sets of timestamps and estimates the value of the slave clock by averaging the explored bounds. Geng et al. [20] proposed an approach called HUYGENS. HUYGENS first exchanges large amounts of packets, and rejects impure packets with queuing delay, random jitter, and timestamping noise. It then uses a support vector machine with pure packets to estimate clock offset and skew. Finally, it constructs a reference spanning tree (RST) from the network graph, and performs a collective algebraic loop correction of nodes in the RST to improve the accuracy.
In this paper, we therefore propose a new offset estimation method that mitigates the negative impact of asymmetry. Based on the fact that the PTP timestamps represent the intractable asymmetry variations, we define the mathematical model of the asymmetry. Specifically, to achieve robustness against errors due to asymmetry, we first define the asymmetry in the form of a linear differential equation (LDE) using the derivatives of the timestamps with respect to the time step in the PTP architecture. Then, to improve the clock synchronization accuracy, we leverage the LDE to define and exclude errors due to the asymmetry. The novelty of our proposed approach is that it directly models the asymmetry itself, rather than estimates the asymmetry based on the packet delay values or probability distributions as in [8], [16], and [17]. On the comprehension that a linear combination of the derivatives represents the measurement of asymmetry variation, our approach exploits the derivatives to explore the asymmetry and to track asymmetry-induced errors; the use of the derivatives ensures the ability to accurately describe the dynamically changing asymmetry due to the nature of measurement. Thus, the approach achieves accuracy of the order of tens or hundreds of nanoseconds in a software-based manner even under heavy traffic environments.
We also propose a novel state-space model that is completely consistent with PTP. The virtue of the state-space analysis is the integrated allowance of optimal estimations through stochastic models, such as the Kalman [23] or ARIMA [24]. Previous studies [25], [26], [27], [28], [29], [30] have proposed state-space models for clock synchronization. However, the effect of asymmetry has not been fully considered or resolved in these approaches. For example, the Giorgi state-space model [26] simply assumes that the error from asymmetry constitutes a zero-mean Gaussian variable and no longer deals with the asymmetry. Unlike other studies, we clearly derive the state transition of the asymmetry and extend it to the impact of asymmetry. Furthermore, a complete state-space model of PTP is derived by considering its clock offset estimation as an observation. To the best of our knowledge, this is a pioneering work on the PTP that considers the state-space of the asymmetry.
The strength of our approach is threefold. First, this approach is a purely software-based method and does not assume any application-specific condition or require specialized hardware support; thus, it can be deployed on any system without significant effort. Second, our approach maintains the PTP architecture without requiring additional packet exchanges. In clock synchronization scenarios, it is important to reduce unexpected packet exchange overheads to ensure periodicity of the clock offset estimation. Therefore, our approach retains the PTP architecture to minimize any additional overhead while maintaining all the advantages of PTP and the ability to be combined with any PTPbased approach. Finally, the proposed approach is designed to be capable of handling the asymmetries of total packet delays. The packet delay is composed of nodal processing, queuing, transmission, and propagation delays. The packet delay considered in our model includes all these delays occurring between the timestamps on the master and slave nodes.
The remainder of this paper is organized as follows. Section II discusses the terminologies and elaborates on PTP. Section III describes the limitations of the PTP. Section IV presents the proposed approach, state-space model, and an optimal estimator. Section V provides the evaluation setups and results. Section VI shows a case study of our approach. Finally, Section VII concludes our paper. Table I provides notations and symbols adopted in this paper. Unless further specified, plain letters, boldface letters, and boldface capital letters denote scalars, column vectors, and matrices, respectively.

A. Terminology
Clock offset refers to the time difference between different clocks. Suppose t denotes a referenced master clock time 1 and C(t) denotes a local slave clock time. The clock offset θ(t) The goal of clock synchronization is to estimate and reduce such clock offsets. In general, a clock synchronization process is periodically repeated because the offset estimation is not error-free, and the clock offset varies with time.
Clock skew refers to a phenomenon in which clocks operate at different frequencies, even if their designs and settings are the same. Suppose that both clocks are set to a nominal frequency. In practice, their frequencies are not the same as the nominal frequency, and vary over time. The dimensionless clock skew, γ(t), between the frequencies is defined by where f m and f c denote the clock frequencies of the master and slave, respectively. Clock drift is the accumulated discrepancies between the clocks, which originates from clock skew. The drift caused in a time duration from t i to t j , τ | tj ti , is defined as The relationship between the clock offset θ(t), and the clock drift τ | t 0 is given as where θ(0) denotes an initial offset. One-way packet delay (OWPD) is the the total time required for a packet to be transmitted from a source and arrived at the destination; it includes all possible delays in the network, such as nodal processing, queuing, transmission, and propagation delays. Suppose D s and D d denote the OWPDs of packets from a source to a destination and vice versa, respectively. These OWPDs are given as where t 1 and t 3 are the transmitted times and t 2 and t 4 are the received times of the packets, as shown in Fig. 1. Degree of asymmetry between two packet delays is the difference between the packet delays transmitted in opposite directions: Estimation error is the difference between an estimated value and a true value. For example, the error of an offset estimation is given as whereθ(t) denotes an estimated value of θ(t).

B. IEEE 1588 PTP
PTP is a two-way packet-exchange-based protocol for synchronizing local slave clocks with their referenced master clocks. We herein summarize the core aspects of the end-toend synchronization of PTP, which can be applied without any hardware constraints.
The synchronization between the master and slave clocks is achieved as shown in Fig. 1. To remove the clock offset at t 1 , θ(t 1 ), the master and slave exchange timestamping packets, using which, the slave estimates its offset from the master clock and adjusts the value along the following steps: • Step 1: At time t 1 , the master sends a sync message to the slave. After transmitting the sync message, the master measures timestamp t 1 and subsequently sends a followup message with a timestamp t 1 . • Step 2: After receiving sync and follow-up messages, the slave measures the local timestamp C(t 2 ) and calculates C(t 2 ) − t 1 , which is equal to the sum of the clock offset and packet delay between the master and slave, θ(t 1 ) + D s . • Step 3: The slave sends a delay-req message to the master, measures the local clock timestamp C(t 3 ), and subsequently sends a follow-up message. • Step 4: After receiving the delay-req, the master measures the arrival time t 4 and sends a delay-resp message with a timestamp t 4 . • Step 5: Once the slave receives a delay-resp, it estimateŝ D s andθ(t 1 ) using (8) and adjusts its clock byθ(t 1 ). Note that the PTP estimatesD s using a simple averaging under the assumption that D s = D d .

III. PROBLEM STATEMENT
One of the primary and traditional problems of PTP is that it uses a simple approach such as averaging to estimateD s and θ(t 1 ) under the assumption that the OWPDs are symmetric (D s = D d ) as in (8b). In the real world, the assumption is often violated because the transit times of packets in the master-slave and reverse slave-master paths can be different (D s = D d then λ = 0). Inside this disagreement between assumption and reality, a nonzero asymmetry, λ, occurs and causes critical estimation error εθ in (8). Fig. 2 shows the asymmetry, λ k (= D d,k − D s,k ), and offset estimation error, εθ ,k (= εθ(t 1,k )), of the IEEE 1588 PTP in a real environment (the base environment in Section V), where the rightmost subscript k denotes the repetition number of PTP packet exchange. As shown, λ k can extend up to several tens of milliseconds. Moreover, the asymmetry is nonlinear and changes dynamically over time. As presented, λ k directly affects εθ ,k . The asymmetry-induced estimation error may extend up to tens of milliseconds, even though there is only one hop in the communication link and no network congestion.
The asymmetry and the induced error are intractable. This is because packet delay has multiple components, and their asymmetries are caused by different factors. For example, the sync and delay-req messages can experience different nodal processing delays because of the inconsistent time consumption of processes including routing table lookup. Their queuing delays also can be different because they are affected by network conditions such as the congestion caused when messages are communicated. Transmission and propagation delays may not be symmetric owing to the different data rate or cable twist rate. In actual environments, various factors, including these, may occur in combination, and continue to change over time. As a result, the asymmetry and error become nonlinear, unpredictable, and non-stationary variables with respect to the time.
This design issue, i.e., asymmetry-induced error, has not been fully addressed since PTP was first introduced in 2002. The reasons why this issue is challenging are as follows: i) it is difficult to mathematically analyze the asymmetry due to the non-linearity, unpredictability, and non-stationarity, and ii) the relationship between the asymmetry and error is unknown.

IV. PROPOSED APPROACH
To address the design issue of PTP definitively, we propose an enhanced clock offset estimation method and focus on three questions, 1) What model can accurately represent the variations of the intractable asymmetry? 2) How can we define the relation between the estimation error of (8a) and defined asymmetry? 3) How can precise clock offset estimation be achieved using the relation? To explore answers to these questions, we define a discrete linear differential equation (LDE) with respect to k for the asymmetry and relation between asymmetry and offset estimation error to design a clock offset estimation robust to delay asymmetry (Section IV-A). Later, we provide a concrete example of the estimation by constructing a state-space model that is completely consistent with PTP (Section IV-B). Finally, based on the presented state-space model, we obtain an optimal clock offset estimation using the Kalman filter and a Bayesian hyper-parameter inference (Section IV-C).
The core concept of the proposed approach exists in the LDE in Section IV-A. The LDE plays a key role in reducing errors resulting from the intractable asymmetry. Applying the computational principle of the LDE is known to be effective in managing non-stationary processes; the principle is applied to compute the current value of a variable based on its recent differentials or values [31]. In the LDE, we focus on representing Δλ k = λ k − λ k−1 in a numerical manner using the PTP timestamp measurements rather than approaching the problem via model analysis, because finding a general model of Δλ k is difficult owing to the non-linear and unpredictable properties. We benefit from the advantages of the given measurements to simplify the description of Δλ k , thereby linearizing the random relationship between the sequential asymmetries.
The use of the timestamp measurements is beneficial in solving the practical PTP design problem, in which the randomness of asymmetry and clock skew are intertwined. The LDE can determine Δλ k without randomness if the timestamp measurements are provided in an ideal case without clock skew. We demonstrate this as Corollary IV.1 in Section IV-A. Inspired by Corollary IV.1, our LDE-based approach using measurements attempts to approximately reduce the problem space to only consider the clock skew that contaminates the measurements. To consider the randomness of the clock skew, we subsequently derive Corollary IV.2 in Section IV-A, adopt a predefined clock skew estimator, configure the observation error, and construct a state-space model [31] in Section IV-B.

A. Asymmetry, Error, and Estimation
First, we derive an LDE for the asymmetry; Equation (6) implies that an LDE for the degree of asymmetry λ k can be derived from the LDEs of the OWPDs of the sync and delayreq signals. Therefore, we start by building the LDEs for D s,k and D d,k as where Δ denotes the numerical differential, e.g.
Note that reference timestamps t 2,k and t 3,k are not available in practice, irrespective of whether the clock skew exists, because D s,k and D d,k are unknown.
Corollary IV.1: In an ideal case without clock skew, the LDE of λ k can be directly determined using with only the given timestamps. From (11), in cases with negligible clock skew, we can approximate the LDE for λ k as For example, if we use atomic clocks, (12) can be adopted.
In cases with non-negligible clock skews, for example, if we use oven-controlled or temperature-compensated crystal oscillators, we cannot directly process (10) because the drift terms are not approximated to zero and reference timestamps Proof: The derivation uses the relation between D s,k and D s,k in (8b). If we replace C(t 2,k ) and is given by (15).
Finally, we design a clock offset estimation robust to the asymmetry for the PTP based on the above. The new clock offset estimation, which takes the error of (15) and LDE of (10) into account, is designed as (16).
The intuitive estimation (16), however, cannot be directly used in real implementation because εθ ,k includes drift terms that should be estimated through a stochastic model. Therefore, we propose to expand (16) to (17) where g(•) denotes an estimator. The use of a general function g(•) implies that our LDEs and εθ ,k can be incorporated into various stochastic models. Next, to provide a concrete and practical example, we construct a state-space model.

B. State-Space Model of PTP
The presented model suggests a method to address the state-space related to asymmetry and is completely consistent with PTP. Consequently, the proposed state-space model establishes the basis for optimal clock offset estimations on asymmetric communication links via stochastic models, as shown in the next section.
In the design of the observation model, our approach considers the estimate of (8a) as an observation of PTP and uses (15) to describe the observation error of (8a). In the design of the state and error transitions, the model adopts a predefined clock skew model [32] and the LDE of (14). For the transition noise, the model obtains an analyticity from the properties of an unbiased clock skew estimator analyzed in [32].
To avoid repeated explanations, we consider environments with non-negligible clock skews. Therefore, we use ζ k rather than λ k without loss of generality.
Eventually, we can have a state-space model for the proposed approach as follows: where the transition of ζ k is derived by where εθ ,k describes the remaining observation errors except for − 1 2 ε Δζ,k . Note that ε Δζ,k can be replaced with ε Δζ,k when ζ k is observable. The hidden states, x k , control inputs, u k , and transition noise, w k are defined as Note that concrete models of the state transition matrices, A k and B k , and the transition noises εθ ,k , εγ ,k , and εζ ,k are determined by clock or clock skew models. For example, if one of the models in [32], [33], and [34] is adopted, the transition noises are postulated to follow the normal distribution.
In this paper, because we use the models in [32], the matrices and noises are derived by and respectively, under the assumptions that η(t)dt is a standard Wiener process [35] and the density function η(t) follows the independently and identically distributed (i.i.d.) Gaussian of where (σ η ) 2 is the step variance of random walk skew; for the detailed proof, see Appendix A.1-(i), (ii), and (iii) in Zhong et al. [32].

C. Optimal Clock Offset Estimation
We provide the optimal clock offset estimation robust to delay asymmetry based on our state-space model. The optimal estimation uses the Linear Kalman Filter (LKF) [23] for state estimation and a Bayesian algorithm for hyper-parameter inference.
In order to use the LKF, adopting the symbols {Q, R, S} commonly used in state-space models, we define the relationships of the noises w k and v k as where Cov(•) denotes the covariance function of •, which ensure the independence between the noise terms w k and v k [36]. The independence condition is a fundamental requirement for optimal estimation based on the state-space model, such as the Kalman-filter-based estimation. To guarantee the required condition, (23a) handles the effects of S k in the state-space model by introducing the terms , and (23a), respectively: In our model, S k = 0 owing to the term ζ k in (18a). Moreover, the uncertainty of ζ k introduced at k-th step for the given ζ k−1 stems from Δζ k where ε Δζ,k includes the uncertainties of γ k−1 and εζ ,k related to clock skew, as seen in (18d). Hence, the total offset observation error, which is the first component of v k , is correlated with εθ ,k and εγ ,k because the transition errors are also related to clock skew. (23a) removes the effects of the correlation.
We provide a reference to the LKF process and its complexity analysis. Extensive literary works are available for the LKF since it has been successfully used in many fields, including networks, measurements, and robotics. The LKF process and analysis are summarized in [38].
In brief, in the LKF, the posterior for given historical observations, p(x k |y 1:k , σ), is given as The LKF produces optimal estimates adaptively to minimize the mean-squared error (MSE) under the Gauss-Markov assumption. The LKF is lightweight in terms of computation and memory. The complexity of the LKF is dominated by a few matrix inversions and it does not require storing estimates or data records owing to the Markovian property. Our hyper-parameter inference is inspired by the integrated nested Laplace approximation (INLA) [39], [40] and a sequential Bayesian inference [41]. The inference uses an approximation of the marginal posterior of σ, p(σ|y 1:k ) ∝ p(x 1:k , y 1:k , σ)/ p(x 1:k |y 1:k , σ), and a grid search.
We first use the posterior of the LKF in (24) to construct the denominator p(x 1:k |y 1:k , σ). Then, we use the Newton-Raphson method [42] to determine the modeσ = argmax σ p(σ|y 1:k ). Moreover, to avoid the local optima problem, we search a grid of σ around the mode via σ(z) =σ + U Λ 1 2 z, where U ΛU T is the eigendecomposition for the inverse of Hessian and z is a standardized multivariate Gaussian variable. We test whether the mode is a local optimum by evaluating p(σ(z)|y 1:k ).
Note that the grid search is performed only once in the startup phase for the evaluation noted in Section V. In the evaluation, we attempt to focus on a lightweight design to ensure that the approach is practical. Therefore, after the startup phase, the inference is periodically performed using the approximation based on the LKF and Newton-Raphson method for recent N observations, where N is empirically derived value. The use of the Newton-Raphson method is allowed because the LKF has a closed form. The Hessian matrix is computed from the model error 1:k defined in (18c) and numerator of (24).
Complexity Analysis. We execute a fine-grained grid search for the grid points M ≥ 8 × The inference, which is performed in both of the startup and synchronization phases, has a complexity of O(L 3 ) for each iteration, where L is the dimension of y k or v k . From a theoretical point of view, the convergence of the iteration is quadratic because it is based on the Newton-Raphson method. From another empirical point of view, the inference conservatively requires a few iterations less than twenty to locateσ in the environments without system faults. Note that the inference starts from the optimal point found by the grid search, and {σ η , σ εγ ,k } are bounded or slowly vary depending on temperature, voltage, and chip aging.

V. EVALUATION
This section presents the evaluation of the proposed approach for clock offset estimation. The main purpose of the evaluation is to determine whether the approach addresses the design issue of PTP and clock offset estimation error εθ ,k caused by the asymmetry λ k of the OWPDs, D s,k and D d,k . Because our main approach proposed in Section IV-A has a general form, the evaluations are conducted with the novel optimal estimation of Algorithm 1 presented above.
We compare the performances of our approach with those of PTP [1], previous approaches [8], [14], [19], [26], and PTPv2 [7]. Comparisons with PTP and previous approaches show how well our approach addresses the asymmetry-induced error. Comparison with PTPv2 shows the potential of our approach as an alternative to TC, which is currently the most powerful means to improve PTP. TC has the disadvantage of requiring a compatible network switch and is hence expensive. Thus, if our algorithmic method can replace PTPv2 that requires TC-enabled hardware, it can offer considerable benefits to the industrial field.
The comparisons are conducted in eleven different network environments, including baseline environment as a stable closed network. The other ten environments are configured with different network loads, traffic patterns, and packet size distributions.

A. Evaluation Suite
We build an evaluation suite, as illustrated in Fig. 3, consisting of two development boards (Freescale P2020RDB-PCA [43]), one each for the master and slave nodes, a PTPv2enabled switch (CISCO IE-3000-4TC-E [44]) implementing TC, an oscilloscope (Tektronix MSO 5034b [45]) for measurement, and a desktop computer (PC with 3.2GHz 8 cores of Intel Xeon W and 32GB 2666MHz DDR4 ECC memory) to control the network load.
The master and slave nodes are connected via a wired network to form a closed local area network (LAN). For clock synchronization, the master and slave nodes exchange PTP packets once every second. The PTPv2-enabled switch records the residence times of sync and delay-req packets in the correction field of the respective follow-up packets by the default end-to-end TC configuration [7]. The oscilloscope monitors clock offset θ(t) and clock skew γ(t), and reportsγ k to the slave node. For the measurements, the probes of the oscilloscope are connected to the master and slave to capture the periodic pulses generated from the clocks. The PC sends network traffics to control noise on the packet delays between master and slave nodes.

B. Evaluation Setup 1) Timestamping:
We use MAC and PHY hardware timestamps supported by the P2020RDB-PCA [46], [47]. For the timestamps, the GNU/Linux kernel v3.12.37 of QorIQ SDK v1.9 [48] is installed. The device driver gianfar enables the timestampings by default, which refer to the value of a 64-bit timer counter composed of two 32-bit registers, ETSEC_1588_TMR_CNT_H/L. The timer counter is set to operate at 200 MHz and increments by five (as set by TCLK_PERIOD) to ensure granularity on nanoseconds level.

2) Monitoring:
We capture the periodic pulses generated from the counters of the master and slave to monitor clock offset and clock skew. The P2020RDB-PCA contains two fixedperiod event registers (ETSEC_1588_TMR_FIPERa, a = {1,2}) for two separate periodic pulse generations. For clock offset monitoring, we set the ETSEC_1588_TMR_FIPER1s of both the master and slave to 10 9 to generate periodic pulses per second (PPS). For clock skew monitoring, we set the ETSEC_1588_TMR_FIPER2s of both the master and slave to 2.5 × 10 4 to generate a periodic pulse per twenty-five microseconds. To capture the pulses, the probes are connected to the output pins of the J6 on the P2020RDB-PCAs. The TekScope preinstalled on MSO 5034b is set to measure the delay between the master and slave pulses as well as the momentary frequencies of the pulses. The monitoring requires a minimum sampling rate, 1 GS/sec (gigasamples per second), for nanosecond precision, and the MSO 5034b is capable of sampling at most 20 GS/sec.
3) Reporting: Our approach uses two observations,θ k andγ k . As described in Section IV,θ k is reported by PTP. Forγ k , 2 we use the instrument control toolbox of MATLAB on the MSO 5034b and the secure shell protocol (SSH) to automatically report the measurements to the slave. The MSO 5034b is set to reportγ k at least three times per second with a sampling rate of 5 GS/sec. The slave is programmed to use the most recent observations; the observation errors due to the query time and reporting latency are dealt with by εγ ,k .
The profiled probability density functions (PDFs) of the absolute asymmetry under an ideal network environment and the ten different network load configurations are plotted in Fig. 4. Note that in an ideal environment, which is marked as Baseline in Fig. 4, only the PTP packets are exchanged and thereby the asymmetry is minimized. Fig. 4(a) shows the distribution of the asymmetry in the SV settings. As shown, the distribution of asymmetry value is widened as the network load increases. Under a heavily loaded network, the asymmetry values do not fall within a certain level, thus making it difficult to effectively eliminate it. Fig. 4(b) shows the asymmetry distribution in the ATM1 settings. The distribution of asymmetry values is also wider compared with the ideal environment. Contrary to our expectations, the asymmetry distribution is much narrower than that of SV settings. This implies that asymmetry is more affected by the network traffic profile (e.g., packet size) at the moment the PTP packets are exchanged, rather than the long-term network load. The results were similar in the ATM2 setting, then the ATM2 result is omitted for the sake of clarity.

5) Parameter Configurations:
In our approach, in the startup phase of the clock synchronization, the initial values of the hyper-parameters σ η , σ εθ ,k , and σ εγ ,k are inferred by the method described in Section IV-C. In our experiment, the inference is performed for 30 min. During the clock synchronization phase, the hyper-parameters are periodically updated every 5 min based on the previous 5 min of data with approximation of the marginal posterior. In the startup phase, the initial states are configured using PPS-based offline measurements. The manual configuration of the initial states is thus feasible because PTP is mostly implemented in LANs.

1) Representative Results:
We first provide the results evaluated in the baseline network environment and the harshest network environments, SV-100, ATM1-8020, and ATM2-8020. Table II lists the statistical results of the one-hour clock offset estimation (3600 repetitions of the clock synchronization) of PTP [1], previous approaches (Linux PTP [14] and Karthik et al. [8]), PTPv2 [7], and our approach. We measured the absolute offset estimation error |εθ ,k |. To ensure reliability of our results, one-hour clock offset estimation experiment was repeated fifteen times, and we calculated the overall means of the average, variance, and maximum values.
Our clock offset estimation approach improved the accuracy of PTP by more than a thousand times in all the network The results indicate that our approach significantly improved the accuracy of PTP while maintaining the PTP architecture, without any additional packet exchanges.
Our approach also outperformed the Linux PTP [14] and Karthik et al. [8]. For the latest Linux PTP v3.1, we set the default configuration to use the built-in moving median (MM) filter and proportional & integral (PI) control servo. The Linux PTP showed slightly better performance than PTP because it implements simple techniques to react against practical conditions along with the delay filter and servo. For evaluating approach of Karthik et al. [8], we implemented the approach as reported by the authors without any modification. Subsequently, we used pre-profiling to determine the hyperparameters, including the number of modes and reset time. The approach adopted the Gaussian mixture model to assume the stochastic distributions with multiple modes of multiple parameters such as mean, standard deviation of the queuing delay, and asymmetry. The approach also applied a variant of the expectation & maximization inference for the recursive estimation of the parameters. As a result, the approach exhibited a better performance than those of PTP and Linux PTP. However, all the previous approaches without supports from the PTPv2-enabled switch did not provide accuracy at a nanoseconds level, unlike our approach.
Moreover, the performance of our approach was comparable to that of PTPv2. Our approach and PTPv2 both showed tens and hundreds of nanoseconds level accuracy. The accuracy of our approach was similar (Baseline, ATM1-8020, ATM2-8020) with or slightly better (SV-100) than that of PTPv2. In terms of variance and maximum values, our approach had more robust performance than PTPv2. Overall, our approach had the capability to achieve a competitive advantage over PTPv2 without incurring additional costs from specialized hardware. The reasons why this was possible are stated in Section V-C.3.
We also implemented and evaluated other approaches [19], [26]. However, we intentionally omitted the results from this paper because our evaluations revealed that these approaches performed worse than PTP. Instead, we share our implementation experience and evaluation results on a Github page [52] for the readers who wish to employ or improve these approaches.
2) Results With Various Network Environments: Next, we evaluated the estimation errors in eleven different network environments. Based on the order of error, we divided the evaluated methodologies into two groups; {PTPv2, Proposed approach} with nanoseconds (under 10 −6 sec) and {PTP, Karthik et al., Linux PTP} with microseconds order (over 10 −6 sec). Results of two groups are shown in Fig. 5 In the upper graph of Fig. 5, which plots the results obtained from {PTPv2, Proposed approach}, three interesting aspects are observed. First of all, in all the cases of SV, the proposed approach provides better and stable performance (less estimation error) than PTPv2. Meanwhile, the PTPv2 error was relatively large and it rapidly increases with network load. In the SV cases, as the network load increases, asymmetry of total packet delay other than the residence time also increases. While PTPv2 only mitigates the asymmetry of residence time, our approach deals with the asymmetry of total packet delays in a stochastic manner, thus mitigate the disturbance and outperformed PTPv2.
Second, our approach offered comparable performance to PTPv2 in the ATM cases. The traffic models of ATM cases have realistic properties, as stated in [50], [51]. In such realistic environments, our approach achieved tens of nanoseconds of accuracy in a software-based manner. In the ATM cases, PTPv2 slightly outperformed our approach, unlike in Baseline and SV cases. We analyzed that it comes from the noise introduced to the observational parameters when the burst traffic occurs.
Last, variations in packet sizes may cause harsh conditions for the precise clock synchronization. In the ATM cases, as the network load rate increases, the error of PTPv2 slightly increased. (5.08 × 10 −8 , 5.34 × 10 −8 , 5.52 × 10 −8 ) and (4.79 × 10 −8 , 5.34 × 10 −8 , 6.35 × 10 −8 ) in (4020, 6020, 8020) of ATM1 and ATM2, respectively. On the other hand, in the SV cases, the error was rapidly increased and amplified up to 8.21 × 10 −7 . For reconfirmation, we conducted additional experiments. We simply fixed the packet size to 512 bytes and configured the rest of the environments are the same as the SV. In this additional experiments, the PTPv2 errors were in the original SV cases with varying packet sizes. From these results, we concluded that streams composed of various packet sizes can be negatively critical to the offset estimation.
In the lower graph of Fig. 5, the results obtained from {PTP, Linux PTP} and the {Karthik et al.} are plotted along the left and right axes with different scales of the units. Fig. 5 reveals that Linux PTP and the approach of Karthik et al. performs better than PTP in all the network environments. Linux PTP possessed a relatively large error of approximately two hundred microseconds. The results were predictable because MM cannot fully handle a completely random packet delay, including queuing delay, and the PI controller was not supported with a full system model and fine tuning. The approach reported by Karthik et al. exhibited better performance than those of PTP and Linux PTP. The estimation error was more than thirty times lower than that of PTP, and more than twenty times lower than that of Linux PTP. Although the performance of the approach degraded slightly in several network conditions such as SV-75 and ATM2-4020, the estimation error was maintained at less than ten microseconds in all the network environments.
3) Detailed Comparison With PTPv2: Here, we present comparisons with PTPv2 based on the results in the Baseline and SV-100 network environment. For comparison of long-term performance, additional 15 (total 30) evaluations were conducted.
Baseline: The results of 30 trials of one-hour clock offset estimations are illustrated with the box plot in Fig. 6. As shown in Figs. 6(a) and 6(b), our approach showed good overall performance compared to PTPv2. In most trials, the averages of the estimation errors were less than 5.00 × 10 −8 sec, as indicated by the central values of the boxes, and the variances were low, as indicated by the short heights of the boxes. Moreover, in most cases, there were a small number of outliers and maximum estimation errors of no more than 10 −7 sec. As shown in Fig. 6(b), PTPv2 showed performance results similar to those reported since 2008. PTPv2 had stable performance in terms of average estimation errors with low values of less than 5.00 × 10 −8 sec. However, the variances of the estimation errors were relatively high, there were too many outliers, and the maximum estimation errors were close to or greater than 2.00 × 10 −7 sec. Overall, in most trials, our approach showed more robust performance than PTPv2. While the average estimation errors in both approaches were less than 5.00 × 10 −8 sec in terms of variance, outliers, and maximum values, our approach slightly outperformed PTPv2.
SV-100: To evaluate the performance under a heavily loaded network, we compared the results under the Baseline and SV-100 network environments. We chose the best examples among the results, then illustrated the absolute estimation errors |εθ ,k | over time in Fig. 7.
As shown in Fig. 7(a), our approach and PTPv2 showed different response patterns in Baseline. In our approach, the estimation errors along the time axis were sequentially correlated. However, in PTPv2, the estimation errors were independently distributed regardless of the time. As shown in Fig. 7(b), this difference was magnified in SV-100. Our approach had a small number of outliers and relatively narrow peak-to-peak performance. However, PTPv2 had a large number of outliers with severely magnified estimation errors.
These differences in patterns come from the principles of the offset estimation methods. Our approach uses the LDE of ζ k to handle asymmetry-induced errors. The LDE of ζ k is highly effective because it includes u ζ,k that measures variations of the asymmetry. Note that u ζ,k dramatically reduces the uncertainties of the variations and retains only the uncertainty from the clock skew of the measurement, which is managed by the unbiased skew estimator in our model. Moreover, in the estimation of the hidden states, we adopted an adaptive maximum a posteriori (MAP) scheme with the LKF to minimize the MSE. Estimation of the MAP is based on prior knowledge and results; therefore, our approach does not suddenly produce large erroneous results. Consequently, our approach tends to have a small number of outliers. However, PTPv2 newly estimates the offset in each clock synchronization phase independently from the previous estimation, and the results have little correlation with the order of estimations. Therefore, the range of errors changes immediately according to the current network condition. In an idle network, as shown in Fig. 7(a), the range is narrow. However, when traffic is high on the network, the range is enlarged as shown in Fig. 7(b), even up to tens of microseconds.
In summary, our approach and PTPv2 showed similar performances, as shown in Table II and Fig. 6. However, as shown in Fig. 7, our approach exhibited more stable performance than PTPv2 in a dynamically changing network environment.

4) Effect of ζ k :
To provide more information about the key factor of our approach, ζ k , we analyzed the effect of ζ k using the model error k of (18c). The model error k reflects the reliability of the state-space model and enables approximation of the marginal posterior p(Θ|y 1:k ) as mentioned in Section IV-C. Fig. 8 illustrates the model error k . Without taking account of ζ k , as shown in the left panels, the model error has a number of nonlinear components of the millisecond level. Considering of ζ k , as shown in the right panels, the model error only had components with a magnitude of less than 2.00 × 10 −8 sec. This performance improvement was achieved using u ζ,k and the unbiased skew estimator to address the impact of asymmetry. Moreover, in the latter case, the model error approximately followed a zero-mean Gaussian. This is important for optimal estimation based on the LKF under a Gaussian assumption.
The results in Fig. 8 indirectly represent a comparison of our state-space model with the previously proposed Giorgi state-space model [26] of PTP. Without ζ k , our model is the same as that of Giorgi. The only difference is the design of the noise terms, where Giorgi's model adopts the Allan variance [53], while we use ζ k and the unbiased clock skew estimator. The Giorgi's LKF however showed low performance in our evaluation; 7.29 × 10 −4 . This is because the Giorgi state-space model assumes that its k is zero mean Gaussian when setting the observation noise parameters. The assumption does not hold true unless the PTPv2-enabled device is allowed, as seen in the left panels of Fig. 8. Because the Kalman filtering [23] is highly sensitive to the noise parameters and requires strict parameter settings in general, this disagreement lowers the performance in a critical manner. On the other hand, our approach fills the gap, as seen in the right panels of Fig. 8. As shown, using ζ k , our state-space model improved upon Giorgi's model.

5) Computational Cost:
We measured the computational cost of the proposed approach on the evaluation suite. Generally, a low computational cost is an important requirement for clock synchronization because such synchronization often operates with other applications and repeats itself at short intervals. We provide a complexity analysis via the notation of O(•) at the end of Section IV-C. Herein, we report the run-time efficiency evaluated in terms of the execution time.
LKF: LKF operates on P2020RDB-PCA in the slave node. As summarized in Table III in [38], LKF involves only a few algebraic computations and does not use non-deterministic operations. Therefore, the execution time is short and stable. The average of the LKF execution time for each estimation was 5.11 × 10 −5 sec. The variance was 1.80 × 10 −11 sec 2 . Owing to the low cost, the approach was non-tightly executed within a packet-exchange interval of 1 sec.
Newton-Rahpson: The Newton-Raphson optimization for hyper-parameter inference also operates on P2020RDB-PCA in the synchronization phase. The optimization iteratively proceeds until a convergence point is reached. Therefore, its execution time is primarily affected by the number of iterations. In our evaluation, the average and the maximum number of iterations were 7 and 16, respectively. Most of the iterations were used for the inference of σ εθ ,k . {σ εγ ,k , σ η } slowly varied over time, and the inference of each variable required up to three iterations. As a result, the optimization required an average execution time of 3.09 × 10 −1 sec with a variance of 1.13 × 10 −3 sec 2 .
Grid search: The grid search was implemented on a PC. The search for the optimal initial condition (or prior) has a significant implication on the performance of Bayesian filters and inferences. We utilized the MATLAB to generate 8 × 10 7 random samples of z and perform the subsequent algebraic multiplication U Λ 1 2 z to obtain grid points. These required 3.61×10 2 sec. Because the grid search was expensive, we could not employ it in the synchronization phase. If that were possible, the proposed approach would have room for performance improvement.

VI. CASE STUDY
This section presents an application example of the proposed approach to a time-sensitive system. Time-sensitive systems such as distributed real-time, data processing, broadcasting, and factory automation systems often require clock synchronization accuracies at nanoseconds level. In Section V, we verify that the proposed approach satisfies the requirements even under harsh network conditions. Herein, we provide more solid clues to the applicability of our approach to time-sensitive systems in real networks, wherein traffic is not controlled, and multiple hops exist between nodes.
For the case study, we construct a distributed real-time system and employ the proposed approach for physical synchronization among distributed real-time schedulers. Physical synchronization can ensure the time predictability of distributed task scheduling [54].
A distributed real-time system consists of computing nodes connected via a network. In the system, a program comprises the execution of multiple tasks, and the tasks are distributed to each computing node. On each node, tasks are executed in parallel, and the nodes periodically cooperate to achieve a common goal. To that end, the real-time scheduler on each node strictly controls the execution of tasks. The scheduler recognizes the flow of time by referencing time sources to ensure that each task is executed according to a predefined schedule and completes execution following a predefined deadline.
The purpose of physical synchronization is to guarantee that the real-time schedulers of all nodes in a distributed system operate in synchronization by sharing a common notion of time in a master-slave structure. Specifically, it attempts to ensure that two time sources, a system tick counter and a system clock timer counter, referenced by the real-time scheduler of each slave node are synchronized with the master time sources. During real-time scheduling of each node, the tick count value is directly used to determine the start and end times of task execution. The tick count value is incremented by a timer interrupt, which occurs when the clock timer counter detects an elapsed time, which is 100 us in our case by default. From the perspective of the distributed system, the discrepancies between the tick count values can cause positive or negative delays in task scheduling. The offsets and speed differences in clock timer counters can induce different rates of tick counting, longer or shorter task execution times, and postponed or advanced task deadlines. Fig. 9(a) presents a simple example, wherein four real-time computing nodes comprising one master and three slave nodes constitute a distributed real-time system. These nodes execute the same program for the same sequence of multiple tasks for a common purpose such as synchronous broadcasting, swarm control, distributed computing, data fusion, fault-tolerant active-standby. Let us suppose that they execute a bridging program for high-resolution audio and video streaming that requires synchronization accuracies on the order of hundreds of nanoseconds. If the synchronization of the time sources is accurate and centralized synchronization is achieved, as seen in Slave2, all the nodes constituting the distributed system can satisfy the condition to simultaneously transmit desired streams for harmonization. However, as seen in Slave2 and Slave3, if the tick counter and clock timer counter, respectively, are not synchronized, physical synchronization fails, and the desired situation is not realized. Fig. 9(b) illustrates the simplified process of physical synchronization, which includes external and internal clock synchronization. The external synchronization is achieved by clock synchronization methods including the proposed approach, and is performed between the network clock timer counters of the master and slave nodes via Ethernet. The internal synchronization is proceeded in the interrupt service routine (ISR) of the timer interrupt, and occurs between the network clock timer counter and system time sources within the same node.
A. Environmental Setup 1) System: We built a distributed real-time system with four P2020RDB-PCAs to validate the application of the proposed approach to centralized synchronization. On P2020RDB-PCAs, we installed a real-time operating system [55], [56] and executed physical synchronization and programs for logical synchronization during the operating time from 1 PM to 7 PM. For the clock timer counters, we used ETSEC_ 1588_TMR_CNT_H/L and DEC, a decrementing counter of the e500v2 core, as the network clock timer counter and a system clock timer counter, respectively. The system tick counter was incremented by one in the ISR of the decrementer interrupt in a scenario wherein the internal synchronization modified the system tick counter and the decrementer auto-reload register (DECAR).
2) Network: We used an open network shared by hundreds of computing nodes for research and development. Traffic over the network was real, not controlled. To setup a multi-hop environment, we placed four networking devices connected to the internet between the master and slave nodes. We provide the statistical profile of the real network on a Github page [52].

B. Performance Measurement
We present the results of the proposed approach and PTP [1]. The results of other approaches [8], [10], [19] have been provided on a Github page [52]. Note that Linux PTP [14] was not implementable because it was designed for Linux systems. PTPv2 [7] was also excluded because all the networking devices had to be PTPv2 compatible, and configuration of the entire PTPv2 system was prohibitively expensive.
1) Physical Synchronization: We first present results of the external synchronization recorded by the following metrics: Average, Variance, and Maximum used in Section V. The results represent the performance of clock synchronization in a real-world network environment with uncontrolled traffic and multiple hops. In a real environment, the proposed approach demonstrated good performance. Although the distance between nodes and the experimental duration increased, its accuracy was maintained as that in Section V. For the metrics, the approach resulted in absolute errors of 7.54 × 10 −8 sec, 1.77 × 10 −15 sec 2 , 2.06 × 10 −7 sec. We analyze that the superior results were achievable because ζ k dealt with the asymmetry of a total packet delay. Conversely, compared with the results obtained in the evaluation, PTP demonstrated worse performance. For the same metrics, PTP has absolute errors of 2.98 × 10 −4 sec, 1.82 × 10 −6 sec 2 , and 1.96 × 10 −2 sec.
Moreover, we report the performance of synchronization between real-time schedulers following internal synchronization. For this, we used the following two metrics: the number of attempts required to modify the tick counter and the delays caused in task scheduling. In the experiment, the proposed approach did not induce any modifications to the tick counter. This implies that our approach can set up environments wherein the schedulers perceive tick counters as linearly monotonic. During scheduling, the approach caused an average delay of −1.97×10 −8 sec. The result implies that our approach can provide time predictability for task scheduling on the order of tens of nanoseconds in a distributed system. In the case of PTP, the average number of modifications and the average delay were 2134, 1.49 × 10 −5 sec, respectively. These results imply that PTP should not be used for scheduling synchronization. This is because the real-time scheduler selects the executed tasks by referencing the value, and the modification can ruin task scheduling.
2) Logical Synchronization: We determined the degradation in the computed efficiency resulting from logical synchronization. We used the period of the logical clock defined by G7 in [54] as a performance indicator. Generally, the length of the period is a factor affecting the computing efficiencies of distributed real-time systems. Similar to the low clock frequency of processing units, long periods of logical clocks could cause performance degradation or even faults in distributed real-time computing. Compared with the ideal case of G5 presented in [54], the proposed approach lengthened the period by 3.91 × 10 −7 sec from 4.46 × 10 −2 . From this result, we confirmed that we noted a performance penalty under 0.001% when the approach was applied to logical synchronization. Conversely, for the period extension and performance degradation, PTP registered 1.99 × 10 −4 and 44.8%, respectively.

VII. CONCLUSION
In this paper, we proposed a clock offset estimation robust to delay asymmetry for the PTP on asymmetric communication links. The PTP has a critical design flaw for estimating clock offsets under the assumption that the packet delays are symmetric. To address this, we derived an LDE to define the variation of the asymmetry and applied it to define and exclude the asymmetry-induced errors in the PTP. Moreover, to verify the approach, we implemented an optimal clock offset estimator using the proposed scheme and conducted performance evaluations. The estimator was constructed using an LKF and a Bayesian optimizer. Our approach was compared with the PTP and PTPv2 in environments with and without network loads. In the experiments, our approach mitigated most of the asymmetry-induced errors and improved the PTP by several orders of magnitude in terms of accuracy. Compared to the PTPv2, our approach showed performance improvements without incurring additional costs from specialized hardware.