Generalized Time Warping for Multi-modal Alignment of Human Motion

Temporal alignment of human motion has been a topic of recent interest due to its applications in animation, tele-rehabilitation and activity recognition among others. This paper presents generalized time warping (GTW), an extension of dynamic time warping (DTW) for temporally aligning multi-modal sequences from multiple subjects performing similar activities. GTW solves three major drawbacks of existing approaches based on DTW: (1) GTW provides a feature weighting layer to adapt different modalities ( e.g ., video and motion capture data), (2) GTW extends DTW by allowing a more ﬂexible time warping as combination of monotonic functions, (3) unlike DTW that typically incurs in quadratic cost, GTW has linear complexity. Experimental results demonstrate that GTW can efﬁciently solve the multi-modal temporal alignment problem and outperforms state-of-the-art DTW methods for temporal alignment of time series within the same modality.


Introduction
Alignment of time series is an important unsolved problem in many scientific disciplines.Some applications include speech recognition [23], curve matching [29], chromatographic and micro-array data analysis [18], activity recognition [14], temporal segmentation [35] and synthesis of human motion [13,22].In particular, alignment of human motion has recently received increasing attention in computer vision and computer graphics.Major challenges for an accurate temporal alignment of human motion include modeling the difference in subjects' physical characteristics, view point changes, motion style and speed of the action [30,31].Unlike existing work, this paper addresses the challenging problem of multi-modal alignment of timeseries coming from different sensors where subjects are performing a similar activity.For instance, consider the problem illustrated in Fig. 1.How can we solve for the temporal correspondence between the frames of a video, the samples of motion capture data, and the accelerometer signal from different people kicking a ball?Our work is motivated by recent success in extending dynamic time warping (DTW) for aligning human behavior.Zhou and De la Torre [34] proposed canonical time warping (CTW).CTW combines DTW with canonical correlation analysis to temporally align data of different dimensionality (e.g., motion capture and video).More recently, Gong and Mendioni [7] proposed dynamic manifold warping (DMW) that extends CTW to incorporate more complex spatial transformations through manifold learning.However, CTW and DMW have three main limitations due to reliance in DTW: (1) Their computational complexity is quadratic in space and time; (2) They address the problem of aligning two sequences, and it is unclear how to extend it to the alignment of multiple sequences; (3) They compute the temporal alignment using DTW, which relies on dynamic programming to find the optimal path; however, it is unclear how to adaptively constrain the temporal warping.To overcome these limitations, this paper proposes generalized time warping (GTW), which allows an efficient and flexible alignment between two or more multi-dimensional time series of different modalities.GTW uses multi-set canonical correlation analysis to find the spatial transformations, and extends DTW by parameterizing the temporal warping as a combination of monotonic basis functions.Unlike existing DTW approaches based on dynamic pro-gramming that usually have quadratic cost, GTW uses a Gauss-Newton algorithm that has linear complexity in the length of the sequence.Moreover, GTW allows to align several multi-modal time series.
The remaining of the paper is organized as follows.Section 2 reviews previous work on temporal alignment.Section 3 reviews previous work on DTW.Section 4 describes GTW.Section 5 illustrates the benefits of GTW on synthetic and real data.

Previous work
This section reviews previous work on temporal alignment of human motion.In particular, we discuss the challenges of aligning human motion from sensory data in the context of computer graphics and computer vision.
In the computer graphics literature, time warping of motion capture data has been a key component in many animation systems [32,16].However, most existing techniques are challenged when applied to placing stylistically different motions into correspondence.To account for the variations of human motion performed by different subjects, one popular strategy is to augument DTW with certain regression models.For instance, Hsu et al. [13] proposed to combine DTW with a space warping step, in which each individual degree of freedom from motion capture data can be scaled and translated.In [12], a weighted PCA algorithm [6] was used to find a low dimensional embedding such that the stylistic part of gesture sequences can be removed.Although these methods yield promising alignment results for motion capture data, they have several limitations when the extracted features in the sequences are very noisy (e.g., video) or come from different modalities (e.g., video and motion capture).
In the computer vision literature, a challenge in sequence alignment is to build view-invariant representations.In a multi-camera setting, it has been shown that both 2-D homography and 3-D epipolar geometries can form a powerful cue for alignment of two or more sequences.For instance, homography-based constraints [1,2,21] have been shown to be useful to align sequences in a planar scene.In addition, the fundamental matrix [25,10] can be used to guide DTW to eliminate the distortion generated by the projection from 3D to 2D.Li and Chellappa [17] proposed a general framework for video alignment by optimizing various 2-D and 3-D constraints on a Riemannian manifold.Recent work [3] also illustrated the stability of the self-similarity matrix of actions under view changes.Built upon this observation, Junejo et al. [14] proposed a view-independent descriptor for video alignment using DTW.Observe that most existing works rely on certain explicit or implicit estimation of the underlying camera geometry.Unlike these works, GTW is able to efficiently align semantically similar multi-modal sequences.

Dynamic time warping
Given two time series 1  [23] is a technique to align X and Y such that the following sum-of-square cost error is minimized [34]: where Recall that the optimal l is automatically selected by the DTW algorithm.The warping paths, p x ∈ {1 : n x } l and p y ∈ {1 : n y } l , denote the correspondence indexes between frames.For instance, the i th frame in X and the j th frame in Y are aligned iff there exists p x t = i and p y t = j for some t.In order to find a polynomial solution, the warping paths (p x and p y ) have to satisfy three constraints: (1) Boundary conditions: Notice that the choice of step size is not unique.For instance, replacing the step size by {[2, 1], [1,2], [1,1]} can avoid the degenerated case in which a single frame of one sequence may be assigned to many consecutive frames in the other sequence.See [23] for an extensive review on several DTW's modifications to control the warping paths.

Generalized time warping
Generally speaking, there are three major limitations of using DTW to align multi-modal and multi-dimensional time series: (1) DTW relies on dynamic programming (DP) to exhaustively search over all possible warping paths.This search has quadratic computational complexity (O(n x n y )) in both time and space.This might be restrictive when applying DTW to aligning long sequences.(2) A direct extension of DTW to align more than two sequences is usually infeasible due to the combinatorial explosion of possible warping paths.For instance, a DP-based alignment of m 1 Bold capital letters denote a matrix X, bold lower-case letters a column vector x.x i and x (i) represent the i th column and i th row of the matrix X respectively.x ij denotes the scalar in the i th row and j th column of the matrix X.All non-bold letters represent scalars. 1 m×n , 0 m×n ∈ R m×n are matrices of ones and zeros.In ∈ R n×n is an identity matrix.x p = p |x i | p denotes the p-norm.X 2 F = tr(X T X) designates the Frobenious norm.vec(X) denotes the vectorization of matrix X. X • Y is the Hadamard product of matrices.{i : j} lists the integers, denotes the titled minus, e.g., To address these issues, this section proposes generalized time warping (GTW), a technique for efficient spatiotemporal alignment of multiple time series.To accommodate for subject variability and to take into account the difference in the dimensionality of the signals, GTW uses multi-set canonical correlation analysis.To compensate for temporal changes, GTW extends DTW by incorporating a more flexible temporal warping parameterized by a set of monotonic basis functions.Unlike existing approaches based on DP with quadratic complexity, GTW efficiently optimizes the time warping function using a Gauss-Newton algorithm, which has linear complexity in the length of the sequence.

Objective function
is well aligned with the others in the least-squares sense.In a nutshell, GTW minimizes the sum of pairwise distances: where ψ(•) and φ(•) are regularization functions, which bias the solution in the space of temporal transformations W i and the embedding for spatial transformation V i , respectively.Ψ and Φ represent the domains for W i and V i .The explicit form for Ψ and Φ will be discussed in the following sections.
Generally speaking, optimizing J gtw (Eq.2) is a nonconvex optimization problem with respect to the alignment (W i ) and projection matrices (V i ).We alternate between solving for W i using a Gauss-Newton algorithm, and optimally computing V i using mCCA.These steps monotonically decrease J gtw , and because the function is bounded below the alternating scheme will converge to a critical point.

Parameterization of the temporal warping
To simplify the discussion, let's consider the temporal warping matrix W ∈ {0, 1} n×l for a single sequence X ∈ R d×n .The DP-based approach to optimize W has a computational cost of O(nl), which quickly becomes infeasible as the sequence length increases.In order to reduce the computational complexity and to provide a flexible way to control the warping path, GTW approximates the warping path p ∈ {1 : n} l , which parameterizes the warping matrix W(p), as a linear combination of monotonic functions q ∈ [1, n] l , that is: Recall that [5] also used hyperbolic tangent functions as temporal basis, and the weights were optimized using a non-negative least squares algorithm.GTW differs from Fisher et al. [5] in three aspects: (1) GTW allows aligning multidimensional time series that have different features.Fisher et al. can only align one-dimensional time-series.(2) Unlike [5], we used a more efficient Eigendecomposition to solve CCA and QP for optimizing the weights.(3) We propose to use a family of monotonic functions that allow for a more general warping (e.g., subsequence matching), and constraints to regularize the solution.
As in DTW, we incorporate the following constraints on the weight a to constrain the warping path p = Qa.
Boundary conditions: We enforce the position of the first frame, p 1 = q (1) a ≥ 1, and the last frame, p l = q (l) a ≤ n, where q (1) ∈ R 1×k and q (l) ∈ R 1×k are the first and last rows of the basis matrix Q ∈ R l×k respectively.In contrast to DTW that imposes tight boundary (i.e., p 1 = 1 and p l = n), GTW relaxes the equality with inequality constraints to allow for a sub-part of X being indexed by p.This relaxation can be used for sub-sequence matching.
Monotonicity: We enforce t 1 ≤ t 2 ⇒ p t1 ≤ p t2 by constraining the sign of weight: á ≥ 0. Notice that constraining the weights is only a sufficient condition to ensure monotonicity but it is not necessary.See [24,26,33] for in-depth discussions on monotonic functions.
Continuity: To approximate the hard constraint on the step size (e.g., p t − p t−1 ∈ {0, 1}), we penalize the curvature of the warping path, where F ∈ R l×l is the 1 st order differential operator.
In summary, we constrain the warping path 2 as: where L =   0 ḱ× k −I ḱ −q (1) −q (1)  q(l) q(l) Therefore, given a basis set of k monotone functions, all feasible weights belong to a polyhedron in R k parameterized by L ∈ R ( ḱ+2)×k and b ∈ R ḱ+2 .For instance, Fig. 2b illustrates an example of a warping function (red solid line) as a combination of three monotone functions (blue dotted lines).

Optimization of the temporal weights
Suppose that k i basis functions, are associated with the i th sequence X i , then the optimization of J gtw (Eq.2) with respect to the time warping parameter W i minimizes: s. t.Liai ≤ bi, ∀i ∈ {1 : m}.
To optimize Eq. 3, we linearize the expression and use a Gauss-Newton method similar to the Lucas-Kanade framework [19] for image alignment, where the nonlinear expression in Eq 3 is linearized by performing a first order Taylor approximation on where vi = vec Plugging Eq. 4 in Eq. 3 yields: 2 Notice that the constraints of ψ and Ψ in Eq. 2 associated with the warping matrix W are replaced by the constraints (ψa and Ψa) associated with the weight a.
Minimizing Eq. 5 with respect to the weight increment δ i ∈ R ki yields a quadratic programming problem: where k = In all our experiments, we initialize a i by uniformly aligning the sequences (the curve of GN-Init in Fig. 3b).The length of the warping path l is usually set to l = 1.1 max m i=1 n i .In practice, when the sequence length n i is very large, an additional pre-conditioner should be used to obtain a numerically stable solution.For instance, a normalized version of Eq. 6 minimizes is the scaling matrix.After solving this new quadratic optimization problem, we need to rescale the result as δ ← R −1 δ.The computational complexity of the algorithm is O(dl k + k3 ).
As discussed in [23,28], there are various techniques that have been proposed to accelerate and improve DTW.For instance, the Sakoe-Chiba band (DTW-SC) and the Itakura Parallelogram band (DTW-IP) reduce the complexity of the original DTW algorithm to O(βn 2 ) by constraining the warping path, assuming β < 1.However, using a narrow band (a small β) might cut off potential warping space, leading to a sub-optimal solution.For instance, Fig. 3a shows an example of two 1-D time series and the alignment results calculated by different algorithms.The results computed by DTW-SC and DTW-IP are less accurate than the one computed by Gauss-Newton (GN).This is because both the SC and IP bands are over-constrained (Fig. 3b).
To provide a quantitative evaluation, we synthetically generated 1-D sequences at 15 scales.For DTW-SC, we set the band width as β = 0.1.For GN, we varied k among 6, 10, 14 to investigate the effect of the number of bases.For each scale, we randomly generated 100 pairs of sequences.The error is computed with Eq. 9 and shown in Fig. 3cd.DTW obtains the lowest error but takes the most time to compute.This is because DTW exhaustively searches the entire parameter space to find the global optima.Both DTW-SC and DTW-IP need less time than DTW because they need to search a smaller space constrained by different bands.Empirically, DTW-IP is more accurate than DTW-SC for our synthetic dataset.This is because the global optima is more likely to lie in the IP band than the SC band.Compared to DTW, DTW-SC and DTW-IP, GN is more computationally efficient because it has linear complexity in terms of sequence length.Moreover, increasing the number of bases monotonically reduces the error.

Optimization of the spatial embedding
To optimize over V i we used multi-set canonical correlation analysis (mCCA) [11], and we constrain the embedding V i as: where λ i ∈ [0, 1] is the regularization term.Consider the special case when λ i → 1, the constraint is equivalent to the one used in multi-set partial least squares (mPLS) [27].Plugging Eq. 7 into Eq. 2 yields: where ḋ = The optimal V of Eq. 8 can be solved in closed form using a generalized Eigen decomposition, i.e., CV = DVΛ.The dimension d is selected to preserve 90% of the total correlation.

Experiments
This section compares GTW against state-of-the-art DTW approaches in three experimental settings: (1) aligning time series with known ground truth to provide a quantitative comparison, (2) aligning several video sequences of different people performing similar actions using different visual features for each sequence, and (3) aligning three sequences of different subjects performing a similar action recorded with different sensors (motion capture data, accelerometers and video).

Other methods for comparison
We compared GTW against several versions of Procrustes analysis [4], which are used as baselines.
Procrustes dynamic time warping (pDTW): Procrustes analysis [4] has been extensively used for shape for alignment.We proposed a simple temporal extension pDTW, which aligns multiple time series by minimizing: pDTW alternates between solving the warping matrix W i ∈ {0, 1} ni×l by a slightly modified DTW and computing the mean sequence 1 m m j=1 X j W j ∈ R d×l .Procrustes derivative dynamic time warping (pDDTW): In order to make DTW invariant to translation, derivative dynamic time warping (DDTW) [15] uses the derivatives of the original features.Similar to pDTW, we combined DDTW and Procrustes framework to minimize: where F i ∈ R ni×ni is the 1 st order differential operator.
Procrustes iterative motion warping (pIMW): Similar to GTW, iterative motion warping (IMW) [13] alternates between time warping and spatial transformation to align two sequences.In our experiment, we extended IMW to align multiple sequences by minimizing: where A i , B i ∈ R d×ni are the scaling and translation parameter for the i th sequence X i , respectively.F a i , F b i ∈ R ni×ni are 1 st order differential operators, enforcing a smooth change in the columns of A i and B i .
To evaluate the time warping results, we proposed to compute the difference between the warping matrix W alg i ∈ {0, 1} ni×l alg given by the algorithm (e.g., GTW, pDTW, pDDTW, pIMW) and the ground-truth W tru i ∈ {0, 1} ni×ltru .Recall that the number of warping steps can be different, i.e., l alg = l tru .Equivalently, we could compare the warping path where dist(P1, P2) = where p tru ∈ R 1×m are the i th row of P alg and j th row of P tru respectively.Let's consider each warping path P ∈ R l×m as a curve in R m with l points.Thus the term of min l2 j=1 p i 1 − p j 2 2 can be interpreted as the closest distance between point p i 1 and the curve P 2 .

Synthetic dataset
In the first experiment we synthetically generated 3-D spatio-temporal signals (2-D in space and 1-D in time) to evaluate the performance of GTW.The first two spatial dimensions and the time dimension are generated as follows: where Z ∈ R 2×l is a curve in two dimensions (Fig. 4a top).U i ∈ R 2×2 and b i ∈ R 2 are randomly generated projection matrix and translation vector, respectively, see Fig. 4a top.The binary matrix M i ∈ {0, 1} l×ni is generated by randomly choosing n i ≤ l columns from I l for temporal distortion (Fig. 4a bottom).The third spatial dimension e i ∈ R ni is generated with zero-mean Gaussian noise (Fig. 4b top).Notice that in the case of synthetic data we are able to obtain the ground truth alignment matrix W tru i = M T i .The error between the ground truth and a given alignment W alg is computed by Eq. 9 (Fig. 4g).We initialize all methods by uniformly aligning the three sequences, i.e., p i = round(linspace(1, n i , l)) , where round(•) and linspace(•) are MATLAB functions.
We set the length of the latent sequence to l = 300.For GTW, we set η i = λ i = 0 and selected d to preserve 90% of the total correlation.We selected three hyperbolic tangent and three polynomial functions as bases for monotonic warping function.Fig. 4b-e show the spatial-temporal warping estimated by each algorithm.Fig. 4g shows the err alg (Eq.9) for 100 new generated time series.As can be observed in Fig. 4e GTW obtains the best performance.pDTW (Fig. 4b fails in this case since the sequences have been distorted in space.pDDTW (Fig. 4c) cannot deal with this example because the feature derivatives do not capture well the structure of the sequence.pIMW (Fig. 4d) warps sequences towards others by translating and re-scaling each frame in each dimension.Moreover, pIMW has more parameters ( m i=1 ln i + 2dn i ) than GTW ( m i=1 k i + dd i ), and hence pIMW is more prone to over-fitting.Furthermore, pIMW tries to fit the noisy dimension (3 rd spatial component) biasing alignment in time, whereas GTW has a feature selection mechanism which effectively cancels the third dimension.

Aligning videos with different features
In the second experiment we applied GTW to align video sequences of different people performing a similar action.Each video is encoded using different visual features.The video sequence are taken from the Weizmann database [8], which contains 9 people performing 10 actions.To extract dynamic features from video, we extract the silhouette with background subtraction (Fig. 5a).We computed three popular shape features (Fig. 5b) for each 70-by-35 re-scaled mask image, including (1) binary image, (2) Euclidean distance transform [20], and (3) solution of Poisson equation [9].In order to reduce the feature dimension (2450), we picked the top 123 principal components that preserve 99% of the total energy.To evaluate the performance, we randomly selected three walking sequences, each of which is manually cropped into two cycles of human walking.The ground-truth alignment was approximated by using pDTW using the same features, and it provided an accurate visual temporal alignment.
GTW was initialized with uniform alignment, and we used the parameter λ = 0.1.We used five hyperbolic tangent and five polynomial functions as the monotonic bases (Fig. 5f middle-top).
Fig. 5g shows the err alg for 10 randomly generated sets of videos.Notice that neither pDTW (Fig. 5c) nor pDDTW (Fig. 5d) is able to align the videos because both of them lack the ability to solve for correspondence between signals of different nature.As observed from Fig. 5e, pIMW registers the top three components well in space; however, it overfits all the dimensions and thus obtains a biased time warping path.In contrast, GTW (Fig. 5f) warps the sequences accurately in both space and time.Fig. 5b illustrates the temporal correspondence found by GTW.

Multi-modal sequence alignment
This experiment applies GTW to align sequences of different people performing a similar activity but recorded with different sensors.We selected one motion capture sequence (Subject 12, Trial 29) from the CMU motion capture database, one video sequence (Eli, jacking) from the Weizmann database [8], and we collected the accelerometer signal of a subject performing a jacking exercise.Some instances of the multi-modal data can be seen in Fig. 6d.Observe, that to make the problem more challenging, while in the mocap (top row) and video (middle row) the two subjects are performing the same activity, in the accelerometer sequence (bottom row) the subject only moves one hand and not the legs.Even in this challenging scenario, GTW is able to solve for the temporal correspondence that maximizes the correlation between signals.
For the mocap data, we computed the quaternions for the 20 joints resulting in a 60 dimensional feature vector that describes the body configuration.In the case of the Weizmann dataset, we computed the Euclidean distance transform as described earlier.The data from the accelerometers is collected in X, Y, and Z axes by an X6-2mini USB accelerometer (Fig. 6a) at a rate of 40Hz.GTW was initialized by uniformly aligning the three sequences.We used five hyperbolic tangent and five polynomial functions as monotonic bases.Fig. 6b shows the first components of the three sequences projected separately by PCA.As shown in Fig. 6c, GTW found an accurate temporal correspondence between the three sequences.Unfortunately, we do not have ground-truth for this experiment, however visual inspection of the video suggest that results are consistent with human labeling.Fig. 6d shows several frames that have been put in correspondence by GTW.

Conclusions
This paper describes GTW, a technique for temporally aligning multiple multi-modal sequences.The GTW algorithm offers a more flexible and efficient framework than the state-of-art DTW algorithms because we parameterize the time warping function as a linear combination of monotonic bases.
Although GTW has shown promising preliminary results, there are still unresolved issues.First, the Gauss-Newton algorithm for time warping converges poorly in the area where the objective function J gtw is non-smooth.Second, GTW is subject to local minima.A well known strategy to escape from local minima in image alignment has been to adopt a coarse-to-fine approach for optimizing GTW at different temporal scales.Third, although the experiments show admissible time warping results with fixed bases, it is more desirable to automatically learn the monotonic bases.We plan to explore these issues in future work.

Figure 1 .
Figure 1.Temporal alignment of three sequences of different subjects kicking a ball recorded with different sensors (top row video, middle row motion capture and bottom row accelerometers).

Figure 2 .
Figure 2. Temporal warping function.(a) Six common choices for monotonically increasing function q.(b) An example of time warping XW(Qa) ∈ R 1×70 of 1-D time series X ∈ R 1×50 .The warping function is a linear combination of three basis functions including a constant function and two monotonically increasing functions.

Figure 3 .
Figure 3.Comparison of temporal alignment algorithms.(a) An example of two 1-D time series (with 247 and 305 frames respectively) and the alignment results calculated using the ground truth, DTW, DTW constrained in the Sakoe-Chiba band (DTW-SC), DTW constrained in the Itakura Parallelogram band (DTW-IP) and Gauss-Newton (GN).(b) Comparison of different warping paths.GN-Init denotes the initial warping used for GN.SC-Bound and IP-Bound denote the boundaries of SC band and IP band respectively.(c) Comparison of alignment errors.(d) Time.

Figure 4 .
Figure 4. Synthetic data.Time series Xi, i ∈ {1 : 3} are generated by (a) spatio-temporal transformation U T i (Z + bi1 T l )Mi of 2-D latent sequence Z, (b top) and adding Gaussian noise ei in the 3 rd dimension.The spatio-temporal warping is computed by (b) pDTW, (c) pDDTW, (d) pIMW and (e) GTW, where the bases are shown in the top-right corner.(f) Comparison of different time warping techniques.(g) Mean and variance of alignment error for different methods.

Figure 5 .Figure 6 .
Figure 5. Example using multi-feature video data.(a) Original sequences after background subtraction.(b) Key frames aligned using GTW.The three sequences from top to bottom are represented as a binary image, Euclidean distance transform and solution of the Poisson equation respectively.(c) pDTW.(d) pDDTW.(e) pIMW.(f) GTW.(g) Comparison of different time warping techniques.(h) Mean and variance of the alignment error.
1} n×l , which sets w pt,t = 1 for t ∈ {1 : l} and zero otherwise.l ≥ max(n x , n y ) is the number of steps needed to align both signals.