Aligned Cluster Analysis for temporal segmentation of human motion

Temporal segmentation of human motion into actions is a crucial step for understanding and building computational models of human motion. Several issues contribute to the challenge of this task. These include the large variability in the temporal scale and periodicity of human actions, as well as the exponential nature of all possible movement combinations. We formulate the temporal segmentation problem as an extension of standard clustering algorithms. In particular, this paper proposes aligned cluster analysis (ACA), a robust method to temporally segment streams of motion capture data into actions. ACA extends standard kernel k-means clustering in two ways: (1) the cluster means contain a variable number of features, and (2) a dynamic time warping (DTW) kernel is used to achieve temporal invariance. Experimental results, reported on synthetic data and the Carnegie Mellon Motion Capture database, demonstrate its effectiveness.


Introduction
In the past two decades, motion capture systems were able to track and record human motion with high spatial and temporal resolution. The extensive proliferation of motion databases urges the development of efficient techniques to index and build models of human motion. One key aspect to understand and build better models of human motion is to develop unsupervised algorithms for decomposing human motion into a set of actions [10]. The problem of factorizing human motion into actions, and more generally the temporal segmentation of human motion, is an unsolved problem in human motion analysis. The inherent difficulty of human motion segmentation stems from the large intra-person physical variability, wide range of temporal scales, irregularity in the periodicity of human actions, and the exponential nature of possible movement combinations. To partially address these problems, we formulate the temporal segmentation of human behavior as a temporal clustering problem. Fig. 1 illustrates the main goal of this paper: given a sequence of motion capture data, we are able to find temporally coherent clusters of actions (e.g. walking, rotating, jumping). We propose Aligned Cluster Analysis (ACA), an extension of kernel k-means clustering that allows unsupervised clustering of temporal patterns. Compared to previous literature, our approach has multiple advantages: (1) The temporal granularity of the human action can be controlled by the user. (2) A robust temporal matching metric is defined by means of the Dynamic Time Alignment Kernel (DTAK) [20]. (3) The temporal segmentation problem is posed as a versatile energy minimization problem. An efficient coordinate descent algorithm solves ACA.
The remainder of the paper is organized as follows. Section 2 reviews previous work on temporal segmentation of motion capture data. Section 3 introduces theoretical foundations for ACA. Section 4 presents the details of preprocessing and initializing stages for human motion segmentation. Section 5 demonstrates experimental results on synthetic and motion captured data. Finally, Section 6 concludes the paper and outlines future work.

Previous work
This section describes previous work on temporal segmentation of human motion and standard clustering techniques.

Segmenting motion capture data
Temporal segmentation of human actions is an emerging topic in the field of computer vision. Zhong et al. [22] use a bipartite graph co-clustering algorithm to segment and detect unusual activities in video. Zelnik and Irani [21] define a flow based matching between actions. De la Torre et al. [7] propose a geometric-invariant clustering algorithm to decompose a stream of facial behavior into facial gestures. Unusual facial expressions can be detected through comparisons with clusters of facial gestures.
In the graphics literature, Barbic et al. [2] proposed an algorithm to decompose human motion into distinct actions by detecting sudden changes in the intrinsic dimensionality of the Principal Component Analysis (PCA) model. Jenkins et al. [8] [12] use the zero-velocity crossing points of the angular velocity to segment the stream of motion capture data. Li et al. [17] fit a mixture of linear dynamical systems to a sequence using maximum likelihood approach. Each of the linear systems considers a motion texton that can be used to synthesize new motion sequences. Recently, Beaudoin et al. [3] developed a string-based motif-finding algorithm which allows for a user-controlled compromise between motif length and the number of motions in a motif.
In the field of data-mining, the problem of segmentation of time series is well known [14]. Dynamic Time Warping (DTW) is one of the most popular measures between temporal sequences. The idea of dynamic warping has been successfully used in motion blending [6] and clip locating [16] to overcome high fluctuations of articulated movements. Due to the high cost of exact DTW matching, approximate distances, such as the Minimum Bounding Rectangles (MBR) [1], are applied to segment the trajectories. This paper differs from previous work in the vision, graphics and data-mining literature in the way in which the temporal segmentation problem is formulated. We propose to frame the temporal segmentation problem as an energybased temporal clustering, providing an elegant mathematical solution via Dynamic Programming (DP).

k-means clustering and kernel extensions
Clustering refers to the partition of n data points into k disjointed clusters. Among various approaches to unsupervised clustering, k-means [11] is favored for its simplicity. k-means clustering splits a set of n objects into k groups by minimizing the within-cluster variation. That is, k-means clustering finds the partition of the data that is a local optimum of the energy function [7]: s.t. G T 1 k = 1 n and g ij ∈ {0, 1} where d i ∈ d×1 (see notation 1 ) is a vector representing the i-th data point and m c is the geometric centroid of the data points for class c. G ∈ k×n is a binary indicator matrix, such that g ci = 1 if sample d i belongs to cluster c, and zero otherwise. A major limitation of the k-means algorithm is that it is optimal only when applied to spherical clusters. To overcome this limitation, kernel k-means [19] implicitly maps the data to a higher dimensional space using kernels. The kernel k-means minimizes: where φ(·) is the mapping, dist c (d i ) is the distance between i th point and the center of class c, i.e.
where n c is the number of samples that belong to class c.
The kernel function κ is defined as Similar to the first step in the k-means algorithm, the kernel k-means assigns the sample to the closest cluster: It is worth noting that in kernel k-means there is no need to explicitly recompute the mean for each cluster.

Temporal segmentation
In this section, we formulate the temporal segmentation problem as a clustering one.

Temporal segmentation with ACA
Given a sequence X ∈ d×n of motion capture data with n frames, we want to decompose X into m disjointed segments, each of which corresponds to one of k actions (i.e. classes). The segment itself, Y i X [si,si+1) , is composed by the frames that begin at position s i and end 2 at s i+1 − 1. We constrain the length of the segment to the range w i ∈ [w min , w max ], in order to control the temporal granularity of actions. A k-by-1 indicator vector g i is used to assign each segment to an action. g ci = 1 if Y i belongs to class c, otherwise g ci = 0.

Energy function for ACA
There are two major challenges in framing temporal segmentation as a clustering problem: (1) modeling the temporal variability of human actions, and (2) defining a robust metric between temporal actions. To address these problems, ACA extends previous work on kernel k-means (eq. 2) by minimizing: It is worth pointing out the differences between ACA, eq. 5 and kernel k-means eq. 2: (1) ACA clusters variable features, that is, each segment Y i might have a different number of frames, whereas standard kernel k-means has fixed number of features (rows of d i ).

Coordinate descent optimization
In this section, we describe a Dynamic Programming (DP)-based algorithm to perform coordinate descent to solve for ACA (i.e. G, s).
To optimize over s and G with DP, we introduce an auxiliary function, L(u, v) min G,s:X [u,v] J ACA , to store  2 2 3 5 6  1 3 4 5 6  2 3 4 6 7 the minimum cost J ACA to all segmentations on the subsequence (x u , x u+1 , · · · , x v ). Observe, that L(u, v) contains the minimum J ACA for the best s, G within the (u, v) range. Note that the function L depends on the range of segmentation in the sequence X, allowing us to use the traditional DP divide-and-conquer paradigm: The above equation implies that the optimal decomposition of the subsequence X [u,v] is achieved only when the segmentations on both sides X [u,i−1] and X [i,v] are optimal and their sum is minimal. Moreover, this recursive decomposition could be repeated until encountering the granular segment (i.e. w is in the range w ∈ [w min , w max ]). In fact, previous decomposition (eq. 8) is equivalent to: When v = n, the L(n) is actually the optimal cost of the segmentation that we seek. The inner values i and g that lead to the minima are the head position and label for the last segment respectively. Eq. 9 unifies both point-based k-means and segment-based ACA clustering by the constraint of length [w min , w max ]. If w min = w max = 1, where each segment consists of one single frame, this is equivalent to kernel kmeans. Based on the recursive equation (eq. 9), we compute our algorithm with a forward and backward step to obtain G new , s new based on G old , s old : 1. Forward step: Scan from the beginning (v = 1) of the sequence to its end (v = n), see fig. 2.
, whereĉ is the closest cluster of the previous segmentation (G old , s old ) for the segment X [i,v] . A record is kept of the optimal head position i and labelĉ for each v.  v L S C · · · · · · · · · · · · 24 .254 19 · · · · · · · · · · · · 32 .362 25 · · · · · · · · · · · · wmin wmax minc distc(X[i,v]) Figure 2. Forward ACA step. To construct L(v) (v = 32), the starting position (i = 25) of the best segment (bold) is selected from the pool of candidates in the shadow area. The label for the segment is determined to be the closest model (triangle). The segments of three classes, which are marked in triangle, square and circle respectively, are the segmentation of the sequence in the last step.

The Algorithm
In this section, we give further details on an efficient implementation of the DP algorithm.
The calculation of dist c (Y i ) in eq. 9 involves three components: κ ii , h c , f ci . The term κ ii is a constant term and it does not affect the optimization of G, s. The latter two terms depend on DTAKs. However, storing all possible kernel pairs between segment is prohibitive in space O(n 2 w 2 ) and time O(n 2 w 4 ).
To make the algorithm efficient in practice, we need to reduce the computational cost of h c and f ci . First, it is possible to directly calculate the DTAK used for h c . Given a segmentation (G old , s old ), the number of segments are O( n m ), where the space and time needed for h c s is O( n 2 m 2 ) and O(n 2 ) respectively. For f ci s, we instead maintain a relatively smaller active kernel matrix, A, to reuse the previous calculations of DTAK (eq. 6) during the optimization.
The optimization process mainly consists of two algorithms, DP Search (alg. 1) and ActiveU pdate (alg. 2). DP search starts by forward constructing L storing in each position the minimum value, C and S for labels and position respectively. The components of the updated parameter will be obtained by back-tracking. During the processing, A(v, w v , j, w j ), which stores the kernel between the segments X (v−wv,v] and X [s old j ,s old j+1 ) , is updated from its neighbors according to the definition in eq. 6. In order to reuse the space, a circularly-linked list of w max length is implemented to index the position of segment X (v−wv,v] , i.e. pos v = v mod w max .

Complexity analysis
Given a sequence with n frames and an average segment width w, we need O(n 2 ) space to store the kernel matrix of frames. At the beginning of each step in the iterative procedure, h c s and the active A are created as a block of O( n 2 w 2 ) Obtain the h c s from G old , s old ; and O(nw 2 ) respectively. As v increases, each evaluation takes O(nw) to calculate the distance from G old , s old , and therefore it takes O(n 2 w) to scan through the whole sequence. To sum up, the space complexity is O(n 2 ), which makes it possible to process sequences with thousands of frames. The overall time complexity is O(n 2 wt), where t is the number of iterative steps.

Segmentation on motion capture data
This section describes two strategies to scale ACA to segment large collections of motion capture data: (1) temporal reduction, and (2) good initialization.

Temporal reduction
It is not computationally practical to run ACA on large amounts of motion capture data. Because human motion is typically smooth, and recent work [4,9] has shown evidence that human motion is locally linear, it is possible to temporally reduce the number of frames without losing important information relevant for temporal segmentation. Following previous work on temporal segmentation [7], we first apply a clustering step to group frames into k reduce classes, and remove irrelevant consecutive frames within the same class.
We use the Carnegie Mellon Motion Capture database, which contains 149 subjects performing several activities. The motion capture system uses 41 markers per subject. Similar to the method of Barbic et al. [2], we only consider the 14 most informative joints out of 29. The 3-D Euler angles are transformed to 4-D quaternions to provide a smoother and continuous representation of motion. We apply the k-means [11] algorithm to cluster the frames into k reduce classes. A large k reduce is usually preferred to capture subtle human behavior. Fig. 3(a) shows the labels (k reduce = 20) of a 10078-frame sequence (subject 86, trial 4), which contains seven distinct actions. Every 20 consecutive frames that belong to the same class are reduced to one frame ( fig. 3(b)).

ACA initialization
Minimizing ACA is a non-convex optimization problem, and the quality of the solution is highly sensitive to initial conditions [5]. In this section, we describe a coarse segmentation process based on spectral clustering methods that provides a good initialization for ACA.
Generally speaking, many human actions, such as walking or running, are periodic movements. This periodicity can be observed in the block structure of the similarity matrix between all the frames. Fig. 3(e) shows the similarity matrix of the 608 frames (after temporal reduction). The similarity matrix is computed by assigning 1 if the two frames belong to the same class given by k-means. To emphasize the frames that might belong to the same type of actions, we modified the similarity matrix by propagating the similarity between two temporally close frames that share the same cluster. More specifically, given two pairs of frames, x i1 , x j1 vs x i2 , x j2 , we define the new similarity κ as We use spectral clustering algorithms [7,18] to find an embedding where samples are easier to cluster. Notice that long (short) segments will be divided (merged) to satisfy the predefined length constraint [w min , w max ] for each of the actions. Fig. 3 (c) shows that this coarse initialization ( fig. 3 (d)) identifies meaningful segments.

Experimental results
In this section, several experiments on synthetic and real dat evaluate the segmentation performance of ACA.

Synthetic data
In the first experiment, we synthetically generate a random 1-D sequence ( fig. 4(a)) with four temporal clusters. The length of each segment is restricted to be between 10 and 15 samples (frames), and the value of each sample is a uniform random integer between in the range [1,20]. Several artificial frames are randomly inserted (temporal noise) into the sequence. The parameter, p noise , controls the amount of noise. For instance, p noise = 0.2 indicates that one noise frame might be inserted every 5 frames.
The ACA algorithm runs 10 times with random initialization, and the solution with minimum J ACA is selected. DTAK is constructed based on an exponential kernel κ with σ = ∞ ( fig. 4(c)). Observe that in this case, no temporal reduction or good initialization is used.
To quantify the segmentation accuracy, we need to compare the segmentation provided by ACA and the groundtruth 3 . To compute the accuracy we use a confusion matrix ( fig. 4(e)) between the ACA and ground-truth. The confusion matrix is calculated as follows: and Y truth j are two segments given by the ACA algorithm and ground-truth data respectively, and |Y ACA i ∩ Y truth j | denotes the number of frames they share. Fig. 4(e) illustrates the confusion matrix for the synthetic problem. The classical Hungarian algorithm [15] is applied in order to find the optimum solution for the cluster correspondence problem. Fig. 4(d) depicts the DTAK matrix for the segments given by the ACA algorithm. Good segmentations tend to have large within-class and low between-class connectivity. Fig. 4(f) shows the accuracy results of our algorithm for different levels of noise (p noise = 0.0-0.3). For each p noise , we repeated the above generation of data 10 times to average the results.

Motion capture data
In the second experiment, we choose the 15 sequences performed by subject 86, each of which is a combination of 10 natural actions (e.g. walking, punching, drinking, running). Typically each sequence contains 8000 frames (70 secs). Quaternions are used as features to group the frames into 20 clusters, and reduce the length of the sequence as explained in section 4.1. The length for each activity ranges from 50 to 200 frames. After initializing with the algorithm described in section 4.2, ACA is optimized until convergence. For each sequence, ACA would usually converge in 3-5 iterations. Each iteration took average 30 seconds in an unoptimized Matlab code with Intel Core 2 Duo 2.4 GHz and 2 GB memory. Fig. 5 shows the segmentation obtained through ACA, manual labeling, and the method proposed by Barbic et al. method [2] respectively, in four sequences. Different actions are marked with different colors. The black stripes in the human label sequences indicate areas where the judgments vary among labelers, while areas in the PCA bars indicate the 2-sec preparation period used for estimating the underlying quaternion distribution [2]. We should mention that the PCA approach works in an on-line procedure, while ACA is an off-line approach. Moreover, ACA identifies the distinct actions by providing a segmentation closer to the one provided by the human observer. In fact, the motions that are almost cyclic were more clearly detected (dark lines 3 Recall that for the synthetic data, the ground-truth segmentation is known in advance. whose both sides have the same color) by ACA than PCAbased approaches or human labeling.

Conclusions
In this paper, we have presented ACA, an extension of kernel k-means for temporal segmentation. ACA combines standard vector-space approaches for clustering with Dynamic Time Alignment Kernel (DTAK) and Dynamic Programming (DP). The main contributions of our paper are: (1) formulation of temporal segmentation with ACA, (2) temporal reduction and initialization strategies for ACA, and (3) efficient computation of ACA. ACA has been applied to temporal decomposition of motion capture data into a set of actions, but it is a generic algorithm and can be applied to other data (e.g. facial expression, speech). Although ACA has shown promising preliminary results, there is still the need for algorithms to automatically select the optimal number of actions and avoid local minima in the optimization.
Acknowledgements This work was partially supported by the National Science Foundation under Grant No. EEEC-0540865. The data used in this project was obtained from mocap.cs.cmu.edu. The database was created with funding from NSF EIA-0196217. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.