Feature seeding for action recognition

Progress in action recognition has been in large part due to advances in the features that drive learning-based methods. However, the relative sparsity of training data and the risk of overfitting have made it difficult to directly search for good features. In this paper we suggest using synthetic data to search for robust features that can more easily take advantage of limited data, rather than using the synthetic data directly as a substitute for real data. We demonstrate that the features discovered by our selection method, which we call seeding, improve performance on an action classification task on real data, even though the synthetic data from which the features are seeded differs significantly from the real data, both in terms of appearance and the set of action classes.


Introduction
A human researcher who designs a feature has an almost insurmountable advantage over a learning algorithm: they can appeal to an intuition built over thousands of hours of direct experience with the world to decide which parts of the visual experience are important to consider and which are noise.
In contrast, an algorithm that attempts to select or learn Figure 1.Our method uses motion features from synthetic data (left) to seed features that are effective for real data (right), even though the two data sets share no common actions and are very different in terms of appearance. ...
. System overview: a pool of randomly generated features (a) is filtered, or seeded, on synthetic data (b) to produce a greatly reduced number of features (e) that are likely to be informative.We extract descriptors (e.g.trajectories) on real data (c), and these descriptors are fed through the seeded feature set to produce label vectors qi, one per descriptor.These label vectors are then accumulated into a histogram H that represents the video clip.
can meet that requirement (see Fig. 1).What we demonstrate is that one can leverage observations of human actions obtained from one source to classify actions observed from another loosely related source, even if the two sets of actions differ.This transfer is possible because the two datasets are correlated -not necessarily in terms of specific actions but because both depict humans performing visually distinctive movements.
In more concrete terms, many popular bag of visual words (BoW) techniques rely on quantizing descriptors computed from video; generally either simple unsupervised techniques such as k-means clustering [11,15,20,24] or hand-crafted quantization strategies [18] are used.Our suggested seeding can be seen as employing synthetic data to drive the selection of the quantization method itself.
The basic organization of our method can be seen in Fig. 2. First, a set of synthetic video clips is generated using motion capture data.These clips are generated in groups, where each group is an independent binary classification problem.Next, raw motion descriptors are extracted from the synthetic data pool in the form of trajectory snippets [18,19] and histogram of optical flow (HOF) descriptors around space-time interest points (STIP) [15].Note that we are not proposing a complete system for action recognition; we consider only motion features in a simplified recognition framework in order to isolate the effects of our feature seeding.
Each clip produces many descriptors-trajectory descriptors produce on the order of 300 descriptors per frame of video, while STIP-HOF produces closer to 100 descriptors per frame.These descriptors are sampled to produce a candidate pool of features, where each feature is a radial basis function (RBF) classifier, 1 whose support vectors are randomly drawn from the descriptors.Then the synthetic data is used to rank features based on their aggregate classification performance across many groups of synthetic data.We denote the highly ranked features selected in this way as the seeded features.The seeded features can then be applied to real data and used as input to conventional machine learning techniques.For evaluation, we consider the seeded features in a standard bag-of-words framework, using linear SVMs as classifiers.

Related work
Our proposed technique is related to both domain adaptation and feature selection, but targets a different level of information transfer than either.Domain adaptation techniques can be powerful across limited and wellcharacterized domains, such as in [14].However, the gains are often modest, and as the aptly titled work "frustratingly simple domain adaptation" by Daumé [10] shows, even simple techniques can outperform sophisticated domain adaptation methods.Likewise, transfer learning methods such as transductive SVMs [8] can provide modest benefits, but are often computationally expensive and often restricted to datasets with shared classes.
In terms of feature selection, our method falls firmly into the filtering category of the taxonomy of Guyon and Elissee [13], in which features are selected without knowing the target classifier.The choice of a filtering method rather than a wrapper is motivated by the larger risk of overfitting in wrapper methods [13,17,29].We use a feature ranking technique [13,29] that is inspired by boosting-based methods [9,28,29].However, since we do not assume that the specific task is known on the target data, we do not rank features by their performance on a single task, but instead on aggregate performance over a basket of independent randomly generated tasks.
Typical domain adaptation techniques assume that the specific task is the same between the source and target domains [2,10,14], and this assumption is the most common one in transfer learning as well [8,9,12].That is to say, if the problem were action recognition, then these techniques would need the specific actions of interest to be matched across the domains.For example, Cao et al. perform cross-dataset action detection, using one dataset (KTH) [24] to improve performance on a related one (MSR Actions Dataset II) [4].However, the particular actions that are detected are present in both datasets, and indeed MSR was explicitly constructed to share those actions with KTH.In contrast, our seeding technique requires no forward knowledge of what classes are present in the real data, and does not require any shared action classes at all between the synthetic and target tasks.
A related idea is that of learning from pseudo-tasks, as in Ahmed et al. [1], where the learning of mid-level features is regularized by penalizing features for poor performance on a set of artificially constructed pseudo-tasks.Our synthetic data can be seen as pseudo tasks, but with the important distinction that our synthetic tasks force features to be good at the specific problem of human action recognition rather than the more general problem of image processing.
We select from a pool of features that are computed from raw descriptors, which may take many forms; we consider trajectory fragments [18] and space-time interest points [15].These features are slightly modified Gaussian radial basis function (RBF) classifiers.We choose to use features of this form because RBF can approximate arbitrary functions [25], and random classifiers and features have, by themselves, shown benefits over other representations [6,23].The choice of each feature as an independent classifier means that, after features have been seeded, only the selected features need to be computed on the test dataset.As our method is capable of drastically reducing Figure 3. Example frame from synthetic data.The synthetic video is abstract in appearance, but the movement of the armature man is derived from motion capture data.The texture does not match real data in appearance and serves only to provide descriptor extractors (e.g.trajectories derived from optical flow) with stable inputs.the feature count (e.g., from 10,000 to 50 with no loss of accuracy), this results in greatly reduced computational and storage requirements, and simplifies subsequent stages in layered architectures.
There has been limited work on using synthetic data for action recognition.A number of approaches use synthesized silhouettes from motion capture to match actions (e.g.[7]), but these are limited to a constrained domain (where silhouettes can be reliably extracted) and require that the action being searched be present in the data set.Recent work with depth cameras [26] demonstrates the power of this type of approach when the synthetic model of the task is very good.Another line of work by Qureshi and Terzopoulos [22] uses synthetic crowds to tune sensor networks for surveillance.In still image analysis, Pinto et al. use a screening approach to select entire models from synthetic data [21].In contrast, our approach selects generally good features that are useful across models.
We demonstrate that even crude, unmatched synthetic data that makes no attempt to directly mimic the target datasets can be used to improve performance, and furthermore, that this increase can be achieved by straightforward selection mechanisms.Additionally, seeding a quantization scheme from synthetic data outperforms both unsupervised and heuristic quantization schemes.

Descriptor extraction and handling
Our synthetic data can support a wide variety of motion descriptors, and in this paper we consider two different approaches: trajectory based descriptors, and histogram of optical flow (HOF) descriptors as per Laptev et al. [15].
For the trajectory descriptors, points are tracked using a KLT tracker [3], with the number of tracked points capped at 300.Each trajectory is segmented into overlapping windows of ten frames, where each window or snippet is ex- pressed as a set of coordinates in the image where T = 10 is the number of frames in the overlapping window.This is converted into a relative representation, so that where dx t = x t+1 − x t , and dy t = y t+1 − y t .This relative representation is the basic trajectory descriptor on which the k-means centers are computed.For input to RBF features we perform an additional normalization step to normalize the length of each link (dx, dy) to 1.This normalization to discard magnitude information is similar to that used in other techniques [18,19].
For the histogram of optical flow (HOF) based motion descriptor, we use the Space-Time Interest Points of Laptev et al. [15] using a HOF descriptor around each point (STIP-HOF).This method finds sparse interest points in space and time, and computes HOF descriptors around each.Each descriptor found in this manner is a 90dimensional vector.

Feature pool 3.2.1 Feature generation and evaluation
We consider a pool of candidate features, where intuitively each feature can be viewed as an RBF classifier.Formally, each feature evaluates a function of the form where v k,i is one "support vector" for the feature f k , and w k,i and β k,i are the weight and beta for that support vector.The clip(.) function clips the value to the range [0, 1], so that values less than zero (definite rejections) are thresholded to zero, while values above one (definite accepts) are thresholded to one and all other values are unchanged.Our experiments use RBF classifiers as features due to their generality, but in practice our method can employ any type of classifier.
Given this functional form, we generate a pool (in our case, of size 10,000) of features by randomly selecting support vectors from the synthetic dataset's descriptors.The weight associated with each support vector is chosen from a normal distribution N (0, 1), and the β associated with each support vector from a uniform distribution over the range [0, 10].These parameters were chosen arbitrarily to generate a large range of variation in the classifiers.Example trajectory descriptors that might be accepted by these types of features can be seen in Fig. 4.
The feature can also be seen as computing an intermediate representation q(d) corresponding to a descriptor d, so that When the pool of features is evaluated, the "histogram" bin corresponding to a feature is evaluated according to where D is a set of descriptors (e.g., all the descriptors computed from a given video).The entire histogram is expressed as which is to say that the feature f k is treated as an indicator function for whether a descriptor belongs to label k, where a descriptor might have multiple labels, and where the labels a descriptor d i takes on are given in the vector q i .

Feature seeding/filtering
Given the pool of features, we select for, or seed, a good set of features from the pool by rating them on a set of synthetic data.In practice, the seeding is similar to a single iteration of boosting, with the important difference that the seeding attempts to find features that work well across many different problems, rather than a single one.
Let P n and N n correspond to the sets of descriptor sets (videos) in the positive and negative sample sets, respectively, of a synthetic group n.Then we can express the rating a k,n of a feature k on group n (n = 1, . . ., N ) as where b k (D) is the result of evaluating feature k on descriptor set (video) D, and I(.) denotes the indicator function.
Note that this is just the accuracy of a decision stump on the b k (D) values.We have also considered mutual information based rating, but we find that it has slightly worse performance, probably because the stump-classifier rating we use here is a better match for the final SVM classification.However, our method does not depend on any single rating metric, and it is straightforward to swap this metric for another.Now, we express the aggregate accuracy of a feature over all groups as where g(.) is a function that operates on a set.In our case, we consider three possible aggregation functions g: g min (X) = min(X), g max (X) = max(X), and g avg (X) = mean(X).Intuitively, g min takes the worst-case performance of a feature against a collection of problems, g max takes the best-case performance, and g avg takes the average case.Note that because the evaluation problems are randomly generated from a large motion capture database (see Section 3.4), it is unlikely that they will share any action classes in common with the target task.The goal is to select features that perform well against a variety of action recognition tasks (i.e., that can discriminate between different human actions).
Then we simply rank the features according to their A k values and select the top s ranked ones.In practice, we use seeding to select the top s = 50 features from a pool of 10,000.
Given a set of training and test videos on real data, we compute histograms h D , where each histogram is computed according to (Eqn.6) over the reduced set of s features.Then we simply train a linear SVM as the classifier.

Synthetic data generation
In order to perform our feature seeding, we must be able to generate relatively large amounts of synthetic data.Since it is difficult to produce synthetic data that is comparable to real-world data in terms of raw pixel-level appearance, we concentrate on the simpler task of generating synthetic data that matches real-world data in terms of motion.We make no attempt to mimic real-world appearance: the human model in our synthetic data is a abstract armature (Fig. 3).However, in terms of motion it is a reasonable analog, since its motion is derived from human motion capture data (Fig. 1).

Synthetic data organization
The synthetic data is organized into groups of clips.Each group consists of a number of positive samples all gener-ated from a single motion capture sequence, and a number of negative samples randomly drawn from the entire motion capture dataset.In this way, each task is an independent binary classification problem where the goal is to decide which clips belong to the action vs. a background of all other actions.We reiterate that the actions used in the synthetic data do not correspond to the actions used in the final classification task on real data.Since the synthetic actions are randomly chosen out of motion capture sequences, they may not correspond to easily named actions at all.The two sets of tasks are unmatched so that the seeded features can be used in any future classification task.Each clip is 90 frames long, and each group has 100 clips, corresponding to 50 positive samples and 50 negative samples.In this paper we use 20 groups, for a total of 2000 clips.
A clip is produced by moving a simple articulated human model according to the motion capture sequence, with some added distortions.The synthetic data is rendered at a resolution of 320×240 and a nominal framerate of 30fps in order to match the MSR and UCF-YT datasets (see Sec. 4.1).

Motion generation
The motion of the human model in the synthetic videos is produced by taking motion capture sequences from the CMU motion capture database [5] and adding temporal distortions and time varying noise to the joint angles.
For each clip a motion capture file is chosen from the 2500 clips in the CMU motion capture database.If the clip is meant to be a positive example, then the motion capture file and approximate location within that file is given, and the starting frame is perturbed by approximately ±1s.If the clip is meant to be a negative example, a motion capture file is randomly chosen from the entire database, and a starting position within that file is randomly chosen.
Next, temporal distortion is added by introducing a temporal scaling factor (e.g., if the factor is 2.0, then the motion is sped up by a factor of two).Non-integral scaling factors are implemented by interpolating between frames of the motion capture file.Then, a random piece-wise linear function is used to dynamically adjust the temporal scaling factor of the rendered clip.In practice, we limit the random scaling factor to drift between a value of 0.1 and 2.0.Consequently, the timing of a rendered clip differs from that of the base motion capture file in a complicated and nonlinear fashion.
A similar approach is used to add time-varying distortion to the joint angles.A random piece-wise linear function is generated for every degree of freedom for every joint in the armature, and this function is simply added to the joint angles obtained from the motion capture sequence.The magnitude of this distortion is ±0.3 radians.
We add several other distortions and randomizations to the synthetic data.The viewing angle is randomly chosen for each clip, as is the viewing distance.Additionally, the position of the actor/armature is randomized for each clip.
The lighting is also randomized between clips, because the effects and positions of shadows can have a significant effect on the extraction of feature trajectories.

Datasets
We evaluate our method on two standard datasets: the UCF YouTube "Actions in the Wild" dataset (UCF-YT) [16] and the Microsoft Research Action Dataset (MSR) [30].The UCF-YT dataset is a straightforward forced-choice classification problem between 11 action classes (mostly various sports).The dataset contains 1600 videos of approximately 150 frames each, and we divide these videos into a training set of 1222 videos and a testing set of 378 videos.The UCF-YT dataset is further sub-divided into subsets of related videos (e.g., all from the same sports match, or sharing the same background); in order to avoid training/testing contamination from closelyrelated videos, we employ a stratified training/testing split that places each subset either entirely in the training or the testing set.
The MSR Action Dataset consists of of sixteen relatively long (approximately 1000 frames per video) videos in crowded environments.The videos are taken from relatively stationary cameras (there is some camera shake).The dataset only has three actions -clap, wave, and box, with each action occurring from 15 to 25 times across all videos.The actions may overlap.For evaluation we consider MSR to be three separate binary classification problems, i.e., clap vs. all, wave vs. all, and box vs. all, rather than a threeway forced choice because the actions overlap in several parts.Each problem has an equal number of negative samples drawn by randomly selecting segments that do not feature the action in question, so for example, the wave vs. all problem is a binary classification between the 24 positive examples of the wave action and 24 negative examples randomly drawn from the videos.Due to the limited amount of data in this set, evaluation is by leave-one-out cross validation.
As described earlier, for feature seeding we use a synthetic dataset, which consists of 2000 short videos; the "actions" in this dataset do not necessarily correspond to any of the action classes in either the UCF-YT or MSR datasets.

Feature statistics
A natural question to consider is how informative these RBF features are; that is, how likely is our seeding method to find useful features.Because the features are evaluated by treating them as stump classifiers, the worst an individual feature could do is 0.5 accuracy; any lower, and the classi- The difference between these two distributions is statistically significant to p < 0.001 according to the Kolmogorov-Smirnov test.
fier simply flips direction.Since there is noise in the data, a classifier that is uncorrelated with video content can still vary in value across videos, and this means that it is possible for it to obtain an accuracy better than 0.5 on the limited data simply by chance.If we were considering a single feature, then we could ignore this unlikely possibility, but with a pool of 10,000, statistically we can expect to see several such false positives.
It is easy to empirically estimate the false positive distribution by simply randomly permuting the labels of all of the test videos; in this way, a classifier cannot be legitimately correlated with the video labels, and the resulting distribution must be entirely due to false positives.
As can be seen in Fig. 5, the accuracy distribution of the real features is quite different from the false positive distribution.In particular, the real feature distribution is shifted to the right of the random distribution, indicating that there are more high-accuracy features than would be expected by chance, even in the worst-case scenario that the vast majority of the features are uninformative.Note that the false positive distribution takes on a log-normal type distribution, albeit with a spike at 0.5 corresponding to zero-variance features.The same test performed with the aggregation techniques produces similar results, indicating that the aggregation techniques also reveal informative features.

Comparison with other quantization methods
Since the goal of the proposed technique is to improve on the early quantization and accumulation steps of the bagof-words model, a natural baseline against which to compare is the standard bag-of-words model consisting of kmeans clustering followed by nearest neighbor quantization and histogram accumulation.Additionally, for the UCF-YT dataset we compare against a somewhat more sophisticated quantization technique, trajectons [18].
Our results on the UCF-YT dataset are shown in Table 1.Here the feature shows large gains over both k-means Even with a pairwise spatial relationship coding scheme, their technique achieves 47.7%, which is only slightly better than the performance of our independent features without any spatial information.Note that our performance with 50 seeded features matches that of running the entire candidate set of 10,000 features.Beyond the obvious computational and storage benefits of processing only 50 features instead of 10,000, methods that build on top of these quantized features will likely benefit from the reduced dimensionality (e.g., if pairwise relationships are considered, it is better to consider 50 × 50 rather than 10000 × 10000).While the "kitchen sink" approach of feeding all 10,000 classifiers into an SVM worked in this case (likely due to the resilience of linear SVMs against overfitting), other classifiers may not be as robust.
The results of this comparison on the MSR dataset are shown in Table 3. Overall, the feature selection posts relatively large gains in the g max and g avg selection methods, while g min remains largely the same as for k-means.For the individual classes, the selection method improves performance on the clap and box categories, while performance on wave is largely similar.
It is interesting that the selection techniques that perform well are exactly inverted between MSR and UCF-YT, with g max and g avg performing well on MSR, while g min performs well on UCF-YT.In practice, g avg works like a weaker g max , so it is unsurprising that its performance is similar to that of g max on both datasets.Between g min and g max , however, we suspect the difference is due to how similar the datasets are to the synthetic data that was used for feature selection.The MSR dataset is much more similar to the synthetic data than the UCF-YT dataset, which may explain why the more aggressive g max selection performs better on the former while the more robust g min selection performs best on the latter.More specifically, the MSR dataset has a fixed camera and simple human motions, which matches the cinematography of the synthetic data (albeit varying in the specific actions).By contrast, UCF-YT exhibits highly variable cinematography and includes non-human actions (e.g., horses and dogs) as well as actions with props (e.g., basketballs and bicycles).

Comparison of base descriptors
The results of a comparison of base descriptors (trajectories vs. STIP-HOF) is shown in Table 1.Overall, the performance of STIP-HOF features is worse than that of trajectory-based ones.However, note that the best selection method (g min ) outperforms k-means for both STIP and trajectory features, and that g min outperforms the other two methods for both features.

Comparison to unseeded RBF features
As an additional baseline we compare the performance of the features seeded from the synthetic data to that of random sets of features.The purpose of this baseline is to establish whether the gains seen with the classifier sets over k-means are due to the selection process, or whether the classifier-based features are inherently more informative than k-means histogram counts.As can be seen in Tables 1  and 3, the performance of random feature sets is very similar to that of codebooks produced by k-means, indicating that random classifier sets are by themselves about only as powerful as k-means codebooks.It is only after selection (either on the data itself, if there is enough, or on synthetic data) that significant gains are seen over the k-means baseline.

Comparison with feature selection on real data
We perform experiments using AdaBoost for feature selection on the MSR dataset (see Table 3).While boosting on the data itself improves performance on the clap action, the overall performance increase is modest, suggesting that when features are selected from the entire pool of 10,000 classifiers, boosting overfits.When the features are boosted from smaller subsets chosen at random, the overall performance is closer to that of unseeded features.However, the average performance of boosting on the real data is not much better than that of random subsets, and lower than that of seeded features.
Next, we evaluate the contribution of the synthetic data itself, in order to rule out the possibility that it is only the seeding technique (i.e., randomly partitioning the data into groups and then evaluating aggregate performance) that produces performance gains.We perform our feature seeding using the real training data as the seeding source.In order to mimic the structure of the synthetic data groups (one action class vs. everything else), we partition the UCF-YT training data into groups, where each consists of one action class vs. the remaining 10.We further randomly partition each group into five, for a total of 55 groups.We then perform the feature seeding.These results are shown in Table 2.Note that for every selection method (e.g.g min ), the seeding from synthetic data outperforms the seeding from the real data.Additionally, the selection method g min is the best regardless of the seeding source.Thus, the synthetic data itself plays an important role.

Conclusion
In this paper we propose feature seeding, a novel approach for using synthetic data to improve action recognition.Since the synthetic data (1) does not match the appearance of real world video and (2) is not guaranteed to contain the same actions as the test datasets, it is difficult to apply traditional domain adaptation, feature selection, or transfer learning approaches.Nevertheless, we demonstrate that seeding, which is a feature ranking selection technique on appropriately organized data, significantly improves performance on real world data.Seeding outperforms both the popular k-means quantization method and a more sophisticated engineered quantization method, demonstrating that even in very different action datasets there are deep commonalities that can be exploited.
Tellingly, features seeded from synthetic data have better performance than those seeded from the real data, despite the similar sizes of the datasets, indicating that the synthetic data itself contributes to the success of the tech-nique.This highlights the potential benefits of appropriately constructed synthetic data (i.e., where low-level descriptors are similar to real data and high levels of variation can be generated).
We believe that this general approach, in which synthetic data is used to select for robust algorithms, is an especially important avenue of exploration given the increasing demands placed on learning-based techniques and the sparsity of appropriately annotated data.Although the experiments presented here focus on the video action recognition domain, the proposed approach is broadly applicable to many learning-based vision tasks.

Figure 4 .
Figure 4. Examples of trajectory descriptors accepted by different classifier features.Some features represent simple concepts, such as leftward movement (a), or a quick jerk (b), while others do not correspond to anything intuitive (c).Given limited labeled data (c) could be indicative of overfitting.Feature seeding allows us to confidently determine that the chosen features generalize well.

Figure 5 .
Figure 5. Accuracy distribution of RBF classifier features on synthetic data, compared with the expected number of false positives.Above accuracy 0.61, the majority of features are true positives.The difference between these two distributions is statistically significant to p < 0.001 according to the Kolmogorov-Smirnov test.

Table 1 .
Results on UCF YouTube dataset (motion features only).

Table 2 .
Comparison of seeding source on UCF YouTube.