Multi-label Discriminative Weakly-Supervised Human Activity Recognition and Localization

. Activity recognition in video has become increasingly important due to its many applications ranging from in-home elder care, surveillance, human computer interaction to automatic sports commentary. To date, most approaches to video rely on fully supervised settings that require time consuming and error prone manual labeling. Moreover, existing supervised approaches are typically tailored for classiﬁcation, not detection problems (the spatial and temporal support of the action has to be detected). Recently, weakly-supervised learning (WSL) approaches were able to learn discriminative classiﬁers while localizing the action in space and/or time using weak labels. However, existing approaches for WSL provide coarse localization in terms of spatial regions or spatio-temporal volumes. Moreover, it is unclear how to extend current approaches to the multi-label case that is common in practical applications. This paper proposes a matrix completion approach to the problem of WSL for multi-label learning for video. Our approach localizes non-rectangular spatio-temporal discriminative regions that are inferred by clustering regions of common texture and motion features. We illustrate how our approach improves existing WSL and supervised learning techniques in three standard databases: Hollywood, UCF sports, and MSR-II.


Introduction
The idea of recognizing actions automatically from videos brims with potential.Solving it enables many tasks, including surveillance, human-computer interaction, patient monitoring, and automatic sports analysis.However, understanding actions in a video sequence remains a challenging problem due to several reasons: (1) there is a large variability in imaging conditions, as well as in how different people perform an action; (2) background clutter and motion blur are common; (3) data arising from video is of high dimensionality; (4) obtaining ground truth labels for every individual action in every frame of a video is cumbersome.Previous works have addressed these issues by introducing different features [1,2], interest region detectors such as space-time volumes [3] or trajectories [4,5], and using different classifiers [2,[6][7][8][9][10].While these methods have improved recognition results, they may find correlations from background context and non-activity related regions, which result in a lack of interpretability of what is being learned.This motivates us to explore learning techniques that rely less on error-prone human annotations, and learn instead from captions describing the entire video.
In this paper, we propose a multi-label WSL approach to efficiently recognize activities and pinpoint their spatio-temporal location on unseen videos.Fig. 1 shows examples of our results on different datasets.We first extract spatio-temporal activity parts throughout the video.Then, we recognize the activity/activities present in the video, along with selecting the activity parts associated with each recognized activity.
Weakly-supervised learning (WSL) approaches such as multiple instance learning (MIL) ( [7][8][9][10]) have eased the problems in labeling by localizing discriminative regions while learning the classifier.Instead of class labels, MIL defines labels for positive and negative bags, each containing several instances.All instances in negative bags are negative, but there is at least one positive instance in each positive bag, and the goal is to localize the positive instances (see Fig. 2(a)).Unfortunately, the MIL paradigm has two major drawbacks: first, it is non-trivial to extend it to multi-label settings [11]; second, it typically leads to multi-pass algorithms that alternate between classification and localization.This is especially cumbersome on videos, due to the high number of degrees of freedom in voxel/cuboid search.The MIL problem gets even harder if several instances have to occur together in a bag to form a positive sample.This is the case of action recognition, since activities are typically defined by a collection of spatiotemporal parts extracted from a video [5,7,12,13].Thus, in order to provide accurate spatio-temporal localization, activity parts cannot be labeled individually, but rather be selected coherently throughout the entire dataset.
We explore the fact that instances from the same class usually organize themselves into clusters [14][15][16][17] and that low-rank matrix completion [38] can exploit low-rank subspaces to find relations between labels and features.Thus, we jointly cluster instances into subspaces (Fig. 2(b)) and label unknown instances consistently with the clustering, while keeping negative bag instances as negative (Fig. 2(c)).We demonstrate the effectiveness of our joint subspace clustering and classification in weaklysupervised multi-label learning for video activity recognition.

Related Work
Many researchers have addressed the problem of activity recognition in video sequences by using space-time interest points [1,20], dense trajectories [5] and discriminative space-time neighborhood features [21].Some previous works have also targeted the problem of spatio-temporal action segmentation and recognition.Hoai et al. [22] recognized activities using a multi-class support vector machine (SVM) and infer the temporal segments with dynamic programming.Lan et al. [8] trained a latent SVM with a number of labeled and fully annotated videos, but each video is assigned a single label.In [23], the authors propose a weakly supervised video action classification using a similarity constrained latent SVM.Tang et al. [24] use a variable-duration hidden Markov model to build a model for each video.Chen et al. [25] construct a spacetime video graph and find the subgraph that maximizes an activity classifier's score.Siva et al. [10] extract potential action cuboids and use genetic algorithms to select the best potential cuboids to learn a SVM for recognition.In related work, [12] introduced spatio-temporal deformable part models for activity recognition and localization.Action localization is usually performed in the context of action detection, separate from the recognition phase (e.g., [26][27][28][29][30]).Raptis et al. [7] extract spatio-temporal structures by forming clusters of trajectories.A graphical model is used to recognize a collection of these clusters as a particular action.We share with [7] the use of action parts, but they use graph search to correspond action parts and incorporate fully supervised data, while we perform subspace clustering in a weakly-supervised setting.Ma et al. [31] use a two level hierarchical model for activity localization, where each body part is associated with a rectangular box.They first perform a video frame hierarchical segmentation and prune a candidate segment tree.Then they extract hierarchical space-time segments for activity recognition via separate codebooks for root and parts.
Multiple-instance learning was initially proposed in [32] for the WSL problem of predicting which configurations of a pharmaceutical drug are effective.Andrews et al. [33] formulated a maximum margin MIL based on Support Vector Machines, where sample labels are unobserved integer variables and the margin between these is maximized directly.These MIL methods result in non-convex optimization processes and thus are heavily dependent on initialization.WSL in computer vision has been extensively studied, by generating spatio-temporal masks for objects in images and videos [34] from partially tagged Internet and YouTube videos [35].Since labeling video by annotating every single frame is a cumbersome task, several WSL models have been developed for activity recognition and event detection in videos (e.g., [8,31]).Tang et al. [17] propose a spatio-temporal transductive and inductive object segment annotation from weakly-tagged videos.Recently, several works have formulated the MIL and WSL problems as convex problems (e.g., [36,37]).In [36] the authors have proposed a model based on calculating likelihood ratios of instances using Support Vector Regression and classifying the bags into positive and negative with a binary SVM.
Our work is most similar to [14] and [38].[14] is a low-rank subspace segmentation algorithm and [38] a low-rank matrix completion (MC) framework for classification.We propose a method that intertwines these two to perform simultaneous recognition and localization in videos.In [38] each image is represented as a single column in the matrix, localization is performed in the image plane by a bounding-box exhaustive search.However, in our method each video is composed of several parts and supervision is weak and only labels entire videos.Transduction and clustering alone do not suffice, but together provide a selection coherent for all parts in the dataset.This global context means selecting parts yields space-time locations and activity labels.

Video Representation
In our method, each video in the dataset is treated as a collection of motion parts [5,7,12,13].Following [5,7], videos are represented by features extracted from parts with dense motion trajectories.We perform a spatio-temporal segmentation to obtain volumetric regions that have similar visual and motion characteristics.Then, we extract trajectories using an optical flow tracker, and discard regions with little or no movement.Finally, we group trajectories with similar behavior into parts.Fig. 3 illustrates this process in a sample video from the HOHA dataset.Since trajectories are asynchronous and have different lengths, we define a distance to incorporate motion similarity and spatial closeness.For two trajectories A and B with points x A [t] and x B [t], we calculate their similarity on a temporal overlap t where denote velocities of the trajectory points and σ [τ1,τ2] is the local optical flow variance in the interval [τ 1 , τ 2 ].In (1), the first term is a measure of spatial distance while the second estimates distance in motion and velocity.To group trajectories, we follow [7] and calculate the affinities between all pairs of trajectories in a video, forming an affinity matrix, calculated as ω(A, B) = exp(−ηd(A, B)).A normalized-cut clustering is then used to group the trajectories, where a Cattell's scree test is used to determine the appropriate number of clusters.Each trajectory group forms a part that may or may not be associated to the activities of interest.For instance, 23 parts appear in the video frame shown in Fig. 3.Each part is represented by a histogram of oriented gradients (HoG), optical flow (HoF) [1] and oriented edges in the motion boundaries (HoMB) [5].These histograms are computed on a regular grid at three different scales.Each descriptor (HoG, HoF, HoMB) uses an independent dictionary, obtained by performing K-means on all the parts, and quantizing all descriptors to its closest 2 distance dictionary element.The concatenation of all three histograms forms the group (part) descriptor, h k ∈ R n .A video V i is described by concatenating its activity parts, as

Activity Recognition and Localization
In this section, we present our weakly-supervised learning algorithm for action recognition and localization in video sequences.In our problem, we have several training videos, each of which is labeled with one or more activities.However, no spatiotemporal information exists on where the activities occur.Our task is to classify whether unknown test videos contain those activities or not, and simultaneously localize them throughout the video.Our approach merges the advantages of two recently proposed low-rank models: subspace segmentation [14] clusters similar activity parts from all videos in the dataset, and a matrix completion classifier [38] determines the activity labels they belong to, such that the labeling is consistent throughout the entire dataset.
Let m be the number of different activity classes, n the dimensionality of the feature space, and N tr , N tst the number of training and testing parts, respectively.For the classification task, we can define a matrix D 0 as where Y tr ∈ R m×Ntr and Y tst ∈ R m×Ntst are the training and test labels and X tr ∈ R n×Ntr and X tst ∈ R n×Ntst are the training and test feature vectors, respectively.Hence, D Y , D X and D 1 denote the label, feature and last rows of D, respectively.As noted by Cabral et al. [38], if a linear classification model holds, D 0 is rank deficient.Therefore, classification can be posed as a matrix completion problem of filling the missing entries in Y tst such that the nuclear norm of D 0 (a convex approximation of its rank) is minimized.To deal with noise and outliers in the data, we can incorporate an error term E mc in the known feature and training label entries, and the classification process can be posed as finding the best Y tst and the error matrix E mc such that the rank of D is minimized.As discussed in Section 3, each video V i is represented by the histograms of its activity parts.If labels were provided for each part in training, we could construct D 0 by setting each column to the features corresponding to one activity part and its respective {0, 1} m label vector.However, in our case supervision is weak and labels are only provided for entire videos.Thus, simply labeling parts with all class labels present in the video they originate from is insufficient for obtaining correct part level classifications.
Instead, to identify the parts that comprise each activity class, we can also exploit the fact that activity parts from the same class likely cluster together.This can be formulated as a segmentation of feature vectors into low-rank subspaces, using a Low-Rank Representation (LRR) [14].Since D X contains the feature vectors for all videos in the dataset, we can cluster activity parts by computing a low-rank similarity matrix Z, as where E lrr is the LRR [14] error matrix and λ is a balancing parameter between lowrank and error fit.Z is indicative of the similarity between each activity part in D X and thus can be used as an additional cue to weak supervision for classifying which parts constitute which activities.Using the similarity matrix Z, we can apply a clustering method such as Normalized Cuts to group similar activity parts in all train/test videos.The output of this clustering method is a n c × N binary matrix Q, where n c is the number of clusters.Each row of Q corresponds to one cluster, with q ij = 1 if the j th activity part belongs to the i th cluster, and 0 otherwise.Below, we show that these matrix completion classification and subspace clustering steps can be done jointly, so that labels are consistent within clusters and vice-versa.

Joint Classification and Clustering
With the matrix completion and subspace segmentation defined as above, we can simultaneously obtain a low-rank representation of the feature vector matrix D X , and correct and complete the labels in D Y = [Y tr , Y tst ].Our activity classification problem can be defined as minimizing the rank of D for determining the part labels, while at the same time ensuring the labels are consistent with the clustering Q obtained from the low-rank representation Z of the parts D X .If we define Ω Y as the set of known label entries in D 0 , this objective can be written as where ) is a logistic loss function that penalizes entries of different classes.γ, λ, ρ 1 , ρ 2 are positive trade-off parameters.k is the most similar cluster to label i, calculated as k = argmin nc k=1 j c y (d ij , q kj ).
With the objective in (5), the first term seeks a low-rank D matrix so that labels can be expressed as a linear combination of features.The second establishes a low-rank representation Z for subspace clustering.The third term controls the level of noise in the clustering.The fourth term nudges the labels in D Y the direction suggested by the clustering Q and the fifth term regularizes changes on known training labels Y tr in the matrix completion.Therefore, we are seeking to achieve a consensus between the clustering and classification outputs.The intersection of these two tasks is incorporated by the fourth term, where inconsistent clustering outputs and labels are penalized.The minimization process will aim towards unanimity between the two and the least label changing in Y tr .Also, notice that in the process of joint minimization, both classification and clustering tasks share the feature error matrix, resulting in less variables than used when optimizing both objectives separately.
The objective in (5) can be optimized using an Alternating Direction Method of multipliers (ADMM) [39].When it converges, the labels in Y tst corresponding to each activity part indicate its action label(s) and the columns with that label are the parts associated to that specific activity.The highest computational complexity step in solving ( 5) with an ADMM is a SVD of D, but scalable SVD/ADMM methods are currently being researched heavily [40].
As in D Y , each instance is assigned a set of labels, each of which belongs to an independent activity class.This enables us to model multi-label MIL problems.Many previous works have exploring the dependence among the labels [41,42].But when the labels are incomplete (weakly-supervised) the task is harder.As also explored in previous works [18,38], the low rank assumption of the matrix D resembles a linear dependence among the labels and the feature vectors.We evaluate our multi-label setting in a weakly-supervised video activity recognition and localization.

Experiments
To evaluate the proposed technique, we set up several experiments on various synthetic and real datasets.Since our approach performs clustering and classification simultaneously, one might conceive that we could first run clustering and then use matrix completion for obtaining the labels.Thus, as a baseline, we derive a low-rank representation [14] of matrix D X and then run matrix completion while incorporating the feature error term in the matrix completion formulation (LRRMC).We also compare the performance of our method to using just matrix completion (MC) of [38] for classification as described in Sec.4 to show that solely relying on a weakly supervised labeling for part classification does not work, and the well-known MI-SVM [33], with RBF kernel.
In each iteration of (5), we obtain the clustering Q using n c = 2m clusters to account for intra-class variability, and use as parameters γ = 0.9, ρ 1 = 1.5, ρ 2 ∈ {10 −3 , 10 −2 , 10 −1 , 1}.For experiments on activity recognition datasets, to ensure direct comparability with state of the art methods, we follow the setup of [7] for obtaining and describing activity parts, as described in Sec. 3.Each part is represented by histogram of oriented gradients (HoG), histogram of optical flow (HoF) [1] and histogram of the oriented edges in the motion boundaries (HoMB) [5] descriptors, with 500, 500, 300 dimensions respectively.This figure shows the means and standard deviations for three different runs.

Synthetic Data
First, in order to validate the proposed algorithm, we construct 10 independent subspaces of dimensionality 100 (as described in [14]).The first five subspaces form our desired positive classes and the second five, negative.We create 100 positive and 100 negative bags, with size 10, and sample instances from the above subspaces.Positive bags, as in MIL, are composed of uniformly distributed positive and negative instances.We corrupt each sampled instance x with probability p, by adding Gaussian noise with zero mean and variance 0.3 x .The performance of the proposed method is compared with LRRMC, MI-SVM and matrix completion (MC) [38], as illustrated in Fig. 4 for different probabilities of corruption and noise.The performance of our method is much better when the noise level increases in the data.As mentioned in Sec. 4, MC yields worse results since it fully relies on the initial labeling, which is not accurate enough due to its weakly supervised nature.Our method performs a joint clustering and classification of the data and detects noise and outliers in both tasks collaboratively.In LRRMC these are done separately.Thus, our method deals better with noise in the data.

Action recognition and localization
Three popular activity recognition datasets are used: MSR-II [6], HOHA [1] and UCF sports [3] action datasets.MSR-II action dataset 2 contains 54 videos with three action categories: boxing, clapping and hand-waving.In this dataset, some of the videos contain multiple actions and some with actions even occurring at the same time.The HOHA (Hollywood1 Human Action) dataset contains 430 videos.Each video contains significant camera motion, rapid scene changes and occasionally significant clutter.Furthermore, actions in this dataset are performed in different conditions, and many actions are defined by the interactions between the subjects and/or objects.These factors make this dataset particularly challenging.The UCF sports dataset consists of 150 videos extracted from sports broadcasts.Video in this dataset contain camera motions and many different lighting and capturing conditions, as well as large displacements of most of the actions, cluttered backgrounds, and large intra-class variability.Recognition: Tests on each of the datasets have separate experimental settings to facilitate comparisons with reference methods.We compare our recognition model with state-of-the-art models reported in the literature and with the same baselines described in the synthetic tests of Sec.5.1.The final classification step in our model is performed via a thresholding procedure, where labels above a common threshold are selected.
MSR-II dataset -For the experiments on this dataset, a two-to-one random division of all videos in the dataset creates the training and testing sets.This dataset contains videos with multiple actions happening in the video and, in some cases, being performed at the same time, which can challenge our multi-label classification framework.Some of the videos in this dataset contain several instances of all activities.Since we expect a single instance of each activity class in the video, the videos are split such that each video contains only one instance of each activity class, but allowing for several activities from different classes.Fig. 5 shows our per-class accuracy results compared to the MI-SVM model [33].Table 1 shows the recognition accuracy results compared to state-of-the-art methods on this dataset.The supervision column shows the level of supervision used in the training phase: fully supervised methods know spatio-temporal bounding boxes of activity locations, whereas weakly-supervised methods use only the label(s).
HOHA dataset -In this experiment the test set has 211 videos with 217 labels and the training set has 219 videos with 231 labels, all manually annotated [7].Fig. 6 shows the per-class accuracy results for this dataset.This dataset is very challenging for activity recognition, due to the large amount of clutter and motion in the camera.Our approach is comparable with results from state-of-the-art methods designed specifically for this dataset, improving them by a slight margin.Table 2 gives the overall accuracy results compared to some other methods on this dataset.Fig. 6: Per-class recognition for HOHA dataset.UCF Sports dataset -We split this dataset into 103 training and 47 test samples, follwing the setup described in [7,8].This separation minimizes the strong correlation of background cues between the testing and training set [7].Some results on this dataset report leave-one-out-cross-validation (LOOCV) performance, which may take into account the similarity of the background instead of the activity itself.In this dataset the background is very similar for sports of the same kind, which affects the activity recognition rates.Fig. 7 depicts the per-class classification accuracy for this dataset.As shown, our method outperforms the BoW+SVM model in almost all classes.As shown in Table 3, the overall recognition rate of our method is also competitive with the state-of-the-art.The upper part of the table compares our results with state-of-theart methods' reported results for the same training and testing dataset split.Our method outperforms all of these works.The lower part of the table shows results from works that use LOOCV, which generally achieve better results.Our split is much harder and the difference between the results is expected.Notwithstanding a more difficult test scenario, our results are still comparable to these works.
Spatio-temporal localization: The second function of our method is the spatiotemporal localization of the activity in the video sequence.In order to assess spatiotemporal localization directly against reported state-of-the-art methods, we employ three metrics for assessing localization performance: 1) intersection-over-union using the selected positive parts (IOU), 2) average precision (AP) of part classification based on ground truth spatio-temporal annotations, and 3) the localization score, defined as in [7].The latter is defined as the average ratio of the sets of points inside the annotated ground truth bounding box and the set of points of the selected trajectory group for each frame.If the detected activity part(s) throughout the video have at least a θ overlap with   [8] only provides localization results on a subset of frames, we also include results on this subset for comparison.The average recognition/localization accuracies for the experiments on the datasets as a function of θ are illustrated in Fig. 8.Some results are shown in Fig. 9.
Experimental results discussion: Our experiments show that the proposed joint process in (5) significantly improves results, when compared to the baselines of MC and performing clustering and classification steps separately (LRRMC).We note that   the multi-label nature of our method allows us to provide results for simultaneous actions on the MSR-II dataset, as seen on Fig. 9.An important note on the recognition results, is that our method performed competitively even with those specifically focused for recognition (i.e., that do not perform any localization of the activity) and methods that train with fully annotated datasets.This is despite the fact that when using the whole frame or video features for recognition, we are dealing with many outliers and significant noise.Furthermore, our model extracts the exact spatio-temporal segmentation of the activity, rather than a simple bounding box, cuboid or voxel representation, as opposed to many previous works.We improve the recognition results on all datasets, and also achieve good localization scores.We believe these could be improved further if more accurate spatio-temporal annotations in the datasets were used as ground truth instead of bounding boxes.
As could be seen, our method achieved much better results compared to many stateof-the-art methods.This is basically due to two important properties of our method.Our method deals with errors and outliers in the feature vectors and the labels.As could be seen in ( 5) we extract the erroneous elements as well in the process of minimizing the matrix ranks.The error for both LRR and MC are incorporated simultaneously, which tend to correct one another in the process.On the other hand, our method labels the actions via transduction, which alone improves the results compared to inductive approaches.There are no separate train and test phases and our approach incorporates activity parts and information from the whole dataset when minimizing the nuclear norm and deciding on the instance classes.

Conclusions
In this paper, we have proposed a low-rank formulation for weakly supervised learning and have applied it to the challenging problem of activity recognition.Our approach uses a simultaneous convex matrix completion and LRR subspace clustering framework to recover the labels for the test videos and localize the spatio-temporal extent of activities throughout each video.Interactions between the activity parts are globally modeled throughout the entire dataset using the subspace clustering procedure, while the matrix completion framework labels the activities ensuring that labeling is consistent within clusters and vice-versa.Our experiments show this joint process significantly improves results, when compared to performing clustering and classification steps separately.Moreover, it attains performances comparable to state-of-the-art methods for classification and localization in all three datasets tested.
Unlike typical MIL approaches, our method to be naturally multi-label and is able to handle video sequences where several activity parts have to occur together in a bag to define an action, and actions occur simultaneously in different spatial locations.
As a direction for future work, we intend to apply and develop incremental procedures for the training and testing and exploit parallel algorithms for the SVD operations needed to optimize (5), such as in [40], in order to decrease processing time.

Fig. 1 :
Fig. 1: Our multi-label weakly-supervised approach recognizes activities and pinpoints their spatio-temporal location on unseen videos.This figure shows results on UCF Sports, HOHA and MSR-II datasets.Top: A sample frame and the extracted spatiotemporal activity parts.Bottom: Activities recognized and localized by our method.

Fig. 2 :
Fig. 2: (a) Multiple instance learning has positive and negative bags, and the goal is to identify positive instances in positive bags.Instead, our approach (b) clusters the instances and (c) forces the labels to agree with the clustering output and bag labels.

Fig. 3 :
Fig. 3: Left to right: Points tracked on a frame, extracted trajectories, trajectory groups.

Fig. 4 :
Fig. 4: Accuracy comparison according to corruption probability p on synthetic data.This figure shows the means and standard deviations for three different runs.

Fig. 8 :
Fig. 8: Average localization accuracy as a function of the localization overlap θ.

Table 1 :
Recognition results on MSR-II dataset.Cross dataset methods are trained on KTH dataset, which only contains actions with little background motion.

Table 3 :
Recognition results on UCF Sports.Upper part: Results with 103:47 dataset split.Lower part: Results with LOOCV.

Table 4 :
Action localization AP on the MSR-II dataset.Cross dataset methods are trained on KTH dataset, which only contains actions with little background motion.the annotated ground truth bounding box (score ≥ θ), the recognition/localization is considered as correct.The results are compared to the state-of-the-art methods in the literature, using IOU, AP or localization score, where available.Tables 4, 5 and 6 show results on MSR-II, HOHA and UCF Sports datasets, respectively.Since

Table 6 :
[8]]age localization IOU on the UCF Sports dataset.Note that[26]and[8]use the bounding box annotations during the training, while ours is weakly-supervised.