Multi-class active learning for video semantic feature extraction

Active learning has been demonstrated to be a useful tool to reduce human labeling effort for many multimedia applications, especially for those handling large video collections. However, most of the previous work on active learning has focused on only binary classification, which greatly limits the applicability of active learning. We present a multi-class active learning approach which extends active learning from binary classification to multi-class classification using a unified representation with margin-based loss functions. The experimental results on the TREC03 semantic feature extraction task shows that the proposed active learning approach works effectively even with a significantly reduced amount of labeled data.


INTRODUCTION
As the increasing amount of multimedia information is driving the demand for content-based access to video data, new challenges have been created. Machine learning approaches are becoming more and more useful for improving the performance of multimedia information retrieval, but a large amount of human effort is still required to the annotate the training data. Unfortunately, manually annotating training data is not only labor intensive and time-consuming for large video archives, but also subject to human errors. The easiest way to reduce the labeling effort is to request a human to label some randomly selected data, and automatically propagate the labels to the entire collection using a supervised learning algorithm. However, random sampling cannot always provide the most useful data to label and thus can waste a lot of human labeling effort. A better approach would be to select non-random examples which, if labeled, will provide the most information for the learning algorithm. This motivated us to develop an incremental learning framework with interaction/supervision from a human. This framework is known as active learning.
The effectiveness of active learning to reduce labeling cost has been demonstrated by previous work [1,2,3]. An

Multi-class Labeled Pool Unlabeled Pool
Selection Strategy

Informative Data
User Labeling Fig. 1. Illustration of multi-class active learning active learner may begin with a pool of unlabeled data, select a set of unlabeled examples to be manually labeled as positive or negative and learn from the newly obtained knowledge repetitively. This type of problem can also be called "query learning" or "selective sampling" [1]. Typically, the unlabeled examples can be selected by means of either minimization of the learner's expected error [1] or maximization of information gain / version space reduction [2,3]. On the other hand, multi-class classification is widely applied in many areas of multimedia processing such as face recognition, people identification, image categorization and semantic feature extraction from video. However, most previous studies on active learning simply focused on binary classification problems, and few of them mention selection strategies in the context of multi-class classification, which greatly limits the applicability of active learning. In this paper, we present a multi-class active learning approach using a unified representation with margin-based loss functions to minimize human effort in the task of semantic feature extraction. Figure 1 illustrates the process of multi-class active learning. It enables a learning algorithm to select the most informative unlabeled data for all classes simultaneously instead of just for the binary classes. The experiments on the TREC03 feature extraction task indicate that our multi-class IEEE International Conference On Multimedia and Expo (ICME), special session on "Active Learning on Multimedia Retrieval", Taipei, Taiwan, June 27-30, 2004 active learning approach works effectively even when the amount of training data is significantly reduced.

MULTI-CLASS ACTIVE LEARNING
We present a unified multi-class active learning framework in this section 1 . In the following discussion, let denote the domain of possible examples, be a finite set of classes. Formally, the learning algorithm takes a set of training ex-amples´Ü ½ Ý ½ µ ´Ü Ñ Ý Ñ µ as input, where Ý ¾ is the label assigned to example Ü ¾ . We pay particular attention to learning algorithms that attempt to minimize a margin-based loss function, called margin-based learning algorithms [5]. This includes a large family of well-studied algorithms with different loss functions and minimization algorithms, such as decision trees, logistic regression, support vector machines(SVMs) and AdaBoost. The marginbased learning algorithms always minimize the loss function with respect to the margin, i.e. ½ Ñ È Ñ ½ Ä´Ý ´Ü µµ where Ä is some loss function Ä Ê ¼ ½µ.
First we want to extend margin-based learning algorithms to multi-class classification since most of the algorithms were originally devised for binary classification. Allwein et al. [5] have proposed a unified approach for decomposing multi-class problems into a set of binary-class problems. Each decomposition can be represented by a cod- Orthogonal to the problem of coding matrix selection, the learning algorithm has to assign examples to a predicted class Ý ½ given the labels provided by the binary classifiers. We adopted a loss-based decoding scheme where Ä is a loss function for both decoding schemes. The predicted label Ý Ö Ñ Ò Ö ´Å Ö ´Üµµ.

Optimal Multi-Class Active Learning
One of the most important components in active learning is the sample selection function which aims at selecting a set of informative examples to label. Specifically, it will be reasonable for a margin-based learning algorithm to use the 1 More details of this algorithm can be found in [4] minimization of the training loss as the criterion [5]. Therefore the goal for an optimal active learner is to search for those unlabeled examples which can minimize the expected loss on the data set.
Let È´Ý Üµ be the conditional distribution over an example Ü, and È´Üµ be the marginal distribution of Ü. The learner has been given a labeled training set , and output a set of estimated loss Ä´ÅÝ ´Üµµ for every Ü in the pool È. We denote Ä´ÅÝ ´Üµµ as Ä´ µ for the sake of simplicity in the following discussion. We can then write the expected risk function of the learner as follows, A multi-class active learner has to select a set of unlabeled examples, or query set · from the pool and ask a human for their labels. After every example Ü £ in · is given labels Ý £ ¾ and added in the training set, an updated learner will be trained on the training set £ · . The optimal learner can choose the optimal query set · ÓÔØ so that the updated learner should have the largest risk reduction, Because it is rather difficult to estimate the expected risk over the full distribution, È´Üµ, it is more feasible to measure the risk over all the examples in the pool. Therefore, the minimization function becomes In theory, the maximization of (3) straightforwardly leads to the optimal query set · . Unfortunately, in practice, it is intractable to compute all the ¾ È possible combinations even in one round, not to mention selecting unlabeled examples iteratively. One of the feasible solutions is to select only one unlabeled example each time, such that the choice of examples greatly reduces to È . Plus, many learning algorithms such as SVMs and Naive Bayes have efficient algorithms for incremental learning. However, even with these optimizations usually the computation is not affordable for many real-world applications, which leads us to develop some approximated strategies with similar performance but significantly less computational effort.

Approximated Sample Selection Strategies
In this section, we describe several practical sample selection strategies as alternatives for optimal multi-class active learning. As noted before, it is computationally intensive to maximize (3) due to the re-learning of the classifiers to estimate the new expected risk. Moreover, because typically only a small number of labeled data are available for training, the estimation for È´Ý Üµ might be unreliable. To make multi-class classification more practical, we will use some simple heuristics to simplify our selection strategies.
The first step of the approximation is to eliminate the computation-intensive components that have to be reestimated after adding additional data, which is the prediction function ´ £ µ. Based on two assumptions presented in [4], we can rewrite (3) into the following equation which maximizes the expected risk for only one example, Substituting (1) into Ý Ü´ Ä´ µµ in (4), we get A probability estimation scheme needs to be provided for the conditional probability È´Ý Üµ. Note that the classification confidence presented in the loss function is not the posterior probability, so we cannot straightforwardly treat the confidence as the estimation of È´Ý Üµ. To address this, we adopt the best worse case model which has been suggested in several papers [2,1]. This model expects that the predicted label is exactly the true label of the unlabeled data, e.g. È´Ý Üµ ½ if Ý is predicted otherwise È´Ý Üµ ¼.
Therefor it approximates the expected loss function with the smallest loss function among all the possible labels. Thus (5) can be rewritten as where Ý Ü is the predicted label for example Ü. The rationale for this model is to choose the most ambiguous examples with the maximum expected loss for the predicted label. In the case of Ð ½ where only binary classes are predicted, this strategy can be reduced to the problem of choosing the example Ü with Ñ Ü Ü Ä´Ý ´Üµµ. This can be interpreted as selecting the examples closest to the decision boundary, which is a common sample selection criterion in binary active learning tasks [3].
Some of the most relevant work to our approach was done by Tong et al [6]. Viewing the multi-class classification problem as an extension of the binary case, they propose a simple heuristic to select the next unlabeled example that minimizes the maximum model loss By approximating the size of the version space Ö ´Î´ µ Ü Ý µ with´½·Ý ´Üµµ Ö ´Î´ µ µ ¾, we can show that this heuristic is a special case of our best worst case model with loss function ÐÓ ½ ´½ · Üµ.

EXPERIEMENTS
In this section, we describe experiments on semantic feature extraction using the development set of the TREC03 Video Track Feature Extraction Task [7] to demonstrate the effectiveness of our multi-class active learning approach. To construct a multi-class data set, we randomly sampled 893 video shots from 5 mutually exclusive categories using the labels provided by the common annotations, including 277 shots of studio settings, 215 shots of weather news, 123 shots of hockey events, 175 shots of basketball events and 103 shots of baseball events.
Low-level features including color, edge and face are generated to learn the semantic features. After dividing an image into 5*5 equal regions, the color features in each region are computed as the color histograms for each separate color channel, where the three channels are derived from the HSV color space. We use 16 bins for hue and 6 bins for both saturation and value. A canny edge detector was applied to extract edges from the images. The edge histogram includes a total of 73 bins. The first 72 bins represent the edge directions quantized at 5 degree intervals and the last bin represents a count of the number of pixels that have not contributed to any edges. Schneiderman's face detector algorithm [8] was used to extract frontal and profile faces. RBF-kernel support vector machine(SVM) was trained as the base classifier with parameter 0.05. In figure 2, we compared the performance between supervise learning with random sampling, binary class active learning which independently selected examples for each decomposed binary classifiers using Simple stategy [3], and the multi-class active learning approach with best worse case model. The one-against-all coding scheme and hinge loss function´½ Ýµ · [5] are used. We started with 50 training examples, and added 5 more training examples for each  2 For the sake of simplicity, we simulated the human labeling process using complete, true data labels without actually asking a human for the labels at each step. The results show that the classification error rate of multi-class active learning decreased much faster and more consistently than both the random sampling strategy and binary active learning. When comparing both methods with 200 training data, active learning resulted in the lowest error rate, which was only half of the error of random sampling. For another view of point, the error rate of active learning was reduced to 24% with only 30 additional training examples while the random sampling required more than 150 additional training examples to achieve a similar performance. Figure 3 is depicted to show the key frames of video shots selected by the proposed selection strategies in first 4 sampling iteration.

CONCLUSION
In this paper, we presented a multi-class active learning framework to reduce the human labeling effort for multi-class classification. Our experiments on semantic feature extraction demonstrate that an active learner with approximated sample selection strategies can achieve remarkably good performance with much less human labeling effort compared to supervised learning with random sampling.
However, it may be too early to claim that the active learning definitely outperforms supervised learning with random sampling, because active learning requires users to wait for the learning algorithm to output the classification results, which might take a long time as a large video collection is processed. A better cost function needs to be defined to take all these issues into account. Another promising avenue for