Collaborative Multilabel Classification

Abstract In multilabel classification, strong label dependence is present for exploiting, particularly for word-to-word dependence defined by semantic labels. In such a situation, we develop a collaborative-learning framework to predict class labels based on label-predictor pairs and label-only data. For example, in image categorization and recognition, language expressions describe the content of an image together with a large number of words and phrases without associated images. This article proposes a new loss quantifying partial correctness for false positive and negative misclassifications due to label similarities. Given this loss, we develop the Bayes rule to capture label dependence by nonlinear classification. On this ground, we introduce a weighted random forest classifier for complete data and a stacking scheme for leveraging additional labels to enhance the performance of supervised learning based on label-predictor pairs. Importantly, we decompose multilabel classification into a sequence of independent learning tasks, based on which the computational complexity of our classifier becomes linear in the size of labels. Compared to existing classifiers without label-only data, the proposed classifier enjoys the computational benefit while enabling the detection of novel labels absent from training by exploring label dependence and leveraging label-only data for higher accuracy. Theoretically, we show that the proposed method reconstructs the Bayes performance consistently, achieving the desired learning accuracy. Numerically, we demonstrate that the proposed method compares favorably in terms of the proposed and Hamming losses against binary relevance and a regularized Ising classifier modeling conditional label dependence. Indeed, leveraging additional labels tends to improve the supervised performance, especially when the training sample is not very large, as in semisupervised learning. Finally, we demonstrate the utility of the proposed approach on the Microsoft COCO object detection challenge, PASCAL visual object classes challenge 2007, and Mediamill benchmark.


Introduction
In multilabel classification, semantic labels such as words and phrases present strong label dependence, characterized by word-to-word dependence. Such dependence can be unconditional or conditional depending on predictors. For example, "football" and "tennis" are two semantic labels that are unconditionally dependent, usually co-occurring in a sports event, and conditionally independent given the equipment used in sports. In such a situation, classification based on inadequate training examples often yields a poor performance given the complex structure of semantic labels in the presence of a large number of additional labels without predictors. For instance, in imagecaptioning (Fang et al. 2015), a learner trains image-caption pairs to describe the contents of an image, together with a large number of captions from news articles in newspapers, which the learner can leverage to account for label dependence for enhancing prediction. One central issue is how to leverage additional labels while accounting for label dependence to enhance supervised learning based on label-predictor pairs. Also, do labelonly data provide the relevant information for classification in a similar manner as unlabeled data in semisupervised learning ? In this article, we develop a classification framework, what we call collaborative multilabel classification, under which we develop classifiers to leverage label dependence and additional labels to deliver higher predictive accuracy than its supervised counterpart ignoring either additional labels or label dependence.
Given a large body of literature on multilabel classification focusing on nonsemantic labels, we focus our discussion on the most relevant references. Ideally, one may perform independent binary classifications, one for each label, known as binary relevance (BR). The state-of-art binary relevance learning (Cox 1972;Carey, Zeger, and Diggle 1993;Pendergast et al. 1996;Breiman and Friedman 1997) is an ensemble of binary classifications by treating predicted values as predictors for one additional classification. However, binary relevance largely ignores label dependence. As illustrated in Breiman and Friedman (1997), component-wise dependencies are crucial to prediction in multi-response regression, indicating that a classification model must model label dependence To account for unconditional dependence, the frequencies of label co-occurrences are used in training examples (Read et al. 2016). To account for conditional label dependence, Markov/Bayesian/dependence networks are employed for estimation of the dependence structure and classification, see Guo and Gu (2011) for an excellent survey. For example, Godbole and Sarawagi (2004) constructed a classifier chain utilizing the frequencies of label co-occurrences in training examples to account for unconditional dependence, yet identifying a good chain is indeed challenging. Given the mounting cost of estimation of the dependence structure, an effort of modeling label dependency has not been rewarded in the past. Recently, Cheng et al. (2014) used a sparse pseudolikelihood for an Ising model describing linear conditional dependence, which is a frequentist version of Guo and Gu (2011). Moreover, Hsu (2009) proposed a compressed sensing method by exploring output sparsity. On a related topic, Wu et al. (2014) suggested a tree-based approach for hierarchical labels. Despite progress, issues remain, particularly for modeling semantic labels and leveraging additional labels without predictors, as in image captioning (Fang et al. 2015).
The contributions of this article are five-fold. First, we develop a multilabel framework of collaborative classifiers for label-predictor pairs as well as additional labels, with a focus concentrating on semantic labels. One salient feature of the proposed classifier is its capability of detection of a novel class absent from training by exploring conditional label dependence. Moreover, the classifier can improve the supervised performance by leveraging additional labels, particularly when the training sample size is not large, as in semisupervised learning . Second, we develop nonlinear classifiers based on random forests (Breiman 2001) to capture label dependence, conditionally and unconditionally, integrating additional labels through stacking (Ting and Witten 1997), where the use of nonlinear classifiers is critical to learn conditional label dependence. Third, we introduce a weighted loss quantifying partial correctness for false positive and negative misclassifications due to label similarities, which is in contrast to the Hamming loss (Tsochantaridis et al. 2004), the Jaccard distance (Gjorgjioski, Kocev, and Džeroski 2011), and the subset loss or the 0-1 loss (Gjorgjioski, Kocev, and Džeroski 2011). In the literature, a hierarchical loss imposes partial correctness through the size of offspring of a node when each label corresponds to a node in a hierarchy such as a tree (Cesa-Bianchi, Gentile, and Zaniboni 2006), which dramatically differs from our situation. Moreover, existing classifiers are typically designed under the Hamming loss, or 0-1 loss, which does not target the Bayes rule under the proposed loss, and hence that they tend to perform worse when the proposed loss is used to evaluate, as demonstrated by our simulations. Fourth, we develop a computational method that decomposes the joint learning task into the independent learning of transformed labels, which dramatically reduces the computation cost. As a result, the proposed method is more scalable than its competitors in the memory requirement. In simulations, we demonstrate that the proposed classifier outperforms binary relevance and a regularized pseudo-likelihood classifier under two evaluation loss functions, namely the proposed loss (3) and the Hamming loss; see Tables 1 and 2. Moreover, for three benchmark examples, it continues to outperform its competitors, including a ResNet50 convolution neural networks (CNN) deep learner (He 2016). Fifth, we establish consistent recovery of the Bayes performance by the proposed classifier in terms of the evaluation loss function and give conditions for the proposed classifier to detect novel labels.
This article is organized as follows. Section 2 presents the proposed loss function, Section 3 formulates a collaborative learning framework, develops a classier to account for label dependence, and integrates additional labels through stacking. Computationally, we develop a decomposable learning strategy to allow the computational complexity of our classifier to be linear in the label size. Section 4 investigates the theoretical properties of the proposed classifier. Section 5 examines their  numerical performances and compares them with some strong competitors, namely, the approaches binary relevance and an Ising model classifier through simulations, followed by an application to three benchmark examples. Finally, the appendix contains technical proofs (supplementary material).

Collaborative Classification
In multilabel classification, (X = (X 1 , . . . , X q ) , Y = (Y 1 , . . . , Y p ) ) is an input-output pair, where labels Y 1 , . . . , Y p present strong label dependence given X, with each label Y j is coded as {−1, 1}; j = 1, . . . , p. In addition, an independent sample of additional label observations {Z j } n+m j=n+1 following the same distribution of {Y i } n i=1 are available, typically a small amount of complete data with a large amount of additional labels in that the size of additional labels m may greatly exceed the sample size n. This framework is what we call collaborative multilabel classification. Our primary objective is to (i) leverage additional labels, (ii) label dependence for higher accuracy of classification, as well as (iii) detection of novel labels.

A Novel Loss
In collaborative multilabel classification, we introduce a new loss to quantify partial correctness for false positive and negative misclassifications due to label similarities.
The accuracy of classification is measured by the generalization error Err(f ) = EL(Y, f ), where L(·, ·) is a loss function measuring discrepancy between observed Y and its prediction by f (x) = (f 1 (x), . . . , f p (x)) , which is a decision function vector with f k (x) predicting Y k , and E is the expectation with respect to (X, Y). Now we propose a nonnegative loss to quantify false positive and negative errors: where w +lk ≥ 0 and w −lk ≥ 0 are the amounts of penalty for false positive and negative errors made by f k (x) for predicting label y l , and I is the indicator function. For false negatives, w −lk > 0 when labels y l and y k are semantically similar, w −kk ≥ w −kl ; l = k when correct classification of y k by f k (x) is more important than that by f k (x) for l = k. For false positives, w +kl is usually small and can be w +kl = 0 without any loss; k = l, particularly when the primary objective is to identify the presence of certain labels as in image object detection. Note that the second term in Equation (1) ensures that L(y, f ) = 0 corresponds to no error because of Equation (3) as false positive and negative errors can not occur simultaneously. For convenience, we may normalize the rows or columns of W ± so that the row or column sum equals 1. In contrast, the Hamming (symmetric difference) loss (Tsochantaridis et al. 2004), the Jaccard distance (Gjorgjioski, Kocev, and Džeroski 2011), and the subset loss or the 0-1 loss (Gjorgjioski, Kocev, and Džeroski 2011) neither discriminate the false positive and negative errors nor permit partial label correctness. Note that Equation (1) reduces to the Hamming loss if w ±lk = 0 when l = k. As illustration, consider a simple case of three semantic labels "football, " "basketball, " and "vehicle, " where w +ll = 1/3 and w +lk = 0 otherwise; 1 ≤ l, k ≤ 3, and w −11 = w −12 = w −21 = w −22 = 1/6, w −33 = 1/3 and w −lk = 0 otherwise. In this case, we are concerned about the false positive error, and hence that misclassifying "football" to "basketball" incurs a penalty of 1/3, which is smaller than misclassifying "football" to "vehicle" with a loss of 1/2, according to the degree of similarity. Hence, loss (1) appears more sensible than the aforementioned loss functions that yield neither partial correctness nor discriminate false negative and positive errors.
In practice, W − = {w −kl } {1≤k,l≤p} may be estimated semantic similarity measure based on an independent sample. Whereas W + = p −1 I is sensible, particularly for our target application of object recognition, W − is estimated by global vectors for word representation (GLOVE) (Pennington, Socher, and Manning 2014). Vector-space representations of this kind encode the semantic information of words as numerical vectors in the Euclidean space. They are constructed so that semantically similar words have word vectors close to each other, making it an ideal tool to measure word-to-word similarities. In particular, we propose to use the Cosine similarities between the word vector representations of two vectors to measure their similarity. Formally, denote by v j the word vector representation of label j. We compute a similarity measure, for any two labels j and k, as and normalize the rows of W − so that the row sum equals to 1. For the applications considered in this article, all labels have pretrained GLOVE word vector representations available. For other applications involving word labels beyond GLOVE, one could fine-tune the pretrained GLOVE models over the new corpus. Loss (1) is nonnegative and seems nondecomposable at the first blush, which is unlike the Hamming loss L H (y, f ) Tsochantaridis et al. (2004) can be written as a sum of individual label loss functions. Surprisingly, we can decompose L(y, f ) in (1) as follows where we have used The decomposition (3) has several consequences. First, it decomposes the overall generalization error with respect to each label classification: is highly interpretable in that δ k (y) is the aggregated misclassification error over false positives and negatives, which is determined by w +lk and w −lk . Moreover, it is a margin loss that is a function of functional margin δ k (Y)f k (X) for predicting outcome of Sign(δ k (Y)) by Sign(f k (X)). Note, however, that even if Sign(δ k (Y)) = Y k for all 1 ≤ k ≤ p, minimization under the new loss is not equivalent to that under the Hamming loss. This is because weights δ k (Y) may not necessarily equal to each other for 1 ≤ k ≤ p.

Multilabel Classification and Label Dependence
To predict the outcome of Y given X, we derive the Bayes rule under loss L in Equation (3), based on which our predication rule is constructed. Specifically, Lemma 1 gives the Bayes decision function f that minimizes the generalization error (1), the Bayes decision rule is expressed as

Lemma 1. (Bayes decision rule) Given loss L in Equation
In Equation (4), f k is a sum of weighted conditional probabilities, which reduces to the case of Hamming loss in which However, the Bayes rule in Equation (4) typically differs from that under the Hamming loss provided that not all W ±lj = 0; l = j. This means that a classifier constructed under the Hamming loss is not Fisher-consistent or fail to converge to the Bayes rule, which is in contrast to the proposed classifier that is Fisher-consistent, c.f., Theorem 2.
Lemma 1 suggests that label dependence needs to be leveraged to deliver a good performance of a classifier under loss L accounting for label dependence. Based on Equation (4), we propose our prediction rule as and Equation (3), after ignoring the constant p k=1 min l:y l =+1 w −lk , l:y l =−1 w +lk : (5) Now minimizing S(f ) in f is equivalent to solving p independent optimizations, that is, for k = 1, . . . , p, (6) reducing the complexity of parameter estimation from O(p 2 ) Liu and Shen (2006) to O(p) in dimension p , where F k is a class of candidate decision functions for f k . In Equation (6), we solve weighted binary classification for predicting Sign(δ k (Y)), where the overall misclassification loss is weighted by Note that the solution of Equation (6) is not unique and the indicator function there is replaced by a large margin surrogate to resolve this issue (Cortes and Vapnik 1995).
To estimate f k in Equation (6), we employ a weighted version of random forest (Breiman 2001) due to its parsimonious tree representations with the capability of variable selection, permitting treatment of nonparametric classification with many variables. Random forest uses trees and bootstrap aggregation or bagging. Bagging repeats B times to select a bootstrap sample or a random sample with replacement of the training data and fits classification trees to these samples. In particular, for b = 1, . . . , B, a classification tree f kb is trained on each bootstrap sample. Then the label of unseen x is predicted by averaging the predictions from all the individual classification trees on x, Note that the random forest suffers from a lesser degree of the curse of dimensionality with respect to the dimension of x because a classification tree involves a lower degree of piecewise constants.

Prediction Based on Additional Labels
This section focuses on multilabel classification based on additional labels (Z i ) n+m i=n+1 , which is integrated with that based on complete data for label prediction through stacking in Section 3.3.
In the absence of input x, to predict label values of Z, we introduce a decision function vector g = (g 1 , . . . , g p ) , one for each label, which serves as baseline functions without x. Then we propose a loss L(z, g) = p k=1 L k (z, g k ) + p k=1 min l:z l =+1 w −lk , l:z l =−1 w +lk based on (3), where z = (z 1 , . . . , z p ) . This in turn yields our proposed cost function: for k = 1, . . . , p, Since we can write the RHS of Equation (7) as minimization of Equation (7) in g k yields an estimated decision . Interestingly, g k is the normalized value of δ k (·) aggregated over the additional observed labels.
Note that the solution of Equation (7) is not unique. We use additional labels and feature-only data to learn the weight matrices W + and W − . In particular, we consider minimizing the following objective with respect to the discriminant function and both weight matrices jointly where E 1 and E 2 are empirical expectations over feature-only data and additional labels, and Note that the joint minimization of f and weight matrices may be difficult if we use the random forest model for f . This is because the random forest software can not handle the loss function in the second term (i.e., l(·, ·)). We can potentially consider the following greedy approach.
We sample (W (m) Finally, we choose (W (m) The idea is that we treat the weight matrices as tuning parameters (as opposed to model parameters).

Integration of Complete and Additional Labels
This section derives a stacking or an ensemble method, to further enhance the accuracy of prediction based on complete data by incorporating the prediction from additional labels. Specifically, we combine two prediction functions in a form h k (α, x) = α f k (x) + (1 − α) g k , where α ∈ [0, 1] is a tuning parameter for model averaging. Our objective is to tune α to ensure that predictive accuracy of h k is no less than that based on complete data alone.
The next lemma says that the combined classifier performs no worse than that complete data alone asymptotically.

Lemma 2. Assume that
for some constants a, γ ≥ 0. Suppose that c N = O(N min(1/(2−γ ),1) ). Then for the stacked classifier h( As a technical remark, Equation (12) is a commonly used smoothness assumption in the classification literature, see, for example, Shen et al. (2007), which is implied by the low-noise condition (margin assumption) (Tsybakov 2004).

Nonlinear Learning and Label Dependence
This section presents a result arguing that the pairwise label dependence is taken into account through nonlinear learning regardless of such dependence is expressed in terms of linear conditional dependence. To see this aspect, we examine pairwise conditional dependence of Y j and Y k given X in an Ising model.
Lemma 3. Suppose that Y = (Y 1 , . . . , Y p ) follows a conditional Ising model given a predictor vector X. The probability density of Y given X = x is where y = (y 1 , . . . , y p ) , α(x) = (α 11 (x), α 12 (x), . . . , α pp (x)) is a p(p + 1)/2-dimensional vector and Z(α(x)) is the partition function. Then, for k = 1, . . . , p, logit(P(Y k = 1 | x)) is written as As indicated by Lemma 3, the conditional covariance of Y j and Y k given X = x, which is proportional to α jk (x), enters into a classification model (15) nonlinearly even if α jk (x) is linear in x. This observation, together with the result of Lemma 1, suggests that our nonlinear representation of f k (x) in Equation (6) is more appropriate than its linear counterpart for accounting for label dependence in any situation.

Detection of Novel Labels
When a label is absent from a training set, a classifier typically does not assign any instances to that label, failing to detect novel labels. In contrast, our classifier, as defined in Equation (6), has the capability of detecting a novel class absent from training through label dependence. Theorem 1 gives such a result.
Theorem 1. (Detection of novel labels). Suppose that the kth label is absent from the training data, that is, Y i k = −1; i = 1, . . . , n. If there exists an i with 1 ≤ i ≤ n + m such that δ k (Y i ) > 0, then there exists α * with 0 ≤ α * ≤ 1 such that h k (α * , x i ) > 0 for some 1 ≤ i ≤ n, or class k can be detected.

Consistent Recovery of the Bayes Performance
The original random forests (Breiman 2001) relied on complex data-dependent mechanisms of selecting variables and cutting points, which makes it extremely difficult to analyze. As a result, its basic statistical properties remain not fully understood. To the best of our knowledge, most existing theoretical results focus on a simplified version of the original random forest algorithm so that statistical analysis is more tractable; see Biau and Scornet (2016) for a comprehensive review of recent theoretical developments. As such, we expand existing results to our new loss function based on a simplified version of random forests (Biau 2012).
More specifically, we analyze a simple random forest in Equation (6), which is a voting classifier of simple decision trees considered in Biau, Devroye, and Lugosi (2008) and Biau (2012). For a single tree, a coordinate of X = (X 1 , . . . , X q ) is chosen at each node, and the jth feature having probability p nj ∈ (0, 1) of being selected, and the selected cell is split along with the randomly chosen variable at the midpoint of the chosen side.
A voting classifier f (B) k for the kth label is defined as the average of B independent tree classifiers where f kb (x); b = 1, . . . , B are independent single trees with the same number of variable splits. Next, we generalize consistency results of Biau, Devroye, and Lugosi (2008) and Biau (2012) under the Hamming loss to the loss L in Equation (3). Specifically, consistency of the voting classifier f (B) means that The next theorem establishes conditions under which our method (6) is consistent under Equation (1).
Theorem 2. Assume that the distribution of X is supported on [0, 1] q . Moreover, we assume that f k (x) = E(δ k (Y) | X = x); k = 1, . . . , p, are uniformly L-Lipschitz continuous: where L > 0 is a constant independent of p, q, and n. Let S denote the number of splits for each individual tree. Then, the voting random forest classifier f (B) is consistent if, S → ∞, p 3 S n → 0, and min 1≤j≤q p nj log S − 2 log p → ∞ as S → ∞.
It is worth mentioning that any classifier that is not Fisherconsistent is not consistent in the sense of Theorem 2. For example, a classier constructed under the Hamming loss is inconsistent because the label dependence is ignored; see the discussion after Lemma 1.
If (p n1 , . . . , p nq ) is chosen to split variables at random for each tree, or (p n1 , . . . , p nq ) = (1/q, . . . , 1/q), then the scaling condition requires that q = o log n log p , which means that the feature dimension can only grow more slowly than log n log p . To handle a high-dimensional situation, the probabilities (p n1 , . . . , p nq ) need to be chosen adaptively and a true classification model is sparse. For example, as heuristically discussed in Biau (2012), if the target decision function E(δ k (Y) | X) depends only on q 0 = |S 0 | features, which is a subset S 0 of the q features, then p nj can be chosen adaptively using an independent validation set when so that the feature dimension q can grow much faster with n.

Numerical Examples
This section investigates several aspects of the proposed method, namely (i) the operating characteristics of the classifier, (ii) the contribution of additional labels on the accuracy of prediction, and (iii) the capability of detection of partially observed labels. Importantly, we compare the proposed classifier with four state-of-art classifiers, including two binary relevance based on SVM (Cortes and Vapnik 1995) classifications, the unweighted random forest classifier (Breiman 2001), and a Pseudo-Ising classifier (Cheng et al. 2014) based on an Ising model. In three benchmark examples, we also include a ResNet50 Convolution Neural Networks (CNN) deep learner (He 2016) for a comparison. The three binary relevance classifiers are denoted by linear-SVM, nonlinear-SVM, and CW, which are separate linear and Gaussian-kernel SVM classifiers, one for each label, and CW is a refitting classifier using the fitted values from binary relevance based on linear SVM (Breiman 2001).
Numerical analysis is performed in R. For our training data, we first generate m + n paired observations, with n = 200, 500 and m = 2000. Furthermore, we consider p = 2, 10, 50, 100, 500, 1000 and q = 20, 50. Finally, we generate a test set with 2000 independent complete observations. Then the test error is computed under the loss (1) over the test data. For the proposed method integrating with additional labels, five-fold cross-validation is used for selection of the tuning parameter α over a set of 1000 uniform grid points of [0, 1].
As shown in Table 1, the proposed method outperforms all the competitors in terms of the test error across all the settings under our evaluation loss (3). The amounts of improvement over LSVM, NSVM, CW, and P-Ising range from 42.5% to 67.8%, from 14.9% to 61.8%, from 23.9% to 56.6%, and 13.1% to 155.4%, respectively. Roughly, large improvements occur for challenging situations, particularly when either p or q increases while n is held fixed. Interestingly, the Pseudo-Ising classifier does not perform as well as expected even although the data are generated from the Ising model. This is mainly because it uses a linear classification to account for linear label dependence. As indicated in Lemma 3, linear dependence is only captured by a nonlinear method. Moreover, a pseudo-likelihood approach is less efficient when it does not target the evacuation loss. Interestingly, the proposed method continues to fare well even under other commonly used loss functions. For instance, as indicated in Table 2, the proposed method performs well under the Hamming loss, although the amount of improvement over the Pseudo-Ising classier shrinks as p increases. Moreover, as shown in Table 3, the aforementioned results extend to the F 1 -score with higher F 1 -scores than other competing methods across all situations, where F 1 = 1 + FN+FP 2TP −1 . In summary, the proposed method performs well even when the classification loss differs from the evaluation loss in this case, which is attributed to the fact that label dependence has been adequately taken into account in the model.
Concerning the contribution of additional labels, Table 5 suggests that our method does improve the classification performance of its counterpart with complete data alone consistently across all the settings. However, the amount of improvement becomes large for difficult situations. The improvement is attributed primarily to the utilization of additional labels through stacking.
Concerning runtime, the proposed method runs faster than other nonlinear methods for more difficult set-ups and is slower than linear SVM, as suggested in Table 4. Importantly, the proposed method runs linearly in the number of labels, while enabling to account for label dependence. This is in contrast to binary relevance methods ignoring label dependence.
Finally, to investigate the capability of detecting novel labels for the proposed method, we consider the same simulation setup as before with n = 200, 500, p = 2, 10, 50, 100, 500, 1000, and q = 4. In particular, we set the first label to be −1 in all cases in the training data. As suggested by Table 6, the TPR of the proposed method is strictly positive as long as the dimension p is not that large, but it decays as p increases because the level of difficulty escalates. This occurs in 4 out of 12 cases. This indicates that the proposed method is capable of detecting the novel label if the noise level is not too large. This corroborates with the result of Theorem 1. By comparison, all competitors Table 3. F 1 -scores (standard errors in parentheses) of various methods based on 100 simulations for complete data, where LSVM, NSVM, P-Ising, CW, and our denote linear SVM, nonlinear SVM, Pseudo-Ising classifier Cheng et al. (2014), Breiman's classifier by combining binary relevance linear-SVMs, and the proposed method. are not capable of detecting novel labels with TPR = 0 and FPR = 0.

Benchmarks
This section demonstrates the utility of the proposed method and also compares with some state-of-art methods for multilabel classification on two benchmark examples.

Microsoft COCO Object Detection Challenge
We first examine the 2017 COCO Object Detection Task data in the Microsoft COCO object detection challenge dataset Lin (2014), consisting of about 118, 287 images with 80 object categories and about 500, 000 disparate objects. In each image, each object is bounded by a box, which is associated with one label, with multiple bounding boxes corresponding to multiple labels. The training, validation, and testing sets are available. Moreover, all images are labeled, so we do not consider additional labels in the full analysis. As labels for the test set is unreleased, we will use the training set for learning and the validation set for testing instead.
To apply a classifier, we extract image features from these images by first applying a ResNet50 convolution neural networks (CNN) deep learner (He 2016). Specifically, we extract the last layer of a trained ResNet50 model applying to images For the whole dataset of 118, 287 images with p = 80 and q = 2056, we run our method on a machine with 2 Twelve Core Xeon E5-2690 v3 2.6GHz CPUs and 128GB memory. For the weight matrices, we set W + = p −1 I to be the 80 × 80 identity matrix and W − is computed using Equation (2) based on label-label similarity matrix computed by GLOVE word vectors (Pennington, Socher, and Manning 2014). As suggested by Table 8, the proposed method yields an overall test errors of 0.002 under loss (1) and 0.025 under the Hamming loss over 80 categories as measured by Err(·, ·) based on a test set of size 5000, supplied in the COCO data. However, the other competitors require much higher memory and are unable to run there. By comparison, the proposed method is scalable to these data without a large memory requirement.
To compare with LSVM, NSVM, Pseudo-Ising, and the ResNet50 CNN classifier, we decrease the size of the original problem by randomly subsampling 6000 images on six categories: "person, " "car, " "motorcycle, " "bus, " "truck, " and "train. " The average numbers of positively labeled observations for the six categories are 3252, 621, 178, 200, 311, and 182. For the weight matrices, W − becomes a 6 × 6 matrix, as defined in Table 7, while W + is the 6 × 6 identity matrix. To be more informative, we report separate test errors for each category Err k (f k ); k = 1, . . . , 6, as well as the overall error Err(f ), which are obtained by averaging over 100 random subsamples, wherê f = (f 1 , . . . ,f 6 ) is a classifier.
As suggested in Table 8, the proposed method outperforms the competitors substantially under loss L in Equation (3), with the smallest test error 0.004 over six classes, which is in contrast to the corresponding test error 0.002 for the full data over 80 classes. An increased training sample size does improve the accuracy of classification. Moreover, the amounts of improvement over LSVM, NSVM, and Pseudo-Ising are 825%, 150%, and 6775%. Interestingly, the proposed method continues to perform the best in the largest two classes "Person" and "Car, " and performs well for the other four small classes. However, the proposed method underperforms NSVM under the Hamming loss, although it outperforms other methods for this reduced Table 8. Averaged test errors for the full COCO object detection image data as well as corresponding standard errors in parentheses based on 100 random subsamples involving six categories with the average numbers of positively labeled observations 3252,621,178,200,311,182. Full Weighed loss ( (Cheng et al. 2014), a convolutional neural network method, and the proposed method. Note that "NA" means that a routine cannot return a value.  (Cheng et al. 2014), a convolutional neural network method, the proposed method using only complete data, and the proposed method integrating additional labels. The best performer is marked in bold.
data. This is because the classification and evaluation losses differ.
To study the impact of additional labels on classification, we take a random sample of 500 images with at least two objects within the six categories: "person, " "car, " "motorcycle, " "bus, " "truck, " "train, " and another sample of 100 images containing at most one in these categories. The rest of the training samples are now treated as additional labels. As shown in Table 9, the proposed method leveraging paired data and additional labels significantly outperforms its counterpart without leveraging additional labels. Note that the improvement of five competing methods without leveraging additional labels ranges from 280% to 575%. This suggests that additional labels indeed provide valuable information for prediction. This is an analogy of the situation in semisupervised learning in which a semisupervised classifier that leverages unlabeled data usually leads to an improvement over its supervised counterpart Wang and Shen (2007). In a sense, the proposed method borrows external information from additional labels to enhance the performance of supervised learning as in transfer learning Pratt (1993).

PASCAL Visual Object Classes Challenge 2007
Now, we analyze a different image dataset from the PASCAL Visual Object Classes Challenge 2007 (Everingham 2007), consisting of training and validation sets of 5011 images and a testing set of 4952 images. Now we apply all the methods on the training and validation sets for training and use the testing set for evaluation. The weight matrix for the proposed loss is generated as in (A) using the label-label similarity matrix computed by GLOVE word vectors.
As suggested by Table 10, the proposed method continues to outperform its competitors under the proposed weighted loss. The amounts of improvement over LSVM, NSVM, Pseudo-Ising, and CNN are 200%, 16.67%, 5300%, and 66.67%. Moreover, the proposed method is the second better in both the Hamming loss and the F 1 -score, that is, it is slightly worse than NSVM under the Hamming loss but better than NSVM under the F 1 -score, and is slightly worse than P-Ising under the F 1score but better than P-Ising under the Hamming loss. Overall, the proposed method fares well across three evaluation criteria.

Mediamill Benchmark
Finally, we analyze a Mediamill dataset, which extracted from 85 hr of international news videos from the TRECVID 2005/2006 benchmark datasets (Snoek et al. 2006) and is publicly available at https://ivi.fnwi.uva.nl/isis/mediamill/challenge/data.php. This dataset includes 120 low-level visual and textual features with labels consisting of a lexicon of 101 semantic concepts, like commercials, nature, and baseball. The training set is comprised of 30, 993 samples while the testing set has 12, 914 samples. Again, the weight matrix for the proposed loss is generated as in (A) using the label-label similarity matrix computed by GLOVE word vectors. There are no additional labels for this dataset.
For our analysis, we use the given training and testing sets for training and evaluation. Moreover, we examine all aforementioned classifiers except CNN because the generated features are not suited for CNN.
As seen from Table 11, the proposed method again outperforms its competitor NSVM under the proposed weighted loss as well as the Hamming loss. The amount of improvement over NSVM is about 49%, 4%, and 6% under the proposed weighted loss, the Hamming loss, and the F 1 -score. Note that LSVM and Pseudo-Ising cannot complete after one week.  (Cheng et al. 2014), a convolutional neural network method, and the proposed method.  (Cheng et al. 2014), and the proposed method. Here NA means that the method did not produce a result after running for one week.