Set-Valued Support Vector Machine with Bounded Error Rates

Abstract This article concerns cautious classification models that are allowed to predict a set of class labels or reject to make a prediction when the uncertainty in the prediction is high. This set-valued classification approach is equivalent to the task of acceptance region learning, which aims to identify subsets of the input space, each of which guarantees to cover observations in a class with at least a predetermined probability. We propose to directly learn the acceptance regions through risk minimization, by making use of a truncated hinge loss and a constrained optimization framework. Collectively our theoretical analyses show that these acceptance regions, with high probability, satisfy simultaneously two properties: (a) they guarantee to cover each class with a noncoverage rate bounded from above; (b) they give the least ambiguous predictions among all the acceptance regions satisfying (a). An efficient algorithm is developed and numerical studies are conducted using both simulated and real data. Supplementary materials for this article are available online.


Introduction
The advancement of statistics and machine learning is reshaping many fields.Increasingly, many critical decisions are made based on advanced statistical and machine learning methods, especially classification methods.It, therefore, has been important to make reliable classification and avoid making a misclassification when it is known that the chance of misclassification is high.Standard classification methods often cannot meet this demand.This is partially because a standard classifier has the goal of minimizing the overall misclassification rate and it assigns a single class label to each observation regardless of the perceived high uncertainty for some observations.However, in practice, it is often the case that an accurate single-valued prediction is difficult or impossible to obtain for some observations due to high uncertainty and lack of information.Moreover, in many applications, the consequence of misclassification for even one instance is too severe to bear for those who are affected.Examples of this kind include using classification methods to guide parole decisions, to evaluate school teachers (O'Neil 2016) and to diagnose cancers.In these high-stake domains, it is safer and more appropriate for the classifier to return a set of most plausible outcomes (e.g., class labels) for each observation and leave the final decision to a human expert or a secondary model to validate.It is desirable that this prediction set contains the true class label with high probability.Moreover, one can expect that the classifier should not make predictions at all for observations that it is highly unsure about.
In this article, we propose a set-valued multicategory classification method based on the support vector machine approach.The size of the prediction set is adaptive to the confidence that the classifier has on each observation.When it has high confidence on an observation, a single class label may be given as the prediction; otherwise, multiple class labels will be reported.Rejections may be viewed as the extreme case that all the possible class labels are predicted for an observation. 1The standard classification method may be viewed as a special case in which the prediction set only contains one label.Therefore, standard classification should ideally only be used when there is high confidence for all the observations; unfortunately, it is rarely the case in practice.
The main difference between the standard and the set-valued classification is that the latter can no longer be framed as an (unconstrained) minimization problem of the overall misclassification rate.Set-valued classification is best understood using the following tradeoff: the larger the prediction set is, the more likely that it contains the true class label, and yet the less information such a prediction has.One way to precisely formulate this tradeoff is the acceptance region learning framework.Let the training data consist of independent and identically distributed pairs of data points (X i , Y i ), i = 1, 2, . . ., n, from an unknown distribution P, with X i ∈ X ⊂ R p , and Y i ∈ Y = {1, . . ., k}.The goal of acceptance region learning is to identify acceptance regions C j ⊂ X , j = 1, . . ., k, one for each class, which satisfy some nice coverage properties (see details in Section 2.) Collectively these acceptance regions are equivalent to a set-valued classifier φ : X → 2 Y , defined as φ(x) = {j : x ∈ C j }, namely, observation x is predicted to be from a set of class labels consisting of all classes with acceptance regions that contain x.Reversely, given a set-valued classifier φ, the equivalent acceptance regions are C j = {x : j ∈ φ(x)}, j = 1, . . ., k. Lei (2014) and Sadinle, Lei, and Wasserman (2017) defined acceptance regions using two competing quantities, confidence and efficiency.The notion of confidence is defined as the probability that set C j (j = 1, . . ., k) covers a random observation from class j.The notion of efficiency is inversely related to ambiguity, defined as the expected number of acceptance regions C j 's that contain a random observation (equivalently, the expected size of prediction set for a random observation.)As the confidence of C j 's increases, the efficiency decreases (i.e., the ambiguity increases).The Bayes-optimal acceptance regions minimize the ambiguity with the noncoverage rate for each C j constrained.It was shown (Sadinle, Lei, and Wasserman 2017) that the Bayes-optimal acceptance regions (or their equivalent set-valued classifier), is obtained through the conditional class probability η j (x) P(Y = j | X = x).Sadinle, Lei, and Wasserman (2017) proposed to use the plug-in method to estimate this set-valued classifier, that is, to first estimate η j (x) using a consistent estimator, then plug the estimated η j (x) into the Bayes-optimal rule.The empirical performance of the resulting set-valued classifier highly depends on the estimation accuracy of η(x).However, as pointed out by many authors (Wang, Shen, and Liu 2007;Fürnkranz and Hüllermeier 2010;Wu, Zhang, and Liu 2010), probability estimation can be more difficult than the prediction of the class label, especially for high-dimensional data.While the requirement on estimation accuracy is somewhat relaxed in the classification context, how accurate the probability estimation needs to be is still an open question.
In this article, we propose to estimate acceptance regions and the equivalent set-valued classifiers by minimizing some empirical risk based on the support vector machine (SVM; Scholkopf and Smola 2001), bypassing the step of estimating η j (x).It takes advantage of the great prediction power of the SVM in both the linear and nonlinear cases.We show in theory the Fisher consistency, that is, the population minimizer of the proposed optimization is equivalent to the Bayes-optimal classifier.Moreover, in the finite-sample case, we show that the resulting classifier can control the noncoverage rates while minimizing the ambiguity.
A related problem is the Neyman-Pearson (NP) classification problem (Cannon et al. 2002;Rigollet and Tong 2011).Given a null hypothesis class, NP classification aims to identify an acceptance region for the null class which minimizes the probability that an observation from an alternative class falls into it (the Type II error) while controlling the chance that an observation from the null class is not covered by the region, and hence, is misclassified to the alternative (i.e., the Type I error).See Tong, Feng, and Zhao (2016) for a survey.The problem studied here can be regarded as solving k NP classification problems jointly.
The problem of identifying acceptance regions and its connection with the NP classification have attracted increasing attention from the statistics and machine learning communities.Dümbgen, Igl, and Munk (2008) framed it as a general p-value for classification problem.Lei (2014) proposed a framework for the binary case; Sadinle, Lei, and Wasserman (2017) extended it to the multicategory classification.Denis and Hebiri (2015) and Denis and Hebiri (2017) studied a dual problem, in which they minimized the overall noncoverage rates while controlling the ambiguity.Recently Hechtlinger, Póczos, and Wasserman (2018) and Guan and Tibshirani (2019) generalized this problem to conduct outlier detection (that is, the abstention problem).In this article, we do not consider the abstention/outlier detection problem; in other words, we assume that there is no unseen class in the training data that might appear later in the test data.
Popular ways to achieve set-valued classification include classification with reject options and conformal learning.Unlike the constrained minimization framework considered in this article, the classification with rejection methods often try to balance the ambiguity and confidence using a weighted sum of the costs of misclassification and rejection, given a predetermined weight (in the binary case, an ambiguous prediction is the same as a rejection, while confidence is related to classification accuracy).The binary version of this problem has been extensively studied (Herbei and Wegkamp 2006;Bartlett and Wegkamp 2008;Yuan and Wegkamp 2010); Zhang, Wang, and Qiao (2017) has studied the multicategory case.The conformal learning inference aims to find a set-valued prediction for each new observation to guarantee the probability that the prediction set contains its true class label (Shafer and Vovk 2008;Lei, Robins, and Wasserman 2013;Vovk et al. 2017;Lei et al. 2018).Both the approaches taken by Lei (2014) and Sadinle, Lei, and Wasserman (2017) may be viewed as special cases of conformal learning.
The rest of this article is organized as follows.Section 2 gives an overview of the underlying problems.Our main algorithm is introduced in Section 3, followed by a study of the theoretical properties in Section 4. Section 5 offers some numerical experiments.Concluding remarks are given in Section 6. Proofs are in the supplementary materials.

Background
Sadinle, Lei, and Wasserman (2017) extended the binary acceptance region learning problem (Lei 2014) to the multicategory case.Under this framework, one tries to balance the efficiency and confidence of acceptance regions.The efficiency can be measured by the ambiguity, defined as the expected cardinality of the set-valued prediction, E(|φ(X)|), where | • | is the cardinality of a set.Note that this is the same as E( k j=1 1 X ∈ C j ), the expected number of acceptance regions that cover a random observation.The confidence refers to the requirement that each acceptance region C j must cover at least (1 − α j )100% of the population in class j, P j (C j ) ≥ 1 − α j , where P j (•) P(•|Y = j) is the probability measure conditional on Y = j.Note that this constraint may be written as the class-specific classification accuracy guarantee for class j: P j (Y ∈ φ(X)) ≥ 1 − α j .In summary, we minimize the ambiguity when maintaining the confidence by controlling the noncoverage rates (or the classspecific error rates), min φ∈ E(|φ(X)|), subject to 1 − P j (C j ) ≤ α j , j ∈ {1, . . ., k}.
(1) Here α j 's are predetermined.For example, if one wants the setvalued classifier to correctly classify at least 95% of the population from class j, then she can set α j = 0.05.
Under certain continuity conditions and the assumption that P(Y = j) > 0 for all j's, Sadinle, Lei, and Wasserman (2017) gave the following Bayes acceptance regions as solutions to problem (1).
Definition 1 (Bayes acceptance regions).Given α j 's, a solution to problem (1) is where t j is chosen to have Intuitively, each Bayes acceptance region contains all the observations for which the corresponding conditional class probability is large enough.In practice, Sadinle, Lei, and Wasserman (2017) suggested to employ the plug-in principle: first obtain η j , the estimation of η j , by methods such as the penalized logistic regression or k-nearest neighbors; then estimate t j by the n j α j th smallest value of { η(x j,1 ), . . ., η(x j,n j )}, where x j,1 , . . ., x j,n j are training data from class j.As a result, the estimated acceptance regions take the form of C j = {x : η j (x) ≥ t j }.

Set-valued Multicategory Support Vector Machine
A fundamental challenge of the plug-in method is that in many contemporary data analyses, it is very difficult to estimate η j at the first place.In this work, we propose to solve (1) directly via a risk minimization procedure avoiding estimating η j .We introduce a general formulation in Section 3.1, and then focus on specifics in Sections 3.2 and 3.3.

Formulation
For a k-class problem, our set-valued classifier will be characterized by a vector-valued discriminant function f : X → R k−1 and a threshold ε ∈ R. To obtain f , we adopt the angle-based classification method (Zhang and Liu 2013), which has been shown to be very effective and computational efficient for largescale multicategory classification in the high-dimensional space.We first define k unit vectors, w j ∈ R k−1 , j = 1, . . ., k, which form a regular simplex and sum to 0. Each vector represents a class and they are equiangular from one another.One possible configuration of w j 's is, where 1 ∈ R k−1 is the vector of all ones and e j ∈ R k−1 is the vector of all zeros, with the jth element being 1. Figure 1 gives an illustration of this configuration for k = 2 and k = 3.The angle margin, defined as f (x), w j , measures the proximity from f (x) to vector w j .A large angle margin indicates a small angle between vectors f (x) and w j , and hence a close proximity between the observation x and class j.We will conduct the optimization with respect to f so that f (x), w j is large for j = y and small for j = y.Motivated by this intuition, we define the acceptance regions and the set-valued classifier to be, Intuitively, the acceptance region for class j consists of all those observations whose f (x) are close enough to w j .Re-expressing (1) in terms of f and ε, we have min When the constraints attain equality at the minimizer, one can show that the minimizer coincides with the solution to the following modified optimization, Since w j 's sum to 0, we have that k j=1 f (x), w j = 0 for any x.In this case, requiring ε ≥ 0 implies that f (x), w j ≥ −ε for at least one j, and hence |φ(x)| ≥ 1.There has been previous work (Hechtlinger, Póczos, and Wasserman 2018;Guan and Tibshirani 2019) in which φ(x) = ∅ may occur for some x, implying that the class label for x has never been seen before.We do not consider this setting in this article: specifically, we assume that {1, 2, . . ., k} are the only possible classes in the test data.
In practice, the indicator functions in both the objective and in the constraints of (4) may cause difficulties for numerical optimizations (Hoffgen, Simon, and Vanhorn 1995).A common practice is to replace the indicator function in the objective by a convex surrogate loss.Moreover, a stream of work on NP classification (Rigollet and Tong 2011) also suggests to use a surrogate loss to bound the noncoverage rates such as the one in the constraints of (4).In general, it can be any decreasing surrogate loss used in the literature.Let 1 and 2 be the surrogate losses to be deployed in the objective and in the constraints, respectively.Our proposed set-valued classifier can be obtained by the following optimization, Conceptually, the value ( f (x), w j + ε) in the argument of 1 in the objective measures the closeness between observation X and the jth acceptance region (the larger the closer).Minimization of the objective leads to small values of ( f (x), w j + ε) for j = y and a large value for j = y.When 1 is the hinge loss, the new objective function resembles the loss function in multicategory SVM (Lee, Lin, and Wahba 2004), except for the important absence of the sum-to-zero constraint from our work, thanks to the use of the angle-based framework.In practice, given training data {(x i , y i ), i = 1, . . ., n}, one solves the empirical version of (5), where n j is the subsample size for class j and J(f ) ≤ s is a regulatory constraint added to make the solutions identifiable.

Choices of Surrogate Loss in Objective and Constraints
The choice of the surrogate losses is an important issue.Ideally, the surrogate loss 1 should enjoy the Fisher consistency property; on the other hand, an appropriately chosen 2 should guarantee that each acceptance region cover each class with at least the promised rate.
We propose to use a truncated hinge loss for 1 to achieve the Fisher consistency.Define the hinge loss as H(u) = (1−u) + and the truncated hinge loss as T(u) = (1 − u) + − (−u) + , where (a) + = max{a, 0}.The latter loss truncates the conventional hinge loss to have a height not exceeding 1.The blue solid line in the left panel of Figure 2 gives an illustration of this truncated loss, which can be regarded as the difference of two hinge-type loss functions (the dashed and dotted lines).Theorem 1 shows that with the truncated hinge loss, our proposed method is Fisher consistent.
With a truncated loss, the resulting optimization is not convex due to the non-convexity of T. However, one can use the difference of convex function (DC) algorithm (Le Thi Hoai and Tao 1997; Wu and Liu 2007).A brief description of this algorithm is shown below.

Algorithm 1 (DC algorithm). To minimize
. stands for the parameters in f and ε.This algorithm is an example of the Majorize-Minimization (MM) algorithm as we replace Q cav by its affine approximation in each iteration (Hunter and Lange 2004).The DC algorithm was used by Wu and Liu (2007) to build a Fisher consistent robust multicategory SVM.
Next we discuss the loss function in the constraints.We aim to bound the empirical noncoverage rate (1/n j ) j=y i 1 f (x i ), w j < −ε through bounding the empirical risk under the surrogate loss, (1/n j ) j=y i 2 ( f (x i ), w j + ε).The hinge loss may not be ideal for this purpose because it may have a much greater value than the indicator 1[u < 0], deteriorating the performance.For example, an observation with a very small functional margin f (x i ), w j 0 will give a large hinge loss and make the left-hand-side of the inequality (1/n j ) j=y i 2 ( f (x i ), w j + ε) ≤ α j to be very close to or even exceed the right-hand-side, even though it is associated with only one instance of noncoverage.In general, using the hinge loss to bound the noncoverage in the constraints will lead to overly conservative solution (set-valued classifiers being too ambiguous).A potentially useful alternative is the truncated hinge loss, min{1, H( f (x i ), w y i + ε)}.However, the use of the (nonconvex) truncated hinge loss will add to another layer of computational challenge.To mimic the truncated hinge loss, we propose to combine the hinge loss with an adaptive weight in an iterative algorithm to alleviate this issue.Observations are assigned with weights, chosen to be w i = max{1, H( f (x i ), w y i + ε)} −1 based on the solution to (f , ε) from the previous iteration, which, when multiplied by the hinge loss, resembles the truncated hinge loss.See the right panel of Figure 2 for an illustration: the blue bold line stands for the weighted hinge loss, which is the result of multiplying the weight (red dotted) by the hinge loss (purple dashed); the weighted hinge loss is close to the indicator function (black two-dashed).
Our proposed set-valued SVM (SSVM) is φ(x) φ (f ,ε) (x) = {j : f (x), w j ≥ −ε}, where (f , ε) is the final solution in an iterative algorithm.In each iteration, we solve given the weights.In the initial step, w i ≡ 1 for all i.In the subsequent steps, we define given (f , ε) from the previous step.The algorithm stops when the solution converges or the number of iterations has reached a preset maximum.Though there is no theoretical guarantee on the convergence of this iterative algorithm, in our numerical studies, it often converges after two or three iterations.Wu and Liu (2013) used a similar iterative idea in their adaptive weighted large margin classifiers for the purpose of robust classification.

Implementation Algorithms
In this section we discuss the implementation algorithm, to be used in the numerical experiments in Section 5.For computational convenience, we move the constraint J(f ) ≤ s to the objective as an additional regularization term, and obtain We will consider both linear and kernel learning.In linear learning, let where β q is the qth column of B. Following the standard routine in the SVM literature, we introduce slack variables ξ i,j for the hinge-like functions in the objective function, and slack variables η i,j for the hinge-like function in the constraints.The entire algorithm entails two loops.In the outer loop, we update the weight for the constraints; in the inner loop, we use the DC algorithm given a fixed value of the weight.At each iteration of the DC algorithm, we aim to solve where C = (2nλ) −1 , and c r,q , c q and c are sub-gradients of ) with respect to β r,q , b q and ε evaluated at the parameter set value from the previous iteration.Problem ( 9) is equivalent to the following dual problem, min This dual problem can be solved by many standard off-the-shelf quadratic programming routines.After obtaining the solution to Z, denoted as Z, we have B = X T ( Z•Y)W T +C.Then we can plug B back into (9), which becomes a linear programming problem for v and ε, solvable by standard routines.
The kernel trick is often used in SVM like (7) to allow nonlinear classifiers.In the proposed method, f is a vector of nonlinear functions (f q ) k−1 q=1 where f q 's belong to the same reproducing kernel Hilbert space (RKHS) with respect to a positive definite kernel function K(•, •).By the representer theorem (Kimeldorf and Wahba 1970), we can focus on functions with form f q (x) = n r=1 β r,q K(x r , x) + b q and the coefficients matrix B now become n × (k − 1).Then the dual problem at each iteration of the DC algorithm becomes, min After the solution to ( 11) is found, we have ] is a n × n matrix whose entries are the kernel function K evaluated on pairs of observations from the data set.

Relation with Classification using Reject and Refine Options
Classification with reject and refine options (CRR) (Bartlett and Wegkamp 2008;Manwani et al. 2015;Zhang, Wang, and Qiao 2017) also allows set-valued classification.Bartlett and Wegkamp (2008) proposed a Fisher consistent surrogate loss in the binary case.Zhang, Wang, and Qiao (2017) extended CRR to the multicategory case and introduced the refined option, which allows a set-valued prediction whose size is between 1 and k.
Typically CRR classifiers aim to balance the cost of misclassification and the cost of rejection.Some CRR work uses a weighted combination of both costs in the objective function; others consider minimizing the misclassification rate, subject to a budget of rejections.There is an underlying connection between the level of rejection allowed in CRR and the confidence achieved in a set-valued classifier.Though CRR may lead to set-valued predictions, the notion of confidence is not explicitly accounted for in the algorithm.The main motivation of the current work is precise quantifications of the confidence (or class-specific accuracy) of the set-valued classifier.To this end, one may view CRR and the set-valued classification as dual problems to each other.

Theoretical Studies
In this section, we first study the Fisher consistency in the setvalued classification setting.Then we bound the excess ambiguity by the excess surrogate ambiguity, in parallel to the excess risk bound seen in Bartlett, Jordan, and McAuliffe (2006).Lastly, we study finite sample bounds for the noncoverage rate and the excess ambiguity.

Fisher Consistency and Excess Risk Bound
We follow the same assumptions in Sadinle, Lei, and Wasserman (2017).Assume the underlying distribution P(X, Y) is absolute continuous with respect to ν X × ν Y , where ν X is the Lebesgue measure in R p and ν Y is the counting measure on {1, . . ., k}.Moreover, assume p j , the density function of the distribution of X conditional on Y = j, is positive on X .Let π j = P(Y = j) be the prior probability of class j and assume π j > 0. In addition, we assume η j (X) is a continuous random variable with P(η i (X) = η j (X)) = 0 for all pairs i = j.
Our first main result is the Fisher consistency of the surrogate function, which suggests that the population minimizer of the surrogate loss function coincides with the Bayes solution given in Sadinle, Lei, and Wasserman (2017), with constraints that P j ( f (X), w j < −ε) ≤ α j , j = 1, . . ., k, which correspond to the optimization, One subtlety here is that the true Bayes solution may involve null set, that is, the union of all acceptance regions in the Bayes solution ∪ j C * j may not cover the whole feature space X , or equivalently, φ * (x) may be empty for some observation x.This may happen for relatively easy classification tasks in which data points from different classes are far away from each; this may also happen when the noncoverage rates α j 's are chosen to be large so that the acceptance regions are relatively small.Note that in these cases, set-valued classification methods becomes less relevant since a traditional classification method can meet the needs and perform just as well.Hence, to show Fisher consistency of the proposed method in settings relevant to setvalued classification, we consider the following assumption.
Under this assumption, the Bayes acceptance regions in Sadinle, Lei, and Wasserman (2017), as given in Definition 1, satisfy ∪ k j=1 C j = X .
Theorem 1.Under Assumption 1, for a fixed ε ≥ 0, let F * be the class of functions that solve (12).Then any f * ∈ F * satisfies that almost surely, where t j satisfies P j (η j (X) ≥ t j ) = 1 − α j .Hence, φ (f * ,ε) is equivalent to the Bayes acceptance regions in Definition 1, that is, the truncated hinge loss is Fisher consistent.
The next theorem provides a bound quantification of the excess risk (defined as the classification ambiguity) using the excess surrogate risk as assessed using the truncated hinge loss function.The same bound quantification framework was proposed by Bartlett, Jordan, and McAuliffe (2006) and used by Wang and Qiao (2018).

Finite-Sample Properties for Error Rates and the Ambiguity
In this section, we discuss two properties of the proposed setvalued classifier (6) based on a finite sample.Our discussion focuses on kernel learning given a set of nonstochastic weights in the constraint.To simplify the theoretical analysis, we consider the case of equal weights, and assume that the sample size for each class is non-stochastic (i.e., they are fixed).In particular, instead of sampling n points directly from P(X, Y), we choose the sample size for each class and then sample from each subpopulation.The theorems in this section can be extended to unequal weights or stochastic weights with the assumption that H K ≤ s}, the reproducing kernel Hilbert space (RKHS) for (k − 1) dimensional vector-valued functions with norm bounded by s.
Here K is a positive definite kernel function which induces H K and we assume that sup x K(x, x) ≤ r.Theorem 3. Given the training data {X 1 , X 2 , . . ., X n j |Y = j} from the jth class, and a fixed ε ≥ 0, any function f ∈ H K (s) uniformly satisfies that, for any j, with probability at least 1 − kζ (the probability is with respect to the distribution of the training data.)Here , the expectation on the left-hand side is with respect to a test observation (X, Y), and (•) can be either the hinge loss H(•) or the truncated hinge loss T(•).
Note that the left-hand side of the inequality in Theorem 3 satisfies due to the definition of (•).Together with this observation, Theorem 3 suggests a way to control the noncoverage rate for each class at a desirable level, say, α j .To this end, one should identify a data-dependent function f , by solving (7) and searching for tuning parameter properly, so that This amounts to setting the right-hand side of the constraint in (7) to be slightly smaller than the desired level α j ; after replacing f in the inequality in Theorem 3 by f , we can see that the lefthand side of the inequality, and hence P j ( f (X), w j < −ε | D), is bounded by α j .Note that the remainder terms 3T n j (ζ )+Z(n j ) converges to 0 at the rate of n −1/2 j .
By setting the arbitrary f to be the data-dependent f , Theorem 3 implies the multiple-use validity in the sense of Dümbgen, Igl, and Munk (2008).Specifically, let C j be the acceptance region for class j induced by f ; we have that, with probability at least 1 − kζ , P(X ∈ C j |Y = j, D) ≡ P j ( f (X), w j ≥ −ε | D) ≥ 1−α j , for each j.Hence, by making ζ → 0 as n j → ∞, we can obtain the multiple-use validity in Dümbgen, Igl, and Munk (2008), in principle.Note that in practice one may implement a different calibration method to select α j ; in our numerical studies, we use split-conformal calibrations for all methods to achieve fair comparisons.
We note that this convergence rate does not depend on the dimension of the data, although it does depend on the number of classes.In contrast, the estimation error of probability estimation could quickly diverge as the dimension increases, which may undermine the performance of plug-in based methods in high-dimensional settings.
The next theorem quantifies the excess T-ambiguity based on a finite sample.We define the function space with noncoverage rates bounded by α j less a small term κ √ n j by and its empirical version as Theorem 4. For a fixed ε, let f be a solution of the optimization problem Problem ( 14) is almost equivalent to (7), except that ε is fixed and the nominal noncoverage is set to be α j less a small quantity κ √ n j .Part (i) of Theorem 4 has a similar implication to Theorem 3: if one imposes a more stringent constraint (that is, f ∈ F ε (α, κ, s), or precisely speaking, 1 √ n j , with the gap term κ √ n j vanishes as the sample size increases), then it is possible to make E(H( f (X), w j + ε) | Y = j) bounded by the desired rate α j .Part (ii) further shows that the T-ambiguity of our proposed method based on a finite sample converges to the T-ambiguity of the theoretically optimal classifier that minimizes the T-ambiguity subject to the true noncoverage rate being bounded.The difference between the two (that is, the excess T-ambiguity) is at most 2kκ(n −1/2 ), which vanishes as the sample size grows, and does not depend on the dimension.Though both Theorems 3 and 4 are under a fixed ε which is usually unknown in real applications, the convergence rate does not depend on the value of ε.Hence, given a dataset, the proposed method can achieve at least the convergence rate shown in the theorems.
Remark 1.The convergence rate of the excess T-ambiguity for our proposed method is O(n −1/2 ), whereas the plug-in method has a convergence rate of O( γ n + log(n)n −1/2 ) for the excess ambiguity (Sadinle, Lei, and Wasserman 2017), where n is related to the estimation error of the η j functions and γ is the margin exponent in the low-noise margin condition of the underlying distribution for η(X).While this is not an applesto-apples comparison, our proposed method has a faster convergence rate, and does not require the estimation of η j .From the methodological perspective our method does not require data splitting for the purpose of calibration as is required by the plug-in method.In practice, we recommend to use the proposed method when the dimension is high and sample size is limited; this is the scenario in which the plug-in method may have difficulty estimating η j accurately.

Numerical Studies
In this section, we compare our confidence-based set-valued multicategory support vector machine (SSVM) method and various methods using the plug-in principle (Sadinle, Lei, and Wasserman 2017) on both simulated and real data.The baseline models include L 2 penalized logistic regression (Le Cessie and Van Houwelingen 1992; Zhang and Liu 2013), kernel logistic regression (Zhu and Hastie 2005), kNN (Altman 1992), random forest (Liaw and Wiener 2002) and MSVM (multicategory SVM) (Cortes and Vapnik 1995;Platt 1999;Lee, Lin, and Wahba 2004).MSVM does not directly provide an estimate of the probability, but provides a list of scores that preserve the order among the estimated probabilities.For the proposed SSVM model, we use the implementation that solves the optimization problem (8).
In the study, we use solver Cplex and lpsolve to solve the quadratic and linear programming problem arising in SSVM.For other methods, we use existing R packages glmnet, gelnet, class, randomForest, e1071 and the solver provided in Lee, Lin, and Wahba (2004).

Simulations
We study the empirical performance of the proposed method over a variety of simulated data with different sample sizes.In each case, an independent tuning set with the same sample size as the training set is generated for parameter tuning.The test set has 10,000 observations for each class.We run the simulation 100 times and report the mean and standard error.Nominal noncoverage rates are set to be 0.05.
We select the tuning parameter C = (2nλ) −1 and the hyperparameters in kernel learning for the proposed SSVM method as follows.We search for the optimal ρ in the Gaussian kernel exp (− x − y 2 /ρ 2 ) from the grid 10 {−0.5,0.25,0,0.25,0.5}and the optimal degree for polynomial kernel from {2, 3, 4}.For each fixed candidate hyperparameter, we choose C from a grid of candidate values ranging from 10 −4 to 10 2 by the following two-step searching scheme.We first do a rough search with a larger stride {10 −4 , 10 −3.5 , . . ., 10 2 } and get the best parameter C 1 .In the next step, we do a fine search from C 1 × {10 −0.5 , 10 −0.4 , . . ., 10 0.5 }.After that, we choose the optimal pair which gives the smallest tuning ambiguity among those which have the tuning set noncoverage rates being smaller than or equal to the nominal rate α j .
For the plug-in methods, we employ both one-versus-rest classification and multicategory classification to estimate the posterior probability η j as done in Sadinle, Lei, and Wasserman (2017).In one-versus-rest classification, we train k separate classifiers to classify between class j and all the other classes.All k classifiers share the same tuning parameter.All the plug-in methods are tuned in the same way as SSVM, that is, choosing the tuning parameter(s) that minimizes the ambiguity among those which satisfy the nominal noncoverage rates.For logistic regression and SVM, we use the same grid as SSVM when grid-searching their tuning parameters.For random forest, we choose the best number of trees from {50, 100, 150,…, 300} and subsampling rate for the number of variables from {0.05, 0.1, 0.2,…, 0.8}.For kNN, we choose the best k from {6, 8,…, 40}.
To robustly control the error, we make use of the splitconformal inference approach (the so-called robust implementation) suggested in Lei (2014) for all the methods.We split the data into training and tuning sets.Using the training data, we first obtain an estimate of η j (by methods such as logistic regression, kNN and random forest) or an monotone proxy of it so that the order is preserved (such as the scores in MSVM, and f , w j , the jth angle margin in SSVM).For each class j, we choose thresholds t j to be the (α j × 100)th sample percentile of η j (x) among the tuning data in class j so that the noncoverage rates for the tuning set match the nominal rates.The estimated acceptance regions are defined as C j = {x : η j (x) ≥ t j } and equivalently, the set-valued predictions φ(x) = {j : η j (x) ≥ t j }.Ideally, the plug-in procedure requires two extra datasets other than the training data: One is used to select thresholds t j 's and the other is for hyperparameter tuning.However, to achieve fair comparison with the proposed method, we use the tuning set for both purposes.This method was introduced in Lei (2014) which works well in practice.
We include MSVM approaches whose discriminant functions are obtained either in the traditional one-versus-rest way or in the all-at-once multicategory (Lee, Lin, and Wahba 2004) way.We induce acceptance regions from MSVM by thresholding in the same way described above.It is well-known that SVM does not provide an accurate estimation of the posterior probabilities (Platt 1999).The comparison between these MSVM methods, not originally designed for set-valued classification, and our proposed method, highlights that even using robust  implementation directly on either kind of MSVM methods will not provide a successful set-valued classifier; that is to say that the better performance of our method is attributed to factors beyond the use of the robust implementation scheme.
Because there are k noncoverage rates and one ambiguity, how to make fair comparisons between methods becomes a tricky problem since one method can have small test data ambiguity but higher test data noncoverage rates.It is unfair to claim that this method is better simply because it has a smaller test data ambiguity.To resolve this conflict, we further adjust the thresholds in each method after the initial training stage, so that the test data empirical noncoverage rates of all the methods are aligned with the nominal noncoverage rate.As a result, the noncoverage rates for almost all methods are the same so that we only need to compare them based on their test data ambiguity (kNN and random forest have slightly smaller noncoverage rates because there are many ties exactly at the threshold).Given the same noncoverage rate, a smaller ambiguity means the classifier performs better.
We consider three different simulation scenarios.In the first scenario we compare the linear approaches (SSVM, naive SVM and penalized logistic regression), while in the next two scenarios we consider nonlinear methods.In all cases, we add additional noisy dimensions to the data to test the robustness of all the methods.These noisy covariates are normally distributed with mean 0 and = diag(1/p), where p is the total dimension of the data.
Example 1 (Linear model with nonlinear Bayes rule).In this scenario, we generate three normally distributed classes with different covariance matrices as shown in the left panel of Figure 3.In particular, we have X | Y = j ∼ N (μ j , j ).Given w 4 := w 1 , for j = 1, 2, 3, we have μ j = w j − w j+1 −1 2 (w j − w j+1 ), and j = S j diag(1, 0.2)S T j , with S j = [μ j , μ j ] and μ j = [−μ j,2 , μ j,1 ] T .Here w j 's are those class representative vectors in angle-based learning.The prior probabilities of all classes are the same.Lastly, we add eight dimensions of noisy covariates to the data.We compare linear SSVM, and the plug-in methods based on L 2 penalized logistic regression and naive linear SVM.
Example 2 (Moderate dimensional uniform balls).We first generate a two-dimensional data uniformly distributed in three disks with radius 2/3 as shown in the middle panel of 3 ), sin( 2π3 )] T and c 3 = [cos( 4π 3 ), sin( 4π3 )] T .Then we contaminate each disk by relabeling 10% of the observations within each class to a different class, so that the Bayes acceptance region should include the own disk for each class and one of the rest two disks.We then add 98 noisy covariates on top of the two-dimensional signal.We use the Gaussian kernel for all the kernel-based methods.
Example 3 (High-dimensional donut).We first generate data using radius-angle pairs (r i , θ i ) where θ i ∼ Unif[0, 2π ], and .35, 2].We define the two-dimensional X i = (r i cos(θ i ), r i sin(θ i )) as shown in right panel of Figure 3.We then add 398 covariates on top of the two-dimensional signal.We use the polynomial kernel for all the kernel-based methods.
Simulation results are reported in Figure 4.In all three settings, the proposed method (denoted as "ssvm") outperforms all the plug-in methods when the number of observations are small and are comparable to the best plug-in method (logistic regression in Example 1, random forest in Example 2, and kNN in Example 3) when the sample size becomes large.The naive SVM method is significantly worse than the proposed methods in all scenarios.The noncoverage rates (not shown here) of SSVM, random forest, kernel logistic regression and naive SVM methods are close to 0.05 while kNN have a smaller noncoverage rates (due to such technicalities as too many ties of η j (x) exactly at the threshold.)

Accuracy and Ambiguity Tradeoff
Although 0.05 is a popular noncoverage rate in practice, it is of interest to study the trade-off between ambiguity and noncoverage rates as the noncoverage rate varies.In this section, we compare the proposed method with other plug-in methods as well as the CRR method (Zhang, Wang, and Qiao 2017) with various noncoverage rates.In particular, we studied which method has the smallest ambiguity under different noncoverage We fix the training sample size at 40 for each class and vary the noncoverage rates from 0.025 to 0.2 for SSVM and plug-in methods.We align the empirical noncoverage rates for SSVM and plug-in methods and compare their ambiguity as in the previous section.For the CRR classifier, we vary the reject price d and report the ambiguity and the average noncoverage rates over all the classes.The results are shown in Figure 5.
In Figure 5 (where "svmrr" stands for the SVM with reject and refine), we can see that the SSVM gives a much smaller ambiguity than the plug-in methods when the noncoverage rates are small.However, with the noncoverage rates grow, the gap between the proposed method and the plug-in methods become smaller.In the first example, SSVM is even outperformed by a certain plug-in method.This may not be surprising.One major advantage of the proposed method is to incorporate the noncoverage rate consideration into the risk minimization.In contrast, the discriminant functions of plugin methods, such as the logistic regression, is not affected by the choice of the noncoverage rate.As a result, when the noncoverage rates are small, our proposed method will optimize its discriminant function to accommodate the noncoverage rates; when the noncoverage rates gradually grow larger, the effect of the noncoverage rate level becomes weaker and the gaps between the proposed method and the plug-in methods vanish.When the noncoverage rate is set to be a very large value, the coverage constraints are not active (that is, they no longer matter because they can be achieved by most classifiers easily) and therefore most of these methods perform similarly (as they do in the standard classification setting.)

Real Data Analysis
We study the performance of the proposed method on a few benchmark datasets.We compare the proposed method SSVM with L 2 penalized logistic regression, kernel logistic regression, kNN, random forest and MSVM.For the sake of brevity, we do not consider methods based on the One-versus-One or Oneversus-Rest paradigm.
CNAE-9 Data: The CNAE-9 data (Ciarelli and Oliveira 2009) contains 1080 documents of free text business descriptions of Brazilian companies from nine categories.Each document was represented as a vector, where the weight of each word is its frequency in the document.This dataset is highly sparse (99.22% of the matrix is filled with zeros) with 856 predictors.The dimension of this data is much more than the number of observations.There are 120 observations for each class in the original dataset.We evenly split the observations to training, tuning and test set, which makes 360 documents for each set.The noncoverage rate is set to be 0.05 for all the classes.We apply linear SSVM, and compare with linear logistic regression, random forest, kNN and naive linear SVM on this dataset.
Zipcode Data: We conduct the comparison on the wellknown hand-written zip code data (LeCun et al. 1989) widely used in the classification literature.The original dataset consists of 9298 16 × 16 (hence 256 predictos) pixel images of handwritten digits.There are both training and test sets defined in it.Lei (2014), Sadinle, Lei, and Wasserman (2017), Wang and Qiao (2018) used the same dataset for illustrating the set-valued classification.Following Lei (2014) and Wang and Qiao (2018), we only use a subset of the data containing digits {0, 6, 8, 9}.Previous studies (Shafer and Vovk 2008)  Although Lei (2014) and Sadinle, Lei, and Wasserman (2017) sets nominal noncoverage rates to be 0.05 in their study, many nonlinear classifiers, such as SVM with Gaussian kernel, can achieve this noncoverage rate without introducing any ambiguity.Therefore, we reduce the noncoverage rate to 0.01 for both classes to make the task more challenging.
We apply Gaussian kernel for SSVM, and compare with kernel logistic regression with Gaussian kernel, random forest, kNN and naive SVM with Gaussian kernel on this dataset.
Vehicle Data: The Vehicle dataset (Siebert 1987) can be found in the UCI Machine Learning Repository.It is a four-class multicategory classification task with 946 observations and 18 predictors in total.We discriminate between silhouettes of model cars, vans and buses.We randomly split the data into training, tuning and test sets.The training and tuning sets are both of size 200 (50 for each class), and the rest is used as test set.We learn set-valued classifiers using both the proposed method and the plug-in methods and evaluate the performance with the test set.The noncoverage rate is set to be 0.04 for all the classes.We apply linear SSVM, and compare with linear logistic regression, random forest, kNN and linear naive SVM with on this dataset.
For each example, we repeat splitting 100 times and report the mean and standard error.We shows the results in Tables 1  and 2. In Table 1, we report the class-specific noncoverage rate.Ideally, they should be less than or equal to the nominal rates.The rows "Amb.Aligned" and "Ambiguity" show the ambiguity of the set-valued classifiers with and without aligning the non- coverage rates to be the nominal rates using the test data.If all the empirical noncoverage rates match the nominal rates, then one could simply compare the ambiguities.Unfortunately it is rarely the case.Instead, it is fair to compare the ambiguities after aligning the test data noncoverage rates: the smaller ambiguity, the better.The effectiveness of the proposed method (SSVM) can be seen from the aligned ambiguity in Table 1.Overall, no single method is the best over all cases, but the proposed SSVM is either the best or comparable to the best plug-in methods.In low-dimensional datasets (Vehicle), SSVM outperforms the logistic regression, kNN and naive MSVM methods and comes close to random forest.In relatively high-dimensional settings, SSVM's performance improves.In particular, it is slightly better comparing to the best plug-in methods, random forest and naive MSVM in the zip code data.In the high-dimensional CNAE-9 data, it is slightly better than logistic regression and significantly better than all the plug-in methods.
Table 2 provides a different perspective to this study.It shows the proportions and accuracy of the set-valued predictions conditional on the size of the prediction set.It also shows the expected size of the prediction set and the overall accuracy for each method.Here we define accuracy as the probability that the true label is contained in the prediction set.
The naive SVM method does not give successful acceptance regions on most of the datasets.Although the proposed method also uses the hinge loss as the surrogate, it performs much better.This illustrates the potential power of the proposed risk minimization framework that explicitly incorporates the noncoverage consideration.

Conclusion
In this work, we propose to learn multicategory acceptance regions to achieve set-valued classification using empirical minimization.We make use of a general large-margin framework for the learning task.It is important to choose appropriate surrogate losses for the proposed problem.In particular, we use truncated hinge loss in the objective with proven Fisher consistency and use the weighted hinge loss to obtain a close approximation to the noncoverage rates.The angle-based learning approach is used to effectively learn the classifier in the high-dimensional setting.Theoretical and numerical studies have shown the effectiveness of our approach in controlling the noncoverage rate and minimizing the ambiguity.Other surrogate losses can be considered in this framework as future work.
In our proposed framework of set-valued classification, we optimize the ambiguity while imposing a constraint on the noncoverage rate (equivalently, the class-specific accuracy).A separate stream of research in the machine learning community (Denis andHebiri 2015, 2017;Shekhar, Ghavamzadeh, and Javidi 2019) consider the paradigm in which one optimizes the accuracy with an constraint on how many ambiguous predictions (prediction sets with size greater than 1) can be made.It

Figure 2 .
Figure 2. The left panel illustrates the truncated hinge loss.The right penal illustrates the proposed weight function and the resulting weighted hinge loss.

Figure 3 .
Figure 3. Scatterplots of the first two dimensions for the simulated data with different colors showing the overlapping acceptance regions suggested by the SSVM method.

Figure 4 .
Figure 4. Empirical ambiguities in three settings.Empirical noncoverage rates are aligned among different methods and are not shown.SSVM has the smallest ambiguity.
Figure 3.The centers of three disks are c

Figure 5 .
Figure5.Ambiguity under different noncoverage rates.The advantage of the proposed method is more obvious when the noncoverage rates are small.

Table 1 .
Rows annotated with α j are empirical noncoverage rates given class.Rows annotated with "Ambiguity" give the ambiguity for each classifier, and "Amb.Aligned" give the ambiguity with the empirical noncoverage rates aligned with the nominal rates by adjusting the threshold.Numbers in the parenthesis are standard errors across 100 runs.SSVM has a comparable performance to the best plug-in method (logistic regression in CNAE dataset and Random Forest in zip-code and Vehicle dataset).The boldface indicates the smallest aligned ambiguity.
pointed out that there were discrepancies between the training and test sets.In this study we first mixed the training and test data and then randomly split into new training, tuning and test data.The training and tuning data both have sample size 400, with 100 from each class.

Table 2 .
The column |φ(X)| stands for different cardinalities of set-valued predictions.For each classifier, we report the proportion of predictions with different cardinalities and the accuracy for each cardinality.Row E(|φ(X)|) is the average cardinality of set-valued predictions, which is the same as "Amb.Aligned." in Table1.Row P(Y ∈ φ(X)) gives the overall accuracy for each classifier.They are very similar after the alignment.