On Theoretically Optimal Ranking Functions in Bipartite Ranking

ABSTRACT This article investigates the theoretical relation between loss criteria and the optimal ranking functions driven by the criteria in bipartite ranking. In particular, the relation between area under the ROC curve (AUC) maximization and minimization of ranking risk under a convex loss is examined. We characterize general conditions for ranking-calibrated loss functions in a pairwise approach, and show that the best ranking functions under convex ranking-calibrated loss criteria produce the same ordering as the likelihood ratio of the positive category to the negative category over the instance space. The result illuminates the parallel between ranking and classification in general, and suggests the notion of consistency in ranking when convex ranking risk is minimized as in the RankBoost algorithm for instance. For a certain class of loss functions including the exponential loss and the binomial deviance, we specify the optimal ranking function explicitly in relation to the underlying probability distribution. In addition, we present an in-depth analysis of hinge loss optimization for ranking and point out that the RankSVM may produce potentially many ties or granularity in ranking scores due to the singularity of the hinge loss, which could result in ranking inconsistency. The theoretical findings are illustrated with numerical examples. Supplementary materials for this article are available online.


Introduction
How to order a set of given objects or instances reflecting their underlying utility, relevance, or quality is a long standing problem. It has been a subject of interest in various fields, for example, the theory of choices or preferences in economics, and the theory of responses in psychometrics. Ranking as a statistical problem regards how to learn a real-valued scoring or ranking function from observed order relationships among the objects. Recently, the ranking problem has gained great interest in the machine learning community for information retrieval, web search, and collaborative filtering.
As a special form of ranking, bipartite ranking involves instances from two categories (say, positive or negative), and given observed instances from the two categories, the goal is to learn a ranking function that places positive instances ahead of negative instances; see . For example, in document retrieval, documents are categorized as either relevant or irrelevant, and from the observed documents one wants to find a ranking function over the documents space that ranks relevant documents higher than irrelevant documents.
There exists notable similarity between bipartite ranking and binary classification. However, ranking aims at correct ordering of instances rather than correct prediction of the categories associated with them. This distinction is clear in the loss criterion used to measure the error of a ranking function as opposed to classification error; the former is the bipartite ranking loss indicating the misordering of a pair of instances with known preference or order relationship while the latter is the misclassification loss. As a result, while a given discriminant function for classification can be used as a ranking function, specification of a threshold is not needed for ranking.
The performance of a ranking function is closely connected to the so-called receiver operating characteristic (ROC) curve of the function, which has been used in radiology, psychological diagnostics, pattern recognition, and medical decision making. Minimization of the expected bipartite ranking loss is shown to be equivalent to maximization of the area under the ROC curve (AUC) by using the link between the AUC criterion and the Wilcoxon-Mann-Whitney statistic as in Hanley and McNeil (1982). Cortes and Mohri (2004) further investigated the relation between the AUC and the classification accuracy, contrasting the two problems.
On the other hand, the similarity has prompted a host of applications of classification techniques such as boosting and support vector machines to ranking problems. For example, RankBoost proposed by Freund et al. (2003) is an adaptation of AdaBoost to combine preferences. Application of the large margin principle in classification to ranking has led to the procedures that aim to maximize the AUC by minimizing the ranking risk under a convex surrogate function of the bipartite ranking loss for computational efficiency. See Herbrich, Graepel, and Obermayer (2000); Joachims (2002); Brefeld and Scheffer (2005), and Rakotomamonjy (2004) for optimization of AUC by support vector learning.
Theoretical developments in the literature regarding bipartite ranking in part center around the convergence of the empirical ranking risk to the minimal risk achievable within the class of ranking functions as an application of the standard learning theory for generalization bounds. See ; Agarwal and Niyogi (2005), and Clémençon, Lugosi, and Vayatis (2008).
With particular focus on the Bayes ranking consistency, we investigate the theoretical relation between a loss criterion used for pairwise ranking and the optimal ranking function driven by the criterion in this article. Motivated by considerable developments of ranking algorithms and procedures in connection with classification, we examine how the optimal ranking functions defined by a family of convex surrogate loss criteria are related to the underlying probability distribution for data, and identify the explicit form of optimal ranking functions for some loss criteria. In doing so, we draw a parallel between binary classification and bipartite ranking and establish the minimal notion of consistency in ranking, namely, ranking calibration, analogous to the notion of classification calibration or Fisher consistency, which has been studied quite extensively in the classification literature (see, e.g., Zhang 2004; Bartlett, Jordan, and McAuliffe 2006). At about the same time as an earlier version of this work, Gao and Zhou (2015) independently examined similar issues regarding the difference between classification and ranking consistency when surrogate loss functions are used. Kotlowski, Dembczynski, and Hüllermeier (2011) and Agarwal (2013) also studied ranking consistency in connection with classification by examining the risk bounds for bipartite ranking when discriminant functions from binary classification are directly used as ranking functions.
In this article, we employ a pairwise ranking approach, which is standard in learning to rank. Starting from the fact that the theoretically optimal ordering over the instance space is determined by the likelihood ratio of the positive category to the negative category, we show that the best ranking functions under some convex loss criteria produce the same ordering. In particular, the RankBoost algorithm with the exponential loss is shown to target a half of the log-likelihood ratio on the population level. This result reveals the theoretical relationship between the ranking function from RankBoost and the discriminant function from AdaBoost. Rudin and Schapire (2009) arrived at a qualitatively similar conclusion through an algebraic proof of the finite sample equivalence of AdaBoost and RankBoost algorithms. The binomial deviance loss used in RankNet (Burges et al. 2005) also preserves the optimal ordering for ranking consistency. Further, the result suggests that discriminant functions for classification that are order-preserving transformations of the likelihood ratio (e.g., logit function) can be used as a consistent ranking function in general. We establish general conditions for ranking-calibrated losses, and show that they are stricter than those conditions for classification calibration as optimal ranking requires more information about the underlying conditional probability (a transformation of the likelihood ratio) than classification. Some classification-calibrated loss functions are also ranking-calibrated (e.g., exponential loss and binomial deviance loss), but the hinge loss for support vector ranking is a notable exception among the commonly used loss functions. We provide an in-depth analysis of the scoring function under hinge loss in this article and prove that the support vector ranking with the hinge loss may produce potentially many ties or granularity in ranking scores due to the singularity of the loss, and this could result in ranking inconsistency.
As a related work, drawing on the connection to convex risk minimization in binary classification, Clémençon, Lugosi, and Vayatis (2008) considered a statistical framework that transforms bipartite ranking into a pairwise binary classification problem. By minimizing empirical convex risk functionals in the framework, they studied ranking rules that specify preference between a pair of instances instead of real-valued ranking or scoring functions in our approach. However, a ranking rule or the associated function that induces the rule, in general, does not define a ranking function consistently in the form of pairwise difference as assumed in bipartite ranking. This fact yields significantly different results for the two formulations. Equivalence between the two formulations depends largely on the loss. We specify the condition for loss such that the equivalence holds (see Theorem 7), which implies that not all ranking-calibrated loss functions result in such equivalence. This result renders the relevance of the pairwise binary classification approach rather limited in practice because standard ranking algorithms such as RankBoost, RankNet, and RankSVM are designed to produce a scoring function instead of a ranking rule. Moreover, the pairwise binary classification approach does not make the important distinction between classification calibration and ranking calibration for loss functions. Theorem 3 and Section 3.3 highlight the difference in the two approaches. Gao and Zhou (2015) also borrowed the framework for classification calibration in Bartlett, Jordan, and McAuliffe (2006) for bipartite ranking. Focusing mainly on the analysis of ranking risk, they provided conditions for surrogate loss for AUC consistency, but with no attention to the form of the optimal scoring functions as in our work. Duchi, Mackey, and Jordan (2010) examined ranking consistency of general loss functions defined over preference graphs in label ranking and showed negative results about convex loss functions. Similarly, Calauzènes, Usunier, and Gallinari (2012) established nonexistence of calibrated convex surrogate losses with respect to such evaluation metrics as the average precision and the expected reciprocal rank. These negative results are largely due to the generality of label ranking in contrast to bipartite ranking.
The rest of this article is organized as follows. Section 2 introduces the problem setting and specifies the best ranking function that maximizes the AUC. The properties of the theoretically optimal ranking functions under convex loss criteria are discussed together with conditions for ranking calibration, and the form of the optimal ranking function for a certain class of loss criteria is specified in Section 3 along with analysis of support vector ranking. The general relation between the optimal bipartite ranking function and the optimal ranking rule in pairwise binary classification approach is investigated in Section 4, followed by numerical illustration of the results in Section 5. Conclusion and further discussions are in Section 6. Technical proofs are in the supplement.

Basic Setting
Let X denote the space of objects or instances we want to rank. Suppose each object X ∈ X comes from two categories, either positive or negative, Y = {1, −1}, and let Y indicate the category associated with X. Training data for ranking consist of independent pairs of (X, Y ) from X × Y. Suppose the dataset has n + positive objects {x i } n + i=1 and n − negative ones {x j } n − j=1 . From the data, we want to learn a ranking function such that positive objects are generally ranked higher than negative objects. A ranking function is a real-valued function defined on X , f : X → R, whose values determine the ordering of instances. For For each pair of a positive object x and a negative object x , the bipartite ranking loss of f is defined as is the indicator function. Note that the loss is invariant under any order-preserving transformation of f . The best ranking function can then be defined as the function f minimizing the empirical ranking risk over the training data by considering all pairs of positive and negative instances from the data.
The empirical ranking risk R n + ,n − ( f ) is given by one minus the AUC of f , and thus minimization of the ranking risk is equivalent to maximization of the AUC.

Optimal Ranking Function
Theoretically the AUC of a ranking function is the probability that the function ranks a positive instance higher than a negative instance when they are drawn at random. Casting the AUC maximization problem in the context of statistical inference, consider hypothesis testing of H 0 : Y = −1 versus H a : Y = 1 based on a "test statistic" f (x). For critical value r, the test rejects H 0 if f (x) > r, and retains H 0 otherwise. Then the size of the test is P( f (X ) > r|Y = −1), which is, in fact, the theoretical false positive rate, and the power of the test is P( f (X ) > r|Y = 1), which is the theoretical true positive rate of f . Hence, the relationship between the false positive rate and the true positive rate of a ranking function f is the same as that between the size and the power of a test based on f in statistical hypothesis testing. This dual interpretation of ranking also appears in Clémençon, Lugosi, and Vayatis (2008). By the Neyman-Pearson lemma, the most powerful test at any fixed size is based on the likelihood ratio of x under the two hypotheses. This implies that the best ranking function that maximizes the theoretical AUC is a function of the likelihood ratio.
Let g + be the pdf or pmf of X for positive category, and let g − be the pdf or pmf of X for negative category. For simplicity, we further assume that 0 < g + (x) < ∞ and 0 < g − (x) < ∞ for x ∈ X in this article. The following theorem states that the optimal ranking function for bipartite ranking is any order-preserving function of the likelihood ratio f * 0 (x) ≡ g + (x)/g − (x). For notational convenience, let R 0 ( f ) ≡ E(l 0 ( f ; X, X )) denote the ranking error rate of f under the bipartite ranking loss, where X and X are, respectively, a positive instance and a negative instance randomly drawn from the distributions with g + and g − . The proof of the theorem is omitted here.
Theorem 1. For any ranking function f , To see the connection of ranking with classification, let π = P(Y = 1) and verify that the posterior probability is a monotonic transformation of f * 0 (x). Indeed, through a different formulation of ranking, Clémençon, Lugosi, and Vayatis (2008) showed the equivalent result that a class of optimal ranking functions should be strictly increasing transformations of the posterior probability. The difference in the formulation is described in Section 4. This fact implies that those discriminant functions from classification methods estimating the posterior probability consistently, for example, logistic regression, may well be used as a ranking function for minimal ranking error in practice.

Ranking with Convex Loss
Since minimization of ranking error involves nonconvex optimization with discontinuous l 0 loss function, direct maximization of the AUC is not computationally advisable just as direct minimization of classification error under the misclassification loss is not. For computational advantages of convex optimization, many researchers have applied successful classification algorithms such as boosting and support vector machines to ranking for the AUC maximization by replacing the bipartite ranking loss with a convex surrogate loss.
In this section, we identify the form of the minimizers of convex ranking risks and examine the properties of the optimal ranking functions. Consider nonnegative, nonincreasing convex loss functions l : as a ranking loss, given a ranking function f and a pair of a positive instance x and a negative instance x . For example, the RankBoost algorithm in Freund et al. (2003) takes the exponential loss, l(s) = exp(−s), for learning the best ranking function, and the support vector ranking in Herbrich, Graepel, and Obermayer (2000), Joachims (2002), and Brefeld and Scheffer (2005) takes the hinge loss, l(s) = (1 − s) + .
To understand the relation between a convex loss function l used to define a ranking loss and the minimizer of the ranking risk on the population level, let be the convex ranking risk and let f * be the optimal ranking function minimizing R l ( f ) among all measurable functions f : X → R. When f * preserves the ordering of the likelihood ratio f * 0 , we call the loss l ranking-calibrated, which is analogous to the notion of classification-calibration of l for the Bayes error consistency in classification.

Special Case
The following theorem states some special conditions on the loss function under which the theoretically best ranking function can be specified explicitly.
Theorem 2. Suppose that l is convex, differentiable, l (s) < 0 for all s ∈ R, and l (−s)/l (s) = exp(s/α) for some positive constant α. Then the optimal ranking function minimizing where β is an arbitrary constant.

General Case
To deal with a general loss l beyond those covered by Theorem 2, we consider convex loss criteria. The next theorem specifies general conditions for ranking calibration, and states the general relation between the best ranking function f * under convex ranking loss criteria and the likelihood ratio (g + /g − ) in terms of the relative ordering of a pair of instances when X is continuous.
Theorem 3. Suppose that l is convex, nonincreasing, differentiable, and l (0 Remark 2. As an interesting example, consider l(s) = (1 − s) + , the hinge loss in support vector ranking. It is differentiable at 0 with l (0) = −1, but it has a singularity point at s = 1. Thus, Theorem 3 does not apply. In comparison, l(s) = [(1 − s) + ] 2 , the squared hinge loss, is differentiable everywhere, and Theorem 3 (i) implies that the optimal ranking function f * under l preserves the order of the likelihood ratio without ties.
Remark 3. Theorem 3 (i) is equivalent to Theorem 2 in Gao and Zhou (2015), which states sufficient conditions for AUC consistency. The former describes ranking calibration in terms of the likelihood ratio, (g + /g − ), while the latter in terms of the posterior probability as in classification calibration.

Support Vector Ranking
To cover the hinge loss for support vector ranking, we resort to results in convex analysis (Rockafellar 1997). We mainly use the fact that a convex function is globally minimized at a point if and only if zero is contained in the subdifferential of the function at the point. The subdifferential of hinge loss (1 − s) + at s = 1 is [−1, 0]. First, we illustrate potential ties in the optimal ranking function under the hinge loss with a toy example. We can derive explicitly the conditions under which ties can occur in this simple example.

... Toy Example
and without loss generality, assume that 3 ) for the pmfs of X and X , g + and g − . Let f * be a minimizer of the ranking risk under the hinge loss. Define for a ranking function f . Then we can express the risk R l ( f ) in terms of s 1 and s 2 as follows: Depending on the size of difference in the likelihood ratio g + /g − between distinct instances, the optimal increments of f * , can be identified explicitly. Table 1 summarizes the values of s * 1 and s * 2 under various scenarios specified by two constants a and b. They are defined and used to quantify the size of difference in the likelihood ratio. See the proof of the result in the supplementary material. Note that for some values of a and b, s * 1 and s * 2 are not uniquely determined and neither is f * . Table 1 shows that the optimal increments for the support vector ranking function could be zero with the only exception of a < 0 and b < 0 case. Another notable fact is that the maximum increment is 1, which clearly stems from the singularity point of the hinge loss. Having at least one of s * 1 and s * 2 equal to zero means that the theoretically optimal ranking function f * produces ties for the pair of x * 1 and x * 2 or x * 2 and x * 3 . Such ties make the ranking error rate of f * strictly greater than the minimal ranking error rate, and thus f * is not consistent with f * 0 . This toy example demonstrates that in general, ranking by risk Table . The optimal increments s * 1 and s * 2 for a hinge risk minimizer f * , depending on the signs of a and b.
minimization under the hinge loss could lead to inconsistency due to ties when the sample space is discrete.
To understand the ideal case when the optimal increments are both 1 and hence ties in ranking by f * do not occur, we examine the conditions a < 0 and b < 0. Expressing a < 0 equivalently as g , we can describe the conditions alternatively as .
From the inequalities, a < 0 can be interpreted as a condition for the gap between the likelihood ratios of g − /g + at x * 1 and x * 2 being sufficiently large (more precisely, larger than g − (x * 3 )/g + (x * 2 )) and likewise, b < 0 means that the gap between the likelihood ratios of g + /g − at x * 2 and x * 3 is large enough. In other words, when the elements in the sample space are sufficiently apart in terms of their likelihood ratio, we expect theoretically optimal rankings by the function f * to be without any ties. Otherwise, f * could yield ties. Additional interpretation of the result in the table and illustration of partitions of the space of g + and g − distributions according to the optimal increments (see Figure S5) can be found in the supplementary material.

... General Properties
By using the constructive way of identifying ranking functions with increments in the toy example, we derive general properties of optimal ranking functions under the hinge loss.
As illustrated in the toy example, the ranking risk is minimized by a unique set of optimal increments except for some degenerate cases of negligible measure. Without loss of generality in practical sense, we derive the following result under the assumption that the risk minimizer f * is unique (up to an additive constant).
The following theorem shows more specific results of optimal ranking under the hinge loss. They reveal the undesirable property of potential ties in ranking when the hinge loss is used, extending the phenomenon observed in the toy example to general case. Detailed proof is given in the supplementary material.
, if elements are ordered by the likelihood ratio g + /g − such that , then the increments of f * cannot be any value other than 0 or 1, that is, Thus, a version of f * is integervalued.
(ii) For continuous X , there exists an integer-valued ranking function whose ranking risk is arbitrarily close to the minimum risk.

... Integer-Valued Ranking Functions
As an implication of Theorem 5, it is sufficient to consider only integer-valued functions to find a risk minimizer f * under the hinge loss. Let K be the number of distinct values that f takes (possibly ∞) and define A i ( f ) as {x| f (x) = i} slightly different from that in the proof of Theorem 5 (ii). For the partition . Emphasizing the connection between the partition A( f ) and f , let To examine the effect of the number of distinct values K or the number of steps (K − 1) on the minimal ranking risk, define F K as the set of all integer-valued functions with (K − 1) steps only. Let R K = inf f ∈F K R l ( f ) be the minimal risk achieved by ranking functions within F K . The following results show that if the likelihood ratio g + /g − is unbounded, the ranking risk R K is nonincreasing in K and strictly decreasing as long as g − has a positive probability for diminishing tails of the likelihood ratio where it diverges to ∞. See Section 5 for a concrete example illustrating Theorem 6 and Corollary 1.
Ifĝ − (C ) > 0 for all > 0, then R K > R K+1 for each K under the assumption of Theorem 6. Hence the optimal K is infinity.
Remark 4. By reversing the role of g + and g − and redefining A i , we can establish similar results as Theorem 6 and Corollary 1 when inf x∈X (g + (x)/g − (x)) = 0.

Comments on Related Results
As a related work, Clémençon, Lugosi, and Vayatis (2008) provided a rigorous statistical framework for studying the ranking problem and also discussed convex risk minimization methods for ranking. We explain the connection between the two approaches and point out the differences. Their formulation considers a ranking rule r : X × X → {−1, 1} directly, instead of a ranking (or scoring) function f : X → R as in our formulation. If r(x, x ) = 1, then x is ranked higher than x . A ranking rule r represents a partial order or preference between two instances while a ranking function f represents a total order over the instance space. A real-valued function h(x, x ) on X × X can induce a ranking rule via r(x, x ) ≡ sgn(h(x, x )).
Covering more general ranking problems with a numerical response Y , for each independent and identically distributed pair of (X, Y ) and (X , Y ) from a distribution on X × R, they define a variable Z = (Y − Y )/2, and consider X being better than X if Z > 0. Then by directly relating (X, X ) with sgn(Z), they transform the ranking problem to a pairwise binary classification problem and examine the implications of the formulation to ranking.
Note that in the transformed classification framework, the bipartite ranking loss corresponds to the 0-1 loss. As a result, when Y = 1 or −1, the best ranking rule r * (x, x ) is given by the Bayes decision rule for the classification problem: Although it was not explicitly stated in the article, we can infer from φ * (x, x ) that the best ranking rule in bipartite ranking can be expressed as Hence, with this different formulation we can arrive at the same conclusion of Theorem 1 that the theoretically optimal rankings over the instance space X are given by the likelihood ratio (g + /g − ).
As an extension, for the ranking rules minimizing convex risk functionals, Clémençon, Lugosi, and Vayatis (2008) invoked the results of Bartlett, Jordan, and McAuliffe (2006) on the consistency of classification with convex loss functions. Again, directly aiming at the optimal ranking rule rather than the scoring function, they discussed theoretical implications of minimization of the risk E[l(sgn(Z)h(X, X ))] for a convex loss l.
Considering only a positive instance X and a negative instance X , we can describe the difference between our approach and theirs being whether one finds the optimal ranking rule induced by a real-valued function h, argmin h E[l(h(X, X ))] or the optimal ranking function f , argmin f E[l( f (X ) − f (X ))]. In practice, ranking algorithms such as the RankBoost algorithm produce a ranking function, not a ranking rule, which makes our approach more natural and pertinent. More importantly, a ranking rule does not define a ranking function consistently in general, and Clémençon, Lugosi, and Vayatis (2008) had overlooked the fact when applying the classification results to ranking.
On the other hand, in some special cases, if there exists f , h * can be used to specify the optimal f . Theorem 2 regards those special cases. For example, in the case of the exponential loss, its population minimizer in the classification problem is known as (1/2) times the logit. Therefore, the best ranking rule r * (x, x ) is induced by where β is an arbitrary constant. h * then identifies f * = (1/2) log(g + /g − ) as the optimal ranking function, the same conclusion as Theorem 2.
The following theorem states that the conditions for ranking loss in Theorem 2 are indeed necessary for the existence of a ranking (or scoring) function consistent with h * (x, x ), and therefore, the equivalence between the two formulations.
Theorem 7. Suppose that l is convex, differentiable, l (s) < 0 for all s ∈ R, and l (−s)/l (s) is strictly increasing in s. Let h * be the optimal function on X × X minimizing E[l(h(X, X ))]. Then h * is of the form, h * (x, x ) = f (x) − f (x ) for some function f on X if and only if l (−s)/l (s) = exp(s/α) for some positive constant α.
Remark 5. When the conditions in Theorem 7 hold for loss l, the two formulations for ranking are equivalent in the sense that h * (x, x ) = f * (x) − f * (x ) by Theorem 2. Note that the regret bound for bipartite ranking through h(·, ·) in Clémençon, Lugosi, and Vayatis (2008) (see p. 864) is based on the result in Bartlett, Jordan, and McAuliffe (2006) for binary classification, and its translation to the regret bound for ranking through Remark 6. Kotlowski, Dembczynski, and Hüllermeier (2011) and Agarwal (2013) also assumed the relation between a pairwise ranking rule sgn(h(x, x )) and a scoring function f (x) from binary classification as sgn( f (x) − f (x )) in their regret bounds analysis.

Remark 7. The assumption in Theorem 7 about l (−s)/l (s)
is related to the necessary and sufficient condition for proper composite losses in Reid and Williamson (2010) for binary classification.
In contrast to those loss functions that yield the functional correspondence between h * and f * , other loss functions call for careful distinction between the two approaches. For example, the squared hinge loss l is ranking-calibrated (also classificationcalibrated), but l (−s)/l (s) = exp(s/α) for any positive con- Another case in point that illustrates the difference between the two approaches clearly is the hinge loss, l(s) = (1 − s) + . Application of the well-known result about the population minimizer of the hinge loss gives It is easy to argue that there exists no ranking function f , then for x 0 with g + (x 0 ) = g − (x 0 ), the ranking function would be given by However, the functional form leads to the equation which is not generally true.
In contrast, Theorem 4 implies at least that the optimal ranking function under the hinge loss preserves the order of the likelihood ratio, but not strictly with some possible ties. Although the explicit form of f * may not be specified, Theorem 5 further describes that f * could exhibit the characteristic of a step function. As the toy example illustrates, such ties could lead to ranking inconsistency. An alternative proof of ranking inconsistency of hinge loss is in Gao and Zhou (2015) with a simple counter example, which is essentially a special case of the setting in our toy example.
Theorem 3 states a sufficient condition for a convex loss function to be ranking-calibrated. Gao and Zhou (2015) provided a necessary condition for ranking consistency of a convex loss function l(·) as that it is differentiable at 0 and l (0) < 0. We note that the necessary condition is identical to the sufficient and necessary condition for classification calibration of a convex loss in Theorem 2.1 of Bartlett, Jordan, and McAuliffe (2006), illuminating further the relation between classification and ranking. Gao and Zhou (2015) also studied the regret bounds for exponential loss and logistic loss and showed that R 0 with some c > 0 for each loss. This result is reminiscent of the regret bounds in Bartlett, Jordan, and McAuliffe (2006) and Zhang (2004) for binary classification with surrogate loss and those in Agarwal (2013) for a pointwise approach to ranking via classification.

Simulation Study
To illustrate the theoretical results, we carried out a numerical experiment under a simple setting. With binary Y (1 or −1), the distribution of X for the positive category was set to N(1, 1), and that for the negative category was set to N(−1, 1). This setting yields log(g + (x)/g − (x)) = 2x. Thus, the theoretically best ranking function with minimum ranking error should be an orderpreserving transformation of x. A training dataset of (X, Y ) pairs was generated from the distributions with 2000 instances in each category (n + = n − = 2000).
First, the RankBoost algorithm in Freund et al. (2003) was applied to the training sample by considering 2000 × 2000 positive and negative pairs. In the boosting algorithm for ranking, a weak learner is a ranking function whose performance in terms of the AUC is slightly better than random assignment. In our experiment, a stump f θ (x) ≡ I(x > θ ) or I(x ≤ θ ) with a threshold θ ∈ R was used as a weak learner, and the threshold θ was taken from the observed values, At each iteration, a weak ranking function was chosen and added to the current ranking function with weight determined to minimize the ranking risk over the positive and negative pairs from the training data. We iterated the boosting process for 400 times to combine weak rankings and obtained the final ranking function f . It is depicted in the left panel of Figure 1. Thef in the figure is centered to zero in the y axis. The dotted line is the theoretically optimal ranking function, f * (x) = (1/2) log(g + (x)/g − (x)) = x, under the exponential loss as indicated by Theorem 2. The centered ranking function from boosting appears to approximate f * closely, especially over [−2, 2], where the density is relatively high as marked by the rug plot of a subset of the observed values sampled at the rate of 1/20. The flat part of the function on either end is an artifact due to the form of the weak learners used in boosting. Increasing the number of iterations further did not change the visual appearance of the ranking function. In fact, after fewer than 20 iterations, the AUC values of boosted rankings over the training data became stable as shown in the right panel, and the changes afterward were only incremental.
Second, the AUC maximizing support vector machine (SVM) in Brefeld and Scheffer (2005) was applied to the training data. In general, the AUCSVM (also known as RankSVM) finds a ranking function f ∈ H K minimizing where C is a tuning parameter and H K is a reproducing kernel Hilbert space with a kernel K. The solutionf (x) takes the form of i, j c i j (K(x i , x) − K(x j , x)). As the data involve four million pairs, which would make exact computation almost prohibitive,  a clustering approach was proposed in the article for approximate computation. Since X is univariate in this example, we could streamline the clustering step by taking sample quantiles, instead of relying on general k-means clustering as suggested in the article. We first selected a certain number of quantiles of pairwise differences (x i − x j ) for i = 1, . . . , n + and j = 1, . . . , n − , and used only the corresponding pairs for an approximate solution. To allow a rich space with sufficiently local basis functions for approximation of the optimal ranking functions, the Gaussian kernel K(x, x ) = exp(−(x − x ) 2 /2σ 2 ) with parameter σ 2 = 0.15 was used. To illuminate the implications of Theorem 5, we also considered a range of other sample sizes n = n + = n − and tuning parameter C. Figure 2 shows approximate ranking functionsf (solid lines) obtained by the AUC maximizing SVM for some combinations of n and C. For approximation, we selected 400 pairs based on quantiles of the pairwise differences when n + = n − = 30 and 1500 pairs when n + = n − = 2000 or 3000. As expected from Theorem 5, the estimated ranking functions appear to approximate step functions increasing in x roughly over the region of high density [−2, 2] as indicated by the visible bumps. The reverse trend on either end is again an artifact due to the form of the Gaussian kernel used as a basis function and the fact that there are relatively few observations near the end. On the whole, the ranking functions attempt to provide the same ordering as the likelihood ratio as indicated by Theorem 4, however, with potentially many ties. The dotted lines in Figure 2 are the optimal step functions that are theoretically identified when the numbers of steps are 1, 2, 4, and 5, respectively. Explanation of how to characterize the optimal step functions in general is given in the next subsection. We empirically chose the step function that matches each of the estimated ranking functions most closely in terms of the number of steps. For better alignment, we shifted the step function in each panel vertically so that the values of the pair of functions at x = 0 are identical.

... The Optimal Step Functions for Ranking under Hinge
Loss Although there is no explicit expression of the optimal ranking function under the hinge loss in this case, Theorem 5 (ii) suggests that there exists an integer-valued function whose risk is arbitrarily close to the minimum. In an attempt to find the best ranking function among integer-valued functions given the number of steps K, we consider a step function of the form , a 0 = −∞ and a K+1 = ∞. Note that f A is nondecreasing in x as the likelihood ratio is.
Using (1) with A i = (a i−1 , a i ], we can explicitly calculate the risk of f A as where G + and G − are the cdfs of X and X , and is the cdf of the standard normal distribution. Given K, the necessary condition for risk minimization is then for i = 1, . . . , K. In the normal setting, (2) is simplified to exp(−a i ) (a i+1 − 1) = exp(a i )(1 − (a i−1 + 1)) for i = 1, . . . , K. (3) By solving for a i analytically, we can identify the jump discontinuities of the step function with minimal risk given the number of steps. Table S3 in the supplement displays the solutions to the equations for small K that are obtained numerically. For example, when K = 1, the optimal ranking function has a jump at a 1 = 0, which coincides with the decision boundary of the Bayes rule if the problem was treated as binary classification. The sequences displayed in the table certainly reveal some symmetry in the solutions to (3), which can be analytically verified. See the supplement for proof. The dotted theoretical functions in Figure 2 are identified by the jump discontinuities in Table S3 for some values of K.
Since log(g + (x)/g − (x)) = 2x, by taking C = {x|x > −(log )/2}, we can check thatĝ − (C ) = ((log )/2 − 1) > 0 for all > 0. Hence, the optimal K on the population level is infinity by Corollary 1. The limiting sequence of jump discontinuities as K → ∞ can be identified using the recursive relation in (3). The supplementary material includes discussion about how to characterize the limiting sequence numerically, and calculation of the theoretical ranking risk (or 1 − AUC) of the step function f A and the hinge risk of a linear ranking function f (x) = cx with c > 0 for comparison with f A . In this normal setting, the minimum risk of f A is shown to exceed the "Bayes" ranking risk, clearly indicating the inconsistency of the RankSVM. Further, a multivariate extension of the comparisons between ranking procedures in the univariate normal setting can be found in the supplementary material.

Application to Movie-Lens Data
To examine the implications of the theoretical findings to real applications, we took one of the Movie-Lens datasets (GroupLens-Research 2006). The dataset consists of 100,000 ratings for 1682 movies by 943 users. The ratings are on a scale of 1 to 5. In addition to the ratings, the dataset includes information about the movies such as release dates and genres and some demographic information about the users, which can be used as predictors. Among the features of a movie, we used its release date (year, month, and day) and genres (a movie can be in several genres at once) as explanatory variables for ranking. We also included age, gender, and occupation as user-specific factors. For simplicity, in our analysis, ratings are taken as the sampling unit instead of movies or users as in collaborative filtering.
We first excluded 10 ratings with incomplete data. A quick examination of the scatterplot of the proportion of movies rated 1 or 2 versus the number of movies rated for each user revealed that six users (id: 13, 181, 405, 445, 655, 774) are outliers in one of the metrics; either extremely critical or having rated unusually many movies. Exercising caution in modeling typical patterns in ratings, we further excluded their ratings from our analysis, which led to a total of 97,139 ratings. To turn this rating prediction into bipartite ranking, we dichotomized the ratings into "high" (4 or 5) and "low" (3 or below), which yields 54,806 of high ratings. We standardized the predictors before conducting numerical experiments.
We compared three methods: RankBoost, RankSVM, and LogitBoost (a boosting algorithm for logistic regression; see Friedman, Hastie, and Tibshirani 2000). We included LogitBoost, which can be taken as a pointwise approach to ranking, primarily to examine the relation between ranking functions. We used the fact that the population minimizer of binomial deviance for LogitBoost is the true logit, log(P(Y = 1|X = x)/P(Y = −1|X = x)) = log(g + (x)/g − (x)) + log(π/(1 − π )). See Zhang (2004) for reference. As in the simulation studies, stumps were generated by selecting a variable and a threshold from the observed values at random, and used as weak learners for boosting. To handle large data and at the same time to put the three procedures on an equal footing except for the loss function employed, we devised a boosting (as forward stagewise additive modeling) algorithm for the hinge loss, dubbed "HingeBoost" in this article. The loss criteria determine the weights attached to the weak learners in boosting, and thus they drive the main difference between RankBoost and HingeBoost in the numerical comparisons. Since the hinge loss is not strictly convex, depending on weak learners used, the optimal weight may not be determined uniquely. For HingeBoost, weight optimization was done by grid search, and when multiple minima existed, the smallest value was chosen. In addition, we examined the effect of sample size and input space on ranking functions and their accuracy by varying the combination of variables used in ranking and the number of training pairs (n + × n − ) from small (200 × 200) to large (1000 × 1000).
In each experiment, we first set aside test data (of 10 6 pairs) chosen at random from the Movie-Lens data for evaluation. The same test dataset was used across different training sample sizes and variable combinations. High-low pairs in training data were formed by selecting equal number of cases from each category at random from the remaining data, and additional 10 6 pairs were randomly chosen from the rest for determining the number of iterations in boosting by taking the given loss criterion as the corresponding validation criterion. Ranking accuracy was then evaluated over the test pairs. This process was repeated 50 times. In each replicate, test data as well as validation data were fixed and only training sample sizes and variable combinations were changed. Across replicates, test data and validation data were varied. Table 2 provides a summary of the results with the mean AUC value and the standard error in parentheses for each setting. As the training set size increases, the ranking accuracy generally increases for all three methods. Among the main variables, the movie release year turns out to be a stronger predictor than genres, the user's occupation, and age. In terms of the ranking accuracy, LogitBoost and RankBoost performed better than HingeBoost in general, and the differences become more pronounced as the number of training pairs increases. LogitBoost produced the highest mean AUC value for each setting. A pointwise approach is known to be more advantageous than a pairwise approach in terms of computational complexity. Additional numerical studies under a similar setting also suggest that for small samples the former tends to produce more stable rankings than the latter. The results in Table 2 indicate that the excess ranking error of RankBoost in comparison with LogitBoost tends to diminish as the sample size increases. Figure 3 depicts the main effect of the movie release year in the ranking functions when fitted to a training set of million pairs by the three methods with all the variables. Here, the main effect means the additive component of the ranking scores attributed to the corresponding variable, and it is taken to be centered to zero. Each panel contains sample curves of the estimated main effect from 10 replicates, which are distinguished by different colors, and the solid black line indicates their mean curve.
Overall, old films in the dataset tend to be rated high. Probably, they are those films that survive for several decades for good reasons. The estimated effect of the movie release year peaks around 1940-1950, which includes such film classics as Citizen Kane, Vertigo, Casablanca, Rear Window, and The Seventh Seal, just to name a few. The main effects from RankBoost and Logit-Boost in the figure are very similar except for the scale factor of 2 as suggested by the theory. In contrast, HingeBoost provides a crude approximation to the ranking scores from the other two procedures, removing some fine details captured by the two. The granularity in the main effect of HingeBoost (clearly visible in the individual main effect curves) is due to the singularity of the hinge loss and its particular preference toward integer-valued Figure . The main effect of the movie release year in the ranking functions fitted to a million training pairs by RankBoost, LogitBoost, and HingeBoost with stumps as base learners. Sample curves from  replicates are drawn with different colors in each panel, and the solid black line indicates the mean curve. The rug plots in red and green are for a random sample from the movie data with labels  and −1, respectively. scores. When it is coupled with such discrete weak learners as the stumps, the extent of granularity becomes stronger.
For comparison of the overall ranking scores from the three procedures, scatterplots of the scores from RankBoost and HingeBoost versus those from LogitBoost are shown in Figure 4 for a replicate. Two times the score of RankBoost is very close to the score of LogitBoost in the figure, empirically confirming the theoretical findings in this article. HingeBoost with the stumps as base learners produces integer-valued ranking scores. Additional analysis with Gaussian kernel functions as smooth weak learners can be found in the supplementary material.

Conclusion and Discussion
We have investigated the properties of the best ranking functions under convex loss criteria on the population level in bipartite ranking problems, and have specified general conditions for ranking-calibrated loss functions. Our results show that the best ranking functions under convex ranking-calibrated loss criteria produce the same ordering as the likelihood ratio. The best ranking function specified for a certain class of loss functions including the exponential loss provides justification for boosting method in maximizing the AUC.
For the AUC maximizing SVM (or the RankSVM), the result points to the undesirable property of potential ties in ranking, which could lead to inconsistency. Numerical results confirm these theoretical findings. In particular, it was observed that the ranking scores from the RankSVM exhibit granularity. Our result offers much improved understanding of the RankSVM, and at the same time, provides due caution that contrary to the current practice and widespread belief regarding the utility of the hinge loss in machine learning, ranking with the hinge loss is not consistent.
As for practical implications of the theoretical findings about the RankSVM, we need to carefully examine the effect of some of the factors involved in the ranking algorithm on ranking scores. As observed in the numerical examples, weak learners in boosting and kernel parameters such as bandwidth for the Gaussian kernel or degree for polynomial kernels are expected to be critical in determining the extent of granularity in rankings. A systematic study will be necessary to understand the operational relation between the factors in the algorithm and notable features of the resulting ranking function. In practice, such knowledge of the relation can be used to minimize potential ties in ranking.
Study of the theoretical relation between a loss criterion and the optimal ranking function is important not only for understanding of consistency, but also for appropriate modification of ranking procedures to achieve different goals in ranking other than minimization of the overall ranking error. For example, many ranking applications in web search and recommender systems focus on those instances ranked near the top only. Clémençon and Vayatis (2009) investigated the relation between AUC maximization and optimization of linear rank statistics including mean reciprocal rank. It is worth extending the current results to study the impact of certain modifications proposed for specific aims in ranking, and further develop a principled framework for proper modification.

Supplementary Materials
The online supplementary materials contain: (i) technical proofs of the theorems; (ii) proof of the result in Table 1; (iii) detailed theoretical analysis of the numerical result in Section 5.1; (iv) a multivariate extension of the simulation study in Section 5.1; (v) additional analysis of the Movie-Lens data with Gaussian kernels.