Boosting and Maximum Likelihood for Exponential Models

We derive an equivalence between AdaBoost and the dual of a convex optimization problem, showing that the only difference between minimizing the exponential loss used by AdaBoost and maximum likelihood for exponential models is that the latter requires the model to be normalized to form a conditional probability distribution over labels. In addition to establishing a simple and easily understood connection between the two methods, this framework enables us to derive new regularization procedures for boosting that directly correspond to penalized maximum likelihood. Experiments on UCI datasets support our theoretical analysis and give additional insight into the relationship between boosting and logistic regression


Introduction
Several recent papers in statistics and machine learning have been devoted to the relationship between boosting and more standard statistical procedures such as logistic regression.In spite of this activity, an easy-to-understand and clean connection between these different techniques has not emerged.Friedman, Hastie and Tibshirani [7] note the similarity between boosting and stepwise logistic regression procedures, and suggest a least-squares alternative, but view the loss functions of the two problems as different, leaving the precise relationship between boosting and maximum likelihood unresolved.Kivinen and Warmuth [8] note that boosting is a form of "entropy projection," and Lafferty [9] suggests the use of Bregman distances to approximate the exponential loss.Mason et al. [10] consider boosting algorithms as functional gradient descent and Duffy and Helmbold [5] study various loss functions with respect to the PAC boosting property.More recently, Collins, Schapire and Singer [2] show how different Bregman distances precisely account for boosting and logistic regression, and use this framework to give the first convergence proof of AdaBoost.However, in this work the two methods are viewed as minimizing different loss functions.Moreover, the optimization problems are formulated in terms of a reference distribution consisting of the zero vector, rather than the empirical distribution of the data, making the interpretation of this use of Bregman distances problematic from a statistical point of view.
In this paper we present a very basic connection between boosting and maximum likelihood for exponential models through a simple convex optimization problem.In this setting, it is seen that the only difference between AdaBoost and maximum likelihood for exponential models, in particular logistic regression, is that the latter requires the model to be normalized to form a probability distribution.The two methods minimize the same extended Kullback-Leibler divergence objective function subject to the same feature constraints.Using information geometry, we show that projecting the exponential loss model onto the simplex of conditional probability distributions gives precisely the maximum likelihood exponential model with the specified sufficient statistics.In many cases of practical interest, the resulting models will be identical; in particular, as the number of features increases to fit the training data the two methods will give the same classifiers.We note that throughout the paper we view boosting as a procedure for minimizing the exponential loss, using either parallel or sequential update algorithms as in [2], rather than as a forward stepwise procedure as presented in [7] or [6].
Given the recent interest in these techniques, it is striking that this connection has gone unobserved until now.However in general, there may be many ways of writing the constraints for a convex optimization problem, and many different settings of the Lagrange multipliers (or Kuhn-Tucker vectors) that represent identical solutions.The key to the connection we present here lies in the use of a particular non-standard presentation of the constraints.When viewed in this way, there is no need for special-purpose Bregman distances to give a unified account of boosting and maximum likelihood, as we only make use of the standard Kullback-Leibler divergence.But our analysis gives more than a formal framework for understanding old algorithms; it also leads to new algorithms for regularizing AdaBoost, which is required when the training data is noisy.In particular, we derive a regularization procedure for AdaBoost that directly corresponds to penalized maximum likelihood using a Gaussian prior.Experiments on UCI data support our theoretical analysis, demonstrate the effectiveness of the new regularization method, and give further insight into the relationship between boosting and maximum likelihood exponential models.

Notation
Let and be finite sets.We denote by Å Ñ ¢ Ê • the set of non- negative measures on ¢ , and by ¡ Å the set of conditional probability distributions, ¡ Ñ ¾ Å È Ý¾ Ñ´Ü Ýµ ½ for each Ü ¾ .For Ñ ¾ Å, we will overload the notation Ñ´Ü Ýµ and Ñ´Ý Üµ; the latter will be suggestive of a conditional probability distribution, but in general it need not be normalized.Let ¢ Ê, ½ Ñ, be given functions, which we will refer to as features.These will correspond to the weak learners in boosting, and to the sufficient statistics in an exponential model.Suppose that we have data ´Ü Ý µ Ò ½ with empirical distribution Ô ´Ü Ýµ and marginal Ô ´Üµ; thus, Ô ´Ü Ýµ ½ Ò È Ò ½ AE´Ü Üµ AE´Ý Ýµ We assume, without loss of generality, that Ô ´Üµ ¼ for all Ü.Throughout the paper, we assume (for notational convenience) that the training data has the following property.
For most data sets of interest, each Ü appears only once, so that the assumption trivially holds.However, if Ü appears more than once, we require that it is labeled consistently.We make this assumption mainly to correspond with the conventions used to present boosting algorithms; it is not essential to what follows.

Given , we define the exponential model
´Ü Ýµ where ´Ü Ýµ È Ñ ½ ´Ü Ýµ.The maximum likelihood estimation problem is to determine parameters that maximize the conditional log-likelihood ´ µ È Ü Ý Ô ´Ü Ýµ ÐÓ Õ ´Ý Üµ or minimize the log loss ´ µ.The objective function to be minimized in the multi-label boosting algorithm AdaBoost.M2 [2] is the exponential loss given by M2 ´ µ È Ò ½ È Ý Ý ´Ü Ýµ ´Ü Ý µ As has been often noted, the log loss and the exponential loss are qualitatively different.The exponential loss grows exponentially with increasing negative "margin," while the log loss grows linearly.

Correspondence Between AdaBoost and Maximum Likelihood
Since we are working with unnormalized models we make use of the extended conditional Kullback-Leibler divergence or Á-divergence, given by Ô´Ý Üµ • Õ´Ý Üµ defined on Å ¢ Å (possibly taking on the value ½).Note that if Ô´¡ Üµ ¾ ¡ and Õ´¡ Üµ ¾ ¡ then this becomes the more familiar KL divergence for probabilities.Let features and a fixed default distribution Õ ¼ ¾ Å be given.We define ´ Ô µ Å as ´ Ô µ Since Ô ¾ , this set is non-empty.Note that under the consistent data assumption, we have that Ô Ü℄ ´Ü Ý´Üµµ.Consider now the following two convex optimization problems, labeled È ½ and È ¾ .
Thus, problem È ¾ differs from È ½ only in that the solution is required to be normalized.
As we'll show, the dual problem È £ ½ corresponds to AdaBoost, and the dual problem È £ ¾ corresponds to maximum likelihood for exponential models.
This presentation of the constraints is the key to making the correspondence between Ada-Boost and maximum likelihood.Note that the constraint È Ü Ô ´Üµ È Ý Ô´Ý Üµ ´Ü Ýµ Ô ℄, which is the usual presentation of the constraints for maximum likelihood (as dual to maximum entropy), doesn't make sense for unnormalized models, since the two sides of the equation may not be "on the same scale."Note further that attempting to rescale by dividing by the mass of Ô to get would yield nonlinear constraints.
We now derive the dual problems formally; the following section gives a precise statement of the duality result.To derive the dual problem È £ ½ , we calculate the Lagrangian as The dual problem is to determine Ö Ñ Ü ´ µ.To derive the dual for È ¾ , we simply add additional Lagrange multipliers Ü for the constraints È Ý Ô´Ý Üµ ½.

Special cases
It is now straightforward to derive various boosting and logistic regression problems as special cases of the above optimization problems.
Case 2: Binary AdaBoost.In addition to the assumptions for the previous case, now assume that Ý ¾ ½ •½ , and take ´Ü Ýµ ½ ¾ Ý ´Üµ.Then the dual problem is given by Ö Ñ Ò È Ý ´Ü µ which is the optimization problem of binary AdaBoost.
Case 3: Maximum Likelihood for Exponential Models.In this case we take the same setup as for AdaBoost.M2 but add the additional normalization constraints: We note that it is not necessary to scale the features by a constant factor here, as in [7]; the correspondence between logistic regression and boosting is direct.

Duality
Let É ½ and É ¾ be defined as the following exponential families: Thus É ½ is unnormalized while É ¾ is normalized.We now define the boosting solution Õ boost and maximum likelihood solution Õ ml as where É denotes the closure of the set É Å.The following theorem corresponds to Proposition 4 of [3] for the usual KL divergence; in [4] the duality theorem is proved for a general class of Bregman distances, including the extended KL divergence as a special case.Note that we do not work with divergences such as ´ ¼ Õµ as in [2], but rather ´ Ô Õµ, which is more natural and interpretable from a statistical point-of-view.Theorem.Suppose that ´ Ô Õ ¼ µ ½.Then Õ boost and Õ ml exist, are unique, and satisfy Figure 1: Geometric view of the duality theorem.Minimizing the exponential loss finds the member of É½ that intersects with the feasible set of measures satisfying the moment constraints (left).When we impose the additional constraint that each conditional distribution Õ ´Ý Üµ must be normalized, we introduce a Lagrange multiplier for each training example Ü, giving a higher-dimensional family É ¼ ½ .By the duality theorem, projecting the exponential loss solution onto the intersection of the feasible set with the simplex gives the maximum likelihood solution.
This result has a simple geometric interpretation.The unnormalized exponential family É ½ intersects the feasible set of measures satisfying the constraints (1) at a single point.
The algorithms presented in [2] determine this point, which is the exponential loss solution On the other hand, maximum conditional likelihood estimation for an exponential model with the same features is equivalent to the problem Õ ml is the exponential family with additional Lagrange multipliers, one for each normalization constraint.The feasible set for this problem is ¡.Since ¡ , by the Pythagorean equality we have that Õ ml Ö Ñ Ò Ô¾ ¡ ´Ô Õ boost µ (see Figure 1, right).

Regularization
Minimizing the exponential loss or the log loss on real data often fails to produce finite parameters.Specifically, this happens when for some feature ´Ü Ýµ ´Ü Ý´Üµµ ¼ for all Ý and Ü with Ô ´Üµ ¼ (3) or ´Ü Ýµ ´Ü Ý´Üµµ ¼ for all Ý and Ü with Ô ´Üµ ¼ This is especially harmful since often the features for which (3) holds are the most important for the purpose of discrimination.Of course, even when (3) does not hold, models trained by maximum likelihood or the exponential loss can overfit the training data.A standard regularization technique in the case of maximum likelihood employs parameter priors in a Bayesian framework.See [11] for non-Bayesian alternatives in the context of boosting.
In terms of convex duality, parameter priors for the dual problem correspond to "potentials" on the constraint values in the primal problem.The case of a Gaussian prior on , for example, corresponds to a quadratic potential on the constraint values in the primal problem.
We now consider primal problems over ´Ô µ where Ô ¾ Å and ¾ Ê Ñ is a parameter vector that relaxes the original constraints.Define ´ Ô µ Å as ´ and consider the primal problem È ½ reg given by ´È½ reg µ minimize ´Ô Õ ¼ µ • Í ´µ subject to Ô ¾ ´ Ô µ where Í Ê Ñ Ê is a convex function whose minimum is at ¼.To derive the dual problem, the Lagrangian is calculated as Ä´Ô µ Ä´Ô µ • Í ´µ and the dual function is the dual function becomes A sequential update rule for (5) incurs the small additional cost of solving a nonlinear equation by Newton-Raphson every iteration.See [1] for a discussion of this technique in the context of exponential models in statistical language modeling.

Experiments
We performed experiments on some of the UC Irvine datasets in order to investigate the relationship between boosting and maximum likelihood empirically.The weak learner was the decision stump FindAttrTest as described in [6], and the training set consisted of a randomly chosen 90% of the data.Table 1 shows experiments with regularized boosting.
Two boosting models are compared.The first model Õ ½ was trained for 10 features gener- ated by FindAttrTest, excluding features satisfying condition (3).Training was carried out using the parallel update method described in [2].The second model, Õ ¾ , was trained using the exponential loss with quadratic regularization.The performance was measured using the conditional log-likelihood of the (normalized) models over the training and test set, denoted train and test , as well as using the test error rate ¯test .The table entries were averaged by 10-fold cross validation.
For the weak learner FindAttrTest, only the Iris dataset produced features that satisfy (3).On average, 4 out of the 10 features were removed.As the flexibility of the weak learner is increased, (3) is expected to hold more often.On this dataset regularization improves both the test set log-likelihood and error rate considerably.In datasets where Õ ½ shows significant overfitting, regularization improves both the log-likelihood measure and the error rate.In cases of little overfitting (according to the log-likelihood measure), regularization only improves the test set log-likelihood at the expense of the training set log-likelihood, however without affecting test set error.The duality result suggests a possible explanation for the higher performance of boosting with respect to test .The boosting model is less constrained due to the lack of normalization constraints, and therefore has a smaller Á-divergence to the uniform model.This may be interpreted as a higher extended entropy, or less concentrated conditional model.
However, as ´Õ ml µ ¼, the two models come to agree (up to identifiability).It is easy to show that for any exponential model Õ ¾ É ¾ train ´Õ ml Õ µ ´Õ ml µ ´Õ µ By taking Õ Õ boost it is seen that as the difference between ´Õ ml µ and ´Õ boost µ gets smaller, the divergence between the two models also gets smaller.The empirical results are consistent with the theoretical analysis.As the number of features is increased so that the training data is fit more closely, the model matches the empirical distribution Ô and the normalizing term ´Üµ becomes a constant.In this case, normalizing the boosting model Õ boost does not violate the constraints, and results in the maximum likelihood model.pares train ´Õ ml µ to train ´Õ boost µ (left) and train ´Õ ml µ to train ´Õ ml Õ boost µ (right).The bottom row shows the relationship between test´Õ ml µ and test´Õ boost µ (left) and ¯test´Õ ml µ and ¯test´Õ boost µ (right).The experimental results for other UCI datasets were very similar.

Figure 2 :
Figure 2: Comparison of AdaBoost and maximum likelihood for Sonar dataset.The top row com- these constraints are satisfied, then the other constraints take the ´Ü Ýµ ´Ü Ýµ and the connecting equation becomes Õ ´Ý Üµ ½ Ü Õ ¼ ´Ý Üµ ´Ü Ýµ were Ü is the normalizing term Ü È Ý Õ ¼ ´Ý Üµ ´Ü Ýµ , which corresponds to setting the Lagrange multiplier Ü to the appropriate value.In this case, after a simple calculation the dual problem is seen to be ¾ ´ µ È Ü Ô ´Ü Ýµ ÐÓ Õ ´Ý Üµ which corresponds to maximum likelihood for a conditional exponential model with sufficient statistics ´Ü Ýµ.Case 4: Logistic Regression.Returning to the case of binary AdaBoost, we see that when we add normalization constraints as above, the model is equivalent to binary logistic regres-

Table 1 :
Next we performed a set of experiments to test how much Õ boost differs from Õ ml , where the boosting model is normalized (after training) to form a conditional probability distribution.For different experiments, FindAttrTest generated a different number of features (10-100), and the training set was selected randomly.The top row in Figure2shows for the Sonar dataset the relationship between train ´Õ ml µ and train ´Õ boost µ as well as between train ´Õ ml µ and train ´Õ ml Õ boost µ.As the number of features increases so that the training ´Õ½ µ test ´Õ½ µ ¯test ´Õ½ µ train ´Õ¾ µ test ´Õ¾ µ ¯test ´Õ¾ µ Comparison of unregularized to regularized boosting.For both the regularized and unregularized cases, the first column shows training log-likelihood, the second column shows test loglikelihood, and the third column shows test error rate.Regularization reduces error rate in some cases while it consistently improves the test set log-likelihood measure on all datasets.All entries were averaged using 10-fold cross validation.data is more closely fit ( train ´Õml µ ¼), the boosting and maximum likelihood models become more similar, as measured by the KL divergence.This result does not hold when the model is unidentifiable and the two models diverge in arbitrary directions.The bottom row in Figure2shows the relationship between the test set log-likelihoods, test ´Õ ml µ and test ´Õ boost µ, together with the test set error rates ¯test ´Õ ml µ and ¯test ´Õ boost µ.In these figures the testing set was chosen to be 50% of the total data.In order to indicate the number of points at each error rate, each circle was shifted by a small random value to avoid points falling on top of each other.While the plots in the bottom row of Figure2indicate that train ´Õ ml µ train ´Õ boost µ, as expected, on the test data the linear trend is reversed, so that test ´Õ ml µ test ´Õ boost µ.Identical experiments on Hepatitis, Glass and Promoters resulted in similar results and are omitted due to lack of space. train