Multimodal Oriented Discriminant Analysis

Linear discriminant analysis (LDA) has been an active topic of research during the last century. However, the existing algorithms have several limitations when applied to visual data. LDA is only optimal for Gaussian distributed classes with equal covariance matrices, and only classes-1 features can be extracted. On the other hand, LDA does not scale well to high dimensional data (over-ﬁtting), and it cannot handle optimally multimodal distributions. In this paper, we introduce Multimodal Oriented Discriminant Analysis (MODA), a LDA extension which can overcome these drawbacks. A new formulation and several novelties are proposed:


Introduction
Canonical Correlation Analysis (CCA), Independent Component Analysis (ICA), Linear Discriminant Appearing in Proceedings of the 22 nd International Conference on Machine Learning, Bonn, Germany, 2005.Copyright 2005 by the author(s)/owner(s).Analysis (LDA), and Principal Component Analysis (PCA) are some examples of subspace methods (SM) useful for classification, dimensionality reduction and data modeling.These methods have been actively researched by the statistics, neural networks, machine learning and vision communities during the last century.The modeling power of SM can be especially useful when available data increases in features/samples, since there is a need for dimensionality reduction while preserving relevant attributes of the data.Another benefit of many subspace methods is that they can be computed as an eigenvalue or singular value type of problem, for which there are efficient numerical packages.An obvious drawback of SM is its linear assumptions; however, recently extensions based on kernel methods and latent variable models can overcome some of these limitations.
Among several classification methods (e.g.Support Vector Machines, decision trees), LDA remains a powerful preliminary tool for dimensionality reduction preserving discriminative features and avoiding the "curse of dimensionality".However, there exist several liminations of current LDA techniques (Fukunaga, 1990;Hastie et al., 2001).LDA is optimal only in the case that all the classes are Gaussian distributed with equal covariances.Due to this assumption, the maximum number of features that can be extracted is the number of classes-1.Another common problem when dealing with high dimensional data is the small size problem (Yu & Yang, 2001), that is, the training set has more "dimensions" (pixels) than data samples 1 .In this situation LDA overfits and PCA techniques usually outperform LDA (Martinez & Kak, 2003).On the other hand, the computational/storage requirements of computing LDA directly from covariance matrices is impractical.In this paper we introduce Multimodal Oriented Discriminant Analysis (MODA), a new low dimensional discriminatory technique optimal for multimodal Gaussian classes with different covariances.MODA is able to efficiently deal with the small sample case and scales well to very high dimensional data avoiding overfitting effects.There is no closed form solution for the optimal values of MODA and an iterative majorization is proposed to seach for a local optimum.Finally, a new view and formulation of the LDA is introduced, which gives some new insights.Figure 1 illustrates the main purpose of this paper.

Linear Discriminant Analysis
The aim of LDA is to project the data into a lower dimensional space, so that the classes are as compact and as far as possible from each other.Many closed form solutions for LDA are based on the following covariance matrices: ) is the data matrix, where each column is a vectorized data sample.d denotes the number of features, n number of samples and c 1 In this case the true dimensionality of the data is the number of samples, not the number of pixels.
2 Bold capital letters denote a matrix D, bold lower-case letters a column vector d.dj represents the j column of the matrix D. All non-bold letters will represent variables of scalar nature.diag is an operator which transforms a vector to a diagonal matrix. 1 k ∈ k×1 is a vector of ones.the number of classes.m = 1 n D1 n is the mean vector for all the classes and m i is the mean vector for the class i. n i is the number of samples for class i and c i=1 n i = n.P i are projection matrices (i.e P T i = P i and P 2 i = P i ) with the following expressions: G ∈ n×c is an dummy indicator matrix such that S b is the between-covariance matrix and represents the average of the distances between the mean of the classes.S w represents the within-covariance matrix and it is a measure of the average compactness of each class.Finally S t is the total covariance matrix.With the matrix expressions, it is straightforward to show that S t = S w + S b .The upper bounds on the ranks of the matrices are c − 1, n − c, n − 1 for S b , S w , S t respectively.
Rayleigh like quotients are among the most popular LDA optimization criteria (Fukunaga, 1990).
Some are: tr(B T S2B) , where S 1 = {S b , S b , S t } and S 2 = {S w , S t , S w }.A closed form solution to previous minimization problems is given by a generalized eigenvalue problem S 1 B = S 2 BΛ.The generalized eigenvalue problem can be solved as a joint diagonalization, that is, finding a common basis B, which diagonalizes simultaneously both matrices S 1 and S 2 (i.e.B T S 2 B = I and B T S 1 B = Λ).

Oriented Discriminant Analysis
LDA is the optimal linear projection only in the case of having Gaussian classes with equal covariance matrix (Campbell, 1984;Duda et al., 2001;Hastie et al., 2001) (assuming enough training data).Fig. 2 shows a situation where two classes have almost orthogonal principal directions of the covariances and close means.In this pathological case, LDA chooses the worst possible discriminative direction where the classes are overlapped (it is also very numerically unstable), whereas ODA finds a better projection.In general, this situation becomes dangerous when the number of classes increases.
In order to solve this problem, several authors have proposed extensions and new views of LDA.Campbell (Campbell, 1984) derives a maximum likelihood approach to discriminant analysis.Assuming that all the classes have equal covariance matrix, Campbell shows that LDA is equivalent to impose that the class means lie in a l-dimensional subspace.Following this approach, Kumar and Andreou (Kumar & Andreou, 1998) proposed heteroscedastic discriminant analysis, where they incorporate the estimation of the means and covariances in the low dimensional space.On the other hand, Saon et al. (Saon et al., 2000) define a new energy function to model the directionality of the data, J(B) =  (Hastie et al., 2001;Hastie et al., 1995): where V ∈ k×c is a scoring matrix (Hastie et al., 1995).Similarly, Gallinari et.al (Gallinari et al., 1991) have also shown the connection between LDA and regression by minimizing These approaches are appealing for several reasons.First, if the dummy matrix G contains 0 and 1's, the mapping gives a linear approximation of Bayes's posterior probability and if g ij = n i /n then it returns classical LDA.On the other hand, Hastie et.al (Hastie et al., 2001;Hastie et al., 1995) have modified eq. 2 to take into account more than linear functions, for instance, they replace B T D by Bf(D), where f maps the original data (similar to kernel methods) introducing Flexible Discriminant Analysis (FDA) or Penalized Discriminant Analysis (PDA) by adding regularization terms to eq. 2 (e.g.f 2 (B)).Although similar in spirit, our work differs in several aspects; first we provide a new and probabilistic interpretation, we model directly the covariances in the original space rather than mapping the data to a higher dimensional space where usually the parameters and a functional form of the kernels need to be chosen, our method scales naturally with very high dimensional data and linear algorithms are developed to learn this model.We also show that ODA and MODA are consistent generalizations of LDA, whereas regression approaches have some limitations (Hastie et al., 2001;Gallinari et al., 1991) when the number of classes increases.

Maximizing Kullback-Leibler divergence.
In this section, we derive the optimal linear dimensionality reduction for Gaussian distributed classes with different covariances.A simple measure of distance between two Gaussian distributions N (x; µ i , Σ i ) and N (x; µ j , Σ j ) is given by the Kullback-Leibler (KL) divergence (Fukunaga, 1990): We assume that each class i is modeled as a gaussian N (µ i , Σ i ) and the aim of ODA is to find a linear transformation B ∈ d×k , common to all the classes (i.e.N (B T µ i , BΣ i B T ) ∀i ) such that it maximizes the separability (Kullback-Leibler divergence) between the classes in the low dimensional space, that is: After some algebraic arrangements (de la Torre & Kanade, 2005), the previous equation can be expressed in a more compact and enlightening manner: Observe that a negative sign is introduced for convenience; rather than searching for a maximum, a minimum of G(B) will be found.
Several interesting things are worth pointing out from eq. 5.If all covariances are the same (i.e. which is exactly what LDA maximizes.ODA takes into account not just the distance between the means but also the orientation and magnitude of the covariance.In the LDA case, the number of extracted features cannot exceed the number of classes because the rank of S b is c − 1; however, ODA does not have this constraint and more features can be obtained.Unfortunately, due to different normalization factors (B T Σ i B) −1 , eq. 5 does not have a closed-form solution in terms of an eigenequation (not an eigenvalue problem).

Multimodal Oriented Discriminant Analsyis
In the previous section, it has been shown that ODA is the optimal linear transform for class separability in the case of Gaussian distributions with arbitrary covariances (full rank).However, in many situations the class distributions are not Gaussian.For instance, it is likely that the manifold of the facial appearance of a person under different illumination, expression, and poses is highly non-Gaussian.In this section, MODA, an extension of ODA that is able to model multimodal classes is described.
In order to model multimodal distributions, the training data for each class is first clustered using recent advances in multi-way normalized cuts (Yu & Shi, 2003).
Once the input space has been clustered for each class, eq. 5 is modified to maximize the distances between the clusters of different classes, that is: where µ r1 i is the r 1 cluster of class i, and r 1 ∈ C i sums over all the clusters belonging to class i. KL r1r2 ij denotes the Kullback-Leibler divergence between the r 1 cluster of class i and the r 2 cluster of class j.Observe that MODA looks for a projection matrix B which maximizes the KL divergence between clusters among all the classes, but it does not maximize the distance between the clusters of the same class.
As in the case of ODA, there is no closed expression for the maximum of eq. 6.However, if all the covariances are the same (i.e.Σ r1 i = Σ ∀ i, r 1 ), there exists a closed form solution that can give a new insight into  Hastie et.al (Hastie et al., 2001;Hastie et al., 1995) have introduced Mixture Discriminant Analysis (MDA) to overcome similar situations, however MODA differs in several aspects.First, it uses spectral graph methods to cluster the data because they accomodate better high dimensional data and are less prune to local minima in comparison with k-means type of algorithms.Secondly, it is not clear how MDA is able to model Gaussian distributions with different high dimensional covariances.Finally, MDA implicitely constraints the clusters of the same class to be far from each other, however MODA does not have this constraint (e.g 4.b).

Bound optimization
Eq. 6 is hard to optimize; second-order type of gradient methods (e.g.Newton or conjugate gradient) do not scale well with huge matrices (e.g.B ∈ d×l ).Moreover, the second derivative of eq.6 is quite complex.In this section, we use a bound optimization method called iterative majorization (Heiser, 1997;Leeuw, 1994;Kiers, 1995) able to monotonically reduce the value of the energy function.Although this type of optimization technique is not common in the vision/learning community, it is very similar to Expectation Maximization (EM) type of algorithms.

Iterative Majorization
Iterative majorization is a monotonically convergent method developed in the area of statistics (Heiser, 1997;Leeuw, 1994;Kiers, 1995), and it is able to solve relatively complicated problems in a straightforward manner.The main idea is to find a function, that makes it easier to minimize/maximize than the original (e.g.quadratic function) at each iteration.The first thing to do in order to minimize G(B), eq.6, is to find a function L(B), which majorizes G(B), that is, , where B 0 is the current estimate.The function L(B) should be easier to minimize than G(B).A minimum of L(B), B 1 , is guaranteed to decrease the energy of G(B).This is easy to show, since L(B 0 ) = G(B 0 ) ≥ L(B 1 ) ≥ G(B 1 ).This is called the "sandwich" inequality by De Leeuw (Leeuw, 1994).Each update of the majorization will improve the value of the function, and if the function is bounded it will monotonically decrease the value of L(B).Under these conditions it is always guaranteed to stop at a local optimum.
Iterative majorization is very similar to EM (Neal & Hinton, 1998) type of algorithms, which have been extensively used by the machine learning and computer vision communities.The EM algorithm is an iterative algorithm used to find a local maximum of the log likelihood, log p(D|θ), where D is the data, θ are the parameters.Rather than maximizing the log likelihood directly, EM uses Jensen's inequality to find a lower bound log p(D|θ) = log q(h) p (D,h|θ)   q(h) dh ≥ q(h)log p (D,h|θ)   q(h) dh, which holds for any distribution q(h).The Expectation step, performs a functional approximation on this lower bound, that is, it finds the distribution q(h), which maximizes the data and touches the log likelihood at the current parameter estimates θ n .In fact, the optimal q(h) is the posterior probability of the latent/hidden parameters given the data (i.e.p(h|D) ).The Maximization step maxi-mizes the lower-bound w.r.t the parameters θ.The E-step in EM would be equivalent to the construction of the majorization function and the M -step just minimizes/maximizes this upper/lower bound.

Constructing a majorization function
In order to find a function which majorizes G(B), the following inequality is used (Kiers, 1995), i || F ≥ 0, where we have assumed that the factorizations of A i and B i are possible, that is, Rearranging the previous equation derives in ( apply By adding a sum to both sides of this inequality a function L(B) which majorizes G(B) is obtained: The function L(B) is quadratic in B and hence easier to minimize.After rearranging terms a necessary condition for the minimum of L(B) has to satisfy: See (J.R. Magnus, 1999) for matrix derivatives.Finding the solution of eq. 9 involves solving the following system of linear equations i T i = i Σ i BF i .A closed-form solution could be achieved by vectorizing eq. 9 with Kronecker products.However, the system would have dimensions of (d × l) × (d × l), which is not efficient in either space or time.Instead, a gradient descent algorithm which minimizes: is used.Due to the huge number of the equations to solve (d × l), an effective and linear time algorithm to solve for the optimum of eq. 10 is a normalized gradient descent: After some derivation, it can be shown that the optimal η is η = (de la Torre & Kanade, 2005).

Dealing with high dimensional data
When applying any classifier to visual data, a major problem is the high dimensionality of the images.Several strategies are necessary to get good generalization, such as feature selection or dimensionality reduction techniques (PCA, LDA, etc).In this context LDA or MODA can be a good initial step to extract discriminative features.However, as it is well known, dimensionality reduction techniques such as LDA, that preserve discriminative power cannot handle very well the case that n << d (more pixels than training data), which is the typical.For instance, an image of 100 × 100 pixels will correspond to feature vectors of 10000 dimensions, which will induce covariance matrices of 10000×10000.To make the covariance full rank, at least 10000 independent samples would be necessary, and even that will be a poor estimate.In this scenario, working with huge covariance matrices presents two major problems: computational tractability (storage, efficiency and rank decificiency) and generalization.
To solve the computational aspect, one straightforward approach is to realize that if d >> n, the true dimensionality of D ∈ d×n is n.Therefore, we can project into the first n principal components without losing any discriminative power.Besides the computational aspects, the second and more important problem is the lack of generalization when too few samples are available.As noticed by Hugues (Hughes, 1968), the fact of increasing the dimensionality would have to enhance performance for recognition (more information is added), but due to the lack of training data this will rarely occur.Fukunaga (Fukunaga & Hayes, 1989) studied the effects of finite data set in linear and quadratic classifiers, and concluded that the number of samples should be proportional to the dimension for linear classifiers and square for quadratic classifiers.In this case, LDA over-fits the data and does not generalize well to new samples.One way to understand over-fitting is to consider eq. 2. There are O(k × n) equations and O(d × k) unknowns (B) 3 .Without enough training data, eq. 2 is an underdetermined system of equations with ∞ solutions.In other words, if there are more features than training samples, di-3 Orthogonality of B is not assumed.
rectly minimizing LDA will result in a dimensionality reduction that will act as a associative memory rather than learning anything (no regression is done), and no good generalization will be achieved.
In order to be able to generalize better than LDA and not suffer from storage/computational requirements, our solution approximates the covariance matrices as the sum of outer products plus a scaled identity matrix After opitmizing parameters, it can be shown (de la Torre & Kanade, 2005) that: where Λi are the eigenvalues of the covariance matrix Σ i and U i the eigenvectors.The same expression could be derived using probabilistic PCA (Moghaddam & Pentland, 1997;Tipping & Bishop, 1999).
It is worthwhile to point out two important aspects of the previous factorizations.Factorizing the covariance as the sum of outer products and a diagonal matrix is an efficient (in space and time) manner to deal with the small sample case.Observe that to compute storing/computing the full d × d covariance is not required.On the other hand, the original covariance has d(d + 1)/2 free parameters, and after the factorization the number of parameters is reduced to l(2d − l + 1)/2 (assuming orthogonality of U i ), so that much less data is needed to estimate these parameters and hence it is not so prone to over-fitting.Also, the spectral properties of the matrix are not altered; the eigenvectors of U i Λ i U T i +σ 2 i I d are the same as Σ i , and the set of eigenvalues will be , where λ i are the eigenvalues of the sample covariance.

Toy Problem
In order to verify that under ideal conditions ODA outperforms LDA, we tested ODA on a toy problem.200 samples for five 20-dimensional (d=20) Gaussian classes were generated.Each sample for class c was generated as y i = B c c + µ c + n, where y i ∈ 20×1 , B c ∈ 20×7 , c ∼ N 7 (0, I) and n ∼ N 20 (0, 2I).The means of each class are µ 1 = 41 20 , µ 2 = 0 20 , µ 3 = −4[0 10 1 10 ] T , µ 4 = 4[1 10 0 10 ] T and µ 5 = 4[1 5 0 5 1 5 0 5 ] T .The basis B c are random matrices, where each element has been generated from N (0, 5).A weak orthogonality between the covariance matrices (i.e.tr(B T i B j ) = 0 ∀i = j) is im-posed with a Gram-Schmidt approach, i.e.B j = B j − i I, such that they preserve 90% of the energy.In the test set, a linear classifier is used, that is, a new sample d i is projected into the subspace by x i = B T d i and it is assigned to the class that has smallest distance, (x i − μi ) Σi  It is well known, that in the case of having a small number of samples, classical PCA can outperform LDA (Martinez & Kak, 2003).We run the same experiment as before but with a feature size of 152 (i.e.d=152) and just 40 samples per class.The results can be seen in table 2 P CLDA holds for PCA+LDA (preserving 95% of the energy) and PCMODA for PCA+ODA.Even, in the small sample case, ODA still outperforms all the other methods.Also, by projecting onto PCA, LDA avoids overfitting.

Face Recognition from Video
Face recognition is one of the classical pattern recognition problems that suffers from noise, limited number of training data and the face under pose/illumination changes describes non-linear manifolds.These facts make face recognition a good candidate for MODA.
A face database has been collected using our omnidirectional-meeting-capturing device (de la Torre The training set consists of the data gathered on the first day under three different illumination conditions (varying lights in the recording room), scale and expression changes.We have around 500 images per person in the training set and a similar number for the testing.The testing data consist of the recordings of the second day (a couple of weeks later) under similar conditions.Figure 6 illustrates the recognition performance using PCA, LDA and MODA, similarly table 3 provides some detailed numerical values for different number of basis.
In this experiment, each class has been clustered into two clusters to estimate B. Once B is calculated, the Euclidean distance for the nearest neighbourhood is used.Several metrics have been tested (e.g.Mahalanobis, Euclidean, Cosine, etc) and the Euclidean distance performed the best in our experiments.For the same number of bases, MODA outperforms PCA/LDA.Also, observe that LDA can extract only classes-1 features (22 features), whereas MODA can extract many more features.In this experiment, each sample is classified independently; however, using temporal information can greatly improve the recognition performance; Refer to (de la Torre et al., 2005)

Figure 1 .
Figure 1.Projection onto a low dimensional space of face images for classification.Observe that the face distributions can be multimodal and with different covariances.
is the trace of the matrix A and |A| denotes the determinant.||A||F = tr(A T A) = tr(AA T ) designates the Frobenious norm of a matrix.N d (x; µ, Σ) indicates a d-dimensional Gaussian on the variable x with mean µ and covariance Σ.

c i=1 (
|B T S b B| |B T ΣiB| ) ni , where Σ i is the class covariance matrix and S b the between-class scatter covariance matrix.Taking a different view point, Hastie et.al have proposed several LDA extensions by modifying the following regression problem

Figure ( 3
Figure (3) shows four 3-dimensional Gaussians belonging to two classes (XOR problem).Each Gaussian has 30 samples generated with the same covariance.The means of the two classes is close to zero.Since the distribution for each class is multimodal and both classes have approximately the same mean, LDA cannot separate the classes well (fig.4.a).Figure (4.b)shows how MODA is able to separate both classes.The figures show the projection into one dimension; the y-axis is the value of the projection and the x-axis is the sample number.
− μi ) + log| Σi |, where μi and Σi are the low-dimensional estimates of the mean and class covariance.

Table 1 .
Table1shows the average recognition rate of LDA and ODA over 50 trials.For each trial and each basis, the algorithm is run five times from different initial conditions (perturbing the LDA solution), and the best solution is chosen.As can observed from table 1, ODA always outperforms LDA and it is able to extract more features.Average over 50 trials