Facial Feature Detection with Optimal Pixel Reduction SVM

Automatic facial feature localization has been a long-standing challenge in the ﬁeld of computer vision for several decades. This can be explained by the large variation a face in an image can have due to factors such as position, facial expression, pose, illumination, and background clutter. Support Vector Machines (SVMs) have been a popular statistical tool for facial feature detection. Traditional SVM approaches to facial feature detection typically extract features from images (e.g. multiband ﬁlter, SIFT features) and learn the SVM parameters. Independently learning features and SVM parameters might result in a loss of information related to the classiﬁcation process. This paper proposes an energy-based framework to jointly perform relevant feature weighting and SVM parameter learning. Preliminary experiments on standard face databases have shown significant improvement in speed with our approach.


Introduction
Detection of facial features (e.g.eyes, nose) is a necessary step in a wide range of applications (e.g.face recognition, face tracking).Most successful approaches to facial feature detection frame the task as a classification or regression problem [11,14,17,18,22,31].Traditional approaches for classification/regression follow a two step process: (i) extracting features, (ii) building classifiers/regressors.Performing these two steps independently might result in a loss of information relevant to the classification/regression task.
Due to its importance, feature selection has been a central topic in a variety of fields including signal processing, computer vision, statistics, neural networks, pattern recognition, and machine learning.Traditionally, feature selection is performed independently of learning the classifier parameters [2-6, 12, 15, 20, 23, 24, 28, 30].This paper extends previous work on feature selection and image classification by jointly learning optimal weighting of features (i.e.pixels) and SVM parameters.
Figure 1 illustrates the main point of the paper.Figure 1a displays a 17×29 rectangular patch around an eye.Fig- ure 1b plots the ROC curve of a linear SVM using all available pixels inside the patch as features.Figure 1c displays a sparse set of 64 pixels chosen by our algorithm.These pixels and their weights are learned jointly with the SVM parameters.Using only 64 pixels (13% of the features), our SVM classifier produces a ROC curve (Fig. 1d) that is almost identical to the one shown in Figure 1b (using all pixels).Although the classification performance is not significantly better, using only 13% of the features lead to a dramatic increase in speed.Notably, most selected pixels are located around the edges of the eye, which is consistent with our intuition.
The rest of the paper is organized as follows.Sec. 2 reviews previous work on SVMs and feature extraction.Sec. 3 derives a normalized error function to jointly learn a parameterized kernel and the SVM parameters.Methods for learning feature weights in the input space and kernel space are provided in Sec. 4 and 5 respectively.Sec.6 describes experiments on two standard face databases.

Previous work
This section reviews previous work on SVMs and feature construction for SVMs.

Support Vector Machines
Given a set of training data x 1 , . . ., x n ∈ ℜ d×1 (see notation1 ) with corresponding labels y 1 , . . ., y n ∈ {−1, 1}, SVMs seek a separating hyperplane with maximum margin [26]: Here, M is the margin, w is the normal vector of the hyperplane, and ϕ(•) represents the mapping from the input space to the feature space.Let w = w/M , b = b/M , then Eq. 1 is equivalent to: The above is equivalent to: Using a soft-margin instead of a hard-margin, we obtain the primal problem for SVMs: minimize w,b,ξ Here, {ξ i } n 1 are slack variables which allow for penalized constraint violations.C is the parameter controlling the trade-off between a large normalized margin and less constrained violations.

Feature construction in SVM
This section discusses previous work on selecting features for SVMs.
One popular technique for selecting features is RE-LIEF [19].RELIEF assigns the weight to a particular feature based on the differences between the feature values of nearest neighbor pairs.Cao et al [5] further develop this method by learning feature weights in kernel space.This method is often done as a data processing step, independent of classifier construction.De la Torre and Vinyals [12] learn a subpace-parameterized Taylor series kernel expansion that effectively weights irrelevant pixels for classification with SVMs.Recently, there have also been several papers that learn kernel matrices for classification [10,16,21].A popular approach is to define a parameterized family of kernel matrices and optimize the parameters to align with an ideal kernel.Another popular approach is to determine a desired property and learn a kernel which exhibits that property.In these approaches, the kernel is learned independently of the SVMs parameters.This is the key difference between our proposed method and previous work.
To solve the problem of jointly learning the SVM parameters and kernel, Chapelle et al [8] and Weston et al [30] propose a method for choosing SVM's parameters including the kernel parameters by minimizing the Leave-One-Out Cross Validation (LOOCV) error.However, since the LOOCV error cannot be expressed analytically, they instead propose to minimize some differentiable functions that are upper bounds of the LOOCV error.Mangasarian & Wild [23] introduce a modification to the objective function of the SVMs, and performs feature selection by repeatedly sweeping through all features to decide weather select or deselect a feature depending on which will decrease the value of the objective function.
One way to select a subset of good features is to prune away unnecessary ones.Hermes and Buhmann [15] start by constructing a SVM classifier using all available features and recursively remove the feature that has the least impact on the decision function if removed.Similarly, Avidan [3] uses a greedy sequential forward selection method to find a subset of features and support vectors that approximate the SVM solution obtained using all available features.
To further constraint the SVMs' parameters, some authors propose modifying the objective function of SVMs by including regularization terms or constraints on the parameter w of SVMs.For example, Chan et al [6] include two additional constraints on the L 1 and L 2 norms of w in the formulation of SVMs to achieve a sparse weight vector w.
Stoeckel & Fung [27] add a constraint on w to have the weight for each pixel depend not only on the pixel itself but also on its neighbors.Dundar et al [13] add a regularization term on w in the objective function to encourage the decision function to produce similar results for neighboring pixels.

SVMs and parameterized kernels
Suppose the mapping from the input space to the feature space can be parameterized by a parameter p, i.e. ϕ(x i ) = ϕ(x i , p).We would like to find a parameter vector p and a separating hyperplane that have the largest margin.However, different values of p correspond to different feature spaces, and since the margins in two different feature spaces can not be directly compared, it is necessary to consider normalized margins.Let us consider the normalized margin as the ratio of the margin over the square root of sum of squared distances (in the feature space) between same-class data instances.In other words, the normalized margin is defined as: Observe that normalized margin defined above is invariant to scale and translation in the feature space.
The problem of finding the parameter p for the mapping and the parameters of the separating hyperplane that provides the largest normalized margin can be stated as: Recall that if p is fixed, finding the hyperplane with maximum normalized margin is equivalent to finding the hyperplane that maximizes the normal margin M .
The above is equivalent to: minimize w,b,p Using soft-margin instead of hard-margin, we get: Here, {ξ i } n 1 are slack variables which allow for penalized constraint violation.C is the parameter controlling the trade-off between having large normalized margin and having less constraint violation.

Learning feature weights
Consider a mapping that assigns different weights to different features ϕ(x i , p) = diag(p) 1/2 x i , where p = [p 1 ...p d ] T are the feature weights, and p i ≥ 0 ∀i.We have: Since φ(p) is homogeneous in p, we can always scale w and p appropriately to get φ(p) = 1.Therefore Eq. ( 9) is equivalent to: minimize w,b,p,ξ w and consider the function g : ℜ × ℜ + → ℜ defined by: Eq. 11 is equivalent to: minimize v,b,p,ξ •) is convex, the above optimization problem is also convex.

Feature weighting in feature space
Let X ∈ ℜ d×n be the training data set and X ′ ∈ ℜ d×m be the testing data set.Let ϕ(X) denote [ϕ(x 1 ) . . .ϕ(x n )].
The training kernel is K train = ϕ(X) T ϕ(X), and the test- Based on these conditions, the correspond-ing train and test kernels are: Thus we have defined a feature mapping ϕ that induces the same training and testing kernels.Now, we can learn the feature weights as if the training data was BK train and the testing data was BK T test .If K train is singular or if we want to reduce the number of dimensions of the feature space, we can take B as Here U k contains the first k columns of U (corresponding to the largest eigenvalues of K train ) and S k is the sub-matrix of S containing the first k columns and k rows.In this case, K train might not exactly match K train , but it is the best rank-k approximation.

Experiments
This section compares the performance of weighted SVMs and normal SVMs on two standard face databases.

Pose classification
We performed experiments on the CMU Face Images Data Set from the UCI machine learning repository [1].The database contains 30×32 pixel facial images of 20 people under different expressions and poses.Some examples of faces from the database are given in Fig. 2. The classification task was to distinguish between two different poses: looking up and looking to the camera.Because the number of data instances in this database is small (only 312 faces), the experimental results were taken as the accuracy of 10-fold cross validation.We constructed four different SVM classifiers, namely linear SVM, linear weighted SVM, Gaussian SVM, and Gaussian weighted SVM.For all classifiers, we repeated the experiments for different values of the C parameter (and γ for Gaussian SVMs) and reported the best results.Table 1 shows the best results from all methods.Notably, weighted SVMs achieve similar classification accuracy while using a much smaller number of pixels and support vectors.Fig. 3 displays the pixels selected by applying our weighted SVM method.

Eye detection
Following the approach of Everingham and Zisserman [14], we performed eye detection experiments on the gray-scale FERET database [25].This database contains facial images of various subjects under different expressions and poses.All images have a 256×384 pixel resolution and limited lighting variation.Some images are associated with  a set of four hand labeled landmarks (Fig. 4a).Among the images with labeled landmarks, we extracted all the 2963 available frontal faces for experiments.These images were further divided into disjointed training and testing sets (60% and 40% respectively).
For training, we first performed Procrustes analysis [9] to align the landmarks w.r.ples were created by extracting rectangular patches around random points in the iris neighborhood.The neighborhood was defined as in Fig. 4b.Each patch was normalized by subtracting the mean intensity and dividing by the standard deviation.
For each training image, the OpenCV Viola-Jones face detector [29] was used to produce a square centered on the face.A linear regression predictor was implemented to approximate the iris landmark from the position and scale of the face detector's output [14].
We performed experiments with two different SVM classifiers, namely normal SVM and weighted SVM.For weighted SVM, we first applied the method described in Sec. 4 to learn the optimal pixel weights.Pixels with insignificant weights (< 10 −5 ) were discarded, and a SVM classifier was constructed based on the remaining pixels, taking their weights into account.Fig. 1c shows the locations of 64 pixels (out of 493) chosen by our weighted SVM (cyan dots).
For each testing image, we used the previously learned linear regression to produce the first approximation for the iris' position.A searching window was placed around this initial guess.With a sliding window approach, the pixel with the highest SVM decision value was chosen as the final result for the localization of the iris.
The performance of different algorithms was evaluated in two different ways.Figure 5 plots the localization error threshold (x-axis) and the proportion of successful localizations within the threshold (y-axis).The Euclidean distance from the ground truth landmark to the predicted iris location was normalized by the inter-ocular distance (distance between the two iris landmarks) to account for different scales.Compared with normal SVM, weighted SVM achieves similar performance results while using a much smaller number of pixels.
To analyze the trade-off between true detections and false alarms, we classified all pixels inside the searching window and produced ROC curves (Fig. 6) by varying the threshold of the SVM classifier.The positively classified pixels were considered correct if they fell inside a square neighborhood around the true landmark.The size of this neighborhood was proportional to the inter-ocular distance of the subject (illustrated in Fig. 4c).As can be observed, the ROC curve produced by our weighted SVM is similar to the one produced by standard SVM.However, weighted Figure 5. Distance threshold versus the proportion of iris localization within the threshold.The distance is taken as the Euclidean distance from the ground truth landmark to the predicted iris location normalizing by the inter-ocular distance.Weighted SVM performs as well as the other method while using much less pixels.The Regression curve is the result of using initial guess produced by the linear regression predictor.Figure 6.ROC curves of three different methods.Weighted SVM performs as well as normal SVM while using a much smaller number of pixels.
SVM used only 13% of available pixels.
In our experiments, SVM classifiers were built using LibSVM [7].The C parameter of SVMs and other parameters were tuned using cross validation.

Conclusion
In this paper, we have presented a method for jointly performing feature extraction and building SVM classifiers.Learning feature weights and parameters of SVM classifiers is formulated as a convex optimization problem.The method has been applied to solve two important computer vision problems: pose classification and facial feature detection.Experiments on standard face databases produce SVM classifiers that employ sparse sets of features while retaining classification performance.

Figure 1 .
Figure 1.a) 17 × 29 rectangular patch used for eye detection.b)ROC curve of a linear SVM classifier using all pixels as features.c) 64 most discriminative pixels used by our SVM classifier that jointly optimizes pixel weighting and SVM parameters.d) ROC curve of the learned SVM classifier, using only 64 pixels.

Figure 2 .
Figure 2. Examples of faces from the CMU Face Database

Figure 4 .
Figure 4. (a) Example of four landmarks used in the FERET database.(b) Centers of negative training patches were sampled randomly inside the cyan region.(c) Region of correct classification, positively classified pixels were considered correct if they are located inside the square.
t. the mean shape, removing rotation, translation, and scale variations.Positive training examples were obtained by sub-sampling 17 × 29 patches inside 27 × 47 rectangular regions around the left iris landmark of every training image.Similarly, negative exam-

Table 1 .
Comparison of weighted SVMs and normal SVMs on the UCI CMU Face Images Data Set.The weighted SVMs (both linear and Gaussian) achieve similar accuracy rates while using much fewer features and support vectors.