Learning Image Alignment without Local Minima for Face Detection and Tracking

Active Appearance Models (AAMs) have been extensively used for face alignment during the last 20 years. While AAMs have numerous advantages relative to alternate approaches, they suffer from two major drawbacks: (i) AAMs are especially prone to local minima in the ﬁtting process; (ii) few if any of the local minima of the cost function correspond to acceptable solutions. To minimize these problems, this paper proposes a method to learn the ﬁtting cost function that explicitly optimizes that the local minima occur at and only at the places corresponding to the correct ﬁtting parameters. The paper explores two methods to parameterize the cost function: pixel weighting and subspace learning. Experiments on synthetic and real data show the effectiveness of our approach for face alignment.


Introduction
Since the early work of Sirovich and Kirby [23] parameterizing the human face using Principal Component Analysis (PCA) and the successful eigenfaces of Turk and Pentland [25], many computer vision researchers have used PCA techniques to construct linear models of optical flow, shape or gray level [4-6, 8, 12, 13, 16, 21].In particular, Active Appearance Models (AAMs) [8,12,13,19] have proven to be an appropriate statistical tool for modeling shape and appearance variation of faces in images.In AAMs, the appearance/shape models of objects are built by performing PCA on training data.Once PCA model has been built from training samples, finding the location/configuration of an object of interest in a testing image is achieved by minimizing a cost function w.r.t.some geometric transformation (motion) parameters; this is referred to as the fitting process.
Although widely used, AAMs have two major problems in the fitting process.First, they are especially prone to local minima.Second, often few, if any, of the local minima of the cost function correspond to acceptable solutions.the global minimum of this cost function is not at the desired position, the black dot of Fig. 1d, which corresponds to the correct landmarks' locations.These problems occur because the PCA model is constructed without considering the neighborhood of the correct motion parameters (parameters that correspond to ground truth landmarks of training data).The neighborhoods determine the local minima properties of the error surface, and should be taken into account while constructing the models.
In this paper, we propose to learn the cost function (i.e.appearance model) that has a local minimum at the "expected" location and no other local minima in its neighborhood.This is done by enforcing constraints on the gradients of the cost function at the desired location and its neighborhood.Fig. 1e,g plot the error surface and contours of the learned cost function.This cost function has a local minimum in the expected place (black dot of Fig. 1e), and no other local minima near by.

Previous work
Over the last decade, appearance models have become increasingly important in computer vision and graphics.In particular, Parameterized Appearance Models (PAMs) have been proven useful for alignment, detection, tracking, and face synthesis [5,6,8,12,13,16,19,21,26].This section reviews PAMs and gradient-based methods for the efficient alignment of high dimensional deformation models.

Parameterized Appearance Models (PAMs)
PAMs [5,6,8,12,16,21,26] build the objects' appearance/shape representation from the principal components of training data.Let d i ∈ ℜ m×1 (see notation 1 ) be the i th sample of a training set D ∈ ℜ m×n and U ∈ ℜ m×k the first k principal components [15].Once the model has been constructed (i.e.U is known), tracking/alignment is achieved by finding the motion parameter p that best aligns the data w.r.t. the subspace U, i.e.
Here x = [x 1 , y 1 , ...x l , y l ] T is the vector containing the coordinates of the pixels to track.f (x, p) is the function for geometric transformation; denote f (x, p) by [u 1 , v 1 , ..., u l , v l ] T .d is the image frame in consideration, 1 Bold uppercase letters denote matrices (e.g.D), bold lowercase letters denote column vectors (e.g.d).d j represents the j th column of the matrix D. d ij denotes the scalar in the row i th and column j th of the matrix D. Non-bold letters represent scalar variables.1 k ∈ ℜ k×1 is a column vector of ones.0 k ∈ ℜ k×1 is a column vector of zeros.I k ∈ ℜ k×k is the identity matrix.tr(D) = i d ii is the trace of square matrix D.
is the operator that extracts the diagonal of a square matrix or constructs a diagonal matrix from a vector.and d(f (x, p)) is the appearance vector of which the i th entry is the intensity of image d at pixel (u i , v i ).For affine and non-rigid transformations, (u i , v i ) relates to (x i , y i ) by: with [x s 1 , y s 1 , ...x s l , y s l ] T = x + U s c s , where U s is the nonrigid shape model learned by performing PCA on a set of registered shapes [7].a, c s are affine and non-rigid motion parameters respectively, and p = [a; c s ].

Optimization for PAMs
Given an image d, PAM tracking/alignment algorithms optimize (1).Due to the high dimensionality of the motion space, a standard approach to efficiently search over the parameter space is to use gradient-based methods [1,5,7,9,19,27].To compute the gradient of the cost function given in (1), it is common to use Taylor series expansion to approximate d(f (x, p + δp)) by d(f (x, p)) + J d (p)δp, where ∂p is the Jacobian of the image d w.r.t. to the motion parameter p [18].Once linearized, a standard approach is to use the Gauss-Newton method for optimization [3,5].Other approaches learn an approximation of the Jacobian matrix with linear [7] or non-linear [17,22] regression.
Over the last few years, several strategies for improving the fitting performance have been proposed.For examples, Black & Anandan [5] and Cootes & Taylor [7] proposed using multi-resolution schemes, Xiao et al [29] proposed using 3D models to constrain 2D solutions, De la Torre et al [10] proposed learning filters to build a multi-band rep- resentation that proven to achieve robustness to local minima in AAM fitting.Baker et al. [2] learned a PCA model invariant to rigid and non-rigid transformations to improve the PCA model.De la Torre and Nguyen [11] learned a non-linear PCA model invariantly to non-rigid transformations that is less prone to local minima.Recently, Wu et al [28] propose avoiding local minima in discriminative fit- ting of appearance models by learning boosting with ranked functions.Although these methods show significant performance improvement, they do not directly address the problem of learning a cost function with no local minima to fit AAMs.In this paper, we deliberately learn a cost function which has local minima at and only at the desired places.

Learning parameters of the cost functions
Gradient-based algorithms, such as the ones discussed in the previous section, might not converge to the correct location (i.e.correct motion parameters) for several reasons.First, gradient-based methods are susceptible to being stuck at local minima.Second, even when the optimizer converges to a global minimum, the global minimum might not correspond to the correct motion parameters.These two problems occur primarily because PCA has limited generalization capabilities to model appearance variation.This section proposes a method to learn cost functions that do not exhibit these two problems in training data.

A generic cost function for alignment
This section proposes a generic quadratic error function where many PAMs can be cast.The quadratic error function has the form: (3) Here A ∈ ℜ m×m and b ∈ ℜ m×1 are the fixed parameters of the function, and A is symmetric.This function is the general form of many cost functions used in the literature including Active Appearance Models [8], Eigentracking [5], and template tracking [18,20].For instance, consider the cost function given in (1).If p is fixed, the optimal c that minimizes (1) can be obtained using c = U T d(f (x, p)).Substituting this back into (1) and performing some basic algebra, ( 1) is equivalent to:

Desired properties of cost functions
As discussed previously, it is desirable that the cost function have minima at and only at the 'right' places.In this section, we deliberately address this need as an optimization problem over A and b.
Let {d i } n 1 be a set of training images containing the objects of interest (e.g.faces), and assume the landmarks for the object shapes are available (e.g.manually labeled facial landmarks as in Fig. 5a).Let s i be the vector containing the landmark coordinates of image d i .Given {s i } n 1 , we perform Procrustes analysis [7] and build the shape model as follows.First, the mean shape s = 1 n i s i is calculated.Second, we compute a i the affine parameter that best transforms s to s i , and let a −1 i be the inverse affine transformation of a i .Third, ŝi is obtained by applying the inverse affine transformation a −1 i on s i (warping toward the mean shape).Next, we perform PCA on {ŝ i − s} n i=1 to construct U s , a basis for non-rigid shape variation.We then compute c s i , the coefficients of ŝi − s w.r.t. the the basis U s .Finally, let p i = [a i ; c s i ], p i is the parameter of image d i w.r.t. to our shape model.Notably, the shape model and {p i } n 1 are derived independently of the appearance model.to vanish, i.e.

∂E(d
To learn a cost function that has few local minima, it is necessary to consider p i 's neighborhoods.Let Here lb is chosen such that N − i is a set of neighbor parameters that are very close to p i ; it is satisfactory for a fitting algorithm to converge to a point in N − i .ub is chosen so that the fitting algorithm is guaranteed to be initialized at a point in N i or N − i .In most applications, such ub exists.For example, for tracking problems, ub can be set to the maximum movement of the object being tracked between two consecutive frames.Fig. 2 depicts the relationship between N − i , N i , andN + i .For a gradient descent algorithm to converge to p i or a point close enough to p i , it is necessary that E(d i , .) have no local minima in N i .This implies that ∂E(di,p) ∂p does not vanish for p ∈ N i .Notably, it is not necessary to enforce similar constraints for p ∈ N − i ∪ N + i because of the way lb, ub are chosen.Another desirable property is that each iteration of gradient descent advances closer to the correct position.Because gradient descent walks against the gradient direction at every iteration, we would like the opposite direction of the gradient at point p ∈ N i to be similar to the optimal walking direction p i − p.This quantity can be measured as the projection of the walking direction onto the optimal direction.Fig. 3 illustrates the rationale of this requirement.This requirement leads to the constraints: Equations ( 4) and ( 5) specify the constraints for the ideal cost function.
The gradient of the function E(d, p) plays a fundamental role in the above two constraints.To compute the gradient ∂E(d,p) ∂p , it is common to use first order Taylor series expansion to approximate d(f (x, p + δp)) by d(f (x, p)) + J d (p)δp, where J d (p) = ∂d(f (x,p)) ∂p is the spatial intensity gradient of the image d w.r.t. to the motion parameter p [18].This yields: Substituting ( 6) into ( 4) and ( 5), we obtain a set of linear constraints over A and b.

Practical issues and alternative fitting methods
In practice, there is an issue regarding the optimization using steepest gradient descent: the small components of

∂E(d,p) ∂p
tend to be neglected in the optimization.This occurs due to the magnitude difference between some columns of J d (p).For example, in (2), the magnitudes of the Jacobians of d(f (x, p)) w.r.t. to a 1 , a 2 , a 4 , a 5 can be much larger than the magnitudes of the Jacobians of d(f (x, p)) w.r.t. to a 3 , a 6 .
To address this concern, we consider an alternative opti-mization strategy where the update rule at iteration k th is: The update rule of the above algorithm is a variant of Newton iteration.Intuitively, H d (p k ) is similar to the Hessian of E(d, p) at p k , and it acts as a normalization matrix for the gradient.This algorithm is indeed a reasonable optimization scheme for cost functions in which A is symmetric positive semidefinite with all eigenvalues less than or equal to 1.For the interest of space, we omit the proof in this paper.Similar to the case of gradient descent, requiring the incremental updates to vanish at only at the places corresponding to acceptable solutions yields the following constraints: However, these constraints might be too stringent.Therefore, we propose to relax the constraints to get the optimization problem: Here ∆ di (p i ) is required to be small instead of strictly zero, ξ i 's are slack variables for constraints in (9) which allows for penalized constraint violation.C is the parameter controlling the trade-off between having few local minima and having local minima at the right places.
A is also constrained to be a symmetric positive semidefinite matrix where eigenvalues are less than or equal one.Adding this constraint to the optimization problem and incorporating the idea of margin, we obtain: where H m denotes the set of all m × m symmetric matrices of which all eigenvalues are non-negative and less than or equal to one.M is the user-defined margin size.Since ∆ di (p) is linear in terms of A and b, this is a quadratic programming problem with linear constraints, provided the requirement A ∈ H m can be described by linear constraints.

Special cases and experiments
Sec. 3.3 proposes a method for learning generic A and b.However, in specific situations, A and b can be further parameterized.The benefits of further parameterization are threefold.First, the number of parameters to learn can be reduced.Second, the relationship between A and b can be established.Third, the constraint that A ∈ H m can be replaced by a set of linear constraints.This section provides the formulation for two special cases, namely weighted template alignment and weighted-basis AAM alignment.
Experimental results on Multi-PIE database [14] are included.This database consists of facial images of 337 subjects taken under different illuminations, expressions and poses.Each face is manually labeled with 68 landmarks, as shown in Fig. 5a.Images are down sampled to 120 × 160 pixels.

Weighted template alignment
As shown in Sec.3.1, template alignment is a special case of (3) in which A = I m , and b = −d ref .
In template alignment, pixels of the template are typically weighed equally.In this section, we show to how to learn the optimal pixel weighting to avoid local minima in image alignment.In particular, we illustrate results for face alignment.
Consider the weighted sum of squared differences: , where, w is the weight vector for the template's pixels.This cost function is equivalent to (3) with A = diag(w) and b = −diag(w)d ref .
The constraint A ∈ H m can be imposed by requiring 0 ≤ w i ≤ 1.Thus (11) becomes a quadratic programming problem with linear constraints over w.
For this experiment, we only use directly-illuminated frontal neutral faces.Each subject in the Multi-PIE database might have more than one image.However, to ensure the testing data is completely independent of the training data, we restrict our attention to at most one image per subject.Our dataset contains 217 images, 10 are selected for training, 65 are used for validation (parameter tuning), and the rest are reserved for testing.
The shape model is built as described in Sec.3.2.The final shape model requires 6 coefficients (only affine transformation) to describe a shape.For object appearance, we extract intensity values of pixels inside the region formed by the landmarks (Fig. 5c).
The template is taken to be the mean image of all images (Fig. 4a).Thus the task is to do template alignment Table 1.Alignment results of different methods for three different difficulty levels of testing data (PerMag).Initial is the initial amount of perturbation before running any alignment algorithm.
Best is the best result can be achieved by using affine transformation.Default is the method which give uniform weights for the pixels.The between an arbitrary image with the mean image.The template's weights are learned by optimizing (11) with the following parameter settings: M = 0.01, C = 1.To avoid N i being of infinite size, we restrict our attention to a set of 150 random samples from N i .The random samples are drawn by introducing random Gaussian perturbation to the correct shape parameter p i .Fig. 4b displays the learned weights; bright pixels mean higher weights.Notably, the pixels in the eye and mouth regions do not actually receive high weights.This is consistent with the intuition that cost function using high weights in area with high variability are more susceptible to local minima.Testing data are generated by randomly perturbing the components of p i , the correct shape parameters of test image d i .Perturbation amounts are generated from a zero mean Gaussian distribution with standard deviation P erM ag × [0.05 0.05 1 0.05 0.05 1] T .P erM ag controls the overall difficulty of the testing data.The relative perturbation amounts of shape coefficients are determined to simulate possible motion in tracking.Fig. 5b shows an example of shape perturbation, the ground truth landmarks are marked in red (circles), while the perturbed shape is shown in yellow (pluses).
Table 1 describes the experimental results with three difficulty levels of testing data (controlled by P erM ag).The performance of the learned cost function is compared with the default (uniform weight) cost function.The results show that the learned cost function outperforms the default function in all levels of perturbation.

Weighted-basis for AAM alignment
As shown in Sec.3.1, AAM alignment is a special case of (3) in which A = I m −UU T = I m − k 1 u i u T i , and b = 0. U is the set of k first eigenvectors from the total of K PCA basis of the training data subspace.k (≤ K) is usually chosen experimentally.In this section, we propose to use all K eigenvectors, but weigh them differently.Specifically, we learn A which has the form: (11) we get a quadratic programming problem with linear constraints over w.
From the Multi-PIE database, we only make use of the directly-illuminated frontal face images under five expressions (smile, disgust, squint, surprise and scream).Our dataset contains 1100 images, 400 are selected for training, 200 are used for validation (parameter tuning), and the rest are reserved for testing.
The shape model is built as described in Sec.3.2.The final shape model requires 10 coefficients (6 affine + 4 nonrigid) to describe a shape.For object appearance, we extract intensity values of pixels inside the patches located at the landmarks (Fig. 5d).
The training data is further divided into two subsets, one contains 300 images and the other contains 100 images.U is obtained by performing PCA on the subset of 300 images.The second subset is used to set up the optimization problem (11).For better generalization, ( 11) is constructed without using images in the first training subset.To avoid N i being of infinite size, we restrict our attention to a set of 200 random samples from N i .The random samples are drawn by introducing random Gaussian perturbation to the correct shape parameter p i .
Following the approach by Tsochantarisdis et al [24] for minimizing a quadratic function with an exponentially large number of linear constraints, we maintain a smaller subset of active constraints S and optimize (11) iteratively.We repeat the following steps for 10 iterations: (i) empty S; (ii) randomly choose 20 training images; (iii) for each chosen training image d i , find the 100 most violated constraints from N i and include them in S; (iv) run quadratic programming with the reduced set of constraints.Similar to the case of weighted template alignment, testing data are generated by randomly perturbing the components of p i , the correct shape parameters of test image d i .Perturbation amounts are generated from a zero mean Gaussian distribution with standard deviation P erM ag × [0.05 0.05 1 0.05 0.05 1 2 2 2 2] T .
Table 2 describes the experimental results with four difficulty levels of testing data (controlled by P erM ag).The performance of the learned cost function is compared with four other cost functions constructed using PCA with popular energy settings (70%, 80%, 90%, and 100%).As can be observed, when the amount of perturbation is small, PCA models with higher energy levels perform better.However, as the amount of pertubation increases, PCA models with lower energy levels perform better.This suggests that cost functions using fewer basis vectors have less local minima while cost functions using more basis vectors are more likely to have local minima at the 'right' places.Thus it is unclear what the energy for the PCA model should be.On the other hand, the learned cost function performs significantly better than the PCA models for most difficulty levels.Because the magnitude of b is related to the sensitivity of the alignment algorithm, we add a regularization term 0.1 * ||b|| 2  2 into the objective function of ( 11) to regulate b to be small.The values of other parameters C = 2, M = 0.01 are tuned using the validation set.

Conclusion
In this paper, we have proposed a method to learn a cost function for image alignment that avoids local minima in training.We directly address the problem of learning cost functions that have local minima at and only at the desired places.The task of learning a cost function is formulated as optimizing a quadratic function under linear constraints.
Encouraging results have been achieved for face alignment with template matching and AAM fitting.Further work needs to address how to select the most informative samples in the error surface to reduce the number of constraints to optimize.

Figure 1 .
Figure 1.Learning a better model for image alignment.a: Original image to align (b,c): mean and eigenvectors of the AAM.(d,f): surface and contour plot of the PCA model.It has many local minima; (e, g): our method learns a better error surface for alignment.This figure is best seen in color.

Figures
Figures 1d,f illustrate these problems.Fig. 1d plots the error surface constructed by translating the testing image (Fig. 1a) around the ground truth landmarks and computing the values of the cost function.The cost function measures the reconstruction error from the PCA model constructed from labeled training data.Fig.1fshows the contour plot of this error surface.As can be observed, any gradient-based optimization method is likely to get stuck at local minima, and will not converge to the global minimum.Moreover, Thus (1) is a special case of (3), with A = I m − UU T , and b = 0 m .For template tracking, the cost function is typically the sum of squared differences: ||d(f (x, p)) − d ref || 2 2 , where d ref is the reference template.This cost function is equivalent to: d(f (x, p)) T d(f (x, p)) − 2d T ref d(f (x, p)).Thus the cost function used in template tracking is also a special case of (3) with A = I m and b = −d ref .

Figure 2 .
Figure 2. Neighborhoods around the ground truth motion parameter pi (ret dot).N −i : region inside the orange circle; it is satisfactory for fitting algorithms to converge to this region.N + i : region outside the blue circle; alignment algorithm will not be initialized in this region.Ni: shaded region, region to enforce constraints on gradients.

Figure 3 .
Figure 3. pi: desired convergence location.Blue arrows: gradient vectors, red arrows: walking directions of gradient descent algorithm, orange arrows: optimal directions to the desired location.Performing gradient descent at p advances closer to pi while performing gradient descent at p ′ moves away from pi.

Figure 4 .
Figure 4. (a) the template (mean image) used in weighted template experiment.(b) the learned weights, brighter pixels mean higher weights.Interestingly, the eye and mouth regions do not receive high weights.

Figure 5 .
Figure 5. (a) example of landmarks associated with each face (red dots), (b) example of shape distortion (yellow pluses), (c) example of patches for appearance modeling.

Table 2 .
Alignment results of different methods for four different difficulty levels of testing data (PerMag).Initial is the initial amount of perturbation before running any alignment algorithm.PCA e% is the cost function constructed using PCA preserving e% of energy.The table shows the means and standard deviations of mis-alignment (average over 68 landmarks and over testing data).The unit for measurement is pixel.