Learning 3D Appearance Models from Video

Within the past few years, there has been a great interest in face modeling for analysis (e.g. facial expression recognition) and synthesis (e.g. virtual avatars). Two primary approaches are appearance models (AM) and structure from motion (SFM). While extensively studied, both approaches have limitations. We introduce a semi-automatic method for 3D facial appearance modeling from video that addresses previous problems. Four main novelties are proposed:

• A 3D generative facial appearance model integrates both structure and appearance.• The model is learned in a semi-unsupervised manner from video sequences, greatly reducing the need for tedious manual pre-processing.• A constrained flow-based stochastic sampling technique improves specificity in the learning process.• In the appearance learning step, we automatically select the most representative images from the sequence.By doing so, we avoid biasing the linear model, speed up processing and enable more tractable computations.

Introduction
Within the past few years, there has been great interest in face modeling for analysis (e.g.facial expression recognition) and synthesis (e.g.virtual avatars).Among various approaches to modeling 3D faces from video, two of the most popular and commonly used are based on appearance models (AM) [2,4,8,9,17] and rigid/nonrigid structure from motion (SFM) [5, Figure 1: Generative 3D Facial Appearance Model with Structure and Appearance.7,16,21].While each has been studied extensively, both approaches suffer from several drawbacks.All SFM approaches have an implicit data conservation assumption in their formulation, since the correspondence problem is usually solved with classical trackers or flow techniques.In the face domain, this aspect is dramatic as the face undergoes deep changes in appearance due to variations in expression that may be either qualitative (e.g.blinking, appearance of the tongue, etc) as well as quantitative (i.e., intensity change), which can seriously bias any parameter estimation.
While the AM approach overcomes the problem of appearance change by explicitly introducing linear variation of intensity and shape, it incurs other challenges.AM approaches do not necessarily decouple the rigid/non-rigid motion in the fitting process, since a single shape basis models both of them.Moreover, AM approaches require a labeled training set to learn face appearance.Manual labeling of face images is tedious and prone to error.In this paper, we propose a generative model that is robust to intensity changes in appearance, takes into account structure and appearance, and learns model parameters in a semi-supervised manner.Fig. 1 illustrates the main idea of the paper.

Previous Work
It is beyond the scope of the paper to review all the work related to 3D face modeling.Notwithstanding, we cite the more relevant literature.Several papers have used a relatively simple 3D model (e.g.cylinder [19], ellipse, etc) and flow equations to recover 3D rigid head motion.These approaches model 3D rigid motion but only crudely the 3D shape of the face (relative to SFM and AM).
In the area of structure from motion (SFM), several authors have reported encouraging results.Torresani et al. [18] decouple rigid and non-rigid motion under orthographic projection.Chowdhury and Chellappa [7] construct a 3D model by inferring depth from flow.In a similar approach but using feature correspondences and performing bundle adjustment, Zhang et al. [21] construct 3D models from a video in which the face rotates 180 degrees from profile to profile.Pighin et al. [16] model and animate 3D Face Models using SFM in multi-view images and solving the correspondence by hand.Brand [5] reports a SFM technique using a new algebraic approach that allows accommodation for uncertainty and is less prone to propagating errors.
Since active shape model/active appearance models [8] and Morphable models [13] appeared, there has been much related work in the appearance/face domain.Vetter and Blanz [4] have introduced morphable models learned from a Cyberscan, which takes into account shape and texture.Romdhani and Vetter [17] have recently improved the fitting process (see [1] for efficient fitting).Black and Jepson [3] introduce an elegant formulation for continuous alignment w.r.t. a subspace.Cascia et.al. [6] show a method that is able to track 3D heads under changeable illumination conditions by registering w.r.t the eigenspace.While the AM approach has shown great performance, the various algorithms require training from hand-labeled samples, which is labor intensive and error prone.
Frey and Jojic [11] introduced an Expectation Maxi- mization (EM) algorithm that learns several statistical models (e.g.PCA, mixture of Gaussians, etc) that are invariant to geometric transformations.However the complexity of the algorithm scales exponentially with the number of motion parameters.To solve this problem, De la Torre and Black [10] proposed an energy function based algorithm to learn the appearance model.Their algorithm achieves invariance to geometric transformation while remaining scalable.In related but independent work, Baker et al. [2] have proposed a method to learn the AAM in a unsupervised fashion.Morency et al. [15] introduce an adaptive view-based appearance model, which is able to register w.r.t.previous selected prototypes.The method we present in this paper benefits from previous AM and SFM approaches, by learning a structured appearance model in an unsupervised manner.

Generative Model for 3D Faces
In this section we describe a possible generative 3D facial appearance model that takes into account the structure, appearance and 3D motion.

From Generic 3D Structure to Person-Specific Models
We begin with a generic 3D head model (http://grail.cs.washington.edu/projects/realface/)and subsample it to make it more computationally tractable.To give a first estimation of the shape of the face, we select 30 points by hand in two orthogonal views.The mesh is then deformed using a radial basis function and affine transformation that minimizes: where points of the mesh.A ∈ 2×3 contains an affine transformation, D is a matrix such that each element is the Euclidian distance [16].Once we have re-escaled the X,Y axis, we do a

Modeling Appearance Changes
Once the structure of the face is obtained, we construct the appearance model by mapping the 3D model into cylindrical coordinates.In figure 3.a it is possible to see how to project the mesh into cylindrical coordinates, y = Y and x = arctan(αX/Z) where α is a variable which adjust the cylindrical projection.In figure 3.b we can see the unwarped mesh.
Once we have unwarped the mesh, we map the texture from the image to the unwarped mesh, assuming perspective projection.Similar to previous work [10], in the unwarped texture image, we define four regions in the unwarped texture image, corresponding to the eyes, mouth, profiles and the rest of the face.Each of the regions contains a subspace of different dimensionality (figure 4.a).After the unwarped texture is obtained, it is mapped from the unwarped cylindrical parameter space to the 3D model, by means of the triangular patches [16] (figure 4.b).

Flow based initialization
We use flow based techniques to give an initial and fast estimation of the rotational and translational components of the rigid motion of the head between frames [19].However, flow techniques are based on the brightness constancy assumption and are well known for being noisy and ambiguous when recovering 3D information.To overcome these difficulties, we make use of robust statistics techniques [3] and approximate the average head depth with a simple 3D model.Candidate 3D models include cylindrical [19], ellipsoidal and anthropomorphic models.Within a coarse-to-fine iterative strategy, we minimize 1 : where ρ(x, σ) = Minimizing expression (1) becomes a non-linear estimation problem due to the robust function and the behavior of the motion parameters.To approximate the problem by a linear one, we linearize the motion variation and use the Iteratively Reweighted Least Squares (IRLS) algorithm [14,10].Given an initial estimation of the motion parameters µ 0 , a Gauss-Newton method can be applied by incrementally updating the parameters solving the following approximate minimization problem: (2) where J t = ∂dt ∂µ is the Jacobian matrix.See [10,19].

Dimensionality Reduction
Dimensionality reduction is a common technique to filter and makes algorithms more computationally tractable.
When processing large videos of the same person, the amount of redundant facial expression/poses becomes an issue for several reasons. 1 Throughout this paper, we will use the following notation: bold capital letters denote a matrix D, bold lower-case letters a column vector d.d j represents the j-th column of the matrix D. dij denotes the scalar in the row i and column j of the matrix D and the scalar i-th element of a column vector d j .All nonbold letters will represent scalar variables.||d|| 2 W = d T Wd is a weighted norm of a vector d.diag is an operator which transforms a vector to a diagonal matrix.D 1 • D 2 denotes the Hadamard (point wise) product between two matrices/vectors of equal dimensions.Firstly, we do not necessarily have a uniform sampling of all the possible facial expressions/poses.This will bias the appearance learning algorithm towards reconstructing better the expressions with more samples.Secondly and more importantly, the amount of data would make the stochastic algorithm very computationally expensive.To avoid this phenomena, once the images are registered, we find the most representative prototypes by clustering, using the recent advances in multi-way normalized cuts [20].In figure (6) we show 50 prototypes extracted from a sequences of 800 frames.Figure (7) shows some of the samples of the same cluster.We can observe that individual prototypes capture changes in expression/pose.

Stochastic Smoothing for Appearance Learning
The optical flow provides a first estimation of the rigid motion parameters, which can be biased due to changes in facial expression, the fact that the 3D model is not accurate enought and linealization errors.In order to improve the estimation, compute non-rigid motion parameters, and build the appearance model, we use a smoothing particle filtering algorithm [12].We pose the problem as doing inference in a general state space model, which can be described by: s t = g(s t−1 , u t )+β t and d t = h(s t ) + ξ t where d t is the vectorized observed image frame at time t.The hidden state, s t , will recover (θ x , θ y , θ z , t x , t y , t z , κ), where κ are the non-rigid  parameters (see section 6.1).u t is the input and β t and ξ t are samples from a noise distribution.h is the measurement function and g describes the dynamics of the system.

Measurement Equation
The measurement equation expresses the fact that an image at time t, d t , is generated by a general non-linear function h of s t .The likelihood of a particular sample of s t is related to the image by: where we define several operators; N R(M, κ) is an operator which takes the 3D mesh and deforms the non-rigid parameters κ. κ is a vector of 3 parameters which modify the positions of eyebrows, mouth corners and the mandible aperture.P roj is the perpective projection operator [f x X/Z − x 0 , f y Y/Z − y 0 ] of the visible triangles in the 3D mesh.Given the projected visible triangles, Rec takes the image triangles, projects them into cylindrical coordinates and reconstruct the subspace as L l=1 (π l t • B l c l t ), where: π l t : Binary mask of the l layer at time t, which represents its spatial domain.
, where each π l pt ∈ {0, 1}and l π l pt = 1 ∀p, t.It is defined by hand.c l t : Coefficients which linear combination of the basis B l will reconstruct the graylevel of the layer l.B l : Appearance Basis of the l layer.

State Equation
Eq. ( 4) describes the dynamical behavior of the hidden states of the dynamical system (the image sequence).
In the more general case g(s t , u t ) is a nonlinear transformation (e.g. a mixture of gaussians, a multilayer perceptron network, • • • ).
The optical flow has given a first estimation of the 3D rigid parameters up to a scale factor due to the ambiguity between translation and depth.Despite the fact that the flow estimation can be a little bit biased, we use it to guide the search while sampling the posterior distribution of the state parameters.We combine both estimations(flow and temporal) with their covariances in an optimal Bayesian way: where Σ d is the uncertainty comming from the dynamical system, f t is flow estimation for the rigid parameters, Σ t is the uncertainty of the computed optical flow.
To compute an estimation of Σ f , we run several iterations of Gauss-Newton with IRLS method, and, once it has converged, we recompute the Jacobian J t with the final parameter values f t and a binary weighting matrix W t is constructed.Then, an estimation of the uncertainty is given by Σ t = trace(W t )(J T t W t J t ) −1 .A stands for a simple linear dynamical model, which is assumed to have a constant velocity model.Once the parameters are known, we sample from the multidimensional gaussian to generate new samples.

Deterministic Gradient Learning
Having a reasonable assessment of the rigid/non-rigid parameters over a set of k frames, we unwarp the texture and compute an estimation of the subspace for each region of the face.For each unwarped frame, we have an image p t ∈ kt×1 and a weighting matrix w t ∈ kt×1 .We minimize kt×k is a matrix such that w ij = 1 is the pixel which is visible and w ij = 0 if not.B l ∈ d×k is the set of k basis and C l = [c l 1 • • • c l n ] ∈ k×n are the set of coefficients for the l layer.We recursively update the basis to preserve 85% of the energy.To optimize for B and C, we use a two step method which alternates between minimizing C in closed form with B fix and viceversa until convergence.See [10] for more details.

Experiments
Figure 8 shows some pictures with the tracking results.The projected 3D mesh onto the images is shown as well as the original rotated 3D mesh.The original sequence has approximately 800 frames from which, after tracking with flow (section 5) and clustering, 130 frames are selected.From each of the 130 frames, we have taken subsets of 15 frames, compute an appearance basis and run the smoothing for Condensation in order to register w.r.t the subspace.We iterate the learning step and the smoothing algorithm until convergence.Good results have been achieved using 700 particles and, tipically, 3 runs going backward and forward for the smoothing process.The algorithm has been implemented in a non-optimized Matlab code and takes roughly 7 hours to process the original image sequence of 800 frames.This takes into account the Optical Flow, Condensation, smoothing for Condensation and the learning of the appearance model.Figure 9 shows the 3D mesh with the learned appearance model.

Figure 4 :
Figure 4: a)Texture mapped from one image to the unwarped cylinder.b)Two views of the texture map in 3D.

Figure 5 :
Figure 5: a) Example of tracking results with pose and facial expression changes.