A Closed-Form Solution to Non-Rigid Shape and Motion Recovery

Recovery of three dimensional (3D) shape and motion of non-static scenes from a monocular video sequence is important for applications like robot navigation and human computer interaction. If every point in the scene randomly moves, it is impossible to recover the non-rigid shapes. In practice, many non-rigid objects, e.g. the human face under various expressions, deform with certain structures. Their shapes can be regarded as a weighted combination of certain shape bases. Shape and motion recovery under such situations has attracted much interest. Previous work on this problem (Bregler, C., Hertzmann, A., and Biermann, H. 2000. In Proc. Int. Conf. Computer Vision and Pattern Recognition; Brand, M. 2001. In Proc. Int. Conf. Computer Vision and Pattern Recognition; Torresani, L., Yang, D., Alexander, G., and Bregler, C. 2001. In Proc. Int. Conf. Computer Vision and Pattern Recognition) utilized only orthonormality constraints on the camera rotations (rotation constraints). This paper proves that using only the rotation constraints results in ambiguous and invalid solutions. The ambiguity arises from the fact that the shape bases are not unique. An arbitrary linear transformation of the bases produces another set of eligible bases. To eliminate the ambiguity, we propose a set of novel constraints, basis constraints, which uniquely determine the shape bases. We prove that, under the weak-perspective projection model, enforcing both the basis and the rotation constraints leads to a closed-form solution to the problem of non-rigid shape and motion recovery. The accuracy and robustness of our closed-form solution is evaluated quantitatively on synthetic data and qualitatively on real video sequences.


Introduction
The many years of work in structure from motion have led to significant successes in recovery of 3D shapes and motion estimates from 2D monocular videos to support modeling, rendering, visualization, and compression. Reliable systems exist for reconstructing the 3D geometry of static scenes. However, in real world, most biological objects and natural scenes are flexible and often dynamic: faces carrying expressions, fingers bending, etc. Recovering the structure and motion of these non-rigid objects from a single-camera video stream is a challenging task. The effects of 3D rigid motion, i.e. camera rotation and translation, and non-rigid motion, like deforming and stretching, are coupled together in image measurement. If every point on the objects deforms arbitrarily, it is impossible to reconstruct their shapes. In practice, many non-rigid objects, e.g. face under various expressions and scene consisting of static building and moving vehicles, deform regularly. Under such situations, the problem of shape and motion recovery is solvable.
One way to solve the problem is to use the application-specific models of non-rigid structure to constrain the deformation [2,3,5,8]. These methods model the non-rigid object shapes as weighted combinations of certain shape bases. For instance, the geometry of a face is represented as a weighted combination of shape bases that correspond to various facial deformations. The successes of these approaches suggest the advantage of basis representation of non-rigid shapes. However, such models are usually unknown and complicated. An offline training step is thus required to learn these models. In many applications, e.g. reconstruction of a scene consisting of a moving car and a static building, the models of the dynamic structure are often expensive or difficult to obtain.
Several approaches [6,12,4] were proposed to solve the problem from another direction. These methods do not require a prior model. Instead, they treat the model, i.e. shape bases, as part of the unknowns to be solved. The goal of these approaches is to recover not only the non-rigid shape and motion, but also the shape model. They utilize only the orthonormality constraints on camera rotations (rotation constraints) to solve the problem. However, this paper proves that, enforcing only the rotation constraints leads to ambiguous and invalid solutions. Previous approaches thus cannot guarantee the desired solution. They have to either require a prior knowledge on shape and motion, e.g. constant speed [9], or need nonlinear optimization that involves large number of variables and hence requires good initial estimate [12,4].
Intuitively, the ambiguity of the solution obtained using only the rotation constraints arises from the non-uniqueness of the shape bases: a linear transformation of a set of shape bases is a new set of eligible bases. Once the bases are determined uniquely, the ambiguity is eliminated. Therefore, instead of imposing only the rotation constraints, we identify and introduce another set of constraints on the shape bases (basis constraints), which implicitly determine the bases uniquely. This paper proves that, under the weak-perspective projection model, when both the basis and rotation constraints are imposed, a closed-form solution to the problem of non-rigid shape and motion recovery is achieved. Accordingly we propose a factorization method that applies both metric constraints to compute the closed-form solution for the non-rigid shape, motion, and shape bases.

Previous Work
Recovering 3D object structure and motion from 2D image sequences has a rich history. Various approaches have been proposed for different applications. The discussion in this section will focus on the factorization techniques, which are closely related to our work.
The factorization method was first proposed by Tomasi and Kanade [11]. First it applies the rank constraint to factorize a set of feature locations tracked across the entire sequence.
Then it uses the orthonormality constraints on the rotation matrices to recover the scene structure and camera rotations in one step. This approach works under the orthographic projection model. Poelman and Kanade [10] extended it to work under the weak perspective and para-perspective projection models. Triggs [13] generalized the factorization method to the recovery of scene geometry and camera motion under the perspective projection model. These methods work only for static scenes.
For non-static scenes, Costeira and Kanade [7] proposed a factorization technique to recover the camera motion and shapes of multiple independently moving objects under the orthographic projection model. This technique factorizes the feature locations to compute a shape interaction matrix, then block-diagonalizes this matrix to segment different objects and recover their shapes and motions. Han and Kanade [9] introduced another factorization method to reconstruct a scene consisting of multiple objects, some of them static and the others moving along fixed directions and at constant speed. Wolf and Shashua [14] presented a more generalized solution to reconstructing the shapes that deform at constant velocity.
Bregler et al. [6] first introduced the shape representation as weighted combination of bases to reconstruct non-rigid shapes and motion. Without assuming constant deformation speed, they proposed the sub-block re-ordering and factorization method to determine the shape bases, combination coefficients of the bases, and camera rotations simultaneously. This approach enforces only the rotation constraints. As we proved, the solution is inherently ambiguous and not optimal. To remedy the problem, Torresani and his colleagues [12] extended Bregler's method to a trilinear optimization approach. At each step, two of the three types of unknowns, bases, coefficients, and rotations, are fixed and the rest one is updated. Bregler's method is used to initialize the optimization process. Brand [4] proposed a similar non-linear optimization method that uses an extension of Bregler's method for initialization. Both the non-linear optimization approaches still fail to impose the basis constraints, which is the essential reason that the method in [6] does not work well. Therefore they can neither guarantee the optimal solution. Note that both optimization processes involve a large number of variables, e.g. the number of coefficients to be computed equals the product of the number of images and the number of shape bases. Their performances greatly rely on the quality of the initial estimates of the large number of unknowns, which are not easy to achieve.
We follow the representation of [3,6]. The non-rigid shape is represented as weighted combination of K shape bases {Bi, i = 1,..., K}. The bases are 3 x P matrices controlling the deformation of P points. Then the 3D coordinate of the point p at the frame / is, where b ip is the p t h column of Bi and Q/ is its combination coefficient at the frame /. The image coordinate of Xf p under the weak perspective projection model is, where Rj stands for the first two rows of the f th camera rotation and t/ = [tf x tf y ] T is its translation relative to the world origin. Sf is the nonzero scalar of the weak perspective projection.
Replacing X/ p using Eq. (1) and absorbing s/ into c/« and t/, bi Suppose the image coordinates of all P feature points across F frames are obtained. We form a 2F x P measurement matrix W by stacking all image coordinates. Then W = MB 4-T[ll...l]. where M is a 2F x SK scaled rotation matrix, B is a SK x P bases matrix, and T is a 2F x 1 translation vector, M= I : : : I , B= I : : : As in [9,6], we position the world origin at the scene center and compute the translation vector by averaging the image projections of all points. We then subtract it from W and obtain the registered measurement matrix W = MB.
Since W is the product of the 2F x SK scaled rotation matrix M and the SK x P shape bases matrix B, its rank is at most mm{3X,2F,P}. In practice, the frame number F and point number P are usually much larger than the basis number K. Thus the rank of W is at most SK and K is determined by K -rank(W)/S. We then perform SVD on W to get the best possible rank SK approximation of W as MB. This decomposition is only determined up to a non-singular 3K x SK linear transformation. The true scaled rotation matrix M and bases matrix B are of the form, where G is called the corrective transformation matrix. Once G is determined, M and B are obtained and thus the rotations, shape bases, and combination coefficients are recovered. Since all the procedures above, except obtaining G, are standard and well-understood [3,6], the problem of nonrigid shape and motion recovery is now reduced to: Given the measurement matrix W, how can we solve the corrective transformation matrix Gl

Metric Constraints
In order to solve C, two types of metric constraints are available and should be imposed: rotation constraints and basis constraints. Using only the rotation constraints [6,4] leads to ambiguous solutions. Instead imposing both constraints results in a closed-form solution.

Rotation Constraints
The orthonormality constraints on the rotation matrices are one of the most powerful metric constraints and they have been used in reconstructing the shape and motion for static objects [11,10], multiple moving objects [7,9], and non-rigid deforming objects [6,12,4].
According to Eq. (5), where M2*i-i-.2*i represents the ith two-row of M. Due to the orthonormality of the rotation matrices, where 12x2 is a 2x2 identity matrix. Since Q is symmetric, the number of unknowns in Q is (9K 2 H-3A')/2. Each diagonal block of MM T yields two linear constraints on Q, For F frames, we have 2F linear constraints on (9AT 2 -f 3A)/2 unknowns. It appears that, when we have enough images, i.e. F > (9K 2 4-3A)/2, there will be enough constraints to solve Q via the standard least-square methods. However, this is not true in general. Many of these constraints are redundant. We will show later that no matter how many frames or feature points are given, the linear constraints from Eq. (8) and Eq. (9) are not sufficient to determine Q.

Why are Rotation Constraints not Sufficient?
When the scene is static or deforms at constant velocities, the rotation constraints are sufficient to solve the corrective transformation matrix G [11,9,14]. However, when the scene deforms at varying speed, no matter how many images are given or how many feature points are tracked, the solutions of the constraints in Eq. (8) and Eq. (9) Each off-diagonal block consists of 3 independent elements. Since Y is symmetric and has K(K -l)/2 independent off-diagonal blocks, it totally includes 3K(K -l)/2 independent elements.

Definition 2.
A 3A x 3AT symmetric matrix Z is called a block-scaled-identity matrix, if each 3x3 block is a scaled identity matrix, i.e. Zjj = A^jI3 x .-3 7 where Xij is the only variable.
since Z is symmetric, the total number of variables in Z equals the number of independent blocks, K{K+ l)/2.

Theorem 1. Let H be the summation ofY and Z. Q = GHG T is the general solution of the rotation constraints in Eq. (8) and
Eq. (9), where G is the desired corrective transformation matrix.
Proof. Since G is a non-singular matrix, the solution Q of Eq. (8) and Eq. (9) can be represented as Q = GAG T . Now we need to prove that A must be in the form of H, i.e. the summation of Y and Z. According to Eq. (7), where QJ is an unknown scalar. Divide A into 3x3 blocks, Akj (k,j=l,...,K). Combining Eq. (4) and Eq. (12), Denote the 3 x 3 symmetric matrix E^= 1 (c^kA kk + £f= k +i2c ik Cij(A kj + A^)) by F { . Let r ? be the homogeneous solution of Eq. (13), i.e. RiFiRj = 0 2x 2-Note that Rj consists of only the first two rows of the ith rotation matrix. Let 7^3 denote the third row. Due to the orthonormality constraints, Fi is determined by, where Si is an arbitrary scalar. Apparently F^ -a^l3 X 3 is a particular solution of Eq. (13). Therefore the general solution of Eq. (13) is, where fa is an arbitrary scalar. Since Q = GAG T is the general solution of the rotation constraints, Eq. (13) and Eq. (15) must be satisfied for any set of coefficients and rotations. If /3j for some frame i is not zero, for another frame that is formed by the same coefficients but different rotation, Eq. (15) and Eq. (14) are not satisfied. Therefore, /3j has to be zero for every frame, i.e., M ? Alj)) = aj 3 x3 Since Eq. (16) must be satisfied for any set of coefficients, the solution is.
A kk = Afc fc l3x3 (17) where X kk and X k j are arbitrary scalars. According to Eq. (17), the diagonal block A kk is a scaled identity matrix. Since the diagonal block of Z, Z kkl is a scaled identity matrix and the diagonal block of Y, Y kk , is a zero matrix, A kk = Z kk + Y kk . Let Akjab, a : b = {1,2, 3}, denote the elements of an off-diagonal block A k j. Due to Eq. (18), the diagonal elements are A k jn = A kj2 2 = A k jss = X k j/2 and the off-diagonal elements satisfy A k ji 2 = -A k j 2 i 7 A k ju = -iifejjn, and A k j 2 3 --A k y,\ 2 . Therefore A k j equals the summation of a scaled identity block, Z k j, and a skew-symmetric block, Y k j. This concludes the proof: A equals //, the summation of a block-skew-symmetric matrix Y and a block-scaled-identity matrix Z, i.e. the general solution of the rotation constraints is Q -GHG T '.

0
Since H consists of 2K 2 -K independent elements: SK(K -l)/2 from Y and K(K + l)/2 from Z, the solution space has a degree of freedom of

Basis Constraints
For static scenes, a variety of approaches [11,10,13] utilize only the rotation constraints and succeed in determining the correct solution. Now we are dealing with non-static scenes with a certain assumption of the non-rigidity, i.e. representable by direct combination of the shape bases. Under such situations, enforcing only the rotation constraints results in a solution space that contains ambiguous and invalid solutions. Are there other constraints that we can use to determine the desired solution in the space? Intuitively, since the only difference under non-rigid situations from under rigid situations is that the non-rigid shape deforms as direct combination of a certain number of shape bases, can we impose certain constraints on the bases and eliminate the ambiguity?
Since any non-singular linear transformation on the shape bases yields a new set of eligible bases, the bases and the corresponding combination coefficients are not unique. However, their composition, i.e. the non-rigid shapes, are unique. Thus the bases and coefficients depend on each other. Once one of them is determined, another is also decided. If we can obtain any K frames including independent shapes and treat the shapes as a set of bases, both the bases and coefficients are determined uniquely. Without the loss of generality, we assume the shapes in the first K frames are independent on each other 1 For any three-column of C, g^, k = 1,..., K, according to Eq. (5),

\-h 3 -h 4 hj
If the first K shapes are not independent, we can find K frames in which the shapes are independent, by examining the singular values of their image projections. We then reorder the sequence by moving these K frames to the top. Proof. First we prove that the rank of H{j is less than 3. Due to Eq. (27) and the orthonormality constraints, where r^ = rn x 7^2. Qi and a 2 are two arbitrary scalars. Therefore, -If both ai and c* 2 are not equal to 0, the linear system f/jjX = rf^ has at least two independent solutions rj^/ai and rj^/a 2 . Hence Hjj is not a non-singular matrix and its rank is less than its dimension, 3. -If either aj or a 2 equals 0, say cii, the linear system Hjj*. = O;? x i has at least one non-zero solution rj r Hjj is thus singular and its rank is less than 3.
Due to Lemma 1, we derive the following theorem,  The performance of the closed-form solution is evaluated in a number of experiments. First, we compare its performance with that of previous work. Second, we evaluate its robustness and accuracy quantitatively on synthetic data. Third, we apply it on real image sequences to examine it qualitatively.

Comparison with Previous Work
Previous methods enforce only the rotation constraints and thus have limitations. [6] reorders and factorizes each two-row of A/ to compute the coefficients and rotations. Then the rotation constraints are applied to compute a 3x3 corrective transformation G s as in [11]. This process is equivalent to assume the desired G as diag(G s ,..., G s ). Whereas this assumption is correct for static scenes, it does not hold when the scene is non-rigid. Brand [4] extended [6] by applying the rotation constraints to compute different corrective transformations for each three-column of M independently. It is equivalent to assume G as diag (G s i,..., G S K), where the diagonal blocks are different. This assumption often does not hold, because M can be from an arbitrary linear transformation of the true M and its three-columns usually are mixed up. The regularization term to minimize the deformation bases will not help much, since one can have arbitrarily small bases but large coefficients and achieve the same reconstruction. The tri-linear algorithm [12] does not assume certain form of G, but involves a large number of unknowns, e.g. the number of coefficients is FK. It enforces only the rotation constraints and there exist many local optima. Its performance depends on good quality of the initial estimate, which is not easy to achieve, especially for such a huge number of unknowns.
Let us demonstrate that the weakness of the above approaches actually results in erroneous solutions even for a simple noiseless example. Figure 1 shows a scene consisting of a static cube and 3 moving points, marked as diamonds, triangles, and squares. The measurement i-4.  [4] and the tri-linear method [12] both after 4000 iterations. While our closed-form solution achieves the exact reconstruction, all three previous methods result in apparent reconstruction errors, even for such a simple and noiseless setting. Figure 2 demonstrates the reconstruction errors of the previous work on rotations, shapes, and image measurements. The errors are computed relative to the ground truth.

Quantitative Evaluation on Synthetic Data
Our approach is quantitatively evaluated on the synthetic data. We evaluate the accuracy and robustness on three factors: deformation strength, number of shape bases, and noise level. The deformation strength shows how close to rigid the shape is and it is represented by the ratios of the powers (Frobenius Norm) of the bases. Larger ratio means weaker deformation, i.e. the shape is closer to rigid. The number of shape bases represents how flexible the shape is. Bigger basis number means more control variables on the shape need to solve for. under the noiseless situations, a good approach should provide the exact solution, no matter how strong the deformation is and how big the basis number is.
In real applications, the data are often contaminated by noise. Under such situations, a good method should be robust enough to provide reasonably accurate solutions, regardless of strong deformation or big basis number. Assuming a Gaussian white noise, we represent the noise strength level by the ratio between the standard deviation and the power of the measurement W. Under the same noise level, weaker deformation leads to better performance, since some deformation mode is more dominant and the noise relative to the dominant basis is weaker. When the powers of the bases are close to each other, bigger basis number results in poorer performance, because the noise relative to each individual basis is stronger.  In both experiments, when the noise level is 0%, the closed-form solution always recovers the exact rotations and shapes. When there exists noise, it achieves reasonable accuracy, e.g. the maximum reconstruction error is less than 15% when the noise level is 20%. As we analyzed, under the same noise level, the performance gets better when the power ratio is larger and gets poorer when the basis number is bigger. Note that in all the experiments, the condition number of the linear system consisting of both basis constraints and rotation constraints has order of magnitude O(10) to 0(1O 2 ), even if the basis number is big and the deformation is strong. Our closed-form solution is thus numerically stable. The yellow circle shows that the plane is mis-located.

Qualitative Evaluation on Real Video Sequences
We examined our approach qualitatively on a number of real video sequences. The first sequence was taken of an indoor scene by a handhold camera. Three objects, a car, a plane, and a toy person, moved along fixed directions and at varying speeds. The rest of the scene was static. The car and the person moved on the floor and the plane moved along a slope.
The scene structure was composed of two bases, one for the static objects and another for the moving objects. 32 feature points tracked across 18 images are used for reconstruction. Two of the images are shown in Figure 4.(a) and (d).
The rank of W was estimated in such a way that after rank reduction at least 99% of the energy was kept. The basis number is automatically determined by K = rank(W)/S. Figure 4.(b) and (e) show the images warped to a common view based on the reconstruction by the closed-form solution. The wireframes show the structure and the yellow lines show the trajectories of the moving objects till the present frames. The reconstruction is consistent with our observation, e.g. the plane moved linearly on top of the slope. Figure 4.(c) and (f) show the reconstruction using Brand's method [4]. The shapes of the boxes are distorted and the plane is incorrectly located underneath the slope, as shown in the yellow circles.
The second sequence was taken of a human face by a static video camera. It consisted of 236 images and contained various facial expression and head rotations. 68 feature points were manually picked in the first frame and then tracked automatically using the Active Appearance Model method [1]. Figure 5. (a) and (d) display two of the images with marked features. According to the reconstructed shapes by our method, we warp the images into a common view, as shown in Figure 5.(b) and (e). Their corresponding 3D wireframe models shown in Figure 5.(c) and (f) demonstrate that the non-rigid facial motions such as mouth opening and eye closure were recovered successfully. Note that the feature correspondence in these experiments was noisy, especially for those features on the sides of face. The reconstruction rjafoc; ifc: roK|icfnocc

Conclusion and Discussion
This paper proposes a closed-form solution to the problem of non-rigid shape and motion recovery from video, under the weak perspective projection model. It consists of three main contributions: first, we prove that enforcing only the rotation constraints results in ambiguous and invalid solutions; second, we identify and introduce the basis constraints; Third, we prove that imposing both rotation and basis constraints leads to a closed-form solution to non-rigid shape and motion recovery.
A deformation mode is degenerate, if it limits the shape to deform in a plane, i.e. the rank of the corresponding basis is less than 3. Such a case occurs in practice, e.g. if a scene contains only one moving object that moves along a straight line, the basis referring to the linear motion is degenerate, since the motion vector is of rank 1. Under degenerate situations, the basis constraints cannot determine the degenerate bases. As a result, the ambiguity of the rotation constraints cannot be completely eliminated and thus enforcing both metric constraints is insufficient to produce a closed-form solution. The degeneracy problem can be solved using an alternating linear optimization method.
In applications such as motion capture, the acquired data are usually composition of the 3D non-rigid structures and their corresponding poses. One has to decouple the originally acquired data so as to capture the accurate 3D shapes. The proposed method can be easily extended to solve this problem.