Deep Non-Rigid Structure from Motion

Current non-rigid structure from motion (NRSfM) algorithms are mainly limited with respect to: (i) the number of images, and (ii) the type of shape variability they can handle. This has hampered the practical utility of NRSfM for many applications within vision. In this paper we propose a novel deep neural network to recover camera poses and 3D points solely from an ensemble of 2D image coordinates. The proposed neural network is mathematically interpretable as a multi-layer block sparse dictionary learning problem, and can handle problems of unprecedented scale and shape complexity. Extensive experiments demonstrate the impressive performance of our approach where we exhibit superior precision and robustness against all available state-of-the-art works in the order of magnitude. We further propose a quality measure (based on the network weights) which circumvents the need for 3D ground-truth to ascertain the confidence we have in the reconstruction.


Introduction
Building an AI capable of inferring the 3D structure and pose of an object from a single image is a problem of immense importance. Training such a system using supervised learning requires a large number of labeled images -how to obtain these labels is currently an open problem for the vision community. Rendering [27] is problematic as the synthetic images seldom match the appearance and geometry of the objects we encounter in the real-world. Hand annotation is preferable, but current strategies rely on associating the natural images with an external 3D dataset (e.g. ShapeNet [8], ModelNet [32]), which we refer to as 3D supervision. If the 3D shape dataset does not capture the variation we see in the imagery, then the problem is inherently ill-posed.
Non-Rigid Structure from Motion (NRSf M) offers computer vision a way out of this quandary -by recovering the pose and 3D structure of an object category solely from hand annotated 2D landmarks with no need of 3D supervision. Classically [6], the problem of NRSf M has been Figure 1: In this paper, we want to reconstruct 3D shapes solely from a sequence of annotated images-shown on top-with no need of 3D ground truth. Our proposed hierarchical sparse coding model and corresponding deep solution outperform state-of-the-arts in the order of magnitude. applied to objects that move non-rigidly over time such as the human body and face. But NRSf M is not restricted to non-rigid objects; it can equally be applied to rigid objects whose object categories deform non-rigidly [19,2,30]. Consider, for example, the five objects in Figure 1 (top), instances from the visual object category "chair". Each object in isolation represents a rigid chair, but the set of all 3D shapes describing "chair" is non-rigid. In other words, each object instance can be modeled as a deformation from its category's general shape.
NRSf M is well-noted in literature as an ill-posed problem due to the non-rigidity. This has been mainly addressed by imposing additional shape priors, e.g. low rank [6,10], union-of-subspaces [36,2], and block-sparsity [18,19]. However, low rank is only applicable to simple non-rigid objects with limited deformations and union-of-subspaces relies heavily on frame clustering which is still an open problem. Block-sparsity-each shape can be represented by at most K bases out of Lare considered as one of the most promising assumptions in terms of covering broad shape variations. This is because sparsity can be thought as a union of L K subspaces where L could be large then an over-complete dictionary is utilized. However, pointed by Kong et al. [18], searching the best subspace out of L K is extremely hard and not robust. Based on this observation, in this paper, we propose a novel shape prior using hierarchical sparse coding. The introduced additional layers compared to single-layer sparse coding are capable of controlling the number of subspaces by learning from data such that invalid subspaces are removed while sufficient subspaces are remained for modeling shape variations. This insight is at the heart of our paper.

Contributions:
• We propose a novel shape prior based on hierarchical sparse coding and demonstrate that the 2D projections under orthogonal cameras can be represented by the hierarchical dictionaries in a block sparse way.

Related Work
Low-rank NRSf M: In rigid structure from motion, the rank of 3D structure is fixed as three [29] since 3D shapes remain the same between frames. Based on this insight, It is first proposed by Bregler et al. [6] that non-rigid 3D structure could be represented by a linear subspace of low rank. Dai et al. [10] developed this prior by theoretically and practically proving that low-rank assumption itself is sufficient to address the ill-posedness of NRSf M with no need of additional priors. Working together with temporal smoothness, low-rank assumption is also extended to temporal space [3,15]. Though impressive success, low-rank assumption has a major drawback i.e. it is not capable of modeling complex shape variations.
Union-of-subspaces NRSf M: Inspired by an intuition that complex non-rigid deformations could be clustered into a sequence of simple motions, Zhu et al. [36] proposed to model non-rigid 3D structure by a union of local subspaces. This was later extended to spatial-temporal domain [1] and applied to rigid object category reconstruction [2]. The major difficulty of union-of-subspaces is how to effectively cluster shape deformations purely from 2D observations and how to estimate affinity matrix when the number of frame is large e.g. more than tens of thousand frames.
Sparse NRSf M: Sparse prior [18,35,19] is more generic than union-of-subspaces since it is equivalent to the union of all possible local subspaces. One obvious advantages of this is the large number of subspaces to model a much broader set of 3D structures. However, it is this large number that leads to the major drawback of sparse prior i.e. searching the best subspace is sensitive to noise and fairly easy to fall into local trap. In this paper, we proposed to mainly resolve this contradiction by using hierarchical sparse coding.

Background
Sparse dictionary learning can be considered as an unsupervised learning task and divided into two sub-problems: (i) dictionary learning, and (ii) sparse code recovery. Let us consider sparse code recovery problem, where we estimate a sparse representation z for a measurement vector x given the dictionary D i.e.
where λ related to the trust region controls the sparsity of recovered code. One classical algorithm to recover the sparse representation is Iterative Shrinkage and Thresholding Algorithm (ISTA) [11,26,5]. ISTA iteratively executes the following two steps with which first uses the gradient of x − Dz 2 2 to update z [i] in step size α and then finds the closest sparse solution using an 1 convex relaxation. It is well known in literature that the second step has a closed-form solution using soft thresholding operator. Therefore, ISTA can be summarized as the following recursive equation: where h τ is a soft thresholding operator and τ is related to λ for controlling sparsity. Recently, Papyan [25] proposed to use ISTA and sparse coding to reinterpret feed-forward neural networks. They argue that feed-forward passing a single-layer neural network z = ReLU(D T x − b) can be considered as one iteration of ISTA when z ≥ 0, α = 1 and τ = b. Based on this insight, the authors extend this interpretation to feedforward neural network with n layers as executing a sequence of single-iteration ISTA, serving as an approximate solution to the multi-layer sparse coding problem: find {z i } n i=1 , such that where the bias terms {b i } n i=1 (in a similar manner to τ ) are related to {λ i } n i=1 , adjusting the sparsity of recovered code. Furthermore, they reinterpret back-propagating through the deep neural network as learning the dictionaries {D i } n i=1 . This connection offers a novel breakthrough for understanding DNNs. In this paper, we extend this to the block sparse scenario and apply it to solving our NRSf M problem.

Deep Non-Rigid Structure from Motion
Under orthogonal projection, NRSf M deals with the problem of factorizing a 2D projection matrix W ∈ R p×2 as the product of a 3D shape matrix S ∈ R p×3 and camera matrix M ∈ R 3×2 . Formally, are the image and world coordinates of the i-th point. The goal of NRSf M is to recover simultaneously the shape S and the camera M for each projection W in a given set W of 2D landmarks. In a general NRSf M including Sf C, this set W could contain deformations of a non-rigid object or various instances from an object category.

Modeling via multi-layer sparse coding
To alleviate the ill-posedness of NRSf M and also guarantee sufficient freedom on shape variation, we propose a novel prior assumption on 3D shapes via multi-layer sparse coding: The vectorization of S satisfies where D 1 ∈ R 3p×k1 , D 2 ∈ R k1×k2 , . . . , D n ∈ R kn−1×kn are hierarchical dictionaries. In this prior, each non-rigid shape is represented by a sequence of hierarchical dictionaries and corresponding sparse codes. Each sparse code is determined by its lower-level neighbor and affects the nextlevel. Clearly this hierarchy adds more parameters, and thus more freedom into the system. We now show that it paradoxically results in a more constrained global dictionary and sparse code recovery.
More constrained code recovery: In a classical single dictionary system, the constraint on the representation is element-wise sparsity. Further, the quality of its recovery entirely depends on the quality of the dictionary. In our multi-layer sparse coding model, the optimal code not only minimizes the difference between measurements s and D 1 ψ 1 along with sparsity regularization ψ 1 0 , but also satisfies constraints from its subsequent representations. This additional joint inference imposes more constraints on code recovery, helps to control the uniqueness and therefore alleviates its heavy dependency on the dictionary quality.
More constrained dictionary: When all equality constraints are satisfied, the multi-layer sparse coding model degenerates to a single dictionary system. From Equation 9, by denoting However, this differs from other single dictionary models [36,37,18,19,34] in terms that a unique structure is imposed on D (n) [28]. The dictionary D (n) is composed by simpler atoms hierarchically. For example, each column of D (2) = D 1 D 2 is a linear combination of atoms in D 1 , each column of D (3) = D (2) D 3 is a linear combination of atoms in D (2) and so on. Such a structure results in a more constrained global dictionary and potentially leads to higher quality with lower mutual coherence [14].

Multi-layer block sparse coding
Given the proposed multi-layer sparse coding model, we now build a conduit from the proposed shape code to the 2D projected points. From Equation 9, we reshape vector s to a matrix S ∈ R p×3 such that S = D 1 (ψ 1 ⊗ I 3 ), where ⊗ is Kronecker product and D 1 ∈ R p×3k1 is a reshape of D 1 [10]. From linear algebra, it is well known that AB ⊗ I = (A ⊗ I)(B ⊗ I) given three matrices A, B, and identity matrix I. Based on this lemma, we can derive that Further, from Equation 7, by right multiplying the camera matrix M ∈ R 3×2 to the both sides of Equation 10 and share share share share Figure 2: Deep NRSf M architecture. The network can be divided into two parts: encoder and decoder that are symmetric and share convolution kernels (i.e. dictionaries). The symbol a × b, c → d refers to the operator using kernel size a × b with c input channels and d output channels.
where · (3×2) 0 divides the argument matrix into blocks with size 3 × 2 and counts the number of active blocks. Since ψ i has active elements less than λ i , Ψ i has active blocks less than λ i , that is Ψ i is block sparse. This derivation demonstrates that if the shape vector s satisfies the multi-layer sparse coding prior described by Equation 9, then its 2D projection W must be in the format of multilayer block sparse coding described by Equation 11. We hereby interpret NRSf M as a hierarchical block sparse dictionary learning problem i.e. factorizing W as products of hierarchical dictionaries {D i } n i=1 and block sparse coeffi-

Block ISTA and DNNs solution
Before solving the multi-layer block sparse coding problem in Equation 11, we first consider the single-layer problem: Inspired by ISTA, we propose to solve this problem by iteratively executing the following two steps: where · (3×2) F 1 is defined as the summation of Frobenius norm of each 3 × 2 block, serving as a convex relaxation of block sparsity constraint. It is derived in [13] that the second step has a closed-form solution computing each block separately by Z where the subscript j represents the j-th block and h τ is a soft thresholding operator. However, soft thresholding the Frobenius norms for every block brings unnecessary computational complexity. We show in the supplementary material that an efficient approximation is Z where b j is the threshold for the j-th block, controlling its sparsity. Based on this approximation, a single-iteration block ISTA with step size α = 1 can be represented by : where h b is a soft thresholding operator using the j-th element b j as threshold of the j-th block and the second equality holds if Z is non-negative.
Encoder: Recall from Section 3 that the feed-forward pass through a deep neural network can be considered as a sequence of single ISTA iterations and thus provides an approximate recovery of multi-layer sparse codes. We follow the same scheme: we first assume the multi-layer block sparse coding to be non-negative and then sequentially use single-iteration block ISTA to solve it i.e.
where thresholds b 1 , ..., b n are learned, controlling the block sparsity. This learning is crucial because in previous NRSf M algorithms utilizing low-rank [10], subspaces [36] or compressible [18] priors, the weight given to this prior (e.g. rank or sparsity) is hand-selected through a cumbersome cross validation process. In our approach, this weighting is learned simultaneously with all other parameters removing the need for any irksome cross validation process. This formula composes the encoder of our proposed DNN.
Decoder: Let us for now assume that we can extract camera M and regular sparse hidden code ψ n from Ψ n by some functions i.e. M = F(Ψ n ) and ψ n = G(Ψ n ), which will be discussed in the next section. Then we can compute the 3D shape vector s by: Note we preserve the ReLU and bias term during decoding to further enforce sparsity and improve robustness. These portion forms the decoder of our DNN.
Variation of implementation: The Kronecker product of identity matrix I 3 dramatically increases the time and space complexity of our approach. To eliminate it and make parameter sharing easier in modern deep learning environments (e.g. TensorFlow, PyTorch), we reshape the filters and features and show that the matrix multiplication in each step of the encoder and decoder can be equivalently computed via multi-channel 1×1 convolution ( * ) and transposed convolution ( * T ) i.e.
Code and camera recovery: Estimating ψ n and M from Ψ n is discussed in [18] and solved by a closed-form formula. Due to its differentiability, we could insert the solution directly within our pipeline. An alternative solution is using an approximation i.e. a fully connected layer connecting Ψ n and ψ n and a linear combination among each blocks of Ψ n to estimate M, where the fully connected layer parameters and combination coefficients are learned from data. In our experiments, we use the approximate solution and represent them via convolutions, as shown in Figure 2, for conciseness and maintaining proper dimensions.
Since the approximation has no way to force the orthonormal constraint on the camera, we seek help from the loss function.
Loss function: The loss function must measure the reprojection error between input 2D points W and reprojected 2D points SM while simultaneously encouraging orthonormality of the estimated camera M. One solution is to use 1 The filter dimension is height×width×# of input channel×# of output channel. The feature dimension is height×width×# of channel.

Furnitures
Bed Chair Sofa  Table 1: Quantitative comparison against state-of-the-art algorithms using IKEA dataset in normalized 3D error.
spectral norm regularization of M because spectral norm minimization is the tightest convex relaxation of the orthonormal constraint [34]. An alternative solution is to hard code the singular values of M to be exact ones with the help of Singular Value Decomposition (SVD). Even though SVD is generally non-differentiable, the numeric computation of SVD is differentiable and most deep learning packages implement its gradients (e.g. PyTorch, TensorFlow). In our implementation and experiments, we use SVD to ensure the success of the orthonormal constraint and a simple Frobenius norm to measure reprojection error, where UΣV T = M is the SVD of the camera matrix.

Experiments
We conduct extensive experiments to evaluate the performance of our deep solution for solving NRSf M and Sf C problems. For quantitative evaluation, we follow the metric i.e. normalized mean 3D error, reported in [4,10,16,2]. A detailed description of our architectures is in the supplementary material. Our implementation and processed data will be publicly accessible for future comparison.

Sf C on IKEA furniture
We first apply our method to a furniture dataset, IKEA dataset [23,31]. The IKEA dataset contains four object categories: bed, chair, sofa, and table. For each object category, we employ all annotated 2D point clouds and augment them with 2K ones projected from the 3D ground-truth using randomly generated orthonormal cameras 2 . The error evaluated on real images are reported and summarized into Table 1. One can observe that our method outperforms baselines in the order of magnitude, clearly showing the superiority of our model. For qualitative evaluation, we randomly select a frame from each object category and show them in Figure 6 against ground truth and baselines. It shows that our reconstructed landmarks effectively depict Methods KSTA [16] BMM [10] CNS [20] MUS [2] NLO [12] RIKS [17] SPS [ Table 2: Quantitative evaluation on PASCAL3D+ dataset. We conduct experiments on both original and noisy 2D annotations, listed at the upper and lower half of table respectively. The symbol '-' indicates either algorithm implementation or data is missing. The shaded columns are erros using our processed data and others are copied from Table 2 in [2]. Relative errors are computed with respect to our method, the most accurate solution, without noise perturbation. Our data and implementation will be publicly accessible for future comparison.
the 3D geometry of objects and our method is able to cover subtle geometric details.

Sf C on PASCAL3D+
We then apply our method to PASCAL3D+ dataset [33] which contains twelve object categories and each category is labeled by approximately eight 3D CADs. To compare against more baselines, we follow the experiment setting reported in [2] and use the same normalized 3D error metric. We report our errors in Table 2 emphasized by shading and concatenate the numbers copied from the Table 2 in [2] for comparison. Note that the errors are not exactly reproduced even though using the same dataset and algorithm implementation, because the data preparation details are missing. However, one can clearly see that our proposed method achieves extremely accurate reconstructions with more than ten times of smaller 3D error. This large margin makes the slight difference caused by data preparation even less noticeable. It clearly demonstrates the high precision of our proposed deep neural network and also the superior robustness in noisy situations.

Large-scale NRSf M on CMU MoCap
We finally apply our method to solving the problem of NRSf M using the CMU motion capture dataset 3 . We randomly select 10 subjects out of 144 and for each subject 3 http://mocap.cs.cmu.edu/ we concatenate 80% of motions to form large image collections and remain the left 20% as unseen motions for testing generalization. Note that in this experiment, each subject contains more than ten thousands of frames. We compare our method against state-of-the-art methods, summarized in Table 3. Due to huge volume of frames, KSTA [16], BMM [10], MUS [2], RIKS [17] all fail and thus are omitted in the table. We also report the normalized 3D error on unseen motions, labeled as UNSEEN. One can see that our method obtains impressive reconstruction performance and outperforms others again in every sequences. Moreover, our network also show a well generalization to unseen data which improve the effectiveness in real world applications. For qualitative evaluation, we randomly select a frame from  [20], the best performance of state-of-the-arts with no noise perturbation.  Table 3: Quantitative comparison on solving large-scale NRSf M problem using CMU MoCap dataset. Each subject contains more than ten thousand of frames. Due to huge volume of frames, KSTA [16], BMM [10], MUS [2], RIKS [17] all fail and thus are omitted in the table. UNSEEN refers to the errors of the motions that are not accessible during training. This is used to demonstrate the well generalization of our proposed network, which is especially important in real world applications.
each subject and render the reconstructed human skeleton in Figure 5. This visually verifies the impressive performance of our deep solution.
Robustness analysis: To analyze the robustness of our method, we re-train the neural network for Subject 70 using projected points with Gaussian noise perturbation. The results are summarized in Figure 3. The noise ratio is defined as noise F / W F . One can see that the error increases slowly with adding higher magnitude of noise and when adding up to 20% noise to image coordinates, our method in red still achieves better reconstruction compared to the best baseline with no noise perturbation (in green). This experiment clearly demonstrates the robustness of our model and its high accuracy against state-of-the-art works.  Figure 4: A scatter plot of the shape error ratio in percentage against the final dictionary coherence. A line is fitted based on the data. The left comes from subject 05, the middle from subject 18, the right from subject 64. reconstruction from sparse observations [9,24,21,22,7]. These two solutions make our central pipeline of DNN more easily to adapt to handling missing data.

Coherence as guide
As explained in Section 4.1, every sparse code ψ i is constrained by its subsequent representation and thus the quality of code recovery depends less on the quality of the corresponding dictionary. However, this is not applicable to the final code ψ n , making it least constrained with the most dependency on the final dictionary D n . From this perspective, the quality of the final dictionary measured by mutual coherence [14] could serve as a lower bound of the entire system. To verify this, we compute the error and coherence in a fixed interval during training in NRSf M experiments. We consistently observe strong correlations between 3D reconstruction error and the mutual coherence of the final dictionary. We plot this relationship in Figure 4. We thus propose to use the coherence of the final dictionary as a measure of model quality for guiding training to efficiently avoid overfitting especially when 3D evaluation is not available. This improves the utility of our deep NRSf M in future applications without 3D ground-truth.

Conclusion
In this paper, we proposed multi-layer sparse coding as a novel prior assumption for representing 3D non-rigid shapes and designed an innovative encoder-decoder neural network to solve the problem of NRSf M using no 3D supervision. The proposed DNN was derived by generalizing the classical sparse coding algorithm ISTA to a block sparse scenario. The proposed DNN architecture is mathematically interpretable as a NRSf M multi-layer sparse dictionary learning problem. Extensive experiments demonstrated our superior performance against the state-of-the-art methods and the generalization to unseen data. Finally, we propose to use the coherence of the final dictionary as a generalization measure, offering a practical way to avoid overfitting and selecting the best model without 3D groundtruth.  [20], SPS [18], NLO [12]. Each column corresponds to reconstructions of a certain frame, randomly selected from each subject. Spheres are reconstructed landmarks while bars are for visualization. 3D shapes are already aligned to the ground truth by orthonormal matrix. Figure 6: Qualitative evaluation on IKEA dataset. Landmarks projected by annotated cameras are omitted from images. In each rendering, red cubes are reconstructed points while the planes and bars are manually added for descent visualization. Left to right: annotated image, ground-truth, ours, RIKS [17], KSTA [16], NLO [12], SFC [19], CNS [20], BMM [10].