Shape and motion without depth

Inferring the depth and shape of remote objects and the camera motion from a sequence of images is possible in principle, but is an ill-conditioned problem when the objects are distant with respect to their size. This problem is overcome by inferring shape and motion without computing depth as an intermediate step. On a single epipolar plane, an image sequence can be represented by the F*P matrix of the image coordinates of P points tracked through F frames. It is shown that under orthographic projection this matrix is of rank three. Using this result, the authors develop a shape-and-motion algorithm based on singular value decomposition. The algorithm gives accurate results, without relying on any smoothness assumption for either shape or motion.<<ETX>>


Introduction
In principle, the shape of an object can be computed from a sequence of images by estimating camera motion and depth, and inferring shape from the depth values.
However, when objects are distant from the camera, relative to their size, this computation is ill-conditioned. First, it is difficult to distinguish rotation from translation with adequate precision. Second, shape is computed from the small differences between large depth values.
These difficulties can be circumvented by inferring shape directly from variations in the distance between image features, without computing depth and camera translation as intermediate steps.
In this paper, we show how to infer shape and camera rotation from any number of features and frames, and reduce the computation to decomposing a matrix of image measurements.
The resulting algorithm, tested in simple situations, gives remarkably precise motion and shape estimates, without introducing smoothing effects into the result.
In 1979, Ullman proposed [Ullman, 1979] to compute shape and motion without going through depth. His first formulation assumed an orthographic projection model, and hence ignored the combined effects of depth and perspective distortion. He justified this simplification partly on the ground of mathematical tractability. The important point that computing depth leads to instability if the scene is remote did not receive all the emphasis it deserved.
Most of the work carrying out Ullman's proposal has concentrated on obtaining shape and motion with the minimal number of points and frames. These results are useful proofs of the existence of a solution. In this paper, we propose a way to incorporate any number of points and frames (greater than the minimum required) into the computation of shape and motion. For simplicity, we limit our consideration to one epipolar plane at a time, and assume that motion occurs in that plane. As a consequence, our images are single scanlines.
Our solution is based on two observations which, to our knowledge, have not appeared in the literature: under orthography, (1) the incidence relations among projection rays can be expressed as the degeneracy of the matrix of all the measurements; (2) the image coordinates of any two points in the epipolar plane trace an ellipse as the camera moves, if the coordinates are registered with respect to those of a third point.
Using these observations, we developed an algorithm that computes the shape of remote objects and the rotation of the camera. Since we use many, closely spaced frames, the results are insensitive to noise, and the correspondence problem is simplified.
As an illustration of the theory, we used our algorithm to recover the shape of a one-dollar silver coin (about 4 cm in diameter) at 3.5 meters distance from a real camera with a long lens. The total rotation of the camera was 30 degrees around the coin (and in the midplane of the coin). The error in the computed camera rotation is always less than one degree, and that in the shape of the coin is less than one percent of its diameter. These errors are mostly due to perspective effects, for which corrections are possible (but not made here).
In the following, we introduce our scenario, summarize the results, and sketch the relations of our work with previous literature on the subject. Section 2 proves the two geometric observations above. Section 3 shows how to use them to decompose the measurement matrix into shape and camera rotation. The experimental results in chapter 4 show the ability of the algorithm to deal with jerky rotations without smoothing its output. The conclusion (chapter 5) compares direct shape algorithms with algorithms which base the computation of shape on that of depth, and shows the former ones to be superior for remote scenes.

The Scenario
We assume that the camera produces an orthographic projection, rather than a perspective one. The world is still, and the camera moves in a plane, where it can freely rotate and/or translate. P features are visible in a given scanline, parallel to the plane of motion. Since the frames are taken frequently, it is easy to track the features from frame to frame. As the camera moves, it is panned so as to keep the features in the field of view.
In every frame, the image coordinate of an additional reference point is subtracted from the image coordinates of the P points. After F frames, anFxP matrix m of image measurements is available. This matrix is the input to the algorithm. This is a rather artificial situation, but it approximates well what happens with a camera on an airplane, with suitable control mechanisms to align the camera scanlines with the direction of flight, and to keep the same object within the field of view. The farther away the objects are with respect to their size, the better the assumption of orthographic projection serves as an approximation.

The Results
This paper shows that if the measurements are noise-free, the measurement matrix m is highly degenerate (its rank is 2), and can be decomposed into the product of three smaller matrices: an F x 2 matrix p, which encodes camera rotation, a P x 2 matrix 7r, which encodes the positions of the world points, and a 2 x 2 diagonal matrix a.
In reality, however, noise corrupts the measurements. The decomposition is still valid in an approximate sense, and a tells how reliable the decomposition is.
The matrix m is factored into p, 7r, and a by singular value decomposition [Golub and Reinsch, 1971], which is known to be efficient and numerically well behaved. If more points and frames are used than prescribed by equation-counting arguments (which require a minimum of three points, including the reference, and three frames), the effects of noise can be reduced.
The resulting shape and rotation algorithm is simple and efficient, and has been implemented and tested on small objects as distant as one hundred times their size (see chapter 4). The rotation errors are always smaller than one degree, and usually much smaller. The relative precision in the computed shape is of the order of the relative depth range, defined as the ratio between the size of the object along the optic rays and its distance from the camera.
The good performance of our algorithm derives from the fact that depth is not used as an intermediate result. For remote objects, the inference of depth is very sensitive to noise in the images, so that the quality of the depth estimates obtained by triangulation degrades as the relative range decreases. Consequendy, the shape estimates worsen even faster, since the computation of shape from depth is itself ill-conditioned.
In our approach, instead, shape is related directly to the variations in the distances between image features from frame to frame. No triangulation is done, and the amount of camera translation becomes irrelevant.

Relations with Previous Work
Our goal is to compute camera motion and world point coordinates, relative to each other, from multiple frames.
In essence, our algorithm does what photogrammetrists for more than thirty years have known how to do by hand and with two frames at a time [Thompson, 1959]. Ullman proposed an automated solution to this problem eleven years ago [Ullman, 1979], and called it structure-from-motion.
Most of the initial efforts in this area have been devoted to finding closed-form solutions with a minimal or nearly-minimal number of points and/or frames (see, for instance, [Longuet-Higgins, 1981]).
In general, structure-from-motion is hard to solve. The major difficulty is the inherent sensitivity of the shape and motion results to noise in the image, especially when objects are distant. Performance degrades with reductions in the relative depth range. For instance, the algorithm presented in [Tsai and Huang, 1984] works very well for close objects, which is the intended goal of that paper, but the performance is likely to degrade when objects become more remote, and the relative depth range becomes smaller. If the images are noisy, few points and/or few frames give bad results, regardless of how good the math is.
The remedy is to use many frames and many points, exploiting redundancy to counteract noise. If frames are closely spaced, the correspondence problem is also easier to solve. This has been tried, with relatively good results, for the inference of depth when the motion of the camera is known. See for instance [Bolles et aL, 1987] or [Matthies etaL, 1989].
In [Spetsakis and Aloimonos, 1989], an interesting algorithm is presented for the case of unknown motion, using several frames and points and a perspective projection model. In spirit, our approach is akin to theirs: the projection lines of the same world point are a bundle (or pencil) of lines, and the resulting incidence relations between them allow casting the computation of shape and motion as a minimization problem. Our solution, however, does not recover depth or camera translation. We bypass this intermediate stage, and obtain a solution which is partial, but more reliable for remote scenes.

Chapter 2 The Decomposition Principles
In this chapter we introduce the two observations on which we base the computation of shape and motion. As we stated in the introduction, we consider only one scanline per frame, and assume that the camera moves in a plane parallel to the scanline.
In this plane, we define an orthogonal system of coordinates (X, Z), with the X axis along the scanline in the first frame. The origin of the system is a visible reference point on the object, as in figure 2.1.
The images are orthographic projections. Image points are registered by subtracting from their projections, x/ p , the projection of the reference point, JC/Q: There are P points, besides the reference point, and they are tracked through F frames. The registered measurements m fp can then be collected in an F x P matrix mu Registration is equivalent to translating every image along itself so that the reference point projects always to the same image location. In addition, we can translate every image along its projection rays so that it passes through the reference point. In summary, all images can be thought of as rotating around the reference point, as in figure 2.1.

6
From the figure, we see that the projection lin e of poin t p on to frame / is represented by the equation where Cf an d Sf are the cosine an d sine of the angle a/ that frame / forms with frame 1 (thus, a\ = 0).
We now show two facts about the measurement matrix m. First, it is of rank 2. Second, given an y two points p an d q> the pairs (m/ p , rrif q ) must be on an ellipse for all frames/ =

The Rank Principle
Without noise, the rank of the measurement matrix m is two.
All the projection lin es of poin t p belon g to a pencil, sin ce they must pass through point p itself. Therefore, for an y three frames /', g, A, the projection lin e equations for point p, If we now read the matrices by columns, the incidence equations mean that the three vectors Thus, any third order determinant extracted from the exact measurement matrix TTI = [mfp] is equal to zero: the rank of m is smaller than 3. In appendix A we prove that, unless all points are aligned, some 2x2 determinant extracted from the matrix m must be non-zero, so that the rank of m is exactly 2. 1 The row space and the column space of m are two-dimensional.
Geometrically, this result means that the rows of m (one row per frame), interpreted as points in a P-dimensional space, must lie on a plane through the origin, call it the frame plane. The same holds for the columns of m (one column per point), interpreted as points in an F-dimensional space.
Intuitively, the rank principle says that the F x P measurements are not unrelated: they could be described in a simpler way by giving F frame angles and P points, if only these were known. That this degeneracy takes on the form of a simple rank equation (rank(/n) = 2) is due to the linear nature of the orthographic projection equation.

The Ellipse Principle
The registered image projections (m fp , m fq ) of two points p and q y lie on an ellipse for all frames / = 1,..., F.
The incidence equations (2.2) hold for any triple of projection lines relative to the same point. Then, the rank of the F x 3 matrix ci S\ m\p ' CF SF TTlFp is also two. Since this holds for any p> we conclude that the cosine and sine vectors c and s belong to the frame plane. Any two independent vectors m p and m q (they are independent if points p and q are not aligned with the reference point) span the frame plane, and there must be four numbers (for a given pair p and q) a^q ) , P^q\ a™, such that c = aWmp+pMm, s = a™m p +f3™m q .
By squaring and adding these two vector equations component by component, we obtain the following F equations for p and q fixed and / = 1,..., F. These equations say that all pairs (/n/ p , m/ q ) lie on the same ellipse, centered at the origin. We can draw P(P -l)/2 ellipses, one for every pair of points p and q. Intuitively, this can be understood by the following thought experiment: if the camera were to rotate at uniform angular velocity, the projection of each point would be a sinusoidal function of time. If two such sinusoids represent the orthogonal coordinates of a point moving on a plane, the point traces an ellipse. In fact, this is how Lissajous figures are drawn on an oscilloscope. Let us now remove the condition of constant camera rotation. The phase relation between the two sinusoids is preserved, because the two coordinates of each point on the ellipse refer to the same camera frame. Therefore, we obtain the same ellipse, but sampled at irregular intervals.

Chapter 3
The Algorithm: Dealing with Noise When images are noisy, the measurement matrix m will not be exactly of rank 2. However, the rank principle can be extended to the case of noisy measurements. We do this by using the concept of Singular Value Decomposition (SVD) [Golub and Reinsch, 1971] to introduce the notion of approximate rank. The ellipse principle is also readily extended, by replacing interpolation (the points are on the ellipse) with fitting (the points are near an ellipse).
In this chapter, we examine these extensions, and show how to use the extended principles to compute shape and motion from a matrix of noisy image measurements.
Assuming 1 that F >P,m can be decomposed [Golub and Reinsch, 1971] into anFxP matrix p, a diagonal P x P matrix <r, and a P x P matrix 7r, such that m = pan 7 (3.1) where / is the P x P identity matrix, and the singular values o\,..., <jp are the diagonal entries of cr. This is called the Singular Value Decomposition (SVD) of the matrix m.
We can now restate the rank principle for noisy measurements.
The first two singular values of the noisy measurement matrix m are much greater than the others: Golub and Reinsch [Golub and Reinsch, 1971] give an efficient and well behaved algorithm to compute the decomposition. Consider now the matrix p which is obtained by setting to zero all the singular values after cr 2 in the decomposition (3.1): where the first two columns of p are denoted by p\ and pi* and the first two columns of 7r are denoted by n\ and ?T2. It can be shown [Forsythe et al., 1977] that the 2-norm of the (matrix) difference between m and p is smaller than 0-3. Hence, the value of (T3 can serve to assess the quality of the approximation m « p. If equation (3.2) holds, we can expect p to be a clean version of m, after removing noise in the least square error sense.
The two vectors p\ and P2 are a basis for the frame plane, that is, for the column space of p. Then, we can apply the ellipse principle to these vectors, rather than to two columns p p and p q of the measurement matrix; the P(P -l)/2 ellipses found in the previous chapter, one for every pair of points p and q y are now replaced by one ellipse, whose coefficients account for all of the measurements through the vectors p\ and p 2 : (a 2 c + a 2 s )p 2 x + (# + ft)p} 2 + 2(a c p c + a s p s )p nPf2 » 1 . (3.4) The reason for the approximate equality is that the two vectors p\ and P2 are a basis only for the best estimate of the frame plane, not for the true frame plane. Therefore, if we require the two vectors c and s to lie on the estimated measurement plane, the normalization conditions cj + sj = 1 will hold only approximately.
The remaining steps needed to complete the solution are the following: • find the coefficients a 2 = 0% + a], b 2 = 0* + and d = a c /3 c + a s /3 s of the ellipse by solving the following overconstrained F x 3 system of equations in the least square error sense: These are the two vectors on the frame plane which best satisfy the normalization conditions + = 1; • find two vectors d and d that satisfy the normalization equations exactly, and that are as close as possible to c and s. Here, the correct metric is the Euclidean metric in the space of the measurements rrif P : we want to move from c to d and from s to S* while perturbing the values of the measurements as little as possible. As shown in appendix B, this is equivalent to changing the vectors p\ and p2 into two new vectors p\ and p f 2 so as to minimize

Z/=i [^i (pf i-Pf i) 2+^2 (/>/2~/ ) /2) 2 ]»subject to the normalization constraints.
This is a simple Lagrange minimization problem. Its solution yields the cosines and sines of the frame angles a/, that is, the camera rotation; • compute the coordinates X p and Z p of every object point p by finding the least square error intersection of all its projection lines. This is done in appendix G These steps have been implemented in a computer program, which was tested several image sequences. The next chapter describes an illustrative experiment.

An Experiment
The purpose of the experiment described in this chapter is to illustrate the rank an ellipse principles, demonstrate the good quality of the results, and quantify the influence of perspective effects on the accuracy of the motion estimates.
The key parameter is the relative depth range, which we defined as the ratio of the object size along the projection rays and the distance between camera and object. The relative errors in the computed shape are of the same order as the relative depth range, and modeling inaccuracies that are small with respect to it can be ignored.
We put a one-dollar coin (about 4 cm in diameter) approximately 3.5 meters away from a Sony CCD camera with a 300 mm Tokina lens. Thus, the relative depth range was 4/350 « 0.011. Figure 4.1 shows the setup.
The camera was moved in the plane of the coin, so that only the edge of the coin was visible in every frame. The motion was roughly circular around a point in the vicinity of the coin. Only the rotation component was controlled with an accurate positioning mechanism, so that a precise reference was available for performance evaluation.
The edge of the coin was approximately aligned with the image scanlines, thus yielding easy-to-track image features (the thin vertical notches on the coin edge). The first 101 frames were taken in steps of 0.1 degrees between consecutive frames; after that, the velocity was doubled to 0.2 degrees per frame, and 100 more frames were taken; thus, the overall rotation was 30 degrees. The 201 scanlines are stacked together in figure 4.2, top to bottom. This figure is what is called an epipolar plane in [Bolles et al., 1987], The image was filtered with a thirteen-tap finite impulse response approxima-tion to a Laplacian of a Gaussian, and the zero crossings of the result (figure 4.3) were used as features in the experiment (104 crossings were found). The rank principle is illustrated graphically by the similarity of figures 4.4 and 4.5. Figure 4.4 shows the crossing of figure 4.3 after registration (equation (2.1)). To obtain figure 4.5, we decomposed the matrix m representing the registered crossings, set to zero all the singular values except the first two, and reconstructed the measurement matrix from the first two columns of the SVD factors (equation (3.3)). The rank principle says that the only differences between figure 4.4 and figure 4.5, under orthography, are due to noise.
The singular values are plotted in figure 4.6; without noise, and if the projection were exactly orthographic, only the first two values would be different from zero. The third value ((73) reflects essentially the effect of perspective. Figure 4.7 illustrates the ellipse principle. It shows the points (p/i, p/2) from the left factor of the singular value decomposition of the measurement matrix m, and the best fit ellipse, as defined by equation (3.4).
In spite of perspective effects and unmodeled small variations in depth, the quality of both shape and motion results is remarkably good. Figure 4.8 shows the computed and the true rotation. The error is always smaller than one degree, and almost everywhere much smaller than that. The algorithm assumes no motion models, and does no smoothing. As a result, the sharp change in rotational velocity is preserved in the motion output. Figure 4.9 shows the shape results, and the best circular fit to them. The accuracy of shape is of the order of the relative depth range (1 percent), even if variations in depth during the motion of the camera were of the order of the coin size.
To get an idea of how perspective effects influence the accuracy of the results, we tested our algorithm on a sequence of simulated, noise-free images similar to those of our coin experiment. A circular object with 10 features is placed at various depths from the camera. For each depth, a pinhole camera moves and rotates by 30 degrees in 30 steps. Figure 4.10 plots the relative error in the total computed rotation as a function of the relative depth range. While algorithms based on depth give worse motion estimates as objects are moved farther away, our algorithm improves (for a constant total rotation angle), because it approximates orthography better and better. 14 motion  The input to the algorithm; each scanline is a new frame, and represents the edge of a one-dollar coin seen from a new angle. In [Bolles et al, 1987], a figure like this is called an epipolar plane. We use it to recover shape and rotation, instead of depth given known motion.       Figure 4.10: The motion error due to perspective distortion decreases when the relative depth range becomes smaller. These results were obtained by simulating noise-free images of a circular object with 10 features, and a pin-hole camera rotating by 30 degrees in 30 frames.

Conclusion: Depth versus Shape Algorithms
The algorithm presented in this paper infers the shape of remote objects and the rotation of the camera. It is a shape algorithm. It does not compute either depth or camera translation.
Algorithms such as the ones described in [Tsai and Huang, 1984], [Heel, 1989], [Spetsakis and Aloimonos, 1989], on the other hand, represent depth explicitly, and compute it from the image sequence. They are depth algorithms.
Depth algorithms give a more complete answer. They compute all components of motion, up to a scale factor, and the depth information they supply allows, in principle, computing shape as well.
However, depth algorithms do not work if objects are very distant from the camera with respect to their size. When the relative depth range is very small, as for instance in aerial cartography and reconnaissance, the values of depth are poorly constrained by the image sequence, and it is hard to distinguish rotation from translation.
In these situations, the completeness of depth algorithms is not only useless, but harmful. A shape algorithm gives a more stable and accurate answer, because it computes shape and camera rotation directly from image deformations. It does not use depth as an intermediate result, and it need not distinguish translation from rotation.
The results of this paper can be extended along four independent directions: accuracy, threedimensionality, completeness, and efficiency. Accuracy can be increased by correcting for perspective effects. Once a good shape estimate has been computed, the solution can be perturbed with a steepest descent search to account for the slight divergence of projection rays in each frame. Furthermore, if relative changes in depth are large with respect to the relative depth range, looming effects must be estimated and accounted for. The algorithm can be extended to three dimensions. For obvious reasons of applicability, this is the direction we have chosen to pursue first in our future research.
Completeness: if a motion model is available, depth and translation can be estimated independently. Shape and rotation, computed by our algorithm, would be inputs to a separate depth and translation algorithm, possibly together with external motion information. Shape and depth are often several orders of magnitude apart. We have shown that they should be estimated separately, not that depth cannot be estimated.
Our implementation of the algorithm uses an efficient singular value decomposition routine. However, it treats a whole batch of frames at once. An incremental implementation would be more desirable. The feasibility of this is being investigated. Furthermore, let V>/j be the angle between frame / and frame g, measured counterclockwise from/ to g (figure A.l).
Then, if rrifp is the projection of point P onto frame / (after registration), we have

The Normalization Equations for a Noisy Measurement Matrix
On page 12, we computed the cosines c/ and sines s/ of the frame angles a/ in two steps. We first found those values of Cf and Sf that lie on the frame plane, and that best satisfy the normalization conditions cj +stf = 1. We then perturbed c/ and Sf into new values d f and Sf that satisfy the normalization equations exactly, and that are as close as possible to c/ and Sf.
We identified the correct metric for measuring the amount of this perturbation as the Euclidean metric in the space of the original, registered image measurements m /p .
In this appendix, we show that solving this problem is equivalent to changing the the first two columns p\ and p 2 of the left factor of the singular value decomposition of the measurement matrix m into two new vectors p\ and p' 2 so as to minimize the sum F /=1 subject to the normalization constraints (c}) 2 + (4) 2 = l for/ = 1,...,F.
From the definition of the clean measurement matrix p (equation 3.3), we see that if we change pf 1 and/or /9/2, we alter only one row of x: in fact, the / -th row of that equation is Xfi ... XfP ]= pf\<J\<f>\ + pf2<72<f>2 or, in matrix notation,

4>\ <f>l
This is intuitive: the coefficients p/i and pf% regard only the measurements in frame number/, so it stands to reason that changing these coefficients affects only measurements in frame /.
Then, a change e T = (ei, e 2 ) in (p fï , p f2 ) results in a change As a consequence, we can almost use a Euclidean metric in the space of the points (p/i, /9/2), except that the two coordinates must be scaled by v\ and 0*2. The problem of computing (pf X , p' f2 ) from (/>/i, /9/2) is now easily stated: find the point (/9/ 2 , pf 2 ) such that the norm of the vector
We can rewrite this constraint in terms of the vectors p\ and p' 2 by noticing that