The 3D MOSAIC scene understanding system: incremental reconstruction of 3D scenes for complex images

Abstract The 3D Mosaic system is a vision system that incrementally reconstructs complex 3D scenes from multiple images. The system encompasses several levels of the vision process, starting with images and ending with symbolic scene descriptions. This paper describes the various components of the system, including stereo analysis, monocular analysis, and constructing and modifying the scene model In addition, the representation of the scene model is described. This model is intended for tasks such as matching, display generation, planning paths through the scene, and making other decisions about the scene environment. Examples showing how the system is used to interpret complex aerial photographs of urban scenes are presented. Each view of the scene, which may be either a single image or a stereo pair, undergoes analysis which results in a 3D wire-frame description that represents portions of edges and vertices of objects. The model is a surface-based description constructed from the wire frames. With each successive view, the model is incrementally updated and gradually becomes more accurate and complete. Task-specific knowledge, involving block-shaped objects in an urban scene, is used to extract the wire frames and construct and update the model.


Introduction
It is important for a general vision system to derive diree-dimensional (3D) information about a given scene from images and store the information in a coherent manner so that it can be used for various matching, planning, and display tasks. The 31) Mosaic system is a vision system that incrementally acquires a 3D description (or model) of a complex scene from multiple images. This paper describes the system and presents examples of how it is used to interpret complex aerial photographs of urban scenes.
ITie paper is organized as follows. First, we present the motivation for our approach of incrementally acquiring the scene model, together with an overview of the system. Then we discuss the two components used to extract 3D information from the images: the stereo analysis and monocular analysis components. Next we describe the representation of the scene model, and an example is presented that shows how the scene model is acquired. Finally, we show how information from a new view is incrementally combined with a current model.

The 3D MOSAIC System
The goal of the 31) Mosaic system is to obtain an understanding of the 3D configuration of surfaces and objects in a scene. The significance of this goal may be demonstrated by the following tasks. 2. 3D change detection. Change detection is a task that determines how the geometry and structure of a scene changes over time. The conventional approach to this task involves comparing and detecting changes in images. However, because of different viewpoints and lighting conditions, changes in the images do not necessarily correspond to changes in the geometry and structure of the scene. If 3D scene descriptions were obtained from the images first, such descriptions could be compared in 3D to determine changes in the scene.
3. Simulating the appearance of the scene. If a 3D description of the scene were to be obtained, displays as seen from arbitrary viewpoints could be generated from it. This is useful for tasks such as familiarizing personnel with a given area, and flight planning by generating the scene appearance along hypothetical flight paths.
4. Robot navigation. Three-dimensional descriptions of complex environments may be used to make decisions dealing with padi planning or determining which parts'of the environment to analyze in more detail.
Note that to perform these tasks, a vision system must do more than classify images, segment them, or identify objects in them; it must be able to generate a 3D description of the scene. The 3D Mosaic system deals with complex, real-world scenes (e.g., Fig. 4). That is, the scenes contain many objects with a variety of shapes, the object surfaces have a variety of textures and reflectance charactcrisics, and the scenes arc imaged under outdoor lighting conditions. Because of the complexity, there arc many difficulties in interpreting the images, including: 1. Any particular image contains only partial information about the scene because many surfaces are occluded.
2. Hvcn portions of the scene that arc visible arc often difficult to recover. For example, surfaces with dark shadows cast across them, or with highlights, may be difficult to interpret. Highly oblique surfaces may be difficult to analyze if their resolution in the image is poor. Such portions of the scene, Uicreforc, may be recovered with errors and inconsistencies, or may not be recovered at all.
Our approach to die problems of complexity is to use multiple images obtained from multiple viewpoints.
This approach aids interpretation in two ways. First, surfaces occluded in one image may become visible in another. Second, features of surfaces that are difficult to analyze and interpret in one image (such as scene edges and texture) may become more apparent in another image because of different viewpoint and/or lighting conditions.

Incremental Approach
A large number of views will, in general, be required to obtain a fully accurate and complete description of a complex scene. Typically, all these views will not be simultaneously available, while some may never become available. Many of them will only be obtained gradually tiirough interaction with the scene environment. Our system must therefore have the ability to utilize partial descriptions and incrementally update them with new information whenever a new view happens to become available. As a practical example, consider a robot (perhaps a mobile ground robot or an automatically guided airplane) which is attempting to navigate through an unknown environment. The robot would sequentially acquire images of the environment as it moves about. Information derived from each new image would serve to update its internal model, and this partial model would be used to decide where to go next, or where to analyze in more detail.
We have adopted an approach in which the 3D scene model is incrementally acquired over the multiple views. The views of the scene are sequentially acquired and processed. Partial 3D information is derived from each view. The initial model is constructed from 3D information obtained from the first view, and represents an initial approximation of the scene. As each successive view is processed, the model is incrementally updated and gradually becomes more accurate and complete.
In our approach, the scene model plays the role of a central representation with two primary functions.
First, it incrementally accumulates information about the scene. Second, at any point along its development, it represents the current understanding of the scene. As such, it may be used for tasks such as matching, display generation, planning paths through the scene, and making other decisions about the scene environment. Two such tasks arc important for the incremental acquisition process itself: (1) 3D information derived from a new view must be matched to the model so that updating can occur, (2) higher-level components should be able to use the model to determine which parts of the scene to analyze in more detail, and from which viewpoints to take the next images.

Overview
A flowchart for the 3D Mosaic system, showing die major modules and data structures, is displayed in Fig.   1. The input is a new view of the scene, which may be cither a stereo image pair or a single image. The stereo pair undergoes stereo analysis, while the single image undergoes monocular analysis. The purpose of diese analyses is to obtain 3D scene features such as portions of surfaces, edges, and corners. The stereo analysis component currently matches junctions extracted from the two images, and generates a sparse 3D wire-frame description of the scene. The monocular analysis component currently extracts linear structures from the image and converts these to 3D wire frames using task-specific assumptions.  These, in turn, arc converted into the scene model in Fig. 32. Finally, the result of modifying the model in

Stereo Analysis
Most In our approach, rather than attempting to find matches fof scene faces occluded in one of the images, we match face boundaries visible in both images. We do this by explicitly taking into account the way junction appearances change from one image to the other, using the knowledge that in urban scenes, roofs of buildings tend to be parallel to the ground plane, while walls tend to be perpendicular to this plane. Edges in the scene perpendicular to the ground will appear in each image to be directed towards the vertical vanishing point [Kender83].
If a feature in an image lies on a roof, its appearance in die other image as a function of position along the epipolar line can be predicted if the normal to the ground plane is known. 2 To sec why, consider Fig. 2.
Suppose the junction 1 > 3 P 1 P 2 in ima 8 cl iS 8 iven ' and our 8°a l is t0 predict the junction Q 3 Q X Q 2 S UrCl ; ^ , M°lSa!C flowchart ' showi "g major modules (boxes) and data structures e hpses). me dashed lines represent components that have not yet been impLen^d ic solid lines represent components already implemented. P'ementca, tnc This uniquely determines die position of die plane parallel to the ground diat contains V r The 3-space positions of the points V 2 and V 3 can now be computed as the intersections of diis plane with the rays corresponding to the points P 2 and P 3 , respectively. Finally, the points Q 2 and Q 3 are uniquely determined as central projections of the points V 2 and V 3 , respectively. 3 Although this analysis is independent of die camera geometry relative to the scene, vertical aerial photography is in general more useful than oblique aerial photography because of the greater probability that an arbitrary junction in the image lies on a roof or on the ground. In oblique aerial photography, larger portions of horizontal surfaces would be occluded by vertical walls. Figure 2: For junction P 3 P J P 2 , its appearance in image2 can be predicted as a function of position Q x along the cpipolar line. The normal to plane V 3 V X V 2 must be known.
Therefore, when an L junction is found in one image, it is initially assumed to arise from a corner of a roof, and its appearance in the other image can be predicted. When an ARROW or FORK junction is found, the leg of the junction directed towards the vertical vanishing point is initially assumed to arise from a scene edge perpendicular to the ground, while the other two legs arc initially assumed to arise from scene edges lying on a roof or on die ground. Again its appearance can be predicted.
Note that this analysis is valid not only for features lying on horizontal planes in the scene, but for any family of parallel planes.

Imagel Image2
Structural relationships between scene vertices are also used to aid in the matching. If two junctions in an image arise from scene vertices at the same height above the ground, the positions of the corresponding junctions in the other image, as a function of position along the cpipolar line, can be predicted if die normal to the ground plane is known. This can be shown using similar arguments as before. In Fig. 2, pretend that the points R, Q r and V. correspond to positions of separate junctions and vertices. For example, if P { and P 3 are two separate junctions in imagcl, then for some point Q x on the cpipolar line corresponding to P^ the position of the junction Q 3 , corresponding to P 3 , can be predicted if V t and V 3 arc assumed to lie at die same height. We make the assumption that junctions close to one another in the image often correspond to vertices lying on top of die same building and therefore have approximately the same height. In diis way, the configurations within the neighborhoods around junctions in die two images are used in the matching.
These matching techniques assume Uiat the vector normal to the ground plane is known. To obtain this vector, we form a vector from die focal point to die vertical vanishing point. As shown in Fig

Steps in Stereo Analysis
We now provide an example showing how the stereo analysis is performed on the stereo pair of images in Fig. 4.
Exracting lines. The first step in the stereo analysis is to extract linear features. A 3x3 Sobel operator is used to extract edge points, as shown in Fig Finding potential junction matches. We now want to match the junctions found in one image with those in the other. Let us consider how L junctions are matched. Each L junction is initially assumed to lie on a horizontal scene plane. The shape and orientation of its corresponding junction in die other image, as a function of position along the epipolar line, can therefore be predicted. Each L junction in the first image may therefore usually be matched with several junctions in the second image that have, within tolerance, die predicted shape and orientation. However, we do not try to match only with junctions in die second image that have been previously found. Rather, for every point on the epipolar line (on the appropriate side of the infinity point), a search is made within a prc-specified window for lines that might correspond to the predicted junction. The requirements, however, for two lines to form a junction is more relaxed than the requirements during initial junction search. We therefore improve feature detection in each image by using the features found in one image to predict features in die other image. The matching is performed in two directions, from the first image to the second, and vice versa.  an L junction and matched in the manner described above.
Searching for unique junction matches. At this point, each junction in one image is associated with a set of potentially matching junctions in the other image. The next step is to find the best of the potential matches, resulting in a single match for each junction. Two criteria are used in determining the best matches: 1. If the image intensities inside two potentially matching junctions are similar, the likelihood that tiicy really match is increased. This is because the two junctions will often have similar intensities if they arise from the same face corner. To measure the degree of similarity, we compute the average intensities of regions along the two legs of the L junction in each image. As depicted in Fig. 9, let A and li be the average intensities of these regions in one image, and let A* and IT be die average intensities of corresponding regions in die other image. Then the degree of similarity, called the local cost, is defined as . Similar intensities in the two junctions result in small local cost, while diverse intensities result in large local cost.
2. As described previously, if two junctions in an image arise from scene vertices diat are at the same height, the relative positions of die corresponding junctions in die other image, as a function of position along die epipolar line, can be predicted. We use this to determine whether two sets of junction matches arc consistent with one another. Suppose, in Fig. 10. that the junctions Jj and J 2 in imagcl arise from scene vertices that are at the same height. Suppose also that die junction matches (J p fj) and (J 2 , i\) have been hypothesized. To measure the degree of consistency between these two sets of matches, we predict die position of the junction in imagc2 that corresponds to (say) J r Let us refer to the predicted position as J" 2 . If die vector from to J" 2 is (a r bj) and the vector from to J' 2 is (a ? b 2 ), then the degree of consistency between the two sets of matches, called the global cost, is defined as Two sets of junction matches whose relative positions arc near the prediction result in small global cost, while positions far from die prediction result in large global cost.
To arrive at a unique set of junction matches, the space of potential matches is searched using a beam search [Rubin 80], which is guided by the above two criteria. The search space is represented by a network whose nodes are the possible pairs of junction matches. This is depicted in Fig. 11, where each junction in (say) imagcl (i.e., J, K, L, ...) is paired with each of its potential matches in imagc2 (i.e., J'., K\, L\, ...). The junctions in imagcl arc ordered so diat die junction in column k is within an MxM window of die junction in column k -/. M is chosen so that there is a good probability that junctions within the window arise from vertices on top of die same building.
In Fig. 11, each junction and its candidates lie in a single column, and each candidate is represented by a node in the network. Any path through the network that visits a single node at each column represents a set of unique junction matches. Associated with each such path is a cost obtained by adding all the local costs of the The search starts at column 1 (Fig. 11) and proceeds successively to each column. At each column k , die best N partial paths from column 1 to k arc extended to column k + / as follows. Suppose that each node in column k has a cost corresponding to the minimum cost path from column 1 to the node. Then for each of the N lowest cost nodes J\ in column k, compute the cost of the path when extended from J', to each node IC. in column k + /. This cost is the sum of the cost of the partial path to node J'., the global cost between nodes J*, and K\ , and the local cost of node K'.. Then add a link in the network between nodes J', and K\ .
K\ in column k + /, and each node KV will now have several costs associated with it, one for each link into die node. Suppose die link from node J', has the lowest cost to . A backpointer from K\ { to is added, and the associated cost is stored. All other links and costs associated with node K\ are discarded. Bach of the best N nodes in column k + / arc then extended to column k + 2. Notice that this search is not guaranteed to result in the lowest cost path in die network. A path discarded at column k because it is not among the best N may have been part of the best path at column k + j if it were extended that far. Searching for third legs of junctions. The next step tries to find lines in the images that might be die third leg of matched junctions and that might represent scene edges perpendicular to the ground plane. The method used finds lines near the junctions in both images that are directed toward the vertical vanishing point.

Column
Generating 3D wire frames. Finally, 3D coordinates of vertices and equations of edges are derived using triangulation. Fig. 13 shows a perspective view of die 3D vertices and edges diat result. We call diis a wire-frame description of die scene.

Monocular Analysis
Although stereo is a major source of 3D information, sotne views of the scene will be only single images.
We can also extract 3D information from diesc images by exploiting task -specific knowledge. We assume diat the objects in the scene are trihedral polyhcdra containing only vertical and horizontal faces, i.e., faces perpendicular and parallel, respectively, to the ground plane. Our monocular analysis extracts linear structures in the image that represent boundaries of buildings, and then converts these structures into 3D wire frames.

Steps in Monocular Analysis
This section provides an example showing how the monocular analysis is performed on the image in Fig. 14.
This is a different view of the same scene shown in die earlier stereo pair (Fig. 4).
Exracting lines and junctions. The first step in the monocular analysis is to extract linear segments and junctions from the image. The method used here is the same as that used during stereo analysis (as previously     Fig, 16.
segments. The connection between J t and J k is retained only if the percentage of coverage exceeds a threshold.
The result of this pruning step is shown in Fig. 19. Note that it does a good job in eliminating unwanted connections. These two steps illustrate how useful a hypothesize-and-test method can be for low-level image processing. In the first step, candidate connections arc hypothesized on rather preliminary evidence. In die second step, the candidates that do not pass a rigid test arc eliminated. Obtaining 31) wire frames. ITie next step is to convert die 2D structures into 3D wire frames. In order to do so, we assume that all lines that form the 2D structures arise from cither vertical or horizontal scene edges.
Furthermore, we use several features that aid us in relating an image to the 3D scene depicted in die image, including vanishing points, the ground plane constraint, propagation of 3D constraints, and colinearity (i.e., alignment of lines).
First, the lines that form the 2D structures are labeled as either "vertical" or "horizontal" depending on  Although this technique permits us to recover the 3D configuration of any junction relative to some arbitrary depth, it is not useful to apply it directly to the junctions in the original line image (Fig. 16) because the relative heights above the ground plane of the corresponding vertices cannot be determined; the height of each vertex is arbitrarily chosen without relation to the heights of other vertices. It is more useful, however, to apply the technique to the 2D structures in Fig. 20, since the heights of the vertices within each structure can be related. To see how this is done, consider the example in Fig. 22, which shows a 2D structure. The solid lines arc part of die extracted structure (while the dashed lines arc for the reader's convenience to make the 3D shape more apparent). Suppose lines pj) b and p^p A have been labeled "vertical", while the other solid lines have been labeled "horizontal". Applying our technique to (say) point p l9 the 3-spacc positions of the vertices Figure 22: The solid lines represent a connected 2D structure. The dashed lines arc for the reader's convenience to make the 3D shape more apparent.
In order to obtain a coherent scene description, the depdis of the different structures in die scene must be related. We use two methods to do this. The first method involves finding structures diat lie on die ground plane. Suppose a junction point p of such a structure is hypodiesized to arise from a vertex lying on the ground. Then die 3-space position of the vertex may be obtained as the intersection of the ground plane with the ray through p. The normal vector u to the ground plane is known, but the distance d from the focal point to the ground plane is arbitrarily chosen. Since the 3-space position of all junctions arising from ground points can be calculated in this manner, the depths of all structures containing such points can be related to one another through the parameter d.
To hypodicsize junctions that arise from vertices lying on die ground plane, we use the observation that if a line labeled "vertical" connects two junctions (e.g., line pj> t in Fig. 22), the line is directed toward the vertical vanishing point with respect to one junction, but away from this vanishing point with respect to the other junction. The latter junction is assumed to represent a vertex lying on the ground plane. Points p { and p 3 in There are many structures in Fig. 20 that do not contain points lying on die ground plane, cidicr because such points arc occluded in die scene or because tiicy have not been properly extracted from the image.
corresponding to points p x% p 2 , and p t can be determined relative to some arbitrary depth a for p x . If the technique is applied next to point p 2 , the 3-space position of point p 3 can be determined as a function of die depth a. This procedure continues with points p 6 , /? 4 , and so on, until the 3D configuration of die whole structure has been determined, relative to some arbitrary depth. Suppose that points p x through p 7 have already been assigned 3D coordinates, and we want to obtain the 3-spacc position of die 2D structure PiP 9 p iQ p n . Since the lines /y? 7 and p%p u are aligned in die image and both are labeled "horizontal", they are assumed to be aligned in the scene and to lie in the same horizontal plane.
The 3-space position of (say) point p s is therefore determined as the intersection of this plane with the ray through p s . The 3D coordinates of diis point may then be propagated to points p 9 , /? 10 , and p n as described previously. Note that all 3D positions are functions of the parameter J, which is arbitrarily chosen for the equation of the ground plane.

Representing and Manipulating the 3D Scene Model
The representation we have developed for the 3D scene model draws on ideas from geometric modelling used in computer-aided design systems [Baer, Eastman, and Henrion 79, Requicha 80]. In diesc systems, however, die 3D models are usually derived through interaction with a user. Our case is different in diat (1) die 3D models arc derived automatically from 2D images, and (2) many portions of the scene arc unknown or recovered with errors because of occlusions or unreliable analysis.
The following factors have determined how die scene model is represented and manipulated.
1. Partially complete, planar-faced objects must be efficiently described by the model. It is therefore represented as a graph in terms of symbolic primitives such as faces, edges, vertices, and their topology and geometry. Information is added and deleted by means of these primitives.
2. The model must be easy to use in matching.
3. Because scene approximations are often more useful if they contain reasonable hypotheses for parts of the scene for which there are partial data, we introduce mechanisms that permit hypotheses to be generated, added, and deleted.
4. Because incremental modifications to die model must be easy to perform, we introduce mechanisms to (a) add primitives to the model in a manner such that constraints on geometry imposed by dicsc additions arc propagated throughout the model, and (b) modify and delete primitives if discrepancies arise between newly derived and current information.

Representation of Model
The 3D structure in die scene is represented in the form of a graph, called the structure graph. The nodes and links represent primitive topological and geometric constraints. The structure graph is incrementally constructed through the addition of diese constraints. As constraints arc accumulated, their effects are propagated to other parts of the graph so as to obtain globally consistent interpretations. Nodes in the structure graph represent either primitive topological elements (i.e., faces, edges, vertices, objects, and edge-groups (which are rings of edges on faces)) or primitive geometric elements (i.e., planes, lines, and points). Face, edge, and vertex nodes are tagged as cither confirmed or unconfirmed. Confirmed means that the clement represented by the node has been derived directly from images. Unconfirmed means that the clement has only been hypothesized.
The primitive geometric elements serve to constrain the 3-spacc locations of faces, edges, and vertices.
Plane and line nodes contain plane and line equations, respectively. Point nodes contain coordinate values.
The structure graph contains two types of links: the part-oflmk, representing the part/whole relation between two topological nodes, and the geometric constraint link, representing die constraint relation between a geometric and topological node.

Modifications to Model
Modifications to the structure graph arc made by adding or deleting nodes and links, or changing the equations of line and plane nodes, or the coordinates of point nodes. All effects of modifications are propagated to other parts of die graph.
As an example, consider adding or deleting a geometric constraint link between a geometric and topological node. Any of the three geometric nodes (points, lines, and planes) may constrain any of die dirce topological nodes (vertices, edges, and faces). Fig. 27 shows how a constraint on one node may propagate to others. The arrows in the figure indicate die direction of propagation. Point constraints propagate upward. That is, if a point constrains a vertex, it must also constrain all edges and faces which contain that vertex. Similarly, a point that constrains an edge also constrains all faces containing that edge. Line constraints propagate outward, and plane constraints propagate downward. Whenever a geometric constraint link is added, propagation occurs as indicated in Fig. 27.
When a geometric constraint link is deleted, the rest of die structure graph must be made consistent with this change. Our approach .to diis problem is based on the TMS system [Doyle 79], using die notion that when an assertion is deleted, all assertions implying it and all assertions implied by it that have no other support should also be deleted. We obtain assertions that imply a given assertion by following backwards along the arrows in Fig. 27, and we obtain assertions implied by a given assertion by following forward along the arrows.
Consider die simple example in Fig. 28a, which depicts three topological nodes (vertex v, edge <?, face J) constrained by one geometric node (point p). Suppose now tiiat link 4 is deleted (Fig. 28b), that is, the assertion "p constrains e" is deleted. All assertions which have implied this must now be deleted, for if one were to hold, link 4 would also hold. To find these assertions, we locate the box in Fig. 27 that represents a point constraining an edge and follow backwards along the arrow. The result is die box diat represents the point constraining any vertex of die edge. In Fig. 28b, tliis corresponds to the assertion "/; constrains v, and v is part of <?". This assertion must therefore be made false. To do so, we may delete cither link 1, link 3, or both from Fig. 28b. Our intuition tells us diat part-of links (link 1) should dominate constraint links (link 3), and thus link 3 is deleted. This seems to work well for our examples.
We now must determine the assertions implied by the one initially deleted. All these assertions must also be deleted unless they have other support. To do so, we follow forward along die arrow from the box in Fig.   27 that represents a point constraining an edge, and the result is die box that represents the point constraining all faces containing the edge. In Fig. 28b, diis corresponds to the assertion "/? constrains/*, which is link 5.
This link should therefore be deleted since it has no other support. The resulting structure graph is depicted in Fig. 28c.

Generating the 3D Scene Model
The result of image analysis is a 3D wire-frame description that represents 3D vertices and edges which correspond to portions of boundaries of objects in the scene. We construct a surface-based description die 3D scene model -from these boundaries by hypothesizing new vertices, edges, and faces. Both the wireframe and surface-based descriptions arc represented by structure graphs.
Our current techniques for hypodiesizing the scene model will be shown next using an example that starts with the output of the stereo analysis component depicted in Fig. 29. These techniques provide a mediod for hypothesizing parts of the scene for which there are only partial data by exploiting task-specific knowledge.
The various thresholds used throughout this example have been manually chosen.
Combine edges. First, if two wire-frame edges are nearly parallel and very close to each odier, they are merged into a single edge. This occurs only once in Fig. 29, for the two edges labeled El and E2.
Generate web faces. Next, each vertex is assumed to correspond to a corner of an object. Therefore each adjacent pair of legs ordered around the vertex corresponds to die corner of a planar face. Thus far in our experiments, we have dealt only with trihedral vertices. In diis case, every pair of legs of each vertex corresponds to the corner of a separate face. A partial face, called a web face, is generated for each such pair (Fig. 30a).
Merge partial faces. After all web faces have been created, those that represent portions of a single face are  merged. Two partial faces that touch each other (e.g., Fig. 30b, and Fl and F2 in Fig. 29) should be merged if (1) they share exactly one edge, (2) the edge serves as a boundary of both faces, but does not partition them, and (3) the planes of the faces are nearly parallel and very close to each other.
Two partial faces that do not touch each other (e.g., Fig. 30c, and F3 and F4 in Fig. 29) should be merged if (1) each face has a single chain of edges that is not closed, (2)  Complete the shapes of faces. After all mergers have been performed, many faces may still be incomplete because they do not have closed boundaries. In these cases, task-specific knowledge is used to hypothesize the shape of each face, and it is completed by generating the appropriate edges and vertices. The rules used here are: 1. If the partial face consists of a single corner, i.e., it contains only two connected edges (Fig. 30d), the shape is completed as a parallelogram.' 2. If the partial face contains three or more edges connected as a single chain (Fig. 30e), the shape is completed by connecting the two end points of the chain with a new edge.
Find holes ]n the faces. After all faces have been completed, one face is assumed to represent a hole in another face if (1) the planes of the faces are nearly parallel and close to each other, and (2) the boundary of the first face, when projected onto the plane of the second face, falls inside the boundary of that face (Fig.   30f). When these conditions are met, the bounding edges of the first face are converted into an inner ring of edges of the second face.
Generate vertical faces for incomplete objects. At this point, many objects will be only partially complete because they arc not closed. Since we are dealing with urban scenes, faces that lie high enough above the ground arc assumed to represent roofs of buildings. A hypothesized vertical wall is dropped towards the ground from each edge of such faces, unless the edge is already part of another face (Fig. 30g). Each wall is dropped cither to the ground or to the first face it intersects on the way down. The equation of the ground plane is currently interactively obtained. The procedure for dropping vertical faces from a face F is as follows.
First, an edge is dropped from each vertex of F cither to the ground plane or to the first face it intersects.
Next, web faces arc created for each new edge pair at each vertex. Newly created faces are then merged and completed in the ways described above. Fig. 31 shows several perspective views of the resulting scene model. Notice that one of die buildings has a hole in it, through die roof. The planar patches at the "front" of the scene are part of die ground. Because they were not high enough above the ground plane, they were not treated as building roofs. Fig. 32 shows die scene model generated when diese techniques are applied to the wire-frame description obtained using monocular analysis (Fig. 25). Note that all vertices, edges, and faces which have been hypodicsizcd by die procedures described above are marked as such, and will be-replaced by more correct versions as more information becomes available from new views.

Comparison with Depth Map
There are several interesting points about the model generated from the stereo output. First, notice that it is a higher level description than a depth map. The product of many stereo analysis systems is a depth map [Baker and Bin ford 81, Grimson 80, Ohta and Kanadc 83] which, like an image, is an array of numbers that will have to be converted into a higher level description. Our approach, on the other hand, has been to extract a set of 3D features using stereo analysis (as shown in Fig. 29) and to use task-specific knowledge to go directly to a higher level 3D description. This description is symbolic and much more compact than one based on surface points, and allows relative sizes and positions of scene objects or their parts to be easily available. This facilitates matching and updating the model with 3D information derived from subsequent views, matching the model with other models, generating and deleting hypodieses for parts of the model, and computing structural features of the model.

Mapping Gray Scale onto Faces
In order to render more realistic displays, gray scale is added to them [Dcvich and Wcinhaus 80]. This is useful for realistically simulating the appearance of the scene from arbitrary viewpoints. We associate with each face in the model a normalized intensity image patch of the face. These patches are currently extracted from a single image of the scene, but may eventually be extracted from multiple images. For faces that are partially occluded in the image, the intensity patch is associated with the unoccluded portions. Geometric normalization, which eliminates the effects of perspective projection, is performed on the patches. We also hope to perform photometric normalization to eliminate the effects of varying illumination conditions. Figs.

Combining New Views with Current Model
The process of incorporating a 3D wire-frame description extracted from a new view into die current scene model can be divided into three main steps: 1. The wire-frame data must first be matched to the current model. This process provides (a) the scale transformation and coordinate transformation from the wire-frame data to die model, and (b) corresponding elements (i.e., vertices and edges) in die two.
2. The new wire-frame data is then merged with the current model. This process includes (a) merging pairs of corresponding elements, and (b) adding to the model wire-frame elements for which no correspondences were found. The latter procedure is aided by knowledge of the scale and coordinate transformations. During the merging process, hypodicsizcd parts of die model diat arc inconsistent with the new wire-frame data arc deleted.
3. At this point, many objects in the model may be incomplete because (a) new wire-frame data has been added, and/or (b) some hypothesized elements have been deleted. These objects are completed using the techniques described in section 6.
To see how Uicse steps arc carried out, consider the example of incorporating die information from a second view into the scene model of Fig. 31. This scene model was constructed from the set of wire frames (Fig. 29) automatically extracted from a "front" view of the scene (Fig. 4). The second set of wire frames, shown in

Matching
We assume in this example that the scale and coordinate transformations from the new wire-frame data to the current model is known; the data and model may therefore be described in the same coordinate system.
We have not yet implemented a general matcher that provides these transformations between die two. The next step is to determine corresponding edges and vertices in the data and model. First we label each connected group of edges in the wire-frame data as a distinct wire-frame object. Next, wire-frame objects are matched with model objects. Two objects are said to match if dicy have confirmed parts diat match. Matches are sought only for edges and vertices, since these constitute die only confirmed parts of a wire-frame object.
The requirements for two confirmed vertices, one from each object, to match arc: (1) they must be very close to each other, or (2) they must be part of matching edges whose other two vertices match. The requirements for two confirmed edges, one from each object, to match arc: (1) the two confirmed vertices of one edge must match the two of the other, or (2) one confirmed vertex on one edge matches one on the odicr, and the two Figure 35: Perspective view of manually generated vertices and edges which simulate information available from images showing an opposite point of view from that shown in Fig. 4. The viewpoint for this drawing is chosen to be similar to Fig. 29. Points PI, P2, and P3, for example, correspond to points PL P2, and P3 in Fig. 29.
edges arc close together and overlap in their lengths. 'ITiese rules arc used in a relaxation algorithm to obtain matching vertices and edges.

7.2, Discrepancies
We must now merge the new wire-frame data into the model. An important issue here is how to handle discrepancies between the two. We consider the following two types of discrepancies: 1. After the coordinate system of the wire-frame data has been transformed to that of die model and scale adjustments have been made, corresponding pairs of confirmed vertices and edges may not register perfectly in 3-space. In order to merge them into single elements, we perform a "weighted averaging" of their positions.

2.
Hypothesized elements in the model may be inconsistent with newly obtained elements. We handle this by deleting such hypothesized elements.
To determine whcdicr or not hypotheses arc still valid when confirmed elements in the model are modified or deleted, we consider the elements which gave rise to the hypotheses. A hypothesis is dependent on all elements whose existence directly resulted in the creation of the hypothesis. If one of these elements is modified or deleted, the hypothesis must also be modified or deleted since the conditions under which it was created are no longer valid. The dependency relationships for hypothesized elements arc explicitly recorded at the time of their creation using dependency pointers [Doyle 81].
The following examples show how some of these relationships are recorded: 1. When two non-touching pardal faces arc merged, (Fig. 36a) each face has two edges which are intersected with their counterparts in the other face. The intersection points form two new hypothesized vertices, each of which is dependent on the two edges whose intersection gave rise to it. In Fig. 36a, the arrows indicate the dependencies. Vertex vl is dependent on edges el and e3, and vertex v2 is dependent on edges el and e4. If one of the edges were to be modified (e.g., if its position were to be displaced), the vertex that depends on that edge would no longer be a valid hypothesis, and would therefore be deleted. A new vertex might then be hypothesized.
2. When a face is completed by connecting its two end points (Fig. 36b), two new vertices and one new edge are hypodicsizcd. The new edge e4 is dependent on both el and e3, while die new vertices vl and v2 are dependent on die edges on which they lie.
3. When a vertical wall is dropped from a face, the first step is to drop hypothesized edges from vertices of the face. Such edges arc dependent on the vertices from which they arc dropped. In Fig. 36c, the new edges el and e2 arc dropped from, and are dependent on, the vertices vl and v2, respectively. A dropped edge is constrained to be perpendicular to the ground plane, and would therefore no longer be a valid hypothesis if the vertex it depends on, which is one of its end points, were to be displaced.

Results of Merging
When these procedures are applied to the wire-frame data in Fig. 35 and die scene model in Fig. 31, we obtain the updated scene model shown in Fig. 38. The updated version has two important improvements over die initial version. First, the updated model contains more buildings since new wire-frame data, some of which represent new buildings, have been incorporated into the initial model. Second, for many buildings described in both versions of the model, the positions of vertices and edges are more accurate in the updated version. This is because many hypothesized vertices and edges are replaced by accurate ones obtained from die new data, and many confirmed vertices and edges are merged with corresponding ones in the data by "averaging" dicir positions, generally decreasing the amount of error.
The shape of die large hole in the roof of one of the buildings has changed from a rectangle in the initial model to an almost triangular quadrilateral in the updated version. When compared with die source images in Fig. 4, the rectangular shape would seem more accurate. However, the positions of the edges and vertices that form the hole are more accurate in the updated model in the sense that they are more faithful to the wire-frame descriptions derived from the images.
This experiment demonstrates how information provided by each additional view allows die model to be incrementally made more complete and accurate.

Summary
The 3D Mosaic system acquires an understanding of the 3D configuration of surfaces and objects in a scene. The system encompasses several levels of die vision process, starting with images and ending with symbolic scene descriptions. Because the scenes considered are highly complex, we use multiple views so that more information can be extracted than from a single view. This has led to an incremental approach for acquiring the scene model. As a result, the following capabilities are required: 1. Image analysis must extract as much scene information as possible from input images.
2. Partial scene descriptions must be represented and manipulated.
3. Incremental modifications and updates to the scene model must be easy to perform.
4. Mechanisms for generating, manipulating, and deleting hypotheses from the model must be introduced.
A view of the scene may be cither a single image or a stereo pair. Two separate system components for extracting 3D information from images have been described: stereo analysis and monocular analysis. Both of these components extract sparse 3D wire-frame descriptions from die images. A component that converts these wire frames into a surface-based description has also been described.

48
We have demonstrated diat task -specific knowledge is very useful for interpreting complex images.
Knowledge of block-shaped objects in an urban scene is used for stereo analysis, monocular analysis, and reconstructing shapes from die wire frames. Our techniques have been demonstrated on complex aerial photographs of urban scenes.
There are several extensions and improvements to the system that we will pursue in the future: 1. Incorporating depth map data. Currently, our stereo analysis extracts a sparse set of wire frames from the images. We would also like to include a stereo algorithm diat extracts depth maps [Baker and Binford 81, Grimson 80, Ohta and Kanadc 83]. '1 he depth map from a new view would have to be segmented into surfaces, edges, and vertices, and merged into the current model.
2. Improving the 3D matching. The algorithm diat matches new 3D information with the current model should be improved so diat it can provide the scale and coordinate transformations between the two. In addition, die current algorithm, which considers only edges and vertices when performing the matching, will have to be extended to include faces that may be directly obtained from a depth map.
3. Using the current model to interpret a new view. Currently, 3D information is extracted from a new view without using any information available in the model generated from previous views.
Making use of the current model may aid in segmenting a depth map, or extracting 3D information from a single image.
4. Improving the monocular analysis by using other monocular cues, such as shadows [Shafer and Kanade 82] and texture.