A stereo matching algorithm with an adaptive window: theory and experiment

An iterative stereo matching algorithm is presented which selects a window adaptively for each pixel. The selected window is optimal in the sense that it produces the disparity estimate having the least uncertainty after evaluating both the intensity and the disparity variations within a window. The algorithm employs a statistical model that represents uncertainty of disparity of points over the window; the uncertainty is assumed to increase with the distance of the point from the center point. The algorithm is completely local and does not include any global optimization. Also, the algorithm does not use any post-processing smoothing, but smooth surfaces are recovered as smooth while sharp disparity edges are retained. Experimental results have demonstrated a clear advantage of this algorithm over algorithms with a fixed-size window, for both synthetic and real images.<<ETX>>


Introduction
Stereo matching by computing correlation or sum of squared differences (SSD) is a basic technique for obtaining a dense depth map from images [MSK89] [FP86][Woo83] [MKA73]. A central problem with this method lies in selecting an appropriate window size. If the window is too small and does not cover enough intensity variation, it gives a poor disparity estimate, because the signal (intensity variation) to noise ratio is low. If, on the other hand, the window is too large and covers a region in which the depth of scene points varies, then the disparity within the window is not constant. Therefore, the position of maximum correlation or minimum SSD may not represent a correct estimate of disparity. For this reason, an appropriate window size must be selected locally.
However, there has been little research for adaptive window selection. Most correlation-or SSD-based methods in the past have used a window of a fixed size that is chosen empirically for each application. Uncertainty in matching due to the variation of unknown disparities within a window is unaccounted for by existing stereo algorithms. Levine et. al [LOY73] presented a method of changing the window size locally depending on only the intensity pattern. However, window selection must also depend on the disparity (ie. depth) variations which changes from pixel to pixel in an image. In fact, the difficulty in obtaining an adaptive window lies in a difficulty in evaluating and using disparity variances. While the intensity variation is directly obtained from the image, evaluation of the disparity variation is not easy, since the disparity is what we intend to calculate as an end product of stereo. To resolve the dilemma, an appropriate model of disparity variation is required which enables us to assess how disparity variation within a window affects the estimation of disparity.
The stereo algorithm we propose in this paper selects a window adaptively by evaluating the local variation of the intensity and the disparity. We employ a statistical model that represents uncertainty of disparity of points over the window: the uncertainty is assumed to increase with the distance of the point from the center point. This modeling enables us to compute both a disparity estimate and the uncertainty of the estimate. So, the algorithm can search for a window that produces the estimate of disparity with the least uncertainty for each pixel of an image. The method controls not only the size but also the shape (rectangle) of the window.
In this paper, we first develop a model of stereo matching in section 2. Section 3 shows how to estimate the most likely disparity and the uncertainty of the estimate based on the modeling in section 2. These two sections provide theoretical grounds of our proposed algorithm. In section 4, we describe a method to select appropriate window size and shape adaptively for each pixel, Section 5 provides experimental results with synthesized and real stereo images. The quality of the disparity maps obtained demonstrates the effectiveness of the algorithm.

Modeling Stereo Matching
We will first develop a statistical model of the difference of intensities of two images within a window. The analysis is based on the uncertainty model presented in [OK90b]. Let the stereo intensity images be/i(jc,y) and/ 2 0c,y). Assume that the baseline is parallel to the x axis, and f\(x,y) and/ 2 0c,y) come from an underling intensity function f(x,y) with a disparity function d r (x,y). Then, where n\(x,y) and nz(x,y) are independent Gaussian white noise for both images, such that *i(x,jO, n 2 (x,y) ~ N(0,a 2 H ). (3) From equations (1) and (2), where n(jc,y) is Gaussian white noise such that n(x,y) ~ N(0,2<7* 2 ). (5) To simplify the notation, suppose that we want to compute the disparity at (x,y) = (0,0), i.e., the value d r (0,0). Also, suppose a window W = {(£, r/)} is placed at the correct corresponding positions in both images, that is, at (0,0) in image f\(x,y) and at (d r (0,0),0) in image fi(x,y). Figure 1 illustrates the situation. Then, the difference of intensities between f\ and/2 at (f, rj) in the window can be approximated by using the Taylor expansion of the left hand side of equation (4) /i«.»7)-/2(e + *(0,Q),»7) « 17)-4(0,0))^ (6) At this point, let us introduce the following statistical model for the disparity d r (£, rj) within a window: where is a constant that represents the amount of fluctuation of the disparity. That is, this model assumes that the difference of disparity at a point (f, rj) in the window from that of the center point (0,0) has a zero-mean Gaussian distribution with variance proportional to the distance between these points. In other words, the expected value of the disparity at (f, rj) is the same as 2 the center point, but it is expected to fluctuate more as the point is farther from the center. 1 Or, in terms of the scene, the surface covered by the window is expected to be locally flat and parallel to the baseline, but it is less certain as the window becomes larger. We also assume that the image intensity derivatives §^fi(.£,, v) within a window follow a zero-mean Gaussian white distribution. 2 These assumptions allow us to model a statistical distribution of the intensity difference (6). Let us denote the the right hand side of equation (6) by n s (£, rj). By assuming J^foC^, vj) and d r (£, rj) to be mutually independent, we can compute the mean and variance of n s (£, rj).
where OCf = E ( / 2 « + 4(0,0), 77)) (10) From equations (6), (8), and (9), we can show that /t,(£, 77) is approximated by Gaussian white noise such that The intuitive interpretation of (11) is as follows. Referring to figure 1, n,(f, 77) is the difference between f\ and /2 at (£, rj) within a window when the window is placed at the corresponding 'The statistical model of (7) can be shown equivalent to assuming that d r (£, rj) is generated by Brownian motion (refer to [BN68][\bs87]). More generally, we can assume rj) to be a fractal. This corresponds to choosing a different degree of £ 2 + rf in the variance in (7). The Brownian motion is the simplest case in which the degree is j. However, our preliminary experiments have shown no noticeable advantage of using a general fractal assumption.
2 This is also equivalent to assuming the pattern /2^, rj) to be result of Brownian motion: i.e., locally it has a constant brightness, but has more fluctuation as the window becomes bigger.

MODEUNG STEREO MATCHING
4 positions for obtaining the disparity at (0,0). If there is no additive noise n(x,y) in the image (i.e., a* = 0) and the disparity is constant within the window (i.e., a d = 0), then the two images match exactly, and n s (£, TJ) must be null. Otherwise, however, the difference has a value which shows a combined noise characteristic which comes from both intensity and disparity variations. As derived in (11), it can be modeled by zero-mean Gaussian noise whose variance is a summation of a constant term and a term proportional to yjt 2 + rj 2 . The constant term is from the noise added to the image intensities. The second term is from uncertain local support. That is, while the points surrounding the center point in the window are used to support the matching for the center point, it should be noted that these points may actually increase the error in computing the disparity of the center point. This is because, in general, the disparity of the surrounding points deviates from that of the center point. This uncertainty is represented as if the intensity signals have additional noise whose power is proportional to the distance from the center point in the window. If the disparity is constant over the window (i.e. = 0), the additional noise is zero. If the disparity changes more in the window (i.e., the larger a* is), its effect becomes larger and the information contributed by the surrounding points becomes more uncertain. Now, we will show how the disparity and its uncertainty can be estimated based on the modeling presented in the previous section. L et do(x,y) be an initial estimate of the disparity d r (x,y). By using the Taylor expansion, equation (11) becomes where Ad is an incremental correction of the estimate to be made, such that Ad = d r (0,0)-do(0,0).
Dividing both sides of this equation by \jlcr\ + a/o^y^2 + rj 1 yields is Gaussian white noise such that = n n &r i ), where N W is the number of the samples within the window. These parameters change as the shape and size of a window changes.

Iterative Stereo Algorithm with an Adaptive Window
In the previous sections we have developed a theory for computing the estimates of the disparity increment and its uncertainty, which take into account the fact that not only the intensity but also the disparity varies within a window. We now describe the complete stereo algorithm based on the theory: 1. Start with an initial disparity estimate do(x,y). This initial estimate can be obtained by any existing stereo algorithm.

For each point (x,y)> choose a window that provides the estimate of disparity increment having the lowest uncertainty. For the chosen window, calculate the disparity increment by (25) and update the disparity estimate by d i+ \(x,y) = difay) + Ad(x,y).
Here we need a strategy to select a window that results in the disparity estimate having the lowest uncertainty. In the discussions so far the shape of the window can be arbitrary. In practice we limit ourselves to a rectangular window, as illustrated in figure 2, whose width and height can be independently controlled in all four directions. Our strategy is as follows: (a) Place a small 3x3 window centered at the pixel, and compute the uncertainty by using (27), (28), and (26).
(b) Expand the window by one pixel in one direction, e.g., to the right JC+, for trial, and compute the uncertainty for the expanded window. If the expansion increases the uncertainty, the direction is prohibited from further expansions. Repeat the same process for each of the four directions JC+,X-,y+, and y-(excluding the already prohibited ones).
(c) Compare the uncertainties for all the directions tried and choose the direction which produces the minimum uncertainty.
(d) Expand the window by one pixel in the chosen direction.
(e) Iterate steps (b) to (d) until all directions become prohibited from expansion or until the window size reaches to a limit that is previously set.
Thus, our strategy is basically a sequential search for the best window by maximum descent starting with the smallest window 3. Iterate the above process until the disparity estimate di(x,y) converges, or up to a certain maximum number of iterations.
Now, by using synthesized data we will examine how the window is adaptively set by the stereo algorithm for each position in an image, and demonstrate its advantage. Figures 3 (a) and  (b) show the left and the right images of the test data. In generating the data set, a linear ramp in 8 the direction of the baseline is used as the underlying intensity pattern. It is deformed according to the disparity pattern in figures 3 (c) and (d), and Gaussian noise is added independently to both images. We apply the iterative stereo algorithm to the resultant data.
First, we will examine the result of window selection. The four images in figure 4 show the length (increasing brightness corresponds to increasing length) by which the window has been extended in each of the four directions. For example, the vertical dark stripes in figure 4 (a) on the right hand side of the vertical disparity edge show that the windows for those points are not extended to the left so that the windows do not cross the disparity edge to a region of different disparity. We observe the same phenomena in the other directions. We can examine the size and shape of selected windows at several representative positions shown in figure 5. The windows selected at those positions are drawn by dashed lines in figure 6 relative to the disparity edges drawn by solid lines. For example, at PO a window has been expanded to the limit for all directions, whereas at PI expansion to the right has been stopped at the disparity edge. At P5> a window is elongated either vertically or horizontally, depending on the image noise, but consistently avoids the corner of the disparity jump.
Next, let us examine the computed disparities. For comparison, we also have computed disparities by running the same iterative algorithm but with a fixed window size; that is, in Step 2 of the stereo algorithm we use a window of predetermined size rather than the window selection strategy. We run with three window sizes, 3 x 3,7 x 7, and 15 x 15. Figures 7 (a), (b) and (c) show the result produced by fixed window sizes, and (d) by the adaptive-window algorithm. We can clearly see the problem with using a predetermined fixed window size. A larger window is good for flat surfaces, but it blurs the disparity edges. In contrast, a smaller window gives sharper disparity edges at the expense of noisy surfaces. The computed disparity by the proposed algorithm shown in figure 7 (d) shows both smooth flat surfaces and sharp disparity edges. The improvements are further visible by plotting the absolute difference between the computed and true disparities as shown in figure 8, with a table that lists their mean error values. The adaptive-window algorithm has the smallest mean error, but more importantly we should observe that the algorithm has reduced two types of errors. A small fixed window results in large random error everywhere. A large fixed window removes the random error, but introduces systematic errors along the disparity edges. The adaptive-window based method generates small errors of both types. In fact, we have shown that at each point the expected value of the error by the adaptive-window method is always smaller than or equal to that produced when any fixed-size window is used [OK90b], Figures 9 (a) and (b) show another example of synthesized test data. Figure 10 presents the computed disparity by the new method in (d), together with the results produced by fixed window sizes in (a) to (c) for comparison. As with the previous example, we clearly see better performance with the new method.

Experimental Results
We have applied the adaptive-window based stereo matching algorithm presented in this paper to real stereo images- Figures 11 shows images of a town model that were taken by moving the camera vertically. The disparity, therefore, is in the vertical direction. To give an idea of the arrangement of objects in the scene, a picture taken from an oblique angle is given in figure 11 (c).
For initial disparity estimates, we have used a technique of multiple-baseline stereo matching [OK90a] which can remove matching ambiguities due to repetitive patterns, especially in the top portion of the image. Figure 12 (a) shows the disparity map computed by the adaptive window algorithm. In addition, the uncertainty estimate computed by the algorithm is shown in figure 12 (b): increasing brightness corresponds to higher uncertainty. With this uncertainty estimate we can locate the regions whose computed disparity is not very reliable (very white regions in figure 12 (b)). In this example, they are either due to aliasing caused by the fine texture of roof tiles of a building (in the middle part of the image) or due to occlusion (the others). The disparity estimates of those uncertain parts can be discarded for later processing. The isometric plot of the disparity map is shown in figure 12 (d), which roughly corresponds to the viewing angle of figure 11 (c). We can see that each building wall has a smooth surface and yet is clearly separated from others, and the shape of the distant bridge (on the left) is recovered. Figure 13 shows perspective views of the recovered scene by texture mapping the original intensity image on the constructed depth map and generating views from new positions which are outside of the original stereo views. They can give an idea of the quality of reconstruction. This stereo data set is the same one used in [MSK89]. We can observe a noticeable improvement of the result over the previous result. Also it should be noted that this is extremely narrow baseline stereo: the baseline is only 1.2 cm long and the scene is about lm away from the camera, thus the depth to the baseline ratio is approximately 80. The shapes of buildings, a A-shaped roof, a water tank on the roof, and a flat ground have been recovered without blurring edges.
In this paper, we have presented an iterative stereo matching algorithm using an adaptive window. The algorithm selects a window adaptively for each pixel. The selected window is optimal in the sense that it produces the disparity estimate having the least uncertainty. By evaluating both the intensity and the disparity variations within a window, we can compute both the disparity estimate and its uncertainty which can then be used for selecting the optimal window.
The key idea for the algorithm is that we employ a statistical model that represents uncertainty of disparity of points over the window: the uncertainty is assumed to increase with the distance of the point from the center point. This model has enabled us to assess how disparity variation within a window affects the estimation of disparity.
The experimental results have demonstrated a clear advantage of this algorithm over algorithms with a fixed-size window both on synthetic and on real images.