CSANG: Continuous scale anisotropic gaussians for robust linear structure extraction

Robust estimation of linear structures such as edges and junction features in digital images is an important problem. In this paper, we use an adaptive robust structure tensor (ARST) method where the local adaptation process uses spatially varying adaptive Gaussian kernel that is initialized using the total least-squares structure tensor solution. An iterative scheme is designed with size, orientation, and weights of the Gaussian kernel adaptively changed at each iteration step. Such an adaptation and continuous scale anisotropic Gaussian kernel change for local orientation estimation helps us obtain robust edge and junction features. We consider an efficient graphical processing unit (GPU) implementation which obtained 30x speed improvement over traditional central processing unit (CPU) based implementations. Experimental results on noisy synthetic and natural images indicate that we obtain robust edge detections and further comparison with other edge detectors shows that our approach obtains better localization and accuracy under noisy conditions.


Introduction
Edge detection and feature extraction are important image processing tasks. Digital image edge detection has a long history and has been the object of study for many decades now [1], [2], [3]. Many of these edge detectors are based on image derivatives and gradients and thus require usually pre-smoothing with a smoothing filter for e.g. convolution with a Gaussian kernel. One of the classical and most famous edge detector is due to Canny [3], which is still widely used in many image processing pipelines. Though Canny edge detector includes Gaussian pre-smoothing, it can still obtain spurious edge pixels under noise (see e.g. Figure 1) or double edges, and various improvements have been proposed over the years. Structure tensors are alternative to gradient computations and recently they are advocated in various image processing tasks where edges, orientations are required to be computed accurately and robustly under noisy conditions [4], [5].
(a) Input image (b) Canny [3] (c) CSANG Figure 1. Example edge detection results with CSANG provides better results than optimal Canny edge detector. We show the edge detection results for a synthetic gray scale F ish image. (Top row) Noise-free, (bottom row) noisy image obtained with adding Gaussian noise of standard deviation σn = 30. As can be seen, the CSANG edge detector obtains no spurious edge pixels in the noisy scenario.
Nath and Palaniappan [6] considered an adaptive robust structure tensor (ARST) approach for orientation estimation and image segmentation. The ARST is derived via robust statistics and oriented Gaussian kernel adaption in an iterative manner. By utilizing a spatially varying adaptive Gaussian kernel, improved accuracy in finding edge and corner locations were obtained. Further, robust error function such as Geman-McClure is utilized for estimating ARST in local neighborhoods which leads to better localization of low level linear structure extraction. Due to the nonlinear dependency on the orientation of the patch and iterative nature of the ARST estimation the method requires high computational cost. In this work, we consider utilizing graphical processing unit (GPU) for the main computational iterative scheme of ARST. We show that GPU ARST for fast edge detection with an average 30× speed improvement over traditional central processing unit (CPU) based implementations. We provide comparative experimental results on edge detection with other well-known edge detectors on synthetic, natural images.
Rest of the paper is organized as follows. In Section 2 we review the adaptive robust structure tensor (ARST) and corresponding edge detection approach. Section 3 provides experimental results on various synthetic, real images and comparison with other edge detectors. Finally, Section 4 concludes the paper.

Robust Estimator for Gradient Estimation
We follow the notations and conventions from [6] and detail the iterative method for the robust estimator based continuous scale anisotropic Gaussians adaptation. Let v(x) be the true (latent) gradient of an image patch Ω(y), centered at pixel x. The norm of the error vector between the estimated gradient g(y) at location y and v(x) is given by e(x, y) as The aim here is to estimate v, hence the straightforward way is to minimize an error functional ρ(||e(x, y)|| 2 ) integrated over the image patch Ω, Here, W (x, y) is a spatially invariant weighting function, for example the Gaussian kernel, This weighting function emphasizes the gradient at the central pixel within a small neighborhood, and we utilize an adaptive size and orientation based weighting function estimation, see Section 2.2. One of the well-known and basic formulation is based on the method of least-squares which corresponds to setting ρ(s) = s 2 , thus Eqn. (2) becomes, subject to the condition 1 that ||v|| = 1. We utilize the Lagrange multipliers to write the unconstrained formulation as, (5) Differentiating E LS (x, y) to find the extremum leads to the standard eigenvalue problem for solving for the best estimate of v, given byv.
We note that this is the least-squares structure tensor at position x using the weighting kernel W . The maximum eigenvector solution of Eqn. (6) gives the least-squares estimate for the gradient at pixel x using the surrounding gradient information 2 .
In [6], robust error measures are advocated instead of the least squares due to their noise tolerant property. Robust error functions impose smaller penalties on outliers [7]. One of the important example for robust error functions is the Geman-McClure function [7] and defined as, where, m is a parameter that determines the amount of penalty imposed on large errors. If we utilize the Geman-McClure robust error function in Eqn.
(2) we get, Using Lagrange multipliers we can write the unconstrained version as follows, Differentiating E GM (x, y), with respect to v, and setting it to zero gives where, is the Geman-McClure robust structure tensor. The following iterative equation, is a fixed-point functional iteration scheme for numerically solving (λ, v) in Eqn. (10) that usually converges to a local minimum [8]. We use total least-squares solution to initialize the iterative process in Eqn. (12).

Continuous Scale Anisotropic Gaussians
Following [6] we use a soft Gaussian convolution function as the weighting function W (x, y) in our iterative formulation (12) with adaptive size and orientation within Ω. i , e 2 i ] the eigenvectors at the i th iteration step. R 0 is the radius of the initial (circular) local neighborhood while [R 1 i+1 , R 2 i+1 ] are the semi-major and semi-minor axes of the new local neighborhood at the (i + 1) th iteration step. Adaptation updates for region size and orientation are shown in Eqs. (14) and (15). Ω i (y) is initially circular and subsequently becomes an oriented elliptical region. Image adapted from [6]. TODO: Add two more rows with synthetic and real images to show how the ellipses move through iterations... Spatially varying adaptation of the kernel (local neighborhood shape and coefficients) is beneficial for improving the estimation of oriented image structures. The neighborhood Ω is initialized as a circular region and subsequently adapted to be an oriented elliptical region. Figure 2 shows the adaptive algorithm at an ideal edge. In this figure, the dashedline elliptical region is oriented along the gradient while the solid-line elliptical region (that is scaled and rotated by 90 • ) is oriented along the edge. The spatially varying kernel W i (x, y) that is used with Eqn. (11) is defined as, where K is a scaling factor associated with the Gaussian function. We initialize the kernel W 0 (x, y) as an isotropic Gaussian with R 1 0 = R 2 0 = R 0 . A fairly large number is chosen (typically R 0 = 8), in order to reduce the influence of noise when evaluating the structure tensor. The columns of U i are the eigenvectors (e 1 i , e 2 i ), with the columns of U −1 initialized as the co-ordinate axes. Let λ 1 i and λ 2 i (with λ 1 i > λ 2 i ) be the eigenvalues of the structure tensor J(x, v i−1 ,W i ) at the i th iteration. Scaled versions of these eigenvalues are used to update the semi-major and semiminor axes for the (i + 1) th iteration as The eigenvectors obtained from the current iteration, along with (R 1 i+1 , R 2 i+1 ) are used to update the kernel as follows This kernel is used to compute a new structure tensor J(x, v i ,W i+1 ) as per Eqn. (11). To account for the spatially varying Gaussian kernel, Eq. 12 is modified to the following form, We experimentally determined that two or three iterations were sufficient to achieve good results.

Edge Detection with CSANG
for all image coordinates (x, y) ∈ m × n do If (E nms (x, y) ≥ t hi ∧ E bin (x, y) = 0) then

return E bin
The solution of the iterative procedure (16) v is used to obtain an edge map by applying non-maximal suppression and hysteresis thresholding steps, see Algorithm 1. These are standard steps in obtaining binary edge maps with single pixel width and originally used by Canny edge detector [3]. We next details the graphical processing unit (GPU) based implementation details of the core iterative procedure.  • Structure Tensor: This is one time executed kernel which loads input image into shared memory. First, we compute the gradients on shared memory data and from the computed gradients structure tensor matrix is estimated.

GPU Implementation Details
• Robust Estimator: In this kernel, we compute the GM robust estimation for smoothness and texture terms from gradients and structure tensor. All the computation equations are shown in the pseudo code. Shared memory is used for computing 2nd derivatives similar to normal gradients. Since GM robust estimation is same for both smoothness and texture we implemented it as a device function. GM robust function there is a floating point division which is very slow to make it faster we approximate division by multiplication with reciprocal.
• Input for solver: Inputs to Thomas solver are prepared using a couple of kernels. Inputs to these kernels are from Robust estimator and they are transposed for row-wise gradients. In the original source code, there is a data dependency on the previous value. To avoid this data dependency and make the algorithm parallel we load the input data into shared memory and re-compute the previous value so that data dependency is eliminated. The overhead for recomputation of previous value is quite minimal.
• CuSparse batched gtsv: Thomas solver is used from CuSparse gtsv library function. If we consider a matrix, this solver solves a column each instance. So, to make the column elements coalesced we do a matrix transpose on inputs. For multiple columns to be solved at the same time we use batched gtsv solver but this requires a considerable amount of global memory (batchCount ? (4?m+2048) ? sizeof(¡type¿)). In some cases we ran out of memory, to solve this issue we run the solver multiple iterations with fewer batches(columns) considering the free memory available at the launch of the kernel. Gradients are updated by taking the average of row wise and column wise gradient estimation.
• Orientation Estimation: Finally, once all the iterations for computing the gradients are done we compute the orientation of the gradients. Orientation is estimated using atan2f, an inbuilt math function.
• Matrix Transpose: This is the most instanced kernel in the whole algorithm. Implementing this optimally is critical for performance. This kernel is optimized for square block size. First, we load the input matrix into shared memory. A different mapping is used to read data from shared memory and store data into output coalesced. Figure 3(a) shows the GPU execution pipeline discussed above.

Thread scheduling and shared memory utilization.
Details about thread scheduling, shared memory utilization are described below, see Figure 3(b).
• Thread Scheduling: Input image could be of any resolution and thread blocks need to be traversed over the whole image. Though we have a 2D grid of blocks we convert this 2D grid into a 1D value in terms of the image width. From this image width, we compute the current iteration X and Y . Shown below is the image considering grid dimensions of < 3, 2 >. Thread blocks first walk the image width and once we hit the boundary of the image then we move to next row(vertical) and so on.
• Shared memory utilization: Shared memory is used in most of the kernels. Shared memory in matrix transpose is straight forward. Shared memory for computing gradients is shown below. For computing a gradient in X we need previous and next value in input, gradient in Y we need row above and row below value. To optimize all the overlaps we load the data into shared memory and for all the boundary elements we need an extra load. To reduce the warp divergence the threads with threadIdx.y==0, threa-dIdx.x==0 load two extra elements. For example, if we have a block of < 8, 8 > we load 10 × 10 data into shared memory. Since we use sync threads() for barrier synchronization we make sure all the threads in a block are alive even though they might be out of the image. Returns the orientation sector s θ for the 2D vector Recursively collects and marks all pixels of an edge that are 8-connected to (x, y) and exhibit a gradient magnitude TraceAndThreshold(E nms , E bin , x, y, t lo ) return sist of 500 images with five human drawn contours as ground truth. To check the improvement in timing we tested with binary synthetic images of sizes 512×512, 1920×1080, and 6600 × 4400. We load the binary image, copy the data to GPU and start the kernel execution. All the evaluations are run on CUDA2 server which has 2 Tesla C2050 GPU's, and serial evaluations are done on Intel Xeon server. Runtime results (in seconds) with MATLAB, C, and CUDA implementations are in give Table 1. As can be seen, the improvement with CUDA is around 30× over the traditional MATLAB, and C implementations.

Comparison with other edge detectors
We compare our fast CSANG with traditional edge detectors due to Prewitt [1], Sobel [2], and Canny [3].

Synthetic images.
We use a synthetic image F ish of size 518 × 334 for testing the accuracy of different edge detectors under different noise levels [6]. We compare the edge detection results on noise-free, and noisy (obtained by adding Gaussian noise of standard deviation σ n = 30) images with other edge detectors [1], [2], [3] with default parameters. As can be seen from the Figure 4 bottom row, under noisy conditions all the traditional edge detectors pick up false positive edges while our CSANG obtained a robust result. Figure 5 shows some example comparison of edge detection results on the BSDS500 natural images with Canny edge detector and human drawn ground truth (GT). The GT was aggregated from five different humans ( Figure 5(b)), and Canny edge detection results were obtained with default parameters of MATLAB ( Figure 5(c)). As can be seen by comparing the edge detection results obtained with our CSANG (Figure 5(c)) we obtained better results. In Canny edge detection results we can observe lot of spurious edges while our approach obtained less spurious edges, however edge connectivity is lost in some places.

Conclusions
In this paper, we utilized a fast adaptive robust structure tensor for edge detection in digital images. By utilizing continuous scale anisotropic gaussians via Geman-McClure robust error estimator function we obtained efficient edge detection for noisy images. We provide an efficient GPU implementation using CUDA, and our implementation can obtain 30× speed improvement over traditional CPU implementations. Compared to different edge detectors from the (a) Input (b) Prewitt [1] (c) Sobel [2] (d) Canny [3] (e) CSANG  literature, our ARST based edge detector provides robust and accurate results. Experimental results on a variety of noisy images indicate we obtain better edge detection results and compared to other standard edge detectors. Testing other robust error estimator functions and applying the fast ARST to large scale image segmentation defines our future work.