Parallel Hough transform for image processing on a mesh of trees architecture

This paper presents algorithms for Hough transform on mesh of trees parallel computers. Using a mesh of trees with n/spl times/n processors on the base, our algorithms perform the Hough transform for an image of size n/spl times/n in O(n) time, with O(n), O(/spl radic/n), and O(n/log n) memory usage, respectively. The algorithms were simulated, and the results were compared with those of existing algorithms. It is shown that our algorithms are efficient, and can achieve a good speedup in real-time Hough transform computation.<<ETX>>

The Hough Transform(HT) has been found very useful in computer vision and is frequently used i n detecting the shape of object boundaries in image pattern analysis [2, 31.It is to detect the presence of groups of collinear or almost collinear pixels.The amount of computation required in the HT grows as the number of pixels increases and the demanded accuracy(known as resolution raises).Because the amount of computation is very large, it is not easily implementable in real-time calculation.In recent years, a number of approaches for parallel implementation of the HT have been reported.Assume that the size of the parameter plane is n x n and the number of the pixels is N .A H T algorithm with time complexity a t least O ( N ) on a finegrain SIMD machine with broadcasting capabilit has been proposed by Ibrahim ' [4].Silberberg has reported an O ( n ) parallel method in GAPP(Geometric Arithmetic Parallel Processor) [5].The systolic array for an improved HT which was reported by Chuang et

HOUGH T R A N S F O R ~I
Basically, HT involves transfornlirrg each of the edge pixels in an image plane to a parameter plane.A straight line in the (x,y)image plane can be described by (a,b) whew U and b represent slope and intercept respectively and y, = ax, + b is satisfied.On the other side, it can be parameterized by ( p , O ) where p is the distance between the origin and the line.8 is the angle hetween the line and the positive x axis measured counterclockwise.and p = I,, * cos 8 f ya * sin 8 (1) is satisfied.The most commonlv used coordinate system in the HT implementation is ( p , 8).
Given a set of n pixels, {(r0,y"),(z1,y~);..,(r~-1.y~-1)}, it is possible to find a set of straight lines that connect some of these points together.A pixel at location (I,, y*) is transformed into the sinusoidal curve in the p-8 plane defined by equation 1 by varying 8. Curves corresponding to the set of collinear figure points in the ( r , y) plane will have a common point.Thus the problem of detecting collinear points can be converted to the problem of finding concurrent curves.(See Fig. 1.)When it is not necessary to determine the lines exactly, we can specify the acceptable error resolution B,,, and ( For each point ( z a r y o ) in the image plane, the ( p , 8 ) cell defined by equation 1 is incrernented.A given cell in H counts the total number of curves intersecting at the point represented by a given ( p , 8 ) in the parameter plane.After all the pixels have been treated, H is inspected 'The broadcast time is ignored ~n time analysis.
Hough Transform utilizing @,e) parametrization.I ZEAFTOROOT(Vector, Source): "Source" has the same meaning as the Obj except that it will send out data.
LEA FTOSUBROOT(Vector, Source): has the same meaning as the ZEAFTOROOT(Vector, Source) except that the SUBROOT refers to a root of a subtree which will be defined later.
IETOROOT(Vector, Source): ('Vector" refers to either all or part of IE's in a subtree."Source" refers t o a corresponding selected register in these IE's which will send out the data.

A GROUP O F ALGORITHMS
We assume that the image size is n x n and the parameter space is also n x n.The main strategy to design our algorithms is: each BE in jth(where 0 5 j 5 n) column only owns a B value, B j , and a onedimension array(represented by h j ) does the count work.The image pixels are input into the root of row and are passed down through the row tree to BEs.BEs calculate the p value by applying equation 2 one pixel at a time and accumulate the corresponding entry in the hj.
After all pixels are processed, the data in the entries of hj is passed up one by one through the column tree t o the root of column where the maximum value in all entries of hj will be found.
The First Algorithm Broadcast.sin B j and cosOj are input into the root of the j t h column for 0 < j 5 n -1 and are consecutively assed in a pi eline fashion downward through the column trees.lj(where j is t i e index of column) is reserved in each BE.

Procedure Broadcasting
for each j(0 5 j 5 n -1) pardo ROOTOLEAF(co1umn j , A) (sin 0j); ROOTOLEAF(co1umn j , B) (cos 0,); end for In the above procedure, pipedo indicates that Procedure Broadcasting is performed in a pipeline fashion assuming sin B j or cos 0, to be available successively a t the row input ports.
Calculation.First, the pixels are entered into the root of the row.For an image containing n x n pixels, the ith row pixels, namely, n pixels, are input into the root of the ith row in a pipeline fashion.In other words, each root of the row takes care of one row pixels.Simultaneously, the value of the pixel is passed down via the tree one by one so that the value of the first pixel arrives at each BE of corresponding row in logn steps: BE(i, j) can now computes the p value by applying equation 2 and increments the pth element of the hj array.Then, BE(i,j) can accept the next pixel.This process is repeated n times.However, we do not perform calculation in the above way and we can have some modifications to save calculation time.Two multiplications and one division are required in each cycle when we calculate the HT by applying equation 2 in the above algorithm.The are all ex ensive floating point operations and are time-consumin l o , it woulzbe advantageous if we can replace these operations by l%s expensive and less time-consuming additions or subtractions.In [7], Kannan introduced a method to realize this goal.His idea can be similarly applied to our algorithm.The modified algorithm is given below.

Procedure Calculation
Input pixel values of ith row to the root of the ilh row successively and pass down through the row tree in a pipeline fashion; In each BE pa = cos 0, /pres; p = j * sin 0, /pres; accept the 2nd pixel from the parent of BE in the row tree; accept the next pixel from the parent of BE in the row tree; end for Summation.Because the same value of B is held in all rocessors of each column of BEs, we have accumulated the counts k r ( p ; , e j ) , 0 < %, j 5 n -1, in the hj array of the BE on the j t h column.The &tents of the corresponding elements of the h, array on a column of BEs should be summed up t o get the total counts, and this should be done in all n columns in parallel.All BEs send the contents of the first cell, namely, h.(O), to their parents in the column tree.The parents of BEs then addthe two values received from their two children and send the result to their parents in the same column tree (i.e., the grandparents of BEs).Simultaneously, the BEs send the contents of the second cell, hj (l), to their parents.The similar tasks are executed in the subsequent cycles.At last, the root of the j t h column will have the final value of hj (corresponding to B j ) .

Procedure Summation for k = O to n -1 pipedo
Send h (k) to the C register of its own BE; LEAF+OROOT (column air, c);

end for
Max.After all the roots of column have received h, for the corresponding B at the end of previous phase, the largest value is t o be found in the H array of the root of each column.
Zinc One more step is needed here to find the largest value in the 2-dimensional H array, i.e., the maximum for all the e's.

Procedure FindMaximum
ROOTOLEAF(co1umn all, A(row 0)); LEAFTOROOT(row 0, A); Time Complecity.The time complexity of the first algorithm is O(n), and the memory space is O(n).

The Second Algorithm
With O(n3) sequential computation of HT and O ( n z ) processors of the Mesh of Trees is, the first algorithm achieves an optimal time complexity O(n).However, the memory usage can be further reduced.
The straight line we want to detect can be specified by p and 0 where p is the distance from the origin to the line.Therefore, the maximum p must be the diagonal of the image, d m = f i n for an image of size n x n (in pixel units).In the first algorithm, each BE has to process n pixels with range of horizontal coordinate from 0 to n-1 and vertical coordinate remaining unchanged.Tbe change&le range d p k each processor is larg, and each BE has to store a p-list of length n.In the following algorithm the length of p-list for each BE is reduced, and the memory usage hence decreases.
Instead of that each BE processes one row pixels(of number n), we let it process J;; x J5i pixels(i.e., a J;; x J;; square of the image).The number of pixels for each BE to process remains unchanged.Because each BE processes one 0, the largest changeable range of p with respect to each B is from the first pixel to the last pixel in the small J;; x J;; square.Obviously, it is 6 if we do not consider the resolution.2The distribution of the image for this algorithm is shown in Fig. 3.  Another problem arises in applying the new method, in which each BE only stores a sublist of p with length JiE.We do not know which 'If the resolution is considered, the range of p will be smaller since the resolution is larger than 1.

Ai
part of the whole p-list the BE should contain, and a pre-process procedure(training) is hence needed to associate each BE with a corresponding part of the p-list.The procedure is"one-time" process.In other words, the procedure only needs t o he executed once, and the result can be saved in the ROM and to be used thereafter.
Training.By broadcasting sin0 and cos0 to the corres ondin BE first.the similar method used in the first algorithm can \e mo2ified and used to calculate the p list .

Procedure Training
for each c(0 5 c 5 n -1) pardo ROOTOLEAF(co1umn c, A) (sin e<); ROOTOLEAF(column c. B) (cose,): e n d for Let p i d ( r ) tte the row index of BE; In each BE pid.k = p i d ( r ) mod fi; s t a r t x = pid-k * J;;; end.r = (pid-k + 1) * fi -1; start-y = ( I p i d ( r ) / f i J ) t J;;; end-y = (Ipid(r)/JfiJ + 1) * J;; -1; for i == start-y t o i < md.y d o = cos e,ip,,,; Summation.Sum up the contents of the corresponding elements of the h, array on a column of BE to obtain the total counts.The method is similar to the Summation phase in the first algorithm, except that the processing of the p-list is modified due to the difference of the list.The total range of p is n , and the array to store n p-values is referred to as a "total-list".The array which contains the calculation results is fi in length in every BE and is referred t o as a "sublist".The values in the sublist should be from part of the total-list and should be successive.First, compare the first index value of sublzst in each BE with the first index value in the total-list.(This needs only one comparison because the index value of sublist is sorted in an ascending order.)If they are not equal, the next one in the sublist is definite1 bi ger than the first one in the total-list.Set a flag bit that is initiafizef to 0 in each BE.The flag bit is set to 1 when the result of comparison is equal.Otherwise it remains 0.Then, each BE with the flag bit equal t o 1 sends the content. of the first cell of that array to its column parent and execute the arithmetic o eration there if necessary.Secondly, compare the first index value in t i e sublist in each BE whose flag bit is 0 with the second index value in the total-list.If they are equal, set the fla bit to l.(Obviously, the flag bit will be set t o 1 some time in all the Bgs.) Send the content of the first cell in each BE whose flag bit is just set to 1 and the content of the next cell in each BE whose flag bit was set to 1 to its column arent and perform an addition there.At the same time, each parent wtich received a value in the previous addition also sends the results to its parent.After (logn -2) cycles, the roots of columns will receive the first value.In the following (n + 1logn) cycles, the operation is almost the same as previous ones except that the roots of column processors ;are needed to compare the current received value with previously received ones, t o choose the bigger one and t o save it for the next comparison.After n cycles, the largest value for each 0 can be obtained.In the following algorithm, we do not consider the comparison erformed in root element, because it only needs one step for each cy& and it will not affect the time complexity.

Procedure Summation
Each BE pardo e n d if e n d for Having the largest p value for each 0. we then call the subroutine FindMaximum in Algorithm 1 to find the maximum value for all the 0's.

The Third Algorithm
Another algorithm is introduced below which does riot need the training phase and hence has a smaller factor in the same O( n) time complexity.Moreover, The memory complexity is also less than O ( n ) .
In this algorithm, each BE processes one row pixela which are divided into log n groups, and each group is with n/ log n pixels.We distribute uniformly the p list to the processors in the lower half of the column tree, with (2 * n l l o g n ) entries per processor.
Initially, each row BEs calculate the p value for the first group of ( n / log n ) pixels.Depending on the p value, a corresponding ent rg will be found in the temporary array maintained by the BE, and the count for it will by incremented.After the first group is processed, the count of the ith entry for p2 in the temporary array will be passed to the root of the column tree.On its way, one of the sublist will have the ith entry with the same pf valuu.m d the count ail1 be accumulated thrrr.We will then calculate thP second group of ( n l l o g n ) pixels, performing the summation again and so on.In this way, the ralculation and summation are done one group pixels after another.As we know, the lower half of the columu tree contains ,h subtrees, each of them having , , / Z leaves and %, ;.e. log fi; height.
Each BE in the same level stores the same part of the p-list, referred t o as a 'sub-list '.

The time complexity of the scjcond algorithm is
T h e o r e m 1 In any subtree, after the calculatioil of any group pixels, if we request that all the BEs send o u t the resulted values which are corresponding to a specified entry of all the sub-lists, there are totally 9 entries along any path in a subtree.However, all BEs in a subtree wzll send out the same value which is corresponding to that specified entry in some IE and only one IE.

Proof:
The range of the resulted value of a group pixels for the first BE in a subtree is ( x * c o s 0 + y * s i n 0 , ( x + n / l o g n ) ~c o s 0 + y * s i n 0 ) , and the range of the resulted value of a group pixels for the last BE in a subtree is (z *cos 0+(y+ fil)+sin 0, (x + n / log n ) * cos 8+ ( y+ J51-1)tsin 0 j, where x and y are the x-coordinate and y-coordinate, respectively, of the first pixel of that group for the first BE in that subtree.
Because each BE in one column has the same 0 value, we can find the largest difference of the resulted value between the first BE and the last BE as n/ log n t cos0 + (fi -1) t sine.As we know.there is a

SIMULATION
We obtained the experimental results on nCUBE2, simulating a Mesh of Trees parallel computer.The nCUBE2 which we use has 64 processors.We process on it three sizes of images of 64 x 64, 128 x 128, and 256 x 256 pixels.Following are the execution time and speed up for implementing the second algorithm.From table 2, we can see that the speed-up increases as the image size increases.When the image size is approaching to infinite, the ideal speed up, 64, can be achieved for a system of 64 processors.(See Figure 4 for reference.)

COMPARISON WITH OTHER WORKS
We have presented several simple efficient parallel HT algorithms on Mesh of Trees.The algorithms take advantage of both Mesh and Tree network structures.With careful consideration of data flow and information management, we obtained a very good result both in time performance and in memory usage.Table 3 summarizes the known bounds in the literature and the results derived in this paper.
al. requires O ( N + n ) time [6].An algorithm that has been described on mesh arrays takes O ( f l f n ) time and O ( n ) memory [7].

Figure 1 :
Figure 1: Representation of points in polar coordinates.tofind those cells with the largest counts.If the count value in a given cell ( p , 8 ) is m, then precisely m edge pixels lie (within the quantization error) along the line which is defined by ( p , 8).

Figure 3 :
Figure 3: Distribution of image pixels for the second algorithm.

( 2 *
n/ logn) long sublist in each IE along the path of that subtree.It is obviously n / l o g n > J;;.Therefore, 2 * n/ Iogn > n / l o g n * cosO + (6 -1 i * sin 0 P r o o f completed Time complexity.The time complexity of the Third algorithm is O ( n ) , and the memory usage is O(n/logn).
sin B,/p,,, + start.2* cos B c / p T e s ; for 3 == start.2+ 1 t o j < end.2 do he the index of the root of the row; In t h e root of each rowp d k = p i d ( a ) mod fi, start-x = prd-k t J;;, en(1.x=(pid-k+ 1) * fi -1;start-y = ( [ p i d ( a ) / J ; ; J ) * Jr;; end-y = (bzd(a)/J;;J + 1) * fi;for i == stnrt-y t o i < r nd-y d o for == start-2 to j < end.2 d oInput pixel value (i, J j in the root of the row successively and pass them down through the row tree in a pipeline fashion;In each BEif 1 == start-x then P o = 6JPW p = j t COS B,/p,., + i t sin BJp,,, ; accept the next value from the BEs row parent; if (pixel == 2 5 5 ) then h, [p] = h, [p] + 1; p = p + p .: if pixel == 255 then h, [p] = h [p] + 1;accept the next value from the bEs row parent;

Table 2 :
Speed-up table I 64 x 64 I 128 x 128 I 256 x 256 size speed-up I 17.18I 18.65 I 26.62