DESIGN AND IMPLEMENTATION OF A SINGLE-CHIP 1-D MEDIAN FILTER

The design and implementation of a 1-Dimensional median filter in VLSI is presented. The device is designed to operate on 8-bit sample sequences with a window size of 5 samples. Extensive pipelining and employment of systolic concepts at the bit level enable the chip to filler at rates up to 10 Mega-samples per second The chip is designed to be implemented with a X = 2.5ji NMOS technology and is 6.2 mm by 5.0 mm in size. A circuit configuration for using the chip in approximate 2-D median filtering is also presented

i

Introduction
Median filtering is a nonlinear signal smoothing operation in which the median of a window of size w = 2n +1 replaces the sample at the middle of the window.Medians computed in this way tend to follow the polynomial trends in the original sequence while sharp discontinuities of short duration are filtered out Further properties of median filtering have been described in [1] while [2] describes its application to speech processing.Recently, an algorithm for real-time median filtering has been presented in [3].Systolic algorithms for one-and multi-dimensional median-filtering operations and the more general case of computing running-order statistics have been recently proposed by Fisher [4].
This work presents the design and implementation of a VLSI chip for the 1-dimensional median-filtering operation.The device is designed to operate on 8-bit sample sequences with a window size of 5 samples.
Extensive pipelining and employment of systolic concepts at the bit level enable the chip to have a very high throughput, i.e. the chip can be clocked at rates up to 10 Mhz and produce one median every clock cycle after an initial delay to fill the pipeline.The chip is designed to operate as a shift register in a system environment, filtering data coming from the source before going into the actual computing system.

Systolic Algorithms and Structures
Rapidly advancing VLSI technology offers system designers a very high potential for parallel operations.
However, in order to exploit this potential, algorithms to be implemented with VLSI computing structures should have regular and simple communication schemes.This is mainly due to the fact that communication, especially irregular communication, is costly in VLSI in terms of the chip area that communication channels (i.e.wires) occupy.Furthermore, to reduce the design time, these algorithms should employ a rather small number of basic building blocks (or cells) from which larger systems can be built A class of parallel algorithms that exhibit such regular structures are systolic algorithms.Systolic algorithms for various computational problems have been described in [4,5,6,7,8].Systolic data structures for priority queue operations and connectivity problems have been proposed in [9] and in [10] respectively.The general architectural principles of systolic computation systems have been discussed by Kung in [11].In general, systolic algorithms and the underlying hardware structures implementing them have very regular neighbor-toneighbor communication schemes.They utilize their inputs many times through pipelining and multi-directional data flow and hence do not make heavy bandwidth demands on system memories.
Employment of systolic concepts at the low-level implementation of logic circuits for various simple functions (like addition and comparison) also leads to regular structures that have small propagation delays (independent of the size of the circuit) and require no broadcasting.Such circuits are suitable as building blocks in higher-level pipelined structures.A previous chip employing such concepts is the pattern matching chip described in [12].
In the last few years various special purpose chips employing systolic algorithms have been designed at Carnegie-Mellon University.These include a pattern matching chip [12], an image processing chip [13], and a tree processor for database applications [14],

The 1-Dimensional Median-Filtering Algorithm
The 1-D median-filtering algorithm implemented is differs from the one described in [4] in the sense that it uses the odd/even-transposition sort [15,16,6] as the high-level algorithm and exploits systolic data flow concepts at the bit level to achieve a very high throughput After an initial delay to fill the pipeline, the chip can produce one median over a sliding 5-wide window at every clock period.The logic design enables the use of a clock period that is long enough to cover the propagation delays of five NMOS gates.However, due to technological limitations, the method employed is suitable only for small window sizes (3 to 7) because the network implementing the pipelined odd/even-transposition sort requires area proportional to the square of the window size.The systolic algorithms presented in [4] require area linear in window size but they need more complex circuits.

High-Levei Structure of the Algorithm
At the high level, the algorithm, and hence the underlying hardware that implements it, consists of an input stage which generates the successive window elements from the incoming sample stream, and a pipelined sort stage which performs the odd/even-transposition sort on the elements of successive windows (see Fig. 3-1).
Shamos, in [17], has proposed similar circuits for median finding; in fact, a circuit proposed there for a window of size 5 uses fewer of comparators than the circuit presented here, but Shamos' circuit structure is not regular.Each stage of the odd/even-transposition sort network consists of 2 8-bit compare-and-swap units ([window size / 2 J units in general) and one delay element to store the window sample value that does not get compared at that stage, due to the fact that the window size is odd.
Each 8-bit compare-and-swap unit compares the pair of 8-bit numbers at its input and interchanges them if necessary so that the larger of the numbers is at the "top".At the output of the last stage, the window elements will be sorted such that the largest will be at the "top" .

Hardware Implementation of the Algorithm
The structure of the odd/even-transposition sort network described above has certain undesirable characteristics if directly mapped into hardware.In the compare-and-swap units, the swapping of the inputs can only be done after the result of the entire 8-bit comparison has been computed.However, this requires waiting for a long propagation delay through 8 stages of bitwise comparators.
It is possible to get rid of this propagation delay by employing systolic concepts at the bit level.This involves breaking up the compare-and-swap operation into steps and then distributing them over time by skewing the bits of the numbers being compared with delay elements so that each pair of bits arrives at their bitwise comparator at the same time the subresult of the comparison of their more significant counterparts arrives.
The basic element to implement the compare-and-swap operation is the bitwise compare-and-swap unit The functional description of this unit is given in Fig. 3-2.It is a bit comparator followed by two multiplexers which pass the larger of the inputs to the A output and the smaller to the B output if E. m is asserted.
Otherwise it unconditionally swaps or passes the inputs depending on whether L* m is asserted or not It also passes "downward" the cumulative subresult of the comparison to the less significant stages.
L in E in The 8-bit wide compare-and-swap units implemented with the units described above also distribute the swap operation over time along with the comparisons.So at the end each bitwise compare-and-swap operation, the outputs of the bit compare-and-swap units will be same as they would be if all the bitwise swaps were done simultaneously after waiting for the final comparison result This is easy to see if we note that if a less significant compare-and-swap unit decides that all the input bits should be swapped, then the inputs to more significant stages should have been equal hence passing them without swapping would not matter.
The implementation of the odd/evcn-transposition sort network exploits the observations presented above.
Furthermore, the comparisons of the next stage of the odd/even-transposition sort network can be started immediately once the comparisons and swaps of the first bits of the preceding stage are done.This observation leads to an internal block structure of the odd/even-transposition sort network given in Fig 3 .3.It should also be noted that in the resulting structure, the first bits of the sorted outputs are available even before the comparisons of the first stage of the odd/even-transposition sort network are completed.

The Chip
The chip employing the method described in the preceding section has been designed to be implemented with an NMOS process with A of 2.5 microns.The basic methodology and the design rules presented in [18] have been used throughout the design and layout process.The outline of the floor plan of the resulting chip is given in Fig. 4-l.The dimensions of the chip are approximately 6.2 mm by 5.0 mm.
It uses 21 pins : 8 for input, 8 for output, 2 for the two phases of the clock, 1 for V dd , 1 for Ground and 1 for the substrate bias; hence it can be packaged in a 24 pin package.
As of this writing, the chip has been laid out completely and design-rule checks have been made.Circuit level simulations of the circuits making up the sort network have been done.Currently the chip is being fabricated by the ARPA facility coordinated by USC-ISI.

Application to 2-Dimensional Image Processing
Although the design is not directly applicable to 2-Dimensional median filtering operation, a cascade configuration using these chips can be used for approximate median filtering of 2-D images as suggested by Shamos [17].The basic idea is to find the medians of the rows of an nx n window and then compute the median of the medians.It is shown in [17] that A w the median of the medians of such a window, has the

Evaluation and Conclusions
The design and implementation of a VLSI chip for performing the 1-D median-filtering operation has been presented.The major motivation for this work has been to apply systolic concepts at the bit level in the implementation of logic circuits to construct a digital system with a very high throughput Also, application of the developed chip to 2-D image processing has been investigated and a configuration for employing it in approximate 2-D median filtering has been proposed.
Although the design developed in this work has a very high throughput, the response time is k + w where k is the number of bits in each sample and w is the window size (so the response time for this specific implementation is 13 clock periods).Furthermore, the design is not practical for larger window sizes because the silicon area for implementing the odd/even-transposition sort network grows as the square of the window size.

Figure 3 - 1 :
Figure 3-1: High level structure of the algorithm Figure 3-2: The basic compare-and-swap unit Figure 3-3: Internal structure of the odd/even-transposition sort network Figure 4-1: The floor plan of the chip Figure 5-1: Hardware structure to implement the approximate 2-D median filtering