WAFER-SCALE INTEGRATION AND TWO-LEVEL PIPELINED IMPLEMENTATIONS OF SYSTOLIC ARRAYS

This paper addresses two important issues in systolic array designs. How do we provide fault-tolerance in systolic arrays for yield enhancement in wafer-scale integration implementations? And, how do we design efficient systolic arrays with two levels of pipelining? The first level refers to the pipelined organization of the array at the cellular level, and the second refers to the pipelined functional units inside the cells. The fault-tolerant scheme we propose replaces defective cells with clocked delays. This has the distinct characteristic that data can flow through the array with faulty cells at the original clock speed. We will show that both the defective cells under this fault-tolerant scheme and the second level pipeline-stages can simply be modeled as additional delays in the data paths of "generic" systolic designs. We introduce the mathematical notion of a cut to solve the problem of how to allow for these extra delays while preserving the correctness of the original systolic array designs. The results obtained by applying the techniques described in this paper are encouraging. When applied to systolic arrays without feedback cycles, the arrays can tolerate large numbers of failures (with the addition of very little hardware) while maintaining the original throughput Furthermore, all of the pipeline stages in the cells can be kept fully utilized through the addition of a small number of delay registers. However, adding delays to systolic arrays with cycles typically induces a significant decrease in throughput In response to this, we have derived a new class of systolic algorithms in which the data cycle around a ring of processing cells. The systolic ring architecture has the property that its performance degrades gracefully as cells fail. Using our cut theory and ring architectures for arrays with feedback, we have effective fault-tolerant and two-level pipelining schemes for most systolic arrays. As a side-effect of developing the ring architecture approach we have derived several new systolic algorithms. These algorithms generally require only one-third to .one-half of the number of cells used in previous designs to achieve the same throughput The new systolic algorithms include ones for LU-decomposition, QR-dccomposition and the solution of triangular linear systems.


Introduction
In recent years many systolic algorithms have been designed and several prototypes of systolic array processors have been constructed 1 ' ^3' . Major efforts arc currently devoted to building systolic arrays for large, real-life applications. In this paper, we will consider two implementation techniques for building highperformance systolic arrays: wafer-scale integration (WSI) and fabrication using pipelined components.
Fabrication flaws on a wafer are inevitable. It is necessary for a WSI circuit to be "fault-tolerant" so that wafers with defective components can still be used. A common approach is to include redundant circuitry in the design and avoid defects by programming the interconnection of the constituent elements. In particular, the laser-programming technology has been applied successfully to program the redundant circuitry in VLSI RAMs as a yield enhancement measure 5 . The MIT Lincoln Laboratory 6 has also been experimenting on the use of laser-programmable links to build wafer-scale processor arrays.
Systolic arrays are well-suited to the application of wafer-scale integration. They consist of large numbers of small and identical (thus interchangeable) cells and their regular and localized interconnection greatly simplify the problem of routing around defective cells. On the other hand, systolic architectures guarantee full exploitation of their constituent cells to achieve maximum parallelism. The more cells an array has, the more powerful it is. Wafer-scale integration has the potential to provide a very cost-effective and reliable way of implementing high-performance systolic systems.
Before WSI systolic arrays can become a reality, we must solve the problem of how to construct faulttolerant arrays. After the cells are tested (by wafer-probing, for example), how do we route around the defects to build a functional array? (See Figure 1-1 (a)). This paper describes a "systolic" approach which provides fault-tolerance at a very low cost and admits of a graceful degradation in performance as the number of defects increases. the first being the pipelined organization of systolic arrays at the cellular level. While this additional level of pipelining can increase the system throughput, it considerably complicates the design of systolic array algorithms. Our solution to this problem is to devise a methodology to transform existing systolic designs which assume single-stage cells to arrays consisting of pipelined cells.
We will show that both the "fault-tolerance" and the "two-level pipelining" problems can be solved by the same mathematical reasoning and techniques. Our results imply that once a "generic" systolic algorithm is designed, other versions of the algorithm (for execution on arrays with failed cells, or for implementation using different pipelined processing units) can be systematically derived. The techniques of this paper can also be applied to other computation structures, such as FFT processor arrays.
In the next section we will introduce our approach to the problems, using as an example the simplest type of systolic arrays-uni-directional linear arrays. As we will see, systolic arrays without feedback admit of a much simpler solution and they will be discussed in section 3. In section 4, we will propose a new architecture, the "systolic ring", which can be used in place of many systolic arrays with feedback cycles and are much more amenable to fault-tolerant measures. Section 5 includes a summary and some concluding remarks.

Fault-Tolerance and Two-Level Pipelining for Uni-directional Linear Arrays
Figure 2-1 depicts a systolic array 9 for the convolution computation with four weights w x w 4 . In this array the data flow only in one direction, that is, both JC/ and y t move from left to right (with JC ; going through an additional "delay register" following each cell). This is an example of a systolic array without feedback cycles-an array where none of the values in any data stream depends on the preceding values in the same stream. (For an example of an array with feedback cycles, see Figure 4-1 (a)). Depicted in 2-2 (a) is an example of a 5-cell array with one faulty element The defective cell in the middle is replaced with two "bypass" registers (drawn in dotted lines)-one for the jc-data stream and one for the ydata stream. It can easily be shown that this array correctly solves the same problem as the array of Figure 2-1. For example, y 1 picks up W 4 -JC 4 , w^x 3 and w 2 -x 2 at the first, second and fourth cell respectively. The degradation in performance due to the defect is slight The maximum convolution computed by this array in one pass can have only 4 rather than 5 weights, and the latency of the solution is increased by one cycle. However, the computational throughput, often the most important factor in performance, remains the same at one output per cell cycle. Figure 2-2 (b) depicts the cell specification for this fault-tolerant scheme, using reconfigurable links. Note that the input/output register in a systolic cell can be used as a bypass register in case the cell fails. Therefore no extra registers are needed to implement this fault-tolerant scheme. A basic assumption of this paper is that the probability of the interconnection links and registers failing is very small and thus negligible. This is reasonable because these components are typically much simpler and smaller than the cells themselves. Furthermore, they can be implemented conservatively and/or with high redundancy to increase the yield.
In the proposed scheme data move through all the cells. At failed cells, data items are simply delayed with bypass registers for one cycle, and no computation is performed (Figure 2-3 (a)). We call fault-tolerant schemes of this type systolic in view of the fact that data travel systolically in a defective array from cell to cell, at the original clock speed. We now examine more carefully the idea behind our fault-tolerant scheme for the linear array of Figure  2-2. Because of the unit delay introduced by the bypass registers, all the cells after the failed one receive data items one cycle later than they normally would Since both the x-and .y-data streams are delayed by the same amount, the relative alignment between the two data streams remains unchanged Thus, all the cells after the third one receive the same data and perform the same function, with a one-cycle delay, as would the cell preceding it in a normal array. For this reason, an n-cell, uni-directional, linear array with k defective cells will perform the same computation as a perfect array of /z-k cells.
The above reasoning also implies that the correctness of a uni-directional linear array is preserved, if the same delay of any length of time is introduced uniformly to all the data streams between two adjacent cells. This result is directly applicable to the implementation of two-level pipelined arrays. We can interpret the stages in a given pipelined processing unit as additional delays in the communication between a pair of adjacent cells.
Consider, for example, the problem of implementing the systolic array of Figure 2-1 using the pipelined multiplier and adder of Figure 1-1 (b). Since the adder is now a three-stage pipeline unit instead of a single-stage unit, two additional delays are introduced in the jrdata path. Thus each cell requires a total number of four delay registers be placed in the x-data path-one is implicit in the original cell definition, the second is the delay register in the original algorithm design, and the last two are to balance the two new delays in the .ydata stream. The resulting two-level pipelined array is depicted in Figure 2

Systolic Arrays without Feedback Cycles
From the previous section we see that both the defective cells in a fault-tolerant array and the pipelinestages in systolic cells can simply be modeled as additional delays in the data paths. Thus by solving the one problem of how, if possible, to allow for additional delays in systolic designs, we can transform generic systolic designs to fault-tolerant or two-level pipelined designs. A general theory of adding and removing register delays to a system has been proposed by Leiserson and Saxe 18 in the context of optimizing synchronous systems,

The Cut Theorem
We model a systolic array as a directed graph, with the nodes denoting the combinational logic and the edges the communication links 19 . The edges are weighted by the number of registers on the links. We say that two designs are equivalent if, given an initial state of one design, there exists for the other design an initial state such that (with the same input from the host, i.e., the outside world) the two designs produce the same output values (although possibly with a constant delay). In other words, as far as the host is concerned, the designs are interchangeable provided the possible differences in the timing of the output are taken into account We define a cut to be a set of edges that "partitions" the nodes in a graph into two disjoint sets, the source set and the destination set, with the property that these edges are the only ones connecting nodes in the two sets and are all directed from the source to the destination set We say that a systolic design is a "delayed" system of another design if the former differs from the latter by having additional delays on some of the communication links. Thus the graph representations of the two designs are the same except for the weights on the edges that correspond to the communication links with additional delays. Theorem 1: (Cut Theorem) For any design, adding the same delay to all the edges in a cut and to those pointing from the host to the destination set of the cut will result in an equivalent design.
Proof: Let 5 be the original design partitioned by a cut into sets A and 5, the source and the destination set respectively. Let S' be the same as S (with its corresponding sets A' and /?0» with the difference that d delays are now added onto the edges in the cut We will show that by properly initializing S f (at $,), the output values from A and A' will be identical and that the output values from B are the same as those from B' t but lagging behind by d clock cycles.
We define the initial state of A' to be identical to the state of A at time n>. Since none of the edges in the cut feed into A\ directly or indirectly, nodes in A 1 behave exactly the same way as the corresponding ones in A and thus produce the same outputs.

Let r^e*)
r rf (e0 be the delay registers on any edge in the cut, *', with r x (e0 being closest to the source node and /^(e 7 ) closest to the destination node. First, we assign the initial state of B' to be identical to the state of B at time t^-d. We then initialize the registers r x (e0 ^(eO with the values of the data on the corresponding edge in S at time ^-1,^-2 n>-<£ respectively. In this way, the input data received by the nodes in set B' from time 4, to k+d-l is identical to those received by B from 4>-d to 4>-1 and so the configuration of B' at ^+d and that of B at â re identical. Since the outputs from A' are the same as those from A, all the inputs arriving at B 9 starting from time t$+ tfare the same as those arriving at B, except that they lag behind by d cycles due to the additional delay registers. Therefore the nodes in B 1 will behave the same way as the corresponding ones in B with a d cycle delay. • We say that a delayed system S' is derivable from S if there exists a set of cuts C l% C v ... t C n with their cut delays d^d* d m such that V e' € S' , number of additional delays on e' = ^ d\.

{i\7tq}
Since equivalence is associative, the cut theorem implies that if a "delayed" design is derivable from the original design dien the two designs arc equivalent Since a cut partitions the nodes of a graph into two sets with data flowing uni-directionally between them, it cannot cross any feedback cycle. On the other hand, for any given edge not in a feedback cycle, we can always construct a cut set that contains it Therefore any number of delays on the data paths in a graph without feedback can always be incorporated if we have the option of inserting other delays into the system.

Linear Arrays Without Feedback
We will now apply the above results to the examples we discussed previously. As depicted in

Two-Level Pipelining for Two-Dimensional Systolic Arrays
It is just as simple to apply die cut theorem to two-level pipelined arrays of two dimensions. Let us consider die example of a hexagonal systolic array that can perform band matrix multiplication 20 (Figure 3-2 (a)). Two results follow directly from the cut theorem: , these edges define a cut since none of the outputs from the adders are fed back into the multipliers. By the cut theorem, we conclude that these systolic cells can be implemented using pipelined multipliers of any number of stages without any further modification, provided the number of stages is the same for all the multipliers.

Two-Level Pipelining for the FFT Processor Arrays
The cut theorem can be applied to two-level pipelined designs for any processor arrays without cycles. We consider here as an example, the well known processor array for computing fast Fourier transforms (FFTs). For an /z-point FFT, the array has log 2 n stages of n/2 processors for performing butterfly operations. The data are shuffled between any two consecutive stages according to a certain pattern 21 * 22 . Figure 3-3 depicts the so-called constant geometry version of the FFT algorithm (for /i=16), that allows the same pattern of data shuffling to be used for all stages.  Figure 3-4 (a) depicts a straightforward processor implementation for the butterfly operation using four multipliers and six adders. The time that the processor takes to perform a butterfly operation is the total delay of one multiplier and two adders.
To increase the throughput for calculating butterfly operations, we implement the processors with pipelined multipliers and adders. Suppose that these functional units each have five pipeline stages, as in the case of some recent floating-point chips 7 . By the cut theorem, the pipeline delays on the b^ and b^ag data paths have to be balanced by the same number of delays on the fl^/ and a' maz input lines. The two-level pipelined design of the processor is shown in Figure 3

Systolic Fault-Tolerant Schemes for Two-Dimensional Arrays
Let us consider as an example the rectangular array of Figure 3-5 (a) where the data move forwards and downwards. Among many other applications, this array can perform matrix multiplication with either an operand or the partial result matrix stored in the array during the computation. We will first discuss the constraints that a correct implementation must satisfy and then we will study several redundancy schemes.

The Local Correctness Criterion
By exploiting the regularity in systolic arrays, the following theorem reduces the problem of establishing equivalence between two designs to smaller problems which can be solved using only "local information". Theorem 2: Let S be a mcsh-connectcd systolic design without feedback and S' be a "delayed" version. S' is equivalent to S if for each square of adjacent cells in the grid, the number of delays on each of the two paths joining the two diagonally-opposite corners is the same.
Proof: Let V t and E t be the nodes and (vertical) edges in the /th column in grid S r . We form two subgraphs G/ and G 2 ', such that G/ contains all the nodes and edges to the left of the /th column and G{ contains all those to the right, and in addition, they each contain V t and £"/. We will first show that graph S' is derivable from S if subgraphs (7/ and G{ are derivable from the corresponding subgraphs in S, G x and <7 2 , respectively.
Let C be a cut in subgraph (7/. If C does not intersect E b all the nodes in V t must belong to the destination set of the cut. Since there are no direct links between any nodes in the source set and the nodes in S"-(7/, C is also a cut in S'. By the same token, any cut in subgraph G{ that does not contain any edges in Ej is a cut in S'.
It is obvious that a cut can have at most one edge in Ej. Suppose the cuts C x in (7/ and C 2 in G{ both contain the same edge e in E t For both subgraphs, all the nodes in K, that are above e belong to the source set, and those below belong to the destination set We observe that Q U C 2 partitions the nodes of S' also into a source set and a destination set, with the former being the union of the source sets in the two subgraphs and the latter the union of destination sets. Therefore, C L U C 2 is a cut in S'. Without loss of generality, let the delay associated with all the cuts be 1. (A cut with d delays is equivalent to d identical cuts, each with 1 delay.) If (7/ and G/ are derivable from G x and G 2 respectively, then for each edge e€ E t with d(e) delays, there exist exactly d(e) cuts containing e in each of the two subgraphs. Therefore all the cuts containing edges in E t in the two subgraphs can be paired up to form cuts in S'. We have already shown that the cuts in the subgraph that do not contain any edges in Ej are also cuts in S'. Therefore if G x ' and G{ are derivable from G x and G 2 respectively, then S' is also derivable from 5.
The above result implies that we can cut up the grid S' into vertical strips and show that S' is equivalent to S by proving the equivalence of each of the strips. By applying the same argument on the horizontal links, we can further subdivide the strips into squares, each containing only four cells. The equivalence problem is now reduced to solving the equivalence for each of the squares. An edge from each of the two paths that connect the two diagonally-opposite corners constitute a cut Therefore if the number of delays on each of the two paths of a square is the same, then the square is derivable from its counterpart in S. If this condition holds for each square, then S' is derivable from, and thus equivalent to S. • The criterion for correctness as derived from this theorem is represented graphically in Figure 3-5 (b). This theorem can be generalized to any array where we can find paths that partition the graph representing the array into disjoint subgraphs. For example, in the case of a hexagonal array without feedback cycles (Figure 3-2 (a)), the constraints for equivalence are simply reduced to the local criterion that for each unit triangle of three adjacent cells, the number of delays on each of the two paths connecting two of the corners of die triangle has to be the same.

The utilization of live cells for the rectangular systolic array of Figure 3-5 (a) depends on the availability of two hardware resources: delay registers in the live cells and the channel width. The results of Section 3.1 imply that if sufficient delay registers arc available in the cells, the "systolic" approach can fully utilize all the live cells without any penalty to the throughput rate of the system. In general, a lower utilization can be expected with a smaller number of delay registers. The other factor that might decrease the utilization is the channel width. If there are not sufficient tracks in the channels, we might not be able to implement the interconnection desired.
We have conducted several experiments to study the tradeoff between the utilization of live cells and the required hardware resources. We implemented four heuristic programs modeling different redundancy schemes. We ran Monte Carlo simulations on three different array sizes and cell failure rates ranging from 5% to 65%. The distribution of defects is assumed to be identical for all the cell locations on the wafer. The   Figures 3-7 (b) and (c). The larger the array size, the more hardware delay registers are needed to get the same utilization. This is obvious since the set of constraints that have to be satisfied by a larger array is a superset of those satisfied by a smaller  (a)  a  aaaoaaaa  a a a a  oa  o a  a a  a a a  a a  aa  a o a a   DD   a o a a a a  a a a a a  a  oaaaaa  a a  a a a a a  a  a  a  a  a a a a  a  a a a a a  a   ( array. We have to bear in mind, however, that the cells in a larger array arc typically smaller and thus have lower failure rates. From Figure 3  These experiments give us a general idea of the expected efficiency of the different redundancy schemes using the systolic approach. In-depth studies using a more precise model are necessary to determine the optimal or near-optimal redundancy scheme for any particular application. Probabilistic analyses 16

Systolic Arrays with Feedback Cycles
In this section we will describe a new technique for treating systolic arrays with feedback cycles. Such arrays include systolic designs for LU-dccomposition 25 , QR-decomposition 26 , triangular linear systems 25 and recursive filtering 27 .

Computation of Simple Recurrences-An Example of Cyclic Systolic Arrays
To illustrate the basic ideas, we consider the computation of the following simple recurrence of size /i-1: data pass through an extra register per cell. This is a 4-slow system, performing die same computation as the 2-slow version, but at half its throughput Suppose that the third cell from the left were to fail The original function of the array could be preserved by simply allowing cells 2 and 4 to communicate through a bypass register (as illustrated in Figure 4-1 (c)). A drawback of this approach is that the performance of the array degrades rapidly with respect to the number of consecutive failed cells that need to be tolerated. Note that systolic arrays with feedback cycles are initially 2-or 3-slow in general, and in order to tolerate k consecutive failures, the throughput must be further decreased by a factor of *+ L The recurrence of size /i-l computed by an /r-cell bi-directional linear anay (illustrated in Figure 4-1 (a)) can also be implemented on an /i/2-cell ring with uni-directional data flow (as in Figure 4-2 (a)). The systolic ring works as follows. The n/2 most recently computed results are stored in each of the n/2 cells, while the next n/2 partial sums travel around the ring to meet these stored values. Every two cycles, a sum is completed and a new computation begins. For example at time 0 in Figure 4-2 (a), y/ is ready to pick up its last term y 3 while yf is ready for its first term y v The final value of y A then travels to cell "a" to replace y v At time 2, y/ and yf will pick up their last and first terms respectively. Like the bi-directional systolic array of Figure  4-1 (a), this systolic ring has a computational rate of one output every two cycles. However, all its cells are active at any time, therefore only half as many cells are needed.

Fault-Tolerant Systolic Rings
Systolic rings require not only less cells than other .designs solving the same problems, they also degradegracefully as the number of defective cells increases.
Each cell in the systolic ring computes with a stored result for a period of 2n cycles before the result is replaced by a new value. The ring can be unrolled to form a linear array where each cell stores only one result in its whole lifetime, as shown in Figure 4-2 (b). This transformation reduces the ring structure to one without feedback, and thus allows us to analyze its fault-tolerant behavior using the results of the preceding section.

Two-Level Pipelining for Systolic Rings
By going through a similar argument as above for the two-level pipelined array, we can obtain the following result:

Other Examples of Systolic Ring Architectures
We have shown in the previous section that the ring structure is suitable for solving simple recurrences where each result is dependent on a fixed number of previous results. This characterizes many of the problems solved by systolic arrays with feedback. We will describe some of the examples in this section.

Solution of Triangular Linear Systems
Let A=(ajj) be a nonsingular nxn band, lower triangular matrix with bandwidth q. Suppose that A and an /i-vector 6=(^ 6^) T are given. The problem is to solve Ax= b for *=(x v ... ,x,) T . TTiis can be viewed as a recurrence problem of size q-1. A ring of q/2 cells is sufficient to solve the problem at a throughput of one result every two cycles. As a comparison, the previous bi-directional linear systolic array 25 has the same throughput, but it uses twice as many cells. The ring is also more robust-with k failures in a ring of m cells, the throughput is only reduced from 1/2 to (m-Jfe)/(2m-k).

Triangularization of a Band Matrix
The usefulness of the systolic ring approach is not limited to linear array solutions-   Figure 4-7 (b). The parameters needed for performing the elimination (e.g. Givens rotations for QR-decomposition) pass around the ring after they are generated Suppose p f is the parameter generated by the element eliminated in row / and the element above it If the data input ay is not on the subdiagonal to be eliminated, it is updated on the arrival of/>/. It stays in the cell for one cycle to compute with /? l+1 and then moves on to the next ring. If a ( j is to be eliminated, it is computed with the stored value, a,.^ to get p h which is then passed down the ring. The output of each ring is the result obtained by eliminating the last subdiagonal of the input array. The uppermost ring outputs the entries of the triangular matrix that we want to compute. Note that correspond- Unlike the data values circulating the rings in the previous examples, the p t are computed before they are passed around However, they have the same property that they are produced every two cycles and need to meet with w-1 input values before they can be discarded Therefore, from our previous analysis, q rings of w/2 cells each are required for triangularizing a band matrix with bandwidth w and q subdiagonals. This architecture requires about half the amount of hardware and achieves the same throughput of a previous solution of QR-decomposition 26 . An efficient layout of this ring architecture is shown in Figure 4-8 (a). Every two consecutive rows correspond to one ring. 20 The analysis of the fault-tolerant behavior of this ring structure is very similar to the one-dimensional ring. A system with n rings can be unrolled to form a mcsh-connectcd acyclic array with n cells on one side and an "unbounded" number on the other. The throughput rate is reduced from 1/2 to n/(2n+ k) if k defects are tolerated in each of the n rings in the final array. Also, by applying theorem 2, we can simplify the correctness constraints on the final configuration to get a local criterion that has to be satisfied by each unit square in the logical grid. This criterion is depicted in Figure 4-8 Figure 4-10 shows the snapshots of this structure at various stages in the computation. By viewing this structure as an array of rings, its performance can be analyzed using the result of Theorem 4 with parameter p=2. The throughput of this array is the same as the previous design 25 which uses, however, three times as many cells.
This two-dimensional ring architecture admits of a surprisingly efficient layout See Figure 4-11 (b). The numbers on the cells indicate the original row the cells are in. This layout can be obtained by the following method. Starting with the original architecture (Figure 4-11 (a)), we first bring the top and bottom rows together and get a cylindrical structure. We then expand the space between each row by one cell's length, so that if we flatten out the cylinder, the consecutive rows in the "front" and "back" surfaces will be interleaved. But before we flatten out the cylinder, we first "twist" it by one cell's length in the direction that shortens the inter-row links.

General Remarks on Systolic Rings
The systolic ring architecture has some disadvantages over other systolic architectures, but they are compensated for by its superior fault-tolerance performance. One of the possible disadvantages is that we need to provide an additional data path to unload the values during the computation, as the computed results are continuously stored in the ring. This is, however, not the case for the triangularization schemes of section 4A2.
In many of the conventional cyclic algorithms, only one or a few boundary cells may require special processing capability and extra input/output bandwidth. However, with some ring architectures, every cell is required to assume the role of a boundary cell. Algorithm-dependent methods can sometimes be used to alleviate the problem of having to provide each cell with special functionality. For instance, in the previous example of solving triangular linear systems, instead of providing each cell with the capability to divide, we precompute the reciprocals of the diagonals. 23

Summary and Concluding Remarks
The fault-tolerant approach proposed in this paper is tailored to systolic arrays. By using the additional information about systolic data flows we are able to design schemes that are usually more effective than other schemes designed for general processor arrays. Our systolic fault-tolerant scheme has the characteristic that the maximum interconnection length is not increased. This eliminates a source of inefficiency, such as increased system cycle time or driver area, common to most other approaches.
For uni-directional linear arrays, our systolic fault-tolerant technique achieves 100% utilization of live cells, without extra registers nor interconnection links. For two-dimensional arrays without feedback cycles, the utilization of live cells on a wafer increases with the number of redundant channels and delay registers available in the cells. The number of delay registers needed to achieve the same utilization also increases with the cell failure rate and the size of the original array on the wafer. Our empirical studies indicate for a wafer with nxn cells, approximately n delay registers per cell are needed to achieve 100% utilization.
Although many systolic algorithms with feedback have been proposed, some of the same problems to which these algorithms address can also be solved by systolic arrays without feedback. Examples of such problems include convolution, graph connectivity and graph transitive closure 9 * 29t 30 . Acyclic implementations usually exhibit more favorable characteristics with respect to fault-tolerance, two-level pipelining, and problem decomposition in general.
For problems that have been solved exclusively by systolic arrays with feedback cycles, this paper introduces a new class of systolic algorithms based on a ring architecture. These systolic rings have the property that the throughput degrades gracefully as the number of failed cells in the rings increases. Furthermore, as a byproduct of the ring architecture approach, we have derived several new systolic algorithms which require only one-third to one-half of the cells used in previous designs while achieving the same throughput We have shown that the two-level pipelining problem in systolic arrays can be solved by the same techniques used to solve the fault-tolerance problem. An important task left for the future is the development of software to solve both problems automatically. 24