Reliability Modeling of Compensating Module Failures in Majority Voted Redundancy

The classical reliability model for N-modular redundancy (NMR) assumes the network to be failed when a majority of modules which drive the same voter fail. It has long been known that this model is pessimistic since there are instances, termed compensating module failures, where a majority of the modules fail but the network is nonfailed. A different module reliability model based on lead reliability is proposed which has the classical NMR reliability model as a special case. Recent results from the area of test generation are employed to simplify the module reliability calculation under the lead reliability model. First a fault equivalent technique, based on functional equivalence of faults, is developed to determine the effect of compensating module failures on system reliability. This technique can increase the predicted mission time (the time the system is to operate at or above a given reliability) by at least 40 percent over the classical model prediction for simple networks. Since the fault equivalent technique is too complex for modeling of large circuits a second, computational simpler technique, based on fault dominance, is derived. It is then shown to yield results comparable to the fault equivalent technique. A more complex example circuit analyzed by the fault dominance model shows at least a 75 percent improvement in mission time due to modeling compensating module failures. A commercialy available 31 gate integrated circuit chip is also modeled to demonstrate the applicability of the technique to large circuits.


INTRODUCTION
New system designs for reliable computers must be explored to meet the increasing demand for reliable computing systems. One important method of predicting the performance of a system is the modeling of the system reliability.
Modeling requires a mathematical or physical representation which incorporates the salient parameters of the modeled system [1]. A model is an incomplete representation of the subject under study.
To be of value, the modeling technique must be convenient to apply and must successfully predict the behavior of the subject under various parameter changes. If a reliability model is accurate, then insights can be gained as to how the system reliability changes as a function of the design parameters.
An exact method to model the effect of a majority of failed modules in the N-modular redundancy (IWR) scheme is presented and shown to increase the predicted mission time (the time for which the system is to operate at or above a given reliability) over that of the classical reliability model by at least 40$. This exact method is too complex to apply to a large circuit. Thus a second and -2computationally simpler method is developed and shown to yield a predicted mission time within 10$ of the exact model for example systems.
CLASSICAL NMR RELIABILITY MODEL NMR [2] is implemented by dividing the nonredundant network into modules, replicating the modules N times (where N = 2t + 1 and t is an integer), and inserting a majority gate between each set of replicated modules. Figure 1 depicts the implementation of a triple modular redundancy (TMR) version of a portion of a nonredundant network consisting of a two input, single output module. TMR will be the major topic of discussion, although the procedures presented have straightforward applications to the general case of NMR.
Classically the reliability of the network in Figure 1 is modeled by assigning the modules a reliability function, call it R^(t), or R^ with time as an understood variable. The probability of module failure is thus 1 -R m « It is then assumed that the system fails when two or more modules driving the same voter, say voter A in Figure 1, fail. The classical reliability model is: The effect of nonperfect voters can readily be incorporated into (1) if voters are assigned to module inputs [3,4,5]. Since each voter drives exactly one module input, a voter failure has the same effect as a module failure. If R v is the voter reliability, then the effective module reliability in (1) 2 becomes R y^m » Networks that do not have a voter for every module input can be modeled by more complex techniques [6], However, for the present discussion we will assume every module input has an associated voter. Further, the discussion will center on applying the modeling techniques to modules only. Nonperfect voters can easily be modeled by applying the techniques to modules and their associated input voters.
Equation (1) is pessimistic since there are many cases that a majority of the modules are failed yet the network of Figure 1 would not be failed. For example, consider two failed modules for the network of Figure 1. Assume module one has a permanent logical one on its output while module three has a permanent logical zero output. The network will still realize its designed function. Such multiple module failures which do not lead to network failures will be termed compensating module failures.
Adding these double, and even triple, module failure cases can often lead to a substantially higher predicted reliability than the classical reliability model. With a better reliability model some systems previously designed may be found to be overdesigned for their specific mission because an inadequate reliability model was used.

MODELING COMPENSATING FAILURES
In the literature, equation (1) is sometimes rewritten to take into account the cases where two modules can fail so as to have compensating effects at the voter: The K in (2) is a probability formed by the ratio of the number of ways in which compensating failures can occur divided by the number of ways any failure can occur. K has often been taken as l/2 [7].
An alternative model for compensating failures that has appeared in the literature [7] is: TMR mm m m -. n mnr mnr m.n.r; m=l n=I r=0 where the module failures follow the Poisson assumption; there are K mnr ways of designating which of the three modules have m, n, and r failures respectively; and P mnr I s the probability that the system operates correctly with m failures in one module, n failures in another, and r failures in the other.
Equation (3) can be rewritten as R_ -R 3 + 3R 2 (1 -R ) + R™. + R_, (4) TMR • m n m 7 Two Three where R TwQ > R xhree is t * ie contribution to tne system reliability from compensating failures in two and three modules respectively. heavily on the logical stuck-at-fault mode [8]. This model assumes that most or all failures of interest in a logic circuit manifest themselves as some line in the circuit taking on a constant logical value, either one or zero. Now that algebraic structure which applies to the behavior of networks in the presence of stuck-at faults has been developed [8,11,12], the tools are available to formulate and analyze a new module reliability model.
The new model will assign a reliability function to each lead in the network rather than each module as in the classical model. Lead reliability will be represented by R and the probability of lead failure by 1 -R.
Much has been written in defense of the stuck-at failure model [8] but a few words will now be devoted to justification of the lead reliability model. In one study of IC failure mechanisms [9] it was found that about 84$ of the IC failures were directly related to lead failures, either of input leads or of metalization on the chip itself.
Similar to the classical model assumption that module failures are statistically independent events, it will also be assumed that lead failures are statistically independent. A further advantage of the lead reliability model is that it takes into account the increased number of interconnections required for the massive redundancy version of a nonredundant system. Wiring errors and off-chip interconnections than may be the major source of failures.
It will be assumed that the stuck-at-one (s-a-1) faults are as likely to occur as the stuck-at-zero (s-a-0) faults. Hence, the probability that a lead is s-a-1 is: P (s-a-1) « P (lead failure) P (s-a-1 | lead failure) P (s-a-1) -(1-R) • l/2 FAULT EQUIVALENT RELIABILITY MODEL Using the above model for module failure it is now possible to calculate R^Q* First, a few assumptions are necessary. The modules will be assumed to consist of irredundant combinational logic [8] so that any single internal module fault will cause an improper output for at least one set of inputs. It will also be assumed that the system has failed as soon as it is possible for the voter to give a wrong response to any possible input combination. This excludes the situations where a module trio fails but subsequent faults within the module trio restores proper behavior.
To model the faulty modules we will adopt the notation developed in [8]. We will now illustrate the evaluation of the ^T wQ , i.e., the case of two faulty modules for a simple module.
1) Transform the logical circuit into the corresponding logical model [8].
Consider Figure 2(a) where the module under study is a single two input NAND gate. The logical model is a directed graph shown in Figure 2  The development of step 4) is best given by an example. The equivalence class matrix for the NAND gate of Figure 2 is derived from the entries in Table 1(a) and is shown in Table 2.
For our example of two failed NAND modules (6) becomes:

3) Enumerate the supplementary classes.
For the case of two module failures one of the modules will be the fault free function.
The majority gate can be considered to be a threshold gate with input weights 1 and threshold of 2 [10].
In Table 1 Table 1(c) for our example.
In the last step a matrix E is used to actually evaluate R TW£> -Element E i of equivalence class matrix E is the number of faults in equivalence class j (the equivalence classes were assigned numbers under step 2) which are a result of i leads in a module failing, where i is termed the fault multiplicity.
A) Form the term for two faulty modules by use of the equivalence class matrix E and the equation:  (1) will now be undertaken.

COMPARISON OF FAULT EQUIVALENT AND CLASSICAL RELIABILITY MODELS
If there are p leads in a module, then the module reliability, R ffl , according to the fault equivalent model just presented is R P . For the case of fewer than half the modules failing in an NMR network, the classical reliability model gives a reliability of: It will now be shown that the first LN/2J + 1 terms of the fault equivalent reliability model are identical to (8), the classical NMR reliability model. The fault equivalent failure probability for a single module is: -10- The ± term, considering the cases for all i, is 2^^) since there are ways to select k failed leads from p. Each failed lead may be in one of two failure modes, s-a-1 or s-a-0, which accounts for the 2 • Hence (9) becomes: sCJ)0-wVk --RP + l"  Table 3.
It is important to note that the apparent improvement in mission time is not due to any change in the modeled hardware system but rather to a more accurate reliability model.  (with one lead failure in each of two modules) assuming tree structured mod-1 i u TWO ules will now be derived. Approximations for modules with reconvergent fanout will also be given.
The modules adhere to the same assumptions as the fault equivalent reliability model. The following definitions will be required:  The proof is by contradiction. Assume f 1 ~ f 2 but fl T 2 / <|>. Take the case T 1 fl T 2 = t i « Since t ± is a test for f 1 , Y(f^) ©F » 1 for the input t ± . Likewise t ± is a test for F 2 and F(f 2 > ©F=1. The majority (F(f.j), F(f 2 ), F) ^ F and f 1 f 2# The original assumption is contradicted and therefore if f 1 ~ f 2 then T 1 D T 2 = j). This completes the proof.
There are two corollaries to Theorem 2 which will prove helpful. The modules will also be assumed to be composed of elementary gates: Invert, AND, OR, NAND, and NOR are depicted in Figure 3. The equivalent and dominating fault class structure, as developed in [8,11,12], is also depicted. Consider the two input AND gate in Definition 4: A lead y is a successor of a lead x iff every path from x to the circuit output also passes through y.

Definition 5:
A lead x is a predecessor of a lead y iff y is a successor of x.
In Figure 4(a) lead y is a successor of lead x and x is a predecessor of y.  Within each tree of the fault class structure any fault on lead x will either dominate or be equivalent to any predecessor node because of the transitivity of the dominance and equivalence relationships [11,12]. Hence there can be^ no supplementary failures on the n predecessor leads since a test set pre for a fault on a predecessor lead will not be disjoint from the test set for a fault on lead x. Similarly, faults on successor leads to x cannot be supplementary to x since they will either dominate or be  Note that for the NAND gate of Figure 2 A • 1 and (13) yields S 2 -20. Thence (17) agrees with the first term of equation (7). Similar correspondences have been made between the fault equivalence and fault dominance models for tree structured modules.  The fault dominance model can also be used for circuits with reconvergent fan-out. The circuit is modeled by an identical circuit with fan-out points removed. Figure 5(a) depicts an exclusive-OR circuit and Figure 5(b) the tree circuit used to calculate the supplementary faults. This value for S 2 will be a lower bound since not all faults are considered. All supplementary faults in the reduced circuit will also be supplementary in the original circuit since fan-out points only restrict the values of test sets. Gate inputs are no longer independent. Table 5 depicts the mission time improvement for the exclusive-OR and a SN74147 10-line-to-4-line priority encoder chip. The latter consisted of 31 gates and 70 leads. The value of S 2 was calculated by inspection. The number of non-successor and non-predecessor leads for each lead, taken over all leads, took less than 10 minutes to determine by hand. This illustrates the applicability of the technique to large circuits. Equation (17) can now be amended to 2p Figure 5. An exclusive-OR circuit (a) and the modeled circuit with fan-out removed (b).