A parallel algorithm for constructing binary decision diagrams

A parallel algorithm for constructing binary decision diagrams is described. The algorithms treats binary decision graphs as minimal finite automata. The automation for a Boolean function with AND as its main operation (OR operation) is obtained by forming the intersection (union) of the regular sets associated with its operands. The union and intersection operations are implemented by a product construction on the minimal automata for the regular sets. After each product construction step the automaton must be reminimized. The parallel algorithm is designed so that it is possible to find the minimal representations for several Boolean operations in parallel. The level of each operation is determined. Operations at the same level can be performed in parallel without any communication between processors. If there are relatively few operations in one level, then the product generation step is divided into several suboperations and the results are merged.<<ETX>>


Introduction
The ordered binary decision diagram [2] is an acyclic graph representation for Boolean functions.Because this representation provides a canonical form (i.e. two functions are logically equivalent if and only if they have the same form) and is quite succinct in most cases, it has become widely used in CAD applications.However, the construction of binary decision diagrams for certain large or particularly complex Boolean functions can be very time consuming.Consequently, it is important to find ways of speeding up the construction process.This paper describes a parallel algorithm for this task.The algorithm has been implemented on a 16 processor Encore Multimax and tested on several standard examples.
Our approach to binary decision diagrams uses some simple ideas from finite automata theory.An ra-argument Boolean function can be identified with the set of Boolean vectors that make it true.For example, the function denoted by the Boolean expression x\ • x 2 + -*x 2 -x 3 is uniquely determined by the set of vectors {(1,1,0), (1,1,1), (0,0,1), (1,0,1)}.The corresponding set of strings {110, 111, 001,101} is a finite language.Since all finite languages are regular, there is a minimal finite automaton that accepts this set.This automaton provides a canonical representation for the original Boolean function.Logical operations on Boolean functions can be implemented by set operations on the languages accepted by the finite automata: AND corresponds to the set intersection, OR corresponds to the set union, and NOT corresponds to the set difference ((the universal set) -(the specified set)).Standard constructions from elementary automata theory can be used to build the binary decision diagram for a composite Boolean function from the decision diagrams for the atomic proposition symbols in the formula.
There are several (relatively minor) differences between our notion of a binary decision diagram and the one given in [2].In the sequential case these differences should have little effect on the complexity of either algorithm.For example, in our scheme, it is unnecessary to label the nodes of the graph with information about the corresponding Boolean variable.The depth of the node in the graph uniquely determines its label.
We believe that there are several important reasons for viewing binary decision diagrams as automata.Minimization of finite automata is a well-understood task for which good algorithms are available.In fact, many of the important properties of binary decision diagrams follow directly from properties of the minimization procedure.A typical example is the normal form property (the proof of this property in Bryant's paper is not so straightforward).Moreover, powerful techniques from Automata and Formal Language Theory can be used to investigate questions like what properties of a Boolean function determine the size of its binary decision diagram.We have obtained some results of this type that we hope to present in a future paper.
In the construction of a binary decision diagram corresponding to a Boolean function, a parse tree of the function is used, where leaf nodes correspond to input variables, and non-leaf nodes correspond to Boolean operations.The level of each node is defined from leaf nodes to the top of the tree, and operations at the same level are performed in parallel.If there are only a few operations in some level, these operations are divided into several sub-operations to extract additional parallelism.
Our paper is organized as follows: In Section 2, we review some of basic terminology on finite automata and binary decision diagrams.Section 3 describes the implementation of Boolean operations as operations on finite automata.Section 4 describes the algorithm for building the product automaton and minimizing it.Section 5 describes the parallel algorithm and gives performance statistics for a number of examples.Section 6 shows a method to manipulate the construction of BDD's with large number of nodes.The paper concludes in Section 7 with a summary and discussion of some directions for future research.

Finite Automata and Binary Decision Diagrams
We start with some simple definitions dealing with finite automata and binary decision diagrams.A string is a sequence of symbols over some alphabet E. In this paper, the alphabet will always be E = {0,1}, where 0 represents False and 1 represents True.For example, 110 and 111 are strings.The length of a string is the number of symbols in the string.Thus the length of 110 is 3.
A finite automaton M is a 5-tuple (Q, E, 6, q 0 , F), where Q is a finite set of states, E is the alphabet for strings, 6 is the state transition function from Q x E to <2, qo is the initial state in Q, and F is a set of final states in Q. M accepts a string aia 2 ...a n where each a t 6 E if and only if there exists a sequence of states go, <Zi,---^n such that = £(# t _i,a t ) and q n € F. The set of strings accepted by M is called the language of M and will be denoted For example, M = ({?o,9i,?2,93,94,9s, -L}, {0, 1}, <5, 90, {#>}) accepts {010, 110, 111}, where 6 is defined as 6(q o ,0) = q u 8(q 0 ,l) = q 2 , £(91,0) =JL, S(q u l) = 93, 6(q 2 ,0) =-L, %2,1) = 94, %3,0) = 95, %3,1) =-L, %4,0) = % 4 ,1) = ?5,%5,0) = % 5 ,1) =-L, and <$(J-,0) = £(JL,1)=J_.-Lis called a sink state.The representation of 6 as a directed graph is shown in Figure 1.Note that the graph is acyclic; this will be true for all of the automata that we consider in this paper.The sink state is not shown in the figure for simplicity.In the following, the sink state may not be mentioned explicitly, but its existence is always A Boolean function / with n-variables is a function from {0, l} n to {0,1}.For example, is a Boolean function with three Boolean variables.The value of the function could, of course, also be given by a Boolean expression f{x\,x 2 ,xz) = A x 2 A V (xi A x 2 ).Observe that the set of triples in the domain where / has value 1 (i.e./~1(1)) is the same as the language that is accepted by the finite automaton in the previous example.
In general, the set of elements in {0, l} n for which / is 1 can be used to represent /.If we associate the n-tuple (ai, a 2 , a n ) with the string aia 2 ...a n , then each set of by L(M). assumed.
1, if (x u x 2j x 3 ) is (0, 1, 0), (1, 1, 0) or (1, 1, 1); 0, otherwise;  n-tuples from {0, l} n will correspond to a set of strings over E = {0,1} with length n.This correspondence allows us to associate a finite language contained in E n = {0, l} n with each n variable Boolean function.Since all finite languages are regular, it follows from the correspondence between regular languages and finite automata, that each such language is accepted by some finite automaton.The minimal finite automaton corresponding to the Boolean function / provides a canonical form for /: two n-variable Boolean functions will have the same minimal automaton if and only if they are logically equivalent.Since each node in the state-transition graph for a Boolean function will have at most two successors (one for each value of E), we can view this graph as a binary decision diagram for the function.
We illustrate these ideas by giving the finite automata and binary decision diagrams for some simple n-variable Boolean functions.First, we consider the function fu which is identically 1 for all possible values of its arguments, i.e. fu(xi,..., x n ) = 1 for all values of ... ,x n .The language corresponding to fu consists of all strings of length n over the alphabet E = {0,1}, and accepted by a finite automaton M\j = ({qouj Qnu}, {0, 1}, you, {<lnu})j where Su is defined as $u{<li,0) = $u(<IiA) = The binary decision diagram is shown in Figure 2.
Any Boolean function can be described using the above functions and Boolean operations, and a BDD corresponding to any Boolean function can be constructed from the above BDD's and operations on BDD's corresponding to Boolean operations.

Implementing Boolean Operations on Binary Decision Diagrams
Let Mi = (Q u {0, 1}, 6i, ql, Fi) and M 2 = (Q 2 , {0, 1}, S 2j ql, F 2 ) be the binary decision diagrams for two n-variable Boolean functions / x and f 2y J-i be the sink state in Qi, and ±2 be the sink state in Q 2 .We will show how simple automata theoretic constructions can be used to find the binary decision diagrams for various combinations of f\ and f 2 involving the Boolean operations AND(A), OR(V), NOT(-), and EXOR(©).
We consider the AND operation first.The set of strings over {0,1} that satisfy /1 A f 2 corresponds to the intersection of sets accepted by M\ and M 2 .The standard construction of a finite automaton M that accepts the intersection of L(M\) and L(M 2 ) may be used in this case.M = (Qi x Q 2 U {_L}, {0, 1}, £ A , (<7o>9o)> ^1 x ^2), where _L denotes the sink state for the product automaton.S A is defined as: The OR operation is similar.The OR of two Boolean functions represented by Mi and M 2 corresponds to the union of sets accepted by Mi and M 2 .The standard construction for such an M can also be used in this case.M = (Qi x Q 2 U {±}, {0, 1}, £ v , (?o?9o)?(Fi x Q2) U (Qi x F 2 )), where 6 V is defined as: The NOT operator corresponds to the set difference.Let U be the set of all strings with length n, then U -L(Mi) corresponds to the negation of the Boolean function represented by Mi.A finite automaton accepting U -L(M\) can be constructed from My = (Qu, {0, 1}, 6u, Fu) and Mi as , where is defined in the same manner as for the OR operation.The EXOR operator © is also similar to the OR operator.The finite automaton for this operator is given by M , where S e is defined in the same manner as for the OR operation.
Note that determining the state set of the finite automaton for each of these four operators involves a product construction Mi x M 2 .We exploit this observation by giving a single procedure for the product construction that is parameterized by the type of Boolean operator involved.Also note that in each case the resulting automaton M may not be minimal, even if both Mi and M2 are minimal.Consequently, a final minimization stage is needed after the product construction in order to obtain a canonical binary decision diagram.Because of our convention regarding final states, a binary decision diagram M = (<2, {0,1}, qo,F) may be represented by its state-transition graph alone.In particular, two edges emanate from each state q: a 0-edge pointing to £(g,0) and a 1-edge pointing to 8(q, 1).In generating the product automaton for the result of some two-argument Boolean operation applied to Mi and M 2 , the initial product state is given by (go,gjj) where ql is the initial state of M\ and ql is the initial state of M 2 .The successors of this state are determined for the inputs 0 and 1, and this process is repeated until no new state pairs are generated.The process is shown in Figure 4.
Note that there are only two places where we need to take into account the types of the Boolean operator: the computation of (<$i(gi, 0), <$2(#2> 0)) and (<$i(gi, 1), £2(92,1))-The most time-consuming part of this procedure is deciding whether a pair is new or not.By using a hash method with chaining we can make this test take essentially constant time.The hash function that we use is given by hash prod (q u q 2 ) = mod(qi * (HASH, SIZE/2) + q 2 ,HASH, SIZE), where qi and q 2 are integer values for the state pointers.Since we use a hash method with chaining, each state is recorded in a linked data structure with 3 fields: the first field (edgeO) holds a pointer to the 0-successor of the state, the second field (edgel) holds a pointer to the 1-successor of the state and the third field holds a pointer to a state with the same hash key.It should be mentioned that we need no special memory for a state pair.
Let state q correspond to a state pair ((ft, 92), state 9' correspond to a state pair (61(91, 0), ^2(92,0)), and state 9" correspond to a state pair (61(91,1), 62(92,1))-First, a data cell corresponding to a state 9 is allocated so that edgeO of 9 holds a pointer to 91 and edgel holds a pointer to 92.Then 9 is registered in a hash table and placed in a queue.If the state pair (91, q 2 ) is generated as a next state of some state, then the hash key is computed and the hash entry is checked.Since the state pair is already registered as 9 in the hash table, there is no need to register the pair.In this manner, the data cell of 9 is used for the occurency check of the same state pair.
After the state 9 is dequeued, and state pairs corresponding to 9' and 9" are calculated, edgeO (edgel) of 9 is overwritten to a pointer to 9' (9").Since the state transition graph is acyclic, if we generate state pairs in a breadth first manner using a queue as shown in Figure 4, the same state pair as (91,92) will not be generated after the pair is dequeued, and we need not keep the data of the pair for the occurency check after the dequeue operation.Thus, our overwrite method for reducing the memory usage works quite well.
An example illustrating this phase is shown in Figure 5, where the intersection of Mi and M 2 is computed (corresponding to an AND operation in the original formula).Mi corresponds to (-^Xi A -1X3) V #i, and M2 corresponds to The result of the AND operation is (-»a?iA ^x2 A -"£3) V (xi A -^x 2 A -1x3) V (#i A x 2 ).In the product construction, the initial pair (91,97) is entered in the queue S. Then its next state is computed.At this point the queue S contains (92,9s), (93,99).Next, the successor of (92,9s) is computed, and the queue becomes (93,99), (94,910), (94,911).Hence, the states are generated in the order 914, 915, 921.

Minimization
After the product generation phase, we must minimize the resulting automaton.Since the graphs involved are directed acyclic graphs, we do not need to use the completely general n • log(n) minimization algorithm described in [1].Instead, we can use a variant of the linear algorithm for tree isomorphism [1].
In the minimization phase, states are processed starting at bottom level working upward, since the determination of whether two states should be merged into an equivalence class is based on the equivalence of their successor states.First, the final states (the bottom level nodes) are processed.Next, the states which have an edge to the final state are processed, and so on.Thus the order in which the states are processed in this phase is the reverse of the order in which they were generated during the product phase.
The minimization algorithm is summarized in Figure 6.To reduce the memory consumption, we keep a global binary decision diagram whose states represent equivalence classes of states of the reduced automaton.The same hash mechanism is used for the occurency check of the new global state as in the product generation phase.The hash key for a state q is defined as hash min (q) = mod(6(q, 0) * HASH.SIZE + %, 1), HASH.SIZE) using the edge-pair (S(q, 0), 6(q, 1)) of q.
For each state q of the product automaton in the reverse order of the generation do Begin Reset_Flag; For each global state with the same hash key as q do Begin If the edge-pair of the global state is the same as that of q then Begin Set_Flag; Break; End; End; If Flag is not set then Begin Allocate a global state cell; Copy the edge information from q to the global state; End; Mark q as registered, and store a pointer to the global state; End; For the product automaton in Figure 5, states are processed in the order of q 2 i, q 2 o, • qi4 as shown below.The minimal automaton is also given in Figure 5.
1. #2i is processed and is registered as the unique final state.2. The edge-pair (921,921) of q 20 is new, and q 20 is registered as unique.
3. The edge-pair (g 2 i, -L) of qi$ is new, and qi 9 is registered as unique.4. The edge-pair of gi 8 is (JL, _L), and it is impossible to reach the final state from <ji 8 .
Thus qis is removed.5.The edge-pair (q 2 i, -L) of qi? is the same as that of qi$.Thus we set qn = 919.6.The edge-pair (919, q 20 ) of <7i 6 is new, and q\ 6 is registered as unique.7. The edge-pair (919,1_) of 915 is new, and qi$ is registered as unique.8.The edge-pair (qis^qie) of 9i4 is new, and #14 is registered as unique.

Implementation
We now describe how the basic algorithm outlined in the previous section can be implemented on a shared memory multiprocessor.To illustrate the procedure we consider the following example.
The first step is to determine the level of each node in the parse tree for the formula (see Figure 7).The leaf nodes of the tree are input variables; the non-leaf nodes correspond to the Boolean operators that occur in the formula.The level of each node is determined by the rule: l.The level of an input variable is 0. 2.The level of a non-leaf node is max(l\, l 2 ) + 1, where l\ and l 2 are levels of its operands.
Since we initially generate binary decision diagrams for input variables, we can process operations at level 1 immediately.After the level 1 operations have been completed, we can process Boolean operations at level 2, and so on.In general, we can process level i nodes as soon as the level i -1 nodes have been completed.
Operations at the same level in the tree can be performed in parallel, since they do not conflict.Each such operation is performed on a separate processor since synchronization between processors is very time consuming.Some levels have only a few operations that can be performed in parallel.We divide operations on such levels into several sub-operations so that there will not be as many idle processors.The method is as follows.
In the product generation and minimization phase, the 0-and 1-successors of the initial pair (gj, q%) are generated.Then the product generation and minimization are performed for these two successors.After the minimization for these two successors is completed, the minimization of the root state corresponding to the initial pair is begun.Thus the product and minimization phase for each of these two successors (the 0-and 1-successors of (#Q, <7Q)) can be performed in parallel.Note that the minimization phase guarantees the uniqueness of global states.
An example of this procedure is shown in Figure 8. First, processor Pi expands the 0and 1-successors of the initial pair.Processor P 2 takes the 0-successor (92598)?generates the product automaton and minimizes this automaton.Processor P3 takes the 1-successor and does the same thing.After P 2 and P3 have completed the minimization phase for their product automata, processor Pi minimizes qi 4 .If, in the example, we compute the 00-, 01-, 10-and 11-successors of the initial pair, then the original operation can be divided to four parts whose product and minimization phases can be performed in parallel.Three merges are needed to reassemble these parts.In a similar manner we can divide a single operation into 8 parts and 5 merges to obtain an even higher degree of parallelism.
Figure 9 shows a program executed in parallel with several processors.The algorithm is intended for a shared memory multiprocessor and would require significant modification for other types of parallel architecture.Data for operations and for global states can be accessed by all processors, lock and unlock are used to implement mutual exclusion.
While (operation exists) do Begin lock; take one operation if exists; unlock; If an operation can be taken then Begin wait until the operands of the operation have been computed; do the operation, i.e. construct product automaton; minimize the product automaton; End; End; Fig. 9 Process structure.
In the parallel minimization algorithm, the following method is used to maintain consistency of global states.

Performance Evaluation
Our program for building binary decision diagrams is implemented in C and uses the Cthreads package [5] for parallel programming under the Mach operating system.Interlocks are used for process synchronization instead of general semaphores in order to avoid the expense associated with system calls.The program is organized so that locks are only needed for the hash table for global states and for taking a Boolean operation to be executed.Consequently contention for shared memory is light.The performance statistics that we describe below were obtained for an Encore Multimax with 16 processors and 96 megabytes of shared memory.Each processor is a National Semiconductor 32332 and is rated at roughly 2 MIPS.
Multipliers were used to evaluate the program since the binary decision diagrams for these circuits are known to grow quite rapidly (exponentially in the size of the operands, in fact).Table 1 shows the execution time to construct binary decision diagrams for multipliers with 7 to 10 bits (14 to 20 Boolean variables).In the evaluation, a hash table with 1023 entries is used for the product generation, and a hash table with 32767 entries for the minimization.
Table 1 shows that the minimum execution time on the Multimax with several processors is about 10-times smaller than the execution time with a single processor.The time for

Execution time for 9-bit multiplier example.
Execution time for 10-bit multiplier example.
: i i i i i i i ;   a single processor is roughly the same as the (sequential) program for constructing binary decision diagrams described in [6], The graphs in Figure 10 show how the execution time varies with the number of processors.The execution time is in reverse ratio with the number of processors.The graphs in Figure 11 show the rate of speed-up for these multipliers.The rate of speed-up is defined as (the execution time using 1 processor) / (the execution time using n processors).The rate is almost linear with the number of processors.

Uniform Splitting Method
We have shown a parallel algorithm to construct BDD's.In several cases, the number of nodes exceeds the memory limitation of a computer and the construction of BDD's fails.To overcome the problem, we have devised a divide-and-conquer method.Since the parallel algorithm guarantees the high-speed execution, each part can be processed in reasonable time and the total execution time is also reasonable.
The basic idea of the divide-and-conquer method is that a Boolean function / can be represented as a pair (/o, ft), and that Boolean operations can be done independently for each part of the pair.

It is easy to show that
We can also show that  The algorithm is summarized as follows, where d is the logarithm of the size of the division, i.e. the number of parts is 2 d .Since the construction of BDD's is based on BDD's for input variables and on Boolean operations, we only create BDD's for input variables correctly in each repetition.Let d = 3, and /(xi, ...,xn) be the original Boolean formula.In the following loop, BDD's for /(0,0,0,x4, ...,xn), /(0,0, l,.r 4 ,...,a; n ), /(0,1,0, x4, ...,xn), etc. are constructed with respect to % = 0,1,2,...For i = 0 to 2 d -1 do Begin Initialize data with respect to i: Set all operations to be undone; Create i-th part of BDD's for input variables: The depth d successor of the initial state of the original BDD with respect to the binary representation of i; While (operation exists) do Begin do the operation; End; remove all data generated in the construction; End; Note that if we need a BDD corresponding to the global outputs, we should keep the data corresponding to these outputs in each repetitions.Also note that if we only want to know whether two functions are equivalent or not, we need not to keep any data, since two Boolean functions are equivalent if and only if each part is equivalent.The check can be done in each repetitions.The parallel execution algorithm can be used in doing the sequence of Boolean operations in the above algorithm.
Applying the method to the multiplier examples, the result is as follows.In these examples, 10 processors are used.At first, an 8-bit multiplier example is shown to compare the data without splitting.The construction of BDD's for 13 to 16-bit multipliers cannot be done without splitting.

8-bit multiplier
The problem is divided into 4 parts.The number of nodes of BDD's for each part is about 34,000 (0.4 MB), and the execution time for the construction of each part is about 4.0 seconds.Total execution time is about 16.27 seconds.

13-bit multiplier
The problem is divided into 8 parts.The number of nodes of BDD's for each part is Symbolic Model Checking [3,4].When synchronization or communication is possible among several finite state processes, the number of system states can be quite large.By using a sequential implementation of binary decision graphs to provide a concise representation for large global state-transition graphs, we have already been able to verify a pipelined circuit with as many as 10 20 states.Since constructing binary decision diagrams is the most time consuming part of the verification procedure, we should be able to handle even larger finite state systems in the future.

Fig. 2 A
Fig.2 A binary decision diagrams accepting all strings.

Fig. 4
Fig.4 Construction of the product automaton.

Fig. 7
Fig.7 Levels of Boolean operations for

Fig. 10
Fig. 10 Execution time for multiplier examples on Multimax.
Speed-up rate for 9-bit muitipiier example.
Fig.11 Speed-up rate for multiplier examples on Multimax.

(91,92) from
The size of the hash table (the parameter HASH, SIZE) and the hash function are critical factors in determining the execution time of this phase of the algorithm.Let the initial pair be (9o,9o); Put the pair in the queue 5, and allocate a new state for it; While (S is not empty) do Begin Dequeue a pair

Table 1
Evaluation of multiplier examples on Multimax.