Communication complexity for parallel divide-and-conquer

The relationship between parallel computation cost and communication cost for performing divide-and-conquer (D&C) computations on a parallel system of p processors is studied. The parallel computation cost is the maximal number of the D&C nodes that any processor in the parallel system may expand, whereas the communication cost is the total number of cross nodes (nodes generated by one processor but expanded by another processor). A scheduling algorithm is proposed, and lower bounds on the communication cost are derived. The proposed scheduling algorithm is optimal with respect to the communication cost, since the parallel computation cost of the algorithm is near optimal.<<ETX>>


Introduction
Divide and conquer (D&C) is a common computation paradigm, in which the solution to a problem is obtained by solving subproblems recursively.Examples of D&C computations include various sorting methods such as quick sort [6], computational geometry procedures such as convex hull calculation [12], AI search heuristics such as constraint satisfaction techniques [5], adaptive data classification procedures such as generation and maintenance of quadtrees [13], and numerical methods such as multigrid algorithms [lo] for solving partial differential equations.
A D&C computation can be viewed as a process of expanding and shrinking a tree.Each node in the tree corresponds to a problem instance, and children of the node correspond to its subproblems.During the computation, each internal (non-leaf) node goes through two phases.The first phase is the divide phase during which the problem instance associated with the node is divided into subproblems.The second phase is the combine phase during which the solution of the problem instance associated with the node is derived by combining solutions of the subproblems associated with the node's children.After its creation each leaf will perform some computation and return the results to its parent.At a given time, nodes on a wavefront that cuts across all paths from the root to leaves can be active in performing divide, combine, or compute operations.Along each path the wavefront first moves down from the root to its leaf and then up from the leaf to the root.
At first glance, one might think that it should be straightforward to perform D&C in parallel, because nodes on the wavefront can all be processed independently.However, if one wants to achieve good load balancing between the processors, then parallelizing D&C becomes nontrivial.In fact, doing efficient D&C on any real parallel machine has been a major challenge to researchers [3, 4, 9, 141 for many years.
The difficulties are due to the fact that many D&C computations are highly dynamic in the sense that these computations are data-dependent.During computation, a problem instance can be expanded into any number of subproblems depending on the data that have been computed so far.In fact, the trees of CH3062-7/91/0000/0151$01.~ 0 1991 IEEE many D&C computations can be expected to be sparse and irregular, and as a result, load balancing must be adaptive to the tree structure and must be done dynamically at run time.This implies that coinputation loads need to be moved around between processors during computation.The challenge is then to devise efficient scheduling algorithms which can achieve good load balancing while minimizing the communication cost for moving computations around.
In general there is a tradeoff between balancing computation loads and minimizing communication costs.The results of this paper quantify this tradeoff.In particular, the paper establishes lower bounds on the communication cost for any scheduling algorithm based on how well it performs 1oa.d ba,la.ncing.
2 Summary of Results of This Paper

Definitions and Notation
The tree of a D&C computation is ca.lled a ( N , h,, d )tree, if 0 N is the number of nodes in the tree, h is the height of the tree, and 0 d is the maximal number of children of a node.
(We assume that d is at least 2, to allow pa.ra.lle1 processing of the tree.) A node is said to be at tree level i if it is the i-th node on the path from the root to the node.Therefore, the root is at level 1, and the height of the tree is the maximal level number.For the parallel system which will ca.rryout the D&C computation, we a.ssume that 0 p is the number of processors in the system, a,nd 0 it takes one time step for a processor to expand a node, i.e., to perform the divide operation for an internal node, or to perform the compute operation for a leaf node.For simplicity, we a.ssume that a processor takes no time to perform a. combine operation.
When a node is expanded, zero or more children may be generated.More precisely, if a. node does not generate any children, the node is a. leaf; if a node generates one or more (up to d ) children, the node is an internal node.Each newly genera.t,ednode will in turn be expa.ndedby some processor i n t,he fut.ure.A frontier node is a node which has been generated but has not been expanded.
A scheduling algorithm for a D&C computation schedules nodes (i.e., frontier nodes) on processors for expansion.We assume that scheduling algorithms cannot "lookahead" .This non-lookahead assumption is reasonable when dealing with irregular D&C trees.In this type of tree, the number of children a parent may have (if any) is typically data-dependent and is therefore not known a priori.
The parallel computation cost T A ( H ) of a scheduling algorithm A for a D&C computation tree H is the maximum number of the nodes that any processor may expand.Since there are N nodes and p processors, a lower bound on T A ( H ) is Tmin = [ N / p ] .The parallel computation cost TA of algorithm A is defined as the maximum T A ( H ) for all ( N , h , d)-trees H .
The communication cost C A ( H ) of a scheduling algorithm A for a D&C computation tree H is the total number of cross nodes.A cross node is a node which is generated by one processor but expanded by another processor.Note that the processor expanding a cross node needs to receive information from the processor generating the node.Therefore, C A ( H ) is a reasonable measure for capturing the interprocessor communication cost in performing the divide phase of all the internal nodes.(A similar definition of communication cost is used by Papadimitriou and Ullman in [ll].)The communication cost CA of algorithm A is defined as the maximum C A ( H ) for all ( N , h, d)-trees H .

Main Results
Theorem 1 For each scheduling algorithm A f o r a parallel system of p processors, for each integerp', 0 < p' 5 p , and for each N I h , and d with the following two restrictions, S1.N > 3pd2h, and there exists some ( N , h , d)-tree H for which at least one of the following two properties is true:

P I . the parallel computation cost of the algorithm is
Many DScC computations are expected to satisfy restrictions Sl and S2.Since N is usually an exponential function of h , restriction S1 is easily satisfied in these cases.Restriction S2 roughly requires that N < dh-2/ph.If a tree is perfectly balanced and each node has exactly d children, then N would be C3(dh-') instead.A perfectly balanced tree is easy for load balancing because the subtrees of each node have the same computation load.Restrictions S1 and S2 basically capture those interesting D&C computations with irregular trees.This class of D&C computations are exactly those for which one finds it difficult to achieve good load balancing without paying much in communication overheads.The lower bound on C A ( H ) , stated in P2 of the theorem, provides an explanation of why this must be the case.
The two properties P1 and P2 in Theorem 1 can be expressed in terms of the quantities N , h, d (associated with the DScC tree) and p (associated with the parallel system) as follows.One can check that N' > ( 1 -t ~) N and h' 2 ( l -€ h ) h for each positive E N 5 1 and Eh 5 1 , provided that h 2 log, Ph+logd 3+6-10gd E N a.nd From this and the fact that N' < N and h' < h, we note that N' and h' approach N and h respectively, when both E N and Eh approa.ch0. Therefore, P1 and P2 in Theorem 1 become T .( H ) = Q ( N / p ) and C A ( H ) = R(pdh) for large h,, when p' is close to p .Furthermore, we ca.n slightly change the theorem as Corollary 1. Theorem 1 also implies an important tradeoff result: if a scheduling algorithm wants to achieve a good load balancing by parallel processing, then it must pay a high price in communication cost.We can express the tradeoff between TA and CA explicitly by showing a lower bound on their product: TA . (CA + .).
If (p* -1). 5 CA < p*., where 0 < p* 5 p , then by Theorem 1, TA must be at least NI/,*.Therethat because of TA 2 N l p > N ' / p this tradeoff is also satisfied when C A 2 p ~.This tradeoff result is summarized in the following corollary.The algorithm satisfying Theorem 2 has the minimum parallel computation cost.By Corollary 1, the algorithm is optimal with respect to the communiw tion cost, since the parallel computation cost of the algorithm is near optimal.These results also imply that the lower bound on TA . (CA + K ) in Corollary 2 is tight when both E N and Eh are arbitrarily close to 0.
Note that Theorems 1 and 2 are so formulated that their results are system-independent.That is, the results are independent from the interconnection topology of the processors and various control overheads such as data structure maintenance and reading/writing messages.Therefore, our upper and lower bounds on CA are intrinsic to any parallel system.These bounds give insights into actual communication cost in a real implementation, but exactly how they are related to the actual cost is a separate matter depending on the implementation (see [15]).
Section 3 describes the algorithm of Theorem 2. Section 4 presents a simplified version of Theorem 1 and its proof to help the reading of this paper.A complete proof of Theorem 1 is given in Section 5.

Relation to Past Work
There have been several approaches in performing parallel D&C.A simple approach (e.g., in [2]) is to expand all the nodes above a fixed level on one processor and then distribute nodes at this level to other processors.Load balancing would be done poorly in this approach when the tree is irregular.Another approach [14] is to distribute generated nodes, and to have each processor perform load balancing based on load status information from its neighbor processors.For this scheme, the communication cost ca.nbe very high in tlie worst case.
Recently, some researchers have ma.deefforts to reduce communication overhead.A popular approach [4, 9, 161 is based on the "donate-highest-subtree" strategy, in which an idle processor will be given frontier nodes as near to the root as possible.Since a subtree rooted near the top usually has many nodes and these nodes can all be expanded locally, this strategy tends to reduce the amount of interprocessor communication.Ferguson and Korf [3] presented a D&C scheme with several processors scheduled first to a node and then to their children.The idea behind their scheme is also that of distributing front,ier nodes near the root to idle processors.
Although the methods described in the previous pa.ragraphall attempt to reduce communica.tionoverhead, they do not use global informa.tion to ba.lance the 1oa.d.It turns out that the communication cost for these methods can still be high in the worst case.For example, we estimate that the communication cost is O(dh'"gdP) for Ferguson and Kerf's scheme, and is O(min(p2h, pdh')) for the scheme in [4] with roundrobin scheduling.
In contrast, the communica,tioii cost for the scheduling algorithm of this paper (Section 3) is as low as O(pdh) (Theorem 2).This is partly due to tlie fact that our algorithm is able to ma.ke effective use of global information (i.e., "global pool" in Section 3).
Most importantly, we note that none of the previous work has any lower bound results on the communication cost for parallel D&C computations.It appears that our lower bounds in Theorem 1 and Corollaries 1 and 2 are the first lower bound results for those DStC computations whose tree structures are dynamic in the sense that the tree structure is determined only at run time.Previous results on computation and coinmunication cost tradeoffs such as those in [7,8,111 deal with only static computation graphs, whose topologies are known before the computation st.arts.

A Scheduling Algorithm and Upper Bounds
This section describes a new scheduling algorithm which can achieve the upper bounds in Theorem 2 for both parallel computation cost and communication cost.The bounds hold for any D&C computation, i.e., for any ( N , h, d)-tree no matter how irregular it is.

Proposed Scheduling Algorithm
The scheduling algorithm uses a data structure, called a Global Pool (abbr.GP), to keep track of frontier nodes at a particular tree level which have not been taken by any processor for expansion.This level, identified by a variable gl, has the property that nodes at higher levels have all been taken by processors.Every processor will try to take a node from the G P to work on whenever it becomes idle.For the proof of Theorem 2, it suffices to assume that the G P is maintained by some single processor.(See [15] for a distributed scheme where the G P is maintained by multiple processors.) Initially, the G P contains only the root and the value of gl is one.The G P becomes empty when all of its nodes at level gl have been taken by the processors.At this moment, all the processors are requested to send in their frontier nodes at level gl + 1 in the next time step when all the nodes a t level g1+ 1 have been generated.Then the G P is filled with this set of new nodes, and gl is increased by one.This process is repeated until all the nodes have been expanded.
The key idea of this algorithm is what each processor will do after it has taken a node from the GP.
The processor will do a depth-first traversal.Consequently] the processor can exhaust all possible work locally before asking for a new node from the GP.As a result, we can prove (below) that the communication cost can be as low as C,.While not related to parallel computation cost and communication cost, an important advantage of this local depth-first strategy is that it uses the minimum amount of memory.
In essence the scheduling algorithm described here uses a breadth-first scheme to distribute big chunks of computations to processors, and has each processor after receiving a computation follow the depth-first strategy locally.Therefore, the algorithm is a hybrid method, which interestingly will do a purely depthfirst traversal of the tree in the case that only one processor is used.
Suppose that we define the parallel computaiion time to be the time (in terms of number of time steps) when the last node is expanded by a processor.Then the parallel computation time of the algorithm described here is at most [ N / p + h l .To see this, we note that some processors may become idle only when the number of nodes in the GP is smaller than the number of idle processors.In the worst case all the p processors may become idle at the end of some time step, but at this time there is only one node in the GP.Thus, in the next time step, as many as p -1 processors may be idle.This situation can happen a t most h times.Therefore, in the entire D&C computation, additional h(p -1 ) nodes could have been expanded if there were no idle processors a t any time step.This implies that the parallel computation time is at most Note that parallel computation time defined in the previous paragraph is different from parallel computation cost defined in Section 2.1.Being able to take into account processor waiting time induced by internode dependency, parallel computation time may be of more practical interest than parallel computation cost.
However, to prove Theorem 2, we need to establish an upper bound on the parallel computation cost of the algorithm.We will do this and also establish an upper bound on the communication cost of the algorithm.Proof of Theorem 2. To achieve the [ N / p l upper bound on parallel computation cost, we will need to add some fair scheduling feature to the algorithm described above.Whenever the number of nodes in the GP is smaller than the number of idle processors, we will select the active processors for the next time step from all the p processors in a. fair way.That is, processors take turn to become active using a roundrobin scheme.This ensures at the end of any time step that the total number of nodes expanded by a processor so far will not exceed that expanded by any other processor by more than one.Thus when all the N nodes are expanded, each processor will have expanded at most [ N / p 1 .This proves that the parallel computation cost of the scheduling algorithm with the fair scheduling feature is at most [ N / p 1 .
The communication cost of the algorithm is at most the number of frontier nodes entering the GP, as this represents the only interprocessor communication activity for the entire algorithm.Since by using depthfirst search each processor has at most d local nodes at each level (as illustrated in Figure l), the GP can collect at most pd nodes each time that g l increases.This will happen a t most h times, so the total number of nodes entering the GP is bounded above by C, = pdh.Note that in a practical implementation, the fair scheduling feature may not be used since minimizing parallel computation cost may not be important.Without the fair scheduling feature, the parallel computation cost would become [ N / p + h1.However, the communication cost can be reduced to p(dl ) h , if a processor right after expanding a node will schedule one child, if any, of the node for expansion at the next time step.
The scheduling algorithm described in this section is being used as a basis for developing a parallel programming model for D&C computations.To obtain practical insights, we plan to implement a programming system based on the model on the 26-host Nectar network system [l] developed at Carnegie Mellon University.

A Simplified Version of Theorem 1
This section presents Theorem 3 (see below), which is a simplified version of Theorem 1 dealing with only two processors.A relatively simple proof of Theorem 3 is given.This simple proof captures the essence of a more complicated proof of Theorem 1 given in Section 5.It is advised that the reader read this simple proof first to understand the ideas.
Note that restrictions S1 and S2 correspond to those in Theorem 1.Restriction S3 is for a minor technical convenience, namely, ensuring that h' an integer.
Theorem 3 implies, for example, that if the communication cost is small (in the sense t1ia.tQ2 does not hold), then the parallel computation cost must be large (in the sense that Q1 holds).In prticular, if C a ( H ) < h'(d -1) and if 3dh << N , then the parallel computa.tioncost will be close to N .
Proof of Theorem 3. Suppose that we a.re given a scheduling algorithm A for performing a D&C computation on processors PI and P2.For algorithm A , we will prove the existence of a (IV, h , d)-tree H for which at least one of Q1 and Q2 must hold.
By playing an adversary game with algorithm A , we will construct the tree by growing it from the root one step at a time.A time step consists of two phases, node scheduling phase and node expansion phase.In the node scheduling phase, algorithm A schedules a node or no node for each processor to execute.Then, in the node expansion phase, these scheduled nodes are expanded.In this phase we will determine the number of children each scheduled node will generate.We will first define a special class of subtrees which will be used to describe some sufficient conditions under which a tree can grow to a (N, h,d)-tree.We will then give the main part of the proof including a description of the tree construction procedure.

HF D-S u b t ree
Definition 1 At any given time during the tree construction, a High.-an,d-FuIl-Degreesubtree (abbv.HFD-subtree) is a subtree, which. is rooted at a node at or above level h -N I , and which has been constructed using the following rules:

A l . nodes above level h. generate d children; and
A2. nodes at level h, geaercrie 11.0 ch.ildrenNote that rules A1 and A2 imply that a node which is above level h and has no children must be a frontier node.
Lemma 1 A t any given time during the tree construction, if the current free satisfies the following four properties: 11. the total number of generated nodes is at most Nhd (generated nodes include the root);

the height is at most h;
13. the degree of any node is at most d; and 14. the tree contains an HFD-subtree, then a construction procedure can be devised to grow the tree to a (N,h,d)-tree: Proof.We first note that in the HFD-subtree of I4 there exist nodes which are above level h and have no children.Otherwise, the subtree would have been "fully grown" to level h , according to rules A1 and A2.Since its root is at and above level h -[lo& N I , this fully grown HFD-subtree would have at least dflogdN1(> N ) nodes.This contradicts 11.As noted above, those nodes in the current HFD-subtree which a.re above level h and have no children must all be frontier nodes.
Let H I be the current tree.We will identify a set of "padding nodes" which can be added to H1 to make it a ( N , h, d)-tree.
If H1 has height less than h or degree less than d, we will grow it by extending the current HFD-subtree from one of its frontier nodes which a.re a,bove level h.Let v be this frontier node, as shown in Figure 2.
We generate d children for w and crea.te a pa.th from v to a node at level h, as shown in Figure 2 (a).The resulting tree, called H2, has height h, degree d, and no more than If H2 has less than N nodes, we will pa,d it with nodes in the fully grown HFD-subtree which a.re reachable from the current frontier nodes and other padding nodes, as illustrated in Figure 2 (b).Since the fully grown HFD-subtree has at least N nodes, it has sufficient nodes which can be added to H2 to ma.ke it a ( N , h , d)-tree.
After having identified all these padding nodes, we now have a "blueprint" for a construction procedure to follow.More precisely, the construction procedure will just generate all those padding nodes in the da.rk region in Figure 2 (b).0

Main Part of Proof of Theorem 3
The tree construction procedure consists of three stages.Each stage uses a.n independent set of rules in constructing the tree.The following shows that C1 or C2 must become true sometime, i.e., T2 exists.Recall that by the end of stage 1 processor PI has generated at least h'(d-1) frontier nodes.In stage 2 processor PI will generate nodes in the subtrees rooted at those frontier nodes which are still in PI.For each of these subtrees, since its root is in area 1 of Figure 3, the subtree can have a t least N -h -2d nodes unless some of these nodes are moved to processor P 2 from processor 9.If C1 does not hold, then fewer than h'(d -1) nodes can be moved from PI to P2.Consequently, some subtree will have at least Nh -2d nodes, and thus C2 will be true.
Stage 3 starts right after time T2.Lemma 2 below shows that properties 11-14 of Lemma 1 hold for the tree at time T2.In stage 3, we follow the procedure described in the proof of Lemma 1 to grow the tree to a ( N , h, d)-tree.

Lemma 2 At any time in stage 1 or 2, including time T2, the tree satisfies properties Il-Id of Lemma 1.
Proof.It is obvious from the descriptions of stages 1 and 2 that I2 and I3 are satisfied.For 11, we note that the total number of nodes generated in stage 1 is at most (2h'+ l ) d + 1, and thus at most Nhd by restriction S1 of Theorem 3. In stage 2, I1 obviously holds when C2 is not true.Suppose that C2 becomes true at time T2.Since the tree has no more than Nh -2d nodes in the previous time step and since at most d nodes can be generated (in processor P I ) in one time step, there are at most Nhd nodes at time T2.
Property I4 clearly holds for stage 1 by examining its description.It remains to prove that I4 holds for stage 2. The proof is similar to the earlier proof of the fact that C1 or C2 must become true in stage 2. Recall that in stage 1 processor P I has generated at least h'(d-1) frontier nodes.We note that any of these subtrees rooted at these nodes is an HFD-subtree if the subtree does not contain any expanded cross node.Since the number of cross nodes expanded (not just scheduled) through time T2 is less than h'(dl), one of these subtrees must be an HFD-subtree.Note that if C2 becomes true at time Tz (in the node scheduling phase), the node scheduled has not been expanded.

0
To complete the proof of Theorem 3, we observe that if C1 becomes true a t some time in stage 2 or 3, it will remain true for the rest of the tree construction process.Therefore property Q2 of Theorem 3 will hold for the final ( N , h, d)-tree.Now assuming that C1 never holds at any time in stage 2 or 3, we want to show that property Q1 of Theorem 3 will hold for the final ( N I h, d)-tree.We derive an upper bound on the total number of nodes expanded by processor P2.The upper bound is the sum of four terms U1, U2, U3 and U4.In stage 1, processor Pz has expanded at most U1 = 2h'+l nodes.At time T I , processor P2 can have generated up to (h'+ l)(d-1) + 1 frontier nodes, each of which can be expanded at most once by processor P2 in stage 2 or 3.It is also possible for processor Pz to expand nodes which are generated by PI but subsequently moved to P2.The total number of these nodes is at most Cd(H) 5 U3 = h'(d -1).Moreover, to ta.ke ca.re of the nodes generated after T2 in sta.ge 3, processor PZ may expand up to U4 5 h + 2d nodes.Therefore the total number of nodes expanded by processor P2 is at most U = U1 + U2 + U3 + U4 5 3dh.This implies that processor PI has expanded at least N -U = N -3dh; that is, property Q1 holds.0

Proof of Theorem 1
Suppose that we are given a scheduling algorithm A for performing a D&C computation on a parallel system of p processors.For algorithm A, we will prove the existence of a ( N , h, d)-tree H for which either only p' processors are active for expanding most of nodes (at least N' nodes) or at least C' nodes are moved between processors to balance their computation loads.For the former, the parallel computation cost will be high, i.e., Td(H) 2 N'/p' (property P l ) .For the latter, the number of cross nodes will be large, i.e.,

C A ( H ) 2 C' (property P2).
By playing an adversary game with algorithm d, we will construct the tree by growing it from the root one step at a time.The definition of time step is the same as that in the proof of Theorem 3.
We will give some more definitions in Section 5.1 and then give the main part of t,liis proof in Section 5.2.All the related lemmas are in Section 5.3.

Definitions
To help derive a lower bound on the number of cross nodes, we introduce the following relation between subtrees.
Definition 2 A set of subtrees is processor-orancestry independent (abbr.PA-independent) ij for each pair of subtrees in the set at least one of the following two properties is satisfied: 1. Processor Independence: the roots of these two subtrees are generated on different processors; 2. Ancestry Independence: neither is a subtree of the That is, there is no ancestor-descendant other.relationship between the two roots.
Note that for two PA-independent subtrees rooted at nodes r1 and r2, if node PI is an ancestor of node rz, then both nodes must be generated on different processors.This implies that there must exist at least one cross node on the path from node r1 (inclusive) to the parent (inclusive) of node r2.Therefore, from this property, if there are k PA-independent subtrees each of which has at least one expanded cross node, then there are at least k expanded cross nodes in the tree.This is shown in Lemma 3 (in Section 5.3).Definition 3 A n HFDC-subtree is an HFD-subtree (as defined in Definition 1) or a subtree with at least one cross node already expanded.If the root of an HFDC-subtree is generated on processor P , the subtree is called an HFDC-subtree on processor P .
By Lemma 3 and Definition 3, if there are k PAindependent HFDC-subtrees and fewer than k expanded cross nodes, then there exists an HFD-subtree, as shown in Lemma 4. We will use this lemma to show the existence of an HFD-subtree during some periods of the tree construction procedure.

Main Part of Proof of Theorem 1
The tree construction procedure, like that in Section 4, consists of three stages.Basically, this procedure, summarized in Figure 4, is similar to that in Section 4, except that in stage 1 we use more sophisticated rules to prove a better lower bound of the number of cross nodes.(Note that if h >> logd N and p = 2 the lower bound of communication cost in this theorem is approximately twice as large as that in Theorem 3.) In stage 1, we will repeatedly apply rules Rl-R4 (in Figure 4) until time TI when one of the conditions Stage 1 Apply the following four rules: R1.Nodes in area 1 (shown in Figure 5 ) will generate d children.
R2. Cross nodes in areas 2 and 3 (shown in Figure 5 ) will not generate any children.

83.
Non-cross nodes in areas 2 and 3 (excluding level h) will generate d children.
R4. Nodes at level h will not generate any children.

C1.
For some p' processors, at least h' non-cross nodes have been expanded on each processor.C2.At least C' cross nodes have been scheduled.

C3.
At least N -( p d + d + h) nodes have been generated.
Repeat rules Rl-R4 until time 7' 1 when any of the following three conditions holds: Stage 2 (continued from time 2 ' 1 when C1 holds) a Find a set r of p' processors with the following two properties: B1.There are at least C' PA-independent HFDC-subtrees in r.B2.There are at most h' non-cross nodes expanded on each of the other pp' processors in the set F. R5.Nodes (excluding those at level h) in r will generate d children.R6.Nodes in r will not generate any children.
R7. Nodes at level h will not generate any children.Cl-C3 holds.Rules Rl-R4 ensure that each subtree rooted in area 1 or 2 is always an HFDC-subtree because in constructing the subtree either rules A1 and A2 are followed (using R1, R3, and R4) or some cross nodes are expanded (using R2).Basically, the procedure in stage 1 attempts to produce at least C' PA-independent HFDC-subtrees on some p' processors (property B1) while preventing ea.ch of the other pp' processors froin expanding more tha,n h.' noncross nodes (property B2).(Recall t1ia.t in the proof of  In stage 2, we will repeatedly apply rules R5-R7 until time Tz when condition C4 or C5 holds.(Note that these rules are exactly the same as those of stage 2 in Section 4.) According to property B1, initially, there are at least C' PA-independent HFDC-subtrees in r.In stage 2, these subtrees continue to be HFDCsubtrees, because either rules A1 and A2 are followed (using R5 and R7) or some cross nodes are expanded (using R6).In addition, by rule RG, the set 1; Proof.This proof is not trivial because among these subtrees those with ancestry relationship may contain a same expanded cross node.
expanded cross node %, %,$. 5 :PA-independent subtrees.In this proof, we will prune the k PA-independent subtrees one by one under the restriction that the subtree being pruned contains no other subtrees which have not been pruned yet.(For the example illustrated in Figure 7, we can prune the subtrees in the order: &, 73, ;rZ, and 71.)For this proof, it suffices to prove that each pruned subtree has at least one expanded cross node.Initially, the first pruned subtree obviously has at least one expanded cross node by the assumption of the lemma.As mentioned in Section 5.1, for any two PA-independent subtrees 7 and 7' rooted at nodes r and r' respectively, if r is an ancestor of T ' , there must exist at least one expanded cross node on the path from r (inclusive) to the parent (inclusive) of T' due to processor independence.Therefore, if we prune 7' at r', 7 still has at least one expanded cross node.
Hence, after we prune each subtree under the above restriction, each of the remaining subtrees will still have at least one expanded cross node.This implies that the next pruned subtree also has at least one expanded cross node.So, each pruned subtree has at least one expanded cross node. 0Leinina 4 At some time, if there are k PAindependent HFDC-subtrees and fewer than k expanded cross nodes, there exists an HFD-subtree.0 Proof.Assume that there exists no HFD-subtree.Thus, each of these PA-independent HFDC-subtrees has a.t least one expanded cross node according to the definition of HFDC-subtree.By Lemma 3, there are at least 6 expanded cross nodes.This is contradictory to the assumption of the lemma.. Leinina 5 In stage 1, if a processor h,as expanded h' non-cross nodes, th.en th.ere are at least K PAindependent HFDC-subtrees on the processor.
Proof.As mentioned in Sectmion 5.2, each subtree rooted in area 1 or 2 is always an HFDC-subtree in stage 1.Thus it suffices to prove that at least K: nodes with ancestry independence in areas 1 and 2 will be generated on the processor after h' non-cross nodes have been expanded.By rules Rl-R3, for any noncross node, all of its ancestors in area 2 (with h' + 1 levels) must be non-cross nodes as shown in Figure 8. So, all the nodes generated by the first h' non-cross nodes must be in areas 1 and 2. Since each of the h' non-cross nodes will generate d children and can remove at most one ancestor, these non-cross nodes will, in total, generate at least ( dl)h'(= K ) nodes with ancestry independence.0 Lemma 6 At any time in stage 1 o r 2, including time TI or Tz, the tree satisfies properties 11-14 of Lemma 1.
Proof.It is obvious from rules Rl-R7 that I2 and I3 are satisfied.In addition, it is also obvious that I1 holds before condition C3 or C5 becomes true.Consider the first time step when at least N -(pd + h + d ) nodes have been generated (i.e., condition C3 or C5 holds).Since the tree has no more than N-(pd+h+d) nodes in the previous time step and since a t most pd nodes will be generated in each time step, there are at most Nhd nodes in the current time step.
In the rest of this proof, we will show that I4 always holds (i.e., there always exists an HFD subtree) in each stage.
In stage 1, all the nodes in area 1 will generate d nodes by rule R1.So, before all the nodes in area 1 have been expanded, there must exist one frontier node in area 1, of which the subtree (with only one node) is an HFD-subtree.After all the nodes in area 1 are expanded, there are at least drlogdpdh1 >pdh 2 C' subtrees rooted at the top level of area 2. Obviously, these subtrees are PA-independent.They are also HFDC-subtrees because each subtree rooted in area 1 or 2 in stage 1 is always an HFDC-subtree as described in Section 5.2.Since the number of expanded cross nodes is always less than C' (due to condition C2), there has always been an HFD-subtree up to time TI by Lemma 4. Thus, we can conclude that there always exists an HFD-subtree in stage l.
In stage 2, initially, there are at least C' PAindependent HFDC-subtrees in I' (property Bl).These subtrees will continue to be HFDC-subtrees in this stage as described in Section 5.2.In stage 2, due to condition C4 the number of expanded cross nodes is always less than C'; so, there always exists an HFDsubtree by Lemma4.0

Corollary 2
For any scheduling algorithm A for a parallel system of p processors, for all N , h , and d with restrictions S1 and S2 as defined in Theorem 1, fore, T A .(CA + .) > ( N ' / p * ) .p * % = N' ... Note TA .(CA + .) 2 N ' .n, where N' and K: are defined in Theorem 1. Theorem 2 A scheduling algorithm A can be devised to have the property that the parallel computation cost as TA = Tmin and the communication cost as CA 5 Cu(= pdh) for any ( N , h, d)-tree.

0[Figure 1 :
Figure 1: At most d frontier nodes at each level on a processor ( d = 3).

Theorem 3
For each scheduling algorithm A for a parallel system of two processors, f o r each N , h, and d with the followang three restrictions, si.N > 3dh, s2.h > [log, NI + 2, and S 3 .h -[lo& NI -2 is an even integer, there exists some (N, h , d)-tree H for which at least one of the following two properties is true: Q1. the parallel computation cost of Ihe algorithm is T d ( H ) 2 N -3dh,; Q2. the communication cost of the algorithm is C d ( H ) >_ h'(d -I),

Figure 2 :
Figure 2: Growing the current tree to a ( N , h, d)-tree.

Figure 3 :
Figure 3: Two a r e a in the constructed tree.In stage 1, we expand each node with exactly d children.Stage 1 terminates at time TI when a total of 2h' or 2h'+ 1 nodes have just been expanded.(Note that at this time the tree is completely inside area 1 of Figure 3.) Since the number of frontier nodes increases by d -1 each time when a node is expanded, there are exactly 2h'(d -1) + 1 or (2h' + I ) ( d -1) + 1 frontier nodes at time TI.Without loss of generality, we assume that processor P I has generated at least h'(d -1) frontier nodes.Stage 2 starts right after TI.In this stage every node above level h expanded by processor PI will have

Theorem 3
subtrees rooted a t frontier nodes at.time TI are PA-independen t H F DC-su btrees.) If condition C1 holds at time 7'1 , then from Figure G we can find a set I ' of p' processors for which condition C1 and property B2 hold.According to Lemma.Each processor expands exactly h' non-cross nodes.=:Each processor expands at least h' non-cross nodes.:Eachprocessor expands fewer than K non-cross nodes.

Figure 6 :
Figure 6: Around the time when condition C1 becomes true.

3 Figure 8 :
Figure 8: In stage 1, any non-cross node's ancestors in area 2 must lime been genera.tedon the same processor.

Corollary 1 For each scheduling algorithm for a par- allel system of p processors, for each positive EC < 1, which can be arbitrarily close to 0, there are values of N , h, d , p , and q ( > 0 ) , for which if the parallel computation cost is between and (1 + E T ) : , then the communication cost must be at least (1 -
Let p 2 $ and d 2 e.Then, let ET = 1. 0

3 Relevant Lemmas Lemma 3
If C2 or C3 holds at TI, this implies that stage 2 is empty.)SinceLemmaG also shows that properties 11-14 of Lemma 1 hold for the tree at time TI or Tz, in stage 3 we will follow the procedure described in the proof of Lemma. 1 to grow the tree to c 1 .Moreover, t,o ta.ke ca.re of tlie nodes generated after T2, processors in T may expand up to U5 5 pd + d + h nodes.Therefore, the tota.1 number of nodes expaaded in r is at, most, U = U1 + Uz +U, +U, + Us 5 3pd2h.This implies t,lia.ttlieprocessors in r have expa.iideda.tleastN -U = N -Since condition C1 does not hold in sta.ge 1, we can find a set r of p' processors with property B2 (see Figure6also).Since stage 2 is empty for this case, we ca.n let time T2 be the same as T I .Thus, we can use the same technique as above to prove that property P1 holds.Suppose that there are k PA-independent subtrees a2 some tame during the computation.If each of these subtrees has at least one expanded cross node, then the total number of expanded cross nodes in the whole tree constructed so far as at least IC.
a ( N , h, d)-tree H.To complete the proof, we observe t1ia.t if Cd(H) 2 C' it will remain true for the rest of tlie tree construction process.Therefore property P2 of Theorem 3 will hold for H. Now, assuming that Cd(Hj < c', we want to prove that property P1 holds for H. Since C2 and C4 never hold, either C3 will become true at time 7'1 or C5 will become true at time T2.First, suppose that condition C5 becomes true at time T2.To prove that property P1 holds in this case, we will derive a.n upper bound on tlie total number of nodes expanded in r.The upper bound consists of five terms U1, U 2 , U3, Uq, a.nd US.Assume tha,t there are C1 < U1 = C' cross nodes expanded in r in stage 1.In sta.ge 1, the processors in r have expanded at most U2 = ( pp')h' non-cross nodes due to property B2.These nodes expanded in stage 1 will generate at most U3 = ((11p'jhd + C1)d 0 5.