New algorithms and lower bounds for the parallel evaluation of certain rational expressions

This paper presents new algorithms for the parallel evaluation of certain polynomial expression. In particular, for the parallel evaluation of x<supscrpt>n</supscrpt>,we introduce an algorithm which takes two steps of <underline>parallel division</underline> and [log<subscrpt>2</subscrpt>n] steps of parallel addition, while the usual algorithm takes [log<subscrpt>2</subscrpt>n] steps of parallel multiplication. Hence our algorithm is faster than the usual algorithms when multiplication takes more time than addition. Similar algorithm for the evaluation of other polynomial expressions are also introduced. Lower bounds on the time needed for the parallel evaluation of rational expressions are given. All the algorithms presented in the paper are shown to be asymptotically optimal. Moreover, we prove that by using parallelism the evaluation of any first order rational recurrence, e.g., [equation], and any <underline>non-linear</underline> polynomial recurrence can be sped up at most by a constant factor, no matter how many processors are used.

1) To run the algorithms each processor is either masked or performing the same operation at any time.Hence the algorithm can be run on single-instruction stream-multiple-data stream (SIMD) machines (Flynn [66]) sucfTalT ILLUC Tv.' 2) The algorithms require a very simple interconnection pattern.All we need is a binary tree network between processors.
Hence, for most machine organizations, we should not expect any significant delay caused by communication between processors.
We also prove lower bounds on the time needed for the parallel evaluation of certain rational expressions, under the assumption that all processors can perform different operations at any time.
This assumption corresponds to multiple-instruction stream-multiple-data stream (MIMD) machines (Flynn [66]) such as Cmmp, the multi-mini-processor system currently under construction at Carnegie-Mellon University (Wulf and Bell [72]), It is clear that optimal algorithms with respect to MIMD machines must be also optimal with respect to SIMD machines.The lower bounds obtained in the paper imply that the algorithms presented in the paper are asymptotically optimal with respect to MIMD machines, although most of these algorithms can be run on SIMD machines, as noted above.Furthermore, these lower bounds imply that, by using parallelism, the evaluation of an expression defined by any first order rational recurrence or any non-linear polynomial recurrence can be sped up at most by a constant factor, no matter how many processors are used.Consider, for example, the evaluation of the defined by the recurrence, which is the well-known recurrence for ap^roximat-* ing *ya.We show that for evaluating y any parallei algorithm using any number of processors cannot be essentially faster than the obvious sequential algorithm.Thus the theory for non-linear recurrences is completely different from the theory for linear recurrences, where good speed-ups have been obtained (for example, Heller [73], Kogge [72], Kogge and Stone [72], Maruyama [73], Munro and Paterson [73] and Stone [73a]), Suppose that we have a problem for which multiplication is much more expensive than addition and that we want to minimize the number of multiplications and divisions.Lower bounds on the ' time needed for the multiplications and divisions are also derived.
In the next section, we give basic definitions and an abstract formulation of our problem.
In Section 3 we derive algorithms for the parallel evaluation of various expressions.Lower bound results are given in Section 4. The final section deals with results on non-linear recurrences.

2, ABSTRACT FORMULATION AND DEFINITIONS
Let F be a commutative and algebraically closed field, e.g., F is the field CL of complex numbers.Let F[x] and F(x) be the ring of polynomials and the field of rational expressions in x over F, respectively.Our task is to evaluate a set of polynomials in F[x], {f 1 (x),f 2 (x),...,f m (x)}, under the following assumptions: we consider the parallel evaluation of certain rational expressions.We assume This research was supported in part by the National Science Foundation under Grant GJ32111 and the Office of Naval Research under Contract N00014Each of the algorithms minimizes the time needed for the multiplications to within a constant and can be shown to be faster than the best previously known algorithm for large n.Moreover, all the algorithms, except the one associated with Theorem 3.4, have the following two characteristics : parallel step of addition.If the positive integer k in 3) is greater than one, we say {fj(x),...,f^(x)} is to be evalu-the proof of the following theorem we give an algorithm for the parallel evaluation of x 11 which uses divisions and which takes time less than flog n] when n is large.A^ • x-r^, i^l,...^, in parallel, where the r^ are in F and are the n distinct zeros of x n -r for any non-zero ele-/CTlog n]A + 2<A +D^>] « m/A.n -K» Hence we have sped up the evaluation of x n by a factor M/A for large n.Remarks on Algorithm 3 <t 1. 1) The choice of r in step 1 depends on the application of the algorithm.For instance, if the algorithm is used to compute A n for a real matrix A then the number r should be chosen such that A -r^I is non-singular for all i; otherwise the algorithm would break down at step 2, where we have to compute s^(A-r^I) ^ for all i. (Note that for matrix computation, in the algorithm divisions should be interpreted as matrix inversions, and scalars r^, r should be interpreted as r_^* respectively, where I is the identity matrix.)2) The algorithm raises x to the nth power without using any multiplications but with two divisions.This may be surprising to those who are dealing only with sequential algor-A^ = x + a^, i=1,...,n, in parallel; 2) Compute « b^/A^, i=1,...,n, in parallel, where b ± = [ II (a.-a.)]" 1 ; n 3) Compute C = E B. in parallel; x) is the nth degree Chebyshev polynomial with respect to some interval, then (3.3) T (P(x)) £ |"log nlA + A + 2D .n s s Proof Since the zeros of P(x) are distinct and are known analytically, tne corollary follows from Theorem 3.2, ft It is clear that after some obvious modifications of Algorithm 3.2, Theorem 3.2 can be extendn m. ed to cover the general expression N(x+a.)where 1 1 the a. are distinct and the m. are positive intei l gers.Since it is straightforward, we will not give the details here.There are several potential applications of Algorithms 3.1 and 3.2.For example, by using Algorithms 3.1 and 3.2 we can compute A n and P(A), respectively, where A is a matrix and P(x) is some Chebyshev polynomial 0 A n and P(A) n can then be used to approximate the dominant eigenvectors of A. (See, for instance, Wilkinson [65, Chapter 9J.)However, these applications do not fit the topic of this paper.They will be reported in another paper.Lemma 3.1.IF k £ jn(n+l) -1, then the set fx^.x"* x n l can be evaluated in two steps OF parallel division and flog nl + 2 steps OF parallel addition.More precisely.(3.4) T. (x 2 ,x 3 ,...,x n ) £ RIOG nlA + 2(A +D ) provided k ^ ^n(irH) -1.Proof We establish the lemma by exhibiting an al-gorithm^ Algorithm 3.3.[An algorithm for the parallel evaluation of {x 2 ,...,x n } by using at least jn(n-il) -1 processors.]1)Assign i processors for the evaluation ofx 1 for each i=2,...,n.Use Algorithm 34 of Algorithm 3.1 will not be performed for the evaluation of x 2 ,...,x n ^ until the time when step 4 of Algorithm 3.1 is ready to be performed for the evaluation of x .Clearly, the lemma follows from Algorithm 3n, then the set (x «x »...,x ) can be evaluated in five steps of parallel non-scalar multiplication or division and [log n]j+ 5;steps of parallel addition.More precisely.

•
5) T (x ,x ,...,x N )SRIOG nlA + A 44(A +D )+M. is 2TLOG nlA, while all other upper bounds we have derived so far have the dominant term TLOG nlA (see (3.1) ~ (3.5)).In the following theorem we show that the upper bound in (3.6) may be improved to have TLOG n]A as the dominant term by using 2N processors.clear that the tree is finite, since there is a positive lower bound on the time needed for every operation.)We note that if the binary operation associated with a node is a non-scalar addition, multiplication or division then the two successors of the node must be leaves.Hence along ~each path of the tree there is at most one linear recurrence problems occur in practice very often.)For example, it seems very difficult to use parallelism for the following non-If q>(x).*(x)g F(x).then deg(cD .•) bj) ...(x-b^,) , where the a is in F, the a^^ are distinct elements in F, the b^ are distinct elements in F and the nu, are non-negative integers.Clearly, deg cp^ 53 2m^ and deg cj^ a 2n^.Since cp^ and cp 2 are relatively prime, (x) -a, ^ ( X )) 1 .(tf ( X) , a ^ (x) <+7< x >"b 1 t 2 (x)) ni ...( tl (x)-b f (x)) 1 the evaluation of an expression defined by any first order rational recurrence can be sped up at most by a constant factor.the evaluation of an expression defined by any non-linear polynomial recurrence can be sped up at most by a constant fact.Consider, for example, the recurrence 10