Parallel Algorithms for the Longest Common Subsequence Problem

Abstruct-A subsequence of a given string is any stling obtained by deleting none or some symbols from the given string. A longest common subsequence of two strings is a common subsequence of both that is as long as any other common subsequences. The longest common subsequence problem is to find the longest common subsequence of two given strings. The bound on the complexity of this problem under the decision tree model is known as 7nn if the number of distinct symbols that can appear in strings is infinite, where m and n are the lengths of the two strings, respectively, and m 5 n. In this paper, we propose two parallel algorithms for this problem on the CREWPRAM model. One takes O(log2 TIL + log n ) time with mn/ log m processors, which is faster than all the existing algorithms on the same model. The other takes O(log2 TIL log log m ) time with mn/ log’ T ~ L log log m processors when log’ m log log m > log n, or otherwise O(1og n ) time with rnn/ log n processors, which is optimal in the sense that the time x processors bound matches the complexity bound of the problem. Both algorithms exploit nice properties of the LCS problem that are discovered in this paper.


I. INTRODUCTION
STRING is a sequence of symbols. Given a string, a  concentrates on exploiting properties of the LCS problem. Section IV gives the faster algorithm, and Section V presents the optimal algorithm.

A. The LCS Problem via Grid-Directed Acyclic Graph
An lI x 12 grid directed acyclic graph (DAG) is a DAG whose vertices are the 11 x 12 grid points of an 11 x 12 grid. The only edges from vertex ( i , j ) , the grid point on the ith row and the jth column, are to vertices ( i , j + I), ( i + 1 , j ) and (2' + 1, j + 1). Sometimes they are referred to as horizontal, vertical, and diagonal edges, respectively. Vertex (1,1) is the source, and vertex ( 1 1 , 1 2 ) is the sink. Given an instance of the LCS problem, i.e., string A = a l , a2. . . . , a , and string B = bl , b2 . . . , b,, the grid DAG, G, associated with A and B is an (m + 1) x ( n + 1) grid DAG such that each diagonal edge on G, say, from vertex ( i , j ) to vertex (i + 1 , j + l), is associated with cost 1 if symbol a, and symbol bJ in A and B are identical, and otherwise associated with cost 0. The cost of a path on G is defined as the sum of costs on the path. A maximum-cost path is the one with the maximum cost.
Throughout we presume that 711, the length of A, is a power of 2. As an example, Fig. 1 shows the grid DAG associated with strings tcaggatt and gatttatgcagg. The relation between the LCS problem and the maximum-cost path problem is seen as follows.
Observation I: Any path with cost 1 on grid DAG G associated with strings A and B corresponds to a CS with length 1 of A and B. In particular, the maximum-cost path between the source and the sink corresponds to the LCS of A and B.
So, to solve the LCS problem, we need to find only the maximum-cost path beginning at the source and ending at the sink on grid DAG G. To find this path, we are actually to find the maximum-cost paths from every vertex on the top row to every vertex on the bottom row. Similar to the previous research [2], [3], those paths will be identified under a divideand-conquer scheme. We divide the ( m + 1) x ( n + 1) grid DAG, G into two ( m / 2 + 1) x ( n + 1) grid DAG'S, the upper half, Gr; and the lower half, G L , and then find the maximum-cost paths on GLj and GL in a recursive fashion.
A vertex v on the bottom row is the jth breakout vertex with respect to vertex (1, i) if v is the leftmost vertex on the bottom row, such that there is a path of cost j from vertex (1, i) to v. Sometimes we call v a breakout vertex of vertex (1, i ) for short. In Fig. 1, vertices (9, 2), (9, 3), (9,4), (9, 5), and (9, 13) are the first, second, third, fourth, and fifth breakout vertices of the source. Note that there are no fifth breakout vertices with respect to some vertices, for example, (1, 8), because the maximum cost from vertex (1,8) to the bottom row is 4.
A fact about breakout vertices is this: The maximum-cost path between a vertex, say, v, on the top row and its jth breakout vertex, say w, on the bottom row on G must have cost j , if v does have the jth vertex. This is because all of the cost 1's appear on diagonal edges only. Indeed, if there is a path between vertex v and w with cost greater than j , then vertex w must not be a jth breakout vertex of v, because we can always find a vertex w' to the left of w such that there exists a path between v and w' with a cost of j . In general, the maximum-cost path between two vertices is not unique. Throughout this paper, we are interested only in the leftmost one among the maximum-cost paths between two vertices, in the sense that no vertices on the other paths lie to its left.
The maximal possible cost of a path on an ( m + l ) x (n+ 1)grid DAG is m. Hence, any vertex on the top row of the grid DAG has at most m number of breakout vertices on the bottom row. The information about breakout vertices can therefore be stored in an n x m matrix called cost matrix. A cost matrix associated with grid DAG G, denoted by DG, is defined as follows: For 1 5 i 5 n and 1 5 j 5 m, D~( i , j ) = IC if vertex (m + 1, IC) is the jth breakout vertex of vertex (1, z), and DG (2, j ) = a~ if vertex (1, i) does not have a jth breakout vertex. Note that an entry in DG is really not a cost, but rather the location of a breakout vertex on the bottom row of G. By Db we denote the ith row of DG. Fig. 2 shows the cost matrices associated with Gu, the upper half of G shown in Fig. 1, and G L , the lower half of G.

B. The Main Structure of Our Algorithms
algorithms consist of four phases. Now we give an overview of our algorithms. Basically, both 1) Compute DG, for 1 5 i 5 m, where G; is a 2 x (n + 1)grid DAG consisting of the ith and (i + 1)th rows of G. 2) Recursively compute DG from DG" and DG,.
Authorized licensed use limited to: Texas A M University. Downloaded on March 10,2020 at 10:40:28 UTC from IEEE Xplore. Restrictions apply.

3) Identify vertices on the maximum-cost path between the 4) Identify the LCS that corresponds to the maximum-cost
Both algorithms that we designed for the LCS problem share the same implementations of Phase 1 and Phase 4, but differ in their implementations of Phase 2 and Phase 3. Phase I and Phase 4 are simple, so a short description on them is provided in the next subsection. The implementations for Phase 3 will be stated later in Sections IV and V. The implementations of Phase 2, which are critical for both algorithms, are quite source and the sink on G. path.
that edge e has a cost of 1 and vertex V k has column index 2.
The LCS of A and B that corresponds to p can be obtained by ranking those marked symbols. Since the number of edges on p is bounded by n + m, and because checking the cost on an edge takes constant time, marking symbols in A can be done in constant time with n processors or in O(1ogn) time with n l l o g n processors. The ranking job can be done in O(1og n) time with n/ log n processors using a standard technique [7]. Thus O(log n) time using n log n processors suffices for Phase 4. complicated. We provide some basic ideas in Section 11-D, and leave details to Sections IV and V.

D. Ideas for Implementing phase
A cost matrix contains all information about the costs of

C. The Implementations of Phase I and Phase 4
Our basic strategy for the LCS problem is divide-andconquer. Given two strings A and B, to compute cost matrix DG of G associated with A and B , initially we need to compute m cost matrices, with each being associated with a 2 x ( n + 1)-grid DAG as a base for "merge." Let us now discuss the computation of the cost matrix of such a grid DAG, say, Gh. Suppose G'h consists of the hth and (h + 1)th rows of G. Let bj, bj,, . . . , bJT the jlth, jpth,. . ., j,th symbols in B , be all symbols identical to ah, the hth symbols in A, where j l < j 2 < . . . < j,. The following facts are apparent, according to the definition of Gh. Any vertex on the top row of Gh has at most one breakout vertex; any vertex properly to the right side of vertex (1, j,), the j,th vertex on the top row of Gh, has no breakout vertex at all; and, finally, any other vertex ( l , j ) , 1 5 j 5 j,, has a breakout vertex ( 2 , j k + l ) , the (jk + 1)th vertex on the bottom row of Gh, where j k satisfies j k -1 < j < jk ( j o is defined as 0). In other words, the values of entries from D G~(~, 1) to D G~ ( j l , 1) are j , + 1, the values of entries from D~~( j l + 1 , l ) to D~~( j 2 , l ) are j 2 + 1, and so on. For those entries D G~(~, 1) where j , < j 5 n, the value is 30. As an example, for grid DAG G' shown in Fig.   1, because a1 = t and b3 = bq = b5 = b7 = t, we have D G~ = (4,4,4,5,6,8,8,0,0,,,oo,oo)T.
jl , j2,. . . , j , can be identified by sorting symbols of B, which can be done in O(1ogn) time with n processors [6]. j 1 , j 2 , . . . , j,, or 30 can then be properly assigned to the entries of D G~ by the procedure below. D G~ can be generated in O(1og n ) time by using n / log n processors; therefore, Phase 1 can be done in O(1og n) time by using mn/ log n processors, for there are m such matrices to be computed in total. The procedure for generating D G~ is as follows. 1 ) Assign j kj k -1 to D~~( j k -1 + 1, 1), for 1 < k 5 T , and assign j1 + 1 to D G~(~, 1).

3)
Assign 03 to entries of D G~ from D G~ ( j T + 2 , l ) to Now we tum to Phase 4. In Phase 4, we trace the maximum-DGh ( n , 1). the maximum-cost paths between vertices on the top row and vertices on the bottom row of a grid DAG, which allows us to compute DG, given D G~ and D c L . Before proposing a formula for computing DG from D G~ and D G~, we would like to first examine their relations through the grid DAG.
Clearly, the cost of p is j , and i, is the value of entry D G (~, j ) . Path p intersects the common boundary of Gbr and G L at some vertex, say, (m/2 + l,Z,). Thus, vertex (m/2 + l,i,) partitions path p into two subpaths, say, p l and p2: Path p l goes from vertex (1,i) to (m/2 + l , i q ) with certain cost, say, k , and path p:! goes from (m/2 + 1, i q ) to ( m + 1, iv) with cost jk . Since p is a leftmost path (we are interested in only the leftmost paths), each of p1 and p2 must be a leftmost path also. Consequently, vertex ( m / 2 + 1, i,) on G is the kth breakout vertex of ( 1 , i ) on GLI, whereas vertex (vi + l,i,) on G is the ( kj)th breakout vertex of (1, i 4 ) on GL, respectively. In other words, D~~( i , k ) = i, and D G~(~~, kj ) = i,, when k # 0 and k # j . From these two equations, plus D~( i , j ) = i,, we have D~( i , j ) = D G~( D G~(~,~) ,~ -j ) . When k = 0 , p l must go straight down from (1, i) to ("2 + 1, i,) (again, this is because p is a leftmost path); thus, i, = i. Consequently, p a , with a cost of j , must be the maximum-cost path from (m/2 + 1, i ) to ( m + 1, iv), implying that D G~ (2, j ) = 2,. Therefore, we have Similarly, one can prove that when j = k , we have D~( i , j ) = D~~( i , j ) . As an application of Theorem 1, we can calculate DG (1,3) from D G~ and D G~, shown in Fig. 2, as follows.  (1,2) , 1 ) ) = 4.
Authorized licensed use limited to: Texas A M University. Downloaded on March 10,2020 at 10:40:28 UTC from IEEE Xplore. Restrictions apply.

Prooj?
The correctness of Theorem I in the normal case in which the jth breakout vertex of vertex (1,i) exists has been shown through the above discussion. Now we shall show that this theorem is also correct when vertex (1, i ) does not have a jth breakout vertex. To do this, it suffices to show that D G (~, j ) will be assigned DC: under this circumstance.
Obviously, when vertex (1, i) does not have a jth breakout on G, neither vertex (1, i ) on GLI nor vertex (l! i ) on GL has its jth breakout vertex. So, the first two items, DG,, (2, j ) and D~~( i , j ) , in the above formula must be 00, which leaves us to show that D G~ ( D G~, ( i , k ) , jk ) = 0 ; : for 1 5 k 5 j . For a contradiction, we assume the existence of k , 1 5 k 5 j , such that D G~ ( D G ( , (~,~) ,~ -k ) = 'il and il is finite. Recalling the definitions in Theorem 1, we see that D G~, ( i , k ) must be finite. Let DG,, (2. k ) = 22. Then D G~ ( i 2 , jk ) = 21. In other words, there exist two paths on G. One has cost k and goes from vertex ( l ? i ) to ( m / 2 + 1, i 2 ) , and the other has cost j -IC and goes from vertex (771/2 + 1,&) to ( m + 1,il). A path on G with cost j thus can be obtained by combining these two paths. The existence of such a path contradicts the fact that 0 A trivial but inefficient algorithm for computing DG is to apply Theorem 1 directly. Indeed, computing entry DG(Z, j) of DG from DcI, and D G~ is nothing more than identifying the minima among O ( m ) entries, which can be done in O(1ogm) time using vi/ log m, processors. However, because there are in total n x V L entries of DG to be computed, nm2/ logm processors are needed in order to complete the computation in O(1ogm) time, which implies a computational time of O(log2 m) for generating DG with nm2/ log m processors.
We must apply Theorem 1 in a much more efficient way. A better-organized form of this theorem is proposed below.
A k-copy of a row-vector W of size m is a row-vector of size 2 m , denoted as C Y [ k , W ]     Given a row-vector, a sub-row-vector of it is obtained from it by deleting none or some entries (not necessarily consecutive ones). If a rowvector is a sub-row-vector of more than two row-vectors,

3) D G
Authorized licensed use limited to: Texas A M University. Downloaded on March 10,2020 at 10:40:28 UTC from IEEE Xplore. Restrictions apply. 2) Any k + 1 consecutive rows of DG are k-variant.
Proof: Proposition 2 (2) is an immediate result of Proposition 2(1), so we prove only the latter. Without loss of generality, we assume that 00 exists in both DL and Ilk+:+'.
We also assume that there are two numbers, 11 and 12, such To show that D& and D2.j' are 1-variant, it suffices to show that each of these two rows contains at most one "finite" entry that is not in the other.
Noticing that 11 2 22 (suggested by Proposition 1 (2)), we actually need to show only that there exists at most one finite entry that is in D&, but not in 0:". Three cases need to be considered, depending on the first entry in DL and D;+': The third case could not occur, according to Proposition where j z must be less than or equal to 2. On the other hand, j 2 should not be 1, because D~( i + l , Therefore, we conclude that j 2 = 2; i.e., D G (~ + 1,2) =

To structurize the useful information in DG and in M [ D & ] ,
we introduce several concepts. Consider k + 1 consecutive row-vectors in DG, say, 0% for il 5 i 5 il + k. Let L be a common row-vector of them. L can be partitioned into groups, L1, L z , . . . , L,, such that entries in each group are consecutive entries in every Db. The remnant L I , R;, . . . , L,, Et,,). Note that there may be no entry in R;. The size of R[D&] is defined as the sum of sizes of Rj's for 1 5 j 5 r + 1. In this paper, we are interested in only the largest partition, in the sense that for any two entries from two distinct groups, there exists i', il 5 i' 5 il + k , such that these two entries are not consecutive in D$. Clearly, under this assumption, the partition for groups are unique. Consider the first and second rows of D G~ in Fig. 2 as an example. The common row-vector of them is L = (7,9,12), and the remnants of them are R[D&] = (2) and R [D$u] respectively. 1) > DG(i + 1, 1). in 0;':" according to Proposition l(4).

Proposition 3:
1) For k + 1 consecutive row-vectors D& of DG, il 5 i 5 i l + k , common row-vector L of D,?j and D S f k is a common row-vector of the k + 1 row-vectors.
2) L can be partitioned into at most 2k groups.
Proofi To show' that L is the common row-vector of D&'s, for il 5 i 5 il + k , we need to show only the following: 1) Any "finite" entry in L appears in every D&.
2) The number of "00" entries in L are no more than that Statement 1) is true because if, for instance, w is a "finite" entry in L, i.e., there exist j l and j z such that D~( i 1 , j l ) (5)). Statement 2) is true because, in fact, Proposition l(2) suggests that there are at least as many 00 entries in D& as in D,?j for il < i.
L can be obtained from either Dzl or D i l f k by deleting at most k entries; therefore, it should be of no problem to partition L into no more than 2k groups such that entries in each group correspond to consecutive entries in both D2 and D$+k. Now, to show Proposition 3(2), we shall show that the partition is valid for D&'s for ,il < i < il + k . That is, if w1, W Z , . . . , W h are h consecutive entries in both Dg and Dk+k under this partition, then they are also consecutive in any D&, for il < i < il + k. The existence of those entries in DL's is certain by Proposition l(5). To show the consecutivity, we assume that there exists entry w in D& such that w1 < w < wl+l, 1 5 1 < h, for a contradiction. Since w1 is an entry in D2+k, we have il + k: < W I . Because of i < il + k , il + k < w1, and w1 < w, we have i < il + k < w.
By Proposition 1(4), w must be an entry also in D 2 f k . The assumption that w is between wl and W Z +~, thus implying that wt and WZ+I are not consecutive in D2+k, a contradiction. The proof of Proposition 3 (3) is similar to the proof of Proposition 3U).
The following theorem is an immediate result from Proposition 3. Theorem 2: Each row-vector 0% of any k + 1 consecutive rows of DG can be represented by a common row-vector of these consecutive rows and a remnant such that the common row-vector consists of at most 2 k groups, and the remnant contains at most k entries.

B. Properties of M[D&]
The in any D&'s.
be two matrices of the same size, where X i and Y i are the ith columns of X and Y , respectively. X is obtained from Y by k-shift, denoted by To prove the first claim, we shall prove that there exists where 1 5 j L ' r . Remember that matrix Mj is decided by L j , together with D G~, as defined in the theorem. Let r j be the size of Lj, and let L j ( l ) , the first entry in L j , be the wlth and wpth entry in D&, and D&,, respectively. We shall show that the submatrix, which consists of rj number of rows Comparing these two copies , one can see that the former is obtained from the latter by the wl-shift.  is never physically generated, for the same reason. More details can be seen in Sections IV and V.
We now tum to the second property of M [D&]. A 2-D matrix 2 is monotone if the minimum value in its ith column lies below or to the right of the minimum value in its (i -1)th column. (If more than one entry has the minimal value in the column, then we take the uppermost one.) In particular, Z is totally monotone if every 2 x 2 submatrix of it is monotone [ 2 ] . As an example, matrix 6 4 1 is monotone, but not in the theorem, which is equivalent to C Y [l, D~~u ( i l ' w l ) 1 7 of Dg,"u ( 2 1 ,w1) (: : 1) totally monotone, whereas matrix 6 4 1 is not only monotone but also totally monotone.
Among many nice things about this type of matrix, we mention only two that are important to our algorithms. Submatrices of a totally monotone matrix obtained by deleting rows and columns are totally monotone, and efficient parallel algorithms are available to identify the column minima of a totally monotone matrix [ 2 ] . ( I :) where il < ZZ and jl < j 2 . To prove that M[D&] is totally monotone, it suffices to prove that the above submatrix is monotone, that is, the following: It can be seen that if (1)  shall show only the following equation: We first discuss the case in which four entries in (2) are finite, and leave the other cases to the last. For a contradiction, we assume that the following is true: time, which is the fastest algorithm on the CREW-PRAM machine to the best of our knowledge. Second, understanding this algorithm is a warmup for understanding our next algorithm, which is more complicated but optimal. The four main phases of this algorithm have already been described in Section 11, and the implementation for Phase 1 and Phase 4 has also been addressed there. Here we focus on the implementations of Phase 2 and Phase 3. Phase 2, which is to generate cost matrix DG, is the most complicated part of this algorithm. With a divide-and-conquer strategy applied in generating DG, our focus is on the merge stage, i.e., how to compute DG based on DG, and D G~. Phase 3, which is to identify the maximum-cost path from DG, is relatively easy, and only a brief discussion on it is given. At the end of this section, we provide an analysis of the complexity of this algorithm. A 2-D array is employed to represent a cost matrix in this algorithm-a big difference from our next algorithm. Combining (4), (6), and (7) we derive the following equation: Common row-vector L of D&"s can be computed from D& and D Z Z m , according to Proposition 3(1). This computation takes advantage of the fact that entries in D& and are monotonely increasing in value (by Proposition 1( 1)). One processor is assigned to each entry in D,& . Typically, processor P, assigned to entry TI of D& independently executes a binary search on D Z * m , and marks if it finds an entry w that is identical to U . Clearly, those marked entries are entries shared by both D&" and D,!$,, and therefore L is obtained simply by ranking them. In order to further partition L into groups, L1, Lp, . . . , L,, we search and mark every pair of consecutive entries of L, say, 11 and 12, such that they are not consecutive in D z 2 m . Since the entries in the same group should be consecutive in D Z 2 m, 11 and 12 must belong to two groups, and, moreover, 11 must be the last entry in some group Li, and 12 be the first entry in group L;+l. Thus, we partitioned L into groups. It is easily seen that the above tasks can be done in O (1ogm)  For the sake of simplicity, without loss of generality, we assume that any group Mj has no more than m/ log2 m rows; otherwise, we partition the larger groups into several smaller groups, and the total number of groups should still be bounded by O ( m / log2 m). Consider the computation of Cmin[M.]. For the worst case, we assume that it is of size (mllog m) x m. We first compute Cmin[Z] where Z is an (m/log2m) x (m/log2m) matrix consisting of every (log2 m ) column of ~3 .

A. How to Compute DG from DG, and D G~
Clearly, z is totally monotone, because M j is totally monotone, and thus Aggarwal and Park's algorithm [2] is applicable to compute Cmin [Z]. By  The objective of this section is to describe a scheme to identify vertices on the maximum-cost path between the source and the sink of the ( m + 1) x ( n + I) grid DAG G, given DG. For a maximum-cost path from the source to the sink on G, say, p = (q. w z , . . . , v l } , there could be more than one vertex of p that are on the same row of G, because m 5 n. A vertex, say, w;, on p is a cross-vertex if vi is the leftmost one to the left of vertices on p . We denote the cross-vertex on the j t h row of G as w[j] to distinguish cross-vertices from other vertices on p . Clearly,u1 = v[l]. Phase 3 in the main structure described in Section 11-B is implemented by two stages.

2) Identify the other vertices on p.
We start with the first stage. All cross-vertices on a maximum-cost path can be obtained as the side effect of computing cost matrix DG. Suppose we are computing D~( i , j ) , that is, identifying in G the j-breakout vertex, denoted as y, of vertex ( l , i ) , denoted as 2. Let p be the maximum-cost path from z to y, and let vertex q be the cross-vertex of p on the boundary betwen GU and GL, meaning q = v[m/2 + 11. By side effect, we mean that q can be identified in Phase 2 without spending extra time.
In fact, according to Theorem 1,D~( i ,j ) is

D. The Complexity of the Algorithm
We first state the result.
Theorem 6: O(log2m + logn) time with mn/logm processors suffices to identify the LCS of A and B with lengths m and n, respectively, where rri 5 n.
Recall the four phases described in Section 11-B. We have shown in Section 11-C that Phase 1 can be done in O(1ogn) time with m71/ log n processors, and that Phase 4 can be done in O(1og n ) time with n/ log n processors. We have just shown in Section IV-C that O(1ogn) time and n, processors suffice for Phase 3. So, to show Theorem 6, we shall show that O(log2 m) time and mn/ logrn processors suffice for Phase 2. Let T ( k ) be the time taken to generate a cost matrix of size ( k + 1) x ( n + 1). The execution time of Phase 2, which is O(log2 m), can be obtained from the following recurrence suggested by Theorem 5: where T ( k / 2 ) is the time taken to generate the two smaller cost matrices, and c1 log k is the time taken for merge. T (2) is defined as 0, because time T ( 2 ) is charged on Phase 1. In order to use no more than m n / l o g m processors throughout the algorithm, Brent's principle [4] must be applied in Phase 2. Notice that Phase 2 is a recursive process, and has logm merge stages, each costing O ( m n ) operations. Therefore, these operations in each stage can be performed in O(1ogm) time using mn/ log m processors, according to Brent's principle.
v . AN OPTIMAL ALGORITHM FOR THE L c s PROBLEM The four phases of this algorithm are described in Section II-B, in which Phase 1 and Phase 4 are shared with the previous algorithm. In this section, we focus on the implementations of Phase 2 and Phase 3. The fundamental difference between this algorithm and the previous one lies in the data structure used for representing cost matrices. In the previous algorithm 2-D arrays are used to implement cost matrices; unfortunately, this approach would destroy any hope of achieving optimality. To see this, just think about computing and storing 0 ( m / 2 i ) cost matrices, each of size 7~ x O(22) at ( i + 1)th merge stage of Phase 2, where each matrix corresponds to an O(ai) x ( n + 1)grid DAG. Obviously, this stage alone costs at least O(mn) operations. The fact that there are O(1ogm) merge stages in Phase 2 implies that at least O(mn1ogm) operations are needed, which is larger than our desired bound, mn. For this reason, a very efficient data structure is adopted for DG in our optimal algorithm.

A. The Data Structure for DG
We use common vectors and remnants to represent cost matrix DG. Specifically, for every consecutive k + 1 rows of m x n cost matrix DG, say, D& for 1 5 i 5 k + 1, we use an 1-D array to represent their common row-vector L, and use Throughout this section, we presume that any cost matrix is represented by the above data structure. So, by computing DG, we mean computing the common row-vectors, corresponding remnants, and position functions. Now we would like to make a short comment on why the above representation of DG can help us to achieve our objective. Note that by Theorem 2, k + 1 consecutive rows of DG with size n x m can be represented by their common vector L with a size of at most m: and k + 1 remnants, each with size of at most k ; hence, O(mn/k+nk) space suffices to store all information in DG. By this representation, not only is the redundant information in DG removed, and thus storage space is cut a great deal, but also the number of entries to be computed is greatly reduced. Two simple facts about the position function are stated as follows.
Proposition 4: Let Lh be the hth group of L , the common row-vector of Oh's for il I i 5 il + k .

C. The Proof of Theorem 7
We prove Theorem 7 by providing implementation and corresponding analysis for each of these three steps.
Step 1 can be implemented similarly to Step 1 of ColMin. The only difference is the data structure of DG. Keeping this in mind, one should have no difficulty in seeing that Step 1 can be done in O( log m log log m ) time with m log m processors. We leave this for readers to calculate as an exercise.
Step 2, which generates the remnants, is the most complicated part of the whole algorithm. Let R[D&] = ( R i , . . . , Ri+,) be the remnant of 0%. We shall show only how to compute remnant group R;, the bth group of R[D&], where 1 5 a 5 10g4m and 1 5 b 5 T + 1; other remnant groups can be handled in exactly the same way. We assume that L contains at least one entry that is not CO; the case in which L contains only CO's should be easy to deal with.
Two facts should be noted: Any finite entry in R;, b 2 2, exists also in (see Proposition 3 (3)), and CO entries are always on the right side of finite entries in DE. Thus, removing CO entries from Ry4"' will not affect generating finite entries in R;. Without loss of generality, we assume that there is no 00 entry in R p 4 m . Leaving the question of how to get SD [a, b] for now, we first examine the computation time for R;, supposing that SD [a, b] is available. Notice that the size of R [ D p 4 "1 is bounded by log4 m, and so is the size of RFg4 "'. Hence, the method we used in Step 1 (a) of procedure ColMin can be used to compute RE, and the following time bound should be quite clear. that the size of R;" is bounded by log%. So, we define SD [a, I] as the first log4m entries of DE. Let k be the size of R;"; then R Y consists of the first k entries of SD[a, 11, and the (IC+ 1)th entry of SD[a, 1 1 is identical to Ll (1). From this, we draw the following conclusions.
Lemma 7: Suppose that SD[a, 11 and L(l) are given, then log4m processors suffice to identify R;" in constant time.
From Lemmas 6 and 7, together with the fact that T , the number of groups of L, is bounded by log4m, we further draw the following conclusions.
Corollary 4: Suppose that SD [i, j ] and L( 1) are given, and that the size of SD[i,j] is bounded by log4m. Then Rj's, for 1 5 i 5 log4m and 1 5 j 5 T , can be obtained in O(1oglogm) time by using logl'm processors.
Since &(1) has already been computed in Step l(b), in the remainder of this section, we concentrate on the critical problem of how to find SD [a, b] with b 2 2 such that R; is a subvector of it and such that its size is bounded by O(10g4 m).
The basic formula described in Corollary 1 is applied to generate SD [a, b]. (Remember that SD [a, b] is nothing but a subrow of DE.) We first discuss the issue of which entries of DE should be included in SD [a, b]. A vector, Ind[SD[a, b]], is used to record their original positions in DE for entries in SD [a, b], as follows: (1).

Understanding the following relation between Pos[D$ ! R;]
and P o s [ D F 4 "', Rp4 "' 1, i.e., the relation between the position of the first entry of R; in DE and the position of the first entry of RP" in Dgg4 *, is the key to figure out how to find Ind[SD[a, b] K ( i , k)). By [i,j],k], we refer to the submatrix of both M[SD [i,j]] and X [ i , k] such that SM[SD [i, j ] , k ] has the maximum number of columns and rows. See Fig. 6. Matrix M'[SD[a,b]] is defined as the matrix obtained from M[SD [a, b] (2)). Each entry of M'[SD [a, b]], except those entries belonging to Cmin[SM[SD[a, b], k]], can be obtained from D G~, and D G~ in O(log1ogm) time using polylogarithmic number of processors. Therefore, we can make the following observation.  11). T ( m ) = O(log2 7n) is suggested by Lemma 9, which can be found by the similar recurrence we used in Section IV-E. The number of processors needed is bounded by n. Thus, we have completed the proof of Theorem 8.

D. The Implementation of Phase 3
is implemented by two stages: Like our first algorithm, Phase 3 in our optimal algorithm 1) Identify cross vertex ~[ i ] on p , for 1 5 i 5 m + 1.

2) Identify other vertices on p.
Obviously, the discussion about the second stage in our first algorithm is still applicable. Unfortunately, the discussion about the first stage in that algorithm is no longer applicable here. Remember that in the first algorithm, all cross-vertices are computed and stored as the side effect of Phase 2. Since the method of computing cost matrices in our optimal algorithm is totally different, the side effect has not been preserved. A new method is needed. In what follows, we first state the result. E. The Complexity of the Algorithm Theorem 9: The LCS problem can be solved in O(logzmlog m ) time with rr~n/log~mloglog m processors, when 1og2r,loglog m > log n, or otherwise in O(logn) time with rrirr/log TI. processors.
To prove this, we need to examine the complexity for each of the four phases. We deal with only Phase 1 and Phase 2; the discussion of Phase 1 is applicable to Phase 3 and Phase 4. We have proved in Section 11-C that Phase 1 can be done in O(1og n) time with nin/ log 71 processors. To be consistent with Theorem 9, we just point out that when log2 m log log 7r1 > log I?,, instead of using mn/ log n processors, we can use mn/ log2 7ri log log rn processors only. By applying Brent's principle, the procedure for Phase 1 can be simulated by using 7nn/ log2 711 log log m processors in o(log2 rri log log nr) time.
As for Phase 2, we shall show that it can be done in O(10g2 7ri log log nr) time with 7rin/ log2 m, log log 7r1 processors. From this result, by applying Brent's principle, readers can easily see that O(1ogn) time suffices for Phase 2 if rrin/Iogn, processors are used, where rrinllogn < nin/ log' m log log v i .
Consider the ith stage of O(1ogrri) stages in Phase 2. We have 0 (~1 1 / 2~) grid DAG's to deal with, each of size  be computed in t , time by using p; number of processors, where t , = 0(log(22) loglog(2')) and p ; = 2in/log3(2i). Since a total of P number of processors are available, where P = win/ log2 7ri log log 711, we can compute s i cost matrices simultaneously, where s, = P/p;. On the other hand, because there are a total of ,y; cost matrices to compute, when y; > si holds (i.e., ,i < (/log2 7r~log log rn,/ log3 2) O((y;/s;)ti) time suffices, and when g; 5 s1 holds, t; time is enough for this stage. This discussion, together with the fact that there are O ( l o g 7 r~) stages in Phase 2, results in the time bound T for Phase 2, which can be found as follows: where il = i/log2 rri log log m,/ log3 2. Thus, we have proved that O(log2 m log log m ) time suffices for Phase 2 using m 7 1 / log2 7 r~ log log rn. processors.