Recognizing substrings of LR(k) languages in linear time

LR parsing techniques have long been studied as efficient and powerful methods for processing context free languages. A linear time algorithm for recognizing languages representable by LR(k) grammars has long been known. Recognizing substrings of a context-free language is at least as hard as recognizing full strings of the language, as the latter problem easily reduces to the former. In this paper we present a linear time algorithm for recognizing substrings of LR(k) languages, thus showing that the substring recognition problem for these languages is no harder than the full string recognition problem. An interesting data structure, the Forest Structured Stack, allows the algorithm to track all possible parses of a substring without loosing the efficiency of the original LR parser. We present the algorithm, prove its correctness, analyze its complexity, and mention several applications that have been constructed.

1 Introduction the grammar G by adding the rule 5' -• $5$, where S is the start symbol of grammar G, and '$'is a new terminal symbol, not in the original alphabet of grammar G. The non-terminal 5' becomes the new start symbol of grammar G f . From the input string x we construct w = $x$. The output of the reduction is the pair ((£', w). This reduction is constructive in constant time and space. The details of this reduction's correctness proof are omitted, and may be easily filled in by the reader. Also, it can be shown that the set of all substrings of a CFL is itself a CFL. Since the set of CFLs is exactly the set of languages accepted by non-determinitic pushdown automata (NPDAs), one easy way to show this is by constructing an NPDA that accepts all substrings of the language of a given context-free grammar. The NPDA constructed for accepting the language of a given context-free grammar (in Greibach normal form) in [HU79] (page 116) can easily be modified to accept all substrings of the language. Thus, the general problem of recognizing substrings is not any harder than that of recognizing full-strings. However, the set of all substrings of an LR(k) language is not necessarily itself an LR(k) language, therefore a linear time bound for recognizing substrings of LR(k) languages is not trivial.
In this paper we show that the substring recognition problem for LR(k) grammars is not any harder than the full-string recognition problem. We present an algorithm for the LR(k) substring recognition problem that runs in linear time, which is similar to that of the original LR parsing algorithm [AU72]. While previous substring parsing algorithms such as Cormack's [Cor89] modified the LR parsing tables to accommodate for substring recognition, our algorithm modifies the parsing algorithm itself, while leaving the original LR parsing tables intact. We introduce a data structure, the Forest Structured Stack (FSS), that keeps track of all possible parses of the substring, while preserving the efficiency of the original LR parsing algorithm. The SLR, canonical LR(1) and LALR parser variants differ only in the algorithms that produce the parsing tables from the grammar, and share a common LR parsing algorithm that is controlled by these tables. Since our substring algorithm replaces this run-time parsing algorithm while using the parsing tables "as is", it is equally applicable to all of the above LR variants. The parsing algorithm for canonical LR(k) grammars (k > 2) differs slightly from the other variants, in order to account for the extended lookahead into the input. Thus, a slightly different version of our substring algorithm handles canonical LR(k) grammars.
Section 2 describes the FSS data structure and presents the substring recognition algorithm for LR(1) grammars. In section 3 we prove the correctness of the algorithm. Section 4 analyzes the time complexity of our algorithm. An amortized analysis is used to prove that the algorithm does indeed run in linear time. Section 5 extends the algorithm to the general LR(k) case. Finally, some applications of the algorithm and our conclusions are presented in section 6.

The Algorithm
In this section we present our fundamental substring recognition algorithm, appropriate for SLR, canonical LR(1) and LALR parsing tables. These LR parsing variants assume that only the single next input symbol is available to the parser at any point (no further lookaheads). The slightly modified algorithm for canonical LR(k) grammars (k > 2) is presented in section 5. The substring recognition algorithm we describe in this section is denoted by SSR. It is a variation of the conventional LR parsing algorithm, denoted by LRP.

2.1 The Forest Structured Stack
The Forest Structured Stack (FSS) is a graph, consisting of a set of trees, representing a possibly infinite set of stacks of LRP. The nodes of the graph are labeled by states of the LR machine. The edges that connect the state nodes are labeled by grammar symbols. Each path from a root to a leaf corresponds to the top portion of an LRP stack, in which the node at the root of the path represents the state at the top of the stack.
The algorithm simulates the behavior of LRP on all the stacks represented in the FSS, adding nodes in correspondence with actions that push items on the stack (shifts), and removing nodes in correspondence with stack reductions. The tree representation avoids the duplication of stacks which have an identical top part but which differ in content deeper down.

An Informal Description of the Algorithm
The idea behind SSR is to effectively simulate the behavior of LRP on all possible strings of which the input is a suffix. When parsing a string w, of which our input string x == xix 2 • • • x" is a suffix, LRP is in some state (at the top of the stack) upon shifting xi, the first symbol of x. We are interested in all such states and thus we initialize SSR by building a FSS with a distinct single node tree for each state that can be the result of shifting xi according to the pre-compiled action table. Since each single node tree represents all stacks with that state at the top, the initial FSS represents the set of all possible stacks after the shifting of xi.
From here on we continue the parsing of x according to each of the FSS trees. SSR performs a series of alternating Reduce and Shift phases, one pair of phases for each input symbol.
During a Reduce Phase, reductions are performed on all trees whose top state indicates that a reduction is to be performed. In LR parsing, reductions remove nodes from the stack. When performed on a tree, they are done on all paths in the tree, starting at the root, to a depth corresponding to the number of symbols on the right-hand side of the rule being reduced.
Reductions are a problem only when they wish to remove nodes deeper than the length of some path in the FSS. This corresponds to a reduction that includes symbols derived from parsing the part of the full string that is prior to x. In our algorithm, we refer to such reductions as long reductions, and treat them in a manner somewhat similar to our initialization.
A reduction normally removes the right-hand side of the rule being reduced, and then shifts the non-terminal symbol A of the left-hand side of the rule. The new state at the top of the stack is determined from the goto table, and depends on A and on the state revealed at the top of the stack by the reduction. With long reductions, since only a partial stack exists, this state is not known. Our algorithm determines all such possible states by a lookup in the long reduction goto table. This supplemental table specifies for each possible reduction from a state at the top of the stack, the set of states that may be reached as a result of the shifting of the left-hand side non-terminal of the rule being reduced. The table is easily constructed from the parsing tables prior to run-time. Each of the determined goto states corresponds to at least one full string, the parsing of which would have resulted in that state being at the stack top at this point in the parsing process. It is sufficient at this point to add these states to the FSS as single node trees. Long reductions are performed at most once per state in a Reduce Phase, since a second long reduction from the same top state would produce the same new trees, and thus would be redundant 1 .
When the action defined by the table on the root node of a tree is error, the entire tree is discarded. These are trees that correspond to prefix string s of x that cannot be completed to strings in the lang uag e. A Reduce Phase terminates when the action indicated by the table, on each of the tree root nodes, is to shift the next input symbol. All the shift operations are done in the consequent Shift Phase of the alg orithm.
Upon reaching the end of the input x, if the FSS is not empty, we can safely assume that there exists a prefix string у such that the parsing of the string yx by the LR parser would not have caused a parsing error by this point. Properties of LRP g uarantee the existence of a suffix z , such that w = yxz is accepted. Thus x is confirmed to be a valid substring .
To increase the efficiency of the alg orithm, two operations, SUBSUME and CONTRACT, are performed on the FSS structure at appropriate times. When a single node tree is added to the FSS, and the state of the node is identical to that of some other tree root node in the FSS, the larg er tree may be deleted from the FSS, since the sing le node tree represents all stacks of L#Pthat have that particular state at the top of the stack. This set of stacks necessarily includes all stacks that were represented by the larg er tree rooted at a node of the same state. The SUBSUME operation detects such conditions and deletes the larg er tree. Long reductions frequently create sing le node trees that subsume other trees in the FSS.
The CONTRACT operation merges two trees, the roots of which are of the same state, returning a sing le tree as a result. The merg ing is done recursively down the two trees, to ensure that no immediate sibling nodes in the FSS are labeled by the same state. This in turn g uarantees that at all times, the branching deg ree of every node in the FSS is bounded by the number of states in the parsing table, a property essential for maintaining a linear bound on the running time of the algorithm. Two trees may end up having the same top state as a result of either a shift operation or a reduction. In the shift case, since prior to the shift the trees necessarily had different top states, they may be simply merg ed at the top node level, and no deeper tree contraction is needed. However, in the case of a reduction, if the result of the reduction is a top state which is the same of that of another existing tree in the FSS, a full CONTRACT operation is performed.
The RECLAIM operation is responsible for freeing the dynamically allocated storag e for those nodes and trees that are discarded in the course of the alg orithm.

A Formal Description of the Alg orithm
We next present a more formal description of alg orithm SSR in a pseudo "hig h-lever lang uag e. We use the following notation : • nodes of the FSS are presented as structures with two fields. A state field containing the parser state, and an action field containing the next parser action to be done upon processing the node. • STATES is the set of all parser states (according to the parsing table).
• ROOTS is the set of nodes that are roots of trees in the FSS.

• NEW-ROOTS -temporary set of new roots.
• EOS -token representing the end of the input string.
• geLnextsym(x) : function returning the next input token x. if there exists a node n* in ROOTS with n*.state = s then SUBSUME(n,n*) else add node n to ROOTS with n.action = ACT(s,x); end; end; mark state ts for long reduction ; end; 6 2.4,2 CONTRACT CONTRACT merges two trees that have root nodes of the same state into a single tree.

C0NTRACT(nl,n2)
if nl is a singleton node then RECLAIM(n2) and return nl; else if n2 is a singleton node then RECLAIM(nl) and return n2; else for each child c2 of n2 do : if nl has a child cl with cl.state = c2.state then CONTRACT(cl, c2) and replace cl with the resulting tree else add c2 as a new child of nl; end;

SUBSUME
SUBSUME replaces a tree rooted at a node n with a singleton new node that has the same state.

RECLAIM
RECLAIM deletes all nodes of the tree rooted at given node n from the Forest Structured Stack.

RECLAIM(n)
for all children nodes c of n do RECLAIM(c); delete node n;

An Example
To further clarify how the algorithm works, we present a simple example. Figure 1 contains a simple arithmetic expression grammar, taken from [ASU86] (page 218). Table 1 contains the SLR parsing  table for Table 2 shows the long  reduction goto table for this parsing table. For each state, the long reduction goto table contains the list of states into which the parser may shift after a reduction from that state 2 . Figure 2  r2 sh7  r2  r2  3  r4  r4  r4  r4  4  sh5  sh4  8 2 3  5  r6  r6  r6  r6  6  sh5  sh4  9 3  7  sh5  sh4  10  8  sh6  shll  9  rl sh7  rl  rl  10  r3  r3  r3  r3  11  r5  r5  r5  r5   Table 1: SLR parsing table for grammar in Figure 1 Top state Goto states after reduction 0 1 2 1 8 3 2 9 4 5 3 10 6 7 8 9 1 10 2 11 3   Figure 2f. This completes the Shift Phase. The consequent termination test discovers that we have reached the end of the input. Since the FSS is not empty, the input is a valid substring (of an arithmetic expression in the language of our grammar), and the algorithm terminates. Note that due to the simplicity of the chosen example, no CONTRACT or SUBSUME operations occurred in the execution outlined above.

Correctness
We now prove the correctness of SSR. The reader is referred to Aho and Ullman [AU72] for a comprehensive proof of correctness of the original LR parsing algorithm LRP. In our proof, we rely on the correctness of LRP, namely that for an LR grammar G, given an input string x, LRP accepts x if and only if x £ L(G). We will therefore aim to prove the following theorem : Let G be an LR(1) grammar and x be an input string. SSR accepts x if and only if there exist strings y, z such that w = y • x • z is accepted by LRP.
We show that SSR simulates the parsing of x by LRP for all possible prefix strings y. If upon shifting x n , the last input symbol of x, SSR has not rejected x, there exists at least one such prefix string y, for which LRP has not rejected the input y • x after the shifting of x n . The existence of a suffix string 2, for which w = y • x • z is accepted by LJ?P is assured by the fact that LR parsers reject inputs as early as possible [AU72]. We now provide a formalization of the above outline. stack configuration c is a triple (s,x,i), where s = [sti,st2, -->stk] is a stack of states (with st^ at the top), x is the input string of length n, and 0 < i < n is a position within the input string.

Definition 1 A
The set of stack configurations represented at any point of SSR includes a configuration for each path from a root node to a leaf in the FSS. The LRP stack configurations are those particular configurations that correspond to stacks manipulated by LRP. To formalize the effect of the parsing operations of algorithm SSR on the FSS, we define the function next, from stack configurations to sets of stack configurations. case of a shift or a normal reduction, next(c) is a set containing the single resulting new configuration. In case of a long reduction, next(c) is the set of all stack configurations consisting of single state stacks, the states of which one can reach after shifting the left-hand side non-terminal of the rule being reduced, as determined by the long reduction goto table. If the action is accept 3  or the end of string is reached, we define next(c) = {c}, and if it is reject (a parse error), then next(c) = <f>. c = (s,x,j) is an LRP stack configuration, then for some k, step(x, k) = (s,j), the action cannot be a long reduction, and therefore next(c) contains a single LRP stack configuration c', where c' = step(x, k + 1).

Note that if
To formalize the effect of the Reduce and Shift phases, we define the extension of next to sets of stack configurations in the following way.
Definition 5 Let C = C\ U C 2 be a set of stack configurations such that C\ contains exactly the stack configurations of C whose top state indicates that the next action is a reduction, and C 2 is the rest of C.

Thus, reductions have precedence over other actions.
Based on this extended definition of next we define for every n > 0 the function next 71 , which is the result of n successive applications of next. Note that a Reduce Phase corresponds to some finite number of applications of next and that a Shift Phase corresponds to a single application of next. Also note that again, if c = (s,x,j) is an LRP stack configuration, then for some A;, step(x,k) = (s,j), the action taken on any of the n following parsing steps cannot be a long reduction, and therefore, for any n > 0, next n ({c}) contains the single LRP stack configuration c f , where c' = step(x, k + n).

Lemma 2 The Simulation Lemma : Let C be a set of stack configurations. Then : M(next(C)) = next(M(C))
Proof  = (s,x,z) , d G M(next(c)).

. Assume c' G next(M(c)). Since the action on c is a reduction, c' must be of the form (r • [st],yx,|y| + t), where the state st is a result of shifting the left-hand side non-terminal at the end of the reduction. Prom it's definition, the long reduction goto table includes state st as a possible result of the long reduction on c. Thus, ([siljXji) G next(c) and by definition of Af
• We generalize Lemma 2 to any finite number of applications of next.

Lemma 3 The Generalized Simulation Lemma : Let C be a set of stack configurations. For every n > 1 ; M(next n (C)) = next n (M(C))
Proof: By a straightforward induction on n using Lemma 2.

By the induction hypothesis we have that M(C) = M(next m~l (C)
= next m " 1 (M(C)). The following set of equalities complete the proof of our claim :

M(next m (C)) = M{next{next m " l {C))) by def. of next = M(next(C')) by def of C = next(M(C')) by Lemma 2 = next(next m~l (M(C))) by induction hyp. = next m (M(C)) by def. of next
This completes the proof of the Lemma 3. •

Proof: By induction on i. C\ has both properties due to the way it is constructed. The induction step is proven by the following arguments. Since the next function is a formal modeling of the Reduce and Shift phases of the algorithm (excluding the process of possibly discarding some configurations by SUBSUME and CONTRACT operations), it follows that for some n, d C next n (C,-_i) (with the "missing" configurations being those discarded by the SUBSUME and CONTRACT operations) and since SUBSUME and CONTRACT have no effect on the set of configurations represented by M, M(C t ) = M(next n (C t -i)). The next function has the property that if M(c) / <f> and next(c) / <j>, then M(next(c)) ^ <£, which extends to next n and thus guarantees soundness. By Lemma 3 M(d) = M(next n (C t _i)) = next n (M(Ci-i))> which guarantees completeness. • Corollary 1 If C n is the set of stack configurations represented by the FSS after the nth Shift Phase, where n = then C n ^ <f> iff there exists an LRP configuration c' = (s',yx, \y\ + \x\).
Note that the existence of such an LRP configuration c 1 implies the existence of a string w' = yx, such that w' is not rejected by LRP by the time x n was shifted. The soundness property of Lemma 4 guarantees that if C n ^ <f>, such an LRP stack configuration c' exists. The completeness property guarantees that if such a configuration c' exists, C n ^ <j>.
We may now proceed to proving the main theorem:

Theorem 1 Let G be an LR grammar, and x be a given input string. Algorithm SSR accepts x if and only if there exist strings y, z such that w = y • x • z is accepted by algorithm LRP.
Proof: 1. If: Since there exist strings y and z such that w = y • x • z is accepted by algorithm LRP, the string w f = y • x is not rejected by Li?P up to the point of the shifting of x n (where n = \x\). Thus, from the above corollary it follows that the FSS of algorithm SSR is not empty upon entering the nth TERM stage, and x will be accepted by SSR.

Complexity Analysis for Grammars Free of Epsilon Rules
We now prove that SSR runs in linear time for grammars free of epsilon rules. In the next subsection we will demonstrate that SSR maintains a linear running time even in the presence of such rules.

After the initialization of the FSS, the algorithm enters a loop that consists of a termination test for end of input, examining the next input symbol, a Reduce Phase and a Shift Phase. This loop can be executed up to n -1 times, until the end of string is reached. The initialization of the FSS that precedes the loop requires only constant time. It involves scanning a column of the LR action table, and the creation of a constant number of root nodes. The termination check also takes constant time. Since there are only a constant number of root nodes (see Lemma 5 below), each Shift Phase involves only a constant number of shift operations and thus takes constant time.
However the time cost of each Reduce Phase is not uniform, and varies from one run through the loop to the next. Each Reduce Phase involves some number of Tree Reductions, which are reductions on all paths of an FSS tree to a constant depth. We will show that each such Tree Reduction is completed in constant time and then use an amortized cost evaluation to obtain a linear bound on the total number of Tree Reductions. Finally, we will argue that the total time cost of all SUBSUME, CONTRACT and RECLAIM operations is also at most linear in the length of the input.
In the following analysis, S denotes the set of states of the parser, and \S\ is the size of this set. We distinguish between root nodes of the FSS and internal nodes.

Proof: The claim holds after the initialization of the algorithm, and throughout Reduce and Shift
Phases SSR explicitly checks for root nodes of identical state, and when detected, merges the appropriate trees, using SUBSUME and CONTRACT as necessary. •

Lemma 6 The total number of nodes that become internal in the course of execution of the algorithm on a string x of length n is 0(n).
Proof: In the case that the grammar is free of epsilon rules, root nodes become internal only as a result of shift operations. Once a node becomes internal, it never again becomes a root node. Thus, the Lemma is a direct result of the fact that the number of root nodes at the start of any Shift Phase is bounded by |5|, and there are at most n Shift Phases. Thus the total number of shift operations is 0(n). •

Lemma 7 No node in the FSS ever has more than \S\ children.
Proof: Throughout the algorithm CONTRACT operations are performed whenever necessary so as to maintain this property. •

We now concentrate on analyzing the time complexity of Reduce Phases. A normal reduction on a single path of nodes in the FSS is identical to an LRP reduction, and takes constant time. Long reductions are very similar to normal reductions. However, they involve accessing the long reduction goto table in order to determine the possible states that may result from the shifting of the left-hand side non-terminal of the rule being reduced. This table access is done in constant
time. New root nodes are created for the resulting states of this process, and each added new node may require a SUBSUME operation, if there already exists a root node of the same state. This condition can be detected in constant time by a linear scan of the set of root nodes, and need be done only a constant number of times per long reduction, since at most |5| new root nodes may be added. We account for the time spent on the SUBSUME operations separately. Therefore, excluding the time spent on all SUBSUME operations, a long reduction on a single path requires only constant time. Thus, any reduction, normal or long, on a single path requires only constant time.
A Reduce Phase reduction in SSR operates on a FSS stack tree, and performs the reduction on all paths in the tree that originate at the root node to a depth equal to the number of symbols on the right-hand side of the rule being reduced. Since this is a constant depth, and the fan-out degree of FSS tree nodes is also bounded by a constant, each such Tree Reduction involves only a constant number of reductions (one for each path), each taking constant time. Thus in order to complete the time analysis of Reduce phases, we need only demonstrate that 0(n) Tree Reductions are performed in the course of the algorithm.
For the purpose of the analysis, we separate the rules of our grammar into two groups. Grammar rules with a single symbol on the right-hand side are grouped together as non-generative rules and their corresponding reductions are referred to as non-generative reductions. All other rules will be called generative rules and their corresponding reductions generative reductions. We will show that the cost of performing a generative reduction can be charged to internal nodes of the FSS that are discarded by the reduction, and that only a constant number of consecutive non-generative reductions may occur between the generative ones. Thus, the non-generative reductions may be charged to the generative ones, and they in turn can be charged to the nodes.

Lemma 8 In a Reduce Phase of algorithm SSR, only a constant number of consecutive nongenerative Tree Reductions may be performed.
Proof: Since long reductions are performed at most once per state in a Reduce Phase, we need only consider the normal reductions. Non-generative reductions do not remove internal nodes from the FSS. By a counting argument it can be seen that after a constant number of such reductions on FSS trees, such a reduction is repeated. If this were to occur the non-generative rules that correspond to this series of reductions would form a cycle, in contrast with the fact that any LR grammar must be non-cyclic. First we consider the CONTRACT operations. The CONTRACT operation merges two FSS trees that have root nodes of the same state. The contraction itself is done by comparing the states of the children of the first root node with those of the second root node. Lemma 7 guarantees at most I.SI 2 comparisons. If a child of the first root node has a state identical to that of a child of the second root node, the two subtrees are contracted by a recursive call to CONTRACT. All other children (and their appropriate subtrees) are added as children of the first root node, and the second root node is deleted. Thus, the top level CONTRACT operation requires constant time. Note that any recursive call to CONTRACT will necessarily result in the elimination of an internal node. We may thus charge a unit of cost to the node deleted as a result of each recursive call to CONTRACT, and since the node is deleted from the FSS by the this operation, it may be charged only once. Since CONTRACTUS invoked only after reductions, there are at most 0(n) top level calls to CONTRACT. Lemma 6 guarantees that at most 0(n) internal nodes will be charged, therefore implying at most 0(n) recursive calls to CONTRACT. This provides us with an 0(n) Finally, we observe that we have already accounted for the SUBSUME operations. SUBSUME searches for a root node of a state identical to that of a new single node tree created by a long reduction. This requires constant time. If found, the tree is the reclaimed by the RECLAIM operation, the time for which we have already accounted for. This completes the time complexity analysis of our algorithm, under the assumption that the grammar contains no epsilon rules. Our analysis has shown that the total time cost of all operations in an execution of the algorithm on an input string of length n is 0(n).

Extending the Complexity Analysis to Grammars with Epsilon Rules
We now turn to deal with the case that the grammar contains epsilon rules. Epsilon rules complicate our algorithm due to the fact that root nodes may become internal nodes as a result of a reduction by an epsilon rule. Thus, Lemma 6 must be re-argued, namely that the total number of root nodes that become internal in the course of an execution of the algorithm continues to be 0(n), even in the presence of epsilon reductions.
Since epsilon rules have no effect on the Shift Phase of our algorithm, in order for our entire complexity analysis to still carry through, we need only to prove that the total number of Tree Reductions is still 0(n).
Let us note that a grammar may indeed have epsilon rules, and still be LR. For example consider the natural grammar for the language a n b n (for n > 0) in figure 3, which is in fact LR(0).
It is convenient to look at epsilon rules as normal grammar rules that generate an "invisible" terminal symbol epsilon. Thus strings in the language generated by the grammar correspond to modified strings that include the epsilon symbols in the appropriate places. For a non-ambiguous grammar we are guaranteed that this is a one to one correspondence (each string in the language corresponds to exactly one string with epsilon symbols).

Lemma 11 An LR grammar has the property that only a constant number of epsilons may appear between two non-epsilon terminal symbols in the modified strings that correspond to strings in the language generated by the grammar. Furthermore, if we denote the length of the longest right hand side of all grammar rules by L, and the number of grammar rules by i, this constant number of consecutive epsilons is bounded by L x .
Proof: In order to prove this claim we restrict our attention to E, the subset of grammar rules that may produce a consecutive string of epsilons. It is easy to see that if the rules in E can produce an infinite string of epsilons (starting from any rule in E, whose left-hand side non-terminal is reachable), then the grammar is necessarily ambiguous and thus not LR. The fact that E cannot produce an infinite string of epsilons poses several restrictions on the rules in this subset. No rule in E contains a terminal symbol on its right-hand side. Also, no rule in E can be recursive (the left-hand side non-terminal cannot appear on the right-hand side of the rule). Using these properties, by a simple induction on z, the number of rules in E y it can be shown that the number of consecutive epsilons that can be produced by E is bounded by the constant C e = L\ where L is the length of the longest right-hand side of the rules in E. • In order to prove that the total number of Tree Reductions continues to be O(n), it is sufficient for us to show that Lemma 6 still holds.

Lemma 12 The total number of root nodes that become internal nodes in the course of an execution of algorithm SSR on a string x of length n is 0(n), even if the grammar has epsilon rules.
Proof: For every i: 0 < i < n, let internal(i) be the total number of nodes that have become internal in the course of the algorithm, up until the completion of the Shift Phase of X{. We prove by induction on i that for every 0 < i < n, internal(i) < C * i, where C is a the constant \S\ * (C e + 1).

internal(m + 1) < internal(m) + \S\ *C e + \S\ < C * m + C (by induction hypothesis) = C*(ra+ 1)
Now since the total number of nodes that become internal in the course of the execution of algorithm SSR is bounded by internal(n), and internal(n) < C * n, the above total has indeed been shown to be 0(n). • In the process of proving the above lemma, we in fact have shown that only 0(n) epsilon reductions may occur in the course of executing SSR on a string x of length n. It thus follows that Lemma 10 continues to hold, and the number of Tree Reductions continues to be O(n), taking into account all three types of tree reductions that now exist, non-generative tree reductions, generative tree reductions, and epsilon rule tree reductions. Combined with the time analysis of the other operations which continues to hold as before, we may again conclude a linear time bound on the total running time of algorithm SSR.

The Algorithm for Canonical LR(k) Grammars
In this section we consider the implications of generalizing algorithm SSR to deal with the general cause of canonical LR(k) parsing tables.
First, let us consider the necessary modifications to the algorithm itself. These turn out to be quite minimal. In fact, only the INIT stage needs to modified. In the INIT stage, instead of reading just the first symbol of the input string, we must obtain the first k symbols for the lookahead. This is due to the fact that the LR(k) action table is defined according to the A;-lookahead on the input. The action table is then searched in order to construct the initial set of root nodes. An obvious complication occurs whenever the length of the input string is less than the needed lookahead (\x\ < k). To handle this case, all possible extensions of the input string x to a string y of length k are considered, and the set of root nodes is constructed as the union of the sets derived for all such y. The algorithm will then terminate immediately in the following TERM stage. If the set of root nodes constructed in the INIT stage is not empty, x is accepted, otherwise x is rejected. Following is the "high level" description of the modified INIT stage : All other stages of the algorithm stay exactly the same as in algorithm SSR, as presented in section 2. In the DISTRIBUTE stage, the actions determined from the LR(k) action table depend on the existing fc-lookahead at that particular point in time. In the Shift Phase, the first symbol of the lookahead (the symbol being shifted) is removed from the lookahead and shifted. The get_next_sym function call in the subsequent TERM stage completes the lookahead from length k -1 to k. The algorithm terminates when the end of string (EOS) is encountered, with k -1 symbols of the input string still in the lookahead. 18

Let us now consider what implications (if any) does the above modification of algorithm SSR have on its correctness and complexity.
The proof of correctness presented in section 3 continues to hold for our modified algorithm. Lemma 4 continues to hold with respect to the appropriate LR(k) version of algorithm SSR. Since the property of rejecting an input at the earliest possible opportunity [AU72] holds for general LR(k) grammars, the proof of the main theorem of correctness continues to hold as well.
Finally let us consider the complexity analysis. It is easily seen that the revised INIT stage still takes only constant time. The set Sy is a finite set bounded by a constant, thus constructing the initial set of root nodes clearly takes only constant time. The size of this set is still bounded by I*?), the number of states in the LR action table. Since all other stages of algorithm SSR are the same as before, the time complexity analysis of the algorithm remains valid.

Conclusions
We have presented and proved a linear time algorithm for recognizing substrings of LR(k) languages.
The original version of this algorithm was initially developed by the first author in 1980. It did not include the CONTRACT operation for merging trees of the FSS. Tree contractions are crucial to retaining a linear bound on the running time of the algorithm. In the process of trying to prove the linear time bound we discovered this deficiency, and the proper modifications were consequently made.
The original algorithm, while in fact not always linear, was used as the basis for a syntax checking modification to the IBM VM/370 editor XEDIT. That modification enabled the IBM editor to check COBOL source code for syntax errors, when users modified lines, screens or files. For instance, when the cursor was moved off a modified line, the editor would beep and display an unobtrusive error message if the line was not a substring of any COBOL program. Though COBOL has a large grammar, this modification had no apparent effect on the speed of XEDIT on machines of the early 1980's. The algorithm was also used to check Pascal programs on an IBM PC editor, and this too had no apparent effect on the speed of the editor. Thus, the original algorithm appeared to be adequately fast in practice.
We have implemented our revised algorithm and have tried it on several test grammars. No precise measurements have been performed to compare the actual running time of our substring algorithm with that of the original LR parser. However, in practice, the revised implementation continues to run as fast as before.