Inter-Iteration Scalar Replacement in the Presence of Conditional Control-Flow

,


Introduction
The goal of scalar replacement (also called register promotion) is to identify repeated accesses made to the same memory address, either within the same iteration or across iterations, and to remove the redundant accesses (here we only study promotion within the innermost loop bodies, but the ideas we present are applicable to wider code regions as well). The state-of-the-art algorithm for scalar replacement was proposed in 1994 by Steve Carr and Ken Kennedy [CK94] 1 . This algorithm handles very well two special instances of the scalar replacement problem: (1) repeated accesses made within the same loop iteration in code having arbitrary conditional control-flow; and (2) code with repeated accesses made across iterations in the absence of conditional control-flow. For (1) the algorithm relies on PRE, while for (2) it relies on dependence analysis and rotating scalar values. However, that algorithm cannot handle arbitrary combinations of both conditional control-flow and inter-iteration reuse of data.
Here we present a very simple algorithm which generalizes and simplifies the Carr-Kennedy algorithm in an optimal way. The optimality criterion that we use throughout this paper is the number of dynamically executed memory accesses. After application of our algorithm on a code region, no memory location is read more than once and written more than once in that region. Also, after promotion, no memory location is read or written if it was not so in the original program (i.e., our algorithm does not perform speculative promotion). Our algorithm operates under the same assumptions as the Carr-Kennedy algorithm. That is, it requires perfect dependence information to be applicable. It is therefore mostly suitable for FORTRAN benchmarks. We have implemented our algorithm in a C compiler, and we have found numerous instances where it is applicable as well.
For the impatient reader, the key idea is the following: for each value to be scalarized, the compiler creates a 1-bit runtime flag variable indicating whether the scalar value is "valid." The compiler also creates code which dynamically updates the flag. The flag is then used to detect and avoid redundant loads and to indicate whether a store has to occur to update a modified value at loop completion. This algorithm ensures that only the first load of a memory location is executed and only the last store takes place. This algorithm is a particular instance of a new general class of algorithms: it transforms values customarily used only at compile-time for dataflow analysis into dynamic objects. Our algorithm instantiates availability dataflow information into run-time objects, therefore achieving dynamic optimality even in the presence of constructs which cannot be statically optimized.
We introduce the algorithm by a series of examples which show how it is applied to increasingly complicated code structures. We start in Section 2 by showing how the algorithm handles a special case, that of memory operations from loop-invariant addresses. In Section 3.3 we show how the algorithm optimizes loads whose addresses are induction variables. Finally, we show how stores can be treated optimally in Section 3.4. In Section 4 we describe two implementations of our algorithm: one based on control-flow graphs (CGFs), and one relying on a special form of Static-Single Assignment(SSA) named Pegasus. Although the CFG variant is simpler to implement, Pegasus simplifies the dependence analysis required to determine whether promotion is applicable. Special handling of loop-invariant guarding predicates is discussed in Section 5. Finally, in Section 7, we quantify the impact of an implementation of this algorithm when applied to the innermost loops of a series of C programs. This paper makes the following new research contributions: • it introduces the SIDE class of dataflow analyses, in which the analysis is carried statically, but the computation of the dataflow information is performed dynamically, creating dynamically optimal code for constructs which cannot be statically made optimal; ... } Figure 1: For ease of presentation we assume that prior to register promotion, all loop bodies are predicated.
• it introduces a new register-promotion algorithm as a SIDE dataflow analysis; • it introduces a linear-time 2 term-rewriting algorithm for performing inter-iteration register promotion in the presence of control-flow; • it describes register promotion as implemented in Pegasus, showing how it takes advantage of the memory dependence representation for effective dependence analysis.

Conventions
We present all the optimizations examples as source-to-source transformations of schematic C program fragments. For simplicity of the exposition we assume that we are optimizing the body of an innermost loop. We also assume that none of the scalar variables in our examples have their address taken. We write f(i) to denote an arbitrary expression involving i which has no side effects (but not a function call). We write for(i) to denote a loop having i as a basic induction variable; we assume that the loop body is executed at least once. For pedagogical purposes, the examples we present all assume that the code has been brought into a canonical form through the use of if-conversion [AKPW83], such that each memory statement is guarded by a predicate; i.e., the code has the shape in Figure 1. Our algorithms are easily generalized to handle nested natural loops and arbitrary forward control-flow within the loop body.

Scalar Replacement of Loop-Invariant Memory Operations
In this section we describe a new register promotion algorithm which can eliminate memory references made to loop-invariant addresses in the presence of control flow. This algorithm is further expanded in Section 3.3 and Section 3.4 to promote memory accesses into scalars when the memory references have a constant stride. Figure 2 shows a simple example and how it is transformed by the classical scalar promotion algorithm. Assuming p cannot point to i, the key fact is *p always loads from and stores to the same address, therefore *p can be transformed into a scalar value. The load is lifted to the loop pre-header, while the store is moved for (i) *p += i;

The Classical Algorithm
------------tmp = *p; for (i) tmp += i; *p = tmp; Figure 2: A simple program before and after register promotion of loop-invariant memory operations. Figure 3: A small program that is not amenable to classical register promotion.
after the loop. (The latter is slightly more difficult to accomplish if the loop has multiple exits going to multiple destinations. Our implementation handles these as well, as described in Section 4.2.2.)

Loop-Invariant Addresses and Control-Flow
However, the simple algorithm is no longer applicable to the slightly different Figure 3. Lifting the load or store out of the loop may be unsafe with respect to exceptions: one cannot lift a memory operation out of a loop it if may never be executed within the loop. To optimize Figure 3, it is enough to maintain a valid bit in addition to the the tmp scalar. The valid bit indicates whether tmp indeed holds the value of *p, as in Figure 4. The valid bit is initialized to false. A load from *p is performed only if the valid bit is false. Either loading from or storing to *p sets the valid bit to true. This program will forward the value of *p through the scalar tmp between iterations arbitrarily far apart.
The insight is that it may be profitable to compute dataflow information at runtime. For example, the valid flag within an iteration is nothing more than the dynamic equivalent of the availability dataflow information for the loaded value, which is the basis of classical Partial Redundancy Elimination (PRE) [MR79]. When PRE can be applied statically, it is certainly better to do so. The problem with Figure 3 is that the compiler cannot statically summarize when condition (i&1) is true, and therefore has to act conservatively, assuming that the loaded value is never available. Computing the availability information at run-time eliminates this conservative approximation. Maintaining and using runtime dataflow information makes sense when we can eliminate costly operations (e.g., memory accesses) by using inexpensive operations (e.g., Boolean register operations).
This algorithm generates a program which is optimal with respect to the number of loads within each region of code to which promotion is applied (if the original program loads from an address, then the  Figure 4: Optimization of the program in Figure 3. optimized program will load from that address exactly once), but may execute one extra store: 3 if the original program loads the value but never stores to it, the valid bit will be true, enabling the postlude store. In order to treat this case as well, a dirty flag, set on writes, has to be maintained, as shown in Figure 5. 4 Note: in order to simplify the presentation, the examples in the rest of the paper will not include the dirty bit. However, its presence is required for achieving an optimal number of stores.

Inter-Iteration Scalar Promotion
Here we extend the algorithm for promoting loop-invariant operations to perform scalar promotion of pointer and array variables with constant stride. We assume that the code has been subjected to standard dependence analysis prior to scalar promotion. Figure 6 illustrates the classical Carr-Kennedy inter-iteration register promotion algorithm from [CCK90], which is only applicable in the absence of control-flow. In general, reusing a value after k iterations requires the creation of k distinct scalar values, to hold the simultaneously live values of a[i] loaded for k /* prelude */ tmp = uninitialized; tmp_valid = false; tmp_dirty = false;  consecutive values of i. This quickly creates register pressure and therefore heuristics are usually used to decide whether promotion is beneficial. Since register pressure has been very well addressed in the literature [CCK90, Muc97, CMS96, CW95], we will not concern ourselves with it anymore in this text. A later extension to the Carr-Kennedy algorithm [CK94] allows it to also handle control flow. The algorithm optimally handles reuse of values within the same iteration, by using PRE on the loop body. However, this algorithm can no longer promote values across iterations in the presence of control-flow. The compiler has difficulty in reasoning about the intervening updates between accesses made in different iterations in the presence of control-flow.

Partial Redundancy Elimination
Before presenting our solution let us note that even the classical PRE algorithm (without the support of special register promotion) is quite successful in optimizing loads made in consecutive iterations. Figure 7 shows a sample loop and its optimization by gcc, which does not have a register promotion algorithm at all. By using PRE alone gcc manages to reuse the load from ptr2 one iteration later.
The PRE algorithm is unable to achieve the same effect if data is reused in any iteration other than the immediately following iteration or if there are intervening stores. In such cases an algorithm like Carr-Kennedy is necessary to remove the redundant accesses. Let us notice that the use of valid flags achieves the same degree of optimality as PRE within an iteration, but at the expense of maintaining run-time information.
/* Rotate scalar values */ a0 = a1; a1 = a2; } Figure 6: Program with no control-flow before and afer register promotion performed by the Carr-Kennedy algorithm. Figure 7: Sample loop and its optimization using PRE. (The output is the equivalent of the assembly code generated by gcc.) PRE can achieve some degree of register promotion for loads.
Figure 8: Sample program which cannot be handled optimally by either PRE or the classical Carr-Kennedy algorithm.

Removing All Redundant Loads
However, the classical algorithm is unable to promote all memory references guarded by a conditional, as in Figure 8. It is, in general, impossible for a compiler to check when f(i) is true in both iteration i and in iteration i-2, and therefore it cannot deduce whether the load from a[i] can be reused as a[i-2] two iterations later. Register promotion has the goal of only executing the first load and the last store of a variable. The algorithm in Section 2 for handling loop-invariant data is immediately applicable for promoting loads across iterations, since it performs a load as soon as possible. By maintaining availability information at runtime, using valid flags, our algorithm can transform the code to perform a minimal number of loads as in Figure 9. Applying constant propagation and dead-code elimination will simplify this code by removing the unnecessary references to a2 valid.

Removing All Redundant Stores
Handling stores seems to be more difficult, since one should forgo a store if the value will be overwritten in a subsequent iteration. However, in the presence of control-flow it is not obvious how to deduce whether the overwriting stores in future iterations will take place. Here we extend the register promotion algorithm to ensure that only one store is executed to each memory location, by showing how to optimize the example in Figure 10.
We want to avoid storing to a[i+2], since that store will be overwritten two iterations later by the store to a[i]. However, this is not true for the last two iterations of the loop. Since, in general, the compiler cannot generate code to test loop-termination several iterations ahead, it looks as if both stores must be performed in each iteration. However, we can do better than that by performing within the loop only the store to a[i], which certainly will not be overwritten. The loop in Figure 11 does exactly that. The loop body never overwrites a stored value but may fail to correctly update the last two elements of array a. Fortuitously, after the loop completes, the scalars a0, a1 hold exactly these two values. So we can insert a loop postlude to fix the potentially missing writes. (Of course, dirty bits should be used to prevent useless updates.)

Implementation
This algorithm is probably much easier to illustrate than to describe precisely. Since the important message was hopefully conveyed by the examples, we will just briefly sketch the implementation in a CFG-based framework and describe in somewhat more detail the Pegasus implementation.

CFG-Based Implementation
In general, for each constant reference to a[i+j] (for a compile-time constant j) we maintain a scalar t j and a valid bit t j valid. Then scalar replacement just makes the following changes: • Replaces every load from a[i+j] with a pair of statements: t j = t j valid ? t j : a[i+j]; t j valid = true • Replace every store a[i+j] = e with a pair of statements: t j = e; t j valid = true.  Figure 11: Optimal version of the example in Figure 10.
Furthermore, all stores except the generating store 5 are removed. Instead compensation code is added "after" the loop: for each t j append a statement if (t j valid) a[i+j] = t j .
Complexity: the algorithm, aside from the dependence analysis, is linear in the size of the loop 6 . Correctness and optimality: follow from the following invariant: the t j valid flag is true if and only if t j represents the contents of the memory location it scalarizes.

An SSA-based algorithm
We have implemented the above algorithms in the C Compiler named CASH. CASH relies on a Pegasus [BG02b, BG02a, BG03], a dataflow intermediate representation.
In this section we briefly describe the main features of Pegasus and then show how it enables a very efficient implementation of register promotion.
As we argued in [BG03], Pegasus enables extremely compact implementations of many important optimizations; register promotion corroborates this statement. In Table 1 we shows the implementation code size of all the analyses and transformations used by CASH for register promotion.

Pegasus
Pegasus represents the program as a directed graph where nodes are operations and edges indicate value flow. Pegasus leverages techniques used in compilers for predicated execution machines [MLC + 92] by collecting multiple basic blocks into one hyperblock; each hyperblock is transformed into straight-line code through the use of the predicated static single-assignment (PSSA) form [CSC + 00]. Instead of SSA φ nodes, within hyperblocks Pegasus uses explicit multiplexor (mux) nodes; the mux data inputs are the reaching definitions. The mux predicates correspond to the path predicates in PSSA.
Hyperblocks are stitched together into a dataflow graph representing the entire procedure by creating dataflow edges connecting each hyperblock to its successors. Each variable live at the end of a hyperblock gives rise to an eta node [OBM90]. Eta nodes have two inputs-a value and a predicate-and one output. When the predicate evaluates to "true," the input value is moved to the output; when the predicate evaluates to "false," the input value and the predicate are simply consumed, generating no output. A hyperblock with multiple predecessors receives control from one of several different points; such join points are represented by merge nodes.
Operations with side-effects are parameterized with a predicate input, which indicates whether the operation should take place. If the predicate is false, the operation is not executed. Predicate values are indicated in our figures with dotted lines.
The compiler adds dependence edges between operations whose side-effects may not commute. Such edges only carry an explicit synchronization token -not data. Operations with memory side-effects (loads, stores, calls, and, returns) all have a token input. When a side-effect operation depends on multiple other operations (e.g., a write operation following a set of reads), it must collect one token from each of them. For this purpose a combine operator is used; a combine has multiple token inputs and a single token output; the output is generated after it receives all its inputs. In figures (e.g., see Figure 12) dashed lines indicate token flow and the combine operator is depicted by a "V". Token edges explicitly encode data flow through memory. In fact, the token network can be interpreted as an SSA form for the memory values, where the combine operator is similar to a φ function. The tokens encode both true-, output-and anti-dependences, and they are "may" dependences. In Figure 12(A) there is one load and two stores. A load is denoted by "=[ ]" and has 3 inputs: address, predicate and token; it produces two outputs: the loaded value and another token. A store is denoted by "[ ]=" and has four inputs: address, data, predicate and token; the only output is a token.

Register Promotion in Pegasus
We sketch the most important analysis and transformation steps carried out by CASH for register promotion. Although the actual promotion in Pegasus is slightly more complicated than in a CFG-based representation (because of the need to maintain φ-nodes), the dependence tests used to decide whether promotion can be applied are much simpler: the graph will have a very restricted structure if promotion can be applied. 7 The key element of the representation is the token edge network whose structure can be quickly analyzed to determine important properties of the memory operations. We illustrate register promotion on the example in Figure 8.
1. The token network for the Pegasus representation is shown in Figure 13. Memory accesses that may interfere with each other will all belong to a same connected component of the token network.
Operations that belong to distinct components of the token network commute and can therefore be analyzed separately. In this example there is a single connected component, corresponding to accesses made to the array a.  2. The addresses of the three memory operations in this component are analyzed: they are all determined to be induction variables having the same step, 1. This implies that the dependence distances between these accesses are constant (i.e., iteration-independent), making these accesses candidates for register promotion.
The induction step of the addresses indicates the type of promotion: a 0 step indicates loop-invariant accesses, while a non-zero step, as in this example, indicates strided accesses.
3. The token network is further analyzed. Notice that prior to register promotion, memory disambiguation has already proved (based on symbolic computation on address expressions) that the accesses to a[i] and a[i+2] commute, and therefore there is no token edge between them. The token network for a consists of two strands: one for the accesses to a[i], and one for a[i+2]; the strands are generated at the mu, on top, and joined before the etas, at the bottom, using a combine (V). If and only if all memory accesses within the same strand are made to the same address can promotion be carried.
CASH generates the initialization for the scalar temporaries and the "valid" bits in the loop pre-header. We do not illustrate this step.
4. Each strand is scanned from top to bottom (from the mu to the eta), term-rewriting each memory operation: • Figure 14 shows how a load operation is transformed by register promotion. The resulting construction can be interpreted as follows: "If the data is already valid do not do the load (i.e., the load predicate is 'and'-ed with the negation of the valid bit) and use the data. Otherwise do the load if its predicate indicates it needs to be executed." The multiplexor will select either the load output or the initial data, depending on the predicates. If neither predicate is true, the output of the mux is not defined, and the resulting valid bit is false.
• Figure 15 shows the term-rewriting process for a store. After this transformation, all stores except the generating store are removed from the graph (for this purpose the token input is connected directly to the token output, as described in [BG03]). The resulting construction is interpreted as follows: "If the store occurs, the data-to-be-stored replaces the register-promoted data, and it becomes valid. Otherwise, the register-promoted data remains unchanged." 5. Code is synthesized to shift the scalar values and predicates around between strands (the assignments t j−1 = t j ), as illustrated in Figure 16. during the last iteration. This is achieved by making the predicate controlling these stores to be the loop-termination predicate. This step is not illustrated.

Handling Loop-Invariant Predicates
The register promotion algorithm described above can be improved by handling specially loop-invariant predicates. If the disjunction of the predicates guarding all the loads and stores of a same location contains a loop-invariant subexpression, then the initialization load can be lifted out of the loop and guarded by that subexpression. Consider Figure 17 on which we apply loop-invariant scalar-promotion.
By applying our register promotion algorithm one gets the result in Figure 18. However, using the fact that c1 and c2 are loop-invariant the code can be optimized as in Figure 19. Both Figure 18 and Figure 19 execute the same number of loads and stores, and therefore, by our optimality criterion, are equally good. However, the code in Figure 19 is obviously superior.
We can generalize this observation: the code can be improved whenever the disjunction of all conditions guarding loads or stores from *p is weaker than some loop-invariant expression (even if none of the conditions is itself loop-invariant), such as in Figure 20. In this case the disjunction of all predicates is f(i)||!f(i) which is constant "true." Therefore, the load from *p can be unconditionally lifted out of the loop as shown in Figure 21.
In general, let us assume that each statement s is controlled by predicate with P (s). Then for each promoted memory location a[i+j]: 1. Define the predicate P j = ∨ s j P (s j ), where s j ∈ {statements accessing a[i+j]}.  Figure 18: Optimization of the code in Figure 17 without using the invariance of some predicates.  Figure 19: Optimization of the code in Figure 17 using the invariance of c1 and c2.  Our current implementation of this optimization in CASH only lifts out of the loop the disjunction of all predicates which are actually loop-invariant.

Dynamic Disambiguation
Our scalar promotion algorithm can be naturally extended to cope with a limited number of memory accesses which cannot be disambiguated at compile time. By combining dynamic memory disambiguation [Nic89] with our scheme to handle conditional control flow, we can apply scalar promotion even when pointer analysis determines that memory references interfere. Consider the example in Figure 22: even though dependence analysis indicates that p cannot be promoted since the access to q may interfere, the bottom part of the figure shows how register promotion can be applied.
This scheme is an improvement over the one proposed by Sastry [SJ98], which stores to memory all the values held in scalars when entering an un-analyzable code region (which in this case is the region guarded by f(i)).

Hardware support
While our algorithm does not require any special hardware support, certain hardware structures can improve its efficiency.
Rotating registers were introduced in the Cydra 5 architecture [DHB89] to support software pipelining. These were used on Itanium for register promotion [DKK + 99] to shift all the scalar values in one cycle.
Rotating predicate registers as in the Itanium can rotate the "valid" flags. Software valid bits can be used to reduce the overhead of maintaining the valid bits. If a value is reused k iterations later, then our algorithm requires the use of 2k different scalars: k valid bits and k values. A software-only solution is to pack the k valid bits into a single integer 9 and to use masking and shifting to manipulate them. This makes rotation very fast, but testing and setting more expensive, a trade-off that may be practical on a wide machine having "free" scheduling slots.
Predicated data [RC03] has been proposed for an embedded VLIW processor: predicates are not attached to instructions, but to data itself, as an extra bit of each register. Predicates are propagated through arithmetic, similar to exception poison bits. The proposed architecture supports rotating registers by implementing the register file as an actual large shift register. These architectural features would make the valid flags essentially free both in space and in time.

Other Applications of SIDE
This paper introduces the SIDE framework for run-time dataflow evaluation, and presents the register promotion algorithm as a particular instance. Register promotion uses the dynamic evaluation of availability and uses predication to remove memory accesses for achieving optimality. SIDE is naturally applied to the availability dataflow information, because it is a forward dataflow analysis, and its run-time determination is trivial.
PRE [MR79] is another optimization which uses of availability information which could possibly benefit from the application of SIDE. In particular, safe PRE forms (i.e., which never introduce new computations on any path) seem amenable to the use of SIDE. While some forms of PRE, such as lazy code 9 Most likely promotion across more iterations than bits in an integer requires too many registers to be profitable. motion [KRS92], are optimal, they do incur small overheads; for example, safety and optimality together require the restructuring of control-flow, for example by splitting some critical CFG edges 10 . A technique such as SIDE could be used on a predicated architecture to trade-off the creation of additional basic blocks against conditionally computing the redundant expression.
The technique used by Bodík et al. in [BG97] can be seen as another application of the SIDE framework, this time for the backwards dataflow problem of dead-code. This application is considerably more difficult, a fact reflected in the complexity of their algorithm.
An interesting question is whether this technique can be applied to other dataflow analyses, and whether its application can produce savings by eliminating computations more expensive than the inserted code.

Expected Performance Impact
The scalar promotion algorithm presented here is optimal with respect to the number of loads and stores executed. But this does not necessarily correlate with improved performance for four reasons.
First, it uses more registers, to hold the scalar values and flags, and thus may cause more spill code, or interfere with software pipelining.
Second, it contains more computations than the original program in maintaining the flags. The optimized program may end-up being slower than the original, depending, among other things, on the frequency with which the memory access statements are executed and whether the predicate computations are on the critical path. For example, if none of them is executed dynamically, all the inserted code is overhead. In practice profiling information and heuristics should be used to select the loops which will most benefit from this transformation.
Third, scalar promotion removes memory accesses which hit in the cache, 11 therefore its benefit appears to be limited. However, in modern architectures L1 cache hits are not always cheap. For example, on the Intel Itanium 2 some L1 cache hits may cost as much as 17 cycles [CL03]. Register promotion trades-off bandwidth to the load-store queue (or the L1 cache) for bandwidth to the register file, which is always bigger.
Fourth, by predicating memory accesses, operations which were originally independent, and could be potentially issued in parallel, become now dependent through the predicates. This could increase the dynamic critical path of the program, especially when memory bandwidth is not a bottleneck.

Performance Measurements
In this section we present measurements of our register promotion algorithm as implemented in the CASH C compiler. We show static and dynamic data for C programs from three benchmark suites: Mediabench [LPMS97], SpecInt95 [Sta95] and Spec CPU2000 [Sta00].
Our implementation does not use dirty bits and therefore is not optimal with respect to the number of stores (it may, in fact, incur additional stores with respect to the original program). However, dirty bits can only save a constant number of stores, independent of the number of iterations. We have considered their overhead unjustified. We only lift loop-invariant predicates to guard the initializer; our implementation can thus optimize Figure 17, but not Figure 20. As a simple heuristic to reduce register pressure, we do not scalarize a value if it is not reused for 3 iterations. Table 2: How often scalar promotion is applied. "New" indicates additional cases which are enabled by our algorithm. We count the number of different "variables" to which promotion is applied. If we can promote arrays a and b in a same loop, we count two variables. Table 2 shows how often scalar promotion can be applied. Column 3 shows that our algorithm found many more opportunities for scalar promotion that would not have been found using previous scalar promotion algorithms (however, we do not include here the opportunities discovered by PRE). CASH uses a simple flow-sensitive intra-procedural pointer analysis for dependence analysis. Figure 23 and Figure 24 show the percentage decrease in the number of loads and stores respectively that result from the application of our register promotion algorithms. The data labeled PRE indicate the number of memory operations removed by our straight-line code optimizations only. The data labeled loop shows the additional benefit of applying inter-iteration register promotion. We have included both bars since some of the accesses can be eliminated by both algorithms.
The most spectacular results occur for 124.m88ksim, which has substatial reductions in both loads and stores. Only two functions are responsible for most of the reduction in memory traffic: alignd and loadmem. Both these functions benefit from a fairly straightforward application of loop-invariant memory access removal. Although loadmem contains control-flow, the promoted variable is always accessed unconditionally. The substantial reduction in memory loads in gsm e is also due to register promotion of invariant memory accesses, in the hottest function, Calculation of the LTP parameters. This function contains a very long loop body created using many C macros, which expand to access several constant locations in a local array. The loop body contains control-flow, but all accesses to the small array are unconditional. Finally, the substantial reduction of the number of stores for rasta is due to the FR4TR function, which also benefits from unconditional register promotion.
The impact of these reductions on actual execution time depends highly on hardware support. The performance impact modeled on Spatial Computation (described in [BG03,Bud03]) is shown in Figure 25. Spatial Computation can be seen as an approximation for a very wide machine, but which is connected by a bandwidth-limited network to a traditional memory system.
We model a relatively slow memory system, with a 4 cycles L1 cache hit time. Interestingly, the improvement in running time is better if memory is faster (e.g., with a perfect memory system of 2 cycle latency the gsm e speed-up becomes 18%). This effect occurs because the cost of the removed L1 accesses becomes a smaller fraction of total execution cost when memory latency increases.
The speed-ups range from a 1.1% slowdown for 183.equake, to a maximum speed-up of 14% for gsm e. There is a fairly good correlation of speed-up and the number of removed loads. The number of removed stores seems to have very little impact on performance, indicating that the load-store queue contention caused by stores is not a problem for performance (since stores complete asynchronously, they do not have a direct impact on end-to-end performance). 5 programs have a performance improvement of more than 5%. Since most operations removed are relatively inexpensive, because they have good temporal locality, the performance improvement is not very impressive. Register promotion alone causes a slight slow-down for 4 programs, while being responsible for a speed-up of more than 1% for only 7 programs.

Related work
The canonical register promotion papers are by Steve Carr et al.: [CCK90,CK94]. Duesterwald et al. [DGS93] describes a dataflow analysis for analyzing array references; the optimizations based on it are conservative: only busy stores and available loads are removed; they notice that the redundant stores can be removed and compensated by peeling the last k loop iterations, as shown in Section 3.4. Lu and Cooper [LC97] study the impact of powerful pointer analysis in C programs for register promotion. Sastry and Lu [SJ98] introduce the idea of selective promotion for analyzable regions. None of these algorithms simultaneously handles both inter-iteration dependences and control-flow in the way suggested in this paper. [SJ98, LCK + 98] show how to use SSA to facilitate register promotion. [LCK + 98] also shows how PRE can be "dualized" to handle the removal of redundant store operations.
Schemes that use hardware support for register promotion such as [PGM00, DO94, OG01] are radically different from our proposal, which is software-only. Hybrid solutions, utilizing several of these techniques combined with SIDE, can be devised.
Bodík et al. [BGS99] analyzes the effect of PRE on promoting loaded values and estimates the potential improvements. The idea of predicating code for dynamic optimality was also advanced by Bodík [BG97], and was applied for partial dead-code elimination. In fact, the latter paper can be seen as an application of the SIDE framework to the dataflow problem of dead-code. Muchnick [Muc97] gives an example in which  Figure 21), but he doesn't describe a general algorithm for solving the problem optimally.

Conclusions
We have described a scalar promotion algorithm which eliminates all redundant loads and stores even in the presence of conditional control flow. The key insight in our algorithm is that availability information, traditionally computed only at compile-time, can be more precisely evaluated at run-time. We transform memory accesses into scalar values and perform the loads only when the scalars do not already contain the correct value, and the stores only when their value will not be overwritten. Our approach substantially increases the number of instances when register promotion can be applied.
As the computational bandwidth of processors increases, such optimizations may become more advantageous. In the case of register promotion, the benefit of removing memory operations sometimes outweighs the increase in scalar computations to maintain the dataflow information; since the removed operations tend to be inexpensive (i.e., they hit in the load-store queue or in the L1 cache), the resulting performance improvements are relatively modest.