Low depth cache-oblivious algorithms

In this paper we explore a simple and general approach for developing parallel algorithms that lead to good cache complexity on parallel machines with private or shared caches. The approach is to design nested-parallel algorithms that have low depth (span, critical path length) and for which the natural sequential evaluation order has low cache complexity in the cache-oblivious model. We describe several cache-oblivious algorithms with optimal work, polylogarithmic depth, and sequential cache complexities that match the best sequential algorithms, including the first such algorithms for sorting and for sparse-matrix vector multiply on matrices with good vertex separators.
 Using known mappings, our results lead to low cache complexities on shared-memory multiprocessors with a single level of private caches or a single shared cache. We generalize these mappings to multi-level cache hierarchies of private or shared caches, implying that our algorithms also have low cache complexities on such hierarchies. The key factor in obtaining these low parallel cache complexities is the low depth of the algorithms we propose.


INTRODUCTION
Due to the physical realities of building machines it seems likely that locality will always play a role in designing efficient algorithms for parallel machines.Indeed many parallel models have been designed to take account of locality on both shared [4,48,7] and distributed memory machines [48,33,12].These models, however, assume a fixed number of processors for which the algorithm designer or programmer have to map their algorithms onto.What seems to be emerging instead as the dominant programming paradigm for shared memory parallel machines is one based on dynamic parallelism.In such models the programmer expresses the full parallelism without concern of how it maps onto processors.The runtime system then supplies a scheduler that maps this dynamic parallelism onto the processors of the machine.A common form of programming in this model is based on nested parallelism-consisting of nested parallel loops and/or fork-join constructs [13,26,20,35,44].If locality is not of concern, performance costs in such models can be calculated in terms of work (number of operations) and depth (also known as the span or the critical path length) and can be mapped onto runtime on a fixed number of processors.This can greatly simplify how programmers think about parallelism.It is not clear, however, how to capture locality in these models in a high-level way.
In this paper we are interested in analyzing the locality of algorithms written with dynamic nested parallelism.We consider a paradigm based on analyzing the cost of an algorithm using in addition to work and depth the cache complexity in the sequential cache-oblivious model.The cache-oblivious model (ideal-cache model) [38] is a two-level model of computation comprised of an unbounded memory and a cache of size M .Data are transferred between the two levels using cache lines of size B; all computation occurs on data in the cache.Both M and B are unknown to the algorithm, and the goal is to minimize an algorithm's cache complexity (number of cache lines transferred).Sequential algorithms designed for this model have the advantage of achieving good sequential cache complexity across all levels of a (single processor) multi-level cache hierarchy, regardless of the values of M i and Bi at each level i [38].Researchers have developed cache-oblivious algorithms for a many problems [6,22,34].
The cache complexity Q(n; M, B) for a natural sequential execution of a parallel program (on input of size n) can be used to bound the cache complexity Qp(n; M, B) for the same program on certain p-processor parallel machines with a single level of cache(s) [1,15].In particular, for a shared-memory parallel machine with private caches (each processor has its own cache) using a work-stealing scheduler, Qp(n; M, B) < Q(n; M, B) + O(pM D/B) with probability 1−δ [1], 1 and for a shared cache using a parallel-depth-first (PDF) scheduler, Qp(n; M + pBD, B) ≤ Q(n; M, B) [15], where D is the depth of the computation.These results apply to nested-parallel computations-computations starting with a single thread and using nested fork-join parallelismthat use binary forking (spawning) of threads.(When viewed as a computation dag where the nodes are constant-work tasks and the edges are dependences between tasks, the dags for such compu-tations are series-parallel.)The "natural" sequential execution is simply one that runs each call in a fork to completion before starting the next.
These results for a single level of cache(s) suggest a simple approach for developing cache-efficient parallel algorithms: Develop a nested-parallel algorithm with (1) low cache-oblivious complexity for the sequential ordering, and (2) low depth; then use the results to bound the cache complexity on a parallel machine.Low depth is important because D shows up in the term for additional misses for private caches, and additional cache size for a shared cache.Moreover, we show that algorithms designed with this approach can also achieve good parallel cache complexity on multilevel private or shared caches.For example, we show that for a work-stealing scheduler on a multi-level private cache hierarchy Qp(n; Mi, Bi) < Q(n; Mi, Bi) + O(pMiD/Bi) with probability 1 − δ for each level i, and that this bound is tight.
As an example of the approach consider Strassen's matrix multiply.It is nested-parallel because the seven recursive calls can be made in parallel and the matrix addition can be implemented by forking off a tree of parallel calls.For n × n matrices the total depth is O(log 2 n)-O(log n) levels of recursion, each with O(log n) depth for the additions.As shown in [38], the sequential cache complexity is Q(n; M, B) = O(n lg 7 /(B √ M )).Thus, we have that Qp(n; M, B) < Q(n; M, B) + O(pM log 2 (n)/B) for a single level of private caches and Qp(n for a shared cache.For practical parameters these bounds indicate either only marginally more total misses than the sequential version for private caches or only a marginally larger cache size for shared caches.Similarly good bounds are obtained for multi-level hierarchies of private or shared caches, using our results for such hierarchies. Although matrix multiply and some other known cache-oblivious algorithms are naturally parallel with low depth (e.g., matrix transpose and FFT [38]), many are not.Importantly, prior cache-oblivious sorting algorithms with optimal sequential cache complexity [23,24,25,36,38] are not parallel.This paper presents the first low (i.e., polylogarithmic) depth cache-oblivious sorting algorithm with optimal cache complexity.Under the standard "tall cache" assumption M = Ω(B 2 ) [38], our (deterministic) sorting algorithm has cache complexity Q(n; M, B) = O( n B log M n) and work W = O(n log n), which are optimal, and depth D = O(log 2 n).We improve the depth for a randomized version.In contrast, parallelizing the prior algorithms using known techniques would result in depth at least Ω( √ n).We illustrate how our sorting algorithm can be used to construct the first polylogarithmic depth, cache-oblivious, optimal cache complexity algorithms for other important problems such as list ranking and tree contraction.Finally, we present the first cache-oblivious, low cache complexity algorithm for sparsematrix vector (SpMV) multiply on matrices with good vertex separators (roughly speaking a sparse matrix has good separators if the corresponding graph can be partitioned by removing a reasonably small set of vertices and their incident edges so no partition is too large).All planar graphs have good separators.The SpMV algorithm is optimal work, O(log 2 n) depth, and its sequential cache complexity improves upon the previous best sequential algorithm and is optimal for planar graphs.
Other work on parallel cache-oblivious algorithms has concentrated on bounding cache misses for particular classes of algorithms.This includes results by Frigo et al. [39] for a class of algorithms with a regularity condition, by Blelloch et al. [14] for a class of binary divide-and-conquer algorithms, and by Chowdhury and Ramachandran [28,29] for a class of dynamic programming and Gaussian elimination-style problems.Recent work by Chowdhury et.∀k ∈ [1 : [31] has studied cache oblivious algorithms for a parallel model with a tree of caches.Our design motive is to have a generic approach that enables one to analyze an algorithm independently of the model in a simple way and then map onto different machines; we study SpMV-multiply, sorting and related algorithms as specific instances of our general approach.Our work may also be contrasted with that of [7], which presents cache-efficient algorithms for private caches but the algorithms are not cache oblivious and are based on a fixed number of processors p.
A preliminary version of this paper appeared as a three page brief announcement in SPAA'09 [17].

SORTING
In this section, we present the first cache-oblivious sorting algorithm that achieves optimal work, polylogarithmic depth, and good sequential cache complexity.Prior cache-oblivious algorithms with optimal cache complexity [23,24,25,36,38] have Ω( √ n) depth.

Algorithm Preliminaries
Our sorting algorithm uses known algorithms for matrix transpose, prefix sum and merging as subroutines.We first describe the exact variants of these algorithms that the sorting algorithm uses.The costs are summarized in Figure 1.The standard divideand-conquer matrix-transpose algorithm [38] is work optimal, has logarithmic depth and has optimal cache complexity when M = Ω(B 2 ).A simple variant of the tree-based parallel prefix-sums algorithm has logarithmic depth and cache complexity O(n/B).As usual the algorithm works in two phases generating partial sums in a tree in one phase and propagating results down in the next.Each phase is implemented using divide-and-conquer over the tree.For cache efficiency, the tree of partial sums is laid out in the infix order.This gives an algorithm that runs with cache complexity O(n/b) and depth O(log n) even if the cache only has a single cache line.
Algorithm 1 merges two arrays A and B of sizes lA and lB (lA + lB = n).The pivots ranked {n 2/3 , 2n 2/3 , . . .} can be found using a dual binary search on the arrays.This takes O(n 1/3 • log n) work, O(log n) depth and at most O(n 1/3 log (n/B)) cache misses.Once the locations of pivots have been identified, the subarrays which are of output size n 2/3 each can be recursively merged and appended.The recursive relation for the cache complexity is It is easy to see that the work involved is linear.Using this merge algorithm in a mergesort in which the two recursive calls are parallel gives an algorithm with depth O(log 2 n) and cache complexity O((n/B) log 2 (n/M )), which is not optimal.Blelloch et al. [14] analyze similar merge and mergesort algorithms with the same (suboptimal) cache complexities but with larger depth.

Deterministic Sorting
Our parallel sorting algorithm is based on a version of sample sort [37,45], and has optimal cache complexity.Sample sorts first use a sample to select a set of pivots that partition the keys into buckets, then route all the keys to their appropriate buckets, and finally sort within the buckets.Compared to prior cache-friendly (sequential) sample sort algorithms [2,41], which with slight modification can be improved to Ω( √ n) depth, our cache-oblivious algorithm uses (and analyzes) a new parallel bucket-transpose algorithm for the key distribution phase, in order to achieve O(log 2 n) depth.
The algorithm (Algorithm COSORT in Figure 2) first splits the set of elements into √ n subarrays of size √ n and recursively sorts each of the subarrays.Then, samples are chosen to determine pivots.This step can be done either deterministically or randomly.We first describe a deterministic version of the algorithm for which the repeat and until statements are not needed; Section 2.3 will describe a randomized version that uses these statements.For the deterministic version, we choose every (log n)-th element from each of the subarrays as a sample.The sample set, which is smaller than the given data set by a factor of log n, is then sorted using the mergesort algorithm outlined above.Because mergesort is reasonably cache-efficient, using it on a set slightly smaller than the input set is not too costly in terms of cache complexity.More precisely, this mergesort does not incur more than O( n/B ) cache misses.We can then pick √ n evenly spaced keys from the sample set P as pivots to determine bucket boundaries.To determine the bucket boundaries, the pivots are used to split each subarray using the cache-oblivious merge procedure.This procedure also takes no more than O( n/B ) cache misses.
Once the subarrays have been split, prefix sums and matrix transpose operations can be used to determine the precise location in the buckets where each segment of the subarray is to be sent.This mapping information is stored in a matrix T of size √ n × √ n.Note that none of the buckets will be loaded with more than 2 √ n log n keys because of the way we select pivots.
Once the bucket boundaries have been determined, the keys need to be transferred to the buckets.Although a naive algorithm to do this is not cache-efficient, we show that the bucket transpose algorithm (Algorithm B-TRANSPOSE in Figure 2) is.The bucket transpose is a four way divide-and-conquer procedure on the (almost) square matrix T which indicates a set of segments of subarrays (segments are contiguous in each subarray) and their target locations in the bucket.The matrix T is cut in half vertically and horizontally and separate recursive calls are assigned the responsibility of transferring the keys specified in each of the four parts.Note that ordinary matrix transpose is the special case of Ti,j = j, i, 1 for all i, j.
PROOF.(sketch) For each node v in the recursion tree of bucket transpose, we define the node's size s(v) to be n 2 , the size of its submatrix T , and the node's weight w(v) to be the number of keys that T is responsible for transferring.We identify three classes of nodes in the recursion tree: 1. Light-1 nodes: A node v is light-1 if s(v) < M/100, w(v) < M/10, and its parent node is of size ≥ M/100.2. Light-2 nodes: A node v is light-2 if s(v) < M/100, w(v) < M/10, and its parent node is of weight ≥ M/10.

Heavy leaves
The union of these three sets covers the responsibility for transferring all the keys, i.e., all leaves are accounted for in the subtrees of these nodes.
From the definition of a light-1 node, it can be argued that all the keys that a light-1 node is responsible for fit inside a cache, implying that the subtree rooted at a light-1 node cannot incur more than M/B cache misses.It can also be seen that light-1 nodes can not be greater than 4n/(M/100) in number leading to the fact that the sum of cache complexities of all the light-1 nodes is no more than O( n/B ).
Light-2 nodes are similar to light-1 nodes in that their target data fits into a cache of size M .If we assume that they have combined weight of n − W , then there are no more than 4(n − W )/(M/10) of them, putting the aggregate cache complexity for their subtrees at 40(n − W )/B.
A heavy leaf of weight w incurs w/B cache misses.There are no more than W/(M/10) of them, implying that their aggregate cache complexity is W/B + 10W/M < 11W/B.Therefore, the cache complexities of light-2 nodes and heavy leaves adds up to Pick an appropriate sorted pivot set P of size h ∀i ∈ [1 : h], Mi ← SPLIT(Si, P) {Each array Mi contains for each bucket j a start location in Si for bucket j and a length of how many entries are in that bucket, possibly 0.} L ← h × h matrix formed by rows Mi with just the lengths , Mi,j 2 {Each triple corresponds to an offset in row i for bucket j, an offset in bucket j for row i and the length to copy.} until No bucket is too big Let B1, B2, . . ., B h be arrays (buckets) of sizes dictated by T Bucket transpose diagram: The 4x4 entries shown for T dictate the mapping from the 16 depicted segments of S to the 16 depicted segments of B. Arrows highlight the mapping for two of the segments.

Figure 2: Cache-Oblivious Sorting and Bucket-Transpose Algorithms
another O( n/B ).We also note that the validity of this proof does not depend on the size of the individual buckets.The statement of the lemma holds even for the case where each of the buckets is as large as O( √ n log n).

PROOF. All the subroutines other than recursive calls to COSORT have linear work and cache complexity O( n/B
).Also, the subroutine with the maximum depth is the mergesort used to find pivots; its depth is O(log 2 n).Therefore, the recurrence relations for the work, depth, and cache complexity are as follows: where the nis are such that their sum is n and none individually exceed 2 √ n log n.The base case for the recursion for cache complexity is Q(n; M, B) = O( n/B ) for n < cM for some constant c.Solving these recurrences proves the theorem.

Randomized Sorting
A simple randomized version of the sorting algorithm is to randomly pick √ n elements for pivots, sort them using brute force (compare every pair) and using the sorted set as the pivot set P. This step takes O(n) work, O(log n) depth (let cn be a constant such that depth is at most cn log(n)) and has cache complexity O(n/B) and the probability that the largest of the resultant buckets are larger than 3 √ n log n is not more 1 − 1/n.When one of the buckets is too large (> 3 √ n log n), the process of selecting pivots and recomputing bucket boundaries is repeated.Because the probability of this happening repeatedly is low, the overall depth of the algorithm is small.Further, the recursion is stopped when the problem size reduces to 2 40 .

THEOREM 2.3. On an input of size n, the randomized version of COSORT has, with probability greater than
PROOF.(sketch) In a call to randomized COSORT with input size n, the loop terminates with probability 1 − 1/n in each round and takes less than 2 iterations on average to terminate.Each iteration of the while loop, including the brute force sort requires O(n) work and incurs at most O( n/B ) cache misses with high probability.Therefore, where each ni < 3 √ n log n and Similarly for cache complexity we have which implies ´.
To show the high probability bounds for work and cache complex-ity, we can use Chernoff bounds since the fan out at each level of recursion is high.
To analyze the depth of the dag, we obtain high probability bounds on the depth of each level of recursion tree (we assume that the levels are numbered starting with the root at level 0).To get sufficient confidence bounds at each level we might have to execute the outer loop more times toward the leaves where the problem size is small.Each iteration of the outer loop at node N of input size m at level k in the recursion tree has depth log m and the termination probability of the loop is 1 − 1/m.
We first show probability bounds on the depth of a maximal path in the computation dag.We represent the path as a recursion tree and show that the sum of depths of all nodes at level d in the recursion tree is at most c d log 3/2 n/ log 2 log 2 n with probability at least 1 − 1/n (log 2 log 2 n) 2  (for some constant c d to be defined shortly) and that the recursion tree is at most 1.5 log 2 log 2 n levels deep.This will prove that the depth of the recursion tree (i.e.the path) is 1.5c d log 3/2 n with probability at least Since each of the paths is a candidate for the critical path, the actual depth is a maximum over all such paths.We will argue that there are not more than C(n) = n 1.5 log 2 log 2 n such paths.Then, by the union bound, it follows that the probability that any single path is longer than 1.5c d log 3/2 n is less than (high probability bound).
The maximum number of levels in the recursion tree can be bounded using the recurrence relation Using induction, it is straightforward to show that this solves to X(n) < 1.5 log 2 log 2 n.Similarly the number of paths C(n) can be bounded using the relation Again, using induction, this relation can be used to show that C(n) = n 1.5 log 2 log 2 n .
To compute the sum of the depth of nodes at level d in the recursion tree, we consider two cases: (1) when d > 80 log 2 log 2 log 2 n and (2) otherwise.
Case 1: The size of a node one level deep in the recursion tree is at most 3 √ n log n ≤ n 1/2+r for r = 1/6.Also, the size of a node which is d levels deep is at most n (1/2+r) d , each costing cn(1/2 + r) d log n depth per trial.Since there are at most 2 d nodes at level d in the recursion tree, and the failure probability of a loop in any node is no more than 1/2, we show that the probability of having to execute more than ) loops is small.Since we are estimating the sum of 2 d independent variables, we use Chernoff bounds of the form: The resulting probability bound is less than 1/n log 2 2 log 2 n for d > 80 log 2 log 2 log 2 n.Therefore, the contribution of nodes at level d in the recursion tree to the depth of recursion tree is at most We classify all nodes at level d in to two kinds, the large ones with size greater than log 2 n and the smaller ones with size at most log 2 n.The total number of nodes is at most 2 d < (log 2 log 2 n) 80 .Consider the small nodes.Each small node can contribute a depth of at most 2cn log 2 log 2 n to the recursion tree and there are at most (log 2 log 2 n) 80 of them.If we set c d to be the minimum number such that c d log 3/2 n > cn log 2 log 82 2 n, then the contribution of small nodes to depth of the recursion tree at level d is lesser than c d log 3/2 n/ log 2 log 2 n.
We use Chernoff bounds to bound the contribution of large nodes to the depth of the recursion tree.Suppose that there are j large nodes.We show that with probability not more than 1/n log 2 log n , it takes more than 10 • j loop iterations at depth d for j of them to succeed.For this, consider 10 • j random independent trials with success probability at least 1 − 1/ log 2 n each.The expected number of failures is no more than μ = 10 • j/ log 2 n.We want to show that the probability that there are greater than 9 • j failures in this experiment is tiny.Using Chernoff bounds with the above μ, and δ = (0.9 • log 2 n − 1), we infer that this probability is less than 1/n log 2 2 log 2 n .Since j < 2 d , the depth contributed by the larger nodes is at most We have shown that each level in the recursion tree adds at most c d log 3/2 n/ log 2 log 2 n depth to a path with probability at least 1− 1/n log 2 2 log n n .The proof follows from the union bound described earlier.

APPLICATIONS OF SORTING: GRAPH ALGORITHMS
In this section, we make use of the fact that the PRAM algorithms for many problems can be decomposed into primitive operations such as scans and sorts.Our approach is similar to that of [7] except that we use the cache-oblivious model instead of the parallel external memory model.Arge et al. [5] demonstrate a cacheoblivious algorithm for list ranking using priority queues and use it to construct other graph algorithms.But their algorithms have Ω(n) depth because list ranking uses a serially-dependent sequence of priority queue operations to compute independent sets.Our parallel algorithms derived from sorting are the same as in [27] except that we use different algorithms for the primitive operations scan and sort, which suit our cache-oblivious framework.Moreover, a careful analysis (using standard techniques) is required to prove our sequential cache complexity and depth bounds under this framework.List Ranking.A basic strategy for list ranking [40] is the following: (i) shrink the list to size O(n/ log n), and (ii) apply pointer jumping on this shorter list.Stage (i) is achieved through finding independent sets in the list of size Θ(n) and removing them to yield a smaller problem.This can be done randomly using the random mate technique in which case, O(log log n) rounds of such reduction would suffice.Alternately, we could use the deterministic technique described in section 3.1 of [40]: use two rounds of Cole and Vishkin's deterministic coin tossing [32] to find a O(log log n)ruling set and then convert the ruling set to an independent set of size at least n/3 in O(log log n) rounds.Arge et al. [8] show how this conversion can be made cache-efficient, and it is straightforward to change this algorithm to a cache-oblivious one.Stage (ii) uses O(log n) rounds of pointer jumping, each round essentially involving a sort operation to figure out the next level of pointer jumping.Thus, the cache complexity of this stage is asymptotically the same as sorting and its depth is O(log n) times the depth of sorting: Graph Algorithms.We tabulate the complexity measures of basic graph algorithms on bounded degree graphs (Figure 3).The al-

Depth
Cache Complexity List Ranking gorithms that lead to these complexities are straightforward cacheoblivious adaptations of known PRAM algorithms (as described for the external memory model in [27]) using primitives from earlier sections.For instance, finding an Euler Tour involves scanning the input to compute the successor function for each edge and running a list ranking.Tree contraction involves constructing an Euler tour of the tree, finding an independent set on it and contracting the tour to recursively solve a smaller problem.Finding Least Common Ancestors of a set of vertex pairs in a tree involves computing the Euler tour and reducing the problem to a range minima query problem (which is solved with search trees).Deterministic algorithms for Connected Components and Minimum Spanning Forest are similar and use tree contraction as their basic idea; the cache bounds are slightly worse than those in [27]: log(|V |/ √ M ) versus log(|V |/M ).While [27] uses knowledge of M to transition to a different approach once the vertices in the contracted graph fit within the cache, cache-obliviously we need for the edges to fit before we stop incurring misses.

SPARSE-MATRIX VECTOR MULTIPLY
We consider the problem of multiplying a sparse n × n matrix with m non-zeros by a dense vector.For general sparse matrices Bender et al. [10] show a lower bound on cache complexity, which for m = O(n) matches the sorting lower bound.However, for certain matrices common in practice the cache complexity can be improved.For example, for matrices with non-zero structure corresponding to a well-shaped d-dimensional mesh, a multiply can be performed with cache complexity O(m/B + n/M 1/d ) [11].This requires pre-processing to lay out the matrix in the right form.However, for applications that run many multiplies over the same matrix, as with many iterative solvers, the cost of the pre-processing can be amortized.Note that for M ≥ B d (e.g., the tall-cache assumption in 2 dimension), the cache complexity reduces to O(m/B) which is asymptotically the same as scanning memory.
The cache-efficient layout and bounds for well-shaped meshes can easily be extended to graphs with good edge separators [14].The layout and algorithm, however, is not efficient for graphs such as planar graphs or graphs with constant genus that have good vertex separators but not necessarily any good edge separators.In this paper we generalize the results to graphs with good vertex separators and present the first cache-oblivious, low cache complexity algorithm for the sparse-matrix multiplication problem on such graphs.We do not analyze the cost of finding the layout, which involves the recursive application of finding vertex separators, as it can be amortized across many solver iterations.Our algorithm for matrices with n separators has linear work, O(log 2 n) depth, and O(m/B + n/M 1− ) sequential cache complexity.
Let S be a class of graphs that is closed under the subgraph relation.We say that S satisfies a f (n)-vertex separator theorem if there are constants α < 1 and β > 0 such that every graph In our presentation we assume the matrix has symmetric non-zero structure (if it is asymmetric we can always add zero weight reverse edges while at most doubling the number of edges).
We now describe how to build a separator tree assuming we have a good algorithm FindSeparator for finding separators.For planar graphs this can be done in linear time [43].The algorithm for building the tree is defined by Algorithm BuildTree in Figure 4.At each recursive call it partitions the edges into two subsets that are passed to the left and right children.All the vertices in the separator are passed to both children.Each leaf corresponds to a single edge.We assume that FindSeparator only puts a vertex in the separator if it has an edge to each side and always returns a separator with at least one vertex on each side unless the graph is a clique.If the graph is a clique, we assume the separator contains all but one of the vertices, and that the remaining vertex is on the left side (Va) of the partition.By convention we place any edges between vertices in the separator in E b .
Every vertex of degree Δ in our original graph corresponds to a binary tree embedded in the separator tree with Δ leaves, one for each of its incident edges.To see this consider a single vertex.Every time it appears in a separator, its edges are partitioned into two sets, and the vertex is copied to both recursive calls.Because the vertex will appear in Δ leaves, it must appear in Δ − 1 separators, so it will appear in Δ − 1 internal nodes of the separator tree.We refer to the tree for a vertex as the vertex tree, each appearance of a vertex in the tree as a vertex copy, and the root of each tree as the vertex root.The tree is used to sum the values for matrix vector multiply.
We reorder the rows/columns of the matrix based on a preorder traversal of their root locations in the separator tree (i.e., all vertices in the top separator will appear first).This is the order we will use for the input vector x and output vector y when calculating y = Ax.We keep a vector R in this order that points to each of the corresponding roots of the tree.The separator tree is maintained as a tree T in which each node keeps its copies of the vertices in its separator.Each of these vertex copies will point to its two children in the vertex tree.Each leaf of T is an edge and includes the indices of its two endpoints and its weight.In all internal vertex copies we keep an extra value field to store a temporary variable, and in the leaves we keep two value fields, one for each direction.Finally we note that all data for each node of the separator tree is stored adjacently (i.e., all its vertex copies are stored one after the other), and the nodes are stored in preorder.This is important for cache efficiency.
Our algorithm for sparse-matrix vector multiplication is described in Algorithm SparseMxV in Figure 4.This algorithm will take the input vector x and leave the results of the matrix multiplication in the root of every vertex.To gather the results up into a result vector y we simply use the root pointers R to fetch each root.The algorithm does not do any work beyond the recursive calls on the way down the recursion, but when it gets to a leaf the edge multiplies its two endpoints by its weight putting the result in its temporary value.If the matrix is symmetric then only one weight is needed.Then on the way back up the recursion the algorithms sums these values.In particular whenever it gets to an internal node of a vertex tree it adds the two children.Since the algorithm works bottom up the values of the children are always ready when the parent reads them.PROOF.For a constant k we say a vertex copy is heavy if it appears in a separator node with size (memory usage) larger than M/k.We say a vertex is heavy if it has any heavy vertex copies.We first show that the number of heavy vertex copies for any constant k is bounded by O(n/M 1− ) and then bound the number of cache misses based on the number of heavy copies.
For a node of n vertices, the size X(n) of the tree rooted at the node is bounded by the separator condition giving the recurrence relation: . Therefore, there exists a constant c such that for n < cM, the subtree rooted at node of n vertices fits into the cache.We use this to count the number of heavy vertex copies H(n).The recurrence relation for H(n) is: for n > cM and 0 otherwise.This recurrence relation is satisfied by ).Now we note that if a vertex is not heavy (i.e., light) it is only used by a single subtree that fits in cache.Furthermore because of the ordering of the vertices based on where the roots appear, all light vertices that appear in the same subtree are adjacent.Therefore the total cost of cache misses for light vertices is O(n/B).We note that the edges are traversed in order so they only incur O(m/B) misses.Now each of the heavy vertex copies can be responsible for at most O(1) cache misses.In particular reading each child can cause a miss.Furthermore reading the value from a heavy vertex (at the leaf of the recursion) could cause a miss since it is not stored in the subtree that fits into cache.But the number of subtrees that just fit into cache (i.e., their parents don't) and read a vertex u is bounded by one more than the number of heavy copies of u.Therefore we can count each of those misses against a heavy copy.We therefore have a total of O(m/B + n/M 1− ) misses.
The work is simply proportional to the number of vertex copies, which is less than twice m and hence is bounded by O(m).For the depth we note that the two recursive calls can be made in parallel and furthermore the for all statement can be made in parallel.Furthermore the tree is depth O(log n) because of the balance condition on separators.Since the branching of the for all takes O(log n) depth, the total depth is bounded by O(log 2 n).

MAPPING TO PARALLEL MULTI-LEVEL HIERARCHIES
In this section, we discuss how (low-depth) algorithms designed for the cache-oblivious model can be scheduled on natural multilevel generalizations of one-level private or shared cache machines, such that we can upper bound the parallel cache complexity and the parallel run time.We begin by defining the two models we consider.

PMDH and PMSH Models
We consider a Parallel Multi-level Distributed Hierarchy (PMDH) (Figure 5(left)), where each of the p processors has a multi-level private hierarchy and a Parallel Multi-level Shared Hierarchy (PMSH) (Figure 5(right)), where all the processors share a multi-level cache hierarchy.All computation by a processor p occurs on data in p's (private or shared) level-one cache.One or more cache lines of a given cache at level i < k fit precisely in a cache line of its "parent" cache at level i + 1.We assume the cache hierarchy is inclusive: each cached word at level i < k is also cached in its parent cache at level i + 1.A processor requesting a memory word fetches the cache line containing the word from the lowest-level ancestor cache containing the line (and populates all intervening caches).If the processor writes to the memory word, only the level-one cache line is updated and the line becomes dirty.Whenever a dirty line is evicted from a cache, its contents are written back to the corresponding line in its parent cache.Each cache is fully associative and uses an optimal replacement policy (within the constraints of being inclusive).
Cache Consistency in PDMH.In private cache models, the same memory word can be in the caches of multiple processors and these copies must be kept consistent.As in the two-level private cache model studied by Frigo et al. [39], the multi-level PMDH assumes a variant of the dag consistency cache consistency model [19] that uses an optimal replacement policy instead of LRU replacement.(We revisit LRU replacement in Section 6.) Caches are non-interfering in that the cache misses of one processor can be analyzed independent of other processors.To maintain this property, Frigo et al. use the BACKER protocol [19].This protocol manages caches so that if an instruction j is a descendant of instruction i, then values written to memory words by i are reflected in j's memory accesses.However, concurrent writes to objects by instructions that do not have a path between them in the dag will not be communicated between processors executing these instructions.Such writes are reconciled to shared memory and reflected in other cache copies only when a descendant of the instruction that performed these writes tries to access them.Reconciliation of a memory block involves updating all written words within the block; the protocol must track all such writes.In case of multiple writes to the same word, an arbitrary write succeeds.Concurrent reads are permitted.We likewise assume the same non-interfering property, with the same reconciliation process.

Extending Private Cache Results to Multiple Levels
Because each processor in the PMDH model has its own private memory hierarchy, it is better to have each processor work on parts of computations that are as far apart as possible.The work stealing scheduler [21,9,1] is an ideal choice for such a system.In its simpler form, a work stealing scheduler maintains a task dequeue for each processor.When a processor spawns a new job, the new job is queued at the tail of its dequeue.When a processor runs out of work, it pulls out the job at the head of its task queue.If its own task queue is empty, the processors randomly picks another task queue to steal from.This version of the work stealing is referred to as randomized work stealing.Another (perhaps less practical) version of work stealing uses a single shared task dequeue for all processors; we refer to this as centralized work stealing.
We derive run time bound for algorithms under randomized work stealing for the PMDH such that the only algorithm-specific metrics in the bound are W , D and the sequential cache complexity Q. (To simplify notation, we will use Q(M, B) instead of Q(n; M, B) in the remainder of this section.)Given a particular execution X of a computation A on some parallel machine P , let c(x) be the cost of instruction x.This cost includes the time for accessing the data used by x; if the access needs to fetch the data from level i cache, the cost is C i (we view the shared memory as level k + 1).The latency added work under execution X is W lat A,P (X) = P x∈A c(x).The latency-added work, W lat A,P , of a computation is the maximum of W lat A,P (X) over all executions X.This can be bounded by PROOF.We use Lemma 12 from [21] to bound the number of steals.Since that result uses a simpler model for the computation that does not charge cache miss costs towards run time, we reduce our dag to a simpler form on which the lemma can be applied directly.For each instruction in A, we replace the instruction by a chain of c(x) (according to some execution) sequential instructions.Each of these replaced instructions take unit time.If x is a fork (join) point, the last (first) node in this chain does the equivalent fork (join).Since we assume a dag-consistent memory model, the run time of this modified computation A is the same as that of A. Since the depth of A under any execution does not exceed D lat A,P , the schedule involves not more than O(p(D lat A,P + log 1/δ)) steals with probability at least 1 − δ, all the caches at level i incur a total of at most Q(Mi, Bi) + O(p(D lat A,P + log 1/δ)Mi/Bi) cache misses with probability at least 1 − δ.To bound the running time of the computation A which has at most W lat A,P instructions and D lat A,P depth, we use Theorem 13 from [21].Since W lat A,P ≤ W + ) with probability at least 1 − δ, the claim about the running time follows.
Thus, for constant δ, the parallel cache complexity at level i exceeds the sequential cache complexity by O(pD lat A,P Mi/Bi) with probability 1 − δ.The bounds in Theorem 5.1 carry over to centralized work stealing without the δ terms, e.g., the parallel cache complexity exceeds the sequential cache complexity by O(pD lat A,P Mi/Bi).Throughout this section, the runtime bounds do not include scheduler overheads (which would increase the runtime by at most a small constant factor).) is a constant that we will define shortly).Each of the first D/2 nodes on the spine forks off a "superscan" that consists of 3 1+12K/(1−c h ) identical parallel scans of length M each.A scan over an array A of size M is a binary tree forking out in to M parallel leaves, each leaf scanning one of the consecutive words in the array A. The remaining D/2 nodes on the spine are the joins corresponding to superscans forked up the spine.Note that D lat A,P = D + C k+1 because each path in the DAG contains at most one memory request.
In a sequential execution, a processor executes the superscans one by one and can reuse a subsequence of length Mi/Bi (at level i) for all the identical scans with in a superscan.In other words, sequential execution gets (pD/6)(3 1+12K/(1−c h ) −1) (Mi−Bi)/Bi cache hits at level i cache, and the sequential cache complexity We argue that in the case of randomized work stealing, there are a large number of superscans such that the probability that at least two scans from such superscans are executed by different processors is greater than some positive constant.This implies that the cache complexity is Θ(pDMi/Bi) higher that the sequential cache complexity (claim A).
1. Once the p/3 spines have been forked, each spine is occupied by at least one processor till the stage where work along a spine has been exhausted.This property follows directly from the nature of the work stealing protocol.
2. In the early stages of computation after spines have been forked, but before the computation enters the join phase on the spines, exactly p/3 processors have a spine node on the head of their work queue.Therefore, the probability that a random steal will get a spine node and hence a fresh superscan is 1/3.

3.
At any moment during the computation, the probability that more than p/2 of the latest steals of the p processors found fresh spine nodes is exponentially small in terms of p and therefore less than 1/2.
4. If processor p stole a fresh superscan A and started the scans in it, the probability that the work from the superscan A is not stolen by some other processor before p executes the first 2/3-rd of the scan is at most a constant c h ∈ (0, 1).This is because the probability that p currently got a fresh superscan does not depend on events in the history, and therefore, with probability at least 1/2, more than p/2 processors did not steal a fresh superscan in the latest steal.This means that these processors which stole a stale superscan got less than 2/3-rd fraction of the superscan to work on before they need to steal again.Therefore, by the time p finishes 2/3-rd of the work, there would have been at least p/2 steal attempts and there is a probability of at least 1/16 that two of these steals stole from p. Two steals from p would cause p to lose work from it's fresh superscan.In this scenario, p does not execute more than 5/6-th of the scan even if it comes back to steal work from the higher instance of scan A.

5.
Since there are at most KpD steals with high probability, there are no more than (1 − c h )pD/12 superscans which incur more than 12K/(1 − c h ) steals.On an average, about (1 − c h )pD/6 superscans are stolen from before the first processor that touched the superscan executes 2/3-rd of it.Therefore, on an average, there are no less than (1−c h )pD/12 superscans which get fewer than 12K/1 − c h steals and are stolen from before one processor executes 2/3-rd of it.Such superscans, by construction have at least two different processors execute a complete scan.This proves claim A.

Centralized work stealing:
The construction for centralized work stealing is shown in Figure 6(b).The DAG ensures each steal causes a scan of a completely different set of memory locations.The bound follows from the fact that unlike the case in sequential computation, cache access overlap in the pairs of parallel scans are never exploited.

Extending Shared Cache Results to Multiple Levels
Finally, we consider the PMSH model.For the case of a single level of shared cache, our previous work [15] showed that the parallel depth-first (PDF) scheduler was a good choice for mapping good sequential cache complexity to provably good parallel cache complexity.In the PDF scheduler [16,15] tasks are prioritized according to their ordering in the natural sequential execution, i.e., according to the ordering used to analyze the sequential cache complexity Q; the ith task in the sequential execution is given priority rank i.A processor completing a task is assigned the lowest ranked task among all the available tasks that are ready to execute.The relative ranking of available tasks can be efficiently determined on-the-fly without having to perform a sequential execution [16].
The results from [15] stated in Section 1 for a single level of shared cache can be generalized to the PMSH: THEOREM 5.3.When a cache-oblivious nested-parallel computation A with sequential cache complexity Q(M, B), work W , and depth D is scheduled on a PMSH P of p processors using a PDF scheduler, then the cache at each level i incurs fewer than Q(p(Mi − BiD lat A,P ), Bi) cache misses.Moreover, the computation completes in time not more than W lat A,P /p + D lat A,P .PROOF.(sketch) The cache bound follows because (1) inclusion implies that hits/misses/evictions at levels < i do not alter the number of misses at level i, (2) caches sized for inclusion imply that all words in a line evicted at level > i will have already been evicted at level i, and hence (3) the key property of PDF schedulers, Qp(M + pBD lat A,P , B) ≤ Q(M, B), holds at each level i of a PMSH.The time bound follows because the schedule is greedy (and we are not accounting for scheduler overheads).Thus, our approach for developing cache-efficient parallel algorithms via (i) low cache-oblivious sequential cache complexity and (ii) low depth is validated for shared-cache hierarchies (and PDF schedulers) as well.

DISCUSSION
A goal of the work described in this paper is to develop a simple model for accounting for locality with dynamic parallelismcleanly separating the cost-model from any particular machine model while still being useful in bounding costs on various machines.We believe the approach of analyzing cache complexity in the cacheoblivious model, and work and depth with dynamic nested parallelism as described in the paper achieves this goal.The approach, however, does have some limitations and ignores some details.We briefly describe these here.Firstly the general bounds on the parallel cache misses rely on low-depth (as the title of the paper implies).It seems that avoiding this would require a modified model for cache complexity, or taking into account particular properties of programs as studied in some previous work [39,14,29].A more general approach to handle algorithms with higher depth would be useful.
Secondly, our scheduler results assume DAG consistency using the BACKER protocol, which at present is not implemented on real machines.The backer protocol avoids cache protocol misses due false sharing (multiple threads writing to different locations of a shared cache line) by resolving cache line conflict when writing back to memory.Strong consistency is not guaranteed and not needed by our DAG consistent algorithms.Maintaining strong consistency per cache line could create problems on the algorithms and scheduling techniques we described by forcing cache lines to "ping-pong" among processors.It would be interesting to develop a model and algorithms that avoid these problems, but ultimately if this makes the process of designing or analyzing algorithms more complicated, or breaks the abstraction between a high-level model and the machines below, this would be a argument to modify cache consistency protocols.
Thirdly, our scheduler results assume an optimal cache replacement policy.Note that for practical purposes, each level of cache could instead use a multi-level inclusive LRU replacement policy.Unlike in the case of optimal replacement, where a complete memory access profile may be needed a priori at all levels in order to compute what to replace, implementing a multi-level LRU replacement policy does not require that all levels of the cache hierarchy see the memory access profile.Assuming that a cache line evicted at level i is sent to level i + 1 and that any access to a memory location not at level i is serviced by passing it from higher levels in the memory hierarchy through cache level i + 1, it is possible for cache level i + 1 to know exactly what memory words are contained in the lower level cache.From the order in which cache lines were evicted by level i, cache level i + 1 can fill up the rest of its slots and order them in LRU order (see Figure 7).It follows from [47] that the number of cache misses at each level under the multi-level LRU policy is within a factor of two of the number of misses for a cache half the size running the optimal replacement policy.For example, under multi-level LRU, the upper bound on the cache misses in Theorem 5.1 becomes 2Q(Mi/2, Bi) + O(p(D lat A,P + log 1/δ)Mi/Bi).Finally, our scheduler results are for multi-level hierarchies of private or shared caches.It would be interesting to extend these results to more general multi-level models [3,4,18,30,42,46,49], while preserving the goal of supporting a simple model for algorithm design and analysis.

THEOREM 2 . 2 .
On an input of size n, the deterministic COSORT has Q(n; M, B) = O( n/B log M n ) sequential cache complexity, O(n log n) work, and O(log 2 n) depth.

THEOREM 3 . 1 .
The deterministic list ranking outlined above has Q(n; M, B) = O( n/B log M n ) sequential cache complexity, O(n log n) work, and O(Dsort(n) log n) depth.

Figure 3 :
Figure 3: Low-depth cache-oblivious graph algorithms.All algorithms are deterministic.The bounds assume M = Ω(B 2 ).Dsort and Qsort are the depth and cache complexity of cache-oblivious sorting.

Figure 4 :
Figure 4: Cache-Oblivious Algorithms for Building a Separator Tree and for Sparse-Matrix Vector Multiply

THEOREM 4 . 1 .
Let M be a class of matrices for which the adjacency graphs satisfy an n -vertex separator theorem.Algorithm SparseMxV on an n × n matrix A ∈ M with m ≥ n non-zeros has O(m) work, O(log 2 n) depth and O( m/B + n/M 1− ) sequential cache complexity.

THEOREM 5 . 2 .Figure 6 :
Figure 6: DAGs used in the proof of Theorem 5.2.D − 2(12K/(1 − c h ) + log(p/3) + log M ) each (c h ∈ (0, 1) is a constant that we will define shortly).Each of the first D/2 nodes on the spine forks off a "superscan" that consists of 3 1+12K/(1−c h ) identical parallel scans of length M each.A scan over an array A of size M is a binary tree forking out in to M parallel leaves, each leaf scanning one of the consecutive words in the array A. The remaining D/2 nodes on the spine are the joins corresponding to superscans forked up the spine.Note that D lat A,P = D + C k+1 because each path in the DAG contains at most one memory request.In a sequential execution, a processor executes the superscans one by one and can reuse a subsequence of length Mi/Bi (at level i) for all the identical scans with in a superscan.In other words, sequential execution gets (pD/6)(3 1+12K/(1−c h ) −1) (Mi−Bi)/Bi cache hits at level i cache, and the sequential cache complexityQ(Mi, Bi) is (pD/6)( Mi/Bi +3 1+12K/(1−c h ) (M −Mi)/Bi ).We argue that in the case of randomized work stealing, there are a large number of superscans such that the probability that at least two scans from such superscans are executed by different processors is greater than some positive constant.This implies that the cache complexity is Θ(pDMi/Bi) higher that the sequential cache complexity (claim A).

Figure 1: Low-depth cache-oblivious algorithms. New algorithms are marked ( * ). All algorithms are work optimal and their cache complexities match the best sequential algorithms. The bounds assume
M = Ω(B 2 ).
where #S is the number of steals.The latency added depth D lat A,P can be defined using c(x) similarly: it is the maximum of P x∈P c(x) over all paths P in A. We note that D • C k+1 is a pessimistic upper bound on the latency-added depth for any machine.THEOREM 5.1.(Upper Bounds) For any δ > 0, when a cacheoblivious nested-parallel computation A with binary forking, sequential cache complexity Q(M, B), work W , and depth D is scheduled on a PMDH P of p processors using randomized work stealing: