Embeddings of negative-type metrics and an improved approximation to generalized sparsest cut

In this article, we study metrics of <i>negative type</i>, which are metrics (<i>V</i>, d) such that &sqrt;d is an Euclidean metric; these metrics are thus also known as ℓ<sub>2</sub>-squared metrics. We show how to embed <i>n</i>-point negative-type metrics into Euclidean space ℓ<sub>2</sub> with distortion <i>D</i> = <i>O</i>(log<sup>3/4</sup><i>n</i>). This embedding result, in turn, implies an <i>O</i>(log<sup>3/4</sup><i>k</i>)-approximation algorithm for the Sparsest Cut problem with nonuniform demands. Another corollary we obtain is that <i>n</i>-point subsets of ℓ<sub>1</sub> embed into ℓ<sub>2</sub> with distortion <i>O</i>(log<sup>3/4</sup> <i>n</i>).


INTRODUCTION
The area of finite metric spaces and their embeddings into "simpler" spaces lies in the intersection of the areas of mathematical analysis, computer science and discrete geometry. Over the past decade, this area has seen hectic activity, partly due to the fact that it has proved invaluable in many algorithmic applications. Many examples can be found in the surveys by Indyk [2001] and Linial [2002], or in the chapter by Matoušek [2002].
This research was performed while the first author was a graduate student and the third author was a postdoctoral researcher at Carnegie Mellon University. The first and third authors were supported in part by NSF grant no. CCR-0122581 (The ALADDIN project), and the second author was supported by an NSF CAREER award CCF-0448095, and by an Alfred P. Sloan Fellowship. Corresponding author's address: Anupam Gupta, Computer Science Department, Carnegie Mellon University, Pittsburgh PA 15213. Permission to make digital/hard copy of all or part of this material without fee for personal or classroom use provided that the copies are not made or distributed for profit or commercial advantage, the ACM copyright/server notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior specific permission and/or a fee. c ACM 1529-3785/2006 One of the first major applications of metric embeddings in Computer Science was an O(log k) approximation to the Sparsest Cut problem with non-uniform demands (henceforth called the Generalized Sparsest Cut problem) [Linial et al. 1995;Aumann and Rabani 1998]. This result was based on a fundamental theorem of Bourgain [1985] in the local theory of Banach spaces, which showed that any finite n-point metric could be embedded into 1 space (and indeed, into any of the p spaces) with distortion O(log n). The connection between these results uses the fact that the Generalized Sparsest Cut problem seeks to minimize a linear function over all cuts of the graph, which is equivalent to optimizing over all n-point 1 metrics. Since this problem is NP-hard, we can optimize over all n-point metrics instead, and then use an algorithmic version of Bourgain's embedding to embed into 1 with only an O(log n) loss in performance.
A natural extension of this idea is to optimize over a smaller class of metrics that contains 1 ; a natural candidate for this class is NEG, the class of n-point metrics of negative type 1 . These are just the metrics obtained by squaring an Euclidean metric, and hence are often called " 2 -squared" metrics. It is known that the following relationships hold: 2 metrics ⊆ 1 metrics ⊆ NEG metrics. (1) Since it is possible to optimize over NEG via semidefinite programming, this gives us a semidefinite relaxation for the Generalized Sparsest Cut problem [Goemans 1997]. Now if we could prove that n-point metrics in NEG embed into 1 with distortion D and this embedding can be found in polynomial time, we would get a D-approximation for Sparsest Cut; while this D has been conjectured to be O( √ log n) or even O(1), no bounds better than the O(log n) were known prior to this work. (See the Section 1.3 for subsequent progress towards the resolution of this conjecture.) In a recent breakthrough, Arora, Rao, and Vazirani [2004] showed that every n-point metric in NEG has a contracting embedding into 1 such that the sum of the distances decreases by only O( √ log n). Formally, they showed that the SDP relaxation had an integrality gap of O( √ log n) for the case of uniform demand Sparsest Cut; however, this is equivalent to the above statement by the results of Rabinovich [2003].
We extend the techniques of Arora, Rao, and Vazirani to give embeddings for n-point metrics in NEG into 2 with distortion O(log 3/4 n). More generally, we obtain the following theorem.
Theorem 1.1. Given (V, d), a negative-type metric, and a set of terminal-pairs D ⊆ V × V with |D| = k, there is a contracting embedding ϕ : V → 2 such that for all pairs (x, y) ∈ D, Note that the above theorem requires the embedding to be contracting for all node pairs, but the resulting contraction needs to be small only for the terminal pairs. In · 3 particular, when D = V × V , the embedding is an O(log 3/4 n)-distortion embedding into 2 . Though we also give a randomized polynomial-time algorithm to find this embedding, let us point out that optimal embeddings into 2 can be found using semidefinite programming [Linial et al. 1995, Thm. 3.2(2)]. Finally, let us note some simple corollaries.
Theorem 1.2. Every n-point metric in NEG embeds into 1 with O(log 3/4 n) distortion, and every n-point metric in 1 embeds into Euclidean space 2 with O(log 3/4 n) distortion. These embeddings can be found in polynomial time.
The existence of both embeddings follows immediately from (1). To find the map NEG → 1 in polynomial time, we can use the fact that every finite 2 metric can be embedded into 1 isometrically; if we so prefer, we can find a distortion-√ 3 embedding into 1 in deterministic polynomial time using families of 4-wise independent random variables [Linial et al. 1995, Lemma 3.3].
Theorem 1.3. There is a randomized polynomial-time O(log 3/4 k)-approximation algorithm for the Sparsest Cut problem with non-uniform demands.
Theorem 1.3 thus extends the results of Arora et al. [2004] to the case of non-uniform demands, albeit it proves a weaker result than the O( √ log k) approximation that they achieve for uniform demands.
The proof of Theorem 1.3 follows from the fact that the existence of distortion-D embeddings of negative-type metrics into 1 implies an integrality gap of at most D for the semidefinite programming relaxation of the Sparsest Cut problem. Furthermore the embedding can be used to find such a cut as well. (For more details about this connection of embeddings to the Sparsest Cut problem, see the survey by Shmoys [1997, Sec. 5.3]; the semidefinite programming relaxation can be found in the survey by Goemans [1997, Sec. 6]).

Our Techniques
The proof of the Main Theorem 1.1 proceeds thus: we first classify the terminal pairs in D by distance scales. We define the scale-i set D i to be the set of all pairs (x, y) ∈ D with d(x, y) ≈ 2 i . For each scale i, we find a partition of V into components such that for a constant fraction of the terminal pairs (x, y) ∈ D i , the following two "good" events happens: (1) x and y lie in different components of the partition, and (2) the distance from x to any component other than its own is at least η 2 i , and the same for y. Here η = 1/O( √ log k). Informally, both x and y lie deep within their distinct components, and this happens for a constant fraction of the pairs (x, y) ∈ D i . This partition defines a contracting embedding of the points into a one-dimensional 1 metric (a line) such that every pair (x, y), for which the above "good" events happen, has low distortion. (The details of this process are given in Section 3; the proofs use ideas from the paper by Arora, Rao, and Vazirani [2004] and the subsequent improvements by Lee [2005].) Note that the good event happens for only a constant fraction of the pairs in D i , and we have little control over which of the pairs will be the lucky ones. However, to obtain low distortion for every terminal pair, we want a partitioning scheme that separates a random constant fraction of the pairs in D i . To this end, we employ a simple reweighting scheme (reminiscent of the Weighted Majority algorithm [Littlestone and Warmuth 1994] and many other applications). We just duplicate each unlucky pair and repeat the above process O(log k) times. Since each pair that is unlucky gets a higher weight in the subsequent runs, a simple argument given in Section 4 shows that each pair in D i will be separated in at least log k of these O(log k) partitions. (Picking one of these partitions uniformly at random would now ensure that each pair is separated with constant probability.) We therefore obtain a good partition for each distance scale individually. We could now use these O(log k) partitions naïvely, by concatenating the corresponding "line-embeddings", to construct an embedding where the contraction for the pairs in D would be bounded by √ log k/η = O(log k). However, this would be no better than the previous bounds, and hence we have to be more careful. We slightly adapt the measured descent embeddings of Krauthgamer et al. [2005] to combine the O(log k) partitions for the various distance scales to get a distortion-O( log k/η) = O(log 3/4 k) embedding. The details of the embedding are given in Section 5.

Related Work
This work adopts and adapts techniques of Arora, Rao and Vazirani [2004], who gave an O( √ log n)-approximation for the uniform demand case of Sparsest Cut. In fact, using their results about the behavior of projections of negative-type metrics almost as a black-box, we obtain an O(log 5/6 k)-approximation for Generalized Sparsest Cut. Our approximation factor is further improved to O(log 3/4 k) by the results of Lee [2005] showing that the hyperplane separator algorithm of Arora et al. [2004, Section 3] itself gives an O( √ log n)-approximation for the uniform demand case. As mentioned above, there has been a large body of work on low-distortion embeddings of finite metrics; see, e.g., [Bartal 1998;Bourgain 1985;Chekuri et al. 2003;Fakcharoenphol et al. 2003;Gupta et al. 2003;Gupta et al. 2004;Krauthgamer et al. 2005;Linial et al. 1995;Matoušek 1996;Matoušek 1999;Rao 1999], and our work stems in spirit from many of these papers. However, it draws most directly on the technique of measured descent developed by .
Independently of our work, Lee [2005] has used so-called "scale-based" embeddings to give low-distortion embeddings from p (1 < p < 2) into 2 . The paper gives a "Gluing Lemma" of the following form: if for every distance scale i, we are given a contracting embedding φ i such that each pair x, y with d( K , one can glue them together to get an embedding φ : d → 2 with distortion O( √ K log n). His result is a generalization of , and of our Lemma 5.2; using this gluing lemma, one can derive an 2 embedding from the decomposition bundles of Theorem 4.5 without using any of the ideas in Section 5.

Subsequent Work
Following the initial publication of this work, Arora, Lee, and Naor [2005] built upon our techniques to obtain an embedding from negative-type metrics into 2 with distortion O( √ log n log log n), implying an approximation to Generalized Sparsest Cut with the same factor of approximation. Their improvement lies in a stronger gluing lemma. This result is essentially tight, as it is known that embedding negative-type metrics into 2 requires Ω( √ log n) distortion in the worst case [Enflo 1969].
This improvement was coupled with considerable progress on lower bounds for embeddability into 1 - Khot and Vishnoi [2005] showed the existence of a negativetype metric that must incur a distortion of Ω(log log n) 1/6− when embedded into 1 . This was subsequently improved to a lower bound of Ω(log log n) on the distortion by Krauthgamer and Rabani [2006]. In related work, Chawla et al. [2006] showed that it is NP-hard to approximate the Sparsest Cut problem within a factor of o( √ log log n), assuming an appropriate version of the Unique Games Conjecture of Khot [2002]. (Khot and Vishnoi [2005] independently show a weaker hardness of Ω(log log n) 1/6− , under the same assumption.)

Sparsest Cut
In the Generalized Sparsest Cut problem, we are given an undirected graph G = (V, E) with edge capacities c e , and k source-sink (terminal) pairs {s i , t i } with each pair having an associated demand D i . For any subset S ⊆ V of the nodes of the graph, let D(S,S) be the net demand going from the terminals in S to those outside S, and C(S,S) the total capacity of edges exiting S. Now the generalized sparsest cut is defined as follows (if there is unit demand between all pairs of vertices, then the problem is just called the Sparsest Cut problem).
Here a cut metric δ S is defined as δ S (x, y) = 1 if exactly one of x and y is in the set S, and 0 otherwise. So the second equality just follows from the definition. The third equality is less trivial; see, e.g., [Aumann and Rabani 1998] for a proof. This problem is NP-hard [Shahrokhi and Matula 1990], as is optimizing linear functions over the cone of 1 -metrics [Karzanov 1985]. There is much work on the sparsest cut problem (see, e.g., [Leighton and Rao 1988;Shmoys 1997]), and O(log k) approximations were previously known [Linial et al. 1995;Aumann and Rabani 1998]. These algorithms proceeded by relaxing the problem and optimizing over all metrics instead of over 1 -metrics in the above equation, and then rounding the resulting fractional solution. A potentially stronger relaxation is obtained by optimizing only over metrics d ∈ NEG instead of over all metrics: This quantity is the value of the semidefinite relaxation of the problem, and can be approximated well in polynomial time (see, e.g., [Goemans 1997]). Since 1 ⊆ NEG, it follows that Φ NEG ≤ Φ. On the other hand, if we can embed n-point metrics in NEG into 1 with distortion at most D in polynomial time, we can obtain a solution of value at most D × Φ NEG . It follows that Φ ≤ D × Φ NEG , and the solution is a D approximation to Generalized Sparsest Cut.

Metrics.
The input to our embedding procedure is a negative-type metric (V, d) with |V | = n. We can, and indeed will, use the following standard correspondence between finite metrics and graphs: we set V to the node set of the graph where the length of an edge (x, y) is set to d(x, y). This correspondence allows us to perform operations like deleting edges to partition the graph. By scaling, we can assume that the smallest distance in (V, d) is 1, and the maximum distance is some value ∆(d), the diameter of the graph.
It is well-known that any negative-type distance space admits a geometric representation as the square of a Euclidean metric; i.e., there is a map ψ : Deza and Laurent 1997, Thm. 6.2.2]. Furthermore, the fact that d is a metric implies that the angle subtended by any two points at a third point is non-obtuse. Since this map can be found in polynomial time using semidefinite programming, we will assume that we are also given such a map ψ. For any node x ∈ V , we use x to denote the point ψ(x) ∈ R n .

Terminal Pairs.
We are also given a set of terminal pairs D ⊆ V × V ; these are the pairs of nodes for which we need to ensure a small contraction. In the sequel, we will assume that each node in V takes part in at most one terminal-pair in D. This is without loss of generality: if a node x belongs to several terminal pairs, we add new vertices x i to the graph at distance 0 from x, and replace x in the i-th terminal pair with x i . (Since this transformation adds at most O(|D|) nodes, it does not asymptotically affect our results.) Note that a result of this is that D may have two terminal pairs (x, y) and (x , y ) such that d(x, x ) = d(y, y ) = 0.
A node x ∈ V is a terminal if there is a (unique) y such that (x, y) ∈ D; call this node y the partner of x. Define D i to be the set of node-pairs whose distance according to d is approximately 2 i .
We use the phrase scale-i to denote the distances in the interval [2 i , 2 i+1 ), and hence D i is merely the set of terminal pairs that are at distance scale i. If (x, y) ∈ D i , then x and y are called scale-i terminals. Let D be the set of all terminal nodes, and D i be the set of scale-i terminals.
The radius r ball around

Metric Decompositions: Suites and Bundles
Much of the paper will deal with finding decompositions of metrics (and of the underlying graph) with specific properties; let us define these here. Given a distance scale i and a partition P i of the graph, let C i (v) denote the component containing a vertex v ∈ V . We say that a pair (x, y) ∈ D i is δ-separated by the partition P i if -the vertices x and y lie in different components; i.e., C i (x) = C i (y), and -both x and y are "far from the boundary of their components", i.e., d( A decomposition suite Π is a collection {P i } of partitions, one for each distance scale i between 1 and log ∆(d) . Given a separation function δ(x, y) : Finally, a δ(x, y)-decomposition bundle is a collection {Π j } of decomposition suites such that for each (x, y) ∈ D, at least a constant fraction of the Π j δ(x, y)-separate 2 the pair (x, y).
In Section 3, we show how to create a decomposition suite that Ω(1/ √ log k)separates a constant fraction of the pairs (x, y) ∈ D i , for all distance scales i. Using this procedure and a simple reweighting argument, we construct a Ω(1/ √ log k)decomposition bundle with O(log k) suites. Finally, in Section 5, we show how decomposition bundles give us embeddings of the metric d into 2 .

CREATING DECOMPOSITION SUITES
In this section, we will give the procedure Project-&-Prune that takes a distance scale i, and constructs a partition P i of V that η-separates at least a constant fraction of the pairs in D i . Here we use η = Input: The metric (V, d), its graph representation G, its geometric representation where x ∈ V is mapped to x ∈ R n , and, a distance scale i. We assume that terminal pairs (x, y) ∈ D i are disjoint.
(1) Project. In this step, we pick a random direction and project the points in V on the line in this direction. Formally, we pick a random unit vector u. Let p x = √ n x, u be the normalized projection of the point x on u.
(2) Bucket. Let = 2 i/2 , and set β = /6. Informally, we will form buckets by dividing the line into intervals of length β. We then group the terminals in D i according to which interval (mod 4) they lie in. (See Figure 1.) Formally, for each a = 0, 1, 2, 3, define A terminal pair (x, y) ∈ D i is split by A a if x ∈ A a and y ∈ A (a+2) mod 4 . If the pair (x, y) is not split by any A a , we remove both x and y from the sets A a . For a ∈ {0, 1}, let B a ⊆ D i be the set of terminal pairs split by A a or A a+2 . (3) Prune. If there exist terminals x ∈ A a and y ∈ A (a+2) mod 4 for some a ∈ {0, 1} (not necessarily belonging to the same terminal pair) with d(x, y) < 2 /f , we remove x and y and their partners from the sets {A a }. We repeat until no such pairs remain. (4) Cleanup. For each a, if (x, y) ∈ B a and the above pruning step has removed either of x or y (recall that we remove the other one as well), then we remove Step 1, else go to Step 6. (6) Say the set B a has more pairs than B (1−a) mod 2 . Define the partition P i by deleting all the edges in G at distance 2 /2f from the set A a . More formally, let C = B(A a , 2 /2f ), and define the partition P i to be G[C] and G[V \ C], the components induced by C and V \ C.
Note the procedure above ensures that for any pair of terminals (x, y) ∈ A a × A (a+2) mod 4 , the distance d(x, y) is at least 2 /f = 2 i /f , even if (x, y) ∈ D i . Why do we care about these pairs? It is because the separation of 2 /f between the sets A a and A (a+2) mod 4 ensures that the balls of radius 2 2f around these sets are disjoint.
This in turn implies that terminal pairs (x, y) ∈ D i ∩ (A a × A (a+2) mod 4 ) are η-separated upon deleting the edges in Step 6 from the graph G. Indeed, for such a pair (x, y), the components C i (x) and C i (y), obtained upon deleting the edges at distance 2 2f from the set A a , are distinct, and both d(x, V \C i (x)) and d(y, V \C i (y)) are at least 2 2f ≥ d(x,y) 4f . The following theorem now shows that the procedure Project-&-Prune terminates quickly.
Theorem 3.1. For any distance scale i, the procedure Project-&-Prune terminates in a constant number of iterations. This gives us a random polynomial-time algorithm that outputs a partition P i which η-separates at least 1 64 |D i | pairs of D i . The proof of this theorem has two parts, which we will prove in the next two subsections. We first show that the sets B 0 and B 1 contain most of D i before the pruning step (with a high probability over the random direction u). We then show that the pruning procedure removes only a constant fraction of the pairs from these sets B 0 and B 1 with a constant probability. In fact, the size of B 0 ∪ B 1 remains at least |D i |/32 even after the pruning, and then it follows that the larger of these sets Fig. 2. The distribution of projected edge lengths in the proof of Lemma 3.2. If y falls into a light-shaded interval, the pair (x, y) is split. must have half of the terminal pairs, proving the theorem.

The Projection
Step Proof. Recall that a terminal pair (x, y) ∈ D i is split if x lies in the set A a and y lies in A (a+2) mod 4 for some a ∈ {0, 1, 2, 3}. Also, we defined 2 = 2 i , and hence (x, y) ∈ D i implies that x − y 2 = d(x, y) ∈ [ 2 , 2 2 ). Consider the normalized projections p x and p y of the vectors x, y ∈ R n on the random direction u, and note that p y − p x is distributed (nearly) as a Gaussian random variable Z u ∼ N (0, σ 2 ) with a standard deviation σ ∈ [ , √ 2 ) (see Figure 2.) Now consider the bucket of width β in which p x lies. The pair (x, y) will not be separated if p y lies in either the same bucket, or in either of the adjoining buckets.
(The probability of each of these three events is at most 1 √ 2π σ × β.) Also, at least 1 4 of the remainder of the distribution causes (x, y) to be split, since each good interval is followed by three bad intervals with less measure.
Putting this together gives us that the probability of (x, y) being split is at least Since each pair (x, y) ∈ D i is separated with probability 1/8, the linearity of expectations and inverse Markov's inequality implies that at least one-sixteenth of D i must be split at the end of the bucketing stage with probability 1 15 .

The Pruning Step
We now show that a constant fraction of the terminal pairs in D i also survive the pruning phase. This is proved by contradiction, and follows the argument of Arora et al. [2004].
Assume that, with a large probability (over the choice of the random direction u), a large fraction of the terminal pairs in D i (say 63 64 |D i |) get removed in the pruning phase. By the definition of the pruning step, the projection of x − y on u must have been large for such a removed pair (x, y). (Recall again that the removed pair (x, y) is not necessarily a terminal pair.) In our algorithm, this happens when d(x, y) < 2 /f , or equivalently when β > d(x, y) × √ f /6. Since the expected value of |p x − p y | is exactly d(x, y), while p x and p y are separated by at least one bucket of width β, this implies that the expectation is exceeded by a factor of at least √ f /6 = Ω(log 1/4 k). Setting t = √ f /6, we can say that such a pair (x, y) is "stretched by a factor t in the direction u". For any given direction u, the stretched pairs removed in the pruning step are disjoint, and hence form a large matching M u .
Arora et al. showed the following geometric property-for a given set W and some constant C, the number of disjoint t-stretched pairs in W × W cannot be more than C|W | with constant probability (over the choice of u); however, their proof only proved this for stretch t = Ω(log 1/3 |W |). The dependence on t was improved subsequently by Lee [2005] to t = Ω(log 1/4 |W |).
In order to make the above discussion more precise, let us first recall the definition of a stretched set of points.
That is, the pair (x, y) is stretched by a factor of t in direction u.
The above theorem has been subsequently improved by Lee to the following (as implied by [Lee 2005, Thm. 4.1]).
Theorem 3.5. For any γ, β > 0, there is a C = C(γ, β) such that if t > C log 1/4 n, then no set of n points in R n can be (t, γ, β)-stretched for any scale l.
Summarizing the implication of Theorem 3.5 in our setting, we get the following corollary.
Corollary 3.6. Let W be a set of vectors corresponding to some subset of terminals satisfying the following property: with probability Θ(1) over the choice of a random unit vector u, there exist subsets S u , T u ⊆ W and a constant ρ such that |S u | ≥ ρ|W | and |T u | ≥ ρ|W |, and the length of the projection | u, x − y | ≥ /(6 √ n) for all x ∈ S u and y ∈ T u . Then with probability Θ(1) over the choice of u, the pruning procedure applied to sets S u and T u returns sets S u and T u with |S u | ≥ 3 4 |S u | and |T u | ≥ 3 4 |T u |, such that for all x ∈ S u and y ∈ T u , d(x, y) ≥ 2 /f .

Proof.
For a unit vector u, let M (u) denote the matching obtained by taking the pairs (x, y) of terminals that are deleted by the pruning procedure when given the vector u. Note that pairs (x, y) ∈ M (u) have the property that d(x, y) < 2 /f and |p x − p y | > /6. For the sake of contradiction, suppose there is a constant γ such that the matchings M (u) are larger than ρ/4|W | with probability at least 1 − γ over the choice of u.
Using Definition 3.3 above, we get that the vectors in W form an (6 √ f , γ, ρ/4)stretched set at scale / √ f . Theorem 3.5 now implies that 6 √ f = 6 √ c(log k) 1/4 must be at most C log 1/4 |W |. However, since |W | ≤ 2k, setting the parameter c suitably large compared to C would give us the contradiction.
Finally, we are in a position to prove Theorem 3.1 using Lemma 3.2 and Corollary 3.6.
Proof of Theorem 3.1. Define W to be D i , the set of all terminals that belong to some terminal pair in D i . Let a be the index corresponding to the larger of B 0 and B 1 before the pruning step, and set S u = A a and T u = A (a+2) mod 4 for this value of a. Lemma 3.2 assures us that |S u | = |T u | ≥ 1 32 |D i | = 1 16 |W | with probability 1 15 (over the random choice of the vector u ∈ R n ). Furthermore, for each x ∈ S u and y ∈ T u , the fact that |p x − p y | ≥ β translates to the statement that x − y, u ≥ /(6 √ n). These vectors satisfy the conditions of Corollary 3.6, and hence we can infer that with a constant probability, the pruning procedure removes at most 1 4 |S u | and 1 4 |T u | vertices from S u and T u respectively. Their partners may be pruned in the cleanup step as well, and hence the total number of terminal pairs pruned is at most ) and d(y, V \ C i (y)) are at least ηd(x, y).
Since this happens with a constant probability, we will need to repeat Steps 1-3 of the procedure (each time with a new unit vector u) only a constant number of times until we find a partition that η-separates at least 1 64 |D i | of the terminal pairs; this proves the result.
Running the procedure Project-&-Prune for each distance scale i between 1 and log ∆(d) , we can get the following result with γ = 1 64 . Theorem 3.7. Given a negative-type metric d, we can find in randomized polynomial time a decomposition suite Π = {P i } that η-separates a constant fraction γ of the terminal pairs at each distance scale i.
In the next section, we will extend this result to get a set of O(log k) decomposition suites {Π j } so that each terminal pair (x, y) ∈ D is separated in a constant fraction of the Π j 's.

OBTAINING DECOMPOSITION BUNDLES: WEIGHTING AND WATCHING
To start off, let us observe that the result in Theorem 3.7 can be generalized to the case where terminal pairs have an associated weight w xy ∈ {0, 1, 2, . . . , k}.
Proof. The proof is almost immediate: we replace each terminal pair (x, y) ∈ D i having weight w xy > 0 with w xy new terminal pairs (x j , y j ), where the points {x j } and {y j } are placed at distance 0 to x and y respectively. Doing this reduction for all weighted pairs gives us an unweighted instance with a set D i of terminal pairs with D i ≤ kD i . Now Theorem 3.7 gives us a decomposition suite η-separating at least 1 64 |D i | of the new terminal pairs at distance scale i, where η = 1/O( log D i ) = 1/O( √ log k). Finally, observing that the separated terminal pairs at scale i contribute at least 1 64 (x,y)∈Di w xy completes the claim.
In the sequel, we will associate weights with the terminal pairs in D and run the procedure from Lemma 4.1 repeatedly. The weights start off at k, and the weight of a pair that is separated in some iteration is halved in the subsequent iteration; this reweighting ensures that all pairs are separated in significantly many rounds. (Note: this weighting argument is fairly standard and has been used, e.g., in geometric algorithms [Clarkson 1995], machine learning [Littlestone and Warmuth 1994], and many other areas; see Welzl [1996] for a survey.) The Algorithm: (1) Initialize w (0) (x, y) = 2 log k for all terminal pairs (x, y) ∈ D. Set j = 0.
(2) Use the algorithm from Lemma 4.1 to obtain a decomposition suite Π j . Let T j be the set of terminal pairs η-separated by this decomposition. ( for the others. If w (j+1) (x, y) < 1 then w (j+1) (x, y) ← 0.
Note that the distance function d in each iteration of the algorithm remains the same.
Lemma 4.2. In each iteration j of the above algorithm, for all scales i, the following holds (x,y)∈Di Proof. In each iteration, the algorithm of Lemma 4.1 separates at least a γ fraction of the weight (x,y)∈Di w j (x, y) for all i, and hence the total weight in the next round drops by at least half this amount.
Noting that initially we have (x,y)∈Di w (0) (x, y) ≤ 2k 2 , one derives the following simple corollary: Corollary 4.3. The above algorithm has at most 4 γ log k iterations.
Lemma 4.4. For all distance scales i, every pair (x, y) ∈ D i is η-separated in at least log k iterations.
Proof. Since we start off with w (0) (x, y) ≥ k and end with w (j) (x, y) < 1, the weight w (j) (x, y) must have been decremented at least log k times. Each such reduction corresponds to a round j in which (x, y) was η-separated by Π j .
Theorem 4.5. The above procedure outputs an η-decomposition bundle with at most 4 γ log k decomposition suites, such that each terminal pair (x, y) is η-separated in at least log k of these suites.

EMBEDDING VIA DECOMPOSITION BUNDLES
In the previous sections we have constructed a decomposition bundle with a large separation between terminal pairs. Now, we show how to obtain a small distortion 2 -embedding from this. The proof mainly follows the lines of the results in Krauthgamer et al. [2005].
Theorem 5.1. Given an α(x, y)-decomposition bundle for the metric d and a set D, there exists a randomized contracting embedding ϕ : V −→ 2 , such that for each pair (x, y) ∈ D, Note that for α(x, y) = Ω(1/ √ log k) this theorem implies Theorem 1.1. Along the lines of the reasoning in ], we define a measure of "local expansion". Let where B(x, r) denotes the set of terminal nodes within the ball of radius r around x. We derive Theorem 5.1 from the following lemma.
By repeatedly applying Lemma 5.2, we obtain the following guarantee: Corollary 5.3. Given an α(x, y)-decomposition bundle, there is a randomized contracting embedding ϕ : V −→ 2 such that for every pair (x, y), Proof. The corollary follows by applying Lemma 5.2 repeatedly and independently for each decomposition suite several times. Then concatenating and rescaling the resulting maps gives with high probability an embedding that fulfills the corollary. In passing, we note that this algorithm (using independent repetitions) may result in an embedding with a large number of dimensions, which may not be algorithmically desirable. However, it shows the existence of such an embedding, and we can then use semidefinite programming followed by random projections to obtain a nearly-optimal embedding of the metric into 2 with O(log n) dimensions in randomized polynomial time.
To see that the above corollary implies Theorem 5.1, we use a decomposition due to Calinescu et al. [2004] and Fakcharoenphol et al. [2003] (and its extension to general measures, as observed by Lee and Naor [2005] and by ) that has the property that with probability at least 1/2, a pair (x, y) is Ω(1/V (x, y))-separated in this decomposition. Applying the corollary to this decomposition bundle, we get an embedding ϕ 1 , such that Applying the corollary to the decomposition bundle assumed by the theorem gives an embedding ϕ 2 with Concatenating the two mappings and rescaling, we get a contracting embedding as desired. Now it remains to prove Lemma 5.2.

The embedding
Let T = {1, . . . , log k} and Q = {0, . . . , m − 1}, for some suitably chosen constant m. In the following we define an embedding into |T | · |Q| dimensions. For t ∈ T , let r t (x) denote the minimum radius r such that the ball B(x, r) contains at least 2 t terminal nodes. We call r t (x) the t-radius of x. Further, let t (x) ∈ N denote the distance class this radius belongs to (i.e., 2 t(x)−1 ≤ r t (x) ≤ 2 t(x) ). Pick a decomposition suite Π = {P s } from the decomposition bundle at random. In the following δ(x, y) denotes the separation-factor between x and y in this suite, i.e., δ(x, y) = 1 d(x,y) min{d(x, V \ C s (x)), d(y, V \ C s (y))} if C s (y) = C s (x) and 0, otherwise. Observe that with constant probability we have δ(x, y) ≥ α(x, y).
The standard way to obtain an embedding from a decomposition suite is to create a coordinate for every distance scale and embed points in this coordinate with respect to the partitioning for this scale. For example, one could assign a random color, 0 or 1, to each cluster C ∈ P i . Let W i denote the set of nodes contained in clusters with color 0 in partitioning P i . By setting the i-th coordinate of the image ϕ(x) of a point x to d(x, W i ), a pair (x, y) gets a distance Ω(δ(x, y)d(x, y)) with probability 1/2, because this is the probability that the clusters C i (x) and C i (y) get different colors (in this case the distance is Ω(δ(x, y)d(x, y)) since both nodes are at least that far away from the boundary of their cluster). Overall this approach gives an embedding into 2 with contraction O( √ log k/δ(x, y)), and has e.g. been used by Rao [1999] for getting a √ log n embedding of planar metrics into 2 . In order to improve this, along the lines of the recent measured descent idea from ], the goal is to construct an embedding in which the distance between (ϕ(x), ϕ(y)) increases as the local expansion V (x, y) increases. This can be achieved by constructing a coordinate for every t ∈ T and then embed points in this coordinate according to the partitioning for the corresponding distance · 15 scale t (x) (i.e., different points use different distance scales depending on their local expansion). Thereby, for a pair with a high V (x, y)-value the nodes will often (≈ V (x, y) times) be embedded according to the partitioning for distance scale i = log d(x, y) that corresponds to d(x, y). Therefore, the pair (x, y) gets a larger distance (by a factor of roughly V (x, y)) in this embedding than in the standard approach.
However, transferring the rest of the standard analysis to this new idea has some difficulties. If we define the set W t as the nodes x that are colored 0 in the partitioning for scale t (x) we cannot argue that for a pair (x, y) either d(x, W t ) or d(y, W t ) is large, because nodes u very close to x or y may have distance scales t (u) that are different from t (x) or t (y). In order to ensure local consistency such that all nodes close to x obtain their color from the same partitioning, we construct several coordinates in the embedding for every t, such that for each distance scale t (x) there is a coordinate in which all nodes close to x derive their color from the partitioning for scale t (x). The details are as follows.
Let Q = {0, · · · , m − 1} denote the set of indices of coordinates corresponding to each value of t. For each q ∈ Q, we partition the distance scales into groups g q of size m each, and let the median scale in each group represent that group for the coordinate corresponding to q. In the (q, t) th coordinate, the color of a node is chosen according to the median distance scale in the group g q to which t (x) belongs.
In particular, let g q ( ) := −q m .Note that each distance group contains (at most) m consecutive distance classes which means that distances within a group differ at most by a constant factor -all distances in group g are in Θ(2 m·g ). We define a mapping π q between distance classes that maps all classes of a group to the median distance class in this group (the value of π q for the first and last distance group is rounded off appropriately; we omit a precise definition for the sake of clarity). π q ( ) := i + m · g q ( ) − m 2 Observe that this partitioning satisfies the key property that for each distance class i, there exists a q such that π q (i) = i. Based on this mapping we define a set W q t for each choice of t ∈ T and q ∈ Q by W q t = {x ∈ V : color πq( t(x)) (x) = 0}, where color i (x) denotes the color of the cluster that contains x in partitioning P i . Note that all nodes whose t-radii fall into the same distance group (w.r.t. parameter q) derive their color (and hence whether they belong to W q t ) from the same partitioning. Based on the sets W q t we define an embedding ϕ t,q : V −→ R for each coordinate (t, q) -ϕ t,q (x) = d(x, W q t ). The embedding ϕ : V −→ R |T ||Q| is defined by ϕ(x) := ⊗ t,q ϕ t,q (x).
In the next section, we analyse the distortion of the map ϕ.
We fix an integer t with log(|B(x, d(x, y)/8)|) ≤ t ≤ log(|B(x, 2d(x, y))|), and we use i = log d(x, y) to denote the distance class of d(x, y). Clearly, the distance class t (x) of the t-radius of x is in {i − 4, . . . , i + 2}, because d(x, y)/8 ≤ r t (x) ≤ 2d(x, y). The following claim gives a similar bound on the t-radius for nodes that are close to x.
In the following we choose m (the number of distances classes within a group) as 10, and q such that π q (i) = i, i.e., i is the median of its distance group. Then the above claim ensures that for all nodes z ∈ B(x, 1 16 d(x, y)), the distance class t (z) is in the same distance group as i. Furthermore, these nodes choose their color (that decides whether they belong to W q t ) according to the partitioning for distance scale i. Recall that x is δ(x, y)-separated in this partitioning. Therefore, we can make the following claim.
Claim 5.5. If x does not belong to the set W q t , then, d(x, W q t ) ≥ min{ 1 16 , δ(x, y) } d(x, y) ≥ 1 16 δ(x, y) d(x, y) Now, we consider the following events concerning the distances of x and y from W q t , respectively.