A constant factor approximation algorithm for generalized min-sum set cover

Consider the following generalized min-sum set cover or multiple intents re-ranking problem proposed by Azar et al. (STOC 2009). We are given a universe of elements and a collection of subsets, with each set S having a covering requirement of K(S). The objective is to pick one element at a time such that the average covering time of the sets is minimized, where the covering time of a set S is the first time at which K(S) elements from it have been selected.
 There are two well-studied extreme cases of this problem: (i) when K(S) = 1 for all sets, we get the min-sum set cover problem, and (ii) when K(S) = |S| for all sets, we get the minimum-latency set cover problem. Constant factor approximations are known for both these problems. In their paper, Azar et al. considered the general problem and gave a logarithmic approximation algorithm for it. In this paper, we improve their result and give a simple randomized constant factor approximation algorithm for the generalized min-sum set cover problem.


Introduction
The min-sum set cover problem is a min-latency version of the well-known set cover problem: for ease of exposition we will consider the equivalent hitting set formulation of the set cover problem.Here, we are given a universe U of n elements, and a collection S = {S 1 , S 2 , . . ., S m } of subsets with S i ⊆ U , and the objective is to select one element at a time (i.e., find a linear ordering of the elements) such that the average hitting (or "cover") time of the sets is minimized.Formally, we pick one element at every time instant: if an element e is picked at time t its cover time is Cov(e) = t.The hitting/cover time of a set S is Cov(S) = min e∈S Cov(e), and the goal is to minimize S∈S Cov(S).For this problem the greedy algorithm of picking the element which covers the most number of uncovered sets is known to be a 4-approximation for this problem [BNBH + 98, FLT04], and this is the best possible unless P=NP [FLT04].A problem that is similar in spirit is the min-latency set cover problem, where the cover time of a set S is Cov(S) = max e∈S Cov(e), the time at which all the elements of the set have been selected.This problem also admits a constant factor approximation algorithm [HL05].In fact, this problem easily reduces to that of precedence-constrained scheduling on a single machine, for which a 2-approximation is known using various techniques [HSSW97,MQW03,CM99].
A substantial generalization of these two problems was offered recently by Azar, Gamzu and Yin [AGY09]: the multiple intents re-ranking problem or the generalized min-sum set cover problem (GenMSSC).Here each set S ∈ S also comes with a covering requirement K(S) ∈ {1, 2, . . ., |S|}, and its cover time is defined to be the time at which K(S) elements from S are selected: The goal is to minimize S Cov(S).Note that we get the min-sum set cover problem if we set K(S) = 1 for all sets S ∈ S, and the min-latency set cover problem if we set K(S) = |S| for all S ∈ S. Azar et al. [AGY09] gave an O(ln r)-approximation algorithm for this problem, where r is the largest size of any set in S via a modified greedy algorithm, and left open the question of obtaining a constant factor approximation for the problem.We resolve that question in this paper.
Theorem 1.1.The generalized min-sum set cover problem (a.k.a the multiple intents re-ranking problem) admits a randomized 485-approximation algorithm.
Our approach is based on formulating a strengthened LP relaxation for the problem, obtained by adding the so-called "knapsack-cover inequalities" [CFLP00] to the natural LP relaxation.This is necessary as one can construct examples (see Section 6.2) where the natural LP has an unbounded integrality gap.We then use a simple stage-based randomized rounding scheme which works as follows.We consider exponentially increasing prefixes of time, and round the (fractional) assignments in these prefixes to obtain partial orderings.Then, we combine these partial orderings into a single ordering.For any set S, our rounding guarantees an expected cover time of O(t S ), where t S is its cover time in the LP relaxation.

Related Work
The fact that the greedy algorithm was a constant-factor approximation algorithm for min-sum set cover was implicit in the work of Bay-Noy et al. [BNBH + 98], and was made explicit in papers by Feige et al., who also simplified the proofs, both in the conference version [FLT02], and then further in the journal version [FLT04].They also showed that the 4-approximation was the best possible unless P=NP.Other variants of this problem have been studied in different contexts, like when the set coverage is probabilistic (stochastic) [CFK03], or when the cost of a set depends on the set of uncovered elements at the time when it is picked [MBMW05].
At the other end of the spectrum is the minlatency set cover problem.This was formally studied by Hassin and Levin [HL05], who gave a factor eapproximation for the problem via techniques similar to those for the min-latency tour, a.k.a. the traveling repairman problem.Subsequently, they observed that min-latency set cover can be modeled as a special case of the classic precedence-constrained scheduling problem 1|prec| j w j C j , for which several 2-approximation algorithms are known using a variety of different techniques (see, e.g., [CK04,KSW99] for surveys).This special case corresponds to the so-called "bipartite constraints" case, where there are two types of jobs J 1 and J 2 .All jobs in J 1 have w j = 0, p j = 1 (these correspond to elements), all jobs in J 2 have w j = 1, p j = 0 (these correspond to sets S j ⊂ J 1 ) and the precedence constraints have the form that each job j ∈ J 2 must be preceded by the jobs S j ⊂ J 1 .To see the equivalence to min-latency set cover problem, note that any valid schedule is just an ordering of jobs in J 1 (as jobs in J 2 have size 0).Moreover, only jobs in J 2 contribute to completion time (as jobs in J 1 have weight 0), and being of size 0, a job in J 2 can be assumed to be completed immediately after its preceding jobs in J 1 have been scheduled.Woeginger [Woe03] showed that this special case (or equivalently the min-latency set cover problem) is as hard to approximate as the general 1|prec| j w j C j problem.Recently it has been shown [BK09], that assuming a variant of the Unique Games Conjecture, it is hard to approximate 1|prec| j w j C j , and hence minlatency set cover, to better than 2 − for any > 0.
Multiple Intents Re-Ranking: The multiple intents re-ranking problem was introduced by Azar et al. [AGY09].In this problem, each set S has a weight vector w S of length |S|, and if the elements of the set are output at times τ S = (t 1 , t 2 , . . ., t |S| ) where t 1 < t 2 < • • • < t |S| , then the cost of the set is w S • τ S ; the goal is find an ordering of the elements that minimizes the sum of these costs S∈S w S •τ S .(However, as noticed in that paper, by making copies of sets, one can equivalently imagine each set S to have a single requirement K(S), and we are charged for the first time at which K(S) elements from S have been chosen; i.e., the model we use.)They showed that if all the weight vectors were increasing or decreasing, one could get constant factor approximations, even though the naïve greedy algorithm could be arbitrarily bad.They then gave an O(log r)-approximation for a greedy-like algorithm using a clever harmonic interpolation idea; here r is the size of the largest set in the set system.However, we can show (see Section 6.1) that their algorithm cannot give a constant-factor approximation for the general problem.

Min-sum Set Cover and GenMSSC
A key difference between min-sum set cover and the generalized version of the problem can be illustrated by looking at the max-coverage variants of both these problems.In the max-coverage problem, given a bound k, the goal is to choose k elements which maximizes the number of sets hit.While it is known that the greedy algorithm is a 1 − 1/e approximation algorithm for this problem, the max-coverage variant of the generalized problem becomes Dense-K-Subgraph hard even for the case when a set is covered when 2 of its elements are selected.Indeed, given a graph G, consider the following instance of GenMSSC: elements are the vertices, and sets the edges.Each set e = {u, v} has a covering requirement K(e) = 2. Clearly, the set of k elements/vertices which "hits" the most number of sets/edges is the collection of k vertices which induces the most number of edges.Therefore, the max-coverage version of GenMSSC is as hard as the Dense-K-Subgraph problem.
Hence, while one can get constant factor approximations for the min-sum set cover problem by solving the max-coverage problem for bounds of 2 i (for 1 ≤ i ≤ log n ) and combining these solutions to get a global linear ordering, naïvely using this approach would fail for the GenMSSC problem.(Hassin and Levin [HL05] use the max-coverage approach differently for their e-approximation, and it would be interesting to see if that approach can be extended to work for GenMSSC.) Our approach is based on a variation of this idea.
In particular, we use the following observation, which suffices for our purposes even though it is too weak to yield a useful guarantee for max-coverage.Consider the LP formulation for the max-coverage instance given a bound k, strengthened by adding the knapsack cover inequalities.Let denote the number of sets which are covered fractionally to an extent of at least 1/2 (or any constant) in an optimal fractional solution.
Then the solution obtained by applying a round of randomized rounding (to the LP solution scaled by a suitable constant factor) covers at least Ω( ) sets.At a high level, it is this observation that forms the basis of our algorithm and its analysis.We next describe the details.
3 An LP Relaxation If x et and y St are restricted to only take values 0 or 1, then this is easily seen to be a valid formulation for the problem.In particular, Constraints (3.1) require that only one element can be assigned to a time slot and constraints (3.2) require that each element must be assigned some time slot.Constraints (3.3) correspond to the knapsack cover constraints and require that if y St = 1, then for every subset of elements A, at least K(S) − |A| elements must be chosen from the set S \ A before time t.As a consequence, we get that y St can be 1 if and only if there have been K(S) elements picked from S before time t.Therefore, the set would incur an LP cost of exactly the cover time of the set in the integral ordering (since the term (1 − y St ) would keep contributing 1 to the LP objective until the set has been covered).
Let Opt denote any optimal solution of the given GenMSSC instance, and let LPOpt denote the cost of an optimal LP solution.From the above discussion, the LP is a valid relaxation and hence we have that, Lemma 3.1.The LP cost LPOpt is at most the total cover time of an optimal solution Opt.

Solving the LP: The Separation Oracle
Even though the LP formulation has an exponential number of constraints, it can be solved assuming we can, in polynomial time, verify if a candidate solution (x, y) satisfies all the constraints.Indeed, consider any fractional solution (x, y).Constraints (3.1), (3.2), and (3.4) can easily be verified in O(mn + n 2 ) time, one by one.
Consider any set S, a time instant t and a particular size a < K(S).To verify constraint (3.3), we wish to check the following condition: Now, notice that for any fixed set A of size a, the left hand side could be rewritten as e∈S t <t x et − e∈A t <t x et .Therefore, if the above condition were to hold when we choose A to be the set of the a elements with the largest values of t <t x et , then it would also hold for any other set A. Hence we can verify constraint (3.3) in polynomial time for each choice of set S, time t, and size a, and there are only O(mn 2 ) such choices to iterate over.

The Rounding Algorithm
Let (x * , y * ) denote the optimal LP solution.Our rounding algorithm proceeds in O(log n) stages, with the i th stage operating in the time interval [1, 2 i ).In stage i, we perform one round of randomized rounding (as described below) on the fractional solution restricted to the interval [1, 2 i ) and obtain a set O i of elements.At the conclusion of these stages, we output the elements of O 1 , followed by elements of O 2 , O 3 , . . ., O log n , with the elements of any set O j being output in an arbitrary order.(Of course, we should only keep the first occurrence of any element in the final output, but imagining elements to potentially be output multiple times will be easier for the analysis.) The rounding process for stage i that generates the set O i is the following: Algorithm 1 Randomized Rounding for stage i 1: let t i = 2 i .2: let z e,i ← t <ti x * et be the fractional extent to which e is selected before time t i , for each e ∈ U .3: let p e,i ← min(1, 8z e,i ) for all e ∈ U .4: mark each element e ∈ U independently with probability p e,i .5: let O i be the set of marked elements.

The Analysis
In the interests of expositional simplicity, we have not tried to optimize the constants in our analysis.
Observation 5.1.The fractional coverage of the sets is monotonically non-decreasing.That is, y * St ≥ y * St for all sets S ∈ S and Therefore, the expected number of elements from S \ S g marked in stage i is Since these elements are marked independently of each other, we can use the following Chernoff bound [MR95] (Theorem 4.2): if X 1 , X 2 , . . ., X n are independent {0, 1}-valued random variables with For our application, since we have µ ≥ 4(K(S) − |S g |) ≥ 4, we can substitute β = 3 4 and bound the tail probability that fewer than (K(S) − |S g |) elements are marked from S \ S g by exp(− (3/4) 2

2
• 4) = e −9/8 .As the elements in S g are all marked with probability 1, it follows that the probability that fewer than K(S) elements are marked from S is also at most e −9/8 .Lemma 5.2.The probability that any elements are dropped in step 6 is at most e −6 .Proof.To show this, we use the following concentration inequality [BLM00] (Theorem 1, Remark 3): if X 1 , X 2 , . . ., X n are independent {0, 1}-valued random variables with In our setting, since the probability with which an element is picked in O i is at most 8 times the extent to which it was scheduled in [1, 2 i ) by the fractional LP solution, the expected number of elements picked (i.e.µ) in O i is at most 8 • 2 i .Therefore, by substituting β = 8 • 2 i and µ ≤ 8 • 2 i in the above inequality, we get that the probability of picking more than 16 • 2 i is at most exp( −64•2 2i (64/3)2 i ) ≤ exp(−6).
We now bound the cover time of the set S for the above algorithm.Proof.Let Cov Alg (S) denote the time of set S with respect to the ordering output by our algorithm.
For ease of analysis, we will consider a set S to be covered in some stage i only if that t * S ∈ [1, t i ), and moreover the set O i returned is not truncated.Note that if the set S is actually covered with any of these criteria not met, its cover time only improves.Let E iS denote the event that set S is first covered in stage i under this modified notion of coverage.Then we have since if S is covered in stage i, its cover time is at most i j=1 16 • 2 j ≤ 32 • 2 i .Also, we know that any set will certainly be covered by stage log n because the matching constraints (3.1) and (3.2) would ensure that each element be picked to an extent 1 by time n.Now, the event E iS that a set S is first covered in stage i is strictly contained in the event that S is not covered in stages log t * S , ( log t * S +1), . . ., (i−2), and (i − 1).But for any i, the event that S is not covered in stage i occurs only when either 1. K(S) elements from S were not picked in O i , or 2. O i was truncated in step 6.
The former event happens with probability at most e −9/8 from Lemma 5.1, and the latter event happens with probability at most e −6 from Lemma 5.2.Hence, the probability that S is not covered in any fixed stage is at most e −9/8 + e −6 < e −1 .Thus, we have Plugging this into equation (5.6), we get   ) (roughly n/2) for the first n time-steps.After this, however, we cover one set at a time in time slots (n+1), (n+2), . . ., (n+l): but by this time each set has only a 1/(n+1) uncovered fraction which needs to incur any cost for this final covering step: hence the total LP cost would be Θ(nl + 1 n+1 (nl + l 2 )).However, each integer solution might as well schedule the elements {a 1 , a 2 , . . ., a n } before selecting the elements {b 1 , b 2 , . . ., b l } one by one, giving a cost of (n + 1) + (n + 2) + . . .+ (n + l) = nl + l 2 .Now setting n = √ l gives us an integrality gap of Ω( √ l).Notice that with the knapsack cover inequality, the y St values cannot decrease by 1/(n + 1) with each additional covered element.In fact, for this extreme case of min-latency, the above LP strengthened with the knapsack cover inequalities is equivalent to the time-indexed LP relaxation for precedence-constrained scheduling on a single machine, which has an integrality gap of 2.

Closing Remarks
The proofs trivially extend to the case where each set S also has a weight w S ∈ R + , and the objective function is S w s Cov(S).
The current approximation factor is a rather large constant, and it would be interesting to pin down the integrality gap of this LP relaxation better.For both the extreme cases of the min-sum set cover and minlatency set-cover, it is known that the integrality gap of this LP relaxation is 4 (see [FLT02] for a dualfitting proof) and 2 [HSSW97] respectively.We have not tried to optimize the constants in this abstract; however, getting a substantially lower constant might require other ideas.
min A:|A|=a e∈S\A t <t x et ≥ (K(S) − a) • y St (3.5) Theorem 5.1.(Cover Time) The expected cover time of a set S is at most O(1)t * S . minimize Recall that we are considering a set S and stage i such that t SLemma 5.1.For any set S and any stage i such that t * S ∈ [1, t i ), the probability that K(S) elements from S are not marked in stage i is at most e −9/8 .Proof.Consider any set S, and let S g = {e ∈ S | z e,i ≥ 1/8}.By the choice of p e,i in step 3 of our rounding procedure, we know that all elements in S g are definitely marked in stage i, and any element e ∈ S \ S g is independently marked with probability 8z e,i .Thus, if |S g | ≥ K(S), then clearly the lemma holds.Thus we consider the case when |S g | < K(S).* S ∈ [1, t i ); since t * S was the last time t at which y − log t * LPOpt ≤ 485 LPOpt.This completes the proof of Theorem 1.1.
Consider the following fractional solution: Select element a t in time slot t for t ∈ [n], and element b t in time slot (n + t) for t ∈ [l] (i.e.set x att = 1 and x b t (n+t) = 1).Although the x et variables are integral, the LP will now cheat when it comes to the y St variables (which is what contributes to the objective).For any set S and time slot t, the LP solution sets y St = min(1, 1 n+1 e∈S t <t x et ).We first analyze the LP cost of this assignment: clearly, with each element we pick from {a 1 , a 2 , . . ., a n }, the y St term would decrease by an additive 1/(n + 1); and this happens for each set, in each of the first n time steps.Hence, each set would incur a cost of 1 + (1 −