Probabilistic Planning in the Graphplan Framework

(cid:2) The Graphplan planner has enjoyed considerable success as a planning algorithm for classical STRIPS domains(cid:10) In this paper we explore the extent to which its representation can be used for probabilistic planning(cid:10) In particular(cid:2) we consider an MDP(cid:7)style framework in which the state of the world is known but actions are probabilistic(cid:2) and the objective is to produce a (cid:11)nite horizon contingent plan with highest probability of success within the horizon(cid:10) We describe two extensions of Graphplan in this direction(cid:10) The (cid:11)rst(cid:2) PGraphplan(cid:2) produces an optimal contingent plan(cid:10) It typically su(cid:12)ers a performance hit compared to Graphplan but still appears to be fast com(cid:7) pared with other approaches to probabilistic planning problems(cid:10) The second(cid:2) TGraphplan(cid:2) runs at essentially the same speed as Graphplan(cid:2) but produces potentially sub(cid:7)optimal policies(cid:13) TGraphplan(cid:14)s policy selects the (cid:11)rst action on the highest probability trajectory from its current state to the goal(cid:10) Ideally(cid:2) we would like an optimal planner for probabilistic domains with the same speed that Graphplan would have if the domain were made deterministic(cid:10) By comparing the speed and quality of these two planners to each other


Introduction
The Graphplan planner is based on compiling a STRIPS-style planning problem into a compact (polynomial size) graph structure in which information can be quickly propagated to aid in the search for a plan BF97].Empirical results indicate that this approach is often quite fast compared to other traditional methods BF97, Byl97, KNHD97].Since its initial formulation, this basic algorithm has been extended and improved in a numb e r o f w ays, such as allowing operators with contingent e ects GK97, KNHD97, ASW97], handling certain kinds of uncertainty SW98, W AS98], and further speed improvements KNHD97,KLP97].
In this paper, we explore the question: to what extent can the speed of Graphplan be extended to probabilistic domains?We consider the setting in which initial conditions and current state are known, but actions can be probabilistic, having several possible outcomes.This falls into the framework of Markov Decision Processes (MDPs).In particular, instead of looking for a plan that consists of an action sequence, we will be looking for a contingent plan that tells which action to take based on the history of outcomes so far.The speci c kind of MDP we focus on is one in which, as with STRIPS planning, our objective i s t o satisfy all of a given set of conjunctive goals.Our aim will be to produce an optimal nite-horizon contingent plan, where by \optimal" we mean maximizing the probability of success within the time window (though much of the discussion applies to objectives such as minimizing the expected completion time as well).
Ideally, w e w ould like an optimal planner for probabilistic domains with the same speed that Graphplan would have, had the domain been deterministic.Embedding probabilistic outcomes into Graphplan's graph structure is straightforward.However, how to search f o r a p l a n i s l e s s o b vious.Planning in probabilistic domains is inherently more complex than in deterministic domains, in part because a complete policy can have size exponential in the horizon time whereas in deterministic domains, plan sizes are linear in the horizon time.
We instead present t wo planners: PGraphplan and TGraphplan.TGraphplan runs at essentially the same speed as Graphplan, but produces potentially suboptimal policies.Speci cally, TGraphplan nds the highest probability trajectory from the start state to the goal, which can then be turned into a kind of greedy contingent plan in a natural way.I n c o n trast, PGraphplan produces optimal contingent plans but typically su ers a performance hit.Unlike the backwardchaining search o f Graphplan, PGraphplan is based on a forward-chaining search through the planning graph, which i s m uch more natural for producing optimal plans in these probabilistic settings (we discuss this further in Section 3.3).PGraphplan is like a standard top-down Dynamic Programming algorithm, but uses information stored in the graph to prune its search.The di culty h e r e i s that for forward-chaining search, Graphplan's mutual exclusion relations are not especially helpful, and instead PGraphplan uses other kinds of information propagated backwards from the goals.In this sense, our objectives are similar to those of Kambhampati and Parker KP99].
We compare PGraphplan and TGraphplan to each other and to several other probabilistic planners such as Buridan KHW95], SPI BDG95], and vanilla dynamic programming, on a variety of domains.Comparing PGraphplan and TGraphplan to each other allows us to estimate how far o we are from our ideal objective.In particular, from the perspective o f Graphplan, t wo k ey questions are: \to what extent can the planning graph representation be used to speed up forward s e arch?" and \Can Graphplan's backward-chaining search produce a near-enough-optimal solution?"

Other related work
There is a long history of work in probabilistic planning.One of the rst planners designed speci cally for probabilistic STRIPS-style domains is Buridan KHW95], extended to the contingent planner C-Buridan by D H W 9 4 ].Unlike o u r w ork, these planners also considered partially observability, but in general were quite slow.Closer to our motivations is Structured Policy Iteration (SPI) of BDG95].SPI is designed for the fully observable MDP setting, and attempts to use the propositional representation of states and actions to nd an optimal policy without expanding out the entire state space.Zander ML99] a n d Maxplan ML98] by Majercik and Littman compile a planning problem onto a stochastic satisability problem, solving the problem in that representation.Maxplan i s a b l i n d planner, while Zander produces contingent plans.
There has also been work on probabilistic planning explicitly using representations motivated by Graphplan.In particular, Boutilier et al.BBG98] generalize Graphplan's pairwise mutual-exclusion constraints to k-wise constraints, and examine the reduction in the size of the MDP that is implicitly represented by t h e layer at which the planning graph levels o .
The work of Dean et al.DKKN95] has a close connection to the goals of TGraphplan.That work nds an initial trajectory using a forward search, and then attempts to interpolate (in an anytime fashion) towards an optimal solution.

Graphplan and its representation
Graphplan BF97] is a planner for STRIPS domains.These domains consist of initial conditions that describe the starting state of the world, operators which describe the legal actions that may be performed, and goals representing those facts that we wish to be true at the end of a plan.Operators have conjunctive preconditions, add-lists, and delete-lists.For instance, in a blocks-world, the initial conditions specify the starting con guration of the blocks, the goals specify what we wish to be true about the blocks at the end of the plan, and the operators specify our legal moves.
Graphplan is based on compiling such a planning problem into a polynomialsize structure called a planning graph.A planning graph is a directed, leveled graph.The rst level has one node for each proposition in the initial conditions.The next level has a node for each action (fully-instantiated operator) that might possibly be performed at time 1: that is, one node for each action whose preconditions all exist in the initial conditions.The next level of the graph lists all propositions that might be true at time 2: namely, the union of the add-e ects of all actions in the previous action level, including the no-ops (for consistency, Graphplan also includes a no-op action for each proposition that simply propagates the proposition forward in time).The next level consists of actions that might possibly be performed at time 2, and so forth.Edges in a planning graph connect actions to their preconditions and their add and delete e ects.
Graphplan begins by creating a planning graph forward from the initial conditions until all of the goals appear in the graph.It then searches in a backwardchaining fashion.If the recursive search does not nd a plan, then the graph is extended one more time step and the process is repeated.The planner also has a m e c hanism for eventually halting if, in fact, the problem has no solution.
The planning graph allows for several important optimizations including: 1. Propagating pairwise mutual exclusion relations between propositions and between actions forward through the graph while it is being created.These relations tell the planner that, for instance, propositions P and Q cannot be both made true by time step 4, and are used to prune the backward search.2. Allowing the plan to perform several operators in parallel so long as they do not con ict with each other.
Memoizing the results of unsuccessful searches is also used, and improvements to Graphplan use a number of other optimizations as well.For more details see BF97, KNHD97, KLP97].

Representing probabilistic actions
In this paper, we consider probabilistic actions.Speci cally, w e consider operators with conjunctive preconditions and with several sets of add and delete e ects, each set having an associated probability.F or example, an \open door" action might require that the door be unlocked, and with 88% probability a c t ually open the door, with a 10% probability do nothing, and with 2% probability pull o the handle.We assume that the outcome that actually occurred will always be known when executed (the MDP not the POMDP setting).
To represent these kinds of operators, the graph is constructed in the normal manner except that each action contains a list of possible \outcomes", each w i t h its associated probability, and then each outgoing edge of the action indicates which outcome produced that edge.For instance, suppose that we h a ve a n o p e rator that with probability 0.7 deletes G0 and adds G1, and with probability 0 .3 just adds G1.Then, there would be two outcomes, one with probability 0.7 and one with probability 0.3.There would also be three outgoing edges.One edge would be a delete edge leading to G0 and associated with the rst outcome, one would be an add edge leading to G1 and associated with the rst outcome, and one would be an add edge leading to G1 and associated with the second outcome.Since each outcome contains its probability, this representation is su cient t o reconstruct the de nition of the operator.

PGraphplan
A standard algorithm for solving a nite-horizon MDP is dynamic programming.One begins by computing the value of each state for a time horizon of 0, then uses that to compute values for a time horizon of 1, and so on up to the given horizon t max .In propositional planning, because we h a ve an initial state and the state space is not explicitly enumerated, it is more natural to do this in a recursive top-down fashion, as described in Figure 1.Notice that top-down DP, because it stores the result of each computation, explores each state at most once per time step just like bottom-up DP.DPsolve(state s, t i m e t): Compute value(s t) = best possible probability of success within the time window f o r a c o n tingent plan starting from state s at timestep t.
1.If t = tmax then return 0 if goals not satis ed, else return 1. 2. If already-visited(s t), then return the previously-computed value.3.For each possible action a, (a) For each possible state s 0 that could result from taking action a in state s, recursively call DPsolve(s 0 t + 1 ) .(b) Let value(s a t) be the probability-weighted average of the results.4. Let value(s t) = maxa value(s a t).Return this quantity, after rst storing it in case we e v er visit s at time t in a later recursive call.Also, return argmaxavalue(s a t) as the optimal action.

Fig.1. Standard top-down Dynamic Programming
PGraphplan begins with this vanilla top-down Dynamic Programming as its starting point,1 but uses the planning graph to prune its search.In particular, PGraphplan propagates two distinct kinds of information through the graph.The rst kind tells the planner how v arious nodes in the graph contribute to solving the problem goals, and the second focuses more on how near or remote that contribution is.Both kinds of information are used in the same way: to tell the planner when the path it is currently exploring provably cannot reach the goals within the given time horizon and therefore it may safely return failure in its recursive call.The two t ypes of information are given below.

Unary and pairwise \needed" nodes
One very simple optimization is to delete nodes from the planning graph that do not have a n y paths to the goal literals, by performing a backward sweep through the graph.This simple form of relevance analysis e ectively collapses multiple states together, and is roughly equivalent to the notion of relevance used by K n o b l o c k Kno94].For an even greater savings, we a s s i g n t o e a c h n o d e not removed a vector indicating which goal literals are reachable from this node.The planner uses this information when at some state S by looking at the vectors assigned to each literal in S, a n d c hecking to see if any goal is missing from all the vectors if so, then it knows it may b a c ktrack immediately.In fact, if there is only one non-noop action that creates some goal, then the preconditions to this action can be used instead of that goal for the purposes of de ning these vectors.This can be viewed as a crude but computationally cheap version of the fact-generation trees of Nebel et al.NDK97].
A more interesting version of this information is a pairwise notion of \neededness".Suppose node P does have one or more paths to goal literals, but all of these paths use node Q as well (e.g., all potentially useful actions that have P as a precondition also have Q as a precondition) in that case, we can add a \ P needs Q" edge to the graph, allowing the planner to drop literal P from its state if literal Q is not also present.Speci cally, w e s a y that proposition P needs proposition Q if each action with P as a precondition either (a) has Q as a precondition too, or (b) needs some action that has Q as a precondition.Similarly, a non-noop action A needs noop B if all add-e ects of A need the result of B (and none are equal to it) a noop action A needs action B if the result of A needs a proposition that can be created only through B. 2For example, consider a domain in which a robot, initially with key in hand, must move to the end of the hallway, u n l o c k a door at the end, and then drop the key in order to empty its hand for some subsequent task.In this case, the fact that the move-forward action needs the noop-have-key action gets propagated back through the entire graph, ensuring that the robot does not drop the key prematurely.
It is interesting to compare the information propagated here to the \backward mutex" constraints used by K a m bhampati and Parker KP99].They propagate the notion that \P and Q are redundant" in the sense that one can con dently remove one of them from any state that contains both.That information seems more di cult to propagate appropriately in a probabilistic setting and we h a ve not attempted to do so.

Value propagation
The idea for this second kind of information is to store values on nodes of the graph that allow one to compute a \permissible heuristic" for an A -style search.This method empirically produces an even greater savings than the one above.
Imagine that reaching a goal state (one in which all the problem goals are satis ed) is worth t max dollars, but performing a non-noop action costs $1.In this case, the true value of a state that can reach the goals in i steps is $(t max ; i).Notice that if a state S at time step t is worth less than $t, t h e n this means it cannot possibly reach the goals by time t max .T h us, a magic oracle that returned the true value of any g i v en state would be quite useful.The idea of value propagation is to propagate heuristic values (hvalues) through the nodes of the graph such that the heuristic value of any s t a t e S, de ned to be the sum of the hvalues of the nodes of the state, is guaranteed to be greater than or equal to the true value of S. If the planner nds that the heuristic value of its current state at time t is less than $t, then it can con dently backtrack immediately.
Speci cally, w e begin by dividing the nal value t max evenly among the problem goals at time t max , breaking ties arbitrarily (all hvalues will be integral).We propagate hvalues on propositions at time t+1b a c kwards to actions at time Notice that the planner will return from any state at time t = 2 that doesn't have clearB (it will never reach the impossible state in which it is holding A and B).t as follows: the hvalue of a noop is the hvalue of its e ect the hvalue of a nonnoop action a is ;1+ P e2adde ect(a) hvalue(e), where preconditions that are not deleted are viewed as add-e ects for this purpose.We view probabilistic actions (which m a y h a ve s e v eral possible outcomes) as if they were user-controllable for this computation: that is, a probabilistic action with k possible outcomes is treated as k separate deterministic actions, one for each possibility, e a c h w i t h its own value.
Values on actions at time t are then propagated to propositions at time t as follows.We begin with the noops: each g i v es its value to its precondition.We then consider each non-noop action in turn. 3For each action, we compare its value to the sum of the hvalues of its preconditions.If its value is larger, we g i v e the di erence to the preconditions the semantics is that the total value of the preconditions is the maximum of performing or not performing this action.As a heuristic, we distribute the value evenly among just the preconditions deleted by the action (breaking ties arbitrarily) if any exist if none exist, we distribute the value evenly among all preconditions.(Actions with no preconditions are given a fake \always-true" precondition that is true in the initial conditions and never deleted.)An example is given in Figure 2.
This method is legal in the sense that if a state at time t has a sum of hvalues less than t, then it cannot possibly reach the goals.Unfortunately, it can at times be an over-estimate, because of a double-counting that may occur as values are propagated.
The two kinds of information described above e a c h h a ve a di erent purpose.The rst kind asks is a node useful, and for what?The second focuses more on how long the path is from some node to its eventual use.

Why forward-chaining?
Finding an optimal policy with a planning graph appears to be considerably more di cult for a backwards search than a forwards one.Consider, for example, a domain in which the initial conditions contain the literals A and B, and the goal is to achieve G.There are 2 operators: OpA requires A, deletes A and with probability 0.5 adds the goal G OpB requires B, deletes B and with probability 0.5 adds G. See Figure 3.For this domain, the plan OpA OpB has a 75% chance of success.However, to nd this plan by reasoning backwards in the planning graph appears to require combining seemingly unrelated goal sets.In particular, to produce the optimal plan, we need at time 2 to consider the goal set fGg (which a c hieves our goal via a noop) and fBg (which has probability 0.5 of achieving the goal via OpB) together.We then need to realize that if OpA is performed from fA Bg at time 1, then each of the two outcomes of the action will lead to one of these goal sets at time 2. (Note: these are two distinct goal sets, not just two goals in the same set.)This kind of reasoning seems possible, but it also seems that it would require time quadratic in the number of goal sets at any g i v en time step (or cubic if some action has three possible outcomes).The problem stems from the fact that we are dealing with goal sets (which are only subsets of states) rather than the states themselves in our backward-chaining search, unlike in bottom-up DP.Perhaps some way can be found around these di culties, or some way t o m a k e this not too expensive i n \ t ypical" domains.In any case, this is the reason we choose to use forward chaining.
Smith and Weld SW98] a n d W eld et al.WAS98] use a di erent approach that allows for backward-chaining search on goal-sets in the presence of uncertainty.They handle uncertainty in the framework of Graphplan by essentially creating one graph for each \possible world".For instance, if one views the outcome of a probabilistic action as being determined by the ip of an associated coin, then there would be one graph for each possible sequence of coin ips.Ideally, one would like to use the same kind of search but at the same time handling the (possibly exponentially many) possible worlds within the context of a single planning graph.
TGraphplan uses a backward chaining search (essentially the same as the original Graphplan) and nds an optimal trajectory from the initial state to the goals.A trajectory is a sequence of actions and outcomes leading from one state to another, such a s \ I t u r n t h e k ey in the ignition, the car starts, I drive t o t h e airport, get there in time for my p l a n e , I c a t c h m y i g h t, and arrive a t m y destination."An optimal trajectory is the highest probability sequence of states and actions leading to the goal.When two trajectories have the same utility, trajectories which do noops later are preferred as in Graphplan.T h i s b i a s i s especially important in a probabilistic setting because it maximizes the amount of time available for recovery in case the trajectory experiences a failure.
TGraphplan can be used as a subroutine to manufacture a complete policy by s i m ulating execution of the trajectory to detect the state and time of unexpected outcomes not on the trajectory.Optimal trajectories for these unexpected outcomes can then be found and simulated forward to nd new unexpected outcomes.The recursion terminates when there are no more unexpected outcomes.More naturally, TGraphplan can be used in an online fashion after the optimal trajectory is discovered.While the rst action is executing (in the real world), TGraphplan can be planning for subsidiary trajectories.Thus, the important time quantity to measure for TGraphplan is the time to nd the optimal trajectory rather than the time to nd a complete policy.
TGraphplan starts by building a planning graph with probabilistic outcomes included.The backward-chaining search is done exactly as in Graphplan, e x c e p t instead of the recursion returning a binary value (success/fail), it returns a realvalued success probability for the best sub-trajectory.I n t h e TGraphplan search algorithm, the probability of the trajectory is determined by recursively multiplying the probability of a step succeeding by the probability of the partial trajectory already explored.The same set of optimizations that Graphplan uses are used by TGraphplan.Mutual exclusions are more complicated because it is possible for an operator to interfere with one outcome of another probabilistic operator but not with a second outcome.In TGraphplan, a pair of operators is made exclusive w h e n a n y pair of outcomes interfere.Instead, we could lose the notion of \exclusive operators" and replace it with a notion of \exclusive outcomes" but we do not do that.TGraphplan can also be run in an iterative deepening mode as for Graphplan.In order to do this, the desired trajectory probability m ust be given in advance.The algorithm terminates as soon as a trajectory with at least the desired probability i s f o u n d .
The TGraphplan algorithm is fast in comparison to PGraphplan because it only outputs partial policies, s o l v es an inherently less complex problem and uses a b a c kward chaining search which can take advantage of the mutual exclusions propagated forward in the building the graph.It is interesting to consider when TGraphplan will produce a (near) optimal policy and when the choices it makes will be substandard.This will be discussed in the following examples.
We n o w describe several example domains and give results of running PGraphplan, TGraphplan, and other comparison planners on them.The purpose of these experiments is twofold: to examine the speed of the proposed planners, and to explore the extent to which the plans produced by TGraphplan are optimal or close to it.PGraphplan and TGraphplan are written in C. The planners we compare to are { Top-down Dynamic Programming (Figure 1).{ Buridan KHW95]: in compiled Lisp.{ Blackbox KS99] (on deterministic domains): written in C. { SPI BDG95]: an in nite-horizon discounted MDP solver, written in C. Moats and Castles: This simple domain is an adaptation of one by Majercik and Littman ML98], in which the goal is to build a sand castle on the beach.In our version, there are two operators.Dig-moat is a deterministic action that increases the depth of a protective moat (there are 5 discrete depths), and buildcastle is a probabilistic action for creating the castle, whose success probability increases with the depth of the moat.The optimal nite-horizon plan for this problem will consist of some number of dig actions, followed by a remainder of builds, where the number of digs depends on the time horizon and the speci c success probabilities.The problem as stated has a very small state space to make things more interesting, we consider having multiple castles, each with its own moat.
We consider this domain in part because it gives a simple illustration of when TGraphplan does or does not produce optimal plans.In particular, for the case of one castle, the optimal trajectory is to dig as many times as possible, followed by one nal build operation (and then noops up to the time horizon).This may o r may not be a trajectory of the optimal policy in particular, the optimal policy may h a ve f e w er dig operations, if, for instance, deepening the moat has only a small e ect on the success probability of the build operation.
We describe performance results in Table 1.For this domain, when the time horizon is large, the information propagated by PGraphplan does not provide much of a gain.That is because in this case, almost all states can, in principle, lead to solving the goals, and the information propagated is only intended to prune states with no chance of success.TGraphplan scales better with time horizon but worse with number of castles, compared to PGraphplan the latter e ect appears to occur because of interaction between parallel and probabilistic actions.
Probabilistic Blocks: In the standard blocks-world domain, we c a n pickup a block from the table, putdown a block o n to the table, unstack a block from another block, and stack a block on another block.So, for instance, to pick u p a block from the table and place it on another block t a k es two time steps.Imagine we augment this domain with a probabilistic operator faststack that can move a block from the table onto the top of another block in one time step, but it only succeeds with a 70% probability with a 30% probability it has no e ect.This setting is interesting because the optimal action depends on the time-to-go.Note that the optimal trajectory will always choose the deterministic operators if there is su cient time, otherwise it will choose faststack.T h us, in this domain, TGraphplan does, in fact, lead to an optimal policy.Results for this domain are given in Table 2.
8-puzzle and Flat-tire: In order to compare to deterministic planners, we also considered several deterministic domains, in particular the at-tire problem of Russell, and the 8-puzzle problem.The 8-puzzle problem is interesting because there is no special advantage to backward-chaining on this problem.We consider the goal of achieving board state ABCDEFGH (reading left to right, top to bottom) from two di erent initial states: one in which a solution requires 18 steps and one in which a solution requires 30 steps (this is the case of initial board HGFEDCBA ).Results are given in Table 3.Note that PGraphplan is the fastest of all planners tested (even the deterministic ones) on this problem.

Discussion and Conclusions
PGraphplan performs a forward search to nd an optimal contingent plan, using information stored in the graph to collapse and prune away unnecessary states.On problems such a s b l o c ks-world, at-tire, and 8-puzzle, this pruning provides a substantial speedup.The value information seems to provide the greatest savings in general, with the information on the eventual purpose, if any, of each node in the graph providing a smaller but still signi cant h e l p .PGraphplan is still in general slower than Graphplan, in part we b e l i e v e because we h a ve not yet found the best way to propagate information backwards through the planning graph.One possible avenue would be to better integrate the two k i n d s o f information discussed above, for instance by separating the hvalues into di erent \ avors" depending on which goal they are intended for.It would also be helpful to propagate more probabilistic information, such as an upper-bound on the probability o f r e a c hing the goals, which could then be used to prune search.Finally, i t w ould be interesting to explore whether knowledge of the TGraphplan's more quickly-constructed policy could guide PGraphplan's search.2. Results for the probabilistic blocks problem described in the text.The initial state is a tower (ABCD...) and the goal is the same tower except that the block t h a t used to be on top is now on the bottom (BCD...A).SPI nds a policy that applies from all possible starting states, so in a sense it is unfairly penalized in this experiment (to partially alleviate this, for SPI we discarded actions not used in any reasonable trajectory from our initial state from the domain description Table 3. Results on the at-tire domain, and the easy and hard 8-puzzle problems. Blackbox w as run in its default mode, with -solver graphplan (BlackboxGP), and with -solver walksat (BlackboxWS).
The planners implemented apply to problems for which the objective i s t o maximize the probability of satsifying the problem goals within the given time window, or the related goal of minimizing expected completion time.More general MDP problems are often given by specifying rewards for performing certain (noop or non-noop) actions.It appears that some of the kinds of information propagated here should be useful in those more general settings, and it would be interesting to see if this generalization could be made without a sacri ce in performance.

Fig. 2 .
Fig.2.RHS of graph for simple blocksworld domain, ending at t = 4.Some noops are not shown.Numbers indicate hvalues.Notice that the planner will return from any state at time t = 2 that doesn't have clearB (it will never reach the impossible state in which it is holding A and B).

Fig. 3 .
Fig.3.Planning graph for the simple domain of section 3.3.Dotted lines represent deletes, and irrelevant nodes have been removed.

Table 1 .
PGraphplan (PGP), TGraphplan (TGP), top-down DP (DP), SPI, and Buridan(Bur) on the moat and castles problem with varying numbers of castles.SPI is run with discount factor 0.9.The other planners are given time horizon of 5 or 10 (for Buridan, this is done by p r o viding a desired success probability) values for horizon of 10 are given in square brackets.Running times are given on a PII-450 xeon with 512 MBytes of memory. ).