Decomposition methods for solving Markov decision processes with multiple models of the parameters

Abstract We consider the problem of decision-making in Markov decision processes (MDPs) when the reward or transition probability parameters are not known with certainty. We study an approach in which the decision maker considers multiple models of the parameters for an MDP and wishes to find a policy that optimizes an objective function that considers the performance with respect to each model, such as maximizing the expected performance or maximizing worst-case performance. Existing solution methods rely on mixed-integer program (MIP) formulations, but have previously been limited to small instances, due to the computational complexity. In this article, we present branch-and-cut and policy-based branch-and-bound (PB-B&B) solution methods that leverage the decomposable structure of the problem and allow for the solution of MDPs that consider many models of the parameters. Numerical experiments show that a customized implementation of PB-B&B significantly outperforms the MIP-based solution methods and that the variance among model parameters can be an important factor in the value of solving these problems.


Introduction
Markov decision processes (MDPs) are mathematical optimization models that have found success in informing optimal decision-making under uncertainty when decisions are made sequentially over time.However, the optimal decisions can be sensitive to the MDP's parameters.Unfortunately, parameter ambiguity is common in MDPs as parameters are usually estimated from data and may rely on expert opinion.The ambiguity in the modeling process creates a dilemma for the decision maker as to the best choice of parameters to use when solving the MDP, and in turn the decision maker (DM) may be left unsure as to the best course of action.
Recently, there has been a line of research on mitigating the impact of parameter ambiguity by incorporating multiple models of the parameters into the solution of the MDP.For instance, Adulyasak et al. (2015), Buchholz and Scheftelowitsch (2018), and Steimle et al. (2019) proposed designing policies that maximize a weighted value across multiple models.Ahmed et al. (2017) proposed minimizing the maximum regret with respect to each model for finite horizon MDPs.Adulyasak et al. (2015) and Meraklı and Küçükyavuz (2019) proposed a percentile optimization approach for finite-horizon and infinite-horizon MDPs, respectively.Thus far, the exact solution methods for this problem have relied on mixed-integer program (MIP) formulations where binary decision variables encode the policy and continuous variables encode the value functions for each model of the MDP.Although the MIP can be used to find exact solutions, these problems are NP-hard (Steimle et al. 2019, Buchholz and Scheftelowitsch 2018, Delage and Mannor 2009), and the MIP solution methods for this problem have been limited to a small MDPs thus far.
In this article, we present new methods for solving MDPs with multiple models of the parameters that leverage the special structure of the problem.For the weighted value case, we present a branch-and-cut (B&C) method which follows from decomposition algorithms in the stochastic programming literature and a custom branch-and-bound (B&B) method that exploits the decomposable nature of the problem.We describe how the custom B&B procedure is easily modified to extend to other ambiguity-sensitive objective functions that incorporate multiple models of the parameters.Then, we compare the extensive form of the MIP, the customized B&C scheme, and the custom B&B approach using a numerical study of an MDP for medical decision making.Our study demonstrates that the custom B&B approach outperforms solving a MIP directly using the extensive form and by using the B&C procedure in terms of computation time for the weighted value problem.We use this new ability to solve larger problem instances exactly to investigate the impact of parameter ambiguity on the performance of an MDP; moreover, we examine how the choice of objective function influences the overall performance of a policy with respect to each model.
The remainder of this article is structured as follows: In Section 2, we provide background on parameter ambiguity in MDPs and related solution methods for solving MDPs with multiple models of parameters.In Section 3, we introduce the multi-model Markov decision process (MMDP), state the Weighted Value Problem (WVP), and present other multi-model objectives for the MMDP.In Section 4, we describe two methods, a B&C procedure and a custom B&B procedure, that leverage problem structure to solve MMDPs.We compare the solution methods in Section 5 using a set of test instances based on a medical decision-making problem.Finally, we conclude with a summary and discussion of the most important contributions of this article in Section 6.

Background
The MDP is a generalization of a Markov chain in which a DM can take actions to influence the stochastic evolution of the system (Puterman 1994, Boucherie andVan Dijk 2017).In this article, we consider a finite state-space S ≡ {1, 2, . . ., |S|} and a finite action-space, A ≡ {1, 2, . . ., |A|}.The system begins in state s 1 ∈ S according to the initial distribution µ 1 ∈ ∆(S), where ∆(•) denotes the set of probability measures on the discrete set.At each decision epoch t ∈ T ≡ {1, . . ., T }, the DM observes the state s t ∈ S, selects an action a t ∈ A, and receives a reward r t (s t , a t ) ∈ R.Then, the system transitions to a new state s t+1 ∈ S with probability p t (s t+1 | s t , a t ) ∈ [0, 1].The DM selects the last action at time T which may influence which state is observed at time T + 1 through the transition probabilities.Upon reaching s T +1 ∈ S at time T + 1, the DM receives a terminal reward of r T +1 (s T +1 ) ∈ R. Future rewards are discounted at a rate of α ∈ (0, 1] which accounts for the preference for rewards received now over rewards received in the future.In this article, we assume without loss of generality that the discount factor is already incorporated into the reward definition.
We will refer to the times at which the DM selects an action as the set of decision epochs, T , the set of rewards as R ∈ R |S×A×T | , and the set of transition probabilities as Throughout the remainder of this article, we will use the tuple (T , S, A, R, P, µ) to summarize the parameters of an MDP.
The strategy by which the DM selects the action for each state at decision epoch t ∈ T is called a decision rule, π t ∈ Π t , and the set of decision rules over the planning horizon is called a policy, π ∈ Π.The goal of the DM is to specify a policy that prescribes the decision rules that maximize the expected rewards over the planning horizon: (1) It has been shown that a Markov deterministic policy will achieve the optimum in (1) (Puterman 1994, Theorem 4.4.2),which means that the optimal action to take will depend only on the current state of the system and the decision epoch.In this article, we restrict our attention to decision rules of the form π t = (π t (1), . . ., π t (|S|)) where π t (s) : S × T → A because although in general history-dependent policies may be optimal, they are difficult to implement in practice (Steimle et al. 2019, Meraklı andKüçükyavuz 2019).
The standard optimization methods for MDPs assume that the transition probability parameters are known precisely.However, in practice, the transition probability parameters are often subject to errors due to finite sample sizes or epistemic limitations.Unfortunately, policies obtained by optimizing with respect to erroneous input parameters can underperform in practice (Mannor et al. 2007).To address this issue, new models of MDPs that account for parameter ambiguity have been proposed.Some authors have modeled ambiguity in the input parameters using an ambiguity set and seek to find the policy that maximizes the worst-case when the parameters vary within this set (Nilim andEl Ghaoui 2005, Iyengar 2005).The structure required for the ambiguity set to be tractable gives rise to a max-min policy that tends to be very conservative (Wiesemann et al. 2013), so others have taken steps to either find other tractable ambiguity sets that are less conservative (Mannor et al. 2016, Goyal andGrand-Clement 2018) or assume distributional information about the model parameters (Delage andMannor 2009, Xu et al. 2012).
The MMDP is another approach for addressing ambiguity in MDPs in which the DM considers multiple models of the MDP parameters when determining the optimal policy.Steimle et al. (2019), Adulyasak et al. (2015), and Buchholz and Scheftelowitsch (2018) proposed the Weighted Value Problem (WVP) for MMDPs in which the DM seeks to find a policy that maximizes the weighted performance with respect to each model of the MDP.Adulyasak et al. (2015) and Meraklı and Küçükyavuz (2019) consider an ambiguity-averse DM through the use of a percentile optimization formulation of MMDPs.Ahmed et al. (2017) consider a min-max-regret approach to considering multiple models of the parameters wherein the DM wishes to minimize the worst-case gap between the best possible performance in a given model and the performance achieved by the specified policy.The exact solution methods proposed in the articles above all rely on MIP-based approaches, while the custom B&B approach we present later does not rely on an MIP formulation of the MMDP.
The WVP for an MMDP is an NP-hard problem and has close ties to two-stage stochastic integer programming.In a two-stage stochastic program, ambiguous problem parameters are treated as random variables where collective outcomes define scenarios.The DM must take some first-stage decisions before these random variables are realized.Upon the realization of the problem parameters, the DM may take some recourse actions in the second-stage to adapt to the information gained.In many cases, the random variables representing the problem data have finite support and the outcomes of these random variables are referred to as scenarios.We refer the reader to Birge and Louveaux (1997) and Shapiro et al. (2009) for more information on stochastic programming.
Through the lens of stochastic programming, the MMDP can be viewed as a two-stage stochastic integer program in which the DM specifies a policy that is encoded through the use of first stage binary variables to indicate whether or not to take a given action for each state-time pair.
Then, the DM observes which model of the MDP is realized, which subsequently determines the corresponding value functions for each model, which are represented using continuous variables.
This two-stage stochastic integer program has a fixed technology matrix and a random recourse matrix with a finite number of scenarios.The problem can be written in its extensive form which contains decision variables representing the first-stage variables as well as second-stage decision variables for each scenario.However, the extensive form of the problem can become quite large and potentially inefficient to solve as the number of models grows.Fortunately, the constraint matrix in the extensive form is block-separable except for the columns corresponding to the first-stage decision variables.Due to this structure, two-stage stochastic programs lend themselves well to divide-and-conquer type algorithms, such as Benders decomposition (also known as the L-shaped method) (Benders 1962, Van Slyke andWets 1969).
The stochastic programming formulation of the MMDP has some features that require special algorithmic consideration.First, the binary first-stage decision variables require integer programming methods, such as branch-and-bound.Early work to address this problem includes that of Wollmer (1980), which proposed a cutting plane decomposition method, and Laporte and Louveaux (1993) which proposed a B&C scheme wherein the feasibility and optimality cuts are added within a B&B framework for solving the master problem.Second, the MMDP has relatively complete recourse and therefore, feasibility cuts are not required.Third, logical constraints are required to enforce that the value functions in each of the models of the MMDP correctly correspond to the policy encoded by the binary variables.The logical constraints are enforced through the introduction of notorious "big-Ms" which weaken the linear programming relaxation of the extensive form of the MIP and cause problems for potential decomposition methods.Logic-based cuts have been proposed to strengthen formulations and avoid the explicit use of the big-M values (Hooker and Ottosson 2003, Codato and Fischetti 2006, Ntaimo 2010, Luedtke 2014).
In this article, we propose a Benders-based B&C method to solve the MMDP and we further propose a custom B&B method that does not explicitly consider an MIP formulation, eliminates the need to consider the big-M's in the MMDP, and further exploits the unique structural properties of the MDP subproblems.

Problem statement
In this section, we introduce the MMDP and several objective functions that consider the performance of a policy with respect to the finite set of models of the MDP.

The Multi-model Markov decision process
The MMDP is a tuple (T , S, A, M, Λ) where T is the set of decision epochs, S and A are the state and action spaces respectively, M is the finite discrete set of models, and Λ := {λ 1 , . . ., λ |M| } is the set of exogenous model weights with λ m ∈ (0, 1), ∀m ∈ M and m∈M λ m = 1.Each model m ∈ M is an MDP, (T , S, A, R m , P m , µ m ), with a unique combination of rewards, transition probabilities, and initial distributions.The requirement that λ m ∈ (0, 1) is to avoid the trivial cases: If there exists a model m ∈ M such that λ m = 1, the MMDP would reduce to a standard MDP.If there exists a model m ∈ M such that λ m = 0, then the MMDP would reduce to an MMDP with a smaller set of models, M \ {m}.In this article, we consider that the DM wishes to design a policy π that performs well with respect to the different models in the MMDP.In the following sections, we will describe several proposed objective functions for such DMs, beginning with the WVP.

Weighted value problem
In this section, we introduce the WVP and present a MIP formulation for solving it.In the WVP, the DM considers the expected rewards of the specified policy in the multiple models of the MDP.
The value of a policy π ∈ Π in model m ∈ M is given by its expected rewards evaluated with model m's parameters: We associate any policy, π ∈ Π, for the MMDP with its weighted value: Thus, we consider the WVP in which the goal of the DM is to find the policy π ∈ Π that maximizes the weighted value defined in (3).Given an MMDP (T , S, A, M, Λ), the WVP is defined as the problem of finding a solution to: and a set of policies Π * := {π : W (π) = W * } ⊆ Π that achieve the maximum in (4).
1 if the policy states to take action a in state s at decision epoch t, Throughout this article, we will refer to (5) as the extensive form of the MMDP.
Constraints ( 5b) and (5e) are used to encode a valid Markov deterministic policy.Constraint (5c) is a logic-based constraint which uses "big-M"s to enforce the relationship between the value functions in each model and the policy, so long as M ∈ R is selected sufficiently large enough.
Together, constraints (5c) and ( 5d which is the expected amount that the DM would pay to know with which model of the MMDP she is interacting .These metrics allow for a better understanding of the impact of uncertainty on the performance of the DM's decisions and how valuable it would be to obtain better information about the MDP parameters.

Other multi-model formulations
The WVP is just one formulation that considers multiple models of the MDP's parameters.The WVP may be appropriate for a DM who is risk-neutral to model ambiguity.However, there is considerable evidence that some DMs may be risk-averse to the ambiguity which can affect decisionmaking (Ellsberg 1961).Therefore, it could be desirable to find a policy that offers more protection against the possible outcomes of the ambiguity and helps guide DMs who exhibit some degree of ambiguity aversion.To do so, we present other multi-model formulations of the MMDP which reflect preferences beyond neutrality towards parameter ambiguity.Some alternative approaches to addressing risk include maximizing the worst-case (max-min), minimizing the maximum regret (min-max-regret), and percentile optimization (PercOpt).In the max-min approach, the DM seeks to find a policy that will maximize the expected rewards in the worst-case realization of the ambiguity, i.e., max The min-max-regret criterion considers the performance relative to the best possible performance in that model.The regret for policy π in model m, (π, m), is the difference between the optimal value in model m and the value achieved by policy π in model m: The min-max-regret criterion then seeks to find the policy π ∈ Π that minimizes the maximum regret: In the PercOpt approach, the DM selects a level of confidence ∈ [0, 1], and wants to maximize the -percentile of the value in the models, z: Each of the multi-model objectives listed above can be formulated as MIPs as shown in Appendix A.

Methods that leverage problem structure
In this section, we describe two methods that leverage the special structure of the MMDP for the WVP.The first focuses is a B&C approach applied to the proposed MIP formulation.The second is a custom B&B approach which considers the problem as |M| independent MDPs with coupling requirements on the policy.Later, in Section 5, we compare these two methods to a generic branchand-bound implementation using a commercial MIP solver for test cases based on an example problem in the context of medical decision making.Although we develop these approaches in the context of the WVP, the custom B&B easily extends to the other multi-model formulations presented in Section 3.3.

Branch-and-cut for the MMDP
In this section, we present a B&C scheme for solving the MMDP which is in the same vein as Benders decomposition for stochastic programming.Benders decomposition breaks the extensive form of a stochastic program into a master problem and subproblems.The master problem typically only considers "complicating variables" while the subproblems will consider the other variables assuming fixed values of the complicating variables.In the context of stochastic programming, typically one solves a "relaxed master problem" which involves only the first-stage decisions and a subset of the constraints required to specify the complete optimization problem.Then, one uses duality for the subproblems to subsequently add constraints, or cuts, to the relaxed master problem to enforce the feasibility and optimality of the proposed first-stage solutions from the relaxed master problem.
Our B&C approach uses a master problem that includes the binary policy variables and enforces constraints (5c) and (5d) via cutting planes that are generated from each model's subproblem.We begin by describing the decomposition of (5) into the master problem and subproblems, and then we describe a B&C algorithm that uses this decomposition approach.
First, we describe the decomposition of (5) into a master problem and model-specific subproblems.The value of a fixed policy, π, in a model m can be determined by solving a linear program, which we refer to as Subproblem m (π): For constraints of the form (10b), we assign dual variables x m t (s, a) ∈ R and for constraints of the form (10c) we assign dual variables x m T +1 (s) ∈ R.Then, the dual of Subproblem m (π) is: Given the policy, π, (11) is easy to solve because the constraint corresponding to the tuple (s, a, t) is binding if and only if π t (a|s) = 1 so long as M is selected sufficiently large enough to enforce the logical relationship between the value functions and the policy.Therefore, given a policy π t (a|s), we have already identified an optimal basis for the subproblems.Because the primal constraint corresponding to (s, a, t) is non-binding if π t (a|s) = 0, it follows from complementary slackness that Proposition 2. The following forward substitution gives an optimal solution to (11): Proof.First, we show that x as defined in ( 12)-( 14) is feasible for (11).The solution generated by equation set ( 12)-( 14) satisfies (11b) because The solution satisfies (11c) because and ( 11d) is satisfied by definition.The non-negativity constraints are also satisfied.
Next, we show that x as defined in ( 12)-( 14) is optimal.Consider the following feasible solution to Subproblem( m π): , ∀s ∈ S, a ∈ A, t ∈ T .
The solutions v and x are feasible for Subproblem m (π) and its dual.For all (s, a, t) ∈ S × A × T such that π t (a|s) = 0, x m t (s, a) = 0, and for all (s, a, t) Thus, by complementary slackness, v and x are optimal solutions for their respective problems.
Proposition 2 provides a means to compute the solution to subproblems without the need to solve a linear program.The importance of this from a computational perspective is motivated by the following proposition.For an optimal policy π * for a given subproblem, the dual solution has a corresponding objective value of Therefore, z m is a tight upper bound on the value that can be obtained by using the fixed policy π * in model m.Further, the hyperplane in (15) can be used to bound the value of other policies in model m.
To this point, we have described how the subproblem can be solved quickly for a fixed value of the policy π.We now describe the master problem ( 16) which is a MIP that uses binary variables, π ∈ {0, 1} |S×A×T | , to encode the policy and continuous variables, θ ∈ R m , to be used as surrogate variables for the value functions in each model.The master problem incorporates optimality cuts generated from the subproblems corresponding to previously investigated policies, Algorithm 1 is a B&C algorithm for solving the MMDP that uses the decomposition described above.The algorithm we present largely follows from the Integer L-Shaped Method (Laporte and Louveaux 1993), but leverages the special structure of the subproblems to quickly generate optimality cuts.

Custom branch-and-bound for the MMDP
In this section, we present a customized B&B framework that can be used to solve MMDPs without relying on a MIP formulation.We begin by introducing the general framework of the MMDP B&B procedure, and then subsequently describe search strategies including node selection, branching, and pruning within this framework.

General MMDP branch-and-bound framework
The custom B&B procedure is grounded in the idea that, if we relax the requirement that each of the |M| MDPs must have the same policy, the problem becomes easy to solve as it involves solving |M| independent MDPs.
Thus, at the beginning of the custom B&B procedure, we completely relax the requirement at the root node that each of the |M| MDPs must have the same policy, and solve each individual model via backward induction to obtain an optimal policy for each model (for (s, t) ∈ S × T ).Then, we Algorithm 1 Branch-and-cut for the MMDP 1: Set k ← 0, ν ← 0. Let π be any feasible policy and z be the corresponding weighted value.
2: Select a pending node from the list; if none exists, stop.
4: if Current problem has no feasible solution then 5: Fathom the current node; Return to Step 2.
7: end if The policy described by π ν can be pruned by bound

9:
Fathom the current node.Return to Step 2. Create two new pendant nodes; Return to Step 2.
16: if z ν > z then The policy described by π ν is better than the incumbent 17: Update the incumbent to be (π ν , z ν ).
18: end if The policy described by π ν is suboptimal 20: Fathom the current node.Return to Step 2.

23: end if
sequentially add restrictions that the policies must prescribe the same action for certain (s, t)-pairs for each of the |M| MDPs.We refer to these restrictions as partial policies because they specify the actions to be taken in some, but not necessarily all, of the (s, t)-pairs.Given a partial policy, at a given node, one solves each MDP independently, taking the best action for each (s, t)-pair in that given model unless that (s, t)-pair's action has already been fixed in the partial policy.
The custom B&B procedure begins with an empty partial policy, meaning that none of the (s, t)pairs are fixed to a specific action, and thus, each of the |M| MDPs can be solved independently.
For a given partial policy, π, the corresponding relaxation of the MMDP is solved.The solution of the relaxation will be described in more detail in Section 4.2.4.Solving the relaxation provides an optimistic completion of the partial policy of a given node and an upper bound on the weighted value objective that could be achieved by completing this partial policy.The upper bound is compared to the value associated with the incumbent, the best solution seen so far.If the upper bound is worse than the lower bound associated with the incumbent, the node is pruned, meaning that none of its descendants in the tree will be examined.If the model-specific optimal completions of the partial policy are the same in each of the |M| MDPs then the policy is an implementable policy for the MMDP and, if the corresponding weighted value is better than the incumbent, then the incumbent solution is replaced with this policy and the node is pruned by optimality.If the partial policy is not pruned, the node is added to the list of pending nodes.At each iteration a pending node (i.e.partial policy) is selected for branching.factor, d is the depth of the tree, and U is an upper bound on time required to solve a subproblem (Morrison et al. 2016).The branching factor of the B&B tree described above is A and the depth of the tree is ST .The worst-case running time to solve a subproblem is the time required to solve zi ← m∈M λ m v m (π) 16: end procedure M MDPs in the worst-case.The worst-case time to solve a single MDP is O(T S 2 A) (Puterman 1994, p. 93).
Interestingly the worst-case running time grows linearly in the number of models considered and exponentially in the number of actions.This suggests that identifying actions than cannot be optimal via an action elimination procedure (Puterman 1994, §6.7.2) or the presence of a special structure such as a monotone optimal policy (Puterman 1994, §6.11.2) could be used to reduce computation time.

Node selection strategy
Standard options for the search strategy include breadthfirst search (BrFS), depth-first search (DFS), and best-first search (BFS).In DFS, the unexplored partial policies are explored in last-in-first-out fashion, such that there is a priority placed on finding implementable policies.In this search strategy, the list of partial policies is maintained using a stack structure wherein the algorithm explores the partial policy of the most recent child node first before generating any of its siblings.One advantage of this approach is that often there are some value function estimates corresponding to a partial policy that can be reused by the children of that partial policy.In other types of searches, the reusable value function estimates would need to be stored while awaiting selection of the node, however in DFS this information can be removed from memory as soon as the policy is optimally completed.There can be drawbacks to DFS.For example, DFS can lead to imbalanced search trees in which some partial policies remain pending at small depths in the tree while the algorithm is focused on completing other partial policies, which could lead to the B&B procedure spending a lot of time completing poor policies before finding an optimal policy.
Another search strategy is BrFS which can be viewed as a first-in-first-out approach to exploring partial policies, meaning that all children of a node are examined before any grandchildren can be examined.In this case, the unexplored partial policies roughly have the same number of (s, t)pairs fixed at each stage in the B&B procedure.However, there are usually substantial memory requirements associated with BrFS because most children nodes are not explored immediately after their parent nodes.Another search strategy is the BFS which considers the partial policies that appear most promising.In the MMDP, a best-bound approach explores a partial policy π next if its corresponding upper bound ẑ is higher than all the other unexplored policies.

Branching strategy
At each iteration in which at least one pending node exists, the B&B algorithm selects an (s, t)-pair to branch on to generate new subproblems.In our B&B procedure, we focus on wide branching strategies which generate |A(s, t)| new partial policies to be investigated upon branching on a pair (s, t), where each new subproblem corresponds to fixing π t (s) to be each of the possible actions in A(s, t), the action set specific to this (s, t)−pair.In this context, there are several strategies for branching on (s, t)-pairs.
One such branching strategy is horizon-based branching.In this strategy, decisions for branching are made on the basis of the decision epoch.That is, there is some ordering of the (s, t)-pairs such that t < t implies something about the order in which π t (s) and π t (s ) are fixed for any states s and s .One such approach is early-horizon branching in which the decisions for epochs early in the planning horizon are fixed first.Early-horizon branching may be desirable because this allows the branch-and-bound procedure to reuse value function estimates from the wait-and-see problem wherein each of the |M| MDPs is solved independently.
Another branching strategy is disagreement branching, which is in the vein of most fractional branching in integer programming.The idea behind the disagreement branching strategy is to select the parts of the optimal completion of the partial policy found in the relaxation, and select the (s, t)-pair for which there is the largest disagreement among the models' optimal completion policies.For instance, one disagreement-based strategy is to select the (s, t)-pair for which there is the largest number of actions suggested by the different models.

Pruning strategy
A natural pruning procedure in this context is to eliminate partial policies based on their corresponding upper bounds.If the upper bound on the weighted value corresponding to the completion of partial policy π is lower than the weighted value associated with the incumbent solution, then the B&B procedure no longer needs to consider partial policy π as there is no way to complete the policy such that it will be better than the incumbent.We restrict our attention to this pruning strategy because the upper bound associated with a partial policy is easily obtained by solving the relaxation via Algorithm 2. When the partial policy is empty, this relaxation corresponds to solving the wait-and-see problem.For any other partial policy such that some (s, t)-pairs have been fixed, this procedure uses backward induction to find the optimal completion of the partial policy.In Appendix A, we discuss how the pruning procedure is easily modified to reflect other multi-model formulations of this problem.

Computational study
In this section, we present computational experiments that demonstrate the solution methods we propose and provide a comparison of the various objective functions that consider ambiguity in MDP parameters.

Test instances
First, we present an MDP related to medical decision making for HIV adapted from Chen et al.
(2017) which has been used for illustrative purposes in the medical decision making literature.
Figure 1 illustrates the structure of the MDP used to determine the best HIV treatment to select for a patient depending on their health state.The states of the MDP represent the severity of HIV.Over the course of the planning horizon, the DM chooses between 3 treatments ranging from the least effective and least costly to the most effective and most costly.The MDP parameters are estimated from data and/or derived from expert opinion and thus are not known with certainty.
In this case study, the DM seeks to maximize Net Monetary Value which weighs the benefits in terms of quality-adjusted life years (a measure of the length and the quality of a patient's life) with the costs of a particular treatment.We refer the reader to Chen et al. (2017) for more details.
We construct multiple MMDP test instances from this MDP by sampling from the Dirichlet and Lognormal distributions reported in Chen et al. (2017) which represent the transition probabilities among the health states and the relative risks of the various treatments, respectively.We construct sets of MMDP instances with 10, 30, and 100 models.

Experiments
First, we consider the WVP.We solve the WVP for the medical decision making MMDPs instances using the MIP extensive form, the B&C, and the custom B&B algorithms presented in Section 4. We implemented the MIP extensive form and the B&C procedure presented in Algorithm 1 using the commercial solver Gurobi Optimizer Version 8.0.1 in C++ using Xcode.We implemented the extensive form using default settings in Gurobi and used special ordered set (SOS) Type 1 constraints for both the extensive form and B&C implementations.In the B&C procedure, the optimality cuts were added as lazy constraints within user callbacks whenever an integer feasible solution was found.We implemented the custom B&B procedure in C++ in Xcode using a priority queue data structure to manage the search tree and a custom node class to manage the information stored at each node in the tree.We specified an optimality gap of 1% for each of the algorithms and specified a time limit of 300 seconds.For each solution method, we provided the solution from the approximation algorithm described in Steimle et al. (2019) as a warm-start in Gurobi for the extensive form and B&C, and as the incumbent for the B&B procedure.We also solve each instance using the custom B&B procedure for the various multi-model objectives.We report the solution time required to solve the MMDP formulations and compare the resulting solutions.

Results
We now present the results of our experiments.First, we describe the performance of the solution methods on the medical decision making test instances.Then, we discuss the properties of the solutions obtained by solving these MMDP instances.

Comparison of the solution methods
Our initial experiments with the branching and node selection strategies for the custom B&B algorithm suggested that "early-horizon" branch-ing and BFS node selection are the most effective search strategies.Early-horizon outperformed disasgreement-based branching in large due to the ability to reuse value function information from the parent nodes when early-horizon branching is used.We tested the node selection strategy using random MMDP test instances and solved the test instance using each of the three node selection strategies with early-horizon branching.Our initial experimentation showed that BFS search requires the least amount of computation time to solve these instances, followed by DFS, and finally BrFS.BFS explored the fewest number of nodes of all node selection strategies followed by DFS; BrFS explored the most nodes.We suspect that BFS is the best node selection strategy because the warm-start provided by the heuristic in Steimle et al. ( 2019) is near-optimal in many cases and thus the algorithm benefits more from finding better bounds rather than finding new incumbent solutions.
Table 1 Computational performance for solving the medical decision making test cases using the WVP objective.We tested the performance of the extensive form, Branch-and-cut,and Branch-and-bound for various values of the number of models |M|.For each number of models, 30 test instances were generated and solved to within 1% of optimality.The warmstart provided was the solution found using the approximation algorithm described in Steimle et al. (2019).A time limit of 300 seconds was enforced.We now compare the MIP-based solution methods and the custom B&B approach using the search strategy described above.Table 1 demonstrates the solution times and optimality gaps for the extensive form of the MIP, the B&C procedure, and the custom B&B algorithm.The custom B&B solves the MMDP instances significantly faster than the MIP-based approaches of solving the extensive form directly and using the B&C procedure.The custom B&B method was able to solve all of the instances within 8.57 seconds for all instances.However, the extensive form and the B&C procedure could only solve 1 of the 90 instances within the 300 second time limit.The custom B&B outperforms the extensive form of the MIP and the B&C procedure for all MMDP sizes in terms of average solution time, worst-case solution time, average optimality gap, and worst-case optimality gap.The worst-case optimality gap across all instances for the custom B&B was less than 1% compared to 7.73% and 4.58% for the extensive form and B&C procedures respectively.Because the custom B&B procedure outperformed the MIP-based methods for the WVP, we focused on comparing the solution time required to solve the MMDP for the other multi-model objective functions.Table 2 shows how the solution time and optimality gap changes for each objective function.The custom B&B performed the best on the WVP with each instance solved in under 10 seconds.The custom B&B algorithm was able to solve each instance under the PercOpt objective, but the solution time increased with the number of models.The max-min and min-maxregret objectives were not as easily solved using the custom B&B algorithm.The performance was excellent most of these problems, but there were a small number of instances for which the optimality gap was significantly larger than the average, suggesting that these formulations may be somewhat harder to solve than the WVP.This finding motivates the use of finding using good heuristics for the other multi-model formulations to generate near-optimal solutions which could decrease the computation time.

Properties of the solutions
We now discuss properties of the solutions of the medical decision making instances.First, we discuss the measures of ambiguity for these test instances for the WVP.Then, we present the policies resulting from the B&B algorithm for the various multimodel objectives.Finally, we show how these policies shape the distribution of the value functions across the multiple models of the parameters for the MMDP.Table 3 reports the relative VSS and expected value of perfect information (EVPI) for the different MMDP sizes.First, we note that VSS tends to be lower than the EVPI.The maximum VSS across all instances was $2,745 while EVPI tended to be higher with the maximum EVPI being $9,130 for these MMDP instances whose optimal weighted values range from $171,358 to $189,068.Low values of the VSS suggest that solving the MVP may be a suitable approximation to the MMDP for the WVP, and higher values of the EVPI suggest that DM may wish to invest in research to determine better estimates of the MDP parameters.
We now present how the solution of the MMDP depends on the multi-model objective.Figure 2 shows the policies corresponding to the MVP, the WVP, the max-min problem, the max-min-regret problem, and the percentile optimization problem when = 0.2 for one instance of a 100 model MMDP.Each multi-model policy agrees that Drug A should be used in the severe health state.
The mean value problem (MVP), WVP and PercOpt policies are quite similar for the mild and moderate states.For the patients in the mild state, these policies recommend using Drug C at the beginning of the planning horizon, Drug B in the middle of the planning horizon, and Drug A at   and the "wait-and-see" solution.The "wait-and-see" solution is not an implementable policy, but represents the situation where the DM could wait and see the model of the MDP before selecting a policy.As expected, we observe that the different multi-model policies achieve their multi-model outcomes.For example, the value corresponding to the 20 th percentile is the largest for the PercOpt policy where = 20.Perhaps due to their similarity, the MVP, WVP, and min-max-regret policies achieve similar distributions.The PercOpt policy does much worse than the MVP and WVP policies in terms of the worst-case model policy and worst-case model regret, while not gaining much in terms of the 20 th percentile performance.The max-min policy achieves its goal but sacrifices performance for many models relative to the other policies.Figure 3(b) shows the distribution of regret for these policies with respect to each model.As expected, the min-max-regret policy performs the best in terms of worst-case regret.These findings demonstrate that the DM's attitude toward parameter ambiguity may be important when solving MMDPs.

Conclusion
In this article, we addressed the problem of solving MMDPs exactly.The MMDP has been proposed as a method for designing policies that account for ambiguity in the input data for MDPs, but the solution of MMDPs had been restricted to a small number of models.Finding an optimal Markov deterministic policy for an MMDP is an NP-hard problem, and the extensive form of the problem is a MIP that includes "big-M"s which weaken the linear programming relaxation.
We proposed two decomposition methods that leverage the problem structure to solve the MMDP.The first was a B&C algorithm in the vein of the Integer L-Shaped Method for two-stage stochastic integer programming with binary first-stage variables.The B&C algorithm decomposed the extensive form of the MIP into a master problem involving the binary variables used to encode the policy and |M| subproblems that evaluate a proposed policy in each model of the MDP.Unfortunately, the extensive form relied on the notorious "big-M"s to enforce logical constraints; the big-Ms led to weak optimality cuts to be added within the B&C procedure.We also proposed a custom B&B procedure which does not require big-Ms in the formulation.The custom B&B procedure begins by viewing the MMDP as |M| independent MDPs.The algorithm began by allowing each model of the MDP to have its own policy and sequentially added requirements that the decision rules for certain state-time pairs agree in each model.
We presented the first numerical study of exact algorithms for medical decision making MMDP instances.To do so, we generated random MMDP instances of an HIV treatment problem to compare the computation time required to solve these problems using the extensive form, the B&C procedure, and the custom B&B procedure.Our computational experiments showed that the custom B&B solution method greatly outperforms the solution of the MIP using the extensive form directly and with the B&C method.The custom B&B procedure was able to solve each of the 90 test instances to within 1% of optimality in under 9 seconds, while the extensive form and the B&C procedure to solve the MIP-based formulation each were only able to solve 1 of the 90 instances to the same tolerance within the time limit of 300 seconds.The worst-case optimality gap was 7.73% for the solution of the extensive form and 4.12% for the B&C method.The custom B&B procedure tended to take more time to solve when there were more models in the MMDP, but still outperformed the other two solution methods.
Our solution methods enabled us to investigate the impact of ambiguity in the parameters of MDPs.Using the medical decision making case study, we considered the impact of the number of models used in the MMDP on the value of the MMDP approach in terms of value relative to the mean value problem and expected regret relative to each model's optimal policy.We found that the MVP may be a suitable approximation to the WVP, especially when there is a large number of models, and that EVPI tends to be higher than VSS.We also used the medical decision making instances to compare alternative ways to incorporate multiple models into the solution of the MDP.We found that the DM's policy may depend on which multi-model objective function is used and that the objective function may shift the corresponding distribution of value functions corresponding to each model.Thus, the DM's attitude towards ambiguity may be important when solving MMDPs.
Our study has limitations.First, our B&C procedure does not include logic-based optimality cuts which could potentially improve its performance.However, our initial experiments with such cuts showed they do not provide significant enhancements to the B&C procedure.Second, we describe properties of the solution, such as the expected value of perfect information and the value of the MMDP, for a particular medical decision making problem; these values and solution properties may depend on this problem's particular transition probability and reward structure and not be representative of all MDPs with ambiguity in the transition probabilities.Third, we do not specifically tailor our custom B&B procedure to the other multi-model objective.Our results suggest that developing better heuristics and further refining the custom B&B for these other objectives could lead to computational gains.Finally, the size of the MDP test instances used in our computational study are not large; although the custom B&B method we propose outperforms the current approach of using MIP formulations, future research should investigate solving even larger problem instances.
Our approach could be extended in several ways.First, we consider only finite-horizon MMDPs, but our decomposition algorithms might be extended to the infinite-horizon setting.For instance, the custom B&B method might be extended to infinite-horizon MMDPs by using linear programming, value iteration, or policy iteration to solve the relaxations at each node in the B&B tree.
It would be interesting to compare this approach to those proposed in Buchholz and Scheftelowitsch (2018).Second, we could further investigate other branching and node selection strategies to enhance the custom B&B procedure.Third, we do not assume any structure on the original MDP or its optimal policy.However, if we were to assume structure, such as monotonicity of the optimal policy, our algorithms might be able to be modified to exploit this structure for computational gain.Finally, it would be interesting to compare these model-based approaches to more traditional robust MDP formulations in which an ambiguity set is constructed around the models.The work presented in this article provides a foundation for exploring these important future directions.
functions for each model.Notices that these continuous variables are independent of the policy variables π.The constraints in (19e) and (19f) ensure that the variables vm take on the optimal value functions corresponding to model m ∈ M. We also introduce a continuous variable w to represent the worst-case regret among all of the models.Constraints (19g) serve as epigraph constraints ensuring that w will take on the value of the largest regret among all of the models.To minimize the value of w, the variables v m will attempt to take on values as close to their vm counterparts, so long as the constraints in (19c) and (19d), which enforce that these value functions correspond to policy π, are satisfied.(19j) Rather than solving the MIP (19) directly, the B&B algorithm can be modified to apply to this problem.To do so, the B&B requires a modification to reflect that this is a minimization problem.
Further, one needs to modify the bounding approach to reflect the best possible worst-case model regret that comes from completing a partial policy.The bound is easily found after solving the relaxation, which is done in the same manner as before.However, at a given node, the lower bound needs to reflect the worst-case regret by determining which model's value in the relaxation Step 15 in Algorithm 2 should be modified to reflect the appropriate upper bound.The upper bound z is easily found by sorting the models in increasing order by the value functions v m (π) to obtain an order statistic m (1) , . . ., m (|M|) .Then, one can select z to be v m (x) (π) where x is the smallest value such that m (x−1) m=m (1) λ m ≤ .
Steimle et al. (2019) showed that the WVP for the Markov deterministic policy class, Π M D , for an MMDP is a hard problem:Proposition 1(Steimle et al. (2019)).Solving the weighted value problem with Π = Π M D is NP-hard.Steimle et al. (2019) propose the MIP formulation in (5) as an exact solution method for the MMDP.The MIP formulation extends the standard linear programming formulation of an MDP (Puterman 1994, §6.9) to include continuous variables representing the value function for each model of the MMDPs and modifies the epigraph constraints to enforce that each model is evaluated according to the same policy.We define model-specific value function continuous variables such that v m t (s) ∈ R represents the value-to-go from state s at decision epoch t in model m.We define the policy decision variables: ) ensure that the value functions in each model correspond to the policy encoded via the binary π variables.As mentioned previously, the MMDP can be viewed as a two-stage stochastic integer program with binary first-stage decision variables, continuous secondstage decision variables, and relatively complete random recourse.Given a fixed policy, the MMDP reduces to the evaluation of |M| Markov chains.However, policy optimization for an MMDP is challenging due to the coupling constraints that enforce the same policy is used in each model in the MMDP.The view of the WVP as a two-stage stochastic program lends itself to measures of the impact of uncertainty in the MDP parameters.The value of the stochastic solution (VSS) is the value added by solving the WVP given in (4) rather than solving a simpler, deterministic version of the MMDP, called the mean value problem (MVP), wherein the DM solves a single MDP with parameters that are obtained by taking the weighted average of the parameters from each model of the MMDP.Another measure of uncertainty is the expected value of perfect information (EVPI)

Proposition 4 (
Worst-case running time of the custom B&B procedure).The worst-case running time of the B&B procedure is O(|M|T S 2 A ST +1 ) where T = |T |, S = |S|, and A = |A|.Proof.B&B algorithms have a worst-case running time of O(U b d ) where b is the branching

Figure 1
Figure 1 An illustration of HIV treatment problem adapted from Chen et al. (2017).The states of the MDP represent the severity of HIV.The parameter β changes the transition probabilities depending to reflect the treatment option selected.

Figure 2
Figure 2The policies corresponding to different multi-model objective functions for an instance of the HIV MMDP with 100 models.The white indicates that Drug A is optimal, the gray color represents that Drug B is optimal, and the black color indicates that the Drug C is optimal for this state and decision epoch.
Figure 3The distribution of the models' value functions and model regret of the policies corresponding to different multi-model objective functions for an instance of the HIV model with 100 models.

Figure 3
Figure 3(a) shows the distribution of the individual models' value functions corresponding to Proposition 3. (11) can be solved in O(T |S| 2 ) time.
Proof.Computing the values in (12) can be completed in O(|S|) time.For a given state, (13) can be completed in O(|S|) time because exactly one action is taken in each state.Therefore, for a given decision epoch, the solution of (13) requires O(|S| 2 ) time and thus, the total time required for substitutions in (13) can be completed in O(T |S| 2 ) time.Equation set (14) also requires O(|S| 2 ) time.

Table 2
Steimle et al. (2019)ance for solving the medical decision making test cases using other multi-model objectives.We tested the performance of the custom branch-and-bound for various values of the number of models |M|.For each number of models, 30 test instances were generated and solved to within 1% of optimality.The warmstart provided was the solution found using the approximation algorithm described inSteimle et al. (2019).A time limit of 300 seconds was enforced.

Table 3
Measures of ambiguity of the medical decision problem test instances.