ERROR BOUNDS FOR ROLLING HORIZON PROCEDURES IN MARKOV DECISION PROCESS Jeffrey M. Alden and Robert L. Smith Department of Industrial & Operations Engineering The University of Michigan Ann Arbor, Michigan 48109 Technical Report No. 86-10 April 1986

ERROR BOUNDS FOR ROLLING HORIZON PROCEDURES IN MARKOV DECISION PROCESS Jeffrey M. Alden and Robert L. Smith April, 1986 1

Introduction We derive bounds on the expected error of using a rolling horizon procedure (RHP) to generate a strategy for a non-homogeneous Markov Decision Processes (MDP). We rely primarly on a certain measure of decoupling in the MDP to obtain our bounds, however, our bounds are sensitive to discounting as well. We define the expected error of a RHP to be the expected cost of the strategy generated by the RHP less the expected cost of an optimal strategy when these costs are finite due to discounting, and define the average expected error of a RHP as the average per period expected cost difference between the strategy generated by the RHP and an optimal strategy. Unless noted otherwise, all costs are assumed to be discounted to the beginning of the MDP. We begin by defining both the MDP and the RHP, then continue with some assumptions and notation (Section 1). In Section 2, we derive an inequality which is fundamental in obtaining our bounds. We also give an upper bound on the expected error of a RHP when discounting is present and assuming the one period undiscounted expected costs are non-negative and uniformly bounded from above (Theorem 1). In Section 3, we define a coupling coefficient between periods and give an equivalent MDP formulation where this coefficient simply scales the discount factor. Finally, in Section 4, we combine Theorem 1 and the equivalent MDP formulation to give a final result bounding the expected error of the RHP for the original MDP (Theorem 2). We illustrate our result with a Dynamic Vehicle Assignment problem in Section 5. 1.0 The Markov Decision Process A Markov Decision Process is a system which, during certain time intervals or periods where the set of periods partition time, can be described as being in exactly one of a finite set of states which and be period dependent. Available at each state is a finite set of actions, from which exactly one action must be selected when the system enters the state, that is, a decision is made. The set of actions is generally period and state dependent. The system moves or makes a transition from the current state to a state of the next period 2

according to probabilities that depend only on the decision made. in the first state, the pair of states involved, and the current period. Whenever a decision is made, a cost is incurred that depends on the action selected, the current state, and the current period. When the cost of a decision also depends on the state entered in the next period, we can use the expected cost of the decision over all possible transitions from the current state as the cost of the decision and so remove the dependency of costs upon the next state entered. This modification produces a problem having identical expected costs for each strategy, and hence produces an equivalent problem when trying to select a strategy that minimizes expected costs. A vector giving a decision for each state of a period is called a policy and a sequence of policies giving the policy for each period of the problem is called a strategy. The optimization problem is to select a strategy that minimizes the total expected costs. Our notation for the MDP is: M _ the number of periods in the MDP, a the per period discount factor (a E [0,1]), S the set of states of a period (= 1, 2, * * *, s}), Dk the set of actions available at state i E S of period k, Dk _ the set of feasible policies for period k, D the set of feasible strategies, zx - a decision for state i E S in period k (xk E Dk), xk a policy for period k giving xz for each i E S (xk E Dk), x — a strategy giving xk for k = 1,2,* *,M (x E D), P. (x) = the probability of transition to state j E S of period k + 1 given the decision xk E Dk is made while in state i E S of period k and, Pk(x) the transition matrix whose i,j element is P. (x), ck (x) the expected cost of making decision?z E Dk in state i E S of period k, and ck(xz) the vector whose Ash element is ck(x), i = 1,2,-,. 3

To simplify notation, we let the superscripts and subscripts on a variable define what part of a strategy the variable actually depends on. For example, P. (x) only depends on the strategy x through the decision zx. We also simplify notation by assuming the number of states per period is constant; generalization to period dependent state sets is obvious. Finally, we assume that all costs are non-negative. Any MDP with costs that are uniformly bounded below by a constant can easily be redefined in terms of non-negative costs by subtracting the constant from the cost of each decision. It is also useful to define the following product of transition matices for I = 1,2, * *, M and m =,1 + 1, **,M: m-1 plm(x = I PI'(x). k=l where P" = I, and I is an identity matrix of dimension s. The optimization problem of the MDP can be simply stated as: M min aCk-1 k (X)ck (X) zED k=1 Note that the kth term of the sum is just the expected cost of the policy zk in period k. The optimization problem above involves the minimization of a vector and not of a scalar. This arises since we do not necessarily know the initial state of the process. Hence, we need generate a policy for the first period to specify a decision for each state of the first period. Even if we knew the initial state of the process, we still need to define what it means to minimize a vector in this context because, later on, we will consider the partial costs of a strategy beginning in a period k > 1 which is represented by a vector as the MDP could occupy any one of the states in period k. We interpret the above minimization as that of selecting a single strategy to simultaneously minimize the expected value of each component of the vector. At first, this may seem impossible as the strategy which minimizes the expected cost from a given beginning state 4

might well be dependent on that state. Fortunately, this is not the case - such a strategy does exist which is independent of the beginning state. This follows from the the Markov Property of a state - the optimal decision for a state is independent of the sequence of states visited in reaching the state. Applying this property, we see the policy developed for period 2 and beyond is indeed independent of the initial state occupied in the MDP. This gives the required independence to make our interpretation of the above minimization well defined. 1.1 The Rolling Horizon Procedure A heuristic procedure for generating a strategy for the MDP is to select a policy for the current period based only on the expected cost of a strategy over a finite number of periods or horizon beyond the current period - any costs occuring beyond the end of the last of these periods or the horizon's end are simply ignored. In effect, this truncates the problem at the horizon's end and forms a smaller and finite MDP or subproblem. The sequence of policies of a stategy covering the periods of a subproblem is called a substrategy of the subproblem. After forming a subproblem, a minimum expected cost substrategy of the subproblem is selected and the decision for the current state in the first policy of the substrategy becomes the selected decision. Commitment is made to the selected decision and no more decisions are made until the next period. Hence, one rolls forward to the next period and the process repeats with a new subproblem or terminates by arriving at the end (if one exists) of the original MDP. This heuristic procedure is called a rolling horizon procedure or RHP. A simple modification of the RHP is to roll forward not just one period (a decision roll) but to commit to the selected substrategy for several or all periods included in the subproblem (a full roll). To simplify our results, we will only consider the possibly most interesting case of decision rolls. We illustrate the RHP with a simple deterministic example and identify some of the notation defined later in Section 1.2. 5

1 1 1 1 1 1 1 1 2 4 8 16 Figure 1. A problem where a RHP using a horizon of 2 generates a poor solution. Consider the problem of Figure 1 having three types of actions: U - move diagonally up, D - move diagonally down, and H - move horizonally to the right. If D is selected at the origin (node 0), then one must follow the lower branch of the figure. Alternatively, if U is selected at the origin, then one can follow the upper branch by selecting H until D is selected causing one, after selecting D again, to continue on the lower branch. Let the duration of each decision be one and the horizon of each subproblem be two (T = 2). Subproblem 1 contains the nodes 0 (the origin ol) through node 5. The substrategies of subproblem 1 are pi = {(U, H}, pa = {(U, D)}, and pS = {D, H} with respective costs 2, 1.25, and 1.5. Since P2 is the minimum cost substrategy; C1 = 1.25 and U is the selected decision. After selecting U, the RHP rolls forward to node 02 = 1 for a total cost of F1,2 = 1. The second subproblem contains the node 1 through 8 except nodes 2 and 5. The substrategies of subproblem 2 are pi = {H, H}, P2 = {H, D}, and p3 = {D, D} with 6

respective costs 2, 1.125, and 1.25. Substrategy P2 is the minimum cost substrategy with C2 = 1.125 and H becomes the selected decision. Note the extension of the minimum cost substrategy of subproblem 1 is the decision D out of node 4 with an extension cost of el=l. With a fixed horizon of 2, the RHP will continue to select H in all subsequent subproblems. Hence, the RHP generates the only infinite cost strategy in the problem. Note that the RHP selects a decision with a high cost to avoid the high extension cost and to continue with a substrategy with cheaper future decisions, but never actually makes one of them to benefit from the selection. 1.2 Assumptions and Notation We make the following assumptions regarding subproblems and the RHP: A. Each subproblem has a minimum expected cost substrategy and the RHP will roll forward using the first policy of one such substrategy. B. The horizon's end of each subsequent subproblem is non-decreasing. C. The expected undiscounded cost of any substrategy is finite. D. The horizon of any subproblem is finite. We use the following notation to describe the status of a RHP implementing decision rolls after it has rolled forward k - 1 times, k = 1,2, * *, M + 1: T = the number of periods in each subproblem or the horizon. k the subproblem index, this is also the first period included in the kth subproblem, ok — the initial state or origin of subproblem k. Fjk ^ the (random variable) cost of the strategy generated by the RHP from the beginning of period j to the beginning of period k where k > j. We define Fjj to be zero. Fj,*^ the (random variable) cost of the optimal strategy from the beginning of period j to the beginning of period k where k > j. We define FL_ to be zero. 7

Ckf+1 I f I I I. 1 +1 k k+lk+T kc+l+T Figure 2. Illustration of notation. Circles represent states and braces indicate a realization of costs between states. The paths from ok to al and from Ok+l to a2 are representatives paths from following the optimal substrategies in subproblems k and k + 1, respectively. Paths from ok to a3 and ok+ to a4 are representative paths from following part of the optimal substrategies in subproblems k + 1 and k (with an extension), respectively. Ck the (random variable) cost of an optimal substrategy for subproblem k. We define CMv+ to be zero. Two variables of special interest are: FM+1 F1,M+1 or the cost of the RHP strategy, and i,+1 F*M- +1 or the cost of an optimal strategy. Since the last subproblem that really must be solved in the RHP is the first subproblem that includes the last period of the problem, there are some end of problem effects that are important when M is small that we will try to avoid since we are mostly interested in problems with many periods. This means we will sacrifice sensitivity of our results to small M in order to simplify our presentation. In this light, we will assume the RHP will keep rolling forward and solving subproblems, still using a horizon of T and artifically extending the the MDP with periods containing zero cost vectors as necessary, until it reaches the end of period M. Hence the RHP will roll forward M times for a total cost of FM+1 8

When M is infinite, the cost of the strategy generated by a RHP is taken to be Fo = limk-.oo Fk when this limit exists. Clearly, we seek an upper bound on the quanity E(FM+1 - Ff +1). There is one more random variable to define relating to the additional costs the RHP must consider as it rolls forward and solves the next subproblem. These additional costs are called extension costs. At first glance, extension costs may appear somewhat unimportant when trying to bound the error of a RHP because they occur at the end of a subproblem and the cost of the RHP strategy comes from from the commitment to the first policy of each subproblem substrategy. However, it turns out that extension costs are central in deriving bounds on the error of a RHP. We need to discuss extension costs in more detail before we can define the particular extension costs required in our bounds. When subproblem k + 1 is formed, it will have some commonality with subproblem k. In particular, the set of substrategies and associated costs that begin in period k+ 1 and end at the beginning of period k + T are the same in both subproblems (see Figure 2). Thus, a substragegy of subproblem k can be modified to become a substrategy of subproblem k + 1 in the following way: a) continue the substrategy with a sequence of policies of the original MDP from the end of the first subproblem to the end of the second subproblem, and b) remove all policies for the periods prior to the beginning of the second subproblem. The modification accomplished by task a) is called an extension of a substrategy. The extension cost associated with the extension constructed above is the partial cost of the substrategy resulting from following the modified substategy from the end of subproblem k to the end of subproblem k + 1. (Note that Assumption B is required for extension costs to make sense.) It is an upper bound on the expectation of the following extension cost that we need in our analysis of the error of a RHP: ek=- the (random variable) extension cost of the substrategy selected in subproblem k when extended by an extension which minimizes the expected extension cost. 9

2.0 A Fundamental Inequality and a Bound on the Error of a RHP In this section we derive an inequality that bounds, from above, the expected cost of a strategy generated by a RHP. This becomes useful later on to derive bounds on the expected error of a RHP. Before continuing, we make the following definition regarding the expectation of any random variable of interest, generally denoted by Y, with respect to a the beginning of subproblem k: Ek (Y) = the expected value of the random variable Y given the MDP is in state ok. Note that the MDP initially starts in state o0, so we are primarly interested in E1 (Y) for prior analysis of the RHP and we will often use E(Y) for E1 (Y). LEMMA 1. If 1 < M < oo, then M-1 E(FM+ ) < E(C1) + Z E(ek), k=1 and if M = oo, then m E(Foo) < E(C1) + lim X E(ek). k=1 Proof. We first prove the case for 1 < M < oo. Since M > 1 we can select two sequential subproblems, say k and k + 1, formed by the RHP. By our assumptions, the substrategy formed by adding a minimum cost extension to the minimum cost substrategy of subproblem k, with some truncation, is a substrategy of subproblem k + 1. The expected cost of this substrategy, with respect to period k + 1, is Ek+l(Ck - Fk,k+1 + ek) (see Figure 2). Hence, Ek+ 1 (Ck+l), the minimum expected cost of the substrategy selected in subproblem k + 1, must satisfy Ek+l(Ck+l) < Ek+l(Ck) - Fk,k+1 + Ek+l(ek). (Note that by period k + 1 the value of Fk,k+l is known.) Since all terms are finite, Fk,k+ < Ek+1(Ck) - Ek+1(Ck+1) + Ek+1(ek). 10

By taking expectations of both sides with respect to the first period, we have E(Fk,k+ ) < E(Ck) - E(Ck+l) + E(ek). Summing the above inequality over all k from one to M gives M E(Fl,M+l) < E(C1) - E(CM +) + E E(e,). k=1 The result follows from the definitions E(FM+l) = E(Fl,M+l) and E(CM+1) = 0, and realizing that E(eM) = 0 since a substrategy of the last subproblem has no extension. The case for M = oo follows from the previous case by replacing M with m to obtain m E(Fi,m+1) < E(C1) - E(Cm+) + E E(ek), k=1 and taking the limit of both sides as m approaches infinity (with positive expected costs, E(Cm+i) > 0 and can be omitted from the bound).. Under the assumption that the undiscounted expected cost of any period is nonnegative and uniformly bounded above, we can use Lemma 1 to derive a bound on the expected error of a RHP in terms of the discounting factor. THEOREM 1. If 1) the undiscounded expected cost of a period is non-negative and uniformly bounded above by c, 2) each forward roll is one period, and S) each subproblem horizon is T periods (T < M) then T (1 - au- 1) E(F. F < a T(1 ) Proof. With decision rolls, an extension consists of a single period with an expected undiscounted cost of at most c. The kth extension begins with period T + k. Hence an upper bound on the kth expected extension cost, E(ek) is E(ek) < a T+k-. 11

Summing the inequality over all k = 1, 2, *, M - 1, we obtain M-1 T(1 _ aM-1)c E(ek) < 1k=1 Since all expected costs are non-negative and E(C1) is the expected cost of the minimum expected cost substrategy over the first T periods, it follows that E(FF+ 1) > E(F T+1) > E(C1). Combining this inequality with Lemma 1 gives M*- 1 E(FM+1 - F,+1) < E E(ek). k=1 Now substutite the expression we derived above bounding the sum of the right hand side.s 3.0 An Equivalent MDP We say two MDPs are equivalent if the set of optimal substrategies of any subproblem in one MDP is the same as the set of optimal substrategies of the subproblem in the other MDP containing the same sequence of periods. This is an important property, since it implies the strategy generated by a RHP will be the same in either MDP, provided that the selection of an optimal substrategy from the set optimal substrategies for a subproblem is the same in each MDP. Note that the cost of an optimal substrategy is not required to be the same in both MDPs. In this section, we identify with every MDP an equivalent MDP that is useful in, deriving a bound on the error of a RHP. The equivalent MDP differs from the original MDP in two respects: 1) the discounting factor is scaled by a quanity between zero and one, and 2) the transition matrices are different. Due to first property above, we can always apply Theorem 1 to bound the error of a RHP in the equivalent MDP whenever the scaled discount factor is less than one. We show in Section 4.0 that the a bound on the error of a RHP in the original MDP can be expressed in terms of certain bounds of a RHP in the equivalent MDP. In this way, the equivalent MDP is useful in obtaining a bound on the error of a RHP. 12

We motivate the derivation of an equivalent MDP with the following observation: if the set of all possible transition matricies, pk = {Pk ()} for xk E Dk, of period k are identical and have equal rows (i.e., a stable matrix), then the MDP is effectively decoupled at the end of period k - it matters not what state is occupied during the period k, the expected cost to continue from the state is state independent for a given strategy. This means we can solve the smaller problem ending with period k and obtain the optimal strategy for the original problem up to period k. Such situations do not occur as a rule, however, there may be a common non-negative matrix, say Lk, with equal rows where each element of the matrix is less than or equal to the corresponding element of each matrix in pk. That is, Lk is dominated by each matrix in pk. Depending on the magnitude of Lk, as measured by the sum of the elements of one of its rows or its row sum, some decoupling of the MDP in period k will exist, e.g., if the row sum is one then we have complete decoupling. It seems reasonable that as the magnitude of Lk increases the degree e of decoupling in period k should increase, that is, there should be a decreased sensitivity of the optimal strategy of the MDP up to and including period k to problem data beyond period k. With this motivation, define Lk = the stable matrix having the largest row sum which is dominated by each matrix in pk, and 1 - Pk = the row sum of Lk. From the definition of Lk, an expression for Pk can be derived: Pk =1- min E min Pa (x). We call 1 - Pk the decoupling coefficient for period k and, for obvious reasons, Pk the coupling coefficient for period k. In words, the decoupling coefficient is calculated for a single transition matrix by determining the minimum element in each column of the matrix and then adding these minimums together. The decoupling coefficient for a period is simply the minimum decoupling coefficient over all possible transition matrices of the period. The coupling coefficient 13

for the period is just one minus the decoupling coefficient for the period. A numerical example of calculating the coupling coefficient of a single matrix is given in Table 4 of Section 5 where the decoupling coefficient is 0.3025 (the sum of the elements in the bottom row the table). If Pk is zero then the row sum of Lk is one and, as noted before, the MDP is completely decoupled there is no coupling between periods k and k+1. If P3k is one then for a given state of period k + 1 there is a state-decision combination in period k which will avoid the given state with certainty. Hence to avoid certain costs beyond period k it may be important which state one is in during period k; in this sense, there exists a coupling between period -k and k+ 1. We can subtract Lk from pk (x) to obtain a positive matrix whose row sum is just /3k. Hence, assuming for the moment that pk > 0, fk(Z) = P() - Lk Bk is a stochastic matrix. This gives Pk(x)= f pk (x) + (1- k)Lk, where Lk Lk /(1 - /k) if Pk < 1 and is set to any arbitrary stable stochastic matrix if Ok = 1. Note that Lk is always a stable stochastic matrix. To simplify our expressions by replacing products of coupling coefficients with powers of a single coupling coefficient, we define a period independent coupling coefficient as /3 = max Pk and show how we can work with / instead of pk. Clearly, we can subtract any stable matrix dominated by Lk from Pk(x) and still produce a positive matrix that can be scaled into a stochastic matrix. In particular, the matrix Lk defined as Lk =(,) Lk 14 - 14

is a stable matrix dominated by Lk having a row sum of just 1 -,. We then redefine ^P Pk (X) - Lk * L giving Pk(x) = PPk(x) + (1 - )L, where Lk has not been redefined. We now consider two MDPs: the original MDP called problem P and a new MDP called problem P where pk (Z) is replaced by Pk (x) and a is replaced by ac. To be precise, Problems P and P are defined as M (P) min ak-1P ()c (x), and zED k=l M (P) min (ap)k lPlk(z) (). ( ED k=1 We show in Lemma 2 below, that P and P are equivalent. To do this, we compare the set of optimal substrategies for an arbitrary subproblem in either MDP covering the same periods and show that they are the same. To aid in the comparison, we define P(j, I) the subproblem of P covering periods j through 1, and P(j, 1) the subproblem of P covering periods j through 1. LEMMA 2. The set of optimal substrategies for subproblems P(j,l) and f(j,l) are the same for all I = 1,2,.- *,M and j = 1,2, -*,1. Proof.(See Appendix A) If the RHP selects a substrategy from the set of optimal substrategies in a manner that is independent of the problem formulation P or P (which we assume), then Lemma 2 guarantees that the strategy generated by a RHP will also be independent of the problem formulation P or P, 15

In the next section we use the equivalent problem P to bound the error of a RHP in the original MDP. 4.0 A Bound on the Error of a RHP In this section, we derive a bound on the error of a RHP under the assumptions of Theorem 1. We begin by establishing a relationship between the error of a RHP in P in terms of a set of errors of the RHP in P. We then derive, using Theorem 1, a bound for each error of the set and use the relationship to obtain a bound on the error of a RHP in problem P. In particular, the errors of a RHP in P that we require are the errors of the RHP from period k to period M for each k = 1,2,.., M. To aid in discussing these errors, we make the following definitions: Fk - the (random vector) cost of the strategy generated by the RHP from period k through period M in P, and 2F* = the (random vector) cost of an optimal strategy from period k through period M in P. (Refer to the comments at the end of Section 1.0 which discuss the minimization of vectors in the context of MDPs.) The. error of a RHP from period k to period M is just E(Fk - FP). We now give an expression for the error of a RHP in P in terms of these errors of the RHP in P. LEMMA 3. MA-1 E(FM+l - FM+1) = E(F1 - Fr1) + (1- P) E r-'LE(F+ - ). i=1 Proof.(See Appendix B) To derive a bound on the error of a RHP in P we note that the RHP not only generates a strategy beginning in period 1 producing the error E(F1 -F I), but, due to the decision rolls, 16

it generates a strategy as if starting in period k as well producing the errors E(Fk - F), for k = 2, 3, *, M. Hence we can apply Theorem 1 to bound each component of the vector E(Fk - F*) and substitute the resulting bounds, one for each k, in the expression of Lemma 3. This will then bound the error E(FM+l - FM+1). In Theorem 2, e is an s vector of ones. THEOREM 2. If 1) the undiscounded expected cost of a period is non-negative and uniformly bounded above by c, 2) a forward roll is one period, and S) each subproblem horizon is T periods (T < M), then (_- (l1- 1E(FM+ - F+1) 1 + (1 - )(1 - CeT) e. Proof. Since a forward roll is one period, the RHP solves the MDP as if beginning in period i for each period i = 1,2, *., M. Hence, we can use Theorem 1 to bound the error of the RHP beginning in period i in problem P. After replacing a with a,8 and setting M = oo to produce a weaker yet simpler bound, Theorem 1 gives the following bound for i < M - T: F(A,)T+e (i+1 _ F,) -< I - a ce, and seperate considerations give for i > M- T, E(Fi+l - Fi,1)= 0e. Note that i occurs in the exponent of the first bound to properly discount costs in P to the beginning of the first period. The bound when i > M - T follows from the fact that if a subproblem includes period M, then the RHP must select a substrategy that agrees, in expected cost, with the optimal strategy of P over the periods covered by the subproblem. There is no complication due to the possibility that the actual state occupied in a period may depend on whether the RHP strategy or an optimal strategy is followed in the MDP. This follows since we are comparing expected cost vectors which are independent of state actually occupied in a period. 17

Substituting the above expressions for E(F,+1 - F)+ ) into the expression of Lemma 3, using the fact that Lie = e, and summing the resultant geometric series gives the result.s An interesting corollary of Theorem 2 bounds the average error of a RHP when solving a MDP with no discounting. COROLLARY 1. With a = 1 and under the assumptions of Theorem 2, lim E(FM+l - Ff+1) <) T e M — oo M Proof. In Theorem 1 take the limit as a -> 1 (using an application of L'Hospital rule) to obtain E(Fm+1 - F+) < [1 + (1 -/)(M - T)]e, then divide both sides by M and take the limit as M -+ oo.. 5.0 A Dynamic Vehicle Assignment Problem We model a simple non-homogeneous vehicle assignment problem as a MDP to illustrate our results with an application of Corollary 1. Consider the problem of assigning a fixed number of vehicles, V, to a given set of stations in order to carry loads between the stations to minimize costs less revenues. Loads arrive at a station at the beginning of each day and form a queue. Vehicles assigned to a station at the end of the previous day serve the queue with each vehicle carrying at most one load per day until there are no more loads or no more vehicles at the station (see Figure 3). The number of loads that arrive at a station is random depending on the station and the day. The destination of a load is also random depending on the station at which the load arrived. A vehicle can carry one load in a day, hence the loads in excess of the number of vehicles at a station will be lost with no costs or revenues incurred. Alternatively, the vehicles in excess of the number of loads at a station will not carry any loads during the 18

Loads Assignment Loads Assignment arrive is made arrive is made Loads Vehicles Loads Vehicles are are are are moved moved moved moved Day k- 1 Day k Figure 3. Events in the Dynamic Vehicle Assignment Problem. A " " indicates when the state of a period occurs. day and remain at the station unless reassigned to some other station later in the day. There is a fixed profit, R, for moving a load from a station to its destination. After all vehicles are done moving their loads, any number of vehicles at a station can be assigned and moved to other stations in order to move loads during the following day. There is a fixed cost, C, to assign and move an empty vehicle to another station. The period in this problem is a day. The state of a period is the number of vehicles at each station after all vehicles have moved their loads for the day. A decision for a state in this problem is the assignment of vehicles to the stations in preparation for the next day. A policy for a period is the specification of an assignment for each state of the period. The optimization problem is to select a strategy, which is an assignment for each state of each period, that minimizes the per day average expected costs less revenues. The RHP to generate a strategy for the problem is simple: find an optimal strategy for the problem over a fixed horizon, roll forward one day using the first policy of the strategy and repeat the process. The question of interest is: what horizon should be used to guarantee that the average error of the RHP is not too great? For our particular example, we will consider the three station problem of figure 4. We assume that the number of loads that arrive during the morning at a station has a Poisson 19

Ald 1 A 1P2 dP2 A3d 3 e 2 - A2d Figure 4. A Three Station Dynamic Vehicle Assignment Problem. distribution with parameter Aid for station i (i = 1, 2, 3) and day d (d = 1, 2, *, 7) of the week. By making A day dependent the problem becomes non-homogeneous on a day to day basis. However, the problem does look the same on a week to week basis as A only depends on the day of the week. This is done only to keep the problem simple and yet exhibit a non-homogeneous behavior, at least for our definition of a period. Let Nid be the random variable for the number of vehicles that arrive at station i during the dh day of a week. The probability distribution of Nid is n e- Aid P{Nid = n} = ide Assume that a load at station i will have the station on its left, or a clockwise movement in figure 3, as its destination with probability pi and the station on its right with probability 1- i. A state of a period is represented by the vector (ni,n2, n3) where ni is the number of vehicles at station i. An assignment can be represented in a simular manner as (al, a2, a3) where ai is the number of vehicles assigned to station i. The transition probabilities between states of two consecutive periods depend on the assignment of vehicles in the first period and are driven by the probabilities P{Nid = n} and pi. For example, if all V vehicles are assigned to station 1, or the assignment (V, O, 0), then the probability that they all go to station 2, or the state (0,V, 0), as a result of moving loads on day d of the week is ( _V1 An -.:x, P{(V,0,0) - (0,V,0)}= 1 - E d a- ) pV 20

The expected cost of an assignment is C times the number of empty vehicles moved in the assignment less R times the expected number of loads moved the next day. In this initial formulation, the expected cost of an assignment can be negative. Since our results require positive expected costs, we need a lower bound, say c, on the expected cost of an assignment to subtract from all expected assignment costs to make them positive. A simple lower bound on the expected cost of an -assignment is found by assuming that no vehicles are moved in an assignment and that all vehicles happen to move a load in the following day. This gives a lower bound of c = -VR. This is a weak bound since it bounds the actual cost of an assignment which in turn bounds the expected cost of an assignment. However, this bound is not critical in this problem and therefore adequate for illustration. Now that we have positive expected costs, we need a reasonable upper bound on the expected cost of an assignment for use in Corollary 1. For the Poisson distribution, the expected number of loads moved by the assignment placing all vehicles at the station having the lowest expected number of arrivals is minimum. Such an assignment will minimize the expected profits of moving loads. Assuming the worst case situation, where this assignment requires all vehicles to move empty to other stations, we obtain the following bound on the maximum expected cost of an assignment: V-1 oo c < CV + RV-Rmin E nP{N, d = n} +V { P{Nd = n}. d,i _= n=V 1 For a numerical example, assume the cost of moving an empty vehicle is C = 100, the profit of moving a load is R = 400, and the routing probabilities are pi = p2 = p3 = 0.5. Also, let the expected number of arrivals at each station per day in a week be as given in Table 3. For these data and for the case of five vehicles, V = 5, the above bound is c < 1350. We illustrate the calculation of the coupling coefficient f3, for this problem containing only three vehicles (V = 3). Later, we will consider a larger problem with five vehicles when actually applying Corollary 1. The reason we consider the larger problem is to show that sometimes, even if the coupling coefficient is near one for the problem as stated, by 21

TABLE 3. Expected Daily Arrivals by Station and Day of Week Day Ald A2d A3d 1 1 1 1 2 2 1 1 3 1 2 1 4 1 1 2 5 2 2 1 6 2 1 2 7 1 2 2 eliminating decisions that are clearly non-optimal in any strategy or substrategy we find a much smaller coupling coefficient. Recall that the coupling coefficient of a period is calculated from the set of transition matrices of the period, and that a transition matrix for a period is defined by specifying the decision for each state of the period. Specifying a decision for one state effectively defines a transition vector from the state to states of the next period and so defines a row of the transition matrix for the period. In our example, the decision set of a state, that is, the set of possible vehicle assignments, is state and period independent. Moreover, the transition probabilities given a particular vehicle assignment is also state independent (but period dependent). Hence, the set of candidate transition vectors for any row of a period's transition matrix is row independent. This means the decoupling coefficient for a period in this problem is equal to the decoupling coefficient of the matrix whose rows represent the set of unique candidate transition vectors for a single row. We call this matrix the candidate matrix for a state. In this example, the candidate matrix for a state is the same for each state of a period, hence the matrix representing the set of unique transition vectors taken from the union of all candidate matrices for each state of a period, or the candidate matrix of a period, is equal to the candidate matrix for any given state of the period. Table 4 contains the candidate matrix for the first day of a week in our example with three vehicles. The last line of the table contains the column minimums whose sum is 0.3025. Hence, the coupling coefficient is 0.6075 for the first day of a week. Note that the 22

TABLE 4. Transition Vectors for each Assignment of Day 1 State of Day 2 Assignment 003 012 021 030 102 111 120 201 210 300 003.3678.1839.0459.0100.1839.0919.0301.0459.0301,.0100 012.1162.1934.0885.0243.1744.1675.0694.0790.0660.0208 021.0243.0885.1934.1162.0694.1675.1744.0660.0790.0208 030.0100.0459.1839.3678.0301.0919.1839.0301.0459.0100 102.1162.1744.0790.0208.1934.1675.0660.0885.0694.0243 111.0367.1110.1110.0367.1110.2231.1110.1110.1110.0367 120.0208.0790.1744.1162.0660.1675.1934.0694.0885.0243 201.0243.0694.0660.0208.0885.1675.0790.1934.1744.1162 210.0208.0660.0694.0243.0790.1675.0885.1744.1934.1162 300.0100.0301.0301.0100.0459.0919.0459.1839.1839.3678 Minimums:.0100.0301.0301.0100.0301.0919.0301.0301.0301.0100 candidate matrix of Table 4 just happens to be a valid transition matrix for a period. In particular, it is the transition matrix where an assignment is equal to the period's state. This means that vehicles are never moved in an assignemt. In general the candidate matrix of a period is not a valid transition matrix for the period. Now we consider the case of five vehicles. Calculations of the coupling coefficient for each day of the week were made using the approach illustrated above and the results are given in the column headed by the number 5 in Table 5. Ignore, for the moment, the remaining columns of Table 5. The maximum coupling coefficient over each day of the week, and hence over all days in the problem, is 0.9606. This number is near one which means the error bound we would obtain from Corollary 1 may be unreasonably large. This motivates us to do something better. Actually, the coupling coefficient we need is only for the set of transition matrices that could be part of an optimal substrategy in the RHP or part of an optimal strategy. For our data, we reason (but do not rigorously prove) why assigning all vehicles to a single node is never part of a optimal substrategy or optimal strategy. This result alone will allow us to use a coupling coefficient of 0.8746 versus 0.9606 (see the column headed with 4 in Table 5). 23

TABLE 5. Coupling Coefficients for each Day of the Week Limit on Number of Vehicles Assigned to a Station Day of Week 5 4 3 2 1.9606.8746.7106.3909 2.9296.8127.5893.2986 3.9296.8127.5893.2986 4.9296.8127.5893.2986 5.8513.7267.4668.1969 6.8513.7267.4668.1969 7.8513.7267.4668.1969 Maximums:.9606.8746.7106.3909 It seems reasonable that placing all vehicles at the station with the highestation expected number of arrivals will be the best case for arguing that such assignments could indeed be optimal. We discuss why, for this problem and data, such assignments can not be optimal and therefore can be removed from the set of potentially optimal assignments. Placing all vehicles at station 1 on day 2 at zero cost is the best case for any assignment of any day that places all vehicles at one station. This assignment has an expected profit of at most 800 = 400 x 2. Now consider a second assignment that places all but one vehicle at station 1. At most one extra move of an empty vehicle will be required by this second assigment at a cost of 100. The expected profit of moving loads for this assignment is approximately 400 x 2 + 400(1 - e-1) = 1053. Finally, we have to account for cost differences between the two assignments to continue in the problem. In particular, the one vehicle not assigned to station 1 in the second assignment may not end up where the fifth vehicle ended up in the first assignment. The other four vehicles would necessarily end up at the same stations in both assignments. Hence, we can move the one vehicle, if necessary, in the second assignment at a cost of 100 and match the ending state of the first assignment. The net cost savings of the second assignment over the first assignment is approximately 1053 - 800 - 100 - 100 = 53. Hence, we can eliminate those assignments from the candidate matrix that place all vehicles at one station, and then determine the coupling coefficient. This, in effect, limits the number of vehicles that can be assigned to a single station to 4. The results of this are summarized in the column headed 4 in Table 5. 24

TABLE 6. Horizons for the Coupling Coefficients of Table 5 with f =.01 Assignment Limit Coupling Coefficient Horizon in Days 5 0.9606 114.56 4 0.8746 34.37 3 0.7106 13.48 2 0.3909 5.01 By a similar although slightly more complicated arguement (which we will not present), it can be shown that assignments placing 4 or more vehicles at station are not potentially optimal assignments. Thus, we can futher reduce the limit on the number of vehicles assigned to a station to just 3 without eliminating any optimal assignments. The column headed 3 in Table 5 contains the results of using this limit. Thus, we can use 8 = 0.7106 for this particular problem, which is much lower than our initial value of 0.9606. The last column of Table 5 is included only for completness sake in showing the sensitivity of the coupling coefficient to the assignment limit. Now we turn to Corollary 1. Since we are interested in selecting a horizon T to guarantee the average error of a RHP is small, say no more than E, we set the right hand side of the expression in the corollary to less than e and solve for a lower bound on the horizon. This gives T >ln(/ Suppose that we require the RHP to generate a strategy having an average error of no more than a fraction f of the maximum expected cost of a decision. That is, e = f. Then the bound simplifies to T > ln(f) ln(p3)' For our, of 0.7106 and a fraction of f =.01 we find that a fixed horizon of 14 days or more will guarantee the average error of a RHP is no more than one percent of the maximum expected cost of a decision. Note that this error is no more than 13.5 which is one percent of the bound c < 1350 we calculated above. Table 6 contains the horizons for this and other values of f found in Table 5. 25

Appendix A: Proof of Lemma 2 LEMMA 2. The set of optimal substrategies for subproblems P(j,l) and P(j,l) are the same for all I = 1,2,..,M and j = 1,2,-,1. Proof. We arbitrary set I to an integer in [1,M] and then prove the result for all j = 1,1 - 1,...,1 by induction on j. We will consider two subproblems identical if the set of optimal substrategies and the expected cost of an optimal substrategy are the same in each subproblem. We begin by defining the hypothesis of our induciton arguement. Let H(j, I) be the hypothesis that states: 1) the optimal set of substrategies for subproblems P(j,l) and P(j, 1) are the same and 2) the expected cost of an optimal substrategy of subproblem P(j, 1) is equal to the expected cost of an optimal substrategy of subproblem P(j, ) plus Oje where 0ji is a scalar constant independent of the decision made in period j. For the case j = I subproblems P(l, I) and P(l, 1) are identical since they only involve C (x) and contain no transition matrices. Hence, H(l, 1) is trivally true with 81 = 0 Now assume H(i, l) is true for i = j, j + 1,.,, we will prove H(j - 1,1) is also true. Subproblem P(j - 1,1) is identical to min j-1 (x'- ) + aPi- ('- ) min akipi (x) (), xji-XEDJ- xED k=j where any dependency on x- 1 is made explicit. Making the substitution Pi'(i-) = pi- 1 (x-1) + (1 - /3) L- and noting the 26

resulting term involving Li-1 is independent of xi - we have the identical subproblem mm [cd- (x) + aP3 1(xi 1)'min ak-i P()c (X) x"-1EDj-1 xED kkxvc/ x) (+ c(l - P)L min -pik (Pk(x)k (x) xED k =j The two terms above represent two independent minimization problems - the value assumed by one term does not influence the substrategy that minimizes the other term. Hence we can evaluate the second term, that is, actually set x in the second term to an optimal substrategy of second term, without changing the set of optimal substrategies of the first term. Since Li-1 is a stable matrix, the last term now becomes a constant column vector of equal elements. Let S,_l be the scalar such that the last term is simply y-ile. Now we invoke the second part of the hypothesis H(j, ) and make the substitution i I min ac-, i pik (x)Ck(X) = O Ie + min (ca/3) P i (i )k(), xED x xED k=j k=j which gives the following identical subproblem: [ c- 1 (x3- 1) + ap3Pi 1 (x 1) (Dye + min (a3)kAiPfk x)ck( min L mi (x)} j + _-ie. The product apPi1 (xi- l')jle is simply aiOjile (since Pe = e for any stochastic matrix P) which is independent of xz-l. Performing this product, pulling it out of the minimization, and collecting terms we obtain the identical subproblem: r A k min L i (a)ki+p-1 k(x)ck(z) + (a,0il, + 6i-l)e. xED Lk 1 This subproblem is simply the subproblem P(j -,1) plus a constant vector of equal elements. Thus, the optimal set of substrategies for this subproblem is the same as that 27

of subproblem P(j - 1,1). This makes the first part of the inductive hypothesis H(j - 1,1) true. By setting Oj-il = aPOfjl + 8j-l, the second part of H(j - 1,1) is also true. Since H(l, I) is true, it follows by induction on j that H(j, 1) is true for j = 1,2,*..,1. Furthermore, by the arbitrary choice of 1, H(j, ) is also true for I = 1,2,...,M. The lemma follows from the fact that the first part of H(j, 1) is true for all 1 = 1, 2, *, M and j= 1,2,-..,1.1 Appendix B: Proof of Lemma 3 LEMMA 3. M-1 E(FM+1 - F =+1) = E(F1 - F;) + (1 - pL) E(/Fi+x - Fi+1). i=l Proof. We prove the lemma by simple substitution of Pk () = pk (x) + (1 - p)Lk and, after some algebra, identifying the terms involving E(Fk) or E(Fk) depending on x being the RHP generated strategy (x) or the optimal strategy (x*), respectively. The result follows by making a simple subtraction. We substitute pk (x) = Pk (Zx) + (1 - p)Lk into the definition of Plk (x) and, when simplifying the expression, use the fact that PLk = Lk for any stochastic matrix P (since Lk is stable). This gives, after dropping the dependency on x for simplicity, k-1 p1k = pk-lplk + (1_ P- ) E pk-i- Lifp+lIk i= where k-1 0 pik P= n p i I, and = O. i=j i=l By definition, M E(FM+) = E ak-lPlk(-)ck(-). kc= 28

Substituting in for plk and still dropping the dependency on x), we have M M k-1 E(FM+i) = E (a)Pk-lkc + (1 - /3) E ak- E k-i-'liP i+'lkc. k=l k=2 i=1 The first term is simply E(F1) for which we will substitute. Since all elements of the second term are positive, we can apply the monotone convergence theorem and exchange the summations in the second term to give M-1 M E(FM+ 1) E(F) + (1 - ) E p-iLi E (ac))k-l1Pi+kck. i=1 k=i+ 1 But M E(F,+)= ] (ap)k-lpi+lk ck k=i+l 1 so M —1 E(FM+l) = E(1) + (1 - ) E -''E(.'+ ). i=1 Likewise. if we set x = x* then M-1 E(F + 1) = E(F* ) + (1- / ) E -'E(F ). -= 1 Subtracting the second equation from the previous equation gives the desired result.2 29