OPTIMAL AVERAGE VALUE CONVERGENCE IN NONHOMOGENEOUS MARKOV DECISION PROCESSES Yunsun Park, James C. Bean and Robert L. Smith Technical Report 92-04 Department of Industrial and Operations Engineering University of Michigan, Ann Arbor, Michigan 48109-2117 January, 1992

Optimal Average Value Convergence in Nonhomogeneous Markov Decision Processes * Yunsun Park Department of Industrial Engineering Myungji University Yongin-Eup Yongin-Kun Kyungkee-Do, Seoul, South Korea James C. Bean and Robert L. Smith Department of Industrial and Operations Engineering The University of Michigan Ann Arbor, MI 48109-2117 Abstract We address the undiscounted nonhomogeneous Markov decision process with average reward criterion and prove two structural results. First, we establish equivalence of this problem to a discounted Markov decision process by means of an ergodic coefficient embedded in the original problem. Second, we prove, for the original problem, that the optimal finite horizon average values converge to the infinite horizon optimal average value under an ergodic condition. 1 Introduction Many problems can be modeled as Markov decision processes, but are not necessarily homogeneous. That is, rewards or transitions may be time de*This work was supported in part by the National Science Foundation under Grants ECS-8700836 and DMM-9018515 to the University of Michigan. 1

pendent. Examples include R&D modelling (Nelson and Winter [1982]), capacity expansion (Freidenfelds [1981], Luss [1982]), equipment replacement (Lohmann [1984]), and inventory control (Sobel [1971]). In some of these applications average reward criteria are more appropriate than discounted objectives. In this paper we address the nonhomogeneous Markov decision process with objective to maximize average reward. This analysis is complicated by the fact that the average reward criterion is tail driven. That is, whatever is done during any finite leading solution segment is irrelevant to the final objective value. In many homogeneous problems this is not a concern since the tail is an exact replica of the original problem. Such is not the case in nonhomogeneous problems. To further complicate the issue, in nonhomogeneous problems we are often interested only in a leading strategy segment since only it must be implemented now. One approach is to transform the Markov decision process to an equivalent discounted problem. To accomplish this we generalize results of Ross [1968] that consider homogeneous problems, and develop a more efficient transformation. We then adapt results from Alden and Smith [1992] that were developed for finite state, discounted problems. We also prove convergence of the finite horizon average optimal values to the infinite horizon average optimal value. This facilitates planning or solution horizon approaches. Our goal is to establish the mathematical framework necessary to create algorithms to solve these problems. Section 2 introduces the notation and definitions necessary for our discussion. In Section 3, we describe the relationship between three common ergodic coefficients. Section 4 proves equivalence between the nonhomogeneous Markov decision process to maximize average reward, and a discounted problem. In Section 5 we prove average optimal value convergence for the nonhomogeneous Markov decision process. Section 6 summarizes the results of this paper. 2 Notation and Definitions We generalize the notation of Bean, Smith and Lasserre [1990] for undiscounted nonhomogeneous Markov decision processes. 2

We observe a process at time points k = 0, 1,... to be in one of a countable number of states i = 1,2,.... The decision maker chooses a policy in stage k, Xk E Xk, by selecting any actions, xk, from finite sets, Xk, for states i = 1,2,.... An infinite horizon feasible strategy, x, is an infinite sequence of policies. It resides in the set of feasible strategies, X. A finite horizon strategy, x(k, N), is a sequence of policies from time k through time N -1. Even though a finite horizon feasible strategy consists of a finite number of policies, we require that x(k, N) E X by allocating arbitrary policies before time k and after time N. Also, if k = 0, denote x(N) - x(0, N). We use an asterisk to represent the optimality of an action, policy or strategy in the minimum class to which it belongs. For example, x*(N) is an N-horizon optimal strategy. The set of all feasible strategies, X, is compact in the metric topology presented in Bean, Smith and Lasserre. The metric, p, is defined 00 00 p(x, ) = (x, X)2-(k+i) for all x, x EX, k=O i=l and 0, if x4=4 k,x = { 1, otherwise. Using this metric we define algorithmic optimality, first introduced in Hopp, Bean and Smith [1987]. Definition Under this topology, an infinite horizon strategy, i, is called algorithmically optimal if, for some sequence of integers, {Nm }m=l, x*(Nm) - x in p-metric as m -* oo. If we take action x4 in state i at time k, then, independent of past actions, two things happen: 1. we gain a reward rk(xk). The vector of such rewards is Rk(xk). 2. we transit to states, j, at time k + 1 according to the probability transition matrix {pj(xk)} - Pk(xk). Note that both the rewards and transition probabilities may be stage dependent. 3

The basis for many optimality criteria is the finite horizon reward function. Given an infinite horizon strategy, x, and a one period discount factor, 0 < a < 1, the expected net present value of the total rewards from time k to time N, N > k, at the beginning of stage k, is written Vk(x; N). Note that in evaluating Vk(x; N), the first k policies of x are ignored. Let Vk(x; N) map into Roo with the ith element given by Vk(x; N) which represents the expected net present profit from state i in stage k through stage N under strategy x. Note that Vk(x*(N); N) = Vk(x*(k, N); N) for all k = 0, 1,... by the principle of optimality. The value function from stage k to stage N, which is written: N-1 Vk(x; N)= -E aTk(x)Rr(xn), n=k where n-1 Tk(x) = I (xi), n > k > 0 l=k Tk(x) = I, n< k. Throughout the paper we make the following assumptions: Assumptions 1. the state space, I, is countable. 2. the number of decisions available is finite for all states, i.e., \Xl < oo, for all i and k where IAI is the cardinality of set A. 3. rewards are uniformly bounded for all states and decisions, i.e., for some R < oo, IrK(4)l < R, for all i I, k = 0,1,... and xk G Xk. 4. From each state, i, at stage k, under strategy x, the set of reachable states, {jp (x\) > 0}, is finite. That is, only a finite set of states is reachable in one transition from any state, under any action. Further, max{jlIp (xk) > 0} is uniformly bounded over xk E Xk for each stage, k. 4

In the infinite horizon problem, with discount factor a, 0 < a < 1, define x* to be an a-optimal strategy if Vo(x*) > Vo(x), for all x e X, where Vo(x)= lim Vo(x;N). N —+oo This definition is valid if the limit exists. By Assumption 3 it exists whenever a < 1. However, the primary interest of this paper is the case a = 1. In this case it is possible that Vo(x; N) diverges with N. Then we define x* to be average optimal if o y(x *; N) Vof(; N) liminf (x*;N) > liminf;N) for all x E X. N-+oo N N-.oo N Assumption 3 implies that the liminf is always finite. 3 Weak Ergodicity In this section, we formally define weak ergodicity and the corresponding ergodic coefficients. Let Pn,N(X) = POPn(xn)Pn+l(Xn+l)...PN-I(XN-1), where Po is a starting vector (initial distribution). Let Qn,N(x) be the same forward product with starting vector Qo. If P = (pj) is a vector, we define the norm of P to be 00 IIPII =! lpjl. j=l If P = (pij) is a square matrix, we define the norm of P to be 00 IIPII = sup Ipj I. i j=1 Definition A nonhomogeneous Markov decision process is called weakly ergodic if and only if, for all n, lim sup IIPn,,N(x) - Qn,N(x)II = 0 for all x E X, N- oo-PoQo 5

and is called strongly ergodic if and only if there exists a vector q(x) = (q(x), q2(x),...), with llq(x)I = 1 and qi(x) > 0 for all i = 1,2,... such that for all n lim sup IIPn,N(x) - q(x) | = 0 for all x e X. N- -oo ro That is, a nonhomogeneous Markov decision process is weakly ergodic if and only if it eventually loses memory of the starting vector. For a problem to be strongly ergodic, the process not only must lose memory, but also converge to a fixed probability vector. It is often difficult to determine if any specific problem satisfies these definitions. To facilitate the identification of weak ergodicity, we define several ergodic coefficients: the Ross coefficient (ao), the Doeblin coefficient (/j), and the Hajnal coefficient (7). Definition Ergodic Coefficients: * Ross coefficient: a0 = su p p ao(Pk(xk)), k xk EXk where ao(Pk(xk)) = 1 - supj infik (xk). * Doeblin coefficient: / = sup sup /(Pk(xk)), k xkEXk where P(Pk(xk)) = 1 - Ej= infi Pk (Xk). * Hajnal coefficient: 7 = sup sup 7(Pk(xk)), k Xk Xk where 7(Pk(xk)) = 1 - inf2,.2 EjZi min(pk (xk), (xk)). We call ao the Ross coefficient since the homogeneous version was used in Ross [1968]. For the nonhomogeneous case, Hopp, Bean, and Smith use this coefficient to prove the average optimality of an algorithmically optimal strategy. Alden and Smith use the Doeblin coefficient to show that the error between a rolling horizon strategy and a discounted optimal strategy goes to zero when 3 < 1. The Hajnal coefficient was first introduced by Dobrushin [1956], followed by several papers and books such as Hajnal 6

[1958], Paz [1963] and Paz [1971]. For applications of this coefficient, see Hopp [1989]. The following lemma describes the relationship between the coefficients and the property of weak ergodicity. The proofs are straight forward and omitted. Lemma 1 a) ao < 1 if and only if/3 < 1. b) ao > /. c) If / < 1 then y < 1. d) if any of ao < 1, / < 1, or 7 < 1, then the nonhomogeneous.Markov decision process is weakly ergodic. Even though we know from Lemma 1 that the Hajnal condition (7 < 1) is the weakest of the three implying weak ergodicity, we will use the Doeblin coefficient to show many of the results in Section 4. The advantage of the Doeblin coefficient is that we can transform the undiscounted Markov decision process into an equivalent discounted Markov decision process by exploiting / as a discount factor. We can also transform using ao, but since ao > /3, the Doeblin coefficient can lead to faster convergence when we solve the equivalent discounted problem. 4 A Discounted Equivalent Problem The traditional transformation from a nonhomogeneous problem to a homogeneous problem defines states in the homogeneous problem as a (time, state) pair. If the original problem has a countable number of states, then so does the transformed problem. However, even if the nonhomogeneous problem satisfies the conditions for weak ergodicity in Lemma 1, the transformed, homogeneous problem may not. For example, begin with a finite state problem where transitions in an even numbered stage occur with transition probability matrix Peven and in odd numbered stages follow Podd An equivalent homogeneous problem would have transition matrix 0 Peven 0 0. P= 0 0 Podd 0... ~Le 7

Since each of the columns of P contains predominantly zeroes, none of the sufficient ergodic conditions, of which we are aware, are satisfied (see Federgruen and Tijms [1978]). We now present an improved transformation that preserves the Doeblin condition for weak ergodicity. Alden and Smith proved the finite state version of the following theorem. The extension to the countable state case is straight forward and omitted. Theorem 1 Every one step probability transition matrix can be expressed as a convex combination of another stochastic matrix and a stable matrix (stochastic matrix with identical rows), using the ergodic coefficient f as a multiplier. That is, for all k and all xk E Xk, Pk(xk) = dPk(xk) + (1 -/3)Lk, where Pk(xk) is a stochastic matrix, Lk is a stable matrix independent of xk, and O 3 < < 1. Solving for Pk(xk) in the case 3 > 0 we have P( Pk(xk) - (1 - 3)Lk Pk (xk) Pk k) --- 3 Lk, for each k, and for each xk. Let (P) be the original problem defined in Section 2. Based on the above theorem, define another class of nonhomogeneous Markov decision processes, (P). (P) The /-discounted nonhomogeneous Markov decision process with probability transition matrix Pk(xk), reward Rk(xk), value function Vk(.), infinite horizon optimal strategies x* and finite horizon optimal strategies x*(k, N). We transform the original undiscounted nonhomogeneous Markov decision process, (P), into the /-discounted nonhomogeneous Markov decision process, (P), using the ergodic coefficient /3. This generalizes an approach by Ross since it considers nonhomogeneous problems and uses the more efficient Doeblin coefficient. The following lemma shows that the finite horizon optimal value of (P) can be obtained from (P) and the set of finite optimal solutions of (P) is equal to that of (P). It generalizes a result 8

in Alden and Smith from finite states to countable states. It is similar in structure to the result in Alden and Smith, but the Lemma below expresses Vki(x* (k, N); N) in terms of V' whereas the earlier paper expresses it in terms of V and V'. We denote an element of the matrix Lk as L'j. However, since Lk is stable, L'- is independent of i. Lemma 2 Under the condition that ~ < 1, we can represent the finite horizon optimal value of (P) -as a function of the finite horizon optimal value of. (P),Y i. e., V'(x*(k IN); N) = Vk`(_~* (k, N); N) + (1- 3 L'1V1) (.F (l, N); N),I l=k+l j=1 for all states,iZ; k =0,...,IN -1. Moreover, the finite horizon optimal strategy set of (F) is equal to that of (F),I i. e., X*(j ) =i*k, ),for all k =O,...,IN -1. Proof Let Vk(N) = Vk(x *(N); N) = Vk(x *(k, N); N). We will prove the result by induction on k. For k = N - 1, V17-1(N) = maxjr' -i(x' -1)} = Vk-1(N), thus F*(N-1,N) = x*(N-1, N). Now fix k in 0 < k < N - 2 and assume that the result holds from period k- + 1.- If /3 = 0 then the result holds as above. If /3> 0 then, 00 141(N) =maxfrk'(x) +/3Z3VkW+1(N)} Xk 00 t = max frt(4) +k p (4) k41() (1 -/3r'() S p5 Lx') 2rkV7(N)]- (1 -VkS'V+1 (N)} I 3k+2 m= 9

oo N oo = max{r}(x4) + EPk (x)Vk+1(N)} -(1 - ) E E L V13(N) Xk j=1 l=k+lj=l N oo = V(N) -(1 -) E Ej LJ i(N), I=k+l j=1 since Ljm1 = L]m for all i,j, which is the desired result. Also from the second to last equation, we can see the equivalence of the solution set, since the last term of that equation is independent of xk. m The above lemma is interesting since both the finite horizon optimal solution and value of an original undiscounted nonhomogeneous Markov decision process problem, (P), can be obtained by solving the f-discounted nonhomogeneous Markov decision process problem, (P). Now, we prove the main theorem of this section. Theorem 2 Under Assumptions 1 through 4 and the condition that 3 < 1, any algorithmically optimal strategy for (P) is an average optimal strategy for (P). Proof From Lemma 2, any algorithmically optimal strategy of (P) is an algorithmically optimal strategy of (P). Bean, Smith and Lasserre, Theorem 5, shows that an algorithmically optimal solution is an average optimal solution under these hypotheses. Hence, the result follows. * Now a traditional transformation to a discounted, homogeneous problem can be carried out. 5 Average Value Convergence In a direct forecast horizon approach to the nonhomogeneous Markov decision process we seek to find the optimal policy for the first stage since we must implement that policy now. We proceed, as in Bean, Smith and Lasserre, by truncating the infinite horizon problem at some finite horizon. We then solve the finite horizon problem and test to see if the policy for the first stage is optimal for the infinite horizon problem. Such approaches require convergence of the finite horizon optimal values to the infinite horizon optimal value. See the Appendix of Hopp, 10

Bean and Smith for an example where this property fails. This convergence implies that the average value obtained by solving a sufficiently long finite horizon problem is within any chosen e of the infinite horizon average optimal value. Average optimal value convergence justifies the truncation of an infinite horizon problem to a sufficiently long finite horizon problem; an approach commonly used for real world problems. In this section, we establish conditions under which we can prove that optimal average values converge. If a < 1 the question is trivial. Below we assume that a = 1. Mathematically, average optimal value convergence is Vo (V(N); N),Vol (X; N) liminf V(x ) = liminf x;N) for all i E. N-oo N N-+oo N We begin with a technical lemma and then prove the main theorem. Lemma 3 Let x be an algorithmically optimal strategy so that for some {Nm}mi=l, X*(Nm) — i x. Then, under Assumptions 1 through 4 and the condition that f < 1, for any times, k < N, and state, i, Vk(x*(N); N)V(; N) < 2R/(1 -,3). Proof By Assumption 4 and the definition of the p metric, for m sufficiently large, x and x*(Nm) agree in action for all states, j, and for all times through N. Hence, Vk(X*(Nm); Nm) = Vk(~; N) + Tk ()VN(x*(NA); N,). (1) Define x as the concatenation of policies from x*(N) for times 0 to N witbhi policies from x*(Nm) for N and beyond. Then Vk(x; Nm) = Vk(x*(N); N) + TN(x*(N))VN(x (Nm); Nm). (2) Subtracting (2) from (1) and recognizing the superiority of x*(Nm) to x over the horizon Nm we get o < Vk(x; N) - Vk(x*(N); N) + [7N(X) - TN(x*(N))]VN(x*(Nm); Nm), (3) where 0 is a vector of zeroes. The last term in (3) is equal to Nm-1 E a[Tk (X) - TkN(x (N))]Tr(X*(Nm))Rn(X*(Nm)). n=N By Lemma 4 of Bean, Smith and Lasserre this is bounded above by 2R/(1 - ). Hence, the result follows. * 11

Theorem 3 Under the Assumptions 1 through 4 and the condition that 3 < 1, the optimal values of the finite horizon problems converge to the optimal value of the infinite horizon problem, that is, liminf Vo(*(N); N) liminf Vo(x*; N) lim inf = hm inf ---N-+oo N N-.oo N Proof Let x be any algorithmically optimal strategy. The existence of such a strategy is proved in Theorem 1 of Bean, Smith and Lasserre. For any given N, by Lemma 3, 2R VO (*(N); N) - Vo(; N) < (1 Dividing by N and taking the liminf gives Vo(x*(N); N) __ '( V; N) liminf (xliminf N) N-+oo N N-.oo N By Theorem 5 of Bean, Smith and Lasserre limin 1V; N) _Vol(; N) lim inf V N= lim inf V(x;N) N-+oo N N-.oo N Hence, the result follows. * Theorem 3 suggests a conceptual, tail value algorithm similar to that in Bes and Lasserre [1986] for the discounted problem. 1. Choose e > 0 and, by Theorem 3, choose N such that the average value of the finite horizon optimum is guaranteed to be within e of the infinite horizon optimal average value. 2. Find all strategies with finite horizon average value within 2e of the optimal finite horizon value. By Theorem 3, no strategy outside of this set can be optimal. If all strategies in the set begin with the same policy at time 0, it must be optimal to the infinite horizon problem. Else, decrease e and go to Step 1. 6 Summary This paper presents several structural results for an infinite state nonhomogeneous Markov decision process with average reward criterion. First, under 12

the Doeblin condition, the problem is shown to be equivalent to a discounted problem. Under the same condition, we show that the optimal finite horizon optimal average values converge to the infinite horizon optimal average value. 13

References [1] Alden, J. and R. Smith [1992], "Rolling Horizon Procedures in Nonhomogeneous Markov Decision Processes," Operations Research, Vol. 40, Suppl. 2, May-June, 1992. [2] Bean, J., R. Smith and J. Lasserre [1990], "Denumerable State Nonhomogeneous Markov Decision Processes," Journal of Mathematical Analysis and Applications, Vol. 153, pp. 64-77. [3] Bes, C. and J. Lasserre [1986], "An On-Line Procedure in Discounted Infinite Horizon Stochastic Optimal Control," Journal of Optimization Theory and Applications, Vol. 50, pp. 61-67. [4] Dobrushin, R. [1956], "Central Limit Theorems for Non-stationary Markov Chains II," Theory of Probability and Its Applications, Vol. 1, pp. 329-383. [5] Federgruen, A. and H. Tijms [1978], "The Optimality Equation in Average Cost Denumerable State Semi-Markov Decision Problems, Recurrency Conditions and Algorithms," Journal of Applied Probability, Vol 15, pp. 356-373. [6] Freidenfelds, J. [1981], Capacity Expansion: Simple Models and Applications, North-Holland, Amsterdam. [7] Hajnal, J. [1958], "Weak Ergodicity in Nonhomogeneous Markov Chains," Proceedings of the Cambridge Philosophical Society, Vol 52, pp. 67 -77. [8] Hopp, W. [1989], "Identifying Forecast horizons in Nonhomogeneous Markov Decision Processes," Operations Research, Vol. 37, pp. 339-343. [9] Hopp, W., J. Bean and R. Smith [1987], "A New Optimality Criterion for Nonhomogeneous Markov Decision Processes," Operations Research, Vol. 35, pp. 875-883. [10] Lohmann, J. [1984], "A Stochastic Replacement Economy Decision Model," Technical Report 84-11, Department of Industrial and Operations Engineering, The University of Michigan, Ann Arbor, MI 48109. [11] Luss, H. [1982], "Operations Research and Capacity Expansion Problems: A Survey," Operations Research, Vol. 30, pp. 907-947. [12] Nelson, R. and S. Winter [1982], An Evolutionary Change of Economic Change, Belknap Press. [13] Paz, A. [1963], "Graph-Theoretic and Algebraic Characterizations of Some Markov Chains," Israel Journal of Mathematics, Vol. 3, pp. 169-180. 14

[14] Paz, A. [1971], Introduction to Probabilistic Automata, Academic Press Inc., New York, New York. [15] Ross, S. [1968], "Non-Discounted Denumerable Markov Decision Models," The Annals of Mathematical Statistics, Vol. 39, pp. 412-423. [16] Sobel, N. [1971], "Production Smoothing with Stochastic Demand II: Infinite Horizon Case," Management Science, Vol. 17, pp. 724-735. 15