Denumerable State Nonhomogeneous Markov Decision Processes James C. Bean and Robert L. Smith Department of Industrial and Operations Engineering The University of Michigan Ann Arbor, MI 48109-2117 Jean B. Lasserre LAAS 7 avenue du Colonel Roche 31077 Toulouse, France Technical Report 89-15 April 1989

DENUMERABLE STATE NONHOMOGENEOUS MARKOV DECISION PROCESSES James C. Beant, Jean B. Lasserret, Robert L. Smitht April 19, 1989 ABSTRACT We consider denumerable state nonhomogeneous Markov decision processes and extend results from both denumerable state homogeneous and finite state nonhomogeneous problems. We show that, under weak ergodicity, accumulation points of finite horizon optimal (termed algorithmic optima) are average cost optimal. We also establish the existence of solution horizons. Finally, an algorithm is presented to solve problems of this class for the case where there is a unique algorithmic optimum. Research on Markov decision problems has concentrated on homogeneous problems. Of most interest here are problems with denumerable state spaces such as in Federgruen, Schweitzer and Tijms [1978], Cavazos-Cadena [1986], and Hernandez-Lerma and Lasserre [1988]. The usual optimality criterion employed is average optimality. Traditional solution procedures include policy iteration or fixed point methods. Much less work has been done on the nonhomogeneous version of the problem. Almost all of this work has considered finite state spaces. Recent works include Hopp, Bean, and Smith [1987] and Hopp [1985]. The primary optimality criteria used is algorithmic optimality (i.e., solutions which are accumulation points of finite horizon optima) since we no longer have the optimality equations for average optimality. Traditional solution procedures use solution horizons with cost based stopping rules. A solution horizon is a t Department of Industrial and Operations Engineering, The University of Michigan. Ann Arbor, Michigan 48109. The work of these authors was supported in part by the National Science Foundation under Grant ECS-8700836 to The University of Michigan and by a Rackham Grant for International Research from The University of Michigan. $ LAAS, 7 avenue du Colonel Roche, 31077, Toulouse, France 1

time such that solving a finite horizon problem to that horizon or beyond gives a first period policy in agreement with the infinite horizon optimal first policy. Despite these differences, the problems have a great deal in common. Any finite state nonhomogeneous NMarkov decision process can be reformulated as a denumerable state homogeneous problem. States are relabeled to include time period and state designation in the new formulation. However, the converse is not always true. Some denumerable homogeneous problems cannot be transformed into finite nonhomogeneous problems. It is interesting that the finite nonhomogeneous problem is a special case of the denumerable homogeneous problem since the former is generally considered harder. We attempt, in this paper, to investigate this relationship and expand the techniques and theory available to each problem. As a mechanism we consider the general denumerable state nonhomogeneous problem which subsumes both of the traditional problems. We show that, under weak ergodicity, an algorithmically optimal strategy is also average optimal. We also provide a necessary and sufficient condition for a first period decision to be average optimal. Based on this result we propose a modified value iteration algorithm. At each step of the value iteration procedure a stopping rule permits elimination of some nonoptimal decisions until only one remains. For a given initial state, the optimal first period decision is obtained in a finite number of steps of the value iteration procedure. If we assume that, at each state, there are a finite number of one-step neighbors with strictly positive probability, then the modified value iteration procedure is implementable despite the countable state space. Section 1 formally states the problem under consideration. Section 2 generalizes the solution horizon theory of Hopp, Bean, and Smith. Section 3 presents the stopping rule and the modified value iteration procedure. Finally, Section 4 includes a summary and conclusions. 1. Problem Statement 1.1 Notation We generalize the notation of Hopp, Bean, and Smith. The decision maker chooses 2

a policy in stage k, zk, by selecting actions, ik, from finite sets, for states i = 1, 2, 3.... An infinite horizon strategy, x, is an infinite sequence of policies. The set of all feasible strategies is denoted by X. Taking action xk in state i of stage k gains a reward, rkj(k), and transition probabilities to all states in stage k + 1, p (k), j = 1,2,3,.... Let R < oo denote an upper bound on Ir(zXk)l. Let Rk(zk) represent the column vector of rewards in stage k and Pk(xk) represent the matrix of transition probabilities from all states in stage k to all states in stage k + 1. The vector of transition probabilities from state i at stage k is denoted Pk(xk; i). Note that both the rewards and transition probabilities may be stage dependent, i.e., nonhomogeneous. If strategy x is used and the one period discount factor is 0 < a < 1, the expected net present value at the beginning of stage k of the profit from stage k to stage N, N > kc, is written Vk(x; N). In evaluating Vk(x; N), the first k - 1 policies of x are irrelevant. The Vk(-) function maps into Roo with the ith element given by Vk(x; N), which represents the expected net present profit from stage k to stage N given that the process is in state i at stage k. In general we are interested in the value function from stage zero onward, which is written: N-1 Vo(x; N)= anTon(x)Rn(xn), n=O where n-1 T (z)= II Pk(Xk), n > and k=l T1 (x) I. In the finite horizon problem, with discount factor a, optimization is equivalent to maximi' tios of Vo(x;N) over x E X. In the infinite horizon problem define x* to be a-optimal, for 0 < a < 1, if lim Vo(x*; N) > lim Vo(x; N), for all x E X. N-oo N —oo If a = 1, so that in general Vo(x; N) diverges with N, we define x* to be average optimal when Vo(x*; N) Vo(;N) liminf N) > liminf (, for all x E X. N-oo N N-oo N 3

Our assumption that R < oo implies that the liminf are always finite. While this measure is the most commonly used, it is less than satisfactory since it allows inclusion of suboptimal leading policies (see Hopp, Bean and Smith). The following section presents a more restrictive measure of optimality. 1.2 Algorithmic Optimality If optimal solutions to the finite horizon problems, {x*(N)})' C R'ox<, approach a particular infinite horizon strategy we would like to consider it as the optimal strategy. The most useful implementation of this idea is algorithmic optimality defined in Hopp. Bean, and Smith (called periodic forecast horizon optimality there). The term algorithmically optimal was first used in Schochetman and Smith [1989] and arises since precisely these solutions can be discovered by a forward looking algorithm. To formally define this notion we use a metric analogous to that presented in Bean and Smith [1984]. Let 00 00 p(x,.) = E (x,)2(+k) k=O i=1 w, if ~; = e; where r(x, i)={' ^k k wheek x 1, otherwise. The topology induced by this metric is the product of product topologies. In particliar. a sequence x(n) -- x as n -* oo if and only if, for all I, K < oo, there is an.N(I. KI) such that x4(n) = for i = 1,2,..., I and k = 1, 2,..., K for all n > N(I < K). Definition: A strategy, x*, is algorithmically optimal if there exists a subsequence of the integers, {Nm}~1, such that x*(Nm) -- x* in the p metric as m -- x. Note that *(VNm) is composed of the finite optimal solution to the Nm problem exten.(e(arbitrarily to cover the infinite horizon. The following two theorems generalize results from Hopp, Bean, and Smith. They are stated here without the proofs, since they are obvious extensions. Theorem 1: If X is compact in the topology generated by p, an algorithmically optA;ial strategy exists for the denumerable nonhomogeneous Markov decision process. 4

As in the finite state problem, algorithmic optimality implies ca-optimality if the value function is convergent. This implication is a natural extension of the results of Hopp, Bean and Smith and is stated as Theorem 2. Theorem 2: If Assumptions (2)-(4) (Section 2.2) hold and a < 1 then algorithmic optimality implies a-optimality. Remark: The converse of Theorem 2 is false. For a counterexample see Ryan, Bean and Smith [1987]. If we assume that R < oo, the profit functions can diverge only if a = 1. In this case, we would like to show that algorithmic optimality implies average optimality. Algorithmic optimality would then be stronger than average optimality since algorithmic optimality discourages leading suboptimal policies. However, this implication is not true without additional conditions. 2. Solution Horizons and Average Optimality We will show that in most denumerable nonhomogeneous problems the algorithmically optimal solutions are also average optimal, and that solution horizons exist leading to discovery of algorithmically optimal solutions. However, there are cases where these desirable characteristics fail. Consider the following example from Ross [1968]. We have a deterministic problem with state space {1,2,...}. At each state, i, there are two choices: remain in place and receive reward i, or move up one state and receive reward 1. The infinite horizon average reward optimal solution is to go up one state to i. remain there for i transitions, then repeat. This is not an algorithmically optimal solution. We seek conditions to eliminate the possibility of such behavior. As in the finite nonhomogeneous case, the important characteristic is weak ergodicity. 2.1 Weak Ergodicity A stochastic matrix is stable if it has identical rows. A scalar function, r(-), that is continuous in an appropriate topology on the set of doubly infinite stochastic ma.rices 5

(treated as points in Roxo) such that 0 < r(P) < 1 for any stochastic matrix, P, is a coefficient of ergodicity. It is called proper if r(P) = 0 if and only if P is stable. See Iosifescu [1972] for a full discussion. Definition: The Markov chain formed by strategy x achieves weak ergodicity if lim r(Tn (x)) = 0, for all I > 0, n-oo where r(-) is a proper coefficient of ergodicity. The most commonly discussed coefficient of ergodicity is the Hajnal coefficient, 00 (p))= sup{ Ip,. -pjsI/2}.. 3=1 For example, see Iosifescu; Hopp, Bean and Smith. Lemma 3: For any infinite sequence of stochastic matrices defined by any strategy r, {Pk(Xk)} k=0, the following are true: (a) r1(-) is a proper coefficient of ergodicity (b) For any > 0 n-1 71(T, ()) < n (Pk((Xk)) k=1 Proof: Iosifescu. I Lemma 4: For any three stochastic matrices Pi,P2,T, vector R with IRkl < R, and proper coefficient of ergodicity, r(T), (P1 - P2)TR < (2r(T)A)e, where e is a vector of ones. Proof: Omitted. m The probability distribution on the states in stage n starting from each of the initial states is given by the matrix Po(xo)Tln(x). If r1(Tl(x)) -+ 0, so that Tf((x) has asymptotically identical rows as n -- oo, then Po(xo)Tl (x) also approaches a matrix of identical 6

rows irrespective of the Po(xo) matrix. The probabilities on the states in stage n, and hence the expected rewards in stage n and subsequent stages, become independent of xO as n increases. 2.2 Assumptions The following assumptions are invoked in subsequent results: (1) EZ=kr1(Tk(xl)) < Ak < oo, uniformly over X, for all k = 1,2,..., where Ak/k -+ 0 as k -- oo. (2) At any stage, the choices available for each state are finite in number. (3) Irk(xk)l < R < oo for all stages,,, states, i, and strategies, x. (4) From each state, i, at stage k, under strategy x, there exists a finite set {j IP/ (x) > O}. That is, only a finite set of states is reachable in one transition from any state, under any strategy. Further, max{j IPk'(x) > O} is uniform over x E X for each stage k. Assumption (1) requires slightly more than weak ergodicity since the series must converge rather than just its terms going to zero. This is weaker than the equivalent condition in Hopp, Bean and Smith which requires that Ak be independent of k. All results requiring this assumption can be shown with a < 1 substituting for Assumption (1). Assumption (2) limits the problem to discrete, bounded decision variables. Assumption (3) requires an upper bound on the maximum reward obtainable at each stage. Given a known starting state, Assumption (4) requires that the accessible states, for any finite time, are finite. Remark: Since any sequence of policies is feasible and, by Assumption (2), there are a finite number of choices at each state, X is compact in the topology generated by p (Tychonoff Theorem). 2.3 Average Optimality Theorem 2, stating that algorithmic optimality implies a-optimality if a < 1, is based 7

on the uniform convergence of Vo(x; N) on X. If a = 1, then Vo(x; N) is not necessarily convergent under Assumptions (1) through (4). However, the uniform convergence of E, 7T1( Tk(x)) over X will suffice. Theorem 5: Under the Assumptions (1) through (4), algorithmic optimality implies average optimality. Proof: Let x* be algorithmically optimal. By definition there exists a sequence {Nrm})= such that x*(Nm) -- x*. So, by Assumption (4), for all k there exists Mi such that m > M implies that xz(Nm) = xz, n = 1, 2,..., kc, state i reachable in k steps. Note that Tok(x) is identically zero for columns beyond the maximum reachable state in k steps. Fix k and choose x E X and m sufficiently large. Since x*(Nm) is optimal for horizon Vm, Vo(x (Nm); Nm) > Vo(x; k) + To(x)Vk(x (Nm); \m) which implies that Vo(x'; k) + Tok(x )Vk(x(Nm); Nm) - Vo(x; k) - To(x)Vk(x(Nm); Nm) > 0. since Vo(x*; k) = Vo(x'(Nm); k) and Tok(x*) = ok(*(N)). Hence, Vo(x*; k) - Vo(x; k) + [Tok () - T (x)] Vk( (Nm); Nm) > 0. Now Nm [Tok () - To (x)],- (Nm); N) = Z [To( )-o )]T (x *(N)) Rn (X \ )) n=k Nm < [aC 2rl(Tkn (t (Nm))) R] e n=k by Lemma 4. By Assumption (1), this is not greater than 2AkRe. Hence Vo(*; k) > Vo(x; k) - 2AkRe. Divide both sides by k and take the liminf of both sides to conclude that x* is superior to x in average reward, and hence, is average optimal.u 8

2.4 Existence of Solution Horizons This section extends results for nonhomogeneous Markov decision processes from those of Hopp, Bean and Smith; and Schochetman and Smith. By use of solution horizon results we can find algorithmically optimal strategies when they are unique. Theorem 6: Under Assumption (2), x*(N) -- x* in the p metric as N -- oo for all choices x*(N) at each N if and only if the algorithmically optimal strategy is unique. Proof: (if) Assume otherwise. Then there is an infinite subsequence of {x*(N)},= that is bounded away from x* in the p metric. Since X is compact by Assumption (2), this, in turn, has a convergent subsequence. By definition, its limit is also algorithmically optimal, a contradiction to the assumption of uniqueness. (only if) If the algorithmically optimal strategy is not unique, then there exist at least two distinct strategies which are limit points of finite horizon optima. Then the limit of. finite horizon strategies cannot exist.! Corollary 7: Under Assumption (2), if all algorithmically optimal strategies have the same first L policies then a solution horizon exists leading to the optimal first L policies in the infinite horizon problem. Note that convergence of x*(N) to x* implies that the optimal finite horizon strategy will agree with the optimal infinite strategy over an increasing range of states and stages. This fact follows from the definition of the metric, p. The existence of solution horizons allows computation of algorithmically optimal strategies in a forward recursive manner. The following section develops such an al(,orithm. 3. A Cost Based Algorithm 3.1 Preliminary Results In this section we prove that Assumptions (1) and (3) imply that the difference between future values from any two states converges. In this case, the problem can be transformedl 9

into a finite value problem as in Hopp. This characteristic, termed coherence, is necessary for the main theorem of Section 3. Deflnition: A nonhomogeneous Markov decision process is coherent if, for all k, and state pairs, i,j, Vk(x; NV)-Vd (x; N) is uniformly convergent as N -x o, over ij and x i X. Theorem 8: Under Assumptions (1) and (3) the denumerable nonhomogeneous Mlarkov decision problem is coherent. Proof: From Assumption (1), for any fixed k, for all e > 0, there exists N independent of x, such that 00 Z i(T1(x)) < e. I=N Let Tl(x; i) be the ith row of the matrix Tl(x). By definition of the Hajnal coefficient. r1('), for any states, i,j, and strategy, x, |(T (x; i) - T(x;j)) el < 271(T((x)). Hence, by Assumption (3), (TK (x; i) - T (x;j)) Ri(x ) I < 2rl(Tk(x))R. Then [T(x; i)R1(xi) - T(x; j)RI(x,)] < 2R f Tr(T/(x)) < 2Re. I=N I=N uniformly over x E X. Hence, the series defining Vk(x)-Vk(x) is uniformly convergent for i,j and x E X and the problem is coherent.! If a problem is coherent, it can be transformed, without loss of optimality, to a 5inite value problem. Define the infinite horizon coherent value for state i as V;'* = Ulim [V(zx*; N) - V (x*; N)] N —-oo which is well defined by Theorem 8. Similarly, the finite horizon coherent va':e is (N) = V(x*(N); N) - Vl(x*(N);N). For the finite state problem, Hopp shows rhat if x* uniquely maximizes [Ro(xo) + Po(xo)V1*], it is the first policy of an algorithm:n(-aii 10

optimal strategy. That is, we can transform the infinite valued problem into a traditional finite valued problem, even without discounting. Note that an x* exists by Theorem 1. We extend this to the denumerable problem. The following lemma demonstrates convergence of the finite horizon coherent values to the infinite horizon coherent value. It is used to prove the validity of the stopping rule. We use the notation 00 Ek(N) = sup E T(T (x))Rn(). xEX n-N Under Assumption (1), Ek(N) -- 0 as N - oo for all k. Lemma 9: (Coherent Value Convergence) Under Assumptions (1) and (3), IV*'(V) -'Vl1 I 2e2(N)R for all i = 1,2,.... Proof: Let V*(N) = Vi(x*(N); N) with element i denoted by V*'(N). For any two integers, N,m, Vl*(N + m) = r (x*(N + m)) + P (x (N + m); i)V*(N + m) (1) V*l(N + m) > r4(x*(N)) + P (x;(N); 1)V2(N + m) (2) Vl*i(N) > r\(x*(N + m)) + Pi(x;(N. + m); i)V2(N) (3) V;l(N) = r(x(N)) + Pi(x(N); 1)V2*(N). (4) Taking equations [(1)- (2)]- [(3)- (4)] gives Vli(N + m) - V*(N) <Z [Pi(z;x(N + m); i) - Pi ((N); 1)][V2*(N + m)-V(N). (5) Taking (1) - (3) gives V;(N + ) -(N + ))N + m) - V(N) N + m))[ (N + m)]. Similarly, for all k < N, Vk,(N + mi) - Vk(N) < Pk(X*(N + m))[Vfk+l(N + m) - Vk1(N)]. 11

Using this fact recursively we can refine (5) to V1*'(N + m) - V'(N) < [Pi(x (N + m); i)- Pi(x (N); 1)] Pn (x(N + m)) [V'(N + m) - VV(X)]. n=2 Noting that V(V;(N) = 0, the right hand side equals N+m-1 [Pi (x (N + n); i) - Pi ((N); 1)] E (*(Nm))Rn(z*(Nm) < 2e2(.)R) n=-N by Assumption (1) and Lemma 4. A similar analysis for the opposite inequality allows strengthening of the result to IV*'(N + m) - Vl*(N) < 2e2(N)R. (6) Since e2(N) - 0 in N, the sequence {Vi1*(N)} is Cauchy and converges. In the product topology, vector convergence is equivalent to pointwise convergence so {V( (N)} converges. It must converge to V* or any limit point of the sequence {x*(.V)} would have greater value than x*. Since limm.OO/ V;(N + m) = V*, and (6) holds for all m. we conclude that IV,*(N) - V1* < 2e2(N)R.* Recall that V l = lim-oo VI(x*; N), where x* is algorithmically optimal. An immediate corollary of Lemma 9 is that V; takes the same values for any algorithmically optimal x*. For ease of expression we introduce the notation i(xo) = r(xo) + Po(xo; i)V*; and v'(x0; N) = r(xo) + Po(xO; i)V*(N). Lemma 10: Under Assumptions (1) through (3), if x* is the first policy of an algorithmically optimal strategy, it maximizes vi~(xo) for the starting state io. 12

Proof: By definition of algorithmic optimality, there exists {x*'O(Nm)} beginning with x~O. That is, for any xo, r;O(x;) + Po(x;;io)Vl(Nm) > r0~(xo) + Po(xo; io)Vl*(Vm). For any xo, V'i(xo; Nm) =r0(xo) + Po(xo; io)[V*(Nm) - (V*l(Nm))e] =r (xo) + Po(xo; io)V*(Nm) - Vi*l(Nm)Po(xo; io)e since the series are convergent over the finite horizon (note that Vl*'(Nm) is a scalar). This last term is a constant over all xo since Po is stochastic. Hence vbo(x*; Nm ) > v~(xo; Nm). From Lemma 9 and the definition of vio(xo; Nm), viO(xo; Nm) -+ vi'(o). Hence, vio(x) > vi,(so(). Assumption (5): The maximizer of vi~(xo) is unique. Corollary 11: Under Assumptions (1) through (5), x* maximizes vio(xo) if and only if it is the first decision of an algorithmically optimal strategy. We now propose a stopping rule for a value iteration procedure. 3.2 A Stopping Rule In this section we propose a stopping rule inspired by that presented in Bes and Lasserre [1986] in the discounted cost case and Hernandez-Lerma and Lasserre in the (homogeneous) average cost case. At each step of the value iteration procedure we se a test which permits us to eliminate actions which are not optimal for the average rost criterion. Tests in previous works such as Shapiro [1968] or Hinderer and Hubner'19741 include parameters which are difficult to obtain since they require knowledge of the optimal average cost. We only use known parameters. The most difficult values to obtain are the e2(N). In problems with a geometrically decreasing coefficient of ergodicity, these valtues are easily calculated from problem structure. The test is based on the following theorem. 13

Theorem 12: Under Assumptions (1) through (5), an initial action for state i. x*, is rot algorithmically optimal if and only if there exists a time N such that [ro(x0(N)) - r0(xo)] + [Po(xo(N); i) - Po(xo; i)]V(*(N) > 2e2(N)R. That is, if and only if v'(x(N)) - v(xo) > 2e2(N)R. Proof: (if) Let x* be the first policy of an algorithmically optimal strategy. By Corollary 11, vi(x*) > v (xo(N)). By Lemma 9, for any i, \IV*(N) - V* < 2e2(N)R. For any x0, a simple algebraic manipulation extends this to Ivi(xo; N) - vi(xo) < 2e2(N)R. (7) From (7), we have vi(xz(N)) > vi(x(N); N) - 2e2(N)R. Combining gives v'(x*) > vi(x(N); N) - 22(N)R. Since xo(N) maximizes vi(xo;N), we have vi(x(N); N) > vi(x; N). From 7). vi(x) < v(xo; N) + 2e2(N)R for any i. Combining gives vi(x3) < v'(x*(N); N) + 2e2(N)R, for all i. Summarizing, Iv'(xo) - vi(x(N); N)l < 2e2(N)R, for all i. This proves the contrapositive of the (if) direction of the theorem. (only if) Again by contrapositive, if, for all iV, 0 < vi(xo(N); N) - v(xo; N) < 2^.V. R. by letting N go to infinity we have that vi(xo) = v(x) and, by Corollary 11. A) is algorithmically optimal. u 3.3 Algorithm Statement Given an initial state io, we propose an algorithm to obtain, the optimal decision r*: which begins the algorithmically optimal (also average optimal) strategy. It is the classical value iteration procedure plus a test based on the previous theorem which permits:s to eliminate nonoptimal actions without computing an optimal strategy. 14

Step 0: Initialize U(io), the (finite) set of potentially optimal actions in state io at time 0. Step 1: For all x~ E U( io), if [r'O(x(N)) - rO~(xo)] + [Po(x (N); io) - Po (x; i'o)]V*() > 2e2(N)R, then eliminate xio from U(io). If U(io) is a singleton then stop. Else, set N:= N + 1 and go to Step 1. Corollary 13: Under Assumptions (1) through (5), the algorithm is guaranteed to stop in a finite number of steps with U(io) = {xi~}. Proof: By Assumption (5) there is a unique maximizer of vio(xo). The algorithm will eliminate any other xo by Theorem 12.! To carry out this algorithm we know all necessary data except e2(N), which depends on special structure. For example, if weak ergodicity is indeed geometric, i.e., rl(Pk(x)) < 3 < 1 for all x and k, then e2(N) can be computed from the tail of the geometric series as in Bes and Sethi. Note that this algorithm permits us to obtain on-line an optimal average reward strategy. At time 0, in state io, run the preceding algorithm until x'~ is obtained. Implement this decision and observe the new state il at time 1. Rerun the above algorithm where the index 0 is replaced by 1, 1 by 2, and the initial state is il. In the same manner, compute xil and iterate. At time n we observe the state in after we have implemented the sequence of actions xi~, xit,..., x.1. Computation of an average optimal policy would be impossible since we have an infinite number of states. Here, since we are interested in the optimal action at state io at time 0, we need only evaluate the value functions Vk(x({N); tN) k = 0,..., N- 1, at states accessible in k steps from io. They are finite in number by Assumption (4). 4. Summary and Conclusions We have shown that the analytical framework developed in Hopp, Bean and Smith for the finite state space nonhomogeneous Markov decision process can be generalized to the denumerable state case. Solution horizons exist if the problem is weakly ergodic. The algorithm of Hernandez-Lerma and Lasserre is generalized to this problem. These results and algorithms can also be applied to the well studied denumerable stationary 15

problem as a special case. 16

REFERENCES Bean, J. and R. Smith [1984], "Conditions for the Existence of Planning Horizons," Mathematics of Operations Research, Vol. 9, pp. 391-401. Bes, C. and J. B. Lasserre [1986], "An On-line Procedure in Discounted Infinite Horizon Stochastic Optimal Control," Journal of Optimization Theory and Applications, Vol. 50, pp. 61-67. Cavazos-Cadena, R. [1986], "Existence of Optimal Stationary Policies in Average Reward Markov Decision Processes with a Recurrent State," Submitted to Applied Mathematics and Optimization. Federgruen, A., P. Schweitzer and H. Tijms [1978], "Contraction Mappings Underlying Undiscounted Markov Decision Problems," Journal of Mathematical Analysis and Applications, Vol. 65, pp. 711-730. Hernandez-Lerma, 0. and J. B. Lasserre [1988], "A Forecast Horizon and a Stopping Rule for General Markov Decision Processes," Journal of Mathematical Analysis and Applications, Vol. 132, pp. 388-400. Hinderer, K. and G. Hubner [1974], "An Improvement of Shapiro's Turnpike Theorem for the Horizon of Finite Stage Discrete Dynamic Programs," Trans. 7th Prague Conf. on Information Theory, Statistical Decision Functions, Random Processes. Hopp, W. [1985], "Identifying Forecast Horizons in Non-Homogeneous Markov Decision Processes," Technical Report 85-5, Department of Industrial Engineering and NManagement Sciences, Northwestern University, Evanston, Illinois, 60201. To appear in Operations Research Hopp, W., J. Bean and R. Smith [1987], "A New Optimality Criteria for Non-Homogeneous Markov Decision Processes," Operations Research, Vol. 35, pp. 875-883. Iosifescu, M. [1972], "On Two Recent Papers on Ergodicity in Nonhomogeneous Nlarkov Chains," The Annals of Mathematical Statistics, Vol. 43, pp. 1732-1736. Ross, S. [19681, "Non-Discounted Denumerable Markovian Decision Models," Annals of Mathematical Statistics, Vol. 39, pp. 412-423. Ryan, S., J. Bean and R. Smith [1987], "A Tie-Breaking Algorithm for Discrete Infinite Horizon Optimization," Technical Report 87-24, Department of Industrial and Operations Engineering, The University of Michigan, Ann Arbor, Michigan 48109. Schochetman, I. and R. Smith [1989], "Infinite Horizon Optimization," Mathematics of Operations Research, Vol. 14, pp. 1-16. 17

Shapiro, J. F. [1968], "Turnpike Planning Horizons for a Markovian Decision 2Model. Management Science, Vol. 14, pp. 292-300. 18

rY OF