A NEW OPTIMALITY CRITERION FOR
NON-HOMOGENEOUS MARKOV DECISION PROCESSES
Wallace J. Hfopp
James C. Bean
Robert L. Smith
Technical Report 84-24
Revised June, 1986
Revised December, 1986

A NEW OPTIMALITY CRITERION FOR
NON-HOMOGENEOUS MARKOV DECISION PROCESSES t
Wallace J. Hopp
Department of Industrial Engineering and Management Sciences
Northwestern University
Evanston, Illinois 60201
James C. Bean
Robert L. Smith
Department of Industrial and Operations Engineering
The University of Michigan
Ann Arbor, Michigan 48109
ABSTRACT
We propose a new definition of optimality for non-homogeneous Markov decision processes
called periodic forecast horizon (PFH) optimality. Using measures of discounting and ergodicity
we establish conditions under which PFH optimal strategies exist and PFH optimality implies the
more conventional notions of a-optimality and average optimality. Finally, we use this definition
of optimality as the basis for a direct development of forecast horizon results for discounted and
undiscounted non-homogeneous Markov decision processes.
A majority of the work on Markov decision processes deals with the stationary or homogeneous
case. It is assumed that at each decision epoch the future process is stochastically identical to the
process encountered at time 0. For many important applications, however, such as R & D modeling (Nelson and Winter [1982]), capacity expansion (Freidenfelds [1981], Luss [1982]), equipment
replacement (Lohmann [1984]) and inventory control (Sobel [1971]) the appropriate models are
Markovian but not stationary. At each decision epoch a different problem is encountered.
We consider a framework for solving non-homogeneous Markov decision processes. To solve a
problem of this type requires a more general definition of an optimal solution and a more robust
optimal solution seeking technique. Unlike the stationary case, in its most general form this problem
can neither be stated in finite information nor solved in finite time.
Several optimality criteria have been proposed for non-homogeneous Markov decision processes.
For problems where the value function converges for all feasible strategies, optimality is logically
t This material is based on work supported by the National Science Foundation under Grant No. ECS-8409682.
1

defined by maximum total reward, which defines strategy xz to be optimal if the total reward for x2
is not less than the total reward for any other strategy in the strategy set X. This will be formally
defined in Section 1.1.
Since these problems have infinite time horizons, often the rewards of the optimal strategies are
infinite. Proposed optimality criteria for the divergent reward case include greatest average reward
(Derman [1966], Ross [1968]), overtaking optimality (Denardo and Rothblum [1979]), 1-optimality
(Blackwell [1962]), and average overtaking optimality (Veinott [1966]). Ideally, the definition of
optimality should be neither overselective nor underselective. An overselective optimality criterion
may define some intuitively optimal strategies to be suboptimal and can result in no optimal
strategy existing. An underselective criterion may accept strategies as optimal that should clearly
be considered suboptimal. In addition to being appropriately selective, we would like our optimality
criterion to be consistent with practical methods for computing optimal strategies.
When reward functions diverge, the most commonly used definition of optimality is greatest
average reward. A strategy x' is defined to be average optimal if the limit as N -- oo of the average
reward per period for zx' is not less than the limit of the average reward for any other strategy in
X.
Average optimality is tail driven so that the quality of any strategy is determined independently
of any finite leading segment of the strategy. For example, a strategy that produces rewards of
-1000 in periods 0 through N and 1000 in every period beyond N, for any finite N, is equivalent
under the average optimality criterion to a strategy which has a reward of 1000 in every period.
Since average optimality does not distinguish between these two strategies, it is an underselective
criterion. This underselectivity does not generally present difficulties in homogeneous problems
since under a stationary strategy the expected reward will be the same in each period. That is, the
tail is an identical replication of the original problem. In non-homogeneous problems, stationary
strategies will not generally be optimal so it is possible to identify an average optimal strategy that
performs poorly in early periods. Since in practical problems early performance is crucial, we need
a stronger criterion than average optimality to rule out these strategies.
Overtaking optimality is often used as an alternative to average optimality. A strategy is
overtaking optimal if there is some finite time, N', such that for any time horizon beyond N' the
strategy is optimal for that finite horizon problem. For example, suppose one strategy is optimal
for all even time horizons and another strategy is optimal for all odd time horizons and that both
strategies generate the same average reward per period, where average reward means expected
reward over the horizon divided by the number of periods in the horizon. While it is reasonable
to consider both strategies optimal in the infinite horizon problem, neither is overtaking optimal.
Hence, a less stringent criterion is needed to guarantee existence of an optimal strategy.
2

The 1-optimality and average overtaking optimality criteria are more selective than average
optimality because they emphasize returns in early periods. At the same time, they are less
selective than overtaking optimality and therefore are not as likely to result in nonexistence of an
optimal strategy. In the finite state homogeneous case these criteria can be shown to be equivalent
(Denardo and Miller [1968]). While these criteria are consistent with computational techniques
for homogeneous problems (Miller and Veinott [1969]), they do not lend themselves naturally to
solution procedures for non-homogeneous problems. The optimality criterion introduced in this
paper leads to straightforward solution techniques based on forecast horizons.
For homogeneous Markov decision problems, a variety of techniques, such as policy improvement, value iteration, and linear programming can be used to compute an optimal stationary
strategy (Howard [1960], Manne [1960], Denardo [1967], Miller and Veinott [1969]). Since stationary strategies consist of a single policy repeated they are finitely expressible. Under certain
conditions such as finite state and action spaces they are finitely computable.
In non-homogeneous problems attention cannot be restricted to stationary solutions. While
many of the properties of homogeneous Markov decision processes can be generalized to the nonhomogeneous case (Hinderer [1970]), this approach does not yield finite algorithms. Further, while
a non-homogeneous problem can be converted into an equivalent homogeneous problem (Schal
[1975]), this can only be done at the expense of transforming a finite state space into a countably
infinite one. Again, this cannot yield a finite algorithm.
Since an- optimal solution to the infinite horizon non-homogeneous problem, however it is
defined, cannot be found and stated in finite time, a surrogate for a finite optimal algorithm is
necessary. Such a surrogate is suggested by the fact that only the first few decisions will be
implemented immediately. The remainder of the strategy is found to evaluate future impacts of
these first few decisions. This fact motivates the use of the forecast horizon approach commonly
used in deterministic problems. A forecast horizon is a time horizon long enough to guarantee
agreement in the optimal first decision for all longer horizons. We will show that, under certain
regularity conditions, the first L optimal policy vectors, where L is any finite positive integer, can
be found in finite time. Further, by recursively employing this fact in a rolling procedure, the entire
optimal strategy can be recovered, though not in finite time.
The foundation for this work is Bean and Smith [1984] which proves the existence of forecast
horizons for a wide class of deterministic sequential decision problems. The important concept in
that paper is the assumed discounting of net cost flows. This discounting allows the present decision
to become independent of decisions made sufficiently far into the future. Information beyond this
future time can be discarded without effect on the initial decisions. This decreasing dependence
will be referred to as diminishing influence. Bhaskaran and Sethi [1985] show that these results
3

can be simply extended to a stochastic case with discounting. Related observations concerning
diminishing influence are also given in Morton [1979]. A much earlier paper by Shapiro [1968] used
discounting to derive forecast horizon results for homogeneous Markov decision processes.
In this paper we generalize this concept of independence of the present and future decisions.
The existence of a forecast horizon is seen to be an implication of a generalized type of ergodicity,
which has precisely the same effect as that of discounting. The stochasticity present in a large
class of Markov decision processes leads to this ergodicity. Consequently, forecast horizons can be
shown to exist in the absence of discounting and even with negative discounting. Furthermore,
ergodicity and discounting are seen to interact multiplicatively. The interaction of ergodicity and
discounting was first noted in the context of Markov decision processes by Morton and Wecker
[1977]. Their measure of ergodicity is primarily suited to homogeneous problems and is used to
show convergence of the policies and relative value functions. This paper introduces a measure of
ergodicity that applies to a much broader class of non-homogeneous problems and goes beyond
convergence results to develop forecast horizon results.
Section one contains problem formulation, notation, and a discussion of PFH optimality. Section two includes results on weak ergodicity. The third section contains conditions for the existence
of forecast horizons. Finally, section four includes extensions and conclusions.
1. Problem Formulation
1.1 Notation
We consider a process that is observed at discrete intervals to be in one of n states (some of
which may be identical in order to ensure n states per stage). The decision-maker chooses a policy
in stage k, xk, by selecting actions, 4rk, from finite sets, for states i = 1,..., n. An infinite horizon
strategy, x, consists of an infinite sequence of policies, one for each stage. The set of all feasible
infinite horizon strategies is denoted by X.
Taking action 4 in state i of stage k results in an immediate reward, rx(zx), and transition
probabilities to all states in stage k + 1, p( () = 1,... n. Let R denote an upper bound on the
rk'(k) which is assumed to be finite. We let Rk(xk) represent the (n x 1) column vector of rewards in
stage k and Pk(xk) represent the (n x n) matrix of transition probabilities from all states in stage k
to all states in stage k+ 1. Notice that both the rewards and transition probabilities may be stage
dependent.
If strategy x is used and the one period discount factor is 0 < ca < 1, the expected net present
value at the beginning of stage k of the profit from period k through period N, N > k, is written
Vk(x(k); i), where X(k) represents the continuation of strategy z from stage k forward. By definition,
X(o) = x. The Vk(') function maps into 3Rn with the 1h element given by VkI(x,, x-l); N ), which
4

represents the expected net present profit from stage k through stage N given the process is in state
i of stage k. In general we are interested in the value function from stage zero onward, which can
be written:
N
Vo(z;N)= a To (z)R,(ej)
j=O
where,
j-1
T/(x) = Il Pk(x), j> I
k=l
T/(x) = I
In the finite horizon problem, optimization is equivalent to maximization of Vo(x; N). Formally,
we define x* to be a-optimal if
lim Vo(x;N) - lim Vo(x;N) > 0, V E X.
N —-oo N-*oo
This definition is valid as long as the limits exist. It is possible that Vo(x; N) diverges with N,
typically when a = 1. We formally define x' to be average optimal if
iiminf{ Vo(zx; N) - Vo(x; N)}/N > 0, Vx E X.
N —oo
1.2 Periodic Forecast Horizon Optimality
If the optimal solutions to the finite horizon problems approach a particular infinite horizon
strategy we would like to consider it as the optimal strategy. This idea has been discussed frequently
in the forecast horizon literature (see Morton [1978]). We refer to such a solution as forecast horizon
optimal. To formally define this notion we use a metric analogous to that presented in Bean and
Smith. Let
p(X, ) = Z k(X, )2(k1)
k=O
0 if x= i = 1,..., n
where Q(k(z, ) =
1 otherwise.
The p metric has the property that any two strategies which agree in the first m policies are
considered closer than any two strategies that do not. Any metric with this property would suffice
equally well for the purposes of this paper.
Now we define x' to be forecast horizon (FH) optimal if x (N) - zx in the p metric as N - x.
where zx(N) is an a-optimal solution to the N stage problem. Unfortunately, xz(N) may not
5

converge due to the existence of multiple optima. Hence, we define here the weaker optimality
criterion of periodic forecast horizon (PFH) optimality.
Definition: A strategy, x is periodic forecast horizon (PFH) optimal if there exists a subsequence
of the integers, {Nm})l, such that z*(Nm) — > x in the p metric as m -f oo.
We assume that the strategy space, X, is compact in the metric space generated by p. This
assumption precludes the possibility of a sequence of feasible strategies converging to an infeasible
strategy. For further discussion of a related problem see Bean and Smith.
Theorem 1: A periodic forecast horizon optimal strategy exists for the non-homogeneous Markov
decision process.
Proof: Compactness of X implies that the sequence {z*(N)}JNl has a convergent subsequence.
The limit of such a sequence is PFH optimal by definition..
We would like to have PFH optimality imply a-optimality if the value function is convergent
and average optimality if the value function is divergent and an average optimal strategy exists.
The first implication is a natural extension of the results of Bean and Smith and is stated as
Theorem 2. The second implication requires the main theoretical result of this paper.
Theorem 2: If R < oo and a < 1 then PFH optimality implies a-optimality.
Proof: It is a simple matter to show that Vo(x; N) is uniformly convergent on X as N - 00
(Hopp [1984]). Let x' be PFH optimal. Choose a subsequence of integers, {Nm}~~, such that
x' (Nm) -* x' as m -+ oo. By definition,
Vo(x (Nm); Nm) > Voo(x; Nm), V E X.
Now
lim Vo(x'(Nm); VNm) = Vo(x) < oo,
m- oo
where Vo(x) is the expected net present value of strategy x over the infinite horizon. This
follows from the uniform convergence of Vo(x; N) over X. Hence,
Vo(x() > Vo(x), V E X,
so x' is a-optimal.
Since we assume that R < oo, the profit functions can diverge only if a = 1. In this case, we
would like to show that PFH optimality implies average optimality. PFH optimality would then be
stronger than average optimality since PFH optimality discourages leading sub-optimal policies.
However, this implication is not true without additional conditions. We present a counterexample
in the appendix where a PFH optimal strategy exists but is not average optimal. If this occurs, then
6

using the limiting strategy of a sequence of optimal finite horizon strategies might not maximize
average returns.
In this counterexample the influence of the first policy extends infinitely far into the future.
The same is true of the undiscounted "credit union" example given by Bean and Smith. Bean and
Smith use discounting to resolve the difficulty presented by their counterexample. We will use the
concept of weak ergodicity to develop conditions that rule out this type of behavior and ensure that
PFH-optimal strategies are average optimal.
2. Diminishing Influence
To develop a measure of the diminishing influence that results from the stochasticity of nonhomogeneous Markov decision processes, some results concerning forward products of stochastic
matrices are needed.
If a = 1 and strategy x is used, the expected reward in stage j is given by Toj(x)Rj(xj). To
describe the diminishing influence of the first policy, xo, on the reward in stage j we look at T (z),
the forward product of j stochastic matrices. The following results concerning T (x) are taken from
Seneta [1981].
2.1 Weak Ergodicity
Definition: A stochastic matrix is said to be stable if it has identical rows.
Definition: A scalar function, r(.), that is continuous on the set of (n x n) stochastic matrices
(treated as points in "Rn x Rn) such that 0 < r(P) < 1 for any stochastic matrix, P, is a
coefficient of ergodicity. It is called proper if r(P) = 0 if and only if P is stable.
Definition: The Markov chain formed by strategy x achieves weak ergodicity if
r( T/(x))- as j oo,>0,
where r( ) is a proper coefficient of ergodicity.
More informally, the forward product of any string of stochastic matricies in a weakly ergodic
Markov chain will increasingly resemble a stable matrix as the number of matricies in the string
increases. Note that unlike the conventional notion of ergodicity, called "strong" ergodicity for
clarity, weak ergodicity does not require the forward product to converge to a limiting matrix.
Indeed, the T/ (x) may differ significantly as j increases due to the non-homogeneity of the problem.
To be weakly ergodic, T/(x) need only have nearly identical rows for large j. In a homogeneous
problem under a stationary strategy all transition matricies will be identical so that weak ergodicit.
and strong ergodicity are equivalent for this case.
7

We define the following coefficients of ergodicity for any (n x n) stochastic matrix.
n
ri(P) = max {(p1, - p3j)/2}
8s=1
c(P) = 1 - max(min p,,)
8!
Lenma 3: For any stochastic matrix, P, and the infinite sequence of stochastic matrices defined
by any strategy z, {Pk(Zk)}^=, the following are true:
(a) rl(.) is a proper coefficient of ergodicity and c(.) is an improper coefficient of ergodicity
(b) rl(P) < c(P)
(c) For any 1> 0
ji- i-1
ri(Tj(x)) < nrl(Pk(k)) < (())
k=1 k=1
Proof: Seneta.
Theorem 4 (Seneta): Forward products of a sequence of stochastic matrices, {Pk(x^k)}0, achieve
weak ergodicity if
o00
{(1 - c[Pk(Xk)]} o.
k= 0
vexit
Corollary 5: If c(Pk(xk)) < ao < 1 for all k = 0, 1,..., then weak rergodicity is achieved at a rate that
is at least geometric with parameter ao (i.e., there exists an Isuch that rl(T (zx)) < a7', j > 1).
The probability distribution on the states in stage j starting from each of the n initial states is
given by the (n x n) matrix Po(xo)}T(x). If ri(Tj(x)) -> 0, so that T[(x) has increasingly identical
rows as j - oo, then Po(xo) Tj(x) approaches a matrix of identical rows regardless of what the
Po(zo) matrix is. The probabilities on the states in stage j, and hence the expected rewards in stage
j, become independent of the first policy as j grows large.
Note that there are many other possible coefficients of ergodicity. Morton and Wecker define
an alternate which has ao = 1 for most non-homogeneous problems. Hence, it does not lead to
many of the results which follow.
2.2 Weakly Ergodic Markov Decision Processes
Theorem 4 and its corollary allow us to define a measure of the rate at which the Markov chain
formed by a particular strategy achieves weak ergodicity. To provide a single quantitative measure
of the minimum rate at which the Markov chain in a non-homogeneous Markov decision process
achieves weak ergodicity, define ao as follows:
ao = sup sup c(Pk(xk)).
zEX" k
8

where X' is the set of all potentially optimal z. In general, X' is a subset of X that can be
determined by additional problem structure. For example, if no such structure is available we take
X' = X. If ao < 1, then by Corollary 5 the Markov chain formed by any potentially optimal infinite
horizon strategy achieves weak ergodicity at a rate that is at least geometric with parameter ao. The
condition ao < 1 essentially requires that under any potentially optimal strategy each transition
matrix has a column with all entries greater than or equal to 1 - ao > 0. This can be interpreted
as requiring the existence of a uniformly accessible state in each stage such that the probability of
reaching that state is at least 1 - ao regardless of the state in the previous stage.
2.3 Assumptions
The following assumptions are made about the non-homogeneous Markov decision process:
(1) acao < 1
(2) At any decision point the choices available for each state are finite in number.
(3) R = supxk{maxi lr(4)|} < 0o
(4) The strategy space, X, is compact in the metric space generated by p.
Assumption (1) requires either discounting, weak ergodicity, or both. Assumption (2) limits
the problem to discrete, bounded decision variables. Assumption (3) requires an upper bound on
the maximum reward obtainable at each stage.
2.4 Average Optimality
The result of Theorem 2 that PFH optimality implies a-optimality if a < 1 is based on the
uniform convergence of Vo(z; N) on X. If a = 1, then Vo(z; N) is not necessarily convergent under
assumptions (1) through (4). However, we can show that
fk(zk, zk, X(k+l); N) = Vk(xk, X(k+l); N) - Vk(xk, (k+l); N)
is uniformly convergent on the set of feasible X(k+l) and use this result to show that PFH optimality implies average optimality when a = 1 under the assumptions in Section 2.3. The function
fk(Xk, k, Z(k+l); N) represents the difference in the value function for stages k through \N that results
from using policies xk versus Zk and then continuing with strategy X(k+l). Thus, fk is a relative
value function where the reference is determined by a fixed action, xk, rather than a fixed state as
is usually done (Morton and Wecker, Ross).
Theorem 6: For any feasible Xk and Nk there exists a finite function, fk(xk, k, X(k+l)), such that
fk(k Xk, X (k+1); N) = fk(N)' fk = fk(x k, k (k+1) )
uniformly on the set of feasible x(k+1) as N -f oo.
9

Proof: We will proceed by establishing the Cauchy criterion for uniform convergence. Assume
N > k. For any three stochastic matricies P1, P2 and T,
I(Pi - P2)TRk(zk)l < 2c(T)R e,
where e is a column vector of ones (see Hopp [1984]). From Lemma 3 it can be shown that
C(TN+1 ()) < N-k
Combining these we get
Ifk(N + 1) - fk(N)l < (2Ra(aao)N-k) e.
This result can be extended additively for k < M < N to
Ifk(M) - fk(N) < [2R(a)N ] e
(1 - caao)
By assumption, (caao) < 1 so that this bound converges to zero with rate independent of x, z,
and z(k+l). Since the set of strategies is compact, this is sufficient to show that there exists a fk
such that fk(N) -- fk uniformly over all X(k+ 1).
Intuitively, Theorem 6 implies that the effect on the reward in stage N from choosing xk versus
xk in stage k becomes negligible as N gets large. Given this condition we can prove the main result,
that PFH optimality implies average optimality.
Theorem 7: Under the assumptions (1) through (4), PFH optimality implies average optimality.
Proof: Let x* be a PFH optimal strategy and z be any feasible strategy in X. Then there exists
some infinite subsequence of the integers, {Nm}~m_1, such that x (Nm) -+ Xz. By construction,
Vo(x(Nm); Nm) - Vo(x; Nm) > 0 for all x X,
in particular, for z = (xo, z1,... Xz-1, zX^). By structure of the metric p there exists some m
such that x,(Nm) = z for all m > m and i = 1,2,... k - 1. Hence,
lim { Vo(z, z1)(Nm); Nm) - Vo(xo, Zl) (Nm); Nm)} > 0
for all feasible xo. By Theorem 6, the expression inside the brackets is uniformly convergent
and therefore,
lim Vo(x; Nm) - Vox(o, xl); Nm)}
exists and is non-negative. By the principle of optimality we also know that
lim n{l(xl); Nm) - Vl(xl,X2);Nm)} > 0.
10

Using this and the fact that
Vo(x; Nm) = Ro(xo) +aPo(zo) Vl(z(l); Nm)
inductively, we have that
lim { Vo(z; Nm) - Vo(zo, i,... x-1,ik); Nm)} > 0.
m —* oo v
for any finite k and z E X. We can break this down to
lim {Vo(*; k-)- Vo(z; k- 1) + ak(To(') - To(z)) * Vk(z(); Nm)}
= Vo(*; k- 1)- Vo(x;k- 1) + lim {a(To(Z*) - To(Z)) Vk(Zk);Nm)} > 0.
Using the same uniform bound as in the proof of Theorem 6 this becomes
2o
o(z; k- 1)- Vo(z;k- 1) +[( - e > 0.
1 - aao)
Reparameterizing N = k- 1, dividing by N, and taking the lim inf gives
lim inf ((*; N) - V (x; N)!iminf'_ ).(.) > 0.
N-.oo N
By definition, z* is average optimal.l
Theorems 2 and 7 establish the reasonableness of the PFH optimality criterion by verifying that
PFH optimality implies ca-optimality when a < 1 and average optimality when a = 1. The manner
in which PFH optimality is defined makes the following forecast horizon results flow naturally.
Note that both discounting and weak ergodicity act geometrically to diminish the influence of
the first policy on future rewards. For this reason, a and ao play analogous roles in the development
of conditions for the existence of forecast horizons in the following section.
3. Conditions for the Existence of Forecast Horizons
This section develops results for non-homogeneous Markov decision processes parallel to those
of Bean and Smith for deterministic problems. By using weak ergodicity, we obtain forecast horizon
results for undiscounted, as well as discounted, problems. By use of forecast horizon results we can
find PFH-optimal strategies when they are unique. This is an advantage relative to other criteria
such as 1-optimality and average overtaking optimality.
Theorem 8: If the PFH optimal strategy is unique then x(N) -- x in the p metric as N --- o.
Proof: Assume otherwise. Then there is an infinite subsequence of {zi(N))}=L that is bounded
away from at in the p metric. Since X is compact this, in turn, has a convergent subsequence.
By definition, its limit is also PFH optimal, a contradiction to the assumption of uniqueness..
11

Note that convergence of zx'(N) to x* implies that the optimal finite horizon strategy will agree
with the optimal infinite strategy over an increasingly long initial sequence of policies. This fact
follows from the way the p metric is defined and leads to the following forecast horizon results.
Theorem 9: If all PFH optimal strategies have the same first L policies then a forecast horizon
exists leading to the optimal first L policies in the infinite horizon problem.
Proof: Similar to that of Theorem 8.
Corollary 10: If the PFH optimal strategy is unique, or all PFH optimal strategies have the same
first policy, then a forecast horizon exists.
Theorems 8 and 9 and Corollary 10 show that in addition to assumptions (1) through (4), some
form of uniqueness of the infinite horizon optimal policies is needed to yield sufficient conditions
for the existence of a forecast horizon. Multiple optimal policies may cause cycling between the
multiple optima as N increases. From the decision-maker's perspective it would be impossible
to tell whether the cycling was due to multiple optima or to an insufficiently long time horizon.
Hence, as in other optimization problems, degeneracy presents a serious theoretical problem in
non-homogeneous Markov decision processes.
It is a simple matter to extend the results of Bean and Smith to show that if the set of potentially
optimal strategies is countable then the PFH optimal strategy is unique for all a outside of a set
of Lebesgue measure zero. In this case a forecast horizon exists.
The existence of forecast horizons allows computation of PFH-optimal strategies in a forward
recursive manner. The first such technique is presented in Hopp [1985].
4. Extensions and Conclusions
We have developed sufficient conditions for the existence of forecast horizons in non-homogeneous
Markov decision processes. These results form the stochastic counterparts to the forecast horizon
results of Bean and Smith for deterministic problems and also represent the extension of Shapiro's
work to the non-homogeneous and undiscounted problems. The difference between this and previous work is the role weak ergodicity plays in the existence of forecast horizons in non-homogeneous
Markov decision processes. We have shown that weak ergodicity acts analogously to discounting
in the existence conditions. Indeed, the discount factor interacts multiplicatively with an appropriately chosen numerical measure of the rate at which the Markov chain achieves weak ergodicity.
This leads to the existence of forecast horizons in undiscounted as well as discounted problems.
Weak ergodicity yields insight into the underlying causes of forecast horizons that is not readily
seen by investigating deterministic problems. We have shown that it is not discounting, per se, that
generates forecast horizons, but rather, diminishing influence. If the influence of the first policy on
12

the reward in stage N decreases geometrically with N, then under a set of regularity conditions
given in section 3, a forecast horizon will exist. Discounting produces this effect, as does weak
ergodicity. Diminishing influence due to discounting occurs because the present value of the reward
in stage N goes to zero geometrically with N, regardless of the choice of the first policy. Weak
ergodicity causes diminishing influence because the probability distribution on the states in stage
N becomes independent of the first policy as N grows large. Under the proper conditions, it does
so geometrically.
The forecast horizon results of this paper are for a discrete time non-homogeneous Markov decision process with discrete decision variables. It is possible to relax the assumptions of discreteness
of both time and the decision variables, although the results that are obtained are not as strong as
those given above. A more detailed treatment is given in Hopp [1984]. The primary difference in
the continuous time case is that the forecast horizons will be stated in terms of numbers of stages
rather than numbers of periods. Since these stages have random lengths, the lengths of the forecast
horizons are themselves random variables. In the continuous decision variable case forecast horizons leading to the optimal first policy no longer exits. However, horizons that yield first policies
resulting in an error less that e, for arbitrary E, do exist for this case.
13

APPENDIX
This example presents a case where average optimal and forecast horizon optimal strategies
exist but are not the same.
Suppose there are four states per stage, {1,2,3,4}. In all states in stage 0 there are four
actions, {A, B, C, D}. In all subsequent stages action A is feasible in state 1, action B is feasible in
state 2, action C is feasible in state 3, and action D is feasible in state 4. In addition, action B is
feasible in state 1 of stage k for k = 2n - 1, n = 1, 2, 3,... No other actions are feasible. Rewards
and transitions are as follows.
r(x) = 0, i= 1,2,3,4, x=A,B,C,D
r (A) = rk(B) = 1, k= 1,2,3,...
r2(B) = 2, k= 1,2,3,...
r(C)= 1, k= 1,2,3,...
r(D) = 5/4, k =1,2,3,...
p l(A) pi2(B) = po(C) = p4(D) = 1, i= 1,2,3,4
n1p(A)= p33(C) = p44(D) = 1, k= 1,2,3,...
I (B) = P23 (B) = 1, k = 2n - 1 n = 1, 2, 3,...
p22 (B) = 1, k 2n- 1, n = 1, 2, 3,...
All other probabilities are zero. The problem is undiscounted so a = 1.
Since transitions are deterministic, an a-optimal strategy in the problem from stage 0 through
state N can be expressed as a simple sequence of N + 1 actions. Since the four states in stage 0 are
identical, Vo(x(N); N) is the same for i = 1,2,3,4. a-optimal strategies and average reward for
i = 1, 2, 3, 4 are easily computed to be:
N x{(N) Vo(z'x(N);N)/N
1 BB 2
2 ABB 3/2
3 ABBB 5/3
4 ABBBC 6/4
5 AAABBB 7/5
6 A A A BBBB 9/6
14

10 AAABBBBBCCC 14/10
11 AAAAAAABBBBB 15/11
Continuing in this manner verifies that the strategy z = AAAA... is forecast horizon optimal.
However, it is also easily shown that
liminf{ Vo(; N) - Vo(x; N)}/N = (1/4)e
N —oo
where z = DDDD... and e is the 4-vector of ones. Hence, z is preferred to i under the average
optimality criterion. A unique forecast horizon optimal strategy exists but is not average optimal.
15

REFERENCES
Bean, J. and R. Smith [1984], "Conditions for the Existence of Planning Horizons," Mathematics
of Operations Research, Vol. 9, pp. 391-401.
Bhaskaran, S. and S. Sethi [1985], "Conditions for the Existence of Decision Horizons for Discounted
Problems in a Stochastic Environment: A Note," Operations Research Letters, Vol. 4, pp. 6165.
Blackwell, D. [1962], "Discrete Dynamic Programming," Annals of Mathematical Statistics, Vol.
33, pp. 719-726.
Denardo, E. [1967], "Contraction Mappings in the Theory Underlying Dynamic Programming,"
SIAM Review, Vol. 9, pp. 165-177.
Denardo, E. and B. Miller [1968], "An Optimality Condition for Discrete Dynamic Programming
with no Discounting," The Annals of Mathematical Statistics, Vol. 39, pp. 1220-1227.
Denardo, E. and U. Rothblum [1979], "Overtaking Optimality for Markov Decision Chains," Mathematics of Operations Research, Vol. 4, pp. 144-152.
Derman, C. [1966], "Denumerable State Markovian Decision Processes-Average Cost Criterion,"
Annals of Mathematical Statistics, Vol. 37, pp. 1545-1554.
Freidenfelds, J. [1981], Capacity Expansion: Simple Models and Applications, North-Holland.
Hinderer, K. [1970], Foundations of Non-Stationary Dynamic Programming with Discrete Time
Parameter, Lecture Notes in Operations Research and Mathematical Systems, Springer-Verlag.
Hopp, W. [1984], "Non-Homogeneous Markov Decision Processes with Applications to R & D Planning," Unpublished Ph.D. dissertation, Department of Industrial and Operations Engineering,
The University of Michigan, Ann Arbor, Michigan 48109.
Hopp, W. [1985], "Identifying Forecast Horizons in Non-Homogeneous Markov Decision Processes,"
Technical Report 85-5, Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, Illinois, 60201.
Howard, R. [1960], Dynamic Programming and Markov Processes, Wiley.
Lohmann, J. [1984], "A Stochastic Replacement Economy Decision Model," Technical Report 8411, Department of Industrial and Operations Engineering, The University of Michigan, Ann
Arbor, Michigan 48109. To appear in lIE Transactions.
Luss, H. [1982] "Operations Research and Capacity Expansion Problems: A Survey," Operations
Research, Vol. 30, pp. 907-947.
16

Manne, A. [1960], "Linear Programming and Sequential Decisions," Management Science, Vol. 6,
pp. 259-267.
Miller, B. and A. Veinott [1969], "Discrete Dynamic Programming with a Small Interest Rate,"
The Annals of Mathematical Statistics, Vol. 40, pp. 366-370.
Morton, T. [1978], "The Non-Stationary Infinite Horizon Inventory Problem," Management Science, Vol. 24, pp. 1474-1482.
Morton, T. [1979], "Infinite Horizon Dynamic Programming-A Planning Horizon Formulation,"
Operations Research, Vol. 27, pp. 730-742.
Morton, T. and W. Wecker [1977], "Discounting, Ergodicity and Convergence for Markov Decision
Processes," Management Science, Vol. 23, pp. 890-900.
Nelson, R. and S. Winter [1982], An Evolutionary Theory of Economic Change, Belknap Press.
Ross, S. [1968], "Non-Discounted Denumerable Markovian Decision Models," Annals of Mathematical Statistics, Vol. 39, pp. 412-423.
Ross, S. [1970], Applied Probability Models with Optimization Applications, San Francisco: HoldenDay.
Schal, M. [1975], "Conditions for Optimality in Dynamic Programming and for the Limit of nStage Optimal Policies to be Optimal," z. Wahrscheinlichkeitstheorie verw. Gebiete, Vol. 32,
pp. 179-196.
Seneta, E. [1981], Non-negative Matricies and Markov Chains, New York: Springer-Verlag.
Shapiro, J. [1968], "Turnpike Planning Horizons for a Markovian Decision Model," Management
Science, Vol. 14, pp. 292-300.
Sobel, M. [1971], "Production Smoothing with Stochastic Demand II: Infinite Horizon Case," Management Science, Vol. 17, pp. 724-735.
Veinott, A. [19661, "On Finding Optimal Policies in Discrete Dynamic Programming with no Discounting," Annals of Mathematical Statistics, Vol. 37, pp. 1284-1294.
17