4967-5-T Technical Report No. 153 OPTIMAL MARKOVIAN QUEUEING SYSTEMS by K. Chuang Approved by: / 4-cf B. F. Barto for COOLEY ELECTRONICS LABORATORY Department of Electrical Engineering The University of Michigan Ann Arbor, Michigan Air Force Systems Command Rome Air Development Center Contract No. AF 30(602)-2661 Griffiss Air Force Base, New York June 1963 THE UN!V ER I Y F MICHRGAIV ENGIN!EERING I BRARY

I^C J 38, 4S jr v ). A,. -. A -. l3, I

TABLE OF CONTENTS Page ACKNOWLE DGE ME NT 1. INTRODUCTION 1 2. MARKOVIAN MODEL FOR QUEUEING ANALYSIS 2 3. MARKOVIAN DECISION PROCESSES WITH CONTINUOUS-TIME PARAMETERS 4 4. MARKOVIAN DECISION PROCESSES WITH DISCRETE-TIME PARAMETERS 6 5. RETURN FUNCTIONS 6 6. DYNAMIC PROGRAMMING AND RECURRENCE RELATION 8 7o ASYMPTOTIC SOLUTION OF RECURRENCE RELATIONS 9 8. A PROPERTY OF SIMPLE QUEUEING MATRIX 15 9. THE COMPUTATIONAL METHOD OF FINDING OPTIMUM CONTROLS BY POLICY ITERATION 16 9. 1 Asymptotic Solution of a Markovian Decision Process as n-oo 17 9. 2 Howard Iteration Cycle 17 10. CONCLUSIONS 19 REFERENCES 20 iii

ACKNOWLEDGE MENT The author wishes to acknowledge the valuable comments on this paper made by Mro V. L. Wallace of the Cooley Electronics Laboratory. iv

1. INTRODUCTION Due to the tremendous cost involved in constructing and operating a multiplecomputer system, it is obvious that preliminary analysis of the performance of a proposed system should be made before the actual system is built. Furthermore, even with an existing system it is advantageous to explore analytically the capability of the system without interrupting the daily operation of the system. Because of the varying natures of jobs, their equipment requirements, service durations, and times of arrival can only be described in terms of statistics. At present, as far as the state-of-the-art allows, it seems the two most effective means of analyzing a large-scale multiple-computer system are: (1) To examine the system analytically by queueing theory and (2) To simulate the system on a computer and run the simulated system sufficiently long that reliable statistics on system performance can be obtained. The principal merit of the simulation method is that it can be applied to any system regardless of its complexity, while it is generally difficult to model a complex system by the queueing theory method or to obtain its analytical solution. However, the simulation method is not without drawbacks. The worst defect of this method is that in order to obtain reliable and meaningful statistical results, a tremendously large number of runs should be performed. This imposes a serious financial limitation on its use and requires the availability of the large amount of time spent on the computer runs. On the other hand, the queueing theory approach is analytical in nature and consequently the reliability of its results (which are usually easy to interpret) depends only on the faithfulness with which the theoretical model describes the actual system. According to our experience in both simulation and queueing analysis in previous work, it is highly desirable to study multiple-computer systems by queueing methods because of the high reliability of their statistical conclusions.

The purpose of this report is to describe the extension of the analytical aspect of queueing theory to its synthesis aspect. That is, one not only analyzes the performance of a given multiple-computer system but also devises a scheme or a strategy to optimize the given system against a set of preassigned return or cost functions. This approach contrasts with the ordinary queueing method, and may be considered as an active queueing method in the sense that it changes the structure of the computer system to meet the current demand (state) of the system. 2. MARKOVIAN MODEL FOR QUEUEING ANALYSIS In order to facilitate an understanding of the meaning of the transition from an analysis model to a synthesis model a brief review of a commonly used analysis model —a Markovian model —is given in this section. The simplest Markovian model used in queueing analysis is one which possesses stationary transition probabilities. The behavior of this system is characterized by a single variable which is called the state-of-the-system. In ordinary queueing theory this state variable or state usually indicates the number of jobs waiting (both in the waiting line and in the computer proper). Mathematically, the system can be described by the following state equation q(t+At) = q(t) + a(At) - s(At) where q(t) = number of jobs in waiting at time t = state variable a(At) = number of jobs arriving in time interval At s(at) = number of jobs completing service and leaving the system in time interval at. In the above equation we use the fact that the system is a stationary system so that the number of jobs arrived in time interval at and the number of jobs completed in time interval At are independent of present time t. Otherwise, a and s would be functions of both t and at, i. e., a(t, at) and s(t, At). 2

Suppose that a(At) and s(At) are statistically independent and suppose furthermore that in time interval At, the probability that one job arrives is (at)X and that two or more jobs arrive is O(At), where the symbol O(At) denotes the quantity that becomes negligible compared with At as t- 0. Similarly, assume that the probability that one job has completed the service and left in time interval At is (At)u and that the probability that two or more jobs have completed the service and left is O(At). Then under these assumptions it can be shown that as At - 0, the probability of finding the system in any one of the states at time t satisfies the following Chapman-Kolmogorov equations: (A) If infinite states are allowed dP dt - Po + P1 dP dt = p - (I +X)P + P2 dt 1 2 dP n dt XPn- (/ +X)Pn + zPn+l dt- = n-i n+l where P (t) = the probability that the state of the system is n at time t. n (B) If only a finite number N+1 of states are allowed dP dt - Po + P1 dPk dt k-l (- + )Pk + Pk+ dPN dt = XPN-1 i PN 3

A multiple-computer system which allows simultaneous operations may be considered as a multiplicity of service channels in parallel, and the total service rate changes as a function of state. For this system the above Chapman-Kolmogorov equations continue to hold, but with different values replacing some of the i-coefficients. For systems which cannot be described by a Markovian queueing model for every instant of time, a consideration of the state at a sequence of special time points will often yield the system as a Markovian model. This technique is known as the imbedded Markovian chain method. In view of the above brief description of Markovian queueing models, it is clear that Markovian stochastic processes are powerful tools for queueing analysis. However, it is worthwhile to note that in queueing analysis we assume that the queueing systems studied are fixed in structure; that is, we analyze the system without attempting to change the structure of the system. 3. MARKOVIAN DECISION PROCESSES WITH CONTINUOUS-TIME PARAMETERS If a stochastic system such as a multiple-computer system can be changed by the command of a central executive system, then obviously the following two approaches for studying this type of system are possible: (1) Assume all possible environments for the system. Then use queueing methods to study the behavior of the system against a set of return functions under various possible sequences of commands. (2) Assume all possible environments for the system, but use a dynamic programming method to find the sequence of commands which will optimize the system against a set of given return or cost functions. The first approach is essentially analytical in nature and is time-consuming, because we are forced to analyze all possible sequences of commands, some of which may 4

not be useful for actual operation. The second approach is most efficient because it allows us to select directly only the commands that are optimal. Loosely speaking, the difference between these two types of approaches resembles the difference between analysis and synthesis techniques in circuit theory. When the system under study is a Markovian system, the second approach (dynamic programming) is called a Markovian decision process. This process was first studied by Bellman and developed into a useful tool by Howard (Refs. 1-3). In order to make this report self-sufficient, a brief discussion on Markovian decision processes will be given in the next paragraphs. A Markovian decision process is a generalization of an ordinary Markovian process such that the transitional probabilities of the process depend on a command signal u(t) which is a member of the set U of allowable commands. Mathematically, this process can be represented in the following manner. Let us consider a stochastic service system which can be at any one of the states 0,1,2,...,N. Let Pk(t) be the probability that the system is in kth state at time t and let (p..ij)t be the transitional probability of the system from ith state to jth state in the time interval At. Since every Pi. is assumed to be a function of the command signal u(t), it is clear that pij = pij(u). Furthermore, since it is assumed that the system is a Markovian system, the probabilities of the system satisfy the following Chapman-Kolmogorov equations as At - 0 dP N dt 0- Pio(u) P [1-pOO(u)] P ue U dPN N- 1 dt - PiN(u) i [1- PNN()] PN i=0 The process of selecting a control process u(t)eU such that it maximizes or minimizes a return function is called a Markovian decision process. 5

4. MARKOVIAN DECISION PROCESSES WITH DISCRETE-TIME PARAMETERS For computational ease, we often treat Markovian decision processes as discrete-time processes; that is, the time t in these processes is a member of a countable set {nAtln = 0, 1,2,...} and consequently, a command signal can be applied only at each time instant nAt. Furthermore, a discrete time process may naturally arise in modeling a system in which observations and decisions are made periodically. The associated Chapman-Kolmogorov equations reduce to the following set of differential equations: N Po[(n+l)At] = Pio(u) Pi(nAt)At i=O u(nAt)e U N PN[(n+l)At] = PiN(u) Pi(nAt)At i=0 The process of selecting a control process u(t) such that it maximizes or minimizes a set of return functions reduces to a sequence of controls {u(nAt)} where n = 0, 1, 2,.... In subsequent text the At will be assumed equal to unity, without loss in generality. 5. RETURN FUNCTIONS In the preceding two sections, return functions were mentioned in connection with the study of both discrete- and continuous-time parameter Markovian processes. As in all engineering system studies, a return function indicating "the best performance" inevitably involves a certain amount of subjective judgement. Moreover, in choosing between two different candidates for a return function, we are sometimes forced to accept the less meaningful one simply because it can be solved mathematically and the other can not. Here we shall choose our return as a value [depending on u(t) or u(nat)] associated with each transition of states. For example, whenever a transition is made in time interval At at time instant nAt with control u(nAt), a value rij [u(nAt)] will be assigned to 6

this transition if state i is the initial state and j is the final state. The physical significance of these return functions for discrete-time cases can be explained as follows. With a discrete-time system the occurrence of a transition within a time interval at usually serves as a rough indication of the service efficiency or inefficiency of the multiple-computer system. For example, from a user's point of view, the queue in the system would ideally be zero at all times; i. e., the computer system would always be available. This means that a large positive value (return) should be assigned to a state transition from greater to lesser queue length, and a large negative value (return) should be assigned to a transition from lesser to greater queue length. From the point of view of the management of the multiple-computer system, however, there should always be someone in the queue to guarantee that at least part of the system is never idle; yet to maintain good will or to keep from losing customers, the queue must not be allowed to grow too long. This means that a different value (return) should be assigned to each transition of states. Consequently, in making the proper choice and assignment of values (returns) to various state transitions, a compromise must usually be made between the interest of the customers and that of the management. In general, the return functions associated with transition of a finite state system can be expressed as a (N+l) by (N+1) square matrix function of control u(nAt). r00(u) r01(u)... rON(U) r10(u) r11(u)... rlN(u) [R(u)] = rNO(U) rNl(U)... rNN(U) For continuous-time cases, rij(u) can be interpreted as a fixed rate of return during the time of a transition from state i to state j with control u(t). For a finite system, return functions again can be expressed in a (N+1) by (N+1) square matrix function of control. 7

6. DYNAMIC PROGRAMMING AND RECURRENCE RELATION We are interested here only in a discrete-time parameter system. This system reduces to the continuous parameter system if At - 0, but otherwise is applicable to systems which are Markovian and in which only periodic decisions can be made. The basic tool for establishing a recurrence relation is the principle of optimality which states that an optimal sequence of controls has the property that given any initial state and initial control, the remaining controls must constitute an optimal sequence of controls with regard to the state resulting from the first control. Using the principle cited above, we can derive the recurrence relation as follows: Suppose that the general transition probability matrix [P(u)] is known and that the return matrix [R(u)] associated with the transition of states is specified. Let the system be in state i; and fn(i) be the expected optimum total return in the next n transitions. Then, by the principle of optimality, the following recurrence relations are obtained: Max N f (i) = U j L Pij(u) [rij(u) + fn-1(j)] fo(i) = Ci i = 0,1,2,...,N and n =1,2,... where C. are specified boundary conditions, usually set equal to zero. If we define N Z Pi(u) rij(u) = bi(u) j=0 then these recurrence relations can be written as fn(i) bi() + Pij(u) fn-l()} fo(i) = Ci i =0,1,2,..., N and n =1,2,... The problem now is to choose a control u at the beginning of each transition such that the sequence of controls over the total time interval n(at) maximizes the total 8

expected return of the system. If the system is allowed to operate indefinitely, then n(At) -co and the problem is to determine which sequence of controls maximizes the total average return over the infinite time interval. The case of optimization over an infinite time interval is closer to reality than is the case over a finite time interval, especially in computer systems, because the operating life of the system is much longer than the interval between each state transition. Therefore, the long-term return is the more important and corresponds mathematically to the asymptotic solution of the recurrence relations given above. 7. ASYMPTOTIC SOLUTION OF RECURRENCE RELATIONS The following theorem on asymptotic behavior of the sequence {fn(i)} implied by Bellman in his discussion on Markovian processes has been proven in detail. Both the theorem and the proof are essentially the same as those of Bellman except for the important condition (c), where Bellman requires the much more restrictive condition [P(u)] > 0. Theorem Let us assume the following: (a) bi(u) and pij(u) are of finite dimension and are functions of vectors u, whose components assume only a finite set of values. (b) bi(u) > 0 for all i and all u, and bi(u) > 0 for some i and all u. (c) [P(u)] > 0, where k is some finite number, for all u. (Every element of [P(u)]k is greater than zero.) N (d) [P(u)] is a stochastic matrix; i. e., pij(u) = 1, j=l i =0, 1, 2,..., N and pij(u)> 0. 9

Under these conditions, we have the asymptotic result fn(i) nr n-oo, i = 0, 1,2,...,N (1) where the scalar quantity r is obtained as follows: r = -Max lim b(u) + [P(u)] b(u) +... + [P(u)] n- b(u) r- uU n-lo n - u U n-o-oo n (2) Here 1 1 1 = 1 b(u) = bo(u) bl(u) bN(u) [P(U)J = [Pi.(u)J Proof (A) On the Markov Matrix From the stochastic matrix theorem it can be shown that a dominant characteristic vector 1 exists. Furthermore, by iteration, for any vector y which is nonnegative and nontrivial, we have, for large n, [P(u)]n y rl where r is a scalar quantity depending on y. It can be shown that lim( (u)] - mrl) = x lm ~, 00 [P(u)] y- mr 1 X exists. Now let us consider the following: y + [ { y - m - y+ [P(u)] [Zp(u] y - mryl = Z [Pd)1 k y - mr [P(u)] 1 + y k=O k=0 10

Since 1 is the characteristic value of [P(u)] and 1 is its associated characteristic vector, then it follows that [P] 1 = 1. This results in the relation rm k m+1 y + [P(u L [pU)] y - mr [P(u)]k y- (m+l)rl + rl k =0 k=O Now letting m+l -oo, we obtain y + [P(u)] x = x+ rl If we let y = b(u) we obtain the following equation: b(u) + [P(u)] x = x + rl which is a result useful in the remaining proof of the theorem. (B) Equivalence of Equations We have the following equations: b(u) + [P(u)] x = x + rl where u is the control which maximizes the limit in Eq. 2 and uax U (u) + [P(u)] x = x + rl In order to prove the equivalence of these two equations, we examine the consequence of their nonequivalence. In such a case uaxU (u) + [P(u)] x} x + rl. (3) If we let u = u be a value of u which maximizes b(u) + [P(u)] x, then Max (u) + [P()] x = b(u0) + [P(u0)]x (4) From Eqs. 1 and 2 b(u~) + [P(u~)] x _ x + r (5) Without loss of generality we can say that, for i = 0, 1,2,...,N, the strict inequality occurs 11

in the first component of the vector inequality in Eq. 5: N bo(u) + Z Pj(u )x > x + r (6) 0 j=0 jj= N bi(u + z pi(u )x > x + r (7) j=0 Equation 6 can be changed to > by adding a positive constant a to the right-hand side: N b(u~) + Z Poj(u0~)j x. x + (1+a)r j=0 j N x b (u )+ Z Po(u0)x - (1 + a)r (8) o 0 jOj=0 Now iterate the inequalities of Eqs. 7 and 8. For Eq. 7, the following inequality results: N[ N 1 bi(u) + ij(u~) (u)+, P( xk r j=1 J k=0 j + pio(u) bu )(u + k Pok( ) ( +a) r r + x. (9) 1 For Eq. 8, the following inequality results: ((uu) + b PN oj1(Uo o r b(u~) + p (u~) b () + Pj k(u~) xk r 0 j= L k=0 + Poo(U) [bo(U) + Pok(U) k - (1 + a) r k=O - r+x + a r r+x (10) O O~~~~~~~~~0 Combining Eqs. 9 and 10 into a vector equation, we obtain the following vector inequality: r 1 + x _ b(u~) + [P(u~)] b(u~) + [P(uo)]2 x - r 1 - a r [P(u )] or 12

x = b(u~) + [P(u~)] b(u~) + [P(u0)]2 x - 2r 1 - a r [P(u0~) Reiterating (N=2), we obtain: x - b(u0) + [P(u~)] b(u~) + [P(uo)]2 b(u~) + [P(u~)]3 b(u~) /1/ w (11) + [P(u x-2r [p(uO)]2 1 ar [P(u~)] + [P(u~0)4 x -2r [P(u~)]2 1 - ar [P(u~)] 3 0 -2r 1 - ar[P(u~)] 1 o (1 b(u) + [P(u )] b(u~) + [P(u)]4 x -4rl -ar P(u ) + [P(u (12) Combining Eqs. 11 and 12 and iterating (N=3), we obtain 1 5 x b(u~) + [P(u )] b(u~) + [P(u)]6 x- 6rl - ar {[P(u~)]5 + [( P(u)] k=1 In general, for the Nth iteration we have the following inequality: 2N-1 2N-1 2i-1 0 x < b(u~) +, [P(u~)] b(u~) + [P(u~)] 2N x - 2Nr 1 - ar L [P(u~)] Since=1 - - - i=l 0Since then for 2 Since [P(u~) k> 0, then for 2i - 1 > k, all 13

1 1 1... 1 o~i-1 1 1 1... 1 [P(uo)] >. 1 1 1... 1 where e > 0 is the greatest lower bound of all P2i for 2i - 1 k. Therefore 2N-1 2N x - b(u~) + L [P(u~)] b(u~) + [P(u )] x- 2Nrl - are(2N - k/mk) k=1 where mk is the smallest integer ' (k+1)/2 The maximal property of r = r(u) asserts that as N increases, the vector b() + [P(u~)] b(u) +.. + [p(u0)]2N-1 b(u~) - 2Nr (1 - a)l1 + are mk becomes an arbitrarily large negative vector. This contradicts the inequality of Eq. 13. Hence part (B) of the theorem is proved. (C) fn(i) nr, for large n Assume thatnr + xi- k _ f (i) nr + x. +k for n =0,1,..,. Then from the principle of optimality f (i) Max {bi(u) + j () [r +ik] ~1 uE U i + j =0+ 1 I _ Qr + k + i a b(u) + o pi(u)x] uEU lr + u0 ij i < r +k+ r + x. = (f+1) r + x. +k QED Therefore f (i)~ nr. Remarks For many finite-state queueing matrices, [P(u)], the condition that [P(u)]k > 0, where k is a finite constant, holds. Therefore, it follows from the above asymptotic theorem that in such a case a steady-state strategy exists. 14

8. A PROPERTY OF SIMPLE QUEUEING MATRIX In this section we want to show that a simple queueing matrix has the property that [P(u)]k >. It can be shown for the simple queueing system with a negative exponential service time distribution and Poisson arrivals that if At is small, the transition matrix [P(u)] for the simple queue system is merely nonnegative and only its main diagonal elements and the two adjacent elements are nonzero. For this type of matrix it can be shown by the following theorem that [P(u)]k > 0 for k > N where N+1 is the number of states. Theorem Any finite state simple queueing matrix [P(u)] has the following property [P(u)]k > 0 for k > N where N+1 is the order of the matrix [P(u)]. Proof For a simple queueing matrix pi > 0 if j = i - 1,i, i+ 1 and 0 i, j < N = 0 otherwise. (For simplicity, pij(u) is written as pi. ) Let us denote the elements of [P(u)]m by pij); then, for m = 2, (2) N ij O Pik Pkj = Pi(i-1) P(i-1)j + Pii Pij + Pi(i+l)P(i+l)j. Then, from the assumed property of the queueing matrix it follows that (2) Pi( > 0 for j =i-2, i- i + 1, i + 2 and 0 <i, j N = 0 otherwise. 15

For m = 3, N (3) ik (2) (2) (2) (2) Pij k= Pkj = Pi(i-1) P(i-)j + Pii Pij + Pi(i+l) P(i+l)j (2) From the properties of pi and p.(2) it can be readily shown that ]ij 1 (3) p.3) > 0 for j = i - 3, i - 2, i - 1,i, i+ 1, i + 2, i + 3 and 0 < i, j < N ij = 0 otherwise. In order to use mathematical induction, we assume that Pi (m) > 0 for j = i - m, i - (m-1),..., i,..., i + (m-1), i + m, and that pj(m) = 0 otherwise. The element pj(m ) of the (m+l)th iteration of matrix [P(q)] has the following form: (m+l) (m) (m) (m) (m) ij = J PikPkj Pi(i 1)P( + Pii ij + Pi(i+l)P(i+l)j k=0 On the basis of the iand the above iteration formula, it is evident that (m+) > 0 for j = i - (m+1), i - m,...,i,...,i + m, i + m + 1 and 0 < i, j < N 0 otherwise. Thus the number of nonzero elements in a row of the matrix [P(u)] m is increasing with m and every iteration of the matrix [P(u)] increases the number of nonzero elements by two,, until one end of the row is filled and then by one until the entire row is filled. The extreme left nonzero element in the ith row of the [P(u)]m matrix is P(m) and the extreme right nonzero m (in) elements in the same row of the [P(u)] m matrix is Pi(im) For an (N+1)-order queueing matrix the furthest right element of [P(u)] m is p(m) If this element is nonzero then [P(u)]m > 0; if p(m) > 0, then m = N. Therefore, PO, m [P(u)]K > 0 for k > N. QED 9. THE COMPUTATIONAL METHOD OF FINDING OPTIMUM CONTROLS BY POLICY ITERATION A method of finding the asymptotic returns and optimum controls (developed by 16

R. Howard) is summarized in the following paragraphs. 9. 1 Asymptotic Solution of a Markovian Decision Process as n-co It can be shown that for a given policy (not necessarily optimal) the expected return, fn(i), is given by the following recurrence equation: N (i) =bi + J Pij fn-(j) i =0,1,2,...,N j=0 n = 1,2,3,... Earlier we proved by the asymptotic theorem that if [P(u)] > 0, then fn(i) ~ nr. Making use of this result, we can write f (i) into the following equality form: f (i) = nr + v i = 0,1,2,..,N as n-co where r and v. are some constants. Substitution of this asymptotic form back into the above recurrence equations yields the following equation: N nr + vi = bi + 3 [(n-l)r + vj]pj j=1 N Since L Pi = 1 j=0 N r + v = b + p v. i =01,2,...,N 1 i j. j where v0, vl, v2,..., vN and r are unknowns. Now we are facing an (N+2) unknown and (N+1) equation problem. However, if we let vN = 0, we can solve for v0, v1, v2,..., vN 1 and r. Indeed, if a constant a is added to vi, i. e., N N r + (vi + a) = b + pij(vj + a) = bi+ pj v + a j=0 j=0 the original recurrence equation is obtained. This shows that the absolute value of vi does not enter the calculation; only the relative value can be obtained. 9. 2 Howard Iteration Cycle The computational techniques developed by R. Howard are shown in Fig. 1. If there is no a priori reason for selecting a particular initial policy, it is convenient to start the policy improvement routine with all v. = 0. Then the starting de1 17

cision will be the decision which maximizes bi, and the routine will e stopped when the policies on two successive iterations are identical. Value-Determination Operation Use pij and bj for a given policy to solve N r + vi =bpi+ Pv i = 1, 2,.... N j=O VN= 0 Policy Improvement Routine For each state i find the alternative u' (control) that maximizes N b. + p.. v 1 jO ij j=O using the v. of the previous policy. Then u' becomes new decision in ith state. becomes new decision in ith state. Fig. 1 Howard Iteration Cycle 18

10. CONCLUSIONS It has been shown that under the assumption that the transition probabilities of a queueing system are controllable, a steady-state optimal control exists which is a function of current-state only. This conclusion is important for operating a multiple-computer system because the dependence of the optimal control on current-state alone allows the management of the system to construct an optimal control table a priori. Optimum operation of the system can be obtained by automatically observing the current-state of the system and by changing the organization of the multiple-computer system according to the optimum control table. 19

UNIVERSITY OF MICHIGAN 1 I Ill l lt1 1 ll III II l llll ii IIll II IIIIi 3 9015 02829 5213 REFERENCES 1. R. A. Howard, Dynamic Programming and Markov Processes, Technology Press and John Wiley, New York, New York, 1960. 2. R. Bellman, Dynamic Programming, Princeton University Press, Princeton, New Jersey, 1960. 3. R. Bellman, "A Markovian Decision Process, " J. Math and Mech., Vol. 6, 1957, pp. 679-684. 20