THE UNIVERSITY OF M I C H I G A N COLLEGE OF LITERATURE, SCIENCE, AND THE ARTS Department of Communication Sciences Technical Report A CLASS OF SEQUENTIAL SAMPLING PROBLEMS ARISING IN CERTAIN LEARNING SITUATIONS Edwin Bainbridgel rip 2 ORA PROJECTS 08226, 03105 and 01252 supported by: U.S. ARMY RESEARCH OFFICE (DURHAM) CONTRACT NO. DA-31-124-ARO-D-483 DURHAM, NORTH CAROLINA DEPARTMENT OF THE NAVY OFFICE OF NAVAL RESEARCH CONTRACT NO. Nonr-1224(21) WASHINGTON D.C. DEPARTMENT OF HEALTH, EDUCATION, AND WELFARE PUBLIC HEALTH SERVICE NATIONAL INSTITUTES OF HEALTH GRANT NO. GM-12236-04 BETHESDA, MARYLAND administered through: OFFICE OF RESEARCH ADMINISTRATION ANN ARBOR December 1967 Distribution of this document is unlimited.

RESEARCH PROGRESS REPORT Title: "A Class of Sequential Sampling Problems Arising in Certain Learning Situations," E. S. Bainbridge, University of Michigan Technical Report 03105-49-T. Background: The Logic of Computers Group of the Communication Sciences Department of The University of Michigan is investigating the application of logic and mathematics to the design of computing automata. The application of the techniques and concepts of mathematics, including sequential sampling techniques, to the study of automaton learning, forms a part of this investigation. Condensed Report Contents: A strategist is to decide on each of n turns whether to take a sample from a certain fixed random variable, and receive the outcome as payoff, or to receive as payoff the largest value of the random variable he has discovered so far. The expected total payoff for the n turns is to be maximized. It is shown that the following decision procedure is the solution. IfF(x)dx Sample until - - exceeds the number of turns remaining -f(l-F(x))dx where x is the current record value, and F(x) is the distribution function of the random variable. A strategy for an indefinite number of turns is described, and for suitable distributions it is shown that the limit of the ratio of the payoff accumulated by this strategy in n turns to the payoff accumulated by the Qptimal strategy for n turns is one, with probability one. For Further Information: The complete report is available in the major Navy technical libraries and can be obtained from the Defense Documentation Center. A few copies are available for distribution by the author.

Introduction In a competitive learning situation, the individual must not only discover strategies which yield high payoffs, but also spend a certain part of his time actually using these strategies in order to accumulate enough payoff to maintain his competitive position. Typically, the type of activity which produces information which may be used to improve his strategy is quite distinct from the activity involved in the use of the current best strategy, and will yield considerably lower payoffs, if any. For example, market research presumably produces a better advertising campaign (from the point of view of the manufacturer), but the money spent there increases revenue only when the new campaign is begun. It is therefore necessary for the individual with limited resources to make decisions between information producing and payoff producing activities. In this example, the individual may, if he has sufficient resources, elect to devote some to the gathering of information, and some to the gathering of payoff. However, the resources may be so restricted that only one of these activities may be undertaken at any time. For example, a chess player must decide whether to try an opening with which he is relatively unfamiliar in case it should prove more effective than the one he typically uses, or to stick with his familiar favorite opening. The situation we examine in this paper is inspired by the latter example. A special case of this type of learning situation may be termed trial and error learning. We imagine the player to have available a class of strategies, some of which have been tried, and others which have not. On each turn the player is to decide whether to try a new one, or to use the strategy which has been found to provide the highest payoff. In any real situation, if the player decides to attempt a new strategy, he will choose

one which he thinks has a reasonable chance of success. Also, even if the strategy he tries does not prove better than the current best strategy, the player will presumably get information which he may use in deciding which untried strategy he will attempt next. It would therefore seem unrealistic to suppose that when the player chooses a new strategy that he does so at random. However, we may redefine the problem as follows. We may consider the strategy by which the player analyzes unsuccessful attempts in order to decide what his next trial shall be. If he always assimilates information from his explorations in the same way, then surely there is an advantage to be gained in sometimes assimilating it in a different way. If he does change it from time to time, then he has exactly the same problem with these second level strategies as he does with his first level strategies. We therefore argue that unless he has an infinite heirarchy of strategies, at some level either he selects new strategies randomly, or it is to his advantage to do so. It is this problem, in which new strategies are selected randomly, and the only information gained from them is whether or not they were an improvement on the currently best strategy, that we call trial and error learning, and discuss in this paper. This then is the motivation for our formulation, which is the following. Each turn must be either a play of the best strategy, which is assumed to produce the same payoff each time it is used; or the turn consists of a trial of a new strategy, which is selected randomly, and hence produces a payoff from some fixed probability distribution. We first consider the case in which a fixed number of turns are allotted, and the expected total payoff for those turns is to be maximized. The solution is explicitly given for the case of a continuous distribution with a mean, and involves a certain criterion function. This criterion function is then used to define

a strategy which does not depend on any fixed number of turns. For suitable distributions, this strategy is shown with probability one to have a payoff ratio to the class of best strategies for fixed numbers of turns which approaches one. It is also pointed out that if, as one would expect to be the case, the payoff distribution is not known, then the form of the criterion function is such that it may be estimated as play proceeds.

I. The Basic Decision Problem Let x be a fixed random variable. We consider a class of decision problems, each of which is characterized by a positive integer n and a real number x. The decision problem (n,x) is given as follows. On each of n turns, a decision is to be made between two alternatives, which we shall call sampling and collecting. The decision to sample results in the strategist receiving as payoff the result of an independent sample from the random variable x. The decision to collect results in the strategist receiving as payoff the current record value, where the record value on each turn of the decision process is defined as follows. On the first turn, the record value is x. On successive turns, the record value is the larger of the record value on the previous turn, and the payoff received on the previous turn. The strategist is to maximize the expected total payoff for the n turns. We obtain the decision procedure which does this as follows. Let us assume that the random variable x has a mean x, and write _n(x) for the maximum expected total payoff over all decision procedures for the problem en,x). If a strategist samples on the first turn, he receives the payoff x, and his new record value is max(x,x). If he collects on the first turn, he receives the payoff x, and his record value remains x, Thus we have the recursion x + E(%n_ (max(x,x))) () = max(1) x + n (~) n-l() Since ~l(") = max(x,x), we may calculate ~n(k) for all (n,x) if we know the distribution of x. Furthermore, let us define pn (x) = x + nl(x) - x - E(nl(max(x,x))). Then if r turns remain, and the current record value is x, the optimal decision for the next turn is sample if r(x) < O (2) collect if kr(k) 2 0

since by making this decision the strategist chooses the alternative leading to the larger potential total expected value. Let us further assume that x has a continuous density function f(x), with distribution function F(x). Then the following facts are readily verified by induction. (i) ~n(x) is continuous and piecewise differentiable for each n, with n'(x) > 0 when defined. n (ii) in(X) is continuous and piecewise differentiable for each n, with xn (i) 2 1 (iii) in(X) ~ x - x, and hence is negative for x < x. If x is unbounded, then n (x) is ultimately positive by (ii), and if x is bounded, say by Xmaxp then n(xmax) > 0. Thus in either case, n (x) has positive values, and by (iii), negative values, and by (ii) is strictly increasing, and hence has a unique zero, which we denote by z n We may thus restate the optimal decision rule when r turns remain and the current record value is x, as sample if x < zr ^r~~~~~~~~~ ~~~(3) collect if x 2 z r We now verify that Zn+l z (4) First, n(zn) = 0 by definition, so that zn + n-(z) = x + E(4n l(max(x,z ))) T'hus 4n(zn + Zn) = x + E(pn l(max(x,zn))), by (1). Now we see that n+l(Zn) is not positive, since n+l(z n) = Zn + n(z n) - x - E( n(max(x,zn))) = zn + [x + E(~nl(max(x,z )))] - x - E(~n(max(x,zn))) and noting that for any y, pn(Y) n-1 (Y) 2 y, then taking y = max(x,zn),

we obtain n+l(Zn) ~ Zn - E(max(x,zn)) < 0. Thus we must have zn < Zn+l' as claimed, since ln+l(X) is increasing in xo Now suppose x a zno Then n(x) ~_ (Zn) = 0, so that + nl() ~ x + E(ni (max(x,x))) and so n(x) =x + n 1(X), by (1). But x 2 Zn ~ Znlz hence we obtain by induction that ~n(X) = nx for x > z (5) We are now in a position to obtain the optimal decision rule in an explicit form. Since nn(Zn) = 0 and n_-l(Zn) = (n-l)zn, n-_l(max(x,zn)) (n-l)max(x,zn), we obtain by substitution that n Zn + (n-l)zn = x + (n-l)E(max(x,zn)), and hence z -x nl = E(max(x,Zn)) - Zn (6) Now for any continuous distribution with a mean the following facts are easily verified by writing, e.g., 1 - F(x) = fx f(y)dy and interchanging the order of integration. (iv) E(max(x,x )) - xo = fJ (l-F(x))dx 0 0 X 0 (v) - x = F(x)dx - (1-F(x))dx, Thus taking x = Zn, substituting in (6) and simplifying, we obtain 0 n f;n F(x)dx = n (7) fz (1-F(x))dx n Let us now define fXF(x)dx z(X) = (8) j (1-F(x))dx We note that z(x) is strictly increasing, and z(z ) = n by (7) so that we may once again restate our optimal decision rule when r turns remain and

the current record value is x, as sample if z(x) < r (9) collect if z(x) > r Furthermore, the value z(x) where x is the current record value, may be estimated on the basis of the observations on x taken to that point. Let us rewrite z(x) using (v), as fXF(x)dx f F(x)dx - x + x We note that in this form the portion of the distribution of x about which no evidence has been received, that is, the part above x, does not appear explicitly. In summary, we have shown that if x is a random variable with mean x, continuous density function f(x) and distribution function F(x), then the decision which maximizes expected future total payoff when r turns remain and the current record value is x, is sample if z(x) < r collect if z(x)' r fx F(x)dx F(x)dx where z(x) = - fJ(l-F(x))dx j F(x)dx - x + x is a function which may be estimated from the samples so far taken.

IIo The Extended Decision Problem It is natural to ask whether there is an acceptable strategy if the number of turns available is not fixed. It is not clear what it would mean for such a strategy to be optimal, since nay such strategy can be improved in the following sense. Let X be the sample space consisting of infinite sequences x = (x1,x2,.00) of independent samples from the random variable x. A strategy for an indefinite number of turns is characterized by a function S mapping X into the set A of infinite sequences 6 = (61,62,...) of binary digits 6r' The interpretation of such a sequence is that a sample is taken on the rth turn just in case 6 = 1 We write the rth component of S as S r r We have perhaps made this definition too broad since not all such strategies are realizable in the sense that the decision at turn r may depend on samples not yet taken. However, this does not matter, since we shall construct from S an improved strategy S* which is realizable if S is. We first exclude those strategies which after some point eithe.i never samples or never collect, that is, those strategies for which P(9n Vr r > n = Sr = 0) = 1 or P(3n Vr r > n Sr = 1) = 1 The former type can be improved in an obvious way by sampling beyond the point at which sampling would have stopped, until a better record value is found, and then collecting indefinitely; and the latter can be improved by collecting indefinitely once a record value better than the mean has been found. Let us consider the more interesting cases more explicitly. If S is not of the two previous types, then for almost all x E X, play proceeds by alternate blocks of sampling and collecting. We can therefore pick some fixed n such that with positive probability at least two blocks of collecting separated by a block of sampling have occurred in n turns, We define

a strategy S* by S* =S for r > n r r and for r' n n 1 if r' E Sk kel S* = 0 if r > ZSk That is, we have arranged that in the first n turns, all sampling is done before any collecting. If two blocks of collecting separated by a block of sampling occurred in the first n turns, then with positive probability a better record value was found before the second block began. Now since the strategy S* does all collecting in the latter part of the first n turns, it will have this better record value available when it begins collecting, and will thus do better than S, which made some collections with the smaller record value. Thus with positive probability, S* accumulates more payoff in the first n turns than S does. Furthermore, if S was realizable, then S* was realizable, as follows. S* can be effected by using S as a "subroutine", and sampling whenever S calls for a sample, but ignoring calls for a collection until the number of samples plus the number of collections S has called for equals n. At this point the number of turns actually taken by S* is only the number of samples that S called for, and the remainder of the n turns are still available to S* for collection.. Note that at each stage, say when s samples and k calls for collection have been made (s + k < n), the course of play for the first s + k turns if S had been used can be reconstructed. The samples were actually taken and the outcome of each collection S called for would have been the record value at that time. Thus it is actually possible to use S in

the manner described, since any information on which S's decisions might be conditional is available to the strategist who is using S*o This method of improving an arbitrary strategy S was chosen instead of the more straightforward method of playing the optimal strategy for the first n turns and then reverting to S, since it illustrates the following fact. Each strategy S induces a family {S (n) of strategies for n turns which are essentially S modified so as to take advantage of the knowledge that only n turns are available. S(n) is given by 1 if r _Sk s(n) (10) n 0 if r > ZSk k=l for r < no We cannot find a strategy which is optimal in a strong sense, but we can find a strategy S whose induced finite strategies {S(n)} are the optimal strategies for n turns. Furthermore under suitable assumptions the total payoff received in n turns by S is assymptotically that repeived (n) by S for almost all x s X. We specify S by the decision rule: When r collections have been made, and the current record value is x, sample if z(x) < r + 1 (11) collect if z(x) Z r + 1 We now establish that the induced strategies {(n) I are indeed the optimal strategies for n turns. Suppose the initial record value is x, and consider an arbitrary x = (xl,X2,o..) s Xo We know the optimal strategies do all sampling before any collecting, so it is sufficient to show that for each n, x, the optimal strategy for n turns takes iSr(x) samples. k=l 10

We proceed by induction. The optimal strategy for one turn samples just in case z(x) < 1 and so takes S1(x) samples. n Let us assume that the optimal strategy for n turns takes Sr(x) = s, say, rl th r=l samples in n turns. Then since the s+l sample was not taken, ihe., when n - s turns remained and the current record value w as max(x,x1,,*,xs) the decision was made not to sample, we have by (9) that z(max(i,Xl,. o,xs)) > n - s We examine two cases (a) z(max(x,xl,..,Xs)) < n - s + 1 If this is the case, then S (x) = 1, since by our induction hypothesis S n+l has taken s samples in n turns, and so n - s collections have been made, so th that (a) together with (10) imply that S will sample on the n+l turn. However, after s turns tile optimal strategy for n+Il turns has made s samples [it is immediately verified independently by induction that for each x C X the optimal strategy for n+l turns samples at least as many times as the optimal strategy for n turns] and so has (n+l) - s turns remaining, and a record value of max(x,xl,.,xs), so that (a) together with (9) imply that the optimal strategy for n+l turns takes at least s+l samples. However z(max(x,x1,.,x5xs1)) > z(max(x,xl,oo,xs)) > n - s = (n+l) - (s+l) so that the optimal strategy for n+l turns does not sample on the (s+2)th turn, and hence takes exactly n n+l S + 1 Sr(X) + S (x) = (x) n+l L~ r r=l r=l samples. (b) z(max(x,xl,o... Xs)) > n - s + 1 If this is the case, then S (x) = 0, by (10). But also, by (9) the optimal strategy for n+l turns will not sample on thle s+lth turn, and hence 11

takes exactly r=l r=l samples. We have thus established that S(n) is the optimal strategy for n turns We now investigate the assymptotic properties of the payoff accumulated by S. We define the random variable sn by Sn(x) (= S (12) and, if the initial record value is x, we define the random variable xn by Xn () = max(x'xl, X (x)) (13) We define Pn to be the total payoff received by S in n turns, so that Sn('X) n Pn (x) = xr + (l-Sr(x))xr(x) (14) and we define p(n) to be the total payoff received by S(n) in n turns, so that p (x)= x + (n-s (x)x (x) However since S(n) takes no samples after turn sn, then X (xx) Xn(x), (x) so that (n, sn(x) A Pn) = ) xr + (n-sn(x))xn () (x5) p r=l n Thus, noting that 1 - Sr = r - sr - ((r-l) - sr 1) we have by summation by parts that n n Z-s ( r)( r ) = (n-sn)xn -1 12

so that n p(n) _ p = (rsr)(xr-xr) (16) r=l For any r _< n, let us define r to be the greatest non-negative integer less than r such that a sample was taken on turn r + 1, but not on turn r, if such exists; and r = r otherwise. If a sample was taken on turn r, then r A r and we have, since a sample was not taken on turn r, that z(x_1) r- 1 s1 + 1, r- ~ 1 r-l by (11) and x- X=_ r-l r and s1 S_ =-l 5T Now a sample was taken on turn r + 1, so sF+l = s + 1. Thus, substituting, we obtain z(xt) 2 (r+l) - sFurthermore, since samples were taken on every turn from r + 1 to r we have (r+l) - s+1 = r sr Thus we have Sr =1= z(T) r - 5s (17) Now (x -x 1) = 0 except possibly for those r such that a sample was taken on turn r. Thus we have (n) - 6n n pn) 7:r =(r-s )(x rx r < z( x(x -x P - - ln =Xr1 l r r- - r r-' We now note that if a sample was taken on turn r Y _ r - 1, so z( Xf) Z(Xrl) Also if r is between n and n, then r = n, and x = x-. Thus we have r n n n n ^ L_. 7 r-l( r r-l1 7 (Y)n n r=l

Now it is clear that Z Z(rx)(rxxil) x -x 1 z(x)dx r=l since z is increasingo Thus (n) p < Jx z(x)dx + z(xn)(xn-x-) (18) Furthermore, from (15) p(n) > (n-sn)xn and so (n) > p ~ (n-s9K- ~ Now if n $ n, then a sample was taken on turn n+l so z(L) < n - s- + 1 so that p(n) > (z(xi - 1)xne (19) However, if n = n, then no samples have been taken up to and including turn no This means that z(x) 2 n, since with no sampling the record value is still xo Thus if z(x) is finite then sampling will begin after [z(x)] collections, where [y] = greatest integer not exceeding y, and thus we may assert (19) for sufficiently large n, provided F(x) < 1, since z is finite for all such xo We excluded the case F(x) = 1, since then the decision procedure is pointless, and obtai i.for large n (n) P (n)Pn z(x)dx z() (20 n n_..... + n(20) p n (z(L) - 1)x- (z(x) -1)x Relation (20) holds for any continuous distribution which has a mean. We now establish sufficient conditions for the limit of the right side of (20) to be 0 for almost all x c X, and thus we will have p(Qim (n= 1) = 1o 14

We assume that for any., F(4) < l,and that for any E > 0, rim 1 - F((l+E) = 44->.00 1 - F(4) On the latter assumption, we have for all sufficiently large 4 1 - F((l+~)) ~ ~(l - F(;) and hence 1 - F((l+E) c) E ~ (1 - F(4) Thus c(1 - F(c)) s c(1 - F(c)) c(1 - F(c)) 00.En (.+ ) n+nl n_ (1+n n=O 1Zn+l (1F+)n n=O if e(l1+C) < 1! - E(1+E) Thus coo kim (l - F(x))dx 4im 4(1 - F()) = 0 (21) Now by L'Hopital's rule, Uim (z () - 1)4 _ im z (C) - 1 + Cz'( ) Qim [ 1 + z'(C) ~ f~z (x)dx But + 0, and z = +F() F00 The first term is positive, and the second term is unbounded, by (21)o Thus Uim (z(4) - 1) r fx Z(x)dx 15

However, we require this result in the equivalent form z(x)dx... x o (22) im ((n) - (22) In order to show that with probability one n we now need n-*oo (n) = 0 rim Urim X only show that with probability one - x_ = m and = 1, in view of n-> n n-, 9 (20), (22), and the fact that z is unbounded. n We first note that since F(~) < 1 for every ~, then z(;) is finite for every CO Thus for any record value C at most [z(~)] collections will be made, so that sampling will eventually resume. However, once sampling has begun, it will terminate with probability one, since by (11), it terminates when a value;' is found such that z(;') z [z(~)] + 1, but in order not to find such a;', the largest of an arbitrarily long sequence of independent samples from x would be bounded by z ([z(~)] + 1) which happens with probability Uim -1 s lm (F(z l([z(C)] + 1))) = 0 Thus with probability one, play proceeds by infinitely many alternate blocks of sampling and collectingo Now if we pick a sequence of numbers n. such that for each i,n4 is in the ith block of collections, then n. will be in th the i-lt block of collections, and since the record value is constant within each block of collections we have x- = x Now i is unbounded with,no n 1 1 probability one, since with probability one infinitely many samples are taken, so X- is also unboundedo n Let us now define the random variable ~ over X by e = sup {6: for infinitely many n, x ~ (l+6)x}. n n Uim Xn Now P( 1) > 0 just in case P(~ > 0) > 0. We shall show that n —oo Xjj ~im Xn_ P(,> O) = O0, and hence that P(n= = - 1) = 1 Suppose to the contrary that P(E > 0) > O Then there is some ~ > O such that P(E > ~) > 0O That is, 16

P(sup {6: for infinitely many n, x > (1+6) —} > s) n n = P (for infinitely many n, Xn > (1+ )x-) > 0 Now with probability 1, xn is unbounded, so we may exclude those x e X for which xn is bounded, and choose values ni as before. Then P(for infinitely many i x > (l+c)xn ) > 0 i+l i But we shall now show that kim im > (1+6)xi = = 0 i+l 1 and hence have our desired contradiction. The conditional density function of X given x =, is i+l i -l+ for X > z ([z(C)] + 1) 1 - F(z- ([z(r)] + 1)) i+l Thus P(X > (l+)i = I^ = 1 - F(max((l+.)Cz ([Z(O)] + 1)))'n > (1+ x1n. 1- -1 nl+i 1 - F(z ([z(O)] + 1)) We note that Jf ~ F(y)dy fCF(y)dy + fC F(y)dy z (C+) = — > > z(O) + 0f {(1 - F(y))dy f (1 - F(y))dy (1 - F(y))dy Thus if. is sufficiently large that f(1 - F(y))dy < E, then z(C+e) > z(r) + 1 and + = z -1 (Z(+E)) > z- (z(%)+l) > z- ([z(r)]+l). Thus for large %, -1 1 - F(max((l+c)s,z ([z()]+)) < 1 - F((1+c)C) 1 - F((l+6)((+E)) 1 - F(z ([z()]+) 1 - F(C+E) 1 - F(C+E) i - F(z ([z()]+1i)) for suitable 6 and large %o But the right term approaches O, by hypothesis. This completes the proof that i Pn 0 with probability one, n-~ p)(n)

for suitable distributionso We note that if the random variable is bounded, we also have the above result, with no assumptions other than that of a continuous distribution, and the existence of a value x such that F(x ) =1 and F(x) < 1 for x < x o m max max max To show this, we note that x max kim ~-~ (1 - F(x))dx +x m C(1 - F(C)) mxim one9m = knim Xn and that with probability one and = 1, so that we obtain n 4 n max n-> xT the result directly from (20)0 kim Pn In summary, we have shown that nQ (n = 1 for any bounded random n-~kx (n) variable x with a continuous density function, or for any random variable with mean. and continuous density function satisfying (1) F,() < 1 for all 60 (2) m1 - F((l+E)) = 0 for all C > 0~ (2) 1 - F(o) We note that many standard distributions such as the normal and exponential distributions have the properties (1) and (2)o 18

DISTRIBUTION LIST (One copy unless otherwise noted) Technical Library Naval Electronics Laboratory Director Defense Res. & Eng. San Diego 52, California Room 3C-128, The Pentagon Attn: Technical Library Washington, D.C. 20301 Dr. Daniel Alpert, Director Defense Documentation Center 20 Coordinated Science Laboratory Cameron Station University of Illinois Alexandria, Virginia 22314 Urbana, Illinois Chief of Naval Research 2 Air Force Cambridge Research Labs Department of the Navy Laurence C. Hanscom Field Washington 25, D.C. Bedford, Massachusetts Attn: Code 437, Information Attn: Research Library, CRMXL R Systems Branch U. S. Naval Weapons Laboratory Director, Naval Research Laboratory 6 Dahlgren, Virginia 22448 Technical Information Officer Attn: G. H. Gleissner, Code K4 Washington 25, D.C. Asst. Dir. for Computation Attention: Code 2000 National Bureau of Standards Commanding Officer 10 Data Processing Systems Division Office of Naval Research Branch Office Room 239, Building 10 Box 39, Fleet Post Office Washington 25, D.C. New York, New York 09510 Attn: A. K. Smilow Commanding Officer George C. Francis ONR Branch Office Computing Laboratory, BRL 207 West 24th Street Aberdeen Proving Ground. Maryland New York 11, New York Office of Naval Research Office of Naval Research Branch Branch Office, Chicago Office 230 North Michigan Avenue 495 Summer Street Chicago, Illinois 60601 Boston, Massachusetts 02110 Commanding Officer Naval Ordnance Laboratory ONR Branch Office White Oaks, Silver Spring 19 1030 E. Green Street Maryland Pasadena, California Attn: Technical Library Commanding Officer David Taylor Model Basin ONR Branch Office WashingtonrD.C. 20007 1076 Mission Street Attnt Code 042, Technical Library San Francisco, California 94103

DISTRIBUTION LIST (Concluded) The University of Michigan Lincoln Laboratory Department of Philosophy Massachusetts Institute of Technology Attn: Professor A. W. Burks Lexington 73, Massachusetts Attn: Library National Physical Laboratory Teddington, Middlesex, England Office of Naval Research Attn: Dr. A. M. Uttley, Supt. Washington 25, D.C. Autonomics Division Attn: Code 432 Commanding Officer Dr. Kenneth Krohn Harry Diamond Laboratories Krohn Rhodes Research Institute, Inc. Washington, D.C. 20438 328 Pennsylvania Avenue, S. E. Attn: Library Washington 13, D. C. Commanding Officer and Director U. S. Naval Training Device Center Dr. Larry Fogel Orlando, Florida 32813 Decision Science, Inc. Attn: Technical Library 6508 Pacific Highway San Diego, California Department of the Army Office of the Chief of Research National Bureau of Standards and Development Applications Engineering Section Pentagon, Room 3D442 Washington 25, D. C. Washington 25, D.C. Attn: Miss Mary E. Stevens Attn: Mr. L. H. Geiger National Security Agency Fort George G. Meade, Maryland Attn: Librarian, C-332

Uic 1 as ified Security Classification DOCUMENT CONTROL DATA- R&D (Security claseifcatlon of title, body of betract end Indexing annotation must be entered when the overall report is claesifled) 1. ORIGINATIN G ACTIVITY (Corporate author) Za. REPORT SECURITY C LASSIFICATION Logic of Computers Group Unclassified The University of Michigan 2b. GROUP Ann Arbor, Michigan 48104 3. REPORT TITLE A CLASS OF SEQUENTIAL SAMPLING PROBLEMS ARISING IN CERTAIN LEARNING SITUATIONS 4. DESCRIPTIVE NOTES (Typ of report and inclusive datoe) 5. AUTHOR(S) (Lest nne. first name, Initial) Bainbridge, Edwin S. 6. REPORT DATE -. OTAL NO. OF PACER 7b. NO. OF RarS December 1967 18 ea. CONTRACT OR GRANT NO. 9Sa. ORIGINATOR'S REPORT NUMBR(S) Nonr 1224(21) 03105-49-T b. PROJECT NO. c. b. OTH ER Ri PORT NO(S) (Any other numbers that may be asgalned Jhio report) 10. A V A IL ABILITY/LIMITAtiON NOTICES Qualified requesters may obtain copies of this report from DDC. Distribution of this document is unlimited. 11. SUPPL EMENTARY NOTES 12. fl85IN8j"jjaT0q Ak;IVITYarch Department of the Navy Washington 25, D.C. 13. ABSTRACT A strategist is to decide on each of n turns whether to take a sample from a certain fixed random variable, and receive the outcome as payoff, or to receive as payoff the largest value of the random variable he has discovered so far. The expected total payoff for the n turns is to be maximized. It is shown that the following decision procedure is the solution. IXF (x)dx Sample until, exceeds the number of turns remaining where x is f;(l-F(x))dx the current record value, and F(x) is the distribution function of the random variable. A strategy for an indefinite number of turns is described, and for suitable distributions it is shown that the limit of the ratio of the payoff accumulated by this strategy in n turns to the payoff accumulated by the optimal strategy for n turns is one, with probability one. DD, JAN'4 1473 Unclassified Security Classification

Security Classification 14. KEY WORDS LINK A LINK B LINK C KEY WORDS _ ROLE WT ROLE WT ROLE WT statistics sequential sampling dynamic programming trial and error learning INSTRUCTIONS 1. ORIGINATING ACTIVITY: Enter the name and address imposed by security classification, using standard statements of the contractor, subcontractor, grantee, Department of De- such as: fense activity or other organization (corporate author) issuing (1) "Qualified requesters may obtain copies of this the report. report from DDC. " 2a. REPO:RT SECUI;TY CLASSIFICATION: Enter the over- (2) "Foreign announcement and dissemination of this all security classification of the report. Indicate whether "Restricted Data" is included. Marking is to be in accord-authorized ance with appropriate security regulations. (3) "U. S. Government agencies may obtain copies of this report directly from DDC. Other qualified DDC 2b. GROUP: Automatic downgrading is specified in DoD Di- users shall request through rective 5200. 10 and Armed Forces Industrial Manual. Enter the group number. Also, when applicable, show that optional.. markings have been used for Group 3 and Group 4 as author- (4) "U. S. military agencies may obtain copies of this report directly from DDC Other qualified users 3. REPORT TITLE: Enter the complete report title in all shall request through capital letters. Titles in all cases should be unclassified. If a meaningful title cannot be selected without classification, show title classification in all capitals in parenthesis (5) "All distribution of this report is controlled. Qualimmediately following the title. ified DDC users shall request through 4. DESCRIPTIVE NOTES: If appropriate, enter the type of.._ " report, e.g., interim, progress, summary, annual, or final. If the report has been furnished to the Office of Technical Give the inclusive dates when a specific reporting period is Seryices, Department of Commerce, for sale to the public, indicovered. cate this fact and enter the price, if known. 5,. AUTHOR(S): Enter the name(s) of author(s) as shown on 11. SUPPLEMENTARY NOTES: Use for additional explanaor in the report. Enter test name, first name, middle initial. tory notes, If m:litary, show rank and branch of service. The name of the principal author is an absolute minimum requirement. 12. SPONSORING MILITARY ACTIVITY: Enter the name of the departmental project office or laboratory sponsoring (pay6. REPORT DATE Enter the date of the report as day, ing for) the research and development. Include address. month, year; or month, year. If more than one date appears on the report, use date of publication, 13. ABSTRACT: Enter an abstract giving a brief and factual summary of the document indicative of the report, even though 7a. TOTAL NUMBER OF PAGES: The total page count it sumay also appear elsewhere inthe body of the technical reshould follow normal pagination procedures, i.e., enter the port. If additional space is required, a continuation sheet shall number of pages containing information. be attached. 7b. NUMBER OF REFERENCES' Enter the total number of It is highly desirable that the abstract of classified reports references cited in the report. be unclassified. Each paragraph of the abstract shall end with 8a. CONTRACT OR GRANT NUMBER: If appropriate, enter an indication of the military security classification of the inthe applicable number of the contract or grant under which formation in the paragraph, represented as (TS). (S), (C), or (U) the report was written. There is no limitation on the length of the abstract. How8b, 8c, & 8d. PROJECT NUMBER: Enter the appropriate ever, the suggested length is from 150 to 225 words. military department identification, such as project number, subproject number, system numbers, task number, etc, 14. KEY WORDS: Key words are technically meaningful terms or short phrases that characterize a report and may be used as 9a. ORIGINATOR'S REPORT NUMBER(S): Enter the offi- index entries for cataloging the report. Key words must be cial report number by which the document will be identified selected so that no security classification is required. Identiand controlled by the originating activity. This number must fiers, such as equipment model designation, trade name, military be unique to this report. project code name, geographic location, may be used as key 9b. OTHER REPORT NUMBER(S): If the report has been words but will be followed by an indication of technical conassigned any other repcrt numbers (either by the originator text. The assignment of links, rules, and weights is optional. or by the sponsor), also enter this number(s). 10. AVAILABILITY/LIMITATION NOTICES: Enter any limitations on further dissemination of the report, other than those Unclassified Security Classification

UNIVERSITY OF MICHIGAN 3 90111111111111111111 0 11!493 II921 3 9015 02493 9012