A Weak Law of Large Numbers for Rare Events by Donald E. Bpwn Robert L. Smith Technical Report 86-4 February 1986 Donald E. Brown Robert L. Smith Department of Systems Department of Industrial Engineering and Operations Engineering University of Virginia The University of Michigan Charlottesville, VA 22901 Ann Arbor, MI 48109

A Weak Law of Large Numbers for Rare Events Donald E. Brown Robert L. Smith Technical Report 86-4 February 1986 Abstract We show that the empirical distribution associated with a discrete probability distribution p, when constrained to lie within a convex information set A, will as the number of trials increases become arbitrarily close with arbitrarily high probability to the distribution that minimizes the relative entropy between p and A.

A Weak Law of Large Numbers for Rare Events Statistical inference is traditionally concerned with using data and probability models to derive conclusions about a practical problem with inherent variability. Classical statistical inference uses only the data and the sampling function to derive conclusions about the sampled population. However, a large class of problems involves inference with data in the form of deterministic constraints on the underlying probability model. Typically the constraints do not uniquely determine the unknown distribution. A common approach to problems of this kind has been to use an information theoretic procedure known as relative entropy minimization. A special case of the relative entropy minimization procedure, known as entropy maximization, has been used in a wide variety of applications; examples include reliability (Tribus [1969]), urban modeling (Wilson [1970]), stock market pricing (Lozzolino and Zahner [1973]), oil spill damage assessment (Thomas [1979]), and statistical mechanics (Jaynes [1956]). Applications of the more general relative entropy minimization approach can be found in the areas of statistics (Kullback [1959]), statistical mechanics (Hobson [1971]), legal inference (Sampson and Smith [1984]), and risk assessment (Sampson and Smith [1982]). We provide in this paper a relative frequency interpretation of the relative entropy minimization procedure that lends empirical justification to the approach. 1. Relative Entropy Minimization as an Inference Procedure The maximum entropy principle proposed by Jaynes [1956] was intended to be employed as a user invariant method of assigning probabilities based on testable information. Information in this context consisted of inequality constraints on the unknown distribution. Relative entropy minimization is a more general inference procedure which admits an initial or prior distribution, in addition to the constraint information. 1

More formally, let p = (Po, p1,..., Pm) > 0 be the prio probability distribution expressed as a probability mass function and let A, the information constraint, be a closed convex subset of the simplex S of all non-degenerate m+1 dimensional m discrete distributions where S = {qli qi = 1, q1 > 0 for i = 0, 1, 2,..., m}. Then the principle of minimum relative entropy prescribes choosing that q* which minimizes the relative entropy subject to satisfying the constraint A, that is q* = argmin I(q, p) qeA where I(q, p) = iE qi In (qi/Pi). We are reduced to entropy maximization when pi = - for all i is the uniform distribution. I(q, p) is also known as the m+1 cross-entropy (Shore and Johnson [1980]) or the Kullback-Leibler information discrimination (Kullback and Leibler [1951]). Justifications for using relative entropy minimization in an inference procedure have relied on axioms of information (Hobson [1969], Hobson and Cheung [1973], and Johnson [1979]), or axioms of inference (Shore and Johnson [1980]). 2. A Correspondence Property for Relative Entropy Minimization One of the principal justifications for the use of entropy maximization as an inference procedure was provided by Jaynes [1968]. He demonstrated a correspondence between the maximum entropy distribution and the most likely outcome in repeated trials of a random experiment. In particular, let a0, a1,..., am be the possible outcomes of an experiment where each outcome is equally likely to occur. Suppose now we repeat the experiment n times and observe the number of times Ni that outcome i occurred for i = 0, 1,..., m. Then Jaynes effectively showed that m n n" in P(No = no, N = n1,.- Nm = nm) = Kn (i0. — In n) + C + E n = n n i 1 where Kn < 0 and Cm > 0 are constants depending only on n and m respectively and En - 0 as n -* for fixed m. 2

From this result, Jaynes concluded the correspondence property that the probability distribution which maximizes the entropy is identical to the frequency distribution which can be realized in the greatest number of ways. We will extend and strengthen this correspondence property to the general minimum relative entropy procedure. Suppose now that a0, a1,..., am are the possible outcomes within an experiment for which outcome i occurs with probability Pi for i = 0, 1,..., m. Let Vin = N/n be the relative frequency of occurrence of outcome i in n repeated independent trials of the experiment. We refer to Vn = (Vn, Vn,..., Vm) as the empirical distribution based on n trials. The first lemma is due to Sanov [1961]. Lemma 1: For any v E S, P(Vn = v) = e'n(I(v,p) + O(ln n/n)) for all n with nv integer where O(ln n/n) depends only on m and is independent of v and p. Proof: By Stirling's approximation (Knuth [1976], p. 111) n! = /-2n (n/e)n (1 + 0(1/n)), so that In n! = n(ln n -1) + 0(ln n). Suppose that nvi is integer for all i = 0, 1,..., m. Then P(Vn = v) = (n!/(nvo! nv!...nm!)) pV pl Pmm, and therefore In P(Vn = v) = In n! - In (nv-!) + Z nvi ln p = n(ln n -1) + 0(ln n) i=0 1 i=0 1 1 - nv (ln nv - 1) + E 0(ln (nvi)) +. nv In pi i=0 1O1 i=i O i m m i nvi In vi + i nvi In p. + O(ln n) where O(ln n) depends i=0 1 1 i=0 1 1 only on m since nvi < n for all i. Hence In P(Vn = v) m - n Vi In vi/Pi + O(ln n). m We have then P(Vn = v) = e"n( z viln i/Pi + O(ln n/n)). i=O Lemma 2: Let AE = SE (v*) n A be an c-neighborhood around v* = argmin I(v, p) where p i A and S (v*) = {v I vi -i I < E for i = 0, 1, 2,..., m}. Then for all c > 0, lim P(Vn E A lVn A) = 1. 3

Proof: Let A- = A - A. Then P(Vn e A-) - Z P(Vn = v) = Z e-n(I(v, P) + O(ln n/n)) vEcAc yeA with nv integer with nv integer < N(A ) e-n(I(v, p) + O(ln n/n)) where I(vc, p) = min I(v, p) and N(T) is the number of points v e T C S with nv yeAc integer. Now N(A ) < N(S) < (n + m) = O(em n n) for fixed m where m S = {vI vi = 1, vi > 0 for i = 0, 1, 2,..., m. Hence P(Vn A) e-n(I(ve, p) + O(ln n/n)) On the other hand, by the continuity of I(v, p) in v over S, we can choose E > c' > 0 small enough so that I(v, p) < I(vE, p) for all v in the closure of As,. Let v* be a point in the closure of A., such that I(v\, p) = max I(v, p) over all v in the closure of Ae,. Clearly, for all n > N for some N^, there is a v ~ A., with nv integer. Let vn be any such point. Then E n for all n> N, P(Vn = A) > P(Vn n) = e-n(I(vn, P) + O(ln n/n)) > e-n(I(v*, p) + O(ln n/n)J. Hence for all n > N., we have P(Vn e AiIVn ~ A) = P(Vn E Ai)/P(V" e A) < e-n(I(v, p) - I(v\, p) + O(ln n/n)) + O as n - X since I(vz, p) > I(v*, p). From Lemma 2, we may easily infer the following theorem. Theorem (Weak Law): Let A be a closed convex subset of S = {q li.oqi 1, qi > 0 for i = 0, 1, 2,...,m} and p ~ A be a point in S. Let Vi be the relative frequency of occurrence of outcome i in n independent trials of an experiment that results in outcome i with probability Pi for i = 0, 1, 2,..., m. Set v* = argmin I(v, p). Then for all e > 0, v ~ A lim P(IVi - vI > IVn" A) = n->.oo 1 i - Proof: From Lemma 2, we have lim P(Vn A A ) = 0 and hence our result. U 4

Proof: Let At = A - A. Then P(Vn ~ AE) Z= P(Vn = v) = Z en(I(v, p) + O(ln n/n)) vEA vEA~ with nv integer with nv integer < N(iA) en(I(ve, p) + O(ln n/n)) where I(ve, p) = min I(v, p) and N(T) is the number of points v e T S with nv veAe integer. Now N(A) < N(S) < (n + m) = o(em In n) for fixed m where S = {v i|O vi 1, vi > 0 for i = 0, 1, 2,..., m}. Hence P(Vn ) < e-n(I(vc, p) + O(ln n/n)). On the other hand, by the continuity of I(v, p) in v over S, we can choose e > e' > 0 small enough so that I(v, p) < I(vc, p) for all v in the closure of A,. Let v* be a point in the closure of A., such that I(v*, p) = max I(v, p) over all v in the closure of At,. Clearly, for all n > Nc for some N,, there is a v E A., with nv integer. Let vn be any such point. Then cn ~ E for all n> Nc, P(Vn e A) > P(Vn = n) = en(I(, p) + O(ln n/n)) < en(I(v*, p) + O(ln n/n)), Hence for all n > N., we have P(Vn e A IV" A) = P(V" I~ )/P(Vn E A) < en(I(v, p) I(v*, p) + O(ln n/n)) _ 0 as n + X since I(vN, p) > I(v, p). From Lemma 2, we may easily infer the following theorem. m Theorem (Weak Law): Let A be a closed convex subset of S = {qlioqi = 1, qi > 0 for i = 0, 1, 2,...,m} and p ~ A be a point in S. Let Vi be the relative frequency of occurrence of outcome i in n independent trials of an experiment that results in outcome i with probability Pi for i = 0, 1, 2,..., m. Set v* = argmin I(v, p). Then for all e > 0, V E A lim P(IV v > ~ IVn ~ A) = 0 n->oo 1 i Proof: From Lemma 2, we have lim P(Vn ~ A, Vn ~ A) = 0 and hence our result. 4

The general correspondence property suggested by the Theorem above may be summarized by the statement that the minimum relative entropy distribution is identical to the empirical distribution that would be observed after a large number of trials. In a related result, Van Campenhout and Cover [1981] demonstrated that the conditional probability distribution of a single trial when given a fixed sample mean converged to the minimum relative entropy distribution as the number of trials grew large. We have shown that the relative frequency distribution over all trials would also converge to this same minimum relative entropy distribution. 5

References Campenhout, J. Van and T. Cover, "Maximum entropy and conditional probability", IEEE Transactions on Information Theory, IT-27, pp. 483-489, 1981. Cozzolino, J. M. and M. J. Zahner, "The maximum-entropy distribution of the future market price of a stock", Operations Research, Vol. 21, pp. 1200-1211, 1973. Hobson, A., "A new theorem of information theory", Journal of Statistical Physics, Vol. 1, pp. 383-391, 1969. Hobson, A. and B. K. Cheung, "A comparison of the Shannon and Kullback information measures", Journal of Statistical Physics, Vol. 7, pp. 301-310, 1973. Jaynes, E. T., "Information theory and statistical mechanics", Physical Review, Vol. 106, pp. 620-630, 1956. Jaynes, E. T., "Prior probabilistics", IEEE Transactions on Systems Science and Cybernetics, SSC-4, pp. 227-241, 1968. Johnson, R. W., "Axiomatic characterization of the directed divergences and their linear combinations", IEEE Transactions on Information Theory, IT-17, pp. 641650, 1979. Knuth, Donald E., The Art of Computer Programming, Vol. 1, Second edition, AddisonWesley, Reading, Massachusetts, 1976. Kullback, S., Information Theory and Statistics, John Wiley & Sons, New York, 1959. Kullback, S. and R. A. Leibler, "On information and sufficiency", Annals Math. Statist., Vol. 22, pp. 79-86, 1951. Sampson, Allan R. and Robert L. Smith, "An information theory model for the evaluation of circumstantial evidence", IEEE Transactions on Systems, Man, and Cybernetics, Vol. 15, pp. 9-16, 1984. Sampson, Allan R. and Robert L. Smith, "Assessing risks through the determination of rare event probabilities", Operations Research, Vol. 30, pp. 839-866, 1982. 6

Sanov, I. N., "On the probability of large deviations of random variables", IMS and AMS Translations of Probability and Statistics, (From Mat. Sbornik 42, pp. 1144), 1961. Shore, J. E. and R. W. Johnson, "Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy", IEEE Transactions on Information Theory, IT-26, pp. 26-37, 1980. Thomas, M. U., "A generalized maximum entropy principle", Operations Research, Vol. 27, pp. 1188-1195, 1979. Tribus, M., Rational Descriptions, Decisions and Designs, Pergamon, New York, 1969. Wilson, A. G., Entropy and Urban Modeling, Pion Limited, London, 1970. 7