Relative Entropy and the Convergence of the Posterior and Empirical Distributions under Incomplete and Conflicting Information A. Ben-Tal Faculty of Industrial and Engineeriing and Management Technion Israel Institute of Technology, and University of Michigan Haifa, Israel and Ann Arbor, MI 48109 D.E. Brown Department of Systems Engineering University of Virginia Charlottesville, VA 22901 R.L. Smith Department of Industrial and Operations Engineering University of Michigan Ann Arbor, MI 48109 Technical Report 88-12 March 1988

Relative Entropy and the Convergence of the Posterior and Empirical Distributions under Incomplete and Conflicting Information A. Ben-Tal Faculty of Industrial Engineering and Management Technion Israel Institute of Technology Haifa, Israel and Department of Industrial and Operations Engineering University of Michigan Ann Arbor, MI 48109 D.E. Brown* Department of Systems Engineering University of Virginia Charlottesville, VA 22901 Department R.L. Smith of Industrial and Operations Engineering University of Michigan Ann Arbor, MI 48109 March 22, 1988 Abstract It is well known that under complete information the posterior distribution converges to a degenerate distribution concentrated at the true probability distribution as *The work of this author was partially funded by a grant from The Jet Propulsion Laboratory under Grant #JPL957721. 1

the sample size grows. Suppose however that the sample possesses average properties not shared in expectation by any probability distribution in the support of the prior. In this case we show that the posterior distribution converges, as sample size grows, to a degenerate distribution p* closest in relative entropy sense to the set of distributions sharing these average properties. We show moreover that the empirical distribution converges in probability to that empirical distribution v* closest in relative entropy to the support of the prior. Both p* and v* can be obtained computationaly as solutions of certain convex programming problems. Implications for decision making in the presence of conflicting information are explored. 1 Introduction Consider the problem of estimating a probability distribution when given only partial information on the distribution. When the partial information is based on a random sample drawn from a population following the unknown distribution, a Bayesian construction might provide the mode of the posterior distribution as a best estimate. However, when the partial information consists instead of deterministic constraints on the known distribution, it seems no one best estimate can be singled out from the set of distributions consistent with the information. Jaynes in 1957 introduced a principle of maximum entropy, that proposed to choose that probability distribution q of minimum entropy I(q) among those consistent with the constraints. He justified this choice with a correspondence principle demonstrating that if the generating distribution were the uniform distribution, then the maximum entropy estimate was the most likely empirical distribution among those consistent with the constraints. This principle has been generalized using the Kullback-Leibler separator I(q,p), between two distributions q and p, to the principle of minimum relative entropy (Shore and Johnson [1980]), where this latter principle reduces to the former when p is the uniform distribution. In general, there has been enormous interest in this interplay between statistics and information theory within the communities of statistics (Sanov [1961], Vasicek [1980], and Bahadur [1971]), information theory (Van Campenhout and Cover [1981], Jaynes [1956], Johnson [1979], and Sampson and Smith [1984]), and operations research (Cozzolino and Zahner [1973], Sampson and Smith [1982], Thomas [1979], and Wilson [1970]). In order to make sense of the correspondence principle, some kind of superpopulation model seems inescapable. That is, the maximum entropy distribution estimates the distribution for a population generated through a random sample from a superpopulation following a uniform distribution. On the other hand, the Bayesian construction, which chooses the mode of the posterior would then be providing an estimate for the distribution of the superpopulation based on knowledge about the population distribution, the latter regarded as a sample drawn from the former. We intend to pursue this dual interpretation within this paper. 2

Specifically, we suppose that certain constraints are imposed on the population (or empirical) distribution Vn for population size n as well as on the superpopulation distribution P. The former is represented by a subset A of distributions known to contain Vn while the latter is represented by the support Q2 of the prior distribution P over P. To avoid the trivial case we suppose that A and Q do not intersect. In section 2, the formal probability model is presented. In section 3 we collect and refine some Large Deviation results, and in section 4 we prove results concerning uniqueness and continuity of the relative entropy minimizers. These results are in preparation for section 5 which contains the main convergence results; in Theorem 5.1 and Corollary 5.2, we show that the posterior distribution Pn, based on a sample of size n, converges (as n goes to infinity) to a degenerate distribution concentrated at the distribution p* in Qf that minimizes the relative entropy between A and Q. In Theorem 5.4 and Corollary 5.5, under mild regularity conditions, we extend the correspondence principle within this Bayesian construction. We show that not only the mode but the conditional distribution of Vn converges (as n goes to infinity) to a degenerate distribution concentrated at the distribution v* in that minimizes the relative entropy between points of A and Q. Both v* and p* in Q are independent of the prior P and only depend on the geometry of A and Q. Both p* and v* can be computed as optimal solutions of certain convex programming problems. In section 6, we discuss the implications of these results to decision making under incomplete and conflicting information. 2 The Probability Model Let the prior distribution P be an absolutely continuous probability measure with support Q C S where S = {qj Io qi = 1,qi > 0} is the relative interior of the m-dimensional standard simplex in Rm+l. The true distribution P is a random probability mass function P E fQ with distribution P. Let x be a finite discrete random variable that takes on the value ai E R with probability Pi for i = 0, 1,..., m. Suppose P is determined and fixed in accordance with the prior distribution P and X is then repeatedly observed within an experiment of n trials to yield a sample X1, X2..., n That is, X1, X2,..., Xn are identically distributed with p.m.f. P and conditionally independent when given P. Let Ni be the random variable equal to the number of times outcome a, is observed in the n trials for i = 0, 1, 2,..., m. The empirical distribution (based on n trials) Vn is the random vector in R+l1 whose ith component represents the relative frequency of occurrence of outcome ai in n trials, i.e., vn o VN1 NM, n n n Let A be a Borel measurable subset of S with Pr(Vn E A) > 0 for n sufficiently large. For Borel sets B C S, the posterior distribution Pn of P, when given the information that 3

Vn E A, is P"(B) = Pr(P E BIVn E A) and the sample distribution Vn of Vn when given the information that Vn E A, is Vn(B) = Pr(Vn e BIV" E A). We consider in the next section conditions under which the posterior and sample distributions, P" and Vn converge as the sample size n increases. 3 Preliminary Large-Deviation Results In this section we derive a Large-Deviation type result (Theorem 3.3) on the conditional probability of the empirical distribution. We begin by introducing the relative entropy I(v, p) between two distribution v and p in S, which is defined by m I(v,p) = Zvi log (vi/pi) i=o where log is here and elswhere taken to the base e. I(v,p) is also called the Kullback-Leibler separator (Kullback [1959]) and the minimum discrimination information (e.g. Brockett, Charnes and Cooper [1980]). Since I(v,p) is nonnegative, and is equal to zero if and only if v = p, it is also referred to as the directed distance from p to v. Although I is not a metric (it is not symmetric, nor does it satisfy the triangle inequality) we shall occasionally adopt this distance terminology to draw geometric analogies for the results to follow. For a given nonempty subset A C S, and a probability vector p E S, we denote by I(A, p) the relative entropy ("distance") between p and A, defined by I(A,p) = inf I(v, p). yEA The main properties of I(v,p) and I(A,p), needed for our purposes, are summarized in Appendix I. The main result in this section (Theorem 3.3) is preceded with a result on uniform convergence of I(An,p)(Lemma 3.1) and a well known result on bounding the conditional probability Pr{Vn E AIP = p}(Lemma 3.2). Since the empirical distribution Vn when multiplied by n yields an integer vector, only a discrete lattice of points within the set A are possible values for Va. That is, the set An of possible values of Vn is given by A" = An Sn 4

where Sn = {q E Sjnqi is a nonnegative integer for i = 0, 1,..., m}. Since An becomes dense in A as n goes to infinity (for A a closed body,) it is not perhaps surprising that the directed distance from p to An converges to the directed distance from p to A. Lemma 1 below extends this result in Bahadur [1971] by showing that for closed Q this convergence is uniform in p. Lemma 3.1 Let Qf and A be subsets of S with ~Q closed and A a closed body (i.e., A is the closure of its interior A~.) Then I(An, p) I(A^, p) uniformly over p E Q as n -+ oo. Proof: We first show that D(A, A) - 0 as n -+ oo where D is the Hausdorff metric defined for compact sets F and D by D(F, G) = max(h(F, G), h(G, F)) with h(F, G) = max{d(x, G), x E F} and d(x, y) being the Euclidian distance between x and y. Now D(A, A) -- 0 as n -, oo iff (1) limsup,,, An C A and (2) A C liminf,,,o An. Here, for a sequence of compact sets {F}, limsupn_.o Fn (resp. liminfn-,oo F) is the set of all points x such that every open neighborhood of x is frequently (resp. eventually) intersected by the Fn. (Hausdorff [1957]). Now, (1) follows immediately from the fact that A n C A for all n and A is closed. As for (2), let v E A and let N(v) be an open neighborhood of v. Now N(v) n A~ -7 0 since A is the closure of A~ and therefore N(v) n A~ being the intersection of two open sets is open. It follows that there exists an e > 0 and x E N(v) n A~ such that the open ball B,(x) of radius e around x satisfies B,(x) C N(v) n A~. Let xn be the Euclidian closest point in Sn to x. Since x E S,d(xn,x) -+ 0 as n - oo and hence there is an integer N such that x" E B,(x) for all n > N. But then x" E An since B,(x) C A~ C A for all n > iV for some N. Also Xtn E N(v) for all n > N since B,(x) C N(v). Hence An1 n N(v) 0 0 for all n > N and A C liminfn,_ A". Therefore D(A, A) -- 0 as n -- oo. Choose e > 0. Since A1 C A for all n, I(An, p) > I(A, p) > I(A, p)- - for all p E Q and for all n. It remains to show that I(An, p) < I(A,p) + e for all p E Q for all n > N, for some N,. Let v(p) E Argmin{I(v,p), E A} # 0 since I(v,p) is a continuous function over the compact set A. Since D(An, A) -O as n -+ oo, for all 6 > 0, there is a Ns such that D(An, A) < 6 for all n > No. In particular, for all v E A there are qn(v) E An1 for all n > Ns such that d(q"n(v), v) < 6. Now since I(v,p) is continuous on A x Q (see Appendix I(d)), which is compact, I(<, p) is uniformly continuous over A x Q. In particular, for all e > 0 there is a 6S independent of p E Q such that if d(v, v') < 6, for v, v' E A then II(v,p) - I(v', p) < e. Let N, = Ne so that d(qn(V(p)), v(p)) < 6, for all n > N,. Then I(qn(v(p),p) - I(v(p),p) < e for all p E Q for all n > N,. Hence I(An,p) - I(A, p) < I(qn(v(p)),p) - I(v(p),p) < e for all p E Q for all n > N,. Therefore I(Atn,p) < I(A,p) + e and the result follows.. 5

The following well known lemma provides bounds on the probability of finding the empirical distribution in A C S, when given P, in terms of the directed distance from P to An. Lemma 3.2 Let A be a subset of S and p E S C R"+l. Then there exists a positive constant y(m), depending only on m, such that for all n log(n + 1) exp{-n[I(An,p) + ( )lg - g( ]} < Pr(Vn e A|P = p) <exp{-n[I(A, p) -m on ]} n Proof: See Bahadur [1971,p.18].. Combining Lemmas 3.1 and 3.2, we get the following important large-deviation result. Theorem 3.3 Let Qf and A be subsets of S with Q closed and A a closed body. Then 1 -In Pr(V" AlP = p) -- -I(A,p) n uniformly over p E Q as n -- oo. Proof: From Lemma 3.2, l - 1 logn log y(r) 1 log(n ~ I) -(I(A, p)+(m- log n log - (m) < 1 log Pr(Vn E AIP = p) < -(I(An, p)-m g ),P +( 2 n n n n for all n, and hence I(A,p) - I(An,p) - ("- l)~+ + og(M) < In Pr(Vn e A P = p) + I(A p) I(A, p) - I(An, p) + m lon+ for all n. From Lemma 3.1, I(An,p) - I(A,p) uniformly over p E as n -oo. Let e > 0 and choose N1, N2, N3 large enough so that I(A,p) - I(An,P)I < for all p E Q and for all n > N1, m- logn logy(m) e 2 n n 2 6

for all n > N2, and log(n + 1) e m < n 2 for all n > N3. Let N = max(N, N2, N3). Then 1 -e < -log Pr(V' E AIP = p) + I(A,p) < e n for all p E Q and all n >N.. Theorem 3.3 states that for large n, P(V' E AlP = p); exp{-nI(A,p)}. Hence for p 4 A, the probability of finding the empirical distribution in a set A decays exponentially in its directed distance from p. Thus the likelihood of finding the empirical distribution in the closer to p of two sets is overwhelmingly greater then that of finding it in the more distant one. By demonstrating that the convergence is uniform over closed Q, Theorem 3.3 is an extension of large deviation theory that began with the classic work of Sanov [1961]. 4 Uniqueness and Continuity of the Relative Entropy Minimizers The convergence results in the next section make use of uniqueness and continuity properties of the optimal solution sets {p* p* = arg minEA I(v,p)}, {vlv* = argminpEn I(v,p)} and {(v*,p*)lI(v*,p*) = minA^n I(v,p)}. In this section we study these properties. The first lemma gives necessary conditions for nonuniqueness of the solution of min I(v, p). AxO Lemma 4.1 Let A, f be convex subsets of S. Consider the problem (P) (P) min {I(v, p) = vi log( }. vEA,pE i=O p If (v, p) and (v*, p*) are both solutions of (P), then -i- =-, alli=0 O,l,...,m. (1) Pi Pi 7

Proof: Let 1* = I(v*,p*) = I(v,p) = minI(v,p). Since I is a convex function on Rn x R' (see Appendix I (a)) and the feasible set of (P),A x Q, is convex, it follows that the set of optimal solutions of (P) is convex. In particular then VO < a <_ 1, (v,p) = (av* + (1 - a)v, ap* + (1 - a)p) is an optimal solution, i.e., I(a) I(vO,p) = I*, all 0 < < 1. The latter equation implies I'() = 0 (2) i.e., I'(0) = ZE=o a{(Eav + (1 - a)oi) log(;+(;:l )}I=o = Eo {(vi - vi) log(vi/p) -v - v (ps - p)(Vi/pi) } = 0. Hence, using Ei = E v' = 1, and (2), EZo{vI log(vi/pi) - (vips/P,} - I* + 1 = 0. (3) Consider the one-variable problem max{v, log(vi/pi) - vip*/i}. vi >0 The unique optimal solution of this strictly concave problem is A *Pi Vi = vp pi and the corresponding optimal value is v? log(v;/pI) - v:. Hence m I* - 1 = >{v log(v*/p7) - vT} = max{E vU log(vi/pi) - vi(pT/Pi)} vi>O m > {v* log(Vi/Pi,)- (DipT/pi)},if Vj vj, for somej, i=O I'-1 by(3), which is a contradiction. Hence Di = vi = vfpi/p! Vi = 0,1,..,m proving (1). u The necessary conditions for nonuniqueness can generate sufficient conditions for uniqueness. This is done in the next three results. The first result is for the two-dimensional case (m= 1). 8

Corollary 4.2 Let A, Q be convex subsets of S C R2, with A F fQ = 0. Then problem (P) has a unique solution. Proof: Suppose there are two distinct solutions (A*, p*) 7 (A,p). Since A n f = 0, A* i p* i.e., A # p;. (4) By Lemma 4.1, we must have AO/P* = A*/po, AI/p; = Ai1/p, 1 (5) Ao + A1 = 1, po P+ Pi=1. J Consider the system of linear equations p5 0-A3 o 0 Ao 0 0 p - po _ 0 P0 1 IPo (6) 1 0 1 0 Al 1 0 1 0 1 Pi 1 Notice that (A*,p*,A;,p;) is a solution, as well as, by (5), (Ao,po, 1l,pl). But this is a contradiction since (4) implies that the coefficient matrix in (6) is nonsingular. a An important special case of the problem (P) is where Q( or A) consists of a single probability mass function. In this case uniqueness is also obtained. Corollary 4.3 If one of the sets A, Q C S C Rm+l is a singleton, and the other is convex, and A Fn = 0, then problem (P) has a unique solution. Proof: Let Q = {p~}. If problem (P) has two solutions they are of the form (v*, p0) and (v, p). and by Lemma 4.1 they must satisfy t = v Vi =0, 1...,m. Hence v* = v and the two solutions coincide. * The third corollary gives a reasonable general sufficient condition for uniqueness. For its formulation we need the notion of a strictly convex set. Thus a convex set C is strictly convex if int C # 0 and Vx,y E C and 0 < A < 1 it follows that Ax + (1 - A)y E int C. A typical example of such sets are level sets C = {x E Rf(x) < y} where f is a strictly convex function and y > inf(f). 9

Corollary 4.4 If one of the sets A, Q is convex and the other is strictly convex and AnQ = 0, then problem (P) has a unique solution. Proof: Suppose (v*,p*) and (v,p) are distinct solutions of (P). Then by Lemma 4.1 we must have v* # v and p* p. Suppose that Q is strictly convex. Let v= (V* +v), P = (p +p) and note that (v,p) E A x Q is also an optimal solution of (P), i.e., I(, ) = min I(v,p) = I*. VEA,pE~ Since SQ is strictly convex, p E int Q. Also A n = 0 implies that v ~ Q. Hence there exists 0 < 7 < 1 such that the point p. = 7y + (1 - y)p E Q. Therefore (vipy) E A x Q and so I' _< I(i,p,) < y7(0, V) + (1 - y)(,,p) = 7. O + (1 - y)' < I*, a contradiction. Thus no two distinct solutions exist. The same proof goes through if A is assumed to be strictly convex. * If none of the three sufficient conditions in Corollaries 4.1 through 4.3 hold, problem (P) may indeed have more than one optimal solution. This is illustrated by the following. Example Consider the special case of problem (P): 2 min E vi log(vi/pi) EApEi=O where A = {v E R3lvo +2v +v2 < 4/3,v > O,Evi = 1} and Q = {p E R3po + 2pl +p2 > 3/2,p > OEpi = 1}. Hence A and F are convex subsets of R3, A n Q = 0, but neither A nor Q are strictly convex (or a singleton). It is easily verified from the Karush-Kuhn-Tucker conditions that I (11 111 3,3 ), -2'4 and - 11 1 11 3 10

are both optimal solutions. Note that the necessary condition (1) is satisfied here: =_ -, i =0,1,2. Pi Pi The discussion of the uniqueness has been limited so far to the case where both A and Q are convex sets. Without this assumption, the uniqueness hypothesis which is needed for the convergence results in Section 5 is as follows: Al Vp E Q, there is a unique solution v(p) where v(p) = arg min I(v, p) vEA and A2 There is a unique solution p* where p' = arg min I(v(p), p). pEri However, under the convexity assumption, the uniqueness condition Al n A2 is equivalent to the uniqueness of the optimal solution to problem (P). Lemma 4.5 Let A and Q be convex, then Al n A2 holds if and only if condition B holds: B The problem minv^E,pEnI(v,p) has a unique solution. Proof: (B > Al n A2) Suppose Al is false, i.e., there exists p E ~ and v1, v2 E Q, l v2 such that vi E Arg minEA I(v, p). But this is impossible since I(f, p) is strictly convex. and hence possesses a unique solution on the convex set A. Suppose then that Al holds but not A2, i.e. for some pP, E Q2,p~ * $P P*,p E ArgminPE I(v(p), p). Let vT = v(p*). Then I(vr, p;) = I(v2, p*) = min I(v, p). (7) For otherwise there exists (v, p) E A x Q such that I(v, p) < I((p), pi). (S) Now, I(v(p*),pt) < I(v(p),p) since p* = argminl(v(p),p) < I(v,p) < I(v(pf),p) by (8), a contradiction. Hence (7) holds, but this is in turn a contradiction to B. 11

(Al n A2 - = B) Suppose B is false and let (v;,p*) and (v*,p*) be both solutions of ninVEA,pEn I(v, p) with (v;,p*) # (v;,p2). (9) By Lemma 4.1, (. = (X Vi. Hence, (9) implies v~ #v;,p p;)(. Now, by Al, v- = v(p?) and so both p* and p* are solutions of min I(v(p), p), with p; # P2, but this contradicts A2. * Whenever assumption Al holds, the mapping p -+ arg min I(v, p) vEA is single-valued and defines a function v(.)|I — + A. The last lemma is a general result, directly applicable to the question of the continuity of the function v(p). Lemma 4.6 Let X be a compact subset of Rnand Y C Rm. Consider the function flRn x Rm -f R and assume that f is continuous on X x Y and that Vy E Y, there exists a unique solution x(y) = argmin f(x,y). yEY Then, x(.) is a continuous function on Y. Proof: Let yn -- y and suppose, for some open neighborhood N(x(y)) of x(y), that X(ynk) f N(x(y)) for all k. Passing to a subsequence if necessary, we have x(yn,,) -+ x' for some x' E X, by compactness of X. Also I(x(y), y) < I(x', y) since x(y) is the unique minimizer fory. Let e = f(x',y)-f(x(y),y) > 0. Let k1 be such that If(x(y), n, )-f(x(y),y)I < e/2 for all k > k1, which is possible since f(x, ) is continuous on Y. Let k2 be such that If(x', y)- f(x(yn), y)n)l < e/2 for all k k22 which is possible since f is continuous on X x Y. We have f(x(y), yn)- f(x(ynk),ynk) < If(x(y),yn)- f(x(y),y)l + f(x(y),y) - f(x', y) + If(x', y) - f(X(ynJ^, Ynk)l < /2 - e + e/2 = 0 for k > max(kc, k2). Hence for such kc, we have f(x(y)n,,k) < f(x(yn*),ynk) which is a contradiction to x(ynk) being the unique minimizer of f(, yn) on X. * 5 Convergence of the Posterior and Empirical Distributions Roughly speaking, if we are given that the empirical distribution is in A, Theorem 3.1 tells us that we would find the highest likelihood for it being in a neighborhood of the point 12

v(p) closest (in relative entropy distance) to p E Q. Since this likelihood increases as p gets closer to v(p), we should expect that P has the highest probability of being near the point p* closest to A. This is indeed the case as expressed in the following. Theorem 5.1 (Posterior Convergence) Let the prior be an absolutely continuous probability measure with support f C S where Q is a closed set. Let A C S be a closed body with A n f = 0. Suppose v(p) = argminEA I(v,p) is unique for all p E S and p = arg minPE I(v(p),p) is also unique. Then pn = p*as n - oo in the sense that Pn(N(p*)) -- 1 as n -x oo for all open neighborhoods N(p*)ofp*. Proof: We have Pr(Vn E AIP i N(p*)) < sup Pr(Vn E AP = p) pEQ-N(p*) log(n + 1) < sup exp(-n(I(Anp) - m + )} by Lemma 3.2, pEn-N(p*) n < sup exp(-n(I(A,p) - m g + )) since An C A for all z. pEn-N(p*) n exp{-n(I(v(p), p)- lg(m ))} n where p E Arg min I(v(p),p) pE -N(p*) which exists since I(v(p),p) being a composition of continuous functions (see Lemma 4.6) is continuous over the compact set Q - N(p*). Also, Pr(V E A) = jPr(Vn AP = p)dP(p) > f exp-n(I(An'p) + (m -l)log - log(m)}dP(p) by Lemma 3.2. 2.n - n > j exp{-n(I(A^np)+(m - logn g())dP(p) e(p.) 2 n n since iQ(p*) C Q, where a(p') = n {q E Sllq - ptl < e for i = 0, 1,2,..., m}. 13

By Lemma 3.1, I(A'n,p) -- I(A,p) uniformly over p in the closure of Q,(p*), so that for all 6 > 0 there exists an N6 independent of p E fQ(p*) such that I(A, p) < I(A,p) + 6 for all n > N6 and all p E fQ(p*). Therefore, by the above inequality, we get Pr(V e A) > n exp{-n(I(Ap)6+( og + n + )}dP(p) - t(p*) 2 )"n n > jg exp{-n(I(Jv(pe()), p()) + 6 + mm-llogn logY(m} (r- )log - ())}P(p) (10) 2 n n for all n > N6, for all 6 > 0, where p(e) E Arg min I(v(p),p) pefi(p') and Qf](p*) is the closure of Qf(p*). Hence V-_Ilogn log (m) Pr(Vt E A) > exp{-n(I(v(p(e)),p(e)) + 6 + (2 - n g( ))p(fi(p.)) > 0 for all n > N6, for all E, 6 > 0 since P is absolutely continuous over Q. Hence, using the two inequalities proved above, Pr(P N(p*)IVn E A) = Pr(V E AlP N(p*))Pr(P V N(p*)) Pr(Vn e A) exp{-n(I(v(p), p) - m~g(n+))}( - P(N(p*)) exp{-n(I(v(p(e)), p(e)) + 6 + (ml )(lo_) _ logY(m))}p(, (p) = exp{-n(I(v(p),p)- I(v(p(e)),p(e)) -6- ( )(log n) 2 n + log(m) log(n + 1) 1 - P(N(p*)) n n P(p*)) for all n > N, for all 6, e > 0. Now choose e > 0 sufficiently small so that I(v(p),p) > I(v(p(e)),p(e)) which is possible since I(v(p),p) is a continuous function of p E S and p' = arg minpEn I(v(p), p) is unique. 14

Let 6 = (I(v(p), p) - I(v(p(e)), (e)))/2 > 0. Then log(m) (m - 1) logn log(n + 1) 1 - P(( Pr(P N(p')lV" E A) < exp{-n(6+1g(m) (m1)ogn)m )}M-p n 2 n n P(Q (p')) for all n > Ns. Hence limsupn,. Pr(P ~ N(p*)IVn E A) < 0. But trivially liminfn,, Pr(P ~ N(p*)IVn E A) > 0, so that lim,,_o Pr(P ~ N(p*)lVn E A)= 0.. Theorem 5.1 states that when v(p) and p* are unique (i.e. A1iA2 holds), the posterior distribution when given the information that the empirical distribution lies in A degenerates, as the sample size grows, to a distribution concentrated at pt. Moreover this p* being the closest point to A in Q depends only on the geometry of the sets A and Q and is otherwise independent of the prior P. An interpretation is that a decision maker who initially believes Q to contain the possible distributions of the superpopulation should, in light of strong empirical evidence that the distribution is in A, adopt the view that P equals p*. Of course, the probability of finding the empirical distribution in A (when A and Qf do not meet) goes to zero as the sample size diverges to infinity. However for sufficiently large finite values of the sample or population size n, P will, with arbitrarily high probability, be near p*. Hence the theorem entails an observable prediction. If one were to generate a series of large populations as samples from superpopulations with distribution P chosen in accordance with P over Q, then for a population whose distribution lay in A, it is nearly certain to have been drawn from a superpopulation with distribution near p*. Using the uniqueness results of section 4, we derive from Theorem 5.1 the following result. which expresses explicitly sufficient conditions under which pn =- p*. Corollary 5.2 Let the prior be an absolutely continuous probability measure with support Q C S where Q is a closed convex set. Let A C S be a closed convex body with A n Q = 0. If either Q or A is strictly convex or a singleton, or if S C R2, then pn" p* as n -- oo in the sense that Pn(N(p*)) -* 1 as n - oo for all open neighborhoods N(p*) of p*. * From the above results, in the presence of convexity or uniqueness conditions, the uncertainty about the true distribution P as expressed by P unambiguously results in certainty that P is p* in the face of incomplete but conflicting information on a large sample of observations. This convergence to a known distribution in the presence of only partial sample information and a general nonparametric prior is quite striking. 15

We turn now to consider the behavior of the empirical or population distribution V" as n increases. We begin by demonstrating in Lemma 5.3 below that Vn converges in conditional probability to v(p), when given that P = p. Lemma 5.3 Let A, Q C S with Q closed and A a closed body. Assume that for all p E Q, v(p) = argminEA I(v,p) is unique. Then Pr(V" ~ N(v(p)lVn E A, P = p) -+ 0 uniformly over p E Q as n -* oo, where N(v(p)) is an open neighborhood of v(p). Proof: Let e > 0. If N(v(p)) = A - N(v(p)) = 0, then Pr(Vn V N(v(p)) Vn E A,P = p)= 0 < e for all n. Suppose then N(v(p)) # 0. Let 6 = I(N(v(p))) - I(A,p) > 0 since v(p) is unique. By Lemma 3.3, we can choose N1, so that for all n > N1 and all p E Q, log Pr(Vn E N(v(p))lP = p)) ^ < -I(N(v(p)),p)+ 6/3 n and choose N2 so that for all n > N2 and all p E Q, Now choose N3 so that log e"/n > -6/3 for all n > N3. Let N = max(N1, N2, N3). Then for all n > N and all p E Q, log Pr(Vn" N(v(p))IP = p) log Pr(V E AlP = p) n n -I(N(v(p)),p) + I(A,p)+ (2/3) = -6/3 < log e/n. Hence Pr(V" V N(v(p))IV" E A,P = p) = Pr(Vn e N(v(p))lP = p)/Pr(V" E AIP = p) < e for all n > N and all p E Q. Theorem 5.4 (Convergence of the Empirical Distribution) Let P, 2, and A be as in Theorem 5.1. Consider the conditional sample distribution Vn(B) = Pr(Vn E BIV" C A) for all Borel sets B C S. Then Vn" Y* as n - oo in the sense that Vn(N(v*)) - 1 as n -, oo for all neighborhoods N(v*) of v*. 16

Proof: Let - > 0. Let e > 0 be small enough so that A,(v*) C N(v*). Choose 6 > 0 so that V(p) E A,/2(v*) for all p E fQ(p*), which is possible by Lemma 4.6. Now choose N1 so that Pr(V" i A,/2(v(p))lV' e A, P = p) < y/2 for all n > N1 for all p E Q6(p*) which is possible by Lemma 5.3. Then Pr(V" A A,(v*)IVn E A, P = p) < /2 for all n > N1, and p E f6(p*). Now Pr(Vn A N(v*)lVn E A) < Pr(Vn ~ A,(v*)IVn E A) = f,(,.) Pr(V"n A(v*)lV" E A, P = p)dP(plVn E A) + fn-,(v) Pr(Vn q A )(v*)IVn e A, P = p)dP(plVn E A). (11) The second term on the right hand side of (10) is bounded from above by f|L ddP(plVn E A) = Pr(P ~ f6(p*)IVn E A) < 7/2 n-a(p*) for n > N2 for some N2 by Theorem 5.1. The first term on the RHS of (10) is bounded from above by j/ (7/2)dTP(plVn E A) = (7-/2)Pr(P E Q(p*)lVn E A) < -y/2 for all n > N1. Let N = max{N1,N2}. Then Pr(Vn ~ N(v'*)IVn E A)< 7 for n > N. Hence Pr(Vn ~ N(v*)lVn E A) -- 0 as n - oo. * When the uniqueness condition A1nA2 holds, Theorem 5.4 states the following. Suppose we generate a large population by drawing a random sample from a superpopulation whose distribution P has been determined in accordance with a prior distribution P over Q. If the distribution of that population exhibits characteristics that depart from those characteristic of the superpopulation (i.e. V" E A), then that population's distribution is almost certain to be nearly v*. There is a similar classical interpretation of this result for the case when P = p* is known. This special case extends the result of Vasicek [1980] and is related to that of Van Campenhout and Cover [1981]. Once again from Lemma 5.3, we obtain the following corollary. Corollary 5.5 Let P, 2, and A be as in Corollary 5.2. Then Vn = v* as n -- oo in the sense that Vf(N(v*)) - 1 as n -+ oo for all open neighborhoods N(v*) of v*. * 17

6 Discussion and Conclusion If we adopt the viewpoint expressed in the introduction that P is the distribution for the superpopulation and Vn is the distribution of a large population of n members drawn from this superpopulation, then the following observation flows from Theorem 5.1 and 5.4. When given the partial information that the population exhibits aggregate properties not shared by the superpopulation (i.e., that its distribution lies in A), then it is nearly certain that its distribution is close to v*. Moreover, the unknown distribution P, known to lie in Q, is nearly certain to be near p*. It follows that in a problem of inference under conflicting and incomplete information, it is essential to identify which information relates to the population as opposed to the superpopulation. The reason is that because of the asymmetry of I(q, p), v* and p* are not invariant when A and Q are interchanged. For example, if we let A be the set of distributions with mean at most u and Q be the singleton p*, then v* is given by that member of the discrete exponential family containing p* that has mean u (see Sampson and Smith [1985]). In particular, if p* is a binomial distribution, then v* is that binomial distribution with mean u. On the other hand, if we assign to be the set of distributions with mean at least u and A to a small neighborhood around the point vI, then p* is no longer a member of the exponential families. In fact, p* is given by Vi* Pt = - for i = 0... m A(i - u) + 1 where A is the unique root of certain mth degree polynomial equation [see Appendix II]. In Sampson and Smith [1982, 1985], the information provided by "expert judgement" that the mean is at most u is ascribed to A so as to obtain the exponential family estimate v* for the "true" distribution. Q is set to the singleton {p*} where p* is interpreted as the "decision maker's prior". From the results of this paper, this estimate is justifiable only under the interpretation that the decision maker's prior was in fact used to generate a population with mean at most u. Hence within this interpretation, the informal use of the Bayesian term "prior" by Sampson and Smith is rigorously justifiable. They also show that v0 = u + o(u) independently of p*, the so-called rare event approximation. Although i* in that context merely represented the closest point in A in relative entropy to p*, Theorem 5.4 now provides a rigorously substantiated and empirically observable claim. The duality expressed by v* and p*, the former a classical and the latter a Bayesian viewpoint, is rich in paradoxes. For example, given two pieces of incomplete but conflicting information, which should play the role of A and which of Q? At least within the model of this paper, the answer seems to pivot on a) which information is primary (Q) and which secondary (A) and b) for which population is the estimate desired, the "population" or 18

"superpopulation"? The estimate v* is appropriate for the former while p* is appropriate for the latter. 19

References 1. Bahadur, R.R., Some Limit Theorems in Statistics, SIAM, Philadelphia, 1971. 2. Brockett, P.L., A. Charnes, and W.W. Cooper, "MDI Estimation Via Unconstrained Convex Programming," bf Communications in Statistics, B-9, 223-234, 1980. 3. Van Campenhout, J. and T. Cover, "Maximum Entropy and Conditional Probability,' IEEE Trans. on Information Theory, IT-27, 483-489, 1981. 4. Cozzolino, J.M. and M.J. Zahner, "The Maximum-Entropy Distribution of the Future Market Price of a Stock," Operations Research 21, 1200-1211, 1973. 5. Hausdorff, F. Set Theory, Chelsea, N.Y., 1957. 6. Jaynes, E.T., "Information Theory and Statistical Mechanics," Physical Review 106, 620-630, 1956. 7. Johnson, R.W., "Axiomatic Characterization of the Directed Divergences and Their Linear Combinations," IEEE Transactions on Information Theory, IT-17, 641650, 1979. 8. Rockafellar, R.T., Convex Analysis, Princeton, NJ, 1970. 9. Sampson, A.R. and R.L. Smith, "Assessing Risks Through the Determination of Rare Event Probabilities," Operations Research 30, 839-866, 1982. 10. Sampson, A.R. and R.L. Smith, "An Information Theory Model For the Evaluation of Circumstantial Evidence," IEEE Trans. Systems, Man, and Cybernetics SMC-15, 9-16, 1985. 11. Sanov, I.N., "On the Probability of Large Deviations of Random Variables," IMS and AMS Translations of Probability and Statistics(from Mat. Sbornik 42, 11-44), 1961. 12. Shore, J.E. and R.W. Johnson, "Axiomatic Derivation of the Principle of Maximum Entropy and the Principle of Minimum Cross-Entropy,"IEEE Transactions on Information Theory, IT-26, 26-37, 1980. 13. Thomas, M.U., "A Generalized Maximum Entropy Principle," Operations Research 27, 1188-1195, 1979. 20

14. Vasicek, O.A. "A Conditional Law of Large Numbers," Annals of Probability S. 142-147, 1980. 15. Wilson, A.G., Entropy in Urban and Regional Modeling, Pion Limited, London, 1970. 21

Appendix I Properties of the Relative Entropy Functional: Let p, q be probability vectors in m S= { E RI E xi = 1,xi > 0,i = 0,1,...,m} i=o The relative entropy functional I: S x S --- R is defined by m I(v, p) = Vi log(vi/p). i=O We also denote by R~++l the positive orthant: {x E Rm+1: x > 0}. (a) I(v,p) is (jointly) convex on Rx++1 x Rm++1. (b) I(v, ) is strictly convex on Rm+' (Vv E Rm+') and I(, p) is strictly convex on Rm+' for all p E Rm+. (c) I(v, p) > O(Vp E S, v E S) with equality if and only if p = v. (d) I is continuous on R x+' x R+1. (e) Let A be a convex subset of R?+'. Then the function I(A,p) = infEA I(v,p) is a convex function of p on R+1. Proof: (a) The two variables function f(vi, p) = vi log(Vi/pi) is convex since its Hessian H [ l/vi - /p, i = -l/p, Vi/p2J is clearly positive semi-definite. Indeed, VX = (x1, 2) E R2, we have zxTHx = ( - P )2 > 0. But I(p, q) = Eo f(vi, i) and hence is convex. (b) f(-,pi) is strictly convex on R+ for pi > 0 and f(qi, ) is strictly convex on R+ for qi > 0 and hence (b) follows. 22

(c) Applying the gradient inequality to the strictly convex function g(x) = I(x, p), vwe obtain I(p, v) > I(p, p) + (v - p) V I(p, p) with equality if and only if p = v, i.e., I(p, v) > 0 + (vi- pi) 1 = 0 since p, v E S. (d) This follows from the general fact that a convex function is continuous in the relative interior of its domain. Here ri (dom I) = Rx+1 x R+1. (e) The pointwise infimum of a jointly convex function over a convex set is a convex function. (See e.g. [Rockafellar [1970, p. 38-39]). QED 23

Appendix II Let Q = {p E Si 0 ioipi = u} and A = {v*}. The closest point p* E Q to v* is the optimal solution of m m min {Z v' log(v'/pi) = - v logp, + constant} + i=p i=0 subject to (i) En0=oip, = (ii) E 0 Pi = 1. Let A E R be the Lagrange multiplier of constraint (i) and,p E R the Lagrange multiplier of constraint (ii). The optimality condition for p* are then (i), (ii), and (iii) v'/p? - iA - s = 0 i = 0,,..., m. The solution of (i)-(iii) is V* A(i - u) + 1 where A is the unique solution of (iV) Er=o (i-u)+l We note that (iv) is equivalent to the mth degree polynomial equation E akAk = 0, k=O where the coefficient ak is given by k ak = EE-E(ij - )(E V) io il ik j=O where ij E {0, 1,...,m}, ij y ij+l,j = 0, 1.. Ik- 1. As a specific example let m = 2, and u = 1. Then *,;* 1 1 2 *. A. 2 - 1. =.1 i -'POP2 2, P=i l' lV12 24