Faculty Research of...... 'i........ C,.f 1 *ll 4)1IIJl

On Priors with a Kullback-Leibler Property STEPHEN WALKER* AND PAUL DAMIEN** *Department of Mathematical Sciences, University of Bath **University of Michigan Business School ABSTRACT. In this paper, we highlight properties of Bayesian models in which the prior puts positive mass on all Kullback-Leibler neighbourhoods of all densities. These properties are concerned with model choice via the Bayes factor, density estimation and the maximisation of expected utility for decision problems. The results suggest it is appropriate to label a prior with this Kullback-Leibler property as a true Bayesian model. In our illustrations we focus on the Bayes factor and show that whatever models are being compared, the [log(Bayes factor)]/[sample size] converges to a nonrandom number which has a nice interpretation. KEYWORDS: Bayes factor, decision theory, exchangeability, expected utility rule, Kullback-Leibler divergence. 1. Introduction. Recent Bayesian nonparametric literature has focused on consistency properties of Bayesian procedures. See, for example, Wasserman (1998), Barron, Schervish and Wasserman (1999), and Walker (2002). Based on the results from these papers, we argue that it is practically realistic to define a true Bayesian model as one in which the prior puts positive mass on all Kullback-Leibler neighbourhoods of all densities. Our reasons for this position are the theme and point of the paper. We consider solely the case when fo is a density function and X1 (X1,..,X,) are available as a random sample from fo, the first n observations of a possibly infinite sequence X1, X2,... Since fo is unknown, the Bayesian constructs a prior distribution on the relevant space of density functions, or distribution functions, reflecting available prior information about the location of f0. Assuming all the densities under consideration are dominated by some a-finite measure, which we will take to be the Lebesgue 1

measure, Bayes theorem and the data X1 combine to update the prior to the posterior. There are compelling reasons why a Bayesian should use a prior distribution which puts positive mass on all Kullback-Leibler neighbourhoods of all densities; in particular, on all Kullback-Leibler neighbourhoods of fo. Obviously, if fo is unknown a priori, to guarantee the prior puts positive mass on all Kullback-Leibler neighbourhoods of fo, it is required to put positive mass on all Kullback-Leibler neighbourhoods of all densities. We shall refer to this as the Kullback-Leibler property for the prior II. In order to achieve this, a nonparametric prior is required. For specific examples of priors with the Kullback-Leibler property, see the recent paper by Barron, Schervish and Wasserman (1999). We should point out that all Bayesian models, that is, M = {f(x; 0), ir(0)}, define a prior probability H on the space of density functions. A random density function from II is chosen by first choosing a 0 from 7r and putting f(.) f(.; 0). Hence, for us, a Bayesian model is precisely the prior II. A parametric model of finite dimensions will not satisfy the Kullback-Leibler property, unless fo is known to belong to a particular parametric family. The following reasons suggest that II should have the Kullback-Leibler property: 1. Many practising statisticians would argue that parametric models are sufficient when combined with model checking and model comparison diagnostics. See, for example, Bernardo and Smith (1994). However, Draper (1999), in an insightful discussion of the paper by Walker et al. (1999), points out that allocating probability mass one to parametric subsets of densities should not be done lightly. The reason being that on switching models, when the.original model under consideration is found to be deficient in some sense, exposes the statistician to the very real possibility of poor calibration. Therefore, there is a very practical reason for assigning mass one to the set of all densities; the data can offer no surprises. 2. It is shown that if II does have the Kullback-Leibler property then the Bayes factor comparing this model with any other model will always eventually support (under mild regularity conditions) the prior with the Kullback-Leibler property. The precise result is stated in Section 2. The conclusion is that there is no motivation to put the prior II 2

under the scrutiny of a Bayes factor, unless it is with another prior which also shares the Kullback-Leibler property. 3. Decisions made via the maximisation of expected utility are consistent when using a prior with the Kullback-Leibler property. This is proved in Section 3. That is, with a utility function and fo, there is a well defined correct action, unknown just as fo is unknown. Decisions are consistent if the decision rule eventually sticks on this correct action. 4. For those interested in density estimation, there exists a Kullback-Leibler consistent sequence of predictive densities based on a prior with the Kullback-Leibler property. This is precisely stated in Section 4. Before proceeding, we introduce the notation used throughout the paper. We let In denote the posterior distribution given Xn. Then define In, fRn(f)FI(df), n > 1, and 1o = 1, where Rn(f) = InH= f(Xi)/fo(xi). Define fn = f f fn(df) to be the predictive density, and also define D(f) flog(fo/f) fo to be the Kullback-Leibler divergence between fo and f. In the following, a.s. will be with respect to the infinite product measure Fo~. 2. Bayes factors. Bayes factors are widely used in Bayesian model selection problems. See, for example, Bernardo and Smith (1994) for a review. To date, asymptotic studies of Bayes factors have only been formulated when one of the models is "correct". See, for example, Gelfand and Dey (1994). The Bayes factor for comparing model 1 with model 2 is given by Bn = InlI[2n where Ijn = f Rn(f) FIj (df), and the Bayesian models are fully characterised by IU1 and I2. Recall that all Bayesian models induce prior distributions on the space of density functions. Bayesian models, characterised by II, will be associated with a d > 0. This 6 is such that H{f: D(f) < d} > 0 only for, and for all, d > S. THEOREM 1. (WALKER, 2002). If 1. Tj{f: D(f) < d} > 0 only for, and for all, d > 6j. 2. E, n-2Var(logIjn/Ijn_i) < oo 3

3. liminf D(fjn) > 6j a.s. then n-1 log Bn - 62 - 61 a.s. Consequently, Bn - oo0 a.s. (preferring model 1) if, and only if, 61 < 62. This makes sense. Note that the rate will be exponential; that is, Bn - exp{n(62 -61)}. Obviously, if 61 = 0, then Bn -4 oo a.s. for all 62 > 0. Condition 2. is an extremely mild condition to be satisfied. Condition 3. is also a realistic assumption to make; one would not anticipate the predictive density to get closer than 6 to fo in a Kullback-Leibler sense, if the prior has no densities this close in the Kullback-Leibler support. We present illustrations of theorem 1 in Section 5. If II has the Kullback-Leibler property and the competing model does not, then the Bayes factor will eventually prefer the model which does have the Kullback-Leibler property. A model with this property can therefore rightly be defined as a true model. There is no point in comparing it with a model which does not have the Kullback-Leibler property. 3. Bayes decision theory. Here we provide further support for the notion that a prior II with the Kullback-Leibler property can be called a true model. Taking the notation from Hirshleifer and Riley (1992), the elements of a decision problem are as follows: (1) a finite set of actions indexed by a and for practical purposes we assume a E {1,..., N}, for some integer N. While much of decision theory is written up with the notion of a continuous set of actions, in practice the number of decisions that can be made are finite, see Lindley (1985) for a discussion. (2) a set of states of nature, which we take to be the appropriate space of distribution functions, say. 1 (3) a consequence function c(a, F) showing outcomes under all combinations of actions and states of nature. lWe assume that the relevant unknown state of nature is the distribution generating the data. This gives us a general framework to work with. Certainly, knowing the true distribution will solve all decision problems associated with the data. 4

(4) a preference scaling function v(c) measuring the desirability of the consequence c. (5) a probability distribution on 9 representing beliefs in the true state of nature. In a Bayesian context this probability is the prior n in the no sample problem and is HI once the data Xn has been observed. The Von Neumann-Morgenstern (1947) expected-utility rule then asserts that the best decision is to take the action a which maximises Un(a)= f v c(a, F)} n (dF). This expected-utility rule is applicable if and only if the v(.) function has been determined in a particular way which leads to v(c) being bounded, specifically 0 < v(c) < 1. That is, the v(c) has a probabilistic interpretation. See Hirshleifer and Riley (1992). There are differing opinions on the point of a bounded elementary utility function. See for example De Groot (1970) who relaxes the axioms of Von-Neumann-Morgenstern. We would point out that with unbounded v(.), it is not guaranteed that Un(a) even exists, and since this depends on fo which is unknown, the bounded v(.) makes most sense. It is not our intention to discuss the expected-utility rule further. Our aim is to show that if II has the Kullback-Leibler property then the decision rule eventually sticks to the action which maximises Uo(a) = v{c(a, Fo)}, which can be classified as the correct action, obviously unknown because F0 is unknown. THEOREM 2. If II has the Kullback-Leibler property then U, (a) -> U (a) a.s. for all a. PROOF. If I has the Kullback-Leibler property then IH converges weakly to IUo a.s., where IIo is the probability measure with point mass one at F0. See Schwartz (1965). The Portmanteau theorem (see, for example, Billingsley, 1968, Theorem 2.1), then gives the desired convergence result for U,(a), assuming that v is suitably smooth. Clearly, if Un(a) -> Uo(a) a.s. for all a then the maximiser over U (a), say an, will obviously eventually stick to a0, which maximises Uo(a). 5

4. Predictive density. Here we demonstrate that if II has the KullbackLeibler property then there exists a Kullback-Leibler consistent sequence of densities fn. That is, D(fn) - 0 a.s. THEOREM 3. (WALKER, 2002) Suppose that II has the Kullback-Leibler property and E n-2Var(log In/Ini) < oo. n If l N fN_ I nf n-l then D(fN) - 0 a.s. Hence, for those who see density estimation as an important statistical procedure, fn is an easily available Kullback-Leibler consistent sequence of densities. Nonparametric predictive densities are often hard to construct but are not hard to sample from. So, if it is possible to sample from fn then it is obviously also possible to sample from f'. If II has the Kullback-Leibler property then the condition E n-2Var(logIn/In1,) < oo n is an extremely mild constraint. If In = exp(-nt,) then tn -- 0 a.s. if IU has the Kullback-Leibler property. See Barron, Schervish and Wasserman (1999). It is therefore sufficient that E E(tn-tn-1t)2} < 00. n If, for example, tn = O(n-~) for some s > 0 then itn - tn-1 = O(n-l-s) and the condition will be easily satisfied. 5. Illustrations. Here we present four examples illustrating theorem 1. EXAMPLE 1. In the first example we take the true density function to be fo(x) = exp(-x). We take model 1 to be fl(x; 0) - 0 exp(-xO) with prior 7r1(0) = exp(-0) and take model 2 to be fixed at f2(x) - 0.5exp(-0.5x). It is easy here to see that 61 = 0 and 62 = log 2 - 0.5 = 0.193. It is calculated that J/If (xi) In(df) = 1+ i=! (1 + sn)l+'= 6

where Sn = E=n xi, and n i f( xi) n (df) (1/2)n exp(-s/2). Following a simulation of data from fo, figure 1 plotting n- logBn for n = 1... 3000 is presented at the end of paper, where the convergence of n-1 log Bn to the correct value of 0.193 is evident. EXAMPLE 2. In the second example we consider the case when both models are wrong, in the sense that neither prior has the Kullback-Leibler property. We now take fo(x) = x exp(-x) and keep model 1 as in the first example. The second model is Weibull, f2(x) = xa exp(-0x2/2) with 2 (0) = exp(-0). Then 61 = 0.116 and 62 = 0.099 and so 62 - 1 -0.017. Again, a simulation was performed of n- logBn and figure 2 plotting the convergence to the correct value is also at the end of the paper. It should be noted in this case that the convergence is very slow and we took 1,000,000 samples. The figure shows every 350th value of n-1 log Bn. EXAMPLE 3. In this example we take a nonparametric prior, not infinite dimensional, but with a large number of parameters. With samples from [0, 1], we have model 1 to be f(x; 0) = Ox1- with prior 7r1(0) = exp(-0). We take model 2 as a histogram on m 1, 000 bins, with each bin of length 1/m. The density function is f2(x) = mqk l((k - 1)/m < x < k/m) and we take (q... qm) to have a Dirichlet prior with parameters all equal to 1. Then n n -1- T r-1 J/ if(xi) fI1(df) -! H=l Z i Z_=l (1 + t)l+n) ' where tn = -,E log xi, and nF(n + m) II (nk + 1), fi Hf(xi) I-2(df) - ~(rt + r) k-i where nk = EZ= l((k - 1)/m < xi < k/m). If fo is uniform on [0, 1], then both 61 = 0 and 62 = 0. As can be seen from the simulation of n-1 log Bn in figure 3 at the end of the paper, the Bayes factor always prefers the parametric model although asymptotically it prefers neither model, since n-1 log Bn - 0 a.s., although the convergence is extremely slow. This exposes the myth that Bayes factors always select the more complex model. 7

EXAMPLE 4. This is a slight variation of example 3. Here we retain model 2 and fo as in example 3 and take fi(x) = 2x to be fixed. Then 61 = 0.306 and figure 4 shows the convergence of n-1 log Bn to -0.306. Again, very slowly. Note in this case that the Bayes factor always prefers the nonparametric model. 6. Discussion. In this paper, we have demonstrated how the KullbackLeibler property for a prior II provides good large sample properties for a number of Bayes procedures. We argue that Bayesians should be constructing priors with the Kullback-Leibler property, at the very least when there is doubt about the underlying shape of the density function generating the data. Although the results are based on large samples, the notion of having all densities in the Kullback-Leibler support of the prior must be an appealing one for all Bayesians. Indeed, from the Bayes factor perspective, there is no reason to compare a model with the Kullback-Leibler property with any other model, and so practically speaking meets the requirements of a true model. Barron, Schervish and Wasserman (1999) demonstrate that a number of nonparametric priors which are in use, such as Polya trees and infinite dimensional exponential families, do have the Kullback-Leibler property. For those interested in subjective issues, consider the following. Walker et al. (1999) show that it is possible to take subjective information from a parametric model and incorporate it into a nonparametric model. Then, for those who would acknowledge the existence of fo, this paper demonstrates the practical relevance of a nonparametric model. For those who would not accept that an object such as fo does exist, the nonparametric approach using the Kullback-Leibler property offers the surprise-free approach (see point 1. in Section 1), and at a minimum avoids the poor calibration that may confront the statistician who is happy to hand out probability one to a host of possible models. See Draper (1999) for a detailed discussion of this point. For those concerned with working in high dimensional spaces, the message from the collection of applied papers edited by Dey et al. (1998) is that it is no more difficult to routinely implement Bayesian nonparametric procedures than parametric ones, following the advent and rapid growth of user-friendly Markov chain Monte Carlo methods. Other ideas for avoiding the model merry-go-round include Bayesian model averaging (Draper, 1995) and model selection, both ideas based on 8

a fixed set of models with associated probabilities of plausibility, rather than probabilities of correctness. Practically speaking it may not be difficult to assign probabilities to models; if there is a finite set then assigning equal probability is one option. A number of recent researchers have pointed out that model averaging usually outperforms model selection, and intuitively it is easy to see why this might be the case. We see model averaging as an attempt to construct a prior with large support (the idea being that at least one of the models may be close to fo) using a collection of parametric models, and this could be seen as equivalent to a Bayesian nonparametric statistician who makes finite the infinite dimensional nonparametric model. This often happens, such as in the case of P6lya trees and the infinite dimensional exponential family; indeed it is necessary in these cases. Acknowledgments. The work of S. Walker is financially supported by an EPSRC Advanced Research Fellowship. References. BARRON, A., SCHERVISH, M.J. and WASSERMAN, L. (1999). The consistency of distributions in nonparametric problems. Ann. Statist. 27, 536-561. BERNARDO, J.M. AND SMITH, A.F.M. (1994). Bayesian Theory. Wiley & Sons. BILLINGSLEY, P. (1968). Convergence of Probability Measures. Wiley & Sons. DEY, D., SINHA, D. and MULLER, P. (1998). Practical Nonparametric and Semiparametric Bayesian Statistics. Lecture Notes in Statistics. Springer. NY. DRAPER, D. (1995). Assessment and propagation of model uncertainty (with discussion). Journal of the Royal Statistical Society Series B 57, 45-97. DRAPER, D. (1999). Discussion of the paper "Bayesian nonparametric inference for random distributions and related functions" by Walker et al. Journal of the Royal Statistical Society Series B 61, 485-527. 9

GELFAND, A.E. and DEY, D.K. (1994). Bayesian model choice: asymptotics and exact calculations. Journal of the Royal Statistical Society Series B 56, 501-514. DE GROOT, M. (1970). Optimal Statistical Decisions. McGraw Hill Book Company. HIRSHLEIFER, J. AND RILEY, J.G. (1992). The Analysis of Uncertainty and Information. Cambridge University Press. LINDLEY, D.V. (1985). Making Decisions (2nd edn). Wiley & Sons. LOEVE, M. (1963). Probability Theory. 3rd Edn. D. Van Nostrand Company (Canada) Ltd. SCHWARTZ, L. (1965). On Bayes procedures. Zeitschrift fir Wahrscheinlichkeitstheorie und Verwandte Gebiete. 4, 10-26. VON NEUMANN, J. and MORGENSTERN, 0. (1947). Theory of Games and Economic Behaviour 2nd edn. Princeton University Press. Princeton N.J. WALKER, S.G., DAMIEN P., LAUD, P.W. and SMITH, A.F.M. (1999). Bayesian nonparametric inference for random distributions and related functions (with discussion). Journal of the Royal Statistical Society Series B 61, 485-527. WALKER, S.G. (2002). A new approach to Bayesian consistency. Submitted. WASSERMAN, L. (1998). Asymptotic properties of nonparametric Bayesian procedures. In Practical Nonparametric and Semiparametric Bayesian Statistics (D. Dey, P. Miiller and D. Sinha, eds.), 293-304. Lecture Notes in Statistics, Springer. NY. 10

Example 1: Convergence of Bayes Factor.2.1 if' C m 0) 0 -J I.-.0 -.1 -.2 -.3 -.4 -.5 n (Log B)/n — > 0.193 n = 1 to 3000 Example 2: Convergence of Bayes Factor nr _ -.01 m I0 - -.02 i~t 1 -I..pi lyvv v -.03 -.04 n (Log B/)n — > -0.017 n = 1[350] 100000 Page 1

Example 3: Convergence of Bayes Factor.12-.10- jvA.081 0).06 004.04 -0.00 n (Log B)In ->0 n = 1 [350] 1 000000 Example 4: Convergence of Bayes Factor -.20 --..22 -m 24 -.26 --.28 --.30 (Log 8)/n ->-0.306 n = 1 [350] 1 000000 Page 1