I On Bayesian Models and Consistency STEPHEN WALKER* AND PAUL DAMIEN** *Department of Mathematical Sciences, University of Bath **University of Mlichig'an Business School ABSTRACT. VWe, discuss the relevance of consistency to the Bayesiai. Should consistency be dismissed as irrelevant or thought about seriously when constructing prior distributions? Strong opinions have been held on this matter, but it. is probably fair to say it is a largely neglected area. Pioneers, sulch as dle Finetti, Savage, Lindley, have had very little to say on the matter. The aim of this paper is to give specific reasons why Bayesians should be concerned with consistency and exactly what type of consistency they should be concerned withl if the goal is to make rational and good decisions. We also discuss the notion of a true model and define useful connections with consistency. KEYWORDS: Decision theory, exchangeability, expected utility rule, true modtel. 1. Introduction While it is acknowledged that. consistency is crucially important in Classical statistics, for example, providing justification for many estimators and procedures, no such consensus exists in the domain of Bayesian statistics. This is, we would suggest, due to an understanding that Bayesian de cisit1 theory/inference are the logical consequences of a set of axioms of rationality and a subjective interpretationl of probability (see, for exam'ple, Bernardo and Smith, 1994). Hence, Bayesian procedures are logical consequences of an axiomatic system and so, apparently, asymptotic studies of such procedures is rot reqltirled. Large sarmple studies have generally focused on the mathematical aspects (Freedman, 1963; Schwartz, 1965; Berk, 1966, 1970) with nto practical implications for Bayesian inference being made as a consequencel: 1

I of the results. That is, in a traditional Bayesian analysis, consistency is not typically, if ever, considered. Procedural components required to be specified in order to carry out a Bayesian analysis of observed data. are undefined, it is only their existell(:c which is guaranteed. The prior dist.ribution is thought to be the most problematic to specify in practice; thle existence of which is guarantee d by the Representation Theorem of de Finetti (1938), assuming exchangeable observations; see, also, Hewitt and Savage (1955). Since we do only have an existenc.:e theorem, it i ight seem appropriate to think hard about the specification of the prior and in particular how it should be constructed. This, perhaps, could then lead to procedural principles tl. ta. might further solidify the Bayesian paradigm. The question which this paper is interested in is the following: Sllould Bayesians construct priors so tlhat inference is consistent in some sense? Clearly it is not incorrect or wrong to do this. But is it sensible? Does it matter? A first. point to make is that, conIsistency (in whatever form it takes) is not an accident. It has to be mathematically (lesigned. This imposes conditions on a prior which might be difficult to establish in specific cases; see, for cxample, Schwartz (1965); Barron et al. (1999); Ghosal (t al. (1999). Anothler point to make is wlhether a procedure which is the logical consequence lo'; desire for ratioral behaviour implies it is a good procedure. Clearly, an existence theorem 'for the prior means precisely and just that - anything goe\s. The prior could be a single point mass on some unllsual density, which is clearly not good. Making use of relevant prior information would (hopefully) eliminate th.is impractical prior from consideration, but the point is rmadce: the theory does not automatically lead to a good procedure. More effort is required. In this paper, we endeavour to give reasons why a particular type of consistency is important arid to discuss the consequences of this. In particular, we concentrate on what we believe to be a practical perspective and hetnce consider the notion of consistency with respect to Bayesian decision theory. Due to the fact that inference is based entirely on models, we start by looking at Bayesian models and in part.icular true Bayesian lmodels. This is done in Section 2, where thle notion of a true Bayesian model is discussed. In Section 3, we propose a general definition of a true Bayesian model in t.erns of consistency, claiming this a useful definition and in Section 3.1 consider 2

the ramifications of this definition. Section 4 then considers which type of consistency is required by considering Bayesian decision theory, acknowleclging the fact that data collection leads to decisions having to be made. Sect.ion 5 extends on work presented in Section 4, by considering infinite rather th.an finite decision spaces. 2. True models A survey of' the Bayesian literature will involve lmeeting the tern "trll model" with high frequency. This is particularly apparent in thle Bayesian model selection and model avelraging literature. For exanmple, inl Raftery et al. (1996) it is stated; 'A typical approach to data analysis is (o carry o.lt. a model selection exercise leading to a single 'best' Ilodel and to then mrake inference as if the selec.ted mno(el were the true model". Trlen later on in the paper;...pr(Alk) is thle prior probability that ik. is the true miodel." The following is taken from Key et al. (1999); "... the M-closed view. corresponds to believing that. one of the models... is 'true', but which (one is unknown." Section 4 of Key et al. (1999) is entitled: 'Choosing amlong models when none of them is true". The discussion concluding Key et. al. (1999) brings into question tlhe existence of a true modell. M.J.Bayarri states "In a sense, an honest Bayesian can only agree with the main tlheme of this paper (i.e. Key et al., 1999), naimely that (i) no model is true...". Tills remark is essentially attributable to Savage who said "no model is true, sonle models are useful." See also Box (1980) who writes 'No statistical model can safely be assumed adequate". It is with these latter sentiments that. we wish to disagree. None of the participants in the "true model" discussions define exactly what t.hey mean by such a phrase. In this paper, we consider the case when fo is a density and y"l = (yi,.. l',) are available as a random sanmple from l r, the first n observations of a possibly "infinite" sequence yl, 712y..; Key et al. (1999) refer to this as the exchangeable case. It. is quite likely that tlle above conmentators were referring to more comnplex data struictures tllia tlie exchangeable case, though it. was never nmentioned explicitly in t.li(ircomments. Nevertheless, discussing the concept of a true nlodel is valid in the exchangeable case and is obviously the correct place to st.arl since il. is the most minimal assumIption that. can be made to (cevelop Bayesian tleory. 3

I We examine when a. model is or is not the true modlc(1 with respect t.o f/). There seems to be no ldefinition of exactly what a true Bayesian model is. It. night be assumed that. a precise definition would be as follows: DEFINITION 1. If A f = {f(;J), 7 (H)} and fo().(-;0 ) an<d /o is ill tHi support of the prior theln Al is a true Bayesian modell.1 We use the phrase 'is;. true Bayesian model' since thlere are, for cxaml)le, many 7r(O) which will have 00 in its support. The main point, to mlention h}ere is that if the true model is parametric (i.e. finite dimensional) then Doob (19419) proves posterior consistency holds, in the sense that the posterior distribution accumulates about 00 as the saml])le size tends to infinity, for almost all 0o - 7r. %We would argue that without t.he result of Doob (1949), Definitionl 2 would be hard to substatntiate as defining a true rmodlel, since without coisistency it implies nothing favourable about using the model in l1reference to any other model. Perhaps the feeling that,no truel model exists, as indicated by Baytrrir, Savage, Box etc., sterls from t.he problems associated with a parametric model. Of course, how can it really be known that a 00 exists? It canior.. To be sure that the true density do(es lie in the support of a prior we nee(l t.o consider bigger, or nonlparametric mIodels. Consequently, a general definition of a true model should also allow for ( being ain infinite dimrensional paraimeter, even a density or a distribution. In these noiiparametric cases, the support of the prior (lepends on the topology or metric used to define (listances between densities or distributions. Suppose tlhe metric is d(O, 0'). DEFINITION 2. The support of a prior Tr is defined to be S = {0: 7w(,(V(0)) > 0 for all e > 0}, where A(0) = {0' d{(0, Y) < c}. For example, if 0 is a density with respect to the Lcbosgue mreasure, thoet a 'This version of tI suggests a Bayesian model has two components, a parametric family with prior. TWe prefer,. and this will be apparent later on, to think of MA as a probability on a subset, of all densities, the subset b)eing a parametric family. This then extends readlily to the idela of a Ilon)armlltcritc rior prob )l)bllility on the set of all densities. 4

I possible choice of metric is the Hellinger distance: d(0, ') {jl ( - - v)}l'i/ Another possibility is a mtetric which mnetrizes weak convergence, such as l.le Prokhorov metric. Se. for oexaml)le, Billingsley (1968). Thus, a nonparametric version of Definition 1 would be as follows: DEFINITION 1 (revised). If Al = {ir(d0)} and 0o lies in the sIupport. of tlie prior with respect to the metric d, then M is a true Bayesian nmodel. Now 7r is a. probability mleasure on densities or distributions, rather tlhan a density as in the parametric case. When 0 is a density the likelihood function is defined to be /1 i='l and the posterior distribution is characterised via 7(AIUyI) x L 7,,(0)'(d0). This supports the notion that a Bayesian model is indeed just a probability on sets of dcensities rather than1 a likelihood and prior combination. There are special issues to consicler when 0 is a random distribution filrction whic:h do not admit densities with respect to any dominating measlure. Such a case in point is the Dirichlet process - random distribultions withl randonm mass allocated to random points - formally introduced by Ferguson (1973). Then alternative procedures need to be. developed t.o obtain the posterior distributions. Definition 1 is now actually incomplete as a definition since it say.s not.hing about d. It is our intention to convince the reader that Definition 1 is inappropriate as a definition of a ttrle Bayesian model, even if d has been specified, and to propose and suptplort an alternative definition. Whereas Definition 1 may vwell be appropriawte for finite dimensional cases, it does not. sit well in nonparametric cases. Thle key is that Definition 1 implies colsistency in the parametrtic fiiti.e dimensional cases and hence the notion l' * a true model. Howeveve, Definitior 1 does not imply consistencyI ini infinite dimensional cases, even if, for example, d is a strong metric. 5

I (Our aim is to provide a definition of a true Bayesian model, whether we decide to pllt all the mass on all densities (and call ourself a nonparamet.ric Bayesian) or restrict thle mass to parametric families of densities (and call ourself a parametric Bayesian). The consideration of nonparametric priors is warranted due to the recent. explosion in their use. See, for exalmple, Dey er. al. (1998) and tWalker et al. (1999). 3. Defining a true Bayesian model Suppose that AM = {Tr(d1)} l and r(.;tl.ly' ) -4 1 wit.l t probability 1 as n7 -> c for all c > 0, where r(. y,/n) is t.11e posterior based on a samlple of size n and A, = (: id(0o, ) < u}. If this holds for M then it is clear that NM is a true Bayesian model with respect to the metric d. It would be hard t.o argue t lnt. AM was not a true model if it p)ossessesd the property of consistency. Henlce. vw are saying a consistent model is a true model. Hcre problems of using Definition 1 for defining a true Bayesian model are (ncountered. A model which satisfies the condit ions of Definition 1 is not. necessarily consistent. Countmer-examrples exist. One such counter-example is presented in Barron et a1. (1999), Their prior is nonpararnetric and although tlhe true density fo - Un(0, 1) lies within the (Kullback Leibler) supIport of tlie prior, the posterior is demonstrably not consistent, with repsect to tlhe Hellinger distance, It is however weakly consistent. But even we<ak consistency can cause problemns. Diaconis and Freedman (1986) presenllt a example whicl illustrates tliat evenI if a model has the true distribution in tlhe weak support, rweak consistency is not guarant(eed. The upshot is that Definition 1 includes models which, not being consistent in any sense, would be difficult to justify as true. The true in true model should mean just that; the true fo or Fo is available, which is tot the c:use unless the model is consistent with respect to a suitable metric. Wit.hout. 1Ihe property of consistency. there is nothing to distinguish; i.e., to set apart. a model which satisfies t.he conditions of Definition I and is not consistent wilth any other model. That is, the property of having the "truth" in the support counts for nothing unless it is acting as (part of) the sufficient condition for consistency. Hence, it is argued that a true model must be consistent andl lwe have already indicated 1lhat a consistent model is a true model, leading To DEFINITION 3. A Bayesian mIodel M = {(r(d0)} is said to be true wit}i 6

I respec t to the metric d; if and olly if 4(Aly") - 1 with probability 1 for all E > 0, where A. = {0: d(Oo 0) < }. In the next sub-section we discuss 1 ihe ramifications of using Definition 3 as defining a true model, in the e.xchangeable case. 3.1 CONSEQUENCES OF DEFINITION 3 The fundamental assumption adopted here is that. a true niolel is mIloren desirable to work with than any ot her model, so long as all other import.ant. aspects are approximately equal. 2 W'e mainly concentrate on the( consequences of Definition 3 for nmodel selection. The first point to mlake is that with carefill c:onstruction of a (ullparametric) model it is possible to ensure the prior is consistent, i.e. t.lI( posteriors are consistent with respect to either the Hellinger or Prolihorov metrics. Sufficient conditions have been given by lboth Barron et al. (1999) and Ghosal et. al. (1999) for the former and Schwartz (1965) for the latter. Additionally, Barron et a]. (1999) and Chosal et al. (1999) work out these( sufficient conditios for specific priors; including infinite-dimensional CXl))nential family, P6lya trees, Dirichlet mixture models, and histograms. With the sufficient conditions, these irodels then meet the requirements for being called a true Bayesialn riodel. Consequeintily, understanding consistency, the notion of M-closed and MIopen views, as discussed by Bernardo and Smith (1994), lose importance. That is, it. is possible to force the M1-closed view with a single model. If there are two models, one consistent, the other not, all else being equal in the prior information sense, we would obviously select to work solely with the consistent model. There is no good reason to use the non-consistelr. model. Hence mrodel selection will only be necessary when all the modells under consideration have not, for whlatever reason, been constructed to be consistenlt. Obviously, the comlllents and consequences relating to modtel selection are equally valid f-or model averaging (Draper, 1995). That is. knowing a true model climinates thle need for model averaging. 'Two priors, say s1 and rl, can be considered equal in a priori informatiton sense if, for example, E,__(A) =. (4) and var..,.(A) = ar, _(,A) for all A, _-.'~-~;~ ~- _.~- =' ' A fX 7

I Having said all of this, construc:ting and making inference fromn a true model may bh inconvenient. and inappropriate. For example, a simpler (proxy) model might be needed to explain the statistial procedure to a lnonexpert or for ease of inference. In this case a number of simple rliodels canl 1e considered and the 'best' one uised. However, tile benclhmark model is a true mIodel and consequently the best Ilodel should be c(hosen whi(h is clos(st. inl sonie sense to a true model. WaNlker and Guti.'rrcz —Pcna (1999) imlplicitly advocate choosing the modenlt.l Nwhich has the closest predictive densitxy in the sense of IKullback Leibler divergence to tlhe predictive density of'a t.riie model. Statements highlighted at the outsct of the paper, i.e. no model is t.rll(. etc. are incorrect. A consistent mottll, attainable (at. least) in the exchangeable case, is a true model. A question from Key et al. (1999) goes: "Blut wlhen does it actually make sense to speak of a 'true' model and hence.o adopt the M-closed perspective?" The answer, we believe, is whell a consistent model has been constructed. In sumnmary, we can achieve true (consistent) mlodels. On tilt assumption of exchangeability, de Finetti's celebrated Representation Theorem imnplies the exist.ence of a prior 7r and a procedure for updating/predicting based (on observations. That is, pr{ny,+ E B v}y = Ey, {J f(y) dy} = /( f(y) dy} r(df jy'). This, however, dloes not imply it is a "good" procedllre in any sense. Consistency is one way to define a good lrocedure in the sense implied above. I1. extends ol the existence theoreIn to provide a criterion for the selection of t,. The key is this: if an arbitrary parametric model is chosen and a prior co(nstructed, it is no great effort to inc:orporate c(:rtain defining characteristics from this parametric model into a nonpalranmtric one, with the merit t.hai. the nonparalmetric nlodell, with the regularity conditions for consistency, (';an then justifiably be called 'true'. The quclstion left. for us to conIsider is whic type of consistency is required. 4. Which kind of consistency? Clearly the type of consistency depends on the problem at handl which wvill be determined by the reason thel data was collected in th.e first pllce. Density 8

I estimation is an important problem in statistics and if this is the aim.thenl priors which lead to strong consistency will be needed. We would argue that. data collection is usually motivated by a decision problemn and we discuss the type of consistency required from this I)ersItective. We wish to make this discussion as 1)road as pIossible( i.e. not restricted t.o lprametric models, alnd hence we consider prior distrilbutions defined on an arl)itrary set of distributions, say B, witli topology induced by the metric d, and we assume t.he distributions in F have densities with respect to the L.ebesgue tleasure. Let lb)e a plrior distribution on.T, y"L = (yi... t,,) a rarnulom rsanrl), from F0 and r" t the posterior distribution on F given yt71. We. can as lI;as already been pointed out,. disltinguiishl between two t.yples of consistency. Let dw1 be a metric, such as the Prokhorov metric, whiclh metrizes weak conve rgence, i.e. Fk converges weakly to F iff dw(Fk, I") -- 0. In the following notation, we replace 0 with t.lfi dfist.ribution function F. DEFINITION 4. If 7(AIt7ly) -- I a.s. as n - oo for all vweak neighbourholods A of Fo0 i.(. A = {F: dt(Fo, F) < } for any E > 0, then vT is said t.( I(b weakly consistent at F0. Now let ds be a metric which defines strong neighbourrhoods of Fo; e.g. rhe Hellinger metric, ds(F, F) = { I J F/dy - dFo/d ) d y. DEFINITION 5. If' r(Ai|y1) -- 1 a.s. as 7i -> oo for all strong neighbourho)ods A of Fo, i.e. A = {F: ds,(Fo, F) < s} for any e > 0, then x is said to bt strongly consistent at F0. Schwartz (1965) established the following result: if ir pults positive mass on all Kullback-Leibler neighllourhoods of F0, which we now refer to as condition (A), then 7r is weakly consistent at F0. The condition of Schwartz (1965) is not a necessary condition; see 'Wasserman (1998) who describes a colnter-example first presented by Ghosal et al. (1997). Additional sufficient. but not Inecessary, conditions in addition to condition (A) for strong (Hellinger) consistency are provided by Barron e.t al. (1999) and Ghosal et al. (1999). A good review is provided by WNasserzllm, (1998). The apparent need for strong consistency is based on the fact thlat. 9

I weak neighbourhoods of F0 include distributions F with densities that aire far from Jo with respect to a st.rong metric, such as the Hellinger metric. Anl example highlighting this pherlomelnon is provided by Barron ct. al. (1999). It should be noted that the (current.) extra conditions for strong consistellcy are quite strict and coulld clash wit. h prior information, i.e. it is difficult to both insist on strong consistency anti incorporate realistic prior information. An example of this is P6lya trees. Next, we review the elements of Bayesian decision theory. 4.1 BAYESIAN DECISION THEORY Taking notation from Hirshleifer and Riley (1992), the elements of a decision problern are as follows: (1) a finite set of actions indexed by x and for practical purposes we asslumel a: G { 1,.... X} 'focr somel inlteger X. While most theory is associatedl with finite decision spaces (Raiffa, 1970; Lindllcy, 1985), a relaxation of this asstlmption to a non-finite decision space will be discussed ill Section 5: (2) a set of states of nature, which we take to be FT equiped with the wleak topologyf; (3) a consequence function c(:, F) showing ot.comles under all (combinatiotls of actions a.I.d states of nature. (4) a 1rcfrcence scaling function v(ce) measuring tli( desirability of the conisequlence C; (5) a probability distributionl on:F representing beliefs in the true state of nature. In a Bayesian context. this probability is the prior T7 in the no sample problemn and is r," once the data y" has been observed. NWe assume that v{c(x,F)} is uniformly continuous in F for each x. This makes sense sinlc.e small changes in F should result in small changes to our elementary ultility v(.). 3XWe assunme that the relevant minkown state of nature is thlle distribution giving rise to the data. This gives s a gentral framework to work with. Certainly, kinowing the (nil distribution will solve all dccision problellms associated with the data. 10

I The Von Neannma-Morgensternl (1947) expected utility rule then asserts that the best decision is to lake the action: which maximises in-) = / {C(, F)}) t(dF). This expected utility rule is;applicable if and only if the v(.) fnlliction hlas been determined in a particular way wvhich leads I.o v(c) being boundedl. specifically (0 < v(c) < 1. ThaI. is, the v(c) has a. probabilistic interpretation. Hirshleifer and R.iley (1992) say: It turns out that the expected-ltililty rule is a.pplicable if alndc onlyl if the '(c) function has been ldetelrminIed in a particul;r way' that has been termed the. assignment of;ca'rdinal" ftilities to consequences. They go on to say: To formally justify the joint use of a cardinal preference scaling function and the expected-utility rule, for dealing with chloices among risky prospects, involves a somewhat higher order of technical difficulty. What follows... how tthe required t ype of' preference scaling function can b(e developed.... Tlie essential Ipoint is that the v(c) measure obtained via the reference lottery technique is in the form of a probability, so that the expected-utility rile becomcs equivalent to the standard formula for componlrding probabilities. See Hirshleifer and Riley (1992) for further details. There are differing opinions on the point of a bounded elemlentary utility function. De Groot. (1970) states: In many axiomatic developments of the theory of utility, assllml>tions are made which are stronger in certain Irspects than those which have been made here [De Groot, 1970]. These strengthened assumptions [Von Neumanln-Mlorgenstern, 1947] make it possible to conclude thatt the utility function must be a bounded fuiln:tion. * It therefore depends on which set. of axioms; i.e., thle strength of them, the e.xplerimenf.er is willing to adhere to. We would adhere to those of Von Neumann.r 11

anid Morgenstern (1947), summarised in Hirshleifer and Riley (1992), whiich lead to bounded utilities since with unbounded utilities, i.e. unlbounded v(.), it is not guaranteed tihat U(,(:x) even exists. Berger (1985) effectively works with bomlncd utilities; lie only considers lower bounded loss functions, i.e. l(x, F) > -IK > -c and we canl consider v,(F) = IA - l(;, F). It is not our intention to disculss the expected utility rule furthe r in this paper(. ()Our aim, motivated by dfiscussion in Sectionl 4.2, is to p>rovide sufficient conditions for whlich the rule is consistent. Note that, since we assulltl c(x, F) to 1-e uniformly cont.iluous in F for all x and v(-) is bounded and tiniformly continuouis, vr.(F) = ' tv c(x, Ft)} is boundecd and uniformly contintluous in F for all;x. 4.2 CONSISTENT DECISIONS Although we did not cover tile groutnd; i.e., present thei axioms and derivittions. in the previous subsection, tlhe point is that Bayesian decision theory is tle logical consequence of axioms of rationality and the subjective interpretation of probability. The goal is to maximise expected utility. However. good foundations do not necessarily lead to good procedures. As more lata accumulLate, it is essential that. the quality of decisions improve. The theory does not touch on this area. However, it is common sense that with large samples, correct decisions (defined later) must be made. How could it ever be argued that collecting more information (i.e. data) is worthwhile unless such an asymptotic property exists. To highlight the alove point, let us suppose there are two actions. one1 of which must be taken. If we do not have consistency then the rfule will (randomly) point to one action or the other ad infinitum as the sample size increases, or, alternatively, eventually sticks to the incorrect dlecision. In this latter case, a decision maker is going to reject an infinite and free no sampling cost salmple, which is clearly ridiculous; that is, at least. if he/she guesses at the start (i.e. n = 0), he/she has a 50% chance of getting it right. An experimenter of any type should never be in a position where it is preferable to reject an infinite and free salmple rather than accepting it. Although we accept that all decisions will be niade withl finite samples, the methodology used to perform tlis task should be such that. it. wouldl }ie desirable. to accept an infinite and firee sample. This ties up witlh the notion of perfect information, and is connected to 12

I the idea of deciding on hovw manly salmples to collect; i.e., the size of it (whlen sampling costs rloney). The following is a quote firom Raiffa and Schlaifer (1961): Lec. us imagine an ideal expelriment eo with a known cost co which is capable of yielding exact or perfect information conce.rlling the true state of F and let us suppose that the decision malker wishes to choose bIetweer.n e'c and thie mlll.experinent (no sarmpling). This cdecision problem only nmakes sense if we do indleed have consistency. The discussion for the number of samples to collect must be done under the fundamental assumption that an infinite sample does indeed lead to perfec:t information, i.e. tlhe correct decision. Otherwise, there appears ino reason for saTrmpling at all. It is also a matter of inspiring confidence. A procedure for lnaking ldecisions must inspire coILfidenlce in those paying for exl)eriments, collecting the data and then handing it over to the expert decision maker (i.e. the statistician). If the decision maker with his/her procedure can not guarantee good decisions with large samiples then this is not going to inspire confidence. It is not good enough to say procedure A will be implelmented for small samples and procedure B for large samples. If procedllre B is good for large samples; i.e., leads to correctt decisions, why is it not good for small sam.tnples? f] A is good for small sa amples, why is it not good for large samples? And what precisely is a small/large samlple? As we have mentioned before, we can adequately incorporate both prior information and consistency into a sinlle prior which sorts out both large and small sample dlesirabilities of a prior (orl procedure). The correct decision is made if F0 is known and we can evaluate Uo((:) = v{c(:x, uF)} for each x and hence select the action x. whic h maximises UO(x). As the sample size increases to infinity, at somne point we require the correct decision to be irade. For this we need pri.. {:r, X Zo i..} = 0 where x0 is the correct action. i.e. U0 (x0) > Uo(z) for all x: and:I.,, maximises U,(x). A sufficient condition for this result to hold is that max{.U,,(.:)} -> maxz{Uo(z)} a..s. 13

I which, siIlnce Xi indexes a finite set,, holds if LU,,x) - Uo(xt) a.s. for all::. Now U,,(:') = /' v(F) 7"(dF) and, if w satisfies condition (A), i.e., r'r" converges weakly to 7r1) a.s. - t.li probability with point mass at F( - tlhen from the PorItrmanteau theorem (see:. for example, Billingsley, 1968, Theorem 2.1), the desired convergence resllt for Ul,(:) holds. Hence, l)rovidle(l a prior satisfies condition (A), a decision maker is guaranteed making the correct decision as more sanplesi i.(.. as m1ore information, is accumulated. We. would argue, thercefore, that Bayesian consistency is important. ntWe (to need decisions to be correct as the sample size goes to infinit;y. Therefore, it is sufficilent that models are we(akly consisteint rather than insist.ing that 1.the(1 are strongly consistent. This means that provided a prior ptts positive lia.ss on all Kullback Leibler neighl)ourhoods of the true distribution, a Bayesianl ends up with the right decision. To ensure the Kllllback-Leibler propterly, it may well be necessary to use a1 large or nonparanmetric prior. sinci( obviollsly the true distribution is unknown, In sumnlary' we propose that priors are construc:ted to ensure tlhe KuillbackLeibler property. It can be done, and this ensures consistent dlecisions aret made. To guarant.ee the Kulliback-Leibler property it may be required to use nonparametric priors. Indeed, this is necessary to avoid the problem of a decision maker who knows that. decisions made with a. free and infinite samplel will either be random or wrotng andl hence is in the undesirable situation of having to tIurn down sulch a sample. 4,3 OBJECTIVE PRIOR AND RATES OF CONVERGENCE A particllar type of utility arises if we think of the (lecisioin probllem ha.ving y/rl1 as the unknown state of rature, having witnessed yn. Then pr(y,+t C Alyn) = F"(A), where Fn = j'F F i"(dF). Then we would have U, (x) = ',(y) (dF' () = {v(gy) dF(y)} r"(dF) and hence I revious theory al)plies with v,(F) =.t':,.(y) dF(y). A so-calledl objectivel prior may well be warrante. d in many contexts and one is availableb 141

I using a Dirichlet process prior for F (Ferguson, 1973) with diffuse base measure: effectively this is equivalent to using the Baysian bootstrap (Ru.in, 1981). Then LJ',(w::) = J >3 Vx(Yi) i=1 which clearly only dlepends on the data and tlle choices of v?:;(). 4 In this case we can esta.ilislh rates of convergence of U,, to if.o (dropping subscript. r). We make ulse of n wvell known result of Hoeffding (1963) whlicll gives p{ Z, + -. + Z, l > '1} < 2 exp { (i -ai) } where the {Zi} are independetllt, EZi = 0 and ai < Zi < bi for each i. In our situat.ion we have Zi = v(Ji) - f v:(y) dFo(y) so that. -1 < Zi< i +1 and so Ir {U; I U-o > E} < 2 exp{- 27}. The Borcl-Cantelli theorem confirms (wliat we know) the a.s. convergenlce of U, to U0. The result is usefttl fromi a lractical pelrspective. If we require pr{ lU,, - U0l > E} < ) then we require a sample of size na to be at least -2 log(6/2)/2. 5. Infinite decision spaces In this section we assume that the decision space, say SQ is not necessarily finite, and we let ml be a metric on 92. Of course. with an infinite set Q, consistent decisions are no longer guaranteed, using the theory (leveloped in Section 4. First, we keep to (2, Tn) being compact. \VWe will now go through the mathematics in detail (dropping a.s. from the following). We will Lnee( an equicont inuity condition for { U, () }. DEFINITION 6. {Un(z)} is eqlicont.ilnous at x if for tany c > () there exists a S > 0 such that 7(., x:r') < ( irmplies IUl,(:) - U.(: ')l < e for all t. To ensure this, we require a condliti on on v,(F) of t:he type |vx(F) - v (F)| < IC Km(x,.') for all F 4All priors are subjective; somte are more data-dependenti than others. At tinlies. aulthors have gone through considerable pain to construct families of objective priors, usually in the parametric framework. Litt.le work Ihas been (done n onlbjec:tive nonparainetric pi iers. 15

I for some K < o:. Then it is easy to see thfat I UT, (n)- U, ) (U')]n I_ ~K m(x, I') and hencie we have (eqllicontiniity for {U,(x)}. From our condlitiotns on F-, we know that we have the pointwise( convergence of U,, - U0o. A well known theorem from Ainalysis is useful here: THEOREM 1. If Un, -- U0 poinltwise for all nx E Q, Q is coInmpact and {U,, } is equicontinuous at ea1ch.: E Q1, then UI, -> li ulnifornl.y, i.e. sul,.C.iUr(;: ) - Uo(x)l- 0. Conscquaently, condition (A) imprllies pointwise convergence, the corlpactlness and equicont.inuity provide ulniiforn convergence. T'HEOREMNI 2. Uniform convergence implies consistent decisions. PROOF. Suppose there exists a sllbsqluence n' such that x, —:to I > e whllere x, mnaximises UA and:I: maxinises U0. Since {x,,} lie in a coimpact spiace there exists a sub-subsequence n" and an a such that 1,xT 1 - 4 0, i.e. m(xr,,, ca) -- 0. Equicontinuity inmplies I,,,(:,:, ) -,,,,(C)l 0 and since I U' (at) - UO (a) - 0 we nmust have \U,,,(:x;) )- wUo(a)J 0. Since a $.n0o, for all large 72", Ul '(:l:l,) < Uo(xo) -'j for some 7 > 0. However. we know that U,,,, (xo) -> UC}(::o) and hence we encounter a contradictilon, indicating that x,, ->:on. That. is, supl)Xn(,U() -4 8supx)Ei0o(:;) and hel(c we are again making consistent decisions. Let us recall the type of decision and "objective" prior considered in Scc:ti(on 4.3. We now wish to consider U,? (.0) =- 1 E 1,1 V(Yi). i=1 For equicontiimity we require Il (/) - v, (y) < KnI(xx') for all y. For consistency in a non-compact. space, we require a uniformtl strong law of large numbers; i.e., selU ) - U(: ) -- 0 a.s. 16

I Following Pollard (1984, Def. 23) llt us define JV1 (e: F", V), wheret F" is tlhe empirical distribution function and V is the set of functions {va..r: x }. as the smallest integer mn(r, c) for which there exists functions {gl,...,,,,,,,-) } such that min/ i Il - jjd;F"' _< e for each; E V. Then Pollard (1984, Theorem 24) I>rovides the required result if log m(n, E) = op(n) for each E > 0. Rates of convergence are a ls discussed in Pollard (1984). Let us suirnnarise what we have ill general termrs. [f ([Q, m) is compact and we construct v(.) and r(,.) st t.1hat. (F) - tv,(F)l < Km Irn(x, v') for all F then withl the Kullback Leibler condition (A) on the prior distri.bution 11 we make consistent decisions. Conditions for consistency are less strict it we. have a finite decision space. Suppose that Hl does satisfy condition (A) and (2,; n) is:ompact,p t.lhen, for consistency, we n eed |r(F) - v! (F) I < Km7nx, ) for all F. Other subjective choices for v aind c mlighlt not lead to this condition being satisfild. However, it would appear sensible to Inodify v anl c approprialely, and to remain close in sorie s.ense. to the subjective choice in order for claims about. the decision making )procedure to be made. Without these claims, tnhere 'does not appear to be any good reason for the procedure. That it is coherent does not count for anytlhing. Our actions may well stick together - bult uniformly badly! It is only with consistency in place that (we believe) we can s ay anything positive about, the decisioni-making procedure at all. 6. Discussion Our conc:lusion is as follows: Bayesian theory suggests a form for the pro{:edure by which data is studied. — the choice of proceldurel effectively boils downi to the choice of prior. Thlis form of procedure includes both good and hlad procedures. That the choice is left to subjective persuasions based on prior information alone will not provide a unique prior and will include some priors which are consistent (i.e. lead to posterior consistency). There is no goodt reason why the final choice should not be fromn this set of colsistent priors. This intersection of consistent and subjective priors is the set of nonparanlrtric priors, since consistency as defined in this paper - is only guaralteed fromn such priors. 17

I From a decision theoretic pers)pe(ctive and for a given utility function, wo have argued the need to make correct decisions as sample sizes increase. Tllis requires the use of nonparametric priors with the IKllback-Leibler property. References BARRON, A., SCHERVISHE, M.J. and WSSEIRMIN, L. (1999). The consistency of distributions in 1lonparaml'tric problems. Ann. Statist. 27, 536-561. BEItGElIt, J.O. (1985). StIatistic al Decision Theory alnd Bayesian Alaly.sis. 2nd edn. Springer-Verlag. BtEii, R.I. (1966). Lilrditing behaviour of posterior distributions when the irlodel is incorrect. Ann. Math. Statist. 37. 51 58. B1Im, R.II. (1970). Consistency a posteriori. Ann. MWath. Statist. 41. 894- 906. BERNARlDO, J.M. AND SMITI, A.F.M. (1994). BayeCsian Theory. Wiley & S)ons. BILL.IN(:SI-rY, P. (1968). Convergelice of Probabilitv Mleasures. Wiley & Sonrs. Box, G.E.P. (1980). Sampling and Bayes' infernl:en in scientific modelling and robustness. Journal of' the: Royal Statistical Society Series A 143, 383-430. DE FINETTI, B. (1938). Sur 1a condition dcequivalence I)artielle. VI Colloq)(le Geneve. Act. Sci. Ind. 739. Herman, Paris. D1E GItoor, M. (1970). Optimal Statistical Decisions. McGraw lill Book Company. DEY, D., MUjLLER, P. AND SINHA, D. (1998). Practical Nonpara;-metric and Scmiparaimtriic Bayesian Statistics. springer, New York. DIAcoNIs, P. AND FREEDMAN, D. (1986). OI1 the consistenC:y of Baycs cst.imIates (with discussion). Ann. Statist. 14, 1-67. 18

I DoO, J.L. (1949). Application to the theory of martingales. In Ie Cailcul (I Probabilities et ses Applicartions. Colloques Tn tLernationaux du Centre National de la Researche Scientifique, Paris 23 27. DRAPER, D. (1995). Assessmen(t and propagation of modlel lncertainty (wit.h discussion). Joulrnal of' the Royal Statistiical Society Series B 57, 45 9!7. FERGUSCON, T.S. (1973). A Bayesian analysis of some ntonparat etric pro(blems. Ann. Statist. 1, 209-230. FIEEDXIMXN, D. (1963). On t.he asymptotic behaviour of Braes estimates iI the discrete case. Ann. Mlath. Statist. 34, 1386-1403. GHlosAl., S., GHiOSTI, J.K. and R^AMANIooIrrIl, R.V. (1997). Consistency issll(Sl in Bayesian nonparane trics. Unpublished. GHOSAL, S., GHOSH,.1K. and RAMAMOORTHI, R.V. (1999). Posterior consistency of Dirichlet mixtures in density estimatiotn. Ann. Statist. 27, 143-158. HEtwrr'r, E. AND SAVA;GE L.J. (1955). Snymmetric measllrs on cartesian products. Trans. Am. Math. Soc. 80, 470-501. HIRSHLEIFER, J. AND RILEY, J.G. (1992). The A:na1lsis of' Unlccrtainty andl hInfrmation. Cambridge University Press. HOEFFDINGC_,,'V. (1963). Probability inequalities for sums of bounded random variables..1. Amer. Statist. Assoc. 58, 13-30. KEY, J.T., PERICCII, L.R. AND SMITH, A.F.M. (1999). Bayesian model choice: what and why? In B ayesian Statistics 6, pp 343-370. J.M.BernardoJ.O. Berger,A.P.Dawi(d and A.F.RM.Smith (Eds.) Oxfi)ord University Press. LiNDLEY, D.V. (1985). Makiing Decisions (2nd edn). Wiley & Sons. POLLARD, D. (1984). Conve.rgence of Stochastic Processes. Springer-\Veral g. RAIFI'A, H, (1970). Decision Analysis. Addison-Wesley. RAIFFA, H. AND SCHLAIFER, R.. (1961). Applied Stat. istical Decision TIheory. Harvard, Boston. 19

I R.AFTERY, A.E., M:ADmCAN, D. AND VOMLNSKY, C.T'. (1996). Accounting for model imnc(rainty in survival anialysis improves predictive Iperformance. In B;aycsian Sta.tistics 5, IP 323 349. JM.Becrnardo,J..().B(.rger,A. P. Dawid and A.F.M.Sinith (Eds.) Oxford University Press. RuBIN, D.B. (1981). Tche Bayesian blootstrap. Ann. Statist. 9, 130 —134. SCHWARTZ, L. (1965). On Bayes plrocedures. Zeitschrift fir VWah'rscheilllichke(itstheoric ilncl VIranl.lt.e Gebiete. 4, 10 26. SmITI, A.F.M. (1984). Bayc;sian Statistics: Present. p osition ald potential developments: some personal views. J. oRq: Statist.. Soc. A 147. 245-259 (withl discussion). Vo.N N;:MRIANN, J. ANI).MoUtC;ENs.rEMI,. 0. (1947). Thelwcy of Gnimesr aind Ec:onomnic Behaviour 2nd ecld. Princeton University Press. Princcton N.J. WAIK^IF. S.G., DAMIELN P., LAUD, P.W. AND SMITH, A.F.M. (1999). Bayvsian nonparametric inference for random distributions and related fiunctions (with discussion). Journal of the Royal Statistical Society Series B 61. 485-527. WAL.KER, S.G. ANDr Gi;r Il IIl.z-Pl.&A, E. (1999). Robustifying Bayesian I'rocedures (with discussion), Iin Bayesian Statistics 6': pp 685 - 710. J. M.Bernardo J.O.Berger,A.P.Dawid and A.F.M.Snmith (Eds.) Oxford University Press..WA.SSERMAN, L. (1998). Asymptotic properties of nonlparametric Bayesian procedures. In Practical ionlparametric and Seiciparamnetric Baycsian Statistics (D. Dey, P. Miiller and D. Sinha, eds.), 293-3(4. Lectiurel Notes in Statistics, Springer. 20