RESEARCH SUPPORT UNIVERSITY OF MICHIGAN BUSINESS SCHOOL MARCH 1997 BAYESIAN NONPARAMETRIC INFERENCE FOR RANDOM DISTRIBUTIONS AND RELATED FUNCTIONS WORKING PAPER #9712-OS BY STEPHEN WALKER IMPERIAL COLLEGE, LONDON, UK PAUL DAMIEN UNIVERSITY OF MICHIGAN BUSINESS SCHOOL PURUSHOTTAM W. LAUD MEDICAL COLLEGE OF WISCONSIN ADRIAN F. M. SMITH IMPERIAL COLLEGE, LONDON, UK

I~~~~~~~~~~~~~~~~~~~~~~~

BAYESIAN NONPARAMETRIC INFERENCE FOR RANDOM DISTRIBUTIONS AND RELATED FUNCTIONS Stephen G. Walker Department of Mathematics, Imperial College, London, UK. Paul Damien School of Business, University of Michigan, Ann Arbor, MI, USA. Purushottam W. Laud Medical College of Wisconsin, Milwaukee, WI, USA Adrian F.M. Smith Department of Mathematics, Imperial College, London, UK. Abstract In recent years, Bayesian nonparametric inference, both theoretical and computational has witnessed considerable advances. However, these advances have not received a full critical and comparative analysis of their scope, impact and limitations in statistical modelling; many aspects of the theory and methods remain somewhat a mystery to practitioners and many open questions remain. In this paper, we discuss and illustrate the rich modelling and analytic possibilities available to the statistician within the Bayesian nonparametric and/or semiparametric framework. Key words: Levy processes. Bernoulli Trips, Polya Trees, Polya Urns, Bayesian Bootstraps, Hierarchical Models, Markov Chain Monte Carlo, Latent Variables, and Gibbs Sampler. I

1 Introduction 1.1 From Bayesian parametrics to nonparametrics Why nonparametrics? Obviously the answer depends on the particular problem and procedures under consideration, but many, if not most, statisticians appear to feel that it would be desirable in many contexts to make fewer assumptions about the underlying populations from which the data are obtained than are required for a parametric analysis. Bayesian nonparametric models are constructed on 'large' spaces to provide support for more eventualities than are supported by a parametric model. A Bayesian approach to acknowledging uncertainty about a suggested parametric model is to provide a nonparametric model "centered" on the parametric model in some way. For example, to acknowledge uncertainty about the assumption of a normal parametric model, the corresponding nonparametric approach would construct a model centered on the normal assumption but including models in a neighbourhood of this normal family, with the size of the neighbourhood controlled to reflect beliefs in the normality assumption. Technically, the off-putting aspect (to many) of the Bayesian nonparametric framework is the mathematical apparatus required for specifying distributions on function spaces and for carrying through prior to posterior calculations. A further pragmatic concern is how to incorporate real qualitative prior knowlegde into this mathematical framework. A major emphasis of this paper will be an attempt to clarify these issues and provide detailed illustrative analyses: these will demonstrate both the modelling flexibility of this framework and the ease tllrougll tailored simulation methodology with which prior to posterior analysis cail be implemented. The earliest priors for lnolll)aranletric problems seem to have been introduced by Freedman (1963i) wllo introduced tail free and Dirichlet random measures. Subsequently. lI)biiis and Freedman (1965), Fabius (1964), Freedman (1965), Fergusoin ( 1!)l. 1!)7-) formalized and explored the notion of a Dirichlet process. Early work was largely focussed on stylised summary estimates and tests so tllat collparisons with the corresponding frequentist procedures could be mad(l. Since Ferguson (1973) the nonparametric 2

Bayesian literature has grown rapidly. The current focus of attention is on full Bayesian analyses of nonparametric models using simulation techniques (apparently first used in this context by Escobar, 1988). In this paper, we shall focus on nonparametric inference for random distributions and related functions. We shall not deal with Bayesian non/semi parametric density estimation; for a recent survey of this field, see Hjort (1996). 1.2 Outline of the paper The paper is organised as follows. In Section 2.1, the well-known Dirichlet process prior for an unknown distribution function is reviewed and the limitations of this prior are noted. Linear functional estimation is considered in Section 2.2. We assume there is information concerning the functional (specifically, the mean and variance) but that information about the unknown distribution is unavailable and is modelled nonparametrically via the Dirichlet process. We show how to construct the prior for the distribution to incorporate the information on the functional. The extension to the mixture of Dirichlet process (MDP) class of models (essentially a Bayesian nonparametric approach to hierarchical models; see West et al., 1994, and references therein) is discussed in Section 2.3. Other types of prior distributions will be motivated in Section 2.4: detailed descriptions of these latter types of priors will be the focus in Sections 3 (stochastic process priors), 4 (partition model priors) and 5 (exchangeable model priors). In particular, in the context of reliability and failure time data, interest often centres on the hazard rate and/or survival curve of the process under investigation. In Section 3.4 we consider Bayesian nonparametric survival data models, providing estimators which generalise the classical Kaplan and Meier (1958) nonparametric estimator. Also in Section 4.4 we consider Bayesian semiparametric approaches for the proportional hazards model (Cox, 1972), the accelerated failure time model, and frailty models (Clayton and Cuzick, 1985). In Section 5.4, we consider a three state disease process model. 2 General framework We assume that Y1, Y2, -, defined on some space Q, is a sequence of iid observations from some unknown probability distribution F, assumed to be 3

random and assigned a prior distribution, PQ. In a parametric framework, F is assumed to be characterised by a finite dimensional unknown parameter, 0. The prior is then assigned to 0, and we write PQ as PE. If we eschew the finite dimensional assumptions we enter the realms of Bayesian nonparametrics. However, if we think of the nonparametric model, PF, as arising from a wish to weaken a posited parametric assumption, Po, we can construct a PQ "centered", in some sense, on Po. An important seminal version of a nonparametric prior is the Dirichlet process (Ferguson, 1973) which we now review. 2.1 The Dirichlet process prior One approach is to define the Dirichlet process via the Dirichlet distribution: Definition 2.1: Let Q be a space and B a a-field of subsets, and let c be a finite non-null measure on (Q, B). Then a stochastic process F indexed by elements A of B, is said to be a Dirichlet process on (Q, B) with parameter a, if for any measurable partition (A1, —,Ak) of Q, the random vector (F(Ai),..,F(Ak)) has a Dirichlet distribution with parameter (a(Ai),.., a(Ak)). We write F - D(a) and a(.) = cFo(.) where c > 0 and F0, a probability distribution on Q, "centres" the process in the sense that E(F) = Fo. The scale parameter c is commonly interpreted as controlling the variability of F about F0. Antoniak (1974) considers a larger class of priors based on the Dirichlet process in which priors are assigned to c and the parameters of the parametric distribution F0. Perhaps the most unsatisfactory aspect of the Dirichlet process is the part played by c. There is actually no clear interpretation for this parameter, due to its dual role, controlling both the smoothness (or discreteness) of the random distributions and the size of the neighbourhood (or variability) of F about Fo. To illustrate this, we note that if F - D(cFo) then varF(A) = Fo(A)[1 - F0(A)] c+l For maximum variability we would want c -- 0. However, Sethuraman and Tiwari (1982) point out that as c -* 0, F converges in distribution to a single 4

atomic random measure. Also, note from the expression for the variance of F(A) that it is not possible to specify varF arbitrarily, and that the shape is determined by F0. A constructive definition of the Dirichlet process is presented in Sethuraman and Tiwari (1982). If F - TD(cFo) then we can generate an F via 00 F - E Vj50 v j=l where V = w1, Vj = wj(1 - W-i). W),j=2...- j 2,3, the W1, WT2,... are iid beta(l, c), and 01, 2,... are iid from Fo. This construction will be useful later. Note that F is discrete and this, combined with the problem of interpreting (and hence specifying) c, make the Dirichlet process somewhat unsatisfactory. Bayesian inference via the Dirichlet process is attractively straightforward. Given the data (in the form of an iid sample of exact observations), the posterior is once again a Dirichlet process, so that the latter is a conjugate prior on the space of distribution functions. The prior to posterior parameter updates are c- c + n and Fo -> (cFo + nF,)/(c+ n), where F, is the empirical distribution of the observations. The naive interpretation of c as a prior sample size presumably derives from the forms of these posterior parameters. But does a c = 0 correspond to "no information"? If c = 0, note that the Bayes estimate for F, with respect to quadratic loss, is given by Fn which is what a nonparametric frequentist would typically use as an estimate. Therefore c = 0 fits in with one of the notions of a noninformative prior discussed by Ghosh and Mukerjee (1992). Note, also, that a Dirichlet posterior under a c = 0 specification has the parameter nFn which is the basis for Rubin's Bayesian bootstrap (Rubin, 1981). An alternative notion considered by Ghosh and Mukerjee is that of "information". Under this notion, c = 0 can definitely not be thought of as providing a "noninformative" prior. As mentioned earlier, as c -+ 0, F converges to a single atomic measure, which is strong information about the discreteness of F. 5

2.2 Functional estimation We have seen that it is not possible to express the mean and variance of F arbitrarily using the Dirichlet process, nor is it possible to interpret the parameters satisfactorily if we seek directly to make inference about "the unknown distribution function". However, these difficulties do not arise if interest focuses on inference for a linear functional of F. Let g be a measurable function and, for illustration, consider estimation of N(F) =f gdF, with respect to a quadratic loss function. For example, with g(y) = yi, b is the ith moment of F. The Bayes estimate for q is given (Ferguson, 1973) by - c f gdFo + n f gdF, On =. c + n Now 00 = ES(F) f gdFo and so )n can be written more concisely as A c(o + nobn On = v-J — c+ n where qn = f gdFn = n-1 i g(Yi). Assigning a value to qo is typically not problematic since it is the prior guess for 0. In the case where prior information is available, the choice of c can be made by considering +(F) and A(F) where x(F) g2dF -02(F) Note that A(F) > 0 for all F. Also note that EA(F) + var4(F) = A(Fo), so that if F0 has been assigned then a prior choice for A can be obtained via the choice of prior variance for q>. We can use the constructive definition of the Dirichlet process to show that EA(F)/varq(F) = c, or, since EA(F) + varq(F) = A(Fo), to show that varoq(F) = A(Fo)/(c + 1). Thus we obtain the following prior specifications for (q in terms of the parameters of the Dirichlet prior: Eq = q(Fo), varq- A(F= ) c+l Being noninformative about X, by allowing varq - oo, is not obtained by allowing c -- 0 but rather by allowing A(Fo) = var g(Yo) - oo. The posterior 6

first two moments are given by E(qldata) = q, and var(qldata) = c [(c + 1)+ ]+ c+ (Y 2- } where a2 = varq. If we wish to be "noninformative" by taking a2 -> oo (with c > 0) then var(qldata) -- oo, which is not what we want. To obtain a finite posterior variance under this specification (and to be coherent) we require c = 0 and ca2 = 0. Under these conditions we see that var(qldata) = s2 /(n + 1) where 2 = n-1 E [g(i) -,n]2. Note that A(F) = f[g - q(F)]2dF and so s2 can be thought of as a sample estimate for A. Therefore, the posterior distribution for q, under this limiting form of prior, has mean qn and variance s2/n + 1). As an example, let g(y) = y so that q(F) = f ydF represents the mean of F. Under the above prior specifications we can obtain an approximate posterior distribution for X given by the normal distribution with mean Y and variance n-1 3[Y~ - Y]2/(, + 1). 2.3 Mixture of Dirichlet process model The Dirichlet process is one of the most "user-friendly" priors in Bayesian nonparametric inference and has received substantial coverage in the literature. However, as mentioned earlier, one problem with the Dirichlet process is that it assigns probability one to the space of discrete probability measures. A class of priors that chooses a continuous F with probability one is the Mixture of Dirichlet Process (MDP) model, which we now briefly discuss. Recently, mixture models llave found widespread application: see, for example, Escobar (1994); Escobar and West (1995); West et al. (1994); Mueller et al. (1996); MacEarchen and lMueller (1994); Bush and MacEarchen (1996). These are essentially Bayesiall hierarchical models, one of the simplest versions taking the form:. 'l - -./ (. 0 ), i,, n 01. OnIF iid F 7

and F - D(cFo). Instead of the Ois being assumed to be iid from some parametric distribution (as with standard Bayesian hierarchical models) greater flexibility is allowed via the introduction of the Dirichlet prior centered on a parametric distribution. In the above referenced papers, priors are also assigned to c and the parameters of Fo. MDP models have dominated the Bayesian nonparametric literature recently as a consequence of the realisation that full posterior computation is feasible using simulation methods (Escobar, 1994), although the latter can be very computer intensive and involve non-trivial sampling algorithms (particularly when f(. l) and Fo(8) form a nonconjugate pair). The MDP model provides a continuous nonparametric prior for the distribution of the Yis. Given F, we have 00 -tIF SE Vjf(.Oj ), j=l based on the constructive definition given in Section 2.1. This mixture model has been successfully exploited by Escobar and West (1995). Note, however, that centering Zfl Vjf(.l0j) xvill not be easy. 2.4 Beyond the Dirichlet process As was noted in Section 2.1, there are limitations with the Dirichlet process when it comes to prior specifications and their interpretation. In the rest of the paper, we will focus on generalisations of the Dirichlet prior which overcome these difficulties. THEOREM 1 (Doksum. 19711: D)alal. 1978). Let (Q,B) be a measurable space and let a. system of finite dlinlenlsional distributions (a 31.1)..., F(Bm..k)) be given for each finite class (13 1...., Bm,k) of pairwise disjoint sets from B. If 1. F(B) is a random varialble on (0, 1) for all A C B; 8

2. F() = 1 a.s. 3. (F(UiBIj)..., F(UiBmi)) =d (ZEF(Bl,),...E F(Bm,i)); (here =d denotes equality in distribution) then there exists a probability measure PQ on the space of probability measures on (Q, B) yielding these finite dimensional distributions. There are a number of ways of constructing a nonparametric prior to meet the requirements of Theorem 1: 1) Stochastic Processes. This approach is particularly appropriate for generating random cdfs on (0, oo) with application in survival data models. An important and rich class of priors is the neutral to the right process (Doksum, 1974). Briefly we have F(t) = 1 - exp(-Z(t)) where Z is an independent increments or Levy process on (0, oo), with Z(0) = 0 and limtooZ(t) = oo. We shall illustrate this approach with the analysis of the well known Kaplan and Meier (1958) data set. 2) Partitioning. Here we construct a binary tree partition of Q denoted by II = {(Be)}, where c is a binary sequence which 'places' Be in the tree. At level 1 in the partitioning process, we have sets Bo and B1 such that Bo n B1 = 0 and Bo U B1 = Q. Then, at level 2, Bo 'splits' into Boo and Bo1 and so on. A probability distribution is assigned to {F(BB)} such that, for all e, F(Bo) + F(BE) = F(Be) > 0 and F(Q) = 1. This is the idea behind Polya trees (Ferguson, 1974; Lavine, 1992,1994; Mauldin et al., 1992). Such priors seem particularly appropriate for error models, either at the first or second stage in a hierarchical model. Applications considered later in this paper include generalised linear mixed models, accelerated failure time and frailty models 3) Exchangeability. Rather than constructing F directly, as in 1) and 2) above, here we rely on the Representation Theorem (de Finetti, 1937) for a sequence of exchangeable random variables defined on Q. Such an approach seems particularly appropriate when the problem is one of prediction; that is, in providing the distribution of Yn+1 given Y1, Y,1. Applications considered in this paper include modelling a multiple state disease process. Each of these approaches will now be considered separately in detail (although they are by no means mutually exclusive: for example, the Dirichlet 9

process has a representation under all three approaches). 3 Stochastic processes 3.1 Neutral to the right process We begin by discussing neutral to the right (henceforward, NTR) processes. Many well-known processes, such as the gamma and simple homogeneous processes (Ferguson and Phadia, 1979), and the Dirichlet process (Ferguson, 1973,1974) belong to this class. More recently, a NTR process called the beta-Stacy was developed by Walker and Muliere (1997). Detailed background to the following discussion can be found in Levy (1936), Ferguson (1973, 1974), Doksum (1974), and Ferguson and Phadia (1979). Definition 3.1: A non-decreasing almost surely (a.s.), right continuous a.s., process, Z(t), with independent increments, is called a NTR Levy process if it satisfies: 1) Z(O) = 0 a.s; and 2) limt,,oZ(t) = oo a.s. Fact 3.1 Z(t) has at most countably many fixed points of discontinuity. Fact 3.2 Let tl, t2, ~ correspond to these fixed points of discontinuity having independent jumps W1, HV.2 '-. The difference Z,(t) = Z(t)- Ej WjI[t,oo)(t), where I(.) is the indicator function, is a non-decreasing, independent increments process without fixed points of discontinuity and is a pure Levy process. Hence, every NTR process can be written as the sum of a jump component and a continuous component. This will be useful when we later address the problem of generating random variates from a NTR process. Fact 3.3 The Levy formula for the log of the Laplace transform of Zc(t) has a representation given by, roo log E exp (-qZ,(1)) = -qb(t) + j (e-v - l)dNt(v), where Nt is a continuous Levy measure. The "location" function b(.) is not required for the following discussion: see Ferguson (1974) for details. 10

Definition 3.2: A random distribution function F(t) on the real line is NTR if it can be expressed as F(t) = 1 - exp(-Z(t)), where Z(t) is a NTR Levy process. We will concentrate on the beta-Stacy process (Walker and Muliere, 1997). Let a be a continuous measure and f3 a positive function. Then F is a beta-Stacy process with parameters a and / if the Levy measure is given by dv rt dN (v) = dv exp (- v(s))da(s). (1- e-v) Jo It can be shown that F is a.s. a random probability measure under the condition f da(s)l//(s) = +oo. The beta-Stacy process generalises the Dirichlet process, which is obtained when a is a finite measure and /3(s) = a(s, oc). The simple homogeneous process (Ferguson and Phadia, 1979) results when f3 is constant. In Section 3.2, we discuss the choice of a and f3 to specify EF and varF. In the rest of this section, the NTR process priors are discussed with reference to the beta-Stacy process, since many specific forms of interest are special cases of this prior. It is also closely related to the beta process prior, for modelling cumulative hazard functions, introduced by Hjort (1990). 3.2 Prior specifications Ferguson and Phadia (1979) point out that for the NTR processes which they considered, such as the gamma, simple homogeneous, and Dirichlet processes, interpreting the prior parameters is quite difficult. Walker and Damien (1996) provide a way of specifying the mean and variance of the distribution function based on the beta-Stacy process. This method has the merit that the practitioner can model the prior mean and variance via a Bayesian parametric model. Let p(t) = -log {ES(t)} = ( - e-v)dNt(v) and \(t) = -log {E[S2(t)]} = ( - e-U)dNt(v), 11

where S(t) = 1-F(t). Recalling the Levy measure for the beta-Stacy process (we assume for simplicity that there are no fixed points of discontinuity in the prior process), Walker and Damien show that there exist a(.) and /3(.) which provide an explicit solution satisifying the above two conditions for arbitrary p and A satisfying p < A < 2j, which corresponds to [ES]2 < E[S2] < ES. Explicitly, we obtain da(t) = dy(t)3(t) and dA(t)/d#(t) = 2 - (1 - P(t))-1. Suppose, for example, we wish to centre, up to and including second moments, the nonparametric model on the parametric Bayesian model given by S(t) = exp(-at) with a - gamma(p, q). Then we would have p(t) plog(1 + t/q) and A(t) = plog(1 + 2t/q) giving P(t) = q/(2t) and da(t) = pqdt/[2t(q + t)]. This method of specifying the prior mean and variance of the distribution function overcomes the difficulties in interpretation identified by Ferguson and Phadia (1979). In the absence of alternative strong prior information, this provides a flexible form of prior specification. We can specify a p and q to reflect beliefs concerning the "likely" position of S; that is, a region of high probability in which S is thought most likely to be. The unrestricted nature of the prior will then allow S to "find" its correct shape within this specified region, given sufficient data. 3.3 Posterior distributions THEOREM 2 (Ferguson, 1974). If F is NTR and Y,, Y, is a sample from F, including the possibility of right censored samples, then the posterior distribution of F is NTR. The prior distribution for Z(t) (the Levy process) is characterised by M={tl t2, ---},,{AJft,2) —}, the set of fixed points of discontinuity with corresponding densities for the jump components, and Nt(.), the Levy measure for the continuous component of Z(t). We now give the characterisation of the posterior distribution for a single observation Y. (The case for n observations can be obtained by 12

repeated application.) In the following, we assume the Levy measure to be of the type dNt(v) = dv K(v s)ds, which includes the beta-Stacy process. The next theorem provides the complete posterior characterisation for this class of NTR processes. THEOREM 3 (Ferguson, 1974; Ferguson and Phadia, 1979). Let F be NTR and let Y be a random sample from F. i) Given Y > y the posterior parameters (which we denote by an asterisk) are M* = M, f(.e-) Af (v) if tj < y ft* (v) f(v),M) if tj > y and K*(v, s) = exp{-vI(y > s)}K(v, s), where I(.) is the indicator function, and tK denotes the normalising constant. ii) Given Y = y E M the posterior parameters are M* =M, '. e- fq (V) if tj < y ft; (v ) - h'.( - e- )ft (v ) if tj y f.(v)=,(lv) if tj > y and, again, K*(v, s)= exp{-vl(y > s)}K(v, s). iii) Given Y = y All the posterior parameters are M* - M U {y}, with fy(v) =.(1 - e-"v)K(v,y), f. -cVftj(v) iftj<y f (v) = {ft,(v) if t > y and, again, K*(v,s) = exp{-r'(y > s)}IK(v,s). Consequently, if F is a beta-Stacy process with parameters a and fi then, given an iid sample from F. w it 1i possible right censoring, the Bayes estimate of F(t), under a quadratic loss f'ullction, is given by F(t) - 1-f l {1 -da(s) + dN(s) } [o.,j 13

where N(t) = Ei (Yi < t), M(t) = EL I(Y > t) and I[ot] represents a product integral (Gill and Johansen, 1990). The Kaplan-Meier estimate is obtained as a, p ->- 0, which is also the basis for both the censored data Bayesian bootstrap (Lo, 1993) and the finite population censored data Bayesian bootstrap (Miuliere and Walker, 1997b). Fact 3.4. The Dirichlet process is not conjugate with respect to right censored data. The beta-Stacy process is conjugate with respect to right censored data. If the prior is a Dirichlet process, then the posterior, given censored data, is a beta-Stacy process. Fact 3.5. After a suitable transformation and selection of K, other processes such as the extended gamma (Dykstra and Laud, 1981) and the beta process (Hjort, 1990) can be obtained (for details, see Walker and Muliere, 1997). The remaining key question is whether prior to posterior calculations for these models are computationally feasible. Below we describe a general algorithm which enables us to simulate the posterior beta-Stacy process and thus to perform fully Bayesian nonparametric calculations. Simulating a NTR Process Recall that any NTR process Z(t) can be written as the sum of a jump random variable, say 1iW, and a continuous component Zc(t). From a simulation perspective, given the posterior process, it is sufficient to generate random variates from these two components separately and independently. Simulating the jump cotlpolnl(ll With respect to a beta-Stacy plrocess, without jumps a priori, let W denote the posterior jump rai(lndom variable with density fy(w). Then given M* {Yi: is = 1). where b, = I ill(licates that I~ is an uncensored observation, fy(w) ox (- exp(-, )) 'pXl)( - w[/3(y) + M(y) - N{y}) where y E M*N, N{y} = EL;=, atd i M(yl /i() = EI(Yi > y). Also, W =d -log(1 - B) where B - beta (.\{1. 3(y) + M(y)- N{y}). 14

If N{y} = 1 then WV has an exponential density with mean value l/(f/(y) + M(y)- 1). Simulating the continuous component of a NTR process. It is well known (Ferguson, 1974; Damien et al., 1995) that the continuous component will have a distribution that is infinitely divisible (id). Bondesson (1982), Damien et al. (1995) and Walker and Damien (1996) have developed algorithms to generate random variates from any id distribution. Here we note that simulating the continuous component is straightforward regardless of which algorithm one decides to implement. However, the particular choice of the algorithm might depend on the posterior process under consideration. Thus, Laud et al. (1996a) use the Bondesson algorithm to simulate the extended gamma process; Damien et al.'s algorithm is exemplified for the Dirichlet, gamma and the simple homogeneous processes; and Walker and Damien (1996) provide a full Bayesian analysis for a large class of NTR processes using a hybrid of algorithms. For the illustrative analyses that involve NTR processes, we will rely on the Walker and Damien method. 3.4 Example We reanalyse the Kaplan-Meier (1958) data set, partly for its historical significance, but mainly because it has been studied extensively in recent Bayesian literature and thus provides a basis for comparing different methods and models. The data consist of exact observed failures at 0.8, 3.1, 5.4, 9.2 months, and censored observations at 1.0, 2.7, 7.0, 12.1 months. We address the problem of estimating the probability of failure before 1 month; that is F(0, 1). Whereas Susarla and Van Ryzin (1976) and Ferguson and Phadia (1979) were only able to obtain Bayesian point estimates, we are able to sample from the full posterior distribution. Note, also, that we are able to sample from the posterior distribution of F(0, t) for any t and are therefore able to construct a full picture of the posterior failure time distribution. We follow Ferguson and Phadia (1979) and, Within the beta-Stacy framework, take P(s) = exp(-0.ls) and da(s) = O.lexp(-Q.ls)ds. The prior is therefore a Dirichlet process but, with censored observations in the data set, the posterior is not Dirichlet but a beta-Stacy process. We take M = 0 a priori: i.e., there are no jumps in the prior process. The continuous component 15

of the posterior process, ZZ(t), has Levy measure given by *(-zs)ds = ( - exp(-z)) exp (- z[/(s) + M(s)])da(s). We consider sampling F(0, 1) from the posterior distribution. This involves sampling Z(0,0.8) and Z~[0.8,l), which is achieved using the algorithm described in Walker and Damien (1996), and sampling W0.8 from the density fo8(.), which is the exponential density with mean [exp(-0.08) + 7]-1. A required sample from the posterior distribution of F(O, 1) is then given by 1 - xp-Zc(0, 0.8) - Ze[0.8, 1) - H10.8}. We collected 1000 samples from the posterior and the resulting histogram representation with kernel density estimate is given in Figure 1. The mean value is given by 0.12 which is the (exact) point estimate value obtained by Ferguson and Phadia. It can be argued that this prior seems somewhat informative. Can we recapture the shape of Figure 1 using the flexible, less informative prior we proposed in Section 3.2? To investigate this we reanalyse the data set using the Bayesian parametric model described in Section 3.2 with p = q = 1, in an attempt to be "relatively noninformative". We obtain /3(t)= 1/(2t) and da(t) dt/[2t(1 + t)]. Again, we collected 1000 samples from the posterior and the resulting histogram representation with kernel density estimate is given in Figure 2. Note that we have recovered the shape of Figure 1 extremely well. It is also of interest to see how our nonparametric analysis compares with a parametric analysis using the parametric model on which it is centered. The posterior distribution from the parametric model is given by F*(0, 1) = 1-exp(-a) with a ~ gamma(l+4, 1+41.3). 1000 samples from this posterior were collected and the resulting histogram representation with kernel density estimate is given in Figure 3. The distributions are fundamentally different. So what does all this add up to? With the parametric model, the first two moments define the shape of the posterior distribution. In the nonparametric model, the first two moments do not define the shape - there is still more flexibility in the model. For a nonparametric/nonparametric comparison we note that our less informative nonparametric prior leads to essentially the same result as the informative nonparametric prior, which is very encouraging - we do not need to select a fully specified distribution on which to centre the prior. 16

I 00 02 04 F(O.1) Figure 1: Histogram representation with kernel density estimate of posterior density of F(0,1) using Dirichlet process prior. 00 02 0.4 06 0B F(O.1) Figure 2: Histogram representation with kernel density estimate of posterior density of F(0,1) using beta-Stacy process prior. 17

00 005 010 0.15 020 025 030 F(0.1) Figure 3: Histogram representation with kernel density estimate of posterior density of F(0,1) using parametric prior. What else can be done with the stochastic process approach? Kalbfleisch (1978), Clayton (1991), Laud et al. (1996b) provide examples of the use of stochastic processes in the context of Cox regression. It is also possible to use such processes to model functions other than a distribution function. Hjort (1990) developed the beta process to model a cumulative hazard function. Simulation algorithms for carrying out prior to posterior analysis for the beta process appear in Damien et al. (1996). Dykstra and Laud (1981) consider modelling monotone hazard rates nonparametrically by developing a class of processes called the extended gamma process. The advantage of this process is that it indexes the class of absolutely continuous functions with probability one. Also, as in the case of cumulative hazard processes, context motivated information can be used to specify the parameters of the hazard rate model, thus offering greater flexibility. Laud et al. (1993,1996a) develop simulation methods for the extended gamma process; Amman (1984) extended the hazard rate process to model bath-tub hazard rates. Arjas and Gasbarra (1994) develop processes to model the hazard rate piecewise. However, in practice the stochastic process approach is only user-friendly for relatively simple models of the kind we have illustrated. Inference for 18

more complex models usually requires us to make some partitioning of the sample space, subsequently working with a discrete version of the process. But this suggests that we should construct the prior on a partitioned space in the first place and motivates the approach considered in the next section. 4 Partitioning t 4.1 Polya tree priors Detailed background to the material of this section can be found in Ferguson (1974), Lavine (1992,1994), Mauldin et al. (1992), and Muliere and Walker (1997a). The Polya tree prior relies on a binary tree partitioning of the space Q. There are two aspects to a Polya tree: a binary tree partition of Q and a nonnegative parameter associated with each set in the binary partition. The binary tree partition is given by I = {Be} where E is a binary sequence which 'places' the set B, in the tree. 'We denote the sets at level 1 by (Bo, B1), a measurable partition of Q; we denote by (Boo, Bol) the 'offspring' of Bo, so that Boo, Bol, Blo, B11 denote the sets at level 2, and so on. The number of partitions at the mth level is 2"'. In general, B, splits into Bo and Be1 where Beo n Be1 = 0 and Bo U B,1 = B,. A helpful image is that of a particle cascading through these partitions. It starts in Q and moves into Bo with probability Co, or into B1 with probability 1 - Co. In general, on entering Be the particle could either move into Beo or into B,1. Let it move into the former with probability Co and into the latter with probability Cli = 1 - Co. For Polya trees, these probabilities are random, beta variables, (Clo, Ci1) ~ beta(aEo, ael) with non-negative a0o and acl. If we denote the collection of as by A {oa}, a particular Polya tree distribution is completely (defined by n and A. Definition 4.1 (Lavine. 1992) A ralndom probability measure F on Q is said to have a Polya tree distribution. or a Polya tree prior, with parameters (I1, A), written F ~ PT(l. A). it there exists non-negative numbers A = (Co, CI, 00,''oo ') and random variables C = (Co, Coo, Clo, ' ) such that the following hold: i) all the random variables jIn C are independent; ii) for every c, Ceo - beta(no,0. yo); and 19

iii) for every m = 1, 2, * and every e = ei c m define m m F(BEem) (, ce _o)( II (I- Ce,_E-1o)) j=1;c0 =;=Oj=1 where the first terms (i.e., for j = 1) are interpreted as Co and 1 - Co. Fact 4.1. The Polya tree class indexes priors which assign probability 1 to the set of continuous distributions, unlike the Dirichlet process which has sample distribution functions which are discrete with probability 1. For example, the choice cr ='m2, whenever c defines a set at level m, leads to an F which is absolutely continuous (Ferguson, 1974). Fact 4.2. It is easy to show that the discrete versions of the beta process (Hjort, 1990), the beta-Stacy process, and hence the Dirichlet process can all be characterised as Polya trees; see, for example, Muliere and Walker (1997a). 4.2 Prior specifications and computational issues Problems tackled in this paper involving Polya trees require simulating a random probability measure F - PT(H, A). This is done by sampling C using the constructive form given in Definition 4.1. Since C is an infinite set an approximate probability measure from PT(H, A) is sampled by terminating the process at a finite level M. Let this finite set be denoted by CM and denote by F>, the resulting random measure constructed to level M (which Lavine, 1992. refers to as a 'partially specified Polya tree'). From the sampled variates of CAl we define Fa by F(Be...m) for each c = cI..CM according to (iii) under Definilionl -4.1. So, for example, if M = 8, we have a random distribution which assignls raildom mass to r = 28 sets. It is possible to centre te tl lolya tree prior, on a particular probability measure Fo on i byl takinlg it lle partitions to coincide with percentiles of Fo and then to take Qoma = I, for each E. This involves setting Bo = (-oo Fo-1(1/2)), B1 = [/-'(l/21). x,) and, at level mn, setting, for j = 1,...,2m, Bj = [f-1((j- l/').Fo1(j/2m)), with Fo1(0) = -oo and Fol(1) = +oo, where (13 j = 1..... 2M) correspond, in order, to the 2m partitions of level m. It is t lhen straigltforward to show that EF(Bc) = Fo(Bc) for all e. 20

In practice, we may not wish to assign a separate a, for each c. It may be convenient, therefore, to take a, = Cm whenever e defines a set at level nm. For the top levels (m small) it is not necessary for F(Beo) and F(Bi) to be 'close'; on the contrary, a large amount of variability is desirable. However, as we move down the levels (mn large) we will increasingly wish. F(Beo) and F(Be1) to be close, if we believe in the underlying continuity of F. This can be achieved by allowing Cm to be small for small m and allowing Cm to increase as m increases; choosing, for example, Cm = cm2 for some c > 0. According to Ferguson (1974), cm = m2 implies that F is absolutely continuous with probability 1 and therefore according to Lavine (1992) this "would often be a sensible canonical choice". In what follows, we shall choose cm2 for the as. Note that the Dirichlet process arises when Cm = c/2m, which means that Cm -- 0 as m -- oo (the wrong direction as far as the continuity of F is concerned) and F is discrete with probability 1 (Blackwell, 1973). This model can be extended by assigning a prior to c, but in the applications which follow we shall confine ourselves to providing illustrative analyses corresponding to a range of specified choices of c. An alternative idea is to understand and assign the cms in terms of the variance of the probabilities associated with the sets on level m. These variances are all equal to 21 | ck +1 1 k2m =1 2ck + 1 2m which gives a procedure for assigning the Cms based on uncertainty in the centering of F(B,). For example, if we want varF(B,) = Vm, whenever e defines a set at level m, then we need Cm +1 4mvm + 2cm + 1 2(4m-lvm,_ + 1)' Note that this imposes a constraint on the VmS given by vm-1/4 < Vm < Vm_-/2 + 1/4m. More generally, we could define the ca, to match EPTF(BE) and EpT[F2(BC)] with those obtained from a parametric model. This procedure will be detailed elsewhere. 21

4.3 Posterior distributions Consider a Polya tree prior PT(HA). Following Lavine (1992), given an observation Y1, the posterior Polya tree distribution is easily obtained. Write (FIY1) - PT(1I1, AIY) with (AIKY) given by aYl = { a + 1 if Yl E BE a' otherwise. If Y1 is observed exactly, then an a needs to be updated at each level, whereas in the case of censored data (in one of the sets Be), only a finite number require to be updated. For n observations, let y = (Y1,' ", Yn), with (AlY) given by (oIY) = a, + n,, where nr is the number of observations in Be. Let qE = P(Yn+1 E BEIl), for some e, denote the posterior predictive distribution, and let e = e... Cm; then, in the absence of censoring, a,,E + n1, ae1 2 +, ~. lE2 tin el'm + nEl.Em qE ao + aC + n Oao + ae1 + nI, ael...em-1O + cE+l "'Em-1 -+ nE1 "'.mFor censored data, a,,e + nEl El'm + nl-emM Cao + &C1 + n tl... 0... m-1 + Em-1 + neeml - SE..._im-l where s, is the number of observations censored in Be. 4.4 Examples We first re-analyse the Kaplan-Meier data (given in Section 3.4) using a Polya tree prior, providing a comparison with the NTR approach. The second example involves a linear regression model in the context of accelerated failure time data. The third example involves a generalised hierarchical model involving binomial data in a two-way factorial layout. The Kaplan Meier data. The NTR approach to the analysis of survival data is mathematically involved. Polya trees can simplify this complexity a great deal. However, what should the partitions be? Recall that with the NTR analysis in Section 3.4 the observations effectively partitioned the time axis with each partition 22

having to be treated separately (and independently), all the different parts subsequently being "put together". It seems obvious that the partitions of the Polya tree should coincide with at least some of the observations. In fact the only way to make progress is to have the censoring sets, e.g. [1.0, oo), coinciding with partition sets. Consider, therefore, the following partitions: Bo = [0, 1.0), B1= [1.0,oo), Bo = [1.0, 2.7), B1 = [2.7, oo), Bllo = [2.7,7.0), Bill [7.0, oo), and Blllo = [7.0, 12.1), B111 = [12.1, oo). The inclusion of these partitions is compulsory; the choice of the remainder is somewhat arbitrary. A particular partition could therefore be selected in the light of what specific questions needed to be answered; for example, for a future observation, Yn+1, what is P(Yn+i > 6.01data)? This can be answered by partitioning B11o appropriately. Thus an estimate of the survival curve S(t) = 1 - F(t) can be found for any t c (0, oo) and its computation for a range of ts will be sufficient to provide a clear picture of S. Susarla and Van Ryzin (1976) and Ferguson and Phadia (1979) assume a Dirichlet process model with an exponential base distribution, with parameter 0.12, that is, G(B) = B O.12exp(-0.12z)dz and scale parameter equal to 8 (recall Section 2.1). For comparison with their results, we set a, = CmG(B,) whenever e defines a set on level m, and took Cm = cm2 for a number of c > 0. The estimates for various q- = P(Yn+1 E BlJdata) over a selection of values for c, the Kaplan-Meier (KM) estimates, and the Susarla and Van Ryzin (SV) estimates are reported in Table 1. The estimates obtained from the Polya tree are exact and do not depend on a finite level of partitions. This is because in the absence of observations from a particular sub-branch of the tree there is no gain to be made by partitioning sets. For example, suppose we wished to estimate P(Yn+I > 6.01data). We would partition the interval [5.4,7.0) at 6.0 and because there are no observations in (5.4,7.0), no matter how we partitioned this interval (although at some level there would 23

Polya tree Polya tree posterior KM SV Interval prior c=10 c=1 c =0.01 c- 0 [0,0.8) 0.09 0.08 0.05 0 0 0.05 [0.8,1.0) 0.02 0.03 0.06 0.11 0.12 0.07 [1.0,2.7) 0.16 0.15 0.08 0 0 0.09 [2.7,3.1) 0.03 0.03 0.03 0 0 0.02 [3.1,5.4) 0.17 0.17 0.17 0.19 0.18 0.17 [5.4,7.0) 0.09 0.09 0.13 0.18 0.18 0.13 [7.0,9.2) 0.10 0.10 0.08 0 0 0.07 [9.2,12.1) 0.10 0.11 0.16 0.26 0.26 0.15 [12.1, oo) 0.24 0.24 0.24 0.26 0.26 0.25 Table 1: Posterior predictive probabilities for death time (the second column gives the prior predictives which are independent of c). need to be a partition at 6.0) and to what levels we took these partitions, it would make no difference to the estimate of P(Yn,+ > 6.01data). Note from Table 1 that for c = 10 the prior dominates the information from the data and for c = 0.01 the data dominates the prior. This can be understood from the expressions for qu given in Section 4.3. A suitable choice for c would be somewhere in between these two extremes. For illustration we have chosen c = 1 and remark that the precise choice of c is somewhat problematic. For this reason we recommend the approach briefly outlined in the last paragraph of Section 4.2, which avoids the problem of defining a c altogether. Alternative approaches for dealing with c are described in the immediately following sections. We can consider the uncertainty associated with estimates of qu by sampling F(Be) from the posterior. So, if c =- C...Em, we would sample C1 from Pcl, data(-) up to CE1...m from Pc... data(' ) and set F(B,) =-...C~...~. It is also possible, again with the appropriate partitioning, to obtain samples from F(t)ldata and hence for S(t)Idata. In Figure 4 we show the marginal posterior distribution of S(6.0)jdata. This was obtained via the sampling strategy just described using a sample of size 5,000, and with c = 1. Using Polya trees for analysing survival data is more straightforward than using NTR processes. Posterior distributions are more tractable and there 24

Figure 4: Posterior distribution of S(6.0)ldata is little need for computer intensive simulations. An additional advantage is that we can select Polya trees which are continuous, whereas NTR processes are discrete. Using Polya trees to model error distributions We now consider the use of Polya trees for modelling error distributions. Multiple regression example. We start by considering the linear model E = -Xit3+ q i, i - -... n, where Xi = (X'i,,1,,2,., Xi,p) is a vector of known covariates, 3 is a vector of p unknown regression coefficients and Oi are error terms, assumed to be iid from some unknown distribultioll F. The parameter /3 is assigned a multivariate normal prior with meall p and covariance matrix E. A priori, F and /3 will be taken to be independent. Since F is completely arbitrary the intercept term of f3 will be confounded wAith tlhe location of F. This is easily overcome by fixing the median of F by defining F(Bo) = F(B1) = 1/2. Typically, we will want the median to be located at 0 and this is achieved by taking the partition at level 1 to be at 0. In such cases it may be convenient to take Fo as the normal distribution with zero mean and variance a2. This defines a median regression model instead of the more popular mean regression model 25

and parallels Ying et al.'s (1995) frequentist approach. If required, we could also fix the scale of F by defining, for example, F(Boo),..., F(B11) each equal to 1/4. This would be appropriate for the alternative model YI = Xif + aoi, i = 1,,n, where F0 could be taken to be the standard normal distribution. Analysis is based on a MCMC algorithm outlined in the Appendix and further details can be found in Walker and Mallick (1997a). We reanalyse the data set presented by Ying et al. (1995). This involves 121 patients suffering small cell lung cancer and each undertaking one of two treatments; A with 62 patients, or B with 59 patients. The survival times are given in days, with 98 patients providing exact survival times and the remainder right censored survival times. The covariates are the treatment type, 0 or 1, and the natural logarithm of the entry age of the patient. Ying et al. were only able to estimate the median survival time in their analysis and then test for the "better" treatment. We are not restricted in any way as to the type of inference we can make. In our analysis we took a normal prior, with mean zero and large variance term, for fi. The parameters for the Polya tree are F0 as the normal distribution with zero mean and variance a2 = 102 and a- = cm2, whenever e defines a set at level in, with c = 0.1. We found these parameters provided a noninformative approach (see Sections 4.2 and 4.4). We took the number of levels of the Polya tree to be fixed at 8. These parameters were obtained after some preliminary analyses. Increasing a2 had no effect on the results and yet we were not confident about reducing a2 below 100. Reducing c to 0.01 gave more weight to the lollinformativeness of the prior at the bottom levels rather than the contiiluitv. ad(l increasing c to 1 removed the noninformativeness of the prior at tllh too levels. Finally, we chose M = 8 to give us satisfactory partitions. esselltially Ileither too big nor too small. We therefore had 254 partitions coveriiig (appllroximately) the interval (-20, +20), giving 0.16 as the average partition le lgtlli. An alternative approaclh w\\ol(l he to treat c, MA and a2 as unknown parameters and assign prior (list rilnlt iolis (though perhaps this is not necessary for M). This is a relatively straightforward idea to implement using the Yi = Xi/ + aOi model. bilt xwe shlall present details elsewhere. For illustration, predictive sulrvival curves are presented; the first (Figure 5) for a new patient witll treatmlent A, and the second (Figure 6) for a new 26

. 0 50 100 150 200 Timo ndays V10) Figure 5: Predictive survival curves for three new patients with treatment A \ i oe.\ go 0 50 0 00 150 200 Tm in days (/10) Figure 6: Predictive survival curves for three new patients with treatment B 27

patient with treatment B. The three curves selected for illustration are those for patients whose covariate values coincide with the quartiles of the observed values of the log entry age covariate. Hierarchical model example. We now consider nonparametric modelling in the second stage of a hierarchical model. For illustration, we remodel and reanalyse the problem considered by Crowder (1978, Table 3), which involves binomial data in a 2 x 2 factorial layout. The beta-binomial model of Crowder models variation of expected proportions within cell means, by assuming that Yij Ziu binomial(Zij, nij), i - 1,..., 4, j 1,..., ni, with Zij - beta(7-i, 5i), where rx = -yi/(7y + Si) is the mean of Zij. The analysis proceeds by finding maximum likelihood estimates for 7r = (trl, 7r2, w3, r4). We remodel this by considering a random effects logit model with second stage given by logitZi = Xi/3 + Oij, where 11,..., 04n4 IF "iid F. Here, the independence assumption for the Oij corresponds to the homogeneity of the variance term of the distribution of the Zij in the beta-binomial model given by yi +6i = constant. The logitZij are a linear combination of the intercept term and the factorial effects, so that X1 = (1,0, 0), X2 = (1, 0, 1), X3 = (1,1,0) and X4 = (1,1.1). A MCMC analysis is detailed in the Appendix and further information can be found in Walker and Mallick (1997b). The prior specifications for the Polya tree were taken to be the same as those for the previous example. Posterior distributions are summarised using the samples obtained from the MCMC output. The ergodic mean estimates of 7ri are evaluated as 7rl 0.41, 7f2 = 0.64, 7r3 = 0.37, 74 = 0.51, which compare with the values il1 = 0.41, fT2 = 0.65, 73 = 0.33, 4 = 0.57, 28

obtained by Crowder. The posterior mean of F is shown in Figure 7. Essentially the distribution in Figure 7 characterises the within cell mean variability, that is, the variability of the Zij about ri. As can be seen, this is a fairly tight distribution about zero, as one would probably expect. The posterior distributions for /i, /32 and P3 are shown in Figure 8. Since the density in Figure 7 does resemble a normal to some extent we reanalyse the data replacing the Polya tree by a normal distribution, with mean zero and variance a2, assigning a-2 a noninformative gamma prior. The resulting posterior distributions for /3 appear in Figure 9. Note then that these posteriors have the same locations as those from the Polya tree analysis but with the posterior spread from the parametric model reduced significantly. -10 0 10 20 Thea Figure 7: Posterior expectation of F in the beta-binomial model Walker and Mallick (1997a) also detail the use of Polya trees in a frailty model (Clayton and Cuzick, 1985). Here, we omit details and simply draw attention to the posterior estimate of the log-frailty distribution obtained in that paper. In the analysis the frailties are (incorrectly) assumed to be exchangeable and not dependent on a male/female covariate; Figure 10 evidences the great flexibility of the nonparametric framework in recovering a bimodal form for the distribution of the log-frailties arising from the mixed male/female population. 29

Z dar-,::D~ ::I6B;~a~~~~::-~- ---- e:c:6::m 1ijji iijsri: w; Figure 8: Histogram representations of the posterior distribution of 3 obtained from nonparametric model r ~ 1 1 U~I,:-~-x C''~' j~ -:-:~ L8 F. sl.~a" .~- i; i I tt I Figure 9: Histogram representations of the posterior distribution of 8/ obtained from parametric model 30

-4 -2 0 2 4 Theta Figure 10: Posterior expectation of log frailty distribution 5 Exchangeable models Let Y1, Y2, — be an exchangeable sequence of random variables defined on Q. By de Finetti's representation theorem (de Finetti, 1937), there exists a probability measure Pn defined on the space of probability measures on Q, such that the distribution of 1;., ';2, can be obtained by first choosing F - PQ and then taking Y;, -^'. IF -iid F. That is, P(Y1 E, B,;, Bn)-J( F(Bi))dP(F) i=l Here PQ is referred to as the de Finetti or prior measure and, given the joint distribution of Y12, -Y,. this PQ is unique (Hewitt and Savage, 1955). An example is the general Polyn-l-r, scheme (Blackwell and McQueen, 1973). Let a(.) = cFo(.), where c > 0 and Fo is a probability measure on Q. The scheme for generating the excllangeable sequence (Yi, ', Y,) from Q is given by ); c Fo,,-):' cFo + SY, c+l 31

l, '.. _cFo + Ej-i 5j Yn-Il,.in - c+n-l Blackwell and McQueen show that n n-1 S ) -- ) F with probability 1, i=1 with F from a Dirichlet process with parameter a. The de Finetti measure for the Polya-urn scheme is therefore the Dirichlet process. As might be anticipated from our earlier identification of the beta-Stacy process as a generalisation of the Dirichlet process, a generalised Polya-urn scheme can be obtained, which has the discrete beta-Stacy process as the de Finetti measure (Walker and Muliere, 1997. See also Section 5.1). There are a number of reasons why it is often convenient to consider the sequence Y1, Y2,. directly, marginalising over F. First, F is an infinite dimensional parameter so the advantages in removing this is that one ends up working in a finite dimensional framework, making much of the mathematics simpler. Secondly, interest is often in prediction and the distribution of YQ+1 given Y1,. —, y is an immediate consequence. Thirdly, we are "closer" to the data in the sense that we have the probability distribution for the data explicitly. It should also be pointed out that the posterior parameters for PQ can often be determined from the sequence of predictive distributions (consider, for example, the Polva urn sequence). 5.1 Bernoulli trips Here we introduce a sinlple coli)ept! and method for modelling multiple state processes based on all exch;langeable sampling scheme (Bernoulli trip). A Bernoulli trip is a ir2forc(d rn',,doin U'alkl (Coppersmith and Diaconis, 1987; Pemantle, 1988) on a "'tree" whlich cllaracterises the space for which a prior is required. An observatioi in tli is slace corresponds to an unique path or branch of the tree. The 1)at Ii corresp)onding to this observation is reinforced; that is, the probability of aft Ill ire observation following this path is increased. Thus, after n observations. a mllaximllum of n paths have been reinforced. To construct a Bernotllli tril we discretise the relevant space. The walk starts at eo and moves ill oe ole a I possible finite number of directions to reach 32

ei, say. From here the walk moves, again in one of a possible finite number of directions. In general, a walk reaches e and moves to one of a finite number of "positions", the collection of which we will denote by M,. For the first walk P (e e E Me a -Z M(c, e) P e-, e CECIfEMd aE, / e ell), where each a is nonnegative. There will be positions which, if reached, result in the walk being terminated, and this eventually happens to all walks, whatever the path. After the first walk the parameters a are updated. If during the course of the first walk a move was made from c to C' then we simply replace a(e, c') by a(c, ) + 1. The second walk "follows" these new probabilities. After the second walk the new parameters are themselves updated in the same way and the third walk "follows" these twice updated probabilities, and so on. It is clear that the probablity of the second walk coinciding with the first walk exactly has increased (reinforcement). If we denote the path of the first walk by Y1 and the second walk by Y2 and so on, then we can write down without much difficulty the joint probability for the first n walks following particular paths. From this it is straightforward to show that (Y1,, Y,) are exchangeable random variables for all n. Explicitly, we have = aeM (E, ) e'E)[ n(,)] where n(e, e') is the number of walks which move from E to ', a[] = a(a + l)...(a + -1) and a[] = 1. A Bayesian bootstrap procedure would be to obtain the posterior parameters and then set the prior parameters to zero. Thus,, e') = n(c, 6). In such cases the predictives only depend on the data. To illustrate, consider a two state process with one absorbing state; that is, a survival model. Each walk starts at (0,0) and on reaching say (k, 0), k = 1, 2, -, the walk can move either to (k + 1,0) or (k + 1, 1). We assume k indexes time points tl, t2,.... If it reaches (k, 1), for any k, then the walk is terminated (obviously this corresponds to death at tk). The move (k - 1, 0) to (k, 0) indicates survival from tk-_ to tk. Explicitly, for k = 1,2,, P ((I - 1,0) -> (, 0))- akO ako + akl 33

and P ((k-1,0)-> (kL, 1))=-. aCkO + C1kl Clearly each walk is characterised by the point k at which the move to (k, 1) is made and let Yi represent this point for the ith walk. A priori we have P(Y = k) = ac a -ljo ako + acl j<l ajo + ajl and a posteriori after n observations we have P(Yn+l = klYI, ' —, I) - Crk jI.J a/ko + ak1l '<k OCjO + Cjl co = aCkO + nkO, and a;1 = akl + 1nl, where nko is the number of walks that move from (k - 1, 0) to (k, 0) and nTl is the number of walks that move from (k - 1, 0) to (k, 1). We can easily deal with right censored observations within the Bernoulli trip framework. A censored observation at k, that is Y > k, corresponds to a walk being censored at k. The updating mechanism for such a walk is given by aj0 - a-0j + 1 for all j < k. Note that the walks remain exchangeable provided the censoring mechanism is independent of the failure mechanism. The Bernoulli trip just described can be shown to be a discrete time version of the beta-Stacy process detailed in Section 3. Whereas it would be difficult to extend the stochastic process approach to model multiple state processes it is relatively easy within the Bernoulli trip framework. The only drawback, if indeed it is one, is that the space needs to be discretised. Typically, however, data arising from multiple state processes do come in a discrete form -as information obtained each day, week, or during some other unit of time. 5.2 Prior trips For illustration, we consider a three state (disease) process in which all patients start in state 1. From here it is possible to move directly to state 3 or move to state 3 via state 2. Random right censoring can occur at any time. We define the first walk via the transition probabilities P ((k - 1, 0) (i, 0)) - ao CLkO + akCl + aCk2 34

P((k - 1. 0). l. )) -- aki an-7 0 akO + Oakl + Ck2 and P((k -1,0) - (k,2)) = k2 CYkO + akl + ^k2 for transition from state 1. For transtition from state 2 to state 3, we define P((k- 1,1) - (k, 1)) - +i Pkl + Pk2 P ((k 1, -) (k, 2)) - + APkl + Pk2 The walk is completed at k whenever (k, 2) is reached. We can obtain the prior predictive for a particular event; for example P(T = k,S j < k) * jl a10o Pk2 Pl1 O0jo + ajl + aj2 l < colo + -al + c12 kl + Pk2 j<l<k P1 + /12 where T denotes the time to reach state 3 and S is the time to reach state 2 (if at all). If state 2 is not visited then P(T =, state 2 not visited) = C-k2T j0 cOko + akl + Ck2 I<k cl0 + Cll + a12 Note that we need to define the parameters a and P so that the first walk will end with probability 1. Note, also, that the model described here assumes that the transition probabilities from state 2 to state 3 do not depend on the time of transition from state 1 to state 2. This is the Markov model and will be referred to as model M(c). The semi-Markov model, in which the transition probabilities from state 2 to state 3 do depend on the time of transition from state 1 to state 2, can be represented within the Bernoulli trip framework without difficulty. We could have model M(a) given by P(T = klS =j < k) kjl 2 j<IH Pkjl + Pk2jl j<l<k Pljl + Plj2 35

to model a direct dependence on the time of transition from state 1 to state 2, or, model M(b) given by P(T = kS =j < k) = Ok - j 2 Pk-j 1 + Pk-j 2 j<l<k P-j 1 + Pl-J 2 where now the conditional probabilities depend solely on the time spent in state 2. Here we seek an interpretation for the parameters cko, akl; we assume that ak2 = 0 for simplicity. Note that a priori Cko P(S= k) akO + akl P(S = k) + P(S > k) and a^kl P(S > k) ako + akl P(S = k) + P(S > k)' It can be seen that ako is associated with P(S = k) and akl is associated with P(S > k). Therefore it is reasonable to take ako = CkP(S = ) and kl = CkP(S > k), where each Ck is a positive and P are prior guesses for the relevant probabilities. Intepretations for the cks can be found by considering the estimates/predictives for the conditional hazards P (Snl = k1\Sn+ > k, Y',n) = CkO + n Ocko + nkO+ - kl + nkl It is seen that large ck reflects strong belief in the prior estimate/predictive P(S = kjS > k), and a small ck reflects a corresponding weak belief in P(S = kIS > k). Similar interpretations can be found for Pkl and Pk2. 5.3 Posterior trips A complication with obtaining the posterior trips arises if some of the observations are interval censored. Suppose that one observation (i = n) is interval 36

censored, that is, S, is known to be in the interval [kl,..., kL] (kL < oo and Tn > kL). The (random) updated parameters are given, for M(,), by Co = c0ko + nko + Jak and C^a* = ckl + 2lkl + I(k < kn) + J1k, where nko = z1lI(S =- k) and nkl = E?-1 I(Si > k). Here Jak and Jk are random and defined on {0, 1 where I(Ak = 1) = I(Sn = k1k,1 < Sn < kL, Sli,..., Sn-ITi,..,rTn-i), I(J'k 1)- I(Sn > k1 < Sn < kL, i |l..nT1..., n_) Tn-) and P( PSn -= JkS1l,..., Sn-1, T1,..., TnPJic=i-P(1c ~ -Sn < kLISli,..., Sn-l,T,..., Tnl) which is given, up to a constant of proportionality, by k-I kL In. (I- T) X I (l-E ), l=k-l l=k+1 where c/ko + nk/o ako + nkO + akl + nkl and 1k2 + -l I(Ti = k, Si < k) A-i +.+ + E2 =1 Z I(Ti > k, Si < k) for k E {fklc,...,IkL} For more than one interval censored observation we can proceed by sampling the missing data, conditional on all the other observations, obtain the predictive estimate, or whatever is required, and then take the average over a numbler of simulations. Without loss of generality, let Si,..., Sm (m < n) be interval cenisored, with Sj E [l(j),..., kL(j)] (Tj > kL(j)). The approach is to sample iteratively, for j = 1,...,m, from P( L,) <.i < kL(j),S(j),T(j)), 37

where (S(j), T(j)) contains all the information in the data and from the sampled variates except on individual j. Note that if S(j) n {1k(j),.., kL(j)= 0 then Sj is taken uniformly from {kl(j),..., kL(j)}. At iteration t we have then sampled {St):j=-1,..,nm}, which, combined with the observed data, gives the estimator (t). The required estimator is then given by the average -E P(t) t=l for some large enough T to ensure convergence of the simulated Markov chain. Such a procedure can be viewed as a stochastic version of the iterative algorithm for obtaining the self consistent estimator in Frydman (1992). Essentially the sampling from [Sjl...] replaces taking the expectation of [Sjl...]. Note that it is also possible to consider the situation in which T and S are both interval censored using a modified version of the algorithm just described. 5.4 Example We analyse a data set presenteld )by De Gruttola and Lagakos (1989) and reanalysed by Frydman (1992. Table 1). 262 haemophiliacs, divided into two groups, heavily and ligllyl treated, were followed up over a period of time after receiving HI\V infected Ilood. Observations are discretised into 6 months intervals. State 1 is inflection free, state 2 corresponds to HIV infection and state 3 is the olnset o'f IDS. According to current mainstream theory, it is not possible to lhave;\11)S without first being HIV and so it is not possible to move directly 1'roIl state 1 to state 3. Therefore P(( - 1.0) - (k, 2)) = 0, and we can achieve this by d(f titlinig Ck2 = 0 for all k. For the illustrative results that follow, we take a laylesian bootstrap approach; that is, we set 38

the prior parameters to zero. De Gruttola and Lagakos (1989) and Frydman (1992) both analysed the data nonparametrically via self consistent estimators (Turnbull, 1976) but the former assumed the times in states 1 and 2 to be independent. Figure 11 is the estimated cumulative distributions of times to HIV infections for the two groups. These are similar to the results obtained by Frydman. Figures 12 and 13 are, respectively, the estimated marginal cumulative distributions for the onset of AIDS under assumptions (c) and (a). These highlight differences under the two quite valid assumptions. 0t 15 Tro Figure 11: Estimated cumulative distributions lightly treated —, heavily treated- - - of times to HIV infection; If there is uncertainty in which assumption, or model, to choose then a possibility is to obtain an estimator which is comprised of a mixture of estimators under the different assumptions. Explicitly this involves taking the estimator P given by P = P(a)7ri(M() Idata) + P(b)r (MA(b)data) + P()7r (M(c)data), where P(.) is the estimator under M(.) and 7r(M(.) data) is the posterior weight assigned to M(.), that is, 7r (M(.)Idata) oc 7r dataM(.) x r (M(.)), 39

I /H ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ o 5 10 15 20 Figure 12: Estimated (marginal) cumulative distributions of times to onset of Aids, assumption (c); lightly treated —, heavily treated- - - Tre Figure 13: Estimated (marginal) cumulative distributions of times to onset of Aids, assumption (a); lightly treated —, heavily treated- - - 40

where 7r(M(.)) is the prior weight assigned to model M(.). Therefore to obtain the estimator P it only remains to evaluate r(datalM(.)). These are in fact straightforward to calculate. It remains to decide on the values of the as and f3s with which to determine the 7r(datajM(.)). First, for large as and,(s the prior specifications should swamp the data and the model. This is the case and for a = 3 = 106 log7r(datallM(.)) = -578.8 for all MJ(.) (we have removed the [nko] [nki] ako0 akl k I (ako + akl)[ko+[n nk1] from 7r(datalM(.)) which is common to all M(.)). To represent vague a priori information we consider /3k2 = APk1 = 10-6, for A = 1,10,100, logr (datalM(a)) = -390.3,-337.3,-284.4, logr (datalM(b)) = -260.0, -234.6, -209.3, and logr (datalM(c)) = -249.2, -226.1, -203.1. As far as Bayes factors are concerned therefore the data support M(,), the Markov model. 6 Discussion Following a description of the Dirichlet process in Section 2.1, we demonstrated in Section 2.2 the use of the Dirichlet process with respect to functional estimation. We showed how it is possible to incorporate prior information on both the mean and variance on the unknown parameter. In Section 3.2 we briefly considered the MDP model. In Section 3 stochastic process priors were described, in particular the neutral to the right process. In particular, we showed how to specify and interpret the mean and variance of an unknown survival curve, developed 41

the full posterior distribution and, via illustrative analysis, implemented a full Bayesian solution using simulation. In Section 4 we described Polya tree priors for partition models. We demonstrated the use of these priors in modelling errors in both hierarchical and non-hierarchical frameworks. In particular we were able to capture both 'well behaved' and 'badly behaved' distributions. In Section 5 priors constructed from exchangeable processes were detailed and their use demonstrated in a three state disease process model. A natural question that arises is how to choose the appropriate approach for a given problem. Due to the difficulties involved in simulating a continuous time stochastic process, we recommend the use of such processes only when interest is in the specific function being modelled as a process, for example, the cumulative distribution/hazard function. We have found Polya trees particularly appropriate for modelling error distributions in a large class of models, including linear models, generalised linear models and frailty models. In particular, it is straightforward to fix the location (and scale) of a random probability measure chosen from such a prior. Exchangeable processes are more suited to predictive inference and especially useful in extending the traditional alive/death survival models to incorporate multiple states. This paper is an expose of the current state of the art of Bayesian nonparametrics from our perspective. The work is ongoing and a number of problems remain unresolved. In particular, more work is required in the following areas: a full Bayesian nonparametric analysis involving covariate information; multivariate priors based on stochastic processes; multivariate error models involving Polya trees; developing exchangeable processes to cover a larger class of problems; and nonparametric sensitivity analysis (Lenk, 1996). A further question that arises is the extent to which we currently understand the potential mathematical consequences of the tool-kit we are developing. Diaconis and Freedman (1986) present a nonparametric model that uses a symmetrized Dirichlet prior for the underlying distribution and an independent prior for its median. They then demonstrate that seemingly innocuous choices for the latter lead to an inconsistent Bayes estimate of the median. For the same model, they show other reasonable priors for the median that are consistent. The source of the problem, when it occurs, appears to be the infinite dimensionality of the nuisance parameter. In light of 42

results such as in Hjort (1990) and Diaconis and Freedman (1993) that give demonstrably consistent nonparametric Bayesian procedures, general theoretical advances that pinpoint the pitfalls would indeed prove valuable. In the interim, we advocate the use of prudent albeit heuristic sensitivity analyses and look forward to more formal developments in this direction that would afford the practitioner a higher degree of assurance. Acknowledgements Research reported here was supported in part by an EPSRC ROPA and travel grant, an NSF grant, and financial support from the Business School at the University of Michigan, Ann Arbor. References Amman, L. (1984). Bayesian nonparametric inference for quantal response data. Ann. Statist. 12, 636-645. Antoniak, C.E. (1974). Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Ann. Statist. 2, 1152-1174. Arjas, E. and Gasbarra, D. (1994). Nonparametric Bayesian inference from right censored survival data using the Gibbs sampler. Statistica Sinica 14, 505-524. Blackwell, D. (1973). The discreteness of Ferguson selections. Ann. Statist. 1, 356-358. Blackwell, D. and MacQueen, J.B. (1973). Ferguson distributions via Polyaurn schemes. Ann. Statist. 1, 353-355. Bondesson, L. (1982). On simulation from infinitely divisible distributions. Adv. App. Prob. 14, 855-869. Bush, C.A. and MacEarchen. S.N. (1996). A semiparametric Bayesian model for randomised block designs. B3iometrika 83, 275-286. Christensen, R. and Johnson. \\. (1988). Modelling accelerated failure time with a Dirichlet process. Biot lrika 75, 693-704. Clayton, D.G. and Cuzick.1. (1985). Multivariate generalisations of the proportional hazards model (with discussion). J. Roy. Statist. Soc., Ser. A 148, 82-117. Clayton, D.G. (1991). A Ilonte Carlo method for Bayesian inference in frailty models. Biometrics 47. 467-485. 43

Coppersmith, D., and Diaconis, P. (1987). Random walk with reinforcement. Unpublished manuscript. Cox, D.R. (1972). Regression models and life tables (with discussion). J. Roy. Statist. Soc., Ser B 34, 187-220. Crowder, M.J. (1978). Beta-binomial Anova for proportions. Appl. Statist. 27, 34-47. Dalal, S.R. (1978). A note on the adequacy of mixtures of Dirichlet processes. Sankhya 40, 185-191. Damien, P., Laud, P.W. and Smith, A.F.M. (1995). Random variate generation from infinitely divisible distributions with applications to Bayesian inference. J. Roy. Statist. Soc., Ser. B 57, 547-564. Damien, P. Laud, P.W. and Smith, A.F.M. (1996). Implementation of Bayesian nonparametric inference based on beta processes. Scand. J. Statist. 23, 27-36. Damien, P. and Walker, S.G. (1996). Sampling probability densities via uniform random variables and a Gibbs sampler. Submitted for publication. de Finetti, B. (1937). La prevision: Ses lois logiques, ses sources subjectives. Ann. 'lInstit. H. Poincare 7, 1-68. De Gruttola, V. and Lagakos. S.VW. (1989). Analysis of doubly-censored survival data, with application to AIDS. Biometrics 45, 1-11. Diaconis, P. and Freedman. D. (1986). On the consistency of Bayes estimates. Ann. Statist. 14, 1-26. Diaconis and Freedman (1993). Nonparametric binary regression: a Bayesian approach. Ann. Statist. 21, 2108-2137. Doksum, K.A. (1974). Tailfree and neutral random probabilities and their posterior distributions. Ann1. )robab. 2, 183-201. Dubins, L. and Freedman, D. (1965). Random distribution functions. Bull. Amer. Math. Soc. 69. 54-8-551. Dystra, R.L. and Laud. P.\.. (1981). A Bayesian nonparametric approach to reliability. Anin..S'laist. 9. 35(i-:67. Escobar, M.D. (1988). I'Estilil;t itli t Il Ineans of several normal populations by nonparametric estiniation of tit' dtistribution of the means. Unpublished Ph.D. dissertation. Departinlci of Statistics, Yale University. Escobar, M.D. (1994). Estinlmatinl normal means with a Dirichlet process prior. J. Am. Statist..l.,soc. 89. 268-277. Escobar, M.D. and W\est. I. (1995). Bayesian density estimation and inference using mixtures. J..A..SIatfist. Assoc. 90, 577-588. 44

Fabius, J. (1964). Asymptotic behaviour of Bayes estimates. Ann. Mlath. Statist. 35, 846-856. Ferguson, T.S. (1973). A Bayesian analysis of some nonparametric problems. Ann. Statist. 1, 209-230. Ferguson, T.S. (1974). Prior distributions on spaces of probability measures. Ann. Statist. 2, 615-629. Ferguson, T.S. and Phadia, E.G. (1979). Bayesian nonparametric estimation based on censored data. Ann. Statist. 7, 163-186. Freedman, D.A. (1963). On the asymptotic behaviour of Bayes estimates in the discrete case I. Ann. Math. Statist. 34, 1386-1403. Freedman, D.A. (1965). On the asymptotic behaviour of Bayes estimates in the discrete case II. Ann. Math. Statist. 36, 454-456. Frydman, H. (1992). A nonparametric estimation procedure for a periodically observed three-state Markov process, with application to AIDS. J. Roy. Statist. Soc. B, 54, 853-866. Ghosh, J.K. and Mukerjee, R. (1992). Noninformative priors. In Bayesian Statistics 4. J.M.Bernardo,J.O.Berger,A.P.Dawid and A.F.M.Smith (Eds.) Oxford University Press. Gill, R.D. and Johansen, S. (1990). A survey of product integration with a view toward application in survival analysis. Ann. Statist. 18, 1501-1555. Hewitt, E. and Savage, L.J. (1955). Symmetric measures on cartesian products. Trans. Amer. Math. Soc. 80, 470-501. Hjort, N.L. (1990). Nonparametric Bayes estimators based on beta processes in models for life history data. Ann. Statist. 18, 1259-1294. Hjort, N.L. (1996). Bayesian approaches to non- and semiparametric density estimation. In Bayesian Statistics 5. J.M.Bernardo,J.O.Berger,A.P.Dawid and A.F.M.Smith (Eds.) Oxford University Press. Kalbfleisch, J.D. (1978). Nonparametric Bayesian analysis of survival time data. J. Roy. Statist. Soc., Ser. B 40, 214-221. Kaplan, E.L. and Meier, P. (1958). Nonparametric estimation from incomplete observations. J. Am. Statist. Assoc. 53, 457-481. Laud, P.W., Damien, P. and Smith, A.F.M. (1993). Random variate generation from D-distributions. Statist. Comput. 3, 109-112. Laud, P.W., Smith, A.F.M. and Damien, P. (1996 a). Monte Carlo methods for approximating a posterior hazard rate process. Statist. Comput. 6, 77 -84. Laud, P.W., Smith, A.F.M. and Damien, P. (1996 b). Bayesian nonparamet 45

ric and covariate analysis of failure time data. Submitted for publication. Lavine, M. (1992). Some aspects of Polya tree distributions for statistical modelling. Ann. Statist. 20, 1203-1221. Lavine, M. (1994). More aspects of Polya trees for statistical modelling. Ann. Statist. 22, 1161-1176. Lenk, P.J. (1996) Bayesian inference of semiparametric regression and Poisson intensity functions. Submitted for publication. Levy, P. (1936). Theorie de l'addition des variables aleatoire. GauthiersVillars, Paris. Lo, A.Y. (1993). A Bayesian bootstrap for censored data. Ann. Statist. 21, 100-123. MacEarchen, S.N. and Mueller, P. (1994). Estimating mixtures of Dirichlet process models. Unpublished manuscript. Mauldin, R.D., Sudderth, W.D, and Williams, S.C. (1992). Polya trees and random distributions. Ann. Statist. 20, 1203-1221. Mueller,P.,Erkanli,A. and West,M. (1996) Bayesian curve fitting using multivariate normal mixtures. Biometrika 83, 67-79. Muliere, P. and Walker, S.G. (1997a). A Bayesian nonparametric approach to survival analysis using Polya trees. Scand. J. Statist. To appear. Muliere, P. and Walker, S.G. (1997b). Extending the family of Bayesian bootstraps and exchangeable urn schemes. J. Roy. Statist. Soc., Ser. B. To appear. Pemantle, R. (1988). Phase transitions in reinforced random walk and RWRE on trees. Ann. Probab. 16, 1229-1241. Rubin, D.B. (1981). The Bayesian bootstrap. Ann. Statist. 9, 130-134. Sethuraman, J. and Tiwari, R. (1982). Convergence of Dirichlet measures and the interpretation of their parameter. In Proceedings of the third Purdue symposium on statistical decision theory and related topics. Gupta,S.S. and Berger,J.O. (Eds.). Academic press, New York. Smith, A.F.M. and Roberts, G.O. (1993). Bayesian computation via the Gibbs sampler and related Markov chain Monte Carlo methods (with discussion). J. Roy. Statist. Soc., Set. B 55, 3-24. Susarla, V. and Van Ryzin, J. (1976). Nonparametric Bayesian estimation of survival curves from incomplete data. J. Am. Statist. Assoc. 71, 897-902. Tierney, L. (1994). Markov chains for exploring posterior distributions (with discussion). Ann. Statist. 22, 1701-1762. Turnbull,B.W. (1976). The empirical distribution function with arbitrarily 46

grouped, censored, and truncated data. J. Roy. Statist. Soc., Ser. B 38, 290-295. Walker, S.G. and Damien, P. (1996). A full Bayesian nonparametric analysis involving a neutral to the right process. Submitted for publication. Walker, S.G. and Mallick, B.K. (1996). A Bayesian semiparametric accelerated failure time model. Revised for J. Am. Statist. Assoc. Walker, S.G. and Mallick, B.K. (1997). Hierarchical generalised linear models and frailty models with Bayesian nonparametric mixing. J. Roy. Statist. Soc., Ser. B. To appear. Walker, S.G. and Muliere, P. (1997). Beta-Stacy processes and a generalisation of the Polya-urn scheme. Ann. Statist. To appear. West, M.,Muller, P. and Escobar, M.D. (1994). Hierarchical priors and mixture models, with application in regression and density estimation. In Aspects of Uncertainty: A tribute to D. V.Lindley. A.F.M.Smith and P.Freeman (Eds.) Ying, Z., Jung, S.H. and Wei, L.J. (1995). Survival analysis with median regression models. J. Am. Statist. Assoc. 90, 178-184. Appendix Here we give a brief outline the MCMC algorithms for the models described in Section 4.4; for full details see the references cited in the main text. We will concentrate on the hierarchical generalised linear model since this is the most complex. Recall the first stage binomial model given by Yij (Zij = zij) binomial(zj, nij), and that, given the {Zij}, the {Yij} are mutually independent. The second stage is given by logit(Zij) = Xij/ + Oij where Olnl... OnnnIF -iid F. Finally, the prior specifications are that F PT(H,A) and a normal prior with zero mean and large variance term is taken for /3. 47

Samples from the relevant full conditional distributions, including p(FIY, Z,/3), p(Zij Yij, F, ), and p(PY, Z, F), are obtained using an MCMC algorithm and, in particular, a Metropolis/Hastings/ Gibbs method. The full conditional distribution for F is a Polya tree which has been updated, to give the posterior Polya tree, obtained from the ]i ni iid observations eij = logit(Zj) - Xij/. A random FM is then taken as described in Section 4.2. Sampling from the full conditional of P proceeds as follows. Recall that F is sampled to the level M giving FAI as a sequence of weights, {Wk: k = 1,.,., 2}, on the r = 2M sets at level M, say Al,...,Ar. The likelihood of /, given Z and FM, is (p|Z, FM) =n fWk(, ij), ij where lfk(/3, ij) = FM(Ak(/, ij)) and logit(Zij) - X/3 E Ak(/, ij). With this likelihood established, a Metropolis/Hastings step can be used, after some preliminary work to establish a good proposal distribution, to sample the full conditional for /. An identical approach can be used for sampling the full conditional distributions of Zi. An alternative method for sampling the full conditionals for / and Oij (replacing Zij) uses latent variables. The joint distribution for / and Oij, given F, is given, up to a constant of proportionality, by - f(PI O IF) exp(YiXij/ + Yijij) p(ijF)p() f(3, jF) {1 + exp(Xij/3 + j O)}Jn We can write this in another way by introducing the latent variables Uij and Vij and defining the joint distribution f(/, eij, Uij, VjF) oc p(/)x exp(-uijij - ij(nij - yij))I (uij > log(l + e-Zi),vij > log(l + eZ'J))p(OijlF), where now Zij = Xij,/ + 0ij. It is easy to see that the marginal distribution for 3 and Oij is as required. This eases the Gibbs sampler since the full conditionals for / and Oij are now of known types, albeit restricted to particular sets (Damien and Walker, 1996). 48