FEB 1 9 1997 KRESGt tUS. ADM LlbR4Ry RESEARCH SUPPORT UNIVERSITY OF MICHIGAN BUSINESS SCHOOL DECEMBER 1996 A FULL BAYESIAN NONPARAMECTRIC ANALYSIS INVOLVING A NEUTRAL TO THE RIGHT PROCESS WORKING PAPER #961 2-41 STEPHEN WALKER IMPERIAL COLLEGE, LONDON, UK AND PAUL DAMIEN UNIVERSITY OP MICHIGAN BUSINESS SCHOOL

A Full Bayesian Nonparametric Analysis Involving a Neutral to the Right Process Stephen Walker1 and Paul Damien2 Department of Mathematics. Imperial College. 180 Queen's Gate. London SW 7 2BZ 2 Department of Statistics and Management Science. School of Business, University of Michigan. Ann Arbor. 48109-1234, USA. SULMMARY Implementation of a full Bayesian nonparametric analysis involving neutral to the right processes (apart from the special case of the Dirichlet process) have been difficult for two reasons: first, the posterior distributions are complex and therefore only Bayes estin1 Ites (posterior expectations) have previously been presented: secondly, it is difficult to obtain an interpretation for the parameters of a neutral to the right process. In this paper we extend Ferguson and Phadia (1979) by presenting a general method for specifying the prior mean and variance of a neutral to the right process, providing the interpretation of the parameters. Additionally. we provide the basis for a full Bayesian analysis. via simulation. from the posterior process using a hybrid of new algorithms that is applicable to a large class of neutral to the right process (Ferguson and Phadia only provide posterior means). The ideas are exemplified through illustrative analyses. Keywords: Hierarchical models. Levy process, Neutral to the right process. Beta-Stacy process, Infinitely divisible law, Latent variables. Gibbs sampler. 1

I 1 Introduction. larI'e section of the Bayesian nonparametric literature is concerned with tie construction of prior distributions on the space of probability distiribl)1 -tions on (0, x). One particular application which has received a great deal of attention includes survival analysis. An important and rich ilass of prior dlistribution is the neutral to the right process (Doksum. 1974: Ferguson. 1971) which includes the well known Dirichlet process (Ferguson, 1973). DEFINITION (Ferguson. 1971). A random distribution function F() on (0. c) is neutral to the right (from now on NTR) if for every m and 0 < tl <... < t, there exist independent random variables Vi,.... l, such that (l-F( ).... 1-F(tn)) = C(lv. v2*....- l). If F is NTR then Z(t) = -log(l - F(t)) has independent increments. The converse is also true: if Z(t) is an independent increments process. nondecreasing almost surely (a.s.). right continuous a.s.. Z(0) = 0 a.s. and linilZ(t) = +fc a.s. then F(t) = 1 -exp(-Z(t)) is NTR. For the process Z(.) there exist at most countably many fixed points of discontinuit. at tl..2... with jumps JI, J2.... (independent) having densities fl.ft..... Then Z:(t) = Z(t) - tt t J. has no fixed points of discontinuity and therefore has a Le'vy representation with log-Laplace transform given by -logEexp( - oZ,(t)) = (i -exp( —&))d.Vt(). where -\(.). a Levy measure, satisfies: N^(B) is nondecreasing and continuous for every Borel set B:.Vt(.) is a measure on (0. o) for each f and (uo.) (1 + )-ldNV(z) <.cO. In short we have F(t) = 1 - exp{-Z(t)}. with Z an independent increments (Levy) process. There has only been limited application involving NTR process priors in the statistics literature (besides the well known Dirichlet process). The reason for this is the difficulty in obtaining anything more than an expression for the posterior expectation F (Ferguson and Phadia, 1979. Section 9

hi. Applications with a full Bayesian analysis have appeared involving tile extenlded gamma plrocess (Laudl et al. 1996) and the beta process iDamien ft al..!)!t9. leither of which are NTR processes. A prillary aim of this paper is to construct a simple and easy-to-interpret method for specifying the prior mean and variance of a random distribution function. We develop such a method for NTR processes in Section 2.. second aim is to present a new hybrid sampling based approach for making inferences on the posterior NTR process and in particular we provide an algorithm for sampling F[a. b) from the posterior distribution. 2 Prior and Posterior Distributions Prior Specifications Central to any nonparametric Bayesian analysis is the ability to use available prior information to obtain expressions for the mean and variance of the random distribution function. F(t). For the Dirichlet process it is only the mean of F. say F,. which can be modelled arbitrarily since. given F0. then varFlt) = F0(t){1 - Fo(t)} c+ 1 for c > 0. The parameter of the Dirichlet process is then defined as ct(.) = cFo(.). Ferguson and Phadia (1979) point out that for other NTR processes. the interpretation of the prior parameters of the process tends to be nebulous. (We return to this later.) Noting from the previous section that any NTR process is defined by a Levy process. our goal is to describe the prior mean and variance of the random distribution function in terms of the Levy measure that characterises t he process. Consider then the expressions for -log{ES(t)} and -log{E[^I92(t)]} where.s'() = 1 - F(t). UIsing the Levy representation in Section 1 (with no fixed points of discontinuity) it follows that p(t) = -log{ES(t)} = r(l - -)d-t() \(t ) = -log{E[S2(t)]} = (1 - ~-2dt( 3

I Note tha t it is necessary for 0 < ji(t) < \(t) < 2t(t) which derivePs from the condition {ES(f )}2 < E[S'2()] < E.S't}. \\'e want to find a L6evy measure.Vt(.) which satisfies these two equatinls and in particular we consider Levy measures of the type /.V^() =,dz{I - }-1 j exp( - z3(s))da(.s). wlhere.3(.) is a nonnegative function and ct(.) a measure. This type of Levy rmeasure characterises beta-Stacy processes which are considered in greater detail in Walker and NMuliere (199.5). Our motivation for working with the beta-Stacy process stems from the fact that this process encapsulates virtually all of the N-TR processes mentioned in the literature. As examples. the Dirichlet process arises when a(.) is a finite measure and 3(t) = Q(t. x). The simple homogeneous process (Ferguson and Phadia, 1979) is obtained when 3 is constant. LEMMA. There exists a(.) and 3(.) such that; (t) = J exp ( z3(s) da()dz and A(=t) Jo {i i e — }ep(- s()da(s)dz. PROOF. It is obvious that the first of these conditions is satisfied when fj dc(s)/3() (t) that is. when da(t) =.3(t)dt(t). The second condition becomes. using the transformation y = 1 - exp( —).,(t) = - {2 - 1/(i1 +.())}d(s) 4

,i'Ving dA(t)/dyi(t) = 2- 1/(1 I-.(t)i. concludirng the proof. In particular tle result allows its to obtain an interpretation for the paranieters of a simple honlogelneouls process. not resolved by Ferguson and Phadia. If.3 is constant then we ae ae L(t) = a(t)/3 and \(ft = cc(t /'. where c = (1 + 2.3)/(l - ). Example 1. [f it is required that ES(t) = exp{-at}.a > 0, and E[S2(t)] = exp{-(t)}. (0) = 0. b nondecreasing and b(t) -, -o as t -* xc. then 3(t) = 1/(2 - a-db/dt) - and dQ(t) = a3(t)dt. Note that here we need the condition 0 < a < db/dt < 2a which corresponds to the necessary condition 0 < jt(t) < \(t) < 2t(t). (Note also that the two conditions are in general impossible for the Dirichlet process which has the extra constraint of 3(t)-= (t. a )). Example 2. Suppose we wish to center. up to and including second moments. the nonparametric model on the parametric Bayesian model given by S(t) - exp(-lt) with a gamma(p.q). Then we would have p(t) = plog(l + t/q) and \fU) = plog(l + 2t/q) giving 3(t) = q/(2t) and da(t) = pqdt/{2t(q + t)}. Firstly. since 3(t) $ ca(t, c). the Dirichlet process is not applicable. Secondly. one could use different gamma(p. q) densities along the time axis. Thirdly, some other parametric model could be chosen resulting, of course. in different forms for a(t) and 3(t). All these allow for greater modelling flexibilit-. Exaam ple.3. 0)

I For a two-parameter Weibill 'centering' we have E~E5( ) = e.p{-(af )"} leading to itt) = La)lf. Since wye require c < A < 2i it seems appropriate to take.\Af = c(alr)" for some c (1,2). It is now easy to see that (tf) =.1 = (c - 1)/(2 - c) and do(t) =.3abbt-ld. Here we have a simple hotinCogeneouts process. \\e cotuld consider uncertainty in c via a prior clistribtuion on (1. 2) which would result in a mixture of simple homogeneous processes. Thius. our method allows one to specify the mean and variance for F. whlich is not. in general. possible for the Dirichlet process. or other well known NTR processes. Ferguson and Phadia (L979. Remarks 1 and 2) discuss the problems associated in interpreting the mean and variance of F(t). They rely on such ideas as prior sample size and strength of belief when considering uncertainty in the centering of F. and conclude they are not satisfactory. In fact. as a consequence of the Lemma, their assertion that c = n(O. x( ) represents a prior sample size, which tends to become noninformative as c -* 0 is cluestionable. For if 3(t) = a(t,.x) = cFo(t. x) then dpt(t) = dFo(t)/Fo(t. x) and as c -- 0. it is clear from the Lemma, that A(t) — + (t) implying that ''2(t)] - E.S'I). (See. also. Sethuraman and Tiwari (1982) for further inIsight on this point.) We next provide details of the posterior distribution of a NTR process. Posterior Distributions THEOREM I (Ferguson. 1974). If F is NTR and Xi....,, X, is a sample from F. including the possibility of right censored samples (where X, represents tlie censoring time if applicable), then the posterior distribution of F given -\'..... is NTR. The prior distribution for Z(.) is characterised by.I = {it.t2....}. {2 ff..t}. the set of fixed points of discontinuity with corresponding densities for the jumps. and:V (.). the Levy measure for the part of the process without fixed points of discontinuity. The posterior distribution is now given for a single observation X.. The case for n observations can then be obtained by repeated 6

application. In the following, we assume the Levy measure to be of the type (l.\;-() = dz f(0o. IK(z. s)ds (the beta-Stacy process with parameters a(.) and.3(.) arises when K(z..s)ds = (1 - — )-exp{ — 3(s)}dcv(s)). THEOREM 2(Ferguson, 1974; Ferguson and Phadia. 1979). Let F be NTR and let X be a random sample from F. (i) Given X > x the posterior parameters (denoted by a *) are 1ft = -.. f, )- _ c.e. —ft,(C) if tj < t — Jft (z) if tj >. and KI(z. s) = exp{-zI(x > s)}K(z, s) (here c is the normalising constant). (ii) Given =- x E.1 the posterior parameters are A"t = -M, c.e- ft, () if tj < f^(~) - c.( - e-:z)ft,(z) if t = x ftj () if tj > x and, again. KI'(z. s) = exp{-zI(x > s)}K(z, s). (iii) Given XA =x, MI the posterior parameters are MI = MI U {x}, with /f,() = c.(1 - e —)'(z, x), ff i i c.e —ft,(z) if tj < - ftp ( ) if tj > and. again. KI(zs) = exp{ —zI(x > s)}KI(zs). W;ith this general characterisation of a posterior NTR process, a procedure for sampling the posterior process is developed in the next section. 3 Simulating the Posterior Process We are now concerned with simulating from the posterior distribution of F[a. b). Since F(t) = 1 - exp{-Z(t)} we have F[a, b) = exp( - Z(O, a)){ - exp(- Z[a, b))} 7

and note that Z(O.a) and Z[a. b) are independent. Therefore we only need to consider sampling random variables of the type Z[a. b). This will involve sampling from densities {Jtf: tj e [a. b)}. corresponding to the jumps in the process. and secondly, the continuous component. the random variable Z[a.. b). The type of densities corresponding to points in A,', the jumps. are defined. up to a constant of proportionality, by f(z) ex (i- e xp( - ))t —f ().,\., > 0 (integers) and f(z) c< ( -exp(-z)) Ki(z).,\ > 0 (integer) where the subscripts tj have been omitted. Simulating the jump component The second of these two types of densities for the random variable corresponding to the jump in the process will most likely be a member of the SD family of densities (Damien et al., 1995) and algorithms for sampling from such densities is given in Walker (1995) and Damien and Walker (1996). However, the first of these two types can be done by introducing latent variables, whose densities are of the form described in Damien and Walker (1996), and setting up the conditional distributions required to implement a Gibbs sampler (Smith and Roberts, 1993). Define the joint density of z, u = (1,..., uA) and v by f(z i, v) Ca e-UI(v > z){F_>e-'U1I(ul < z)}f(z). Clearly the marginal density of z is as required. The full conditional densities for implementing Gibbs sampling are given by f(ultI.u-I,v, z) ce- 'I(( < ), I = 1,...,A, f(vlu, z) x e-uUI(v > -) and f(zluv ) c f(z)I(z E (maxi{u},v)). So. provided it is possible to sample from f(z), this algorithm is easy to implement. In the next section, we show that, for all the well-known NTR 8

processes. simulating from f'(z) is trivial. Simulating the continuous component The random variable Z = ZJ[a. b) corresponding to the continuous coimponent is infinitely divisible (id) with log-Laplace transform given by -logEexp(-oZ) = J ( ex xp(-z) Ee I(xi > s))lK(.s)d.sdz. There are two algorithms for simulating an id random variable with a given Laplace transform: Damien et al. (1995). which relies on the characterisation of an id random variable as the limit of sequences of sums of Poisson types. and Walker (1996), which uses the fact CZ = L( dP(z)) where P(.) is a Poisson process with intensity measure dz f[a,) IKX(. s)ds. The Walker (1996) method is described in some detail in the Appendix since it is an improvement to the Damien et al. algorithm, and hence ivas used to analyse the example data in the next section. In summary we have: 1. Specify the prior mean and variance of a neutral to the right process, F(t) via the mean and variance of its corresponding Levy process using, say, a hierarchical parametric model. 2. The posterior process is given by Theorem 2. This process is comprised of a jump component and a continuous component. The jump components are random variables that can be simulated via latent variable substitutions. and. if necessary, a Gibbs sampler. 3. The continuous component for any NTR process, always, has an infinitely divisible distribution. Simulating from such a distribution is described in the Appendix. 4 Illustrative Analysis Walker and.Muliere (1995) developed the beta-Stacy process and showed that it is a generalisation of the Dirichlet process. Unlike the Dirichlet process. 9

the bera-Stacy process is conjugate to arbitrarily right censored data wlhich,'o0lnlio<lv arise in survival analysis. Moreover. Walker and Muliere prove tlhat if th1e l)rior process is Dirichlet. then the posterior process is beta-Stacy if tlie dataa include right censored observations. Here these results are uscedl to model exact and right-censored data. The Kaplan-_leier (1958) data was chosen because Ferguson and Phadia (1979) examined these data previously. thus offering a basis for comparison. 'The data consist of exact observations at 0.8, 3.1, 5.4 and 9.2 months and right censored data at 1.0.. 2...0 and 12.1 months. E.ram nplk I We follow Ferguson and Phadia and take 3(s) = exp(-0.1s) and do(,) = O.lexp(-0.ls)ds. The prior therefore is a Dirichlet process but. with the inclusion of censored observations within the data set. the posterior process is not Dirichlet but a beta-Stacy process. We take -1 = 0 a priori: i.e.. there are no jumps in the prior process. Then, given the data. the posterior parameters are given by = {Xi: 6i = 1. where b, = I indicates that. Xi is an exact observation. fN -) x - e -exp(.-)() e+ Y() - V{x))) where.r E I..V{.x} = Ex,=r 6 and Y(x) = E, I(Xi > r). A striking consequence of the beta-Stacy process model in this first illustration is to note that if Z, the jump random variable has density fr(.) then Z =i -log(l - Y) where Y - beta(.V{}, 3(x) + (x) - V{x}). however. if -V{x} = 1 then Z has an exponential density with mean value 1/.3(x) + Y(x) - 1). From a simulation perspective, this obviates the use of both the latent variable steps required for simulating the jump random variable, and a Gibbs sampler. The continuous component of the posterior process has Levy measure.ive yn by /k(:. s)ds= (- - exp()) {exp - (s) +Y(s)})da(s). 10

>.o note lherefore that the posterior process is also a beta-Stacy protess. \\We obtlin -aI ileples from the posterior distribution of FI0. 1). Thii will invol\e sampling Z!O. 0.8). Z:[0.8. 1), which is dtone using the algorithmn of Walker 19!96). and sampling.Jo.8 from the density fJ(.). the exponential density with nean /({exp( -0.0O) + 7}.. required sample from the posterior distr ibulItion of FlO. IL is then given by I - exp{-Z7(0.0.8) - Z:[O.S. 1)-.o.s}. \Ve collected 1000 samples from the posterior and the resulting histograml representation with kernel density estimate is given in Figure 1. The mean value is given 1) 0.12 which is the (exact) value obtained bt Ferguson and Plladia. E.ra ympl 2 We also reanalyse the data set. using the 'base' hierarchical centering model given in Section 2. Here. with p = q = 1, in an attempt to be noninformative with the gamma parametric prior, we have 3(t) = 1/(2t) and o( t) = dtl{ 2t( I + /)}. Again. we collected 1000 samples from the posterior and the resulting histogram representation with kernel density estimate is gi'ven in Figure 2. \We compare this with the posterior distribution of the Bayes model givenl by F(O. t1 = 1 - exp(-a) with a -- gamma(l + 4.1 + 41.3). 1000 samples from the posterior were collected and the resulting histogram representation with kernel density estimate is given in Figure 3. The distributions are fundamentally different even though mean values are identical at 0.11. 5 Conclusions Nelutral to the right processes form a wide class of prior distributions for the space of distribution functions. An intuitive way for specifying prior expectations of a random distribution function via hierarchical parametric nlodels was developed. From a practical perspective. this method enables tie statistician to try different prior models for the mean and variance of the Levy measure that characterises all neutral to the right processes. The form for the posterior process was stated using a general and well-defined Lrvy measure. A full Bayesian solution was then obtained by simulating the posterior process using a hybrid of sampling methods. and which was exemplilied with exact and censored survival data. 11

i / Q 02 04 OB F(O 1) Figure 1: Histogram representation with kernel density estimate of posterior density of F( 0.1). I;, -i-., 00 02 04 0.5 08 F(0 1) Figure 2: Histogram representation with kernel density estimate of posterior density of F(O. ). 12

I i / o J * "",,- *~00 005 010 015 020 025 030 F10.1) Figure 3: Histogram representation with kernel density estimate of posterior density of F (O.1). Acknowledgements The authors are grateful to Adrian Smith for helpful discussions and comments on an earlier draft of the paper. References Darnien. P.,Laud,P.W. and Smith.A.F.M. (1995). Random variate generation approximating infinitely divisible distributions with application to Bayesian inference. Journal of the Royal Statistical Society Series B 57. 547-564. Damien.P..Laud.P. W. and Smith,A.F.MI. (1996). Implementation of Bayesian nonparametric inference based on beta processes. Scandinatian Journal of Statistics 23. 27-36. Dartier,P. and Walker.S.G. (1996). Sampling probability densities via uniform random variables and a Gibbs sampler. Submitted for publication. 13

I Doksmtl.l.K..k. i197'). Tailfree and neutral randoml probabilities and tlleir p[)slt erior (li.stlriltbtions. The.4Annals. of Probability 2. 183-201. Ferguson.T.S. {1973),. A Bayesian analysis of some nonparametric problemls. The.Annals of Statistics 1. 209-230. Ferguson'.T.S. (1974). Prior distributions on spaces of probability measures. Th7e A.-1 nnal.s of Statistics 2. 61.5-629. Ferguson.T.S. and Phadia.E.G. (1979). Prior distributions on spaces of probability measures. The Annals of Statistics 2, 6153-629. Kaplan.E.L. and Meier.P. (1958). Nonparametric estimation from incomplete observations. Journal of the.American Stati.stical Association 53. -157 -431. Laud.P.WN..Snlith.A.F.MI. and Damien,P. (1996). -Monte Carlo methods approximating a posterior hazard rate process. Statistics and Computing 6. 77-84. Sethurrainan,J. and Tiwari.R. (1982). Convergence of Dirichlet measures and the interpretation of their parameter. In Proceedings of the Third Purdue.Syjmposium on Statistical Decision Theory and Related Topics. Gupta.S.S. and Berger.J. (eds.), pp.305-315. Academic Press. NY. Smith.A.F.M. and Roberts.G.O. (1993). Bayesian computations via the Gibbs sampler and related Nlarkov chain Monte Carlo methods. Journal of the Royal Statistical Society Series B 55, 3-23. Tierney,L. (1994). Markov chains for exploring posterior distributions. The.4 lnals of Statistics 22. 1701-1762. Walker.S.G. (1995). Generating random variates from D-distributions via s[l)bstitution sampling. Statistics and Computing 5. 311-315. Walker.S.G. (1996). Random variate generation from an infinitely divisi 14

I )le clistributioll via Gibbs sampnlilng. Su'bbmitted/r or pblication. \\alke.S.G.(. and Muliere.P. (1)995). Beta-Stacy processes anti a generalisation of the Pol-a-urn scheme. r.'ltder revision f,,r The.4nnalstr f S'tai.Rti,.. Appendix Here we consi(ler sampling an infinitely divisible variable Z with log-Laplace transform given by logEexp(-oZ) = (-~ - l)c/.\V (-). where dN-A.(1) = il J exp(-.(.s))do(.s). Recall ~Z = t(J.f) zdP(z)) where P(.) is a Poisson process with intensity measure do\(!.). For a fixed e > 0 let Z = X, + 4I (X,. Yt independent i vwhere X. is the sum of the jumps (times) of P(.) in (e. x) and r. is the sum of the jumps (times) in (0. ] and has log-Laplace transform given by.J(o.4{exp(-o.) - l}dVa('). The aim is to approximate a random draw Z I)y sampling an.X. The error variable b' has mean ElE = (1o.e] Zd.N-x( and variance varlE = /(o.f Z2d-A().. Now e can be taken as small as one wishes and it is not difficult to show. for very small e. that EI, < c and var; < 0/2. Let,\ =.V(, x ) < xo and take iv Poisson(AJ). Here v is the (ranidom) niumber of jumps in (e, xo) and therefore the jumps r.... r, are taken iid from G(;(. ) = A-1,\ (e s], for s E (e. x-). Then X-. is given by X, = Z-=1.4I To implement the algorithm therefore it is required to (a) simulate from C,(.) and (b) calculate A,. For the beta-Stacy process the density corresponding to G( is given, up to a constant of proportionality. by gz() -x {l- exp( —)} 1/( > e) ' exp( - S(s))dcr(s). a). To sample from g, define the joint density function of z. s and 7t by f(-.. t. )x l(t < (I- exp(- }-')exp( —3(s))c(s)(- > e.s E A). 15

I w\here,/ is a random variable edeline(I on the interval (0. { - exp( —ej} )..- i a riandcom variable defined on (0. c ) and a(:),/. ( dai(.s). Clearly lhe miarilnal dlistriblttion for z is given bvy y. W\e now describe a hybrid Gibbs/ Metropolis sampling algorit lirn (Tierney. 19.)9)) for obtaining random variates {'("}} from y,. The algoritlrim involves simulating a Markov chain { {I)..()}. u')} such that as / -- x- then zi) ran b>e taken as a random draw from y, l). The algorithm is given. with startiiin values {(l'L11). u'Ml}, 11 (i) take zi(+1) from the exponential distribution with mean value l/3j.;'~) restricted to the interval (e.-log{l - 1/Iutl}) if ti{l) > I and the interval ft. xc) if,'t) < 1: (ii) take (41+li from the uniform distribution on the interval (O -exp(- 11))}-1) and (iii) (for example) take s" from c(.)/a( ) on A and ' from the uniforml distibution on the interval (0. 1. If r < min{l.exp( z11[3(s) - )} tlenr..(l+1) = else.s5 ) = (b). To calculate AX define the joint density function of and s by f(zs) x exp(- zJ(s))acs)l(z > cs e A) andc let the right-hand side of this expression be h(z. s). Then A, = I { - exp(-)}-^'h(. s)d:ds so A, = Ef[{ -exp(-z)-1] x IC h(,s)dzds Now Ef[({ -exp(-z)}-'] can be obtained from L -1 ELt{ exp( (:})}-1 where {-(': l = 12....} are obtained via a Gibbs sampling algorithm which 16

involves sailmpling from f/(t!s, -wllich is the exponential d(istri l)ution \wirh mlean ivalle 1 i 'I fi restricted to tlie interval (e'. x ). and samplini t'roin f'(. I. w\lichl is dlone ais in (iii) from f a. Finally h(Z.,/) / Ld; = ^xp -: (..p(.))/(s)/H.(.is, so MXontte ('arlo methods. ift' an aalytic expression is not available, can be used to calculate this tern (it should also be adequate to take this to be Jl rloi.!)/.'. i - e'( )). 17