I-Aw r ~ e n c i * iS X5 a I 'X~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~' z~ ~~-"i ' * o; 3 " * p-I^I - 9- -O Z Ii El % V^ " s s 0 " 2 2 CA -~~~~~~~~O Co pw.~ N z3 "eb r < o _. 2 U) U a' to z A l a 5'** < Z In~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~d~r c ~ ~ ~~~~~ ~ Sx 2 * > ~~~~~~~~~~? a2 J~~~~~~~~~~~~~,~~~~~~~~~~~~~~~~~~~ 0 -y C~~~~~~~~~~~~~~~ (S)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ r2 ~d":~ go~ IY cN:*~""-':~:y;i~:..~h~So,', a <?W~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ \A ~~~~~~~~'~~ ~ 0~\ 2i' -I ~a. ~":" i iI:i~:; ~:I ~:o~ f~ul, c~:V1~~.~$s iC A02 ~,~~LI E~T.i;~~~,~r "~l~isi pb,~~'*o -*-=~~ r-~~ '"~.: i:a 'Cn i", "i ~ ~~~~-dC CA ~r~.-:~ ii; ~~ ~~"`i~~~~ ~~~~p.~~~b~?T ";. I'~gi ~ ~ ~ ~ ~ ~ ~ C CO ~ ~ ~ ~ '',t-a-I; —;ia 2~~~.4: i~ C.4 ~ ~ ~ ~.s;at~~-L:i:.-, ~:~~~ w ~ =Lik"~l,~r~.u~?"a~, 't' o eD uP, i C~

f I 6 19^. KRESGk 3us. ADM. EfltM,~ RESEARCH SUPPORT UNIVERSITY OF MICHIGAN BUSINESS SCHOOL DECEMBER 996 BAYESIAN NONPARAMETRIC AND COVARIATE ANALYSIS OF FAILURE TIME DATA WORKING PAPER #9612-38 PURUSHOTTAM W. LAUD MEDICAL COLLEGE OF WISCONSIN AND PAUL DAMIEN UNIVERSITY OF MICHIGAN BUSINESS SCHOOL AND ADRIAN F.M. SMITH IMPERIAL COLLEGE, LONDON, UK

Bayesian nonparametric and covariate analysis of failure time data By Purushottam W. Laud Div'ision of Bio.statistics. Medical College of Wisconsin. Milwaukee. WI 53226-0509. U.S.A. Paul Damien Department of Statistics and Management Science. CUnit ersity of Michigan. Ann.4rbor. MI 48109-1234. U.S.A. and Adrian F.M. Smith Department of Mathematics. Imperial College. London SW7 2BZ. U.K. SUMMARY A Bayesian analysis of the semi-parametric regression model of Cox (1972) is given. The cumulative hazard function is modelled as a beta process. The posterior distribution of the regression parameters and the survival function are obtained using a combination of recent Mlonte Carlo methods. An illustrative analysis within the context of survival time data is given. Some key words: Infinitely Divisible Distributions: Laplace Transform; Beta Process: Hazard Function: Latent Variables; Cox Model: Gibbs Sampler. 1 Introduction Cox (1972) proposed a model for survival time data in the presence of covariates. To state it. let T represent the time to the event of interest for an individual and z = (i.... )' the vector of covariates. Then the survival distribution of T is taken to be P(T > tjz) = exp{-H(t)exp(z'3)} (1) where H(t) = fo h(s)ds is the baseline cumulative hazard function (CHF), h(.) the corresponding hazard rate and,3 = (3?,..., 3p)' is the vector of regression coefficients. This 1 f'

model is often called the proportional hazards model since the ratio of hazard rates of two individuals with differing z's is constant in time. Its use and success are now so widespreadt [see. for example. Klein an,. M.\oeschberger (1996) and the references therein] as to make it. the primary tool in the analysis of lifetime data. In his original article Cox used a factor of the full likelihood. which he later justified and termed partial likelihood in Cox (1975). to estimate 3. IKalbfleisch (1978) considered a Bayesian analysis using a gamma process prior on the logarithm of the baseline survival function and Clayton (1991) discussed a Monte Carlo method for related frailty models. Designed for use in various survival analysis problems, the beta process was advanced by Hjort (1990) as a prior on the space of CHF's. He also briefly treated the Cox model and gave some descriptions of the resulting posterior distributions. In this paper. we follow Hjort (1990) and model the cumulative hazard using a beta process. We give a full Bayesian solution in that the posterior distributions of the baseline CHF H(t) and the regression coefficient vector 3 are obtained. This is accomplished by using and extending a combination of Monte Carlo methods described in Smith and Roberts (1993). Damien et.al (1995) and WTalker (1995). By straighitforward transformations, samples obtained from the full posterior distribution also allow inference for other relevant qucantities such as the baseline survival function, the relative risk and survival probabilities at given values of covariates and times. Section II beloxw describes the use of the beta process with the Cox model and the attendant posterior results available from Hjort (1990). In Section III, we develop the computational form of the model and give details for its implementation. Section IN provides an illustrative analysis of the leukemia data previously analysed, using Bayesian methods. by IKalbfleisch (1978). Some extensions are discussed inSection V. 2 Cox Model with Beta Process Prior Much of nonparametric Bayesian inference has proceeded by modelling the unknown cumulative distribution function (CDF) as a stochastic process. One of the exceptions appears in Hjort (1990) where an alternative is introduced by placing a (beta) stochastic process on the space of cumulative hazard functions. This has several attractive features: it offers a broader class of models in the context of life history data; there is an accessible interpretation of the prior process in terms of hazard functions; and it provides a natural generalisation of neutral to the right processes, such as the Dirichlet (Ferguson. 1973). In the discussion that follows, we will consider the general time-continuous version of the beta process. Let T be a random variable with CDF F(t) = Pr(T < t) on [0. x) I2

and F(0) = 0. The CHF for F or T is a nonnegative. nondecreasing. right continuous function.4 onl [0. Xc). As in Hjort (1990) w-e use the symbol A ill three different wa-s:.4i. 1).4A(t) and A.4{t} denote the increment of the function 4 in the interval [a. b). t lhe value of the-function.4 at a point t. and the increment of the function 4 at t, respectively. Now let.4 be such that it satisfies 1.4A(s) =. [s. s + cds) = Pr {T C [.s. s + d.s)T >. s} =-F(.)/F[s. x ) so that.4b) = dF(s) A[b') -= b F[s. x) J[, F[.S. Recovering F from.4 requires product integrals (Gill and Johansen. 1990) as F(t) = 1I- { - dA(.s)}. t > 0. [0.t] If F' is continuous, this can be shown to r,,' -ce to the usual formula A(t) = -log(1 - F(t)). In general. however. F(t) does not equal 1 - exp{-A(t)}. As explicated in Andersen et al.(1993), this definition of the cumulative hazard function unifies the treatment of discrete. continuous and mixed random variables while retaining the appropriate meaning of hazard as used in the actuarial and survival analysis literature. While we omit the mathematical details of its construction. the beta process has independent increments that, infinitesimally speaking, have beta distributions. More precisely. it is a Levy process with Levy measure concentrated on [0,1]. Using this process as a prior on the space of cumulative hazard functions, the posterior distribution is. in the absence of covariates, again a beta process. Relevant definitions and results are stated in the Appendix. Full Bayesian inference using the Gibbs sampler for this situation is given in Damien et al (1996). To include covariates as prescribed in the Cox model (1). let Ai and zi denote the cumulative hazard function and the covariate vector for individual i, i = 1, ~. n. Then. - dA,(s) {1 -dA(s)}e-P(z:2). Suppose Xi.*.. n, are the times to the event of interest for the 'n individuals and the data are in the form ti = min(Xi. ci), 5 = I(Xi < ci), where I denotes the indicator function. The cis here are censoring times assumed to be independent of the event times. 3

It Now the discussion inI Section 6 of Hjort (1990) leads to the following likelihood for.4 and 3: C.A;-.3: dIt},,) = -I {1 -(. 4(.)} m, K{1 -.4{t}}-e'/(z) - 1 ( 2) [.2c } where R(.s..3) = Ej=> exp(z'.3)I(tj > 4). Using a beta process prior.4 beta(c(.)..4o(.)) for A (see Appendix for notation) and an independent prior-for 3. the conditional posterior of A given 3 and the data is a beta process between points where actulal event times are observed. At such event times. there are positive jumps in A with distributions f.4{, (a) x-(l,- )c(s)R(d(s.3)- )-lnt =1 {1 - di(s)(1 - t)e'P(z)} (3) where cli(s) = I(ti = s. 6i = 1) and d(s. 3) = EZ=> di(s)exp(z'73). A prior specification for 3 and the likelihood in (2) above suffice to yield a conditional posterior for 3 given -A. In the next section we employ these two conditionals and several auxiliary variables to develop a Gibbs sampler. 3 The Computational Model Let A = (A1 A -) denote increments in the A(Q) process between the ti's; i.e.. the lgrid is taken to be all distinct censoring or event times in the data. Call these time points.si < s., < -.. < sx. among which are event times x1 < xr2 <... < xm occuring with multiplicities k1. k.. *.. km repectively. Let U = (L1.. Urm) stand for the jumps in A(*) at these x's. Finally, let i(j) be the index of the jt event observation at xi, j = 1,. k in some fixed order for each i = 1,., mi. This means that i(j) is.a unique number between 1 and n such that ti(j)- xi and i(j) = 1. With this notation, the Gibbs sampler proceeds by simulating from [A. U.13. data] and [.31 A. U. data]. As outlined in the previous section, [AJ[3, data] is the distribution of the increments of a beta process wvith parameters {c(-) + R(-,,3)} and f') C(s5)d(s) The components of A are independent. each with an infinitely divisible distribution. Simulation of these is detailed in Damien et al.(1996). Moreover, U and A are independent, U having independent components with distributions f(. ') x j[(1 - u )c(x,)+R(l,,)-{ZJ lexp(z%)3)}-ll, {1 - (1 - 1 )exP(z'(J).3)} Simulation from such a distribution can be effected using auxiliary variables. Transforming a to v = - log(1 - u). the task is to simulate from [t] x exp(-avc)(1 - exp(-t))-n1' (1 - exp(-bi )). 4

With y a geometric variate and tLi truncated exponeiltials given by [y.lt] = (1 - erxp(-c))exp(-'.). - 0. 1.. and [u'l i,] = b-i e.p(-bti L,)(1 - exp(-bxc))- 1(o.')(t). we get [,jt,'..] x Lt e. p{-t(a + + + btt,)} i=1 which is a gamma variate. It only remains to simulate 31JA. U. data. Adapting the likelihood (2) to the notation of the current section yields [31-. U. data] x [3] {Hn;= (1- Ax,)R )} rWm1 L1 )R(xi.3)-d(x.i)nk {l - ( - ti)xp(z )}.xhere [3] is the prior distribution of the regression coefficients. Using L = - log(l - ui) and incorporating the conditionals of the auxiliary variables yi, ij, i - 1.-.m.j = 1-.., ki. we arrive at [31 A. V. V }I.t' data] x [.3]exp R(si, 3) log(1 - Ai) Li=1 Mr ki m ki - i R(i,3)-) -(1 - ij)ep(zx() 3) +E E zi(j) i=31 j=1 z J i=1j=1 It is straightforward to show that this is logconcave in each component of 3 if the prior is so. leading to efficient simulation of 3 by the algorithm of Gilks and Wild (1992). To summarize. we use [Y13, A, Vl W, data] oc Independent Geometrics [WI3. A. V, Y, datal] x Independent Truncated Exponentials [V\13. A, TV, Y. data] x Independent Gammas [A|3. 1;, V1; i, data] oc Independent Beta Process Inc'rements [31 A.. W, 1W. data],x Comnponentwise Logconca ve as the full conditionals in a Gibbs sampler.,

'.4 Table 1: Leukemia Remission Time Data Group 0 (drtg) 6*. 6. 6. 6. 7. 9. 10*. 10. 11. 13.16.17'. 19*. 20*. 22, 23.! i~'* 32. 32*. 32. 34U* 3' *Censored i Group 1 (placebo) 1. 2. 2.3.4.4.55. S.8..8. 11.11.12.12.15.1 7. 22. 23 4 Illustrative Analysis To demonstrate the types of inference one can make using the techniques of this article. we consider the leukemia data analyzed by Cox (1972) and Kalbfleisch (1978). aniong others. These data. listed in Table 1 as reported by these authors, consist of remission times (in weeks) of leukemia patients assigned to treatment with a drug or a placebo cluring remission maintenance therapy. Although Andersen et al.(1993) point out that the data actually consisted of 21 pairs matched by status of initial remission attained, we have chosen to disregard this aspect here in view of the present purposes of illustration and comparison. In the analyses below, the prior distribution for,3 is taken to be normal with mean 1.5 and standard deviation.5.0. Runs with varying means in the range -5.0 to 10.0 gave similar results not reported here. The prior for the baseline cumulative hazard function is a beta process with 40(t) = 0.0.5t and c(t) = ke-~ ~0t. This simple choice corresponds to a prior mean survival function that is exponential with mean 20 while the prior variability can be controlled via the number k. Our illustrative calculations use a wide range for k from 0.01 to 100.0. The beta process, however, admits a much richer expression of prior opinion as designed in Hjort (1990) and demonstrated in Damien et al.(1996). One can specify different time-dependent degrees of prior variance in A(t) since c(t) can be interpreted as the number at risk at t in an imagined prior sample. Such flexibility is especially welcome in constructing prior specifications from past data sets and reported studies. With these priors and the data in Table 1 the Gibbs sampler of the previous section was implemented and run with each setting of k, saving 2000 iterations after discarding the first 3000. The results for,3 are summarized in Table 2. Values of k near zero represent little prior input whereas large values correspond to a strong prior belief that the baseline hazard is that of an exponential distribution with mean 20. These posterior summaries are remarkably stable over a large range of k, a phenomenon observed also by lKalbflieisch (1978) regarding the posterior mode and an approximation to the posterior standard deviation when using a gamma process prior. In Figure 1(a) we show the 6

Table 2: Posterior Summaries for 3 Summary 0.01 0.10 1.0 10.0 100.0 Mean 1.69 1.71 1.71 1.67 1.62 1 Standard deviation 0.438 0.421 0.422 0.419 0.381 Central 90V1 limits (0.99.2.43) (1.06.2.44) (1.02.2.42) (0.98.2.36) (0.97.2.25) posterior distributions of 3. For comparison with results based on maximizing the partial likelihood, calculations wxere carried out using four different methods including those advanced by Efron (1977) and IKalbfieisch and Prentice (1980) to accomodate tied observations. Details of these are given, for example. in Section 8.3 of Klein and Moeschberger (1996). Estimates of 3 varied from 1.51 with the method of Bresloxv (1974) to 1.63 by that of Cox (1972). The discrepancy is about 28% of the standard error. In contrast, the Bayesian analysis in this paper uses the full likelihood and calculates the appropriate posterior, with or without multiplicities. The MIonte Carlo error in the posterior mean of,3 is less than 0.01. Figure l(b) shows posterior distributions of the more easily interpreted hazard ratio of placebo to drug treatment. These are readily obtained by transformation of the posterior samples of,3. The posterior mean of the baseline survival function and pointwise 90% limits for it are plotted in Figure 2. As k increases, the shape approaches that of the exponential as chosen in the prior. Values of k upto 10 give remarkably stable posteriors except in the right tail wvhere a different c(.) function could be chosen to express higher prior uncertainty. In general, such stable ranges of k will depend on the size of the data. Figure 3 demonstrates the types of inference that can be easily obtained from the full posterior samples. In addition to the mean survival functions for the two groups, part (b) shows how all parameter uncertainty can be propagated to useful quantities such as the difference in the survival functions. A more detailed description at t = 10 weeks is shown in part (c). It clearly conveys the effectiveness of the drug in maintaining remission past 10 weeks. The calculations for this analysis were performed on a Hewlett-Packard 9000 Series 700 workstation using code written in SAS-IML. Generation of 5000 posterior samples was accomplished in 28 minutes. Code in FORTRAN or C could be expected to speed this up considerably. Convergence diagnostics not reported here showed a burn-in of 1000 would suffice. 7

5 Conclusion A semiparanietric model using Cox regression and the beta process was described and exemplifiecL. The complex form of the posterior dlistribution. following latent varial)le transformations. was shown to reduce to a computational model that could be sampled easily. The beta process is a rich family of prior distributions on the >-pace of cumulative hazards with easy to interpret parameters. However, in some contexts. it may 1)e necessary or easier to modlel the hazard rate nonparametrically. One class of prior distributions for hazard rates is called the extended gamma process (Dykstra and Laud. 1981). A semiparaimetric analysis similar to the one in this paper for increasing. decreasing and U-shaped hazard rates can be developed. In another direction. the methods of this paper can be adapted for use with various frailty models. These developments will be reported elsewhere. ACKNOWLEDGEMENTS WVe thank Nils Hjort for personal communication clarifying the distribution of jumps in the beta process and Craig Cumbus for SAS-IMIL code for the Gilks-Wild algorithm. Research reported here was partially funded by a grant from the National Science Fouindation. APPENDIX Beta Process Summary With Beta(ca, 3) denoting the beta distribution proportional to x'-'(1- x)'-' on [0, 1]. we have, Definition: Let Ao be a cumulative hazard function with a finite number of jumps taking place at t1, t2,.-, tm and let c(.) be a piecewise continuous, nonnegative function on [0. cc). A process A is called a beta process with parameters c(-) and.40(.), denoted A -beta(c(.), A0o()), if it has Levy representation E(e-0.(,))= t t E(e-Os)] exp {- (- (1e-)dLt(s)} j:tj<t I 0 8

with S =.4{tJ} ~ Beta(c(t)Ao{tj}. c(t)(1 -.4o{t })) alldl (dLt(s) = [c(_)s-'(1 - s)C (/Ao.()] ds for t > 0 and 0 < s < 1. where Ao.c(t) = A0(t) - tJ<t-4{t}-.40 is the prior guess at the cumulative hazard. and c(t) can be interpreted as the number at risk at t in an imagined prior sample. The above definition implies that the beta process has independent increments and at fixed points of discontinuity the increments have beta distributions. More general choices of dLt(s) are possible with attendant results given in Hjort (1990) and simulation techniques in Damien et al.(1996). Let XN1... X, be independent and identically distributed with cumulative hazard function.4, and.4 beta(c(.), A4o(.)). Suppose (Ti, 61),. (T2, 6n) is observed. where Ti = min2(X. ci), 6i I{Xi < ci} and c1,". cn are censoring times. Define the counting process N and the left-continuous a. risk process Y as 11 n -v(t) = {Ti < t & = 1}, Y(t) = E{Ti > t}, i=1 i=1 where I is the indicator function. Then p\( c(s)dAO(s) + dN(s).4J(Ti. 61). (Tn, n6) beta {c(.) + Y(.), c(s) + Ys) The posterior process contains fixed points of discontinuity even if the prior does not. These extra points occur at observations with 6i = 1. The distribution of any jump is given by A{t}ldata - Beta(c(t)Ao{t} + dN(t), c(t)(1 - Ao{t}) + Y(t) - dN(t)). REFERENCES Andersen, P.K., Borgan, O., Gill, R.D. and IKeiding, N. (1993). Statistical Models Based on Counting Processes. New York: Springer-Verlag. Besag. J. and Green. P.J. (1993). Spatial Statistics and Bayesian Computation. J. R. Statist. Soc. B 55, 25-37. Breslow, N.E. (1974). Covariate Analysis of Censored Survival Data. Biometrics 30. 89-99. 9

Cox. D.R. (1972). Regression Modlels and Life Tables (wxith Discussion). J. R. Statist. Soc. B 34. 187-220. Damien. P.. Laud. P.W.. and Smith. A.F.M., (1995). Approximate Random Variate Generation from Infinitelv Divisible Distributions with Applications to Bayesian Inference. J. R. Statist. Soc. B 57. 547-563. Damien. P.. Laud. P.W.. and Smith. A.F.M.. (1996) Implementation of Bayesiani Nonparametric Inference Based on Beta Processes. Scand. J. Statist. 23. 27-36. Dykstra. R. L. and Laud. P. W. (1981). Bayesian Nonparametric Approach Toward Reliability, Ann. Statist. 9, 356-367. Efron. B. (1977). The Efficiency of Cox's Likelihood Function for Censored Data../. Am. Statist. Assoc. 72,.557-565. Ferguson, T. S. (1973). A Bayesian Analysis of Some Nonparametric Problems. Ann. Statist. 1. 209-230. Gilks, W.R. and Wild. P. (1992). Adaptive Rejection Sampling for Gibbs Sampling. Appl. Statist. 41, 337-348. Gill, R.D. and Johansen. S. (1990). A Survey of Product-Integration with a View toward Application in Survival Analysis. Ann. Statist. 18, 1501-1555. Hjort, N.L. (1990). Nonparametric Bayes Estimators Based on Beta Processes in Models for Life History Data. Ann. Statist. 18, pp 1259-1294. Kalbfleisch, J.D. (1978). Non-parametric Bayesian Analysis of Survival Time Data. J. R. Statist. Soc. B 40, 2, 214-221. Ktalbfleisch, J.D. and Prentice, R.L. (1980). The Statistical Analysis of Failure Time Data. New York: John Wiley. Klein. J.P. and Mloeschberger. M.L. (1996). Survival Analysis: Methods for Censored and Truncated Data. New York: Springer-Verlag. Smith. A.F.M. and Roberts, G.O. (1993). Bayesian Computations via the Gibbs Sampler and Related MIarkov Chain Monte Carlo Methods. J. R. Statist. Soc. B 55. 3-23. Walker, S.G. (1996). Random Variate Generation from an Infinitely Divisible Distribution via Gibbs Sampling. Technical Report, Imperial College, London. 10

o91 (a) I I I 0.18 - 0.12 - (b) I i l I 0.6 - C '0 c L. 0 CL a) 0 n-, ) c o._ 0 0 0 - 0.3 - \1 0.06 - 0.0 0.00 - 0.I.......... 2 Beta k=1.0 3 4 I I 1 5 9 13 Hazard Ratio --- k=1.0 I I 17 21 ---- k=100 PRIOR- k=0.01 ------ k=100 PRIOR k=0.01 Figure 1. Posterior distributions of (a) the regression coefficient and (b) the hazard ratio.

1.0 -0.8 - (a) k=0.01 1.0 -0.8 - (b) k=1.0 I I.0 -0 L0~ LLI) 0.6 - 0.4 - a -L 0. -5 jU) 0.6 -0.4 - 0.2 - 0.0 - 0.2 - 0.0 - 0 10 I20 I30 40 10 20 30 40 Time 1.0 - 0.8 - (c) k=10.0 1.0 - 0 0 10 20 30 40 uTime (d) k= 100.0 C,, 0!.0 -0 0~ LU) 0.6 - 0.4 - >1 -4-J a Q. 0 0.8 0.6 -0.4 - 0.2 - 0.0 - 0.2 - 0.0 - 0 I I I I4 10 20 30 40 0 I I I 2 10 20 30 40 Time Time Figure 2. Posterior mean and central 90% limits for baseline survival function.

1.0 - 0.81 (b) 5 - (c) I I \ I I 0.8 0 -0 0 L. 'so:3.> 23 0'3 0.6 -0.4 - I I \ I i I I I \ I I I \ I I\ A \ \ \ N 0.6-.0.0 L. 5 U3 CD.) C: L4 4) 4 - 0.2 - 0.0 - 0.4 - 0.2 - 0.0 - 4-. 3 -a, 0 L.o 0 o 2 -1 - I I I I / I I I I I I I! I I!# I \ I \ I \ I \ I \ I I I I I \ I \ I \ ~~i 1 ~\ \ I \ \ _ _ t --. / 0 - 0 10 20 i. Time I i 30 40 I I I 0 10 20 Time I r 30 40 0.0 0.2 Survival I I I 0.4 0.6 0.8 Probability at t- 10 1.0 Figure 3. Posterior comparison of groups, drug, placebo, k=0.01 (a) mean survival functions (b) mean difference in survival functions with 90% limits (c) distributions of P(X>10) _ _ I~~ 'T 5 ' ' ~ ' } 9_ rU