I RESEARCH SUPPORT UNIVERSITY OF MICHIGAN BUSINESS SCHOOL APRIL 1997 A BAYESIAN NONPARAMETRIC COMPARISON OF TWO TREATMENTS WORKING PAPER #9705-12 BY STEPHEN WALKER IMPERIAL COLLEGE, LONDON, UK AND PAUL DAMIEN UNIVERSITY OF MICHIGAN BUSINESS SCHOOL

I A Bayesian Nonparametric Comparison of Two Treatments Stephen Walker1 and Paul Damien2 1 Department of Mathematics, Imperial College, 180 Queen's Gate, London SW7 2BZ. 2 Business School, University of Michigan, Ann Arbor, 48109-1234, USA. Summary In this paper we present a full Bayesian nonparametric analysis of survival time data, involving information from two types of treatment. The goal of the analysis is to determine whether there is a difference between the two treatments, according to some well defined criteria, which would justify the use of one in preference to the other. In this paper we present an easy to compute posterior distribution which provides direct insight into the difference of interest. Keywords: Censored data Bayesian bootstrap, exchangeability, partial exchangeability, beta process, survival data. 1

1 Introduction A common dataset which arises in a medical context consists of information gathered on individuals, from one population, exposed to one of two possible types of treatment, usually allocated at random: the goal - to determine which is the 'better' treatment. The specific observables we will be concentrating on is when the responses are survival times; that is, X1,.., Xn are iid from F1 and Y1,, Ym are iid from F2, with both the Fs having support on (0, oo), and the observations are subject to random right censoring. Modelling differences The key issue is how to model 'differences' between the two treatments; equivalently, the difference between F1 and F2. A recent approach, Hsieh (1996a), is to take F1 as a 'baseline' distribution and assume that F2 is a combination location-scale shift of F1; that is, F2(x) = F1 (- ) for some (y, ar) which are to be estimated, and F1 may then be regarded as a nuisance parameter. Here x denotes the log-survival time. A popular approach is to assume a proportional hazards model (Cox, 1972) in which differences are modelled through the hazard functions: A2(x) = A1 (x) exp p. Here A1 is often regarded as a nuisance parameter and interest focuses on estimating p. Recently, Hsieh (1996b) generalised the proportional hazards model to A2(x) = (A1(x)) exp p. In this paper we eschew these scale and location shift models as explanations of the difference between treatments. We look for a more robust procedure in which the data itself determines the form of the difference. This is done in a novel way in which we rely on the notions of exchangeability and partial exchangeability. Exchangeability/partial exchangeability 2

Let us consider two extreme cases: firstly, under no treatment difference we would regard the entire set of survival times as arising from a single. but unknown, distribution function. In a Bayesian context this would be equivalent to regarding all the individuals as being exchangeable. Secondly, at the other extreme, we would not expect information from one group to help in the understanding of the other group, and vice versa. In a Bayesian context, this would be equivalent to the assumption of partial exchangeability; that is, individuals are regarded as being exchangeable within groups, and independent between groups. We refer to the exchangeable structure as Se and the partial exchangeable structure as Sp. We assume that the truth lies somewhere in between these two extremes. That is, the correct structure Sc is given by Sc = 7rSe + (1 - 7)Spi for some r E [0, 1]. Here r will be taken to be the posterior probability that Se is the 'true' structure, given by -_ = _ 7 rop(datalSc) 7rop(datalSe) + (1 - 7o)p(datajSp)' where tro is the prior choice for the probability that Se is the 'true' structure. The above equations mean that to work with structure SC we would carry out analyses Se and Sp and then combine these according to the weight 7r. For example, suppose that interest is in estimating the parameter 0 and let Qe be the estimate obtained under the exchangeable model, 0p the corresponding estimate under the partial exchangeable model; so 0 = rOe + (1-7r)Op is the required estimate. For our proposed method of analysis we are required to assign a (nonparametric) prior to the appropriate Fs, leading to simple calculations of p(datalS), and hence wr. Throughout the rest of the paper we assume that F1 is the probability distribution of survival times for treatment A and F2 the corresponding distribution for treatment B. If interest is in obtaining a posterior distribution for a parameter of interest, OA, associated with treatment A, then p(OAjdata) = 7rp(OA all data) + (1 - 7r)p(Aldata from A), 3

where p(0A data from A) is the posterior based solely on the information provided by those observations from treatment A, and p(OAlall data) is the posterior under the exchangeable assumption which assumes treatment A and B are identical and hence is based on all observations. In Section 2 we discuss the nonparametric prior for the Fs and in Section 3 we calculate 7r. Section 4 discusses a selection criterion for the 'better' treatment, based on the ideas appearing in Spiegelhalter et al. (1994). Section 5 contains two illustrative examples. 2 Beta process: prior and posterior In a Bayesian nonparametric framework we will be assigning a prior distribution on the space of probability distributions on (0, oo). We will be working with a generalisation of the Dirichlet process (Ferguson, 1973), which we need in order to adequately express prior opinion about the unknown distributions. We consider a discretisation of the continuous space (0, oo), say {ds, 2ds, * * } for some appropriate ds, and will therefore be assigning a nonparametric prior on {1, 2, -- }: this is required to calculate p(data). Even if we used a continuous prior (process) an approximate discrete version of the process would be required to calculate p(data). Typically, data arising from survival analyses, do so in a discrete form - as information obtained each day, week, or during some other unit of time - and so no generality is lost in developing the prior to posterior calculations in a discretized framework, at the outset. Let At, for t = 1,2,.., be independent and distributed according to beta distributions with parameters (cat, t). Then a random survival distribution S(t) can be obtained from the discrete version of the beta process (Hjort, 1990; Walker and Muliere, 1997) via S(t)= H (1 - As) s<t If X1, * *, Xn are iid from S(t), possibly with arbitrary right censoring, then the posterior parameters are given by at + nt and flt + mt, where nt is the number of deaths at time t and rnt is the number of survivors just before time t. A Bayesian bootstrap procedure (Lo, 1993) would be to obtain the posterior parameters and then set the prior parameters to zero; that is, the posterior parameters are (nt, mr). 4

Within this discrete framework we can calculate p(data) straightforwardly, p(X, X'7) n { (oat + ft)[nt+mt] ' t where a[ = a(a + 1) * (a + n - 1) and a[] = 1. Whereas we could consider the zero prior parameters to obtain the posterior distribution, we can not do this to calculate ir, which is clear from the above expression for p(data). In the next section we suggest a way to obtain the prior parameters in a meaningful way. 3 Calculating r We have seen that in order to calculate 7r we have to evaluate p(datalSe) and p(datalSp). Let n4 and mrA be the death process and at-risk process, respectively, for group A and nB and mB the corresponding processes for group B. Also let nt = nA + n, and mt = m + mB. Straightforwardly, using the result for p(Xl,,X), we obtain p(datalSe) = i (cf + a~ t)[nt+mt] } and [n] AmA] [n] [m' l p(datalSp)- -= Jff a' 1.l) _t]} 1 (at + n+m (at + lj Hence an expression for 7r can now be constructed. We could be noninformative in our approach, by taking at = t 0 for all t, in order to estimate p(X > t), for example, but this is not possible when calculating p(datalS,) and p(dataiSp). Assigning non zero values to at and pt, would, of course, imply an informative prior. Here then we discuss a way to select these prior parameters. If we let S(t) denote the random probability p(X > t), we obtain E[S(t)] = +' and E[S2(t)] = E[S(t)] a + 1 a<t < + f Ps st <+as + p8 1 5

I Our aim now is to match these first two moments from the nonparameteric model with those from a parametric model. We choose the geometric distribution for its simplicity and ease of interpretation. Our (Bayes) parametric model is given by p(X = tA) = A(1-)t-, t = 1, 2,., and A ~ beta(a, b). Therefore, Ex[S(t)] = (a 11 -and E[S2(t) = (2t21 (a + b)(t-l1 E (= + b)[2t-2] leading, after some algebra, to Pt = - ) and at f t(l/ - 1), &t - 6t where 4 = Ex[S(t)]/Ex[S(t - 1)] and 6, = C Ex[S2(t)]/EX[S2(t- 1)]. A noninformative approach, which would presumably take a = b = 1, simplifies to it = tcr = t/(2t2 - 1). These are the values for at and /ft which will be used to calculate 7r. Note, then, that it is not possible for us to consider the Dirichlet process, since this not only requires E, cat < oo (which follows from above), but also ft = x,>t as, which does not follow from the development above. 4 Selecting treatments Our strategy for comparing and selecting the 'best' treatment is based on the work by Spiegelhalter et al. (1994, Section 2) and the reader is referred to that paper for further insights. Here we briefly discuss the main aspects of the decision making process. We assume that the selection of the preferred treatment will be based on the parameter 6 = A - OB, where each 0 corresponds to a parameter of interest for each of the treatments; for example, 0 = p(X > to) for some time point to. We think of treatment B as being the current choice of treatment and A a new treatment under investigation. Then according to the arguments 6

put forward by Spiegelhalter et al. (1994) we would select treatment A in preference to treatment B if 6 > 3s, indicating clinical superiority of the new treatment. The full significance of 3s and its selection criteria is discussed by Spiegelhalter et al. (1994), and references cited therein. Of course the Os, and hence 5, are unknown. Spiegelhalter et al. (1994, Sections 3 and 4) develop parametric posterior distributions for S. In our Bayesian nonparametric framework we will use the posterior beta processes to construct the posterior distribution for OA and OB, and hence for 6. The use of the beta process is equivalent to using the censored data Bayesian bootstrap (Lo, 1993); in the limiting case at, t -- 0. We will use this bootstrap method to generate samples from the posterior distribution of 6. Here we briefly detail the method. For s = 1,..,to, we generate /A from i(nA,mA) (independently) and set AA = ls<o (1 - ~A), with probability 1- ir, or generate As from,(n,, mns) (independently) and set 0A = YIsto(l - A,), with probability 7r. We do a. similar procedure for OB, based on (n,mBf), and define OB accordingly, and set 6 = OA - OB If this is 61, we repeat this simulation N times, and use (61,, SN) to construct the posterior for 6. Although there are a number of choices for 0 we will consider the case when 0 = p(X > to) for some to. A more general case is to define 0 = to utp(X > t), where Ei 1 U = 1, and the, ut are chosen to reflect the relative importance of surviving beyond time t; i.e., we would take ut < Ut+l for all t. 5 Numerical examples Example 1. Our first example involves leukemia remission times (in weeks) and the two treatments are an active drug and placebo. The data was discussed and analysed by Cox (1972), Kalbfleisch (1978) and more recently by Laud et al. (1996). The times for the active drug are 6,6,6,6* 7,9*, 1010, 11,13,16,17*, 19, 20* 22,23,25* 32,32*, 34* 35, where a * denotes a censored observation; the times for the placebo are 1,1,2,2,3,4,4,5,5,5,8,8,11,11,12,12,15,17,22,23. 7

Taking 7r0 = 1/2 we compute wr to be 2 x 10-4. We take this to mean that the data from the two groups should be analysed separately. For illustration purposes, we construct the posterior distribution for 6 based on 0 = p(X > 10), using the censored data bootstrap method. The posterior is presented in Figure 1, and clearly demonstrates the superiority, in survival beyond ten weeks, for treatment A; this is quite similar to the result obtained, using a Cox regression, by Laud et al. (1996). Example 2. Our second example is taken from Lawless (1982, Example 7.2.1) and also involves remission times, in weeks. Each treatment, A and B, has 20 patients: the observations from treatment A are 1,3,3,6,7,7,10,12,14,15,18,19,22,26,28, 29,34 40,48, 49* and from treatment B are 1,1,2,2,3,4,5,8,8,9, 11, 12,14,16, 18,21,27*,31,38*,44. Again, taking 7ro to be 1/2, we compute the posterior probability ir to be 0.77, providing support for the exchangeable structure and no difference in the treatments. For these data, based on a number of (frequentist) tests, Lawless also concludes that there is "no evidence of a difference in distribution." As in Example 1, we construct the posterior distribution for 6 based on 0 = p(X > 10), using the censored data bootstrap method. The posterior is presented in Figure 2, and this time clearly demonstrates equality of treatments, in terms of survival beyond ten weeks. Discussion In this paper we develop a simple method, via a nonparametric prior distribution, to selecting the "better" treatment. The novelty of our approach is four-fold: firstly, it offers the practitioner to use contextual prior information in a natural way to model the uncertainty in the underlying survival functions corresponding to the two treatments; secondly, treatment differences are identified based on whether or not the data is exchangeable or partially exchangeable; thirdly, the calculation of a formal decision rule enables the selection of the preferred treatment; and fourthly, our solution does not require Markov chain Monte Carlo simulations. A very easy-toimplement bootstrap is the only simulation aspect to the solution, the rest being closed form analytical expressions. 8

References Cox, D.R. (1972). Regression models and life tables (with discussion). Journal of the Royal Statistical Society, Series B 34, 187-202. Ferguson, T.S. (1973). A Bayesian analysis of some nonparametric problems. The Annals of Statistics 1, 615-629. Hjort, N.L. (1990). Nonparametric Bayes estimators based on beta processes in models for life history data. The Annals of Statistics 18, 1259-1294. Hsieh, F. (1996a). Empirical process approach in a two-sample locationscale model with censored data. The Annals of Statistics 24, 2705-2719. Hsieh, F. (1996b). A transformation model for two survival curves: An empirical process approach. Biometrika 83, 519-528. Kalbfleisch, J.B. (1978). Nonparametric Bayesian analysis of survival time data. Journal of the Royal Statistical Society, Series B 40, 214-221. Laud, P.W., Damien, P. and Smith, A.F.M. (1996). Bayesian nonparametric and covariate,analysis of failure time data. Lawless, J.F. (1982). Statistical Models and Methods for Lifetime Data. John Wiley & Sons. Lo, A.Y. (1993). A Bayesian bootstrap for censored data. The Annals of Statistics 21, 100-123. Spiegelhalter, D.J.,Freedman, L.S. and Parmar, M.K.B. (1994). Bayesian approaches to randomised trials (with discussion). Journal of the Royal Statistical Society, Series A 157, 357-416. Walker, S.G. and Muliere, P. (1997). Beta-Stacy processes and a generalisation of the Polya-urn scheme. To appear in The Annals of Statistics. 9

I -0.5 0.0 0.5 1.0 delta 0 Figure 1: Posterior distribution for 5 based on 6 = p(X > 10), Example 1 toUI)~10 10

0 O1 qI m.. 0 Io -0.4 -0.2 0.0 0.2 0.4 0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 delta Figure 2: Posterior distribution for 6 based on 0 = p(X > 10), Example 2 11