RESEARCH SUPPORT UNIVERSITY OF MICHIGAN BUSINESS SCHOOL APRIL 1997 GIBBs SAMPLING FOR BAYKSFIAN NONCONJUGATE MODELS USING AUXILLIARY VARIAIBLET WORKING PAPCR #9705-13 BY PAUL DAMIEN UNIVERSITY OP MICHIGAN BUSKSBIC8B SC2HOOL, STEPHEN WALKER IMPERIAL COLLEGE, LONDON, UK AND JON WAKE'FIELD IMPERIAL COLLCGE, LONDON, UK

Gibbs Sampling For Bayesian Nonconjugate Models Using Auxilliary Variables Paul Damienl Stephen Walker2 Jon Wakefield3 1 Business School, University of Michigan, Ann Arbor, 48109-1234, USA. 2 Department of Mathematics, Imperial College, 180 Queen's Gate, London SW7 2BZ 3 Department of Epidemiology and Public Health, Imperial College School of Medicine at St Mary's, Norfolk Place, London, W2 1PG. SUMMARY We demonstrate the use of auxilliary (or latent) variables for sampling nonstandard densities which arise in the context of a Gibbs sampler, involving Bayesian nonconjugate models. Their strategic use can result in a Gibbs sampler having standard full conditionals. We propose such a procedure to simplify or speed up an existing Markov chain Monte Carlo algorithm. The strength of this approach lies in its generality and its ease of implementation. The method is illustrated on posterior densities arising from Bayesian nonconjugate and hierarchical models. A feature of the paper, therefore, is to provide an alternative sampling algorithm to rejection based methods and other sampling approaches such as the Metropolis-Hastings algorithm. Keywords: Gibbs sampler, Hierarchical model, Latent variable, Nonconjugate model. 1

1 Introduction Markov chain Monte Carlo (MCMC) methods (Smith and Roberts, 1993; Tierney, 1994) allow Bayesian inference for highly complex models in which realistic distributional assumptions can be made. The Gibbs sampler, the most common of the MCMC algorithms, can often be difficult to implement due to the required conditional distributions assuming awkward forms. In this case the practitioner may turn to the Hastings-Metropolis algorithm and /or a rejection algorithm; see, for example, Metropolis et al. (1953); Hastings (1970); Tierney (1994); Devroye (1986). However, this may be difficult to set up and may require 'tuning' to achieve satisfactory performance (Bennett, Racine-Poon and Wakefield, 1995). In this paper we discuss a novel approach which, after the introduction of strategic auxilliary (or latent) variables, results in a Gibbs sampler having a set of standard full conditionals. Suppose the required conditional distribution for a random variable X is denoted f. The basic idea is to introduce a latent variable U. construct the joint density of U and X, with marginal density for X given by f, and then extend the Gibbs sampler to include the extra full conditional for U. We demonstrate that in many cases it is possible to introduce a latent variable so that all full conditionals are standard and can be sampled directly. This is obviously appealing and, provided there is no dramatic loss in efficiency compared to the original chain(s), will be of particular interest to the MCMC practitioner. In many cases the introduction of a single latent variable will 'split' a nonstandard full conditional disribution appearing in a Gibbs sampler into two standard full conditional distributions, effectively increasing the Gibbs sampler by one more full conditional. This then may lead to a more efficient method than the Metropolis-Hastings algorithm, the adaptive rejection method for log-concave densities (Gilks and Wild, 1992) and other rejection algorithms, in many contexts. For a historical overview of Markov chain methods and the use of latent (auxilliary) variables the reader is referred to Besag and Green (1993). In particular our approach develops the original idea introduced by Edwards and Sokal (1988) and highlighted by Besag and Green in Section 5 of their paper. Recent progress with auxilliary variables is reported in Higdon (1996), and references therein. The paper is organised as follows. In the next section, we develop the 2

theory underlying the new algorithm. We show that our method improves on a Metropolis independence chain as well as a rejection algorithm. In Section 3 we implement the approach for Bayesian hierarchical models; Section 4.1 for generalised linear mixed models (GLMMs) and Section 4.2 for nonlinear mixed models (NMMs). Section 5 contains numerical examples, followed by a concluding discussion in Section 6. 2 Preliminaries The main result on which the algorithm developed in this paper is given in the following theorem; see, Damien and Walker (1996). THEOREM 1. Suppose we wish to generate random variates from a density f given by L f(x) c 7r(x)fgi (x), where ir is a density of known form and the gt are nonnegative invertible functions (not necessarily densities), that is, if gt(x) > u then it is possible to obtain the set Az(u) = {x; gi(x) > u}. Then a Gibbs sampler for generating random variates from f exists in which all but one of the full conditionals are uniform densities and the remaining full conditional is a truncated version of 7r. Proof. We introduce the latent variables u = (1,...,uL) with each ut defined on (0, oo) such that the joint density with x is given by f(x, tl,..., uL) oc 7r() I{u< g(Ix)}. Clearly the marginal density for x is f(x). A Gibbs sampler can now be implemented where obviously the full conditionals for each ui is the uniform density on (0, gr()). The full conditional for x is given by ir restricted to the set A(u) =.{x: gi(x) > ut, = 1,..., L}. The decomposition appearing in Theorem 1 is very similar to an expression appearing in Besag and Green (1993, Section 5). However, they do not mention the significant advantages that having invertible gg lead to. They say, 3

"When dealing with more complicated models, direct simulation from f(xju) is unlikely to be available." [Italics ours.] As a consequence, they propose that sampling from Xr restricted to the set A(u) be achieved using rejection algorithms; that is, sampling from t until the sample falls in A(u). While this method would work in principle, our aim in this paper is to demonstrate that we can introduce latent variables in complex models which still permits direct simulation from f(xlu); i.e., the use of rejection sampling can be obviated. If the decomposition indicated in Theorem 1 is possible then we will achieve substantial efficiency over the rejection sampling of Besag and Green. The class of densities having the appropriate decomposition seems to be large, and specifically, in the context of Bayesian models, the decomposition stated in Theorem 1 can be readily achieved. Consider the posterior density given by f(x) cc l(x)7r(x) and suppose it is not possible to sample directly from f. The general idea is to introduce latent variable Y, defined on the interval (0, co) or more strictly the interval (0, l(()), where x maximises 1(.), arid define the joint density with X by f(x,y) cc l(y < l(x))r(x). The full conditional for Y is /(0, I(x)) and the full conditional for X is T, restricted to the set Ay = {x: 1(x) > y}. We can show how this approach "improves" on a particular independent Hastings -Metropolis chain. Tlle Metropolis algorithm is a Markovian scheme which may be used for obtaining samples from the posterior f(x) cx l(x)7r(x). Given x(t), a proposal for x(t+1), x, is taken from, for example. 7(.), and a uniform random variable, u. is taken from the interval (0, 1). Essentially, if l(z)/1({x()) > u then x(t+) = a5 else (+1) = x(t). The chain either 'moves on' or 'stays where it is'. The convention is that x is sampled first followed by u. Suppose we reverse this and sample u first. To 'move on' we need to sample x from 7r(.) such that I(i)/l(x(Q)) > u. Suppose, therefore, we sample x from 7r(.) restricted to the set AU(t) = {x: l(x) > ul(xi))). The chain will always 'move on'. In fact we have just described a Gibbs sampler with standard full conditionals, detailed in Theorem 2: THEOREM 2. The Markovian scheme for generating {x(0} given by x(t+) 7r(.) restricted to the set A(t) = {x::(x) > ul(x(t)}, where u is a uniform random variable from the interval (0, 1), satisfies x -t)+ x r f(x) CX 4

1(I)w(x). Proof. Define the joint density function of x and y by f(x,y) x I(y < l(x))tr(x). Clearly the marginal density for x is f. A Gibbs sampler can now be used to generate {z(t)} which satisfies the conclusion of the theorem. To implement the Gibbs sampler the full conditional densities need to be sampled in turn, updating the parameters as they are sampled. These full conditional densities are given by f(yljx(), which is the uniform density on the interval (0, 1(x(t))) and f(zl('+')y), which is r(.) restricted to the set A = {xz: (x) > y}. Clearly such a scheme is the one described in Theorem 2, completing the proof. Theorem 2 says that it is possible to remove the 'ties' from an independent Metropolis chain and remain with a simulated Markov chain from a pure Gibbs sampler. So why not just simulate from this Gibbs sampler in the first place? This is exactly what we are proposing, and intuitively we must improve the efficiency in exploration of the sample space since we have removed the 'stops and starts' of the Metropolis chain. An additional burden with the Metropolis algorithm is that it may be difficult, in some instances, to obtain a good candidate distribution; see, for example, Chib and Greenberg (1995), who discuss the difficulties in this selection process. The general case The above algorithm works when 1(.) itself is invertible. It is more usual for it to decompose into a product of invertible functions. Therefore, we consider the case when f (x) of l,(x) 7r x). Here we introduce the latent variable y = (y,..., yn), where the yi are mutually independent given x, and define the joint density of x and y by f(x, ) c I (yi < I,(x))} w(x). i~"=i } 5

The full conditional densities are given by f(y~ix), independent uniform densities on the intervals (0, l(x)), and f(xly), 7r(.) restricted to the interval Ay = {x: () > yi,i - l,... A simplification in the multivariate case If x is multidimensional and it is not possible to obtain the multivariate set A, then a simplification is to sample from f(xjy) by sampling from f(xklxk, y), for k = 1,...,p, where p is the dimension of x. This would involve sampling from 7r(xkIxk) restricted to the set A}, = {xk: (x*,x-k) > y)}) In the discussion above it is not required that I and ir be a likelihood function and a prior dsitribution, respectively. More generally, suppose f = h x g, and we wish to sample from f using g as a proposal distribution for a rejection algorithm within a Gibbs sampler. A standard rejection algorithm would proceed by first calculating the supremum of h. Using an argument similar to the one appearing in Theorem 2, it is easy to prove that the method proposed in Theorem 1 will be at least as efficient as a standard rejection algorithm; this, of course, means that the method developed in this section will be at least as efficient as any variant of the standard rejection algorithm, when used within a Gibbs sampler. 3 Bayesian nonconjugate models Example 3.1 Poisson/log-normal model Suppose we observe a random nonnegative integer, n, from a Poisson distribution with parameter exp(x). Without loss of generality we assume a N(O, 1) prior for x. Then we obtain the posterior given by f(x) cx exp (nx - exp(x))exp(-O.5x2), where we assume without loss of generality that the prior is normal(O, 1) and n is a nonnegative integer. We introduce the latent variable Y, defined on the interval (0, oo), such that the joint density with X is given by f(x, y) cx exp(-y)I (y > exp(r))exp (-0.5(x2 - 2nx)), 6

which leads to the conditional densities given by f(ylx) oc exp(-y)I(y > exp(x)) and f(xly) cc exp (-0.5(x - n)2) I (x < log(y)), a truncated N(n, 1) density; see Devroye (1986); Robert (1995); Cumbus et al. (1997) for information concerning the sampling of truncated normal densities. Example 3.2 Bernoulli/logistic regression model Consider the following Bernoulli regression model for which w-l[. - x], zi Bernoulli({1 + exp(-p- xzi)}-1), i.. n with X - N(O, 1) as the prior (we assume p is known). The posterior density for X is given, up to a constant. of proportionality, by f(x) cx exp(-O.5x2) I ({1 + exp(- - xzi)}-w {1 + exp(,u + xz)}w-l). i=1 We introduce the latent variables u = ( l,..., U) and v = (v,,..., v,) such that their joint density with X is given, up to a constant of proportionality, by f(x, u, v) cc exp(-0.5x2) x I I ( < {1 + exp(-p - xz)} ', i < {1 + exp(# + xz)"-) ) i=1 The full conditional densities f(u|lu-i, v, x) and f(vUlv-i, u, x) are all uniform: f(ilu-i, v, x) = U (0& {1 + exp( — - xzi)}-'i) and f(viv_,Uxt. ) = U(0, {1 ~+exp( + xzi)}~w-1), where U(-~, L) is the uniform distribution on (a, b). Let S = {i: wi - }n {i: zi O} and 7Z = {: wi = 0} n {i: z $ O}. Then f(~xju, v) c< exp(-O.5X2)I(x 6 A,), 7

where A., = (maxis{fai},min-E{bi;}), ai = {log(l/ui - 1)- p,}/z and b = {log(l/v;i - 1) - p}/zj. Note that if S = 0 then we replace max;ss{ai} by -oo and if R7 = 0 then we replace minie7z{bi} by +oo. Example 3.3 Probit model Here we consider the posterior density given, up to'a constant of proportionality, by f(J) C II {( 0o + f3Z?)} i {l - (Go + pihz,)}Ini- 7 ({) '=1 i=1 where we assume a multivariate normal(/u, Y) prior for /3, and 4' is the standard normal distribution function. We introduce the latent variables U = (U,..., U) and V = (V,..., V) such that their joint density with /? is given, up to a constant of proportionality, by n f(u, v, v) cc Iff (u < {(o 0 + z,))}}',w < {1 - (0o +A,)}"'i ) 7(0). t=1 i The full conditional densities f(uiu-, v, ) and f(v|ivi, u,,) are uniform and so we focus attention on f(,8k 1,_k,u, v). Let a- = -'(r) - /3ziz, bi = (~-~(ri)- 3o)/Z, - = -1(Ai) - /.z and di = (=-t(A) - /o)/zi, where r = "w1/' and AX = 1 -v~/'-wi). Note that bi and di are only defined for those zi; 0, ai and bi are only defined when wi > 0 and ci and di are only defined when ni > wi. Then f (o.li ^, u v) c<x r(/.fl [i) (maxi{ai} <.?,o < mini{c} ) and f(&3il/3o,u,v) oc 7r(31l0o)l(maxi,{b} <,1 < min{ad}4). Example 3.4 Weibull proportional hazards model The Weibull proportional hazards model is popular for modelling censored survival time data. The hazard function for the ith individual is given by Ai(t) = Ao(t)exp(X;/), where / = (/3,...,/%p) is a vector of unknown parameters and A0(t) is the baseline hazard. The Weibull model arises when Ao(t) = at'-l for some 8

unknown a > 0. The conditional posterior distribution for /1, given a and taking a normal multivariate normal prior for /3, is given, up to a constant of proportionality, by f(/) o fI exp (XiI(6i = 0) - gtexp(Xifi)) exp (-0.5(/- - )'( )), /=1 where - = 0 indicates that ti is an uncensored observation. Here we introduce the latent variable U = (U,,..., UTn) such that the joint density with /3 is given, up to a constant of proportionality, by f(l, u) c f-" I (u, > eXi) exp (-O5( - p)'E-'( - ) + P) i=1 wherev = 2nl Xii(8i = 0). The full conditional distributions for each of the ui are independent exponential(l) distributions restricted to the sets (ttexp(Xi/3), cx). Sampling from f(/*k:I_-k, a) requires Ak. = {. k: k < mini{log(ui/t')/Xki - i XziJ/Xi } The full conditional for a with prior 7r(a) =constant (Dellaportas and Smith. 1993) is given by cr ( ti I (maxt,<l { log(Uj) - X < Ccmint,>l { log/ (uj) - where Fn is the number of uncensored observations. We can sample this density via the introduction of a latent variable V and define the joint density with a by f(v,ia) c< acl Zv < I( - < a < A+), where A- and A+ are the bounds appearing in the full conditional for a. It is now seen that both f(valc) and f(cxlv) are of standard type and can be. sampled using uniform random variables. 9

4 Bayesian hierarchical models Hierarchical models are relevant when the observed variability in the data on a number of units can be conveniently partitioned into within- and betweenunit components. At the first stage of the hierarchy observations from a particular unit are modelled, whilst at the second stage of the-hierarchy between unit differences are modelled. In this paper we consider both (i) generalised linear mixed models and (ii) nonlinear mixed models. We concentrate on that situation in which the second stage distribution is specified parametrically, typically using normal or Student's t distributions. The use of such models is becoming increasingly common as both computational power increases and new computational techniques are developed. For example GLMMs are used in health services research (Gatsonis et al., 1995; Kahn and Raftery, 1996), disease mapping (Clayton and Kaldor, 1987), multicentre clinical trials (Skene and Wakefield, 1990), educational testing (Goldstein, 1995) and small area estimation (Ghosh and Rao, 1994). NMMs are used, for example, in growth curve analysis (Berkey, 1982), and population pharmacokinetic/pharmacodynamic studies (Beal and Sheiner, 1982; Wakefield, 1996). Computationally NMMs and GLMMs pose two problems; first, interest generally focuses on the second stage parameters and to obtain the likelihood for these parameters the unit-specific first-stage parameters (the random effects) need to be integrated out. For the models considered here these integrals are analytically intractable. Secondly problems arise when summarising the resultant marginal likelihood function or the posterior distribution if a Bayesian approach is taken. From a non-Bayesian perspective various approaches have been suggested including Maximum Likelihood Estimation, Generalized Estimating Equations and Quasi-Likelihood. There is now a large amount of literature on estimation in GLMMs; see, for example, Williams (1982), Breslow (1984), Stiratelli, Laird and Ware (1984), Gilmour, Anderson and Rae (1985), Liang and Zeger (1986), Goldstein (1987), Zeger, Liang and Albert (1988), Schall (1991) and Breslow and Clayton (1993). Bayesian solutions for GLMMs have been suggested using the Laplace method (Kass and Steffey, 1989) and numerical integration (Skene and Wakefield, 1990). Neither of these techniques is completely satisfactory, however. Laplacian methods and numerical integration become infeasible as the dimensionality of the parameter space increases and for a given model and 10

dataset it is difficult to assess whether the posterior distribution is sufficiently well-behaved for the analytical and numerical approximations to be appropriate. MCMC techniques are far more appealling since they allow the complete posterior surface to be examined. Unfortunately for GLMMs the required conditional distributions do not assume standard forms and so specialist code is required. Zeger and Karim (1991) proposed the Gibbs sampler as a method for GLMMs but their rejection algorithm was not guaranteed to provide a bounding envelope and so strictly the Markov chain did not have the correct limiting distribution, though Tierney (1994) gave a Metropolis algorithm to correct for this. For a GLM with appropriate priors Dellaportas and Smith (1993) showed that the required conditional distributions are logconcave so the adaptive rejection algorithm (ARS) (Gilks and Wild, 1992) can be used. This approach could be adopted within a hierarchical framework and examples have been presented using the BUGS software (Spiegelhalter et al., 1995). We turn now to NMMs. A good review of this area is provided by Davidian and Giltinan (1995). Non-Bayesian computational techniques have been suggested by, amongst others, Lindstrom and Bates (1990), Vonesh and Carter (1992) and Walker (1996). Fearn (1975) provided an early Bayesian solution to growth curve analysis and Racine-Poon (1985) provided an EM-type algorithm. However it was not until MCMC techniques became available that the Bayesian approach was feasible generally. As with GLMMs the conditional distributions for the random effects do not assume standard forms. Wakefield et al. (1994) used the ratio-of-uniforms black-box random number generation method (Wakefield, Gelfand and Smith, 1991) but this technique is computationally expensive since numerical maximizations are required at each iteration and also requires specialist software. The Metropolis algorithm is an obvious candidate but this requires 'tuning' by the user. In this paper we provide an algorithm which, via the introduction of latent variables, provides an MCMC solution with all sampling being from standard forms. 11

4.1 Generalised linear mixed models 4.1.1 The model Given {bi}, a set of q-vector random effects, the observations yf i = 1,.., n are conditionally independent from the exponential family of distributions with mean h(Xif, + Zibi), where h(.) is a non-negative invertible function, that is, g1 _ h exists, Xi is a. p-vector of explanatory variables, P a p-vector of unknown parameters, and Zi a q-vector of explanatory variables, for the ith observation- The conditional variances are given by var(yt&bi) = qv{E(yi b)}) where v is a known variance function and q an unknown dispersion parameter. The bi are assumed to be independent and identically distributed (iid) from the multivariate normal distribution with mean 0 and covariance matrix Q. Within a Bayesian framework conjugate prior distributions are assigned to the parameters, f) and Q. The prior for 6 (if present) is typically an inverse gamma distribution, the prior for 0 a multivariate normal prior, say N(p, E), and an inverse Wishart prior for Q. 4.1.2 The algorithm Here we present a general algorithm for sampling the conditional distributions of the GLMM. Suppose the full conditional distribution for A3 is given, up to a constant of proportionality, by f(f) c( exp (yX\3 - h(Xi/3 + ZbJ)) N(/3lm ). In this form the distribution is not of standard type and so cannot be sampled directly without recourse to specific software. However, with the introduction of latent variables standard forms can be recovered. We proceed by introducing the latent variables u = (u1,..., u,) and v = (vl,..., vJ) such that the joint (full conditional) distribution with 3, is given, again up to a constant of proportionality, by f(/3,u,v) oc {I(ui < exp(yX/3),vi < exp{-h(Xi + Zb,)}) } N(fip, S). Clearly the marginal distribution for fi is as required. Some simple algebra gives the following full conditional distributions for each pk, k = 1,..,p, f(k) cx N(,3kl*, l/ekk)l(ak < fk < ck),

where Y4 = Ihk - (- f)e6kekk, Ark elk is the lkth element of E-1, ak = maxx,^#O { (yvlogui - /,xi,) /Xki } and Ck = minxkjo { (u(- log v) - E 3Xt - Zibi) /Xki }. The introduction of the latent variables provides what can be described as a latent model and the 'new' Gibbs sampler includes the sampling of the full conditional distributions for u and v within each iteration. These are easily seen to be uniform distributions. The full conditional distribution for bl is given by f(bi) X {l[ i(u < exp{yiZibi}, vi < exp{-h(Xfi + Zbi)}) N(bi 0, Q) which, as with the full conditional for /3, will lead to a truncated normal distribution. 4.1.3 Examples Example 4.1.1 Random effects binomial model. The model considered here is a random effects binomial model which allows for over-dispersion. If pi is the probability of success for the ith observation then Yi\Pi, binomial(pi, ni), logitpl = Xi:3 + bi, biN(0,>). Independent priors are assigned to A and /f, typically a gamma and a multivariate normal, respectively. Of interest is the joint probability distribution of /f, b = (6i,..., bn) and A given, up to a constant of proportionality, by f(/, b,) oc {i ( exp(yi0i)exp(-05.A)} A/( go I b7 A) 0C 11 exp -, 7, i xp~yn 13

where Oi = Xijd + hi. Here we introduce the latent variables u = (ui,..., u,) and v - (vi,..., v,) such that the joint distribution with /3, b and A is given, again up to a constant of proportionality, by f(A, Au, ) A ( )x { J e ci - '(ni-i)(fUi > log{l + e-}i, vi > log{l + e' })exp( - 0.5,Ab)} i-1 The full conditional distribution for /3k is given by f(k) OC7 (P |Itk)I(Ik E A)) where Ak is the set (maxxo{[ - log{e - } -,;kX^J, - b]/Xk}, minxk,:o{[log{e' - 1} - SAk-Xi,- bi]/Xki}). The full conditional distribution for bi is given by f(b,) oc exp( - 0.5bA)I(bi A,) where Ai is the set ( -log{e ~ } - 11- kXk,lg{e- 1} - EkXkigk). The full conditional distributions for the latent variables are given by f(ui) oc exp(-uiy)I (u, > log{1 + e-}), a truncated exponential(yi) distribution, and f(vi) C exp( - Zi(n7 - yi))(vti > log{I + e}) and the full conditional for A is f(A) oc A/2 exp (-A/2 b? )7r(A). All of these full conditionals are of known types. Only minor modifications are required if Yi = 0 or yi =- i. 14

Example 4.1.2 Random effects Poisson model. Here we consider the random effects Poisson model given by yilOi Poisson (expoV), Oi Xfi + bi, b, N(O, ). Priors for 0 and A are taken as in Example 1. The joint probability distribution of /3, b and A is given by fHr b, A) wc exp { j (yu - exp e - A/2bl ) }An 2r(A, 3). Here we introduce the latent variables u = (un,..., u) and v = (ui,-.,) such that the joint distribution with /3, b and A is given, up to a constant of proportionality, by f(/, 4, A, u, v) c A^'/, A( A){ -i (ui < e-iYi vi > eiO) exp (- o.5 0 )}. The full conditional distribution for /3k is given by f(/3) oc 7r(/3k1-k)J]( e Ak), where Ak is the set (o, minx, log{v }-,I -A3X,.-bil-] /Xk [-y log{iu}-,,kXil-b,] / Xk}). The full conditional distribution for bi is (,b) cc exp (- 0.5? )I)Ji E Ai), where Ai is the set (O, min{ log{t, -.kXkifk, -Yv 1log{ui - SkXkiJ }). The full conditional distributions for the latent variables are given by f(ui) oc I(i < exp( —yiO)), f(vi) oc exp(-vi)I(v, > exp(Oi)) and the full conditional for A is f(A) cao Ar/2exp - A/2ebZ 4r(\) Again, only minor modifications are required for the case when yi = 0. 15

4.2 Nonlinear mixed models 4.2.1 The model In the following let i (i = 1,..., n) index individuals and j (j = 1..., ni), with N = it ni, index observations within individuals. Let yij represent an observation or a transformation of the observation (for example the logarithm). The conditional probability model for the observations is given by y, oi N(g(j, Xij) o2)' where Oi is the random effect associated with the ith individual, xij an explanatory variable for the ijth observation and g a known nonlinear mean response function. From now on we will write g(Oi, xi) as gj(Oi). The Ois are assumed to be normally distributed with mean p and variance-covariance matrix B. Here cr, p and S are the population parameters. Conjugate priors are assigned to these parameters in a manner described in Wakefield et al. (1994). Note that the full conditional density for Oi is given, up to a constant of proportionality, by f(i) cx { exp (- 0.5 I)/C2) }7(Oi), j=1 where lj(Oi) - (yj- gj(Oi))2 and 7r(GO) = N(Oifp, B). It is not possible to sample this distribution directly without specialist random number generation techniques. Note that here tile ratio-of-uniforms method may be used but requires three numerical maximisations for each sample. The adaptive rejection sampling routine cannot be used since the conditional distributions are typically not log-concave. Gilks, Best and Tan (1995) proposed the Metropolis adaptive rejection sampling algorithm for such cases. We can write this model in a different way by introducing a (latent) random effect uij for each observation. This latent model is given by yi jt o, 0 - U(qj(o0i) - ij(0i) + I) and,ij G(3/2 A/2), where G denotes the gamma distribution and A = 1/ar2. It is easily seen that integrating over the uij returns the original model. Now the full conditional 16

distributions for the random effects are given by f(Oi) c N(OIp,E)I(Oi E Ai), where A, = {,: yij - /i'j < gj(O;) < yij + J:j 1,...,}. The full conditional distributions for the latent variables are given by f(ujj) ac exp(-Auij/2)I(uij > [yij- gj(Oi)]2. The full conditional distribution for A, with prior A-', is given by G (3N/2,f E: i/2. i=1 j=l This latent model motivates the algorithm presented in the next section. 4.2.2 The algorithm The target distribution is given, up to a constant of proportionality, by m f(0) c { I exp (-0.5A(O))}(7r), j-=where lj(O) > 0 for all 9 (here we have removed the subscripts i and we put m = ni). The aim is to introduce a latent variable and use Gibbs sampling to generate random variates from the target distribution. Since inference is already enveloped within a Gibbs sampler the only extra computation required is the sampling of the full conditional distributions for each of the latent variables within each iteration. We define the joint density for 0 and u = (ul,..., un), given up to a constant of proportionality, by f(O.u) oc I7 {exp(-O.5Auj)(u > () }(),7=1 Clearly the marginal density for 0 has the required target density f(.). The full conditional densities, required for the Gibbs sampling, are given by f(uj 0) cx exp(-O.5Au)I (uj > Ij (0)), 17

forj = 1,..,m, and f(Ojlu) oc 7w()I(0 e AJ, where A = {10: Ij() <,j = l,.,m}. Sampling from f(uj 10) is straightforward and we assume that sampling from 7r(.) restricted to some set A is also easy (typically, in applications, r(.) will be a known multivariate distribution, for example, the normal). Therefore the only remaining difficulty is the determination of the set A=. It should first be pointed out that A, is not empty. Within a Gibbs sampling algorithm it is easily seen that the current 0 must be a member of A,. If it is not possible to obtain the multivariate set Au then an alternative approach is to sample from f(O u) by sampling from f(Okl0-k, u) for k = 1,..., p where p is the dimension of 0. This would involve sampling from 7r(klO_-k)I(0k e Ak,,) where Ak, = {0k: lj(Ok, -k) < Uj,J = 1,...,m} Clearly each A, or Aku will depend on the likelihood /j(.). We now consider some examples. Example 4.2.1 One compartment pharmacokinetic model. Here the logged data is assumed to be normally distributed. This gives an approximate constant coefficient of variation, which in such applications mimics assay precision. A Bayesian analysis of a population pharmacokinetic data set (with a one compartment model) then involves the simulation of Wr(.) with lj(0) = (log yj -log d-0 +x exp(02)) Here yj represents the measured concentration of a drug at time xj after administration of a dose of size d at time x = 0. Due to the difficult task of obtaining the two dimensional set A., we concentrate on obtaining Al, and A2.u In this case let zj = logyj - logd. Then 1l(0) < uj implies (zj - 1 + Xj exp(92)) < Uj. This leads to A, = (maxj{aj},minj{bj}), where a Zj - zj- /T+j exp(02) and bj = zj +,/ J+xj exp(%2). If maxj{01 - zj- /} > 0 then AU = (maxjes{ciQ),minj{j.}), 18

where S {j: - Z - v/ij > 0}, aj = log{(0i - vj - )/sxj and 3j = log{(0 - zj 4- + TJ)/xj}. If maxj{0i- zj - v/j} 0 then A2u (- oo,minj{f,}). Note that 09 - z3 + ^/7 > 0 for all j. Therefore if 7r(.) is a bivariate normal distribution then sampling from f({Ok_-k, u) is the sampling of a truncated univariate normal distribution. Example 4.2.2 Logistic model. For the logistic model we obtain 1j(9) (- 01 + log{1 + exp(92 + 03Xj)})]. Therefore A = (naxj{aj},minj{bj}), where aj = - /i7+ log{ I + exp(02 + 03Xj)} and b = zj + /iJ + log{1 + exp(02 + 03zj)}. Let S = {j: exp(0f - /uj i- z) > 0}. IfS # 0 then A2u = (maxjEES{aj},minj{j}), where aj = log{exp(0i - U- j)- 1Z } - 03z5 and fj =log{exp(Oi + /uJzj)- 1} - axj (note that 01 + v/t- zj) > 0). If S = 0 then A, = (-oo, minj {,J}). Finally, if S $ 0, A3, = (maxss{yj},mins{/A}), where yj = [log{exp(0l - - - j) - 1} - 02]/Xj and 6; = [log{exp(01 + U- -3)- 1} - 02]/rT. If S 0 then a = (-oo,min;{/j}). Typically 7r(0) will be the normal distribution N(Oji, E). Then f(OkFO-k, ), ) will be the univariate normal distribution N(Ok,4p/ekk.1/ekk), where p. = Pkekk - Zt.$k ek(0 - PI) and elk is the lkth element of S-1. The algorithm reduces therefore to the sampling of the univariate normal distributions N(Okl#,1/ekk, 1 /ekk) restricted to the set Aku. 19

5 Numerical examples Example 5.1 Generalised linear model with logit link. Our example is the binomial GLM with a logit link function and a quadratic logistic model given by yi\Tri - binomial(ni, ri) and log (7r/( - 7ri)) - pt + Zi#2 + Zt3 = X, i-,...,. Further details are provided in Dellaportas and Smith (1993). With a multivariate normal prior for.6, say N(/z, S), the posterior distribution is given by f(,d) c: ey xi,/(l + Cx,)i} exp ( - 0.5(/ - I) S-( - )). We introduce the latent variable u = (ui,..., u,) such that the joint density with ft is given, up to a constant of proportionality, by f(Q,u) oc f1I (ui < [1 +exp(Xif)] -') exp (-O.5(fl- S-13 -/z) + -) where Y = Z=L yiXi. The full conditional distributions for each of the Ui are uniform, f (uu{i, 0) = U(0, [1 + exp(Xi)]-'). The condition ui < (I +eXi)-'ni implies exp(XJ;) < 1/u!Ir - 1. Therefore, define the sets Aku - {: 1k < mini{log(1/u"11 - 1)/Xk- - i- X.m- XXkXi}}, where {k, l, m arc, in some order, the elements {1,2,3}. Sampling from f(Plu) can now be done by sampling successively from f(/ IL/?_, u) which involves sampling from a univariate normal distribution restricted to the set Aku. This univariate normal distribution is given by 7r(/Pk/l?-k) where 7r(/?) is the multivariate normal distribution with mean u + Ev and covariance matrix ~. 20

We analyse a data set relevant to the above example. The data set and prior distribution used are given in Dellaportas and Smith (1993). We start the chain by taking /3 as the location of the prior distribution and then proceed to sample u and then back to 3. We ran the chain for 5,000 iterations (taking under 15 seconds) and collected the last 2,000 for parameter estimation. We can report, as was to be expected, that our parameter estimates (/1 = -2.36, i2 =0.21 and /3 = -0.004) coincide with those obtained by Dellaportas and Smith. These authors used the adaptive rejection sampling scheme (Gilks and Wild, 1992) which depends on the posterior density being log-concave. (We need no such condition.) This example demonstrates that not only is the algorithm simple to code, but it is also quick. Example 5.2 Random effects logistic regression. Here we analyse the data set presented in Table 3 of Crowder (1978) which involves binomial data, the proportion of seeds that germinated on each of 21 plates, in a 2 x 2 factorial layout by seed and type of root extract. The model for analysis is described in Example 1 (Section 2). Here we have Oi = A l+ 2ril + f3Xi2 + 34XilXi2 + hi, where xii, xi2 are the seed type and root extract of the ith plate. We encountered some problems with high autocorrelation associated with the Markov chain due to the introduction of the latent variables. To solve this problem we took every 100th sample for inference using a single Markov chain. This reduced the autocorrelation to satisfactory levels. Computing time was about half an hour. Parameter Estimate.3 -0.547 f/2 0.068 133 1.337 i34 -0.812 r 0.292 Table 1: Parameter estimates for Example 1 The parameter estimates for the Crowder data set are given in Table 1. The estimates compare well with those obtained using BUGS. Additionally we provide in Figure 1 the ergodic plots for the parameters obtained from the 21

'H..........__ i1.._______.______ IATICI nonU" Figure 1: Ergodic averages for Example 1, Parameters are /i, /32 (row I); fi?, f/4 (row 2) and cr- cr (row 3). Markov chain (thinned) output based on a sample of size 1000. Example 5.3 Nonlinear random effects model. The second example is taken from the paper of Lindstrom and Bates (1990). The data set consists of 5 (indexed by i) orange trees with 7 (indexed by j) trunk circumference measurements taken for each tree over an interval of time (denoted by x). The logistic model is used to fit the data giving logyj = OI - log{l + exp(02i + xij03i) } + ij, where yij are the observed trunk circumference measurements and cij are independent NP(O, ra2) error variables. As in Example 1 of this Section, the second stage assumes Oi= (0i, O2i, 03i) to be independent N(u, 3) variables with priors assigned to (o2,a. s ). The parameter estimates are presented in Table 2. (We took every 10th sample from the Markov chain output for parameter estimation). These are comparable with the maximum likelihood estimates of Lindstrom and Bates (1990). Histogram representations of the marginal posterior distributions are presented in Figure 2. In Figure 3 we 22

present the ergodic plots for the parameters based on a sample of size 1000 from the Markov chain (thinned) output. In this example we did not encounter the high autocorrelation which was met in Example 1. Parameter Estimate Pi 5.304 p2 2.041 s3 -0.00327 AnI 0.274 S22 0.576 Es3 0.00203 Table 2: Parameter estimates for Example 2 ~.. fnd...'3 *, ~.. _ a D S.... _ _~~~~~~~~~.:::Z Figure 2: Estimated marginal densities for pi (bottom) from Gibbs output for Example 2 (top), H2 (middle) and 13 6 Discussion, extensions and conclusions In Section 5 we presented examples using the auxilliary variable method which resulted in quick and efficient MCMC algorithms. Additionally, the algorithm is easy to code, requiring only standard random variate generation 23

= llf ^. _. ^ I w -- do v ' q bar a_ _ EnAd_-_s |,|i d,~.l -......... _. ml. A.. I I Fi t. 1 I '. I ". Figure 3: Ergodic averages for Example 2. Parameters are pil, H2 (row 1); p3, 11 (row 2) and E2, 33 (row 3). routines. However, we do not claim that superior efficiency will be the case in general. If there is an efficient Metropolis or rejection algorithm then, rather than introducing latent variables, these may be the preferred choice. A broad question is, "Will a Gibbs sampler with more conditional distributions, all of which are uniform densities, be more efficient than a Gibbs sampler in which some or all of the full conditionals have to be sampled via rejection and/or Hastings type algorithms?" We are not aware of a definitive answer to this question. On the other hand, the question above points to the fact that ease in coding may very well outweigh gains in efficiency especially when the gains using an alternative approach may be neglible: in many contexts, particularly in examples such as the ones described in this paper, and which are very useful in applications, this appears to be a common trend. In addition to the ease in coding, just like other algorithms, the method proposed in this paper is, in a sense, general. In this spirit the results of this paper can be compared to that of Meng and Van Dyk (1997). but from a Bayesian perspective. Generality combined with ease of computation of the latent variable approach, and gains in efficiency relative to other approaches, in a variety of contexts, are compelling reasons for popularising its use among statisticians. For example, improved efficiency in simulation from distributions arising from exponential power. Student i and stable laws, as well as 24

truncated versions of the latter laws, are also being studied. These results will be reported elsewhere. From a practical perspective suppose that a "one-off' MCMC algorithm is required. Our method provides an "instant" solution and the only issue is that of convergence (which is the case for all chains). Algorithms based on Metropolis or rejection steps typically require "tuning" and even then there is no guarantee that a more efficient chain will emerge. Results on rates of convergence are currently only available for narrow classes of models (see, for example, Polson, 1996). Polson, and the discussants of his paper point out that answers to questions such as the one stated in this section remain unresolved. But there is no specific reason why the introduction of latent variables should reduce efficiency. On the contrary, Polson, in his section titled, 'Using Latent Variables To Improve Convergence', reports that "Careful use of latent variables...can lead to vast improvements in efficiency." The examples in Section 4 of Poison's paper give support to the auxilliary variable approach for two types of distribution. Poison indicates there will be improved efficiency for these cases. That there should be a significant reduction in efficiency for all other types of distributions, with the introduction of auxilliary variables, may not, of course, follow. But Roberts (personal communication) has shown that our method (which Radford Neal at the University of Toronto first referred to as "slice-sampling") always converges geometrically, under very mild regularity conditions: this is untrue in general for other MCMC algorithms, which require far more stringent conditions to obtain geometric convergence. We finally note that the method developed in this paper may be used for random variate generation in general. A comprehensive comparison of alternate methods to random variate generation for sampling well-known densities will be reported elsewhere. Acknowledgements The authors are grateful to a Joint Editor, several referees and Adrian Smith for critical comments on earlier drafts of the paper. Thanks, also, to Gareth Roberts for sharing his results on convergence of "slice-sampling" methods. 25

References Beal,S.L. and Sheiner,L.B. (1982). Estimating population kinetics. CRC Critical Reviews in Biomedical Engineering 8, 195-222. Bennett,J.E.,Racine-Poon,A. and Wakefield,J.C. (1996). MCMC for nonlinear hierarchical models. In Markov chain Monte Carlo methods in practice (eds. W.R.Gilks,S.Richardson and D.J.Spiegelhalter), pp339-357. London:Chapman and Hall. Berkey,C.S. (1982). Bayesian approach for a nonlinear growth model. Biometrics 38, 953-961. Besag, J. and Green, P.J. (1993). Spatial statistics and Bayesian computation. J. Roy. Statist. Soc. B 55, 25-37. Breslow,N.E. (1984). Extra-Poisson variation in loglinear models. Applied Statistics 33, 38-44. Breslow,N.E. and Clayton,D.G. (1993). Approximate inference in generalized linear mixed models. Journal of the American Statistical Association 88, 9-25. Chib,S. and Greenberg,E. (1995). Understanding the Metropolis-Hastings Algorithm. Amer. Statist. 49, 327-335. Clayton,D.G. and KaldorJ. (1987). Empirical Bayes estimates of age standardised relative risks for use in disease mapping. Biometrics 43, 671-682. Crowder,M.J. (1978). Beta-binomial ANOVA for proportions. Applied Statistics 27, 34-37. Cumbus,C.,Damien,P. and Walker,S.G. (1997). Sampling truncated multivariate normal and other densities within the Gibbs sampler. Damien, P. and Walker, S.G. (1996). Sampling probability densities via uniform random variables and a Gibbs sampler. University of Michigan Business 26

School Working Paper. Davidian,M. and GiltinanD. (1995). Nonlinear Models for Repeated Measurement Data. Chapman and Hall, London. Dellaportas, P. and Smith,A.F.M. (1993). Bayesian inference for generalised linear and proportional hazards models via Gibbs sampling. Appl. Statist. 42, 443-459. Devroye, L. (1986). Non-uniform random variate generation. Springer, New York. Edwards, R.G. and Sokal,A.D. (1988). Generalization of the Fortuin-KasteleynSwendsen-Wang representation and Monte Carlo algorithms. Phys. Rev. D, 38, 2009-2012. Fearn,T. (1975). A Bayesian approach to growth curves. Biometrika 62, 89-100. Gatsonis,C.,Epstein,A.,Newhouse,J.,Normand,S. and McNeil,B. (1995). Variations in the utilization of coronary angiography for elderly patients with an acute myocardial infarction: an analysis using hierarchical logistic regression. MVedical Care 33, 625-642. Ghosh,M. and Rao,J.N.K. (1994). Small area estimation: an appraisal. Statistical Science 9, 55-93. Gilks, W.R. and Wild,P. (1992). Adaptive rejection sampling for Gibbs sampling. Appli Statist. 41, 337-348. Gilks,W.R.,Best,N.G. and Tan.K.K.C. (1995). Adaptive rejection Metropolis sampling within Gibbs sampling. Applied Statistics 44, 455-472. Gilmour,A.R.,Anderson,R.D. and Rae,A.L. (1985). The analysis of binomial data by a generalized linear mixed model. Biometrika 72, 593-599. Goldstein,H. (1995). Nonlinear multilevel models with an application to 27

discrete response data. Biometrika 78, 45-51. Goldstein,H. (1995). Multilevel Statistical Models. London: Edward Arnold. Hastings, W.K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57 97-109. Higdon, D. (1996). Auxilliary variable methods for Markov chain Monte Carlo with applications. Technical Report, Duke University. Kahn,M.J. and Raftery,A.E. (1996). Discharge rates of Medicare stroke patients to skilled nursing facilities: Bayesian logistic regression with unobserved heterogeneity. Journal of the American Statistical Association 91, 29-41. Kass,R.E. and Steffey,D. (1989). Approximate Bayesian inference in conditionally independent hierarchial models (parametric empirical Bayes models). Journal of the American Statistical Association 84, 717-726. Liang,K. and Zeger,S. (1986). Longitudinal data analysis using generalized linear models. Biometrika 73, 13-22. Lindstrom,M.J. and Bates,D.M. (1990). Nonlinear mixed-effects models for repeated-measures data. Biometrics 46, 673-687. Meng,X.L. and Van Dyk,D. (1997). The EM algorithm - an old folk-song sung to fast. new tune (with discussion). J. Roy. Statist. Soc. B. To appear. Metropolis,N. Rosenbluth, A.W., Rosenbluth, M.N.. Teller, A.H., and Teller, E. (1953). Equations of state calculations by fast computing machines. Journal of Chemical Physics. 21 1087-1091. Poison, N.G. (1996). Convergence of Markov chain Monte Carlo Algorithms. In Bayesian Statistics 5, pp. 297-321. J.M.Bernardo et al. (Eds.) Oxford University Press. Racine-Poon, A. (1985). A Bayesian approach to nonlinear random effects 28

models. Biometrics 41, 1015-1024. Robert, C.P. (1995). Simulation of truncated normal variables. Statist. Comput. 5, 121-125. Schall,R. (1991). Estimation in generalized linear models with random effects. Biometrika 78, 719-727. Skene,A.M. and Wakefield,J.C. (1990). Hierarchical models for multivariate binary response studies. Statistics in Medicine 9, 919-929. Smith, A.F.M. and RobertsG..O. (1993). Bayesian computations via the Gibbs sampler and related Markov chain Monte Carlo methods. J. Roy. Statist. Soc. B 55, 3-23. Spiegelhalter,D.J.,Thomas,A..Best.N.G. and Gilks,W.R. (1995). BUGS.Bayesian inference Using Gibbs Sampling, Version 0.50. Cambridge: Medical Research Council Biostatistics Unit. Stiratelli,R.,Laird,N. and Ware,J.H. (1984). Random effects models for serial observations with binary responses. Biometrics 40, 961-971. Tierney, L. (1994). Markov chains for exploring posterior distributions. Ann. Statist. 22, 1701-1762. VoneshE. and Carter,R.L. (1992). Mixed effects nonlinear regression for unbalanced repeated measures. Biometrics 48. 1-17. Wakefield,J.C.,Smith,A.F.M.,Racine-Poon,A. and Gelfand,A.E. (1994). Bayesian analysis of linear and non-linear population models using the Gibbs sampler. Applied Statistics 43, 201-221. Wakefield,J.C.,Gelfand,A.E. arid Smith,AF.M. (1991). Efficient generation of random variates via the ratio-of-uniform method. Statistics and Computing 1, 129-133. Wakefield,J.C. (1996). The Bayesian analysis of population pharmacokinetic 29

models. Journal of the American Statistical Association 91, 62-75. Walker, S.G. (1996). An EM algorithm for nonlinear random effects models. Biometrics 52, 934-944. Williams,D.A. (1982). Extra-binomial variation in logistic linear models. Applied Statistics 31, 144-148.. Zeger,S.L. and Karim,M.R. (1991). Generalized linear models with random effects: a Gibbs sampling appraoch. Journal of the American Statistical Association 86, 79-86. Zeger,S.L.,Liang,K.-Y. and Albert,P.S. (1988). Models for longitudinal data: a generalized estimating equation approach. Biometrics 44, 1049-1060. r. 30