#UcS C y~I 40441 RESEARCH SUPPORT UNIVERSITY OF MICHIGAN BUSINESS SCHOOL JULY 1996 SAMPLING PROBABILITY DENSITIES VIA UNIFORM RANDOM VARIABLES AND A GIBBS WORKING PAPER #9612-05 PAUL DAMIEN AND STEPHEN WALKER UNIVERSITY OF MICHIGAN BUSINESS SCHOOL

Sampling Probability Densities via Uniform Random Variables and a Gibbs Sampler Paul Damien1 and Stephen Walker2 1 Department of Statistics and Management Science, School of Business, University of Michigan, Ann Arbor, 48109-1234, USA. 2 Department of Mathematics, Imperial College, 180 Queen's Gate, London SW7 2BZ SUMMARY Continuous densities on the real line are sampled given only an iid sequence of uniform random variables on the interval (0, 1) and using a Gibbs sampling scheme. The method is exemplified via several examples. A feature of the paper is to provide an alternative sampling algorithm to rejection based methods and other sampling approaches such as the Metropolis-Hastings algorithms, especially from a Bayesian perspective. Keywords: Gibbs sampler, Latent variables, Uniform density. 1 Introduction Let f be a continuous density function defined on the real line. We address the problem of generating a random variate X from f. Starting with the 1

assumption that it is possible to sample uniform random variables from the interval (0, 1) to provide an iid sequence U1, U2,... of such variables. The basic idea is to introduce a latent variable Y, construct the joint density of Y and X with marginal density for X given by f, and to use the Gibbs sampler (see, for example, Smith and Roberts, 1993) to generate random variates from f. This is done by simulating a Markov chain {Xn} where given XYn = x, Y is taken from f(ylx) and then X,,+ is taken from f(xjY = y). Under mild regularity conditions Xn — da X,- f. Additionally we are looking for the conditional distributions which can be sampled using the appropriate uniform random variables. For a historical overview of Markov chain methods and the use of latent (auxilliary) variables the reader is referred to Besag and Green (1993). With the widespread use of the Gibbs sampler this paper, especially from a Bayesian perspective, is relevant. In particular, the new algorithm, after some preliminary analysis (details later), provides a Gibbs sampler in which all the full conditional densities are of known type. This then may lead to a more efficient method than the Metropolis-Hastings algorithm (Tierney, 1994). the adaptive rejection method for log-concave densities (Gilks and Wild, 1992) and other rejection algorithms, in many contexts. Preliminaries Firstly being able to sample a uniform random variable from the interval (0.1) allows us to sample a uniform random variable from any interval (a, b) and we write such a density as UI(a, b). Let f(x) oc exp(-x/3)I(a < x < b), where /3 > 0, 0 < a c b < oo and I represents the indicator function. We write such a density as E~(/, a, b). Sampling X from E((, a, b) can be done by taking U from U(1 -exp(-a/3), 1 -exp(-b/3)) and taking X = -1//3log(l - U). Let f(x) cx (1 - x)~-l(a < x < b), where / > 0 and 0 < a < b < 1. We write such a density as B(/3, a, b) and it is easy to see that if f(y) = (/, -log(1 - a),-log(l - b)) and X = I - exp(-Y) then f(x) = B(/, a, b). Thus sampling an X from B(,/, a, b) follows from above. Alternatively, one could use the inverse transform method to sample from B(/,, a, b). In subsequent sections, we will show that ~(/3, a, b), B(/3, a, b) and U(a, b), along with an appropriately defined Gibbs sampler, are sufficient to generate random variates from most of the standard densities, and many nonstandard 2

AUG z 2 Iyy6/ and complicated densities as well. 2 Densities from Johnson and Kotz While there is not an obvious theorem underlying the proposed method there is a common thread which will be exemplified. Here we state this as a 'ruleof-thumb': (1) Given a target density f, defined up to a constant of proportionality, identify factors in f which can be substituted easily via latent variables that induces full conditional densities, and which can be sampled using a uniform random variable. (2) Set up the full conditional distributions for the Gibbs Sampler based on (1). Example 1. Normal First we consider the normal(O, 1) density with density function f(x) oc exp(-O.5x2). We introduce the latent variable Y, defined on (0, oo), which has joint density function with X given, up to a constant of proportionality, by f(x, y) oc exp(-0.5y)I(y > x2). Clearly the marginal density for X is the normal(O, 1) density. The conditional densities are given by f(yjx) (0.5, x200) and f(xyY) =U(- VF, + )y For a normal(s, a) density we simply generate a X(o,I) variable and take X(A,,,O) = aX(0.1) + /. Note that the sampling of the normal(0, 1) density could also be done by defining the joint density by f(x, y) c I(y < exp(-0.5x2)). The conditional densities are now given by f(yx) = U(0, exp(-0.5x2)) 3

and f(xlyl) =(- / -21og(y), +-21og(y)). It is no more difficult to sample a truncated normal(O, 1) variate with density f(x) oc exp(-0.5x2)I(a < x < b). The joint density of X and Y is given by f(x, y) oc exp(-0.5y)I(y > 2, a < x < b). The conditional density f(ylx) remains unchanged but the conditional density f(xjy) becomes U(max{-v/y, a}, min{+v/y, b}). Example 2. Gamma We first consider the gamma(a, 1) density with density function given up to a constant of proportionality by f(x) oc x exp(-x)I(x > 0) for a > 0. We introduce the latent variable Y, defined on (0, oo), which has joint density function with X given, up to a constant of proportionality, by f(x,y) cx I(y < x-1x > O)exp(-x). The conditional densities are given by f(yjx) =U(ox 1) and f ~(1,y ('A-), 0) if a > I f(xjy)- ~F(1, 0, (l/y)i/(1-a)) if a < 1. Trivially a = 1 implies f(x) = 8(1,0, oo). Such a Gibbs algorithm may be appropriate when a = c < 1 is very small. For a gamma(a, /) density we sample a X(,,,) variable and take X(c*,,) = 1//X(a,). Again, as in example 1, sampling a truncated gamma density will pose no extra problem. Example 3. Beta Here we have f(x) oc xl-(1 - x)-1I(O < x < 1) for a, f > 0. We introduce the latent variable Y, defined on (0, oo), which has joint density function with X given, up to a constant of proportionality, by,f(x,y) oc I(y < 'xa, 0 < x < 1)(1 - x)-. 4

1 The conditional densities are given by f(Y x)= u(0,xa1) and f(/, yl/(at), 1 ) if C > 1 f(xly) - 1,o0,l (/y)1/()) if a < 1. Trivially a = 1 implies f(x) = B(/O, 0, 1). Again, as in examples 1 and 2, sampling a truncated beta density will pose no extra problem. We note that the above method overcomes numerical problems when one wishes to sample from a beta distribution with parameters that are less than one and very small. It is apparent that this method can be applied to other well known densities as well. For example, the Student's t, chi-squared and Weibull densities can be sampled using transformations or mixtures of the above. Example 4. Cauchy The Cauchy(0,1) has density given by f(x) oc 1/(1 + 92). We define the joint density of X and Y, a random variable defined on (0,1), by f(x,y) cx I(y < 1/(1 + x2)). The full conditional densities are given by f(yjx) = U(O, 1/(1 + x2)) and f (xy) = U(- /y - 1, +//y -1). Example 5. Pareto The Pareto(a, a) density is given, up to a constant of proportionality, by f(x) oc l/xa+'I(x > a), where a, a > 0. We define the joint density of X and Y, a random variable defined on (0, c.), by f(x,y) oc I(y < l/x+l, x > a). The full conditional densities are given by f(ylx) = U(0, 1/x+1) and f(y) = U(a, 1/yl/+l)). Example 6. Inverse Gaussian We consider the Inverse Gaussian with density given, up to a constant of proportionality, by f(x) cc 1l/x3exp( - - 1/x)I(x > 0). Here we introduce two latent variables Y and Z, defined on (0, oo) and (0,1), respectively, such that their joint density with X is given, up to a constant 5

I of proportionality, by f(x,y,z) cx exp(-x)l(y < 0/x)I(z < exp(-I/x))I(x > 0). The full conditional densities are given by f(yi, Z) = u(o, i ), f/(Z, x) = U (0,exp(-1/x)) and f(xly,z) = ~(1,-1/log(), (1/y)2/3). 3 Nonstandard Densities This method and other related concepts in the context of neutral to the right processes (Doksum, 1974) are studied in detail in Walker and Damien (1996). In this section we exemplify our method to densities encountered within the context of Bayesian nonparametrics, and which have appeared in recent literature: see, for example, Damien (1994), Damien et al. (1995,1996), Laud et al. (1993,1996) and Walker (1995,1996). In addition, we illustrate the method for some Bayesian non-conjugate models. Example 7. D-Distributions The class of D-distributions was introduced by Laud (1977). A random variable X on (0, oo) is said to have a D-distribution with parameters a, / > 0, -y > 0, if its density function is given, up to a constant of proportionality, by f(x) Ocx x-lexp(-xl){l - exp(-x)}.' Here we introduce the latent variable Y, defined on the interval (0, 1), such that the joint density of X and Y is given, up to a constant of proportionality, by f(x, y) cc x'-'exp(-zx)(1 - y)'-l (y > exp(-x)). The conditional densities are given by f(yjx) = 8(7, exp(-x), 1) 6

and f(xly) cc x-lexp(-lx)I(x > -log(y)), a truncated gamma(a,,/) density (see, example 2). The D-distributions are special cases from the class of SD-distributions (Damien et al., 1995) and the method of this paper can be used to sample from such distributions (Walker, 1995). We omit details here. Example 8. Diaconis/Kemperman Here we consider the unusual density uncovered by Diaconis and Kemperman (1996) which is given, up to a constant of proportionality, by f(z) oc (1 - x)-'+'x-sin(x)I(O < x < 1). Now we introduce five latent variables U. V, W, all defined on (0,1), Y and Z, both defined on (0, oo), such that their joint density with X is given, up to a constant of proportionality, by f(u, v, x, w, y, z) oc I(w < sin(xr)) xI(u < exp(-xy), v < exp{-( - x)z}, y > -log(l - x), z > -log(x)). The conditional densities are given by f(ulv, w,, x, z) -= U(O, exp(-xy)), f(vw, x,y,z, u) -U= (O, exp{-(1 -x)z}), f(wIx, y, z, u, v) = U(0, sin(zr)), f(ylz, u, v, w, x) = U( - log(1 - x), -x-log(u)), f(zlu, v, w,x, y) = - log(z),-(1 - )-llog(v)) and f(xjy, z,u,v,w)-= U(max{aw, exp(-z), 1 + z-llog(v)}, min{b,, -y-llog(u), 1 - exp(-y)}), where (au, b,) = Au, = {x: sin(x:r) > w). It is not clear that this density could be sampled in any other way. Next we consider some common Bayesian non-conjugate models. 7

1 Bayesian Non-conjugate Models If a posterior is given by f(x) oc l(x)7r(x), where l(.) represents the likelihood and 7r(.) the prior, then the general idea is to introduce the latent variable Y, defined on the interval (0, oo) or more strictly the interval (0, 1()), where 9 maximises 1(.), and define the joint density with X by f(x,Y) C (y < (zx))r(z). The full conditional for Y is U(0,1(x)) and the full conditional for X is 7r restricted to the set Ay = {x: l(x) > y}. Example 9. Poisson/log-normal model A Poisson likelihood with log-normal prior produces the posterior f(x) oc exp{nx - exp(x)}exp(-0.5x2), where we assume without loss of generality that the prior is normal(O, 1). We introduce the latent variable Y, defined on the interval (0, oo), such that the joint density with X is given by f(x,y) c exp(-y)I(y > exp(x))exp{-0.5(x2 - 2nx)}, which leads to the conditional densities given by f(y lx) = ~(, exp(x), oo) and f(x[y) oc exp{-0.5(x - n)2}I(- oo < x < log(y)), a truncated normal(n, 1) density (see, example 1). Example 10. Binomial/logit model Here we have mix - binomial(x, n) with a normal(0, 1) prior for log{x/(lx)} which produces the posterior f(x) oc exp(mx)){1 + exp(x)}-nexp(-0.5x2). We introduce the latent variable Y, defined on the interval (0,1), such that the joint density with X is given by f(, y) c I(y < {1 + exp(x)} l)exp{-0.5(x - 2mx)}, 8

which leads to the conditional densities given by f(yx) = U (o, { + exp(x)}-) and f(xjy) c= exp{-0.5(x - m)2}I(- co < x < log{l/y1/ - 1}), a truncated normal(m, 1) density (see, example 1). Example 11. Bernoulli/logistic regression model Consider the following model in which y\X = x ', Bernoulli(l/{l + exp(-p - xi)}, i=,..., n, where X ^-. normal(0, 1) is the prior (we assume y is known). The posterior density for X is given, up to a constant of proportionality, by f(x) cc exp(-0.Sx')InIL {, + exp(-p - xzi)} H1 1 + exp(p + xz)} Here we introduce the latent variables U = (U1,..., U,) and V = (V1..., V-,), where both U and V are defined on (0, 1)", such that their joint density with X is given, up to a constant of proportionality, by f(x, u, V) oc exp(-0.55x2) xni, (u < { + exp(-p - z)} ')Il (vi < {1 +exp( + z)}1). The full conditional densities f(u,|ui, v, x) and f(vt v_, i, x) are all uniform. For example, f(uijui, v,x) = U(O0, {1 + exp(-p - zi)} ) and f(vili.,u,x) =U (0, {1 + exp( + xzi)} ). Let S = {iy, = } n {-i: zi 0} and 7Z =: y, f = 0} n {i: z, $ 0}. Then f(xlu, v) oc exp(-0.5x2)I(x e Am), 9

where Auv = (maxiEs{ai},miniEz{bi}), ai = {log(l/ui - 1) - p}/zi and bi = {log(1/v, - 1)-,u}/.i- This is then a truncated normal density. Note that if S = 0 then replace maxies{ai} by -oo and if IZ = 0 then replace miniE7z {bi by +oo. Example 12. Probit model Here we have the posterior density given, up to a constant of proportionality, by f(fi) WX'=i{(D0o + 'elzi)} HII{1 - N(0 + /3iZi)} 'ir(T where we assume a multivariate normal(y, E) prior for fi and 4 is the standard normal distribution function. We introduce, the latent variables U = (UI,..., Un) and V = (V1,..., Vn), where both U and V are defined on (0, 1)n, such that their joint density with, is given, up to a constant of proportionality, by f (, u, v) c Utl(ui < {P(o+/lzi)} ) HiL (Vi < {1-4(o+Pizi)} j(f) The full conditional densities f(uilui, v, f) and f(viv_, u, u,/3) are uniform (see example 11 to see how to obtain them) and interest is in f(/fkj/t-k, u, v). Let ai = ((ri - li, bi = ((-(rTi)- io)/zi, ci -= (-1(Ai)- 8izi and di = ((-'(Ai) - /o)/z, where ri = ul/' and = 1- /(nt'-). As in example 11 bi and di are only defined for those zi $ 0, ai and bi are only defined when yi > 0 and ci and di are only defined when ni > yi. Then f(83o], u, v) c 7r(/30oi1,)I(maxi{ai} < /o < mini{ci}) and f(AIJA/o, u, v) ac r(/AI,3o)I(maxi{bi} < /1 < min,{cd}). Numerical Examples Our first example is the Cauchy(O, 1) density (example 4). Using the Gibbs sampler algorithm 1000 random variates were generated and the results are shown in Figure 1. 10

Our second example is the binomial GLM with a logit link function and a quadratic logistic model given by yi\ri,A binomial(ni, ri) and log (ri/(l -~ri)) = i1 + Zi2 + Z/3 = Xi3, i= 1,..., n. Further details are provided in Dellaportas and Smith (1993). With a multivariate normal prior for /3, say N(p/, ), the posterior distribution is given by f(13) oc {n leyiX/(l + eXi)l)ni}exp( - 0.5(13 - )1 -) Here we introduce the latent variable U = (U1,.., Un) such that the joint density with 13 is given, up to a constant of proportionality, by f(3, u) oc {TI l(ui < {l+exp(Xi3)-'ni)}exp(-0.5(3 —p)S-' (3-)+v), where v = Eitl yiXi. The full conditional distributions for each of the Ui are uniform given by f(uilu_,/3) = U(, {1 + exp(X3))}-'). However, the real interest is in sampling from f(/Jju). First the condition ui < (1 + eXi')-"ni implies exp(X4Yi/) < 1/un' - 1. Therefore define the sets Aku = {3k: k < minm{log(1/u'"n - )/Xki- Xif/ Xki - Xmi, m/Xki }}, where {k, 1, m} are, in some order, the elements {1, 2, 3}. Sampling from f(,/}u) can now be done by sampling successively from f(fkj/3-k, u) which involves sampling from a univariate normal distribution restricted to the set Aku. This univariate normal distribution is given by 7r(/JkljLk) where wr(/3) is the multivariate normal distribution with mean # + Sv and covariance matrix S. We analyse a data set relevant to the above example. The data set and prior distribution used are given in Dellaportas and Smith (1993). We start the chain by taking /3 as the location of the prior distribution and then proceed to sample U and then back to /. We ran the chain for 20000 iterations 11

(taking only several seconds) and collected the last 2000 for parameter estimation. We can report, as was to be expected, that our parameter estimates (/3 = -2.36, f2 = 0.21 and /a = -0.004) coincide with those obtained by Dellaportas and Smith. These authors used the adaptive rejection sampling scheme (Gilks and Wild, 1992) which depends on the posterior density being log-concave. (We need no such condition.) Additionally, in Figure 2 we give kernel density estimates of the marginals for /l obtained from the output of the Gibbs sampler. I 0 i3. 0 6 %n 00 Co - o 0 S. cl cam C! Co oJ 0. ~_ _.._- n Inuuuwnm-aM. --. -15 -10 -5 0 5 10 15 Figure 1: Histogram representation fr Sampler for the Cauchy(0,1) example..om output obtained using the Gibbs 4 Discussion and Conclusions With the increasing use of the Gibbs sampler in Bayesian analysis, faster, efficient, and simpler methods for generating random variates are required. In this paper, we have proposed and illustrated a method that appears to 12 I ---T-~ - -— ^

Figure 2: Marginal posterior distibutions for f3 (/31 top, 3 bottom) be, in some sense, ubiquitous when sampling from univariate continuous densities, from a Bayesian perspective. Here we simply point out that in all the examples we have presented, the use of popular algorithms such as the Metropolis-Hastings, sampling-resampling (Smith and Gelfand, 1992), adaptive-rejection, ratio-of-uniforms, or any rejection method, whether these methods are used within a Gibbs loop or not, is bypassed. (For an excellent discussion of these algorithms, see, for example, Chib and Greenberg, 1995, and Mueller, 1995). The striking feature of our approach is that it obviates the difficulties associated with these alternative approaches, namely identifying dominating densities, calculating supremums, and acceptance rates. Acknowledgements The authors are grateful to Adrian Smith for helpful discussions and critical comments on earlier drafts of the paper. 13

I References Besag,J. and Green,P.J. (1993). Spatial statistics and Bayesian computation. Journal of the Royal Statistical Society Series B 55, 25-37. Chib, S. and Greenberg, E. (1995). The American Statistician. 49, 327 -335. Damien, P. (1994). Some contributions to Bayesian nonparametric inference. Ph.D. Thesis. Imperial College, London. Damien, P., Laud, P.W. and Smith, A.F.M. (1995). Approximate random variate generation from infinitely divisible distributions with applications to Bayesian inference. Journal of the Royal Statistical Society Series B 57, 547 -564. Damien, P., Laud, P.W. and Smith, A.F.M. (1996). Implementation of Bayesian nonparametric inference based on beta processes. Scandinavian Journal of Statistics. 23, 27-36. Dellaportas,P. and Smith,A.F.M. (1993). Bayesian inference for generalised linear and proportional hazards models via Gibbs sampling. Applied Statistics 42, 443-459. Diaconis,P. and Kemperman,J. (1996). Some new tools for Dirichlet priors. Bayesian Statistics 5. pp.97-106. Eds, J.M.Bernado,J.O.Berger,A.P.Dawid and A.F.M.Smith. Oxford University Press. Doksum,K.A. (1974). Tailfree and neutral random probabilities and their posterior distributions. The Annals of Probability 2, 183-201. Gilks,W.R. and Wild,P. (1992). Adaptive rejection sampling for Gibbs sampling. Applied Statistics 41, 337-348. Laud,P.W. (1977). Bayesian nonparametric inference in reliability. Ph.D. Dissertation, University of Missouri, Columbia, MO. 14

Laud, P.W., Damien, P. and Smith, A.F.M. (1993). Random variate generation from D-distributions. Statistics and Computing 3, 109-112. Laud, P.W., Smith, A.F.M., and Damien, P. (1996). Monte Carlo methods approximating a posterior hazard rate process. Statistics and Computing 3, 109-112. Mueller, P. (1995). A generic approach to posterior integration and Gibbs sampling. iManuscript. Smith, A.F.M. and Gelfand, A.E. (1992). Bayesian statistics without tears: a sampling resampling perspective. The American Statistician 46, 84-88. Smith,A.F.M. and Roberts,G.O. (1993). Bayesian computations via the Gibbs sampler and related Markov chain Monte Carlo methods. Journal of the Royal Statistical Society Series B 55, 3-23. Tierney,L. (1994). Markov chains for exploring posterior distributions. The Annals of Statistics 22, 1701-1762. Walker, S.G. (1995) Generating random variates from D-distributions via substitution sampling. Statistics and Computing 5, 311-315. Walker, S.G. (1996) Random variate generation from an infinitely divisible distribution via Gibbs sampling. Technical Report, Imperial College. Walker, S.G. and Damien, P. (1996) A full Bayesian nonparametric analysis involving a neutral to the right process. Technical Report, Imperial College. 15