I Uniform Scale Mixture Models with Applications to Variance Regression Z. STEVE QIN University of Michigan, Ann Arbor, AI, USA. PAUL DAMIEN University of Michigan, Ann Arbor, MHI, USA. STEPHEN WALKER University of Bath, UK. SUMMARY Scale mixtures of uniform distributions are used to model non-normal data in both univariate and multivariate settings. In addition to providing greater modelling flexibility, the use of scale mixtures of uniforms also results in straightforward computational strategies, particularly in a Bayesian analysis where Monte Carlo methods are used. Key words: Scale mixtures, Gibbs sampling, Bayesian inference, variance regression. t 1 Introduction Markov chain Monte Carlo (MCMC) methods, such as the Gibbs sampler (Gelfand and Smith, 1990; Smith and Roberts, 1993) have made Bayesian analysis of complex models relatively straightforward. Within the framework 1

I of MCMC, various methods have been developed to make computations simpler; see, for example, Poison (1996), Damien et al. (1999). It has also been recognised that statistical models can be generalised, in a variety of ways, with Bayesian inference remaining tractable, due to MMCC. This paper is concerned with generalising the popular scale mixture of normal (SMN) model, Andrews and Mallows (1974), using the idea of a scale mixture of uniform (SMU) model. We develop a richer class of model with uniform distributions replacing normal distributions. Surprisingly perhaps, it turns out that the MCMC is easier to implement for the scale mixture of uniform model compared to the original scale mixture of normal model. The important issues are as follows. Scale mixtures of normals only provide densities with heavier tails compared to the normal density. On the other hand, scale mixtures of uniforms coincide with the class of unimodal and symmetric densities. Another point is that we are able to consider variance regression models in a manner which is essentially no more complex than the standard mean regression model. The layout of the paper is as follows. In Section 2 we provide definitions and facts relating to scale mixtures of uniforms. In Section 3 we demonstrate the usefulness of scale mixtures of uniforms for building and estimating a mean and variance regression model, and include both simulated and real examples. Section 4 provides a comparative analysis between SMU and SMN models. Section 5 considers the extension to multivariate models and finally we end with a brief discussion in Section 6. 2 Background We start by considering scale mixture of normals (Andrews and Mallows, 1974). Definition 2.1 Scale mixtures of normals A random variable Y defined on 11 has a scale mixture of normal distribution if the density for Y can be written in the form PY (y) =>o N(y e, c(A) a) (A) dA, where ca () is a positive function on and r (A) is a probability density function. 2

I The class of models defined above is very large and useful. However, a restrictive property of py () is that it only provides densities with tails heavier than normal, i.e. leptokurtic. To deal with this we can introduce the idea of a scale mixture of uniform density: Definition 2.2 Scale mixture of uniforms A random variable Y defined on IR has a scale mixture of uniform representation if it may be written using the following representation: YI [V = v] - Un (1 - av, i + av), where a > 0, and V has density fv(-), defined on 1R+. We now collect together some facts which can be easily established. Fact 2.1 If V -, Ga(3/2,1/2) then Y. N(p, a2). Fact 2.2 The class of unimodal, symmetric densities on JR coincides with the scale mixture of uniform densities. Fact 2.3 If Y has a scale mixture of normal representation, i.e., YI\A - N(a., r2/A), and then te corespondg representon va scale mture of foms s gven then the corresponding representation via scale mixture of uniforms is given by Y\[V = v] r Un (- crV, fi + rav), V - fv(v), where fv(v))= fp(vIl) r(X) dA and p(vlA) is the density function of Ga(v13/2, A/2). In the next section we demonstrate how, using scale mixtures of uniforms, we can analyse straightforwardly a mean/variance regression model. We illustrate with some examples. 3

I 3 Application to variance regression In this section we consider the basic model given by: E[YJ] = Xi,/ log var[Y] - 2Z,0, (1) for i = I,...,n, where Xi = (1, X,...,i ), = (Po /3,...J j), i = (1, Zi... i, ZiK), 0 = (80, 1,..., OK) are covariate and regression vectors, respectively. Many methods have been proposed in the literature for parameter estimation. See, for example, Carroll and Ruppert (1988). Here we demonstrate the advantage of using the scale mixture of uniform family in the analysis of a variance regression model. For convenience, we reparameterise Ak = e-k. A specific model which results in the specifications (1) is given by Y\Il[V = vs] Un(Xi,- Il -,X,+ (X )X, and Vi 'iid fv( ) The condition for the variance is satisfied provided we constrain EV = 3. For example, for normal errors we can let V -, Ga(3/2, 1/2). For alternative levels of kurtosis we can take V -, Ga(3a/2, a/2), for a > 0. Here a > 1 leads to tails heavier than normal and a < 1 leads to lighter tails than normal. In a Bayesian context, this model is straightforward to study via a Gibbs sampler. This follows since all the full conditional densities required to implement the Gibbs sampler are of standard type. Let 7r() represent the prior for (3, A) which comprises the product [=Lo 7r(0i) o 7ir(1A). Consequently, the full conditional densities are as follows: 7r(vi.) c exp(-vi/2)I {vi > (Yi - Xi/)2 1F I} I fc J~~~~~ 4

I (|BI ) x r(/()I {Gti e (mnax{Aj} mnin{B}) } where Y -/ k At -mij 1/ +' + / Atk + 'lj Aj = mini:x,jo X -- xi J Bj = maxex o r - / - k _ -. Yi + -- / A- - | B=.xf Xt ij X and y-ij = EZ j Xijfi. 7r(Ak-|.) ca r((A,) kI {- log k c (max {Ec,}, min {Ek}) } [tZi} <0 ~ i:Zik >0 / where E ( Yi - Xifl Il k A )Zi If Z,k > 0 for all i then maxit:zk<o{Eik} = oo. If Zik < 0 for all i then min:zk<o{Eik} = -Co. Hence, all the full conditional densities are sampled using standard techniques; see, for example, Devroye (1986), Damien and Walker (2000). Next, we consider some examples. 3.1 Example 1 A simulated dataset contains 50 cases, generated as follows: Yi = Po + 31xi + o-iji, logi-=0o +0zi, i = 1,...,50, with ei - N(0, 1) independently. The regression parameters were taken to be fo = 10, /1 = 5, 0o = -3 and 01 = 1. The 'standard' ordinary least squares estimates for /3 and /1 are 6.75 and 5.46, respectively. 5

I Table 1: Posterior summaries of model parameters 30o A, 00 and 01 for the simulated dataset.. po 1 00 0 1 mean 10.149 4.960 -2.384 0.890 std. dev 0.095 0.019 0.151 0.037 In this example, and the others to follow, we used non-informative prior distributions for the regression parameters; that is, we took Aj N(o0,.i) Ak Ga(ak, ak), with large cr and with ak very small. Using the scale mixture of uniform model, the resulting posterior estimates, obtained using the Gibbs sampler, are summarized in Table 1. The true values appear to be well approximated by the sample based statistics. 3.2 Example 2 The dataset used in this example is taken from the original epitaxial layer growth experiment of Kackar and Shoemaker (1986), as reported in Shoemaker, Tsui and Wu (1991). One of the initial steps in fabricating integrated circuit (IC) devices is to grow an epitaxial layer on polished silicon wafers. The experimenters needed to find process factors that can be set to minimize the epitaxial layer nonuniformity while maintaining average thickness as closely as possible to nominal. Here we consider a simplified version of this experiment, four experimental factors, susceptor-rotation method, nozzle position, deposition temperature and deposition time (labeled A, B, C and D) are to be investigated at the two levels, - and +. A four factor, full factorial design of 16 runs with 6 replications was adopted. We are interested in both the location and dispersion effects. Traditionally, these two analyses will be performed separately, via linear regression. Since the assumption of equal variance is violated, such a dichotomous 6

1 Table 2: Posterior summaries of model parameters io /3i, 0o and 0I for the layer growth experiment example. mean 14.415 0.430 -1.389 0.616 std. dev 0.014 0.013 0047 0.060 Po HI 14.34 14.40 14.42 14.44 144. 0.40 042 0.44 046 0.41 00 1 -1.5 51.50-1.4 -t.40-1.35 1.30-1.2S 0.4 0.5 06 0.7 O0. Figure 1: Posterior distributions of model parameters po /61, 0o and 01 for the layer growth experiment. approach may be inappropriate. Using screening techniques, such as half normal plots, one can easily identify that factor D is the most influential factor for location effect and factor A is the most influential one for dispersion effect. Using the scale mixture models described earlier in this paper, the loca- * tion and dispersion effects are modelled simultaneously. The posterior statistics are summarized in Table 2, and the histograms appear in Figure 1. The sample based summaries provide a full analysis for the variance regression model. 7

I Table 3: Posterior summaries of model parameters /3o 31I, 32, 00, 01 and 02 for the tensile strength experiment. = 0 0~ 1 12 00 01 62 mean 42.903 0.994 1.473 -1.128 -0.108 0.786 std. dev 0.204 0.066 0.176 0.22 0.215 0.161 3.3 Example 3 In Example 2, it is easy to estimate variance under each setting, since there are replications. However, in the absence of replication, it is difficult to obtain estimates for the dispersion effects. We consider such an illustration. Box and Meyer (1986) presented an interesting analysis of dispersion effects in a fractional-factorial experiment. The experiment concerns the tensile strength of welds in an off-line welding experiment performed by the National Railway Corporation of Japan (Taguchi and Wu, 1980). This experiment was also studied by Carroll and Ruppert (1988) for dispersion effects. Box and Meyer found that the mean could be adequately explained by two factors, B and C. Here we also want to find out the effect of these two factors on the dispersion. Using the model described earlier, a full Bayesian analysis of the dispersion effects is obtained. The posterior statistics are summarized in Table 3, and the histograms appear in Figure 2. 4 A Comparison of SMN and SMU models In this section, we will demonstrate the differences between the two scale mixture methodologies: Scale mixture of normals and scale mixture of uniforms. 8

I, 4 " 4*, 'V. -o,,.........,....~....,...... t'..,,i... IIiIIIL -,, 1. __.jffl__. Figure 2: Posterior distributions of model parameters Po f/3, f2, 0o, 01 and 02 for the tensile strength experiment. 4.1 Regression model with Cauchy errors We revisit the linear regression model under Cauchy errors with location and scale parameters 0 and a, given by yi = a + xi+ i, i 1l,...,n. Scale mixture of normals The scale mixture of normals representation is then given by: y 1i ^ N(a + pix 2/,i), and i,- Ga(1/2, 1/2). Assuming the non-informative prior, r(a,, A) c \1/2, the full conditionals are given by: ~ For if's:.-. Ex (+AJ(YJ-a- ad)2) 9

* For oa: al...,,, N ( En; i A Eni i Yn=l ~i 1 i -i n =li ~ For /:.. n t/i(yi ) 1 V 2i=l 1i 2 A Zi=l Sii * For A: A,.-~Gafn+3 li~1 n A ( ( Ca 2 '2. E y= a ) Based on the development in Section 2, after some algebra, we obtain the following posterior conditional distributions for the parameters of the model using the SMU representation. Without loss of generality, in canonical notation, 7r(a) and 7r(3) denote prior distributions for a, f/ respectively. We assume, without loss of generality, 7r(A) Ga(aA, /A); i.e., the prior for A is a gamma density. It is straightforward to derive the posterior conditional distributions for the parameters, and the auxiliary variable. These are given by: * For vi: Vi I.. Oc I[A(yi - a-_- Xi)2, 00), (1 Vi)2 * For a: al. (x7r((a) H I[Y -pxi- i- Pxi+ x ], r(c)I [mAax {yi- i- },min Yi i }] * For 3: I... T ocr( I yi-o i-o i Xi Xi 10

I () [max i a }min y i- N+}] Xi- Xi m * For A: Al... c r (A) n P(yn, vi, ylA) cx Gamma(c, + n/2, /)I (o, min { (y_ _ 2i)2 a truncated gamma distribution. Here 7r(a) and 7r(Q3), in canonical notation, are prior distributions for a and /. Setting a and / to be 10 and 3, respectively, and a7, the scale parameter in the Cauchy distribution, to be 3, 100 pairs of (zx,yi) were generated. The least squares estimates for a = 6.6772 and f3 = 6.5973. Using the Gibbs sampler detailed above, posterior summaries under the SMN and SMU models are provided below. Table 4: Comparison of SMN and SMU on the accuracy of estimated parameters for Cauchy distribution para True Value SMN SMU mean std dev mean std dev a 10 10.100 0.629 9.770 0.598 / 3 3.119 0.374 3.393 0.282 -a 3 2.948 0.380 2.914 0.332, ~ ~ ~ ~.......... - As expected, in both examples, the two representations yield comparable results. This is partly because a non-informative prior was employed. The effect of this prior is that the resulting full conditionals in the Gibbs sampler under the SMN framework reduces to standard forms, sampling from which is straightforward. However, in general, this is not true; take, for example, in the Cauchy illustration, a normal prior for the regression parameters. On the other hand, no matter what prior is used, under SMU, the resulting 11

I posterior full conditionals take the form of a truncated version of the prior. It was this in mind a canonical notation for the prior distributions was used in the earlier development. The SMU representation, typically, converges more slowly than SMN because of the nature of the resulting Markov chain. This, of course, is hardly problematic in an era where fast computers and sophisticated software is a norm. It then seems appropriate to consider models that result in algorithms that are trivial to code - an ubiquitous feature of the SMU enterprise. Also, at least as important, the SMU includes the SMN as a special case, thus making it a more flexible and practically appealing family of models. 5 Multivariate distributions In this section, we briefly describe how the SMN concept can be extended to multivariate distributions. In one dimension, we generated a scale mixture of uniform density using intervals with random lengths. In two or higher dimensions, intervals are replaced by ellipses, ellipsoids or hyper-ellipsoids, and lengths replaced by radii. Hence, multivariate scale mixtures of uniform distributions can be generated via uniform distributions on ellipses, ellipsoids or hyper-ellipsoids, with random radii. Let a denote a d-dimensional vector (al, a,...,ad), A denotes a d x d positive definite matrix and r > 0. Let Eda(, A, r) denote the d dimensional ellipse, ellipsoid or hyper-ellipsoid which satisfies (x - a)'A( - a) < r2 Definition 5.1 Multivariate scale mixture of uniforms Suppose for a vector Y = (Y, Y2,... Ym), we have Y\I[V -] - Un {BE,,(,, Ev)}, and v- fv( ), where E is positive definite and V is positive, then Y has a multivariate scale mixture of uniform density, with fv(-) being the generating density. 12

I For example, the density function of a multivariate normal distribution with mean t and covariance matrix S is given by f(x) = (27I) - |Sl- -(-)'I -L(-p ) (x c R"). The multivariate scale mixture of uniform representation is given by: (Y, Y2)l[V = v] is uniformly distributed in E2z(p, E-', /) and v ~ Ga(2, 1/2). Generalising, we have for a m-dimensional vector Y = (YI,..., Ym) if Y [V = v] is uniformly distributed in E,,,n(, -1', V/) and v Ga(l + m/2,1/2), then Y will have an m dimensional multivariate normal distribution with mean, and covariance matrix E. This result is similar to a result of Johnson and Ramberg (1977). It is clear that this result can be extended to a very general "symmetric" multivariate distribution family: by symmetric, we mean the density function f satisfies f/(Z1 ia2,..- n) = f(-x1, -x. -, -Zn)Tong (1990) describes Y to have an elliptically contoured distribution if its density function fo(Mx) = li-1/2 g{(X - ) (x - )}, x G, where g: R -+ [0, oo) is nonincreasing. It is clear that this family is included in our development above. Similarly, "elliptically symmetric distributions", Kelker (1970), is also a special case under our definition. The density function of n dimensional multivariate t distribution is as follows: r( m+ ) >r i }-(+t)/2 M.(2, Ml, 2) 2= (1) i + /( - >)S( - ) } where x, p E Rm, 2 is a m x m positive definite matrix and t is the degrees of freedom. 13

I The scale mixture representation for the multivariate t distribution is similar to the one for the multivariate normal except the distribution of v is given by: (1 + Iv)-(m+t)/2 The multivariate Cauchy is a special case of the multivariate t, having one degree of freedom. 6 Discussion In this paper, we have developed a new class of models, namely, a scale mixture of uniform distributions, that enables analysis of data when normality assumptions are violated. The scale mixture representation provides a general and flexible approach to modelling; this was illustrated via examples in the context of variance regressions. An attractive feature of the approach is that the full conditional distributions in the resulting Gibbs sampler are all uniform; this makes coding a trivial task. Introducing auxiliary variables in the constructions described in this paper could likely lead to autocorrelation in the Markov chain. In a related work, Damien et al. (1999) provide a comparative study of the efficiency rates of Gibbs samplers constructed via the introduction of auxiliary variables. References Andrews, D.F. and Mallows, C.L. (1974). Scale mixtures of normal distributions, Journal of the Royal Statistical Society, Series B 36, 99-102. Box, G.E.P. and Meyer, G.D. (1986) Dispersion effects from Fractional Designs, Technometrics. 28, 19-27. Carroll, R.J. and Ruppert D. (1988), Transfonration and Weighting in Regression, New York, London, Chapman and Hall. Chew, V. (1968) Some useful alternatives to the normal distribution, The American Statistician, 22, 22-24. 14

I Damien, P., Wakefield, JC. and Walker, S.G. (1999). Gibbs sampling for Bayesian non-conjugate and hierarchical models by using auxiliary variables, Journal of the Royal Statistical Society, B. 61, 331-344. Damien, P. and Walker, S.G. (2000). Sampling truncated normal, ganmma and beta densities. Journal of Graphical and Computational Statistics. To appear. Devroye, L. (1986). Non-uniform Random Variate Generation. SpringerVerlag, New York. Gelfand, A.E. and Smith, A.F.M. (1990). Sampling based approaches to calculating marginal densities. Journal of the American Statistical Association, 85, 398-409. Johnson, M.E. and Ramberg, J.S. (1977) Elliptically symmetric distributions: characterizations and random variate generation, Proceedings of the American statistical Association, Statistical Computing Section, 262-265. Kackar, R.N. and Shoemaker, A.C. (1986) Robust design, a cost effective method for improving manufacturing process, AT&T Technical Journal, 65, 39-50. Kelker, D. (1970). Distribution theory of Spherical distributions and a location-scale parameter generalisation, Sankyf, A 32, 419-430. Polson, N.G. (1996) Convergence of Markov chain Monte Carlo algorithms. In Bayesian Statistics 5 (eds J. M. Bernardo, J. 0. Berger, A. P. Dawid and A. F. M. smith). pp. 297-321, Oxford University Press, Oxford. Shoemaker, A. C., Tsui, K. L., and Wu, C. F. J. (1991), Economical experimentation methods for robust design, Technometrics, 33, 415-427. Smith, A.F.M. and Roberts, G.O. (1993). Bayesian computation via the Gibbs sampler and related Markov chain Monte Carlo methods (with discussion), Journal of the Royal Statistical Society, B 55, 3-23. Taguchi, G. and Wu, Y. (1980), Introduction to off-line quality control, Nagoya, Japan: Central Japan quality control association. 15

I Tong, Y.L. (1990) The multivariate Normal Distribution, New York: SpringerVerlag. V 16