I RESEARCH SUPPORT UNIVERSITY OF MICHIGAN BUSINESS SCHOOL APRIL 1997 ON SCALE MIXTURES OF UNIFORM DISTRIBUTIONS AND THE LATENT WEIGHTED LEAST SQUARES METHOD WORKING PAPER #9712-09 BY STEPHEN WALKER IMPERIAL COLLEGE, LONDON PAUL DAMIEN UNIVERSITY OF MICHIGAN AND MARY MEYER UNIVERSITY OF MICHIGAN

I On Scale Mixtures of Uniform Distributions and the Latent Weighted Least Squares Method Stephen Walker1 Paul Damien2 and Mary Meyer3 1Department of Mathematics, Imperial College, 180 Queen's Gate, London SW7 2BZ. 2 Business School, University of Michigan, Ann Arbor, 48109-1234, USA. 3 Department of Statistics, University of Michigan, Ann Arbor, MI, 48109. Summary In this paper we introduce a new estimator for the coefficients of a linear regression model. The estimator is based on the characterisation of a normal distribution as a scale mixture of a uniform, the mixing distribution being a particular gamma distribution. The Student t and exponential power distributions are also characterised as scale mixtures of uniforms; these, in turn, are seen to be a special case of a new (and general) family of distributions. It is shown that this new family of distributions coincides with the class of unimodal, symmetric distributions. Key words: EM algorithm, Latent variables, Weighted least squares, Robustness. 1

I 1 Introduction This paper considers the familiar linear regression model k yi = xijj + e i = 1, n, (1) j=1 where the yi are observable dependent variables, the xij are known covariates, the /j unknown regression coeffecients and the parameter of interest, and the ei are independent and identically distributed (iid) error terms. We make the following assumptions about these errors: 1) Eei = 0 and the ei are symmetric about 0, and 2) var(e-) = a2. The most popular and widely used estimate for the regression parameter i = (P1i.., I k) is the ordinary least squares (OLS) estimator: OLS [ XiXI-' 1 yiXi, i i where Xi is the column vector with entries (Xi1,"*,Xik). The OLS is the maximum likelihood (ML) estimator under the assumption that the ei are normally distributed. It is well known that the OLS estimator is likely to be unsatisfactory in a number of possible sampling scenarios. For example, if the errors have a heavier tailed distribution to that of the normal, or if there are outlier observations: in the latter case, the OLS estimator gives too much influence to outliers; that is, it is not robust. 'Moving' a response observation to infinity would drag the estimator to infinity as well. It is desirable to have an estimator that 'downweights' the effect of outliers. A number of alternative (ML) estimators are available by considering non-normal error models; for example, the Student t distribution; see Meng and Van Dyk (1997) for a recent likelihood based approach and O'Hagan (1988) for a Bayesian treatment. In this paper we propose a new estimator, Latent Weighted Least Squares (LWLS), for the parameters in a linear regression model in cases of outlier and/or non-normal error models, using an EM type algorithm that is both simple to code and fast to execute. The development of the LWLS estimator starts from a characterisation of the normal distribution as a scale mixture of a uniform, the mixing distribution being a particular gamma distribution. We 2

1 are therefore able to characterise any scale mixture of a normal distribution (Andrews and Mallows, 1974; West, 1987) as a scale mixture of a uniform distribution. In Section 2 we characterise the Student t and exponential power distributions as scale mixtures of uniforms. These characterisations lead to a new family of distributions and a characterisation of the Weibull distribution. In Section 3 we introduce the LWLS estimation method which is based on the characterisation of the normal as a scale mixture of a uniform. Section 4 contains a simulation study and real data analyses to investigate the LWLS estimator by, comparing with the OLS estimator. 2 Scale mixtures of uniform distributions We can write the model (1) in a different way: first, introduce the latent variable u = (u1,,,u), with each ui defined on (0,oo). Consider the model k Yilui E= xijij + riTV, i =1, *, n (2) j=l where the ri are iid from the uniform distribution on (-1, +1), and the ui are iid from some distribution fu defined on (0, oo). The 7r and the ui are also independent of each other. Theorem 1. For the model given in (2) (i) Eyi = j=l xijpj and yi is symmetric about the mean; and (ii) if Eui = 3r2 then var(yi) = C2. The proof is straightforward, and is omitted. Therefore provided Eui = 3cr2 then the conditions 1) and 2) will be satisfied. One possibility is the normal distribution which arises when fu is a particular gamma distribution: Theorem 2. (The Normal error regression model). For the model in (2), if fu is the gamma distribution with parameters (3/2, A/2), and mean value 3/,, where A = a-2, then marginally each ei is normally distributed with mean 0 and variance r2 3

I Proof. This follows from the result y exp(-u)du - e xp(-y2) >y2 and noting that model (2) is equivalent to yilui being uniformly distributed on the interval (/i- -/,, pi + f/r7), where pi = zE=1 xijpj. We have noted that the normal distribution can be represented as a scale mixture of a uniform distribution. Here we note that two other distributions, the Student t and exponential power distributions also arise as scale mixtures of uniform distributions. Both of these distributions have a scale mixture of normal representation (West, 1987); for example, the Student t, with v degrees of freedom, has the representation YI "' N({, o'2/ ), ga(v/2, v/2). We can write the normal first stage as the scale mixture of uniform distribution: ylu U ( - acru,, + a/u), u - ga(3/2, </2). We can then combine f(ulE) and f(~), integrate over E, to obtain the marginal distribution of u and hence obtain a scale mixture of uniform representation: Theorem 3 (The Student t distribution). If fu has density given up to proportionality by f(u) Vc (v + U)(U+3)/2 and ylu - U(iL - acrS, IL + a/-u), then y has a Student t distribution with mean,t, scale parameter a and v degrees of freedom. We state the result for the exponential power distribution: 4

I Theorem 4 (The exponential power distribution). If fu has density proportional to f(u) o(X 1/T-i/2 exp (-1 /') and ylu U(Q - a/u, P + Oa/w), then f (y) oc exp - --- where r E (0, 2]. This characterisation of the exponential power distribution appears to be more tractable than the alternative scale mixture of a normal characterisation (West, 1987), which is only valid for r E (1, 2]. We can obtain an interesting result by combining the result of West with ours: Theorem 4 (West, 1987). If yA - N(0, A), here A denotes 1/(variance), and f(A) < A-'/2pl/,(A), (1 < r < 2), where p,(.) denotes the density of the positive stable distribution with index a (0 < a < 1), then f(y) oc exp(-lyl2/T). We can now insert the uniform and gamma mixture to replace the normal, leading to the following 3 stage mixture: ylu ~ U(-fu, +/u) ulA - ga(3/2, A/2) and f(A,) cc A-1/2p/, (A). Combining the last two stages implies: l/ r-/2 exp(-ul/T) oc 1/2 J A3/2 exp(-0.5Au)A-l/2pl/r()dA. Therefore, Theorem 5. If ulA has the exponential distribution with mean 2/A and 5

I f(A) cc pi/(A), (1 < r < 2), then f(u) c ul/T-lexp(-ul/r), a Weibull distribution. In fact any distribution which has a scale mixture of normal representation also has a scale mixture of uniform representation. This is easy to see; if xzA - N(O, A) and A - g then xlu - U(-/ui, +X/i) with u - f where f(u) Oc ijfo A3/2 exp(-0.5Au)g(A)dA. f/A=0 There does not seem to be any reason why we should just consider the Student t and exponential power families. Consider the general family of distributions: y lu - U(Iu - avu, yI + as/u) u fu. Then the following hold: Ey = ji, var(y) = r2Eu/3 and 9E[u2] kurtosis(y) = 5[u 3. So the mean and variance of u determine the variance and kurtosis of y. To obtain var(y) = a2 and kurtosis(y) = r we require Eu = 3 and var(u) = 5r + 6. Note that we must have r > -6/5 which is the kurtosis for the uniform density. A particular distribution which satisfies these requirements is given by fu = ga(9c, 3a), where a = (5r + 6)-1. This new family of distributions has parameters (y, a, r), with mean iA, variance a2 and kurtosis T. We recover the normal distribution when r = 0 (a = 1/6). In fact the scale mixture of uniform family coincides with the class of unimodal, symmetric distributions: Theorem 6. If fx is a unimodal, symmetric density about 0 and fx(x) exists for all x then fx(x) = 1/2 f fu(u)du/v7, U >x* 6

I where fu(u) = -fx(V). Therefore, we can write x;u -, U(-/u,+V~) with u - fu, provided fu is a density on (0, oo). Note that - f~~0 fx (v/)du = 1 which follows from 1 = fx(x)dx = 1/2 -fx(V)/v/U [-dx ddu, — 00 U=O =_/U and fu(u) > 0 iff fx(x) is unimodal. Theorem 6 is a consequence of a theorem of Feller (1971, pp. 155); see, also, Brunner and Lo (1989). We note that our approach appears to be more general and simpler than the one provided by Feller. Feller (see, also, Brunner and Lo) considers the different scale mixture of uniform model, given by XlU ~ (r - U, + u), u G. for some distribution G with support on (0, oo). Brunner and Lo then assign G a Dirichlet process prior. However, this model only provides the unimodal, symmetric distribution for X and the first four moments of U are all required to specify the first four moments of X. With our model x u ~ (, - avt, i + a/v), u ffu; it is clear that since we are explicitly modelling the variance of X, only the first two moments of U are required to define the first four moments of X. We can 'improve' on the Brunner and Lo (1990) model by considering the following model: xlu - (it - acv'u, + oal-), u G.' with G taken from a Polya tree prior, which happens to generalise the Dirichlet prior. For this we would need to fix the location of G, which would otherwise be confounded with a. We can achieve this by fixing the median of G, an easy task, and this development will be reported elsewhere. 7

I 3 Latent weighted least squares (LWLS) The advantage of introducing the latent model (2) is the availability of the 'natural' weights (ul, -, un) for a weighted least squares method, based on var(yilui) = uj/3. Given the ui, the standard estimator for the regression parameter is the weighted least squares (WLS) estimate; wLS(U) = [E WXiX]-1 E WiyiXi (3) t i where wi = 1/ui. This will 'downweight' extreme observations since it is clear from (2) that for large yi the larger ui will need to be. In principle, we have a linear random effects model not unlike the original linear random effects models considered by Laird and Ware (1982). A slightly modified version of their hybrid EM algorithm is used to obtain a LWLS estimate. We treat the ui as missing data so that (y, u) represents the complete dataset. With this complete dataset we have, with fu as ga(3/2, A/2), k E(y iui) = E xijfij and var(yilui) = u/3 j=i and we take p = /3WLS(U) given by (3). Following Laird and Ware we then obtain the expectations of the sufficient statistics for the missing data; that is, ui = E(uil, ar, y), so k i = (yi-E ijpj)2 + 2r2. (4) j=1 Finally, we maximise the complete data likelihood l(ar2y, u) to update a: a2 =- u,/(3n). (5) i Combining (3), (4) and (5) together gives our algorithm for obtaining the LWLS estimator for the linear r3gression model: 1) / = [Ei wiXiX, ]1 - wy;iXi, where Wi = 1/ i, 2) 52 = Ei i/(3n) and, 3) i = (yi -k1 ijpj)2 + 22. The algorithm iterates over 1), 2) and 3) until convergence. Suitable starting values should be the OLS estimator for f3 and the sample variance for r2. A simulation study and real data analyses, comparing LWLS with OLS, is presented in Section 4. 8

I 4 Examples To begin our investigation of the LWLS estimator we simulated 50 standard normal random variates and computed the OLS estimate and the LWLS estimate. We repeated this exercise 1000 times and computed the mean and variance of the sample estimates: for the OLS these are (5.9e- 04, 0.02), and for the LWLS these are (-1.le - 03, 4.4e - 06). At the other end of the scale, we simulated 50 standard Student t random variates with 2 degrees of freedom and, again, repeated the exercise 1000 times. The corresponding mean and variance for the sample estimates are (-5.4e-03, 0.22) and (7.2e-03, 5.5e-05) for the OLS and LWLS estimators, respectively. It is quite obvious from these results that the LWLS estimator improves quite remarkably on the OLS estimator in these two fundamental, yet common, sampling scenarios. How does the LWLS estimator cope with contaminated data and outliers? To investigate this point we sampled 50 random variates from the contaminated standard normal, obtained by adding 5 to any of the standard normal random variates with probability 1/10. The corresponding mean and variance for the sample estimates are (0.49, 0.06) and (0.11, 1.le-04) for the OLS and LWLS estimators, respectively. For an analysis involving a real dataset we turn to Box and Tiao (1973, Table 3.4.1). The data consists of 20 experiments relating the rate of a chemical reaction to the temparature at which the experiment was conducted. The linear model yi = 1 + 2xzi + ei, is used where yi and xi denote the log reaction rate and a a measure of the temperature for the ith experiment, respectively. Box and Tiao, within a robust Bayesian framework, use the exponential power family for modelling the error distribution, which includes the normal distribution. They conclude that the normality assumption, by consideration of the power parameter of the exponential power distribution, is valid and under this asumption obtain an estimate for (P1, 32) as (-4.013,-0.203). Our LWLS estimator turns out to be (-4.008, -0.202). In the same chapter of their book, Box and Tiao reanalyse the Darwin dataset (Table 3.2.1) consisting of 15 iid observations. They again make use 9

I of the exponential power family to model the observations. They conclude in this case that the normality assumption is not valid and the true distribution shows signs of leptokurticity. The mean of the data is 20.933 and the LWLS estimate is given by 26.469. This higher estimate is in keeping with the results of Box and Tiao (see, Figures 3.2.3 and 3.2.6). 5 Discussion In this paper, we study scale mixtures of uniform distributions: which coincides with the family of unimodal, symmetric densities on the real line. We introduce a new (mixture) family of distributions which is ideally suited to the modelling of location, scale and kurtosis. A new estimator for the linear regression model, the Latent Weighted Least Squares (LWLS) estimator, is developed which is simple to code and fast to execute. Illustrative analyses exemplifying the method are provided and demonstrate vast improvement on the OLS estimator in a number of sampling scenarios. We encourage its widespread use. References Andrews, D.F. and Mallows, C.L. (1974). Scale mixtures of normal distributions. Journal of the Royal Statistical Society, Series B 36, 99-102. Box,G.E.P. and Tiao,G.C. (1973). Bayesian inference in statistical analysis. Addison-Wesley, Massachusetts. Brunner, L.J. and Lo, A.Y. (1989). Bayes methods for a symmetric unimodal density and its mode. The Annals of Statistics 17, 1550-1566. Feller, W. (1971). An introduction to probability theory and its applications II, Wiley, New York. Laird, N.M. and Ware, J.H. (1982). Random effects model for longitudinal data. Biometrics 38, 963-974. 10

I Meng, X. and Van Dyk, D. (1997). The EM algorithm: an old folk-song sung to a fast new tune (with discussion). Journal of the Royal Statistical Society, Series B. To appear. O'Hagan, A. (1988). Modelling with heavy tails. In Bayesian Statistics 3, pp. 345-359. J.M.Bernardo, M.H.DeGroot, D.V.Lindley and A.F.M.Smith (eds.) Oxford University Press. West, M. (1987). Scale mixtures of normal distributions. Biometrika 74. 646-648. 11