Division of Research Graduate School of Business Administration The University of Michigan August 1, 1980 Robust Sampling Designs Using Several Auxiliary Variables Working Paper No. 227 Roger L. Wright The University of Michigan FOR DISCUSSION PURPOSES ONLY None of this material is to be quoted or reproduced without the express permission of the Division of Research.

-2 - ABSTRACT Strategies are investigated for large-scale surveys of populations having known auxiliary variables related to the target variable through a linear superpopulation model. Strategies which combine WLS estimators with varying.probability sampling designs are evaluated using criteria that integrate sampling and model-based considerations. The best robust strategy, is found to incorporate a new WLS estimator and a design which generalizes Neyman allocation. This strategy is typically much more efficient than robust strategies using OLS or BLU estimators. The best robust strategy can be approximated by strategies using strongly stratified sampling. KEY WORDS: Balanced sampling; Multiple regression models; Robustness; Stratification; Superpopulation models; Unequal probability sampling.

-3 -Author's Footnote: Roger L. Wright is Associate Professor of Statistics, Graduate School of Business Administration, The University of Michigan, Ann Arbor, MI 48109. This paper was prepared with the support of the U.S. Department of Energy, Grant No. DE-FG02-80ER10125. However, any opinions, findings, conclusions, or recommendations expressed herein are those of the author and do not necessarily reflect the views of D.O.E. The author wishes to thank K.R.W. Brewer, W.A. Ericson, and D.J. Aigner for valuable comments on an earlier version of this paper.

-4 - I. INTRODUCTION Sampling strategies have conventionally combined sampling designs and estimators which minimize bias and provide reasonable efficiency with minimal assumptions about population distributions. In other fields, inference is more dependent on models, especially linear regression models. This paper identifies a class of strategies for using several auxiliary variables in WLS estimators which are asymptotically unbiased in a traditional sampling sense. By adopting a linear superpopulation model, a strategy can be constructed which is robust in the sense that it provides an asymptotically unbiased estimator regardless of the validity of the model, and is efficient in the sense that it minimizes the asymptotic variance if the model is accurate. This strategy is often substantially more efficient than robust strategies based on OLS or BLU estimators. Following Royall (1976), consider a finite population comprised of N units labeled I=1,...,N. Unit I has known attributes given by the (k x 1) vector XI in Rk, as well as an attribute YI which is the realization of a superpopulation random variable. For the purpose of design, assume the linear superpopulation model YI = X + eI (1.1) E(eI) = 0, E(eIej) = oI2 > 0 if J = I, = 0 otherwise, with g unknown but cI known.

-5 - Two restrictions of (1.1) are of interest. Brewer (1979) has investigated the ratio model in which k=l and XI > 0. Our general analysis will prove to be most interesting in the case of the nonzero-intercept model in which the initial element of XI is unity. For any sample s, as an estimator, or a predictor under (1.1), of the total Y = XNi=1 Y we use y* = I Y+ X XI' (1.2) I's I-s with = ( WIXIXI1)-1 X WIXIYI. (1.3) ITs ITEs Here it is assumed that the WI and the sample s yield a nonsingular matrix Ilss WIXIXI'; otherwise the WI are to be chosen as part of the strategy. Conventionally the WI are chosen following criteria based on (1.1). The most widely recommended choice is WI = aI-2, so that y* is the best linear unbiased (BLU) predictor of Y conditional on (1.1) and s (Royall 1976). A second choice might be WI = 1, giving the ordinary least squares (OLS) predictor y*OLS. A new choice of WI will be proposed in Section 3. This strategy is generally superior to BLU or OLS when certain sampling considerations are included. The precision of y* depends not only on the WI but also on the sampling design. If (1.1) is known to be accurate, a sensible strategy is to choose

-6 - the sample s to minimize the variance of Y*BLU' For example, with the ratio model and cI2 = o02XI, this strategy dictates that s be comprised of the n largest units in the population. In general, if the assumed model is inaccurate, Y*BLU can be badly biased under this strategy. This has stimulated interest in robust strategies that provide some degree of protection against model misspecification. Royall and Herson (1973a) and Scott, Brewer and Ho (1978) work with Y*BLU for the ratio model and impose balance conditions on s that guarantee unbiasedness under a class of alternative models. Following their approach, y* is unbiased under the alternative model YI = ZI'S + UI with E(uj) = 0 if s satisfies the balance condition X Xt' ( X WIXIXI)-1 WIXIZI1 = I ZI'. (1.4) ICSS Ies ISs I S The balanced sampling approach raises two questions: 1. How to select s satisfying (1.4) for one or more ZI, and 2. How to offset the loss of efficiency in y* when s is forced to satisfy (1.4). A partial answer is to use Y*BLU with stratified sampling (Royall and Herson 1973b) or varying probability sampling (Scott et al. 1978). A more complete answer leading in a new direction is given by Brewer (1979) for the ratio model. Instead of requiring model-unbiasedness under a specific class of alternative models, Brewer obtains robustness by imposing a condition which relates the WI to the sampling design and which guarantees

-7 - that y* is asymptotically design unbiased (ADU). Brewer then selects the sampling design to minimize the asymptotic variance of y*, say v(y*). (The asymptotic construction will be described in Section 2.) In particular, Brewer shows for the ratio model that: 1. y* is ADU if and only if the WI are proportional to (rr-1 - 1)/XI where fI is the probability of selecting unit I. 2. If y* is ADU, then v(y*) is minimized with the sampling design Ij = n aI/yNJ=1 aj. 3. With this design, N N v(y*) = n-l ( aGI)2 - E aI2. 1=1 1=1 In the next two sections, Brewer's results will be generalized to the unrestricted multivariate model (1.1), and in particular, to the nonzerointercept model.

-8 - 2. ASYMPTOTICALLY DESIGN UNBIASED STRATEGIES Following Brewer (1978), our asymptotic limits will be constructed as follows. For any positive integer m, consider m exact copies of the original finite population to form an aggregate population of mN units with total Ym = mY. From each of these m populations, one sample is selected using fixed wI; these m samples are considered together as a single aggregate sample. The estimator Ym* is defined by applying (1.2) and (1.3) to the aggregate sample. To guarantee the existence of various limits, we assume that there exists a constant X > 0 such that all characteristic roots of C WIXIXi' are almost surely greater than X. IEs With this construction, lim Ep(ym*/m) exists and is equal to m+00oo N N X IYI + C' X TIWIXIYI. (2.1) I=1 I=1 Here Ep is the expectation based on the sampling design, and N N C' = X (l-uI)Xi1 (X iIWIXIXI)-1. (2.2) I=1 I=1 y* is said to be asymptotically design unbiased (ADU) if and only if lim Ep(ym*/m) is equal to Y for any finite population. From (2.1), y* is m-oo ADU if and only if "If + IWIC'XI = 1, I=1,...,N. (2.3) Although (2.2-3) seem to provide a rather complex characterization of the W1i, suitable WI can easily be constructed. Suppose D is any vector such that D'XI > 0 for all I, and let WI = (r-1 - 1)(D'XI)-1. (2.4)

-9 - These WI = satisfy (2.3) since (2.4) implies N N C'= I WID'IXIXI ( IWIXIXI')-1 I=1 I=1 = D'. This means that an ADU estimator y* can be constructed using (2.4) with any D for which D'XT > 0 for all I. Various choices of D generate simple classes of estimators in particular cases. While additional situations may arise and call for other choices of D, two cases are of interest here. For the ratio model, D is necessarily scalor so (2.4) implies that y* is ADU if and only if the WI are proportional to (rffj- - 1)/XI as in Brewer (1979). Choosing D = [c-l 0... 0]', cc> 0 shows that y* is ADU for any model with a nonzero intercept if WI = a(iTf-1 - 1), a > 0. (2.5)

-10 - 3. BEST ADU STRATEGIES Within the class of ADU strategies a useful performance measure is the asymptotic variance of y*, denoted v(y*). Here v(y*) is the asymptotic design-based expectation of the model-based mean squared error of y*. The asymptotic construction is the same as in Section 2. Specifically, v(y*) equals lim EpE(ym* - Ym)2/m. m+oo v(y*) can easily be evaluated for any y* that is ADU. Using (1.1-3), N N v(y*) = I WTI 2 cI2(Ci'X)2 + j (1-'I)oa2 (3.1) I=1 I=1 with C as in (2.2). If y* is ADU, then (2.3) can be used to simplify (3.1) giving N N v(y*) = I-lf(l_-Ir)2 aI2 + ) (1-rI))I2 I=1 I=1 N - = X (f7TI1 - 1) aI2. (3.2) I=l1 The best ADU strategy is to choose the 1TI to minimize (3.2). By Schwartz ' s inequality, NN N ( <aI)2 < I T1I 2 TI2TI-1 1=1 1=1 I=1 N and 7 TI = n, so I=1 N N v(y*) > n-l ( I a)2 - oi2. (3.3) 1=1 I=1

-11 -The lower bound (3.3) is achieved by a generalization of Neyman allocation, N TI = naI/ I OJ.(3.4) J=l A best ADU strategy combines the sampling design (3.4) with any ADU estimator, denoted Y*BDU, giving N N v(y*BDU) = n-1( o)2 - E aI2. (3.5) I=1 I=l1

-12 - 4. ADU STRATEGIES FOR MODELS WITH NONZERO INTERCEPT Throughout this section, assume the model (1.1) with a nonzero intercept, and consider the class of ADU strategies satisfying (2.5). Within this context, the best ADU strategy can be compared to more conventional ADU strategies employing OLS or BLU estimators. These comparisons show that the best strategy can greatly outperform the conventional strategies. Consider first an OLS strategy with WI = 1 and sample size no giving the estimator y*OLS. If (2.5) is used to provide robustness, then fI is uniformly no/N as in simple random sampling, and (3.2) implies that N v(y*OLS) = I (N/no - 1)a2 I=1 N = (N2/no)(1 - n/N) ( I oI2/N). (4.1) I=1 By comparing (4.1) with (3.5), we find that the OLS strategy using a simple random sample of size no provides the same asymptotic variance as the best ADU strategy with sample size n = (eff)no. Here the asymptotic efficiency (eff) of y*OLS is N N eff = ( I ai)2/(N I aI2) I=1 I=1 = (cv2 + 1)-1, (4.2) where cv is the coefficient of variation of aI throughout the population. There is some evidence suggesting that cv is likely to be well in excess of unity in populations of interest, implying that the asymptotic efficiency of Y*oLS compared to Y*BDU is likely to be substantially below 50%.

-13 - Some empirical work with real populations is underway. In any case, Y*OLS is efficient only if (1.1) is homoscedastic. A robust BLU strategy has about the same (in)efficiency as the robust OLS strategy. Consider the BLU strategy with WI = oa-2 and sample size no. Assume also that the tI are uniformly small so that (2.5) gives N TI approximately equal to no I2/ X aj2. Then (3.2) shows that J=1 v(y*BLU) is approximately equal to (4.1). This means that, given small wI, the asymptotic efficiency of Y*BLU is about the same as the asymptotic efficiency of y*OLS, (4.2), and is less than 100% if (1.1) is heteroscedastic. The inefficiency of Y*BLU for a heteroscedastic model may be surprising. It results from the poor sampling design that is used to provide robustness, namely sampling with probability proportional to ol2. The model-inefficiency of y*BDU is more than compensated by the designefficiency of sampling with probability proportional to aI. Simply stated, the BLU strategy is likely to yield a sample containing too many units with large aI. It is interesting to note that the inefficient BLU sampling design is a pps design if, as is often assumed, aI2 is proportional to size.

-14 - 5. STRONGLY STRATIFIED ROBUST DESIGNS It is sometimes advantageous to approximate the best ADU strategy using stratification. Suppose that {Sh:.h=l,...,H is any stratification of the population. Let cvh be the coefficient of variation of aI within the Nh units of stratum h, so that 1 + cvh2 = Nh I aI2 ( a)-2. (5.1) ICSh IC-Sh While Dalenius and Hodges (1959), Cochran (1961) and others have been primarily interested in designs with small H, we will examine designs that are strongly stratified in the sense that max {cvh: h=l,...,H} = ~ is small. Suppose that y* is ADU, with TI = nh/Nh, ICSh. (5.2) Using (3.2), H v(y*) = I (Nh/nh - 1) ai2. (5.3) h=l IGSh As in Neyman allocation, (5.3) is minimized given n by choosing nh proportional to (Nh I aI2)1/2. With this, (5.3) becomes IC Sh H N v(y*) = n-l[ 1 (Nh. oi2)1/2]2 - 9i2 h=l ICSh I=l H N = n-l[ X (l+cvh2)1/2 Y oI]2 - aI2 h=l IESh I=1 N N < (1 + e2)n- ( ai)2 - oI2. (5.4) I=1 I=1

-15 - An almost equally efficient design is obtained with the substantially more convenient allocation rule N nh = n( oaI)/ I CI (5.5) IESh I=1 In this case, N H N v(y*) = nl1 ( E ai) I (1 + cvh2)( E aI) - E I2 I=1 h=l IeSh I=1 which is also bounded above by the right hand side of (5.4). The factor 1 + e2 in (5.4) limits the loss in efficiency in y* that comes from using (5.5) rather than (3.4). Since (5.4) depends on the stratified design only through e, all strongly stratified designs are almost equally efficient given Neyman allocation or (5.5). So the choice of stratification is largely inconsequential, but a convenient criterion is to choose strata to equalize X ai, so that (5.5) gives nh = n/H. In the IESh large scale surveys of interest, H can be chosen large enough so that ~ is negligible. While the best ADU strategy has been justified on asymptotic grounds, it may be that convergence to these asymptotic limits is accelerated by using strongly stratified designs. If this is true, then strongly stratified designs may perform well with moderate sample sizes.

-16 - REFERENCES Brewer, K.R.W. (1979), "A Class of Robust Sampling Designs for Large-Scale Surveys," Journal of the American Statistical Association, 74, 911-915. Cochran, W.G. (1961), "Comparison of Methods for Determining Stratum Boundaries," Bulletin of the International Statistical Institute, 38, 345-358. Dalenius, T. and J.L. Hodges, Jr. (1959), "Minimum Variance Stratification," Journal of the American Statistical Association, 54, 88-101. Royall, Richard M. (1976), "The Linear Least-Squares Prediction Approach to Two-Stage Sampling," Journal of the American Statistical Association, 71, 657-664. Royall, Richard M. and Jay Herson. (1973a), "Robust Estimation in Finite Populations, I," Journal of the American Statistical Association, 68, 880-889. (1973b), "Robust Estimation in Finite Populations, II: Stratification on a Size Variable," Journal of the American Statistical Association, 68, 890-893. Scott, A.J., K.R.W. Brewer, and E.W.H. Ho. (1978), "Finite Population Sampling and Robust Estimation," Journal of the American Statistical Association, 73, 359-361.