Division of Research Graduate School of Business Administration The University of Michigan ROBUST ESTIMATION IN SURVEY SAMPLING USING MULTIVARIATE AUXILIARY INFORMATION Working Paper No. 309 Roger L. Wright The University of Michigan Carl-Erik Sarndal The University of Montreal and The National Central Bureau of Statistics of Sweden FOR DISCUSSION PURPOSES ONLY None of this material is to be quoted or reproduced without the expressed permission of the Division of Research. August 1982

ABSTRACT This paper considers multivariate regression estimators for finite population totals and other linear functions. Estimators are identified that are asymptotically design unbiased and preserve certain aspects of the model-based prediction approach. The design-based variance of these estimators is approximated using a generalization of the Yates-Grundy expression. The approach is illustrated in an electric power load research application. KEY WORDS: Generalized regression, Prediction approach, Asymptotic design unbiased, Design-based inference, Model-based inference

1. ALTERNATIVE FORMS OF ASYMPTOTICALLY DESIGN UNBIASED ESTIMATORS Suppose we want to estimate the linear function T = 1 aIYI of the character Y for a finite population U = {1,..., I,..., N}, where aI is a known constant and Y is the unknown Y-value attached to the unit labelled -1 I. The population total (aI = 1, all I) or mean (aI = N, all I) are important special cases of T. The YI are observed for I e s, where sCU denotes a sample. Several estimation approaches exist that exploit (multivariate) auxiliary information. These can be divided into two main groups, those that appeal to principles of design inference (DESI) and those that appeal to principles of model inference (MODI). In DESI, it is the randomized selection of the sample s, according to a given sampling design, p(s), that is the source of the inference. Some recent contributions in this trend are Brewer (1979), Cassel, Sarndal and Wretman (1976), Isaki and Fuller (1982), Sarndal (1980a,b), Wright (1982). In DESI, estimators are usually required to be asymptotically design unbiased (ADU; see below) and the quality of the estimate (given the design) is measured by the design variance, V (T). For confidence intervals, an approximately design unbiased variance estimate V( is created; the reapproximately design unbiased variance estimate V (T) is created; the re~~~A A 1/2 sulting interval T + Zl,/2 {V (T)}2 has the property of covering the true point T for roughly 100(1-a)% of all samples s drawn repeatedly under the specified sampling design. By contrast, in MODI the argument is conditional on s and makes reference to an assumed model t. If an estimator is model unbiased, its quality is measured2 is measured by its model variance, E(T - T) = V(T). Having found an

-2 -estimated model variance, one can construct MODI confidence intervals that are interpreted via repeated realizations of the model. Usually 5 is a linear model stating that Y = XB + e with E (e) = 0, V;(e) = a I. This point of view has been developed in a series of papers by Royall and coworkers (e.g., Royall (1970), Royall and Herson (1973), Royall and Cumberland (1981)). We favor the DESI outlook, and our paper concentrates on that approach. It is nevertheless fitting to observe: (a) that the estimators proposed by DESI as well as by MODI advocates can be represented within a single class, called the QR class, (Wright, 1982), and (b) that some DESI writers (e.g., Brewer, 1979) are attracted by the particular form of the usual MODI estimator, yet manage to conform to DESI requirements, including ADU-ness, by an appropriate choice of "free constants." N The QR class of estimators of T = 1N aIYI is given by T = aIYI + sIaII (1.1) where Y = xI' eT =Y -y (1.2) I I, I = YI - YI' (1.2) = (Zsq1I ')Y1 ) sqx, (1.3) and xl,..., xN are known auxiliary p-vectors. The constants rI and qI > 0 are open to choice. The typical estimator T is evidently suggested by the model t such that Eg(YI) = x g. (1.4) In addition, we assume under t that 2 VC(YI) = aI and that the N random variables Y are mutually independent. The aI are assumed known except for a multiplicative constant.

-3 -The QR class contains three important subclasses (Sarndal, 1980b) defined by three simple choices of the rI: Simple projection estimators (SPRO) defined by setting rI = 0 for all I, yielding T = 1 aIYI, (1.5) Linear prediction estimators (LPRE) defined by letting rI = 1 for all I, yielding T=,saIY + - aIYI (1.6) where s = U - s, and -1 Generalized regression estimators (GREG) defined by setting rI = TI yielding A N A T = l a1YI + XsaIeI/ 1. (1.7) Here rI is the inclusion probability P(Ies), assumed to be nonzero. We argue that the statistician (and his client) may feel attracted to a certain subclass on (1) theoretical grounds, (2) cosmetic grounds. By theoretical grounds we mean the basic properties of the estimator. These differ in DEST, with its ADU-ness requirement, and in MODI, where model-based properties are appealed to. By cosmetic grounds we mean that the form of the estimator matters; many may feel, for example, that the LPRE form is both simple and realistic, being composed of a sum of observed Y's plus a sum of predicted Y's for nonsampled units. The simplicity of the SPRO form may

-4 -also appeal to many. Although the GREG form is not complicated, let us for sake of argument consider that the SPRO and LPRE subclasses are "cosmetic." Returning to theoretical considerations, we note that the LPRE subclass includes the best linear unbiased (BLU) prediction estimator based on the MODI approach under i, (1.4). The BLU-LPRE estimator is obtained by -2 letting rI = 1 and qI = a for all I. Although the BLU-LPRE estimator has desireable MODI properties under i, it has been shown to be vulnerable to serious bias when the assumed model is inaccurate, (e.g., Hansen, Madow, and Tepping, 1978). Under the DESI approach, an important basic property is asymptotic design unbiasedness. An estimator T is said to be ADU for T = 1 aIYI if lim [E (T) - T] = 0. (n,N)+ P DESI writers consider ADU-ness desirable for reasons of robustness. Whether the model (1.4) is true or not, an ADU estimator is correct "on the average" in large samples drawn repeatedly from proportionately larger populations. We do not enter into detail here about the limit argument. Somewhat different versions are found in Brewer (1979), Isaki and Fuller (1982) Sarndal (1980a), and Wright (1982), although the differences do not seem to be consequential. The latter author shows that within the QR class, T is ADUJ for T if and only if there exists a p-vector X such that aI(l - rITI) = TIqXI X' for I = 1,..., N. (1.8) Several important conclusions follow from (1.8): (a) The GREG esti-1 mator, being defined by the choice rl = RI, is ADU regardless of the choice of qI. (b) le can have ADU-ness (thus DESI) and cosmetic form (SPRO or LPRE) at the same time by choosing the qI to satisfy (1.8) for a particular X. (c) Let us define u = (ul,..., uN)' with

-5 -= a(rr - ri)q; I = 1,..., N. (1.9) Then T is ADU if and only if u e M(X), where M(X) is the linear manifold spanned by the columns of X = (x,.., x) ' Conclusion (c) implies that we can provide ADU-ness for any choice of q1 by expanding X. If, for a predetermined choice of r1 and qI, u V M(X), we can augment X by the column u. Then surely u e M(XA), where XA is X augmented by the column u. This may in fact lead to an estimator with a smaller variance, since u is an additional explanatory variable. An important additional result for all ADU estimators within the QR class is the following (Sarndal, 1980a; Wright, 1982): If T given by (1.1) is ADU, then its asymptotic expected variance, lim E E (T - T)2, (n,N)-+ is given by v N 2( -1 2 (1.0) aI ( -)c. (1.10) This expression does not depend on the choice of rI and qI, provided these quantities satisfy the ADU requirement (1.8). This result is important for planning a survey. The expression (1.10) is easily seen to be minimized by a design whose inclusion probabilities are given by IT = nlaIlaI / 1N ajlo. (1.11) Wright (1982) has suggested a procedure that approximates the optimum. It involves estimating the aI and stratifying the population into strata of units I with similar values of laI|I, where oI is the estimate of oI.

-6 -2. ESTIMATING THE POPULATION TOTAL In this section we consider the estimation of the population total 1 YI and comment on the ADU proposals already in the literature. GREG form. As already observed, the GREG estimator is ADU, whatever the choice of qI. The term GREG was first used by Cassel, Sarndal and Wretman -2 (1976) who considered the GREG estimator arising when qI = a under the simple ratio model such that Et(YI) = gxI and V(YI) = o = x. -1 -1 2 Sarndal (1980a,b) also studied the GREG form with qI = rI and qI = I oI, for the general linear model (1.4). He notes (1980a) that GREG takes on the cosmetic SPRO form if there exist X such that 2 aI = XI'; I = 1,..., N. This is the case for any homoscedastic model including an intercept, and for 2 any model' (with or without intercept) if aI c xIj for some variable x. in the model. The GREG form was extended for use in situations with nonresponse by Cassel, Sarndal and Wretman (1979). What qI should be chosen for the GREG? One may study the question with reference to second order efficiency (in terms of expected variance), but no definite answer is available. LPRE and SPRO forms. Brewer (1979) argued that it makes good sense to require the LPRE form (1.6) and ADU-ness. Working under the simple ratio model EY(YI) = fXI, he consequently posed the question: Can one obtain an ADU estimator of the form (1.6)? His answer, in the affirmative, was that this requires that -1 -1 qI (h-I -l)exI I = 1,..., N a choice which clearly satisfies the general condition (1.8).

-7 -If we extend Brewer's wish to maintain ADU-ness and the LPRE form to a general linear model E (YI) = xI', we find that the ADU condition (1.8) is satisfied if we fix any constant vector X such that xI X > 0 for all I and take q cc I -l1)/x'X; I = 1,.*,. N. (2.1) Similarly, we see that the SPRO is ADU if we fix any X and take q cc /x1I'XN; I = 1,..., N. (2.2) I -I Since any choice of X in (2.1) or (2.2) will produce ADU-ness, we may legitimately ask: What X should be chosen? This is an open question, just as the choice of qI in the GREG case mentioned above. The augmented X-matrix argument. We observed in the previous section that, having fixed the rI and the qI, we can always achieve ADU-ness simply by augmenting the X-matrix. We draw the consequences of this argument for the SPRO and LPRE forms, where ADU-ness is not "automatic," as it is for GREG. It follows from (1.9) that the SPRO is ADU if we augment the X matrix by a single column u1 = (u 1,.., UN )' where -1 - 1 u1 = I q ' I = 1,..., N (2.3) provided ul is not already in the linear manifold M(X) spanned by the columns of X. By the same argument, the LPRE is ADU if we augment the X matrix by the single column u2 = (u12,..., UN2)' where -1 -1 u12 = (@I -l)q, = 1,.., N provided u2 is not already in M(X). It also follows that we can have an ADU estimator expressible in either SPRO or LPRE form if X is augmented by the two (non-identical) columns u1 and

-8 - u3 = (U13',*. UN3)' where UI3 = 9I ' I = 1,..., N (2.4) provided neither ul nor u3 is already in M(X). Isaki and Fuller (1982) were concerned in particular with the SPRO -2 estimator together with the choice qI c I, where it is understood that the II have already been determined optimally according to (1.11), that is IT = nuI/ZI ~AI In agreement with (2.3) they observed that the SPRO is ADU if the X-matrix is augmented by the column ("I'..' N) ' Moreover, confirming (2.4) they observed that the resulting estimator also has LPRE form if we further add the 2 2 column (nrl,. N )' 3. VARIANCE ESTIMATION In this section, we are concerned with estimating the design variance V (T) of any ADU estimator T in the QR class. Here T is the linear function P, ay. That is, we seek V (T) such that we can construct a confidence interval A 1/2 T + Z1-/2 {Vp(T)} (3.1) that carries the repeated sampling interpretation. We start by considering the GREG subclass. Several important results concerning the GREG form follow from Sarndal (1982). Consider the estimate g given by (1.2) for a certain fixed choice of qI. Let

-9 - B= (N 1 qIIxIxI ) 1 qI xIYI be the expression to which f converges in design probability as (n,N)+o, and define "residuals from the census fit" by EI YI -xI'B. It can then be shown by Taylor expansion techniques that the design variance of T is approximately N F aIc a TT 2 P (T) IJgIJ I 7r I (3.2) where RIJ = P(I e s and J e s), and gJ = ( -IJ - IJ) IJ 1 for HIJ * 0 = 0 otherwise. We recognize (3.2) as being the variance of the Horvitz-Thompson estimator in aII, t aIEs/br, which is design unbiased for the total 1 a sI. The result (3.2) comes as no great surprise, for the estimation error of the GREG form can be expressed as T - T = Lae1/~ - \1 T -T = ZsaIeI/I 1 aIeI' where ek given by (1.2) is the sample proxy for cI. As an estimate of (3.2), the Yates-Grundy expression in aIEI suggests itself. However, since EI is unknown, we replace it by the computable residual eI, thus arriving at v (T) g IeI aJeJ (3.3) IpJ I J I,JEs

-10 -Monte Carlo experiments for a variety of situations have shown that the resulting confidence interval (3.1) gives a coverage rate in repeated samples s that approximates the intended 100(1-a)% rather well, even with rather modest sample sizes. There is a clear tendency, however, for the actual coverage rate to fall on the short side of 100(1-a)%, indicating that the normal score Zl_-/2 produces intervals that are too short on the average. -1 -2 Sarndal (1982) considered the case where qI = IrT aI, but his argument carries through for any qI. The variance (3.2) and its estimate (3.3) will of course depend on the qI actually chosen, and additional work is needed to settle the question of a "best" choice. We now extend these results to any ADU estimator in the QR class. Wright (1982) showed that the general estimator T given by (1.1), for a fixed choice of the rI and the qI, is ADU if and only if it is identical to the GREG estimator for the same qI. In other words, for a fixed choice of the rI, when T given by (1.1) is ADU, it is numerically equal, for every sample s and for any Y -values, to the GREG with the same qI. But we have just given a variance estimation procedure for the GREG with arbitrary qI. It follows that we have a procedure for variance estimation for any T that is ADU. Example. Brewer's (1979) estimator of 1N Yk for the model E (Yk) = gxk is ADU and of the LPRE form. It can be written as T = (1 xI) e + (34) where eI = Yk - k and P I i 1-)Y }/{ -

-11 -As is easily checked, and consistent with the result of Wright, an expression for T that is equivalent for all s and all Y = (Y1,..., YN)' is ^ r* N ^A T = (1 xI)3 + se I/ But this is of the GREG form. The approximate design variance of Brewer's T (3.4) is therefore given by (3.2) with all aI = 1 and EI = YI - BxI where B = { (1 -I)YfI}/{1 (l-I)XI} NB = (IY/ O I The estimated design variance is given by (3.3) with aI = 1. 4. AN ILLUSTRATION The preceding estimation techniques can be illustrated in an application drawn from electric utility load research. Many large power companies estimate the time-of-day usage of electricity within various populations of customers through special metering of samples of customers. These estimates are used in rate analysis and system planning. Brandenburg and Higgins (1974) describe the application and the standard methodology. Wright (1981) has introduced some of the techniques discussed in this paper. In these load research studies, the target variable YI is often a system peak demand, which is the use of electricity by customer I during a prespecified period of high system-wide consumption of power. In our example, Y measures usage during twelve monthly hours of system peak consumption throughout a specific year. A rich data base of auxiliary information is available in the administrative customer records. Oae important auxiliary variable, x 1 is the customer's total usage of electricity during the year. For certain commercial and industrial customers, another variable, xI2 called maximum noncoincident demand (MND) can be extracted from billing information. In our

-12 -example, maximum noncoincident demand is available throughout the target population; it measures each customer's peak half-hour rate of electricity use during the year. Our sample consists of 175 customers selected from a population of 20,600 customers, following the stratified sampling plan shown in Table 1. Table 1 also shows the population totals of x11 and xI2 that are used in the example. Table 1 about here Two separate analyses have been performed, one using xI1 to estimate the population total T = N YI, and the other using both x1l and xI2 to estimate T. In each case, we have calculated various estimates that satisfy the ADU condition (1.8) either through suitable choice of qI, as in (2.1) and (2.2), or by augmenting the X-matrix as in (2.3) and (2.4). In several of these procedures we have used an estimate of aI based on a multiplicative heteroscedasticity model of Harvey (1976). Under Model One, we assume E4(YI) = OXI' (4.1) I =o0xI1 Given our sample and the assumption of normality, we used seven iterations of the Harvey algorithm to calculate the maximum likelihood estimates -3 Y = 2.042 * 10 xI1, (4.2) A^ -3.8703 oI = 3.378 * 10 x1 Table 2 shows various estimates of T using (1.5) - (1.7) with several choices of qI. Line (1) of Table 2 follows (2.2) while line (2) uses (2.1). The choice of qI in line (3) yields the best linear unbiased LPRE estimator

-13 - conditional on the model (4.1) and aI from (4.2). The choice of qI in line (4) is suggested in Sarndal (1980a). Lines (5) - (7) are obtained following various augmentation techniques. In particular, we augment the X-matrix by u1, uI2, or both uI1 and uI3 using -2 (2.3) - (2.4) with qI = a as in the BLU approach. For illustrative pur9 poses the population totals of u 1, u I2 and uI3 are assumed to be 142.6 * 10, 9 9 141.9 * 10, and 0.7 * 10 respectively. In practice it is not always easy to to obtain the population totals of these artificial variables. Table 2 about here Note that the GREG estimates in Table 2 are all quite similar. Of course they are identical to the corresponding SPRO or LPRE estimates whenever the latter are ADU. However the SPRO and LPRE estimates in line (3) are somewhat lower than the ADU estimates. The last column of Table 2 shows the standard errors of the GREG estimates, calculated using (3.3). These standard errors are also applicable to the SPRO and LPRE estimates when the latter are ADU. As expected there is little variation among these standard errors. Under Model Two, we assume E (YI) = + x1 + 2xI2' (4.3) Y1 Y2 aI = o0 XI I2 ' and we obtain the maximum likelihood estimates

-14 - YI = -1315 + 7.319 * 10 xI1 + 5.182 xI2, A ^ l S l5 -1. 369 2. 237 oI = 1.825 * 10 xl XI2 (4.4) Table 3 shows several estimates of T based on this model. In rows (1) and (2), qI is chosen following (2.2) and (2.1) respectively, with X = (1 0 0)'. The remaining rows are analagous to the corresponding rows of Table 2. To calculate the estimates in rows (5) - (7), we have assumed that 9 9 the population totals of uIL, ui2 and uI3 are now 70.1 * 109, 69.6 * 109 and 0.5 * 10 respectively. Table 3 about here The GREG estimates in Table 3 are very similar, displaying substantially smaller differences than in Table 2. Again, whenever the SPRO and LPRE are ADU, they are identical to the corresponding GREG estimate. Moreover the non-ADU estimates are very similar to the ADU estimates, perhaps reflecting the validity of Model Two. Finally we note that the standard errors in Table 3 are substantially smaller than those shown in Table 2, indicating that the use of both auxiliary variables is quite advantageous.

-15 - REFERENCES Brandenburg, L. and C. E. Higgins, Jr., (1974), "Stratified Random Sampling Methods for Class Load Surveys for Electric Utilities," Applied Statistics for Load Research, Vol. 3, New York: Association of Edison Illuminating Companies. Brewer, K. R. W., (1979), "A Class of Robust Sampling Designs for Large-Scale Surveys," Journal of the American Statistical Association, 74, 911-915. Cassel, C. M., C. E. Sarndal and J. H. Wretman, (1976), "Some Results on Generalized Difference Estimation and Generalized Regression Estimation for Finite Populations," Biometrika, 63, 615-620.,,, (1977), Foundation of Inference in Survey Sampling, New York: John Wiley & Sons. _,, __, (1979), "Prediction Theory for Finite Populations when Model-based and Design-based Principles Are Combined," Scandinavian Journal of Statistics, 6, 97-106. Hansen, M. H., G. W. Madow, and B. J. Tepping, (1978), "On Inference and Estimation from Sample Surveys," Proceedings of the Survey Research Section, American Statistical Association, 82-107. Harvey, A. C., (1976), "Estimating Regression Models with Multiplicative Heteroscedasticity," Econometrica, 44, 461-465. Isaki, C. T. and W. A. Fuller, (1982), "Survey Design under a Regression Superpopulation Model," Journal of the American Statistical Association, 7 7, 89-96. Royall, R. M., (1970), "On Finite Population Sampling Theory under Certain Linear Regression Models," Biometrika, 57, 377-387. and J. Herson, (1973), "Robust Estimation in Finite Populations," Journal of the American Statistical Association, 68, 880-889. and W. G. Cumberland, (1981), "An Empirical Study of Prediction Theory in Finite Population Sampling: Simple Random Sampling and the Ratio Estimator," Journal of the American Statistical Association, 76, 66-77. Sarndal, C. E., (1980a), "On Tr-Inverse Weighting versus Best Linear Unbiased Weighting in Probability Sampling," Biometrika, 67, 639-650. _, (1980b), "A Two-way Classification of Regression Estimation Strategies in Probability Sampling," The Canadian Journal of Statistics, 8, 165-177., (1982), "Implications of Survey Design for Generalized Regression Estimation of Linear Functions," to appear in Journal of Statistical Planning and Inference.

-16 -Wright, R. L., (1981), "Electric Utility Load Research Using Model-Based Statistical Sampling," Proceedings of the 1981 DOE Statistical Symposium, New York: Brookhaven National laboratory., (1982), "Finite Population Sampling with Multivariate Auxiliary Information," School of Business Administration, The University of Michigan.

-17 -Table 1. Sampling Plan Number Total Total Sample of Usage MND Stratum Size Customers 106 EWH 103 W 1 2 3 Total 32 51 92 175 19, 800 682 118 20,600 2, 370 532 197.3, 099 694 134 49 877

-18 - Table 2. Various Estimates of T Under Model One. Added SPRO LPRE GREG SE Approach q Variables 103 KWH 103 KH 103 KH 103KW-H (1) (2) (3) (4) (5) (6) (7) Ra tio Adjusted Ratio BLU -1 wr *BLU AUG1 AUG2 AUG3 -1 -1 7I xI1 I II -1 -1 Af -2)x I ~I ^ -2 aI I -1-2 aI I A -2 ^ -2 ~I "12 ^ -2 I u Ir u13 6659.7 6697.4 6326.7 7049.3 6741.9 6747.9 6591.1 6631.9 6667.2 6320.3 6996. 5 6737.3 6743.3 6591.1 6659.7 6667.2 6592.7 6737.9 6741.9 6743.3 6591.1 198.7 199.9 191.4 214.2 224.8 225.4 202.2

-19 - Table 3. Various Estimates of T Under Model Two. Added SPRO LPRE GREG SE Approach q Variables 10 KIH 100 KKH 103 WH 10 3KH (1) (2) (3) (4) (5) (6) (7) -1 Adjusted IT BLU -1 ITf *BLU AUG1 AUG2 AUG3 -1 IT' -1 I1 -1 A -2 ^1 -2 ~I -1A -2 I ~I ^ -2 ~I ^ -2 ~I ^ -2 ~I 6529.6 6529.6 6541.5 6445. 5 6532.8 6535.3 6517.1 6530.2 6529.0 6539.7 6455.5 6530. 9 6533.5 6517.1 6529.6 6529.0 6535.2 6491.6 6532.8 6533.5 6517.1 153.0 152.5 152.8 152.6 152.7 152.7 152.7 u1, uI2 uI12 uI3