Division of Research January, 1980 Graduate School of Business Administration The University of Michigan SAMPLE DESIGN FOR STRATIFIED RATIO ESTIMATION Working Paper No. 199 Roger L. Wright The University of Michigan FOR DISCUSSION PURPOSES ONLY None of this material is to be quoted or reproduced without the express permission of the Division of Research.

I I ~ ~~~ ~ ~~~ ~ ~~~ ~ ~~~ ~ ~~~ ~ ~~~ ~~~~ ~~~~ ~~~~ ~~~~ ~~~~ ~~~~ ~~~ ~~~ ~~~ ~~~ ~~~ ~~~ ~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ r ~ ~~~ ~ ~~~ ~ ~~~ ~~~~ ~~~~ ~~~~ ~~~~ ~~~~ ~~~~ ~~~~ ~~~ ~~~~ ~~~ ~~~~ ~~~ ~~~ ~~~ ~~~ ~~~ ~~~ ~~~ ~~~ ~~~ ~~~ ~~~ ~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ A a i q~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ r~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ r~~ ~ ~~~ ~ ~~~ ~ ~~~ ~ ~~~ ~ ~~~ ~ ~~~ ~ ~~~ ~ ~~~ ~ ~~~ ~ ~~~ ~ ~~~ ~ ~~~ ~ ~~~ ~ ~~~ ~ ~~~ ~ ~~~ ~ ~~~ ~ ~~~ ~ ~~ ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ r

ABSTRACT This paper develops simple, easily implemented rules for designing stratified sampling plans for combined ratio estimation. The analysis is based on a superpopulation model and on approximations that hold when strata are constructed to tightly control the variation of the auxiliary variable. The proposed techniques are illustrated using a utility load research example, and are related to the somewhat different designs obtained by the cum /f rule developed by Dalenius and Hodges.

CONTENTS 1. Introduction........................................ 1 2. The Superpopulation Model.......................... 3. Stratified Sampling............................... 5 3.1 Strong stratification According to Size......... 7 3.2 Overall Balance of a Stratified Sample....... 8 3.3 Random Sampling Within Strata............... 9 4. Sample Design for Ratio Estimation with Strong Stratification................................. 10 4.1 The Expected Mean Square Error.................. 12 4.2 Approximately Optimal Allocation............. 13 4.3 Design Rules.................................... 14 4.4 Gain from Stratification................17 4.5 Estimation of Model Parameters................ 18 5. Sampling in Electric Utility Load Research............21 6. Comparison with Dalenius-Hodges Stratification...... 28

1. Introduction Ratio estimators are widely used in sampling studies of finite populations. In many of these applications, a stratified random sampling design can be developed using the known population distribution of the positive auxiliary variable x together with information (or assumptions) about the relationship between x and the target variable y. Using a superpopulation model of this relationship together with suitable approximations, easily applied rules can be formulated for planning all aspects of the sampling design: total sample size, strata cutpoints, and strata allocation. The emphasis in this paper is on conventional sample designs utilizing simple random sampling within prescribed strata. The sample design is chosen to optimize the expected statistical precision of the conventional combined ratio estimator. The principal innovation offered here is a simple procedure for choosing the strata cutpoints, and for planning the total sample size and the sample allocation among the strata. The proposed method is analogous to the Dalenius-Hodges procedure [5, pp. 127-135; 6; 7; 16] for choosing strata cutpoints when the mean or finite population total of x is to be estimated. However, the interest here is in stratification of the distribution of x for ratio estimation of the finite population total of y. Any comparison of sampling designs for ratio estimation

- 2 - should focus on the joint distribution of x and y. The practice of using the Dalenius-Hodges "cum /f" rule for stratification relies solely on the marginal distribution of x and yields designs that can be inefficient for ratio estimation. When a suitable previous sample of x and y is available, the conventional sample-based formula for the precision of the combined ratio estimator [5, pp. 165-167] can be used to develop a design. However, the problem of choosing the strata cutpoints to minimize the sample-based expected mean square error bears disquieting similarities to the computationally vexing traveling-salesman and bin-packing problems. As strata cutpoints are tentatively adjusted, the datapoints in the available sample migrate among strata and often cause abrupt changes to the within-strata sample variances and estimated precision. The resulting designs often seem to be overly tuned to the realized values of x and y in the available sample and not sufficiently based on the underlying joint population distribution. To avoid this problem, this paper uses a superpopulation model to represent the relationship between x and y, together with the known finite population distribution of x. Section 2 develops this model. Section 3 considers the advantages of stratified random sampling for ratio estimation; the discussion borrows heavily from Royall and Herson's concepts of balanced sampling and robustness [12, 13, 14, 15]. The present paper suggests that stratified random

- 3 - sampling with the combined ratio estimator gives much of the robustness of a balanced unstratified sample but with oftensubstantial gains in efficiency. The concept of strong stratification is defined in Subsection 3.1. Section 4 uses the superpopulation model and certain approximations derived from strong stratification to develop a design rule for stratification, sample allocation, and total sample size. Section 4.1 develops the model-based expected mean square error of the ratio estimator of the finite population total of y. Section 4.2 examines the question of efficient allocation. The results developed in these two sections are used in Section 4.3 to formulate a specific design rule. A very simple expression for the gain in efficiency from stratification is developed in Section 4.4. Finally, Section 4.5 completes the model-based analysis by proposing simple estimators for the model parameters. Section 5 presents an application and numerical example that arises in electric utility management [1, 2, 9]. Section 6 completes the paper with a discussion of the relationship between the proposed design rule and the cum VF rule developed by Dalenius and Hodges [5, 6, 7, 16]. 2. The Superpopulation Model How should the joint distribution of x and y be specified at the planning stage? In the applications of interest, the finite population consists of N units labelled

- 4 - 1, 2, ', N and xk is known for each unit k. Much less is known about the conditional distribution of y given x so that it is appropriate and convenient to utilize a superpopulation heteroscedastic regression model. In this formulation Yk is regarded as the realized value of a random variable denoted Yk which is determined from xk and a random disturbance sk following the regression equation Yk = h(Xk) + sk[V(Xk)]1/ k = 1, 2, —, N. (2.1) Here the expected value and variance of Yk depend on xk and 2 are denoted as h(xk) and a v(xk) respectively. The disturbances El, 2' ', EN are independent random variables 2 with mean zero and variance a. This model, notation, and many of the analytical techniques follow Royall and Herson [13]. A more comprehensive presentation of this approach is given in [4]. At the design stage it is helpful to adopt a specific form of (2.1) that combines reasonable accuracy, parsimony and analytical convenience. In this paper we will consider the superpopulation model 5 given by (2.1) with the very simple specifications h(xk) = xxk and v(xk) = Xk. In many of the applications of the combined ratio estimator, this model provides a sufficiently realistic basis for sample design. The three parameters B, y and a can be assessed fairly easily even when relevant data is severely limited. Using this model sample designs can be developed following simple

and sensible rules which extend the conventional samplebased formulas. However the robustness of these rules to model misspecification remains to be investigated. The superpopulation modell~ has been widely used to explore the properties of ratio estimation. If the heteroscedasticity parameter y is one,' the Gause-Markov theorem implies that the simple ratio estimator is the best linear unbiased estimator of the superpopulation parameter B. Brewer [3] and Royall [11] have extended this result to preN N diction of the finite population ratio EZYk/Zlxk and the population total EZYk. If y is different from one, the simple ratio estimator is still unbiased but not the most efficient estimator. However, in the absence of complete confidence in the accuracy of the superpopulation model ~ and perfect knowledge of y, the ratio estimator is often chosen in practice because of its robustness [13]. 3. Stratified Sampling Stratified sampling usually provides two major advantages over simple random selection: control over the sample distribution of x, and more efficient allocation of the sample among the population units to reflect differences in the conditional variance of Y. Royall and Herson [13] have clarified the advantages of a sample with a balanced x-distribution. They have shown that balanced samples provide protection against bias arising from misspecification of h(x) in the

- 6 - superpopulation model (2.1). Unfortunately, their unstratified balanced samples often suffer a substantial loss of efficiency compared to samples that are optimal for a particular specification of h(x) and v(x) [13, p. 885]. Stratified sampling regains some of the lost efficiency with little sacrifice of robustness. The balance is achieved by using an estimator that weighs observations according to the known population distribution of x. Substantial gains in efficiency can be achieved by allocating the sample observations appropriately. Although the stratified sample estimator may still not be as efficient as the optimal-sample ratio estimator for a particular superpopulation, it generally seems to be much more robust. Ratio estimators can be adopted to stratified sampling either by using a separate ratio estimator within each stratum or by using a single combined ratio determined from the stratified-sample estimators of the means of y and x [5, pp. 164-169]. Under the model E of Section 2 the combined ratio estimator would usually be recommended for small samples while for larger samples the choice between the combined and separate ratio estimators would depend upon the credibility of the specification of the regression function h(x). If the model E is plausible at the planning stage, it is convenient to consider the combined estimator for planning even though the separate estimator might be used in the analysis for added robustness. In [14] Royall and Herson

- 7 - have provided certain results useful for designing stratified samples for the separate ratio estimator under a broad class of superpopulation models. The present paper will propose simpler and more prescriptive sampling rules for the combined estimator under the model i. 3.1 Strong Stratification According to Size The design rule developed in this paper is developed from a concept called strong stratification. In order to develop this idea in adequate detail, some notation is required. Suppose that the population of N units is divided into H strata with Nh units in strata h. The population is stratified according to x if stratum one contains the N1 units that are smallest as measured by x, and in general, if stratum h contains the Nh smallest units excluded from strata 1, 2,.., h-1. Each such stratification S is essentially determined by the number of strata H and the strata sizes N1, N2, '', NH. Each stratification S determines within-strata population moments of x. The population mean of xC (c = 1, 2, y/2, -(c) y, etc.) within stratum h will be denoted as x, i.e. xc) = Exk/Nh where the summation is over the Nh units in stratum h. The absence of the subscript h will denote the overall population mean, so that -(C) N c = H (c)/N. X c lXk/N lNhXh /N.

- 8 - The absence of the superscript (c) will indicate the first moment (c = 1), either overall or within strata. A very simple rule for design can be formulated by concentrating on stratifications that tightly control the variance of x within each stratum. A stratification S will be called strong for a given number c if the variance of xC -(2c) -(c) 2 is small within each stratum, i.e. if xh (xh) for 1 < h < H. 3.2 Overall Balance of a Stratified Sample Royall and Herson's concept of a balanced sample can be adopted to combined ratio estimation. Consider a stratification S and a particular sample s comprised of n units from the population. Let nh be the number of sample units -(c) c from stratum h and let x () denote the sample mean of x -(C) c within stratum h so that x-() = Exk/nh. The overall stratified sample mean of xC is ENhx -( /N which will be denoted -(c) x s The stratified sample s has overall balance of degree -(C) J if the overall stratified sample moment x() is equal to the population moment ) for all c = 1, 2, J. For the population moment x for all c = 1, 2, *, J. For overall balance it is sufficient but not necessary that the sample is balanced within each stratum, i.e. that s is a balanced stratified sample of degree J as defined in [14, p. 890]. Royall and Herson [13, p. 883] showed that for a

- 9 - simple balanced sample of degree J the simple ratio estimator is unbiased under any superpopulation model (2.1) for which h(x) is a polynomial of degree J. A similar approach shows that the combined ratio estimator is unbiased under the same family of superpopulation models if the stratified sample has overall balance of degree J. 3.3 Random Sampling Within Strata If a stratification S is strong for given c (Section 3.1), and if a sample is obtained by randomly selecting nh units from each stratum then conventional sampling theory shows that x(c) has high probability of being close to x(c) This implies that a random, strongly stratified sampling plan is likely to lead to an approximately balanced sample. Under these conditions, the combined ratio estimator is likely to be quite robust. In particular if the stratification is strong at least for c = 1, then approximate first degree overall balance can be expected to provide protection against bias from a nonzero intercept 30 in h(x) = B0 + 3Bx. There are some strong arguments against random sampling when a superpopulation model can be assumed. For example, under C with Y equal to one, the model-based mean square error of the simple ratio estimator can be minimized by observing Yk for the n largest units in the population [13, p. 883]. However, this design could give a badly biased estimator if E is inaccurate. If complete confidence in ~ is

- 10 - lacking, an approximately balanced sample may be preferred. One convenient way of providing approximate balance is to use a strongly stratified design. With strong stratification, the arguments against randomization are almost mute. As long as v(x) is continuous, there will be little heteroscedasticity in the superpopulation model within strata and little preference among units. At a small cost, randomization will provide protection against model misspecification. For example randomization reduces the concern about systematic selection of outliers. It will be seen that randomization also contributes to the simplicity of the design rules. It is important to note that the overall balance of a stratified sample does not require proportional sample allocation among strata since the population sizes are used in the overall sample moments. This means that robustness can be retained even while the sample allocation is chosen to maximize the expected precision of the combined ratio estimator. 4. Sample Design for Ratio Estimation With Strong Stratification A stratified random sampling design p has three components: a stratification S determined by N1, N2, ', NH, the overall sample size n, and the sample allocation nl, n2, '', nH. The sampling plan p determines a sample 1 n2 H

- 11 - space of (n ln2) n ) equally likely stratified samples. For each such sample s, xl, x2, *-, xh are fixed but Yll Y2, '', Yh are regarded as random variables specified by the model i. The criterion for selecting p is the model-based expected mean square error of the combined ratio estimator NN~ ~N T = (Ys/Xs)El Xk Here T is regarded as a predictor of the finite population total T = E Yk which is also a random vari1Yk able under ~. Initially the conditional mean square error A 2 E (T - T) is evaluated using t with a fixed sample s. Then A 2 the unconditional mean square error E(T - T) is obtained by A 2 averaging E (T - T) over the sample space determined by the design p. Very simple rules for selecting a good design can be followed if two approximations are reasonable: Condition A: The desired design is strongly stratified for c = y/2 so that x() - (x(Y/2))2 for all strata h. Here y is the heteroh h scedasticity parameter of the superpopulation model i. Condition B: The desired design is strongly enough stratified for c = 1 to be confident that a randomly selected sample s will have approximate overall balance of degree one, i.e. Xs x. Condition B is used as in conventional analysis to neglect the sampling variation of the denominator of T.

- 12 - 4.1 The Expected Mean Square Error A Using the model i, the combined ratio estimation T may be written as (B + EHN uh/Nx ) Nx where 1N sh s 1X k Y/2 Uk = Yk - pXk = kX /2 Similarly, the finite population H ) total T is (1 + EZNhuh/Nx) xk. So if E is accurate, T is an unbiased predictor of T for any sample s since E(T) E (T)= B- xk A 2 Moreover, the expected mean square error given s, E (T-T), 3-1i~~~~~~~~s is xHNE (ush/x Uh /)2 1 h sh s h which is equal to 2x2zH2 [Y)/nhs + (Y)/N 2 - 2 /Nhxx] (4.1) 1 h sh h s h h shhs A 2 Condition B implies that x - x so that E (T - T) is s s approximately y2EH1 r n + N - 2x(Y) IN 1 h[ sh /nh + /h sh /Nh] When this expression is averaged over the sample space of a specific design p, the unconditional expected mean square error can be approximated as E(T - T) a2 N2-() (n1 - 1 E~l(T - T) C Nhxh (nh N h (4.2) = 1hXh h - h)

- 13 - 4.2 Approximately Optimal Allocation For a given stratification S and a given total sample size n, the expression (4.2) for the approximate expected mean square error of T can be minimized by choosing the sample allocation (Y) H -() 1/2 nh/n = Nh(x ))1/2/1Nh(x ) 1/ (4.3) This follows from the argument commonly used to demonstrate the optimality of Nehman allocation [5, pp. 96-98]. With the allocation (4.3), the approximate mean square error (4.2) becomes 2 [ h(x~Y) 1/2] 2 2 - (NY) [Y1Nh(x ) /n - a E1NhXh The preceding expression can be considerably simplified under Condition A. Note that the term Nh is equal to -(Y) -(Y) Nx(Y) where x() is the overall population mean of x. Moreover, Condition A implies that (x) 1/2 can be H -(y//lNh (xh ) can be2) approximated as ElNhxh ) which is equal to Nx( So under the model r and under the Conditions A and B of strong stratification,* then the expected mean square error of the combined ratio estimator T is approximately 2 2 2 (y/2) 2 -(Y) E(T - T)2 Nr [(x /2))2/n - x(/N]. (4.4) Condition A also yields two very helpful approximate reformulations of the allocation rule (4.3), namely *and under the approximately optimal allocation (4.3),

- 14 - -(y/2) -(y/2) nh/n Nhxh/2 /Nx( (4.5a) hx/2 N /2 (4 5b) l xk /lk (4.5b) The first of these relationships implies that the withinstrata sampling fraction nh/Nh is proportional to the within-(y/2) stratum mean xh72) In the case that the superpopulation model E is homoscedastic, y is zero and our allocation gives a constant sampling fraction. If y > 0, then the sampling fraction increases from stratum to stratum in proportion to the mean xy/2) The larger the heteroscedasticity parameter y, the more heavily oversampled are the units with large x. The case that y = 2 is closely related to sampling with probability proportional to size. The second formulation given above, (4.5b), implies that the sample allocation nh/n is proportional to the within strata population totals of x /2. In the next section, this approximate characterization of the allocation (4.3) will be used for a convenient stratification rule. 4.3 Design Rules In the expressions to be developed in this section, the ratio (x(Y/2))2/(Y) occurs repeatedly and will be referred to as the design effect and denoted de(y). The term design effect will be justified in the next section.

- 15 - The design effect de(y) can be conveniently characterized in terms of the coefficient of variation of xY/2, defined as cv(x~/2) = [(y/2))2]1/2/-(/2) In fact de(y) = [cv2(x/2) + 1]-1 (4.6) Using the design effects, (4.4) may easily be restated. Under the superpopulation model i, and using a sample design satisfying the Conditions A and B of strong stratification together with the approximately optimal allocation (4.3), the expected mean square error of T is approximately 2 2 2-(y) E(T - T)2 N x [de(y)/n - 1/N]. (4.7) This result gives the following rule for choosing the sample size n with a strongly stratified design and approximately optimal allocation. For expected mean square error 2 22 E(T - T) approximately equal to N s,choose the sample size n equal to n where cr ncr = de()n0/[l + n0/N], (4.8a) with 2-() 2(4.8b) n0 = x /s. (4.8b) 2 Here a and y are the parameters of the superpopulation

- 16 - model i defined in Section 2. Estimation of these parameters is considered in Section 4.5. One surprising implication of the proceeding result is that the expected mean square error is insensitive to the choice of stratification as long as approximately optimal allocation, (4.3) or (4.5), is used. However, in most applications it is desirable to choose a design giving equal subsamples from each stratum. If (4.5b) is used for allocation, the strata sample sizes will be equal if the population totals of x/2 are equal within all strata. This suggests the following design rule for combined ratio estimation, subsequently called the cr-rule. Choose the number of strata H large enough to provide Conditions A and B of strong stratification, and choose the strata sizes N1, N2, '', NH to equalize the population totals of xY/2 within all strata. Determine the required sample size from (4.8) and allocate the sample equally among strata. In some applications the cr-rule may prescribe a sample size nh exceeding the population size Nh for some stratum h. If y > 0, it is sufficient to consider stratum H. In this case, the cr-rule can be modified to provide a 100 percent sample of stratum H while maintaining the efficiency of the allocation and the validity of our approximation (4.7) for the mean square error. Simply decrease the lower boundary of stratum H until xhY/2) - Nx(~/2)/n. Then use the cr-rule to stratify the rest of the population and

- 17 - to allocate the remainder of the sample. If y < 0, a similar adjustment might be required in stratum 1. 4.4 Gain from Stratification This section examines the gain in efficiency of stratification and allocation following the cr-rule. Our comparisons will be with a design using a simple random sample of size n and the ordinary ratio estimator (ZnYk/Zxn)Z xk. Conditional on any sample s the expected mean square error of the ordinary ratio estimator can be found from (4.1) with H = 1 to be (Ncx) [x(Y/nxs + y) /N2 - 2x() /NX x]. S Here x and x() are the unstratified sample moments, S S s= Z x/n and x = ZxY/n. Assume that s is a large simple random sample so that x - x. Then the unconditional S expected mean square error of the ordinary ratio estimator is approximately N22x() (1/n - 1/N). 2 2 This will be equal to N s if the size of the simple random sample n is equal to nr where nr = n0/(1 + n0/N). (4.9) Here no is a x) /s as in (4.8b), and no is the simple random sample size required if the population size N is large. ',L

- 18 - A comparison of (4.9) with (4.8a) gives the following result. Given the superpopulation model i, approximately the same expected mean square error can be achieved by either the ordinary ratio estimator with a simple random sample of size nr or by the combined ratio estimator with an appropriately allocated, strongly stratified sample of size ncr = de(y)n. So de(y), (4.6), gives the reduction in the sample size achieved by strong stratification with allocation following the crrule. It is interesting to note that if the superpopulation relationship is homoscedastic, then y is equal to zero and the design effect is one so that there is no gain from stratifying. For a fixed population distribution of x, the greater y is, the greater is the gain from stratification. 4.5 Estimation of Model Parameters In order to implement the cr-rule for a strongly stratified sample, the parameters y and a of the superpopulation model t must be estimated at the design stage. A full analysis of the estimation problem might address many rather complex issues perhaps including a Bayesian analysis of inference featuring the value of sample information, methods of pooling information drawn from various more or less relevant populations, and a comparison of model-efficient estimators with robust estimators. In this paper, estimation is not the main focus and it may suffice to suggest simple es

- 19 - timators that minimize reliance on superpopulation assumptions and that maximize consistancy with common samplingbased theory and practice. Consider first the estimation of the heteroscedasticity parameter y from m observations of Yk generated according to i for given xk. A simple procedure for estimating the heteroscedasticity parameter of a regression model has been proposed by Park [10] and developed further by Harvey [8]. Let uk be the deviation Yk - BXk which is equal to aSkx/ under i, and assume that the first two moments of 2n(Ek) exist. Then Jn(uk) = a + yZn(xk) + vk (4.10) 2 a2 where a = E[ln(cEk) and vk = tn(sk) - a. Under., the disturbances vk are independent and identically distributed so that the coefficient y could be estimated using ordinary least squares if the uk were observable. An asymptotically unbiased A^ ^ ^ Am m predictor of uk is uk = Yk - SXk where B = ZEyk/Zxxk. This leads us to the least squares regression estimator Y = (zws - zsw)/[ws (Ws)]. (4.11) ^2 - Here zk = n ), w n(xk) and zw is the sample moment vm (2) m 2 zkwk/m, w(2) is ElWk/m and so on. If the available sample was randomly selected, the sample deviations uk can also be used to estimate the quantity

- 20 - 2- (y) C x which arises in (4.7)-(4.9). Define 2 m^2 S2 = luk/(m-l). (4.12) Conditional on the Yk of the finite population, conventional 2 sampling theory shows that S is an asymptotically unbiased e(2) N 2 estimator of u = Euk/N. But under ~, the model-based -(2)' 2-(y) 2 expectation of u ( is 2 x(). This implies that S is a 2-(y) model-based asymptotically unbiased estimation of a x2. This estimator is closely related to robust estimators considered in [12]. 2 2-(y) If S is used to estimate a x and if de(y) is used to estimate de(y), then equations (4.8) and (4.9) give the following estimates of no, nr, and nr: n S2/s (4.13a) nr = n0/(l + no/N) (4.13b) ncr = de()nr. (4.13c) 22 Here N s is the expected mean square error required for the estimator of the population total T, no is the sample size required using the ordinary ratio estimator with a large population, nr is the sample size required using the ordinary ratio estimator with the finite population correction, and nr is the sample size required using the combined ratio cr estimator with allocation following the cr-rule. One noteworthy implication of this use of S2 is that (4.13a) and

- 21 - (4.13b) are completely consistent with conventional samplingbased analysis. Suppose now that instead of a simple random sample, we have a stratified random sample with mh observations in each stratum h = 1, 2, -, H. Then equation (4.12) defining S must be modified to use the population-weighted average of 2 ^2 within-strata sample variances. Let Sh = uk/(mh-l), the sample variance of uk within stratum h. In place of (4.12), define s2 = E (Nh/N)Sh (4.14) With this redefinition of S, equations (4.13a-c) remain appropriate. These procedures will be illustrated in the following section. 5. Sampling in Electric Utility Load Research The sampling rules discussed in this paper have been developed for a specific application in electric utility load research. Interruptions of electric power service endanger the health of many people, disrupt business, and inconvenience everyone. Utility planners, regulators and managers must provide enough generating and transmitting capacity to instantaneously meet their customers' greatest demands for electricity. To minimize the cost of service, managers try to maintain an efficient balance of base generating units

- 22 - which have high capital costs but low operating costs, and peak generating units which have low capital costs but high operating costs. Regulators try to establish rates that fairly allocate both capital and operating costs among users. These various concerns -- capacity planning, power production, and rate setting -- all require accurate assessment of the timing of the customers' demand for electricity. Most utilities measure their power production almost continuously and these data can be adjusted for transmission losses to estimate past demand on an almost instantaneous system-" wide basis. In addition, customer billing procedures generally measure each customer's use of electricity monthly or bi-monthly. Despite this abundance of accurate data describing the entire generating system and population of customers, utility managers and regulators need additional data describing the timing of demand for electricity within certain customer subpopulations, especially present or proposed rate-groups. These data are usually obtained by utilizing special time-ofday meters for a sample of customers. These meters generally measure an individual customer's usage of electricity during each consecutive fifteen minute period. The data are continuously recorded on magnetic tapes which are periodically returned to the utility or to a service bureau for editing and transcription. The expenses of equipment acquisition and

- 23 - maintenance, meter installation, data collection, and data processing usually add up to several hundred dollars per sample customer, so sample planning is important. A discussion of this application is especially timely because the Public Utility Regulatory Policy Act of 1978 will greatly expand this activity. In fact this act requires most utilities to develop sampling plans to estimate demand characteristics of specified customer subpopulations with 90 percent probability of less than 10 percent error. Current practices and issues in electric utility load research are described in detail in various publications and presentations of the Load Research Committee of the Association of Edison Illuminating Companies, especially [2, 9]. Aigner's work [1] is also relevant. While the details vary, the methods of sample design discussed in the present paper are generally applicable to load research. Consider a specific population of customers, let xk be the use of electricity of customer k during a specific month as recorded by the customer billing procedure, and let Yk be his use of electricity during a specific interval of time within the month, perhaps the hour of peak system usage. For convenience, the variable xk will be called "use" and Yk will be called "demand." We regard use xk as nonstochastic, but demand Yk as a random variable. The utility wants to predict the total demand T = EZYk of all N 1 k

- 24 - customers. It is equivalent to predict the ratio B = ENYk/x Nk or, as is more common in the industry, to predict the reciprocal B which is called the load factor. In the latter case Xk is usually redefined as the average hourly use during the month. The utility needs to plan a procedure for selecting n customers for whom demand Yk will be measured using time-ofday meters. Since these data will be used to determine electricity rates for the population of customers, any design involving non-random sample selection is politically unattractive and random sampling is almost always used. Royall and Herson's concept of balanced sampling also helps to justify the preference of utility managers and regulators for random sampling. Many load research designs stratify the population according to use, x. Stratified random sampling lets the researcher oversample the large customers while preserving the impartiality of random selection. This practice is supported by the results of this paper which show that an allocation following the cr-rule usually provides greater efficiency than a simple random sample and also much of the robustness of a balanced sample. A stratified sampling design must specify strata boundaries, the total sample size, and the sample allocation among strata. If the superpopulation model ~ (Section 2) is believed to be reasonably accurate, and if the parameters y

- 25 - and a2 can be assessed, then an efficient sampling design can be developed following the rules given in Section 4.3. If a relevant sample is available, the estimation techniques of Section 4.5 are applicable. A numerical example may be helpful. In [9], Higgins provides data describing use xk (mwh) and demand Yk (kw) for each of 210 customers. Figure 1 shows a scatterplot of the data, and Table 1 provides sample statistics. Table 1 Sample Statistics for Example n = 210 x = 1589 mwh s y = 3757 kw s Y = 1.704 S = 1920 kw While population statistics were not published, the population size N and the population distribution of use would ordinarily be readily available. For the sake of illustration we assume N = 740, x = 1400 mwh, and cv(x /2) 1.238 so that de(y) =.3948. These figures are consistent with the available sample. Suppose that the design criterion is 10 percent relative error with 90 percent probability as required for the Public Utility Regulatory Policy Act. Then the expected

- 26 - Figure 1. Scatterplot of Demand and Usage for Example DEMAND 39400. + * + * 31674. + 23947. + + * * * *X 16221. +.~. * * i~k * 8494.4 + ** 2 32*** * * 2****.+. 23*4* t5X93 * 4XX2** 768.00 +XX* 24 2 16 6514.8 3378.5 96511. 12787 J SE 15924.

- 27 - 2 2 mean square error N s should satisfy 1.645s = (.10)1x = (.10) (13757/1589) (1400)kw so that s = 201.2 kw. Then (4.13a-c) imply no n r n cr = (1920/201.2)2 = 91 units, = 91/(1 + 91/740) = 81 units, and = (.3948)(81) = 32 units. A practical design might use H = 8 strata with four units per strata. Following the cr-rule, the eight strata are chosen to equalize the population totals of x~/2 within all strata. Using the available sample data, the resulting stratification is shown in Table 2. Strata h 1 2 3 4 5 6 7 8 C Table 2 Choice of Stratification for Example Sampling Upper Fraction;ize Boundary (%) /% h Xh nh/h Xh 268 575 1.5 170.0 173 840 2.3 261.3 113 1483 3.5 403.0 74 2522 5.4 603.0 49 3814 8.2 931.4 28 7955 14.3 1398.0 21 11600 16.7 2347.0 14 16000 28.6 3457.0 -(Y) 1/2 (xh 173.0 262.3 407.2 608.3 934.6 1432.0 2360.0 3473.0

- 28 - Despite the comparatively small number of strata, Conditions A and B of strong stratification seem well satisfied. The consistency between the last two columns of Table 2 confirms Condition A. Condition B can be verified by showing that the coefficient of variation of x is small. In fact, it turns out to be 2.8 percent. An interesting alternative design would be to use 32 strata with one unit selected at random from each stratum. 6. Comparison with Dalenius-Hodges Stratification Rules for constructing strata boundaries have been considered by Dalenius and others [5, pp. 129-131; 6; 7; 16]. Cochran summarizes this work by saying that "the cum /F rule applied to x should give an efficient stratification for another variable y that has a linear regression on x with high correlation," [5, p. 131]. However, the cr-rule for stratification proposed in this paper is quite different than Dalenius and Hodges' cum V/ rule. To simplify the discussion suppose that the population distribution of x is continuous with the probability density function f(x). The cum Vf rule is to choose strata boundaries, xh say Xh_1 and xh, to equalize the integrals / /f(x) dx Xh-l between all strata, and then to allocate the sample equally among strata. Under the cr-rule, on the other hand, the Xh /2 integrals fX x f(x)dx are equalized. Both rules balance h-1 x integrals of the form f w(x)f(x)dx where the weight funch-l

- 29 - tion w(x) is f(x) 1/2 for the cum /f rule and xY/2 for the cr-rule. This comparison can be reformulated in terms of the sampling fractions nh/Nh. If we ignore 100 percent sampling constraints, both rules divide the total sample n equally among the strata. This implies that the within-strata sample Xh size nh is proportional to the integral fx w(x)f(x)dx so that the sampling fraction nh/Nh is proportional to the x xh within-strata population mean of w(x), f w(x)f(x)dx/f f(x)dx. Xh-l h-1 In the case of the cr-rule, the sampling fractions are proportional to the within strata means of x~/2 as was pointed out in Section 4.2, so that the sampling fractions increase with size as long as y > 0. However, in the case of the cum V/ rule, the sampling fractions are proportional to the -1/2 means of f(x)1/2 within each stratum, so that the sampling fractions decrease as the density of x increases. The difference between the two rules is most dramatic for strata below the mode of a unimodel distribution. The cr-rule will give sampling fractions which increase with x if Y > 0, while the cum Vf rule will give sampling fractions which decrease with increasing x. The two rules will be equivalent only if their weight -1/2 y/2 functions f(x) 1/2 and x/2 are proportional, i.e. if f(x) is proportional to x-Y wherever f(x) is nonzero. The results given in this paper appear to contradict Cochran's evaluation of the cum /f rule, quoted above. The

- 30 - difference in designs derives from different choices of estimator. Dalenius' work dealt with the stratified sampling estimator of the mean or total of y while the cr-rule relates to the combined ratio estimator. The cum If rule is appropriate if x is to be used for stratification but not for estimation. The cr-rule seems to be appropriate in those cases in which x is to be used both for stratification and in ratio estimation.

- 31 - REFERENCES [1] Aigner, D.J., "Bayesian Analysis of Optimal Sample size and a Best Decision Rule for Experiments in Direct Load Control," Journal of Econometrics, 9 (1979), 209-22. [2] Brandenburg, L. and Higgins, C.E., Jr., "Stratified Random Sampling Methods for Class Load Surveys for Electric Utilities," Applied Statistics for Loan Research, Vol. III, Association of Edison Illuminating Companies, New York (July 1974). [3] Brewer, K.W.R., "Ratio Estimation in Finite Populations: Some Results Deducible from the Assumptions of an Underlying Stochastic Process," Australian Journal of Statistics 5 (1963), 93-105. [4] Cassel, C.M., Sarndal, C.E., and Wretman, J.H., Foundations of Inference in Survey Sampling, John Wiley & Sons, New York, 1977. [5] Cochran, W.G., Sampling Techniques, Third Edition, John Wiley & Sons, New York, 1977. [6] Dalenius, T., Sampling in Sweden, Contributions to the Methods and Theories of Sample Survey Practice, Alquist and Wicksell, Stockholm, 1957. [7] Dalenius, T. and Hodges, J.L., Jr., "Minimum Variance Stratification," Journal of the American Statistical Association, 54 (1959), 88-101. [8] Harvey, A.C., "Estimating Regression Models with Multiplicative Heteroscedasticity," Econometrica, 44 (1976), 461-64. [9] Higgins, C.E., Jr., "Stratified Random Sampling Methods for Class Load Surveys for Electric Utilities," Unpublished Mimeograph, Virginia Electric & Power Company. [10] Park, R.E., "Estimation with Heteroscedastic Error Terms." Econometrica, 34 (1966), 888. [11] Royall, R.M., "On Finite Population Sampling Theory Under Certain Linear Regression Models," Biometrika, 57 (1970), 377-87. [12] Royall, R.M., and Cumberland, W.G., "Variance Estimation in Finite Population Sampling," Journal of the American Statistical Association, 73 (1978), 351-58.

- 32 [13] Royall, R.M. and Herson, J., "Robust Estimation in Finite Populations I," Journal of the American Statistical Association, 68 (1973), 880-89. [14] Royall, R.M. and Herson, J., "Robust Estimation in Finite Populations II: Stratification on a Size Variable," Journal of the American Statistical Association, 68 (1973), 890-93. [15] Scott, A.J., Brewer, K.R.W., and Ho, E.W.H., "Finite Population Sampling and Robust Estimation," Journal of the American Statistical Association, 73 (1978), 359-61. [16] Singh, R., "Approximately Optimal Stratification on the Auxiliary Variable," Journal of the American Statistical Association, 66 (1971), 829-30.