Division of Research June 1979 Graduate School of Business Administration The University of Michigan COMPARING REGRESSION MODELS INVOLVING RATIO VARIABLES Working Paper No. 180 Roger L. Wright The University of Michigan FOR DISCUSSION PURPOSES ONLY None of this material is to be quoted or reproduced without the express permission of the Division of Research.

COMPARING REGRESSION MODELS INVOLVING RATIO VARIABLES Roger L. Wright The University of Michigan 1. Introduction Regression models are often specified to include ratio variables such as household consumption per member, regional income per capita, savings as a fraction of income, and profit of a firm per unit of its assets or sales. In cross sectional studies, the motivation for utilizing ratio variables is usually to eliminate the influence of size to better isolate non-size effects, (Kuh and Meyer, 1955). In the analysis of time series data, ratio variables arise in adjusting for seasonality and inflation. Ratio variables are also used to implement weighted least squares estimation for heteroscedastic regression models, (Johnston, 1972, pp. 214-221.) As Kuh and Meyer empasized, the Gauss-Markov Theorem for heteroscedastic linear models provides a systematic and rigorous basis for introducing ratio variables in certain circumstances. Various authors, including Fisher (1957), Glejser (1969), Harvey (1976), Park (1966), Rutemiller and Bowers (1968), and Taylor (1978), have discussed the estimation of heteroscedastic models, but they have not been developed with sufficient generality to encompass the common and useful practice of utilizing both deflated and undeflated independent variables in the same model. In this paper we will examine a very broad family of nonlinear -1 -

- 2 - regression models which generalize Harvey's multiplicative model for heteroscedasticity to allow deflation (or inflation) of independent variables. Rao's scoring method (Rao, 1973, pp. 367-70) will be used to develop a convenient procedure for calculating maximum likelihood estimators. There has always seemed to have been confusion about the relationship of correlation to the use of ratio variables. Several authors, including Pearson (1897), Kuh and Meyer (1955) and Madansky (1964), have discussed the spurious correlation that may arise from using a common variable to deflate both dependent and independent variables. Other writers have warned against using multiple correlation coefficients to compare the goodness of fit of models with alternative transformations of the dependent variable. (On the other hand Granger and Newbold [1976] find that maximizing R2 can be an appropriate strategy for analyzing a family of transformations of both dependent and independent variables.) We will develop here a simple way of using the standard errors of the regressions to compare the likelihood of alternative specifications involving heteroscedasticity or deflated dependent variables. Our methods of model specification and estimation will be illustrated by reanalyzing Mooz's (1978) capital cost data for nuclear power plants. Some of the issues of concern here were raised in a rather similar setting by Griliches (1972).

- 3 - 2. Model and Notation Throughout this paper we will examine the following nonlinear, heteroscedastic regression model: m Z'.t. i j Xi Yi = E e 3 X!.i + ui (1 < i n) (la) 1 j=1 3 Here {Zji, X.i' Y.:l < i < m, 1 < i < n} are observed data, and {a, 31 1 3.:1 < j < m} are parameters. In general we consider Z.. and a. to be vectors of size pj x 1, and X.. and Bj to be of size qj x 1. The following assumptions are maintained: (i) The disturbances {ui: < i < n) are mutually independent and normally distributed with mean zero and variance E(u ) = cle i (Ib) with Zli and a1 as in the initial term of the 2 summation in (la), and a > 0 an additional parameter. (ii) The parameters {aj:1 < j < }, {j:1 j < m}, and a2 are to be estimated, while {Otj: + 1 < j < ml are prespecified. (iii) The variables Zji and Xji are either nonstochastic or stochastic but distributed independently of the ui with a distribution not involving the parameters j, a j, and a. n (iv) Z Zli=. The importance of this assumption, which i=l imposes no real lack of generality, will become clear in Section 3. (v) The parameters to be estimated are identifiable. Some additional notation will simplify subsequent analysis:

-4 - Z - ZVi c X* = e ii i (1 < j < m) Z* = (X*') (1 < j < ji j- j ji *= e- lily 1 -Va i iia 1 E. = e u. 1 1 In addition we will use the following partitioned vectors and matrices: (2a) (2b) (2c) (2d) Zll i 11 f* I z*I In I I *I I In z* Z21 z*t 2n 0 0 0 _ *1 * ' Zn I - - O - 0 1 I * 1 0 i X* 1ll X* ln 0 o 0 ml *' X I nm 0 0 _ (3a) (3b) C =[a oca'... l1 I 'i... '] ~vt P nl0 2 2 c --- — E t 6C __ B' and E_ a ' 1j~ * en Q (3c) The dimension of X, ', and 'are respectively (2n x r+s), (1 x r+s), 9, m and (1 x 2n). Here r = Z pj and s = Z q. In section 3 we will j=l j=l j often write Ei, s and X as i(B). E(,, and (,a) to emphasize their functional dependence on the parameters. Three special cases of this model are of special interest: (i) If a1=0 and Q = 1 so that all of the deflation coeffi

- 5 - cients 2, '*, am are prespecified, we obtain a homoscedastic linear regression model involving q independent variables represented by the vector Xli and prespecified deflated independent variables given by X, ***, X. In the most common applications, p. - 1, Z. is the logarithm j J of a measure of size, and a. = -1 (2 < j < m). Alterna3j tively if aO = 1 we obtain an interaction variable. (ii) If a, = 0 and Z > 2, we have a nonlinear, homoscedastic regression model which generalizes an example discussed by Draper and Smith (1966, pp. 266-284). In this case the degree of deflation of X2, ***, XQ is to be estimated from the data. (iii) If Xli = 0 and 9 = 1, we have the linear model with multiplicative heteroscedasticity. Harvey (1976) discusses estimation of al and Bj (2 < j ~ m) in this case. (B1 is obviously unidentifiable and is taken to be zero.) Kmenta (1971, pp. 256-264) also discusses this case but with the added restriction that pi = 1. Furthermore if al is known, it is easy to see that this case can be reformulated as a homoscedastic linear model in the deflated variables Y* *, X*. This is the familiar case of weighted least squaresm least squares.

-6 - 3. Estimation To study estimation, we reformulate the model (la-b) using the the definitions (2a-d) as m Y. = X*.x'. + s. 1 j3=i 1 Here the {i:1 i < n} are mutually independent and normally dis2 tributed with mean zero and variance o. The transformation from Yi to Y* involves parameters 3 to be 1 1 estimated. In such a case the likelihood function of the sample can be written as the product of the probability density of the {Y*:l < i < n} and the Jacobian J of the transformation from Y1 ** Y to Y * Y 1 n 1 n (Box and Cox, 1964, p. 215). From (2c) and assumption (iv) it is easy to see that n dY* j = I d i=l i n -Z Z' i=l li 1 = e = 1 This will provide a very convenient simplification to our analysis. The likelihood functions L(_,ac2) can be expressed as I n 2 2 v2 -n 2 2i=1 L(3, 2) = ( C) 2e i=l

- 7 - so that log L(,c2) = n log(2T) log(o2) (4) 2 2 2a2 i=1 where, using (la) and (2d) we write -Z m Z..a. Ei B) = e i(Y - Z e J 3Xij) (5) j=l ji From (4) it is clear that the maximum likelihood estimators of f and 2, say and a2, have some of the properties of ordinary least squares estimators: n (i) B must minimize Z ~.(38) (6a) i=l 1 n (ii) "2 =1 Z E ( 2, and (6b) A 2 n ni=l i - (iii) log L(,a2) = - n log(2rr) - log (a2) - n (6c) -2 2 2 Rather than attempting to directly solve the nonlinear normal equations associated with (6a), we develop an iterative estimation procedure utilizing Rao's method of scoring (Rao, 1973, pp. 367-70). For this we consider a trial value 1 and let c02 be defined from 0 (6b): n 2 1 2 0= -- Zin We define the efficient score S(f ) and the information I(3 ) to be: -o -o S4) = log L(,Cr 2), and (7a) s( -o ) -O 0'

- 8 - I(%) = E[- a, log L(lCy ) I (7b) Then the revised estimate _1 is calculated as -l A 1 o = + I( ) s ) (7c) 2 This procedure is continued until convergence of a. The resulting maximum likelihood estimator. B is asymptotically normal with mean B and covariance matrix I(1) -12. Because of the special form of the model that we are considering, the method of scoring is computationally quite convenient. From (5) it is easily seen that E () = - i jli (1 < j -< X) a j ~~i e(6) 3 -X*.' (1 < j ( m) so that (4) implies t- log L(13Ga2) l( sz= + 6 Z c Z' (1 aPsJ ( ) 2 (i1 i Ji i -=i l i li i=l n -- log L(1,O2) =.1EX (1 < j < m) lo JO( CY2i=l 1 ji Here 6jl is defined to be one if j = 1 and zero for other values of j. From (3a-c) we find that s( O) = X(4cr)'Es(4O ) (8a) In a similar fashion one can confirm that - 2 n n act' log L(1, CY )] = _( Z Z* *' +. Z ' a k log L(,2 i=l ji ki jl6kl Z liZli jr i=L i=l (1 < j t k _ Q)

- 9 - a2 n n E(- - l L = (, 1 6 k' m) 2j - i ki ~2 2 E(- a log L(B,G2)) = E X* iXi (1 j k < m) 8j9k -- o2i=1 Ji ki which implies that 2, 2 I(o) = X(O0,a X( ) (8b) Thus the revised estimate 1 is given by 0 =x(o2 C2 -1 2,) 2 I -!O =[X(f_, )' 0)]X([(0 _) 0) and the correction for 0 takes the form of an ordinary least squares estimator with e as the dependent variable and X as the data matrix. Moreover the ordinary standard errors of the final correction serve as standard errors of the final estimate 3. In the particular case that X1j = 0, X'X is block diagonal so that the scoring correction takes the form of one ordinary least squares estimator associated with a1 and a second ordinary least squares estimator for 02, *2', a'' *, * m. The first of these two estimations is equivalent to Harvey's scoring procedure. 4. An Application Policy decisions about alternative sources of energy require projections of future capital costs based on a thorough understanding of past experience. Recently W.E. Mooz (1978) published a de

- 10 - tailed statistical analysis of the capital cost of nuclear power plants constructed in the U.S. We will use data collected and published by Mooz to illustrate the methods of analysis developed in Section 3. In the preparation of his data, Mooz first carries out a rather complex interpolation to adjust annually reported capital costs for the inflation of construction costs to obtain an estimate of the total capital cost of each plant measured in millions of 1973 dollars. We label this variable as COST. Mooz also records the SIZE of each plant, measured in installed megawatts of capacity. The dependent variable selected by Mooz is the ratio 1000 COST/SIZE, which measures costs in 1973 dollars per kilowatt of installed capacity. The following independent variables are included in Mooz's final model [p. 32]: SIZE - installed capacity in megawatts. CPIS - date of issuance of the construction permit, represented as a decimal equivalent of year and month, e.g. 67.17 for February, 1967. LOCI - a dummy variable indicating the FPC Region I, the Northeastern U.S. EXP - the natural logarithm of the cummulative number of plants built by the architect-engineer. TOWER - a dummy variable indicating the inclusion of a cooling tower in the plant. The data used by Mooz and also here is for 37 nuclear power plants having cost data available and considered by Mooz to be reliable. These plants were started between 1967 and 1972 and range

- 11 - in capacity from 457 megawatts to 1130 megawatts. Their costs range from 208 to 881 million 1973 dollars. In our re-examination of these data we will investigate the use of capacity to deflate both COST and the independent variables. For our purpose we introduce the variable S=SIZE/GM(SIZE). Here GM(SIZE) is the geometric mean of SIZE within the 39 plant database, 813.788 megawatts. We consider models of the form COST = (1 + B2 SIZE)S + 3 CPIS S + B4 LOCI S + EXP S + 6 TOWER S + u (9) 2 = 2a1 with E(u) = 0 and E(u2) aS 1 and sometimes with the constraint a= a. (For our computations we have also scaled COST by a factor 1 2 1/379.745)) It is easy to verify that this specification is within the family of ratio models (la), with Zj defined to be log(S). The table summarizes our numerical analysis, first considering (9) to be unconstrained (full), and then under various constraints. In each case the table shows the maximum likelihood estimates of the coefficients aj and Sj and their respective standard errors. The table j J also gives several measures of the goodness of fit of each specification, (k=I,...V), namely the standard error ok determined by (6b), the likelihood ratio of each constrained model

Table. Summary of Illustrative Analyses - - -- ---- --- -- - -- -- - i (7) (8j L (9 X (10) (11) df X_25 x-ns 1 2 3 4 5 6 I aj.60 1.72 1.65 5.92 3.26.84.19442 1.0 Full S.59 1.07.83 2.74 1.99 1.96 $j -20.2.25-3.31.20 -.15.19 SB 3.7 4.22-3.04.12.07.09 II aj 1.07 1.07 1.09 4.51 2.30 1.10.19624.695.73 1 a1=a2 Sa.52.52.46 2.40 1.51 1.87 dj -19.2 -1.21-3.31.22 -.16.19 S8 3.2 2.79-3.04.11.06.09 III a" 1.0 1.0 1.0 1.0 1.0 1.0.20642.097 4.67 6. a. fixed Bj -19.0 -.47-3.30.27 -.16.20 S 2.2.21-3.03.09.04.07 IV aj 1.0 0. 0. 0. 0. 0..20974.052 5.92 6 a. fixed j -21.0 1.27-3.31.26 -.15.16 S8 2.3.20-3.03.08.04.07 V j. 0. 0. 0. 0. 0..22037.007 9.77 6 1 aj fixed f -21.0 1.18-3.31.29 -.16.17 Sg 2.4.22-3.04.09.05.08 S B..R.... 3,81 2,59 2,59 2,59 N)

- 13 - 74 AL k = Lk1 and the chi-square statistic for testing the null hypothesis associated with the constrained model, 2 = -2 log Lk. Under the null hypothesis that the constrained model is a correct 2 specification, X asymptotically has the chi-square distribution with degrees of freedom (df) equal to the number of constraints, so 2 that a critical value (e.g. X 05) can be easily selected for testing the constrained model against the full model. A careful examination of the table provides a number of helpful insights although our findings must be considered as rather tentative due to the relatively small sample available. In the full model, aN =.6 which suggests some heteroscedasticity in the relationship determining COST, with the standard deviation of the disturbance approximately in proportion to the square root of plant size. The high values of a2 ' '' a5 suggest a rather strong relationship between COST and SIZE, resembling a quadratic function of SIZE with strong interaction between SIZE and LOCI and EXP. The indication is that some deflation of COST is called for to adjust for heteroscedasticity but the independent variables should not be deflated as in weighted least squares but actually inflated for SIZE.

- 14 - The second analysis (II: a = a2) is representative of several attempts to simplify the full model without undue loss of goodness of fit. The success of the simplification should be subjectively evaluated in terms of the plausibility of the specification and of the corresponding estimates, and the sacrifice of fit as measured by a, 2, or (perhaps most suitably) by L*. Under the constraint that A\ a1= a2 the standard deviation of the disturbance is estimated to be almost directly proportional to SIZE, and the first two terms of the predictive equation are almost quadratic in SIZE with the remaining terms similar to the full model. The likelihood ratio of this specification compared to the full model is about 7 to 10 and cer2 tainly "acceptable" in the technical sense of the asymptotic X test. We might interject a note about the computation of the estimates. The scoring procedure was used with the initial values of the deflation coefficients a1, '*', a6 taken to be one and the 1' '' * ' 86 conditionally determined from ordinary least squares using the deflated variables X*, Y*. Then a second data set was set up corresponding to X, E and the corrections to the deflation coefficients were determined from a second ordinary least squares regression. This procedure was iterated until convergence of c, the (unadjusted) standard error of the conditional regression. In our experience, satisfactory convergence occurred in two to six cycles

- 15 - although convergence was not uniformly good for all deflation coefficients. In fact a6 usually oscillated rather widely, indicating 6 poor identifiability of this parameter. In analyses III-V of the table, the linear coefficients B1, 'f' 6 are estimated conditional on specified values of the deflation coefficients. Specification III in its deflated form is equivalent to the final model reported by Mooz. (Mooz uses 1000 COST/SIZE rather than COST/S. Moreover our COST is multiplicatively rescaled as already noted. Mooz's estimates can be obtained by multiplying all statistics of III except L* and X2 by the factor 466.64.) Specification III achieves a considerable simplification at a rather high but not conventionally significant cost in goodness of fit. A comparison of III with II suggests that Mooz might find it desirable to add interaction terms of SIZE with both LOCI and EXP. Analyses IV and V explore other plausible, simple specifications. In IV, COST is linearly related to undeflated independent variables, but with the standard deviation of the disturbances proportional to SIZE, as in the heteroscedastic specification. V is a homoscedastic specification relating SIZE to the undeflated independent variables. The likelihood ratios of IV and V relative to III are about.54 and.07 respectively so III seems to be clearly superior among these three simple conditional models.

- 16 - 5. Concluding Comments Evaluating the adequancy of alternative model specifications is an essential aspect of most data analysis. The coefficient of 2 determination R has seemingly become the statistic of choice for such comparisons when the dependent variable is invariant. Of course the standard error of the regression (unadjusted or adjusted) could serve equally well in this case. When the comparisons involve transformations of the dependent variable, it is widely recognized that the coefficient of determination is generally misleading because of changes in the total sum of squares. Fortunately Box and Cox (1964) have given us a simple method of comparing models involving power and logarithmic transformations of the dependent variable. In the Box-Cox case, the dependent variable is multiplicatively standardized to have unitary geometric mean, and the standard error of the regression or the associated likelihood ratio is used for comparisons. We have shown in this paper that an analogous standardization of the deflation variables yields a similar approach to the comparison of specifications involving ratio transformations or heteroscedasticity. In fact a wide family of specifications involving both Box-Cox transformations and our deflations can be easily investigated provided that the dependent variable and the deflation variables are suitably standardized.

- 17 - 6. References Box, G.E.P. and Cox, D.R. (1964), "An Analysis of Transformations," Journal of the Royal Statistics Society, B, 26, 211-243. Draper, N.R. and Smith, H. (1966), Applied Regression Analysis, New York: Wiley. Fisher, G.R. (1957), "Maximum Likelihood Estimators with Heteroscedastic Errors," Review of the International Statistical Institute, 25, 52-55. Glejser, H. (1969), "A New Test for Heteroscedasticity," Journal of the American Statistical Association, 64, 316-323. Granger, C.W. and Newbold, P. (1976), "The Use of R2 to Determine the Appropriate Transformation of Regression Variables," Journal of Econometrics, 4, 205-210. Griliches, Zvi, (1972), "Cost Allocation in Railroad Regulation," The Bell Journal of Economics and Management Science, 26-41. Harvey, A.C. (1976), "Estimating Regression Models with Multiplicative Heteroscedasticity," Econometrica, 44, 461-464. Johnston, J. (1972), Econometric Methods, Second Edition, New York: McGraw-Hill. Kmenta, J. (1971), Elements of Econometrics, New York: Macmillan. Kuh, E. and Meyer, J.R. (1955), "Correlation and Regression Estimates when the Data are Ratios," Econometrica, 23, 400-416. Madansky, A. (1964), "Spurious Correlation Due to Deflating Variables," Econometrica, 32, 652-655. Mooz, W.E. (1978), Cost Analysis of Light Water Reactor Power Plants (Report), Santa Monica: Rand, R-2304-DOE. Park, R.E. (1966), "Estimation with Heteroscedastic Error Terms," Econometrica, 34, 888. Pearson, K. (1897), "On a Form of Spurious Correlation which may Arise when Indices are used in the Measurement of Organs," Proceedings of the Royal Society of London, 60.

- 18 - Rao, C.R. (1973), Linear Statistical Inference and its Applications, Second Edition, New York: Wiley. Rutemiller, H.C. and Bowers, D.A. (1968), "Estimation in a Heteroscedastic Regression Model," Journal of the American Statistical Association, 63, 552-557. Taylor, W.E. (1978), "The Heteroscedastic Linear Model: Exact Finite Sample Results," Econometrica, 46, 663-676. AJo