RES EARCH SUPPORT UNIVERSITY OF MICHIGAN BUSINESS SCHOOL FEBRUARY 1997 ROBUST EZTIMATION OF SEMIPARAMETRIC REGRESSION MODELS WORKING PAPER #9712-03 BY MARTIN R. YOUNG UNIVERSITY OF MICHIGAN AND XUMING HE UNIVERSITY OF ILLINOIS

Robust Estimation of Semiparametric Regression Models Martin R. Young`* Xuming Het a University of Michigan Business School, Ann Arbor. MI 48109 6 University of Illinois, Champaign, IL 61820. February, 1997 Abstract The goodness-of-fit of multiple linear regression models can often be improved by preliminary transformation of the dependent variable. In this paper, we consider the model h(y) = /3'x + e where h(.) is a monotone function, and present a method for joint estimation of the transformation function h(-) and the regression coefficients /3, via minimization of a sum of absolute deviations loss function. The resulting estimator is robust with respect to outliers, and has a simple and direct numerical solution, using linear programming. The paper presents consistency results, and applications of the technique to simulated and real economic data. KEY WORDS: index model, least absolute deviations, splines, transformation JEL CLASSIFICATION: C14 Corresponding author. Phone: 313-936-1332, Fax: 313-936-0274, email: myoung.umich.edu. The research of Xuming He was supported by NSA Grant MDA904-96-1-0011.

1 Introduction The goal of linear regression modeling is the identification of the statistical equation relating predictor variables x to some response variable y. The standard linear regression model assumes that the predictor variables are linearly related to the dependent variate: the validity of this assumption is often improved by preliminarily transforming the dependent variable nonlinearly, and then applying the linear model to the transformed data. Box and Cox (1964) was an early and influential contribution to the understanding of transformations in regression: Carroll and Ruppert (1988) provides an overview of classical and modern methods for choosing transformations for regression models. In this paper, we present a new method for automatically identifying a transformation which optimizes the goodness-of-fit between the transformed dependent variate and a linear combination of the predictor variables. In this approach, the transformation function is chosen nonparametrically, without restricting the function to belong to a class such as the power function class. Because the predictor is determined as a parametric, specifically linear, function of the independent variables, the overall procedure is termed semiparametric (Horowitz 1996). The method in this paper uses the least absolute deviations (LAD) goodness-of-fit criterion, and as a result is resistant to outliers in the dependent variable. Finally, the method is very computationally efficient: the transformation function and the regression coefficients are estimated jointly as the solution to a single linear programming problem. Throughout the paper, the proposed procedure will be referred to as the "ROSE" technique, short for "Robust Semiparametric Estimation". Section 2 of this article describes the proposed method for robust semiparametric regression estimation, section 3 discusses inference issues, section 4 describes the application of the ROSE technique on some real and simulated data, section 5 presents results on the asymptotic properties of the estimator, and section 6 gives further discussion, including references to related work. 9

2 The Algorithm The data for the problem under consideration are pairs (yi, xi), i = 1I..... n where the y; are observations on a dependent variable, and xi = (xio,..., Xip) are predictor variables. with. typically. xao = 1 for all i. The usual model for relating dependent and predictor variables is the linear regression model: i =3'xi + e, (1) where the ei are zero mean residuals. In this paper, we consider a generalization of model (1), namely: h(yi) = p/'x + et, (2) where h(.) is a monotone, nonlinear transformation function. The objective of the analysis is to identify the function h(Q) and the coefficient vector 0 to optimize the goodness-offit between h(yi) and p'xi. In order to achieve robustness with respect to outliers in the dependent variable, we use the LAD criterion to assess goodness-of-fit; see Dielman and Pfaffenberger (1982) and Gonin and Money (1989) for overviews of LAD methods in robust regression. The semiparametric LAD estimation problem can be formulated as: min () x (3) min S Jh(yi) - /txc (3) i=1 subject to h(yl) = y (4) h(yn) = n, (5) where yl and y, are, respectively, the smallest and largest values of y in the dataset. The optimization in (3) is over the product space H x R'l+, where H is the space of monotone 3

functions h(') satisfying boundary conditions (4) and (5). The boundary conditions are needed in order to prevent the trivial solution h(y) = O. 30 = 3 ==- 3p = 0, which would have a sum of absolute deviations equal to zero. Because the boundary conditions constrain both the scale and location of the transformation, it will typically be necessary to include the intercept term xio = 1, in order to achieve a good fit between the h(y;) and /'xi. The values of h(yi) and h(y,) can clearly be constrained to have arbitrary. distinct values; the choice represented in equations (4)-(5) leads to a location and scale for the transformed variable that match those of the original dependent variable. which correspondence may be helpful in interpreting the estimated model. To implement the estimation, the function h(.) can be represented as a linear combination of spline basis functions: q h(y) = Z akPk(Y), (6) k=O where the Ok(y) are the known basis functions. In the examples described in this paper. the linear "power spline" basis (Smith 1979) is used. For a linear power spline with knots at locations ki,... kr, the series expansion has r + 2 basis functions, and is given by: h(Y) = <tao + aiy + E j (y - kj-14)+4. (#7) j=2 where (z)+ = max{z,0}, and q = r + 1. In this parameterization. h(y) is monotone as long as the following linear restrictions on or are maintained: a > 0, al + ca2 > 0...., cal + * + acr > 0. Ramsay (1988) presents an alternative parameterization for splines, called I-splines, which have the convenient property that monotonicity of the spline function is ensured simply by enforcing positivity on all coefficients ak. in the basis expansion (6); the I-spline basis could also be used in the procedure described below. Given the linear power spline representation for h(y), the estimation problem can be 4

written as: n min (e+ +K) (8a) i=1 subject to q P ak>pk (yi) - E kXik- = 6 - C (Sb) k=O k=o &kb(y) =Y (8c) k=O Z k k(yn) =yn (8d) k=O al >0 (Se) a+ + a2 > 0 (f) l + - *'-+ a7 >_ 0 (Sg) e+>0, er >0 i=1.... n. (Sh) The quantities 4f and cE are the positive and negative parts of the residual h(yi) - 3'x,. The optimization in (8) is over the variables {e+,. 6, i = 1.... n}, {ak, k = 1.....q}. and {,k, k = 1,...,p}. The objective function and constraints are linear in the decision variables: thus the problem can be solved as a single linear programming problern. using standard methods such as the simplex, dual simplex, or interior point algorithms. The robust semiparametric estimator (ROSE) is defined. then, as the minimizing solution to linear program (8). Because the model (2) is estimated using linear programming, it is simple to incorporate additional linear constraints on the coefficients cy and /, should such constraints be appropriate. For example, in certain applications, the signs of the linear predictor coefficients Ok will be known a-priori; adding constraints 3k > 0 or 3k < 0 will not change the linear programming character of the solution. Incorporating 5

such prior information may be helpful in regularizing parameter estimates in cases of multicollinearity. or small sample size: see, e.g., the example in section 4.3 below. Also, the transformation h(y) can be restricted to be concave or convex through linear restrictions on the a parameters. For example, with the power spline parameterization for h(y) in equation (7), h(y) will be guaranteed to be concave if ok < 0 for k = 2,..., q. Solving the linear program (8) is typically rapid; for example, for the data described in section 4.2, with 506 observations, 14 predictor variables, and with 6 basis function coefficients used to estimate the transformation function, the optimal parameter estimates are obtained in less than 1 second on a 66 Mhz PC with an Intel 486DX CPU, using the algorithm of Barrodale and Roberts (1978). Because the transformation function h(.) is constrained to be monotone, oversmoothing is not as significant of a concern as it is in a typical nonparametric regression problem. However, it is useful to have a heuristic for choosing a model with as few knots as are necessary to achieve a good fit to the data. An approach which seems to provide reasonable model choice is to consider a small number of different values for q, the number of parameters in the spline transformation (e.g., q=2, 4, 6, and 8) and to select the model which minimizes the AIC-like measure AIC = 2V - 2(p + q). where / n \ n ~ = n logn - nlog h(yi) - -xi n + h' (yi) (9) is a surrogate for a concentrated log-likelihood function, assuming Laplace errors, and p + q is the total number of free parameters (p + I for the regression coefficients, q + 1 for the spline transformation, minus 2 for the normalizing constraints). The quantity h'(yi) is the derivative of h(y) at yj, which enters the likelihood function via the Jacobian of the transformation, and which can be evaluated easily as h'(yi) = Z;-=o ackq5k(y); for the power spline basis, with bo(y) = 1, b1(y) = y, and qi(y) = (y - kjl)+, j = 2,... q, the derivatives are '0(y) = 0, 5'1(y) = 1, and #'j(y) = IQ(y ~ k;-), j == 2,...,q. Hastie and Tibshirani (1990, chapter 7) describes a related approach to model selection in the context of transformation estimation. 6

LAD methods have previously been employed in nonparametric regression settings in Koenker, Ng, and Portnoy (1994), Young (1996), and He and Shi (1996). The model described in this paper differs from earlier work in that the nonlinear transformation being estimated is applied to the dependent variable, rather than the independent variates. 3 Inference on Regression Coefficients The issue of how inference on regression coefficients should proceed in the face of uncertainty about the appropriate transformation for the dependent variable is not fully resolved. Bickel and Doksuin (1981) showed that the sampling variability of the estimator fi in a Box-Cox regression model may be considerably greater when the transformation parameter is estimated than when it is fixed. Box and Cox (1982) and Hinkley and Runger (1984) have argued, though, that the parameter /3 is not "physically meaningful" unless the scaling for the dependent variable is considered fixed, and therefore argue for making inference on / conditional upon the estimated transformation. If this latter view is taken, then inference on the coefficients 13 in the robust semiparametric model can be made, conditional upon the estimated transformation h(y), using the sampling theory for LAD estimators developed in Bassett and Koenker (1978). This is so, because, if h(y) is conditioned upon, or fixed, the estimate of (3 in the robust semiparametric regression is an LAD regression estimate. Bassett and Koenker (1978) demonstrated that an LAD regression coefficient estimator,3 is asymptotically unbiased and normally distributed with covariance matrix r2(X'X)-", where X is the usual regression design matrix, and r2/n is the variance of the median of a sample of n observations from the error distribution. Thus, for example, a (1 - ca)100% confidence interval for a linear compound r'/3 can be written as r'jS3z,,/2r [r' (X'X)-' r] / (Dielman and Pfaffenberger 1982). A consistent estimate of the quantity r can be obtained based on the residuals from the regression. Let the ordered residuals be denoted by e(i). 7

then an estimate of r is given by: =e(t) - e() 2(t- s)/n' where t and s are symmetric around the median sample residual. and t-s is small relative to n: Dielman and Pfaffenberger (1982) suggest the estimate is not sensitive to the choice of t -s. Other estimates of r based on kernel density estimation or neighboring regression quantiles may also be used. 4 Examples 4.1 Seasonal Regression Modeling A common model for time series forecasting is the seasonal regression model (Cryer 1986): Yt = 0o+ A-t + Q2 + 33Q3 +.34Q4 + et (11) where Yt = The time series value at time period t Q2 = I if period t is in Quarter 2; 0 otherwise Q3 = 1 if period t is in Quarter 3; 0 otherwise Q4 = 1 if period t is in Quarter 4: 0 otherwise. The term lit in equation (11) captures the linear trend in the time series, while the terms 9/2Q92. d3Q3, and 04Q4 account for seasonal variation. Often it is necessary to nonlinearly transform the data yt before the linear trend model is appropriate. Here, we estimate the 8

optimal transformation automatically from the data. through fitting the model: h(yt) =.30o + t + 32Q2 +.3Qa - + CiQ + 3. e2) Figure 1 displays aggregate sales figures for the Ford Motor Company. in nominal dollars; note the nonlinearity of the trend. Also, the additivity of model (li is questionable for these data, as the seasonal variation appears to increase in magnitude over time. Figure 2 shows the transformation function h(y) estimated by solving (8) using 4 knots spaced at the quintiles of the empirical distribution for y. Figure:3 shows the transformed time series; here the linear trend assumption appears more valid. Fortuitously, this transformation also appears to render the model more nearly additive: on the transformed scale, the seasonal variation is fairly constant over time. The model in (12) provides a forecast equation for h(ye) for some future time t; it is elementary to invert the function h(y), to obtain a forecast for yt. [Insert Figures 1-3 Here] 4.2 Boston Housing Data Harrison and Rubinfeld (1978) present a hedonic price index model for estimating homeowners' marginal willingness-to-pay for various characteristics of a neighborhood. including crime rate, and environmental quality. The unit of observation in this study was a census tract in Boston, and the entire set of variables used in the analysis were: y = MEDV = median housing value (in $1000) xt = CRIM = per capita crime rate X2 = ZN = proportion of land zoned for lots greater than 25,000 square feet 9

X3 = INDUS X4 = CHAS 5 = NOX r6 = RM x7 = AGE xs = DIS = proportion of nonretail business acres = bounds Charles river (l=yes, 0=no) = nitrous oxide concentration (parts per 10 million) average number of rooms in home squared proportion of owner-occupied units built prior to 1940 - weighted distance to five employment centers in Boston region = index of accessibility to radial highways = full value property tax rate (per $10,000) pupil teacher ratio = 1000(Bk-0.63)2, where Bk is the proportion of blacks in the population = proportion of the population that is lower status X9 Xio l11 X12 -'13 = RAD = TAX = PTRATIO = B = LSTAT Belsley, Kuh, and Welsch (1980) states that the data for this study "possess much heavier tails than the Gaussian (normal) distribution"; they estimate the data using iterative robust regression procedures, using log-transformed median housing value as the y variate in the regression. Here, we use median value as the y variate, and automatically select an appropriate scaling h(y) via the ROSE procedure. With just I knot for the spline expansion of h(y), the estimated transformation is very close to a log transformation, with a Pearson correlation between h(y) and log(y) of 0.993. Table 1 displays the normalized coefficient estimates obtained via the ROSE procedure, via robust (LAD) regression on log(y), and via OLS estimation on log(y). it is seen that the robust semiparametric method produces coefficient estimates similar to those obtained via robust regression applied to the log-transformed data. A possible advantage of the semiparametric method is that it does not require the user to recognize the need for a log transformation prior to analysis of the data. [Insert Table 1 Here] 10

4.3 Automobile Fuel Efficiency Lock (1993) provides a dataset of specifications for new car models from the 1993 year; measures are provided for evaluating price, fuel economy, engine size, body size. and features. The cars included in the dataset were selected at random from among 1993 passenger car models. Here, regression models are estimated for predicting city fuel economy, as a function of various covariates. The analysis includes all complete cases listed in the dataset - a total of 82 observations - and incorporates the following variables: y = CITYMPG = city fuel efficiency (mpg) Xt = CYL = # of cylinders X2 = ENGSIZE = engine size (liters) x3 = HP = horsepower (100's) x4 = LENGTH = length (feet).5 = WIDTH width (feet) X6 = WEIGHT = weight (1000 pounds) x, = DOMESTIC = 1 if U.S. manufacturer, 0 otherwise Table 2 presents the coefficient estimates obtained via OLS; the positive coefficients on LENTGH and WIDTH obtained from OLS seem counterintuitive, and are due to the extreme multicollinearity in the data. A further deficiency of the OLS model is revealed in the residual plot, shown in Figure 4, in which there appears to be a nonlinear pattern not captured by the linear model. The data were also analyzed using the semiparametric regression model estimated via the ROSE procedure. The estimation was regularized by adding the restrictions that the coefficients on the Pk must all be non-positive, except for the coefficient for DOMESTIC. The coefficient estimates are presented in Table 2, and the estimated transformation function is plotted in Figure 5. The zero coefficient estimates are, if not completely satisfactory, more reasonable than negative coefficients. Figure 6 displays the residuals from the transformed regression model; these seem more nearly random than do the residuals from the untransformed model. This example provides an illustration of the three potentially beneficial features of 11

the ROSE procedure: the robustness with respect to outliers, the ability to nonlinearly transform the dependent variable, and the ability to easily constrain the parameter estimates to a space that appears 'reasonable" a-priori. The modeling approach presented here does not represent the conclusive analysis for these data; in particular. the analysis might benefit from judicious use of case deletion (Anscombe 1960: Andrews and Pregibon 1978), variable selection (Mallows 1973), and/or such techniques as ridge regression (Hoerl and Kennard 1970) or principal components regression (Miansfield. Webster. and Giinst 1977). The ROSE approach, though. can be useful in general as a simple method for generating plausible candidate models in the face of multicollinearity. nonlinearity, and residual outliers. [Insert Table 2 Here] [Insert Figures 4-6 Here] 4.4 Monte Carlo Analysis A modest Monte Carlo study was performed to evaluate the effectiveness of the joint estimation procedure. Data were generated from the model y -3o +.31xil + /32xi2 + /33x13 + ei, i = I.... n? (1:3) with 3o = 1, 3h = 11, /2 = 21, /3 = 31, with cj all uniformly distributed on [0. 1]. and with the e, coming from the Laplace (double-exponential) density with mean 0 and standard deviation Cr. The y were then nonlinearly transformed into observable quantities y by a i a+ b, (14) Yi — 1 +- exp(- (y-40)/5) + b (14 with a and b chosen so that yo = yj for the smallest and largest values of y in the sample. Figure 7 shows a plot of y vs. /3'x for a sample dataset, using the true values of /3. 12

Inverting equation (14) shows that the optimal transformation function from y back to. is not a member of the Box-Cox class of functions. The regression coefficients 3 were estimated three ways: via the ROSE procedure presented in section 2, via LAD regression of y vs. x, and via OLS regression of y vs. x. The latter two methods would onlv be appropriate for estimating the direction of 3: thus. in each case, the estimates of / were normalized as /3, = /3/ /. The goodness of fit. for each simulation replication was measured as (/3 - 3n) (/3 - /3 and the overall performance of the estimators was assessed as the mean of this quantity over all simulation replications. The statistical significance of the difference in performance was assessed by a binomial sign test, for which the test statistic is the proportion of simulation replications in which the ROSE estimator is more accurate than a competing estimator (LAD or OLS). Table.3 shows the performance of the three methods for different noise levels a and different samples sizes n. In each case examined, the ROSE method substantially outperformed the methods which do not employ a preliminary transformation, with the differences statistically significant. These results suggest the value of applying the appropriate nonlinear transformation to the dependent variable, even when all that is desired is an estimate of the direction of the regression coefficient vector. [Insert Table 3 Here} 5 Asymptotic Analysis: Consistency This section is concerned with asymptotic consistency of the proposed estimates in Section 2. We argue that when the model (2) holds, the direction of 3 is. being estimated consistently under rather general conditions on the predictor variables. If both the predictor and the error variables are normally distributed, the estimated transformation also approximates the true link function h up to a multiplicative constant. To facilitate our arguments, we consider an equivalent form of the model (2) with the 13

intercept term absorbed into the link function: ho(Y) = X'30 + e (15) where 30o Sp, the p-dimensional unit ball, and ho G H, the space of increasing functions that are Lipschitz on any compact interval [A, B] C Dy, where Dy is the interior of the support of Y. Note that Dy could take the form of (0, oo) or (-oo, oo) or a finite interval like (0. 1). Since the Lipschitz condition is imposed on a compact set contained by Dy. the function ho is allowed to have arbitrarily large derivatives at the "boundaries" of Dy. Furthermore, we assume that E IXI < cc and that the median of e is zero. Let (h'.A *) be the minimizer of EIh(Y) - X'fI over J E Sp and h e H. Lemma 1: If the distribution of X is elliptical, then P' = o0. In addition, if X'30 and e are independently and normally distributed, we have h*(y) = c*ho(y) for some constant c". In general, there is no assurance that h* would recover the true link function h0. Proof: Since 3* is the solution to the problem of minimizing E hj(Y) - X'31 over /3 E S,, Theorem 2.1 of Li and Duan (1989) implies that if the distribution of X is elliptical, the solution (h*, /') is Fisher consistent in /3, that is, $* = ro. To prove the second part of Lemma 1, we assume without loss of generality that E(XY'3o) = 0, Var(X'30) = a, and Ee2 = o2. The conditional distribution of e given Y (or equivalently ho(Y)) is normal with mean Cro(y)/( c2 + cT2) Consider lh(Y) - X''l = I - (ho(Y) - h(Y))l. (16) Note that h' minimizes the expected value of (16) over h E H. By a well-known property of population median, the conditional expectation of (16) given Y = y is minimized by h(y) satisfying ho(Y)- h(Y) = median [ely] = crho(y)/(cr + c). Thus, hA(y)= caho(y)/( + ar). The proof is then complete. In fact, we have also seen that the solution (I*', /) is unique, but h* is not equal to ho. A scale multiplier however does not affect the goodness-of-fit through a linear equation. 14

Next, we consider the sampling property of the estimator (hai,) that minimizes hli Ih(yi)- ~ 31 over h E HI and?3 C Sp. The observations are assumed to be independent from the model (15). It is easy to see that for every sequence h, in H and any given compact interval [A. B] there exists a subsequence h,n that converges to a limit h, in H in the sense that suPAy<B hnk,(y) - ho(y)I - 0. Lemma 2: The estimator (han,,/) is consistent in the sense that 3%n -+ & and h.(y) - hA(y) for any y E Dy. Proof: If the estimator (h,,iJ3) is not consistent for (h4,13*) as stated in Lemma 2. we consider, for some given pairs A < B (to be chosen later), a convergent subsequence, still denoted by (h,, fn) for notational simplicity, such that 3,n -+ 31 and maxA<y<B I|h,(y) - hl(vy)I - 0. In this case. we have either l1;~ 13* or maxAC<y<B hi(y) - h*(y)l > 0. For any arbitrarily small S > 0, we can choose A = A(S) and B = B(S) to be sufficiently close to the boundaries of Dy and then extend hl outside [A, B] so that hA G H and E{lhl(Y) - X'lIj(A < Y < B)} > E 1hl(Y) - X'II - S. Then. there exists sufficiently large N = N(S) such that n > N implies E ~h'(Y) - X'3', + > n-1 Ih'(y,) - x.l' > n-I- E I^.(y )- x, j > n-' lhn, (yi) - A<y, <B > n-1 Ihl(yi) - x:fl l- > E{(Ih(Y) -X'PlII(A < Y < B)} - 2 A<y, cB > E Ihl(Y)-X',lI -35. Letting J -4 0, we would arrive at E Ihl(Y) - X',lI < E Ih(Y) -X'B*', which contradicts the definition of (hI, '*). The proof is then complete. Finally, we consider the estimator (hanJ, I) that minimizes 2 Ih(yi)-X'flj over h c HB and 3 E SP, where HB is the space of "power splines" defined on the interval [yl, y,] with the set of knots k1 < '* < k,, see (7). 15

Theorem 1: If the knots are quasi-uniform in the sense that min,(ki+i - k i) > Y maxi(ki+ - k) > for some constant 7 > 0, and the number of knots in each finite internal tends to infinity with n. then the estimator (ha,3n,) is consistent in the sense of Lemma 2. Proof: For any choice of [A. B], we know from the theory of spline approximations that there exists hn E HB such that maxA<<B I\h,(y) - h*(y)l -+ 0 as n - W. Without loss of generality, we can choose h,, to be bounded outside [A, B]. Thus. for sufficiently large n and sufficiently large interval [A, B], n tJh - *>h(y)-3*) > nE h()- - > E jin(y (Y )I3 1- 8 A< y,<B A<y, B > n~- Ih,(y,)- x, l-2 n-s> n- h, (y4i) - nl268. A.4<y, <B Then the same arguments used in the proof of Lemma 2 for the consistency of (h,,. J) apply to that of (h,, nin). The consistency result we obtained for the function estimate is not uniform due to the fact that the true link function ho may not have a bounded derivative function on its support. If we start with a finite support such as Dy = (0, 1), the model (15) with Gaussian error has to be satisfied with a link function with unbounded derivatives at 0 and 1. In this sense, uniform consistency in compact intervals inside Dy is the best result that can be achieved. 6 Conclusion We consider the joint estimation of a vector of regression coefficients and a monotone transformation on the dependent variable of the regression. An estimation procedure is presented which has the desirable properties of robustness to outliers, and computational efficiency. The algorithm can be easily implemented using a standard linear program 16

solver. In recent decades, a number of useful extensions to the standard multiple linear regression model have been developed; these include ACE (Breiman and Friedman 1985). AVAS (Tibshirani 1988), generalized additive models (Hastie and Tibshirani 1990). projection pursuit regression (Friedman and Stuetzle 1981), ATS methods (Cleveland. Mallows. and McRae 1993) slicing regression (Li 1991), transform-both-sides methods (Nychka and Ruppert 1995), robust transformation estimation (Carroll 1980: Carroll and Ruppert 1985; Carroll and Ruppert 1987), and semiparametric regression (Powell and Stoker 1989; Ichimura 1993; Bonneu. Delecroix, and Malin 1993: Horowitz 1996: Wan and Ruppert 1996; He and Shen 1997). Bayesian contributions to flexible regression modeling include West, Miieller, and Escobar (1993), Mallick and Geifand (1994). and Laud, Damien, and Smith (1996). The ROSE algorithm presented in the present paper offers a potentially useful addition to the currently available set of techniques. The ROSE technique may be of use in cases in which the data may contain outliers. and the response variate requires nonlinear transformation, but in which the modeler wishes to have explanatory variables enter the model in an easily understandable linear and additive fashion. 17

References Andrews, D. and D. Pregibon, 1978, Finding the outliers that matter, Journal of the Royal Statistical Society. Series B 40. 85-93. Anscombe. F.. 1960, Rejection of outliers, Technometrics 2, 123-147. Barrodale, I. and F. Roberts, 1978, An efficient algorithm for discrete 11 linear approximation, SIAM Journal of Numerical Analysis 15, 603-611. Bassett, G. and R. Koenker. 1978, Asymptotic theory of least absolute error regressions, Journal of the American Statistical Association 73, 618-622. Belsley, D., E. Kuh, and R. E. Welsch, 1980, Regression diagnostics (John Wiley and Sons, New York, NY). Bickel, P. and K. Doksum, 1981, An analysis of transformations revisited, Journal of the American Statistical Association 76, 296-311. Bonneu, MI., M. Delecroix, and E. Malin, 1993, Semiparametric versus nonparametric estimation in single index regression model: A computational approach. Computational Statistics Quarterly 8, 207-222. Box, G. and D. Cox, 1964, An analysis of transformations, Journal of the Royal Statistical Society, Series B 26, 211-246. Box, G. and D. Cox, 1982, An analysis of transformations revisited. Journal of the American Statistical Association 77, 209-210. Breiman, L. and J. Friedman, 1985, Estimating optimal transformations for multiple regression and correlation (with discussion), Journal of the American Statistical Association 80, 580-619. Carroll, R., 1980, A robust method for testing transformations to achieve approximate normality, Journal of the Royal Statistical Society, Series B 42, 71-78. Carroll, R. and D. Ruppert, 1985, Transformations: A robust analysis, Technometrics 27, 1-12. 18

Carroll, R. and D. Ruppert, 1987, Diagnostics and robust estimation when transforming the regression model and the response, Technometrics 29, 287-299. Carroll, R. and D. Ruppert, 1988, Transformation and weighting in regression (Chapman & Hall, London, UK). Cleveland, W. S., C. L. Mallows, and J. E. McRae, 1993, ATS methods: Nonparametric regression for non-Gaussian data, Journal of the American Statistical Association 88, 821-835. Cryer, J. D., 1986, Time series analysis (Duxbury Press, Boston, MA). Dielman, T. and R. Pfaffenberger, 1982, LAV (least absolute value) estimation in linear regression: A review, TIMS/Studies in the Management Sciences 19, 31-52. Friedman, J. and W. Stuetzle, 1981, Projection pursuit regression, Journal of the American Statistical Association 76, 817-823. Gonin, R. and J. Money, 1989, Non-linear Ip-norm estimation (John Wiley and Sons, New York, NY). Harrison, D. and D. Rubinfeld, 1978, Hedonic prices and the demand for clean air, Journal of Environmental Economics and Management 5, 81-102. Hastie, T. and R. Tibshirani, 1990, Generalized additive models (Chapman & Hall, London, UK). He, X. and L. Shen, 1997, Linear regression after spline transformation, Biometrika (to appear) He, X. and P. Shi, 1996, Monotone B-spline smoothing, Technical report, University of Illinois Department of Statistics. Hinkley, D. and G. Runger, 1984, Analysis of transformed data, Journal of the American Statistical Association 79, 302-308. Hoerl, A. and R. Kennard, 1970,, Ridge regression: biased estimation for nonorthogonal problems, Technometrics 12, 55-67. 19

Horowitz. J. L.. 1996. Semiparametric estimation of a regression model with an unknown transformation of the dependent variable, Econometrica 64. 103-137. Ichimura. H., 1993, Semiparametric least squares (SLS) and weighted SLS estimation of single-index models. Journal of Econometrics 58, 71-120. Koenker, R., P. Ng, and S. Portnoy, 1994, Quantile smoothing splines. Biometrika 81. 673-680. Laud. P.. P. Damien, and A. Smith, 1996, Bayesian nonparametric and covariate analysis of failure time data. Technical report. University of Michigan Business School. Li. K.-C.. 1991, Slicing inverse regression for dimension reduction. Journal of the American Statistical Association 86. 316-342. Li, K.-C. and N. Duan, 1989, Regression analysis under link violation. Annals of Statistics 17, 1009-1052. Lock. R.. 1993, 1993 new car data, Journal of Statistics Education 1. Mallick. B. and A. Gelfand. 1994, Generalized linear models with unknown link functions. Biometrika 81, 237-245. Mallows. C., 1973, Some comments on Cp, Technometrics 15, 661-676. Mansfield, E., J. Webster, and R. Gunst, 1977, An analytic variable selection technique for principal component regression, Applied Statistics 26, 34-40. Nychka, D. and D. Ruppert, 1995, Nonparametric transformations for both sides of a regression model, Journal of the Royal Statistical Society, Series B 57. 519-532. Powell, J. L. and T. M. Stoker. 1989, Semiparametric estimation of index coefficients. Econometrica 57, 1403-30. Ramsay, J., 1988, Monotone regression splines (with discussion), Statistical Science 3. 425-461. 20

Smith. P.. 1979, Splines as a useful and convenient statistical tool. American Statistician 33, 57-62. Tibshirani. R.. 1988. Estimating optimal transformations for regression via additivity and variance stabilization, Journal of the American Statistical Association S3. 394 -405. Wang, N. and D. Ruppert, 1996, Estimation of regression parameters in a semiparametric transformation model, Journal of Statistical Planning and Inference 52, 331-351. West. M., P. Miieller, and M. Escobar, 1993, Hierarchical priors and mixture models with applications in regression and density estimation, Technical Report A02. Duke University, Institute of Statistics and Decision Sciences. Young, M., 1996, Robust seasonal adjustment by Bayesian modeling, Journal of Forecasting 15, 355-367. 21

Table 1: Parameter estimates, Boston housing data. Variable CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT ROSE LAD OLS -0.0198 -0.0239 -0.2693 0.0020 0.0031 0.0907 0.0065 0.0079 0.0498 0.1586 0.1596 0.0806 -0.8490 -0.8203 -0.2478 0.4889 0.5296 0.1932 -0.0020 -0.0019 0.0067 -0.0799 -0.0940 -0.3002 0.0192 0.0208 0.3728 -0.0012 -0.0014 -0.3233 -0.0752 -0.0900 -0.2574 0.0018 0.0018 0.1144 -0.0464 -0.0569 -0.6313 22

Table 2: Parameter estimates. 1993 car fuel economy data. OLS ROSE INTERCEPT 29.289 54.380 CYL -0.241 -0.915 ENGSIZE +0.888 -0.000 HP +0.655 -0.595 LENGTH +0.251 -0.085 WIDTH +4.228 -0.000 WEIGHT -23.72 -13.618 DOMESTIC -1.804 -0.830 23

Table 3: Results of Monte Carlo evaluation of estimators. RMNSE Signif. Signif. n ar ROSE LAD OLS LAD OLS 100 2.0 0.0263 0.0495 0.0360 100 4.0 0.0397 0.0666 0.0521 50 2.0 0.0402 0.0791 0.0621 * 50 4.0 0.0665 0.1010 0.0856 * = Semiparametric estimator more efficient. p <.01. binomial sign test. 24

Figure 1: Aggregate sales in millions of dollars. Ford Motor Company 30000 -..-.I - I I I 25000 20000 Y 15000 10000 5000 0 60 65 70 75 80 85 90 95 t 25

Figure 2: Estimated transformation of dependent variable in seasonal regression model of Ford Motor Company sales data 30000 25000 20000 h(y) 15000 0000 I I I I 5000 0 C 9 5000 10000 15000 20000 25000 30000 y 26

Figure 3: Transformed sales data. Ford Motor Company 30000 I I 25000 20000 h(y) 15000 10000 5000 - 0 6 7 7 I0 8 90 9 60 65 70 75 80 85 90 95 t 27

Figure 4: Residual plot, OLS regression of 1993 cars fuel economy data 14 i p 12 10 8 6 -2 2 -' ' '*, -4 -6 8 I 10 15 20 25 30 35 40 'x 28

Figure 5: Estimated transformation of dependent variable in 1993 cars fuel economy regression model 50 45 40 35 30 25 h(y) / f I I I I I 20 15. v 15 20 25 30 35 40 45 y 50 29

Figure 6: Residual plot, robust semiparametric regression of 1993 cars fuel economy data 12 ~, 10 8 6 4 O0. - -4 -6 -8 15 20 25 30 35 40 )6x 30

Figure 7: Sample dataset from Monte Carlo study: n = 100, a = 2.0. 60 i I i I 1- I I i I 55 50 50 45 _-.; 40 35 30 -. '. 25 20 -. ' o -... - 51 1, t i - I -, 5 10 15 20 25 30 35 40 45 50 55 60 /3'x 31