Division of Research Graduate School of Business Administration The University of Michigan April 1976 (I BOX-JENKINS SEASONAL FORECASTING USING TRANSFORMED DATA: A CASE STUDY Working Paper No. 129 by Craig F. Ansley The University of Michigan FOR DISCUSSION PURPOSES ONLY None of this material is to be quoted or reproduced without the express permission of the Division of Research.

ABSTRACT In this paper, sales data for a line of office equipment are analyzed by Box-Jenkins methods using the Box-Cox transformation, given by X yX(t) =(t) - X 0 yO(t) = In y(t) where A is the transformation parameter. It uses likelihood equations and a numerical algorithm for their solution which have been developed by Ansley, Spivey, and Wrobleski. A step-by-step account of the analysis and forecasting of the data is presented, together with a discussion of some of the special problems encountered. Forecasting performance of the model is shown to be superior to the best ARIMA model based on the logarithmic transformation.

Introduction The Box-Jenkins approach to time series analysis and forecasting is becoming widely used in many business and economic applications. Examples of seasonal forecasting, however, have not appeared often in the literature. Amongst the few published examples are analyses of airline passenger data, Box and Jenkins [4, Section 9.2]; automobile registration data, Nelson [10, Chapter 7]; and a detailed case study of sales data by Chatfield and Prothero [7]. The sales data study is somewhat disturbing to users of BoxJenkins seasonal forecasting methods —the authors were unable to. obtain a satisfactory forecasting model. Box and Jenkins [5], Harrison [8], and Tunnicliffe Wilson [14] suggested that this occurred because Chatfield and Prothero had improperly transformed their data initially by taking logarithms. A more flexible family of transformations, suggested originally by Box and Cox [3], was put forward by these discussants as a more suitable approach to choosing an initial transformation, and two, [5] and [14], were able to obtain much better results in this way. The Box-Cox family of transformations is given by (y(t)) - 1 y (t) A X # 0 (1) yO(t) = In y(t) where y(t), assumed positive, is a nonstationary time series and A is the transformation parameter. A formal development of the likelihood function for joint estimation of the transformation parameter X and the other parameters of a seasonal ARIMA model is given by Ansley, Spivey, and Wrobleski [I], -1 -

-2 - [2]. These authors also have developed an algorithm for approximate solution of the likelihood equations. This algorithm is easily implemented and requires only a modest modification of existing Box-Jenkins computer programs. This paper outlines the effects of the transformation on data structure and summarizes estimation and identification procedures. It then provides a step-by-step account of the application of the transformation to the analysis and forecasting of sales data for a line of office equipment. The results demonstrate a significant improvement in forecasting performance over the log transformation approach. Nature of the Transformation For a fixed value of X the transformation (1) is simply a linear transformation of the power transformation yX(t) = {y(t)}. (2) We can therefore use the transformation (2) to gain some insight into the effects of transformations on the data structure.1 The first step in the Box-Jenkins methodology is to reduce a nonstationary series to a stationary series by means of differencing operations. The differenced series should have (i) constant mean, (ii) constant variance, and we often assume that (iii) the differenced series is normal.2 The first property, constant mean, requires that the original series have both trend and seasonal components generated by a polynomial. The

-3 - most common example is a series with a straight-line trend plus seasonal fluctuations of constant amplitude. A very simple example is given in Figure 1. Often, however, a series has some other trend pattern, such as an exponential trend. The usual procedure in such a case is to transform the data initially; for an exponential trend, one would choose a logarithmic transformation. Unfortunately, a logarithmic transformation can "overtransform" the data. In Figure 2 we show a series with an increasing trend and its logarithms. Note that the original series has a trend that could easily be mistaken for an exponential trend and has seasonal fluctuations of increasing magnitude. The logarithmic transformation, on the other hand, has a decreasing trend, with decreasing seasonal amplitude. This is an example of overtransformation. Overtransformation can and does arise in practice. For example, the data plotted by Chatfield and Prothero [7, p. 298] shows an increasing trend with increasing seasonal amplitude, while the logarithms of the data show a decreasing trend with decreasing seasonal amplitude. With the benefit of hindsight, one could point to this as the source of their problems in developing a forecasting model based on a logarithmic transformation. The series in Figure 2 was generated by raising that in Figure 1 to the power of 2.5. If we were to transform the data back using X =.4 in equation (2), we would obtain the original series. A constant mean could then be obtained by simple differencing operations.

-4 - C) 01 Cll U) CV C) C) cn z C 0 14 IC) I Cl~ C) ('a 0; 1 4 8 12 16 20 TIME Fig. 1. Series with a linear trend and stable seasonal component.

-5 - 0 X U) l ~ %%%l! <JD A t I It'IE Original series ---- Logarithms ~/ " / '" i~, 1 4 8 12 16 20 TIME Fig. 2.. Example of overtransformation by logarithms..... Original series.... Logarithms

-6 - The second requirement of the differenced series is that it have constant variance. With real data, we often take one difference and one seasonal difference. These two operations remove parabolic trends, and given the random fluctuations in real data, it can be very difficult to detect a varying mean in the differenced series. If the data are improperly transformed originally, however, the variance of the differenced series will not be constant. Finally, the choice of the initial transformation will affect the distribution of the individual values of the series. A carefully chosen initial transformation can lead to a more nearly normal distribution for the differenced series. It is clear that a family of transformations indexed by a single parameter cannot simultaneously satisfy (i), (ii), and (iii) in every case. The maximum likelihood procedures outlined below in effect provide a rational procedure for weighting the importance of these requirements in model estimation. Outline of the Transformation Estimation Procedure We assume that (1) we have observations y(t) from a time series and that (2) for some value X of the transformation parameter, the transformed observations follow a seasonal ARIMA(p,d,q)9(P,D,Q) process. Using the notation of Box and Jenkins [4, p. 205] we have (B)4(BS)Vd Vsy (t) = 6 + (B(B) ) (BS)a(t) (3) where B is the backshift operator (see Box and Jenkins [4, p. 8]), and where

-7 - s = length of seasonal cycle; d = degree of differencing; D = degree of seasonal differencing; 6 = constant term; q(B) = 1 - h B - - pB; O(B) = 1 - B-.. - e Bq q I. p (D(Bs) = 1 -,BS, -. pBSP; m~ (BS ) = 1 - ('H)Bs -... - (HQ BSQ We write J to represent the vector 1'"'', of autoregressive parameters and similarly 0, 4 and () to represent the vectors of moving average, seasonal autoregressive, and seasonal moving average parameters, respectively. Suppose we observe n+d+sD values of the time series y(-d-sD+l),....,y(O),y(l),.....,y(n). Assuming that the first d+sD values y(-d-sD+l),....,y(-l),y(O) are fixed, it has been shown by Ansley, Spivey, and Wrobleski that the log likelihood is given by n 2 1 - L = const. - In a + in |M + In J (4) 2 2 n 2q2 where 2 a = variance of a(t) 2 -1 a M1 = covariance matrix of w (l),...,w (n) n A A n J = r {y(t)} 1 t=l n S = I [a(t)] -00

-8 - and where wX(t)= VdVsy (t) and [a(t)] = E[a(t) |w(1l),...,wX(n)]. We can shed some light on this complicated expression by the following comments. First, the likelihood function is conditional on the first d+sD observations because the two differencing operations effectively reduce the data set by this number. For example, if we had 72 observations on monthly data (s=12/ and we took one backward difference and one seasonal backward difference to reduce'the data to stationarity, the differenced series would contain 59 = 72 12 - 1 differences. Second, the formula for J, which is, in fact the Jacobian for the transformation (1), excludes the first d+sD observations. For the example above, we would calculate J using only the last 59 observations. The expected values [a(t)] appearing in the sum of squares S are calculated from the difference equations (B)D(BS)[wX(t)0] = 6 + e(B) iH (BS)[a(t)]. (5) An algorithm for the numerical solution of these equations is given by Box and Jenkins [4, Section 7.2]. Maximization of L is simplified by a numerical algorithm developed by Ansley, Spivey, and Wrobleski [1], [2]. They show that maximum likelihood estimates can be obtained by minimizing n Sz (t)}2 (6) -c0.

-9 - where a (t) is obtained from the difference equations z (B(B)(BS)z (t) = 6 + O(B) ) (BS)a (t), (7) A Z Z where z (t) = w (t)/J1/n and = 6/1/n z Because the equations (7) are in the same form as equation (5), the same algorithm can be used for their solution. In these equations 6 is replaced by 6, and after S is minimized, an estimate of 6 is obtained z z from the value 6 by z A A 1L/nA 5 = 6 <'"'(A) z Any computer program for Box-Jenkins estimation will contain: (i) an algorithm for solving difference equations such as (7), and (ii) a nonlinear least-squares algorithm that can be used to minimize the sum of squares S in equation (6). z This method for estimating the parameter X simultaneously with the other ARMA model parameters can therefore be incorporated into an existing computer package with minimum modification. A flow chart for the major steps involved is given in Figure 3. The main difference between the standard and modified estimation procedures is that in estimating X, the series must be transformed and differenced prior to each evaluation of the sum of squares within the nonlinear least-squares algorithm.

-10 - Standard Estimation Modified Estimation 1, Initialize t,1sa,(4, 6 Initialize 1o,11,9,(,6 X I S___ i Initialize 6 = 6/J1/n () z Initial transformation Calculate transformed series zX(t).....L s~ ~~~~~~,, Differencing Differencing w I____ =____NI_____ Nonlinear least squares algorithm __ \ I I Subroutine to calculate residuals Nonlinear least squares algorithm Subroutine to calculate residuals Calculate sum of squares Calculate sum of squares s z \/I Output estimates A Jl/n (i z Output estimates Fig. 3. Flowchart of standard and modified estimation routines.

-11 - The Box-Jenkins program employed here uses the Marquardt nonlinear least-squares algorithm [9]. The program was adapted as outlined above to minimize S simultaneously over all the parameters including z the transformation parameter X, and it has proven efficient over a wide range of ARIMA models. On the University of Michigan's Amdahl 470 computer, the modification requires approximately 10 percent more CPU time than does standard estimation. Model Identification The estimation procedures described above assume that the order of the ARIMA(p,d,q)-(P,D,Q) model is known. A problem arises immediately in that the autocorrelation function used in identification will be a function of the initial transformation parameter and will change as the parameter is changed. Experience has shown the following strategy to be successful: (i) Choose an initial transformation which seems reasonable from an examination of the raw data. (ii) Using this initial transformation, carry out the usual Box-Jenkins identification procedure to find p,d,q,P,D,Q and initial estimates of the parameters. This simple solution needs some justification. Experience with a large number of business and economic time series has shown that the identification of (p,d,q) and (P,D,Q) is not affected by the choice of the transformation parameter X over a wide range of values. Moreover, initial estimates of other model parameters are relatively insensitive to the choice of X. However, as X is a scale parameter,

-12 - any constant term will be sensitive to the choice of X, and the initial choice of this constant term must be compatible with the initial choice of X. In addition, the likelihood function has been found to be well behaved and unimodal over a wide range of ARIMA model configurations. This suggests that optimization algorithms should converge efficiently from any reasonable initial value of X. Box and Jenkins [5] suggest a heuristic method for rapid approximate evaluation of X. The value thus obtained could be used here to provide an initial value for X, but we have found that the additional computer time and effort are not worthwhile. The Data and an Initial Transformation Data were obtained for monthly sales of a line of office equipment January 1969 through December 1975. The data are given in Table 1 and plotted in Figure 4. For the sake of example, the last twelve data points, plotted with a dashed line in Figure 4, were omitted from the identification and estimation phases of the analysis so that they could be used to test forecasting performance. The series has a marked downward trend resembling an exponential decay. There is also a seasonal pattern with decreasing amplitude. These two factors suggest that a reasonable initial transformation would be the logarithmic transformation, i.e., X=0 in equation (1). The logarithms of the data, plotted in Figure 5, show a downward trend that is more nearly linear than the original series, although some appearance of decay remains. This suggests that the

-13 - final transformation has a parameter X<0. The amplitude of seasonal fluctuations is more nearly constant over the series, although there is some decrease, again suggesting that the final transformation parameter may be negative. Although it appeared that a logarithmic transformation may, in this case, be an undertransformation, it was an adequate initial choice. TABLE 1 SALES OF OFFICE EQUIPMENT JANUARY 1969 —DECEMBER 1975 1969 1970 1971 1972 1973 1974 1975 Jan 1564 1061 857 757 691 636 553 Feb 1586 1116 917 823 746 652 586 Mar 1475 1083 890 779 724 636 564 Apr 1459 1006 846 768 702 619 553 May 1343 973 823 741 691 591 519 June 1221 934 796 702 647 553 497 July 1155 890 741 652 613 531 481 Aug 1111 868 729 641 591 508 470 Sept 1155 906 790 663 624 536 492 Oct 1105 879 774 691 624 547 492 Nov 1083 868 774 685 613 531.481 Dec 1050 873 713 663 569 503 453 Differencing Let the observed series at time t be y(t), and the initial transformed series be y0(y) = lny(t). The first step is to reduce the transformed series to approximate stationarity by differencing operations. As the series Yo(t) has both a trend and a seasonal pattern, plots and autocorrelations of the series V V 12y (t) were examined for various integer values of d and D (see Box and Jenkins [4, Chapter 9]). It has been shown (see Appendix to Chatfield and Prothero [7]) that the operator VV12 will eliminate both a linear trend and a stable seasonal pattern.

-14 - c) C LUl 0 CD i-4 0 ' MONTH Co \J 1 is 36 54i 72 84 MONTH Fig. 4. Sales of office equipment from January 1969 to December 1975.

-15 - 0 ' LJ Co ri 0a 0 rI 1 6 32 4 64 7 1 16 32 48 64 72 MONTH Fig. 5. Logarithms of sales data for January 1969 to December 1974.

-16 - The sample autocorrelations for the series {yo(t)},{Vy (t)}, {V 12Y(t)}, and {VV12y0(t)} are given in Table 2. Note that the series contain 72,71,60, and 59 terms, respectively. For both {y0(t)} and {V12o (t)} the autocorrelations begin large and positive then fall steadily to large negative values. Clearly, further differencing is required —a necessity that also can be easily verified in this case from plots of the series. The autocorrelations of {Vyo(t)} have a seasonal cycle with peaks at the lags of 12,24, and 36 and troughs at 6,18, and 30. Examination of the plot of {Vy0(t)} shows a marked seasonal pattern but little trend. These findings strongly indicate that a seasonal difference is required. The autocorrelations of {VV12yO(t)} diminish quickly after lag 12 and there is no marked trend or seasonal pattern. It is thus reasonable to assume that {VV 12Y(t)} is approximately stationary. The plot in Figure 6 shows no obvious trend or seasonality, confirming this choice of differencing operator. Note, however, that the variance of the series tends to decrease over time, suggesting once again that a negative value of the transformation parameter may ultimately be required. Identification of the Differenced Series We have obtained an approximately stationary series w (t) given by w0(t) = VV12YO(t) Next we must find a model of the ARMA class which will adequately describe the autocorrelation structure of w0(t). The class of seasonal ARMA models can be written

TABLE 2 SAMPLE AUTOCORRELATION OF VARIOUS DIFFERENCES OF {y0(t)} Series Lags y (t) 1-12 13-24 25-36 Vy (t) 1-12 13-24 25-36 V12Yo(t) 1-12 13-24 25-36.12 (yt) 1-12 13-24 25-36 Autocorrelations.93.85.78.72.65.60.57.54.51.49.47.45.40.35.30.26.22.19.17.15.13.12.12.11.07.03.00 -.04 -.07 -.10 -.11 -.11 -.13 -.14 -.14 -.14.12 -.16 -.13.00 -.03 -.22 -.05.05 -.07 -.25.18.62.15 -.12 -.12 -.03.00 -.24 -.02.06 -.13 -.23.14.50.07 -.04 -.19 -.02.02 -.21.02.00 -.09 -.19.08.29.84.77.71.61.53.49.42.38.34.28.24.19.18.15.12.07.04.02 -.02 -.05 -.09 -.10 -.09 -.16 -.20 -.19 -.21 -.20 -.19 -.22 -.23 -.26 -.30 -.32 -.35 -.34 -.43.06.03.05.01.12 -.04 -.11.25 -.18.23 -.36.22 -.06.13 -.04 -.05.00 -.08.27 -.23.01.05.14 -.26.13 -.05 -.06.16 -.14.04 -.09.13 -.06.03 -.09 I -j I

-18 - ro in U-4 o LI U) I 1 12 24 36 48 MONTH 59 Fig. 6. Differences {VV12 In y(t)}.

-19 - (B)(B12)w0(t) = 6 + 0(B) ) (B B)a(t) as in equation (3) in which the polynomials 4(B),6(B),D(B ) and () (B ) have degrees p,q,P, and Q, respectively. To find suitable values of p,q,P, and Q, we begin by examining the sample autocorrelations and partial autocorrelations. These statistics and their approximate standard errors are given in Table 3. (The autocorrelations are, of course, the same as those given previously in Table 2,) Approximate standard errors for the sample autocorrelations rk assuming zero autocorrelations at lags > k were obtained from the expression 1 2 2 1/2 (rk) = { 1+2(rl+....+r _) }) (rk) 1.+21 k-I)l (see Box and Jenkins [4, p. 177]). The partial autocorrelations are calculated from the autocorrelations rk via the Yule-Walker equations [4, p. 55]. The standard error for the estimated partial autocorrelation A %kk at lag k given that the underlying true partial autocorrelations are zero at lags k > 0, is approximately 1//n [4, p. 178]. From Table 3 we see that the only significant (i.e., greater than two standard errors) autocorrelations are at lags 1 and 12. At lag 1 the autocorrelation r1 is large and negative, followed by a number of autocorrelations close to zero. The first partial autocorrelation 11 is large and negative followed by values which decay more slowly. This suggests that the series {w0(t)} is of the form w0(t) = at - Oat + other terms. (8) Turning now to lag 12, we see that there is a large, negative autocorrelation, r12 = -.36, flanked by positive autocorrelations,

-20 - TABLE 3 AUTOCORRELATIONS AND PARTIAL AUTOCORRELATIONS OF {w (t) o - - -- - -- - -- -- -- -------- -- Approximate Standard Error Lag Autocorrelations 1-6 7-12 13-18 19-24 25-30 31-36 -.43 -.04.22 -.08 -.26.04.06 -.11 -.06.27.13 -.09.03.25.13 -.23 -.05.13.05 -.18 -.04.01 -.06 -.06.01.23 -.05.05.16.03.12 -.36.00.14 -.14 -.09.13.15.18.19.20.21 1-6 7-12 13-18 19-24 25-30 31-36 -.43.12 -.07 -.16 -.13 -.03 Partial Autocorrelations -.15 -.01.08.08.20 -.11.15 -.05.18 -.32 -.06.14.14.03.01.14.04 -.18.10.04 -.12.06.06.01 -.05.01.04.05.02 -.04.13.13.13.13.13.13

-21 - rll =.23 and r13 =.22. There is no significant autocorrelation at lags 24 or 36 suggesting that a single seasonal MA parameter could adequately explain the observed autocorrelation at lag 12. The following tentative model was chosen for w (t) 0 w0(t) 6 + (l-B)(1- B12 )a(t). (9) 2 If r = VAR(a(t) the autocovariances are given by 2 22 Y (1+e2)(1+ (H 2), 21 ~1 =-e (l+(~> 2, Y12 = - t (1+92)~, Y13 = e0)2 [4, p. 329]. Note that for such a model l = y3 0; this is consistent with the observed sample autocorrelations. Now it can be seen that P1 = -e/(1+e 2) p12 = P12 - -)/(l+(C,2) ' Thus, we can obtain initial estimates of 0 and @ by solving the equations 1 r = -.43 1+0 2 1 -36 Discarding roots having an absolute value exceeding 1, we obtain 00.57 and 0 =.43.

-22 - The series {w0(t)} has 59 points and ranges from -.088 through.078. Its mean is.0045. This value is sufficiently large relative to the range to justify including a constant term in the model for a first estimation run. For the model (9) EwO(t) = 6, so that an appropriate initial estimate for the constant term is 60 =.0045. 0 To summarize, thus far we have tentatively identified the series as an ARIMA(0,1,1)-(0,1,1) model with initial values X0 = 0, 00 -.57, H)0 =.43, and 60 =.0045. Finally, it is interesting to compare Table 3 with the sample autocorrelations and partial autocorrelations for the differences (VV12) for the raw series, as shown in Table 4. It can be seen that there is very little variation between the statistics for the logarithms (X=0) and the raw data (X=l), especially at lags where values are significant. This confirms our earlier comment that autocorrelations and partial autocorrelations are insensitive to the initial choice of the transformation parameter A. Estimation Approximate maximum likelihood estimates were obtained using the least-squares algorithm described earlier. An important consideration here is that most nonlinear least-squares computer packages will not accept zero initial values; our Marquardt routine is no exception. It was therefore necessary to perturb the initial value of X = 0. Because we suspected that the final value of X would be negative, an initial value of A0 = -.05 was chosen. Note that this perturbation must be small because the initial estimate of the constant term was chosen to

-23 - TABLE 4 AUTOCORRELATIONS AND PARTIAL AUTOCORRELATIONS OF VV12y(t) (Raw Data) Approximate Standard Lag Autocorrelations Error 1-6 -.26.08.13.08.11.10.13 7-12.00 -.13.23 -.06.17 -.25.14 13-18.17 -.03.11 -.03 -.03 -.01.16 19-24 -.07.27 -.26.05.04.03.17 25-30 -.15.08 -.09 -.06.15 -.15.18 31-36.07 -.14.13 -.05 -.04 -.05.19 Partial Autocorrelations 1-6 -.26.02.17.16.17.15.13 7-12.01 -.24.06.00.21 -.20.13 13-18.06 -.04.13 -.03.02 -.10.13 19-24 -.12.15 -.04 -.05.12 -.01.13 25-30 -.16 -.08.03 -.01.08 -.03.13 31-36.05 -.06.00.03 -.06 -.04.13

-24 - be consistent with X = 0. A large perturbation could make the initial values of 6 and X incompatible and possibly cause the least squares routine to diverge. 3 The estimated values and their standard errors are shown in Table 5. TABLE 5 PARAMETER ESTIMATES AND STANDARD ERRORS Standard Errors Parameter Estimate Unconditional Conditional e.423.117.114 (H).891.054.054 6.663E-3.117E-2.232E-3 A -.212.219 a2.351E-4 There are two ways of viewing the distributions of the estiA A A mators 0, ~, and 6. We can consider their distributions conditional on a fixed value of \. Or we can admit the possibility of random variation in A and thus allow additional random variation in 0,, and 6. The procedure we are using is designed to choose the best metric, or transformation, for our analysis. Given that we have selected X = -.212, we should examine the adequacy of our model for the data thus transformed. In other words, we should examine the estimates using statistics conditioned on our choice of X. From Table 5 we see that from this point of view, each of the estimates 0, (, and 6 is

-25 - more than two standard errors away from zero and thus can be considered significant. It is worth noting that if we had first transformed our data using X =.212 and then estimated the other parameters using standard Box-Jenkins techniques, we would have arrived at identical estimates with the same estimated standard errors as the conditional standard errors in Table 5. We can reconcile the two points of view by examining the cor4 A A^ A relation matrix of the estimators 0,,H) 6, and X given in Table 6. This matrix allows random variation in all of the estimators jointly and therefore corresponds to the unconditional case above. TABLE 6 ESTIMATED CORRELATION MATRIX A 0 1.00 ~ -.11 1.00 6 -.20.12 1.00 X -.22.10.98 1.00 A% A% %A A There is very little correlation between either 0 and X or ( and X, Ao A implying that the standard deviations of 0 and @ will be little affected by conditioning on \. This confirms the figures in Table 5. These results are also consistent with our observation that the autocorrelations and partial autocorrelations, which are functions of 0 and (), are insensitive to the transformation parameter X.

-26 - On the other hand, there is extremely high correlation between X and 6. As pointed out earlier, this is expected because X is a scale A parameter directly affecting the measurement units of 6. Consequently, if we allow X to vary, we must also expect 6 to vary, not only because of the randomerrors in themodel, but also because its units of measurement are changing. Thus the large, unconditional standard error for 6 (shown in Table 5) also is expected. This standard error is of little use in determining the significance of 6; however it can be interpreted only as a measure of the effect of changes in X on the units of measurement of the transformed data. We also see in Table 5 that the transformation parameter X is only one standard error away from zero and could, therefore, be regarded as insignificant. But our purpose is not to test the hypothesis that the logarithm is the correct transformation: it is to find the transformation that best explains the observed data. We therefore prefer X = -.212 to a logarithmic (X=O) transformation and expect to obtain improved forecasting performance through use of this transformation. The only relevance of the standard error of X is that perhaps the forecasting improvement might be expected to be smaller than if X were, say, two standard errors removed from zero. Finally, it is useful to examine the nature of the likelihood surface. In Figure 7 we have plotted the variation of the likelihood L with 0 and \. (In this instance L is maximized over ~ and 6.) Note that the surface is regular and unimodal, with nearly elliptical contours, especially near the maximum, supporting our assertion that standard, nonlinear least-squares algorithms will be efficient in this application.

-27 - -r4 W-4 I c: CT Om CE ca: -J THETR Fig. 7. Log likelihood contours over 0 and X.

-28 - The contours are only slightly inclined to the axes, indicating only slight correlation between 0 and X. Similar comments apply to the variation of the likelihood L with ( and X, as shown in Figure 8. Figure 9 gives the variation of the likelihood L, maximized over 0, @), and 6, with X. This illustration shows L is approximately quadratic in X near the maximum and that the curve is well behaved and unimodal. Diagnostics We have now estimated jointly the transformation parameter and the parameters of the ARIMA(O,1,1)-(0,1,1) model based on the transformed data. The next step is to examine the model for adequacy of fit to the transformed data. The adequacy of fit was tested by examining the residuals, their autocorrelations, and their cross-correlations with the differenced transformed series. As a first step, we plotted the residuals (see Figure 10). Examination of this plot reveals no evidence of systematic change in variance, nor of cyclical patterns. There is some suggestion of "stickiness" in one or two places, but the significance of this was easily checked through the autocorrelation function. The autocorrelation function of the estimated residuals is given in Table 7. Nowhere do the autocorrelations exceed two standard errors in magnitude. At small lags and at multiples of 12, however, the standard errors can be somewhat smaller than those shown [6], and care must be taken to allow for this possibility. Even so, the only doubtful value is at lag 36, but with no other suspicious values

-29 - V-4 '-4 I I l cr in a: Se I cr N IC^ * i7 '.. 1..83.87.91.95 THETR (SERSONRL).99 Fig. 8. Log likelihood contours over ()and X.

-30 - CE 0: C_ -.27 LRMBDA.09 Fig. 9. Variations of log likelihood L with X.

-31 - cm CSI N? w 1 12 2a 36 48 5 MONTH Fig, 10. Residuals of ARIMA (0,1,1)-(0,1,1) model for {yl(t)}. 0 Fig. 10. Residuals of ARINA (0,l,1)*(o,l,l) model for f kt} 9

-32 - there is insufficient evidence to conclude that the residuals are not white noise. TABLE 7 RESIDUAL AUTOCORRELATIONS Approximate Standard Lag Error 1-6 -.07.03.15 -.03 -.01.18.13 7-12 -.08.01.20 -.16.11.05.14 13-18.01.09.12 -.09.03 -.09.15 19-24 -.09.07 -.10 -.10.08 -.01.16 25-30 -.25.01 -.01.06.13 -.11.16 31-36.03 -.04 -.11 -.07 -.02 -.20.17 It should be noted that the distributions derived by Box and Pierce do not allow for the effect of estimating the transformation parameter X. Thus we have assumed that these distributions are not greatly affected by the estimation process and that the standard errors are approximately the same as in a fixed transformation. This assumption is supported by our observation that estimates of 0 and @ are insensitive to X, although an analytical investigation must eventually be undertaken. The second test calculated the Q statistic [6], [4, p. 291]. This is given by 36 Q =59 rk(a) - 26.5 1

-33 - where rk(a) is the residual autocorrelation at lag k. The factor of 59 is the number of values in the differenced series, and the upper limit of 36 is chosen as the value beyond which theoretical autocorrelation becomes negligible. Box and Pierce have shown that Q is asymptotically 2 X with 36-3 = 33 degrees of freedom (there being 3 parameters, 0, CR9, and 6, estimated in the model). We have assumed that the Q statistic 2 will, in this case, be approximately X with 32 degrees of freedom, a reduction of one degree of freedom to allow for estimation of the addi2 tional parameter X. The 5 percent critical value of X for 32 degrees of freedom is 46.2. This test therefore provides no evidence rejecting the estimated model. Finally, we can examine the cross correlations between the differenced series w (t) = VV12y(t) and the estimated residuals a(t). For positive values of k, we would expect the cross-correlations of A wx(t) and a(t+k) to be close to zero. The cross-correlations between w (t+k) and a(t) should be somewhat removed from zero at small lags and multiples of 12 but should diminish as k increases. This general pattern is apparent in Table 8. We conclude from this and the preceding section that the data is closely approximated by the model YA(t) = t -1 X = -.212 w(t) = VV12Y (t) WA(t) = 000663 + (1o423B(1-.891B2)a(t) (10) w (t) =.000663 + (l-.423B)(l-.891B )a(t) Var{a(t)} =.0000351.

-34 - TABLE 8 CROSS-CORRELATIONS OF RESIDUALS AND DIFFERENCED SERIES Lagt Cross-Correlations a(t), w% (t+k) 1-6 7-12 13-18 19-24 25-30 31-36 -.46 -.09.16.02 -.19.07.20.04.06.08.04 -.12 -.03.07.02 -.17 -.04.03 -.05 -.16 -.07.09.06 -.04.05.16.03 -.03.05.00.02 -.46 -.10.02 -.11 -.12 Cross-Correlations a(t+k), w (t) 1-6 7-12 13-18 19-24 25-30 31-36 -.02.04.19 -.11 -.21 -.05 -.05 -.17 -.07.13.03.04.01.27.18.03.06.10.05.00.00 -.13 -.02 -.11 -.06 -.07 -.05.05.15.04.23.01.07.25 -.01 -.01

-35 - Forecastin Forecasts were calculated up to twelve months ahead and compared with actual values. A formula for forecasts of the transformed data can be easily obtained for the model (10). We use wXt () to represent the forecast of wX(t+z) made from the origin date t. Noting that the forecast of a(t+Z) is zero for positive values of p, we have wE0) (=..000663 -.423a(t) -.891a(t-12) +.377a(t-13), and for 1 < 12 we have wkt(Q) =.000663 - *891a(t-12) +.377a(t-13) Now w (t) =Y(t) - yy(t-1) - yX(t-12) + y (t-13), and thus yX(t) = w(t) + yx(t-l) + yx(t-12) - y(t-13) We therefore can obtain forecasts yt(Z) of yx(t+-) from Y ) wxt(l) +l yA(t) + yx(tll) y(t-12) and for 1 < < 12 Ykt() = wt() + YAt(-l) + yX(t+Z-12) - yX(t+z-13) Approximate tolerance limits for ^t() were calculated using the methods described by Box and Jenkins [4, Chapter 5]. Forecasts and tolerance limits for the transformed data are given in Table 9. Note that the limits are tolerance (or probability) limits, not confidence intervals. They are only approximate because they are based on estimated values of 0,, and a 2 rather than true values.

-36 - TABLE 9 TRANSFORMED SERIES —FORECASTS FROM ORIGIN t = 72 - -- - - -- --- -- I — ----- - ------ I --- Lead Time 1 2 3 4 5 6 7 8 9 10 11 12 Forecast Yt (P) 3.4712 3.4933 3.4879 3.4804 3.4747 3.4554 3.4416.3.4365 3.4521 3.4516 3.4483 3.4376 95 Percent Tolerance Limits +.0118 +.0137 +.0153 +.0168 *+.0181 +.0193 +.0205 +.0216 +.0227 +.0237 +.0247 +.0256 Forecasts {yt () of the original series {y(t+I)} obtained by means of the inverse transformation t () =(- 212yt1/.212) were then Approximate 95 percent tolerance limits were obtained by similarly inverting the limits for y.t(M). The final results are shown in Table 10 and plotted in Figure 11. There is a special problem with calculating forecasts as in Table 10. Box-Jenkins forecasts are minimum, mean-square error forecasts, and, under the assumption of independent normal residuals, these are conditional expectations. Because the normal distribution is symmetrical, the mode and mean coincide and the forecasts become most probable forecasts.

-37 - TABLE 10 FORECASTS OF ORIGINAL SERIES FROM ORIGIN DATE t=72 - - --- - 'I -— — c - - Lead Time 1 2 3 4 5 6 7 8 9 10 11 12 Lower 95 Percent Limit 511 551 537 519 505 468 443 433 456 455 447 429 Forecast 534 581 569 553 541 503 478 469 497 496 490 471 Upper 95 Percent Limit 559 613 604 590 580 541 516 508 542 541 538 518 Actual 553 586 564 553 519 497 481 470 492 492 481 453 Error 19 5 -5 0 -22 -6 3 1 -5 -4 -9 -18 However, if we obtain such forecasts for transformed data with a value of X # 1, then use the inverse transformation to obtain forecasts of the original series, the distribution of the forecast error will be skewed and the mode and mean will not be the same. The forecasts will still correspond to the mode of the distribution, and therefore will be most probable forecasts, but they will not be conditional expectations nor minimum mean square error forecasts. Some discussion of the possibility of adjusting forecasts to allow for a quadratic loss function is given by Chatfield and Prothero [7], Harrison [8], and Box and Jenkins [5].

-38 - O CD tJ rLc e 1 CM OL =rP O 49 56 64 72 MONTH 80 84 Fig. 11. Original Series with forecasts and approximate 95 percent tolerance limits from origin date 72 (December 1974). Original series - Forecasts -. 95 percent tolerance limits

-39 - Comparison with Log Model To compare the forecasting performance of the transformed model with that of the more conventional log model, we fit the best model to the logarithms and calculated a set of forecasts. The log model used was Wo(t) = VV12Zn y(t) w0(t) =.00349 + (1-.389B) (1-.904B 12)a(t) All parameters in this model were significant, and no serious irregularities were found in the diagnostics. The following sets of forecasts were generated for the log model. First we found one-step-ahead forecasts for origin dates 72,73,....,83. Then we calculated two-steps-ahead forecasts for origin dates 72,73,....,82, and so on, until finally we obtained a single 12-steps-ahead forecast for the origin date 72. The same model parameters were used at each origin date; no attempt was made to re-estimate the parameters as additional data points became available with the advancing origin date. A similar set of forecasts was obtained for the transformed data model (X=-.212). In both cases, forecasts of the original data were obtained by simple inversion of forecasts of the transformed data, and no adjustment was made for skewness. Table 11 summarizes the forecasting errors for the transformed (X=-.212) model and the log model. These results speak for themselves: the model for X=-.212 was clearly superior.

TABLE 11 COMPARISON OF FORECASTING PERFORMANCE fndividual Steps Ahead 1 2 3 4 5 6 7 8 9 10 11 12 Overall X = -. 212 Mean Mean Absolute Square Error Error X = 0(log) Mean Mean Absolute Square Error Error Percentage Improvements Mean Mean Absolute Square Error Error Individual Improvements Number Percentage 8.1 10.0 11.5 14.6 15.1 16.4 18.8 22.6 25.0 27.7 29.5 32.0 15.2 116 168 213 264 263 298 446 665 772 947 1003 1024 345 8.7 11.7 13.2 16.2 18.8 20.0 23.0 28.2 32.3 36.0 39.5 42.0 18.4 125 194 269 354 400 431 620 954 1169 1469 1693 1764 495 7 15 13 10 20 18 18 20 23 23 25 24 17 7 13 21 25 34 31 28 30 34 36 41 42 30 9 9 7 6 7 7 6 5 4 3 2 1 66 75 82 70 67 88 100 100 100 100 100 100 100 85 I o

-41 - Discussion Use of the Box-Cox transformation in this example has been shown to give greatly improved forecasting performance. This approach is new and relatively untried, however, and there are several issues yet to be resolved. First, identification depends largely on the near orthogonality of the estimator for X and the estimators of the other structural parameters. A group at the University of Michigan is working on a formal proof of this property, but until they are finished, we have only our rather limited experience on which to rely. Similarly, we believe that asymptotic distributions of parameter estimators and residual autocorrelations remain unchanged except for the loss of a degree of freedom, but, again, a rigorous analytical proof is required. Forecasting performance has not yet been fully investigated. Experience to date has disclosed substantial improvements in forecasts, and we have found no case in which inferior forecasts were generated. However, a detailed empirical study covering a wide range of time series is being planned by researchers at The University of Michigan to substantiate this experience. It will adopt the same general approach as used by Reid [13] and Newbold and Granger [4] to compare different forecasting methods. Despite these reservations, all indications are that the modified Box-Jenkins method is easy to use and gives generally better forecasts. In short, it seems to work.

Footnotes 1. The reason for working with the transformation (1) rather than the power transformation (2) is that the former is continuous at X = 0 (the log transformation). 2. Pierce [12] has shown that the Box-Jenkins least-squares estimators are consistent and have the same asymptotic distributions for both normal and non-normal series. 3. Standard errors were obtained by inverting the estimated information matrix (see Box and Jenkins [4, p. 227]). 4. This matrix was obtained by inverting the estimated information matrix.

References 1. Ansley, C. F., Spivey, W. A., and Wrobleski, W. J. "An Analysis of Transformations for Time Series." Proceedings of the American Statistical Association, Business and Economics Section, 1975, pp. 211-16. 2. __. "A Class of Transformations for Box-Jenkins Seasonal Models" paper submitted for publication to Applied Statistics, February 1976. 3. Box, G. E. P., and Cox, D. R. "An Analysis of Transformations." Journal of the Royal Statistical Society B, No. 26 (1964): 211-43. 4. Box, G. E. P., and Jenkins, G. M. Time Series Analysis, Forecasting and Control. San Francisco, Calif.: Holden Day, 1970. 5.. "Some Comments on a Paper by Chatfield and Prothero and on a Review by Kendall." Journal of the Royal Statistical Society A, No. 135 (1973): 337-45. 6. Box, G. E. P., and Pierce, D. A. "Distribution of Residual Autocorrelations in Autoregressive-Integrated Moving Average Time Series Models." Journal of the American Statistical Association 64 (1970): 1509-26. 7. Chatfield, C., and Prothero, D. L. "Box-Jenkins Seasonal Forecasting: Problems in a Case Study." Journal of the Royal Statistical Society A, No. 136 (1973): 295-315. 8. Harrison, P. J. "Discussion of a Paper by Dr. Chatfield and Dr. Prothero." Journal of the Royal Statistical Society A, No. 136 (1973): 319-24. 9. Marquardt, D. W. "An Algorithm for Least Squares Estimation of Nonlinear Parameters." Journal of the Society for Industrial and Applied Mathematics 2 (1963): 431-41. 10. Nelson, C. R. Applied Time Series Analysis for Managerial Forecasting. San Francisco, Calif.: Holden Day, 1973. 11. Newbold, P., and Granger, C. W. J. "Experience with Forecasting Univariate Time Series and the Combination of Forecasts." Journal of the Royal Statistical Society A, No. 137 (1974): 131-64. 12. Pierce, D. A. "Least Squares Estimation in the Regression Model with Autoregressive-Moving Average Errors." Biometrika 58 (1971): 299-312.

13. Reid, D. J. "A Comparative Study of Time Series Prediction Techniques on Economic Data." Unpublished Ph.D. Thesis, University of Nottingham, 1969. 14. Tunnicliffe Wilson, G. "Discussion of a Paper by Dr. Chatfield and Dr. Prothero." Journal of the Royal Statistical Society A, No. 135 (1973): 315-19.