Division of Research Graduate School of Business Administration The University of Michigan August 1974 ANALYZING AND FORECASTING TIME SERIES Part 1: Methodology Working Paper No. 91 by W. Allen Spivey William J. Wrobleski The University of Michigan (O The University of Michigan FOR DISCUSSION PURPOSES ONLY None of this material is to be quoted or reproduced without the express permission of the Division of Research.

TABLE OF CONTENTS 1. Introduction......... 2. Components Time Series Analysis........ 3. Time Series As A Stochastic Process....... 4. Classification of Stochastic Processes and Related Concepts................ 5. Power Spectra and Covariance Stationary Processes 6. Moving Average Processes and Some Examples. 7. Autoregressive Processes............ 8. Autoregressive Moving Average Processes.. 9. Simple Exponential Smoothing Forecasting Models 10. Higher Order Exponential Smoothing Models.. 11. A Useful Alternate Form of the Second Order... 12.... 2 6..~ 7... 10... 13... 21...30.... 35.... 42 Exponential Smoothing Model.......... 50 12. Exponential Smoothing Models for Seasonal Time Series 13. Box-Jenkins Forecasting Procedures......... Bibliography..................... ~. 53. 60. 70

1. INTRODUCTION We take the point of view that analyzing and forecasting a time series are interrelated activities. One uses time series analysis as an important model building strategy whereby intuitive ideas about the underlying time series data are translated into an appropriate forecasting model. Two major approaches to time series analysis are available. One, which we will call components analysis, regards the time series as being composed of several influences or components which are generally taken to be trend and cyclical, seasonal, and irregular or random movements. Oftentimes in this approach, trend and seasonal influences are modeled in a deterministic manner; trend may be regarded as a polynomial of a given degree and the seasonal component may be modeled by a trigonometric function with given period and amplitude. Random influences are usually assumed to have a simple probability structure, i. e., they are treated as independent, identically distributed random variables having zero mean and finite variance. Because of the simple role of probability in this approach, estimation of the trend and cyclical component, as well as the seasonal component, is carried out using fairly simple filtering procedures such as differencing, forming moving averages, considering

- 2 - ratios involving moving averages, and others rather than by using parametric statistical estimation procedures. The other major approach is to regard a time series as an observed sample function representing a realization of an underlying stochastic process. A stochastic process can be stationary or nonstationary; it can be a moving average process, an autoregressive process, or one which involves both autoregressive and moving average terms. Moreover, associated with a stochastic process are the mean and autocovariance functions, the autocorrelation function, and its power spectrum. When two or more processes are considered, their cross-correlation functions, cospectra, and other characteristics may be examined. This injects a full array of probabilistic concepts and methods into time series analysis and forecasting. The more elaborate probability framework permits a more complete consideration of statistical estimation than is generally considered in components analysis. 2. COMPONENTS TIME SERIES ANALYSIS The components of a time series can be assumed to be combined additively, multiplicatively, -ori some mixture of the two. Thus we can speak of additive, multiplicative, or mixed models. Specifically, if TC(t) indicates the trend-cyclical component, S(t) the

- 3 - seasonal, and d(t) the random component, an additive model is given by (2. 1) Y(t) = TC(t) + S(t) + e(t), where Y(t) denotes the observed data. A multiplicative model is (2.2) Y(t) = TC(t)S(t)e(t), and an example of a mixed model is (2. 3) Y(t) = TC(t)S(t) + e(t). For purposes of forecasting, with a components model, one approach is to estimate the components of Y(t) appropriately, and use these estimates to obtain a forecast Y(t3r) of the expected value of Y(t) a ep eriloddsbnahe'ad-ascfollows. If TC(t) and S(t) are estimates of TC(t) and S(t), respectively, then the T-periods ahead forecast of Y(t) is taken as (2. 4) Y(t+T) = TC(t+ ) +S(t+T) in the additive model or as (2. 5) Y(t+T) - TC(t+ T)S(t+T) for either the multiplicative or mixed model. 1The forecast of the expected value of Y(t+T) is often called the forecast value of Y(t).

-4 - To determine prediction intervals for Y(t+ T) a study of the random component e(t) would be required. For example, if the random component were considered to be a disturbance term which from period to period could be assumed to be independent and identically distributed with zero mean and unknown variance, an estimate of this variance as well as TC(t) and S(t) would be necessary. Methods of estimating components vary from simple ad hoc specification to elaborate computer-based manipulations of the data. Detrending methods illustrate this variety very well. Among the simpler methods is the calculation of first or higher-order differences of the data. The order of differencing used corresponds to one's perception of the trend as represented by a polynomial of given degree, and one might use the variate difference method to estimate the degree of the polynomial representing the trend and cyclical component. Other detrending procedures involve using a moving average as well as polynomial regression. A wide variety of deseasonalizing procedures is also available. These range from manipulation of moving averages, such as the Census II method and its variants, to regression analysis using dummy variables, to more sophisticated harmonic and spectral analyses. Developing suitable estimates of these components is a delicate problem in time series analysis and these estimates obviously

-5 - influence whatever forecasting model is developed subsequently; Unfortunately, from a forecasting point of view, the methods of statistical estimation of these components are inadequate at the present time. Rigorous definitions of trend and seasonal influences have not been developed. Yet another difficulty is that some widely used procedures for separating one component from a time series have been shown to separate parts of others as well. Moreover, a circularity in estimation is present in some procedures. For example, if one wishes to estimate the TC(t) component, a standard method is to estimate the seasonal component and deasonalize the data, producing a time series consisting of TC(t) ande(t). Unfortunately, in order to estimate S(t) one must make an initial assumption concerning TC(t). Therefore some implicit detrending of the data is present in the process of attempting to estimate TC(t), thus) producing an evident circularity in the procedure. Despite important shortcomings, which include lack of rigorous definitions for TC(t) and S(t) and the fact that many unresolved inference problems remain, components analysis continues to occupy an important place in time series analysis. The literature is large and for over 70 years has received wide attention in both theoretical and applied fields. Our survey is largely directed to time series analysis in a probabilistic framework and we do not give a full treatment

-6 - of components analysis here. However, as the reader will see, one is not entirely free of some form of components analysis and the brief discussion above is useful in the material that follows. 3. TIME SERIES DATA AS A STOCHASTIC PROCESS A stochastic process can be viewed as a family of random variables {X(j, t) } defined on an appropriate probability space (Q,, P) where wsQf and the index t belongs to a set T. Here Q denotes a sample space, I is a a-field of events associated with Q, and P is a probability measure. For all purposes in this paper, the set T is either an infinite subset of the set of integers or is an interval on the real line. Whenever T is a countable set we say that the stochastic process is a discrete parameter process; otherwise we say it is a continuous parameter process. For an alternate view, suppose that to is a fixed point in 2. If we treat the corresponding values of the random variable X(w, t) for the fixed 0 as a function of t, then we arrive at the concept of a sample function of the stochastic process {X(w, t)}. We suppress explicit reference to t in this case, simply using X(t) to denote the sample function. A realized (observed) sample function is called a time series. Thus by means of this approach and through appropriate statistical

- 7 - analysis and estimation applied to time series data, it may be possible to make inferences concerning important properties of the stochastic process generating the data. These properties also provide a basis for the development of forecasting models, as will be discussed in later sections. 4. CLASSIFICATION OF STOCHASTIC PROCESSES AND RELATED CONCEPTS Associated with either of the points of view in the previous section are the families of finite dimensional distribution functions, F(xl '; tl * *, tn)= P(X(tl) < x, * *, X(tn) xn) where {tl, *-*, tn} is any finite subset of the index set T. If it turns out that these families of distribution functions satisfy the condition that (4. 1) F(x1, x, x; tl* *, tn) = F(x1, *'', xn; tl+h,, * tn+h) for any displacement hfdf tb,t:*;' 'tftnt then the stochastic process is said to be a strictly stationary stochastic process. Relating this important concept to the sample function interpretation, we observe that if a time series displays a pronounced trend that persists over a long period of time, the distribution functions

-8 - could not satisfy the condition (4. 1) and one would in this case say that the underlying stochastic process is not strictly stationary. Using the family of distribution functions introduced above, we define the mean function of a stochastic process as (4. 2) M(t) = f xdF(x;t) = E[X(t)]. -00 The variance function is 00 co 2 2 (4. 3) V(t) =f (x(t) - M(t)) dF(x;t) = E[(X(t) - M(t)) ] -00 and the covariance function between X(tl) and X(t2) is given by 00 00 (4.4) C(t1, t2) = f f (Xl-M(tl))(x2-M(t2))dF(xl, X2;tl, t2) -00 -00 = E[X(tl)-M(tl))(X(t2)-M(t2))]. Note that when t1 = t2 = t we get C(t, t) = V(t). The correlation function between X(tl) and X(tZ) is (4. 5)p(t, t2) -= (tl)(t2) It is obvious that when a stochastic process is strictly stationary we have M(t) = M(t+h) and V(t) = V(t+h), since F(x;t) = F(x;t+h). Thus the mean M(t) and the variance V(t) of a strictly stationary process must be constant and cannot depend on t. In addition,

-9 - C(tl, t2) = C(tl+h, t2+h) and p(t 1 t2) = p(t +h, t+h) since F(xl, xZ;tl, t2) = F(xl, x2;tl+h, t2+h). Furthermore, taking t = t-h and t2 = t gives (4. 6) C(t' tZ) = C(t-h, t) = C(t, t+h) = C(h) and we see that for strictly stationary processes the covariance is a function of the displacement h alone. Moreover, if we let h = 0 in (4. 6) we get C(0) = C(t, t) = V(t) = V(0), since in the strictly stationary case V(t) is a constant. A similar argument shows that p(t1 t2) = p(t-h,t) = P(t,t+h) = P(h). In this case the functions C(h) and p(h), regarded as functions of h, are called the autocovariance and autocorrelation functions, respectively. An important class of stochastic processes, less restricted than the class of strictly stationary processes, is that of covariance

- 10 - stationary processes. These and other still less restricted classes are oftentimes of interest in time series analysis. A stochastic process is said to be stationary in the mean if the mean function M(t) = E[X(t)] is a constant (and does not depend on t). A process is said to be covariance stationary (or stationary in the wide sense) if C(t, t+h) = E[(X(t) - M(t))(X(t+h) - M(t+h))] is a function of the displacement h only. This condition in turn implies that M(t) = E[X(t)] and V(t) = E[(X(t) - M(t))2] are constants, neither depending on t. In other words, covariance stationarity implies stationarity in the mean and stationarity in the variance. 5. POWER SPECTRA AND COVARIANCE STATIONARY PROCESSES We have seen that when a stochastic process is covariance stationary its autocovariance function depends only on the displacement of t. The covariance function C(k) of a discrete parameter covariance stationary process regarded as a function of the displacement k is a positive semi-definite function. Thus it can be represented as a Fourier transform of a uniquely chosen nondecreasing function G(X), 1/2 (5. 1) C(k) = f cos2-rkXdG(X) -1/2

- 11 - where 0 < AX 1/2. In this context, one refers to A as a frequency since the function cos21rAx has period 1/A and frequency 1/1/A = A. For continuous parameter stochastic processes which are covariance stationary, the covariance function C(h), regarded as a function of h, is also positive semi-definite and can be uniquely represented as a Fourier transform 00 (5. 2) C(h) = fcos2TrhAdG(X) 0 where G(A) is a uniquely chosen monotone nondecreasing bounded function and where 0 < X < oo. In either the discrete or continuous case the function G( A) is called the spectral function. Expressions corresponding to (5. 1) and (5. 2) are available for complex valued covariance stationary processes, but our discussion is here limited to real valued processes only. These important results are consequences of Bochner's theorem. 00 Moreover, in the discrete parameter case, when Z |C(k) | k oo converges there is a continuous spectral density function, sometimes called the power spectrum, 00 (5. 3) g(X) = 2C(0) + 4 Z C(k)cos27rkX. k=l This function has as its domain the frequencies which are associated with the corresponding autocovariance function, i. e., 0 < X < 1/2.

- 12 - This sometimes can be emphasized by saying that the function g(X ) is defined over the frequency domain. 00 In the continuous parameter case, when f 1C(h)J dh converges, -00 there is a continuous spectral density function or power spectrum 00 (5.4) g(X) = 2C(0) + 4 f C(h)cos27rXhdh. 0 The results (5. 3) and (5. 4) are special cases of the more general Fourier inversion theorem and they permit one to obtain the spectral density function from the autocovariance function. Since the function cos2 rrX h is symmetric about X = 0, while C(k) = C(-k) for integers k and C(h) - C (i-h)- for —heal numb'erts. h, some writers define the power spectrum as 00 p(X) = C(0) +2 Z C(k)cos2rkX - 1/2 < X < 1/2 k-tl for discrete parameter processes or 00 p(A) = C(0) + Zf C(h)cosZrrAhdh -o0 < A < co 0 for continuous parameter processes instead of (5. 3) and (5. 4), respectively. The range of frequencies has been extended so as to include negative integers in the discrete parameter case and negative real numbers in the continuous case, the "power" assigned to the

- 13 - frequency being one-half that in (5. 3) and (5. 4), i. e., p(,X) = 1/2 g( | [X ). The preceding discussion provides a dual characterization for the probabilistic properties of covariance stationary processes as they can be examined either in the time domain using the covariance or autocorrelation functions or in the frequency domain through the analysis of the power spectrum. This duality between the time and frequency domains is extremely important for time series analysis and forecasting. 6. MOVING AVERAGE PROCESSES AND SOME EXAMPLES Consider a sequence TM(t) of mutually uncorrelated random variables having mean 0 and unit variance and where t = 0, 1, 2, * * (such a sequence is called white noise); then (6. 1) Z(t) = bo~(t) + blc(t-l) + - - + bq~(t-q), where b0 i 0 and b1, **, bq are arbitrary (real) constants, is called a discrete parameter moving average process of finite order q and is denoted by MA(q). For our purposes, when b0, bl, * * is an infinite 00 2 sequence for which Z b. < oo we define the process j=0 (6. 2) Z(t) = b0(t) + b d(t-l) + ** to be a discrete parameter moving average process of infinite order.

- 14 - The mean function of a moving average process of order q is (6. 3) M(t) = E[Z(t)] = E[b~0(t) + ble(t- 1) + + bq (t-q)] = boE[e(t)] + blE[e(t-1)] + * * + bqE[e(t-q)] = 0 since E[e(t)] = 0 for every t. The variance is (6.4) C(t) = Var[b ~(t) + bl (t-1) + * ** + bq (t-q)] = b0Var[e(t)] + b Var[e(t-l)] + * * + bZVar [(t-q)] ~1 9~q q 2 - b j=0 J and the autocovariance is q q (6.5) C(t, t2) = E[Z(tl)Z(t2)] = E[( z bi (ti-i))( bje(t2-j))] i=0 j=0 = E[Z Zbibj [(tl-i)e(t2-j)] i i = Z bibjE [((t -i)-(t2-j)]. When t1 = t and t2 = t+k, recalling that the e(t) have unit variance and are uncorrelated, we find that (6.6) C(t, t+k)= - bibjE[s(t-i)s(t+k-j)] 1 j q-k Z= bibi+k 0 < k < q i=O 0 k > q.

- 15 - Since C(k) = C(t, t+k) depends only on k, the MA(q) process is covariance stationary. The autocorrelation function for the MA(q) process is q-k Z bibi+k (6.7) p(k) C(k) i- Ok <q V(0) q 2 Z bi i=0 =0 k>q and the power spectrum is q 7 q q-k (6.8) g(X) = 2[ i] + [ bibik]CS2 Ik X 0< 1/2 Z ]+4 Z [ Z bibi~k]COs2'k i=0 k=l i-O Clearly, when we consider an infinite order discrete moving 00 2 average process (6. 2) for which E bj <co, then Z(t) will be a coj=0 variance stationary process with mean M(t) = 0, variance 00 00 o2 C(O) = V(O) = Z b, and autocovariance C(k) = Z bibi+kj=0 li=0 These concepts can be illustrated by using the MA(3) process (6.9) Z(t) = C(t) +. 5 e(t-1) +. 25e(t-2) +. 125 (t-3). 3 2 The mean is 0, of course, and C() = Z bi = 1. 328125, C(1) = i=l 1 b0bl + blb2 + b2b3 =. 65625, C(2) =.3125, C(3) =. 125, while C(k) = 0 for Ikl >3. For this process p(0) = 1, p(l) =.50, p(2) =.23,

- 16 - and p(3) =.09 while p(k) = 0 for J kj > 3. The autocovariance and autocorrelation functions are shown in Figures 1 and 2. Values for negative k are plotted since these are symmetric functions. The power spectrum for this process is 00 g(X) = 2C(0) + 4 Z C(k)cos2rrkX k=l = 2. 65625 + 4[(. 65625)cos2'rrX + (. 3125)cos4rf X+(. 125)cos6Tr X], = 2. 65625 + 2.625cos2rX + 1. 25cos4rX +.50cos6r X. and is shown in Figure 3. Returning to the moving average process of finite order q, when the finite dimensional distribution functions of the disturbance terms E (t) are multivariate normal, Z(t), as a linear combination of jointly distributed normal random variables, must itself be normally distributed (the sequence of random variables c(t), ~(t-1),. *, ~ (t-q) in this case is sometimes referred to as Gaussian white noise). Therefore, the finite dimensional distribution functions for the covariance stationary process Z(t) are given by (6. 9) F(z1, Z', Zn; t1, **, tn) = rZ zn 1 -1/2 f. — ( -)n/21z (til* *tn)i/ -00 -o0 (2 ') e-1/2z Z.t-l(t ' tn)zTdzl'' dz

- 17 - 1 -3 -2 -1 0 1 2 3 Fig. 1. Autocovariance Function for the MA(3) Process (6. 9).

- 18 - 1 -3 -2 -1 0 1 2 3 Fig. 2. Autocorrelation TFun.ction for ithe:MA((:3)) Process (6. 9).

0 _~ -N —LU Cr) U) CC I _ I zt o* 0 F 0 C) 0,9 ~ 09'S 089h O' 00th O o 091 09 00 G0 * I) ) ),4 -

- 20 - where z is the 1 by n vector z = [Z1, *, Zn] and E(t1, '*, tn) is the n by n variance-covariance matrix of the random variables Z(tl), ***, Z(tn), C(tP, t) *** C(tl, tn) S(t, '*, tn) C(tn, tl)... C(tn, tn) We have seen that for an MA(q) process C(ti, tj) = C(ti+k, tj+k), so the variance-covariance matrix depends only on the displacement k, i. e., z(tPl' *'' tn) = Z(tl+k, **, tn + k); it follows that when the random variables E(t) are multivariate normal, F(z ', Zn; tl, *, tn) = F(zl, *, Zn; tl+k,,tnk) and the MA(q) process is strictly stationary as well. More generally, it can be said that when the families of finite dimensional distributions of a stochastic process are multivariate normal, covariance stationarity of the process implies strict stationarity.

- 21 - 7. AUTOREGRESSIVE PROCESS Frequently in economic and other applications, it is natural to regard the current value of the time series as being influenced by the lagged values of the process together with a random disturbance term s(t) which does not depend on the lagged values. Such a process can be written (7. 1) a0Z(t) + alZ(t-l) +. * + apZ(t-p) = E(t) where ao0 0, a1, **, ap are real constants and e(t) for t = 0, +, + are uncorrelated random variables having mean 0 and unit variance (i. e., white noise). The process (7. 1) is called a discrete parameter autoregressive process of order p and is denoted by AR(p). From (7. 1) we easily obtain (7. 2) a Z(t) =,clZ(t-1) + f2Z(t-t) + * * * + pZ(t-p) + s(t) where j = - aj, j = 1, *, p, which displays Z(t) as a weighted sum of the lagged Z(t-j) and the disturbance for period t. The equation (7. 2) indicates why the process (7. 1) is called autoregressive -- the lagged terms in (7. 2) play a role that is analogous to explanatory variables in standard single equation regression analysis. One cannot easily determine the mean, variance, and covariance functions of the process (7. 1) as was the case for the MA(q)

- 22 - process because we do not know the joint distribution of the random variables Z(t-j); only some properties of the disturbances E (t) are known in (7. 1). Thus it is natural to ask if Z(t) can be expressed only in terms of the random disturbances e(t), e(t-l), ~ ~, and under what conditions this is possible. To this end consider (7. 1) for t-l, t-2, *** t- v, v p,,0?~~~~ p a0Z(t-l) +ai~Z(t —2) -+..: pZ(t -p2-l) =(:e(t-l) 0aZ(t-2) + apZ(t-3) i' *+ * -3 L+^pt: Z(t-p'2i) =(t(lr2) * * e *, ~ aoZ(t-v) + alZ(t-v-l) + * + apZ(t-p-v) = (t- ) Now upon multiplying (7. 1) by bo # 0 and each of the above equations by bl, bz,., bK, respectively, we obtain the system of equations a0b0Z(t) + alb0Z(t-l) + * *- + apb0Z(t-p) = bo0(t) a0blZ(t-l) + alblZ(t-2) + * - + apblZ(t-p-1) = bl (t-l) a0*bZ(t-v) + a*lbZ(t-v-l) + -+ apb Z(t-p-v) = b V abvZ(t-v) + albZ(t-v-l) + **-* + apbvZ(t-p-v) = by (t-v) Adding these equations together and transposing to the right hand side those Z(t) beyond v periods in the past gives V V (7.3) Z djZ(t-j) = Z b E(t-j) + Rz(v, p) j=0 J j=0 J

- 23 - where J dj - abjkbj _k k=O- k d k- O ja+kv v-k p+v dj= z ak-v b+j-k J k=j and where the remainder term is p+v Rz(Vp) = - djZ(t-j). j=Vl J For example, for j = 1, * *, v we get d = aObo dl = aobl + alb0 (7. 4) d2 = aOb2 + albl + a?I * j = 0,1, 2, *, v j = v+l, v+2, *, v + (p-v) j = p+l, p+2, - *, p+v?2b0 d = albv + alb +..+a b Clearly, Z(t-1), Z(t-2), * *, Z(t-v) will not appear in (7. 3) when J (7.5) d.j - akbjk = O, j = 2, *, v. k=l k When this is the case we have V (7.6) dOZ(t) = Z bjE(t-j) +Rz(V,p) j=0

- 24 - which expresses Z(t) as a weighted sum of the random disturbances for the immediately preceding V periods and a weighted sum of the Z(t) for past periods beyond v. Now as V increases,((71,.6) indicates that Z(t) is influenced by successively more disturbance terms, together with a weighted sum of previous values of the time series for successively more remote periods of time. Consequently, in the limit we obtain oo d0Z(t) = Z b c(t-j) + R (co) j0 = where Rz(Oo) = lim RZ(V, p). r ->-oo V p For purposes of analyzing economic time series the "infinitely remote past history" reflected in Rz(oo) can be assumed to be unimportant. If these influences are ignored we arrive at the repr e s entation 00 (7. 7) a0Z(t) = Z bj E(t-j) j=0 or, when a = 1, 00 (7.8) Z(t) = Z b.e(t-j) j=0 J

- 25 - and we have Z(t) expressed as a moving average process of infinite order. 00 Under the condition that Z b. < 0, we know from the previous j=0 J section that the mean function of the autoregressive process (7. 1) is 0, its variance is C(0) = Z b. /a, and its autocovariance function is j o j=0 J C(k) = Z bibi+k/a. The power spectrum of the process is i=0 00 00 00 (7.9) g(X) = 2[ Z b./a2]+4 [Z b.bi+k/a2]cosZrTkX. i=0 k=l i=l1 To complete this discussion we should consider how the b. in (7. 8) can be obtained from the a. in (7. 1). One way to proceed is to use the equations in (7. 5). For example, if j = 1 we obtain bl from the condition a0b1 + alb0 = 0 as al b b = - a 1 a0 For j = 2, we obtain bz from the condition aob2 + albl + azb0 = 0 as alb - a2b0 b2 = - a This direct recursive procedure becomes extremely involved, however, and infinitely many equations must be solved. An alternate method for determining the bj is to use the generating functions:

- 26 - 00 00 A(x) = akxk B(x) = Z bkxk k=0 k=0 Now 00 00 00 (7. 10) A(x)B(x) =( akxk)( bxk) djxJ k=0 k=0 j=O where dj are given in (7. 3). If dj = 0 for j = 1, we have (7. 11) A(x)B(x) = d0 = a0bo. Conversely, if (7. 11) holds for all x, this requires that d. = 0 for j >- 1. Thus we can say that the equations (7. 5) are satisfied if and only if the product of the generating functions gives A(x)B(x) = a0bo. If this is the case, we get a0b0 B(x) = A () A(x) so that B(x) is a ratio of two polynomials. As a rational function, the coefficients bj of B(x) can be obtained from the partial fraction expansion of B(x). For example, suppose that A(x) has p distinct roots x1,..., xp; then the partial fraction expansion of B(x) is P A( -1 (7. 12) B(x) = E Ak(xk - x) k=l 00 Recalling that B(x) = Z bkxk, we can differentiate B(x) successively k=0 and evaluate these derivatives at x = 0, getting

- 27 - (7. 13) B (jO) = j!bj j 1, 2,, But since B(x) has the representation (7. 12), successive differentiation of the right-hand side of (7. 12) gives dij P I P /_ 0 1) d [ Ak(Xk-x) 1] = j!Ak(Xk - X) dxJ k=l k=l Therefore, upon evaluating these derivatives at x = 0, since B ()(0) = d [ Ak(xk x)-1 ] dxJ k=l k x=O we obtain j!bj - Z j!AkXk k=l1 or A1 A2 A (7. 14) b. = — + + +, j, * * * xJ +1l x +1 1 42 P Thus to determine the bj in (7. 4), instead of solving infinitely many equations recursively, we need only obtain the roots of the polynomial A(x) and use (7. 14) to determine the required coefficients bj in the infinite order moving average representation of the underlying autoregressive process. Moreover, even when the roots of the polynomial A(x) are not distinct a similar representation for the coefficients b can be obtained.

- 28 - More importantly, the above discussion gives an insight into the stationarity of the autoregressive process. We have seen that 00 an AR process is stationary provided that Z bk < o~. Suppose one of k=l 1 the roots x satisfies |xil > 1; then from (7. 14) it can be seen that bj + oo since lim [ xijijl = 0. In other words, the AR(p) process is covariance stationary if and only if the roots of the polynomial A(x) all li'e- outside the unit circle. The autocorrelations associated with a covariance stationary process satisfy an important set of relationships called the Yule-Walker equations. These are obtained by multiplying (7. 1) by Z(t-k) for k > 0, (7. 15) a0Z(t)Z(t-k) +.. + apZ(t-p)Z(t-k) = e(t)Z(t-k)? Upon taking expected values we get (7. 16) a0E[Z(t)Z(t-k)] + - + apE[Z(t-p)Z(t-k)] = E[ (t)Z(t-k)]. Since k > 0, Z(t-k) refers to a past period whereas e(t) is the current disturbance, so e(t) and Z(t-k) in this case are uncorrelated and we have E[~(t)Z(t-k)] = 0. Also, since C(k) = E[Z(t)Z(t-k)], we see that (7. 16) becomes (7. 17) a0C(k) + alC(k-l) + * + apC(k-p) = 0, k > 0. This system of difference equations involving the C(j) can be used to

- 29 - characterize the autocorrelation function of a covariance stationary AR(p) process since if we divide (7. 17) by V(O) we obtain (7. 18) a0 p(k) + al p(k-1) + * * * + ap p(k-p) = 0. The latter system consists of the Yule-Walker equations and they play an important role in estimation using data from given time series. Moreover, when the autocorrelation function is given, we can regard the system (7. 18) as determining the coefficients a. in the autoregressive scheme (7. 1). This system can be written in matrix form as 1 p(l) ~ ~ p(l-p) a1 aZ (p-1) 1 a p For example, for the AR(1) process where a0 becomes p(l) = "a. P(P) = -1 the system (7. 19) [ 1 ] [al(l)] = [P(1)] or al(l) = p(l) where al(l) is intended to indicate the order of the corresponding autoregressive process. For an AR(2) process where a0 = -1 the system

- 30 - becomes (L1 ( a 1 a(2) p(2) L (2) p(Z) we get aI(2) 1 p (1) -p(l) a2(2)_ P(1) 1 p(2) and finally a(2) = P (1) P(1)P(2) 1 - [p(1)] a2(2) = P(2) - [p(l)] 1 - [p (1) Continuing in this fashion the coefficients aj of the AR(p) process (7. 1) can be considered as functions a.(p) of the order of the autoregressive scheme. In this interpretation the a (1), a2(2),.*, ap(p) are called partial aut.corIeeation s..e ' 'awillMrefe& lt ttle & t dnc epts again imna:ilatert s ebtonctilon. 8. AUTOREGRESSIVE MOVING AVERAGE PROCESSES In the AR(p) process (7. 1) the random disturbances (t) are assumed to be uncorrelated random variables with mean 0 and unit

- 31 - variance. The probability structure of these disturbances is so simple that it is natural, especially in economic time series where the disturbances are not uncorrelated from period to period, to consider more general prppertiesu(forrIthedldiliturbance.oreerror[terms. Accordingly, we replace the error terms ~(t) in (7. 1) by a moving average of order q and obtain the autoregressive moving average process, denoted ARMA(p, q); namely, (8. 1) a0Z(t) + alZ(t-l) + + apZ(t-p) = bg (t) + bic (t-1) + + bq(t-q). Two questions immediately arise: are these processes stationary? What are their means, variances, autocovariances, and autocorrelations? To consider the first question, rewrite (8. 1) in the form (8.2) a0Z(t) + alZ(t-l) + * +a Z(t-p) = c*(t) where (8. 3) ~ *(t) = bo (t) + * + bq ~(t-q). Now whenever the right hand side of (8. 2) is a covariance stationary process (as is the case for a MA(q) process) the arguments given in the preceding section establishing the covariance stationarity of the AR(p) process will also establish the covariance stationarity of (8. 1).

- 32 - In other words, under the assumption that the disturbances or errors are covariance stationary, (8. 1) will be covariance stationary provided the roots of the polynomial A(x) referred to in the previous section lie outside the unit circle. Returning to a consideration of the second question posed above, if we use the ARMA(p, q) process (8. 2) and the arguments of the previous section we see that we can express Z(t) in the form 00 (8.4) a0Z(t) = b -,(t-j), j=0 where E*(t) is given in (8. 3). From the representation (8. 4) we see that the mean function of the ARMA(p, q) process is M(t) = 0. The variance, autocovariance, and autocorrelation functions can be obtained explicitly but each would be more lengthy than those we dealt with earlier because one must account for the fact that the E *(t) are no longer uncorrelated random variables. An alternate characterization for these functions can be developed implicitly using a system of difference equations that they must satisfy. To obtain this system, multiply (8. 1) by Z(t-k) for k > 0, getting (8.5) a0Z(t)Z(t-k) + alZ(t-1)Z(t-k) +- + a Z(t-p)Z(t-k) = b0 ~(t)Z(t-k) + b1 s(t-1)Z(t-k) + * - + bq (t-q)Z(t-k).

- 33 - Upon taking expectations we obtain (8. 6) aoE[Z(t)Z(t-k)] + alE[Z(t-l)Z(t-k)] +. + apE[Z(t-p)Z(t-k)] b0E[F (t)Z(t-k)] + blE[c(t-l)Z(t-k)] +... + bqE[c (t-q)Z(t-k)]. Since Z(t) is correlated with c(t), c(t-1), ***, e(t-q), the crossproduct expectations involving Z(t) and ~(t) beyond the current period are not necessarily zero as is the case for the AR(p) model. These expectations are called the cross-covariance which we denote C Z(j- i) = E[e(t-i)Z(t-j)], and, when j = 0, we get C c(-i) = E[ (t-i)Z(t)]. Note that when i < j we have C z(j-i) = 0 because we are then dealing with Z(t) for values of t previous to the time of the given disturbance and in this case the Z(t) and E(t) are uncorrelated. Also, when i> j, C(j-i) will not be zero. Using this notation, (8. 6) can be written for k ~ 0 as (8.7) a0C(k) + alC(k-l) +-.. + apC(k-p) = b0C e(k) + blCZ(k-1) + -. + bqC e(k-q). For example, taking k = 0, 1, 2, * *, q successively in (8. 7) we obtain

- 34 - a0C() +alC(1)+.. + apC(p) = boC EZ(O)+blC Z(-1) +. +bqCs Z(-q) aoC(1) + alC(Z) + - + a C(1-p) = b C ~ Z(O) + b2C (- 1) +-. + bqC Z(1-q) (8.8) aOC(2) +alC(3) + ~ + apC(2-p) = b2C z(O) + b3C z(-1) + + bqC Z(2-q) aoC(q) +aC(q-l)+ --- +apC(q-p) = bqC GZ(O). Whenever k > q, we get (8. 9) a0C(k) + alC(k+l) +* *+ apC(k-p) = 0 since in this case all cross-covariances appearing on the right-hand side of (8. 7) are zero. The system of difference equations represented in (8. 8) and (8.9) implicitly defines the autocovariance functions. The autocorrelation functions p(k) satisfy the same system of equations and are also implicitly defined by means of them.

- 35 - 9. SIMPLE EXPONENTIAL SMOOTHING FORECASTING MODELS Suppose a components model for a time series is given by (9. 1) Y(t) = TC(t) + e(t) where (9. 2) TC(t) = s + sit +- + sqtq = P(t) and where sj is real, sq 0, and e(t) is an irregular or random component representing uncorrelated random variables with mean zero. That is, we have an additive model in which the trend-cyclical component is assumed to be a polynomial of order q. One wishes to estimate P(t) in order to develop a forecasting model. When the errors e(t) are uncorrelated with mean zero and constant variance a, given observations on Y(t) for time periods 0, 1, * *, t (we call the first period for which we have observations time period 0 and denote the current period by t), then the sj in (9. 2) could be estimated by ordinary least squares, the estimates s. being determined so as to minimize t 2 (9. 3) z (Y(t-j) - P(t-j))2 j = More generally, in order to develop estimates sj of the coefficients in (9. 2), when the variances of the errors or disturbances are not constant from observation to observation, we can minimize the

- 36 - weighted sum of squares t (9.4) Z (Y(t-j) - P(t-j)) j=O -(t-J wher e C((t-j) = E[e2(t-j)]. These are standard procedures and if in addition the errors are normally distributed, maximum likelihood estimators can be obtained and the conventionallar ge sample inference procedures would be available as well. For many economic time series the variances of the error terms decrease as one approaches the current time period. If this were not the case and the variances were increasing in time, forecasts for future periods, even those for one period ahead, would be virtually useless in practice because they would be anticipated to differ so widely from the actual future values for the time series. Returning to the components model (9. 1), it is therefore natural to think in terms of discounting previous observations of Y(t) more heavily than the current observation, or equivalently, to think in terms of discounting previous errors more heavily than the current error. A convenient way of doing this is to assign weights which decrease as a geometric sequence, (9. 5) 80, BI2 22 a 3 t

- 37 - In this case the components model can be considered as (9. 6) Y(t-j) = P(t-j) + e(t-j) j = 0, 1, *, t where it is assumed that the e(t-j) are uncorrelated with mean zero but have variance proportional to p-j for j = 1, 2, * *, t, say o2(t-j) = I/ca 2 f J, where 02 is a constant of proportionality. Since the variance is no longer constant we multiply (9. 6) by 1/a (t-j), obtaining Y(t-j) = P(t-j) + e(t-j) (9. 7) c ) (t-j) o(t-j) C(t-j) which we can write as (9.8) Y$(t-j) = Pt(t-j) + E(t-j), where the e(t-j) are uncorrelated with mean zero and, it should be noted, have unit variance, i. e., are white noise. Thus the ordinary least squares criterion (9. 3) applied to this representation is t t Z e2(t-j) = (Y (t-j) - P(t-j)) j=0 j=O (9 9) =- Y(t-j) P- (t-) 2 -0 cr(t-j) G(t-j) j=O t = 2 Z BJ(y(t-j) - P(t-j))2. j=0

- 38 - We call the minimization of (9. 9) with respect to the coefficient sj of (9. 2) the discounted least squares criterion. To illustrate the use of this criterion, we consider the case in which P(t) = so and estimate so by choosing that value so which minimizes t 2 (9. 10) Z (Y(t-j) - s) j=0 Upon differentiating (9. 10) with respect to so, setting the derivative equal to zero, and solving for so, we get t.t t Z Piy(t-j) E (1- )i1Y(t-j) (9. 1 ----1) s= ~=0j t+l= j-jO since t: t+l E *J - -TT1- Si Z j 1-3 j=0 The factor 1- 3 is frequently denoted by a and called the exponential smoothing constant. In practice the choice of the smoothing constant a is based upon the user's judgment regarding the weight he wishes to assign to the most recent observation, rather than on the application of appropriate methods of statistical inference. Provided I B < 1, (9. 11) can be rewritten in the limiting form

- 39 - t-1 (9. 12) so = Z as3JY(t-j) +istS(O) j=0 where S(0) = aY(0) + caY(-l) + ac 2Y(-2) + One sees that when t is large (a large number of time series observations is available), the term S(0) in (9. 12) depends only on observations of the time series from the remote past and in such a way that the observations beyond Y(0) are discounted successively more heavily. Moreover, in (9. 12), S(0) is itself discounted by 3t, so when t is large one may ignore the last term on the right in (9. 12) and use the estimate t-1 (9. 13) s0= Z ac'JY(t-j). j=O In (9. 13), when t is allowed to assume integer values successively, we interpret so as a function of t and write s0(t) rather than s0. In other words, for t = 1,, 2, 3, we have t- 1 (9. 14) so(t) = Z caJY(t-j) j= j=l t-1 = aY(t) + aSY(t-j) j=1 t-1 = aY(t) +8 Z asJ-ly(tj) j=1

- 40 - Changing the index of summation appearing in the last term above to i j-l gives t-2 o(t) = caY(t) + Z oa3iy(t-l-j). i=0 From (9. 14), however, when t is replaced by t-l, we obtain t-2 (9. 15) S0(t-l) = a i Y(t-l-j), i=0 and substituting (9. 15) into (9. 14) leads to (9. 16) 0o(t) = cY(t) + s^o(t-l) Rather than denoting the estimate (9. 16) of the constant term in (9. 1) by so(t), the alternate notation S(l)(t) is frequently used and expressing (9. 14) in this notation gives (9. 17) S(1)(t) = aY(t) + S((t-1) = - a O < a < 1. The equation gives the first order exponentially smoothed value of Y(t); it provides a simple recursive procedure for updating the smoothed values S ()(t) and has an appealing intuitive justification. If we "unfold" (9. 17) by substituting S(1)(t), S()(t-l),.., as expressed by (9. 17) successively in this equation, we see that S (1(t) can be written as a weighted sum of the observations Y(t), Y(t-1), -, with the weights

- 41 - decreasing in a geometric progression as the observations refer to successively more distant time periods. Finally, for forecasting purposes, since the time series is represented as Y(t) = so + e(t), the mean function of Y(t) under the assumed properties for the disturbance term e(t) is (9. 18) M(t) = E[Y(t)] s0. Thus an estimate Y(t+T) of the expected value of Y(t) for T periods ahead is given by (9. 19) Y(t+ T) = M(t+T) = so(t) = S)(t). Note that in this case, since the trend-cyclical component is given by a polynomial of degree zero, Y(t+ T) does not depend on T. In other words, we have the same forecast for any period t+ T in the future and this forecast is changed only when the next observation on the time series becomes available. The results are extended in the next section to time series in which the trend-cyclical component is not restricted to a polynomial of degree zero but can be of any degree q,

- 42 - 10. HIGHER ORDER EXPONENTIAL SMOOTHING FORECASTING MODELS Consider the more general case in which the trend-cyclical component is a polynomial of degree q, and for j = 0, 1, 2,, t, express (9. 2) in the form q (10. 1) P(t-j) si(t)(j) /i! i=O Estimating the coefficients (10. 2) sO(t), sl(t), *, Sq(t) in (10. 1) by minimizing (9. 9) gives estimates so(t), sl(t), * ', s (t) which satisfy the system of equations ml11l m12 ml, q+l so(t) s (t) mZi1 m22 m2+ m, q+1 sl(t) S (t) (10.3) sq+l,)(t) mq+l, 1 q+l, 2 mq+l, q+~1. q(t) Sl where S ()(t) = Y (t;) +;t" BS (l)(t-l) (2) (t) = S as. (t') +S "S 2)(t 1) (10. 4) S(q+)(t) = aS(q)() + S(q+l)(t- 1)

- 43 - and (-l)i a oo ki k(i-l+k)! i (10.5) mmij+1 j!(i-l)! Z k! j > 0 k=0 The S(j)(t) in (10.4) is called the jth order exponentially smoothed value of the time series Y(t). One can also see in (10.4) a simple recursive pattern similar to that in first order exponential smoothing as given in (9. 17). Indeed in (10.4) one is smoothing successively time series consisting in turn of other lower order smoothed values of the time series Y(t). As with first order exponential smoothing, initial values S( )(0), S(Z)(0), *, s(q+l)(0) must beli chosen. As an example of these concepts consider the case q = 1, in which the trend-cyclical component of the time series Y(t) has a linear trend and can be represented as TC(t) = s + sit = P(t), which is a special case of (9. 2), or, alternately, as TC(t+oT) = s0(t) + s (t)T = P(t+T), which is a special case of (10. 1) where j = - T. Then corresponding to (10. 3), (10. 4), and (10. 5) we have

- 44 - (10. 6) 1 m12 m1 o0(t) S(1)(t) m21 m22 s l(t) S (2) (t) Mz i nz L i(t)] S()(t) = aY(t) + fS(l)(t-l) s(2)(t)= s(l)(t)+ S(z)(t-1) where (10. 7) and the mij+l in this case are (with, respectively, i = 1, j = 0; i = 1, j = 1; i = 2, j = 0; i = 2, j = 1): 00 k m 1 = a Z k! = k=O k! k! 00 ml2 = - ca kf =k=O ca(1)= 1 a - a(-) a2 a (10. 8) m =21 2 0 suk(l+k)! - 2 k=0a Z -= 11k= I k! k=O 00 k=O (l+k) 8k = 1 m22 = -a2 C k (l+k) k=0 k! 00 = -a2 Z (l+k)kBk k=O. [ (1iB) +13 a2 (Y.l 213 a Then (10. 6) becomes 4 ~1n 1 -- 0 (t)" - a- - -(t) a S(1)(t) s( )(t) (2))

- 45 - Therefore, 0 l(t)j - i) rsO(to 2 1- 1 S() (t) A0c(2 S(2)t s (t) - -1- 2S((t) or (10. 9) s (t)= 2S( ) - S(2) () (10. 10) sI(t) = (S 0 l (t) - C(S()(t) - S(2)(t)). For a given smoothing constant a, together with given initial values s(l)(0) and S(2)(), so(t) and sl(t) are updated whenever a new observation on the time series Y(t) becomes available. The updating process utilizes the recursive expressions (10. 7); one first obtains updated values of S(l)(t) and S(2)(t) and these are then used in (10. 9) and (10. 10) to obtain the revised estimates so(t) and sl(t). In the same way as is illustrated for the case q = 1, (10. 3), (10.4), and (10. 5) in principle can be used to determine the estimates s0(t), sl(t),, sq(t) as linear combinations of the various exponentially smoothed values S()(t), S(2)(t), *, S(q+l)(t). The result that, except for exceptional values of a, the linear equation system (10. 3) can be solved is called the fundamental theorem of exponential smoothing (see Brown [4]).

- 46 - However, for exponential smoothing of order higher than three, serious computational limitations occur if one attempts to use (10. 5) to determine the mij+l because they depend on a, and for each new choice of a, a new calculation must be made. Furthermore, progressively more complex closed form expressions arise for the infinite sums involved in (10. 5). Cogger [6 ]I has provided a simplified procedure for determining the elements mij+. Turning now to the problem of developing T-step ahead forecasts when the trend-cyclical component is a polynomial P(t) of degree q, the mean function of Y(t) in (9. 1) is M(t) = E[Y(t)] = E[TC(t) + e(t)] = P(t), since the errors are assumed to have mean zero. Thus a T-step ahead forecast Y(t+T) of the mean of Y(t+T) is A Y(t+ T) = M(t T) = P(t+T) where q (10. 11) P(t+T) = Z S (t)^Tii! T =1, 2, ** i=O and where the vector whose components are s (t), sl(t), *, sq(t) is the solution of the equation system (10. 3). For example, when the degree of the polynomial representing the trend-cyclical component is q = 1 and we deal with a linear trend,

- 47 - we have A fA (10. 12) Y(t+T) = s0(t) + s (t)T, where sO(t) and sl(t) are computed from the first and second order smoothed values as given by (10.9) and (10. 10). Specifically, the T = 1 step ahead forecast of the expected value of Y(t) is,,,,,, (1) (Z) (10. 13) Y(t+l) =o(t) + l(t) = (2+)S (t) - (l+ ) (t). To state a more general result for one-step ahead forecasts, we need the concept of the nth order (backward) difference of a time series Y(t). Suppose we define the first difference of IY(t) as 6Y(t) = Y(t) - Y(t-1). The second order difference can be written 62Y(t) = 6Y(t) - 6Y(t-1) = Y(t) - 2Y(t-1) + Y(t- 2) and, finally, the nth order difference is n (10. 14) sn(t) = z (_)i () Y(t-i) i=0 We extend this concept by understanding that 60Y(t) = Y(t). We now state the following optimality property for the T = 1 step ahead forecast of the mean of a time series based on exponential smoothing: for

- 48 - any time series whose nth order difference can be represented as a moving average of order n, nth order exponential smoothing will provide the minimum mean squared error, one-step ahead forecast of the mean. For the components model Y(t-j) = P(t-j) + e(t-j) where the e(t-j) are uncorrelated with mean zero and have variance proportional to -13 we developed the T-step ahead model (10. 11) through the application of discounted least squares. Thus for T = 1 we get ^A ^ ^^1(t)+... + s (t) (10. 15) Y(t+l) = so(t) + sl(t) + 2! s(t) + q qt) as the one-step ahead forecast of the mean of Y(t). We now raise the question, is (10. 15) an optimal predictor, i. e., is this a minimum mean squared error one-step ahead forecast? The answer to this st question depends on whether or not the (q+l)t difference of Y(t) can be written as a moving average process. Since the time series Y(t) is represented as Y(t-j) = P(t-j) + e(t-j), upon differencing the time series q + 1 times we obtain 6q+ly(tj) = q+lp(tj) + 6q+le(t-j).

- 49 - But 6q+lp(t-j) = 0 since P(t) is a polynomial of degree q and therefore 6q+ly(t_j) = q+le(t-j). We also have from (10. 14) q+l ql s6+le(t-j) = z (-l)( ) e (t-j-i). i=O However, the 6q+le(t-j) are not yet expressed as a moving average process because the e(t-j-i) are not white noise. But recalling (9. 7) and (9. 8), we know that (t-j)- =e(t-j) (t - j) or e(t-j) = o(t-j) e (t-j) = 'Je (t-j)/o2 where a2 is a constant of proportionality. Therefore q+l sq+le(t-j) = Z (-l) ( )e(t-j-i) i=O q+l Z (- )i(q+l) 14i (t-j-i) / 2 i=0 1 q+l z (- )i ( q+l)(t-j-i)/c2 i=O 1 q+l bi (t-j-i) i=0

- 50 - where b. (- ()i(q+ )/a2 i = 0 1, 1 *, q+l, 1 1 which expresses 6q+le(t-j) as a moving average process of order q+l. This establishes (10. 15) as an optimal predictor for the onestep ahead forecast. 11. A USEFUL ALTERNATE FORM OF THE SECOND ORDER EXPONENTIAL SMOOTHING MODEL We recall from the previous section that the second order exponential smoothing forecasting model for the T-step ahead forecast is (10. 12) Y(tt+T) = s(t) + sl(t) in which (10. 9) so(t) = S()(t) - S(2)(t) and (10. 10) sl(t) = (S( (t) - S( t)). It is convenient for subsequent discussions to express the estimates so(t) and sl(t) in an alternate form which reflects a recursive relationship between these estimates, (11 1) s0(t)= a Y(t) + B0[s0(t-1) + sl(t-l)] where a0 denotes the same smoothing constant o involved in (10. 9),

- 51 - (10. 11) and (10. 12), and where A A A (11.2) s1(t)= c(SO(t) - So(t-l)) + s3sl(t-1) where (11.3) al and = + 0 1+ l 1+ o The relationships (11. 1) and (11. 2) with al and B1 given by (11. 3) can be obtained from (10. 9), (10. 10), and (10. 12) directly, together with the definitions of the first and second order smoothed values (10. 7). This direct verification of (11. 1) and (11. 2) is somewhat lengthy and the details involved uninteresting for our purposes so they will not be given here. It should be observed that to obtain the same forecast (10. 12) by using (10. 9)j) and (10. 10) or by using (11. 1) and (11. 2) for estimates so(t) and s,(t), it is necessary to choose the smoothing constants a, a0, and ao such that a 0 = a and (11. 3) is satisfied. Moreover, if one chooses these three smoothing constants differently, the forecast based on (11. 1) and (11. 2) will differ from the second order exponential smoothing forecast based on (10. 9) and (10. 10), the latter being obtained from minimizing the discounted sum of squares criterion. Equation (11. 1) has a natural interpretation which is not apparent from (10. 9). Suppose we denote the T = 1 step ahead forecast of

- 52 - the expected value of Y(t) made at time t-l by Y(t; t-l); then Y(t; t-l) [Y(t-l+T)] = [s0(t-1) +S (t-l)T] (11. 4) - = IT = 1 = s0(t-l) + s(t-l). Now the forecast of the expected value of Y(t) made at time t is given by (10. 12) with T = 0, Y(t) = [Y(t+T)] T =TO= 0 = [so(t) + Sl(t)T] -- = O = So(t). Thus (11. 1) can be expressed in an alternate form (11.5) Y(t) = a0Y(t) + gY(t; t-l) which states that the estimate ofcqthe expected value of Y(t) made at time t is obtained by simply averaging the current observed value of Y(t) and its forecast value from the immediately preceding period. Equation (11. 2) can be interpreted as follows. We have interpreted so(t) as an estimate of the expected value of Y(t) at time t;

- 53 - similarly s0(t-1) is an estimate of the expected value of Y(t-1) made at time t-1. Consequently, if there is a linear trend in the expected value of the time series, then the difference s0(t) - s0(t-l) between the two estimates represents the most recent assessment of the slope of the linear trend line that is available on the basis of the current observation Y(t) of the time series. Now (11. 2) states that the estimate of the slope of the linear trend component made at time t is merely an average of its most recent assessment s0(t) - s 0(t- 1) and its previous estimated smoothed value sl(t-l). 12. EXPONENTIAL SMOOTHING MODELS FOR SEASONAL TIME SERIES Consider a mixed components model of the form (12. 1) Y(t) = TC(t)S(t) + e(t) in which the trend-cyclical and seasonal components appear in a multiplicative fashion, and the error or disturbance is additive. It is assumed that the seasonal component S(t) is a periodic function having period k, so that S(t) = S(t+ Z) for all t with ~ being the smallest positive number for which this.property holds. If S(t) is known, the time series can be deseasonalized, obtaining

- 54 - (12. 2) )= TC(t) + e(t) s(t) s(t) or (12. 3) Y*(t) = TC(t) + e*(t), and one could return to the models of Sections 10 or 11. It should be emphasized that in using the exponential smoothing models developed earlier it was assumed that seasonal influences were absent (see (9. 1) for example). In practice, if one uses these nonseasonal models one should be dealing with data which do not have seasonal influences or one should deseasonalize the data prior to using these nonseasonal exponential smoothing forecasting models. Extensions of these nonseasonal exponential smoothing models have been developed for some simple cases which permit one to deal with seasonal influences within the model itself. These methods are ad hoc and their statistical and optimality properties from a forecasting point of view are largely unknown at present. We now develop one such extension and begin by considering the deseasonalized representation of the time series (12. 3) in which we assume that the trend-cyclical component is given by a polynomial of degree 1 as (12. 4) TC(t+T) = so(t) + sl(t)T = P(t+T)

- 55 - which is a special case of the more general trend-cycle polynomial (10. 1) in which j = -T and q = 1. In general, were the errors in (12. 1) to have mean zero, the expected value of Y(t) for T periods ahead is given by M(t+ T) = E[TC(t;eTf)S(t+CT))+ -e(f+TT)]S- tTC(t+ T)S(t+CT) Hence a T-step ahead forecast Y(t+T) of Y(t) would be (12. 5) Y(t+T) = TC(t+T)S(t+T). Now when the trend-cyclical component is given by (12. 4), second order exponential smoothing applied to the deseasonalized time series (12. 3) provides an estimate of TC(t+T) (12. 6) TC(t+T) = s0(t) + sl(t)T where (12.7) So(t)= ca0Y (t) + 4 O(sO(t-1) Sl(t-l)) [(t) ] +e (s (t-l) + s(t-l) and (12.8) sl(t) = a(s0(t) - s0(t-l)) + Bs(t-1). The last two equations correspond to the alternate form of the smoothing model as given in (11. 1) and (11. 2) except that these are applied to the deseasonalized time series.

- 56 - If the seasonal component S(t+T) were known, (12. 7) and (12. 8) would be used in (12. 6) to forecast 1E[(Y(Gt4,+T)] from (12. 5). Typically, however, the seasonal component is unknown and it must be estimated as well. One way to proceed is as follows. Return to the components model (12. 1) and assume that TC(t) is known. If we divide both sides of (12. 1) by TC(t), we get (12. 9) Y(t) = S(t) + e(t) TC(t) TC(t) or (12. 10) Y*"(t) = S(t) + et (t). The trend-cyclical component of the time series Y**(t) is clearly S(t); thus in order to apply first order exponential smoothing to develop an estimate of S(t) we must assume that S(t) is a constant seasonal factor. Using this interpretation of (12. 10) and recalling that S(t) = S(t-Z), the estimate obtained is S(t) = a Y''(t) + S S(t-A) (12. 11)(t) + = [Y(t) ]+ B A(- z S TC(t) For the linear trend model (12. 11) becomes (12. 12) S(t)= a [ Y(t) ] + S(t-Z) So(t)

- 57 - because TC(t+T) = s0(t) + sl(t)T and therefore TC(t) = so(t). In cases where the linear trend component is unknown, the value of s0(t) appearing in (12. 12) must itself be estimated. If such an estimate is denoted s (t), then an estimate of the smoothed estimate S(t) of the seasonal component S(t) would be obtained by replacing s (t) by its estimate s0(t) in (12. 12) and using (12. 13) S(t) as [ i( ] + S S(t-Q). s0 ' So,(t) On the other hand, if S(t) were known, an estimate of s (t) would be provided by (12. 7). Since the seasonal component is unknown, however, but its period is assumed to be Z, its last previous estimate S(t- k) could be used to provide an estimate of its current value S(t). In turn, an estimate of the smoothed estimate s0(t) of the intercept term of the linear trend is provided by using S(t- k) in place of S(t) inrf (12.7) as (12. 14) s(t) = 0 [ + 10V(]s t-) (t-l) ]. S(t- 3&) Finally, we use this estimate so(t) in (12. 8) to obtain in turn the estimate (12. 15) s (t) = a [o(t) - S^(t-1)] + 3 1(t-l).

- 58 - The forecast value Y(t+ T) would then be given by (12. 16) Y(t+-) = TC(t+T)S (t+T -Z) where (12. 17) TC(t+T) = So(t) + sl(t) The equations (12. 13), (12. 14) and (12. 15), together with (12. 16) and (12. 17), are called Winters' seasonal exponential smoothing model for a linear trend with a multiplicative seasonal component [2 3]. It should be noted that in order to use this model one must choose the smoothing constants as, a0, and ac in equations (12. 13), (12. 14), and (12. 15) and, moreover, one must know the periodicity of the seasonal component as expressed in the choice of the period Q. Thus, four parameters must be chosen from an analysis of the data or by other means in order to apply Winters' multiplicative exponential smoothing model. Given the period S of the seasonal component, Winters has provided a heuristic procedure for choosing as, ao0 and al which consists of making various selections of the three smoothing constants over a grid of possible values. Then for each selection of as, ao, and al, one-step ahead forecasts are made and the root mean squared error is calculated for each selection. One chooses that selection of a c ac, and a1 which is associated with the smallest of the calculated s' O 1

- 59 - root mean squared errors. This is an attempt to achieve a selection of the three smoothing constants which minimizes the mean squared error of the one-step ahead forecasts. Three aspects of Winters' method should be noted. First, it must be assumed that the grid search procedure produces a selection of the smoothing constants as, ag, and a1 which is reasonably close to that selection which minimizes the mean squared errors of the forecast over the sequence of observations. Second, it must also be assumed that this choice based on the past data remains a good choice for making forecasts for future time periods; and third, that the period Z of the seasonal influence is known exactly. In Winters' model the seasonal influence is represented in the choice of the period Z and the resulting seasonals are given as a system of adjustments that are to be applied to the time series in accordance with the equations defining the model. In other words, the seasonal influence is not described explicitly as a trigonometric function. Harrison [ 14 ] has developed an alternative approach in which the seasonal influence is fitted by means of a trigonometric function and then the seasonal estimates are smoothed and used in forecasting. The multiplicative seasonal exponential smoothing approach has been recast as an additive seasonal model by McClain [18] and McClain and Thomas [19].

- 60 - Burman [5] has used trigonometric functions to estimate seasonal influences in an additive components model. In all the exponential smoothing models above, the smoothing values a are constants and must be chosen by the user of the model. It is natural to consider a more general exponential smoothing model in which a is not a constant but depends on time. Such a model, which can be called a dynamic exponential smoothing model, has been developed by Trigg and Leach [22] and has been used by Dunn, Williams, and Spivey [8]. 13. BOX-JENKINS FORECASTING PROCEDURES In using the exponential smoothing models in preceding sections one oftentimes does not attempt to analyze the underlying probability structure of the time series by methods of statistical inference. The smoothing constant a, although related to the probability structure of the additive error terms of the underlying components model, is often chosen by the user without reference to more objective considerations of statistical inference. Moreover, as we have also emphasized, these models require one to assume that the trend-cyclical component is a polynomial of known degree q. The simple recursive updating relationships in these models, however, allow them to be applied conveniently in an almost mechanical fashion and this simplicity has made

- 61 - these models widely attractive in situations where many time series must be forecast repeatedly on a routine basis as in parts inventory systems. In contrast to the features above, Box-Jenkins [3] have developed forecasting procedures which require one to analyze the underlying probability structure of the time series in considerably more detail and methods of statistical inference are employed systematically to estimate properties of this underlying structure. Secondly, these procedures are more flexible than exponential smoothing models with respect to the specification of the underlying trend-cyclical component of the time series. For example, in exponential smoothing we must assume that this is a polynomial of degree q, whereas Box-Jenkins procedures provide various transformations which can be performed on the data and which accommodate more varied trend-cyclical components. Thus if the data appear to have an exponential trend, one would take logarithms of the data and proceed with the analysis. If the logarithms of the data themselves appear to have a linear trend, one could difference the logarithms. In short, without having to specify the trend-cyclical component explicitly and then estimate it, the Box-Jenkins procedure provides a flexible system of simple transformations which are applied to the data until finally an autoregressive moving average process of order (p, q) appears to result.

- 62 - Thirdly, unlike exponential smoothing models which have simple updating features, Box-Jenkins procedures require more elaborate computer algorithms for obtaining estimates of the parameters of the ARMA process, and there is no simple recursive method for updating these estimates as additional observations become available. We now proceed to a discussion of the Box-Jenkins procedure. Suppose that after a time series Y(t) is differenced d times we arrive at a time series Z(t) = 6dY(t), where it is known that Z(t) is an ARMA(p, q) process of the form (8. 1), namely (13. 1) Z(t) - QlZ(t-l) -.. - pZ(t-p) = (t) - l(t- 1) -.. ~ r (t-q) (- q(t-q). We assume that the s(t) are white noise and that all the roots of the associated polynomial equation lie outside the unit circle, so that Z(t) is a covariance stationary process. When d > 0, Y(t) is called an integrated autoregressive moving average process and is denoted by ARIMA(p, d, q). When the expected value of the original time series Y(t) is a constant p 1 0 we will consider Y(t) = Y(t) - vi

- 63 - instead of Y(t), and if Y(t) is an ARMA(p, q) process with mean zero we say that Y(t) is an ARMA process with mean p. In addition we generalize (13. 1) slightly by permitting the s (t) to have constant variance cr rather than unit variance as formerly. These minor extensions concerning the mean and variance, unnecessary in earlier discussions, are now made to permit us to consider a slightly wider class of economic time series. For purposes of forecasting the time series Z(t), Box-Jenkins suggest using of (13. 1) which requires the estimation of p+q parameters together with the variance a and they refer to (13. 1) as a parsimonious representation. Obtaining optimal or minimum mean squared error T-step ahead forecasts for models of the form (13. 1) is simplified because Z(t) and e(t) appear in (13. 1) in essentially linear fashion. To see why this is the case, let Z(t+T) denote a forecast of the expected value of Z(t+-T) based on a linear representation involving the current and all previous observations as (13.2 (+) - = 11oZ(t) + 111Z(t-l) + 11z(t-Z) + CO = Z I jZ(t-j). j=o Since, however, Z(t) can be represented as a moving average process of infinite order,

- 64 - 00 (13. 3) Z(t) - Z ie(t-i) i=O with 0- = 1, we find that the forecast (13. 2) can be expressed as a linear combination of only the error terms associated with the moving average part of the process, 00 Z(t+T) = Ij, Z(t-j) j=0 00 00 (13. 4) = Z IIj Z i E(t-i-i) j=0 i=0 00 = z ek (t-k) k=O where Ik =(0O 111 + + Ik)'k' Therefore the difference between the actual value Z(t+T) at time T and its forecast value Z(t+T) at time T can be expressed as 'T-1 oo (13. 6) Z(t+T')-Z(t+T) = Z iis(t+T-i) + Z ( T+k - Ilk)(t-k) i=O k=O using (13.3) and (13.4). Consequently, ] z1 2 2 2 E[(Z(t+T) - Z(t+T)) ]= ac z + z 20 +k-0) i=0 k=O From the decomposition of the sum of squares above we see that

- 65 - Z(t+T) will be the minimum mean squared error or optimal forecast if and only if the coefficients llj are chosen so as to be identical to the corresponding coefficients r-T+k which appear in the representation of Z(t) + T)cas1a'rnmoving-averagepr.oces'ss:;o ihifinitel ordeir3. (13. 3).heTherefore (13.4) can be expressed as oo (13.7) Z(t+T) =. 'T+kT (t-k), k=O and the mean squared error associated with this forecast is T-l E[(Z(t+T) - Z(t+T))J] = o Z 2 i=O the mean squared error in this case, of course, being identical to the variance of the forecast error. An alternate expression for this optimal forecast is obtained by returning to the ARMA(p, q) process which has (13. 7) as its infinite order moving average representation. To develop such an expression we first observe that Z(t+T) as expressed in (13. 7) is simply the conditional expectation of Z(t+T) given Z(t), Z(t-l), **-, or equivalently the conditional expectation of Z(t+T) given S (t), e(t-1), * -. In other words, if we calculate the conditional expectation Z(t+T) given ~(t), (t-1), ' ', we see that

- 66 - 00 (13.8) E[Z(t+T) fE(t), c(t-1). *] = E[ Z j~s(t+r -i) j (t), E(t-1), ] i=O T-l E[ Z iC (t+T-i) (t), (t-1),'''] i=O oo + E[ Z Pi ~(t+T-i)I E(t), C(t-l), 3']. i=T When the successive future errors e(t+l), ***, e(t+T) are independent of the errors E(t), c(t-1), * * of the current and previous periods, then T-1 T-1 E[ Z i (t+T-i)! c(t), E(t-1), '] = E[Z liE(t+T-i)]. i=O i=O Also, since the mean of e(t) is zero, T-1 T- E[ Z i( Z (t+T-i)] Z ~iE[&(t+T-i)] = 0 i=0 i=O we obtain 00 (13. 9) E[Z(t+T) I (t), E(t-1),- ] = E[ Z T +k(t-k) I(t), e(t-1), ] k=0 co =Z T +k (t-k) k=0 showing as asserted that

- 67 - (13. 10) Z(t+ ) = E[Z(t+T)| e(t), E(t-1), ] Suppose we take (13. 9) and reexpress it as an ARMA(p, q) process; then we obtain (13. 11) Zf-(t+-;t) - Z "Z*(t+T-1;t) - ** - Z*(t+T-p;t) = c'(t+T; t) - 01 Es(t+T- 1; t) - * - qE *(t+T -q; t) where the Z*(t+ - t),t), E~(t+ T;t), etc., denote random variables which represent the conditional expectation involved, namely Z (t2; t) = E[Z(t2) l(t), e (t1- ), ' ] <(t2, tl) = E[s(t2) I (tl), (t1-l), '' ] Therefore the forecast Z(t+T) is expressed in the form of an ARMA(p, q) process as P q (13. 12) Z(t+T) = Z (Z*(t+T-i;t) + Z j. e*(t+T-j; t) i=l j=l + *(tf'+c4r:~1(t+ -T;t) Thus to generate the optimal T -step ahead forecast the parameters %1, ' ' ' p, Q1' O as well as the various conditional expectations appearing in (13. 12) must be known or be estimated from the data.

- 68 - The Box-Jenkins procedure provides two ways to calculate the conditional expectations, one is to use the unconditional expectations and the other is to utilize the "back-forecasting" scheme. These methods are discussed in Chapter 6 of [3]. When the finite dimensional distributions of the s(t) are multivariate normal, maximum likelihood estimators of the parameters <1), '', Ap, 0 *, Oq and c2 can be obtained as solutions to a system of simultaneous non-linear equations (see Chapter 7 of [3]). Other estimators are also derived based upon the minimization of sums of squares related to the quadratic form associated with the multivariate normal distribution. These approximations concentrate on minimizing the quadratic form rather than the likelihood function which would be involved in the exact maximum likelihood procedure. Each of the preceding forecasting and estimation procedures assumes that we know the order of differences d of the time series of observations and the orders p and q of the autoregressive and moving average processes. Box and Jenkins have also provided identification and diagnostic procedures to assist the user to determine the numbers p, d, and q by classical methods of statistical inference that involve the Yule-Walker equations and sample estimates of the autocorrelations and partial autocorrelations of the data as well as the estimated forecast errors.

- 69 - Determination of p, d, and q is a crucial part of the BoxJenkins procedure and can lead to different forecasting models being generated from the same data by different users. Since these methods are somewhat unsatisfactory, others in recent research have suggested alternate approaches to the identification and diagnostic procedures of Box and Jenkins (see Parzen [21]).

BIBLIOGRAPHY 1. Ash, R. B. Real Analysis and Probability. New York: Academic Press, 1972. 2. Blackman, R. B., and Tukey, J. The Measurement of Power Spectra. New York: Dover Publications, 1959. 3. Box, G. E. P., and Jenkins, G. Time Series Analysis, Forecasting and Control. San Francisco: Holden-Day, 1970. 4. Brown, Robert G. Smoothing, Forecasting and Prediction of Discrete Time Series. Englewood Cliffs, N. J.: Prentice-Hall, Inc., 1962. 5. Burman, J. P. "Moving Seasonal Adjustment of Economic Time Series. " Journal of the Royal Statistical Society, Series A, 128 (1965): 534-558. 6. Cogger, K. 0. "Extensions of the Fundamental Theorem of Exponential Smoothing." Management Science 19 (Jan. 1973): 547-554. 7. Doob, J. L. Stochastic Processes. New York: John Wiley and Sons, 1953. 8. Dunn, D. M., Williams, W. H., and Spivey, W.A. "Analysis and Prediction of Telephone Demand in Local Geographical Areas. " The Bell Journal of Economics and Management Science 2 (Autumn, 1971): 561-576. 9. Feller, W. An Introduction to Probability Theory and Its Applications, vol. I. New York: John Wiley and Sons, 1957. 10. Feller, W. An Introduction to Probability Theory and Its Applications, vol. II. New York: John Wiley and Sons, 1966. 11. Granger, C. W. J. and Hatanaloa, M. Spectral Analysis of Economic Time Series. Princeton: Princeton University Press, 1964. 12. Grenander, V., and Rosenblatt, M. Statistical Analysis of Stationary Time Series. New York: John Wiley and Sons, 1957. 13. Hannan, E. J. Time Series Analysis. New York: John Wiley and Sons, 1960. - 70 -

- 71 - 14. Harrison, P. J. "Short-Term Sales Forecasting." Applied Statistics 14 (1965): 102-134. 15. Jenkins, G., and Watts, D. G. Spectral Analysis and Its Applications. San Francisco: Holden-Day, 1968. 16. Kendall, M. G. Time-Series. New York: Hafner Publishing Co., 1973. 17. Kendall, M. G., and Stuart, A. The Advanced Theory of Statistics, vol. 3. New York: Hafner Publishing Co., 1968. 18. McClain, J. 0. "Dynamics of Exponential Smoothing with Trend and Seasonal Terms." Management Science 20 (May, 1974): 1300-1304. 19. McClain, J. 0., and Thomas, L. J. "Response-Variance Tradeoffs in Adaptive Forecasting." Operations Research 21 (MarchApril 1973), 554-568. 20. Nelson, C.R. Applied Time Series Analysis. San Francisco: Holden-Day, 1973. 21. Parzen, E. "Some Solutions to the Time Series Modeling and Prediction Problem. " Technical Report No. 5, Department of Computer Science, State University of New York at Buffalo, Feb. 1974. 22. Trigg, D. W., and Leach, A. G., "Exponential Smoothing with an Adaptive Response Rate." Operational Research Quarterly 18 (1964): 53-59. 23. Winters, P.R. "Forecasting Sales by Exponentially Weighted Moving Averages. " Management Science 6 (1960): 324-342. 24. Yaglom, A. M. An Introduction to the Theory of Stationary Random Functions. Englewood Cliffs, N.J.: Prentice-Hall, Inc., 1962.