Division of Research Graduate School of Business Administration The University of Michigan September 1981 TIME SERIES ANALYSIS OF BINARY DATA Working Paper No. 277 Daniel McRae Keenan The University of Michigan FOR DISCUSSION PURPOSES ONLY None of this material is to be quoted or reproduced without the express permission of the Division of Research. Acknowledgement. This research was funded by a grant from the National Science Foundation, No. MCS76-81435. This work was part of the author's Ph.D. thesis submitted to the Department of Statistics, University of Chicago. The author is grateful to Professor Ronald A. Thisted under whose guidance this work was carried out.

-1 - 1. Introduction In both experimental and nonexperimental settings measurements taken sequentially in time have become quite common. For example, we may be observing the same machine or individual over time; the position of a satellite and stress level measurements of an individual are data of this form. The theory for continuous-valued random variables is rather well developed: e.g., regression analysis, time series analysis, etc.; however, for discrete-valued data, different approaches are required. For independent observations, logit, probit, and log-linear models are available. This is the domain of qualitative data analysis. When the observations are dependent and discrete-valued, stochastic process models are relevant. Markov chain theory is well developed; Billingsley (1961) and Kemeny and Snell (1960) are relevant references for statistical inference on Markov chains. However, problems often exhibit more structure than just stationary transition probabilities. This work is concerned with developing time series models for discrete-valued data, allowing for arbitrarily long memory. The learning theory models and chains of infiniteorder are related, but are concerned with different structure (Lamperti and Suppes (1959), Bush and Mosteller (1951)). The data in the following three examples typify a certain type of data; the observations are sequential in time, discrete-valued, and correlated. In labor economics an important problem is the determination of those factors which influence labor force participation. The Bureau of Labor Statistics records over time (e.g., monthly) whether various individuals are in the labor force or not (Heckman (1979)). In psychology one can study the moods of individuals over time in order to determine the variability both within and across individuals (Larson (1979)).- For example, if the happy-sad continuum (i.e., semantic differential) is divided into two parts, happy and sad, then the data would consist of daily recordings of zeroes and ones. In business the

-2 - prediction of directional changes of business cycles is of concern. If the data consists of only the past directional changes, then the data is a series of zeroes and ones. In these three examples the data, (Di,D2,...D ), was discretevalued (actually, binary), sequential in time and correlated. The approach of this paper is quite general; it is assumed that an underlying time series of continuous-valued data generates the time series of discrete-valued data. The family of discretization mechanisms is broad and can often be given intuitive meaning. The underlying probability structure is exploited with much of the structure carrying over to the discrete process probabilities. Stationarity is a reasonable condition on the correlations; it also serves as a first approximation in the nonstationary case. The next section describes some alternative approaches to modelling data of this type. 2. Alternative Approaches One method for modelling time series of discrete-valued data is to approach the subject in a manner analogous to Box-Jenkins (1970); one such approach was considered by Jacobs and Lewis in two articles (1978a, 1978b). Jacobs and Lewis replace the linear combinations of the continuous-valued case with probabilistic mixtures and call their models discrete autoregressive-moving average (DARMA) models. As an example, consider the simplest Box-Jenkins model, the first-order autoregressive model (AR[1]), where the X process is formed n according to: X = pX + e n n-i n where IpI<l and {e }n00 are i.i.d. (0,a2). The DAR(1) sequence, {Dn}n,} n n=-oo n n is formed according to: D Dn_l with probability 0 < p < 1 n = E with probability 1 - p, n

-3 - where {E }a is an i.i.d. sequence of random variables in the discrete n n= — set. An analogue of the Yule-Walker equations exists in which linear combinations of correlations are replaced by probabilistic mixtures. An advantage in approaching the discrete case in a mode analogous to Box-Jenkins is that the directions which have and have not proven fruitful in the continuous-valued case may be of guidance. There are some disadvantages to the Jacobs-Lewis approach. For example, in both the DAR(1) and AR(1) models, pJ is the autocorrelation of lag j; however, p is forced to be nonnegative in the discrete case, so that all autocorrelation must be nonnegative, thereby restricting the model's applicability. In the DAR(1) model, D 1 contains all information available at time n-1 about the past, as does X - in the AR(1) model. However, in the discrete model there is randomizing between Dn1 and E; E contains no n-i n n information about the past, and if E is chosen, then all memory before time n is lost forever. In other words, the memory of such discrete models is discontinuous; in the AR(1) model the memory dies out geometrically. These same advantages and disadvantages carry over to the general DARMA models. Another alternative is the Markov chain or modifications thereof; no exposition of Markov chain modelling will be given here. Markov models of lagged dependency of two or more become quite complicated and cumbersome in terms of calculations. However, the Markov property, by its very construction, lends itself quite readily to forecasting. Lomnicki and Zaremba (1955) and Kedem (1980b) model binary data as the clipping of an underlying process; a clipping is a truncation at zero. Kedem's work is for the situation where the underlying data is available and, if need be, can be properly centered; the original data, for computability reasons, is replaced by' the coarse data, zeroes and ones, which are used for estimation. th Lomnicki and Zaremba consider the clipping of the k difference of a Gaussian process.

-4 - 3. Basic Framework An alternative to defining the structure on the discrete process is to assume, as Lomnicki and Zaremba (1955) and Kedem (1980b) do, that it inherits a certain structure from an underlying continuous process. The discretization by a threshold or truncation of a continuous-valued process is a common phenomenon in engineering and biology. I consider a general procedure which includes thresholds and truncations as special cases. A binary process, {Dn}n, is assumed to be generated by {Xn}= a continuous, strictly stationary time series, and a monotone function F: R+[O,1] in the following way. Given {Xn}~_, the D are independent and n n=- n P(D = 1lX ) = F(X ) n n (Xn (3.1) P(D = OIXn) = 1 - F(Xn). By the definition of Dn, let 0 < jI < ji <... be integers, then P(D1 Di+ I Di+j ' 'l+.D+j IX1 ' X+j l Xl+j '... l+j ) - P(D1|X1) P(D1+j IXi+j )x...xP(Dl+j IX +j ). Dependence among the responses Dn, given Xn, could also be considered, although for most applications independence has intuitive appeal. The D process was n generated through a response function F which maps R into [0,1]. Some examples of F follow. Example 1. Truncation or Threshold Function A continuous process is truncated at some value u, so that, given X, 1 if X > p D = n n 0 if X < p. n

-5 - Example 2. Probability Cumulative Distribution Function In this case, the greater the underlying value, the greater the probability of a one being generated. One can think of this as a random threshold or, equivalently, as the introduction of measurement error followed by truncation. Example 3. Survival Functions A survival function is one minus a probability c.d.f.; this has the reverse effect of Example 2. Example 4. Defective c.d.f. This puts bounds on the response probabilities below one and/or above zero. The three examples given in the Introduction can be discussed in terms of these generating procedures. In the mood analysis and business cycle turning points examples, the underlying processes are our true mood and the actual value of the economic indicator, respectively. In the labor force participation example, the underlying process could be the difference in the lifetime utility at time n of the individual if employed and if not employed (Heckman (1979)). A binary time series generated by these methods is strictly stationary, since the underlying X process is strictly stationary. The probability of the event {D1 = d, D2 = d2,,D = d }, where (dl, d2,...,d ) {0,1}n, is P{D1 = dl, D2 = d2...,D= d = 1 2 2','1 n n 00 co 00 d- l-d1 d 1-d J fIf...F(x1)] [1 - F(xl)] x...x[F(xn)] [l-F(xn)] d Gn(x1,x2...,x). (3.2) -00-00 -00 Second-order stationarity of the underlying process will not insure secondorder stationarity of the discrete process. A characterization lemma, which follows, shows that all strictly stationary binary processes can be generated by the above procedure through just the Gaussian processes. The integral in

-6 - equation (3.2) is similar to a convolution and can be given probabilistic meaning. The following lemma, a variation of a known result (Feller (1971), p. 144), gives an alternative view of the response function F. Define the sets C(0) = (-,O0) and C(1) = [O,o). Lemma 3.1 Let the X process and F be as above and let G be the distribution funcs n tion of (X1,X2,...,Xn). Let {Y }00i= be i.i.d. with c.d.f. F, independent of {Xn}i=, and define V = X - Yi. Then, for (dl,d2,,n) e {O,}n, n i- ~~ JL 21 i ni P{D = d, 2 '= d2...,D = dn} = P{VleC(dl), V2eC(d2),...,Vn C(dn)} Proof. The event {V1eC(dl), V2eC(d2),...,VncC(dn)} = {Y. < X. for j 3 d = 1 or X. < Y. for j 3 d = 0; j = 1,2...,n}. Pick e > 0; R = Ik where Ik = (ke,(k + 1)e]. Since the {Yi} are independent and independent of the Xi process, the probability of the above event is approximately 1 I 1 oo 00 n a. l-a I I *a. I { H [F(x)] [1 - F(x)] p{Xje,,X e Ik kl=-0 k2=-o k =-~ j=l n 1 2 n where x. ~sI J k If xc Ik = (ke,(k + 1) e], then F(ke) < F(x) < F((k + 1) e), and the above probability is bounded above and below by sums which depend only on e and which both converge to P{D1 = dl, D2 = d2,...,Dn = d }, defined by expression (3.2), as e + 0. Q.E.D. This lemma shows that there is a duality with which one can view the generation of discrete processes through response functions. One can view the binary process as being generated by the X process and F or, equivalently, by the n V process and a truncation at zero. Note that if X is an ARMA(p,q) process, n n f

-7 - then V will usually not be an ARMA process of the same order, although it will be in the ARMA family (see Granger and Morris (1976)). Lemma 3.1 says that probabilitistically we need only consider truncation as the discretization mechanism if we are only concerned with properties of strictly stationary time series. The binary process is a strictly stationary time series with E(Dn) = P(V > 0) = P(D =1) n n- n Var(Dn) = P(Vn > 0) P(V < 0) 'n 'n- n n and Cov(Dn, Dn) = P(V > V > 0) - [P(V > 0)]2 j > 1. ~-+j nV n+j -n - If {XD } X is a Gaussian process with E(X ) = 0, Var(Xn)-= T2 n n=-o n n correlation structure {p(j)}j=l, and if F = 0 b2 the c.d.f. of N(O,b2) j=1) ' I then the V process is Gaussian with E(V ) = 0, and the covariance matrix of (Vn+l V,...,V1) is given by M with elements n 1 n p(li-jl)T2 if i 0 j i ij (l+k)T2 i j = 12... n+l (l+k)Tr i j where k = b2/T2 is the ratio of the variance of the response function to that of the underlying process. If there is no response function variation, i.e., a truncation, then k = 0; as the amount of response function variation increases, the autocorrelations of the V process are reduced, so that the past of the V process has less effect on its future than the past of the X process has on n the future of the X process. Defining p' = (p(l),p(2),...,p(n)) and n n V (Vn,V-,...,V1) we have n T2 ['< + k) fn 1 M =T --- ---- --- -L I n-lJ

-8 - The regression coefficients of V+1 on (Vn,V 1,...,V1) are S = p' M1 -n -n n-1 and, defining Z = S' V, Z is Normal with mean zero and variance n -n — n n (3.3) 2 -1 Q = T p' M P (3.4) *n -n n-1 -n Throughout, T2 will be assumed to be one; the effect of taking other values for T2 will be pointed out. The bivariate and trivariate probabilities involving the D process can be n calculated, since the process can be viewed as a truncation of the Vn process. Classical results on the probabilities of bivariate and trivariate normal quadrants (orthants) can be used; see, for instance, the survey article by S.S. Gupta (1963). Applying these results to V, we have for j, ~ > 1, (d,dn+j) {0,1}2, (ddn+jdn+j+L) {0,o1}3 P{D = d, Dn+j n n' n+j = dn+j} = 1/4 + (-1) ndn+j arcsin [ + k] 211 (3.5) and {n dn' Dn+j = dn+j' Dn+j+ dn+j+ dn+d n [ P()) dn+dn+j+ [(i+t)l = 1/8 + (-1) n +Jarcsin 11+ + + (-1) n arcsin + k 4n 11 The c )dn+jdn+j+ arcsin [ (.)] 411orrelatin.orrelation structure of the D process is n 2arcsin [ J)k 11 (3.6) j > 1. (3.7)

-9 - The discretization operates on the underlying correlations by rescaling in two stages, first by scaling down by the factor (1 + k), due to the response function, then by applying the scaled arcsin function, due to the truncation. Note that k does not depend on {p(j)I}K1 except that any such structure not compatible with a finite T2 constrains k to be zero. The function 2arcsin [.]/n is a monotone increasing function of [-1, 1] onto itself, shrinking the value towards zero. The above derivations of the bivariate and trivariate probabilities of the D process were possible because of closed-form expressions for 2 and 3 dimensional multivariate normal orthant probabilities. For dimensions greater than 3 there is no general expression. There are a few special cases in 4 dimensions; see Gupta (1963), and Cheng (1969). Numerical methods are described in Abrahamson (1964), McFadden (1956), and Moran (1956), and some recursive formulas in Schlafi (1858), David (1953), Sondhi (1961), and Choi (1975). Only in the equicorrelated case is a closed-form expression known for general n (Gupta, 1963). If good approximations to the n-dimensional probabilities were available, then the joint and one step-ahead transition probabilities of the D process could be approximated. The following lemma is a characterization of those binary processes which can be generated by an underlying process through a response function. All strictly stationary binary processes can be generated in this manner; in fact, all can be generated under the Gaussian assumptions. Lemma 3.2 Let {D } X be a binary-valued strictly stationary time series with finite n n= —o dimensional joint probabilities P{D1 = O,DZ = O*..*,D = O} specified for n> 1, t2 t n -n then there exist an underlying stationary Gaussian process {X }V and a n n= ---

-10 - Gaussian c.d.f. F:R+[O,1] such that the joint probabilities are given by expression (3.2). Proof We can assume without loss of generality that Var(X ) = 1, E(X ) = 0, and the mean and variance of F are p and k, respectively. Define C = p//l + k, so that P{D 0} = P{V1 < 0} (c) and c = -1 (P{D1 = 0}) is uniquely determined. Suppose that p(l),p(2),...,p(j-1) have been determined, and let Z. be the correlation matrix with elements 3 Pm = P(I| -m),,m = 1,2,...,j+l, with Zj, therefore, being only a function of p(j). Let O' = (0,0,...,0) and 1' = (1,1,...,1) be n-dimensional vectors — n — n of zeroes and ones. Then we have the following: lim O. (c j _ 1) = P{V1 c, V2 < c,...,V < c, V < c} p)+ Oi = P{D1 = 0,...,Dj = 0} lim (c 1 ) P{V < c, V < c,..., V > c} p(j)+-i 0.1 -j-1 1 2- j 1 j =0 The distribution 0 (c lj1) is continuous in p(j), and Slepian's -i -, j Theorem (see Slepian (1962), Das Gupta et al. (1972)) says that (cl ) O.-j1 -J-j-1, j is a monotone increasing function of p(j). Therefore, there exists a p(j) such that -1 < p(j) < 1 and O (C lj P{D = 0'D2 = * = Oj+ = 1 } and by mathematical induction there exists a sequence {pj)By Kolmogorov' and by mathematical induction there exists a sequence {p(j) =1' By Kolmogorov's Existence Theorem there exists a stationary Gaussian process with correlation structure {p(j)}J"=. Q.E.D.

-11 - A result more general than Slepian's Theorem is available for the family of elliptical distributions (See Das Gupta et al. (1972)). The next section is concerned with predicting and/or determining the conditional distribution of Dn+l' given data D1,D2,...,Dn 4. The Binary Prediction Model For a time series of discrete data an important problem is how to use the data in making decisions about the future behavior of the series. One may want to predict this future behavior or make probabilistic statements about it. To be more precise, given dl,d2,...,dn, these problems consist of predicting and/or determining the conditional distribution of dn+j j > 1. In terms of the three examples. given in the Introduction, the data (dl,d2,...,dn), would consist of employment history, moods, and economic directional changes, and our objective would be to predict and/or determine the distribution of future employment, mood, and directional change. The determination of the transitional probabilities P(Dn+1 = d n+llD dn D =, dl) are of fundamental importance. The derivation of approximations to these conditional probabilities is discussed in Keenan (1980) in the context of the loss of Markov property; the difference between the actual probability and certain reasonable approximations can be viewed as a measure of such a loss. Therefore, the determination of the conditional distribution of Dn+, given Dl,...,Dn, will not be discussed here, but rather the related problem of predicting Dn+l, given D1,D2,...,D. It is assumed that {X }0 is a strictly stationary time series generating n n= — {D }n X through a response function F:R+[0,1], as follows: n n ----o Given {X _, n n=-o P(D = 1IX) = F(X) n = n) 1 - FX P(D = OIX) = 1 - F(X) n n n

-12 - Given data dl, d2,...,d from this process, the problem considered in this section is to predict dn+j. For simplicity we shall deal with j = 1. To predict dn+1, a prediction accuracy measure is needed. Since in binary data there is only one possible wrong and right prediction, the probability of predicting incorrectly is a reasonable measure of prediction uncertainty. The probability of error using D + is given by P(Dn+l Dn+l) = dP (dld2'" dnd {Dn+ +l (Dl'D2' **DD nD'n+l)' (4.1) where Dn+l is a predictor of Dn+l based on the data dld2,...,d. The broadest class of predictors which can be considered is the class of all randomized rules. For an arbitrary, randomized rule 6, the probability of error, P.E.(6), is: [6P(dd2,...,dn) P{D1 = dlD2 = d2,...,D = dD+ = 0} (dd2,...,dn) (4.2) + (1 - 6(dd2,...,dn))P{D = dl'D2 = d2'...,D = d, Dn = 1]. One nonrandomized rule which is a reasonable predictor of Dn+ is Dn+l = 1 if P(D1n+ 1 = lIDn= = d,lDn+ d) > 1/2 i P(D+l =,..,D d) < 1/2,.3) if P(Dn+1 11D- ld -d,, D — d1) < 1/22 which predicts that value which is the highest probable. A randomized rule which is just as reasonable is that which randomizes with the conditional probability: 6(dl2X,..,d n) = P(D+ = 1Dn = d,...,D = dl). Thnx lemma shw nt n Bn The next lemma shows that Dn+l, a nonrandomized rule, has minimum probability of error among all randomized rules. Lemma 4.1 Dn+ has minimum probability of error within the class of all randomized rules.

-13 - Proof - Define an arbitrary randomized rule 6 by 6(dl,d2,...,dn) = P(lldld2,*..dn)d The P.E.(6) is given by equation (4.2). For a given (dl,d2,...,d ), 6(dd2,...,dn) P{D1 = dl,...,D = dn Dnl = 0} + (1 - P(d{Dd )) P{D dl,...,D = d' Dn = 1 n 1 1 n n n+l lies between P{D1 dl...1 Dn dn Dn+ = 0} and P{D1 = d1...Dndn = Dn+ =1} and so the P.E. is minimized by 6*(dd D 0 if P(D1 dn1 = d Dn+ = 1) < P(D = d1 ** = d n+D 0),...,D 0) 1 if P(D1 n+D = d1) P(D = d D..,D -= 0 + ), A which is equivalent to D+. Q.E.D. This lemma shows that if our main concern is predicting D+, then we need not sl ~ predctin Dn+l, then wn not know the exact probability, P(Dn+ = 11d,d2,...,dn), but only whether or not it exceeds or equals 1/2. We saw in Section 3 that, when {X }n and F = are Gaussian, the (0,b2 D are binomial (1, 1/2), with the bivariate and trivariate probabilities given by expressions (3.5) and (3.6), respectively. Consider the special case of just one data point, dl. The minimum probability of error predictor of D2 is given by 1 is P(D2 = lID1 = d1) > 1/2 D = 2 0 is P(D2 = lID1 = dl) < 1/2, where d1 +d2 arcsin 1 k)] P(D2 = d2lD1 = dl) = 1/2 + (-1) The predictor of D2 is what we intuitively expect, namely D d if p(l) > 0 2 dlif () 1 2 1-d if p(1) < 0.

-14 - The P.E.(D2) can be calculated: P(D2 A D2) = 1/2- arcsin [Ip(l)l/(l + k)] ~P(DZ~ D2 1/2. (4.4) For p(l) = 0, we have independence of D2 and.D1, and since D1 gives no information about D2, the probability of predicting wrongly is 1/2. If Ip(l)| = 1, then Var(X ) = T2 is infinite, k = b2/T2 = 0, and the probability of error is n zero, as expected. Consider the case of two data points, dl, d2. Before considering D3, ^(i) look at the following two predictors, D, j = 1,2, given by A(j) 1 if P(D3 = lD3_j) > 1/2 (4.5) D3 - 0 if P(D3 = 1D3_j) < 1/2 (1) A(2) D3) and D3 are predictors based on only part of the available data, one and two time points back, respectively. From the case of prediction based on one data point, we know that for j = 1, 2 P.E.(D()) = 1/2 - arcsin [p(j)/(l+k) (46) 3 Now consider D3 which uses both data points. Using equations (3.5) and (3.6), the conditional probabilities of D3 given D1 and D2 are explicitly determined. Figure 1 divides the (p(l), p(2)) plans into 4 parts via the equiangular lines. Figure 1 P(2) 1. -1 arcsin r[p(2) l] 4 r N arcsin [1p(+ k t2 4 1/2 - - -1 1 p(l)

-15 -Table 1 shows D3 for these four areas, and Figure 1 also plots the probability of error of D3 as a function of (p(l), p(2)). Table 1 I I I IData I Area I I F... I I i 1 2 3 4 I I I..... ID11, D2=1 ID3=11D3=01D3=01 D3= ~I 1 1 1/A A A =ID =0 =0 = 1 = ID1=0, D2=l ID=0 D3=0OD3=l D3 =l I lI 1D1, D2=0 ID3=11D3=1D3=OID3=0 1 _____ I 1 1 I I D10, D2-0 ID3=- D3=l D3-L D3=01 I!_ 1 1 I1 Figure 1 suggests that if Ip(1)>1|p(2)1, use only the preceding observation to predict; if Ip(2)I>|p(1)l, use the penultimate observation. In each case, if the larger correlation is positive, predict concordance with the past value; if negative, predict discordance. This is summarized in the following lemma. Lemma 4.2 For n = 2 data points when the Xs process and F = (Ob2) are both Gaussian, the minimum probability of error predicator of D3 is D) where p()(=Max {It(1)1, mn (2)1 and ) pis given by expression (4.5). The minimum probability {|p(l)|, |p(2)1} and D3 is given by expression (4.5). The minimum probability of error is 19, _ arcsin [ p()I/l+k] x, f - i.

-16 - Remark. This lemma states that for two data points, the minimum probability of error prediction is based on only one of the two points, that which has highest correlation with D3; knowledge of the other point does not help. (1) Proof. From Table 3 it can be seen that D3 equals D3 in areas 2 and 4 and -(2) A D(2) in areas 1 and 3. Therefore, in areas 2 and 4 the P.E.(D3) is 1/2 - arcsin[ p(l)l(l+k)] and in areas 1 and 3 it is 1/2- arcsin[Ip(2) l /(l+k)] Q.E.D. For an arbitrary n > 2, define D+1 2 as ^(1,2) 1 if P(D 1 D = dn, D = dn) 1/2 Dn+1 n n n'1 n-1 n0 if P(D n 1 D = d, Dnl = d ) < 1/2, n+1 n n n-1 =nA(1,2) (i.e., optimal predictor based on only the two preceding values), and let Dn+l be any decision rule based on only the two preceding values. Corrollary 4.3. If the X process and F = D 2 are Gaussian, then for an arbitrary If t s p(0O,b) n > 2, P.E.(n+12 > P.E. (Dn+l n+l n+1 = 1/2 -arcsin[Max{Ip(1)I,p(2) }/(l+k)] > P.E.(Dn+1) I > P.E.(D ). n+1 -

-17 - D(1 2) Ad(es,2) Proof. The P.E.(D(+1 ) is given by expression (4.2). Since D+ does not depend on (dl,d2,...,d ) and the D process is strictly stationary, the P.E.(D, )) reduces to n+1 (d d P{D 1n- = dl Dn dn Dn+1 dn+1' n+1 n n-1, n, n+l (1,2)), which by Lemma 4 2 (and since On1A does which is greater than or equal to P(D which by Lemma 4.2 (and since D does not depend on (d1,...,dn-l)), is 1/2- arcsin[Max{|p(l)|I,p(2)|} (l+k)] 1/2 -, which is greater than or equal to P.E.(Dn+), by definition of Dn+1. Q.E.D. Remark. Therefore, for data, dl,d2,...,d, n > 2, we have a lower bound on how n - well predictors can do using only the present and penultimate observations and an upper bound on the optimal predictors which use more than the previous two preceding observations. For the remainder of this section we will consider the binary prediction problem for an arbitrary n > 2 and the most fundamental model, that of the first-order autoregressive. Therefore, the X process is Gaussian AR(1), s X =PX + e Ipl < 1, Var(X ) = 1 and F = (Ok) We have seen that the S S-i SI s (k s D process generated by X and F is not a Markov chain. An important question s S is: How much of the past must be used for prediction purposes? Previously we considered the binary prediction problem for the arbitrary Gaussian case with n = 2 data points. The determination of the conditional distribution of D3, given D2 and D1, allowed for explicit probability of error calculations. For n > 3, general closed-form expressions for the conditional distribution of

-18 - Dn+1 given D,Dn1...,D1 are not available. In Keenan (1980) six approximations, A(1) to A(6), to this distribution are developed in a different context, that of determining the loss of Markov property due to discretization. Rather than attempting to numerically approximate p and (n + 1) dimensional integrals, the approach of the six approximations is to reduce the dimensions to 2 and 3 for which closed form expressions, (3.5) and (3.6), are available. Approximations A(2) and A(3) are the conditional distributions of Dn+, given D and D and n+1 n n-1 given Dn, respectively. Approximations A(4), A(5), and A(6) use the binary data V' (Dn,D n-...D1) first, to estimate the underlying unobserved data (Vn,Vn_,...,V1) = V'; then to estimate Z = S'V, where S' is given by expression (3.3); and finally n -n —n -n to calculate the conditional distribution of Dn+l given the estimate of Z In n Keenan (1980), or using Kalman filtering methods (see Jazwinski (1970), Duncan and Horn (1972) Downing et al. (1980) etc.), recursive expressions for S' (S nS S ), Q = Var(Z), and R = k/(l+k-Qn) are — n n,i n,2'""' n,n n n n PR S 2 < j <n n-1 n-1 j-1 n,j P(l-Rn_l j = 1 (4.7) 2 2 Q = p (1 - Rn_) + p Rn-l Qn-l The factor (1-Rn_1) is the usual Kalman gain. Approximations A(4), A(5), and A(6) estimate Z by n (4) n n 1-d Z(4) = SJ x E(VjID =d) = [ I (-1) S ] 2(l+k) - s nx^ j -i~-n-j+is j=l ' j=1 nli () = x E(VnID = dn n = d ) (4.8) n + 2 x n En n = dn Dn - 1 + S x E(Vn_1ID = d, Dn_1 =.d ) n,2 n1 n n n-1 n1

-19 - 1-d 1-d - dn+d n-i n~1 Sn.2 k/5fl dnd n-l arcsin[p/(i+k)] +-1 and z~ j) s, E(V IDn) n nl ( n n 1-d +- (4.10) Approximation A(1) is the conditional distribution of D given D and E i.e., n+1 n n-l i P(D n+ = 11D = d E e en), where E is the binary (0,1) function of (D n,...,Di) defined by E = - 1 if Zn- > ~ n- 0 if Z < 0 n-1 and e = {0,1} if Range (Zn) = R 1 if Range (Z ) [0,o] n-1 0 if Range (Z ) = [-x,0] Justification for this approximation ib described in Keenan (1980). The correlations of (D D, E n-) are Corr(Dn+i, D) = 2arcsip/(-k)] n 2arcsin[sign(p) (Q /(l+k)) 1/] corr(Dn, )l = Enn n-1 eand 1/9 2arcsin[1pl*(Q /(/+k)) /] Corr(Dn+l' E = i -_1 )) Approximations A(1) and A(4) were shown to be better than the others; these were the only approximations which used all of the data (D1 = di, D2 = d2,...,D = dn).

-20 - AA(j) Consider the predictors, Dnl, j = 1,2,...,6, where the approximations A(1) to A(6) are substituted for P(Dn+ = lDn = d,...D1 = d ) in expression (4.3). The next theorem allows us to calculate the probability of error for these predictors. Theorem 4.4 If the D process is generated by an X process which is Gaussian AR(l),lpl < 1, s s Var(X ) = T2, and response function F = (0,b) with k = b2/T2 then for n data s(,b2) withk =b2/k2= then for n data points fA(2) -?A(3) P A(5) ^A(6) P.E.(D nl)) - P.E.(Dn ) P.E.(DA ) P.E.(D(6)) n+1 n+l n+1 1/2 - arcsin[ pl/(1+k)] P.E.(Dn+l)) if Qn-1 < /(l+k) (4.11) P.E.(Dn()) if pr < 1/212) Proof. Without loss of generality we can assume that p is positive; the same ^_A(2) ^ 9.2) conditions hold for p negative. By strict stationarity P.E.(DA(2) = P.E.(D ' ) ^A(3) (l^ and P.E.(D A(3) = P.E.(D 1); Corollary 4.3 now gives the result for A(2) and A(3). n+1 3 Using expression (4.7) and comparing expressions (4.9) and (4.10) the result follows for A(5) and A(6). For approximations A(4), A(5), and A(6), Z() 0 is equivalent to A(j) > 1/2, j = 4,5,6. For A(4) the most extreme case is where D is one (zero) and the entire infinite past (D nl,Dn 2,...,D1,D0,D,...) is all zeroes (ones); Z(4) is greater than or equal to zero in the case if pr < 1/2. If the data only goes back finitely and pr < 1/2, then Z(4) > 0 is certainly satised. Approximation A() is the same as A(3) except whenthe past difers satisfied. Approximation A(1) is the same as A(3) except when the past differs

-21 -from the present (d1 = d2 =. dnl dn); but A(4) and A(3) will both be greater than or equal to 1/2 at these extreme points if n1 < l/(l+k). Q.E.D. Conditions Q-1 l/(l+k) and pr < 1/2 are conditions measuring the correlation of D with D +1 relative to that of (D nl,Dn2,...,D1) with Dn+ and (D n,Dn _,...,D1,Do,...) with Dn+l, respectively. Tables 2 and 3 show values for p and k for which the conditions of Theorem 4.4 are satisified. Table 2 Conditions for expressions (4.11) and (4.12) to be satisfied in terms of IPI and K. I I I I -I I I I Q < l/(l+k)(for all m) I p. r < 1/21 1 Il _ ______I __I k.p| -. pi 0 <1.00 < 1.00 <.2 <.95 <.95 <.5 <.90 <.85 < 1.0 <.85 <.80 < 2.0 <.80 <.70 — ~ ~-........_-_ ~ ~ ~ ~ _~ - m - - ~ -:~ ~ - - - ~ - - --.....-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The minimum Probability of Error predictor involves P(Dn+1=llD =Dn,...,D=dl). Rather than numerically approximating the probabilities of the exact n+l and ndimensional random variables, the approximations A(1) to A(6) calculate exactly the probabilities of 3 and 2 dimensional approximations to the n+l and n dimensional random variables. In Keenan (1980) different approximations are shown to do well for different parameter values; for any given parameter region, either A(1) or A(4) appears to be a reasonable approximation to the probability P(Dn+ 1 = llD n-d n...,D=dl). Therefore, Corollary 4.3 and Theorem 4.4 suggest that if the underlying process is AR(1) (therefore, Markov) and D is a discretization of this process, then even though D is not a Markov chain (having in n

-22 - fact infinite memory), for prediction purposes we should act as if D is a Markov chain and predict from the present value with 1/2 - arcsin[ pl/(1+k)] as the approximate probability of error for one-step-ahead prediction. 5. Multinomial Time Series The previous sections have dealt with binary data; both the general framework and the specific results under Gaussian assumptions were for the binary case. We can now consider a more general model for the generation of ordered categories-multinomial time series by an underlying continuous valued-state space process and a response function F. Asymmetric Gaussian randomization, F = (p b2) and p not equal to zero, is a special case. Each of the three examples of binary times series given in the Introduction can be generalized to the multinomial case; for example, one could consider more than 2 psychological states and more than the simple dichotomy of being employed or unemployed. For m > 1, an (m + 1) valued-multinomial process, {Dnn -, is assumed to be generated by an underlying continuous-valued, strictly stationary time series, {X}, 1=0, E(X ) = 0, a probability c.d.f. F: R + [0,1], and the sequence - 1 = < po <... < pm- < pm = +-o, such that, given {X }0 } the D are -= -1 < 0 < ' ' <m-1 < m n n=-~ n independent with D = {j with probability F(Xn - 1) - F(X - j=0,,...,m. The joint probability of (D1 = dl,..,Dn = d), where (dl,...,dn) ~ {O,l,...,m}n, is Co 00 00 I I... f jl[F(xj - ) - F )1 dG(x,'.,x ) (5.1) - - -00 j-1 j As in the binary case the process can be viewed as a truncation of a new process {Vn} n=, except that now the truncations occur at the points (pO'l' ' ' 'm-1 )' n n- except

-23 - and a characterization Lemma analogous to Lemma 3.2 can be proven. The approximations A(1) to A(6) can be extended to the multinomial setting, although they will be given by integral representations rather than closed-form expressions. However, because the integrals are at most 3 dimensional, they can be evaluated quite efficiently. 6. Some Comments on Estimation In the previous sections probability calculations have involved the parameters of the underlying process and the response function; ordinarily, these parameters are unknown and, therefore, need to be estimated. Prior to estimation, the model to be estimated must be determined. First, the appropriate underlying process and response function are identified, and, second, the parameters involved are estimated. A third step is diagnostic checking; this includes goodness-of-fit tests, residual analysis, and plots to see if the data shows any gross departures from the fitted model. For continuous-valued time series data, Box and Jenkins (1970) have developed procedures for these three steps of analysis. This section is a brief introduction to the model identification and estimation steps for binary processes. The ideas developed here are introductory, and it is hoped that they will serve as a first step towards a more comprehensive theory. This section deals with data from a binary process generated by an underlying Gaussian process and response function. Let {Dn} 0 be a binary process generated by {Xn }n _, a Gaussian n=-oo process with E(Xn) =, Var(X ) = T2, and correlation structure {p(j)}jl, and F = $ - c.d.f. of N(O,b2). The autocorrelation function {Dn} 0 is 0b n = —o O,b given by equation (3.7): 6(s) = 2 arcsin [p(s)/(l+k)] R

-24 - Given data D1 = d, D2 = d2,...,D = d, if 6(s) can be estimated from the 1iven data 1' D 2 2' n n sample say, by (s) sample, say, by 6 (s), then [1+ k( can be estimated (method of moments) inverting equation (1.7) to obtain p(s) _ ll ) +K sin [ 2 6(s)] As for probabilistic statements concerning the binary process, {Dn}0n_, { }s, rather than {p(s)},s= are the appropriate parameters. One estimate 1 +K ls=l- S=l1 of 6(s) is the sample autocorrelation of lag s, 6 (s): n-s d. + di+s 6 (s) ns (-1) i+s n(s) n-s For s = 1, 6 (1) counts the number of changes of sign (i.e., directional n changes) in the (d1,d2,...,dn), relative to the number possible, n-1. Kedem (1980b) and Lomnicki and Zaremba (1955) develop some asymptotic results for this and related estimators. These estimates do not involve assumptions about the form of the underlying autocorrelation structure {p(s)}s=l. That is, {X }_ need not have a specified ARMA (p,q) form. However, the model assumes that the response function is symmetrical, that is, F = P b2 and 0O,b PO = 0. Without constraints on the correlation structure, k = b2/T2, the ratio of the response function variance to that of the underlying process cannot be estimated. If we assume a specific ARMA model for {x _}, then A n= —' as a result of the constraints that knowing p and q impose on the autocorrelations, k can be estimated. The variance of 6 (s) is Var(1 2 s n-s nD. D + Di+ + D. - Var(n(<s)) = ( —) sI E((-I) ' Di+ s [6(s)]2 i=l j=l

-25 - and can be evaluated if 4-dimensional probabilities can be determined. Plackett (1954), McFadden (1960), and Gehrlein (1980), among others, discuss numerical a approximations to these probabilities. However, the appropriate approximation depends on the magnitudes of the correlations. Lomnicki and Zaremba (1955) give a bound for these 4-dimensional probabilities. 7. Concluding Remarks and Future Directions Data which is binary, recorded sequentially in time, and correlated is common. In Section 2 a family of models for such data which allow for arbitrary correlation structure was proposed. The memory of such models dies out in a continuous manner. It is assumed that an underlying process with continuous-valued state space generates the discrete process through a response function F. An important question is, how much of the structure of the underlying process passes through to the discrete process? The structure with which we are concerned is that of being Markovian. If the states of a Markov chain are lumped together to create a smaller number of new states, the question becomes whether or not the new chain is Markovian (i.e., has the property of lumpability) (Kemeny and Smell (1960), pp. 123-140, and Burke and Rosenblatt (1958)); Therefore, the above question can be viewed as a stochastic version of the lumping of a Markov process into two states, zero and one. A future direction is the measurement of the loss of the Markov property as a result of discretization. In Section 3 it was shown that the discretization could be viewed as the addition of i.i.d. noise followed by truncation; each of these distorts the Markov property. In a forthcoming paper a measure of the loss of the Markov property due to discretization is proposed. In that paper, the loss is decomposed into a loss due to the noise and a loss due to truncation. The relationship of the above models to those other areas —functions of Markov chains, source-coding, and quantization in signal processing —will be

-26 -explored. In Section 5 one aspect of the above question, the ease of forecasting from the discrete process when the underlying process is Markovian, was considered. It was shown that although the discrete process is not Markovian, for prediction purposes it is reasonable to act as if it is. Bounds on and approximations to the Probabilities of Error for the optimal predictor were derived.

-27 - REFERENCES Abrahamson, I. G. (1964). Orthant Probabilities for the Quadrivariate Normal Distribution. Ann. Math. Statist. 35, pp. 1685-1703. Billingsley, P. (1961). Statistical Inference for Markov Processes. University of Chicago Press, Chicago. Box, G.E.P., and Jenkins, G.M. (1970). Time Series Analysis, Forecasting, and Control. Holden-Day, San Francisco. Burke, C. J., and Rosenblatt, M. (1958). A Markovian Function of a Markov Chain. Ann. Math. Statist. 29, pp. 1112-1122. Bush, R. R., and Mosteller, F. (1951). A Mathematical Model for Simple Learning. In Luce, R. D.; Bush, R. R.; and Galanter, E. (Eds.), Readings in Mathematical Psychology, (1963). John Wiley and Sons, Inc., New York. Vol. 1, pp. 289-299. Cheng, M. C. (1969). The Orthant Probabilities of Four Gaussian Variates. Ann. Math. Statist. 40, pp. 152-161. Choi, J. R. (1975). An Equality Involving Orthant Probabilities. Comm. Statist. 4, No. 12, 1167-1175. Das Gupta, Eaton, Olkin, Perlman, Savage, and Sobel (1972). Inequalities on the Probabilities Content of Convex Regions for Elliptically Contoured Distributions. Proc. Sixth Berkeley Symp. Math. Statist. Prob. Berkeley and Los Angeles, University of California Press, Vol. 2, pp. 241-265. David, F. N. (1953). A Note on the Evolution of the Multivariate Normal Integral. Biometrika 40, pp. 458-459. Downing, D. J.; Pike, D. H.; and Morrison, G. W. (1980). Application of the Kalman Filter To Inventory Control. Technometrics 22, No. 1, pp. 17-22. Duncan, D. B., and Horn, S. D. (1972). Linear Dynamic Recursive Estimation from the Viewpoint of Regression Analysis. J. Amer. Statist. Assoc. 67, pp. 815-821. Feller, W. (1971). An Introduction to Probability Theory and its Applications. Vol. II, Second Edition, John Wiley & Sons, Inc., New York. Gehrlein, W. V. (1979). A Representation for Quadrivariate Normal Positive Orthant Probabilities. Comm. Statist. B. 8, pp. 349-358. Granger, C. W. J., and Morris, M. (1976). Time Series Modelling and Interpretation. J. R. Statist. Soc. A. 38, pp. 246-257. Gupta, S. S. (1963). Probability Integrals of Multivariate Normal and MultiVariate t. Ann. Math. Statist. 34, pp. 792-828.

-28 - Heckman, J. J. (1979). Statistical Models for Discrete Panel Data Report 7902, Center for Mathematical Studies in Business and Economics, University of Chicago, January 1979. Jacobs, P. A., and Lewis, P. A. W. (1978). Discrete Time Series Generated by Mixtures. I. Correlation And Runs Properties. J. R. Statist. Soc. B. 40, No. 1, pp. 94-105. (1978). Discrete Time Series Generated by Mixtures. II. Asymptotic Properties. J. R. Statist. Soc. B. 40, No. 2, pp. 222-228. Jazwinski, A. H. (1970). Stochastic Processes and Filtering Theory. Academic Press, New York. Kedem, B. (1980). Binary Time Series. Marcel Dekker, Inc., New York. (1980). Estimation of the Parameters in Stationary Autoregressive Process After Hard Limiting. J. Ann. Statist. Assoc. 75, pp. 146-153. Keenan, D. M. (1980). Time Series Analysis of Binary Data. Technical Report No. 130, Department of Statistics, University of Chicago. Kemeny, J. G., and Snell, J. L. (1960). Finite Markov Chains. D. Van Nostrand Company, Inc., Princeton, New Jersey. Lamperti, J., and Suppes, P. (1959). Chains of Infinite Order and Their Application to Learning Theory. Pacific J. Math. 9, pp. 739-754. Larson, R. (1979). The Significance of Solitude in Adolescents' Lives. Ph.D. Dissertation, Committee on Human Development, University of Chicago. Lomnicki, Z. A., and Zaremba, S. K. (1955). Some Applications of Zero-One Processes. J. R. Statist. Soc. B. 17, pp. 243-255. McFadden, J. A. (1960). Two Expansions for the Quadrivariate Normal Integral. Biometrika. 47, 3 and 4, pp. 325-333. Moran, P. A. P. (1956). The Numerical Evaluation of a Class of Integrals. Proc. Cambridge Phil. Soc. 52, pp. 230-233. Plackett, R. L. (1954). A Reduction Formula for Normal Multivariate Integrals. Biometrika. 41, pp. 351-360. Schlafli, L. (1858). On the Multiple Integral &n d,d,...d whose limits are p1 = al x + b y +... + h1 z1 p2 > 0,..., p > 0 and x2 +... + z2 Quart. J. Math. 2, pp. 269-301, 3, 54-68, and 97-108. Slepian, D. (1962). The One-Sided Barrier Problem for Gaussian Noise. Bell System Tech. J. 41, pp. 463-501. Sondhi, M. M. (1961). A Note on the Quadrivariate Normal Integral. Biometrika 48, pp. 201-203.