THE UNIVERSITY OF MICHIGAN COLLEGE OF ENGINEERING Department of Aerospace Engineering Technical Report AN EMPIRICAL BAYES TECHNIQUE IN COMMUNICATION THEORY Stuart C. Schwartz ORA Project 02905 supported by NATIONAL AERONAUTICS AND SPACE ADMINISTRATION GRANT NO. NsG-2-59 WASHINGTON, D. C. administered through: OFFICE OF RESEARCH ADMINISTRATION ANN ARBOR October 1966

This report was also a dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in The University of Michigan, 1966.

TABLE OF CONTENTS Page LIST OF ILLUSTRATIONS v LIST OF APPENDICES vi LIST OF PRINCIPAL SYMBOLS vii LIST OF CONSTANTS x ABSTRACT xi CHAPTER 1 INTRODUCTION 1 1.1 Introduction 1 1.2 The Empirical Bayes Procedure 7 1.2a Independent Observations 11 1.2b Dependent Observations 15 1.2c Extension of the Dependent Case to Multiple Hypotheses 23 1.2d Convergence of the Empirical Procedure for Unbounded Loss Functions 28 1.3 Literature Survey and Scope of the Present Study 31 1.3a Literature Survey 33 1.3b Scope of the Present Study 38 2 ESTIMATING THE DENSITY FUNCTION OF THE OBSERVATIONUNIVARIATE CASE 42 2.1 Introduction 42 2.2 The Empirical Distribution Function 44 2.3 Estimate of the Density Function-Kernel Method 64 2.3a Bias Calculation 65 2.3b Variance Calculation 68 2.3c Mean-Square Error 71 2.3d Mean Integrated Square Error 74 2.4 Estimating the Density Function by Series Methods-An Orthogonal Representation 77 2.5 Estimating the Density Function by Series Methods-An Eigenfunction Representation 92 2.6 Special Forms of a(z) 106 2.7 Summary and Generalizations 111 iii

TABLE OF CONTENTS (Concluded) CHAPTER Page 3 ESTIMATING THE DENSITY FUNCTION OF THE OBSERVATIONS-k-VARIATE CASE 119 3.1 Introduction 119 3.2 Dominating the 2k-Fold Integral 120 3.3 Estimating the k-Variate Density Function 133 3.3a. The Empirical Distribution Function 133 3.3b The Orthogonal Representation 135 3.3c The Eigenfunction Representation 139 3.3d The Gaussian Kernel 143 4 APPLICATIONS OF THE EMPIRICAL BAYES TECHNIQUE 147 4.1 Introduction-Communication Through a Random Channel 147 4.la. Communication Through an Unknown Random Channel-Supervised Learning 149 4.lb The Detection of Noise in Gaussian Noise 155 4.2 Multiple (Simple) Hypotheses With Unknown A Priori Probabilities 158 4.2a. Finite Number of Observations Per Decision 158 4.2b Limiting Forms 162 4.3 Transmission of Known Signals Through an Unknown Random Multiplicative Channel 172 4.3a Orthogonal Signals 176 4.3b Bipolar Signals 182 4.3c Arbitrary Signals 185 4.4 Unbounded Loss Function for the Case of Bipolar Signals 189 5 SUMMARY 194 APPENDICES 196 LIST OF REFERENCES 217 iv

LIST OF ILLUSTRATIONS Figure Page la Probability of error vs, pl 3 2o The L2 series procedure for estimating Lj(wj)the case of orthogonal signals. 178 v

LIST OF APPENDICES APPENDIX Page A The Hermite Polynomials 196 B The Evaluation of Some Integrals 203 C The Gaussian Kernel 209 vi

LIST OF PRINCIPAL SYMBOLS Symbol Meaning Defined A covariance matrix of dimension k (35o13) a,. Fourier coefficients of f(x) in the L2 expansion (2.4.2) A ajn estimate of a. at the n-th stage (2.4.4) a(z) distribution function of the random variable Z a (Zlz2) second-order (stationary) distribution of m Q Z1 and Z2 above (2 2.28) BT cross-covariance matrix of dimension k (3,153) dj coefficients associated with a(z) in the eigenfunction representation (2o 510) Dm, ( 2.02 010) Dm averages involving the differences of the ( 2.229) bivariate distribution and univa.ria.te disDT.2 tribution (2.2.35) D (22,241) T,2 f(x) overall density function of the observa.tion, f(x) = fg(x-z;a)da(z) (2o1.4) fi(x) density function of the observation under the i-th hypothesis f (x) series representation for f(x) truncated (2.4o24) q or at q+l terms (2.5.25) A fn(x) estimate of f(x) after the n-th observation Fn(x) empirical distribution function (2o235) vii

LIST OF PRINCIPAL SYMBOLS (Continued) Symbol Meaning Defined g(x-y;c) gaussian density function with mean value y and standard deviation a. Designated also as g(X-Y) when dealing with the Hermite polynomials g2( nl,n2;,p) bivariate gaussian density with correlation coefficient p and both variables having the same standard deviation a H (x) Hermite polynomial associated with the weight g'(x) (Aoll) Hej(x) Hermite polynomial associated with the weight g(x) (A.3) (2~3530) Jn or Jnt mean integra.ted square error or (2,5.20) K(y) kernel function (2.354) ~,m used as subscripts to denote the ~-th and m-th interval, observation, etc... M covariance matrix of dimension 2k (35153) M( (v) Fourier transform of the quantity appearing in the subscript n used as a subscript to denote the present observation, decision N covariance matrix of dimension 2k (3.2.2) Ni i-th gaussian noise sample 0 big 0 notation, e.g., V(f (x)) = O(l/n) means V(fn(x)) < a7n where a is a. positive constant;Pe probability of error of the one-stage procedure viii

LIST OF PRINCIPAL SYMBOLS (Concluded) Symbol Meaning Defined P probability of error of the empirical en procedure pj.(x) orthonormal Hermite functions (2.4.3) q(n) positive integer depending on n P -;m correlation coefficient of the ~ and m gaussian noise samples p* max 1 I T>l' l R(T) autocorrelation function of the gaussian *process s(x) a particular exponential function (2.5.6) ~a standard deviation of the noise samples r2=R( O)'1 a.n arbitrary constant., which in section 2.5, is chosen greater than a T(x) test function t(x) decision function T an index to denote the difference T=m-~ V( ) variance of the indicated quantity XI denotes the observation X~=N~+Zi e2 ~2 = (aL-a )/(t2+ ) (2.535) A designates the estimate of the quantity appearing under it ix

LIST OF CONSTANTS The following symbols are used for constants or bounds. Symbol Defined B1 below equation (2.2~20) B2 equation (2.2.24) or (2.2.25) B3 equation (2o2o44) B4 equation (2,3.10) B5 below equation (2.3 16) B6 below equation (2.4.19) B7 above equation (2.3.21) B8(k) equation (35.241) b, below equation (2,3525) b2 below equation (23.525) b3 above equation (2.3.26) cl above equation (A,30) ~c~~e,~ ~equation (2o4o7) C3 equation (2o4o10) C4 equation (2, 5l6) C5 equation (2o5.17) c6 equation (2o5o21) c7 equation (2..5533) cs(i) equation (2o6o8) x

ABSTRACT This thesis is concerned with an empirical Bayes procedure and its application to communication theory. The communication problem is one in which a sequence of information bearing signals is either assumed to be a stationary random process or distorted by a stationary random process. In either case, the underlying probability structure is unknown. The message sequence is then added to correlated gaussian noise. The statistical inference problem is to extract information from each member of the observation sequence, i.e., make a decision as to the presence of a particular signal. The empirical Bayes procedure utilizes all past observations to obtain consistent estimates of the unknown distributions or related quantities. These estimates are then used to form a sequence of test functions which is evaluated using only the present observation. It is shown that the sequence of test functions converges to the test function one would use if all distributions were known and if the observations were independent. For a minimum probability of error criterion, the resulting difference in error probabilities is dominated by a quantity proportional to the mean-square error in the estimate of the test function. In particular, we consider the class of problems where the marginal density function of an observation is the convolution of a gaussian density function and an unknown distribution, f(x) = fg(x-z;a)da(z). By suitably interpreting a(z), a variety of communication problems are included. Much of this study is concerned with obtaining consistent estimates of f(x) given the sequence of dependent, identically distributed random variables Xi=Ni+Zi, i=l,...n. Three techniques are presented: a. kernel method which is similar to the procedure used for estimating a spectral density, a.n orthogonal expansion for f(x) in Hermite functions, and an eigenfunction representation obtained by solving an eigenfunction problem associated with the integral equation for f(x). For all three methods, we calculate the bounds on the mean-square error in the estimate of f(x). A typical result is: if the autocorrelation function of the gaussian noise is absolutely integrable and eventually monotonically decreasing, and if the sequence Zi is M-dependent, the rate of convergence of the estimates is the same as in the case of independent observations. The rate is 0(l/n4/5) for the kernel method. For the orthogonal expansion, with the r-th absolute moment of Z finite, the rate is O(l/n(r-2)/r). With the eigenfunction representation, we estimate a quantity related to f(x) and obtain the rate O(ln2n/n). The techniques are then extended to the case of estimating a k-variate density function f(xl...xk). xi

These results allow us to bound the rate of convergence of the risk incurred using the empirical procedure in a number of communication problems. The problems considered are: communication through an unknown, stationary, random channel when learning samples (channel sounding signals) are available, communication through an unknown random multiplicative channel, and the transmission of known signals with unknown a priori probabilities. xii

CHAPTER 1 INTRODUCTION 1.1 INTRODUCTION This thesis is concerned with: a class of hypothesis testing problems in which not all pertinent statistics are known, but where the observer is repeatedly faced with the same decision problem. The type of problem we want to discuss is one in which a sequence of information bearing signals is assumed to be, or distorted by, a stationary random process whose underlying probability structure is unknown. The message sequence is then added to correlated gaussian noise. The statistical inference problem is to extract information from each member of the observation sequence, i.e., make a decision as to the presence of a particular signal. The empirical Bayes technique which we shall discuss involves the use of accumulated past observations to obtain consistent estimates of the unknown distributions or related quantities. These estimates are then used to form a sequence of test functions which converges to the test function one would use if all pertinent distributions were known and if the sequence of observations were independent. These remarks are perhaps best clarified by a simple example. Suppose we have an observation X=N+Z, where N is a gaussian random variable with mean zero and standard deviation equal to one. Z is assumed to a random variable which takes on the values 0 and 1 with probability po and pl-l-po, respectively. We take Z independent of N. 1

2 Designate the distribution of Z by a(z) and let the gaussian density with a standard deviation equal to 1 be denoted by g(x;l) The density function of the observation X is then written. f(x) = fg(x-z;l) dc(z) (1.1.1) = Pog(x;l) + pig(x-l;l) We want to test whether Z=O or 1 with a minimum probability of error criterion. The optimum test procedure is known to be a likelihood ratio test with a threshold of one. Using the logarithm of the likelihood ratio, an equivalent procedure is to evaluate the function. T(x) = x-1/2 + 21nk_) vp'1 / (1.1.2) = x-c, and compare it to a zero threshold. The test procedure which minimizes the probability of error is to choose HI (Z=l) if T(x) -O and Ho if T (x)< 0. Let G(x) denote the cumulative gaussian distribution function. Then, the probability of an incorrect decision is given by Pe = P' G(c-1) + (1-pI )(l-G(c)). (lol) Pe as a function of pi is called the Bayes envelope function and is the minimum probability of error attainable. A plot of this function is given in Figure 1.

35 Suppose pi is unknown and that we have a "good" estimate which we denote by p. Then, we might use the test function A 1 P T(x) = x-1/2 + 21lPl ) (1.1.4) = x-cA and compare this quantity to a zero threshold. The reason for this is iA A that if i is close to pi, T(x) ought to be close to the Bayes test function T(x). This is in fact the case as can be seen by calculating the probability of error as a result of using T(x). Defining this probability of error as P(plpl), a straightforward calculation yields P(p,pl) = pG(c-l) + (1-p) (l-G(ch). (1.1.5) A plot of P(pl,p1) versus pi for different values of Pi is also given in Figure 1. 1.0 p l-=o Pe or Pen.8 A pl=.3.6 p1=.5 Bayes envelope A function Pl=-7.2.4.6 8 1.O pi Figure 1. Probability of error vs. p-.

4 Now assume we are repeatedly faced with the same decision problem; we observe the sequence of stationary random variables X,2=1l,2,... n, and for each observation XQ we are to decide whether Ho or H, is the true state of nature. Prior to making a decision based on the observation Xn, we would use Xn to update the estimate of P1. For this example, a convenient estimate of p1 is a= 1 in n ZxL (1.1.6) Since p is a function of the observations, P(plnp) is a ranPin pin dom variable. We define the average value of P(lnPi) by A en EP(n ) (1.1.7) For the above procedure to be useful, we should have lim Pen = Pe (1.1.8) n->o That is, on the average, as the number of observations (and decisions) increases, the probability of error Pen should approach Pe. An estimate of how close Pen is to Pe as a function of n would also be of value. In the next section we will show that for (8) to hold we need Pn converging in probability to Pi. We also show that if pln converges in mean-square to pi, a knowledge of the mean-square error provides a bound on the difference Pen-Pe. 1We will refer to equations within the section by the last index. In referring to an equation in another section we will use all three indices.

5 A Let us calculate the mean-square error for the estimate pin Using (1), we see that the estimate is unbiased A E(P P {in 1 (1.1.9) Assume the sequence of observations ([Xe is independent. Then, the mean-square error is E (P1n-P)2 = V(Piln) = lpn()P 1) (1. 110) n The estimate P converges in mean-square at the rate 0(1/n). If, on the otherhand, we assume that the gaussian noise samples are correlated, E(N Nm) = POm_- we have n n V(p ) = l+pl- + 2 p (l ) n n2 L-i.=1, m=g+l From the stationarity assumption, we can write n ( ) l+pl(l-p) + 2 (- T)p (1.1.12) n n n T=1 Assuming the correlation coefficients p, are absolutely summable) 00 I PTI B2 < (1.1.1) T=l we have We use the standard big O notation to write V(p )=0(l/n) which is taken to mean that there is a constant a for which Y( Pl) < a/n.

6 V(pn) < l +l(1-pl) 2 IPI n nL 7=^ (1.1.14) _ (1+P(1l-pl) +2B2). n A Hence, if (13) holds, pn converges in mean-square also at the rate 0(1/n). Now, for the case of independent samples, T(x) as given by (2) is the optimum test function; the test procedure using (2) results in the A minimum probability of error, Pe. Then, using Pln in (4) we have A A Tn(x) =x _ 1/2 + 21n n (1.1.15) n If, as a result of using Tn(x), Pen converges to Pc, we say the sequence of test functions is asymptotically optimal. This definition was introduced by Robbins [30]. When the sequence of observations is dependent we will still use the test function given by (15). With V(pln)=O(l/n), we would expect that P converges to P at the same rate as for the case of independent en e observations. Now Pe is no longer the minimum probability of error attainable since we do not base the present decision on all past observations. We do, however, use all past observations to form an estimate of T(x). For this case, we shall call T(x), as given by (2), the "optimum one-stage" test function. We will be concerned with the convergence of Tn(x) to this one-stage test function. Clearly, the one-stage 1Numbers in squa r efer to references listed at the end of the report.

7 test can be modified by basing a decision on a specified number of observations. Then, at the expense of increasing the number of hypotheses to be tested, the probability of error would tend to decrease. In general, this procedure is still suboptimum. We do not discuss it further. We have called the learning and test procedure an empirical Bayes procedure because the decision problem is one of making an inference concerning the presence of a random variable (or process) Z which is distributed according to some a priori distribution a(z) and which we will take. as unknown. With dependent observations, which is the case we will study, the procedure is neither an optimum (Bayes) nor asymptotically optimum procedure. We now generalize the above problem and establish bounds on the convergence of Pen to Pe. 1.2 THE EMPIRICAL BAYES PROCEDURE We let the parameter \ represent the hypothesis in effect when the random variable X is observed. \ takes on the values "0" and "1" with probability po and pl=l-po, respectively. The observed random variable X is governed by the density function fi(x) when A=i, i=O,l. The density function fi(x) will, in general, be the convolution of a gaussian density and some distribution function. The marginal, or overall density function of the observation is f(x) = Pofo(x) + f(x) (1.2.1) lIn this formulation, and in the proof of asymptotic optimality for independent observations, we follow Robbins[30].

8 The choice of deciding between the two hypotheses is made by a decision function t(x); t(x) is defined on the space of observations and takes the values 0 or 1, according to'-which hypothesis we believe is active. A loss function L(t(x),A) is also defined as the loss incurred when we make the decision t(x) and A is the true parameter. We take the loss of a correct decision to be zero, L(QCO)=L(I;.i)=0, and define L(l,0) = ao> O0L(0,) = a, > 0.1 (1o2o2) Letting b(A\) = L(0,\)-L(1,A), we can write the loss incurred by using t(x) as L(t(x),) = L(OA) - t(x)b(A). glo235) For any decision t(x), the expected loss as a function of the parameter A is R(t,A) = J L(t(x), )f^(x) dX (,2o4) x The expected or overall risk is defined as R(t.) = j R(t,A) dTr(A) = Jf L(t(x),A) f( x) d(t\) Ax 1We will carry ao and al along in the development even though we are interested in a minimum probability of error criterion for which a0 = a1 =1.

9 where t(?\) is the distribution for\. We can write the expected risk in the form R(t,) = plal - ft(x) [plafl(x) - poaof(x) ]dx. (1.2.5) If we denote the test function T(x) by T(x) = plalfl(x) - oaofo(x), (1.2.6) then (5) becomes R(t.,) = pa - f t(x)T(x) dx, (1.2.7) and the procedure to minimize the overall risk is to choose tB(x) = 1 if T(x) 2 0 (1.2.8) = 0 if T(x) < 0. The decision function defined in this manner is the Bayes decision function (with respect to the distribution it), and the Bayes (minimun) risk is R(tB) = pa - T(x) + dx, (1.2.9) where T(x) = T(x) if T(x) > 0 and 0 if T(x) <0. Now suppose that the test function T(x) is unknown and that we are repeatedly faced with the same decision problem. (Both Pi and fi(x) may be taken as unknown.) At the n-th decision, we have observed the sequence of stationary random variables XQ,.=l,2,...n, and we want to

10 decide whether An=0 or A=l. The values ofA (the states of nature), i=1,2,.,n-l1 may or may not be known, i.e., we may not know whether our previous decisions were correct. We agree to use only the present observations Xn to make a decision as to the value of An, but we use all past observations to determine a decision function tn which takes on the values 0 or 1. The decision function is now defined on the space of all past observations. We designate this decision function by tn(xlgx2,0o xn) or, for notational convenience, by tn(xn)o Let En n denote the mathematical expectation with respect to all the random variables X.....Xnn and Enn denote the conditional expectation given An. With the loss given by (3), the expected loss at the n-th stage is R(tn An) = En I\n( L(tn(Xn) ) ) (12o10) and the overall loss is given by R(tn, ) = R(tnA) dt(A) ( 1.2.11) =En.;7(L(tn(xn) An)). This can also be written as R( tn, P) = P aI-E( tn( t )b(An)) ( 1 2. 12) Let ExA denote the expectation with respect to the pair of random variables (x,A). If lim Entn(tn(Xn)b(An)) ]=ExA (tB(X)b(A)) 3 then, in view n->o of (4)-(6), we have

lim R(tn, r) = R(t B) (1.2.15) A sequence of decision functions ftn(x...xn)) such that (13) is satisfied is said to be asymptotically optimal. This is the definition Robbins adopted for the case of independent observations. We shall now obtain a bound on the convergence of (13) and also investigate the case where the observations are dependent. 1.2a Independent Observations Consider the second expression in (12): En-n(tn(Xn)b(Xn)) = PoEn (tn(Xn)b(O)) + plEn (tn(Xn)b(1)). In view of the independence and stationarity assumption, and the definition of b(?), from (1) we have poEnI (tn(Xn)b(O)) = - aop f fo(xn) n-l Xnd0 fJ.. t (x...xn)f(xl)...f(xn-) dx...dxn dx and n. - n=T nh{V 5jf.. tn(x1...xn) f(xnl)dx...dxn 7dxn Using the definition for T(x), we can write (-tn(Xn) bn) = f T(x) IIn

12 Define a sequence of test functions Tn(Xn) = Tn,(X, X2,. **..Xn-l Xn;Xn) Tn(xn) is a function of xn, whose functional form depends on the variables (observations) xl...xn. Suppose that for almost every fixed x and arbitrary e, lim Pr (ITn(X,...Xn_l,x;x) - T(x)| < e) = 1, (1.2.14) n-Ko i.e., Tn(Xn) converges in probability to T(x) for almost every fixed x. Further, define the sequence of decision functions by tn(x...xn) = 1 if Tn(xl.. xn;xn)0 O (1.2.15) = 0 otherwise We then have Enn(tn(Xn)b(kn)) = J T(x)[Pr{Tn(Xl...Xn_l,x;x) 0)] dx, and since IT(x) |dx < o, it follows from the dominated convergence theorem and (14) that lim Enn(tn(Xn)b(n)) = J T(x) lim PrfTn(X...Xn_l,x;x) O0) dx n-oo n->oo = / (x) dx. lIn this section we drop the A notation on Cn(Xn).

13 Theorem 1.2.1 (Robbins [30], p. 201): With Tn(xn) such that (14) is true and with tn(xn) defined by (15), the sequence of test functions is asymptotically optimal in the sense that lim R(tnt) = R(tB,J) n->o Perhaps the most convenient way to obtain a rate of convergence is to assume that the sequence of test functions converges in meansquare to T(x), uniformly in x. Suppose that for almost every x, the inequality Enl [ITn(Xl.Xn-lx;x)-T(x)} < Pn (1.2.16) is satisfied and that lim Bn=0. Then, from the Chebyshev inequality, n-oo we have convergence in probability with the bound Pr Tn(X|lT..Xnl,x;x)-T(x)| = E } I n / 2 for a.e.x. Notice that by definition R(tni) > R(tBJt). From (9) and (12), the difference in risks is given by 0 < R(tn=d)-R(tB,j) = T(x)dx - f T(x)[Pr(Tn(X1,X2.. Xn_,x;x) > 0}]dx Define T(x)- = T(x) if T(x) < 0 and T(x)-= 0 if T(x) > 0 We then have 1En_l denotes the expectation with respect to the first (n-l) random variables, X1~.. Xnl

14 0 a R(tn,')-R(tB,jt) = J T(x) dx - f T(x) [Pr(Tn(x) Z O)]dx - f T(x)[Pr(Tn t O0]dx (1.2.17) = fT(x) [PrTn(x) < O)]dx - fT(x)-[Pr(Tn(x)? O)]dx. Let A = (x:O > T(x) > e) for arbitrary positive E and consider the first expression on the right side of (17). fT(x) [Pr(Tn(x) < 0)]dx = T(x)+[Pr(Tn(x) < O)]dx + A T(x) [PrITn(x)<O)]dx For x contained in A, it follows from the bound below (16) that Pr(Tn< O) = Pr[Tn-T < -T) < Pr(Tn-T < - e) < P/e2. Hence, the first p2 integral in the above expression is bounded by - J T(x)+dx. For the e2 A second integral, assuming al > aO, we have /f T(x) [PrTn(x) < O}]dx < fA T(x) dx Ac c < a81 A [plfl(x) + pofo(x)]dx = alPr[ < T(x) < e) c = ai6(e) Collecting results, we have P2 fT(x)+[PrTn(x) < O}]dx < 2 J T(x) dx + a1(e). In a similar manner, the second integral on the right side of (17) is bounded by 2 - fT(x) [Pr[Tn(x) > O}]dx < - -- J T(x)-dx + a162(e)

15 where 52(e) = Pr( -e < T(x) < 0. We now have, Corollary' 1.2.1: Assume that the sequence of test functions Tn converges in mean-square as given by (16). Then, the risk at the n-th decision is bounded by 2 0 < R(tB,t)-R(tn, 1) < n flT(x)|dx + (e), where 6(e) = Pr(tT(x) < e. We shall now derive the same bound for the case of dependent observations and also give conditions on fo(x) and fl(x) so that 6(e) can be made arbitrarily small. 1.2b Dependent Observations Let: f(x,...,xn) denote the(n-variate) density function of the random variables XQ,=l,...n; f(xl,....,xnIxn nh ) be the conditional density function of the first n-l random variables given Xn and xn; and f(xl,...,xnl|n) designate the density function of the variables x1...,xn, given Xn. In analogy to the previous development, we have Rn(tni) = En(L(tn(Xn) )) = - tn(Xn), pab(Xn)). (1.2.18) and E(tn(Xn)b(An)) = plalEn ( tn(Xn) ) - PoaoEn \( tn(Xn) ) n=l n=o

16 We write the first expectation on the right side as En I (t( Xn)) = /. t(x.X.)f(x.. x jn=l) dxl..dx. (X...J *** x..n)fX-lX =1) dx...dx 3 dx The conditional density f(xnl|=l) is written as f (xn). The second expectation, EnI =Ois written with fo(xn) in place of fl(x-) and f(xl,...,Xn_-lIxn,n=O) in place of f(xl*...Xn-llXn, n=l). Using these expressions in (18), the empirical risk becomes n-1 R (tn,t) = lal- Jplalfl(xn)... tn(x...x) f(x...Xn-Il xn n=1) dxi. dXn_ dxn n-1 J p Poafo(xn) sf.'' ftn(xl Xn)f(x l Xn1 _llxnan=O)dxl.. dxn-1 dxn and upon using the definition of T(x), the risk is expressed as n-1 R(tn ) = pa 1-.T(x) [.../ tn(x..lxn, ) n-l -fPoaofa(x)f... tn(xn...Xnx)_ x..xn lIxnx, ) - -~ ~ ~ ~~- -p a x t_) x lX ~,**X-lnx^~

17 To demonstrate convergence, in contrast to (14) we now need to require convergence in probability conditioned on the n-th pair of random variables xn and Xn, lim Pr Tn,(x l.x,_lx;x) - T(x) | < In=xn= = 0y i = 01. (1.2.20) Clearly, this is satisfied if we have (conditional) mean-square convergence for a.e.x, n.-l lim En.l( I (Tn2TI) = lim /... (.T(x...x x;x) - T(x))2 n.-+oo n.=i I'.do l f(x... xn-l=x,.=i) dxl...dxn = 0, i = 0,1. (l.2o2l) This condition, however, is difficult to verify. Under the assumption that the marginal density functions pifi(x), i=0,l, do not equal zero for almost all x, (21) is implied by the inequality En(Tn-T)2 =../ (T(x..xn xn;x) - T(x x)) f(x n) dxl 2 dxn < ye (1,2.22) where lim 7n=O. This average is considerably easier to obtain. n-*oo Then, it is easy to see that the empirical risk converges to the risk incurred by using the one-state test:

18 lim R(tn,) = pial - T(x)lim PrITn(X...Xn_,.x;x) o0:n-oo n- o Xn =, Xn = dx pal - fT(x)dx=R(tB,). To get a rate of convergence, we proceed differently than in the independent case; the reason being that the bound in (22) does not imply a useful bound for (21) and (20). From (5), we have R(tB,l) = p1al - Exk(tB(X)b()), and since R (tB,gr) depends only on the present pair of random variables (xn, n), we can write the difference of (18) and R(tB,c) as 0 I R(tn,) - R(tB,r) = E.nn(b( h) (tB(X) - tn(X.*X))) )) (1.2.24) Since b(%) is a bounded function and assuming al> ao, (24) is dominated by 0 _ R(tn, r) - R(tBr) - alEn tB(Xn) - tn(X1...Xh). (1.2.25) The functions tB and tn take on the values 0 or 1. Hence, the contributions to the expectation are the two cases where tBrtn. We have, from (8) and (15), 0 < R(tnJ) - R(tB,3) - a Pr T(X X,...Xn;Xn), T(Xn) < + aPrTn(X,X2...Xn;Xn) < O,T(Xn) 2 o0.

19 Let e> 0 be an arbitrary constant and consider the first expression on the right side: Pr T.n O, T < O = Pr< Tn 0, T - + Pr T n 0 - E < T< 0 Assuming (22) holds, we have Pr JTn 0 T < - Pr |Tn-T < y /E2 T < - <Pr ~n - T-~Y /E Letting 6 (E)=Pr-e < T(xn) < it follows that Pr {Tn- 0, - ~ < T < 0 6= 61(E) Pr Tn - < T < 0 Hence, we have Pr Tn 0,T< 0 7n2/2 + / (c). In a similar manner, we can show that Pr r< OT - 0 Y7n /2 + 62(), where 6 (E)=Pr > T(xn) => 0. Then, setting 5(E) = 6j(E) + (E) = Pr ~T(x) <, the difference in risks is dominated by /. R2 a

20 To show that B(e) can be made arbitrarily small, it is sufficient to assume that the density functions fo(x) and fl(x) are linearly independent, and in addition, they are analytic functions of x. The linear independence assumption is not unreasonable since, if the density functions were linearly dependent, one could not distinguish between the two hypotheses. The analytic assumption is more then we need, but in the cases we are interested in this assumption will always be met; fi(x) will be the convolution of a gaussian density (which is analytic) with some distribution function. These two assumptions imply that the roots of T(x)=0 are isolated. For if T(x)=0 in some interval then, since T(x) is analytic, T(x) is identically equal to zero. This violates the linear independence assumption. Now, since T(x) is continous, it follows that for any specified 8, we can choose an e such that the probability of the set A=A x:.lT(x) < e satisfies Pr(A) 65. This gives the desired result Pr I T(x) I < c] (e). We collect our results (and assumptions) in Theorem 1.2.2: We observe the sequence of stationary dependent random variables X,=l1,2,...,with the marginal density function f(x) = of(x) + p1(x). Assume that a sequence of test functions, Tn(xlX2,...n;xn) exists which satisfies (22), By linear independence we mean that there does not exist two non-zero constant co, ci, such that cofo(x) + clfl(x) = 0, a.e.x. (Since the fi are densities, linear dependence is equivalent to equality.)

21 En( IT(X1..;Xn) -T() 12) 2 2 and lirm n =0. Define the sequence of decision functions by -n-oo tn(X,...xn) = 1 if Tn(xl,x2,...,xn;xn) 0 = 0 otherwise. If pifi(x)O, for a.e.x, i=0,l, the empirical risk converges to the risk incurred using the one-stage procedure. In addition, if the density functions fi(x), i=0,~, are linearly independent and analytic functions of x, the difference in risks at the n-th decision is bounded by le2 0 ~ R(tnA) - R(tBg,) n a~.. e is an arbitrary positive constant and 5(e) can be made arbitrarily small by a suitable choice of c. We will have the occasion to consider a test function defined as T(x) = s(x)T(x) = s(x)p lalf(x) - aoof(x) (1.2.27) Since s(x) will be a positive function, the decision function tB(x) = 1 if T(x) - 0 (1.2.28) = 0 otherwise is identical to tB(x). Hence, the ris~ using this equivalent test, R(tB,1), is equal to R(tB,r). For the empirical procedure, we will then take

22 Tn(xn) = s(xn)Tn(xn) (1.2.29) as the estimate of T(x) and define the decision function by tn(xn) = 1 if s (Xn)Tn(Xn) n 0 = 0 otherwise. (1.2.30) Letting R(tnrT) denote the risk of this procedure, we have R(tn) = alpl- ((tn( )b ) (1. 2.21) The difference in risks is 0 < R(t C) - R(tB,) =En )(tB(X - (X)) i a En{FtB(Xn) ( * (.2.2) Then, by a proof which is identical to the previous theorem, we have Corollary 1.2.2: Assume that the sequence of test functions satisfies En (Xn)(Tn(X.. n;Xn) - T(Xn)) 3 (n)2 (1.2.33) and that lim yt=0. Assume further that the density functions fi(x) 9 -n.o- n i=0,,l are linearly independent and that the functions s(x)fi(x) i= 0,1, are analytic functions of x. Then, the difference in risks at the n-th stage is bounded by 0' R(<,t>) - R(tBr)' a 2 2 +'() (1.2.34) < 2 - /2 o~~~ < +

23 where 5'(e) = Pr s(x)T(x) |<I can be made arbitrarily small by suitable choice of e. Similar results can be obtained for any equivalent test procedure. By an equivalent procedure we mean a test function Te(x) such that for every x, Te(x) 0 when T(x)- 0, and a decision function t(x) which equals one when Te(x)- 0 and zero otherwise. In view of (32), a bound on the difference in risks, analogous to (34), can easily be obtained. These remarks can be extended to the results of the next two subsections. Since the extension to equivalent tests is straightforward, we will not discuss them further in this chapter. 1.2c Extension Of The Dependent Case To Multiple Hypotheses Let the parameter \ take on the "values" X= Xo X...3 Again, X designates which hypothesis is active. We take pi as the a priori probability of the i-th hypothesis, Z Pi=l, and fi(x) as the density function of the observation given that X=-i. A test procedure is equivalent to specifying (K+l) decision functions ti(x), i= 0,1,...K+l, defined on the space of observations such that if, for a given x, ti(x)=l we announce hi and if ti(x)=O, we do not announce Xi. Clearly, we have Z ti(x)=l, for all x. If we take the loss as 0 for a correct classification and equal to 1 if we are in error then, assuming that \=\j, the loss is (l-(tj

24 (x)) 1 The expected loss, given that \=\j, is +00 R(tj,) = (l-tj(x))fj(x)dx (1.23.5) -00 and the overall loss, or expected risk, is +00 K _ 0 Pj (1-tj(x)) fj(x) dx _00 K +0 j = = 1 - pj tj(x)fj(x)dx. (1.2.36) _00 j=o We prefer to write this as K +00 R(t,r) = 1 - po - E f tj(x)Tj(x)dx, (1.2.57Y -00 j=o where the test functions Tj(x) are defined by Tj(x) = pjfj(x) - pof (x), j=O,l,...K. (1.2.58) The test procedure given by tiB(X) = 1 if Ti(x) 2 Tj(x), all j (1.2.39) 0 otherwise, 1 We can think of t (x) as the probability of announcing \=\j when we observe x. (l-tjix)) is then the probability of an error given that X=j.

25 minimizes the probability of error. The minimum probability of error is then K R(tB,) = 1 - P - 4 tj(x)Tj(x)dx, j=o K 1 - p0 Tj(x)dx. j=o l where Aj=Aj x: Tj(x) Ti(x), i=ol,...K When the test functions are unknown, we suppose that we can find a sequence of functions T jn(x n) which satisfy E( ITjn(X...Xn-l.,;XnX) - Tj(YX) 12) jn.,=ol,... K. (1.2.40) We then define the sequence of decision functions: tj-n(xl...Xn) = 1 if Tjn(xl...xn;xn) T Tin(xl..xn;xn), all i, = 0 otherwise. (1.2.41) With fj(x1 x2,...,xn) denoting the joint density function of the n observations given that the n-th Mn is \n=.jn, the expected risk is K n R(tl) = 1- Z p... (... x.. )ft(x...x )dx-n..dx. j=o The difference in risks is expressed as K n 0 E R(tnr) - R(tBt) = Pj /... f (tBj(xn) - tjn(xl..xn)). j=o fj(x...xn) dx....dxn (1.2.42)

26 K Clearly, pjfj(xl,.. xn) < Pifi(xl...xn) = f(xl...xn) and hence, i=o (42) is dominated by K o0 R(tntr) - R(tB,~t) EnlJtBj(Xn) - t:nj(Xl...) (1.2.43) j=o Let the subscript i in the following expressions read "for all i," and the subscript k mean "for some k." The joint event ((Tjn(x...xn; xn) 2 Tin(xl...n;xn) for all i) Tj(xn) < Tk(xn) for some k) is written as (Tj >T T k) jn in3 j k' The expectation inside the summation of (43) becomes.. ltBj( - tjn(X. ) = Pr jn > Tin < Tk + Pr r j:n < Tkn Tj' Consider the first probability expression on the right sideo Since the event (Tjn Tin Tj < T) is included in the event (Tjn Tkn T < Tk) the first probability expression is dominated by Pr n Tin knTj < Tk = Pr{Tjn.- Tkn OT - Tkk Cjk + Pr j -Tk - Ejk < J k where ejk is an arbitrary constant. If (T. - T k 0) and ((Tj Tk) - jk)then the expression (ITjn- Tkn- Tj + TkI kCjk) holds. Therefore, it follows that

27 Pr n T- n T ~n T. - Tk - kj jn kn' k jk s PrTjn - Tkn Tj + Tkl jkj Defining 1jk = Pr jk < Tj - Tk < ), we have Pr - Tkn 0, - -jk< Tj - Tk < dominated by 61jk in analogy to the previous development. We then have the bound Pr Tjn T T < Tk Pr Tin - Tkn Tj + Tkl 2 jk + b-jk which, in view of (40) and the Minkowski inequality, can be dominated by 2 PrTkl ('jn + 7ia);Pr {J TTnTj < T 1_ (_YJn + Yk1) + 5(1.2.44) 2 Ejk Similarly, we can show that Pr T^ < - T} E (Y jn ko)2 + bljko Sjk Combining these bounds, we obtain EnltBj(XI) - t j(X... Xn) I k 2 i-<j + k(jk) 2 ~jk where jk= Pr Tj(Xn) - Tk(Xn) < ck

28 In analogy to Theorem 1.2.2, we have Corollary 1.2.3: We observe the sequence of stationary dependent random variables Xe,Q= 1,2,...,with the marginal density function f(x) = k - Z pjfj(x). Assume that the sequences of test functions TJn(x,x... j=o Xn;xn j j=o,l,...K, satisfy 2 2 En(Tjn - Tj) < y7n j=ol,...K. (1.2.45) and that lim j. =0, j=0,1,...K. At the n-th decision, define the (K+l) decision functions by tj(xl...,xn) = 1 if Tjn(Xl...xn;xn) 2 Tin(... Xn;xn), all i 0 otherwise, j=0,1,2,...K. Then, if pifi(x)!a.e.x, i=0,l...K, the empirical risk converges to the risk of the one-stage procedure. If the density functions fj(x), j=0, 1,...K, are linearly independent and analytic functions of x, the difference in risks at the n-th decision is bounded by 2ja + 7kn) 0 - R(tn,) - R(tBl) - + + jk(J) k B) 1r) < 2 jk(jk) j=o jk where again 6jk can be made arbitrarily small. 1.2d Convergence of The Empirical Procedure For Unbounded Loss Functions The fact that the loss function ((2)) is bounded has been used to considerable advantage in obtaining the above bounds. Situations where L(t(x),\) may notbeabounded function of A occur when we let A(the state of nature) take on a continuum of values. We assume that A is a

29 random variable drawn from some general papameter space A. With A distributed according to the distribution a(X), the density function of the observation is written f(x) = - f(x)da(\). (1.2.47) The hypothesis test we consider is one in which we infer from the observation X whether AEA (hypothesis Ho) or AeA-A (hypothesis H1). To obtain the (one-stage) test procedure, we again let t(x) = 0, 1, depending on whether we believe Ho or Hi is in effect. Defining b(A) = L(O,k) - L(1,N) (1.2.48) And T(x) = b(A)fN(x)d a(), (1.2.49) A the risk incurred is a minimum if we choose tB(x) = 1 if T(x) 2 0 = 0 otherwise. The risk is then given by R(t~,o) = fjL(O,)da(A) -jT(x)+dx (1.2.50) A 1See Robbins [50], section 5, for the details.

30 which we can also write as R(tBa) = fL (O,)da(\) - Ex tB(X)b(Y\. (1.2.51) A When the test function T(x) is not known, we define a (two-valued) decision function tn(x...xn) as before. The overall (empirical) risk is R(tal ) = f L(0, )d(A?) - E n C.n(X..X)b(-\n) (1.2.52) and the difference in risks can be written as 0 - R(tn,) - R(tB,a) En n)(t(Xn) - t(Xn) (1.2.55) If we assume / b (\)de(\) c < oo (1.2.54) then by the Schwarz inequality it follows that 0 - R(tn,) - R(tB,a) s (E B(Xn) - tn(X..X Xn) (1.2.55) The value of the expectation in (55) is identical to the value of expectation appearing in (25). Hence, we obtain Corollary 1.2.4: Assume there exists a sequence of test functions which satisfies

31 2 2 En 4(j(X...Xn;X n ) - T(Xn)) 7 Then, if the decision function tn(xn) is defined as tn(Xn) = 1 if Tn(xn) - 0 = O otherwise, and if (54) holds, the difference in risks at the n-th stage is dominated by O < R(tn,O) - R(tB,9a) ~/2 2 + 2() C2Y+ (1.2.56) Observe that the bound is of order y-n while the previous bounds on 2 the risk were of order yZ. This is a direct result of the boundedness of b(\) for the minimum probability of error criterion and the fact that the sequenceh A may be dependent. We have assumed throughout that each decision is based on a single observation. The extension of the above results to more than one sample per decision is straightforward. 1.3 LITERATURE SURVEY AND SCOPE OF THE PRESENT STUDY We have investigated the convergence of a particular empirical procedure to what we have called the optimum one-stage procedure. By dominating the mean-square error,

32 En (T^( X.X n;X) - T(Xn))2 7n. we are able to bound the rate of convergence of the empirical risk. Hence, the central problem is to find a sequence of estimates Tn(xl,..o xn;xn) which is consistent, i.e., lim 7y=0. This is our major concern. nxr Consider the two hypotheses problem with a minimum probability of error criterion. For this case, T(x)=plfl(x) -pof(x). Assume that p is known and that the densities f (x) and f (x) are unknown. To estimate T(x), a natural procedure would be to first estimate the densities and then take T =(x) = (x) (x) (1 51) n o on A as the estimate of the test function for the n-th decision. If f ln and Pon are consistent estimates then the sequence Tn(x) will also be consistent. The manner in which the estimates are obtained depends on whether "learning" samples are available. If one can classify an observation with probability one, it is called a learning sample. Then, if the observation is known to come from, say, hypothesis Ho, we would use it to update our estimate of fo(x)o This type of operation has sometimes been called supervised 1 learning or learning with a teacher. In the context of communication problems, learning samples are provided by periodically injecting a known fixed sequence into the sequence of information bearing signals, i.e., channel sounding signals. See [35536]. Learning samples of a different nature occur in problems such as statistical weather prediction. Based on some observational data, an inference is made about the future weather. At some later time we find out if the inference was correct. This knowledge would then be used to form better inference procedures.

55 When learning samples are not available, the problem is more difficulto Since we never know from which population the observation is drawn, we can not directly estimate the desired quantities. One possible procedure is to estimate the overall density functions: f(x) = pofo(x) + plfl(x), and then attempt to extract from this estimate the parts that are unknown and that are needed to form the estimate of the test function. This mode of operation has been called nonsupervised learning or learning without a teacher. We remark that learning in the nonsupervised mode is not always possibleo When the sequence of observations is independent and if, with either of the above procedures, we obtain consistent estimates of the 2 test function, then, these procedures are asymptotically optimal. This is not to say that the probability of error is minimized at each stage. This, of course, depends on what part of fi(x)is unknown, how it is estimated and subsequently used to fo-rm the estimate of the test function. l.3a Literature Survey The learning procedures most frequently investigated are those in which a set of parameter vectors;, Oi i=lo..k, is to be estimated When the observations are dependent, the procedures are in no way optimum. Presumably, they are reasonable procedures to follow, especially when the exact nature of the dependency on the observations in not specified.

34 from the statistical.ly related observation vectors X -=l,.n. Each parameter (or pattern class) ei is associated with a particular hypothesis Hi and could represent samples of a signal which is buried in noise. The density function of the observation given that Hi is active is written as fe (x.) and the overall density function of the observation becomes K f(xi) = P e. fe. -X (l.5. 2) i=l Furthermore, it is assumed that the set of patterns is initia.:ll chosen from a known a prior distribution, p.(i ) ^ and then held fixed for the experiment. The statistical inference problem is to decide which hypothesis (pattern class) is in effect for a part;icular observat;ion X. The criterion used is the minimization of the totaL probability of error. Within t.his framework.9 a n.umber of authors (e, go, tli,5)20,2l ) have investigated optimumn test procedures when learning samples a:r.e available. Let Xk represent the sequence of learning samples. Tihen. given the observation X9 the optimum decision rule is to compute the a posteriori conditional densities Vectors are denoted by the notation.

35 P IjX)XkJ,j=l,...K (1.5.5) and announce the 9. for which (3) is a maximum. -J Braverman[5], assumes that the sequence of learning samples Xk= XklJXk2... is independent and that the learning samples of one class impart no information concerning the patterns of another class. Letting Xkj denote the set of learning samples of the j-th class, (3) becomes P,j=l,..K. He takes the density function f (x) as gaussian (fe (x)=g(x-ej)) and the a priori densities Pe.( ), j=l,...K. also as gaussian with unknown -J means and known covariance. The optimum procedure is then to use the learning samples to estimate the means of each class and use these estimates in the computation of the a posteriori probabilities. For the case of two hypotheses, he shows that the difference between the error probability of the above procedure and the error probability in the case the patterns Mean value) are known is approximately inversely proportional to the number of learning samples. Keehn[21], extends the work of Braverman by taking both the mean vector and covariance matrix of P ( ) as unknown. ej Scudder[39,40], also takes the noise and a priori distributions as independent gaussian and investigates the problems encountered when learning samples are not available. The optimum test procedure now

36 requires an exponentially growing memory. He then looks at a fixed memory technique similar to the procedure used when learning samples are available. but now, learning takes place on the basis of previous decisions which are never known with certainty to be correcto The problem of when the optimum test procedure, with or without learning samples, requires a growing memory is discussed in a paper by Spragins[42]. The optimum test procedure (an application of Bayes' rule conditioned on an increasing number of observations) will be of fixed memory if and only if the sequence of (independent) observations admits a sufficient statistic of fixed dimensionO The existence of the sufficient statistic is seen to imply the existence of an a priori distribution Pe.( ) which has a "reproducing" property. Thus, by choos-J ing an a priori distribution which has the reproducing property, a number of authors (e.g. [3,21]) are able to obtain optimum fixed memory procedures. Hancock and Patrick[17] provide for a general formulation of the learning problem by focusing attention on the overall distribution as given by (2). An important contribution of this study is the determination of when sufficient amounts of a priori information exists for a learning procedure to converge. When little a priori information is known, they apply histogram techniques to a class of nonsupervisory problems. When the functional form of the overall density is known, they investigate estimates of the parameters G. which characterize the

57 overall distribution or, as they call it, the mixture. The estimates are shown to be consistent thus leading to an asymptotically optimal test procedure. Somewhat related, but less general in formulation, is the work of Cooper and Cooper[9]. They consider the two-category problem with particular emphasis on the case where the overall density is the sum of two gaussian densities. Taking each hypothesis equiprobable, they discuss different estimates of the unknown means which are then used to form an estimate of the test function. They extend the (nonsupervisory) results to multivariate gaussian densities by estimating the parameters which characterize the optimum partition (i.e., a hyperplane) of the sample space. Also discussed is the case where the arbitrary densities of the two equiprobable hypotheses differ only in a location parameter. A departure in the above formulation is made by Robbins[29-31] and his associates [19,37], They consider only one a priori distribution, pe( )=p(G), and take the distribution as unknown. Here, the inference problem is to decide whether G is contained is some set A or its complement. Since the density function of an observation under either hypothesis is the same, f(x) = f(x)dp(G), every observation can be considered a learning sample even though these observations are never classified correctly with probability one. Their main effort is directed toward showing that the empirical procedures are asymptotically optimal

38 for a variety of hypotheis testing (and estimation) problems. All of the above authors take sequences of independent observations. Tainiter[44] extends one aspect of the work of Robbins to M-dependent observations and Raviv[28] takes the'patterns" to be a Markov sequence with the transition probability matrix initially unknown. Special formulations and learning procedures appropriate to certain communication problems are given by Glaser[1], Price and Green[27] and Sebestyen[41]. A bibliography emphasising the supervised mode of learning is given in [2]. A discussion of most of the approaches to nonsupervised learning is given in the recent paper by Spragins[43]. 1.3b Scope of the Present Study The present study is closest, in spirit, to the work of Robbins. The problems we will consider are those in which the "patterns" are random variables. Thus, if the same pattern class or hypothesis is active in succeeding intervals, this only means that the distributions from which they are drawn are the same. It is these a priori distributions which we will take as unknown. In particular, we shall consider a class of problems where the marginal density function of a single observation can be written as f(x) = g(x-z;a)da(z), (1.534) with a corresponding vector equation for multidimensional observations. g(x;a) denotes the gaussian density function with standard deviation ca.

39 By suitably interpreting a(Z:) we can include all the problems we are interested in. We give the following as examples. Let u(t) be the unit step function and define K ((Z) = Piu(z-yi). (1.3.5) i=l Then, f(x) becomes K f(x) = Pig(x-y;a). (1.3.6) i=l This represents the density function of the observation where one of K signals is transmitted with gaussian noise added to the message. The signals represent the values the random variable can assume. A generalization of (5) is to take K c(iZ) = P i, u(z-y)di(y) (1.3.7) i=l where pi(y) represents one of K different distributions. f(x) is then given by K f(x) = P. g(x-y;O)dpi(y). (1.3.8) i=l Here the problem would be one of testing between K composite hypotheses with noise-like signals. Letting u(z-y) = u(z-si(ty)) in (7) gives K f(x) = Z p; g(x-s,(ty );a)d^,(y) (1.5.9) i=l

40 This has the interpretation as the overall density function of K composite hypotheses with the i-th hypothesis representing the s. signal being transmitted. The a priori probability of this transmission is Pi. The notation si(t,y) is taken to mean that the signal si which, for example, is time sampled at t, is distorted by the random vector Z. The difficulties we shall encounter are not in attempting to process the observations is some optimal fashion. We have already agreed to consider a learning procedure which, at best, converges to the optimum one-stage procedure; this empirical procedure being asymptotically optimum if the observations are independent. Our difficulties will stem from the fact that the a priori distribution p(y) is taken as completely unknown as opposed to assuming some known functional form with a finite set of unknown parameters. The empirical procedure we have outlined is one of estimating the densities fi(x) when learning samples are available, and, initially, the overall density f(x)=Zpifi(x) when operating in the nonsupervisory mode. Much of this study deals with estimating f(x) as given by (4), and establishing bounds on the mean-square error in the estimate. There is one exception. We also consider (6) with the a priori probabilities unknown.

41 In Chapter 2 we consider different methods of estimating f(x) Of particular interest is an eigenfunction representation (sectiOn 2.5) for f(x) which we obtain by solving an eigenfunction: problem associated with equation (4). Chapter 3 extends the results to estimating the k-variate density function f(xl...xk). In Chapter 4, we apply our results to some problems in communication theory. Section 4.1 considers transmission through a general, stationary, random channel when learning samples are available. This problem serves to relate the results of section 1.2 on the convergence of the empirical procedure with our results on density estimation. It also illustrates when we can expect to obtain solutions to the nonsupervisory problem. The remaining applications emphasize learning in the nonsupervisory mode. In section 4.2 we consider the problem of transmission of known signals with unknown a. priori probabilities and in section 4.3 we discuss the problem of transmission through a random multiplicative channel. In section 4.4 we consider a problem with an unbounded loss function. A summary of this study is given in Cha.pter 5.

CHAPTER 2 ESTIMATING THE DENSITY FUNCTION OF OBSERVATIONS-UNIVARIATE CASE 201 INTRODUCTION As discussed in the previous chapter, one approach to finding a convergent sequence of test functions is to first obtain a convergent sequence of estimates for the unknown density functionso These estimates are then used to form a test function, the structure of which is identical to the test function one would use if all distributions were knowno Our main concern in this chapter is obtaining consistent estimates of the univariate density function of the observations. By consistent estimates we will mean estimates which converge in mean-square in the sense of lim )0 e o E ((f(x) - fn(X1,X2,o Xn;x))} - for every x (2l1) or A 1im E ((f(X ) - f(X l21 XX;X )) 0. (2.1.2) n- oo n n n n Equation (1), obviously, is concerned with convergence to the constant f(x), while in (2) we have convergence to a random variable. It is (2) which we need to demonstrate convergence of the empirical procedure for the case of dependent samples. Since, for two of the methods which we'The convergence in (1) is essentially that required for the case of independent sampleso See (1o2.16)o 42

43 use to estimate f(x), there is little difference between consistency in the sense of (1) and (2), we will evaluate bounds for both types of convergence o To compare our results with previous work in the area of density estimation, we will also consider a global measure of the error, the mean integrated square error, +00 A 2 E/ (f(x) - fn(X1,X2... Xn;x)) dx (Sol2.) -00 The basic problem which we want to discuss is as follows. We are given the stationary sequence of identically distributed random variables (observations), Xi = Ni + Zi, i=1,2,.., where Ni is a sample from a stationary gaussian process and Zi is a sample from an unknown random. processo The samples may be time samples or any other linear processing of the received waveform which preserves the gaussian nature of the noise. With Ni independent of Zi, the univariate density function of the observation Xi is + 00 f(x) = g(x-z; c)da(z), (2.14) 00 where by g(x-y;a) we mean the gaussian density function with mean value y and standard deviation c. We want to take the gaussian noise samples as correlated and also consider a dependency on the Zi sequence which will be specified latero In the next section we consider the empirical distribution function as an estimate of the cumulative distribution of the observationso We

44 investigate the mean-square error and obtain a bound on the rate of convergence The results of this section are then applied to the problem of estimating the density function, for which we give three techniqueso The first method of estimating f(x), section 2.3, is analogous to the technique used in estimating a spectral density. For this method, we restrict our study of convergence to those as specified by equations (1) and (3). This method of estimation requires a minimum of assumptions to guarantee convergence, In section 2.4 we consider an orthogonal representation for f(x) and investigate all three of the above modes of convergence. The method in section 2.5 is analogous to the technique generally used to solve a deterministic integral equation. To the best of our knowledge, this approach has not appeared in the literature. The results which we will need for the applications of the empirical Bayes procedure are contained in Corollaries 2.4.1, 2.51, and section 2o6o Section 2o6 considers a special form of a(z); the case where a(z) contains a finite set of unknown parameters which enter linearly into f(x). A summary of the chapter and generalizations are given in section 22 2o2 THE EMPIRICAL DISTRIBUTION FUNCTION We want to consider the empirical distribution function as an estimate of the true distributiono For the case of independent observations, it

45 is easy to see that this estimate is consistent with the mean-square error going to zero at a rate I/no For the case of dependent samples, our main interest will be a characterization of the nature of the dependency on the samples, or the underlying random process, for which we can still guarantee consistency with a specified rate of convergence The sequence of observations, X1,X2.,.Xn are identically distributed (not necessarily independent) random variables. Xi is a sample from a stationary process which is composed of the sum of a gaussian process with an autocorrelation function R(t), and another stationary processes Z(t); Xi = Ni + Zio With Ni and Zi independent, the density function of the observation is given by +00 f(x) = g(x-z;a)da(z), (2.2.1) -00 with the corresponding distribution function x F(x) f(y)dy (2o.2) 00 The empirical distribution function of the observations is defined as F (x) = -(number of Xi < x, i=l,2,o..n). Let UL(X~) = 1 if X' < x, and equal zero otherwiseo Fn(x) is then written as n F (x) n 2 Ue(2X) ~ (2o2,5)

46 With E denoting the mathematical expectation, we have E(Fn(x)) = F(x), (2.2.4) Fn(x) is an unbiased estimate of the distribution whether or not the observations are independent. The mean-square error can be written in terms of a bias and variance contribution: E(F(x) - Fn(x)) = (E(F(x) - Fn(X))2 + E(Fn(x) - E(F(x)))2. (2.2.5) Since the first term is zero to investigate consistency we need only consider the variance of the estimate. The second moment is given by E (X) ()) = 2 X U/X)2 An J r V1 -=1 m=+l n n ~- +E^ u7 (X)+2^j 7 UJ(X9Um(Xm)} (2.2.6) n n- 2 X E =1 =1 m=-+l We have defined Fm (x,x) as the joint probability that the samples from the ~ and m intervals are less than or equal to x, The subscripts 2 and m will always mean the ~ and m observations (decisions, intervals, etc...) when used in a double sum~

47 F m(x,x) = Pr (XQ < x, Xm < x) = Fm_(x,x) (2.2.7) We want to display the effect of the dependency of the observations on the variance of Fn(x). Add (1-l/n)F2(x) to (6) and subtract its equivalent n n 2 F2(x) ~=1 m=2+l The variance is V(Fn(x)) = F(x) (1 - F(x)) n n n 2 2 + 2 (F 2(x,x) - F (x)) (2.2.8) n =1 m=2+l With independent observations the second expression on the right side of (8) is zero-the variance reduces to the standard result. Assume, for the moment, that the sequence of random variables Zi are independent. Then, the second-order distribution is x x Fm_(x,x) =f / j / g2(yl-Zly2-z2;a Pma)da(zl)da(z2)dyl,dy2, -oo -oo Z1 Z2 (2.2.9) where g2(nl,n2; a,pm_) is the bivariate gaussian density function with the random variables NQ and Nm having the same standard deviation a and a correlation coefficient pm_ = R(m-2)/R(O).1 The univariate distribution If we had time samples, Pm-~=R((m-~)T)/R(O), where T is the time between succeeding samples. We shall take T = 1. The gaussian random variables have the same standard deviation since the waveform in each interval is identically processed.

48 a(zi) is independent of the subscripts I or m because of the assumed stationarity. It will be convenient to denote the expression in the double summation of (8) by Dm- D = F (xx) - F (x) m-~ m-~ ~ - x x J J J 7[g2(Yl-ZlY2-Z2;aPm-~)-g(Y1-Zl;a)g(Y2-Z2;a)]da(z1) - o -o Z1 Z2 da(z2)dyl dy2 (2.2.10) We interchange the y and z integrations((l0) is absolutely integrable) and consider the resulting inner double integrals x x / [ga(Yl-zl,y2-Z2;CaPm )-g(yl-zl;ac)g(y2-z2;a)]dy1 dy2. (2.2.11) -00 -00 The bivariate gaussian density is expressed in terms of Mehler's formula. From Appendix A, (A.20), with the requirement that |Pm_ I < 1, I~ m, we have: 0O g2(yl-zl,y2-Z2;,p ) = g( Y-Z)g(Y2z)X 7z He ) He(YZ) i-I Ca a LC j=o 3~j-0~~o (2.2.12) The Hej(y/a) are the Hermite polynomials orthogonal with respect to the gaussian weight g(y/a). Observe that this is not an orthogonal expansion We use the notation g(y/a) for the gaussian density (with standard deviation a) when dealing with the corresponding polynomials. g(y/a) is identical to g(y;a) which is the notation we generally use for the gaussian density.

49 in the usual sense. The polynomials are defined in such a way that the orthogonal functions are given byTg(y) He.(y). Substitute (12) for the bivariate density in (11). The first term of the series cancels leaving oo00 x x j, r C ) Hj )g( He ( 2)g( 22)dyl dy. (.2.135) j=l -o0 -o0 With IP m l < 1, it is easy to justify the above inversion of summation and integrations The integrals are then dominated using Schwarz's inequality and the orthogonality relation for the Hermite polynomials (Ao10): x fg( ) He ( )dy -00 +00 +00 1/2 < X f g( )dy g( He0( )dy - 00O 00 <\f7. (2.2.14) Hence, (13) is bounded by V I J _ lPm-~l l m.1 - Pmj=1 Dmi is also bounded by the same quantity, D -_1 - -I- ( 1 l -Pm-l0 1See Appendix A, section A.1. 2See Appendix A, Lemma Aol.

50 Notice that (15) would still bea valid bound on Dm_, if the integrand in (10) were replaced by its absolute value, We shall use this later, Combining (8) and (1.5), the variance of the empirical distribution function is dominated by n n V(Fn(x)) =F) (l-F(x)) + Dmn n n =1l m=-+l n, F( x) + nI < F(x) (l-F(x)) + 2 (n-r) -P (2.2,16) n n'2 1 1- TP with the second expression following from the stationarity and with T=m-~o Convergence in quadratic mean requires that V(F (x)) O as n -* oo. For this, it is sufficient to assume that the autocorrelation function of the gaussian noise process satisfies R(T) + 0 as T - co.(2.2l17) Condition (17) excludes the possibility of jumps in the spectrum. of the noise process. It then follows (Loeve [241, po 202) that p* max jp1| < 1 2 (2o2o18) T>i Using this fact, the second part of (16) is m.ajorized by n n S X (nrl-) T- < --- E (2,2-19) - 1 1 - PJTI 1l-p. n T'=l T —L n Since T + 09 it fol.lows that the sequence of arithmetic means,~ - TTl

51 tends to zero as n oo (Hobson [18], po 7). Notice also that (18) is sufficient for the validity of the Mehler formula With the first part of (16) dominated by 1/n, we have Theorem 2.2o1: Given the sequence of identically distributed random variables of the form Xi = Ni + Zi, with the univariate density function given by +00oo f(x) = g(x-z;a) dca(z) -00 Assume that i) the sequence of random variables (Zi) is independent ii) the autocorrelation function of the gaussian noise satisfies R(,-) O0 as T - oo Then, the empirical distribution function is a consistent estimate of F(x)o Upon applying the Chebyshev inequality, and since V(Fn(x)) + 0 uniformly in x, we also have uniform convergence in probability: as ~1 1,~3 "1 n + oo and for arbitrary, Pr[F n(x)-F(x) >C) + 0, uniformly in Xo In order to obtain a bound for the rate of convergence we need to specify the manner in which R(r) + 0 For example, assume that R(T) is bounded by 1We note that the hypothesis of the theorem is sufficient to ensure convergence with probability oneo This will be discussed at the end of the sectiLon 2The bounds given below are for time samples, with the time between succeeding samples taken as T = lo

52 jR(T)I < 2 /T (2.2.20) for ITI > B1 where 0 < 6 < 1. Then, it is not difficult to obtain an integral upper bound for the arithmetic mean: n n n Ip RI IR() < -- B1 1 (2.2.21) n L " T' n/L a2 n (1-5)n6 T=1 T=l In the sequel, we will designate (20) as condition A. Alternatively, we could make the assumption that 00 0 / |R(t)|dt < co. (2.2.22) This implies that the spectrum is absolutely continuous. Then, by the Riemann-Lebesgue lemma. we have R(t) - 0 and t - oo. Assuming further that R(t) is monotonically decreasing for |tl > B1, an integral upper bound is given by n 1 7' B2 n IPTI < - (2.2.23) T=1 where we have set 00 B2 = B1 + 1 / R(t)dt. (2.2.24) a2' B1 The assumption of monotonicity can be dropped if the autocorrela.tion possesses a derivative which is integrable. Then, we replace B2 in (23) by 00 B2 = 1 (lR(t)| + |R'(t)|)dt. (2.2.25) o

55 This follows from the Euler-Ma.claurin summation formula. In the sequel, equation (23) will be designated as condition B, with the constant B2 given either by (24) or (25)0 Corollary 2o2.1: Under the hypotheses of Theorem 2.2ol and with R(t) satisfying condition A, from (16) and (19), the variance of the empirical distribution function is dominated by V(Fn(x)) < + - (2 B + ) (2.2.26) n l-p* n (1-)n6 Alternatively, if R(T) satisfies condition B, the variance is dominated by V(Fn(x))< ( + - ) ( 2.227) Note that the bound in (27) gives the same rate of convergence as in the case of independent samples. As easy extension of Theorem 2.21l, and one of practical importance, can be obtained by replacing the independence assumption on the Z random variables by one of M-dependenceo Definition: The random variables Z~ and Zm are said to be M-dependent if the variables Z~ and Zm are independent for Im.-|>M. In terms of the distributions, we have wa 2he zn2) - a(z1) a(z2) for em-QI > M, where

54 m_-(z1,z2) = Pr (Z < zl, Zm < Z2) ~ The extension is carried out by noting that the independence of Z and Z was first used in (9). In general, this equation now becomes x x Fm,(x, x) g2(yI1-ZI,y2-Z2;c5, P )dmQ(l(Z i, 2)dyi d2 -oo ~ 1 Zp ( 2 (2.2.28) We use (28) in the expression Fm _(xx)-F (x), add and subtract (9), and group the terms so as to display the Z dependenceo Designating the resulting expression by Dm_ we obtain x x l ni j g2(Y1-Z1, Y2-z2; c,Pm,9-)g(Yl-Zl;5 )g(Y2-Z2; ))) _-o -do Z dyZ da(-zl)da(z2)dyi dye x x +: g2(Yl-zl,y22-Z2;C7, Pm_) tdamQ( zl,Z2) -dC(z)da( z2)} ~ -c-1 Z 1 Z2 zdyl dy2 (2.2.29) The first term on the right is the same as before and is bounded by (15,) The second expression is easily dominated by: Dm-_ < 2 for L-m M < M and Dm- = 0 for I|-ml > M Using these bounds and the previous results we have Corollary 2o2o2: Given the hypotheses of Theorem 2o2o1 but with condition i) replaced by one of M-dependence, Then, the variance satisfies

55 lim E(Fn(x) - F(x))2 = im V(Fn(x)) = 0, n-loo n- oo uniformly in x. In addition, if R(t) satisfies condition A, we have the bound V(Fn(x)) < +4(M- 1) + )n) (2.2.30) - n 1-p. n (30) With R(t) satisfying condition B, the variance is dominated by V(Fn(x)) < n (1 + 4(M-1) + 2 B ) (2.2.51) 1 -p* We now replace the M-dependence assumption with an ergodic requirement. Suppose that the stationary sequence (Z~J is ergodic. Now, the weakest condition we have imposed on the correlation function of the gaussian process (R(t) -+ 0 as t - oo) implies that the spectrum of the process is continuous. This, in turn, is a known necessary and sufficient condition for the gaussian process to be ergodic.l Since Ni and Zi are independent it follows that Xi = Ni + Zi is an ergodic sequence. We have previously defined the random variable Ug as: U2(Xj) = 1 n if Xi < x, and = 0 otherwise. Since Fn(x) = Z U,(X,) and E(Ue(Xe)) = -~x) n~ 1=1 E(U2(X))) = F(x) < 1, we can use the Mean Ergodic Theorem([16], p. 16) to get: lim E(lFn(X1,X2,...X;x) - F(x)12) = 0 (2.2.32) n-wo for every x. Hence, we always have mean-square convergence to some F(x). In addition, with the (X)} sequence ergodic, we have from Birkhoff' s ergodic 1Grenander, U., "Stochastic Processes and Statistical Inference," Arkiv fur Matematik, vol. 17, 1950, pp. 195-277.

56 theorem ([16], p. 18): lim F (X1,X2,...,Xn;x) = E(Fn(Xi...Xn;x) = F(x) (2.2.33) n-lo with probability one. Since Fn(x) converges with probability one to F(x) and in mean-square to F(x), it follows that F(x) = F(x) with probability one for every x. Thus, convergence of Fn(x) to the true distribution function is ensured under an ergodic condition on the (ZQ) sequence and the above condition on the autocorrelation function of the gaussian process. What we do not have is a measure of how fast the convergence takes place. We now want to find what conditions are required to characterize a rate of convergence. In doing this, we will also directly verify that F(x) = F(x) when the Z process is ergodic. Consider the expression for V(Fn(x)) in the case the Zi are dependent variables: F( \ dent vaV(Fn()rs) - V((x) (1 - F(x)) n n 2 - D7 ~=1 m=~+l _ () (1- F(x)) n n + 2 E (n-T) D (2.2.16) Tli 7=1 From equation (29), DT is the sum of two terms. The first part of DT has already been bounded by (15): V(Fn(x)) F() (1-F(x)) + (nT) T=l1 - PI

57 n 2 V D designates the second term of D, T2 T x x DT2 / / / / g2(Yl-Z1y2-Z2;,PT) [dc(Z1,Z2)-d( zl)dC( Z2) ]dyldy2 oo -oo Z1 Z2 (2.2.55) Performr the y integrations first and let G2 denote the second-order gaussian distribution functiono DT-Y 2 G2(x -Zlx-Z2; r7Y PT)-G(x-zl;a)G(x-Z2;a)] Z1 Z2 z1 z2 [du (z1, z2:-dou( zj)du( Z2) Z1 Z2 We have added and subtracted the quantity x x G( G(x-z.; G(x-2; ) g(y-zl; )dy g(y-z2; )dy -00 -o00 in the integrando Using our previous results (see (13)-(15)), we easily dominate the first expression on the right side of (56) by (2!p j/(l-1p j))o The bound for the variance becomes' V(Fn(X)) < F(X (1 - F(x)) n.6 Z iP;2 " / (n~.r) -~n 1 x- jp ~5b1

58 n + 22 (n-T) G(x-zl;o)G(x-z2;a) [da (Zl,Z2)n T T=l Zi Z2 da(zl)do(z2)]. (2.2.37) Aside from the constant 6, the second expression is the same as (16). Hence, if R(t) -+ 0 as t + o, this term tends to zero. We now show that the ergodic condition on the (Zr} sequence is sufficient to have the third expression of (37) go to zero. For the stationary sequence (Zn), n = 0, 1, 2,..., a condition equivalent to ergodicity (Rosenblatt [34], p. 110) is: n lim 1 P(BnT A) = P(A)P(B), (2.2.38) n->oo n j=l where A and B are any two events defined on the underlying probability space. P(A) denotes the probability of the set A and T is the unit shift transformation. We take the elementary points of the probability space as co = (.....,cw co(L1,...), where the coi are real numbers and define the random variable Z (wo) = n. Equation (38) holds for any measurable set defined on the probability space. In particular, with A = (w1Z nl() < z1) T -A is the set T-JA = (ClZnl(T c) <Z1) -= (Zn +j(co) < zl Let B = (cD|Zn(W) < Z2). The stationarity assumption gives: P(A) = aZnl(Z1) = c(zl) P(B) = g.n (z2) = a(z2)

59 P(BNATAA) = Pr(Zn2(Wc) < z2, Znl+j(C) < zi) = c(n+jjn( Z,Za) o Define a second-order distribution function as n (z1,z1 ) _ 1 7 (zz2) (2o2o39) ~nnl-n n = n n1-n2+j j=l Then, equation (38) implies nlim (Zn,nn ax(z) a(z2) (2.2.40) n+ no n -n a for every z1 and za and every nl and n2a Returning to the variance expression, designate the double integra.tion in (37) by D' ~ 2 D 2= G(x-z;cG) G(x-z2;c) [da (Z1,Z2)-da(zj)doa(2)1 z (2.2,41) 1 2' Zi Z2 Define the partial sum s:n by n s ) D n 1 2 and the partial Cesaro sum Sn as n Sn n j j = The third expression of (37) can then be written as 2 (n- - D (2242) n1)n

6o Consider the arithmetic mean of the partial suims n n = G(x-Zl;)G(x-,2;:) C ) [dao(ztz1) - da(zl1)da(z2)] z 1 Z2 rT-l (2o.2.43) From the ergodic hypothesis and equations (38)-(40), as n - (oo we have n n ZlA U(ilz) U(z) (2) for every zl and z2o Applying the Helly-Bray Theorem (Lobeve, [2411, p. 183), we get that sn/n + 0 as n -+;o It then follows (see the discussion below (19)) that the arithmetic mean of the partial Cesaro sums Sn 1/n - 0, Hence, (42) and V(F (x)) tend toward zero as n 0oo n' We have just shown that an ergodic assumption on the sequence Z and a somewhat stronger assumption on the gaussian noise implies that V(Fn(x)) - Also, we are now in a position to investigate the rate of convergence. If we were dealing with i:ndependent samples, the rate of convergence would be C(l/n) For definiteness, we will consider this particular rate Clearly, if R(t) satisfies condition B, the second expression on the right side of (35) will be O(1/n) For the third expression of (37) to be of the same order., the sequence of Cesaro sums, S., must either oscillate between finite bounds or converge. There are necessary and sufficient conditions for Ceasaro summabilityo For example, we have: We will discuss this ass'm.ption on the gaussian noise latero 2Ksnopp K:,, Theory and Application of Infinite Seriess Hafner Pubo Co., 1950, p. 486.o Translated from the second German Editiono

61 a necessary and sufficient condition for a series Z an, with partial sums Sn, to be Cesaro summable to the sum S is that the series 00 av i) Z - v=O should be convergent and that for its remainder _>l_ an+a ^ --- + a..+(n= 0 n n+ - n+s3 the relation ii) S + (n - 1) p -- S holdso With D = a., the above conditions, in conjunction with condition B on R(t), yield the rate 1/no These conditions, however, are difficult to interpret in terms of the {Z~) process We shall content ourselves with a. simple sufficient condition which admits some interpretation and which at the same time is not an overly restrictive assumption for the type of problems we will want to deal witho The Cesaro method is regular; that is, if Z D converges to s, T r2 then Sn also converges to so Hence, a. sufficient condition to achieve the rate 1/n is: 00 2 = B3 < oo. (22.2.44) As our final result, we have

62 Theorem 2,2~2: Given the sequence of identically distributed random variables, Xi = Ni + Zi i= 1,2.... Assume: i) R(t) satisfies condition B ii) D as given by (41) satisfies (44). Then, the empirical distribution function is a consistent estimate of F(x) with the variance dominated by E(Fn(x) - F(x)) V(F (x)) (1+ B3) n n n 1-p* In addition, if iii) the sequence (Z.) is ergodic, Fn(x) converges to F(x) with probability one for every x. The term 2B2 is contributed by the first part of D which is due to the correlated noise, while 4B2 results from the bound on the first part of D. Among other things, (44) implies that D -e 0 as T - oo We have n 1 ~ seen that ergodicity implies only - D + 0 as n -+. To characn T=l T12 terize the type of process which satisfies the assumptions of the theorem, we can replace the ergodic assumption on the {Z) (equation (38))by the stronger condition lim P(BnT-JA) P(A)P(B). (2,2.45) n- oo This is called a mixing condition (Rosenblatt [34], po 110),, The mixing condition implies that aT(z2,z2) -+ C(zl) ca(z2) for every zl and z2, and from the Helly-Bray Theorem we have lim DT = 0 Thus, one class of T'am>

63 processes which satisfies ii) of the above theorem is the mixing processes whose dependency is weak enough so that 00 X / X ida (z.l,z2) - da(zl)da(z2)1 < 00 (2o2o46) T=1 Z1 Z2 is satisfied. From the definition of DT,2, this condition then implies ii) of the above theorem. We observe that if the sequence (Z.) is M-dependent, it satisfies the mixing conditiono In this case, ii) of the theorem and (46) are obviously satisfied. If the ([Z] sequence were gaussian, condition ii) would be satisfied if its autocorrelation function satisfied condition Bo A further characterization of processes which satisfy condition ii) is given in section 20 7 The condition R(t) -+ 0 as t co, as we have already remarked, implies ergodicityo The implication does not go the other wayo In fact, this condition on the correlation function is both a necessary and sufficient condition for the gaussian process to be mixingo2 What we have done in this section is to employ the Mehler formula. to dominate the integral Rosenblatt, [34], p. 110, shows that a stationary process of independent, identically distributed ra.ndom variables satisfies the mixing condition. The extension to a M-dependent process is easyo ~Rosenbla.tt, Mo, "Independence and Dependence," Proc 4th Berkeley Symposium Math. Statistics and Probability (1961) vo 2, ppo 431-4430

64 ff Ig2(x1-z1,x2-z2;a,pT) - g(xl-;o)g(x-zg 2;Co) dx dx2 The bound we have obtained is independent of zl and z2 and is given in terms of the correlation coefficient p. By specifying the manner in which the correlation function R(T) goes to zero, we then obtained a bound on V(Fn(x)). When the Zi are dependent, we have to require a condition like (44) so a.s to specify a rate of convergence. In estimating the density function f(x), we will make use of these results. In all three methods which we present the variance of the estimate is dominated in the same manner; the part of the variance expression which is due to the dependency of the observations is written as the difference of two expectations involving the appropriate bivariate and univariate density functions. 2.3 ESTIMATE OF THE DENSITY FUNCTION —KERNEL METHOD In this section we consider a method of estimating the density function which is analogous to that used in estimating the spectral density of a stationary time series. This approach ha.s already been applied to the case of a sequence of independent random variables [25,26,33,46]. We will generally follow Parzen [26]. The density function we want to estimate is given in (2.1.4) and repeated here, f(x) = f g(x-z;a)da(z). (2.3.1) z

65 From the observations X., i=1,2, n on, we take an estimate of the form: n fn(XlIX2,.ooXn;x) = fK) (xh() (2532) nh(n) =1 h(n) where h is a sequence of positive numbers depending on n, and chosen so that lim h(n) = 0 (235.3) n+- oo K(x) is a non-negative function satisfying sup K(x) < 0 o00 < x < 0o f K(x)dx < oo (2%354) lim |xK(x)l = 0 2o3a Bias Calculation The expectation of (2) is E fn(X) f KE(= h ) K(x) f(y)dy (2 35) n 1 j h The following theorem. (specialized to our situation) is given in Parzen. Theorem 2.Sol (Parzen, pE 1067): With h and K(y) satisfying (3), (4), respectively, we have +00 limE fn(x) f(x) () K(y)dy (236) n-d~ _-o at every point x of continuity of f( )e

66 A With JK(y)dy = 1, and since f(x) is everywhere continuous, fn(x) is asymptotically unbiased estimate of f(x) for every x. In fact, since the gaussian density is uniformly continuous, f(x) is also uniformly continuous. It then follows from Parzen's proof that the convergence is uniform in x. Our particular density f(x) is more specialized than that needed for the proof of the theorem. We can use some of its properties to obtain a uniform bound on the bias. Rewrite (5) as +00 E fn(x) = K(u)f(x-hu)du. (2.3.7) -00 Since fK(u)du = 1, we can write +00 E fn(x) - f(x) = K(u)[f(x-hu) - f(x)]du (2.3.8) -00 To find the limiting behavior of the integral, we expand f(x-hu) in a Taylor series about the point x. Since f(x) is the convolution of a distribution with a gaussian density, all derivatives of f(x) exist. f(x-hu) = f(x) - hu f'(x) + h2u2 f"(x) + O(h3) (2.-.9) 2 Choose K(u) as an even function and require that +00 u K(u)du = B4 < o. (2.3.10) -00 Two examples of even, non-negative kernels which satisfy this condition (as well as (4)) are:

67 K(u) = Lu < 1 2 = 0, otherwise. 1 -_ K(u) - e 2 They also integrate to one, Substitute the Taylor series into (8), and perform the integrations. As n + a, we get E en(x) - f(x) > B4 h. (2..311) To obtain a uniform bound (and for future reference) we note that the derivatives of f(x) are uniformly bounded in x. Specifically, from (Ae9) it follows that d - fx - g(x-z;&)doa(z) He (-) g( 3)doa(z) z z and dif(x) 1 c c.47 djfx < ~f / da (z) a 2 -- * (2 3512) CT... 12) j+c dx' ~;' ITi ( I i j" The last line follows from. Cramer's bound, (A.30). With j = 2, we have If"(x)l < cl /4T 3 and a.s n -+ o (11) is dominated by IE fn(x) - f(x)i < cl B4 h + (h4) (2 5) 2.7. cr3

68 It is advantageous, in terms of bias to have h(n) go to zero rapidly. Consideration of the variance of the error, however, will show that it should not approa.ch zero too rapidly. 2.3b Variance Calculation The square of fn(x) is written as n n A 2 1 x-X 2 XXm f (x ) nha l K( X) K( X (2.3.14) n n nh h " Q1=1 =1 m=+l We proceed in a manner analogous to the development in section 2.2. Take the expectation of (14), subtract nh2 n n L h E { h ) ~=1 m=~+l and add its equivalent 1 1 x 2 (1 - ) [E(K(-X) ] h7 n h Subtracting the square of the bias, we obtain the variance: V(fn(x)) = E(fn (x) - E fn(x)) nh2 n n 2 V V^x-X X-X ( x-X _ (x + ) E7K(-7)K(-)J-E4K( ) )JEK( h)} i (2..15) n2 h2 h iJKh nKhj 2=1 m=+l'Again, the second term is a result of the dependency of the observations. We proceed to majorize each of the terms in the variance expression.

69 The first term written out is +00 - E( )} = -- K(2X) f(y)dy (2K3.16) nnh2 h nh h -O0 From the first two conditions of (4) we have fKI(y)dy = B < oo. Since f(y) < 1l/J C, the substitution of u = (x-y)/h in (16) leads to 1 E^ ^ L < B5 (2B3.17) nh h }- nhJ a (2Y5,17) The second term of (15) is 2 +00 1 1 -X12 E [EFK(-11 T2 K( ) f(y)dy (2.3518) nh L' h nh h Use the boundedness of f(y), the same substitution as above, and the fact that K(z) is non-negative and intergrates to one to obtain nh2 E K( ). 2 (2.519) nh2 L h J n 2jta2 For the third term of the variance equation, let Qm-, be the expression inside the double sum. Writing this term out gives QK(x-Y -)K(X - hY2) f (Yl Y2)-f(Yl)f(Y2) dyldy2, h h fJ (2-5.20) where fm-,(Yl,.2) is the second-order density function of the observations in the m and I intervals. The kernel K(y) is bounded (4), say by B7. Hence, 2. P. _ n- / /J| / \ i/ \ A/ \ | i _ 7 (25\l

70 and we are in a position to use the results of the previous section (see equations (2.2.10)-(2.2.15)). For example, with the sequence (Zr) independent, m B2 <PmI1 (2.3.22) Qm-~ <- B?2 I m 1 - IPm-7l Combining this with (17) and (19), (15) is majorized by n fn(X)) < n 2 nh + h22 2E X -n 2rTt2 nh 2 hn 2 1 - IPm_1 ~=1 m=A+l With the autocorrelation function satisfying condition B, (2.2.23) gives 2 V((x)) < 1 1 + 1 Bs + 2BB2 (2..23) n 2.ca nh a nh (l-p*) Under these conditions it follows that we need to require nh2 + oo as n + oo for consistency of the estimate. Notice that if the observations were independent, the third term would be absent. In this case (as in Parzen's development) we need only require nh -+ o. In Appendix C, section C.1, we show that by choosing K(u) as the gaussian kernel, u /2 K(u) -- e (22) is replaced by QM < h2 1Pm- c (2.3.24) 1 -,l -'

71 The reason this is possible is that with the specific gaussian kernel we can perform the y integrations in (20) before taking bounds. In this case, the variance is then dominated by A b 1 b2 V(fn(X)) < + (23.25) Vn(x) - <n nh' where we have set 2b 1 + 2B2 CL 2a2 ( 1-* ) 2Ta b2 B5/ 42iT The rest of the discussion will assume a gaussian kernel, and for definiteness, we assume that the autocorrelation function satisfies condition Bo It was this condition on the noise that gave, for the empirical distribution function, a rate of convergence equal to the case of independent observationso We expect analogous results for estimating the density functiono 2o3c Mean-Square Error The mean-square error is written in terms of the bias and variance contributions o E fn(x) - f(x) E(fn(X) - E( (x))2 + [E( f(x) - f(x))] As n + o, from (13) and (25), and setting l = (C1B4)/('sI 2o3), we have

72 E(f (x) - f(x)) < bl h (253o26) n - - n nh Clearly, to minimize this bound, we choose h (as a function of n) so as to have the last two terms of (26) tend to zero at the same rate. Differentiating, we find tha.t the best h is given by 1/5 h b2 (235.27) 4b3n Therefore, as n + oo, the mean-square error satisfies A 2 4,5 E(fn(x) - f(x))2 0(1/n )5 (23528) This is the order of consistency one obtains for the case of independent samples [26,33]o Theorem 2_. 2: The estimate of the form n f(X1,X2, Xn X) = g(x-X;h(n)) converges in mean-square, uniformly in x, at a rate 1/n /5 if~ 1) (Z? are independent ii) R( ) satisfies condition B iii) h(n) is chosen as in (235.27) Clearly, we can extend the results for other dependencies on Zo For the M-dependent case, we ha~ve: Corollary 2_51: Under the preceeding hypotheses and with condition i) replaced by i) Z~ is independent of if tm-ll > M, the order of consistency remains o( l,/n')

75 This corollary follows from the comments in appendix C, (C.10), choosing h as in (27), and taking bl, as b, -- [ 4(.M-1 (2.3.29) 2ar r2 1 p* A We have obtained the result that the estimate fn(X1,...Xn;x) converges in mean-square to the univariate density at a rate not slower than 1/n / As will be seen in the next chapter, in attempting to estimate the k-variate density function f(xl,...xk), the bound on convergence which we are able to specify indicates slower convergence. Another disadvantage of this method is the problem of "growing memory." The estimate we have been using is of the form n fn(X1,X2,...X n;x) = K X), nh(n) =1 (n' from which it can be seen that all past observations must be stored-at each stage a particular observation's contribution to the estimate is weighted differently. This problem can be eliminated if one is willing to accept a final estimate which is biased. For this situation, we need only store the past N observations, where N is determined by tne bias one will accept. A recursive relationship is then used to update the estimate. One advantage of the kernel method is that the estimate is a density A function; fn(x) in non-negative and integrates to one. Another advantage is that no knowledge of the gaussian or z process is required to form

74 the estimate, and only a minimal amount of information is required to specify the rate of convergence. 2o3d Mean Integrated Square Error (MISE) Another criterion which has been used to measure the error is the mean integrated square error (MISE)o Using this criterion, one can specify an optimum choice of the kernel K(y) and investigate maximum rates of convergenceo It is primarily the rate of convergence which we now want to discuss. The MISE is defined as J = E [ (fn(Xi Xn;x) - f(x))2 dx (2.3530) We remark that the condition nh(n) - oo is sufficient to show that, with probability one, n fn(X,. X;x)) E g(x-Xp;h(n)) (2-3,31) rn(X1,~ n n = n =-1 is square integrable in Xo In fact, in appendix C we obtain a bound on the MISEo Our result (see section C.2) is that the MISE = 0(l/n /5), 1 which is the same rate obtained for the mean-square error. The question naturally arises as to whether one can specify a maximum rate of convergence using estimators of the above typeo This is for the case of independent Z and correlated noise with condition B holding

75 Watson and Leadbetter [45] consider the problem of optimizing the estimator of the form n n(x) = Kn(x-Xp) (2-3.32) ~=1 given a sequence of independent observations. They show that the MISE is a minimum if the Fourier transform of the kernel Kn(x), which we denote by MK (v), is equal to MK(v) = - M(vl) (2-3.33) n^1 + (n-l) IMf(v)|2 n n Mf(v) is the characteristic function of f(x). Notice that M (v) > 1 as n + co, indicating that Kn(x) approaches a delta. function as is the case with the previous estimator. Here, however, the kernel's functional form is dependent on the index n, which gives a more complicated estimator. They show that the minimum MISE cannot decrease faster than 1/n. Specifically, with the optimum kernel given by the inverse transform of (33)9 the minimum MISE is M (0) MKn() J n. = - - 0(1/n) Ii" n Watson and Leadbetter further characterize the optimum. estimator by studying the asymptotic behavior of the (unknown) characteristic function Mf(v).1 1The estimator in (33) is of no practical value since it is expressed in terms of the function being estimated. By specifying the asymptotic behavior of Mf(v), they show that there is a class of kernels, with the same asymptotic behavior, which achieves the maximum rate of convergence.

76 Of particular interest to us is the class of characteristic functions which decrease exponentially with degree r and coefficient y, A characteristic function is of this class if it satisfies: i) IM (v)| < A eA7vl, for some constants A > O y > 0 and 0 < r < 2. 1 nl, s' dt ii) 7.dt 0 as v o O 1 + exp (2yv) IMf(tv) |i Under these conditions they state the following theorem. Theorem 2.3.3: (Watson and Leadbetter, p, 490): Let Mf(v) decrease exponentially with coefficient y and degree r. Then Jn* the minimum MISE, satisfies l/r l/r lim [n/(log n)'l Jn =- [l/(2y)l/r n-oo For our case r=2 and with independent observations, the minimum MISE converges to zero at a rate l 7log(n)/no This assumes an estimate with the kernel given by (33) and represents an improvement in the convergence rate at the expense of a more complicated estimator. With the noise correlated and the sequence (ZQ) independent, we can obtain a corresponding expression for the optimum kernal expressed in terms of Mf(v) and the sequence of correlation coefficients (pr), Calculating the rate of decrease for the MISE is difficult. However, it is easy to show that the minimum MISE cannot decrease faster than l/n. These comments are substantiated by paralleling the development in [45].

77 It is this rate and the one for the independent case which we will want to use a.s a. point of reference while discussing the series methods of estimating f(x)o 2.4 ESTIMATING THE DENSITY FUNCTION BY SERIES METHODS-AN ORTHOGONAL RE PRESENTATION In this section f(x) is represented in an orthogonal expansion and it is this form which we will estimate, Our concern will be with convergence, not only in the sense of MISE, but in mean-square as given by (2o1.1) and (2.1,2). The density function f(x) g(x-z;a)dca(z) is a bounded integrable function. Hence, it is L2 a.nd can be expanded in a series of orthonormal functions: 00 f(x) = a cp(x/&a) (24.1l) Jj J0 j-o +00 a. = f(x) cp.(x/c:l)dx (2a4.2) -00 Naturally, equality in (1) is in the sense of limit-in-mean. The functions in the expansion are the normalized Hermite functions which form a complete orthonormal set on the whole line (see Appendix A, section Ao2): 1This technique has been discussed in [6], but not in any deptho

78 CP (x/Ai) =.,, J ^ (2.435) o1 is an arbitrary positive constanto As discussed in appendix A, we reserve the H notation for the polynomials orthogonal with respect to the square of the gaussian weight. They 2 are generated by differentiating g (x). The He polynomials are generated by differentiating g(x) and are orthogonal with respect to the gaussian weight. These polynomials were introduced earlier. Given the sequence of observations (XIj, ~=1,2,..o,n, the problem of estimating f(x) (in the L2 sense) is reduced to one of estimating the coefficients aj, We designate the estimate of the aj coefficient at the A n-th state by aj n 1 It follows that these estimates are unbiased: n A 1^ E a.;n — E pj(X~/ 1) n nl / P.(x/cil) f(x)dx - a_ o (2.4,5) -00 A The mean-square error in the estimate is then given by the variance of ad n To calculate this variance we proceed as before: V(t aj E(. - E n E(ajn) - a E~~~~~~

79 n n n = ~ n2 4 J( / + 2 X I l j(X l)j - a2 2 Q=1 Q=1 m=~+l 1 tE(q^(X/aj)) - [E(X/O) )] } n n +2 7 tE(pj(Xj/O) Cpj(XM/)) - E(cpj(X/)) n 1 m= +l E (cp (Xm/o~)). (2.4.6) The first expression on the right is easily dominated using Cramer's bound. From (A.29), we have - - |Hj(x) e 2 < c1, (A.29) 2 jj7.' -where the constant cl is independent of x and j. Hence, l/4 1/2 |cp.(x/oi)I < ci/(r 1/ c2 (2.4)7) and the first expression in (6) is dominated by 2c /no For the second expression in (6), we again write out the expectations and use the above bound on cpj(x) to dominate it by n n + +o 2 c2 If (XlYX2) - f(xl)f(x2)ldxdx2 o (2.4.8) 2 2 M-i n =1 m= +l We have already majorized this term. As it will again appear in the sequel, we use our previous results to record the following lemmraso Lemrma 24o l1 Assume i) the autocorrelation function satisfies condition B

80 ii) the sequence (Zj} is M-dependent. Then, (8) is majorized by 2 C2 (2B2 + 4(M-1)) n l-p* Alternatively, if R(r) < o2/T for ITI > B1 and 0 < 6 < 1 (condition A), we have Lemma 2o4.2: Assume i) R(T) satisfies condition A ii) the sequence (Z~) is M-dependent Then, (8) is majorized by 21-p2 + ) c2M-l) (2.4.9) l-p* Fn (1-5)n/ n The proofs and appropriate definitions are given in the discussion leading to Corollary 2,2.1o Clearly, the M-dependence assumption can be replaced by a weaker condition as reflected in Theorem 2,2.2. Again, the point here is that the estimate of aj is taken as an average of bounded functions and the problem then reduces to one of dominating a. sum involving the absolute difference of the bivaria.te and univariate density functions, In the following discussion we shall refer only to Lemma 2.4.1. Theorem 2.4.1: The estimate as given by n an - X ~(v/) (2.44)' n Ln JL ~ =1

81 is unbiased, and under the hypotheses of Lemma 2.4.1, the variance is bounded by 2 V(aj )<- (2 + 2 B2 + 4(M-1)). (2.4.10) n 1-p* n Now define the estimate of the density function at the n-th stage by s(n) fn(x) = ajn (Cj(x/al), (2.4.11) fn Z-j jn j=0 where q(n) is an integer which depends on n. Consider the MISE: +00 Jn = jE(fn(x) - f(x)) dx -00 00 q = a. + E(an- aj) j=q+l j=l a + C3 q(n) < 2 a. +,.(2.4.12) --, J n j=q(n)+l 00 Since E a. = ff2(x)dx < oo, it follows that the first term of (12) goes jml J to zero if q(n) -> oo. With q(n) properly chosen so that the ratio q(n)/n-o, we have Jn+ 0. The problem of specifying the sequence (q(n)), n=l,e,..., is analogous to choosing the constants h(n) in the previous section so as to balance the bias and variance errors. Here, a partial answer is provided if we assume that the random variable Z has, for example, a finite second moment,

82 zadc(z) < oo. Below, we will show that this assumption implies that -00 the coefficients aj satisfy aj B6/j (2.4.13) where B6 is a constant independent of j. With this being the case, from the Euler-Maclaurin summation formula, we obtain: 00 C00 aKB U Be6 + )dt Be(-+ ) q(n)+l q+l q (2.4.14) The MISE is then dominated by i! T(1+ C3q(n) Jn < Be (q - 7q) + c-3n 1/2 1 and with q(n) n n' This does not compare favorably with the 1/n rate obtained by 1/2 the kernel methodo Here, however, the rate 1/n /is not a direct result of the method of estimationo but rather, results from the second moment assumptiono After proving (13), it will be clear as to what additional moment assumption is needed to achieve any rate up to l/no Lemma 2o43: With f z2dC(z) < c> the coefficients a. satisfy -00 J More precisely, since q is an integer, we choose q = [n], where [ ] denotes the largest integer< n o

83 a2 < B6/j2 j - To prove this lemma we first obtain the conditions on f(x) needed to give the above bound and then show that the second moment assumption implies these conditions. Proof: From (3), +00 +0 a: = f(x)cpj(x/al)dx = f(x) g(x/cT)Hj(x/a1)dx I-00-00 (2.4.16) We note the relationship for the derivative of the Hermite polynomial which is obtained from (Ao28), d H. (x/al) - 2(j+l)Hj(x/a) dx J+1 C Substitute for Hj(x/li) and integrate by parts. + 00 d. / a(x) - ~,~ dx dx -00 -l +00 Hj+i(x/Ci) i - "1 L'/. JJ [dx f(x)g(x/al) dx 42_ j+l -(j+'lo / l7 CT 00 1 t t0 wp (x/O) [X (x - f'(x) =i dx. (345.17) Repeat the argument with the function (xf(x) - l2'f'(x)) playing the role IThe essential idea of the lemma can be found in Sansone [38], ppo. 368-369.

84 2 d of f(x): a= -1 0.H.2(x/a1) 2 al d gx/a1Kxf^x-Ilfx)j - 1 cp+(x/al) w(x) dx (2.4.18) 2 (j+l)( j+2)' _i where w(x) = ((x2-Ca2)f(x) - 2al2xf'(x) + oa4f"(x)). If w(x)eL2 we obtain a2 < w2(x)dx < B (2.4.19) ----,.....2 4(j+l)(j+2) -oo where B6 is the L2 norm of w(x)/2. Using characteristic functions, it is easy to see that w(x) is an L2 function by showing that the individual terms are L2. Define: +00 Mf(v) = / ejx f(x)dx -00 +00 Ma(v) = e-jVda(z) -00 Recall that X = N + Z, and that N and Z are independent. Then, 1 a2v2 Mf(v) = e- 2 Mc(v). (2.4.20) Clearly, v Mf(v) is L2. Then from Plancherel's theorem its transform f"(x) is L2. Similarly, if d- Mf(v) is L2, x2f(x) will be L2. With the second moment of Z finite, the first and second derivatives of MW(v) exist, are continuous and bounded. From this it follows that

85 d2 Mf(v)EL2. In this manner, all terms of w(x) are seen to be L2 dv2 functions qoe od It should be clear as to the conditions needed to guarantee faster convergence of the MISEo Assume that the r-th absolute moment of z exists, f IzlrdC(z) < oo. By repeated application of the method of the -00 lemma we obtain: 2 r aj K B6/j Z a B6 (( )qrl +(2421) j q+l B6 is now the L2 norm of the function _______ 1 d' ______ -7^ -7(g(x/oi)f(x)). g(x/l1) 2r/2 dxr (g(x/ We summarize this discussion in Theorem 2.4.2: Under the hypotheses of Lemma 2.4.1, the sequence of estimates q(n) ~n(XlXa,,.Xn;x) a7 Cp.(x/C1) (2.4.11) jn j j-o with q(n)+ oo and q(n)/n - 0 converge in the sense of MISE to f(x)o If izj da(z) < oo, r > 2, the MISE at the n-th stage satisfies the inequality

86 Jn E J (fn() - f(x)) dx < B6((-i1 — + q(n) + calq - (,))-1 q(n) - (2.4.22) where C3 is defined in (10). Choosing q(n) as the largest integer l/r < (Ben/c3) gives J= 0(1/n(r )/r) (2.423) We now want to consider the mean-square error for fixed xo With the A estimate f (x) as in (14), define the function q(n) fq(x) = a. pj(x/i) (2.4.24) j=0 The mean-square error, as a function of x, is given by E (f(x) - fn(XlX2.Xn;x)) = E (f(x)- ()) (f(x) - fq(x)) q + 2(f(x) -fq(x)) E(a.j- n) cp (x/i) j=0 q A A + E ((aj-a.jn)(ak- kn)) CP (x/o)Pk(x/ ) (2.4.25) j=0 k=0 Since ajn is an unbiased estimate, the cross-product term is zero. The other expectation of (25) is bounded by Schwarz's inequality and Theorem. 24 1: Elj n) (aak-akn )1< L(ain) kn)]1/2

87 Using (7), the third expression in (25) is dominated by c2q2(n)c3/n. Hence, we obtain the bound: E {(f(x)-fn(x)) } < (f(x)-fq(x))2 + c C3 q(n). (2.4.26) q n To continue the discussion we need to investigate the pointwise convergence of fq(x) to f(x). Theorem 2.4.3 (Sansone [38], p, 381): If f(x) is contained in L1 and L2 then, at a finite point xO, the series 00 Z a. (/) j=0 behaves like the Fourier trigonometric series of a function which coincides with f(x) in an arbitrarily small neighborhood (xo-h,xo+h) of xO. In particular if f(x) is of bounded variation in a neighborhood of xo we have 00 Z aj cPj(xo/ai) = 2 [f(Xo) + f(xo )] j=0 and again if f(x) is continuous and of bounded variation in (-0o,oo), then the series converges uniformly in any interval interior to (-0oo,o). Sansone's proof, which is attributed to J. V. Uspensky, involves showing that in a neighborhood of x0 the partial sum of the series can be made arbitrarily close to the partial sum of the Fourier expansion of the function in the same interval. This adaptation of Fejer's method

88 of Fourier series is also used by Wiener to obtain the same result. The above theorem is a special case of the more general result that for orthogonal functions of the Sturm-Liouville type, the L2 series of these functions behaves in the same manner at a point as Fourier's series do (Hobson [18], p. 771). The density function f(x) is L1, L2, and continuous for every x. It is also of bounded variation in any interval since f'(x) exists and is bounded (see (2.3.12)). Hence, in any finite interval, as n-oo we obtain q(n) fq(x) = a aj cpj(x/al) f(x), uniformly in x. j=0 Specifying a rate again involves a moment assumption. Assuming the r-th absolute moment of Z exists, where now r > 3 (cf. Theorem 2.4.2), yields: 00 If(x)-fq(x)l < I a C (x/ao ) j=q+l 00 < C2 l a. | j=q+l 00 < C2 BI jr/2 j=q+l < c2%f/ 1 -) - 1 ) (2.4.27) < C2~JB~I~r _ 1 r 2-l) q q7 N. Wiener, The Fourier Integral and Certain of Its Applications, Dover Publications, Inc., N.Y., 1933, pp. 55-67.

89 We have used (7), the first part of (21), and the usual. integral upper bound We are now in a. position to prove Theorem 2o4o4~ Assuming the hypotheses of Lemma. 24o1 hold, then, since f(x) satisfies the conditions of the previous theorem, the sequence of estimates q(n) fn(x) = ajnj(x/Cli) (2,4.11) j =1 converge in mean-square to f(x) if q -+ oo and q2/n + 0. This convergence is uniform in x for any finite interval. If the r-th absolute moment of z exists, r > 35 the mean-square error is dominated by 2 (2422 2 8) 1 /r Upon choosing q(n) = [nl/lr as n -+ co we obtain r-2 E 4(f(x) f (X)f = 0(1/n r n For the application of the empirical Bayes technique,'we shall need A the convergence of fn(Xn) to the random variable f(Xn)o The mean-square 1 error in this case is written (see (25)): En f (n) - fn (X1,...Xn;Xn)) = F ~(~"~r~iA 9 En (f(X) fxn n )). n. n n)) We use the En notation of Chapter 1o

90 = En (f(Xn) - fq(Xn))2} q(n) + 2En (f(Xn) -fq(Xn)) (aj-an) cp (Xn/ali J=0 i j~~~=O -k=0 Now, the first and third terms have already been bounded independent of the argument. With a r-th absolute moment assumption on the random variable Z, from (27) we have E (f(Xn) - fq(X )) < (< 2 (1 -)q'2 2 and as before, the third term of (29) is dominated by c c2 q. Using n these bounds and the Minkowski inequality, (29) is dominated by'-A 2-1 1 I"}2 Enf() - n(Xn)) < C2 q C n L (2.4.30) The fastest rate of convergence is obtained if q(n) is set equal to the largest integer less than or equal to (BGn/c3) /. This gives: 2 _ 1 2l] En f(Xn) - fn(Xn)) < C2 qr86 r/r ] r1 1 r C2 (B/ (2.4.1)):(=)l/r 117,(r-2)/2r n r22/2 B C3 J

91 Asymptotically, (31) is dominated by En ( (Xn)-fn(n)) (r-2)/r 2 2r Ln' n" -)- n r-<)C3 B< n 2 1 r-2 1 2 r 2r + C2 B cSr (2.4 32) Corollary 2.4.1: Under the hypotheses of Lemma 2.4.1 and with the r-th absolute moment of Z finite, r >, fn(Xn) converges in mean-square to f(Xn). The mean-square error is dominated by (31), and for large n we have En {(fn n(X) n)) = O(1/n( r). (2.4.53) It is not surprising of course, that we achieve the same rate as in Theorem 2.4.o4 In practice, we may want to hold q fixed, i.e., estimate an approximation of f(x). We then take the estimate q fn(x) = n cJ(x/ai). (2.4.54) j jn' q is now a fixed integer chosen according to some error criterion. The estimates of the q + 1 coefficients, (4), can be put in the recursive form jn = (nl)' + AJ(Xn/l)], j = 0,l..q (2.4.55) fn(x) converges in MISE to a function which differs from f(x) in

92 00 2 the L2 sense by ) aj, ioe., (12) gives q+l 00 E (f()-f (x)) dx < aj + (2.4.36) j =q+l In practice, it can reasonably be assumed that the second moment of Z exists. Hence, the asymptotic error in the estimate is then known to be inversely proportional. to the number of terms used in the series((2l))o Similarly, we can take fn(X ) as an approximate estimate of f(Xn)o With r greater than two, the mean-square error ((50)) is bounded by En jf(Xn)-n(Xn)) } C2 ( )q r ) 6.r2 -1) q N (2.4-37) The asymptotic error is lim E (f(Xn) -(Xn 01/ n-oo which is inversely proportional to some power of the number of terms usedo 2 5 ESTIMATING- THE DENSITY BY SERIES METHODS-AN EIGEUliNCTION REPRESENTATION The method we now discuss makes use of the fact that f(x) is the convolution of a known gaussian density function with an unknown distributionO This is in contradistinction to the previous methods which do not utilize this knowledge In the equation for f(x), f(x) g(x-z;a)da(z) -00

95 a(z) is the unknown quantity. To make use of this we want to express f(x) is a form which, in some sense, isolates 0(z)) This is accomplished by solving an eigenfunction problem associated with the above equation. We first observe that the "kernel" g(x-z;5) is not Hilbert-Schmidt; it is not square integrable in the x-z product spa.ceo We will display a function s(x) such that s(x) g-(x-z;a)dx dz < o x z A considerably more difficult task is to choose a s(x) so that we can solve for the eigenfunctions and eigenvalues of the operator s(x)g(x-z;a), ioe@, find the cp's and k s which satisfy +00 s(x)g(x-z;cr) cp.(z/al)dz X- p.(x/7y) -00 We shall find these quantities by obtaining the diagonal L2 expansion 00 s(x)g(x-z;o) C X cp.(x/7) pj(z/J) j O Note that by a. change of variables y x the right side of the expansion is symmetrical in y and zo Hence, the operator s( yy/rl)g( - z;cs) is also symmetrical and the above formulae reduce to more familiar formso For our purposes, it is more convenient to deal with the unsymmetrical operator, s(x)g(x-z; ) o Having found the' sS we define the coefficients associated with the distribution a( z):

94 +00 d = CP j(z/cZ/) da(z) j.1 -co Under suitable conditions, the quantity s(x)f(x) can be written: s(x)f(x) / s(x) g(x-z;c)dc(z) -00 o00 = E x dj Pj(x/y) j- =0 It is this form which displays the unknown quantities-the dj's. Consequently, we will estimate not f(x), but rather, the product s(x)f(x). Since s(x) turns out to be a positive function, it is then just a matter of dividing by s(x) to discuss the mean-square error. However, as discussed in section 1.2, when an equivalent test which incorporates s(x) can be found, the quantity we are interested in estimating is just the product s(x)f(x). We proceed to obtain the above expansion. Consider the gaussian density appearing in the convolution. g(x-z;a) is an L2 function in z for every x. Expand this function in the orthonormal series 00 g(x-z;a) = j cj(xaal)cp (z/aL) (2.5.1) j=0 where the cp are again the Hermite functions as in (2.4.3), and oa is J an arbitrary positive constant. Here the expansion is in terms of the indepenaent variable z. As indicated, the coefficients are functions

95 of x, C1 and a, and are calculated by +00 cj(x,Ia,a) = /g(x-z;a) cpj(z/al)dz.(2.5.2) -00 The evaluation of this integral is the main result of Appendix B. Make the definitions: 2 2 2 = a1 -aC 012+a2D 2 ((a12-a2)(a12+a2) (2.5.3) 2'~~2 1 C12 Then, from (B.10), j g(x; cJa+a2) Hj(x/)) c (xiaioj) = -(2.5.4) j2j'/ S**-\al and hence, 00 g(x-z;a) = g(x; o1-2+a2) jz/ ) (2.5.5) j=o J2Jj'./ C1l Now consider the function S(X fg(x;y) s(x) = / g(x; -) ~ 22 1/4 22 _2 ( 2 a2+a2)(a2a) 2 a+12 y (2.5.6) Let al be greater than a but otherwise arbitrary. Then, s(x) is a bounded function. It is also non-negative, L1, and L2. Multiply (5) by s(x)

96 00 s(x)g(x-z;ac) - cpj(x/7')CP(z/5i) (2 5.7) j =0 s(x) is just the right function to make the set of coefficients c.(x,cr) J an orthogonal.set (with respect to dx); or what is the same thing, (7) is an L2 expansion in the x-z product space with the coefficient a. = 0 ii if iij, aj*.-. In fact, we even have more. Since g(x-z;c) as a function of z satisfies the conditions of the theorem quoted earlier (Theorem 2.453), (5) converges uniformly in z for every xo Hence, we have pointwise convergence in x and-z. Multiplying by s(x) does not change this convergence and (7) converges pointwise to s(x)g(x-z;o). Since 5 < 1, it is easy to see that the error in the remainder term of (7) is dominated'by c1 sJ+l j=J Since the remainder is independent of x and z, (7) converges uniformly in x and z. Now write s(x)f(x) - s(x)g(x-z;/)d(z) and substitute (7)~ s(x)f(x) - x o c.(x/) n pm(z/ario da(z) (275.9) It is not difficult to calculate the Lq norms for both sides of (7)0 It is equal to Z (SJ)2 (aoioc2)/2 2

97 Define d. as the j-th coefficient associated with the distribution a( z). -00 +00 We justify interchanging the operations in (9) by the Lebesgue dominated convergence theoremo The result is: 00 J j =0 Equation (11) is an orthogonal representation for s(x)f(x). It also follows, in a number of ways, that the series converges pointwiseo In view of the inequalities Id.I < f Jcp (z/a1)Jd(x) < J J 1/41 12 dU(z) 1/4 1/2 (2.5l12) (T rE 1 ) ( J - ) and i|1 < 1, it also follows that q ls(x)f(x) V dj Cpj(x/y) j =0 c q+l ci( 2 51 < - (2o5o13) \/71 l-T and the series in (11) converges uniformly in X To estimate the quantity s(x)f(x), we proceed in a manner analogous to section 2o40 Take, as estimates of the product of the coefficients

98 t dj, the quantity n d = c j(Xl/y) s(XI), j = 1,2.... (2.5.14) i=1 The estimates are unbiased: A E(t dJ ) = E(cpj(X/7)s(X)) = f cpj(x/7)s(x)f(x)dx = J dj. (2.5.15) Since the function cp.(x/y)s(x) is bounded (uniformly in j and x) by (Pj(X/Y)S(X) < 1/4 1/2 2_ 2-a ()1/2 1/ 2 2/ 21/2 (- ( 12_)2j [(a +a)1/4 C4 2.5.16) in analogy to Theorem 2.4.1 (see (2.4.10)) we have Theorem 2.5.1: Underthe hypotheses of Lemma 2.4.1, the estimates S d. converge in mean-square at the rate 1/n: E 2 (dj -d nj. d < ~i 2 + + 4(M-1) = E; J (diJ" }- J jn -in n l-p* n (2.5.17) As the estimate of s(x)f(x) at the n-th observation we take q(n) s(x)n (x) = 7 Th d. q.(X/). (2.5.18) n L1d jn J j=0

99 Designating Jn as the MISE, we have: Jn = E f (s(x)(f(x)-f(x))) dx Z0 q Z' 2 2 P A 2 1 _ L 2 dj + E (dn-dj) (2o5.19) j=q+l j"l In view of (12) and the previous theorem., Jn is dominated by J'_ - =l2 2 2q(n) cEq(n) (2 ) JE2 + (2o5.20) J7 1 1_ For fixed n, this bound is minimized if q(n) is chosen as 1 c5 o_ r 1 ~In n ~n n q(n) - n n- = c6 -n (2o5.21) 2 n ccin 2 2n 2~nt The logarithm is'taken to the base e. Since 0 < ~ < 1, q(n) + 0oo Letting q(n) equal the largest integer less than or equal to (21), Jn is dominated by jv <(c2 6 1'-Kc+ (2 5.22) n \- 1/2/ )-2 n on n / Theorem 2~5~2: Under the hypotheses of Lemma 2.o4ol, the sequence of estimates q(n) S(x)f (x)' n djn (cp(x/y) n n jnj - with coefficients n J jn n C pj(X/7Y)s(X~) ^=11

100 converge in MISE to s(x)f(x) if q(n) is chosen as above. In this case, the MISE satisfies Jn = 0 ( ) (2 5.23) n n Observe that no assumptions on the moments of the random variable Z are required. The mean-square error for fixed x is given by (cfo(2.4.26)): 2 A xK~x)-?^))2) = s2 x)(f(x)-f (x))2 E(s2(x)(f(x)-f (x))) -s ( x)(f(x)-fq(x))2 q + 2s(x)(f(x).-fq(x)) E(.J(dj-djn))(pJ(x/7) j=0 q i+ X E(SJ (dj-djn)(dkd ))CP(x/Pj( k(x/7) (2 5.24) kn j =0 k=O fq(x) is defined implicitly by the equation q(n) s(x)fq(x) = ~ dj cpj(x/7) (2. 525) j=0 From Theorem 25,1l and (2.4,7), the third term of (24) is dominated by cl c_5 q (n) n The first term. of (24) is majorized using (13), and the middle term is zeroo Combining bounds yields;

101 4 2q(n) c C q2(n) E(s x) (f(x)-fn(x))2) < 2 2q(n) c nnol2 (1+-)7 n (2.5.26) The fastest rate of convergence is obtained if q(n) is set equal to the largest integer less than or equal to In n 2 In 5 Using the inequality (n) -1 < q(n) <, we have: 2 In 2 2 In E (x) (f(x)-fn(x)) < 4 n nLJ (g71Y)(l-g) n -) + c1c5 1 (n n)2 (2.5.27) 4( 7n In Theorem 2.5.3: Under the hypotheses of Lemma 2.4.1, the mean-square error (for fixed x) satisfies 2 E 2(x)(f( x)-f(x)2) = 0( —) (2.5.28) Consequently, the sequence of estimates q(n) fn(x) = -J d) n (x/Y) (2.5.29) j=0 converge in mean-square to f(x) at the same rate. This convergence is uniform in any finite interval.

102 To investigate the mean-square convergence of s(X )fn(Xn) to the random variable s(Xn)f(Xn) we write: En( S2(Xn)( f(Xn)-fn(Xn)) E n4s Xn)(f(n) fq(-n) )2 q(n) + 2En (Xn)(f(Xn)-fq(Xn)) J (dj-d jn)CPj(Xn/y j=0 q(n) k A A + E J (dj-djn)(dj-djn)Pj(Xn/Y)cPk(Xn/y7 (2 5.30) k=0 The first and third terms of this equation have just been bounded independent of the argument. From the Minkowski inequality, (13), and the expression below (25), we have (cf.(26)): En (X n)(f(Xn)- fn(X )) < n( n- -- - { -+ cjI sft (2o5.31)'C>l cl g(n) +n This expression is minimized for fixed n if q(n) is chosen as ~n(n) q(n) = C7 + n() (2o5o32) 21|ln I where _ (1/4 1/2

103 1 Setting q equal to the largest integer less than or equal to (52), we have: En{s (Xn)(f(Xn) -n(Xn))2 2cC7 1 c5 n ~el5 ~ jn (C7 r^ In (n) 2 Cor7(1-ay 2. 5'ind ertyem.qnI (2..4) (2.5.34) Corollary 2.5.1: Under the hypotheses of Lemma 2.4.1, the sequence of estimates q(n) A 1 j A s(X )fn(Xn) d. (X /) j=0 with q(n) given by (32), converge in mean-square to the random variable s(Xn)f(Xn). The mean-square error is dominated by (34). Hence, we achieve the rate: E s2(Xn)(f(X)-f(X)) = O( n ). (2 5.5) There are a number of differences between this method of estimation and the previous L2 series representation. To apply the method of this section the standard deviation of the noise must be known while in the previous method a moment assumption on the random variable Z is needed to specify a rate. Using the present method, we have obtained a rate of We have used q > c7 -1+ -n( n) in the first expression on the right side of (34). 2 a21n

104 2 ln n/n for the mean-square error (Thmo2.5.5) and ln(n)/n for the MISE (Thmo 2o5,2). For the first series method, with the r-th absolute moment of Z (r-2)/r finite, we had the rate 1/n / for the mean-square error (Thmo 2.4.4) and 1/n(r-1)/r for the MISEo In pra.ctice, we may want to hold q fixed, settling for an approximate estimate of s(x)f(x)o In this case, we designate the. estimate by q s(X)fn(X) =- djn Tj(x/7) j=0 Considering the MISE we have (see (20)): lim E f (s(x)(f(x)-fn(x)) dx = n+ n->oo 0 0 t = 2 2 2 q + 2 c 2q ci-CT2 Id <` 2 j=q+l d 2 -]72a J(2.53.6) Similarly, s(Xn)fn(Xn) converges in mean-square to s(Xn)fq(Xn)o The mean-square error is given by (31) and the asymptotic error is: lim E (Xn )-f(Xn)n(Xn))} n (t C4 (- 2 The comparison of rates for MISE is not really valid as different quantities are being estimatedo

105 C14 1 (2 2)2q+l (2.57) - 44 ( 2 229q+3 4 ~ - (a1 +- ) The asymptotic error in both cases decreases geometrically with q. In the previous method, the asymptotic error is inversely proportional to some power of q. Another significant difference between the two methods is when we want to estimate the density function c'(z). This will be discussed in section 414. Recall that ai is an arbitrary constant chosen to be greater than a. (a is the standard deviation of the noise samples.) For applications in hypothesis testing with a. minimum probability of error criterion, we will use a test function of the form poS(x)fo(x)-pls(x)fi(x) Since s(x) is proportional to x2 2 exp {- 2{ -(a2a 2)(a2+ a2) the free parameter ai can be considered as a scaling factor for the test functional. We shall discuss the problem of choosing oa in section 4.1. We remark that in the series method of section 2.4, ai is also arbitrary and would naturally be chosen to minimize the bound on convergence. For example, it enters into the MISE bound through two terms: B6 2r is proportional to a1 and c3 is proportional to 1/a1.

1o6 2o6 SPECIAL FORMS OF a(z) We consider the case where C(z) contains a finite set of unknown parameters which enter linearly into f(x). Examples are equations (1o3o5) and (1o3o7) with the set of a. priori probabilities unknown. We take (13oo5) k a(z) = Pi u(Z-Yi) (1-355) i=l where u(z) is the unit step functiono Then, t+00 k f(x) = g(x-z;c)doa(z) = p g(x-y.;a), (135.6) -0o i=l A and the problem is to find a sequence of estimates, Pi., which converge to the Pi for i=1,2,..oK. For the case of independent samples, this problem has been solvedo The procedure used is still applicable in our situation. A necessary and sufficient condition for the existence of the A sequence p n is that the signals (the mean values of the gaussian density functions) y.i i=l,2,,ooK, be distinct. The condition is clearly necessary for if yi = yj we can not distinguish between the hypotheses i,j, or the a priori probabilities pi, pj. Sufficiency is demonstrated by constructing the sequence. 1 Robbins [31]. For a discussion of this general type of problem see H. Teicher,'Identifiability of Finite Mixtures," Anno Math. Stat., v, 34, 19635, pp. 1265-1269.

107 We will show the condition that the signals be distinct is equivalent to having the K functions g(x-yi;a), i=l,...K, linearly independent. Postponing this proof till later, we now give a procedure for estimating the pio Assuming that the g(x-yi;a) are linearly independent, define: gij = g(x-yi;c)g(x-yj;a)dx = g(yi-yj; Nf o) (2,6.1) -00 G = the Gramian matrix whose elements are gij; i,j=l,o.oK -1 hij the elements of the inverse G g(x) = the K-dimensional column vector whose i-th entry is g(x-yi;a) o Consider the vector g(x) = G g(x), (2-6o2) whose i-th entry is given by k gi(x) - his g(x-yj;a) (2.603) j=1 Postmultiply g (x) by the transpose of the vector g(x) and integrate each element of the resulting matrix over x: f g(x) g'(x) dx = G f g(x)g'(x)dx - (2.6o4) I is the identity matrixo Hence, g.(x) is orthogonal to the space spanned by the g(x-yj;a), j=loooi-l,i+l, oooK

1o8 K f gi(x) g(x-yk;a)dx = hjj f g(x-y.;c)g(x-yk;)dx j=l =1 if k = i (2.6.5) = 0 otherwise. As an estimate of pi at the n-th stage take n Pin = n Z (X Q=1 K n n hij g(X-yj). (2.6.6) j=l Q=1 The estimate is unbiased: A E Pin E gi(X) f gi(x)f(x)dx K -= pj f g+(x)g(x-yj;a)dx (2.6.7) j=l Pi Observe that gT(x) is a bounded function gi(X) < IJh l = ca(i) (2.6.8) \T2T L hij Consequently, our previous theory is immediately applicable. By a proof Consequently, our previous theory is immediately applicable. By a proof Observe that the "active" part of the estimate is the inner sum. The hij are constants computed before the estimating procedure begins.

109 which is essentially identical to Theorem 2o4,1, we obtain Lemma 2o 61: Under the hypotheses of Lemma 2.4o1 the sequences of A estimates Pi. i=l o..K, converge in mean-square to Pi with the variance of the estimate dominated by A^ 2 A ) 2 Cai) B E (Pi P } V(i n 2 (1 + 2+ 2(M-1)) (2.6.9) We have identified the random variable Z as Z(c.j)=-yo That is, the underlying probability space consists of the points w =(cu oo o.. on with P(cj) = pjo Z (cuj) is then interpreted as saying that in the I-th interval, yj was transmitted. The M-dependence assumption represents transmission with a finite memory; the probability of transmitting yj m- in the ~-th interval and yi in the m-th interval is pij, which need not equal piPj if tm-1i < Mo The extension to transmission with "infinite" memory comes essentially from Theorem 20202~ For example, we can require that m-~ TX Pij Pj - > Pi Pj as T +, and n I (P -P-Pijj < B3n, 0< <1 (26, 10) Tr-l Then, the variance of the estimate is bounded by

110 EPi-Pn)} - (i n) < (i) 1 [1 + B2) + K 3n Ln 2lj 1-p* (2o6oll) Convergence takes place at the rate 1/n1There remains to show that the g(x-yi;a) are linearly independent. Assume they are dependent. Then, with ai # 0 for all i, the dependence assumption gives K Z aig(x-yi;a) 0. (2.6.12) i=l 22 Take the Fourier transform of (12) and divide out the common e 2 terms K ai ejvyi = 0 i-l Multiply through by e-jvyk. Using the mean-value property T lim 1 T-o T ej du 1 if x = 0 0 0 otherwise and the fact that the yi are distinct, we get that ai 0, i=l,2oooK. This is a contradictiono Hence, the g(x-yi;c) are linearly independento The above procedure is clearly applicable to the case 1 The KI factor comes from bounding the term DT,. ioe., summing (10) over all i and j See (2o2o41)o

1ll K ca(z) = Pi I u(z-y)dpi(y) (145) i=l where the Pi(y) are known distributions and the set Pi is taken as unknown~ The density function of the random variable X is K f(x) = Pi fi(x) i=l where fi(x) = f g(x-y;a)dpi(y) Since the fi(x) are bounded functions, so will the corresponding f (x) be bounded. Then, we need only require that the characteristic functions of the distributions Pi(y) be linearly independent. 2o7 SUMMARY AND GENERALIZATIONS In this chapter we have primarily been concerned with the problem of estimating a. special, but not unimportant, univariate density function f(x) = f g(x-z;a)da(z). Given the sequence of dependent random variables X=N+Z, we have displayed consistent estimates of f(x) and have obtained bounds on the rate of convergenceo We have given two methods of estimating f(x) and another method of estimating the product s(x)f(x)o To apply the kernel method, one must recompute the contribution of all past observations at each stage in order to obtain an asymptotically unbiased estimateo Using a gaussia.n kernel, we have shown that this method

112 4/5 1 gives a rate of convergence equal to 0(1/n ). The estimate is a density function in that it is non-negative and integrates to one. No knowledge of the underlying process is needed to form the estimate or to specify the rate of convergence. To use the first series method, we require a moment condition to be able to specify a rate of convergence. No knowledge of the gaussian process is needed to form the estimate. The eigenfunction representation for s(x)f(x), on the otherhand, does require knowing the standard deviation a. Both series methods may lead to estimates which, at some point of the sequence, are negative over a. finite range of x. In practice, either series can be truncated with the remaining finite number of coefficients estimated recursively. We have already pointed out the dependence of the asymptotic error on the number of terms used. We have also shown (Corollaries 2.4.1 and 2.5.1) that both series methods converge in the manner required to guarantee the convergence of the empirical Bayes procedure discussed in section 1.2. Somewhat secondary to our purpose, but worthy of mention, is the fact that the L2 representation can be applied to the more general problem of density estimation given a sequence independent observations. Suppose the density function p(x) and its first three derivatives exist. Further, assume that x3p(x),x p'(x),xp"(x) and p(x)''' are L2 functions. Then, using the technique in section 2.4, we can estimate p(x) with a mean-square T This assumes appropriate conditions on the dependencies of the observation.

115 error = 0(1/n ). This does not compare favorably to the rate 0(l/n /5) obtained by the kernel method which requires a minimal amount of assumptions on p(x). However, in estimating a k-variate density function, we will see that the series method, with the same type of assumptions as in 1/3 the univariate case, keeps the 1/n rate, while the kernel method leads to a rate which is considerably slower. With the exception of section 2.5, the results we have obtained are not unique to the gaussian noise assumption, but to a class of processes, of which the gaussian in the most prominent member. Specifically, the technique we have used is applicable to any stationary bivariate density 1 function which can be expanded in the form a. (a) (b) p2(x,y) = pa(x)pb(y) ai o (x). (y) (2.71) -i (XJ J Y(2..) i.j (a) (b) with aij = p2(x,y) (i (x) jb)(y)dx dy (2.7.2) and where Pa and Pb are the marginal density functions of p2(x,y). The (a) (x) are polynomials orthogonal with respect to the weight pa(x). The i a Mehler formula is a special case of this expansion (with convergence already established) wherein, ai = i j pj -, i = j (2.7.3) Equation (1) is called the Barrett-Lampard expansion [3], and has found other uses in noise theory [7,22,25].

114 ej(x): Hej(x) (273) That is, for the bivariate gaussian density, the expansion is diagonal and pa(x) = Pb(x)o It is this expansion which gave the desired cancellation of the product of the univa.riate gaussian densities and the first term of the series (2,o211)o Any bivariate density function which can be expressed as in (1) will give the desired cancellation, for it is easy to see that, in general, O(x)= 1 and a. =1. Hence, the first term of the expansion is just 0 00 the product of the univariate densities. Then, in analogy to the development in section 2.2, one can obtain the bound IJ IJp(x,y) - pa(x)pb(y)ldxdy < laij('r) (2.7.4) i,j where i and j are not both equal to zeroo To dominate the variance expression corresponding to (2o2.8), the summability of n 1 Z Z Ja. (T)l (2j7-5) n Zjij / T=1 ij would be investigated. The interpretation of summability in terms of the underlying noise process would now have to be made with reference to all the moments of the biva.ria,te density and not just the correlation function as in the gaussian case. In the Barrett-Lampard paper, the authors give another example of a bivariate density which admits a diagonal expansion and for which the

115 coefficients form a geometric progression. This is the case of narrowband gaussian noise subjected to an instantaneous square law envelope detector. The expansion takes the form([3], Eq. 80): 00 (xl,x2) = p(x1)p(xz) (a1 (T)) Lj(xl/2)L.x2/, (2.7.6) T J J (7.6) j=O 2 where xi is the square of the envelope, xi=Ri, and the density function of R is given by the Rayleigh density. The polynomials in this expansion are the Laguerre polynomials which are orthogonal on 0 < x < oo with respect to the weight p(x) = exp ( ) (2.7.7) 2a2 2cr The quantity pi is defined as([3], Eq. 72) 00 00 (T) = J Sn(fl) Sn(f2) cos 2jt(fl-f2)Tdfi df2 0 0 where sn(f) is the power spectrum of the narrow-band gaussian noise and 00 2 = sn(f)df 0 Assuming that ti(T) < 1 for T ~ O0 (4) reduces to n T1 l-_2(T) n 1- 42(T) in direct analogy to the development in section 2.2. For the class of stationary Markov processes, Barrett a.nd Lampard show that if the bivariate density admits a diagonal expansion then the

116 correlation function is an exponential function of time and the expansion takes the form 00 p(xl,x))=p(xi)p(x2) X e^ ej 9(xl)j (x2) o (2.7.9) j=O T is the distance (in time) between the random variables xl and x2. Wong and Thomas [47] have further characterized the Markov processes which admit the above expansion. This class is composed of three distinct types, all of whose univariate density functions belong to the Pearson system of distributions. This class consists of~ a.) the gaussian density with the associated Hermite polynomialso b) the density p(x) xa e-x and the associated Laguerre r(a+i) polynomials o For a=n-,1 n=0,l,2,... we have the chi-square distribution, and for ca=O we have the case just discussed ((7))~ 1 r(a+P+2) (1-x a(l+x) c) the density function p(x) = (l) +) 2a+p+i r(a+i)r(O+l) which represents the Pearson type I system. This includes the uniform density and the density function of a. sine wave with unit amplitude and random phase. The associated polynomials are the Jacobi polynomials which include, among others, the Legendre and Chebyshev polynomials. With the marginal density function of the Markov sequence given by one of the above, the corresponding biva.riate density function is given 1 Their technique is to reduce the Fokker-Plank equation to a Sturm-Liouville eigenvalue problemo Necessary and sufficient conditions are then found under which the eigenfunctions form a complete set of orthogonal polynomials. These conditions include a. differential equation which the univariate density must satisfy and which characterizes the Pearson system~

117 by (9) In the context of our problem, these results can be applied not only to the noise, but to the jZ.~) sequence as wello For suppose that the density function of a T(zl,z2) is given by (9)0 Then, the quantity DT (Eqo (2.2041)) can be written as: D = ff G(x-Zl;d)G(x-z2;a) aT(Z1,z2)-a(Z1)a( Z2) dzj dz2 T2 T < ff la(Ezl,z2)-a' (zXl)Ca (z) dz dz2 00 < j e J j=Consequently, a theorem analogous to Theorem 2~2.2 can be obtained with a, rate of convergence determined by the growth or summability of the quantity n 00 2 T V -^T h ) (n-T) > eJ T-1 j-= To extend the technique of section 2~5 to other noise processes, one must solve an eigenfunction problem where the kernel is the corresponding univariate density —function of the noiseo We suspect, but have not proved, that progress in this direction can be made for those processes whose univa.riate densities are the weights associated with the classical polynomialso These polynomials ([12], page 164) are just those polynomials mentioned earlier whose corresponding weights belong to the Pearson system, Most of the specific properties of the Hermite polynomials which we used

118 are common to the classical polynomials. In addition, those polynomials which are solutions to a Sturm-Liouville problem form a complete set and behave (in an interval) as does the usual Fourier series (Hobson [18], page 771)

CHAPTER 3 ESTIMATING THE DENSITY FUNCTION OF THE OBSERVATIONS-k-VARIATE CASE 3.1 INTRODUCTION In this chapter our previous results are extended to an arbitrary k-variate density function. For the univariate case, convergence statements generally followed from the inequality ff Ig2(X1,X2;a,PT) - g(Xl;a)g(X2;ca) jdxjdX2 < AiTI. (3.1.1) 1-jPT' This was derived in section 2.2, equation (2.2.15). In this chapter we will be concerned with dominating the integral Ill g2k( 2;M) -gk( x;A)gk(x2;A) l dxl dx2, (3.1.2) where gk and g2k denote the k-variate and 2k-variate gaussian density functions. xi and x2 are k dimensional vectors. x-i represents the k samples from the ~-th interval and xa the k samples from the m-th interval. Both vectors have a covariance matrix A which, from the stationarity assumption, is independent of the interval. The covariance matrix MT of dimension 2k, is given by rrxil[xi xB] MT = E~jjj-r x } 119

120 A K BT L, (3.1.3) with T = m-~, and' denoting the transpose. In the next section we majorize the 2k-fold integral by displaying a transformation which converts (2) into k double integrals of the form (1). This change of variables allows us to go directly to the Mehler formula without any further generalization of the Hermite polynomials introduced earlier. The majorant will now be a function of the eigenvalues of a certain matrix and not simply expressed in terms of the correlation coefficients. A rate of convergence can still be determined by investigating the properties of the autocorrelation function. Having majorized (2), the extension of our previous results will be obvious. For this reason we simply state the results in section 3.3, commenting when there is a significant difference in the technique or final result. 3.2 DOMINATING THE 2k-FOLD INTEGRAL Define the 2k-dimensional vector u as u = x}l, (3.2.1) and the matrix N (Of dimension 2k) by A I N = --. (5.2.2) i

121 Equation (3.1.2) is written as J ig~k( UIY)~ g~k(U) I du,(3 3) with du representing the 2k differentials dul...duk. As already mentioned, we want to show that a change of variables u=T-lv reduces (3) to k double integrals of the form (3.1.1). What is the same thing is to show that a. non-singular transformation T ta.kes MT into i C TM -- (3.2.4) TTI a,nd N into C 0 T'T = - - (3.2.5) T _0 C Here, C and R a.re kxk diagona.l matrices. This simulatneous transformation is accomplished in two stages. The first step is to reduce N and M to diagonal matrices. Let T1 be the transformation such that %2 T MT T1 = A = L. (3.2.6) 0 ~L ~-2k and I 0 T3 N T1 = _ —---. (3.2.7) 0 I This is the usual "double-diagonalization" procedure.

122 The second step is the reverse of this process. That is, we then show that there is an orthogonal transformation T2 such that C I RT I RT T' A TI = - (3.2.8) is \ T — ^, ^ ~ -' (5.2.8) R' I C R I i Since T2 is orthogonal (7), remains invariant under this transformation. The required transformation is then T= —T2. From (6), it is not immediately evident that AT (with 2k diagonal elements) is similar to (8) which has k free parameters (the diagonal elements of R). In addition, to apply Mehler's formula we will need the elements of RT strictly less than one. We will establish that AT and (8) are indeed similar, and that the elements of R are less than one. The only assumption which we will make is that MT be a. positive definite covariance matrix. By this we shall always mean strictly positive definite. What we are concerned with here is a generalized characteristicvalue problem, or what Ga.ntmacher calls the pencil of quadratic forms. We will use some results from Gantma.cher [14], Chapter 10, section 6. To avoid a. proliferation of subscripts we drop the T subscript for the present and since we are consistently dealing with vectors we will not use any special notation for them. Two real symmetric quadratic forms M(x,x)=x'Mx and N(x,x)=x'Nx determine the pencil of forms M(x,x)-XN(x,x), \ is a parameter.

123 Definition: If the form N(x,x) is positive definite, the pencil M(x,x)XN(x,x) is called regular. Definition: The equation |M-\NI = 0 is called the characteristic equation of the pencil of forms M(x,x)-kN(x,x). From (3), with M p.d. (positive definite), A is also p.d. since it is a principal minor of M. Hence, N is p.d., and the pencil of forms is regular. Theorem 3.2.1 (Gantmacher, page 310): The characteristic equation of a regular pencil of forms, jM-XNI = 0, always has 2k real roots Xj with the corresponding principal vectors zJ=( Zj,zj,...,z j) j=l,...2k, which satisfy Mzj = XjNz, j=l,2,...2k. (3.2.9) These principal vectors can be chosen such that the relations N(z,z3) = z NzJ = i, ij=l,..2k (3o2.10) are satisfied. From (10)-,it follows that the z, j=1,...2k are linearly independent. The existence of the required transformation T1 is assured by 2k Theorem 3.2.2 (Gantmacher, p. 314): If Z = (zij is a principal matrix of a regular pencil of forms M(x,x)-XN(x,x), then the transformation x = Zy (5.2.11)

124 reduces the forms M(x,x) and N(x,x) simultaneously to sums of squares 2k 2k Z 2 2 Exj y,I X y., (3.2.12) j=l j=l where \1,72,...,?2k are the characteristic values of the pencil M(x,x) -?N(x,x) corresponding to the columns z1,z2,.,z2 of Z. Conversely, if some transformation (x=Zy) simultaneously reduces M(x,x) and N(x,x) to the above form, then Z is a principal matrix of the regular pencil of forms M(x,x)-AN(x,x). The first theorem is proved by writing (9) as N lMz = 7jzj and then showing that N' M is similar to some symmetric matrix for which the characteristic values are known to be real, etc... The second theorem is the usual statement of the double-diagonalization process. As a consequence of Theorem 3.2.1, we know that the elements of A in (6) are real. A further characterization is obtained by noting that for any characteristic value and vector, we have zi Mzj = Xjkz, j=l,2,...2k. (3.2.13) Since M and N are both p.d., it follows that Xj > 0; j = 1,2,...2k. (3.2.14) We now want to show that k of the Xj are determined by the remaining k characteristic values. Let zl and z2 represent k-dimensional vectors and write z = [-Z1}-, 2where z is a principal vector of (9) Expressing M and N in terms of A where z is a principal vector of (9). Expressing M and N in terms of A

125 and B, we obtain from (9) the following two equations: (l-X)Azi + Bz2 = 0 (.2.15) B'zl + (1-X)Az2 = 0 (3.2.16) From (15) we have (l-A)zi = -A- 1B2. (3.2.17) Multiply (16) by (l-x) and use (17) to substitute for (l-X)B'zl. Equation (16) becomes ((-X)2A-BA-B)z2 = 0. (3.2.18) In an analogous manner we obtain an equation for the zi vector, ((l-X)2A-BA-1B')z = 0. (3.2.19) The 2k characteristic values %j must satisfy I|M-KjN = 0. They must also satisfy the characteristic equations associated with (18) and (19)o It is not difficult to show that the characteristic equations of the above two equations((l8) and (19)) are equal. Therefore, we need only consider one of them. Let r = (l-) (3.2.20) Equation (19) is written as Bellman, R., Introduction to Matrix Analysis, McGraw-Hill Book Co., 1960,;po 94...

126 (r2A-BA-B' )z = 0. (3.2.21) Observe that BA-B' is symmetric. Since A is p.d., (21) is a regular pencil 2 with the parameter r. From Theorem 3.2.1, we have that there are k real roots rj, j=l,2,...k, with the corresponding linearly independent principal vectors zJ, j=l,...k. The characteristic equation corresponding to (21) is a polynomial 2 of degree k in r. Viewed as a. polynomial of degree 2k in r, the 2k roots a.re equal to + rj, j=l,...k. From the definition or r, we have j = 1 +r = + r 3 J j (3.2.22) = =2 l-r j j+k 3 j+k = i - r = 1 - r, j = 1,2,...k. Since Xj is greater than zero ((14)), we have the bound jIrj < 1, j=l,...k. (3.2.25) Using these results, (6) can now be written as l+rl l+r2 0 l+rk (3.2.24) T1MT = A = l-rl 0

127 2 with the k values of r given by the roots of the polynomial IA-BA-B' - r2l = 0 (3.2.25) As an aside, we remark that the 2k principal vectors of (9) are given by = { j+k, $,2 |Z J v2 Z2 j where zli and z2j are the principal vectors (of dimension k) of (19) and (18) It is a. simple matter to see that the 2k vectors thus defined are linearly independent. It is only slightly more difficult to show that they perform the double-diagonalization, i.e., (12) is satisfied. We now show that A, as given by (24), and the matrix defined in (8) are similar. For this -it suffices to show that their characteristic equations a.re the same. From (24) we have, k A - yI = H (1-r - 2/ + 2) (3.2.26) i=l To evaluate the characteristic equation of (8), we need to specify the diagonal matrix R. Choose the k positive values rj = +fr2, j=l,...k, a.nd take R as rl r2 R =.' (3.2.27) rk

128 The characteristic polynomial of (8) is given by I R I 0I -— [ —- _y -— 4 —_R I_ O _ I_ (3.2.28) To determine this characteristic polynomial, we use the same argument as in going from (15) to (.19). We write I R L — -- -4.w wl - R I w2 w2 and obtain the characteristic equation for the vector wl (or we), ((I-y)2I - R2) wl = 0. (3.2.29) The characteristic equation in given by I(l-y)2I-R21, which is identical to (26). Hence, the existence of the orthogonal transformation T2 is assured. In the context of our problem, and to summarize the results to date, we have Theorem 3.2.3: We are given the 2k-order integral |g2k(u;M) - gk(u;N)|du (3.2.3) u A | BT -A 0 with M = --- a.nd N = --- LBT A O A.. If M is a positive definite covariance matrix, there exists a change of variables

129 u = T v = (T1T5)-v (3.2.30) with IT I 0 which takes (3) into 2k k k.f ---- g (vi,v +k;,ri) - g(vi;l)g(vi;l) dVl dv..dvk i=l i=l (3.2.31) The vj are scaler variables and the k values of ri are given by the roots of 0 = IA-1BA'B' - r2I. (3.2.25) Since Irk| < 1, i=l,...k, we can use Mehler's formula to majorize (31). Let the symbol j ~ 0 mean "excluding the term ji=j2=... k=O." Then, using the expansion for the bivariate gaussian density, we have: 2k k k f....I n 2 g2(vivi+k;l,ri) - i g(vi;l)g(vi;l) dvl...dv2k i=l i=l 00 ji 2k k r. -.. n j II g(vi;l)g(vi+k;l)) Hi(vi) He (vi+k) i=l Ji' Ji Ji j i=o dvl...dv2k, 0 0. (3.2.33) Bring the integrations inside the summation a.nd bound the integrals as in (2,.2.lk)..We then obtain Corollary 3.2.1: Let MT be a positive definite covariance matrix. It then follows that / Ig2k(u;Mr)-g2k(u;N)Id < ll

130 00 k I j,r j $ 0 i=l i s - - Ji=o - -1 + r, (3.2.34) 1-r. i=l i,T 2 where r., satisfies IA-1B A' B-r TII = 0, i=1,2,...k. 2 For k=l we have A=1, BT=p, r T=p2, and (33) reduces to (2.2.15). T' 19T T The majorant in (34) will enter into the variance calculations summed over T. We want to again interpret this sum in terms of the autocorrelation function R(T). In analogy to section 2.2, designate D v~1J 1+ / D' = 1 +11 T -^ L \ 1= 1-ri l T=l T=l 1iTJ k n n /- n (1-r.. Z i=! =1 L( Irl(l-r ), ) J Let r*=min (l-ri ). Since ri., < 1, r >. Then, (35) is dominated T>l by n 1 f k1 DT- < t ( 1- I 1 ri (3.2.36) T=1 T 1 Writing out this expression gives n n ~k k. 7 T - r < ) ri, - r,Trj. +... +(1)k r r...r T=1 T=i =1 i, j L D!j (3 2.37

131 The j-th inner summation gives kT./(j'.(k-j)') terms, where j=1,2,...1k. k There are then 2 -1 terms which are summed over T. Since IriT| < 1, for all i and T. any term which contains a r. as a factor is dominated 1zT by the single term ri appearing in the first inner sum. Consequently, (37) can be bounded by n n k k T — r. k T n (k k (3.2.38) T=l T=1 i=l From the Schwarz inequality we have k 2 k ( r < k ri i=l i=l and it follows that n k/2 k 1/2 T_ 1 ( I 1/T T1 k r. i=l The quantity Z riT is just the trace of the matrix (A BA B' ), Hence n k n r1V/2 (2k-1) D' < (2<l) Z trace (A B A-1B ) (3.2.39) T - 1/2 T T T=1 k r* T=i The trace is the sum of k terms. Each of these terms involves the product of the correlation function evaluated at different arguments. The matrix A m, which also involves the autocorrelation function, does not depend on T but only on the manner in which samples are taken in a particular

132 interval Define: -1 _1 aij = elements of A Rij( = elements of B, ij=l,2,..k (3.2.40) RB(T) = max IR. (T)| i,j 1J A(k) E Z.a i j The trace is given by k -1 -1 trace (A B A B') =j a j a j4 R.j (T) Rjj4 (T) Jl, j2,3j, j Using the above definitions, we obtain the bound trace (A -B A-9 ) < R2()A2(k) and (39) becomes n n k' < -) A(k) ) IR(T)I -T 1/2I T=l k r* T n = BS(k) |R*(T)I. (32.41) T=l We are now in a: position to use the results of section 2.2 to obta.in Corollary 3.2.2~ If R(T) satisfies condition A, from (2.2.21), we obtain Y Dt < Be(k) a(B1 + -- ) (3.2.42) U \ (I5)

133 If condition B is satisfied, then, from (2.2.23), we have n I D'T Bs(k) 2 B2. (52-45) T — T=l With the exception of when B2 is given by the Euler-Maclaurin summation formula, the constants B1 and B2 have the same meaning as in section 2.2. To use the Euler-Maclaurin formula, rather than sum R*(T) over T, we first sum Rj(T) over j, j=l,...k, for each T. That is, we sum all elements of the first row of BT for each T, and then sum over T. B2 is now given by (cf.(2.2.25)). 00 B2 = (k|R(t) + IRT(t)l)dt 2 3.3 ESTIMATING THE k-VARIATE DENSITY FUNCTION Rather than introduce a new set of constants we will use the same notation as in chapter 2 for those constants which play similar roles. 3.3a The Empirical Distribution Function The k-variate empirical distribution function is given by n F (x)= n X u() X (5-3 1) ~=1 The random variable U~(X ) is defined as The notation y is used to designate a vector.

134 U (xi) = 1 if X < x (353.2) = 0 otherwise Xi is the sample vector in the Q-th interval, so that by X < x we mean the set of inequalities X i < xi, i=l,2,...k, ~=1,2,...n. In analogy to Theorem 2o2o2., we have Theorem 3o3.l: Given the sequence of identically distributed stationary random vectors Xi = N. + Zi. with the k-variate density function f(x) = f g(x-z;A) da(z). (3 3 3) Define: D = I Gk(yl-z1;A) Gk(Y2-Z2;A) daT(zi,z2) - da(zl ) dx( Za ) ( 3 34) where Gk(y-; A) = g(x-z;A)dx -00 a,nd -(zl,z2) = Pr4 Z i <z, Z +T< z Assume i) R(t) satisfies condition B ii) DT satisfies T,2

135 00 ) D = Bo < o 1 T, 2 3 <00 T=l Then, the empirical distribution function is a, consistent estimate with the variance bounded by EI(F(x)-Fn(x))2 = V(Fn(X)) < [1 + 6 a2 B8(k) B2 + B3]. (335.5) 353b The Orthogonal Representation The k-variate density function (3) is expanded in the series f(x) = f(X,x2,...xk) =- ", a..' cPj(Xl/1) J.. pj (xk/k) jl jk (35.56) The c's are the one-dimensional Hermite functions as in (2.4.3). We will write (6) as f(x) = a. cp(x) (33..7) j A The estimate of a. at the n-th stage is denoted by a, n3 Jj n jn = cp (X, ). (3.3 8) ~=1 These estimates are unbiasedo Since the function cpj(x) is bounded (see f(x) is bounded by [(2t) |A|j] and therefore it is L2 in Rk.

136 (2A4.7)) by k k/4 1/2 lPj ()! < C1 /(kt (/ a. k) ) = C2, (3*3.9) in analogy to Theorem 2o41o and using Corollary 3.2.2, we have Lemma 53ol: Assume the sequence of vectors (Z] is M-dependent and that R(t) satisfies condition B. Then, E<aa) = V(a. ) 2- c2 (1 + 2 B8(k) B2 + 2(M-l)) = n^ = -j1n -| J11n n (3 3 -10) For k = 1, 2 Ba(k) B2 = 2B2 and (10) reduces to (2o4.10). 1 -p As an estimate of f(x), we take ql(n) qk(n) An -' A k n =.jk J0 * * k(k/ak) We shall set ql =.. = qk = q, and write the estimate as q(n) f (x) = nP (x) (353511) n L -ijn jj=0 Assume the r-th absolute product moment of Z exists: J f. Iziz2 ozkl| da(zl,z2, o.z2 k) < co, r > 2. (3512) Then, Lemma 2o4o3 can be directly extended to k dimensions. This results in the bound 2 =. a. = a... < B6/(j2l..'Jk) (>'.:_1)

137 and we have Theorem 353.2: The MISE of the estimate is 00 00 J = E (f(x) - (x))2d = a J n.. nJ~' a'Jk jl=q+l jk=q+l q q + y - -V E ( ^ -~j -. jk a, i. ~ * j jl=0 jk=0 With (12) and Lemma. 3.3.1 holding, we obtain the pound B ( 1 + k q Jn < B r ). c~ — + (3.3.14) n — (r6l) (qr-l q 1 Choosing q(n) as the largest integer < (BGn/c3)rk, the MISE satisfies r-l Jn = 0(1/n r ). (3.3.15) This is the same rate as in the univariate case. Naturally, the actual bound is different. The constant Be in (14) is now the L2 norm of a function defined in Rk. This function is r (g(xlal). g(x );a)) jj, j, (1'..: ~.Jk) (g(xlj;cyd [gf(x l..xk)g( x; al )...g(xk; )], dxl Ji..dx Jk k where the summation is to be taken over all possible integers with j i = i=l

138 To investigate the mean-square error, define the function q q fq(X) =' ajl...jk cj(Xl/al)... cpj(k/Ok) jil=0 k=o (3. 3.16) As in the one-dimensional case, this function converges to f(x) pointwise and uniformly in any finite interval. In particular, with r > 3, we have the bound k If(x) - fq(x)l < C2JB ( -1) q r -1 r _1 (3.3.17) In analogy to Corollary 2 4.1, we obtain Corollary 3.3.1: Assuming that Lemma 3.3.1 holds and that (12) is true with r > 3, we have En (fn(Xn) - f())2} Qf-1)( -1 q+ + + c) q n ^ — - r r r 2 q2 ( -3. 18) l/rk Upon choosing q ~ 1/n, as n + co, the mean-square error satisfies r-2 En (fn() - f(Xn)) O(1/n r) (3.3.19) Observe that this is the same rate as in the univariate case.

139 3 3c The Eigenfunction Representation We shall assume that the covalriance matrix of the noise A is known and that the observation vector is "pre-whitened," i.e., we apply the 1l/2 transformation A to the observation vector X~. The new data. vector is A 1/(N2 + Z9)7 What this does is to make the resulting gaussian noise samples within an interval independent. The noise vectors in different intervals are still correlated with a. cross-correla.tion matrix -1/2 -1/2 equal to A BT A. Note that the characteristic polynomial for 2 r remains the sameo Hence, Corollary 3.2.2 is applicable without change. We assume that the data. is pre-whitened without changing notation. The density function of the observation vector is f( o ) = ~ g(xl-zl;l) o..g(xk-zk;l)da(z...zk) (3.3.20) which we write as f(x) = g(x-z;I)da(z) (353.21) z With cp.(z) given in terms of the Hermite functions CPj(z) = (Pj (z//ao) (pj(z2/cl) o. cPjk(zk/l), (353.22) where a1 > 1, define All that is necessary is s to transform the data vector so that the resulting covariance matrix is diagonalo For purposes of notation, it is more convenient to have the resulting covariance matrix equal to the identity matrix o Hence we use A-1/2. Since A is (strictly) positive definite, Al1/2 is well defined,

140 -J d = cp (z)da(z) (3.3.23) JJ ~'Jk - The coefficient dj is uniformly bounded in j and z, k ij lj.. I Jl @1 - ( 3, 5 24) Let, k 2 k/4 \ ~xp: (+1 2 2 (~3~.2:) S(X) = ) 2 (12X+l)(2l) (555) r- i=1 and define o~ -1 a = 5 om+l (3.3o26) In analogy to section 2.5, we obtain the expansion 00 00 s(x)f(x) =, o (J.. tJk)dj c (x /7)...p (x/ ~ W fj ( j jk ji J k Ji=0 jk=~ k - dj _ cpj(x) o (3~o~27) j This series converges in mean-square and uniformly in x. We also have the uniform bound s(x)f(x) - d C j(x) < ) _ cl q 1 (5 5.8)

141 The unbiased estimates of the coefficients are n j A J Jn n (p(x (X)s( (3.3.29) ~=1 Using Corollary 3.2.2, we have Lemma 3.3.2: Assume that R(t) satisfies condition B and that the sequence (ZI) is M-dependent. Then, the variance of the estimate is dominated by i ~ 2 C4q/ 2 v(J dj) < 2 4 + B(k) B + 2(M-1) = cs/n. (3.3.30) nj n where c4 is defined by 1/4 k cl, 12+1 -XPj(X)S(X) < X 1 )k/2 = c4 (5551) k/4 ( k/) 2 The estimate of s(x)f(x) is taken as q q s(x)f (x) = X * (jl+ (x.J) dJ...jkC(x1)' (k/ k ji=0 jk=O )= J djn %P(x) * (35.5.32) j=0 In analogy to Theorem 2.5.2 and Corollary 2.5.1, we have the following results. Theorem 3.3.3: Under the hypothesis of the previous lemma, the MISE is dominated by

142 ^ X _S A2 Jn = E f(s(x)f(x) - s(x)fn (x) dx k < f/2 ( 2~ ) + (3.3.33) (1/2 al-2) q1 n Choosing q(n) as the largest integer < n n, e have - 2k In 1' ( cl2 l_2)) 1, ( In n)k n1 - < /, (.31- 34) n Tc 1/2Crl n I lnP 12k.) n or J' = O((Qn n)/n). Corollary 3.3.2: Under the hypothesis of the previous lemma, the meansquare error in the estimate of the random variable s(Xn)f(Xn) is bounded by En( s(X ) (f(Xn )-fn(X-n)) F' l2 gq+l k k - 2 (3*3*35) Letting q be the largest integer < c7 + I - gives 2kl|nSi 2k E, ( ) ( 7 + n../ l2,n (3.3.36) Vn n or, En ( ) = 0((Qn n)/n)

145 The constant c7 is not given explicitly as in (255533), but would be determined by minimizing (36) 3.3d The Gaussian Kernel The estimate of f(x) is taken as n fn(x) = g( x-X;H). (35.537) n -1 With H: a diagonal matrix with i-th entry equal to hi, (37) is just the k-dimensional version of (Col). We take H as this diagonal matrixo The expectation of the estimate is E f(x) = f da(z)g (x - z; A + H) (33.38) As in section 2.3a, the bias is dominated by considering the expression (Efn(x) - f(x)) and expanding in a Taylor series. Since the kernel gk(y;H) is an even function in y, the first and mixed second derivatives drop out. We obtain, as n - oo and hi + o0 i=l,.k k E () k/2 - 2f(x) h 2 + O(hi (3. 39) E.'(x)-f(x) + 2 (h) (3~3539) 2 L x h i=l z The variance expression is n n n 1 Z- + ~=1 m=2+l

144 -E(g(x-X; H)Eg(x-Xm;H) (3.3.40) Using (38), the first bracket is bounded by' i2T) (h...hk) A + Hl + 2)klA + HI ~ (3..41) For the second expression, the term inside the double summation is given by Qm- =/ / dC((zl)d (z2) )g2k(X-ZlX-Z2;M+H) - Z1 Z2 g2k( X-Zl, n *X-Z2; N+H) (3 *3 42) ~ H 0 M and N have the same meaning as before, and H =. Since M+H is positive definite, we can apply the transformation of the previous section to the expression in the bracket in (42). We then dominate the resulting expression using Mehler's formula and Cramer's bound (see(C.8) and (C.9)). The result is, Qm =I QT < (-)-1 + II ] (3.-3 43) 2k i=l 1-ri T The r2iT are the roots of |(A+H)1 BT(A+H)1 B1 - r II. Use Corollary 3.2.2 to obtain 2 V Q B < (7 C a B2(k)B2 I m Here, we assume the sequence (Z') is independent.

145 The constant B8(k) is defined differently since A(k) is now the bound on the elements of (A+H) The variance is dominated by A iF 2 1 1 V(f(X)) < ( 1) Bs(k)a B2 + -n -n Lat (21)k JA+-H 1 1 (2t)2klA+ H1 (hl...hk) b= I^ -— S —~ 1 (55. ~(3,3.44) n n(hi...hk)J Adding the square of the bias error to (44), the mean-square error is E f(x) - f n(X)) k < b + b2 + b.. h2 h2 (.3.45) - n ij i j n(hlh2o.hk) ) ij where the bij are constants which bound the second partials in (39). We have not made much progress in selecting the set hi so as to minimize (45)~ The obvious thing to do is to set hl=..ohk=h. Then, it is easy to see that h is chosen proportional to l/n1/ and E{(fn(X)f(x)) = O(1/n/(4+k) (3.3.46) From the way in which we have taken the estimates, it is not surprising that the assumptions needed to specify a rate of convergence (for all three methods) are direct extensions of those needed for the univariate

146 case Thus, throughout this chapter we have assumed that the covariance matrix M is (strictly) positive definite. In Chapter 2, we required the same thing by taking lp. < 1, ioeo, for k=l, MT [:T PT M T 1 and jpT j <1 implies that M is positive definite. To use the eigenfunction representation, we take the covariance matrix A a.s knowno For the L2 series, we require a product moment assumption in order to specify a. rate. Again, no knowledge of the underlying process is required to form the kernel estimate or to specify a. rate of convergence The eigenfunction representation gives essentially the same 1/n rate a.s k in the univariate case: for the MISE we have the rate (ln(n)) /n and for the mean-square error we obtain (ln(n)) k/no With a r-th product moment assumption on the vector Z, we obtain the same rate of convergence for the L2 series as in the univa.riate case The kernel method now gives a. slower rate-the reason being that the bias in the estimate is still 0(h ) while the variance term is now 0(1/h ).

CHAPTER 4 APPLICATIONS OF THE EMPIRICAL BAYES TECHNIQUE 4.1 INTRODUCTION We now apply our results to some problems in communication theory. As discussed in Chapter 1, we will be concerned with procedures which converge to what we have called the optimum one-stage test. To reiterate, this test uses only the present observations for the present test and, as such, is truely optimum only when the sequence of observations is independent. The empirical sequence of tests makes decisions on the same basis as the one-stage procedure, but it incorporates all past observations in updating the estimate of the test function. Furthermore, we will take a sequence of tests, the test function of which is identical in structure to the test function one would use if all distributions were known. It does not follow, especially in the small sample case, that this is the optimum thing to do. What we expect to show is that when we are repeatedly faced with the same decision problem, our sequence of test functions will get closer, in the meansquare sense, to the one-stage test. It then will follow from the results of section 1.2 that the empirical risk will approach the risk incurred by using the one-stage procedure. So far, we have considered estimating the marginal density function of the observation which, in general, can be written as 147

148 f(x) = fg(x-z;a)da(z). What remains is to show that we can extract from the estimate of f(x) those quantities needed to form a. consistent estimate of the test function. In the supervisory mode, there is no difficulty in finding these consistent estimates. Since we have available samples which are correctly classified with probability one, we can estimate the particular density fj(x) from which it was drawn. As would be expected, obtaining consistent estimates of the test function is more difficult when operating under a nonsupervisory condition. In the remainder of this section, we discuss the problem of transmission through a. random unknown channel when learning samples are available. This problem relates the results of section 1.2 on the convergence of the empirical procedure with the results on density estimation. It also serves to indicate when we can expect to find a solution to the nonsupervisory problem. In section 4.2 we consider the problem of transmitting known signals with unknown a priori probabilities. The problem of communication through a random multiplicative channel is considered in section 4.3. This problem is discussed in some detail since, for the case of small nonlinear distortion, a firstorder analysis reduces to an analysis of a multiplicative disturbance.1 One of K signals, yi(t), is transmitted with gaussian noise added to a. distorted version of the signal. The received waveform is x(t) = n(t)+yi(t,T), where T is, say, a random delay with known mean value To. If, for example, the variance of T is small, we write T=To+AT and approximate x(t) by x(t)=n(t)+yi(t,To)+AT(/ aT)yYi(t,T. )| =T. Then, AT is taken as a. zero mean random variable with an unknown distribution.

149 A problem with an unbounded loss function is discussed in section 4.4. 4.la Communication Through an Unknown Random Channel-Supervised L1arning Suppose we transmit one of two signals, Yo(t) or y1(t), with a priori probabilities po and pi = 1-po. The signal is passed through a random unknown channel. We take the output of the channel to be a stationary random process zo(t) or zl(t), depending on which signal was transmitted. The received waveform during any interval is then x(t) = n(t) + zi(t), i=O,l, I < t < (X+1). (4.1.1) The density function of a. single time sample is f(x) = po fo(x) + pi fi(x) (4.1.2) where fo(x) = fg(x-zo;c)da(z ) (4.1.3) fl(x) = fg(x-zl;r)dp(zl). (4.1.4) The distributions a and P a.re taken as unknown and may or may not be 1 related. The statistical inference problem is to decide, with minimum probability of error, which of the two processes Zo(t) or zl(t) is present a and P would be unrelated if, instead of transmitting known signals through a random channel, the problem was one of sending one of two unrelated random'signals to which gaussian noise is added.

150 during each interval, The one-stage procedure is to evaluate the test function T(x) = pi fl(x) p fo(x) (4o 5) and compare it to a zero threshold Suppose that we have an estimate of T(x) and, after we have made a. decision, we are told whether or not the decision was correcto In this esupervised learning" situation, we can use the observation to update the estimate of the density function from which it was drawno We now have a better estimate of the particular density and the test function which we subsequently use for the next decision, Assuming Po is known, the error in the estimate of the test function after the n-th decision would be Tn(x) - T(x) = pi(fl(x)-flnl(x))-po(f(x) (x)) (4.16) ni is the known (after a. decision is made) number of occurrences of yi(t) in the n intervals, no+n=no Let Yinj=En (fi(X)-finj(X)) o Then, by the Minkowski inequality, we have EnPn(Xn)-T(Xn))j < (Pi Yn + Po 70n ) n Q (4 7) Hence to guaraee that tends to zero, sa t the same rate as for Hence, to guarantee that -n tends to zero, say at the same rate as for the ca.se of independent samples, we need to require that the autocorrelation function satisfy condition B~ e,go, it be integrable a,nd

151 eventually monotonically decreasing. In addition, we need a weak dependency amongst the samples derived from zi(t) as reflected in Theorem 2.2.2. Furthermore, to guarantee that the probability of error Pen converges to Pe (the probability of error using the one-stage test), we also require that (see Theorem 1.2.2) a) fi(x) i 0 a.e. x, i=0,1, and to be able to specify a bound on the rate of convergence of Pen it is sufficient to assume that b) fi(x) is analytic, i=0,1 c) fi(x), i=0,l are linearly independent, i.e., unequal. From (3) and (4), we see that conditions a and b are satisfied as g(x-z;a) is non-negative and analytic. If a and 3 are unequal then condition c will be satisfied. We will illustrate the estimation procedure with the eigenfunction representation. As in section 2.5, define: +00 d. = cp(z/ac)da(z) j 3 -00 e. = /cj(z/li)da(z) -00 2 2 2

152 2 (, - )(2+2) ( + (4.12+) 2 2 2 (4.1.8) C 1 r12 Multiplying (2)-(4) by s(x) = exp - (41.9) 2 o57 gives 00 s(x)fo(x) = - dj cp(x/7) (4.1.10) j=0 00 00 00 s(x)T(x) = (pej-pod ).(x/y) o (4.1.11) j=0 then we are told from which population Xn was drawno Assuming X wa's j =0 A test function equivalent to (5) is given by s(x)T(x) = C (pej-podj)~ (4.1(x/) Operating under a. supervisory condition, we make a. decision and then we are told from which population Xn was drawn. Assuming Xn was drawn from fl(x), we then update the estimates of e. as in section 2.5. At the end of the n-th decision, our estimate of the test function would be 1 Perhaps more appropriate to communication type problems, learning samples can also be provided by transmitting a known sequence yoY1,yoy1, etcoeo interspersed with the message sequenceo

153 ql(nl) s(x)Tn(x) = p1 ejn. (x/Y) P e 1 jn1 j=O - P, d cp j(x/) 4.114) n0 no j=0 Supposing that each qi(ni) is chosen as in Corollary 2.5.1, we obtain the bound En =(Xn) (Tn(Xn)-T(Xn)= (Yn) = O(ln2n/n), (41,15) where n* = min(no nl). Hence, from Corollary 1.2.2, the difference between the probability of error of the empirical procedure and the one-stage procedure is bounded by 2( yn)2 o < Pen -Pe < ( 2 + t(E)) (4.1.16) where b'(e) = Pr s(x)T(x)| < E and (n) is given by (15). Therefore, with the above assumed dependencies on the observations, Pen converges a.t the rate ln'n*/n.. Recall that the constant o1 is chosen greater than a but otherwise arbitrary. We would naturally choose oa to minimize the difference between P and P To illustrate how ol enters into (16), suppose en e we truncate the series (14) at q+l terms. Then, using the bound in 2 (2.o537) for (y'), the asymptotic difference in the probabilities of error is dominated by

154 (-m4 2q+7 - enen->oo en e - -_ 42 ( 2 2+ 2qJ (4.1.17) The closer o1 is to a, the smaller the first term of (17). On the otherhand, from (8) and (9) we have s(x) s- e eT 2 s(x )~1'- exp *- \2r (.4-a4)/ and 6'(e) is seen to increase as ol approaches a. One possible procedure to follow would be to choose ol so that 6'(E) is less than some preassigned small number A. Then, q is chosen large enough so that the first term of (17) is also less than A. When more than one sample is used to base a decision, the assumptions and results are direct extensions of those for the univariate case. Thus, if we took k samples for each decision, Pen would converge 2k to (a smaller) Pe at the rate O(lnn n*/n*)o This discussion assumes that we operate in a supervisory mode and, if one is willing to use a one-stage (or finite-stage) test, the procedure just outlined is straightforward and provides a solution to a variety of problems.l However, the above formulation and solution are Included in this formulation is the case of non-coherent communication. By restricting attention to a one-stage test we do, however, exclude the case of intersymbol interference.

155 not the best we can do. The above procedure has eliminated the signal design problem-the known signals yo(t) and yl(t) do not influence the convergence of the procedure. This has come about as we have neglected the relationship between a(zo) and P(zl). A relationship certainly exists as the channel presumably distorts both signals in the same manner. That some relationship between a(zo) and P(zl) must exist and be utilized in order to learn when operating in a nonsupervisory mode is almost obvious. For, with no knowledge of the relationship between a and P, there is no way to extract from f(x) consistent estimates of fo(x) and fl(x). In contrast, if a and P are equal then fo and fl are also equal and there is no longer any statistical inference problem. It is somewhere between these two cases where solutions to the nonsupervisory problem are to be found. One such case is the transmission of known signals through a random multiplicative channel. We consider this problem in section 4.3 where it will be seen that the nature of the signals enters into the bounds and, in fact, determines whether or not consistent procedures can be found in the nonsupervisory mode of operation. For the supervisory mode, the rate given above in (15) will be improved upon with n replacing n*-the point being that, in some cases, both sets of coefficients, dj and ej, can be updated at each stage. j J 4.lb The Detection of Noise in Gaussian Noise We want to mention a somewhat different application of the eigen

156 function representation, the detection of noise in gaussian noise when all distributions, including the a priori probabilities, are known. The application has the unusual aspect of incorporating a test procedure which is sequential in the number of terms of the series used for the test, For simplicity, we restrict our attention to decisions based on a single sample. Using the notation of the previous sub-section, assume one of two random processes, zo(t) or zl(t), is transmitted with a priori probabilities po and plo With the distributions a(zo) and P(zl) known then, in principal, the coefficients dj and e. are known. The procedure which minimizes the probability of an incorrect decision is to evaluate (13) and compare it to a zero threshold. Consider truncating the test function at q+l terms, q+l s(x)Tq(x) = (plej-podj ) cpj(x/y) o(.18) j=0 From (2. 513), we have the bound q *2 q+l Is(x)f(x) - djp.(x/y)[ < c (4.1.19) o J1 A~17 —- 1 — j=0 JN/;TL with an identical bound for the truncated series for s(x)fl(x). Then, the difference between (13) and (18) is dominated by

157 2 q+l 2 q-1/2 |s(x)(T(x)-T (x))| < c- = cl (4.1.20) q -/;7 1- t(cff) 1-1 Hence, if for a given observation X, the value of s(x)Tq(x) is greater in magnitude than the right side of (20), the decision using the truncated test function is the same as when the complete series is used. Again, we would not pick Oi arbitrarily close to a so as to make & arbitrarily small. All this does is scale down the possible range of values of s(x)Tq(x) and s(x)T(x). Specifically, from (2.4.3), we have g(x/y)Hj(x/7) (Pj(x/7) = j(2.4.5) 3 2 J.T /4'77 and s(x)Tq(x) becomes 2 2 s(x)T (x) = exp x 1 --- q On \2 2 (a 2+a2)2 Hj(x/y) ~( pi2je.-pd (4.1.21) Having fixed the value of ol, the procedure would be to evaluate the first ql terms of (21) and compare the magnitude to (20). If the magnitude is greater than (20), we announce hypothesis H1 if s(x)Tql(x) is positive and Ho if it is negative. If Is(x)Tql(x)l is less than (20), compute another term of the series and recycle. In this manner, we expect to eventually make a decision which would be identical to the

158 decision based on the original test function. The value of q at which the procedure terminates is a random variable whose distribution depends on the value of al chosen, the coefficients dj and ej, and the observation X, There is no theoretical difficulty in extending the procedure to a finite number of samples. We use the nonsingular transformation A /2 to whiten the gaussian noise. Then, for example, the vector zo is transformed into z with a. distribution a(A / z ). The difficulties,.~-o -o of course, are in inverting A and in calculating the coefficients d.. 4.2 MULTIPLE (SIMPLE) HYPOTHESES WITH UNKNOWN A PRIORI PROBABILITIES We consider the problem of detecting one of K+l known signals when the a priori probabilities are unknown. In each time interval, I < t< i+ 1 = 0,1,..., a signal yj(t) is chosen with a priori probability pj, j=O,l,...K. Zero mean, correlated gaussian noise is added to the signal. The received waveform is then x(t)=n(t)+yj(t). In the next sub-section, we consider the detection problem with a finite number of time samples. In sub-section 4.2b, the Karhunen-Loeve expansion is used to obtain limiting forms. 4.2a Finite Number of Observations Per Decision Under the j-th hypothesis, the density function for k time samples x=(xl,X2,...xk) is given by fj(x) = gk(x-yj;A) (4.2.1) 3

159 where A is the non-singular covariance ma.trix of the gaussian noise vector n, and yj represents the k samples of the signal yj(t). The overall density function is K f(x) = pjfj(x). (4,2.2) j =0 j=O For a. minimum probability of error test based on k samples, as discussed in section 1.2, we form the (K+l) test functions Tj(x) = pj fj(x) - po fo(x) j = 0,, oooK. (4.23) The decision function is tj(x) = 1 if T.(x) > Ti(x), i = 0,1...K, and tj(x) = 0 otherwise. Suppose that the set of a priori probabilities are unknown. Then in (3), we use estimates of these quantities which are updated in every transmission interval. In the n-th interval the error-in estimating the test function is nj (n)Tj(xn) =(PjnPj) f.(X n)(Pon-Po)fo(xn),J=0,lo oKo (4.2.4) denotes the n-th sample vector X = tX X ~ p. is -n iny,n~ ~ ~ jkn n naturally a function of all the observations (X ), ~=1,2,... Observe that fj(x) contains a common factor (1/(2 k/) |A|/) which can be cancelled out of the K+l test functionso We assume this has been done in (4), but keep the notation unchanged. Since f j(x)

160 A is a bounded function, for any unbiased estimate Pjn, we have 2 22 E{jnPjn ) f j(Xn) < V (j A 1 (4.2.5) The mean-square error of (the modified) equation (4) is dominated by E 7n j(T n((X n) < (J V( jn) + ) = yj Sj=1,2,...K. n n~j -n -Pjn on (4.2.6) Notice that we begin the index at j=l since we would naturally set A Tno(X)=0, which equals To(X). The inequality in (6) is the bound we need to dominate the difference in the probabilities of error. Let Pe be the probability of error using the empirical procedure and Pe the probability of error when the a priori probabilities are known. In analogy to the equation above (1.2.40) and to (1.2.46), which are valid for one sample per interval, we have for k samples per interval: K Pe l-Po fTj(x)dx j=0 A. (4.2.7) Aj=A:( ) A> Ti(x i=O x,,..K and ~ K ~ ujn ~ j())2 0 < Pn-Pe I Z 2(7< — - ^+ ( (4.2.8) - n u(j) Ju(J) t 7(j)u(j with 7on set equal to zero. The subscript u(j) plays the role of k in

161 (1.2.46) since, in this section, k is the number of observations per interval. In addition, here we let u=u(j) depend on the index j. The reason for this is as follows. Recall that ju(j) is an arbitrary constant and ju(j) = PrlTj(x)-Tu(x)l < j. One would like to choose the subscript u, u(j)=O,1,...K, so that 5ju(j) is a minimum for each j. This involves solving for the roots of Tj(x)-Tu(x)=O in terms of the signals and a priori probabilities and then, perhaps by linearization of T -T, obtaining bounds for the 5j in terms of the signals j u yj and y There remains to estimate the pj's so as to bound the difference P -P e To do this, we proceed as in section 2.6. We use the sequence en e of observations x. with i fixed (e.g., the first component of each in observation vector) to obtain an unbiased estimate Pjn. Assuming the samples of the signals are distinct, ijy ij.,j,ji=O,l,^.. K and i=l, o.k, there are k such sequences available which give k unbiased estimates of pjo These are then combined in a linear fashion. Taking the sequence of observations xi~,=l1,2,. on, the density function of xie is K f(x i) E= 7 p(x, - j=0 We assume the samples of the signals, Yij, j=O,l,...K, are distinct. Then, the estimate of p. is taken as (see (26.66)): J

162 n Pjn n i ~i) =0 K n 1n hjh v g(Xi iYiv; ) r j=~O,1. K. v=l v=l X=1 Now, assume that the autocorrelation function of the gaussian noise is integrable and eventually monotonically decreasing (condition B). Assume further that the signal transmitted in a particular interval can depend only on the signals transmitted in the previous M-l intervals (M-dependence) Then, the hypotheses of Corollary 2.6.1 hold and we A obtain V(pjn)=O(l/n). Using this in (8), we have Pen converging to Pe at the rate 1/n. Convergence also takes place at the 1/n rate with an infinite transmission memory" if the conditional a priori probabilities satisfy (2o6010) with 5=0. A The above procedure gives no guarantee that the estimates pjn are probabilities. In practice, we would want to normalize the estimates A K so that 0 < < 1 and Z p. =l j=O 4.2b Limiting Forms The application of the empirical Bayes procedure to the finite sample case is straightforward. This is not so when limiting forms are considered. To use the Karhunen-Loeve expansion, we make the following definitions. Since all processes are stationary, these definitions hold for any interval. For the properties of the expansion see [11].

163 R(s-t) t (t)dt = X. (s), 0 < s < lj=l,2,.. ~ j 0 j - - 1 x. = x(t)j.(t)dt o (4.2.9) 1 n = n(t):.(t)dt 0J o 1 Yij = Yi(t)*j(t)dt) i=0.l,...K. o We then have as the first k observations x = n + Yij, j=12l...k, jJ'' 13 and it follows from a. property of the expansion that E(n n ) = 0, i j j.j, i = j. Since the n. are uncorrelated gaussian random variables, they are independent. Hence, under the i-th hypothesis, the density function of the first k observations x = (xl...xk) is simply the product of k univariate gaussian density functions fi(x)=f (x1...xk) (k exp Yij) 4 (2-k/(-..o — /-exp -

164 Define the functions k Yikl kl (t) vik(t) = i., i=0,l,...K; O < t < 1 (4.2.11) kl=l and the inner products 1 k (vik.x) = Vik(t) x(t)dt = ylklXkl i=.K. (4.2.12) o kl=l 1 k 2 (vikyi) = Vik(t) yi(t)dt = (4.2. 13) 0 k1=1 Now, it is more convenient to take the test functions a.s the ratio. f(x) Tp -- p fo(x), j=O,.... (4.2.14) The decision function remains the same. Cancelling common factors in (14) and using the above definitions and (10) yields Tj(x) = - exp - - 2(v.kx)+(vjk,.) Po 2 M j +2(vok x)-(v ok Yo)j. Since the logarithm is a monotonic. function we can just as well use lnTj(x) in the test. (vokY o)-(vjk,Yj) in Tj(x) = ln(pj/p) + (vjk-vok,x) + o (4.2.15) We shall assume

165 Ik - L <, i =0,...K. (4.2.16) kl=l kl Then, it is known ([352]) that: i) the series in (12) converges with probability one 00 lim (vik,.,x ) = Yiklk < 0, i = 0,...K. k->oo kkl k~oc, k! ~~~ki~=l =1(4.2.17) ii) vik(t) converges to an L2 function v (t) which satisfies the i integral equation R(t-s)vi(s)ds = (t), 0 < t < 1, i = O...K. (4.2.18) o Since we are assuming that the autocorrelation function is strictly positive definite, the solution to (18) is unique. Define the random variables Wi = / x(t)vi(t)dt, i = 0,.. K. (4.2.19) o 0 and the quantity uij = (viyi) = vi(t) yi(t)dt.1 (4.2.20) o The (gaussian) random variable wi is the output of the filter uij ca.n be thought of as the signal correlation to noise ratio. The white noise case gives: R(t)=N6(t), uij = o Yi(t)yj(t)dt/No and u..= signal energy/noise power density (per cycle).

166 matched to the signal Yi(t). The logarithm of the test function is now a function of the random variables (wj-wo). Taking the limit as k oo in (15) yields: ln(Tj(w -wo)) = ln(pj/Po)+(wj-wo)+ oo jj (4.2.21) 2 The error in the estimate of the log of the test function is simply ln(T j(w -w )) - ln(T.(w.-w )) = ln(p /p ) - ln(p /p ) jn on j o = n. /) - n( /p ). (4.2.22) jn j -on o To form the estimate of pj, consider the output of the r-th matched filter during any time interval 2 < t < (~+1). Assuming the i-th hypothesis to be active during this interval, the received waveform is x(t) = n(t)+yi(t), and wr is a. gaussian random variable with mean value uri and variance equal to u r Averaging over all possible hypotheses, the density function of the observation wr is K f(r) = P g(Wr-Uri; r (4.2.25) i=O The only difference between the situation here and in section 2.6 is that the correlation between the random variable (wr-uri) in different intervals is not given directly in terms of R(t). We do, however, have:

167 m+l Q+1 E wr r;)(w rim S E(n(tl)n(t2))Vr(tl)vr(t2)dtldt2 m r i 1 = f/ /R(m-f+t2-t1)vr(t1)Vr(t2)dtldt2 (4.2.24) O O If we assume that R(t) satisfies condition B ((2.2.23)), let T=m-l, and denote (24) by u PT, then, n Ip lp<' B2 J Vr(t)l d Urr o (4.2.25) a 2 < B2br' u r rr where br is a bound on the L1 norm of vr(t). Since I |vr(t)|dt < j[o v2r(t)dt), we can set b equal to the L2 norm of vr(t). From the definition of vr(t) and the previous assumptions, this norm is finite. Using the output of the r-th matched filter, we take as the estimate of pj n Pjn n= Z ~=1 iX~~~~=1 ~~(4.2.26) K n =n hi g(Wr -u i; r)' j = 1..K. i=l =1 wrlis the output of the r-th matched filter during the l-th interval. As in section 2.6, these estimates are unbiased. In analogy to the previous case of time samples, assuming the autocorrelation function

168 satisfies condition B and that the signals satisfy an M-dependency, from Corollary 2.6.1 and (25), the variance expression is dominated by 2 2 A O2 2 V(^ ) < -2 -8( (1 + B2 br + 2(M-1)) rr ~2 =, j = 1,...K. (4.2.27) To calculate the mean-square error in the estimate of the test function (22), we proceed as follows. Since Pjn converges to pj in mean-square, it converges in probability: Pr j1jn-pJl I> E <. Dr Pin _ J _ Consequently, with probability greater than 1- n /E2 A ~ A j -iP j1 ln( /p) = ln(l+(pj -pj)/pj = Pn- + 2 ~( pj pj Letting A equal the set of sample points which satisfy fAdP(cu) > l-yn /ej it follows that 1n2(Pjn(w)/pj)dP(w) <3 + o(~E) (4.2.28) A 2 Pj Upon applying the Minkowski inequality to the expected value of the square of (22), it follows that, except for a set of experiments of probability less than n /e2 or 2/E2 (whichever is greater), the mean-square error in the j-th test function is dominated by

169 2 2 ~A 22 Pi+P 2 EnlnTlnj-1lTj)j < (n + 2 0(ej))O, j=1,...K. pi Po (4.2.29) We have defined = (yjn /j+yon/P ) and as before, we set Tn(wo-w )=0 which gives n=0. Letting n = max Yjn/ejp in analogy to (1.2.37), for this (equivalent) procedure we obtain: except in a set of experi* ments of probability less than y 2, the difference in the probabilities n of error is bounded by <PenPe < t 2 jn + Ykn + 2 2 ( ) 0<P -,<- 7 7 -0:) =0 jk Pj Po 2 +P2 2 0p p ( k k + (jk)- (4.2.30) p p2 kj jk kj Pk o From our earlier assumptions, yjn = 0(1/n). Hence, Pen converges to P at the same rate, except for a set of possible experiments of probability less than y 2. n To investigate the manner in which the signals affect the bounds, we consider the case of binary communication, K=l. The difference in the probabilities of error is then given by Theorem 1.2.2, 2 72 0 < P -P < n + 6(c), (4.2.31) en e- 2 where Yn is defined below (29) with j=l, po+pl=l, and 6(E) is defined as 6(e)=Pr|lnTi(wi-w) <.

170 Suppose we only use the output of the j=0 matched filter to estimate Po. The bound on the variance of the estimate is 2 2 2 2 c8 () T2 B2 b on < -2 (1 + 2 b (M-l)) (4.2.32) on n 2in u 1-pP 2 n oo. 00 where c8(0) = Iho. j=0 The h.. are the elements of the inverse of the Gramian matrix G (see 13 (2.6.1)). For this example, G1 is 2 -' 1 -exp -(u o-uo) /2u0 -exp -(Uo0-u )/2Uoo 1 -1 G = 4iruo00' (1-exp(u -uOl ) /u and c8(0) becomes 1 _ _ (1 + exp {L(uoo-uo0)2/2 uo0) C8(0) =,/4O, (1 - exp (uo -u0)2/uO }) The variance of the estimate is written s 2 e1 x 3 Bb(u u+ exp)2/2uoo-'on n 81 u + 2M-1) ( T on n 8,,2Uo 00 l-p,* 1 00 1 - exp (u-0Uoo-u )2/uoo (4.2.33) For purposes of illustration, let us assume that the autocorrelation

171 of the gaussian noise is approximately a delta function, R(t)=No0(t). Then, v (t) is approximately yo(t)/No, and we. have 1 2 2 2 b < j v (t)dt u /N =u /o, 0 and, 2 2 1 B 1 + exp( u00-u )2 u 7o = --- 2 2+ 2M-l) ( Yon n 8nru lp 1o 00 exP|LNuoo or I oo Roughly speaking, the variance of the estimate is inversely proportional to the square of the signal-to-noise ratio. Consequently, in combining estimates of po from the two matched filters we would weigh more heavily the filter corresponding to the larger signal energy. For testing (K+l) hypotheses, there is a. simplification when the signals are orthogonal. The signals Yi(t) are said to be orthogonal if 1 11 uij = v(t)yi(t)dt =/ v.(t)R(t-s)vj(s)dt ds = 0, iij 0 00 For this situation, the density function of the output of the r-th matched filter (equation (23)) reduces to \171) + (l~g~w:,P.f f(wr) = Pr g(Wr-urr; I) + (l-P)g(wr; J ) Hence, the output of this filter is used to estimate only Pr and operationally, the procedure for estimating the a priori probabilities

172 reduces to that given in the introductory example of section 1.1. 4.3 TRANSMISSION OF KNOWN SIGNALS THROUGH AN UNKNOWN RANDOM MULTIPLICATIVE CHANNEL We now discuss the problem of detecting one of K+1 nonzero known signals which are passed through a random multiplicative channel. In any interval, the received waveform is x(t) =N(t)+Z~Yi(t); ~ < t < ( +1), i=O,...K, (4,3.1) where the a priori probability of transmitting Yi(t) is Pi. The signal is amplitude modulated by a random variable ZQ, which may depend on the previous Z's, but which is independent of the gaussian noise. In this problem, we take the a priori probabilities, pi, i=O,l,..K, as known, and the a priori distribution of Z oa(z), as unknown We will mainly be concerned with learning in the nonsupervised mode as the problem of supervised learning (with an arbitrary channel) has already been discussed in 4.1. We will, however, point out when the results in 4.1 can be improved upon for the particular case of a multiplicative channel. At the end of this section, we shall briefly mention the problem where the received waveform is given by x(t)=N(t)+Z(t)Yi(t); ~ < t < (+1), i=O,1,...K. (4.3.2) Here, the signals are amplitude modulated by the (unknown) random process Z(t).

173 We start with the problem given by (1) and again derive limiting forms via the Karhunen-Loeve expansion. Using the notation and definitions of the previous section, from (4.2.9), the observations are the Xkl kl=l,...k. Under hypothesis Hj. the density function for the first k observations in the ~-th interval is +00 k 2] f(X) = exp[ 1 (xkl-z Yjk ) ( fj( exp: _ (2 ) k/2(1..2)1/2 2 k= d(z) (4.3.3) The density function of the observations averaged over. all hypotheses is simply K f(x) = pjfj(). (4..4) j=O For the (K+l) test functions we take P) fj(x) T.(x) = i, j = 0,...K. (4.5.5) P f (x) PO0 -O Cancel common terms in (5) and use the definitions in (4.2.11)-(4.2.13) to obtain +00 Po p. [exp - 2'[-2zp(vok, x)+z 2(vokY.]] da(zA) Txp - -~o — ~ — ok — 00 i o kJ j (4.3.6)

174 Y jk Since (v.,y.) = jk > 0, j=O,l,...K, we can complete the square jk J k, %kl in z~. Define the random variables (vx), j = oi...K. (4.3.7) (vjk,Yj) Then, + 00 +2 2k l F 2 1 1 j+ _k ( j) j[exp - vjkyj)(wjk-z) d2](z ) 3 p W2k +00 TJ - P0 k PV e 2 oVyk~o^ e lJp 1w 2 d a r 00 (4.3.8) We make the same assumption as before, Y 0 <,o j=O,l,...K. (4.2.16) kl=l kl Define (vjk,x) (v,x) w = lim w = lim (Vkx) (v4,x9) ko j koo (vjkYj) U In any interval, Z~ is just a constant. Hence, (4.2.16) guarantees that wj exists with probability one. Then, taking the limit in (8) yields (by the bounded convergence theorem): 2 +00 1 (.(-z)2 d(z) +- 2 j e 2 jj i 2 j lim T.(x) = - e 00 J- P + 1 k- oo o e UooW 2 uoo 1 e d(.(z~) (4.5.10)

175 with probability one. The vj's are again the (unique) solutions to the integral equation (4.2.18) and, since R(t) is assumed to be strictly positive definite, we have ujj > o, j=O,...K. For some of the cases we will discuss, it is more convenient to use the logarithm of the test function. Defining +CO Lj(w ) = \2/u' j g(wj-z;l/JT7 )da(z), j=0,...K, -00 (4.3.11) we have 1 2 2 ln(Tj(wj,w))=ln(pj/po) + 2 (ujjw-u o) +lnLj(wj)-lnLo( o). (4 12) The test procedure is to pass the received waveform x(t), < t < I + 1, through K+l matched filters. The gain of the j-th filter is 1/ujj, and the output (equation (9)) is wj. The K+l w. are then used to evaluate J 3 the test function in (10) or (12), with the signal corresponding to the largest value being announced. This is the one-stage procedure to minimize the probability of error when a(z) is known. To estimate the test function in the nonsupervisory mode when a(z) is unknown, we first obtain the density function of the wj. Consider the output of the j-th matched filter in any interval. Assuming hypothesis i is active, we have (vj,x) 1 j= ui = 1 Sv.(t)(n(t)+zyi(t)dt. (4.3.13) J u ^Ujj 0 J wj, given the value of z, is a gaussian random variable with mean value 3

176 (zuji)/u.. and standard deviation 1/Ju.The density function of w. given that hypothesis i is in effect is fj(wjl Y=yi): g(w. - z O1; 1/dj(z),(4.3.14) 0 ujj -00 and averaging over all hypotheses we obtain K +00 fj(WJ) = P g(wj-z; i/4u jj) da(z), j=O,...K. i=O -00 (4.3.15) We have dropped the I subscript since the sequence (ZA] is assumed to be stationary. 4.3a Orthogonal Signals We first consider the case of orthogonal signals. With uji=o for jOi, (15) reduces to — oo fj(wj) = Pj g(wj-z; l/ Tu j )da(z) -00 + (l-pj)g(wj; 1/j).) (4.5.16) In view of the definition of Lj(wj), we can write Lj(wj) = / (fj(wj)-(l-pj)g(wj; l/5ujj)).(4.17) Since ujj and p. are known, the unknown part of ln(Tj) is ln(L.j(wj)). From (17), the problem of estimating Lj(wj) reduces to estimating fj(wj)

177 and then subtracting off the known quantities. Clearly, we can do this under a. nonsupervisory condition with either the L2 series or the eigenfunction representation. For the L2 series approach, we expand each of the fj(wj) in a series as in section 2.4. Then, assuming the r-th absolute moment of the random variable Z exists (r > 3), the sequence Z~ is M-dependent, and that R(t) satisfies condition B, we can apply Corollary 2.4.1 to obtain nA pi / (r-2)/r ET fjn (W. )-fj(Wjn )) = (/ )/ (4.3.18) where Wjn represents the output of the j-th filter during the n-th interval. This bound is then used to dominate the quantity Pen-Pe. Since (18) implies convergence in probability, one could proceed to 2A dominate the expected value of In Ljn(Wjn) as in the previous section. In this manner, one could make a (probability) statement about the convergence of Pen to P in analogy to the case in sub-section 4.2b. A block diagram of the procedure for estimating L.(w.) is given in Figure 2. As indicated, the output of each matched filter is fed into q+l devices to evaluate the first q+l coefficients. The ai for each of the L2 series represents an arbitrary constant. If we picked them all equal, we would not need q+l coefficient generators for each The structure here is identical to the problem of binary on-off communication through a random channel. In both cases, we are testing a composite hypothesis vso a specified alternative. The distribution of the composite hypothesis is estimated by estimating the overall distribution and then subtracting the known quantities.

at the n-th decision: n t (n+l) x(t) Won Lo(Won) I x(t)on(t)dt oo - j=Q,1,...q. _I_, __ I (^ijn-(l-Pi)bij)Pj (WIin A+ 0 Figure 2__(t.(t in lpc. (/ fo) e +)st —-— n —— in Outputs of g H 01 o ___tI _ I"t h e q + 1 0 coefficient I(i (n-1 c /I generators CPjCWn/(i) "ni 1 (l-P ) are summed WKn Li(Win)'ufX(t)vi(t)dt ____in tos + 00 ~ biJ g(x; 1/4quii)cpj(x/ai)dx Figure 2. The Ie series procedure for estimating Lj(wj)-the case of orthogonal signals.

179 filter; the observations Wjn, j=O, 1...K, could be processed serially by the same q+l devices. We can also eliminate storing the first q+l Hermite functions by use of the recurrence formula. cp +T(wl/) = cp (w/a) + 2j cp. (w/a) Since we have truncated the estimate of the series, there-will be an asymptotic error as given by (2.4.37). We have, lim E f( - ( < c2 /-B6 (/2)-1 n jn jn j6 { 77j7 r7 -l n- (o -1) q r|2; 1/4. The constant c2 is cl/(rt /r ). Recall that cl is Cramer's bound and c. is the arbitrary constant. B6 is the L2 norm of the function given below (2.4.21). Both the signal-to-noise ratio ujj and aj enter into B6. We now use the eigenfunction representation for the same problem of orthogonal signals. From (16) and (2.5.11), fj(w.) can be expressed as: 00 fj(w) = (pj i dji + (l-p )b i)ci(w /Yj) (453.19) J s.(w.) j ji j ji i J J J i=0 where 1 2 s (w. ) = e 2 u SJ J ^"- Yj 2 a aj j

180 Ujj j-1 2 UJJ Jc (4.5.20) j u.. +1 2 (Uj j-1) (ujjj+l)'j 2 2 u..o. and dj i = cpi(z/aj)da(z) 4.5.21) The b.. are the Fourier coefficients of sj(w)g(w;l/Jujj), ji bi = ci(w/yj)sj(w)g(;l//jj7)dw.(4.5.22) The constant a2 is chosen to be greater than 1/u... j The part of the test function which we want to estimate is lnLj(wj). From the definition of Lj we have the expansion o00 sj(wj)Lj(wj) = 2/ujj u j dJi cPi(wj/yj),(4.5.25) i=0 and can write for (12) ln(Tj(wj,w)) = ln(pj/po) + - (ujw - u W) + ln(sj(wj)L (wj)) - In(so(Wo)Lo(wo)) - in sj(wj) + in sO(W). (4.5.24) The error in the estimate of (24) is A A A lnT -lnT^ =ln(s LjL)-ln(s n )-ln(s )+ln(s L ) (4.5.25) which we can also write as

181 A = in ii+5L5 lnT -lnT In + S(i )n SOL o (4.3.26) s.L. -sL. -! +.A n i.i sjLj In analogy to section 2.5, we take the estimate of sj(wj)Lj(wj) as q(n) s( jnj) = Ajin(/) (4.3.27) i=O where the estimates of the coefficients n ji A:i 1 1n Xi jp i( /y) (W) - -p)bi dj fin n y t (-p j differ from those in section 2.5 only by the (1-pj)bji term. If the sequence of random variables (ZI} is M-dependent and if R(t) satisfies condition B. then, from Corollary 2.5.1, the sequence of estiA mates sjLjn converges in mean-square (uniformly in wj) to sjLj. The rate is In2n/n with the bound given by (2.5o34). Since we have convergence in probability, we could again make a probability statement concerning the convergence of Pen to Peo We remark that if the orthogonal signals yj(t) have equal energy then ujj=..=Uoo, and the fj(wj) are essentially the same. In this situation we need only use the output of one filter to estimate the 1The f.(w.) would be identical if the signals were also equi-probable. 33J

182 coefficients. 4.3b Bipolar Signals Some interesting problems arise when we are testing between two hypotheses and the signals are bipolar. The signals Yo(t) and yl(t) are said to be bipolar (or antipodal) if Yo(t) = -yi(t). In this case, we need only one matched filter as vo(t) = -vl(t). The density function of the observation (equation (15)) is f(wo) = Po f g(wo-Z; l/ NJuo)dO(z) (4.3.28) + P1i g(wo+z; l/ 7Uoo)da(z) since u o=ull=-u. With w = -wl, the test function, as given by (10) becomes fg(wo-z; J/~o)doa(z) T(w ) P -.- (4.3.29) o g( w+z; l/ f )da(z) 000 Let Lo(wo) = so(wo) f g(wo-z; l/ 4 )da(z) (4.3.30) Ll(wo) = So(wo) f g(wo+z; l/TJo)daO(z), where so(wo) is defined as in (20). A procedure equivalent to comparing T(wo) to a threshold of one is to evaluate

183 T(w ) = pi L1(w) - poLo(W ) (4.3.31) and compare it to a zero threshold. We need to assume that da(z) d da(-z). If da(z) = d( -z) then Li(wo) = Lo(wo) and there is no basis for a decision. Hence, if the density function exists (da(z) = a'(z)dz), we assume that it is not an even function. Multiplying (28) by so(wo) results in the expansion so(wo)f(Wo) = d(Pl+P(-l) ) c(/Y) (4.3.32) while the test function can be written as T(wo) = / ^(pi-p(-l) pj(w/7). (4.3.33) The coefficients d. are defined in (21). We have used the property that the Hermite functions are even or odd functions (depending on whether the index j is even or odd) to obtain (32) and (33). We observe from (32) that if pi P po, there is no problem in estimating the product (S dj) which is then used to estimate T(wo). For bipolar signals, however, it is certainly reasonable to take Po = P1 = * Then, (32) and (33) become 00 j=O s()f(w ) = d cp (w /7) (4.5.54).j=0 j=0 0 2

184 Hence, we can only estimate the even coefficients, while the test function depends on the odd coefficients. There is one set of circumstances where this difficulty can be overcome. Suppose the density function of z exists, da(z) = a (z)dz, and that Oa'(z) is L2. Further, assume that the random variable z can only take on positive values, a'(z) = 0 for z < O. Then, the even part of ca'(z) uniquely determines the odd part: 00 o(z) = a(zC('(-z)) V dj / (z/) (e )2 = X d: Cp.(ZI/a ) (4.3.36) j=0 and = 2 =) z o(z) for z > 0 o 2 e -U'(z) for z < 0 e From (30) and (31), the test function can be written as +00 T(wo) = s(wo) g(wo-z; l/ uf ) (z)-(-z) dz 00 2 -00 The procedure, then, would be to obtain estimates of the coefficients 1 Operating in the supervisory mode, there is no difficulty since we can estimate all the coefficients, even and odd, by estimating the marginal density fo(Wo) or fl(-wo). The even coefficients of fo and fl are equal and the odd coefficients differ by a minus sign.

185 d2j by estimating (' d2j) in (34), and then dividing by g2j The estimate of the test function would take the form q(n) + T( )= s(o) n g(wo-; l/ ) C (z/o) no =0 2jn K 00 2j z/n 0) j=O o 0 - g(wo-z; 1/u ) CP(Z/o) (4.5.58) -00 ~ 2 The convergence of (38) to T(wo) is not given by the in n/n rate since we are, in effect, estimating a' (z) and not f(x). As indicated above, using the eigenfunction representation, we can "pick off" the dj's. The L2 series can also be used to estimate a'(z). The estimation procedure, however, is more complicated. This will be discussed in section 4.4 4.3c Arbitrary Signals The distinction between the supervisory and nonsupervisory modes is more marked in the case of general signals. Define the constants G. U.. a =l ji ji Ujj (4.3.39) 2 u a.. -1 2 _ Ji ji u. o. +1 33 312 2 (uj -j2i-1)(Ujj. 1) ji u2 a 2 JJ Ji

186 and the functions 2 s w (W)> 1 exp L.} (4.3.4o0) ji 3 2 2 2 where the indices i,j = 0,1...K. The (K+1) oj are chosen so that 2 a.. > u... From (2.5.6), we have the expansion 00 sji(wj)g(wj-y; 1/ j) = ji ck(Wji) Pk(Y/oji) k=0 or, 00 Sji(wj )g(-; 1/ ) = CPk(wj/ji)CPk(z/lj) jj k=0, k=O (4.3.41) Defining the coefficients +00 d = cpk(z/a.)da(z), (4.3.42) -00 the density function of the output of the j-th matched filter (equation (15)) becomes K 00 (w. i) w k f.(w.) i= L Pi (w) ji d.k cPk(w%/7'i) * (4.3.43) i=0 ji j k=0 It is the djk which are needed to estimate the unknown part of the test function, lnL.(wj). Repeating the definition of L.(w ), +00 Lj(w ):= 2/j/uj j g(wj-z;l/u. )dol(z), (4.5.11) -00

187 we use (42) to obtain 00 (w2/u.. k Lj(j ( ) dk d (/Y ) (4. 3 44) s (w j) dk ^jj'Pk(wj / kJ=O The difficulty in estimating djk is that the unknowns in the overall density function no longer appear as the coefficients of an orthogonal expansion. Of course, in the supervisory mode, there is no need to consider the overall density function fj(wj). By working with the individual densities in (43), we have the usual orthogonal expansion and can update the estimates of djk at each stage. For the nonsupervisory jk problem, we can use a procedure analogous to that given in section 2.6 to obtain approximate estimates for a finite number of the djk. By defining K v-~~,;C1 k jk(W.) = P.(w. ) k(w /7yj ) (4.53.45) jk j X i sji j(w j i/ i=O fj(wj) can be expressed as 00 fj(wj) = d djk jk( (4.3.46) k=O It can be shown that the functions *jk(wj), k=O,l,...J-1, are linearly independent. Then, we can construct (as in section 2.6) a set of functions jk(wj) which satisfy +00 f j(w)j (w)dw = 0, mk (447) -oo = 1, m=k

188 where k,m=O,l,...J-l. As an estimate of djk, k=O,l,...J-l, we can take n dkn =n i jk( (4.3.48) Q=l These estimates are biased o0 +oo E(d jkn) djk = dj Vjk(w) Vji(w)dw. (4.5.49) i=J -~ The variance calculations are unwieldy a.nd we will not discuss them. The difficulties encountered in the case of arbitrary signals carry over when the received waveform is x(t) = N(t) + Z(t)yi(t); I < t < ( +1), i = 0,.. K. (4.3.2) We define the time samples z(tkl)yi(tkl) as zklYik, and let u be the vector of time samples. The density function of the k samples, given that the i-th hypothesis is active, is fi(x) = dj.jk.' gk(x-ui;A)cpj(u /y i lil).. *~ e 1 Jk JiJ... ~ duil.dui (Pik(uik/yi ck). Y4-3-) jk(Uik/Yi k4) Yil...Yik For simplicity, we have assumed that the density function of a(z) exists and is L2, and have defined the coefficients as

189 d...k=... P (zi/ai1)... jk( k/k)al(zi,...zk)dzl..d k (4.5.51) The test procedure involves evaluating each of the fi(x) and then comparing them as in section lo2. If the noise samples in each interval are independent, A is diagonal and the problem is a direct extension of the one above. For the supervisory mode, each si(x)fi(x) can be expressed in an orthogonal series. In the nonsupervisory mode, (50) is summed over i and one can obtain the k-variate analog of (46). When A is not diagonal, f.(x) can not be expressed in an orthogonal series with the dj.. defined as in (51). Be redefining the coefficients so as to depend on the index i (let al.o. k depend on i), one could express each fi(x) in the same orthogonal series. Then, in the supervisory mode, the coefficients could be estimated a.s in section 3o3.5c. For the nonsupervisory mode, this representation would only be useful for the case of bipolar signals. 4.4 UNBOUNDED LOSS FUNCTIONS FOR THE CASE OF BIPOLAR SIGNALS For the case of equi-probable bipolar signals, we had to assume that the random variable Z could only take on positive values in order to consistently estimate the test function in the nonsupervisory mode. Under the same assumption on Z, we now give another formulation which is illustrative of a. class of problems.

190 Equation (4.3.28) can be written as +00 f(w) = g(W-z;l// I~) (d z)-d(- ). (4.4.1) -00 The output of the matched filter is Wo=N oZ, depending on whether yo(t) or yl.(t) = -y (t) was transmitted. No is the (derived) gaussian noise sample with variance equal to 1/u, Deciding between yo or -yo is 00 equivalent to deciding whether the prefixing on the (non-negative) random variable Z is + or -. Rewriting (1) as +00 f(wo) = g(w-z;l/uo)dc() (4.4.2) -00 we can think of Z a.s a random variable with the symmetric distribution Oe(z)o Z will be positive (negative) only if yo(-Yo) is transmitted. Hence, the decision problem is equivalent to deciding whether Z is positive or negative. If Z is close to zero, it is more difficult to distinguish between the two hypotheses. The penalty of an incorrect decision should, a.ccordingly, be small. On the otherhand, if Z is large and we make an incorrect decision, the loss should be high. As one possible loss function, we take the loss equal to the magnitude of Z for an incorrect decision and equal to zero if we decide correctly. Using the notation of sub-section 1.2d (with z and wo in place of X and x), the loss function L(t(wo),z) is:

191 L(O,z) = 0 if z > 0 = -z if z < 0 (4.4.3) L(l,z) = z if z > 0 = 0 if z <0 We announce yo(t) if t(wo) = 0 and -yo(t) if t(w )=l. Defining b(z) = L(O,z) - L(l,z) = -z, (4.4.4) from (1.2.49), the test function is +00 T(wo) = b(z)g(wo-z;l/oou )dc! (z) -00 The procedure which minimizes the average risk is to set t(w ) =1 if Te (W) > 0 =0 otherwise. Substituting for b(z) yields: +00 T e(wo) = -J zg(w -z;l1/I )da (z) -00 1.~~~ ~~(4.4.5) (wof(w0 + - f(w)) 00 Hence, the test function does not depend explicitly on Ce(z). However, we now require estimates of the derivative of f(wo). To estimate the derivative, the techniques given in Chapter 2 are applicable. By differentiating the estimates of f(Wo), we can obtain consistent estimates

192 1 of f'(wo), As would be expected, convergence of these estimates takes place at a. slower rate than the estimates of the density function. Then, assuming fz 2dce(z)< o, Corollary 1.2.4 can be used to bound the difference in risks. If b(z) is a k-th order polynomial in z, the test function will 2 depend on f(wo) and its first k derivatives. Even when the test function ca.n not be written as a polynomial, the problem of finding consistent estimates of Tte(wo) is, in principal, easier than the situation encountered in the last section for the case of arbitrary signalsthe point being that here, the density function under either hypothesis (Z > 0 or Z < 0) is the same and is given by (2). Consequently, we can estimate ae(z) and form an estimate of the test function by taking A A Ta n(wo) = b(z) g(w -z;l/ \ )dcen(z) (4.4.6) o00 We briefly discuss how to estimate the distribution or, for simplicity, the density e((z). We assume a'(z) is L2s Using the eigen~~e e function representation, we first form the estimate of gjdj a.s in section 2.5, and then divide by the known Ji The estimate of Ca(z) is taken as 1 For the method of 2o3, we have verified this only with the gaussian kernel. To use the eigenfunction representation, we estimate the product s(Wo)f'(wo)o 2For a discussion of this point, and other densities (besides the gaussia.n) which have this property, see [37]~

193 q(n) e(z) = djnpj (z/al) Using the L2 series for f(x), we define the estimate implicity by +oo A A f (w) = g(w -z;l/u )a' (z)dz n o oo en -00 (4.4.7) q(n) -= Z a,jn Pj (Wo/CT) j=0 Since the Hermite functions, aside from a. constant, are their own Fourier transforms, letting Mf (v) be the transform of fn(x), we obtain q(n) Mf (v) -= ai (-1) cp.(av). (4.4.8) Mfn (vj) J j=0 A Letting Men(v) be the transform of atn' we ha.ve (v),+v2/2u M n e 00 Mf (v). (4.4.9) e n To form the estimate of ac(z), take the inverse finite Fourier transform of (9), +B(n) A +v2/2u-ivz C (z) = J e 0 M(v) dv, en fn -B(n) with Mf(v) given by (8). n Both of these representations for 4C(z) lead to consistent estimates and, under appropriate conditions on b(z), the sequence of test functions defined in (6) can be shown to be consistento

CHAPTER 5 SUMMARY We have studied a particular empirical procedure and applied it to some problems in communication theory. The procedure utilizes all past observations to form an estimate of a test function which is then evaluated using only the present observation. This procedure is neither optimum nor asymptotically optimum when the sequence of observations is assumed to be dependent. Whether or not the sequence of observations is dependent, we have shown that if the sequence of test functions converges in mean-square to the one-stage test function, the difference in the risks (for the case of a bounded loss function) is dominated by a quantity proportional to the mean-square error in the estimate of the test function. These calculations (section 1.2), although straightforward, appear to be new. In estimating the density function f(x) from the sequence of dependent observations, we are able to dominate the mean-square error and hence, specify the rate of convergence of the estimate. The key relationship is the Mehler formula (2.2.12), or more generally, the BarrettLampard expansion (2.7.1). It appears that this particular application of the expansion has not been used before. We have presented three methods of estimating the density f(x). Varying amounts of information are required to apply and specify a rate of convergence for each of the techniques. A summary of assumptions 194

195 needed and rates of convergence are given in section 2.7 and at the-.end of 2.5. Of the three methods we have presented, the most interesting is the eigenfunction representation. To the best of our knowledge, this approach of estimating a density function obtained from a convolution and the particular solution of the eigenfunction problem in section 2.5 have not appeared in the literature. The communication problems we have considered are those in which the unlknowns enter linearly into the overall density function. In section 4.2, the unknowns consist of a finite set of parameters. In the other problems, an arbitrary distribution is taken as unknown and expressed in terms of a countable set of linear unknown parameters and known functions by using either of the two series methods. The series methods we have studied may also be useful for other purposes. We have mentioned two; estimating a k-variate density function with the L2 series (2.7 and 3.3b) and using the eigenfunction representation for the detection of random signals in gaussian noise with the number of terms to be used in the test function determined by a sequential procedure (4,lb).

APPENDIX A THE HERMITE POLYNOMIALS In this appendix, we review the definitions and properties of the Hermite polynomials and develop the relationships which we will need. Common practice has been to use the notation Hen(x) for the poly-x2/2 nomials associated with the weight function ex /, and Hn(x) for the polynomials associated with the weight e. This is the notation which we will adopt. The polynomials are related by (Erdelyi, [12] p. 268), n Hen(x) = 2 2 Hn(x//) (A.1) Hn(x) = 2n/ Hen(/'x). (A.2) A.1 THE POLYNOMIALS Hen(x) From Cramer. [10], page 133, we define Hen(x) = (l)nd n /2 (A ) dx The first five polynomials are: 3 Heo(x) = 1 He3(x) = x-3x (A.4) Hel(x) = x He4(x) = x4-6x2+ 3 He2(x) = x - 1 The orthogonality relation is 196

197 +oo 2 x- "-12 e 2 He (x) He (x)dx = n'. m = n — OO (A.5) =, m n One generating function is -t2/2 + tx C tv 00 e + tx = He (x) (A.6) Or v. v and another particularly useful expansion is given by F2 2 22 2 p y 00 p x +p y -2 pxy v 1 e L 2( l-p2) P= L Hev(x) He (y), (A.7) with (A.7) holding for Ipi < 1. We define (see Table of Symbols) 1 - (Xm)2 g(x-m;a) = g(x) = 1 e 2a2 (A.8) CT -J e2it 2.8 We have from (A.3) -n g(x-m;a) g( ) He (xm) (A.9) dx a a n a and from (A.5) +oo Hen(- He-) He -) g(-) dx = n., ( = n -00 (A.10) =0, n A.2 THE POLYNOMIALS Hn(x) From Erdelyi, [12], section 10.13, define:

198 (X) n x2 n x2 H,(x) = (-1) e (d) e (A.ll) Hn(x) is also given by [ n 21 m, n-2m H (x) = n'. 1) (2) (A.12) n Z-j ml. (n-2m)'. m=O where [n/2] denotes the largest integer n/2 or (n-1)/2. The corresponding five polynomials are: Ho(x) = 1 H3(x) = 8 x3- 12 x 4: 2 H1(x) = 2x H4(x) = 16 x - 48x2 + 12 H2(x) = 4x -2 (A.13) The orthogonality relation is +00 2 -x n H (x) Hm(x) e dx = 2 n'. f 6m (A.14) n mn -00 and the corresponding generating functions for these polynomials are: 00 -t2+2tx c tn e t= Hn(X) (A.15) n=O 1'i2p2+y2p2 -2 pxy 00n 1 e 1-p2 = (p H (x) H() IP < 1 (A16) (A.l6) is called Mehler's formula. For the weight function x2 (A.16) is 1 e g (x/a) 1= -2 e O2 ( &N)G2'

199 we have from (A.ll) (d) g2(x/a) = )n () g2(x/a) (A.17) dx cn n (o and from (A.14) +00 (XH (X e dx = 2 n' (A.18) a~o mao mn -00 In terms of the g(x/c) notation, the orthogonality relation becomes +00. n rm~~~2 n' Hn(X/a)H(x/)g2(x/)dx = 2. (A.19) -C The sequence of functions given by g(x/o.)Hj(x/a) is known to be a complete orthogonal system [38]. A.5 THE EXPANSION OF THE BIVARIATE GAUSSIAN DENSITY FUNCTION AND A LEMMA An expansion for the bivariate ga.ussian density function is ea-sily (x2+y2) obtained by multiplying (A.7) through by 1 e 2 and then substituting x = -, y = Y-m2 In terms of our g(x) notation, the a1 a2 result is (x-ml)2 2 (x-ml)(y-m2) (y-m2.)2 1m 1 a12 a 1 aC2 Ca22 2oaiao2(l-p2)l/2 exp 2 1-p2 oo g(X -m) g( - n1 n a1 n a2 =( -- ) g( He (x- )Hen( )'I <. (A.20) ca1 a2 n' n cG _C n=O This result can also be derived from the bivariate gaussian characteristic

200 function by expanding the cross-product term in a power series. The double integral in the inversion formula then becomes two single integrals. We denote the bivariate guassian density function by g2(x-ml,y-m2;Cl,C2,p). If the standard deviation of both variables is the same, we designate the density by g2(x-ml,y-m2;ap). When integrating the bivariate density as given by (A.20), we will want to interchange the double integration and summation. This is easily justified by Lemma A.1: With |p| < 1, it follows that Yl Y2 / f g2(Xl,X;12,;,a2,p)dxdx2 = -00 -00 = P- g(xl/la)Hen(xi/al)dxl g(x2/c2)He (x2/a2)dx2 (A.21) n=0. n=O -0 -0 Proof: Let 00 n G(xl,x2) =y E M |Hen(xl/pl) lHen(x2/a2)|g(xl/al)g(x2/a2) n=0 (A.22) (A.22) Then by the Lebesgue monotone convergence theorem, Y1 Y2 G(xI,x2)dxidx2 = -00 -00 Stratonovich, R. L., Topics in Theory of Random Noise, Gordon and Breach, N. Y., 1963. Translated from the Russian by R. A. Silverman pp. 41-42.

201 00 n Y 2 =y - Ln - He (xi/l)|Ig(xi/l)dxi,JHe (x2/o2)lg(x2/a2)dx2 (A.25) n=O -~00 Using (A.10) and the Schwarz inequality, we obtain Yi Yi J |He (x1/a1)J|g(x1/a)dx1 < He2(Xlal )g(xl/al)dxl /g(xl/al)dxl -00 -00 -o (A.24) < /n'7 Since Ipl < 1, (A.23) is dominated by Y1 Y2 00! n Y~1Z n G(xl,x2)dxldx2 < I =(A.25) - J< n l-p (A.25) -00 -00 n=O N n Now define gN(xl,x2) = p- g(x1/c,) g(x2/a2)Hen(xl/al) Hen(x2/oa). n=O n'. Clearly, gN(xl,x2) —g(xlx2;op) pointwise. For all N we have gN(xl,x2) < G(xl,X2) which we just showed was integrable. (A.21) then follows from the Lebesgue dominated convergence theorem. A.4 MISCELLANEOUS RELATIONSHIPS From Erdelyi, [12], p. 193, we have the fact that Hn(x) is either an even or odd function, depending on the index being even or odd, Hn(x) = (-1)n Hn(-x). (A.26) From the same page we have: H (x) - 2xH (x) + 2n H n_(x) = 0 (A.27) n.+l n n-l d H(x) = H' (x) = 2n H (x). (A.28)

202 A uniform bound (in x) for the Hermite functions was given by Cramero From Erdelyi, [12], p. 208, or Sansone, [38], p. 324, Cramer's bound is 2 ex /2 Hn(x)| < cl 2n 7, (A.29) where the constant cl = 1.086435. Using (A.1) a bound in terms of the Hen(x) polynomials is x2/2 -x /4 e-xL |He (x)l< e |Hen(x)| < cl n (A.50) The bound ha.s been improved with cl replaced by 2 /T. Reuter, G. E. H., "On the Boundedness of the Hermite 1rthogonal System,' Journal of the London Math. Society, volo 24, April 1949, ppo 159-160.

APPENDIX B EVALUATION OF SOME INTEGRALS In this appendix we calculate a number of integrals involving the Hermite polynomials. For our purposes, the most useful result is given in equation (B.10). Equations (B.15) and B.17) are of interest and have been included for the sake of completeness. We shall evaluate the integrals starting from an integral appearing in Tables of Integral Transforms [13]. By way of verification, we also indicate different (and sometimes more direct) methods of obtaining the results. From Erdelyi, et. al.,[l1], page 290, number 17: +00 n/2 exp 1 (x-y) Hen(ax)dx = (2) /2 (1 ) Hen - aay (B.1) It is easy to verify this integral for n=0,1. Then, integrate (B.l) by parts and use the relations d Hen(Ox) = naHen_1(x) dx n n-l Qx Hen(ax) = Hen+l(X) + n Henl(Ox) (which are derived from (A.27) and (A.28)) to express (Bol) in terms of integrals involving polynomials of order n-1 and n-2. (Bol) is then 203

204 verified by induction. The simple substitution x=x/a, y=y/a gives +00 n g(n) Hen()d (1-Xa ) Hen Li, (B.2) CTf nf sr' ^ n l' _ J7(B.2) — 00 __ where again g( —) denotes the gaussian density with mean y and standard deviation o. Using the relationship between the two types of Hermite polynomials, (A.1), the integral is expressed in terms of Hn. — 2~ n/n/2 With x=x-m, this becomes' g(x-.) H x ( - 2)n/2 H a, y 2(B _ a n~y21o_ n o3CT(l-Q!2C1/-00 We now want to evaluate the following integral. +00 I1 = Hn( X-) g( -ml) g(Xm2) dx. (B.5) n CTa1 1 2 -00 Use the relation /xmol _~/ ml-M2 ( x-ab (B.6) g( Xml) g( 2 ) = g(mlr ) g(- ) (x. ) a1 C- a a1 +2b where m_ m2 a. 2 2 a1 a2 2 2 b2= o1 a2 a12 +022

205 and I1 becomes +00 2 9m M-l 2 H (X-ml ) g(x b )dx (7)' g( 2+a'2 n El b -00 Make the following identifications 2 y = ab -m = b T2b C1 and substitute into (B.7) +00 I1 = g(m-m2) ) H a(x-ml)1 g(x-y-ml)dx (Bo8) 12+C22 -0 n L2 a b This integral is given by (Bo4) mn/2 2 HF I, = g( -m ) (l-a) Hn j- (By 9) 9( Mal _M2. H (Bo9) _ 12' 1- 1/ Substitute for y, ca, and then for a. and b The result is Il = H (xm ) g( -m ) g( X_2)dx n'V 1''2 F 2,,/t2-.2.1+2 ll+S n L, ~ai 0 (B.10) This integral exhibits a. type of reproducing property which we will find useful in this study The property which we refer to is the fact that the (gaussian) average of a. Hermite polynomial of order n and its associated gaussian weight gives a. Hermite polynomial of the same order

206 and a gaussian weight, with the resulting two functions having different arguments. For the case where a1=C2, by using (A.12), (B.10) reduces to n I1 = g( m -ml ) (m2-ml) (B.11) I2 al r 1 As an alternate derivation, equations (B.10) and (B.ll) can be obtained by using a form of the Weirstrass transform (Bilodeau [4]). Briefly, this method involves defining the transform of *(y) by +oo 2 Y(x) = e(xY) (y)dy (B.12) -00 and noting that (under suitable conditions on i(y)) +00 dn 1 2 n (x) = - f Hn(y)e-y (y)dy. (B.13) dxn,,/, — 00 x=O The evaluation of (B.12) is carried out by completing the squares of the product of appropriate gua.ssian weights. A generalization of (B.10) is the evaluation of +00 Hn ml) g(x-m2) g(xm3)dx. (B.14) Hn i Y2 (B li3 -00 We use (B.6) to manipulate (B.14) into a form so as to use (B.10). The result of this straightforward procedure is: +00 Hn(X ml) g( m) g( X m )dx -00

207 2(_ 2. 2). 2(, 22. 22 2 2, b 622 2 (2_+2) scrs2 +Cr3 - Cl ^( 2 +3_ a32(m-ml) + a2 (m3-m-)l) (B Hn (B.15) 4 2+ 2 $22(Crl2 a32)+ C (aY12 2) The polynomials Hen(x) also exhibit the reproducing property analogous to (Bol0) For the integral given below, set al =T2 a2 and use (A.l): -00 +00 n = 2 2 Hn(YX2) g(YX2) g(y-x3) ( (B.16) o 01 z2. o3 -00 This integral is given by (Bol5). Upon substituting back for:j and He we obtain the desired result: He (yx2) g(Y X2) g(yX3)dy n a2 C2 Cy3 n nr n2 - 2. _ J a2 2 g( X3-X2 ) HeF x s- (B.17) Lo'2 ^3 J vap' c^3 2 Van +3 Here, the argument of the resulting gaussian weight and Hermite polynomial are the same (cfo (B.10))o This result can also easily be obtained from the integral g(YX2) g(- )dy = X3 -00and the following relationships and the following relationships:

208 n n d3'\IU2 +C (U2+~3 2) 22 2vC 3 -a 2C3 VU 2 3 dn 1 L(- g(~(Y-) He (La) dx3 (Y3 (3n C3 n 3 Hen(-y) = (-l)n Hew(y) Another integral which we will need is I2 = //g(xl-y;1)g(x2-y2;Cr) gp(y1-z1,y2-z2;52,p)dyjdy2 (Bol8) where g2(Y-zlYJ2-Z2;P) is the bivariate gaussian density withstandard deviation a2 and correlation coefficient p. This integral can be evaluated by using (B.17) and the expansion (Ao20). We note, however, that (B.18) represents the density function of the sum of two independent gaussian vectorso The resulting probability density function, which of course is gaussian, has a! covariance matrix equal to (Yl 0 0 s2 PG2 2- "2+C2 P222 2 + 2 \ 2 -2 2 22 0 o i J. L. (po - C = 6po.2 a +2 J Hence, defining 2 2 2 7 = 1 + 02 (B ol9) 2 a2 2 2 +O 12 (Bol8) is given by I2 = g2(xlx-Z1,x2-z2;/, P) o (B 20)

APPENDIX C THE GAUSSIAN KERNEL By specializing the kernel in section 203 to K(y)=g(y;l), we are able to perform some of the required integrations.. This permits, for example, an exact expression for the bias and second moment of the estimate. It also leads to sharper bounds on the variance expression. Col THE UTEIVARIATE CASE From section 2,35 the estimate of the density function is -n A = fn(x) = L X o nh =1 h Take K(x) = g(x;l) = i eX2/ The estimate is n f(x) = _ X g(x-X%;h). (C.I) n=. The mean value is given by E.n(x) = Eg(x-X;h) = g(x-y;h) g(y-z; c)d d(z)dy = g(x-z +h2 )da(z) (C.2) The bias in the estimate is then E f(x) - (x) - g(x-z, o +h ) - g(x-z;a B da(z) (C.) 209

210 In (2.3.20), we defined the quantity Q^ = ff K(X YK4^ <mj(Y1Y^2) - f(Yl)f(y2jJ dyldy2. This expression, the third part of the variance bound, led to the 1/nh factor. Using the gaussian kernel we will be able to eliminate the 1/h term. Substuting for the kernel we obtain Qn- = j J g(x-yl;h)g(x-y2;h) [fm (Yi.Y2) - f(Yl)f(Y2) dyldy2 (C.4) Designate the first part of this expression by Qm-l. Write out fm- (y1,Y2) and interchange the y and z integrations. s-n-^i = J j dC(z1z) d( j j g2( Y1- 12- Z2; Pm) g(x-y; h)g(x-y2;h)dyldy2. (C.5) The double integration has been evaluated in the previous appendix. Define /= -+h,,= pm-2 e *c2 (c.6) Cr+h2

211 Then, from (B.20), Qn-e,1= j J zdcz(z)dcX(Z)g( x-zlix-zl-2;P,7l ) (C 7) The second part of (C.4) is just the square of (CQ2). Combining (CL7) and (Co2) we have, Qm- = j da( z,)dU(Z) (2(x- x- z2;P,_ -) - h2 z! Z2 g(x-zl;p)g(x-Z2;) o (CO8) Noting that.rmj.l Pm- e!<1 for mR, we use the Mehler formula and Cramer's bound to obtain IQ <h jm12 cl2 - hPmei c12 c.9) m- /^' --------.. —. — - I, - 2 - j 2jTcr Il-m-Xl I -IPm_ l 2 which is the result quoted in section 2.3, equation (2o,324). Note that we have taken the -rZ) as independent. The extension to M-dependent variables is straightfowardo An additional term is added to Qm-_/h2 which, after performing the y integrations, is given by Jn g2(x-zx-zx-Z2;,i ) [d(m_ ( xz2) - da(zl)d (z2) ]o (ColO) This contribution to the V(fn(x)) expression can be bounded by 4 (M-l) n 2uay (l-p-*.)

212 The result in (C.5) enables us to get an exact expression for *the second moment of the estimate. Using this and essentially (C.2), we have E{n(X))2 = Eg2(x-X;h) 2 E Eg(x-X;h)g(x-Xm;h) E%(xX;h n2 =1 m=l+l 1 g(x-z; o2+h/2 )dc( z) nh2v-7 n n + 2 da (zl)da(z2)g2(xx-zx-z2;P, -). (C.11) n2 =1 m=2+l C.2 MEAN INTEGRATED SQUARE ERROR In section 2.3d, we stated that the MISE was = 0(1/n /5). We now obtain this bound assuming the [{Z} are independent and that the autocorrelation function satisfies condition B. From (C.2) and (C.11), the MISE is: Jn: E (n<x) f(x)) dx =< __ f g(x-z;,2+ /2 )dc?(z) n h2/r1 Z n n + 2 z I g2(x-zl,x-Z2;;,ym- )doa(zl)dCa(z2) n2=1 m=+ 2 " =1 m=i+l

213 - 2 l g(x- z; ) da( z ) g(x- z2;, + )da(z2) Z1 + f g(x-z; a)g(x-z2; )dc( z )dc(z2) C.12) g(x- (Zl+Zi ).; g(z -z 2;P2( (C.13) to perform the x integrations. We obtain: = m=i z 1 da(z) Jn h2V~ + f f g(zj-Z2;\V24 )dg(zj)dca(z2). (C.i4).n Use Parseval's relation / / g( z1- z2; a) da( zl) da( z2) -;L fct~y(v)%P(S'(v)e 2 dv, 2 Tr where -Pq(v) = J:e. da(z),

214 to write n n'nJ, ~0 p(v)'2e 2(1-%) d~ 22tJn = 7 + 2 7 j cv(v) Je_'/VI d n:h n2 R=1 m=R+l -2 J IP(v) 12e1/2v2(2a+h2 dv + lP(v)| ie dv. (C.15) Add and subtract e dv. (1-1) lp(v) |! ev2(d02+h v n Regrouping terms yields: 2:J^n Z= x -' 1! CXp(V) 12e vdv -ni n {-v2( c2+h2) - 1 v2(2o2+h) -v dv + + Ie&( v)- 2e + e dv + n n + 2 (V) I 2v2i1-) e-v J. (C. 16) n Q=1 m=1+l Using |cpa(v) |I 1, the second expression on the right hand side is bounded by'T-7/a. For the third expression, it follows that as n-, h(n) --- 0 and ( -va+h2) 2 -1/2v(22+h2) v22 Ce 2e- 2e + e - e ((5 v4h4 + 0(h6)). 8

215 Hence, as h -- 0, the third expression of (C.16) can be bounded by 15 Sh4. 8 The last expression of (C.16) is just slightly more difficult to bound. Substitute for p and Ym' and bring the summations inside the integral n n 2 e-v2(+h2) l (v) I Z (e+v a Pm- -l.d (C.17) n 1*1 m-=+l Expand the exponential in its power series and let T=m-t. The double summation is then dominated by I I (e+v Pm 1) z=1 m=2+l n ( =. (:na-) y ( V PRr) T=1 j=l j' = (+v2a2) + (m-T)pT M V~l T1 L_ j j=i j! T=1 00 j n -<n ( v2 ) PT. (c.18) j =1 J 00 T=l Under condition B, we have the boundl IpPTI-B (see 2.2.33 ). Hence, 2 T=l (C,18) is dominated by oo j nB 7 (v2) - nB e 2 2 j=l t'

216 Substituting this result into (C.17), we obtain n n 2 f e(2( hIIca( v) | (e m -2.l1) n2 l=1 m=t+l s 2B2 J 1p(v) 2 e hdv n 5 2B2 /7. (C.19) n h Combining all the bounds, we have Js 1 fL 1 1+2B2 + 15 ah. (C.20) 27:na:nh 8 As was the case for the mean-square error (2.3.26), setting h(n) = n / gives for the mean integrated square error, = E J (fn(x) - f(x)) dx = 0(1/n4/5) as n-Oo.

LIST OF REFERENCES 1. Abramson; No, and Do Braverman. "Learning to Recognize Patterns in a Random Environment'" IRE Trans. PGITj Volo 8, 1962 pp.58-635 2. Abramson, N., Do Braverman, and G. Sebestyen, "Pattern Recognition and Machine Learning," in Report on Progress in Info. Theory in the U.S.S., 1960 1963, IRE Trans. PGIT, Vol. 9, 1963, pp.257-261. 35 Barrett, J.F., and D.G. Lampard. "An Expansion of Some SecondOrder Probability Distributions and Its Applications to Noise Problems, IRE Trans. PGIT, Volo 1, 1955, ppO10-15. 4, Bilodeau, Go G. "On the Summability of Series of Hermite Polynomials," Journal of Mathematical Analysis and Applications, Vol. 8, 1964, pp. 406-422o 5 Braverman, D. "Learning Filters for Optimum Pattern Recognition," IRE Transo PGIT, Vol. 8, 1962, ppo 280-2850 6, Brick, DB., and G. Zameso "Bayes Optimum Filters Derived Using Weiner Canonical Forms,' IRE Trans. PGIT, Vol. 8, 1962, ppo 35-46. 7o Brown, JLo "On a Cross-Correlation Property for Stationary Random Processes," IRE Transo PGIT, Vol 35, 1957, ppo 28-31. 80 Capon, J., "On the Asymptotic Effieiency of Locally Optimum Detectors," IRE Trans. PGIT, Volo 7, 1961, ppo 67-71. 9, Cooper, DoBo. and PoWo CooperO "Nonsupervised Adaptive Signal Detection and Pattern Recognition," Information and Control, Volo 7. 1964, ppo 416-4440 10. Cram~g, Ho Mathematical Methods of Statistics, Princeton University Press, Princeton, New Jersey, 1947. 11o Davenport, WoBo and WoLo Root. An Introduction to the Theory of Random Signals and Noise, McGraw-Hill Book Co, New York. 1958. 12o Erdelyi, Ao. Wo Magnus, Fo Oberhettinger, and FoGo Tricomio Higher Transcendental Functions, Volo II, Bateman Manuscript Project, McGraw-Hill Book Coo New York 13953 217

218 13. Erdelyi, A., W. Magnus, F. Oberhettinger, and F.G. Tricomi. Tables of Integral Transforms, Vol. II, Bateman Manuscript Project, McGraw-Hill Book Co., New York, 1954. 14. Gantmacher, F.R. The Theory of Matrices, Vol. I, trans. by K.A. Hirsch, Chelsea Publishing Co., New York, 1959. 15. Glaser, E.M. "Signal Detection by Adaptive Filtersj" IRE Trans. PGIT Vol. 7, 1961, pp. 87-98. 16. Halmos, P.R., Lectures on Ergodic Theory, Chelsea Pub. Co., N.Y., 1956. 17. Hancock, J. and E.A. Patrick, Learning Probability Spaces for Classification and Recognition of Patterns With or Without Supervision, Purdue University School of E.E., Electronic Systems Research Lab., Nov. 1965, Tech. Rept. No. TR-EE65-21. 18. Hobson, E.W., The Theory of Functions of a Real Variable, Vol. II, Dover Publications, Inc., N.Y., 1957. 19. Johns, M.V. "An Empirical Bayes Approach to Non-Parametric Two-Way Classification," Studies in Item Analysis and Prediction, Stanford University Press, 1961 ed., H. Solomon 20. Kac, M. "A Note on Learning Signal Detection," IRE Trans. PGIT, Vol. 8, 1962, pp. 126-128. 21. Keehn, D.G., "A Note on Learning for Gaussian Properties," IEEE Trans, PGIT, Vol. 11, 1965, pp. 126-132. 22. Leipnik, R. "Integral Equations, Biorthonormal Expansions and Noise," S.I.A.M., Vol. 7, 1959, pp. 6-30. 23. Lubbock, J. K.,"The Optimization of A Class of Non-Linear-Filters," Institution of Electrical Engineers Proceedings (London), Vol. 107, pt. C, 1960, pp. 60-74. 24. Lo've, M., Probability Theory, D. Van Nostrand Co., Inc., Princeton, N.J., 1963, Third Edition. 25. Murthy, V.K., Nonparametric Estimation of Multivariate Densities With Applications, Douglas Missle and Space Systems, Div., Douglas Paper No. 3490.

219 26, Parzen, Eo, "On Estimation of a Probability Density and Mode," Ann. Math, Stat. Vol. 335 1962, 1065-1076. 27. Price, R. and Green, P, E,, "A Communication Technique for Multipath Channels," Proc. IRE, Vol. 46, 1958, ppo 555-570. 28o Raviv, J.o "Decision Making in Incompletely Known Stochastic Systems," Into Jo Engngo Sci,, Volo 3, 1965, ppo 119-140. 29. Robbins, Ho "An Empirical Bayes Approach to Statistics," Proc. Third Berkely Symposium on Statistics and Probability, Volo I, 1955, pp. 157-164o 30. Robbins. Ho, "The Empirical Bayes Approach to Testing Statistical Hypotheses, " Review of the International Stat, Inst. Vol. 351 ppo 195-208o 51. Robbins, Ho "The Empirical Bayes Approach to Statistical Decision Problems.," Ann, Matho Stat,, Vol. 355 1964, pp, 1-20, 32, Root, WoLo The Detection of Signals in Gaussian Noise, prepared under NASA Research Grant NsG-2-59, Sept, 19650 To appear as a chapter in a forthcoming book edited by AoVo Balakrishnano 335 Rosenblatt, Mo "Remarks on Some Nonparametric Estimates of a Density Function, Ann. Matho Stat,, Vol 27, 1956, ppo 832-837. 34~ Rosenblatt, Mo, Random Processes, Oxford University Press, NoY., 1962o 355 Rushford, Co Ko. Communication in Random or Unknown Channels, Stanford Electronics Lab., 1962, Repto No. 2004'-6 36, Rushford, Co Ko "Adaptive Communication With Sounding Signals in Random Channels,"E EEE Int, Convo Record, 1963, part IV, pp, 102106, 537 Samuel, Eo ".tAn pirical Bayes Approach to the Testing of Certain Parametric Hypotheses' Anno Ma Stath. Stat, Vol. 34, 1963 ppo 13701385. 38, Sansone, G,, Orthogonal Functions, Interscience Pub, Inc, No.Y. 19590 Translated from the Italian by Ao Ho Diamond,

220 59. Scudder, H.J. III, "Probability of Error of Some Adaptive PatternRecognition Machines," IEEE Trans. PGIT, Vol.11, 1965, pp. 363371. 40. Scudder, H.J. III, "Adaptive Communication Receivers," IEEE Trans. PGIT, Vol. 11, 1965, pp. 167-174. 41. Sebestyen, G., "Pattern Recognition by an Adaptive Process of Sample Set Construction," IRE Trans. PGIT, Volo 8, 1962, pp. 82-91. 42. Spragins, J., "A Note on the Iterative Application of Bayes Rule," IEEE Trans. PGIT, Vol. 11, 1965, pp. 544-549. 43. Spragins, J., "Learning Without a Teacher," IEEE Trans. PGIT, Vol. 12, 1966, pp. 223-230. 44. Tainiter, M., "Sequential Hypothesis Tests for the r-dependent Marginally Stationary Processes," Ann. Math. Stat., Vol. 37, 1966, pp. 90-97. 45. Watson, G. S. and Leadbetter, M. R., "On the Estimation of the Probability Density, I," Ann. Math. Stat., Vol. 34, 1963, pp. 480491. 46. Whittle, P. "On the Smoothing of Probability Density Functions," J. Roy. Stat. Soc. Ser. B, Vol. 28, 1958, ppo 334-343. 47~ Wong, E. and Thomas, J.B., "On Polynomial Expansions of SecondOrder Distributions," SIAM, Vol. 10, 1962, pp. 507-516.