Technical Report No. 225 004860-3-T SIMULTANEOUS DETECTION AND ESTIMATION: THE USE OF SUFFICIENT STATISTICS AND REPRODUCING PROBABILITY DENSITIES by Jurgen 0. Gobien Approved by: __C___ __ Theodore G. Birdsall COOLEY ELECTRONICS LABORATORY Department of Electrical and Computer Engineering The University of Michigan Ann Arbor, Michigan for Contract No. N0014-67-A-0181-0035 Office of Naval Research Department of the Navy Arlington, Va. 22217 November 1973 Approved for public release; distribution unlimited. THE UNIVERSITY OF HICHIGAN ENGI NEFR N IruRBmv

IE~ ^ 3l^ o ^,

ABSTRACT The problem considered epostulates two mutually exclusive and exhaustive statistical hypotheses, under each of which the probability distribution on the observation space is known except for (conditional to) a finite-dimensional parameter. The parameters may or may not have common components, and are considered as random variables and assigned a priori probability density functions (p. d. f.'s). It includes the standard signal-detection problem where either the signal, or the noise, or both, may contain uncertain parameters. Estimation is defined as knowledge of the a posteriori parameter p. d. f. given by Bayes' rule. The likelihood ratio of marginal observation distributions is considered an optimal detection statistic. It is shown that this statistic can be found by using the two separate estimation results to modify a related simplehypothesis detection statistic. Thus, estimation and detection occur simultaneously in a very natural fashion. The concept and existence of necessary and sufficient statistics are investigated. If the conditional observation distribution under either hypothesis admits a sufficient statistic of fixed dimension, then a natural conjugate family of parameter densities exists iii

and is indexed by a "conjugate parameter" of the same dimension. Explicit relations can be found to "update" the conjugate parameter based on the sufficient statistic; usually the procedure is recursive. Explicit use of Bayes' rule becomes unnecessary and the estimation problem is reduced to a tractable, fixed-dimensional procedure. The detection problem is similarly simplified. It is shown that any p. d. f. nonzero on the same parameter space as the natural conjugate density also reproduces. Much of the signal processing is shown to be independent of the a priori parameter densities. All results are rigorously extended to include observations which are continuous-parameter random processes. To illustrate the theory, the problem of detecting a known signal in and simultaneously estimating the parameters of Mth-order stationary autoregressive Gaussian (Gauss-Markov) noise is addressed. Solutions are found for both the discrete (sampled) and the continuous case. The estimation solution is tractable; the detection statistic is complicated. It is written in closed form for M = 1. For arbitrary (known) values of M, it contains integrals which are quite difficult and are left unevaluated. iv

FOREWORD This doctoral thesis represents a major theoretical breakthrough in the fields of detection theory and estimation. Previous research has been based on the use of the Shannon sampling theorem or the Karhanen-Loeve theorem; both require knowledge of the noise autocorrelation function. In practice, this knowledge is often unavailable. In situations where this lack of knowledge may critically influence the design and performance of equipment, it is necessary to "tell the mathematics" about this uncertainty of the noise characteristics. This doctoral thesis develops both the foundation and the techniques for working with uncertainty about the noise process. It is hoped that its distribution to research workers in detection and estimation theory will spur renewed interest and progress in extending theories toward handling more realistic situations and thereby be of more direct help to the practical equipment designer. v

TABLE OF CONTENTS Page ABSTRACT iii FOREWORD v LIST OF ILLUSTRATIONS ix LIST OF APPENDICES x LIST OF SYMBOLS AND ABBREVIATIONS xi CHAPTER I: INTRODUCTION 1. 1 General Notation and Assumptions 7 1. 2 Statement of the Problem 12 1. 2. 1 The Simple-Hypothesis Detection 13 Problem 1. 2. 2 The Bayesian Estimation Problem 15 1.2.3 The Compound-Hypothesis Detection 18 Problem 1.3 Theoretical Foundations; the Traditional 20 Solution 1. 3. 1 Simple-Hypothesis Detection Theory 20 1. 3. 2 Bayesian Estimation Theory 28 1. 3. 3 Compound-Hypothesis Detection Theory 33 1.4 Historical Background 36 CHAPTER II: NECESSARY AND SUFFICIENT STATISTICS 39 2. 1 Definitions and General Results for Finite- 41 Dimensional Observations 2. 1. 1 Basic Concepts 42 2. 1. 2 Some General Results 47 2. 2 Continuous Parameter Processes 53 22. 1 Sampling the Observation 54 2. 2. 2 Sufficient Statistics for Continuous 57 Processes vi

TABLE OF CONTENTS (Cont. Page 2.2.3 Summary 60 2.3 Sequential Samples from an M -Order 62 Markov Process CHAPTER III: SUFFICIENT STATISTICS AND REPRO- 74 DUCING DENSITIES IN SIMULTANEOUS DETECTION AND ESTIMATION 3.1 Bayesian Estimation 75 3.1.1 Natural Conjugate Densities 77 3.1.2 Other Reproducing Densities 84 3.2 Compound- Hypothesis Detection 87 3.2.1 Natural Conjugate Densities 89 3.2.2 Other Reproducing Densities 93 3.3 Continuous Observations 94 3.3.1 Estimation 95 3.3.2 Compound-Hypothesis Detection 98 3.3.3 Sequential Processing of Continuous 99 Observations 3.4 Conclusions and Historical Sketch 101 3.4.1 Summary and Discussion 101 3.4.2 Historical Outline 104 CHAPTER IV: EXACTLY-KNOWN SIGNALS IN DISCRETE 108 STATIONARY AUTOREGRESSIVE GAUSSIAN NOISE 4.1 Noise Models and Parametrizations 110 4.1.1 The M-th Order Discrete Auto- 110 regression 4.1.2 The Transition and Joint Densities 115 4.2 Sequential Estimation 118 4.2.1 Sufficient Statistics and the Natural 119 Conjugate Class 4.2.2 Updating and Estimation 121 4.2.3 The conditioning on y0 123 4.3 Sequential Detection 126 4.3.1 The Signal Hypothesis 126 4.3.2 Detection 127 vii

TABLE OF CONTENTS (Cont.) Page 4.4 Gauss-Markov Noise: M = 1 128 4.4.1 The Autoregression 128 4.4.2 Estimation 131 4.43 Detection 136 CHAPTER V: EXACTLY-KNOWN SIGNALS IN 137 CONTINUOUS STATIONARY AUTOREGRESSIVE GAUSSIAN'NOISE 5.1 Continuous Stationary Autoregressive Noise 138 5.2 Gaussian Measures 142 5.2.1 Equivalence and Singularity 142 5.2.2 R-N Derivatives for Rational 148 Spectrum Gaussian Processes 5.3 Estimation of Noise Parameters and Detection 153 for Arbitrary M 5.3.1 The Implications of Singularity 153 5.3.2 Estimation of Noise Parameters 156 5.3.3 Detection in Noise of Unknown 161 Parameters 5.4 The Ornstein-Uhlenbeck Process: M = 1 162 5.4.1 Estimation 166 5.4.2 Detection 170 5.5 Estimation and Detection: 2-SAG Noise 172 CHAPTER VI: SUMMARY AND CONCLUSIONS 179 6. 1 Narrative Summary 179 6. 1. 1 Problem Statement: General Solution 179 6. 1. 2 Necessary and Sufficient Statistics 180 6.1.3 Continuous Observations; General 183 Solution and Sufficient Statistics 6. 1. 4 Reproducing Densities 184 6. 1. 5 Discrete M-SAG Noise 186 6. 1. 6 Continuous M-SAG Noise 187 6. 2 Contributions of this Work: Discussion 189 6. 3 Areas for Future Research 192 REFERENCES 260 DISTRIBUTION LIST 267 viii

LIST OF ILLUSTRATIONS Figure Title Page 3.1 The Primary Detection Processor 91 3.2 The Secondary Processor: Modification 93 for the Actual A Priori Densities 3.3 Sampling Scheme 100 3.4 Partitioning of the Estimator/Detector 103 5.1 Parameter Space for 2-SAG Noise 175 A. 1 The Metzger Model 198 Table 3. 1 Detection Problem Notation 88 ix

LIST OF APPENDICES Page APPENDIX A: THE METZGER MODEL 195 APPENDIX B: MEASURE AND PROBABILITY 215 THEORY; SUFFICIENT STATISTICS ON MEASURE SPACES APPENDIX C: QUASI-BAYESIAN: THE USE OF 234 UTILITY MEASURES APPENDIX D: PROOFS AND DERIVATIONS 240 APPENDIX E: R-N DERIVATIVES FOR THE 251 1-SAG PROCESS x

LIST OF SYMBOLS AND ABBREVIATIONS1) a.e., a.s. almost everywhere, almost surely; true except on a set of measure zero [#B.1].,_~ the a-algebra induced on the observation space W( by the observed random process [#B.1, #2.2]. *,,*O sub a-algebras of X [#B.5]. c.d.f. cumulative distribution function [ #B.2]. DO D1 the decision that H0 [resp. H1] is true. {ek} a white sequence of unit Gaussian random variables. E( ) mathematical expectation. 0 E ( ) conditional expectation w.r.t. the suba-algebra VJ0 [#B.5, (B.13)]. f frequency. f (.) a probability density function identified by its arguments [#1.1]. g[t(y); 0] the factor of the conditional observation p.d.f. which depends on the parameter [(2.10)]. (1)Where necessary, references to the text are made in square brackets. Section numbers are preceded by #, and equation numbers placed in parentheses. xi

LIST OF SYMBOLS AND ABBREVIATIONS (Cont.) G(y) the factor of the conditional observation p.d.f. which does not depend on the parameter [ (2.10)]. G(, * ) a detection-theory goal functional satisfying certain regularity properties [#1.3.1, #1.3.3]. h(; 0) an alternate way of writing the transition density of a Markov process [(3.35)]. Ho H1 mutually exclusive and exhaustive statistical hypotheses, usually representing "noise only" and "signal plus noise" resp. in communications problems. H(j27Tf) a Fourier transfer function. j the square root of -1. J(. ) a cost functional for Bayesian estimation. L(8; y) the natural log of the transition likelihood ratio function for a Markov process [ (2.28), (2.29)]. ~'^S ~ Lebesgue measure. I(. ) the likelihood ratio [ (1.14)]. MAP maximum a posteriori, an estimate based on the mode of the a posteriori p.d.f. M-SAG Mth-order stationary autoregressive Gaussian. A4d the set of measures W on the observations, indexed by 0 e 0 [#B.4]. nt a continuous-parameter noise process. {n. } a discrete noise sequence. xii

LIST OF SYMBOLS AND ABBREVIATIONS (Cont.) N(., ~ ) the Gaussian distribution with mean given by the first argument and variance by the second. 8(. ) -of the order of, in the sense that -6(0-1) — M<oo as 6- 0. p. the upper-half-plane poles of a rational I^~ ~~spectral density [(5.2), (5.3)] p.d.f. probability density function. p(O; y) the natural conjugate density on the parameters of H1 [(3.8)]. P( ) a probability P the Borel probability measure given by the 0X p.d.f. f(Yk [#2.2, (2.19)].'~t Ad~ ~ probability measures on the observation 0 -space [#B.]. q. coefficients of the polynomial Q(. ) q(r/; ip) the natural conjugate density on the parameters of H 0 Q(. ) a polynomial with roots whose real parts are negative [(5.4), (5.5)]. r. the cross-correlation of sequential samples ^~~1 ~separated by i instants [(4.10)]. r the M-vector of cross-correlations of a stationary M-th order Markov sequence. r (. ), r( *.) the R-N derivatives relating the actual and natural conjugate a priori densities under H1 and H resp. [(3.19)]. xiii

LIST OF SYMBOLS AND ABBREVIATIONS (Cont.) R-N Radon- Nikodym. R a covariance matrix [(4.9), (4.11)]. Rn n-dimensional Euclidean space. R the complement of the negative half-line. R (T) The autocorrelation function E YtYt+T y,~^tw ~ Bayes' risk, the expectation of J( ) n 2yn the Borel sets on R s(t) an exactly-known deterministic signal. S (f ) the spectral density function of the process " - Yt e9j?3 - the admissible set of values for the autoregressive parameters f3 [#4.1.1, (4.3)]. t time. t ( ) a sufficient statistic. T, T. fixed instants denoting the beginning or end ^~~1 -of an observation interval.,^~~T ~ the space in which the sufficient statistic t( ) takes values. w.r.t. with respect to. y a scalar observation. vt a continuous-time observation. xiv

LIST OF SYMBOLS AND ABBREVIATIONS (Cont.) Y a k-vector of sequential scalar observations -k [(1.1)] /, ik the observation space and its k-fold Cartesian product. z(. ) the natural log of the likelihood ratio. Q!~~a ~ the intensity parameter of the discrete autoregression [ (4.2), (4.6)]; the threshold for the likelihood ratio [ (1.15)]. 1, f parameters of the discrete autoregression ~i' - [(4.2), (4.6)]. y, r the conjugate parameter under H and the space in which it assumes values [#3.1, (3.8)]. 6.the sampling interval. e belongs to, is a member of. r,, At/ the uncertain parameter of the observation under H, and the associated parameter space [ #1.2.3]. 0, O The uncertain parameter of the observation under H1, and the associated parameter space [#1.2.3]. 0, 0*, 00 an estimate, the true value, and a fixed value of 0 x(yt, 0) The R-N derivative given by the limit of the likelihood ratio function [(2.21)]. A (Yk, 0) the likelihood ratio function of k sequential -k - samples [(1.30)]. xv

LIST OF SYMBOLS AND ABBREVIATIONS (Cont.) A (y, 0) the transition likelihood ratio function of a 0- Markov process [(2.27)]. 2 (- ) the logarithmic derivative of the unit Gaussian c.d.f. [#5.4.1]. ~0 ('~ *) the unit Gaussian p.d.f. 41 (* ) the unit Gaussian c.d.f. p an alternate intensity parameter for the discrete autoregression [ (4.22)]. [ ] modulo, with respect to; also, a reference to the bibliography or a closed interval on the real line. v-bL+~ for any, for all. I -given, conditional to. ~~~* ~transpose, complex conjugate ~^~~- ~tends to, has the limit. C is a subset of. ^~~d ~is defined as. denotes equivalence of measures [#B.3]. _J _ denotes singularity of measures [#B.3]. < < denotes absolute continuity of measures [#B.3]. gI~ ~gdenotes the end of theorems, proofs, and *~~I ~ samples. xvi

CHAPTER I INTRODUCTION The theory of signal detection and estimation is the study of methods for determining the presence of, and extracting useful information from, communication signals corrupted by random interference. As such, it is cast within the framework of the statistical theories of hypothesis testing and point estimation and shares a great deal of mathematical ground with other disciplines cast in that framework; for example, pattern recognition and statistical decision theory. Both of the latter disciplines explicitly apply and make use of some statistical results concerning sufficient statistics and reproducing probability densities. It is the aim of this dissertation to apply those same concepts to the theory of signal detection and estimation. To this end, it is necessary to rederive the results in terms of the language and notation of detection theory. Then, the available results are considerably extended by applying them to the infinite-dimensional function spaces generated by random processes. Finally, they are applied to the detection and estimation problem and some significant new results are obtained. This work concerns itself with problems in which the signal, the corrupting noise, or both contain a finite number of parameters 1

2 which are not known; it is especially significant that the methods developed are easily applied to unknown noise parameters. As will be shown, the results allow application of the classical detection theory results without the assumption that a complete statistical description of the noise is available. The viewpoint taken is essentially Bayesian; that is, unknown parameters are considered as random variables with probability distributions which reflect a composite of the observer's subjective opinion and of past observation of their state. () As further observations are made, these distributions are modified according to Bayes' Rule. This viewpoint is a matter of some controversy in the statistical literature (see, e. g., Savage [ 53 ]) and certainly there are cases where it is unjustified. For many problems of interest in detection theory, the Bayesian approach is quite valuable. If nothing else, the domain of an unknown parameter can often be bounded by practical considerations and ignorance beyond this point ex(2) pressed as a uniform a priori distribution. Further, it is shown in Appendix C that Bayesian techniques are just as ( )This is the only intended implication of the word "Bayesian"; in particular, the results are not restricted to the linear or quadratic cost functionals which "Bayesian" is sometimes taken to imply. (2)See, for example, Kashyap [31].

3 applicable if the unknown parameters are not random but if one admits a utility function which represents a specification of the estimator and detector performance as a function of the "true" parameter values. Since Bayesian methods in general utilize and manipulate probability distributions rather than just numbers, the data storage and calculation necessary to their use often seem prohibitive. One purpose of this dissertation is to demonstrate how, for a large class of problems, these difficulties can be surmounted. A word is in order concerning the mathematical level of this work. It is intended that a first-year graduate education in communications engineering which includes one or two courses in probability and statistics be a sufficient background to read and apply the results. Thus, the use of analysis or measure theory is specifically avoided whenever possible. Occasionally (e. g., in sections of Chapters II, III, and V) it is not; the offending sections are marked with an asterisk. To read them, one needs the background of Appendix B, as well as some knowledge of random process theory. This dichotomy in mathematical levels has the acknowledged effect of making the material somewhat longer and occasionally more tedious than necessary, but is considered worthwhile.

4 The notation is standard. Footnotes are referenced in raised parentheses( references to the bibliography are made in square brackets [ ], equation numbers are placed on line in parentheses ( ), and the end of theorems, proofs, examples, etc., is indicated by a square block. An effort has been made to use conventional engineering symbols, abbreviations, and notation whenever possible. For convenience, a list of symbols is in the preliminaries. The material is organized as follows. Chapter I first states and then presents traditional solutions to three problems: simplehypothesis signal detection, Bayesian point estimation, and compound-hypothesis detection. An interesting new model, useful for the solution of certain simple-hypothesis detection problems, is given in Appendix A and is used to obtain solutions needed in later chapters. In addition to stating the problems in some detail, Chapter I establishes the notation to be used and concludes with a brief historical review of the subject. Chapter II addresses the concept, properties, and existence of necessary and sufficient statistics for a family of probability distributions. After a presentation of the usual statistical concepts, some modern measure-theoretic results concerning sufficient statistics are applied to the function spaces generated by a random process; this allows application of these concepts to observations

5 made continuously in time. The chapter concludes by reconsidering finite-dimensional observations; many of the classical results are rederived for the case where discrete samples possess an M - order Markov dependence, as is often true in communications and control problems. Chapter III defines and demonstrates the existence of classes of reproducing probability densities, and then applies the properties of such classes to the Bayesian sequential and "one-shot" detection and estimation problems; these are shown to be simplified considerably. The detection problem is shown to partition in such a way that a great deal of the signal processing becomes independent of the a priori distributions. Finally, the results are extended to observations made continuously in time, and the sequential treatment of such observations is addressed. Chapter IV applies the theory to the problem of detecting a known sure signal in, and simultaneously estimating the spectral parameters of, discrete M th-order stationary autoregressive Gaussian (M-SAG) noise. General solutions are derived, and the case M = 1 is done in detail. The results are quite complicated; no attempt is made to simplify or approximate the discrete solution. Instead, it is left to stand in contrast to the material of Chapter V, which solves the same problem for signal and noise

6 observed continuously in time. Although the theory is much harder, the results are considerably simpler than the corresponding "sampled" results of Chapter IV. The word "solutions" above should be interpreted loosely. The purpose of Chapters IV and V is to illustrate the application of and to further clarify the theory; no claim as to the practicality of the results is made. Estimation, for instance, will consist of finding the parameters which specify the a posteriori p. d. f. as a member of some known family. This is fine in theory, but actually making a sensible estimate based on that density (i. e., on its parameters) is a completely new problem, and is not one which this work intends to address. The same holds for detection; usually the detection statistic, though optimal, is so complicated as to be impractical. Again, no attempts will be made to approximate or simplify the results. One final apology is necessary at the outset. A key feature of the theory of Chapters II and III is that it is in no way restricted to problems which are linear, Gaussian, involve quadratic costs, or any of the other usual constraints; yet, the examples of Chapters IV and V are both linear and Gaussian. There are two reasons for this: First is the usual one that Gaussian noise is indeed a very realistic model, especially in communications problems. Second, it was desirable that one unified example serve for both

7 the discrete and the continuous case. In the latter, the singularity of (or the existence of densities for) abstract measures become important topics for which very few practical results exist once one leaves the realm of Gaussian measures. 1. 1 General Notation and Assumptions Throughout this work, observations will be denoted as y and will lie in a set ~f which represents the totality of all possible observations. They are generated by some sort of random mechanism (process, experiment), and one wishes to use them to make inferences (of a nature to be made clear in Section 1. 2) about that generating mechanism. There are two broad categories of observations which will be considered: The first is that y is a sample function from a real-valued random process {yt, te[O, T]}, where t takes on continuous values in the fixed, finite interval [0, T]. The sample space is a measure space ( ~, 4d,') of real-valued functions on [ 0, T];? is a probability measure belonging to a family 0 Xf = { O;eI } indexed by a finite-dimensional parameter. Solution of this case is quite difficult and will usually be approached by taking a suitable limit of solutions obtained for the second category of observations, namely: The observation is a real number belonging to some domain cy R. All results could just as well have been derived for

8 yeRn, but at considerable notational expense. If a single observation is to be processed, the Borel probability measure on w ill be considered as given by a member of a family {f(y i );6E } of probability density functions (p. d. f.'s) indexed by a finite-dimensional parameter. Throughout this dissertation, it will be assumed that all Borel probability measures are absolutely continuous with respect to Lebesgue measure and hence given by p. d. f.'s. A totally rigorous and unambiguous notation would require that these p. d. f.'s be identified by a subscript, and that different symbols be used to distinguish random variables, dummy variables, and constants in their arguments. Such a notation makes equations rather cumbersome, especially where conditional p. d. f.'s are employed; further, it can be difficult to read without practice. Hence, this dissertation will use the shorter but admittedly ambiguous notation of letting f( ) denote a p. d. f. which is identified by its arguments, e.g., f(e) / f()l Further, the notation will not explicitly show in what sense those arguments are to be considered; ambiguities will be resolved by context or by an explicit comment. Often, one is interested not in a single observation y but

9 in a finite-length sequence of observations which the process has generated. Denote the first k observations of interest as Yk (Y1, Y2, k) (1. 1) which is, if appropriate, considered a column vector. Clearly, Y e k CR. If no assumptions are made regarding their statistical dependence, its elements have a joint p. d. f. which belongs to {f(YklI ); OE }; these families may be different for different values of k, and hence requireanarbitrarilylarge "hard memory" to store. (1) The densities can be written sequentially using the relation k f(Yk0=) = f(yiIYi. 0) (1.2) i=l1 but this does not lessen the storage requirement. The situation improves if more structure is placed on the problem. Consider, for instance, that the samples possess an th_'(2) M -order Markov dependence with stationary transitions (2) For reasons apparent later, it is then convenient to include the )"Hard" memory contains information inherent in the problem statement and stays fixed as & is learned. (2)Each observation is statistically dependent upon only the preceding M.

10 samples (yM+1,.' YO) and write f(Y-M+1...'''Y''YkI) = f(Y-M+I-*Yo l ) k i f(iYi.i-.1. YiM' )MO i=1 (1.3) This is simplified by defining the state vector Yi = (Yi' Yi-M+1) (1.4) Conditioned upon yo, (1. 3) becomes k f(yklO09) = II f(yiyi-1',) (1.5) i=1 The structure of this equation will be important in the sequel, and it will often be convenient to condition all results upon the "initial observations" Y0 = (- M+1'. YO) in this fashion. It is important to note that the results will not in general be the same as if unconditioned, and the conditioning must finally be undone or justified in some way. The case of statistically independent observations is given by M = 0, i. e., k f(YJk) = Ii f(y.i- ) (1.6) i=l 1

11 Note that (1. 5) and (1. 6) present two simplifications. First, they are inherently sequential so that the joint density of k observations can be found using only the p. d. f. of (k - 1) observations and the preceding M observations themselves. Second, a saving of "hard memory" is apparent since knowledge of the family of transition densities {f(yl y, 0 );e 0 } yields the (conditioned) family of joint p. d. f.'s for any k. Many results for the continuous process {yt;tE [0, T]} will be obtained by considering that k samples are drawn from the interval. Let Tk denote the set of k sampling instants Tk = { ti= T;i=1, 2,...,k} (1.7) equally spaced with sampling interval 6k = T/k. One obtains a discrete process {yt, te Tk }; it will be assumed that the distribution of Yk, the vector of (1. 1) with elements yi = Yt, satisfies all requirements posed above for the finite-dimensional case. As k - cc (i. e., k - 0), the samples grow dense in [0, T] and, under suitable restrictions, inferences can be made about the continuous process. (1)Inferences made by this scheme are actually based on observing the half-open interval (0, T]. If the sample functions are a. s. continuous, this is clearly equivalent to observing [0, T]. If not, it may be desirable to include the point t = 0 in the finite sample sets.

12 If the continuous process is Mth-order Markov(, then it is again convenient to condition all expressions upon y; these will not be identical to the expressions given above until the conditioning is accounted for. There is an important distinction between inferences which include initial observations and those which are conditioned on initial observations. 1. 2 Statement of the Problem This section describes the problem to be solved by first stating two sub-problems: the simple-hypothesis detection problem, and the Bayesian estimation problem. Finally a combination of the two, the doubly compound hypothesis detection problem will be posed. It is this last problem of simultaneous detection and estimation which is of concern; the sub-problems are separated as above for historical reasons and because the solution will separate similarly. Section 1. 3 will be organized similarly and will present general solutions to the problems; Chapter III will later apply the concepts of sufficient statistics which are introduced in Chapter II. (1)This means that any collection of sequential samples is Mth-order Markov as previously defined.

13 1. 2. 1 The Simple-Hypothesis Detection Problem. The basic signal detection problem postulates two mutually exclusive and exhaustive hypotheses, denoted as H0 ("noise alone is present") and H1 ("signal and noise are present"). The "true" hypothesis activates some sort of probabilistic mechanism (communications channel); the result is mathematically described as placing a corresponding probability measure on the space of observables Y; i. e., one of two p. d. f.'s is active on Y: f(ylHo) or f(ylH1) (1) By observing ye, the "detector" must decide which hypothesis is true; the two possible decisions are denoted D0 and D1. The detector will be designed to maximize a "goal functional" G[ P(DI H1), P(D1 HO)] which is a real function of the "detection probability" P(D1 H1) and the "false-alarm probability" P(D1IH0). The only restriction on G(,') is that it be monotone nondecreasing in its first argument and monotone nonincreasing in the second: It does not penalize correct decisions or reward false ones. (1)These statements should be interpreted as referring to measures instead of p. d. f.'s if the observations are continuous time functions. This liberty with notation will be taken throughout.

14 The following examples of simple-hypothesis detection problems establish notation to be used in later chapters. Example 1. l(a) The observation is a continuous random process, given under each hypothesis by: H0: yt = nt H1: yt = s(t) + nt, t[0, T] where s(t) is an exactly-known deterministic signal, and nt is first-order stationary autoregressive Gaussian (1-SAG) noise of zero mean with known spectral density Sn (W) a;P > O (1.8) w2 +P and autocorrelation function 2 PJITI R (T) e (1.9) n 2p1 This noise process is often called the Ornstein-Uhlenbeck process. The observation yt te[0, T], is to be processed to determine whether or not the signal is present; "optimality" is determined in accordance with a goal functional G which satisfies the assumptions made above. /

15 Example 1. l(b) Sample the observation of the preceding example as described in (1. 7). Then HO:'k = Nk H1: Yk =: Sk + Nk and the observation is a vector-valued random variable which is easily treated by classical methods. f 1. 2. 2 The Bayesian Estimation Problem. In the preceding section, the family of distributions on i had two members. Now, assume that this family is, as stated in Section 1. 1, given by the p. d. f.'s {f(yl0); e0 } where the parameter set 0 is a domain in Rm. Let f(yl *) be the member of the family which is active on W. Before making an observation, the subjective and objective knowledge about 0 * is summarized as an a priori p. d. f. f0(), 0 e (. This may be an actual explicitly known probability density, or it may be a density function chosen from some class for its

16 ability to model the observer's a priori knowledge. (1)(2) Observations are used to "update" the probability distribution on 0, resulting in a new a posteriori p. d. f. after each observation is made. This p. d. f. may then be used to make an optimal estimate 0 (y) of 0 * with respect to any desired criterion as shown in Section 1. 3. 2; the results of this dissertation will not be restricted to specific criteria. Instead, "estimation" will always imply explicit knowledge of the a posteriori p. d. f. Example 1.2(a) Put e = i, Pl>0 PI and let the p. d. f. f0( ) summarize all previous knowledge about 0 *. The observation is continuous, ()Such modeling techniques constitute an active field of study; see, e. g., Kashyap [31] or De Groot [10], Ch. 6. (2)Appendix C shows that Bayesian techniques are as applicable (though the philosophy is different) if one considers 0 a fixed but unknown constant and uses an integrable utility or performance function in place of the a priori p. d. f., or even if one has a bona-fide a priori p. d. f. and wishes to combine a utility specification with it.

17 Yt = nt te[O, T] where nt is 1-SAG noise as described in Example 1. l(a); aa " and "P1" are the parameters of the noise spectral density, and they are to be estimated to minimize the cost functional J = E x1 0 (y)- 0 Example 1.2(b) Sample the observation of Example 1. 2(a), so that the parameters to be estimated are unchanged but = R1 and the observation is Y = N e k. The cost is -k -k J = E k k(Yk) l Example 1. 2(c) Chapter V will show that the preceding example is equivalent to: Let {yi, i = 1, 2,... } be the first-order autoregressive process which satisfies Yi + 1Yi-1 = aei; 1< 1<

18 with initial condition selected at random from a N(0, 1 ) distribution. {ek} is a sequence of independent N(0, 1) random variables. The parameters of this problem are related to those of Example 1. 2(b) by -P16 /1 = -e a2 - 2p 6 a2 - a (1e ) where 6 is the sampling interval. l 1. 2. 3 The Compound-Hypothesis Detection Problem. The situation is similar to the simple-hypothesis problem, except that under either or both hypotheses the active probability measure on'Y is one of a family indexed by an unknown parameter. Under H1, this parameter is denoted 0 e and under H0 it is rTe J; these may or may not have common components. The corresponding families of densities {f(y 0, H1); o } and {f(y 7,H0); q e j } are known. In essence, this problem is a combination of the two already defined. The reward of a Bayesian approach to this problem is two-fold: First, it allows detection and estimation to occur simultaneously in a natural way, especially for sequential observations.

19 Second, it permits sidestepping a difficult problem of statistical decision theory; namely, it avoids the issue of finding a "uniformly most powerful" decision statistic (one which is optimal for any admissible 0 *). This is accomplished by optimizing the detector with respect to a goal functional exactly as was done for the simple-hypothesis problem of Section 1. 2. 1, except that detection and false-alarm probabilities are now defined to be the marginal probabilities P(D1IH) = j P(D11Hl1 ) (d (1.10) 0 P(D1IHO) = f P(D1IH0, ) f0(7) d7 (1.11) where f0(a), f0(r) are the a priori p. d. f.'s. Once the detector makes a decision, that hypothesis is assumed true and the corresponding parameter is to be estimated precisely as in Section 1. 1. 2.

20 Example 1. 3 In the previous notation H0' Yt - nt H1 y = as(t) + n, te[0, T] where s(t) is a signal known exactly and nt is 1-SAG noise (Example 1. 1). The values of a, a, and p1 are not known, so a a2 e = [a, ] = Note that some parameters are common to both hypotheses., 1.3 Theoretical Foundations: the Traditional Solution 1. 3. 1 Simple Hypothesis Detection Theory. The problem of Section 1. 2. 1 has four possible decision/hypothesis combinations. Since a decision is forced, the probabilities of two of these (usually the "detection probability" P(D11H1) and the "falsealarm probability" P(D1IHO)) are sufficient to characterize the performance of any detection scheme. It is a basic fact of classical detection theory (see [ 18 ], [ 43 ], [ 39 ] ) that optimal processing of the observation consists

of computing a one-dimensional statistic (the likelihood ratio). This is trivial to demonstrate for a simple goal functional; suppose one desires to maximize G = P(D1H1) - a P(DIH0O) a> 0 (1.12) Many linear goal functionals including those commonly referred to as "Bayes' criteria, " can be written thus. Clearly, G = f P(D iy)d F(y) - a J P(D ly)d 0 (y) d l(y) P(D Iy) d 0(Y) - d ( (. 13) where _(y) and 0P(y) are probability measures on and P(Dl y) is a randomized decision rule. If the Radon-Nikodym derivative (likelihood ratio) d d1(y) d y) (y) d0)(y) f(yIH1) = f(y H) if densities exist (1. 14) f(y I Ho)

22 makes sense, then (1. 1. 3) is maximized by choosing the decision rule as follows: P(1 QJ) 2(y) > a P(Dlly) = r[O, 1] (y) = a 0o (y) < a (1. 15) If y is finite-dimensional anddensities exist as in (1. 14), then application of these concepts is straightforward; more general situations, however, can become quite complex. Birdsall [ 5] has demonstrated that the likelihood ratio (1. 14) is an optimal decision statistic, and that the decision rule has the form of (1. 15), for any goal function possessing the properties set forth in Section 1. 2. 1; only the threshold a depends upon the actual criterion. This is a powerful result since it permits design of a detector which will be optimal for a very wide class of criteria. In infinite-dimensional spaces, the situation is much more difficult because p. d. f.'s fail to exist. Most of the available theory pertains to Gaussian processes, in which case the mean and covariance functions constitute a complete statistical description. One basic result concerns the detection of a signal known exactly

23 in known Gaussian noise; it is obtained as follows. Suppose H: t = nt H1 yt = s(t) + nt, t [O,T] (1.16) where s(t) is a sure signal and nt is zero-mean stationary Gaussian noise with autocorrelation function R (T). Let n {k; k = 1, 2,.}. } be the eigenvalues of the kernel Rn(t - s) and expand yt, s(t), and nt in terms of the corresponding normalized eigenfunctions {(k(t)}; if nt has rational spectral density, this is a complete orthonormal (c. o. n. ) set of functions in L[ 0, T], the space of all square-integrable functions.(2) Let {yi}, {si}, and {ni} be the corresponding sets of (possibly random) Fourier coefficients, e. g., T n. = f nti(t)dt (1. 17) 0 ( )See Grenander [18 ] or Davenport and Root [9], Art. 14. 5. (2)Davenport and Root [ 9 ], Appendix 2.

24 By the Karhunen-Loeve theorem, {nit/Ti} is a c. o. n. set of random variables; since Gaussian, they are also statistically independent. Let Yk denote the finite collection (Y1,..., k) It is trivial to verify that z (Yk), the natural logarithm of (Yk), is given as k y.s. k s. (- A. 2 (1. 18) i=1 i i=1 i Here the first term represents the necessary signal processing and the second is a measure of the performance of the detector: k s. Z ~t = d (1. 19) i=l i where d is the "detectability index". For the general Gaussian (1)Van Trees [61], p. 100 or Peterson, Birdsall, and Fox [43 ].

25 problem (i. e., z(y) is a Gaussian random variable under either hypothesis), it is always true that d = E[z(y)H1 ] - E[z(y)iH0] (1.20) where E(-) denotes mathematical expectation. Under suitable regularity conditions,(1) the sums in (1 18) converge in L2 and T T z(y) = j yts2(t) dt - f s1(t)dt (1.21) 0 where 0C s s2(t) = i i i(t) (1.22) si(t) = - oi(t) (1.23) i=l i s. (l)It is necessary that L -Z converge; see e. g., Kelly, Reed, and Root [32 ].

26 It is easily verified using (1. 22) that the function s2(t) is the solution to a Fredholm integral equation of the first kind, T s(t) = s2() Rn(t- X)dX, 0 < t <T (1. 24) 0 and in fact, most classical solutions solve (1. 24) using an asso(1) ciated differential equation with suitable boundary conditions; the necessary signal processing is then given by the first part of (1. 21). A model first suggested by K. Metzger [38] and extensively refined by T. Birdsall allows finding s2(t) and the quadratic content of sl(t) through methods more familiar to the engineer. It applies to cases where nt has a rational spectral density: in engineering terms, nt can be represented as "white Gaussian noise" filtered by a causal, linear, time-invariant system with rational transfer function H(s). Details of the model are derived in Appendix A. As an example of its application, Appendix A (1)For example, Van Trees [61], Art. 4.3. 6 or Helstrom [26], Art. IV. 5.

27 solves the problem of detecting an exactly-known sure signal in additive M-SAG noise; for M = 1, one obtains the solution to Example 1. 1: T z(y) = a^ J [pIl s(t) - s"(t)] ytdt + [p1s(O) - s'(O)] yO + [ps(T)+ s'(T)yT] y2 (1. 25) where d is the detectability index d =a21 pa sa (t) dt+ pl[s (T)+ s ()] + [s'(t)]a dt 0 0 (1.26) Note that the last term of (1. 25) does not involve the observation yt; in most classical solutions this term is either omitted and considered as part of the likelihood-ratio threshold or is kept separate as a measure of performance. (1) (2) For reasons which (1)Van Trees [61], p. 318. (2)Helstrom [26], p. 136.

28 wvill be apparent in Chapter V, it is here retained as an explicit term in the likelihood ratio. It is well known that sample functions of the process nt and hence yt, are continuous with probability 1. Thus, the likelihood ratio (1. 25) is the same whether one considers {t, t[0, T ]} as done here, or {yt, te(0, T]}. 1. 3. 2 Bayesian Estimation Theory. a. Bayes' Rule. Consider the problem of Section 1. 2. 2 for sequential discrete observations. Suppose Yk has been observed. Given the a priori p. d. f. f(a ), one can use the information in the observation to form an a posteriori p. d. f. f( Xk Y) by using Bayes' rule, which is merely a restatement of the definition of conditional p. d. f.'s: f(YkJ )fO() f(lYk) f()J =(1.27) Once an observation is made, the denominator is a number which serves to normalize the a posteriori p. d. f.,

29 K(Yk - (Yk= l f(Yekl0)f-(l) (1. 28) Prior to any observations, Yk is a random variable and its marginal p. d. f. may be written f (Y) f(YkI ) f(1.29) As an incidental observation, suppose one defines d f(Yk__) Ak(Yk; ) f(yk10 (1.30) where 0 is a fixed value in O such that the expression exists. Observe that Bayes' rule (1. 27) holds if Ak(Yk; ) replaces f(YkI ), since the denominator of (1.30) cancels out of (1. 27) and (1. 28). The function Ak( ) is called the likelihood ratio function and will play a significant role in later chapters.

30 Equations (1. 27) and (1. 29) can be put into sequential form; using (1. 2) in (1. 27) gives k f o( ) I f(Yi Yi- 1 ) f(e IY) = = (1. 31) (Numerator) d6 Separating the first factor, one finds ( lY ) f(Yk yk - )( k-1')f(_) (1.32) f( Y=k) "~-k (Numerator) dO Comparison with (1. 27) shows that, at each stage, the previous a posteriori p. d. f. should be used as a prior for the next increment. A sequential analog to (1. 29) follows similarly: f( Yk- 1) (. 33) f(Yk Yk- 1) f(e lYk) f(YkIYk- 1') ) For the reasons given in Section 1. 1, all this is in general

31 intractable: The hard memory (that which retains information which stays fixed throughout the problem) must store arbitrarily many different conditional p. d.f.'s f(Yk 1k, ), and the soft memory (that which changes as 8 is learned) must update and retain the a posteriori p. d. f. as a function. Further, all previous observations must be retained in order to select the proper conditional p. d. f. If the observations are Mth-order Markov with stationary transitions, (see Section 1. 1, Eq. (1. 3) ff. ), things improve markedly. Equation (1. 31) becomes fo(O) f(yiYil-) f(O Iv Y ) ==l (1. 34) f(6 lI, Yo) = - j (Numerator) da and most of the hard-memory problems have been alleviated. Under quite general conditions, the a posteriori p. d. f. defined by (1. 27) or its equivalents approaches a delta-function about the "true" value 0 * as k- cc; this has been studied, for example, by Liporace [37] and Le Cam [34]. Note that the conditional p. d. f. f(Yk 0 ) is precisely the classical likelihood function altered through multiplication by the a priori p. d. f. If the

32 a priori p. d. f. is nonzero in a neighborhood of 8 *, one expects that the maximum likelihood and Bayes estimators will have similar asymptotic properties; this has been verified by Levin and Shinakov [36 ] for a large class of problems. b. Bayesian Estimation. Suppose that the a posteriori p. d. f. has been formed and that a cost functional J[ 0 (y), ] is given ( 0 is an estimate of the true value 0 * ); the object is to find a a (y) which minimizes the risk 4?, defined as the mathematical expectation of the cost. Conditional to 0 *, this expectation is Ey [J(, *) 0*] = fJ[e (y), *]f(yl *)dy where e (y) is the estimator being used. Since 0 * is not known, the risk is in fact given by averaging this over all possible 0 * using the known (a priori) distribution on 0: = f fo(0) J[e(y), 0 ] f(y1i) dydO Note from (1. 27) that, upon interchanging the order of integration, this can be written = j f(y) f J[0(y),0 ] f(ly)ds dy (1.35) y o

33 Since f(y) > 0, it is sufficient to choose 0 (y) in a way which minimizes the a posteriori expected cost; for a large class of cost functionals this estimate is the mean of the a posteriori density f( Iy). (1) If a cost functional is not given, the estimate is often chosen to be the mode of f( Ily); this is called maximum a posteriori (MAP) estimation. For purposes of this dissertation, the Bayesian estimation problem is considered solved when the a posteriori density is found, and no specific cost functionals will be considered. A final concern is to determine if and when one can sample a continuous observation as discussed in Section 1. 1 and define an a posteriori p. d. f. (and hence a Bayes' estimate) by taking a limit as the samples grow dense in the observation interval. This question will be posponed until Chapter II (Section 2. 2). 1. 3. 3 Compound-Hypothesis Detection Theory. The problem was stated in Section 1. 2. 3; one desires a decision to maximize G[P(D1IH1), P(D1IHO)] where G(,') satisfies the consistency conditions stated in Section 1. 2. 1 and where the detection and false-alarm probabilities are marginal probabilities as defined (1Van Trees [61], pp. 60-63.

34 in (1. 10) and (1. 11).(1) For discrete observations it is easily shown that the optimal decision is based on the likelihood ratio of marginal p. d. f.'s f(Yk H1) (~) =f(Yk I HO) / f(YklHl 0)fO(0)do -=~......~ (1.36) f f(YkIH, ) fo(V)d77 If hypothesis H0 is simple this reduces to the well-known result (Yk) = f (YkOl)fo(0) de (1.37) (1) It should be noted that defining the goal functional in this way sidesteps one of the primary concerns of statistical decision theory, namely the existence of a uniformly most powerful test; see, e.g., Van Trees [61], p. 85 or Lehmann[35], Ch. 3. This is possible only because 0 is considered a random variable. (See also Appendix C).

35 Section 1. 2. 3 claimed that using the Bayesian approach results in a natural relation between detection and estimation which allows the two to be done concurrently. This is dramatically illustrated by using (1. 29) to write the marginal densities in the likelihood ratio (1. 36): fO(O) f(?I Yk HO) (-Yk) = f0(f) f( I Yk, H1) (Yk' ) (38) where d f(YJkl'H1) ^(^Yk'd ) f(YkI&0-H) (1.39) Note that 0 and r7 must "cancel out" of the right side of (1. 38), since the left side does not depend on the parameters. This implies that any fixed, convenient, admissible value of the parameters may be used to evaluate the expression, and (Yk I 0, 77) is merely the simple-hypothesis likelihood ratio for those fixed values. Thus, the compound-hypothesis detection problem can be solved using the

36 simple-hypothesis result, modified by the a posteriori p. d. f.'s which are found in the Bayesian estimation problem. Equation (1. 38) will be a basic relation in the work which follows. A sequential analog to (1. 38) is again easily found; it is given by -k (Yi i -i (1. 40) where f(O I Yi_ 1' H ) f( 1 Yi, HO) (Yi Yi- 1) f( I Yi, H) f(7Y - 1.) 1) f( li'j) f( - 1I H0) (1. 41) 1. 4 Historical Background The "classical" theory of signal detection, based primarily upon the statistical theory of hypothesis testing, was developed in the early 1950's; basic papers in the field are those of Grenander [ 18 ], Peterson, Birdsall, and Fox [ 43 ], and Middleton and Van Meter [ 39 ]. This early work was primarily concerned with detection; although the presence of unknown parameters in the signal

37 was addressed, no attempt was made to simultaneously estimate these. The earliest treatments of simultaneous estimation and detection usually proposed suboptimal solutions based upon using a maximum likelihood estimate of the parameters in a detection scheme; see, e. g., Kelly, Reed, and Root [32] or Price [45]. Only quite recently has the simultaneous optimality of detection and estimation schemes been addressed; even then, this has usually been done in a limited context which tends to obscure the overall simplicity of the problem. For instance, Middleton and Esposito [40] developed a Bayesian solution for signal parameters, but from a nonrecursive viewpoint and through use of a cost structure which explicitly considers coupling of the detection and estimation costs. Scharf and Lytle [ 54] address Gaussian noise of unknown level, thus including noise parameters in the problem. Their solution is nonrecursive, and investigates the existence of uniformly most powerful (UMP) or UMP-invariant tests; this question can be avoided by adoption of a Bayesian approach. Spooner [ 57 ] [ 58 ] considered unknown noise parameters in detail; his work falls closest to that presented here, being a specific application of the principles involved. Jaffer and Gupta [29] [30 ] consider the recursive Bayesian problem

38 using a quadratic cost, Gaussian Markov processes, and estimating only signal parameters. Roberts [ 47 ] [ 48 ] considered signals of unknown phase or amplitude. Though much more specialized, his work also is quite close to the results given here; he appears to have been the first to employ reproducing probability densities (a phenomenon he called "closure of the distribution") for the uncertain parameters. Spooner [ 58 ] used the same technique to estimate the spectral density of sampled white noise; the results were subsequently generalized by Birdsall [ 6 ], whose work provided much of the motivation for Chapter III of this dissertation. Many other pertinent papers exist in the recent literature; among them, the multiple-hypothesis work of Fredriksen, Middleton, and Vandelinde [17 ] and the paper by Nahi [41]. A good discussion of earlier papers on the problem is contained in [40 ].

CHAPTER II NECESSARY AND SUFFICIENT STATISTICS Most results in the sequel depend on the existence of sufficient statistics for the parameter of a family of probability distributions; this chapter is dedicated to the presentation of results concerning such statistics and the families of distributions which admit them. The emphasis is different from that usually found in treatments of the subject; for a more conventional approach, the reader is referred to Ferguson [16 ] Ch. 3, or Hogg and Craig [28 ] Ch. 8. A rigorous, measure-theoretic treatment based on the work of Halmos and Savage [24 ] and Bahadur [ 3 ] can be found in Appendix B and is necessary to a full understanding of Section 2. 2. 2 and portions of Chapters II and V. The mathematics involved in that treatment are sufficiently difficult to be incompatible with the rest of this work; hence, the results were relegated to an appendix. All results in Sections 2. 1 and 2. 3 can be derived from the work in Appendix B; in keeping with the spirit of this dissertation, they are here developed independently at a lower level of rigor. Section 2. 1 will introduce the concept of necessary and sufficient statistics by considering them in the context of finite parameter and observation spaces (which means that all probabilities are discrete). In addition to motivating the various 39

40 definitions, this discussion forms a good heuristic basis for the measure-theoretic work in Appendix B. The section then proceeds to formally define necessary and sufficient statistics (using the factorization criterion, since all spaces are assumed finitedimensional and all probability measures absolutely continuous), and concludes by demonstrating a function which, under suitable restrictions, always satisfies the definitions. Section 2.2 extends the results to continuous-parameter random processes. This is done by assuming such a process to be observed on a fixed, finite interval and sampled at evenlyspaced points on that interval. The samples are made to grow dense, and limits for the sufficient statistics and the a posteriori p. d.f. generated by Bayes' rule are investigated. Finally, Section 2. 3 assumes that observations are generated sequentially by some process which causes them to possess an Mth-order Markov dependence. This generalizes, and includes as a special case, the situation where samples are independent. Under certain restrictions, only exponential families of distributions will be seen to admit sufficient statistics of fixed, finite dimension. () ( 1This was first proved for the case of independent samples by Koopman [33 ].

41 The general theory developed in Chapter III will depend primarily upon the contents of Sections 2. 1 and 2.2. Section 2. 3 treats a special class of processes and is presented here for reference in Chapter IV where the Mth-order Gaussian Markov process will be considered. 2. 1 Definition and General Results for Finite-Dimensional Observations. Recall that ye * is observed; for notational simplicity, 1 ( is considered a domain in R. If k sequential observations k k are made, their ordered column vector is denoted Ykc f CR Suppose that Pk - {f(YkI );e cR } is a known family of Borel probability measures given by the indicated piecewise-smooth p. d. f.'s on k. (These results are generalized in Section 2. 2 and Appendix B. ) Occasionally in the sequel, it will be assumed k that each f(Ykl ) is strictly positive on Sk; this will be referred to as the regular case. The assumption, when made, simplifies results significantly but rules out some (though not many) distributions of practical interest; e. g., 0;0~< y.i,< i =...k f(Yk I )el 0 elsewhere

42 2. 1. 1 Basic Concepts. The concept of a necessary and sufficient statistic is easy when stated with enough generality: Suppose the observations Yk have been made. A mapping t(Yk) is a sufficient statistic if its value contains, in a sense to be defined, as much information about the true value of 0 as did Y itself; a mapping t(Yk) is a necessary statistic for 0 if none of the information it contains about 8 is redundant. ) The definitions in existence all seek to express these concepts in various forms and degrees of generality. Usually the task is complicated by the necessity of admitting the non-Bayesian, so that 0 may not be considered a random variable. That is not a problem here; accordingly, t(Yk) will be considered a sufficient statistic (of a sample of size k, for the parameter 0 ) if f(8 Yk) = f[8 t(Yk)] yke Sk (2.1) where the a posteriori p. d. f.'s are given by Bayes' rule (1. 27). The following discussion, due to Dynkin [13 ], makes the (1)The sense in which this may be interpreted is a subject of continuing debate by mathematical statisticians. For a good discussion, see the conclusion of the paper by Halmos and Savage [24].

43 results in the sequel heuristically clear. Suppose that all probability distributions are discrete and finite, i. e., one has a finite set of random outcomes = { YV, Y2, "'YN} On these observations define a finite family of probability distributions, each member of which is {P(ylil),..., P(yNXI)} where 6EO ={,...6 }, a finite parameter set. Given an a priori distribution {P( =6 i) =Pi}i 1on 0, and assuming an outcome y was observed, one uses Bayes' rule p.P(yI 0i) P( ily) = s, i = l...s (2.2) PkP(y k) k=l to construct the a posteriori distribution. Now suppose that for some y', y"e y, P(y'Ii) = P(y"?Ii), i =...s (2.3)

44 Then from (2. 2), P(e Iy') = P(O Iy") i = 1...s (2.4) and the outcomes {y', y"} are equivalent for constructing the a posteriori distribution on 0. Use (2. 3) to define a class of sets on ~: Bk = {y:P(yi) = Yk i = l...s} The class {Bk} is a sufficient class of sets for e, since one need only know the set into which a particular observation fell in order to construct the a posteriori distribution on 0. A mapping t: S- which is constant on the sets Bk is a sufficient statistic for 0;(1clearly, P( i y) = P( i t(y) ), i = 1..s. It is not clear that (2. 3) generates the coarsest class which is sufficient to construct { P(Oi I y); i = 1...s}. In fact, it does not in general. Suppose 0 0e is a value of the parameter for which P(yi. 0 ) > 0, i = 1... N; define the probability ratios ()In Appendix B, {By} = d0 will be called a "sufficient' goI~~~~' sub-u-algebra" for { ^ }, and t( ) is a statistic which generates e.o ~

45 P(y..i) A(yjl6i) = p(yjl0), j = 1...N, i = 1...s (2.5) Bayes' rule (2. 2) is unchanged if all P(y( j) are replaced by A(yl&j). Hence, the class of sets {Ak}, Ak ={y:A(yli)= Xk i = 1...s} (2.6) is also sufficient for 0.(1) It can be shown that {Ak} is the coarsest (i. e., the minimal) class which is sufficient; accordingly, it is called the necessary and sufficient class of sets for 6, and a mapping T which is constant on {Ak} is a necessary and sufficient statistic. Now let {Ak} be an arbitrary sufficient class for 6 and let t(y) be a sufficient statistic which is constant on members of this class, say t(y) = tk for yeAk. By assumption, then, (1) is clear that a sufficient class of sets also results if the denominator of the ratios (2. 5) is any arbitrary probability distribution on ~/ which does not depend on 6; this class will not, however, necessarily be minimal.

46 P( ly) = P( It(y)) (2.7) for i = 1...s. Let {pi, i =1...s} be a known assignment of a priori probabilities. It follows directly from the definition of conditional probabilities that Pi P(y 0i) = P(iIy)P(y) Hence, P(y i) = g[t(y), i ] G(y) (2.8) where, from (2. 7), P(O ilt(y)) g[t(y), 0 = Pi depends on y only through t(-), and where s G(y) = Z P P(yl0) r r r r=l does not depend on 0. The converse is trivial to verify: Using Bayes' rule, any family satisfying (2. 8) satisfies (2. 7). Thus, the factorization (2. 8) provides an alternate characterization of a

47 sufficient statistic. Finally, compute the conditional probabilities: P(yI i) if yEAk P(ylt(y)= tk, 6i) r= ^ ~k i <YrEAk 0 otherwise Using (2. 8) and the fact that t(y) = tk SEAyk, l/nk if yEAk P(yltk' i) = (2.9) 0 otherwise where nk is the number of observations in Ak. This motivates yet a third characterization of sufficient statistics; namely, t(y) is a sufficient statistic for 0 if the conditional distribution P(ylt, ) does not depend on. This completes the elementary discussion using finite spaces. 2. 1. 2 Some General Results. For this and the next section,

48 the factorization criterion will serve as a useful characterization for sufficient statistics: DEFINITION 2.1 A mapping t k:' - k, tk = tk(Yk), is said to be a sufficient statistic of a sample of size k, for the family {f(Ykl0); e8O} if there exist: -- A nonnegative function g[tk(Yk); ] which depends on Yk only through tk(), and -- A function G(Yk) which does not depend on 0, such that f(Ykle) = g[tk(Yk);e 0G(Yk) (2.10) f(Y^) (2.)0) The value of a statistic which satisfies (2. 10) is clearly sufficient to evaluate the a posteriori p. d. f. via Bayes' rule; using (2. 10) in (1.27) and (1.28), g[tk(Yk); 0 ] f(e ) ( I'..) (2. 11) f (Numerator) dO =f(e tk(Yk))

49 The set k is purposely unspecified as yet; if it has enough structure and dimension less than k, then a clear saving in "soft memory" requirements results from using tk(Yk) rather than Y to estimate. DEFINITION 2.2 A system of sufficient statistics {t: k k k} k k=1, 2,... is said to be of fixed dimension r if, for any k, elements of the set Sk can be indexed by a parameter of dimension r. I All sufficient statistics of practical interest will be of fixed dimension; further, it will be possible to update them sequentially: tk+1(Y k+) can be evaluated using only tk(Yk) and Y+1 These two properties, and the fact that sufficient statistics exist in many problems of practical interest, lend significance to their study. So far, nothing has been said about necessary statistics. To discuss the topic without use of measure theory, the following concept is needed:'(1 DEFINITION 2.3 Let ( * ), s2( ) be any mappings defined on a set ^ ( The following definitions should be compared with Definition B. 2 of Appendix B.

50 Then one says s1 is dependent on s2 if s2(x') = s2(x") implies s1(x')= s ((X") s1 is equivalent to s2 if each is dependent on the other. s1 is trivial if it is equivalent to I(x), the identity function on. DEFINITION 2.4 t k: k - gk is a necessary statistic of a sample of J~k ~*^ size k for the family {f(YklS; 0eO } if it is dependent on every sufficient statistic. g The following facts follow trivially from the definitions: a. The elementary necessary statistic is a constant on b. The elementary sufficient statistic is Y c. All necessary and sufficient statistics are equivalent. d. Tk is a sufficient statistic for, Tk is dependent on Sk Sk is a sufficient statistic for. e. Tk is a necessary statistic for 0, Sk is dependent on Tk= Sk is a necessary statistic for. The definitions agree with the concepts present in Section 2. 1. 1. An elementary discussion using classes of sets is not possible here because ik is not finite, or even countable. Thus, it is necessary to resort to the properties of functions to describe the desired concept.

51 DEFINITION 2. 5 Let 6 0 O be a fixed value of the parameter for the regular family of p. d. f.'s {f(Yk I ); 0 } (1) The likelihood ratio funck...... (2, tion of a sample of size k, Ak: k x O- R is defined as d f(Yk1~) Ak-Yk ] f(yk ) (2.12) Recall from the comment made following (1.30) that Bayes' rule can be written Ai(Yk, 0 )f0( ) f(e Y k)= k- (2. 13) f (Numerator) d6 Thus, if tk(*) is a sufficient statistic for 0, it follows (and is that "Regular" means that each member of the family is k strictly positive on ty; see Section 2. 1. (2 )Compare this with (2. 5).

52 g[t(Yk); ] Ak[Yk, ] = g[t(Yk);e0 (2. 14) The significance of the likelihood ratio function lies in the following result: THEOREM 2. (1 The likelihood ratio function, considered as a mapping from W~k into piecewise smooth, positive functions on 0, is a necessary and sufficient statistic of a sample of size k for 0 Proof: Sufficiency follows by rearranging (2. 12); necessity from (2. 14), whence Ak is dependent on any sufficient statistic tk(). (2) I Clearly, the same properties hold for the function ( Dynkin [13 ] p. 23. (2 ) 2 The remark made following (2. 5) holds here also. If the denominator of the likelihood ratio function is an arbitrary p. d. f. of Yk which does not depend on, then the function yields a sufficient statistic; choosing the denominator from the family {f(iYk0)} yields a necessary and sufficient statistic. In fact, it is easy to see that if the denominator is a linear combination of members of the family, then one still obtains a necessary and sufficient statistic.

53 L(Yk,) n A k(yk ) (2.15) In either case, the range ek of the statistic so defined is a function space; suppose it has finite dimension r. Then the likelihood ratio function can be expressed in terms of a given basis in the space; the coefficients of such an expansion will conmk r stitute a mapping Sk - R, and are a set of necessary and sufficient statistics in the usual sense. This will be made more precise in a later section; to do so, it is beneficial to restrict attention to a smaller class of observed processes. First, the case where ( is a function space will be investigated. 2. 2 Continuous Parameter Processes Suppose the observed process is {yt, tc[O, T]}, a continuous-parameter random process on a fixed finite interval whose sample functions, elements of a measure space ( I,,, ), are real-valued on [0, T]cR1. /f={t {, EO}, isa family of probability measures on ( I,e). The definition of this measure space is beyond the scope of this dissertation;(1) elements of ~ are functions, which is inconsistent with previous (1)See, e.g., Doob [11] pp. 47-50, ofWong[62] pp. 37-41.

54 use of the symbol but should cause no confusion. Section 2. 2. 1 considers that a finite number of samples are drawn from the observation, and previously established results are applied to the vector of those samples. The limiting procedure as sampling grows dense in [0, T] will be discussed. Section 2. 2. 2 actually develops the continuous result; a working knowledge of the contents of Appendix B is presupposed. Finally, Section 2. 2. 3 presents a brief summary of the results, stated in terms which do not require a knowledge of measure theory. 2. 2. 1 Sampling the Observation. Suppose that discrete observations are generated by drawing k evenly spaced samples from a realization yt as discussed in Section 1. 1 (see Eq. (1.7) ff. ) As before, the column vector of samples is denoted -Yk = (Y1, Y2, "'*Yk)* In some cases (e. g., Markov processes) it may again be convenient to omit a number of "initial" samples from consideration and proceed with all expressions conditioned on these samples. This (1 )Recall the discussion of Section 1 1.

55 will be done in applications which follow in later chapters, but will not be explicitly indicated here. The joint p. d.f. of k samples, f(Y k ), is usually easy to find. (1) One wishes to investigate limiting behavior as the samples grow dense in [0, T] or (0, T];i.e., as 6k- 0 or k - o. Specifically, it is desired to investigate the limits, should they exist, of the sufficient statistics and the a posteriori p. d. f. This requires some care, since subtle mathematical singularities can (2 ) occur in such a procedure; (2 mathematical rigor is indispensable, and liberal use of the measure-theoretic results of Appendix B will thus be made. A direct attempt at finding a limiting a posteriori p. d. f. by use of Bayes' rule, say f(Yk I ) fo( ) f(lyt) = lim (2. 16) k-c Sf (Numerator) de (1)It is, in fact, the theoretic basis for being able to define the measures { Gi }; see Wong [62] p. 40, or Doob [11] pp. 47-48. (2 See, for example, Slepian [56 j.

56 is usually doomed to failure. The conditional p. d.f. f(Y. I) is defined on a space of dimension k, but p. d. f.'s in the usual sense (i. e., with respect to Lebesgue measure) only exist on finitedimensional spaces; the problem arises because of the required normalization. Thus, the numerator of the right-hand side of (2. 16) does not in general have a limit; the limit of the entire term, including denominator, may be extremely difficult to evaluate. Recall from (2. 13) that if the family of distributions is regular, Bayes' rule can be written using the likelihood ratio function A k(Yk;e) as defined in (2. 12). Moreover, this function will be shown to have a limit under quite general conditions; if it does, one can define an a posteriori p. d. f. on 0 for continuous observations by f(A lytY lik ) (2. 17) f(l Yt) - lim - (2. 17) k-c Sf (Numerator) de Suppose that the families of density functions {f(Ykl), Oe }k,2 have fixed rank r < oc;i.e., they admit a,k=1-,2... family of sufficient statistics {tk( )} of dimension r. Under suitable conditions (e. g., Section 2. 3; in particular, Eq. (2. 34)) the likelihood ratio function is

57 Ak(yk; ) = A(tk(Yk); ) (2. 18) and it appears the sufficient statistics may themselves possess a limit. This will also be investigated. 2.2. 2 Sufficient Statistics for Continuous Processes. The convergence of the likelihood-ratio function can be established using a basic result for random processes. Assume that Gab 0 is a probability measure which dominates e, 0o << 80 for all eO. (1) Suppose (,,, e ) is complete. 0 Let Tk denote the set of k sampling instants, see (1.7), and Pk the Borel probability measure generated by {yt, teTk} under 4'. By assumption, Pk, is given by the p.d.f. f(Yk l) and so k(yk;e) dP k,0 ( Theorem 2.2 will hold whether or not e 0 In practice, things are greatly simplified if ~A is an equivalent family and 0e.~

58 THEOREM 2.2(1) Let (i) (, d, ), e be as above. (ii) ( 1, g,', ) be complete, and A<< K< 0 0 ~c~e. (iii) The process {yt, te[0, T]} be separable and continuous in probability [ g ] for all 0 Then for each EO, dPke __ d k(Yk;) dPk ko (t) a.s. [ (2.20) Assumption (iii) is quite weak and, as a precept, its satisfaction is a primary property of probabilistic processes possessing prominent practicality. The requirement which is stronger and (1 This result was rigorously established by Striebel [ 60 ], who obtained it as a direct consequence of Doob's convergence theorem for martingales ([12 ], Ch. 7, Thm. 4. 1). In her work Striebel also observed that sufficient statistics can often be found by inspection of the resulting density.

59 may be more difficult to establish is that <e << K for 0 all 0e E. Fortunately, a large literature exists on the singularity of Gaussian measures, and Gaussian processes are of primary interest in many applications. The convergence of (2. 20) does not imply that the limit (Radon-Nikodym derivative) is easy to evaluate or even exists in closed form. The existence of sufficient statistics can simplify the matter considerably; the following theorem summarizes the results of Appendix B for the case where E o c/( dominates 0 k. (1) THEOREM 2.3 Put d (Y (y;0) (2.21) Let A (r) be the inverse image of the Borel set X(.; 0) < r in R1 and If* the minimal a-algebra which contains all A (r), o0 O, 0 < r <. The sub-algebra,* is necessary and sufficient (iiRecall footnote (1) following Theorem 2.1, which applies here as well.

60 for eAt. In particular, t: (,.j/) - (,, A) is a necessary and sufficient statistic for Ik (i. e., for 0 ) iff there exists a family g(-; () (): (R+, ) such that X(Yt0) = g[t(yt);0 ] G(yt) (2.22) Here, Yt"' denotes a sample function and "t( )" the sufficient statistic. Since A (t; ) is the probability density of',) with respect to UP, Theorem 2. 3 is a generalization ~ 8 0 of the classical factorization criterion; often, the sufficient statistics can be found by inspection. The definitions of necessary and sufficient a-algebras which were used to obtain the theorem are direct extensions of the set-theoretic notions discussed in Section 2. 1. 1. 2.2. 3 Summary. The procedure for the continuous case is as follows. Once it has been verified that {Yt, te[0, T]} is separable and continuous in probability, () then a suitable 0 ( )A separable version of any process exists; Wong [62 ] p., 42.

61 with respect to which each member of eA is absolutely continuous must be found. If ~e e/ or is a linear combination of 0 members of vif, the procedure is guaranteed to yield a necessary and sufficient statistic. If not, only sufficiency can be insured. Commonly, Af will be an equivalent set of measures and 0 e E chosen arbitrarily; as demonstrated in Chapter V, this has an additional advantage for certain Gaussian processes. Next, one writes the joint p. d. f. of k < oc evenly spaced samples of yt, and forms Ak(Yk;) as in (2. 12). At this point, it is helpful to recognize the sufficient statistics tk(Yk) if they exist, and to arrange the expression for Ak (usually through judicious multiplication and division by the sampling interval) so that the statistics will converge to a well-defined function of yt The limit of Ak(Yk; 6 ) is evaluated, (2 ) and the sufficient statistics for continuous observations are recognized directly (as above) or through factoring as in Theorem 2.3. The a posteriori (1 The arbitrariness of i0 may seem bothersome 0 since it will usually introduce extraneous parameters into the resulting density. Theoretically, this is no problem since those parameters are constants and do not affect the sufficient statistics. Practically, it is also no problem since they will generally cancel in Bayes' rule (2.23). (2 As will be seen in Chapter V, this also is not a trivial task.

62 p. d. f. of 0 is given by (2. 17), which may clearly be written X (Yt;o )fO( ) f( t) = - -~f( (2.23) S (Numerator) dO provided that the integral exists. The procedure will be illustrated in Chapter V. As a final comment, it is pointed out that other procedures than the limiting described above exist for evaluating Radon-Nikodym derivatives. Any such procedure may, of course, be employed in lieu of the above. 2.3 Sequential Samples from an M -Order Markov Process To conclude this chapter, attention is focused on a special class of discrete (i. e., / C R ) processes. One wants to investigate the conditions under which such processes admit necessary and sufficient statistics, and what form those statistics may take. Most sequential results available in the statistical literature address sufficient statistics in the context of independent sampling, i. e., successive observations are assumed statistically independent. Many, especially those concerning the form of distributions which admit sufficient statistics, are easily extended

63 th to the case where the samples possess an M -Order Markov dependence as discussed in Section 1. 1 (Equations (1. 3) through (1. 5)). Note that (1. 5 ) is functionally similar to (1. 6 ), the expression for the joint p.d.f. of k independent samples, except that each factor of (1. 5 ) is a function of M + 1 variables yi = (Yi.Yi-) Many results concerning necessary and sufficient statistics are derived from the functional (as opposed to probabilistic) properties of these factors; thus, it is just as easy to use vector notation and treat the Mth-Order Markovian case; the "independent" result may be recovered by putting M = O if necessary. The work presented parallels Dynkin [ 13]; if proofs are sufficiently similar to be obvious, they will only be sketched here. Define state vectors of sequential observations as in (1. 4 ) and also define an (M + 1)-vector Yi= (Yi, Yi-l' *' Yi-M) * e (2. 24) where (*) denotes transpose. Now assume that a. The conditional Borel probability measure of the ith observation is given by one of a family of p. d. f.'s {f(Yi]Yi- 1' Yi-M' M ); 00e} = {f(y{iYil i-');p eO} where 0 is an m-dimensional parameter.

64 b. f(Yilyi 1',) does not depend on i; that is, the observations have stationary transitions. c. f(yly, i) is piecewise smooth and nonzero in, and integrable on, f x Mx 0 = M+1 x. d. The value of 0 stays fixed as sequential observations are made. Some remarks are necessary: Whenever the vector y is used, its first component is understood to be the "present" observation y, and its other components are the previous M observations y which are needed in the transition density. To start the process, one needs y0; these will be assumed given and all expressions which follow should be interpreted as conditioned upon the "initial observation" y0 (see Section 1. 1). Henceforth, the transition density will be denoted as h(i;) d f(yly) ) (2. 25) This is convenient because, as discussed, most results depend on the functional properties of h(y; ) rather than its probabilistic interpretation. Note that assumption c. implies the "regular case": the dlomain in which the transition density is nonzero does not depend

65 on the value of. If the transition density can be factored in a manner analogous to (2. 10), h(y;) = g[t(y), 0 ] G(i) (2.26) then the joint density of k observations (1. 5) factors similarly. Thus, the results which follow will be based on the properties of the transition density; t (y) above will be called a sufficient statistic of y for 0; a necessary statistic is similarly defined (Definition 2.4). Consider the transition likelihood ratio function f(yly,) A0o(y;) f(eyyO) =(2.27) h(y;8) h(y; o) where &E 0O is an arbitrary fixed element. A0 is well-defined (see assumption c.); its k-fold product is, from (1. 5), seen to be the likelihood ratio function A,(Yk; ). Also,

66 L(;y) n A0(y;) (2.28) is well defined; its use will turn out to be convenient in what follows. THEOREM 2. 4 The mapping, L: M+!1 - L(;y) = n h(y;0) - In h(y;0) (2.29) is a necessary and sufficient statistic of y for 0 Proof: From (2.29) h(y;8) = h(y;80)exp{L(6;y)} so L is a sufficient statistic. Let t(y) be any sufficient statistic; then (2.26) is satisfied for some functions g( ) and G(). Using (2.29), L(V; y) = ng[t(y);8] - Png[t(y);0 ] which is dependent on t(y). Thus L is a necessary statistic.

67 Note again that L. maps ~M+1 into piecewise smooth functions on 0 Corollary A necessary and sufficient statistic for a sample of size k, Yk = {l, "-', Yk-l}, is k Lk i) = i L( 1(2.30) Lk(Oi;Yk) = (;y') I This follows from Theorem 2.2 and (1. 5). THEOREM 2.5 Let VL denote the minimal linear space of functions, defined on L M+i +, spanned by constants and the functions {L(;y); E6}. Suppose dim VL = r+ 1 (possibly r = oc) Then a. For every finite k < r, any sufficient statistic for a sample of size k is trivial.'(1) ( See Definition 2. 3.

68 b. If the functions { l,1(y),.,., r(Y) are a basis for VL, then -Vk > r the r-vector of functions tk(Yk) = [tki(Yk); i=... r], where tki(Yk) = ~i(Y1) + qOi(Y2) +'+ "Oi(lk) (2.31) is functionally independent and forms a necessary and sufficient statistic of Yk for.' " 5 The function tk(Yk) is a mapping from k to R; it is the quantity usually written directly as a sufficient statistic. The approach here is more general and yields the concept of necessity as a side benefit. The proof is a generalization of Dynkin's proof for independent observations ([ 13 ] p. 24) and is sketched in Appendix D. In essence, it is only necessary to replace Dynkin's observation x with the vector y and to redefine the spaces accordingly. Care must be taken, however, to recall from (2. 24) how the vectors y were defined; for example, the pair of vectors {i' Yi+l} has M common components and thus {yi, Yi+l}e M+2; similarly, { YYi... M1+k Theorem 2. 5 inspires the following definition:

69 DEFINITION 2. 6 The rank of the family {f(Yly i., a6) 6E O } in the domain S( is the greatest number r such that, for any finite k < r, there is no nontrivial sufficient statistic of a sample of size k for e. THEOREM 2.6(a) Suppose {f(yilYi_, 0); 8EO} has finite rank r for ye,. Then the transition density f(yly, 8) has the form r h(-; ) = exp |L ci() c) + c0(6) + O(y) (2.32) i=l where the functions {rol(y). ~~r(y)} are piecewise smooth in yM+1, and where the systems of functions {,.1 * Pr} and { 1, cl,...c } are linearly independent.

70 THEOREM 2.6(b) Suppose f(yi i,) = exp { qL sj(i) c)j(0) + i) 0()} for (Yi',i-1) e M 0 e 0 (2.33) Then the rank of the family of densities does not exceed r. If the systems of functions {1, $~1... } and { 1, cl, *cr} are linearly independent, then the rank equals r and, for k>r, k tk(Yk) L (j) j=l Lk[- i i=1...r is a necessary and sufficient statistic for 0. The proofs will be omitted; they are analogous to Dynkin [13 ] pp. 26-27, and are easy consequences of Theorem 2. 5 and Equation (2.29). Only the statements concerning linear independence require some effort. Before illustrating all this with an example, two comments

71 should be made: a. Recall from Theorem 2. 5 that VL included the constant functions. In general, the constant in the likelihood ratio function for a sample of size k will explicitly depend on k which must therefore be known to have totally sufficient statistic. Although it is pedagogically questionable, k will henceforth be included as a component of tk (even though it does not depend upon the observation) whenever the discrete case is being investigated. This clearly makes no sense if one intends only to pass to the limit k - c. b. If (2.32) and (2.25) are used in (2.24), it is clear that the likelihood ratio function for a sample of size k can be written Ak(Yk, ) = A[tk(Yk), ] (2.34) where the functional form of A depends only on the transition density, and not on k. Example 2. 1 Suppose that {yi} is a stationary, discrete, M-th order autoregressive Gaussian process as defined in Section 4. 1; the

72 parameter 0 has components denoted a and 3 =(/31...'M)* and the transition density is (see (4. 16) ) f(Yi i-l' YO'," a) = exp { - [n(2 )+ ( + ) ] exp -[2n2 r+ 2na +- Yi 1 1 + Yi Yi- 1+ -- -1 Y.* 21 - Yi~i-1 2 Y Si-lri-1J ]} ~a a1L2La (2.35) It is convenient to retain vector-matrix notation. Define O(y = Y (2.36) M(YY: Yi Yi-1 (Yi Yi-''''' Yi Yi-M)* (2. 37) CPd(Y)='i-1. i*1 *M(y): i- l- 1 Y31- 1 Y2i-2. Yi-2' Yi- Y-M Yi-2 Yi-1 i-2 Yi-2 Yi-M Yi- M Y i- Y!i- M Y i- 2'' Yi-M (2.38)

73 Q M is symmetric and has M(M +1) linearly independent components; thus, there is a total of a (M + 1) (M + 2) linearly independent statistics defined above, and a set of necessary and sufficient statistics for a sample of size k is k k to(Yk) =Y ~ i= iy (2. 39) i=1 i=l k k k Y i 1 1 (2. 4 TM(Yk)= IM(Y)= Yi -i 1 - 1 (2. 41) i=1 i=1 In Chapter IV, these will collectively be referred to as t(Y). I

CHAPTER m SUFFICIENT STATISTICS AND REPRODUCING DENSITIES IN SIMULTANEOUS DETECTION AND ESTIMATION Chapter I, specifically Sections 1. 3. 2 and 1. 3. 3, presented general Bayesian solutions to the estimation and compound-hypothesis detection problems; these were given in both one-shot and sequential formulations. Section 2. 1 then introduced the concept of sufficient statistics; it was shown (Eq. (2. 11) ) how the existence of sufficient statistics of fixed dimension could significantly reduce the amount of "soft memory" necessary to store the observation, since one need only save the (r-dimensional) updated value of t(Yk), rather than saving the observations Yk themselves, in order to estimate. The definitions of Chapter II always admit the observation itself as an "elementary" sufficient statistic. To save verbiage, it will henceforth be convenient to reserve the term "sufficient statistic" for those statistics which are of fixed minimal dimension r, i. e., which are necessary and sufficient and of fixed dimension. If no such statistic exists, then the family of distributions will be said to not admit a sufficient statistic. As a further notational convenience, one subscript will henceforth be eliminated and tk(Yk) written as t(Yk). 74

75 The remainder of this dissertation will deal only with classes of probability distributions which admit sufficient statistics in the sense just discussed; this chapter presents the general theory, and Chapters IV and V a specific application. Section 3. 1 treats only the estimation problem, since its solution is basic to solving the detection problem (see Section 1. 3. 3); it will be shown that not only does existence of a sufficient statistic eliminate the need to retain all past observations, but it also obviates the need to store the a posteriori p. d. f. on O as a function. Instead, this p. d. f. is itself indexed by a parameter of fixed dimension, regardless of the number of observations upon which it is based. Section 3. 2 applies the results to the detection problem; finally, Section 3. 3 will treat the case of continuous observation. 3. 1 Bayesian Estimation Suppose that t(.): k _ R, t(Yk) = tk, is a sufficient statistic of fixed dimension; then by Definition 2. 1, f(yke) = g[t(Yk),3 ] G(Yk) (3.1)

76 for all k, and Bayes' rule can be written as in (2.11). The following example illustrates the concept and will be re-examined later in this section: Example 3.1 Let the observations yi be generated by yi = 0 + n., i = 1, 2, 3... where 0 is an unknown scalar and the n. are independent "noise" samples from a N(O, ax ) distribution. Conditioned on 0, the samples are independent N(, a2 ) and their joint conditional p.d.f. can be written as follows: 2 (z k f(Y10) = (2u) exp- (k2 e20 yi) exp i[ 1 y = g[t(Yk); ] G(Yk) (3.2) Thus, the sufficient statistic t(Yk) consists of the sum of the observations and the total number of observations, t(Yk) = k yk (3.

77 where * denotes transpose. Obviously, this can be "updated" sequentially as the sequence {yi} is received. Including k as a "statistic" is, as discussed in the paragraph preceding (2. 34), pedagogically questionable but will turn out to be a convenience. I The function g[t(Yk;e ] will always be assumed maximally factored (i. e., it contains no factors which do not depend on 0 ) 3. 1. 1 Natural Conjugate Densities. The sufficient statistic can serve to also characterize the a posteriori p. d. f.'s on 0, eliminating the need to store them as functions in soft memory. Example 3. 2 Consider the situation of Example 3. 1: Suppose the a priori p. d. f. on O has the form f0(e) = K exp 0e2Y02, - oc < 0 < oc (3.4) where y01 > 0. This is the class of Gaussian densities on O = R;its members are indexed by the parameter y0 = (o1, 02). Note the similarity between (3. 4) and the first term of (3.2). Now suppose Yk is observed; applying Bayes' rule (1. 27) in one-shot form yields

78 k (rO+k)' -2(Y02+Eyi)' f(0 lYk) = Kk exp 2a. 5) where all terms not involving 0 are included in the normalizing constant K. This is the member of the class of (3. 4) indexed by Co = Yo + t(Yk) (3.6) Choosing f0(6 ) in the class of (3. 4) causes the a posteriori p. d. f. to remain in that class; the indexing parameter y is updated through (3. 6). Note that the updating can be done recursively. 1 The concept of a reproducing density function, illustrated above, will be defined shortly. One might wonder whether the parametrization established in (3. 4) is necessary to the phenomenon; the answer is no. For example, had the a priori p. d.f. been para2 metrized by its mean and variance, f0(9) ~ N(m0, do), then the a. posteriori density would be Gaussian with

79 rmo [ +d mk 1 i dJ1 id (3.7) dk 0.i d These parameters are updated using the same set of statistics (3. 3), and hence can also be formed sequentially; only the "updating equations" (compare (3. 6) and (3. 7)) change. DEFINITION 3. 1 Let er(~) = {h(6;y);yercRm, 8eO} be a familyofp.d.f.'s on 0 which is indexed by the m-dimensional parameter y. r (0) is said to be a reproducing class of probability densities under {f(Yk0l)} if, for any k, whenever the a priori p.d.f. on O is f0(0 h(;Y) = h( ), 0 there exists a'k = Yk(Y' Yk) e r such that the a posteriori p. d. f. is f( IYk= h(;vk), ke'r

80 If such a class exists and is used, f(8 IYk) need not be stored as a function; yk completely characterizes the a posteriori p. d. f., and its dimension is fixed. The following theorem generalizes the result of Example 3. 2. THEOREM 3.1 Suppose f(YkI ) admits a nontrivial sufficient statistic of fixed dimension, t(Yk), for 0 and hence can be factored as in (3. 1); let the function g(, ) be as defined there, and, provided the integral exists, put pO(;Y)'E 8 ], Yr (3.8) g[y,O ]dO where r is the image of the space of observations under t(). (1) Then {p(;y), ye r } is a reproducing class of densities under f(YklB). I (1)r can actually be taken to include all values of y for which p(; y) retains the mathematical properties of a p. d. f.; see Raiffa and Schlaiffer [46], p. 50.

81 The proof is given in Appendix D. The class thus defined is called the natural conjugate class of p. d. f.'s under f(Yki); existence of a sufficient statistic implies existence of such a class. The parameter y which indexes its members will be called the conjugate parameter to distinguish it from the parameter. The class may be quite rich, ( or it may be restrictive and not contain a satisfactory model for the a priori p. d. f. The next section will show that this class is only a small subset of all densities which reproduce; first, some relations which apply if the a priori p. d. f. is in the natural conjugate class. Consider one-shot processing and that f60( ) = p(;yO), rYOE. Bayes' rule (1. 27) becomes p(O;go) f(Yk0 ) p(0;k) = (3. 9) " (yk) where ( 1See, for example, Howard [27].

82 f(Yk) p(e;Yo)f(Ykle)dO (3.10) Alternately, (3. 9) may be rearranged to yield the analog to (1. 29), p(O;oy) (3. 11) f(y=) P(e;k) pf( Y-k) Since the updated conjugate parameter yk is found from y0 using the sufficient statistics (see e. g., (3. 6) ), this result can be valuable in forming the marginal p. d.f. of Y. Analogous expressions can be written for sequential processing. From (1.32), Bayes' rule is P(8;Yk) f(yk+lJ-k' 8) p(8;y,. 1) = ~i....... )V (3. 12) P( k ~f(yk+1 "Y)

83 and the marginal observation p. d. f. is P(O;Yk) f(Yk+l' p(;Yk+l) (Yk+1 k' (3. 13) The comment made following (1. 39), which applies to (3. 11) and (3. 13) as well, indicates that these equations are not in their simplest form (although they will be useful as written); indeed, one may use the factorization criterion (3. 1) and the definition of p(O;y) (3. 8) to rewrite them. First, using (3. 8), define the "conjugate normalizing constant" as K(y) = [ g[,] d (3. 14) Use of the symbol K(') will henceforth be consistent with this definition. Now use (3. 1), (3. 8), and (3. 14) to rewrite Bayes' rule (3. 9): g[t(_Yk), 0 g[ 0, 0 ] P(0;Yk) (3.15) f (Numerator)dO =K(Yk) g[Yk, 0 ] (3. 16)

84 The second equality follows because g(-, ) was assumed maximally factored, so that g[rYk 0] = g[t(Yk),0 ]g[0, ] (3. 17) Finally, use (3. 14) - (3. 17) to rewrite (3. 11) K(YO) f(k) K ) G(Yk) (3. 18) This, as claimed, does not depend on. Equation (3. 18) implies that, though t(Yk) is sufficient for 0, it is not sufficient for the marginal density of the observation (and hence, in a later section, for the marginal likelihood ratio); this in turn means that (3. 15) and (3. 17) can be made sequential, but this need not be so for (3. 18). 3. 1. 2 Other Reproducing Densities. Suppose now that the actual a priori p. d. f. on 0 is not a natural conjugate prior but

85 can, for some value ye r of the conjugate parameter, be written(1) fo( ) = r( )p(0;o) (3. 19) where r( ) is a positive function defined on r. The following result will show that f0( ) also reproduces with the parameter (2) THEOREM 3. 2 Let fl( ), f2( ) be two p. d. f.'s on O which can be written f2(0) =r(6) fl(0) (3.20) (Note that r( ) is a Radon-Nikodym derivative (see Appendix B). If p(6) is the Borel measure on O represented by the p.d. f. p(6;y0) and 6f(O) the measure represented by f0(0), then r(O) is the Radon-Nikodym derivative of.Uf with respect to p. The positivity of r(O) can be relaxed to requiring it nonnegative but nonzero in some neighborhood of the "true" value 0 *; this is necessary so that the actual a priori p. d. f. does not exclude the true value. (2)This can be shown more directly (see, e. g., Spragins [ 59 ]) by applying Bayes' rule to (3. 19). Clearly, the reproducing class is the natural conjugate class with each member multiplied by r(6). The approach here was chosen because the equations developed in the theorem are needed for later work.

86 Suppose that using fl( ) as an a priori p.d.f. results in an a posteriori p. d. f. f1( I Yk) and a marginal p. d. f. fl(Y) Similarly, using f2(0) results in f2( IYk) and f(k) Then r(0 )fl(0 I Yk) f2(e ) = r( (3. 21) f r(0')fl(e'iYk)dO f(yk):fl(yk) r(O')f1('Yk)d&' O (3. 22) The proof is given in Appendix D. Corollary An a priori p. d. f. which can be written as in (3. 19) reproduces with parameter y. a For, in an obvious notation, (3. 21) becomes r(8)p(O;9k) ^f( I Y ok. f.) =- (3. 23) (' f) (Numerator) dO 0

87 Since r( ) is known, the product function { r(e ) p(O;y); y r} is a class of (unnormalized) p. d. f.'s which reproduces under f(Yk]0) with the conjugate parameter y. Applying (3. 22) yields f(Yk f) f(yk ) f r(8)p(e;yk)dO (3.24) where p0 means "using p(; y0) as an a priori p. d. f..Now suppose f0( ) is given: One can choose a convenient y0, define r( ) by (3. 19), and proceed (using the theory of the preceding section) on the assumption that a priori p. d. f. is p(e, y0). To account for the actual a priori p. d. f. f0( ), the results must then be modified using (3. 23) and (3. 24). If it is possible to define reproducing classes for which the integral in the last two equations exists in closed form, the use of this technique can be as tractable as the use of natural conjugate priors. If the integral must be evaluated numerically, the resultant processing is probably too complex to be worthwhile. 3. 2 Compound-Hypothesis Detection The problem was stated in Section 1. 2. 3 and its solution begun in 1.3. 3. The notation is more complicated than that of the preceding section since there are essentially two separate estimation problems to be solved. Under hypothesis H1, the

88 symbols introduced for the estimation problem will be used; under H0, an analogous set of symbols is introduced. This is summarized in Table 3. 1. Table 3. 1 Detection Problem Notation Hypothesis H1 Hypothesis H0 Uncertain Parameter 0 7 Conditional p.d.f. on f(Y k H1) f(kl H) the Observations Sufficient Statistic tl(Yk) to(yk) Natural Conjugate p(e;y), yE q(77;V), IE p.d. f.'s Other Reproducing f0( ) = r6 ()p(6;r0) f0(7) =r(7)(71 0 p. d. f.'s Recall that 0 and rj may represent completely different parameters or may have components which represent the same parameters (as would, for instance, be the case in the classical "signal or signal-plus-noise" problem with unknown noise parameters). Even if they represent the same parameters, the natural conjugate classes may be different (and hence, since they are

89 computed under different statistical hypotheses, their a posteriori p. d. f.'s will evolve differently). The optimal detection statistic is the likelihood ratio of marginal observation p. d. f.'s (1. 36). This was rewritten in (1. 38) and formulated sequentially in (1. 40). Recall from these equations and the comment following (3. 18) that if a "parameters fixed" (simple hypotheses) solution is known, then the a posteriori p. d. f.'s (i. e., the updated parameters of the natural conjugate p. d. f.'s) are sufficient to solve the compound-hypotheses problem. If not, then this may not be the case. 3. 2. 1 Natural Conjugate Densities. Suppose that both hypotheses admit sufficient statistics and that the a priori p. d. f.'s are members of the corresponding natural conjugate classes. The detector can be written in two equivalent forms: The first is useful if a solution to the corresponding simple-hypotheses problem already exists (especially if the results are to be extended to the case of continuous observation); the second eliminates redundancies in the first and is especially useful if the sequential (or the oneshot discrete) solution is desired for its own sake. a. Form #1 of the Detector: Apply (3. 13) and its analog for hypothesis H, to (1.38):

90 p(O;Yi- 1) q(~;~) gy - I -Y. f^T i -7 =(Yi P(Yi) ((;i-) (y(iYi;i) (3. 25) Recall again the comment following (1. 39); the computation required to implement (3. 25) can often be simplified by proper choice of the fixed values of the parameters 0 and 77. Updating of the conjugate parameters y and 4 takes place through the sufficient statistics, precisely as described for the estimation problem. If the logarithm of the likelihood ratio is denoted z( ), then Fig. 3. 1 illustrates the processing necessary to find z(Yk) by use of (3. 25); the tilde on 0 and r7 indicates that fixed values of those parameters are used. The updated values of y and 4 completely characterize the a posteriori p. d. f.'s, so the conjugate parameters are sufficient outputs to an external estimator (recall the comment at the end of Section 1. 3. 2(b) ). Finally, note that the "parameters given" likelihood ratio is explicitly required; this, again, is because y and 4 are not sufficient for the marginal densities. b. Form #2 of the Detector. If the simple hypothesis solution is not available there is no advantage to explicitly retaining

Unit Delay Fig.t 31 T Evaluatee UpdateNaua Cn. (8.. Sufficient -TL^ Conj. __, Pn ) Hypothesis H, Statistic 9=8 ri Observe Compute "Parameter z (YilYi ) + Z(YjYii S z(i) y ~ Known" Solution ~ Sum ~Evaluate Ulpdate ~ Update Natural Conj. ^ ~ ^ ~~~Sufficient ^ ~ ^ ~ ~ HypothessH Statistic *i n^, q ) 0 Unit Delay Fig. 3.1 The Primary Detection Processor

92 (YilYi- 1', I ); one may as well perform the simplification of (3. 18), in which case the "one-shot" likelihood ratio becomes (-Yk) Kp(Tk) Kq (gJ0) G(Yk'I H0G (3. 26) K ( ) and K (.) are normalizing constants of the corresponding P q natural conjugate densities, see (3. 14). Equation (3. 26) depends on Yk through the functions G( ), see (3. 1); it can be intractable unless these are themselves of fixed dimension. In most applications, this will be the case. The result can be made sequential by noting from (1.40) that (yil Yi_ 1)= ()/(Yi1) (3.27) and using (3. 26) for the numerator and denominator; this is of little use unless the resulting ratios are simpler to evaluate than the functions themselves. In a sense, though, (3. 26) is already sequential: Provided that the functions G(' I H) are tractable, it depends on Yk only through statistics of fixed dimension which are "updated" in sequential fashion.

93 3. 2. 2 Other Reproducing Densities. Suppose that natural conjugate prior densities are not appropriate, but that the actual a priori p. d. f.'s can be written as in (3. 19) (see Table 3. 1). Then (3. 24) holds under both hypotheses, and f r0 ()p(O;vi)dO k(Yi If0) = (YiI P0'o0) (3. 28) f r (rl /)q(v;gi)d where (. I p0, q0) is the likelihood ratio assuming that p(O;y0) and q(l;;0) are the a priori densities; this is computed by the ri_ Compute H M(,r)re P(;Pi)d ~. Modifier ~,.Y p+ t* i Compute H0 Modifier 3 2 h S i.....[:)r (,)p (* Fig. 3, 2 The Secondary Processor

94 "primary processor" given by (3. 25), Fig. 3. 1, or (3. 26). A "secondary processor" (see Fig. 3. 2) then computes the "modifier gains" in (3. 28) and computes the actual likelihood ratio. Again, the secondary processing can be minimal if the integrals of (3. 28) exist in closed form; if they must be evaluated numerically, then the designer must trade the freedom of choosing a priori p. d. f.'s against the cost of the processing involved. 3.3 Continuous Observations Suppose that, as discussed in Sections 1. 1 and 2. 2. 1, discrete observations are generated by sampling { yt, t[0, T]}; one wishes to infer how the continuous observation should be processed by studying the preceding results as the samples grow dense. Much of the necessary groundwork has been laid in Section 2. 2; the development here will be less rigorous and will deal primarily with p. d. f.'s and the convergence of their ratios; the conditions necessary for convergence have already been established. Under hypothesis H1, {yt, te[0, T]} is defined on a probability space (sample space) ( E, eA, e ); A /= { 0;e e} is a family of probability measures and 9'b dominates cAy, 0 i.e., 0< << 0 for all 0. Under Ho, the process is defined on ( it, A4, 7 ) and a probability measure P dominates {' }; assume that A << K. It is not necessary, 11 00 0io

95 but will often be the case, that P and SP are members of "0 0 the families they dominate. Assume further that tl('):( Yd) - (Rr, Wr) and t0(): ( Y, d) - (RS, S) are sufficient statistics for 0 and r7 respectively; then it has been shown that, as k -oc, Ak (Yk;) - (Yt; 0 ) = gl[tl(Yt);]G1(y) (3.29) Ak0(Yk; )77 Xo(Yt;n) = go[t0(t);] G0(Yt) ( 30) where yt denotes a sample function and where, for example, ~1 ^f(Ykl, H1) Ak (k;) fH (3. 31) Again, the use of "t" for time and "t( * )" for sufficient statistics should cause no confusion. Section 3. 3. 1 will treat the estimation problem by itself and as usual employ the notation of H. Section 3. 3.2 then applies the results to the detection problem, and 3. 3. 3 discusses how continuous solutions on subintervals of [0, T] can be treated seqa Ytally. Throughout, the results will be seen to bear a striking resemblance to those obtained for the discrete case. 3. 3. 1 Estimation. Recall that, under the assumptions, an a posteriori p. d. f. for continuous observations can be defined as in (2. 23). It will be assumed throughout that the integral in the denominator exists. Using the generalized factorizations given

96 above (i. e., using Theorem 2. 3), g[t(yt); ]f((3.32) A0..., - (3. 32) S (Numerator) dO 0 Now, analogous to Theorem 3. 1, put g[t(yt);6 ] g[t(Yt);O']dO' O t (y t)= (3.33) where the subscript "c" denotes "continuous. " It is easy to verify that {pc(O;), oe}yEr r t( fW) is a reproducing class of p. d. f.'s on O under the generalized Bayes' rule (3. 32); it will, again, be called the natural conjugate class and its members are indexed by the conjugate parameter y his is true because the form of (3. 32) is the same as that of the finite-dimensional Bayes' rule, and that form alone was necessary

97 to prove Theorem 3. 1. (1) The relation by which the conjugate parameter is updated depends upon the parametrization of Pc which in turn depends upon the choice of sufficient statistic (recall that this is not unique). Suppose that, analogous to (3. 19), the a priori p. d. f. is not natural conjugate but can be written fo() = r(8)pc(;ro), O (3.34) y0E r Since the proof of Theorem 3. 2 also depended only on the functional form of Bayes' rule, the analog to (3. 23) is immediate: r(e )Pc(;7T) f( iYtfo) = c (3. 35) S (Numerator) de (1)From an alternate point of view, the function g[t(yt);e ] can be normalized and considered as a bona fide p. d. f. on the "sufficient space" e= t( Y); this space is by assumption finitedimensional, so that all the results can be extended with only a slight notational modification.

98 where yT is computed assuming that the a priori p. d. f. is Pc;0Y) 3. 3. 2 Compound-Hypothesis Detection. For a finite number of samples, and assuming a simple-hypothesis solution known, the problem is solved by (1. 38), i. e., f0(e) f(( IYk, H) -yk) f( iF) f (0 Yk' (3. 36) Recall that this expression does not depend on 0 or 71; so long as it makes sense, it may be evaluated at arbitrary fixed values of those parameters. Under the assumptions, each term in (3. 36) converges and thus the decision statistic converges to fo (0) f( I Yt' H0) l(Y+t (e ) =f().(y+t[,?) (3.37)

99 where the last term is the simple-hypothesis solution and is presumed to be known. If natural conjugate a priori p. d. f.'s are used then, in the previously introduced notation, Pc(0;0o) q c(n;JT) l(yt) =e;T pc(0;T) C(yt;0 l ) (3y 38) If the actual priors are not natural conjugate but can be written as in (3. 19), (see Table 3. 1), then an expression analogous to (3. 28) can also be developed. 3. 3. 3 Sequential Processing of Continuous Observations. Suppose that the interval [0, T] is partitioned into subintervals by defining a set of partitioning times T. so that 0 < T1 < T2 <... < T < T = T. It is tempting to consider processing n-i n the continuous observation on the subintervals (Ti, Ti+1], using the sequential results derived in Sections 3. 1 and 3. 2 to obtain analogous notions of a posteriori p. d. f.'s and likelihood ratios which "evolve" as further subintervals are observed. This is a very difficult problem to treat in general, (1) but can be quite (1)See, e.g., Bahadur [3], Arts. 8-11.

100 tractable in specific cases. Consider, for example, that {yt, tc[0, T]} is an Mth order Markov process and that k > M equally-spaced samples are taken from each subinterval (see Fig. 3.3). ( Let y.i denote th th the j sample from the i subinterval; analogous to (1.4), let Yij (Yij, Yi1... Yi, j-+1) and i,k = (Yi, 1 Yi, k) Yi-l, k Yi, 1 Yi, 2 Yi, k-l Yi, k Yi+l 1 ili i i i. 7 Ti_1 T Fig. 3.3 Sampling Scheme (1)See Section 1.1.

101 Note that all the samples of the vector yi 0 belong to the st (i-1) subinterval and are known if that interval has been processed; thus, usingthe p.d.f. f(Yi 1ki 0., ) makes sense. Under suitable conditions, the likelihood ratio function can be formed, and its limit as k - oc taken as outlined earlier; this in turn yields the sufficient statistic, natural conjugate density, and the updating expressions. The procedure will be illustrated for a specific problem in Chapter V. 3. 4 Conclusions and Historical Sketch 3. 4. 1 Summary and Discussion. It has been shown that the optimal detector solves two separate Bayesian estimation problems, one under each hypothesis, and then uses the a posteriori p.d.f.'s to modify a simple-hypothesis likelihood ratio; thus, detection and estimation occur simultaneously. Without further assumptions, this is not generally a tractable solution since all densities must be stored and manipulated as functions during the estimation. If the observations admit a sufficient statistic (of fixed size) for the unknown parameters, then a class of natural conjugate reproducing densities exists under either hypothesis; each class is indexed by a parameter vector of the same dimension as the sufficient statistic. The ensuing simplifications are dramatic:

102 1. Sequential Bayesian estimation becomes tractable since a posteriori p. d. f.'s are completely characterized by the fixeddimensional "conjugate" parameter vector; this parameter is updated through the use of fixed-dimensional sufficient statistics to reflect evolution of the p. d. f.'s. Both operations are inherently recursive. 2. Thus, if a simple-hypothesis detector is known, the compound hypothesis detection problem is tractably solved as stated above. 3. The optimal solution becomes independent of a priori distributions on the parameters in the following sense: "Primary" processing (detection and estimation) of the signal is done under the assumption that the a priori densities are natural conjugate; a secondary processor then modifies both the likelihood ratio and the a posteriori p. d. f.'s using the Radon-Nikodym derivatives of the actual a priori p. d. f. with respect to the assumed natural conjugate priors. This partitioning is illustrated in Fig. 3. 4. The primary processor is as developed in (3.25) and shown in Fig. 3. 1; it can be permanently designed ahead of time using the "natural conjugate" prior densities. The secondary processor modifies the output of the primary processor according to (3. 28); it can be reprogrammed for different a priori p. d. f.'s simply by supplying the necessary Radon-Nikodym derivatives. The user

_A0 Terminate Thresholdtl - 4 Partitionig of te Estima ee Switch. ^Primary YiPoo econdary ) D Processor P processor Comparator Terminate Threshold Command o Fig 34 Partitioning of the Estimator/Detector

104 can reoptimize the receiver "in the field" for any a priori p. d. f. which represents a distribution absolutely continuous w. r. t. the natural conjugate distribution. The estimation processor has a structure which depends on the cost functional; regardless of the specific cost functional, however, it uses the true a posteriori p. d. f., see (3. 23), and hence requires the inputs shown. The advantages of this partitioning are threefold. First, it eliminates the need for the designer to know exactly the a priori p. d. f.'s with respect to which he must optimize. Second, he can use the mathematical tractability of the natural conjugate prior p. d. f.'s in developing the primary processor. And third, the user obtains a capability for reoptimizing the receiver as his information (or opinion) about the a priori densities changes. Under suitable restrictions, many of the results apply to continuous observations made on a fixed finite interval as well. The resulting expressions are remarkably similar to those obtained in the discrete case. 3. 4. 2 Historical Outline. It has long been known that the existence of sufficient statistics is a prerequisite for practical Bayesian estimation, especially if a large number of observations are to be processed; many of the techniques presented in Section 3. 1 are implicit in the actions of the Bayesian statistician, even though he may not have bothered to explicitly formulate them or even be

105 aware of the fact that he is using them. The concepts of sufficient statistics, natural conjugate densities, and Bayesian estimation appear to have first been explicitly connected in a decision-theoretic context by Raiffa and Schlaiffer [46 ], Chapters 2 and 3. Another text, which considerably clarifies the results and employs them in the same context, is that of De Groot [10], Chapter 9. Both references (as well as most other statistical works) treat only the case of discrete, independent observations and are thus rather limited in their applicability to communications or control problems. It has for the past decade been apparent to researchers at The University of Michigan that the same fundamental connection between these concepts could be useful in detection theory. For example, Roberts [48] observed that if a uniform a priori p. d. f. was placed on phase in the "signal known exactly except for phase + Gaussian noise of known spectrum" detection problem, then the a posteriori p. d. f. had a form which remained unchanged as more information was processed; he called this property "closure of the distribution." Spooner [ 58 ] similarly found that a Gamma p. d. f. placed on the unknown intensity of white Gaussian noise would reproduce. The report of Birdsall [ 12] sought to summarize the knowledge about the phenomenon; in many ways, it is a departure point for the research presented here.

106 Simultaneously, the works of Spragins [ 59] and Grettenberg [ 20] were published; both struck close to the core of the subject as presented here and were instrumental in the formulation of this work. The development of this research began as an attempt to extend and clarify the report by Birdsall [ 12 ]. The key steps turned out to be: (a) The explicit factorization of the conditional density function as in, e. g., (2. 10). (b) Defining the natural conjugate density as in (3. 8); previous works used a similar definition but employed the entire conditional p. d. f., which is correct but obscures what actually occurs. (c) Separating the concepts of "parameters which index the natural conjugate densities" and "sufficient statistics"; previous works lumped these under a common symbol, obscuring the actual nature of the "updating" relationships. The results of this early research are published in [7 ] and constitute Sections 3. 1 and 3. 2 of this dissertation. The remainder of the results were obtained as an outgrowth of an attempt to solve the problem of detecting a known signal in I-SAG noise of unknown parameters (Example 1. 3, except a is known). It soon became apparent that a sampled version of the problem fit quite naturally into the theory developed above; the

107 results are in Section 4. 4. Attempts to limit these results by letting the samples grow dense failed until the technique of using the likelihood ratio function (Section 3. 3) was attempted; an application of a theorem due to Baxter (Theorem 5.3) gave the desired result, and provided a clue to the intimate connection between the singularity of measures and the use of Bayesian methods in infinite dimensional spaces. The papers of Halmos and Savage [ 24] and Bahadur [ 3 ] clarified what was happening, and the continuous-time results of Section 3. 3 and Appendix B followed. Finally, a study of the subject of Gaussian measures set the stage for the work of Chapter V (and, incidentally, revealed that much of the 1-SAG solution had already been obtained by Striebel [ 60] ).

CHAPTER IV EXACTLY-KNOWN SIGNALS IN DISCRETE STATIONARY AUTOREGRESSIVE GAUSSIAN NOISE This and the following chapter will seek to illustrate the theory of Chapters I and III by applying it to the problem of optimally estimating the spectral parameters of M-SAG noise while detecting a known, sure signal corrupted by that noise. The problem, though closely related to the classical Gaussian detection problem, is heretofore unsolved at this level of generality. As stated in the introduction, no claim is made as to the practicality of the result. That is not the object here. Rather, "estimation" consists of explicit knowledge of the a posteriori p. d. f.; no attempt at a detailed analysis of this density and estimates based on it will be made. Similarly, the detection statistic will turn out to be a very complicated function of the observation which, again, will be left as is rather than being approximated or analyzed further. The purpose of these chapters is primarily to illustrate the theory. However, it will become clear that all the elements upon which a practical analysis must be based are present here. The problem addressed in this chapter can be summarized as follows: 108

109 HO: i = n. (4. 1) H1: = n. +; i = 1, 2, 3,... where {si} is a known sequence of real numbers and where {ni} is discrete-parameter, Mth-order, stationary, autoregressive Gaussian (M-SAG) noise whose spectral parameters are not known. The object is to detect the presence or absence of the signal, simultaneously estimating the parameters of the noise. Section 4. 1 will discuss the noise in some detail; two parametrizations will be considered. The first is the spectral parametrization mentioned above, while the second characterizes the noise in terms of the elements of the correlation matrix of M + 1 sequential samples; relations between the two sets of parameters will be derived. The ultimate interest is in discrete noise which results from sampling continuous-time M-SAG noise, but this question will be postponed until Chapter V. Section 4. 2 treats estimation of the noise parameters by applying the concepts of Chapter II, especially Section 3. 1. Although the notation of H0 will be used, it is clear from (4. 1) that this is possible under either hypothesis, so long as the hypothesis is specified. Section 4. 3 then treats the related detection problem by

110 applying the results of Section 4. 2; finally, the results are specialized to the case M = 1 in Section 4. 4. 4. 1 Noise Models and Parametrization th M-SAG noise is the stationary solution to an M -order autoregressive stochastic difference equation driven by a sequence of independent Gaussian random variables. As will be shown in Chapter V, a sequence of samples from the continuous M-SAG noise process considered in communications engineering can be modeled in this fashion. Alternate parametrizations will be investigated and the joint p. d.f. of k samples determined. This is necessary as a prelude to finding the sufficient statistics and natural conjugate class of densities for the unknown parameters of the autoregression. 4. 1. 1 The Mh-Order Discrete Autoregression. The process {nk; k = 1, 2, 3... } of interest is the stationary solution to the stochastic difference equation nk + l lnk-l + + Mnk-M aek (4.2) where {ek} is a sequence of N(0, 1) random variables which are statistically independent of each other and of the present and preceding noise samples. The parameters {I1.".'.M' }a are real. For the equation to be stable, the (3's must be such that all

the roots of Q(z) Z= + sM.. + = 0 (4.3) have modulus less than 1. The region where this is true will be called cRM. The parameter a will be assumed nonzero. There is a technical inconsistency in calling {nk} a stationary sequence and yet considering only k > 0. This is resolved below by judiciously choosing the initial conditions { nM+,''.. n0} and then agreeing to not consider values of k which are not positive. The power spectral density of {nk} is 2 f(c) = a. Q(e )Q(e- ) 2 ^ ______________^__________(4.4) I eiMw i'(M- 1) w. +(4 4) and thus the parameters {P1". 3M' a } will be called the spectral parameters of the noise.

112 To simplify the notation, define the following column vectors: = (Pl' " " *M)* eR n. (n:Mn * RM (4.5) -i (''i- M+ 1) (4.5) = (nl, n2, *.,nk)* Rk Equation (4.2) can then be written n 0 k_1 =aek' <(4.6) k + * P-k-1 ek' 1 0 The sequence {nk} is to be stationary; since its mean is zero this requires that rd E(ni) (4.7) not depend on i. In practice this can be achieved by means of a "quiet start, "i.e., by choosing the initial conditions n = (nM+1,...., no)* from an ensemble with the desired stationary statistics. It is of interest to see what these statistics are. Using

113 (4. 6) in (4. 7) yields rO = *R + at (4.8) where R is the covariance matrix R En.n.* (4.9) Since it is a covariance, R is symmetric and positive definite. (1) Since {ni} is stationary, it is also Toeplitz.2) Thus, its ijth entry is E n. n.= r (4. 10) l ii-ji and R is constant along its diagonals: ( 1Strict positivity follows from condition (4. 3); see Doob[ 11], pp. 253-255. ()Grenander and Szego [ 19], p. 170.

114 r0 r r2 ^ rM-1 1 r0 rl rM2 R r r r r Rr ~'2'1 ^0 ^M-3 rM-1 rM-2 rM-3 * r0 (4. 11) The quantities {r0, r1,.., rM} will henceforth be called the correlation parameters; it will be useful to find relations between them and the spectral parameters. To accomplish this, M equations are needed in addition to (4. 8). These are found by defining the M-vector of cross-correlations r = (rl,..., rM)* (4. 12) and noting that = E nk_1 Vk> = - R (4. 13)

115 where (4. 13) follows from (4. 6). If the correlation parameters are given, the spectral parameters can thus be found: = R- 1 r (4. 14) = rO r*R r (4. 15) the latter of which follows by using the former in (4. 8). Writing these equations for {r0.. rM} in terms of { a, } is more complicated and will not be done for general M. In a specific problem it is necessary to know the correlation matrix of y0 to achieve a "quiet start" in simulation or to account for the conditioning on Y0 in sequential theoretical work, and the equations must ultimately be inverted. This will, e. g., be done in Section 4. 4 for M = 1. 4. 1. 2 The Transition and Joint Densities. The joint conditioned p.d.f. of k observations Nk = (n,..., n)* willbe found as outlined in Section 1. 1, Eq. (1. 3). This is most conveniently done in terms of the spectral parameters. By inspection of (4. 6), the distribution of nk given nk 1 is N(-n* il, a ); thus, the transition density is

116 fi(n i-n no0, a) = (2 r ) 2 exp - (ni + * )j (4. 16) From (1. 5), the desired joint p. d. f. is a k-fold product of transition densities. Recall that it is conditioned on no to be tractable for sequential estimation. If the resulting sum in the exponent is expanded, one obtains -k =(2rac) exp I- _ * TM([_ +1) ) +to(Nk)] }' - 1 (4. 17)

117 k TM(Nk- 1 = 1 - 1i k 3 M1 Zk ni-i ni-2n.. 2 1 ni-M n. n. *_ = 2 i-i 2 ni-2 ni-2'N-M n i Mni-1 2 ni-Mni-2.* n M (4. 18) k _-Mc~Jk, C ";1'in. n. - k 1 Z. n.n2 nin ] -M (4. 19) to() -3 n. (4.20) i=l

118 To obtain the joint or transition densities in terms of the correlation parameters, it is only necessary to use (4. 14) and (4. 15) in these expressions. 4. 2 Secuential Estimation Before proceeding with a solution of the estimation problem, it is necessary to resolve a notational matter which arises because all the uncertain parameters (say, a and f ) are common to both hypotheses: They are parameters of the noise, and under H0 the observation is the noise, ni = yi, while under H1 the noise may be reconstructed by subtracting the exactly-known signal, n. = y. - s.. Strictly speaking, the parameters are different ran1 1 1 dom variables under H0 and H1 since they carry different distributions and may even be assigned different a priori densities; for this reason, and because in general they need not be common to both hypotheses, they were assigned different symbols ( for H1 and 77 for H0 ) in Chapter II. (1) That practice will be discontinued from here on, and the parameters will be explicitly referred to as (a, Q); the hypothesis can always be deduced from the notation ( )As discussed in Section 1. 1, this ambiguity could have been avoided through use of a more rigorous notation, but only at the cost of considerable complexity.

119 being used (recall Table 3. 1). For obvioust reasonIs, the estimation problem will be solved assuming H0 true. For notational simplicity, no further mention of the hypothesis will be made in this section. 4. 2. 1 Sufficient Statistics and the Natural Conjugte Class. The sufficient statistics have already been found as Example 2. 1, Eqs. (2. 36) - (2. 38). The factorization (see (2. 10) or (3. 1) ) is trivial; G(Yk) = (2)k/2 (4. 21) It will be convenient henceforth to use d -2 p = a > 0 (4.22) instead of a. From (4. 17) and Theorem 3. 1, the natural conjugate class is seen to be q(pP;4) = K() p exp ~ -I [x*M +2* iM + O] (4. 23) where l M is an M x M, symmetric, positive definite matrix; M is an M-vector; %0 > 0 is a scalar; and 4c is a scalar "counting parameter"(see comment "a. " following Theorem 2. 6). The symbol "4" without subscripts refers to the totality of all the conjugate parameters; K(Q) is a normalizing constant. For fixed values of f, (4. 23) represents an unnormalized Gamma p. d. f. in p; for fixed p, it is an M-variate Gaussian

120 in L, truncated to be zero on the complement of the region P. The latter fact makes it very difficult to deal with this p. d. f. in general terms; unfortunately, K(4) is explicitly required to solve the detection problem (see (3. 25) and (3. 26) ), and K-1() = f p exp 2 \ 2 M + -2 [* " M V + 2 dL dp (4. 24) The outer integral is easily evaluated, giving / +1 /~~ r(ycl)2l (~+ 1)2 d+ (4' [f@MMM 20+1 (4.25) where the integrand is the marginal density of. The remaining integral is quite difficult, and no attempt to evaluate or approximate it will be made here. ()See, e. g., Hogg and Craig [28], p. 91.

121 4.2. 2 Updating and Estimation. Suppose that q(p,;; /(O)) is used as an a priori p. d. f., Yk = (Y1. yk)* is observed, and Y0 = (Y-M+1'.. Y0) is known. From (3. 9), it is clear that the a posteriori p. d.f. is q(p, g; p(k)), where the conjugate parameter has been updated using:(k) (0) + tM (4. 27) (k): J00 + t0(Yk) (k) (=,) + to(k) (4. 28) 41,(k) (0) + k/2 (4. 29) c c Because the a posteriori p. d. f. is explicitly known, the estimation problem is in principle solved. Practically speaking, it is not, since an estimate based on (4. 23) may be quite difficult. For large k one expects the p. d. f. to have a well-defined mode near the true value of the parameters, and the "maximum a posteriori" estimates

122 kMAP) - EI-Y (k) SMAP EM J AM (k) c P (k) c (4. 30) MAP ( k* (k)( (k)* (k) (k) ~0 -M M M obtained by differentiating (4. 23) may be reasonable. These relations can also be useful in choosing 0) to model the a priori p.d.f. Making suitable ergodicity assumptions and using a law of large numbers, the MAP estimates are seen to be consistent: For large k the initial values 0) in (4. 26) through (4. 29) become negligible; if ( and a are the spectral parameters of the process generating the observations, then using (4. 18) through (4. 20) it is clear that (k). -1_ (k) - 1 MP r(r* -_ (4.31) ^ ^ r ^ R -r

123 where ro, r, R are the correlation parameters corresponding to a and f; using (4. 14) and (4. 15), one finds that (k) - 9MAP P (4. 32) (k) -2 MAP () P This argument can be made rigorous. If the a posteriori (or the desired a priori) density does not have a well-defined mode, then the estimation problem may involve computing moments of (4.23) and can be quite complicated. As discussed, a detailed analysis of the p. d. f. is outside the scope of this dissertation and will not be attempted. Should the desired (or known) a priori p. d. f. not belong to the natural conjugate class but be expressible as in (3. 19), then the technique of Section 3. 1. 2 can applied. It may be possible to choose classes of densities which simplify the computation of moments and the estimation problem (recall the closing comment of Section 3. 1.2), but this again is a problem which will not be addressed. 4. 2. 3 The Conditioning on 0. Recall from Sections 1. 1 and 2. 3 that, as derived, the a posteriori p. d. f. is conditioned

124 on the initial observations y0; to obtain a joint p. d. f. of the desired form (see (1. 5) and Section 2. 3), the basic relation throughout was Bayes' rule conditioned on y: f(IY)k i p,) fo(p, fo ) ~f(p,...k. Y) -- (4. 33) S (Numerator) dp dg with (1. 5) employed in the numerator. The "a priori" p. d. f. in this expression is actually conditioned on y0 and must be interpreted as a posteriori to the initial conditions; if it is natural conjugate, then so is the a posteriori p. d. f. defined above, f(p,'lykYk,) = q(p,;4g ) (4.34) and the solution given by the use of natural conjugate techniques is thus indeed conditioned on y0. As previously remarked, this must be undone or accounted for before the solution can be considered complete. From (1. 3) it is clear that (4. 33) can be unconditioned as follows: f(YO [ Po g) f(Yk 0Y' P' 0) fo (p' ) ^ ~f(p ^), (Nui=fr~to- (4 35)' J~k f(Numerator) dp dg

125 where for the problem at hand, M 1 2_ 2 1 f(Y0Ip, A) = (27) 2 I exp R- (4.36) -2 and R is related to p = a and e through (4. 14) and (4. 15). Examination of (4. 35) suggests two equivalent ways to account for the conditioning: a. Consider the first and last term in the numerator as a unit and compare with (4. 34); clearly, the a priori p. d. f. f0(p, () in the conditioned result is equivalent to the a posteriori (to y0 ) p. d. f. in the unconditioned result. To obtain a situation which reproduces unconditionally using the natural conjugate class, it is necessary to choose an a priori [ y ] density from the class q(P,;/) f g(P, e (4. 37) % (P' I) =('o'P,) in which case the a posteriori [ y ] p. d. f. will be natural conjugate but not conditioned on y0. b. Separate the first term of (4. 35) and compare with

126 (3. 19) or (3.20); once YO is observed, f(y0lp,L) has all the properties required of r(6) (i.e., r(p, ) ). Thus if a "modifying processor" to account for other-than-natural-conjugate a priori densities has been implemented, one may as well choose a natural conjugate prior as discussed following (3. 24) and include f(y0op,) in the modifying term. Finally, one can ignore the problem; this amounts to nothing more than a refusal to process the information in y0, and may be entirely reasonable if a large number of observations (k >> M) are to be made. 4. 3 Sequential Detection 4. 3. 1 The Signal Hypothesis. Recall from (4. 1) that H1: ni = Yi - i (4.38) Thus, all estimation results of the preceding section may be applied under the signal hypothesis if, instead of using the observation, one uses yi - si Specifically, a set of necessary and sufficient statistics is tl(Yk) = t(Yk - Sk) as defined in Example 2. 1, and the natural conjugate class of densities is p(p, L;Y) =K(y) p exp - 2 M + 70] (4. 39)

127 with the conjugate parameters y defined in complete analogy with the parameters 4 of (4. 23). If the initial parameters are y(), updating takes place through r (k) r () + TM(Yk -1Sk) (4. 40) rM - k -r (k) = (0) (Y ) (4.41) X'M =M + t M(Yk - Sk) (k) (0) (442) Y/0 YtO(Yk- Sk) (k)= (0) + k/2 (4.43) c c and the a posteriori p. d. f. may be analyzed as in the preceding section. Note that 4(0) and y(O) may be different; even though the same observation parameters are being estimated under different statistical hypotheses, there is no reason why the a priori densities must necessarily be the same. 4. 3. 2. Detection. Putting the preceding expressions into (3. 25) or using (3. 26) directly, one finds that the marginal likelihood ratio is =K[()] [ k)] (4.44) K[ ] i K[/]

128 where K( *) is defined by (4. 24) or (4. 25). Clearly, the naturalconjugate normalizing constants are explicitly required to solve the sequential detection problem. Recall that these involve integrating over J7; since detailed solution of the problem is not a primary goal of this dissertation, the matter will not be pursued further. 4. 4 Gauss-Markov Noise: M = 1 4.4. 1 The Autoregression. The noise is generated by % + la nk-l = aek5) where I 3 I < 1. It is desirable that the results correspond to samples from continuous-time lowpass (as opposed to bandpass) Gauss-Markov noise; this requires the further restriction that - 1 < < 0, which will henceforth be made. One obtains the following correspondences with quantities defined in Section 4. 1: General M M= /31 n. n. -1 1 r r. R rn

129 Equations (4. 14) and (4. 15) become P1 = - rl/r (4.46) a = r - r2/r which may be inverted to find - _. r0 (4. 47) r1 =. 1-1 and so yo has (or, in simulation, should be chosen from an ensemble which has) a N(, ( 41 distribution. The joint conditional p. d. f. of k observations is given by (4. 17) as

130 k -2 k f(Nkln0 P, P) = ) 2 ex p - 1 i-1 k k + 2P1 3 n.n + n. (4.48) ^ ~ " i=l i The statistics TM( ), tM( ), and t0( ) of (4. 18) through (4. 20) are all scalars; they can here be combined into a single vector (1 k i=l t(Nk) = Z n. Z ni k/ 2 (4.49) (1For an explanation of the last component, recall again comment (a.) following Theorem 2. 6.

131 4. 4. 2 Estimation. The notation of H0 will be used; since all the vectors and matrices defined in Section 4. 2 are one-dimensional for M =, much of the notation is superfluous and will be dropped. To parallel the definition of t(-) above, the natural conjugate density (4. 23) is written q(p, 1; ) = K(g) p exp [1 1 + 21 2 3] (4. 50) where?, the parameter indexing the densities, is a vector with the indicated components which takes values in the subset of R defined by' = {kiR4: 1' P43,' 4 > and 2 4/14/3 < 0} (4. 51) Analysis of the density can be carried somewhat further than for arbitrary M. The marginal p.d.f.'s of 1 and p are, respectively,

132 ~4+1 K(Q) r( i4+1)2 q9(;4/) X4+1'... (Pa -1+21 <2+ 4<3) (4. 52) and q(Pp; K( ( ) 2 2 ^ ^ 12P,P>, exp -2'~1 (~/1 3 /2a), P > 0 ____ r 2~C2 (41 4/32A/) (4. 53) where 4 ( ) represents the unit Gaussian cumulative distribution function. If the initial value of V4 is chosen an integral multiple of 1/2, then it will remain so throughout and (4. 52) can be integrated to obtain an explicit (though not closed) expression for the

133 normalizing constant:(1) Let n = 44 1 4 D-=^ - D = 1,4'3 -,2 P= 4/1 + 242 + /3 N. = (i- 1)!i 1i ^ (2i)! M. = (2i)! i (i!) Then K i~ 2(2n) ^ (l nI ___P /4D1\ = n! 2() 21 (g1- g/2) ( Ni'g4/P N. 4D + 4'2 1 N (413i] ~J-~D~I [tan'(4j —D) +tan (4 )] (4. 54) (The necessary integrals may be found in the C. R. C. Tables [ 55], p. 404 (#113) and p. 414 (#241).

134 provided that n is an integer, or - 1 (m+l)lm! 23(m r(m+ /2) -y =Dm+ (2m+2)! - f? ^ /D r'" 2) m ( D)i E2CM.,rP i=O + H MO (4gD4,) (4. 55) if m is an integer. These expressions, though not impressive for the insight they provide, are certainly suitable for machine evaluation. From (4. 49) and (4. 50), the sufficient statistic updates the parameter of the natural conjugate density through i(k) -= (0) + t(Y(456)

135 The mode of the a posteriori p. d. f. occurs at (k1) (k)= - 2(k) Ak) _(k)4(k) p (k) / )1 /4 k(k) (k)- (k)] (4. 57) 3 -[ 2(4. 57) and these may provide reasonable estimates. To account for the conditioning on y0, recall that y0 ~ N(O, r0) as given by (4. 47). So one can follow Section 4. 2. 3(a. ) and choose the a priori density from fo(P i; )) = p exp 2 [1 1-, +21, )2 +3() (4. 58) in which case the a posteriori [ O ] density will be natural conjugate with - (4 + yoa )1/2 (4. 59)

136 Note that (4. 58) has no limit as 1 - 1; this technique may be undesirable unless 01 can be bounded away from - I. Alternately, one can follow the technique of Section 4. 2. 3(b. ) as outlined there. 4. 4. 3 Detection. Not much more can be said. Since n. = yi- Si under H1, all the results of the previous section may be applied as detailed in Section 4. 3. 1; the constants K((k)) and K(y )) can be evaluated as in (4. 4) and (4. 55), and the likelihood ratio found from (4. 44). No significant simplification results when the ratio of constants, rather than the constants themselves, are considered. Thus, the discrete solution is unreasonably complicated even for M = 1. Instead of pursuing it further, attention will be focused on the continuous solution. This is both theoretically more interesting and results in somewhat simpler expressions.

CHAPTER V EXACTLY-KNOWN SIGNALS IN CONTINUOUS STATIONARY AUTOREGRESSIVE GAUSSIAN NOISE The problem addressed here is H0 yt=n Ho: Yt = nt H1: t =nt + s(t), te[0, T] (5.1) where nt is M-SAG noise with unknown spectral parameters (Section 5.1) and s(t) is an exactly-known sure signal which is 2M times differentiable for 0 < t < T. It is a continuous version of the problem solved in the preceding chapter. Section 5.1 discusses and establishes parametrizations for the noise, and relates a sampled version of nt to the process studied in Chapter IV. Section 5.2 presents pertinent results from the theory of Gaussian measures; these are needed to insure nonsingularity of the estimation and detection problems, and are also useful for evaluating the required Radon-Nikodym (R-N) derivatives without the tedium of finding a limit for the likelihood ratio function. To verify that the resulting derivatives are the same, Appendix E evaluates such a limit for M = 1. Section 5.3 solves the estimation and detection problem for 137

138 arbitrary M. The sufficient statistics are found and the natural conjugate density and updating equations are explicitly written. The chapter concludes by giving more detailed solutions for M = 1 and M = 2. Even for these cases, it is found that the detection statistic is extremely complicated and probably not practical as written. 5.1 Continuous Stationary Autoregressive Gaussian Noise. The noise process under consideration, denoted int, t [0, T], is a finite segment of a zero mean, Mth-order, stationary autoregressive Gaussian (M-SAG) random process; its spectral density function is rational and of the form s (f2) = a (i2 p2k; (5.2) n M i Pk i=l It is well known that this process can be modeled as white Gaussian noise filtered by a causal, lumped parameter system with Fourier transfer function Re(i)> 0 H(j27rf) = M ~; pi # pk (5.3) -(j2iTf) oi=L2?,f + J a >0

139 An alternate parametrization for the filter of (5.3) results from expanding the denominator H(j27Tf) (5.4) Q(j2iff) where / M M-1 Q(z) = z +qlz ++ + q (5.5) is a polynomial with real coefficients and with distinct roots in the left-half z-plane. The coefficient q is the sum of all different combinations of the Pi taken k at a time. In terms of this parametrization, the spectral density can be re-written as S (f2) = 2 (5.6) I Q(j27Tf)2 Occasionally, it will be necessary to use yet a third parametrization of Sn(f2), namely s (f2) =..1 (5.7) 0 M- i (j27rf) i-%0M-i (j2f)i where 0 = la and 0. = qial, i=1... M. From (5.2), the autocorrelation function of the noise is

140 M -p. IT I R(T) = a2. e (5.8) n i=l where 2 2'. =F 2pkl,( ] (5.9) 1 L kii Pi The process nt is well known to be Mth -order Markov.(l) Since it is Gaussian, the M+1 parameters {a2, P1,. PM} or 2 {a, q1 ). M'' M constitute a complete statistical description. It is easily verified that for 6 > 0 the parameters {R (0), R (6),..., Rn(M6)} provide an equivalent description; the two sets are uniquely related through (5.8). Suppose n is sampled at t = i6, i=-M+1,..., 0, 1, 2,..., and the corresponding discrete process denoted {n. }. It can be shown(2) that this discrete process is the solution to a related More precisely: If nt is sampled at arbitrary tK < t2 < t3 <... and samples arranged into "state vectors" n.= (n, nt,..., ) — 1 i 1i-l i-M+1 then the discrete vector process {n.; i = 1, 2, 3,... } is a Markov process. See, e.g., Astr5m [2], Arts. 3.3 and 3.10. (2)Doob [11] or Astr5m [2], p. 84.

141 autoregressive difference equation (see Section 4.1.1); clearly any (M+1) adjacent samples are a multivariate Gaussian random variable with zero mean and covariance matrix r1 r0 r1 R r2 r1 r2 (5.10) M+1 1 0 (5.10) r. r1 rM r r M 1 0 which is positive definite, symmetric, and Toeplitz() so that r.. = ril = R (li-jl6). Using these facts and equations (4.11)ij li-ji n (4.15), the parameters of the related difference equation can be determined. It is also well known that almost every sample function of nt is uniformly continuous and is everywhere M-1 times continuously differentiable; the (M-1)St derivative will be a function of unbounded variation. (2) )Grenander and Szego [19], p. 170. (2) ()This follows from a theorem of Baxter [4] which will be stated later as Theorem 5.3; see, e.g., Wong [ 62] pp. 221-222.

142 Before studying the detection and estimation problem in M-SAG noise, it is necessary to investigate results concerning the singularity of, and R-N derivatives for, Gaussian measures. 5.2 Gaussian Measures 5.2.1 Equivalence and Singularity. Once again, consider the pre-probability space ( w, ) whose elements are sample functions of {Yt; t [ 0, T]}, where Jlis generated by the process. If 1 and 0 are any two probability measures on (f, e4), 0 then one can combine the Radon-Nikodym theorem (Theorem B.1) and the Lebesgue decomposition theorem(1) and write THEOREM 5.1 There exists f >0 which is meas.[.d ], and a finite measure /L1 0, such that VA ed t(A) = Jfde 0(y)+ i(A)l (5.11) A 0 Clearly,? << K iff L - 0 and then f = d /d? as usual. A measure Pon ( Y, ) is a Gaussian measure if {Yt, t [0, T]} is a Gaussian random process with respect to, (2) ()Royden [50], p. 240. (2)That is, the random variables {y, teTk}, with Tk any finite parameter subset of [ 0, T], are jointly Gaussian. See Wong [62], pp. 46-47.

143 Assume henceforth that the process is separable and continuous in nrobability. Any Gaussian? is uniquely described by the corresponding mean and covariance functions of {yt, t [ 0, T] }, and t the singularity (or R-N derivative) of two such measures can be studied in terms of these functions. An extensive literature exists on the subject. 1 (a) The Basic Dichotomy. It is a basic result that if 49~ and P0 are Gaussian measures on (, e/) then either 1 -- c0 or ~1 l 0. This is proven by simultaneously diagonalizing the covariance matrices of {yt, t Tk} under 1 and s; as the finite index set Tk grows dense in [0, T], 2 1 and 0 are obtained as infinite products of (independent) onedimensional Gaussian measures, and the result follows from a theorem due to Kakutani. 2) If under 1 and e0 the process has the same covariance function R(t, s) but different mean-value functions (say, without loss of generality, ml(t) = F(t) and m0(t) = 0), then only one "matrix" need be diagonalized; this is easily accomplished and the result can be stated quite simply: Especially useful are the papers by Root [ 49] and Yaglom [63], and Arts. 6.1-6.4 of Wong [62]. (2)SeeRoot[49], p. 296 orWong[62], p. 215.

144 THEOREM 5.2(1) Let {X., 0i(t)} be the eigenvalues and functions of the kernel R(t, s); let S c L2[ O, T] be the span of { 0i(t)}, and expand yt and,i(t) in terms of the i. Then - o if jU(t) eS and so 2 z Mi < X i=l X. in which case 1 = expii 1 i (5.12) i=l Xi Otherwise,? L?. This is exactly the simple hypothesis detection result stated in Section 1.3.1 [Eq. (1.18)]; indeed, the problems are seen to be same. The case 19 jL 0 results in "singular detection," since one need only examine the separating set A (see the definition of singularity preceding Theorem B.1 in Appendix B) to decide c0 or 4* with probability 1. The possibility of singular detection raises serious questions regarding the Gaussian model; these have (2) been discussed at length in the literature. (2) 717Much of this theorem is due to Grenander [18]. (2)For example, see Root [49] or Slepian [ 56].

145 The dichotomy theorem for the case that the mean and the covariance functions both are different under 91 and 1? is much more difficult; its general form was arrived at independently by Hajek [ 21] and Feldman [ 15]. More restrictive forms were known much earlier; a good survey is contained in the paper by Yaglom [ 63]. No benefit is derived from treating the general result here; instead, only the easier case where {Yt, t [ 0, T]} has rational spectral density under 0 and 1? will be presented. Before turning to that result, the following theorem is presented for later reference. THEOREM 5.3 (Baxter's Theorem)(1 Let {yt, t [ 0, T]} be Gaussian with mean /p(t) and covariance R(t, s), where gL is of bounded variation and where a2 a R(t, s)< oo (5.13) at as except possibly for t = s. Put f(t) = lim R(t,t () - R(s,t) li R(tt) - R(s,t) st sts t-S (5.14) Let {T } be a family of finite parameter subsets of [ 0, T] which kBaxter [4].

146 grow dense in [, T] as k- Q. Then ZkY Y, ) 2 f/ (t) dt (5.15) i=l\ i- o a.s. as k —. fl This theorem is basic to many results on the singularity of Gaussian processes. If Yt is stationary then (5.14) and (5.15) say that, with probability one, the quadratic variation of a sample function is equal to the jump in the derivative of the autocorrelation function at the origin multiplied by the length of the observation interval. It is remarked that a continuous function with nonzero quadratic variation is necessarily of unbounded variation, and hence is not Riemann-Stieltjes integrable with respect to itself, or anywhere differentiable. (b) Processes with Rational Spectral Density. Suppose that under 0 and 1', Yt is a zero-mean stationary Gaussian process with spectral density function S2) PP (j27rf)2 Snif = Q=j i j; n = O, 1 (5.16) where j = v-1, P (z) and Q (z) are polynomials with no roots in the right-half z plane, and the order of Q( ) exceeds that of Pn( ). In engineering terms, yt is the output of a stable realizable n

147 filter with rational transfer function P (s) H (s) =; n = 0, 1 (5.17) which is driven by unit-density white Gaussian noise. The autocorrelation functions corresponding to (5.16) are R(T) = J S (f)e df (5.18) - n The dichotomy theorem for processes of this type can be stated as follows: THEOREM 5.4 - - 40 iff lim = lim (j27rf) 1 (5.19) f-oo S0(f ) f- oo1 Otherwise, l g U The necessity of (5.19) for equivalence was established by Slepian [ 56]; the theorem in its entirety was established in the (1) aforementioned works of Feldman and Hajek. Slepians observation follows directly from Baxter's Theorem (Thm. 5.3)' Let m+l (See also Hajek [22], p. 439.

148 be the smaller of the differences between the order of the numerator and denominator polynomials in (5.16). Then t is m times differentiable and, unless (5.19) is satisfied, the test function of (5.15) when applied to the derivative y(m) will converge (with probability one and for arbitrarily small T ) to a nonzero number or to zero according as whether'1 or 01 is the measure generating the process. Note that (5.19) requires that the spectral densities S(f2) and S (f2) have the same high-frequency asymptote. 5.2.2 R-N Derivatives for Rational Spectrum Gaussian Processes. Recall that in the estimation problem the observed process is defined on (,, 0, Ad e.(1) The quantity of interest is the R-N derivative d J X(yt ) = -d - (yt) (5.20) where 40 dominates A.(2) Theorem 2.3 showed that this 0 quantity yields a sufficient statistic for 0 and that a generalization of the factorization criterion applies. Under the assumptions of 1Again, the compound hypothesis detection problem will be left for a later section. As usual, the notation of H1 will be used here. (2) 4, may or may not be a member of,A. 0

149 Theorem 2.2, it can be found as the limit of a likelihood ratio. From (2.23) or (3.29), X(yt; 0) may be employed in place of the conditional density function in a generalization of Bayes' rule. The preceding section gave necessary and sufficient conditions for the R-N derivative to exist (i.e., for Am to dominate 0 0 f) in the case that Ad and all the At c gAare given by 0 rational spectral densities; if those spectral densities are denoted P(j2lTf) 2 S0(f2) = | _______ I (5.21) Qo(j27rf) and 2 P (j 2f) 2 S (f2 ) J ]f) 1 (5.22) 0Q (j2lf) respectively, then it is necessary and sufficient that P (j27Tf) Q0(j2irf) 2 flim Q(j2f) -P(j1) 1 V e OE (5.23) The subscript 0 is used to denote the spectral parameters regardless which set [see (5.2)-(5.7)] is actually being employed. Suppose (5.23) is satisfied; it is then necessary to evaluate the R-N derivative. Finding a limit for the likelihood ratio can be extremely difficult and tedious (this is illustrated in Appendix E for

150 I-SAG noise); luckily, the literature contains general expressions, derived in a variety of ways, for such R-N derivatives. This chapter is concerned only with M-SAG noise; i.e., P1 (j27rf) = a a positive constant, and the order of Q (-) is M. R-N derivatives for this type of process take a particularly simple form; the results stated here are due to Hajek. (1) Instead of finding the R-N derivative of B with respect to some measure given by another rational spectral density, Hajek defines the ~(2) following: ) DEFINITION 5.1 Let P+ denote the Gaussian measure such that under M, ao ^ ~~~Mc~(M (i) The vector (y0' y6,., y 1) is distributed according to M-dimensional Lebesgue measure. (ii) yt(M -) is a zero-mean Gaussian process of independent increments such that Eldy(M-) 12 = -1 dt (iii) For all t, [yt(M-) yO(M) is independent of v v- v(M- ) Y0 YO' Y'"' Y0 ~.HajekL [22], Art. 7. 2Ibid, p. 433.

151 Let 0 = (0...,'M)* be the set of parameters defined by (5.7). These are equivalent to (pl. PM) or (q qM) Recall that 00 = a1 1. Define the state vector Yt = (Yt Yt * Yt )* t (5.24) Hajek shows that if {yt, te[ 0, T]} is to be stationary, the autocorrelation matrix of the "initial condition" y0 must bel) [E y () (k) M- 1 [D ]1 E j) Y )] j, k=0'jk (5.25) where the elements Djk are 2 0~() M- i M+i-j-k- for j+k even D = (5.26) jk 0 for j+k odd and the sum runs over max(0, j+k+l-M) < i < min(j,k). He then proves THEOREM 55(2) Let {Yt' t[ 0, T]} be a finite segment of a stationary Gaussian process with spectral density (5.7). Then 9'91+ 0 M,)0 (2Ibid, p. 421. Ibid, p. 433.

152 and.d...... i T M-1 k T (k) 2 &~... = ID jk iexp - 2 AkJ [t ] dt ddk 0O k=O 0 M, 0 (5.27) ^ M-1 M-I 1 M- M[ (j) (k) (j) (k)D j=0 k=O j+k even 2k where Ak is the coefficient of (j27rf)k in the denominator of (5.7); i.e., is given by M M M v n M-n M-m Z2k (5.28) ~ ~ Ge z (-z) z (5.28) m, n=0 k=O An explicit expression for Ak is easily found: k = 0 (Mk) - i (1)M-i (5.29) i i 2(M-k)- i where max[0,M-2k] < i < min[M, 2 (M-k)]. The fact that the dominating measure is ^/+ rather M, 0 than a more familiar measure may initially seem bothersome; however, it turns out to be irrelevant since the terms in X(yt; 0 ) or in di'+ M, 0

153 which are unique to the dominating measure will cancel when that R-N derivative is used in Bayes' rule. In fact, the dominating measure of Definition 5.1 is superior because it introduces no extraneous constants into the problem; this, again, will be illustrated for M = I in Section 5.4. 5.3 Estimation of Noise Parameters and Detection for Arbitrary M. 5.3.1 The Implications of Singularity. Consider first the estimation problem; the uncertain noise parameters will be denoted as 0. Recall from (2.23) that a generalization of Bayes' rule is d o d (Yt; ) fo( ) Iyt) = (5.30) f (Numerator) dt where? dominates Lk and AP is induced by the M-SAG 00 o noise process with parameter 0 (the subscript 0 will be used regardless of which parameterization is actually employed). If a sufficient statistic exists, the R-N derivative may be factored and one obtains (3.32). From Theorem 5.4, it is clear that the measure 0 0 induced by an N-SAG noise process is equivalent to iP if and 0 only if M = N and the numerators a2 of the corresponding spectral densities (5.2) or (5.6) are identical; this means that the

154 2 parameter a may not be included in 0. If it is included, then 2 one can always construct a test function which converges to a with arbitrarily small error for arbitrarily small observation in2 tervals [0, T]. Since a characterizes the properties of {yt' t [ 0, T] } for very high frequencies, this is heuristically 2 reasonable. Physically, error in estimation of a is bounded below by the ability to sample yt at arbitrarily high rates; the argument is the same one found in discussions of singular Gaussian noise-in-noise detection. Indeed, the problem is the same since the "noise-in-noise" detection statistic is precisely the R-N derivative of (5.30). Suppose the dominating measure is P + as in Hajek's ~-1 M, 0 Definition 5.1. From (5.7), 0 = Ia I so that again the constant a cannot be estimated but, rather, must be known. Section 5.4 2 will illustrate the singular estimation of a for the case M = 1 by using Baxter's Theorem (Theorem 5.3). Now consider the detection problem. Since a simple-hypothesis (parameters-known) solution is available (see Section 1.3.1), equations (3.37) or (3.38) are the obvious candidates for a solution. Recall the comment made at the beginning of Section 4.2; the parameters under either hypothesis are 0 = (0 1... 0M) [or equivalently, (ql' * * qM) or (P1 * PM)] and the notation which indicates that their distributions evolve differently under Hn and H. will

155 be dropped. From (5.2), it is clear that for any 0 E O the measure induced on (y,cJA) by {yt, te[0, T]} under H0 is identical to the measure induced by {y - s(t), te [0, T]} under H1(1) The R-N derivative of the former with respect to,_ + 1 M, 00 is given by (5.27), and the R-N derivative of the latter is the same expression with yt- s(t) replacing Yt Clearly, a (or 00) cannot be estimated under either hypothesis. If this parameter is known and one makes the necessary additional assumptions on s(t), (2) then (3.37) [with (5.30) used for the a posteriori p.d.f. Is] provides a well-defined detection statistic. If the dominating measures used in (5.30) are not M 0+ M,00 but are some other rational spectrum Gaussian measures satisfying (5.23), then from Theorem 5.4 they are equivalent to 9 + M, & and the result is unchanged. This is so because then _ d / d'M 0 0 d 0( d M,0 0 1(5.31) d,_#0 dB+ d,0 deco0 d M,l 00 and the second term cancels out of (5.30). Section 5.4 will specifically illustrate this for M = 1. This assumes that s(t) has at least the same smoothness properties as almost every sample function of {yt};this assumption is weaker than the assumptions necessary to guarantee a solution to the simple-hypothesis problem, and hence is of no concern. (2)See the beginning of Appendix A; s(t) must be (2M)-times differentiable on the interval (0, T).

156 5.3.2 Estimation of Noise Parameters. The notation of H0 will be used. As is by now clear, all that is said can be repeated for H if y is replaced by y - s(t) Henceforth, the R-N derivative of (5.27) will be denoted f(ytl0). This is only a slight abuse of notation since this quantity is a direct generalization of the usual notion of a p.d.f. (which is the R-N derivative of a probability measure with respect to Lebesgue measure). The ultimate goal is to factor this R-N derivative as in (2.22), recognize the sufficient statistics, and determine the natural (1) conjugate class of p.d.f.'s on.) From previous results and discussions (e.g., Sections 1.1, 2.3, and 3.3.3) one conjectures that this class will be much simpler and more useful for sequential estimation if f(yI 0) is first divided by ("made conditional to") the p.d.f. of the initial state y0 = (y, y,... y0(M-)) ( Recall that After the factorization of (3.31) the remaining expression g[t(yt); 0] represents a bona-fide p.d.f. on the finite-dimensional space of the sufficient statistics. (2)In the finite-dimensional case, f(YkI0) was divided by f(yOI ) and y consisted of M discrete samples. It seems reasonable, though no attempt will be made to prove, that passing to the limit after such a procedure yields the same result as above (Section 5.4 will illustrate that this is true for M=1 ). The question is of little relevance, since the primary concern is with the functional form of the densities involved. If the above procedure is arbitrarily adopted, then all the results of Section 4.2.3 can be justified.

157 f y 0 ) = [(2r)-M ID j ]exp - y [D k]YO (5.32) The quotient of (5.27) by (5.32) is the desired "conditional density" TOM T M T01 M-1 k T(k) 2 f(Ytl0(2yo) = (2)exp 2 - k (-) Ak [Yt ] dt (5.33) M-1 M-1 0. k 0) (k) 4.0 aT [Y T O ]Djk j+k even The following facts are obvious by inspection: (i) Sufficient statistics for the estimation of 0 = (,0..., 6M) in (5.7) are: [ (k) 2dt (k) (k) ^^t I d O Y^ T;k=0,1...M-1 (5.34) (ii) With the results conditioned on yO the natural conjugate class of densities for 0 = (1... 0M) = (q/lal,..., qM/ lal) is an M-variate Gaussian p.d.f. (Note from (5.26) and (5.28) that all terms in the exponent of (5.33) are either linear or quadratic in 0.) This density is truncated to be zero on that region

158 of R where the roots of e00M + 0M- +.. + M = have positive real parts; the complement of that region is denoted (1) It is shown in Section D.4 of Appendix D that (5.33) can be re-written as follows: Define the parameter vector 0= (O1 M) (5. 5) and the functions of the observation (k)(yt) = J[yt)] dt (5.36) 0 k) (j) (k) (j) y ()37) E(J'B)(y) YTJy^ -ygy^ (5.37) Use these sufficient statistics to write an MxM, positive definite, symmetric matrix T(yt) whose elements are ('~ ()Note that the natural conjugate density written in terms of (P1''' PM) is much more complicated, see (5. 5) ff., but is defined on a much simpler region. (Namely, Re(p.) < 0 for all i.)

159 (5.38) 3i + (M- + (-1) J (yt); i+j even it. L (_1)M-k-l-min (i j) k (2M-i-j-k-l, k) j d oE (y);i+j odd where max(0, M-i-j) < k < min(M-l, 2M-i-j-1). Also, define (1) the M-vector t(yt) whose elements are: 20to(Yt) 2; =1 t.(Yt) = (5.39) jt 20 tj(Yt); j=2...M then the "conditional density" for use in Bayes' rule is M f(ytl 0, y0) = (27T) exp{ -~ [6* T(yt) + t*(t) (5.40) Clearly, the natural conjugate density has the same form but involves an MxM conjugate parameter matrix m and a conjugate parameter M-vector 4; these are then updated through ( 1Again, the use of t for "time" and t( ) for a sufficient statistic, and of T for observation period and T(.) for a sufficient statistic, will hopefully cause no confusion.

160 r = 0 + T(yt) (5.41) T -= 0 + t(Yt) Their components are functionally independent only to the extent that those of T(y) and t(yt) are. Consider now the sequential processing of finite subintervals as discussed in Section 3.3.3. Define d Y - Y -i -Ti (5.42) d Y(i-1, i] Yt; Ti_1 < t < Ti} and similar notation for the derivatives of y. Since sample functions of yt and their first M-1 derivatives are a.s. uniformly continuous, it is irrelevant whether one considers open or halfclosed intervals. If (5.36) is modified to run over the interval (Ti1, T ), the endpoints in (5.37) changed accordingly, the numerator of the first term of (5.39) changed from "T" to "T - T. " and appropriate notational changes are made, then 1 i-1 the density of (5.40) can be written M f[(i-, i) Yi] = (2r) 2exp{- [O*T(Y(i i) (5.43) + t*_, (li))]

161 The unconditioned density of (5.33) could be similarly modified; from either one, it is easy to verify the relations k y (0, k)y ik f[(i, i), il] (5.44) By definition (see (5.33) ), f[Y(o,k)I [] i=0 f(YO 0 ) (5. 45) The form of all these expressions is precisely the same as in the finite dimensional Markov case and so the formal manipulations (including forming the natural conjugate density, updating its parameter by using the sufficient statistic, and accounting for the conditioning on y0 ) may proceed in exactly the same fashion. 5.3.3 Detection in Noise of Unknown Parameters. Recall the discussion concerning the detection problem begun in Section 5.3.1. If one considers (3.37) and uses (5.30) with the appropriate numerator() for the a posteriori p.d.f.'s on 0, one obtains the non-sequential (not conditioned on yO ) detection statistic. The a priori densities are seen to cancel, and there remains )i.e., f(Yt I )f0(0) for H0 and f(yt-s(t) )f( )) for H1, where f(. ]0 ) is the density defined by (5. 27).

162 J'f (Y- s (t) ) f () d 2(yt) = __ _ _ __ _ _~ /f(y IX)- f(X) dX e (5.46) f(Yt 10 ) f(Yt- s(t)l0) (t1 As has been previously remarked (e.g., following (1.38) or (3.25)), the parameters 0 must cancel out of this expression; alternately, it may be evaluated for any admissible fixed value of 0. Owing to the complexity of the terms (recall that (yt 0) is the simplehypothesis solution as given by the Metzger model or by other "classical" techniques), this cancellation will not be verified for arbitrary values of M. The following sections will do so for M=1 and M=2; they will also present results equivalent to (5.46) but stated in terms of the natural conjugate densities and thus useful for sequential processing. 5.4 The Ornstein - Uhlenbeck Process: M = 1 Consider the 1-SAG noise process, i.e., tie zero-mean stationary Gaussian process {nt, t e[ 0, T] } with spectral density function

163 S (f2 a2 S(2) = 2 2 (5.47) (27if) + p1 - -^ ~ ^ (5.48) (j2irf) + q1 (5.48) where, in this case, p1 = ql so that there is no difference between the parametrizations. (The parameter will be called ql from here on.) The corresponding autocorrelation function is 2 -qll RT) = a e; q > 0 (5.49) To illustrate the previously claimed singularity in the estimation of a, consider that nt is observed on [0, T] and is sampled as usual. Note that R' (0) - R' (0+) = a (5.50) n n where the indicated derivatives are limits from the left and right respectively. From Theorem 5.3 (Baxter's Theorem), k lim C (n.- n 2 2T a.s. (5.51) k-oc i=l

164 where n is the noise sample at t = iT/k. Since the convergence is strong (i.e., holds for almost every sample function), the dis2 crete test function of (5. 51) can be used to estimate a with arbitrarilysmallerror for arbitrarily small T. (Physically, of course, one is limited by the ability to sample nt at very high rates). It should be noted that since a.e. sample function of nt is uniformly continuous on [0, T], (5. 51) implies that a.e. sample function is of unbounded variation. (1) By direct substitution into Hajek's result (5.27) one finds the desired Radon-Nikodym derivative de f(ntIO) = 0, 2 2 )2 exp dt2 10 ( 280 2 0 1 a ( y2T 2 2 10 (YT + Y0)j (5.52) where 00, 01 are parameters of the spectral density written as in (5.7); recall 06 = lal- is not estimable and must be known. See Appendix D, Section D. 5.

165 In terms of the parameter q1 = 1 lal1, this can be written g1 (^ I F? 1 f(ntlql) - exp 2 Jtdt a p 2 2a2 I t 2 2 + q1(YT+ Y0)] (5.53) This can be considered as a R-N derivative or it can be normalized and viewed as a bona-fide p.d.f. on the "sufficient space" with co2 ordinates { J nt dt, no, nT } It was claimed at the end of Section 5.3.1 that the same result is obtained if one evaluates the R-N derivative with respect to an equivalent measure (given by another rational spectral density) as the limit of a likelihood ratio function (Section 2.2.2). ( This is verified in Appendix E using the following procedure: Let P, be the measure induced by a I-SAG process with fixed parameter q* > 0 instead of q1; * is equivalent to A. Sample the process nt as usual; the sequence of samples is the solution to a first order autoregression whose parameters may be found in terms of q1 or q* and the sampling interval. Thus the likelihood ratio function (conditional to no ) may be evaluated using (4.48); its limit can be found, but it is necessary to explicitly employ Baxter's ( 1This was first done rigorously by Striebel [ 60].

166 result (5. 51) in the process. This limit is X(ntln0o; q) as defined in Theorem 2.3 but conditional to nO (i.e., its factorization yields the density for sequential processing). Multiplication by f(n0lql) gives the R-N derivative X(nt; q1), which still contains the "nuisance parameter" q*; this cancels out in Bayes' rule (5.30), and the result is identical to Hajek's density (5. 53). 5.4.1 Estimation. Suppose the observation is noise, i.e., H0 is true. To obtain the sequential natural conjugate density on q, (5. 53) must be conditioned on yo. From (5.32), or by noting that 2 Rn(0) ='q (5.54) one finds the p. d. f. of the initial sample / ^ \ ^ F ^1 21 f (0q) = 2 exp [- y (5 55) Dividing (5. 53) by (5. 55), or merely putting M = 1 in (5.33), yields f(yl' y,) = exp [2 J2 dt ql T +2 (. 56) q1(a2T + y0 - y) 56

167 By inspection (recall Theorem 3.1), the conjugate density on q is Gaussian, (1) P(ql; 4) = K(g)exp { 2 [q2 1- q142] } 1 (5.57) where 1 > 0. Its parameter is updated through TT 2 yt dt (T) 4(0) C + (5. 58) ^2 2 2 (Ta T +y -y. If observations are made on sequential subintervals, (5. 56) may be su itably modified and the conjugate parameter updating equation becomes, in the previous notation, T. 1 2 (T (Ti_) 1) T i a (T -T 1) + y2 _- 2 To be totally consistent with Chapter III, this density would have to be denoted q( ~; _) since H0 is under consideration. However, it is clear that the natural conjugate class is Gaussian under either hypothesis and only the parameters ( ~ for Ho, _y for H1 ) differ. - _

168 Analysis of the natural conjugate density p(ql; 4) is straightforward; it is a truncated Gaussian p.d.f. with mean and variance prior to truncation of - -2 2 a2 _()) = 24, -- a()=S(5.60) respectively. Once it is truncated to the positive half-line, its normalizing constant is K(],) = - (~ - (5.61) a \2av,/ where the function 0 (.) is the logarithmic derivative of the Gaussian c.d.f. and is defined by (x) a (x) = (x) - o < x < xo where 0(x) is the unit Gaussian p.d.f. and c(x) the corresponding cumulative distribution function. The mean and variance of P(q;-4) are given by:'()The utility of this function in certain detection problems was first noticed by Roberts [47], p. 68.

169 E(q1 1g) = - (~2a -) (5.62)' 2- \-2a / (5. 2 2 2 a2 a\2 2 \ 24/1 \l4a 4/ 2 ~ (5. 63) var(qll_) = E(ql 1)- E2(q1) (5.64) and are thus quite complicated; under suitable conditions, the mode g(4) of the untruncated p.d.f. may be a reasonable estimate. Any reasonable estimate may be shown to be consistent; (T) using (T) as given by (5. 58) in (5. 60) shows that the a posteriori p.d.f. on q1 is Gaussian with mean and variance prior to truncation of (0 2 ^2 2 ~2(0) + a2T + yT 2T 2- T 2L1( + 2J yt2 dt T = a 1 + JYt dt (5. 65) Let T- oo and recall that

170 1.i.-1 2 l.i.m. T Jyt dt = R (0) T-oo 0 2 (5.66) a 2qq where ql* is the "true" value of the parameter for the process generating the observation; clearly T- q; T - 0~ as T - T I O T and the a posteriori p.d.f. tends to a "delta function" at ql* If desired, the conditioning on y0 can be undone by any of the methods mentioned in Section 4.2.3. 5.4.2 Detection. It is again clear that under H! all the above results hold; the parameter of the natural conjugate p.d.f. (0) is then y, its a priori value may be different from (), and it is updated through (5.67) T 2 J[t - s(t)] dt (T) (0) + a [ Y0- (0)] [YT- s(T)] The R-N derivatives of the observation may be written as in (5. 53) or (5.56), with yt replaced by yt - s(t).

171 The detection statistic is given by (3.37) or (3.38); if one wishes to detect sequentially and use the natural conjugate class, the conditioning on y0 must be accounted for since the simplehypothesis statistic f(Yt Iq1) given in (1.25) was not conditioned on y. One way to do this is, as discussed in Section 4.3.2 (a), to choose an a priori (to y0 ) p.d.f. from the class Po(ql;-) = Ko(- )ql exp - t - ql2] (5.68) in which case the a posteriori (to the discrete observation y0 ) p.d.f. will be natural conjugate and hence the a posteriori (to yt, t[ 0, T]) p.d.f. is also natural conjugate but is no longer conditioned on y0;its parameter is found using (5. 58) or (5. 67). Appendix E gives the calculations involved in simplifying the detection statistic; it is shown that the parameter ql does indeed cancel, and one finds KO(-) K(T)) -2 T QP(Yt) - ( - exp -a [J s (t)ytdt K W) K()y 0 (5.69) T- s'( + j [s'(t)] dt- s'(0)y0 - s'(T)yT 0

172 where K(') is the normalizing constant in (5.57) and K0(') normalizes (5.68). The second ratio of constants involves t and thus represents processing of the observation. Eq. (E.26) explicitly shows the resulting terms; even for the case M=1, the processing necessary to obtain the optimal detection statistic is extremely complicated. If one does not want to use the a priori p.d.f. of (5.68), then (5.46) may be used directly; equivalently, the natural conjugate densities may be employed and the results modified as in Section 3.2.2. In either case, the results are similar to the above. 5.5 Estimation and Detection: 2-SAG Noise The case M = 2, though not of great interest for its own sake, is more indicative of the general case than M = 1. The various parameterizations of the spectral density (see (5.2), (5.4), and (5. 7) ) are related by = P1 + P2 (5.70) 2 = P P2 and.= qi/lal; i=1, 2 (5.71) SO = 1/ al

173 The autocorrelation function is written in (A.22). As usual, let JO represent the measure induced by the process regardless of which parametrization is being employed. It is by now clear that instead of evaluating d i /d 60 as 0 0 the limit of a likelihood ratio, it is more convenient to employ d 0/d+ /d2 directly.() From (5.27) and (5.33), the desired d s/di~tzo O 0 2 00 R-N derivatives (densities of the sufficient statistics) are, in the usual notation %1^2 g 1 " T2 f(y Iq) =.. exp qT [2J d. t - Yt exp 2 2 q2 Ytdt a" 2a 0O T 2 2 2 2 + (q2 2q2)J (t) dt + q q2 ( + 0) + q (y + y 2) (5.72) and qlT 1 2 2 f(yIYt 0 q) = 2 exp 2 - q2 Jytdt 2a 2 0 2 2 2 2 + (ql - 2q2) J(y) dt + ql q2(Y -y0 ) + ql(YT' 2 Y ) (5.73) NThere is some ambiguity here since the first case t6 represents any arbitrary Gaussian measure equivalent to A, wxhile in the second case 0 refers to the number tall. This wqll not present cause for further confusion.

174 The latter equation can be re-written, by inspection or using (5.36)- (5.40), as (5.74) f(Yt ly0 q) = 27 exp q- [q* t)q +q *t(yt)] 2a - -) where q (qi' q2)* (5.75) Yo = (Yo, y* (5.76) T 2 2 j(y2 dt yT - 0 y 2 2 T(Yt) = (5.77) 2 2 Y Y T 0fy2 dt,2,2 2 - aT YT o0 t(y,)= - T J 2 d (5.78) 0 The natural conjugate class of densities is p(q; \) = K() exp- ~!2 [q* q +q* q j 2a qeO (5.79) where 0 is the set on which the roots of

175 z + qlz + q2 = 0 have negative real parts; this is illustrated in Fig. 5.1. K(/) involves integrating over O and will not be evaluated. From (5.77) and (5.78), it is clear that for a minimal parameterization one must put 1 43 =L =,:' =, _ 1: (5.80) which contains only four independent parameters; or, as commented in Theorem 3.1, one can let the vector, of conjugate parameters be arbitrary and still remain within the class as observations are processed. In either case, the parameters are updated as in (5.41). F q2 2~~~~~/ 4q2 = ql Fg5.Pa trpeo 2/A o Fig. 5.1. Parameter Space for 2-SAG Noise

176 By completing the square, the natural conjugate density (5. 79) can be re-written p(q; ) = K(4) exp. L* It 2a exp 2 exp ~ ~-2 (q - -)* ~ (q - i) 2a, qeO (5.81) where IL = -~ I (5.82) is the mode of the density provided that /ce 0. This may provide a reasonable estimate of q, (T) =_ a I(T)-,(T) if 0 (5.83) IMAPi~l- if AMAPEO (5.83)'MAP - - A and is analogous to (5.65). The detection problem is solved in complete analogy with the work of Section 5.4.2. Once again, the required cancellation with the simple-hypothesis solution as given by (A. 51) [ recall that z(y) = &n &(ytlq)] occurs. If fl(q) and f0(q) are arbitrary a priori densities under H1 and H, then using (5.46) one finds that

177 (Yt) = K1(Yt) exp S(yt) (5.84) where K.(yt), i =, 1, are normalizing constants for the a posteriori p.d.f., and where T T -2 (3) 2 S(Yt) = - a J t s (t)dt + 2 J[ s"(t)] dt + s"(0) - y~ s"(T) (5.85) If one chooses the a priori p.d.f. from the class po(q_; ) = K0( -) 1 exp - 12q* q + *q] q 2 2a.... (5.86) and treats the discrete observation y0 as usual to obtain a natural conjugate situation which is not conditioned on y0 then the detection statistic is K0(y) K(4 (T)) fy 0) K-~ ) -- y exp (S(y)) (5.87) - ( ) K0 ) K(- ) In both (5. 84) and (5.87) the ratios of normalizing constants represent complicated signal processing, so that the expressions are

178 not as simple as they first appear (recall (E.26) for the simpler case M = 1 ).

CHAPTER VI SUMMARY AND CONCLUSIONS 6. 1 Narrative Summary( 6. 1. 1 Problem Statement: General Solution. The problem statement postulates two mutually exclusive and exhaustive hypotheses (#1. 2), under each of which a statistical description of the observation is known except for a finite-dimensional parameter. These parameters, which index families of probability distributions on the observation space, are considered random variables; they may or may not have common components. If the hypothesis is specified, one is left with the problem of estimating the corresponding parameter based upon the observation(s) (#1. 2. 2). For the present purpose, this estimation problem is considered solved when the a posteriori p. d. f. based on Bayes' rule (1. 27) is known. Suppose neither hypothesis is specified and one is interested in deciding which one is true. For a large class of problems (#1. 3. 3) the likelihood ratio of marginal observation p. d. f.'s (1. 36) is an optimal detection statistic. Using ()Throughout this chapter, references to section numbers are placed in parentheses and preceded by #, and references to equations are merely placed in parentheses. 179

180 Bayes' rule, the marginal density for either hypothesis can be found as the product of the ratio of a priori to a posteriori parameter p. d. f. and the conditional observation p. d. f. (1. 29). The expression can be evaluated at arbitrary, fixed values of the parameter. Thus, the marginal likelihood ratio is given by a similar expression (1. 38) which also does not depend on the parameters, but which involves the related simple-hypothesis likelihood ratio. One finds that the solutions to the two related estimation problems (one under each hypothesis) serve quite naturally to solve the detection problem, provided that the simple-hypothesis result is known. With no further assumptions, all the conditional observation densities and a riori and a posteriori parameter densities must be evaluated, stored, and manipulated as functions (or as discrete approximations to functions); furthermore, all the observations must be saved, This clearly makes the procedure intractable, especially for recursive processing. It becomes tractable, however, if the observation distributions admit finite and fixed-dimensional sufficient statistics for the unknown parameters. Before pursuing the subject, Chapter II was used to investigate the concept of sufficient statistics. 6.1. 2 Necessary and Sufficient Statistics. Heuristically, the concept is best understood by considering sets of observations such that knowing the set into which an observation falls is equivalent

181 (for the purpose of finding the a posteriori p. d. f. on the parameters) to knowing the observation itself. A class of such sets is called sufficient for estimating the parameter; the coarsest such class is called necessary and sufficient. A mapping constant on the sets of such a class is called a sufficient or a necessary and sufficient statistic. This approach serves well in the trivial case of finite spaces (#2. 1. 1) and in the difficult case of infinite-dimensional probability spaces (Appendix B). For the intermediate case considered by classical probability and statistics, the concept is formulated differently but equivalently. If the observation space is finite dimensional and the conditional probability distribution of the observation admits a density w. r. t. Lebesgue measure ( as will be assumed), then a sufficient statistic can (if it exists) be found by factoring that density (2. 10). Equivalently, one can factor the ratio of that density to another probability density defined on the same space; if this second density happens to be a member of the family of conditional p. d. f.'s being investigated, then that ratio is called the likelihood ratio function (2. 12) and its factorization always yields a necessary and sufficient statistic (Theorem 2. 1). Bayes' rule is unchanged if one replaces the conditional observation density with the likelihood ratio function (1. 30).

182 To establish results needed in Chapters IV and V, the existence of sufficient statistics when sequential observations possess an M -order Markov dependence was studied in detail (#2. 3). If all. p. d. f.'s are conditioned on M initial samples y0 then the joint conditional p. d. f. of k samples is a k-fold product of transition densities (1. 5) and is very similar in form to that of independent samples (1. 7); results for the latter case are readily available in the literature. The similarity was pursued. Under suitable restrictions (2. 24 ff. ), only those processes possessing exponential class transition densities (2. 33) admit sufficient statistics of fixed dimension; these statistics may be found by inspection of the transition density and can be updated recursively based on the M-dimensional observation "state vector. " It must be borne in mind that all the results are conditioned on y0 and that this must ultimately be undone or justified. If sufficient statistics exist, the observations themselves need not be saved and the memory necessary for storage of the observation has fixed size regardless of the number of samples processed. Statistics which are sufficient for estimation may or may not be sufficient for detection; if a simple-hypothesis solution is known and one can use the procedure of the preceding section, then they are. If not, the conditional p. d. f. of the observation

183 must be more carefully investigated. 6. 1.3 Continuous Observations; General Solution and Sufficient Statistics. This problem was approached by sampling the observation, using the preceding results, and then letting the samples grow dense (#1. 1); the form of the results is quite similar to the discrete case. Suppose that the hypothesis is specified and the correspondingfamily of measures on the observation space (indexed by the unknown parameter) is dominated. The R-N derivative with respect to the dominating measure, which is a function of the parameter, may be employed as the "conditional observation density" in a generalization of Bayes' rule (2. 23). Under suitable conditions, this derivative is the limit of the likelihood ratio function of the samples (2. 20). If the dominating measure belongs to the family being studied, the R-N derivative may be factored as usual to find a necessary and sufficient statistic (Theorem 2. 3); if not, then only sufficiency can be guaranteed. If the a posteriori p. d. f.'s under either hypothesis can be found as above and the continuous simple-hypothesis result is known, then the detection statistic can also be found by a procedure analogous to the discrete result. In any case, one must be extremely careful to investigate absolute continuity oi the measures involved. The mathematical state of the art pretty well dictates that the continuous results are

184 practically useful only for Gaussian problems. 6. 1. 4 Reproducing Densities (#3. 1). Existence of a sufficient statistic implies that there exists a "natural conjugate" class of p. d. f. s on the unknown parameter which reproduces in the sense of Def, 3. 1. This class is indexed by a parameter (called the conjugate parameter) whose dimension is the same as that of the sufficient statistic, and the functional form of its members can be deduced by inspection of the conditional observation density (3. 8). Suppose one chooses a natural conjugate a priori p. d. f.; it is then possible to determine explicit relations to "update" the conjugate parameter based upon the sufficient statistics of the observation such that the "updated" parameter indexes the a posteriori p. d. f. Explicit consideration of Bayes' rule becomes unnecessary. Further, the dimensionality of the p. d. f.'s on the unknown parameter becomes finite since they are indexed by the conjugate parameter. Since knowledge of the a posteriori conjugate parameter solves the estimation problem, that problem is therefore collapsed to a very tractable procedure. The detection problem may or may not behave similarly, depending on the tractability of the simplehypothesis solution (#3. 2. 1). In many problems, the results found as above are inherently sequential and easily lend themselves to recursive processing of the observation, [(3. 12) - (3. 18)].

185 Suppose the a priori p. d. f. is not itself natural conjugate but may be written as the product of a natural conjugate density and a known function of the parameter (3. 19); i. e., it represents a measure absolutely continuous w. r. t. the conjugate class. One may then assume that the a priori p. d. f. is natural conjugate, proceed as above, and forestall modification of the results to account for the actual a priori density until the very end of the procedure. The necessary modification is given in (3. 23) and (3.24), and may be considered to take place in a "secondary" processor (Fig. 3. 4); it can be quite tractable if the integrals involved are available in closed form. Again, all the results extend in an obvious way to solve the detection problem; the resultant receivers are derived in #3.2. The partitioning into "primary" and "secondary" processors illustrates that much of the receiver structure is independent of the a priori densities. The advantages of this approach to the problem are discussed in #3. 4. 1. All these results can be applied to continuous observations if, instead of considering the conditional observation density in finding sufficient statistics and the natural conjugate class, one uses the R-N derivative of the observation measures with respect to a dominating probability measure. This derivative may be found as the limit of a likelihood-ratio function, or any such derivative

186 available in the literature may be employed. 6. 1. 5 Discrete M-SAG Noise. To illustrate application of the discrete results, the problem of detecting an exactly-known signal in (and simultaneously estimating the parameters of) discrete M-SAG noise was solved. The noise is considered to be the stationary solution to an M th-order autoregressive difference equation driven by a white Gaussian random sequence [ (4. 2), (4. 6)]. All coefficients of the equation, including the intensity of the driving sequence, are treated as unknown parameters. The joint p. d.f. of k sequential samples (conditioned on the parameters and M "initial samples" y0 ) is easily written (4. 17). By inspection, the sufficient statistics are found to be vectors and matrices whose elements are the sample auto-and cross-correlations up to order M, [(4. 18) - (4. 20)]. The natural conjugate density is seen to be a composite p. d. f. which is an M-variate Gaussian on the parameters of the autoregression and a Gamma density on the intensity of the driving sequence, (4. 23). The normalizing constant and moments of this p. d. f. are extremely difficult to compute because the Gaussian portion is truncated to be zero on that region of RM which would yield an unstable difference equation, (4. 3); no attempt was made to compute moments or thoroughly analyze the conjugate class. The conjugate parameter updating relations take a very simple additive form [4.26) -(4.29)],

187 however, and a modal (MAP) estimate is easily written (4. 30) and is shown to be consistent. Solution of the estimation problem is similar under either hypothesis, since the uncertain parameters are the same and the signal is simply additive. The solution given is for Ho, and may be used for H1 if the known signal is first subtracted from the observations [ (4. 38) - (4. 43) ]. Several techniques for eliminating the conditioning on y0 were discussed in detail (#4. 3. 2). Since in the discrete problem there is no advantage to explicitly retaining the simple-hypothesis solution as part of the overall detection statistic, that statistic was simplified. It turned out to consist merely of the product of ratios of a priori to a posteriori natural conjugate normalizing constants (4. 44). As stated above, these were not evaluated further. The case M = 1 was done in more detail (#4. 4); here, it was possible to explicitly find the normalizing constant and write the detection statistic. These were, however, found to be extremely complicated expressions [(4. 43), (4. 55), (4. 44)]. 6. 1. 6 Continuous M-SAG Noise. This problem is essentially the same as the preceding one except that the noise is a continuous-parameter process. It may be modeled as the solution to a stochastic differential equation, in which case its parameters are the coefficients of the equation. Alternately, it may be

188 modeled as the output of a linear, rational transfer-function filter [(5. 2) - (5. 4)] which is driven by white Gaussian noise, in which case its parameters are the coefficients of the denominator polynomial of the transfer function when the leading coefficient is unity (5. 5). To obtain the required continuity of measures, it is necessary that the parameter which represents the high-frequency asymptote of the filter transfer function be known (#5. 3. 1). Rather than finding a limit for the likelihood ratio function, a R-N derivative found by Hajek [60] was employed, (5. 27). The sufficient statistics were found to be the quadratic content of the observation and its first M - 1 derivatives, as well as the value of these quantities at the endpoints of the observation interval (5. 34). By "conditioning" the R-N derivative on y0 (which now consists of the observation and its first M - 1 derivatives at t = 0 ), the R-N derivative was put into a form suitable for sequential processing (5. 33); see Section 6. 1. 2 above. The natural conjugate density for this form was a truncated M-variate Gaussian p. d. f. [ (5. 40), (5. 41) ]; the region on which it is truncated is again very complicated (5. 5), and the normalizing constant was not found for arbitrary M. Estimation under either hypothesis is accomplished using similar techniques, exactly as discussed in the preceding section. The detection statistic was written in general form (5. 46), but was not simplified because the normalizing constants were not

189 available. The simple-hypothesis term was explicitly retained since that result is available from classical detection theory. The cases M = 1 and M = 2 were again done in more detail. MAP parameter estimates were given [(5. 65), (5. 83)] and shown to be consistent. For M = 1, the limit of the likelihood ratio function was evaluated (Appendix E) and the result shown to be the same as Hajek's R-N derivative. The required cancellations with the classical simple-hypothesis result were shown to occur [Appendix E, (5. 69), and (5. 84)] and the resulting detection statistic, though optimal, was found to be an extremely complicated function of the observation; (5. 69), (5. 84), and (E. 26). 6. 2 Contributions of this Work: Discussion Speaking very generally, the main contributions of this work lie in the notational unification of known results of mathematical probability and statistics and in their application to the detection and estimation problem. This was accomplished both by recasting the statistical results in the language of communications theory and by generalizing the communications problem enough so that its relation to the statistical theory became clear. It is significant that the results include the case of continuous observations, and especially that this is possible without the use of the stochastic calculus which has recently become popular in communication and

190 control problems. All derivatives and integrals of the observation in Chapter V, for example, are ordinary derivatives or integrals of sample functions. No novelty is claimed for the statistical results concerning sufficient statistics or natural conjugate reproducing densities in Chapters II and III; to the Bayesian statistician concerned with time series analysis or decision theory, these would appear somewhat less than startling. The only thing which is possibly unique about the material is the unified presentation which makes it clear that sufficient statistics, natural conjugate densities, and Bayes' rule all arise from and involve the same quantity. Classically, this quantity is considered to be the joint conditional p. d. f. of k observations (preferably written as a k-fold product of densities (1. 7), (1. 5) to make sequential processing tractable). Here, it is noted that the same results are obtained if that quantity is the likelihood ratio function, also preferably written as a product. Its use has the added advantage that the results concerning sufficient statistics are more immediately apparent, and that all results extend readily to the (infinite-dimensional) case of continuous observations. The concept of employing a Bayesian approach and using natural conjugate densities to process continuous observations on sequential finite intervals is, to the author's knowledge, original.

191 Further, the application of all these results to the simultaneous estimation and detection problem is believed original. Occasionally, a previous work has touched on these concepts in a specific example (#3. 4. 2); however, the general application of the theory and the resulting parametrizations and partitioning of the problem are unique to this work and, in some degree, to its predecessors [ 6 ] and [7 ]. Chapters IV and V illustrated the theory by treating the detection of a known signal in, and the simultaneous estimation of the parameters of, Mt-order stationary autoregressive Gaussian (Gauss-Markov) noise. Both the discrete and continuous versions of this problem were heretofore considered unsolved in communications theory. To be sure, the discrete estimation problem of Chapter IV bears a close resemblance to the well-known statistical problem of maximum-likelihood estimation of the parameters of an autoregressive stationary time series. This is natural if one recalls the close relation between maximum likelihood and Bayesian methods (see the remark following (1. 34)). The explicit use of the two estimation results to solve the detection problem (#1. 3.3), (#3. 2) is believed original, as are most of the corresponding techniques for continuous observations as presented in (#3. 3) and Chapter V.

192 6. 3 Areas for Future Research As is often the case, the work presented has raised at least as many interesting questions as it answered. Some of these are listed and discussed below: a. An obvious omission has been the lack of consideration or evaluation of the performance of detectors derived here. A study of the subject might include the usual approximations to the normal receiver operating characterististic (ROC) curves and detectability index d, especially for small observation times (when the a priori parameters are very significant) and for large observations (when they may be neglected). b. As observations are processed, the conjugate parameters trace some type of curve in the spaces J in which the sufficient statistics take values. As the parameters are thus learned, these trajectories presumably tend to some subspace or point which represents complete knowledge of the parameter (i. e., which corresponds to the a posteriori p. d. f. being a delta function). The concept is intuitively appealing; perhaps a metric on can be defined in such a way that the "distance" from the target subspaces is easily evaluated and gives an indication of estimator and detector performance. c. A common artifice in communications theory is to consider that a small amount of "white Gaussian noise" is added to

193 the "colored" noise being considered. (1) This eliminates some singularities and simplifies the solution of many problems. An interesting exercise might be to attempt an analog to Chapter V using this artifice. d. In Chapters IV and V, the order M of the noise must be known. It is of much current interest to estimate M based on the observations. Very few sound results are available in this area. e. Because of the difficulty of evaluating moments for or even normalizing the natural conjugate densities in Chapters IV and V, those results cannot be considered practical. It would be of interest to carry the solutions further, perhaps arriving at practically usable approximations. Also, one might be able to determine reproducing classes in the sense of #3. 1. 2 which are more tractable. A natural departure point would be the 1-SAG noise solution, which is at least given in closed form. f. As samples grow dense, estimation of the parameter "a2 " is singular in Chapter V. If the discrete results of Chapter IV are examined one finds that under the same conditions the conjugate parameter qc (also called /4 in #4. 4) tends to infinity. ()See, e.g., Van Trees [61], p. 288.

194 It seems clear that this has the effect of making the (discrete case) natural conjugate density singular in the subspace which corresponds to the parameter aa of the continuous process. The subject, especially the rate at which singular convergence occurs as samples grow dense, bears further investigation. g. Suppose that the discrete solution of Chapter IV is given the same information as the continuous solution of Chapter V, i. e., the value of the discrete parameters which correspond to a (these can be related; see (5. 10) ff., and (E. 5) for M = 1 ) are fixed. It would be of interest to compare the resulting discrete solution with a corresponding finite approximation to the continuous solution.

APPENDIX A THE METZGER MODEL Consider the detection problem H: Yt = nt H1: t = s(t) + nt, t[0, T] where nt is zero-mean, stationary Gaussian noise with known rational spectral density N(f ); the numerator and denominator of N(f ) are polynomials in f, and the order of the denominator polynomial exceeds that of the numerator by p > 1. Let the autocorrelation function of nt be cc R (T) = f N(f2) ej2fTdf (A. 1) -oC Suppose s(t) is an exactly-known signal which is 2p times continuously differentiable for 0 < t < T, so that the first (2p -1) derivatives are continuous from inside the interval at t = 0 and t=T. The classical solution is as follows [9 ], [16], [6: Let z(y) denote the natural logarithm of the likelihood ratio, and d the "detectability index, "a performance (probability of detection measure which, in the Gaussian case, is given by 195

196 d = E1[Z(y)] - E[z(y)] (A. 2) where Ei[ ] is the expectation under Hi, i =0, 1. Then z(y) is an optimum decision axis, and T z(y) = J yt s2) dt -2 (A. 3) T d = f sl (t)dt (A. 4) 0 The functions sl( ) and s2( ) are defined as follows: Let {Xk, qk(-)} be the eigenvalue-eigenfunction pairs of the L2 kernel Rn(X );{k( )} are a complete orthonormal set of functions in L. Expand s(t) and yt in terms of these functions so that, e.g., T = s(t) (k(t) dt 0J Assume that 2 < oc k=1 k Then 1=1.. ^ i=l x^i qb

197 CI S. s2(t) i24 n, — X /Ji(t) (A. 5) and s2(t) is the solution to a Fredholm integral equation of the first kind, T S s() R (t - u) d, = s(t) (A. 6) 0 for 0 < t T A useful mnemonic device which simultaneously models the "generation" of the observed signal and permits evaluation of the function s2( ) and the quadratic content of sl( ) is the Metzger (1) model,(1) illustrated in Fig. A. 1. The following discussion refers to that figure. First, normalize the autocorrelation so that N0 R(T)= 2 R (T) (A. 7) 2 n where R (0) = 1. Hence, n N(f ) r f n(T) e-j 2z fdT (A.8) -OC Factor N(f ) as follows: NO N(f2) = 2 H(j2rf) H*(j27f) (A. 9) (1)Based on a personal correspondence from K. Metzger, [38].

White Gaussian Noise Density N0/2 Filter #2 (Causal) H(j27rf) t Time Gate ~ ^ Mh(t) ~t) + [0,T] s(t)+ n^ Filter #1 (Anti-Causal) c2(t) H*(j2v) ^i^) H h (- t) 1 H0 c2(t) = 0 for ti[0,Tj Fig. A. 1. The Metzger Model

199 where H(-) is a rational, causal transfer function and * denotes complex conjugate; hence H*(-) represents an anti-causal filter. Let the inverse Fourier transforms of H(') and H*( ) be h(t) and h(-t) respectively. Obviously, the output of Filter #2 is nt when that filter is driven by white Gaussian noise of density N0/2. The signal c2(t) is proportional to s2(t) as follows. Let C2(f) be the Fourier transform of c2(t) and S (f) the transform of s(t), the signal component of y. From Fig. A. 1, S(f) = H(j2wf) H*(j2rf) C2(f) and thus, using the convolution theorem, 2 OC s(t) = N R (t- X) c2(X)d, -oc < t < o 0 -oc Since c2(t) is defined to be zero outside the interval [0, T] and s(t) is s(t) truncated, this implies 2 T s(t) = J R(t-X) c2() dX,0 < t< T 0 0 (A. 10) Comparison with (A. 6) shows that s(t) = - ) (A. 11) 2(t) Nn 2

200 The detectability, as defined in (A. 2) and given by (A. 4), is proportional to the quadratic content of cl(t) T d c (t) dt (A. 12) 0 0C This is proved as follows. First note from (A. 1) and (A. 9) that R,(t - = h(t - ) h( - t) R(t-ji) = N oc = S h( - X ) h(t -X) dA (A. 13) where ( denotes convolution. Now put (A. 4) into (A. 3) and take its expectation under both hypotheses: T E[ z(y)] = 2 s1 (t)dt 0 T T E1[z(y)]= f st) s2(t)dt 2 s (t)dt 0 0 Hence, from (A. 2), an alternate expression for d is T d = f s (t) t)dt (A. 14) 0 T T =J J N0 Rn(t - 1) c2(/z) s2(t) dpl dt t=0 I=0 0 T T? oc NO = f Nf f 2f h(i X ) h(t - X) c2() s2(t) dhX d/dt t=0,=u0 0 X=-oc

201 where the second equality follows from (A. 6) and (A. 11), the third from (A. 13). Recombining, changing the order of integration, and using (A. 11) yields OC T d NO -o F c2(t) h(t-X)dt ^0 oc 0 0 T J c2(p) h(p -A ) d A 0 2 Now c2(t) 0= for t [0, T] so the inner integrals are convolutions and, by inspection of Fig. A. 1, are precisely the output of filter #1. 2 f da o cl'(X)dx O -o But h(-t) =0 for t > 0, so c1(t) =0 for t> T and thus (A. 12) is proven. In terms of the Metzger Model signals, (A. 3) and (A. 4) can be written as T T z(y) t c2(t) dt - (t) dt 0 0 0 -Cc (A. 15) The Metzger Model provides no new theoretical results; rather, it permits solution of the integral equation (A. 6) using techniques familiar to the engineer and thus provides much more

202 insight into the problem than would a straighforward solution of that equation. As a side benefit it provides an alternate method (Eq. (A. 12) ) of evaluating the detectability index. To illustrate its application, the model will be used to solve the problem of detecting a known signal in Mth-Order stationary autoregressive Gaussian (M-SAG) noise (see Section 5. 1), i. e., in zero-mean Gaussian noise with spectral density az Re(Pi) > 0 N(f2) = M, (A. 16) ^[ (2.f) +Pi i j The solution will be derived in detail for M = 2, and will be stated for M = 1 and for arbitrary M. Consider Fig. A. 1; recall that s(t) is assumed 2M-times differentiable in (0,T) so that s(2M (0) and s(2M (T) exist as limits from inside the interval. Thus, s(t) and its first (2M- 1) derivatives must be continuous from inside the interval at the endpoints t = 0 and t = T. Now c2(t) is zero outside of [0, T]; it is easily verified that the most general form for (1), (2) c2(t) is( 2 This is, of course, the same as the classical result of Zadeh and Ragazzini [64 ]. In fact, the Metzger Model is merely a system which models the differential equation and boundary conditions associated with (A. 6). (2)If the numerator polynomial is nontrivial, c2(t) must also contain exponential terms in t which are related to the zeros

203 c2(t) - g2(t) lu(t) u(t- T)] M- 1. M- 1 + (- )a i)(t) + M (- 1)j 3j (j)(t) (A. 17) i=0 j=1 where g2(t) is piecewise continuous, u(t) is the unit step function, and 6 (i)(t) is the it derivative of the Dirac delta function. Recall J f(t) 6 (i)(t) dt = (- )i f(i) (0) (A. 18) This in turn implies that c1(t) contains no singularities but may be discontinuous at t = 0 and t = T, and that s(t) and its first (2M - 1) derivatives are not only continuous from inside [0, T] but are continuous. To use the Metzger Model, one assumes a c2(t) as in (A. 17) and solves the model in the "forward" direction for s(t) This is somewhat tedious but the procedure is straightforward and will be illustrated in the example which follows. Since the first (2M - 1) derivatives of s(t) are continuous at t = 0 and t = T, and since they are known from inside the interval, "matching up" the boundaries yields a set of equations which may be of the numerator; these represent the homogeneous solution to the associated differential equation. See Helstrom, [26] p. 441 and [25 ].

204 solved for the constants ai and i.; the function g2(t) is just the "infinite interval" solution and may be found using, for example, standard transform techniques. The procedure will now be illustrated for M = 2: Example A. 1 eN(f) = iat Re(, P2) > 0 N(fe ) = ~~.-I...I......... [(27rf) 1 +1 ] [ (27f)a +P2 Pl (A. 19) (A. 20) I(j2rf)2 +ql(j27rf)+q2a where 1q P2 + P (A. 21) q2 = P2P1 The alternate parameters introduced in (A. 20) will be convenient. From (A. 19) the autocorrelation function is R (T) a~ p1 e n Pp2 e K 2PlP2(P2P ) [P1 2) (A. 22)

205 Normalize this as ill (A. 7); defiine the constant K = 2qq = 2pj(i+) (A. 23) Then one obtains the following quantities for the Metzger Model: N 0 a No at (A. 24) 2 K -p1IT I -p2l'l (T) 1 [ - pPe (A. 25) p2-p1 H(j27rf) = K (A. 26) (j27rf+p) (j27Tf+p2) (A. 27) (j27Tf)a +ql(j27rf)+q2 and the impulse response is -Plt -2t,e -e h(t) = Ke -e u(t) (A. 28) p2-Pl H*(j27rf) is found by inspection of (A. 25) or (A. 27); its impulse response is h,(t) = h(- t) (A. 29)

206 Since the truncation of the "Time Gate" does not affect the linearity of the system for 0 < t < T, it is easy to verify by use of Fourier transforms that, inside the interval, the signals of the Metzger Model are related by cl(t) K [s"(t)+ qs'(t) + q2s(t) (A.30) c2(t) = K [c'(t) - q1 c (t) + q2 c(t)] (A. 31) = K2 4(t) + (2q-q)s "(t) + q2 s(t)] (A. 32) and clearly s (t) = s(t), 0 < t < T. For clarity, the solution procedure will be divided into steps. Step 1: Assume that c2(t) is known: c2(t) = K2 {g2(t)[u(t)- u(t-T)] + ao0 6(t) - a16'(t) + 0 6 (t- T)- f16'(t- T)} (A. 33) Find cl(t) by convolution, c (t) = c(t) 0 h(t)

207 After some manipulation, T Pl(t-T) P2(t-T) cg(t) e...) dT u(T-t) max(0,t) P2- p1 F pt pt p2t pit P t P2 - p e +o + P1 P2 - p e - e e - *2 P l [ ^ ~ P 2 -'P 2 P l ~ u(T - t) (A. 34) It is easily verified that the discontinuities in (A. 34) are c1(0) - c1(0) = -K 1 c1(0) - 1 (A. 35) c (T ) - cl(T) = K 1 and that c )(0 c(O K [ - a(1) K [a a(P2+Pl)] (A. 36) ci(T) - c'(T = K [o - 1(P2+Pl) ]

208 Step 2: The signal cl(t) as found in (A. 34) has the form pit p2t Plt P 2t -1 y e -'2e K11... t < O P2 - P c(t) = K g1(t), 0<t<T 0 t> T (A. 37) Again using convolution (t) = cl(t) I h(t) (A. 38) one can find s(t). For t <, the result is P tl 2t s(t) = [2PlP2(Pi2- pPl)] (P2e P1Y2e ) (A. 39) So that s (0). P21 2 (A. 40) s'(0) = ( (A. 41) 2(p2-p$ )

209 One can evaluate s(t) for t > 0 and verify that s and s' are continuous, but that serves no purpose here. Step 3: Assume now that s(t) is given; using the derived relations, one can evaluate the constants a. and 3i of (A. 33) in terms of the known s(t): Recall first that s(0 ) = s(0) and s'(0 ) = s'(0) Thus, from (A. 40) and (A. 41)'YF ll |~~s(O)- St 2p1p2(p1 + P2) (A. 42) s(O) - s() P1 Finally, using this intermediate result and the differential equations (A. 30) and (A. 31) in (A. 35) and (A. 36), one obtains a1 =- s"(0) + q1 s'(0) - q2 s(0) (A. 43) a = 3)(0) + (q2 qa s() + qq2 s(0) (A. 44) 1 = s"(T) + ql s'(T) + q2 s(T) (A. 45) 0= s(3(T) - (q2 - qa) s(T) + q1q2 s(T) (A. 46)

210 Using (A. 31) one thus finds that c2(t) = K2 [s()(t) - (q( - 2q2) s"(t) + q s(t)] [u(t) - u(t-T)] + a0 6 (t) - 6'(t) + /0 6 (t-T) - 6'(t-T) (A. 47) Sltep 5: The detectability index d may be found using either (A. 12), T N2 d = f c(t)dt (A. 48) 0 -cc or (A. 14), T 2 d = S s(t) c2(t) dt (A.49) 0 0 The equality of these integrals provides an excellent check on the solutions obtained for the model. For this example, either method will (after some integration by parts and use of (A. 24) ) yield T T T + 2q2[s([ s'(T)'(T) - s(0) s()] (A. 50)

211 Step 6: Utilizing (A. 15), all these results may be pieced back together to write the solution to the detection problem: T z(y) = a2 f yt[s(4) (t) - (q - 2q2) s"(t) + q2 s(t)] dt (3) + Y [ () +(q2 - 1) s'(0) + qlq2 s(O)] - T[s(3)(T) + (q2 qa1) s'(T)- qlq2 s(T)] - [s"(0) - ql s'(0) + q2 s(0)] + y~ [s"(T) + 1 s'(T) + q s(T)] [ (A. 51) where d is given by (A. 50). Example A. 2 Using the same technique as above one finds that if nt is 1-SAG noise (also known as the Ornstein-Uhlenbeck process), with spectral density function N(f ) = a (A. 52) (21f) +p 1

212 and autocorrelation Rn(T) = 2 exp (- P T) (A. 53) Then s2(t) = 2 c2(t) - a 2 - s"(t) + p4s(t) + [p1s(0) s'(0)] 6(t) [p1s(T) + s'(T)] 6(t-T) } (A. 54) and T 0d fN c1(t) dt 0 d = a72| [(s'(t))) +p21s2(t)] dt+p1tsa(0)+sa(T)J (A. 55) As usual, the solution to the simple-hypothesis detection problem can then be written using (A. 3).

213 Example A. 3 For M-SAG noise, the spectral density function is as in (A. 16); this can alternately be written N(? )= Q(j2f)Q(-j2f) (A.57) where Q(*) is an Mth-order polynomial Q(z) = z + + + q q(A. 58) with no zeros in the right-hand z-plane. The coefficient qi is the sum of all different combinations of the pk taken i at a time. For this general case, it can be shown that(i) s2(t) = a" Q Q {s(t)} M-1 2M-1-f + ()(t) s(m)(o) f =0 m=O [k qM-kqM-m — l+k( M- 1 2M- 1-f + 6 ( )t- T) m s((T) m+k-k [E qM-kqM-m —l+k(-)m1 (A. 59) (1)Pisarenko [44 ], p. 59.

214 where q0 = 1 and where the limits on the sums over k are max(O, m + + 1- M) < k < min(m, M). Using (A. 14) and integrating by parts, the detectability index is found to be M T d = a | k (-1) Q2k f [s()(t)]a dt+E1+ E2 k=0 0 where E1 arises from the delta-functions in (A. 59) and is given by E1 = Z1 [ (-1) (T) s (T) + (-, 1) s (0) s (0) i, j=O L k q q (- 1 qM-kqM-i-j-l+k and the limits on k are: max(0, i + j + 1- M) < k < i; the term E2 arises from the integration by parts and is M k-i -2 = Lk) Q2k 2, (-1)i [(i) (2k- i)( (i(0) (2k-i)( " k=1l i=0 L J 2k The coefficient Q2k is the coefficient of z in the expansion of Q(z)Q(-z), where Q(z) is given by (A. 58).

APPENDIX B MEASURE AND PROBABILITY THEORY; SUFFICIENT STATISTICS ON MEASURE SPACES The purpose of this appendix is two-fold: to establish the measure-theoretic background necessary for portions of the main body of this dissertation, and to present an exposition of some modern results concerning necessary and sufficient statistics. The treatment of measure and probability theory is necessarily terse and incomplete. The aim is mostly to establish a notation; the scheme follows three primary references: Chapter 1 of Wong [ 62], Articles 2.1 to 2.4 of Lehmann [35], and the author's notes from a course taught by Professor W. L. Root in 1972. The treatment of necessary and sufficient statistics follows an elegant formulation of that subject in terms of sub-a-algebras as first presented by Bahadur [ 3]; basic to that work is a classic paper due to Halmos and Savage [ 24]. An attempt is made throughout to relate the results to the more traditional concepts of statistics and conditional probability. 1. Basic Measure Theory. Consider an abstract space (Y whose points are denoted y; often f( will represent the totality of all outcomes y of a random experiment and will be called the sample space. Aggregates of points (outcomes) are subsets of ~( and 215

216 denoted, e.g., A c A. If JI is a class of such sets which is closed under complementation and countable union, it is called a cr-algebra. A is said to be an JI-measurable set if Ac eI. An arbitrary class of sets' is ed-measurable if each Cec is; given such a class, there exists a a-algebra generated by', ed(@), which is the minimal a-algebra with respect to which W is measurable. A r-algebra 0 is a sub-algebra of,., I C 0, if AecIo==* Ae c. The couple (, c4) is called a measurable space or a preprobability space. Let p be a nonnegative set function on ad which is a a-additive: If {A }e e are pairwise disjoint, then OC O1 ( i n = (An) (B.1) n=1 n n=1 j. is called a measure on X. If there exists a sequence of sets satisfying UA = and if each 1L(A ) < c, then p. is a a-finite n n n measure. If tL(W) < oo, then p. is a finite measure. If g(Y) = 1, then p, is a probability measure and is denoted. The triple ( yc, pL) is called a measure space. Sets of.-measure zero and their subsets are called null sets, and (M4,,4, p.) is said to be complete if all null sets are I-measurable. Every measure space can be uniquely completed. Any relation which holds for all ye IY except on a null set is said to hold almost surely

217 (a. s[ ]).( ) If ~4 is Rn, then the smallest a-algebra which contains all rectangles, i.e., all n-fold products of half-open intervals B ={xER: ai< x. b, i =...n} (B.2) is called the Borel a-algebra En, and its members are the Borel sets. Measures on (Rn, jn) are called Borel measures. The unique Borel measure which assigns to each rectangle the product of the lengths of the intervals which comprise it is called the Lebesgue measure on (Rn "n) and is denoted 2!. The completion of (R, R n, ) defines a a-algebra which contains the Lebesgue-measurable sets, but which will not be distinguished notationally. Let T: ~ be a mapping, and,2 a a-algebra of sets in d. T is a measurable mapping (meas.[3]) if T-1(B)E for all Bec. Writing T: (~,,j)- (,J,) will imply T meas.[c]. If a a-algebra B is not determined by other considerations, then T can be considered to generate on its range the a-algebra' consisting of all sets B such that T (B) = {ye y: TyeB} e,. (B.3) 1Throughout, square brackets should be read as "modulo" or "with respect to" when they occur in the text.

218 T Given a measure t on ( Y, i), T induces a measure JLT on ( J, ) which satisfies T 1 T(B) = [T- (B)] VB e (B.4) T -1 and often one writes p. = LT 2. Basic Probability Theory. A measure space (,, 9, ) is called a probability space and often a sample space; a probability measure satisfies all the basic axioms normally attributed to the concept of probability. In applications, probabilities over the sample space ( /, _ ) refer to random experiments whose outcomes are the points ye Y. Denote these observations as Y, and let the probability that Y falls in A c y be fY(A) = g9Y {y: yeAe }. Considered as a variable whose value is determined by the observation, Y is called a random variable( over Y and the measure Y is called the probability distribution of Y. If T ( S, ) - ( T ), then PY induces a probability measure AT on the sets of JO as in (B,4). The values taken on by T(Y) can be considered outcomes of a related random experiment, and so This differs from the definition usually given in treatments on probability theory, but does not conflict with them. The approach here follows Lehmann [ 35], and is consistent with the classical concept of a random variable. Mathematically, a random variable is nothing more than the carrier of its distribution.

219 T = T(Y) is a random variable with probability distribution PT A measurable mapping T defined on the sample space is called a statistic; if go is a Euclidean space, then 21 will always be considered the Borel sets. It is clear that any statistic is also a random variable. Let (R, i" nW ) be a Borel probability space; then the point function P R Rn R P(a..a) = {A:xeA -oc < x. < a. i =1...n} n n1' (B. 5) is called the cumulative distribution function (c.d.f.) of,1 and has all the usual properties of such a function;( conversely, every c.d.f. uniquely determines a Borel probability measure. 3. Integration and Expectation. Let f:( Y, -) (R, 1); f is the indicator function of a set A, f(y) = A(y), if f(y) = 1 for yeA and zero elsewhere. Clearly, Ae4, * IA (y) meas. [i ]. f is a simple function if, for some sets A.etI and some constants a < oo, N f - a IA (B.6) i=l i (i\wong [62], pp. 6-7.

220 Let g be a measure on (, e ); if f is simple, define d N f f(y) di(y) = i ai (Ai) (B.7) y i=l Any measurable f can be approximated by a sequence of simple functions and the definition of (B. 7) can be unambiguously extended. A function f is said to be integrable with respect to, (f is integ.[ e, g]), if f is Rd-measurable and fy fld,< <. If ( Y, A, P ) is a sample space and T a real-valued statistic, the expectation of T is defined as E(T) = f T(y) d (y) -= t d AT(t) (B.8) where,T =,T- 1. The equivalence of these expressions will be established by Lemma B-2 below. Let J. and v be two measures on ( ~, ). v is said to be absolutely continuous with respect to p. ( v <~< ) if.(A) = 0 m v(A) = 0. p and v are equivalent (v- p ) if they are mutually absolutely continuous. v and p. are (mutually) singular,' I-, if there exists a set A such that p (A) = 0 and v(A ) = 0, (1) Ibid. pp. 15-17.

221 where "c" denotes complement. A basic theorem is: THEOREM B.1 (Radon-Nikodym)() Let n,v be a-finite measures on (,.d). Then v<< AL iff there exists f: ( Y, ) - (R1, 1), f > 0 such that v(A) = J f d VAE.JX (B.9) A f is unique (a.s.[p.J). / f is called the Radon-Nikodym derivative of v with respect d' M, f = d, and often one writes dv = f dp.. If v is a probability measure, f is also called the probability density function of v with respect to g(p.d.f.[ ]). If (,,) = (R, n, ), then f is just called the probability density function (p.d.f.) of v. 4. Families of Measures. Given (, ), let t-= {pi: OeC } be a family of measures on 4. A set is L/A-null if it is a null set [ L] vO. A function is eA -integrable, integ.[ cd,eft], if it is integ.[ li, ] Ve. W is said to be a dominated family of measures ( e<< X) if there exists a a-finite measure X on ( ~Y, c), not necessarily a member of AI, such that ju << I ()See, e.g., Royden [ 50], p. 238.

222 for all 8 e O. Domination by a a-finite measure is equivalent to domination by a finite or a probability measure.(1)'A is an equivalent family if the p. are pairwise equivalent. 5. Sub-Algebras, Statistics, and Conditional Expectation. Let Cd C si be a sub-algebra. A measure j/ on / is also a measure on d0. Let p,v be measures on d;then v <<,J on =V << ~i on c, and thus,j - v on a =: — v on 0 Let ( Y, m, p )be the sample space and T: (, -~) - ( T, 9 ) a statistic. T induces on y a sub-algebra 0 = T 1() = {A0e:Ao T-T1(B), Be } (B. 10) If the events {Ai } are all in the same set A0e 0, they cannot be distinguished by observation of T(Y). The correspondence of sets (B. 10) establishes a similar correspondence for measurable functions: Lemma B.1 Let T be as above. Then f:( Y, d ) (R1, e ) is (-, i 1)Halmos and Savage [24], pp. 232; see also Theorem B. 2 below. Ihidh, p. 223,

223 meas.l d 0 iff there exists g (, J )- ( R1, such that f(y)= g[T(y)] Vyey l (B.11) Integrals of these functions can also be related: Lemma B.2(1) T Let ~i be a a-finite measure on ( /, A ), T be the measure on ( Y,, ) induced by T, and the functions f, g be as above. Then for any Be 9 S g[T(y)] d(y) = f g(t) dp (t) (B.12) T-(B) B Suppose J0 is an arbitrary sub-algebra of c, and f: ( Y, V) - (R1, 1) is nonnegative and integ.[ed, ]. The conditional expectation (c. exp.) of f given a, E f is the unique(2) nonnegative, real-valued function which satisfies:'1' Ibid., pp. 228-229. (2) (2 Existence and uniqueness follow from the Radon-Nikodym Theorem B.1.

224 a. E f is meas[.^0] b. f [E f](y) d (y) = j f(y) d (y) Ao Ao for all AOeC t0 (B.13) E f yields the same average as f on the "coarser" sets of 0, but is meas.[ 04] which f is not. Some properties of c.exp., most of which follow directly from (B.13) are:(1) Lemma B.3 Suppose JI C. 4a; f, i =1, 2... are integ.[I, 0 1? ]; and h is integ.[ i0,? ]. Then, a.s. [KR, ^], a. S E f(x) d (x) = f(x) d (x) = E f b. fl(x) <_ f2(x) E f < E f2 and the same holds for equalities. Jo0 +f) CE 0 f0 c. E (fl + f2) = aE fl+ E 2 ~.W ong [62], pp. 29-32.

225 d. E [ h(x) f(x)] - h(x) E f(x) provided the terms make sense. In particular, E h(x) h(x). e. If c c -0d is a sub-algebra, then E E f - E {E f} To relate this to the more familiar notion of c.exp. "given a value of a statistic, " suppose that C0V was generated by I,> T: (Y, )- (, ). From LemmaBB., [E f](y) depends on y only through T(y), since there must exist a function g ( j,, )- (R1,1) such that [E f] (y) =g[ T(y)] (B.14) Traditionally, this is written as E[f(Y)IT =t] = g(t) (B.15) It appears that the formulation using sub-algebras is more basic and elegant. Suppose that S ( Y,JV) (,r,' ) also generates;0;by Lemma B.1 one can write the c.exp. [E f] (y) = g[ T(y) = h[S(y)] In fact, any statistic which generates S0^ gives a similar deter0 mination of E f which depends only on the statistic. Many

226 notions traditionally expressed in terms of specific statistics are much simpler and more elegantly expressed in terms of subalgebras; this is true in particular of the notion of "necessary and sufficient statistics": One speaks of necessary and sufficient subalgebras, and then any statistics generating these sub-algebras are necessary and sufficient statistics. () The conditional probability of a set is the conditional expectation of its indicator function P{AI 09} =E IA, Ae, (B.16) or, in terms of a specific statistic, P{Alt } = E[II(Y)lt] (B.17) These relations are defined and must be considered as functions of t for fixed A. Under suitable restrictions however (namely that ~J be a Euclidean space), there exist determinations of P{Alt } which make it a probability measure on ( MY, a ) for each fixed te.(2) (2) (Lehmann[35 ], Art. 2.5.

227 6. Necessary and Sufficient Statistics. Let L= {,, eO } be a family of probability measures on ( d, ), and T: ( Y, ~ ) - ( d, 9 ). Let the induced measures on ( 9,, ) be denoted = { T-, e} If Y is a finite-dimensional Euclidean space, then the conditional probabilities discussed in the preceding section constitute a well-defined conditional probability measure b (A t), on (?Y, L1). Traditionally, T is said to be a sufficient statistic for A/( (i.e., for 0 ) if this conditional distribution does not depend on 9, 9'(Alt) = d(AIt) V e O (B.18) Such a definition is heuristically justified in Section 2.1.1. Early works attempted to extend this concept to abstract probability spaces; for example, Halmos and Savage [24 ] defined: T is a sufficient statistic for e4i if, for every Ae e, there exists p = p(Alt) which is real-valued and meas.[ ] on V such that P[Alt] = E[IAIt] = p(At) a.s.[. T-1] (B.19) Lehmann [ 35 ] (pp. 47-48) uses the same definition; in an early section of his work, Bahadur [ 3 ] (p. 429) rephrases the concept:

228 T is a sufficient statistic for ^4( if, for each Ac d, there exists a [ B, / 3] -integrable function A (t) such that f d 9 Q =- f ~A(t) d T(t) (B.20) -1 B An T (B) for all Be,s 0eO. These approaches are successful but, due to the explicit use of statistics, are unnecessarily cumbersome. As previously discussed, a more elegant formulation uses the concept of subalgebras. (The development here will follow that of Bahadur [3].) A rigorous justification for eliminating T from explicit consideration and working only with J0 lies in the fact that the measure spaces (, 0', ) and (Y,, T-1) are isomorphic and the isomorphism is independent of,..(1) Thus, explicit consideration of ( d, 8 ), of the values t of T, and of the distributions e/ = { 0 TT } of T is not essential to a study of T. One can equivalently study the distributions 4I = { o0 } in the reduced sample space (, e0);for example, Je is dominated on B iff j/f is dominated on 0O,) etc.(2)'Halmos[23], p. 167. (2)Bahadur [3 ], p. 430.

229 Accordingly, let e0 be an arbitrary sub-algebra of 4 in (Y, ), 8, Ce b. Generalizing (B.19), one ~obt~ ains0 obtains DEFINITION B.1 ed0 is said to be a sufficient sub-algebra for tJ if, for each AeJ/, there exists A which is meas.[ 0] and does not depend on 8, such that A(y) = E [IA() a.s.[,t, ef] (B.21) This is readily seen to be equivalent to (B.20) also. In accordance with the heuristic concept of a necessary statistic (Section 2.1.1), a necessary and sufficient sub-algebra should be the "coarsest" of all sufficient sub-algebras; accordingly one makes DEFINITION B.2 A sub-algebra JA c,i is necessary for ( (on JA) if J1, c X a.s.[J, K,] for everysub-algebra J0 which is sufficient for,/A. (1)Bahadur [ 3 ], p. 430.

230 The following facts are clear intuitively; all are proven in B3ahadur [ 3 ]: Lemma B.4. If T is a sufficient statistic for J/A, then there exists O0 c t such that T0 = T1( 0) is a necessary and sufficient sub-algebra for ^.. Lemma B. 5. (i) If J CJ'C,JS and iJ is sufficient for 4, then HJ' is sufficient for /. Suppose C* c J is necessary and sufficient for. Then: (ii),J' is necessary fora/f iff JA' c aj* (iii),' is sufficient for /f iff,J* c A'. (iv),X' is necessary and sufficient for 4( iff (v) The elementary sufficient sub-algebra is JA. (vi) The elementary necessary sub-algebra is { ~, c } where Z is the empty set. Assume from now on that A/( is dominated by a a-finite measure X. Then there exists a countable subset /0 = 1'd... } of A suchthat 2 - for

231 i=l, 2,... and all Oe.(1) Choose asequence {ci., ci =1, and put () C (A), AcE (B.22) 0 i i (2) THEOREM B. 2 As defined, X - /. The sub-algebra J0 is sufficient 0 0 for et on V if and only if, for each, there exists a nonnegative meas.[ ca ] function go such that d = g0 (y)dX0 a.s.[J] / (B.23) In other words, aJ is a sufficient sub-algebra iff each 0 d0 Radon-Nikodym derivative is meas. As an easy corollary, one obtains a generalization of the factorization theorem. This will be stated in terms of a statistic: See Halmos and Savage [24 ] p. 232, or Lehmann [35] p. 354. (2)This theorem was first proven by Halmos and Savage [24 ], p. 233, and was restated in the present form by Bahadur [3], p. 437.

232 (1) C or ollary If /Y<~<, X is a-finite on A.s, Then T (,J) - ( -, ~ ) is a sufficient statistic for eAY iff there exists a nonnegative function h: ( Y,,JI) - (R1, 1 ) and a family of nonnegative functions {g; e O}, g: ( d, )- (R1, 91), such that d'0 = h(y) go[T(y)]dX a.s.[JV] (B.24) Note that here it is the p.d.f. [x ] of ~ which is factored. Recall from Lemma B.1 that the composition g[T(y)] is meas.[ J0] Now recall the measure X9 of the theorem. Since =- A4(, there corresponds to each 0 e a nonnegative, meas.[,JQ] function f (y) such that d~ = f (y) dX (B.25) 0 0 0 Let A (r) c'? be the set )Strictly speaking, this corollary only follows from Theorem B.2 if X0 is used in place of X;the fact that it is true as stated is easily proven (e.g., Lehmann [ 35 ], p. 49). The generalization makes the result much more useful; in applications, one almost always has the case where X is Lebesgue measure or where X = X0 e /.

233 A(r) {y: f(y) < r }, 0<r< o (B.26) This is the inverse image of a Borel set and hence is meas.[4]. Let JV* c J be the a-algebra generated by the {A6 (r); 0O, re(0, x) } THEOREM B.3(1),* is a necessary and sufficient sub-algebra for 4(..Bahadur [3 ], p. 439.

APPENDIX C QUASI-BAYESIAN: THE USE OF UTILITY MEASURES In typical statistical treatments of Bayesian estimation and detection (see Section 1.3), the a priori p.d.f. on the parameters is interpreted strictly as a "relative frequency of occurrence" measure; under loose assumptions, its operational function is to enhance estimator performance in those regions of the parameter space where the "true" parameter, 0*, is a priori considered most likely to lie. This is clear if one notes, from (1.27) for example, that the a posteriori p.d.f. is just the classical likelihood function multiplied by the a priori p.d.f. and normalized; this makes the Bayes' estimate closely related to the maximum likelihood estimate, but biases it toward those regions of O where the a priori p.d.f. places most of its mass. Suppose, for example, that the conditions of the problem admit the a posteriori mean as an optimum estimate. () Since the likelihood function is objective, it is clear that the a priori p.d.f. serves to bias the estimate toward those values where 0* is subjectively presumed most likely..See Van Trees [ 61], pp. 59-62. 234

235 More generally, note that (1.33) gives c= S JJ[0 (y), 0*] f0(6*)f(y16*)d6*dy (C.1) which is to be minimized by choice of 6 (y). The discussion above follows upon observing that the product of the first two terms in the integrand may be viewed as a new cost functional, say J[ 0 (y), *] In engineering practice it is common to encounter system specifications which do not reflect a cost on the relationship between the estimate 6 and the true value *, but which specify only that the equipment (estimator) is to perform better in some regions of 0 than in others. For instance, it might be desired to build a receiver which performs most efficiently for small signals even though large signals are known to be prevalent. From above, it is clear that such a specification can just as well be incorporated into the a priori p.d.f. (which must then be interpreted differently) as into the cost functional. In fact, such a specification can be translated into a nonnegative normalized weighting function, say v( ), and used in place of an a priori p.d.f. even when no prior distribution is known and one is not willing to consider 6 as a random variable. This can make Bayesian techniques acceptable when the philosophy is not. 1. Quasi-Bayesian Estimation. Assume (or construct) the cost

236 functional J[ & (y), *] to have no factors which depend only on * (usually, it will be J[ - 0 *] ); incorporate all specifications of the type discussed above into a nonnegative, bounded, integrable weighting function v(0) on. Construct a utility density function w(0) as follows: f (6) if no v( ) exists fo( )v(o)b w(.) ~~ if both exists (C2) fo(')v(')ds v(8) if no f(8 ) exists v(o )d8 The first case is the classical Bayesian one, and w( ) represents a (possibly subjective) relative frequency of occurrence. In the last case, w( ) represents a pure utility and 0 need not even be considered random; the integral over ~ in (C.1) can be viewed as an expectation, and that over 0 as an ordinary integral. The second case is, of course, a combination of the other two. Once w( ) is defined one proceeds with all the Bayesian techniques, using w(0) in place of f0(0). Clearly, much advantage is gained if w(0 ) can be chosen in the class of natural conjugate

237 densities defined in Section 3.1.1 (provided that sufficient statistic exist) or can be represented as in (3.19) using an r(O) which makes the integral in (3.23) tractable. 2. Quasi- Bayesian Detection. Identical comments can be made for the compound hypotheses detection problem; the optimal decision statistic becomes the "marginal" likelihood ratio averaged over the utility densities. This is best illustrated with an example. Example C. Suppose Hi occurs with probability P(H1) = p, and that f(yl, H1) and an apriori p.d.f. f0( ) are known. Similarly, P(H0) = -p and f(yll, H0) and f0(7) are known. A "reward structure" is established as follows: Hypothesis H1 Ho 1 0 v11 (0) V1(), vll(0) > vol(0) Decision D0 v01() v00(r), 00() > Vl10 ( Such a structure lends itself readily to emphasizing performance in certain regions of 0 and eA. The objective is to maximize the expectation of the reward,

238 V = P(H1) f f [v01)(D) + 1(P(Dy) + v P(D y) f(y,H H 1)f() dy dO +P(H0) f [v00o7)P(DOy) + v10)P(D 1y) ] f(y177, H0) fo0() dy d77 (C.3) Denote a randomized decision function as g(y) = P(D1Iy) = 1 - P(D0Iy) (C.4) Use this in (C.3); after considerably rearranging, it is seen that maximizing V is equivalent to maximizing S f(yle',H1) [v') - v01(')] f(') de' V = g(y) I ^!S ff(yrl',H0) [v 00(n') - 10(')] fO(0') dr' _- (!pP) |J f(yl,H0o)rf0(7) d7 dy (C. 5)

239 Clearly, V is maximized by putting g(y) = 1 when the term in braces is positive, g(y) = 0 otherwise. In the notation developed prior to this example, the result can be summarized: Define the utility densities [V11 (0)-Vol (0 )] f(O ) w () (c. 6) J (Numerator) dO 0 w (r) =..o.... (C.7) J0(Numerator) d7 eA Then all Bayesian results apply with these utility densities used in place of the a priori densities; the threshold for the "utilityaveraged" likelihood ratio is (1-p0 fo()[Vo0([ ) v o(10 )] d a= ~-.. (C.8) P J fo(0) [v11()-v01(O)] dO 0 All the results discussed here may be applied to sequential and continuous observations as well. For the sake of brevity, the body of this dissertation proceeds on the assumption that a priori p.d.f.'s are probabilities in the strict, classical sense.

APPENDIX D PROOFS AND DERIVATIONS. 1 Proof of Theorem 2.5 As discussed following the statement of the theorem, this proof closely parallels the proof of Dynkin's theorem and thus will only be outlined here.(1) It follows from three lemmas: Lemma 1. For any ()e VL, the statistic of a sample of size k given by k t(Yk) =Z ~(Y (D. 1) j=l is a necessary statistic for 0. This is proved by noting that by the definition of V L yo can be written r S(y) c L( q;y) + c0 (D.2) q=l q Use this in (D. 1) to see that t(Yk) is dependent on Lk(e;Yk) which is a necessary statistic. (1) See Dynkin [34], pp. 24-25. 240

241 Lemma 2. If {1, S(y),..~, s(y)}, s> r, span VL then the s-vector of statistics k(Yk), where k (D. 3) ki(Yk) = i(); i = 1, 2,. s (D. 3) j=1 is sufficient for 0. To prove Lemma 2, note that by assumption s L(0;y) = a0() + 3 aj(8)oj(y) VOeO (D.4) j=l J Use (D. 4) to rewrite (2.30) and then use (D. 3) to replace the resulting oj(y) terms by tkj(Yk). This shows Lk(o;Yk) which is a sufficient statistic for 0, to be dependent on {tki(Yk), i= 1 s} Lemma 3. If {1, 01(y),..., s(y)}, s < r, are linearly independent in V then the s-vector of statistics defined in L' (D. 3) is functionally independent. This is proven by induction using a rather lengthy classi.cal calculus argument (see [13] p. 26); it is true in essence because the L(;Yk) are piecewise smooth. Part b. of the theorem follows directly from these lemmas;

242 part a. requires a minor amount of further work using Lemma 3. I). 2 Proof of Theorem 3. 1 The proof will be given in sequential notation, so that "a priori" refers to the state prior to observing Yk+ and hence is posterior to Yk; "a posteriori" refers to Yk+l. Clearly, if p(&;y) reproduces under any Yk, then by induction it reproduces. For an arbitrary a priori p. d. f. f(6 IYk), the a posteriori density is given by (1.32) f(yk+l Yk ) f( l Y-k) fe ly') = f^y~~1IY~~f~e ^ ~ ^ (D. 5) f(A IYk+l) (Do 5) J (Numerator) d6 0 This will be used shortly; for now, consider the identity f(Yk+ 10) = f(k+llYy k)f(Y-kl) (D. 6) Since a sufficient statistic of fixed dimension exists by assumption, both sides may be factored as in (2.10). Replace tk(Yk) by Yk and normalize the g[y;8 ] functions on both sides. Using (3.8 ) and rearranging,

243 G(Yk) f g[Yk^,']dO' p(0;,k+1) G(Yk+l) S g[yk+1,"] de" f(Yk+l yk ) p(e; k) (D.7) The left side was constructed to be a probability density on 0; hence the term in braces on the right is a normalizing constant, and f(Yk+ l -k 0 ) p(0;yk) p(; k+) = k (D.8) S (Numerator) dO Compare with (D. 5), which holds for arbitrary a priori p. d. f.'s; clearly, p(; y) reproduces. D. 3 Proof of Theorem 3. 2 Using fn (), n = 1, 2, as an a priori p. d. f. yields fn(e)f(fYk e) fn(e lyk) = fn (Yk) (D. 9) Note that f(Ykl ) does not depend on n (i.e., on the prior). Put n =2 and use (3.20),

244 r (8 )fl(8 )f(Yk 1) ^ ^ ^ ~^)" (D.10) f2( IYk) f(Yk) (D. 10) Now put n = 1 in (D. 9), and use the result to rewrite the last two terms of the numerator of (D. 10). Rearranging, f2( Yk) f2(Yk) = fl(Yk)r()f( k) (D. 11) Integrate on 0 to verify (3.22). Then divide (D. 11) by (3.22) to verify (3.21). D. 4 The R-N Derivative and Natural Conjugate Density for Continuous M-SAG Noise Consider the R-N derivative of (5.27) where the coefficients Djk are defined by (5.2 6) and the Ak by M-i M-i ()M-i k(D.12) =96O. (-1) Akzk i,j=0 k:0 i =0 1-3 k=o k (D. 12) All terms for which i+j is odd cancel on the left, so the expression makes sense. To obtain an explicit expression for Ak fix k; this requires that 2M- (i+j) = 2k (D.13) and relates the indices i and j, collapsing the double sum. Use (1). 13) to eliminate the sum over j; recall that 0 < j < M,

245 which restricts the limits on i min[ M, 2(M-k)] M-i k C i ^2(M-k)- i (D.14) i=max(0, M-2k) This verifies (5.29). The Natural Conjugate density is found by re-writing (5.33) so that it has the form f(ytle,y0) = Kexp ot0ijij4 (D.15) Consider the last two terms in the exponent of (5.33), which have the form iM-l M- M-I E(-1)k\AJ(k)(y) + ZD E (jk) k=O -k=O j=O jk j+k even (D. 6) where..T (k) 2 J(k) (y) d= ytk)] dt (D. 17) 0 E(,k) (y) d (j)y (k) ()y(k) (I).18) YT T 0 0 and Djk is given by (5.26). It is necessary to re-arrange the resulting double and triple sums to obtain explicitly the coefficients

246 of 0.. as in (D.15). This is best done by inspection; consider the first term in (D. 16), which may be re-written using (D. 12) and (D. 13): M-1 )k (k) (-1) AkJ (y) k=0 = t Z 0.0.)M-i )k (k) a.~ (-M1Y (-14 j (y) k=0 i=O j=0' 2M- (i+j)=2k (D. 19) Only terms for which (i+j) is even appear; each of these has a coefficient 2M-i- i+j (M- 1+J t. = (-1) J (y) (D.20) where k has been eliminated through use of (D.13); k < M+1 means that i+j must be strictly positive. The exponent on (-1) can be simplified without changing its effect; one finally obtains M-1 M 3i +j (M- +J) (_l)AkJ(k)(y) = 0. (-) 2 (y) k=0 i,j 1 J i+j even (D. 21) Now consider the second term of (D.16). By inspection of (5.26) one concludes that, as j, k run over their allowable range, one obtains all products 6 0 such that p > q and p+q is odd. P q

247 Further, Djk = Dkj and both occur. Now use (5.26) in (D.16): jk kj M-1 M-I k=O j=O k (). 22) O - 0 Cp082M-p-j-k-1 () EP-M(j k)(y) k=O j=o p where the substitution p = M-i has been made, the sum on p runs over all values for which the subscripts on 0 make sense. To find the coefficient of an arbitrary 0 0 satisfying the above p q restrictions, fix p and call the second subscript q; use that relation to eliminate j, changing that sum to one over q. Fix q t = (l)M-q-k- l ( 2M-p- q-k-l k)() (D.23) pq k and the sum runs over all values 0 < k < M-1 for which it makes sense, i.e., for which 0 < 2M-p-q-k-1 < M-1. Thus, M-1IM-I k=O j=0 j+k even M-1 4-I t' (D. 24) p=0 q=p+l Pq P q p+q odd

248 Combine all this to write (D.16) as follows: M U (D.16) = -Ai t.'. 8 (D.25) i=0 j-=O J i where 3 i+j (M (D.26) 2 ( (-1) -J (y); i+j > 0, even t.'. = ~~ -1)M-k-l-min(i, j)E(2M-i-j-k-l, k)y; dd'; i+j odd k and k runs over max(O, M-i-j) < k < min(M-l, 2M-i-j-l). Note that tij = t'., and t0 = 0 IJ31U 00 Since 0 is not estimable, it is desirable to separate it from the other parameters; since the sum is symmetric, (D.25) can be written M MM (D.16) = 20 O t6j1O +i ~ t..o (D.27) o=1 j=lj ij where t.. = t.'. as in (D.26) and where (.1)J/2 J(M-j/2)(y); j even t - (D. 28) ~ I(- 1 )M-kl E(2M-j-k-,k)(y ) j odd k

249 Define the parameter vector 0 (O 1' 0 OM) (D.29) Putting all these results together, the "density" of (5.33) can be wr itte n (D. 29) M f(Ytl 0 y0 ) = (2 r) exp {2 [*T(yt)o + t*(yt)0 where the elements of the sufficient statistic vector and matrix are as given in (5.38) and (5.39). D. 5 Lemma( If f(t), 0 < t < T, is a real-valued continuous function with nonzero quadratic variation, then f(t) is of unbounded variation on [0, T] Proof: Let Pk be a partition of [ 0, T] with modulus IPI. Then the quadratic variation of f is k.1.T.is lemma was taken from a set of lecture notes by Prof. W. L. Root.

250 Qf[0, T] dlim r f(ti) f(ti.l) k-oc iLk - IPk 1-0 < lim max lf(ti-f(t l l f(t)- f(ti) -i k i J k The first term — 0 since f is continuous; the limit of the second is, by definition, the total variation Vf[ 0, T] So assume f(t) is continuous with nonzero Qf[ 0, T], and Vf[0, T] < oo. Then Qf[O, T] = 0, a contradiction.

APPENDIX E R-N DERIVATIVES FOR THE 1-SAG PROCESS E. 1 The R-N Derivative The 1-SAG process with parameter q has spectral density function S (f ) = (E. 1) Y ~ (2nf) +ql and autocorrelation R (T) = 2 exp(-q1T), q>0 (E.2) y 2q! Let {yt; te[0, T]} be a finite segment of the process; let be the measure on the space of sample functions (,,') which is induced by the process, and let q* be the measure induced by a 1-SAG process with parameter q* > 0. From Section 5. 2, q Suppose yt is sampled as usual on [0, T], see Section 1. 1. The autocorrelation matrix of any two adjacent samples is, since the process is stationary, r0 rl R = (E. 3) 25 r1 251

252 where, from (E. 2), _a 0 = 2ql a -ql6 r1 2q1 e (E. 4) and 6 = T/k is the sampling interval. From (4. 46), it is clear that the samples may be considered as having been generated by a discrete autoregression with parameters -q! = -e a - 21-e ) (E. 5) and the joint density of the k samples, conditioned on yo, is given by (4.48). If the parameter of the process was q*, then of course all the above holds with q1 replaced by q*; it is desired to find f(Yk 1YoqI) X (y Iy;q1) = lim (E. 6) (1oq)=k-oc f(Yk lyO,q*) Before passing to the limit, it is desirable to rearrange the exponent of (4. 48) to obtain sums of the observation which will converge. These sums turn out to be:

253 k T z yi 6 yt dt (E.7) i=l 0 k (i - Yi- 1)- a' T (E. 8) i=1 (Y0 Yk) = (Y' YT) (E.9) where 6 = =T/k. Equation (E. 7) is true because yt is a. s. sample-function continuous and has finite average power; (E. 8) follows from Baxter's theorem, see (5.15); (E. 9) is obvious. After some manipulation, and using (E. 5), one can rewrite (4. 48) in terms of these sums: k ri i ^ f(k1 0' q) =[ (1..) q l+j 1 k-i1 k exp 1 (Y1-3Yi) a2 [is _- 1-1 ( i-" (E. 10) where 1 = - e. Again, the denominator of (E. 6) is similar.

254 The likelihood ratio function has the form +ek(g1) L; (Yi Yi) + fkQ (q1)YkY k 1 Yk. Y(E. 11)'he limits of the various coefficient functions will be found separately: Limit for ck(q1): k r0 2 91 l-e^^ 1Ck(q) -[ ( 12) Ik-oc [ (- e k-oc a [k () 2()] = exn 2k + = ex(m k

255 since in(1 - )- - e as E- 0+ Using this in (E. 12), ck(ql exp [. 2(q,) T] as k-oc (E. 13) Limit for dk(q 1) d 1l- exp(q) q*6)(E 4) d 1) = [c- 1 _exp(_-q* (1 4 kq 6 1+exp(-q16)+exp(-q*6 - 2(q -q* ) as 6-0 (E. 15) as can be verified using a first-order Taylor approximation. Limit for ek(q1): q exp(-q 1b ) ek(ql) = 1exp- ) g*exp(-q*6) (E. 16) ek(q) 1-exp(-2q16) 1- exp(- 2q*6) = [q16 csch(qb )-q*6 csch(q*6)] -0 as 6-0 (E. 17) (1) as can be verified using a series approximation for csch( ) See [1], p. 85, #4. 5. 65. (1See [ 1], p. 85, #4.5.65.

256 Limit for fk(ql) ql 2' k 1 l+exp(-qb ) l+exp(q* ) (E. 18) - (9 " q*) as 6 - 0 (E. 19) Put these four results into (E. 11); take the limit, using (E. 7) - (E. 9): X(yt Y0;q1) = exp - 1_ (q -,q*) t dt,^o^ " _ -~ ^ ) a y't ^ (q*) [aa T+ ya] (E. 20) The "nuisance parameter" q* will cancel if this is used in Bayes' rule, leaving (5. 56); if the remaining terms are multiplied by f(yo0lq) as given in (5.55), one obtains Hajek's result (5.57). E. 2 The Detection Statistic Suppose one wishes to solve the detection problem of (5. 1 ) using (3. 37). To account for the fact that (E. 20) is conditioned on YO' one chooses the a priori p. d. f. from the class of (5. 68); upon observing yo, the a posteriori p. d. f. is natural conjugate as in (5. 57), with parameter

257 () = + [ O (E.21) -2y After observing Yt, 0 < t < T, the parameter is further updated using (5.58) and becomes T s Yt, dt p(T) = + aT - (y2 + yT (E. 22) The results are not conditioned on y; analogous results hold for H1 with I and (yfs(t)) replacing / and Yt. Equation (3. 37) can be written PO(ql;-) (ql;(T) Q(y p(q 1;= p (T)) (y iq 1) (E. 23) Call the ratio of densities above L(Yt, q1). By direct calculation using (E. 22), (5.68), and (5. 57), it is found that K0(4-) K((T)) L(Yt, q1) =KO(-) K(yT) exp 1 s (t) dt - 21 s(t)Ytdt - + 2TST (T)]] f- 2q s(t)^ydt - q1 [2yos(O) - s" (0) + 2y(s(T) - (T)24 (E. 24)

258 The term C(ytlql) has already been found using the Metzger Model, and is given in (1. 25) and (1. 26); recall that p1 = q. Comparing those equations with (E. 23) shows that all terms which involve ql do indeed cancel, and one is left with K0(-) K(Q(T)) -2[' ( (Yt) 0 (T) exp a s'(0) y + s'(T) y (yt) K0(k) K( (T)) T T T s"(t) t y[ dt [s(t)] (E. 25) 0 0 For simplicity, assume that the a priori densities on q1 were identical under H0 and H1 (i. e., k = ) so the first ratio above is 1. The second ratio can be found using (E. 22) and (5. 61), and then dt 2 f1 t -'/2 +a2 T-(y ) +yT) ((Yt) = ~T' ~J T ~ - [/ 1+ Y[ t-s(t) ] dt 2a/1 + dt 4- 1 /2+aa T-[ y-S(0) ]a [YT -(T) ] 2a 1+ { Yt-s(t) }' dt 0 - 2 YT exp a2 [s(0)y0 + s'(T) yT - s(t) ytdt 0 2f {s'(t)}dt] (E. 26)

259 It does not appear that any further significant simplifications are possible.

REFERENCES 1. M. Abramowitz and I. Stegun, ed., Handbook of Ma.hematical Functions with Formulas, Graphs, and Tables. National Bureau of Standards, U. S. Government Printing Office, Washington, 1970. 2. K. J. Astrom, Introduction to Stochastic Control Theory, Academic Press, New York, 1970. 3. R. R. Bahadur, "Sufficiency and Statistical Decision Functions,' Annals of Mathematical Statistics, 25, 1954, pp. 423-462. 4. G. Baxter, "A Strong Limit Theorem for Gaussian Processes,? Proc. of the American Mathematical Society, 7, 1956, pp. 522-528. 5. T. G. Birdsall, "The Theory of Signal Detectability: ROC Curves and Their Character, " Ph. D. Dissertation, The University of Michigan, Ann Arbor, August, 1966. 6. T. G. Birdsall, Adaptive Detection Receivers and Reproducing Densities, Cooley Electronics Laboratory Technical Report No. TR-194, The University of Michigan, Ann Arbor, July 1968. 7. T. Birdsall and J. Gobien, "Sufficient Statistics and Reproducing Densities in Simultaneous Sequential Detection and Estimation, " to be published in the IEEE Trans. on Information Theory. 8. H. Cramer, Mathematical Methods of Statistics, Princeton University Press, Princeton, New Jersey, 1946. 9. W. B. Davenport and W. L. Root, An Introduction to the Theory of Random Signals and Noise, McGraw-Hill Book Company, Inc., New York, 1958. 10. M. H. DeGroot, Optimal Statistical Decisions, McGrawHill Book Company, New York, 1970. 11. J. L. Doob, "The Elementary Gaussian Processes, " Annals of Mathematical Statistics, 15, 1944, pp. 229-282. 260

261 REFERENCES (Cont.) 12. J. L. Doob, Stochastic Processes, John Wiley and Sons Inc., New York, 1953. 13. E. B. Dynkin, "Necessary and Sufficient Statistics for a Family of Probability Distributions, " Selected Translations in Mathematical Statistics and Probability, 1, 1961, pp. 17-40. 14. V. N. Faddeeva, Computational Methods of Linear Algebra, Dover Publications Inc., New York, 1959. 15. J. Feldman, "Equivalence and Perpendicularity of Gaussian Processes," Pacific Journal of Mathematics, 8, No. 4, 1958, pp. 699-708. 16. T. S. Ferguson, Mathematical Statistics, A Decision Theoretic Approach, Academic Press, New York, 1967. 17. A. Fredriksen, D. Middleton, D. Vandelinde, "Simultaneous Signal Detection and Estimation Under Multiple Hypotheses," IEEE Trans. On Information Theory, IT-18, No. 5, 1972, pp. 607-614. 18. U. Grenander, "Stochastic Processes and Statistical Inference, " Arkiv for Matematik 1, 1950, pp. 195-277. 19. U. Grenander and G. Szego, Toeplitz Forms and Their Application, University of California Press, Berkeley, 1958. 20. T. L. Grettenberg, A New Class of Structurally Invariant Learning Machines, Communication Theory Laboratory Technical Report No. T5-685/3111, California Inst. of Technology, Pasadena, Calif., July 1965. 21. J. Hajek, "On A Property of Normal Distribution of Any Stochastic Process, " Selected Translations in Mathematical Statistics and Probability, 1, 1958, pp. 245-253. 22. J. Hajek, "On Linear Statistical Problems in Stochastic Processes" Czech Mathematics Journal, 12, 1962, pp. 404- 443.

262 REFERENCES (Cont.) 23. P. R. Halmos, Measure Theory, D. Van Nostrand Company Inc., New York, 1950. 24. P. Halmos and L. Savage, "Application of the RadonNikodym Theorem to the Theory of Sufficient Statistics, " Annals of Mathematical Statistics, 20, 1949, pp. 225-241. 25. C. W. Helstrom, "Solution of the Detection Integral Equation for Stationary Filtered White Noise, " IEEE Trans. on Information Theory, IT-11, 1965, pp. 335-339. 26. C. W. Helstrom, Statistical Theory of Signal Detection, Pergamon Press, Oxford, 1968. 27. R. A. Howard, "Decision Analysis: Perspectives on Inference, Decision, and Experimentation, " Proceedings of the IEEE, 58, No. 5, 1972, pp. 632-643. 28. R. Hogg and A. Craig, Introduction to Mathematical Statistics, The Macmillan Company, New York, 1965. 29. A. Jaffer and S. Gupta, "Recursive Bayesian Estimation with Uncertain Observation, " IEEE Trans. on Information Theory, IT-17, No. 5, 1971, pp. 614-616. 30. A. Jaffer and S. Gupta, "Coupled Detection-Estimation of Gaussian Processes in Gaussian Noise, " IEEE Trans. on Information Theory, IT-18, No. 1, 1972, pp. 106-110. 31. R. L. Kashyap, "Prior Probability and Uncertainty," IEEE Trans. on Information Theory, IT-17, No. 6, 1971, pp. 641-650. 32. E. J. Kelly, I. S. Reed, and W. L. Root, "The Detection of Radar Echoes in Noise Parts I and II, " Journal of the Society on Industrial and Applied Mathematics, 8, 1960, pp. 309-341 and 481-505. 33. B. O. Koopman, "On Distributions Admitting a Sufficient Statistic, " Transactions of the American Mathematical Society, 39, 1936, pp. 399-409.

263 REFERENCES (Cont.) 34. L. LeCam, "On Some Asymptotic Properties of the Maximum Likelihood Estimate and the Related Bayes Estimate," University of California Publication in Statistics, 1, 1953, pp. 277-330. 35. E. L. Lehmann, Testing Statistical Hypotheses, John Wiley and Sons, Inc., New York, 1959. 36. B. Levin and Y. Shinakov, "Asymptotic Properties of Bayes Estimates of Parameters of a Signal Masked by Interference, " IEEE Trans. on Information Theory, IT-18, No. 1, 1972, pp. 102-106. 37. L. A. Liporace, "Variance of Bayes Estimates," IEEE Trans. on Information Theory, IT-17, No. 6, 1971, pp. 665-669. 3b. K. Metzger, Personal correspondence consisting of EE b34 notes presented at The University of Michigan on February 13, 1968. 39. D. Middleton and D. Van Meter, "Modern Statistical Approaches to Reception in Communication Theory," IRE Trans. on Information Theory, IT-4, 1954. 40. D. Middleton and R. Esposito, "Simultaneous Optimum Detection and Estimation of Signals in Noise," IEEE Trans. on Information Theory, IT-14, No. 3, 1968, pp. 434-444. 41. N. E. Nahi, "Optimal Recursive Estimation with Uncertain Observation, " IEEE Trans. on Information Theory, IT-15, No. 4, 1969, pp. 457-462. 42. L. W. Nolte, Adaptive Realizations of Optimum Detectors for Synchronous and Sporadic Recurrent Signals in Noise, Cooley Electronics Laboratory Technical Report No. TR163, The University of Michigan, Ann Arbor, March 1965. 43. W. W. Peterson, T. G. Birdsall, and W. C. Fox, "The Theory of Signal Detectability, " IRE Trans. on Information Theory, IT-4, 1954.

264 REFERENCES (Cont. ) 44. V. F. Pisarenko, "The Detection of a Random Signal on a Background of Noise, " Radiotekhnika i El:ktronika 6, No. 4, 1961, pp. 514-528. 45. R. Price, "Optimuin Detection of Random Signals in Noise, with Applications to Scatter-Multipath Communication, Part I, ~ IRE Trans. on Information Theory, IT-2, 1956, pp. 125-135. 46. H. Raiffa and R. Schlaiffer, Applied Statistical Decision Theory, The M. I.T. Press, Cambridge, Massachusetts, 1961. 47. R. A. Roberts, Theory of Signal Detectability: Composite Deferred Decision Theory, Ph.D.- Dissertation, The University of -Michigan, Ann Arbor- 1965. 48. R. A. Roberts, "On the Detection of a Signal Known Except for Phase," IEEE Trans. on Information Theory, IT-11, 1965, pp. 76-82. 49. W. L. Root, "Singular Gaussian Measures in Detection Theory, " Proceedings of the Symposium on Time Series Analysis (Brown University, 1962), M. Rosenblatt ed., J. Wiley and Sons, New York, 1963, pp. 292-315. 50. H. L. Royden, Real Analysis, The Macmillan Company, New York, 1968. 51. A. P. Sage and J. R. Melsa, Estimation Theory with Applications to Communications and Control, McGrawHill Book Company, New York, 1971. 52. L. J. Savage, The Foundations of Statistics, John Wiley and Sons, 1954. 53. L. J. Savage, Joint Statistics Seminar, The University of London, 1959, John Wiley and Sons, New York, 1962. 54. L. Scharf and D. Lytle, "Signal Detection in Gaussian Noise of Unknown Level: An Invariance Application, " IEEE Trans. on Information Theory, IT- 17, No. 4, 1971, pp. 409-411.

265 REFERENCES (Cont. ) 55. S. M. Selby, ed., CRC Standard Mathematical Tables, The Chemical Rubber Company, Cleveland, 1970. 56. D. Slepian, "Some Comments on the Detection of Gaussian Signals in Gaussian Noise, " IRE Trans. on Information Theory, 4, No. 2, 1958, pp. 65-68. 57. R. L. Spooner, "On the Detection of a Known Signal in a Non-Gaussian Noise Process, " Journal of the Acoustical Society of America, 44, 1968, pp. 141-147. 58. R. L. Spooner, The Theory of Signal Detectability: Extension to the Double-Composite Hypothesis Situation, Cooley Electronics-Laboratory Technical Report No. TR- 192, The University of Michigan, Ann Arbor, April 1968. 59. J. Spragins, "A Note on the Iterative Application of Bayes' Rule, " IEEE Trans. on Information Theory, IT-11, No. 4, 1965, pp. 544-549. 60. C. T. Striebel,'Densities for Stochastic Processes," Annals of Mathematical Statistics, 30, 1959, pp. 559-567. 61. H. L. Van Trees, Detection, Estimation, and Modulation Theory, Part I, John Wiley and Sons, Inc., New York, 1968. 62. E. Wong, Stochastic Processes in Information and Dynamical Systems, McGraw-Hill Book Company, New York, 1971. 63. A. M. Yaglom, "On the Equivalence and Perpendicularity of Two Gaussian Measures in Function Space, " Proceedings of the Symposium on Time Series Analysis (Brown University, 1962), M. Rosenblatt ed., J. Wiley and Sons, New York, 1963, pp. 327-346. 64. L. Zadeh and J. Ragazzini, "Optimum Filters for the Detection of Signals in Noise, " Proceedings of the IRE, 40, 1952, p. 1223.

DISTRIBUTION LIST No. of Copies Office of Naval Research (Code 468) 1 (Code 102-OS) (Code 480) 1 Navy Department Washington, D. C. 20360 Director, Naval Research Laboratory 6 Technical Information Division Washington, D.C. 20390 Director 1 Office of Naval Research Branch Office 1030 East Green Street Pasadena, California 91101 Dr. Christopher V. Kimball 1 Special Studies Group IAR/ PGI Suite 4 9719 South Dixie Highway Miami, Florida 33156 Director Office of Naval Research Branch Office 495 Summer Street Boston, Massachusetts 02210 Office of Naval Research New York Area Office 207 West 24th Street New York, New York 10011 Director Office of Naval Research Branch Office 536 S. Clark Street Chicago, Illinois 60605 Director 8 Naval Research Laboratory Attn: Library, Code 2029 (ONRL) Washington, D.C. 20390 267

268 DISTRIBUTION LIST (Cont.) No. of Copies Commander Naval Ordnance Laboratory 1 Acoustics Division White Oak, Silver Spring, Maryland 20907 Attn: Dr. Zaka Slawsky Commanding Officer 1 Naval Ship Research & Development Center Annapolis, Maryland 21401 Commander 2 Naval Undersea Research & Development Center San Diego, California 92132 Attn: Dr. Dan Andrews Mr. Henry Aurand Chief Scientist Navy Underwater Sound Reference Division P. O. Box 8337 Orlando, Florida 32800 Commanding Officer and Director Navy Underwater Systems Center Fort Trumbull New London, Connecticut 06321 Commander Naval Air Development Center Johnsville, Warminster, Pennsylvania 18974 Commanding Officer and Director Naval Ship Research and Development Center Washington, D. C. 20007 Superintendent Naval Postgraduate School Monterey, California 93940 Commanding Officer & Director I Naval Ship Research & Development Center* Panama City, Florida 32402 *. Formerly Mine Defense Lab.

269 DISTRIBUTION LIST (Cont.) No. of Copies Naval Underwater Weapons Research & Engineering Station Newport, Rhode Island 02840 Superintendent Naval Academy Annapolis, Maryland 21401 Scientific and Technical Information Center 2 4301 Suitland Road Washington, D. C. 20390 Attn: Dr. T. Williams Mr. E. Bissett Commander Naval Ordnance Systems Command Code ORD- 03C Navy Department Washington, D. C. 20360 Commander Naval Ship Systems Command Code SHIPS 037 Navy Department Washington, D. C. 20360 Commander 2 Naval Ship Systems Command Code SHIPS OOVI Washington, D. C. 20360 Attn: CDR Bruce Gilchrist Mr. Carey D. Smith Commander Naval Undersea Research & Development Center 3202 E. Foothill Boulevard Pasadena, California 91107 Commanding Officer 1 Fleet Numerical Weather Facility Monterey, California 93940

270 DISTRIBUTION LIST (Cont.) No. of Copies Defense Documentation Center 5 Cameron Station Alexandria, Virginia 22314 Dr. James Probus Office of the Assistant Secretary of the Navy (R&D) Room 4E741, The Pentagon Washington, D. C. 20350 Mr. Allan D. Simon Office of the Secretary of Defense DDR&E Room 3E1040, The Pentagon Washington, D. C. 20301 Capt. J. Kelly Naval Electronics Systems Command Code EPO-3 Washington, D. C. 20360 Chief of Naval Operations Room 5B718, The Pentagon Washington, D. C. 20350 Attn: Mr. Benjamin Rosenberg Chief of Naval Operations Rm 4C559, The Pentagon Washington, D. C. 20350 Attn: CDR J. M. Van Metre Chief of Naval Operations 801 No. Randolph St. Arlington, Virginia 22203 Dr. Melvin J. Jacobson Rensselaer Polytechnic Institute Troy, New York 12181 Dr. Charles Stutt General Electric Co. P. O. Box 1088 Schenectady, New York 12301

271 DISTRIBUTION LIST (Cont. ) No. of Copies Dr. Alan Winder EDO Corporation College Point, New York 11356 Dr. T. G. Birdsall Cooley Electronics Laboratory The University of Michigan Ann Arbor, Michigan 48105 Mr. Morton Kronengold Director, Institute for Acoustical Research 615 S.W. 2nd Avenue Miami, Florida 33130 Mr. Robert Cunningham Bendix Corporation 11600 Sherman Way North Hollywood, California 91606 Dr. H. S. Hayre University of Houston Cullen Boulevard Houston, Texas 77004 Mr. Ray Veenkant Texas Instruments, Inc. North Central Expressway Dalla, Texas 75222 Mail Station 208 Dr. Stephen Wolff John Hopkins University Baltimore, Maryland 21218 Dr. Bruce P. Bogert Bell Telephone Laboratories Whippany Road Whippany, New Jersey 07981 Dr. Albert Nuttall Navy Underwater Systems Center Fort Trumbull New London, Connecticut 06320

272 DISTRIBUTION LIST (Cont.) No. of Copies Dr. Philip Stocklin Raytheon Company P. 0. Box 360 Newport, Rhode Island 02841 Dr. H. W. Marsh Navy Underwater Systems Center Fort Trumbull New London, Connecticut 06320 Dr. David Middleton 35 Concord Ave., Apt. #1 Cambridge, Massachusetts 02138 Mr. Richard Vesper Perkin- Elmer Corporation Electro-Optical Division Norwalk, Connecticut 06852 Dr. Donald W. Tufts University of Rhode Island Kingston, Rhode Island 02881 Dr. Loren W. Nolte Dept. of Electrical Engineering Duke University Durham, North Carolina 27706 Dr. Thomas W. Ellis Texas Instruments, Inc. 13500 North Central Expressway Dallas, Texas 75231 Mr. Robert Swarts Honeywell, Inc. Marine Systems Center 5303 Shilshole Ave., N.W. Seattle, Washington, 98107 Mr. Charles Loda 1 Institute for Defense Analyses 400 Army-Navy Drive Arlington, Virginia 22202

273 DISTRIBUTION LIST (Cont.) No. of Copies Mr. Beaumont Buck General Motors Corporation Defense Research Division 6767 Holister Ave. Goleta, California 93017 Dr. M. Weinstein Underwater Systems, Inc. 8121 Georgia Avenue Silver Spring, Maryland 20910 Dr. Harold Saxton 1601 Research Blvd. TRACOR, Inc. Rockville, Maryland 20850 Dr. Thomas G. Kincaid General Electric Company P.O. Box 1088 Schenectady, New York 12305 Applied Research Laboratories 3 The University of Texas at Austin Austin, Texas 78712 Attn: Dr. Loyd Hampton Dr. Charles Wood Dr. Paul: McElroy Woods Hole Oceanographic Institution Woods Hole, Massachusetts 02543 Dr. John Bouyoucos Hydroacoustics, Inc. P. O. Box 3818 Rochester, New York 14610 Dr. Joseph Lapointe Systems Control, Inc. 260 Sheridan Avenue Palo Alto, Calif. 94306 Cooley Electronics Laboratory 25 University of Michigan Ann Arbor, Michigan 48105

274 DISTRIBUTION LIST (Cont.) No. of Copies Professor Richard A. Roberts Dept. of Electrical Engineering University of Colorado Boulder, Colorado 80302 CAPT. Jurgen O. Gobien AFIT - ENE Air Force Inst. of Technology Wright-Patterson AFB, Ohio 45433

SECuRATY CLASSIFICATION OF THIS PAGE (When Do( En.terd) REPORT DOCUMENTATION PAGE READ INSTRUCTIONS REPORT DOCUMENTATION PABEFORE COMPLETING FORM TR 225 4. TtlLP (a1 SubREflR) S. TYPE OF REPORT PEIOD COVxERD SIMULTANEOUS DETECTION AND ESTIMATION Tec Technical Report THE USE OF SUFFICIENT STATISTICS AND REPRODUCING PROBABILITY DENSITIES. PCRORING oR. rEORT NUMIE 004860-3-T 7. AUTHOR(f) I. CONTRACT OR GIAT NUMBER(*) Jurgen O. Gobien N00014-67-A-0181-0035 *. PERFORMING ORGANIZATION NAME AND AODDRES 10. PROGRAM ELEMeNT. PROJCT, TAK Cooley Electronics Laboratory REA & ORKUNIT UM"S The University of Michigan Ann Arbor, Michigan 48105 I 1, CONTROLLINO OFFICE NAME AND ADDRESS 12. REPORT OATE Office of Naval Research November 1973 Department of the Navy 13. NUMBER oF PAGES Arlington, Virginia 22217 292 MONITORING AGENCY NAMIE ADDRESS(It ifferaine Io Contwlltng 0111..i) Is. SECURITY CLASS. (ol this rfpor) Unclassified S, OECLASSIFI C ATIONDOWN GRAOING SCHEDULE ~. DISTRIBUTION STATEMENT (ol thle Report) Approved for public release; distribution unlimited. I~. SUPPLEMENTARY NOTES ~ II. KEY WORDS (Continue an mrore E de lt noooor wRt nly b lee o nllu, block nb"r Detection theory and estimation Markov Bayes Natural conjugate priors Uncertain parameters Reproducing density functions Sufficient statistics Noise.autocorrelation function 20. ABSTRACt (Continue on oretw Side It n MOoTamy andf Itef: b bylock.Mf.t) The problem considered postulates two mutually exclusive and exhaustive statistical hypotheses, under each of which the probability distribution on the observation space is known except for (conditional to) a finitedimensional parameter. The parameters may or may not have common components, and are considered as random variables and assigned a priori probability density functions (p. d. f.'s). It includes the standard signalDD, 1^n, 1473 EDITION OF I NOV B IS$ OBSOLETE S/N 010 2' 014* 6601 _ SECURITY CLASSIrICATION Or THlS PAGE (Ihn Diec IUntoMI)

SlCUfAeT CLAiSSliCATOw OF ToIS PAGC (Wh. o t { Do.. eereo ) Block 20: detection problem where either the signal, or the noise, or both, may contain uncertain parameters. Estimation is defined as knowledge of the a posteriori parameter p. d. f. given by Bayes' rule. The likelihood ratio of marginal observation distributions is considered an optimal detection statistic. It is shown that this statistic can be found by using the two separate estimation results to modify a related simple-hypothesis detection statistic. Thus, estimation and detection occur simultaneously in a very natural fashion. The concept and existence of necessary and sufficient statistics are investigated. If the conditional observation distribution under either hypothesis admits a sufficient statistic of fixed dimension, then a natural conjugate family of parameter densities exists and is indexed by a "conjugate parameter" of the same dimension. Explicit relations can be found to "Update" the conjugate parameter based on the sufficient statistic; usually the procedure is recursive. Explicit use of Bayes' rule becomes unnecessary and the estimation problem is reduced to a tractable, fixed-dimensional procedure. The detection problem is similarly simplified. It is shown that any p. d. f. nonzero on the same parameter space as the natural conjugate density also reproduces. Much of the signal processing is shown to be independent of the a prior parameter densities. All results are rigorously extended to include observations which are continuous-parameter random processes. To illustrate the theory, the problem of detecting a known signal in and simultaneously estimating the parameters of Mth- order stationary auto-regressive Gaussian (Gauss-Markov) noise is addressed. Solutions are found for both the discrete (sampled) and the continuous case. The estimation solution is tractable; the detection statistic is complicated. It is written in closed form for M = 1. For arbitrary (known) values of M, it contains integrals which are quite difficult and are left unevaluated..sCUMITY CL^"Sl"ICATION Ot T.HIS PAO-E''M- Deo.,-)n'rO

UNIVERSITY OF MICHIGAN 3 9015 03127 3322