Technical Report No. 222 010586- 1-T DIGITAL COMMUNICATIONS: DETECTORS AND ESTIMATORS FOR THE TIME-VARYING CHANNEL WITH INTERSYMBOL INTERFERENCE by David G. Messerschmitt Approved by Theodore G. Birdsall COOLEY ELCTRONICS LABORATORY Department of Electrical and Computer Engineering The University of Michigan Ann Arbor, Michigan for Contract No. N66001-72-C-0073 Naval Undersea Research & Development Center Pasadena, California 91107 April 1973

ABSTRACT In this study receivers are designed for a pulse amplitude modulation (PAM) digital communication system with additive white Gaussian noise and intersymbol interference. Considered along with the transversal filter receiver, which is an established technique, are the bit detector and block detector. The bit detector minimizes the probability of error in the decision on each data digit, while the block detector minimizes the probability of error in making a joint decision on the entire sequence of data digits. Initially the three receivers are derived and compared in noniterative form on the fixed known channel. It is shown that the sampled autocorrelation function of the basic PAM pulse waveform, which is a nonnegative definite function, determines the signalspace geometry and, along with the noise spectral density, the error probability of the three receivers. The output of a bank of matched filters, one matched to each translate of the basic PAM waveform, constitutes a sufficient statistic for the realization of all three receivers. The Hilbert space projection theorem is applied to the derivation of the transversal filter, and it is shown that the transversal filter receiver probability of error approaches 0. 5 as the channel autocorrelation function approaches the boundary of its

nonnegative definite region. A lower bound on the reliability function with error-correcting coding is obtained for the channel with intersymbol interference. Next, iterative realtime and non-realtime realizations for the bit and block detectors on a bandlimited known channel, where sampling can be employed, are derived. Dynamic programming is applied to the block detector to yield an algorithm analogous to the Viterbi algorithm of convolutional decoding. The realtime and nonrealtime bit detectors are generalized to an arbitrary time-varying channel, where the block detector dynamic programming algorithm is not applicable. Channel estimation algorithms are developed using stochastic approximation techniques for use with the known channel block detector on a time-varying channel. Both supervised estimators, which must be used in a decision directed mode, and unsupervised estimators are considered. It is shown by computer simulation on the fixed random channel that the realtime block detector in conjunction with a properly chosen estimator can have an error probability not significantly higher than that of the realtime bit detector. iv

ACKNOWLEDGE ME NT The author's work was supported by Bell Telephone Laboratories through their Doctoral Support Program. The preparation and production of this report was supported by the Naval Undersea Research & Development Center, Pasadena, California.

TABLE OF CONTENTS Page ABSTRACT iii ACKNOWLEDGEMENT v LIST OF ILLUSTRATIONS x LIST OF TABLES xiii LIST OF APPENDICES xiv LIST OF SYMBOLS xv CHAPTER 1: INTRODUCTION AND BACKGROUND 1 1.1 PAM System Model 5 1.2 Review of Previous Work 8 1.3 The Minimum Probability of Error Receiver 10 1.4 Hilbert Space Notation and Matrix Spectral Norms 13 1.4.1 The Hilbert Space L2 13 1.4.2 Euclidian Space 14 1.4.3 Matrix Spectral Norm 14 1.5 Stochastic Approximation 17 1.5.1 Gradient Search Algorithm 18 1. 5.2 The Robbins- Monro Stochastic Approximation Algorithm 20 1.5.3 Fixed Step-Size Stochastic Approximation Algorithm 22 1.6 A Traditional Approach to Intersymbol Interference —The Transversal Filter and Automatic Equalization 22 1.6.1 Transversal Filter 23 1. 6. 2 Automatic Equalization 25 1.7 Plan of the Study 36 CHAPTER 2: A GEOMETRIC SIGNAL-SPACE APPROACH TO INTERSYMBOL INTER FERENC E 37 2.1 The Bit and Block Detectors 37 2.1.1 Bit Detector 39 2o!.2 Block Detector 44 vii

TABLE OF CONTENTS (Cont.) Page 2.2 Properties of the Autocorrelation Function 50 2.3 Reliability Bounds for Intersymbol Interference 63 2.4 Transversal Filter Receiver 79 2.4.1 Derivation of the Transversal Filter 80 2.4.2 Performance Evaluation 86 2.4.3 Frequency Domain Interpretation 95 2.5 A Comparison of the Bit Detector, Block Detector, and Transversal Filter 97 2.6 Sampling Relationships and Assumptions 109 2.7 C onclusions 115 CHAPTER 3: OPTIMUM RECEIVERS FOR A FIXED KNOWN CHANNEL 118 3.1 The Bit Detector and Block Detector for a Bandlimited Channel 118 3.2 Bit Detector 122 3.3 Real-Time Bit Detector 128 3.4 Block Detector 132 3.5 Real-Time Block Detector 140 3.6 The Real-Time Detectors for the Special Case D = 1, Kb= 142 3.6.1 Block Detector 142 3.6.2 Error Probability of the Block Detector 147 3.6.3 Bit Detector 158 3.7 Conclusions 170 CHAPTER 4: OPTIMUM RECEIVERS FOR A RANDOM C HANNEL 172 4.1 The Block Detector for a Random Channel 174 4.2 Bit Detector for a Random Channel 177 4.2.1 General Data Statistics 178 4.2.2 Kb-Markov Data Statistics 183 4.3 Real-Time Bit Detector for Random Channel 191 4.4 Conclusions 195 CHAPTER 5: CHANNEL ESTIMATION ALGORITHMS 196 5.1 Supervised A Posteriori Density 199 5.2 Maximum Likelihood Estimator for a Fixed Channel 213

TABLE OF CONTENTS (Cont.) Page 5.3 A Robbins-Monro Supervised Estimator 215 5.4 A Fixed Step-Size Algorithm for a TimeVarying Channel 227 5.5 An Algorithm for Unknown and/or TimeVarying Data Statistics 246 5.6 Fixed-Increment Algorithms 256 5.6.1 Fixed Observation Estimator 258 5.6.2 Sequential Estimate Algorithm 263 5.6.3 Quantized Sequential Test 272 5.7 An Unsupervised Stochastic Approximation Algorithm 278 5.8 A Numerical Example for L = 2, M = 2 285 5.9 Conclusions 304 CHAPTER 6: COMPUTER SIMULATION RESULTS 306 6.1 Results for the Fixed Known Channel 307 6.1.1 Bit Detector 307 6.1.2 Block Detector 310 6.1.3 Performance of Three Receivers for an Example with L = 2 312 6.1.4 Parameter Sensitivity 317 6.2 Fixed Random Channel 320 6.2.1 Bit Detector for a Fixed Random Channel 322 6.2.2 Stochastic Approximation Algorithm Parameters 325 6.2.3 Simulation Results 331 6.3 Conclusions 337 CHAPTER 7: CONCLUSIONS AND SUGGESTIONS FOR FURTHER STUDY 339 7.1 Conclusion - Comparison of the Techniques 339 7.1.1 Transversal Filter Receiver 341 7.1.2 Bit Detector 343 7.1.3 Block Detector 344 7.2 Suggestions for Future Study 346 RE FERENC ES 433 ix

LIST OF ILLUSTRATIONS Figure Title Page 1.1 PAM transmitter and channel 6 1.2 Transversal filter and automatic equalizer 26 2.1 L = 3; region of allowable 51 and 52 57 2.2 Graphical interpretation of reliability bound 72 2.3 Reliability upper and lower bounds 74 2.4 Reliability upper and lower bounds 75 2.5 Effective decrease in S/N ratio for reliability bound with intersymbol interference 76 2. 6 Upper and lower bounds on channel capacity 77 2. 7 Effective decrease in S/N ratio for transversal filter receiver (L = 2) 93 2.8 Effective decrease in S/N ratio for transversal filter receiver (L = 3) 94 2.9 Receiver block diagrams 98 3. 1 Real-time block detector for L = 2, D = 1 146 3. 2 Decision regions for real-time block detector (and bit detector as ck - 0) 148 3. 3 Probability of error vs. 5 1 for L = 2 157 3. 4 Optimum bit detector, L = 2, D = 1 162 3. 5 Decision regions of real-time bit detector as u2 - oc 166 3. 6 Decision regions for real-time bit detector (S/N = 10 dB) 167 x

LIST OF ILLUSTRATIONS (Cont.) Figure Title Page 3.7 Decision regions for real-time bit detector (S/N = 0 dB) 168 3.8 Decision regions for real-time bit detector (S/N = -10 dB) 169 5.1 Stochastic approximation estimator, L = 2 222 5. 2 k, 1 for XI = 1/2 226 5.3 Convergence of the fixed- step algorithm 233 5.4 Locus of error vector divided by 6 236 5.5 F(oa) for r = 1- kX1 243 5.6 Comparison of mean- square error of two algorithms 273 5.7 Parameters of C 288 5.8 Convergence parameters of the fixed stepsize algorithm for known data statistics 291 5.9 Upper bound on mean- square error for known data statistics algorithm 293 5.10 Upper bound on mean-square errorexpanded scale 294 5.11 Mean-square error and upper bound for independent data digits (P = 1/2) and L = 2 296 5.12 Parameters of the stochastic approximation algorithm for unknown data statistics 298 5.13 Mean-square error upper bound for unknown data statistics algorithm 300 5.14 Performance of the fixed increment algorithms 302 xi

LIST OF ILLUSTRATIONS (Cont.) Figure Title Page 5.15 Performance of the fixed increment algorithms 303 6.1 Error probability of three receivers 314 6.2 Block detector error probability vs. D for L = 2 315 6.3 Block detector error probability vs. D for L = 6 316 6.4 Bit detector sensitivity to a2 319 6.5 Receiver sensitivity to knowledge of H 321 6.6 Estimates of h(0) at S/N = 10 dB 332 6.7 Estimates of h(0) at S/N = 0 dB 333 6.8 Estimates of h(0) at S/N = -10 dB 334 xii

LIST OF TABLES Table Title Page 6. 1 Stochastic Approximation Algorithm Parameters 329 6. 2 Error Probabilities for the Fixed Random Channel 33 6 *. -

LIST OF APPENDICES Page APPENDIX A: OPTIMUM RECEIVER RELATIONSHIPS 348 APPENDIX B: AN ARGUMENT FOR THE CONVERGENCE OF THE BAYES A POSTERIORI MEAN 367 APPENDIX C: CONVERGENCE OF THE ROBBINS-MONRO ALGORITHM 376 APPENDIX D: MEAN-SQUARE ERROR BOUNDS FOR THE FIXED STEP-SIZE ALGORITHM 389 APPENDIX E: MEAN-SQUARE ERROR BOUNDS FOR THE FIXED STEP-SIZE ALGORITHM FOR UNKNOWN DATA STATISTICS 406 APPENDIX F: ANALYSIS OF THE FIXED-INCREMENT ALGORITHMS 410 APPENDIX G: UNIQUENESS OF REPRESENTATION AND MEAN-SQUARE ERROR OF UNSUPERVISED ALGORITHM 423

LIST OF SYMBOLS Symbol Representation Page A Set of signals 66 A Decision region 104, 144 A. Set in V 105 J a.(fk) Integration limits 150 A 0 A priori covariance matrix of H 175 A. A posteriori covariance matrix of H 206 J a1, a2 Constants in representation of H 280 b One of M PAM levels 5 m Bk Data digit 5 Bk Vector of data digits 119 C(w) Channel frequency response 7 C. Data autocorrelation function 207 J C Data covariance matrix 207 C Estimate of C 208, -k 214 C Approximation to C 228 C, Moment matrix 279 C~ Symmetric version of C 279 di Tap-gains of transversal filter 24 De Mean-square error 27 xv

LIST OF SYMBOLS (Cont.) Symbol Representation Page d Vector of tap-gains 27 d d which minimizes D 28 -o e d k Estimate of do 31 d S/N ratio 88 D. Distance between signals 67 D(z) z-transform of di 89 -iwT D(w) D(z) for z =e 95 D Parameter of real-time detectors 129 e(t) Element of W 82 E(R) Reliability function 63 e. Error sample 259 Ep (k*) Estimate of error 264 F(X) Monotone function 58 fk(B1 ~ BN-k) Minimized likelihood function 133 fk Block detector updating statistic 143 F(a) Bounding function 240 gk(t) Transversal filter weighting function 24 g(t) Element of W 81 G(o) Bounding function 283 G Covariance matrix 27 g C ross- covariance matrix 27 xvi

LIST OF SYMBOLS (Cont.) Symbol Representation Page G(k) Power spectrum 53 gk(Bk+l' Bk+J) Minimized likelihood function 137 h(t) PAM pulse waveform 7 H(w) Fourier transform of h(t) 7 H(p) Entropy function 65 h(k) Samples of h(t) 112 H Vector of samples of h(t) 119 Hk Time-varying H 172 hk(t) Time-varying h(t) 172 Estimate of Hk or H 217 H(n) Quantized values of H 322 I Number of transversal filter taps 25 J max {Kb, L-1} 124 Kb Order of Markov data statistics 122 Kh Order of Markov channel statistics 173 K0 Number of observations 258 k* Number of observations of sequential test 264 K, k Number of observations for kth estimate 264 ~L2 ~ Hilbert space of square-integrable functions 13 xvii

LIST OF SYMBOLS (Cont.) Symbol Representation Page L Number of interfering pulses 39 Lk Decision axis 87, 160 L1 Linear manifold 102 L Positive or negative side of linear manifold 102 Lk(B1 ~ Bk) Likelihood function 132 Lo Greatest integer contained in 2 244 M Number of PAM levels 5 mk(d) Regression function 17 Mk(Bk.' Bk+J) Likelihood function 137 MB (t) Moment generating function 266 r0 A priori mean of H 175 m. A posteriori mean of H 206 J N Number of transmitted data digits 5 nk Noise sample 112 n(t) Additive white Gaussian noise 7 N0/2 Noise double-sided spectral density 7 Pe Probability of error 11 P(') Probability density function 11 P(w) Equivalent power spectrum 60 P0 Lower bound on P(w) 84 xviii

LIST OF SYMBOLS (Cont.) Symbol Representation Page Pi Probability of incorrect increment 275 PM A posteriori probability 307 Pm, n A posteriori probability 322 q~ Related to p~ 275 qn Block detector state before minimization 310 Q Number of quantized values of each component of H 322 R(T) Noise autocorrelation function 7 Rn n-dimensional Euclidian space 10 R Code rate 64 R(z) z-transform of p(-) 89 r(H) Ratio of probability densities 204 r Convergence parameter, known data statistics parameter 229 r Convergence parameter, known data statistics parameter 231 rin Minimum value of r 232 min R2 Convergence parameter, unknown data statistics algorithm 249 RZ Minimum value of Rz 251 mrin Rt Threshold in fixed increment algorithms 264 r Convergence parameter, fixed incremnent algorithms 270 xix

LIST OF SYMBOLS (Cont.) Symbol Representation Page s(t) Signal waveform 38 Si Signal set 40 T Spacing of PAM pulses 5 T(w) Transmitting filter frequency response 7 T(X) Linear equation 71 to 0Solution of transcendental equation 266 U Vector space 101 U Dual space of U 101 Driving vector 377 V Vector space 101 V Dual space of V 101 W Linear subspace of L2 81 W Closure of W 81 W Orthogonal subspace of W 81 -Vk. Solution matrix 378 Wk, j x(t) Received waveform 7 Xk Observation sample 112 xk Vector of samples 27 Xk (x1.. x k) 119 Xk (Xk0.( XN+L-1) 125

LIST OF SYMBOLS (Cont.) Symbol Representation Page Y(, ~) Regression function 17 z. Roots of R(z) 90 oa Fixed step-size 19 Variable step-size 20 aC Element of U 101 cmx a corresponding to r = 1 232, 250 amin o corresponding to r. 250 min min Element of V 105 Norm of Wk i 384 6(t) Dirac delta function 7 5.. Kronecker delta function 110 l, j Parameter relating to independence of data digits 240 Fixed increment 257 E Approximate error probability 69 ek Error vector 208 Transversal filter efficiency 88 Related to bandwidth of h(t) 109 A Autocorrelation matrix 47 Xi Eigenvalues of C 218 xxi

LIST OF SYMBOLS (Cont.) Symbol Representation Page ),. Eigenvalues of C 230 1 X.. Eigenvalues of C. 280 (k) -1 AX(k) Eigenvalues of Ak 372 AuQ Second moment of observation samples 279 Estimate of p. 282,k Normalized autocorrelation 39 0.(t) Orthonormal set 41 0 Empty set 11 O(x) Normal distribution function 88 p(A) Matrix spectral radius 15 p(k) Autocorrelation function 38 7e(') Linear transformation from U to V 105 a( Noise variance 112 a2 Matrix variance 249 BBT Linear functional on V 106 Ratio of largest to smallest eigenvalue of C 232 T Time constant 328 0 (t) Sampling function 110 w 1* 1' Hilbert space norm 13, 14, 15 (, ~*) Hilbert space inner product 13 xxii

CHAPTER 1 INTRODUCTION AND BACKGROUND A digital communication system is one which communicates a discrete number chosen from a finite set. The receiver in such a system must make a choice, or sequence of choices, from among a predetermined and finite number of alternatives. A digital communication system is to be contrasted with an analog communication system, which communicates a continuum of numbers. In both types of systems some form of modulation, or translation from the output of the information source into a physical waveform suitable for transmission on the particular medium used, is required. One of the more common modulation techniques for digital communications is pulse amplitude modulation (PAM), in which a single elementary waveform, called a pulse, is repeatedly transmitted at uniformly spaced times. Each of the transmitted pulses is amplitude modulated to one of a finite and fixed number of predetermined amplitudes in response to the data from the information source. Typical information sources for a digital communication system would be computer data or the output of a pulse code modulation (PCM) encoder, which is used for the conversion of a sampled analog waveform to a stream of discrete numbers. One problem which has plagued PAM systems since their inception, and the problem which is frequently the most significant

2 limiting factor in their performance, is intersymbol interference. This is a term used to describe the overlapping, and resultant interference, of the pulses which reach the receiver, and is caused by the distortion effects in the transmission medium. When the channel can be modeled as a time-invariant linear filter with a frequency response which is not flat, then the resultant intersymbol interference can be eliminated at the receiver by a properly chosen time-invariant linear filter. This filter is generally called an equalizer, because it "flattens" or "equalizes" the channel frequency response. The equalizer filter generally takes the form of a transversal filter, which simply sums the output of a tapped delay line with multiplicative tap gains. In many channels, such as the high frequency radio channel or the underwater acoustic channel, and to some extent in all physical channels, intersymbol interference is accompanied by a large additive noise, and in addition, the channel frequency response is actually varying in time. In this instance, the channel cannot be equalized in straightforward fashion because there is no way of determining the current channel frequency response precisely because of the noise, and even if there was, some method of simply and automatically adjusting the equalizer filter to track the channel would be required. The most common receiver for the channel with intersymbol interference and no time variations is the transversal filter receiver. The transversal filter receiver is essentially an extension of the

3 matched filter receiver, which is optimum for the channel without intersymbol interference, in that it consists of a linear equalizer filter followed by a threshold device to make the decision. More recently, it has been shown that some nonlinear receivers can significantly improve on the error probability of the linear transversal filter receiver. In particular, it has been demonstrated that the minimum probability of error receiver for the fixed known channel with additive white Gaussian noise is nonlinear. When the channel frequency response is varying in time, the classical approach is to provide automatic adjustment of the tap gains of the transversal filter in a transversal filter receiver. This configuration is known as an automatic, or adaptive, equalizer. The minimum probability of error receiver for a known channel in conjunction with a channel estimator has also been applied to this type of channel by using the channel estimate as if it were actually exact. The results of the present study fall into several categories: 1. Greater understanding of the fundamental nature of intersymbol interference is developed. The transversal filter is developed from a point of view different from previous authors, and this leads to an understanding of its shortcomings. 2. The previous work on the minimum probability of error receiver for a fixed known channel and independent data digits is extended to include

4 a) general statistical models of channel time variations, and b) general statistical dependencies in the data. The resulting receiver will be called the bit detector for reasons which will become clear. 3. A suboptimum receiver, which is actually a receiver optimized against a different and slightly less satisfactory criterion than probability of error, is developed. This receiver, which will be called the block detector, reduces the computation significantly over the minimum probability of error receiver, and yet has a significantly lower probability of error than the transversal filter receiver when interference is severe. 4. New channel estimators, based on stochastic approximation techniques, are developed. They are suitable for use with a timevarying channel, and can be used with either the bit detector or block detector designed for a fixed known channel by using the estimates as if they were exact. In the following section the communication system model to be used will be discussed in more detail. Then in Section 1.2 past work in this area will be outlined more thoroughly. Sections 1.3 - 1. 6 will deal with more specific background material for the sequel. In Section 1. 7 the organization of this study will be outlined.

5 1.1 PAM System Model The standard configuration of the various types of PAM systems has been adequately covered in standard references on the subject [2]. The purpose of this section is only to give a brief description of, and justification for, the model to be used. A typical configuration of a PAM transmitter and channel is shown in Fig. 1.1(a). The data input consists of N data digits, (B1.. BN), each of which can assume one of M possible real number values, {b }. In order to generate the PAM pulses, the m m=1 {Bk} modulate a sequence of N delta functions spaced at T second intervals, which are passed into a linear transmitting filter. The impulse response of the transmitting filter becomes the elementary PAM pulse. The signal at the output of the transmitting filter is known as the baseband signal, since it is generally a low-pass waveform with frequency components centered about d.c. In practice, there is generally a minimum of intersymbol interference at the output of the transmitting filter. In order to transfer the baseband signal to a frequency range suitable for a particular physical channel, the system of Fig. 1.-(a) uses double- sideband amplitude modulation (single-sideband or vestigial-sideband amplitude modulation could be used as well). The baseband signal is multiplied by cos(w0t) at the input of a b handpass channel which passes frequencies in the vicinity of O radians/second. The channel produces intersymbol interference.

I os wt cos w0t N Transmitting Bandpass Low-Pass EBk N(t- kT) x(t) kk=~kb~t~"T. Filter Ch.annel Filter) k=l Equivalent Baseband Channel (a) Transmitter Channel o r1 —-1 r — 7 Transmitting Channel Filter Filter n(t) C Bk 6 (t - kT=) 1 3 x(t).__ _ _I L.I (b) Fig. 1.1. (a) PAM transmitter and channel (b) PAM transmitter and equivalent baseband channel

7 At the output of the channel, the signal is coherently demodulated by multiplying by cos(w0t), and a low-pass filter is included to reject the double-frequency components. In the analysis of a system of the type of Fig. 1. l(a), it is sufficient to consider the simplified configuration of Fig. 1.1(b). All the elements within the dashed line of Fig. 1.1I(a) labeled "equivalent baseband channel" have been lumped together into an equivalent lowpass channel filter with frequency response C(w). In addition, an additive noise n(t) has been included at the output of the channel to model the cumulative effect of the noises which enter at various points in the system. If h(t) is defined to be the inverse Fourier transform of H(w) = T(w)C(), (1.1.1) then the received waveform is N x(t) = L Bkh(t- kT)+ n(t) (1.1.2) k=1 All that remains in the specification of the model is to state the nature of the additive noise n(t). It will be assumed to be a widesense stationary white Gaussian noise random process, which has autocorrelat ion R(7) = - 6(T) (1.1.3)

8 This noise model is rather restricted, in that it 8oes not represent a very large class of noise statistics. However,.t is a sufficiently accurate model of many noises encountered in practice, and it simplifies the analysis. Since the major emphasis here is on dealing with intersymbol interference, and not with complex noise processes, choice of this simple noise model allows maximum attention to be directed at the problem of intersymbol interference. 1.2 Review of Previous Work The old and established approach to countering intersymbol interference is the transversal filter receiver, which is explained in the excellent book of Lucky et al. [2]. Perhaps the most important result concerning the transversal filter is that it is the optimum timeinvariant linear filter for minimizing the probability of error when followed by a threshold, as discovered by Aaron and Tufts [19]. More recently, algorithms for automatically adjusting the tap gains of a transversal filter, so that it can be systematically adjusted before and during the transmission of data, were discovered by Lucky [21], [22] and Gersho [39]. The bulk of the recent work in intersymbol interference has been in the area of improving upon these algorithms. Recently there has been a shift in emphasis toward studying nonlinear receivers for countering intersymbol interference. This has been prompted by the discovery that the minimum probability of

9 error detector is nonlinear and that some very simple nonlinear receivers can outperform some very complex linear receivers. The consideration of minimum probability of error receivers was started by Chang and Hancock [6], with later revisions by Bowen [8] and Abend et al. [9]. Kimball [7] considered the simplest case of adjacent overlapping pulses in some detail, and was able to evaluate the performance of the optimum receiver for this case analytically. Abend and Fritchman [10] contributed a real-time optimum receiver and included computer simulation results. Several suboptimum nonlinear receivers were considered by Kimball [7], and several authors have studied another nonlinear receiver, which is a modification of the transversal filter to include decision feedback [16], [40], [41]. A receiver used extensively in this study, which will be called the block detector, has its origins in the Viterbi algorithm of convolutional decoding [11], [12]. The potential application of this algorithm to intersymbol interference has been mentioned by Omura [14] and Viterbi himself [13]. Very recently, a receiver similar to the block detector, but not employing dynamic programming like the Viterbi algorithm, has been studied by Costello and Patrick [42]. The effort in applying nonlinear receivers to a random fixed or random time-varying channel has been very limited and is one of the primary contributions of the present study. Monsen [16] and George et al. [41] have considered iterative algorithms for adjusting the

10 tap-gains of decision feedback equalizers, but as yet no attempt has been made to extend the optimum receiver to this case. However, several problems considered in the pattern recognition literature are closely related to this one. A general tutorial article on work in this area is that of Ho and Agrawala [23], and especially applicable is the work of Hilborn and Lainiotis [27], [28]. Finally, in developing simple channel estimators, the concept of stochastic approximation will be used. This is developed well in the tutorial chapter by Sakrison [30]. The idea of using a fixed stepsize algorithm to track a time-varying parameter, which will be exploited in later chapters, is used by Gersho [39] and Monsen [40] among others. 1. 3 The Minimum Probability of Error Receiver In this section, the receiver which minimizes the probability of error in making a decision from among a finite number of alternatives will be derived. The form of this receiver is well known [1], but it will be derived here for completeness, since it will be used often in Chapters 2, 3, and 4. n Assume that an observation z e R, which is a vector of real numbers with n components, is received, that there are k known hypotheses {H.}k about how the observation was generated, and that one and only one of them must be chosen. Also assume that probability densities of z conditional on each of the hypotheses,

P(zlH.), 1 < j < k, are known, as are the a priori probabilities P(H.), 1 < j < k, of each hypothesis. Designing a decision rule corresponds to determining k subsets {A.}k of the observation space j=I Rn such that their union is the whole space, n U A. = Rn (1.3.1) j=I and they are disjoint, A. n A. = 0, i j. (1.3.2) I J The decision rule is to choose H. if and only if z e A.. The goal is to choose the sets of {Aj} so as to minimize the probability of making an incorrect decision, which is k Pe 1 - j P(Hj) P(zlHj)dz. (1.3.3) ej=1 A. J The problem is to choose {AjI}k so that the sum on the right is j=1 maximized. Since each point z E Rn must be placed in one and only one A., the choice which minimizes (1.3. 3) is the one for which P(zlHj) P(Hj) is maximum. That is, Aj = {zIP(zlHj) P(Hj)> P(zlHi) P(Hi), 1 < i < k, ij} (1.3.4) There is an obvious ambiguity in Eq. 1.3.4 when P(z I Hi) P(Hi) =

12 P(z I Hj) P(H.) for some i / j, but it is easily resolved by arbitrarily placing z in Ai or Aj, since either choice will not affect Eq. 1.3.3. The decision rule specified by Eq. 1.3.4 is often referred to as the maximum a posteriori probability (MAP) decision rule, since it corresponds to maximizing the a posteriori probability of H. given J z. To see this, note that by Eq. 1. 3.4, a particular H. is chosen so that max P(zlH.) P(H) = max P(z,H.) (1.3.5) 1<j<k 1<j<k - is satisfied, which is equivalent to the rule max P(H I z) (1.3.6) L<j<k since Eq. 1.3. 6 is obtained from Eq. 1.3.5 by dividing by P(z), a constant. In Chapters 2, 3, and 4 the minimum probability of error receivers will be derived using the maximum a posteriori probability decision rule. The decision rule of this section and the detectors of those chapters could be simply modified to minimize more complex cost functions, but probability of error will be used exclusively because it is adequate for communications applications.

13 1.4 Hilbert Space Notation and Matrix Spectral Norms Throughout this study the concepts of norm and inner product will be used. In this section the notation of norms and inner products are summarized and some results on matrix spectral norms which will be used extensively are stated. 1. 4.1 The Hilbert Space L2. A Hilbert space which will find extensive use in the sequel is the space of square integrable functions on the real line, which is generally denoted by L2. Elements of L2 are Lebesque measurable functions x(t) which satisfy j x2(t) dt < c, (1.4.1) where the integral is interpreted in the Lebesque sense. The norm of x(t), which can be interpreted geometrically as a length, is denoted by 11x(t)ll and defined as 1lx(t)112 = fx2(t) dt. (1.4.2) The inner product between two elements of L2, x(t) and y(t), is denoted by (x(t), y(t)) and defined as (x(t), y(t)) = x(t) y(t) dt (1.4.3) Of course, the norm can be defined in terms of the inner product as

14 1. 4.2 Euclidian Space. The second Hilbert space which will be used extensively is Euclidian N-space, whose elements are Nxl matrices, x1 x = x2. (1.4.5) XN The inner product between two vectors x and y is defined by (x,y) x= y (1.4. 6) where xv is the transpose of x. The norm of an element x is defined by I1x112 = (x,x) x x. (1.4.7) 1.4. 3 Matrix Spectral Norm. An additional related concept which will be employed is that of matrix spectral norm. Since this concept is less familiar, some of the major results which will be needed will be given. A more complete treatment can be found in Varga [31] The spectral radius p(A) of an NxN square matrix A with complex eigenvalues {Xi}N is defined as

15 p(A) = max l. I (1. 4.8) 1<i<n and the spectral norm of A is defined as II lAxll ilAfl = sup (1. 4.9) x#O Ilxll (Notice that an identical notation for the norm of a matrix, vector, and element of L2 has been used. Which norm is indicated should be clear from context.) Some of the more obvious properties of the matrix spectral norm are summarized in the following theorem, which is stated without proof [31]: Theorem 1.4.1: If A and B are two NxN matrices, then 1. II AII > 0 with equality if and only if A is the null matrix, 2. 11 aA - I a 11 A 11 for an arbitrary complex number a, 3. IIA+BII < IHAIl + IIBII, 4. 11A xil < IIAII llxl and there exists an x such that equality holds, 5. 1lAII > p(A), 6. 1A 112 = p(A*A) where A* is the conjugate transpose of A. An immediate consequence of Theorem 1. 4.1 is the following important corollary i 31]:

16 Corollary 1.4.1: If A is Hermitian (A* = A) then IIAI = p(A). Another important result is the following [31]: Theorem 1.4.2: If A is an NxN matrix, 1. lim Ak = 0 if and only if lim Akll = 0 k-cX k-oc 2. lim Ak = 0 if and only if p(A) < 1. k-Oc A result which will be used frequently is the following: Theorem 1.4. 3: If A is a symmetric real matrix, then II(I- A) 11 = r (1.4.10) and lim (I- oaA) = 0 if and only if r < 1, where k-oc r = max 1- cX.I. (1.4.11) 1 <i<N Proof: Equation 1.4.10 follows immediately from Corollary 1. 4.1 and the fact that (I - acA) is Hermitian and the eigenvalues of (I - aA) are {(1 - a.)kN. Then lim (I-aA)k= O 1 i=1 k-xC k-kc if and only if lim l (I-acA)kll = 0 by Theorem 1.4.2 and k-cx hence the condition r < 1, for then and only then k k lima II(I - A)kl = lim r = 0. QED. k-cc k —oc Finally, there is the following Corollary to Theorem 1.4.1:

17 Corollary 1.4.2: If x and y are two Nxl matrices and A is a T symmetric NxN matrix then the bilinear form x A y can be bounded by x Ayl < IIAi ~ Ilxll ~ Ilyll.(1.4.12) Proof: By the Schwarz inequality T Ix Ayl < fixIl ~ A Ayll and the bound follows from Theorem 1.4.1. 1.5 Stochastic Approximation A technique which will be used in Section 1. 6 and extensively in Chapter 5 is stochastic approximation. Although this technique is adequately explained by Sakrison [30], the following discussion should serve to introduce the particular form of stochastic approximation which will be used in Section 1.6 and Chapter 5. All the stochastic approximation applications which will be encountered in the sequel will be of the following type: Let there be an arbitrary measurable function Y(d, xk) of a vector d and random vector xk. The desired goal is to determine d such that the minimum of the expected value of Y(d, xk) is achieved, min mk(d) = mk(dk) (1.5.1) d k -

18 where mkd) =E(Y(dxk)) (1.5.2) Specifically, there are three special cases which arise: 1. mk(d) = m(d) is known and independent of k, 2. Y(',.) is known and an infinite sequence. of samples of the random vector Xk, {xk} is available, but all that is known about mk(d) is that it is independent of k, and 3. The same conditions hold as in 2. except that mk(d) may be a function of k. Three types of iterative algorithms are applicable to the determination of dk in these three cases. They are, respectively, the gradient search algorithm, the Robbins-Monro stochastic approximation algorithm, and the fixed step-size stochastic approximation algorithm. The gradient search algorithm will not be used, but rather serves only as a motivation for the other two algorithms. These three algorithms are treated in the following three sections: 1.5.1 Gradient Search Algorithm. Assume that mk(d) = m(d) is a known function of the vector d. In all the situations encountered later, a necessary and sufficient condition for the minimum of Eq. 1.5.1 is that the gradient with respect to d be zero, grad m(d) = 0O (1. 5.3) d d=d

19 where the solution dk = do does not depend on k. This equation can be solved directly for do but in order to motivate the two stochastic approximation algorithms, the approach here will be to develop a successive approximation algorithm to find do. Specifically, a successive approximation algorithm generates a sequence of approximations to d0, {d j} such that -0 Jj=l lim d. = do (1.5.4) k-oc - One such algorithm is the gradient search algorithm, which works as follows: Let dk be the current approximation to do. Since the gradient of m(d) evaluated at dk is a vector in the direction of the greatest rate of increase of m(d), subtracting a constant times this gradient from _ should yield a new estimate dk+l which is closer to d. The resulting equation is -0 dk+1 =k dI l grad m(d) (1.5.5) d ) d=dk where do is arbitrary and c is a constant which must be determined from considerations of convergence. The algorithm of Eq. 1.5.5 may or may not converge (i.e., satisfy Eq. 1.5.4) depending on ac and the particular form of m(d). The general procedure will be to study the convergence of each such algorithm encountered, rather than attempting to arrive at general convergence conditions.

20 1.5.2 The Robbins-Monro Stochastic Approximation Algorithm. The second and more interesting case occurs when Y(-, -) is known but all that is known about mk(d) is that it is independent of k. As in Section 1. 5.1 it is assumed that there is a unique solution do to the minimization problem, where do satisfies Eq. 1.5.3. It is also assumed that there is available a sequence of samples {_x.} of the random variables {X.} -J j=1 - j=1 Since m(d) is not known, the minimization of Eq. 1.5.1 cannot be performed directly. The best that can be done is to develop an estimate dk of do based on {xj}k 1 and knowledge of Y(,.). A reasonable approach to determining dk is to substitute the gradient of Y(d,xk) for the gradient of its expected value in the gradient search algorithm of Eq. 1.5.5, dk+1 = dk- k grad Y(d,xk) (1.5.6) d d=d The fixed step- size ac of Eq. 1.5.5 has been replaced by a step- size ok which is a function of k. When {ak}C satisfies ~ ak 3e, (1.5.7) k= k k=l1 then Eq. (1.5.is called a Robbins-onro stochastic approximation8) then q. 15. 6 s caledaRbisMnosohsiprxmto

21 algorithm. The term "stochastic" enters because dk) is a sequence of random vectors. Because the algorithm is now stochastic, the convergence of Eq. 1. 5. 6 must be defined in some probabilistic sense. A strong form of convergence occurs when it can be asserted that Eq. 1.5.4 occurs with probability one (almost everywhere). A weaker form of convergence, which will be used exclusively here, is mean- square convergence, which occurs when lim E(dk) = do (1.5.9) k-xc and lim E lldk- d112 = 0 k- xc are satisfied. The particular choice of the {ak} of Eqs. 1.5.7 and 1. 5.8 arises in order that Eqs. 1.5.9 and 1.5.10 can occur. Condition 1.5.7 insures that dk can reach dO from any starting point d while Eq. 1. 5.8 insures that a finite noise variance is added in b-0 the infinite number of stages so that Eq. 1. 5.10 can be satisfied. If a fixed step- size were chosen, Eq. 1.5.10 could not be satisfied because the same noise variance would be added in at each stage. The convergence of each algorithm in Chapter 5 will be considered individually. Sakrison [30] has given general sufficient conditions for mean-square convergence, but they are not satisfied

22 by the algorithms to be considered in Chapter f5. 1.5.3 Fixed Step-Size Stochastic Approximation Algorithm. Now assume that all the conditions of Section 1.5. 2 are met except the one which requires that mk(d) be independent of k. There is then a different solution dk to the minimization problem of Eq. 1.5.1 for each k, and an estimation algorithm must attempt to track the time-varying vector _dk. Several authors [39], [40] have had success in tracking a slowly time-varying vector by replacing the decreasing step-size {ak} in the Robbins-Monro algorithm (Eq. 1.5. 6) by a fixed step- size ok = 0, as in the gradient search algorithm of Section 1.5.1. The algorithm cannot then converge in the mean- square sense of Eq. 1.5.10, even when dk is actually fixed, but rather will always have a non- zero asymptotic mean- square error. In Section 1. 6 and Chapter 5, the performance of the fixed step-size algorithms will be measured by fixing dk = do and calculating how fast E(dk) converges to do, and the asymptotic mean-square error between dk and do. These two performance indices give an excellent indication of how well the algorithm will track a time-varying dk. 1.6 A Traditional Approach to Intersymbol Interference —The Transversal Filter and Adaptive Equalization In this section a concrete example of the design and analysis of an adaptive equalizer using the transversal filter will be given.

23 This is the traditional and widely used approach to countering intersymbol interference, and it works well whenever intersymbol interference is not too severe (this statement will be made precise in Section 2.4). The analysis of the resulting equalizer will serve to illustrate the ideas of matrix norm explained in Section 1.4 and of the fixed step- size stochastic approximation algorithm explained in Section 1.5.3. 1.6.1 Transversal Filter. The transversal filter receiver is based on the following considerations: In the absence of intersymbol interference, the optimum receiver for detecting Bk, where x(t) = Bkh(t)+ Bk n(t), is to sample the output of a filter matched to h(t) and apply to that sample a series of M- 1 thresholds [1]. That is, the receiver calculates the statistic fk(Bk' n(t)) = (ht) x(t), (1.6.2) and chooses Bk = b if and only if aj 1 < fk < a. (1.6.3) where - c = a. < aI <. < aM = xa are a set of appropriately chosen thresholds. In the presence of

24 intersymbol interference, the simplicity of the nratched filter receiver can still be preserved and reasonable performance maintained if a gk(t) can be found such that fk*(Bk, n(t)) = (gk(t), x(t)) (1.6.4) is a function of Bk alone and not of Bj, 1 < j k <N. It will be shown in Section 2.4 that this can be accomplished by letting g(t) = di h (t- (i+k)T) (. 6.5) i=- X for appropriately chosen coefficients {di}. The filter of Eq. 1. 6.5 consists of a matched filter followed by a sampler with sample rate 1/T sec and an infinite tapped delay line with tap coefficients {di}. The ideal filter of Eq. 1. 6.5 cannot be implemented in practice because of its infinite extent. A finite version of this filter is given by the weighting function gk(t) di h(t - (i+k)T), (1.6.6) Strictly speaking, the filter of Eq. 1.6. 6 cannot eliminate interference in the sense of Eq. 1. 6.4. However, if I is chosen sufficiently large, intersymbol interference can be made negligible. Often, in practice, the transmitted waveform is known, but the nature of the channel distortion, and hence h(t), is not known. This

25 is an instance where automatic adjustment of the tap gains (adaptive equalization) should be contemplated. Nevertheless, even with adaptive equalization, the question remains as to what filter to put in place of the filter matched to h(t) which precedes the tapped delay line when h(t) is not precisely known. The usual approach is to place there a filter matched to the transmitted waveform, or perhaps just a low-pass filter (which may be the same thing). The transversal filter receiver is shown in a portion of Fig. 1. 2. The samples of the output of some linear filter, {xi}, are passed through a tapped delay-line and summation device to form the statistic =k _ dixi+k (1.6.7) i=-I It is assumed that I is large enough and the tap gains {dj} are chosen properly so that Yk is an accurate representation of Bk, and can simply be compared with M-1 thresholds to determine the decision, as shown. The portion of the receiver consisting of a tapped delay line and summation device is known as a transversal filter, and so the entire receiver is known as a transversal filter receiver. For a more detailed exposition on the transversal filter the reader is referred to the standard references on the subject [2] 1. 6.2 Automatic Equalization. The question which naturally arises in the design of the transversal filter is the choice of the

Automatic Equalizer ~~dI h4-1 ~ dg hd~~l -1 1I Ad_ Ad1 A Ad 0 Ad-I Ad 1-I Ad 0 00 @00 {x.} L__i-~- ______ - Bk Ix I ---- T T 00 I 0 - 0 i T d dI~ dI d d df +~~~~ Fig.1.. Tansersl flte an auomaic quaize T hre shold L _I I_ __ Tv~~~~~~~~~Trnsersal Filter~ Bk kk Fig I.2. Transversal filter and automatic equalizer

27 tap-gains {i}. It is necessary to determine the tap-gains according to some criterion, since no choice of tap-gains can completely eliminate intersymbol interference for finite I [21]. Lucky [21] used a criterion which he calls the peak distortion and determines a method of automatically adjusting the tap gains to initially set the equalizer and keep it set during data transmission [22]. The shortcoming of his method was that it didn't work under all conditions. Gersho used a different criterion which does not have this shortcoming [39], and which will be used here. Gersho's criterion is to minimize the mean-square error between Yk and Bk, which is defined as De E(yk- Bk)2 (1.6.8) It is assumed that a probability measure has been defined for Bk, and that all second moments exist. To proceed with the determination of {d.} to minimize Eq. 1.6.8, define (16. 9) -k= (Xk-I'.. Xk+I) (1.6.9) d =(d. d) (1.6.10) G = E(k ) (16.11) g = E(B xk) (1.6.12)

28 where stationary statistics have been assumed. Then, since Yk -dT _k d Ak (1.6.13) Eq. 1.6.8 becomes T T D = (d-do) G(d-do)- d Gd+E(Bk) (1.6.14) where Gdo g (1. 6.15) It follows from Eq. 1. 6.13 that E (yk) = dT Gd > 0 (1. 6.16) so that G is automatically non-negative definite. In order to proceed, assume that G is positive definite in addition, so that Eq. 1.6.15 can be solved for dop do = G g. (16.17) Since G is positive definite, from Eq. 1. 6.14, De > E(Bk2)- dT Gdo (1.6.18) with equality if and only if d = do so that d = d of Eq. 1.6.17 is the unique minimum of Eq. 1.6.8. The techniques of Section 1.5 can be applied to the minimization of Eq. 1.6.8. Identify the function Y(.,.) as

29 Y(d, Xk) = (d Xk- Bk)Z, (16.19) and then the minimization of Eq.. 6. 8 is equivalent to Eq. 1. 5.1. In addition, Eq. 1. 5.3 is satisfied, since d = d is the unique solution of gradD = 0 = 2 (G d - g). (1.6.20) e d The gradient search algorithm of Eq. 1. 5. 5 becomes, for the Y(',.) of Eq. 1.6.19, dk+l dk - al(G dk - g) ( 6.21) where d is an arbitrary initial estimate of d and ac must be chosen to insure the convergence of Eq. 1. 6. 21 in the sense of Eq. 1. 5. 4. Rewriting this difference equation in terms of the error between dk and d (d k1 do0) (I- aG) (dk- do)+co(g-Gd0) (1.6.22) and using Eq. 1.6.15, Eq. 1.6.22 becomes (dk+ - do)= (I- ae G) (dk do) (I- o G)k+l ( -do ). (1 6.23) By Theorem 1.4.3, Eq. 1.5.4 is satisfied,

30 lim d = d k-3cif and only if ll - yil < 1, 1 0, 1 < i <2I+1, and a, can always be chosen small enough to satisfy Eq. 1. 6. 24. Consequently, if ac is properly chosen, then the gradient search algorithm of Eq. 1. 6.21 will converge to do without the necessity of performing the matrix inversion of Eq. 1.6.17. When the channel has statistics unknown to the receiver and/ or varying in time, the fixed step- size stochastic approximation algorithm of Section 1.5.3 is applicable. The gradient of Y(, *) is, from Eq. 1.6.19, grad Y(d,xk) = 2 ekxk (1.6.25) d where ek Yk-Bk d xk- Bk (1.6.26) is the error between Yk and Bk. The algorithm corresponding to Eq. 1.5.6 is

31 dk+1 dk ak (_k dk - Bk) (1. 6. 27) Note that the algorithm of Eq. 1. 6. 27 requires knowledge of the data digits Bk in addition to the filter output samples {xi}. This is to be expected, since Bk, along with Xk, is part of the random part of Y(, ), and the fixed step- size algorithm of Eq. 1. 5. 6 requires that the samples of all the random variables be available. In order to circumvent this difficulty, assume that the S/N ratio is high enough and the tap gain vector dk is close enough to do that very few decision errors are being made by the receiver. Then use the receiver decisions as if they were the actual data digits in Eq. 1.6.27. This technique is known as decision-directed operation, and has the inherent danger that a few bad decisions can adversely affect Eq. 1. 6. 27 and cause the tap gains dk to be incremented in the wrong direction, causing more decision errors, and so forth. More will be said about decision-directed operation in Chapter 5, and it will be studied by computer simulation in Chapter 6. The decision-directed implementation of Eq. 1. 6.27 is illustrated in Fig. 1. 2, where the decision at the output of the receiver is subtracted from Yk and multiplied by each of the tap outputs to determine the increment in the corresponding tap gain. This equalizer is an extension of that discovered by Gersho [39], which was designed only for use in a training mode with isolated test pulses.

32 As explained in Section 1. 5.3, Eq. 1. 6. 27 cannot converge in mean-square because of the fixed step-size c. However, it will be shown that when it is assumed that all decisions on Bk are correct and the channel and data statistics are fixed, the algorithm will converge to a neighborhood of the optimum tap gains, and the size of that neighborhood will be measured by the asymptotic mean- square error between _k and d. It will be shown that the asymptotic mean-square error can be made arbitrarily small by choosing a small, but that the rate of convergence to that asymptotic error slows as ac approaches zero. Therefore, there is a tradeoff between two conflicting objectives: fast convergence and small asymptotic error. In case the channel and/or data statistics are actually varying, as long as the rate of variation is slow with respect to the convergence rate of the algorithm, then the algorithm can be expected to track it well. To make the analysis of Eq. 1. 6.27 possible, it is necessary to modify it somewhat. Assume that there is a A > 0 such that xk and Xk+A are independent random variables, then modify Eq. 1. 6. 27 to - ^ / T ^dk \ dk+ -k - akA+1 -Xkh+1 - BkA+1 (. 6. 28) The key simplification is then that XkA+l and dk are statistically independent random vectors, since dk is a function of (x1, xA1'..*. X(k \)A+l). Subtracting do from both sides of Eq. 1.6.28,

33 (dk+ d) Iax -T ) d (-k 1 -o c XkA+l -k+l/ (dk - o (kA+1 -kA+l do - Xk+l Bk+) (1.6.29) Then, taking expected value, E (k+ - d (I - a_ G) E(k do) - -d(G d - g) (I- G)k+l (d - d) (1.6.30) since Gdo = g by Eq. 1.6.15. If and only if Eq. 1.5.18 is satisfied will Eq. 1. 6. 28 be unbiased, lim E(dk-d) = 0 (1.6.31) k — Furthermore, the mean-square error can be bounded from,I dk - (d ) (=I Z - -XTT -k 1T ( d -k+ -- -O \ kA-+ki-kzA ~IJ/-k-0 (I ax kA Tl Xk^+l) (d k - do) -k+lX kA+l d +Bk+l1 Note first that

34 - ax=kh- )I - a G+ aG - x ))2 (I - a G)2 +2a(I-ia__G)T G-XkA+kA+xl + a X(G xkA+1 xkA+ (1.6.33) and then substituting Eq. 1.6.33 in Eq. 1.6.32, taking expected value, substituting in Eq. 1.6. 32, and using Corollary 1.4.2, EIdIK+l - dol2 < R2 Ei dk- do 2 + 2kc IIFTII ~ I- oGllk 110- d0ll + a2 f (1.6.34) where F E [(-XkA+1 xT+ do - Xka+l Bkah+l) xI T'kt1 k )] (1.6.35) f = E d[ok+1_ - Bkz~l) xkFX1 12] (1.6.36) R2 = 111 - a G 12 + c2 E lG - x + + 2 (1.6.37) Since G is assumed to be positive definite, c can always be chosen sufficiently small so that R < 1, and when that is satisfied then 11 I - vc G11 < 1, so that the second term of Eq. 1.6.34 approaches zero as k-c. Iterating Eq. 1.6.34,

CHAPTER 2 A GEOMETRIC SIGNAL-SPACE APPROACH TO INTERSYMBOL INTERFERENCE In this chapter a point of view toward intersymbol interference that differs from that of previous authors will be adopted. Previously attention has been drawn toward iterative realizations of the various receivers, since these are the form best adapted to implementation. By adopting the point of view of non-iterative realizations in this chapter, a greater understanding of intersymbol interference results. The signal-space geometric approach adopted here is influenced by the excellent book by Wozencraft and Jacobs [1], the relevant portion of which seems to be based in turn on the work of Arthurs and Dym [20]. 2.1 The Bit and Block Detectors In this section the bit detector and block detector will be derived for a channel with white Gaussian noise, known impulse response, and intersymbol interference. The treatment of this section is purposefully abstract, for this approach gives greater insight into the nature of the receiver. Chapter 3 will develop receiver realizations more suitable for implementation. The PAM system model to be employed was developed in Section 1.1. To reiterate the model briefly, the transmitted signal 37

38 consists of a sequence of N elementary waveforms {h(t- kT)}k1 each of which is modulated by one of a finite number M of arbitrary but distinct levels {b.}M. The elementary waveform h(t) is asj=1 sumed to be of finite energy, h(t) e L2. The received signal is x(t) = s(t) + n(t) (21.1) where N s(t) -= Bkh(t- kT) (2.1.2) k=1 and where n(t) is a wide- sense stationary white Gaussian noise process with autocorrelation R (7) =2 6(7) and power spectrum No Sn(,) = ( If the elementary waveforms are orthogonal, then the minimum probability of error detector consists of N matched filters, one matched to each h(t- kT), and each filter responds to one and only one waveform (due to their orthogonality). The case of intersymbol interference, which is the subject of this study, occurs when the waveforms are not orthogonal. Thus, it is natural to define the inner product of h(t) and h(t-kT) as p(k) = (h(t), h(t-kT)), k = 1, ~2, (2.1.3) It follows directly from the definition of inner product, Eq. 1. 4.3, that

39 p(k) = p(-k) (2.1.4) and (h(t- jT), h(t-kT)) p(j-k). (2.1.5) A case of particular interest occurs when p(k) is non-zero only for a small number of arguments, and this motivates the following definition: Definition 2.1.1: Intersymbol interference is L-finite if and only if sup klp(k)/O} = L-1 < x (2.1.6) Since p(k) is just the sampled version of the deterministic autocorrelation function of h(t), it will henceforth be called the autocorrelation function. Also of frequent interest is the normalized autocorrelation function, defined as = p(k)/p(0). (2.1.7) It is desired to design receiver structures for making a decision on each of the data digits Bj, 1 < j < N, based on the reception x(t). Two receivers, optimized with respect to different criteria, will now be considered: 2.1.1 Bit Detector. The first receiver, which will be called the bit detector, minimizes the probability of error in detecting each Bk, 1 <k < N. Equivalently, it minimizes the total expected

40 number of errors in the detection of the entire data sequence (B1.. BN). It is called the bit detector because it detects each data digit, or bit in the case of M = 2, individually According to the work of Section 1. 3, the bit detector processes the waveform x(t) to form the M a posteriori probabilities P (Bk=bjx(t)) 1 < j < M (2.1.8) and makes the decision Bk = bj if and only if k I PBk = bjx(t) > P Bk bix(t ), 1 < i j <M (2.O.9) Often it is more convenient to calculate this probability multiplied by P(x(t)), which is independent of Bk, so that the decision is based equivalently on P( kx(t)) P(x(t)) = P(x(t)Bk) P(Bk) (2-1.10) To explicitly calculate this probability, divide the set of MN data sequences into M sets Si, S2,..., SM, where Si = {B1 BNIBk =bi}. (2.1.11) Each S. contains MN-1 signals. Proceeding formally, ( (t)l k= bi) P(Bk =bi) = P(x(t), ST) =S P(x(t)4l) P(4) (2.1.12) I~~~~4eS

since the elements of Si are mutually exclusive and exhaustive. For each a e S. the signal is completely specified. To proceed, observe that, since the waveforms {h(t- jT)}N can be assumed to be linearly independent, all signals lie in an N dimensional subspace of L2. To obtain an orthogonal basis it is necessary to apply the Gram-Schmidt orthogonalization process to {h(t- jT)}N By this =1'' process an orthonormal basis {lj(t)}1 can be obtained, where each 0j(t) is generated iteratively by h(t- jT) (h(t- jT), 0i(t) 0i( t) 0j(t)O= j=1,.., N h(t- jT) - h(t- jT) 0i(t) Xi(t) With respect to this basis, each signal waveform s(t) corresponding to each 4 e S. can be written N s(t) Cj 0j(t) (2.1.14) j=1 where Cj = (s(t), 0 (t)) (2.115) It is also convenient to define and = (t), (t)

42 n. = (n(t), (t)) (21.17) Wozencraft and Jacobs [1] have shown that {x constitutes a sufficient statistic for the calculation of P(x(t)4)i.* Furthermore, since the n. are jointly independent Gaussian random variables with mean zero and variance N0/2, it follows from the relation x. C + n, j = 1, 2,.., N (2.1.18) that the desired probability is P(x(t)4) = (rN O) N/2 exp- N (x.- C)2 (2.1.19) By Parseval's relationship, N ~ (X- C.)2 = flx(t) - s(t) 1 2 j=1 and hence P(x(t)) = (7N0) N/2 exp x(t)- s(t)11-2. (2.1.20) 0 N -Finally, the desired probability is obtained from Eq. 2.1.12, - N/2 I x (t) s (t) II 2 P(x(t), Bk=b) = (rN0)N/2 P(4) expx(t - 1 s(t)l2 / 4 eS. (0 (2.1.21) Observe that this property depends strongly on our model, which completely specifies the noise statistics. If the noise has any unknown parameters, then this statement is not true.

43 The bit detector forms a weighted sum of terms involving the L2 distance (metric) between the received waveform and each of the possible transmitted signals for which Bk=bj. The L2 metric in the exponent of Eq. 2.1.21 can be put in more familiar form. In particular, note that tlx(t) - s (t)x )(t)fl2 + 1 s(t) 12 - 2(x(t), s(t)) (2.1.22) The first term, lx(t) 112, is a constant in the summation of Eq. 2.1.21 and can be ignored, while the second term is dependent on the summation index 4 but not on the reception x(t) and should be grouped with P(4). Thus, Eq. 2.1.21 is proportional to ( ) E | -Ils(t)112/N| 2 P x(t), Bk-bi) cP ()e P (p) e (2.1.23) The term (x(t), s(t)) in the final exponent is just the output of a filter matched to s(t). But this term can be put in more basic form by simply putting s(t) in terms of the {h(t- jT)}, N (t), S(t) =(x(t), Bj h(t- jT) j=l wher Bj(x(t), h(th- jiT)h (2.1.24) j=1 1

44 summation, and each term (x(t), h(t- jT)) is the output of a filter matched to h(t- jT). Equations 2.1.23 and 2.1.24 lead to the conclusion that the N statistics constitute a sufficienx(t), h(t- jT)detection of (B BN). The25) constitute a sufficient statistic for the detection of (B1.. BN). The matched filter outputs of Eq. 2.1.25 are the same statistics as would be required in the absence of intersymbol interference, the difference being that in the intersymbol interference case these statistics are combined in the highly nonlinear fashion of Eq. 2.1.23, whereas when there is no intersymbol interference, a simple threshold applied to each of the matched filter outputs is all that is required. 2.1.2 Block Detector. The second receiver considered, called the block detector, minimizes the probability of at least one error in the detection of the entire sequence of data digits. It makes a simultaneous decision on all the data digits (B1 BN) and minimizes the probability of error in making that decision. Since to the block detector making an error on any number of B.'s, 1 < j < N is the same as making an error on just a single B, the block detector will make more frequent errors in the sequence of N data digits than the bit detector. According to Section 1.3, the block detector chooses (B1 BN)

45 to maximize the a posteriori probability P(B1 BNIx(t)) (2.1.26) As with the bit detector, the approach is to actually calculate a constant (independent of (B1 * BN)) times this probability, P(B1 BN x(t)) P(x(t)) = P(x(t) B BN) P(B1* BN) (2.1.27) Since for each (B1 * BN) the signal is completely specified, the reM ceiver calculates for each 4 e U S. i=1 P(B1 * BNx(t) P x(t) = ( -N0) N/2 exp - N- x(t) - s(t)2 ) (0N) exp I x P(B1 BN) (2.1.28) and chooses the 4 for which the quantity on the right is maximum. Equivalently the receiver can maximize -1 ix(t) - s(t)12 +Qn P(B1.. BN) or N1 [2(x(t), s(t))- 1ls(t)ll + n P(B1'' BN) (2.1.29) since Qn(') is a nonotonically increasing function of its argument and ll x(t) 12 iS a constant.

46 The only dependence of Eq. 2.1.29 on the observation x(t) is through the inner product (x(t), s(t)). As was shown in Section 2.1.1, this inner product can be calculated for each of the M signals s(t) by combining the output of the N filters matched to h(t- jT), 1 < j < N as given by Eq. 2.1.25. Thus, as with the bit detector, the output of these N matched filters constitutes a sufficient statistic for the decision on (B1.. BN). When each of the MN signals is equally likely, the block detector has two special properties which arise because the P(BI * BN) term does not affect the maximization and can be ignored. The first property is that the receiver simply chooses the signal which is closest in L2 metric to the received waveform. The second property, and one which is very important in practice, is that the 1/NO term in Eq. 2.1.29 does not affect the maximization and can be dropped. This means that in this case the implementation of the block detector does not require knowledge of the noise variance. A word is in order regarding the calculation of the error probability of the bit detector and block detector. The bit detector and block detector both calculate the same linear functionals on the observation, in the form of L2 inner products with each of the possible signals, but the bit detector forms a nonlinear weighting of these linear functionals prior to the decision thresholds, while the block detector does not. Thus, one would expect that the calculation of

49 2. The block detector minimizes the probability of one or more errors in the entire data sequence. It is linear, much simpler to implement than the bit detector, and does not require knowledge of the noise power when all signals are equally likely. 3. The performance of either of these receivers depends only on the noise power spectral density N0/2 and the autocorrelation function p(k). 4. A set of sufficient statistics for the realization of either of these receivers is (x(t), h(t - kT)) N That is, both receivers can be implemented by combining the output of a set of filters matched to the time translates of h(t). 5. The complexity of the signal space geometry would seem to preclude the exact calculation of the error probability of either of these receivers, except in degenerate cases.* The receiver realizations discussed in this section, while interesting and instructive, are not as suitable for receiver implementation as the iterative receivers yet to be developed. In Section 2.6 a restrictive bandlimiting assumption will be made as a first step Kimball [7] nas managed to calculate the performance of the optimum bit detector with L =2 and time-limited waveforms. His method does not generalize to other cases.

50 toward practical receiver implementation. Following that, in Chapter 3, is a discussion of several practical methods of implementation, and in Chapter 6 the results of performance comparisons made by computer simulation are given. 2.2 Properties of the Autocorrelation Function In Section 2.1 the autocorrelation function p(k) was defined. In this section necessary and sufficient conditions for a given function p(k) to be an autocorrelation will be established and interpreted in the frequency domain. Several properties of the autocorrelation follow immediately from the definition of Eq. 2.1.3. First of all, it is necessary for p(O) to be positive, since it is the energy of h(t), p(O) = llh(t)112 > 0. (2.2.1) Furthermore, by the Schwarz inequality, (h(t), h(t+kT)_< lh(t)H ~ lh(t+tkT)II so that Ip(k) < p(O) (2.2.2) for all k. It is natural to ask if the conditions of Eqs. 2.2.1 and 2.2.2 are sufficient; that is, if a given function p(k) satisfies Eqs. 2.2.1 and 2.2.2 then is it necessary that there exist an h(t) e L2 which

51 has that p(k) as an autocorrelation? The answer to that question is no, because p(k) must be a non-negative definite function, and satisfy the conditions of the following definition: Definition 2.2.1: A function p(k) mapping the integers into the reals which satisfies p(-k) = p(k) for all k is non-negative definite if for every N and every non-trivial set of complex numbers {Ci i1 N N X a an* p(m- n) > 0. (2.2.3) m=1 n=1 When strict inequality holds in Eq. 2.2.3, p(k) is said to be positive definite. It will now be shown that the non-negative definite conditions of Definition 2.2.1 are stronger than Eqs. 2.2.1 and 2.2.2. That is, it will be shown that a non-negative definite function automatically satisfies Eqs. 2.2.1 and 2.2.2. Let p(k) be non-negative definite, and let N=1 and ael =1 in Eq. 2.2.3. Then, Eq. 2.2.3 becomes p(O) > 0 (2.2.4) which is the same as Condition 2.2.1. Similarly, let N = k+ t and 1, m = 1,am = +1, m = k+1 O0, otherwise

52 in Eq. 2.2.3, which then becomes 2p(k) + 2p(0) > 0 or p(k) > -p(O). (2.2.5) Finally, let N = k+1 and c1, m = 1 =k m= -1, m k+1 0, otherwise in Eq. 2.2.3, which gives p(k) < p(0). (2.2.6) Condition 2.2.2 follows immediately by combining Eqs. 2.2.5 and 2.2.6. To prove that p(k) must be a non-negative definite function, let N and {ci.} be arbitrary as in Definition 2.2.1, and observe 1 i=1 that N 2 N N I m m h(t- mT) =a cmm cxn* (h(t- mT), h(t- nT)) m-1 m=1 n=1 m n N N =! 2 ~m ~* p(m-n) > 0 ~~~~m=1 n=1 ~(2.2.7)

53 This proves the following theorem: Theorem 2 2.1: The autocorrelation p(k) of h(t) e L2 is nonnegative definite. Unfortunately, the converse of Theorem 2.2.1 is not true: there is not necessarily an h(t) e L2 which has a given non-negative definite autocorrelation. However, for L-finite intersymbol interference, stronger results can be obtained as a consequence of the definition G(X) = p(k) e i2 I < 1/2 (2.2.8) k=- (L- 1) and the following Lemma: Lemma 2.2.1: If p(k) = O, Ikl > L, and G(X) is given by Eq. 2.2.8, then p(k) is non-negative definite if and only if G(X) > 0, I A < 1/2. (2.2.9) Proof: Multiplying Eq. 2.2.8 by e- i2Tk and integrating, the inversion of Eq. 2.2.8 is p(k) = e-i G() dX (2.2.10) 2 For arbitrary {cm}N m =1

54 a | a e i2"mX G(X) dm N N 2;l 2; m * p(m-n) > 0 (2.2.11) m=1 n=1 m n from Eq. 2.2.10. Conversely, if p(k) is non-negative i2 mX definite, let am = e, from which N N i2f(m-n)X 0 < 2 2 p(m-n) e m=1 n=1 (L___I ei2wmX N (-) p(m) - Im) (2.2.12) N > L Taking the limit of Eq. 2.2.12 as N-x, G(A) > 0 (2.2.13) QED The following theorem is a consequence of Lemma 2.2.1: Theorem 2.2.2: Given a non-negative definite function p(k) such that p(k) = 0, Ikl > L, there exists an h(t) e L2 such that p(k) = (h(t), h(t-kT)) (2.2.14)

~2 0.5 / X, I I,' ] / X -1 0 -0.707 -0. \ /0.5 0.707 1.0 i, F / \V~~~~~~~~~~~~~~~~~~~ Fig. 2.1. L = 3; region of allowable (1 and 42

58 In the more general case of intersymbol interference which is not L-finite, the situation is more complicated. The complete answer is given by Doob [18] in the following theorem, which is stated without proof: Theorem 2.2.3: A real symmetric p(k) is non-negative definite if and only if there exists a monotone non-decreasing function F(X), IXI < 2, such that p(k) = f e 27'k dF(X) (2.2.21) Clearly, for a given non-negative definite p(k), if the F(h) guaranteed by Theorem 2.2.3 should happen to be absolutely continuous, then its derivative exists almost everywhere and Eq. 2.2.21 can be rewritten as p(k) = e F'(X) dX. (2.2.22) -2 If, in addition, F'(X) is square-integrable, then there is an h(t) e L2 with autocorrelation p(k), the h(t) being given by Eq. 2.2.2.15 with G(Tf) replaced by F'(Tf). However, when F(X) is not absolutely continuous, h(t) would have to be found outside L2, in the class of infinite energy and finite power waveforms. From a practical viewpoint, this mathematical difficulty should not be of concern, since all waveforms dealt with here will be in L2, and hence a p(k) with a

59 corresponding F(X) which is not absolutely continuous will not be encountered. In the case of L2 waveforms, the F'(X) of Eq. 2.2.22 has an important interpretation in terms of the power spectrum of h(t). First of all, it is well known that the autocorrelation p(k) is the inverse Fourier transform of I H(w)l, the power spectrum of h(t) This can be easily derived by applying Parseval's Theorem to the autocorrelation integral, p(k) = h(t) h(t - kT) dt 2,-0H) H (w) eH (W) ) dw -x =2- 7 IH(cw) Z ekT do (2.2.23) -00 Equation 2.2.23 can be put in the form of Eq. 2.2. 22 by dividing the w-axis into intervals of length 27r/T, pk 1 0(2m)IH(w) iwkTdw x (2m-1) p-(k) = - IH(w)+ e do 1 $-/T m=C L X/T w+m ) ewk d <- JwT icokT - ~ f P(o) e do (2.2.24) 7T' -/ T

60 where P(w) H + m( T ) (2.2.25) Reversal of integration and summation in Eq. 2.2.24 is justified by Fubini's Theorem because h(t) e L2 and thus I 0 p(0) = 21 f IH(w) 1 dw < oc. (2.2.26) - C Comparing Eq. 2.2.22 and Eq. 2.2.24, it is apparent that P(co) of Eq. 2.2.25 is the F'(X) guaranteed by Theorem 2.2.3. The condition that F(X) be monotone non-decreasing is equivalent to the trivial condition that P(o) > 0. Further interpretation of P(w) results when it is expanded in a Fourier series (which is possible since it is defined on a finite interval), c - iwkT w( P(w) =T ake, < (2.2.27) k=- x where v/T ak 1 f P(/)T iwkT (2.2.28) From Eq. 2.2.24 and Eq. 2.2.28, it follows that

61 P(w) Tp(O)+ 2T ~ p(m)cos (wmT), kvl < wi (2.2.29) m=1 and P(w) is just a Fourier series expansion with coefficients p(k). The importance of P(w) should now be apparent. There is a one-to-one relationship between P(w) and p(k) through Eq. 2.2.29, and P(w) is related to H(w) through Eq. 2.2.25. Just as the error probabilities of the bit and block detectors are determined by p(k), so too are they determined by P(w). There are many h(t)'s in L2 corresponding to a given P(w), but they all have equivalent signalspace geometries. Since P(w) is related to the power spectrum of h(t) through Eq. 2.2.25 and is equal to the power spectrum IH(co)12 when h(t) is bandlimited to ff/T radians/second, it will be called the equivalent power spectrum of h(t). For large L or infinite L, the conditions of Definition 2.2.1 are difficult to check by any method. Accordingly, the following Theorem gives a simple sufficient condition for a given p(k) to be non-negative definite which suffices for many practical purposes: Theorem 2.2.5: If p(k) satisfies iz Ip(m)l < p(O) (2.2.30) m#0 and is symmetric, then it is non-negative definite.

62 Proof: Define G(X) analogously to Eq. 2.2, 8, but without the L- finite assumption, I i27ffmX < G(X)= W p(m) ei < 2- (2.2.31) m=- 3C Then, by Theorem 2.2.3 it suffices to show that G(X) > 0. From Eq. 2.2.30, it follows that Eq. 2.2.31 is absolutely convergent, since Ip(m) ei2amX = Ip(m)l and z Ip(m) < 2p(0) < m=- cc from Eq. 2.2.30. Hence, p(m) e i2_ m < K ip(m)t < p(0) (2.2.32) and Eq. 2.2.32 implies that G(k) > 0. QED. To summarize this section, precise conditions which the autocorrelation function must satisfy have been determined. These conditions have been interpreted in the frequency domain in terms of the equivalent power spectrum of h(t). The primary applications of these results will be in Section 2.3, where the sufficient condition of Theorem 2.2.5 will be needed, and in Section 2.4, where it will be

63 shown that the transversal filter receiver error probability approaches 1/2 as p(k) approaches the boundary of its non-negative definite region. 2.3 Reliability Bounds for Intersymbol Interference In Section 2.1 problems involved in calculating the performance of the bit detector and block detector receiver were discussed. As was pointed out there, due to the complexity of the signal space geometry it is difficult to get upper bounds on the probability of error which don't exceed the actual probability of error by a substantial amount. In the next section, an upper bound will be derived by considering a linear receiver which eliminates intersymbol interference. However, as the intersymbol interference becomes very severe that bound approaches an error probability of 1/2. It is much easier to obtain lower bounds on the reliability function in the presence of coding. The reliability function is a measure of the rate at which the probability of error approaches zero when block coding is used and the block length is allowed to grow without bound. To be more specific, it will be shown that, using a block error correcting code with block length N, the probability of error is bounded by P < exp {-N E(R)} (2.3.1) where R is the code rate. The largest such E(R) is called the

64 reliability function [3]. The largest R for which the reliability function is positive is the channel capacity. Because of the asymptotic nature of Eq. 2.3.1, the derivation of lower bounds on E(R) does not require very precise upper bounds on the probability of error. A technique similar to that of Schiff and Wolf [5] will be used. The type of block codes considered will be restricted to binary group codes with M = 2 and Bk = ~1, and the error correcting properties of those codes will be bounded using the Varsharmov-Gilbert bound. Since the signal set is predetermined, the signal space geometry will be used to bound the probability of error in place of the random signal selection technique of Schiff and Wolf. To proceed with the derivation, consider the special case of M = 2, b =1, b2 = -1, and use abinary group code with block length N and k < N information bits. When the jth bit of the code isa 1 bit, let Bj =b =1, and when it is a 0 bit, let Bj = b2 = -1. According to custom [3], define the rate of the code in nats/dimension as R = k &n 2 (2.3.2) The maximum possible rate is then Qn2 = 0.69. The VarsharmovGilbert bound [17] then states that there exists a group code with minimum Hamming distance d if N-k ~n2 > H(dt (2.3.3) N n _HN-I - - f

65 where d-2 < N- < 2 and H(p) = -pnp - p) n( - p) is the entropy function. The code will then correct all patterns of t or fewer errors provided that [17] 2t + 1 < d. (2. 3.4) Using the fact that H(p) is a monotonically increasing function for p<, letting N be large, and substituting Eqs. 2.3.2 and 2.3.4 in Eq. 2.3.3, fn2 - R > H (-2) (2.3.5) The optimum receiver in the presence of block coding is the block detector of Section 2.1.2, where the signal set is restricted to code words. Instead of using the optimum receiver, the error probability can be bounded by using the block detector of Section 2.1 without restricting it to signals corresponding to code words. This "detection" portion of the receiver is then followed by a "decoder" which chooses the code word which is closest in Hamming distance to the binary sequence at the output of the detector. The Varsharmov-Gilbert bound is applicable to the performance of this suboptimum receiver.

66 Denoting by S the set of 2 possible signals, the detection part of the receiver chooses the s(t) e S to satisfy min Ix(t) - s(t)112 (2.3.6) s(t) e S where equally likely signals have been assumed. The decoding portion of the receiver then chooses the code word closest in Hamming distance to the binary sequence generated by the detector part. Assume that an arbitrary sl(t) e S is-actually sent, and denote by C. the event that sj(t) e S is the signal which satisfies Eq. 2.3.6 after the reception, given that s1(t) was sent. The probability of error in the detection portion of the receiver conditional on si (t) being sent is Pr {errorls1(t) sent} = Pr U C j=2 2N < C Pr{C.} (2.3.7) j=2 where the upper bound is a consequence of the subadditivity of the probability measure. Denote by Ap the set of signals corresponding to binary sequences at Hamming distance p from the code word corresponding to sl(t). Then

69 e p (Nt p (2.3.15) p —t+l p where E = exp - p(0) - Now, let E1 = E/(1+E) and write Eq. 2.3.15as P < (1+E)N p () EP(l- EN (2.3.16) t+1 =XN N Note that Eq. 2.3.16 is just the complement of the cumulative distribution function of a binomial distribution. Applying the Chernov bound [17] to Eq. 2.3.16, for X > e~ = e/(l+c), u = l-X, P < (1+e )Nx-XN -A N XN (1 PN x-XN -puN XN A c exp -N(- ne - H()) (2 3.17) but noting that - P(O) ( 0 I1I) the bound on reliability is

70 E (R) > P()O (1 - Z yI 1) - H(X) (2.3.18a) where, from Eqs. 2.3.17, 2.3.3, and 2.3.13, E I C+ < X < 4 (2.3.18b) in 2 - R > H(2k) (2.3.18c) 1- II > 0 (2.3.18d) Equation 2.3.18a gives a lower bound on the reliability function. An upper bound on the same reliability function is somewhat easier to obtain. Specifically, it can be taken as the exact reliability function in the absence of intersymbol interference. This reliability function is well known, since the channel is then a binary symmetric channel with error probability Pe ~( ) (2.3.19) The reliability is given by [3] E(R) = T(X) - H(X) (2.3.20) where n 2 -R = H(X)

71 and T(X) = -X An P - (X) n(- Pe) e e for P < X < fV/(ifV + _1- Ph; e - - e e e and E(R) = n 2 - 2 -n (/p + l-e) - (2.3.21) for R < in 2 - H(/(T +) e e Equations 2.3.18 and 2.3.20 are similar, in that both express E(R) in terms of the difference between a straight line and the entropy function H(X). The graphical interpretation of these two relationships is shown in Fig. 2.2. With intersymbol interference (Fig. 2.2a), X is determined by the intersection of the line kn2 - R with H(2X). Then, as long as 1+ < X, where E is given by Eq. 2.3.15, the lower bound on E(R) is found by subtracting H(X) from the straight line -Xnec. If H(X) > - Xne the bound is not useful. Without intersymbol interference (Fig. 2.2b), the binary symmetric cha.nnel crossover probability P is determined from Eq. 2.3.19. The T(X) of Eq. 2.3.20 is the equation of a straight

o o 0 0 0 0 o P) ~~ ~ C enc 3 3 C ~~~~~.(T~~~~~~ ~ ~ bO O O' O t ~~~O ~ M co FP en coo ~1 v E O~~~~ S~~~~ o'1.o~ I tL 01~ ~~d 3,.'D ~0 0 0 0~ 0 0 0 - n3~~~~~~~e I.-4. ~~ m o I CI.~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ r~~~~ )Ir~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ O~~~~~~~~~~~ 1 1I 0~~~~~~~~~~~~~~~~~ (b~~~~~~~0 ~~~~~~~~1.~~~~~~~~~~~~C TI r - - ---- - - 4~Lh~

73 line tangent to H(X) at X =. For X me/ (be + N Pe)' e e e E(R) is determined by calculating T(X)- H(X) for the X at the intersection of kn2 - R with H(X). For larger X (smaller R), E(R) is linear. In Figs. 2.3 and 2.4 the upper and lower bounds on reliability are plotted for S/N ratios of 5 and 10 dB. The S/N ratio referred to is 10 log P(O) for the upper bound (no interference) and 10 log () - 1 for the lower bound. At 10 dB S/N ratio there is reasonable agreement between the upper and lower bounds, but at 5 dB the lower bound is very weak. The effective decrease in S/N ratio due to intersymbol interference, given by -10 log 1- I y) is plotted in Fig. 2.5 for L = 3 (and letting 52 = 0, for L = 2 also). For any 1 and 2' the value given there must be subtracted from 10 log P(D) before reading the lower bound in Figs. 2.3 and 2.4. N0 Finally, in Fig. 2.6 upper and lower bounds on channel capacity have been plotted. The upper bound is just the channel

2 Effective Signal-to-Noise Ratio = 5 dB X \ Upper Bound (No Intersymbol Interference) Lower Bound (With Interference) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Rate R (nats/digit) Fig. 2.3. Reliability upper and lower bounds

5 4 Effective Signal-to-Noise Ratio = 10 dB 3 i Upper Bound (No Intersymbol Interference).1-4.94 94 2 Lower Bound N (With InterferencerN N~~~~~~~~~~~~~~~~~~~~N 1 0.N 0.2 0.3 0.4 0.5 0.6 Rate R (nats/digit) Fig. 2.4. Reliability upper-and lower bounds

7 6 5 II 4 3 o 92 - 0.5 - 0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0. Fig. 2. 5. Effective decrease in S/N ratio for reliability bound with intersymbol interference

0.8 0.6 00 Cd~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~' U / 0.4 Upper Bound I Lower Bound (No Interference 0.2 / 0 - r -20 -15 -10 -5 0 5 10 15 Effective S/N Ratio (dB) Fig. 2.6. Upper and lower bounds on channel capacity

78 capacity of the binary symmetric channel. The lower bound was obtained by finding the largest rate at which the lower bound on E(R) of Eq. 2.3.18 is positive at each effective S/N ratio. The lower bound on channel capacity is nearly the same as the upper bound for S/N ratios in excess of 10 dB, but otherwise the difference between the two is substantial. The lower bound is useless below an effective S/N ratio of 4 dB, since it is identically zero there. In conclusion, upper and lower bounds on the reliability function for channels with intersymbol interference have been derived. The lower bound is useful for effective S/N ratios in excess of about 5 dB, but becomes the trivial lower bound of zero below that point. Due to the complexity of the signal space geometry, the determination of a non-trivial bound for the low effective S/N ratios would be difficult. In the next section an upper bound on the probability of error will be determined in another way. A linear receiver will be derived which completely eliminates intersymbol interference in a special sense. The probability of error of this receiver can be calculated, and for small and moderate interference it is a reasonably accurate upper bound on the probability of error of the optimum receiver. Unfortunately, since the additive noise samples at the output of this receiver are correlated the analysis does not lend itself to obtaining more precise lower bounds on the reliability function than those obtained in this section.

79 2.4 Transversal Filter Receiver The transversal filter receiver was introduced in Section I. 6.1. In this section it will be proven that under certain conditions an infinite transversal filter can eliminate intersymbol interference in the special sense that the linear functional (x(t), gk(t)) on the observation x(t) is independent of Bj, 1 < j j k < N, where gk(t) = di ht - (i+k)T) (2.4.1) i=-oc if the di are properly chosen. The probability of error of the resulting receiver will be evaluated for L-finite intersymbol interference and small L, and the result will be interpreted in the frequency domain. Results similar to these have been obtained by Lucky [21] However, the results obtained here differ from those of Lucky in several respects. First, Lucky considered the problem entirely in the frequency domain, whereas the transversal filter falls out more naturally in the time domain approach used here. Secondly, Lucky considered only the case of h(t) bandlimited to 2ir/T radians/second, while the more general case of h(t) e L2 is considered here. Finally, Lucky did not determine the error probability of the transversal filter receiver. The determination of error probability for the transversal filter receiver as particularly important, since it represents an

80 upper bound on the error probability of the bit detector. Also, it points out dramatically the major shortcoming of the transversal filter receiver, which is that its error probability deteriorates as p(k) approaches the boundary of its non-negative definite region. 2.4.1 Derivation of the Transversal Filter. In this section it will be shown that a finite transversal filter of the form of Eq. 1. 6.6 cannot eliminate intersymbol interference, but that an infinite transversal filter of the form of Eq. 1. 6. 5 can, as long as the equivalent power spectrum is bounded away from zero. First, to show that a finite transversal filter cannot eliminate intersymbol interference in the special sense of Eq. 1. 6.4 for N > 2(I+L) - 1 and L-finite intersymbol interference, let gk(t) be as in Eq. 1.6.6 and consider gk(t), h(t- (k+ I+ L - 1)T)) i=-I d h t - (i +k) T h (t - (k+ I+ L - 1)T)) - E dip(I+L-l-i) = dIp(L-1), (2.4.2) Equation 2.4.2 establishes that (gk(t) x(t)) will always be a function of B(k+i+L 1), and intersymbol interference cannot be eliminated in the sense of Eq. 1.6.4. However, it will now be shown that, as a consequence of the

Hilbert space projection theorem, when P(w) is bounded away from zero, intersymbol interference can be eliminated when I = xo. Let h(t) E L2 and denote by W the subspace of L2 whose elements are finite linear combinations of elements of the set {h(t- kT)}k/0. Let W denote the closure (with respect to the topology induced by the L2 metric) of W. Any f(t) e W can be written in the form f(t) = L dih(t- iT) (2.4.3) i=- X i#o Also, define W-Las the subspace of L2 orthogonal to W, W- = {fl(t) e L2t (fl(t), f2(t)) =0 for all f2(t) e W If it can be shown that W is a proper subspace of L2, then it will follow that W-/ {0} and there exists a non-zero g(t) e W L. This g(t) would satisfy (g(t), h(t-kT)) = k O (2.4.4) and thus woulc satisfy \gk(t), h(t-jT)) = 0, k / j (2.4.5)

82 where gk(t) = g(t - kT) (2.4.6) This g(t) appears to be the solution to the problem of determining an element of L2 which is orthogonal to all the time translates of h(t) except one. However, this g(t) would not be of any use unless (g(t), h(t)) # O (2.4.7) is satisfied, for only then will a filter matched to gk(t) have a response to Bk and thereby provide the basis for making a decision on Bk. Not only must the question of whether there is any g(t) e W which satisfies Eq. 2.4.7 be addressed, but specifically it must be determined whether there is an element of W-L of the form of the transversal filter of Eq. 2.4.1 which satisfies Eq. 2.4.7. Equation 2.4.1 can always be rewritten in the form g(t) = do h(t) - dm h(t - mT) (2.4.8a) = do h(t) - e(t) (2.4.8b) where e(t) e W. However, not every filter of the form of Eq. 2.4.8(b) can necessarily be written in the form of Eq. 2.4.8(a) [43]. The framework has been laid for the following Theorem, which gives necessary and sufficient conditions for there to exist a g(t) e W which satisfies Eq. 2.4.7 and Eq. 2.4.8:

83 Theorem 2.4.1: The following statements are equivalent: 1. h(t) W, 2. There exists a g(t) e W-l which satisfies Eq. 2.4.7, 3. There exists a g(t) e Wl of the form of Eq. 2.4.8 which satisfies Eq. 2.4.7. Proof. To show the equivalence of 1. and 2., the key observation is that (W) = W, since W is a closed subspace. Thus, if h(t) W, then W/ L2, W-1/ {0}, and there exists at least one non-zero g(t) e W1. Since h(t) i (WjyL, there exists at least one g(t) e W- not orthogonal to h(t), i.e., which satisfies Eq. 2.4.7. Conversely, let g(t) e W-L satisfy Eq. 2.4.7. Then h(t) (WJ) =W, and the equivalence of 1. and 2. is established. To show the equivalence of 1. and 3., note that if 3. is satisfied, then that g(t) also satisfies 2., and hence 1 is satisfied. Conversely, by the Hilbert space projection theorem, any h(t) e L2 has the unique direct-sum decomposition h(t) = e(t) + g(t) (2.4.9) where e(t) e W and g(t) e WL. The g(t) of Eq. 2.4.9 satisfies Eq. 2.4.8 with do = 1. To show that Eq. 2.4.7 is satisfied when h(t) W, note that from Eq. 2.4.9,

84 (g(t), h(t))= (g(t), e(t))+ (g(t), g(t)) = g(t) 112 = lh(t) - e(t) ll2 (2.4.10) Since h(t) J W and e(t) e W, it follows that h(t) / e(t) and Eq. 2.4.7 is satisfied. QED. Summarized in words, Theorem 2.4.1 states that whenever h(t) t W there is a filter which responds to h(t) but not to any of its T- second translates, and in particular there is a transversal filter of the form of Eq. 2.4. 8(b) which has this property. * A filter with this property need not be a transversal filter. Kimball [7], in considering the special case of L = 2 and time-limited waveforms, gives an example of a filter which is not of the form of Eq. 2.4.8(a), but which nevertheless is in W' and satisfies Eq. 2.4.7. For a particular h(t), the critical question is whether or not it is in W. The following theorem gives a sufficient condition for h(t) i W: Theorem 2.4.2: If there is a P0 > 0 such that P(w) > P for co I < 7i/T, then h(t) W. *Naylor and Sell [43] give sufficient conditions for when Eq. 5.4.8(b) can be written in the form of a transversal filter, Eq. 5.4.8(a).

85 Proof: Since every element of W is the limit of a sequence of elements of W, it suffices to show that there is a 6 > 0 such that h(t) ah(t - kT) > 6 (2.4.11) for every choice of N, {aIN and {k= / }N Eq. 2.4.11 can be established by using Parseval's Theorem, h(t)- 3 aQ h(t - kT) 2 1f (r/-N -iwk T 2 / (w) - a aT e dw 1 /T N/ - i-iwkT 2 P (O) 1 - a e dc where the fact that N -iwk T 2 1~C a f | 1- 3 a2 e d2

86 P(w) in exactly the same manner as in Eq. 2.2.28. Thus, Eq. 2.4.11 is satisfied with 6 = P0/T. QED. It is worth noting that the condition of Theorem 2.4.2 that P(w) be bounded away from zero is the same as the condition P(w) > 0 when P(w) is a continuous function, since a continuous function on a compact set assumes its infimum. Theorem 2.4.1 is an existence theorem which guarantees that under certain conditions there is a transversal filter which eliminates intersymbol interference. It remains to actually find the tap-gains {dm}. This can be accomplished by assuming a filter of the form of Eq. 2.4.8, and constraining it to be orthogonal to all the time translates {h(t- kT)}k/0 of h(t), (g(t), h(t-kT)) = p(k) - d p(k-m) / mj~0 m = 0 k / 0 (2.4.12) Equation 2.4.12 is an infinite system of linear equations, which can be solved using the method of bilateral z-Transforms. More important than the explicit solution of Eq. 2.4.12 is the determination of the error probability of the transversal filter receiver. In the following section this will be accomplished without explicitly solving for the tapgains. 2.4.2 Performance Evaluation. The performance of the

87 transversal filter receiver derived in Section 2.4.1 can be evaluated without the explicit calculation of the tap-gains {d }. This will be accomplished in this section for L-finite intersymbol interference with M=2 and b1 =1, b2=-1. The decision axis of the transversal filter is Lk = (x(t), g(t - kT)) (2.4.13) where g(t) is given by Eq. 2.4.8. A single threshold is to be applied to Lk to make a decision on Bk. If it is assumed that P(Bk = b) = P(Bk = b2) = 1/2, then the optimum threshold is at zero, which means that the decision Bk = b1 is made if and only if Lk > 0 (2.4.14) The form of Lk can be determined more explicitly using Eq. 2.1.1 and Eq. 2.4.5 as Lk = Bk (g(t- kT), h(t- kT)) + (g(t- kT), n(t)) = Bk(g(t) h(t)) + (g(t - kT), n(t)) (2.4.1 5) Lk is a Gaussian random variable, since it is a linear functional of a Gaussian process. Its mean value is E(Lk) = Bk(g(t), h(t)) (2.4.16) and its variance is the variance of the second term in Eq. 2.4.15,

88 Var(Lk) = - N lg(t) ll N 0 = 2 ( —g(t), h(t)) (24.17) from Eq. 2.4.10. By direct calculation, (g(t), h(t)) = p(0) - 0 dm P(m) = p(0) rj (2.4.18) where = i- ( m~ d m) (2.4.19) The probability of error, using the decision rule of Eq. 2.4.15, is then P e= (-I d) (2.4.20) where I)(X) = 1 e /2 du (2.4.21) -3 and d =2p(0) (2.4.22) N 0 is related to the S/N ratio. Equation 2.4.20 reduces to the probability

89 of error of a matched filter receiver in the absence of intersymbol interference when Or = 1. Thus, the parameter r7 represents an effective decrease in S/N ratio due to intersymbol interference. All that remains in the determination of the error probability is the evaluation of r,. One method would be to find the tap-gains {dm} from Eq. 2.4.12 and substitute into Eq. 2.4.19. A much easier method is as follows: Observe that Eq. 2.4.12 can be written as p(k) - dm p(k- m) = p() 7 6k 0 (2.4.23) m where the additional constraint d0 = 0 (2.4.24) is required. Then, taking the bilateral z-Transform of both sides of Eq. 2.4.23, it becomes R(z) - D(z) R(z) = p(0) r7 (2.4.25) where R(z) = zm p(m) (2.4.26) m and D(z) = 2 zm d (2.4.27) S m Solving Eq. 2.4.25 for D(z), it is

90 D(z) = 1 (0) (2.4. 28) The parameter 71 can then be determined from Eq. 2.4.24 and Eq. 2.4.28. Consider now the special case of L-finite intersymbol interference. Evaluating Eq. 2.4.26 for this case, L-1 R(z) = C z p(m) m=-L+1 2L-2 p() L-1 (- k)/z (2.4.29) It is apparent from Eq. 2.4.29 that for each root Zk, 1/zk is also a root. Furthermore, the conditions of Theorem 2.4.2 cannot be satisfied if there is a root of Eq. 2.4.29 on the unit circle. To see this, assume that z = eio is a root for some 0 1 < 7r and substitute in Eq. 2.4.29, Ln imO 0 0 = 2 p(m) e = P() m=-L+l which violates the condition of Theorem 2.4.2. Therefore, assume, without loss of generality, that I zkl <, 1 < k < L-1 (2.4.30) Zk+L-1 = l/zk' 1 < k < L-1

93 7 6 5 o-4 O 2 1 0 0.1 0.2 0.3 0.4 0.5 Fig. 2.7. Effective decrease in S/N ratio for transversal filter receiver (L=2)

I 3 v'~~~~~~~6 ~~2~~~~\\ ~'VC' 0 o~~~~~~~~~~~~ " —4 CI >3 0 I I! I I I I I I -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 ~2 Fig. 2.8. Effective decrease in S/N ratio for transversal filter receiver (L =3)

95 occurs for a positive 52' which increases with I 1. Referring back to Fig. 2.1, the (g1' 22) curve corresponding to the minimum probability of error lies approximately in the center of the non-negative definite (1' 22) region. 2.4.3 Frequency Domain Interpretation. The interpretation of the results of Section 2.4.1 in the frequency domain leads to additional insight into the results obtained there. The relationship which the tap-gains {d } must satisfy is Eq. 2.4.23. Substituting for p(k) from Eq. 2.2.24, Eq. 2.4.23 becomes 2I / iwi T (wmT~ P(w) e - d e dw =p(O) 71 6; O (2.4.36) Since {e iwf'T) is a complete orthogonal set in L2 (-,, Eq. 2.4.36 can only be satisfied if P(w) D(w) = Tp(O) r, l w < 4 (2.4.37) where D(w) = 1 - d mT(2.4.38) is the frequency response of the transversal filter. It is apparent from Eq. 2.4.37 that the function of the transversal filter is to flatten the equivalent power spectrum of the channel. Accordingly, the

96 transversal filter has a frequency response which Is proportional to the inverse of the equivalent power spectrum of the channel. It is also evident that if P(w0) = 0 for some Iclw0 < r/T, no choice of D(w) can satisfy Eq. 2.4.37. In order for this to happen, it is necessary and sufficient that H (w0 + m = 0, m = 0,1,... (2.4.39) When Eq. 2.4.39 is satisfied for some w0, a transversal filter cannot eliminate intersymbol interference. This establishes the necessity of a condition such as that found in Theorem 2.4.2, which requires that P(w) must be bounded away from zero. The error probability calculation of Section 2.4.2 showed that as the boundary of non-negative definite p(k) is approached, the probability of error approaches. This result can be interpreted in terms of Eq. 2.4.37, since the boundary of non-negative definite p(k) corresponds to a P(w) which vanishes at one or more points. As P(w) gets small at some point, the corresponding D(w) of Eq. 2.4.37 gets large at that point, and the transversal filter admits a progressively larger noise variance into the decision. This is the major shortcoming of the transversal filter receiver: for a channel which has severe intersymbol interference, in the sense that p(k) is near the boundary of its non-negative definite region, the transversal filter receiver will necessarily have a large probability of error.

97 The best single indicator for a given channel of how well a transversal filter receiver can be expected to perform is the equivalent power spectrum P(co). Specific numerical examples comparing the error probabilities of the bit detector, block detector, and transversal filter receiver will be given in Chapters 3 and 6. 2.5 A Comparison of the Bit Detector, Block Detector, and Transversal Filter Now that the bit detector, block detector, and transversal filter receiver have been discussed, it is appropriate to compare these three receivers. Block diagrams of the bit detector, block detector, and transversal filter are given in Fig. 2.9. It should be emphasized that, in the case of the bit and block detectors, these are not the iterative realizations which will be developed in Chapter 3. The most meaningful comparison is in non-iterative form. A bank of matched filters, one matched to each h(t- kT), 1 < k < N, is pictured in Fig. 2.9(a). All three receivers process the output of this matched filter bank. For actual implementation, a single matched filter sampled at successive intervals of T seconds would suffice. Both the bit and block detector require the calculation of the quantity 2(x(t), s(t)) - 11s(t)1 2, which can be obtained by the method

98 h(t- T) B x(t) f x(t) I h(t-2T)! Matched S i L 1: DBank 2 (x(t), s(t))-ll s(t)ll (a) h(t-NT) BN I Jf (a) Bank of matched filters (b) Formation of ( lIx(t) - s(t) 21 - Ilx(t)ll 2 ) 5 N.~ P(S1) (b) S(t) N exp X,(t) t_ (c) Formation of P(x(t), Bk = b.) (d) Bit detector x~t) F Mation of P(B1, B2, * BN) Fle B detec (e) Block detector (f) Transversal filter Fig. 2.9. Receiver block diagrams

99 of Fig. 2.9(b). For a particular s(t), the output of the matched filter bank of Fig. 2.9(a) is weighted by the corresponding data sequence (B1. BN) and summed. This yields (x(t), s(t)), which is multiplied by two, and the energy of s(t), given by Eq. 2.1.32, is subtracted. The bit detector and block detector can now be built up using the block of Fig. 2.9(b) as a basis. Consider the bit detector first. According to Eq. 2.1.23, it is necessary to fix Bk = b.j and calculate 2 (x(t), s(t)) - s(t) for each of the MN-1 values of (B1. Bk 1 Bk+1 BN), take the exponential, multiply by the probability of the corresponding data sequence, and sum. This is shown in Fig. 2.9(c). Then, as shown in Fig. 2.9(d), this operation is repeated for each of the M values of Bk = bj, 1 < j < M, and the maximum chosen to determine the decision on Bk. This realization requires MN banks of matched filters, although in practice the multiple outputs of a single bank (or the multiple successive outputs of a single matched filter) would be stored and used for MN separate computations. The block detector, shown in Fig. 2.9(e), is decidedly simpler. The block of Fig. 2.9(b) is repeated MN times, once for each value of the data sequence (B1' BN), and the maximum is chosen. This determines the decision on the whole data sequence (B1.. BN) Many exponentiations and summations are eliminated,

100 as well as the need to repeat the process for each Bk, 1 < k < N. Finally, the transversal filter is pictured in Fig. 2.9(f). Only a single matched filter bank is required. The output is multiplied by a set of constants, summed, and applied to a set of (M- 1) thresholds to determine Bk. The set of weighting constants depends on k, since they actually consist of a single set of constants which "slide by" the matched filter bank. One matter requires comment. Aaron and Tufts [19] showed that the linear time-invariant filter followed by a sampler and threshold which minimizes the probability of bit error is a properly chosen transversal filter preceded by a matched filter. The resulting receiver is precisely the finite transversal filter receiver of Section 2.4. The block detector, on the other hand, does not consist of a single linear filter followed by a single threshold, but rather has MN linear filters and chooses the maximum of their outputs to determine (B1.. BN). Thus, the block detector does not belong to the class of receivers considered by Aaron and Tufts. Their result is not contradicted by the conclusion of Chapters 3 and 6 that the block detector has a lower probability of error than the transversal filter receiver for all the examples considered. In fact, it will now be shown that it is very reasonable for the block detector to have a lower error probability than the transversal filter receiver. This will be accomplished by considering the decision

101 regions of the two receivers, and showing that the block detector decision regions have a much more complicated geometry. At the same time, a better understanding of the two receivers will be gained. This discussion begins with a few elementary concepts from linear algebra, which follow Nering [44]. Let U and V be finite dimensional vector spaces with A dimension m and n and dual spaces U and V. A linear manifold in V is defined to be a set of the form L = + S (2. 5.1) where ac e V and S is a subspace of V. The dimension of L1 is defined to be the dimension of S. When L1 is (n-l) dimensional, it is said to be a hyperplane. The following lemma is the key to the problem at hand: Lemma 2. 5.1: If 0 e V, 0 0, then the set L1 = {calVl((a) = a} (2. 5.2) is a hyperplane of the form L = ~ olS (2.5.3) L1 = -1 + S1 where S = {_eV10(o_) = O} (2.5.4)

102 and a1 satisfies 0(1) = a (2.5.5) Proof: Let a-. and S1 be as defined in Eq. 2.5. 5 and Eq. 2. 5.4, and let a cElc + S. Then 0() = a and oa EL. Conversely, if ae L1, then 0 (o) = a and 0(_- 1) 0(a) - 0(1) = a-a = 0 so that o2 =0 -- c S. Since -2 - -1 1+ it follows that ca e l + S QED. In addition, the following lemma will be useful: Lemma 2.5.2: The hyperplane L1 of Lemma 2.5.1 divides V into three disjoint convex sets: L1 itself, the positive side, (+) L c= {eV10(O) > al (2. 5.6) and the negative side, L ) = {ccV106(c) < a}. (2.5.7) - 11 1

103 and 0<p<l. Then S(Pa I + (1-P) oa2) = I(a + (1-p) (o!2) > a so that p 1 + (1-p) a2 E L. Similarly, it can be shown that L] and L1 are convex. QED. Two linear manifolds L1 =__1I + S1 and L2 = a2 + S2 are said to be parallel if S C S2, S2 C S1, or S1 = S2. If two parallel linear manifolds have a point in common, one is a subset of the other. The groundwork has been laid to look at the decision regions of the transversal filter receiver and block detector. Specifically, identify U as the N-dimensional vector space of outputs of the N matched filters (x(t), h(t- jT)). Making a decision on Bk consists of choosing M mutually exclusive and exhaustive subsets of U, {A.j}M and choosing Bk = b. if and only if the actual matched Jj=1 J filter bank output vector ca e Aj. The geometry of the sets {Aj) will now be examined for the two receivers. Consider the transversal filter receiver first. It applies a series of thresholds to a single linear functional 0 e U to determine Bk. Specifically, it has M- 1 thresholds, a1 < a2 <.. < aM-1' with resulting decision regions A1 = {leU0l(a) < a1}

104 Am = {eU a < (c) < a ) 2 < m < M-1 m m- - m AM = { e U aM-1 -< ()} (2. 5.8) where the decisions corresponding to the c e U such that 0 (a) = a. have been chosen arbitrarily. Different conventions can be handled similarly. The geometrical shapes of the A are the subject of the following theorem, which follows directly from Eq. 2. 5.8 and Lemma 2.5.2: Theorem 2. 5.1: If L is defined as Lm = {cEUI(a) =) = a}, 1 <m <M-1, (2.5.9) m m then A L 1 1A =L U ( ) L() 2 < m < M-1 m m-1 m-1 m- - A (L ) (2. 5.10) AM LM-1 M(2-1 All these sets are convex. In words, Theorem 2.5.1 states that each Am, 2 < m < M-1 consists of a hyperplane together with the space between a hyperplane and another parallel hyperplane. Al and A Ma consist of the side of a hyperplane and a hyperplane together with one side respectively.

105 Now, consider the block detector. It performs the maximization of Eq. 2.1.29, ~~~N N BmaxB B x(t), h(t- iT)) -i h(t- iT)' B j=B + ~n P(B1'' BN) max {0. () + a. (2.5.11) I< j<MN j11 MN where ac e U is the output of the matched filter bank, {0. (o) e U) MNJ j=N are linear functionals on U, and {a.jM are constants. Identify 3 j= V as M dimensional Euclidian space, and let the linear transforMN mation from U into V defined by the MN linear functionals 0 j=l be identified as 7(). Each linear functional 0j(oa) is associated with a unique data sequence (B1 * BN). Assume, without loss of generality, that the first MN-1 functionals correspond to Bk = bl, the second MN-1 to Bk = b2 and so forth. Further, define MN sets in V as A. + a. >{ Vj + a > < m m< j+ a. > + a, j+l<m <MN}< j j- m m - 1 < j < MN (2. 5.12) The decision regions in U are then

106 J -1 ( MN- i=(j- 1)M +1 (2. 5.13) j MN- 1 JU - 1(A) l<j<M i=(j- 1)MN- +1 where the decision Bk = b. is made if and only if c e A. k- The geometry of the sets fr (Ai) will now be determined. The trick is to recognize that the maximization of Eq. 2. 5. 11 can be reformulated in terms of thresholds applied to additional linear functionals. Accordingly, define the M linear functionals on V, 4..j(3) i'-, 1 < i, j < MN (2.5.14) 2N and M hyperplanes in U, i, j E {ce U14/i j T(o) = ai a <i, j<MN (2.5.15) Then, the geometry of 7r (Aj) is expressed in terms of these hyperplanes by the following theorem: Theorem 2.5.2: If A. is given by Eq. 2.5.12 and L.. by Eq. 2.5.15,

107 j-1 MN (i~'A(+ i ) ( (i, j Li <j <MN (2.5.16) Proof: Let ca be an element of the set on the right side of Eq. 2.5.16andlet 3 = 7r(a). Then, by Eq. 2.5.15andEq. 2.5.16, c e L(+ 1 aa I<i<j-1 1i, j () qi,j() - i-1 4i, jV(W) = 4i, j(P) > a-a a j+1 i+ai, 1 _i+ ai, j+l <i<MN (2.5.19) By Eq. 2.5.12, eAj., or aC er (Aj). Conversely, let a e 1-(A.) and j3 = r(ac) e A.. Then, by Eq. 2.5.12, J J~~~~~~~~~~~~~~~~~~~~~2

108 vi, j(-) = 4i., j (c) > a j a 1 a.i a., j+1 < i < MN (2.5.20) ~ij(-2) - ~ij ~) 1- a - j so that csE L(+), l<i<j-l -, jc~ e Li, U L (+) j+ < i < MN (2. 5.21) - 1,, 1, - - It follows immediately that ac is an element of the set on the right of Eq. 2.5.16. QED. Theorem 2.5.2 together with Eq. 2. 5.13 states that the block N- 1 - 1 detector decision region consists of the union of M sets r (Aj) which are themselves the intersection of M -1 sets consisting of the positive side of a hyperplane and possibly the hyperplane itself. Each of the ir (Aj) are convex sets, since they consist of the intersection of convex sets, but naturally the A. are not convex. The physical picture is that for each value of Bk = bj, 1 < j < M, there are MN- points in signal space corresponding to the MN-1 values of (B1 B k-1 Bk+1 BN). Each of these signals is then contained -1, N in a convex set wr (Ai), which is bounded by MN-1 hyperplanes. The decision region Aj for Bk = bj is then the union of all these sets. In contrast, the transversal filter decision region A. for Bk = b; consists of the area between two parallel hyperplanes. The

109 coefficients of the filter (i.e., the form of the linear functional) are chosen so that in some sense there is a minimum component of the h(t- mT), m # k, in the direction orthogonal to these parallel hyperplanes. In summary, the decision regions of the block detector are much more complex than those of the transversal filter receiver. This complexity gives much greater flexibility in the choice of the decision regions, and therefore admits the possibility of a lower error probability. 2.6 Sampling Relationships and Assumptions The final topic of this chapter is the development of sampling relationships for a bandlimited channel. This development will provide the basis for the iterative receiver realizations of Chapters 3 and 4. The assumption of bandlimited h(t) is not restrictive from a practical point of view, since all physical channels are bandlimited. Assume that h(t) is bandlimited to W - Hz, 2T H(w) = 0, Iwl < 2vW = T (2.6.1) for some positive integer y. A typical value of y for a practical channel would be 2 or 3. It is well known that the sampling functions | (t -2k ) | 3C-

110 form a complete orthonormal set on the subspace of L2 consisting of elements of L2 bandlimited to W Hz, where 0 (t) =2W sin (27rWt) (2. 6.2) W 2fWt The Fourier series for h(t) is h(t ) = (to ) (tt mT) (2.6.3) 2T where 0 < t < T is arbitrary. Without loss of generality, assume to = 0. Equation 2.6.3 expresses h(t) in terms of its samples at a rate of y/T samples/second. It has been customary in the literature to measure the severity of intersymbol interference by the size of the samples of h(t) spaced T seconds apart. A typical criterion for the absence of intersymbol interference has been h(kT) = h(O) k 0 (2.6.4) It is easy to show that Eq. 2.6.4 is neither necessary nor sufficient for there to be no intersymbol interference in the more fundamental sense of p(k) = p(O) 6k, O (2.6.5) An example of an h(t) for which Eq. 2.6. 5 is satisfied and yet all the

T- second samples are non- zero is h(t) = 0 1 (t- T/2) (2.6.6) 2T Conversely, an example of an h(t) which satisfies Eq. 2.6.4 but not Eq. 2.6.5 is h(t) = o (t) + 0 (t- ) + 01(t- -T) (2 6.7) T T T It was seen in Section 2.1 that a sufficient statistic for the detection of (B1 *- BN) was the output of N matched filters. These matched filter outputs can be rewritten, using Eq. 2.6.3, as T 2 t m T mT x(t), h(t- kT () m h ) ((t (t- kT mT)) 2T 1 <k<N (2. 6.8) It is clear from Eq. 2. 6.8 that an equivalent sufficient statistic is xk = () (t), - - < k < c (2.6.9) 2T In the sequel, the {xk} of Eq. 2.6.9 will be called the "samples" of x(t), since they are actually the samples (at a rate of y/T samples/ second) of the output of a low-pass filter with bandwidth W Hz. Physically, the purpose of the low-pass filter is to reject all the

112 components of the white noise outside the bandwidth occupied by the signal h(t). The samples {Xk} can be written in terms of the samples of h(t), using Eq. 2.1.2 and Eq. 2.6.9, as NT Xk = B. h((k-j) I) + nk (2.6.10) where nk = (17 (n(t), ( t kT)) 2T is a noise sample. The {nk} are jointly independent Gaussian random variables with mean zero and variance yN E( -k)2 -2= 2T (2.6.11) The samples {xk} will be used to derive the iterative forms of the bit detector and block detector of Chapters 3 and 4. In the derivation of those receivers, two assumptions will be made: Assumption 1: The channel is bandlimited to 2T Hz (y=l). Assumption 2: Only L samples of h(t) are non-zero. Since the time origin is arbitrary, this is equivalent to the condition h(k) - h(kT) = 0, k< 0, k L (2.6.12)

113 Assumption 1 is merely a means of simplifying notation, since for y > 1 the scalar sample xk can simply be replaced by a y- component vector sample in the probability relationships to follow. Those probability relationships will be otherwise unchanged, since the vector samples have Markov dependencies on the data digits identical to those of the scalar samples. Assumption 2 is more restrictive, but nevertheless it is necessary in order to reduce the receiver computational effort to a reasonable level. Assumption 2 requires that the current sample xk be a function of the L-1 past data digits (Bk L+1 * *Bk) alone. The L-1 future data digits (Bk+.. Bk+Li ) also interfere with Bk, since signal energy from Bk occurs in (xk+l.. xk+L -). Equation 2.6.12 requires that h(t) can be written in the form L-1 h(t)= -T I h(i) 0 1 (t- iT) (2.6.13) i=O - 2T which implies, by direct calculation, that intersymbol interference be L- finite, since L-1 p(k) = T L h(i) h(i-k) = 0, Ikl > L (2.6.14) i=O The converse is not true; L-finite intersymbol interference does not require that only L samples of h(t) are non-zero, as was seen in the example of Eq. 2.6.6. However, when intersymbol interference is

114 L-finite and if more than L samples of h(t) are non-zero, then infinitely many are non-zero. This is evident fr:om Eq. 2. 6.14, with L replaced by L' > L. When infinitely many samples of h(t) are non-zero, they must satisfy ZS h2(kT) < oc (2.6.15) k= - since h(t) e L2, and hence lim h(~kT) = 0 (2. 6.16) k —cx From a practical point of view, since Eq. 2.6.16 is satisfied, all but a finite number of h(kT)'s may be neglected if L is chosen sufficiently large. Therefore, Assumption 2 is not only required to reduce the computational effort, but it is also a reasonable assumption. The three possible cases and the meaning of Assumption 2 in those cases are summarized below: 1. Intersymbol interference is L-finite and only L samples of h(t) are non-zero, in which case no approximation has been made, 2. Interference is M-finite (where most likely M < L), infinitely many samples are non-zero, and all but L have been neglected, or

3. Interference is not L-finite, in which case infinitely many samples must be non-zero and all but L have been neglected. In practice this is the most likely case, since L-finite interference forces the power spectrum to assume a rather special form. In view of Assumptions 1 and 2, the samples xk can be written as k-I L-1 Xk Ij=O < k < N+C Bk-h(j)+nk, N~1<k<N+L-I j=k- N nk, k< 1, k> N+L-1 (2.6.17) from Eq. 2.6.9 and Eq. 2.6.13. The bit detector and block detector need only process a finite sequence of N+L-1 samples to obtain the appropriate a posteriori probabilities in order to make a decision on (Bi' BN). 2.7 Conclusions The more important conclusions of Chapter 2 are summarized below:

116 1. The bit detector minimizes the average number of errors in detecting (B1.- BN). It is a nonlinear receiver which requires knowledge of the noise spectral density height NTJ. 2. The block detector minimizes the probability of one or more errors occurring in the detection of (B1 -- BN) and is linear up to the final choice of the maximum of MN linear functionals on the observation. When all signals are equally likely, it does not require knowledge of the noise spectral density N. 3. The transversal filter receiver applies a threshold to a single linear functional on the observation. As I - xc, it can always eliminate intersymbol interference whenever P(w) is bounded away from zero. Because it does not weight the noise optimally, its performance can be improved upon in spite of the fact that it eliminates interference. 4. All three receivers require the output of N matched filters, (x(t), h(t- jT) ) 1 < j < N. When h(t) is bandlimited, these matched filters can be replaced by a low-pass filter followed by a sampler. 5. The error probabilities of all three receivers are determined by the autocorrelation p(k) and noise spectral density height N0/2. In the case of the transversal filter receiver, a more useful (and equivalent) indicator of performance is the equivalent power spectrum P(w). As P(w0)- O for some Kwl < ir/T, the

117 probability of error of the transversal filter receiver approaches 1/2 6. The decision regions of the block detector are much more complex than those of the transversal filter receiver. This admits the possibility of the block detector having a lower error probability than the transversal filter. The calculations of Chapter 3 and the computer simulations of Chapter 6 will bear this out.

CHAPTER 3 OPTIMUM RECEIVERS FOR A FIXED KNOWN CHANNEL In this chapter iterative realizations for the bit and block detectors, real-time and non real-time, will be developed. In the case of the bit detector, this work is an extension of the results of Chang and Hancock [6] and Abend and Fritchman [10] to the case of general data statistics. The dynamic programming realization of the block detector developed here is an application of the Viterbi algorithm for decoding convolutional codes [11], generalized to general data statistics and to real-time operation, to the problem of intersymbol interference. The iterative forms of the real-time bit and block detectors will be compared in detail for the simple case of L = 2 and the error probability of the block detector will be evaluated for that case. The performance of the bit detector for the same case has been evaluated by Kimball [7] 3.1 The Bit Detector and Block Detector for a Bandlimited Channel In this section the bandlimited channel relationships of Section 2.6 will be exploited to find simpler versions of the bit detector and block detector of Section 2.1. It was shown in Section 2.6 that these receivers can process the samples Xk, given by Eq. 2.6.9. The dependency of xk on the 118

119 data digits is given by Eq. 2.6.17, L-1 Xk = Bkjh(j) + k1 < k < N+L-1 (3.1.1) k j=0 kn where it is understood that Bk = 0 for k < 1 and k > N. The {xk} of Eq. 3.1.1 are jointly independent Gaussian random variables with mean L-1 E(Xk) -= Bk_ h(j) (3.1.2) j=0 and variance a 2, which is given by Eq. 2. 611 with y = 1 To simplify the notation for the remaining chapters, define Xk = (X, x2,..., xk) (3.1.3) H = (h(0), h(1),..., h(L-1) (3.1.4) B =B B T(3.1.5) Bk = (Bk' Bk-l'..' Bk-L+1 ) The basic probability required for both receivers is P(XN+L_ 1B1 BN) N+L-1 2 N+L-1 2a H exp {- (x - B H)a/2a } (3.1.6) The block detecto chooses (B B) -to satisfy Eq 2.16, The block detector chooses (B1 — BN) to satisfy Eq. 2.1. 26,

120 max P (B1' BN XN+Li (3.1.7) (Bi" BN) or equivalently mi(B [- nP(XN+L-1 B 1 BN) - InP(B..BN)] (3.1.8) Substituting Eq. 3.1.6 in Eq. 3.1.8 and ignoring the constant term, Eq. 3.1.8 becomes N+L- 2 min 3. B) X (x - H) /22 - n P(B1 BN)(3.1.9) (B'' BN) j=1 The linear nature of the block detector is again evident from Eq. 3.1.9 when it is recognized that the 3 x.2 term is a constant independent of (B1.. BN) and may be ignored. One way of determining the minimum of Eq. 3.1.9 is to calculate the term in brackets for all MN values of (B1 BN). In Sections 3.4 and 3.5 the simple additive nature of Eq. 3.1.9 will be exploited to find a dynamic programming solution to the minimization which reduces the computation involved substantially over straightforward enumeration. The bit detector can be developed in similar fashion. By Eq. 2.1. 8, it performs the maximization max P(Bk=bjIXNL-1) (3.1.10) I<j<M I+

or equivalently, max Z'' 3 B P(XN+L -B'B N) P(B1 BN) I<j<M B1 Bk- Bk+ BN k (3.1. 11) Substituting Eq. 3.1.6 in Eq. 3. 1.11 and ignoring constant terms, Eq. 3.1.11 becomes max 3 L E P(B -BN) exp -HT( B Bj H _< j<N B B B BI N N+L- 1 a22 exp HT 3 x. B. a2 (3.1.12) j=1 JBk=bj The first two terms in the summation of Eq. 3.1.12 are weighting factors which are a function of the data sequence (B1' BN) but not of the observation. The last term demonstrates again the nonlinear nature of the bit detector, which is in the form of an exponential weighting of a linear function of the observation. The terms in the exponent of Eq. 3.1.12, B. BT and x. Bj, will occur frequently in Chapter 5 when supervised estimators of the Bayes, maximum likelihood, and stochastic approximation type are considered. The following sections of this chapter will develop iterative realizations of tne two optimum receivers of Eq. 3.1.9 and Eq. 3.1.12. In addition, real-time versions of these receivers will be derived.

122 3.2 Bit Detector In this section iterative realizations will be derived for the bit detector of Eq. 3.1.11. Two types of data statistics will be considered: 1. An arbitrary probability measure P(B1 BN) is given. 2. The data digits are Kb-Markov, P (BktB1l*Bk1) = P (Bk1Bk-K kl) k>Kb (3.2.1) The Markov assumption of Eq. 3.2.1 includes the case of independent data digits, which occurs when Kb = 0. Considering first the case of general data statistics, by the law of conditional probabilities, N P(B 1' BN) P(Bj B1 1) (3.2.2) j=l j1 j and by Eq. 3.1.6, N+L-1 P (XN+Ll1 B1 BN) = II P(xj Bj L' B) (3.2.3) j=1 j L+ ji where, as before, it is assumed that Bj = 0, j < 0, j > N. Combining Eq. 3.2.2 and Eq. 3.2.3, the summand of Eq. 3.1.11 becomes

123 N P(B1 BN' XN+L) = II P j-L+. B.)P(B jB Bj N~NL-I j=1 j- L+~I ji N+L-1 II P(xj Bj- L+ BN) (3.2.4) j=N+i Equation 3.2.4 suggests the following iterative realization: To calculate, at stage i, P(B1i' Bi Xi) - P(B1.. Bi_, Xi_1) P(xiBiL+'' Bi) P(BiB%'' Bi_) < i < N P(B1' BNXi) = P(B' BN, Xi_) P(xitBi+l.. BN) N+1 < i < N+L-1 (3.2.5) where P(B1 -. Bi, Xi) has been calculated in the previous stage. After Eq. 3.2. 5 has been calculated for i = N+L-1, it can be substituted in Eq. 3.1.11 for the decision on each Bk, 1 < k < N. The iterative algorithm of Eq. 3.2. 5 has the advantage that the whole observation XN+L1 need not be stored prior to the commencement of processing, but rather each x. can be processed and then discarded as it comes in. However, the computational burden of Eq. 3.2.5 is excessive for all but very small N. The number of values of (B1.. Bi) for which Eq. 3.2. 5 must be calculated and stored in memory increases exponentially with i. The Markov assumption on

124 the data sequence of Eq. 3.2.1 must be made before the computation and memory can be reduced. Consider now the case of Markov data digits (Eq. 3.2.1). The algorithm which will now be developed is a generalization to Markov data digits of the algorithm first proposed by Hancock and Chang [6] for independent data digits. Hancock and Chang determined an iterative algorithm for calculating P(Bk-L+I'' Bk' XN+L-) L < k < N (3.2.6) and maximized Eq. 3.2.6 over (Bk-L+1.. Bk) to determine a decision on B. It was quickly pointed out [8], [9] that it is a simple matter to modify Eq. 3.2.6 to get the minimum probability of error receiver (bit detector), since the desired probability is P(B, X )P(B B B b. X P(Bkb j, XN+L- 1 )B BkL LBk- =bj, XN+L- 1 Bk-L+I Bk-1 (3.2.7) To generalize the Hancock and Chang algorithm to Markov data statistics, let J = max {Kb, L-i} (3.2.8) and observe that

125 P(Bk-J+1' Bk, XN+L-1) P(Xk Bk-J+1'' Bk, Xk)' P(Bk-J+l Bk, Xk) (3.2.9) where Xk is defined as Xk (Xk+1' XN+L-1) (3.2.10) The key simplification of the algorithm results from the fact that, conditional on (Bk J+1 *. Bk), the future observations Xk are statistically independent of the past and present observations Xk. This fact is established in Appendix A, Eq. A.1, P(Xk Bk-J+1 "Bk, Xk) = P(Xk Bk- J+1 Bk) (3.2.11) Proceeding with the calculation of P(BkJ+1.- Bk, XN+L 1)' the natural algorithm is to update P(BkJ+1.. Bk, Xk) iteratively on a forward pass of the observation vector XN+L1, P(XkBk-J+1.. Bk) on a backward pass of XN+L 1, and combine them according to Eq. 3.2.9 and Eq. 3.2.11, P(Bk-J+1' Bk XN+L-1 ) = P(Xk Bk-J+1 * Bk) (3.2.12) P(Bk-J+1' Bk, Xk) J < k < N It is shown in Appendix A, Eq. A. 16, that an updating equation for P(BkJ+;. Bk, Xk) is

126 P(BJ1 J+ * Bk' Xk) = ~ P(xk Bk-L+1 Bk) Bk-J P(BkBkKb Bkl) P(BkJ.. Bkl' Xk-l) k = J+1, J+2,..., N (3.2.13) where Eq. 3.2.13 can be initialized by P(B1 Bk, Xk) = P(xklBk-L+l Bk) P(BkrBk-Kb Bk-1) P(BI. Bk-i' Xk-l) k=1, 2,..., J (3.2.14) Similarly, a backdating equation for P(Xk Bk J+1' Bk) is given by Eq. A. 20, P(X kBk-J+ 1 Bk)= z; P(xk+lIBk- L+2" Bk+l) Bk+t (Bk+ 1 Bk- Kb+1 Bk) P(Xk+1 Bk- J+2 k+ 1 ) k = N-i, N-2,..., J (3.2.15) where Eq. 3.2.1 5 can be initialized by P(xktBk-J+1 "BN) = P(Xk+lBk-L+2 BN)'(32.16) P(Xk+l Bk-J+2 BN) k = N+L-1, N+L-2,.., N Once Eq. 3.2.12 has been determined, the decision on Bk can

127 be made by summing out all the data digits other than Bk, (Bkbj, XN+L- l) P (BkJ-1 Bk' XN+L-1 k- J+ 1 k- 1 (3.2.17) Notice that it is not really necessary to calculate Eq. 3.2.12 for any values of k other than multiples of J. This completes the derivation of the algorithm, which reduces to that described by Hancock and Chang for Kb = 0. The important features of the algorithm are summarized below: 1. Both the update and backdate algorithms require a fixed number of computations and memory at each stage. The number of computations, on the order of MJ, increases exponentially with J. This limits the applicability of the algorithm to relatively small values of J. 2. The most reasonable implementation of the algorithm would be to perform the updating procedure in real-time as the data is received, and then perform the backdating procedure and combination simultaneously, after XN+L_1 has been received and stored in its entirety. This would require the storage of approximately MJ ~ N/J values of P(BkJ+1.. Bk, Xk), k = J, 2J,..., while the data is being received. This linear dependence of memory requirements on N limits the block length which may be accommodated. 3. The Markov case represents a very considerable savings

128 in computation and memory over the general data statistics case for large N. The dependence on N is linear, rather than exponential. Several of the disadvantages of this algorithm are eliminated by the real-time bit detector of the next section. This detector bases the decision on each data digit not on the entire received block, but rather on a linearly expanding portion of it. It can achieve substantially the same performance, and yet operate in real-time with greatly reduced storage requirements. 3.3 Real-Time Bit Detector The bit detector of the previous section requires that the entire observation vector be stored prior to processing by the backdating equation. By forcing the receiver to make a decision on Bk shortly after the reception of the data containing the signal energy corresponding to Bk, this storage problem can be eliminated at the expense of increasing the error probability. The resulting receiver will be termed the real-time bit detector, because it makes decisions as the observation samples are received. It eliminates the necessity of accepting the entire block of data before making a decision on any of the data digits. A real-time bit detector for independent data digits has-been derived by Abend and Fritchman [10]. Their receiver will be generalized, as was done in the last section, to account for an arbitrary joint probability measure on (B1.. BN) and then for a K-Markov

129 dependence of the data digits. Considering first the general data statistics case, it is desired to base the decision on Bk not on XN+L_ 1 as in the last section, but rather on X where D is a non-negative integer.* The k+D desired a posteriori probability is P(Bk =bj, X D) P(B1 " Bk+D, Xk) B1 k- 1 k+1 k+D (33.1) and the updating equation has already been given in Eq. 3.2.5, P(B1'' Bk+D' Xk+D) P(Xk+Dk+D+D-L+1.. Bk+D) P(Bk+D B1'' Bk+D-1) P(B1'' Bk+D- 1' Xk+D- 1) 1 <k <N-D P(B1'' BN, Xk+D) = P(Xk+D Bk+D-L+1 BN) P(B1'' BN, Xk+D-1) N-D+1 <k<N (3.3.2) The algorithm of Eq. 3.3.2 shares with its counterpart of *Note that Xk+D for D < 0 does not contain any signal energy for data digit Bk and hence would not provide any basis for a decision on Bk.

130 Section 3.2 a computational load and memory which increases exponentially with k. As in Section 3.2, this difficulty can be eliminated by assuming that the data statistics are Kb-Markov and satisfy Eq. 3.2.1. In the following, assume that the data sequence is Kb-Markov and that D > J, where J is defined in Eq. 3.2.8. The reason for the latter assumption will soon become clear. Since the current observation Xk+D is now independent of the data digits prior to Bk, the need for keeping in memory a posteriori probabilities for data digits prior to Bk is eliminated. Accordingly, the desired probability is P(Bk =bj, xk+D) " P(Bk. Bk+I Xk+i Bk+l Bk+D (3.3.3) where the probability in the summand of Eq. 3.3.3 can be updated, k k+D' k+D) P(xk+DBk'' Bk+D' k+D-1 ) P(Bk+D Bk * Bk+D-1' Xk+D-1)' k2 P(Bk- BkD- Xk+D1) (3.3.4) Bk-! Because (and only because) D > J, the first term of Eq. 3.3.4 can be simplified to P(xkDB k. Bk+, Xk ) = P(xk.,DskDL1' Bk+)

131 from Eq. A.12, and P(Bk+DiBk'* Bk+D-l' Xk+D-) = P(Bk+D Bk+D- Kb Bk+D-1) from Eq. A.15. The resulting simplified Eq. 3.3.4 is P(Bk Bk+D, Xk+D) = P(xk+DiBk+D+DL+1 Bk+D) P(Bk+DIBk+DKb Bk+D-1) Z P(Bk_1 BBk+DD-1 Xk+D-l) (3.3.5) The reason for assuming D > J should now be evident. If that assumption is not made, then the updating equation would have to retain some data digits prior to Bk. These data digits would have to be summed out at each stage to make the decision on Bk. This would be acceptable, except that the computation would not be reduced below that necessary for D = J, and yet the error probability would be increased because fewer observations would be taken into account for each decision.* For D > J, the computation and memory of the algorithm of Eq. 3.3. 5 increase exponentially with D. Therefore, the value of J places a fundamental lower bound on the bit detector computation and memory. ~Abend and Fritchman [10] present all their numerical results for Kr=O and D < L-1. For the reasons stated, this case is of doubt

132 3.4 Block Detector In this section dynamic programming will be applied to the problem of determining the minimum of Eq. 3.1. 9. The result will be a practical iterative realization of the block detector, as well as a considerable savings in computational effort and memory relative to the iterative bit detector of Section 3.2. As in Section 3.2, general and Kb-Markov data statistics will be considered. The first case, that of general data statistics, assumes an arbitrary probability measure P(B.. BN). In this case, the block detector criterion of Eq. 3.1.9 can be rewritten as + L- I N min [i ( Bx - H) /2a - Cn P(B jB1.. B. (B BN) [ j=1 J - j=1 (3.4.1) The straightforward enumeration of all the values of Eq. 3.4.1 to determine the minimum requires its calculation for MN values of (B1.. BN). This computation can be greatly reduced by employing dynamic programming, which is possible because of the additive nature of Eq. 3.4.1. Accordingly, define xk- Bk H)2/ - _ n P(Bk B1 Bkl) l1<k< N Lk(B1. Bk) = (x-1 - BT H)2/2a2 -n P(BNI B1, BN1), k=N [j=N (xB2 (3.4.2)

133 so that Eq. 3.4.1 becomes N min Lk(B.. Bk) (3.4.3) (B1. BN) B=1 Furthermore, define fk(B1.. BNk) = min L (B1.. B (3.4.4) (BNk+.. BN) =N-k+l J 1 <k <N-1 and Eq. 3.4.3 becomes min FN Lk(B1 Bk)] min[L (B1)+ fN-l(B1)] (3.4.5) (B1 BN) Lk=1 B Thus, the minimum of Eq. 3.4.1 can be determined if a way can be found to determine fk(B1.. BNk) recursively for k=1, 2,..., N-1. This can be accomplished by manipulating Eq. 3.4.4, fk+l(B1'' B min -k-) [NB N-k f k(B'' BNkN)Im (BN k.' BN) N N + 2 L (B1*' B j=N-k+l J N rain L (B1.. BN k)+ min L (B' B in [Nk(BNk+ BN) j=N-k+1 min [LN k(Bl * BN-k ) + fk(Bl''BN-k)] (3.4.6) N-k

134 which is the required recursion relationship. The initial value of Eq. 3.4.6 is fI(B1 * B ) min [LN(B B] (34.7) BN This receiver can be implemented as follows: Store the entire observation vector XN+L_1 as it is received. Then, to initialize the algorithm, use XN- and Eq. 3.4.2 to calculate LN(B1. BN) for each (B1. BN). The initial value fl(B1 "BN-1) can be determined from Eq. 3.4.7 by finding, for each (B1 BN_1), the value of BN which minimizes LN(B1 * BN). This value of f1(Bl'' BN- 1) is then stored with the associated minimizing value of BN for each sequence (B.. BN)~ The XN-i is no longer needed and may be discarded. The first stage of the algorithm proceeds by calculating LN-1(B I''BN-1) for each value of (B1. BN_) using Eq. 3.4.2 and the stored value of XN1. Then, for each (B1. BN_2), [LN- (B1 BN1) + fl(B1''BN-1)] can be minimized over BN-i to yield f2(B1 BN-2) = min [LN_1(Bi BNl) + fl(B1 BN-1)] (3.4.8) For each (B1' BN2), Eq. 3.4.8 determines a BN_1i, which in turn determines a value of BN left in storage from the initialization.

135 These values of BN_1 and BN are then stored in memory along with the associated value of f2(B1. BN 2) for each (B1.. *BN_2). The remaining values of BN stored in the initialization stage, which are associated with B1 s discarded in the determination of Eq. 3.4.8, may be discarded along with XN 1 and f1(B1 BN-1) At stage k+1, the memory contains the partial observation vector XNk, and, for each (B1 BN_k), the previously calculated values of fk(B1 BN k) and an associated sequence (BN k+1 BN) The following steps are performed: i. Using XNk, calculate LN_k(B1.' BN-k) for each (B' I BNk) 2. Calculate fk+l(B1' BN-k-1) from Eq. 3.4.6. 3. For each (B1 BN-k- ) store a vector (BN.k BN) which consists of the BNk determined in the minimization of Step 2 and the corresponding (BNk+1. BN) taken from memory. 4. Discard from memory the (BNk1 B) corresponding to values of BN k rejected in the minimizations of Step 2, as well as the values of XNk and fk(B1 BN k). Finally, at stage N the memory contains only xl, and, for each value of B1, a value of fN-I(B1) and an associated data sequence (B2.. BN). Using the stored value of x1, L1(B1) can be calculated and the value of B1 which minimizes Eq. 3.4. 5 determined.

136 The complete detected output is that value of 31 and the corresponding stored (BZ2 "BN). The amount of computation required for the dynamic programming algorithm is substantially reduced below that required for the straightforward enumeration of all possibilities of (B1. BN). Both methods of determining the (B1. BN) to minimize Eq. 3.4.1 require the calculation of Lk(B1 Bk) for every sequence of data digits (B1 * Bk) and every k. The difference in computation lies in the number of combinations of Lk' s for which the sum of Eq. 3.4.3 must be calculated. Enumeration requires that sum to be formed for every (B1.. BN), while the dynamic programming algorithm requires that the entire sum be determined for only M values of (B1. BN). Of course, it also calculates many partial sums which are systematically rejected in the course of the algorithm. When the data statistics are Kb-Markov, the computation can be reduced much further. In this case forward rather than backward dynamic programming can be employed, and this gives the algorithm a fixed structure and memory requirement which simplifies implementation. In addition, the forward dynamic programming algorithm will lead easily and directly to a real-time block detector in Section 3.5. Assume that the data statistics are Kb-Markov. Then Eq. 3.4.1 can be rewritten as

137 mBin [NJ Mk(Bk Bk+J)] (3.4.9) (B1" BN ) k=l where |Z (x.- BTH)/2a - in P(BjlBj. -B.1), k=1 j=1 -J - JKbJ (X - BT H) 2 / n P(B k+J - P(Bk+jBkKb Bk++1) Mk(Bk Bk+) = 2 < k < N-J-1 N+L-J- 1 j (xj+j-B H)/2 - 2fnP(BNIBN-K N-1 j=N-j+J N-ib k =N-J (3. 4.10) Also, define [BkB) gk(B1k+l Bk+J) m= in m L2 Mj(B* Bj+) (3.4.11) (B<1 Bk)j= 1 <k <N-J so that Eq. 3.4.9 becomes min [N M k (Bk... Bk+J1 [g J(BNJ 1 BN)] (B1.. BN) [k=l (BJ+N1 BN) (3.4.12) and all that remains is to find an iterative algorithm for gk(Bk+l* Bk+J), k= 2,..., N-J. Manipulating Eq. 3.4.11,

138 9 (B B, min M (B B k+1 (Bk+2 Bk+J+1) (B Bm) m (B1. B m m Bk+JP k+l JI k Bk+l LMk+1(Bk+1 m(B1 Bk+ ) m=1 m m =B [M k+1Bk+l k+ * Bk+J+l )+ g+(B1j)] (3.4.13) Bk+~ which is the desired algorithm. The initial value is gl(B2 BJ+1) = min [M1(B1.. BJ+1)] (3.4.14) B1 The algorithm of Eq. 3.4.13 does not require the storage of X but rather each sample x. can be processed and discarded N+L- 1' as it is received. As XJ+1 arrives, it can be processed according to Eq. 3.4.10 to form M1(B1. B+1 ). The initial state gl(B2 B J+) can then be determined according to Eq. 3.4.14 by minimizing M (B1BJ+ 1) over B1 for each (B2B B+1). The M minimizing values of B1, one for each (B2 * BJ+1), are then stored with the associated gl(B2' BJ+1 ) After x2 has been received, M2 (B2. BJ+2) can be calculated from Eq. 3.4.10 and combined with the stored values of gl(B2' BJ+ ) to determine g2(B3'Bj2) = min[gl(B2. BJ+1)+ M2(B2. B )] (3.4.15)

139 Since g1(B2' BJ+ ) is then no longer needed, the values of g2(B3. BJ+2) can replace it in memory. For each (B3 Bj+2) the value of B2 which minimizes Eq. 3.4.15 determines a B1 left in memory from the initialization stage, and this (B1, B2) pair can be stored for the next stage. This procedure continues until stage N- J is complete, and the memory contains, for each (BNJ+1 * *BN), a value of gN J(BN-J+.. BN) and an associated sequence (B1 BN J). The final state gN_J(BN_J+i.. BN) can then be minimized as in Eq. 3.4.12 to determine the entire detected data sequence (B1. BN). The algorithm of Eq. 3.4.13 has a fixed storage requirement for gk(Bk+! Bk+J), MJ locations, and a linearly increasing storage requirement, k MJ locations, for the associated data sequences (B1 Bk). The computational effort and algorithm structure are fixed at each stage, so that the algorithm is well suited to hardware implementation. One disadvantage of the forward dynamic programming algorithm is that its computation and memory requirements are exponentially dependent on J, and this will, in practice, limit the largest J which may be accommodated. Other disadvantages are the linearly growing memory requirement for past data sequences and the necessity to wait until XN+L_ 1 is received before a decision can be made on any of the data digits. The last two disadvantages will be

140 readily eliminated, however, by the real-time block detector dynamic programming algorithm of Section 3.5. 3. 5 Real-Time Block Detector In the case of Kb-Markov data statistics, the equations of Section 3.4 can easily be revised to yield a real-time block detector analogous to the real-time bit detector of Section 3.3. In Section 3.4 the decision on each Bk, 1 < k <N, was based on the entire observation vector XN+L1. In this section, the decision on Bk will be based on Xk+D for some non-negative integer D. The decision on Bk will then be available after Xk+D is received, and the algorithm will operate in real-time. The criterion of Eq. 3.1.7 must be modified to max P(B Bk+D Xk+D) (3. 5.1) 1 Bk+D Notice that, even though the maximization over (B1 Bk+D) is performed implicitly, it is only the resulting value of Bk which is actually used. When xk+D has been received and the criterion of (3. 5.1) is applied, the actual decisions on (B1 ~ ~ Bkl) have already been made (on the basis of fewer observations) and the decisions on (Bk+ Bk+D) must await future observations. Assume the data statistics are Kb-Markov. Then, the criterion of Eq. 3.4.9 must be modified to

141 k+D- J min M (B.B j+J (3. 5.2) (B1i'' Bk+D) j=l j+J to agree with Eq. 3.5.1. Noticing, from Eq. 3.4.11, that k+D-J ~(B ~B minmM(.B gk+D-j( k+D-J+l Bk+D) i Mj(B ~ ~ B+J kD (B1. Bk~DJ J+[J (3. 5.3) it follows that Eq. 3.5.2 is equivalent to mrin gk+D-j(Bk +D-J+D- J Bk+D) (3.5.4) (Bk+D- J+ 1' Bk+D In view of Eq. 3.5. 4, the modification of the block detector of Section 3.4 to yield a real-time algorithm is minor. The recursion relationship for gk(Bk+l. Bk+J) (Eq. 3.4.13) can be retained. Then, at each stage it can be minimized to determine a decision on Bk+JD. If D < J, it is not even necessary to store any past data digits corresponding to previous minimizations in the recursion relationship of Eq. 3.4.13. On the other hand, when D > J, it is necessary to retain in memory D-J+1 past data digits for each current state (Bk+l'' Bk+J) In Section 3.6 the real-time bit detector of Section 3.3 and the real-time block detector of this section will be compared in detail for the simple case L = 2 and D = 1.

142 3.6 The Real-Time Detectors for the Special Casa M = 2, L = 2, D = 1, Kb = 0 In this section the real-time bit and block detectors will be studied in greater detail for the special case M = 2 (with antipodal signalling), L = 2, and D = 1. The receiver structures are relatively simple for this case, and the comparison between the two receivers is very interesting. In addition, the performance of the block detector will be calculated analytically for this simple case. Kimball [7] has already calculated the performance of the bit detector for L = 2 and time-limited waveforms, so that rather lengthy calculation will not be repeated here. 3.6.1 Block Detector. The block detector is the simpler of the two receivers, so it will be considered first. Assume that L = 2, D =1, and M = 2 (with b1 =1, b2 =-1). Also, let the data digits be independent (Kb = 0) and equally likely. From Eq. 3.5.4, Bk should be chosen to satisfy min gk(Bk+l ) (3.6.1) Bk+l where gk(Bk+l) satisfies the recursion relationship of Eq. 3.4.13, gk (Bk+) = min [Mk+l(Bk, Bk+1) + gk-1(Bk)] (3.6.2) Band where and where

143 XMk+l- Bk h(1) -Bk+1 h(O))2 k+1 k 2k o) r2. As was mentioned in Section 3.4, the 1/2a2 in Eq. 3.6.3 is not relevant because the data digits are equally likely. Furthermore, Eq. 3. 6.3 can be simplified by squaring the term in brackets and eliminating the constant terms, which do not affect the minimization of Eq. 3.65.2. Accordingly, the modified recursion relationship is gk(Bk+l) min [Bk Bk+ h(O) h(1) - Xk+1 k h() + Bk h(O1)+ Bk+ )) kBk + gk-1(Bk)] (3.6.4) Equation 3.6.4 can be simplified even further by defining adsbrcik- 1 gk- 1tm (3 e 6- 5) and subtracting gk- l(-1) from every term in the minimization of Eq. 3.6.4, Bk+ h(O)h() - Xk+l(Bk+ h(o)+ h()) + fkl' (Bk=) gk(Bk+1) min -Bk+1 h(O)h(1) - Xk+ (Bkh( 1) (3.6.6) Performing tie minimization of Eq. 3.6. 6 explicitly,

144 h(O)h(l) - xk1 (h(O)+h(l) + fk-l (xk+l'fk-l) 1 A, (Bk =1 gk(+l) = -h(O)h(1) - Xk+l (h(0)- h(l)), (Xk+l, h -l) A2 UA3 (Bk= (3.6.7) and -h(O)h(1) - Xk 1(h(1)- h(0)) + fk-' (Xk+l' fk-1) c AU A2 gk(-) = (Bk = 1) h(O)h(0) + Xfk (h(o)~h(1)) (xk+1,fk1)E A3, (Bk = -1) (3.6.8) where A = xk+l'fk-1l2h(0)h(1)<2Xk+1 h(1)- fk A2 =x f x k - k1 -2h(O)h(0) < 2X hl h(1) - fk < 2 h(O) h( A3 = xk' fk-1 2Xkl h(1)- fk- <-2h(O) h(1) (3.6.9) The updating relation for fk becomes, from Eqs. 3.6.7 and 3.6. 8, 2 h(O) h(1)- 2Xk+1 h(O), (xk+l, fk-1) A1 fk= -f - 2x (h(o) - h(1)), (Xk+l' fk-l) A2 (3.6.10) k k- i k+1 /k+1 V k- 1 A2 - -2h(0)h(1) - 2 Xk+l h(0), (xk+, fk-1) A3 The decision rule on Bk is,

147 lower branch calculates gk(Bk1 = -1) along with a corresponding Bk. The "SIGN" box determines the sign of fk = gk(l) - gk(-l) and chooses the value of Bk coming from the corresponding "MIN" box. Specifically, if fk < 0, then the decision on Bk is the value of Bk coming from the "MIN(Bk+ 1 = +1)" box, and vice versa. The decision regions of the receiver in the (Xk+l, fk 1) plane are shown in Fig. 3.2 for six representative values of h(O) and h(l), all of them positive. By Eq. 3. 6.1, gk- (Bk) tends, on the average, to be smaller for the correct value of Bk than for the incorrect value. Thus, from Eq. 3. 6. 5, a negative value of fk-1 indicates that, based on Xk, Bk =1 is more likely than Bk = -1. As a result, in Fig. 3.2 much of the left half-plane corresponds to the decision Bk = 1. The exception to this rule occurs when the reception xk+1 is strongly indicative that Bk = -1. This occurs when 0 < Xk+l <1 (which tends to indicate that Bk+l =1 and Bk = -1) and when Xk+1 < -1 (which suggests that Bk = Bk+l = -1). However, the weight placed on xk+1 decreases as h(l) becomes smaller, until in the limit of h(l) = 0 the decision is based completely on fk-1. 3.6.2 Error Probability of the Block Detector. The probability of error in making the decision on Bk for the real-time block detector of Section 3.6.1 will now be calculated. Even though the probability of block error is the criterion used in the optimization of the block detector, the probability of bit error is the measure of performance which will be utilized.

148 h(O) = 1.0 1 - Xk+l CV h(l) k1l=! h(l) o 0/ 0.1 0.10102 0.2 0.20871 0.3 0.33333 04|3~ 0. 33333 r2Xk+l h(l)- fk = 2h(O)h(l) 0.4 0.5 k+ k0.5 1.0 0. 5 | 1. ~0 X_ Decide Bk - 1 f 1 3 -2 -1 1 2 3 Decide k- Bk= +1 fk-1 +2xk+l ( h(2) =0 ci ///|4 -2 Xk+ h(l)- k- 1 = -2h(0)h(1 (and bit detector as a2 - 0)

149 It was shown in Section 3.6.1 that the decision on Bk is based exclusively on the value of fk- 1 passed on from the previous stage and the value of the current reception Xk+l. Thus, the error probability can be determined from the density P(fk 1 Xk+l Bk) together with the knowledge of the decision regions in the (fkl' xk+l) plane. All that fk-l and xk+1 have in common is Bk, so that conditional on Bk, fk-1 and xk+1 are independent. The density P(xk+l Bk) is known, so all that remains to be determined is P(k- 1 Bk). A recursion relationship for the density of fk can be derived as follows: Observe that P(fkBk+l) 2 P(fkl k+) k k~ 1 2 Bk B k ki 2= i P(fk, fk-1 Bk' Bk+1)dfk-1 (3.6.13) Bk Equation 3.6.13 must be evaluated separately for each of the three regions of Eq. 3.6.9. The recursion relationship of Eq. 3.6.10 can be regarded as a mapping from (fk-1' Xk+l) into (fk-l fk) This mapping has the Jacobian (X1k+l' fk-() f k(fk' fk- ) 2 h(0)- h(1)) (Xk+l' k-1 ) e A (3. 6.14)

150 and the resultant density of (fk-, fk) is P(fk' fkl Bk Bk+l) = t(i P(k -llBk) PXk1 (h(l) - 2h() Bk Bk+l (fkf lXk+l) e A1 21h(0)"- h(l) P(fk(l) - Bk) kP / (fk- 1' Xk+ ) A2 1. fk 21h(0)1 P(fk llBk) PX (h(l) 2h(0) Bk' Bk+l) (fk-l xkk+l) e A3 (3.6.15) The decision regions can be redefined in terms of (fk1' fk) by using Eq. 3.6. 10 to eliminate Xk+l in Eq. 3.6.9. Defining alh0 f 2(fk B) (h() - h(O) - ( h(0) a2(fk) = 2 h() (h(0)- h(l)) h() fk (3. 6.16) the {Aj}3 become j=2 = k-l, k)tal(fk)l < _k_ < a 2(fk) t

1 51 A (f f a < f ~~~~~(3. 6.17) A3 fk-l' fka2(fk) - fk-l (3.6.17) Using Eqs. 3.6.13, 3.6.15, and 3.6.17, the required density is P(fkBk+l) 1 Bk1 (ffk Bk2h(O) Xk ()- 2h(0) Bk'Bk+ ) P(f Bk)d 2B 2 1 h(0) IPXk I2()k k1 - 3c k (fk-l)a l (k) df1 a2(k)P kl( 2(ho)') Bkk~ P(fk~lIBk)dfk~l al2 (fk)f 1 pk k21h(0)- h() Bk P(fkl iBk) dfk-l 1 a(fk Xk+ 2 2h(0 )- h(1) a2(fk) (3.6.18) The derivation of Eq. 3.6.18 has assumed that h(O) # h(1). When h(O) = h(1), it can be shown that Eq. 3.6.18 becomes P(fk Bk+l) B 21hO_ - BB. (0) -fk k 1 2]h() k k Bk B ~~~Z~hTOTIP $ ~~~P (fk 1/Bk) df k-l k+1 -k -f 2h(0 —- + h(O) 2h(0) + P(fk1 _ 0Cl;fk- 1 = -fk fk J P'~k~i~r~ - 2h~ -h(O)

152 1 1_ 2h(0) Xk+l 2h(0 Bk+l J P(fk-1 tBk) d fk-1 k (3.6.19) For all values of h(O) and h(1), the starting value f0 is given by f0 = - 2h(0) x1 (3.6.20) so that the initial distribution for Eqs. 3. 6.18 and 3. 6. 19 is P (f B p (f0 B1) 21h(0)1 XI 2h(0) B(3.6.21) Starting with the distribution of Eq. 3.6.21, Eqs. 3.6.18 or 3.6. 19 can be iterated until a stationary distribution is achieved. Once P(fk 1!Bk) has been determined in this manner, the probability of error in making the decision on Bk can be calculated by referring to Fig. 3.2. First observe that P(errorBk) = P(error,B ) (3.6.22) B k+ I assume that h(0) h(1), and divide the fk- 1 axis into the three sections for which the decision boundary is linear. Then,

153 P(errorBk = -1, Bk+l) - 2h(1) h(0)- h(1)) I f | P(X k+l Bk- 1, Bk+1) P(fk- 1IB k=- 1) k-i -x -h(0) + 2h(0) d Xk+ d fk-1 2 (h()- h(0)) +2-h( k + I- P(xk1l Bk2 =(lBk) h - 2h(1) (0)-h(1() + 2 h(O) h(0) + P(fk- 1 Bk=- 1) d Xk+l d fk- 1 +' P (xk+llBk=-l, Bk+l) P(fk 1Bk=- l) 2h(1)(h(O)-h(l)) h(0)+ 2h() k -X0 d fk- 1 2h(1)(0)- h(! )) r fk-1 2+- 2k|(h()) h(O)-) + h(1) ) h() - Bk+ h(O)) - 2h(1)h(0)-) k h()) P (fk-l k= 1) 1 -'

154 /+' ~~ k- 1i + I- + h(1)h(O )-B k+ 1 2 h()- ) (h(O)) k d fk- + f P(fkh1 Bk=- 1) [1 ~:k- 1 (h(0 )- B h(O) 0)+ -((B +h(1)- Bk+! 2h(1) h(O)- h(1)) (3. 6.23) When h(O) = h(1), the formula simplifies somewhat, P(error E Bk=- 1, Bk+l) 0 f h() + 2h(0) + h(1)- Bk+ h(0) i P(fk-lhBk —1) I - ~:' d fk- 3C ~~~~~~fk-1 xC (h(o) + h(1) - Bk+ h(0) + P(fk 1 _ dk- 1 (3.6.24) For all h(0) and h(1) the distribution of fl can be obtained in closed form in terms of the normal distribution function. However, the distribution of fk for k > 1 can only be determined by the numerical integration of Eqs. 3.6. 18 or 3.6. 19. In any case, the error probability can be determined by performing the integration of Eqs. 3.6.23 or 3.6.24 numerically.

155 Performing the integration of Eq. 3.6.19 (h(0) = h(1)) numerically is considerably easier than performing the integration of Eq. 3.6.18 (h(0) # h(l1) numerically. Therefore, the numerical integration of Eq. 3.6.19 was programmed on the computer, and it was determined that the error probability as calculated from P(fklBk+l) for k > 1 was not significantly different than that calculated for k = 1 (within 1percent). Therefore, the same result can be expected for the recursion of Eq. 3.6.18 as long as h(l) is about the same as h(0) (i.e., for 51 near 1/2). Since this is the region of greatest interest (the transversal filter receiver, which is easier to implement, performs well for small 1I), the closed form expression for P(fl B2) was programmed and the error probability was determined numerically from Eq. 3.6.23. The resulting error probability was compared to that determined by computer simulation (see Chapter 6) and the results were found to agree closely for 1 0.4, with a substantial discrepancy for 51 < 0.2. Since 51 > 0.3 is the region where the transversal filter receiver has a large error probability, and because performance evaluation by simulation is substantially easier and yields accurate, reproducible results, the distribution of fk was not programmed for k > 1 when h(0) h(l). Kimball [7] has calculated the performance of the bit detector for L = 2. He did not consider a real-time algorithm, but rather his

156 receiver based its decision on the whole observation, in a manner equivalent to the receiver of Section 3.2. It 0vili be shown by simulation in Chapter 6 that the real-time block detector has the lowest error probability for D-L-1 (for all specific cases considered). Thus, the most meaningful comparison in performance for L = 2 is between the real-time block detector with D = 1 and the non realtime bit detector, since they have the lowest error probability that each technique can offer. In Fig. 3.3a-b, the transversal filter receiver, real-time block detector, and non real-time bit detectors are compared for L = 2 at S/N ratios of 0, 7, and 11.8 dB. The S/N ratio is defined as S/N =10 log10 2 (3.6.24),0 2a S/N ratios of 7 and 11.8 dB were chosen because they are the smallest and largest S/N ratios considered by Kimball [7], and this choice avoided the necessity of reprogramming his equations. Figure 3.3a-b demonstrates that the real-time block detector, while naturally admitting more frequent errors than the bit detector, offers substantially better performance than the transversal filter receiver for _1 > 0.3. At 51 = 0. 5, where the transversal filter receiver has an error probability of 0. 5, the error rate of the block detector is 1.44 times as great as that of the bit detector at S/N = 7 dB and 10 times as great at 11.8 dB. At this 51, then, there is

1.0 10- 2 - Transversal FilterA /xte Transversal 10 dBIn -f Filter - Inteerference Free: Free Transverptimum Bit Detector (not real Filterror vs. for L = Interference Free \- Interference Free h(0) & h(1) as in Fig. 3.2 x Block Detector with D = 1 -- -e ~Optimum Bit Detector (not real-time) 10-3. I I I I 010-5 0.1 0.2 0.3 0.4 0.5 I 0 0.1 0.2 0.3 0.4 0.5 1 (a) (b) Fig. 3.3. Probability of error vs. 51 for L = 2

158 still a significant improvement in performance to- be attained by use of the bit detector at high S/N ratios. Equivalent ratios at = 0.4 are 1.18 and 1.66, so that at this 1 the block detector is much more competitive with the bit detector, while the transversal filter is still out of contention. These considerations do not take into account the fact that the bit detector requires knowledge of the noise variance, while the block detector does not. This gives the block detector a significant practical advantage. In a situation where the noise variance is not known precisely and/or is varying in time, the block detector may actually have a lower error probability, since the bit detector's performance is necessarily degraded by that lack of knowledge. In Chapter 6, the sensitivity of the bit detector performance to inaccurate knowledge of ac will be studied by computer simulation. More complete comparisons of the error probabilities of the bit detector and block detector will also be given in Chapter 6. 3.6.3 Bit Detector. The structure of the real-time bit detector (L =2 and D =1) will now be compared to that of the block detector considered in Section 3.6.1. The updating equations of Section 3.3 will be manipulated to a form which points out more clearly the similarities to the block detector. The updating equation of the real-time bit detector is

159 P(Bk, Bk+1' Xk+1) = P(Xk+lBk, Bk+1) P(Bk+l) Z P(Bk_l, Bk, Xk) (3.6.25) Bk- 1 from Eq. 3.3.5. The decision Bk = 1 is then made if and only if P(B kl, Bk+l' Xk+l) > ~ P(Bk=-1, Bk+l' Xk+ 1) (3.6.26) Bk+ 1 Bk+1 Note that the only quantity passed from the kth stage to the k+lst stage is P(Bk-' Bk' Xk) = P(Bk' Xk) (3.6.27) Bk- 1 and this probability summarizes the receivers total knowledge about Bk based on Xk. It seems appropriate at this point to derive a recursion relationship for P(B Xk) instead of P(Bk, Bk+l, Xk) and then relate the decision on Bk to that probability. Further impetus for this strategy can be gained by considering the fk 1 of Eq. 3. 6. 5. Since gk(Bk+ 1) is proportional to min B -u2 kn P(B1 Bk+l, Xk+l) } + constant, (3.6.28) B1.. Bk it follows that fk is proportional to

160 a2 max in P(B' Bk' Bk+l = X k1) B1.. Bk Bmax ~n P(B1' Bk' Bk+1 =-, Xk+ ) (3.6.29) Bi'' Bk For the bit detector, a quantity analogous to Eq. 3.6. 29 would be P(B = 1, X L ao2;n k1 k+1(36.3 k P(Bk+ =-1, X ) (3.6.30) It will now be shown that Lk is indeed analogous to fk, in that the decision regions in the (fk_, k+l) plane for the real-time block detector and in the (Lk_ 1 Xk+l) plane for the real-time bit detector are very similar. A recursion relationship for P(Bk Xk) is readily obtained by summing Eq. 3.6.25 over Bk, P(B X ) (B X (Bk+ I 1 ) I ( k k+ 1 k+ 1 ) (3.6.31) P(Bk+l) P(xk+lBkBk+l) P(Bk Xk) Bk To relate the decision mechanism to P(Bk Xk), substitute Eq. 3.6.25 into Eq. 3.6.26 to yield

P(B k =, Xk) P(Bk+l) P(xk+llBk = 1,Bk+) > Bk+ 1 (3.6.32) P(B k=- 1, Xk) S P(Bk+1 ) P(Xk+ Bk- 1, Bk+1) Bk+ 1 as the condition for the decision Bk = 1. For purposes of receiver implementation, substitute the Gaussian density for P(xk+l i Bk B k+) in Eq. 3.6.31 and Eq. 3.6.32 and eliminate terms independent of Bk, Bk and Xk+ to yield the updating equation P(Bk+ Xk+) = exp h()B x P(Bk, Xk) k ak Bk+ k1Xkk k exp B h(I) (x - B h()) (3. 6 33) and the decision relation 2h(1) z2 Xk+ 1 P(Bk=l, Xk) e cosh (x Xk) P(Bk- 1 Xk) cosh (O) (xk+l h() (3.6.34) as the condition for the decision Bk= 1 A receiver block diagram based on Eqs. 3.6.33 and 3.6.34 is shown in Fig, 3.4. Upon comparison with Fig. 3.1 the greater

h(O) a2 + k+ P(B kt Xk) 0) Fg 2 34 Optimumbd I k+ l - P(B,,.. Exp 2h(1 )/o 2 ~lCosh ~ k,, +(1) h(O)/crz2 Max Bk Fig. 3.4. Optimum bit detector, L = 2, D = 1

163 complexity of the bit detector is apparent. Particularly troublesome from the point of view of implementation is the proliferation of exponentiations. For purposes of comparison of the decision mechanism of Eq. 3. 6.34 with that of the block detector, write Eq. 3. 6.34 in terms of the Lk- of Eq. 3.6.30, which gives the following condition for the decision Bk = 1, cosh[h(O (x -h(1)) k-I - k+Ih(O) cosh[h() (xk+l + h(1)) Equation 3.6.35 compares Lk 1' which is a function of probabilities passed on from the previous stage, with a threshold which is a function of xk+1. Further interpretation of Eq. 3.6.35 is possible when it is noted that P(Bk= 1, Xk) P(Bk= 1IXk) (3.6.36) P(Bk=- 1, Xk) P(Bk= - 1Xk) and P(Bk= 1, Xk) _ P(Xk Bk= 1) P(Bk=1) P(Bk=- 1, Xk) P(Xk Bk=- 1) P(Bk=- 1) P(XkjBk= 1)..(kl..k=1) (3.6.37) P(Xk j Bk= -1)

164 Thus, by Eq. 3. 6. 36, Lkl is proportional to Kimball's "log-odds ratio" of Bk based on Xk [7], and by Eq. 3.6.37, Lkl is also proportional to the "log-likelihood ratio" of Bk based on Xk. The dependency of the Bk decision on xk+ is in the threshold of Eq. 3. 6.35 which is applied to Lk_ 1after the reception of Xk+1. It is instructive to look at limiting forms of the threshold relation of Eq. 3.6.35. For h(O)> 0 and h(1)> 0, as a2 _ O0 Eq. 3.6.35 approaches 2h(1)(xk - h(O0)) Xk+l > h(1) Lk i< K 2(h11) - h(0) X -h(l) < x1 < h(1) (3.6.38) 2 h (1) k+, + h(0), h(l) > xk+1 since each cosh(') approaches an exponential. These are precisely the decision regions for the block detector in the (fk-1' Xk+l) plane, so Fig. 3.2 is still valid as the limiting decision region as 2 0- when the fk- axis is re-labeled as the Lk_1 axis. In order to calculate the limiting decision region as a - x, it is necessary to determine a limit of the form lim Qn cosh(ax) - Qn cosh(bx) x-O x and since both numerator and denominator approach zero, it is easily verified by l'Hospital's rule that the limit is zero. Hence

165 Eq. 3.6.35 becomes, as a2 - X, Lk- 1 < 2h(1) Xk+ (3.6.39) for all Xk+1. The decision regions corresponding to Eq. 3.6.39 are shown in Fig. 3. 5. The boundary lines have the same slope as in Fig. 3.2, but the Xk+1 intercept is at the origin instead of at Xk1l = h(O). The receiver is much less sensitive to small changes in Xk+1, which is understandable in view of the large noise. The actual decision regions corresponding to Eq. 3. 6.35 are shown in Figs. 3.6, 3.7 and 3.8 at S/N ratios of 10, 0, and -10 dB. The decision regions with a S/N ratio of 10 dB are very close to the U2 - 0 case of Eq. 3.6. 38, while the decision regions with a S/N ratio of -10 dB are close to the (2 - Xc case of Eq. 3.6.39. The decision regions at a S/N ratio of 0 dB are a hybrid combination of the two. A word of caution is in order here. Even though the decision regions for the block detector and bit detector as U2 - 0 are the same, it cannot be concluded that the two receivers are identical for large S/N ratios nor that they have identical performance. The two statistics that the receivers use in conjunction with the decision regions, fk-1 and Lkl, are different, as shown by Eq. 3.6.29 and Eq. 3. 6.30. In fact, if Fig. 3. 3b is any indication, the differences in performance of the two receivers is perhaps even accentuated at high S/N ratios.

166 Xk+l 3 3 k 1 - 1/ / I -22 1 -32 Fg Bk=-t b Fi.35 Dcso egoso ea-iebi eet/ s0

167 Xkl IIBk /1 Lk-C ~3 -2 -i _ 1 2 3 Bk-1 ~ k — 2 -3 Fig. 3.6. Decision regions for real-time bit detector S/N = 10 dB

168 Xk+l 3 1, // -2 Fig. 3.7. Decision regions for real-time bit detector s/N 0- dB - - - 12d

3 -- I I (I 2 -2 ~~0 / B 1 =-1 -1=-1 Fig. 3.. Decision regions for real-time bit detector S/N = -10 dB

170 In summary, the real-time bit detector and real-time block detector have somewhat similar structures for L = 2 and D = 1. Both can be implemented in a form which updates a single quantity (fk or Lk). This quantity summarizes the state of knowledge of the receiver concerning Bk based on Xk. The decision on Bk of each receiver is then based on fk- or Lk- 1 together with Xk1. The decision regions of the two receivers in the (fk-l' Xk+1) and (Lk 1' Xk+l) planes are identical in the limit of high S/N ratios. 3.7 Conclusions The real-time versions of the bit detector and block detector with Markov data statistics are the most practical. This is because they have structures and memory requirements which are fixed in time, and because their memory requirements are much smaller than those of the non real-time versions. In addition, very little performance is sacrificed by using the real-time versions. Computer simulations by Abend and Fritchman [10] and later on in Chapter 6 reveal that the real-time bit detector has an error probability which is nearly as low as for the non real-time version for values of D only slightly larger than L- 1. In addition, the computer simulations of Chapter 6 reveal the surprising fact that the block detector error probability actually increases as D exceeds L-l, leading to the conclusion that the real-time block detector for D = L-1 will actually have a lower error probability than the non real-time version.

171 The primary difference between the two real-time algorithms is that the bit detector's computational effort increases exponentially with D for D > L-1 (and is constant for D < L-1) while that of the real-time block detector is independent of D. Even at D = L-1, where the real-time bit detector's computation is at a minimum, the real-time block detector requires considerably less computation, because summations over joint probability distributions are eliminated. Another significant advantage of the real-time block detector is that it does not require knowledge of the noise variance for equally likely data digits. However, the block detector has the disadvantage that its dynamic programming algorithm does not generalize to the random channel case, as will be shown in Section 4.1. This difficulty will be circumvented in Chapter 6 by using the estimators of Chapter 5 in conjunction with the known channel block detector of the present chapter on the random channel.

CHAPTER 4 OPTIMUM RECEIVERS FOR A RANDOM CHANNEL This chapter will generalize the iterative realizations of the bit detector developed in Chapter 3 to a random channel. Before specifying what types of channel statistics are to be considered, it is necessary to modify the notation of Chapter 3 to allow time variations in the channel vector H. Accordingly, modify Eq. 1. 1.2 to N x(t) - Bkhk(t-kT)+ n(t) (4.0.1) k=l so that the received signal does not consist of the T-second translates of a single waveform h(t), but rather the T-second translates of a time varying waveform {hk(t)}N. Assume that all the hk(t) are bandlimited to 1/2T Hz, and define a time-varying channel vector Hk analogous to the H of Eq. 3.1.4 as Hk = hk(0), hkl(T),.. hkL+1 L-1)T (4.0.2) The received samples of Eq. 3.1.1 become, from Eq. 4.0.1, L-1 Xk:, Bk_j hk j(T) + nk BkN (4.0. where it is assumed that Bk = 0, for k < 1 and k> N. 172

173 The statistics of the family of channel vectors NHk+ must be specified. Three increasingly restrictive assumptions on the channel statistics will be made: 1. There is given an arbitrary joint probability density function P(H1... HN+L-1) 2. The vector Hk is Kh-Markov, P(Hk H1 "H ) = P(HHk H (4. 04) 3. The vector H is fixed, H = H, and there is given an -k -k arbitrary probability density P(H). The same two types of data statistics, general and Kb-Markov, will be considered. The data and channel ensembles will be taken as statistically independent. For all three types of channel statistics, the unsupervised nature of the bit detector precludes the existence of a fixeddimensional sufficient statistic for the determination of the a posteriori density of H given Xk. Therefore, for practical implementation of these receivers it will be necessary to assume that (H1.. HN+L- 1) is drawn from a finite discrete distribution. The receiver can store in memory a finite number of probabilities corresponding to a finite number of possible values of H, but it cannot store a continuous probability density for H unless that density is parameterized by a finite dimensional sufficient statistic.

174 In Section 4.1 it will be shown that the dynamic programming block detector algorithm of Chapter 3 does not generalize to the random channel case of this chapter. Accordingly, Sections 4.2 and 4. 3 will concentrate on extending the bit detector to the random channel case. 4.1 The Block Detector for a Random Channel In this section it will be established that the dynamic programming block detector algorithm of Section 3.4 does not generalize to a random channel. When the channel is random, the a posteriori probability of (B1** BN) must be averaged over the channel statistics. For instance, for the fixed random channel, the a posteriori probability is P (B**BN x (t)) = f P(B1..BN, x(t)l H) P(H) dH (4.1.1) The dynamic programming block detector algorithm of Section 3.4 depended on the fact that fn P(B1 BN, x(t)) is an additive function of the data digits of the form nP(B..BN x(t) f(B Bj) (4.1.2) 1 NC 1 I Likewise, fn P(B1 BN, x(t)IH) will have the property of Eq. 4.1.2, but unfortunately the ~n(*) function does not commute with the integral in Eq. 4. 1. 1. Therefore, the a posteriori probability

175 of Eq. 4.1.1 cannot have the property of Eq. 4. 1.2 and the possibility of applying dynamic programming is lost. The same conclusion applies to more complex channel statistics. As a concrete example, consider the fixed random channel with a Gaussian a priori density, (H m0I)T -l(Hm ) 4 1 3) P(H) (27) L/2 IA 2 exp (H- m A (H- m (4- 3) where m0 is the mean value of H, m0 = E(H) (4.1.4) and A is the covariance matrix, -o = E[(H-m0o)(H-mo) ] (4.1.5) From Eq. 3.1.6, P(B1 BN, x(t) H) H N+L- 1 1 2 1 N+L- 1 T 2 P(B1'BN) (2) exp -1 N BL(xB H 2a2 j=l -J and ignoring constants independent of (B1 BN), the integrand of Eq. 4.1.1 is

176 P (B.. BN, x(t) I H) P(H) o P(B1 * BN) exp 2[2(H ( xk k) - H (N iBkik) B ]H exp -2(H - m A)T A1(H - m) = P(B B) exp (H- m (H- m -1 1 -1 ~rnO A0 rn]t (4.1.7) where 1 1 N+L-1 T -1 N+L- 1 A1 -1 A0 m 0+ 2 xk Bk (4_ 19) 2a2 — NPa k=1k- -. k9 Performing the integration of Eq. 4. 1. 1, exp {T1 A 1} (4.1.10) where, by direct calculation,

177 T - T -1 1 m1 A1 m = m A0 1 A0 1m0 2 T -1 N+L- 1 - -1 j x.B. + <v2 m0 -0 -1 j=l jj N+L- 1 N+L- I1 + Z x. x. B. A B (4.1.11) 0o4 i1 j=1 — 1 j -1-j Thus, the exponent of Eq. 4. 1.10 is a quadratic function of (B1 * BN), while IA11 2 is a rather complicated function of (B1 BN). The linear dependence of fn P(B1 * BN, x(t) on the observation vector XN+L 1 is lost, as is the property of Eq. 4.1.2. This example confirms the conclusion that the dynamic programming algorithm is no longer applicable. Since the block detector dynamic programming algorithm is not applicable to a random channel, the remainder of this chapter will concentrate on deriving iterative solutions for the bit detector. Chapter 5 will derive practical channel estimation techniques which can be used in conjunction with the known- channel block detector of Chapter 3 by using the channel estimate as if it were exact. 4.2 Bit Detector for a Random Channel In this section the bit detector of Section 3.2 will be generalized to a random channel. In Section 4.2.1 this will be accomplished for general oata statistics, and in Section 4.2.2 Kb-Markov data statistics will be considered.

178 4.2. 1 General Data Statistics. Assume Xnihat a general probability measure P(B1'' BN) is given. There are three types of channel statistics to be considered. In the first case, there is given a general probability density P(H1 HN+L-_) The required probability is P(Bk, XN+L1 3 ~ 3 3 fP(B" BN' -H1..HN+ L-1 B B B B B1 Bk- 1 k+ 1 N XN ) dH dH (4. 2.1) N+L- 1 -1 HN+L- 1 But P(B1 B H H x= (4.2.2) 1 BN' H )N+L-1' N+L-B) P(XN+L- _ llB BN' H1 HN+L-1 P(B1 BN) P(H1 HN+L where N+L-1 P(X B.-B H ) = H P I B.BYH (XN+~L-I N! 1 -HN+L-1) 1 P(xj B.L+ B, H (4.2.3) from Eq. 3.2.3, and N+L-1 P(H.. HN+L) II P(Hi H 1 (4.2.4) j=1 - -1. Combining Eqs. 4.2.2 through 4.2.4 and Eq. 3.2.2,

179 P(B1' BN' H1 "HN+L- 1 XN+L- 1) N II P(xBjL L- BjY H) P(HjlHi Hj_1) P(B jB jN+L-1 II P(xj BjL+l *' BN, Hj) P(HjH1 * Hj-1) (4.2.5) j=N+ The probability of Eq. 4.2.5 has a recursive realization analogous to Eq. 3.2.5, P(B1 Bk, H1' Hk' Xk)= P(xklBkL+1 Bk, Hk) P(Hk H -Hk 1) P(BkB 1'' Bk- 1) P(B 1 Bk- 1' H1 -k-l' Xk-l) 1 < k < N P(B1 B H Hk' Xk) = P(xkBk-L+l BN' H k) P(HklH1'* H k-1) P(B1 BN, H1' Hk-1' Xk-l) N+1 < k < N+L-1 (4.2.6) It is readily apparent that Eq. 4.2.6 has the exponentially growing computational load and memory requirement in common with the known channel algorithm of Eq. 3.2.5. In addition, the existence of a fixed dimensional sufficient statistic for H is precluded by the time-varying and unsupervised nature of the problem. Therefore, practical implementation would dictate that the (H1.. HN+L 1)

180 vectors be quantized, the continuous density on H be replaced by a finite discrete distribution, and the integral of Eq. 4.2. 1 be replaced by a summation. The computation and memory of the algorithm of Eq. 4.2.6 will now be estimated. If each component of H is quantized to Q values, then there are QL different values of each Hk. Thus, there are MN values of (B1'BN) and Q L(N+L1) Q values of H1 HN +L-1 for a grand total of (MQL) values of the probability on the right side of Eq. 4.2. 1 which must be calculated and stored. This amount of computation and storage can be staggering, even for modest values of M, Q, and N. For instance, take the case of M = L = 2, and let Q = 10. Then, the number of storage locations required is approximately 1023 for just 10 bits (N = 10). Needless to say, the algorithm of Eq. 4.2.6 is not of much practical importance; more assumptions must be made on the data and channel ensembles before anything approaching practical reality is achieved. When the channel ensemble is Kh-Markov, a great deal of the computation and storage of Eqs. 4.2.1 and 4.2. 6 can be eliminated. In particular, in the iteration of Eq. 4.2.6 only the last Kh values of Hk need be retained. Using the Kh-Markov assumption in Eq. 4.2.6 and integrating both sides over (H1 ~'Hk K ), the resulting recursion relationship is

P(B1 Bk, Hk-Kh+l -Hk, Xk) P(Xk* B-L+ B Hk) P(BkB1' Bk- ) P(B1'' Bk, H1" Hk_1, Xk-l) P(HklH1i H-k-1) l<k_ Kh P(Xk BkL+ Bk' Hk) P(Bk B1 Bk-l)' r P(B 1 Bk- 1' H-k- Kh'' Hk- 1' Xk- 1) P(H-k Hk- Kh Hk- 1) d H K Kh+1 <k<N — k-Kh h - - P(B1 BN, Hk- Kh+'' Hk, Xk) P(XkIBk-L+1.. BN, Hk) h r P(B1. BN' -~-Kh'Hk-_' Xk-l) P(_HktHkKh'Hk_) d H N+1 < k < N+L-1 -k- Kh (4.2.7) After the recursion relation of Eq. 4.2.7 has been applied for k = N+L-1, the decisions on the (B1 * * BN) can be made on the basis of P(Bk, XN+L-) Z B1 k-1 Bk+1 BN P(B1'' -HN+L-Kh -N+-L- 1' N +L-) d HN+L-Kh d HN+L- 1 (4.2.8) The computation has been reduced over the previous case, because the

182 H ensemble portion of the calculation is no longer growing in exponential fashion, but rather is fixed in size from stage to stage. When H is quantized to Q values per component, the maximum storLKh age requirement is now MN values of (B1 BN) and Q values of (Hk-Kh+. Hk) for a grand total of MN QKh For thesame example of M =L =2 and N = 10, the total number of storage loca2K +3 tions is approximately 10, which is a considerable savings over 1023 for Kh << 10, but still somewhat impractical. Finally, consider the case of a fixed random channel. As in Eq. 4.2.5, P(B1 SBN, H, XN+L- 1) = P(XN+L_- B1 BN, H) P(B1' BN) P(H) N+L- 1 N = P(H) II P (x'IB. B H) n P(B B1 B j=1 jL+' - j=1 j-1 (4.2.9) which has the recursive realization P(B1*. Bk H, Xk) P(x1 B1, H) P(B1) P(H), k = 1 P(Xk j Bk- L+'.Bk' _H) P(Bk B1'Bk- 1) P(B1..Bk- 1-H, Xk-l), 2 <k<N P(B1''BN HXk) = P(xkBkL+BN, H) P(B'BNHXk 1[ 9 1 fk~N k- I N~1 < k < N+L-1 (4.2.10)

183 After Eq. 4.2.10 has been determined for k = N+L- 1, the decision on Bk is based on P(Bk, XN_) = 3.3 f P(B1' B H,XN+L _) dH B1 Bk- I k+ 1 N (4.2.11) This algorithm completely eliminates the integration (or summation) over H until the last stage. The storage required is QL times that of the analogous algorithm for a fixed known channel (Eq. 3.2.5) for a total of MN QL locations. For the example of M = L =2, Q=10, this number is 10 for N = 10, which would be feasible were it not for the fact that N is much larger in practice. The bit detector computation and memory will be significantly reduced in Section 4.2.2 by considering Kb-Markov data statistics. 4.2.2 Kb-Markov Data Statistics. When the data statistics are Kb-Markov, an iterative algorithm analogous to that of Section 3.2 can be derived. Define J as in Eq. 3.2.8 and let there be given a general probability density P(H1- H Then a relation similar to Eq. 3.2.9 holds, P(B * B H * H X P(k-J+ 1 k' N1 N+L-' N+L-) = P(Xk, Hk+ I HN+L- 1 BBk-J+l Bk, H1 HkXk P(BkJ+I Bk, H1 * Uk' Xk) J<k<N (4.2.12)

184 As in Eq. 3.2. 11, it is shown in Appendix A, Ea. A.21, that P(X,-Hk+l HN+L-1 Bk-J+l Bk' H1 * Hk' Xk) = P(Xk' Hk+ — HN+L- 1Bk-J+ Bk, H H) (4.2.13) The simplification of Eq. 4.2.13 admits recursion relationships for the two probabilities on the right of Eq. 4.2.12 similar to Eqs. 3.2. 13 and 3.2. 15. Specifically, it is shown in Appendix A, Eq. A.34, that an updating equation for P(BkJ+1'' Bk, H1 * Hk Xk) k-J+1 k' 1 k' Xkis P(BkJ+ Bk, H1 Hk, Xk) ~ P(xk Bk L+ 1 Bk, Hk) P(Bk IBk K Bk- 1) Bk-J b P(Hk1 Hk-l) P(Bk-J Bk-, H1''Hk-_, Xk_1) k=J+l, J+2,.., N (4.2.14) where Eq. 4.2.14 can be initialized by P(B Bk' H1'' Hk' Xk) = P(xklBkL+ Bk' k (BklBkKb.. k- l) kH -kl) P(B1 Bkl, H1 Hk-l, Xkl) k =1,2,..., J. (4.2.15)

185 Similarly, a backdating relation for P(Xk, Hk+1 HN+L- 1 Bk J+1 ~ Bk, H1. Hk) is given by Eq. A. 46, P(Xk, Hk+ 1 "-UN+L-1 tBk- J+l Bk, -- k) Z P(x k+l BkL+2" Bk+l' Hk+l) P(Bk+l Bk-Kb+l )' k+1 -(k+l H11 -k) Pk+l' -k+2 HN+L-1 |Bk-J+2 Bk+l' _ 1 Hk+l) k = N-l, N-2,..., J (4.2.16) where Eq. 4.2. 16 can be initialized by (Xk' Hk+l'''HN+L-1BkJ+ BN H1 Hk) P(xk+l1 Bk-L+2" BNs Hk+l 1) P (Hk+ 1_H1 -k)~ Pk+' k+2 -N+L-1 tBk-J+2 BN' 1 -k+l) k=N+L-1, N+L-2,..., N (4.2.17) Once Eq. 4.2.12 has been calculated, the decision on Bk is based on P (Bk XN+L- 1 ) " ~ Y'"r P(Bk -J+l" s — k -N+L- XN+L-) d H Bk- J+ Bk- 1 d H d HN+L-1 (4.2.18)

186 As before, it is only necessary to calculate Eq. 4.2.12 for values of k which are multiples of J. The computation and memory associated with (Bk J+1' Bk) is fixed at each stage of the algorithm. However, the computation associated with the channel vector H is exponentially growing in the updating algorithm and fixed at its largest value QNL in the backdating algorithm. The ultimate probability calculated in Eq. 4.2.12 has MJQNL values, or 4 x 1020 for the example considered earlier of M =J=L =2 and Q = 10. Kh-Markov channel statistics must be considered before a reasonable amount of computation is approached. When the channel is Kh-Markov, a combination equation analogous to Eq. 3.2.9 is P(B k-J+1 Bk' "k-K +1 k' XNL-1) = P(Xk Bk-J+' Bk' -Hk-K +1K Hk' Xk) P(BkJ+1 Bk, HkK +1 Hk, Xk), J< k < N, h (4.2.19) and it is shown in Appendix A, (Eq. A. 47), that P(XkBk-J+I Bk — k-Kh+1 — Hk' Xk) = P(Xk1Bk-J+ l Bk' HKh+1'' Hk)' J <k N (4.2.20)

187 An updating equation for P(BkJ1 Bk, Hk-Kh+l Hk, Xk) can be derived very simply by integrating both sides of Eq. 4.2.14 over (H1 -Hk-K h) to yield P(BkJ+ B HKh+'' Hk, k- J+ k' -k- -k' Xk) = P(xkBk L+1 Bk' -k) P(Bk k-Kb Bk-1) Bk-J (k H-k_ K- — k- 1) P(Bk-J' Bk-l' -k-Kh k- 1' Xk- 1) d -_Kh max {J+1, Kh+1} < k < N (4.2.21) For 1 < k < min {J, Kh}, Eq. 4.2.15 is valid, and for min {J, Kh} < k < max {J+1, Kh+l} the updating equation is the appropriate combination of Eqs. 4.2.15 and 4.2.21. It is shown in Appendix A, Eq. A. 56, that a backdating relation for P(XkRBkJ+1 Bk, Hk-Kh+l "-1-k) is (Xk k-J+ B1 k' -k- Kh+ 1 -k) k+ 1 P(Hk+ Hk-Kh+ Hk) P(Xk+l Bk-J+2 Bk+l' k-Kh+2 — k+1) dHk+l k=N-1, N-2,..., J (4.2.22)

188 where Eq. 4.2.22 can be initialized by P(Xkl Bk-J+ BN'+l' kHk f P(Xk+ 1Bk-L+2 BN' Hk) P(Hk+l Hk-Kh+ H ) P(Xk+ 1 Bk-J+2'' BN' Hk_ Kh+ 2 H k+ 1 ) dHk+ 1 k = N+L-1, N+L-2,..., N (4.2.23) When the iteration of Eqs. 4.2.21 and 4.2.22 are complete, the decision on Bk is based on P(Bk' XN+L-1) B E *B f" f P(BkJ B+l k' Hk- Kh+ 1Hk XN+L-1) k- J+1 Bk- I d_ IKh+1 -k (4.2.24) The computation and memory requirements of the updating, backdating, and combination equations are fixed. The probability in Eq. 4.2.19 must be computed for MJ values of (Bk J+1 Bk) and ~h values of (Kh+ Hk) for a total of Q M values, which is independent of N. For the previous example, this number 2Kh is 4 x 10, which can be reasonable for small Kh. The final and simplest case is that of a fixed random channel. In this instance the combination equation analogous to Eq. 3.2. 9 is

189 P(BkJ+1 Bk, _H XN+L_1) P(Xk Bk-J+1 * Bk' H, Xk) P(BkJ+' Bk' H, Xk) J < k < N (4.2.25) and it is shown in Appendix A, Eq. A. 57, that P(XkBk-J+l Bk' H Xk) = P(Xk BkJ+1 "Bk, H) (42.2 6) An updating equation for P(Bk_j J+1 Bk, H, Xk) is Eq. A.67, P(B- J+ Bk H, Xk) B P(xkl BkL+1 Bk, H) P(Bk BkKb.* Bk-l). Bk-J P(B B 1 H Xk k = J+1, J+2,..., N k-J B k -l, Xkl) (4.2.27) where Eq. 4.2.27 can be initialized by P(B1' Bk, H, Xk) = P(xk BkL+ Bk -H) P(BkiBk-. Bk) P(B1 Bk_1, H, Xk_1) k = 1, 2,..., J. (4.2.28) Similarly, a backdating relationship is given in Eq. A. 75,

190 P(XkBlkJl Bk, H) CB P (X BkL Bk+L+2' H) P(Bk+ Bk Kb+ k+1 P(Xk+ 1Bk-J+2'Bk' H) k=N-1, N-2,..., J (4.2.29) where Eq. 4.2.29 can be initialized by P(Xk Bk J+ 1' BN' H) = P(Xk+l Bk-L+ 2 BN' H) P(X k+ Bk-J+2 BN' H) k =-N+L-1, N+L-2,..., N. (4.2.30) Finally, the decision on Bk is made on the basis of P(B' XN+L- 1 ) = P(Bk J+ 1BkH, XN+ L 1 ) d H Bk- J+lI Bk- 1 (4.2.31) This fixed random channel algorithm requires that the probability in Eq. 4.2.25 be calculated for MJ values of (BkJ+l Bk) and QL values of H for a total of QLMJ values. For the previous example this number is 400, and independent of N. All the algorithms of this section require that the entire observation vector XN+L1 be received and stored before the backdating algorithm can commence and decisions on any of the data digits

191 can be made. Just as the algorithms of Section 3.2 were modified to yield real-time algorithms, the results of this section will be modified in Section 4.3 to yield real-time bit detectors for each combination of data and channel statistics. These real-time algorithms will eliminate the necessity of storing the entire observation vector. Rather, they will process and discard the observation samples as they arrive. 4.3 Real-Time Bit Detector for Random Channel In this section the real-time bit detector for a random channel will be derived. This bit detector can be deduced for D = 0 from the solution of Hilborn and Lainiotis [28] to a general pattern recognition problem. However, the real time bit detector can also be derived from the results of Section 4.2 for D > 0, so this is the approach which will be followed. The real-time bit detector bases its decision on Xk+D for some non-negative integer D. For the case of general data statistics, satisfactory recursion relations and decision equations have been derived in Section 4.2.1. These are repeated here: General Channel Statistics P(B1'Bk+D, H1 - Hk+D, Xk+D) P(XDk+D-LB k+D L+ B k+D Hk+D) P(D1 Hk+Di-1)

192 P(BI B* -B )P(B1 B H H X k+Di 1 k+D-1)P(k+D- 1 -k*-D-1' k+D- 1' (4.3.1) P (Bk' Xk+D) = B" "ff(BBB1 Bk+D, 1 k+D+D, Xk+D B1 Bk- Bk+l Bk+D dH * d H (4.3.2) K -Markov Channel ~-h-~~~ +h P(B!' Bk+D' Hk+D-Kh+1 Hk+D' Xk+D) P(Xk+D Bk+D-L+1 Bk+D' Hk+D) P(Bk+D B1.. Bk+D-1) ~ Hk+DH HH f P(Hk+D -.- H-k+D- Kh Hk+D- 1 ) H X )dH P(B1 Bk+D- 1' Hk+D- Kh -k+D- 1' k+D- 1 k+D- Kh (4.3.3) P (Bk' Xk+D) C f..feP(B. B H -H B1 B B B 1 k+D'k+D-Kh +1 -k+D' 1 k-1 k+l k+D Xk+D) dk+DKh+l dHk+D (434)

193 Fixed Channel P(B1'' Bk+D' H, Xk+D) P(Xk+DiBk+D-L+1 Bk+D' H) P(Bk+DB1' Bk+D-1) I P(B1 Bk+D- 1' H, Xk+D- 1) (4.35) P (Bk, Xk+D) (4.3.6) 3~~ 3... C Sf P(B1** Bk+i, H Xk+) dH B1 Bk-l Bk+1 Bk+D The appropriate updating equations for Kb-Markov data statistics can be derived by assuming, as in Section 3.3, that D > J and summing Eqs. 4.3.1, 4.3.3, and 4.3.5 over (B1.. Bkl ). The resulting equations follow: General Channel Statistics P(Bk' Bk+ D', 1 k+D' k+D) P(Xk+Dt Bk+D-L+ Bk+D' Hk +D) P(Hk+DH1' Hk+D-1) P(Bk+DBkD- Bk+D-Kb B) P(Bkl Bk+D- 1' H 1 Hk+D-1' Bk- 1 Xk+D 1) (4.3.7)

194 P(Bk, XkRD)= 3.. 3 f..fP(B..B,H H X dH dH B P(Bk'D Bk+D, H1 -Hk+D' Xk+D) dH1 dk+D k+1 k+D (4.3.8) K -Markov Channel P(k' B+D' Hk+D-Kh+ 1 -k+D' Xk+D) P(x k+D Bk+D-t L 1 Bk+D Hk+D) P(Bk+D Bk+D-Kb k+D- 1) P(-Hk+D-Hk+D-Kh Hk+D-) PB k k1 k+D- 1' Hk+D-Kh Hk+D- 1' Xk+D- ) d k+D(4.3.9) P (Bk Xk+D) = ~ S J'P(Bk' B k+D -Hk+D-Kh+1 — k+D' Xk+D) k+ 1 k+ D d H ~ d H(4.3.10) k+D-Kh+1 k+D (4. 3.k+0) Fixed Channel P(Bk * Bk+D H, Xk+D) P(Xk+DI|l+ Bk+D, _H) P(Bk+D1 HB+D)J* Bk+ D Kb Bk+D- 1)

195 Z P(B Bk +D H, Xk+D1) (4.3.11) Bk kDP(Bk' Xk+D) ='' * P(Bk k+D Bk+D dH Bk+l Bk+D (4.3.12) These updating equations are all quite similar to those of the real-time bit detector of Section 3.3. The only difference is that there is additional storage and computation associated with past and present channel vectors. The amount of additional computation and storage which is required is dependent on the type of channel statistics considered. Specifically, the computation and storage is L LKh LN approximately QL QLKh or Q times that required for the known channel real-time bit detector for fixed random, Markov, and general channel statistics respectively. 4.4 Conclusions Extension of the bit detector to the random channel case has been accomplished. The computation and storage requirements are somewhat impractical for all cases except for the real-time bit detector with Markov data and channel statistics and small Kb and Kh. In Chapter 6 the real-time block detector, used in conjunction with estimators to be developed in Chapter 5, and the real-time bit detector will be simulated on a computer for a fixed known and fixed random channel and their performances will be compared.

CHAPTER 5 CHANNEL ESTIMATION ALGORITHMS This chapter is prompted by the observation of Section 4. 1 that the block detector dynamic programming algorithm does not generalize to the random channel case. This is unfortunate, since it would be desirable to have a receiver which improved upon the performance of the transversal filter adaptive equalizer when intersymbol interference is severe, but which requires less computation and memory than the bit detector of Chapter 4. The known channel block detector of Chapter 3 can be used on the random channel if estimates of Hk are available for use in place of the actual channel vector. The purpose of this chapter is to develop channel estimators. Surprisingly, there are almost no published results on the direct estimation of the channel vector, because the emphasis has been upon developing algorithms for adjusting the tap-gains of a transversal filter equalizer (as in Section 1. 6). Costello and Patrick [42] recently gave a direct estimation algorithm for use with a detector similar to the block detector, but which does not employ dynamic programming. In this chapter, the emphasis will be on fixed step size stochastic approximation algorithms such as that described in Section 1.5.3. The only other published approach to stochastic approximation in the presence of time variations is that of Chien and 196

197 Fu [33], where a particular functional form for the time variations is assumed. The estimators of this chapter will be classified as "supervised" or "unsupervised." These are terms which are drawn from the pattern recognition literature, and which refer to the two ways that an estimate of the channel vector can be made. In supervised estimation, the estimator is given, or knows beforehand, the values of the data digits (B1 * BN). In unsupervised estimation, the estimator does not know the data digits, but may have available their statistics. An example of an unsupervised estimator was given in Chapter 4. The bit detector of that chapter implicitly makes an estimate of the channel vector H, and this estimate is unsupervised since the bit detector is not given the data digits. Consider, for example, the fixed random channel real-time bit detector of Eq. 4.3. 11. The a posteriori probability density of H based on Xk+D can readily be calculated at any stage of the algorithm, since P(H, Xk+D) D(H 1XT1V ) kD(5.0.1) P(H Xk+D) f P(H, Xk+ D)dH where P(H X+D) =.. P(Bk..Bk+D, H, Xk+D) (0.2) k k+D

198 From this a posteriori probability many unsupervised estimates of H can be made. Perhaps the best example is the a posteriori mean, H E(H X H P(=H)=d, (5.0.3) -k = E(HTXk+D) = S H P(HXk+D) dH, which is the minimum mean square estimator of H. Several examples of supervised estimators, which in general are much simpler than unsupervised estimators, will be given in this chapter. Supervised estimators are usually used in one of the following two ways: 1. The receiver is operating in a training period, the purpose of which is to obtain an accurate estimate of the channel vector H prior to the commencement of actual data transmission. During this period, the transmitter sends a data sequence, such as a maximal length shift register sequence, which the receiver has in memory or can generate. 2. The receiver is operating in a decision directed mode. A decision directed receiver has two parts: a detector, which uses an estimate of the channel vector H to make decisions on the data digits, and a supervised estimator, which uses the detector's decisions as if they were the actual data digits in order to update the channel estimate. This is one of the simplest unsupervised estimation techniques, but it has the inherent danger that decision errors can cause poor estimates to be made, which cause more incorrect decisions, and so

199 forth. The effect of decision errors on the supervised estimators of this chapter, used in a decision directed mode in conjunction with the known channel block detector, will be studied by computer simulation in Chapter 6. Sections 5. 1 through 5.6 will develop supervised channel estimators, Section 5.7 will derive two unsupervised estimators, and a numerical example for L = 2 will be given in Section 5.8. 5. 1 Supervised A Posteriori Density In this section the supervised a posteriori probability density of tHk based on Xk will be determined for a random channel and any a priori density P(H1 * * HN+L 1). It will then be shown that for a fixed random channel there exists a fixed dimensional sufficient statistic which can be used to determine this a posteriori density for any a priori density P(H). The supervised a posteriori density, P(HklB B * Bk, Xk), is analogous to the density of Eq. 5.0. 1, except that knowledge of (B1.. Bk) is required. Once this density is determined, any of the standard estimates, such as the a posteriori mean or maximum a posteriori probability, can be obtained. This density will now be obtained for the same three types of channel statistics as were considered in Chapter 4. The data statistics do not enter into this density, since it is conditional on (B1!. Bk). When a general probability density P(H1 HNTL 1 ) is given,

200 the desired a posteriori density is given by P(Hk B1..Bk Xk) f fJ.. P(H1 kB1.. Bk, Xk) dH1.. dHk (. 11) A recursive method of calculating the integrand of Eq. 5. 1.1 is required. By the chain rule, P(H1 "Hkk' XkBl" Bk' Xk-1) = P(XklB1 *Bk,'_H k, Xk) P(H1'_HkB1'' Bk Xk) = P(H1 — HkB1B Bk, Xk) P(xkB1 Bk' Xk-l) (5.1.2) where P(xkl B Bk, H 1 Hk) Xk) =1 P (xk Bk- L+ 1 Bk, k) (5.1.3) from Eq. A. 29, P(H1.. _ kBl Bk, Xkl) P k B1 Bk' - - k-H l' Xk- l) P(H - 1 Hkl B1' Bk' Xk-l) (5.1.4) and P(HkIB" Bk, H1 "Hk-l, Xkl) = P(Hk1 1k-l) (.1.)

201 from Eq. A.33. Combining Eqs. 5.1.3 through 5.1.5 in Eq. 5.1.2, the desired updating equation is P(_H1 * H"kB1 I Bk, Xk) P (Xk Bk- L+' Bk' Hk) P(BHk | H'' Hk- 1) P(H 1'' Hk-l I B - Bk-' Xk- ) P(xk B1'' Bk, Xk- 1) P(Xk k Bk-L+1' Bk' Hk) P(Hk H''1 Hk- 1) P(H1 *' Hk- l B'1' Bk-l' Xk- 1) f *. f (Numerator) dH1.. dHk (5.1.6) The integral relation for P(xk B1'' Bk Xk 1) in the denominator of Eq. 5. 1. 6 can be obtained by integrating both sides of Eq. 5. 1. 2 over (HI..- Hk). Implementation of Eq. 5.1.6 is obviously impractical for all but very small values of N, simply because of the exponentially growing computational requirements. This problem is eliminated and a fixed processor structure is obtained if a Kh-Markov probability structure on the channel is presumed. An equation analogous to Eq. 5.1.2 is, in this case, Pk- Kh'k' XkB1l Bk, Xk-l) BkB ak H X) P(H.H B X) P(xkBl Bkk' Hk-Kh Hk Xk-1) P(-Hk-Kh -Hk B1 k' k- ) P(Hk.' HkLB1'' Bk, Xk) P(XkIB' Bk, Xk_) (5.1.7)

202 where, as in Eq. 5.1.4, P(H k- Hk{B1.. BkXk_1) = P(Hk lB1 Bk5 k-Kh Hk-, Xk1) P(_Hk- Kh Hk-IB 1 Bk, Xk 1l) h P Hk - Kh -k- 1) P(k- Kh- 1' k- 1 B1 Bk- 1' X- 1)dH- Kh- (5.1.8) Combining Eq. 5.1.8 in Eq. 5.1.7 in the same way that Eq. 5.1.6 was derived, the updating equation becomes P(Hk-Kh Hk BI Bk, Xk) P (Xk Bk-L+1'' Bk' -k) P(-Hk k-K — Hk- ) P (Hk-K - Hkh 1 Kh- 1 11 1 kk -k-h Kh f f * Js (Numerator) dHkK dHk (5.1.9) The desired a posteriori density can be obtained from Eq. 5.1.9 by a trivial modification of Eq. 5.1.1. As was the case in Chapter 4, there does not appear to be any prospect of finding a fixed dimensional sufficient statistic for the densities of Eq. 5.1.6 or Eq. 5.1.9 because of the time-varying nature of the problem. Thus, implementation of either of these algorithms would require that the H vectors be quantized and their continuous

203 densities be replaced by discrete probabilities. The resulting computation and memory requirements are so large that relatively little computation and memory would be saved by using these estimators together with the known channel block detector in place of the random channel bit detector. The situation is much more encouraging in the fixed random channel case. It will be shown that in this case the Gaussian a priori density for P(H) is a reproducing density function. It will follow that the mean and covariance matrix of H constitute a finite fixed dimensional sufficient statistic which can be easily updated. In addition, it will be shown that this sufficient statistic can be used to determine the a posteriori density of H for any a priori density on H. For the fixed random channel, an equation analogous to Eq. 5.1.2 is P(H, XkjB1- Bk, Xk-l) = (xklB Bk, H, Xkl) P(HB1.. Bkl, Xk-l) = P(H B1. Bk, Xk) P(xkB1 Bk, Xk-l) (51.10) The resulting updating equation for the a posteriori density of H is P(HIB1.. Bk, Xk) = P(xki BL+ 1 * Bk' H) P(HB1 Bkl, Xk- 1 (5.1.11) f (Numerator) dH

204 Before establishing that a Gaussian a priori density reproduces in Eq. 5.1.11, a result due to Birdsall [29] which allows any a priori density to be handled will be restated in the present context. Let P0(H) be the Gaussian a priori density of Eq. 4.1.3, and let P1(H) be any other density on H. Since P0(H) / 0 for all H, the ratio of PI(H) to P0(H) exists for all H, P1(H) r(H)- P0H (5.1.12) Let the two a posteriori densities be denoted by the same subscripts as the corresponding a priori densities. Then a relation similar to Eq. 5.1.10 is Pj(H, Xk B1.. Bk) - Pj(HB1 *. Bk, Xk) Pj(Xk B1l. Bk) P(XkB1l Bk, _H) Pj(HIB1' B j =1,2 (5.1.13) from which it follows that P(Xk| B. Bk, H) P (H) =.(H|B1 k* Bk -k) = J (5.1.14) Pj(H B1.. Bk' Xk) J-S (Numerator) d H j = 1, 2 Substituting Eq. 5.1.12 in Eq. 5.1.14,

205 Bk, =P(Xk B1 Bk, H) r(H) P0(H) P1 (H}B1" Bk' Xk) f (Numerator) d H r(H) Po(HIB1' Bk, Xk) f P(XkjB1 Bk,! H) P0(H) dH f P(XkB1I' Bk, H) r(H) P0(H) dH (5.1.15) where the ratio of integrals in Eq. 5.1.15 is independent of H and is evidently the normalizing constant. The final relationship between the two a posteriori probabilities is therefore r(H) PO( (HB1 Bk, Xk) P(HIB1 Bk, Xk) = (5.1.16) P-H- Bk k) = f (Numerator) d H If PO(HIB1 Bk, Xk) can be determined simply from a Gaussian a priori density, then PI(H lB'. Bk) Xk) can be found for any P1(H) from Eq. 5.1.16 and Eq. 5.1.12. It can readily be established that P0(H B1 Bk, Xk) is a Gaussian density. Let P0(H) be given by Eq. 4.1.3, and let the mean m0 and covariance A be given by Eq. 4.1.4 and Eq. 4.1.5 respectively. The proof that PO(H B1 Bk, Xk) is a Gaussian density is by induction. Assume that the density of H conditional on Xkl and (B1 * B 1) is Gaussian with mean _mk1 and covariance Ak l, and calculate the numerator of Eq. 5.1.11,

206 P(xk Bk-L+1' Bk' _H) PO(H B1 Bk-' 1k- ) = [(2,a) 2Ak a exp - + (H- nk 1) Ak-l (H - T -k- I -k- I k- 1 (5.1.17) Manipulating the exponent of Eq. 5.1.17, it is a quadratic form in H equal to (H - mk) A-1 (H - mk) + constant where A1 - A- + B Bk (5. 19) -k -k- 1 2 -k -k -1 -1 1 A m =Ak lrkln + BgX (5.1.20) Ak —k- _k- 1-k- 1 2 2 Bk Xk Therefore, P(Ht B1'' Bk, Xk) is a Gaussian density with a mean value __k given by Eq. 5.1.20 and a covariance matrix Ak given by Eq. 5.1.19. The updating equation for the a posteriori mean of Eq. 5.1.20 will now be put in a form which shows its similarity to the RobbinsMonro type of stochastic approximation algorithm. Iterating Eq. 5.1.19, it becomes

207 - - 1 T — 0 A + ~-B. BT (5.1.21) _ k 0i=l -1 Since A0 is assumed to be positive definite in order for the probaT bility density of Eq. 4.1.3 to exist and each B.iB. is non-negative definite,* Ak of Eq. 5.1.21 is positive definite for all k. Therefore its inverse Ak exists for all k. Assume the {Bk} is a widesense stationary sequence of random variables, and define C = E(Bk B) C C C C 0 1 LC1 C0. CL-2 (5.1.22) C C C L- 1 L-2 0 where C = E(Bk Bk+j) (5.1.23) Then the mean of Ak is, from Eq. 5.1.21, E(A )= AO +- C (5.1.24) k -0 2 - T B B Y (yT B )2 > o - k — k - - k

208 and Ck k Ak (5.1.25) is an asymptotically unbiased estimate of C. The updating equation can now be expressed in terms of Ck by multiplying both sides of Eq. 5.1.20 by Ak, and substituting for A 1 from Eq. 5.1.19, 1 m = m k-I+- C 1 Bk(xk - B_ m ) (5 1 26) k -k k1+k k-.. The algorithm of Eq. 5.1.26 has the general form of the RobbinsMonro stochastic approximation algorithm of Eq. 1.5. 6. The stepsize ak = 1/k of Eq. 5.1.26 satisfies the requirement of Eqs. 1.5.7 and 1.5.8. The purpose of the Ck.1 Bk matrix in the algorithm of Eq. 5.1.26 becomes apparent when Eq. 5.1.26 is expressed in terms of the error between mk I and H, = m - H (5.1.27) -k -k - Using Eq. 5.1.27 and Eq. 3.1.1, Eq. 5.1.26 becomes 1 A -1 T 1 -1 -k -k-lk -k -k k -k-l + -k Ck -k (5.1.28) Since _C is an estimate of the expected value of Bk B, the C 1 BB term will be approximately the identity matrix on the -k -k -ktemwlbeapoiaeyteiettmarxnth

209 average. The algorithm simply increments the mk1l by a term which is, on the average, proportional to the error between mk_ 1 and H. It is of practical interest to determine conditions under which the algorithm of Eq. 5.1.26 converges to H. The convergence of Eq. 5.1.28 to H is difficult to establish rigorously because of the presence of the inversion of the random matrix A1 and the statis-1 tical dependence of A and Bk. Nevertheless, there is a great deal that can be said about convergence through partially heuristic considerations, and additional insight into the nature of the algorithm results. First of all, Eq. 5.1.26 cannot converge for every data sequence {Bk}. As an example, consider the sequence in which every data digit is identical, say Bk = bI. The received samples are then L-1 Xk = bl h(i) + nk (5.1.29) i=O for every k. The mean value of xk is the same for every k,and L-1 for every H for which 2 h(i) is the same. The best that any i=1 L-1 estimation algorithm could do is estimate: h(i), but not H i=l itself. This is the problem of "identifiability" frequently mentioned by Patrick [25] and others in the pattern recognition literature. In order for an estimation problem to be identifiable there must be a one-to-one transformation between the vector to be estimated, H, and

210 the observation statistics. Clearly the observation of Eq. 5.1. 29 is not identifiable, because many vectors H result in identical observation statistics. Under what circumstances, then, is convergent estimation possible? In the absence of noise, the received samples are Xk Bk H (5.1.30) and H can be determined from {xkk=1 if and only if L linearly independent non-zero values of Bk appear in the data sequence. In the presence of additive noise, it might be expected that L linearly independent values of Bk must appear in the data sequence infinitely often in order for convergence of any estimation algorithm to be possible. In order to relate this condition to the statistics of {Bk}, assume it is a wide- sense stationary random process with covariance matrix C given by Eq. 5.1.22. The following lemma is proven in Appendix B: Lemma BI: C is positive definite if and only if L linearly independent values of Bk have non-zero probability. Lemma Bi states that when the data sequence is wide- sense stationary with a C which is not positive definite, then with probability one fewer than L linearly independent values of Bk appear in the data

211 sequence, and any estimator will converge to H with probability zero (even in the absence of additive noise). Conversely, when C is positive definite, then L linearly independent values of Bk have non-zero probability of occurring in the data sequence, and with probability one, as N - cx each of those values will occur in the data sequence infinitely often. Returning to the a posteriori mean of Eq. 5.1.26, Lemma B1 indicates that convergence may occur when {Bk} is wide-sense stationary and C is positive definite. The mean-square convergence of Eq. 5.1.26 is considered in some detail in Appendix B. Although mean-square convergence is not proven, a convincing argument is given to justify the following conjecture: Conjecture: When mk is given by Eq. 5.1.26, and 1. {Bk} is a wide-sense stationary random process, 2. C = E(BBk ) is positive definite, 3. The Bk are known (without error), 4. The nk are independent random variables such that E(nk) 0 (5.1.31) E(nk2) < aZ < c (5.1.32)

212 then* lim Elim - HI12 = 0 (5.1.33) k- xc An interesting feature of the conjecture is that, even though the a posteriori mean of Eq. 5.1..26 was derived under the assumption that the nk are independent Gaussian random variables with zero mean and variance a2, the conjecture does not require that they be Gaussian at all! It only requires that they be independent with zero mean and uniformly bounded second moments. The most important point of the conjecture is that C must be positive definite. It is interesting that, even though the estimator has the data sequence available, the statistics of the data sequence strongly affect the convergence of the algorithm. In fact, it is shown in Appendix B that the average convergence rate of the algorithm is monotonically related to the smallest eigenvalue XI of C: the larger X1, the greater the rate of convergence. Another interesting feature of all this is that it is only the autocorrelation of Bk within a constraint length of L that affects the convergence of the algorithm. The quantity C. for Ij > L has no effect. Briefly summarizing this section, the updating equations for the a posteriori density of Hk given Xk and (B1 - Bk) have been Whenever k - c, it is understood that N - since necessarily k<N.

213 derived for three types of channel statistics. For general and KhMarkov channel statistics, the required computation and memory are excessive and quantization of Hk is required. In the fixed random channel case, the a posteriori mean of H, given by Eq. 5.1.26, and the a posteriori covariance of H, given by Eq. 5.1.21, constitute a fixed dimensional sufficient statistic from which the a posteriori density can be determined for any a priori density. The convergence to H of the a posteriori mean of Eq. 5.1.26 in mean-square requires that C be a positive definite matrix. In Section 5.2, the maximum likelihood estimator of a fixed channel will be derived. Section 5.3 will consider a Robbins-Monro stochastic approximation algorithm for a fixed channel and stationary data statistics which eliminates the matrix inversion necessary for Eq. 5.1.26. In Section 5.4 that algorithm will be extended to a time-varying channel and in Section 5.5 to time-varying data statistics. 5.2 Maximum Likelihood Estimator for a Fixed Channel In the last section the supervised estimation of H was considered from a Bayesian point of view. That is, an a priori density on the channel vector H was assumed and the a posteriori density of H based on the observation was calculated. In this section, the maximum likelihood estimator of H, which is assumed to be fixed, will be derived. This estimator does not assume an a priori density

214 on H, but rather bases the estimate solely on the observation Xk, knowledge of the data digits (B1 Bk), and the knowledge that H is fixed. In order to determine the supervised maximum likelihood estimator, it is necessary to determine the H which satisfies sup n P(XkB1..Bk, H) (5.2.1) where fn P(Xk IB1.. Bk, H) (x. - BT H)2 + constant (5. 2.2) 2a2 i=1 l from Eq. 3.1.6. Taking the gradient of Eq. 5.2.2 with respect to H, the H of Eq. 5.2.1 must satisfy (Z B B )H - Z Bi Xi= 0 (5.2.3) i= - i=Define an estimate of C analogous to Eq. 5.1.25, k B.B.T Ck- k1 (Ck B ) (5.2.4) If Ck is non-singular, then Eq. 5.2.3 has a unique solution for H, which will be denoted by Hk'

215 H C- Bi Xi k k k i -1 + k C BkX- B ) (5.2.5) -Hk-I +k k (x k -k- I With the exception of the form of the estimate of C of Eq. 5.2.4, the algorithm of Eq. 5.2.5 is identical to that of Eq. 5.1.26. Upon comparing Eqs. 5.2.4 and 5.1. 25, the difference in the C estimators is twofold. First, Eq. 5.2.4 does not require knowledge of a2 Second, the inverse of Eq. 5.2.4 does not exist unless L linearly independent values of Bk occur in the sum of Eq. 5.2.4, for then and only then will Ck be positive definite. Therefore, the iteration of Eq. 5.2.5 cannot commence until L linearly independent values of Bk have occurred in the data stream. The convergence argument of Appendix B is still applicable to the convergence of Eq. 5.2.5. This establishes that supervised estimation of H does not require knowledge of a2. With the exception of the minor difference in how C is estimated, the a posteriori mean with a Gaussian a priori density and the maximum likelihood estimator result in the same estimation algorithm. 5.3 A Robbins-Monro Supervised Estimator In this section the Robbins-Monro stochastic approximation algorithm ov Section 1.5. 2 will be applied to the estimation of the channel vector H of a fixed channel when the data statistics are

216 wide-sense stationary and C is known. The resulting algorithm will eliminate the matrix inversions of the a posteriori mean and maximum likelihood estimators of Sections 5.1 and 5.2. Referring to Section 1.5, define the function Y(', ) as T 2 (5.3.1) Y(H, xk) (Xk Bk H) (5.3.1) Substituting for xk from Eq. 3.1.1, Y(H, xk) (H- H)+ nk (5.3.2) where H is the actual channel vector, assumed to be fixed for all k. Then, letting C be defined as in Eq. 5.1.22 and taking the expected value of Eq. 5.3.2, m(H) = (H- H) C(H- H)+ c2 (5.3.3) As was discussed in Section 5.1, C must be positive definite in order for convergent estimation to be possible. Therefore, assume hereafter that C is positive definite. From Eq. 5.3.3, m( H) can be bounded below, m(H) > a2 (5.3.4) with equality if and only if H = H. Thus, the minimum of Eq. 1.5.1 is satisfied by H = H. Furthermore, Eq. 1.5.3 is satisfied, since

217 grad m(H) = 0 H - 2 C (H- H) (5.3.5) has the unique solution H = H. A Robbins-Monro algorithm will be derived by the method of Section l.5.2 using the Y(,.) of Eq. 5.3.1. However, a greater understanding of the resulting algorithm results if the gradient search algorithm of Section 1.5.1 is studied first. The gradient search algorithm of Eq. 1.5.5 becomes, for the m(H) of Eq. 5.3.3, Hk -Hk-i - -k-i1 (5.3.6) where ck is the error between Hk_1 and H, Hc -H H (5.3.7) -k -k - The recursive relation for the error is, from Eqs. 5.3.6 and 5.3.7, k -C k-I (I- aC)0 (5.3.8) The convergence of Hk to H, lim Ek 0= (5.3.9) k-x-k can only be guaranteed for all possible initial errors E0 if

218 lim (I- aC)k = 0 (5.3.10) k —oc By Theorem 1.4. 2, Eq. 5.3.10 is satisfied if and only if 11-oX.l < 1, 1 <j<L (5.3.11) where 0 < k1 < K2 <... < XL are the eigenvalues of C. Equation 5.3.11 can always be satisfied by choosing ca sufficiently small. The following considerations give a better understanding of the operation of the algorithm of Eq. 5.3.6 than the preceding convergence proof: The algorithm subtracts from the previous estimate a term proportional to C times the error in the previous estimate. The vector C ek can always be written in the direct- sum decomposition C -k a Ek + Yk (5.3.12) where Yk is orthogonal to Ek, T Ek Yk 0 (5.3.13) -k Yk Since C is positive definite, from Eq. 5.3.12, ek Cck a E1 1k12 > 0 (5.3.14) _k -k _a so that automatically, a > 0 (5.3.15)

219 Therefore, there is always a positive component of C ek in the direction of ek. Carrying this line of thought further, let {xi}L -k' - i=1 be a set of orthogonal and normalized eigenvectors of C corresponding to the eigenvalues {i}L. Then e can be written as L Ek = a. x. (5.3.16) j=1 J-i and, from Eq. 5.3.14, the constant of proportionality (a) becomes L E a.2 X. a j (5.3.17) f a.2 j= J from which it follows immediately that 0 < X < a < XL (5.3.18) In addition, using Eq. 5.3.16 in Eq. 5.3.8, L =k+l (1 - a X.) aj x (5.3.19) -k+1 J-J and when Eq. 5.3.11 is satisfied, the component of the error in the direction of each eigenvector of C decreases at each stage. The critical dependence of convergence on the positive definite property of C is apparent from Eq. 5.3.19. If one or more eigenvalues of C were zero, any non-zero component of the error in the direction

220 of a corresponding eigenvector could never decrease. A Robbins-Monro stochastic approximation algorithm can be derived by using the gradient of Y (, ~), grad Y(H,xk) 2Bk( TH- x (5.3.20) H in place of the gradient of m ( H) in the algorithm of Eq. 5.3.6. Since C is assumed to be known, the B Bk matrix in Eq. 5.3.20 -k-k can be replaced by C without affecting the m(H) of Eq. 5.3.3. The result is an algorithm analogous to Eq. 1. 5.6, k = Hk_ - 1 k (C Hk_1 - Bk Xk) (5*3.21) This algorithm might be expected to converge to H under suitable conditions, such as those of Eqs. 1.5.7 and 1.5.8. Comparison of Eqs. 5.3.21 and 5.1.26 is more meaningful when the former is expressed in terms of the estimation error of Eq. 5.3.7, Hk -H k-1C- k C Ek-I - k(C - Bk )__H+ k Bknk (5.3.22) Neglecting the last two terms in Eq. 5.3.22 and the last term of Eq. 5.1. 28, all of which have mean value zero, the difference between the two algorithms is that in Eq. 5.1.28 there is an attempt to undo the effect of the B B matrix multiplying _ek1 by multiplying by C 1 - 1,,k an estimate of C. In Eq. 5.3.22, ek1 is simply multiplied

221 by C, and the necessity of inverting the matrix Ck is eliminated. As was seen in Eqs. 5.3.12 through 5.3.19, there is no necessity of eliminating the C which multiplies Ek_1 in Eq. 5.3.22 because C is positive definite and hence C ek_1 always has a positive component in the direction of EOk A block diagram of the stochastic approximation algorithm of Eq. 5.3.21 is given in Fig. 5.1 for L = 2 and an arbitrary C. When the data digits are independent (C1 = 0) the estimator is simplified considerably, with several of the branches and arithmetic operations eliminated. Because Eq. 5.3.21 has no matrix inversions, the meansquare convergence of the algorithm of Eq. 5.3.21 can be proven. The standard Robbins-Monro stochastic approximation algorithm convergence proofs [31] are not applicable, however, because the increments of Eq. 5.3.21 are not statistically independent. The following theorem is proven in Appendix C: Theorem Cl: When Hk is given by Eq. 5.3.21 and 1. tBk} is a wide-sense stationary random process, 2. C = E(BkBkT) is positive definite, 3. The nk are a sequence of independent zero-mean random variables such that E(nk2) < c2 <, (5.3.23)

CO "'-k+l C C1 _ _ II k+F i + S a x am+ + -- I. CI 0 -k+l C~~~~C1C Fig. 5.1. Stochastic approximation estimator, L = 2, C - C1

223 4. {ca.} is a sequence of positive real numbers such that (5.3.24) 3c ak c (5.3.25) k=l c 2 < Xc (5.3.26) k=1 5. The Bk are known (without error), 6. There exists an integer A > 0 such that BI and B. are statistically independent for i- jI > A, then lim E 11Hk-Hl12 = 0 (5.3.27) k-Xc Many of the conditions of Theorem C1 are similar to those of the Conjecture of Section 5.1. Specifically, Conditions 1, 2, 3, and 5 are identical to those of the Conjecture, while the step size 1/k of the algorithm of Eq. 5.1.26 satisfies Condition 4. The requirement of Condition 4 that {aji} be a monotone non- increasing sequence is a non-restrictive assumption (not generally required in Robbins-Monro convergence proofs) which simplifies the proof of convergence of the present algorithm. Condition 6 did not appear in the Conjecture of Section 5.1.

224 Essentially what it requires is that two data digits always be statistically independent when they are sufficiently far apart in the data stream. This condition is easily met in practice for random data which has no periodic components. Close examination of the analysis of Appendix B and C reveals that the reason this condition is not required in the Conjecture of Section 5.1 is the special manner in which the data sequence is compensated for by the multiplication by an estimate of C 1 It must be concluded therefore, that the algorithm of Eq. 5.3.21 is more sensitive to the data statistics than is the a posteriori mean of Section 5.. In particular, it is sensitive to the joint statistics of Bk and Bj for Ik- jI > L, whereas the algorithm of Eq. 5.1.26 is not. Further insight into the algorithm of Eq. 5.3.21, especially as it compares to the fixed step- size algorithm which will be considered in Section 5.4, is obtained by consideration of the solution. The solution of Eq. 5.3.21 is, from Eq. C.14, k Hk Wk 1k HO + w (i BWk x. (5.3.28) i=1 where the matrix Wk iis given by Eq. C.13. Equation 5.3.28 shows the relative weight given to each sample xi, 1 < i < k, in the determination of the estimate Hk. Specifically, each observation xi, i < k, is weighted by a scalar a. times the vector W i+ B. The normn of W i+ is denoted by k i k, i+~~~~ ]~k, i"

225 -klw,il. =k,i (5.3.29) in Appendix C. The quantity 3k, i which is indicative of the relative weight given B.x. in the determination of Hk, is graphed in Fig. 5.2 for X1 = 1/2, i = 1, and ak = 1/k. Consider just the first sample, B x1. If it were given equal weight with the other samples in the estimate Hk, then k, 1 would decrease like 1/k, which is shown in Fig. 5.1 as the dotted line. Since 1/k decreases faster than Ok, 1' it must be concluded that B1 x gets a progressively greater relative weighting than later B. x.'s in the determination of Hk as k increases. In fact, later B.x.'s are also multiplied by the decreasing sequence {cai} in addition to 3k, i' and this gives B1 xl an even greater relative weighting. The question arises as to why the particular weighting sequence 3k, i is chosen, or, equivalently, why {a(i} must satisfy the rather sensitive requirements of Condition 4 in Theorem C. 1. The answer is that this particular choice of {cli} satisfies two conflicting goals: First, it must satisfy Eq. 5.3.25 so that the algorithm can converge to H from any starting point in RL. On the other hand, it must satisfy Eq. 5.3.26 in order that the total noise variance added into the determination of Hk be finite as k - x. The finite noise variance added into an infinite number of stages allows the algorithm to converge in mean-square. In Se:Mtion 5.4 a fixed step- size algorithm will be considered.

226 0.6 0.5 0.4 0.3 a - 1. 0.2 \l. =1 1 0.1 0 1 2 3 4 5 6 7 8 9 10 Fig. 5.2. fk1 for 2

227 In this algorithm only the latest received samples are given a significant weighting, and the result is that an Hk which is varying slowly in time can be estimated. On the other hand, the noise variance added in each stage does not decrease with k, so that there is no possibility of mean-square convergence. 5.4 A Fixed Step-Size Algorithm for a Time-Varying Channel The convergence of the decreasing step-size stochastic approximation algorithm of Section 5.3 depended critically on there being a fixed channel and fixed second-order data statistics. When the channel is actually varying with time, that algorithm cannot track the channel. By simply replacing the decreasing step-size by a fixed step-size, an algorithm results which can track a slowly time-varying channel. However, this advantage is partially offset by the fact that the algorithm cannot then converge in mean-square, even when H is actually fixed. It is difficult to determine the performance of the fixed stepsize algorithm when H is varying in some fashion. Therefore, the algorithm will actually be analyzed for a fixed channel. The rate of convergence and bounds on the asymptotic mean-square error will be calculated. Then, when H is actually varying in time, as long as the rate of variation is somewhat slower than the convergence rate of the algorithm, the algorithm can be expected to track H well with a meansciuare error approximately equal to the fixed channel asymptotic mean- square

228 error. On the other hand, when the rate of channel variation is much greater than the rate of convergence, the algorithm cannot be expected to track the channel at all. It will be shown that there is a trade-off between rate of convergence and asymptotic mean- square error which is governed by the choice of the step- size. Specifically, it will be shown that the asymptotic mean-square error decreases with decreasing step-size, while the convergence rate increases with increasing step- size. For a given situation, the step- size must be chosen to appropriately balance the conflicting objectives of small asymptotic error and high rate of convergence. Define the fixed step- size stochastic approximation algorithm by letting cak= = in Eq. 5.3.21, H H k-1 (C Hk-1 B k) (5.4.1) k k- I.:k- 1kk A positive definite matrix C, not necessarily equal to C, has been substituted in Eq. 5.4.1 to account for the possibility of inaccurate knowledge of the data statistics. In Section 5.5, C will be replaced by B B in Eq. 5.4.1, and an algorithm which does not require knowledge of the data statistics will result. The fixed step-size cO of Eq. 5.4.1 does not satisfy Eq. 5.3.26, and therefore it cannot be asserted on the basis of Theorem Cl that Eq. 5.4.1 converges in mean-square. In fact, a fixed step-size

229 algorithm cannot converge in mean- sauare. In order to understand why not, write Eq. 5.4.1 in terms of the error and substitute for xk from Eq. 3.1.1, H - H -1 C Cek(C - - B )_H + -Bknk (5.4.2) -k — k- I -kThe last noise term, a Bknk, has a variance which is fixed for all k. Therefore, since there is a fixed variance noise added into the algorithm at each stage, the variance of Hk cannot decrease to zero. The ability of Eq. 5.4.1 to track a time-varying H can be established by observing the solution of Eq. 5.4.1, Ak k k-i k (I- a C) H0 + a (I- aC) B. x (5 4 3) from Eqs. 5.3.28 and C.13. The weighting factor k, i of Eq. 5.3.29 becomes k- i k = l(I- aC)i r (5.4.4) where r =II - a C II (5.4.5) It will be shown shortly that convergence of Eq. 5. 4.1 requires that r < 1. The weighting factor of B1xl, ik, 1 is plotted in Fig. 5.2 for (C = 1 and A = 1/2. It decreases exponentially with k, which is, of course, much faster than 1/k. Consequently, the relative weight

230 given B1 x in the determination of Hk decreases very rapidly with k. The fixed step-size algorithm gives a significant weighting to only the most recent received samples, and is therefore able to track H even as it is changing. The algorithm of Eq. 5.4.1 will now be analyzed under the assumption of a fixed channel vector H and wide- sense stationary data statistics with covariance matrix C. The mean value of Hk will be considered first, and it will be shown that when a is chosen properly, E(H ) converges exponentially to C1 C H. Bounds on the mean-square error between Hk and C C H will then be obtained. It is shown in Appendix D, Eq. D. 6, that the mean value of H is ~~~~-k ~ E(H ) -1 C H+ (I - C) (H - C CH) (5.4.6) If a satisfies r < 1 where r= II - a C 11 (5.4.7) = max 11- a.l (5.4.8) 1<j<L and where 0 < X1 < X2 <K... < L are the eigenvalues of C, then lim E(Hk,) = C 1 CH (5.4.9)

231 In particular, when C = C and Eq. 5.4.7 is satisfied, Eq. 5.4.1 is asymptotically unbiased. In fact, from Eq. 5.4. 6, IiE(H )-C 1C HI < rkH C C H1 (5.4.10) -k - - - _ 10 and the error between E(Hk) and its asymptotic value decreases at least exponentially with k. Since C and C are positive definite, aO can always be chosen sufficiently small so that Eq. 5.4.7 is satisfied. When C = C, the r of Eq. 5.4.8 is the same as r in Eq. 5.4.5. The parameter r is a measure of the rate of convergence of E(Hk) to C 1 C H. The smaller r can be made, the greater the rate of convergence. The dependence of r on the step-size a will now be determined. The parameter r can be determined from the following equations by substituting the eigenvalues of C for those of C. Since X! and )2 are the smallest and largest eigenvalues of C respectively, r is 1 - aX1, O < <1 + 1 L r = (5.4.11) 2;XL-, +X <a L EAnea n1;iL from Eq. 5.4.5. A necessary and sufficient condition for r < 1 is, therefore,

232 2 0 < a < amax (5.4.12) max XL From Eq. 5.4.11 it is apparent that r cannot be made arbitrarily small, but rather has the minimum value rmin = +1 (5.4.13) which occurs for = 2 (5.4.14) 1 L where 4' = L/i1 (5.4.15) is the ratio of largest to smallest eigenvalue of C Equation 5.4.11 is plotted qualitatively in Fig. 5.3(a). For ac near zero and 2-, the parameter r is near 1, so that the algorithm L converges slowly. When ac is chosen near mid-range, the algorithm has its fastest convergence. The minimum r, rmin (which corresponds to the greatest convergence rate), is plotted in Fig. 5.3(b) as a function of 4'; it increases from a minimum of zero at 4' = 1 and approaches an asymptotic value of one as 4 - zc. There is a maximum rate at which the algorithm can converge, that maximum rate is dependent on the parameter 4 alone, and is achieved by choosing a according to Ea. 5.4.14.

233 r r _ _1 I / vXL mrin +I = oI I I I I 2 2 0 X +XL max L (a) 1.0 min 0.75 0.5 0.25 0 1 2 3 4 5 6 7 8 9 10 (b) Fig. 5.3. Convergence of the fixed-step algorithm

234 Since the algorithm is asymptotically biased when C # C, it is of interest to determine the size of the asyrnptotic error. This will now be done for a concrete example with L = 2. C is given by Eq. 5.1.22, C0 C C (5.4.16) C1 C where CO E(B 2) and C E(Bk Bkl). Generally CO is known accurately; for instance, when Bk = +1, then C0 =1. However, there may be some uncertainty in the knowledge of C1, which depends on the statisticsl dependencies between adjacent data digits. Therefore, assume that there is an error of 6 in the knowledge of C1, so that C- (5.4.17) C1+s CO where 6 is small. The error between H and the asymptotic value of E(Hk) is C CH-H H- H C (C- C)H (5.4.18) where, by direct calculation,

235 CO2 - (C1 + 6) COC - C(C1 + 6) C0 - (C1+6)2 C 2 (C1+6)2 C (C-C( CoC1 - CO(C1 + 6) C2 - C (C + 6) C 2 (C1+ 6)2 C 2 (- C1+ 6)2 C1 -C (5.4.19) C2-C2 I I C 1 LCo C1 In Eq. 5.4.19, each term has been expanded in a Taylor series in 6 and only the first order terms have been retained. Therefore, the error vector is approximately C 1h(0)- C h(l) ^-1 1 C (C-C)H - (5.4.20) 0 0 - C0 h(O) - C1 h(l) 1 The locus of Eq. 5.4.20 normalized by a factor -6 has been plotted in Fig. 5.4 for C0 =1 and H = [1.0 0.5].T The locus demonstrates that the maximum 6 which can be tolerated for a given error varies widely depending on the value of C1. As C1 - +1 (particularly as C1 - -1) the estimator error becomes very sensitive to small errors in the value of C1. This is to be expected, since this is precisely where the smallest eigenvalue of the matrix C is approaching zero, and a good estimate of H is becoming increasingly

236 h(1) 1 0 ] H 3 -0.5- 05 /1 c2 /OC 1 1* C JC1+C 2 3 =0 X ~\C1=0'5 C1= 0.5 /x c_ =O.9 x C1 =-0.75 Fig. 5.4. Locus of error vector divided by 6

237 difficult to make. For C near zero, the error in the knowledge of C1 expressed as a percentage of C must be on the same order as the norm of the expected asymptotic error vector expressed as a percentage of I HII. Thus, if only lOpercent asymptotic error can be tolerated, IC (C-d)HII 1 (5.4.21) IIHI - 10 then the maximum 6 which may be tolerated is approximately 6Is,.. 1/'5161,~~ 1 ~(5.4.22) C 10O for IC I 1 < 1 (5.4.23) C 0 Now that the mean value of Hk has been considered in some detail, it is appropriate to consider the mean-square error between Hk and its asymptotic mean. This mean-square error has been bounded in Appendix D, and only the results will be summarized here. The general conclusion will be that if r < 1, the mean- square error ^2k decreases to a constant asymptotic value at a rate bounded by r The asymptotic error can be controlled by the choice of a, and under certain conditions on the data statistics, can be made arbitrarily small by making a sufficiently small. Unfortunately, as ac approaches

238 zero, the convergence rate also approaches zero, since the r (or r) of Eq. 5.4.11 approaches one. The mean-square has been bounded for three different types of data statistics, which are considered separately below: General Case In this case, no assumptions are made on the data statistics other than the usual one of wide- sense stationarity with positive definite covariance matrix C. It is shown in Appendix D that the A2k mean- square error decreases as r to an asymptotic error bounded by Eq. D.31, EII 1Hk-C 1CH I L a lim < k- La2 - 1 -r 1~ / 11 2 2 L3 b 4 a2 + M ) (5.4.24) LaY h (1 - r)2 A2k The bound as a function of k, which shows the r convergence rate, is given in Eq. D. 29 and is not repeated here because of its much greater complexity. The quantity on the left is the ratio of the asymptotic mean-square error per component of H to the noise variance, and the bound on the right is expressed in terms of the normsquared of H per component divided by the noise variance, which is proportional to the S/N ratio.

239 The case of small a is of particular interest. Restrict oa to < 2 (5.4.25) A1 + AL where r=l-aXl from Eq. 5.4.11. When Eq. 5.4.25 is satisfied, Eq. 5.4.24 becomes 11H -C C HI2 L I11HII2 1 lim - -- < + (X L b 2 2 L ( k-x- L z2 X1(2- aXl) L a 1 (5.4.26) The second term in Eq. 5.4.28 is a fixed error independent of a but proportional to the S/N ratio. The first term is a function of the step size a and can be reduced to an arbitrarily small value by letting a -- 0; however, as a - 0, r - 1, and the convergence rate decreases. The greatest convergence rate occurs for a relatively large value of a, a = and for that value of a, Eq. 5.4.26 1 + XL becomes 11 H C CH12 HX lk - -_L lim < + k- c L2 1 i1 L II H112 L 2 4) ILu2a _ (x i+ L3 b 4). (5.4.27) Both terms of Eq. 5.4.29 can be relatively large when A1 is small.

240 A slightly tighter bound than Eq. 5.4.24, Eq. D. 32, is obtained in Appendix D for the case C =C. The only difference is that the L 2/ o 2 of the second term in Eq. 5.4.24 is replaced by unity; Li the undesirable feature of having a fixed error which cannot be eliminated by letting a - 0 is not removed. However, by making an independence assumption on the data statistics, that fixed error can be removed. Bk and B. Statistically Independent for Ik- jl > A -k -j The second case considered in Appendix D imposes Condition 6 of Theorem C1; namely, that there exist an integer A such that Bk and B. are statistically independent for I k- jl > A. In this -k -j case, the upper bound on the error can be made arbitrarily small for a sufficiently small ac. The specific bound is Eq. D.40, E ilHk- Ci C H 12 Xa lim < k-cx L u2 -r2 flH112 L3 bl4 + AL2 + M L F (ao) (5.4.28) La2 12 where A -2A+l zA A+ i ~ 2A F(t) =h x2 A2 1- r r r i- -/(1 + () 1 L <1- )(r (1- r) (5.4.29)

241 Equation 5.4.28 is essentially unchanged from Eq. 5.4.24 except for the factor of F(al) in the second term. It can readily be shown that F(ao) approaches zero as a approaches zero. Since r = (1 - a X1) for small a, use the asymptotic formula ^k k r = (1 - aX1) 1- ka430X1 which is valid for small a. Substituting Eq. 5.4.30 in Eq. 5.4.29, the asymptotic formula for F(ao) for small Oa is ao X F(oa) - (2~+1) 1 (5.4.31) 2- -a Substituting Eq. 5.4.31 in Eq. 5.4.28, that bound becomes E 11Hk- C CHII lim < k-xc La 2 L ( + (2A+1) - (L3 b A) (5.4.32) X1(2-a1) \ L for small ac. From Eq. 5.4.32, linr lim E IIHk C- C H 12 (5.4.33) ca-0 k-xc - and the asymptotic mean-square error can be made arbitrarily small by choosing al small.

242 F(ao) is plotted in Fig. 5.5 for 1 < A < 10. The fact that F(~a) approaches zero as ca approaches zero is evident. Figure 5.5 also shows how F(o) approaches one as A - c, so that the bound of Eq. 5.4.28 approaches that of Eq. 5.4. 24, as it should. However, care should be used in using Fig. 5.5, since it is derived under the assumption that 1< 2 or, equivalently, r < I1 Also, IL + AL a must be greater than or equal to L, since otherwise B. and B would contain common data digits and could not be independent. When the data digits are statistically independent, an exact expression for the mean-square error can be obtained. Independent Zero Mean Data Digits The case of independent and zero mean data digits will now be considered. In this case the data covariance matrix is diagonal, C = C I - 0 In order to obtain a closed form expression for the mean-square error, it is also necessary to assume C = C. Then it is shown in Appendix D that the mean-square error converges as r2k to an asymptotic value of

243 1.0 \ 60 0.8 - 0.6 1 0.4 0.2 0 0.2 0.4 0.6 0.8 1.0 (Valid for r > - 1a L) Fig. 5.5. F(a) for r = 1- a1

244 E II H - H112 lim k —o L 2 2-aC2C!H 1 2 L +E(Bk4) r whereLu C0the k I 2 c i - L- 2 + + r La2I C 2 f=-L p(2) 0 - a 0 2 (5 This exact expression can be compared to the bound of Eq. 5.4.28, which is valid in the present case with A\ = L. Noting that X. X. CO, Eq. 5.4.28 becomes ^122L 3 CO,-c L u2 - 20 +r = i-a~ CO~ ~(5.4.36) oL uL on the latter. Equation 5.4.34 contains an additional term which

245 approaches zero as ac - 0. The primary difference between the bound and the exact expression, however, is the (2L+1) factor in the bound, which generally makes it considerably larger. Of course, the asymptotic error of Eq. 5.4.34 approaches zero as a - 0, as indeed it must from Eq. 5.4.33. This completes the derivation of bounds on the mean-square error. A detailed numerical example of their application to a simple case will be given in Section 5.8. To summarize this section, the properties of the estimation algorithm of Eq. 5.4.1 are listed below: 1. When the data statistics are wide- sense stationary with known covariance matrix, the estimation algorithm is asymptotically unbiased. 2. The mean-square error converges exponentially to its asymptotic value, with the maximum rate of convergence determined by the ratio of largest to smallest eigenvalues of the data covariance matrix. 3. When there exists a A > 0 such that B. and B. are -1 J independent for l i- j l > A, the asymptotic mean- square error can be made arbitrarily small by choosing a sufficiently small step- size. 4. Under all conditions, the convergence rate decreases to zero as the step-size approaches zero. Thus, there is a tradeoff between convergence rate and asymptotic mean-square error. 5. The major disadvantage of the estimation algorithm of

246 Eq. 5. 4.1 is that when the data sequence has time-varying statistics and/or unknown statistics, the estimate will generally be biased. When the smallest eigenvalue of C is nearly zero, the estimation error can be substantial. In many practical situations, the data sequence can accurately be considered to consist of independent data digits, in which case the algorithm is particularly simple since C is a diagonal matrix. Such would be the case when the algorithm was being used in a training mode and a maximal length shift register sequence was used for data. However, in situations where the algorithm is used in a decision-directed mode with random data, often the assumption of independent data digits is unjustified. In this case, an algorithm which requires no knowledge of the data statistics and which can track time-varying data statistics in addition to a time-varying channel is needed. Such an algorithm will be studied in Section 5.5. 5.5 An Algorithm for Unknown and/or Time-Varying Data Statistics In Section 5.4 a fixed step- size algorithm which presumed knowledge of the data covariance matrix C was considered. If that covariance matrix is not known accurately and/or is varying in time, then that algorithm is likely to yield an asymptotically biased estimate of H. In this section the C matrix of Eq. 5.4.1 will be replaced by T BkBk, and the result will be an algorithm which does not require any knowledge of the data statistics.

247 Replace C in Eq. 5.4.1 by B B T to yield a new fixed stepsize stochastic approximation algorithm, T -H -oIB (B H ~ Hk-i Ol BkBk -k-1+ -k Bkk (5.5.1) The algorithm of Eq. 5.5.1 may actually be somewhat easier to implement than Eq. 5.4.1. For instance, if Bk = ~1, the elements of the matrix B Bk are all plus and minus ones, and no actual -k-k T multiplication is involved in forming B Bk Hk 1 and BkXk. On the other hand, the elements of C could be arbitrary real numbers for non-independent data digits, and multiplication hardware would be required to form C Hk_1. The comparison of Eqs. 5.5.1 and 5.4.2 reveals a great deal about the two algorithms. The algorithm of Eq. 5.4.2 is asymptotically unbiased for small oa because the vector C ek always has a positive component in the direction of ek 1 when C is positive definite T and B Bk is only non-negative definite, not positive definite. -k-k Therefore, it would appear that Eq. 5.5.1 may not be asymptotically unbiased. However, note that the only time that B B e does not -k-k -k-i T have a positive component in the direction of ek is when B1 = -k-1 -k -k-I 0. But if L linearly independent values of Bk occur repeatedly in the data sequence, for any ek1 a B must eventually occur such -k-i -,k

248 that _Bk Tk-_1 0. Thus, when the data statistics are wide-sense stationary with positive definite covariance matrix C, Eq. 5.5.1 may be asymptotically unbiased. The convergence rate may be slowed relative to that of the algorithm of Eq. 5.4.1, however, because the error Ek-1 is not reduced on every iteration. The algorithm of Eq. 5.5.1 is virtually impossible to analyze in the manner of Section 5.4 because of the presence of a matrix of the form k T E II (I- aB.B ) j-1 — J in the solution. However, by making a simple modification of the algorithm, this difficulty can be circumvented. Specifically, assume that there is an integer Az such that B. and B. are statistically — 1 - independent for li - j > A, and define a modified stochastic approximation algorithm, -Hk+l = H(k- 1)A+l H kia+l k+1 (k-l)A+l XkA+l) (5.5.2) The resulting simplification is that the increments of Eq. 5.5.2 are statistically independent. The algorithm of Eq. 5.5.2 is analyzed in Appendix E under the assumption that the channel vector H is actually fixed, the data statistics are wide- sense stationary with positive definite covariance matrix C, and B. and B. are statistically independent for — 1 -J

249 l i- ji > >A. Specifically, it is shown that if r < 1, then the algorithm is asymptotically unbiased, lim E(Hk+il) H (5.5.3) k —oc with a convergence rate of rk, where r is given by Eq. 5.4.5. In addition, it is shown that when R2 < 1, the mean-square error decreases as R2k to an asymptotic value bounded by E1II H -Hf112 (2 C E 11 HkA+1 -H a lim - < (5.5.4) k-xc L a2 - - R2 where R2 _ r2 + 2 2 2 (5.5.5) BB and a 2 is the variance of the matrix BkBk T BBT -k-k T = E lBBkBT - C 112 (5.5.6) BBT Since R > r, if R < 1 then automatically r < 1. Therefore, there will generally be some ac's for which the algorithm of Eq. 5.4.1 will converge but Eq. 5.5.1 will not converge. In general, the meansquare error of the algorithm of Eq. 5.4.1 will decrease faster to its asymptotic value than will the mean- square error of Eq. 5.5. 1. This is the price which is paid for not knowing the data statistics. From E;;. 5.4.11, R2 can be written explicitly as a function

250 of C as 1 - 2aX + a2(XI + 0 ~ < a < R22 1 - 2ozXL+ +a2(X2 + (2) 2 < a (5.5.7) L L BB 1I L and the condition that R < 1 becomes a condition on a, X +a iL BBT BB B a < -! (5.5.8) max 2 2 2 A X > U2 X + a2 /X 1 L BBT BBT which reduces to the earlier condition of Eq. 5.4.12 for a2 = 0 BB and is otherwise more stringent. It is straightforward to show from Eq. 5.5.7 that the minimum R2 occurs at 1 2 X1(XL - X1) 1 T BT 2 1 BB am. = (5.5.9) 2 n~ <hl(2 U2(~2 | + AL'< (L 1) an te vlBBT 2 and the minimum value of R2 is

251 B T Xl (XL X1) A2+ a2 BB T B 1 T BBT BB R2 (5.5.10) min 2 __2__ __ _BEBT Xl(XL X1) a2 < mm (4(Xi)2 4 2 B2 (1 + XL)2 BBT - Again, these values reduce to those obtained in Section 5.4 for a2 0= For C2 > 0 the value of min is reduced and tile BBT min value of R2 is increased. As a2 o ao. - O and min BBT min BBT R2. _ 1. m in The bound on asymptotic mean-square error of Eq. 5.5.4 can be expressed in terms of oa by using Eq. 5.5.7, 0 2 0 < a < 2 1(2 - a(X+ a2 ~/ < l 01 a2 = (BB ) (5.5.11) 1 - R2, 2 ______L + _ _ _ 2 _i+_L< 2 XL (2 (LL +aT/ L)) + L Equation 5.5.11 should be compared with the similar expression of Eq. 5.4.32. When a2 T= 0, they differ in two respects. First, the BB AL in the numerator of Eq. 5.4.32 is replaced by C0 in Eq. 5.5.11. Second, there is an additional term in Eq. 5.4.32 which does not even

252 appear in Eq. 5.5.11. Since* C < L, the bound of Eq. 5.5.11 for a2 T= 0 is smaller than the bound of Ec1. 5.4.32. The reason BB is that the algorithm of Eq. 5.5.2 uses only every Ath sample, and the interactive effects of dependent increments are eliminated. The problem of determining a 2 for particular data staBB tistics remains. In very simple cases it can be calculated analytically. In more complicated situations it can be determined numerically by calculating the eigenvalues of the matrix (BkBk -C) and averaging the square of the largest eigenvalue in magnitude with respect to the probability measure on Bk. Where this may be impractical, bounds on a(2 can be obtained. BBT As an example of data statistics for which a2 T can be BB determined analytically, consider the case of independent zero mean data digits. Then C is CO times the identity matrix, and X is an T eigenvalue of BB- - C and x is an eigenvector if BkBk x = (X+C0)x (5.5.12) T Since Bk x is a scalar and commutes with Bk, Eq. 5.5.12 becomes T (Bk x) Bk ( + CO) x (5.5.13) The sum of eigenvalues of C is equal to its trace, so that LC = L hi.. The largest eigenvalue, XL, must be larger than the average i=1l of the eigenvalues, C0.

253 and evidently (_k Bk - C) has only one non-zero eigenvalue X Bk112 _ CO and one eigenvector x = Bk. Thus, aT E(IIBk 2 - C0) (5.5.14) BB For example, when Bk = ~1, C0 =1, and T = (L-1)2 (5.5.15) BB Several bounds on (a2 T will now be developed. First note BB that, from Theorem 1.4.1, T T IIBk B - C 11 < IB Bk 11 + IIC 11 (5.5.516) and by Lemma D1, lIB B Bk 11 11 B 112 -k-k -k < L max b.2 (5.5.17) 1<j<M J Combining Eqs. 5.5.6, 5.5.16 and 5.5.17, \ 2 B2B < (L max b. + L (5.5.18) BBT i 1<j<M j When this bound is applied to independent zero mean data digits with Bk =~ 1, Eq. 5.5.18 becomes

254 a2 < (L+ 1)2 (55.19) BBT which is not much larger than the exact expression of Eq. 5.5.15, particularly for large L. The bound of Eq. 5.5.18 requires knowledge of the largest eigenvalue of C, which may not be known. However, that largest eigenvalue can be bounded by noting that L XL < L;i = Tr(C) i=1 L CL 0 < L max b.2 (5.5.20) 1<j<L J Substituting Eq. 5.5.20 in Eq. 5.5.18, a bound which reouires no knowledge of the data statistics results, A2/l~ ~2 2 < 4L(max b. (5.5.21) BBT - l<j<L This bound becomes, for independent zero mean data digits with Bk =~1, c2 < 4L2 BBT 5.5.15.

255 There remains the important practical problem of determining a step-size a sufficiently small that the algorithm will continuously converge in spite of the fact that the data statistics may be varying. The inequality which a! must satisfy is Eq. 5.5.8, which is an inconvenient inequality because of the two ranges on 2z T BB However, the Omax of Eq. 5.5.8 can be bounded by 2 maxa (5.5.23) fo< a 2 (5.5.23) 2 -max T BB L +X1 for all a2 T I. n order to lower bound Eq. 5. 5.23 by a quantity BBT which is independent of the data statistics, it would be necessary to determine upper bounds on AL and o2 T and a lower bound on X1 BB Upper bounds on XL and a2 are available in Eqs. 5.5.21 and L BBT 5.5.20. These upper bounds require absolutely no information about the data statistics, only the {b}M are required. On the other hand, i j=1 a lower bound on ~1 does require some knowledge of the data statistics. For any given {b.}M, it cannot be asserted that1 ~ 0 j=1 without some knowledge of the data statistics. Therefore, an a which insures convergence cannot be chosen without some knowledge of the data statistics. To summarize this section, an algorithm which does not require knowledge of C has been considered. It has been shown that when H is fixed and the data statistics are wide- sense stationary

256 with positive definite covariance matrix C, the estimation algorithm is asymptotically unbiased and the mean- square error approaches a non-zero asymptotic value, as long as the step-size is appropriately chosen. For a given step-size, the rate of convergence of the mean-square error to its asymptotic value is slower than that of the algorithm of Section 5.4 because of the lack of knowledge of the data statistics. Although the algorithm itself does not use the data statistics, the choice of a step- size small enough to insure convergence does require some knowledge of the data statistics. Specifically, enough must be known about the data statistics to be able to obtain a non-zero lower bound on the smallest eigenvalue of C 5. 6 Fixed Increment Algorithms In Sections 5.1 through 5.5, the type of stochastic approximation algorithms considered obtained an updated estimation vector by adjusting the previous estimation vector by a quantity which was a continuous function of the current observation and the past estimate. Three algorithms which adjust each component of the estimation vector by a fixed amount in one direction or the other will be considered in this section. The direction of adjustment will be determined by the previous estimate and a possibly variable number of intervening observations. Consideration of this type of algorithm is motivated by the adaptive equalization algorithm originated by Lucky [22]e These three algorithms, and particularly the third one, will be

257 somewhat simpler to implement than those considered previously. The specific type of algorithm to be considered updates a previous estimate of H, Hk by the equation* [Hk] = [Hkl] - 5 sgn _ek _l] (5.6.1) where x1,> 0 sgn(x) = (5. 6.2) -1, x < o and ek 1 is an estimate of the error vector =H -H -k- 1 -k-1 - Whenever a component of ek 1 is estimated to be positive, the corresponding component of Hk-1 is decreased by a fixed amount 6. When very simple estimates of k- 1 are chosen, the algorithm of Eq. 5.6.1 can be extremely simple to implement. Three algorithms will be obtained by considering three different estimators for E k-. In the first, a maximum likelihood estimate of ek-1 based on a fixed number K0 of observations will be used. The disadvantage of the resulting algorithm is that the convergence is linear, rather than exponential. An approximate exponential [x] ~ is taken to mean the 2th component of x.

258 convergence will be obtained in the second algorithm by using a sequential estimator which uses a variable number of observations between each increment of Eq. 5.6.1. The third estimator will simplify implementation by modifying the second and sacrificing some performance. All three estimators will be designed and analyzed for the case of statistically independent data digits. Although these estimators can be generalized to non-independent data digits, their performances become much more difficult to analyze. 5.6.1 Fixed Observation Estimator. The first estimator of Ek-1 will be a maximum likelihood estimator based on K0 observations. Suppose that the initial estimate of H at k = 0 is H0, and update H according to Eq. 5.6.1 after xkK is received, k > 1. The estimate of ek will be a maximum likelihood estimate based on xkK +1 x~ t+)K. Starting with Eq. 5.2.2 and expressing it in 0+1 X(k+l)K0 terms of _k, the likelihood function is (k+1 )K0 (n P(x kK. x Ol) B i B (k+)K0-H) 1 i (x.-BTH) (kK 1 2aU 2 i=kK +1 (k+l)K0 1 (k+)K0 (iT ~2 (5.6.3) Xk0(e.+') 2r2 i=kK +1 1 -i -k where

259 e x -BT H (5.6.4) i i -i -k is the error between the received sample and an estimate of what it should be based on the previous estimate of H and knowledge of the sequence of data digits. Taking the gradient of Eq. 5.6.3 with respect to _k and setting it equal to zero, the condition for the maximum likelihood estimator Ek is (k+l)K (k+l)Ko i=kK +1 i=kK5 -1 0 0 Invoking the assumption of statistically independent data digits, (k+l)K0 the matrix B i i will be approximately K 0C times the i=kK0+l identity matrix for large K0. Therefore, the solution for ek in Eq. 5.6.5 is approximately (k+l )K0 B. e (5 6 6) k 0 i=kK+l i i(5.6.6) Using the estimate of Eq. 5.6.6, the updating equation, Eq. 5.6.1, becomes - (k+l)K0 Hk+l = Hk + 6 sgn& L B. e. (5. 6.7) i=kK +1 where sgn of a vector is taken to mean sgn of each component of that vector.

260 In analyzing the performance of the algorithm of Eq. 5. 6. 7, the quantity of interest is the probability that a particular component of Hk is incremented in the wrong direction, Pe(f) = Pr{[tk]_ < 1k), 0Y< O <L-1 (5.6.8) where it is assumed that [ k] > 0. Substituting for e. from Eq. 5.6.4 in Eq. 5.6.6, (k+l)K0 _1 T 0 (B B. E n.) (5. 6. 9) KC0 i=kK +1k and (k+ 1 )K B k] B n (5. 6.10) _k[-] - KoC0 0 0 i=kK+i j=0i j In the case of L=2 with Bk = ~1, the terms in the sum of Eq. 5.6.10 are statistically independent. * In this case the distribution function of the sum in Eq. 5.6.10 will approach a Gaussian distribution for large K. However, for other cases, the terms in the sum are not statistically independent due to the presence of common data digits. The actual distribution is difficult to obtain because the It is easily verified that BiBi_1 and BiBi+1 are independent. For instance, P(BiBi_ =1) P(BiBi+1 =1) = P(BiBi_1=1, Bi+1Bi=1) = 1/4.

261 number of combinations of data digits which must be considered grows exponentially with K. Therefore, the distribution of the sum in Eq. 5. 6.10 will be approximated by a Gaussian distribution. The corresponding approximation to the probability of incorrect increment is -E[_e Pe() - O (5.6.11) The mean of [Ek]i can easily be shown to be E[k] = [ek] (5. 6.12) and the variance of [ek] is calculated in Appendix F, Eq. F. 6, as o E(B 4) - C 2 Var [k] = K + [k]2 - KC K Ek K00 C 2Kk 0 0 j= _ 1 (k+l)K0 (k+l)K0 ikK1 m=kK+ (i-m) + i-m) j=0 Ek] I K i=kK +1 m=kK +1 I Eka + (i- m) - (- 0 m) (5.6.13) In general, the last term of Eq. 5.6.13 will be very much smaller than the other three, since there is no reason to expect that the different components of the error will be highly correlated and since the last term is multiplied by an additional factor of 1/K0.

262 There are two parameters of the algorithm which the designer is free to choose: 6 and K. The rate at which the algorithm converges and the behavior of the algorithm when an error on the order of 6 is achieved are dependent on both parameters. Assuming P e() is small, each component of the error vector ek -1 will decrease linearly in time at a rate of 6/K T seconds until the error is on the order of 6/2. At that point the error will oscillate on both sides of zero if K0 is chosen large enough. The rate of convergence is proportional to 6 and the asymptotic mean- square error is proportional to 62. The parameter K0 should be chosen to insure that P e() is small for [ek] 6/2, and 6 should be chosen to balance the convergence rate against the asymptotic error. Consider the numerical example of Lucky [22], where it is required that Pe () - 0.1 for [ek] - 6/2. Let Bk = 1 with equal probability, so that E(Bk2) = E(Bk4) = 1, and assume, for the moment, that a2 = 0. Then, when all the error components are approximately 6/2, 1 -1 _ (L-1) 62 Var[k] K - [ek] 4 (5.6.14) j j where the last term of Eq. 5.6.13 has been neglected. The specification that P (Q) = 0.1 requires that e

263 2~33.. [k 2 4pK 40 or K0 - 5.44 (L-1) (5.6.15) independent of 6. On the other hand, when c2 f 0, then K0 5.44(L-1)+ 21.76 - (5.6.16) 62 is required. This K0 can be quite large if small asymptotic error (small 62) is required. The major shortcoming of this algorithm is the approximately linear convergence. It is much easier for an estimator to track a time-varying channel if the convergence is exponential, since the estimator responds much more quickly when the error is large. In addition, the rate of linear convergence of the algorithm of Eq. 5. 6. 7 will generally be quite slow because K0 must be chosen large in order to insure that the probability of incorrect increment is small when the error is small. When the error is large, the same probability of incorrect increment could be maintained with a much smaller K0. In the next section an algorithm with a variable K0 will be considered. That algorithm will achieve exponential convergence by using a Wald-type sequential test. 5.6.2 Sequential Estimate Algorithm. In this section the

264 ek- Iestimator of Section 5.6.1 will be changed into a Wald-type sequential test. It will be shown that the resulting algorithm has approximately exponential convergence, and the asymptotic meansquare error as a function of convergence rate will be compared to that of the algorithm of Section 5.5. The algorithm of Eq. 5.6.7 incremented all the components of Hk_1 simultaneously after the reception of xkK +1. The algorithm to be considered now increments each component of Hk 1 separately after a random number of intervening observations since the last increment. Specifically, assume that [Hk] was determined by a fixed increment after XK was received, and update [H] Kk again after x, is received, where k,k+1 KQk 1-= KQk+k* (5.6.17) and k* is the smallest integer such that K k+m E (m) = - ~iBi=K ei (5.6.18) i=K +11 K, k satisfies E, (k*) > Rt or E5(k*) < -Rt (5.6.19)

265 The updating equation for [Hk] is Hk+l] = [Hk]- 6 sgn(E (k*)) (5.6.20) The algorithm of Eq. 5.6.20 is best understood by calculating the mean value of Ef(m) from Eqs. 5.6.18 and 5.6.12, E[Ef(m)] = m CO[k] (5.6.21) Therefore, {E (m)} is a sequence of random variables, each of m =1 which has a mean value that increases linearly with m and has the same sign as [ek]. If [Ek] > 0, then with high probability Ef(m) will eventually get larger than Rt, in which case the corresponding component of [Hk] will be incremented in the direction which reduces [ By Eq. 5. 6.21, the average length of time it takes for E (m) to exceed Rt will be inversely proportional to the size of the error, [ k]. That is, from Eq. 5.6.21 the mean value of the random variable k* will be approximately E(k*) - t (5.6.22) c0[ Ek] Therefore, since increments occur more often on the average when [_ek] is large, the algorithm will converge more rapidly when the error is large than when it is small. It will be shown later that the convergence of the algorithm is approximately exponential.

266 The application of the two thresholds to E (m) of Eq. 5.6.18 is actually a Wald-type sequential test of the hypothesis [ek] > 0 against the hypothesis [ek] < 0. Wald [37] and Ferguson [38] have analyzed tests of this type in some detail, and their results have been used to determine approximate expressions for the probability of incorrect increment and the mean and variance of k* in Appendix F. The results of Appendix F will now be summarized. Let M;(t) be the moment generating function of -Bil ei, M f(t) - Bie (5. 6.23) which is given by Eq. F.11, and let to < 0 satisfy M (t0) = 1 (5. 6.24) Wald [37] has shown that such a t0o always exists. Then it is shown in Appendix F, Eqs. F.22 and F.23, that -toRt Pr {E (k*) > Rt} tRt toRt (5.6.25) e - e tRt APrn(k*)< -Rap tR oxme - e.2 ie0 tn- e 0 t' An approximate expression for the mean number of samples until increment is Eq. F.25,

269 expression for [ k]E as a function of k, it will be assumed that k* = E(k*) and that the approximate expression for k* of Eq. 5.6.22 holds. Let the initial error be n6, so that the error drops linearly with k, [_ k] = (n-k) 6, k = 0,1,..., n (5.6.34) By definition, K, k is the number of samples which have been received when the estimate Hk is determined. From Eq. 5.6.17 and 5.6.22, K is k Rt K C(5.6.35) f ~ k i=0 Co (n- i) 6 where the sum in Eq. 5.6.35 can be approximated by an integral, k k n-i f n~ix dx i=O n- i 0 n-x i=O 0 n(n k< n (5.6.36) Substituting Eqs. 5.6.36 and 5.6.35 in Eq. 5.6.34 to eliminate k, [!k] [_J] exp [- O Kk k (5.6.37) and the error drops approximately exponentially with the number of received samples.

270 Because the algorithm of this section has exponential convergence in common with the algorithm of Section 5.5, it is meaningful to compare the asymptotic mean-square error of the two algorithms for the same convergence rate. To accomplish this, expressions for the asymptotic mean-square error as a function of convergence rate are required. Considering the fixed increment algorithm first, assuming that each component of the final error alternates between ~ 6/2, the asymptotic mean-square error is 112 2 lim 11 Hk-H112 - L( ) (5.6.38) k-zc By Eq. 5.6.37, the mean value of [ ek] decreases to zero approximately as r, where r = exp | ~ (5.6.39) The threshold Rt should be chosen to make Pe (f) small for an error on the order of 6/2. The approximate solution to Eq. 5.6.33 for [ek] = 6/2 is* -2(2) In order to maintain a constant probability of incorrect increment, the This equation for to* is exact when k] 0 k-k-' = henlfk]2 j,e

271 quantity 6Rt = -t *R* t (5.6.41) should be kept constant by Eq. 5. 6.31. Substituting Eqs. 5. 6. 41 and 5.6.39 in Eq. 5.6.38, the asymptotic mean-square error becomes lim IIH - Hl2 - (_ ) fn1 (5.6.42) k — ~ c r Now consider the algorithm of Section 5.5. Assume that ac is small enough that R - r in Eq. 5.5.5, and assume that E(Ck) k approaches zero as r. Then the approximate mean-square error of Eq. 5.5.4 becomes L a2 C cy2 lim IIHk - H-12 k- - 1- r2 - (LeaO) ~~i~(5.6.43) since r = 1- (a CO for the case of independent data digits presently under consideration. The approximate expressions of Eqs. 5.6.42 and 5.6.43 display the same dependence on (Lc ). Equation 5.6.42 has the additional term (3/4), where j must be somewhat larger than one in order for the probability of incorrect increment to be small. The

272 major difference between the two expressions is the dependence on the parameter r. The two functions of r are plotted in Fig. 5. 6, which shows that Cn 1 is much larger than r for all r except r those near one. For r near one, however, the difference between the two functions is not large. In practice, ac must be chosen very small in order to hold down the asymptotic mean- square error (see Section 5.8 for numerical examples), and the parameter r will be nearly one. Thus, for practical convergence rates the asymptotic mean-square error of the fixed increment algorithm is larger, but not too much larger, than that of the fixed step- size algorithm of Section 5.5. 5.6.3 Quantized Sequential Test. In this section the sequential estimate of Section 5.6.3 will be modified by quantizing each component of E (m). The resulting fixed- increment algorithm will have a somewhat larger asymptotic mean-square error for a given convergence rate than the fixed increment algorithm of Section 5.6.2 or the fixed step-size algorithm of Section 5.5, but its use may be justified in some applications because of its simpler implementation. The quantized sequential estimate will be obtained by quantizing each term in the sum of Eq. 5.6.18 to the values +1 and -1, K ~, km E,(m) = sgn (-Bik ei) (5.6.45) i=K,k1L

273 I 3 -- 411 ~~~~1 -r 2 I 0 -1.0 -005 0 0.5 1.0 Rate of Convergence, r Fig. 5.6. Comparison of mean-square error of two algorithms

274 Retain the remainder of the fixed increment algorithm as summarized in Eqs. 5.6.17, 5.6.19, and 5.6.20. The simplification in implementation results from the fact that E (m) in Eq. 5.6.45 is an integer. Thus, the algorithm maintains L up-down counters, one for each Ef(m), 0 < C < L-1. With each received sample, the error e. is calculated and each of the counters incremented up or down depending on the sign of ei and Bi f < 0 < L-1. When a counter overflows (i.e., exceeds Rt in magnitude), the corresponding component of H is incremented by ~'6 and the state of the updown counter returned to zero. This algorithm is analyzed in Appendix F, where it is assumed that Rt is an integer. It is shown that Rt Pr{Ef(k*) = Rt} 5 Pr{Ek(k*) = -Rt} Rt (5.6.46) 1+ q Pr{E (k*) t -Rt (5.6.47) t g Rt 1+ q+ Rt Rt qt -1 E(k*) - 2p (5.6.48) 2p - 1 Rt qf +1

275 Rt 2p{(1-pe)Rt Rq 1 Var (k*) - - (2p- 1)3 q t ql~ + 1 Rt + (P ) qRt(5.6.49) 2p(Rt )2 where pp is the probability of an upcount, P= Pr {-Bi_- ei > E ki (5.6.50) and Pk q 1-p (5. 6.51) In Eq. 5.6.48 there is a nonlinear dependence of E(k*) on [Ek] For this reason, it is difficult to approximate the type of convergence that might be expected. Further insight into the nature of the algorithm can be gained by developing approximations to Eqs. 5.6.47 and 5.6.48 for two special cases. When the error [ek] is small (on the order of 6/2), pk will be just slightly larger than 1/2. Let pJ be P 2+ +E (5.6.52) for small E, so that

276 q~ - 1 + 4e (5. 6.53) and Rt Rt n(l+ 4E) q= e 4eRt e e << 1 1 + 4e, Rt ER << 1 (5.6.54) Then Eq. 5.6.47 and Eq. 5.6.48 become Pr {E(k*)= -R << 1 1 + exp {4eRt} 12 1+ 2eReRt << 1 (5.6.55) 2 1 + 2eR t Rt exp{4eRt}- 1 E(k*) 4 +1' e << 1 t R 2 1 + 2ER' ERt << 1 (5.6.56) Thus, as [ k] - 0, e- 0 and E(k*) -Rt2. This should be contrasted with Eq. 5.6.22, where E(k*) -- c as [Ek] --. The reason for the difference is that the magnitude of the increment in Ep(m) is unity, independent of how small ei may be. The second special case occurs when the error is large and pp is nearly one. Let

277 p = 1- e(5. 6.57) for small c, so that q, 1(5. 6.58) and Eqs. 5.6.47 and 5.6.48 become Pr {E2(k*) -Rt << 1 (5. 6.59) R E(k*)! 2e Rt' e << 1 (5.6. 60) Thus for large [iek], E(k*) approaches Rt, rather than one as it did in Section 5.6.2* The reason is that in Eq. 5.6.45 it is apparent that k* > Rt. E(m) cannot equal Rt until Rt samples have been received, regardless of the size of the e. The character of the quantized sequential estimate of this section is somewhat different than that of the unquantized version of Section 5.6.2. When the error [_ek] is large, the quantized algorithm requires a longer average time to make an increment in H by a factor Since k* > 1, obviously E(k*) > 1. The reason that E(k*) - 0 as [ek] - xc in Eq. 5.6.24 is that the derivation of that equation assumed that [_k] << Rt.

278 of about Rt On the other hand, when the errcr is nearly zero, increments in H are made more often on the average. The difference in the two algorithms for large error is more serious, because it indicates that the convergence of the quantized algorithm will be somewhat slower when [Iek] is very large. This completes the discussion of fixed increment algorithms. In Section 5.8 a numerical example of the relationship of P (Q), Rt' and E(k*) will be given for the case L =2. But first, an example of an unsupervised stochastic approximation algorithm will be given in Section 5.7. 5.7 An Unsupervised Stochastic Approximation Algorithm In this section two unsupervised fixed step- size stochastic approximation algorithms will be derived. Unlike the supervised algorithms used in a decision directed mode, the unsupervised algorithms of this section are not adversely affected by decision errors. However, these unsupervised algorithms have the disadvantage that they require greater knowledge of the data statistics, require knowledge of a2, and require some a priori knowledge about the location of the channel vector H. In addition, they are more complicated to implement and difficult to analyze than the supervised algorithms discussed in Sections 5.4 through 5.6. Consider the second moments of the received samples,

281 = -2 [1] (5.7.10) if and only if 11 l22- X21 X12 0 (5.7.11) In Eqs. 5.7.9 and 5.7.10, {Xij}2 and {X.}2 are the eigenvalues j=l {_ j=l and eigenvectors of C.. — 1 The important point brought out by Eq. 5.7.6 is that even when L-1 H can be determined from {1/ } (i.e., L = 2 and Eq. 5.7.11 is,0 =0 satisfied), the solution is not unique. If the two axes of R2 are placed in the direction of x1 and x2 (i.e., the standard axes are rotated 45o), then the quadrant in which H resides must be known a priori before H can be estimated by this method. In addition, from Eqs. 5.7.7 through 5.7.9 it is apparent that a2 and {C.} must be j=0 known. This is much more a priori information than is required by the unsupervised algorithms of Sections 5.4 through 5. 6, none of which require a2 or any prior knowledge of the location of H. An unsupervised estimate of /a0 and /1 will now be developed. Once an estimate of /u0 and /1 is available, an estimate of H can be made from Eq. 5.7.6 if the necessary a priori information is also available. An unsupervised estimate of /~0 and /l which is suitable for a fixed channel is the time average

282 k - 1 —-- X. X /Lpk k- f f, k-i - k- (/XC,k-1 Xk XkXk-) (5.7.12) w = 0, 1 This decreasing step-size algorithm can be converted into an algorithm capable of tracking a time-varying tO and tI (and hence a time-varying H) simply by replacing the decreasing step size by a fixed step- size, viz., Af,k Af= -1~,k-l'(/ k-1 - XkXk-f) = 0,1 (5.7.13) The analysis of this algorithm is very similar to that of previous fixed step-size algorithms considered, and is relegated to Appendix G. It is shown that if 11 - cl < 1 (5.7.14) then lim E( I,k) =I/, ~ =0, 1 (5.7.15) k-xc and, further, if Eq. 5.7.14 is satisfied and there exists a A such that B. and B. are independent for li- jl > A, then — 1 J —

283 E(o0, k- 0)2 < 2a2 (2LbM4 IIH1L1 + aZ) 2 a + (L2 bi4 IIHII4 + 2u2 Lb 2 IIH112 + a4) G((a) (5.7.16) M - E( i,k /~1)2 < U2 Lb 2 E(b k- 1) ~ u2(LbMK 2 (11HI 1 + 2(1-ao) + 2-a + L2 bM 4 11 H114 G(a) (5.7.17) where G(a) = 1 - 2 2- (1 - a), =0, 1 (5.7.18) It is straightforward to show using Eq. 5.4.30 that for small ca, G(ol) - (2A + 2 + 1) 2- a (5.7.19) where second order terms in al have been ignored. Thus, all the terms in Eqs. 5.7.16 and 5.7.17 have the same dependence on the step- size ac for small a, and lim lim E(~,k- gu)2 = 0, =0,1 (5.7.20) a —0 k-oc On the other hand, when no statistical independence can be assumed in the data sequence and A -- c, G(a) - 1 for all a such that Eq. 5.7.14 is satisfied, and Eq. 5.7.20 cannot be asserted. In this case, there may be a fixed asymptotic error which cannot be eliminated by making a smaller.

284 When Eq. 5.7.13 is used in conjunction with Eq. 5.7.6 to make an estimate of H, presuming that the correct solution in Eq. 5.7.6 is chosen, the mean-square error in H can be roughly estimated by looking at the coefficients of gO and LL. The contribution that the mean-square estimation error of,0 or Al makes to the mean-square error in the resulting estimate of H is then proportional to those coefficients. The foregoing algorithm is suitable for L = 2. Another algorithm which is valid for arbitrary L, but which is difficult to analyze because it is nonlinear, will now be defined. Note from Eq. 5.7.2 that Xk2 H C H + (2 E Xk- H C H (5.7.21) TI H xks-l l Hh CL-1 H Thus, a fixed step-size algorithm suitable for estimating H might be

285 Hk CH + Xk Hka~ k-Hk Xk|k -1 H H - -k -I —k I lkk k-i (5.7.22) k+i k Hk T 1T —H H Xk Xk- L+i The quantity in brackets in Eq. 5.7.22 has mean zero for Hk = H so that one stationary point of the algorithm is Hk = H. However, there is at least one other stationary point, H = - H, so that as before some a priori knowledge of H must be available. Specifically, H0 must be chosen near H, or else the algorithm of Eq. 5.7.22 may converge to -H or some other stationary point. In this section two unsupervised algorithms have been derived using the second moments of the observation vectors. These algorithms require greater a priori knowledge about the location of the channel vector H, the data statistics, and the noise statistics than do the unsupervised algorithms of Sections 5.4 through 5.6. Because of its nonlinear nature, the second algorithm has not been analyzed even superficially. Rather, its performance will be gauged by the computer simulations of Chapter 6. 5.8 A Numerical Example for L = 2, M = 2 There has been considerable success in this chapter analyzing the various stochastic approximation algorithms. However, detailed

286 numerical comparison of the algorithms requires the assumption of specific data statistics. Therefore, in this section the several fixed step-size and fixed increment algorithms which have been considered in Sections 5.4 through 5.6 will be compared for L = 2 and a simple first order Markov model of the data statistics. In Chapter 6 these algorithms will be simulated on the computer in conjunction with the known channel bit detector. These computer simulations will serve to verify the theoretical results of this section and determine the effect of decision errors on the operation of these supervised algorithms when they are used in a decision directed mode. Assume that the data digits Bk are +1 or -1 with equal probability (M =2) and first order Markov with transition probability p Bk- 1 P(BkBkl) = 1(5.8.1) 1 - p B k =-Bk_1 where 0 < p < 1. A wide range of statistical dependencies between the adjacent data digits can be studied with this simple first-order Markov model of the data statistics. The covariance matrix C will be calculated first. For the expectation E(Bk Bk_l), observe that Bk and Bk_1 are equal with probability p and of opposite sign with probability (1-p), so that E(Bk Bki_) - P - (l-p) - 2p-1 (5.8.2)

287 and of course E(BkZ) = 1 (5.8.3) Thus, the matrix C is, from Eq. 5.1.22, 1 2p-1 C = (5.8.4) 2p-1 1 By straightforward calculation, the two eigenvalues of C are 12p, p<1/2 1, = ( (5.8.5) 2(1-p), p > 1/2 2(1-p), p < 1/2 A2 = (5.8.6) 2p, p > /2 and the ratio of largest to smallest eigenvalues, i, is — P p < 1/2 (5.8.7) -p These important parameters of C are plotted in Fig. 5.7. Since X1 > O for O < p < 1, the matrix C is positive definite for this range of p. The point p = 1/2 (independent data digits) is special,

288 2.0 \ 2 x2 /l 0 1.0 0 0.5 1.0 p 2Fig. 5.7. Parameters of C

289 because AX and 4 assume their maximum and minimum respectively there. Therefore, the greatest convergence rates of the supervised estimators of H will occur at p = 1/2. It should, however, be possible to estimate H as long as p / 0 or I. The second data statistic which should be calculated is the variance of BkBkT a2 T which is defined in Eq. 5.5.6. To,k k T' BB accomplish this, the eigenvalues of the matrix 0F Bk B - 2p+1 (Bk-k - C) = p (5.8.8) k k Bk_ - 2p+i 0 are required. They can be shown to be (Bk Bk_1 - 2p+1), so that the norm squared of (B B - C) is k - B B -k-k C112 (Bk B - 2p+1)2 (5.8.9) which has an expected value a2 = 4p(1-p) (5.8.10) BB The convergence properties of the fixed step-size algorithm of Section 5.4 can now be examined. Using Eq. 5.4.11 in conjunction with Eqs. 5.8.5 and 5.8.6, for p < 1/2

290 1- 2olp, 0<<<1. r -(5.8.11) l2c(1-p) - 1-, 1 < o and for p> 1/2, 1- 2ca(1-p) < a < r = (5.8.12) 2a0p- 1, 1 < a The minimum r occurs at ac = 1 and is 1- 2p, p<1/2 r =. (5.8.13) min 2p - 1, p > 1/2 from Eqs. 5.4.13 and 5.8.7. The maximum step size for which convergence will occur is given by Eq. 5.4.12, p < 1/2 ma = (5.8.14) max -, p> 1/2 Equations 5.8.11 through 5.8.14 are plotted in Fig. 5.8. Here again, the most advantageous situation occurs at p = 1/2. For any step-size a, the rate of convergence decreases as p increases or decreases from p = 1/2. The requirement on co to insure convergence also becomes more stringent as p increases or decreases from p = 1/2.

291 1.0 r 0.5 0 0.5 1.0 1.5 2.0 (a) 1.0 2.0 rmin max 0.5 1.5 0 1.0 0 0.5 1.0 0 0.5 1.0 P P (b) (c) Fig. 5.8. Convergence parameters of the fixed step-size algorithm for known data statistics

292 However, for p # 0, 1, convergence will always be assured if the step-size oa is chosen to be less than unity. The upper bounds on mean-square error of Section 5.4 will now be evaluated for p < 1/2. Corresponding relations can be obtained for p > 1/2 by substituting (1 - p) for p (as can be seen from Eqs. 5.8.5 and 5.8. 6). The upper bound of Eq. 5.4.28, which is valid in the present case with A = 1, becomes E1lHk - H112 lim < k-*x L c2 p(1 -ap)a IIHl2 [(1p 2] ( H 2 1 - pF(), > 12 2 2p(1 - (Yp) + a! + --- F (z), > 2[1- a(1-p)] La2 p[- 1(l-p)] 2 (5.8.15) For the special case of p = 1/2 (independent data digits) an exact expression for the mean-square error is available in Eq. 5.4.34, lim E - 2 _ a (~F,2 (5.8.16) k-2 Xc L a2 p2-a 2 Lp In Figs. 5.9 and 5.10, Eq. 5.8.15 is plotted in dB for two values of lHll2/Lu2 (O and 100) and 10 values of p. This upper bound is small for small a, but increases rapidly for a near a max

293 50 40 V 30 0 2 -~ 20 II H!l 2 0 -100 u LU~~~L 2 p 0.1 0.2 0. 3 0.4 0.5 0.9 0,8 0. 0. 0 10 Cd 0 0 L1 or 2 IIH2 = 0 Luy2 -10 -20 0 0.4 0,8 1.2 1.6 2.0 Fig. 5.9. Upper bound on mean-square error for known data statistics algorithm

Mean-Square Error Per Component/'z (dB) 0 0 0 0 0 0 0 01C ~~0 ~I. ~~o'0 N~~~~~~~~N CD C 0 o9~~~~~~~~~~ co:O Oc CD~~~~~\ ~~~~~~~~~~~~~~~~~~o 0? ~~~~~II II I 0 0_ 0 0~~ 00 0 0o 0 o.

295 The effect of II H 1 2/L az is effectively to shift the entire curve up and down. The asymptotic mean-square error per component is larger relative to the noise variance at high S/N ratios. The most striking feature of Fig. 5.9 is that there is much to be gained in terms of reducing the asymptotic mean-square error by keeping ac very small. As a! drops below 0.2, there is a very rapid drop in asymptotic mean-square error. Another feature of Eq. 5.8.15 is that when the step- size is fixed the asymptotic mean- square error is at a minimum for p = 1/2. For the special case of p = 1/2, it is possible to compare the bound of Eq. 5.8.15 with the exact expression of Eq. 5.8.16. This is done in Fig. 5.11, where it is seen that the bound and exact expressionagree at I1HI12/La2 = 0. At IIH112/La2 = 10 and 100, there is a discrepancy between the bound and exact expression of about 10 dB. The supervised algorithm of Section 5.5, which is suitable when the data statistics are unknown, will be considered next. The important parameter a2 has been calculated in Eq. 5.8.10. BB From Eqs. 5.8.5 and 5.8.6, X1X2 = 4p(1-p) = a2 (5.8.17) BBT so that the maximum cO for which convergence occurs is, from

296 40 / / 30 / m 1 /' / --- B d= 100 / b > 20- / o Uo ~ ~ ~ ~ ~ ~Bound -10 -20 0 0.4 0.8 1.2 1.6 2.0 Step-Size ao Fig. 5.11. Mean-square error and upper bound for independent data digits (P = 1/2) and L =2

297 Eqs. 5.5.8 and 5.8.17, 2 e1 max 1 + XL for 0 1/2 and the minimum value of R2 is l- p p<1/2 R2 = (5.8.19) min P p> 1/2 from Eq. 5.5.9, and occurs for a step-size of ~ min = 1/2 (5.8.20) from Eq. 5.5.8. The expressions for U2 T (Eq. 5.8.10), R2 BB (Eq. 5.8.18), and R in(Eq. 5.8.19) are graphed as a function of p in Fig. 5.12. Comparison with Fig. 5.8 reveals the decreased convergence rate of the unknown data statistics algorithm of Section 5.5 relative to the algorithm of Section 5.4 for all but very small a. However, as mentioned in Section 5.5, reduction of the asymptotic meansquare error generally requires that a very small c be chosen, and

298 1.0 1.0 2 8 CT2 R - I.., BBT 0.5 0.5.4,.6 0 I 0 0.5 1 0 0.5 1.0 p a (a) (b) 1.0 R2. min 0.5 0 0.5 1.0 p (c) Fig. 5.12. Parameters of the stochastic approximation algorithm for unknown data statistics

299 for these small ac the convergence properties of the two algorithms do not differ appreciably. The upper bound on mean-square error of Eq. 5.5.4 becomes, for the present case, cvY2- p< 1/2 E H, - H 112 1- (1-2c p)2 - 4 c2 p(1-p) lim < k-xoc Laz 2 p~ >1/2 1-I (1 - 2 (1-_p))2 4- 2 p(1-p) (5.8.21) This bound is plotted in Fig. 5.13. The character of the asymptotic error bound is very similar to that of Fig. 5.9, except that ca is always unity (independent of p) and the bound is not a function of 11 H 1 2/L r2. The latter property is a result of the method of analysis in Section 5.5, where the increments in the estimate Hk were forced to be statistically independent. As the final item of this section, the probability of incorrect increment and E(k*) will be calculated for the three fixed increment algorithms of Section 5.6, which were derived for independent data digits (p = 1/2) alone. These expressions will be calculated for only the first component of Ek. Consider first the fixed observation algorithm of Section 5.6.1. The probability of incorrect increment is Eq. 5.6.11,

300 30 b 0; I _ p = 0.1, 0.9 E 10 c I - 0-100 z 0 0.4,0.6 0 0.5 0 0.2 0.4 0.6 0.8 1.0 Step- Size a0 Fig. 5.13. Mean-square error upper bound for unknown data statistics algorithm

303 Fixed Observation Estimate ~ Sequential Estimate 0 Quantized Sequential Estimate 5] = 0.1, = 0, 1 10 10- 2 1 R* - \ \ \ \ 15 20 R* 20 25 t - 25\ Rt 30 35 30 40 100 100 200 300 400 500 600 K0 or E(k*) Fig. 5.15. Performance of the fixed increment algorithms

304 E(k*) of the unquantized version. Therefore, the penalty in performance paid for the simpler implementation of the quantized algorithm is a decrease in convergence rate of approximately 50 percent for the cases considered in Figs. 5.14 and 5.15. This completes the numerical example of this section. In Chapter 6, computer simulations will be made in order to determine the effect of decision errors on these supervised algorithms when used in a decision-directed mode. In addition, the bit detector of Chapters 3 and 4 will be included in these simulations. 5.9 Conclusions The estimators of the channel vector H which have been derived and analyzed in this chapter fall into three basic categories: 1. The supervised decreasing step-size algorithms of Sections 5.1 through 5.3 are suitable for a fixed channel. Their primary value is in estimating the channel in a training period which is short enough that the channel will not vary appreciably. In this application they will converge much faster than the fixed step- size algorithms because their initial step- size is much larger. 2. The supervised fixed step-size algorithms of Sections 5.4 through 5.6 are suitable for a time-varying channel. Often during a long transmission of random data the channel can be expected to vary appreciably. The fixed step-size algorithms can track such a channel when used in a decision-directed mode. Since they are then susceptible

305 to decision errors, a reasonably accurate initial estimate of the channel should be provided. This estimate can be obtained during an initial training period. 3. The unsupervised algorithms of Section 5. 7 will track a time-varying channel and are not susceptible to decision errors. However, they are more complicated to implement and require more a priori information than the supervised estimators. Therefore, the unsupervised algorithms are of greatest value when the S/N ratio is so low that the supervised estimators would necessarily be severely affected by decision errors. Perhaps the most important characteristic of the estimation algorithms of this chapter is that their implementation requires very little knowledge about the channel. They are applicable to channels about which very little statistical information is available. When a channel is more completely characterized statistically, better performance can be obtained by incorporating that information into the estimator. An extreme example of such an estimator is the unsupervised a posteriori mean of Eq. 5.0.3, which requires a complete characterization of the channel and data statistics. However, estimators incorporating more statistical information are generally more complicated, and the performance improvement obtained may not justify the increased complexity.

CHAPTER 6 COMPUTER SIMULATION RESULTS The difficulty of evaluating the performance of the bit detector and block detector has been noted in previous chapters. The error probability of the bit and block detectors for L- 2 and a known channel were calculated analytically. However, in other cases, and particularly for the random channel, analytical determination of the error probability appears to be prohibitively difficult. In addition, the estimation techniques of Chapter 5 have been fairly thoroughly analyzed assuming they have available the error-free data digits, but the effect of decision errors in the decision-directed mode is almost impossible to determine analytically. In this chapter, an alternate approach to obtaining the performance of these receivers will be employed. They will be simulated on the computer using Monte Carlo methods. Specifically, simulation results will be obtained for the bit detector for a fixed known channel and fixed random channel, the block detector for a fixed known channel, and the block detector in conjunction with the estimation algorithms of Chapter 5 for the fixed random channel. These results are somewhat limited by the large expense of these simulations, particularly for the bit detector, and are therefore limited primarily to the case L =2. 306

307 6.1 Results for the Fixed Known Channel The bit and block detectors for a fixed known channel with independent data digits and M = 2 were programmed on an IBM 360/67 Computer in Fortran IV language. In order to give a better idea as to how the equations of these two receivers might be implemented, a brief description of how the simulation programs were organized will be given prior to an outline of the results. 6.1.1 Bit Detector. The bit detector updating relation is given by Eq. 3.3.5 and the decision equation is Eq. 3.3.3. In order to make the relations more directly useful for the computer, let {b. =2j- 3}, and let J j=1 Bk+ B + k+ B + 1 k + k+ Ik+D D m = + 2+...+ + 2 +1 2 2 2 (6.1.1) Then m is the integer corresponding to the binary number Bk+D + 1 Bk+ 1 < 2' 2 Define p = P(B B X ) (6.1.2) Pm''k k+D' k+ D as a vector of 2D+1 probabilities which is kept in storage, updated, and used to make decisions. Assume that the vector {pm is in m=1 storage and contains the values of Eq. 6.1.2. Then the decision and

308 updating procedure is as follows: Step 1. To make a decision on Bk, let P1 P(Bk =1, Xk+D) P0 P(Bk -1, Xk+D) (61.3) Then, according to Eqs. 3.3.3 and 6.1.2 2D P1 P2m m =1 Pg _ P2m-1 (6.1.4) and the decision is determined by 1, P1> P0 Bk - (6.1.5) -1, P < Po Step 2. The first step in the updating procedure of Eq. 3.3.5 is to sum over Bk. This is accomplished by forming P2m-1 + P2m' m 2D P, Dtl 2D+1 < m < 2D+ mD- _

309 The {pm} array now contains P(Bk+1 Bk+D' Xk+D). Step 3. The final step is to form P(Bk+1 Bk+D+, Xk+D+1) according to Eq. 3.3.5 after the receipt of Xk+D+1. If it is assumed that the two values of Bk+D+1 are equally likely, the P(Bk+D+i) term can be ignored, as can constant factors in P(Xk+D+1 Bk+D L+2 BkD ). Thus, the updating relation is Pm = Pm (Xk+D+ Bk+D-L+2 Bk+D+1) L-1 2 pm | exp k+D+l Bk+D+1-j h(j) 2 j=0 (6.1.7) When this is completed, control returns to Step 1 for the decision on Bk+1 L In the Fortran IV version of this receiver, Eq. 6.1.7 was accomplished by keeping two registers, one containing the integer m and the other the binary sequence (using logical variables). Bk+1 + 1 Bk+D + 1 2 2 The binary to decimal conversion of Eq. 6.1.1 was avoided by incrementing these two registers simultaneously through all 2D+1 values. In a more hardware oriented implementation, only the (D+1)-bit binary register would be required, and it could be incremented and

310 used to directly address the pm-vector in memory. In this respect, a hardware or machine language implementation would be somewhat more straightforward than in Fortran. 6.1.2 Block Detector. There are several equivalent methods of implementing the block detector. The method discussed here uses more temporary storage than is really necessary and is therefore inefficient in its use of memory. However, it is more easily understood than alternative implementations. 2L-1 Let the state vector be {p }l m =1 Pm gk+D- L+1 (B k+D- L+2 k+D) as defined in Eq. 3.4. 11, where Bk+D- L2 + 1 Bk+D + 1 L-2 m 2 +'+ 2 +1 (6.1.9) Similarly, maintain a temporary storage vector {qn}, where n=1 q - B h (j) +g (B B (xk+D+1 j- Bk+D+l-j h(D- + gk+D-L+ 2 k+D (6.1.10) and Bk+D-L+2 + 1 Bk+D+1 + 1 Ln _ - + ^.+ 2 1+ 1 ~2~~~~2 (6.1.11)

311 In addition, for each m 1 < m < 2, let there be in storage a sequence of D-L+2 bits (dm, 1' e..' dm, D-L+2) hich contain the bits which minimized the quantity on the right of the recursion relationship of Eq. 3.4.13 in the D-L+2 past iterations, dD- L+ 2 being the most recent. Then the steps to be performed for making a decision and iterating the state vector follow: Step 1. To make a decision on Bk according to Eq. 3.5.4, minimize the state vector {Pm} and choose the corresponding dm 1. That is, let the decision on Bk be dm 1' where m0 satisfies = min p (6.1.12) Pm0 <m<2L1 m (6.12) Step 2. As a first step in iterating the state vector, fill the temporary vector {qn} according to Eqs. 6.1.11 and 6.1.8 n L-1 2 |(k+D - ~L Bk+D+1j h(j)) + Pn 1 < n < 2L-1 k+D+l =0k+D+-j j=0 qn n L-1 L- 1 L (k+D+1 =O Bk+D+l_j h(j) Pn 2L-1 +1 < <2 (6.1.13) where n is determined according to Eq. 6.1.11. Step 3. First shift the bits in memory back,

312 d d 1 < j < D-L+i, 1 m < 2L -16 4) m,j m, j+1' determine dD-L+2 and {Pm} according to Eq. 3.4.14, q2m' q2m <q2m-1 D-L+2 -1 P = (6.1.15) m q2m-1' ~q 2m-1 dDL2 +1 and then return to Step 1. This completes the description of the bit and block detectors. Nothing has been said about the initialization of the vectors in memory, which is straightforward. 6.1.3 Performance of Three Receivers for an Example with L=2. The real-time bit detector (D = 3) and real-time block detector (D = 1) were simulated for a channel with L = 2, M = 2, and Kb =. The specific channel vector chosen was H (1.0, 0.5)T (6.1.16) for which = 0.4. (6.1.17) The resulting probabilities of error are plotted in Fig. 6.1 as a function of the S/N ratio in dB, which is given by Eq. 3.6.24. Also plotted are the error probabilities of a matched filter receiver in the absence of intersymbol interference and of the infinite transversal filter receiver of Section 2.4. The latter two probabilities, given by

313 Eqs. 2.4.20 and 2.4.33, represent lower and upper bounds respectively on the error probability of the non real-time bit detector. The results in Fig. 6.1 can be summarized as follows: The error probabilities of the real-time bit detector and real-time block detector are not significantly greater than that which can be achieved in the absence of intersymbol interference. The error probability of the transversal filter is considerably greater than that of the bit detector and block detector. More specifically, the effective decrease in S/N ratio due to intersymbol interference is 5 dB for the transversal filter receiver, while for the bit and block detectors it is at most 1 dB. Since the error probability of the block detector (D = 1) is only slightly greater than that of the bit detector (D = 3), it is apparent that the block detector error probability cannot get much smaller when larger values of D are used. The error probability of the block detector is plotted as a function of D in Fig. 6.2, which reveals the surprising fact that the error probability actually increases as D gets larger than D = 1 at all S/N ratios. The same phenomenon can be observed in Fig. 6.3 for an example with L - 6. In this case too, the minimum error probability occurs at D = L-1 = 5. The Xeal-time bit detector error probability cannot increase with increasing D, since the minimum probability of error can only

314 1.0 Transversal Filter (Upper Bound) Matched Filter -/ (Lower Bound) 0.1 0 B-~ - II x —-x Real-Time Bit Detector D = 3 o 4-4)1 Owm -30 Real-Time Block Detector D = 1 "4t L=2, 51 =0.4 0 0.001 S/N Ratio (dB) -15 -10 -5 0 5 10 0. 0001 Fig. 6.1. Error probability of three receivers

Z = I to: j'( SA 4l.q.eqoJd;o.Ta;oOa4aop 3oll''9' T:i I000 [0 9 cP~ Z Tg I 000'0 TP 0 x wP O - ~.M gmA ~ ~ ~ ~! —s. k —. 0 qp -9rrnm.. O O't = H/S

316 1.0 S/N = 0 dB 10 1 _ 10i S/N - B d / x H = [-0.077, -0.355, 0.59, 1.0, 0.059, -0.273] 10 -3 o-4...I I I I

317 decrease with the addition of relevant information. For the block detector, on the other hand, the error probability increases when samples are introduced into the decision on Bk which have no signal energy from Bk. The additional samples and data digits introduced into the decision by the addition of samples with no signal energy from Bk apparently confuses the decision sufficiently to raise the error probability. 6.1.4 Parameter Sensitivity. Since in practice the channel vector H and noise variance a2 would seldom be known with complete accuracy, the sensitivity of the receiver error probabilities to inaccurate knowledge of these parameters is an important practical question. It indicates just how inaccurate the knowledge of these parameters can be before the error probabilities are significantly affected. If these parameters cannot be obtained with the required accuracy, the uncertainty should be modeled statistically and incorporated into the receiver structure. Consider first the sensitivity to knowledge of the noise variance a2. It was shown in Section 2.1 that the block detector does not require knowledge of a2 when the signals are equally likely. The bit detector sensitivity can be qualitatively determined by examining the effect of a2 on the decision regions of Figs. 3.2 and 3. 5, which show the asymptotic decision regions for the bit detector at high and low S/N ratios respectively. As shown in Figs. 3.6 through 3.8,

318 high S/N ratios are those above about 10 dB, and low S/N ratios are those below about -10 dB. If the c2 given to the bit detector differs from the actual a2, then this should only make a significant difference if one of them corresponds to a high S/N ratio and the other to a low S/N ratio. For example, if the actual S/N ratio were -10 dB, but the bit detector was told that the S/N ratio is 10 dB, the error probability would be increased somewhat because the decision regions would not be correct. On the other hand, if the bit detector was told that the S/N ratio is 30 dB, but in fact it is 10 dB, the decision regions would be essentially unchanged and the error probability increase would be minimal. This qualitative statement takes the decision regions in the (Lkl, Xk+l) plane into account, but does not reflect the effect of inaccurate knowledge of a 2 on the updating of the decision axis Lk 1. The sensitivity of error probability to knowledge of a2 for the bit detector is shown in Fig. 6.4 for an actual S/N ratio of 5 dB and the H of Eq. 6.1.16. Since 5 dB is a relatively high S/N ratio, the error probability should be significantly degraded only when the assumed a2 is on the order of 10 dB larger than the actual. This is confirmed in Fig. 6.4, which also shows that the block detector actually has a lower error probability than the bit detector when the assumed aZ2 is more than 6 dB too large. Both the bit detector and block detector will be sensitive to an

319 1.0 A Posteriori Error 0 / Probability Calculated,X | /by Bit Detector ~o ~ ~ — > ~~ / Bit Detector D = 2:~ r I -- k_ Block Detector D=I /# Actual S/N Ratio = 5 dB h(0) =1, h(1) = 0.5, 1 =0.4 10-2 0 10 20 Noise Variance Receiver Assumes in dB Above Actual Noise Variance Fig. 6.4. Bit detector sensitivity to aa

320 inaccurate knowledge of H. Since the decision regions of the two receivers are so similar, one would expect that tne degradation in performance of the two receivers would be very similar. This is confirmed in Fig. 6.5, where error probabilities of the two receivers are plotted as a function of the ratio of assumed h(k) to actual h(k), k = 0, 1 The bit detector error probability is plotted only over a relatively narrow range in Fig. 6.5. This is because the bit detector was plagued by overflows and underflows in intermediate calculations at points outside this range. All the exponentiations in the bit detector make it particularly susceptible to this type of behavior. 6.2 Fixed Random Channel The real-time bit detector computer program was modified to handle a fixed random channel using the updating equations of Section 4.3. In addition, the known channel block detector program described in Section 6.1.2 was modified to accept estimates of the channel vector H as if they were exact, and the various estimation algorithms of Chapter 5 were programmed to provide these estimates. Before giving the results of the simulations, the structure of the modified bit detector will be discussed in Section 6.2.1 and the choice of stochastic approximation algorithm step sizes and thresholds will be justified in Section 6.2.2.

321 H = [1.0, 0.5] 0.5 S/N Ratio = 5 dB ~ \Block Detector 10-1 - Bit Detector Pc I- (D = 2) -2 -3 -2 -1 0 1 2 3 4 h(1) Receiver Assumes/Actual h(1) 0.5 0 ~ 10-1 O ~ \- Block Detector (D = 1) v _ 1-2 LI I I I I I -3 -2 -1 0 1 2 3 4 h(0) Receiver Assumes/Actual h(O) Fig. 6,5. Receiver sensitivity to knowledge of H

322 6.2.1 Bit Detector for a Fixed Random Channel. The updating equation for the real-time bit detector for a fLxed random channel and independent data digits is Eq. 4.3.11 with EK = 0. The equations of Section 6.1.1 can be modified easily to be consistent with Eq. 4.3.11. To do so, quantize each component of H to Q values, (n) QL so that there are a total of QL quantized values of H, _H(n=1 Then, define, = P(Bk B, H(n), X ) (6.2.1) m, n k k+D' k+D < m < 2D+ 1<n<QL where m is defined as in Eq. 6.1.1. The {Pm nI array is stored in memory, used to make a decision on Bk, and then updated. The decision and updating steps are listed below. They are very similar to the steps of Section 6.1.1, except for the addition of the variable n associated with the quantized values of H. Step 1. Define P1 and P0 as in Eq. 6.1.3, and let Pn= P(Bk - 1 H(n), ) 1,n' Xk+D Po = P(Bk=-, H(n), X ) (6.2.2) 0Then, n Eq. 6.k+4, Then, as in Eq. 6.1.4,

323 2D P1,n I m1 P2m, n 2D PO,n m=l P2m-l,n (6.2.3) and L n=l 1, n L P~ P On(6.2.4) The decision on Bk is determined by Eq. 6.1.5 as before. In addition to a decision on Bk, the a posteriori mean of H is required in order to compare the speed of convergence of this estimate of H with that of the stochastic approximation algorithms. It can be calculated from {P, n} as ~ H(n) p(H(n ) Xk+D) n=1 E(H]Xk+D) = n - k+D QL n=l n=1 1, n n- 0 PO 1 ~(6.2.5) 0 P1

324 from Eqs. 6.1.3 and 6.2.2. Step 2. The first step in the updating procedure is to sum P(Bk'" Bk+D, HI Xk+D) over Bk in the same way as in Eq. 6.1.6, +p, l<m<2D P2m-1, n 2m, n P m,, D DnI m, D 2D+ < m < 2D+m-2, n < n < QL (6.2.6) The {Pm n} array then contains P(Bk+1 Bk+D, H(n), Xk+D). Step 3. The final step is to update P(Bk+1 Bk+D, H(), X ) to P(Bk+ Bk+' H() Xk+D) using Xk~l The appropriate transformation of the {Pm n} array is Pm, n mn k~D exp -(x~n- B-lT H(n))2/2a2 PM = P M exp }- (Xk+D+l - Bk+D+1 ) /2 <m<D2, 1 < n <QL (6.2.7) When this is completed, control returns to Step 1 for the decision on Bk+1 ~ This algorithm is very similar to that of Section 6.1.1, the only difference being that each probability Pm in Eq. 6.1.2 is replaced by a vector of QL probabilities, one for each H(n) The procedure of Section 6.1.1 is then essentially repeated once for each

325 of the QL values of H(n) The amount of computation involved is increased by a factor of QL, which can be an enormous increase for typical values of Q and L. In the simulations to follow, the actual value of H used is given by Eq. 6.1.16, while each of the two components of H was quantized to Q = 10 values. The quantized values of the first component were (-0.2 + n ~ 0.4}n and those of the second component were (-0.1 + n ~ 0.2}n=1 The a priori density of H was chosen to be uniform, P(H(n)) 0.01, 1< n<100 (6.2.8) which gives an a priori mean of E(H) = (2.0, 1.0)T (6.2.9) All the other estimation algorithms were initialized by the same value of H given by Eq. 6.2.9. 6.2.2 Stochastic Approximation Algorithm Parameters. Also programmed for simulation were seven channel vector estimation algorithms for use with the known channel block detector. These

326 seven estimators, along with the a posteriori mean of Eq. 6.2. 5, are listed below: 1. ORECR: The a posteriori mean of Eq. 6.2.5. 2. BAYES: The supervised a posteriori mean of Eq. 5.1.26. 3. SAA: The fixed step-size stochastic approximation algorithm for known wide-sense stationary data statistics of Eq. 5.4.1. 4. SAAUD: SAA modified to handle unknown and/or time-varying data statistics, given by Eq. 5.5.1. 5. FISSA: The fixed increment sequential estimate algorithm of Eq. 5.6.20. 6. QFISSA: The quantized version of FISSA of Eq. 5.6.45. 7. USSA: The unsupervised stochastic approximation algorithm of Eqs. 5.7.13 and 5.7.6. The solution of Eq. 5.7.6 in the correct quadrant was always chosen. 8. DUSSA: The direct version of USSA of Eq. 5.7.22. The ORECR estimate was included not to provide an estimate of H for use by the block detector, but rather as a standard against which to compare the other estimates. Since it is the conditional mean, ORECR is the least mean-square nonlinear unsupervised

327 estimator of H [1]. Since it has available a complete statistical characterization of the channel and data statistics, it can be expected to converge faster than the various supervised stochastic approximation algorithms. The estimation algorithms SAA, SAAUD, FISSA, QFISSA, and USSA were fairly thoroughly analyzed in Chapter 5 by assuming that the data digits are available without error. The primary purpose of doing computer simulations of these algorithms is twofold. First, they are necessary in order to determine the effect of decision errors when the supervised algorithms are used in a decisiondirected mode. Second, they are necessary in order to compare the convergence rate of these algorithms with that of ORECR and BAYES. In order to accomplish the objective of observing the effect of decision errors, S/N ratios of 10, 0, and -10 dB were chosen for the simulations. These three S/N ratios represent radically different situations with respect to the number of decision errors to be anticipated. Since the error rate at a S/N ratio of 10 dB should be on the order of 10-3 (see Fig. 6.1), and with only one error in every 1000 bits on the average, the decision-directed algorithms should not be adversely affected by decision errors. On the other hand, at a S/N ratio of 0 dB, the best error rate that could be expected would be 0.2, and the decision-directed algorithms could be adversely affected by one error in every 5 bits. And finally, at a S/N ratio of

328 -10 dB, the best error rate is 0.4, and the decision-directed algorithms are certain to be adversely affected. The parameters of the various algorithms will now be determined at these three S/N ratios using the theory developed in Chapter 5. The objective will be to obtain at all three S/N ratios an asymptotic mean-square error bounded by lim i -Hk Hl12 < 0.025 (6.2.10) The resulting parameters of the algorithm are shown in Table 6.1. The procedure used to obtain these parameters will now be outlined. The step-size of SAA, co, can be determined from Eq. 5.8.16, since the simulations will be performed with independent data digits. Since the resulting step-sizes are very small, the convergence rate and asymptotic error of SAAUD should be essentially the same as those of SAA if the same step-size is used. The time constant 7 of SAA and SAAUD is defined by the equation k k -k/7 r = (1aoc) - e 1 1 or T n( 1- a) (62.11) for small ac. The time constant T is the number of samples required for IIE(H) - HII to decrease to e-1 times the initial error,

329 Table 6.1 Stochastic Approximation Algorithm Parameters S/N Ratio (dB) 10 0 -10 a2 0.125 1.25 12.5 IIH i-H112 lim - 0.1 0.01 9. 001 k-c x L2 II H I12 5. 0 0.5 0. 05 La2 SAA [ 0. 0333 0. 00663 0. 002 SAAUD V T = 1/0 31 151 500 6 0.224 0.224 0.224 FISSA la6/2 0.317 0.1 0.032 QFISSA a P (i) 0.1 0.1 e Rt 4.0 11.0 117.0 FISSA E(k*) 11 85 808 T = R t/6 18 49 522 P,,R 4 14 42 Q FISSA 13 135 1300

330 IIH - H11. It is, therefore, a measure of the convergence rate of the algorithm. The step-size 6 of the fixed increment algorithms was determined by setting lim IIHk- HW2 = 0.025 i-e x = L (2) (6.2.12) Then, in order to determine Rt for FISSA and QFISSA, the value of P (f) chosen for [ek] = 6/2 was 0.1. The threshold Rt and E(k*) were determined using Eqs. 5.6.26 and 5.6.27 for FISSA, and Eqs. 5.6.47 and 5.6.48 for QFISSA. The approximate time constant T of FISSA follows directly from Eq. 5. 6.39, T = C t (6.2.13) 0 The step-sizes of USSA and DUSSA were chosen to be the same as those of SAA and SAAUD. Since the theoretical convergence rate of USSA is rk, where r = 1- oa (see Section 5.7), USSA should then converge at the same rate as SAA and SAAUD. One surprising feature of Table 6.1 is that the time constant T of FISSA is smaller than that of SAA. That is, FISSA should converge more quickly than SAA for this choice of parameters. This is a consequence of the choice of P (&) for FISSA, because the time e

331 constant is strongly dependent on the probability of incorrect increment. The convergence rate of FISSA would be slowed if P e() were chosen to be smaller. 6.2.3 Simulation Results. The results of the computer simulations are displayed in Figs. 6. 6, 6.7 and 6.8. In each of these figures, the first component of each of the eight estimates is shown as a function of the number of samples. In Fig. 6. 6, which corresponds to S/N = 10 dB, the algorithms all behave in the manner predicted by Table 6.1. SAA and SAAUD, which follow each other closely, converge with a time constant of approximately T = 40. FISSA and QFISSA converge almost twice as fast as SAA and SAAUD. USSA converges at almost exactly the same rate as SAA, while DUSSA converges much faster than USSA. However, the asymptotic variation of DUSSA is somewhat greater than that of USSA. Both of the estimators suitable only for a fixed channel vector H, ORECR and BAYES, converge very quickly to near the actual value of h(0). There is more asymptotic variation in the BAYES estimate, however, Presumably this is because BAYES is adversely affected by decision errors, whereas ORECR is not. In Fig. 6. 7, where the S/N ratio is 0 dB, the effects of decision errors become very evident. SAA and SAAUD have not converged completely after 500 samples, even though the time constant predicted

332 2.0 ~. 1.5 /-AA oa /- ~~~SAAUD cd 1. 0 ~'-''-BAYES ORECR 0.5 0 0 50 100 150 200 250 300 Number of Samples 2.0'\ J- USSA \ \UU S QFIBSA o A\ \ ~\ & FII*IoA cdigs 6*6. i f0 DUSSA FISSA 0.5 0 50 100 150 200 250 300 Fig. 6.6. Estimates of,,(0) at S/N _ 10 dB

333 2.0 2.0 SAAH 1.5- SAAUD 1.5 E 1 ~ I ~ 0.,. Xj I "ORECR 0.5-BAYES 0 I 1I. A mom 0 100 200 300 400 500 Number of Samples 2.0 FISSA 1.5 \ \ o X QFISSA 0.5. I 1... I,.. 0 100 200 300 400 500 Number of Samples Fig. 6.7. Estimatesof h(O) at S/N = O dB

1.01 0.5 1.0 *0 ~~~sarrrssascssr rb~~~~~~~~~~~~~~~re 9@SAA, SAAUD It *. *00 0~~~~~~~~~~~~e I rs I Fr~~~~~~iISS, QFFISSA Cd 0 200 400 600 800 100000 Number of SamplsS --— g 68 E ~~bd I r o oNube ofSamle ~~~a ~~~~~i. 6..Etmae fh()a IN -0d

335 by Table 6.1 is 7 = 150. In fact, the convergence of SAA and SAAUD does not even appear to be exponential, but rather is closer to being linear. FISSA and QFISSA display very similar behavior. Both the unsupervised algorithms, USSA and DUSSA, have essentially converged after 500 samples. As in Fig. 6. 6, DUSSA converges somewhat faster than USSA. In Fig. 6.7 ORECR takes about 400 samples to completely converge. However, the ORECR estimate is close to h(0) after only about 20 samples. The most surprising feature of Fig. 6.7 is that BAYES is severely affected by decision errors. It appears to be converging to an h(0) of about 0.3, rather than 1.0. Apparently BAYES is more sensitive to decision errors than are the fixed stepsize and fixed increment algorithms. Finally, as expected all the supervised estimators are severely affected by decision errors in Fig. 6.8, where the S/N ratio is -10 dB. They all exhibit approximately the same behavior, converging to h(0) - 3.0. The three unsupervised estimators, USSA, DUSSA, and ORECR converge very slowly. The ORECR estimate is fairly accurate after about 200 samples, but has not converged completely even after 1000 samples. The error probability of the bit detector and the block detector with each of the seven estimators is given in Table 6.2 for S/N ratios of 0 and -10 dB. The error probability at S/N = 10 dB is not

336 shown because the number of errors observed in the run of 1000 samples was not large enough to be statistically significant. Table 6.2 Error Probabilities for the Fixed Random Channel S/N Ratio (dB) 0 -10 Bit Detector 0.17 0.35 Block Detector with SAA 0.17 0.37 Block Detector with SAAUD 0.18 0.37 Block Detector with BAYES 0.35 0.37 Block Detector with FISSA 0.17 0.37 Block Detector with QFISSA 0.19 0.37 Block Detector with USSA 0.17 0.36 Block Detector with DUSSA 0.17 0.36 At a S/N ratio of 0 dB, the block detector with SAA, FISSA, and USSA had as low an error probability as the bit detector. The error probability of the block detector with BAYES is poor because that estimator did not converge properly, as exhibited in Fig. 6.7. At a S/N ratio of -10 dB the bit detector had a lower error probability than the block detector with any of the estimators. The

337 block detector with USSA and DUSSA had a lower error probability than the block detector with any of the supervised estimators because those estimators did not converge. 6.3 Conclusions In this chapter the performances of several receivers were studied on the fixed known channel and fixed random channel. The approach used was to simulate the channel and receivers using Monte Carlo methods. On the fixed known channel, the real-time bit detector and real-time block detector were simulated. For the example considered, the error probabilities of the two receivers did not differ appreciably at any S/N ratio. The effective decrease in S/N ratio due to intersymbol interference did not exceed 1 dB for either receiver. The error probability of the real-time block detector was determined as a function of D for an example with L = 2 and another example with L = 6. In each case the minimum probability of error occurred at D = L-1. On the fixed random channel eight receivers were simulated. One was the real-time bit detector, and the remaining were all estimation algorithms used in conjunction with the real-time block detector. Five of the seven estimation algorithms (SAA, SAAUD, FISSA, QFISSA, BAYES) were supervised and were used in a decision-directed mode. The bit detector and BAYES estimation algorithm are suitable for the

338 fixed random channel alone, while the remaining estimation algorithms are suitable for a time-varying channel. In all cases the bit detector estimate ORECR converged much faster than the others, with the exception of BAYES at a S/N ratio of 10 dB. Because they are unsupervised, the ORECR, USSA and DUSSA algorithms converged at all three S/N ratios. The error probability of the block detector with USSA and DUSSA was not appreciably greater than that of the bit detector at any S/N ratio. All the decision-directed receivers failed to converge at all at -10 dB S/N ratio. At a S/N ratio of 0 dB, all except BAYES converged. However, they were adversely affected by decision errors, in that the convergence rate was somewhat slower than that predicted theoretically (under the assumption of no decision errors). In summary, the block detector used on a fixed random channel in conjunction with an estimation algorithm can have an error probability essentially the same as that of the bit detector as long as some care is exercised in choosing the algorithm and its parameters. At very low S/N ratios (below about 0 dB) the estimation algorithm must be unsupervised. At relatively high S/N ratios (above about 0 dB) a supervised estimation algorithm used in a decision-directed mode works quite well.

CHAPTER 7 CONCLUSIONS AND SUGGESTIONS FOR FUTURE STUDY In preceding chapters several techniques of combatting intersymbol interference in PAM systems have been considered separately. In Section 7.1 of this chapter the comparative advantages and disadvantages of the algorithms will be discussed in a general way. The primary purpose will be to outline the type of situation where each technique might be applicable. In Section 7.2 suggestions for the extension of this study will be given. 7.1 Conclusion - Comparison of the Techniques In Chapters 1 through 5 receivers for the PAM system operating over an additive white Gaussian noise channel with intersymbol interference were designed. Three different states of knowledge of the channel statistics were considered: 1. In Chapter 2 the elementary pulse waveform h(t) was assumed to be completely known to the receiver. In Chapter 3 this known waveform was assumed to be bandlimited. 2. In Chapter 4 and Section 5.1 the bandlimited random channel was considered. By a random channel, it is meant that the channel has been statistically modeled to the extent that an a priori density can be assumed. A random channel can be fixed or can vary in time. 339

340 3. In most of Chapter 5, the bandlimited channel was assumed to be non-random. By non-random it is meant that no probability structure has been placed on the channel vector H in the form of an a priori density. The model which is appropriate for a particular problem is dependent on several factors: 1. The actual state of knowledge of the channel based on physical considerations or extensive observation of its operational characteristics will influence the model. For instance, it may be known from previous experience with the channel that it can be expected to remain fixed over the period of a single communication (NT seconds). 2. The type and accuracy of the information on the channel which can be gained by observation just prior to the commencement of communication or during the communication itself will influence the model to be used. This factor will be strongly influenced by the S/N ratio and the rate of channel variation. At high S/N ratios it is almost always possible to make an accurate estimate of a fixed channel during a training period, and in this case the fixed known channel model will be appropriate. At a low S/N ratio or if the channel is varying rapidly, an accurate channel estimation is difficult to make and a random channel model will be more appropriate. 3. The amount of receiver complexity which can be tolerated

341 will influence the choice of model. In general, the receivers for a random channel are more complex than the receivers for a nonrandom channel. Therefore, allowable complexity of implementation might dictate that the non-random channel model be used when in fact there is enough statistical information on the channel available to warrant the random channel model. However, as a general rule the best performance can be obtained by incorporating all the statistical information available into the receiver. Three different PAM receivers have been considered. Each of these three receivers will be discussed separately below: 7.1.1 Transversal Filter Receiver. The established receiver for countering intersymbol interference is the transversal filter receiver. Essentially, it is the extension to a channel with intersymbol interference of the matched filter receiver, which is the minimum probability of error receiver for a fixed known white Gaussian noise channel without intersymbol interference. Like the matched filter receiver, it consists of a single linear filter which is sampled and applied to a threshold to make the decision on a single data digit. When the transversal filter has an infinite number of taps, the tapgains can (under certain conditions on h(t)) be chosen so that the sampled output of the filter is dependent on only the single data digit on which a decision is desired. The filter cannot have this property for arbitratre lengths of data sequences when there are a finite number

342 of taps. The major shortcoming of the transversal filter receiver is that its probability of error approaches 1/2 as the equivalent power spectrum P(w) of the channel approaches zero for some w0. The equivalent power spectrum is dependent on the channel frequency response H(w) and the data rate T. The applicability of the transversal filter receiver is limited to channels and data rates for which P(w) does not vanish or nearly vanish at some point. For actual implementation, a transversal filter with a finite number of taps must be used. In this case the sampled output must contain a contribution from more than one data digit. The tap-gains must then be chosen according to some criterion, such as to minimize the mean-square error between the sampled filter output and the data digit on which the decision is to be made. Iterative automatic equalization algorithms have been developed by many authors to adjust the tap-gains during a training period. When the channel is time-varying, adaptive equalization algorithms have been developed to continually update the tap-gains during the transmission of actual random data. An example of an algorithm suitable for either automatic or adaptive equalization is given in Section 1. 6. Most of the existing adaptive equalization algorithms are supervised and decision directed. Therefore, they are applicable only at relatively high S/N ratios. These algorithms are applicable only

343 to the non-random channel, as they contain no provision for incorporating statistical information on the channel into the tap- gain updating algorithm. 7.1.2 Bit Detector. The second receiver which has been considered is the bit detector. The bit detector is the minimum average probability of error receiver for detecting (B1. BN). The bit detector is applicable to the known or random channel, but not to the non-random channel. It requires a complete statistical characterization of the channel and data statistics. The main disadvantages of the bit detector are twofold. First, in many practical situations the channel may not be sufficiently well characterized to provide the bit detector with accurate channel statistics. Second, the computation and memory requirements of the bit detector are excessive, particularly in the random channel case. This is due to the fact that the random channel vector must be quantized and the updating probabilities calculated for every quantized value. Perhaps the greatest value of the bit detector is that its error probability is a standard with which to compare other receivers. Where there is a considerable disparity between the error probabilities of the bit detector and another receiver, it is indicative that with some effort the oerformance of that other receiver could be substantially improve upon. Such is the case with the transversal filter,

344 the error probability of which is much larger than that of the bit detector for severe intersymbol interference. On the other hand, the calculations and simulations of this study have revealed no substantial difference between the error probabilities of the bit detector and block detector on the fixed known channels and fixed random channels. On these channels the implementation of the bit detector would seldom be justified, because of the adequate performance of the block detector. The one instance where actual implementation of the bit detector might be justified is on the random timevarying channel, because here the bit detector can make the maximum utilization of the available statistical information on the channel. However, no simulations or analytical calculations have been made on the random time-varying channel in the present study to confirm this. 7.1.2 Block Detector. A receiver which eliminates most of the shortcomings of the transversal filter receiver at the expense of a considerably greater complexity of implementation is the block detector. The real-time version of the block detector, in the case k+D of equally likely data digits, chooses from among the M possible signals the one which is closest to the reception Xk+D in signal space. Whereas the transversal filter receiver applies a threshold to the sampled output of a single linear filter, the block detector chooses the maximum of the sampled outputs of M linear filters

345 to determine Bk. Unlike the transversal filter receiver, the block detector requires and takes advantage of the data statistics. It is most practical for the Kb-Markov data statistics case, where a fixed processor structure and memory is obtained in the real-time version. In this case the computation and memory are exponentially dependent on J = max {L-1, Kb}, placing a practical limitation on the L and Kb for which the block detector is applicable. Where the data statistics are unknown or Kb > L-1 and it is desired to reduce the computation, the block detector for independent equally likely data digits can be used. The block detector then chooses the signal nearest the reception in signal space, and can be expected to perform well. There is further impetus for this approach in the fact that the block detector then does not require knowledge of the noise variance a2 The block detector dynamic programming algorithm does not generalize directly to the random channel case. However, the known channel block detector is still applicable when used with one of the estimation algorithms developed in Chapter 5. Estimators can be developed which incorporate a great deal of statistical information about the random channel (such as the a posteriori mean of Section 5.1), or which require virtually no statistical information on the channel (such as the algorithm of Section 5.5). The estimates can be

346 supervised, like the transversal filter equalizao,n algorithms, or unsupervised. The simulations of Chapter 6 indicate that the known channel block detector in conjunction with an estimator can have essentially the same error probability as the bit detector on a fixed random channel when the estimator is properly chosen. At high S/N ratios a supervised estimator used in a decision directed mode works well, while at low S/N ratios (below about 0 dB) it is necessary to use an unsupervised estimator. 7.2 Suggestions for Future Study There are numerous directions in which the present study could be extended. A few of these are listed below: 1. There is a need to perform additional computer simulations of these receivers. In particular, no analytical performance evaluations or computer simulations for any time-varying channel have been performed. 2. The implications of hardware and software implementation of the block detector and bit detector should be considered. 3. Some specific communication channels should be considered. An attempt at modeling these channels statistically could lead to a better understanding of the relative merit of the several receivers for these particular channels. 4. After Steps 1 through 3 have been performed, promising

347 approaches can be tried out on actual channels to determine any shortcomings of the theory which should be corrected. 5. The extension of these receivers to channels employing diversity reception should be carried out. 6. More complex noise statistics should be incorporated into the bit and block detectors. The extension to colored Gaussian noise would be a natural first step. 7. There is a need for analysis of the runaway situation in the decision directed mode. Although such an analysis would be difficult, it might be possible to carry out in some simple cases. 8. Additional channel estimators, particularly of the unsupervised variety, are needed. 9. The block detector has a computational load exponentially dependent on L. A linear dependence on L could be obtained by using the Fano algorithm for the sequential decoding of convolutional codes [1]. This algorithm would provide an alternative to the transversal filter receiver for large L.

APPENDIX A OPTIMUM RECEIVER RELATIONSHIPS In this appendix various relationships relating to the optimum receivers of Chapters 3 and 4 will be derived. We organize the appendix according to the section in which the relationship is used. Section 3.2 We wish to establish the identity P(Xk Bk-J+1 Bk' Xk) = P(XkIBk-J+'' Bk) (A.1) To accomplish this, rewrite this probability as P(XkBkJ+' Bk' Xk) (A.2) X ~'Z P(Xk Bk J+ 1 BNXk) P(Bk+1 BN Bk-J+1 Bkk) kl N Bk+1 BN Treating the first term in the summand of (A. 2), (XkBkJ+l.. BN,X) (A. 3) Z.. Z P(kl B1 BBN, Xk) P(B1 Bk-JIBk-J+1.. BN Xk) B B J but 348

349 P(B1''BN' XN+L-1) P(XkjB1 - BN Xk) = P(B1X. BN' Xk) (N+L- I 1.. BN)/ P(Xk 1. BN) N+L- I II P(xj Bj L+ B.) = P(XkBk-L+2 " BN) (A.4) j=k+l N where (3.2.3) and the fact that J > L-1 have been used. Returning to (A. 3), it follows from (A. 4) that P(Xk Bk-J+1 BN Xk) = P(XkBkL+ 2 BN) (A.5) Next, manipulation of the second term of (A. 2) yields P(B k+' Bk-J+''Bk Xk) = (A.6) ZJk. P(Bk+.. BN B 1..Bk Xk)P(B1..BkJBk_1 J+1 BkXk) B Bk- J and by a similar development the first term of (A. 6) can be simplified, P(B1. BN, Xk) P(Bk+l BNBl'' Bk, Xk) = P(B1. Bk, Xk) P(Xk B1' BN) P(B1 BN) P(Xk B1, Bk) P(B1 Bk) =P(Bk+l' BNIB1'' Bk) (A.7)

350 Utilizing (3.2.1) N P(Bk+l*' BN B1 Bk) = P(B jB1 Bj_) j=k+ 1 N II P(BjB B ) (A. 8) j=k+ -K - P(Bk+I'BN Bk-K+1* Bk) where the fact that J > K has been used. Combining (A. 6 - A. 8), it follows that (A. 6) becomes P(Bk+I''BNBk-J+ *' Bk' X) P(Bk+ BN B K+1 - Bk) (A.9) Finally, substituting (A. 5) and (A. 9) into (A.2), P(Xk BkJ+1 Bk, Xk) Z P(Xk Bk-L+2 N) k+l N P(Bk+1'' BNBk_K+1'' Bk) = P(Xk BkRJ+1 * Bk) (A.10) which is the desired result. Now, we must calculate an updating equation for P(Bk J Bk Xk). Rewriting it as P(BkJ+1" Bk, Xk) B P(xk(Bk_J Bk, Xk 1) kJBJ P(BktBk-J'' Bkl' Xk-l)' P(Bk-J'' Bk-l' Xk-l) (A.11)

351 the first term on the right side of (A. 11) is P(xk Bk J Bk' Xk-1) P(xkBkL+l Bk) (A.12) Next, the second term in (A. 11) is P(BkBskJ" Bk-l, Xk-l) ". Z. P(BkIBl Bk_l, Xkl) B1 Bk-J-I P(B1' Bk J_1|Bk-J'' Bk-l Xk-1) (A.13) and since P(B1 Bk, Xk-l) P(B kB1. Bk-'1 Xk-) P(B Bkl, Xk) P(Xk-_lB1 Bka) P(B1. Bk) P(Xk_l1 B Bkl) P(Bl" Bkl) = P(BklBk-K'' Bk) (A.14) (A. 13) becomes P(Bk BkK' Bk-l). (A.15) Finally, combining (A.15), (A.12) and (A.11), the updating equation becomes

352 P(Bk-J+1' Bk' Xk) = B P(X k Bk-L+ Bk) BkJ P(Bk Bk_K' Bk_1) P(Bk_J'' Bk_, XI-_1). (A.16) Proceeding with the derivation of a backdating equation for Bk+l P(Bk+ Bk-K+1 Bk) (A. 17) it is necessary to simplify the first term of (A. 17). It is P(XiJBkJ+1. Bk+1) = P(Xk+l BkJ+1 Bk+l, Xk+1l) P(Xk+ ll BkJ+1' Bk+l) (A. 18) in which the first term is P(xk+1!Bk-L+2'' Bk+1) and the second term can be written as Z.. B P( Bk+l k-J+1. BN) P(Bk+2 BN k-J+1 k+1) B B k+2 BN B PB k+lBk-L+3 BN) P(Bk+2 BN1BkK+2" Bk+l) k+2 N Pi(Xik+lBkJ+2. * Bksl) (A.19)

353 Thus, the backdating equation becomes P(Xl B kJ+1 Bk) = P(Xk+l (Bk-lL+2 Bk+l) Bk+l P(Xk+ll Bk-J+2.. Bk+l) P(Bk+l Bk-K+1 Bk) (A.20) Section 4.2 It is desired to establish the identity P(Xk, _Hk+l - HN+L- 1Bk-J+l Bk' -1 — k' Xk) = P(X' -k+1l HN+L-1 Bk-J+ Bk' -H1 Hk) (A.21) The left-hand term of (A.21) can be rewritten as B B? P(Xk, Hk+l HN+L- lBk-J+1 * BN' H1 Hk' Xk) Bk+1 BN P(Bk+' BN BkJ+' Bk'- H1'' Hk' Xk). (A.22) The first term in the summand of (A.22) is rewritten as B'' C P(Xk' Hk+l1' HN+L-1 B1' BN' — H1' Hk' Xk) I1 k-J P(B' BkJB k +B'' BN, H1 Hk, Xk). (A.23) The first term of (A.23) is, in turn,

354 P(N+L-N+L-11 HN+L-1' B1'' BN) P(Xk, -H1.. — Hk, B1.. BN) P(XN+L-B1'' BN' H1 HN+L-1) P(H1 -N+L-1) P(Xk/B1 * B*BN, H1** Hk)P(Hl.. Hk) N+L-1 II P(xj BjL+l'' Bj, Hj) jik........... Pk — N l-P(Hk+l HN+L-ll - Hk) j=1 -L+ j' j P(Xk Bk-L+2 BN' Hkl HN+L- 1) P-k+l -N+L-1 I-1 -k) = P(XkjBk-L+2 BN, H1 _HN+L-1) (Hk+ 1 — N+L- 1 "-1 Hk' Bk-L+2 BN) = P(', - Hk+l " -HN+L-1Bk-L+2" BN' H " —Hk) (A.24) Putting (A.24) in (A.23) and observing that J > L-2, (A.23) becomes P(Xk' Hk+l H N+L- lBk-L+2' BN H1 Hk) (A.25) As in (A.9), it is easy to establish that the second term in (A.22) is

355 P(Bk+l BN Bk- J+l'' Bk' -H1 H1k' Xk) = P(Bk+' BNBkK+'' Bk) (A. 2 6) Now, since (A.25) is the first term of (A.22), and using (A.26), (A.22) becomes ZBk Z' ~P(Xk,' Hk+l'HN+L-1 Bk-L+2 BN' Hi Hk) B B k+1 N P(Bk+l'' BN Bk-K+1 Bk) - P(Xk, H k+1 HN+L-1Bk-J+* Bk H1 Hk) (A.27) and (A.21) is established. Now, we need an updating relation for P(Bk-J+1 - Bk,. H1 Hk, Xk). Noting that P(B k-J+ Bk, H Hk, Xk) = P(BkJ * Bk, H_1 Hk, Xk) = P(xktBkJ BkHl IkXk-l) k- J P(Bk Bk-J Bk-l' H Hk' Xk1 ) P (Hk | Bk-J Bk-_ 1' HI Hk_- 1 Xk- I) P(Bk_ J " Bk- 1' H1 — Hk-_ 1 Xk_ 1)' (A.28)

356 each term in (A. 28) can be handled separately. Obviously, P(xk Bk_- JBkH'' -HkXk-l ) = P(Xk Bk- L+l BkHk) (A.29) and the second term of (A. 28) can be written P(BklBk J Bkl, H1.. k Xkl) = B" BZ P( klB1 Bkl' —H1, HkXk-1) 1 Bk-J- 1 P(B1'' Bk_ J_ IBk-J Bkl,_ H1' ~ Hk, Xk- l) (A. 30) The first term of (A. 30) becomes P(B1' Bk, H1 Hk' Xk-1 ) P(B1'' Bk-l' Hi *HkXk1 ) P(H1'' Hk'Xk- 1 B1 Bk) P(B1' Bk) P(-1 HkXkl Bk-1 Bk) P(B1 Bk- 1 = P(BkIBk-K'Bk_1) (A.31) Combining (A. 30) with (A. 31) it follows that P(Bk Bk- J Bk-l' — H1 "-Hk' Xk-l) = P(Bk Bk- K Bk-1) (A.32) Now, the third term of (A.28) becomes

357 P(H1 Hk, BkJ Bkl, Xk-l) P(H1 _Hk_k, BkJ Bkl Xk-l) P(Bk-J " Bk-l' Xk- 1 H) P(H1 Hk) P(Bkj Bkl, X kI-H1..* Ik-l) P(H1.. Hk 1) P= p(HkVH1.. Hkl) (A.33) Therefore, (A. 28), (A.29), (A. 32) and (A. 33) become P(BkJ+1" Bk, H1 Hk,Xk) =; P(xk BkL+1 Bk, Hk) Bk-J P(Bkl Bk-K Bk-l) P(HkHl "Ik- 1) P(Bk- J'' Bk 1' H1 k-l' - -k-l (A.34) Proceeding with the derivation of a backdating equation for P(XkHk+l HN+L-1 Bk-J+l.. BkH1 * Hk), it can be written as B P(k'Hk+l HN+L-1IBk-J+1 Bk+l' -1Hk) k+l P(Bk+lBkJ+l Bk, H1 H (A. 35) The first term in the summand of (A. 35) can be manipulated to P(Xk+l Bk-J+1 Bk+l' — H1 N+L-1' Xk+l) P(Xki, -Hk+l -HN+L- IBk-J+.. Bk+l,'1 * Hk) (A.36)

358 Each term in (A. 35) and (A. 36) is handled separately. First, P(xk+l BkJ+l Bk+l' H1 HN+L-1,' k+l) P(xk+l Bk-L+2. Bk+ -k+1) (A.37) and, by (A. 32), P(Bk+1 Bk-J+1'' Bk, H1 Hk) = P(Bk+l1Bk-K+'' Bk) (A.38) Finally, P(X,!k+ Hk+1 HN+L- IBk-J+1' Bk+1' 1 - Hk) = P(Xk+1'-!k+2 — HN+L-1 BkJ+1 " Bk+l, H1 — k+1) P(H k+1 Bk-J+1.. Bk+ 1 H-1. Hk) (A.39) and by (A. 33) P(Hk+ lBkJ+1. Bk+l, H1 Hk) = P(Hk+IHl "Hk) (A.40) The first term of (A. 39) is rewritten as P (Xk, H k+I'' 2s P(Xk+I, Hk+2 HN+L-1 Bk-J+l BN' Hi * Hk+l)' Bk+2 BN P(Bk+2'' BNIBk J+1 * Bk+l' H1 H 1) (A.41) in which the first term is

359 P(Xk+I Bk-J+1" BN'-H1 HN+L-) P(Hk+2 -HN+L-1 Bk-J+1 BN, H"Hk+1) = P(Xk+1 Bk-L+3'' BN' H N+L-1) P(Hk+2 HN+L-1 H1''Hk+1) P (Xk+ 1 B H1 H~k+' (A.42) =P( k+l'-k+2 - Hn+L-1 Bk-L+3 BN'H1 k+) (A.42) Similarly, the second term of (A.41) is P(Bk+2 BN Bk-J+1'' Bk+l' H'' Hk+) = P(Bk+2'' BNBk-K+2'' Bk+l) (A.43) and (A. 41) becomes P(Xk+1- Hk+2. HN+L-_1 Bk-J+2 Bk+l, H1 -k+) (A.44) and (A. 39) becomes, when (A.40) and (A.44) are substituted in, P(Xk+lHk+2'sHN+L-1 Bk- J+2 Bk+l' H1 Hk+l) P(Hk+l [IHi Hk) (A.45) Finally, the backdating relation is at hand,

360 P(Xk', k+1.. HN+L- {BkJ+ Bk, H. H) = B P(xk+ll k- L+2 Bk+l' — k+) P(Bk+ lBk-Kb+l Bk) Bk+I P(k+l 1 - "-k) P(Xk+l' -Hk+2 HN+L-1 ** B,H H ) (A.46) Bk-J+2 Bk+' -H1 Hk+l) (A.46) When the channel is Kh-Markov, we can establish the identity P(k Bk- J+1 Bk,' -k Kh+l " —k' Xk) P (XkBk- J+j 1 Bk' fkKh+l ik (A.- 47) where J = max {L-1, Kb}. (A.48) To do so, start with (A.21) and integrate both sides over (Hk+l HN L- ) to yield P(Xklk-J+1 Bk' — H1" — Xk) = P(XkBk _j+. Bk, + H1 Hk) (A.49) We will manipulate the right side of (A. 49), which can be written as

361 k Bk- J+l Bk, -1 -k) P(BXB1 "Bk, -1H — * Hk) P(B1 Bk-JIBk-J+l " Bk' B1 Bk- J -1 k) (A.50) In turn, P(Xk B1.. Bk, H1 "_Hk) Z f r P(XkIB1 BN, BNH1 "HN+L-l) Bkl N P(Bk+l BN Bk-KKb+l" Bk)' P(Hk+H'' Hk) d-Hk+l'' dH _k+1 N -k -kk- I k -N+L-1 (A.51) where P(XkBl'' BN H1'' HN+L-1) N+L-1 II P(x. Bj-L+ Bj, Hj) j=k+1=j+ = PXkBk-L+2'' BN' Hk+l HN+L-t) (A.52) Combining (A. 52) with (A. 51), that equation becomes

362 P(XkJB1 Bk' — H1 H) P(XkBkJ+l Bk, _-_kKh+1Hl (A. 53) and substituting (A.53) in (A.50), P(Xk Bk-J+1 k' — H — k) P(XkBk-J+1 Bk, -— Kh+1i' k) (A.54) Since the right side of (A.49) is not a function of (H - H ), the left side can't be either, and so (A.47) is established. A backdating algorithm can be derived simply from (A. 46), which we integrate over (_Hkl HN+L- 1) P(XBk+'' Bk' — H1' Hk) B P(Bk+l Bk- Kb+1 Bk) f P(xk+llBk-L+2 Bk+l -k+1 B P(B Bkl+I Bk) f P —xkH BkL+ H B 1dkk+t P(Hk+ tHk-K+ -k) P(k+1 Bk-J+ k+ k+2 Bk+1 H Hk+1) dk+1 (A.55) and substituting (A.54) in (A.55), it becomes P(Xk Bk- J+1I Bk' -k- Kh+l Hk) (A.56) Z P(BOk+R Bk_- +1 Bk) S P(xk+~t~_L+" B+,l' Hk+l) t~k+1 -Hk-Kh+l k-) P(Xk+~! Bk-J+2 B.. ~+l, _~_K Hk+I) d~k+1.

363 Now, we treat the simplest case, that of a constant channel. The standard identity to establish first is P(Xk Bk_ J+ * Bk,_H,Xk) = P(XkjBkjJ+1 Bk,'H) (A.57) The left side of (A.57) is P(XklBk-J+1 BN H,Xk) P(Bk+l BNtBkJ+I BkB Xk) Bk+I BN (A. 58) and it follows immediately that the first term in the summand of (A.58) is P(Xk iBk-L+2< BN, H) (A.59) while the second term is a P(B'' BN B1 ** Bk, H, Xk) P(B1 Bk-JI Bk-J+l Bk'HXk) BI Bk B1 Bk-J (A. 60) The first term of (A. 60) becomes P(B1'' BN H, Xk) P(B1 * Bk'HXk) P(X k B1 BN, H) P(B1' BN) P(H) P(XktBl. Bk,_ H) P(B1. )7 P(_ )'k+l' BNIB1 Bk) P(Bk+l BN|Bk'K1+* Bk) (A. 61)

364 from which (A. 60) becomes (A. 61), and (A. 58) becomes 3 *'' 3 P(XkBk-L+2* BN, H) P(Bk+ I BN Kb+l. Bk) Bk+ BN ( XBk -J+1 Bk, H) (A. 62) which is (A. 57). An updating relation can be derived from P(Bk-J+1 Bk'H'Xk) = 3 P(Bk-J'' Bk', H,Xk) k- J 3 PxkBk-J kH, Xk_ ) P(BkIBkJ Bk.' H,X ) k-J P(Bk_J'' Bk-_l H, Xk-l) (A.63) the first term of which is P(Xk Bk-L+1' Bk, H) (A. 64) and the second term is P(Bk!B1 Bk_,H, Xk_l) P(B1 Bk-J-1 k-J Bkl,-,Xk_B1 Bk-J- 1 (A. 65) The first term in (A. 65) becomes

365 P(B1'' Bk'_H, HXk_l) P(B1* Bk-l' H, Xk-l) P(Xkl1 B B kH) P(H) P(B1' Bk) P(Xk_lB1' Bk 1,H) P(H) P(B1 Bk _1) = P(B kB1.. Bk1) =P(BKb.. Bkl) (A.66) Thus, (A. 65) becomes the same as (A. 66), and the updating relation (A. 63) becomes P(BkJ+ 1 Bk, HXk) = P(xkjBkL+1 Bk-H) Bk- J P(Bk Bk-Kb Bk-1) P(Bk- J Bk- 1H,Xk1) (A.67) Proceeding with the derivation of a backdating equation, the right side of (A.57) can be written as P(X kBk-J+1l' Bk, H) B P(Xk+1 Bk-J+1' Bk+l H, Xk+1) Bk+1 P(Xk+ I Bk-J+l Bk+l, H) P(Bk+1 BkJ+l' Bk, H) (A.68) the first term of which immediately becomes P(xk+, BkL+_ T, Bk+1,H) (A.69)

366 The second term of (A. 68) is B PB k+ I Bk-J+1''BN'H ) P(Bk+2 BN Bk-J+1 Bk+IH) k+2 N (A.70) but the second term of (A. 70) is P(Bk+2' BN Bk Kb+2 Bk+l) (A.71) and the first term is P(Xk+l Bk-L+3 * BN, H) (A.72) so that, from (A.71) and (A.72), (A.70) becomes P(Xk+1 Bk-J+2 Bk+lH) (A.73) Finally, the third term of (A. 68) is obviously P(Bk+lIBk Kb+l Bk) (A.74) and from (A. 69), (A.73), and (A.74), (A. 68) becomes P(Xk BkJ+1 Bk,H) = P(Xk+llBk-L+2 Bk+lH) Bk+l P(Bk+ I Bk-K+ Bk) P(Xk+BkJ+2 k+'-H) (A.75) ~C_ k k+1 1k-J+2 k~1_9 _4_)) (

APPENDIX B AN ARGUMENT FOR THE CONVERGENCE OF THE BAYES A POSTERIORI MEAN In this appendix we will give a somewhat lengthy but nevertheless heuristic argument that the Bayes a posteriori mean of (5.1.20) converges in mean square as k - x to the actual value of H under the following conditions: 1. The sequence of data digits is a wide- sense stationary random process, independent of the channel noise, 2. The C given by (5.1.22) is positive definite, and 3. The Bk are known (without error) to the receiver. The first result is the following Lemma: Lemma B1. C is positive definite if and only if L linearly independent values of Bk have non-zero probability. Proof: First, observe that C must be non-negative definite, since for an arbitrary column vector w, T T T w C w = E(w B B w) - - - -k -k = E(wT B)2 0. (B.1) C is not positive definite if and only if equality holds in (B. 1) for some w # 0. But this condition becomes 367

368 0 = E (wT B)2 = (T Bk)2 P(B (B.2) k B -k and for (B. 2) to hold w must be orthogonal to every Bk for which P(Bk) 0. Since w O0 can be orthogonal to at most L- 1 linearly independent vectors, at most L- 1 linearly independent values of Bk can have non-zero probability for C not to be positive definite. Q.E.D. Let us examine (5.1. 20) in more detail. Pre- multiplying both sides of (5.1.20) by Ak+l, - 1 nk1 =A A m +-A(B A3) mk+ 1 k k+k k k+1 Xk+1 (B.3) The solution of (B. 3) is mk = AkA m+0 + Ak i i (B. 4) To establish (B. 4) by induction, note that it is true for k = 1 and assume it is true for arbitrary k, from which it follows that — k+l A=Ak l — 1) Ak B. x1 k+ 1 -1(A A-1A k+ k1-k -k0 - 2 k+l -k k+

369 from (B. 3 and B. 4). Now, substitute for x. -1 1 -k T m =A Ak + B (B. H + n. (B. 6) m -k 1 Ak0 - From (5.1.21), A = A + B B B T (B.7) Ak i=l-1 and substituting for l 3 B. B T from (B. 7) in (B. 6), 2 i= =ek A + Ak Bn (B.8) - k-O -o a2 1= where -k -m H (B. 9) is the error in the estimate mk and e is the initial error. Taking the expected value of (B. 8) and using the fact that the n. are independent of Ak and Bi and E(ni) =O, E (E E(A )A-1 0 (B 10) E(ek) = E(Ak)A (B.0) First, we would like to show from (B. 8) that (B. 6) is an asymptotically unbiased estimate of H; that is, lim E(_ek) = 0 (B.11) k-sc

370 This is true if and only if lim 11 E (k) 11 = 0 (B.12) k-oc and taking norms of both sides of (B. 10), IIE eEIIE-(A 1 11 < IIA1 ~ IIE(Ak)lI (B.13) IE(ek 11 = IIE(Ak)_AO E e - where 11 E(Ak)II is the matrix spectral norm as defined by Varga [31] Similarly, let us calculate the mean-square error. From (B. 8) =1-1 eT A1 Ak2 1 2 T - 1 k B n. k k + n z z ni n- BT A2 B. (B. 14) a4ai=1 j=l nn - J where the symmetry of A and A have been used. By the k k Schwarz inequality, (A eo) Ak 2(A E) < IIA E o 1 11 A 2 A 1 (B. 15) but IIA 2A eAI E A< A i2 IlA 1 II. (B. 16) sk a0 -0 k -0y-0 Sc imlality,

371 B. Ak B. < 11Bi11 1Ak B. 11 (B.17) -i — j - — j and 1IA 2k B. 11 < 11 Ak11 11 B. 11 (B.18) - -k — Combining (B. 14 - B. 18) and taking expected value, Ell 112 < IIA -1 112 EIIA _ 12 k k + - C E(n. n) E(IIB.1I IIB. IIAkl 2) (B.19) a4 i=1 j=1 -1 -l which can be simplified by noting that E(n. n.) = a (B.20) 1 ij and since B. takes on a finite number ML of values, there exists a real number B such that llB.112 < B < xc (B.21) JThus (B. 19) becomes, 21 1B2 1 2 Ellela < IIA0 E01 EIIAkl + - k B EIIAkll (B.22) From (B. 13) and (B. 22) we see that showing convergence in mean- square of (5. 1. 19) reduces to showing that

372 lim 11 E (Ak) l = 0 (B.23) k-xc lim E Ak 112 = 0 (B.24) k-x - and lim kEIIAkl12 = 0. (B.25) k- xc Establishing (B. 23 - B. 25) rigorously is very difficult, if not impossible, because Ak is available only as the inverse of Ak, given by (B. 7). However, we can piece together a heuristic argument that (B. 23 - B. 25) hold when C is positive definite. This argument is not only convincing, but also very instructive. Consider A1 as given by (B 7). Since A -1 is positive definite and each,~IT as given by (B.7). Since pt B.B T is a dyad and hence non-negative definite, Ak-1 is positive definite for k > 1, so all its eigenvalues are positive real numbers. Denote its eigenvalues by, j=A, where the following ordering is presumed to exist, o < X(k) < X() <... < () (B.26) Each X(k) is a random variable which is a function of (B1 X xsa(1 i Bk) Since Ak is positive definite and symmetric [31], 11-k (B. 27)

373 Thus, if we could show that, in some sense, (k) increases as fast as k on the average, then we would expect that (B.23 - B. 25) would be satisfied. The sum of the eigenvalues of symmetric matrix is equal to its trace, so from (B.7) L (k) kk L-1 A.) = Tr (A0 ) +- e ~ B2 j=l ] a-(2 i=1 j=0 and E(Z A;k) = Tr (A ) + k L E(B). (B.28) 0 a k Equation B. 28 indicates that k) certainly could increase as fast as k. The first result we show is the following Lemma: Lemma B2: X(k) > -(k1) with equality if and only if B is 1(k- -1 orthogonal to an eigenvector corresponding to X(k-1) Before proving Lemma B2, we need an additional Lemma: Lemma B3: T -1 (k) n~-T k 1 Ajk) inf xA-x (B.29) x#0 x x'This can be shown by observing that it is true for the diagonalized version and thera noting that trace is invariant to an orthogonal similarity transformation. This is also true for an arbitrary square matrix [35]

374 Proof of Lemma B3: Let {x.}. be ain orthonormal set of eigenvectors corresponding to X(k), so that A' X. - T _ Ak xi 1= i xi= < i, j < L L For any x # 0 there exist scalars {C }1 such that L 1 ii x = Cixi, so that i=1 T ~-1x.' - (k) C 2 x -1A - -X i I T>k jk - (B. 30) T L x Cx i o ci.a and equality is satisfied in (B. 30) when x = x QE. ED. Proof of Lemma B2: From (B.29), TT BT x A I X+1XTB (k) - -k-i- a2 k k- inf T _ x X X (B. 31) 1T x-0 (k...1) x - k-i- I - -k I ( ) X#0 X X XO 2 X X Equality holds in (B. 31) if and only if some xO 0 simultaneQ.E. D.

375 Lemma B2 gives us another verification that C must be positive definite in order for (5.1.20) to converge. For, if C is not positive definite, by Lemma B1 fewer than L linearly independent values of Bk occur in the sequence. In this case there is a subspace of vectors orthogonal to every Bk with non-zero probability. By Lemma B2, if the eigenvector of Ak 1 corresponding to X(k) should ever fall in that subspace, then with probability one k(f) = (k) 1 1 for 2 > k, and there would be no possibility of convergence. On the other hand, if C is positive definite, no matter what the eigenvector of A 1 is, a Bk will eventually arrive which increases the smallest eigenvalue. 1 C in probabilityk Asymptotically as k - oc, A1 k C in probability kTherefore, if C is positive definite, its smallest eigenvalue A1 is positive and {k) k X(k) _ k (B. 32) 2 (k) in some sense. Thus, X(k) could be expected to increase as fast as k on the average. The rate of convergence of (5.1.19) is, therefore, monotonically related to the smallest eigenvalue of C.

APPENDIX C CONVERGENCE OF THE ROBBINS-MONRO ALGORITHM In this appendix we prove the following theorem: Theorem C1: If Hk, k > 1 is given by (5.3.21), and if 1. The data sequence is wide-sense stationary and statistically independent of n., so that C = E(BkBkT) is independent of k, 2. C is positive definite, 3. The n are a sequence of independent zero-mean random variables such that E(nk2) < ua < oc for k> 1, (C. 1) 4. a. is a sequence of positive real numbers such that cxk+l < %cm, (C.2) Zek=C (C.3) k=l and S zk2 < O (C.4) k=1 5. The Bk, k > 1 are known (without error), and 6. There exists an integer A > 0 such that B. and B are -J k independentfor I j-kl > A, 376

377 then lim E(H- H) = (C.5) k-c and lim E 11 Hk - 2 =.6) k-oc Proof: Letting -k =H-H (C.7) -k -k - and substituting in (5.3.22),!k+1 = ~ (C. 8) k+l = (I - k+ 1 C) ek + k+ 1 -k+1 (C1.8) where Uk = (B B - C)HBk+lnk (C.9) Uk+1 (Bk+ I-k+ 1 -- Bk+1 nk+1 Since E(uk) = 0, it follows from (C.8) that E(k+l) = (I- ak+1 C) E(Ek), (C.10) and evidently k E(k) = II (I- ajC)0' (C.11) j=so that E (ek)ll < II (I-a. C) 10e0ll. (C. 12) j=1 I

378 To proceed with the calculation of the mean-square error, define k, j = (I - k C) (1 - ak1C)... (I- C) (C.13) and the solution of (C. 8) becomes k -k - Wk, 1-O + (C 14) i=k — Wk,-I(C4 i+) To verify (C. 14) observe that it holds for k = 1, and if it holds for arbitrary k, then Ek+~1 (I k+l-) (-k, i= 1 Wk i+ i k+-k k+l W+ E0 +. c. U. -k+l, 1- i=l1 -k1, -i Then, from (C.14), k1 E 11_e0T +2E W a.W UT -k -O -k 1 k, 1 -O + 2 e. 1 ik, i+l -i Z U. U. u. W W uk (C.15) i= j=lI I j-1 -k, i+l —k, j+l -j The first term of (C. 15) is deterministic and bounded by E W 112 < 12 -O k, 1-k, 1 - k, - k, 1 O k, 1- O (C.16)

379 The second term offers no difficulty, since E(ui) = 0. The third term is more formidable. Substituting for u. from (C.9), -1 -k, i+ Wk, j+l -j H B ii k iI W B kj+1- - T T T T T +H CW W CH- H CW W B.B. H - -k, i+ 1-k, j- 1 - k, i+ 1 Wk, j+ 1- -- T T T T T H BB - -k,+ i+1 -k, j+1CH+_ (1-1 - -k, i+1- k,j+ 1- j + n. B W. W (B.BT _ C) H+ n. n. B W W B. -i -k, il-k, j+1 —j-j - - 1 J-1 -k, i+1k, j+1 —J (C.17) and taking expected value, T T T T T T E( i+ W k, j+uj < E(HT B B BWT H) -1 k, i+1 k, j+1- -I i i k, i+1- k, i+1 Jk - (C.18) by (C.1). By assumption, when li - jl > A, E(HTB. B W B. BT H) - — 1 -k, i+ -k, j+ 1 -j -J - T T T T H E(Bi B)Wk, i+1 Wk j+1 E(B B.)H T T H CW W CH (C.19) - -k, i+ 1k, j+1and under this condition, since A > 0, 6. = 0, and ii

380 T T E(u. W W u) = 0. (C.20) -i -k, i+1 -k, j+1-j) Consider the first term of (C. 18) when Ii - j I <.. The center portion is a scalar-valued bilinear form, so it can be factored out, T T T T T T T T H B.B W W Bj+ B H (H B.B H) (B W W B) -— k, i- j l —i-J - - - -k, i-U-Wk, j+l -j (C. 21) To proceed, we require a Lemma: Lemma C1: For an LxL matrix W and two vectors x and y, the following inequality holds: T T Ix Wyl < L IIWII llyx l. (C.22) Proof of Lemma C1: Observe that xT Wy = Tr(W y x). Since the trace of a matrix is equal to the sum of its eigenvalues, let the eigenvalues of W y x be denoted by L{j 1 so that T T L Ix Wyl = ITr(Wyx )I = Yi L < l I)Il < L max lyil i=1 l <i<L

381 But Varga [31] has shown that max 1. I < IlWyx 11 1<i<L so that T T T Ix Wyl < L IIWyx 11 < LIIWII lyx li Q.E.D. Using Lemma C1, T T T I-i -— Wk, j+1 BJ I < L, i+1 W k, j+,1 -J -11 < Ly llW W (C.23) k, i+1 -k, j+1 where y = max IIB. B II < xc (C.24) B.,B. -' -J Also, H B. BT H < IIHi2 IIB. B. II < y lHil2 (C.25) - - -j - - - -i Finally, combining (C. 22 - C. 25) Varga calls max Iyil the "spectral radius" of a matrix. For a symmetric matrix the spectral radius and norm are equal.

382 E(HT B. T W W B. BT H < -- -i -k, i+l — k, j+1 — j -j T T T T E I(H B Bj H) (Bi WT. W iB.)I < - i —j - - k,i~l —-Wk, j+lj)I -'2 H11 12 IIW W II. (C.26) -ki+I —k, j+1 The other terms of (C. 18) can be bounded, for i - j I < A, T T T -H CW W C H < IHT CW W C HI -- -- i+l- k, j++l -k, j+ — < tIH112 CWT W. < IIHIIllCWk i,1 Wk, j+l C 11 <,HIIH2 ICII2 IWT I1 (C.27) k, i~1 -k, j+1 and, by Lemma C1, T T T T E(B W W B.) < E(L IIW W II,I B. If) - -k, i+1 -k, i+1-i - i+1-k, i+1 -i-i < Ly lIlWk, i+ 12 (C.28) Combining all these bounds with (C. 15) we get

383 E Iekl12 < iW 112 11_e 112 + -k - -ki,1 -o k i+A k, 3 ij L Y2 11,W i+ k, 11 11 H2 i=1 j=-A 1<j<k + IC 112 11H2 W 1 +Lyc02 11W ~1 112 — k, i+1 -k, j+ -k, j+l 1 i k l-wk, 1 1 11 ell 11 + L a 2 11 11 k i+A + (L y2 + IIC I1 2) II1H112 a aj 11W W j+ (C.29) — =-i ji- 1 j -k, i+l k, j+1 1<j<k Demonstrating (C. 5) and (C. 6) reduces, according to (C. 12) and (C. 29) to showing that lim 11W I = 0 (C. 30) k-x k k lim ~ ~.2 11W 112 = 0 (C.31) k-xc i=l and k i+A rim 3 3 aa. IIW W 1 = 0 (C 32) k-xc i=l j=li i j -k,i+l-k,j+l 1<j<k Proceeding in that direction, from (C. 13),

384 k = 1W 11K II 11I-.C' k, i k, i <= II max 1 a- o. X (C.33) j=i l<m<L m where O < 1 <2 <... <L are the eigenvalues of C. By (C.4), evidently lim ak = 0 (C.34) k-oc so that there exists an integer N > 0 such that for j > N a < 2 x + L which implies that c~X - 1 K 1- a.X j >N jL J < and k (1- i+ < N-1 k, i+1 k I (1 - a1), i+l > N j=i+ 1 k _ = k 1n (1- o ) i+l > N (C.35) =i+1 +

385 where =P - max i (C.36) 1 l<i<N-1 N-,i Now, (C. 30) reduces to showing that lim k, 1 0. (C.37) k- oc Since k fn k, 1 < fn0 + f n(1 - oa X ) j=N k < inpt - Xl o. it follows that k -kl j =N e 3k,1 _< (0e - as k- x by (C. 3). In analogous fashion it can be shown that lim ki = 0 for all i > 1 (C.38) k- ox Now, let XI, iT [Cl,k] X1, k](i) ~ ('o i;[ik] (C.39)

386 be the characteristic function of [1, k], and rewrite (C. 31) as k xc lim 3 C2i2 lim 2 ok- i1 ai k,i+1 k-* c il k,i-1X[1,k].2 lim P ki+ l,k] (C.40) i=l 1 k-.c ki+1 =) - which establishes (C. 31). The interchange of summation and limit in (C. 40) is justified by the Lebesque dominated convergence theorem [4] since P2i < o2 < K c (C.41) k, i+1 -02 is uniformly bounded and therefore ai2 ki+ x[ k] (i) is dominated by %30 c2, which by (C. 4) is integrable. In similar fashion (C. 32) can be established. Noting that 1W1 W.11 < 11 W 11 11 W. -k, i+ -k, j+ 1 - k, i+ 1 k, + 1 (C.42) k,i+1 k, j+1 and rewriting (C. 32),

387 k i+h lim 0 a ~ e~. k-x i=1 j=i- A i j k,i+1 k,j+ l<j<k k- A A A klimx l Pi=~+l f= + k, i+1 k, i+Q+l i-1 i i l +- k, i+ 1 k, i+f+ 1 k k + a iajPk 1P k (C.43) i=k-A+l j=i-A 1 The second term of (C. 43) is easily handled, since the summation is finite, k-x 1 i=l Q=i-i i 3 3c. Cal lim k ~ i+f i+Q l, i+ +l i=1 f=1-i k-soc,i+1 k,i+2~1 0 (C.44) by (C. 38). The third term can be rewritten as k k lim k, i+1 Ok, j+ 1 k-c i=k-A+l j=i-i+1 k, 0 0 lim 3 3 cy 13 13 k-scm=- nm- m+k n+k Ok, m+k+1 k, n+k+ 1 0 0 m= n ~ lim (aom+k en+k Ok, m+k+1 Pk, n+k+1) m= - n=m-A k-(x O (C. 45)

388 by (C. 41) and (C. 34). Finally, addressing ourselves to the first term of (C.43), k-A A k-x i=A+l Q=- A i P+Q k, i+1 Pk, i++l1 - 21 n cc k=-A k-o- i=A+ l oz.li a lia P i+l k, i+,+l X[A+,k-A] (i) Q =-A i=a+l1 k-c i+kl k, i++l X[k+l k-k] (i) = 0. (C.46) The interchange of limit and summation again follows from the Lebesque dominated convergence theorem and the fact that. ~iaii+. < xo (C.47) i=I which follows from (C.2) and (C.4) since, by the comparison test for series, for Q > 0 ~QY. < Ki2 <x (C.48) i=1 i i=1 and for Q < 0, 1i =1 < Z ~2 < c. (C.49) This completes the proof of Theorem C1.

APPENDIX D MEAN-SQUARE ERROR BOUNDS FOR THE FIXED STEP-SIZE ALGORITHM The task of this appendix is to establish bounds on the neighborhood of convergence of the fixed step-size stochastic approximation algorithm of (5.4.1). This analysis will be done for three cases: 1. The data sequence is wide-sense stationary, 2. In addition to 1., Bk and B. are statistically -k -J independent for Ik- j > A and for some integer A. 3. In addition to 1., Bk and Bj are statistically independent for k f j (but Bk and B. may not be independent -k -j because they contain common factors). The actual value of E(BkBk ) will be C while the algorithm of (5.4.1)will use C, which may be different. Both C and C are assumed to be positive definite. As explained in Section 5.4, H will be assumed to be fixed for the purposes of this appendix. The recursion relationship for the estimates is given by (C.8) with lk =; that is, -k+1 = (I - a ) Ek + ak (D.1) and the solution is given by 389

390 _c ~ k (I ack k-i E (I - C) E + (I C) U (D.2) Since from (C. 9), E(ui) = (C- C) H (D.3) -1.. we have E(k) = (I- aC) +a (I- a)ki (C -C) H (D.4) To simplify (D.4), we use the identity k a (I-ac)k- = [I- (Ia- )k]C (D.5) i=l which can be established by induction. Substituting (D. 5) in (D. 4), E(k) = (I- E) 0- C (C - C) H] + C( - C) H (D.6) The second term is a fixed error which can be eliminated only if C = C, while the first term approaches zero asymptotically with k if and only if 1- aXml < 1 1 <m<L (D.7) where 0 < X <X2 <.. < XL are the eigenvalues of C. Assuming (D.7) is satisfied the squared-error between ek and its steady-state value is

391 klek- C (C C)HI2 = IIH k- C C HII2 = (I-aC) (E C (C-C)H)+ a (I- aC)k (i- (C C))| (E C (C -C)H) (It C) ( -C (C -C)H) kk^1 )TT -k C(I- a) - (C- C)k+ (E C (C CC)H) ( C) (I - C) u (C - C) i=1 k k k2 k T 2k- i+ O2 C i - (C-C) H] (I - aC) Lu - (C C) H] Ele. C (C -C)H(112 T(I ae)2ki 1.-k - C 0 (I -a.,EQ k k 2= T 2kL% - _ - _ T 2k-_- j E[1 j. Lu (I- a C) j u- H (C- C ) (I - a C) J (C - C) H (D.9) where E( is defined by fl-i 1c - C)HI}A TE -C (CC)H -C CH (D.10) -0 -0 -i Note that the last term in parenthesis of (D. 9) becomes, upon substitution for u.,

392 E(u (I- C)2k-i-uj) HT(- C) (I - a2k- (C C) H E(HTB. B T(I_- a C)2k-i-j B B. — 1 — 1 -- HT C (I - 2k-i-C) C H + a2 6.. E (BT(I- a C)2(ki)B (D.1 1) Also from (D. 5) we observe that k k _2 2 HTC(I - a C)C2kijC i= j=l= A-_ 2 k A4 2 T C (C )H + C(I - (Ck (1) C H - C CIk )-1 C H - [HT C[(I - o C)k C ] C H 12) The second term of (D.) is then becomes, using (D.12), k k T 2k- ij -a2 ~ ~H C(I-aC) CH i=l j=i HT C(C 1) C H+ H C(I- aC)(C ( ) C H + H C C (I- aC) C CH- H C[(I- acC) C ] H < 11H2 11yC(C) CIl + 2 ICfl2 IC 11 III- a CII +IICI2 _C111 III- aCil (D.13) The last term of (D.ii1) is easily handled since

393 T C)2 (k- i)2(k-i) E(BiT(I - =t C(2(k-i) Bi Bi]) = Tr [(I - c C)2(ki) C] < LIICII II - oC12(k-i) (D.14) where the last inequality follows from the proof of Lemma C1. To simplify (D.13)and (D.14), define r = III- aCII < 1 (D.15) which requires that or < 2- ~ (D.16) L In addition we know that IIC l =XL (D.17) and 11 C 11 = xL' (D.18) Using (D.15 - D.18), _T (Io C)2k eO < r2 I (D.19)

394 k k 2-o 3 3 C HT C (I- aC)2k-i-jC H < i=l j=l A 2 L k -2k IIHI. (12 + 2r( +r ) (D.20) X 2 1 and a2 ~2 E T 2k-ii 2 LL r2 (k- i) a2 Ol E (Bi (I - i C) B\ < LX L u2 a r 2(k-i) LX a2oL2 2k L - r ) (D. 21) -2 - When C = C, these bounds can be strengthened somewhat. Equation D.13 becomes k k T 2k-i-j 3a C HT C (I- C) CH2k-i- < i=1 j=1 IHI2 1+ L (2 rk + r2k) (D.22) \,12 and using the identity. k (1 )2(k-i) 2k (I- C) C c[I- (I- C)2k] (2I - a C) (D.23) i=l (D. 14) becomes

395 2a 2 E(BiT (I C)) 2(k- i)B (2 Tr (z ~ (I - aOt C)2(k i) = a r(Tr (2I - a C)- -Tr[(I- C)2k (2I -C)-1]) <2- aX2 (+ ) (D.24) Simplification of the first term of (D. 11) requires considera-.on of each of three special cases. We will also require the followig Lemma: Lemma D1: For two column vectors x and y, IlxyT I = lyT x (D.25) Proof: To find the eigenvalues of x y we must solve T xy u = Xu T Since y u is scalar-valued, it commutes with x, (y u)x = u and apparently u = ax and

396 T X = y x x yT has only one non-zero eigenvalue since it has rank one, so T T IIxy 1 = XI = ly xl. QED. 1. General Case Manipulating the first term of (D. 11), remove the scalarvalued bilinear form from the center and utilize Lemma C1, T 2k- i- j T H B. B.T (I- ac)2k-i-jB B T H = -— 1 — J - - (H B B H) (Bi (I - c )2k JBj) < T 2 2k- i- j L IIHII2 IB. B. II IHI- aC 11I 2k-i-j < L3 bM4 IH112 r (D.26) The last inequality is derived as follows: Note that lIB. BT I < max IIB B. TI = max B.T B (D.27) B., B. B.,B. -1-J -- -J -- by Lemma D1. If the M possible values of Bi are b1 < b2 < < bM where, without loss of generality, we can assume lb I < IbMlI then

397 max IB.T B - LbMZ (D. 28) B., Bj -- -1 -J Combining (D. 19 - D. 21) and (D. 26), ~-1' 2k u ~LX ~ 2k E ll - C (C - C) H112 < r2k 2 i + L XL [L + k 2k Mk2 + iHl12 (I + 2r+ + r ) + (1 - r ) (D.29) X1 ( - r)2 when C C and 2k < + La 2k) t icE 112 < r -o 2112 X +r llkll - r eO 2- k( L k 2k L3 bM4 2 IH 112 1 + L (2rk + r M (1- rk (D.30) 12 (1- r)2 when C = C. Asymptotically as k- x, (D.29 - D.30) become u2 LXL a2 lim E 1lc - C (C - C) HIIZ < k- X 1 - r z X 2 L3 b 42 + 11iHli2 L + I (D.31) for _(C, and for C AC, and

398 ~2 Lot L - bM142 o lim E lek Iz <_ L2 c.lH+ b + - (D.32) k-2 -:X~ for C = C. 2. Bk and B. Statistically Independent for Ik- jl > A *-k For li-jl > A, the first term of (D.11) becomes E(H B B BiT (I - a)2k-i-j B T H) HT (Ia )2 C H (D.33) which is just the negative of the second term, so that they cancel. In addition, when l i- j l < A we will use a slightly different bounding technique for the second term of (D.11). Specifically, note that T 2k- i- j 112 Cl2 iII 12k- i- j _H C (I - cC) C H < 11HII 2 iC l 11 I - o C I1 I11 Hl2 2 2k-i-j (D.34) 11H r (D. 34) Now, divide up the sum of (D. 9) into k k k- A i+A A i+A k k 2+ +; ~ Z (D.35) i=l j=1 i=A+l j=i- A i=1 j=1 i=k- A+1 j=i-A where all other terms in the sum are zero by (D. 33). Then, using the identities

399 b i+d 2k-d-2b 1 r2(b-a+1) d-c+1 b 2 r~~~ k-b bi-j-21dboa~l) 1 -adr2k-2k2-r - 3r = (r) b/ - bc ( ba)r i=a j=i+c (- r (D. 36) in (D.35) together with (D33), (D.20) and (D. 26), i+ =a=l (1-k) + ( r+ ) (D. 37) i+c 2k-i- r2k 2 ( b- a+1) (b- a+ 1 i=a j=/ (I - r)2 r 2 (+r) r ing(D. 35) together with (D. 33), (D. 20) and (D. 26), Ell E C (C - C) H112 < r. Eli0 a 2L X 0 2 (! - )+ }lH12 (L3 bM L el 2 (k- 2A) ~2A+l2k _ 2A 1- r i-r - r (-r /(1 + r) (+..............(D. 39) Taking the limit as k - c,

400 a 2 L XL a2 lim E le- C (C — C) H112 < L k-xr -k 1+ 11H1 2 (L3 b4 + X2)2 r M L (1- r2)(- ) + r.)/(.+ rl (D. 40) (1 - r)2 and, of course, the slightly tighter bound when C - C, cr2 Lca lim l ll ekll2 < 2 + H 112 (L3 b4 + 2LZ) a2 -k - 2-cA M L k-xc 1 2A+ A+1 2 r(1- r ) + 1- r - r (1- r A)/(l+r) (D.41) (D.41) (1 - r2)(1- r) (1- r)2 3. Independent Data Digits with E(Bk) = 0 Now consider the case where B. and B. are statistically independent for i ~ j. We also assume that C = C. Under these conditions we can solve for the mean- square error without resorting to bounds. Since the data digits are independent we have C C0 I (D.42) and (D. 11) becomes

401 T T 2k- i- Tj E(H (B. B.T-C) (I -c C)2k (B. B. ) H (1C- C0)2k-i- j [E(HT B. B.T B. B.T H)- C 2 IIH112] (D 43) -- — i —i — j —j -- 0 A typical term in T T T T Aij = (Bi B. B. B ) [(Bi B.) (Bi Bj (D.44).... j —-i -- J - is L-1 a B B B B (n 45) mn i-m+l j-n+l C Bi BE. (D.45) There are two instances of i and j where the expected value of (D.45) can be non-zero. First when i = j and m = n, then L-1 E(a n) E (B2 B2m) (L-1) C02 + E(Bk4) (D.46) When i / j then either m = n or i-Q = j-n+1 and j-f = i-m+1. The latter two equations imply that n = (f+1) - (i-j) and m = (f+1) + (i-j), from which m = n+ 2(i-j). Thus, for i # j C 2 m=n E(a C, n = (Q+1) - (i-j), m = n+2(i-j), 0 < 2 < L-1 Cn - 0, otherwise (D.47) Finally for i = j,

402 HT E(B B B. BT)H = [(L-1) Co2 + E(Bk4)J iHII2 (D.48) and for i j, T TL-1 HT E(B. BB.B.T ) H = C 2 12 + C 2 hj) h ( 2Cr I11 [ + j) = C02 IIHIL+ h +2(i-j h(f) V ~ =1- (i-j)1J h() (D. 49) From (D. 48 - D. 49) and (2. 6.14), evidently [(L-1) C02 + E(Bk4)] (0) i= j T T T H E(B B. B.B.T)H - -1 — 1 - - C 2 ~ [p(O) + p(2(i-j))] i The second term of (D. 18) is, for independent data digits 2 6.. E (B.T (I - aC)2(k- i) B 3 -- a2 6ij(1 - oC0)2(ki) L C0 (D.50) and (D. 9) becomes T 2k 2k E (I- aC C) E (1 - a C0) IIlc02 (D.51) Finally, an exact expression for the mean-square error becomes

403 E1~~~c ~~ - ~2k k 22(k- i) E ll - (1- -aC0)2 k I + az L 2 C0(-C i) + c[(L- 2) CO+ E(Bk 4)] (- aC 2(k-i) C 2 k k + a 2T0 3 * p(2(i-j)) (1 - aCo)2ki i=1 j=l j~i 2k L a2 2k - rk 11l2 + 2C (1- r2k) r 0 2 - uC 2 + ]IH112 [(L- 2) C02 + E(Bk4 )] C(2C0) (1 r2k) + C02 IIHl2 Dk(a, p, C0) (D.52) where k k Dk(a,p,Co) = C2 p(2(i-j)) 2k i j (D. 53) i=1 j= p() jsi Now, let L0 be the greatest integer contained in 2 Then p(2k) = 0 for k > L0, and Dk(oa, p, C0) can be written as L0 i+L0 k- L0 i+L0 k k Dk(aY PCo) = 2 j j 1 + 3 + Z Z i=l j=l i=L +1 j=i-L i=k-L0+1 j=i-L j/i j/i 1 j/io

404 Obviously, L0 i+L lim p(2(i-j)) 2k-i-j 0 r< (D.55) k-xc i= j= (0)1, (D55) and the third term of (D. 54) can be written as k k i=k- L0+ j=i-Lo p j#i L0 2Lo p(2(m-n+ L)) 3L0-m-n af2 ~ ~:- r (D. 56) M =1 n=m (0) n7m+L0 which is independent of k. Finally, the middle term of (D. 54) can be manipulated to k-L0 i+L0 y2 3 3=L0+ r - i=L+1 j=i- L p(0) _2(k-__ 2L -0 __ [1___ - r L] p(2 r 0 (D.57) C Cmiig(2 5 - cC2) 0- t Combining (D.55 - D.57) in (D.52), the final result is

405 lim E llekllz = L2 QC k-c -ok 2-c C0 E (B 4) L p(2 L- c C + 1l HI (L-2) + Ek) + L +2 r 1 Lc 2 =-L p(O) J 2-..C0 L0 2L p(2(m-n+L)) 3 L- m-n + C j 12 02 (0) 0 (. ) CO IIHII m nm P(OZ r (D.58) n=1 mn= n~m+L0

APPENDIX E MEAN-SQUARE ERROR BOUNDS FOR THE FIXED STEP-SIZE ALGORITHM FOR UNKNOWN DATA STATISTICS The task of this appendix is to calculate bounds on the meansquare error of the algorithm of (5.5. 2). To do so, we assume that Bk and B. are statistically independent for Ik- j I > A, that {Bk} is a wide-sense stationary random process with a covariance matrix C = E(BkBkT ) which is positive definite, and the ink} are a sequence of statistically independent random variables with mean zero and variance a2. The first step is to substitute for Xka+l and let ekAl = Hk+l - H in (5.5.4), E (I - a B B I -k+1 ( I - Bk+1 -k+l ) (k- 1 )A+1 + caBka+l nk+ (E.1) The key simplification results from the fact that, since E(k l)a+l is a function of B for j < k- 1, it is independent of B and thus BT T E [(I- oBk+ _k+) _ e(kl)A+,] = E(I - ac Bk+ Bk ) E(_e(k-l)A+ = (I- aC) E(E(kl)l+) (E.2) 406

407 Therefore, since E(nkA+l) = 0, from (E.2), tIE (_ek+l)ll < rk lle 1 (E.3) Thus, the algorithm is asymptotically unbiased if and only if r = IlI- cCII < 1 (E.4) To calculate the mean- square error, note that from (E. 2), kE+112 T =BT IkA+ _E (k-)a+l (I - k+ B E(k l) +l + 2 (I- B l l) Bka+1 nkA+ + o2 nA+1 I2BB +11 2 (E.5) and to upper bound (E. 5), first observe that T 2 T (I - a Bk+ B k+)2 (I - C + a C Bk+ B+l) (I - C)2 + 2(C - Bka+l Bk +)2 + 2a(I- aC) (C- Bk+k), (E.6) the expected value of the second term in (E. 5) is zero, and further, E IIBkA+112 = LCO0 (E.7) Substituting (E.6) in (E. 5),

408 Il E 11 < (III- c C1II +o IIC - B B 12) le ILZ -kA+l 1_ + - kA+l - (k- 1 )A+ I + 2 e-(k-1)A+1 (I - c C) (C - B ) (k- + -(k- I)a+I - k~i+l + 2ck (I -a Bk-+l) B k+ +1 + ac n2+1 ik I 21 (E.8) and taking the expected value of (E. 8), E EkA+l 112 < R2 lte(k l)A+11 2 + 2 U2 L C (E.9) where R2 = r2 + a 2 2 (E.10) BBT and aC2 is the variance of the matrix B B T T -k -k BB a2 E IB B C112 (E 11) BBT -k- k Iterating the inequality (E. 9), E l kekA+112 < R2kl e0112 + c2 a2 LCo 0 R2(ki).2 oT 2 L C 2k IIe0 2k - R I l12 + - (1- R ) (E.12) 1- R2 and assuming that R2 < i, we get, finally

409 2 C2 L C. lim E llEkE +lll < -K 0 (E.13) k-soc.- - R2 Note that when the condition R2 < 1 is satisfied, then the earlier condition that r2 < 1 is automatically satisfied.

APPENDIX F ANALYSIS OF THE FIXED-INCREMENT ALGORITHMS The first requirement is to calculate the variance of the estimate (5. 6. 10), which we repeat here, (k+l) K0 L-1 [' = 1 B B. [] -n. (F.1) K0C 0 i=kK +1 Ki- o 0i j From (5.6.12), we know that E[k] [Ek] (F. 2) so that we require only the second moment of (F. 1). Squaring (F. 1), (k+1) K0 (k+1) K0 L-1 Bk] B B [k] -nm K0 2C2 m=kK0+1 n=kK0+1 m-f n j=0 m-j j L-1 -Bni[ek nn) (F.3) i=0 n and taking the expected value of (F. 3), (k+1) K0 (k+l)K0 L-1 L-1 E[K1 ( ( i=0 -K 02C 02 m=kK0+1 n=kK0+1 j=0 i=0 (F. 4) (k+ 1) K0 [kEj( [kB i E(Bm2 Bn& Bmj Bni) + 2: E(Bm- ) m=k K0+1 410

411 (k+1 ) K0 (k+1 ) K0....C1 CI 0 a + C 02 ([_e I -k]KOZC 0 a | m=k K0+ n=k K0 +1 n- m+ (k+1) K0 L-1 + E(Bk2) [E + Co 2[k] )] (k+1) K1 )K 0 - + Z c0~ [Ek] (F.5) m=kK+ n=k + n- m+ m- n+= nfrm Subtracting the mean value squared of (F. 2) from (F. 5), we get the variance, Var + E(Bk4) C + L 00K0 KG2 k 2 o k K 0 a OK0 C 0 0 jjo (k+1) K0 (k+1)K (F.6) + 2_ E3 [ek] [ek] K 2 m=kK -1 n=kK 0 + n-m+Q m-n+Q nfm

412 Consider the sequential test of (5.6. 18 - i.9). When the data digits are independent and Bk = +1 or -1 with equal probability, then the terms in the sum (5. 6. 18) are statistically independent for L = 2. However, for other L > 2 and other data statistics they are not. To get an idea of how great the statistical dependencies are, we will calculate the variance of El(m) and the sum of the variances of each term in the sum. From (5. 6.13), it can be inferred that Var[Ek(m)] = m CO a2 + m[E(Bk4)- C02] [E k]2 L-1 + mC02 [k]2+ j#0 j/o Kk+m K +m K-, k+m m = k k l ~i=K k+1 j=Kf k+1 kQ+(i-m) -(i-m) j'i (F.7) while it can easily be shown that K, k+m i K 1Var(Bi_ ei) mC a o + m[E(Bk4)- CO] k] i=K, k1l +mC [0 ]2 (F.8) j=o j jo0 The two equations are identical except for the last term in (F. 7),

413 which is indicative of the statistical dependencies in the sum. However, as has been noted, that term is generally small in relation to the others because it is multiplied by an additional factor of 1/m and because the correlations between components of the error can be expected to be small. In order to get an approximate analysis of the properties of the algorithm, we will assume that the terms in the sum (5.6.18) are independent. From the preceding analysis, we can conclude that the results should be reasonably accurate. First, we will state several results which are due to Wald [37], [38]. Let {Znl be a sequence of independent identically dis-. tributed random variables, with n Sn Z, (F. 9) i= 1 N = inf {nl S (a,b), a< O< b}, (F.10) and M(t) = E(exp (t Zi) (F.11) Then we state without proof the following three lemmas [38]: Lemma F1: If Pr {zi = 0} 1, then Pr {N* < } = 1. Lemma F2: If Pr {zi = O} / 1 and Pr {Izi. < oc} = 1, then E (exp {t SN.* N*n M(t)}) = 1

414 for all t such that M(t) < xc. Lemma F3: If M(t) exists in any neighborhood of t = 0, then E(SN*) = E(zi) E(N*) (F.12) and E (SN* - N* E(zi = Var(zi) E(N*) (F.13) The Bif ei in the sum (5. 6.18) satisfy all the conditions of Lemmas F1 - F3 except the condition Pr { IBi eil < c} = 1. However, if we define the new random variable B e, Bii-e e i < nRt,i nR, Bi.p ei > nRt (F.14) -n Rt Bi ei < -n Rt where n> 2, then the character of the algorithm (5.6. 20) is not changed and the condition Pr { I z iI < oc = 1 is satisfied. Furthermore, as n - cc, all moments and the moment generating function of z i approach those of Bi_ ei. Therefore, we will assume that E(z,i) = CO[ek] (F.15) and that zQ i and Bi.~ ei have the same moment generating function. We are now prepared to analyze (5.6. 20). First we can assert from Lemma F1 that the test terminates with probability one. Then

415 we need to calculate the moment generating function of z, i, which is M (t) EB [_ek] - Bi_ n. (F. 16) M(t) = E (exp It (Bi_ S Bi-j n.)=) i -. (F16 j=o J In order to get concrete results, we will assume independent data digits with Bk = +1 or -1 with equal probability. Then, it can be shown that Bid Bij and Bid Bim are independent random variables for j A m. Thus, (F.16) becomes t[ ek] t 2 2 L-1 M(t) = e e2 a2 II cosh (tk]) (F.17) j=0 j isf and let to0 ~ satisfy M(t0) = 1 (F.18) (Wald [37] shows that to < 0 exists and is unique.) In general tO must be determined numerically from the transcendental equation (F.17). Then from Lemma F2, E(exp {t0 E(k*)}) = 1 (F.19) but if Rt >> ICo[ ek], then El(k*) " Rt or -Rt. (F.20)

416 Using approximation (F. 20) in (F. 19), we get Pr {E(k*) > Rt} e t + - Pr {E(k*) > R }) tRt t t - 1 (F.21) or -toRt Pr {Ef(k*) > Rt} - tRt (F.22) e - e and toRt Pr {E~ (k*) <R -Rt (F.23) e - e Equations F. 22 and F. 23 are useful for finding the probability of an incorrect increment in the algorithm. They can also be used to find the expected value of k*. In particular, E(E(k*)) - Rt Pr {E(tk*) >Rt RtPr E(k*) < - Rt} toRt -toRt 2- e e (F.24) Rt 2 t t-t(F.24) e - e from (F.23). Thus, from Lemma F3,

417 E E (k*) E(k*) = R t trt R -tR t 2- e - e (F25) coLk] tOR -t t0R (F.25) Using (F.13) a formula for the variance of k* can be developed. From (F.13), E(SN*)Z - 2 E(zi) E(SN*) E(N*) + [E(zi)] 2 E(N ) Var (zi) E(N*) (F.26) from which we get Var(zi) E(N*) - E(SN*2) + 2 E(zi) E(SN*) E(N*) Var(N*) =.-. -- [E(zi)] 2 - [E(N*)]2 (F.27) Noting that E[E (m)2] = Rt, (F. 28) L-1 and Var(Bi_- ei) = a + k [ek]2 (F.29) j=o j from (F.8). Thus, (F.27) becomes

418 R / L-1 f 2eOt -te1 Var(k*) =_ t2 Rt C 0 -k i e 0 -e 4 R 2 -t0R t tR Rt 0 eke2 Coz~lk 2t0Rt t t R This completes the analysis of the algorithm of (5. 6.17 - 20). We now focus attention on the quantized algorithm of (5.6.45). This analysis is very similar to Lucky's [22]. The important quantity is the probability of an upcount, P~ = Pr {-Bi_ ei > 0 ek} (F.31) From (5.6.4), (F. 31) becomes Pa C ~ C Pr B 2[c B > Bi n. - B. B E[ f iB. B [_]=0 Bi Bi- L+1 j#o P(Bi.*Bi L+1) (F.32) which can be calculated by computer for specific data statistics. If we specialize to independent data digits with Bk = +1 or -1 with equal probability, then (F. 32) becomes

419 p " - -'. Pr Pr[k] >nni- - 2 B. B. B B j=O BI Bi-_+1 i-f- Bi Bi-L+1 jf0 L-1 Bi j[ek] j + Pr -[e] < n. - B..[Ck] (F.33) -i I-k I |kf - 1 j0= - iJ -kj j/o For instance, for L = 2, (F.33) becomes + [Ek1 [ok] - 1I P= 2 ( o + ( ~) (F 34) [!k] +I Ek] Ek [ -[k - o...1) ~+.(.... + O) (F.35) Lucky [22] has given an approximation for (F. 33) which, unfortunately, is valid only for very small errors. He assumes that the quantity on the right of the inequalities in (F. 33) is Gaussian with mean zero and variance a. Then (F.33) becomes ([!k]) ([ Ek] a (F.36) For example, if L = 2 and [ek] = [Ek] = 6, then comparison of ~~6:k0 (F. 34 - F. 35) with (F. 36) reveals that (F. 36) is reasonably a c curate only for a <. Thus, (F.36) is valid only when evaluating the probability of upcount when the error has reached its smallest value

420 6 and 6 is much smaller than the standard deviation of the noise. Once P has been evaluated by whatever means, the determination of a probability of incorrect increment is easily approximated using Lemma F1 - F3. As before, we assume that the elements of sum (5.6. 45) are independent random variables. Letting zfi = sgn[-Bi ei] (F.37) zQ i is either +1 or -1 with probability P~ or (1- P) respectively. Let us assume P~ # 1/2 so that E(z i) / 0. The moment generating function of z, i is M(t) = Pg et + (1- P) et (F.38) and M(t0) = 1 for t0 / 0 implies that I - P t= n P ). (F.39) Likewise, we can calculate the mean and variance of z& i as E(z, i) = 2 P - 1 (F.40) Var(zi) = 4P (1- P (F.41) Now, as long as Rt is an integer, (F. 20) is an exact equation (not an approximation) and (F.21) becomes

421 tRt ( Rt) e-t Pr {Ep(k*) = Rt} e + - Pr = e (F. 42) and substituting (F.39) in (F.42), Rt \1- P Rt (F.43) 1 + (1- P t iR t Then, using (F.43 - F.44), the mean of E (k*) is

422 Rt P1-) E [E(k*)] = Rt Rt (F.45).. ) +1 from which it follows, from Lemma F3, that Rt E (k* Rt (;) -.46 E(k*) 2P 1 Rt P1- pQ +1 Also, using (F. 27), we can calculate the variance of k*, Rt 4P(l - PP)Rt 1 - P1 Var(k*) (2P -1)3 t Rt ( Rt41 k2P J k. P.) ) (F.47) I r) +t

APPENDIX G UNIQUENESS OF REPRESENTATION AND MEAN-SQUARE ERROR OF UNSUPERVISED ALGORITHM The first task of this appendix is to study the solution for H of (5.7.2), which we repeat here, g H = C H + c2 6O, 0 < < L-1 (G.1) The first step is to observe that a quadratic form can always be written in terms of a symmetric matrix, so that (G.1) is unchanged when C~ is replaced by Cl, where T C =2 [Cj + CT] (G.2) is a symmetric matrix. The second important observation is that, for L = 2, C-0- C C C (G. 3) but that, for general L > 2, - A- -m -A The case L = 2 is then very special, because since C0 and C1 commute, they have common eigenvectors and can therefore be simultaneously diagonalized. The fact that they do not commute for 423

424 general L > 2 indicates that a general solution to (G.1) cannot be found. Let us therefore concentrate on the case L = 2. To verify (G. 3), it is straightforward to show that 3C + C C +C 1 2 C +2 0 2 C C C C -0 1 -1 -0 CO + C2 3C0 + C2 +CC 2 2 0 2 1 1 2 (G.5) As a consequence of (G.5), C0 and C1 have common eigenvectors. O To see this, let CO have eigenvectors x and x2 with distinct eigenvalues Xl and X12. Then, oxj = xj, j =1, 2 (G.6) and pre-multiplying by C1, C C X. C =X 1 X. j X=1, 2, 1 0oj = 1 xJ Aj i =, from which we may conclude that C x. 2jx, j = 1, 2, (G.7) which implies that xl and x2 are eigenvectors of C1. If we choose {x}2 to be an orthonormal set of eigenvectors J j=l

425 of C (and C ), then T x1 x2 = O, 1l 2 x. x. 1 j =1,2, (G.8) -J -J and H can be written H = alx1 + a2x2 (G.9) -where a. = x. H. (G.10) J -J - Substituting (G. 9) in (G.1) and utilizing (G. 8), (G.1) becomes ~0 =All a12 + X12 a22 + U2 I1=I a 2+ X a 2 (G.11) ~1 =k21 1 22 2 which has a unique solution for a12 and a22 if and only if 11 X22- X21 12. (G.12) When (G.12) is satisfied, the solution of (G.11) is X22(/L 0 - a2)- k12 A1 1 22 l 21 12

426 a = l~1111 - x21(PO -2) a2 - - (G.13) 2 X22 11 A X21 A12 Thus, the solutions of (G. 1) are a1 X1 + a2 x2 H = ( a x1- a2x2 (G.14) -a x1 + a2 x2 -a1 X1 - a2 X2 and there are four solutions to (G.1). If the single correct solution is to be found, more information on the location of H must be known. It remains to find the eigenvalues and eigenvectors of CO and C. By straightforward manipulation it can be shown that -1 11 = Co+ C1 (G.15a) X12 = CO- C1 (G.15b) CO+ C2 21 = C1 + 2 (G.15c) CO + C2 X22 = C1 2 (G.15d) ~-1 B ~1 x2,= - (G. 16b)

427 On the basis of (G.15 - G.16), (G.12) becomes C1 CO +'C2 C1 / ~c0 2 (G.17) as the condition for a solution to exist. The next task is to study the mean-square error of estimation algorithm (5.7.14). First, defining the error, (G.18) eQk = /,k- (G.18) then (5.7.13) becomes E fk = (1 - c) e k-1 +!(xk Xk_ - Ai) (G-19) from which we get E(e, k) = (1 - o) E(ei, kl) (1 - ( )k- e, (G.20) and, if Ii-cal < 1, lim E (k) = 0. (G. 21) k-x c The solution of (G. 19) is ek = (1-a)k ef, + C (1-oL)k (xiXi - ) (G.22) i= +~1 i —.

428 from which we get = (1-)2(k- ) k- f k (1)k-i () e Ik - f f cx(1-c-) Cf f k k + z ( Y (2-c~)-j ( xixi) - X) (Xj xjf - px) i=f+1 j=f+l (G. 23) The second term of (G. 23) has mean zero, while for the third term, E[(x Xij - Af) (X. Xj_ - A )] = E(xi Xi X. xj xj) - (G.24) We assume that Bk and B. are independent for Ik- jl > A, so that (G.24) is zero whenever li-jl > a+. For li-jl < a+fQ, we must consider several cases. First, the general formula is T T T E(x.2 X.2) E(H B. B H H B.B. H) + 1 -- — 1- H _B _H) + - - T T T T E(H Bi B H) E(n. n.) + E(H B. B H) E(n. n. - --— i-f- j -- - i- j-f E(H Bi B H)E(ni. nj) + E(H B B H) E(ni nj_) + E(HT B. B H) E(nn.) + E(H B. BT H)E(n.n. - -1-f-j-_- j - -E(nj-f 1 1-f + E(ni ni_ nj nj) (G.25) where all terms involving first and third moments of the noise random process have been omitted since they are automatically zero. The individual terms of (G. 25) can be bounded by Lemma D1,

429 HT B B T H < IIH112 liB B Tit - -rm-n - - - -m-n T = IIHIZ2 IB B I -- — n < L bMZ lIH112 (G.26) where necessary. Thus, for Q = 0, E(xi4) < LZ b 4 IIH114 + 62 L bMZ HII12 + 3 4 (G.27) for i=j, and E(X.2 x 2) < L2 b 4 IIH12 + 2a2 L b 2 IHI1I2 + a4 (G.28) -1- M M for i j. For 0 < f < L-1, E(xi4) < L2 bm4 HI11 2+ 2ur2 L b2 iiH 112 + 4 (G.29) for i =j, and E(x2 X.2) < L2 b4 IHI12 + 2 L b 2 IIHZ (G.30) i1 - M M for i - j = f and i - j = -f. Now, calculating the expected value of (G. 23) with the aid of (G. 26 - G. 30), while letting r = 1 -a

430 k 2k- 2i lim E(E2 ) < lim az Z (4 a L bM IHI + 2a4)r k-o 0 k- i+ iM =k+l k k + lim urn;cg (L2 bM4 IIHI12 + 2a2 LbM2 IIHI12 + a4) k- c i=f+l j=C+l li- jl < A+f 2k- i-j (G.31) for Q =0, and k- k k 2k i-f.2k- 2i+ + rim (2 L bM 2 ra2H112 r2kH2i + r k-c -o i=2i+l + lim U2 L2 bM4 IIH4 r2kij (G 32) k- cx i=l+1 j=M+1 Dividing up the double sum in (G. 31 - G. 32) into parts as in (D. 35), the result is

431 G(a) = lim ar e n r2k-i-j 3t2 r _ (1r- r U+ ) k-xc i-=+l j=+1 (1- r2) (I- r) li-jl <KA+ ~+ ~2 h++1,- r2 -+ 2.. 1 A+f r+l (1- r2A2 )/(l+r) (1- r)2 2 - a + 1 - (- )2+2 (G.33) 2-a Also, by direct calculation, r a ~ r2(k- i) _ a lim a,2 r= 2 - a (G.34) k- oc i=f+l k- f 2k- 2i- Q c2 2k- 2i+f lim 2 k r = lim a2 r k-. x i=Q+l k- c i=2f+1 = (- a) 2 a (G. 35) Therefore the final result is, for Q =0, lim E(e k) < (4a2 L b 2 IIH112 + 2ua4) 2a k-c'0M + (Lz M4 11 HII 4+ 2a2 L bM2 I1H 11' + ) G(ao) and for O < < L- 1,

432 lim E (e k) < [a2 LbM2 11HII + a 4 + 2a 2 LbM2 11 t12(-c)] 2 ot k- oc k + L2 bM4 11 H114 G(oa) where G(oa) is given by (G. 33).

RE FERENC ES 1. J. M. Wozencraft and I. M. Jacobs, Principles of Communication Engineering, New York: Wiley, 1967. 2. R. W. Lucky, J. Salz, E. J. Weldon, Principles of Data Communication, New York: McGraw-Hill, 1968. 3. R. W. Gallager, Information Theory and Reliable Communication, New York: Wiley, 1968. 4. C. Goffman and G. Pedrick, A First Course in Functional Analysis, Englewood Cliffs, N.J.: Prentice Hall, 1965. 5. L. Schiff and J. K. Wolf, "High-Speed Binary Data Transmission over the Additive Band-Limited Gaussian Channel, " IEEE Trans. Information Theory, Vol. IT-15, March 1969, pp. 287295. 6. R. W. Chang and J. C. Hancock, "On Receiver Structures for Channels Having Memory,' IEEE Trans. Information Theory, Vol. IT-12, October 1966, pp. 463-468. 7. C. V. Kimball, "Intersymbol Interference in Binary Communication Systems, " TR-195, Cooley Electronics Laboratory, University of Michigan, August 1968. 8. R. R. Bowen, "Bayesian Decision Procedure for Interfering Digital Signals, " IEEE Trans. Information Theory, Vol. IT-1 5, July 1969, pp. 506-507. 9. K. Abend, T. J. Harley, B. D. Fritchman, G. Gumacos, "On Optimum Receivers for Channels Having Memory, " IEEE Trans. Information Theory, Vol. IT-14, November 1968, pp. 819-820. 10. K. Abend, B. D. Fritchman, "Statistical Detection for Communication Channels with Intersymbol Interference," Proc. IEEE, Vol. 58, May 1970, pp. 779-785. 11. A. J. Viterbi. "Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm, " IEEE Trans. Information Theory, Vol. IT-13, April 1967, pp. 260-269. 433

434 REFERENCES (Cont.) 12. J. K. Omura, "On the Viterbi Decoding Algorithm, " IEEE Trans. Information Theory, Vol. IT-15, January 1969, pp. 177-178. 13. A. J. Viterbi, "Review of Statistical Theory of Signal Detection," IEEE Trans. Information Theory, Vol. IT-16, September 1970, p. 653. 14. J. K. Omura, "On Optimum Receivers for Channels with Intersymbol Interference, " IEEE International Symposium on Information Theory, Noordwijk, the Netherlands, June 1970. 15. K. Abend, "Compound Decision Procedures for Unknown Distributions and for Dependent States of Nature, " in Pattern Recognition, L. N. Kanal, Ed. Washington, D.C.: Thompson, 1968. 16. P. Monsen, "''Feedback Equalization for Fading Dispersive Channels, " IEEE Trans. Information Theory, Vol. IT-17, January 1971, pp. 56-64. 17. W. W. Peterson, Error Correcting Codes, Cambridge, Mass.: MIT Press, 1961. 18. J. L. Doob, Stochastic Processes, New York: Wiley, 1953. 19. M. R. Aaron and D. W. Tufts, "Intersymbol Interference and Error Probability, " IEEE Trans. on Information Theory, Vol. IT-12, January 1966. 20. E. Arthurs and H. Dym, "On the Optimum Detection of Digital Signals in the Presence of White Gaussian Noise —A Geometric Interpretation and a Study of Three Basic Data Transmission Systems, " IRE Trans. on Communication Systems, Vol. CS-10, pp. 336-372, December 1962. 21. R. W. Lucky, "Automatic Equalization for Digital Communication," Bell Sys. Tech. J., Vol. 44, pp. 547- 588, April 1965. 22. R. W. Lucky, "Techniques for Adaptive Equalization of Digital Communication," Bell Sys. Tech. J., Vol. 45, February 1966, pp. 255-286.

435 REFERENCES (Cont.) 23. Y. C. Ho and A. K. Agrawala, "On Pattern Classification Algorithms- Introduction and Survey," Proc. IEEE, Vol. 56, December 1968, pp. 457-462. 24. E. A. Patrick, "On a Class of Unsupervised Estimation Problems," IEEE Trans. on Information Theory, Vol. IT-14, May 1968, pp. 407-414. 25. S. C. Fralick, "Learning to Recognize Patterns Without a Teacher, " IEEE Trans. on Information Theory, Vol. IT-13, January 1967, pp. 57-64. 26. C. G. Hilborn and D. G. Lainiotis, "Recursive Computations for the Optimal Tracking of Time-Varying Parameters, " IEEE Trans. on Information Theory, Vol. IT-14, May 1968, pp. 514-515. 27. C. G. Hilborn and D. G. Lainiotis, "Optimal Unsupervised Learning Multicategory Dependent Hypothesis Pattern Recognition," IEEE Trans. on Information Theory, Vol. IT-14, May 1968, pp. 468-470. 28. C. G. Hilborn and D. G. Lainiotis, "Unsupervised Learning Minimum Risk Pattern Classification for Dependent Hypothesis and Dependent Measurements, " IEEE Trans. System Science and Cybernetics, Vol. SSC- 5, April 1969, pp. 109-115. 29. T. G. Birdsall, "Adaptive Detection Receivers and Reproducing Densities, " TR 194, Cooley Electronics Laboratory, University of Michigan, July 1968. 30. D. J. Sakrison, "Stochastic Approximation: A Recursive Method for Solving Regression Problems, " in Advances in Communication Systems, A. V. Balakrishnan, Ed., New York: Academic Press, 1966. 31. R. S. Varga, Matrix Iterative Analysis, Englewood Cliffs, N.J.: Prentice-Hall, 1962. 32. Y. T. Chien and K. S. Fu, "On Bayesian Learning and Stochastic Approximation," IEEE Trans. on System Science and Cybernetics, Vol. SSC-3, pp. 28-38, June 1967.

436 REFERENCES (Cont.) 33. Y. T. Chien and K. S. Fu, "Stochastic Learning of TimeVarying Parameters in Random Environment, " IEEE Trans. on System Science and Cybernetics, Vol. SSC- 5, July 1969, pp. 237- 24 5. 34. G. N. Saridis, Z. J. Nikolic, K. S. Fu, "Stochastic Approximation Algorithm for System Identification, Estimation, and Decomposition of Mixtures, " IEEE Trans. on System Science and Cybernetics, Vol. SSC- 5, January 1969, pp. 8-15. 35. M. C. Pease, Methods of Matrix Algebra. New York: Academic Press, 1965. 36. R. Bellman, Introduction to Matrix Analysis. New York: Wiley, 1960. 37. A. Wald, Sequential Analysis. New York: Wiley, 1947. 38. T. S. Ferguson, Mathematical Statistics- A Decision Theoretic Approach. New York: Academic Press, 1967. 39. A. Gersho, "Adaptive Equalization of Highly Dispersive Channels for Data Transmission, " Bell Sys. Tech. J., Vol. 48, January 1969, pp. 55-70. 40. P. Monsen, "Linear Equalization for Digital Transmission Over Noisy Dispersive Channels, " Doctoral Thesis, Columbia University, New York, June 1970. 41. D. A. George, R. R. Bowen, and J. R. Storey, "An Adaptive Decision Feedback Equalizer, " IEEE Trans. Communication Tech., COM-19, June 1971, pp. 281-293. 42. J. P. Costello and E. A. Patrick, "Unsupervised Estimation of Signals with Intersymbol Interference, " IEEE Trans. Information Theory, Vol. IT-17, September 1971, pp. 620-621. 43. A. W. Naylor and G. R. Sell, Linear Operator Theory in Engineering and Science. New York: Holt, Rinehart and Winston, 1971.

DOCUMENT CONTROL DATA. R 8 D Soteenlytr tclIes flr*atiln of tItl, hnrly of habvtr eEr anrd lnetetnp onnoolsorn ett bhe entered wwhen e e overall toort I9 clnealfledd I. Qtl ORNA TINT ACTIVITY (Cotpotrafel 0hto) te. REPORT SECURITY CLASSIFICATION Cooley Electronics Laboratory Unclassified The University of Michigan ib. GROUP Ann Arhor- Michi-an 3. 1REPOrt TITL Digital Communications: Detectors and Estimators for the Time-Varying Channel with Intersymbol Interference 4. sCRIPtVRY' NOTlteS (typ. of Pepotr and Incluolve dates) C. E. L. Technical Report No. 222 I. Au 14rfS) r I tat "nm,.lddl. I.niti l, taam name) David G. Messerschmitt. REPORT OATS 78.i TOTA NO. PAGS 7b. NO. OF REFS April 1973 543 S. CONT *ACT OR GRANT NO. tO. RIGINATOR'S: AEPORT NUMBER(S) N66001-72-C-0073 b. PROJIcT NO. C. Sb. OTHER REPORT NO(S) (Any other numbers that may be aselgned t'd'8- 1- T A. 10. ISTRIBUTION STATtEMENT II. SUPPL'IME4TARY NOTl i12. SPONSORING MILITARY ACTIVITY Naval Undersea Research & Development Center, Pasadena, California 91107.ec',iverics are designed for a pulse amplitude modulation digital communication system with additive white Gaussian noise and intersymbol interference. Considered along with the transversal filter receiver are the bit detector and block detector. The bit detector minimizes the probability of error in the decision on each data digit; the block detector minimizes the probability of error in making a joint decision on the entire sequence of data digits. The three receivers are compared in noniterative form on the fixed, known channel. The sampled autocorrelation function of the basic PAM pulse waveform determines the signal-space geometry and, along with the noise spectral density, the error probability of the three receivers. The output of a bank of matched filters, one matched to each translate of the basic PAM waveform, constitutes a sufficient statistic for the realization of all three receivers. The Hilbert space projection theorem is applied to the derivation of the transversal filter. The transversal filter receiver probability of error approaches 0. 5 as the channel autocorrelation function approaches the boundary of its nonnegative definite region. Iterative realtime and non-realtime realizations for the bit and block detectors on a bandlimited known channel are derived. Dynamic programming is applied to the block detector to yield an algorithm analogous to the Viterbi algorithm of convolutional decoding. Channel estimation algorithms are developed for use with the known channel block detector on a time-varying channel. It is shown by computer simulation on the fixed random channel that the realtime block detector in conjunction with a properly DD TO 14 OM?73 realtime bit detector. Security Clasiflication

Security Classification LNK A LNK * LINK, C Kt[Y WORDSA OLDT R MOLD I Y C ODt f w Digital communications Detectors Estimators Time-Varying Channel Intersymbol Interference Pulse Amplitude Modulation Bit detector Block detector Security Clu.tl^,wio.r

UNIVERSITY OF MICHIGAN 3 90II11111111111111 04 111 83941111111 3 9015 03483 8394