Technical Report No. 168 6137- 14-T A FINITE-MEMORY ADAPTIVE PATTERN RECOGNIZER by K. B. Irani Approved by: / r/ as 1'C ), T. W. Butler, Jr. for COOLEY ELECTRONICS LABORATORY Department of Electrical Engineering The University of Michigan Ann Arbor, Michigan Contract No. DA-36-039 AMC-03733(E) Signal Corps, Department of the Army Department of the Army Project No. 1P021101A0420102 September 1965

TABLE OF CONTENTS Page ACKNOWLEGEMENTS iv LIST OF ILLUSTRATIONS v SUMMARY vi 1. INTRODUCTION 1 2. BACKGROUND 3 3. THE MATHEMATICAL STRUCTURE 5 4. THEOREMS AND USEFUL RESULTS 10 5. DESCRIPTION OF THE PROBLEM 25 6. RESULT OF THE INVESTIGATION 28 6. 1 The Adaptive Procedure and the Proof of the Theorem 29 7. ILLUSTRATIVE EXAMPLE 36 8. CONCLUSIONS 43 REFERENCES 44 DISTRIBUTION LIST 45 iii

ACKNOWLEDGEMENTS The author wishes to acknowledge the valuable assistance he received from several co-workers at the Cooley Electronics Laboratory. He is indebted to Professor A. W. Naylor for many useful suggestions including the application of one-dimensional example to signal detection problem. Professor F. M. Waltz wrote the computer program for the example that is worked out. He was assisted in getting numerical data and plotting graphs by Mr. R. L. Haken. The author has also had some useful discussions with Professor F. N. Bailey. Messrs. J. N. Gittleman, J. A. Colligan, and D. G. Mueller read the manuscript and offered construction criticisms. iv

LIST OF ILLUSTRATIONS Figure Title Page l(a) Loss vs. Discriminant (odd numbers) used in the illustrative example of signal detection. 38 l(b) Average loss over twenty runs at each stage of the adaptive process. 38 2(a) Loss vs. Discriminant (odd numbers) used in the illustrative example of signal detection. 39 2(b) Average loss over twenty runs at each stage of the adaptive process. 39 3(a) Loss vs. Discriminant (odd numbers) used in the illustrative example of signal detection. 40 3(b) Average loss over twenty runs at each stage of the adaptive process. 40 4(a) Loss vs. Discriminant (odd numbers) used in the illustrative example of signal detection. 41 4(b) Average loss over twenty runs at each stage of the adaptive process. 41 5(a) Loss vs. Discriminant (odd numbers) used in the illustrative example of signal detection. 42 5(b) Average loss over twenty runs at each stage of the adaptive process. 42 v

SUMMARY This report gives an adaptive procedure for selecting a discriminant for a pattern recognizer. No a priori knowledge of the probability density on the observation-space is assumed. Moreover the pattern recognizer is assumed to have a finite memory. A mathematical model of the problem of pattern recognition is constructed and several theorems are proved. With the help of these theorems, the adaptive procedure is developed. This adaptive procedure is, in effect, a method of using the finite memory efficiently in "training" the pattern recognizer. vi

1. INTRODUCTION The problem of pattern recognition is a problem of separating objects into two or more classes. Given an object, a set of measurements is made on this object. Based on the results of these measurements, the object is assigned to one of two or more classes. Most of the problems of pattern recognition have been reduced to a common mathematical model in which an object is represented by a point in an n-dimensional space (the measurement-space). Each coordinate in the space corresponds to one of the measurements made on an object. By a "pattern" we shall mean a point in the measurement-space which represents an object to be classified. We shall sometimes refer to the measurementspace as the pattern-space. The problem of classification is then reduced to the problem of defining simply-connected or multiply-connected boundaries (discriminants) in the pattern-space. The region on one side of each boundary is associated with one class of patterns, i. e., points in the pattern-space representing one class of objects. If a received pattern falls on the side of a boundary which is associated with a class, then the pattern (with the associated object) is declared to belong to that class; otherwise it is declared not to belong to that class. The boundary may be included with one region or the other. The boundaries are said to be optimally selected if the "cost" (to be defined later) resulting from misclassifying patterns is minimum. It becomes obvious that the problem of defining boundaries is greatly simplified if the patterns belonging to each class are clustered together in the measurementspace. This simplification can be achieved if the distinguishing features of different classes of objects are recognized and if the measurements made on an object are the measurements of these features. Unfortunately, it is not always possible to directly measure the distinguishing

features of an object. This may be due to the fact that either the distinguishing features of the different classes cannot be recognized, or direct measurements of these features are not possible. Under these circumstances one does the "best" one can in selecting features to be measured. In this report we shall not concern ourselves with the method of selecting the features. We shall assume that the features to be measured have been selected, and objects are represented by patterns or points in the n-dimensional space, n being the number of features being measured. The problem we are concerned with is the problem of selecting the optimum boundaries which separate different classes of patterns. 2

2. BACKGROUND Considerable work has been done in determining the optimum boundaries in the measurement-space. A large percentage of this work is devoted to linear discriminants (hyperplane boundaries) in the measurement-space. Hu (Refs. 1,2,3, 4) and Singleton (Ref. 5) determine conditions under which two classes of binary numbers are linearly separable (i. e., can be separated completely by a linear discriminant). Highleyman (Ref. 6) determines the optimum linear discriminants for the case of not-necessarily linearly separable classes. Cooper (Ref. 7) investigates the problems in which a linear discriminant is optimum. Sebestyen (Ref. 8) shows that the correlation and regression line techniques are also examples of linear discriminants. Discriminants other than linear are also considered in the literature. For example, Cooper (Ref. 9) considers problems for which a hypersphere forms the optimum boundary of separation between two classes of patterns. Transformations are sometimes applied to the measurement-space in such a way that the patterns belonging to each class are brought together or in such a way that in the new space the boundaries have simple geometric forms. Sebestyen (Ref. 8) considers several linear and nonlinear transformations. In some cases of linear transformations, he identifies the eigenvectors of the transformations with the features of the objects and the eigenvalues with the weights to be attached to these features. For the cases where the occurrence of the patterns is stochastic, the theory of pattern recognition falls within the framework of statistical inference and decision theory. As Cooper (Ref. 10) shows, if the probability densities are known, classifications can be done on the basis of one of the four classical criteria, viz, Bayes, maximum-likelihood, minimax, and Neyman-Pearson. All four criteria are optimally satisfied by a decision rule in which the test statistic is a likelihood ratio which is compared to a threshold k. Chow (Ref. 11) describes a functional diagram of such an optimum system. 3

In actual practice, however, the exact probability densitites, more often than not, are not known a priori. If the probability densities are known except for a few parameters, the optimum discriminant can be determined by an adaptive process. See, for example, Cooper (Ref. 9). In many problems of practical importance, however, probability densities are completely unknown. Discriminants have to be determined from a few typical samples with known classifications. This is called the "learning mode" of pattern recognition. In such cases, it is natural to select discriminants which are optimum with respect to the given samples. As illustrated by Sebestyen (Ref. 8), these "optimum" discriminants may turn out to be very poor as far as the unknown patterns are concerned. The remedy is to continue with the learning mode even after the "recognition mode" (i. e., the mode of operation in which a discriminant is used to classify unknown patterns) is started. The process is thus made adaptive by continuously improving upon the discriminant employed. This is possible if the correct classification of a pattern is made known by some outside agency after a classification decision is made about it. The process converges to the optimum discriminant or discriminants (if there are more than two classes of pattern), if the system has infinite storage capacity (infinite memory). Infinite memory is not required if the patterns are known to be linearly separable. This has been demonstrated by Widrow and Hoff (Ref. 13) with the Adeline and Rosenblatt (Ref. 14) with the Perceptron. The convergence is proved mathematically, among others, by Novikoff (Ref. 15) and Albert (Ref. 16). In this report, we shall consider a problem in which no prior knowledge of probability densities is assumed, the patterns are not necessarily linearly separable, the discriminants are not necessarily linear, and only a finite storage space is available. We give an adaptive procedure for obtaining the optimum discriminant at any time, and for sequentially "improving" the discriminant. For simplicity, we restrict ourselves to the case in which a pattern is to be classified into one of two classes. 4

3. THE MATHEMATICAL STRUCTURE Assuming that there are n measurements made on an object and that the values of these measurements are real numbers, a point co in the real n-dimensional space Rn is a pattern, and the space Rn is the pattern-space. Objects represented by these patterns belong either to a class A or to a class B. Due to noise, either in the measurements or in the system through which these measurements are transmitted, a point in Rn sometimes actually corresponds to an object of Class A and at other times an object of Class B. We shall refer to the product space Rn x {A, B} as the observation-space, and a point (co,.), which represents an object and its true classification, as an observation. Thus a pattern is the first of the two elements of the ordered set which we call an observation. The probability density on the observation-space will be denoted by p(o,. ). Sometimes, for convenience, we shall drop the argument (w,. ) and denote this probability density by p. At this time we want to point out that in the problem we shall be considering, we shall assume that p(c,.) is not known. Let r = {c1, c2,..., M} be the set of all discriminants (boundaries in the pattern-space Rn) under consideration. We assume that M is finite. Each discriminant c. n i i i divides the space R into two disjoint sets Q1A and. The points in 2A are classified as A B Are classified as i patterns belonging to Class A, and those in SB are classified as patterns belonging to Class B. If the probability density p on the observation space were known, one plausible way to define the "best" discriminant is to say that it is a discriminant which minimizes a given loss function L(ci,p). One such loss function is given by L(ciP) = f p(c,A) dw + f p(w,B) dc. i i IB 1A This loss function assumes that the loss is one if a pattern is incorrectly classified, and is 5

zero if it is correctly classified. If we define Ei as the event that the discriminant c. classifies an unknown pattern incorrectly, then L(ci, p), defined above, is the probability of the event E.. In the future when we refer to the loss function, we shall mean the loss function defined above. Since p(w,.) is not known, we cannot use the simple criterion of minimizing the loss function L(ci, p) to select the best discriminant. We now give an intuitive explanation of the criterion we actually use for our case. It is not necessary to know the probability density p(w,. ) completely in order to determine the best discriminant defined above. It is not even necessary to know the exact values of L(ci, p) of the loss function. It is sufficient to know the correct ordering of the discriminants according to the values of the loss function, or, what amounts to the same thing, the ordering of the events Ei's according to their probabilities. In the absence of any information, all possible orderings (which are M! in number) are assumed equally probable. The absence of any information about p(,. ) implies that the probability distribution over the space of all possible probability density functions on the observation-space is uniform. We consider a more general situation in which the probability distribution over the space of all possible probability density functions on the observation-space is not necessarily uniform, yet the M! orderings of the events Ei's remain equally probable, in absence of any other information. To describe how such a situation may arise, we first define a class of sets A, k =1,2,... M. Each setAiss M(MA l)(Mk+ ) positive numbers in the k' -,...,M. EachsetAkisasetof range [0,1]. There is a one-to-one correspondence between the elements of Ak and subsets of k distinct elements of the set A1. The numbers in the various sets are compatible in the sense that if the numbers in A1 represent the probabilities of M events, each number in the set Ak represents the probability of the intersection of the corresponding k events. With this definition of the Class {Ak} we describe the general situation we referred to before as follows: 1We consider MI distinct orderings, even though it may happen that the probabilities of two or more events are equal. 6

A one-to-one mapping is known to exist between {Ei} and A1 such that if a. in A1 corresponds to Ei, then aj is the probability of Ei. Though such a mapping is known to exist, the actual mapping is unknown. In other words, knowing {Ak}, one knows the numbers which are the probabilities of various events Ei's and their intersections, but one does not know which number in A1 is the probability of which event. As such, all the M! orderings of the events Ei's still remain equally probable. What makes the situation more general than that implied by the complete lack of information about the probability density p(w,.) is that we assume that the probability distribution over all possible values of {Ak} is not-necessarily uniform. In such a situation, one way of selecting a discriminant would be to select a discriminant conditional to knowing {Ak} and then somehow weigh the discriminants for various values of {Ak} according to the probability distribution on {Ak}. Fortunately, it turns out this does not become necessary, because the way we select a discriminant conditional to knowing {Ak} is independent of the assumed value of {Ak}. From now on we shall assume that {Ak } is known. Corresponding to a given {Ak} there is a set d3 of the probability densities on the observation space which lead to the probabilities of various events El's and probabilities of the intersections of these events as given by {Ak}. This set: is divided into M! subsets B1, B2,..., BM!. Each Bi is a set of probability densities on the observation-space all of which correspond to one ordering of events E.'s. The probability of the actual proba1 bility density p(w,. ) belonging to any subset Bj conditional to knowing {Ak} is MT!, when no other information is available. At the beginning, therefore, there is no reason to consider that one discriminant is "better" than any other. At any future time in the process when some information about the observation-space is available, a discriminant (to classify unknown patterns) is selected on the basis of the available information. The selection rule for the discriminant is called a decision function. A decision function is a "mapping" of the space of available information into the set r of the discriminants. In the entire text, the term "mapping" as applied to a decision function is intended to include the case when the values of a function are not deterministic but probabilistic. 7

Since, for our problem, we are assuming that we have a limited amount of memory available, we can store only a limited amount of available information. (We shall assume that the amount of memory available is less than that required to store M numbers. This is necessary because otherwise, as will become apparent later, the process becomes trivial. ) The available information can be in the form of observations (i. e., patterns with their true classifications) or the discriminants that were selected in the past with the number of errors these discriminants made in classifying patterns (or the number of patterns correctly classified by these discriminants). For convenience, we break up the space of information into subspaces and shall refer to the restrictions of a decision function on different subspaces as decision functions of different types. What these different subspaces are and what the corresponding different types of decision functions are will become clear in the next section. For the time being we shall concentrate on one of the subspaces which we shall denote by Z. A decision function 6 of the corresponding type "maps" the subspace Z into the set r of the discriminants. Instead of defining the best discriminant as we do when the probability density on the observation-space is known, we now define the best or the optimum decision function of Type Z. Towards that end we first define a risk function R, whose values for a given decision function 6 and a given subset B. is given by M R(6,B.) = L L(c, p)P(Zjlp e B) 1 j=l J j B) Here Z. is the set of all elements of the information subspace Z which are mapped by 6 into the element cj of the set r, and P(Zj |p e Bi) is the probability density p(w,. ) on the observation-space is an element of the subset Bi. It is well to remember that the value of R(6, Bi) actually depends on which element of B. is the probability density p(w,. ). However, we have chosen not to depict this fact in the symbol for the risk function in order to avoid awkwardness in notation. Since initially the probability density p(,. ) on the observation-space is by assumption equally likely to come from any one subset B. of the set /i-, we define the average risk R(6) as 8

M! R(6) = M! R(6, B) i=l Here again we would like to point out that, though we have chosen not to show it symbolically, the fact is that R(6) is a function of the particular elements of the subsets Bi, i= 1,2,..,M!, which are assumed to be the probability density p(w,. ). For a fixed set of these elements, we define an optimum decision function 6* by the following inequality: R(6*) < R(6), where 6 is any decision function of Type Z which maps the subspace of information Z into the set r. Though we have defined the optimum decision function for a fixed set of M! elements, one from each subset Bi, i = 1,2,...,M!, it turns out that R(6*) does not depend on this set but only on the Class {Ak}, while the optimum decision function of the Type Z is independent even of {Ak}. 9

4. THEOREMS AND USEFUL RESULTS In this section we consider decision functions of various types, i. e., the restrictions of decision functions which map various subspaces of the information space into the set r of discriminants. We also state and prove theorems concerning optimum decision functions of various types. Recall that by an observation (c,. ) we mean a pattern and its correct classification. A set of m observations will be called a sample of m observations, and will be denoted by S". We denote the cardinality of an aribtrary set T by p(T). Thus p(Sm) = m. The symbol p(Sm f E.) then denotes the number of errors that the discriminant c. makes in classifying the m patterns of the sample Sm. We also define p (Sm) and f o(Sm) as follows: ojsm) = min m E? t ) E. p(S E. i) and (sm) = {c. I(Sm n E) = p~(m)} Thus p (Sm) is the minimum number of errors made by a discriminant in classifying m patterns of a sample S, and o(Sm) is the set of all the discriminants which make the least number of errors in classifying the m patterns of the sample S. An arbitrary element of ~0(Sm) will be denoted by c~(Sm). For a fixed m, a decision function 6 is a mapping of the space m of all Sm's into r. Here we have defined one type of decision function. Theorem 1 6M (Sm) = co(Sm) for Sm E m 10

Remark: To put it in words, the theorem states that given a sample of m observations, the best decision that can be made about selecting a discriminant is to select any one of the discriminants which make the least number of errors in classifying the m patterns of the sample Sm. Proof: We shall prove this theorem first for two special cases before proving it for the most general case. Case (i): r = {cc2} = {B1 B2} P(E1p e B1) = P(E2 PE B2) = a P(E2|p B1) = P(E1lp e B2) = 1 - a P(E1 n E2 p E B1) = P(E1 n E2Ip B2) = 0 Thus, for this case, Al = {a, 1-a} and A2 = {0}. This is the case of only two discriminants, and these two discriminants are such that if one classifies a pattern correctly, the other does not. Let a decision function 6 (for convenience we shall drop the superscript m and the subscript I for this proof) be such that F1 = P[( *) = c1 p e B1] = probability that the value of the 6 is c1 if the probability density p(w,. ), on the observation-space, is some element of subset Bj; F22 P[6( ) = c2p e B2] = probability that the value of the 6 is c2 if ment of subset B1, F P2 c2p B]i the probability density p(w,. ), on the observation-space, is some element of the subset E2. R(6,B1) = a F11 + (l-a)(1-F11) and R(6,B2) = (1-a) (1- F22) + a F22 11

Consequently, since for this case - 1 1 R(6) = R(6,B1) + R(5,B2), we have 1 R(6) = (1-a) + - (2a- 1)(F11 + F22) If a = 1- a, any decision function is optimum and the theorem is proved. If a 1- a, the optimum decision function minimizes F1l + F22. Let us assume that a > (1-a). The proof for the case (1- a) > a should follow similarly. Let us look at this problem from the point of view of hypothesis testing (Ref. 17). Let the hypothesis to be tested by H: Probability of E2 > probability of E1, against the alternate K: Probability of E1 > probability of E2. If the hypothesis is accepted when 6(Sm) = c1, and rejected when 6(S ) = c2, then F11 is the level of significance of the test, and F22 = 1- 3, where p is the power of the test. According to the fundamental lemma of Neyman and Pearson (Ref. 17), there is a most powerful test for which F11 is minimum for a given F22. If the result of that lemma is applied to the case of the testing of the above hypothesis, then the most powerful test is given by the following: For a fixed F22, F11 is minimum if 6(S ) = c1 for p(Sm n E2) > k p(Sm n E = 2 for p(Sm n E2) < k p(Sm n E) = c1 with probability y for p(Sm n E1) = p(m n E2) = C2 with probability (1-y) Here k > 0 and 0 < y < 1. The dependence of F22 on k and y are given by: F22 = {P[p(Sm n E2) < k p(Sm n El) p e B2]} + (1-y){P[p(Sm n E2) = k (S" n E1)lp e B2]} 12

Akin m if k is such that r - k is an integer; otherwise, p(Sm n E2) cannot be equal to k p(Sm n E1), and so F22 = P[p(Sm n E2) < kp(Sm n E1)p e B2] The various probabilities involved in the equations for F22 can be easily calculated by applying binomial distributions. Thus i=r- 1m F22 = m ai(l -a) + (1-') (m r)a 1- a) F22 = ) -i m-r if r is an integer; otherwise, the second term is zero and the summationin the first term is taken up to the largest integer less than r. For a fixed value of F22, the minimum value of F.1 which is obtained by using the most powerful test described above is given by: 11 - P[p(Sm n E2) < k p(Sm n E1)p e B1] + YP[p(Sm fn E2) = k p(Sm n E1)p e B1] = i~l'Y1"-1^ m-i m-r =l+1(m)(l-)a) a +Y( (a)r m- r if r is an integer; otherwise the second term in both the above equations is zero, and the summation in the first term of the second equation starts with the smallest integer that is larger than r. Using the above equations for F11 and F22, it can be easily deduced that F11 + F22 is minimum, and the resulting decision function is optimum, if k = 1 and y takes any value in the closed interval [0, 1]. Thus for the case (i) 6*(Sm) = c1 if(Smsm n E2) > p(Sm n E1) = 2 if p(Sm n El) > p(Sm n E) = c with probability y if p(S n E1) = p(Sm n E2) = C2 with probability (1-y) 13

where y is any number in the closed interval [0, 1]. In other words 6*(Sm) = c~(sm). This completes the proof for the case (i). Case (ii): r = {c1 c2} {B1, B2} P(E1 pE B1) = P(E2jpe B2) = a+ c P(E2|p B1) = P(Ellp B2) = b+ c P(E1 n E2lpe B1) = P(E1 n E2 pc B2) = c a + b + c < 1 Thus, for this case, Al = {a,b} and A2 = {c}. This is the case when the two events E1 and E2 are not necessarily disjoint and all-inclusive, i. e., the two discriminants c1 and c2 are such that there are patterns that can be correctly classified by both the discriminants, and there are patterns which are incorrectly classified by both the discriminants. For a decision function 6, let F.t be the probability that a sample Sm which satisfies the conditions p[(Sm n El) n (sm n E2)] = r p[(m n El) U (Sm n E2)] = m-t, is mapped into c. when the probability density p(w,. ) on the observation-space is an element of the subset Bi, i = 1,2,. Notice that r is the number of patterns in the sample Sm which are incorrectly classified by both the discriminants c1 and c2, and t is the number of patterns which are correctly classified by both c1 and c2. For this case R(6) = 1 [R(6,B1) + R(6,B2)] = (a Ft+a Ft +b Ft +b F2 ). 14

The summations are taken over all possible values of r and t. But rt A rt rt rt rt F F1 + F12 = F21 + F22 So rt F2 R(6) = 2 F (b b+ c) + (a - b)t + Xrtt r Fr If a = b, any decision function is optimum, and the theorem is proved for Case (ii). If rtt) + rt Frt] r a / b, the optimum decision function minimizes [(F./F) (F /F for every r and t. Minimizing [(F 11/F) + F22/F )], for fixed r and t, is similar to minimizing F11 + F22 in the case (i), where a sample consists of m - (r+t) observations, each pattern being correctly identifiable by one and only one of the two discriminants c1 and c2. This is also evident from the fact that the patterns which are either correctly classifiable, or incorrectly classifiable by both the discriminants, do not help in the selection of one discriminant over the other. The decision function which minimizes [(F1r/F ) + (Frt/F )], for fixed r and t, is the optimum decision function of the case (i) when applied to the sample of m - (r + t) observations which are correctly classifiable by one and only one discriminant. However, since adding the remaining (r + t) observations does not change the value of the decision function, we conclude that for fixed r and t, the optimum decision function of the case (i), when applied to the complete sample Sm of the case (ii), minimizes [(Frt/Frt) + (Frt /F )]. Since this is true for each r and t, the optimum decision function of the case (i), when applied to the complete sample Sm of the case (ii), minimizes R(6) for the case (ii). Thus the optimum decision function of case (i) is also the optimum decision function for the case (ii). This completes the proof of the theorem for the case (ii). Case (iii): This is the most general case. r = {Clc2,...,cM},j= {B1,B2,...,BM!} 15

For an arbitrary but fixed class {Ak}, Bi's are the subsets of probability densities on the observation-space as defined before. For this case, the theorem can be proved by applying the result of Case (ii) successively. Thus according to the case (ii), given a sample S, if the choice is between Ci and ci, the optimum decision function has the value c. with probability 1 if p(Sm n Ei) < p(Sm n Ej) with probability 0 if p(Sm n Ei) > p(Sm n Ej) with probability y if p(Sm n Ei) = p(m n Ej) where 0 < y < 1. Notice that the last part of the above statement is equivalent to saying: Select c. or c. arbitrarily if p(Sm n E) = p(Sm n Ej). This optimum selection between c. and c. is true for all pairs of i and j. 1 j Combining the results of all such pairs of i and j, we conclude that, given a sample Sm, the optimum decision function selects ci from the set r with probability 1 if p(Sm n Ei) < p(Sm n E) for all j / i, with probability 0 if p(Sm n E ) > p(Sm n E.) for any j / i with probability y if p(Sm n E.) < p(Sm n E.) for some j 7 i, and p(Sm n Ei) = p(Sm n E) for rest of j i i, where 0 < y < 1. This is equivalent to saying that the value of the optimum decision function for a given sample Sm is c~(Sm). This proves Theorem 1. Corollary: R(I )> R for i > 0. The proof of the corollary follows by observing that, given a sample Sm of m+i observations, one way of selecting a discriminant is to disregard i of the observations 16

and use the decision function 61 on the remaining m observations. But this is not necessarily the best decision function 6(m+i). The equality holds for the trivial case when all Ei's are equally probable. We now define another type of decision function. For this decision function, the domain is (a, c:) xeJ, where a is a fixed positive integer less than or equal to m, and c: is a fixed discriminant. The number a and the discriminant c indicate that sometime in the past there existed a sample Sm for which po(Sm) = a and c(SM) = c, i. e., the minimum number of errors made by a discriminant in classifying m patterns of that sample was a, and one such discriminant which made a errors was c. This type of decision function will be denoted by 6i/a. We shall further assume that for a given pair (a, ca), the decision function 6m/a first maps Jm into {[ai(Sm), ci]}, where ai(Sm) denotes the number of errors made by a discriminant ci in classifying the m patterns of a sample Sm, and then the decision function maps (a, c ) x { [ai(Sm), ci]} into r. With the definition of this new type of decision function we prove the following: Theorem 2: 6I/ (s ) = c: if p(S ) > a II-; c(S") if pO(Sm) < a = c with probability y ifp(S) = a. = c~(Sm) with probability (1- y) Here 0 <y< 1. Remark: To put it in words, the theorem states that if the minimum number of errors that a discriminant can make in classifying m patterns of a given sample S is greater than a, then it is better to select the discriminant c which made only a errors while classifying m patterns of some sample S. If the minimum number of such errors is less than a, then it is better to select any discriminant which makes the minimum number of errors while classifying m patterns of the present sample Sm. If, however, the minimum 17

number of errors is equal to a, any one of the two discriminants is selected arbitrarily. Before we prove this theorem, we shall prove three lemmas. Lemma 1: Given two disjoint subspaces,1 and C2 of the total information space, a decision function 6 which maps C1 U C2 into r is optimum if and only if the restriction of 6 on (i is optimum for i = 1,2. Remark: There are three types of decision functions involved in the above lemma-one that maps the subspace of information C1 into r, the second that maps the subspace of information S2 into r, and a third one which maps the union I1 U S2 of the two subspaces 1 and C2 into r. The lemma asserts that a decision function of this third type is optimum if and only if its restriction on the subspace S1 is an optimum decision of the first type and its restriction on the space o2 is an optimum decision function of the second type. Proof: Let R.(6), i = 1,2, be the average risk if the decision function 6 is restricted to the subspace i. Let Fi, i = 1, 2, be the probability that the available information z is an element of the subspace (i when the probability density p(w,. ) on the observation-space is any one of the M! (equally probable) possibilities used in evaluating Ri(6). Then the average risk R(6) of the decision function 6 (unrestricted either to subspace S1 or to C2) is given by R(6) = F1R1(6) + F2R2(6) The proof of the theorem follows immediately from the fact that a decision function 6, which maps S1 U S2 into r, is optimum if and only if it minimizes Ri(6) for i = 1, 2. Corollary: Let m/ A {sm po(Sm) = a}. Let 6I/a denote a type of decision function that maps g/m/a into r. Then 6m/* is the restriction of 6m* on7 m/a I I Distinction between the two types of decision functions, 6 / and 6 i/a should be recognized. The domain of 6/a is the set /a i. e., the set of all samples Sm for which the minimum number of errors a discriminant can make is a. The domain of the decision function m/ is the product space (a, c) x A m, where a and c: are fixed. 18

Lemma 2: R(6 a) > R(6m/a-) where m > a > 1. The equality sign holds if and only if all the events E.'s are equally probable, i. e., all the 1 M numbers in the set A1 are equal. Remark: This lemma suggests that if there is a choice between two discriminants, one of which makes (a- 1) errors in classifying m patterns of a sample S1 for which the least number of errors a discriminant can make is (ac- 1), and another discriminant which makes a errors in classifying m patterns of another sample Sm for which the least number of errors a discriminant can make is a, then it is "better" to select the first of the two discriminants for the nontrivial case when all the events are known to be not equally probable. In the trivial case any one of the two discriminants can be selected arbitrarily. Proof: As in the case of Theorem 1, we shall prove this lemma for two special cases before proceeding to the proof of the third and the most general case. For this proof, we shall drop the subscript I and the superscript * from the optimum decision functions. Case (i): F = {c1,c2} B = {B1, B2} P(EjiPE B1) = P(E2pe B2) = a P(E21P B1) = P(ElP B2) = 1- a P(E1 n E2p c B1) = P(E1 n E2p e B2) = 0 19

For this case R(6m/,B ) R(6 / () a(1 a) 1'2 (m) a (1-a)m- + () (1-a) ama (+ ((-a) a - m- 2a + (1 (-a a a() - a()(1-a) a m-2a 1+ 1-a The lemma is proved for this case, if we prove that R(6m/, B1) > R(6m/-1,B2) Now m -2a-1 i m- 1 R(6m/,B1) - R(m/a1,B2) = a 1+- a +a-2 -m 1 + (aarn-2a/ a _1+ a (-a a )2- 1 (1 -i a)'[1-a — [ 1 (a) 2 ] [1 + -nam- 2a+2] > 0 The equality sign holds if and only if a = 1 - a. Notice that a or (1- a) cannot be equal to zero, because otherwise g m/ for a / 0, is empty. This completes the proof for Case (i). Case (ii): r = {Clc2} 20

B = {B1 B2} P(Ellp B1) = P(E21p B2) = a+ c P(E2jP B2) = P(E1p e E2) = b+ c P(E1 n E2p e B1) = P(E1 n E2P e B2) = c a + b + c < 1 a+b+c<l The lemma is proved for this case if we prove that R( m/a, B1) + R(6 /a B2) > R(6 m/,B1) + R(6 m/,B2). The risk R(6,Bi), for i = 1,2, can be written as R(6 a,Bi) = Z Rs (6 r a,Bi), s t where RSt is that portion of the risk which corresponds to all the elements of n/ which have s patterns correctly classifiable by both c1 and c2, and t patterns incorrectly classifiable by both c1 and c2. Using the procedure similar to the one used in the proof of Case (i), it can be shown that M/a M/am/ m/R-1 m/ca- 1 Rt(6,B1) + Rst(6,B2) > Rst(6,B1) + Rt(6,B2) for every possible value of s and t. The equality holds if and only if a + c = b + c, i. e., the two events E1and E2 are equally probable. This completes the proof for Case (ii). Case (iii): r = {C1,C2,...,CM} = {B1,B2,...,BM!} where the sets Bi's have been defined before. For this case -(6m/Q) = Ei R(6m! /,Bi) i=1 21

Because of their properties mentioned before, the subsets Bi's of the set i can be paired off. Each such pair (Bi, Bi2) has the following properties: P(Ej IP Bi1 ) ( 221 i2 P(Ej Pe B )= P(E P e Bi1) J 1 j2 e 1 Bill I E P(EJ fl E. p e B. ) = P(E. Ejlp B ) Also P(E p e Bi ) = P(Ekp e Bi ) for k = 1,2,...,M except jl and j2. 1I B 2 We can, therefore, write R(6m/a) - m R(6m/aBi + R6m/a B,Bi ) + R(\ 2) where the summation is taken over distinct pairs (il, i2). Each pair of terms under the summation represents Case (ii). Thus we consider the case (iii), and hence the lemma, proved, with the observation that the equality in the lemma for this case will hold if and only if P(Ekp E Bi) = P(EkjP Bi) for i, i2 = 1,2,...,M! and k = 1,2,...,M. Corollary to Lemma 2: R(6m/ ) > R(6/k) for m > a > k > 0 The equality holds if and only if the probabilities of all the events E is are the same. This is a generalization of Lemma 2 and can be proved by successive application of the lemma. 22

Lemma 3: Let,1' 2'..', Q be 2 disjoint information subspaces. Let 6. denote a decision function of the ith type which maps C. into r, for i = 1, 2,...,. Let 6 denote a decision function of the type which maps (iU ) x (Uj (i) first into { i, 6(zi)} x {j. (z. )} and then maps {i, 6(zi)} x{j,6.(zj)} into r,. where (zi,zj) is an element of ( U Ci x( U.) with JJ1 ~ 1 J \i=l ~1'/ \i=11/ zi e Ci andj c j. Then 6*(zi) z = 6(z) if R(6 ) < R(6*) = (zj) if R(6t) > R(6j) J i 1 = either 6*(zi) or 6*(z.) arbitrarily if R(6) = R(6,), 1 3 for i,j = 1,2,...,Q. Proof: By Lemma 1, it is sufficient to prove this lemma for every restriction 6i. of the decision function 6 on the subspace Ci x j, i, j = 1, 2,...,. Let the value of a decision function 6.i be i(zi) with probability a and j6(zj) with probability (1- a). Then the average risk R(6..) is given by R(6i) = aR(6i) + (l-a)R(). It now becomes evident that R(6ij) becomes minimum, and consequently 6ij becomes optimum, for a = 1 and 6. = 6t if R(6*) < R(6j*); a = 0 and 6. = 6* if R(6*) < R(6*); a arbitrary and 6. = 6* and 6. = 6* if R(6*) = R(6*). Thus the lemma is proved. 1 i 3 j i j Proof of Theorem 2: The decision function 6 I/ can be looked upon as a part of a mapping of the fc U M m M first into i 1 ~ (s M// Lspace (U'2m/i) x U in/i) first intox {,ij { J(s /))} and then into i=0 i=O I I the set r. According to the lemma 2, R(6I/i ) > R(6m/j ) for i > j, where the equality sign holds for the case when the probabilities of all the events Ei's are equal. The theorem 2 follows from Lemma 3. 23

Corollary 1: /m/a* /m(r/a* R(611 ) < R I1 ) if a1 < a2. The equality sign holds only for the case when the events Ei's are equally probable. Corollary 2: R(6II ) < R(6 ) The equality sign holds only for the case when the events E.'s are equally probable. Theorem 3: R(6/ ) = R 6(m-a)/o0* I I for m > a Proof: Given a bsm/ (i. e., a sample of m observations such that the minimum number of errors a discriminant can make in classifying its patterns is a), one can always pick a sample bsm-a)/0 from it, if m > a. The discriminant 6/a* (bsm/ ) is any element of the set ~o(bsm/a), the set of all discriminants which make a errors in classifying patterns of the sample bs/. One can choose to select this element of ( ~(bsm/ ) by first selecting a sample s(m1 )/0 from the sample bsm/ and then taking the value 6(m )/0* bs(m-)/. This proves the theorem. Corollary: /m l/a l) m2/a2 if (m1 - a1) > (m2 - 2), with the equality holding only for the case when all the events Ei's have equal probabilities. Before going on to the next section we would like to point out that though we assumed the knowledge of the class {Ak }, none of the optimum decision functions depend on this class. 24

5. DESCRIPTION OF THE PROBLEM In this section we describe the pattern-recognition problem we have considered. The problem is described below in the form of a list of assumptions and related remarks: A. We assume that the observation-space is known to be Rn x{ A, B}. This means that an object is known to be represented by a set of n measurements and that it is known to belong to one of the two classes A and B. 3. The probability density p(w,. ) on the observation-space is assumed to be unknown a priori. C. We assume that observations are made available as a sequence in real time, not necessarily at fixed intervals. Note again that by an observation (c,.) we mean a pattern and its correct classification. D. The instant of time immediately after a new observation is available will be called a "stage" of the "process. " At every stage a fresh discriminant is selected to classify unknown patterns. The discriminant is selected from a given set r = {cl, c2,... CM) of M discriminants, where M is finite. In some special cases where the discriminants are simple geometrical boundaries, it may not be necessary to require M to be finite. In any case, in order that the problem does not become trivial, M is required to be larger than the amount of available memory, so that it is not possible to count the number of errors made by each discriminant in classifying known patterns and to select a discriminant on the basis of those numbers by the application of Theorem 1. In general, we make no assumptions about the relative geometric properties of the discriminants. However, we shall point out wherever advantage can be taken of simple relations, if they exist, between the elements of the set r. For example, if all the elements of r are completely arbitrary, each one has to be stored separately. However, if 25

the elements of r are, say, equally-spaced concentric hyperspheres, or hyperplanes inclined at equally-spaced angles to a fixed hyperplane, then it is sufficient to have an algorithm that would generate these discriminants when required. E. Storage space is available to keep a record of the past observations. However, this storage space is not infinite; therefore, there comes a time after which all the past observations cannot be stored. To be very specific, we shall assume that there are X units of storage space available to store information about past observations, one unit equivalent to the space required to store one observation. We shall refer to this storage space as the "information memory. " Note that it is not necessary that all the X units be used up in storing past observations. However, the information memory is to be used only for the purpose of storing information about the past observations. The information we shall be concerned with storing will be old observations and discriminants with known numbers of errors these discriminants have made, or known numbers of patterns these discriminants have correctly classified. 7. At every stage, after making the decision about the discriminant to be selected, there is another decision to be made if the information memory is full. This concerns the question of how the presently available information in the information memory is to be rearranged, and how much of that information should be destroyed in order to make a unit available for a latest observation. G. If a new observation arrives before the step of making room for it is completed, that observation is not taken into account. H. The "process," therefore, consists of repeating the cycle of storing a new observation, selecting the "optimum" discriminant from the information available in the information memory, and making room in the information memory for a fresh observation in such a way that the "optimum" discriminant can be "improved" in real time. The cycle is repeated as long as new observations are available. I. We shall mention one more constraint on our system. In the adaptive procedure to be described later, very often before the "optimum" discriminant is selected at a stage, we want to find a discriminant which makes the least number of errors in classifying patterns associated with the observations which are stored in the information 26

memory at that stage. It is possible that there is more than one such discriminant. In fact, let us say that there is a subset r of discriminants which satisfy this requirement. If r contains more than one element, then the adaptive procedure requires an arbitrary 0 selection of one of the elements of r. o We assume that there is no storage space available to store the subset r. We assume that there is enough storage space for two discriminants and two numbers in addition to information memory to compare two discriminants according to the number of errors they make in classifying the patterns in the information memory. Thus we have the constraint that only one element of the subset F can be known at a time. If we wish to make the selection of this element of r arbitrary, we start testing discriminants at an arbitrary point in the sequence r = {c1, c2,..., cM} and go through the set r systematically. We arrange to select the first discriminant that made the least number of errors. It is not necessary to go through the entire set r, if the elements of r are geometrically related. For example, suppose the elements of r are concentric hyperspheres with the points in the inside of each hypersphere representing the patterns of one fixed class. The procedure in such a case would be to select a hypersphere arbitrarily and determine the number of errors it makes and to store them. Then test the next larger hyper-sphere and determine the number of errors it makes. If the latter makes fewer errors than the former, store it and the number of errors it has made in place of the former; otherwise retain the former. Then test the next larger sphere and continue the process until the number of errors made by a sphere is twice as many as the one that is stored. It can be easily established that there is no larger sphere that can do better than one that is stored. At that stage start testing spheres smaller than the one that was first tested, and repeat the procedure for the smaller spheres. It can be shown that this procedure leads to the selection of an arbitrary element of To without necessarily testing all the elements of the set r. 27

6. RESULT OF THE INVESTIGATION We now give the result of our investigation in the form of the following theorem: Theorem 4: Let x1,x2,...,x,... be a sequence of independent observations in real time (i. e., ti < tj for i < j, where t. is the time of arrival of the observation xi). Let yk be the discriminant selected from the set r (to classify unknown patterns) at a stage k (i. e., at the instant of time immediately after the obseravtion xk is stored). Then if Yk is selected according to the adaptive procedure given below, it is "adaptively optimum" in the following sense: (i) For the content of the information memory at the kth stage, the decision function used to determine yk is the optimum. (ii) The information retained in the information memory after the discriminant Yk is selected is such that the type of decision function anticipated to be used at (k+ 1)ststage is better than (i. e., its average risk is less than that of) the type of decision function actually used at the kth-stage, unless there is not enough information in the information memory at the kth-stage. In the latter case the information retained after the discriminant yk is selected is such that the anticipated type of decision function to be used at the (k+ i) thstage is better than the one used at the kth-stage, when i > 1; and the decision function to be used at (k+ 1) st-stage is as good as the one used at the kth-stage. Remarks: 1. The difference between the "type of decision function anticipated to be used" and the "type of decision function actually used," should be noted. One anticipates using a certain type of decision function before one knows the actual value of the newly arrived observation. One knows what decision function was actually used, after the decision 28

has been made. 2. The adaptive procedure can be described roughly in words as follows: There are three modes of operation. During the first mode space is available in the information memory to store new observations. In this mode any one of the discriminants which make the least number of errors in classifying stored patterns is selected. During the second mode of operation the information memory is filled with observations and a number which shows the errors made by the discriminant in classifying patterns in the memory at the time it was selected. A new observation is accommodated by removing the oldest observation from the memory. A comparison is made between the number of errors of the present discriminant and the least number of errors that can be made in classifying patterns in the memory by an element of the set r. If the former is larger than the latter, a new diccriminant is selected which makes the least number of errors in classifying the patterns in the information memory, and the number of errors it makes is stored in the memory instead of the number that is presently stored. Otherwise, no change is made. This mode is continued until all patterns in the memory are correctly classified by a discriminant. At that stage the adaptive procedure goes in the third mode of operation. To describe the third mode of operation roughly, a comparison is made between the number of "successive" patterns correctly classified by the present discriminant and the number of "successive" patterns correctly classified by an "appropriately" selected discriminant. If the former is larger than the latter, the present discriminant is retained; otherwise it is replaced by the other discriminant. In the latter case, another discriminant is "appropriately" selected for future comparisons. The process is continued ad infinitum. 6. 1 The Adaptive Procedure and the Proof of the Theorem It will be assumed that no element of the sequence {xi} is either correctly classifiable or incorrectly classifiable by all the elements of the set r. If such elements exist, they are taken out of consideration, because such elements do not contribute any information towards the selection of a discriminant but use up the information memory. The adaptive procedure is broken down into three modes of operation and into six steps. The first mode extends from Step 1 through Step 3, the second mode consists of Step 4 and Step 5, and the third mode consists of Step 6. 29

We now give the adaptive procedure step by step and simultaneously the step by step proof of Theorem 4. Mode 1 1. Initially, when no observations are available and when the information memory is empty, the discriminant yo is any aribtrary element of the set r. Proof: This is obvious because in the absence of any available observations, all the elements of r are equally likely to be the best discriminant. 2. As long as k < X, where X is the number of units of available information memory, the new observation xk is stored in the information memory and k* Yk = ~I (X1'2, Xk) Note: As mentioned in Point I of Section 5, we have, in addition to the A units of information memory, memory units available to store two discriminants and two numbers, to assist us in the selection of a discriminant which makes the least number of errors in classifying patterns in the information memory. Proof: As long as there is room for a new observation in the information memory, there is no need to destroy any of the old observations. Also, since all the old observations are preserved, there is no need to preserve any discriminant which may have been tested or selected in the past. The choice of yk follows directly from Theorem 1. 3. For k = X, after y. is selected, the need arises to make a place for a new observation. The course of action depends on the value of p~(x, x2,..., x ), i. e., the minimum of the numbers of errors made by the discriminants in the set r, in classifying the patterns of the observations x, x2,....,x. If the value of p~(xl, x2,..., X) is 0 or 1, the process goes into Mode 3 which consists of Step 6; otherwise the process goes to Mode 2 which begins with Step 4. Mode 2 4. The first thing to do is to reorganize the information in the information 30

memory. An arbitrary xi which is not correctly classifiable by the discriminant yX is removed from the information memory. The observations xj, j > i, are moved one step upwards in the information memory. The last unit of the information memory is thus made empty. From henceforth, the content of this memory unit, at any stage k, will be called ok. For k = X, ak is made equal to p~(xl,2.., 1X2.,..., x ). Note: 1 The only reason for eliminating an observation this way, for making room for the number ak, is to show that in the process of making room for ak the optimum decision function is not "degraded. " Note: 2 We recognize here that what we called a unit of information memory (viz. the amount of space required to store an n-dimensional vector i. e., a pattern, and its correct classification) is a lot more storage space than required to store the number p~(xl,x2,...,xi1, x i+i,.-..x). One could use the remaining space to store some other information. However, for the sake of keeping the procedure simple we will not use the remaining fraction of the unit of information memory, for storing information. At any future stage, a newly arriving observation is stored at the end of the sequence of observations in the information memory. This place at the end of the sequence is created by destroying the observation at the head of the sequence and moving the rest of the sequence up by one step. We shall denote the sequence of observations in the information memory at the kth-stage by Sk. The discriminant YX+ is selected by +1 = A- l/ex* II (Sx+l), i. e., Y+1 = YX if ax < P (SX+1) = c (SX+) if ax > P(Sx1) = YX with probability y if P(S+1) = aX = c~(SX+1) with probability (1- y) where 0 < y < 1. 31

x/ax+l* Proof: The decision function used at the Xth stage was 6, which is as good as the X- 1/aX* decision function 6I, according to Theorem 3. The decision function anticipated to X- 1/aX* be used at X + 1 stage is 6, which by Corollary 2 of Theorem 2 is better than X- i/ax* 6, which was the decision function actually used at the Xth stage. A- 1/a* At the stage X + 1, the decision function 6 operates on a new sample of X - 1 observations which is formed from the old X - 1 observations and newly arriving observations, by destroying an arbitrary observation from the X - 1 old observations. We choose to destroy the oldest one. 5. At any future stage k > X + 1, after selecting the discriminant yk, if p~ (S) = 0, the process goes to the third mode and Step 6; otherwise the process runs as follows: ak =k-1 if k-1 < P (Sk) = p (S,) otherwise. The new discriminant yk+l selected at the (k+ l)st-stage (i. e., after the newly arrived observation is stored at the end of the sequence) is - /ak* Yk+l II (Sk+l) After that Step 5 is repeated. a- i/ak_- 1* Proof: If p (Sk) > ak 1, then the decision function actually used at the stage k is 6 With ok = k- 1, the decision function that is anticipated to be used at the (k + 1) st-stage is a-illak* a- 1/ak- 1* II / k which is better than the decision function 6 k, according to Corollary 2 of Theorem 2. If P k( ~ and P~(k) < ak-i, and k = P(Sk) then the decision function f ( Sk0 andp(S k-l /p~(Sk)* that is anticipated to be used at (k+ l)st-stage is 6II, which is better than the a- l/p~(Sk)* decision function 6* which was actually used at the kth-stage. If p~(Sk) = 0, there is a perfectly separable sample (i. e., a sample of observation for which the least number of errors a discriminant can make in classifying its 32

patterns is zero), the third mode of operation is followed. The proof for Step 5 is complete. Note: (i) Consistant with our policy of treating the case where there is no geometrical relation between the elements of the set r, we are giving the third mode of the adaptive procedure which does not take advantage of the geometrical relation, should it exist. It must be pointed out, however, that special procedures which take such geometrical relations into account turn out to be much more efficient. (ii) In the third mode it is not necessary at every stage to find a discriminant which makes the least number of errors in classifying patterns in the memory. As such the storage space of two discriminants and two numbers (mentioned in the note in Step 2) reserved for that purpose is free to be used in the intermediate stages. We shall use the storage spaces for one of the discriminants and one of the numbers. The content of the storage space for the discriminant will be called zk at the kth stage, and the content of the storage space for the number will be called yk, 6. In coming from Step 3, make room in the information memory for a number whose value at any future stage k will be called ak. This is done by removing the observation in the information memory whose pattern is incorrectly classified by the discriminant y, and moving the subsequent other observations in the sequences one step upwards in the information memory, if p(x, x2,...,xX ) = 1. If p (xl, x2,...,x) = 0, the spaces for ak can be made by remvoing any observation arbitrarily, and moving the subsequent observations one step upwards. For k = X, ak is made equal to A - 1 if p~(X1X2,...,x0) = 1, and equal to X if p~(xl,x2,...,xx) = 0. In otherwords, ak will represent the number of patterns correctly classified by the discriminant Yk. zX is made equal to y, and yX is made equal to a. If coming from Step 5, after, say k th-stage, ak is set equal to X - 1. 0 o Zk0 is made equal to yk and yk is made equal to ak At any future stage k > X, if coming from Step 3, and k > k if coming from Step 5, the following procedure is followed: 33

Yk+1 = yk if k < ak = k if k = ak and the pattern in the new observation is not correctly classifiable by zk = Zk if k = ak and the pattern in the new observation is correctly classifiable by zk. If the pattern in the new observation is correctly classifiable by Zk, then Yk+1 = k + 1 and Zk+1 = k Also, if ak = Yk then ak+1 = yk+1 If the pattern in the new observation is incorrectly classified by Zk, then ak+ =ak k+ 1 = c(Sk) k+l. = X- - p (Sk) Remark: zk is the discriminant which we previously referred to as the "appropriately" selected discriminant. yk is the number of "successive" patterns correctly classified by zk up to the stage k, and ak is the number of "successive" patterns correctly classified by Yk. In Mode 3, Yk+l is made equal to one of the two discriminants Yk and zk depending on whether the number of "successive" patterns correctly classified by yk or zk is larger. Proof: We notice that 6aYk k/ (S a) zk = 6Yk/ (s ) 34

The decision function used in Mode 3 is of the type mentioned in Corollary 3 of Theorem 2. Here at the (k+ 1)st-stage the choice is between the values of two types of ak/0*.Tk+1/1* decision functions; one is 6I and the other is equivalent to 6I. If ak = Yk' then ak+1/1* the anticipated decision function is equivalent to k+1 which, by Corollaries 2 and 3 of ak/O* Theorem 2, is better than 6 which was the decision function used in the kth stage. If ak/0* Ok > yk' then the decision function at the (k+ l)st-stage is equivalent to 6 which is the same as the one used at the kth stage. Because there is not enough information the anticipated decision function at the (k+ 1)st-stage cannot be made better than the one used at the kth-stage in the case yk < ak. However, the anticipated decision function at the (k+ i)thstage, where i = ak - yk, is again of the type mentioned in Corollary 3 of Theorem 2. This ak/0* time the choice is between the values of two decision functions 6 and a decision function which is equivalent to 6ik+i/* This decision function at (k+ i)th-stage is better than ak/O* I 6I. This completes the proof of Theorem 4. 35

7. ILLUSTRATIVE EXAMPLE A pattern recognizer, which uses the adaptive procedure given in the last section, was simulated on the IBM 7090 digital computer to work out a problem of signal detection in the presence of additive noise. The usual problem of signal detection is the following: Given a waveform x(t), it is required to determine whether it is pure noise n(t), or signal-plus-noise, s(t) + n(t), for 0 < t < T. It is well known (Ref. 12) that if the signal s(t), with energy E, is known exactly, and if the noise is white Gaussian noise with power per bandwidth known to be equal to No, and if the probability P(SN) of signal occurring with noise is known, then the 2 T decision rule which minimizes the average "cost" is the following: If N S x(t) s(t)dt> c, oo then the hypothesis that x(t) contains the signal s(t) is accepted; otherwise the alternate hypothesis that x(t) is pure noise is accepted. The value of the threshold level c is given by E c = -- + in, o where 3 depends on P(SN), and on the cost. In this report the cost has been taken equal to unity if there is an incorrect classification, and zero if there is a correct classification. For such a cost, 3 is given by 1 - P(SN) P(SN) With known E, No, and P(SN), the value of the optimum threshold c can be calculated. For this illustrative example we are assuming that the knowledge about E, No, and P(SN) is not available. It is the job of the pattern recognizer to select a threshold (discriminant), according to the adaptive procedure given in Section 6, from a given set r of thresholds. The set r actually used is the set of all 41 odd numbers from 1 through 81. The input to the pattern recognizer is a sequence of observations. Each observation, for this example, is the value of the integral I and the information that the corresponding x(t) is s(t) + n(t) or n(t). In the language of this report, I is the one-dimensional 36

pattern w, and the information about x(t) is the classification of the pattern. Thus we may say that the pattern w represented by a value of I belongs to Class A if the corresponding x(t) is n(t); otherwise it belongs to Class B. The unknown probability density on the observation-space is given by p(co,A) = P(SN) p[IIx(t) = n(t)] p(wo,B) = [1 - P(SN)p(I Ix(t) = n(t)+ s(t)] For white Gaussian noise, the probability densities p[Ijx(t) = n(t)] and p[Ijx(t) = n(t) + s(t)] 2E 2E are normal, each having a variance equal to,No and means differing by o. The simulated pattern recognizer has an information memory of 10 units, i. e., the pattern recognizer can store, at the most, 10 observations at a time. 2E For each set of values of 2 and P(SN), twenty runs of the adaptive process No 2E were made. Each run was carried up to 100 stages. Five sets of values of 2E and P(SN) were considered. The (a) parts of the five graphs show the values of the losses (as defined in Section 3) for the 41 discriminants. The number with the arrow, in each graph, shows the best discriminant. In some cases there are two "best" discriminants. The (b) part of each of the five graphs shows the average over twenty runs of the loss of the discriminant selected at each stage. These graphs clearly show a decreasing trend in the average value of the loss of the discriminant as the number of the stage increases. This decreasing trend indicates the fact that the pattern recognizer is learning. The learning is very rapid in the beginning. However, the graphs clearly show that the process of learning continues even after the tenth stage. This is especially true when the signal-to-noise ratio is small. This result clearly indicates that the adaptive procedure leads to a better discriminant than the one that would be selected on the basis of the first ten observations. The latter would have been selected if the observations beyond the first ten were ignored because the pattern recognizer has memory to store ten observations at a time. 37

Loss.6 P(SN) = 0.5 5 4 - -56 0.3.2 0 L I 1 29131 1 1 1 1 1 0 10 20 30 40 50 60 70 80 Discriminant Fig. l(a). Loss vs. Discriminant (odd numbers) used in the illustrative example of signal detection..30A v e P(SN) = 0.5 r.25- 56 S o e L 20 S v e.15r 20 R.10 - n s Minimum Loss (. 081).05_ 0 I I 0 10 20 30 40 50 60 70 80 90 100 Stage Fig. l(b). Average loss over twenty runs at each stage of the adaptive process. 38

Loss.6 P(SN) = 0.375.5 2E.4 L — = 40.4- 2.30 A P(N) I I I 0.375 20 30 40 50 60 70 80 Discriminant Fig. 2(a). Loss vs. Discriminant (odd numbers) used in the illustrative example of signal detection..30 r P(SN) = 0.375 r.25 40 e o.20 5 o Minimum Loss (.1509) v.15 r 20 R.10 u n s.05 0 0 I I I I II, I 0 10 20 30 40 50 60 70 80 90 100 Stages Fig. 2(b). Average loss over twenty runs at each stage of the adaptive process. 39

Loss. 6 P(SN) = 0.5 5 2E =.~~5 ~ 4.3.2.1 * 1 ~- 21 23 O I I I I I I I I _ 0 10 20 30 40 50 60 70 80 Discriminant Fig. 3(a). Loss vs. Discriminant (odd numbers) used in the illustrative example of signal detection. v ~~~30 - ~~~P(SN) = 0. 50 A V 2E e \ = 40.20 - v.15 e r 20 R.10U 11 s.05 0 10 20 30 40 50 60 70 80 90 100 Stage Fig. 3(b). Average loss over twenty runs at each stage of the adaptive process. 40

Loss 6 - P(SN) = 0.625.5 = 2E 40 A=40.4.3.2 0 1 0 10 20 30 40 50 60 70 80 Discriminant Fig. 4(a). Loss vs. Discriminant (odd numbers) used in the illustrative example of signal detection. A.30 ~P(SN) = 0. 625 v.15-2E e — = 40 r 20 R.210 n 0 10 20 30 40 50 60 70 80 90 100 Stage Fig. 4(b). Average loss over twenty runs at each stage of the adaptive process. 41

Loss. 6 " P(SN) = 0.5.5 2E.4.3.2- tt 33 35.1 0 I I I I I I I I 0 10 20 30 40 50 60 70 80 Discriminant Fig. 5(a). Loss vs. Discriminant (odd numbers) used in the illustrative example of signal detection. 35 P(SN) = 0. 5 2E r \ / a. 25 Minimum Loss (. 242) L s.20 r.1520 R u n.10s.05 O i I I I I I I I I I I I I I I I I I I I I I I I I 0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64 68 72 76 80 84 88 92 96 100 Stage Fig. 5(b). Average loss over twenty runs at even stage of the adaptive process. 42

8. CONCLUSIONS An adaptive procedure is given whereby a pattern recognizer continually "learns" to recognize patterns. The learning is demonstrated by the continual "improvement" of the discriminant selected for recognizing unknown patterns, as long as new observations (patterns and their respective correct classifications) are made available to the pattern recognizer. No a priori knowledge of the probability density on the observationspace is assumed. Moreover the pattern recognizer is assumed to have a finite-memory. The adaptive procedure not only makes efficient use of the finite memory in storing past information, but uses "optimum" decision rules to select a discriminant on the basis of the stored information. At any stage of the process, a discriminant is selected from a prescribed set of discriminants. The factors affecting the choice of this prescribed set of discriminants have not been discussed. 43

REFERENCES 1. Hu, S. T., "On the Willis Method of Finding Separating Systems," TR 6-90-61-44, Lockheed Missiles and Space Division, December 1960. 2. Hu, S. T., "The Synthesis Problem of Linear Separability," TR 6-90-62-17, Lockheed Missiles and Space Division, September 1961. 3. Hu, S. T., "An Elimination Process for Willis Synthesis Method," TR 6-90-62-18, Lockheed Missiles and Space Division, December 1961. 4. Hu, S. T., "Successive Approximation Applied to the Synthesis of Linear Separability," TR 6-90-62-99, Lockheed Missiles and Space Division, January 1962. 5. Singleton, R. C., "A Test for Linear Separability as Applied to Self-Organizing Machines," Stanford Research Institute, Project No. 3605, May 1962. 6. Highleyman, W. H., "Linear Decision Functions, with Application to Pattern Recognition," Proceedings of the IRE, June 1962. 7. Cooper, P. W., "The Hyperplane in Pattern Recognition, " Sylvania Applied Research Laboratory, 1962. 8. Sebebtyen, G. S., "Decision-Making Processes in Pattern Recognition, " ACM Monograph Series, McMillan Company, 1962.. Cooper, P. W., "The Hypersphere in Pattern Recognition, " Information and Control, Vol. 5, No. 4, December 1962. 10. Cooper, P. W., "Classification by Statistical Methods," Melpar Technical Note, April 1961. 11. Chow, C. K., "An Optimum Character Recognition System Using Decision Functions," IRE Transactions on Electronic Computers, December 1957. 12. Peterson, W. W., Birdsall, T. G., "The Theory of Signal Detectability," Part I "Applications with Gaussian Noise," Part II, Cooley Electronics Laboratory Technical Report No. 13, June 1953, July 1953. 13. Widrow, B., Hoff, M. E., "Adaptive Switching Circuits," Stanford Electronic Laboratory Technical Report No. 1533-1, June 1960. 14. Rosenblatt, F., "Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms, " Spartan Books, 1961. 15. Novikoff, A. B., "On Convergence Proofs for Perceptrons," Stanford Research Institute Technical Report No. 3605, January 1963. 16. Albert, A., "A Mathematical Theory of Pattern Recognition, " The Annals of Mathematical Statistics, Vol. 34, No. 1, March 1963. 44

DISTRIBUTION LIST No. of No. of Copies Copies 2 Commanding Officer 1 Headquarters U. S. Army Electronics Command USAF U. S. Army Electronics Laboratories Washington 25, D. C. Fort Monmouth, New Jersey ATTN: AFRDR ATTN: Senior Scientist Electronic Warfare Division 1 AFAL (AVWW/ECM Technology) 1 Commanding General Wright-Patterson Air Force Base U. S. Army Electronic Proving Ground Ohio 45433 Fort Huachuca, Arizona ATTN: Director 1 Commander Electronic Warfare Division Aeronautical Systems Division Wright-Patterson Air Force Base I Commanding General Ohio U. S. Army Materiel Command ATTN: ASAPRD Bldg. T-7 Washington 25, D. C. 1 Commander ATTN: AMCRD-DE-E Aeronautical Systems Division Wright-Patterson Air Force Base Commanding General Ohio U. S. Army Materiel Command ATTN: ASNP Bldg. T-7 Washington 25, D. C. 1 ESD (ESTI) ATTN: AMCRD-RP-E L. G. Hanscom Field Bedford, Massachusetts Commanding Officer Signal Corps Electronics 1 Commander Research Unit Rome Air Development Center 9560th USASRU Griffiss Air Force Base P.O. Box 205 New York Mountain View, California ATTN: RAYLD 1 U. S. Atomic Energy Commission 1 Commander 1901 Constitution Avenue, N.W. Air Proving Ground Center Washington 25, D. C. Eglin Air Force Base ATTN: Chief Librarian Florida ATTN: ADJ/Technical Report Branch Director Central Intelligence Agency 1 Chief of Naval Operations 2430 E Street, N.W. EW Systems Branch Washington 25, D. C. OP-35, Department of the Navy Washington 25, D. C. 1 U. S. Army Research Liaison Officer MIT-Lincoln Laboratory 1 Chief, Bureau of Ships Lexington 73, Massachusetts Code 691C Department of the Navy 1 Commander Washington 25, D. C. Air Force Systems Command Andrews Air Force Base New Jersey 45

DISTRIBUTION LIST (Cont.) No. of No. of Copies Copies 1 Commander 1 President Bu Naval Weapons Code RRRE-20 U. S. Army Airborne and Department of the Navy Electronics Board Washington 25, D. C. Fort Bragg, North Carolina 1 Commander 1 U. S. Army Anti-Aircraft Artillery Naval Ordnance Test Station and Guided Missile School Inyokern Fort Bliss, Texas China Lake, California ATTN: ORL ATTN: Test Director - Code 20 1 Commander 1 Commander USAF Security Service Naval Air Missile Test Center San Antonio, Texas Point Mugu, California ATTN: CLR Director 1 Chief of Naval Research Naval Research Laboratory Department of the Navy Countermeasures Branch, Code 5430 Washington 25, D. C. Washington 25, D. C. ATTN: Code 427 1 Director 1 Commanding Officer Naval Research Laboratory 52d U. S. Army Security Agency Washington 25, D. C. Special Operations Command ATTN: Code 2021 Fort Huachuca, Arizona 1 Director 1 President Air University Library U. S. Army Security Agency Board Maxwell Air Force Base Arlington Hall Station Alabama Arlington 12, Virginia ATTN: CR-4987 1 The Research Analysis Corporation 1 Commanding Officer-Director McLean, Virginia 22101 U. S. Navy Electronics Laboratory ATTN: Document Control Officer San Diego 52, California 10 Headquarters 1 Commanding Officer Defense Documentation Center U. S. Naval Ordnance Laboratory Cameron Station Silver Spring 19, Maryland Alexandria, Virginia 3 Chief 1 Commanding Officer U. S. Army Security Agency U. S. Army Electronics Research and Arlington Hall Station Development Laboratory Arlington 12, Virginia 22212 Fort Monmouth, New Jersey ATTN: 2 Cpys - IADEV ATTN: U. S. Marine Corps Liaison 1 Copy - EW Div. IATOP Office, Code: SIGRA/SL-LNR 1 President 1 Director U. S. Army Defense Board Fort Monmouth Office Headquarters Communications-Electronics Combat Fort Bliss, Texas Developments Agency Building 410 Fort Monmouth, New Jersey 46

DISTRIBUTION LIST (Cont.) No. of No. of Copies Copies 15 Commanding Officer 1 U. S. A. F. Project Rand U. S. Army Electronics Command The Rand Corporation U. S. Army Electronics Laboratories 1700 Main Street Fort Monmouth, New Jersey Santa Monica, California ATTN: AMSEL-RD-DR AMSEL-RD-NSR 1 Stanford Electronics Laboratories AMSEL-RD- SM Stanford University AMSEL-RD-SA Stanford, California AMSEL-RD-SEA AMSEL-RD-SEJ 1 Director AMSEL-RD-SES National Security Agency AMSEL-RD-SEE Fort George G. Meade, Maryland AMSEL-RD-ADO ATTN: RADE- 1 AMSE L-RD- SR AMSEL-RD-SE 1 Bureau of Naval Weapons AMSEL-RD-ADT Representative AMSEL-RD-GFR Lockheed Missiles and Space Company AMSEL-RD-PRM P.O. Box 504 AMSEL-RD-RHA Sunnyvale, California 1 Commanding Officer 1 Dr. T. W. Butler, Jr., Director U. S. Army Signal Missile Support Cooley Electronics Laboratory Agency The University of Michigan White Sands Missile Range Ann Arbor, Michigan White Sands, New Mexico ATTN: SIGWS-MEW 11 Cooley Electronics Laboratory The University of Michigan 1 Commanding Officer Ann Arbor, Michigan U. S. Naval Air Development Center Johnsville, Pennsylvania ATTN: Naval Air Development Center Library Above distribution is effected by Electronic Warfare Division, Surveillance Department, USAEL, Evans Area, Belmar, New Jersey. For further information contact Mr. I. O. Myers, Senior Scientist, Telephone 59-61252. 47

UNCLASSIFIED Security Classification DOCUMENT CONTROL DATA - R&D (Security classification of title, body of abstract and indexing annotation must be entered when the overall report is classified) I. ORIGINATIN G ACTIVITY (Corporate author) 2a. REPORT SECURITY C LASSIFICATION UNCLASSIFIED The University of Michigan 2b GROUP 3. REPORT TITLE A Finite-Memory Adaptive Pattern Recognizer 4. DESCRIPTIVE NOTES (Type of report and inclusive dates) Technical Report No. 168 5. AUTHOR(S) (Last name, first name, initial) Irani, K. B. 6. REPORT DATE 7a.- TOTAL NO. OF PAGES 7b. NO. OF REFS September 1965 52 16 8a. CONTRACT OR GRANT NO. 9a. ORIGINATOR'S REPORT NUMBER(S) DA 36-039-AMC-03733(E) b. PROJECT NO. 6137-14-T 1PO 21101 A042-01-02 C. b. bOTHER REPORT NO(S) (Any other numbers that may be assigned this report) d. 10. AVAILABILITY/LIMITATION NOTICES Qualified requesters may obtain copies of this report from DDC Foreign announcement and dissemination by DDC is limited. 11. SUPPLEMENTARY NOTES 12. SPONSORING MILITARY ACTIVITY U. S. Army Electronics Command....._- _ __ATTN: AMSEL-WL-S _______________________________ Fort Monmofith, N. J_ 13. ABSTRACT This report gives an adaptive procedure for selecting a discriminant for a pattern recognizer. No a priori knowledge of the probability density on the observation-space is assumed. Moreover the pattern recognizer is assumed to have a finite memory. A mathematical model uf the problem of pattern recognition is constructed and several theorems are proved. With the help of these theorems, the adaptive procedure is developed. This adaptive procedure is, in effect, a method of using the finite memory efficiently in "training" the pattern recognizer. D D,1 4JAN64 1473 UNCLASSIFIED Security Classification

UNCLASSIFIED Security Classification 14. LINK A LINK B LINK C KEY WORDS ROLE WT ROLE WT ROLE WT Pattern recognition Adaptive procedure INSTRUCTIONS 1. ORIGINATING ACTIVITY: Enter the name and address imposed by security classification, using standard statements of the contractor, subcontractor, grantee, Department of De- such as: fense activity or other organization (corporate author) issuing (1) "Qualified requesters may obtain copies of this the report. report from DDC." 2a. REPORT SECURITY CLASSIFICATION: Enter the over- (2) "Foreign announcement and dissemination of this all security classification of the report. Indicate whether is not authorized. "Restricted Data" is included. Marking is to be in accordance with appropriate security regulations. (3) "U. S. Government agencies may obtain copies of this report directly from DDC. Other qualified DDC 2b. GROUP: Automatic downgrading is specified in DoD Di- users shall request through rective 5200. 10 and Armed Forces Industrial Manual. Enter the group number. Also, when applicable, show that optional *' markings have been used for Group 3 and Group 4 as author- (4) "U. S. military agencies may obtain copies of this ized. report directly from DDC Other qualified users 3. REPORT TITLE: Enter the complete report title in all shall request through capital letters. Titles in all cases should be unclassified. If a meaningful title cannot be selected without classificstion, show title classification in all capitals in parenthesis (5) "All distribution of this report is controlled. Qualimmediately following the title. ified DDC users shall request through 4. DESCRIPTIVE NOTES: If appropriate, enter the type of ___ report, e.g., interim, progress, summary, annual, or final. If the report has been furnished to the Office of Technical Give the inclusive dates when a specific reporting period is Services, Department of Commerce, for sale to the public, indicovered. cate this fact and enter the price, if known. 5. AUTHOR(S): Enter the name(s) of author(s) as shown on 11. SUPPLEMENTARY NOTES: Use for additional explanaor in the report. Enter last name, first name, middle initial, tory notes. If military, show rank and branch of service. The name of the principal author is an absolute minimum requirement. 12. SPONSORING MILITARY ACTIVITY: Enter the name of the departmental project office or laboratory sponsoring (pay6. REPORT DATE: Enter the date of the report as day, ing for) the research and development. Include address. month, year; or month, year. If more than one date appears on the report, use date of publication. 13. ABSTRACT: Enter an abstract giving a brief and factual 7 TOTA NM R OF PAGES: The total ae count summary of the document indicative of the report, even though 7a. TOTAL NUMBER OF PAGES: The total page count it may also appear elsewhere in the body of the technical reshould follow normal pagination procedures, i.e., enter the port. If additional space is required, a continuation sheet shall number of pages containing information. be attached. 7b. NUMBER OF REFERENCES: Enter the total number of It is highly desirable that the abstract of classified reports references cited in the report. be unclassified. Each paragraph of the abstract shall end with 8a. CONTRACT OR GRANT NUMBER: If appropriate, enter an indication of the military security classification of the inthe applicable number of the contract or grant under which formation in the paragraph, represented as (TS), (S), (C), or (U). the report was written. There is no limitation on the length of the abstract. How8b, 8c, & 8d. PROJECT NUMBER: Enter the appropriate ever, the suggested length is from 150 to 225 words. military department identification, such as project number, 1 K w a t subproject number system numbers, task number, etc. 14. KEY WORDS: Key words are technically meaningful terms subproject number, system numbers, task number, etc. subproject numbe, yse nm s ts nuor short phrases that characterize a report and may be used as 9a. ORIGINATOR'S REPORT NUMBER(S): Enter the offi- index entries for cataloging the report. Key words must be cial report number by which the document will be identified selected so that no security classification is required. Identiand controlled by the originating activity. This number must fiers, such as equipment model designation, trade name, military be unique to this report. project code name, geographic location, may be used as key 9b. OTHER REPORT NUMBER(S): If the report has been words but will be followed by an indication of technical conassigned any other report numbers (either by the originator text. The assignment of links, rules, and weights is optional. or by the sponsor), also enter this number(s). 10. AVAILABILITY/LIMITATION NOTICES: Enter any limitations on further dissemination of the report, other than those UNCLASSIFIED Security Classification