THE UNIVERSITY OF MICHIGAN OFFICE OF RESEARCH ADMINISTRATION ANN ARBOR AN ABSTRACT MODEL FOR THE PATTERN RECOGNITION PROCESS Technical Report No. 121 3642 - 1 - T AFOSR 488 Cooley Electronics Laboratory Department of Electrical Engineering By: Donald L. Richards Approved by (_ W. P. Tanner, Jr. Project 3642 Contract No. AF 49(638)-884 Air Force Office of Scientific Research Air Research and Development Command April 1961

TABLE OF CONTENTS Page ABSTRACT iv 1. INTRODUCTION 1 2. THE FORMAL MODEL 2 2.1 Notation and Organization 2 2.2 Patterns and Inputs 4 2.3 Initial Knowledge 5 2.4 Learning 9 2.5 Decision 19 3. DIAGRAM OF THE SYSTEM 10 4. CONCLUDING DISCUSSION 12 1iii1

ABSTRACT The goal of this work, unlike that of other work in the field of pattern recognition, is the construction of a general recognition model which can apply to any group of patterns and to any amount of initial knowledge about them. In the preliminary model which has been developed, the initial knowledge is presented to the recognizer as a list of conditional probabilities; then, given certain characteristics of the sequence of items which has been seen, the most probable correct answer can be selected. Where these conditional probabilities are not known exactly, they are estimated by a primitive "learning" process. Decisions are based on a posteriori probabilities and on other criteria studied in decision theory. The model is described in precise detail, and some possible goals of further work are mentioned. iv

1. INTRODUCTION Many attempts have been made to recognize patterns by machine, especially such patterns as: geometric figures, machine-printed alphanumeric characters, hand-written letters, speech, and enemy fire-power in reconnaisance photos. Seldom has a study of one of these sets of items provided results which were applicable to any of the others, and progress in each case has depended largely on the researcher's familiarity with the particular set chosen. Yet clearly these problems are worthy of study and clearly there are many similarities among them. In this work we shall identify and concentrate upon the elements of similarity, attempting to construct a formal (almost axiomatic) system which applies to all the above patterns. This process of abstraction, though common in mathematics, has never, as far as we know, been applied to pattern recognition. In applying it, we gain simplicity, generality, and exactness. We lose, initially, the insight that familiarity with one particular set of patterns may bring; but when our model is completed, it should apply directly to each such set and should provide a precise framework with which such insights can be fruitfully combined. What elements, then, are common to the various pattern recognition situations? We shall identify them as follows: there is a set of items which the recognizer observes, one at a time; there is a set of possible responses the recognizer can make; and there is later information as to which response was correct in connection with each item observed. Usually the recognizer's goal is to identify correctly as many items as possible, but generality is increased if the more complicated goals of decision theory (most of which depend on correct identification as a preliminary stage) are also allowed. In any case, when the goal has been defined, the recognizer's performance can (and must) be evaluated in terms of that goal. Note that, in order to perform in a better-than-random way, a recognition system must intially be provided with some overall information about the universe of patterns. True, a human being can correctly identify an item which he has never seen before, but this abstraction process is possible only because of long experience with the "real" world and because of evolutionary adaptive mechanisms. To 1

duplicate this ability in our model we must provide conceptual analogues both of: 1) information about (or "experience with") individual items, and 2) initial information which is relevant to the environment of patterns. The information in (1) will be gained in the course of a recognition experiment, through the presentation of items and of the correct response associated with these items. Without information of this type a system possesses no flexibility or ability to learn. The information in (2) will be made explicit at the beginning of each experiment. Without information of this type a system possesses no ability to abstract or to respond to new situations. Together, these two types of information largely determine the performance of any system and the nature of any abstract model of recognition systems. With this in mind, we now present our model formally. 2. THE FORMAL MODEL 2.1 Notation and Organization Let there be a finite set P of elements called patterns and denoted by X1, X2, X3,..., Xm or by Y1, Y2, Y3,..., Ym In addition let there be a (mrssibly infinite) set I of inputs, denoted {Ih. Let time be indexed by the variable t, so that t = 1, 2,,.... An experiment will consist of a sequence of inputs I,, I,..., the superscript denoting the time of presentation. At each time t = 1, 2, 3,..., a pattern X from P may or may not also be presented. Let there be a system S which for each time, t (= 1, 2, 3,...), takes It as input, takes X as input when it is available, and produces as yt yot 1 2 t output a pattern Y from P. Y may be a function of I, I,.., I 1 2 3.tand of X, X, X,..., Xt (where present), as well as of any prior knowledge available to the system. It may not be a function of X or of Iu orXU where u > t. 1 2 (Intuitively, we identify the sequence, I, I,..., with items seen and each X with the "correct" pattern which corresponds to It. Similarly Y represents the system's guess upon seeing It. We may r egardt t may regard X as being presented "after" Y is produced, but within the same time-step. If Y were permitted to be a function of X, recognition would be trivial.) 2

Let the prior knowledge available to S be precisely specified as follows: 1) S has available a decision function, D, which specifies S's goal and can be used to measure its performance (e.g., the decision function may require that Yt correspond to Xt as often as possible). 2) S is supplied with a finite list of characteristics, C1, C2,..., Cn. A characteristic is formally defined as an effectively n 1 2 calculable function from the set of all double sequences (I, I, t 1 2 t-l - I; X, X,..., X ) into a finite subset of the integers. Less formally, we ask that after any input It is presented it is possible to determine from observed I's and X's what the value t th C. of the characteristic Cj is. We denote the k value of C by c jk (In the "pure pattern recognition" case, the characteristics will apply to I only, and will be "characteristics of I " in the ordinary sense. For another more complex example, consider the characteristic which has the value i = 1, 2,..., m if and only if X was Xi. A machine using this characteristic alone would i ignore the I's and would be a type of sequence predicter.) 3) For any or all values of any or all of the above characteristics, S may have available a list of probabilities. Suppose that a characteristic Cj has Cjk as one of its possible values. Then S will have available either no probabilities associated with C. = Cjk' or it will have available a complete list of the probabilities, PCt (xt=X), Pt (Xt=X2),..., t (xt=x )' jk jC=k j jk These may be written as "P (X1),'''' Pc (Xn)"; the whole jk jk list (or, alternatively, the ith item) being denoted "Pc (Xi)." Cjk 1 Note that they are conditional probabilities. k (The characteristics are simply the factors which are relevant in deciding which patterns the various inputs represent. There are no theoretical limits on their number or complexity; however, the system will ignore any connections between inputs and patterns which are not available in the form of characteristics, 3

no matter how obvious they may be. When the list of probabilities, P (Xi), is supplied, S will use these exact values in making its jk t decision and producing Y; otherwise, it will estimate these same probabilities from the observed sequences.) 4) S also has available a set of functions, fi, which express the legitimate a posteriori probability of each Xi in terms of probabilities like those above. Frequently the fi will be trivial and obvious functions. An explanation of their importance in other cases is best postponed. A more detailed discussion of the model and of the operation of S follows. 2.2 Patterns and Inputs The word "pattern" has deliberately been left undefined, a "pattern" being merely an element of the set P. X and Y can be considered as names: X as the name associated with It; yt as the name chosen at time t to optimize the decision function. (Alternatively, each Xi might be regarded as a set of inputs, so that each input Ih belongs to at least one pattern. However, this alternative concept is inelegant with respect to the X's and inapplicable to the Y's. The current formulation, which does not require any specific connection between It and Xt, is simpler and at the same time more general.) The inputs, It, are usually the objects under study: in speech recognition, the speech sounds; in character recognition, the letters, etc. The following formal definition seems to apply to all such situations and will be adopted as a means of clarifying and limiting the model's scope: DEFINITION: An n-dimensional input is a mapping from the rI Cartesian product, iEl Di. of n ordered sets to an ordered set, R. Thus a O-dimensional input is a mapping with no domain, i.e., is nonexistent. A one-dimensional input is a mapping from D1 to R and corresponds to a voltage waveform or acoustic signal (or speech sound) if we let Di be the real line representing time and R be the set of values representing voltage. A two-dimensional input is a 4

mapping from D1 X D2 to R and corresponds to a printed letter or handwritten character (or photographic image) if we let D1 and D2 represent the horizontal and vertical directions, respectively, and we let R represent a set of gradations from white to black. A position in a tic-tac-toe game is a two-dimensional input where Dl=D2={1,2,3} and R = { X, blank, 0 These examples should indicate how the definition is to be interpreted; examples of higher dimensions can, of course, be constructed. For any one experiment a single dimensionality and a single domain HlDi will presumably be selected and a subset of the possible inputs with this domain will be used as {}. The set I of inputs may be infinite; but at any one time, t, 1 t S will have observed only a finite number of them, namely, I1, it Iikewise any single input may have an infinite domain (such as an interval or the real line), but S need not observe or store this infinite item directly. The only information S must handle is that which is used in computing the values of the characteristics. We demand that the characteristics be effectively calculable functions, i.e., that for any double sequence, I,..., It; X,..., Xtl, S can compute the value of any characteristic in a finite number of steps. For each input, It, S then handles whatever information about It is needed to compute characteristic values at time t or at any later time. The amount of this information will be finite. 2.5 Initial Knowledge We have specified that initially S will have available a decision function, a list of characteristics, and, perhaps, lists, Pc (Xi), for some characteristic values, Cjk. Discussion of decision jk functions and of the process of decision can be postponed, but lists of characteristics and probabilities merit discussion here. We shall call the probabilities, P (Xi), "post-probabilities." Their use c as a decision statistic is Justified in decision theory, where they are usually called a posteriori probabilities. In order to produce an output Y which optimizes its decision function, S must have available (except in trivial cases) estimates 3f the likelihood that Xt is X1 that X is X2, etc. The more relevant information that is used to 5s

construct these estimates, the more accurate they will be, and the better the decision will be (on the average). The ideal likelihood estimate for Xi is thus the post-probability, PRt(Xi), where Rt denotes "tall relevant information which is avilable for decision at time t;" but, as we have seen, all relevant available information is to be represented by the characteristics. Where we denote the value of C at time t by cjt. we then have PRt(Xi) = PClt c.d C(X i lt 2t mt and we call the expression on the right, "the post-probability of X." Suppose first that the post-Probability for each X. is always available at any time t. Then it can be used directly as a likelihood estimate for Xi. However, we have specified that S has available lists of the form P (Xi) rather than of the form jk Pc, *..c~ c (Xi). To fit this special case into our general framelk.nk work, it is only necessary to treat the combined characteristics C1,... C as a single one, Cw. Thus for each distinct combination of n w values which C1, C2, etc., can have, we let C have a single distinct value. The likelihood estimate for Xi at time t is then Pc (Xi), the value cwt being determined collectively by Clt, c2t,..., Cnt. Accordingly,the list of characteristics available to S reduces to one element, Cw, and the lists of post-probabilities will be of the form Pc (Xi) for all i and t. wt Suppose next that the post-probabilities above are not directly available for all Xi and all possible values of C1, C2, etc. They can then be estimated from the sequence of I's and X's presented to the system by a process which will be described in the next section. However, the estimation process may well require a very large number of time intervals to produce good estimates. Thus, whenever possible, it is desirable to compute the post-probability as a function of several, more readily estimatable, parts. A natural way of doing this is to separate the characteristics which are involved into groups and to express the post-probability as a (calculable) function of the postprobabilities of the groups. Thus, where the characteristics, C1,..., Cnare grouped to produce D1,..., Dp, we will have: P)'' (Xi) fi [d (Xi), (Xi) (Xi)] Clt' 2t nt 2lt pt 6

(for O<p < n, for all Xi.) In the special case where p = n and D Cj, the probabilities Pc (Xi) are clearly sufficient to compute the overall post-probability fo"Xi. In the other extreme case where p 1=, the post-probability is irreducible, f. is the identity function, and there is effectively only one characteristic, as we have already seen. To make the intermediate cases fit into the framework also, we simply let D1, D2,..., D be the characteristics. There are 2 of these, each representing a combination of one or more C's in exactly the same way that C represented n C's in the irreducible case. More needs to be said about the purpose and form of the functions f.. If the post-probability of Xi were:l) mathematically equivalent to some function of the separate Pc (Xi), or 2) readily available, there would be no need for the fi. jkunfortunately things do not always work out this way, and we must have a way of computing postprobabilities from other statistics which are readily available. A conceptually simple and convenient set for these purposes is the set Pc (Xi) for each Cj. In some practical cases, the functions, fi, can jk be chosen according to the experimenter's estimate of the dependence of the post-probability on the various P (Xi). In others no parti-i jk cular basis for the choice of the fi will exist. Then certain assumptions of homogeneity can be made and-the functions which result from these assumptions can be chosen. Let us illustrate the choice of the f. by a detailed-example. Suppose there are two patterns, P1 and P2(X1 and X2); three characteristics, C1, C2, and C3; and that C1 and C2 have the possible values O and 1 while C3 has the possible values 0, 1, and 2. Suppose first that no estimates of fl and f2 are known, so that assumptions of homogeneity must be made. Let us consider as a specific case the estimation of fl, and, in particular, the estimation of P c (X1) from P o(X1) Pc (X) and P =o(X1) We note that when C =O, six separate situations can occur, i.e. C may have either of two values separate situations can occur, i.e., C2 may have either of two values and C3 may have any of three. Lacking other information about fl, we then make the crucial assumption of homogeneity; we assume that each 7

of these six situations is equally possible. It follows that 1/6 Pc =(X1) is an estimate of Pc(X Similarly, when c=0, six separate situations can occur, and, when c3=0, four separate situations can occur. Making a similar assumption of homogeneity in each case, we find two more estimates of Pc2=c=O(X1), respectively: 1/6 Pc 0o(X1) and 1/4 Pc O0(X1). As our final estimate, we then take the average of these three: 1/3 [1/6 P (X1) + 1/6 P c() + 1/4 P (Xl)] or 1/36 L2P l+o(xl) + 2PC =(X1) + 5Pc =0(X )]. + 2 3 Every step used in deriving this estimate can be carried through in the same way for each of the eight combinations of values of the characteristics and for X2 as well as X1. Thus, we estimate Pck.. c2 =O =2. (cxi) by 1/36[2PC=kl(i) + 2P =k(Xi) + 3Pc 3=k(Xi)] in this example, and we have fl(a,b,c) = f2(a,b,c) = 1/56(2a + 2b + 3c). This example may help to explain the homogeneity assumptions and the estimates which result. A similar explanation could be given for the general case, but to avoid excessive notation, we shall merely state the result. Where there are characteristics, C1, C2..., Cn; where the number of values taken by Cj is Vj; and where the homogeneity assumptions for Xi are made, P *c *c (Xi) is estimated by: Clt' 2t... Cnt n M Vj vjP (xi) it n and fi(ala2'..., an) = MjVjaj, where M is the constant n-v-' Note that this estimation is performed by the experimenter, and is not part of the system's operation; S begins each experiment with f1, f2,...' fm directly available to it, whether computed in this way or not.

2.4 Learning In the previous section, we have shown how the post-probability for Xi can be obtained from the list, Pc (Xi). We have specified that jk the list, Pc (Xi) my be available to S at the beginning of its operation. Howevar, if some of these lists are not available, they too must be estimated. The estimation of P (Xi) for a particular characteristic c jk value, C jk, and pattern Xi, is a simple one. We merely note that P(c P p (Xi) (cJ cJk jk). by definition of conditional probability. Then, interpreting these probabilities as relative frequencies, we consider all previously identified inputs in the experiment for which Cj had the value Cjk. The number of these which were identified as pattern Pi, when divided by the total number, serves as an estimate for Pc (Xi). Naturally cjk this proportion may change in the course of an experiment. But, by the law of large numbers, it will tend to become a more and more exact estimate of the correct probability. Thus, in a rudimentary sense, S will gradually "learn" to perform correct identifications as time goes on. To compute these estimates, S must have in storage some numbers which indicate how often each characteristic value, Cjk, has occurred in conjunction with each pattern, X.. At time t in an experi1 ment, consider the number of previous time-steps at which characteristic Cj had value cjk and Xi was presented as the correct pattern. We denote this number by A(cjk' Xi ). Let S store the list #(cjk. X1), #(Cjk. X2),..., #(c' X ) for each c. Then at time t, P (Xi) can be estimated jk m jk' cjk i by: r~#(Cjk Xr) If no Xt is presented, the same number will remain in storage for computations at time t + 1. However, if X is presented, the stored values must be changed to keep the estimates correct for time t + 1. 9

Suppose that X = X1 and the characteristics at time t have the respective values, Clt, c2t... Cnt. Then S merely adds one to each of the stored values #(Clt X1), #(c2t' X1),..., (Cnt X1), to keep the estimation correct. The same is true, of course, for X2, X3, etc. 2.5 Decision The decision processes involve, and need to involve, no more than the decision function, D, and the post-probabilities for each Xi. The former is available to S directly; the latter can be estimated for each time t, as we have seen. The most common decision function (and the one used almost exclusively in earlier pattern recognition experiments) is that of "percent correct." To operate under this decision function, S need only choose the Xi which is most likely to be correct at each stage, i.e., that which has the highest post-probability. A more general case is that in which the goal is to maximize expected value with respect to a table of values and costs. Formally such a table is a mapping from {Xi} X {Yi) = PXP into the integers (or real numbers). Informally, it specifies how much value (positive or negative) S receives at time t when X = Xi and Y = Y, for every possible i and s. If the value, D(Xi,Ys), for Xi and Ys is 1, where i=s; and -1 otherwise, this reduces to the "percent correct" case. The expected value of choosing Y is defined as m E(Ys) = il PRt(Xi) D(XiYs) where PRt(Xi) is the post-probability of Xi as before. To maximize expected value at time t, S merely chooses the Ys for which E(Ys) is highest. Still other decision functions have been treated in the literature of decision theory and need not be discussed here. 3. DIAGRAM OF THE SYSTEM The following is a list of the notation used in the diagram of the system in Fig. 1: 10

A. Prior Knowledge: 1. D: decision function 2. Cj: list of characteristics and their possible values. 3. Known Pc jk(Xi): for some characteristic values Cjk, the list jk (,.P 1 (Xm). (May be omitted) cjk Cjk m 4. fi: for each Xi, a function which estimates P... (Xi) from Pc k (X),...' P (Xi) B. Knowledge Presented at Time t: t 1. I: input at time t. 2. Xt 2. X: t "correct" pattern for time t. (May be omitted) C. Output at Time t: 1. Y: system's "guess" at time t. C xt D Kr If X is present, add Estimate Pc (Xi) for all one to all }(cjt Xt) _ Jt in storage. c jt and all Xi as #(Co X) cijt' Xr) Compute PRt(Xi) = Select Yt to maximize D t'fi''lpc (x)' (x"PC. (X"ai)3 according to it. nt P~ (X Jnt for all Xi P PRt(x)' Compute current value c For each Cjt select the t it It-for each characteristic C c X c li't.. Store information about It for later time periods where necessary. Known Pc k(Xi)[optional] *The P (Xi) in the box on the right are estimated before these values are transmitted. fit Fig. 1. Diagram of System 11

4. CONCLUDING DISCUSSION We have attempted to present a formalism of some of the elements and processes inherent in any pattern recognition situation. To do this, it has been necessary to abstract, and to omit discussion of the details of any specific application. At the same time, certain somewhat arbitrary choices have had to be made; for example, the "characteristics" discussed here provide only one of many ways of formalizing information and of interrelating initial knowledge to experience with individual items. We have made these choices in an attempt to: 1) limit generality as little as possible, 2) achieve simplicity of structure, and 3) permit ready application to particular situations. These three goals run through the work. Simplicity has obvious advantages and generality is sought, not for itself, but as a means to eventual widespread application. The indispensable first step has been taken: a basic model has been made precise. At this point it is possible to ask many further questions in a meaningful way, and use the formal model in the search for answers. In conclusion, let us summarize some of the areas which may be investigated in the future. 1. Adaptation: Suppose the system S is considered as a part of a larger system, whose goal is to improve the performance of S over a series of experiments. How can this be done most effectively? Can successful adaptation be achieved by modifying the characteristics or by generating new ones? By systematic modification of the fi functions? What further modes of learning can increase the power of S? 2. Memory: A study of the memory capacity required by S for its various operations could be made. In particular, how much (on the average) will memory limitations impair successful recognition? 3. Computer Simulation: In what ways can the model be streamlined so as to permit more efficient computer simulation? Simulation is now possible, of course, but might well be postponed until it is of greater theoretical or practical value. 4) Theoretical Additions: A method of weighting the importance and the degree of independence of the characteristics might be 12

added. Where the p (Xi) are known as estimates (rather than being completely known or unknown), one might allow the "learning" mechanism to accept these estimates initially and to treat them as though they were based on an earlier sequence of inputs. Estimates might alternatively be supplied as a range of possible values, rather than as a single exact value.

UNIVERSITY OF MICHIGAN 111 111111111 1 1111111111111111111111111 3 9015 03695 5675