THE UN I V E R S ITY OF MICHIGAN COLLEGE OF LITERATURE, SCIENCE, AND THE ARTS Computer and Communication Sciences Department ON INDUCING A NON-TRIVIAL, PARSIMONIOUS, HIERARCHICAL GRAMMAR FOR A GIVEN SAMPLE OF SENTENCES John Reed Koza January 1973 Technical Report No. 142 with assistance from: National Science Foundation Grant No. GJ-29989X Washington, D.C.

ABSTRACT ON INDUCING A NON-TRIVIAL, PARSIMONIOUS, HIERARCHICAL GRAMMAR FOR A GIVEN SAMPLE OF SENTENCES by John Reed Koza Co-Chairpersons: Bruce Arden John Holland The thesis presents an algorithm which, for a given sample of sentences, induces a grammar that (1) can generate the given sample of sentences, (2) is non-trivial in the sense that it does not merely enumerate the sentences of the sample, (3) is hierarchical in the sense that the sentences of the sample are derived through relatively long sequences of sentential forms, (4) is, at the same time, parsimonious in the sense that the grammar contains a relatively small number of rules of production, (5) contains recursive rules of production under appropriate specified conditions, and (6) contains disjunctive rules of production (and generalizations) under appropriate specified conditions. The thesis formally defines the notion of a regularity in a sample of sentences, presents a combinatorial algorithm for searching for regularities, and argues that searching for local regularities is an appropriate basis for a grammar discovery algorithm. The Grammar Discovery Algorithm presented involves (1) a search of the given sample of sentences for local regularities-this search being an exhaustive process within a limited range, (2) the consideration non-exhaustively of various possible transformations of the given sample via grammatical rules of production that are developed from the local regularities, (3) a selection process among such transformations, and (4) the reapplication of these steps until the sample is fully

resolved and a grammar induced. The thesis contains and describes a computer program implementing the Grammar Discovery Algorithm. While the emphasis is on grammar discovery, there is also discussion on the equivalent problem of inducing axioms and rules of inference in a formal system. The Grammar Discovery Algorithm is justified in terms of its internal consistency, its satisfying of various heuristic criteria and by examples.

ACKNOWLEDGMENT I thank Professor Bruce Arden who, as co-chairperson of this thesis committee, gave of his time, ideas, and information in writing this thesis; Professor John Holland who, as co-chairperson, originally suggested this topic in his seminar in adaptive systems theory and gave aid and comfort to this thesis throughout; Professor Eugene Lawler (now removed to the University of California, Berekely, and hence not now a member of the committee) who suggested several of the optimizing features of the Grammar Discovery Algorithm; Associate Professor Bernard Zeigler who was helpful at various stages; Professor Walter Reitman who was helpful in the writing of this thesis and particularly the sections justifying the Algorithm; and to Professor Larry Flanigan who, on very little notice, served as a proxy and replacement for Professor Zeigler on the committee, while the latter was on sabbatical in Israel. I also thank Monna Whipp of the Logic of Computers Group for typing and preparing this thesis for publication as a report of the Logic of Computers Group. And, in general, I thank the Department and Faculty for tolerating the glacial advance of this research project. John Koza Ann Arbor, Michigan December, 1972 it

TABLE OF CONTENTS Acknowledgment ii Table of Notation vi I. INTRODUCTION A. STATEMENT OF THE PROBLEM 1 B. REVIEW OF PREVIOUS WORK IN THE FIELD 7 C. PRINCIPAL CLAIMS OF THIS THESIS 9 D. OVERVIEW OF THIS THESIS 10 II. DESCRIPTION OF THE GRAMMAR DISCOVERY ALGORITHM A. PRESENTATION OF THE SAMPLE 17 B. THE SEARCH PHASE -- FINDING REGULARITIES IN A SAMPLE OF SENTENCES 1. INTRODUCTION TO THE SEARCH PHASE 19 2. CYLINDER SETS IN A SAMPLE 20 3. MASKS 23 4. TYPES OF MASKS 25 5. CONTEXT-SUB-SEQUENCES, PREDICTED-SUBSEQUENCES, AND DOMAIN SEQUENCES 27 6. DEFINITION OF A REGULARITY 28 7. TYPES OF REGULARITIES 30 8. RATIONALE FOR SEARCHING FOR LOCAL REGULARITIES 31 9. ALGORITHM OF THE SEARCH PHASE 35 10. EXAMPLES 39 C. THE RECODING PHASE — DEVELOPING RULES OF PRODUCTION FROM REGULARITIES 1. INTRODUCTION TO THE RECODING PHASE 45 2. TRANSFORMATIONS (RECODINGS) 46 3. DEVELOPMENT OF RULES OF PRODUCTION FROM LOCAL REGULARITIES 48 4. THE RECODING PROCEDURE 52 5. EXAMPLE 56 D. THE SELECTION PHASE -- SELECTING RECODINGS 1. INTRODUCTION TO THE SELECTION PHASE 60 2. ENTROPY, PARSIMONY, RECURSIVE PARSIMONY OF A TRANSFORMATION 61 3. THE p..-GRAPH OF A TRANSFORMATION 58 4. RESOLViNG TRANSFORMATIONS 73 5. TRIAL RECODINGS AND THE SELECTION OF THE ACTUAL RECODING 75 iii

E. INDUCING RECURSIONS FROM A FINITE SAMPLE OF SENTENCES 1. INTRODUCTION AND MOTIVATION 84 2. DEFINITION OF A RECURSION 86 3. APPROACHES TO INDUCING RECURSIONS 87 4. SENTENCE-ORIENTED METHOD OF INDUCING RECURSIONS 90 5. RULE-ORIENTED METHOD OF INDUCING RECURSIONS 94 F. INDUCING DISJUNCTIONS AND GENERALIZATIONS 1. INTRODUCTION AND MOTIVATION FOR INDUCING GENERALIZATIONS 100 2, INDUCING GENERALIZATIONS COMBINATORIALLY (RULE-ORIENTED METHOD) 101 3. INTRODUCTION AND MOTIVATION FOR INDUCING DISJUNCTIONS 104 4. INDUCING DISJUNCTIONS USING ENTROPY CSENTENCE-ORIENTED METHOD) 106 G. THE TERMINATION PROCEDURE 111 H. CHOICE OF LIMITS ON LENGTH OF REGULARITIES IN THE SEARCH PHASE 113 I. CHOICE OF PARAMETERS IN THE RECODING PROCEDURE 116 J. EXTENSION OF THE GRAMMAR DISCOVERY ALGORITHM TO SAMPLES OTHER THAN LINEAR SEQUENCES OF SENTENCES 119 III. THE SELF-PUNCTUATING NATURE OF THE GRAMMAR DISCOVERY ALGORITHM A. INTRODUCTION 123 B. EFFECT OF CHOICE OF LIMITS ON LENGTH OF REGULARITIES IN THE SEARCH PHASE 1. INITIAL DISCUSSION 124 2. EXTENSIONS OF A REGULARITY 125 3. DEFINITION OF A MAXIMAL REGULARITY 127 4. PERSISTENCE OF MAXIMAL REGULARITIES AFTER RECODINGS 128 C. EFFECT OF CHOICE OF PARAMETERS IN THE RECODING PROCEDURE 134 D. EFFECT OF PRESENCE OR ABSENCE OF INITIAL PUNCTUATION IN THE SAMPLE 136 IV. ADDITIONAL FEATURES OF THE GRAMMAR DISCOVERY ALGORITHM A. TERNARY MASKS AND DON'T CARE CONDITIONS 139 B. ALPHABET-ENLARGING VERSUS HUFFMAN-TYPE RECODING 142 C. RECODINGS WITH NOISE IN THE SAMPLE 144 D. CONTEXT-FREE CASE 145 V. EXAMPLES 148 VI. FUTURE ITRECTIONS 180 iv

APPENDIX A: DESCRIPTION OF INPUT TO THE COMPUTER PROGRAM IMPLEMENTING THE GRAMMAR DISCOVERY ALGORITHM 182 APPENDIX B: LISTING OF THE COMPUTER PROGRAM IMPLEMENTING THE GRAMMAR DISCOVERY ALGORITHM 193 BIBLIOGRAPHY 245 y

TABLE OF NOTATION A Starting set (axioms) of general derivation system. VT Terminal Alphabet of grammar or general derivation system. VN Non-Terminal Alphabet of grammar or general derivation system. R Rules of Production of grammar or general derivation system. Hi= <VT VN R, A> General Derivation System. S Starting Symbol of grammar. = <VT, VN, R, S> A Grammar. C1 Number of symbols in alphabet of the Sample (VT). Y The sample of sentences. A sequence over VT or the current alphabet. T Indexing parameter for sample. N Length of Sample Y. N Length of longest sentence in Y. (If there is max no initial punctuation in the sample, Nmax=N). V The current Alphabet. VT is the current alphabet for level one. C Number of symbols in current alphabet (that is, C1 plus non-terminal symbols added). C=C1 for level one. M Length of Sequence of symbols. g,h Indices typically used to index positions in a regularity, the context part or predicted part of a regularity. ~cE~ ~ The value l/Cg. PPij Conditional Probabilities. % Symbol meaning "predicted". Symbol meaning "context". # Symbol meaning "Don't Care". vi

X?(I) Mask-Sequence of Length M over symbols "%", '", and possibly also "#" (if mask is ternary). y(I) Context-Sub-Sequence-Sequence of Length M over current alphabet augmented by "%" and possibly also "#". ~(I) Predicted-Sub-Sequence-Sequence of Length M over current alphabet, augmented by " " and possibly also "#". = <y,4, P, Y> A regularity in sample Y whose context-sub-sequence is y, whose predicted-sub-sequence is q, and whose conditional probability is p. A The domain sequence of a regularity. ff9 A transformation. \O~~X ~ Coefficient of parsimony..p Coefficient of Recursive Parsimony. n Number of Rules of Production. n Number of Recursive Rules of Production. The period (Initial punctuation mark). vii

I. INTRODUCTION A. STATEMENT OF THE PROBLEM The processes of deduction and induction are fundamental in mathematics and the axiomatized sciences. These same two processes also appear in the study of generative grammars. We define a general derivation system S = <VT, VN, R, A> to be a system in which VT is the set of terminal symbols, where VN is the set of non-terminal symbols, where R is a set of rules, and where A is the starting set. The starting set contains strings of terminal and non-terminal symbols. The rules are pairs of symbol strings. The first string, called the antecedent (or simply the left side), is a non-empty string of terminal symbols and/or non-terminal symbols. The second string of the pair, called the consequent (or simply the right side), is a string of terminal symbols and/or non-terminal symbols, or is empty. A derivation system generates sentences by starting with any one of the strings in the starting set, and by applying the rules, one after the other. Each application of a rule consists of substituting the consequent sub-string for the antecedent sub-string. Each symbol string so produced is, in general, a string of terminal symbols and non-terminal symbols, and is called a sentential form. A sentential form consisting only of terminal symbols is called a sentence. A generative grammar is a derivation system in which the starting set A consists only of a single distinguished non-terminal of length one. We customarily call this symbol "S", the starting symbol, and we 1

2 call the rules of a grammar rules of production. A formal system consists of two derivation systems. First, there is a set VT of symbols of discourse about the subject at hand, and a set of formation rules RF for the formation of well-formed sentences over VT. This derivation system is a grammar and has a starting symbol S and whatever non-terminal symbols are needed to express the rules RF. The second derivation system has a subset A of the well-formed sentences of the first derivation system as its starting set. This subset consists of the accepted truths of the system, called axioms. There is also a set of truth-preserving rules RI, called rules of inference, for transforming one true, well-formed sentence into another. By recursively applying the axioms and the rules of inference, one derives the true sentences YT of the formal system, called theorems. The non-terminal symbol set of this second derivation system is typically the same as its terminal symbol set. Thus, any sentential form of this second derivation system is a sentence and also a theorem. The process of generating sentences in a derivation system is called derivation. The process of generating theorems in a formal system is called deduction. In both cases, VT, VN, R, and A are given. Typically, these two processes present themselves in the setting of finding the derivation, if any, that can generate a particular given sentence. For grammars, this problem is the parsing problem. For formal systems, it is well to remember that all theorem proving is merely a parsing problem. Parsing and theorem-proving algorithms are either bottom-up in the sense that they start with the given sentences and consider possible sentential forms from which these sentences can be immediately derived, or top-down in the sense that the derivation starts with the starting set A.

3 Induction is the inverse process for deduction. In induction, one begins with an observed, limited set Y of the true sentences YT. The Copernican doing induction tries to discover a parsimonious set of axioms A and a set of rules of inference RI which together could have generated the given set Y. Grammar discovery is the inverse process for derivation in a grammar. In the grammar discovery problem, one starts with a set VT of C1 terminal symbols and a set Y of sentences over this VT. Y may be presented, for example, as a concatenation of the individual sentences, with or without initial punctuation between the sentences to identify the sentences. The problem we discuss herein is the problem of finding a non-trivial, parsimonious, hierarchical grammar = <VT V, R, "S"> which could have generated the given sentences Y. This problem requires finding a set of rules R of production and a suitable set of non-terminal symbols VN. Grammar discovery and induction are necessarily bottom-up processes. In what follows, the main emphasis will be on the grammar discovery problem, since this is the underlying problem common with the induction problem for formal systems. When we say that we are seeking a non-trivial grammar, we are referring to any grammar except a grammar which merely catalogs the sentences of the sample Y-an example of which is the grammar in which the starting symbol "S" may be rewritten as the disjunction of the observed sentences of Y. Note that the adjective "non-trivial" applies to this well-defined, definite enumerative situation. When we say that we are seeking a hierarchicaZ grammar, we are referring to a

4 grammar which, as a minimum, is non-trivial, but which has the additional quality of generating its sentences through relatively long sequences of sentential forms. Note that the adjectives "hierarchical" is less precise and more connotative than the adjective "non-trivial". The connotation is that each small phrase and substructure of a sentence is generated by a separate application of a rule of production —in contrast to a situation in which large phrases are generated by one application of one rule. When we say that we are seeking a parsimonious grammar, we are referring to a grammar with a relatively small number of rules of production. Note that the adjective "parsimonious" is also a connotative word. The connotation is that a few rules of production are each applied many times in the generation of sentences-in contrast to a situation in which many rules are each used only in a few specific cases. A hierarchical grammar without recursions tends to be unparsimonious. A grammar may be parsimonious without being hierarchical-for example, if it is trivial. A grammar may be hierarchical without being parsimonious, as, for example, a grammar which fails to employ recursions, or employs superfluous non-terminal symbols or superfluous rules of production. Note that we state the grammar discovery problem as a problem of finding a parsimonious and a hierarchical grammar, but not necessarily the most parsimonious and most hierarchical grammar-which are qualities that are not defined here. Similarly, for lack of definition, we do not require the finding of a necessarily interesting grammar, nor a significant grammar, nor a linguist's grammar, nor a grammar conforming to some known or supposed physiological or psychological model, nor the "best" grammar. The grammar discovery problem, of course, admits of several

variations, and some of these variations have appeared within the very limited literature on grammar discovery and induction. For example, the machine alleging to solve the grammar discovery problem may receive more than a fixed sample of sentences. The machine may, for example, receive, or ask to receive, answers to certain questions in order to assist it in solving the problem. These questions may take the form of asking whether certain new sentences which it constructs are in the language produced by the grammar which is the desired solution to the problem. Or, non-sentences identified as such may be introduced with the sample sentences. Or, the machine may ask for a larger sample of sentences or identified non-sentences if it needs them (Gold, 1967). Then again, the machine may be assumed to be receiving a continuing, infinite input of sample sentences (Goodall, 1962). Another possibility is that the machine can receive, or ask to receive, certain cues, such as the number (or an upper bound on the number) or internal states which an automaton of a certain type would have to have in order to accept the language generated by the grammar which is the solution to the problem (Pao, 1969). Or, the machine may be told that the given sample contains all the sentences of the language (as in Gold's "text"method of presentation) or that the sample contains sentences whose production requires the use of all the transitions in a particular automaton (Pao, 1969). Or, the sample may be assumed to be structured in various ways —for example, the sample may be assumed to contain (or tend to contain) all of the sentences of the language whose length is not longer than the longest sentence of the sample (Feldman, 1969). We deal here, however, with what seems to be the most basic and general problem —the kind of grammar discovery problem one would

6 encounter in deciphering an unknown message from outer space, or in decoding an ancient language from a given scroll, or in finding the substructures in a coded message —that is: given a fixed, finite sample of sentences and no other information of any kind (that is, no other information about the source of the same; no information about the type of grammar producing the sentences; no information about the statistical properties or completeness of the sample; no non-sentences; no cues; and no information in response to queries or requests) find a non-trivial, hierarchical, parsimonious grammar that could have generated the sentences of the sample.

7 B. REVIEW OF PREVIOUS WORK IN THE FIELD The literature in the field of induction is very small. Solomonoff (1964) treats the problem as a special case of sequence extrapolation using enumerative methods which are not readily extendable, and which are highly dependent on such artifacts as order of sentences within the sample. Gold (1967) proposed a theory of language learning in which the learner receives sequences of correct sentences of an unknown language and successively guesses grammars of specified types. If the guessing converges to one correct grammar, the language has been "identified in the limit." Gold proves several decidability results concerning language identifiability in the limit, and proposes several categorizations of language learning models, and methods of presenting information to such learning models. Feldman (1969) proposes various concepts of grammatical complexity and proposes approaches to inducing a least complex grammar for a given sample. Feldman (1969) includes a heuristic method of merging symbols of the alphabet to achieve recursive rules from a finite sample. Pao (1969) deals with the induction of finite state languages and "delimited" languages —a subset of context-free languages. Pao specifies conditions on the sample to guarantee a solution; and then, given certain cues, the algorithm constructs an automaton, evaluates it, and modifies it until a solution is attained. Caianello (1970) discusses the Procustes Algorithm for feature extraction and form analysis of language, and uses semigroup homomorphisms similar to what are called "transformations" or "recodings" herein.

8 In a highly suggestive, short paper, Goodall (1962) discusses the problem of induction from the point of view of logical types, and the Godel Incompleteness Theorem. Goodall develops a probabilistic automaton for a given ongoing infinite sequence of input symbols. In this paper, Goodall suggests, without specifying, the idea of recoding the given sample into a new "higher level" sample using "resolving transformations" that optimize the conflicting qualities of "entropy" (information) and "parsimony." This thesis started as an attempt to attribute some consistent and reasonable meaning to these suggestive terms and other metaphorical observations in the Goodall paper, but no claim is made that the definitions and constructions herein of these same terms in any way correspond;to that intended by Goodall. In addition to work in the immediate field of induction, there is, of course, a large literature in the fields of pattern recognition (e.g. Uhr); cryptography (e.g. Smith and Gaines); picture analysis; sequence forecasting and correlation, including continuous functions are numerical sequences.

9 C. PRINCIPAL CLAIMS OF THIS THESIS In this thesis, we formalize the notion of a regularity in a sample of sentences, present a combinatorial algorithm for searching for regularities, and argue that searching for local regularities is a suitable beginning for a grammar discovery algorithm — an approach not found in the literature. We then present the Grammar Discovery Algorithm which, for a given sample of sentences, induces a grammar that (1) can generate the given sample, (2) is non-trivial in the sense that it does not merely enumerate the sentences of the sample, (3) is hierarchical in the sense that the sentences of the sample are derived through relatively long sequences of sentential forms, (4) is, at the same time, parsimonious, in the sense that the grammar contains a relatively small number of rules of production, (5) contains recursive rules of production under appropriate specified conditions, and (6) contains disjunctive rules of production (and generalizations) under appropriate specified conditions. We do not claim that the demonstratably non-trivial, hierarchical, and parsimonious grammar that we seek and find here is an interesting grammar, is a significant grammar (in the sense, perhaps, that it conforms to some known or supposed physiological or psychological model),is a "lingiuist's" grammar, or is the "best" grammar. Rather, we only claim that (1) the grammars we seek and find here do possess the above three qualities and that these qualities are reasonable objectives for a grammar discovery algorithm; (2) thegrammar discovery algorithm herein satisfies certain heuristic criteria; and (3) the Algorithm handles a variety of interesting examples that indicate it is not inherently limited in its capacity.

10 D. OVERVIEW OF THIS THESIS Part II of this thesis contains a description of the Grammar Discovery Algorithm. In II. A., we discuss the method of presenting the sample of sentences of the Grammar Discovery Algorithm. The Grammar Discovery Algorithm begins with a sample of sentences Y produced by an unknown grammar =.<V T- VN' R, "S">. The sentences are presented as a concatenation of sentences, with or without initial punctuation (for example, periods) between the sentences to identify the sentences. The terminal alphabet VT is also given —or can be trivially inferred from the sample. The goal is to find a parsimonious, non-trivial, hierarchical grammar for the given sinple of sentences-that is, to find a set R of rules of production, and whatever non-terminal alphabet symbols VN are needed to write the rules R. One of the non-terminal symbols in VN will be designated as the starting symbol ("S") of the grammar. The Grammar Discovery Algorithm is divided into three parts: the Search Phase, the Recoding Phase, and the Selection Phase. In II. B., we describe the first main phase of the Grammar Discovery Algorithm —namely, the Search Phase. In II. B. 1 through II. B. 6, we formally develop and define the idea of a regularity in a sample. Informally, a "regularity" exists whenever the occurrence of certain symbols somewhere in a sentenc in the sample implies that certain other symbols occur elsewhere in the sentence with a conditional probability greater than mere chance. In the Search Phase, one is interested in finding reguZarities in the given sample of sentences. The search is for ZocaZ, not

11 global, regularities. It is not obvious that any grammar discovery algorithm should begin with a search for regularities, but an argument will be given to justify this search (II. B. 8). The regularities will also be categorized by type in a manner suggestive of the various Chomsky types of grammars (e.g. context-sensitive unrestricted rewrite, etc). (II. B. 7) The grammatical types of the rules production that ultimately will be induced will parallel the types assigned to the regularities. We present the algorithm of the Search Phase in II. B. 9, and give examples (II. B. 10). The Search Phase is an exhaustive, combinatorial process — within the limits imposed by the principle of searching only for local regularities. The Search Phase ends with the generation of various tables of statistics about the various regularities discovered. The second main phase of the Grammar Discovery Algorithm is the Recoding Phase. In the Recoding Phase, the local regularities found in the Search Phase are used to transform (recode) the given sample of sentences into a new set of sentences. This recoding is specified by (1) a set of grammatical rules of production, which specify which substitutions are to be made, and by (2) a Recoding Procedure, which specifies when and where these substitutions are to be made. The grammatical rules of production are developed from the local regularities discovered in the Search Phase. These rules ultimately become the rules of production of the induced grammar. The Recoding Procedure is essentially a punctuating process that partitions the given sample into appropriate phrases and substructures. The Recoding Procedure operates so as to minimize the introduction of rules of pro

12 duction into the induced grammar. For example, before a new rule of production is introduced into the induced grammar, the algorithm considers the possibility of instead introducing one recursive rule (to replace one existing rule and the proposed new rules)as well as the possibility of re-using the new rule alone or in conjunction with other existing rules. The Recoding Phase is a non-exhaustive process, in contrast to the Search Phase. In II. C., we describe this Recoding Phase. We define the notion of a transformation (recoding) in II. C. 2. We show how to develop grammatical rules of production from local regularities in II. C. 3, and show the connection between the grammatical type of the rule and the mask of the local regularity from which the rule was developed. In II. C. 4, we describe the Recoding Procedure — which, with the rules of production, defines the transformation (recoding). The new set of sentences resulting from a recoding contains most of the information and structure of the given sample of sentences. Because structural regularities found in the original sample are replaced with short markers, what were originally global regularities in the original sample tend to become near-by local regularities in the new set of sentences. These short markers ultimately become the non-terminal symbols of the induced grammar. The original given sample of sentences is regarded as the set of sentences of the first level in the Grammar Discovery Algorithm. These sentences are composed entirely of terminal symbols. The set of sentences resulting from the recoding then becomes the sentences of the next level of the process. These sentences contain non-terminal symbols as well. These sentences will now be searched for their local regularities, and another recoding performed. The alternate

13 application of first the Search Phase, and then the Recoding Phase is what produces the non-trivial, hierarchical quality of the grammar that is developed. The third main phase of the Grammar Discovery Algorithm is the Selection Phase. After the Search Phase is completed, several (but not all) possible recodings of the sentences are considered. The Selection Phase is a non-exhaustive process that oversees these several trial recodings and selects one recoding which is then actually applied to the sample to produce one new set of sentences. This selection is made on the basis of three criteria: the entropy, parsimony, and recursive parsimony of the transformation, the parsimonious quality of the induced grammar derives from this selection from among the various possible recodings. Upon completion of this actual recoding, the 3 main phases are repeated, using the new set of sentences, instead of the original sample. The processes continue until the conditions of the Termination Procedure are fulfilled, thus ending the grammar discovery algorithm. The result is a set of rules of production R, and a non-terminal alphabet VN with which to express them, and a starting symbol "S". Together with the given terminal alphabet VT, these elements will constitute a grammar =<VT, VN, "S", R> that could have produced the given sample of sentences. In II. D., we describe this Selection Phase. We define the entropy, parsimony, and recursive parsimony of a transformation in II. D. 2. We define the notion of resolving transformation in II. D. 4. We graph the conditional probabilities of the regularities of the transformation in II. D. 3. And, in II. D. 5, we describe the process of making several trial recodings (each parameterized by the

14 length of the longest regularity used), and the algorithm for selecting the actual recoding. In II.E., we describe the process of inducing recursions from a finite sample of sentences. In II.E.1. we first discuss the motivation for inducing recursions-namely, the ability of the induced grammar to generate an infinity of sentences. Without recursions, there can only be a finite number of sentences generated by a grammar. We define recursions in II.E.2, and discuss different possible approaches to inducing recursions in II.E.3. The different approaches can be divided according tD whether they operate on the sentences of the sample or on the rules of production of the induced grammar. In II.E.4, we present the sentence oriented method of inducing recursions. This method has limited application. Finally, in II.E.5, we discuss the rule-oriented method of inducing recursions. In II.F., we discuss the related problems of inducing generalizations and inducing disjunctions. The discussion parallels the discussion of recursions. In II.F.1, we present the motivation for inducing generalized rules when the sample is a formal system. In II.F.2, we present a combinatorial rule-oriented method for inducing generalizations. Then, in II.F.3, we turn to disjunctions. We discuss the motivation for inducing disjunctive (non-deterministic) rules of production in grammar induction-namely, the ability of the induced grammar to generate a variety of essentially different sentence types. Without disjunctions, there can only be essentially one sentence type generated by the grammar. In II.F.4, we present a sentence-oriented method for inducing disjunctions that is based on finding ensembles of substitution instances that have maximal or near-maximal entropy. We conclude this section with a discussion of the similarity of the problem of

15 inducing generalizations and inducing disjunctions. In II.G. we discuss the Termination Procedure and Convergence of the Algorithm. In II.H. we discuss the heuristic considerations for choosing the upper limits on lengths of regularities to be searched for in the Search Phase. In II.I, we discuss the various parameters of the Recoding Procedure. In II.J., we elaborate on the idea that the Grammar Discovery Algorithm presented here for linear sequences of symbols is applicable to a broad variety of induction and pattern recognition problems presented in different formats. This section also serves to extend the definitions and concepts of the Grammar Discovery Algorithm by providing illustrations from a simple two-dimensional pattern recognition problem. In II.K, we present a variety of examples of the Algorithm. Throughout part II, we point out how the basic features of the Grammar Discovery Algorithm —namely, its Search Phase involving local regularities, the Recoding Phase, and the Selection Phase-work to produce a non-trivial, hierarchical, and parsimonious induced grammar. In Part III, we discuss the self-punctuating nature of the Grammar Discovery Algorithm. In II.B.2, we define the notion of extending a regularity, of a regularity being preserved under extention, and of a maximal regularity (III.B.2). In III.B.4, we argue that a maximal regularity will usually not be missed or lost by an unfortuitous choice of limits on the length of regularity in the Search Phase. In III.C., we argue that the choice of parameters in the Recoding

16 Phase and the combinatorial incompleteness of the Recoding Phase does not usually cause the Algorithm to miss maximal regularities in the sample.-unless it finds equally desireable maximal regularities. In III.D., we make the observation (quite contrary to intuition) that the presence or absence of initial punctuation between the individual sentences of a sample does not qualitatively or even qu ntitatively (combinatorially) complicate grammar induction. In Part IV, we discuss several additional features of the Grammar Discovery Algorithm, including ternary masks and "don't care" conditions (IV.A); alphabet-enlarging versus the alternative (Huffman) type recodings (IV.V); recodings with noise in the sample and what noise represents (IV.C); and the generation of contest-free rules of production by an algorithm which appears to be inherently context-sensitive (IV.D). In Part V, we discuss the justification for the Algorithm. In the appendices, we describe the input to the Fortran IV computer program implementing the Grammar Discovery Algorithm (Appendix A); we exhibit the program in Appendix B; and we present additional examples in Appendic C.

II. DESCRIPTION OF THE GRAMMAR DISCOVERY ALGORITHM A. PRESENTATION OF THE SAMPLE Let VT be a set of C1 symbols called the terminal alphabet. The Grammar Discovery Algorithm is said to operate in one of two modes depending on whether initial punctuation is present in the sample. We say that the Grammar Discovery Algorithm is operating in the first mode if there is no initial punctuation present in the sample-that is, the sample Y consists of one long sequence of terminal symbols Y = Y(l), Y(2),..., Y(N)** in which the "sentences" of the sample are concatenated together. We say that the Grammar Discovery Algorithm is operating in the second mode**" if initial punctuation is present between the sentences of the sample. This initial punctuation, if present, is conventionally the period. In this mode, the sample appears as a string of symbols (a sentence) followed by a period, another string of symbols (another sentence), another period, etc. Thus, the sample is a sequence of symbols over the terminal alphabet, with or without initial punctuation between the individual *Mathematical variables which are capitalized herein refer to variables that appear in the computer program implementing the Grammar Discovery Algorithm. **Mathematical vectors which appear in the computer program implementing the Grammar Discovery Algorithm are written herein in the parenthesized style common in computer languages. ***The variable MODE in the computer program specifies whether the Grammar Discovery Algorithm is operating in the first mode or in the second mode. 17

18 sentences. The sample consists of only this symbol sequence. No additional information of any kind is given. No assumptions as to the statistical properties or completeness of any other aspect of this sample are made. Note that this method of information presentation is neither Gold's "text" method (in which the sample is presumed to contain every sentence of the language!) nor Gold's "informant" method (in which non-sentences appear) (Gold, 1967). The presentation of the sample Y as a sequence of symbols, rather than as a square, rectangle, tree, graph, or other structure or raster containing symbols conforms to the predisposition in the study of grammars of natural and artificial languages and in the study of formal systems to consider linear sequences of symbols. However, virtually nothing in what follows is actually dependent on this linear method of presentation. Thus, pattern recognition problems where the sample consists of other structures or rasters of symbols are equally amenable to the methods described herein.

19 B. THE SEARCH PHASE —FINDING REGULARITIES IN A SAMPLE OF SENTENCES 1. INTRODUCTION TO THE SEARCH PHASE The first main phase of the Grammar Discovery Algorithm is the Search Phase. In the Search Phase, we search for local regularities in the given sample of sentences. Informally, a "regularity" exists whenever the occurence of certain symbols somewhere in a sentence in the sample implies that certain other symbols occur elsewhere in the sentence with a conditional probability greater than mere chance. In this section, we proceed to define formally the notion of a cylinder set in a sample; to generalize this concept using the idea of a mask; to categorize these masks by grammatical type; to define formally the notion of context-sub-sequence, the predicted-sub-sequence, and the domain sequence; and to formally define and illustrate the idea of a regularity in a sample. We then argue that if a grammar discovery algorithm is to begin with a search for regularities, the length of those regularities must be small-that is, we should only search for local regularities. We present the algorithm of the Search Phase. The Search Phase is an exhaustive, combinatorial process —within the limited range established by the principle of searching only for local regularities.

20 2. CYLINDER SETS IN A SAMPLE Let V be a set of C symbols, a1, a2,..., aC. Call V the alphabet. A sentence over V is a sequence s = Y1 Y2 "' Yn where each Yi, l<i<in, is in V. Our interest here is in finite sentences, although most of the statements we make would apply equally to infinite sentences. Let Y be a given sample of sentences. In the terminology of automata theory and information theory (Khinchin, 1956), each sentence in the set of all possible sentences over VT is an elementary event of some space of events. Each subset of Y (and Y itself) then is an event. We are interested in a particular kind of event —the cylinder set. An index sequence z = (tl, t2,..., th) is a sequence of different non-zero positive integers. An allowable index sequence for a given set Y of sentences is an index sequence in which no integer is greater than the length of the shortest sentence in Y. Two index sequences are disjoint if they contain no integers in common. The cardinality of an index sequence X, denoted IZI, is the number of integers in Z. A cylinder set i, or a cylinder, for a given sample Y of sentences is the subset of all sentences in Y such that yt.= ai for all i = 1,..., h, and where each a. E V, and where the sequence (ti)h l is an allowable index sequence for the set Y. t may, of course, turn out to be empty. Note that the cylinder set C is defined in terms of 3 parameters: (1) the sample Y of sentences, (2) the allow

21 able index sequence Z, and (3) a sequence of h ai's from the alphabet V. Consider the cylinder set i defined for a sample Y of sentences, and index sequence Z = (ti)=l, and h a's from the alphabet V. Let W = (wj)g be an index sequence disjoint from Z. Let a be the cylinder set defined over the cylinder set 5 (which is a subset of the sample Y of sentences), an index sequence W, and g a's from V. A positive regularity exists in a sample Y of sentences if, for some non-empty cylinder set 5 defined as above, and for some cylinder set a defined as above, then the conditional probability Jal 1 1 cg We denote the quantity 1/Cg by E. Note that this quantity is the probability associated with a "pure chance" event-that is, it is the probability of occurence of particular string over C symbols of length g if each of the strings occurs equiprobably. It will be customary herein to number the cylinder sets, such as E and a, so that we will then be able to index the conditional probabilities (and thereby the positive regularities) with two numerical indices and refer to them as Pij, where i is the index number assigned to cylinder set i, and j is the index number assigned to a. The h positions of the index sequence Z constitute the context part of the positive regularity; and the g positions of the index sequence W constitute the predicted part of the positive regularity. Note that a positive regularity in a given sample is specified by 4 parameters: (1) the sample, (2) a context part, (3) a predicted part, and (4) the conditional probability [a|/[Wi In what follows, we will assume that the union of the h+g integers in the disjoint index sequences Z and W together constitute h+g consecutive integers —or, equivalently, that the context part

22 and the predicted part cover h+g contiguous positions of the sentences of the sample. We will discuss the occasions when this is not the case. separately later in the section of TERNARY MASKS.

23 3. MASKS It will prove advantageous to generalize the above approach to cylinder sets and to detach the cylinders from specific positions tl,..., th. That is, if a "1" in position 1 and a "1' in position 2 reliably predict a "O" in position 3, and a "1" in position 4 and a "1" in position 5 reliably predict a "O" in position 6, we should be able to generalize these two observations about the sample and identify them as being a regularity applying to sequences of length 3. Indeed, often the same phrase will appear many times in a sample-but it will usually appear in many different relative positions within the sentences. We should therefore be able to generalize these regularities and identify them as one regularity. Khinchin accomplishes this generalization by introducing a time shift operation. This approach is equivalent to the various "window" schemes in pattern recognition approaches (Uhr, 1963). However, even with the time shift, all the observations are specific to the symbols that occur in the various positions. That is, if an,'a" in the first position (note: not position 1) and a "c" in a third position reliably predict a "b" in the second position, and a "d" in a first position and an "f" in a third position reliably predict an "e" in the fifth position, then we should also be able to generalize these two observed regularities and identify them as one regularity applying to certain sequences of length 3. Thus, it is desirable to make a further generalization and detach the specific context symbols and the specific predicted context symbols from the regularity, and express the fact that certain positions may constitute a context that reliably predicts certain predicted positions. This leads to the following definition:

24 A binary mask,/ft or a mask, is a binary sequence over the symbols " n and "%" —provided the sequence contains at least one " and at least one "%". The symbol" denotes "the context." The symbol "%" denotes "the predicted part." An M-mask is a mask.of length M. For example, " % " is a 3-mask in which the context is the first and third position, while the predicted part is the second position. The universe of binary masks is all 2M-2 binary sequences over " " and "%" provided the sequence contains at least one " ' and at least one "%". By implication, the shortest masks are of length 2. For example, the universe of binary masks of length 3 consists of the 6 masks "' %", " %%It, t%% ", "% '", "t % ', and "% %" Note that" " (all context) is not considered a mask and has no meaning. Also "% % %" (all predicted part) is not considered a mask at this point in this paper. It can, however, be given a reasonable meaning by interpreting it as the mask which catalogs unconditional probability of occurrence of strings of length 3 in the sample.* *The control parameter LAWL controls whether this extra mask is used in the computer program implementing the Grammar Discovery Algorithm.

25 4. TYPES OF MASKS The universe of M-masks is partitioned by grammatical type as follows: All masks of the form " M %k kkO are called strictly leftsensitive. All masks of the form,%k M-k,, h#O are called strictly right-sensitive. Both of these two types of masks are considered to be of the same level of grammatical complexity —that is, contextsensitive. " %" is an example of a strictly left-sensitive 3-mask, and "% " is an example of a strictly right-sensitive 2-mask. All masks of the form " l %k2 -k-k2", where kl#O and k2AO and kl+k2 M, are called strictly context-sensitive. M must be at least 3 to have a strictly context-sensitive mask. "% " is an example of a strictly context-sensitive mask. All other masks are called strictly unrestricted rewrite. These masks are characterized by more than two switches from "_" to "%" or from "%" to " " (if they start with a " "), or by two or more such switches (if they start with a "%"). "% %" is the shortest strictly unrestricted rewrite mask. The justification for the choice of these names for these subsets of masks will be seen later, in the section wherein we use the regularities derived from the masks to develop the rules of production of the induced grammar. Masks can have a radix of either 2 or 3. If the radix of a mask is 3, the mask is called a ternary mask. A ternary mask is a sequence over the symbols " ", "%", and "#". The symbols "" and "%" are defined as for binary masks. The symbol "#" denotes "don't care." There are several important differences and effects when one uses ternary masks, instead of binary masks. Ternary masks (1) allow

26 a separation in positions between the symbols of a regularity; (2) allow a kind of generalization operation (opposite in effect to the substitution rule of inference in logic) over whatever may be in a certain position; and (3) may allow certain cryptographic regularities to be more readily discovered. Ternary masks will be discussed in full later. For example, of the 6 masks of length 3, " %%" and " %" are strictly left-sensitive; "%% ' and "% " are strictly right-sensitive; "_ % W" is strictly context sensitive; and "% %" is strictly unrestricted rewrite.

27 5. CONTEXT SUB-SEQUENCES, PREDICTED SUB-SEQUENCES, AND THE DOMAIN SEQUENCE The context-sub-sequence y is a sequence of length M over the C symbols of the current alphabet Vc and the predicted symbol "%" defined, for I = 1, 2,..., M as (Y(T-M+I) if /(I) = " Y(I) = "%"1 if -^ (I) ="%" The symbol "%" is merely a filler symbol in y, and its appearance here does not carry any of the meaning that the symbol otherwise has. Because of this convention, note that the context-sub-sequence y contains in itself all the information of the binary maske/ffrom which it was produced. The predicted-sub-sequence j is a sequence of length M over the C symbols of the current alphabet Vc and the context symbol " defined, for I = 1, 2,..., M as IY(T-M+l) ifS((I) = "%" @(if)CI)=,, Here again ' is a filler symbol in %. Note that the predictedsub-sequence contains in itself all the information of the binary mask eftfrom which it was produced. If the mask is a ternary mask, then y(I) and 4(I) are "#" whenever efk(I) is "#". Finally, we define the domain sequence as the sequence of length M over the current alphabet Vc obtained as follows: (y(I) if q(I) = "%" A(I) = t(I) if y(I) = " for I = 1,..., M. Note that A consists of the symbols from the current alphabet Vc (and not "%" or "_") occurring position-by-position in either y or p, and represents the symbols actually appearing in Y.

28 6. DEFINITION OF A REGULARITY A regularity W =< y, %, P, Y> is defined as a quadruple consisting of a context-sub-sequence y (a sequence of length M over the symbols of the current alphabet Vc and the symbol "%"), a predicted-sub-sequence | (a sequence of length M over the symbols of the current alphabet Vc and the symbol " 1), a real number P representing the conditional probability that the occurrence of y in the sample Y predicts the occurrence of the predicted-sub-sequence y in Y. When the sample Y intended is clear, we may write gas <y, X, P>. The length of a regularity is M, which is the common length of y and % and A. In English text, for example, one regularity is~= d'Q%", " U", 1.0>. This regularity is of length 2. This regularity expresses the fact that the letter "Q" is always followed by the letter "U". The binary mask e/from which this regularity is derived is " %", and is of length 2. This mask is a strictly left-sensitive mask. The context-sub-sequence y of this regularity is "Q%", and the predictedsub-sequence % is " U"s. Both, of course, are also of length 2. The conditional probability P that the context-sub-sequence predicts the occurrence of the predicted-sub-sequence in the sample is 1.0. The domain sequence A for any occurrence of this regularity is "QU"-that is, "QU" are the symbols that actually occur in the sample. A is also of length 2. Note that the frequency of occurrence of a given context-subsequence (the "Q" in the above case) does not enter into the definition of the regularity since conditional probabilities are used. It should be noted, however, that the notion of a mask can be extended in order to accomodate the measurement of probabilities of occurrence of single symbols, 2-grams, 3-grams, etc. in the sample. This extention is

29 accomplished by allowing a mask to consist entirely of the predicted symbol ( %). With a vacuous context, the conditional probability P.. now becomes the probability of occurrence of the predicted-sub1J sequence. The probability of occurrence is not, however, used anywhere in the Grammar Discovery Algorithm.

30 7. TYPES OF REGULARITIES We classify regularities and P..'s by the magnitude of the 1) conditional probability P... If the P.. is exactly 1.0, we say that 1J 1) the regularity (or, the P.. itself) is of type 1, and we call it structural. The regularity that "U" follows "Q" in English is a structural regularity. If P.. is near 1.0; that is, if P.. is less lJ 1J than 1.0 but greater than or equal to 1.0 - 1/cg then we say that the regularity (and the P.i) is of type 2. If the P.. is less than 13 13 1.0 - 1/cg but greater than or equal to 1/2, then we say that the regularity (and the P.i) is of type 3. If the P.. is less than 1/2 13 13 but greater than or equal to e = 1/cg then we say that the regularity (and the P.i) is of type 4. The regularities of these 4 types are the positive regularities. If the P.. is less than e, but greater 13 than 0.0, we say that the P.. is of type 5, and we call it noise. If the P.. is exactly 0.0, we say that the P.. is of type 6. We are interested, of course, in the P.. that are 1.0 or near 1.0. 13

31 8. RATIONALE FOR SEARCHING FOR LOCAL REGULARITIES It is not obvious that searching for local regularities in a sample of sentences is the way to begin a grammar discovery algorithm. Indeed, both practical and theoretical considerations seem to advise against this kind of approach. First of all, it would appear that, in searching for regularities in a sample of sentences produced by an unknown grammar, one would be obliged to consider dependencies between symbols widely separated from one another in the sentence. This point is made in Chomsky (1956, p. 286) in discussing the distance in the "relations of dependencies" that may occur as a result, for example, of self-embedding in natural languages such as English. In this regard, consider the sentence -+++ppp which is a well-formed sentence of the propositional calculus (in Polish Notation), and which is generated from the single recursive rule of production "p - +pp". When one parses this sentence, the first symbol (i.e. the "+") is in fact generated with the same application of the rule of production as the last symbol (i.e. the final "p"). Now the number of possible regularities observable in M positions (M}'2) of sentences over an alphabet of c symbols is m-h ch. ().m-h h=l (where h is, as before the number of positions in the context part of the regularity). This is equal to (2m - 2) Cmd F which is very large for anything but small m. Indeed, in the simplest case, that of the binary alphabet (C=2), we have

32 m 2 3 4 5 6 7 F 8 48 224 960 3968 16,128 m Moreover, we are naturally interested in the possibility of sentences of indefinite length; that is, in sentences that are produced by automata with cyclic state transition diagrams, and in sentences produced by grammars with at least one recursive rule of production. A finite sample of such sentences —which is, of course, the only situation we would encounter-contains a longest sentence; but, still, that sentence could be very long. Thus, there would seem to be practical combinatorial obstacles to any algorithm which is based on searching for regularities over m symbols, for even modestly large m, much less the m one would encounter in practice. Even without the practical combinatorial obstacles that seem to surround analysis of sentencesof non-small length m, there is a theoretical problem in searching for regularities among distant symbols. Consider a finite sample of sentences over an alphabet V with C symbols, wherein the longest sentence is of length m. Unless m the sample contains at least one instance of each of the C possible sequences of length m, (which would make the grammar very uninteresting, indeed) there will be at least one, and in general very many, conditional probabilities that are greater than l/cg, for each h = 1, 2, 3,..., m-l. Thus, there will be a least one positive regularity in any such sample. The fact that positive regularities do appear will be used later. But for the moment, it will be used to make another argument. Consider again all the well-formed sentences of the propositional calculus generated by the single recursive rule of production "P -+ +PP ". The sentences3produced by this rule are all of length m = 3, 5, 7, 9,... ~,2 There are 2 sentences of length m. And, in particular, there are

33 8 sentences of length 9; namely, +P+P++PPP ++P++PPPP +P+++PPPP ++++PPPPP +P+P+P+PP ++P+P+PPP +P+++P+PP ++++P+PPP Suppose further, we are given a sample of sentences in which no sentence is of length greater than 9, and that only 7 of the 8 sentences of length 9 are included in the sample. Suppose it is the last sentence which is not present. Then the conditional probability that there is a "P" in position 6, given that there is a "++++P" in positions 1 through 5, respectively, and given that there is a "'PPP" in positions 7 through 9, respectively is 1.0. Since c=2 and g=l here, this is greater than 1/cg = 1/2, and indeed therefore indicates a "positive" regularity. This discovered structural regularity is, of course, an artifact of the sample. If the sample were understood to be complete, in the sense of containing all of the sentences of the grammar, then this regularity would be faithful to the grammar. However, this sample of sentences is typical in that it is admittedly an incomplete sample arising from a grammar which is capable of producing other sentences. In a larger sample, it is possible that the missing sentence would occur as a sentence. Or, it may very well appear as a sub-sentence (which we can call phrase or clause) of a longer sentence. Thus, when we search for regularities of length equal to the length of the longest sentences of the sample, the regularities we find are likely to be artifacts of the sample size, and therefore highly dependent and variable on the sample size, Any grammar discovery grammar discovery algorithm based on finding regularities of this quality would be unstable. Our goal, of course, is that the grammar

34 produced not depend on sample size, after a certain point. A grammar discovery algorithm should converge, and thereafter be stable. A final objection to searching for regularities of long length is that their discovery may indeed tell something about each itemized long sentence present in the sample, but tell us little about the sub-structure of the sentence. Thus, even in the absence of other objections, the regularities produced would tend to lead to trivial information (i.e. cataloging the long sentences) —rather than information about the sub-structure of the sentences. Thus, if a grammar discovery algorithm is to begin with a search for regularities, the length of regularities considered must be small —both absolutely, to avoid practical combinatorial problems — and relatively, to avoid instability and triviality problems. For these, and other reasons that will be discussed later, it is necessary (and also sufficient) to consider only small M in searching for regularities in a grammar discovery algorithm. The way to find global properties is to look for local properties.

35 9. ALGORITHM OF THE SEARCH PHASE In the grammar discovery algorithm, we consider masks of length M = 2, 3, 4,..., up to some pre-determined, small upper limit on M.* The selection of this upper limit M2 will be discussed later. For each such M, we consider all possible masks e./of length M which are of grammatical type up to some pre-determined limit** — perhaps, for example-all masks no more complex than strictly context sensitive. The masks are considered in order of increasing grammatical complexity-up to the pre-determined limit. That. is, context sensitive masks are considered, for example, before unrestricted rewrite masks. This pre-determined limit is selected on the basis of what type of grammatical rules one is seeking. It does not matter whether one is too generous in admitting rules that are more complex, because the rules are not only generated in order of increasing grammatical complexity, but also are adopted into the grammar in that order. Thus, all the M-l strictly right sensitive masks of length M, and all M-l strictly left sensitive masks of length M are generated and considered first. They, in turn, are followed by the M(M+2)/2 strictly context sensitive masks. These masks are then followed by the strictly unrestricted rewrite masks of length M, of which there are 2M-2 -2M-2 -M(M+2)/2 Suppose we denote, for the moment, the symbols of a sentence in the sample as Y(1), Y(2),..., Y(d), where d is the length of the sentence. Then, given the current M we are considering, and the current masks/-, we examine the sentence, *The variable M2 is this upper limit in the computer program. **The variable GRAM controls this in the computer program.

36 symbol by symbol. We consider the sub-strings of length M terminating in position T=M, M41,..., d (d>M), and match each with the current mask-that is, we match the symbols Y(T-M+1),..., Y(T=M) with mask positions 1, 2,..., M. A context-sub-sequence y and a predicted-sub-sequence c are then developed from each such matching, as indicated by the definition of y and %. Each distinct context-sub-sequence y that is produced is given a distinct index number. Each distinct predicted-sub-sequence % that is produced is also given a distinct index number. This allows us to refer to the conditiQnal probability P.. that predicted-sub-sequence j 1J occurs, given that context-sub-sequence i occurred, by using the two indices i and j. In the computer program implementing the Grammar Discovery Algorithm, the index number assigned is the natural number equivalent to the M symbols of y and c modulo (C+2)+2.* These two natural numbers together thus abbreviate all the information needed to identify the sub-sequence as either a predicted sub-sequence or a context-sub-sequence —because a context-sub-sequence contains symbols from the alphabet and the "%", but never the "", while a predicted-subsequence contains symbols of the alphabet and the " ", but never the "%". *It should be noted that many of the operations of the Search Phase, as well as later phases such as the Recoding Phase, are amenable to being performed by parallel computers-either by general purpose parallel computers which will represent the next generation of computers, or by a special purpose parallel computer. The combinatorial obstacles to performing grammar induction on large samples of unknown text will thus be greatly reduced in practice. **The natural number corresponding to the context-sub-sequence is called SQCTXT in the computer program. The natural number corresponding to the predicted-sub-sequence is called SQPRED. The reason for the first "+2" is the inclusion of the symbols " " and "%". The reason for the second "+2" is the inclusion of the symbols "#" and the initial punctuation mask.

37 The pair of 2 natural numbers together are sufficient to recreate the context-sub-sequence'y, the predicted-sub-sequence 4, and the mask ef. The procedure for doing the above depends somewhat on the mode in which the Grammar Discovery Algorithm is operating. If initial punctuation is present in the sample (the second mode), the examination above proceeds through each separate sentence. Naturally, if T+M-l>d, then masks of that length M are not considered for that sentence. (We can, if we like, allow the initial punctuation mask appearing at both ends of a given sentence to be considered as part of that sentence; however, we never allow inter-sentence examinations). If there is no initial punctuation in the sample, the only restriction is that T+M-l>d. In either mode, the entire sample Y is examined under the current mask eA?. Upon completion of this examination, it is now possible to compute the conditional probability P that as given predicted-sub-sequence $ arising from a particular maskekof length M occurred, given the occurrence of a particular context-sub-sequence y arising from that same mask. This process is then repeated for each mask being considered. It is convenient to sort these conditional probabilities into descending numerical order, from 1.0 on down. The P..'s are then 1j classified into types (I, II, III, IV, and V) as previously described. Note that for a given context-sub-sequence with h context posiC C tions, one would expect that most of the (M-h) = g possible predicted-sub-sequences will not actually occur in the sample. In a non-pathological sample, there are relatively few context-sub-sequence

38 and predicted-sub-sequence pairs that will actually occur.* The result of the Search Phase is the production of tables of regularities that occur in the sample. *In the computer program implementing the Grammar Discovery Algorithm, the possibility of an excessive number of small, insignificant P..'s is dealt with by providing for the early culling of the tables -of regularities. This early elimination of P..'s is controlled by the parameter EARLY. 1J

39 10. EXAMPLES Example A: To illustrate the ideas developed above, consider the unpunctuated (first mode) symbol string Y as follows: THE-BIG-CAT RAN-THE-BAD- CAT-RAN.THE-B I G"RAT- RAN-T HE —BAID-RAT —RAN — The terminal alphabet for this sample Y of length N = 64 over C1 = 12 symbols is A ={A,B,C,D,E,G,H,I,N,R,T,-}, where '"" is an undistinguished, but visible symbol to denote the blank. The table below illustrates the kind of masks, context sequences, predicted sequences, and conditional probabilities that one obtains in analyzing the sample. SOME REGULARITIES FOR LEVEL 1 CONDITIONAL CONTEXT PREDICTED LENGTH PROBABILITY MASK Sub-Sequence Sub-Sequence Piij TYPE / Yff 6 3 1,00000 1 %% C%%~ AT 3 1.00000 1 % TH% E 3 1.00000 1,%% %%G BI 3 1.00000 1 % %AN R 3 1.00000 -1 % R%T A 3 0.50000 3 %- B %% I 3 0.50000 3 %% B%% AD The first line of this table, for example, indicates that the letter C is always followed by the letters AT in the sample. This forms the "word" CAT. The second line of this table indicates that the letter pair TH is sufficient to determine the following letter which is E. These 2 lines illustrate left-sensitive rules. Line 3 of this table indicates that the letter G is sufficient context on the right to predict, with 100% reliability, the preceding two letters, namely B and I. Line 4 indicates that the context of R on the left and the context of T on the right is sufficient to determine

40 the one intervening letter A, making the word RAT. This is an example of a strictly context sensitive rule. The only mask not illustrated for length 3 above is the unrestricted rewrite mask %-%. This is the only unrestricted rewrite mask of length 3. The last two lines of this table indicate that the letter B is followed by IG and AD equally often(i.e. with probability.50). Using the terminology we have developed, the information on line 1 represents the type I regularity = 'C%%", " AT", 1.0,Y> of length M = 3. Example B: Now let us consider an example based on the following sequence of 272 bits found in the memory of an IBM 360 computer: (1) "1110 0110 1100 1000 1100 0101 1101 0101 0100 0000 1100 1001 1101 0101 0100 0000 1110 0011 1100 1000. 1100 0101 0100 0000 1100 0011 1101 0110 1100 0100 1100 0010 1100 0101 0100 0000 1101 0110 1100 0110. 0100 0000 1100 1000 1110 0100 1101 0100 1100 0001 1101 0101 0100 0000 1100 0101 1110 0101 1100 0101. 1101 0101 1110 0011 1110 0010 0100 0000" Here the alphabet A = {0,1} and N = 272 and C1 = 2. When the memory of the computer is dumped, the sequence of 272 bits would conventionally be presented as 68 half-bytes (4 bit sequences). The name conventionally attached to each 4 bit sequence is the hexadecimal equivalent of the 4 bits in binary. Using these names, the dump might read: (2) "E6C8C5DS 40 C9D5 40 E3C8C5 40 C3D6E4E2C5 40 D6C6 40 C8E4D4C1D5 40 CS E5 C5D5E3 E2 40"

41 When the 68 half bytes are then combined into 34 full bytes and presented as EBCDIC characters (as is the case in the EBCDIC dump), we see (3) "WHEN IN THE COURSE OF HUMAN EVENTS" If we now search the sample (1) for regularities, we find for a regularity length M of 2 the following regularities: LEVEL= 1 M= 2 NUMBER OF TYPE 1 SEQUENCES(STRUCTURAL 0 NUMBER OF TYPE 2 SEQUENCES (MESSAGE) 0 NUMBER OF TYPE 3 SEQUENCES (MESSAGE) 4 NUMBER OF TYPE 4 SEQUENCES (MESSAGE) 4 NUMBER OF TYPE 5 SEQUENCES (NOISE) 4 NUMBER OF P(I) 12 NUMBER OF POSSIBLE P(I) 8 SEPARATION VALUE BETWEEN TYPES 3 AND 4 C.50000 EPSILON FOR DEFINING TYPES 2 AND 3 0.34000 # LEVEL LENGTH PROB TYPE MASK CONTEXT PREDICTED 6 1 2 0.57 3 1% 1% 2 1 2 0.57 3 %_;1 0_ 7 I 2 0.55 3 _% 0% _0 4 1 2 0.55 3_ tO 0 3 1 2 0.45 4 __ %0 1 8 1 2 0.45 4 _% 0% _1 i 1 2 0.43 4 _ 1 1_ 5 1 2 0.42 4 __ 1% _1 11 1 2 0.31 5 S %% 00 10 1 2 0.25 5 %% %% 10 12 1 2 0.25 5 %% % 01 9 1 2 0.19 5 % Z %% 11 Note that the mask "%%" is included in the above table and that it catalogs the probabilities of occurrence of the various sequences of symbols appearing in the sample.

42 Similarly, if we search the sample for regularities of length no greater than 3, we obtain the following table of regularities: LEVEL= 1 M= 3 NUMBER OF TYPE I SEQUENCES(STRUCTURAL) 0 NUMBER OF TYPE 2 SEQUENCES (MESSAGE) 0 NUMBER OF TYPE 3 SEQUENCES (MESSAGE) 8 NUMBER. OF TYPE 4 SEQUENCES (MESSAGE) 8 NUlMBER OF TYPE 5 SEQUENCES (NOISE) 24 NUMBER OF P () 40 NUMBER OF POSSIBLE P(I) 32 SEPARATION VALUE BETWEEN TYPES 3 AND 4 0.50000 EPSILON FOR DEFINING TYPES 2 AND 3 0*33333 # LEVEL LENGTH PROB TYPE MASK CONTEXT PREDICTED 22 1 3 0.61 3 _ 11 _0 38 1 3 0,60 3 11 0O 6 1 2 0.57 3 % 1% 0 26 1 3 0.57 3 _% 00% 0 2 1 2 0,57 3 %_ %1 0. 42 1 3 0.57 3 % 00 0 28 1 3 0.56 3 01% _ 40 1 3 O.55 3 % %10 0_ 7 1 2 0.55 3 _% 0 _0 4 1 2 0,55 3 % %0 0. 23 1 3 0.52 3 _ 10% 0 43 1 3 0.51 3 %__ %01 0._ 44 1 3 0.49 4 _ 01 1_ 24 1 3 0.48 4 _% 10% _1 3 1 2 0.45 4 _ X0 1_ 8 1 2 0.45 4 _% 0% 39 1 3 0.45 4 _ %10 1__ 27 1 3 0.44 4 __% 01% __1 41 1 3 0.43 4 _ 00 1_ 1 1 2 0.43 4 %_ %1 _ 25 1 3 0.43 4 __ 00% 1 5 1 2 0.42 4 _ 1% _1 37 1 3 0.40 4 2_ %11 1. 2 1 3 0.39 4 11%_ 1_ 20 1 3 0.31 5 %% 0%% _00 36 1 3 0.31 5 0%_ %20 00_ 11 1 2 0.31 5 %2 %% 00 15 1 3 0.30 5 _% 1%% _00 30 1 3 0.30 5 %%__ %1 00. 32 1 3 0.28 5 %%_ %%1 10_ 16 1 3 0.27 5 %% 1%% 01 14 1 3 0.26 5 _% 1Z% _0 10 1 2 0.25 5 % %2 10 31 1 3 0.25 5 %_ %%1 01. 19 1 3 0.25 5 __ % 0% _.10 12 1 2 0.25 5 %% % 01 35 1 3 0.25 5 %% %%0 01. 34 1 3 0.24 5 2 _ t%0 10. 17 1 3 0.23 5 _% 0 % _.01 33 1 3 0.20 5 2 _ %20 11._

43 18 1 3 0.20 5 _ 0%% 11 9 1 2 0.19% %% 15 52 1 3 0.17 5 %%% %% 000 29 1 3 0.17 5 %%_ %1 11_ 13 1 3 0.17 5 _l^ 1% -11 51 1 3 0.14 5 %%% 010 47 1 3 0.13 5 %% 100 48 1 3 0.13 5%% %X% 001 50 1 3 0.12 5 %% 101 46 1 3 0.11 5 %X o% 110 49 1 3 0.11 5 %%% %% 11 45 1 3 0.07 _ 5 _%%% %% 11 Some of the regularities we obtain with an M of 4 are LEVEL= 1 M= 4 NUMBER OF TYPE 1 SFQUENCES(STRUCTURAL) C NUMBER OF TYPE 2 SEQUENCES (MESSAGE) 0 NUMBEER OF TYPE 3 SEQUENCES (MESSAGE) 16 NUMBER OF TYPE 4 SEQUENCES (MESSAGE) 32 NUMBER OF TYPE 5 SEQUENCES (NOISE) 64 NIMBER OF P(I) 112 NUMBER OF POSSIBLE P(I) 96 SEPARATION VALUE BETWEEN TYPES 3 AND 4 0.50000 EPSILON FOR DEFINING TYPES 2 AND 3 0.25000 t LEVEL LENGTH PROB TYPE MASK CONTEXT PREDICTED 142 1 4 0.70 3 %_ %101 0__ 87 1 4 0.68 3 _ 110% __0 96 1 4 0.67 3 _. 101 _0 85 1 4 0.65 3 % 111 __ 139 1 4 0.63 3 % %011 0 147 1 4 0.63 3 %__ %111 0_ 90 1 4 0.61 3 _. 100% __0 22 1 3 0.61 3 _% 11 _0 98 1 4 0.61 3 % 010% __1 38 1 3 0.60 3 % %11 0_ 93 1 4 0.60 3 ___ 011% _ 0 138 1 4 0.60 3 %_ %001 0_ 135 1 4 0.58 3 %_. %100 1__ 134 1 4 0.58 3 %_ %110 0_ 144 1 4 0.58 3 _ %010 1 6 1 2 0.57 3 % 1% 0 26 1 3 0.57 3 _ 00% 0 2 1 2 0.57 3 %1 0_ 42 1 3 0,57 3 %_ %00 0__ 28 1 3 0.56 3 % 01% 0 40 1 3 0.55 3 %_ %10 0 7 1 2 0.55 3 _% 0% 0 4 1 2 0.55 3 _ %0 0_

44 And, some of the regularities we obtain with an M of 5 are shown below: LEVEL= 1 M= 5 NUMBER OF TYPE 1 SEQUtNCES(STRUCTURAL) C NUMBER OF TYPE 2 SEQUENCES (MESSAGE) 2 NUMBER OF TYPE 3 SEQUENCES (MESSAGE) 33 NUMBER OF TYPE 4 SEQUENCES (MESSAGE) 79 NUMBER OF TYPE 5 SEQUENCES (NOISE) 174 NUMBER OF P(I) 288 NUMbER OF POSSIBLE P(I) 256 SEPARATION VALUE BETWEEN TYPES 3 AND 4 0.50000 EPSILON FOR DEFINING TYPES 2 AND 3 0.20000. # LEVEL LENGTH PROB TYPE MASK CONTEXT PREDICTED 291 1 5 0.86 2 __% 1111 _ _0 419 1 5 0.86 2 1_ _1111 0 394 1 5 0,74 3 __ %0011 0__ 287 1 5 0.73 3 _% 1010 __ 1 400 1 5 0.73 3 X%_ %1011 0__ 270 1 5 0.72 3 __% 0110 ___0 272 1 5 0.70 3..._ 1101 ___0 142 1 4 0.70 3 %__ %101 0 410 1 5 0.70 3...0101 1__ In later sections, we will refer to these tables as part of other examples.

45 C. THE RECODING PHASE —DEVELOPING RULES OF PRODUCTION FROM REGULARITIES 1. INTRODUCTION TO THE RECODING PHASE The Recoding Phase is the second main phase of the Grammar Discovery Algorithm. In the Recoding Phase, the local regularities found during the Search Phase are used to develop grammatical rules of production which, in turn, are used to transform the given set of sentences into a new set of sentences. The new set of sentences contains most of the information and structure of the original set of sentences. This transform ation (recoding) process works by replacing contiguous predicted symbols of local regularities occurring in the sample with single symbols. Thus, what were originally global regularities in the given sample tend to become local regularities in the new set of sentences. The recoding that is done is defined, in part, in terms of grammatical rules of production; and these rules of production ultimately become the rules of production of the grammar being induced. The resulting new sentences can later be treated anew as a sample and this new sample can then be searched for regularities, reusing the methods of the Search Phase. And, this new sample can then be recoded at this higher level, yielding additional rules of production and a newer set of sentences. The alternate application of first the Search Phase, and then the Recoding Phase is what produces the hierarchical (and nontrivial) quality in the grammar being induced.

46 2. TRANSFORMATIONS (RECODINGS) We are concerned here with mappings from a sequence Y of length N over C1 symbols to a new sequence Y' of length N' over C1' symbols. A transformation (or recoding) gtas used here will refer to such a mapping. If C1' > C1, we say that the mapping is alphabet-enlarging. A transformation is presented in terms of a set of (1) grammatical rules of production, and (2) a Recoding Procedure. Both are necessary to define the transformation. Each grammatical rule of production consists of an antecedent (left) side, which is a symbol string over the alphabet of Y'; an arrow; and a consequent (right) side, which is a symbol string over the alphabet of Y. An example of a rule of production is OAO + 01110 We read this as "replace A by 111 whenever A is found in context with a zero on both sides of it." Note that the set of rules of production is not sufficient, in general, to uniquely define the mapping from Y to Y'. There will often be many overlapping occurrences of the context-sub-sequences of the local regularities. Hence there are many different overlapping opportunities for applying the rules of production. It is therefore necessary to establish an order of precedence for developing the rules of production. The Recoding Procedure is an algorithm for scanning the sequence Y for occurrences of the context-sub-sequences of the various local regularities and for specifying in what order the rule of production should be developed from these regularities. The sample Y, as initially presented, is said to be the sample of the first level. Each transformation, when applied to a sample, maps that sample from its current level to a sample of one higher level. It should be noted that each level above the first represents

47 one sentential form in the derivation of the terminal string (which is the sample at the first level). Thus, each level above the first will contain some non-terminal symbols. We adopt the convention here that a transformation of a given sample Y will always be over the complete domain Y. This will in general necessitate the application of the identity transformation on certain substrings of Y. This convention guarantees that the entire domain Y is "covered by each transformation, and this completeness of covering facilitates later comparisons between transformations.

48 3. DEVELOPMENT OF RULES OF PRODUCTION FROM LOCAL REGULARITIES In the Search Phase, certain local regularities were discovered. Recall that a regularity = <y, $, P, Y> consists of a contextsub-sequence y, a predicted-sub-sequence,, and an associated conditional probability P. The local regularities are now the basis for developing the rules of production of the grammar. Suppose we are given a context-sub-sequence y from some regularitytof length M. Suppose further that for some integral T > 0, Y(T-M+I) = y(I) for each I = 1, 2,..., M for which y(I) is not "%" or "#" * —that is, suppose the context specified by the context-sub-sequence $ is present in the sample Y; and, that its terminal position is position T of Y. Suppose the Recoding Procedures indicates it is appropriate that a rule of production be developed using this occurrence of the context part of this regularity, and suppose further that we are generating context-dependent rules of production and that we are using only alphabet-enlarging recodings. Then the rule of production would be produced as follows: 1. The consequent (right) side of the rule consists of the domain sequence A = A(I),..., A(M) -which is of length M. 2. The antecedent (left) side consists of a sequence of symbols constructed as follows: Begin with the context-sub-sequence y and consider each sub-string of contiguous predicted symbols "%" in y. Enlarge the current alphabet Vc by a single new symbol, say "N". * "# is the "don't care" symbol and is discussed more fully later.

This new symbol will be a non-terminal symbol in the induced grammar. Now, insert this single new symbol N into the context-sub-sequence y in place of the entire substring of contiguous predicted symbols. If there is more than one sub-sequence of contiguous predicted symbols in y, introduce another new non-terminal symbol into Vc for each such sub-sequence. Note that there can be more than one such sub-sequence only when the mask from which the regularity was derived was of strictly unrestricted rewrite type. Note that the antecedent (left) side thus produced is of length M or less. Note that this decrease in length is the result of replacing the entire contiguous predicted substring by a single new symbol. Note also that the symbols of the current alphabet VC occurring in context or don't care positions in the sample are left unchanged. Finally, note that the grammatical type of the rule of production developed in this way is exactly the grammatical type of the mark from which the regularity X0 was derived. Suppose, for example, that lR= '11%%", " 00", 1.0, Y> is a regularity in a given sample Y-that is, in Y, the symbols 11 are invariably followed by 00. The context-sub-sequence y for this structural regularity is "11%%". The predicted-sub-sequence ~ is W'-OO" The conditional probability P is 1.0. The mask " %%" gave rise to this regularity, and this mask is a left-sensitive mask. The domain sequence A is "1100". The sequence Y begins 0100110001..., so that the context part of the context-sub-sequence "11%%" occurs in Y so that its final symbol (i.e. the second "%") is at position T = 8 of Y. The rule of production developed from this regularity of length 4 in Y will be the left-sensitive rule 11N - 1100, where N is understood to be a non-terminal. In this example, "N"

so50 replaces the two contiguous predicted symbols, which in this case, are the two O's. The alternative to generating context-dependent rules of production is to generate context-free rules. If context-free rules are to be generated, the antecedent (left) side will consist solely of a single new non-terminal symbol. In this event, the antecedent (left) side will, of course, be of length 1. If we are considering only M>2, then it will be shorter than the consequent side. The subject of generating context-free rules will be discussed in detail later. Ana alternative to an alphabet-enlarging code is a recoding using a Huffman code, or a similar scheme for optimal recoding. In this approach, the alphabet is not enlarged. Instead, a distinctive symbol string in the original alphabet is inserted in place of the symbols in Y. The details of possible non-alphabet-enlarging recodings and the advantages and disadvantages of this kind of recoding will be discussed later. Thus, as we have seen, the length of the regularities searched for in the Search Phase limits the length of the consequent (right) side of the rules of production developed for the induced grammar. Thus, if M is the length of the longest regularity searched for in the Search Phase, M is also the length of the longest consequent side of a rule of production in the induced grammar. This in turn, means that if the sample Y is of length N, then regardless of how the rules of production are applied to accomplish the recoding of Y, the recoded version of Y cannot be of length shorter than N/M. Since M is generally fairly small, the image of the sample Y under a Recoding is

51 of a length relatively close to the length of Y itself. This means that Y is not recoded by any one Recoding into an extremely short completely-resolved image. In other words, if we look at the way the sample Y is derived from the rules of production using the ultimate induced grammar, we see that the derivation is indeed a hierarchical derivation-that is, a derivation requiring relatively many intermediate sentential forms (levels). Note that it is the smallness of M which results in this hierarchical characteristic. Note also that the hierachical grammar thus induced is also not trivial. Thus, both the hierarchical and non-trivial character of the induced grammar follow from the decision to search only for locaZ regularities in the Search Phase. Finally, note that while certain Unrestricted Rewrite masks can be accomodated in the above procedure (for example, the mask "%% %%"), there is no provision in general for the development of length-increasing rules of production (that is, rules in which the consequent (right) side is longer than the antecedent (left) side). (See Ginsburg and Greibach [1966]). However, since every rudimentary event (Smullyan [1959]) can be accepted by some linear bounded automaton (Myhill [1960]), and hence context-sensitive language, the formal systems of Smullyan are within the scope of this Algorithm.

52 4. THE RECODING PROCEDUREWe noted earlier that a transformation is specified in terms of both (1) a set of grammatical rules of production, and (2) a Recoding Procedure. The Recoding Procedure is an algorithm for scanning the sequence Y for occurrences of the context-sub-sequences of various local regularities, and for specifying in what order the rules of production should be developed from those regularities. The Recoding Procedure implemented in the computer program for the Grammar Discovery Algorithm admits of numerous variations. These variations will be described in detail as the reason for each possible variation arises. For now, we will consider merely the basic features of Recoding Procedure. In the Search Phase, masks of various possible lengths, from a length of M1 to a length of M2, were used to find local regularities. We take these same M1 and M2 from the Search Phase for use in the Recoding Procedure. If M2 is the mask of longest length considered in the Search Phase, then the first occurrence of a context-sub-sequence associated with that mask cannot be earlier than position T = M2 in the sequence Y. Therefore, the Recoding Procedure starts scanning the sequence Y at position T = M2. The sequence Y is examined for the presence of a sub-sequence of Y of length M which terminates at position T such that Y(T-M+I) = y(I) for each I = l, 2,..., M for which y(I) is not "%" or "#". In general, context-sub-sequences may be identified with various sub-sequences of Y. The purpose of the Recoding Procedure is to select one contextsub-sequence from among the several possibilities. The sub-sequence of Y that is selected is the one which meets the following conditions:

(1) The sub-sequence selected is of longest length, of the lengths between M1 and a parameter MBE which is less than or equal to M2. This range of lengths is further constrained by 2 considerations. First, there must indeed be a sub-sequence in Y of the length L being considered. This means that T>L, for each L considered. Also, if the sequence Y has initial punctuation, then the sub-sequence in Y of length L must not include an initial punctuation mark. Note that an initial punctuation mark is always assumed to precede any seuqence Y, and the first case is merely a special case of the second. The second constraint is that none of the positions occupied by the sub-sequence of length L have been recoded as yet on this level. That is, our aim is to find regularities which "cover" the sequence Y no more than once at any position. If multiple covering were allowed at any one level, then (i) the resulting grammar would be ambiguous, and (ii) it might be necessary to establish a precedence ordering among the rules of production of the grammar induced. (2) The regularity selected is the one whose conditional probability is of the highest type, among those types which are allowed in the Recoding. The lowest allowable type* is specified in advance, and is typically type I, II, or occasionally III. (3) The regularity selected has highest conditional probability, given condition (2). This is accomplished by the sorting of the conditional probabilities P(I) associated with the regularity. (4) The regularity selected is the regularity whose mask is of simplest grammatical type. That is, a regularity arising from *The variable AllOW in the computer program specifies the lowest allowable type of regularity to be used in recoding.

54 a left-sensitive mask will be preferred to a regularity arising from a strictly-context-sensitive mask, etc. Naturally, the type of masks that can be selected are limited by the types of masks used in the Search Phase. After this local regularity is selected, a rule of production may be developed from it, in the manner described in the preceding section. The sample Y is then recoded, as specified by that rule of production. Before a new rule of production is introduced however, the possibility of instead introducing a recursive rule of production is considered. The procedure for inducing recursions works with rules of production developed at previous levels of recoding. Thus, if the grammar discovery algorithm is operating on its first level, no test for recursion is performed, and the rule of production developed from the regularity is automatically added to the induced grammar. If the algorithm is already on the second level, or on a higher level, the test for recursion will be performed. This test is fully described in a later section. If a recursion is then induced, one recursive rule of production is added to the induced grammar. This recursive rule replaces one non-recursive role of production which is already in the induced grammar. The Recoding Procedure operates so as to minimize the introduction of new rules of production. Thus, when a new non-recursive rule is to be introduced, the entire sequence Y is immediately searched for additional possibilities of applying that rule. These recodings are then performed in preference to any other possible recoding. Similarly, whenever a new recursive rule of production is introduced, a similar back-tracking search is made through all previous levels to see if the recursive rule can be applied. Moreover, with recursive rules,

the Recoding Procedure considers the possibility of immediately reapplying recursive rules(even of the same positions of Y which have just been recoded), so that as much of Y as possible is recoded using the newly induced recursive rule. Also, the Recoding Procedure operates in such a way that it first scans Y looking only for the possibility of inducing recursive rules of production. After this scan is done, the sequence Y is rescanned for the possibility of inducing non-recursive rules. This double scanning approach given preference to inducing a recursive rule. Because a recursive rule replaces an existing rule already in the grammar, this approach helps minimize the number of rules in the induced grammar. The Recoding Procedure operates in such a way that the degree of "covering " of the sample is maximized-that is, as many as possible of the symbols of Y are either in the context part or predicted part of some regularity and some rule of production. Once a non-recursive rule is introduced into the induced grammar, opportunities to reapply that rule are always considered before any further rules are added to the grammar.

56 5. EXAMPLE In this section, we illustrate the development of rules of productions from regularities, and illustrate the application of the rules using a recoding procedure. We begin with an illustration of a context-sensitive rule of production. Suppose the given sample of sentences is as follows:........ -- -.........LEVEL 1 ------------------- SYMBCL STRING OF LENGTH 137 AND USING ALPHABET OF SIZE 14 a T-E.,BIG-CAT-RANo.THEBAO-,CATRANTHE-, I G-COG-',RAN-,THE-,BADDOG-R AN-. TI- E I C-.,CAT- SAT-.,THE-E OAD-C AT-SAT. TH E-'B I G-OG-,SAT-, THE-B AD-,CC G-S AT>. Note that the terminal alphabet for the sentences of this sample include 14 Roman alphabet symbols and the period (as initial punctuation). Suppose further that the following regularity has been identified in the sample: "%%E.", "TH ", P > This regularity is a left-sensitive context-sensitive regularity because the mask from which the regularity was derived is a left sensitive mask (i.e.,,%% 2").

The development of a rule of production from this regularity proceeds as follows: The two contiguous predicted positions (namely positions 1 and 2) are consolidated and replaced by the new non-terminal symbol ( "O" in this case). Together with the two context positions, these three symbols become the antecedent (left) side of the rule of production. The consequent (right) side of the rule consists of the four terminal symbols from the regularity. The rule of production is thus OE- --- THE-, In a similar manner, the following 7 rules of production might be developed: TENTATIVE RULES OF PRCDUCTION FOR M OF 4 iE E-. ) THE1G-..> B!G-. AT- -_ > C AT3 AN- - > RAN4D-_ ----- BAD5G.- -..> DO G6AT - > SATApplying these rules of production to the sample, one recodes the sample and obtains the following: NEW STR INC OF-1G-2AT-?AN, OE-,4D-2A T-3AN- OE-1 GG-5G-3A N OE-4fD-5G3 AN-, E-1G-, 2AT-,6 T-. OE-4D -2AT-,6AT-,. OE- G-,5G-,6AT-. OE- 4D- 5G-,6AT-.

58 Note that the rules of production in this example were context-sensitive because the mask of the regularity from which the rules were derived were context-sensitive masks. It would seem that only context-sensitive rules of production can be derived. Such is not the case. Later, after motivating and defining the concept of maximal regularity (in section III.B.3 and IV.D), we will develop context-free rules of production. However, it should be noted now that if there is a procedure for developing context-free rules of production, then their application to the given sample, using the recoding procedure, is exactly the same as with contextsensitive rules of production. To illustrate this statement,recall example B from the previous section. The sample in Example B was the following binary sequence: LEVE'L I — - SYMBOL STRING OF LENGTI- 272 AND USING ILFPH.BET OF SIZE 2 1110011 01100100011 0001011 1OlOlO1O COOOC01100]. 0o10101001000 00011 1001 1 1100011C00 1 0001lCOOlOL OOC C ] CCCC.1110 01011000100 1. CC000101100 01 C1 CocoCOCO 10101101.10001 0'C1 0000001 10010001110 01 001101010011 CO000. 1 10101o.01000000 1.000101.110010111100101 1101010111100011 110COllCOOCO

59 Suppose there is a justification for developing contextfree rules of production in the Grammar Discovery Algorithm. Suppose that the following four context-free rules are developed: TENTATIVE RULES OF PRODUCTION FOR M OF 2 A --— > 11 B -—.> 10 C -- > 01 D - > 00 Then the recoding of the given sample, using these four context-free rules of production and the recoding procedure, would yield the following new sample of sentences: NEW STRING ABCBADBDADCCACCCCDDDoACBCACCCCDDOABi AAGoOADCCCDOUADDAACCBADCO AD DBAOCCCDDDACCBACCBCDDCAC CAdCDACCDADDCACCCCDODADCCABCCABCC ACCCABCAABCBCDOD In the examples that accompany several of the following sections of this paper, we will use context-free rules of production in many cases, although the rationale for developing these rules of production will not be fully developed until sections III.B.3 and IV.D.

60 D. THE SELECTION PHASE-SELECTING RECODINGS 1. INTRODUCTION The third main phase of the Grammar Discovery Algorithm is the Selection Phase. The Selection Phase oversees several possible trial recodings and selects one which is then actually applied to the sample of sentences to produce one new set of sentences. This selection is made on the basis of three criteria: the entropy, the parsimony, and the recursive parsimony of the transformation. In this section, we define the concepts of entropy of a transformation, parsimony of a transformation, and recursive parsimony of a transformation. We define a resolving transformation, and we present the algorithm for trying recodings, evaluating them, and then selecting one recoding as the actual recoding to be performed on the sample. The idea of the P..-graph is developed.

61 2. ENTROPY, PARSIMONY, RECURSIVE PARSIMONY OF A TRANSFORMATION Typically, there is more than one rule of production in each transformation. Each rule of production is derived from a local regularity. Recall that each regularity iw=<y, (, P..> consists of a context-sub-sequence y, a predicted-sub-sequence 4, and an associated conditional probability P... We define the entropy of a transformation to be - Pij log 2 Pij (using the convention that 0 log2 0 = 0 ), where the Pij are the conditional probabilities associated with the regularities from which the rules of production of the transformation were derived. Note that the sum here is taken over the (reduced) set of rules of production. Note that this measure depends only on knowing the set of rules of production of the transformation. Note, for example, that if each local regularity from which the rules of production of the transformation was developed is a structural regularity (i.e. the context-sub-sequence predicts the predictedsub-sequence with 100% reliability), then the entropy of the transformation will be zero. As an illustration of this, consider the string Y... abcdefghefghabcdefgh... This string might be recoded using the two strictly-context-sensitive rewrite rules aNd + abcd eMh + efgh which each have 100% reliability. The resulting string Y' would then be...aNdeMheMhaNdeMh....

62 Note that no information is lost by this transformation for the positions actually recoded. On the other hand, if the local regularities used to develop the rules of production of a transformation all had conditional probabilities of e = 1/cg (i.e. where the occurrence of the context tells us "nothing" about the "predicted" part), then the entropy of such a transformation would be maximal. Information is lost by such a transformation; the original sequence Y cannot be recovered by applying any kind of inverse of production of this transformation to the encoded string Y'. The information lost would be maximal for the positions actually encoded. For example, suppose 001 is followed with 00 or with 11 with probability 1/2, and that the rule of production is OO1A -+ 00100. Y may be the sequence...00100,00111,00100,00111... (the commas are added here only as a visual aid) and Y' may be...001A,00111,001A, 001A... Here Y cannot be recovered from Y'. The weighted entropy of a transformation is the same sum as the entropy of a transformation, except that the sum is taken over each application of the rules of production. Thus, rules that are used more often in the actual recoding count more in this sum. Note that this measure is dependent on both the set of rules of production and also the Recoding Procedure-which determines how many times each rule is applied. The fewer rules of production in a grammar, the more parsimonious it is. Accordingly, we define the parsimony of a transformation as X'n, where n is the number of (reduced) rules of production, and

63 where X is a positive real number called the coefficient of parsimony.* Note that n is a measure of the complexity of the transformation, and indirectly of the induced grammar (Goodall). To illustrate the idea of entropy and parsimony, consider the following trivial grammar which can be inferred from any given sample of sentences al a2..., a over an alphabet VT. The grammar is @ =<VT, S, S, P> where the rules of production P are the w rules: S + a S S + a2 Soa w In this case, the entropy of the transformation here is zero, because all transformations are 100% reliable. However, the parsimony here is high - namely Aw. We define the combined entropy-parsimony measure of a transformation to be Pij log2 Pij + Xn Note that there is a trade-off between the entropy and the parsimony of a transformation. The combined entropy-parsimony measure for 2 different transformations can be the same if the one with more rules of production has correspondingly less entropy (i.e. if the regularities used to develop the rules of production of the transformation have greater reliability). Note also that while probability of occurrence of the different contexts is not explicitly considered in the Grammar Discovery Algorithm (because conditional probabilities are used throughout), probabilities of occurrence are reflected implicitly in the following *The variable LAMBDA in the computer program is the coefficient of parsimony.

64 way: A rule of production developed from a regularity whose context part that occurs rarely in the sample will not be applied often in transforming the s:ample, and therefore other rules of production will be needed to transform the remainder of the sample, and therefore parsimony will be less. The ideas of "entropy" and "parsimony" were suggested by a paper of Goodall (1962). However, it is likely that Goodall's entropy was a measure of information. (and therefore was dependent on probabilities of occurrence and not conditional probabilities). The idea of a combined measure and a trade-off between these two quantities appears in Goodall (1962). One recursive rule of production generally replaces many (indeed, an infinity) of non-recursive rules of production. The recursive parsimony of a transformation is defined as A. n r r where n, is the number of recursive (reduced) rules of production, and where X is a positive real number, called the coefficient of recursive parsimony.* The combined entropy-parsimony-recusive-parsimony measure of a transformation is defined to be - Pij log2 Pij + ' n + n Typically, recursive rules of production are more desireable than non-recursive rules. Thus, X would be chosen so that r X << X r so as to give relatively greater weight in the measure to the less desirable non-recursive rules. One might even assign a value of zero to recursive parsimony. *The variable LAMR in the computer program is the coefficient of recursive parsimony.

Note that in all of the foregoing cases, the identity transformation would have no entropy at all, since the application of an identity transformation is a completely reliable recoding. The combined measure of entropy, parsimony, and recursive parsimony actually used in the computer program implementing the Grammar Discovery Algorithm is an ad hoc combination of the above features. We have not attempted to derive the particular function chosen from a series of desired axiomatized properties, or to show that the function (or its class) is the only function (or only class of functions) that satisfies the axioms. The properties that the function does, however, satisfy are as follows: (1) A rule of production developed from a structural (Type I) regularity whould have zero entropy. Thus, rules developed from either type I regularities or identify transformations will contribute zero to entropy. (2) Each rule of production should count equally against parsimony. Certainly, measuring grammatical complexity can be done in other ways (and better ways) than counting rules of production; however, we will use this simple approach suggested by Goodall (1962) here. One important implication of this requirement is that each different identity transformation should count as a rule in determining parsimony. Thus, we do not treat the identity transformation as one transformation any more that we treat any other rule of production having different predicted parts as being the same. Note that if identity transformation did not count against parsimony, they would not contribute to the H measure at all (since their entropy is zero); and, if that were the case, the transformation with minimal H would always be the identity transformation on the whole sample, and the initial sample would then be perpetuated from

66 level to level. (3) Each recursive rule of production should count equally against recursive parsimony. (4) The entropy for each application of each rule should be normalized to reflect the number of symbols recoded with it. If this normalization were not done, longer rules would tend to be favored over shorter rules. Carried to the extreme, the best recoding would be accomplished with very long rules —rules so long that they are 100% reliable solely by virtue of their capturing artifacts of the sample. (5) The relative weights to be given to the 3 attributes of entropy, parsimony, and recursive parsimony should be determined by a weighted sum. The trade-off should be between entropy on the one hand and the two kinds of parsimony on the other. The weighted, normalized, combined entropy-parsimony-recursive-parsimony used in the computer program is therefore - M.. Z P.. log P.. + X'n + r n 13 13 13 r r where this sum is taken over each application of each rule of production, and where M.. is the length of the regularity associated with P... Note that ^M.. = N, 13 where N is the length of Y. The idea of reliably recoding, or "bunching together," of portions of a sample of data is found in a variety of settings. It is well known in psychology and physiology [Miller, 1956] that human beings are well able to classify gradations in pitch, hue, loudness, count, smell, taste, and a variety of sensory information into about 7 categories. If faced with much more than 7 categories,.humans lose the ability to distinguish reliably and must "bunch" the data together so

o7 that items within the bunch may first be distinguished and gradated. Then, at the next higher level, representatives of the bunches may then be distiiingiished and gradated. 'ThIe perv;lsiveness of this number 7 (plus or minus 2) seems also to extend into the social and political behavior of humans. As C. Northcote Parkinson (1957) has noted in an examination of the English Privy Council and other cabinet bodies,human decision-making bodies tend to function best with about 8 members. When more than this number are added to a committee, an inner committee of smaller size tends to form to do the actual decision-making and to integrate the views of factional representatives from the larger body. Although not within the scope of this study, it seems possible that, for humans, the tradeoff between parsimony and entropy comes when more than 7 items must be integrated. This level (of parsimony) in turn implicitly defines the degree of faithfulness of representation (entropy) which necessarily must be accepted by humans in a variety of sensory and social and political settings.

68 3. THE pij. -GRAPH OF A TRANSFORMATION Whenever we have a collection of probabilities from a set of regularities abstracted from a given sample, we may not only classify them as to type (i.e. type I, II, III, IV, and V), but we may sort them into descending order by magnitude. Such a graph, which we call the pij-graph has the general appearance seen below: Ioo 1.00, ' &I- E II i \D 6(f) 6sf") e/i) <) W(F F)

69 In the figure, a denotes the permutation of the F pij's which sort them into descending order. a(i), where i is an integer from 1 to Fm, is the ordered pair (i,j) that identifies the Pij. Goodall (1962) presented such a graph, although apparently erroneously using the actual probabilities of occurrence. These probabilities, of course, are rarely near 1.0. However, if we instead consider conditional probabilities (as we do herein) or even relative probabilities (perhaps normalizing all probabilities of occurrence of an M-gram by dividing by the largest), there are several provocative and valid points in Goodall. As always, the Goodall terminology is highly metaphorical and suggestive. The high probabilities at the left end of the graph represent the "structure" in the sample. The structure is that portion of the sample that is reliably present. Structure is associated with the type I or II regularities in the sample. Structure may be such facts as Q's are always followed by U's in English; that periods are reliably followed by spaces; that most birthdays in the social security files have a "19" in them; or that certain self-synchronizing codes have a certain punctuating symbol at the beginning of each phrase (e.g. the peak of a sawtooth wave at the beginning of a television line image). Since the structure is reliable, all of the structure may be abbreviated and replaced by a shorter marker (or even removed altogether!). What is structure at one level is structure at the next level —regardless of how it is recoded, abbreviated, or deleted. The message portion of a sample is the part that is unpredictably variable at this level of analysis. The entropy of this portion of the sample is high because i i this is the part of the sample which contains the information. A key point is that what is "unresolvable message"

70 at one level is not necessarily unresolvable at a higher level. Unlike structure, message is not always message. Of course, message cannot be recoded (except by an identity transformation) because it has no as-yet-identified internal structure* Message corresponds to the type III and IV regularities in the sample. The noise portion of the sample is that part of the sample which occurs with very low probability. Noise corresponds to the type V regularities —indeed, the "irregularities." Noise represents all the exceptions, contradictions, error, and randomness of the sample — things which occur so rarely that their non-occurrence in the sample is so predictable that they cannot be considered message (which is unpredictable variability). Whether noise represents "error" (as, for example, would be the case in an incoming encoded message) or "exceptions" (that is, information which indeed is correct, but rare) depends on the nature of the sample. Accordingly, one would want to preserve exceptions for analysis at a higher level and discard error as information that is incompatible with the sample. This is an external decision. On the pij-graph, structure, message, and noise appear as one proceeds from left to right. The pij-graph described earlier is of greatest interest when we plot only the pij's used to develop rules of production that are actually used in a transformation. We call this graph the Pij- graph of a transformation. *There is one possible exception to this statement which is discussed in the sentence-oriented method for inducing disjunctions. The exception occurs when a message sequence is one of an ensemble of possible substitution instances for a predicted sub-sequence that has maximal or near maximal entropy. See II.E.4,

71, For example, if our initial sample is.-.....................LEVEL I ----—. --- SYMBOL STRING CF LENCGT 272 AND USING /LFHPBET OF SIZE 2 11i0011 0110010001 1 0001011 101010 COOOCOlloo0010 11o ioiololoo 0 00 0O11O10001111 C00lc 0001cOOc ccCll1011001010 00.1 CCOO 1011 O000 C l COCOCOl 0110 101 Oi 1 0 0C1 C000001 1001000 1110 01 0011 1OO001 COOOO)ol101ol010 100000001 1000101111001011100101 110101011100ll ll0 11OCOlCOlOCOCO and we use the following 4 rules of production using an M of 2 TENTATIVE RULES OF PRODUCTION FOR M OF 2 A -— > 11 B ---— > 10 C. ---> 01 0 —. —> 00 we might find the following values for entropy and parsimony: ENTROPY TERM 65.18735 PARSIMONY TERM 4.00000 RECURSIVE PARSIMONY TERM 0.0 + VALUE OF H FOR THIS RECODING.......... 69.18735 NUMBER OF RULES OF PRODUCTION 4 NU4MBER OF RECURSIVE RULES 0 NUMBER OF TIMES RULES ARE APPLIED 136 If we then apply the rules of production, using a standard recoding procedure, we would obtain a pij-graph such as is found on the next page. This recoding results in the following new string: ABCBADBDADCCACCCCODDADBCACCCCDODABDAAOBOADCCCDODADDAACCBADCD ADBA OCCCOODAC CBA A CC BCD CAB ACOD ACC DADDCACCCCDOOAOCG ABCCABCC ACCCABCAAB CBCDOO

72 GRAPH OF P(I) USED IN RECCDING FOR LEVEL 1 AND M OF 2 1.00 +-++ 0.98 I 0.96 I I 0,94 1 1 0.92 1 0.90 1 1 0.88 0.86 I I 0.84 II 0.82 II 0.80 0.78 1 11 0.76 I II 0.74 Il 0.72 I lt 0.70 0.68 0.66 II 0.64 I 0,62 I 0.58 I I 0,56 **6 l 0.54 I *1 0.52! II 0.50 +-++ 0.48 1 I 0.46 1 II 0.44 I I 0.42 1 0.40 II 0.38 0.36 II 0.34 0.32 1 1 0.30 0.281 II 0.26 i 0.24 II 0.221 11 0.20 1 11 0.18 1 II 0.16 I 1 0.14 1 II 0.12 1 o.10 II 0.08 1 1 o.o6. 1 0.04 1 I 0.02 I 11 0.0 +-++ 3334 NU1RER OF TYPE I SEQUENCES(STRUCTURAL) 0 NUMBER OF TYPE 2 SEQUENCES (MESSAGE) 0 NU'4BER OF TYPE 3 SEQUENCES (MESSAGE) 3 NUMBER OF TYPE 4 SEQUENCES (MESSAGE) 1 NUJMBER OF TYPE 5 SEQUENCES (NOISE) 0 NUmttiER OF P( I) 4

73 4. RESOLVING TRANSFORMATIONS If the entropy is not large for a transformation, it must be that pij's are mostly near 1.0 or near 0.0 ' ' Let g be the largest number of predicted positions for any mask in a transformation. This number is always less than or equal to M-l for the largest M tried in the Search Phase. The criterion for a transformation is the value 1 1 - log 2 A resolving transformation is a transformation for which the entropy is less than the criterion. This suggestive term is found in Goodall (1962). The idea there that when a recoding satisfies the criterion, the underlying structure that generated the sample has been identified; and, therefore, the sampie is "resolved." Note that if one of our measures which combine entropy and parsimony is used, some pi.'s may be smaller than 1 - e; or, if a transformation which is not a resolving transformation is considered, some pij's may be smaller than 1 - c. In both these cases, however, most pij's will tend to be near 1, and exceptions will occur only because the introduction of that pij leads to such a substantial reduction in the number of rules or production that the increase in entropy is justified. In all cases, however, most of the Pij's will be bunched near 1.0. In the pij-graph of a transformation actually used in recoding there will be no Pij's that are less than e because we never use regularities so unreliable in a transformation.

74 In the case of a resolving transformation, there can be no pij's equal to c because one such pij would alone make the entropy of the transformation larger than the criterion. These last 2 observations define the features of any pij-graph of a resolving transformation.

5. TRIAL RECODINGS AND THE SELECTION OF THE ACTUAL RECODING M1 and M2 are the lower and upper limits, respectively, on the length of context-sub-sequences and predicted-sub-sequences considered in the Search Phase in the search for local regularities. For each M between M1 and M2, a different recoding results, in general, if we allow only regularities of length up to M to be used in developing rules of production in the recoding. Thus, M2-Ml+l recodings are defined. These recodings are parameterized by the index MBE. In the Selection Phase, these M2-Ml=l different trial recodings are each attempted, and the entropy, parsimony, and recursive parsimony of each computed. The recoding which is selected to be the actual recoding is the one which is associated with the first M (considering the direction of considering the M's*) which is a resolving transformation**, or the M which has the best combined entropy-parsimonyrecursive-parsimony measure (if it is not actually a resolving transformation). The best combined measure is the least combined measure. The strings resulting from the application of the actual recoding constitute the strings of the next level of the process. The Search Phase, the Recoding Phase, and the Selection Phase of the Grammar Discovery Algorithm are then applied to the strings of this level, until the Algorithm terminates. *The variable VDIR specifies this order in the computer program. **The variable WDF in the computer program determines whether the first resolving transformation discovered is used for the actual recoding, or whether (if there is more than one resolving transformation) the best resolving transformation is used.

76 For example, if we again refer to the example started on page 68, we find that for an M of 3, we develop the following 8 rules of production: TENTATIVE RULES OF PRODUCTION FOR M OF 3 A --- > 111 B -— ~ > 001 C > — 101 D -.> 100 E - > O1l F.-> 000 G --— > 110 H -— > 010 The application of these 8 rules to the sample yields the following new string: ABCODEFCGCHDFEBBGCHDFEDEGHBDHCFFGF ACEEFOGFCDFCFFGCCDEBFFGHBG HE HDGF EC HC FFGB EGHADCGCEGBADHHFOO The entropy and parsimony for this transformation are as follows: ENTROPY TERM 42.55614 PARSIMONY TERM 8.GOOOO RECURSIVE PARSIMONY TERM 0.0 +_____ _ VALUE OF H FOk THIS RECODING.......... 50.55614 NUMBER OF RULES OF PRODUCTION 8 NUMBER OF RECURSIVE RULES C NUMBER OF TIMES RULES ARE APPLIED 90 And, the p.=graph for the transformation is shown on the next page.

77 GRAPH OF P(I) USED IN RECODING FOR LEVEL I AND M OF 3 1.00 + ---+-+ 0.98 I I 0.6 I I I 0.94 1 0.92 0.90 I 0.84 I 0.82 I 0.80 I I 0.78 0.76 I I I 0.74 I 0.721 0.701 0.68 I 0.66 0.64 1 1 I 0.62 1 I 0.60 * 0.58 1* I 0.56 * 0.54 * 0.52 *1 1 0.50 + --- -+ 0.48 1*1 0.46 0.44 I I 1 0.421 0.40 0.38 i I * 0.36 0.34 I 1 1 0.32 0. 30 I 0.28 I I 0.26 I i I 0.24 0.22 1 I 0.20 1 1 I 0.18 1 i I 0.16 0. 14 I 0.12 0.10 0. 08 I 0.06 I I I 0.04 0.02 1 I I 0.0 +...+-+ 33333344 NUMBPER OF TYPE I SEOUENCES(STRUCTURAL) 0 NUMBER OF TYPE 2 SEQUENCES (MESSAGE) 0 NUMBER UF TYPE 3 SEQUtNCES (MESSAGE) 6 NUMBER OF TYPE 4 SEQUENCES (MESSAGE) 2 NUMBER OF TYPF 5 SEQUENCES (NOISE) 0 NUMBER OF P(I) 8

78 Similarly, for an M of 4, we find the following rules: TENTATIVE RULES OF PRODUCTION FOR M OF 4 A...> 1110 B -...- > 0110 C ---— > 1100 0 ---— > 1000 E ---— > OlO1 F --— > 1101 G ~_> 0100 H ---— > 0000 I ~.. —> 1001 J —.-)> 0011 K > 0010 L --— > 0001 and the following evaluation: ENTROPY TERM 30.02759 PARSIMONY TERM 12.00000 RECURSIVE PARSIMONY TERM 0.0 VALUE OF H FOR THIS RECODING.......... 42.02759 NUMBER OF RULES OF PRODUCTION 12 NUMBER OF RECURSIVE RULES 0 NUMBER OF TIMES RULES ARE APPLIED 68 and the following new string: ABCDCEFEGHCIFEGHAJCDCEGHCJF8CGCKCEGHFBCBGHCCAGFGCLFEGHCEAEAE FEAJAKGH The pij-graph is found on the next page.

79 GRAPH OF P(I) USED IN RECODING FOR LEVEL 1 AND M OF 4 1.00 +.+ ---+ 0.98 I I I 0.96 I G.94 I I 0.92 0.90 0.88 I 0.86 C.84 0.82 I I 0.80 I I 0.78 I 0.76 I 0.74 0.72 I 0.70 0.68 * 0.66 * I 0.64 * I 0.62 * I 0.60 *. 58 I *1 0.56 1 0.54 1 * 0.52 I I I 0.48 1 1 1 0.46 0.44 1 0.42 0.40 I * 0.381 1 0.36 I 0.34 I 0.32 I * 0.30 I 0.28 I I 0.26 1 0.24 0.22 1 0.20 I o.18 1 I I 0.16 I I 0.14 I I 0.12 I 1 0.10 0.08 0.06 I I I 0.04 I 0.02 I I 1 0.02 + 0.0 +.... +_ --- 3333333333444 NUMBER OF TYPL 1 SEQUENCES(STRUCTURAL) 0 NUB[NER OF TYPE 2 SEQUENCES (MESSAGE) 0 NUMBER OF TYPE 3 SECUENCES (MESSAGE) 8 NUMBER OF TYPE 4 SEQUENCES (MESSAGE) 4 NUMBER OF TYPE 5 SEQUENCES (NOISE.) 0 NUMBER OF P(I) 12

Rn If it seems that the combined entropy-parsimony measure (H) is merely a monotonic function of M, consider what happens with an M of 5. Here we get an extraorindary number of rules ---27 in all ---as shown below: TENTATIVE RULES OF PRODUCTION FOR M OF 5 A --- 11100 B > 11011 C -— > 00100 0..- > 01100 E - > 01011 F ---— > 10101 G -- > 01010 1 -.-._ > 00000 I --—.> 11001 j ---— > 00111 K - > 10100 L > 00001 M --— > 11000 N --— > 11110 0..> 01000 p --—.> 10110 Q... — > 00010 R.-.0-> 00110 S --— > 0001 1 T -> 01110 U — > 01001 V ---— > 11010 W.. — > 01111 X --— > 00101 Y ---— > 10111 Z —.> 10001 <AA> > 10000

81 For M of 5, the parsimony is therefore quite large. However, the entropy is quite small. With an M as large as 5, there is sufficient context to guarantee a more faithful (and hence reliable, and hence lower entropy) transformation. However, this greater faithfulness is the result of the proliferation of rules of production. There are more rules, each used less often, and each used more reliably. The following is the calculation of H: ENTROPY TERM 22.21033 PARSIMONY TERM 27.00000 RECURSIVE PARSIMONY TERM 0.0 VALUE OF H FOR THIS RECODING.......... 49.21033 NUMBER OF RULES OF PRODUCTION 27 NUMBER OF RECURSIVE RULES C NUMBER OF TIMES RULES ARE APPLIED 54 And, the following is the new string that results: ABCOEF Gl I JGKLMNOMFHDJFPCMEQKLFPROSCTUFRLVFHDE I 'XVYZNC<AA>00 (Note that when we exhaust the non-terminal alphabet provided, we use brackets as in Backus-Naur Form to indicate nonterminals). The p1i-graph for an M of 5 is found on the following page.

82 GRAPH OF P(I) USED IN RECCDING FOR LEVEL 1 AND M OF 5 1.CO + + — - c... +......+ 0.98 1 1 C.96 0.941J 0.92 II I 0.90 1II I 0.88 II I I 0.8 II 0.84 ** I I 0,e2 11 I 0.80 II 1 I 0.78 I I 0.76 1 I 0.74 11 I i 0.72 I I**** I 0.70 II I 0.68 I ** I 0.66 1 ** 0.64 II * I 0.62 I * 0. 60 * I. 58 ** 0.56 I **. 54 iI 1 I 0.52 11 1 0.50 ++ -. _ -— ~ --- + 0.48 1 I 0. 46 I I 0.44 l ** 0.42 II I *** 0.40 1 0.38 II I * I 0.36 I 0.34 1 * 0.32 II I 0.30 i 0.28 II I * 0.26 I1 1 I 0.24 1 0.22 1 0.20 I 1 0.18 I I I I 0.16 0.14 1 1 0.12 I 1 1 0.08 11 1 0.06 I 0.0411 1 1 0.02 II 0.0 +~ +......._ _ + 223333333333333333344444444 NUMBER OF TYPE I SECUENCES(STRUCTURAL) 0 NUM6 FR OF TYPE 2 SEQUENCES (MESSAGE) 2 NIJMER OF TYPE 3 SEQUENCES (MESSAGE) 17 NUMBER OF IYPE 4 SEQUENCES (MESSAGE) 8 NUMBER OF TYPE 5 SEQUENCES (NOISE) 0 NUMJbER L)F P(I) 27

83 Note that for M of 5, we have some type II regularities in the transformation. Reviewing this example, note that the smallest H is attained with an M of 4. Thus, after considering trial recodings for M of 2,3,4, and 5, we would select the recoding based on an M of 4 for the actual recoding. We then would use the new string obtained by using the recoding based on an M of 4 as the string for input to level 2 of the Algorithm.

84 E. INDUCING RECURSIONS FROM A FINITE SAMPLE OF SENTENCES 1. INTRODUCTION AND MOTIVATION Every sample of sentences we encounter is finite; and every finite sample of sentences can be generated by a finite state grammar. In fact, a finite state grammar can be trivially found which generates the sentences in the sample and only those sentencesthus producing the best possible fit between the sample and the grammar induced. A grammar which generates only a finite number of sentences is a finite cardinality grammar, and a language with only a finite number of sentences is a finite cardinality language. A finite cardinality grammar need not be a finite state (regular) grammar, and a finite state (regular) grammar need not be a finite cardinality grammar. Of course, for reasons of economy and to satisfy our own intuitive requirements, we insist that the grammar induced by a grammar discovery algorithm not always be either a finite state grammar or a finite cardinality grammar. Thus, we allow the induction of non-finitestate-grammars-that is of left-sensitive, right-sensitive, strictly context-sensitive, and strictly unrestricted rewrite grammars-all from a finite sample (which is, of course, a finite cardinality language itself). Similarly, we allow the induction of non-finite-cardinality-grammars from a finite sample. The only way for a grammar (by which we mean, of course, a finite grammar-i.e. one with a finite number of rules) to generate more than a finite number of sentences is for the grammar to have a recursion. Allowing the induction of recursions from a finite sample is as important as allowing the induction of non-finite-state rules of production of nona finit-finite-state rules of production from a finite sample. Indeed, without recursion, many simple relations must be expressed

85 by numerous different rules of production. Moreover, if the sample of sentences is extended —say by application of a recursion-and we are not allowed to induce grammars with recursions, then new rules of production must then be added to explain the extension. Each extension leads to a proliferation of additional rules of production. In the case of this kind of extension of the sample, a grammar which did not contain a recursion would be neither economical, nor stable (under the extension).

86 2. DEFINITION OF A RECURSION Given a grammar G = <VN, V P, S>. Let V = V U VT. Let w and y be strings over V*. We say that w immediately derives y, which we write as w y, if there exists a z, z2, u, and v, all in V*, such that w = z u z2 and y = zl v z2, and such that u - v is a rule of production in P. We say that w derives y, which we write as w > y, if there exists strings w0, w,..., w such that w = w0, and such that y = wn, and such that w. - w.i+ for each i from 0 to n-l. A sentential 1 i+l form is a string w E V* such that S -- w. A sentence is a string w e V* such that S => w. A grammar is said to contain a recursion if there is a non-terminal N e VN such that o N N -> YN s where a, B, y, s ~ V*, provided that it is not the case that a = y and =, Note that rules such as N + N and N + N and NM + MN do not count as recursions in the above definition. We use only the noun "recursion" in referring to the above idea. An individual rule of production aN - yNs is called a recursive rule if the same non-terminal N appears in both the antecedent (left) side and consequent (right) side of the rule, and provided that it is not the case that a = y and B = s, and provided that IaNII[lyNsI.

87 3. APPROACHES TO INDUCING RECURSIONS There are several possible approaches to inducing recursions from a finite sample. Feldman (1966,1967) proposed an algorithm for inferring an unambiguous, recursive regular grammar from a given sample. This algorithm begins with the construction of a non-recursive, regular, grammar that generates exactly the given sentences, then merges non-terminals in this grammar to get a recursive regular grammar which seems to be a "reasonable generalization" of the sample. The algorithm ends with a simplication process. To generate the intermediate grammar which exactly generates the given sentences, Feldman processes the strings of the sample in order of decreasing length. To the extent that the sample does not have different sentences of equal length, this procedure eliminates the effect of order of the sentences within the sample. Rules of production are developed one-by-one, as they are needed to generate each sentence in the sample in turn. Specifically, starting with a (the) longest sentence ala2...a of length n, Feldman would generate n-l rules as follows: The first rule is S + alA1. The next n-3 rules are Ai - ai+lAi+, for i = 2,..., n-2. The n-l -th rule is A + - a a — which is a rule which is not in the form of a rule n-3 n-1 n-2 of a regular grammar, and which he calls a "residue" rule. "Residue" rules in general are rules of the form A + w, we VT, /w/ 2. New rules are added as necessary as additional sentences from the sample are considered. As each sentence is considered, new rules of production are added only to the extent required to guarantee the generation of that sentence. These new rules may even be terminating rules-which are like

88 "residue" rules except that their consequent (right) side is of length 1. Terminating rules come about when all but the last symbol of the sentence under consideration can be generated with the previously developed rules. The rules so developed may be combined and written non-deterministically. To obtain recursion, rules of production are now merged. Each residue rule is merged with a non-residue rules, thus eliminating the "residue" rules. The general principle is that after such merging, the resulting grammar must still generate all the sample, plus as few new short sentences as possible. This merging is accomplished as follows: Whenever the non-terminal on the antecedent (left) side of a residue rule occurs on the consequent (right) side of a nonresidue rule, the non-terminal of the left side of this non-residue rule is substituted for the non-terminal on its consequent (right) side which was in common with the residue rule. This eliminates the residue rule, and makes the non-residue rule involved into a recursive non-residue rule. Note that the resulting grammar is now entirely in the form of a regular grammar. Also, note that this procedure guarantees that the longest sentence of the sample is generated by a recursive rule. Shorter sentences of the sample may be generated either with the aid of a recursive rule, or only by non-recursive rules. The tendency is that the shorter the sentence, the more nonrecursive rules have been constructed; hence, it is more likely that it can be generated by a sequence of non-recursive rules, followed by one terminating rule. The non-residue recursive grammar thus produced is now simplified to remove equivalent productions. This simplication results in no change in the language generated.

89 A second algorithm by Feldman (1969) infers a "pivot" grammar for a given set of sentences. A pivot grammar is an operator grammar in which a terminal symbol which separates non-terminals in a production appears in no other way. Linear grammars are a special case of pivot grammars, but the class of pivot grammars is much broader than the linear grammars. The algorithm begins with the sample of sentences and the knowledge of which terminal symbol(s) are the pivot terminal symbols. The algorithm produces a pivot grammar. The main strategy of the algorithm is to find self-embeddings. Each sentence is examined to see if it has a proper substring which is also a sentence of the sample. If it does, a "loop non-terminal" is substituted for the longest such substring. This results in a new sentence. This new sentence begins part of the sample under consideration. If no sentences have such substrings, the sample is scanned to see if all sentences have the same first symbol, or the same last symbol. If this is the case, the common symbol is trimmed off, and the process described above is then applied to the trimmed sentences.

90 4. SENTENCE-ORIENTED METHOD OF INDUCING RECURSION The methods for inducing recursions can be divided into two categories. First, there are methods that operate on the sentences of the sample —that is, the methods that look for some indication in the sample that a recursive rule of production is justified. Second, there are methods that operate on the rules of production developed at each level of the grammar discovery algorithm.* We examine the sentence-oriented approaches first. A theorem of Bar-Hillel et aZ. (1962) states that for every context-free language L, there exist constants p and q, depending only on L, such that if there is a sentence z in L of length greater than p, then z may be written as a concatenation uvwxy, where Ivwxl| q, with Ivwl>0, and further that each sentence uv wx y is in L for i> O. Thus, if L contains one sufficiently long sentence, an infinity of other long sentences are stated to also be in L. A proof of this theorem appears in Hopcroft and Ullman (1969) under the name "uvwxy theorem." The special case of the uvwxy theorem for regular languages is a well-known result, usually presented in discussions of whether a given regular language is finite or infinite. This result is the theorem stating that if a regular language contains a sentence of length greater than the number of non-terminal symbols in the grammar (which is the same as the number of states in the finite automaton associated with the grammar), then the language is infinite, and in fact contains an infinity of sentences of the form *The variable WCV specifies which approach is to be used in the computer program, only the rule-oriented approach. The variable KRECUR specifies whether any attempt to induce recursions should be made.

91 uv w y, i 0. The uvwxy theorem, and its proof, suggest a sentence-oriented approach for inducing the presence of a recursion in a given sample of sentences. If the sample of sentences is sufficiently rich and varied, there should be some instances in the sample of substrings i i of the form v wy. We should attempt to recognize these instances. In a sentence, a substring such as v is the product of recursion. This sub-string can be generated by the application of some recursive rule i times. The detection of v canbe accomplished with masks of even length M, wherein the first M/2 positions of the mask are context positions, and the last M/2 positions are predicted positions — that is, M M _ %_ 2 (A mask with the context and predicted symbols reversed works equally well). The given sample of sentences at the current level is searched for regularities associated with this series of masks of even length. The only occurrence of regularities associated with this mask can be over those strings where there is a repeated sub-string. The conditional probability associated with this regularity should be near 1.0 for a recursion to exist. That is, most occurrences of the sub-string should be followed by another occurrence of the same sub-string —the only exception being the last occurrence within the sentence. Note that the conditional probability cannot be exactly 1, unless the entire sample consists of repetitions of the sub-string. If the recursive rule is a non-self-embedding rule, then we would find only v in a sentence. If the recursive rule is a self-embedding one then there would also be an xi sub-string.

92 Whenever a repeated substring a is discovered, a new non-terminal can be introduced into the non-terminal vocabulary of the induced grammar, and the recursive rule N a N can be added to the induced grammar. Each occurrence of a in the sample is then replaced with the non-terminal "N". The derivation tree for this is as follows:... cabcacd... The recoded sample is then...cNNNd.... There is no good procedure for deciding whether to write the rule as N -* abcN or N -* cabN, when a terminal such as "c" appears at the beginning of the repeated substrings. There is no good criterion for deciding whether to make the recursive rule left recursive or right recursive. There is no good way to decide whether the sequence immediately following the repeated substring should be written as a disjunctive alternative to the recursive symbol-that is, N - abcN d or whether another formulation is more desirable. Moreover, an obvious self-embedding such as N -+ abcNdef with the derivation tree... cc a...e

93 may yield the pair of recursive rules N + abcN M - defM and a recoded sample of.. cNNNMMM... Moreover, this method has no obvious extension to context-sensitive or unrestricted rewrite rules; and the method also requires a rather large sample size in general.

94 5. RULE-ORIENTED METHOD OF INDUCING RECURSION In the rule-oriented method of inducing recursion, we examine the rules of production generated at each successive level of the grammar discovery process, and try to induce the presence of recursion. Consider, for example, a language generated by the grammar with "p" and "+" as the terminals and with the single strictly-context-free recursive rule P + pp A sample of sentences (with initial punctuation) from this language might be (1) ++pp+pp. ++ppp.+pp. +p+pp. ++pp+pp. +++pppp. ++pp+p+pp. ++pp++pp. In practice, the Grammar Discovery Algorithm would discover the single non-recursive rule A + +pp at level 1. The original sample would then be recoded using this rule of production. The result would be a set of sentences containing the original terminal symbols "p" and "+", and also the non-terminal "A". The substring "+pp" will not, of course, appear in the:encoded strings. The result would be (2) +AA.+Ap.A.+pA.+AA.++App.+A+pA.+A+Ap. Note that substrings such as "+pA", "'+Ap", as well as "+AA" do appear in the image. The rules of production developed at level 2 include B +AA, which is the obvious non-recursive rule resulting from the recursive structure of the language. However, rules such as C - +pA and

95 D + +Ap would also be developed since "+pA" and "+Ap" are cohesive sub-strings at this level. Note the resulting exponential growth of the number of rules of production, and of the non-terminals. This growth is typical when the more economical recursive rule (in this case p -> +pp) is not developed. We now introduce some terminology. Recursive rules of production for context-free grammars takes the following forms, where A is a non-terminal, and where a and 8 are Non-null strings: first, A + Aa, which is called a left recursion; second, A -+ A, which is called a right right recursion; and finally A + aA8, which is called a self-embedding ("middle" recursion). The first two rules are left regular and right regular rules, respectively, when the length of a is only 1. If M is large enough (that is, the length of the consequent (right) side of the rule of production is large enough), then recursions will manifest themeslves between adjacent levels of the grammar discovery process. Suppose the rule of production A -+ aB appears in the induced grammar at level i; and suppose the rule B - 6C~ appears at level i+l; and suppose A and C are ultimately point to be identified with one another to mpnke the recursive rule A -+ oa6Ac then if M had been large enough to encompass the length of a6A~-e,

96 the rule A -+ a6Bc could have appeared at level i, so that the rule A -+ a6AME would have been induced at that level. Thus, it is sufficient to examine rules of production occurring at adjacent levels of the induced grammar, if M is generous enough. (Alternatively, if M is not generous,it will be necessary to expand our examination of rules to non-adjacent levels).* We now develop a procedure for identifying recursions. Let R, be a rule of production induced at level i-l of the grammar discovery process, and let R2 be a proposed new rule of production that is about to be added to the induced grammar at level i. Before any proposed new rule is added at level i (i7/2), that rule is considered in relation to each rule already in the grammar from level i-l to see if one recursive rule could replace both. Note that it is here that we limit the detection of recursion to adjacent levels. The following are the conditions for inducing a recursion: (1) Rules R1 and R2 must be isomorphic in the following sense: (a) The precondition for isomorphism is that the consequent (right) side of rule R1 and the consequent (right) side of rule R2 be of the same length. (b) Secondly, there must exist a one-to-one mapping ~ from the set of symbols appearing in rule R1 to the set *This latter feature is not now implemented in the computer program for the Grammar Discovery Algorithm.

97 of symbols appearing in rule R2, such that if symbol f appears in the consequent (right) side of rule R at position t, then symbol S(f) must appear in the consequent side of rule R2 at position t, for all t between 1 and the common length of the consequent sides of the two rules. (2) There must exist a symbol a such that the following two conditions hold: (a) The symbol a is a predicted symbol on the antecedent (left) side of rule R1. (To determine this, the mask from which rule R. was developed must be considered). (b) The symbol a appears on the consequent (right) side of rule R^. The symbol a is the recursive symbol —the symbol linking the two rules. If no recursive rule were identified, there would instead be a chain of rules at successive levels-each linked between adjacent levels by one such symbol. If the above conditions hold, we have a recursion. The symbol a is now an extraneous symbol in the induced grammar. Let T be the non-terminal symbol appearing in the same position of rule R1 at level i-l as T occurs in rule R2 at level i. The rule R2 is extraneous, and should not be added to the induced garmmar. The symbol a is extraneous and should now be deleted from the current non-terminal vocabulary of the induced grammar. Rule R1 should be made into a recursive rule R by replacing the symbol a on its antecedent(left) side by the symbol T, so that T occurs now on both the antecedent(left) side and the consequent(right) side of this modified rule R1. Thus, this modified rule R1 is now a recursive rule of production.

98 Moreover, T should replace every occurrence of a in all existing rules as well. Rules which are then identical should then be deleted from the induced grammar. This Rewriting process simplifies the induced grammar. Finally, the recursive rule just induced should be recursively applied wherever possible to the existing set of sentences. There will, in general, be many opportunities to apply the recursive rule which did not exist before. This back-tracking process has the effect of applying each rule maximally before going on. Because of these processes of Rewriting all existing rules, and the back-tracking process of recursively applying any new recursive rule wherever possible, it is simpler from the point of view of writing the computer program for this algorithm to induce only one recursive rule at each level. The addition of additional levels to the grammar discovery algorithm in no way complicates the induced grammar. Thus, in the example, the rule A - +pp was a rule at level 1; and (if no search for recursion is made) the rules B +AA C + +pA D +Ap were rules induced at level 2. Level 2 is the first level at which the inter-level search for recursion can be applied. Before accepting

99 a rule such as B +AA we would discover that it is isomorphic to the rule A + +pp. That is, the consequent (right) side are both of length 3, and there exists a mapping C taking "+" into "+", and "p" into "A" and this mapping is one-one. The symbol "A" is the recursive symbol a. T is "p". Replacing "A" with "p", we get the recursive rule p + +pp If we now apply this rule whevever possible to the original sample (1), we get a new sample which has a "p" wherever there is an "A" in (2). However, by applying this rule in every possible way, the sample (1) in fact is completely reduced to p.p.p.p.p.p.p.p. Tt should be noted that every recursive rule of production is inherently a disjunctive rule as well. In particular, any recursive rule can be written in the form aNB + IN6| where ( represents the empty string, where a, B, ( and 6 are strings, and where "N" is a non-terminal (recursive) symbol. The recursive rules induced above are all of this form. Thus, the rule induced in the above example is, if written completely, p + +ppI|.

100 F. INDUCING DISJUNCTIONS AND GENERALIZATIONS 1. INTRODUCTION AND MOTIVATION FOR INDUCING GENERALIZATIONS One specific application of the Grammar Discovery Algorithm is the induction problem for formal systems. In general, the rules and meta-rules of formal systems are written in terms of quantified variables. For example, a given property, say commutativity of some binary operation ".", may hold "for all" variables taken from a certain set Z = Z1. '.,Zk} The commutative rule would not be stated as k separate commutative rules (i.e. Z1-Z2 = Z2 Z1, Z1'Z3 = Z3.Z1, Z2 3 Z3'Z 2 etc.), but rather as one rule stated in terms of a meta-variable which ranges over the set Z (i.e. V x,x2eZ xix2 = xx ). Of course, whenever the set Z is not finite, the set of meta-variables is not merely a convenience and economy, but a necessity. The process of inducing the required meta-rule (which is stated in terms of a quantified meta-variables) from specific instances of the rule is called generalization. Generalization in induction is the counterpart of substitution in derivation. in doing induction in formal systems, it is desireable to have a facility for generalization.

101 2. INDUCING GENERALIZATIONS COMBINATORIALLY (RULE-ORIENTED METHOD) The induction of generalized rules can be done combinatorially in a manner similar in concept to the induction of recursions. We begin by allowing the Grammar Discovery Algorithm to proceed in its development of rules level by level. But before a new rule of production R is admitted into the induced grammar (at a level greater than level one), the possibility of instead introducing a generalized rule is considered. The generalized rule, as it exists, would replace the rule R about to be admitted as well as certain rule(s) already in the induced grammar. The following are the conditions required to induce a generalization: (1) First, there must be a non-empty set of rules R,...,Rk in the induced grammar which are all isomorphic to the rule R. (Isomorphism of rules is defined during the discussion of inducing recursion). (2) Second, for the h context positions of the consequent (right) side of rule R, suppose only a particular subset of symbols (say,,Z1'2'.Zd) appear in those h positions in both R and R1,...,Rk. Then all d combinations of these d symbols in those h positions h must occur. (Note, then, that a minimum requirement is k7d ). (3) Third, for all but the h positions, the rules R1,..,Rk and R must be identical. Now define Z as the set of d symbols above. Z = {Z1,Z2,..ZZd}. Let x1,...,xd be d meta-variables. Let R* be the rule obtained from R by substituting x. for each occurrence of Zi in rule R (lSi\d). Now the rule R* replaces R1...,Rk and R in the induced grammar. Rule

102 R* is the generalized rule; it is written in terms of the d meta-variables x1,...,xd. For example, suppose a sample from a formal system contains rules of inference such as Z1Z2 Z2Z1 Z1Z3 Z3Z1 Z23 Z32 Z21 Z1Z2 Z3Z1 Z1Z3 Z3Z2 Z23 Z Zl Z Zl (identity rules) Z Z + Z2Z " ZgZj -*ZZ ( " " ) 3Z3 Z33 ( " ) where z = {Z.,Z^ Z3} d= 3 h = 2 h h the dh 9 rules of inference are isomorphic. The d rules of inference can be consolidated into one meta-rule expressed in terms of the metavariables xl and x2 universally quantified over Z -namely, xl,x2eZ xX2 - x21. The above induction of generalized meta-rules is a combinatorial process. Of course, the requirement for the appearance of all dk variations can be relaxed in practice-with the attendant risk of over-generalizing.* Finally, it should be noted that the Generalization Process includes a process which might be called the process of finding "negative" regu*This combinatorial generalization process is not implemented in the computer program.

103 larities. Suppose, for example, in a non-binary alphabet that a given symbol of the alphabet does not appear in a particular context in the sample. Then a "negative" regularity can be envisioned which states that fact about the sample. In fact, one can envision the masks in the Search Phase as containing "negative" predicted positionsthat is, which record the absence of particular symbols in particular contexts in the sample. However, it will be seen that the Generalization Process described above subsumes this process-and indeed, is more general in the sense that pairs, triplets, etc. of absent symbols in particular contexts can be expressed easily. Let us now digress briefly and discuss the process of inducing disjunctive rules.

104 3. INTRODUCTION AND MOTIVATION FOR INDUCING DISJUNCTIONS An explicit disjunctive rule is a rule of production in which the consequent (right) side consists of two or more different strings, any one of which can be substituted for the antecedent (left) side. A typical explicit disjunctive rule might be A - aB | bB | c where "A" and "B" are non-terminal symbols, and "a", "b", and "c" are terminal symbols. An implicit disjunctive rule is said to exist in a grammar whenever the antecedent (left) sides of two or more rules of production are identical, but the respective consequent (right) sides are not. Obviously implicit disjunctive rules can be collected and written as one explicit disjunctive rules. Disjunctions already arise in the Grammar Discovery Algorithmbut only in the course of inducing recursions. Each recursive rule of production inherently is a disjunctive rule. In particular, any recursive rule can be written in the form aNW -* yN6 |6 where c represents the empty string, where a, 8, y, and 6 are strings, and where "N" is a non-terminal symbol. The recursive rules induced by the Grammar Discovery Algorithm are all in this form. But so far, there is no other facility in the Grammar Discovery Algorithm for inducing disjunctions. Indeed, the bottom-up character of the Algorithm guarantees that no implicit disjunctive rule (and therefore no explicit disjunctive rule) can result because different substrings in the sample at any level will never be encoded in the same way. A facility for inducing disjunctions is particularly necessary when the sample has initial punctuation. In that event, the encoding can never proceed beyond a string such as

105 S1.S2.S3. ( and so forth), where the S. are non-terminal symbols. A facility for inducing disjunctions in any level of the process is needed in order to avoid always writing final rules such as S + 1 I S2 I S3 where "S" is the starting (non-terminal) symbol of the grammar. This same unattractive situation (i.e. of having to write one final trivial disjunctive rule having a large set of disjunctive choices) is possible, of course, even in the absence of initial punctuation, In any case, this situation can be avoided if the Grammar Discovery Algorithm has a facility to induce disjunctive (non-deterministic) rules of production at any level (that is, not merely at the final level). Finally, in the absence of a facility for inducing disjunctions (i.e. if all rules were deterministic), only one possible structure (except for variations caused by the disjunctions inherent in recursions) can be represented by the grammar. Thus, while recursions provide the means for generating infinities of sentences, disjunctions provide the means for generating different varieties of sentence.

106 4. INDUCING DISJUNCTIONS USING ENTROPY (SENTENCE-ORIENTED METHOD) Up to now, only non-disjunctive (deterministic) rules of production have been induced. These rules of production have been induced as the result of the existence of highly reliable regularities in the sample. For example, the fact that the symbol "a" appears in a position 1 of the sample, and a "e" appears in position 4 of the sample may imply, with 99% reliability, that symbols "bb" appear in positions 2 and 3. Perhaps "cc" appears in positions 2 and 3 with conditional probability.5%. In this situation, a rule of production aNe + abbe may be induced, where N is a non-terminal symbol. This rule faithfully represents the sequence of symbols in the sample 99% of the time. Actually, three "regularities" are involved in the above recoding. First, there is the highly reliable Type II regularity 1 9 = - 'a%%e", " bb ",.99, Y>. This regularity is the basis for the recoding. Then, there are the two Type V regularities (or "non-regularities", if you prefer) 8' = a%%e", " cc ",.005, Y> and = <'a%%e", " dd ",.005, Y>. These two regularities are in effect ignored because they represent a relationship among the symbols of the sample which occurs only rarely. Note that the entropy of the ensemble of probabilities <.99,.005,.005> is very nearly zero. This near-zero entropy is associated with deterministic rules of production that are induced by the Algorithm. Now suppose that, in a given sample Y, we have the following three regularities: R1- = <"a%%e", " bb ",.33, Y>, 1 ' — -

107 2ES =<"a%%e"l, "_cc it.33, Y> and =3<"a%%e", " dd_,.33, Y> Now, the three possible substitution instances (namely, "bb", "cc", or "dd" in positions 2 and 3) occur with equal probability. Certainly, one could not reasonably induce a rule of production that favored any one of the three possibilities (e.g. aNe + abbe). The reason is that the context here (i.e. an "a" in a first position, and a "e" in the fourth position) does not reliably predict the intervening 2 symbols. Indeed, of the three possibilities that occur, the three occur equally often. On the other hand, there are only three substitution instances — that is, the symbols "ae", "ab", etc. are not possibilities. Note that, in this situation the ensemble of probabilities is <.33,.33,.33> here, and that the entropy of this ensemble is large (and indeed maximal, if only the three subsitution instances that actually occur are considered). The above suggests a criterion for inducing disjunctive rules, starting at the very first level of the Grammar Discovery Algorithm. Let W be a set of Wregularities l = <Y, i, Pi Y, 1 1 1 (i=l having the same context-sub-sequence y and having non-zero conditional probabilities P.i w (In general, w<<, where h is the number of predicted positions in the mask associated with the regularities). Suppose the entropy of the ensemble of these w non-zero probabilities is maximal, or nearly maximal —that is, w log2w - Pi log2 Pi l=i is zero, or small. In this situation, we write the disjunctive (non-deterministic) rules of production

108 a+ AA21...A Iw where a is the (common) antecedent (left) side as may be obtained from any of the ~. and where A. is the A-sequence for, lri\xw. Note that we induce deterministic rules of production when the reliability of a single regularity is high (the entropy over the possible substitution instances is zero, or small), and that we induce disjunctive (non-deterministic) rules of production when there are several equally-likely (or nearly equally-likely) substitution instances (i.e. the entropy over the possible substitution instances is maximal or nearly maximal). With intermediate entropy, we do no encoding, and we defer resolution to a higher level. Note also that with either approach, the goal of minimizing the number of rules of production and maximizing the parsimony of the induced grammar is furthered, as is the goal of maximizing the non-trivial and hierarchical quality of the grammar. In certain samples from formal systems, there is an equivalence between generalized rules and disjunctions. Suppose string a, B, and y can be disjunctively substituted for a non-terminal N appearing in some string 6 (using the rule N -+ aIBly). The string 6 may be written as 6= 61Ng2, where 6 and 6 are strings. Define Z as the set containing a, B, and y, and then define z as a metavariable ranging over Z. If it is appropriate to interpret the sample as containing theorems of a formal system, we can then rewrite the disjunctive rule N - a|Bly as the universally quantified generalization zZ 6 for that formal system Conversely, if we have a universally quantified generalization

109 V z c Z F(z) where F(z) is a rule in which z appears, then we can write N -* zllz21... zn, where Z = {zlZ2,..., zn} As mentioned earlier, implicit disjunctive rules in a grammar can be gathered together and written as one explicit disjunctive rule. We can therefore define the multiplicity of an explicit disjunction as the number of different disjunctive alternatives stated in the rule. Rules of production are then of three types: (1) explicit disjunctive rules, (2) recursions, and (3) non-disjunctive, non-recursive rules. If a grammar consists only of the third type of rules, at most one sentence can be generated. It is necessary to have disjunctions to attain a variety of sentence structures. If the grammar consists only of rules of the first and third type, there are no more sentences than the product of the multiplicities of the explicit disjunctions. This number is, of course, finite. This product is an upper limit on the number of essentially different structural types for sentences generated by the grammar (of course, through ambiguity and other reasons, this upper limit may not be attained). Indeed, variety in structural types arises only from disjunctions. If there are any recursions in the grammar, the number of sentences that may be generated by the grammar becomes infinite (assuming the rule can be used).

110 Note, however, that the number of essentially different structural types of sentences is still bounded. Recall also that it was noted earlier that every recursive rule is inherently a binary disjunctive rule ---one of the disjunctive possibilities being the recursive part of the rule, and the other being the so-called "base" string. Note also that the recursive rule can always be written so that this base string is the empty string. A normal-form qrammar here is a grammar such that all of the rules of production are rules of one of the following types: (1) explicit disjunctive rules, (2) recursions with an empty base string, and (3) non-disjunctive, non-recursive rules. For every grammar, there is a normal-form grammar that generates exactly the same language ---thst is, there is an equivalent normal-form grammar. In general, for a normal-form grammar, the number of essentially different sentence structures is bounded above by the product of multiplicities of the explicit disjunctive rules and 2nr, where nr is the number of recursive rules. Note, from the description of the Grammar Discovery Algorithm, that every induced grammar produced by the Algorithm is a normal-form grammar.

111 G. TERMINATION PROCEDURE The Grammar Discovery Algorithm terminates when the sample Y of the current level consists only of one symbol. This symbol becomes the starting symbol of the induced grammar. The Grammar Discovery Algorithm arrives at this situation in one of two ways. If the Grammar Discovery Algorithm is operating in the first mode (i.e. there is no initial punctuation between sentences), then this situation is arrived at naturally by the reduction of Y to a length of one. If the Algorithm is operating in the second mode (i.e. there is initial punctuation between the sentences of the sample), then the sample at the next-to-last level may have consisted of a concatentation of single non-terminal sentences and initial punctuation marks (i.e. s.s2. and so forth). In that case, the reduction of the sample from this level to the final level may be accomplished by (a) removing repetitions of the same non-terminal from the sample, and (b) Inducing Disjunctions over the single symbols. Typically this disjunction occurs at an intermediate level, and process (a) then applies to the repetitions at the final level. Given a finite sample Y, the Grammar Discovery Algorithm converges to some induced grammar after a finite amount of time. If Ml?2 (as is always the case) and if context-free rules only are being generated (as described in a later section), then each application of each context-free rule reduced the length of the sample by at least one symbol. Even if each transformation involves only one application of one binary constituent rule (M of 2), the Algorithm would terminate after either N-l levels (N being the length of the sample, if there is no initial punctuation in the sample) or N -1 max levels (N being the length of the longest sentence in the sample, max if there '. initial punctuation).

112 If context-sentitive rules are being generated, the reduction in the length of the sample Y from level to level depends on their being more than one contiguous predicted symbol in the mask of at least one regularity used at least once in the recoding. One can assure this by considering only masks with this property, or by requiring that there be at least one application of such a rule as part of each transformation (or, of course, by having at least one context-free rule in each transformation). In practice, of course, one usually "covers" the entire sample Y or nearly all of it) so that these requirements are virtually automatically satisfied by any transformation.

113 H. CHOICE OF LIMITS ON LENGTH OF REGULARITIES IN THE SEARCH PHASE In the Search Phase of the Grammar Discovery Algorithm, masks of length M = M1,..., M2 are considered. The lower limit M1 and the upper limit M2* on the length of mask to be considered are determined from the following considerations: (1) Clearly one cannot scan the sample Y with an M greater than the length N of the longest sentence in Y. If the Grammar Discovery max Algorithm is operating in the first mode (i.e. with no initial punctuation in the sample), then M2 cannot be larger than N-the sample length. If the Grammar Discovery Algorithm is operating in the second mode (i.e. with initial punctuation), M2 is limited above by N -the max length of the longest individual sentence in the sample Y. (2) An M of 1 is appropriate only in the case where each symbol in the sequence Y is statistically independent from the others, or in the case where Y consists entirely of 1 symbol repeated endlessly. In the case of one symbol repeated endlessly, an H (information rate) of zero is attained. In the case of statistical independence among the single (different) symbols, an information rate greater than zero is attained; and indeed H is maximal (given the alphabet size) when the symbols occur equally often. These two cases are so uninteresting that we specify Ml 2 to avoid them. (3) If M is large, the number of different sequences of length M over C symbols is large. Thus, the number of appearances of any particular sequence in the sample is relatively small. Certainly, if a sequence only occurs once, twice, or 3 times, the statistics about these sequences will not be meaningful. Thus, an M should be tried which is not too large relative to the total length N of the sample. Also, *M1 and M2 are parameters in the computer program.

114 if N<C, it is not even possible for all possible sequences of length M to appear, even if, in these N symbols, no sequences appeared more than once. Thus, M should be chosen so that logcN ~> M. (4) In attempting to induce recursions, the point was made that if there is a recursion, it will manifest itself in the tentative rules induced at consecutive levels, provided M is large enough. Thus, either M should be large enough to encompass any recursion or the search for recursions should extend between non-consecutive levels. This could be done in the same fashion as the search for recursions between consecutive levels. It should always be remembered that the entire approach to grammar discovery described here is based on the idea of using relatively small M in searching for regularities. If a regularity is missed at one level because M was too small, then we claim the regularity will be detected at a higher level. Thus, if the maximum length of M is limited to say 5, and a Type I (100% reliable) regularity of length 10 exists in the sample, that regularity will be detected at a higher level of the process. In the simplest case the first 5 symbols may be encoded as a second new symbol. If the first 5 symbols at the first level reliably predict the five succeeding symbols, then this regularity involving 10 symbols at the first level manifests itself as a regularity involving two symbols at the second level-namely, the first new symbol reliably predicts the second new symbol as its immediate successor. It should be emphasized that finding two regularities of length 5 at the first level of the process, and then finding one regularity of length 2 at the second level is not at all the same as finding the one regularity of length 10 at the first level. This point is made in

115 Goodall (1962). Combinatorics aside, the important difference is that the first approach discovered a hierarchical relation in the sample, and offers a significant insight into the sub-structure of the phrase of length 10, while the other approach merely cataloged the occurrence of a gross event. The difference is between A B C a1a2a3a4a5 a6a7a8a9a10 and A ala2 3a4a5a6a 7a 8a9a l 0 Indeed, if M is large enough, the subsequences occurring in any finite sample become quite regular and unique —the limiting case of this being a completely trivial, non-hierarchical, and non-parsimonious grammar.

116 I. CHOICE OF PARAMETERS IN THE RECODING PROCEDURE The Recoding Procedure described earlier is only one possible recoding procedure. Several variations in this Recoding Procedure are possible. For example, the sample Y may be scanned from right to left, instead of from left to right, in the search for opportunities for developing rules of production. Or, the scan could be started at some intermediate symbol. Or, the positions could be scanned in some order determined by some other considerations- such as the value of the conditional probabilities encountered, etc. Similarly, the regularities could be considered in ascending, rather than descending, order of length* —thereby giving preference to longer regularities. Also, the position for starting the scan is variable.** If the length of the regularity varies between M1 and M2, and the scan of Y is started at position T< M2, then ovbiously a regularity of length less than or equal to T may well be found to recode some or all of these first T positions, to the exclusion of a regularity of length M2. The Recoding Procedure can also operate so that whenever recoding is attempted using phrases no longer than M that this recoding is actually done using only phrases of length M. This is called a strict recoding***, and is appropriate only when the sample is believed to come from a "uniform" code source (Fano). This variation in the Recoding Procedure would be only rarely appropriate. Another variation in the Recoding Procedure concerns the allowed range in values of the conditional probabilities of the regularities *This variation is under the control of the variable VDIR in the computer program implemented in the Grammar Discovery Algorithm. **This variation is under the control of the variable VSTART. ***This variation is under the control of the variable STRICT.

117 used in recoding. The conditional probabilities associated with each regularity are categorized by type (i.e. Type I, Type II, etc.), and must specify whether the recoding prodedure can use only absolutely reliable (Type I) regularities, or whether Type II, or Type III regularities can also be used.**** Finally, Recoding Procedures can differ according to the order in which the various criteria for selecting regularities are applied. For example, one can consider all the Type I regularities which are also of length M, and then the other Type I regularities of different length. Alternately, one can consider all the regularities of length M which are also of Type I, and then consider other regularities of length M which are of different type. The impact of these variations will be discussed in a later section. Finally, it should be noted that the Recoding Procedure, even with the variations noted above, is only an abbreviation itself of a more combinatorially complete recoding procedure-that is, a recoding procedure which is exhaustiv e in the sense that it includes all possible variations in recoding. This complete recoding procedure would not scan the sample from left to right, or right to left, or from the middle out. In this complete procedure,the procedure would start by considering all possible ways of partitioning the given sample Y of length N into parts, with no part containing more than M symbols. The number of such partitions is the same as the number of partitions of the integer N into parts no greater than M (Riordan, 1958), and this number is very large. Then all the possible regularities that might be applied to each part of each partition must be considered. And for each partition, and each such part, each possible regularity ****This variation is under the control of the variable ALLOW.

118 must be substituted into the part, and a recoding attempted, and entropy, parsimony, and recursive parsimony of that encoding computed. And, when a recoding is finally completed at one level, all the possible different recodings at each successive level must be considered before that recoding is finally selected to be the recoding at the present level.

119 J. EXTENSIONS OF THE GRAMMAR DISCOVERY ALGORITHM TO SAMPLES OTHER THAN LINEAR SEQUENCES OF SYMBOLS Earlier we made the statement that the methods of induction and grammar discovery that we described herein are not particular to the format for presentation of the sample. We then proceeded to describe a specific Grammar Discovery Algorithm dealing with the problem of inducing a grammar to generate a given sample of sentences which is presented as a linear sequence of symbols. In this section, we return to this point and argue for the general applicability of the methods of induction and Grammar Discovery Algorithm in terms of a two dimensional pattern recognition problem. It will be seen that this rephrasing is not particular either to the Grammar Discovery problem for sentences, or to the two dimensional pattern recognition problem, and that the main features and insights of the Grammar Discovery Algorithm apply to other problems, as well as similar problems presented in other formats. Let us consider the simplest kind of two dimensional pattern recognition problem. The sample will consist of a two dimensional raster (matrix) of digital symbols. The raster may, for example, consist of binary digital symbols representing the digitalized image of certain items to be recognized.-perhaps printed letters of the arabic alphabet. The alphabet for this sample is the set of whatever symbols appear in the raster —perhaps 0 and 1 in the case of a binary raster. A raster is said to have initial punctuation if certain submatrices in the raster are separated in some way —for example, if there were 11 x 7 sub-matrices outlined such that each submatrix is presumed to contain one letter of the arabic alphabet. In contrast, a raster would be unpunctuated if no such initial punctuation of the raster were specified. A mask is a submatrix over the context symbol "", the predicted symbol

120 "%", and-if the mask is ternary-the don't care symbol '#'. 6 and are submatrices defined in a manner analagous to the subsequences 6 and ( of the Grammar Discovery problem. A regularity is defined exactly as for the Grammar Discovery problem except that 6 is the context-sub-matrix and ( is the predicted-sub-matrix. In the Search Phase, various masks are considered, and regularities are catalogued and characterized by type according to their conditonal probability. In the recoding phase, the domain is partitioned ("covered") by sub-matrices. Whenever the context-sub-matrix of a regularity of allowably high type occurs in one of the parts of the domain, this part is recoded. Rules of production are developed form regularities to express this recoding in a way exactly analogous to the 1-dimensional case. For example, the letter "L" occurring as in a 12 x 12 raster amongst a large sample of arabic letters might be recoded as shown below. LI>I E Here an 8 x 3 pillar of l's is recognized as a feature of many letters, and recoded as "V" (vertical line); the 3 x 3 vertical-right intersection sub-matrix is recognized and recoded as "I"; and the 3 x 4 hori-,~~~~~~~~~~~3

121 zontal line might be recognized and recoded as "H". The 9 x 5 blank area might be recoded as "B". Thus, the letter "L" is recoded as a 2 x 2 sub-matrix contining a "V", "I", "H", and "B". Indeed, this description is precisely what an "L" is-namely a vertical line segment intersecting at its end with the end of a horizontal line segment. Note that both the vertical and horizontal segment may be represented by recursive rules. If L's occur often in the sample, this 2 x 2 sub-matrix might be given a name and recoded. One should note in passing that this entire paper is about induction and pattern recognition of samples of discrete symbols. Suppose instead of doing a basic symbol-by-symbol matching operation, one does a correlation between time segments of continuous signals. It may be possible to extend the notion of regularity to continuous symbols. The time shift operation inherent in the matching of symbols has an immediate analog for continuous signals. Defining the context part, the predicted part, and the don't care part of a signal merely involves considering the portion of a given signal restricted to a particular time domain with the correlation coefficient playing the role of the conditonal probability, The idea of a regularity amongst a sample of continuous signals may be defined in the obvious way. The universe of possible masks is infinite, but it is possible to imagine some discrete time intervals being used to make this universe finite. The concepts of entropy, parsimony, and recoding all have obvious analogs. It should be noted that the operation of digitalizing a continuous signal is itself a recoding, or punctuating process, in exactly the sense we have been discussing throughout this paper. In digitalizing

122 data, a fidelity criterion (Chomsky & Miller) is used to associate almost identical continuous signal segments and to encode them under a common discrete digital name.

III. THE SELF-PUNCTUATING NATURE OF THE GRAMMAR DISCOVERY ALGORITHM A. INTRODUCTION The choices of the upper and lower limits on the length of regularities searched for in the Search Phase and the choice of parameters in the Recoding Procedure affect the partitioning (punctuating) of the sample into the short parts which are in turn recoded. Since the choice of parameters in the Recoding Procedure and in the values of M1 and M2 are specified externally according to heuristic considerations, the grammar that is ultimately induced would seem to be the product, not of the Grammar Discovery Algorithm, but rather of these external choices. These external choices would seem to lead to two principle types of effects: First, there would seem to be the effect of choosing M2 too small in the Search Phase and thereby 'missing" longer regularities in the sample. Second, there would seem to be numerous effects of partitioning the sample for purposes of recoding-perhaps of missing regularities because the regularity encompasses symbols that end up in different parts of this partitionor, perhaps, of losing the "synchronization" of the symbols in the sample (as, for example, in a sample produced by a uniform code source). It will be seen that both the choice of M2 and the choice of parameters in the Recoding Procedure ultimately concern the partitioning, or "punctuating", the sample into parts. In fact, the Grammar Discovery Algorithm is essentially a punctuatingprocess (or a "selective punctuation" as Goodall would call it). More importantly, the Grammar Discovery Algorithm is "self-punctuating," and recovers from the presumed predetermining effects of the choice of M2 and of synchronization. 123

124 B. EFFECT OF CHOICE OF LIMITS ON LENGTH OF REGULARITIES IN THE SEARCH PHASE 1. INITIAL DISCUSSION Let us first consider the possible limiting and predetermining effects of the choice M2. (Ml is normally 2 and is not an issue). Suppose that the maximum length of regularity searched for in the Search Phase is M2, but that a regularity of length M, with M>M2, exists in the sample Y, say in positions th+l,...,th+M. The regularity that might be missed might for example, be that symbol Y(h+l) in position 1 of the regularity, and the symbol Y(h+M) in position M together reliably determine all the intermediate symbols Y(h+2),..,Y(h+M-l). If M is, say 8, the regularity would be characterized by a context sub-sequence of "a%%%%%%h" and a predicted sequence of " bcdefg " and a mask of " %%%%%% and a conditional probability of 100% (Type I - absolutely reliable). That is, the regularity is "a%%%%%%h", "bcdefg", 1.0>. If M2 is only 6, this regularity would appear to be missed. We claim that if a regularity is missed at one level because M2 was too small, then the regularity will be detected as a regularity at a higher level. We now proceed to make this statement more precise.

125 2. EXTENTIONS OF A REGULARITY First, it is necessary to know that the regularity which may be "missed" is the longest regularity encompassing the symbols involved. This does not weaken the statement; it does not sacrifice generality; indeed, it would seem to make the statement harder to verify. Recall that a regularity t =<y, <, P, Y> is characterized by its context-sub-sequence y (a sequence of length M over the current alphabet V augmented by the predicted symbol "%"; the predictedsub-sequence 4 (which is a sequence of length M over the current alphabet Vc augmented by the context symbol " "), and the value P (which expresses the conditional probability that the context-subseauence y predicts the predicted-sub-sequence j in the sample Y). A regularity g= <y, <, P, Y> may be p-extended to the left (right) to a new regularity W'=<y, (, P, Y> by the addition of a predicted position —that is, by increasing the length of y and 4 by one, by annexing a "%" to the context-sub-sequence y on the left (right) thus forming y', by annexing a specified symbol from the current alphabet V to the corresponding position of the predictedc sub-sequence 4 thus forming 4', and by changing the conditional probability P to P' to express the conditional probability that y' predicts 4' in the sample Y. Note that there are 2-C possible p-extentions to the left (or right) for a given regularity. Note also that P'< P. Similarly, a regularity <?=<y, 4, P, Y> may be c-extended to the left (right) to a new regularity ~'=<y', 4', P', Y> by the addition of a context position-that is, by increasing the length of y and 4 by one, by annexing a "" to the predicted-sub-sequence 4

126 on the left (right) thus forming t', by annexing a specified symbol from the current alphabet V to the corresponding position of the context-sub-sequence y thus forming y', and by changing the conditional probability P to P' to express the conditional probability that y ' predicts 4' in the sample Y. Note that there are 2'C possible c-extentions to the left (or right) for a given regularity. Note also that P' P. Of course, the type of a regularity may change under extentionp-extentions being in general of lower type (less reliable) than the regularity from which they were derived, and c-extentions being in general of higher type (more reliable) than the regularity from which they were derived. In our discussions, if it does not matter whether an extention is a c-extention or a p-extention, it will be called simply an extention.

127 3. DEFINITION OF A MAXIMAL REGULARITY A regularity =9=<Y, (, P, Y> of types I, II, or III is said to be preserved (p-preserved) (c-preserved) under extention provided that P' P-or the extention is of the same type as i. A regularity tW is said to be maximal (p-maximal) (c-maximal) provided it is not preserved under extention (p-extentions) (c-extention) for any possible extentions (p-extentions) (c-extentions) of it. The existence of a maximal regularity means that a particular sequence appears as a unit in the sample Y but that the symbols surrounding this unit vary widely. For example, if " =< 'ab%"," c", 1.0, Y> is a regularity, it means that whenever "ab" appears in Y, the symbol "c" always follows. If tis a maximal regularity, it means that the sequence "abc" appears in the sample Y embedded in all variety of different environments, perhaps as "dabce", "fabcg", etc. If I'='< a%","b"1l.O Y> is also a regularity in Y, it is not maximal because W' may be extended to s. Equivalently, "ab" does not appear in a variety of different contexts in Y, but rather also appears with a "c" to its right.

128 4. PERSISTENCE OF MAXIMAL REGULARITIES AFTER RECODINGS We now claim that a maximal regularity of length M will usually not be missed by the Grammar Discovery Algorithm because M2 (where M>M2) was chosen too small. If a maximal regularity exists at one level of the sample, it will usually persist and continue to exist after the recoding at the next higher level. Thus, an unfortuitously small choice of M2 will not cause the maximal regularity to be lost. "usually" is our "almost always" and it means "in all cases except possible for a case whose occurrence requires the coincidence of several independent events each of very small probability " — in effect, "usually" refers to second order (and lower) effects. Let us assume that a maximal regularity A? =<y, (, p, Y> of specified type (Type I, II, or III) exists in the sample Y, and that the regularity is of length M>M2. Suppose that the context seauence appears q times in the sample Y. Let 4 be the domain sequence of Osin Y. Say that A appears q' times in Y. Note that p=q'/q. Let Aj denote the j-th occurrence of the sequence A in Y (j=l,...,q). Let Aj(i) denote the i-th symbol in the j-th occurrence of A in Y where i=l,...,M, and j=l,...,q. Let a. be the symbol appearing in Y to the left of the leftmost symbol of the j-th occurrence of A in Y (j=l,...,q), and let bj be the symbol appearing in Y to the right of the right-most symbol of the j-th occurrence of A in Y. Note that because c is maximal, no symbols in the sequence <al,...,a > or <b, b..,b> or in the sequence of ordered pairs q q <(albl),...,(aq,bq)> will appear more than ' times (since W is of type III or better). There are apparently three cases to consider:

129 (1) None of the strings ajA.b. of length q+2 are encoded by the transformation operating at this level of the Grammar Discovery Algorithm. (2) some of the symbols in Ajare encoded by a transformational rule, but neither the symbol a. nor the symbol b. is also encoded by the same application of this rule —that is, only symbols interior to Aj are encoded. (3) Some symbols of Aj are encoded by a transformational rule, and either the symbol a. or the symbol b. is also encoded by the same application of this rule. (Note, that both a. and b. cannot J J be encoded by one application of one rule, since M M2). Case (1) poses no problem as to "missing" regularities, since no encoding occurs. At a higher level some encoding may occur, but even then one of the cases below will then apply. It should be noted, however, that the Grammar Discovery Algorithm operates in practice so that most symbols of Y do in fact get "covered" by the rules at each level. When we say that the symbols are "covered", we do not necessarily mean that they are changed, but that they will at least be part of a context-sub-sequence. In summary, this case does not occur often, and is no problem when it does occur. In case (2), some symbols entirely interior to Aj are encoded. Suppose that the k symbols Aj(h),...,Aj(h+h-l), k<M. are the symbols interior to Aj which are encoded.-that is, are the domain of the transformation 9 being applied to the sample Y. Note that this transformation dYis made only because there exists some regulartity C '=<y', 4', P', Y> of length k of allowably* high type in the sample. As a matter of notation, note that the A of s ' *The variable ALLOW controls this in the computer program.

130 is Aj(h),...,4j(h+k-l), and we call this substring A'. Note also that by '(A)(I), we mean the I-th symbol in the image under of A. In a transformation, contiguous symbols of the domain A' are encoded as new non-terminal symbols. If the image of Aj, that is ^Y(Aj), is of the same length or longer than A (the latter occurring only when Unrestricted Rewrite Rules are being generated), no regularity 1i is "missed" because at some higher level the reduction in length finally occurs, and one of the cases here will then apply. If C(Aj) is shorter than Aj, then MA') was shorter than A'. Say the length of.{(a') is f, where f<k. Then the image of Aj is the substring {^Aj)=Aj(l),...,Aj(h-l), (a')(1),..., ({A')(f)~Aj(h+k),...,Aj(M) which is of length M-(k-f). There are now two sub-cases. In the first sub-case, M-(k-f) is still greater than M2. In this case, the regularityew is not "missed," because the second sub-case below will apply at a higher level when the necessary reduction in length finally occurs. The second sub-case is that M-(k-f)< M2. The original maximal regularity of length M now manifests itself as a regularity t"D = <yY", 1t, pt", Y>. We show that? " exists by constructing it. First we extend the transformation eJ(which is a semigroup homomorphism) to the domain V U ("" U "%"), by saying that the extention Y* is {(x) if x ~ V x if x ~ {" ", "%'} Now we have, for I=1,...,M-(k-f), the context-sub-sequence

131 y(I) if 14Ih-1 (I)= i*(y')(I+h+l) if h I h+f-l (I+(k-l)) if h+ffILM-(k-f), the predicted sub-sequence {(I) if 4lId h-l ~"(I)= Q*(')- (I+h+l) if h<SI-h+f-l (I+(k-f)) if h+f I <M-(k-f) and the conditional probability P"=PP'. Note that if the types of regularities allowed for making the transformation.4~are only of Type I (100% reliable), thentw" will not only be detected-but its unconditional probability p" will be the same as that of c (since p' =1.0). If the allowed types of regularity used in the recoding are of type II, then ~t" will either still be of type II or usually at least of Type III (since it would, in general, take a fairly large number k of recodings to make k 1 here was defined earlier to be /c which is very small. Note where s was defined earlier to be 1/c which is very small. Note that this k is, in general, only 1, and will be more than 1 only if case (1) applies at some point, or if, in case (2), tF(Aj) is the same length or longer than Aj for some recodings. If type III regularities are used in the recoding (particularly, if the conditional probability ofW' is (/T ), then regularities 2 of type III (particularly, if the conditional probability of is (/T ), may be lost. This is of course the reasons why regularities 2 of Type III (particularly, the lower range of Type III are not very suitable for recodings). It should be noted that this discussion of case (2) assumed one

132 rule of production (derived from one regularity 'l) was used to encode the symbols Aj. In general, more than one rule of production could be invoked. However, the argument is not altered-although the probability p" would then be the sum of say, two small probabilities. It should be emphasized that because M2 is always small, it would be most unlikely that as many as three rules of production would be applied within one such domain. Case (3) really never happens. By hypothesis, esis maximal. Hence there is no regularity of type III or of higher type that encompasses a. ( or bj) and some of the symbols of Aj —for if there J were, there would then have to be a regularity of length M+l encompassing the sequence a.A. (or A.jb) of length M+l —and this is impossible since the fact that J is maximal means that no such regularity exists. We return now to the example cited earlier —the question of whether the regularity -u'a%%%%%%h", " bcdefg ", 1.0> of length M = 8 would be missed if the maximam length of regularity searched for in the Search Phase was M2 = 6. This regularity states a relationship among 8 symbols-and, in particular, that two widely separated symbols determine the intermediate symbols. We will assume that this regularityeW is maximal-that is, that the string "abcdefgh" appears in a wide variety of different contexts in the sample. With a M2 of 6, the Grammar Discovery Algorithm will be recoding the sample using regularities up to length 6. We assume that some recoding "abcdefgh" does occur at this level. If some recoding does occur, it is the result of the existence of some regularityi p' within the 8 symbols-perhaps that "bc" are always followed by "defg". This recoding would be accomplished by the rule of production

133 bcN -+ bcdefg where N is a non-terminal. Thus, the 8 symbols at the first level would be encoded as "abcNh" in every case. Now, at the next level, it would be noted that an "a" appearing in position 1 and an "h" appearing in position 5 reliably predicts the occurrence of "bcN" as intermediate symbols. This regularity would be noted because its length if 5, which is less than M2, which is 6. Thus, at level 2, this regularity " =<"a%%%%h", "bcN ", 1.0> can be used to develop the rule aPh + abcNh where P is a non-terminal. Note that since "abcdefgh" appears in a variety of different contexts at level one, so will "abcHn" at level two. Thus, no encoding involving the symbols to the left of "a" and no encoding involving symbols to the right of "h" will be developed.

134 C. EFFECT OF CHOICE OF PARAMETERS IN THE RECODING PROCEDURE Now let us consider the possible limiting and pre-determining effects of the syncronization, punctuating, or partitioning of the sample as it is determined by the choice of parameters or the Recoding Procedure. Thus punctuating is the product of choices which either are external choices determined according to heuristic considerations, or which are choices that are an integral part of the Recoding Procedure itself and the fact that the Recoding Procedure is not a combinatorially exhaustive procedure for partitioning the sample. We claim that a maximal regularity =<y= y, P, Y> will usually not be "missed" by the Grammar Discovery Algorithm because of the partitioning of Y by the Recoding Procedure, except when a regularity of equally high type is used instead as the basis for partitioning and recoding. As before, case (1) (where there is no encoding) presents no problem of losing regularities. In case (2), we are concerned about the domain (the A sequence) of a maximal regularitytbeing split between two parts of the partition of Y. (Obviously, if the A oftlies entirely within one part of the partition of Y, there is no problem). First of all, the partition is not likely to divide a maximal regularity. The partition is not imposed indiscriminately on Y — independent of its regularities. Indeed, the partition is the consequence of the existence of regularities of allowably high type and suitable length. More often than not, the hypothesized maximal regularity will itself determine the partition and therefore be located within 1 part of the partition. Recall, in particular, that the

135 Recoding Procedure considers the regularities in order of decreasing type (that is, Type I first), so that the regularity chosen as the basis for making the partition of Y will, at worst, be of the same type as S itself. Also, within a given type, the regularities are considered in order of decreasing reliability, so that the regularity chosen will have a conditional probability at least as high as that of SO. Moreover, since the longer regularities are, in practice, considered first (even if the shorter regularities are considered first, it is the longer regularities that are more reliable, so that they tend to appear first in any case) maximal regularity will tend to be used as the basis for making the partition in the first place (in preference to any shorter regularity). Thus, the problem of a maximal regularity being divided is not likely to arise in the first place. However, suppose this division does occur. Indeed, in a highly structured sample, there may be many reliable regularities that can be used as a basis for partitioning and recoding, and the A of these regularities will tend to overlap. But because of the order of considering regularities for recoding purposes, only regularities at least as reliable as S will be invoked.

136 D. EFFECT OF PRESENCE OR ABSENCE OF INITIAL PUNCTUATION IN THE SAMPLE The purpose of this section is to make the assertion (quite contrary to intuition) that the presence or absence of initial punctuation separating sentences in the sample is of little consequence in the grammar discovery process; and that if it is absent the discovery of sentence boundaries is qualitatively the same punctuation problem as grammar discovery in general, and is moreover not even quantitatively (i.e. combinatorially) much more difficult. It is important to see that the entire Grammar Discovery Algorithm is a punctuating process-that is, a process of partitioning the sample Y into appropriate parts and then recoding the parts. The problem of inducing appropriate sentence boundaries is no different than the process of inducing appropriate phrase boundaries within sentences. Naturally, if one wants to induce sentence boundaries, it is necessary that the sample contain repetitions of various sentences (or at least significant parts of them)-just as the induction of phrase boundaries requires repetitions of the phrases. Indeed, it is the repetition of features which establishes them as features. With this precondition in mind, let us use a particular constructed sample Y to argue the main assertion (above). Suppose that a sample Y consists of 10 copies of 100 different sentences, each sentence being of length 20-the particular numbers and the uniformity of the number of copies and uniformity of sentence length being unimportant to our argument. The sample thus consists of 2,000 sentences. The 2,000 sentences appear in jumbled disorder (if the sentences appeared in a regular order, they would not be individual sentences!). If the Grammar Discovery Algorithm is

137 operating in the second mode (i.e. with initial punctuation between sentences), the sample appears as a string of 2,000 sentences each separated by an initial punctuation mark (the period). If the Grammar Discovery Algorithm is operating in the first mode (i.e. with no initial punctuation), the sample appears as a string of 20,000 symbols. In either mode, the Grammar Discovery Algorithm first searches for local regularities and then recodes them. In the second mode, no search for regularities is made across initial punctuation marks. / The following happens in the first mode: Since the sample is jumbled, the two sentences on either side of each of the 10 occurrences of a given sentence a will, in general, be different. Indeed, the probability that a particular sentence 8 appears, say, to the left of the given sentence a is only 1/100; and, even with the Birthday Problem phenomena, the odds are against any duplication of any sentence to the left of any of the sentences, and certainly against any reliable occurrence of such a duplication. Thus, it is most unlikely that any of the symbols at the left end of a are related in any way with the symbols occurring to their left. Since the partitioning and recoding of Y is done because of the occurrence of regularities, the symbols within each sentence may be recoded, but groups of symbols spanning the sentence boundary will not. Indeed, each sentence may be what we called a maximal regularity (but it need not be, of course). Thus, although we cannot predict how the recoding of Y will proceed, we can predict that no groups of symbols spanning a sentence boundary will ever be recoded together. Thus, the sentence boundary will be preserved, as the recoding proceeds from level to level.

138 Even if each sentence finally becomes encoded as a single nonterminal at a high level, these single non-terminals will display no regularities amongst each other-indeed,the sequence of 2,000 single non-terminals will have maximal entropy or near maximal entropy. This may, of course, be an appropriate moment for inducing disjunctions-but the point is simply that even at this level, no regularities should appear-either for M of 2, 3, 4, etc. because the sentences (being independent units) were jumbled. Thus, appropriate sentence boundaries will be induced by the same punctuating process as phrase boundaries. Note also that the search for regularities is not even appreciably shortened by the presence of initial punctuation. If the sentences have average length of 20, most of the searching for local regularities (with small M's of up to 4, 5, or 6) does not occur over the sentence boundaries anyways. Thus, the presence of initial punctuation (across which the search is not made) does not appreciably reduce the combinatorics of the Search.

IV. ADDITIONAL FEATURES OF THE GRAMMAR DISCOVERY ALGORITHM A. TERNARY MASKS AND DON'T CARE CONDITIONS In the description of the Grammar Discovery Algorithm, the masks used to search for regularities in the Search Phase were all binary masks —that is, they were sequences of length M over only the symbols "" ("context") and "%" ("predicted"). The masks were also compact in the sense that the mask expresses a relation within a certain substring of the sample, and all symbols appearing between the left-most symbol of that substring and the right-most symbol of that substring are either in the context part of the regularity or the predicted part. One may wonder about regularities involving relations between two or more substrings that are widely separated by symbols that are not part of the regularity —as, for example, might occur in mechanically encoded messages. For example, an "a" in a certain position in the sample and a "b" occurring 6 positions later may reliably predict the symbol occurring half-way between them (in position 4). There are two approaches to finding such regularities. If binary compact masks are being used, this regularity will be detected as a regularity at a higher level —particularly if the distance between the substrings is moderately large or large. Or, this regularity may be detected in the Generalization Process —that is, a sequence of rules of production are found in which one of more of their context positions are found to vary "freely" over the entire current alphabet V and these rules are then replaced by one generalized rule having a universally quantified metavariable in the appropriate positions. This approach is particularly appropriate when the distance between the substring of the relation is rather small. 139

140 The second approach is more certain and involves using ternary masks.* A ternary mask /Afis a sequence of length M over the symbols " ", "%", and the symbol "#" (which is called don't care). There are 3 ternary masks of length M as opposed to only 2 binary masks of length M. To illustrate the use of the "don't care" symbol in ternary masks, consider the following example: The ternary mask is "#%" (Note that the smallest ternary mask using all 3 symbols is of length 3). This mask refers to the relationship in which a particular symbol predicts a third symbol, regardless of the second symbol. A regularity based on this mask might be <"a#%", " #b" 1.0 Y> — that is, the symbol "a" appearing in any position of the sample Y reliably predicts the occurrence of symbol "b" two positions later. To express this same regularity using only binary masks would require C (the number of symbols in the current alphabet) separate regularities — namely, <"aa%", " b", 1.0, Y>, <"ab%", " b", 1.0, Y>, <"ac%", b", 1.0, Y>, etc. In general, if there are h don't care positions h in a mask, one ternary mask replaces C separate binary masks. Thus, there are 2M C binary regularities of length M, and there are 3 MC ternary regularities. The smallest non-degenerate situation where both a binary and ternary mask might be used occurs for values of M=3 (since the mask must contain at least one ", one "%", and one "#" ), C=2 (a binary alphabet), and h=l (at least one don't care position). For this case, the binary mask approach is more efficient than the ternary mask approach. But when (3\ M h > logc 2/ *The variable FR controls the use of binary or ternary masks in the computer porgram. Ternary masks are not implemented in the computer program at this time.

141 — which encompasses most all other situations (particularly since C will tend to be moderately large, and M is usually small), the ternary masks will involve fewer combinations. Moreover, it should be remembered that the Generalization Process is itself a combinatorial process, so that any apparent efficiency of binary masks in the Search Phase is lost in the Generalization Process. Also, the Generalization Process requires a rather large and complete sample before it can operate (at least without making conjectures ), and this is another factor in favor of the efficiency of ternary masks.

142 B. ALPHABET-ENLARGING VERSUS HUFFMAN-TYPE ENCODINGS As mentioned earlier, there are two types of transformations: alphabet-enlarging, and non-alphabet-enlarging. In the alphabet-enlarging transformation (which is the approach used above), there is a non-terminal alphabet.;Strings containing these non-terminals:, from time to time replace strings in the given sample. Thus, at level one. the current alphabet V consists c only of the terminal alphabet VT; but at higher levels, new non-terminals are added as needed (as rules of production are developed) and the current alphabet V becomes the union of the original terminal alphabet VT and these added new non-terminals. Because these nonterminals are new symbols which are not found in the original sample at level one, a string containing one of these non-terminals can be recognized as being a string from a level higher than level one.* In the non-alphabet-enlarging transformation, non-terminal symbols are not used. The sample at each level (with the possible exception of the last.and highest level) consists only of the symbols occurring at level one. The encoding is done by mapping one string over the given alphabet to another string over the samelalphabet. In general, not all strings over the given alphabet are possible in the given sample Y, so that some of the "impossible" strings are available to serve the function which the new non-terminals serve in the alphabet-enlarging approach. Generally, these "impossible" strings will be rather long strings. Note that the two sides of the *The variable EXR in the computer program specifies whether alphabetenlarging or Huffman-type recoding is to be done. Although Huffman encoding was attempted in early stages of this work (and the subroutine HUFF remains in the computer program), this feature is no longer operative in the computer program.

143 rules of production developed from this approach are strings Over the same alphabet. Note also that the antecedent (left) side is almost always going to be longer than the consequent (right) side of the rule. This is in contrast to the situation when alphabet-enlarging transformations are used. Thus, when alphabet-enlarging transformation is used, the Grammar Discovery Algorithm is inherently a context-sensitive (or simpler) porcess; while when the non-alphabet-enlarging approach is used, unrestricted rewrite rules emerge. Naturally, one must choose the new long strings with care —so that they cannot be confused with natural strings in the sample. These new longer strings must be treated as an encoded unit, and we are not concerned with the statistics or internal regularities that these strings may exhibit. Since the new strings tend to make more symbol strings possible in the sample, the total entropy increases as a result of the addition of them. An optimal coding procedure, such as the Huffman code procedure (Fano), can be used to produce these new strings in a manner that it is possible to reverse the code and recover the original sample. When a Huffman coding procedure is used on a sample of messages which are themselves strings over the encoding alphabet, it is necessary to encode every symbol in the original sample. As this type of encoding is applied from level to level, all possible strings become possible in the sample, and the information rate of the sample increases to a maximum. The two approaches can be combined into a limited-alphabet transformation. In this approach, the number of non-terminals that can be added is limited to a fixed number h. If more non-terminals are needed, combinations (in the Huffman sense) of the allowed non-terminals must be used. Unrestricted rewrite rules would then result.

144 C. RECODING WITH NOISE IN THE SAMPLE When a recoding is based only on Type I regularities,* no information is lost in the recoding. Whenever a recoding is based on a Type II or lower type regularity e = <, (, P, Y>, then (l-p). 100% of the time the transformation is not faithful to the sample. There can be two motivations for using regularities of Type II or lower type: (1) The sample Y is assumed to contain a small amount of noise and is therefore assumed not to be 100% accurate itself. In this situation, if a regularity is found which has a conditional probability of nearly 100% accurate itself. In this situation, if a regularity is found which has a conditional probability of nearly 100%, it is assumed that the parts of the sample that do not conform to the regularity are in error. The use of a regularity with conditional probability near 100% therefore has the effect of correcting the sample and removing the alleged errors in it. (2) The sample Y is accepted as being 100% accurate, but one desires to develop a very simple grammar for the sample. Parts of the sample which do not conform to regularities whose conditional probabilities are near 100% are assumed to represent "exceptions", and one is not willing to sacrifice the parsimony necessary to account for all the exceptions. Thus, the simplified grammar will be Qne that represents most of the sample most of the time. This simplified grammar will have fewer rules of production (more parsimony), but will have higher entropy (more information from the sample is lost). *The variable ALLOW in the computer programs regulates the type of regularity allowed in the recoding.

145 D. THE CONTEXT-FREE CASE In the Recoding Phase, certain strings of symbols are recoded. Up to now, this recoding (assuming an alphabet-enlarging encoding) always has involved substituting a new non-terminal symbol for a string of contiguous symbols that are well predicted by a certain context. Thus, the antecedent (left) side of any rule of production that is ultimately developed has less than or the same number of symbols as the consequent (right) side of the rule. Thus, each rule developed will be context-sensitive, and the grammar ultimately induced will also be context-sensitive. Thus, a grammar discovery algorithm based on contexts and regularities is inherently a contextsensitive process. It might appear that because the whole Recoding Phase is based on context and conditional probabilities that only strictly-context-sensitive rules can be developed. Indeed, regularities that have no context part (as, for example, regularities developed from the mask %M ) merely record frequencies of occurrences of substrings -rather than any relationship among symbols in the sample. Therefore, it would appear that every rule must have a non-empty context and therefore be strictly-context-sensitive. In fact, however, context-free rules (and regular rules —which are context-free rules with an M of 2) can be developed in one of several ways by the Grammar Discovery Algorithm. First of all, a context-free rule can be induced using the Gener-, alization Process. Whenever a combinatorially complete set of strictlycontext-sensitive rules exists (that is, given h context positions in a rule, all Ch possible strictly-context-sensitive rules appear in the grammar), the Generalization Process can then induce a

146 context-free rule to replace the entire set of strictly-contextsensitive rules. Certainly, if a given substitution is made in all possible contexts, then the substitution can be made without regard to that context-that is, the substitition is context-free. However, because the Generalization Process is a combinatorial process, this approach is generally not available —wfther for reasons of the large combinatorics involved or because not all C strictly-contextsensitive rules appear. The latter condition is indeed most restrictive. One can loosen the Generalization Process by allowing the generalization if no exceptions are found-that is, if among all the contexts that do appear, the substitution is invariably made, then a contextfree substitution can be generalized. However, even this approach involves an exhaustive examination of a fair number of cases. A second approach is based on the idea of maximal regularity, which was defined earlier. Whenever a rule is developed from a maximal regularity c =<y, p, P! Y>, a context-free rule can be induced, — namely N, where N is a new non-terminal, and where A is the A-sequence for OR. The justification for the writing of this context-free rule in this situation is that the regularity cannot be extended. This is the case because A appears in the sample Y in a wide variety of contexts (the distribution of symbols surrounding A have a nonzero and persumably high entropy), and therefore A is "free" of its contexts. However, determining that a regularity is maximal (although involving only the checking of 2'C p-extensions and 2'C c-extensions) is itself as small-scale combinatorial process and somewhat unnatural. However, this second approach suggests a third approach. The third approach is based on the fact that regularities are

147 considered in the Recoding Phase in decreasing order of their length. Thus, if any regularities discovered in the Search Phase are maximal regularities, they will be considered first in the Recoding Phase. Thus, we can simply write N - A whenever we have a regularity t. Moreover, note that ifs is not maximal, nothing is lost by this approach because at a higher level, the maximal regularity will still appear as a regularity encompassing N and various other symbols. Suppose the A-sequence of a maximal regularitye of length M' is A', and that a regularity W of length M M < M' is used to write the rule N - A where A is the A-sequence ofW. Then, with A a proper substring of A', the maximal regularity t manifests itself at a higher level over the string A(l)' —A'(h) N A'(k)-Al'(M')) of length M'-M. Using the maximal regularity to write a rule now (assuming that M'-M is now M2), we might get A - A' (1).A' A(h) N A' (k).*A' (M'), where A is a non-terminal. Thus, we would have two rules, both context-free, and this third approach works.* *The variable RIThe controls whether context-free rules are written in in this way in the computer program.

V. EXAMPLES In this section, we consider a number of examples that illustrate the Grammar Discovery Algorithm. EXAMPLE A: Consider the following sample of sentences: This sample presumably is from the language whose sentences 2n are all of the form A, where n >: 1, The terminal alphabet consists only of the single symbol "A". The input to the Grammar Discovery Algoritllm might be as follows: M-XIYU NUMeER CF LEVELS TO PE TRIE0 3 S'ALLTST M TO RE TRIED 2 'PctP.S M TO BE TRIFD 4 DICFCT IC CF CCN SICEFITNC TN RECCCE CESCEND LAVORA -- CEFF-ICIENT CF PARSI CNY l.COCO. A.- N - C_ r E;I CTF.TI C FR CUPSIV E PARS IIFCh 0.10000 'S(LPCE nf INT I t YN'eCLt ST INC READ '-GOE IN - IT INIAL PUlNCTUTTCN i t'ARK 2ND TYPE CF GPAAR PEScIRE (?ASKS TC BE TRIEC) LS-RS GTI - CCNTP.CLS ANTEC^OFNT SICE OF RULES CF NOPST TYPE CF P(I ) LLCWEC TC ENTER CC(E 4 USE CF 'BEST v S C T LY YES FR -- TACiX CF NASK K PRINT CCNTFCL -- P INT U.NSOPTEC P( I) NO PRINT CrCTROL -- PP!NT SORTF.C P(I) NO FRINT CCNTROL - PRINT C'(PH CF P(I)-S ALL PRINT CCNTROL -- PRINT SECUENCE NLMERS YES SLPPFrUTTNF USEC FCR RECCOIN' PFCODE CVFFPRItE VAL F FCR 'PC'ST (NVCH) 0 FtPLY ELIMINATICN CF UNALLGCA4LE P(I) NO INCLUt E ALL-PRF CICTF C IPSK NO ST~FT PECCDE AT TIE CF Mf EST YES HSEL --.ETFCC FCR rCMFUTING 4 O.iT-C.U' CN FINCTIG GCCC y NO NETCF Cr F FININF- CCRSICNS RULE INCtLlC1 IrETiTY Ul E-S NO TFST FCR RECURSICN YES SIZE OF TERMINAL.ALF-ARET (INITIAL STRING) 1 TEPMINAL ALPhABET; A INTTTAL PUNCTIUTICN MAPK (MCCE=2) ST7F CF PASIC!NCN-TE7RMINAL 0I.F-FET 10 SYFiCLS OF TH-E ASIC NCN-TERI9iNAL ALPHtiET: 0123456789 148

14y At level 1, for an M of 2, the Grammar Discovery Algorithm will develop exactly one possible rule of production, as follows:.. —~................. LEVEL 1 ---. ----....SYBOCL STPTNG CF LENGTF 35 ANC LSING ILFtAEET CF SIZE 1 AAAAAAAAAA AAAAA^AAAA AA A.AAA AAeA. t.FFL= 1 = 2 thUJEP OF TYPE 1 SECUENCFSSSTRLCTURAL) 2 NUIPEP OF TYPF 2 SECLENCFS (MESSC-E) 0 \UNHRF rF TYPE 3 SECUFNCFS (t'ESSAG) 0 NUMBER CF TYPE 4 SFCGtECFS (VESSACE) C NUMBREP CF TYPE 5 SFCLFNCFS (NCISE) 0 htJUFR? CF DIFFERFNT YASKS USED 2 UU.NPFR CF PCSSTILE V-SFECUENCFS 1 NLF lFP CF PCSSIFLE P(!) 2 U.iE B Cr P(I) 2 NUyr.- COF DIFFFP.N1 CCNTFXTS 2 SP^RPb'TICN VALUF ETcrEN TYPES I AND 4 C.5C000 EPSILCN FO(R EFINTNGC TYPES? ANC 3 0.34000 TFETATIVE RULES CF PRnCCCTICN FCR v CF 2 o.-...> A ENTPOPY TEPR C.C PAPSIMCNY TERM 1.COOO0 PFCUJRS IV PqSIPMCNY TFFM C.C VALUF CF H FCR THIS RECODING........ 1COOO0 rUV9ER CF RULES CF PROCUCTICN 1 KUYBER OF PFCURSIVE PULES._ NUL'PFR CF ICENTITY PULES C NUMBFRFP OF TIM4FS. PULES tFE APPLIEC 1 CURRENT STRING Y AAAAA^T.RAAAAANhAAGA.t^^AAAAAAA NFW STRING c. cc.oco.occo.coooc.

15U The pij-graph for the one rule actually used in the recoding is rather sparse: GRAPH CF P(!) LSEC IN FECCOING FCR LEVEL I AND M OF 2 1.CO ' c.e I 0. 4 t O.C? I o.E8 1 o.84 t C0.F2 t 0.78 1 C.76 I 0,74 o0,6 i C,66 C.E4 1 0.4(2 0,60 1 0,56 1 0.54 i C,52 i 0,C0 + C. 8 I 0.46 C.44 6 C.42 I 0.40 1 0.38 1 C, 6 I C. 4 I 0.32 I o. 70 O.?0 ' 0,2P 0.26 1 0,24 I 0.2C 1 O.lP I 0.3A 6 0.1 1 0.2 1 0.1C G. 1C I 0.C9 I O.C6 I C.C I O.C 4 1 NUIPEP OF TYPF I SEOUF5kCES(STPLCTURAL) I NU tRER OF TYPF 2 SEQUF^CES (MESSAGE) 0 N'lPIER CF TYPE 3 SECUENCES ( ESSCGE) 0 I'MPRER OF TYPE 4 SFOLhNCFS (MESSACE) C MURDFER CF TYPE 5 SECUFNCES (NC!SE) C NU.RER CF P(I ) 1

151 Similarly, for an M of 3 of level 1, the sample admits of the development of only one rule of production: LE~EL= 1 - * NUPRER CF TYPE 1 SECUENCFS ( STRtCTURAL) 4 NUMPFF. F TYYPE 2 SFCLtF^CS (IESSAGE) 0 F.U'HEP OF TYPE 3 SECUEtCES (MFSSACE) C NUP(IFR Or TYPE L SECLEE^CFS (MESSAGE) 0 U 'p EP CF TYPE 5 SFQUFNCFS (NOISE) C NMJ^BER CF DTFFPPFNT MASKS USEC 4 NUI'rFP CF PCSSTPL. Iv-SECI.IFNCES 1 NUMPEFQ OF PCSSIPLE P(I ) 4 F.I)Jf FE R CF P(I ) 4 NUPPER CF nIFFERENT CChTFXTS 6 SrPARATrCN VALUE PETFhEN TYPES 3 AND 4 C.SO00C EFcILCt FOR DFFINING TYD-S 2 hNC 3 0.3333? TFNTATIVF RULES CF PPCCUCTICK FCR w CF 3 O -. —)> AAA FtTPCFY TFPM 0.C p RS tCNY TERW 1.C0000 RCCURSIVF PARSIMCNI TEFM C.C VALUE CF H FCR THIS PFCCOING.......... t.COOO0 KIJUER CF PULFS CF PPOCUCTION 1 NUEBER OF PECURSIVE RULES 0 hUJPER CF ICENTITY PULES C NURBER GF TTMES RULES ~PE APPLIEC 1 rCIPPENT STRING Y AA AAAAA.AAAAAA.AA AtAA A o AAAAAAAAA NEh STR TNG AACA CO.OCAA.CCOA.P Note that the new string contains both the original terminal symbol "A" and the induced non-terminal "0".

152 The pij-graph for M of 3 is as sparse as for M of 2, and will not be shown here. For an M of 4, one rule is again induced at level 1, as follows: TENTATIVE RULES CF PPCCUCTICN FCR M CF 4 0.-..> AAAA Application of this rule leads to the following new string: AA.O.CAA.CO.COAA. We omit the pij-graph and other output for M of 4. Note that we have temporarily suppressed consideration of the effect of identity rules in this example. In each case the total value of H consisted only of the value of parsimony. For M of 2, 3, and 4, the value of parsimony was the same. The best M, therefore, would be the shortest M and an M of 2 would be chosen, as noted by the following: REST M IS 2 WITH H CF 1.CCCOO More significantly, however, is that 2 is in fact the best M when the effect of the identity rules are considered. For an M of 3 or 4 (but not an M of 2), a recoding of the entire sample requires application of an indentity rule which transforms the terminal symbol "A" into itself. This rules, when given any non-zero weight at all, is sufficient to dictate the choice of M of 2 as the best M. In Example 2, the role of these identity rules will again be crucial.

153 To review level 1, the following is a table of the regularities observed in the sample PEGULAPITTlS FCq LEVEL 1 # LEVEL LENGTI FPRC TYP E MASK CONTEXT PRECICTED 1 1 2 1. C 1 _ 3A A 2 1 2 1 CC 1 A% -A 3 1 3 1 CC 1 _a At% AA 4 1. 2 t.CC 1 _ AA% _A 5?3..OC I % rA AA1 -3 I.CC 1 t__ *AA A _ 7.. 4 1.*00 1 % AT%% AAA:. 4 1.00o __o1 AA __AA q T 4 1 OC 1 AAA% A 10 1 4 1.CC 1 %%A AAA11 1 4 1.00 1 A _ %gAA AA12 1 4 1.CC 1 c ___ _AAA and the following is the one non-identity rule of production induced at level 1:.............. RULES CF PPCOUCT1CN FCR LEVEL 1 --- ------- ---—.. 0 --—.> A The new string resulting from application of this one rule of production is as follows: KME STRING O. CC.CCO.CCO. COOCC. The Algorithm now accepts this new string as its input for level 2: LEVEL 2 ------------- SYvBCL STQING CF LENGTI- 20.NC LSING tLFF4I- ET CF SIZF 2. CC.CCCO.OCOC COOOC.

154 For M of 2 at level 2, the Grammar Discovery Algorithm proceeds just as it did for level 1. The same regularities that were observed at level 1 are observed in this particular sample, except that they occur in terms of the induced non-terminal "0" instead of in terms of the terminal symbol "A". The rule of production 1 -- 00 is the rule of production which the Algorithm begins to induce. However, before inducing a new rule of production at a level above level 1, the possibility of instead inducing a recursive rule of production must be considered. Note that the rule 1 -- 00 is isomorphic to the rule 0 — * AA, and that the symbol "O" is common to both the antecedent side of the rule at level 2, and the consequent side of the rule at level 1. The same symbol positions of both rules are predicted and context symbols. Therefore, instead of inducing the rule 1 — > 00 at level 2 and adding this rule to the induced grammar, we instead induce the recursive rule A -- AA. We delete 0 -- AA as a rule, and we suppress the rule 1 --- 00. ]e obtain TENhTTIVE RULES CF PROCLCTICN FCR F CF 2 PECURSICN NO. 1 A — ) —> AA The new rule (which is a recursive rule) is applied in all possible ways, yielding: A.A.A.A.A.

155 The termination conditions are now satisified, since the sample is now entirely reduced to sentences of length 1. The induced grammar is thus A - AA. No further M's need be considered at level 2; however, if we did continue we would induce A A- AAA for M of 3, and A — AAAA for an M of 4. The parsimony again would be identical for M of 2, 3, and 4, except for the fact that identity recodings would be necessary when A -> AAA and A - AAAA are used. Thus, again, an M of 2 would be the best M.

156 EXAMPLE B: Suppose the sample consists of a lightly different set of sentences ---namely, sentences of the form An, n> 2, as follows: LEVEL 1 - SYMBOL STRING OF LENGTh 63 AND USING ALPHABET OF SIZE 1 A A A AAA.AAA AAA AAAAA.AAAAAA.AAA A A.AAAAA P AA.AAAAAAAAA. AAAAAAA AA. The analysis will proceed in a manner similar to that of Example A, except that now two rules of production will always result. The two rules will be either A A ----AA and A - AAA, or A -- AA and an identity rule mapping A into itself. The additional rule, in both cases, accommodates the sentences of odd length. Note that, as in Example A, other rules (such as perhaps A -- AAAA or A > AAAAAA)will not be developed because A - AA subsumes both and is more parsimonious.

157 EXAMPLE C: In this example, we consider a sample of well-formed sentences of the propositional calculus in Polish notation, with one term and one operation. The sample is P.P-..P + PP p p++ *.PP P+++.'PPP++P + PP+P+* P p ++ PP+P+ The input to the Grammar Discovery Algorithm is as follows: 1:' X -. 1 P l -- i. r. Lf -V.t: I r T;; IL S, 1 L f: ST " 1 -! f- 1.:'.; ' L \f.( >T Tf '-F 1i 'I ' 3 Cn ET 1rr- l T; C r S I - I t- G l;T r^ E:Cu)! F TF C,.1 F I ~ "fL' F: F: o~l )VI L.-.l _,'t- %F L. -^ U S A ) \ * J ) 3.._ —l -fr!! C f A[ ':iJ l_-t-"- c r r 1'. q T I i: - i- T f.'-,-, ef'L~.:.....- " ' ^j. i, S; rr,I. -. T'' -- T.'-" i_.:.. ^..;....i. RJt - 'T ( '- ''t<. % A"'I'T:'?-~..:> -,'..' f SI'. ": "-ir ',FJL (S r:: TYi ~:.. \'.f - "", S T,. |'!? i; -.-. L 1 _ '< I -. 1 *" - ~ T P ( I ) N^ S 1 T ' * C T, r i(. r - ( T ELST [P(I I i A F P(I)...:lST/I --,. -.\.!. T Y -^I T *.. " i -.. i:>,@(-r,,,j —'-.'' C" ^-'xl U ' r'E; '-:L '"-T, Y r t - i. Ii; 1.. T F;. I.T ~ i k. Ir M tl T 1 ^ l {3 Js t ( r; 2 \ T! 1 * 4 * - (1 ( [A * r [= I ) i:':t.."'' 'L [!''; f'.! 'T"'[I ' —.'. \. P It- I rI.0 f T1' l: - - - ' ' Si.! '! -,' Z- '1. J. '"i'>T I..'.": T -".; ' ( - t:: f ':'l;_ t. I i I. 7*...

158 For an M of 2, three rules of production are developed, as follows: LVEL= 1 M= 2 NtP!', iFR P F' TYPE 1 SEOQ UENCES STRUCT U R AL) 0 NU,3 CBER 3F TYPC 2 SEQUENCES (MESSAGE) 2 NUV.EP OF TYPE 3 SEQUENCES (MESSAGE) 4 NI..'.EPR OF TYPE 4 SEQUENCES. (MESSAGE) 0 JNUs.q ~, 0F TYPE 5 SEQUENCES (NOISE) 2 NU': PFl OF nlIFcEPENT MASKS USED 2 NU?-?ER OF PfSSIBLE M-SEQUENCES 4 NU 'LFR OF POSSIBLE P(I) 8 MNUV,'EER OF P( ) 8,UirEE F DIFF ERNT CONTEXTS 4 SEP,'^ rATION t V ALU' 3TWEFEN TYPES 3 AN) 4 0.50000 EPSILON F' RC CFI r-I INtG TYPES 2.AND 3 0.34000 TFNTATIVrE ULES OF PRODUCTICN FOR M O 2....> P? 2 -__ > 4++ 3 P+> + The value of H obtained upon applying these rules is as follows: FNTRkPY TERM 1.20941 PA.SIO.NY TERM 3.00000 RFClUS TVF PARS I1,fONY TFRM 0.0 VtALUE F H FOR THIS RECODI,l.......... 4.209 1 NUJt3MER tOF RULES OF PRODUCT IGO 3 -JUM3ER 'IF RECU(RS IVE RULES 0. M.; ER OF IDT.NTITY PULFS 0 NU-E,?ER OF TIMFS PUJLS ARF APPLIEO 3 The new string obtained is NF W ST?,G P. 1 12+.1+12.112+.13+3. 1-33.13-+. t+3.

159 The pij-graph for this recoding is as follows: GP!.'-i OF P( ) USF IN CGO(;ING F3O LEVEL I AND M OF 2 1.CO +++ 0.~-;"! I! 0.96 I1 0.;? III 0.P0 IH o.o H! 0, 4ll 0.78 Ii 0. 76 1 0.74 t!! 0.72 t1.) 70 *1 0.68 i!H o. 6 ' ii 0. 62 0.56 I11 -0 5 1 1 0.54 II 0.52 0. 50 + +-' 0.4: III 0.t4 4 II 0. 4 i 0. 40 0.a3 I3 ill 0.:34 0.32 li 04. 2 0 1Z? II 0.co I12 0.1,' I I 0. 12 O. IZ i11 0.04 HI 0.0? I1 0.0 +++ 22.3 'U'-n.R (F TYP- 1 S:QUJENCT.S(STRUCTURAL) 0 NUV'.'3F rFl TYPE 2 SEOUEN'CES (MFSSAGE) 2 ENU;RE ' F Tv3E: 3 SFQU(;JNC S (.SSAGP) 1 t';,J.1,r F TYPE 4 SECIJE NCES (MESSAGE) 0 ~.:JCrR flFr- TYPE 5 SFQUFJNCES (NOISE) 0 MU..5-9 3OF P(tI)

160 For an M of 3, we obtain the following: LEVEL= I M= 3 NJMRERP OF TYPE 1 SEQUENCES( STRUCTURAL) 0 NiJF'EFR (oF TYPE 2 SEQUENCES (MESSAGE) 4 NU,'qEK OF TYPE 3 SEQUENCES (MESSAGE) 7 NUt)!-R O F TYPF 4 SrOUENCFS ( FSSAGE) 6 NUt-.J'3 OF TYPE 5 SEO'J ENCES {(N IS E 15 NUl ER O3F DIFFFRtENrT MASKS USED 4 \UJMPB. OF POSSi.BL: A-SEQ'.U NCES 8 NUf13FP OF POSSIPLE P( ) 32 NUEF OiF P( I ) 32 JNU 'F-.'_F OIFF:P.NT CONI'CXTS 16 SEPARATION VALUE BETWEEN TYPES 3 ANOJD 4 0.50000 EPSILON F-OR DEFI,- NIG TYPES 2 AND 3 0.33333 TEN'TAT IV'C RULES OF PRODU!CTION FOR M OF 3 i -....> OP+ The value of H is as follows: ENTROY.C TF'RM 0.64752 TAPSiSMO Y TERV 2.00003 RPcCUJSIVE P RSIMONY TERM 0.0 VALUE OF H FO R TH I S RECOqDING....... - 2- 64752 NUP NP, OF RULES OF PRnCUCTION 2 NU?,:BE.'. OF RECUJSIVE RLES 0 1NU'Ji. OF IDENTITY PULES tt.' FR P OF TIMS Pr S RU ARF APPLIED 2 The string is then recoded as follows: 'NJFW STRING P. 1 1+. PP 1++.P 1+ P.P+ P+P P 1+. IP+. Note that the rule 1 > PP+ is reapplied several times in the recoding.

161 After considering tentative recodings based on an M of 2, 3, and 4, the Algorithm concludes that BFST " IS 3 'ITH H PF 2.64752 and then uses the rule to recode the sample. The resulting string is then the input to the Algorithm at level 2.......... RJULES OF PRODUCTION FOR LEVEL 1 --- — I -...> PP+ At level 2, the Search Phase again searches for regularities in the new sample, developes rules of production, and computes the value of H for the resulting transformations. For an M of 2 at level 2, we obtain the following: LFVFL= 2 M 2 NUMa~P OF TYPE 1 SFOUFFNCFS (ST.rICTURAL) I,".,.'1hE2W - F TYP? SCQiJENCES (MESS.AGE) 2 '; 1 3:: 3 F TYV>F 3 SFQU ENC: ( 1ESS t^ E) 4 N.r.. F TYp. 4, S.UP:NC-^.S (MFSSAG&) 1 NJUF.^ OF TY? E 5 SEQ UENCES (NOISE) 10 NUM&C.f OF or n)'iFR F NT MASKS US1rD 2 NUt-J'~ R nF PPSS ITHL M -SEQUFNCES 16 N!JU.;3-R:3F P l POiSS )L P I 32 ^4U"':F-r: F. P>( i ) 18.JU,:.. ' O lf)J! - O!:I:EFNT ClNT XTS 7 SAr^PA^ATI^r VALUF!.,TWEFNi TYPES 3 ANr) I 0.50000 FPSILCN F'R DEFI NI..; TYPF S 2 AN!) 3 0.34030

162 The value of H is as follows: ENTr1PY TERM 2.15596 PA RSIM O'nY TFRM 4,00000 RECU: S IVE PA RS IM0NY TERM 0.0 VALUE OF H itOR THIS RECODING....... 6.15596 NUM-ER OrF RULES 'OF POOUCTICN 4 NUM't,'ER OF RECURSIVE PULES 0 NUJ 'ER.E F, TIDETNTITY PULES 0 NUvYF.R OF TIMfS RULES ARF APPLIED 7

163 The pij-graph for an M of 2 at level 2 is as follows.: G'.^ P: iF (I) USFD I<; P.EOD!N FOR LEVEL 2 ANO):1 OF 2. CO: -++ 0.08! II o.r 6 I I i0 11 o ~,D, I tI o. 6 I Ii 0.2?. I I 0.78 (H Iria o.7o0 1 1 O.'6! 0.-4 11 0:.5+! i1 0.50 i+0.56 1 0.5+ I j lI o,4, i I 0.44 I 0.42 1 I 3c.,0 I 1 c.3,!| i 0. 34 I 0.3? t1.,o 2I t 0.,i6 0. 4 ii o (,5 i II 0.20 1 H 0.16 I 0.1?. I i 0.10 HiI u.oe6 I H 0.0o4 I I ';'. 0 +-+. + 1223 ",,.'"m. nF TYPE 1 SFqUFNCr(STRUCTURL) 1 M'v'.-. '.-.F, TYPE 2 SEOUENCFS (MESSAGE) 2 "J.,E n - TY'E 3-SF, U ENCES (MESS'AGE) 1 '.t, r. Fr TY': 4 SoU.NCFS (MESSAG?) 0?,'nC OF TYPE (.SEFI!CF:S (J\ S SE 0 NU~;3E-F rF P{ I 4

164 For an M of 3 at level 2, the Algorithm considers developing the rule 2 -- 11+, but first checks to see whether a recursion is a possibility. In fact, one recursive rule can replace both 2 -- 11+ and the rules 1 -- PP+ developed at level 1. We thus obtain the following recursion instead: Tr'fT AT IVE RULES OF PRODUCT IN,FOR M O 3 RECURSION Nl'. I p.-.-> pp+ Upon applying the recursion once, we obtain the following new string: P. P.li+.PP ++.Pi+P+. P+P+*P l+.1P+. Then upon applying the recursion in every possible way to the new string (and taking into account the fact that the non-terminal "1" is now superseded by the recursive symbol "P"') we obtain the following as the recoding: P.P. P. P P P. P.P The conditions for terminating the Algorithm now obtain, and not further levels are considered.

165 EXAMPLE D: In this section, we analyze the behavior of the Grammar Discovery Algorithm for a language ---rather than merely a particular finite sample of sentences from the language. We do this to demonstrate that the kinds of results one obtains from the Algorithm for small samples of sentences, from languages with relatively simple grammars, over relatively small alphabets can be expected to obtain also for larger samples from languages with more complex grammars over alphabets with large numbers of symbols. This particular example also serves to illustrate the full range of grammar induction devices discussed, including those which are not implemented in the computer program for the Algorithm, or which require large combinatorial searches ---namely, context sensitive regularities, maximal regularities, recursions, disjunctions, as well as context-sensitive and context-free rules of production. With this in mind, consider the language generated by the following rules: V - abVl 0 W cdeW 0 X - fghiXI 0 P -- VWX Y -- lmY 0 Z nopZ 0 Q --- YI Z S -4 P j Q

166 Similarly, no regularities involving "cde" will appear of length greater than 3 which will be as reliable as those for length 3, and no regularity involving "fghi" of length greater than 4 will be superior to those of length 4. Finally, similar observations will apply to the two phrases "lm" and "nop". The second kind of regularity will involve strings longer than the lengths of the basic 5 phrases. These regularities will catalog the propensity of "ab" to follow an "ab". As mentioned above, these regularities will be less reliable than the corresponding shorter regularities. The third kind of regularity for this language will involve the interface between occurrences of the 5 basic phrases and the symbol "j". These will be of type II or worse. In order to recode the sample at level 1, each of the 5 basic phrases will have to be recoded in some way internal to themselves. The reason why there will be no inter-phrase recodings is that a type I regularity internal to the phrase exists in each case, while any inter-phrase regularity will be of type II or worse. Exactly which recoding will be used will depend on several externally-specified parameters. For example, if context-sensitive rules are searched for, the Algorithm will find and use a regularity such as < "c%e', " d ", 1.0 > whereas if only right-sensitive regularities are searched for, something such as < "%%e", "cd-", 1.0 >

167 might be used. The Recoding Procedure, and in particular the point in the sample where recoding begins, all affect the exact recoding chosen. However, even within the variability permitted by these external choices, it can be predicted that the phrase"cde" will be recoded as a unit, and that, for example, no phrase involving "cdecde" will be used (of length 6). Thus, the initial sample of sentences will be punctuated into sub-phrases of length 2, 3, and 4 corresponding to the 5 basic phrases. However, these phrases are recoded, the opportunity to recode them at level 2 will occur, and it is at level 2 that the recursive structure of the sample will be discovered. Again, the exact way in which the recursions will be discovered will depend on some externally=specified choices. If, for example, the "uvwxy" method of inducing recursions is being used, the Algorithm will note that there are recurrences of the 5 basic phrases "ab", "cde", etc. or recurrences of whatever symbols now represent these phrases. If the rule-oriented method of inducing recursions is being used, then at level 2 there may be recurrences of non-terminals such as V, W, X, Y, and Z. These non-terminals may be recoded, in a manner similar to that of Examples A and B into other non-terminals, and the recursion discovered at level 3. In any event, the recursive structure of the sample will be discovered somewhere between levels 1 and 3 of the recoding regardless of how the recoding actually proceeds. Also, the symbol "j" will not be recoded at all because it is the exceptional suffix to "fghi" and the exceptional

168 Note that the sentences of this language have the general form (ab)r(cde)s(fghi)t j (lm)u (nop)V where r,s,t > 0 and either u or v is 0, while the other is>O. In the Search Phase, one finds several different kinds of regularities. First, there are regularities within the 5 phrases "ab", "cde", "fghi" "lm" and "nop". Second, there are regularities occurring between each of these 5 phrases and themselves. Third, there are regularities occurring between 2 different ones of these 5 phrases as well as with the symbol "j." Consider first the first kind of regularity. Among these will be the regularity which catalogs the fact that an "a" always predicts a "b" to the right. This regularity is a type I left-sensitive regularity. There is also the fact that a "b" is almost always —but not always ---preceded by an "a". This is a type II right-sensitive regularity. Similarly, a "b" is usually followed by an "a", but occasionally is followed by a "c" or even an 'f". No regularity involving "a" and "b" of length greater than 2 will be as reliable as those of length 2 because, as will turn out, the string "ab" is a maximal regularity. Similarly, within the phrases 'cde" and "fghi", there will be regularities. Some of these will be context-sensitive regularities (something that was impossible in a phrase of length 2), such as the type I regularity < "c%e, "t d", 1.0 ) Again, all regularities will be within the phrases.

169 prefix to whatever follows. The "j" will be followed approximately equally often by either "lm" or "nop." After these 2 phrases are recoded internally, the non-terminals into which they have been recoded will appear as the two possible substitution possibilities following the "j".. Thus, a situation where the suffix to the "j" presents maximal entropy will exist. If the phrase "lm" is recoded as Y, and the phrase "nop" is recoded as "Z", then the disjunction Q -- Y I Z can be induced. With the 3 recursions discovered at the beginning of each sentence, and the 2 recursions and 1 disjunction at the end of each sentence discovered, each sentence will have the form VWX j Q, or perhaps P j Q, depending on the intermediate recoding that occurred. When each sentence of the sample is reduced to a common form, the conditions for terminating the Algorithm have been satisfied. Thus, in this case, we have induced the original grammar, or one virtually the same as the original grammar. Note that the different recursions. do not get "tangled" because the phrases in which they are embedded are all maximal regularities.

170 EXAMPLE E: Here again we shall consider a language rather than a finite sample of sentences in order to illustrate a situation which cannot be represented by any finite sample. Consider an ergodic source generated strings over a terminal alphabet A of size C1. Perhaps the grammar which describes the language is S -- ( al a ac3 ) S. Any finite subset of the sentences generated by such a grammar (even a subset that is complete in the sense that is includes all sentences of the language up to a certain specified length) has certain regularities which are artifacts of the finiteness of the sample. The nature of ergodicity is precisely the opposite ---namely, that there are essentially no regularities in the sample. Therefore, to discuss the ergodic case, one cannot consider any specific finite sample. Recall that the Algorithm is given an externally specified range of M's. This range is typically from a lower value of 2 up to a small whole number such as 5. The Algorithm is also given an externally specified direction for considering the M's for recoding purposes (almost always descending ---and always descending when context-free rules are being generated). Regardless of these two choices, the entropy associated with any ergodic sample is always the maximal value. Thus, a disjunction expressed in terms of phrases of length M (the first M) is induced. If an M of 1 is consider, the disjunction is in terms of single singles from the alphabet. (Note that the only useable mask for an M of 1 is the "predicted" mask which catalogs probability of occurrence).

171 Thus, the Grammar Discovery Algorithm produces a reasonable result for an ergodic source generating the sentences of the given sample. The above discussion suggests, in effect, an alternative definition of ergodicity. In this paper, the notion of the universe of masks is well defined, as is the concept of the universe of possible regularities for a given value of M. A source is ergodic, therefore, if there are no regularities in the sentences produced by the source except for type VI regularities having a conditional probability Pi. equal to e = 1/Cg, where C is the size of the alphabet, and g is the number of predicted positions in the regularity. In a more practical vein, if the limiting value of the pij's are their respective l/Cg, then we can call the source ergodic.

172 EXAMPLE E: In this example, we will consider a very simple example of sentences from English. The sample is as follows:. THE -8 G-,C AT -,RAN —. T HE-,BAD-,CA T-RAN AN- THE -8 I G',B OG-BR AN T ADDOG-RAN-,. THE-B I G-'CA -SAT-,. THE- 8A -CA T.-SAT-,.T E-.B I G-'DOG-,SAT-". T H E- B A D-'D OG-, SA T.The input to the Grammar Discovery Algorithm is as follows: MAXIMUM NUMBER OF LEVELS TC BE TRIED 3 SMALLEST M TC BE TRIED 2 LARGEST M TO BE TRIED 4 DIRECTION OF CUNSIDERING M IN RECODE DESCEND LAMBDA -- COEFFICIENT OF PARSIMONY 1.00000 LAMR -- COEFFICIENT OF RECURSIVE PARSIMCNY 0.10000 SOURCE OF INITIAL SYMBGL STRING READ MODE -- INITIAL PLNCTUATICN MARK 2ND TYPE OF GRAMMAR DESIRED (MASKS TO BE TRIED) LS-RS RIT - CONTROLS ANTECEDENT SIDE OF RULES CF WORST TYPE OF P(I) ALLOWED TO ENTER CODE 4 USE OF BEST M STRICTLY YES FR -- RADIX OF MASK 2 PRINT CNTROL -- PRINT UNSCRTED P I) NO PRINT CCNTROL - PRINT SORTED PCI) NO PRINT CONTROL - PRINT GRAPH OF P(I)-S ALL PRINT CONTROL -- PRINT SEQUENCE NUMBERS YES SUBROUTINE USED FOR RECODING RECODE OVERRIDE VALUE FOR IVEST (MCH) 0 EARLY ELIMINATION OF UNALLOGABLE P(I) NO INCLUDE ALL-PREDICTED MASK NO START RECODE AT T iME OF MbEST YES HSEL -- METHOD FCR COMPUTING H 4 CUT-OUT ON FINDING GOOD M NO METHO OF FINDING RECURSIONS RULE INCLUDE IDENTITY RULES NO TEST FOR RECURSION YES SIZE OF TERMINAL ALPHABET (INITIAL STRING) 14 TERMINAL ALPHAbET: ABCDEGi INORST-N INITIAL PUNCTUATICN MARK (MODE=Z) SIZE OF BASIC NON-TERMINAL ALPHABET 10 SYMBOLS OF THE BASIC NON-TERMINAL ALPHABET: 0123456789

173 The search phase produces the following for an M of 2: LEVEL= 1 M= 2 NUMBER OF TYPE I SECUENCES(STRUCTURAL) IS NUMBER OF TYPE 2 SEQUENCES (MESSAGE) 0 NUMBER OF TYPE 3 SEQUENCES (MESSAGE) 11 NUMBER OF TYPE 4 SEQUENCES (MESSAGE) C NUMBER OF TYPE 5 SEQUENCES (NOISE) 16 NUMBER OF DIFFERENT MASKS LSED 2 NUMBER OF POSSIBLE M-SEQUENCES 196 NUMBER OF POSSIBLE F(I) 392 NUMBER OF P(I) 46 NUM3E COF DIFFERENT CONTEXTS 28 SEPARATION VALUE bETWEEN TYPES 3 AND 4 0.50000 EPSILON FOR DEFINING TYPES 2 AND 3 0.34000 TENTATIVE RULES OF PRODUCTION FOR M OF 2 0 -. —> TH I ---— > E2 -— > BI 3.> G — 4. ---> CA 5 --— > T6 --— > RA 7 ---— > N8 --— > BA 9 ---— > D0 -<00> -— ~> DO <01>...>) SA The value of H for M of 2 is as follows: ENTROPY TERM 1.50000 PARSIMONYj TERM 12.00000 RECURSIVE PARSIMONY TFRM 0.0 VALUE OF H FOR THIS RECODING.......... 13.50000 NUMBER OF RULES OF PRODUCTION 12 NUMBER OF RECURSIVE RULES 0 NUMBER OF IDENTITY RULES C NUMBER OF TIMES RULES ARE APPLIED 12 The sample is then recoded as follows:.0Q1234567.01894567.C12300>367.0189<00>367.012345<01>5.01894 5<01>5.0123<00>3<01>5.0189<00>3<01>5.

174 The pij-graph for this recoding based on an M of 2 is as follows: GRAPH OF P(IJ USED IN RECCUING FOR LEVEL 1 AND M OF 2 1.GO *****4*** —+ 0.98 1 0.96 1 1 1 0.94 I 1 1 0.92 0.88 i 1 1 0.86 1 I 1 0.84 I I I 0.82 1 1 1 0.80 1 I I 0.78 0.76 0.74 0.72 I 0.70 I 1 1 0.68 1 1 0.66 1 I 1 0.64 1 1 0.62 i I I 0.60 1 1 0.58 1 0.56 1 I 0.54 1 0.52 1 1 1 0.50 + — -— *** 0.48 I 1 1 0.46 1 0.44 1 I 1 0.42 1 1 1 0.40 0.38 1 1 0.36 1 1 1 0.34 I I 0.32 1 I i 0.30 1 1 1 0.28 1 1 0.26 1 i i 0.24 1 0o.22 1 I 0.2o 0 1 u.18 1 1 0. 16 0.14 1 1 0. 12 0.10 i 0.0 I I I 0.06 1 1 1 0.04 1 0.02 1 1 1 0.02 0.0 +....+ —_+ 111111111333 NUMBER OF TYPE 1 SEQUENCES(STRUCTURAL) 9 NUMBER OF TYPE 2 SECUENCES (MESSAGE) C NUMBER OF TYPE 3 SEQUENCES (MESSAGE) 3 NUMiER OF TYPE 4 SEQUENCES (MESSAGE) C NUMBER OF TYPE 5 SEQUENCES (NOISE) 0 NUMBER UF P(I) 12

1-75 For an M of 3, the following tentative rules of production are developed: LEVEL= 1 M= 3 NUMBER OF TYPE 1 SECUENCES(STRUCTURAL) 46 NUMBER OF TYPE 2 SEQUENCES (MESSAGE) 0 NUMBER OF TYPE 3 SEQUENCES (MESSAGE) 34 NUMBER OF TYPE 4 SEQUENCES (MESSAGE) 0 NUMBERK OF TYPE 5 SEQUENCES (NOISE) 32 NUMBFR OF DIFFERENT MASKS USED 4 NUMBER OF POSSIBLE M-SEQUENCES 2744 NUMBER OF POSSIBLE P(I) 10S76 NUMBER OF P(I) 112 NUMBER OF DIFFERENT CCNTEXTS 98 SEPARATION VALUE BETWEEN TYPES 3 AND 4 0.5C00 EPSILON FOR DEFINING TYPES 2 AND 3 0.33333 TENTATIVE RULES OF PRODUCTION FOR M OF 3 0 --— > THE 1 -~ > -,BI 2 - > G-C 3 A> AT-, 4 -— > RANh 5 > -,BA b. ---> D-IC 7 -— > G-D 8 --— > OG9....> D-,D The value of H for the M of 3 is ENTROPY TERM 2.CG0OO PAPSIMONY TERM 10.000COOO RECURSIVE PARSIMCNY TERM 0.0 VALUE OF H FOR THIS RECODING. o....e... L2.COOO NUMBcR OF RULES OF FRODUCTION 10 NUMBER OF RECURSIVE RULES 0 NUMBER OF IDtNTITY RULES 0 NUMBER OF TIMES RULES ARE APPLIED 10 The sample is then recoded as follows:.0 i234-.05634,-'.01784'. 05984-. 0123S2. 0563S3.0178S3. 0598S3.

176 For an M of 3, the pij-graph is as follows: GRAPH OF P(I) USED IN RECODING FOR LEVEL 1 AND M OF 3 1.00 ****** ---0.98 1 I 1 0.96 1 0.94 J 0.92 1 1 0.90 1 i 0.88 i 1 0.86 i I 0.84 1 1 l 0.82 I I o.80 i 0.78 I 1 i 0.76 1 1 0.74 1 1 1 0.72 1 0.70 I 1 0.68 1 1 1 0.66 t 1 0.641 1 1 0.62 1 1 1 0.60 0.58 1 i 0.56 1 I I 0.54 I 1 0.52 1 1 0.50 + ----.+**** 0.48 1 0.46 1 0,44 0.42 I 0.40o I I 0.38 1 i 0.36 0.34 1 0.32 1 1 0.30 1 1 j 0,2o i I 1 0.26 0.24 I i 0.22 I 1 o.z2 i i t 0.20 1 1 0.18 I I 1 0.16 i 1 0.14 1 1 0.12 I 1 0.,1 1 0 1 0.08 I i 1 0.06 i I j 0.04 I 1 0.02 I 1 0.0 + ----+.+ 1111113333 NUMBER OF TYPE 1 SECUENCES(STRUCTURAL) 6 NUMEER F TYPE 2 SEQUENCES (MESSAGE) C NUtMbER OF TYPE 3 SEQUENCES (MESSAGE) 4 NUMBER OF TYPE 4 SEQUENCES (MESSAGE) 0 NUMBER OF TYPE 5 SEQUENCES (NOlSE) C NUMBER OF P(I) 10 Note that there are 6 type I transformations, and 4 type II transformations.

177 For an M of 4, the search phase produces the following: LEVEL= 1 M= 4 NUMBER OF TYPE I SECUENCES(STRUCTURAL) 73 NUMBER OF TYPE 2 SEQUENCES (MESSAGE) 0 NUMBER OF TYPE 3 SEQUENCES (MESSAGE) 81 NUMBER OF TYPE 4 SEQUENCES (MESSAGE) 18 NUMBER UF TYPE 5 SEQUtNCES (NOISE) 20 NUMBER OF DIFFERENT MASKS USED 6 NURSER OF POSSIBLE M-SEQUENCES 38416 NUM3ER OF POSSIBLE P(I) 230496 NUMBER OF P(I) 192 NUMBER OF DIFFERENT CCNTEXTS 219 SEPARATION VALUE BETWEEN TYPES 3 AND 4 0.50000 EPSILON FOR DEFINING TYPES 2 AND 3 0.25000 TENTATIVE RULES OF.PRODUCTION FOR M GF 4 0 -—.-> THE-' 1 ~- > BIG2 > CAT-, 3 --— > RAN-. 4 -— > bAD-, 5 -— > DOG-, 6 -— > SAT-' The value of H is as follows: ENTROPY TERM 0.0 PARSIMONY TERM 7.00000 iECURSIVE PARSIMONY TERM 0.0 VALUE OF H FOR THIS RECODING*....*.... 7,00000 NUMBER OF RULES CF FROOUCTION 7 NUMBER OF RECURSIVE RULES 0 NUMUER UF IDENTITY RULES 0 NUMBER OF TIMES RULES ARE APPLIED 7 The sample is then recoded as follows: NEW STRING.0123.42. 0 453 2642.03. 0453.0126.042 015. 0456.

178 For an M of 4, the pij-graph shows that there are 7 type I regularities in the recoding: GRAPH OF P I) USED IN RECCOING FOR LEVEL 1 AND M OF 4 1.00 ****** o0.,g I 0.96 1 0.94 0.92 I 0.90 ocss I 1 0.86 1 1 0.84 t 0.82 1 1 0.80 1 0.78 I 0.76 1 i 0.74 1 0.72 i I 0.70 1 o.68 1 1 0.64 0.62. I 0.60 0.58 1 1 0.56 1 I 0.54 1 0.52 1 I 0.50 + -—.+ 0.48 I 0.461 1 0.44 1 0.42 1 t 0.40 1 0.38 1 0.36 I 0.34 I 0.32 1 I 0.30 1 0.28 1 0.26 1 0.24 1 1 0.22 1 1 0.20 j I 0.18 1 0.16 1 1 0.14 0.12 1.Oo I I 0.04 t 0.02 1 0.0 +~-...+ 1111111 NUMBER OF TYPE 1 SEQUENCES(STRUCTURAL) 7 NUMBER OF TYPE 2 SEQUENCES (MESSAGE) 0 NUMBER OF TYPE 3 SEQUENCES (MESSAGE) C NUMBER OF TYPE 4 SEQUENCES (MESSAGE) C NUMBER OF TYPE 5 SEQUENCES (NUISE) 0 NUMBER OF P(I) 7

179 The Algorithm then selects the best M, as follows: BEST M IS 4 WITH H OF 7.00000 Using the best M (of 4) as the basis for the recoding, the original sample is then recoded, and this recoded version is then used as input to level 2, as follows: LEVEL 2 SYMBUL STRING OF LENGTH 41 AND USING ALPHABET OF SIZE 21.0123.0423.0153.0453. 0126.0426.0156 0456. The resolution of this sample at level 2 requires the induction of a disjunction at level 2 --- a feature not now implemented in the computer program. However, if an M of 1 is used at level 2, this sample will be recognized as being derived from a grammar which places a '0O" as the first symbol of every string, and then disjunctively places a "1" or a "4" at position 2; a "2" or a "5" at position 3; and a "3" or a "6" at position 4.

VI. FUTURE DIRECTIONS This paper raises numerous additional questions and suggests further research about grammar induction and grammar discovery. Among these questions are the following: First, it would be interesting to explore the two-dimensional pattern recognition problem using the methods of the Grammar Discovery Algorithm. The concepts of context-sub-sequences, predicted-sub-sequences, regularities, and hierarchical rules of production would seem to have application to pattern recognition problems as well as grammar induction problems. The kinds of descriptions that the Grammar Discovery Algorithm builds up in analyzing samples would seem to parallel the kind of descriptions one would want a good pattern recognition algorithm to develop about its samples. Second, a precondition to seriously analyzing the two-dimensional pattern recognition problem would be to rewrite the computer program for implementing the Grammar Discovery Algorithm. The computer program here was an integral part of the research for this paper. As such, the program was constantly changed as the Algorithm was developed. Thus, the computer program is structured in ways peculiar to its development (e.g. the time-consuming hash coding of sequences) but which make it quite inefficient for analyzing large samples. Third, there are several aspects of the induction problem that are amenable to more formal analysis than has been presented here. For example, the combined measure of entropy and parsimony and recursive parsimony used in the computer program is essentially an ad hoc measure based on intuitive considerations. An axiomatic development of a combined measure would be interesting. Also, some of the questions 180

181 of convergence and rates of convergence of the Grammar Discovery Algorithm should be possible. Fourth, it would be interesting to consider extensions of the grammar discovery process that would lend themselves to samples consisting of symbols that are ordered or partially ordered. This would include samples of integers, or of playing cards, or of musical notes which present interesting questions of induction or of identifying underlying structure.

APPENDIX A. DESCRIPTION OF INPUT TO THE COMPUTER PROGRAM IMPLEMENTING THE GRAMMAR DISCOVERY ALGORITHM The input to the computer program implementing the Grammar Discovery Algorithm consists of three control cards, followed by additional cards containing the sample of sentences. The first control card contains a variety of parameters which control the operation of the Grammar Discovery Algorithm. All are integer numbers, unless otherwise specified. Typical Card Columns Variable Name Description Value 1-5 L3 Upper Limit on the number of 99 levels to be tried by the algorithm. This parameter has no theoretical significance, and is used only to conserve computer time in testing. By choosing a large value for L3, its function is effectively negated. (See termination-II.G.) 6-10 M1 Lower Limit on length of regu- 2 larities searched for in the Search Phase. This parameter is almost always set to be 2. (See section II.H. and III.B.) 11-15 M2 Upper Limit on length of regu- 8 larities searched for in the Search Phase. This parameter 182

183 is usually a small whole number. (See section II.H. and III.B.) 16-20 VDIR Direction of Considering M's -1 in Recoding Procedure. (See section II.C.4.) A value of +1 specifies ascending values, and a value of -1 specifies descending values. The choice of -1 is usual, especially when context-free rules are being generated (IV.D) (See RIT below). 21-25 LAMBDA Real Number- Coefficient of.5 Parsimony. (See section II.D.2). 26-30 LAMR Real Number - Coefficient of.1 Recursive Parsimony. Typically LAMBDA << LAMR. (See section II.D.2). 31 32 CONT Specifies source of the sample. 1 A value of 1 specifies that the sample is to be read in on cards, after the 3 control cards. A value of 2 or 3 specifies that sample is to be generated by experimental subroutines GEN2 or GEN3, respectively. CONT is

184 usually 1. The subroutine GEN1 reads the cards, if CONT = 1. 33-34 MODE Specifies the mode of the ini- 1 or 2 tial sample. In the first mode (MODE=1), there is no initial punctuation in the sample, while in the second mode (MODE=2), there is. (See section II.A.) 35-36 GRAM Specifies grammatical type of 1 masks that are used in the Search Phase to search for regularities. A value of 0 specifies Unrestricted Rewrite type; a value of 1, context-sensitive. If context-free rules are desired, GRAM may be 0 or 1, provided RIT is properly set. If regular rules are desired, GRAM may be 0 or 1, provided RIT is properly set, and provided M2 and M1 are both 2. A value of 0 is most general. (See section II.B.9.) 37-38 ALLOW The worst type of regularity 1, 2, or 3 allowed to be used in Recoding. ALLOW may be 1 (only 100% reliable regularities are to be used); or 2 (either Type I or

185 Type II may be used); or 3 (Types I, II, or III may be used). It would be unusual for ALLOW to be 4 or 5. (See section II.C.4.) 39-40 STRICT A value of 1 specifies the use 0 of only one value of M (namely, MBEST) in Recoding, and is suitable only when the sample is from a uniform code source. A value of 0 allows any value of M from M1 to MBEST (which is limited above by M2) in Recoding. STRICT is almost always 0. (See section II.C.4.) 41-42 FR Radix of Masks. A value of 2 2 or 3 specifies binary masks, and a value of 3 specifies ternary masks (i.e. masks with "don't care" positions). (See section IV.A.) 43-44 PRI Controls printing of unsorted 0 table of regularities.: A value of 1 specifies printing, and. a value of 0 specifies no printing. 45-46 PR2 Controls printing of sorted

186 table of regularities. A value of 1 specifies printing, and a value of 0 specifies no printing. 47-48 PRG Controls Printing of Graph 1 or 2 of regularities used in Recod-. ings. A value of 0 specifies the printing of no graphs;aa value of 1 specifies printing of some graphs (namely, when a recoding is actually used); a value of 1 specifies printing of all graphs (i.e. for all trial recodings and the one actual recoding). 49-50 PRS Controls printing of SEQNO 0 table (table of indices of regularities). A value of 0 specifies no printing, and a value of 1 specifies printing. 51-52 EXR A value of 1 specifies that 1 alphabet-enlarging Recoding as contained in subroutine RECODE is to be used. 2 specifies Huffman type of Recoding is to be used. (See section IV.B.)

187 53-54 MCH If MCH is not-zero, this value 0 is taken as an override value of MBEST for purposes of Recoding. This parameter is used only to force a value for the "best M" for Recoding, for testing purposes. 55-56 EARLY A value of 1 specifies that 0 early culling of the table of regularities is to be done (to conserve memory space in the computer), and a value of 0 specifies that it is not. This parameter has no theoretical significance. (See section II.B.9.) 57-58 LAWL This variable is no longer used by the program. 59-60 VSTART A value of 1 specifies that the 0 or 1 scan in the Recoding Phase is to start at position MBEST (rather than M1) of the sample (or if MODE=2, of each individual sentence). (See section II.I. and

188 III.C. for discussion of effect of VSTART.) 61-62 HSEL Selects method of computing 1 H (the combined entropy-parsimony-resursive parsimony measure): HSEL EXPRESSION 1 -Pi a log Pi + N + N X q r r NTM -P.. log P.. + N X + N X N _ -z 2 l q r r Nq. NaN NTM 3 MBE -Pij og P +NX +N ml 09 ij- q rr 4 -P ij log2 Pij + N X + N X q r r where here n is the number of q rules of production, where nr is the number of recursive rules, where NTM is the number of times the rules are actually applied in the transformation, where MBE is the current M being used (as a maximum M for any rules in this transformation). The usual value of HSEL is 1. The other expres

189 sions were used for testing purposes. (See section II.D.2). 63-64 WDF Controls effect of discovering 1 a Resolving Transformation. A value of 1 specifies that if a value of M results in a resolving transformation, then this M is immediately used for the actual recoding at this level. A value of 0 allows the continued examination of M (and thereby allows another equally good or possibly even better M to be used. Note that variable VDIR controls the order of considering the M's (and that this is usually descending order). WDF is usually 1. (See section II.C.4). 65-66 WCV Specifies Approach for inducing 1 recursions: A value of 1 specifies the Rule-Oriented Method (See section II.E.4), and a value of 2 specifies the SentenceOriented ("uvwxy") Approach (See Section II.E.5). WCV is usually 1. 67-68 WXL This variable is no longer used by the program.

190 69-70 KRECUR Specifies whether Recursions should be induced. A value of 1 specifies that recursions should be induced (if possible), while a value of 0 specifies that no attempt to induce recursions should be made. (See section II.E.1).

191 The second control card contains the terminal alphabet and the initial punctuation mark (if any). The variable C1 specyfying the size of the terminal alphabet appears in card columns 1-5. The C1 cymbols of the terminal alphabet appear starting in column 6. If MODE = 2, the initial punctuation mark (customarily, the period) appears after the C1 symbols. If the sample contains blanks, it is customary to use the symbol n "" in place of blanks to improve readability of the analytical tables. This convention in no way identifies the blank as a distinguished symbol to the Algorithm, however. The third control card contains the non-terminal alphabet. The variable N26 specifies the size of the non-terminal alphabet and appears in card columns 1-5. The N26 symbols starting in column 6 are the non-terminal alphabet symbols. If more than N25 non-terminal symbols are reauired by the Algorithm, the Algorithm uses bracketed pairs of these N26 symbols in the fashion of backus normal form. It is therefore advisable not to use brackets in either the terminal or non-terminal alphabets. It is customary to use numbers for non-terminals, if letters have been used in the nonbterminal alphabet, and vice versa; or, alternately, to use upper case letters for nonterminals and lower-case letters for terminals (if a suitable output printer is available). If the sample is to be read in from cards (and this is under the control of the variable CONT), then the fourth card is a card containing the value of N, the length of the sample, in card columns 1-5. The fifth and succeeding cards now contain the sample, 80 symbols to a card. The size of the sample includes the initial punctuation in the sample, if any.

192 The following is an example of input to the computer program implementing the Grammar Discovery Algorithm: 3 2 5 -1 C.5 0.1 1 1 2 1 4 1 2 0 1 2 1 1 0 0 1 4 0 1 201 26ABCOEFGH I JILMNUPQORSTUVv XYZ 272 10 IGC 1 1(0 I01 O C 110J0 1 1 10 1 11C UC01 CC C01001 011101 10.100 O C00111000 1 11001 000 11)00 0101C 1 000C01 C; 1 1 1O C1 C00 10011000010110001010100uO01 0 01C011011 C00 110 010000CC 110 CC1 100C lC, 1f 1 0 0110Coi ccoo 011101010101000000110001011110010111100101 11 0101011110 C 1 1 1 11 CO1010 0 CC0 This input would be interpreted as follows: MAXI UM NUMBER OF tEVELS TO BE TRIED 3 SMALLEST M TO BE TRIED 2. LARGEST M TO BE TRIED 5 DIRECTION OF CONSIDERING M IN RECODE - DESCEND LAMBDA -- CUEf-FICIENT CF PARSIMONY 0.50000 LAMR -- COEFFICIENT OF RECURSYVE PARSIPCNY 0.10000 SCURCE OF INITIAL SYMBCL STRING READ MODE -- INIT IL rFUNCTIJATI CN MARK 1ST TYPF OF GRAMMAR DESIRED (MASKS TO BE TRIED) LS-RS RIT - CONTROLS ANTCECEDENT S IE OF RULES CF WO FST TYPE OF P(I) ALLCWED TO ENTER CCCE 4 USE OF BEST M STRICTLY YES FR -- RADIX OF MASK 2 PRINT CONTROL -- PRINT UNSORTED P(I) NO PRINT CONTROL -- PRIf\T SORTED P(I) YES PRINT CONTROL -- PRINT GRAPH OF P(I)-S ALL PRINT CONTROL -- PFINT SEQUENCE NU MBERS YES SUBROUTIN.E USED FCR RECCDING RECODE OVERRIDE VALUE FOR MBEST (MCH) O EARLY ELIMINATICN CF UNALLOWABLE P(I) NO LA WL YES ST RT RECOCE AT T IVE OF MBEST YES HSFL -- METHOD FCR CCMFUTING H 4 CUT-OUT ON FINDING GCOOD M NO MFTHOD OF FINDING FECUFSIONS RULE WX. NO TEST FOR RECURSION YES SIZE OF TERMINAL ALPHABET (INITIAL STRING) 2 TEPMINAL ALPHABET: 01 SIZE O3F BASIC NCI-TERMINAL ALPHABET 26 SYMBOLS OF THE PASIC NCN-TERMINAL ALPHABET: ABCDEFGHIJKLMNOPQRSTUVWXYZ

APPENDIX B: LISTING OF THE COMPUTER PROGRAM IMPLEMENTING THE GRAMMAR DISCOVERY ALGORITHM C C GRAMMAR DISCCVERY PROGRAM C IMPLICIT INTEGER ( A-Z) C C REAL SCALARS REAL CRIT REAL CX REAL FPS REAL FC REAL HBEST REAL HP REAL LAMBDA REAL LAMR REAL SEPAR REAL TT C C REAL FUNCTIONS REAL ALOG C C VECTORS GF LENGTh NS DIMENSION Y(500) DIMENSION ALPHA(500) DIMENSION YNEW(500) DIMENSION SECNO(5O0) INTEGER DUNE(500) C C VECTORS OF LENGTH MTSIZE REAL P(5I')) REAL PD(50U) DIMENSIUN SGCTXT(50C) DIMENS iON SCLEN(500) OIMiNSIU4 SCPkE(lUiO) DIMENSION TYPt(500) DIMENS IJN INDEX(500) DIMENSION INDEP(5UU) DIMENSION T PHI(500) DIMENSIUN T t.NEXT (50C) DIMENSION TCNT (500) C C VECTOKS OF LENGTH M2MAX REAL H(20) OIMENSION MSK ZO) DIME NS ION PRO (20O UIMtNSICN PCST(2O) REAL FIRST (2) REAL SECLND(ZOI REAL THIRD(20) INTEGERA Bk1O) INTEGER I 11(20) C C VtCT.KRS GC LENGTh N40 DIMENSION SCKJ(40) DIMENSIJN SLKJAL(40) DIMENSION SCRU(40) DIMENSION SCRCAL(40) 193

194 DIME'NSION SCRMAL(40) C C ' ~- VECTURS OF LENGTH RPMAX INTEGER LLIST(50) INTEGER RPALEN(50) INTEGER RPCLEN(50) INTEGER RPLEV(50) INTEGER RPIXL50) INTEGER RTIX(50) REAL POX(50) INTEGER DESC(50) C C ARRAYS OF SILE RPMAX * M2MAX INTEGER APA(50O,2) INTEGER RPC(5C,20) INTEGER RPAM(50,20) C C VECTORS OF LENGTh YRPMAX INTEGER YRP(3CO) C C VECTJRS OF SIZE MNDC REAL HDJ(500) INTEGER IL LEN(O500) C _ C OTHER VECTORS DIMENSIGN NUMr(5) DIMENSION LETTER[75) DIMENSION CHAR(78) DIMENSICN MARKER(3) DATA MARKER / '#', '_', '*' / DIMEINSICN BLANK(S(5 DATA BLANKS / ' I, ' ' * DIMENSI UiO ARkClZ12) DATA ARiROW / ' ' ' - g. 1, *>'p - t, OATA SSS / 'S' / REAL*8 NAiLC(5) DATA NAGtE/'URS','CSG', LS-RS, CFG 'RG'/ INTEGER SOURCE(3) DATA SOURCL/'bAD',GN'GEN2''GEN3'/ REAL*M 'rAYS(2) DATA WAYS/' ECODE', 'HUFFMAN/' REAL*8 TTQ(2) DATA TTQ / 'RULE', 'UVwXY' / INTEGER TRi(3) DATA TRI/'NCNE','SOME', ALL'/ INTEGER ANSW(2) DATA A;NSW/'NC','YES/S INTEGER OJk02) DATA OKO/'1ST', 12ND/ REAL*8 DIRECT(3) DATA DIRECT /'DESCEND', ' ' 'ASCEND' REAL*8 ANTE(2) DATA ANTE/'CONTEXT' oCF'/ C C C COMMON STORAGE CCMMCN Ci,CONTL3.MIM2, PR1, PR2 FR., LAMBDA,. RIT

195 MCHI, MODE, ALLCW, STRICTEARLYCAM, NCHAR t 2 N26, PKG, EXR, PRS, LAWL, VSTART, VDIR, HSEL, WDF, 3 WCV, WXL, KRECUR,' LAFR 4, NS,N,MME,C,MTSIZEM2MAX,NQTRIALCCNT,PUNCLEVELN40,FCNST, 5 PERC,CTXT,TN, NTM,RPMAX,NRP,NY,YR PMAX,NRECUR,YRARRWYRPUNC 1, NTP C C C C VECTORS AND SCALARS.... C C ALLOW=THE WORST TYPE OF P(I) ALLOWED IN THE RECODING C ALPHA(T)=ALPHABETIC VERSICN OF Y C ANSw=ALPHABETIC VECTOR FOR YES-NO PRINTING C ANTE=ALPHABETIC VECTOR FOR PRINTING ANTECEDENT TYPES C ARROW=ALPHAbETIC ARROW C BI,BII=VECTORS FOR TESTING ISOMORPHISM OF RULES C BLANKS=ALPHABETIC BLANKS C C = # OF CIFFERENT SYMBOLS IN SYMBOL STRING Y AT THIS LEVEL C INCLUDES THE SYMBCLS OF MARKER C EXCLUDES INI.TIAL PLNCTUATICN MARK (.), IF ANY C THIS INCLUDES THE TERMINALS AND THE NON-TERMINALS C Cl= # OF DIFFERENT TERMINAL SYMBOLS IN INITIAL STRING Y C EXCLUDES THE SYMBCLS CF MARKER C EXCLUDES INITIAL PUNCTUATION MARK (.), IF ANY C CW=# UF TERMINAL AND NON-TERMINAL SYMBOLS IN CURRENT STRING C EXCLUDES THE SYMBCLS CF MARKER C EXCLUDES INITIAL PLNCTUATICN MARK (.), IF ANY C CX= REAL C C CHAR=VECTOR OF NCHAR ALPHABETIC CIARACTERS C CCNT=CONTROLS METHOD OF PRODUCING INITIAL STRING C l=USE GENI 2=USE GEN2 3=USE CEN3 C CRIT=THRESHOLD CRITERION FOR H C DESC=TYPES OF P(I )S BEING GRAPHED BY 'GRAPH' C DIRECT=ALPHABETIC VECTOR FOR PRINTING DIRECTIONS C DCNE(I): C=POSITION I OF Y IS NOT YET RECODED 1=IS CONE C EARLY= CCN;TR'ULS TIME OF CULLING OUT SMALL P(I)'S C 1=EARLY CULLING O=NO tARLY CULLING C EPS= EPSILCN USEC TO SEPARATE TYPE 2 AND TYPE 3 P(I}eS C FC=REAL CLNSTANT TO CONVERT TO BASE 2 LOGARITHMS C FIRST(I =FIRST TERM OF COMBINED H MEASURE C FR=RADIX OF MASKS 2=BINARY 3=TER1ARY C oKAM=TYPE CF GRAMMAR DESIRED C O=zUNRESTRICTED REWRITE C l=CUNTEXT SENSITIVE C 2=LEFT-SENSITIVE AND RIGHT-SENSITIVE C NB: IF CCNTEXT-FREE RULES ARE DESIRED, SET RIT=L C IN THIS EVENT, GRANM CCNTROLS TYPE OF MASKS TO BE TRIED C H(M)=COMBINED ENTROPY AND PAR.SIMONY MEASURE C HBEST= BEST VALUE: OF H FOR THIS LEVEL C HP=REAL CL;STANT FOR CUMULATING h MEASURE C HDJ(I)=Trc ECNTROPY OF THE I-TH DIFFERENT CONTEXT C.(SUMMED CVER T-E DIFFERENT PRECICTED PARTS) C ILCLLN( I }=UMBIER OF DIFFERENT PRECICTED PARTS ASSOCIATED C WITH CCNTEXT I IN ILC ARRAY C INDEP=INDEX VECTuR FUR SORTING PD INTO DESCENDING CRWEI'R C ININDEXINL)EX VECTCR FOR SORTING P INTO DESCENDING ORDER C KRECUR: L-=CHECK FUR i'ECURSIUN O=DON'T CHECK C L3=NUMBER CF LEVELS TO BE TRIED C LAM6DA=CGEFFICIENT OF PARSI~'~CNY

196 C LAMR=COEFFICIENT OF RECURSIVE PARSIMONY C LAWL=CONTRGLS COMPUTATION OF PROBABILITIES OF OCCURRENCE C O=DCNT 1=00 C LETTEK=VECTGR OF NCN-TERMINAL SYMBOLS C LEVEL= TYPE LEVEL. PROCESS BEGINS AT LEVEL-1 (INITIAL Y) C LLI=L=LIST CF L'S USED TO CATALOG RULES OF PRODUCTION C M=TEST SEQUENCE LENGTH C M1,M2=RANGE OF VALUES OF M TO BE CONSIDERED C Ml >= 2 C M2<= M2MAX C FOR REGULAR GRAMMARS (TYPE 3), SET M1=M2=2 C M2MAX=DIMENSIONED SIZE OF VECTORS VARYING OVER VALUES OF M C MARKER=ALPHABETIC VECTOR = RANGE OF MASK C #=DON'T CARE SYMBOL -_CONTEXT PART t=PREDICTED PART C MASK(M)=VECTOR WHOSE VALUES RANGE OVER THE SET.. MARKER C MASK: % C PRE: Y % C POST: _ Y C MCH=CVERRICE VALUE OF MBEST FOR LEVEL 1 0=NO OVERRIDE C MINM=MINIMUM M THAT PRODUCES A RECCDING THAT USES AT LEAST C ONE RULE OF PRODUCTION C MNOC=DIMENSICNED MAXIMUM NUMBER OF DIFFRENT CONTEXTS C MODE= INITIAL PUNCTUATION CONTROL C 1i=N INhiIAL PLNCTUATION MARK C 2=CHAR(CI+4) IS THE INITIAL PUNCTUATION MARK C MSC=UIMENSIONED MAXIMUM NUMBER OF PRECICTED PARTS FOR EACH C CONTEXT C MTSIZE=DIMENSIONED SIZE OF VECTORS CATALOGING SEQUENCES C (SUCH AS SQCTXT, SQPRED, SQLEN, TYPE, Pt PD, INDEX, LLIST) C N=LENGTH OF Y C N26=SIZE OF LETTER C N40=DIME SICNED SIZE OF SYMBOL STRINGS C NAME=ALPABETIC VECTOR OF NAMES CF GRAMMAR TYPES C NCHAR= SIZE CF VECTOR CHAR C INCLUDES INIT IAL PUNCTUATION (iF ANY) C INCLUDES INITIAL PUNCTUATION (IF ANY) C NDC=NUMBER CF DIFFERENT CONTEXTS C NEPS= # OF P(I)'S W ITHIN EPS OF 1.0 (TYPE 2) C NMESS= # CF MESSAGE P(I)'S (TYPES 2,3,4) C NMASK= # CF PERMISSIBLE MASKS C NNOISY= # CF P(I)'S 8ETWEEN EPS AND 0.0 (NOISE) (TYP'F". C NQ=# OF RULES OF PRODUCTICN AT THIS LEVEL C NRECUR=: OF RECUi<SIVE RULES OF PRCOLCTION AT THIS LEVEL C NS=OIMENSICNED SIZE OF Y C NSTRKU= # OF P( ) S EXACTLY l.0 (STRUCTURAL) (TYPE 1) C NST=SIZE CF VECTORS CATALOGING SEQUENCES C (SUCH AS SQCTXT, SQPRED, SOLEN, TYPE, Pt PD, INDEX, LLIST) C NT3= H OF P(I)'S BETWEEN 1.O-EPS AND SEPAR (TYPE 3) C NT4= # OF P(I)'S BETtEEN SEPAR AND EPS (TYPE 4) C NTB=NUMBER OF SEOUEINCES CATALOGED FOR CURRENT M C (SIZE OF ToCNT, TiPHI, TBNEXT ) C NTM=# OF TIMES RULES OF PRODUCT ICN ARE APPLIED IN RECOOE C NRP=VERS ICN OF NTP IN SUBROUTINE RECODE C NTP=ACTUAL # OF RULES IN RPA, RPC C NUMT=VECTOR OF CCUNT OF TYPES OF P(I)PS C NY=# OF SYMBOLS IN YRP C NZ= # OF P(I)'S FUR CURRENT M C ORD=ALPHABETIC VECTOR FCR PRINTING MODES C P(I)=CDNDIT1CNAL FPRO3AILITY THAT T-E CONTEXT.PART OF THE C SEQUENCE I PREDICTS THE PREDICTOR PART OF THE SEQUENCE

197 C PD(I)=COPY OF P WHICH IS SORTED INTO CESCENDING ORDER C ALSO USED TC SORT P(I)'S THAT ARE-USED IN RECODING C PCX(I): CONOITIONAL PRObABILITY OF SEQUENCE PAIR THAT C PRUCUCED THIS RULE CF PRODUCTION C PRI=PRINT CCNTRCL FOR UNSORTED P( )'S O=DON'T PRINT IMPRINT C PR2=PRINT CONTROL FOR SORTED P(,I)'S O=DONT PRINT 1=PRINT C PRG: PRINT CCNTRCL FOR GRAPH OF P(I)'S USED IN RECOD"ING"' C O=DON'T PRINT GRAPH C I=PRINT SOME GRAPHS --- GNLY WHEN TRIAL=2 C 2=PRINT ALL GRAPHS C PRS: CONTROLS PRINTING OF SECNO O=DCN'T PRINT 1=PRINT C PRE(M)-=M-VECTCR uO CCNTEXT PART OF SEQUENCE C POST (M)=M-VECTOR OF PREDICTED PART OF SEQUENCE C POX(I)=PROUDABILITY ASSOCIATED WITh RULE OF PRODUCTION I C RIT=CCNTPOLS FORP OF ANTECEDENT (LEFT) SIDE OF C RULE OF PRODUCT ION C 1=(CONTEXT-FREE)= REPLACE ENTIRE PREDICTOR AND PREDICTED C SEQUENCES WITH ONE NEW NCN-TERMINAL SYMBOL C O=REPLACE ALL CCNTIGUOUS PREDICTCR SYMOCLS WITH C NEW fiCN-TEkMINAL SYMBOL C RPA(I,J)=J-TH SYMBOL OF ANTECEDENT SIDE OF I-TH RULE OF PROD C RPC( IJ)=J-TH SYMBOL CF CCNSi"UENLT SICE OF I-TH RULE CF PROD C RPALEN(I)=LENGTH OF ANTECEDENT SICE OF I-TH RULE OF PRODUCTION C RPCLEN(I)=LENGTH OF ACCNSECUENT SICE CF I-TH RULE CF.PROD C RPLEV(I)=LEVEL OF RULE I C RPAM( I,J) =RECCRD OF PREDICTED PARTS FOR RULE I C RPM4X=DOIMcSICNED MAX i OF RULES CF PRODUCTION IN RPA AND RPC C RPIX(I): STATUS OF RULE I IN RPA ANC RPC TABLES C O=RULE I HAS BEEN I.AOE INTC A RECURSIVE RULE AND DEACTIVATED C -2=NGN-kECURSIVE RULE C POSITIVE # =RECURSIVE RULE. THE NUMBER POINTS TO THE C ORIGINAL NCN-RECURSIVE RULE FROM WHCIH IT WAS C OERIVED l C RTIX(I)=STLUAGE F-CR RPIX OUTSIDE CF RECODE SUBROUTINE C STRUCTURE: RPA(IJ)p RPL(IJ), RPALEJ(I), RPCLEN(l1) C RPLV (I), RTIX(I), PPIX(I), POX(I) C SCRDSCRJ=SCRAT CH VECTORS C SCRDALSCRJALtSCPAiL=ALPHABETIC SCRATCH VECTORS C SFCOND I )=SEOND TERM OF COMBINED H MEASURE C SEPAR=VALuE SEPARATING TYPE 3 AND TYPE 4 P I)'S C SEQNOT) = IS INDEX OF ENTRY IN SCCTXT-TABLE CCNTAI1M'tNP'HI OF C LONGEST STRUCTURAL SEQUENCE BEGINNING AT TIME T C SOURCE=ALPHA6ETIC VECTOR CF SOURCES OF INPUT SAMPLE C SQCTXT(I)1 PHI CF THE CCNTEXT PART OF THE SEQUENCE C SqPRLUl(I)= PHI CF THE PREDICTED SEQUENCE C SQLEN(I)= LENGTH OF STRUCTURAL SECUENCE I C SSS=ALPHAibETIC 'S' C STRICT: CCNTROLS VARIETY OF M'S ALLOWED IN RECODING C O=USE IL...MBEST I=USE MBEST CNLY C T IS TIME C TBCNT ) =TEMPORARY CCUNT OF UCCURRENCES OF SUB-SEQUENCES C TBPHI(I) =TcMPORARY CONTEXT FUR SUB-SEQUENCES C TBNETXT(I)=TEMPURARY PRECICTED PART FCR SEQUENCES C THIRG(IJ=THIRU TERlM OF COMbINED H MEASURE C TT= # GF CCCURRENCES OF SEQUENCES PHI (REAL) C REPRESENTING PARTICULAR CCNTEXT C TRI=ALPHAdFTIC VECTORS FOR PRINTING C TTQ=ALPHABETIC VECTOR OF WAYS OF INDUCING RECURSIONS C TYPE( I)=THt TYPE COF SEQUENCE I (1,2,3,4,5) C TYPE 1 (STRUCTURAL) P(I) EXACTLY 1.O

198 C TYPE 2 (MESSAGE) P(I) WITHIN EPS OF 1.0 C TYPE 3 (MESSAGE) P(I) BETWEEN 1.O-EPS AND SEPAR C TYPE 4 (MESSAGE), P(II BETWEEN SEPAR AND EPS C TYPE 5 (NOISE) PU() BETWEEN EPS AND 0.0 C VDIR: CCNTRULS ASCENDING CR CESCENDING DIRECTION OF C CONSIDERATION OF M-S' IN RECUDE -l=DESCENDING +1=ASCENDING C VSTART: BEGIN SCAN IN RECCDf AT TIME OF MBEST FOR EACH SENT C (FOR PCDE=2, CNNLY)" O=NC 1=YES C WAYS=ALPABETIC VECTOR OF CODING SUBROUTINES C WCV: 1=USE DEFINITICNAL APPROACH TO FIND RECURSIONS C 2=USE UVWXY APPROACH C WDF: EFFECT OF SATISFYING CRITERICN OR FINDING COMPLETE C RESOLUTICN FOR A PARTICULAR M C O=NO EFFECT 1=CUT-OUT ANC USE THIS M C WXL: INCLUDE IDENTITY RULES (AND FACTOR OF M IN H) C 0=NO 1=YES C Y(T)= LARGE SAMPLE OF CGNCATENATED SENTENCES C YNEW(T)= NEW Y STRING CREATED bY RECODING C YRP(I)=STRING CONTAINING RULES OF PRODUCTION IN CONTIGUOUS C FORM. CCNTAINS YRARRW BETWEEN ANTECEDENT AND CONSEQUENT C SIDES. CCtTAINS YRPUNC BETWEEN RULES. C YRPMAX=DIMENSIONED MAX # OF SYMBOLS IN YRP C C C SUBROUTINES.... C C BEST FINUS 6EST H C CCNV CONVERTS A NATURAL NUMBER INTO A C-ARY SEQUENCE C FZERO=ZEROES FLOATING POINT VECTOR C GENI READS INITIAL SYMBOL STRING FFCM CARDS C GEN2 IS SPECIAL tINARY SEQUENCE GENERATOR C GETMSK GENERATES MASKS C GRAPH PRINTS GRAPH OF P(I)'S USED IN RECODING C GRID PRINTS UUT MASKS C HUFF IS HUFFMAN CODE SUBROUTINE C IEQUAL=SETS VECTORS EQUAL C IMATCH=MATCHES INTEGER VALUES C ISEEK DOES MATCHING C ISUM FINDS SUM CF VECTOR C IXSPRY=SPRAYS INTEGER CUNST1NT C LCC LOCATES SEQUENCES THAT OCCUR IN SYMBOL STRING C LOOK KEEPS LIST OF NE5W NON-TERMINAL SYMBOLS AT EACH LEVEL C MCRE CREATES MASK FROM TiC SEQUENCES (PRE AND POST) C MTEST TESTS GRAMMATICAL TYPE OF MASK C PHI CONVERTS A C-ARY SEQUENCE INTO THE NATURAL NUMBT"'""' C RANDU IS RANDOM INTEGER GENERATCP C RECODE CCES RECODING FRCM LNE STRING TO STRING OF HIGHER LEVEL C SPRAY SETS VECTTOR TC CCNSTANT C SYMiOL PROUUCES SYMbCL STRING FOR PRINTING C TLOC DETERMINES WHICH SEQUENCES APPEAR AT A GIVEN TIME IN Y C XFSUKT SORTS BLOCK OF REALS WITHIN ARRAY C ZERO SETS VECTOR TO ZEROES C C C C C INPUT AND OUTPJT... C 5=INPUT CARDS C 6=OUTP T PRINTER C 7=OLTPUT FILE CONTAINING SUMMARY OF PRINTER OUTPUT

199 c C c C DIMENSIONED SIZES OF VECTORS C NS=500 M2MAX=20 MTSIZE=500 N26=26 N40=40 RPMAX=50 YRPMAX=300 MNDC=500 MSC=15 C C MAIN LOOP C 110 CONTINUE WRITE(6,790) 790 FORMAT ('1 ) C C C C READ CONTROL PARAMETERS C C C C INPUT CARDS... C C 1. CONTRUL CARD: C 415: L3, 91, M2, VDIR C 2F5.2: LAMBOA, LAMR C 2512: CONT,MUcE,GRAM,RI TALLOC STRICTFRPR1 FR2 t PRG,PRS, C EXR,MCH,EARLY,LAwL,VSTARTIHSELtWDFWCVtWXL. C KRECUR C 2. TERMINAL ALPHt.BET VECTOR ( FORMAT 15, 75A1 ) C C 1SIZE OF NUGN-TERMINAL ALPHABET C CARD COLU4IMNS 6... CONTAIN THE Cl TERMINAL SYMBOLS C IF MODE=2, LAST COLUMN AFTER THE TERMINALS CONTAINS C THE INITIAL PUNCTUATICN MARK (.) C 3. NON-TERMINAL ALPHABET VECTOR (FCRMAT 15, 75A1) C N2b=SIZE OF VECTOR IN FORMAT I5 C LETTER=ALPHABET IN FORMAT 75A1 C 4. IF CONT=I, GENI READS CARD CCNTAINING LENGTH OF Y(FORMAT I5) C IF CONT=2, GEN2 READS CARD (FORMAT 1615) C 5. IF CONT=I, GEN READS IN CARDtS) UNTIL N SYMEOLS HAVE BEEN C READ IN UNDER AN 80A1 FORMAT C C READ (5,401C) L3,M,12,VDIR,LAMBDALAMRCGNT MODEGRAMRIT ALLOW,S ITFRICTFRPR1,FKR2,PRCPRSEXRMCHEARLYLAWLVSTART,HSELWDFtWCV,WX 2LKRECUR 4010 Fr<RMAT (415,2F5.2,2512) C C WRITE (6,4020) WRITE (7,4020) 4020 FORMAT ('1') WRITE (6,4030) L3,M,M2,VDIRLAMBDA, LAMRCQNTMODEGRAMRIT,ALLOW, 1STRICT,FR,PRL,PR2, PRGPRS,EXR,MCH, EARLYLAWL,VSTART,HSEliWDFwtiCVW

2uu 2XLKRECUR 4030 FORMAT (lX,415,2F5.2,251I2) C C WRITE (6,4040) L3,M1,M2,DI RECT(VOIR+2..,OLAMBDALAMR WRITE (7,40t4U) L3, Ml, M2DIRECT (VCIR+2,L.AMBDA LAMR 4040 FORMAT ( '0' / 8 ' MAXIMUP NUMBER OF LEVELS TO BE TRIED ', 1 7 4 SMALLEST M TO BE TRIED a, I7 / 5 'LARGEST M TO BE TRIED ' 17 / 8 DC'IRECT IN OF CONSIDERING M IN RECGDE t 6X, A' / 9 ' LAiMBDA -- COEFFICIENT OF PARSIMONY, F10.5 / 9 LAMR -- COEFFICIENT OF RECUkSIVE PARSIMONY, F10.5 7 ) WRITE (6,4050) SCURCE(CONT),ORD(MODE),NAME(GRAM+1),ANT(Rt+i't+),ALL 1OW,ANSW(STRICT+i),FRANSW(PRI+IA) NSWR2 TRIPRGANSR2 TRIPG+ ANSW(PRS+1 2),WAYS(EXR),MCHANSW(EARLY+),ANSW(LAWL+ ),ANSW(VSTART+1),HSEL WRITE (7,4050) SOURCE(CONT),ORDO(MO DE),NAME(GRAM+1),ANTE(RIT+l)tALL OW,ANS (STRIC T ),FR, ASW(PR )NSWPR + ) TRI(PRG+ ) ANSW SIT(PRS+1 2) WA;YS(EXR) t'CHANSW(EARLY+I),ANSW(LAWL+l),ANSW(VSTART+1),HSEL 4050 FORMAT {. 2 SOURCE OF INITIAL SYMEOL STRING,t 6X, A4 / 3 ' MODE -- INITIAL PUNCTUATICN MARK It 6XA4 2 TYPE OF GRAMMAR DESIRED (MASKS TO BE TRIED) ' 6XA8 / 1 RIT - CNCTROLS ANTECEDENT SIDE OF RULES. 6X, A8 / 1 WORST TYPE OF P(I) ALLOWED TO ENTER CODE 17 / 8 ' USE OF BEST M STRICTLY ', 'X4'A / 2 ' FR -- RACIX OF MASK t 17 / 6 ' PRINT CONTROL -- PRINT UNSORTED P(I) ' 6X, A4 / I ' PRINT CCNTROL -- PRINT SORTED P(I) 6X, A4 / ' PRIiT CCNTROL -- PRINT GRAPh UF P(I)-S, 6X, A4 / 5 ' PRINT CCNTROL -- PRINT SEQUENCE NUMBERS ' 6X, A4 / 2 ' SUBROUTINE USED FOR RECODING '' 6X, A8 / 2 ' OVERRICE VALUE FOP MBEST (ICH) I77 ' EARLY ELIMINATION OF UNALLOWABLE P(I) ', 6X, A4 / 7 'INCLUDE ALL-PREDICTED MASK ' 6X, A4 / 5 'START RECODE AT TIME CF MBEST ', 6X, A4 / 7 HSEL -- VETHCO FOR COMPUTING H ', I7 8 ) WRITE (b,4060) ANSW( hDF+I),TTQ ( CV),ANSW(W XL+I),ANSW(KRECUR+l) WRITE (7,406u) ANSW WDF+I),TTQ (CV),ANSW(WXL+ ),ANSW(KRECUR+IJ 4060 FORMAT ( 1 ' CUT-OUT CN FINDING GOOD M ', 6X, A4 / 2 ' METHOD C.F FINDING RECURSIONS ', 6X, A8 / 3 ' INCLUDE IDENTITY RULES ', 6X, A4 / 4 ' TEST FOR RECURSICN ', 6X,A4 5 ) C C C READ IN TERMINAL ALPHABET, INITIAL PUNCTUATION MARK (IF ANY), C READ (5,407C) C1,(CHAR(I), I=4,78) 4070 FORMAT (I5,75A1) NCHARC l+3+ ( MCE-1 ) C PRINT OUT TERMINAL ALPHABET NCX=NCHAR- ( MODE-1) WRITE (6,4080) CI,(CHAR(I),I=44,NCX) WRITE (7,4080) Cl, (CHAR(I),I14,NCX) 4080 FORMAT ( i i /

201 1 ' SIZE OF TER.MINAL ALPHABET (INITIAL STRING) I7 / 2' TERMINAL ALPHABET:,' 75A1 4 ) IF (MODE.EQ.2) WRITE (6,4090). CHAR(NCHAR) IF (MODE.EQ.2) WRITE (7,4090) CHAR(NCHAR) 4090 FORMAT ' INITIAL PUNCTUATION MARK (MODE=2),Al) C C C READ IN NCN-TERMINAL ALPHABET C READ (5s4070) N26s(LETTEK(I),I=1,N26) C C PRINT OUT EASIC NCN-TERMINAL ALPHABET C WRITE (6,4100) N26,(LETTER ( II 1N26) WPITE (7,4100) N26,(LETTER(I),I=1,N26) 4100 FORMAT ( ' ' / 1 ' SIZE OF EASIC NON-TERMINAL ALPHABET ', I7 / 2 SYMBOLS OF THE 2ASIC NON-TERMINAL ALPHABET: 't 75A1 C. C ESTABLISH CONVENTIONS FOR REPRESENTING SYMBOLS C __ C SYMBOLIC NUMERIC C FORM FORM C (ALPHA) (Y) C C C# DON'T CARE SYMBOL C 1 CCNTEXT PART C 2 PREDICTED PART C.. 3...CL+2 THE Cl TERMINAL SYMBOLS C * C1+3 T-E INITIAL PUNCTUATION MARK, IF ANY C C DO 115 I=1, 3 CHAR(I )=MARKER(I) 115 CONTINUE C=NC t-A DONT=O CTXT=1 PEPC=2 PUNC=NCHAR-1 YRARRW=OCNT YRPUNC=PERC C C C OBTAIN INITIAL Y FOR LEVEL 1 C GO TO (130,14C,150), CCNT C GENI: REAC SYMBOL STRING FROM CARDS 130 CONTINUE CALL GEN1 ( YNNS,ALPHACHAR,NCHAR) GO TO 180 C GEN2: SPECIAL BINARY SEQUENCE GENEPATOR 140 CONTINUE C CALL GEN2 ( STRING,NNSMODENCHARPUNC) GO TO 180 C GEN3:GENERAL SEQUENCE GENERATCR 150 CONTINUE GO TO 180

202 180 CONTINUE C C INITIALIZE FOR ALL LEVELS C LEVEL= CATC -=1 NRP=O NTP=O NY=O C C CONSIDER VARIOUS VALUES OF LEVELal,...,L3 190 CONTINUE C C C INITIALIZE FOR THIS LEVEL C MINM=O NST=O NZ=O NDC=O 00 192 11=1, MNDC HDJ( 1 )=O.O 192 CUNT INUE C C PRINT OUT Y C CW=C-3-(MODE-1) WRITE(6,200) LEVEL, N, CW WRITE(7,200) LEVEL, N, CW 200 FORMAT (//' 1 --.- ------ LEVEL ' I2* 9 -._..-............ ----_ --- ——. --- / / 2 ' SYMBOL STRING GF LENGTH ' 13, 3 ' AND USING ALPHABET OF SIZE 13 / } CALL SYMBOL (YALPHAsN, vMACHAR,NChAR#LETTER,N26) WRITE (6,220) (ALPHA(III),I 1=1,MMA) 220 FORMAT ( '0', (/1X,60Al) ) WRITE (7,220) (ALPHA(I 1 ), I II=1,MMA) C C CEMPUTE CRITERICN FOR THIS LEVEL C FC=ALOG(2.0) CX=C-3-(MODE-1) CRIT=-i. O *ALOG(1 O/CX)/FC WRITE (b, 225) CRIT 225 FORM-AT(//'OCRITERION ', F105 / ) C C CONSIDER VARIOUS M = M... M2 C ff=Mi 230 CONTINUE WRITE (6,240) LEVEL,M 240 FORMAT (1','LEVEL=I 4,5X, 'M=' I4////) IF ( PR1.NE. 0 ) WRITE(6t-544) C INITIALIZE FOR THIS M C NT6=O H( iM)=9999 NSTRU=O

203 NEPS=O NT 3=0 NT4=0 NNOI SY=0 CALL ZERO( TBCNT, NS) CALL SPRAY( TeNEXT, NS, -1. CALL SPRAY TbPHI, NS, -1 ) C C EPS AND SEPAR SE PAR=. 50 EPS=1.O/M IF(EPS.GE..34) EPS=.34 C C C CCNSIDER VARIOUS MASKS C C CCNVENTICNS FOR MASK, PRE, POST VECTORS C - #=-DONT CARE SYMBOL _=CONTEXT PART =-PREDICTED PART C C MASK: # % C PRE: Y Y % C POST: # _ Y C NMASK=0 SIND=O 262 CONTINUE CALL GETMSK( P, FR, GRAM, NMASK, CTXT, PERCO DONTI M2MAX, 1 MASK, IPART, &421 ) 263 CONTINUE C C C CONSIDER ALL T FROM T=M TO T=N C FOR CURRENT VALUE OF M, CONSIDER M-SEQUENCES STARTING AT C TIME T=T-M+1 TO TIME T=T C DO 400 T=M,N C C FORM PRE AND POST SEQUENCES C DO 350 L=1,M JD=L TMJD=T-M+JD K=M+1-L C C IF M-SEQUENCE CONTAINS PUNCTUATION MARK, SKIP OVER IT IF (MODEf.EQ..ANO.Y(TMJD).EQ.PUNC) GO TO 390 IF ( MASK{K).EQ. DCNT ) PRE (L)Y(TMJO) IF ( MASKLK).E.E DCNT ) POST(L) = DCNT IF ( MASKIK).EQ. CTXT ) PRE(L)= Y(TFJD) IF ( MASK(K).'Q. CTXT ) PCST(L) = CTXT IF ( MASK(KJ.EQ. PERC ) PRE(L)=PERC IF (!ASK(KJ.EQ. PERC J POST(L) = Y(TiJD) 350 CONTINUE C C GET PHI' S FUR EACH SEQUENCE CALL PHI (PKR,MCXPkE) CALL PHI (PCST,M,C,XPOST) C C INSERT INTO TABLES IF (NT, EQ.O) GO TO 370

204 00 360 I=1,NTB IF (XPRE.NE.TEPHI(I)) GO TO 360 IF (XPOST.NE.TBhEXT(I)) GO TO 360 TBCNT( I)=TCNT(I)+1 GO TC 380 360 CONTINUE 370 CONTINUE NTB=NTB+i TBCNT(NTB)=i ToPH I (NT)=XPRE TBNEXT(NTB) =XPOST 380 CONTINUE 390 CONTINUE C C C END OF LOOGP FCR THIS T C 400 CONTINUE C C C END OF LOOP FOR THIS MASK C IF(SIND.EQ.1) GO TO 429 GO TO 262 C C ALSO USE THE ALL-PREDICTED MASK C 421 CONTINUE IF(LAWL.NE.1) GO TO 428 SIND=1 NMASK=NMASK+i DO 422 IZ=1, M MASK(I )=PE R 422 CONTINUE GO TO 263 428 CONTINUE 429 CONTINUE C C C CCMPUTE THE P(I)'S C IF(NTB.EQ.0) GO TO 545 NZ=O C C MAIN LOOP FOR I=1, NTB C DO 540 1=1,NTB IF (TBCNT(I).LE.0) GO TO 540 C C NEW AND DIFFERENT CONTEXT DISCOVERED C CCMPUTE TT=COUNT OF ITS OCCRRENCES C NDC=NDC+1 TT=O.O DO 430 J=I,NTB C C CONSIDER TABLE ENTRIES ONLY ONCE IF (TBCNT(J).LE.O) GO TO 430 C MATCH CO.NTEXTS

205 IF (TBPHI(I).NE.TBPHI(J)) GO TO 430 TT=TT+T3CNT J) C C PREVENT FURTHER CONSIDEKATI.UN OF PREDICTED PARTS (OTHER THAN C FIRST) ASSOCIATED WITh THIS CONTEXT TBCNT (J) =-T8CINT J) 430 CONTINUE C C CCMPUTE THE P, SQCTXT,SPRED,SQLEN, ETC. VALUES ti t HIS C PARTICULAR CONTEXT. THAT IS, CONSIDER EACH PREDICTED PART C ASSOCIATED WIT- THIS CONTEXT. C ILCLEN(NOC)=O DO 520 J=I,NTB C C CCNSIDER TABLE ENTRIES CNLY ONCE IF (TBCNT(J).CE.O) GO TO 520 C C MATCH CONTEXT IF (TBPHI(I).NE.TBPHI(J)) GO TO 520 NST=NST+1 NZ=NZ+i P(NST)=-TBCNT(J)/TT HDJ(NDC)=HDJ(MDC) - P(NST) * ALOG(P(NST)) TBCNT(J)=0 ILCLEN(NDC) = ILCLEN( NGC)+ SQLEN(NST)=M SQPRED (NST) =T6NEXT( J SQCTXT(NST =TBPHL(J) C C C CLASSIFY SEQUENCES AS TO TYPE (l.t2.3,4, OR 5) C IF (P(NST).GE.olO) GO TO 440 IF (P(NST}.G:.i..O-EPS) GO TO 450 IF (P(NST).GE.SEPAR) GO TO 460 IF (P(NST).GE.EPS) GC TC 470 IF (P(NST).GE..OO) GO TO 480 C C TYPE 1: P(I) EXACTLY 1.0 (STRUCTURE) C 440 CONTINUE NSTRU=NSTKU+1 TYPE(NST)=1 GO TO 490 C C TYPE 2: P(I WITHIN EPS OF 1.0 C 450 CONTINUE NE PS=NEPS+1 TYPE(NST)=2 GO TO 490 C C TYPE 3: P(I) BETWEEN 1.0-EPS AND SEPAR C 460 CONTINUE NT3=NT3+1 TYP (NST)=3 GO TO 490 C

206 C TYPE 4: P(l) BETWEEN SEPAR AND EPS C. 470 CONTINUE NT4=NT4+1 TYPE(NSr =4 GO TO 490 C C TYPE 5: P(I) LESS THAN EPS (NOISE) C NB: NO P(I)=O.O C 480 CONTINUE NNOI SY=NNOISY 1 TYPE(NST)=5 GO TO 490 490 CCNTINUE C C C DELETE P(I)'S HERE THAT ARE TOO SMALL C IF (.NOT.(EARLY.EQ.1.AND.TYPE(NST).GT.ALLOW)) GO TO 495 NZ=NZ- NST=NST-1 495 CONTINUE C PRINT UNSCRTED TABLE OF NON-ZERO ANC DEFINED PIS C IF (PR1.EQ.C) GO TO 510 CALL CONV (SCRJ,M,C,SQCTXT(NST)) CALL SYMbJL (SCRJ,SLRJAL,M, M,CHARNCHAR,LETTERtN26) CALL CONV (SCR:DM, C SPRED(NST)) CALL SYMBOL (SCRD, S CRCAL,M, MMQ,CHAR,NCHAR,LETTER,N2&"J CALL GRID ( SCRJ, SCRD M, DCNT, CTXT, PERC, SCRMAL, MARKER ) JJ=NST MJ=M WRITE(6,710) JJ,LEVEL,SQLLEN(JJ),P(JJ),YPEJJ), 1 (SCRMAL(JJH),JJHI,MJ), BLANKS,(SCRJAL(KKJ: ).KKJ=1,MM),BLANKS, 2 (SCRDAL(JKL), JKL=1,MMQ ) 510 CONTINUE 520 CONTINUE 540 CONTINUE C PRINT STATISTICS ON TYPES OF SEQUENCES C 545 CONTINUE WRITE(6,550) NSTRU, NEPS, NT3, NT4, NNOISY 550 FORMAT ( 1 'ONUMBER OF TYPE 1 SEQUENCES(STRUCTURAL I17 / 6 'NUMBER CF TYPE 2 SEQUENCES (MESSAGE), I / 6 'NUMBER OF TYPE 3 SEQUENCES (MESSAGE) ', 7 / 6 NUMBER uF TYPE 4 SEQUENCES (MESSACE) I7 / 3 ' NUMBER OF TYPE 5 SEQUENCES (NOISE) 1,7 ) NP I= (C-3-(MOCE-i)) **M PSIZE=NP I*NASK WRITE(6,570) NMASK, NPI, PSIZE, NZ, NCC, SEPAR, EPS 570 FORMAT ( 1 ' NUMBER CF DIFFERENT MASKS USEC, 17 / 2 NUMBER OF POSSIBLE M-SEQUENCES ', 17 / 2 ' NUMBER CF PCSSIBLE PI} 'I7 1 ' NUMBER OF P(I) 'I7 / 3 ' NUMbERP CF DIFFERENT CONTEXTS I7 /

207 3 '\ SEPARATICN VALUE BETWEEN TYPES 3 AND 4', F1.5 / 4 ' EPSILUN FOR DEFINING TYPES 2 AND 3 ', F10.5 5 ) C' C C SCRT THE P(I)'S INTO DESCENDING ORDER C IF (NTB.EQ.O) GO TO 610 DO 542 JK=1, NST PD(JK)=P(JK) 542 CONTINUE CALL XFSORT ( PD, MTSIZE, INDEX, 1, NST ) C C PRINT SORTEU TABLE OF NUN-ZERO AND CEFINED P'S IF ( PR2.EQ. 0 ) GO TO 548 WRITE (6,544) 544 FORMAT 'O # LEVEL LENGTH PRGO TYPE',7X, 1 'MASK CONTEXT PREDICTED' / '0 ) DO 546 L=l, ST JJ= INDEX(L) MJ=SQLEN (JJ CALL CONV ( SCRJ,MJ, C, SQCTXT(JJ) CALL SYMbOL ( SCRJ, SCRJALIMJ, MY, ChAR, NCHAR,LETTER,N26) CALL CONV ( SCRD,MJ, C, SQPRED(JJ) ) CALL SYMBUL ( SCRD, SCRDAL, FJ, MMQ, CHAR, NCHAR tLETTERN26) CALL GRID { SCRJ, SCRD, M, DONT, CTXT, PERC, SCRMAL, MARKER ) WRITE(6,710) JJLEVEL,SQLEN(JJ),P(JJ),TYPE(JJ), 1 (SCRMAL JJH),JJH=1,MJ),3LANKS,(SCRJAL(KKJ) KKJ=l,MM),BLANKS, 2 (SCRDAL JKL), JKL=1,MMQ ) 546 CONTINUE 548 CONTINUE C C C ATTEMPT RECODING, COMPUTE H FOR THIS M C TRIAL=1 MbE=M CALL RECODE (Y,YN EW, SECNO ALPHA LLIST,SQTXT, SQL EN,QPRED, 1 TYPE, PPDINDEXtINDEP, CHAR, SCRJSCRJAL,SCRD, 2 SC~CAL, POST, BI,BI I,PRE,MASK, H,FIRST,SECOND,THIRD, LETTER 3 RPALEN, RPLLEN,RPLEV,RFA,RtPC,RPAMYRP,NUMTARROW,DONE,OESC 4, PCX, RPIX, RTIX ) C C CCOMPUTE NSUM, NMESS, MINM C IF (ALLOW.EQ.1) NSUM=NSTRU IF (ALLOW.EQ.2) NSUM=NSTRU -NEPS IF (ALLOw. EQ.3) NSUM=NSTRU+NEPS+NT3 IF (ALLUW. EC.4) NSUM=NSTRU+I-EPS+tnT3+NT4 IF (ALLO.EU.5) NSUM=NSTRU+NEPS+NT3+1T4+NNOISY NMrFSS=NL-NNC I SY-NSTKU IFINTM.EQ.O) GG TO 600 IF (MINM.EQ.O) MINM=M 600 CONTINUE C C C CHECK FOR (1) ALL SEQUENCES STRUCTURAL SEQUENCES, C (2) H(M) IS 0.0, OR (3) H(M) IS LESS THAN CRITERION C

208 IF(WDF.NE.l) GO TO 592 IF(NUMT(1).NE.O.AND. NUMT(2).EQ.O.AND. NUMT(3).EQ.0 1.AUD. NUMT(4).EQ.O.ANO. NUMT(5J.EQ.O ) GO TO 650 IF(FIRST(M).EQ. 0.0 ) GO TO 650 IF(H(M).LE.CRIT) GO TO 630 592 CONTINUE C C C END OF LCLP FOR M C 610 CONTINUE M=M+l IF (M.LE.M2) GG TO 230 C C C C FIND BEST M --- LOCAL MINIMA AMONG THE H(M) C CALL BEST(H,MI, M2, HBEST, MBEST, MINMt FIRSTtSECCND,THIRD,M2MAX) WRITE (6,620) MBEST,H(MBEST) WRITE (7,62C) MBESTsH(MBEST) 620 FORMAT ( 'OBEST M IS ', I4, I WITH H OF I, F10.5 GO TO 670 C C CRITERION SATISFIED C 630 CONTINUE MBEST=M WRITE (6,64C) MBEST,CRITH(MBEST) WRITE (7,64C) MBEST,CRITH(MBEST) 640 FORMAT ( 'OM CF ', I4, ' SATISFIES CRITERION OF,s F1O.5, 1 WITH H OF ', F10.5 ) GO TO 670 C C ALL P{I) ARE STRUCTURAL C 650 CONTINUE M'iESTrM WRITE (b,o60) M,LEVEL WRITE (7,660) M,LEVEL o60 FORMAT( 'OM CF ', 14, ' GIVES COMPLETE RESOLUTION AT LEVEL',[2//) C C OVERRIDE VALUE OF MBEST (FOR LEVEL 1 CNLY) C 670 CONTINUE IF (.NuT.(MCH.Nc.O.AND.LEVEL.EQ.1)) GG TO 690 MBEST=MCH WRITE (6,680) MBEST WRIITE (7,68C) MSEST 680 FJRMAT ( 'OM OF ', I4, ' IS OVERRIDE VALUE ' ) 690 CONTINUE C C C PRINT THE SQ-TABLES C WRITE (6,700) LEVEL 700 FORMAT ( '1REGULARITIES FOR LEVEL 't 14 ) WRITE(6, 544) IF (NST.EQ.O) GO TO 730 DO 720 L =1, NST

209 I= INDEX (L) M=SQLEN( I) CALL CONV (SCRJ,M,C,SQCTXT(I)) CALL SYMBOL (SCRJSCRJAL,M, MM,CHARNChARLETTERN26) CALL CJNV (SCRD,M,CSQPRED(I)) CALL SYMBOL (SCRDSCRDALM, MMQ, CHARtNCHARLETTER,N26 ) CALL GRID ( SCRJ, SCRD, M, DCNTTCTXT, PERC, SCRMALt MARKER ) JJ=I J:=M WRITE (6,710) JJ, LELE EL, S LEN(JJ),P(JJ) TYPE(JJJ) 1 (SCRMAL(JJH),JJH=i,MJ),BLANKS,(SCRJAL(KKJ),KKJ=l,MM)BLANKS, 2 (SCRDAL(JKL), JKL=1,MMQ ) 710 FORMAT ( 318, FS.2, 18, 7X, 80A1 ) 720 CCNTINUE 730 CONTINUE C C C DO RECODING WITH M=MBE (BEST M) C TRIAL=2 MtE=MBEST CALL RECODE(Y,YN E,SEQNOALPHA, LLISTtSQCTXT,SQLENSQPRED, 1 TYPE, P,PD,INDEX,INDEP, CIAR, SCRJ,SCRJAL,SCRD, 2 SCRDAL, POST,bIBII,PKE,MASK,H,FIRSTISECONDTHIRD, LETTER 3 RPALEN, RPCLEN,RPLEV,RPA,RPC,RPAMYRPNUMT,ARROW,OONE,DESC 4, PCX, RPIX, RTIX ) C C C END CF LCCP FOR THIS LEVEL C GO TC (740,760), CATCH 740 CONTINUE LEVEL=LEVEL+1 IF (LEVEL.LE.L3) GO TO 190 WRITE (6,750) WRITE (7,75C) 750 FORMAT ( 'ONU MORE LEVELS TO BE CONSIDERED' / ) CALL SYMBOL (YALPHA,N, MMA, C NCR NCHARLETTER,N26) WRITE (6,22C) (ALPHA(I II), II=1,MMA) WRITE (7,220) (ALPHA(III),I II=1,MMA) C C C PRINT FINAL RULE OF PRODUCTION C 760 CONTINUE GO TO ( 762, 772), MODE 762 CONTINUE CALL SYMBUL (YALPHAN, MMA,CHAR,NCHARLETTER,N26) WRITE(6,770) SSS, ARRUW, (ALPHA(III), III=1, MMA,,) WRITE(7,770) SSS, ARROW, (ALPHA(III), III-l, MMA ) 770 FORMAT ( '0' / ' ', 129A1 / ( ', 1X, llOA1 / ) f GO TO 778 772 CONTINUE L=: DO 775 T=l, I L=L+1 YNEW(L ) = Y (T) IF(Y(T).NE. PUNC) GO TO 775 CALL SYMBOL( YNEW, ALPHA, L, MMA, CHAR, NCHAR, LETTER"-N'26 'r WRITE(o,770) SSS, ARROW, (ALPHA(III), III=i, MMA )

210 WRITE(7,770) SSS, ARROW,,ALPhA' IIl), III=l, MMA ) L=0 775 CONTINUE 778 CONTINUE C C END OF PROCESSING OF THIS DATA SET' C 780 CONTINUE GO TO 110 END

211 SCOPY RECODE *SINK*@-CC SUBROUTINE RECODE(Y YNEW SECNOALPHAt LLISTSQCTXTSQLENSQPRED, 1 TYPE, P,PD,INDEXINOEP, CHARt SCRJ SCRJAL SCRDt 2 SCkDAL,. PUST,BI1SBIIPREMASK,H,FIRSTSECCNDTHIRC, LETTER, 3 RPALEN, RPCLEN,RPLEVRPA,RPCtRPA,~YRFNUMT~ARROW,DONE,DESC 4, POX, RPIX, RTIX ) C C SUBROUTINE FOR RECODING SAMPLE Y C IMPLICIT INTEGER (A-Z) C C C CCMMON STCRAGE C CCMMCN Cl1CONT,L3,Ml1,M2 PR1, PR2, FR, LAMBDAt RIT 1 MCH, MODE, ALLOW, STRICTEARLYGRAM, NCHAR, 2 N26, PRG, EXR, PRS, LAWL, VSTART t VDIR, HSELt WOF, 3 WCV, WXLt KRECUR, LAMR 4, NS,N,MBE,C,MTSIZEM2MAXNQTRIALCCNT,PUNCLEVELN40" TC'',NST, 5 PERCCTXT,TN, NTMRPMAX NRPtNYYRPPAXNRECURYRARhWYRPUNC 1, NTP C C C C VECTORS OF SIZE NS INTEGER Y(NS) INTEGER YNE (NS) INT'GER SEQNC(NS) INTEGER ALPHA/(NS) INTEGER DONE(NS) C C VECTURS OF SIZE MTSIZE INTEGER SQCTXT(MTSIZE) INTEGER SQLEN(MTSIZE) INTEGER SQPRE L (MTSI L INTfGER TYPEL(ITSIZE) INTEGER INDEX (TSIZE INTEGER INDEP MTSIZE) REAL P(MTSIZE) REAL PD(MTSIZE) C C VECTORS OF SIZE NCHAR INTEGER CHAR INCHAR) C C VECTORS CF SIZE N40 INTEGER SCRJIN40) INTEGER SCRJAL(N40) INTEGER SCRD(N40) INTEGER SCRCALN40) C C VECTORS OF SIZE M2MAX INTFGER POST (M2MAX) INTEGER BI.( 2MAX) INTEGER BII(M2. AX) INTEGER PRE(M2MAX) INTEGER MASK(M2MAX) REAL H('4M2iAX) REAL FIRST(M2VAX) REAL SECOND(M2MAX)

212 REAL THIRDIM2MAX) C C VECTORS OF SIZE N26 INTEGER LETTER(N26) C C VECTORS CF SIZE RPMAX INTEGER LLIST(RPMAX) INTEGER RPALEN RPMAX) INTEGER RPCLE (RPMAX) INTEGER RPLEV(RPMAX) REAL POX (RPMAX) INTEGER RPIX(RPMAX) INTEGER RTIX(PPMAX) INTEGER OESCIRPMAX) C C ARRAYS OF SIZE RPMAX * M2MAX INTEGER RPA (RPMAXM2MAX) INTEGER RPC (RPMAX,M2MAX) INTEGER RPAI(RPMAX,M2MAX) C C VECTORS OF SIZE YRPMAX INTEGER YRP(YPPMAX) C OTHER VECTCKS DIMENSION NUMT(5) INTEGER ARRCW(i 2) C C REAL CONSTANTS AND FUNCTICNS REAL HP REAL ALOG REAL FC REAL LAMBDA REAL LAMR C C C C SUBROUTINE RECOOES SAMPLE ACCORDING TO PARAMETERS C MAIN PARAMETERS FOR RECOOING: C MBE=UPPER LIMIT ON LENGTH OF REGULARITY TO BE USED IN C THIS RECODING C M=lv y.., M6E C TRIAL-=NDICAIION OF WHETHER THIS RECODING IS TRlrLTk't" C ACTUAL RECGDING 1= TRIAL 2=ACTUAL C RIT- hETHER TO WRITE CCNTEXT-FREE RULES C ALL'OW=IYPE OF WORST P(I) ALLOWEC IN RECODING C STRICT=WH-THER TO D0 STRICT RECODING C KRECUR=w'HETHER TO TRY RECURSICNS C OTHER PARAMETERS C VDIR=CIRECTICN OF SCAN FOR M C VSTART=STARTING POINT FOR SCAN C HSEL=SELcCTION OF H MEASURE (FOR TESTING PURPOSE' C C INITIALIZATION C YNL=N HP=O.0 NTM=O NCT=O NQ=O NRECUR=O

213 NHER=O FILL=DUNT CALL ZERO (SEQNON) CALL ZERO (DONE,NS) DO 2040 I=1,Ml YNEW(I =Y(I) 2040 CONTINUE IF (TRIAL.EQ.2) WRITE (6,20101 LEVEL IF (TRIAL.EQ.2) WRITE (T,2010) LEVEL 2010 FORMAT ( ' --------------- RULES CF PRODUCTICN ', 9 'FOR LEVEL', I', 1 1~1' ~) IF (TRIAL.EQ.1) WRITE (b,2020) MBE 2020 FORMAT ( 'OTENTATIVE RULES OF PRODUCTION FOR M OF ', 13 ) C C COPY INTO TEMPORARY RPIX FRCM THE PERMANENT RTIX TABLE C N8: IF TRIAL=1, THE IPA AND OTHER ASSOCIATED TABLES ARF NOT C ALTERED, AND NPP IS NOT CHANCED. IF(NTP.NE.O) CALL IEQUAL (RPIX, NTP, RTIX) C C C LOOP FOR T=M1,...,N C LAST=O QLAST= O TN=MI T=Ml 2050 CONTINUE YNEW(TN)=Y (T) IF (NST.EQ.O) GO TO 2460 IF (VSTART.EC1l.AND.T-QLAST.LT.MBE) GO TO 2460 C C C LOUP FOR LR=1,2 C FIRST TIME: TRY TO USE SEQUENCES THAT BECOME RECURSIVE C PkRVIDED KRECUR IS SET AT 1{ ES) C PROVIDED LEVEL IS GREATER THAN 1 C SECUND TIME: USE ANY SEQUENCE C LR=1 IF (LEVEL.EQ.1) LR=2 IF IKRECUR.EQ.O) LR=2 2050 CONTINUE C C C LOOP FOR LTA=1, ALLOW C LTA=l 2070 CONTINUE C C C LOCP FOR M C M=1,...,MBE IF VDIR=+i C M=M2,...,MI IF VDIR=-1 C M=MBE GhLY IF STRICT=l C IF (V'IR.EQ.1) M=M1 IF(VDIR.EQ.-l) M=MBE IF (STRICT.EQ.1i M=MBE C

214 2080 CONTINUE DO 2120 LLL=,NST L=INDEX(LLL IF (SQLEN(L})NE.M) GO TO 2120 IF (TYPE(L).NE.LTA) GO TO 2120 C C MAKE SURE NO PUNCTUATION CCCURS IN M-SEQUENCE OF Y IF (MUOE.EQ.1) GO TO 2100 DO 2090 JD=i,M TMJD=T-M+J D IF [Y(TMJD).NE.PUNC) GG TO 2090 QLAST=TMJD IF (VDIR.EQ.-1) GO TO 2130 GO TO 2460 2090 CONTINUE 2100 CeNTINUE C C RECREATE CONTEXT-SEQUENCE AND PRECICTED-SEQUENCE CALL CONV ( PREM,CSQCTXT(L)) CALL CONV (PGST,M,C,SQPREO(L)) C C MATCH Y AGAINST CONTEXT AND PREDICTED SEQUENCES DO 2110 JD=1,M TMJD=T-M+JD IF (POST(JD).EQ.DONT) GO TO 2110 IF (PAE(JD).E.PERC.AND.POST (JD).NE.Y (TJD)) GO TO 2120 IF (POST(JD).EQ.CTXT.ANU.PKE(JD).NE.Y(TMJD)) GO TO 2120 2110 CONTINUE C LOCATED SEQUENCE OF LENGTH M AT TIME T C GO TO 2150 2120 CONTINUE 2130 CONTINUE C C C END OF LOOP FOR M C M=M+VDI R IF (STRICT.EC.1) GO TO 2140 IF (ViIR.EQ.-l.AND.M.GE.ML) GO TO 2C80 IF (VDIR.EQ.1.AND-M.LE.MBE) GO TO 2080 C 2140 CONTINUE L=O 2150 CONTINUE SEQNO(T)=L C C C THE SEQUENCE THAT IS FOUND IS C 1. OF LONGEST LENGTh FROM Ml TO MBE OF THOSE SEQUENCES C ENDING AT TIME T (IF VDIR -1 ) C OF SHORTEST LENGTH (IF VDIR +1 C OF LENGTH MBE (IF STRICT = 1 C 2. OF HIGHEST ALLOWABLE TYPE C 3.OF HIGHEST P(I) POSSIBLE C 4.OF SIMPLEST GRAMMATICAL TYPE C IF (L.EQ.O) GG TO 2450 1= SQ LEN( L)

215 C C IS THEkE RlOM (FROM LEFT) FOR AN M-SEQUENCE IF (T+1-M.LE.LAST) GO TO 2450 C C ARE ALL 1\ POSITIONS AS YET UNRECODED? DO 2160 JD= 1, TMJD=T-M+JD IF (COJNE(TMJD).EQ.1) GO TO 2450 2160 CONTINUE C C CGNTRIBUTICN TO H IF (P(L).EQ.0.O) GO TO 2170 NTM=NTM+1 IF(WXL.E'.O) MW'Q=1 IF (CXL.EQ.l).WOQM HP=HP- (L)*ALOG(P(L)) * MwC 2170 CONTINUE C C CHECK CURRENT L AGAINST LLIST C QNEW: 1= L IS NEW TO LLIST O=L IS OLD C CALL LOOK (LLIST,MTSIZEtNQL ILINEQNEW) LAST=T TN=TN-M KL=O T=T-M C C WRITE CONSEQUENT (RIGHT) SIDE OF RULE OF PRODUCTION C CONSEQUENT SIOt IS OF LENGTH M C IF (ONEW.NE.1) GO TO 2190 DO 2180 JP=1,M SCRD(JP)=Y( TJP) 2180 CONTINUE 2190 CONTINUE C C C WRITE ANTECEDENT SIDE OF RULE OF PRODUCTION C C RIT=CONTRULS FORM OF ANTECEDENT (LEFT) SIDE OF C RULE OF PRODULCTION C 1=(CONTEXT-FREE)-= REPLACE ENTIRE PREDICTOR AND PREDICTED C SEQUENCES i ITH uNE NEw NON-TERMINAL SYMBOL C O=REPLACE ALL CN'TIGLCUS PKEDICTOR SYMBCLS WITH C NEW NUN-TEhMINAL SYMtOL C IF.(RIT.EQ.O) GO TO 2210 C CCNTEXT-FR'EE CASE (RIT=1) C REPLACE ENTIRE ANTEECEENT LEFT) SIDE WITH I NEW NON-TERMINAL C C C dRANCH TO RECURSIVE TEST, IF LR=1 KRET=1 IF (LR.EQ.1 ) GO TO 2810 2200 CONTINUE IF (LR.EQ.1.AiD.LRX.EQ.) GO TO 2450 TN=TN+1

216 IF (LR.EQ.1.AtD.LRX.EQ. ) YNEW(TN) RSYM IF (LREC.2) YNEW(TN)=C-l+ILINE+NCT KL=KL+-1 SCRJ(KL)=YNEW (TN) CALL IXSPRY (DONEtNS, 1T+l*T+Mj T=T+M IF(M.GE.2) CALL IXSPRY(YNEWNSFILL, TN+is TN+M-1 ) TN=TN+M-1 GO-TO 2360 C C RIT=O CASE C 2210 CONTINUE CALL CONV (PRE,M,C,SQCTXT(L)) CALL CONV (PGST,MC,SQPREO(L)) CALL MCRE (PRE,POST,tMASKDGNTCTXTPERCMODE,PUNC) C C IS MASK STRICTLY CONTEXT-SENSITIVE C OR STRICTLY UNRESTRICTED REWRITE C CALL MTEST ( MASK, MPERC, ~2305, &2215 ) C C (***** *********** * ******** ********************8r*********t****** C STRICTLY CCNTEXT-SENSITIVE CASE C C FIND CONTEXT ON LEFT, IF ANY C 2215 CONTINUE LEFT=O R!IGHT=O DO 2220 LEFT=1,M IF (POST(LEFT).LCQ.DCNT.CR.POST(LEFT).EQ.CTXT) GO TO 2230 2220 CONTINUE 2230 CONTINUE LEFT=LEFT-1 C FIND CONTEXT ON RIGHT, IF ANY DO 2240 RL=1,M RLQ=M+ 1-RL IF (POST(RLQ.EQ. DONT.OR.POST(RLQ).EQ.CTXT) GO TO 2250 2240 CONTINUE 2250 CONTINUE RIGHT.=RL-1 C C C BRANCH TO ~ECURS IVE TEST, IF LR=1 KRET=2 IF (LR.EQ.1) GO TO 2810 2260 CONTINUE IF (LR.EQ.1oAND.LRX.EQ.O) GO TO 2450 CALL IXSPRY (DONEtNS, 1T+sIT+M) C C LEFT END jIF ANY) C IF (LEFT.EQ.0J GO TO 2280 00 2270 KT=1,LEFT T=T+1 TN=TN+I YNEWITN)=Y(T) KL=KL+1

217 SCRJ (KL =YNEW (TN) 2270 CONTINUE 2280 CONTINUE C C C REPLACE 'MID' CONTIGUOUS SYMBOLS IN Y WITH 1 NEW NON-TERMINAL C TN=TN+I IF (LR.EQ.1.AKD.LRX.EQ.l) YNEW(TN)=RSYM IF (LR.Eq. 2) YNEW(TN)=C-1+ILINE+NCT KL=KL+1 SCRJ(KL) =YNEW(TN) MID=M-R IGHT-LEFT IF(MID.GE.2) CALL IXSPRY(YNEWNSFILL, TN+1, TN+MID-1 ) TN=TN+MI D-1 T=T+MID C C RIGHT END (IF ANY) IF (RIGHT.EQ.O) GO TO 2300 DO 2290 KP=1,RIGHT T=T+1 Tr,:=TN+I YNEW(TN =Y(T) KL=KL+I SCRJ (KL)=YNEW(TN) 2290 CONTINUE 2300 CONTINUE GO TO 2360 C C STRICTLY UNRESTRICTED-REWRITE CASE C 2305 CONTINUE C ***RPAMRECUR,DCNE NB=-1 KA=O 2310 CONTINUE KA=KA+1 IF (KA.GT.M) GO TO 2350 IF (MASK(KA). E.NB) GO TO 2330 Nb=MASK(KA) KRET=3 IF (LR.EC.1) GO TO 2810 2320 CONTINUE IF (LR.EQl..AND.LRX.EQ.O) GO TO 2450 TN=TN+1 IF (LR.Q. 1.AND.LRX.EQ.1) YNEWLTN)=RSYM IF (LR.EQ.2) YNEW(TN)=C-1+ILINE+NCT NCT=NCT+l KL=KL+1 SCRJ (KL)=YNEW(TN) T=T+l GO TO 2310 2330 CONTINUE IF (MASK KA).EC.PERC) GO TO 2310 T=T+I TN-TN+ YNEW(TN)=Y(T) KL=KL+ 1

218 SCRJ (KL)=YNEW(TN) 2350 CONTINUE GO TC 2360 C C C SAVE ANTECEDENT (LEFT) SIDE OF RULE OF PRODUCTION IN RPA C 2360 CONTINUE IF (QNEW.NE.1) GO TO 2410 NHER=NHER+1 NRP=NRP+ RPLEV (NRP)= LEVEL POX(NRP)=0.0 IF(LRX.EQ.O) PCX(NRP) = P(L) IF(LR.EQ.2.OR. LR.EQ.L.AND.LRX.EQ.O ) RPIX(NRP)= -2 C NB: RPIX CAN ALSC BE SET IN RECURSIVE ROUTINE RPALEN(NRP)=KL DO 2370 JG=1,KL RPA(NRP, JG) =SCRJ(JG) 2370 CCNTINUE C C SAVE CONSECUENT (RIGHT) SIDE OF RULE OF PRODUCTICN I~ RP-C' RPCLEN(NRP)=M 00 2380 JP=l,M RPC(NRP,JP)=SCRD(JP) 2380 CONTINUE IF (RIT.EO.O) GO TO 2390 RPAM(NRP, )=PERC GO TO 2410 2390 CONTINUE DU 2400 JB=1,M RPAM(NP,JB) =ASK(JB) 2400 CONTINUE 2410 CONTINUE C C PRINT RULES OF PRODUCTION ONTO 6, AS THEY ARE PRODUCED C SCRJ=ANTECEODNT SIDE OF RULE OF PRODUCTION C SCRC=CCNSEQUENT SIDE OF RULE OF PRCDUCTION C IF (QNEW.NE.1) GC TO 2430 CALL SYM80L (SCRODSCRDALMtMMQtCHARtNCHAR,LETTER,N26) CALL SYMbGL (SCRJ,SCRJALKLtM,CHARNCARCHR,LETTER,N26) WRITE (6b,2420) (SCRJAL (J), J,MM) ARROW, (SCRDAL(JJ ) J=1 MMQ) 2420 FORMAT ( 0 ', 129AI/(1X,1 9X, 110A1/J) 2430 CONTINUE C C IF NEW RULE WAS FOUND, OR NEW RECURSIVE RULE WAS FORMED, C TRY TO APPLY IT THRCUGHCUT STRING Y C I=NRP MG=RPC LEN( I) IF(STRICT.EQ.1.AND. MG-NE. MBE) GO TO 2450 3538 CONTINUE NGL=O C C LOOP FOR TX=M1,... C LASTX=O QLASTX=O

219 TNX=M1 TX=Ml ' 3550 CONTINUE IF (VSTART.EQ.1.AND.TX-QLASTX.LT. MBE ) GO TO 3630 C C IS THERE ROOM (FPCM LEFT) FOR AN M-SEQUENCE IF(TX+1-MG.LT. LASTX) GO TO 3630 C C ARE ALL F POSITICNS AS YET UNRECODED? DO 3570 JD=lMG TMJO=TX-MG+JO IF (DONE(TMJD).EQ.1) GO TO 3630 3570 CONTINUE C C MAKE SURE NO PUNCTUATION OCCURS IN M-SEQUENCE OF Y IF {MODE.EQ.1) GO TO 3590 DC 3580 JD=1, MG TMJD=TX-MG+JO IF (Y(TMJD).NE.PUNC) GO TO 3580 QLASTX=TMJD GO TO 3630 3580 CONTINUE 3590 CONTINUE C C MATCH Y WITH CONSEQUENT (RIGHT) SIDE OF RULE I DO 3600 JD=0IMG IF (RPAM(IJC).EO.DUNT) GO TO 3600 TMJD=TX-MG+JD IF (RPC(I,JD).EQ.Y(TMJO)) GO TO 3600 IF(LRX.EQ.1.AND. Y(TM)JD).EQ.SMID.AND. RPC(lJD).EQ.RSYMI 1 GO TO 360C GO TO 3630 3600 CONTINUE C C LCCATED SEQUENCE OF LENGTH M AT TIME T SEQNC(TX)=L NGL=NGL+l LASTX=TX C C CONTRIBUTICN TO H IF (POX(I).EQ.O.O) GO TO 3610 NTM=NTM+ IF(WXL.EQ.0) MWQ=i IF(WXL.EQ.1) MWQ=M HP=HiP-POX(I }ALGG(POX(In) * MWQ 3610 CONTINUE C C RECODE TN X=TNX-MG TX=TX-MG MAA=PALEN( I) 00 3620 JD=1,MAA TNX=TNX+1 YNEW(TNX)=KPA(I,JD) 3620 CONTINUE IF(MG.GT.MAA) CALL IXSPRY(YNEWNS FILL TNX+1, TNX+MG-MAA) TNX=TNX+MG-MAA CALL IXSPRY(DCNE,NSl, TX+1, TX+MG) TX=TX+MG C

220 YKP(NY)=SCRO(JP) 2490 CONTINUE NY=NY+I YRP( NY) = YPUNC CALL SYMHOL (SCRJ,SCRJAL.MAMM,CHARNCHARLETTERN26) CALL SYMBOL (SCRD,SCRCALMCMW,CHARNCHAR,LETTERN26) WRITE (7,2420) (SCRJAL(J)',J=l,MM),ARRCOWSCRDAL(JJ) JJ=IMW) 2500 CONTINUE C C COPY INTC PERMANENT RTIX THE TEMPORARY VECTOR RPIX C IF (NRP.EQ.O) GO TO 2510 NTP=NR.P CALL IEQUAL (RTIX,NRP,RPIX) 2510 CONTINUE 2520 CUNTINUE C C PRINT OUT ORIGINAL STRING Y C WRITE (6,2720) CALL SYMBOL (Y,ALPHA,N,MMA,CHAR,NCHAR,LETTER,N26) WRITE (6,2730) (ALPHA(III),IIII=1MMA) C C C f REMOVE ' FILL' SYMBOLS FROM YNEW CALL REMOVE( YNEW, YNEW, YNL, YNL, FILL ) C C PRINT OUT NEW STRING YNEW C WRiTE(6, 2732) CALL SYMBOL (YNEWALPHA,YNL,MMA,CHARNCHARLETTERN26) WRITE (6,2730) (ALPF A( III) I II=1,MMA) C C C TRY TO RE-USE ALL RULES DEVELOPED IN ALL POSSIBLE WAYS C IF(NRP.EQ.O) GO TO 2652 2538 CONTINUE NGL=O C C LM=1,2 C FIRST TIME: TRY TO USE SEQUENCES THAT ARE RECURSIVE C SECOND TIME: TRY ANY OTHER SEQUENCE C LM= 1 IF(KRECUR.EQ.0) LM=2 2540 CONTINUE C C LOOP FOR T=M1,... C LAST=O QLAST=0 T =M1 T=M1 2550 CCNT INUE IF (VSTART.EQoI.AND.T-QLAST.LT.MBE) GO TO 2631 IF(YNEW(TN).E C. FILL ) GO TO 2631 DO 2630 1=1, NRP YrtPCLEN (I) IF(STRICTT.El.l.AND. M.NE. MBE) GO TG 2630

221 C WHEN LM=1, PKOCESS ONLY RECURSIVE RULE (RPIX > 0 ) IF[LM.EQ.1.AND. RPIX(I).LE. 0 ) GO TO 2630 C WHEN LM=2, PROCESS NON-RECURSIVE RULE (RPIX=-2) IF(LM.EQ.2.AND. RPIX(I).NE. -2 )GO TO 2630 C DEACTIVATED RULES ARE SKIPPED C C IS THERE RCCM (FROM LEFT) FOR'AN M-SEQUENCE Ir (T+1-M.LT.LAST) GO TO 2630 C C MAKE SURE NO PUNCTUATION CCCURS IN M-SEQUENCE OF Y IF (MODE.EQ.l) GO TO 2590 00 2580 JD=1,M TMJD=T-M+JD iF(YNEW(TMJD) NE. PUNC ) GO TO 2580 QLAST=TMJD GO TO 2630 2580 CONTINUE 2590 CONTINUE C C MATCH Y WITH CONSEQUENT (RIGHT) SIDE OF RUL DO 2600 JD=1,M' IF (RPAM(I,JC).EQ.DCNT) GO TO 2600 TMJD=T-M+JD IF( RPC(I,JD).NE. YNEW(TMJD) ) GO TC 2630 2600 CONTINUE C C LOCATED SEQUENCE OF LENGTH M AT TIME T C NGL=NGL+1 LAST=T C C CCNTRIBUTICN TO H C IF (POX(I).EQ.O.O) GO TO 2610 NTi.=NTM+ I IF (XL.EQ.0) PWQ=1 IF(WXL.EQ.1) MWQ=M P=HP-POX(I)*ALOG(POX I)) * MWQ 2610 CONTINUE C C RECODE C TN=TN-M T-T-M MA=RPALEN(I ) DO 2620 JD=1,MA rN= N+ YNEWi (TN) iRPA( I,JD) 2620 COuT INUE IF(M.GT.MA) CALL IXSPRY ( YNEW, NS, FILL, TN+1, TN+M-MA) TN=TN+M-MA CALL IXSPRY (DONENS, 1,T+1,T+M) T=T+M 2630 CONTINUE C C END CF LCCP FOR T AND TN C 2631 CONTINUE T=T+1 TN=TN + 1

222 IF(T.LE. YNL ) GO TO 2550 C C END OF LCCP FOR LM=.1,2 C LM=LP+I IF(LM.LE.2) GO TO 2540 2650 CONTINUE C C C GO THROUGH LIST OF RULE OF PRODUCTIGN UNTIL NO APPLICATIONS ARE C POSSIBLE C IF (NtGL.NE.C) GO TO 2538 C C IF TRIAL=I, DELETE NHER RULES FRCM TABLES C 2652 CONTINUE IF (TRIAL.EQ.l) NRP=NRP-NHER C C IDENTITY RULES C (APPROXIMATE COUNT) C NIR=O IF(WXL.EQ.O) GO TO 2659 CZ=O XG=I DO 2658 IJ=1, N IF(Y(IJ').EQ.PUNC) GO TO 2653 IF(DCNE(IJ) EQ.l) GO TO 2657 XG=O CZ=CZ+I IF(CZ.LT. MBE ) GO TO 2658 2653 CONTINUE NI R-NIR+l NT,4=NTM+ 1 XG=1 CZL=O GO TO 2658 2657 CONTINUE IF(XG.EQ.O) Gg TO 2653 XG=1 CZ=O 26 58 CGNTINUE 2659 CONTINUE C C C COMPUTE H C F IRST(MBE)=2499. GO TC 12660,2670,2680t2690), HSEL 2660 CONTINUE IF (NTM.NE.O) FIRST(MBE)=HP/(FC*NTM) GO TO 2700 2670 CUNTINUE IF (NTM.NE.O) FIRST(MBE)=-(HP*NQ) /(FC*NTM) GO TO 2700 2680 CONTINUE FIRST(MbE)= (HP*MBE)/({FC*M1. GO TO 2700 2690 CONTINUE

223 F IHST (MBE)= HP/FC CJ TO 2700 2700 CUNTINUE SECONO tBE ) zNQGL AMB DA SECCND(MBE)= (NQ+NIR) * LAMBDA TrlIRD(MBE) =LAMR*NRECUR H(MBE)=FIRST (MBE +S ECOND(MBE)+THIRD(MBE) WRITE (6,2710) FIRST(MBE),SECOND(MBE)iTHIRDLMBE),H(MBE),NQtNRECUR, 1 NIR, NTM 2710 FORMAT ( '0 / 2 'ENTROPY TERM ', F10.5 / 3 ' PARSIMCNY TERM ', F10.5 4 ' RECURSIVE PARSIMONY TERM F10.5 / 5 ' 9 + _ / 1 ' VALUE OF H FOR THIS RECODING..........', F10.5 / 8 / 1 ' NUMBER UF RULES OF PRODUCTION I7 / 7 NUM6ER CF RECURSIVE RULES 7 I / 7 ' NUMBER OF IDENTITY RULES m 17 / 2 ' NUMBER CF TIMES RULES ARE APPLIED 17 / 4 C C PRINT OUT ORIGINAL STRING Y C nR ITE (6,2720J) 2720 FORMAT I 'OCLtiRENT STRItNG Y ' / '0' ) CALL SYMBOL (YALPHA,N,MMA,CHAR, NHAHR LETTER N26) WRITE (6t273G) (ALPHA(III),I I=1,MMA) 2730 FORMAT ( '0', (/1X, 60A1) ) C C AGAIN REMOVE ALL 'FILL' SYMBOLS FRCM YNEW C CALL REMOVE( YNEw, YNEW, YNL, YNL, FILL ) C C PRINT OUT NEW STRING YNEW C WRITE(b, 2732) 2732 FORMAT ('ONEW STRING ' / '0' ) CALL SYMdOL (YNEWALPHAYNL,MMACHAR,NCHARLETTERN26) WRITE (6,2730) ( LPA(I 11),I II=1,MMA) C C PRINT OUT STRING YRP C C C PRINT GRAPH OF P(I) ACTUALLY USED IN RECODING C IF (.NOT.(PRG.EQ.2.CR.PRG.EQ.1 AND.TRIAL.EQ.2)) GO TO 2790 IF (N.-J.c.O) GO TO 2790 CALL ZERu ( \UMT,5) DC 2740 I=1,NQ L=LLIST I) P ( I )=P(L) K=TYFE(L) NUMT(K)=NUMT(K)+1 2740 CONTINUE IF (PRG.EQ.C) GO TO 2760 CALL XFSORT (PD,MTSIZE,INDEP,1,NG) 00 2750 I=1,NC DESC (I)=TYPE (LLIST ( INDEP I)) 2750 CONTINUE

224 CALL GRAPH (PD,NQtNUMT.,LEVELMBEO"DESC) 2760 CONTINUE C C C PRINT STATISTICS ON P(I)sS ACTUALLY USED IN RECODING C WRITE (6,2770) (NUMT(IIM),IIM=1, 5) 2770 FORMAT ( 1 I NUMBER OF TYPE 1 SEQUENCES(STRUCTURAL)', 17 / 6 NUMBER OF TYPE 2 SEQUENCES (MESSAGE) '17 / 6 ' NUMBER OF TYPE 3 SEQUENCES (MESSAGE) 1 7 / 6 ' NUMBER OF TYPE 4 SEQUENCES (MESSAGE) ',7 / 3 NUMBER OF TYPE 5 SEQUENCES (NOISE) ' I7 "T NZZ=ISUM(NUMT,5) WRITE (6,2780) NZZ 2780 FORMAT ( 1 ' NUMBER OF P(I) ', 17 2790 CONTINUE C C IF TRIAL=2, SU3STITUTE RECODED YNEW INTO Y C NEW SIZE OF ALPHABET C NEW LENGTH OF Y C SUBSTITUTE NEW STRING (YNEW) INTO Y VECTOR C IF (TRIAL.NE.2) GO TO 2800 C=C+NQ+NCT N=YNL CALL IEQUAL (vN,YNEW) 2800 CONTINUE RETURN C C CHECK FOR RECURSION C ---— __ --- —- - _C C 2810 CJNTINUE LRX=O IF (KKECUR.EQ0) GO TO 2910 IF (QNEW.EQ.O) GO TC 2910 IF.(LEVEL.EQ.1) GO TO 2910 IF (NRP.EQ.O) GO TO 2910 00 2900 I=1,NRP C C RULE i IS CF PREVIOUS LEVEL IF (RPLEV(I).NE.LEVEL-1) GU TO 2900 C CGNSECUENT (RIGHT) SIDE OF RULE I FRCM PREVIOUS LEVEL;.'AND C CJNSEQUENT SIDE OF CURRENT RULE ARE OF SAME LENGTH C NB:THIS IS PRE-CONOITICN FOR ISCICRPHISM (SEE BELOW) LCC=N IF (RPCLEN(I).NE.LCC) GO TO 2900 MA=RPALEN( I) CO 2890 J=1,MA C C SYMBOL 'SMID' IS A 'PREDICTED' SYMBOL CN THE ANTECEDENT LEFT)

225 C SIDE OF RULE It AT PREVIOUS LEVEL IF (RPAM(I,J).NE.PERC) GO TO 2890 SMID=RPA(I, J) DO 2880 JJ=1,LCC C C SYMBOL 'SMID' APPEARS ON CCNSEQUENT. RIGHT) SIDE OF RULE AT C THIS LEVEL IF (SCRD(JJ).NE.SMID) GO TO 2880 C C ISOMORPHISM OF CCNSEQUENT (RIGHT) SIDE OF RULE I FROM PREVIOUS C LEVEL, AND CONSEQUENr (RIGHT) SIDE OF CURRENT RULE BILEN=O BI ILEN=O DO 2830 KK=1,LCC CALL LOOK (BIMZMAXBILENRPC(IKK),BILEV,BINEW) CALL LOOK (BIIM2MAXBIILENtSCRD(KK)tBIILEVBIINEW) IF (8ILEV.NE.BIILEV) GO TO 2880 2830 CONTINUE C C HAVE RECURSION C LRX=1 NRECUR=NRECUR+1 WRITE (6,2840) NRECUR 2840 FORMAT ( 'ORECURSICN NO. ', 3 J C C ICENTIFY RECURSIVE SYMBOL RSYM=RPC(I,JJ) C C ABANDON GENERATICN OF CURRENT RULE IF (NQ.NE.O) INQ=NQ-i C C INSERT RSYM WHEREEVER SMID OCCURRED IN RULES DO 2870 IA=1,rRP LZ4=RPALEN( IA) DO 2850 JA=1,LZA IF (RPA(IAJA).EQ.SMID) RPA(IA,JA)=RSYM 2850 CONTINUE LZC=RPCLEN( IA) DO 2860 JA=1,LZC IF (RPC(IA,JiAJ.E.SMID) RPC(IA,JA)=RSYM 2860 CONTINUE 2870 CONTINUE C C GENERATE A RECURSIVE RULE AND DEACTIVATE RULE I RPIX(I )=0 RPIX(NRP+) =I DO 2872 JP=1, M SCRD(JP)=RPC( IJP) 2-72 CONTINUE DO 2874 JG=1, MA SCRJ (,JG)=RPA( I,JG) 2874 CONTINUE GO TO 2910 C C END UF LOOP FOR JJ 2880 CONTINUE C C END OF LOCP FOR J 2390 CONTINUE

226 C C ENO OF LOOP FOR I 2900 CONTINUE C 2910 CONTINUE GO TO (2200,2260,2320), KRET C END

227 SUBRCUTINE bEST ( H, Ml, M2, HBESTMBESTMINM,FIRSTSECOND, L THIRD, M2MAX ) IMPLICIT INTEGER (P-Z) REAL HIM2MAX) REAL FIRST(;M2MAX) REAL SECGND M2MAX) REAL THIRD(M2MAX) REAL HBEST C C SUBROUTINE TO FIND BEST H C OUT: HBEST, MBEST C MO=M1 IF ( MC.LT. MINM ) MO = MINM HBEST=H(MO) MBEST= MO DO 10 I=MO, M2 IF( HSEST.LE. H(I) ) GO TO 10 H.sESTr-ih I) M EST= I 10 CONTINUE RETURN END SUBROUTINE GETMSK (MFR GRAMNMASKCTXTPERCDONTM2MAXMiASKIPART iPLICI T INTEGERLA-Z) INTEGER S K ASK.2 2MAX) IF (NMASK.NE.C0 GO TG 2010 JJZ-2,<KZ=1 LA TCH-=l ML I M-M I 11=0 I< EG-0 NTi OT=F P.^M-2 IPAAT~i ^ P ^- i= 1 2010 CONTINUE C C LOOP FOR IPART 2020 CONTINUE C DO 2030 IG=1,# MASK(IG)=CTXT 2030 CONT INUE GO TO (2040,2C50,2110,2140), IPART C C TYPE 3 MASK (REGULAR) 2040 CONT!NUE IF (M.GT.2) GO TO 2190 IREG= IREG+i IF (IREG.GT.2) GO TO 2190 C LEFT END IF (IRESGEQ. 1 MASK(M)=PERC C RIGHT END IF (IREG. EQ.2) MASK(1)=PERC

228 GO TO 2180 C C RIGHT-SENSITIVE AND LEFT-SENSITIVE 2050 CONTINUE IF (M.LE.2) GO TO 2190 GO TC (2060,2090), LATCH C C RIGHT-SENSITIVE 2060 CONTINUE MLIM=MLIM-1 IF (MLIM.GT.0) GO TO 2070 LATCH=2 MLIM=M GO TO 2050 2070 CONTINUE 00 2080 JM=1,MLIM MASK JM)=PERC 2080 CONTINUE GO TO 2183 C C LEFT-SENSITIVE 2090 CONTINUE MLIM=MLIM-1 IF (MLIM.LE.0) GO TO 2190 DO 2100 JL=1,MLIM JW=M+1-J L MASK (JW)=PERC 2100 CONTINUE GO TO 2180 C C TYPE I MASK (CONTEXT-SENSITIVE) 2110 CONTINUE IF (M.LE.2)- GC TO 2190 KKZ=KKZ+ 1 IF l(KZ..LE.M-1) GO TO 2120 JJZ=JJZ+I IF JJZ.GT. -1) GO TO 2190 KKZ=JJZ 2120 CONTINUE DO 2130 LK=JJZKKZ MASK (LK)=PERC 2130 CONTiNUE GO TO 2180 C C TYPE 0 MASK (UNRESTRICTED REWRITE) 2140 CONTINUE IF (.LE.2) GO TO 2190 III I=11 1+ IF (IIII.GT.NTOT) GO TO 2190 CALL CONV (MASKtMFR,1III) DO 2150 IJK=1,M IF {MASK(IJKJ)EQ,0) MASK(IJK)=CTXT IF (MASKtIJK).EQ.1) MASKIIJK)=PERC 2150 CONTINUE C CALL MTEST(MASK,M, PERC, &2180,22140 ) 2180 CONTINUE NMAS K NMASK+1 RE TURN 2190 CONTINUE

229 IPART=IPART+l [F(IPART.LE. 4-GRAM ) GO TO 2020 RETURN I END SUBROUT I E MrEST(MASKMPERC,* ) IMPLICIT INTEGER(A-Z) INTEGER MASK(M) C TEST MASK FOR GRAMMATICAL TYPE C RETURN 1: UNRESTIRCTED RE-WRITE C RETURN 2: CONTEXT-SENSITIVE NB=-1 CCC=-1 GO 2170 IK=1,M IF (MASK(IK).EQ.NB) GO TO 2170 NB=MASK( K) CCC=CCC+1 2170 CONTINUE IF CCC GE.3.OR. (CCC.EC.2.AND. MASK(1).EQ.PERC) ) RETURN 1 RETURN 2 END SUBROUTINE GRID ( PRE, POST, M, DCNTCTXT, PERC* SCRGt ZLM) IMPLICIT INTEGER IA-Z) INTEGER PREIM) INTEGER POST(M) INTEGER SCRGIM) INTEGER ZLML3.) C 2#DCN'T CARE SYM8CL _.CCNTEXT PART %=PRECICTED PART DO 100 1=1, M K=I IF (POST(i).EQ. DCNT ) SCRG K)=ZLM(1) IF I POST(I).EQ. CTXT } SCRG(K)=ZLM(2) IF ( PRE(I).EQ. PERC } SCRG(K)=ZLM(3) 100 CONTINUE RETURN END SUBRCUTINE GRAPH I X,Nt NUMT, LEVEL, MBE, TY ) IMPLICIT INTEGER LA-Z) REAL X(N) INTEGER NUMT(5) INTEGER SPOT(S) iNTEGEk IMAGEi5lt101) INTEGER TY(N) REAL CCMP REAL LO) IN DATA BLANK J ' ' DATA VERT / ' I / DATA HORZ / '- / DATA PLUS / '+' / DATA STAR /**' / C C GRAPH PRINTS GRAPH OF PII)'S USEC IN RECODING C WRITE(6 130lC LEVEL, iBAE 1800 FORMAT ( ' GRAPH OF P(I) USED IN RECCDING FOR LEVEL',1I2 ' AND M GF', I4 / ' C IF i N.LE. 101 ) GG TO 150 COMP=N/100. NWIDE=101 GO TO 170

230 150 CONTINUE NW IDE=N COMP=1.0 170 CONTINUE C DO 200 J=l INWIDE DO 200 I=1, 51 IMAGE(I,J)=BLANK 200 CONTINUE C C VERTICAL LINES AT RIGHT AND LEFT EDGE DO 235 1=2, 50 IMAGE( I,)=VERT IMAGE( I, NW IDE)=VERT 235 CONTINUE C C FIND LOCATICNS OF VERTICAL LINES CIVIDING GRAPH IS — 00 220 K=l,5 IS=IS+NUMT K) SPOT (K)= IS/CCiMP IF(SPT(K). EQ-0) SPCT(K)=l 220 CONTINUE C C INSERT VERTICAL LINES DIVIDING GRAPH DO 230 K=1,5 DO 230 1-2, 50 IF(I.EQ.26) GO TO 230 IMAGt( I,SPOT(K})=VERT 230 CONTINUE C C HORIZONTAL LINES DO 240 II=1,3 I -l+; I I-1i)425 DO 240 J=l, ^WIDE IMAGE I,J =HCRZ IF( J.EQ.,CRa. J.EQ.NWIDE ) IMAGE I,J3=PLUS DO 238 L=l, 5 IF(J.EQ. SPOT(L) IMAGE(IJ) = PLUS 238 CONTINUE 240 CONTINUE C C INSERT POINTS TO BE PLOTTED INTO GRAPH DO 300 J=l, N, JJ=COMP*J IF(JJ.EQ.O) JJ —1 I =l1+ 50.X(JJ) IMAGE( II,JJ)=STAR 300 CONTINUE C C C PRINT GRAPH C DO 500 I=1, 51 I!=52-I ORDIN=.02*( II-1) WRITE(6,430) ORDIN, (IMAGE({I,J), J=1,NWIDE ) 430 FORMAT ( I 9, F4.2, IX, lOlAl 1 500 CONTINUE C

231 C PRINT TYPES OF THE P(I}*S ACROSS BOTTCM OF GRAPH C WRITE(6, 600) ( TY(I), 1=1, NWIDE ) 600 FORMAT { 6X, 10111 ) C RETURN END SUBROUTINE SYMBOL I Y, ALPhA, Nt K,-CHAR, NCHAR tLETTER,N26) IMPLICIT INTEGER (A-Z) INTEGER Y(N) INTEGER ALPHA(N) INTEGER CHAR{NCHAR ) INTEGER LETTER N26) C C CONVERTS NUMERICAL STRING INTO ALPHABETIC SYMBOL STRING C C IN: C Y=STRING CF NUMBERS C 0,1,2,... C N=LENGTH OF Y C NCHAR=NUMEcR OF SYMBCLS IN VECTOR CHAR C CHAR=VECTCR OF TERMINAL SYM60LS C I~,CLUD tIG INITIAL PUNCTUATICN, IF ANY C ALSO INCLUDING THE SYMBOLS OF ZLM C LETTER= VECTOR CF NON-TERMINAL SYMBCLS C N26=SIZE GF VECTGR LETTER C OUT: C K=LENGTH OF ALPHA STRING PRODUCED C ALPHA=RESULTING STRING CF CHARACTERS C ALPHA MUST BE DIPENSICNED TO AT LEAST 4 C Nb: ALPHA AND Y MUST BE DISTINCT C IF MORE THAN N26 NON-TERMINALS ARE NEEDED C DATA LP / '<, / OATA RP / s>' / DATA NBLANK / C C ALPHA( 1 =NbLANK ALPHA( 2 =NBLANK ALPHA(3) =NBLANK ALPHA(4) =NBLANK K=O IF(N.NE.O) GO TO 5 K=4 GO TO 999 5 CONTINUE DO 15 1=1, N C C TERMINAL SYMBOLS (INCLUDING INITIAL PLNCTUATICN, IF ANY) C -INCLUDING ZLM IF (.NOrT. Y(I).LT.NCHAR)) GO TO 6 K=K+1 ALPHA(K) = CHARt YVI) + I ) GO TO 15 C C FIRST N26 NON-TERMINAL SYMBCLS (SINGLE SYMBOL) 6 CONTINUE IFi.NOT. i YlI).LT. NCHAR+N26) ) GO TO 7 K=K+1

232 ALPH )AK) = LETTER i Y I) +1-NCHAR) GO TO 15 C C LATER NON-TERMINALS (PARENTHESES ANC 2 NON-TERMINALS) 7 CONTINUE L=l-Y( I-NCHAR)/ N26 L2= Y(I) - NCHAR - LI*N26 +1 K=KiL ALPHA(K)= LP K=K+ i ALPHA(K)=LETTER(L1i K=K+1 ALPHAK) =LETTER(L2) K=K+ ALPHA(K)=RP 15 CONTINUE 999 CCNTINUE RETURN END SUBROUTINE,CRE(PREPOSTM,MASK, DONTCTXT,PERC,MODE,PUNC ) IMPLICIT INTECER(A-Zj INTEGER PRE(M) INTEGER POST(M) INTEGER MASK(M) C CREATES MASK FROM PRE AND POST SEQUENCES DO 860 J 0=, M K=M+-l-JO IFLPRE(JD).EQ. PERC ) HASK(K) = PERC IF(PGST(JD).EG. CTXT ) MASK(K) = CTXT IF( POST(JD).EQ. DONT ) MASK(K) = DCNT 860 CONTINUE RETURN END SUBROUTINE GENI ( Y, N, NS, ALPHA, CAR NHAR NCAR IMPLICIT INTEGER (A-Z) INTEGER Y(NS) INTEGER ALPHAtNS) INTEGER CHARINCHAR) READ i 5, IZ) N 12 FORMAT ( 15) READ ( 5, 20) iALPHAfI), I=1, N i 20 FORMAT ( 80A1 ) DO 30 1=1, N CALL IMATCH(CI-AR, NCHAR, ALPHALI), YCI) ) Y(I)=Y(I-1l 30 CONTINUE kE TURN END SUBROUTINE LOGC (NSNMISTRICT,M8E,SCLENTYPEALLCW,PRE,CY,MTSIZE 1,M2MAX, POSSTSQPREDMODE, DONTPUNC,INDEX,SEQNO,NST,SQCTXTPERCCTXT 2,VDIR,USED, CLAST iVSTARTM2) IMPLICIT INTEGER (A-Z) INTEGER YiNS) iNTEGER SEQN( NS) INTEGER SQC XT W(TSIZE) INTECER SQLEN(MTSIZE) INTEGER SQPRECCMTSIZE) INTEGER TYPEU(TSIZE) INTEGER INDEX(MTSIZE) INTEGER USE ( MTSI ZE)

233 INTEGER POST M2MAX) INTEGER PRE(M2MAX) C C C SEARCH Y FOR ALLOWABLE SEQUENCES C C IN: C MBE=VALUE OF M TG BE USED IN THIS RECODING C OUT: C C IF (NST.EQ. O) GO TO 2141 C C VARIOUS M = M1.. MBEST IF (VDIR.EQ.I) M=MI IF (VDIR.EQ.-1) M=M2 IF (STRICT.EQo1) M=MBE 2123 CONTINUE C C CCNSIDER T= M,... N T-=M 2125 CONTINUE IF (VSTART.EC.l.AND.T-QLAST.LT.MBE} GO TC 2137 C DO 2135 LJ=1,2 O0 2134 LTA=1, ALLOW 00 2133 LLL=1,NST L=INDEX(LLL) IF LJ.EQ.1o AND.USEODL).EQ.O) GO TO 2133 IF {LJ.EQ.2.AND.USEO(L).EQ.1) GO TO 2133 IF lSQLENL)J.NE.M) GC TO 2133 IF(TYP~CL).NE. LTA ) GC TO 2133 IF (MODE.EQ.i) GO TO 2129 00 2127 JD-1,M TMJD= T-M+JD IF (Y{TnJDJ.NE.PUNCJ GO TO 2127 QLAST=TMJD GO TO 2137 2127 CCUNV INUE 2129 CONTINUE CALL CCNV (PAE,MC, SCTXT(L)) CALL CONV (POSTMCSQPRED(L)) DO 2131 JD=1,M TMJD=T-M+JD IF IMODE.EQ.2.AtND.YTMJDJ. EQ.PUNC) GO TO 2137 IF POCST[JD).EC.DONT) GO TO 2131 IF {PR(JU).EC.PERC.AND.POST JDO).NE.YITMJ) ) GO TO 2133 IF (POST (JD).EQ.CTXT.AND.PRE( JC).NE.Y(TMJD)) GO TO 2133 2131 CONTINUE C C LOCATED SEQUENCE OF LENGTH M AT TIME T C C SEQNU{T)=THE ROw (L) IN STPHI STNEXTSTLEN TABLES CF THE C LONGEST SEQUENCE OF ALLOWABLE TYPE ENDING AT TIME T C USEDO(L)=1 SEQNO(T)=L GO TO 2137 2133 CONTINUE

234 2134 CONTINUE 2135 CONTINUE 2137 CONTINUE C T=T+1 IF (T.LE.N) GO TO 2125 C M=M+VOIR IF (STRICT EQ.1) GO TO 2139 IF (VOIA.EQ.-1.AND. MGE.Ml) GO TO 2123 IF (VDIR.EQ.l.ANO.M.LE.MBE) GO TO 2123 2139 CONTINUE C 2141 CONTINUE C RETURN END SUdROUTINE TLOC (NSM1,STRICTMBESQLEkNTYPEALLOW,PRE*C,YMTSIZEe 1M2MAXPOST, SQPREDMODE,DONT,PUNC,INDEXNSTSQCTXTPERCCTXT,VSPEC, 2SPECUSED,TtLVSTART,VOIRQLASTLAWL,M2) IMPLICIT INTECER (A-Z) INTEGER YINS) INTEGER USED(MTSIZE) INTEGER SQCTXT(MTSIZEJ INTEG-R SLcEhNMTSIZE) INTEGER SQPREClMTSIZE) INTEGER TYPEiPTSIZE) INTEGER INDEXIMTSIZE) INTEGER POST(M2MAX) INTEGER PRE tM2MAX) c SEARCH Y FOR ALLOWABLE SEQUENCES C GIVEN T C C C IN: C MBE=VALUE OF M TO BE USED IN THIS RECODING C OUT: C VSPEC SPEC=O SPEC=0 IF (NST.EQ.O0 GO TO 1639 IF (VSTART.EQol.AND.T-QLAST.LT.MBE) GO TC 1639 C C VARIOUS M = L M1' MBEST IF {Vi tR.EQ.1) M=MI IF (VL)IR.EQ.-1) M=-M2 IF LSTRICT.EQ.1) M=MBE C 1623 CONTINUE 00 1633 LJ=-,2 DO 1632 LTA=1, ALLOW DO 1631 LLL=1,NST L=INDEX(LLL) IF (LJ.EQ.l.AND.USEODLJ.EQ.O) GO TO 1631 IF (LJ.EQ.2.ANDOUSED(L).EQ.1) GO TO 1631 IF LSQLEN(L).Nc.M3 GO TO 1631 IF(TYPEIL) NE, LTA ) GO TO 1631 IF (MOOE.EQ.1) GO TC 1627

235 DO 1625 JO=I,M TM JD=T-Mt+JD IF Y( T;-lJD).NE.PUNC) GO TO 1625 LA S T =MJ GO '0. 1639 i625 CJNTINUc 1627 CONTINUE CALL CGNV (PRE,M,C,SQCTXT(L)) CALL CONV lPOST,MC,SQPRED L)) DO 1629 JD=i,M TMJD=T-M+JO IF O;E.EQ.2.AND. Y(TMJD).EQ.PUNC) GO TO 1635 IF (POST(JD).EQ.DONT) GO TO 1629 IF (PRE (JOJ.EC.PERC.ANO.POST(JO).NE.Y(TMJO)) GO TO 1631 IF (POJS'TiJu).EQ.CTXT.AND.PRE~JD).NE.Y(TMJD)) GO TO 1631 1629 CONT INUE C LOCATED SEQUENCE OF LENGTH M AT TIME T IF [LJ.EQ.!1 VSPEC=L IF (LJ{.E(.2) SPEC=L 1631 CGNT I J 1632 CONT NUE 1633 CONTINUE t35 CONT INUE M=M+VDI i IF ISTRICT.EQ.1) GO TO 1637 IF (VDiR.EQ.-.1.AND.,MGE.Ml) GO TO 1623 IF (VDiR.EQ.l.AND.M.LE.MbE) GO TO 1623 1637 CGNTINuE C 1639 CUNTINUE L-O IF (SPEC.NE.O) L=SPEC IF (VSPEC.NE.O) L=VSPEC IF LL.IE.O) USEU(L)l= C RETURN END SUBRCUTINE CEN2 ( STRINGtNNSMODEINCHARPUNC) IMPLICIT INTEGER (A-Z) iN TEGE STRING(NS) REAL P'U?,620 ) REAL YFL C C SPECIAL BINARY SEQUENCE GENERATOR FOR SENTENCES OF UNIFORM C LENGTH C C TERMINAL SYMBOLS: O1 (SO, Sl) C C C RULES OF PRODUCTION: C C S — > XA C C A -> YA C A -> ZA C A — > WA C C X — > 11 C Y — > 01 C Z — > 10

236 C W — > 00 C C C IN: C NS=DIMENSICNED SIZE OF STRING C NCHAR=NUMSER OF TERMINAL SYMBOLS (INCLUDING PUNCTUATION,IF ANY) C MODE=DETERMINES METHOD OF INITIAL PUNCTUATION C OUT: C N=SIZE CF STRING PROCUCED C STRINGThI-E STRING PROCUCED C READ IN FRCM CARD: C NEAR=APPR4XIMATE SIZE OF STRING DESIRED C BEGIN=SEED FOR RAND C N8=SENTENCE LENGTH C DIV: 2=USE NON-TERMINALS Y AND Z 3=USE YZt AND W C S0=3 S1=4 C C READ ONE CCNTRCL CARD C C READ (5, 13) NEAR, BEGIN, N8, DIV i3 FORMAT ( 1615 ) C DO 8 L=l, DIV. 8 PROi (L}=0 C N=O C NEW. SENTENCE C 4 CONTINUE iF ( N.GT. NEAR ) GO TO 30 C LETTER X=ll C SENTENCE ALWAYS BEGINS WITH AN X N=N+1 STRING(N)=S1 N=-N+1 STRING(N)=S J-=2 C INSERT Y, Z, OR W RANDOMLY C 5 CONTINUE CALL RANOU(BEGIN, IY, YFL) BEGIN- IY IND= YFL*DIV+1.O P'ROt IND) = PROB{INDJ +1.0 IF ( J.GE- Nd) GO TO 28 GO TO ( 10, 20, 25), IND C C LETTER Y=01 10 CONTINUE N=N- -1 STRING(N)=SC N=N+ STRING(N)=S1 J=J+2

237 GO TU 5 C C LETTER Z=10 20 CGNTI NUE N-" N l 1 STR IG (N=S 1 N= N+ 1 STRING(NJ=SC J=J+2 GO TO 5 C LETTER W=00 25 CCNTINUE N= N+ i STRIfiG( NJ-=SO,;' + 1 STR' NC ( NJ= SO J-J -+ 2 GO TUi 5 C C END OF SENTENCE C INCLUDE INITIAL PUNCTUATION MARK, IF MODE=2 C 26 CCGNTiNUY ' l * i MUOE NE. 2 ) GO TO 4 N -+ 1 i' RI:G, ( )P UNC G0 TO 4 C C PRINT CUT PROBS AS CHECK 30 C;3NT INUC i 3'j L I, ' IV PC. {L,) = P.-' EiL) / N * 2.0 38 CCi TLNUE WRITE i6, 99) ( PROBCI), 1=1, DIV ) 99 FORMAT I( OFrl.6 ) RETURN END SUbROUTINE HUFF( M, C, M2, Pt LEN, KCCEI PC, ACT, TAB,LMIN,MKOD LIS ) IMPLICIT INTEGER ( A-Z ) RcAL -(M) REAL PC(M2) INTEGER LEN {M ) INTi-GE. iKOOE iM MKC) IN; TEGE ACT M2) INfEGER TA-(M,C ) IN TEGER LM[N(CJ INTEGER LIST(M) REAL SUM REAL MIN C C HIUFFMAN CODING SCHEME C STA.T<S hITH M MESSAGES OF PROB P(I) C COUES ThEY INTC SEQUENCES OF C SYPBCLS C IN: C M= # CF MESSAGES

238 C M2= 2*M C C=# OF 'SYMBOLS IN ALPHABET FOR MESSAGE C P(I) = PROB OF MESSAGE I C MKD=MAX LENGTH OF ANY ENCODED MESSAGE (2ND SUB OF KOOE) C OUT: C KCDE(IJ) = CODE FOR MESSAGE I ( LSES LEN(I) SYMCBLS) C LEN(l)= LEENGTH OF CGDE FOR ORIGINAL SEQ I C SCRATCH VECTORS: C PC(I = ORIGINAL VECTOR P PLUS ADCITICNAL LINES C ACT(I) = 0 IMPLIES LINE IS INACTIVE, 1 IMPLIES ACTIVE C TAB(ItJ)=TF-E LINES USED TO CREATE LINE M+I IN TABLE. C LINE M+l HAS ONLY MO ENTRIES. OT-ER LINES HAVE DEL=C C LMIN(I = LOCATION OF CNE OF THE I SMALLEST PC'S C LIST= C-ARY NUMBER USED IN TREE SEARCH C C *-%,* *** * ** V*$ * **'4t*g $* **t i*-I*** *^^^ ****$** ** ***********$~ C CALLING SECUENCE C NEED NM ~,CC,P C REAL PC:400) C DIMENSION KODE(200o20) C DIMENSION TAB 200,2) C DI MENS ION LEN 200) C D.MEN\SICN ACT[400) C UIMENSION' LISTi200) C DIMENSION LM IN.( 1O C MKD=20 C RECODING C IN TO HUFF: C NME=NUMBER OF MESSAGES TO BE RECODED C CC=NUMBER OF SYMSOLS C P-=PROBAeILITY VECTOR C MNKD=DIMENSIGNED SIZE OF 2ND DIM CF KODE C Mi.K=TWICE NME C PRODUCED BY SUBROUTINE: KCDE, LEN C KODE(IJ) = CODE FOR MESSAGE I ( USES LENtI) SYMBOLS) C LEN(II= LENGTH OF CODE FOR ORIGINAL SEQ I C SCRATCH FOR HUFF: TAB, ACT, LIST, LMIN C IF(EXR.NE.2) GO TO D80 C IF(NME.EQ.Oj GO TO 980 C M'IK= 24' NME C CALL HJFF(NMEr CC, MKKP8, LEN* KODE, PC, ACT, TAB,LMINMKDLIST) C 960 CONTINUE C C C C C INITIALIZE DO 7 I=L, M ACT(I) =1 PC(I) = PCI) LEN(I) =0 7 CONTINUE DO 9 L=1, C DO 9 I=1, M TABlIL) =0 9 CONTINUE C

239 C FIND MO IF ( C.EQ. 2 J GO TO 20 CL=C-1 D0 10 MU=2,C NU= M - MG Q= NU / CL IF ( Q * CL.EQ. NU } GO TO 30 13 CCNT INUE 20 CONTINUE MO-2 30 CONTINUE C C ^,AIN\ SECTICN C FIRST TIME DEL=MC, THEN DEL=C C DEPTH=M TDE P T 11=0 TDEPTH=0 NACT=M DE L- M3 C C FINd DEL SMALLEST PROBS C GET SUM OF DEL SMALLEST PROBS C 22 CONT INUE SUi -=O.0 DO 70 L=l, DEL M4 N= 999. [O 60 il-1 DEPTH IF MIN.LE. PC I).OR. ACT ().NE. 1 GO TO 60 ItIN -' PC (I ) LMINIL) = I 60 CONTINuE l=-LMiNIL} - ACi. l i ) =0 SULf=SUM PC II ) 70 C.NTINUE NACT=NACT - DEL C C CREATE NEW LINE IN PC-TABLE EQUAL TO SUM CF THE CEL SMALLEST C DEPTH=DEPTH +1 T.P T Ti-TDEPTH+1 AC T ( CETH) -1 PC(DEPTH) - SUM,DO 90 J-=I, C TAB( TDEPTh, J } = LMIN(J) 90 CONTINUE C IF ( NACT.LT. 1 ) GO TO 100 NACT-N",C T+1 DLL = C GO TO 22 100 CONTINUE C C CkEATE COCE SEQUENCES C 0D 120 1=1, M LiSTL I) -1 120 CONTINUE C

240 C TREE SEARCH 125 CONTINUE II=TDEPTH K=O 130 CONTINUE K=K+1 II I TAB( II, LIST(K) ) IF ( II.LE. M ) GO TO 140 II=II - M GO TO 130 C C REACHED ENCPOINT OF TREE 140 CONTINUE IF I I.EQ. 0 3 GO TO 190 00 150 LL =1, K KODE IILL -= LIST{LL ) -1 150 CCNTINUE LEN({ I. K 190 CONTINUE LIST(K) = LIST(K) +1 IF ( LIST(K),LE. C ) GO TO 125 LIST(K) =1 KK — K-i IF ( K,GT. 0 ) GO TO 190 C C PRINT CODES PRH=O IF t PRH.EQ. O ) GO TO 361 DO 360 1=1, M JJ= LEN (I) WRITE (6, 341 ) I, { KODE(IJ), J=1, JJ ) 341 FORMAT IX, 14, 5X, 12011 ) 360 CONTINUE 361 CONTINUE RETURN END SUBROUTINE REMOVE ( IN, OUT2 NIN, NOUT, FILL ) IMPL[C T INTEGER(A-Z) INTEGER IN NIN),NTEGER OUT(NIN) TG=O DO 10 T-=1, IN IFl IN(T) *EQ. FILL ) GO TO 10 TG=TG+1 OUT{TG)=IN T) 10 CONTINUE NOUT=TG RETURN END SUBROUTINE ZERCINVECTN) ENTRY IZERO NVECT,N) INTEGER NVECTIN) C SUBROUTINE TO ZERO VECTOR DC 10 I=1,N NVECT I )=0 10 CONTI NUE RETURN END SUBROUTINE FZERO(NVECTN)

241 RE AL NVECT( N) C SUBROUTINE TO ZERO VECTOR DO 10 -1i,N NVECT( I )-. 0 10 CO;T INUE RE TURN END SUBRCUTINE SPRAYINVECTNNVAL) ENTRY ISPRAYlNVECT,NNVAL) INTEGER NVECTIN) C SUBRCUTINE TO SET VECTOR TO A CONSTANT D0 10 1=1, N NVECTII ) =NVAL 10 CONTINUE RETURN END FUNCTION ISUM(NVECTN) INTECLE NVECT(N) C SUM FINUS SUM CF VECTCR IS UM 0 00L 10 I1, N ISUM=-I SUM+N VECT I IC CGNT INUE Rg ETURN SU RCUTINE PHI I DIGIT, M, C, SUM ) li?'i-LICIf INTEGER (A-Z) I NTE CGR DIGITM T L C C PHI ASSIGNS AN INDEX NUMBER TO A GIVEN M-SEQUENCE C PHI=NATURAL NUMBER CORRESPCNDING TO M DIGITS MCDULO C C C iN; DIGI T M, C C DIGIT-VECT0R OF M DIGITS MOD C C OUT: PHI C PHI=N-ATURAL NUMBER PRODUCED C SUM=O E-l DO 120 L=1,M SUMj- U;E-i-l EtD T [M+ -L) 120 CuINT INUE RETUkN END SUiRCUTINE CGNV ( DIGIT, My C, NUM) IM-PLICIT INTEGER ( A-Z) INTEGER DIGITIM) C WC CUNVERTS NUMbER 'NUM' TO M DIGITS MOD C C C IN: N UM, Mt C C NUM=NATURAL NUMBER C OUT: DIGIT C UIGITS= M DIGITS MOD C C ij= t-U bl DO 10 L=-, M Q=N/C DIGIT i M-L+1 } N-Q*C

242 N=Q 10 CONTINUE RETURN END SUBROUTINE XFSORT i V, N, INDEX, NI, N2 ) ENTRY FXSGRT I V, N, INDEX, NI, N2 ) REAL V(N) REAL T INTEGER INDEX(N) C C INDEXED SORTING RCUTINE. C SORTS BLOCK WITHIN VECTOR V INTO HIGH...LOW ORDER C IN: C V=VECTCR OF LENGTH N OF VALUES TO BE SORTED C N=DIMENSICNED SIZE OF V C NI=INDEX OF BEGINNING OF BLOCK WITNIN V TO BE SORTED C N2=INDEX OF END CF BLOCK WITHIN V TO BE SORTED C OUT: C INDEX(I3 = SUBSCRIPT OF I-TH LARGEST ELEMENT OF V C rCHANGED: C THAT PART CF VECTOR V WHICH IS SORTED C Nb: IN MAIN, INDEX VECTOR V BY 2, AND ALL OTHER ASSOCIATED C VECTORS BY INDEX(I} DO 324 KJ= NI, N2 324 INDEXLKJ) KJ IF i NI.GE. N2 ) GC TO 999 NL= N2-1 DO 300 L- N71 NL IF VJ.E V(L GE VL+ ) ) GO TO 300 C V(L+i) GREATER LLL= L+i - NI DO 100 J-=, LLL K-=L+ -J [F( V(K) GE. V(K+I) J GO TO 200 T=V(K) ViK) = V(K*+I V(K+-1) = T NT=INDEX (K) INDEX(K) = INDEXIK+1) INDEX(K+11 = NT 100 CONTINUE 200 CONTINUE 300 CONTINUE 999 CONTINUE RETURN END SUBRCUTINE LOOK ( LLIST, LDIMLENITEMLEVELtNEW) IMPLICIT INTEGER (A-Z) INTEGER LLISTiLDIM) C ITEM=CURRtET ELEMENT C LDIM=DIMENSIONED SIZE OF LLIST C tLEN=ACTUAL LENGTi- OF LLIST C LEVEL=INOD-X OF PLACE WHERE ITEM IS FCUND IN LLIST C LLISTW(I,=LIST OF ITEMS C NEW: O=ITEM WAS IN LLIST ALREADY C 1=ITEM IS NEW, AND WAS ADDED TO LLIST

243 C TINITIALIZE: LEN BEFORE FIRST ENTRY C C IN D- 1 G0 TO 50 ENTRY LCCKN LLIST, LDIM,LEN, ITEM,LEVELNE" [ND-SO 50 CONTINUE IF(LEN.EQ.O) GO TO 150 DO 100 I=1, LEN IF(LLIST I) EQ. ITEM ) GO TO 200 100 CONTINUE CNO MATCH 150 CSNT I -'UE IF(IND.EQ.O.OR. LEN.LT. LDIM ) GO TO 168 PRINT 166, LDIM 166 FORAT I OTOQC MA NY cLEfMENTS IN LIST OF LENGTH ' 16 ) C N ': LEVEL SET TO LAST POSSIBLE LEVEL IN THIS CASE L V L =LU C 'ro o250 t163 CG f dIJU i I:..E + I1 LL ISTiL t-l)-=ITEM LE E L-= LE N G:) TO 25 -C CU. ENT 'ITEM' MATCHES WITH EXISTING ELEMENT' OF LLIST 200 C'T i.L. NE. -0. E V L=I 250 CCNTI NUE RETU;N END SUBRCUTINE ISEEK ( NVECT, Nt NOW, KOCE ) IINTLGER NVECT(N) CO 10 I1l, ft IF ( N ECT(I).EQ. NOW GO TO 20 10 CONT INUE:<0DE =O R TUP N 20 CO T INUE RE T UKN SUBP RUTINE IXSPRYINVECTNDIM,NVAL, I 1 I2 1 INTEGER NVECTLNDIM) DO 10 I1 11, I2 NVECT( I)=NVAL 10 C;3NT i'NUE RE rUON SUL;RUJUTINE IEQUAL i JVECT, N, KVECT ) INTEL;ER JVLCT(N) INTEGER KVECT(NJ DU 10 1-1, N VECT( I )-KVECT(I) 10 CF:N JT I: UE R F T r: END

244 SUBRCUTINE MATCH ( NVECT, N, NOW. KOOE ) ENTRY I M ATC h (NVECT N NOWKODE) INTEGER NVECT(NI DO 10 1=1, N IF ( NVECTII).Q. NOW ) GO TC 20 10 CONTINUE PRINT 15, NOW 15 FORMAT ( 'OUNABLE TO MATCH:', A4) KGDE=0 RETURN 20 CONTINUE KODE= I RETURN END SUBROUTINE RANDU{IX,IYYFL) C C COMPUTES UNIFORMLY DISTRIBUTED RANDCM REAL NUMBERS BETWEEN C 0 AND 1.0 AND RANDOM INTEGERS BETWEEN ZERO AND C 24*31. EACH ENTRY USES AS INPUT AN INTEGER RANDOM NUMBER C AND PRODUCES A- NEW INTEGER AND REAL RAhDCM NUMBER. C DESCRIPTION OF PARAMETERS C IX - FOR THE FIRST ENTRY THIS MUST CONTAIN ANY ODD INTEGER C hUMBER WITH NINE OR LESS DIGITS. AFTER THE FIRST ENTRY, C IX SHOULD BE THE PREVIOUS VALUE OF IY CCMPUTED BY THIS C SUBROUTINE. C IY - A RESULTANT'INTEGER RANDOM NUMBER REQUIRED FOR THE NEXT C ENTRY TO THIS SUBROUTINE. TiE RANGE OF THIS NUMBER IS C BETWEEN ZERO AND 2**31 C YFL- ThE RESULTANT UNIFORMLY OISTRIBUTED, FLOATING POINT, C RANDOM NUMBER IN THE RANGE 0 TO 1.0 C C REAL YFL C fX=65549 D 1 K=- 1CC C CALL RANDU I IX, IY* YFL C 1 CONTINUE REMARKS C THIS SUBROUTINE IS SPECIFIC TO SYSTEM/360 AND WILL PRODUCE C 2**29 TERMS EEFORE REPEATiNG. THE REFERENCE BELOW DISCUSSES C SEEDS (65539 HERE), RUN PROBLEMS, AND PROBLEMS CONCERNING C RANDOM DIGITS USING THIS GENERATION SCHEME. MACLAREN AND C MARSAGLIA, JACM 12, P. 83-69 DISCLSS CONGRUENTIAL ~~C TtTHE RANDU TYPE, ONE FILLING A TABLE AND ONE PICKING FROM THE C TABLE, IS OF BENEFIT IN SOME CASES. 65549 HAS BEEN C SUGGESTED AS A SEED WHICH HAS BETTER STATISTICAL PROPERTIES C FOR HIGH ORDER BITS OF THE GENERATED DEVIATE. ~C SEEDS SHOULD BE CHOSEN IN ACCCRDANCE WITH THE DISCUSSION bC GIGIVEN IN THE REFERENCE ~FLCW. ALSC IT SHOULD BE NOTED THAT C IF FLOATING POINT RANDOM Nl'UMERS ARE DESIREDAS ARE C AVAILABLE FROMi RANDU, THE RANDOM CHARACTERISTICS OF THE C FLCATING POINT DEVIATES ARE MODIFIED AND IN FACT THESE C DEVIATES HAVE HIGH PROBABILITY OF HAVING A TRAILING LOW C ORDER ZERO BIT IN THEIR FRACTIONAL PART. C POWER RESIDUE METHOD DISCUSSED IN IBM MANUAL C20-8011, C RANDCO NUMBER GENERATION AND TESTING C INTEGER hUMBERS ARE AL CDO C IY=IX*65539 iF!lY)'Y 5 6 6 5 IY= I Y-r1474 33647+1 6 YFL IY YFL=YFL*.4656613E-9 [X=IY RETURN END

BIBLIOGRAPHY Caianiello, E. R. and Capocelli, R. M., "On Form and Language: The Procustes Algorithm for Feature Extraction," Report of Laboratorio di Cibernetica del C.N.R., Arco Felice, Naples, Italy, November 1970. Chomsky, N., "On Certain Formal Properties of Grammars," Information and Control, 2 (1959), p. 137-167. "Formal Properties of Grammars'.' Handbook of Math Psych., Wiley 1963, p. 323-418. Fano, Robert M., Transmission of Information, MIT Press, 1961. Feldman, Jerome A. et. al., Grammatical Complexity and Inference, Stanford Articicial Intelligence Project Memo AI-89 1969). Floyd, Robert W., "A Note on Mathematical Induction on Phrase Structure Grammars," Information and Control, 4 (1961) p. 353-358. Gaines, Helen Fouche, Cryptanalysis, (1939) Dover 1956. Ginsburg, Seymour, The Mathematical Theory of Context-Free Languages, McGraw Hill, 1966. Ginsburg, Seymour and S. A. Greibach, "Mappings which Preserve ContextSensitive Languages," Information and Control, 9 (1966). p. 563-582. Gold, M. "Language Identification in the Limit," Information and Control, 10, p. 447-474. (1967). "Limiting Recursion," Journal of Symbolic Logic, 30, p. 28-47, (1965). Goodall, M. C., "Induction and Logical Types," Biological Prototypes and Synthetic Systems, Volume I, ed. by E. E. Bernard and Morley R. Kare, Plenum (1962). Hopcroft, J. and Ullman, J., Formal Languages and their Relation to Automata, Addison-Wesley, 1969. Hunt, E. and P. S. Marin, Experiments in Induction, Academic Press, 1966. Khinchin, A. I., Mathematical Foundations of Information Theory, (1953-1956) Dover 1957. -. -. Miller, G. A., "The Magic Number Seven, Plus or Minus Twp: Some Limits on our Capacity for Processing Information" Psychological Review 1956, 63, 81-97 Myhill, J. "Linear Bounded Automata," WADD Tech Note, p. 60-165, Wright Patterson Air Force Base, Ohio. Pao, Tsyh-Wen Lee, A Solution of the Syntactical Induction-Inference Problem for a Non-Trivial Subset of Context-Free Language, University of Pennsylvania Ph.D. Thesis, 1969. 245

246 Parkinson, C. Northcote, Parkinson's Law, Houghton Mifflin, Boston, 1957 Riordan, John, An Introduction to Combinatorial Analysis, Wiley 1958 Shannon, C. E. and Warren Weaver, The Mathematical Theory of Communication, Illinois 1963 Smith, L. D. Cryptography, (1943) Dover 1955 Smullyan, R., Theory of Formal Systems, Princeton, 1959 Solomonoff, R., "A Formal Theory of Inductive Inference," Information and Control, 1964, p. 1-22, 224-254. "A New Method for Discovering the Grammars of Phrase Structure Languages," Information Processing, June 1959, p. 285-290. Uhr, Leonard and Vossler, Charles, "A pattern Recognition Procedure That Generates, Evaluates, and Adjusts Its Own Operators," in Feigenbaum, E. and Feldman, J. (Editors), Computers and Thought, McGraw Hill, 1963. Walk, K., "Entropy and Testability of Context-Free Languages'" in Formal Language Description Languages for Computer Programmers, ed. by T. B. Steel, North-Holland 1969, Amsterdam. Younger, D. H.3 "Recognition and Parsing of Context-Free Languages in Time n," Information and Control, 10 (1967), p. 189-208.

UNIVERSITY OF MICHIGAN 3 9015 03023 7773