THE UNIVERSITY OF MICHIGAN COLLEGE OF LITERATURE, SCIENCE, AND THE ARTS Communication Sciences Program Technical Report A HOMOMORPHIC THEORY OF CONTEXT-FREE LANGUAGES AND ITS GENERALIZATIONS Yehoshafat Give' on ORA Projects 03105 and 07363 under contract with: U. S. ARMY RESEARCH OFFICE (DURHAM) GRANT NO. DA-ARO(D)-31-124-G665 DURHAM, NORTH CAROLINA and DEPARTMENT OF THE NAVY OFFICE OF NAVAL RESEARCH CONTRACT NO. Nonr-1224(21) WASHINGTON, D.C.? ~ i. administered.. through. OFFICE OF RESEARCH ADMFI.STRATION. ANN ARBOR September'1965

I ^n

ABSTRACT Usually, and naturally, context-free languages are defined and studied by means of grammars. In the course of study of these languages several algebraic characterizations were found. In this paper we want to regard one of these characterizations (namely, the homomorphic characterization that was established by Chomsky and Schiutzenberger) as the definition of this family of languages and to show how one can derive some of their well known properties directly from this "redefinition." In addition to simplification of proofs we find that the arguments involved lead us naturally to some generalizations that have some bearing on mathematical linguistics. The families of languages derived by means of these generalizations exhibit some features which are too complex for the context-free model and yet these features are of the type that one encounters in the study of natural languages. We discuss very briefly some examples of these generalizations. iii

1. Introduction Usually, and naturally, context-free languages are defined and studied by means of grammars (Chomsky 1956, Bar-Hillel et al 1960, Chomsky & Miller 1963). In the course of study of these languages several algebraic characterizations were found. In this paper we want to regard one of these characterizations (namely, the homomorphic characterization that was established by Chomsky and Schuiitzenberger (cf. Chomsky ~ Miller 1963)) as the definition of this family of languages and to show how one can derive some of their well known properties directly from this "redefinition." In addition to simplification of proofs we find that the arguments involved lead us naturally to some generalizations that have some bearing on mathematical linguistics. The families of languages derived by means of these generalizations exhibit some features which are too complex for the context-free model and yet these features are of the type that one encounters in the study of natural languages. We discuss very briefly some examples of these generalizations. -1

In what follows we assume a certain acquaintance with the theory of regular events and with the elementary theory of monoids. 1.1 We define a family of languages to be called CF-languages: Let VT be a fixed finite alphabet. For any finite alphabet *. * V we denote by V(2) the basis of the monoid V xV That is, the elements of V(2) are of the form (a,X) or (A,a), where X is the empty-word and a6V. Let KV be the kernel of the homomorphism y: V 2 FG(V) v(2T- l V * of V(2), the free monoid generated by V(2), onto FG(V), the free group generated by V, with y(a,X) = a, and y(X,a) = a, for all asV That is, KV = {xcV(2): y(x) = lEFG(V)} For our purposes we are interested in alphabets V which include VT. For any such alphabet V c VT, we have a natural homomorphism * * ~V: VC2)-~- VT' which is defined by: -2

a if a = (a,X) and acVT CV(a) for all a(V X otherwise; Obviously, c amounts to the erasure of letters except for VTx{X} which is interpreted as V. No confusion is caused by denoting all the homomorphisms V, for all VT c V, by e We denote by RV* the class of the regular events over V(2) (2) as an alphabet. 1.1.1 A subset L of VT is said to be a CF-language defined b V iff L = E(EnKV) for some EBRv* (2) We denote by lV the class of all CF-languages defined by V We may express now the characterization of the context-free languages by: 1.1.2 THEOREM (Chomsky & Schuitzenberger): A subset L of VT is a context-free language iff there exists a finite alphabet V VT, such that Lc 4. -3

1.2 Our main goal is to prove the following well known properties of context-free languages for the CF-languages as defined in 1.1.1: 1.2.1 Every regular event over VT is a CF-language. 1.2.2 The intersection of a CF-language with a regular event over VT, is a CF-language. * * 1.2.3 For any homomorphism *: VT --- VT and any CF-language L, (L) is a CF-language. 1.2.4 The class of all CF-languages ( over VT ) is closed under the Kleenean operations of events. 1.2.5 If VT is a single-letter alphabet then every CF-language is regular; if not, then there exists a CF-language which is not regular. -4

2. The properties of f V 2.1 LEMMA: If VT V1 V2 then S V ~~i i ^ ^1 2 Proof: K c KV and R * c R V2) ~1 \ 2 -)1 (2) 2)(2) 2.1.1 COROLLARY: For any VT c V1,V2 ~V1 V2 V1 V2 2.2 PROPOSITION: RV*,hence forany VT V RV*S. hT VT T' Proof: KV is obviously a submonoid of V 2) Define the * homomorphism g: VT - KV by VT g(a) = (a,X) (X,a) for all acVT * * Clearly, for any xeVT: e(g(x)) = x. Hence for any E c VT: E = c(g(E)) = e(g(E)nKV ) If EcRVT, then g(E)'sR()*, and therefore E. T T'(2) T. 2.3 PROPOSITION: For any VT SV, if EeRV* and LeV, then EnLce Proof: Let L = etEL'nKV ) for ELcRV*.. "" ( 2 ) -5

EnL = En~(ELr KV) -1 = c(c (E) )n ~ (ELn KV) = ~((E (E) nEL) nK) But ECv* and ECR* -1 But EcRV* and ELcRV* imply that E (E) EL is regular in T \(2) * V(2) and therefore EnLe 2.4 PROPOSITION: Let: VT- -VT be any homomorphism, VT c V, bV, and Vb = Vv{b}. Then for any Lc,:;(L)cAv' Proof: Let iI and i2 be the two natural monomorphisms of V into V2); namely, those which are determined by il(a) = (a,X), and i2(a) = (X,a), for all aV. By combining 4 with i>, and i2 we get two homomorphisms of VT into V (2) = io~,' and 2 i2o * * We "extend": VT —- VT into a homomorphism Vb: (2 ( b)(2) by b (aX) = (b,X) n1(a) for all aeV, *b(Aa) =- 2(a)(X,b) -6

b(aX). (aX) for all acV-V. Ab(X,a) = (X,a) We have now: (i) for any EcRV: 2 b(E)eRVb)( (ii) xcKv iff ~b(x)cKv, hence for any S c V() *b (S) ^ KV = Jb(S nKV); b (iii) for any xcV(2): E(b(x)) = *(E(x)) Thus, if L = E(EL KV) for ELcRV,then: (2),(L) = (ce(ELn KV)) e (Cb (EL KV)) = c(Cb(EL) KV) which shows that *(L)~ Vb' b 2.5 PROPOSITION: For any VT C V, if L1,L2c then L1 vL2 Proof: L1 L2 = (EL L KV) V (EL KV) 1 2 C((ELn KV)V(EL n KV)) - ~(CEL% EL )fKv) 1 -2 -7-.

2.5.1 COROLLARY: For any VT c V1,V2, if LleV and L^ O.....1 2 then Ll L2 VvV2 2.6 PROPOSITION: Let VT c V, bV, and Vb Vv{b}. If L,,e L then bL L Proof: We have Li e(EL KV) for,EL cRV 1 L. i EL L V ~ 1 4 ^ 4 42 (2) Define E = EL -(b,X)EL *(X,b); 41 2 then EcR * and (Vb)(2) EnKV = (ELA KV) (b,) (EL" KV)(X,b) b 1 2 Hence, c(En K ) t((EL n Kv) (b,X).(EL2 KV)(X,b)) Vb "1 2 C((EL L KV) (EL r KV)) 1 2 Li'L2, and therefore L1 L2c Vb 2.7 PROPOSITION: Li 4r7 c V, bV, and Vb = Vv{b). If LeV then L e Vb -8

Proof: Let L e e(EL KV) for ELcRV*. We define (2) E (b,X) *(EL(X,b)), and clearly we have E nK = U (b,X) *((ELn KV)'(X,b)) Kb k0 v=O Hence, c(EKV) E (U (b) ((ELn KV) (Xb)) ) b k=O0 = JU(C(EeLK))k S L 2.8 If VT {a) then KV enjoys a particular structure. T For any set of words E, denote by t(E) the permutationclosure of E (i.e., the set of all permutations of the words that are in E). Obviously, Ky = Ir(((a,X).(X,a))), T and for any S c Vr(2) Snv A r (S)r nK T "T This leads us to employ Presburger's formulas and their relationships ~9

with the commutative-events that are derived from regular events by means of iw (cf. Laing X Wright 1962, Presburger 1930). 2.8.1 LEMMA: If VT {(a) then.V R* = R* -... T VT V a T T Proof: For any word x, over an alphabet that contains the letter b, we denote by nb(x) the number of occurrences of b in x For any EeR * there exists a Presburger's formula Q(m,n) VT (2) such that xcir(E) KV iff Q(n(a,X) (x), n(,a)(X)); ^T hence for any L.ZV there exists a Presburger's formula Q(n) such that L - {an: Q(n)}. Since {n: Q(n)} is ultimately periodic, L must be regular. 2.8.2 Unfortunately I cannot prove the stronger true statement: if VT = {a}, then for any VT c V: = R * without referring to Theorem 1.1.2. 2.8.3 LEMMA: If VT contains at least two letters then M contains languages which are not regular. -10 -

Proof: Let a,bcVT; we define * * E - ((a,k)*(fb)) *.((X,a).(b,X)) Obviously, E KV = {((a,X). (X,b))k ((X,a).(bX)): k }, and therefore e(EKV) = {akbk: k > O}eV. T T -11

3. Generalizations: Abstract CF-systems 3.1 An abstract CF-system is a system - =<WW, P, Pc> where: (i) WT is a monoid ( the terminal monoid of ); (ii) ~ is a set of monoids {W: a6A} (the auxiliary monoids of C} ) where A is an indexing set; (iii) P is a mapping which associates with any auxiliary monoid W,a submonoid P of W; ax xa a (iv) E is a functional which associates with any auxiliary monoid W a homomorphism, also denoted by e: W — W; subjected to conditions which will be specified presently. First, for any monoid W, we denote by RW, the class of the regular-events in W. (A subset E of W is said to be regular in W iff there exists a homomorphism ~: W --- with a finite range, such that E = ~ (F) for soie F c M.) -12

The conditions imposed on ~ are as follows: 3.1.1 For any a,BeA there exists ycA such that WWBVw cW 3.1.2 If W c Ws then R S RW ax W Wa 3.1.~3 W c W implies WnP^ = P 3.1.4 e(P) = WT for all acA. 3.2 For the basic properties of regular events in arbitrary monoids we refer to the literature (Give'on, 1964a,b ). In particular, we shall make use of the following basic properties: 3.2.1 For any monoid W, RW is a Boolean algebra of sets. 3.2.2 For any epimorphism a: W1 — W if EcRW then s (E)cRW1 313 Let ~ be a fixed abstract CF-system. For any acA we define a to be the set of all subsets L -13

of WT of the form L = e(En P ) for-some; EC~R a 3.4 PROPOSITION: If W c W then a f. Proof: Let L = c(EnP ) for EeRW. Since W c W, it follows by 3.1.2 and 3.1.3 that EeRW and W nP P Hence EW5 aa 8 a EnP = E (P W ) = (E nW)n P E = En P and therefore Le 3.5 PROPOSITION: For any acA: RW WT Proof: From c(P) = WT (3.1.4) follows that for any subset S c WT we have E(C (S)n Pa)= S Hence, for any ECRW we have E = c(e (E)P ), and since c (E)eRw T a (3.2.2), it follows that E. 3.6 PROPOSITION: For any acA, EcRlT and Lce, we have EnL:. Proof: Let L =.(ELP,P) for ELeRW, then -14

EnL = Ere(ELrP ) = (C 1(E))n (EILPa) = e((e (E)E^EL)Pa). By 3.2.1 and 3.2.2 follows that e (E) ELCRw and therefore a EnLee 3.7 PROPOSITION: For any acA and L1,L2c: Lv L2 e. Proof: Let Li = e(E.iP ) for L1,L2eRW, then a 1 L2 L= (E1 P )uE (E2P ) = C((E1VE2) n P) By 3.2.1 follows that Ll L2e I 3.7.1 COROLLARY: For any a,BcA there exists a ycA such that for all 1 4a and L2 L1 L2 3.8 PROPOSITION: Let TT WT - WT be a mapping such that for some a,8eA there exists a mapping Ta:. Wa W with the following properties: (i)'TE = eT, (ii) for any EeRW: T (E)cRW (a 8 -15

(iii), for all'weW: weP iff T (W)cP Then for any Le,a:.T(L)e. Proof: Let L = (E nP ) for EeRW. By (i) we have TT(L) = tT(E P) = C'T (E P) By (iii) we have TT(L) = (T o(E)n P), which by (ii) implies that T(tL)E ~' -16

4. Examples: Generalized CF-systems In order to make our generalization more concrete and more applicable to the study of formal languages, we impose on the abstract CF-systems additional conditions. 4.1 A generalized CF-system ( over VT with auxiliary alphabets V ) is an abstract CF-system J = <V? F(O (), PsC> (2) where (i) F( J2)) is the set of the free monoids V(2), for any Vci (ii) for any Ve;:.: V T VT is the erasing homomorphism (as defined in 1.1). (iii)' is rich enough in the following sense: for any finite cardinality n which is larger than the cardinality of VT, there exists a Ve 2 with cardinality n 4.1.1 Note that from (iii) follows that V can be taken as the class -17

of all finite alphabets, and by. 3.1.3 and 3.1.4, we can even assume that V is the class of all finite alphabets that include VT, without changing the languages defined by means of. Note however, that in the definition of abstract CF-systems we required (cf. 3.1 (ii)) that the class of the auxiliary monoids of any abstract CF-system must be a set. But this is, in our case, a baroque difference. 4.1.2 Obviously, the CF-languages as defined in 1.1.1 are defined by means of a generalized CF-system where for any Ve9': p(V) = K. On the other hand, by means of other choices for p we may derive new types of languages over VT, and in this section we suggest and discuss briefly some examples of such types. 4.2 Our notation, namely V(2), hints that in fact we have in mind some other generalizations of the concept of CF-languages, which fall under the machinery of abstract systems but not under the generalized CF-,systems. * * * For example, let V(3) denote the basis of V xV xV. Denote -18

(3) * by BV) the set of all words in V,3) which reduce to the empty-word L by means of successive (and successful) applications of the rule (aX,X) (X,a,X)(XX,a) > X for all aV. c i3) Define L c VT to be a CF -language iff there exists EcRV* such that (3) (3) L:= (EB n ). (The definition of e is evident.) The generalized notion of CF( k-languages is obv'ious. Note' that the CF(l)-languages are the regular languages over VT and the (2) CF ()-languages are precisely the CF-languages. It is easily verified (k) that for any k > 1, the CF k)-languages are defined by means of appropriate abstract CF-systems with VT as their t erminal monoid, and (V ) as their auxiliary monoids. 4.2.1 PROPOSITION: Let k 1, VT S {al,...,ak}, then la2... ak: n > O0 is a CF()-language. Proof: For any 1: i < k define b c V(k by:. - ((ai,,...A).(,ai+l,,....,).:..,(,,...,,ai+k19)) -19-_

where i * j = (i + j) modulo k. Define now E = E *E2 Ek and clearly, nn n (E n BV )-= anlan2.. ak: n >. (E T 1 2 0} 4.3 For any alphabet V we have the natural homomorphism d: V 2 V xV (2) which is determined by the identity on V(2) By means of d we define the following two submonoids of V;) (2) * 4.3.1 E= {xeV(2): d(x) = (y,y). 4.3.2 PV {xeV2): d(x) = (y,a(y)) for some permutation a of y. * * (4.3.3 In general, any submonoid W of V xV determines in this fashion a submonoid d (W) of V. 4.4 It is quite easy to verify: -' *, 4.4.1 If V1 S V2 then CV)(VZ)n Ev = Ev, and (Vl)(2)n) PV lP -20

4.4.2 For all V c VT: c(EV) (PV) VT Hence we have two types of generalized CF-systems. / We denote by CV ( resp., by VP ) the set of languages of the form L (EnEV) for EeRV* (2) ( resp., L = (E —rPV) for ECRV* ). (2) 4.5 Note that if a: V(2) FAG(V) is the natural homomorphism of. *. V(2) onto the free abelian group generated by V, then PV = Ker a. 4.5.1 The relationship between EV and KV is not clear, though obviously, EV c PV and KV S PV For example, let x = (a,X)(b,X)(X,a)(X,b), and x = (X,b)(a,X)(A,a)(b,X); then xliKV but x1cEV, and x2cKV but x2gEv 4.6 In addition to the properties o f V and ~V which were derived.V- V in the previous section for abstract CF-systems, we can prove some of the -21

properties of the CF-languages to hold for as well. For example, 2.6 holds for v; substitute EV instead of KV in the proof. The same holds for 2.4 and 2.8.1. 4.7 PROPOSITION: Let {al,...,ak} then {aa2... a: n > O is both in E and in J P T T Proof: We define E = (al,)A).((X,al).(a2,)).((,a)a,((X,a2)a3 kl)(ak, X))'(X,ak) then E nPV = EnE = n= {(a,) ((Xa1)*(a 2 ))...(,a kl)(ak)) CXak)'n O},} and therefore (E E) ka n,*ZE, tP cCEnPv) = (EE) = {a2 ak n > O E 0 v vn -22T

Bibliography Bar-Hillel, Y., M. Perles G E. Shamir, "On Formal Properties of Simple Phrase Structure Grammars," Language and Information by Y. Bar-Hillel, Addison Wesley Series in Logic, pp. 116-150 (1964). First appeared in Zeitschrift fur Phonetik, Sprachwissenschaft und Kommunikationsforschung, vol. 14, pp. 143-172 961); after Tech. Rep. No. 4, Applied Logic Branch at the Hebrew University, Jerusalem, Israel (1960). Chomsky, N., " Three Models for the Description of Language," IRE Trans, on Information Theory, IT-2, pp. 113-124 1956). Chomsky, N. & G. A. Miller, "Introduction to Formal Analysis of Natural Languages," Handbook of Mathematical Psychology II, John Wiley? Sons Inc., pp. 269-418 (1963). Give'on, Y., "The Theory of Algebraic Automata I: Morphisms and Regular Systems," Tech. Note 05662, 03105-27-T, ORA, The Univ. of Michigan (1964). Give'on, Y., "Outline for an Algebraic Study of Event Automata," Tech. Rep. 05662, 06689, 03105-28-T, ORA, The Univ. of Michigan (1964). Laing, R. ~ J. B. Wright, "Commutative Machines," Tech. Note 04422, 03105-24-T, ORA, The Univ. of Michigan (1962). -23

UNIVERSITY OF MICHIGAN 3 9015 02826 1033., 1;. ~ *~