THE UNIVERSITY OF MICHIGAN COMPUTING RESEARCH LABORATORY - I —n...~- IL — IC-s -- L~l — _ L--Llm~L~ ---h — - IEi--91 I -- — ---L- -~ -- -- OPTIMAL GRAPH CLUSTERING PROBLEMS WITH APPLICATIONS TO INFORMATION SYSTEM DESIGN Wan Ping Chiang CRL-TR-30-84 June 1984 Room 1079, East Engineering Building Ann Arbor, Michigan 48109 USA Tel: (313) 763-8000

owu? Mw^ u- w sIS^. t ~1

ABSTRACT OPTIMAL GRAPH CLUSTERING PROBLEMS WITH APPLICATIONS TO INFORMATION SYSTEM DESIGN by Wan Ping Chiang Chairmen: Toby J. Teorey, William C. Rounds Information system design problems, including database and software, can often be represented in terms of directed or undirected graphs. Some of these problems typically involve determining how to cut the graph into a set of nonvoid and disjoint subgraphs such that each of the subgraphs satisfies a set of constraints while an objective function defined over the subgraphs is optimized. This class of information system design problems is analyzed with the following three objectives: First, to formally define this class of problems as an Optimal Graph Clustering (OGC) problem and classify it into a set of subproblems. Second, to find efficient algorithms that give an exact and optimal solution to each problem with nontrivial objective function and constraints. And third, to demonstrate these algorithms' usefulness by formalizing and solving some information system (database and software) 1

2 design problems. Eight classes of digraphs (general digraph, acyclic digraph, out-necklace, out-tree, out-star, in-necklace, intree and in-star) and four classes of undirected graphs (general undirected graph, necklace, tree and star) are considered and are used to classify the OGC problem into thirty-five subproblems. These subproblems are shown to be NP-complete problems. Dynamic programming and enumerative approaches are used to show that eighteen OGC subproblems with a given graph that has n-1 arcs or n arcs can be solved in pseudopolynomial time. It is also determined that in the rest of the NP-complete problems, if the graphs are sparse (i.e., the number of arcs does not exceed the number of vertices greatly) they are solvable by computer. The class of possible objective functions and constraints applicable to these solutions such that all time complexities remain the same is also defined. All monotonic objective functions and constant time computable constraints are found to be applicable. The solutions' usefulness is demonstrated by solving three information system design problems: B-tree secondary storage allocation, translation of an integrated schema into a hierarchial schema, and database record clustering.

OPTIMAL GRAPH CLUSTERING PROBLEMS WITH APPLICATIONS TO INFORMATION SYSTEM DESIGN by Wan Ping Chiang A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy (Computer and Communication Sciences) in The University of Michigan 1984 Doctoral Committee: Associate Professor Toby J. Teorey, Co-chairman Associate Professor William C. Rounds, Co-chairman Professor Bernard A. Galler Associate Professor Vaclav Rajlich

TABLE OF CONTENTS DEDICATION..................... ii ACKNOWLEDGEMENTS.................. iii LIST OF FIGURES.................. vii LIST OF APPENDICES................. ix CHAPTER 1 INTRODUCTION................. 1 1.1 Problem Definition and Classifications.3 1.2 Complexity Issues............ 8 1.3 Literature Survey............ 11 1.4 The Organization of the Dissertation.. 12 2 OPTIMAL OUT-TREE/OUT-STAR CLUSTERING OF OUTTREE /OUT-STAR PROBLEMS........... 14 2.1 Optimal Out-Tree Clustering of an OutTree Problem (ODCtoto)......... 15 2.1.1 A Sequential Clustering Approach to ODCtoto............ 16 2.1.2 ODCt t o-1 Algorithm.......25 2.2 Optimal Out-Star Clustering of an OutStar Problem (ODCo ) * * *........ 31 2.3 Optimal Out-Star Clustering of an OutTree Problem (ODCsoto )......... 32 3 OPTIMAL OUT-TREE/OUT-STAR/OUT-NECKLACE CLUSTERING OF GENERAL DIGRAPH/ACYCLIC DIGRAPH/NECKLACE PROBLEMS.......... 36 3.1 The Structures of the Set of Possible Clusterings that Satisfy the Cluster Graph Class Constraint......... 36 3.2 Optimal Out-Tree Clustering of General Digraph, Acyclic Digraph and OutNecklace Problems (ODCt ogr ODCt oa ODCt.(.... tg...... 42 iv

3.3 Optimal Out-Star Clustering of General Digraph, Acyclic Digraph and OutNecklace Problems (ODCsog, ODCsa 46 ODCsOn 0 )........?..... 46 3.4 Optimal Out-Necklace Clustering of OutNecklace and General Digraph Problems (ODCnon ODCg)..........48 3.4.1 ODCn no Problem......... 48 3.4.2 ODCno Problem.......... 50 4 OPTIMAL ACYCLIC DIGRAPH/GENERAL DIGRAPH CLUSTERING OF ACYCLIC DIGRAPH/GENERAL DIGRAPH PROBLEMS................... 54 4.1 Optimal Acyclic Digraph Clustering of an Acyclic Digraph Problem (ODCaa).... 54 4.1.1 A Sequential Clustering Approach to ODC aa.................. 54 4.1.2 ODCaa -1 Algorithm......... 65 aa 0 -0 9 0 6 4.2 Optimal General Digraph Clustering of a General Digraph Problem (ODC )..... 72 4.3 Optimal Acyclic Digraph Clustering of a General Digraph Problem (ODC ag)..... 74 5 EXTENSIONS.................. 76 5.1 Extensions of Classes of Graphs..... 76 5.2 Extensions of Objective Function and Constraint............... 80 5.3 Divide-and-Conquer Technique for ODCaa, ODC and ODCag........... 84 gg ag 5.3.1 Cutpoint and Cover........ 84 5.3.2 Divide G into a Set of Covers.. 89 5.3.3 Find and Combine the Clusterings of Each Cover to Form the Optimal Solution............. 92 6 SOME INFORMATION SYSTEM DESIGN APPLICATIONS. 98 6.1 B-Tree Secondary Storage Allocation... 98 v

6.2 Translation of Integrated Schema to IMS Schema................. 103 6.3 Database Record Clustering......... 110 7 SUMMARY AND FUTURE RESEARCH......... 118 7.1 Summary of Results........... 118 7.2 Future Research............. 120 APPENDICES..................... 121 BIBLIOGRAPHY.................... 128 vi

LIST OF FIGURES Figure 1-la 1-lb 1-2 2-1 2-2 2-3 3-1 3-2 3-3 4-1 4-2 5-la 5-lb 5-2 5-3 6-1 6-2a A clustering with X 13=X25= X12=X24=X56 =1 Three clusters induced by the clustering.. Five classes of digraphs.......... An out-tree and its G[v,i]'s....... The eleven stages of the clustering process of the out-tree in Figure 2-1....... An example out-star and its optimal clustering result............. All possible out-tree clusterings of an acyclic digraph.............. A' general digraph and its two maximal outnecklace clusterings........... An example out-necklace and its optimal clustering result............. An acyclic digraph and its tree-like structure................. The eleven stages of the clustering process In-necklace, in-tree and in-star...... General undirected graph, necklace, tree and star.................... A general digraph and its cover graph... Eleven stages of the clustering process of cover graph given in Figure 5-2...... A B-tree of order 5............ Translating an integrated schema into its corresponding IMS physical and logical databases: example 1............ 4 4 6 19 20 33 38 45 51 56 59 78 79 85 93 101 107 vii

6-2b Translating an integrated schema into its corresponding IMS physical and logical databases: example 2............ 108 6-3 Placements of record occurrences using CLUSTERED VIA SET NEAR OWNER........ 112 6-4 A database schema and its two maximal outtree clusterings.............. 114 A-1 A digraph representing the ODC breakup problem.................. 124 viii

LIST OF APPENDICES Appendex A The Breakup of the Fourteen ODC Problems as an ODC Problem.............. 122 B Proof of Proposition 4.7......... 126 ix

CHAPTER 1 INTRODUCTION This dissertation studies a class of information system (database and software) design problems that has the two following characteristics: 1) The problem can be represented as a graph (directed or undirected). 2) The problem is how to cut the graph into a set of nonvoid and disjoint subgraphs such that each of the subgraphs satisfies a set of constraints while an objective function defined over the subgraphs is optimized. Some examples of this class of problems are: B-tree secondary storage allocation (see Chapter 6); translation of database integrated schema to IMS schema (see Chapter 6); database record clustering (Schkolnick '77), (Chiang & Teorey '82), (also see Chapter 6); computational objects allocation in distributed systems (Jenny '82); software system partitioning (Belady & Evangelisti '81); and program restructuring (Ferrari '76). 1

2 There have been approaches to some of these problems, but they are usually specific to isolated cases or restricted to simple objective function and constraints. This dissertation goes beyond isolated cases to provide a generalized and unified framework for this class of problems and provide a set of efficient and structuralized solutions to them. Efforts are also spent to extend the objective function and constraints. The purpose of this study is thus threefold: 1) To formally define this class of problems as an Optimal Graph Clusterinq (OGC) problem and classify it into a set of subproblems. 2) To find efficient algorithms that give an exact and optimal solution to each problem with nontrivial objective function and constraints. Heuristic or approximate solutions are not considered. 3) To demonstrate these algorithms' usefulness by formalizing and solving some information system (database and software) design problems. The first purpose is taken up in this chapter. The definition of the OGC problem and the classification of it into a set of optimal graph clustering subproblems are presented in section 1.1. The complexity issues of the subproblems are studied in section 1.2. Section 1.3 gives a literature survey of the OGC problem and section 1.4 gives the organization of this dissertation. The graph theory terminology used throughout this

3 dissertation follows those of Harary ('69) and Aho et al ('74). If a different term/notation is used, it will be followed by the established term/notation in brackets. 1.1 Problem Definition and Classifications Given a (connected) graph G=(V,E) with vertex set V and arc set E, a clustering K of G is defined as a cutting of G into a set of nonvoid and disjoint (connected) subgraphs by removing some arcs in G. Thus, a clustering K is an assignment function which assigns either 0 or 1 to each arc of E. An assignment with value 0 means the arc is removed (or cut) while a 1 assignment means the arc is not removed (or cut). Each clustering K can be represented as a set of arc assignments Xij, i.e., K = I{X | X = K((vi,v.)) for all (vi,vJ)E} (see i j j Figure 1-la), or a set of clusters it induces, i.e., K = {C|C is a cluster induced by K}. Each of these clusters is a (connected) subgraph of G (see Figure 1-lb). Let IEl represent the number of arcs in G and let PK represent the set of all possible clusterings of G. Thus, PK contains 2 IE possible clusterings of G. A clustering of G which satisfies a given set of constraints defined for a problem is called a feasible clustering. Let PK be the subset of PK which contains only feasible clusterings. Based on the above, an Optimal Graph Clustering (OGC) problem can be defined as:

4 a) B = 3 Vvp (wdi) Figure 1-1a) A clustering with Xp3=X5=0, X12=X24=X56=1 Figure 1-1b) Three clusters induced by the clustering Figure 1-ib) Three clusters induced by the clustering

5 Given: 1) a graph G=(V,E); 2) an objective function F; 3) two constraints: one limit the size of each cluster and the other requires each cluster to be a (connected) subgraph of G. Find: a clustering K which satisfies these two constraints and has optimal objective function F(K). In this dissertation, only twelve classes of graphs are considered. They are: (connected) directed graph -- general digraph (G ), acyclic digraph(Ga), out-necklace(G no) [contra functional digraph (Harary '69)], out-tree(Gto), outstar(Gs ), in-necklace(Gtl) [functional digraph (Harary '69)], in-tree(Gti), in-star(Gsl) (see Figures 1-2 and 5-la); (connected) undirected graph -- general undirected graph(Gu), necklace(G n) [unicyclic graph (Harary '69)], tree(Gt), star(Gs) (see Figure 5-lb). The OGC problem can be classified into many subproblems by using the class of graph given and the class of cluster induced by a clustering. Since an undirected graph can never be obtained by clustering a directed graph and vice versa, the OGC problem can first be classified into two families of problems. One family deals with the directed graph and is called an Optimal Diqraph Clustering (ODC) problem. The other deals with the undirected graph and is

6 general digraph acyclic digraph out-necklace out-tree out-star Figure 1-2 Five classes of digraphs.

7 called an Optimal Undirected Graph Clustering (OUC) problem. Both ODC and OUC problems can be refined into a set of subproblems based on whether a class of graph can be obtained by clustering another class of graph. A class of graph B (e.g., out-tree) is a subclass of a class of graph A (e.g., acyclic digraph), denoted A — > B, if B is a class of graph obtainable by clustering A (i.e., B can be a subgraph of A). Proposition 1.1: a) -— > Gno --- —-....> Gt — > G G -— > G g a -I.> Gti > Gs --— > G --- —-- b) Gu — > G — > G — > G Proof: a) (G — > GnO): A necklace is one in which every vertex has indegree 1. Given a general digraph with cycles and q mindeqree vertices v1, v2,..., vq in it (i.e., those that have in-degree more than 1), to transform an m-indegree vertex vi into an indegree 1 vertex we need to remove id(vi) - 1 arcs, where id(vi) is the indegree of vi. By repeating all combinations of the removal process, we can find at least one occurrence of an out-necklace. If this is not true, then the original given digraph is an acyclic digraph, which is a contradiction. Thus, G — > g

8 Gno. The rest of the subclass relations in a) can be proved similarly. Q.E.D. b) Similar to a). Q.E.D. Using Proposition 1.1, the set of subproblems for ODC is: (The first subscript of ODC or OUC is the class of cluster induced; the second is the class of the graph given.) \digraph \given cluster \ induced \ G G a Gno Gto Gs Gn Gt Gs5 general digraph G g acyclic digraph Ga a out(in)- out(in)necklace tree Gn (Gn) Gto(Gtl) out(in)star Gso (Gs ) ODCgg ODCag ODCnog ODCt 0g ODC5sog ODCn 1 g ODCt 1g ODCs 1 g ODCaa aa ODCta ODCs a ODCtIa ODCa ODCn ono ODOt n tt ODC0n ODC ot ODCs ODC nln ODCt1n ODCtit ODCs n ODC t ODCs s The set of subproblems \graph general cluster\given graph induced \ Gu G OUCu Gn OUCnu Gt tu G OUC s su for OUC is: necklace Gn tree Gt star Gs..........____...._ --- —— __ --— _ OUCnn nn OUCtn OUCtt OUCsn OUCst OUCss 1.2 Complexity Issues

9 All the OGC subproblems being discussed in this dissertation are NP-complete problems. To prove that, it is only necessary to show that ODCSo0s, ODCs/ S and OUCss are. To show OUCss, ODCSOSO and ODCss are NP-complete, it is only necessary to show that the integer versions of them with integer objective value function and constraint are. The integer versions of these three problems can be defined as follows: Given: 1) a star/out-star/in-star; 2) an objective function: F(K) = > f.. * X.. where all (v,v.)13 13 f. is a positive integer value associated with each arc (vivj); 3) two constraints: 3.1) every cluster must be a star/out-star/in-star or its subclass; 3.2) let a vertex weight be a positive integer value w. associated with each vertex and the cluster weight of a cluster be defined as the sum of all vertex weights in the cluster; every cluster weight must be less than or equal to a positive integer B. Find: a feasible clustering K such that F(K) is minimized. Proposition 1.2: All OGC subproblems considered in this

10 dissertation are NP-complete problems. Proof: Bertolazzi et al ('80) showed the subset sum problem is polynomial reducible to the integer version of OUCSs. Since the former has been shown to be NP-complete, the latter is also NP-complete. The same technique can also be used to show the integer version of ODCs0os/ODCss1 is NP-complete by replacing the term "tree" with "out-tree/intree". Since the integer version of OUCss/ODCs0s0/ODCs5sj is NP-complete, the general version of OUC 5/ODC5sos/ODCs S, that is defined in section 1.1 is also NP-complete by using the proof by restriction technique (Garey & Johnson '79). Again, using proof by restriction, since the rest of the OGC subprobiems are mere generalizations of the general version of OUCSs/ODCsoso/ODCss, they are also NP-complete. Q.E.D. The fact that all of the OGC subproblems discussed in this dissertation are NP-complete does not mean efficient and practical algorithms cannot be found to solve them. Because they are also number problems, eighteen pseudopolynomial time solutions have been found in this dissertation (see those enclosed in the dashed lines in the two tables above). One advantage of pseudopolynomial time solutions is that they are practical (efficient) in many real life applications. A second advantage is that they can often be turned into fully polynomial time approximate schemes (Garey & Johnson '79).

11 1.3 Literature Survey Many studies on the OGC problem originated from diversed application areas. Some of them are listed at the beginning of this chapter. Most of the early approaches formalize the problem too generally (as an OUCuu or ODCg9 problem) and are very difficult. Thus, either heuristic algorithms are proposed or problems with restricted objective function and constraints have been solved. Lukes ('72,'74,'75) used a dynamic programming technique to solve OUCuu and ODCtot. with restricted 2 objective function and constraints and provided an O(n*B2) solution for the latter. The n is the number of vertices in the given graph Land B is an integer weight bound of the clusters. In this study, Lukes' technique is used with considerable modifications and extensions. Recently, efficient algorithms have been found for ODCt.to and OUCss. They are: Bertolazzi et al ('80) developed an O(n*B) algorithm for the integer version of OUC s; Johnson and Niemi ('83) found an O(n *B) algorithm for a nonnegative integer version of ODCtoto where the objective function value, the vertex weight and the bound B are nonnegative integers; Schrader ('83) devised an approximations algorithm for ODCtoto; Perl and Snir ('83) discussed a generalized solution for the integer version of ODCtoto -- the

12 generalization is to allow an additional integer bound on the total objective function value of each cluster. 1.4 The Organization of the Dissertation This dissertation is organized into seven chapters. In Chapter 1, the motivation, the problem definition and classifications, complexity issues and a literature survey are presented. In Chapters 2 through 4, three subsets of the ODC problem solutions involving only general digraph, acyclic digraph, out-necklace, out-tree and out-star are presented (See Figure 1-2). The discussions are restricted to a simple objective function and two constraints. The division of these ODC problems into three chapters can be formulated as an ODC problem. Its solution is discussed in Appendix A. In Chapter 5, solutions presented in Chapters 2 to 4 are used to solve the rest of the ODC problems and all of the OUC problems. Also included are: how to deal with disconnected graphs and graphs with parallel arcs or loops; an extension to the class of objective functions and constraints that can be defined for all OGC subproblems so that the time complexity of the corresponding algorithm remains the same; and a technique to lessen the high complexity of algorithms presented in Chapter 4. In Chapter 6, three information system design problems are presented to demonstrate the usefulness of the solutions. The three applications are: B-tree secondary storage allocation, translation of a database integrated

13 schema into an IMS schema, and database record clustering. A summary of results and a future research direction are given in Chapter 7.

CHAPTER 2 OPTIMAL OUT-TREE/OUT-STAR CLUSTERING OF OUT-TREE/OUT-STAR PROBLEMS This chapter analyzes and solves three ODC problems: ODCtotor ODCsoto and ODCsoso (cluster 1 in Appendix A). Since the algorithm used to solve ODCtoto is fundamental to the understanding of all solutions to the OGC subproblems, a detailed description of the approach to solving it and its solution are first presented in section 2.1. In section 2.2, a solution for ODCSoto is discussed. The solution for ODCs0s0 is given in section 2.3. Throughout this chapter and Chapters 3 and 4, two simple constraints and a simple objective function are used. They are defined in terms of: 1) a vertex weight wr, which is a nonnegative integer, and it is associated with each vertex vr of V; 2) an arc value f.., which is a nonnegative real number, and it is associated with each arc (v r,vj) of E; and 3) a weight bound B, which is a nonnegative integer. Also, it is defined here that cluster weight W(C) is the sum of the vertex weight of all vertices in cluster C and a cluster value F(C) is the sum of the arc values of all arcs in cluster C. 14

15 2.1 Optimal Out-Tree Clustering of an Out-Tree Problem (ODCtoto) The ODCtoto problem is to find a set of nonvoid and disjoint clusters that are subgraphs of the given out-tree such that a set of constraints (one of them requires each cluster being an out-tree or its subclass) is satisfied while an objective function is optimized. A maximization version of ODCt.to is defined as: Given: 1) an out-tree G=(V,E); 2) two constraints on the set of clusters induced by a clustering K are: a) cluster graph class constraint: every cluster must be an out-tree or its subclass; b) integer cluster weight constraint: every cluster C must have a cluster weight less than or equal to B; 3) an objective function F:PK —> R+u{0} (nonnegative real number) which is defined as the sum of the cluster values of all clusters C induced by a clustering K, i.e, F(K) = > F(C ) all C induced by K p = > ( > f.. ) all C induced by K (ij)C l] -= f. = > X.. f... X..=l 1 (irj)EE 1J 1 1J Find: a feasible clustering K such that F(K) is maximized. A minimization version of ODC.to can be defined

16 similarly by changing the objective function to F(K) = > f. - > f. (i,j)eE 1] X* j= 13 = > f..= > X.. *f.. Xi=0o 1] (i,j)eE0 1 13 and the problem is to find a feasible clustering K such that F(K) is minimized. Obviously, the maximization and minimization versions are equivalent. Hereafter, only the maximization version will be used to define problems and to illustrate their solutions. The solution approach will first be presented in section 2.1.1 and an algorithm to solve ODCtoto in section 2.1.2. 2.1.1 A Sequential Clustering Approach to ODCt t From the definition of a clustering (section 1.1) an enumerative clustering approach can be used to solve ODCtOtO and the rest of the problems being considered in this dissertation. Such an approach consists of three steps: 1) Enumerate all possible clusterings by cutting all possible combinations of one arc, two arcs,..., and m arcs. 2) Determine a subset of the set of possible clusterings that satisfies all constraints. 3) Find an optimal value clustering among these feasible clusterings. The first step forms (m) + (m) + + (m) = 2m clusterings

17 and is therefore impractical. It is proposed here that a sequential clusterinq approach can solve ODCto O in pseudopolynomial time. Given an out-tree G, a sequential clustering approach sequences the clustering process into a finite sequentially ordered stages. At each stage, a larger set of subgraphs of G, which is obtained from the immediate previous stage by adding a vertex or an arc, is considered. Because the set of sugraphs of G considered at one stage contains only one more vertex or arc than its immediate previous stage, the set of all feasible clusterings of one stage can be obtained from the immediate previous stage by simple operations (i.e., operations that take constant time to execute). The main thrust is to use the integer cluster weight constraint and objective function to define an elimination rule such that at most a constant number of feasible clusterings (B) remains at each stage. Thus, the problem can be solved in pseudopolynomial time. To describe how the clustering process is sequenced and to identify the simple operations needed in the sequential clustering approach, it is necessary to first define some terms and notations. Terms and Notations: Let the given out-tree G=(V,E) have m arcs and n vertices and let v root be the root of G. For each v rV, G[v r] is any subgraph of G rooted at v r. od(vr) is the outdegree of v where od(v )=0 if v is a leaf vertex. For each r r r

18 vrev and i, 0<i<od(vr), G[vr,i] is the subgraph of G induced by v r, by its first i children (from left to right), and by all their descendants (see Figure 2-1). Note that G[v rootod(v root)=G and G[vr,0] is the subgraph of G consisting of only the vertex vr. A subgraph G[v r,i] of G rooted at vr is a complete subgraph if i = od(v ) and is a partial subgraph if i < od(vr). Observe that G[vr,i+l] is a larger subgraph of G than G[vr,i]. It not only contains the partial subgraph G[v,i] rooted at vr but also the complete subgraph rooted at v., which is the i+lth child of vr (denoted by G[v.,od(v.)]), and the arc (v r,vj) connecting the two subgraphs G[vr,i] and G[vj,od(vj)]. The clustering process is sequenced in stages by visiting the vertices of G in postorder traversal sequence (Aho et al '74). At each vertex vr, G[vr,0] is first considered and then G[v,1], G[v,2],..., G[v,od(vr)]. Since there are n vertices and m arcs, there are m+n stages. Figure 2-2 shows all eleven stages of the clustering process of the out-tree that has six vertices and five arcs given in Figure 2-1. Note that each vertex v r is involved in od(v )+1 stages. In Figure 2-2, because the vertices are visited in postorder traversal sequence, at each stage of the clustering process, the graph considered are a set of subgraphs of G (not necessarily a single connected subgraph). For example, at stage 6, there are two

19 3/-\5 G [2,0] G[2,2] G[2,11 B= 3 Vvi (wi=l) Figure 2-1 An out-tree and its G[vr,i]'s.

20 stage 1 0 20 3 ( 4(b 5 0 0 0 7 0 8 a 0 Q 6 0 Figure 2-2 The eleven stages of the clustering process of the out-tree in Figure 2-1.

21 subgraphs: {V = {2,4}, E = {(2,4)}} and {V ' {5,6}, E = {(5,6)}}. Except for maybe one, each subgraph in the set is a complete subgraph of G. For example, at stage 6, subgraph {V = {5,6}, E = {(5,6)}} is a complete subgraph rooted at 5 while {V = {2,4}, E = {(2,4)}} is a partial subgraph rooted at 2. At stage 7, there is only one subgraph rooted at 2 and it is a complete subgraph. There are two patterns of changes in the set of subgraphs considered from stage to stage: 1) If all subgraphs in stage k are complete subgraphs, then in stage k+l the set of subgraphs considered will contain a new vertex as a separate subgraph. This stage k+l is called a new vertex stage (e.g., stages 1,2,3,5,8,and 9 in Figure 2-2). 2) If one of the subgraphs in stage k is a partial subgraph, then in stage k+l the set of subgraphs considered will contain a new arc which connects the root of this partial subgraph with the root of a complete subgraph existing in the set of subgraphs considered in stage k. This stage k+l is called a new arc stage (e.g., stages 4,6,7,10,and 11 in Figure 2-2). Since the set of subgraphs of G considered at each new stage only contains one more new vertex or arc than the immediate previous stage, there is no need to form all possible clusterings from scratch for each new stage. This means that if the set of all possible clustering PKk of stage k is

22 known, the set of all possible clusterings PKk+1 of the next stage k+l can be obtained by some simple operations on PKk. The set of all possible clustering PKk formed at stage k can be represented by s subsets of clusterings if there are s disjoint subgraphs considered at that stage. Each subset of clusterings represents the set of all possible clusterings for each subgraph. Let each clustering be represented by the set of clusters it induces. For example, in Figure 2-2, {V={4},E=0},{{V={5,6},E=0}, {V={5,6},E={(5,6)}} is all possible clusterings of the two subgraphs in stage 4. To obtain all possible clusterings of stage 5, a new vertex stage, an operation which "inserts" vertex 2 as a separate cluster into the set of possible clusterings is needed. Thus, the resulting clusterings of stage 5 is {V={2},E=0}, {4},E=0},{{V={5,6},E={(5,6) },{V={5,6},E=0}}. Consider stage 6, a new arc stage, since vertices 2 and 4 are connected by arc (2,4), there is a possibility that {V={2},E=0} and {V={4},E=0} could be "connected" together to form one cluster after clustering (i.e, assignment X24=1) or not be connected together but simply "put" next to each other (i.e., assignment X24=0). Thus, the new set of possible clusterings in stage 6 is {{V={2,4},E={(2,4)}},{V={2,4},E=0}}, {{V={5,6},E={(5,6)}},{V={5,6},E=0}}. It is now obvious that the necessary operations needed in the sequential clustering approach are an insertion operation in the new vertex stage and connect/put operations

23 in the new arc stage. To formalize these operations, let {G[vlod()], G[v2,od(v)],..., G[v,od(v)],... G[v,od(v G[vr,i]} be the set of mutually disjoint subgraphs rooted at vI,v2,...,vj,...,vr that are being considered in stage k, with G[vj,od(v.)] as the i+lth complete child subgraph of vr. Let PKk be the set of all possible clusterings formed at stage k. PKk includes PK(G[v1,od(v1)], PK(G[v2,od(v2) ]),..., PK(G[vj,od(vj)],..., and PK(G[v r,i]) because there are r disjoint subgraphs considered at stage k. The operations needed in the two types of stages are: For new vertex stage: At stage k+l, the set of subgraphs considered is {G[vl,od(v)], G[v2,od(v2)], **, G[v.,od(v.)],..., G[v,i], G[v,0]} where G[v,0] represents the new vertex v that is introduced. There is only one way to obtain a clustering of a single vertex, which is a cluster containing only this vertex. Thus, the new set of all possible clusterings can be obtained by inserting {{V={vp},E=0}} into PKk, i.e., INSERT operation: PKk+1 = PKk u {{V={vp},E=0}}. For new arc stage: At stage k+l, the set of subgraphs considered is {G[vl,od(vl)], G[v2,od(v2)],..., G[vj_l,od(vjl) ], G[vj+,1, od(vj+1)], G[,d(v r1), G[v,i+l]} where G[vr,i+l] is a subgraph formed by connecting G[vj,od(vj)] and G[v r,i]

24 that exist in stage k with an arc (v r,v). Let the set of possible clusterings of G[vj,od(vj)] and G[v,i] be denoted by PK[v.,od(vj)] and PK[v r,i] respectively. Thus, PKk+l = PKk - PK[v r,i] - PK[vj,od(vj)] u PK[v ri+l]. Let K[v r,i] and K[vj,od(v )] be any two clusterings of PK[vr,i] and PK[v.,od(vj)] respectively. Let K[v r,i] = {C1,C2...rCq}, and K[vjod(vj)] = {C1, C,...,C, and assume C1 and C1' are clusters that contain the root of G[vr,i] and G[vj,od(vj)] respectively. The set of all possible clusterings for PK[vr,i+l] can be obtained from all possible combinations of K[v r,i] in PK[v r,i] and K[vj,od(vj)] in PK[vj,od(vj)] for assignment Xij= 0 (denoted by PK~[vr,i+l]) and 1 (denoted by PK'[vr,i+l]) on arc (v r,vj), i.e., PK[vr,i+l] = PK~[vr,i+l] u PK'[vr,i+l]; PK~[vr,i+l] = {K~[v r,i+l] is formed using PUT for every pair of K[vr,i] and K[vj,od(v.)]}; PK'[vr,i+l] = {Kl[vr,i+l] is formed using CONNECT for every pair of K[v r,i] and K[v.,od(v.)]}, where PUT operation: K~[vr,i+l] = K[vr,i] K[v[v.,od(vj)]; CONNECT operation: K'[vr,i+l] = {C2C3,..., 'cpc2 ' 3 ' q1 where C1 is equal to the connecting of clusters C1 and C1' by arc (v,v.). To obtain all feasible clusterinqs instead of just all possible clusterings, the out-tree cluster and cluster

25 weight constraints need to be enforced in each stage. This can be easily done because: 1) Any cluster formed in the sequential clustering approach is a subgraph of an out-tree and is also an out-tree itself. Thus, the out-tree cluster constraint is automatically enforced in the process. 2) For each pair of clusterings from PK[v r,i] and PK[v.,od(i+l)], only the CONNECT operation form a new cluster C1 which has a new higher cluster weight value. To make sure every cluster formed is less than B at each new arc stage, the CONNECT operation can be made conditional as follows: IF W(C1) + W(C11) < B THEN K[vri+l] = {C2,C3,...,CpC 3 ' 2q 1 Thus, the set of all feasible clusterings formed at stage k+l (denoted PK +1) can be easily obtained by applying either INSERT or PUT/conditional CONNECT operations on PKk. 2.1.2 ODCtoto-1 Algorithm For computational purpose, let each feasible clustering K of a subgraph be represented by a triplet (F(K),W(K),CF(K)) that contains the following three components: 1) A value component F(K) which stores the sum of the cluster values of all the clusters induced by the clustering. 2) A weight component W(K) which stores the cluster

26 weight of the cluster that contains the root of the subgraph. 3) A configuration component CF(K) which stores the clusters induced by this clustering of the subgraph and each cluster is represented by the vertices in it enclosed by "< >". At each stage of the clustering process, if there are s subgraphs considered at a stage, there are s sets of feasible triplets formed. Each of these s sets of feasible triplets represents all feasible clusterings of a subgraph considered at that stage. Notation: Let PK, and PKf+l now represent all feasible triplets produced at stages k and k+l, and let PKf [v ri+l], PK f[v,i], and PK f[vj,od(vj)] now represent all feasible triplets of the subgraph G[v r,i+l], G[v,i] and G[v.,od(vj)] respectively. Let K[v r,i+l], K[v r,i] and K[vj,od(v.)] be a feasible triplet in PKf [V,i+l], K~vj,od~vj )]r PK f[v,i] and PK f[v.,od(v.)] respectively. The three operations can now be redefined in terms of triplet formation: INSERT operation: PK + PK u t(0,Wp,<Vp. PUT operation: form K[v r,i+1] with

27 F(K[v r,i+l]) = F(K[v r,i]) + F(K[vj,od(v.)]), W(K[vr,i+l]) = W(K[vr,i]), and CF(K[v r,i+l]) = CF(K[vr,i]) u CF(K[vj,od(v.)]). Conditional CONNECT operation: IF W(C1) + W(C1,) < B THEN form K1[v r,i+l] with F(K[vr,i+l])=F(K[vr,i]) + F(K[vj,od(v) ]) + frj' W(K[vr, i+l])=W(C1) + W(C1 ), and CF(K[v ri+l])= {C,C3,'***,CpC2', C3' C0 ' 3 } u {C1 u C1 }. For example, let {(0,1,<4>)},{(0,1,<5><6>),(2,2,<5,6>)} be the set of all feasible triplets formed at stage 4 in Figure 2-2. The feasible triplet sets formed at stage 5 using the INSERT operation are {(0,1,<2>)},{(0,1,<4>)},{(0,1,<5><6>),(2,2,<5,6>)}. The feasible triplet sets formed at stages 6 and 7 using the PUT and conditional CONNECT are: (Assume B=3 and Vv (w =1).) stage 6: {(0,1,<2><4>),(3,2,<2,4>)}, {(0,1,<5><6>),(2,2,<5,6>)}; stage 7: {(0,1,<2><4><5><6>),(2,1,<2><4><5,6>), (3,2,<2,4><5><6>),(5,2,<2,4><5,6>),(7,2,<2,5><4><6>), (9,3,<2,5,6><4>),(10,3,<2,4,5><6>)}. In the above example, every new vertex stage forms one additional new triplet using INSERT while every new arc

28 stage can double the number of triplets formed in the immediate previous stage using PUT/conditional CONNECT. Thus, unless reduction of the number of feasible triplets can be done at the end of each new arc stage, the total computation time can still be exponential. Fortunately, some triplets formed at each new arc stage can be eliminated because they can never be extended to the optimal clustering of the out-tree. The idea is to define a dominance relation on the set of new feasible triplets formed at each new arc stage. Let K1 and K2 be any two feasible triplets formed at the same new arc stage. K1 dominates K2 if F(K1) 2 F(K2) and W(K1) < W(K2). Proposition 2.1: The dominated triplet K2 cannot be extended to the optimal clustering of the out-tree. Proof: (See Proposition 4.4 in Chapter 4 where a generalized proof is given.) Q.E.D. Proposition 2.1 implies that at the end of each new arc stage, all dominated triplets can be eliminated. The elimination can be done in two steps by decomposing W(K1) < W(K2) into W(K1) = W(K2) and W(K1) < W(K2): step 1) W(K1) = W(K2) and F(K1) 2 F(K2): For triplets that have the same weight, retain the one with the highest value and eliminate the rest. Thus, only one triplet remains for each integer weight value.

29 step 2) W(K1) < W(K2) and F(K1) > F(K2): For every remaining triplet K2, if it has higher weight than another triplet K1 but have lower or equal value, then eliminate K2. Assume that every vertex has vertex weight 1. Since the weight of a triplet which represents a feasible clustering of a subgraph is an integer less than or equal to B, the maximum number of triplets remaining for a subgraph after each stage is B. If the vertex has weight higher than 1, and suppose Z is the maximum number of vertices that can be put in a cluster with cluster weight less than or equal to B, then the maximum number of triplets remaining for a subgraph after each stage is Z (<B). A complete algorithm called ODCtot.ol can now be specified: ALGORITHM: ODCtt o-1 INPUT: an out-tree G; OUTPUT: an optimal out-tree clustering of G and its value; PROCESS: I) FOR each vr visited in postorder traversal sequence DO: 1) FOR j=0 to od(vr) DO: 1.1) IF j=O THEN form a new PKf+ from PKf using INSERT; ELSE 1.2) form a new PK+1 from PKf using PUT and conditional CONNECT;

30 eliminate all dominated triplets; END; II) find the optimal value clustering in PK +n and RETURN the solution; END. Proposition 2.2: The running time of algorithm ODCtot0o- is 2 2 O(n-Z2) (or O(n-B2) if every vertex has vertex weight 1). Proof: Every new vertex stage takes constant time while 2 every new arc stage takes 2-Z time because of the PUT and conditional CONNECT. Thus, step I) takes O(n-Z ) time. Step II) takes O(Z) time. The time complexity of ODCtoto is therefore O(nZ 2). Q.E.D. Example: Applying ODCtoto-1 to the example given in Figure 2-2, the following 24 triplets are formed during the clustering process: formed formed elimi. elim. by formed by INSERT by P/C by P/C Dominate rel. triplets (stage #) (stage #) (stage #) (stage #) (0,1,<4>) 1 6 (0,1,<6>) 2 4 (0,1,<5>) 3 4 (0,1,<5><6>) 4 7 (2,2,<5,6>) 4 7 (0,1,<2>) 5 6 (0,1,<2><4>) 6 7 (3,2,<2,4>) 6 7 (0,1,<2><4><5><6>) 7 7 (2,1,<2><4><5,6>) 7 10 (7,2,<2,5><4><6>) 7 10 (9,3,<2,5,6><4>) 7 7 (3,2,<2,4><5><6>) 7 7 (5,2,<2,4><5,6>) 7 7 (10,3,<2,4,5><6>) 7 10

31 (0,1,<3>) 8 11 (0,1,<1>) 9 10 (2,1,<1><2><4><5,6>) 10 10 (7,1,<1><2,5><4><6>) 10 10 (10,1,<1><2,4,5><6>) 10 11 (5,2,<1,2><4><5,6>) 10 10 (10,3,<1,2,5><4><6>) 10 10 (10,1,<1><2,4,5><6><3>) 11 (15,2,<1,3><2,4,5><6>) 11 The optimal solution is (15,2,<1,3><2,4,5><6>). Further improvements on ODCtoto-l can be made, but since the time complexity will not change, we will not discuss them here. 2.2 Optimal Out-Star Clustering of an Out-Star Problem (ODCsoS ) The ODCsoso problem is to find a set of bounded outstar clusters, or its subclass clusters (i.e, those that have cluster weight less than B), from an out-star such that the total value of the set is optimized. A formal definition for the problem is the same as that of ODCtoto except that the given digraph is an out-star and each cluster C must be an out-star or its subclass. Since an out-star is a special case of out-tree, all previous results of ODCtoto are applicable here. Thus, an algorithm ODCsoso-l, which is identical to ODCtoto-1, can be used to solve ODCSoso and even have a lower time complexity. Proposition 2.3: The application of ODCsoso-1 yields a time complexity of O(n.Z). Proof: The time complexity is O(n-Z) instead of O(n*Z ) because each PK f[v.,od(v.)] has only one triplet in it J J

32 instead of Z triplets. Thus, each new arc stage takes O(n.Z) time. Q.E.D. Example: Applying the ODCtot.o-l to the out-star shown in Figure 2-3 yields the optimal clustering solution as (8,3,<1,3,4><2>). 2.3 Optimal Out-Star Clustering of an Out-Tree Problem (ODCsoto) The ODCSoto problem is to find a set of bounded outstar clusters, or its subclass clusters, from an out-tree such that the total value of the set is maximized. A formal definition for this problem is the same as that of ODCtotO except that each cluster C must be an out-star or its subclass. The constraint that each cluster must be an out-star or its subclass can be enforced in many ways (e.g., incorporate the constraint as an additional condition for applying conditional CONNECT). The following proposition suggests the most effective way to enforce the constraint while lowering the time complexity of the solution algorithm. Proposition 2.4: Only two triplets need to be retained in each PKf[vj,od(vj)]. One has a cluster containing v. alone, the other has the maximum value. Proof: The constraint that each cluster must be an out-star or its subclass means that if a vertex v. is connected to 3 its parent vertex vr in a cluster by the conditional CONNECT

33 Figure 2-3 An example out-star and its optimal clustering result.

34 operation (i.e., assignment X.=l), then v. cannot have any child vk connected to it (i.e., every assignement Xjk must be 0) and vice versa (otherwise the formed cluster is not an out-star or its subclass). Thus, not every pair of triplets in PKf[v,i] and PK f[vj,od(vj)] can be used by conditional CONNECT to form a new triplet. As a matter of fact, only a triplet in PK f[v.,od(vj)] which contains v. alone as a cluster can be used to form a new triplet with every triplet in PK [vi,i]. Similarly, given a triplet K[vr,i] in PKf[v r,i], there is no need to form Z triplets from K[vr,i] and every triplet in PK f[vj,od(vj)] using PUT. This is because all these Z formed triplets have the same weight and thus only the one with the highest value remains at the end of this stage. This implies that only the triplet in PKf[vj,od(vj)] with the maximum value is necessary to form a new cluster with every triplet in PK f[v,i] using PUT. From both arguments above, it can be concluded that there is only two triplets needed to be retained for each PK [vj,od(vj)]. Q.E.D. An algorithm ODCs0t 0-1 for ODCs t0 can thus be easily modified from ODCtoto as follows: ALGORITHM: ODC -.1 sot~ INPUT: an out-tree G; OUTPUT: an optimal out-star clustering of G and its value; PROCESS: I) FOR each vr visited in postorder traversal sequence

35 DO: 1) FOR j=0 to od(v ) DO: 1.1) same as 1.1) of ODCtoto-1; 1.2) same as 1.2) of ODCtoto-l; END; 2) eliminate all triplets except two: the triplet obtaining a cluster that contains vr alone and the triplet with the highest value; END; II) same as II) of ODC(t~t~)-l; END. Proposition 2.5: The running time of algorithm ODCsoto-l is O(n-Z). Proof: Immediate from that every PK f[v.,od(v.)] contains only two triplets. Q.E.D. Example: Applying ODCsoto-l to the out-tree shown in Figure 2-1, the optimal clustering solution is (15,2,<1,3><2,4,5><6>).

CHAPTER 3 OPTIMAL OUT-TREE/OUT-STAR/OUT-NECKLACE CLUSTERING OF GENERAL DIGRAPH/ACYCLIC DIGRAPH/NECKLACE PROBLEMS This chapter uses the cluster graph class constraint and ODCtoto/ODCsoto/ODCsono to analyze and solve eight ODC problems: ODCtOg, ODCtoa, ODCtonO, ODCsog, ODCsoa, ODCsono, ODCnono and ODCnog (clusters 2, 3, 4, 5 and 6 in Appendix A). Since the solution of the first three problems can be directly applied to the rest, an analysis and solution of the first three problems are given first in sections 3.1 and 3.2. The solution to the next three problems are given in section 3.3. In section 3.4, solutions to the last two are given. 3.1 The Structures of the Set of Possible Clusterings that Satisfy the Cluster Graph Class Constraint Let K. and K. be any two clusterings of a digraph G. K. < K. if every cluster induced by K. is a subgraph of some cluster induced by Kj. For example, in Figure 3-1, each K3, K4, K5, K8, Kg, K10 and K12 < K1. Note that if K. < K., then K. can be obtained from K. by removing some arcs in K.. The < relation is reflexive, transitive and anti-symmetric. It induces a partial ordering on the set of all possible 36

37 clusterings PK of G. Let [PK,<] denote the partly ordered set (poset) that consists of a set PK and a partial ordering relation < on PK. For any two elements Ki and K. in PK, an element Kk in PK is called the greatest lower bound, or meet of K. and K., denoted by K.iK. if Kk < Ki, <Kj, and V K in PK such that K <K. and K <K. imply K<K. 1 1 i 1 1 k Dually, an element Km e PK is called the least upper bound, or join of Ki and K., denoted by Ki+Kj, if K.<K and K.<K and 1 m j m V K1 in PK such that K.<K and K.<K imply Km<K. 1 1 J3K 1K~K1. A lattice is a partly ordered set which has a least upper bound and a greatest lower bound for every pair of elememts. Let x and y be any of the five digraph classes. Define the set of all x clusterings of a specific y as the subset of all possible clusterings of y which satisfies the constraint that each of these clusterings induces cluster of type x or it subclass. Proposition 3.1: The set of all out-tree clusterings of an out-tree (and out-star clusterings of an out-star) together with the < relation form a lattice. Proof: This is immediate from the fact that every subgraph of out-tree/out-star is an out-tree/out-star. Thus, every possible clustering is an out-tree clustering and the set forms a lattice. Q.E.D. But the set of out-tree clusterings of a general

*ode.jBp DtoAoDe U9 jo sButj94SnTO aael-4no aiqTssod TTY T-E aJn6T1A Zfr its VIM Otl, I ~)m 0r, 91 loe I 1.04, i %/ <iy 1/ rP.-%" 0cc' '% — ' /, % / ~ ~~~~I I I I ~ ~ / I -0 % -% 0#

39 digraph, acyclic digraph or out-necklace is not a lattice because it does not always have a least upper bound for each pair of elements. For example, in Figure 3-1, K1 K2=K4 and K3+K 1=K1, but K3+K7 is undefined. A poset [PK,<] is a join-semilattice (meet-semilattice) if for any two elements Ki and K. in the poset PK, K. + K. (K. * K.) exists. Proposition 3.2: The set of all out-tree clusterinqs of a general digraph (acyclic digraph, out-necklace) together with the < relation form a meet-semilattice. Proof: For each cluster C. induced by K. and Cj by Kj, let C = CinCj then form three clusters (where n represents graph intersection): 1) C. - C. 2) C. - C. 3) C. The set of all clusters so formed for each pair of clusters from Ki and K., denoted by Kk, is also a clustering of PK because each cluster is an out-tree or it subclass. Obviously, Kk < Ki and Kk < Kj. Now, assume that there is a K1 in PK such that K <Ki, K-<K. and Kk<Kl. Let C = C. n C. 0 be an induced cluster j K 1 ' 1 j in K1. By Kk<Ki, C must also be a subgraph of some C' induced by K1. Assume C-C'~0, then C' cannot be a subgraph of both Ki and Kj. Thus, C' is not a subgraph of both Ki and Kj. This contradicts the assumption that K <Ki and Ki<Kj. Therefore, Ki<Kk. This means that VK1 in PK such

40 that KI<K. and K <K. imply K <Kk. Thus, the set of all out1 j 1 tree clusterings of a general digraph/acyclic digraph/outnecklace together with < form a meet-semilattice. Q.E.D. An element K.iPK is maximal when there is no other element K. in PK from which K.<K. (a minimal element is similarly defined). For a lattice, there is always one minimal element and one maximal element. In a meetsemilattice, we note that there is always one minimal element (e.g., K2 in Figure 3-1) and many maximal elements (e.g., K1 and K2 in Figure 3-1). Proposition 3.3: The poset formed by the set of out-tree /out-star clusterings of an out-tree/out-star and the < relation has only one maximal out-tree/out-star clustering. Proof: Immediate from the set of all out-tree/out-star clusterings of an out-tree/out-star which together with the < relation form a lattice (from Proposition 3.1). Q.E.D. Proposition 3.4: The poset formed by the set of out-tree clusterings of an out-necklace and the < relation has K maximal out-tree clusterings if the number of arcs in the only cycle in the out-necklace is K. Proof: Each vertex in an out-necklace has indegree 1, while each vertex in an out-tree also has indegree 1 except for the root which has indegree 0. Thus, a maximal out-tree clustering can be obtained by removing a minimum of one arc from an out-necklace. But *not all sets of clusterings generated by removing one arc from the out-necklace are

41 out-tree clusterings. It is only those that are formed by removing an arc in the cycle of the out-necklace are outtree clusterings. Thus, if there are K arcs in the outnecklace, there are K maximal out-tree clusterings. Q.E.D. Proposition 3.5: The poset formed by the set of out-tree clusterings of an acyclic digraph and the < relation has L = T a. maximal out-tree clusterings i=l 1 if the acyclic digraph has p m-indegree vertices where each has indegree ai. Proof: Let the indegree of the set of m-indegree vertices in the given acyclic digraph be al, a2,..., ap. To form a maximal out-tree clustering from the acyclic digraph requires the removal of the "redundant" arcs that exist in each of the m-indegree vertices so that every m-indegree vertex can be reduced to have indegree 1. There are (al-) 1 (a2) ) al- 1 a a1 = a2 * a *... combinations. Thus, there are T ai maximal out-tree clusterings. i=l1 Q.E.D. Proposition 3.6: The poset formed by the set of outnecklace clusterings of a general digraph and the < relation has at most P L = TT a maximal out-necklace clusterings i=l 1 if the general digraph has p m-indegree vertices where each has indegree ai.

42 Proof: Similar to the proof given for Proposition 3.5. Q.E.D. 3.2 Optimal Out-Tree Clustering of General Digraph, Acyclic Digraph and Out-Necklace Problems (ODCtOg, ODCtoa, ODCtono) The ODCtOg, ODCtoa and ODCtono problems are to find a set of bounded out-tree clusters, or its subclass clusters, from a general digraph, acyclic digraph and out-necklace respectively such that the total value of the set is optimized. Again, the enumerative clustering approach is too expensive to use. A better approach is to use the maximal out-tree clustering concept and the ODCtoto-l discussed in section 2.1. A maximal out-tree clustering induces a set of largest out-tree clusters that can be obtained from an acyclic digraph or out-necklace by removing minimal number of arcs from the m-indegree vertices. Since any possible out-tree clustering, including the optimal one, < one of the maximal out-tree clustering (or can be obtained by additional arc removals of the induced clusters), the optimal solution can thus be found in one of the maximal out-tree clusterings. We can do that by using ODCtoto-l. The optimal out-tree clustering of an acyclic digraph or out-necklace is therefore the maximumm value one among the set of optimal out-tree clusterings obtained from all maximal out-tree clusterings. An algorithm for these two problems can be formulated as follows:

43 ALGORITHM: ODCt o a -/ODCt n o -1 INPUT: an acyclic digraph/out-necklace G; OUTPUT: an optimal out-tree clustering and its value; PROCESS: 1) FOR each maximal out-tree clustering of G induces one or more out-tree clusters DO: 1.1) apply algorithm ODCtoto-1 to obtain an o solution for each out-tree cluster; 1.2) IF the sum of the optimal out-tree clus values of these out-tree clusters is the 1 value so far, keep the solution; which )ptimal;tering Largest END; 2) RETURN the last solution kept; END. An algorithm for ODCtOg, named ODCtog-l1 is similar to algorithm ODCtoa-1 except that instead of applying ODCtoto-l in step 1.1, ODCtonO -l is used and all occurrences of maximal out-tree clusterings are changed to maximal outnecklace clusterings. Proposition 3.7: The ODCtOn -l, ODCtoa-l and ODCCtog-l algorithms find 1) the optimal out-tree clustering of out-necklace in 2 O(n.Z.K) time; 2) the optimal out-tree clustering of acyclic digraph in O(n*Z2 L) time; and 3) the optimal out-tree clustering of general digraph in Z2 L) time. O(n.Z -K.L) time.

44 Proof: 1) In step 1 of ODCtOnO- 1, a given maximal out-tree clustering induces only one out-tree cluster. Thus, 2 steps 1.1 and 1.2 take O(n-Z2) by Proposition 2.2. Since there are K maximal out-tree clusterings to search for (see Proposition 3.4) in the step, the time 2 complexity is O(n*Z *K). 2) In step 1 of ODCtoa- 1, let a given maximal out-tree clustering contain w out-trees with nl,n2,..,nw vertices respectively in them. Steps 1.1 and 1.2 take O(nl Z2)+O(n2 Z2)+...+O(n.Z2) = O(n-Z2) time to find an optimal solution. Since there are L maximal out-tree clusterings to search for (see Proposition 2 3.51 in step 1, the time complexity is O(nZ2 *L). 3) In step 1 of ODCtO g1, let a given maximal necklace clustering contain w necklaces with n1,n2,...,nw vertices respectively in them. Steps 1.1 and 1.2 take 2 2 K = nZ2 k2 - O(nl Z 2 *K)++(n2 - ~ )+...+O(nw Z Kw) = O(n-Z K) where K is the maximum of all Ki, l<i<w. Since there are L maximal out-necklace clusterings to search for, 2 - the time complexity is O(n*Z Z2KL). Q.E.D. Example: The general digraph shown in Figure 3-2a contains the two maximal out-necklace clusterings that are in Figures 3-2b and c. Applying ODCt g-1 to the general digraph shown in Figure 3-2a, we find the optimal solution in Figure 3-2b which has a total value of 16 and 2 clusters:

a) b) c) B = 3 Vv. (w-=1) 1 (wi Figure 3-2 A general digraph and its two maximal out-necklace clusterings.

46 Cluster 1 contains vertices 1 and 3. Cluster 2 contains vertices 2, 4, 5 and 6. 3.3 Optimal Out-Star Clustering of General Digraph, Acyclic Digraph and Out-Necklace Problems (ODCSog, ODCoa, ODCson) The ODCSog ODCsoa and ODCsono problems are to find a set of bounded out-star clusters, or its subclass clusters, from a general digraph, acyclic digraph, and out-necklace respectively such that the total value of this set is optimized. An approach to solving ODCsOa and ODCsono can be formulated similarly to the approach presented in section 3.2, which is to use the maximal out-tree clustering concept and the efficient ODCSot" algorithm. This is based on the fact that an out-star is a subclass of an out-tree. Since any optimal out-star clustering of the given digraph (acyclic digraph or out-necklace) < one of the maximal outtree clusterings, the optimal solution can be found in one of them. ODCsot 0-1 will obtain an optimal out-star clustering for each maximal out-tree clustering and the clustering with the maximum value is then the optimal outstar clustering of the given digraph. Thus, two algorithms, ODC Oa-1 and ODCso -l, can be formulated for ODC sa and ODC sOn as follows: ALGORITHM: ODCs o a 1/ODCs n o -1 INPUT: an acyclic digraph/out-necklace G; OUTPUT: an optimal out-star clustering and its value; PROCESS:

47 1) FOR each maximal out-tree clustering of G which induces one or more out-tree clusters DO: 1.1) apply algorithm ODCsot0-1 to obtain an optimal solution for each out-tree cluster; 1.2) IF the sum of the optimal out-star clustering values of these out-tree clusters is the largest value so far, keep the solution; END; 2) RETURN the last solution kept; END. An algorithm for ODCsog, named ODCsog 1, is similar to ODCsoa-1 except that instead of applying ODCsot-1 in step 1.1, ODCsono1l is needed and all occurrences of maximal outtree clusterings are replaced by maximal out-necklace clusterings. Proposition 3.8: The ODCsono-lf ODCsoal and ODC -1 algorithms find: 1) the optimal out-star clustering of out-necklace in O(n*Z*K) time; 2) the optimal out-star clustering of acyclic digraph in O(n.Z.L) time; and 3) the optimal out-star clustering of general digraph in O(n.Z-K-L) time. Proof: Similar to the proof given for Proposition 3.7. Q.E.D. Example: Applying ODCsog to Figure 3-2a, we obtained the

48 optimal solution for Figure 3-2b, which has total value 14 and contains 2 out-stars: Star 1 contains vertices 1, 3 and 4. Star 2 contains vertices 2, 5 and 6. 3.4 Optimal Out-Necklace Clustering of Out-Necklace and General Digraph Problems (ODCnono, ODCn0g) The ODCnono and ODCnOg problems are to find a set of bounded out-necklace clusters, or its subclass clusters, from a given out-necklace and general digraph respectively, such that the total value of this set is optimized. The ODCnono problem is first solved in section 3.4.1 and the solution is then used to solve ODCnog in section 3.4.2. 3.4.1 ODCnono Problem An unique property of an out-necklace is that it has only one cycle in it. To find a set of out-necklace clusters from it, observe that if the weight of the cycle contained in a given out-necklace is less than B, then the optimal out-necklace clustering will always contain the cycle in one of the induced clusters. If the cycle is overweight, then the optimal out-necklace will never contain the cycle in one of the clusters. Thus the ODCnono problem can be divided into two subproblems where each deals with one of these two possibilities. The ODCtoto-l algorithm can be used to solve the first subproblem if we observe that by aggregating all vertices in the cycle into one vertex and replacing all the arcs going

49 out of the vertices in this cycle by arcs going out of this newly formed vertex, the out-necklace becomes an out-tree with the newly formed vertex as the root. This newly formed root can then be assigned a weight equal to the total weight of all vertices in the cycle and the total value of the cycle can be added on later to the cluster induced by the optimal clustering that contains the root. The second subproblem can be easily solved because to find an optimal out-necklace clustering which contains no cycle in any of the induced clusters is to find an optimal out-tree clustering of the out-necklace. Thus, the ODCt0n0-l algorithm can be used to solve this problem. An algorithm for ODCnono can be formulated as follows: ALGORITHM: ODC nn o- 1 INPUT: an out-necklace; OUTPUT: an optimal out-necklace clustering and its value; PROCESS: 1) IF the total weight of the K vertices on the cycle is less than B, THEN 1.1) transform the out-necklace into an out tree as described above; 1.2) apply the ODCtoto-l algorithm to find an optimal solution; 1.3) replace the root in the configuration of the optimal solution obtained in 1.2) by all K vertices in the cycle; 1.4) add to the value of the optimal solution obtained

50 in 1.2) the total value of the cycle; 1.5) RETURN the configuration and its value obtained in 1.3) and 1.4); 2) ELSE 2.1) apply the ODCton0-1l algorithm to find an optimal solution; 2.2) RETURN the configuration and its value; END. Proposition 3.9: The ODCnon0-1 algorithm finds the optimal 2 solution in O(n*Z *K) time. 2 Proof: Step 1) takes O(n-Z2) time from Proposition 2.2. Step 2) takes O(n-Z2 K) time from Proposition 3.7. Thus, the time complexity is O(n*Z 2K). Q.E.D. Example: The out-necklace shown in Figure 3-3a contains a cycle with weight 3<B=4. It is transformed into an out-tree as shown in Figure 3-3b. Applying the ODCtoto-l algorithm obtains an optimal solution which has value 14 and contains 2 clusters: Cluster 1 is an out-necklace with vertices 1, 2, 3 and 4. Cluster 2 is an out-necklace with vertex 5. 3.4.2 ODCng Problem We tackle this problem by using the same idea used in solving ODCtog.* Since the optimal out-necklace clustering of a general digraph < one of the maximal out-necklace clusterings, the optimal solution can be found among this maximal set. The approach is to use ODCnono to obtain an,i ~ ~ ~ ~ ~ nn

51 a) 3 5 b) (10) B = 4 Vvi (wi=l) Figure 3-3 An example out-necklace and its optimal clustering result.

52 optimal out-necklace clustering for each maximal outnecklace clustering. From this set of optimal out-necklace clusterings, the one clustering with the maximum value is then the optimal out-necklace clustering for the given digraph. An algorithm for ODCnog can be formulated as follows: ALGORITHM: ODC o - 1 INPUT: a general digraph G; OUTPUT: an optimal out-necklace clustering and its value; PROCESS: 1) FOR each maximal out-necklace clustering of G which induces a set of out-necklace clusters DO: 1.1) apply algorithm ODCnon0-1 to obtain an optimal solution for each out-necklace cluster; 1.2) IF the sum of the optimal out-necklace clustering values of the set of out-necklace clusters is the largest value so far, THEN keep the solution; 2) RETURN the last solution kept; END. Proposition 3.10: The ODCnog -1 algorithm finds the 2 -optimal solution in O(n*Z *K.L) time. Proof: Similar to the proof given for 3) of Proposition 3.7. Q.E.D. Example: By applying the ODCnOg-l algorithm to the general digraph shown in Figure 3-2a, we obtain the optimal solution in Figure 3-2c. This optimal solution has value 20 and

53 contains 2 clusters: Cluster 1 is an out-necklace with vertices 1 and 3. Cluster 2 is an out-necklace with vertices 2, 4, 5 and 6.

CHAPTER 4 OPTIMAL ACYCLIC DIGRAPH/GENERAL DIGRAPH CLUSTERING OF ACYCLIC DIGRAPH/GENERAL DIGRAPH PROBLEMS This chapter generalizes the sequential clustering approach presented in section 2.1 to analyze and solve ODCaa, ODCgg and ODCag (cluster 7 in Appendix A). ODCaa is solved in section 4.1, ODCgg in section 4.2 and ODCag in section 4.3. 4.1 Optimal' Acyclic Digraph Clustering of an Acyclic Digraph Problem (ODC aa) The ODCaa problem is to find a set of bounded acyclic digraph clusters, or its subclass clusters, from a given acyclic digraph such that the total value of the set is optimized. A formal definition for the problem can be obtained by replacing all occurrences of an out-tree by an acyclic digraph in the formal definition of ODCtotO that is given in section 2.1. The solution approach will first be presented in section 4.1.1 and then an algorithm to solve the problem in section 4.1.2. 4.1.1 A Sequential Clustering Approach to ODCaa Section 2.1 in Chapter 2 has presented a sequential clustering approach which obtains the optimal solution of an 54

55 out-tree through a sequence of clustering increasingly larger set of subgraphs such that all feasible clusterings of each set can be constructed from those of the immediate previous set using simple operations. This approach can be applied to ODCaa as well except that the given acyclic digraph must be restructured into a tree-like structure so that the same approach can be successfully applied. This tree-like structure, T=(V,E~uEl), has the same vertex set V as the original digraph G=(V,E) but divides the arc set E into a tree arc set E~ (e.g., solid lines in the second figure in Figure 4-1) and a co-tree arc set E1 (e.g., dash lines in the same figure). An algorithm for constructing such a tree-like structure from an acyclic digraph with a designated root v can be devised by modifying the depth-first search algorithm (Aho et al '74) for undirected graphs. The only modification is to ignore the arc direction of the acyclic digraph in the restructuring process. Thus, instead of parent child relation, an undirectional relation is-relatedto is defined on the vertex set V of G such that a vertex v. is-related-to vr if either (vr,vj)eE or (vj,v r)E. The restructuring algorithm named RESTRUCTURE-1 is as follows: ALGORITHM: RESTRUCTURE-1 (G, v ) INPUT: an acyclic digraph G=(V,E) with a designated root v; OUTPUT: T=(V,E~uE1); PROCESS: ALGORITHM: Search-l(v ) r

56 6 '3 I I I I 3 1 / / 4 Figure 4-1 An acyclic digraph and its tree-like structure.

57 PROCESS: 1) mark v as "visited"; 2) FOR each vertex v. which is-related-to vr DO: 2.1) IF v. is not visited THEN add the arc (i.e., (Vr,v.) or (vj,vr) e E) to E~; CALL Search-l(v.); 2.2) ELSE add the arc (i.e., (vr,j.) or (v,vr) e E) to E1; END; I) CALL Search-l(v); II) RETURN T; END. Proposition 4.1: The run time of algorithm RESTRUCTURE-1 is O(Max(m,n)) where m is the number of arcs and n is the number of vertices in G. Proof: Immediate from the SEARCH algorithm of Aho et al ('74). Q.E.D. An important feature of T = (V,E~ u E1) is that its vertex set V together with its tree arc set E~ form a tree (if we ignore the direction of each tree arc). The m+n stage sequential clustering process can thus be based on this tree just like the one for ODCtoto-l (see Figure 4-2). But a complication in the clustering process is that all those co-tree arcs should also be considered. This creates a need to consider forming a set of clusterings with the "bounded nonchild descendant" in addition to the "child" vertices at each stage because of the additional

58 possibilities of cluster formation induced by those co-tree arcs. For example, vertices 1,2 and 3 in Figure 4-1 could form a cluster. At the stages when vertex 2 is considered (e.g., stage 7 in Figure 4-2), a set of clusterings should be formed not only with vertex 4 but also with vertex 3. To formalize this idea, some new terms and additional processes are needed. If there is a path (ignore the arc direction) from the root of T to vr and a path from vr to v. in the tree formed by V and E~, then v is the ancestor of v. and v. is the descendant of v. In the case where v is-related-to v. in r r 3 the tree by a tree arc, then vr is also called the parent of v. and v. the child of v. A vertex- v. is called a neighbor of a vertex v if v. satisfies one of the following two rules: 1) v. is a child of v. 2) v. is a nonchild descendent of vr and v. lies on a path vr,xl,x2,...,x pv (p>0) such that if p=O, then vr is-related-to v. by a co-tree arc. If p>O, then all x's are ancestors of vr and xp is-related-to v. by a co-tree arc (e.g., in Figure 4-1, vertex 3 is a nonchild descendant of 2 and lies on the path 2,1,3.). Also, v. is a bounded nonchild neighbor of v if both of the J - r following two rules are true: 1) v. is a neighbor of v. 2) is a nonchild descendant and the path 2) v. is a nonchild descendant and the path

59 4 2 5. G 3 0 #6 4 8 5 9 A I I i Figure 4-2 The eleven stages of the clustering process.

60 v rx1,,X2r,, Xp,v. is bounded, i.e., W(v rxlx2,...,x,vj) < B (e.g., in Figure 4-1, vertex 3 is a bounded nonchild neighbor of vertex 2 if B=3 and Vv qV(w q=l)). To find the set of neighbors and bounded nonchild neighbors for each vertex, two functions — DF-NO and LOW -- are needed. DF-NO assigns depth-first search numbers to vertices in the order they are visited in the construction of T=(V,E~uEl) from a given digraph G=(V,E). For each vertex v in V, LOW: V —>Z+ can be defined as: r r LOW(v r)= MIN({DF-NO(vr )} u {DF-NO(vk)|there exists a co-tree arc (i.e., (v,v.) or (v.,v)) in El such that vj is a j jk J descendant of vr, and vk is an ancestor of vr in the tree formed by V and E~}). The DF-NO for each vertex in V can be easily obtained during the construction of T for the given digraph. The LOW value can also be easily obtained because the definition of LOW can be reformulated recursively as: LOW(v )=MIN( {DF-NO(v )} u {LOW(vj)Ivj is a child of v r} u {DF-NO(vk) vr is-related-to vk by a co-tree arc and vk is an ancestor of v r}). k r An algorithm called RESTRUCTURE-2, which is modified from Aho et al's ('74) SEARCHB for finding T, LOW and DF-NO values for each vertex, is as follows: ALGORITHM: RESTRUCTURE- 2 (G, v)

61 INPUT: an acyclic digraph G=(V,E) with a designated root v; OUTPUT: a T=(V,E~uE') with LOW and DF-NO values for each vertex; PROCESS: ALGORITHM: Search-2(v ) r PROCESS: 1) mark vr as "visited"; DF-NO(vr) = df-count; LOW(vr) = DF-NO(v ); df-count = df-count + 1; 2) FOR each vertex v. that is-related-to vr DO: 2.1) IF v. is not visited DO: J 2.1.1) parent(vj) = vr; 2.1.2) CALL Search-2(v.); 2.1.3) LOW(vr) = MIN(LOW(v ),LOW(v.)); 2.2) ELSE IF v. is not parent(v ) THEN LOW(vr) = MIN(LOW(vr), DF-NO(vj)); END; I) df-count = 1; II) CALL Search-2(v); III) RETURN T with LOW and DF-NO values for each vertex; END. Proposition 4.2: The run time of algorithm RESTRUCTURE-2 is the same as that for RESTRUCTURE-1. Proof: Immediate from algorithm SEARCHB from Aho et al ('74). Q.E.D. Using the DF-NO and LOW values of each vertex vr, a

62 neighbor list Lnb [v r] and a bounded nonchild neiqhbor list Lbn [v r] of the vertex can be defined for each subgraph rooted at v: Lnb[Vr] = {vj vjech(vr)} u {vklvkLnb[vj] s.t. LOW(vk)<DF-NO(vr)}; bn[r] = VmlVmch[vr] and Vm Lnbvr] and LOW(vm) < DF-NO(vr) and W(VrXlx2,..,xp,vm)<B}. An algorithm for finding T=(V,E~uEl) with a neighbor list L n[vr] and a bounded nonchild neighbor list Lbn[r] for each vertex vr can thus be formulated as follows: ALGORITHM: RESTRUCTURE-3(G, v) INPUT: an acyclic digraph G with a designated root v; OUTPUT: a T=(V,EOuEl) with a neighbor list Lnb[vr] and a bounded nonchild neighbor list Lbn[Vr] for each vertex vr; PROCESS: ALGORITHM: Search-3(v ) r PROCESS: 1) IF v, has no children THEN RETURN Lnb[Vr ]=Lbn [vr=0 ELSE 2) FOR each vertex v.ech(v ) DO: 2.1) CALL Search-3(v.) to obtain L lv ] and 3 nb[j Lbn [vj ]; 2.2) Lnb[Vr] = Lnb[Vr] u {v} u {VklvkeLnb[vj] s.t. LOW(vk)<DF-NO(vr)}; 2.3) FOR each vmELnb Vj] DO: 2.2.1) IF wm + wr < B and LOW(v )<DF-NO(vr) m r m~~~~i r

63 THEN v = V S r PATH-WEIGHT = wm + wr; WHILE there is no co-tree arc between vs and vm DO: vs = parent(vs); PATH-WEIGHT = PATH-WEIGHT + ws; END; 2.3.2) IF PATH-WEIGHT < B THEN Lbn[Vr = Lbn[Vr u {vm}; END; END; I) CALL RESTRUCTURE-2(G,v); II) CALL Search-3(v); III) RETURN T with the two lists, L [vr] and L bn[vr], that are associated with each vertex vr; END. Proposition 4.3: The time complexity of RESTRUCTURE-3 is O(n2 d(T)) where d(T) is the maximum depth of the constructed T. Proof: Step I) takes O(max(m,n)) time from Proposition 4.2; II) takes O(n2 * d(T)) time; III) takes 0(n) time. Q.E.D. Given T, Lnb [r] and Lbn[Vr], the clustering process can be sequenced in the m+n stages just like ODCt.to-l (see Figure 4-2). The INSERT, PUT, and conditional CONNECT operations are defined exactly the same way as that defined in section 2.1.1. But at each new arc stage, in addition to

64 the PUT and conditional CONNECT operations, a new conditional COMBINE operation forms a set of feasible clusterings (denoted by PK2v r,i+l]) for each pair of feasible clusterings K[v,i] in PK f[v,i] and K[vk] in PK [v.,od(v.)], for each vk which is a bounded nonchild 3 3 k neighbor of v r. Thus, the set of feasible clusterings formed at each new arc stage now is: f f f f f PKk+l = PK - PK[vri] - PKf[vjod(vj)] u PK [vr where PKf[vr,i+l] = PKO[vr,i+l]uPK'[vr,i+l]uPK2[vr,i+1]. PK~[v r,i+l] and PK[vr,i+l] are the same as those defined in section 2.1.1. PK2[vr,i+l] = {K2[vr,i+1l]K2[vr,i+l] is formed using COMBINE operation for each pair of K[vr, i] and K[vk,od(vk)] and for each VkELbn[Vr] For each bounded nonchild neighbor vk of vr, i.e., vkELnb[vr], the COMBINE operation forms a set of feasible clusterings for each pair of feasible clusterings: K[vr,i] in PK [v r,i] and K[vk] in r r. PK [v.j,od(vj)]. However, there is a consideration which depends on whether there is an arc between vr and vk or not: Case 1: If there is a co-tree arc between v r and vk, then the new clustering is formed in exactly the same way as that formed by the conditional CONNECT operation defined above using K[v,i] and K[vk].

65 Case 2: If there is no co-tree arc between vr and Vk, then the new clustering is formed by merging the cluster C1 with the cluster C1' into one new cluster C1' with no arc connecting them. Include this new cluster with all the other clusters in both clusterings to form a new clustering. This new clustering contains: 1) all clusters except C1 in K[v r,i]; 2) all clusters except C1' in K[vk]; and 3) C11. Note that CI" contains two disconnected subgraphs of G. (This is different from the cluster formed by PUT and CONNECT wbich contains one connected subgraph.) The reason C," is formed in this way is that in a future stage, these two disconnected subgraphs can be connected together through a co-tree arc which relates an ancestor vm of v with vk. 4.1.2 ODC aa-1 Algorithm Let a feasible clustering be represented by a triplet with three components. The value component of a triplet is defined in the same way as that defined in section 2.1.2. The weight component now is a list of s entries enclosed in ( ) if there are s induced clusters. Each entry stores the cluster weight of the corresponding cluster. This is necessary because at each stage the triplets are formed not only from those of the child but also bounded nonchild

66 neighbors. Since the latter can be in any cluster induced by K[v.,od(vj)], it creates a need to keep track of all cluster weights. We will use the notation W(C[vr,i]) and W(C[vr,i+l]) to denote the cluster weight of the induced cluster containing vr in K[vr,i] and K[vr, i+l] respectively. W(C[vk]) will denote the cluster weight of the induced cluster containing nonneighbor vk in K[vj,od(vj)]. The configuration component is also modified to indicate not only vertices in each cluster but also the arcs in the cluster. This is necessary because the cluster is no longer an out-tree and therefore a set of vertices cannot uniquely represent the cluster. For example, in Figure 4-1, suppose B=4 and vertices 1,2,3, and 4 form a cluster. The notation <1,2,3,4> does not tell which of the following 4 acyclic digraphs the cluster represents: V = {1,2,3,4} E = {(2,1)(2,4)(3,4)(1,3)} V = {1,2,3,4} E = {(2,1)(2,4)(3,4)} V = {1,2,3,4} E = {(2,4)(3,4)(1,3)} V = {1,2,3,4} E = {(2,1)(3,4)(1,3)} Let the configuration component CF(K) that is used in Chapter 2 now be CF~(K) and the newly added representation of the arcs be CF'(K). Thus, CF(K) = CF~(K) u CF'(K). In terms of triplet formation, the four operations can now be defined as: INSERT: same as that defined in section 2.1.2. PUT: same as that defined in section 2.1.2. conditional CONNECT:

67 IF W(C[v r,i]) + W(C[v.,od(vj.)) < B THEN form K 2 [ Vri+] with F(K2 [vr,i+l]) = F(K[v,i]) +F(K[v.,od(vj)]) + frj' W(C[vr,i+l]) = W(C[vr,i]) + W(C[vj,od(v.)]), CF(K2[vr,i+l]) = CF~(K2[vr,i+l]) u CF1(K2[vr,i+l]) where CFO(K2[vrr i+l]) = {C2,C3, *..CpC2 'C3 1,*.,C 'p U {C1 U C1 }, CF(K2[vr,i+l]) = CF(K[vi]) u CF'(K[vj,od(v.)]) u {(vr,vj) or (vj,vr)}. conditional COMBINE: IF there exists a co-tree arc between vr and vk THEN: form a K2[vri+] using conditional CONNECT; ELSE form a K2[vr,i+l] with F(K[vr,i+l)]) = F(K[vr,i]) + F(K[vk]), W(C[vr,i+l]) = W(C[vr,i]) + W(C[vk]), and CF(K[vr,i+l]) = CF~(K[vr,i]) u CF'(K[vk]) where CF~(K[vr'i+l]) = {C2,C3,...,CC ',C3,...,Cq } u {C1 u C1, CF'(K[vri+l]) = CF1(K[vr i])uCF1(K[k]). END. For example, in Figure 4-2, let {{(0,(1,1,1),<3><4><5>) (6,(2,1),<3,4><5>(3,4)) (4,(2,1),<4,5><3>(4,5)) (10,(3),<3,4,5>(3,4)(4,5)} {<2>}} be the set of feasible triplets formed at stage 6. At stage 7, the set of feasible

68 triplets formed using the respective operations are: PUT: (0,(1,1,1,1),<2><3><4><5>) (6,(1,2,1),<2><3,4><5>(3,4)) (4,(1,2,1),<2><4,5><3>(4,5)) (0,(1,3),<2><3,4,5>(3,4)(4,5)) CONNECT: (1,(2,1,1),<2,4><3><5>(2,4)) (7,(3,1),<2,3,4><5>(2,3)(3,4)) (5,(3,1),<2,4,5><3>(2,4)(4,5)) COMBINE (vertex 3 is a bounded nonchild neighbor of 2): (0,(2,1,1),<2,3><4><5>) (4,(2,2),<4,5><2,3>(4,5)) A dominance relation can still be defined over the set of triplets formed at each new arc stage even though it is now more restricted (or not as powerful as the one defined in Chapter 2). Let v be the parent of vertex v. in the tree induced by V and E~. Suppose K[v.j] and K2[vj] are any two feasible triplets formed in a new arc stage. A cluster C1 induced by Kl[vj] is cluster-similar to a cluster C2 induced by K2[vj] if both clusters contain the same set of neigbhors in Lnb[vr]. K2[vj] is triplet-similar to Kl[vj] if for every cluster C2 induced by K2[vj], which contains one or more neighbors in Lnb[vr], there is a C1 induced by Kl[vj] which is cluster-similar to C2 and vice versa. Define Kl[vj] dominates K2[vj] if 1) K2[vj] is triplet-similar to Kl[vj]; 2) for every pair of C1 and C2 induced by Kl[vj] and

69 K2[vj] respectively, if C1 is cluster-similar to C2, then W(C2) > W(C1); 3) F(K1[vj]) > F(K2[vj]). Proposition 4.4: The dominated triplet K2[vj] defined above cannot be extended to the optimal clustering of the acyclic digraph. Proof: Let the n vertices visited in postorder sequence be denoted by vlv2,...,vn. A triplet formed in visiting vj, denoted by K[v.j], can be represented by a sequence of pairs (Lukes '75): K[vj] = [v,< >, [v2,C2], j,Cj] where the first entry of each pair represents the vertex visited and the second entry represents the cluster to which the vertex is added on. Let K[v r] and K[v n] be two feasible triplets formed in the ith and nth stages respectively. We define a derivation of a feasible triplet K[vn] from K[vr] (vn is the root of the tree induced by V and E~) as the sequence: [Vr+1,Cr+l], [Vr+2'Cr+2.., [v n,Cn]. Let K[vk 1]' be triplet-similar to K[vk1]" and both are formed in the same new arc stage. Let K[vkl]' have equal to or higher value than K[vkl]". Assume that there exists a triplet K[v n]" derived from K[Vvkl]" that has a higher value than any feasible triplet derived from K[vkl]' for root n. We now show this is impossible. Let a derivation of K[vn]" from K[vk_]" be [vkCk "L]vk+l"Ck+lJl]*O*[vn Cn"l Since K[vkl]' is

70 triplet-similar to K[vk 1]", a triplet K[v n]' can be formed from K[Vk-l ' with the derivation [VkCk?LVk+lCk+l']F**9[v nCn' ] such that for all m=k,k+l,...,n, Cm' is cluster-similar to Cm" This triplet t[v ]' is a feasible triplet because every cluster Cm' in it contains the same neighbors as Cm" and has equal to or lower cluster weight than Cm". Since there are no co-tree arcs between the set of nonneighbors and the set of vertices that are ancestors of vkl, the values of triplets formed at vk,vk+l,...,vn are independent of all the nonneighbors that appear in a cluster together with neighbors. Thus, the difference in cluster values between each pair of Cm' and Cm " is the sum of the values of those (tree or co-tree) arcs that exist among the nonneighbors in Cm' and Cm". Since K[vk 1]' has equal to or higher value then K[vk l]" for m = k,k+l,...,n, the value of Cm' is equal to or higher than Cm". Thus, K[v n]' has equal to or higher value than K[v n]". This is a contradiction to the assumption made above. Q.E.D. Given the definition of the four operations and the dominace relation, an algorithm for ODCaa can be outlined as follows: ALGORITHM: ODCaa -1 INPUT: an acyclic digraph G; OUTPUT: an optimal acyclic digraph clustering of G and its value; PROCESS:

71 I) choose an arbitrary vertex v as root and CALL RESTRUCTURE-3(G,v) to build a T of G with lists Lnb[Vr] and Lbn[vr] for each vertex vr; II) FOR each vr visited in postorder traversal sequence DO: 1) FOR i=0 to od(v ) DO: 1.1) IF i=0 THEN form a new PKf+i from PKk using INSERT operation; ELSE DO: 1.2) form a new PKk+ from PKf using PUT, conditional CONNECT, and conditional COMBINE; eliminate all dominated triplets; END; III) find the triplet in PKm+n that has the maximum value and RETURN the solution; END. Proposition 4.5: Let v be the parent of v r. Each new arc stage forms no more than Bn[p'r] * n[p,r]! * Bn[r'j] ~ n[r,j]! * (n[r,j] + 1) new feasible triplets where n[p,r] = ILnb[Vp] n V[vri]l and n[r,j] = ILnb[Vr] n V[vj,od(vj) ]. Proof: See Appendix B. Q.E.D. Note: when the given digraph is an out-tree, n[p,r]=l and n[r,j]=l in the ODCaa-1. Each new arc stage forms no more than 2-B new feasible triplets and have only B triplets remaining after elimination.

72 Example: Applying the ODCaa algorithm to the acyclic digraph shown in Figure 4-1 will yield the following result: triplet triplet triplet triplet formed by elim. by elim. by formed I/CO/P/COM. CO/P/COM. domi. (stage#) (stage#) (staqe#) (0,(1),<5>) 1 4 (0,(1),<3>) 2 5 (0,(1),<4>) 3 4 (0,(1,1),<4><5>) 4 5 (4,(2),<4,5>(4,5)) 4 5 (0,(1,1,1),<3><4><5>) 5 7 (6,(2,1),<3,4><5>(3,4)) 5 7 (4,(2,1),<4,5><3>(4,5)) 5 7 (10,(3),<3,4,5>(3,4)(4,5)) 5 7 (0,(1),<2>) 6 7 (0,(1,1,1,1),<2><3><4><5>) 7 9 7 (1,(2,1,1),<2,4><3><5>(2,4)) 7 9 7 (0,(2,1,1),<2,3><4><5>) 7 9 7 (6,(1,2,1),<2><3,4><5>(3,4)) 7 9 (7,(3,1),<2,3,4><5>(2,4)(3,4)) 7 9 (4,(1,2,1),<2><4,5><3>(4,5)) 7 9 (5,(3,1),<2',4,5><3>(2,4)(4,5)) 7 9 (4,(2,2),<4,5><2,3>(4,5)) 7 9 (10,(1,3),<2><3,4,5>(3,4)(4,5)) 7 9 (0,(1),<1>) 8 9 (6,(1,1,2,1),<1><2><3,4><5>(3,4)) 9 9 (11,(2,2,1),<1,2><3,4><5>(2,1)(3,4)) 9 9 (9,(3,1,1),<1,3,4><2><5>(1,3)(3,4)) 9 9 (7,(1,3,1),<1><2,3,4><5>(2,4)(3,4)) 9 9 (4,(1,1,2,1),<1><2><4,5><3>(4,5)) 9 9 (9,(2,2,1),<1,2><4,5><3>(2,1)(4,5)) 9 9 (7,(1,2,2),<2><4,5><1,3>(1,3)(4,5)) 9 9 (5,(1,3,1),<1><2,4,5><3>(2,4)(4,5)) 9 9 (8,(2,3),<1,3><2,4,5>(1,3)(2,4)(4,5)) 9 9 (4,(1,2,2),<1><4,5><2,3>(4,5)) 9 9 (12,(3,2),<1,2,3><4,5>(2,1)(1,3)(4,5)) 9 9 (10,(1,1,3),<1><2><3,4,5>(3,4)(4,5)) 9 (15,(2,3),<1,2><3,4,5>(2,1)(3,4)(4,5)) 9 The optimal clustering solution is (15,(2,3),<1,2>,<3,4,5>(2,1)(3,4)(4,5)). 4.2 Optimal General Digraph Clustering of a General Digraph Problem (ODC ) gg The ODCgg problem is to find a set of bounded general

73 digraph clusters, or its subclass clusters, from a general digraph such that the total value of this set is optimized. ODC -1 can be used to solve ODC with one modification. aa gg The need for the modification comes from the observation that if two vertices v and v. form a cycle (i.e., both (vr,vj) and (vj,vr) are in E), then in T, not only is there a tree arc (e.g., (vr,vj)) between vr and vj, but there is also a co-tree arc (e.g., (v,v r)) between them. In ODC -1, this co-tree arc does not exist and has not been aa taken into consideration in forming a set of feasible triplets for each subtree root v r. To include this situation, there are four cases to consider based on the assignments to these two arcs: 1) Xrj=0 and X r=0. This has been handled by the PUT operation in step 1.2 of ODCaa-1.1. aa 2) X.=l and X jr=0. This has been handled by the CONNECT operation in step 1.2 of ODC aa-1.1. 3) X rj=0 and Xj =1. This needs to be considered. 4) Xrj =l and Xjr =1. This needs to be considered. Since the three triplets formed for cases 2, 3 and 4 will always have the same weight and the same set of neighbors, using Proposition 4.4, the two triplets formed from cases 2 and 3 will always be dominated triplets because the one formed for case 4 will have the highest value. Thus, only cases 1 and 4 need to be taken into consideration. The modification that needs to be done so that ODC aa-1 can be applied to ODCgg is to modify the value

74 and configuration calculation of the CONNECT operation in step 1.2 of ODCaa- 1 to: F(K[vr i+l])=F(K[vr i])+F(K[vj,od(v.)])+fr +f jr CF(K[vr,i+l]) = CF'(K[vr,i]) u CF1(K[vj,od(v.)] u {(r,j)} u {(j,r)}. 4.3 Optimal Acyclic Digraph Clustering of a General Digraph Problem (ODC ) The ODCag problem is to find a set of bounded acyclic digraph clusters, or its subclass clusters, from a given general digraph such that the total value of the set is optimized. To solve this problem, an algorithm ODCag-l1 can be devised that is similar to the ODCaa-1 algorithm except for two modifications. The first modification is to consider the possibility of two vertices vr and v. forming a cycle. Unlike ODCgg where only cases 1 and 4 are taken into consideration, cases 1, 2 and 3 are needed here. Case 4 is never allowed because it introduces cycles into the resulting cluster. Thus, one more conditional CONNECT operation must be added to step 1.2 of ODC aa-1 so that cases 2 and 3 are both covered. The second modification is to add one more elimination rule before the dominance eliminations (step 1.2) of ODC -1. The added rule is that for any cluster in a newly aa formed triplet, if the cluster is not acyclic then eliminate the triplet. The algorithm for testing whether a cluster (digraph) is acyclic is a simple procedure (Aho et al '83). It is obvious that the optimal solution will then contain

75 only acyclic digraph clusters.

CHAPTER 5 EXTENSIONS In the previous three chapters, the solutions to the fourteen ODC problems that are related to the classes of general digraph, acyclic digraph, out-necklace, out-tree and out-star have been presented. This chapter considers three extensions to those solutions. They are: 1) extensions of classes of graphs (directed or undirected); 2) monotonic objective function and additional constraints for all problems; and 3) divide-and-conquer technique for ODCaa, ODC and ODC ag Each extension will be discussed in the following three sections in their respective order. 5.1 Extensions of Classes of Graphs In some application areas, the converse digraphs of out-necklace, out-tree, and out-star, which are in-necklace Gn, in-tree Gti and in-star Gs1 respectively (see Figure 5-la), are more appropriate problem representations. Using the principle of directional duality (Harary '69), which states that "for each theorem about digraphs there is a 76

77 corresponding theorem obtained replacing every concept by its converse", the solution given in Chapters 2 and 3 can be modified to solve all ODC problems that are related to intree, in-star and in-necklace (i.e., change all "indegree" to "outdegree" and "m-indegree vertices" to "m-outdegree vertices"). The modifications are trivial and need not be discussed. Undirected graphs, i.e., qeneral undirected qraph Gu, necklace G, tree Gt and star G (see Figure 5-lb), can also be solved in practically the same way. That is, the solutions given in Chapters 2,3 and 4 can be used to solve them with slight or no modification at all. Obviously, ODC aa-1 can be used to solve OUCu. To solve OUCn, OUCtn, OUCsn, OUCtt, OUCst and OUCss, the direction can be imposed uniquely on the given necklace, tree and star. These problems can then be solved by using ODCn0n o-l, ODCtono-l, ODCs onOl, ODCtot0-1l ODCs0to-l and ODCsO -1. The solutions to OUCnu, OUCtu and OUCsu can be solved using a similar algorithm to ODC -1 by changing the acyclic test to ag necklace, tree and star tests. These three tests can be computed in constant time at each stage. Another extension is to allow the given graph (directed or undirected graph) to be a disconnected graph. The optimal clustering of each connected graph in the disconnected graph is found independently. The final solution is the union of the optimal solution obtained from clustering each connected graph.

78 Figure 5-la In-necklace, in-tree and in-star.

79 Figure 5-lb General undirected graph, necklace, tree and star.

80 In a case where the given graph contains parallel arcs or loops, our approach is still applicable if the graph can first be reduced to one without these parallel arcs and loops. For example, in many practical situation, the loop can simply be eliminated and each group of parallel arcs can be replaced by one arc with value equal to the sum of the arc values in the group (Chiang & Teorey '82). 5.2 Extensions of Objective Function and Constraint So far, a simple objective function and two constraints (defined in Chapter 2) have been used for presenting the solutions of all the OGC subproblems. This section considers extensions to the class of objective function and constraints such that all previous algorithms will still be applicable and have the same time complexity. A basic approach common to all the solutions'of the OGC subproblems discussed so far is that they all use ODC aa-1 (or its special case ODCtoto-l) without modifications in their computing of objective function value. Thus, any objective function applicable to ODCaa-1 is also applicable to the rest of the OGC subproblems. This class of objective function is the class of monotonic objective functions which can be defined using the notation that is used in the proof for Proposition 4.4. Let the n vertices in the out-tree visited in postorder traversal sequence be denoted by vl,v2,...,vn. A triplet formed for K[vj] can be represented by a sequence of pairs: K[vj] = [Vl,< >],[v2,C2],...,[v,Cj].

81 Let K1[vj] = (F(K1[v.]),W(K1[vj]),CF(K [v.])) and K2[v] = (F(K2[vj]),W(K2[vj]),CF(K2[v])) be two triplets formed in the same stage. Let a derivation of a triplet Kl[vn] from a triplet Kl[vj] be the sequence of pairs: [vj+l, Cj+l'] [v+,Cj'1],..., [v Cn '] An induced derivation of a triplet K2[vn] from K2[v.] can be defined as the sequence [vj+l Cj+"1 [v +2C j+2" *-, [v,Cn"] such that C.' is cluster-similar to C." for j+l<i<n. An objective function is monotonic if the following is true: 1) F(Kl[vj]) < F(Kl[vn]). 2) F(K![vj]) > F(K2[vj]) ==> F(Kl[vn]) > F(K2[vn]). Three example monotonic objective functions are as follows: 1) F (K) = > X. Xi f. + b-X. i if. 1 1(ij)E 1J 13 1J 13 where a and b are nonnegative real constants. 2) F2(K) = > a-f(v).p(v) vev where a is a positive real constant and f is any nonnegative real constant function which assigns a nonnegative real number to each vertex in V, i.e., f:V —>R+ u {0}. Let v1,v2,...,v be a path from root v1 to a vertex v. The integer function p assigns an integer number to each vertex in V as follows: p(v1) = 1, P(vr+) = p(vr) + Xr,r+l' 3) F3(K) = a-F1(K) + b-F2(K). To show that a monotonic objective function is applicable to

82 ODC aa-1, we only need to prove it is applicable to Proposition 4.4. Proposition 5.1: Proposition 4.4 is still true when the objective function is a monotonic objective function. Proof: Suppose Kl[vj] and K2[vj] are feasible triplets formed in the same new arc stage and Kl[vj] dominates K2[vj], i.e., 1) K2[vj] is triplet-similar to Kl[vj]; 2) for every pair of C1 and C2 induced by Kl[vj] and K2[vj] respectively, W(C2) > W(C1); 3) F(Kl[vj]) > F(K2[vj). Let K2[vn] be a feasible triplet derived from K2[vj]. Then by 1), K1Lvn] can be derived from Kl[vj] with the induced derivation sequence. By 2), for each pair of C1eK1[vj] and C2CK2[vj], W(C2)>W(C1). Thus, Kl[vn] is a feasible triplet. Further more, by 3) and the definition of a monotonic funtion, F(K2[vj]) < F(K1[vj]) implies F(K2[vn]) <F(Kl[vn]). Hence, K2[vj] can be eliminated. Q.E.D. Proposition 5.2: Proposition 2.1 is still true when the objective function is a monotonic objective function. Proof: Immediate from Proposition 5.1. Q.E.D. To extend the set of constraints applicable to ODCaa-l and others, observe that the two simple constraints used in previous chapters are mandatory. This is because the cluster graph class constraint is used in the problem

83 classification (see section 1.1) while the integer cluster weight constraint is used in all algorithms to bound the computation time (e.g., in section 2.1.2, only 2-B2 new triplets are formed in each new arc stage). But the class of formula for calculating the cluster weight and additional constraints can be imposed on all OGC subproblem solutions without increasing the time complexity if they are constant time computable constraints. A constraint is constant time computable if it can be defined in recursive formula such that it can be enforced in each stage of the clustering process in constant time and is independent of the calculation of the objective function value. Some examples of such constant time computable constraints are: 1) Let d(vl,v2) be the number of arcs on the unique path from a vertex v1 in a cluster Cr to a descendant v2: C1: Max(d(vr,v2)) < Y v2eCr where vr is the root of Cr and Y is any integer. 2) C2: w. < B v EC where wi is the weight of vertex vi. 3) Let C[vi] be an out-tree cluster rooted at vi. The length L(C[vi]) is defined as L(C[vi]) = sj + > Tn./ni * L(C[v.])T 1 1 Vj(X-=l) 1 J 13 where ni and n. are two integers associated with vertices vi and v. respectively, si is the length of vertex vi and L(C[vj]) is the length of the complete

84 subgraph in cluster C[vi] rooted at vj: C3: L(C[vi]) < B. All these three types of constraints can obviously be defined in recursive formula, which can be computed in constant time at each stage, and have nothing to do with the calculationn of the objective function value. 5.3 Divide-and-Conquer Technique for ODCaa, ODCgg and ODCag This section applies Lukes' divide-and-conquer idea (Lukes '75) to lessen the high time complexity of ODCaa, ODCg9 and ODCag with the simple objective function and constraints. For each of these problems, we divide the given digraph into a set of covers (nonvoid and nondisjoint subgraphs)r find a solution to each of these covers independently and then recombine their clustering solutions into an optimal clustering solution. Before we present the algorithm in sections 5.3.2 and 5.3.3, some terminology is first defined in section 5.3.1. 5.3.1 Cutpoint and Cover A cutpoint of a digraph G=(V,E) is defined as a vertex in V such that removing the vertex from V results in two or more subgraphs of G (e.g., vertices 5,6,8,10 and in Figure 5-2a). A digraph which has a cutpoint is called separable, and one which has none is called nonseparable. Let V'cV and E'cE. The induced subgraph G'(V',E') is called a (maximum biconnected) cover if G' is nonseparable and if for every larger V", V'cV"cV, the induced subgraph G"(V",E") is

85 a) b) c) Gi CV1: G G CV: G CV4 G CV5 G~ G 7 (1 Q Q Figure 5-2 A general digraph and its cover graph.

86 separable (e.g., GCV to GCV7 in Figure 5-2b). A cover (undirected) graph CG = (V,E) (e.g., Figure 5-2c) can be constructed from the set of covers and cutpoints of G. The vertex set V of CG is a set of vertices where each vertex represents either a cover (called cover vertex CVj) or a cutpoint ( called cutpoint vertex cpi), i.e., V = { cPl, cp2,..., cpr, CV1, CV2,..., CVt }. The arc set E of CG is a set of undirected arcs (cpi,CVj) where the cutpoint cpi is a vertex in the cover which CV. represents, i.e., E = { (cpi,CV.) | cpi is a vertex in the cover GCv 1. Proposition 5.3: The cover (undirected) graph CG=(V,E) constructed from G have the following properties: a) Each cover represented by a cover vertex in CG is nonseparable. b) For any two covers represented by two cover vertices in CG, their intersection contains at most one vertex. c) cp is a cutpoint in CG if and only if cp is the intersection of some two covers in G. d) CG is a tree. Proof: a), b) and c) are immediate from Aho et al ('74). d) is immediate from b). Q.E.D. If a digraph G has more than one cover, the following proposition shows that it is valid to find the optimal solution of G by clustering each cover independently and then combine these clusterings to generate an optimal

87 solution of G. Proposition 5.4: Let a digraph G have s covers, where s>l, and K be an optimal clustering of G and induces t clusters, K= { C1, C2,..., Ct }. The optimal solution K of G can always be decomposed into s subsets of clusters where each subset is a feasible clustering of one cover and the sum of the value of these s subsets is equal to the optimal clustering value. Proof: The set of clusters induced by K can be divided into two sets. One set (KN) contains those clusters that have no cutpoints, while the other set (KC) does. Thus, F(K) = F(KN) + F(KC). Let in-the-same-cover be a relation defined on the set of clusters induced by K. Any two clusters C1 and C2 induced by K are in-the-same-cover if they contain only vertices of the same cover. Since the set KN is a set of clusters that contain no cutpoints, each cluster in KN contains vertices in one cover only. This is because if a cluster contains vertices that belong to more than one cover, the cluster must also contain the cutpoints connecting these covers. Thus, the set KN can easily be partitioned into s disjoint subsets, KN1, KN2,..., KN, by using the in-the-same-cover equivalent relation while the total value remains the same, i.e., s > F(KN.) = F(KN). i=l Since the set KC is a set of clusters that contain

88 cutpoints, each cluster in KC could contain vertices of one or more covers. For each cluster C in KC that contains vertices that belong to m of the s covers, C can be partitioned into m smaller disjoint clusters {C1,C2,...,Cm} such that 1) C1 u C2 u... u Cm = C, and 2) for each cutpoint cp in C, cp is assigned to only one cover. The partition of each C into m smaller clusters obviously do not increase the total value of the cluster, i.e., m _ F(C.) = F(C). j=l 3 Since the intention is to decompose K into s subsets of clusters, where each subset is a feasible clustering of one cover, each cutpoint, which is assigned to only one cover, should also be included as a single vertex cluster in all covers that contain it. This newly formed set of clusters does not increase the value of the set KC because only the single vertex clusters, which each has no value, are added to other clusters. Each cluster now contains vertices belonging to only one cover. It can now easily be divided into s subsets, KC1, KC2,..., KCS, and satisifies s F(KC) = F(KC) > F(KC.) = F(KC). i=l 1 Thus, the original optimal clustering K can be transformed into s subsets of clusters without changing the total value: {{KN1 u KC1},{KN2 u KC2},..., {KNs u KC }} and each {KN. u KCi} is a feasible clustering of cover i. i

89 Q.E.D. From the proof of Proposition 5.3, the optimal solution of a digraph G can be obtained by the reverse process. Thus, the divide-and-conquer process consists of two steps: 1) Divide G into a set of covers. 2) Find and combine the clusterings of each cover to form the optimal solution. 5.3.2 Divide G into a Set of Covers The process of finding a set of covers for a digraph is similar to the process of finding a set of biconnected components of an undirected graph. A biconnected component of an undirected graph G is a (maximum) subgraph of G which has no cutpoints. An efficient algorithm for finding biconnected components has been presented in Aho et al ('74). We modifiy it to solve the cover finding problem here. The basic idea of the cover finding algorithm is to use the property that the removal of a nonleaf vertex of a tree splits a tree into two or more subgraphs. Thus, the algorithm first constructs a (depth-first) tree-like structure from the given digraph and then decides which vertices are cutpoints. The decision rules are based on the following proposition: Proposition 5.5: Let G=(V,E) be a given digraph. Let T=(V,E~uE1) be a tree-like structure for G. Vertex vr is a cutpoint of G if either of the following is true for v r:

90 1) v is the root and has more than one child. 2) vr is not a root nor a leaf vertex, and for some child v. of Vr, there is no co-tree arc between any descendant of vj. (including v. itself) and a proper ancestor of v. Proof: Immediate from Aho et al ('74). Q.E.D. Using DF-NO(v r) and LOW(vr ) that have been presented in Chapter 4, a vertex vr is a cutpoint if 1) vr is. not a leaf vertex; 2) either one of the following is true: 2.1) vr is not a root and LOW(v.) 2 DF-NO(v r) where v. is a child of v; 2.2) v is a root and has more than one child. The following COVER-FINDING algorithm uses the concept of DF-NO(v r) and LOW(v r) to find all covers. The algorithm is similar to Aho et al's ('74) finding biconnected components algorithm. ALGORITHM: COVER-FINDING-1 INPUT: an acyclic/general digraph G=(V,E); OUTPUT: a list of the arcs for each cover of G and a set of cutpoints; PROCESS: ALGORITHM: Cover-search(v ) PROCESS: 1) mark vr as "visited"; DF-NO(vr) = df-count; LOW(vr) = DF-NO(v );

91 2) FOR each vertex v. which is-related-to v DO: J r 2.1) PUSH the tree arc (i.e.,(v,vr) or (v r,v)) into STACK if it is not already there; 2.2) IF v. is not visited DO: 2.2.1) parent(vj) = Vr; 2.2.2) Cover-search(v ); 2.2.3) IF LOW(v.) > DF-NO(vr) DO: cutpoint-set = cutpoint-set u {v }; POP from STACK all arcs up to and including the arc containing vj (i.e., (vjvr) or (vr,v.)); include the arcs in cover(cover-count); cover-count = cover-count + 1; LOW(vr) = MIN(LOW(vr),LOW(v.)); 2.3) ELSE IF v. is not parent(vr) THEN LOW(vr) = MIN(LOW(vr), DF-NO(v.)); END; I) Df-count = 1; cover-count = 1; II) choose an arbitrary vertex v as root of T to be built and CALL Cover-search(v); III) RETURN the list of arcs of each cover and the set of cutpoints; END. Proposition 5.6: The run time of algorithm COVER-FINDING-1 is 0(m) where m is the number of arcs in G. Proof: Immediate from Aho et al's finding biconnected

92 cutpoints algorithm ('74). 5.3.3 Find and Combine the Clusterings of Each Cover to Form the Optimal Solution Using the list of arcs of each cover and the set of cutpoints, a cover graph with a cover vertex as root can be easily constructed. The cover graph which is a tree can be used to order a sequential sequence of clusterings and combinations of covers similar to ODC toto-1. The vertices of the constructed tree is still visited in postorder traversal sequence. But for each vertex v r, the od(vr)+l stages induced by visiting G[v r,0], G[vr,1],..., G[v,od(v r)] in ODCtoto-l is now aggregated into one stage. Thus, the sequential clustering sequence of the cover graph consists of n stages. Figure 5-3 shows the eleven stages of the cover graph that is in Figure 5-2. The n stages can be classified into two types by whether the vertex considered is a cutpoint or cover vertex. The two types are cutpoint stage and cover stage. If the vertex considered at a certain stage is a cutpoint vertex cpj, then that stage is a cutpoint stage. All the children of the cutpoint vertex cpj are cover vertices. The operation needed at this type of stage is simply to collect all the feasible clusterings formed so far by each subtree of the tree rooted at these cover vertices. The set of feasible clustering collected is called a cutpoint list of cpj, denoted by CL[cpj]. If the vertex considered at a stage is a cover vertex

93 stage 1 6 9 cv 2 3 Q cv.4 6 cv3 7 8 4 5 Figure 5-3 Eleven stages of the clustering process of cover graph given in Figure 5-2.

94 (CV), then it is a cover stage. The only parent vertex of CV is a cutpoint vertex (cp.i) and all the children of CV are cutpoint vertices (cpj). The operation needed at this type of stage is to form a set of feasible clusterings from the corresponding cover GCV which CV represents and from all the children cutpoint lists CL(cpj). Depending on whether it is an ODCaa, ODC or ODC problem, a corresponding algorithm oODaa ag of ODC aa-, ODC -1 or ODC ag- 1 can be modified to suit this purpose. The modifications are as follows: 1) Since the set of feasible clusterings formed for the cover vertex CV will be combined with its ancestor cover vertex, which shares the same cutpoint vertex cpi with CV in the cover graph, only the set of all feasible clusterings with respect to this cutpoint needs to be formed. Step I) of ODCaa, ODC and ODCag can be modified to select cpi as the root of the treelike structure T to be constructed. The set of feasible clusterings produced at the end is the set desired. 2) Suppose vr is a vertex in the cover GCV and vr is a cutpoint cpj in CG. To obtain a set of all feasible clusterings for vertex vr in CV, the vertex should be composed with the corresponding cutpoint list of CL[cpj] at the end of the stage that considers vr. The operation is called a COMPOSE operation which is defined as: COMPOSE operation:

95 FOR each pair of feasible triplets K[v r] and K[cpj] from PK [v r] and CL[cpj] respectively, form a triplet with F(K[v]) = F(K[v r]) + F(K[cpj]), W(K[v]) = W(K[vr]) + W(K[cpj]) - Wr, CF(K[v]) = CF~(K[v]) u CF1(K[v]) where CF~(K[v]) = {C,C3,...,Cp,C2 C3... C q u {C1 u C1 - {Vr}}, CF1(K[v]) = CF'(K[vr]) u CF'(K[cpj]). The second step of the divide-and-conquer process for finding and combining clusterings of each cover to form the optimal solution can now be described in an algorithm: ALGORITHM: FIND-GLOBAL-SOLUTION INPUT: a list of arcs for each cover and a set of cutpoints; OUTPUT: an optimal solution and its value; PROCESS: I) choose an arbitrary cover vertex CV as root and construct a cover graph CG using the list of arcs for each cover and the set of cutpoints; II) FOR each vertex in the cover graph traversed in postorder sequence DO: IF the vertex is a cover vertex CV THEN 1) let cpi be the parent vetex of CV in the cover graph; choose cpi as the root and CALL RESTRUCTURE-3(Gcv, cpi ) to build a tree-like structure T with lists L nb[vr] and L bn[vr] for

96 each vertex; 2) FOR each vr in the cover GCv traversed in postorder sequence DO: 2.1) FOR j=0 to od(vr) DO: 2.1.1) same as 1.1 of ODCaa, ODCgg or ODCag; 2.1.2) same as 1.2 of ODCaa, ODCg9 or ODC ag; END; 2.2) IF vr is a cutpoint in the cover graph THEN 2.2.1) COMPOSE PK fvr,od(vr)] produced in 2.1 with the cutpoint list CL[v r] to form a new set of PK [vrod(vr)]; 2.2.2) eliminate all dominated triplets in PKf[vr,od(vr) ] formed in 2.2.1; END; ELSE IF the vertex is a cutpiont cpi THEN DO: FOR each child vj of cpi DO: CL[cpi] = CL[cpi] u PK [v.]; END; END; III) find the triplet in PKf [CV,od(CV)] that has the maximun value and RETURN it; END. This algorithm reduces the number of feasible triplets formed at each stage of ODCaa-1, ODCg -1 or ODCag-1 to a number directly proportional to the number of feasible triplets formed if each cover were clustered independently. This algorithm can also be applied to OUCuu, OUC OUC uu nfu' OUtu

97 and OUCsu as well with the simple objective function and two constraints given in section 2.1.

CHAPTER 6 SOME INFORMATION SYSTEM DESIGN APPLICATIONS This chapter considers three information system design problems that can be formalized and solved as OGC problems. To show that an information system design problem can be solved by using an OGC solution algorithm, it is necessary to define the objective function and constraints, decide which ODC problem it is and prove the constraints and objective function to be applicable. The objective function of a design problem can usually be formulated in many ways. Only one way is shown here to illustrate the solution approach. The three information system design problems to be discussed are: 1) B-tree secondary storage allocation problem (section 6.1); 2) translation of integrated schema to IMS schema problem (section 6.2); and 3) database record clustering problem (section 6.3). 6.1 B-Tree Secondary Storage Allocation Computer based information systems are critical to the management of large public or private organizations. Such systems typically operate on databases containing billions 98

99 of characters and serves varying user communities' information needs (March '83). In these systems, indexes (or directories) are built upon the data files (which are stored on secondary storage) to facilitate random or sequential accesses. Because of the large amount of data files, the indexes (or directories) are large files in their own right and are stored in secondary storage. In the present practice of Data Base Management Systems (DBMS) and Information Retrival systems, B-tree is one of the most popular techniques for organizing index files (Comer '79). The idea of B-tree is related to binary search tree and m-ary search tree. A binary search tree is a tree which contains either no vertices at all, or every vertex in the tree contains a key and two children where 1) all keys in the left subtree are less (alphabetically or numerically) than the key in the root; 2) all keys in the right subtree are greater than the key in the root; and 3) the left and right subtrees are also binary search trees (Horowitz & Sahni '76). A m-ary search tree is a generalization of a binary search tree in which each vertex in it has at most m-l keys and m children. If t1 and t2 are two children of some vertex, and t1 is to the left of t2, then the keys in the subtree rooted at t1 are all less than the keys in the subtree rooted at t2. A B-tree of order m is a special type of balanced m-ary

100 tree with the following properties: 1) The root is either a leaf or has at least two children. 2) Each vertex, except for the root and the leaf vertices, has between 7m/2T and m children. 3) Each path from the root to a leaf vertex has the same length (Aho et al '83). Figure 6-1 shows a B-tree of order 5. To retrieve a record in the data file with a given key value, the B-tree is used instead of searching all record sequentially in the data file. The process always starts by searching from the root to the leaf vertex which contains the key value and a pointer to the record with that key value in the data file. If each vertex in the B-tree of order m is stored in a separate (physical) block and there are a total of n different records (or keys) in the data file, each such retrieval needs no more than logfm/2I n+l/2 physical block accesses (PBAs) to the index file. Since very often the length of a vertex in the B-tree is much less than the secondary storage block length (which is typically 2000 to 4000 bytes), many vertices can be stored in one block. This gives a chance of lowering the log[m/2] n+l/2 PBAs by storing more than one vertex in a block. But observe that only storing a vertex with its parent in the same block can reduce the number of PBAs. Thus, for each vertex there is only one decision to be made, i.e, whether to store the vertex with its parent in the same block or not. This is

I-a I-j 12LLi8{11 L 8(tI1[LLLiL 91 61 E I14 3 El U 7 116.7 85 98e Figure 6-1 A B-tree of order 5.

102 equivalent to assigning a 0 or 1 on each arc and thus an OGC problem. The decision can be based on the usage pattern of the environment which is the access frequencies to each key in the B-tree. The higher the frequency a key is accessed the more ancestors this key should be stored in the same block with. Since a B-tree can be represented as an outtree, the problem then is to find a set of out-tree clusters or its subclass clusters such that the total expected number of PBAs for accessing every key by the environment are minimized. The B-tree secondary storage allocation problem can be formally defined as an ODCtoto problem: Given: 1) an out-tree G=(V,E) which represents the B-tree of order m; 2) two constraints: one requires each cluster be an outtree or its subclass; another limits the length L(Ci.) of each cluster Ci induced by a clustering K to a block length B, i.e., VCi.K L(Ci) = > s(v) < B where 1~ 1 vec. s is a constant function which assigns to each vertex an intger value equal to the size of the vertex, which is: m * size of pointer + (m-1) * size of key; 3) an objective function which is defined as F(K) = > f(v) * PBA(v) veV where f(v) is the frequency of accesses to v; PBA(v)

103 is the number of physical block accesses needed to access v from the root. Let r=vl,v2,...,vm=v be a path for root r to a vertex v. The PBA function can be defined recursively as PBA(v1) = 1; PBA(vi+l) = PBA(vi) + Xviv; (Xv v is the inverse of the assignment on the arc (viv i+l). Find: A feasible clustering K such that F(K) is minimized. Proposition 6.1: The objective function defined above is monotonic and the second constraint is constant time computable. Proof: Immediate from F2 and C2 of section 5.2. Q.E.D. 6.2 Translation of Integrated Schema to IMS Schema Database design is the process of synthesizing conceptual and target DBMS schema (logical or physical) that satisfies the information manipulation needs of a community of users. A stepwise database design process which consists of the following stages has widely been accepted (Teorey & Fry '82): Stage 1) Requirements formulation and analysis. Stage 2) Conceptual design. Stage 3) Implementation design. Stage 4) Physical design. This section considers a principal issue in an

104 implementation design, which is how to map an integrated schema produced by the conceptual design stage to an IMS schema. Many design tools as surveyed by Chen et al ('80) have been developed to aid each stage of the design process. Recently, a new database design system called View Integration System (VIS) (Chiang et al '83) has been built to support the first three stages of a database design process. The VIS integrates many related tools to support a bottom-up logical database methodology. One of the tools is for translation of an integrated schema to an IMS schema which can be based on several solutions given in this dissertation and will be discussed in this section. An integrated schema synthesized from user views by VIS consists of a set of 3NF record types (or groups) interrelated by relationships which are essentially of type l:m (1:1 is a special case of it). Depending on the structure of the integrated schema, it can be represented as a general digraph, acyclic digraph, out-necklace, out-tree, or out-star with each vertex representing a record type and each arc representing a relationship. The primary task of translating the integrated schema to the IMS schema is to transform the integrated schema into a schema that satisfies the rules of IMS. An IMS schema is made up of one or more physical databases and logical database descriptions. A physical database description is made up of

105 descriptions for a set of segments interrelated by a set of relationships in a hierarchical tree structure. The concept of segments is equivalent to the concept of record types and each hierarchical parent to child relationship is also of type l:m. Thus, each physical database description can be represnted as an out-tree with each vertex representing a segment and each arc representing a l:m relationship. But a set of out-trees cannot represent those additional arcs that form m-indegree vertices or cycles that could exist in a general digraph, acyclic digraph or out-necklace. The logical database descriptions are used to supplement these missing information. A logical database description is composed of descriptions of segments from one or more physical data bases interrelated by l:m relationships in a hierarchical tree structure. The idea of forming a logical database description is to establish a relationship between any point in a physical database and a target segment of another or the same physical database. The target segment is called the logical parent segment. This relationship is provided by establishing a link field as a new data field created as part of an existing segment or part of a newly created virtual segment which contains the link field (Cardenas '79). The links in the logical or physical database are unidirectional, proceeding from parent to child and not vice versa. But each child can be specified to have a pointer to

106 its parent. In the following discussion, it is assumed -that this is the case. This allows a link to be bidirectional and makes for a simpler representation for m-indegree vertices and cycles. Figure 6-2a shows two alternatives of translating an integrated schema into its corresponding IMS physical databases and logical database descriptions in pictorial form. The solid boxes, solid arcs and dash boxes represent segments, relationships and virtual segments respectively in physical database descriptions. The dash line represents a relationship in logical database descriptions. Figure 6-2b is another example. Several restrictions are imposed on the hierarchial tree structure of logical or physical database descriptions. The relevant one is that a maximum of 255 segments types may be included and up to 15 segment types are allowed in any one hierarchical path. Again, to decide which alternative is better, the usage pattern of the environment must be estimated or collected. The usage pattern used in Figure 6-2a is the frequency of transitions between two vertices in the integrated schema. Obviously, if the frequencies of transitions from segment STUDENT to GRADE (i.e., 300 times/day), as well as COURSE and GRADE (i.e., 100 times/day) shown in the integrated schema in Figure 6-2a are specified, the first translation shown in Figure 6-2a is a better choice..This is because if the occurrences of each segment type are

107 K> -_-. I - I,, _ GI GRADE I _ _ _ _ X1 I GRADE - - I.. _ J Figure 6-2a Translating an integrated schema into its corresponding IMS physical and logical databases: example 1.

108 l t.. \ Figure 6-2b Translating an integrated schema into its corresponding IMS physical and logical databases: example 2.

109 stored in different physical blocks, the total number of PBAs of the schema for accessing GRADE from STUDENT and accessing GRADE from COURSE together is less than that of the second translation: Total block access of first translation in Figure 6-2a: 300-1+100*2=500. Total block access of second translation in Figure 6-2a: 300-2+100-1=700. This translation problem can be formalized clearly as ODCtog, ODCtono, ODCtoa, ODCtOtO, or ODCsoso, depending on the class of digraph the integrated schema belongs to. The translation problem can be formalized as follows: Given: 1) a digraph G=(V,E) which represents the integrated schema; 2) three constraints: one requires each cluster be an out-tree or its subclass; one limits the size S(C.) of each cluster Ci of a clustering K to 255, i.e., VCiEK S(C) = number of vertices in C. ~ 255; one limits the depth D(Ci) of each cluster Ci of a clustering K to 15, i.e., VC.CK D(C.) = Max (d(v)) < 15 1 1 veci where d(v) assigns an integer that is equal to the length of the unique path from root of Ci to v; 3) an objective function which is defined as F(K) = > f.. + 2 > f.. ij3 1 3ijXO1]

110 > - (2.xij +xij) * f.i where fij is the transition frequency between vertices vi and v.. Find: A feasible clustering K such that F(K) is minimized. Proposition 6.2: The objective function defined above is monotonic and the second and third constraints are constant time computable. Proof: This is immediate from F1 and C2 and C1 in section 5.2, Q.E.D. 6.3 Database Record Clustering An important cost factor of a physical database design depends on the secondary storage search time that is required by the typical operational workload. The major component of that search time is the expected number of physical block accesses (PBAs). Thus, one important consideration in the initial physical database design, or a later reorganization of the database, is to map the logical database structure into a secondary storage in such a way so as to minimize PBAs (Chiang,& Teorey '82). One way to minimize the expected PBAs is to make use of the database usage pattern to group occurrences of those record types that are logically related and are frequently accessed together as clusters and store them into the same block so that inter-block accesses can be minimized. Many DBMS' (e.g., IDMS, IMS, SYSTEM R) allow users to

111 specify such a request using a statement like CLUSTERED VIA SET NEAR OWNER. For example, in Figure 6-3a, X is the parent record type of child record type Y. The occurrences of X and Y are physically linked together like Figure 6-3b. Assume occurrences of X and Y are frequently accessed together. A user can specify CLUSTERED Y VIA X-Y NEAR OWNER X which will result in the placement of occurrences of X and Y by the DBMS as shown in Figure 6-3c. This placement has the advantage that any time xi is read into memory, yil, Yi2 and Yi3 are also in memory. This reduces PBAs. There are several restrictions to the way a database designer can specify the grouping of occurrences statement. They are: 1) each specification must be via a set type; 2) each record type can only be clustered with one of its parent; 3) a record type may not be CLUSTERED VIA a SET type of which it is both member and owner type; and 4) the specification cannot form a cycle. The first restriction implies that the database record clustering problem is a clustering problem. The second, third and fourth restrictions imply that each cluster must be an out-tree or its subclass. Thus, the database record clustering problem is an ODCsoS0, ODCtOto,ODCtoa, ODCtono0 or ODCtOg problem if the given logical database can be represented as an out-star, out-tree, out-necklace, acyclic digraph or general digraph respectively.

112 a) ~V.x irXF b) c) O OQ Q0 0~ ~ ~ A rectangle represents a record type. A square represents a page. A circle represents an occurrence. Figure 6-3 Placements of record occurrences using CLUSTERED VIA SET NEAR OWNER.

113 The database record clustering problem can be formalized as follows: Given: 1) a digraph G=(V,E) which represents the database schema; 2) two constraints: one requires each cluster to be an out-tree or its subclass; one limits the length L(C[vi]) of each cluster C[vi] rooted at vi to the block length B: L(C[vi]) = s + > Tn./n L(C iT B 1 1 Vj(X =l) J 1 J where n. and n. are the expected number of occurrences of vertices vi and v., si is the length of vertices vi and L(C[v.j]) is the length of the complete subgraph in C[vi] rooted at vj; 3) an objective function which is defined as F(K) = > f.. X j=0 Sm i trg' t,,,t where f ij is the samo.ni d -nd 6. Find: A feasible clustering K such that F(K) is minimized. Proposition 6.3: The objective function defined above is monotonic and the second constraint is constant time computable. Proof: Immediate from F1 and C3 of section 5.2. Q.E.D. Example: Let the database schema given by Teorey and Fry ('80) be represented by an acyclic digraph shown in Figure

114 a) b) JOB TYPE fA_ WORK TASK C) Figure 6-4 A database schema and its two maximal out-tree clusterings.

115 6-4a. The record length (si) and the expected number of occurrences (ni) of each record type vi is shown in the following table: record type record length expected no. of occurences PLANT 60 50 JOB TYPE 26 1K WORK_TASK 13 20OK EMPLOYEE 80 100K EMP_DATA 36 10OK The usage information which is characterized by the frequency and the number of logical record accesses (LRAs) to the database schema by each application is defined in the following table: application frequency LRAs T1 100/day 1 PLANT+1 EMPLOYEE+1 EMP_DATA T2 100/day 1 PLANT+1 EMPLOYEE+1 EMP_DATA T3 200k/day 1 EMPLOYEE+ 1 WORK_TASK+ 1 JOB_TYPE T4 120k/year 1 EMPLOYEE R1 Quarterly 100k EMPLOYEE +100k EMP_DATA R2 Annually 50 PLANT + 100k EMPLOYEE R3 Monthly 50 PLANT R4 Monthly 50 PLANT + 50.20 JOB_TYPE R5 Daily 50 PLANT + 100k EMPLOYEE + 200k WORK TASK + 200k JOB TYPE R6 Monthly 50 PLANT+10OOk EMPLOYEE+10OOk EMP_DATA R7 1000/day 1 EMPLOYEE Let f..ij be the transition frequency between vi and v.. A table showing fij can be constructed using the frequency and LRA of each application as follows: (We assume: 1 year=240 working days, 1 quarter=60 working days, 1 month= 20 working days.) (vi,vj) fij from

116 PLANT, EMPLOYEE 0 EMPLOYEE, EMPDATA l00+100+lOOk/60 +100k/20=6.86k T1,T2,R1,R6 PLANT, JOB_TYPE lk/20=0.05k R4 JOB_TYPE, WORK TASK 200k+200k=400k T3, R5 EMPLOYEE,WORKTASK 200k+200k=400k T3 R5 Let the block size B be 2,000. Since the given database schema can be represented as an acyclic digraph and the desired clusters are out-tree or its subclass, the ODCtoa presented in Chapter 3 is applied here. The following traces the results produced by each major steps: step action performed 1 Obtain the first maximal out-tree clustering (see Figure 6-4b). 1.1 Apply ODCtot.-l to obtain 3 clusters: cluster 1 contains: PLANT, JOB_TYPE cluster 2 contains: WORK TASK cluster 3 contains: EMPLOYEE, EMP_DATA with value F(K)= 0 + 400k + 400k = 800k. 1.2 Keep the solution (because this is the first result). 1 Obtain the second maximal out-tree clustering (see Figure 6-4c). 1.1 Apply ODCtoto-l to obtain 2 clusters: cluster 1 contains: PLANT, JOB_TYPE cluster 2 contains: EMPLOYEE, WORKTASK EMPDATA

117 with value F(K)= 0 + 400k = 400k. 1.2 Keep the solution (because 400k is less than 800k). 2 RETURN the kept solution with an F(K)=400k. Q.E.D. Note: Without clustering, the number of PBAs is 0 + 6.86k + 0.05k + 400k + 400k = 806.91k, while with clustering, it is 400k. This is a savings of (806.91-400)/806.91 = 50%.

CHAPTER 7 SUMMARY AND FUTURE RESEARCH 7.1 Summary of Results In this dissertation, we have investigated the problem of cutting a graph into clusters such that a set of constraints on the clusters is satisfied while an objective function is optimized. In Chapter 1, eight classes of digraphs and four classes of undirected graphs are considered and used to divide the Optimal Graph Clustering (OGC) problem into a set of thirty-five subproblems. Chapter 2 describes a sequential approach which obtains optimal solutions for ODCtOt0o ODCSOtO and ODCsoso in O(n*Z2), O(n*Z) and O(n*Z) time respectively. The set of constraints considered are: a) every cluster must be an outtree/out-star or its subclass; b) every cluster must have cluster weight less than or equal to B. The objective function is to maximize the sum of all cluster values. In Chapter 3, the maximal out-tree/out-star clustering concept and the ODCtoto-l/ODC oto-l are used to find solutions to ODCtoto, ODCtOaf ODCtono, ODCsogr ODCsoa and ODC sno. The solution to ODCtoto and ODCtono are then used to solve ODCnono, which in turn is used to solve ODCnOg. 118

119 Only ODCt on ODCs o o and ODCnono have pseudopolynomial time solutions, i.e., O(n*Z2*K), O(n*Z*K) and O(n*Z2*K). Others are exponential with respect to the number of m-indegree vertices. Thus, if the given digraph is a sparse graph its ODC solution is still computable. The objective function and the set of constraints used are the same as that in Chapter 2. Chapter 4 describes a generalized sequential approach of Chapter 2 to solve ODCaa, ODCgg and ODCag. The same set of constraints and the objective function of Chapter 2 are used. The time complexity is totally dependent on the number of neighbors of each vertex. An upper bound on the number of new triplets formed by each stage is: Bn[p'r] * n(p,r]! * BnEr'j] * n[r,j]! * (n[r,j]+l). In Chapter 5, the solutions of Chapters 2 to 4 are used to solve the rest of the OGC subproblems. A solution approach for disconnected graph or graph with parallel arcs or loops are also discussed. Next, the class of objective functions and constraints are extended to the class of monotonic objective function and constant time computable constraints for all solutions presented in the dissertation without changing the time complexity. A divide-and-conquer technique is then presented to lessen the high complexity of the ODC aa-1, ODC -1 and ODC -1 algorithms. The technique aa gg ag reduces the number of feasible triplets formed at each stage to a number directly proportional to the number of feasible triplets formed if each cover were clustered independently.

120 Chapter 6 illustrates the usefulness of the solutions by applying them to three information system design problems. 7.2 Future Research An immediate future research should be to extend the applicable objective functions and constraints with respect to the divide-and-conquer technique discussed in section 5.3. This technique can substantially cut down the computational steps of all strong NP-complete problems. Since only eight classes of digraphs and four classes of undirected graphs have been considered in this dissertation, a natural extention is to consider other classes of graphs such as planar graph, etc. For example, the solution to planar graph is useful for solving the schema or program structure layout problems. Instead of the problem of finding a set of (disjoint, nonvoid) clusters, another extension to the Optimal Graph Clustering problem is the finding of a set of (nondisjoint, nonvoid) covers which satisfies a set of constraints while an objective function is optimized. The solution of this class of problem can be used to decompose database schema, programs, and queries for distributed processing.

APPENDICES 121

122 APPENDIX A THE BREAKUP OF THE FOURTEEN ODC PROBLEMS AS AN ODC PROBLEM The writing of a dissertation is also a design problem. It is to present the research results in a structured way for easy comprehension. A structured presentation typically includes a sequence of chapters roughly of equal length and covers the following topics: 1) Introduction to the problem. 2) Solutions to the problem. 3) Extensions to the solution. 4) Applications of the solution. 5) Conclusion and future direction. This dissertation follows such a structure closely; however, the solutions to the ODC problems are exceptionally long. To maintain approximately equal length chapters, it is necessary to break up the solutions into three separate chapters. This breakup can be formalized as an Optimal Digraph Clustering problem. Using this problem as an example, the purpose and general content of this dissertation can be illustrated. Figure A-l shows a digraph which represents the breakup problem. The digraph is an out-tree which has 14 vertices, each representing an ODC problem solution. Each arc on the

123 digraph represents one of the following two cases: 1) The head of the arc uses a modified version of the tail in its solution. This applies to arcs ODC too — > ODCsoto and ODCtto — > ODCgg 2) The head of the arc uses an exact version of the tail in its solution. This applies to the rest of the arcs. The values associated with the arcs in the first case indicate how similar the two vertices on the arc are, while the values in the second case indicate how importantly the tail vertex is used in the head vertex. The value rating is from 1 to 5 which gives a sense of how two vertices on an arc are related to each other. The weight appearing at the upper right corner of each vertex represents the number of pages that is planned for each solution. Let us suppose each cluster size should not exceed 20 pages. And let the optimization function be defined as the finding of a set of hierarchical related problems such that their closeness is maximized. Thus, this division problem is an instance of ODCtoto* A solution to this problem is given by the circles in Figure A-1 indicating 7 clusters. Clusters 1 and 7 both have weights 20, so each will be presented in one chapter. The rest of the five clusters weights' do not add up to 20, therefore, they will be presented together in one chapter. As to the lineup of the three chapers, it is discovered that the solution of ODCtOtO is fundamental to the rest of

cluster 1 124 cluster 3 cluster 4 cluster 5 cluster 6 cluster 7 ODC o to /JODC o o ODC o D 0 ODC 34 4 4 4/ 4 ODCN o ODCno ODC0 DC ODC a cluster 3 cluster 4 cluster 5 cluster 6 -cluster 7 cluster 2 Figure A-1 A digraph representing the ODC breakup problem.

125 the fourteen problems, therefore, the solutions to the problems of cluster 1 in Figure A-1 are presented first (Chapter 2). The solutions to clusters 2 to 6 are then presented in the next chapter (Chapter 3) because they are problems that use directly the results of cluster 1. Problems in cluster 7 are more difficult and are therefore left to the last chapter (Chapter 4).

126 APPENDEX B PROOF OF PROPOSITION 4.7 The number of steps taken by each type 2 stage is directly proportional to the number of triplets formed by step 1.2, which can be estimated by deciding: 1) the number of feasible triplets in PK f[v.,od(v.)]; 2) the number of feasible triplets in PK f[v,i]; and 3) the number of triplets formed by PUT, conditional CONNECT and conditinal COMBINE. Let vp be the parent of vr. Let n[r,j] be the number of neighbors of vr in V[vj,od(vj)], i.e., n[r,j] = ILnb[vr] n V[vj,od(v.)]|, and n[p,r] be the number of neighbors of vp in G[vr,i], i.e., n[p,r] = ILnb[v] n V[v r,i]I. The set of feasible triplets in PKf [vj,od(v.)] can be partitioned into disjoint blocks by using the tripletsimilar relation. Let PK1 be one of the blocks induced by the equivalence relation. Since there are n[r,j] neighbors, these neighbors can be distributed into no more than n[r,j] distinct clusters. Any given cluster can assume a weight that varies from 1 to B because of the nonneighbors. Thus, the number of clusterings in PK1 is no greater than Bn[rj] or znn[r,j] if the vertex weight is greater than 1. The upper bound on the number of blocks induced by the

127 triplet-similar relation is the number of ways in which n[r,j] distinct objects can be distributed in i nondistinct cells, where i varies from 1 to n[r,j]. Lukes ('75) showed this number to be smaller than n[r,j]!. Thus, the upper bound of the number of feasible triplets in PK f[v.j,od(v.)] is Bn[r'j] n[r,j]!. Similarly, the upper bound of the number of feasible clusterings in PKf[vr,i] is Bn[P'r] * n[p,r]!. An upper bound on the number of triplets formed by PUT is: Bn[P'r] n[p,r]!. Bn[rj] * n[r,j]!. An upper bound on the number of triplets formed by conditional CONNECT and COMBINE is: Bn[pr] * n[p,r]! * (B-l) * B n[rj]1 l n(r,j]! ~ n[ r,j]. An upper bound on the number of triplets formed by PUT, conditional CONNECT and conditional COMBINE is: Bn[pr] * n[p,r]! * Bn[r'j] * n[r,j]! * (n[r,j]+l).

B I BLIOGRAPHY 128

129 BIBLIOGRAPHY Aho, A.V. et al. The Design and Analysis of Computer Algorithms. Reading, Mass: Addison-Wesley; 1974. Aho, A.V. et al. Data Structures and Alqorithms. Reading, Mass: Addison-Wesley; 1983. Belady, L.A. and Evangelisti, C.J.. System Partitioning and Its Measure. The Journal of Systems and Software Vol.2; 1981. Bertolazzi, P. et al. Analysis of a Class of Graph Partitioning Problems. 18th Annual Allerton Conference on Communication, Control, and Computing; 1980. Cardenas, A.F. Data Base Management Systems. Boston, Mass: Allyn and Bacon, Inc.; 1979. Chen, P.P. et al. Survey of State-of-the-Art Logical Database Design Tools. Graduate School of Management, Univ. of California, Los Angeles; 1981. Chiang, Wan P. and Teorey, Toby J. A Method for Database Record Clustering. Proceedinqs of the Conference on Information System; December 13-15, 1982: p. 57-73. Chiang, Wan P. et al. Data Modelinq with PSL/PSA: The View Integration System (VIS) (Preliminary Draft). Ann Arbor, Mich: ISDOS Inc.; 1983. Comer, Douglas. The Ubiquitous B-Tree. Computing Surveys, Vol.11, No.2; June 1979. Ferrari, Domenico. The Improvement of Program Behavior. Computer, Vol.9, No.11; 1976. Garey, M.R. and Johnson, D.S. Computers and Intractability, a Guide to the Theory of NPCompleteness. San Francisco, Calif: Freeman; 1979. Harary, F. GraPh Theory. Reading, Mass: AddisonWesley; 1969. Horowitz, E. and Sahni, S. Fundamentals of Data Structures. Woodland Hills, Calif.: Computer Science Press,

130 Inc.; 1976. Jenny, C.J. Placing Files and Processes in Distributed Systems: A General Method Considering Resources with Limited Capacity. IBM Research Report, RZ 1157 (41911). Yorktown Heights, N.Y., Zurich, Switzerland: IBM Research Div.; July 1982. Johnson, D.S and Niemi, K.A. One Knapsacks, Partitions, and a New Dynamic Programming Technique for Trees. Math. Operation Res., Vol.8; 1983. Lukes, J.A. Combinatorial Solutions to Partitioning Problems. Ph.D. Thesis, Stanford Univ., Calif; 1972. Lukes, J.A. Efficient Algorithm for the Partitioning of Trees. IBM Journal of Res. Develop., Vol.18; 1974. Lukes, J.A. Combinatorial Solution to the Partitioning of General Graphs. IBM Journal of Res. Develop., Vol.19; 1975.. March, Salvatore T. Techniques for Structuring Database Records. Computing Surveys, Vol.15, No.1; March 1983. Perl, Yehoshua and Snir, Marc. Circuit Partitioning with Size and. Connection Constraints. Networks, Vol.13; 1983. Schkolnick, M. A Clustering Algorithm for Hierarchical Structures. ACM Trans. Database Syst., Vol.2, No.1; 1977. Schrader, Rainer. Approximations to Clustering and Subgraph Problems on Trees. Discrete Applied Mathematics, Vol.6; 1983. Teorey, T.J. and Fry, J.P. The Logical Record Access Approach to Database Design. ACM Computerinq Suveys, Vol.12, No.2; June 1980: p. 179-211. Teorey, T.J. and Fry, J.P. Design of Database Structures. Englewood Cliffs, N.J.: Prentice-Hall; 1982.

UNIVERSITY OF MIUMiuA 3 9015 02829 5734