THE UNIVERSITY OF MICHIGAN COMPUTING RESEARCH LABORATORY AN OPTIMAL DIGRAPH CLUSTERING APPROACH TO DATABASE RECORD CLUSTERING Wan P. Chlng Toby J.Teorey CRL-TR-31-84 JULY 1984 Room 1079, East Engineering Building Ann Arbor, Michigan 48109 USA Tel: (313) 703-8000

t M e I -— n ttJ<(':

An Optimal Digraph Clustering Approach to Database Record Clustering by Wan P. Toby J. Chiang* Teorey** June 1984 * Bell Laboratories, 1600 Osgood Street, North Andover, MA 1845 ** Dept. of Electrical Engineering and Computer Science, The University of Michigan, Ann Arbor, MI 48109

ABSTRACT The database record clustering problem is represented in terms of a directed graph, and solved by determining how to cut the graph into a set of nonvoid and disjoint subgraphs such that each subgraph satifies a set of database system constraints while an objective function of physical block accesses is optimized. It is shown that if the logical database structure can be represented as an out-tree or out-necklace, this problem can be solved in pseudopolynomial time although the general problem is NPcomplete. Efficient algorithms are given to compute the optimal solution for the general cases where the logical database structure can be represented as an acyclic or general digraph. An example database design problem is solved to illustrate the approach. Categories and Subject Descriptors: G.2.2 [Discrete Mathematics]: Graph Theory, H.2.2 [Database Management]: Physical Design - access methods. General Terms: Performance, Theory. Additional Keywords and Phrases: Physical database design, record clustering.

TABLE OF CONTENTS SECTION 1. Introduction................ 1 2. A Sequential Clustering Approach to ODCtoto. 6 3. Maximal Clusterings Enumeration Approach to ODCtoto, ODCtoa and ODCtog........ 19 4. An Example Application to Database Design.. 26 APPENDIX A. Definitions of Graph Theory Terminology 31 APPENDIX B. Proof of Proposition 2.1....... 34 REFERENCES................ 36 ii

1. Introduction An important cost factor of a physical database design depends on the secondary storage search time that is required by the typical operational workload. The major component of that search time is the expected number of physical block accesses (PBAs). Thus, one important consideration in the initial physical database design, or a later reorganization of the database, is to map the logical database structure into a secondary storage in such a way so as to minimize PBAs [ChTe82]. One way to minimize the expected PBAs is to make use of the database usage pattern to group occurrences of those record types that are logically related and are frequently accessed together as clusters and store them into the same block so that intrablock PBAs can be maximized while inter-block PBAs can be minimized [Schk77, TeFr82, Marc83]. Many DBMSs (e.g., IDMS, IMS, DB2.) allow users to specify such a request using a statement like CLUSTERED CHILD VIA PARENT CHILD RELATIONSHIP NEAR PARENT. For example, in Figure la, X is the parent record type of child record type Y. The logically related occurrences of X and Y are physically linked together like Figure lb. Assume occurrences of X and Y are frequently accessed together. A user can specify CLUSTERED Y VIA X-Y NEAR X which will result in the placement of the logically related occurrences of X and Y by the DBMS as shown in Figure lc. This placement has the advantage that any time xi is read into memory, yil, Yi2 and Yi3 are also in memory. This reduces PBAs.. There are several restrictions to the way a database designer can specify the grouping of occurrences statement. They are: 1) each specification must be via a parent - child relationship; 2) each record type can only be clustered with one of its 1

2 parents; 3) a record type may not be CLUSTERED VIA a PARENT CHILD RELATIONSHIP of which it is both child and parent type; and 4) the specification cannot form a cycle. Let the logical database structure be represented by a digraph G=(V,E) where each vertex represents a record type and each arc a parent child relationship. The first restriction implies that on each arc there is one decision of value 0 or 1 to be made. A 1 value decision means the occurrences of the two record types which are related by the parent child relationship are going to be stored in the same cluster while a 0 value means otherwise. The second, third and fourth restrictions imply that each cluster(subgraph) resulting from the specification is an out-tree (see Appendix A for' the basic definitions of graph theory terminology). Schkolnick [Schk77] was the first to present a cost model and algorithm for the special case where the given logical database structure can be represented as an outtree. But his algorithm runs in exponential time with respect to the outdegree of each vertex in the outtree. Chiang & Teorey [ChTe82] extended Schkolnick's approach to deal with the case where the given logical database structure can be represented as a general digraph and obtained an exponential time algorithm also. This paper presents a simpler cost model and a more efficient algorithm to Schkolnick's and Chiang & Teorey's work. We also consider the case where the given logical database structure can be represented as an out-necklace or acyclic digraph in addition to out-tree and general digraph. The approach is to represent the logical database structure as a digraph and formalize the database record clustering problem as four Optimal Digraph Clustering (ODC)

3 a) x X-Y crr s etb) A rectangle represents a record type. A square represents a block. A circle represents a record occurrence. Figure 1 Placements of record occurrences using CLUSTERED CHILD VIA PARENT CHILD RELATIONSHIP NEAR PARENT.

4 subproblems [Chia84]: O DCtoto, ODCt, toa and ODCto g. The first subscript of ODC indicates each cluster (subgraph) resulted from a specification must be an out-tree while the second subscript means the given digraph is either an out-tree(t~), out-necklace(n~), acyclic diqraph(a) or general diqraph(g) respectively. Given a (connected) graph G=(V,E) with vertex set V and arc set E, a clustering K of G is defined as a cutting of G into a set of nonvoid and disjoint (connected) subgraphs by removing some arcs in G. Each of the (connected) subgraphs is called a cluster induced by the clustering K (see Figure 2b). Thus, a clustering K is an assignment function which assigns either 0 or 1 to each arc of E, i.e., K: E — > {0,1}. An assignment on arc (i,j) with value 0, denoted by K..=0, means the arc (i,j) is removed (or cut) while a 1 1J assignment on arc (i,j), denoted by K. ij.=l, means the arc (i,j) is not removed (or cut) (see Figure 2a). The database record clustering problem can be formalized as a maximization problem as follows: Given: 1) a digraph G=(V,E) which can be an out-tree, outnecklace, acyclic digraph or general digraph; 2) two constraints: one requires each cluster to be an out-tree; one limits the length L(C[vi]) of each cluster C[vi] rooted at vi to the block length B: L(C[v.) = s + > Tn j/ni * L(C[v.])T < B 1] Vj (Kij=l) i 3 3 where n. and n. are the expected number of occurrences of vertices vi and vj, si is the length of vertices vi and L(C[vj]) is the length of the complete subgraph in C[vi] rooted at vj; 3) an objective function which is defined as F(K) = > f.. Kj=l

6utjaqsn'l aEt Aq paonpuT siaisn3 aajql, (qz a9n6T. =9 S= = _Z IT '0= 'S =EIt qJ. 6uuT.jaisnsTO V (PZ ajn6bi,03 / (q //^s l / t/t~r I,!11 \_ _ A J J

6 where fij is the frequency of accesses between vertices vi and v.. Find: A feasible clustering K such that F(K) is maximized. All ODC subproblems discussed in this paper are NPcomplete problems [Chia84]. But this does not mean efficient and practical algorithms cannot be found to solve them. Because all of them are number problems, two pseudopolynomial algorithms are presented in this paper for ODCtoto and ODCtOno. Two exponential algorithms for ODCtOa ODCtog are also given which are useful when the given acyclic or general digraph is sparse. An algorithm for a problem is pseudopolynomial if it solves any instance I of the problem in time bounded by a polynomial function of the encoding length of I and the largest integer appearing in I. A polynomial algorithm runs in time bounded by a polynomial function of the the encoding length of I alone. One advantage of a pseudopolynomial algorithm for a problem is that it is practical (efficient) in many real life applications. This is because the largest integer in the problem is often bounded by a polynomial function of the encoding length of the problem and thus the pseudopolynomial algorithm becomes a polynomial algorithm. A second advantage is that even if the largest integer appearing in the problem is not bounded by a polynomial function of the encoding length of the problem, it can often be turned into fully polynomial time approximate schemes [GaJo79]. 2. A Sequential Clustering Approach to ODCtotO Given an out-tree G, a sequential clustering approach sequences the clustering process into a finite sequentially ordered stages. At each stage, a larger set of subgraphs of G, which is obtained from the immediate previous stage by adding a vertex or an arc, is considered. Because the set. of sugraphs of G considered at one stage contains only one more

7 vertex or arc than its immediate previous stage, every feasible clustering of one stage can be obtained from the immediate previous stage by a simple operation (i.e., an operation that takes constant time to execute). The main thrust is to use the integer cluster length constraint and objective function to define an elimination rule such that at most a constant number of feasible clusterings (B) remains for each subgraph at each stage. Thus, the problem can be solved in pseudopolynomial time. To describe how the clustering process is sequenced and to identify the simple operations needed in the sequential clustering approach, it is necessary to first define some terms and notations. Terms and Notations: Let the given out-tree G=(V,E) have n vertices and m arcs and let v root be the root of G. For each v rV, G[v r] is any subgraph of G rooted at vr. od(v r) is the outdegree of v where od(v )=0 if v is a leaf vertex. For each vrCV and i, 0<isod(vr), G[v r,i] is the subgraph of G induced by v r, by its first i children (from left to right), and by all their descendants (see Figure 3). Note that G[Vrootd(root )]=G and G[vr, is the subgraph of G consisting of only the vertex vr. A subgraph G[vr,i] of G rooted at vr is a complete subaraph if i = od(v ) and is a partial subqraph if i < od(v ). Observe that G[v r,i+l] is a larger subgraph of G than G[v,i]. It not only contains the partial subgraph G[vr,i] rooted at vr but also the complete subgraph rooted at v.j, which is the i+lth child of vr (denoted by G[vj,od(vj)]), and the arc (vr,vj) connecting the two subgraphs G[v,i] and G[vj,od(vj)]. The clustering process is sequenced in stages by visiting the vertices of G in postorder traversal sequence [Aho74]. At each vertex vr, G[v r,0] is first considered and

8 3f/\5 3 G 120 ' G(2,2] G(2,1l 1 = ~V. 1 3 (s=l) & (ni=l) Figure 3 An out-tree and its G[vr i]'s.

9 then G[v r,1], G[v r,2],..., G[vr,od(v r)]. Since there are n vertices and m arcs, there are m+n stages. Figure 4 shows all eleven stages of the clustering process of the out-tree that has six vertices and five arcs given in Figure 3. Note that each vertex vr is involved in od(v r)+l stages. In Figure 4, because the vertices are visited in postorder traversal sequence, at each stage of the clustering process, the graph considered is a set of subgraphs of G (not necessarily a single connected subgraph). For example, at stage 6, there are two subgraphs: (V = {2,4}, E = {(2,4)}) and (V = {5,6}, E = {(5,6)}). Except for at most one, each subgraph in the set is a complete subgraph of G. For example, at stage 6, subgraph (V = {5,6}, E = {(5,6)}) is a complete subgraph rooted at 5 while (V = {2,4}, E = {(2,4)}) is a partial subgraph rooted at 2. At stage 7, there is only one subgraph Yooted at 2 and it is a complete subgraph. There are two patterns of changes in the set of subgraphs considered from stage to stage: 1) If all subgraphs in stage k are complete subgraphs, then in stage k+l the set of subgraphs considered will contain a new vertex as a separate subgraph. This stage k+l is called a new vertex stage (e.g., stages 1,2,3,5,8,and 9 in Figure 4). 2) If one of the subgraphs in stage k is a partial subgraph, then in stage k+l the set of subgraphs considered will contain a new arc which connects the root of this partial subgraph with the root of a complete subgraph existing in the set of subgraphs considered in stage k. This stage k+l is called a new arc staqe (e.g., stages 4,6,7,10,and 11 in Figure 4). Since the set of subgraphs of G considered at each new stage only contains one more new vertex or arc than the immediate previous stage, there is no need to form all possible clusterings from scratch for each new stage. This means

10 stag 9 ' 0 '0 2 00 3Q( 0 00J 6 Figure 4 Tile eleven stages of the clustering process of the out-tree in Figure 3.

11 that if the set of all possible clusterings PKk of stage k is known, the set of all possible clusterings PKk+l of the next stage k+l can be obtained by some simple operations on PKk. The set of all possible clusterings PKk formed at stage k can be represented by s subsets of clusterings if there are s disjoint subgraphs considered at that stage. Each subset of clusterings represents the set of all possible clusterings for each subgraph. Let each clustering be represented by the set of clusters it induces. For example, in Figure 4, {(V={4},E=0)},{(V={5,6},E=0), (V={5,6),E={(5,6)})} is all possible clusterings of the two subgraphs in stage 4. To obtain all possible clusterings of stage 5, a new vertex stage, an operation which "inserts" vertex 2 as a separate cluster into the set of possible clusterings is needed. Thus, the resulting clusterings of stage 5 is {(V={21,E=0)},{(V={41,E=0)},{(V={5,6}, E={(5,6)}),(V={5,6},E=0)}. Consider stage 6, a new arc stage, since vertices 2 and 4 are connected by arc (2,4), there is a possibility that (V={2},E=0) and (V={4},E=0) could be "connected" together to form one cluster after clustering (i.e, assignment K24=1) or not be connected together but simply "put" next to each other (i.e., assignment K24=0). Thus, the new set of possible clusterings in stage 6 is {(V={2,4},E={(2,4)}),(V={2,4},E=0)}, {(V={5,6},E={(5,6)}),(V={5,6},E=0)}. It is now obvious that the necessary operations needed in the sequential clustering approach are an insertion operation in the new vertex stage and connect/put operations in the new arc stage. To formalize these operations, let {G[v1,od(v1)], G[v2,od(v2)],..., G[v.,od(vj)],..., G[vr,i]} be the set of mutually disjoint subgraphs rooted at vl,v2,...,v.,...,vr that are being considered in stage k, with G[vj,od(vj)] as the i+lth complete child subgraph of J J

12 vr. Let PKk be the set of all possible clusterings formed at stage k. PKk includes PK(G[vl,od(v1)], PK(G[v2,od(v2)]),..., PK(G[vj,od(v )],..., and PK(G[v r,i]) because there are r disjoint subgraphs considered at stage k. The operations needed in the two types of stages are: New vertex staqe: At stage k+l, the set of subgraphs considered is {G[vl,od(v1)], G[v2,od(v2)],..., G[v,od(v.)], ***. G[v,i], G[v p,0]} where G[v p,0] represents the new vertex vp that is introduced. There is only one way to obtain a clustering of a single vertex digraph G[v p,0], which is a cluster containing only the vertex v. This clustering is denoted by PK[v,0]. Thus, the set of all possible clusterings for stage k+l can be obtained by taking the union of PK[v,0] and PKk, i.e., PKk+l = PKk u PK[v p,0 where PK[v,0] = {K[v,0] is formed by INSERTI INSERT operation: K[v,0] = {(V={v },E=0)}. New arc stage: At stage k+l, the set of subgraphs considered is {G[vl,od(v1)], G[v2,od(v2)],..., G[vj lod(vj-_) ] G[vj+1, od(vj+)],... G[Vr-1,od(vr_ )], G[vr,i+l]} where G[vr,i+l] is a subgraph formed by connecting G[v.,od(vj)] and G[v r,i] that exist in stage k with an arc (v,vj). Let the set of possible clusterings of G[vj,od(vj)] and G[v r,i] be denoted by PK[vj,od(v.)] and PK[v r,i] respectively. Thus, PKk+1 = PKk - PK[vri] - PK[vj,od(v.)] u PK[vr,i+l]. Let K[v r,i] and K[vj,od(v.)] be any two clusterings of PK[v,i] and PK[vj,od(vj)] respectively. Let K[v,i] = {ClC2,...,C q}, and K[vj,od(vj)] = {C ',C ',...,C '}, and 2 ~ q J Jt1' 721@*C}

13 assume C =(V1,E ) and C,'=(V 1,El ) are clusters that contain the root of G[v r,i] and G[vj,od(vj)] respectively. The set of all possible clusterings for PK[v r,i+l] can be obtained from all possible combinations of K[v,i] in PK[v r,i] and K[v.,od(v.)] in PK[v.,od(vj)] for assignment K..j= 0 (denoted by PK~[v r,i+l]) and 1 (denoted by PKl[vr,i+l]) on arc (vrvj), i.e., PK[v r,i+l] = PK~[vr,i+l] u PKEvr,i+l]; PK~[v r,i+l] = {K~[vr,i+l] is formed using PUT for every pair of K[vr,i] and K[vj,od(vj) ]}; PK[vr,i+l] = {KLvr,i+l] is formed using CONNECT for every pair of K[vr,i] and K[vj,od(vj)]}, where PUT operation: K~[vr,i+l] = K[vr,i] u K[vj,od(v.)]; CONNECT operation: K1[vr'i+l] = {C2,C3,...,Cp,C2 IC3,...,C u {C1 } where C = (V1 u V, E1 u E1' u {(vr,v )}). To obtain all feasible clusterinqs (i.e., clusterings that satisfy all constraints) instead of just all possible clusterings, the out-tree cluster and cluster length constraints need to be enforced at each stage. This can be easily done because: 1) Any cluster formed in the sequential clustering approach is a subgraph of an out-tree and is. also an out-tree itself. Thus, the out-tree cluster constraint is automatically enforced in the process. 2) For each pair of clusterings from PK[v r,i] and PK[vj,od(i+l)], only the CONNECT operation forms a new cluster C," which has a new higher cluster length value. To make sure every cluster formed is less than B at each new arc stage, the CONNECT operation can be

14 made conditional as follows: IF L(C1) + [nl'/n1. L(C1')] < B THEN K1Cvr i+l] (C C...C Cp2' C'.C ' u {C"}. K~[vr'i+l] = {C2'c3'' ''c2 ' 3,', q 1 Thus, the set of all feasible clusterings formed at stage k+1 (denoted PKf)k+ can be easily obtained by applying either INSERT or PUT/conditional CONNECT operations on PKk. For computational purpose, let each feasible clustering K of a subgraph be represented by a triplet (F(K),L(K),CF(K)) that contains the following three components: 1) A value component F(K) which stores the sum of the cluster values of all the clusters induced by the clustering. 2) A lenqth component L(K) which stores the cluster length of the cluster that contains the root of the subgraph. 3) A configuration component CF(K) which stores the clusters induced by this clustering of the subgraph and each cluster is represented by the vertices in it enclosed by "< >". At each stage of the subgraphs considered at feasible triplets formed. triplets represents all considered at that stage. clustering process, if there are s a stage, there are s sets of Each of these s sets of feasible feasible clusterings of a subgraph Notation: Let PKf and PKk+l now represent all feasible triplets produced at stages k and k+l, and let PKf[v,i+l], PK [vri], and PKf[v.,od(vj)] now represent all feasible triplets of the subgraph G[v,i+l], G[v,i] and G[vj,od(vj)] respectively. Let K[vr,i+l], K[vr,iJ and K[vj,od(vj)] be a feasible triplet in PKV r,i+l], i 3~~~~~~~~~~~~~~~~

15 PK [v,i] and PK [v.,od(vj)] respectively. Let CF(K[v r,i]) = {C1,C2,...,C } and CF(K[vj,od(v) ]) = {C' C2' a. Cpt } The three operations can now be redefined in terms of triplet formation: INSERT operation: form K[v,0] with F(K[v p,0] = 0, L(K[vp,0] = sp, and CF(K[v,0] = {<v >}. PUT operation: form K~[v r,i+l] with F(K[vr,i+l]) = F(K[vr,i]) + F(K[v_,od(v ])), rr r J L(K[v,i+l]) = L(K[vr,i]), and CF(K[vr,i+l]) = CF(K[v,i]) u CF(K[v,od(vj) ]). Conditional CONNECT operation: IF L(C1) + L(Ci ) < B THEN form Ki[vr,i+l] with F(K[vr,i+l]) = F(K[vr,i]) + F(K[vj,od(v) ]) + f r' L(K[vr,i+l]) L(C1) + [n1 /n1 L(C1 ) ] CF(K[Vr,i+l]) = {C2,C3,.0.,CC2s C3 'O**-ICq'I {C1 U C1' }. For example, let {(0,1,<4>)},{(0,1,<5><6>),(2,2,<5,6>)} be the set of all feasible triplets formed at stage 4 in Figure 4. The feasible triplet sets formed at stage 5 using the INSERT operation are {(0,1,<2>)},{ (0,1,<4>) }, {(0,1,<5><6>),(2,2,<5,6>)}. The

16 feasible triplet sets formed at stages 6 and 7 using the PUT and conditional CONNECT are: (Assume B=3 and Vv (s =1 and P P np=l).) stage 6: {(0,1,<2><4>), (3,2,<2,4>)}, {(0,1,<5><6>),(2,2,<5,6>)}; stage 7: {(0,1,<2><4><5><6>),(2,1,<2><4><5,6>), (3,2,<2,4><5><6>),(5,2,<2,4><5,6>),(7,2,<2,5><4><6>), (9,3,<2,5,6><4>),(10,3,<2,4,5><6>)}. In the above example, every new vertex stage forms one additional new triplet using INSERT while every new arc stage can double the number of triplets formed in the immediate previous stage using PUT/conditional CONNECT. Thus, unless reduction of the number of feasible triplets can be done at the end of each new arc stage, the total computation time can still be exponential. Fortunately, some triplets formed at each new arc stage can be eliminated because they can never be extended to the optimal clustering of the out-tree. The idea is to define a dominance relation on the set of newly formed feasible triplets at each new arc stage. Let K1 and K2 be any two feasible triplets formed at the same new arc stage. K1 dominates K2 if F(K1) > F(K2) and L(K1) < L(K2). Proposition 2.1: The dominated triplet K2 cannot be extended to the optimal clustering of the out-tree. Proof: See Appendix B. Q.E.D. Proposition 2.1 implies that at the end of each new arc stage, all dominated triplets can be eliminated. The elimination can be done in two steps by decomposing L(K1) <

17 L(K2) into L(K1) -L(K2) and L(K1) < L(K2): step 1) L(K1) = L(K2) and F(K1) > F(K2): For triplets that have the same length, retain the one with the highest value and eliminate the rest. Thus, only one triplet remains for each integer length value. step 2) L(K1) < L(K2) and F(K1) > F(K2): For every remaining triplet K2, if it has higher length than another triplet K1 but have lower or equal value, then eliminate K2. Assume that every vertex has vertex length 1. Since the length of a triplet which represents a feasible clustering of a subgraph is an integer less than or equals to B, the maximum number of triplets remaining for a subgraph after each stage is B. If the vertex has length higher than 1, and suppose Z is the maximum number of vertices that can be put in a cluster with cluster length less than or equals to B, then the maximum number of triplets remaining for a subgraph after each stage is Z (<B). A complete algorithm specified: called ODCt0t 0 -1can now be ALGORITHM: ODC o 1 INPUT: an out-tree G; OUTPUT: an optimal out-tree clustering of G and its value; PROCESS: I) FOR each vr visited in postorder traversal sequence DO: 1) FOR j=0 to od(vr ) DO: 1.1) IF j=0 THEN form a new PKf1 from PKk k+l

18 using INSERT; ELSE 1.2) form a new PKk+l from PKf using PUT and conditional CONNECT; eliminate all dominated triplets; END; II) find the optimal value clustering in PK +n and RETURN the optimal value F(K) and the optimal clustering configuration CF(K) END. Proposition 2.2: The running time of algorithm ODCtoto-1 is O(n*Z ) (or O(nB 2) if every vertex has vertex length 1). Proof: Every new vertex stage takes constant time while every new arc stage takes 2*Z time because of the PUT and 2 conditional CONNECT. Thus, step I) takes O(n*Z2) time and Step II) takes O(Z) time. The time complexity of ODCtoto is therefore O(n-Z2). Q.E.D. Example: Applying ODCtoto-l to the example given in Figure 2-2, the following 24 triplets are formed during the clustering process: formed formed elimi. elim. by formed by INSERT by P/C by P/C Dominate rel. triplets (stage #) (stage #) (stage #) (stage #) F(K),L(K),CF(K) (0,1,<4>) 1 6 (0,1,<6>) 2 4 (0,1,<5>) 3 4 (0,1,<5><6>) 4 7 (2,2,<5,6>) 4 7 (0,1,<2>) 5 6 (0,1,<2><4>) 6 7 (3,2,<2,4>) 6 7 (0,1,<2><4><5><6>) 7 7 (2,1,<2><4><5,6>) 7 10 (7,2,<2,5><4><6>) 7 10

19 (9,3,<2,5,6><4>) 7 7 (3,2,<2,4><5><6>) 7 7 (5,2,<2,4><5,6>) 7 7 (10,3,<2,4,5><6>) 7 10 (0,1,<3>) 8 11 (0,1,<1>) 9 10 (2,1,<1><2><4><5,6>) 10 10 (7,1,<1><2,5><4><6>) 10 10 (10,1,<1><2,4,5><6>) 10 11 (5,2,<1,2><4><5,6>) 10 10 (10,3,<1,2,5><4><6>) 10 10 (10,1,<1><2,4,5><6><3>) 11 (15,2,<1,3><2,4,5><6>) 11 The optimal value is 15 and the optimal clustering configuration is <1,3><2,4,5><6>. Further improvements on ODCt.to-l can be made, but since the time complexity will not change, we will not discuss them here. 3. Maximal Clusterings Enumeration Approach to ODCto o, ODCa and ODCtg Let K. and K. be any two clusterings of a digraph G. K. < K. if every cluster induced by K. is a subgraDh of some 3 1 cluster induced by Kj. For example, in Figure 5, each K3, K4, K5, K8, K9g, K10 and K12 < K1. Note that if K. < K., then K. can be obtained from K. by removing some arcs in K.. The < relation is reflexive, transitive and anti-symmetric. It induces a partial ordering on the set of all possible clusterings PK of G. Let [PK,<] denote the partly ordered set (poset) that consists of a set PK and a partial ordering relation < on PK. For any two elements Ki and K. in PK, an elenment Kk in 1 J k PK is called the greatest lower bound, or meet of K. and K., denoted by K.-K. if Kk<Ki' Kk<Kj, and V K1 in PK such that K_<Ki and Ki<Kj imply Ki<Kk. Dually, an element Km e PK is called the least upper bound, or loin of K. and K., denoted by K.+K., if 1 J 1 j

'qdeJ6Tp DTT3ADe ue jo s6uT lasnIT aeae-;no alqTssod ITV 9 ajn6BT ilx.' I oP 1 11 ' X ot* X 1 III fof I I wines Is r X x _ I, 1 ^ 9 1 *Gup'~~d Li "I%, ~ d 1% I op % fl~ 1% A aft "A / ^ A of i^;( % ^ 'W " 's /^^X^~~~~~~ IL o f D i Goop ^^ % -,N J~ %,,' w _\./ * F,% aawA Tn

21 K.<K and K.<K and 1 m j m V K in PK such that K.<K and K.<K imply K <K 1 ii j3 1 m1 A lattice is a partly ordered set which has a least upper bound and a greatest lower bound for every pair of elememts. Let x and y be any of the five digraph classes. Define the set of all x clusterings of a specific y as the subset of all possible clusterings of y which satisfies the constraint that each of these clusterings induces cluster of class x. Proposition 3.1: The set of all out-tree clusterinas of an out-tree together with the < relation form a lattice. Proof: This is immediate from the fact that every subgraph of an out-tree is an out-tree. Thus, every possible clustering is an out-tree clustering and the set forms a lattice. Q.E.D. But the set of out-tree clusterings of a general digraph, acyclic digraph or out-necklace is not a lattice because it does not always have a least upper bound for each pair of elements. For example, in Figure 5, K1 *K2=K4 and K3+K10=K1, but K3+K7 is undefined. A poset [PK,<] is a join-semilattice (meet-semilattice) if for any two elements Ki and K. in the poset PK, K. + K. (K. * K.) exists. Proposition 3.2: The set of all out-tree clusterings of a general digraph/acyclic digraph/out-necklace together with the < relation form a meet-semilattice. Proof: For each cluster C. induced by K. and C. by K. and C = CinC3, form three clusters (where n represents graph intersection): 1) C - C. 2) C. - C.

22 3) C. The set of all clusters so formed for each pair of clusters from K. and K., denoted by Kk, is also a clustering of PK because each cluster is an out-tree. Obviously, Kk < Ki and Kk < Kj. Now, assume that there is a K1 in PK such that K:<Ki, K1<K. and K <K. Let C = C. n C. ~ 0 be an induced cluster 3 k l' 1 j in Kk. By Kk<Kl, C must also be a subgraph of some C' induced by K1. Assume C'-C~0, then C' cannot be a subgraph of both Ki and Kj. Thus, C' is not a subgraph of both Ki and Kj. This contradicts the assumption that K1<Ki and Ki<Kj. Therefore, either C=C' or CcC' is true and K1 < Kk. This means that VK1 in PK such that K <K. and K <K. 11 1 3 imply Ki<Kk. Thus, the set of all out-tree clusterings of a general digraph/acyclic digraph/out-necklace together with < forms a meet-semilattice. Q.E.D. An element K.GPK is maximal when there is no other element K. in PK where K.<K. (a minimal element is similarly defined). For a lattice, there is always one minimal element and one maximal element. In a meet-semilattice, we note that there is always one minimal element (e.g., K12 in Figure 5) and many maximal elements (e.g., K1 and K2 in Figure 5). Proposition 3.3: The poset formed by the set of out-tree clusterings of an out-tree and the < relation has only one maximal out-tree clusterinq. Proof: Immediate from the set of all out-tree clusterings of an out-tree which together with the < relation form a lattice (from Proposition 3.1). Q.E.D. Proposition 3.4: The poset formed by the set of out-tree clusterings of an out-necklace and the < relation has K maximal out-tree clusterings if the number of arcs in the only cycle in the out-necklace is K.

23 Proof: Each vertex in an out-necklace has indegree 1, while each vertex in an out-tree also has indegree 1 except for the root which has indegree 0. Thus, a maximal out-tree clustering can be obtained by removing a minimum of one arc from an out-necklace. But not all sets of; clusterings generated by removing one arc from the out-necklace are outtree clusterings. It is only those that are formed by removing an arc in the cycle of the out-necklace are outtree clusterings. Thus, if there are K arcs in the outnecklace, there are K maximal out-tree clusterings. Q.E.D. Proposition 3.5: The poset formed by the set of out-tree clusterings of an acyclic digraph and the < relation has L = T a. maximal out-tree clusterings i=l 1 if the acyclic digraph has p m-indegree vertices where each has indegree ai. Proof: Let the indegree of the set of m-indegree vertices in the given acyclic digraph be al, a2,..., ap. To form a maximal out-tree clustering from the acyclic digraph requires the removal of the "redundant" arcs that exist in each of the m-indegree vertices so that every m-indegree vertex can be reduced to have indegree 1. There are (aI ), (a2 ) *.. * (ap ) a * a a = al * a *... * a combinations. p Thus, there are T ai maximal out-tree clusterings. Q.E.D. i=l1 Proposition 3.6: The poset formed by the set of outnecklace clusterings of a general digraph and the < relation has at most P L = TT a. maximal out-necklace clusterings i=l1

24 if the general digraph has p m-indegree vertices where each has indegree a. Proof: Similar to the proof given for Proposition 3.5. Q.E.D. The ODCtogr ODCtoa and ODCtono problems are to find a set of bounded out-tree clusters from a general digraph, acyclic digraph and out-necklace respectively such that the total value of the set is optimized. To solve ODCtOa and ODCt0on0 the maximal out-tree clustering concept and the ODCto o-l discussed in section 2 are used. A maximal out-tree clustering induces a set of largest out-tree clusters that can be obtained from an acyclic digraph or out-necklace by removing minimal number of arcs from the m-indegree vertices. Since any possible out-tree clustering, including the optimal one, < one of the maximal out-tree clustering (or can be obtained by additional arc removals of the induced clusters), the optimal solution can thus be found in one of the maximal out-tree clusterings. That can be done by using ODCtotO-1. The optimal out-tree clustering of an acyclic digraph or out-necklace is therefore the maximumm value one among the set of optimal out-tree clusterings obtained from all maximal out-tree clusterings. An algorithm for these two problems can be formulated as follows: ALGORITHM: ODCto a-1/ODCt n o 1 INPUT: an acyclic digraph/out-necklace G; OUTPUT: an optimal out-tree clustering and its value; PROCESS: 1) FOR each maximal out-tree clustering of G which induces-one or more out-tree clusters DO: 1.1) apply algorithm ODCt.to-l to obtain an optimal solution for each out-tree cluster;

25 1.2) IF the sum of the optimal out-tree clustering values of these out-tree clusters is the largest value so far, keep the solution; END; 2) RETURN the last solution kept; END. An algorithm for ODCtOg, named ODC tog1, is similar to algorithm ODCtoa-1 except that instead of applying ODCtoto-l in step 1.1, ODCtono-l is used and all occurrences of maximal out-tree clusterings are changed to maximal outnecklace clusterings. Proposition 3.7: The ODCtono-lF ODCtoa-l and ODCg1 algorithms find 1) the optimal out-tree clustering of an out-necklace in O(n-Z *K) time; 2) the optimal out-tree clustering of an acyclic digraph 2 in O(n.Z *L) time; and 3) the optimal out-tree clustering of a general digraph in O(n-Z2 K-L) time. Proof: 1) In step 1 of ODCton -l, a given maximal out-tree clustering induces only one out-tree cluster. Thus, steps 1.1 and 1.2 take O(n-Z ) by Proposition 2.2. Since there are K maximal out-tree clusterings to search for (see Proposition 3.4) in the step, the time complexity is O(n*Z2 K). 2) In step 1 of ODCoa -1, let a given maximal out-tree clustering contain w out-trees with nl,n2,...,n vertices respectively in them. Steps 1.1 and 1.2 take O(nl.Z)+O(n2Z2 )+...+O(nw Z2) = O(n.Z2) time to i z v~~~~

26 find an optimal solution. Since there are L maximal out-tree clusterings to search for (see Proposition 2 3.5) in step 1, the time complexity is O(n-Z2 L). 3) In step 1 of ODCto -1, let a given maximal necklace clustering contain w necklaces with n,n2,...,nw vertices respectively in them. Steps 1.1 and 1.2 take O(nl-Z2 K1)+O(n2*Z2 K2)+...+O(nw Z Kw) = O(n-Z 2K) where K is the maximum of all Ki, l<i<w. Since there are L maximal out-necklace clusterings to search for, 2 - the time complexity is O(n*Z *K-L). Q.E.D. Example: The general digraph shown in Figure 6a contains the two maximal out-necklace clusterings that are in Figures 6b and c. Applying ODCto -1 to the general digraph shown in Figure 6a, we find the optimal solution in Figure 6b which has a total value of 16 and 2 clusters: Cluster 1 contains vertices 1 and 3. Cluster 2 contains vertices 2, 4, 5 and 6. 4. An Example Application to Database Design Let the logical database structure given by Teorey and Fry [TeFr803 be represented by an acyclic digraph shown in Figure 7a. The record length (s.) and the expected number of occurrences (ni) of each record type vi is shown in the following table: record type record lenqth expected no. of occurences PLANT 60 50 JOB TYPE 26 1K WORK TASK 13 200K EMPLOYEE 80 10OK EMP DATA 36 10OK The usage information which is characterized by the

27 a) b) c) 'M3 ( B = 3 Vvi (s=l) a (ni=l) Figure 6 A general digraph and its two maximal out-necklace clusterings.

28 a) P LANT JOB TYPE EMPLOYEE WORK TASK EMP DATA b) PLANT JOB TYPE EMPLOYEE WORK TASK EMP DATA _-___c) PLANT JOB TYPE EMPLOYEE WORK TASK EMP DATA Figure 7 A database schema and its two maximal out-tree clusterings.

29 frequency and the number of logical record accesses (LRAs) to the database schema by each application is defined in the following table: application frequency LRAs T1 T2 T3 T4 R1 R2 R3 R4 R5 100/day 100/day 200k/day 120k/year Quarterly Annually Monthly Monthly Daily 1 PLANT+1 EMPLOYEE+l EMP_DATA 1 PLANT+1 EMPLOYEE+1 EMP_DATA 1 EMPLOYEE+ 1 WORK_TASK+ 1 JOB_TYPE 1 EMPLOYEE 100k EMPLOYEE +100k EMP_DATA 50 PLANT + 100k EMPLOYEE 50 PLANT 50 PLANT + 50-20 JOB_TYPE 50 PLANT + 100k EMPLOYEE + 200k WORK_TASK + 200k JOB_TYPE 50 PLANT+100k EMPLOYEE+100k EMPDATA 1 EMPLOYEE R6 R7 Monthly 1000/day Let fij be the transition frequency between vi and v.. A table showing f.~ can be constructed using the frequency and LRA of each application as follows: (We assume: 1 year=240 working days, 1 quarter=60 working days, 1 month= 20 working days.) (v.,v.-) f.l _ LI from application PLANT, EMPLOYEE EMPLOYEE, EMP_DATA PLANT, JOB_TYPE JOB_TYPE, WORK_TASK EMPLOYEE, WORK_TASK 0 100+l00+lOOk/60 +100k/20=6.86k lk/20=0.05k 200k+200k=400k 200k+200k=400k T1 T2 R R6 R4 T3' R5 T3' R5 Let the block length B be 2,000 bytes. Since the given database schema can be represented as an acyclic digraph and the desired clusters are out-tree, the ODCtOa presented in Section 3 is applied here. The following traces the results produced by each major steps: step action performed

30 1 Obtain the first maximal out-tree clustering (see Figure 7b). 1.1 Apply ODCtoto-1 to obtain 3 clusters: cluster 1 contains: PLANT, JOB_TYPE cluster 2 contains: WORK_TASK cluster 3 contains: EMPLOYEE, EMPDATA with value F(K)= 0.05k + 6.86k = 6.91k. 1.2 Keep the solution (because this is the first result). 1 Obtain the second maximal out-tree clustering (see Figure 7c). 1.1 Apply ODCtoto-1 to obtain 2 clusters: cluster 1 contains: PLANT, JOB_TYPE cluster 2 contains: EMPLOYEE, WORK TASK, EMPDATA with value F(K)= 0.05k + 400k + 6.86k = 406.91k. 1.2 Keep the solution (because 406.91k is greater than 6.91k). 2 RETURN the kept solution with an F(K)=406.91k. Note: Without clustering, the number of inter-block PBAs is 0 + 6.86k + 0.05k + 400k + 400k = 806.91k, while with clustering, it is 806.91k - 406 = 400k. This is a savings of (806.91-400)/806.91 = 50.4%. In conclusion, we have shown that efficient record clustering algorithms can be developed to optimize performance for database models represented as networks, hierarchies, or as separate relations. Globally optimal solutions are obtainable for typical database structures found in practice.

Appendix A. Definitions of Graph Theory Terminology A graph G=(V,E) is a collection of vertices V and a collection of arcs E. If the arcs in E have directions, the graph is called a directed qraph or digraph. If the arcs in E have no directions, the graph is called an undirected graph. An arc is denoted by the two vertices it connects. In a digraph, every arc has a direction from the first vertex to the second. The first vertex is called the tail and the second vertex the head of the arc. In an undirected graph, an arc has no direction and the order of the two vertices in an arc has no meaning. A path is a sequence of arcs (vO,v1),(vl,v2),...,(vn 1,vn) such that consecutive arcs in the sequence have a common vertex and each arc appears only once in the arc sequence. A cycle is a path for which the first vertex v0 and the last vertex vn in the sequence are the same. A graph is connected if every pair of vertices are joined by a path. A subgraph of G is a graph G'=(V',E') such that V' is a subset of V and E' consists of arcs (v,vj) in E such that both v and v. are in V'. For a digraph, the outdeqgree od(v) of a vertex v is the number of arcs which has v as the head; indeqree id(v) of v is the number of arcs which has v as the tail. A vertex which has indegree/outdegree greater than one is caled a mindegree/m-outdeqree vertex. For an undirected graph, the degree of a vertex v is the number of arcs incident with v. In this paper, ony four classes of graphs are being considered. They are: (see Figure 8) general digraph (G) -- a digraph that may have mindegree vertices, m-outdegree vertices and cycles in it; acyclic digraDh (Ga) -- a digraph that may have ma 31

32 general digranh acyclic digraph out-nock lace out-t. re Figure 8 Four classes of digraphs.

33 indegree and m-outdegree vertices but no cycles; out-neckace (G n) -- a digraph that may have moutdegree vertices and one cycle in it, but no mindegree vertices; out-tree (Gto) -- a digraph that may have m-outdegree vertices in it but no m-indegree vertices and no cycles;

34 Appendix B. Proof of Proposition 2.1 A clustering K is an assignment function which assigns either 0 or 1 to each arc ei of E. Let a triplet formed in a new arc stage j be represented by a sequence of pairs: Kj = [ela1],[e2,a2],...,[ej,aj] where the first entry of each pair represents the arc considered in the new arc stage and the second entry represents the assignment on that arc. Let K and K be two feasible triplets formed in the rth and pth stages respectively. We define a derivation of a feasible triplet Kp from Kr as the sequence [er+lar+l], [er+2 ar+2],..., [e p, ap] Let K ' and Kk1" both be formed in the same new arc k-1 k-1 stage. Let F(Kk- ) F(Kk-1 ) and L(Kk-l') < L(Kk "). Assume that there exists a feasible triplet Km+n" derived m+n from Kk 1" that has a higher value than any feasible triplet derived from Kk '. We now show that this is impossible. Let a derivation of K "+n from K be [ekak ],[ek+l,ak+l],'., [em+n,am+n ]. a triplet k+n can be formed from Kk_ ' with the derivation [ekak],[ek+l, ak+1l],*.,[em+nam+n] such that 1) the triplet Km+n' is feasible, and 2) F(Kmn ) > F(Km+n ). The triplet Km+n' is feasible because in every new arc m+n stage from s=k to m+n, the cluster containing the. root of the subgraph in K s' has equal or lesser length than that of ~s K ". This is immediate from the fact that for each s=k to m+n, if a =0 then both K ' and K " have the same weight. If a =1 then s L(Ks') = ) L(C1) + [n * L(') L(Ks") = L(C1) + [n1 /n * L(C1 ")]

35 L(Ks") - L(Ks') = n1 /n1 * [L(C1)"-L(C1')] > 0 F(Km+n ') F(Km+n") is immediate from the fact that F(Kk-1') > F(Kk_") and the fact that km+n' and K +n" is formed from Kk 1' and Kk 1" with the same derivation sequence: [ekak]',[ek+l'ak+l' ** ' [em+n'am+n]'

REFERENCES Aho74 Aho, A.V. et al. The Design and Analysis of Computer Algorithms. Reading, Mass: AddisionWesley; 1974. ChTe82 Chiang, Wan P. and Teorey, Toby J. A Method for Database Record Clustering. Proceedinqs of the Conference on Information System; December 13-15, 1982: p. 57-73. Chia84 Chiang, W.P. Optimal Graph Clustering Problems with Applications to Information System Design. Ph.D. Thesis, University of Michigan, Michigan; May 1984. GaJo79 Garey, M.R. and Johnson, D.S. Computers and Intractability, a Guide to the Theory of NPCompleteness. San Francisco, Calif: Freeman; 1979. Marc83 March, Salvatore T. Techniques for Structuring Database Records. Computing Surveys, Vol.15, No.1; March 1983. Schk77 Schkolnick, M. A Clustering Algorithm for Hierarchical Structures. ACM Trans. Database Syst., Vol.2, No.1; 1977. TeFr80 Teorey, T.J. and Fry, J.P. The Logical Record Access Approach to Database Design. ACM Computerinq Suveys, Vol.12, No.2; June 1980: p. 179-211. TeFr82 Teorey, T.J. and Fry, J.P. Desian of Database Structures. Englewood Cliffs, N.J.: Prentice-Hall; 1982. 36

UNIVERSITY OF MICHIGAN 3 9015 02829 5726