THE UNIVERSITY OF MICHIGAN COMPUTING RESEARCH LABORATORY1 A METHODOLOGY FOR THE DISTRIBUTION OF DATABASES Mourad Oulld-Aissa CRL-TR-2~-84 Under the Direction of Professor Keki B. Irani APRIL 1984 Room 1079, East Engineering Building Ann Arbor, Michigan 48109 USA Tel: (313) 763-8000 'This research was supported by the Department of the Army, Ballistic Missile Defense Advanced Technology Center, Rome Air Development Center, and the Defense Mapping Agency under Contract F30602-80-C-0173. Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the author and do not necessarily reflect the views of the funding agency.

ABSTRACT A METHODOLOGY FOR THE DISTRIBUTION OF DATABASES by Mourad Oulid-Aissa Chairman: Keki B. Irani This research proposes an approach to distribute data that feature semantic crossreferences in a computer network. The unit of distribution is the functional dependency (FD). Firstly, each scheduled query is assigned a set of cross-referencing data units (CRDUs) that will normally be accessed, at possibly different sites, to process it. Each CRDU is associated with one FD. The selected FDs are mapped to a so-called query envelope. The selection of a query envelope is based on semantic integrity and on the minimization of the total "volume" of the instances of the FDs belonging to the envelope. This problem, identified as the envelope optimization problem, is presented and analyzed in a formal framework, and is solved with efficient graph algorithms. Secondly, the CRDUs corresponding to the FDs of all the envelopes are distributed so as to minimize the operational communication cost. The issue of preservation of the database consistency is mentioned. Further, the following modeling points are considered: (i) The cross-references of physically distant CRDUTs, which produce file-to-file inter-site communication in addition to the usual file-to-user communication, (ii) the partitioning of the set of

CRDUs, (iii) the materialization of the CRDUs associated with one envelope, in case of duplication, (iv) the flow capacity constraints. The problem of the distribution of CRDUs is modeled in the form of a manageable multi-commodity mixed-integer program, with quadratic cost and linear constraints. Computational techniques and numerical experience are discussed. Through experimental investigations, the research draws some practical conclusions about the distribution of CRDUs in a computer network. The effect of the reducing and assembly rates over the simultaneous referencing of CRDUs, and over data redundancy, is investigated. Further, the effect of the updating rates over cross-referencing, and over data redundancy, is investigated. The results suggest that a look-up of the user-provided data can indicate whether standard data distribution techniques, instead of CRDU distribution techniques, can be used. i1

TABLE OF CONTENTS LIST OF TABLES....................................................................................................... vi LIST OF FIGURES................................................................ vii LIST OF APPENDICES................................................................ ix CHAPTER 1. INTRODUCTION.......................................................................................1 1.1. The physical environment........................................1.............................. 1.2. The user environment........................................2.................................... 1.3. Distributed query processing........................................ 3 1.4. The problem of database distribution........................................ 4 1.5. Research motivations............................................................................. 1.6. Cross-referencing data units (CRDUs)........................................ 7 1.7. Objectives.............................................................................................. 9 2. BACKGROUND............................................................ 10 2.1. The relational model........................................ 10 2.2. Functional dependency: A tool for logical schema design....................... 11 2.3. Minimum depiction of functional dependencies...................................... 13 2.4. Distributed query processing........................................ 14 2.5. Data allocation models........................................ 17 3. PRELIMINARY FRAMEWORK........................................ 26 3.1. Definitions............................................................. 26 3.1.1. Functional dependencies and database schema.............................. 26 3.1.2. F-graphs............................................................. 30 3.1.3. Link and envelope............................................................ 32 3.1.4. Distributed query processing sequence........................................... 34 3.2. Initial problem formulation............................................................ 35 3.3. Procedure towards a solution................................................................. 36 4. DATA MODELING............................................................ 39 4.1. Foundation of the distribution approach................................................ 39 4.2. Step-l: Initial processing of a user F-graph............................................ 41 4.2.1. F-graph reduction............................................................ 41 4.2.2. Minimum F-graph......................................................................... 48 4.2.3. Lossless F-graph.......................................................... 51 4.3. Step-2: Extraction of a link................................................................... 58 4.4. Step-3: The envelope optimization problem (EOP)................................ 69 4.4.1. Relation-size estimation techniques........................................ 70 4.4.2. A "brute-force" approach to solve EOP........................................ 71 4.4.3. A first heuristic approach to solve EOP........................................ 76 iv

4.4.4. Final solution of EOP................................................................... 82 4.5. Consistency preservation in updates........................................ 94 4.5.1. Constraints on the antecedent and range of an FD....................... 95 4.5.2. Deletion of attribute values........................................96 4.5.3. Update cascades............................................................................ 97 4.6. Summary of PART 1: From users' FDs to envelopes............................. 105 4.7. Example................................................................................................ 109 5. A MODEL FOR DATABASE DISTRIBUTION....................................... 134 5.1. Notation................................................................................................ 135 5.2. "Naive" distribution modeling........................................ 139 5.3. PART 2: A model for the distribution of CRDUs.................................. 141 6. COMPUTATIONAL TECHNIQUES........................................................... 149 6.1. Solving the uncapacitated model by integer programming.............. 149 6.1.1. Description of a branch-and-bound algorithm............................... 150 6.1.2. A practical problem........................................ 154 6.1.3. Practical limitations........................................ 159 6.2. Improving feasible solutions for the uncapacitated model............... 161 6.2.1. Adaptive random search for the uncapacitated model.................. 163 6.2.2. Solving the MRDB problem with XrefU........................................ 170 6.3. The capacitated model........................................ 171 6.3.1. Decomposition of the capacitated model....................................... 172 6.3.2. Optimizing the capacitated model................................................. 176 6.3.3. Convergence and robustness of XrefC........................................ 179 7. EXPERIMENTAL RESULTS...................................................................... 185 7.1. Effect of the reducing rates over cross-referencing.................... 185 7.2. Effect of updating over cross-referencing........................................ 188 7.3. Lower and upper bounds for the communication cost............................ 189 8. CONCLUSION................... 192 8.1. The problem........................................ 192 8.2. Unsolved issues....................................................................................... 194 8.3. Research contributions........................................ 195 8.4. Improvements and further research........................................ 196 APPENDICES............................................................................................................. 199 BIBLIOGRAPHY......................................................................................................... 225 v~~~~~~~~~~2

LIST OF TABLES Table 4.1. Cardinalities and widths of the FDs in the envelope of query Q t,S.................... 111 4.2. Cardinalities of FDs, in the envelope of Q t,5, projected on each referenced attribute............................................................................................................. 112 4.3. Possible partitions of the referenced attributes for query Q t,5........................... 116 4.4. Cardinalities and widths of the FDs belonging to the envelope of query Q t,7............................................................-'~....-.................................................... 118 4.5. Cardinalities of the FDs, in the envelope of query Q t,7, projected on each referenced attribute................................................................... 118 4.6. Possible partitions of the attributes referenced by query Q t,7........................... 122 6.1. Some functional dependencies for the MRDB...................................................... 154 6.2. Typical user-provided transactions for the MRDB.............................................. 155 6.3. Typical frequencies for the MRDB transactions.................................................. 156 6.4. All the FDs selected for distribution, and the possible partitioning configurations for the MRDB................................................................... 157 6.5. An example of numerical input (corresponding to site 2 in the MRDB problem) to be provided to the distribution program................................................. 158 6.6. A partial solution for the MRDB problem........................................................... 160 6.7. Notations for the capacitated model................................................................... 175 6.8. Additional definitions for the adaptive algorithm.............................................. 180 C.1. The user-provided functional dependencies (FDs) for the MRDB........................ 211 C.2. Typical scheduled transactions for the MRDB.................................................... 212 C.3. Needs of each site in the MRDB computer network, and frequencies of occurence of all the transactions............................................................................. 213 C.4. Cardinalities and widths of the functional dependencies for the MRDB.............. 214 C.5. Cardinalities of the functional dependencies, for the MRDB, when projected on attributes................................................................... 215 C.6. Attribute widths for the MRDB................................................................... 216 C.7. Cardinalities of the domains for the MRDB........................................................ 217 vi

LIST OF FIGURES Figure 2.1. The effect of data clustering on distributed query processing.............................. 23 2.2. Noncritical semi-joins in distributed query processing......................................... 24 2.3. The materialization problem................................................................ 25 4.1. An example of F-graph reduction................................................................ 64 4.2. Another example of F-graph reduction................................................................ 65 4.3. Yet another example of F-graph reduction.......................................................... 65 4.4. An illustration of LR-minimality................................................................ 66 4.5. A lossy F-graph (i), and its lossless version (ii).................................................... 67 4.6. A link................................................................ 68 4.7. An F-graph with "partial" and "total" labels..................................................... 104 4.8. An F-graph depiction of the Medical Record Database (MRDB)................. 125 4.9. A minimum F-graph for the MRDB................................................................. 126 4.10. The link of query Q t,5....................................................................................... 127 4.11. A proper link for query Q t,5................................................................ 128 4.12. The metagraph for query Q t,5................................................................ 129 4.13. The weighted metagraph for query Q t,5............................................................ 130 4.14. The link of query Q t,7..................................................................................... 131 4.15. The metagraph for query Q t,7................................................................ 132 4.16. The weighted metagraph for query Q t,7........................................................... 133 5.1. The placement of two redundant CRDUs at the same site.................................. 148 5.2. A distributed query processing scheme................................................................ 148 6.1. Random search assignments................................................................ 166 6.2. Encoding scheme for the adaptive algorithm...................................................... 167 6.3. A "structure" for the adaptive algorithm................................................... 168 6.4. The string manipulation operations used in an adaptive search................. 182 6.5. The performance of Algorithm XrefU................................................................ 183 6.6. A schematic visualization of the execution of Algorithm XrefC................. 184 7.1. Variations of the measure of redundancy and the measure of cross-referencing with respect to the reducing volume ratio...................................... 190 7.2. Variations of the measure of redundancy and the measure of cross-referencing vii

with respect to the updating volume ratio.......................................................... 191

LIST OF APPENDICES Appendix A. IN DEX.......................................................................................................... 200 B. THE RELATIONAL DATA MODEL............................................................ 206 C. THE MEDICAL RECORD DATABASE PROBLEM.................................... 210 D. INPUT FOR THE DISTRIBUTION ALGORITHMS..................................... 218 E. OPTIMAL SOLUTION OF THE MRDB.................................................... 224

CHAPTER 1 INTRODUCTION The goal of this thesis is to outline a detailed procedure which can be applied to the design of distributed databases. In this chapter, the main problem of interest is introduced. Research objectives are discussed in Section 1.5. Initially described are the notions of "distributed system" (Section 1.1) and "user environment" (Section 1.2). The principles of distributed query processing are discussed in Section 1.3. Section 1.4 introduces the notion of a "distributed database" and states the central problem of interest. Section 1.5 provides some key arguments to support our contention that the design methodologies of database distribution available at the time of this writing suffer from shortcomings. Section 1.6 emphasizes one important issue in the research, the definition of a distribution unit. Finally, the research objectives are stated in Section 1.7 in informal terms. 1.1. The physical environment Our work applies to a physical environment referred to as a distributed system*. A distributed system is characterized by the placement of processing and control functions** at different computing 8ites, as opposed to a centralized system in which all the processing and control functions are located in a main computing center. A single, usually powerful, computing unit is enough to support a centralized system, however, a distributed system usually implies the existence of an underlying computer network. A computer network can be visual*Technical words in italic, in Chapters 1 and 2, are defined in Appendix A. **A "function" can be visualized as a software module. We refer to functions that are part of a distributed management system, and in particular, the functions applied to the control and processing of transactions. 1

2 ized as a set of interconnected sites or nodes. Thus, a computer network consists of a set of computing units that can exchange information between each other through a common communication medium (usually a network of tightly connected processors, i.e. connected by telephone channels, or loosely connected processors, i.e. communicating through a broadcasting medium [BOOT 81, DORA 77]). A distributed system can also be supported by a database machine or some more general parallel computing machines. A fundamental property of a distributed system is its ability to remain operative for most users (or most jobs), in the event of occurrence of localized failures of some nodes, or of communication disruptions. In this context it is assumed that a computer network stands for a set of autonomous computers together with their interconnecting links. (However, the material of this thesis is also relevant for the design of distributed systems conceived on one or several database machines). Computers which are said to be autonomous are logically equal [TANE 81], i.e. they participate within a distributed system in an "equal partnership" relationship. This type of physical environment excludes any hierarchical or master-slave relationship between any two computers. Furthermore, the computers of such an environment, sometimes called a horizontal distributed system, need not have the same physical characteristics. In addition to considering a distributed system, we will focus on the usage of such a system in the context of database applications. 1.2. The user environment The user environment is characterized by the nature of the information processing applications. In this context, the user environment translates into a database that is to be processed through intensive cross-referencing of data. Distributed databases are relevant for the support of such organizations as government agencies, communication industries, transportation companies, and the military, to name just a few. In general, the type of database which we wish to consider is one that contains a huge amount of intricately related "text" or numerical data. The information that it represents is under periodic querying and updating,

3 and in general, each query may require more than a simple reading operation. A query may indeed involve complex cros88-referencing of data. Users originating such queries from each site of the computer network form a homogeneous usage environment in the sense that they all belong to the same organization and that the entire database appears to them as if it were centralized. It is assumed however that there may exist some typical patterns for the queries and updates initiated from any specific site. The typical transactions are referred to as the scheduled queries and updates*. Hence, although the information available through the database is accessible by any user from any site, one site may be associated with users with specialized interests who originate certain types of transactions more frequently than other types of transactions. The data needs of a user constitute the view of that user. The user environment to be considered consists of one homogeneous distributed database, and a set of users' views, where a user is identified with one computer or site of the underlying computer network of the distributed system. We now wish to refine the description of the user environment and to introduce some technical notions of distributed query processing. 1.3. Distributed query processing The concept of distribution implies more than a mere geographical distancing of data. Indeed, it presupposes some strong features regarding the strategies of query processing. Normally, a query is processed in three steps. During the first step, any query issued at a usersite, by virtue of its qualification clauses, leads to some extensive local processing on a set of initially accessed relations**. The relevant relations are usually reduced in size by a sequence of restriction and projection operations. The second step involves the complex crossreferencing of the reduced relations. Those relations which are semantical!y dependent, are involved in a mutual effort to further reduce one another in order to eliminate as much As opposed to ad-hoc queries and updates. **Since it is common to use the terminology proper to relational database, the word relation will often be favored over the less descriptive word fik.

4 irrelevant data, for the query, as possiblet [CHUN 831. In practice, this is usually accomplished by performing a sequence of semi-joins between distant relationstt. Data crossreferencing in distributed query processing is associated with the so-called inter-site or file-tofile communication. This type of communication usually incurs a relatively large flow of data volume, also known, for obvious reasons, as the reducing flow in the computer networks. When the reduction step is completed, the copies of the reduced relations contain the information requested by the query, and are ready to be shipped to and assembled at the user-site to produce a final answertt. This third and last step of distributed query processing is referred to as the file-to-user communication, and is associated with the so-called assembly flow in the computer network*. At the design stage, the data units which are subject to distribution are referred to as cross-referencing data units (CRDUs), because in the operational stage of the distributed database, they will assume the form of relational tables or fragments thereof**, and will be submitted to cross-referencing during the distributed query processing sessions. The objective of this research is to find an efficient and formal approach to the problem of defining and distributing CRDUs. 1.4. The problem of database distribution Although the problem of "where to place the data" can sometimes be solved pragmatically, it is nevertheless a non-trivial one mainly because of data dependence. Further, data distribution may rely in many ways on formal optimization techniques. It is usually not practical to replicate and store a whole database at each site of a network, particularly in the case of a very large database. Data units which are semantically tE.g. if they have one or more common attributes. itLocated at different sites. tThe words reducing and fiek-o-fie wrill be associated with a logical operation such as a semi-join, with the cross-referencing communication process established between two relations, or with the resulting flow in the network. fThe assembling operation may be performed at a site distinct from the uscr-site. However, there seems to be no advantage in doing so. *The remark made in footnote $ applies also for the terms asecmbly and fildeflo-ucr. *eE.g. after the application of projection and restriction on some larger relations yet to be defined.

5 dependent on one another may have to be stored at different sites, and may thus require some information to be represented with partial redundancy. A single transaction originated at one site may therefore cause some interlocation accesses to take place between several sites other than the user-site. Such a database is called a distributed database. Hence, a distributed database is characterized by the storage of data units, organized into files, at several sites of a computer network. Any data unit may be stored in at least one site and may be accessed by any other site when needed. In this context, the overall operating cost of the distributed database results primarily from the communication cost per unit time. Given a network of computers that share common data units, the problem of finding a placement of the data units over the computing sites, such that this placement yields a minimum overall operating cost, while satisfying a set of physical and logical constraints, has been identified in the literature as the "data allocation" problem. Usually, whether multiple copies of data units are allowed or not is stated explicitly. This research addresses a larger problem; that of defining, selecting and distributing CRDUs so that relations can be synthesized at each site. This problem is identified as the database distribution problem. Thus, data allocation can be considered to be a subproblem of the database distribution problem. Note that the word "distribution" is used, rather than the word "allocation". In our terminology, "distribution" includes the possibility of keeping duplicate copies of any one of the CRDUs. In the section that follows two issues which motivate further research in the solution of the database distribution problem are indicated. 1.5. Research motivations For some organizations, a specific distribution of data may naturally be dictated by the rigid nature of the applications associated with the sites of a network [BOOT 811. However, if the database to be designed is not initially partitioned, and if no a priori distribution natur

6 ally emerges, then the database distribution problem needs to be addressed, to some extent, in optimization terms. Further, the most critical issue to consider when distributing data in a computer network is the definition of a unit of distribution. This latter issue is discussed at more length in Chapter 2. The choice of a unit of distribution will eventually affect the performance of a distributed database. The traditional methodology [BUCK 791 applied to the database distribution problem, consists of dealing separately with the two problems of defining the database files and of alloeating them to the computers of the network. Hence, the allocation phase usually rests upon the assumption that the data are already organized into files. The methodology applied to define those files makes use of centralized database design techniques. That design approach is inefficient because measures of performance, mainly the traffic cost associated with distributed query processing mechanisms, are not taken into consideration at the time when the initial files are defined. In general, the distributed database design methodologies available at the time of this writing are based on the pragmatic notion that file definition and database distribution are two independent problems (BUCK 79, ESWA 741. It must be realized that this approach is a simplified solution to the general database distribution problem and needs to be reviewed and extended. The underlying methodology of this work emphasizes, to some extent, a synthesis approach of design. The proposed design process starts with the definition of the CRDUs, which may be looked at as the elementary building blocks of some, yet undefined, relations, and it ends with the definition and usage of those relations in the distributed system environment. The latter paragraph contains some remarks about the methodology of distributing databases. The next observation refers specifically to current data allocation models and their solution techniques. In these models, it appears that the assumption of data independence is routinely made. This assumption implies that the distributed query processing mechanisms are exclusively modeled as sequences of file-to-user accesses [RAMA-b-791. Indeed, this

7 assumption of data independence implies that any query can be solved by sending a complete answer from a distant site (that possesses the required information) to the site where the query is issued. Therefore, the simplest models of data allocation follow standard mathematical structures, similar to those of the "facility-location" type problems in operations research. However, the data independence assumption (which is a prerequisite for those models) cannot realistically be applied to complex databases that make use of relational database techniques since it is inadequate to represent distributed query processing mechanisms, i.e. the crossreferencing of data units located at different sites, as was noted previously. The issue of cross-referencing in distributed query processing will be discussed in more detail in the next chapter. Furthermore, the data independence assumption prevents us from investigating the effect on the traffic of the update transactions, which are originated by the system (as opposed to those originated by the user), to preserve database consistency. Updates and update side-effects will be treated in Section 4.5. Thus, if optimization is to be included in a methodology for database distribution, some basic semantic issues must be considered, and therefore, modifications must be made on current mathematical programming models of data allocation. 1.6. Cross-referencing data units (CRDUs) In this section, the term "cross-referencing data unit" is given a precise definition. The soundness of such a definition is argued. A transaction originated by a user at some site of the network is really an access of data-items or attributes. Therefore, instead of defining a CRDU to be a file or - relation, a natural approach might be to associate a CRDU with an attribute. However, attributes, as such, are meaningless from a semantic point of view. The notion of dependency among attributes, which is central to guarantee the correctness of transaction processing, suggests that groups of attributes can be used to define CRDUs. To ensure query correctness, several rules

8 have been developed in the literature. These rules hold for semantic constraints of the functional type* known as functional dependencies (FD) [ARMS 741. It is convenient to identify a CRDU, i.e. an atomic data unit which can be subject to distribution, with an FD. Semantic constraints of the functional type, or FDs, are chosen for the following reasons. (1) FDs are sufficient to determine whether a data access leads to a semantically correct answer, for a query, or not. (2) They are simple and easily identifiable by the user. (3) They are the building elements of the synthetic approach to construct logical database schemata for the three major models, the "network", the "hierarchical", and the "relational models". It has been shown that FDs can be used to design logical databases for any of the three models mentioned [BERN 76, DELO 78, HOUS 791. Further support for the choice of FDs to stand for CRDUs will be provided in Section 2.2. The relational model is selected because it is the best adapted to current distributed query processing terminologies as discussed in Chapter 2. However, this does not presuppose any restriction as far as the local physical database implementations are concerned. In this context, the CRDUs are the physical counterparts of FDs, and can be visualized as particular types of relational tables featuring at least two columns of attribute values. In the remainder of this thesis, FD is used to denote a functional dependency as an abstract concept. A CRDU refers to the physical instance of an FD. Algorithms for synthesizing relational tables [WANG 75, BERN 76, KAMB 781 permit the construction of relations at each site after the CRDUs mapped into FDs are distributed. Thus, under our definition of the CRDUs, it will be possible to define a methodology which introduces optimization early in the solution of the database distribution problem. *This choice is made without any loss of generality because dependencies between attributes that are not of the functional type can be modeled with artificial FDs called non-functional relationships. If the concept of nonfunctional relationships [BERN 76) is considered to be too controversial for some readers, another argument supporting the use of FDs is that it is widely acknowledged that regular FDs augmented with the related notion of joindependency (defined in Section 3.1) are enough to capture the semantics of most practical databases [BEER 81j.

1.7. ObjectIves The problem addressed in this thesis is now broadly stated. A computer network which consists of computer sites and of communication links is available. The computer network has a predefined topology and the links have fixed channel capacities. It is equipped with a horizontal distributed system in the sense of Section 1.1. A distributed database is to be designed for the benefit of an organization of users in the sense of Section 1.2, and a universe of data-items is provided by the organization together with a set of FDs. This organization also provides specifications for every transaction, query and update, at each site, together with the occurence frequencies of those transactions and some statistical information about "raw" data-volumes* with respect to FDs and attributes. The problem objectives are the following. (1) Model and select a set of CRDUs which may be subject to distribution without any loss of the over-all information, and which will allow simple and efficient file-to-file accesses in the operating phase of the distributed database. Determine how the CRDUs are affected by updates. (2) Represent the problem of organizing CRDUs in the computer network in the form of a manageable mathematical optimization model. The organization of the CRDUs must capture** (i) the cross-referencing characteristics of distributed query processing, (ii) the clustering of the CRDUs, e.g. whether they belong to a common relational table or not, and the resulting effect on the assembly and reducing flows of data-volume, and (iv) the materialization, in the static sense, of the CRDUs i.e. the choice of relevant instances of CRDUs to seek access to, for each query, when there are duplicates. (3) Explore and implement efficient numerical techniques to solve the proposed model of (2) so as to minimize the operational communication cost. The next chapter surveys related studies in the literature, and develops some background for the research. eSee Section 4.4. *eThe following issues will be explained in more details in the next chapter.

CHAPTER 2 BACKGROUND In this chapter the conceptual foundations needed to achieve the objectives of Section 1.5 are laid out. Relevant issues are discussed in light of related studies in the literature. At the same time, some basic assumptions are made. The relational model is first introduced (Section 2.1). The usage of functional dependencies in database schema design is discussed (Section 2.2), and means to depict functional dependencies are reviewed (Section 2.3). Distributed query processing is described (Section 2.4), followed with a survey of earlier data allocation models (Section 2.5). 2.1. The relational model Some basic technical terms of relational database modeling are introduced. The relational model [CODD 721 has been the object of very intensive and rich studies. It is legitimate to claim that database theory, as it is being developed now, was born out of the relational model. What makes it attractive is its nonprocedural orientation. The definition of a relational database will not be expanded on; conmprehensive presentations can be found in [CODD 79, CODD 72, DATE 78, ULLM 801 to name just a few sources. A relation or relational table will be referred to as a subset of the arbitrarily ordered cartesian product of a list of domains. A domain is itself a value set for one or several attribute types (the "atomic" data-item types). A relation is a set of tuplee. Thus, a tuple represents a possible value for a specific list of attribute types. That list constitutes the heading of a relation. Well-defined opera10

11 tions can be applied to relations to reduce them or to combine them together. These operations are used to define the relational algebra [CODD 79, CODD 72, DATE 781. More will be said below about the use of these operations in conjunction with distributed query processing policies. In the next section, functional dependencies are discussed. They can be visualized as special types of relations that can be used as building blocks to generate more complete relations. The process of defining relations from FDs is known as the synthetic approach of logical schema design. 2.2. Functional dependency: A tool for logical schema design The functional dependency (FD) semantic constraints serve as building units in database design. FDs were first mentioned in connection with the normalization of relational tables as a measure to avoid update anomalies [CODD 72]. Axioms pertaining to FDs were established in [ARMS 74] as a foundation for a formal description of database semantics. After the first definition of third normal form was proposed in [CODD 72], there followed the concept of multivalued dependency (MVD) [FAGI 771, implying stricter normal form definitions. Database theorists recognized early the interest for synthesizing schemata from functional dependencies [WANG 75, BERN 76]. The synthesis approach consists of building normalized relational tables given an initial set of FDs. However, in [KAMB 78], the algorithms presented in [WANG 75, BERN 761 were shown to feature some weaknesses. In this latter work, a new algorithm for synthesizing relational tables was presented. As mentioned in [KAMB 781, the reason for considering FDs exclusively, and not multivalued dependencies, in tL? context of relational tables synthesis, stems from what we might call the universality property of FDs. To quote Kambayashi [KAMB 78): An FD X -. Y holds in any set of attributes that contains X U Y under the uniqueness assumption discussed in [BERN 7]6; an MVD is defined together with the set of attributes where it holds.

12 This is why MVDs are considered primarily in the context of a decomposition approach of schema design [BEER 80, ZANI 81| rather than a synthetic approach. A representation of a database as a set of "3NF"- normalized relational tables is not unique [LOZI 781. Heuristics aiming at improving the performance of a centralized relational database were devised as an extension to the synthetic approach [LOZI 781. The relational model is not the only logical database model which gains to make use of FDs. FDs have been shown to serve in the conversion from a hierarchical model to a relational model [DELO 78J. Also, they have been used to construct logical database schemata described with the network model [HOUS 791. In [HOUS 791, operations defined on FDs are used to "navigate" through a schema and to describe transactions in a formal way. However, those "navigation" mechanisms do not always capture a correct query processing since socalled lossy accesses, as described in [AHO 771, may sometimes occur. It has been argued that a set of FDs together with the notion of lossless join (to be discussed in the next section), is enough to capture the semantics of most practical databases [BEER 81J. Further, the important problem resulting from deletion cascades [HOUS 79J is most simply observed through FD-modeling. Deletion side-effects associated with consistency preservation can assume a critical role in database distribution. These particular aspects of query correctness, i.e. data integrity and data consistency have been avoided in previous data allocation methods [PALM 79, BUCK 79J, and in mathematical file allocation models [CASE 72, CHU 73, ESWA 74, MAHM 76, MORG 771, mainly through the assumption of data independence. The preceding arguments support further the claim, made in Chapter 1, that it is indeed appropriate and quite general to retain FDs as the sole low-level semantic modeling tools. In the next section, issues pertaining to the depiction of a minimum cover of FDs, i.e. a smallest set of FDs which is semantically equivalent to a given set of FDs, are discussed.

13 2.3. Minimum deplction of functional dependencies The main idea of this section is that the non-redundant depiction of a set of FDs by graphical means is a convenient way to arrive at the representation of a minimum cover of that set. In [AHO 77, BISK 79, LOZI 801, similar results concerning loasless joins of relations have been reported. The main result of interest to us is summarized in Theorem 1 (Section 4.1) which combines two corollaries presented in [BISK 791. Corollaries 1 and 3, and Proposition 7 extend this result to relational schemes involving FDs uniquely. The notion of lossless join is useful to determine whether or not a query solution respects the database integrity. We are not considering the correctness of actual query processing steps, since we assume that appropriate processing strategies are available, but rather the validity of the so-called query envelopes [CCA 801. In this work, the envelope of a query is informally defined to be a portion of the data which is sufficient for answering the query. A query envelope consists of a set of FDs that features the lossless join property. Since, for each query, the choice of an envelope is not unique, optimization will be used to determine a "good" query envelope. The envelope optimization problem is not unlike the problems considered in [LOZI 78, LOZI 801 for a centralized database. In [LOZI 781 the author considers possible improvements of Bernstein's synthesis algorithm [BERN 761 to enhance database performance. In [LOZI 80], given a set of relations, the author looks for efficient query procedures. We borrow the notion of F-graph to depict FDs, and the notion of a link to extract a relevant portion of a database for a query, from [LOZI 801. (Although this latter notion will be refined and adapted to our specific needs). Because of the "distributed" nature of our envelope optimization problem, our assumptions and solution techniques are different from those proposed in [LOZI 801. The methodology described in [LOZI 801 begins with an initial F-graph (Section 3.1.2) to depict FDs, but no mention is made as to how such a graph could be obtained without redundancy. The generation of an initial F-graph from a user-provided F-graph is one impor

14 tant issue to address, especially in the context of automatic schema design. A set of reducing rules for F-Graphs based on Propositions 1-3 (Section 4.2) are provided to eliminate redundancy systematically. The approach which consists of building reduced F-graphs is justified by Propositions 5-7 (Section 4.2), in light of previous results on minimum FD covers reported in [MAIE 801. It is shown that Algorithm Reduce of Section 4.1 which constructs a nonredundant F-Graph to depict FDs, leads to the same result and is at most as complex as [MALE 80]'s algorithm which derives LR-minimum covers of FDs. The discussion on FDs and the means to represent them suggests that they can be utilized to describe a user's view of its data [FEDA 81] and also a system view of the data (if the FDs are rearranged to eliminate excessive redundancy). The focus will be placed on a low level of abstraction corresponding roughly to the "instance operations level" of [NAVA 761, and not in consideration of mappings from ezternal schemata to conceptual schema as in [FEDA 81], although the importance of this latter issue at a higher semantic level is recognized. Determining an envelope of lossless FDs for a query constitutes a starting point from where actual query processing sequences may be derived [CHUN 831. In the next section distributed query processing strategies are discussed. 2.4. Dlstributed query processing It is essential for comprehension of the proposed model (Chapters 4 and 5) to outline the characteristics of distributed query processing [HEVN 79, WONG 77, CCA 80, CHUN 831. The first issue associated with the solution of a distributed query is the localization, through a data directory, of the relevant data needed [CHU 761. It is assumed that such directories are available at each site [CHU 76]. Since data may be replicated at different sites, a specific nonredundant instance of data must be determined for each query. This instance of data is known as a static materialization [CCA 80, HEVN 79]. A static materialization refers to a

15 physical representation of the query envelope which itself refers to a smallest portion of the conceptual schema relevant to solve the query. A materialization can be obtained by table look-up [CCA 801 or can be determined dynamically during query processing [WONG 811. It is assumed that a look-up table will eventually be available at each site [HEVN 791. During the normal operation of the system, a certain user query issued at some site will always look for a specific CRDU at a specific site. The system in charge of selecting the most cost beneficial copies of data to be included in a query materialization is called the "integrity subsystem" in [HEVN 791. The next issue is query analysis. It is mainly concerned with the decomposition of a query statement, expressed in a high level language, into simpler subqueries [WONG 77]. This leads to the accomplishment of elementary processing tasks. Query optimization is the final issue of query processing and it is the objective of the "query processing subsystem" [CCA 80, HEVN 791. To visualize the optimization of a distributed query the relational algebra is now mentioned. The relational algebra is basically defined through its operations on relations, mainly, Selection, Projection and Join [CCA 801. Each data processing task can be mapped to one of those algebra operations. A join can be substituted by a pair of 8emi-joins. In most practical cases this substitution actually results in less communication overhead [CCA 80, CHUN 831. Whereas the joining of two relations necessitates the displacement of at least one whole relation to a common site, a semi-join operation can be performed by keeping the relations at their local sites, and by transmitting their respective joining columns to each other. Hence, joining several relations, which incidently is a commutative and associative operation [AHO 771, can be achieved by a sequence of semi-joins. This implies that one semi-join can take advantage of another in terms of data-volume [CCA 80]. At this point it is worthwhile to pause and snake the following remark. One query may be solved by more than one sequence of such selections, projections and semi-joins. It is commonly accepted that all the restrictions and projections must be performed locally, as far as possible, to reduce the size of the initial relations. Those smallersized relations are referred to as fragments. It is assumed that an efficient sequence of semi

16 joins can be found prior to addressing the allocation problem explicitly. That it is correct to do so results from the fact that, in-a packet switching network, delays are mainly proportional to the flow of data-volume, and therefore, in such an environment, any optimal query processing sequence depends mainly on the partitioning of accessed relations, and not so much on their positions in the network [CHUN 83). The quality of a processing sequence is measured by the cost of its semi-joins (the volume of data to be transmitted) and their benefits (the volume of data eliminated by a semi-join operation [CCA 80, CHUN 831). Algorithm OutTree of Section 4.4 eliminates, as much as possible, the need to access attributes that do not belong to the target list of a query, while minimizing some measure of cost associated with semi-joins. No distributed query processing strategy is, as of now, standard. However, the SDD-1 algorithm [CCA 801 contains all the features of a realistic strategy. "Algorithm H" of [CHUN 831 follows the same spirit proposed in [CCA 801, but goes further towards the optimization of semi-join procedures. During a distributed query processing session, a query originates fileto-file communications to execute some data-reducing operations, or simply, reducing operations. Once those operations are completed, file-to-user communications take place to collect the result, at the originating user site. This action of collecting the final relevant data is referred to as the assembly operation*. It is important to note that the flow incurred by fileto-file transmissions as well as the assembly flow are much larger than the flow corresponding to messages due to requests and control [CHUN 83). The reducing flow is based upon the transmission of large volumes of data, sometimes complete columns of a relational table at a time, associated with join or semi-join [CCA 801 operations. For updates however, transmissions are mainly due to messages. However, transmissions of messages associated with concurrency control mechanisms are not considered, neither are the time delays due to transmission and waiting in queues [MAHM 76, KHAB 80). The main focus is on the minimization of *Note that the terms assembly and reducing were previously used in connection with the flow rather than the logical operations that lead to that flow. Such a flexibility in terminology will be continued.

17 data-volume shipments. In the final section, some characteristics of earlier data allocation models are surveyed. Modeling assumptions and mathematical aspects are discussed. 2.5. Data allocation models Early research on data allocation has been reported in [CASE 72, CHU 73, ESWA 74, MAHM 76, MORG 771. Further studies related to the same problem include [KHAB 80, FISiI 80, RAMA 79b, APER 801. In all the publications mentioned above, the assumption of file independence is made. This means that, given any query, the complete answer to the query is the collection of the partial answers contained in all the relevant files accessed. Unfortunately, this assumption is entirely inadequate when applied to a distributed database that is represented by a set of relational tables. As was noted above, distributed query processing does not consist in a mere file-to-user access, but incurs file-to-file communication, as well, between sites other than the user site [BOOT 81, CCA 801. When considering a network of autonomous computers [TANE 81|, and users at each site, with roughly the same needs in terms of logical accesses, it is hazardous to claim the feasibility of designing independent files that will seldom necessitate any file-to-file communication when processing queries. Relations are not independent, they are related by the attributes they have in common, and therefore, they must be allowed to participate in cross-referencing operations. A model for the placement of relations in a network, that apparently departs from classical ones is presented in [RAMA 79bl; it applies to a relational database in which crossreferencing accesses are permitted. However, the query processing policy assumed in [RAMA 79bJ requires that the relations accessed in a multi-relation retrieval, must be moved simultaneously to the originated site before retrieval operations can be performed. In essence, this amounts to restricting the query processing communications to file-to-user communications, and therefore the flow of data associated with the reducing operations, which has been

18 referred to as the file-to-file flow, is not considered. In [APER 801, however, the problem of distributing a set of relations in a computer network is considered from a different viewpoint. Indeed, the author acknowledges the importance of including cross-referencing in the distribution stage. An efficient heuristic is used to allocate already existing relations to the sites of the network. The two last mentioned studies do not address some very typical issues in the distribution of CRDUs; these are introduced below. There is one main difference between distributing, on one hand, files which do not feature the cross-referencing property, and on the other hand, CRDUs, and in particular relations, which by definition can participate in cross-referencing accesses with one another because of semantic dependence. In the first case it is acceptable to evaluate the total datavolume flow using the superposition principle, whereas in the second case the data-volume contributions are ruled by highly nonlinear effects due to clustering. To understand the significance, on data-volume flow, of clustering CRDUs, let us consider Figure 2.1. It depicts two possible cases of relation clustering for a query, originated at site t, which needs to access relations R1[A,BJ, R2[A,Cl and R31A,DI. In Figure 2.1-i R1 and R3 are clustered* at site 1. R2 is available, by itself, at site 2. The query may be processed by, following a strategy as depicted by the graph of Figure 2.1-i. On this directed graph, arc (a,b) refers to a "dummy"** move of relation R3 towards R1 resulting in a join R1[AJR3. Also, each vertical arc such as (b,e) indicates that no data flow is associated with the corresponding access of data because no physical inter-site move is required. Arc (b,d) represents a reducing operation from R1[A]R3 towards R2 initiated at step 0 of the distributed query processing sequence; it corresponds to the semi-join (R1JIAIR30)IA>R20 which requires the shipment of a volume of data labeled (R10[AJR30)[Al t. Arc (d,e) corresponds to a reducing operation from R2 *We may visualize the clustering of RI and R3 as resulting from the join of RI with R3 over column A if it is advantageous to do so. ~*It obviously incurs no flow since Ri and R3 are located at the same site. fThe superscripts indicate the step in the distributed query processing sequence. RI[A]R3 sends all the values of A that it possesses, to site 2 where R2 is located. We assume familiarity with the unsual notation for join, srcmi

19 towards RI[AJR3, initiated at step 1; it is labeled R21[AJ to indicate that the copy of R2, at step 1, sends its own column of values for the attribute A, to site 1. The assembly flow results from a transfer of relations (R10A1R30)<AJR21 and R2', respectively from sites 1 and 2, to the user site t. For the case depicted in Figure 2.1-ii, R1 and R2 are clustered at site 1. They are represented by their join R1[A]R2 at site 1, whereas R3 is available at site 2. The flow of data-volume resulting from this configuration is indicated in Figure 2.1-ii. It is assumed that the query processing strategy depicted seeks to minimize the flow of data-volume for this new configuration of the initially accessed relations. The distributed query processing strategy of Figure 2.1-ii is shown to be totally different from the previous one in Figure 2.1-i. The reason is that in each case, the particular representation of the initially accessed relations is fully taken advantage of by the distributed query processing algorithm used [CHUN 831. Therefore the two configurations of R1, R2 and R3 in the network, result in different combined file-tofile and file-to-user flows. Obviously, the clustering of CRDUs affects in a non-trivial way the flow of data-volume associated with the distributed query processing mechanism. Yet another problem arises from overlooking the importance of the position, in the network, of the user who originates a query. Figure 2.2-i shows an optimal query processing strategy to solve a query by operating on three distinct CRDUs r, ql and q2. It is assumed that those CRDUs will be placed at different sites, and hence the clustering effect needs not be considered. Figure 2.2-ii displays the reducing and assembly nflows in the network when ql is at site i, q2 at site j, r at site k and the user site t is distinct from i, j and k. If the user is moved to site k, then site t becomes identical to k; the flow would look as shown in Figure 2.2-iii. The assembly flow resulting from the shipment of the processed copy of r to site t is discarded. However, one question still remains; looking at Figure 2.2i, it can be noticed that arcs [(i,2),(t,3)1 and [(j,2),(t,3)I correspond to cross-referencing operations aiming at reducing the size of the copy of CRDU r, and therefore the assembly flow from k to t. Furthermore, because no other cross-referencing operation is to either benefit or follow from their join and projection.

20 application, it is said that these cross-referencing operations are noncritical. Of course it seems futile to try to decrease the size of r because this will not trigger any move to decrease the sizes of ql or q2 any further, and because r is already at the destination site t! Therefore the arcs which are represented with dashed lines can be omitted from Figure 2.2-iii, illustrating the fact that, even without the effect of clustering, the total flow, in processing a query, may be affected by the location of the site at which the query is originated. The term noncritical will be used in association with a cross-referencing operation or a semi-join, in the context of distributed query processing when discussing the optimization model. A more formal definition will be provided in the next chapter. The next remark is about the materialization issue. Figure 2.3 depicts a user, at site t, who originates a query that may be answered through the access and cross-referencing of CRDUs R1, R2 and R3. The figure depicts a multiple copy situation. The query needs only access a non-redundant instance of the set (RI, R2, R3}; this is referred to as a static materialization*. Such materializations are not known in advance, therefore, at the design stage, the materialization problem must be addressed jointly with the distribution problem. Finally, stringent physical constraints may have to be considered at the design stage. This is particularly true if the supporting computer network of the distributed system does not include communication links which feature virtually infinite nflow capacity characteristics. For instance, if static routing is used to direct the traffic in the network, it may not be practical to always follow the shortest path from a node i to a node j whenever some communication is to take place between node i and node j. Consequently, any isolated communication between i and j may be associated with a set of superposing paths from i to j. If adaptive routing techniques are used, the flow is forced to follow the least busy routes in the network. At the design stage, for both routing techniques mentioned, finite nflow capacity may affect the distribution of the CRDUs. This issue is most easily dealt with when a descriptive eAs opposed to a dynamic materialization [WONG 81), which may include redundant copies and which may allow the query processing to take advantage of parallel processing of those redundant copies.

21 mathematical model is available. The computational methods applied to allocation problems are usually based on integer or mixed-integer programming algorithms. A few of them are surveyed below*. The scope of the problem analyzed in [MAHM 76) is larger than those of most file allocation formulations. It includes a wider description of the computer network, such as links and nodes reliabilities, and file availability. Time delays are also considered explicitly. The final model is a nonlinear integer program. A nonlinear branch-and-bound integer programming technique is adopted for its solution; it turns out to be unsuccessful for problems that are not of small size. In JKHAB 801, the file allocation problem is combined with the communication network design (i.e. its topology and channel capacities assignment). The communication cost is identified by 3 nonlinear integer (0-1) expression because the fow is estimated in terms of the zero-one allocation variables and probabilities associated with the access policy. Furthermore, the number of copies for each file are fixed by the file availability requirements. Hence, the proposed design approach leads to a simple form of the allocation model. The file allocation routine is based upon a greedy algorithm that has been tested with as many as ten sites. In [MORG 771, the problem includes the allocation of programs as well as files. The associated mathematical model is a mixed-integer linear program. In the solution of this model, each possible file assignment is represented by a vertex on an n-dimensional hypercube. A partial search is initiated with any feasible file assignment. A sequence of file assignments are generated by branching forward on the hypercube. Through the execution of the algorithm, a feasibility test is consulted to determine wether some portions of the hypercube can be discarded. The model described in [FISH 801 is based on simple file-to-user accesses and on the premise that the whole database can be represented by one file. Overall it features the mcA assumptions in terms of database and network requirements. Nevertheless, the problem leads to a mixed integer program as in [MORG 771, and the algorithm presented is valuable in its own right regardless of the initial assumptions. The algorithm is based on heuristic building blocks for eNone of them deals with the issue of cross-referencing in distributed query processing.

22 generating feasible solutions. A lagrange relaxation approach is followed for obtaining lower bounds on the cost of an optimal solution. [FISH 801 also includes the development of a dominance test that can be used in a preprocessing step to reduce the size of the problem before submitting it to a branch-and-bound algorithm. This algorithm has been shown to withstand a relatively large scale testing. It has been proved in [ESWA 741 that the classical file allocation problem is np-complete. Since then, researchers have encouraged the use of heuristics to solve such problems [CHAN 77, RAMA 79a]. Also, some possible pragmatic techniques applicable to data allocation have been described in [BUCK 79, HAMM 79, PALM 791. In the following chapter (Section 3.1) a formal definition for some of the technical terms encountered so far* is given. The problem of interest is then worded in a more precise way (Section 3.2) and a solution procedure is sketched (Section 3.3).,Other technical terms are brienfly defined in Appendix A in casual words. For more complete definitions, the references should be consulted.

23 [I 3] Local join: No inter-site move. No delay. [ IL] Participation in a semi-join: No inter-site move. Finite delay. [ I Z | Participation in a semijoin: Inter-site move. Finite delay. [ Ij] Semi-join. O a[A]aOR[A (AIr Al time Figure 2.1. The effect of data clustering on distributed query processing

24 T1 MEEl (t,O) ( i,0) ( j,o t- (j,j) 2 ------ (j 2) (t,3) (i,k) (i,k) GmuCwe OMATIOW PwOM cOo im ljI) AT TiME k, To OCo j AT TNe I. - REDUCM PFLOW ---- SAVCD RON FLOW Q om/T T'E i t jE i I. _..if iiFigure 2.2. [Noncritical semi-joins in distributed query processing. Figure 2.2. Noncritical semi-joins in distributed query processing.

25 COMMUNICATION ~' NETWORK user Figure 2.3. The materialization problem

CHAPTER 3 PRELIMINARY FRAMEWORK This chapter is divided into three sections. Section 3.1 contains basic definitions with appropriate references. Section 3.2 provides a concise statement of the problem of interest in more formal terms. Section 3.3 proposes general guidelines to tackle the problem. 3.1. Deflnitlons In this section definitions of the terms Functional dependency (FD), closure and covers of a set of FDs, relation scheme, and database schema are presented. 3.1.1. Functlonal dependency and database schema Let U be a universe of attributes. Every attribute ai in U takes its values from a domain set DOM(a, ). Let X-= (al, -- -,a,, } and Y {b1, *,, b, }, (X C U and Y C_ U) be ordered sets of attributes. Suppose there exists a function f f: DOM( a,1)XDOMa2)X... XDOM(a ) -.,DOM 0b,)X ' - XDOM(b, ), i.e. for any x, x E DOM(al)X... x DOM(a, ), at any time, if z -= ', then Ax) = f(). A functional dependency (FD) is said to hold from X to Y, denoted by X -+ Y, if there exists such a function f. More casually, f is said to hold on X U Y. We also say that X determines Y. X U Y is called the attribute scheme of f and is denoted by SnJ. If both X -+ Y and Y -+ X hold, we write X + —+ Y. We will use capital sized letters from the end of the alphabet, such as X,Y,Z, to denote sets of attributes. Single attributes will be denoted with 26

27 small case letters from the top of the alphabet, such as a, b, FD inference rules, known as Armstrong's rules, have been studied in the literature [ARMS 741. These are: FD inference rules FD 1 (Reflexivity) Y 5 X, yields X -* Y, FD 2 (Augmentation) Z C W and X -- Y yield XW -. YZ (where XW stands for X U W), FD 3 (Transitivity) X -+ Y and Y -- Z yield X -- Z, FD 4 (Pseudotransitivity) X -_ Y and YW -_ Z yield XW -+ Z, FD 5 (Union) X -- Y and X -_ Z yield X -- YZ, FD 6 (Decomposition) X -- YZ yields X -. Y and X -* Z. A set of inference rules is said to be complete if it implies rules FD 1-6. Relaxed versions of FD 1 and FD 2, together with FD 4, can be shown to constitute a complete set of inference rules [KAMB 78]. These are: (i) X -X, (ii) X - Y yields XW -* Y, (iii) X - Y and YW -+ Z yield XW -+ Z. The closure F+ of a set of FDs F is the set of all FDs that can be inferred from the FDs in F using a complete set of inference rules. Given some sets of FDs F and G, F is a cover of G if F+- G+. A set of FDs F is nonredundant if there is no set of FDs G properly contained in F with G+= F+ (MALE 801. A set of FDs is minimum if there is no set G with fewer FDs than F such that G+ = F+ [MAIE 801. A set of FDs F is L-minimum if: (1) F is minimum, and (2) for every FD X -. Y in F, there is no X C X with X - Y in F+ [MAIE 801.

28 A set of FDs F is LR-minimum if it is L-minimum, and replacing FD X - Y in F by X Y. with Y C Y, alters the closure of F [MAIE 801. As in [BISK 791, a relation scheme is defined to be a pair <X,E> where X C U and E C F+ is a set of FDs holding on attributes of X, i.e. if R -+ S E E, then R U S C X. Note, however, that even though R - S may hold on RUS, <R U S,O> is a legitimate relation scheme, where 0 denotes the empty set. If only the first entry or the second entry of the pair <XE> is relevant, we use the notation <X,> and <,E> respectively. The rela. tion scheme < U,F> is called the universal relation scheme. A database schema is a set D = (<X1,F1>,,<X.,F, >) where U Xi = U and ( F, ) - F+, and the FDs of F, hold on X,. Note that this definition is more resI-1 trictive than the one in [BISK 79]. A database subschema is a set A m D = {< Y1,E>,...,<Y, Em >} where, (1) the set Z defined as Z U Y, satisfies A m Z C U, (2) the set E defined as E - U E, satisfies E+ C F+, (3) the FDs of E, hold on Y~. For X = {al a,,a} C U (and X is ordered) a finite set r C DOM(X)= DOM(aal)x... XDOM(aa,) is defined to be a relation on X. We also denote such a relation by r[XJ. Each element t of r is a tuple of values defined on X. (A tuple t can be seen as a function from a set of attributes X to DOM(X) ). An FD Y -_ Z is said to hold in a relation r defined on X if it holds on some subset of X, i.e. Y U Z C X. Note that for any two tuples t and 8 in r, if t(Y) = s(Y) then t(Z) = s(Z). A class of instances described by the relation scheme <X,E> is I<X,> = {r: r is a relation on X, and all FDs of E hold in r}. An element of l<x-E> is called a relation instance. A database schema D = {<X1,Fl>, ~ ~., < X,F,, > describes the following class of instance sets:

29 ID = {{r,...,r}: r, E I <x,, > for i= 1,.,n}. If X C X then t(X) denotes the subtuple containing the IXI components of 1(X) corresponding to the elements of X. The projection, over Y C X, of a relation r defined on X is rl[Y = (t: t E r and t = t(Y)}. The natural join of two relations, rXl] and s1i, denoted by rIX n Yle, or simply r[ls, is rX n Ylo = {t:t(X) E rlX1 and t(Y) E [ll}. The semi-join of relation sill with relation r[1X, denoted by rXfn Y>s or simply r[>s, is r[>e = (r|[J)[YM. A database schema D = {<XI,>,,<X,,>} is said to feature (or satisfy) the lossless join property, or display a join dependency, with respect to <U,F> if, for all r e I<u,F>, r = J r[X, J, where the notation on the right hand side of the equality means that all relations are joined together. This notation is appropriate because the join operation is associative and commutative. Let D be a subschema AmA am {<Yi,EI>,',<Ym,Em>} and let Z= U Y, and E= E,. D is said to satisfy the lossless join property with respect to < Z,E> if for all r E I<z,E>, r = [1 r Y, ]. The concept of universal relation scheme or simply, universal relation (UR), has been discussed by many researchers; see for instance [BEER 81, ULLM 82, FAGI 82, MAIE 831. The implications of such a concept are controversial. However, there is some agreement that the universal relation is both useful and practical [MALE 831. Furthermore, to quote [BEER 811: In almost any "real world" situation, a single join dependency suffices, together with some functional dependencies, to define the legal relations that might be the universal relation at some time. In order to render the concept of universal relation acceptable the following assumptions are made:

30 Assumption UR-1: If an FD holds from X to Y, then it is unique. In particular, X - X denotes the identity FD over X. Assumption UR-2: The attributes are named in such a way that rule FD S is satisfied. Assumptions UR-1 and UR-2 imply that sufficient renaming of attributes needs to be made. Assumption UR-3: Any database schema D satisfies the lossless join property with respect to < U, F>. Assumptions UR-1 and UR-3 are discussed in [ULLM 82j, and Assumption UR-2 is necessary if problems pointed out in [ATZE 82J are to be avoided. 3.1.2. F-graphs In this section, a graphical means of depicting the FDs of the universal relation scheme < U,F>, first used in [LOZI 78J, is introduced. Let <U,F> be a universal relation scheme where F= {f, f J,}. Let G- = <NG,Ac> be a bipartite graph where NG - VU W for some V and W, AG = A U A,,, for some A, and A,, and V n W = 0, A n A,, =0. G is said to be the F-graph of < U,F> if and only if: For all f, e F where f, stands for X, -+ Y, ( for some X,, Y, C U) there exist, (1) V' = {vj: a, E (X, U Y, )} where v; denotes the only node which is labelled a1, (2) w' E W is unlabelled and only associated with f,, (3) A, = {(v,w, ): at E X, }, (4) A,, = {(w,,v.): am E Y, }, and further:

31 (1) V=U V' (2) W- =(wi, 2,**, w, (3) Av =U Av, -— 1 (4) Aw -U Awl. =1 Any FD f, E F is said to be explicitly represented (or simply, explicit ) in G. V is called the set of v-nodes, and W the set of w-nodes. Given a node x, the set of predecessors of x, ir(x), is Ir(z) = {y: (y,x) E AG } and the set of succes88ors of x, a(z), is a(x) = {y: (x,y) E AG }. A v-node y is reachable [LOZI 80j from a set of v-nodes V C V if: (1) y E V, or (2) There exists w, E W and (w,y) E A. and all v E Cr(wj ) are reachable from V. A w-node y is reachable from a set of v-nodes V C V if all the nodes of ir(y) are reachable from V. A set of v-nodes V is said to lead to another node y if y is reachable from V. A wnode x of an F-graph G leads to another node y of G if y is reachable from 7r(z). A set of wnodes W leads to another node y if all x E W lead to y. In general, a set of nodes X leads to a set of nodes Y ( we write X > Y) if X leads to every node of Y. A source of a set of vnodes V is defined to be any w-node ws that leads to V. If V, is a v-node and ai its attribute label, the a, is also denoted by Att(v,). If V = {v1, 2,, v } is a set of v-nodes, {Att(v, ): v, E V} is denoted by Att(V). Given a w-node ws and a set of v-nodes V such that ws > V, Q C U is a key of Att( V) if and only if: (1) Q = Att[7r(ws ), (2) forany i3: i > V, r(ii) ~ i(w,).

32 If a set of FDs F is LR-minimum, then the corresponding F-graph is said to be minimum. An F-graph is said to be lo8sles8 if the database schema defined by D = U {<,{f, }> } satisfies the lossless join property with respect to <U,F>.. f, eF 3.1.3. Link and envelope Defined below are, a link (we will compare our definition with that of [LOZI 80]) from a set of w-nodes to a set of v-nodes, and an envelope for a set of v-nodes. Suppose there exists a sub-graph L(ws,v) = (ViUWE, Av~-UAw~) of G where w, E Wand a v E V and w, > v in L, but for any yE VLUWL W: v in L-{(y), and further, if any arc e is deleted from Av-UAS then w8:; v in L - {e), then L(w,v) is called a proper link from ws to v. We say that any arc in L is a linking arc from ws to v with respect to AvL.UAw w. w5 and v are respectively called the source and the sink of L. If Ll(w,v) ' "'(ws,v) stand for all the proper links from ws to v, then A r. L(ws,v) - U L'(ws,v) is simply called the link from ws to v. The link L(ws,T) from ws to 1 —1 T C V is defined by: L(wS,T) = u L(w8,v). An F-graph L(ws,T) defined by VE T L(w,5 T) = U L '(w,v) is called a oub-link of L (or simply, a sub-link). If for any vE T I, E (1,.n} node y in L we have that ws T in L - {y), then L is said to be proper. Given Ws 5 W, the link from Ws to T is L(WS,T)= U L(w,T). WE w, An FD f: X - a,, is said to be a base-FD in an F-graph G if there exists w E W such that X = Att(Ir(w)) and a, E Att(a(w)). Given the F-graph G of a universal relation scheme <U,F>, an FD f,:X- Y, f, E F+, is said to be induced by a sub-link L(WS,T) if X- U {Att(v)} where yE v,

33 V U ir(w) and Y = U {Att(v)). If such an FD f, is not explicit in G, it is said to be t E W, vE T implicitly represented (or simply, implicit ) in G. The notation (Aw8,v) will denote the FD induced by a proper link L(ws,v). The length of a proper link L(ws,vt) is defined to be the maximum number of w-nodes, including w~, encountered on a directed path from w, to vt. If there exist two or more distinct proper links from w. to vt, and two of which are L+(ws,vt ) and L-(w,vt ), then the pair [L+(w,,vt );L-(ws,vt )l is said to be a doublet. Let f+(ws,Vt) and flw8,vt) be the respective representations of L+ and L-, in the F-graph G. Then, by Assumption UR-1, f+ and f- must be identical. Let L(ws,Vt ) be a proper link. Let C C VE, be such that: (1) C> vt and, (2) for any v, E C C - {v,} t. Define C C WL as, A c= U_ (v,). V, E C A cutset r of L is defined as: r = {(w,vi): w, E C,and v E C)}. By abuse of notation, we write r — (C, C). Given an F-graph G of a universal relation scheme < U,F>, and given a set of v-nodes T, an envelope K(T) of T is a subschema A K(T) = {< Y1,E>,...,< Y,Em >} where, (1) for any <Y,,E, > E K(T), there exists J, E F+ such that E, - {f,} and Y, = sf, ),

34 (2) for any v, E T, there exists <,{f, }> E K(T) such that Att(v ) E S(f, ) and, m m (3) K(T) satisfies the lossless join property with respect to < U Y,, U E, >. ill I-.I Such an envelope K(T) is always guaranteed to exist but is not necessarily unique. In the next chapter, we will show that, for any T, for Condition (3) to be satisfied, it suffices to have some set of attribute Y, be such that Y, - Att(T) holds. In the worst case, Y, may have to be identical to Att(T). We will also show that given a sub-link L(ws,T), the set of relation schemes defined as: U ({< s,,,,,, )), {(w,,vi )}>}, (w,j, ) E A L is an envelope of T. However, we will see that such a sub-link, which has a single source w, for all its v-nodes, can only be guaranteed to exist in a special kind of lossless F-graphs. We will also show that Assumption UR-3 requires that, for any T, at least one envelope K( T) can be determined for which all the FDs are in F rather than in F+, as would be the case more generally. Further, that a universal relation scheme < UF> which satisfies Assumption UR-3 must be depicted with a lossless F-graph. 3.1.4. Distributed query processing sequence Let R1, ',RI denote I relations. Let R represent the list of these relations, i.e. R = (R1, 1 ' ',RI). A distributed processing sequence is a sequence R n, n = 0, 1,., K such that R"+l T"+l [R"], n = 0, 1,, K - 1 where T"+1 is a transformation that consists of semi-joins executed with respect to some pairs of relation instances of the form (R,", Rn) at step n + 1, and where R0 is obtained from R after all initial projections and selections have been performed for a given query. The final result of the sequence, R K, satisfies R K = [(RK)- [(R,~). Any semi-join R,-' [> RK-I contributing to the transformation TK is said to be an noncritical semi-join with respect to the sequence. All the other semi-joins are said to be critical. In Figure 2-i (Chapter 2), a distributed query processing

35 sequence was depicted using a directed graph. 3.2. Initial problem formulation An informal summary of the problem follows. User provided inputs include: (1) A universal relation scheme which consists of (i) a universe of attributes, (ii) a set of functional dependencies (FD) holding on the attributes, and (iii) some statistical information for each attribute, the domain from which the attribute gets its values, and for each FD, (2) a computer network with tightly or loosely connected nodes. It is assumed that the communication network has a fixed topology and that there is enough over-all channel capacity. Communication costs, and channel capacities are assumed to be given. (3) The user's scheduled queries and updates. A query is assumed to consist of a target list of attributes, a usage frequency, and a set of qualification clauses holding on attributes. An update is assumed to consist of a target list of attributes, a type, i.e. whether the update is a "delete" or an "add", and an occurrence frequency. The first phase of the problem must consist of identifying the FDs which will allow the solution of every query and which will enhance the performance of distributed query and update processing. The second phase of the problem must result with a distribution of FDs in the network which minimizes the operational communication cost. The problem is expressed in formal terms below. We are given the following: (1) A universal relation scheme < UF> where U is a set of attributes and F a set of FDs which hold on U (the assumptions UR-1, UR-2 and UR-3 hold on < UF>), and Q a set of statistical informations for each attribute and each FD**. (2) The graph of a computer network r = (N,A), where N stands for the set of nodes identified with computer sites, A stands for the edges identified with communication channels. The graph has a fixed topology, and the network has appropriated channel capacities. Transmission costs characteristics are given. (3) For each user-site t E N, a set of queries is given, defined by **See Section 4.6. tThis means that there is enough channel capacity overall to accommodate the total traffic.

36 Qt= (Qtk < Tt, Vtk, Ctk >, 1 < k <Kt }, where Kt is a positive integer, Ttk = (jt,je,, j tk ), and jtk = 1 indicates that the ith attribute a, E U is referenced in the query, = 0 otherwise. vtk is the frequency of occurrence of query Qtk, and Ctk is a set of qualification clauses specified with respect to the universal scheme < U, >. For each userA site 8 E N, a set of updates is given, defined by Us, {= Uk = <Tsk, vsk TYPE>, 1 < k < KS }, where K, is a positive integer, Tk is defined as Ttk with the subscript 8 substituted for the subscript t, and jsk - 1 indicates that the ith attribute a, E U is referenced in the update. vsk is the frequency of occurrence of Uk, and TYPE is either "ADD" or "DELETE". (A query (Qtk) or an update (Usk) is called a transaction. Given a transaction Qtt or Utk, the set T = {ai,: a, E U and jtk = 1) is called the target list of the transaction. The corresponding set of v-nodes in the F-graph is referred to as the set of target nodes.) The following is to be achieved: (1) For each query Qtk find an envelope Ktk C {<,{h, }>: hi E F+) whose FDs contain the attributes referenced in Qtk. (2) Distribute on r the CRDUs mapped to U Ktk so as to minimize the total communication cost. Qtk 3.3. Procedure towards a solution The task of organizing CRDUs in the computer network is divided into two parts: (1) The determination of envelopes for all queries, i.e. the selection of appropriate FDs, and (2) the distribution of CRDUs mapped to the selected FDs. The main steps according to which the distributed database design problem is addressed are outlined: PART 1: Selection: From users' FDs to envelopes Given an initial set of FDs, F, a universe of attributes U, and a set of transactions for each user node, the following is to be achieved: For each query, one envelope which leads to the efficient processing for that particular query is to be determined. The following approach

37 is proposed*: tStep-1: Given an F-graph, build a minimum F-graph G (Algorithm Reduce ). The full reduction of the initial user-given F-graph leads to extensive decrease in processing effort later on in the design method. It is shown that Algorithm Reduce is slightly less complex than [MAIE 801's algorithm. Given an Fgraph, G, make minor modifications to satisfy losslessness, as required by Assumption UR-3. This step ensures that a query referencing any set of attributes can be solved correctly. Algorithm NoLoss addresses the problem as a simple covering problem. tStep-2: Given a set of target nodes (for every query), find the sets of w-nodes from which the target nodes are reachable (Algorithm GetSources [LOZI 801); these w-nodes are referred to as source nodes. Find the nearest set(s) of source nodes (Algorithm GetNeare8tSource [LOZI 801). Find the link from the source nodes to the target nodes. This is accomplished by Algorithm GetLink which is a corrected version of Algorithm 2 in [LOZI 80]. This step consists of extracting, from the F-graph, the subgraph which is sufficient to solve the query under consideration. tStep-3: For each query, given a link from a set of source nodes to a set of target nodes, extract FDs that cover the target list, contain a minimum number of unreferenced attributes, and lead to low-cost distributed query processing. Thus, each query is assigned its own envelope. Each FD, in an envelope, is mapped to a CRDU. Step-4: Determining input data-volumes in preparation for PART 2. A distributed query processing algorithm such as [CHUN 83]'s may be used. Although the approach based on the synthesis of a distributed database through the distribution of CRDUs mapped to FDs is original, some steps of its implementation make use of existing algorithms. For this reason we attach a "double dagger" ( ) to indicate an original contribution, a 'dagger" ( t) to indicate that modifications have been made to some known algorithms. Other algorithms or techniques are appropriately referenced.

38 PART 2: Distribution: Covering the network with CRDUs tStep-5: Given a set of CRDUs, find a minimum cost distribution and materialization. Step-6: At this stage in the design, for each query, a database subschema that satisfies the lossless join property has been obtained. Since redundant FDs may have been allocated at any one site during Step-5, a nonredundant subset (not just any nonredundant cover) of FDs must be identified at each site. Relations are then synthesized from FDs using [KAMB 781's algorithm. In Chapter 4 the problems of depicting FDs, finding envelopes for target nodes, and selecting FDs that will be mapped to the CRDUs in PART 2 are addressed.

CHAPTER 4 DATA MODELING The first section of this chapter begins with the statement of a known result pertaining to the lossless join property of a set of relation schemes. In Section 4.2, properties of Fgraphs are studied and Step-i of the distributed database design problem is addressed. A set of rules to simplify F-graphs that depict extraneous FDs is then proposed. This problem is related precisely to the problem of finding LR-minimum covers for a set of FDs. Section 4.3 deals with F-graph manipulation algorithms for Step-2. Section 4.4 is devoted to the modeling and solution of Step-3, or the envelope optimization problem. Section 4.5 discusses some important issues of consistency preservation associated with the updating of FDs in a database. Section 4.6 closes this chapter with a complete summary of the data modeling approach, and with illustrative examples. 4.1. Foundation of the distribution approach It was mentioned, earlier in Chapter 1, that each CRDU, an atomic unit of distribution, is associated with one FD. Supporting arguments for this choice were given. This section consolidates the role of FDs in database distribution. Here, the database distribution philosophy is based on vertical distribution (i.e. with respect to attribute types, exploiting the "projection" operation), and on the synthesis approach [BERN 76, KAMB 781. The explanation for the latter statement is that, after distribution, the FDs assigned to a site can be used to build a relational database [BERN 76-, a network database IHOUS 79J, or a hierarchical database [DELO 79]. Horizontal partitioning (i.e. with respect to tuples, exploiting the "selection" 39

40 operation), is beyond the scope of this thesis, but could still complement the proposed approach without conflict. The well known result on lossless join [AHO 77, BISK 79, LOZI 801, summarized in Theorem 1, is of central importance to our distribution approach. Given a query with target list Y, and given a database subschema D = {<X1, >, *,<X., >} where Y 5 U Xi, *=1 Theorem I indicates how to guarantee that some relations r, ' ',r", which are picked from I<X1,>,.' ~,l<x,> respectively, can be joined to solve the query while preserving semantic integrity. Theorem 1 combines two corollaries reported in [BISK 791. Theorem 1: Let D = {<X1, H1>, ',<X,, H, >} be a database sub8chema and <Z,E> a relational scheme such that U Xi = Z. Furthermore, suppose that H, where i 5 H = U H,, is a covering of E (H+ = E+) such that for any R - S E H, there ezists 8-1 <X,, > E D with R U S C X, (condition C). Then there exists <Xi,> E D such that Xo -" Z E E+ if and only if D has the lossless join property with respect to <Z,E>. Proof: Refer to [BISK 791. * The next corollary refers to the special case where each relation scheme in the database subschema contains a unique FD. Although Corollary 1 trivially follows from Theorem 1, it provides a criterion which is easier to use than that provided by Theorem 1. This criterion only involves FDs, and thus Corollary 1 suggests that FDs constitute natural units of distribution. Corollary 1: Let D = {<Xl,{hl}>,,, <XA,{h,}>} be a database subschema and <Z,E> a relational scheme with U Xi = Z, and H = {h,...,h,, } such that H+ = E+, and X, = Sih,) for 1 < i < n. Then there ezists a <X,{hi; }> E D that satisfies

41 X, -- Z E E+ if and only if D has the lossless join property with respect to < Z,E>. From that follows a sufficient condition for an F-graph G to be lossless. Corollary 2: If an F-graph G = (VU W, A, UAw ) is a sub-link with single source, G = L(ws, V), then it is lossless. The converse is not necessarilly true. As a counter-example, consider the F-graph that depicts two explicit FDs AB —. C, and A -. D. 4.2. Step-l: Initial processing of a user-F-graph A user-provided F-graph may feature unnecessary redundancy in depicting FDs. Several steps may be taken to simplify the user F-graph while preserving its informationrepresentation content. These steps are presented in Section 4.2.1 in the form of a detailed algorithm, Reduce. Section 4.2.2 considers the theoretical implications of Algorithm Reduce. Section 4.2.3 indicates how Assumption UR-3 may be enforced on an F-graph. 4.2.1. F-graph reduction The following propositions state some properties of F-graphs. Typically, if the FD inference rules of Chapter 3 can be applied to some of the explicit FDs of an F-graph, to obtain other explicit FDs, then the latter can be eliminated from the graph, and the graph simplified. The notation X, will denote a set of attributes, and V, will denote the corresponding set of v-nodes in an F-graph. If no subscript is used to denote a set of attributes, such as X, then the notation Vx will denote the set of v-nodes associated with X. We will use the fact that, in an F-graph which depicts F, the statement Vi > Vj is equivalent to Xi -_ X E F+ (see [LOZI 80J). Propositilon 1: Given an F-graph G which depicts F, let vT E V be a v-node and Wr E lr(vr). If there exists Vi C r(wr), Vj C r(wr), V, leads to Vj in G,

42 and V, n V, 0, then the arcs in the set {(v,,WT): v E ViJ ) are extraneous (i.e. do not alter the closure of the FDs explicitly represented in G), and can be deleted from G. Proof: Let Vk be defined as Vk = r(r)- (V, U Vji). To show that the arcs in {(V,,WT): v, E V) are extraneous, it suffices to show that X, is superfluous in the left hand side of the explicit FD Xi U Xi U Xk - AT (1). From the hypothesis, Vi > V,, which is equivalent to X, -, X, E F+. Rule FD-1 says that Xi -- Xj U X1 (2) also holds. Using Rule FD-4 on (1) and (2) leads to Xi U Xk -, AT. Therefore, AT is determined by X, U Xk without any reference to X,. Furthermore, since V, n Vj = 0, the arcs of {(V,,WT): V, E V,) are only used to convey the participation of X, in FD (1), hence they are extraneous.. Proposition 2: Given an F-graph G which depicts F, let the FD f,: XY -+ ZS be explicit in G, and let w, be its w-node. If the FD X -_ Z can be induced by a link, L, and wi WL, then, the arcs in {(w,,v, ): v, E Vz } are extraneous in G. Proof: Since X -+ Z E F+, by FD 2, the depiction of XY -- Z is redundant.. A set of rules are proposed that can be used to simplify an F-graph that depicts extraneous explicit FDs. These reducing rules are based on FD-rules of Section 3.1.1 and exploit Propositions 3 and 4. A reduced F-graph is an F-graph on which the following rules have been performed. RULE 1: while CONDITION rio:" Some w-nodes w1l,,w, have exactly the same predecessor set," do REDUCE 1: Combine all the w-nodes w1,w2, ''',w,, into a unique w-node say, wl. Delete all the arcs incident on w2,ws, ''',w. Create new arcs (wl,Fii) for all v, such that (w,, ) E A., 2 < j < n. /* RULE 1 makes use of rule FD-5. See Figure 4.1-i. */ RULE 2: while CONDITION f2l: "There exist some V and w for which V = r(w) n or(w) " do REDUCE 2: begin

43 for all i E V, delete (w,) - If there is no v-node v f V such that (w,v) E A,, then delete w from W and delete all the arcs incident on w end /* RULE 2 uses rule FD-1. See Figure 4.1-iv. */ RULE 3: while CONDITION fl2: "An FD XY - Z is explicitly represented in the graph and w, E r(Vz) is its w-node, and further X - Z is induced by a link (Wj,,Vz) where WI C U a(v,)"do REDUCE 3: /* Makes use of Proposition 2. */ begin delete all (w,,v, ) where v, is a v-node labelled with an attribute in Z (vi, E Vz) If there is no (w,,vi ) such that v, ( Vz, then for all vk E V and vk is labelled with attributes of X or Y delete (Vk, Wi ) delete wi from W end /* See Figure 4.1-iii and 4.3. */ RULE 4: while CONDITION ls3: "For some WT E W, there exists a proper subset of lr(wT), V, that leads to another subset of lr(wT), Vj and V, n Vj = " do REDUCE 4: /* Makes use of Proposition 1 */ begin for all v E VI delete (vj, WT) from A, end /* See Figures 4.1-ii and 4.2. */ Conditions fo, II,f2 and f s can easily be implemented; here is how: (1) Condition fl0: This condition simply requires that every w-node wi be checked for the following property: Given w,,is there a w-node w,, w, y7 w,, such that ir(w,) = r(w,)? This is answered by tracing the F-graph backwards from wi, and checking if there exists such a w,. The time complexity for RULE 1 is O(1 W[2).

44 RULE 1: for all w, E W do begin for all we E W - {w)do begin f Ir(wi) = Predecessor(w,) then REDUCE 1 end end (2) Condition ll: This condition is the most easily verified. For each w-node w, Ir(w) and cT(w) are compared. RULE 2 has a time complexity O(I WI). RULE 2: for all w E W do begin If or(w) C o(w) then REDUCE 2 end (3) Condition f12: For every w,, this condition may be checked by trying to find a link L(W,,Vz) which does not contain w,, where U {7r(w): w E W, ) is contained in ir(w,), and Vz C a(w,). This is accomplished by starting from 7r(w, ) and by traversing the F-graph forwards in search of such a link (ALGORITHM ReachAIll, Sect. 4.3, of time complexity O(IA, I + IA, 1), can be used). Note that we do not need to identify Vx (defined in the statement of CONDITION f12) explicitly. All we need to check is if such a Vx exists at all. The complexity of RULE 3 is o(IW x (IAv I + IAw I)). RULE 3: forallw, E Wdo begin

45 Look for a set, of v-nodes, S, reachable (not trivially) from ir(w, ) without using w,. (ALGORITHM ReachAll Sect.4.3) VZ:= SN,) Ifvzz 3 then REDUCE 3 end (4) Conditions s3: Condition 23 is slightly less obvious than the others. The reason is that one must try to look for a link (of any length) from one subset of a set of v-nodes, P, to another subset of P. (Algorithm ReachAll, Sect. 4.3, of complexity O(lA, 1 + lA, 1) can be used.) However, there seems to be too many cases to check a priori. The following proposition will be helpful to determine whether Condition fls is satisfied or not. Proposition 3: Let us consider the formulation of Condition Q3. Let P = {vl,v2,,vK } = r(wT) for some Wr. The only subsets V, of P that need to be considered are those for which I V =- K - 1. Proof: Note that V, = P does not need to be considered since V, must be a proper subset of P. For K = 1 the proposition is trivially true, and so is it for K = 2. For K = 3, {V1,V2} > V3, {VI,V3} > V2 and {v2,v3} > v1 are the only links that need to be considered. Indeed, Vl > v2 does not need to be considered if {v1,v3} > v2 is considered because if { v1,V3}.i v2 then v l v2, otherwise, we are done. In general, consider the following reachability cases, V,1 > vl, ',V,K > vK where for any ik, I V, = K- 1 vk 1 V, and for any ik ik, V,k s VT. If for some ik V1k > v, holds, then Condition f12 is true and (vk,WT) can be deleted, therefore there is no need to consider the eventuality of R > vk holding for any other R C P. On the other hand, if Vi: vk, then for any R C V,*, R > Vk does not hold either. ~

46 Using Proposition 3, Conditions fls can be checked in a time of complexity O(lnr(wT )I X (IAv I + jIA I)). The complexity of RULE 4 is therefore: w E W o[ IAvI X (Av I + tA. I)] RULE 4: for all WT E W do begin P:= r(wT ) for all v, E P do begin V,:='P-{v,} find the set, S, of all the v-node8 reachable from V, (using ALGORITHM ReachAll, section 4.3). lf v, E S then begin REDUCE 4 P:= V, end end end To show that all the redundancy featured by an F-graph can be removed by a sequential application of RULE 1-4, one must show that (i) there is no need to repeat any of RULE 1-4 more than once. (ii) RULE 1-4 are "complete", in the sense that no other rule is needed to eliminate redundancy. Proposition 4: Let the indexes i and j denote any two of 2, 3 or 4, and i is different from j. Given an F-graph G, an unsuccessful attempt to apply RULE i on G, may never be successful after a successful application of RULE j. Proof: RULE 2-4 do not alter the closure of the explicit FDs in G, and therefore do not alter reachability (>)..

47 Hence, the order in which RULEs 2-4 are applied is immaterial. For performance, since RULE 3 and RULE 4 are of the same time complexity, it makes no difference if RULE 3 precedes RULE 4 or vice versa. However, because RULE 2 is of lesser complexity than RULE 3. 4, it should be applied before the latter. RULE 1 must be repeated after the application of RULE 4, and the following example shows why. Example: Suppose the set of explicit FDs is: (AC -+ E, ABC -. D, A -_ B). RULE 1 cannot be applied. However, RULE 4 can be applied and leads to: (AC -+ E, AC - D, A - B). Now, RULE 1 can be applied. The result is: (AC -_ ED, A -_ B). ~ In practice it helps to apply RULE 2 first, then RULE 1 and RULEs 3-4, then RULE 1 again, in that order. Indeed, RULE 2 is only of time complexity Ol WJll and RULE 1 is of lesser time complexity than RULEs 3-4. Further, RULE 1 will eliminate most superfluous w-nodes (and the arcs incident to them or emanating from them). The following algorithm is used to fully reduce an F-graph: ALGORITHM Reduce: /* Fully reduce a user F-graph */ RULE 2 RULE 1 RULE 3 RULE 4 RULE 1 end Reduce. Proposition 5: The application of Algorithm Reduce on an F-graph G, leads to an F-graph G which depicts a nonredundant set of explicit FDs. Proof: To prove the assertion of the proposition, one has to show that no explicit FD can be induced by other explicit FDs using a complete set of FD-rules. Suppose there exists in G an explicit FD, f,: R -_ R, which is redundant and has not been detected by RULE 2, RULE 3 and RULE 4. Note that, from the application of RULE 2, we have VR n flR = 0. Because

48 f, is redundant, it can be induced by a link L(W,V,) which does not contain w. (the wnode associated with fJ) and does not contain the arcs incident to w, or emanating from it. Since Condition f12 (Proposition 2) is met, f, should have been eliminated by RULE 3. ~ Proposition 6: Algorithm Reduce has a time complexity bound of order: o[ IAI X (IA I+ IAw )] Proof: The total time complexity of Algorithm Reduce is: complexity(RULE 1) + complexity(RULE 2) + complexity(RULE 3) + complexity(RULE 4) - O(I W12) + O(IlW) + (I WI X (IAvI + IAw I) )+ O(IAv I X (IA I + IAw I)) =O( IA I x (IAv I + IAw 1)).. In the next section, a correspondance is established between a reduced F-graph, i.e. an F-graph obtained as a result of the application of Algorithm Reduce, and a minimum cover of FDs. 4.2.2. Minimum F-graph Algorithm Reduce can be used to derive an LR-minimum cover for a set of FDs as stated in the following proposition. Proposition 7: An F-graph is minimum if and only if it is fully reduced. Proof: Let us call H the set of explicit FDs in the fully reduced F-graph G. Let F be the set of FDs that were explicitly represented by the original F-graph G. Algorithm Reduce simplifies or eliminates all the explicit FDs of F that can be derived from IH using a complete set of FD rules (i.e. reflexivity, augmentation and pseudotransitivity). This implies that H is a nonredundant cover of F (Proposition 5). To show that it is minimum, we use a result reported in [MAIE 801, namely, that a non-redundant cover H is minimum if and only if there is no X -_ X and Y -_ Y in H such that: X + --- Y and X -* Y can be induced by using a set of

49 explicit FDs, I, and I nEH, (X) = 0, where EH (X) = X -- V: X -- V E H and X. --- X E H+). If for every X - X and Y -, Y in H, X + Y, then H is minimum. Else, if H is not minimum, suppose there exist X -- Y and Y - X, both in H+, and we can reach Vy from Vx on the F-graph of H through a set of explicit FDs I, and I n EH(X) O. This implies that there exists a set of explicit FDs I C H which can be used to reach Y from X on the F-graph, such that for any X -_ Z in I, X + X. If we take X to correspond to a set of v-nodes on the link from X to Y such that X -* Y, then, X -- X. Further, since Y -- X by hypothesis, therefore X -_ X too, a contradiction. Therefore, H must be' minimum. Let us now show that H is also L-minimum. If it is not, then, for some explicit FD XY -* Z in H, we can find an FD X -+ Z in H+. This implies that there exists a link from a subset of a(Vx) to Vz in G which is a contradiction since we asserted that RULE 3 and RULE 4 cannot be applied any further. Therefore H is Lminimum. Let us assume it is not LR-minimum. There must exist an FD X -+ YZ in H and Y is superfluous (i.e. the arcs existing between the w-node associated with the FD, and the v-nodes in Vy, are extraneous). This implies that X -+ Y E H+. Therefore, a link can be found between a subset of a(Vx) and Vy, without using the w-node associated with the explicit FD X -_ YZ. This is contradicts the fact that RULE 2 and RULE 3 cannot find a range of application. Therefore H is LR-minimum, and G is a minimum F-graph. The converse is obvious. ~ The advantages of having an algorithm such as Algorithm Reduce are now discussed. (1) It can be noted that Maier's algorithm [MATE 801 is of complexity O(n2) where n is the length in number of characters of the initial set of FDs F. Since n = JA, I + IAN i, Algorithm Reduce seems to be of slightly lower complexity than Maier's. However, a careful look at Maier's algorithm reveals that it is really of complexity O maz (IA I, A,, 1) X n Note that Maier's Algorithm features two steps of complexity ( WI X n) (respectively for non

50 redundancy and minimality), one step of complexity O(IA, I X n) (to enforce L-minimality), and one step of complexity O([A, I X n) (to enforce LR-minimality). On the other hand, Algorithm Reduce has only one step of time complexity O(IA, I X n). Further, note that I WI is no greater than -. Hence, Algorithm Reduce is more direct in its approach than Maier's algorithm. (2) Since the F-graph is going to be directly submitted to some processing (section 4.3-4), it is easier to use it from the beginning rather than resort to some indirect technique such as the one reported in [MAIE 801. For instance, if the information structure of the universal database scheme is regularly updated or extended, RULES 1-4 can be applied in a straightforward manner to adapt the F-graph to a changing environment. (3) The reduction of an initial F-graph decreases considerably the complexity of the algorithms used to select appropriate CRDUs for all queries (section 4.3-4). From now on, it may be assumed that any F-graph under consideration depicts a set of explicit FDs which is LR-minimum. This section is closed with an example. Example: To show the interest of considering LR-minimum covers, we depict the FDs for each FD-set below (Figure 4.4): (1) (A -_ B, A -_ C) is nonredundant, but it is not minimum since, by RULE-1 {A -_ BC) has fewer FDs. (2) (ABC -- D, A -* B) is minimum, but not L-minimum since, by RULE-4 the B can be removed from the left-hand-side of the first FD. (3) (A -. AB) is L-minimum but not LR-minimum, since by RULE-2 A can be removed from the right-hand-side.o It has been shown above how an F-graph, depicting FDs, can be fully reduced. The next step is to enforce Assumption UR-3 if necessary.

51 4.2.3. Lossless F-graph Let E be a given set of FDs. We write E= {X1 - Y1, X2-+ Y2,.,X - Yn }. We now discuss how to efficiently represent the information contained in E with an F-graph. A minimum cover, F, of E can be found using Maier's algorithm [MAIE 801, and an F-graph of F can then be constructed. An equivalent approach to that of Maier is to build the F-graph of E and then to reduce it (see Section 4.2.2). Let G = (V U W, A U A. ) be the Fgraph thus obtained. We will later argue that it is desirable to depict a universal scheme < UF> with a lossless F-graph, in particular a graph of the form L(wo, V) for some wo E W. First, we prove a Corollary which states that it is always possible to do so. Corollary 3: Given any F-graph G -(VUW, AUA ), a 1o88le88 F-graph G can be obtained from G by adding, at most, one explicit FD to G. Proof: By Corollary 1, if a set of relation schemes D does not possess the lossless join property with respect to a universal relation scheme <U, >, a new set of relation schemes D that does satisfy the lossless join property with respect to <U, > can be found. Namely, D = D U {< Y, > }, where Y C U and Y -+ U holds. We can apply a similar argument for our case. Suppose, the F-graph G happens to be lossy. Let U denote the set of all the attributes, i.e. U - U Xi U Yi, and let V denote the set of v-nodes, i.e. U Att(V). As will be shown below, we can always identify a subset S of V such that S > V in G. A lossless Fgraph, 0 =(V U W, A, U A, ) can be obtained from G as follows: Create a new v-node V with a new attribute which is the concatenation of all the attributes of the nodes in S. Define V = V VJU (v}. Create a new w-node two. Define W = W U {wo). Also oreate a new arc (ty,wo). Define A, = Av U {(v,wo)}. Now create new arcs from wo to every node in S. Call this set of arcs A. Define A,, A=, U Ao. The new F-graph G thus obtained is of the form G = L(woS) U G = L(woV), where L(wo,S) stands for (SU(~,wo}, AoU{(vwo)}). By Corollary 2, G is a lossless F-graph..

52 If an F-graph G is of the form G = L(w0,V), for some two, then we say that G is lossless in the sense of Corollary S. Without loss of generality, we always assume that an Fgraph is lossless in the sense of Corollary 3. Indeed, if an F-graph is lossless, but not in the sense of Corollary 3, a new FD can be added as it is done for lossy graphs. Such an assumption is not essential, however it is made to simplify the exposure of the proposed design algorithms. We now argue that no semantic constraint is either added or lost when < U,F> is augmented to < UU {('},FU {f}>. Let us denote Att(i) with the new attribute name T. Suppose the set S in the proof of Corollary 3 is S = {1, ' ' ', sm). Let al, ' ' -,am denote the attribute labels of 8l, * ' *, A respectively. We follow the following convention. DOM(!) is DOM(a1)X... XDOM(am). Further, the FD fo: _-* a... am and the identity FD a 1 * am -- a 1 ' am hold in ezactly the same relations. Then, given some set of attributes Y such that Y U {al,..-,am } Y, the join of a relation r E I<, > with a relation s E I <{;}U' > produces q = rlYls E I<({}UF >. From our convention, q contains the same tuples as qgliU Y)J or qIrl. Our convention also implies that I<,,+> = I<FU(f0})+>, and thus, no semantic constraint is either added or lost when <UF> is augmented to <UU{(), FU{fo)>. (Equivalently, when G is augmented to G). However, the database schema { <S(),f> for all f E FU{fo} } satisfies the lossless join property, whereas { < Sn),> for all f E F } does not. The importance of this latter assertion will be discussed before the end of the section. In the proof of Corollary 3 we mentioned the problem of identifying a set of v-nodes S. This problem can be modeled and solved as a simple covering problem as follows. Algorithm NoLoss, stated below, determines a set of w-nodes, WsourceNodes, that leads to all the vnodes of the F-graph, except those in U ir(w). The objective is to minimize the w E WsourceNodes cardinality of WsourceNodes. By so doing, the number of attributes that belong to the key of the universal relation scheme is minimized. Algorithm NoLoas is formulated as a standard

53 covering problem. Detailed implementation is omitted because solution techniques for this problem are well documented. Appropriate references on efficient solution techniques for the covering problem can be found in [ETCH 77, SALK 75, MURT 76]. ALGORITHM Noloss: /* Modify the F-graph, if necessary, to make it lossless. */ begin /* Solve the following covering problem. An algorithm such as [ETCH 77j's may be used. */ 1 if wi > v, or vj E r(w, ) Given Pj = 0o otherwise, t1 if wi E WsourceNodes find z, = 0 otherwise, Minimize xi =1 subject to p p,zxi > 1 for all j. end If I WsourceNodesl > 1 then begin /* Insert a new key that corresponds to all the predecessors of the source nodes. */ W:= WU {wo) /* Where wo is a new node. */ v:= VU {P} Att(F) is the concatenation of the attribute labels of the nodes in U 7r(w) w E WsourceNodes A:=A, U {(tw0o)} for al v E U i(w) do w E WsourceNodes Au:= Aw U {(0, v)} end end Noloss. In Figure 4.5-ii, G' is a lossless version of G, Figure 4.5-i. The forthcoming theorem clarifies the dependencies between the concepts of lossless Fgraph, link, and envelope. In particular, the theorem states that a lossless F-graph guarantees

54 that, for any set of target nodes T, an envelope K(T) can be determined using explicit FDs exclusively. Clearly, a lossless F-graph is desirable since one does not have to worry about inferring implicit FDs to determine query envelopes. Theorem 2: Consider the follounwing five propositions where G always denotes an F-graph of the form (VU W, A, UA, ), and G depicts a universal relation scheme < U,F>. (a) G is lossless in the sense of Corollary 8, i.e. G = L(woV) for some WoE W. (b) Given G, for any T C V, there exists a link L(ws, T) for some ws E W. (c) Given G, for any T C V, there exists an envelope E(T) such that, for any <,{f, }> e E ET), f, is explicit in G. (d) G is lossless. (e) Assumption UR-S. The following assertions are true. (1) (a) if and only if (b). (2) (c) if and only if (d). (3) If (a) then (c). (4) If (e) then (d). Further, if G is a minimum F-graph, then: (5) If (d) then (e). Proof: Assertion (1) is obvious since we are guaranteed to find at least one link with a single source i.e. L(wo,V). Assertion (2) follows immediately from the definition of a lossless Fgraph and the definition of an envelope. Assertion (3) follows from (1), the definition of an envelope, and Corollary 2. All the explicit FDs of a link L(t,7) can be picked to produce an envelope for T. Assertion (4) follows from the fact that if Assumption UR-3 holds, then, in

55 particular, it holds for the set of relation schemes {<J1>,,<,f, >} where the f,' 8 are all the FDs in F, and from the definition of a lossless F-graph. To prove (5), we will prove the equivalent Corollary to (5): Corollary 4: Suppose an F-graph G, which depicts a universal scheme <UF>, is lossless and minimum. Let G be an F-graph that depicts a scheme < U,E> where E+ = F+. then G is lossless. Proof: We know that, if for some XO 5 U, XO -* U E F+, then X0 -- U E E+. Similarly, if for some Vo C V, Vo > V is true in G, then it is also true in G. Further, we claim that such an XO is the left-hand-side (lhs) of some explicit FD in G. To prove the claim we will use Corollary 1 together with the definition of a lossless F-graph. Since, by hypothesis, G is lossless, therefore, by Corollary 1, there exists an explicit FD, say fo, such that X0 5 Sqfo) and XO -_ U. If fo is of the form fo: YXo - XoZ, where Xo U X - = XO and Y, Z are two sets of attributes, then, without loss of generality, X0 can be redefined to be YXo. This completes the proof of the claim. We now wish to show that X0 is also in the Ihs of some explicit FD in G, since this, together with the knowledge that VO > V and Corollary 1, would imply that G is minimum. We proceed as follows. Suppose X0 is not in the lhs of any explicit FD in G. Then, in G, VO can be partitioned into at least two subsets V ' and Vo2, where VO = Vo U Vo2, and furthermore, there exist two w-nodes wl, w2 such that 1r(w ) Vo1, ir(w 2)- V2, and finally, there exists no w-node which contains VO in its entirety in its predecessor set. For the rest of the proof we will use the symbol > to denote the reachability relation in G, and >U to denote the reachability relation in G. In G, let wo be the w-node of FD fo. Still in G, let us pick any v-node z out of the set a(wo). In G, suppose w 1 > G x. Then, Att(V 1) - Att(z) E E+, therefore Att(Vc ) - Att(z) E F+ too. But, by hypothesis, Att(VJI U V2) - Att(z) E F, i.e. this latter FD is explicit in 0. Therefore, by Proposition 2, G is not minimum. This latter conclusion contradicts the hypothesis. A similar argument can be applied if it is assumed that w2 >a x. An identical conclusion

56 would be obtained. Therefore both assumed assertions must be false, i.e. wI Bke z and W 2 fi Xz. Consequently, the only possibility for VO > z to be true in G is that for some S C W. where W is the set of w-nodes in G, S > j z and for any single w, E S, w, 3iG X, and U ir(w, ) C VO. Without loss of generality it can be assumed that S = {wi, w2}. Let w, ES wy be a w-node such that V 1s wY and V2 U wy, but V1 U VO2 >7 Wy. We claim that there has to be at least one such w-node wy. If not, then for any wy, Vo1 > Wy or V >c (we already know that VI U VO >, Wy is certainly true), and in particular, for any w E or(z), Vo' >c wX or Vo2 > ws, which would imply that wl > z or w2 > z. However, we showed earlier that this may not occur in G. Thus the claim has been proved. Now let Wy be the set of all such possible nodes wy nearest to Vo, in the sense that, for any Wy E Wy, (w)= VY Vy2, for some VY, V,2, where Vo V and V02 > Vy2. (By assumption, VY1 3C Wy, Vy2 f Wy ). We claim that the set of t-nodes Wy lead to the v-node x, i.e. Wy > x. To prove this claim, we construct Wy as follows. Define W~ as (wl1,w2}, V1 as (wl1) U Ua(2) and W1 as U oa(v). By induction, for some VI E V1 integer k, we proceed as follows. Top of the construction loop: If w,, E Wk and Wl Wi, and w 2 tg w,, then change Wy to Wy U {w }). Continue to add new w,. s8 to Wy whenever it is still possible to do so. If at the end of this adding process it happens that Wy y4 0, then terminate the construction. This is the end of the construction loop. If however, Wy= 0, then change k to k + 1 and let Vk be U a(wj,), and let Wk be U a(vj).,i EWk- ', e vi Go back to the top of the construction loop. It should be clear that if the construction is continued until all the w-nodes are examined, (i.e. not only the potential elements of Wy ), then we would obtain the largest set of nodes wy, say Wy. Further, from the construction, WY > WY. Also, suppose that all the w-nodes w, E ir(z) are such that wv Wy. Then by definition of Wy, either tl >c WuS or w2 >B We. However, we showed earlier that this may not occur. Therefore, wtv E WY, therefore WY >c w: and hence WY > c x. This completes

57 the proof of the claim. Now let Us redefine Vy' to be U Vyl and Vy2 to be U Vy2 Se Ew, VY E WI (where Vy' and Vy2 were defined earlier for each w-node wy in Wy ). From the latter claim, we have that Att(VylU Vy2) —Att(z) EE+(1), and further we know that, Xol — Att(Vyl) E E+ and X2 -- Att( V2) E E+ (2). Therefore, Xl -- Att(Vyl) E F+ too, and it is an implicit FD in G. Similarly, X2 -_ Att(Vy2) E F+. By (1), either Vy' >G z or VY2 >a, or neither of the previous two reachability propositions are true, but yet VY1 U Vy2 >C Z. However, suppose Vy >G z. Then VO >G z and, from Proposition 2, that implies that G is not minimum, since arc (wo,x) is extraneous, which leads to a contrdiction of the hypothesis. Therefore, Vyl U Vy2 >G z, but Vy1 iG x and Vy2 Sa 2. Since VO >G V,' U V,2 and VYg U VY2 >cG z, therefore either the arc (wO,z) is extraneous in G, or VO = V1 U V,2. In any case, the hypothesis is contradicted. The latter assertion implies that X0 is indeed on the Ihs of an explicit FD in G. Therefore, by Corollary 2, G is minimum. This completes the proof of the Corollary and of the Theorem. ~ At this point we have developed the tools that permit us to indicate our distribution approach. For each scheduled query, an envelope will be determined. By definition, an envelope features a join dependency and therefore guarantees correctness of query processing. The FDs of all the query envelopes will be distributed; and, at each site, the FDs will be locally synthesized [BERN 76] into a database. The first part of the distributed database design problem which consists of assigning CRDUs (mapped to FDs) to each scheduled query may now be tackled. Section 4.3 deals with the extraction of a minimum relevant sub-F-graph that suffices to answer a partizular query. The material includes some techniques already reported by Lozinskii [LOZI 801. We will disc cuss some pathological problems inherent in Lozinskii's algorithms, and we will propose some remedies for them. In particular, we will explain why our original definition of a link is relevant to those remedies.

58 4.3. Step-2: Extraction of a link To solve a query, a relevant portion of the universal relation scheme, a query envelope, must be retained. Having shown, through Theorem 2, that it is possible to find an envelope for any set of target nodes T, and in particular, an envelope whose FDs are in some LRminimum cover H of the user set of FDs F, we now show how to actually determine an envelope for some specific T. Given a lossless F-graph G in the sense of Corollary 3, and given a set of target nodes T, we wish to consider the smallest set K, of relevant envelopes for T, in the following sense. For any envelope K' (T) E K, if some other envelope K (1) is such that, K' (71 C K (71, then K k (T) K. For any T there exists at least one w-node that leads to those v-nodes. From Theorem 2, (a) is equivalent to (b), and (b) implies (c). Therefore, given a lossless Fgraph, in the sense of Corollary 3, and a set of v-nodes T, to determine an envelope of T, it suffices to determine the link L(w,7), for some wu (indeed, any sub-link of L will do). Note that distinct w-nodes ws will yield distinct links L(w,T)71. All the envelopes contained in the set K, mentioned above, can be produced by exploiting the implicit representation of FDs in such links L(w, T). If an envelope can be produced by using the explicit or implicit FDs of some sub-link, we will say that the envelope is embedded in this sub-link. To avoid having to consider envelopes that contain extraneous FDs, some nodes w. may have to be discarded. To that effect, Lozinskii [LOZI 801 described two short algorithms ("ALGORITHM 1" and "ALGORITHM 3-Stage B") which determine a set of w-nodes (let us call it SourceSet) each of which leads to a set of v-nodes T. Each w, in SourceSet is characterized by the fact that Att(r(w, )) constitutes a key for the attribute labels of T. We will refer to "ALGORITHM 1" [LOZI 80J as Algorithm GetSource, and to "ALGORITHM 3-Stage B" [LOZI 80] as Algorithm GetNearestSource. Also, Lozinskii's "ALGORITHM 2" [LOZI 801 determines the link from a given w-node w, to a given set of v-nodes T. However, a backward pass must be added to this latter algorithm to avoid the inclusion of superfluous

59 branches. Further, Lozinskii's algorithm does not handle potential cycling problems created by "self-reachability" (of the type V, > V, for some Vi g V, other than trivially). An easy cure to this problem is to prevent any node from being a candidate, for addition in the link under construction, more than once. However, a link constructed by the algorithm would then be incompatible with Lozinskii's definition of a link, where it is possible to have v E ir(w) and v E a(w) for some v-node v and some w-node w. Our definition of a link does not allow for the representation of self-reachability, and is compatible with Lozinskii's algorithm once this algorithm is corrected as discussed. Finally, note that even if no selfreachability exists in an F-graph, Lozinskii's algorithm (augmented with a backward pass) constructs a sub-graph which may still be incompatible with Lozinskii's definition of a link. The reason is that a link, in Lozinskii's sense, is really almost the same as a proper link in our sense (it may include additional w-arcs). It is our contention that our definition is not detrimental to the generality of the proposed approach. We will refer to the corrected version of "ALGORITHM 2" [LOZI 801 as Algorithm Getlink. Any envelope in K must be embedded in a link L(w, T), obtained for some node w, in SourceSet. Algorithms GetSources, GetNearestSource and Getlink follow. In GetSources, NoPredec(y) stands for the number of predecessor nodes of y. ALGORITHM GetSource: (adapted from ALGORITHM 1 [LOZI 801) /* Find all the wnodes which lead to a set of target nodes T */ SourceSet: 0 TargetNodes:= T for all w E Wdo ReachAll end GetSources. ALGORITHM ReachAll: /* Find all the v-nodes reachable from w */ begin ReachNodes:-r (w) NextNodes:= { v: (w,v)E A, }

60 for all y E Wdo NoPredec[yl:= Ir(y)l while NextNodes 34 0 and TargetNodes ~ 0 do begin Generate:0 for all z: z E NeztNodes and z 4 ReachNodes do begin for all y (z,y) E A, do begin NoPredeclu:= NoPredec[ j- 1 If NoPredec[l = 0 then Generate:-Generate U {v:(y,v)E A, ) end end TargetNodes:= TargetNodes - NeztNodes ReachNodes:= ReachNodes U NeztNodes NextNodes:= Generate end If TargetNodes = then Source:= Source U {w} end end ReachAll. Algorithm GetSource is of time complexity O(IAV I + IA, I). ALGORITHM GetNearestSource: (adapted from ALGORITHM 3-Stage B [LOZI 801) /* Find the nearest subset of the set of source-nodes for a set of target nodes T */ NearestSource:= 0 for all w E SourceSet do begin SourceSet:= SourceSet - {w} for all iw E SourceSet do begin If w: w then NearestSource:= NearestSource U {w} end end end GetNearestSource. Algorithm GetNearestSource is of time complexity O(ISourcel2). The basis for Algorithm GetLink is formalized in Proposition 8.

61 Proposition 8: Let <Z,E> be a relation scheme, let D = {<Xl,hl }>,,<X,,(h, )}>} be a database subschema and H = (hl, a h, } be te et of correspondings the statement of Corollary 1. If D satisfies the lossless join property with respect to <Z,E>, then for any set of attributes X C Z there A ' exist8 a subechema D C D, D-= {<Xil,n{1)>, * I *,<Xd, {h}> }, X- U Xi, and a set of FDa H C H, where X C X C Z, such that D eati8fies the losoless join property with respect to <X,i>. Proof: This is a trivial consequence of Theorem 2, Assertion (2). In GetLink, Forlink is a link from ws to v-node x, generated by a forward pass. CurrentTarget is a set of current target v-nodes. CurrentNodes is a set of nodes. To each z E CurrentNodes a Forlink which contains 7r(z) has already been constructed, and z has not yet generated, i.e. no one of its outgoing arcs is contained in the current Forlink. Successors is a set of nodes succeeding the nodes of CurrentNodes. Envelope is a set of FDs that would be used to access x. Predecessors is a set of w-nodes preceding the nodes of CurrentTarget. Backlink is the initial link in a backward pass, before deletion of extra arcs and nodes. Link w, T) stands for the final link from w, to T. Outlink and Inlink are sets of intermediate w-nodes. "ALGORITHM 2" of [LOZI 801 was shown to be of complexity O(IA, I + IAU, I) and therefore so is the complexity of Algorithm GetLink. ALGORITHM GetLink: (corrected version of ALGORITHM 2 [LOZI 80]) /* Forward pass to generate a link from w to v-nodes T */ Forlink:-0 CurrentTarget:= T CurrentNodes: {wS } Successors:-0 for all y E W do Countlyl:= -Ir(y) for all z E CurrentNodes while CurrentTarget 74 0 do begin for all y:(x,y)E A $ and y * W n Forlink do

62 begin itf y Successors then begin Outlink:= {y} Succeessors:- Successor U ({y} end else Outlink:= If y E W then Count[i: Count[i - 1 If (y E V or (y E W & County = O)) then begin CurrentNodes:= CurrentNodes - ({z CurrentNodes:= CurrentNodes U {1} Fortlink Forlink U Outlink U {(zy)} Successors:= Successors - {y) CurrentTarget:= CurrentTarget - {y} end end end /* End of forward pass */ BackPass end GetLink. ALGORITHM BackPass: /* Backward pass to identify the final link from w, to T */ ExplicitFD:= 0 CurrentSource = { ws } CurrentNodes: T Linkw, T):= Backlink:- Forlink Predecessors: 0 for all y E W do Count[y:= la(y) for all x E CurrentNodes while CurrentSource # 0 do begin for all y: (y,z) E Backlink do begin If y f Predecessors then begin Inlink = {y} Predecessors:= Predecessors U {y) end else Inlink 0- $ Count[l:= Count[i- 1 If (y E V or (y E W & count[ — = 0)) then begin CurrentNodes:= CurrentNodes - (z) Link(w,,T) - Link(w,, ) U Inlink U {(y,z)} ff y e W then EzplicitFD:= ExplicitFD U {(Ayz)} CurrentNodes:= CurrentNodes U {(})

63 Predecessors:= Predecessors - {y} CurrentSource: — CurrentSource - {y} end end end Link(ws,T):= Linw,,T) U r(w ) end BackPass. As an example, the link L(w,{tvA, VB, VD, VG }) obtained by the application of Algorithm Getlink on G (Figure 4.6. i) is shown in Figure 4.6. ii. Note that if the original F-graph happens to be lossy, then for some given T, it may not always be possible to find a source w8. Thus, if a set SourceSet can be determined for all the scheduled queries, then it is perfectly allright to use the original F-graph rather than a lossless version of it. In the next section, a typical, but so far un-addressed, sub-problem of the distributed database design problem is identified. It is the envelope optimization problem (EOP) for a query, to enhance distributed query processing performance. An original formulation and solution of the problem is proposed. Problem EOP is first introduced in the form of a "brute-force" integer program (Section 4.4.2). Section 4.4.3 provides some useful ideas to solve EOP heuristically. EOP is given a more elegant model and solution in Section 4.4.4. Section 4.4.1 provides us with some useful relation-size estimation techniques from [ROSE 81, CCA 801t. tAlso see [CHUN 83] for some possible refinements in relation-sise estimation techniques.

64 (i) (i) F98~ii) 0@~~~' B (iv) Figure 4.1. An example of F-graph reduction

65 vw Vs a (a ~gPREDECESSOR (W ) VA,VB V } CVA -V —I vg *. (Vs, W2) EXTRANEOUS Figure 4.2. Another example of F-graph reduction,, W$ ~ Wi x ~ /,"', \. I /i / % I/ I m W -Wj. VH

66 (i) ~O (ii) Fu44A nilustatioofiii) Figure 4.4. An illustration of LR-minimality

VI WI V3 V2 W2 (i) G VI VI WI (ii) G' Figure 4.5. A lossy F-graph (i) and its lossless version (ii)

68 V / F -GRAPH G (i) - JV Jo N LINK L for T VA,VB,VD VG (ii) Figure 4.6 A link

69 4.4. Step-3: The envelope optlmlsatlon problem (EOP) Given a query and its target nodes T in an F-graph, a set of sources, SourceSet, can be determined as discussed earlier. For each node w8 of SourceSet, a link L(w5,T) can also be determined (Section 4.3). The set K mentioned earlier is the set of all envelopes embedded in those links. Thus, any one of those links can safely be used to produce an envelope for the query. However, even if a node w, in SourceSet is chosen arbitrarily, many options to select a valid envelope are still available. Indeed, the explicit FDs depicted in the link L(w,T) can obviously be retained, but some other combination of implicit FDs in L could be retained as well, provided that they feature some join dependency. Thus, optimization can be used to select a "good" query envelope among all the possible ones in K. With this latter objective in mind, we state the advantages of working with a lossless F-graph in the sense of Corollary 3. (i) The explicit FDI)s of any sub-link that contains T can directly be used to produce an envelope of T. If the F-graph were arbitrary, we would be forced to check for the satisfaction of the lossless join property for envelopes that may contain explicit FDs as well as implicit FDs. (ii) Each of the sub-links under consideration has one single w-node for source. Since there are less sub-links of the form L(w,,T) than of the more general form L( W,7), where IVs is a set of w-nodes, we are restricting the domain of possible envelopes in K to a reasonable size. Finally, note that in the optimization stage, we can still take advantage of the implicit representation of FDs in a sub-link, to discard extraneous attributes, by applying the transitivity inference rule (i.e. X -, Y and Y -. Z yield X -, Z). In a distributed database environment, the following properties of a query envelope are desirable: (p-1) The number of distinct attributes is minimum. (p-2) The FDs contain as few common attributes as possible. (p-l) will tend to minimize file-to-file communication in query processing sessions. Because of (p-2), semi-joins between relations will feature high reduction capability and hence will lead to more benefit in terms of data-volume elimination. Furthermore, the overhead incurred by updates will tend to be minimized. The problem under con

70 sideration is identified as the envelope optimization problem (EOP). Before tackling EOP, we will digress momentarilly to review some relevant tools for estimating the size of relations. 4.4.1. Relation-sise estimation technlques Some useful relation-size estimation techniques previously reported in [ROSE 81, CCA 80J are indicated. Although some more accurate, but more complex, techniques have been reported in [CHUN 831, we will only consider the first referenced techniques for simplicity. The size of an FD, Auw8, VT) (here, we really refer to the size of the largest relation in I<sy,> ), or any relation, includes the number of tuples, mf (the cardinality of I), and the width, f. The width is simply: f= ] Att(v) + Att(vT), where a stands for the width (in v E w(w,) bytes) of attribute a. If Aw,VT) is implicitly represented by a link L(w,vT), the number of tuples in Aw,,VT) is the number of tuples in the relation obtained by joining all the FDs explicitly represented in L, and by projecting over YE= U ) {Att(v))] U {Att(vr)) To estimate this number of tuples, the same assumption as in [ROSE 811's is made, namely, Assumption Join [ROSE 811: For each value z E DOM(XE of some attribute X, the number of tuples of < Y,> such that t[XJ = x and X C Y, determines a random variable nx (X). For some other relation scheme < W,> such that Z C W, for some Z, and DOM(X) = DOM(Z), n, (Z) is defined similarly. n, (X) and n, (Z) are independent. Furthermore, all the relation schemes < Y,YJ1> explicitly represented in a link L are such that El 4n (X)= -AKII Io M(X)l From [ROSE 811 and Assumption Join, given two relations R[X] and Sl 1, such that X, Y C U and X n Y = Z, the cardinality of their join RMZgS is given by IR[S = IRI x ISU/IDOAMZ)I. From [CCA 801, given a relation R[XJ, the cardinality of its projection over Z C X, R[ZJ, is given by: If Z = (A), IR[ZJI = IR[AII, for some A E U,

71 else If Hn JR[AJJ < IR, then IR[ZI- I` IR[AII, else IR[ZII = IRI. ACZ AEZ The first formulation of EOP follows. 4.4.2. A brute-force approach to solve EOP The purpose of this section is not to solve EOP, but rather, to get a feel for possible heuristic solutions for it. To tackle the envelope optimization problem, it is assumed that the following items are given: SourceSet A set of source w-nodes for a set of target v-nodes T. L(ws,T) Link from w-node ws to T for every ws E SourceSet. (In Sections 4.4.2 and 4.4.3, and only those sections, it will be assumed that a single node ws is picked from the set SourceSet, and that the link L(w, 7T) rooted in ws is selected. How to determine such a w, will be discussed in Section 4.4.4.1). 1 if vh is adjacent to w, in L, kh, 0 otherwise. Ifrl Number of tuples in any base-FD f of the F-graph. f Number of bytes per tuple of any base-FD f of the F-graph. afl a] Number of distinct values of attribute a belonging to the attribut&scheme of any base-FD f of the F-graph. Width in bytes of attribute type a. jDOM(a)l Number of distinct values of attribute type a. The following notation is used: A base-FD in H+ which is implicitly represented in L, by w, and v] such that: f - U {Att(v)} -+ Att(v, ). v e ir(,) The following 0-1 variables are to be determined: 1 if w, E WL is associated with fij u= j0 otherwise. 1 if fl, <Ifi is a semi-join to be included, U11ki= j 0 otherwise. *Algorithm GetAllFDu, to follow, determines if u, an be equal to 1.

72 The problem of envelope optimization for a query is initially addressed as follows: EOP: Given the above information, determine an envelope K(T) so that: (A) The number of attributes contained in K(T) is a minimum. (B) No FD in K(T) is extraneous. In particular, no FD in K(T) can be inferred from other FDs in K(T) by transitivityt. (C) An objective function which is a function of the data-volume flow is minimized. Note that possible advantages associated with the sequencing of semi-joins for query processing are not considered. This is to keep the overhead of the solution techniques for EOP to a minimum. In this section, the following assumptions are made to evaluate the cost of item (C): Assumption EOP-cost-l: Query processing startegy: Each FD is assigned to a CRDU. The CRDUs are first locally processed to eliminate unwanted tuples. All pairwise semi-joins are evaluated. Assumption EOP-cost-2: Since the flow of data-volume corresponding to the answer is generally small compared to the file-to-file data-volume flow, only the latter is considered. The symbol Z,lki is defined to be the "worst-case" overhead cost of semi-joins A f,] <Xlfkl, and fkl <AfJ, where X = S(f, ) n flS ) as: Z1 m-min { Z il, Z,k } where, z-.1 Ifl [Xllx xt tThe idea is to avoid unnecessary increase in semi-join occurence and therefore of cost. See properties (p-l) and (p-2) for (A) and (B).

73 + ft, x Ih, I x EU If (Alt + fX ItXDOM(X)X IIXll, + fkl [A IX, I[lA l X flA)l + Zx If,,I x.!["IAI x If, [All IooDMX)12 The terms with a "t" superscript stand for the assembly flow costs, and the terms with a "$" superscript stand for the reducing flow costs. By Assumption EOP-cost-2, = If 1[AX x +!t[ k Xlfk[All ] The overhead cost of semi-join fkl <Xlf,, or Z', sk, is defined similarly. An integer program for the envelope optimization problem is stated as follows: EOP: Minimize 1 Zlh Ur,.I W,, Vk E WL (eop-O) Vji V E VI Subject to: Assure the covering of the target nodes. I wl u > 1 (eop-1) i =1 V j:v E T. The attributes of the source are covered. IvLI uj > 1 (eop-2) j=1 V i: wi E SourceSet If a semi-join is retained, then the FDs involved in it must be included in the envelope.

74 kj u,1 + kjk ukl - Ulj, < 1 (eop-3) V i,j,k,. Transitivity-redundancy is undesirable. u, k,-u utuik = O (eop4) V i,j,j, k. Connectedness: For all v-nodes v} such that u% = 1 for some w-node w,, there exists a chaint (Algorithm CheckConnect ) from SourceSet to vj. Connectednese. (eop-5) This integer program can be solved with a branch-and-bound algorithm. General guidelines to design such an algorithm are discussed below. Specific details of implementation are however omitted because although an optimal optimization technique can be applied and is manageable in the context of solving EOP, it is obvious that it would incur a large overhead in processing time because all queries must be considered individually. Therefore a heuristic approach to solve EOP will be proposed later. The branch-and-bound solution would proceed as follows. EOP-ip-1: Solve EOP without constraints (eop-4) and (eop-5). This is an assignment problem, therefore the Hungarian method [MURT 81], may be used to compute lower bounds. If (eop-4) and (eop-5) happen to be satisfied in the solution, then the problem has fathomed. If the problem has fathomed and this step is being visited for the first time, then we are done. Else, if the cost of the fathomed problem is less than the cost of the incumbent, where incumbent is initialized to be 0 and its cost to be oo, then the current problem takes the place of the incumbent. Otherwise it is pruned. We then go to EOP-ip-2. tA directed path.

75 EOP-ip-2: Branch on a variable uv (set uij = 0, or uy = 1 ) to generate two new candidate problems. A LIFO search may be used. Pruning occurs when a candidate problem is infeasible, or when its lower bound is larger than the incumbent's. If there are still some dangling candidate problems, then go back to EOP-ip-1. Else, go to EOP-ip-3. EOP-ip-3: The optimal solution is the incumbent. Given a link L(ws,T), to find out which are the acceptable variables u,, the following algorithm may be used. ALGORITHM GetAIlFDs: /* Given a link L(w,T) of G, find all the FDs, with a single attribute on the right-hand-side, which are explicitly or implicitly represented. */ Wstart: SourceSet FDset:-0 Vinter:= 0 for all w, E Wstart while Wstart y4 0 do begin /* Find the v-nodes V, reachable from Wstart */ TargetNodes:= V V,:= ReachAllw,) for all new v, E V, FDset:= FDset U {u. } Wstart:= Wstart - { w } Vinter:= a(w, ) for all Vk E Vinter do Wstart:= Wstart U o(vk ) end end GetA11FDs. Algorithm GetAIIFDs is of time complexity bound O(AL Ij + IAL I1). To check the connectedness condition (constraint (eop-5)), the following algorithm can be used: ALGORITHM CheckConnect: /* Given a set of FDs explicitly or implicitly represented in a link L, determine if they satisfy the lossless join property. */

76 Valid true Vcurrent:Wcurrent 0 Wstart:= {w, E WL: U:. 1 for some vj E VL } - Wsource for all w, E Wstart while Valid - true do begin for all vj E 7r(w, ) and v1j Vcurrent while Valid - true do begin Vcurrent: Vcurrent U {v, } If V k, uO = 0 then Valid false end end end CheckConnect. Algorithm CheckConnect is of time complexity 0(1 W12). In the next section a heuristic approach to solve the envelope optimization prolem EOP is suggested and discussed. The primary purpose of Section 4.4.3 is to motivate the final formulation and solution of EOP in Section 4.4.4. 4.4.3. A first heuristic approach to solve EOP There are two immediate reasons for considering a heuristic technique rather than mathematical optimization for solving EOP. (1) The exact closed-form objective function is almost impossible to obtain because the partitioning of the query envelope is to be known in advance before the distributed query processing strategies can be determined. (2) It is not worth using a mathematical programming technique such as branch-and-bound because it is too expensive. Further, looking for a "true" optimal solution is arguable after making Assumptions eop-cost-1 and eop-cost-2. The heuristic approach is now discussed. Given is a link L(w, T) where ws is a source of T the set of target nodes. There is no redundancy** and no extraneous informationt. Furthermore, the link is the shortest possi**Because the link was extracted from a minimum F-graph. tBy definition of a link.

77 blet. The goal, in Step-3, is to get a minimum cost lossless FD-coveringtt of X. One possible approach is proposed below and discussed. APPROACH 3-a: A first heuristic for (EOP). A Step-3-a-l: Given the link L(ws,T), define WT = U o(vTr)n WL. For each pair VT E T of nodes w,, wj in Y = Wr U (wo } f find all the proper links from wi to r(w, ). Step-3-a-2: Augment the proper links to create proper links which explicitly represent the FDs which were previously implicitly represented (by exploiting transitivity). Step-3-a-3: Construct a graph (Y,L) such that any edge in L is associated with a proper link. Find a minimum spanning out-tree (also referred to as arborescence) of graph (Y,L). The steps of Approach 3-a are discussed below. 4.4.3.1. Step-3-a-1: Proper link distlngulshing Algorithm Distinguish, which follows, finds all the proper links between a w-node ws and a v-node vT. Algorithm Distinguish branches backwards through the link from VT (Backtrack). It pushes in STACK the branches traced from a current set of w-nodes, VISIT. Branching is interrupted whenever a v-node vm is found to feature more than one w-node predecessors. When this condition is detected, a "$" character is pushed in the stack, vm is labelled DANGLE, and the link is branched through another live branch starting from any w-node in LiveWIvm (where LiveWIVml is the set of predecessors w-nodes of vm which can still be back-tracked from, at this stage in the algorithm). When the source of the link, w, is $By virtue of Algorithm GetNearcstSource. ftWith FDs which are either explicitly or implicitly represented in the link under consideration. #We assume for the time being that each node VT of T has at most one successor wtr in L. This assumption will be relaxed in section 4..4.4.

78 reached by all the branches generated in Backtrack, one new proper link L is found. The stack is popped until a $ is retrieved, and if some v-nodes are still labelled DANGLE, a new branch can be traced back to create a new proper link. The algorithm proceeds in this manner until all the possible combinations are exhausted. Algorithm Distinguish follows: ALGORITHM Dlstlnguish: k:= 1 /* Number of proper links */ LiveW:= Q VISIT r(vrT ) E ' {r(VT) PUSH(VT) If I(vr )I > 1 then begin DANGLE:= (VT) Wo:= (r(VT ) Succ(Wo):= VT end else do begin DANGLE:-= 0 PUSH( $ ) LiveWlvrI:= r(Vr) wo:= -w, E LiveHWIvT Succ(wo):= VT end Backtrac( w o) while DANGLE ~ 0a do begin Retract(w, ) /* Where w, is outputted by Retract */ Backtrack(w,) end end Distinguish. ALGORITHM Backtrack( wi ): if k > 1 then k:= k + 1 for all w, $4 w, E VISIT while VISIT'y4 {ws } do begin /* Backtrack terminates when ws is found and there is no more dangling branch */ PUSJ( w4 L':= L U {w, <w, Succ(wi)>} for all vm E ((W, ) do begin PUSH(< vm,w, >) PUSI(m, )

79 t:=tL U {<m,, ws>, tm) It I1r(v, )I > 1 then begin PUSH( $) LiveWIvm I:= (Vm) VISIT:= VISIT U (tw,: w E r( v, )) Succ(w, ):= V DANGLE= DANGLE U {v, } end else do begin VISIT:= VISIT U {w. = 7r(vm )} SuCe(W. ):= V. end end VISIT:- VISIT - ({ wto } end Lk:= L: U r(wS) U (, Ws) v E Predecessov(vl ) end Backtrack. ALGORITHM Retract( W% ): Item:= t /* Dummy character */ while Item ~ $ and STACK 34 0 do begin /* Have to find at least one $ */ POP(Item)J* Retrieve a character from the stack in Item */ C +1:=L - {Item} if Item $ then begln v,:= ReadTop(STACK) if Live Wj vm = then begin DANGLE:= DANGLE - {v, } Item:= t end else begin VISIT:- VISIT U {W: wI. E LiveWIm, 1} LiveIm.:= Live W - (w,} PUSH( $ L fk+1 _ Lkt+ U (<w w,vm >,wj } end end end end Retract.

80 We now show that Distinguish accomplishes what it is supposed to. Proposltion 9: A sub-link L(w,vrT) is proper, if and only if any v-node in it has at most one predecessor w-node. Proof: Let L(ws, VT) be a proper link. We proceed by contradiction. Assume that there exists z e VI and X, YC VE and two w-nodes wx and wy, such that r(wx)= X and ir(wy) = Y. Then there exist two sub-links of length one, LX(WX,Z) and Ly(wy,z) and thus, any one of the arcs (wx,z) or (Wy,z) is extraneous. The latter conclusion contradicts the assumption that L is proper. Conversely, given a sub-link L(ws,vT), if for any v-node z there exists at most one predecessor w-node, then by branching backwards from VT in L, while preserving reachability, all the nodes and arcs must be retained to depict the relationship r(ws ) > VT. Therefore L satisfies the definition of a proper link. e It is obvious that Algorithm Distinguish finds all the sub-graphs of L(ws,VT) which are "rooted" in wu and which include VT. Since they all satisfy the condition of Proposition 9, therefore they are proper as well. Since the algorithm enumerates all the possibilities by exploiting the condition of Proposition 9, therefore it keeps track of all the proper links between ws and VT. The number of proper links generated by Algorithm Distinguish is larger than pd where P is the average number of w-nodes in r(v, ) over all v-nodes v,, and d is the length of the link under consideration. It is a non polynomial algorithm, but can be used as long as p d is not too large. 4.4.3.2. Step-3-a-2: Proper link augmenting For each FD Aw,v) implicitly represented by a pair (w,v) in a proper link L(w,,vT), a new FD can be explicitly represented by adding a new arc (w,v). Since L is proper, therefore L(ws,VT) U {(w,V)} - L(w,v) is also proper. In Step-3-a-2 the idea is to create new explicit FDs (exploiting transitivity on L) while deleting intermediate links. Note that since each vT

81 of T is considered individually in this manner, whether another target node, say iT, belongs to the link L(w,v) being dropped or not, is irrelevant. 4.4.3.3. Step-3-a-3: Spanning out-tree In this step, an algorithm to find a spanning out-tree rooted at one node may be used. Such an algorithm is described in [LAWL 761. In general, the algorithm must be able to detect the existence of cycles in a graph, which makes it inefficient. However, in this context, it will be shown in Section 4.4.4.2 that drastic simplifications can be made. 4.4.3.4. Remarks on the Initial heuristic Several remarks can be made about Approach 3-a. (1) The problem of finding all the proper sub-links between a pair of nodes is npcomplete because it can be restricted to the well known path distinguis8her problem [GARE 791. However, this problem may still be manageable for many practical queries because redundancy has been eliminated in Step-1, and because the link under consideration has been shortened to a maximum in Step-2. (2) Any query envelope must be determined on an underlying sub-graph of (Y,L), such as an out-tree, which is rooted.in w,. This comes from the fact that any envelope must satisfy the lossless join property (see also constraint (eop-2)). (3) The sub-graph must at least be a tree, i.e. it must be connected, because there must exist a link from the root node, w., to any other node, even if this link is not represented by a single edge in (Y,L). This also follows from the lossless join property (see also constraint (eop-5)). (4) The sub-graph may be more than an out-tree, but this would add unnecessary "access paths", and thus increase data redundancy. This remark agrees with requirements (A) and (B) (see also constraint (eop-4). Note, however, that constraint (eop-4) is extraneous in problem EOP for it tends to decrease the objective function). (5) Requirement (A) is obviously satisfied, further, all the target nodes are covered (see constraints (eopl1) and (eop-3)). (6) A spanning tree may be more

82 than needed since the initial link may or may not be proper. Eventually, some branches of the spanning tree may be thrown out. We will formalize those remarks in the next section (Proposition 16). Since the first heuristic avoids the penalty of using branch-and-bound by introducing some potential curse of cardinality implied in Step-3-a-1, as well as some substantial overhead in Step-3-a-2, a more direct solution approach to problem (EOP) is obviously called for. It is presented in the next section. 4.4.4. Final solution of EOP An immediate simplification to the first heuristic is to restrict the types of proper links to be determined in Step-3-a-1 and 3-a-2. This suggests the following heuristic approach. APPROACH 3-b: A refined heuristic for (EOP) Step-3-b-l: Select wS in SourceSet such that it yields the shortest proper link L*(ws,T) = (V[, U W*,,A vo. UA wt.) to T. Construct the Metagraph (Y,L) of L* as outlined through the following algorithm. ALGORITHM Metagraph: /* Note that the only nodes and arcs considered are those of L*, and that Xr and a are restricted to L*. */ L:= 0 /* Arcs */ Y:= {<r(w ), w, >} /* Nodes */ for all VT E T do begin if O(VT) 70 then begin for all w E (VT) do Y:= Y U {<r(w), w>} end elseY:=YU {<{T},nl>} end for all < V,,w, > and < V1,wt > E Y do

83 begin If L*(w,, V ) exists, then L: L U {1< V,,wi >, < V,,w, >I} If L *( w, V, ) exists, then L:- L U {l< V,,w, > < V,,Jw, >]} end If there exists < V,w> E Y and <V,> E Y: V C V then add the arc [< V, i>, < v,w>] to L. end metagraph. Step-3-b-2: For each arc e =-[<V,w>, <V,w>]I E L, estimate the size of the FD J(w,V) induced by L *(w, V). Define the weight of e to be the size of Ajw, V). In particular, if V C V, then the weight of e is zero. Step-3-b-3: Find a minimum spanning out-tree of graph (Y,L). Each step of Approach 3-b will now be discussed. Note first that remarks (2) through (5) of Section 4.4.3.4 still hold for Approach 3-b with the new definition of (Y,L). However, remark (6) can be changed as follows. (6) A spanning tree is needed because the initial link is a proper link, and therefore no node can be thrown out of Y. (This is consistent with property (p-2) of an envelope, Section 4.4). 4.4.4.1. Step-3-b-1: Minimum proper link Algorithm MinPropLink which follows, finds all the minimum length proper link between w, and vT E T. It is very similar to the classical shortest path algorithm of Dijkstra [MURT 76]. The algorithm labels each v-node of the initial link as UNPLAB, PERM or TEMP*. At the beginning, TEMP is empty, while UNLAB contains all the intermediate vnodes of the link (VL - 7r(w, )), and PERM contains 7r(wj ). The link is branched out from w, in search of a minimum length proper link that leads to VT. Nodes of UNLAB are branched eThey stand for unlabelled, permanently lanbelled and temporarily labelled.

84 from one by one, and transferred into TEMP. A v-node tj in TEMP is always a candidate for branching until VT is reached. When a minimum branch from PERM to TEMP is selected, the v-node y which is at the end of the branch is transferred into PERM and is never visited again. When VT is found, the algorithm traces back the graph from VT using the Predec labels. The same process is repeated for every VT in T. Algorithm MinPropLink follows: ALGORITHM MinPropLink: for all VT E T do /* T is the set of target nodes */ UNLAB:= VL -(t ) PERM:- r(w, ) TEMP:= 0 for all v, E PERM do begin Predec[v,:= nil /* Initialize the predecessor pointers */ Lengthlv,:= 0 /* Length of the shortest directed path from PERM */ end Length[wS]:= 0 for all v, E a(ws ) do begin TEMP:= TEMP U {, } UNLAB:= UNLAB - { t, } Predeclv,:= WS Lengthv, j:= 1 end /* At this stage, PERM and TEMP have been initialized */ while vT f PERM do /* Quit whenever VT is reached */ begin If TEMP #y 0 then y:= v,: Length[v, ]= min {Length[Ivk: vk E TEMP} t PERM:= PERM U {Y} /* Add in PERM the node y from TEMP which is nearest to PERM than any other node in TEMP. */ for all w E a(y) do Length[w, I:= max{Length[vk ]: vk E ir(w, )} TEMP:= TEMP - {y} for all v, E TEMP do /* Update the labels of all the nodes in TEMP */ begin f ( there exists w,: r(w. ) C PERM and v1 E a(w, ), and w, -; Predec(v, ), and Length[vj 1 > Lengthlwl + 1 ) then begin Length[v, J:= Length[w,] + 1 Predec[vj i:=w, LengtAIw, 1:= max {LengthIV l: Vk e tr(w, )}

85 end end for all v, E UNLAB do /* Update TEMP */ begin Predec[vI: w: such that: Length[w, j = min {Length(wk) where lr(wk) C PERM t8 vo E a(wk) } Length[vj:= Length[w, ] UNLAB:= UNLAB - {v, } TEMP:= TEMP U {v, ) end end /* of the while loop */ /* The minimum length is found by tracing back the graph from VT using the Predec labels. */ L *:= (VT Pedec(vT), <Predec(vT),vT >} for all w E LT * while not a link do begin Lr *:= LT * U ~(W,) LT *: LT * U <V,W, > vE I(w,) for all v, E (w, )do LT:'= LT * U ((V) end L *:= Lr* U 7r(w,) end /* of the for all loop */ L*:= U LT* vT E T TotalLength:= max {Length[vTj } Vr E T end MinPropLink. We argue that Algorithm MinPropLink finds a minimum proper link. Indeed, L* is more than a minimum proper link from w, to T, it is the union of the minimum proper links from ws and every node VT of T. Proposition 10: The sub-links LT * constructed by Algorithm MinPropLink are proper links. Their union, L*, is also proper. Proof: By Proposition 9, the only way a sub-link LT * might not be proper is if it contains a v-node v, with more than one predecessor w-node. But whenever a v-node y is added to PERM, and to Lr *, only one w-node, Predec!l, is added as well (line " t " in Algorithm MinPropLink). Since y is never visited more than once, only one predecessor w-node is associ

86 ated with it. Therefore L*T is proper. We now prove the second statement. Suppose always examined in the same order. Therefore, v; must have the same predecessor in L*(ws,VT) as it does in L*(w,,i). Hence, since the previous argument applies to any node v,, the union of the two proper links is proper, and by induction, L* is proper.. The proof of minimality is very similar to the proof for Dijkstra's algorithm [MURT 81s. Proposition 11: A sub-link LT *(w,,VT) constructed in Algorithm MinPropLink is the sRhortest link from ws to vT, and Length[vr J is the length of the shortest link. Proof: At some stage of the algorithm, assume that the link from w, to vi E PERM traced by the labels on the nodes in PERM are the shortest links. By the manner in which the labels Predec and Length are updated in each step, it is clear that for any v, E TEMP the link from w, to va traced by the current labels is a minimum length link from inc to vp among links from w, to v, which traverse only through nodes in PERM. The length of that link is the distance in the current label Length[v; 1. Let y E TEMP be the node made permanent next in MinPropLink. Let the current distance in the permanent label of y be LengthIy, and the distance on the current label on vy E TEMP, vJ - y be Lengthlv, 1. Then by the choice of y, Length[yJ = min (Length[v,: v; E TEMP, v, y4 y)}. If the link from ws to y traced by the current labels is not the shortest, the shortest link from ws to y must pass through some node vi e TEMP v; 34 y before reaching y. Suppose tk E TEMP is the first such node on this shortest link. Let S be the length of a shortest link

87 from vk to y. The length of the shortest link from w, to y is Length{vk 1+ > 2 Length[vk I Length[yl, by (t), which is a contradiction. Therefore the link from w, to y, traced by the current labels Predec is indeed the shortest link from w5 to y. By induction, and the above argument, the assertions in Proposition 11 are valid for the set PERM after each stage of the algorithm.. Note that Algorithm NearestSource finds a set of w-nodes SourceSet such that, for every ws E SourceSet, 7r(w ) > T, and for any vi E Ir(ws), 7r(ws ) - v, > T, whereas MinPropLink explicitly identifies the shortest link from a single w-node w, to all the v-nodes VT in T. Thus, given a source Wsource, MinPropLink can be executed for all the nodes ws E Wsource, and thus the over-all minimum proper link L*(ws, 71) can be determined. The computational effort required by Algorithm MinPropLink is O[I VL 12 X I TIj. Each node of the so-called metagraph, (Y,L), consists of a w-node and its predecessors in L* if those predecessors contain at least one target node, or if the w-node is the root of L*, w. Additional nodes are accounted for target nodes which do not have successors. Each arc of the metagraph depicts an implicit FD in Li, induced by a sub-link of Lt. Each such FD contains at least one target attribute on either side of its arrow. Note that, in Algorithm Metagraph, since the starting sub-link is proper, all the sub-links are also proper, and therefore Algorithm GetLink (more efficient that MinPropLink) can be used to determine all the arcs of L. 4.4.4.2. Step-3-b-2: Weight assignment on proper llUnks The size of the FD Aws, T), implicitly represented by link L*(w, T) can easily be calculated after Li has been traced back in Algorithm MinPropLink. The following algorithm, which estimates the size of Aw, 7) is very similar to Algorithm ReachAll seen earlier, and of the same time complexity OIIA, I + IAM, jJ. To minimize estimation errors, the FDs explicitly represented in L* are joined from the source ws until U {Att(vrT)} is covered. r E T

88 ALGORITHM SlseEstimate: /* Estimates the size of Aw,, T) implicitly represented by a link L* */ TargetNode8s: V, f:=O Size: 1 ReachNodes i= r(ws ) NextNodesa:= {v: (w,v) E A, } for all w, E WE do NoPredec[w, I:= -It(w, )I while NextNodes 3 t TargetNodes 3 0 do begin Generate:= 0 for all x E NextNodes f_ x z ReachNodes do begin for all w,: (z,w, ) E At do begin NoPredecw, 1:=NoPredeclwi 1 - 1 if NoPredec[w, i - Othen do begin Generate:= Generate t {V,: (w,,v ) E A) } f f U {ft U tt(v)) -, Att(Ivk)} E C(Wo ) Size Size X f I - ze 11 IDOM(Att(v)l V E r(w, ) end end end TargetNodes:= TargetNodes - NextNodes ReachNodes ReachNodes U NeztNodes NextNodes:- CGenerate end if PROD I fI IIAll X IfT[Att(vT )] < Size AC U At(v),E,(ws) VTE T for some j and i then Size:= PROD Size(ws,T):= Size end SizeEstimate. In the next section some useful mathematical properties of (EOP) are established. A new model is then proposed to capture the problem of Step-3-b..3. A detailed implementation of

89 Step-3-b-3 then follows. 4.4.4.3. Step-3-b-3: Matrold model and optimization In this step, we extract an out-tree of (Y,L) which is said to depict an envelope K(T) in the following sense; K(T) is the envelope that contains the FDs represented by the arcs of the out-tree, and only those FDs. The justifications for this step will be provided later. We will first prove three important properties of (Y,L). Those properties will later be used to design a very simple and efficient optimization algorithm. Property-i ProposItion 12: The metagraph (Y,L) of L*(w,,T) does not contain any directed cycle. Proof: If there exists a directed cycle in (Y,L), then there exists a "link-cycle" in L* (the notion of proper link in an F-graph is analogous to the notion of directed path in a regular graph, therefore a link-cycle is the obvious extension of a directed cycle). Therefore, either one v-node, in L* but not in 7r(w, ), has more than one predecessors, which contradicts Proposition 9, since L* is a proper link, or some v-node, y, in?r(w, ) has a predecessor, z. That latter possibility must be discarded since an arc (z,y) may not be a linking arc from w, to y with respect to the set of arcs in L*. ~ Property-2 Consider hypergraph M = (LA) where L is, as before, the set of arcs of the metagraph, A and A is: A = (A,: A, is a set of elements of L, and A, is an out-tree* or a forest of out-trees in (Y,L) } U 0. As proved below, M is a matroid. (See [LAWL 761). Proposition 13: M = (LA) is a matroid, i.e.: *An out-tree is a directed tree in which each node has at most one arc directed into it.

90 (i) for any A, E A, ifAj C A, then Aj E A. Also, 0 E A. (ii) If A,P and AJ+l E A, contain respectively p and p+1 elements of L, then there exists L E AJf"l _ AP sUCh that A,- U {L} E A. Proof: (i) The proper subset of an out-tree is an out-tree or a forest of out-trees. Also, by definition, 0 is a member of A. (ii) Let A,' be a forest of out-trees with p arcs. Let AP+' be a forest of out-trees with p+l arcs. We proceed by contradiction. Let us suppose that for any L = (e,) E AP -A/Ph, AP U {L) is not a forest of out-trees. Then, for all (eJ) E Ap+l, either both e and f already belong to AP, or f is pointed at by an arc in Ai, whereas e is outside A,P. (Of course, there must exist at least one arc (en E A,' such that e. is not a node in A,P). Let us try to reconstruct A,+'. For each arc (g,h) in A,, there exists an arc (e,h) in AP+' (with e possibly identical to g). Therefore we can account for only p arcs in AP+'. This contradicts the assumption that A1P+l is a forest of out-trees with p+1 arcs. Therefore there must exist at least one L E A,1+ - A, such that AP U {L} is an out-tree or a forest of out-trees. Since M satisfies (i) and (ii), therefore it is a matroid. ~ All the subsets of L in A are said [LAWL 761 to be independent sets of M. A maximal independent set is said to be a base of M. By definition of (Y,L) a base is a spanning out-tree in (Y,L). ** Property-S Proposition 14: Any spanning tree in (Y,L) is rooted in e, = <r(ws ),ws > E Y. Proof: This follows from the fact that there exists an arc between e, and any other node of Y, and from the acyclic property of (Y,L) asserted by Proposition 12.. It is known that the matroid greedy algorithm [LAWL 761 finds a base of mazimum weight. In this context, it is reasonable to define the weight of an arc **We know that there is at least one such spanning out-tree by definition of (Y,L).

91 L = <.,ws >, < V,.>I in L to be a large number minus the estimated size of the FD Aw s, V) implicitly represented by link L(w, VI associated with L. The general matroid greedy algorithm as described in [LAWL 761 is: Choose the elements of the matroid in order to size, weightiest element first, rejecting an element only if its selection would destroy independence of the set of chosen elements. Although the construction of a spanning out-tree is usually not trivial because of potential directed cycles, in this context, a very simple greedy algorithm may be devised for Step-3-b-3: Assign to each arc in L a large positive number minus the size of the FD implicitly represented by the corresponding link. Choose the elements of L in order of weight, weightiest element first, rejecting an element only if it points towards the same node of Y as another already chosen arc. That this algorithm indeed finds a maximum-weight spanning out-tree of (Y,L) follows immediately from Properties 1-3 and the theory of matroids [LAWL 761. However, we include a comprehensive proof below. Proposition 15: The greedy algorithm finds the maximum weight spanning out-tree of (Y,L). Proof: First, let us show that at any step p of the algorithm, the current forest of outtrees AP is the weightiest among all other forests of out-trees with the same number of arcs. At step p = 1 this is obviously true since the weightiest arc is initially chosen. By induction, suppose A,P1 is the weightiest forest of outtrees with p-i arcs. Two cases must be considered: (1) A,'- can be augmented: Let A,1- = {(e,f, ) i= 1, ~ ~ ~,p-l}. If the next weightiest arc (g,h) is such t'at h = f, for some fi, then it is not acceptable. Else it is acceptable (by Proposition 12 and the fact that A,-1 is an out-tree), and AP = AP-l {(g,h)} is an out-tree and we are done. Suppose that h = f,X for some m. Assume further that there exists an out-tree with p arcs: A,P = {(e.,f, )j = 1,'',p —} u {(gafm)}

92 such that: Weight(,f) + Weight(g, ) > Weight(ei J,) + Weight(k,l) jf~i s -i*=1 for any acceptable (k,l). Since Weight(e 1,fl) > Weight(g,fm ) therefore: p-i E Weight(, Jf )+ Weight(el,fl) > E Weight(ei J, ) + Weight(k,l), 2 i =2 which implies that: Weight(Flf 1) > Weight(kl). But we know that Al Z f,, because if this were not true then AJ'' U {(,fm )} would not be an out-tree. Furthermore, A,1 U {(ei,fi)) is an out-tree (using Proposition 12) weightier than A'1 U {(k,l)}. This last statement contradicts the assumption that (k,i) was the next heaviest available arc for A-,'. Therefore, either an out-tree such as Aj,' does not exist, or A,P- cannot be augmented further. In the first eventuality the original assertion is true. In the latter, we have to consider case (2): (2) A,,- cannot be augmented: Suppose that there exists an out-tree Ajq with q > p-1 such that IA, is weightier than A,-'. Therefore, there exists A P C Aiq such that, by Proposition 13, L E ASP - A,"- can be added to A,-' to create a larger out-tree. This contradicts the assumption that A,P1 can no more be augmented. Therefore, A,P must indeed be the weightiest forest of out-trees at any step p. Since there exists a spanning out-tree of (Y,L), and using the assertion of Proposition 14, the greedy algorithm terminates when the maximum-weight spanning out-tree rooted in es is found.. Step-3-b-3 may now be justified. Proposition 16: An out-tree of (Y,L) depicts an envelope K(T) and not more. Proof: As noted in connection with Theorem 2, there exists a query envelope which is embedded in the link 14w, T). It is easy to see that there is also an envelope K( T) embedded

93 in proper sub-link L* of L. Therefore, the metagraph (Y,L) depicts more than is needed for such an envelope K(T). If a subgraph of (Y,L) is an out-tree, then by Proposition 6, it is an out-tree rooted in <ir(ws ),ws >. Therefore, the F-graph L, which has for explicit FDs the FDs associated with the arcs of the out-tree, is a sub-link in the F-graph G+, where G+ denotes the F-graph that explicitly represents all the FDs of F+. By Corollary 2 L is lossless. Furthermore, since the out-tree spans all the nodes of Y, all the v-nodes of T are covered. Therefore, the out-tree depicts an envelope K(T). The sub-graph extracted from (Y,L) may be more than an out-tree, but it would then contain extraneous information. An envelope depicted by such a sub-graph of (Y,L) would conflict with requirement (eop-2). i The implementation of the greedy algorithm follows: ALGORITHM OutTree: /* Finds a maximum-weight spanning tree in (Y,L) rooted in es = <.,Ws> */ Arborescence:= true Out Tree 0 for all e - [<.,w>, < V,.>l E L do Weight[el:= S - Size(w, V) /* Size is given by SizeEstimate and S is a large number */ while Arborescence = true do begin e - (el,e2):-( e, EL: Weight[e, = max {Weight[e J, er E L - Out Tree) and V e= (el,e2)E OutTree, e72 7 e2) If e = 0 then Arborescence:= false else OutTree:= OutTree U {e} end end OutTree. Algorithm OutTree is of time complexity OIL2LIJ. Note that the total time complexity of Approach 3-b is therefore o1 SourceSetl X VL 1 + 0 o[ITI X IV.l I1

94 + o[IAvJI + IAw.I + o[ILI2] If a proper link L(ws, T) is given, and if the cost of the model of EOP in Section 4.4.2 is changed to be the sum of the sizes of the FDs in the envelope sought, then Approach 3-b solves exactly the latter model. For most practical cases the expression for the cost given in Section 4.4.2 is unreasonable because of Assumptions EOP-cost-1-2 (page 72). Therefore, a comparison of the solutions obtained respectively by Approach 3-b and an "optimal" algorithm is not justified. In the next section, some typical problems of FD-updating are considered. The semantic constraints that support the FD-modeling structure control the processing of updates, not unlike join dependencies control the correct processing of queries. Thus, a single update on an individual FD may force, as a side-effect, a so-called cascade of updates on other FDs. The cascade associated with one update is defined to be the set of FDs, which are the target of this update. Section 4.5 addresses the problem of determining update-cascades to preserve database consistency. 4.5. Consistency preservation in updates When selecting a query envelope, the lossless join property was considered to be the main factor to satisfy semantic correctness. Similarly, some criteria must be considered in association with updates of FDs to preserve the integrity of the database. In the next section definitions for simple practical semantic constraints on FDs are provided. These constraints are to help understand how the update of a single FD may lead to a cacade of side-effects (Section 4.5.2). In Section 4.5.3 three basic types of updates are considered in details.

95 4.5.1. Constraints on the antecedent and range of an FD The notion of update-provoked side-effects will rely, in part, on the following definitions. As explained in Section 3.1, each FD fh is derived from a function FA. Using the same A notation as in Chapter 3, Ah = DOAA1)XDOM(A2)X... XDOMA,) is defined to be the antecedent of fh, and Rh = DOM(B1)XDOM(B2)X. XDOM(B ) is defined to be the range of fh. For some z E Ah and y E Rh, the notation y = IA (z) will be used. (1) fh is partial with respect to Ah if there exists x E Ah such that z has no corresponding image in Rh with respect to fA. We write fA (z) = 0. (2) fh is total with respect to Ah if for any z E Ah, there exists y E Rh such that fh (z) y. (3) fh is partial with respect to Rh if there exists y E Rh such that for any xzE Ah, fh (z) ~4 y. (4) fh is total with respect to Rh if for any y E Rh, there exists z E Ah such that fh ( —)=. On an F-graph, the tail and the head of each w-arcs are labelled with either the letter P (partial) or T (total) as shown on Figure 4.7. The need for maintaining such labels for each FD is illustrated in the following example from [HOUS 791. Example: A nation-wide organization keeps records of their MEMBERs and of their SPOUSE records if they are married. The MEMBER and SPOUSE attributes are related by the FD fMs which stands for MEMBER -+ SPOUSE. Those MEMBERs who are not married will not be part of FD IMs. Therefore fms is partial with respect to its antecedern MEMBER, and is total with respect to its range SPOUSE. Only the SPOUSEs of known MEMBERS in the organization appear in the database. We write MEMBER(P) -_ SPOUSE( T).

96 4.5.2. Deletion of attribute values It is obvious that the deletion of an attribute value with respect to an FD can trigger a cascade of deletions on related FDs. This "side-effect" phenomenon has been studied previously in [HOUS 791 and is addressed in Section 4.5.3. Typically, it is important to keep track of the labels of any FD under updating. This is particularly true in the case of deletions of attribute values or of FD tuples. The following example is to help understand why a single deletion may require further deletions in cascade. Example: (a) Let FD f, which stands for MEMBER(P) -* SPOUSE( 7) be represented by the following table: f1 MEMBER SPOUSE M1l Si M2 S2 M3 S3 M4 S4 If Ml is to be deleted from the database, then there is no use in keeping Si around in the database. If Sl is deleted then some other information related to Sl through other FDs may have to be deleted too. (b) Consider f2 which stands for MEMBER_ MICI(T) -+ US_CENTER(P) which relates the members belonging to the organization in the state of Michigan, and all the centers that the organization possesses all over the US. Every members in Michigan relate to some CENTER, thus the label "T" assigned to MEMBERMICH, but some other centers in the US are obviously not present in f2, thus the label "P" assigned to US_CENTER in f2- f2 is represented by the following table.

97 f2 MEMBER_MICH US CENTER M1 Cl M2 C1 M3 C2 M4 C2 If center C1 is closed, then all the members assigned to that center are either re-assigned, in which case the deletion is immediately followed by an update to write the new CENTER information, or they are released, in which case Ml and M2, are automatically deleted from the database. Some simple side-effects of deletions have thus be illustrated. ~ In the next section another source of update cascades will be mentioned. It will then be shown how to handle typical updates and how to identify the FDs affected by a cascade of updates triggered by a single update on attribute values or a tuple of an FD*. 4.5.3. Update cascades Typical update side-effects are now indicated. Let H be the union of all the query envelopes. If it happens that H+ = F+, where F is the user-provided set of FDs, then we are done. Else, if H+ C F+, then (H U I)+ = F+, where I = F+ - H+. H+ and F+ can easily be derived by applying repeatedly the FD rules of Chapter 3 on H and F respectively. An LR-minimum cover of I, T can be derived using [MALE 80j's algorithm or Algorithm Reduce (Section 4.2.1). Let G be the final F-graph which depicts H U I. From now-on, it will be assumed that any update is specified with respect to G, and that all "P" and "T" labels are specified with respect with the explicit FDs of G. *It is beyond the scope of this thesis, however, to address all the intricacies associated with updating. This section only considers the most simple cases of updates. Obviously, the more complex the modeling of user via.r is, the more delicate the update problems become.

98 The typical updates that are considered are: (a) Add a tuple in the instance of FD jAw,v) for some arc (w,v) in G. If U {Att(v, )} v, E r(w) = (A, *,A, }, and Att(v)-= B, then the new inserted tuple is taken from DOM(A 1)X ~.. XDOM(An )XDOM(B). This type of update may inherently imply the addition of a value in the antecedent of f, or in its image, or both. (b) Remove a tuple in the instance of FD A(w,v) for some arc (w,v) in G. (c) Remove a value under Att(v) for some v, or under U {Att(v, )} for some w in G. vs E r(w) It was shown earlier that the semantic constraints represented by the "P" and "T" labels could force a single deletion update of type (b) or (c) to trigger off some further updates in cascade. Similarly, the first universal relation assumption (Assumption UR-1 of Chapter 3), must always be enforced whenever an update of type (a) is made. This is a consideration in the design stage, as well as in the operating stage. In the design stage, enforcing Assumption UR-1 serves to identify the set of FDs affected by a cascade. In the operating stage, it serves to define some actual data processing. Typically, the updates occurring in the operating stage are predictable, and therefore, the designer is able to determine in advance what the specific side-effects are. For non-scheduled or ad-hoc updating transactions, the tedious task of enforcing the semantic constraints* must take place under the careful supervision of the database administrator. Obviously, if an updating transaction consists in a conditional update, i.e. a potential update preceded by a question, then the "question" part of the transaction may be analyzed individually with other regular queries. Each type of update is considered below. ~In particular the identification of all the specific updating operations implied by one update of type (a).

99 4.5.3.1. Adding a tuple in the Instance of an FD Let Aw,,v,) stand for { U {Att(vk)} -Att(v,)} A-lA,1A,2 A, - A,.A new vk E J(w,) tuple of Aw,.,v,) is added. It is of the form d, = (dil,,d,,dj,), where dik E DOMA ). This type of update is the most laborious because of the potential disuption of the "uniqueness" assumption (Assumption UR-1, Chapter 3). More specifically, as a new tuple of AJw,,v, ) is added, the uniqueness assumption may be endangered whenever jtAw,vi ) belongs at least to one doublet. The principle behind the correction of any trouble-shooting is rather simple; it may be stated as follows: Let (W,,v, ) be the arc that corresponds to FD Aw,,v, ) for which a new tuple is to be added. For each doublet (LJ+,L-) each ending at node Vt, to which the arc (w,,v,) belongs, check if for the tuple d, which is added in L,+, the image dt in DOM(Att(vt)) is the same along L + and Lo-, and the inverse image relation, say r8, is the same along L+ and L-. I not, the update is not correct. If d, has an image dt and an inverse image rs, augment LJ- such that the image of rs in DOM(At(vt )) along L,- is dt too. Using the notation of Chapter 3, the translation of this action becomes quite laborious. It appears below after some notational definitions. A _ Let r = (C,C) denote a cutset of proper link L,+(w,vt ) which itself belongs to a doublet (L, +,L,). We write: A Xc U {Att(v)} v e 1(w) w EC X -= U {Att(v)) vEC Xr = Xc U Xc X=- U {Att(vs )} v, E (w, )

100 X(L,+)= U {Att(v)} VEV( Le +) U al, h). (,'V) E AWIn the following algorithm, the notation I<x,> denotes the largest relation in I<x>. ALGORITHM AddTuple: /* Add a tuple of Aw,,v,) */ ALTER ({r, Current:= ALTER /* Cutsets still to be visited */ /* rL (w,,v, ) if (w,,v, ) is a cutset of L,+, otherwise it is any cutset of L,+ which contains (w,,v, ). */ for all r, E Current while Current 7a 0 do begin Current:= Current - (r, } for all (L,+,L-): L + contains r, do begin /* Trace forward in L+ */,:={ I<~,j+Et(>[ Xr, U {Attv))} (X,) = d ] }IAttvt)I If dt = 0 then done else begin /* Trace backwards in L+ */ rs, { <Lj, E E, > X's U XCR IXc;) = d (Xc )] [IXs If rs -= then done /* Now trace back from vt to ws on Lj- *L k:= O /* An index to identify cutsets on L- */ dk:d drkl:={ d~ },XY:- Att(vt) Xk:= {Att(v): v E r(w) e w E r(t )} n L, while Xk 7 Xs, rtkl rk 0 do begin k:= k + 1 /* Next cutset to visit */ r1 1= ( X I <x(L, E(L-)>[Xk U XT, XT = d4 8 d4 E I ] )XkI Xg-:= Xt for all vi: Att(v, )E Xjt

101 do VW':= V+1 U ({, } x+:= ({Att(v): v E r(w) &w E r(V+-)}n Lj end If rs, E r[kj then done else begIn /* No antecedent for some cutset */ /* k = Length(L,-) */ ki:=,, X: {Att(v): v E a(w5 )nLjL, while Xk 34 Att(vt) do begin k:-k - 1 s[kI:= { I<Xt, t4, Xk U XT, Xk =4 dk d Esi] }[XTI for all Ak E XI do begin If (s[kI)[Ak -= 0 then begin q kj:=,-j- s[kj modify (w,v): Attv) = Ak such that each tuple in 84k which features a 0 value is mapped to a tuple in qjkj. /*... User controlled */ ALTER:= ALTER U {r (w,v) E r} Current:= Current U {r (W,v) E r} end end end end end end end end AddTuple. Note that all the proper links may be determined with Algorithm Distinguish (section 4.4.3.1), as soon as the modified F-graph depicting H is completed. The side-effects due to a deletion have been mentioned in [HOUS 79j. Typically, they take place whenever an affected FD does not feature a "P" label for both its antecedent and its image. Again, the translation of simple deletion cascades into pseudocode is quite lengthy. The next two sections deal with deletion.

102 4.5.3.2. Removing a tuple In the Instance of an FD A tuple d, = (di,di,,,d,X,dim) is to be deleted from the database. The following steps are taken: ALGORITHM DeleteTuple: /* Remove a tuple of j(w,,v, ) */ ALTER:= {(w,v, )} If LABEL[(w,, v,)] = (T, T) then begin Delete Value(< d, ',d, >d ) If V t EAw,,v, ) t(Attv, )) 7 dim then Delete Value(d, ) end else If LABEL[(w,,v, ) = (T,P) then DeleteValue( d,1, ~ ~ ~,d,, >) else If LABELI(w,,v, )I = (P, T) then begin If V t E Aw,,v, ), t(Att(v, )) 4 d,im then Delete Value(d, ) end end DeleteTuple. 4.5.3.3. Removing a value In the antecedent or range of an FD A ALGORITHM DeleteValue /* Given d = <d,,,dll > E DOM(A 1)X... XDOA(A, ) and ALTER possibly empty */ Possible:= 0 /* Possible set of FDs in a cascade */ for allAw,o, ): ({A, ' - ',A,, } = U {Att(v)} do begin /* Delete an image value */ Possible:= Possible U {(w,vj )) If (LABEL[(w,,v, ) = (T,T) or LABEL[(w,,v )] - (P, T)) then begin If v t E wk,v, ) for some k: t(Att(v, )) 7 di, where d,, is the image of d in.wJ,v, ) then do Delete Value(dm ) ALTER:- ALTER U {( w,, v, )}

103 end end end for all Jw, v, ) E Possible do begin /* Delete an antecedent value */ f (LABEL[(w,,v, )J = ( T, 7) or LABEL[(w., v, )I = (T,P)) then begin for all (A(w,,vt) E Possible: tlA I,... A, =]- d) do begin ALTER:= ALTER U {(wj,v, )) Delete Value(d) end end end end DeleteValue. For any update, all the FDs in H which are affected by a cascade are the FDs represented by the elements of set ALTER. In the next section the problem addressed in this chapter is summarized together with the proposed solution methodology.

104 A/ X / iAtl d"i T P Figure 4.7. An F-gZraph with "partial" and "total" labels

105 4.8. Summary of PART 1: From users' FDs to envelopes The design methodology of PART 1 for the distributed database design problem (as defined in Section 3.3) is now summarized in the form of a design algorithm named Selection. Each instruction in the forthcoming Algorithm Selection is numbered for easy referencing, and is of the form: (i) Output list -- Algorithm[+Algorithm ' ' ][ Input list j, where "(i)" is the reference index. The items in an input or output list are separated by commas. Symbols preceded by a double dagger sign "'" are understood to be user-provided inputs. If several algorithms are used, they are separated by plus signs "+". Most of the algorithms have been defined in pseudocode in the course of Chapter 4. Some other algorithms will simply be defined in words because they involve actions which can be trivially realized. Symbols which have not yet been defined will be defined following the presentation of Selection or will be defined in Chapter 5. Algorithm Selection is shown below and is followed by specific remarks for each instruction. DESIGN-Selection: /* PART 1 of the distributed database design */ begin /* Step-l: Initial processing of the user F-graph */ (0) Given t F and t U, construct GF (1) GF 4- Reduce+ReachAl4GFJ (2) GF * - NoLoss[8GFI end for all Qtk do begin /* Step-2: Extract the minimum information */ (3) Source +- GetSources[GF *, t Tt (4) NearestSource 4- GetNearestSource[GF *, Ttk, Source] for all w, E NearestSource do (5) L(w,, T) -- GetLin4 GF *, Ttk, NearestSourcej /* Step-3: Find envelope */ (6) ntk ' LocalProc[ t n, ~ c,, l for all twS E NearestSource do (7) L*(w,, -T)- MinPropLin/tL(w,, )

106 wS: TotalLength(L*(wS, T)) = min { TotalLength(L*(w,, T)) V w, E NearestSource} (8) (Y,L) - MetagrapIh[L(w8,T) /*... Note that the original link is used, L rather than L* */ for al L = [(.,w);(V,.)] E L do (9) Size(w, 1V) -- SizeE8timatel(Y,L), fltk, f1 (10) ACCtk OutTreel(Y,L), V L E L Size(L)I (11) PARtk 4- AllPartitionsOAACCtk I (12) CLUk -- AUlClustersOAp E PARt I end (13) H - U [ACCtk, Qt Construct GH,the F-graph of H for all Qtk do begin /* Step-4: Find distributed query processing data-volumes */ for all p E PARtk do begin (14) query /k -- QuerySpec[n, V L E L, Size(L), Ctk, t vtk, ACCtk, p E PARtk, CLUUft] V q,r E CLUk (15) AtP, At', A4P7 4- Algorithm H [CHUN83l[query tk end end for all Usk do begin /* Step-3 Find cascade */ (16) DELsk +- UpdateSpec[GH, Tsk] (17) WRTs~, V h E WRTSk Bsk -- AddTuple +DeleteTuple +DeleteValue +Distinguish [GH, $, vsk, t TYPE, DELsk I end end Selection. The following comments apply respectively to every instructions in Selection: (0) F and U are respectively the user-provided set of FDs and universe of attributes. GF is the F-graph which depicts the FDs of F.

107 (1) Algorithm Reduce can be found in Section 4.2.1. Algorithm ReachAll has been seen in Section 4.3. Note that ReachAll is used to determine the set Generate of reachable vnodes, referred to as V, in CONDITION a-4 in Reduce. (2) Algorithm NoLoss can be found in Section 4.2.3. It includes a covering algorithm such as the one reported in [ETCH 77]. GF $* is a lossless fully reduced F-graph. (3) Ttk is defined in Section 3.2. Algorithm GetSources is shown in Section 4.3. It calls Algorithm ReachAll. (4) See Section 4.3. (5) T stands for the set of target nodes which have an index equal to one in Ttk. Algorithm GetLink can be found in Section 4.3. (6) fl is a set of user-provided statistical informations and is defined as: A Q - { IfI, f, fIAll, A, IDOM(A)I, for all f E F & A E U }. All the notations have been defined in Section 4.4.2. Any query Qtk is expressed in a query language with respect to the universal relation scheme < U, >. A query specification is of the form: Qtk: FIND (A, ',Am ) [WHERE (A, = ci,)j [AND (Ai, = Ci 2) Cta [AND (A, = c,,)] where A, E U and c, E DOM(A, ). The square brackets enclose an optional qualification clause. All the attributes mentioned have an index equal to one in Ttk. Ctk is the set of qualification clauses for query Qtk. ftk is the statistical information for the universal relation obtained from fl after initial processing of the FDs in F in order to satisfy the qualification clauses of Ctr. Thus, LocalProc denotes the action of processing

108 the initial FDs of F by eliminating unwanted tuples. A tk {Iftk l, tkIft, f[All, X, for a f E F A AEU ) where ftk is f processed under Ctt using the selection relational operation if it is possible to do so. (7) Algorithm MinPropLink can be found in Section 4.4.4.1. In this algorithm TotalLength denotes the length of L*(ws,T). (8) Algorithm Metagraph can be found in Section 4.4.4. (9) Size(w, V) is the same as Size(w, T) in Algorithm SizeEstimate in Section 4.4.4.1, where ws would be equal to w and T would be equal to V. (10) ACCtk (for "access") stands for the set of FDs in the envelope assigned to query Qtk, where the notation will be carried on in Chapters 5 and 6. (11) This step and the next one will become clear in Chapter 5, where the sets PARtk and CL Ufk are defined. In short, PARtk is a set of indexes where each index corresponds to a different partition of ACCtk. (12) See Chapter 5. CLUP is the partition of ACCtk which corresponds to index p in PAR t. Each element of CLU/k is called a cluster and may be visualized as one single relation. (13) Self explanatory. (14) Given a partition CLU/k of ACCtk, let: CL U? = {ql, '. I,qm }- {RI,,R, }, where each cluster q, is mapped to one relation R,. Each relation R, denotes a join of FDs in q, C ACC~R, i.e.: R, = 1 f, for some cluster q, =,,,f } C ACCtk, U q, = ACCtk and q, n qj =o for i 7 j E {1, -,m}. The input list includes Size(L) for any L E L. Therefore the size of each FD in ACCtk is known. Hence, the

109 size of each relation R, may be computed using the technique of Section 4.4.1. Assuming that relations R1, *,Rm are located at different sites of a computer network, query Qtk may be processed as a distributed query using [CHUN 831's Algorithm H. query?t denotes the input file of Algorithm H [CHUN 83] which represents query Qtk to be processed as a distributed query on distinct relations R 1,,Rm. (15) The symbols AfP, A/P' and AP-k' are data-volume rates and are defined in Chapter 5. All data-volumes take into account the frequency vtk of the query. (16) For an update Us,, UpdateSpec refers to the identification of the FDs in H which possess one or more attributes with an index of one in Tsk. (17) The TYPE is either "ADD" or "DELETE". The volume of one update message is assumed to be # and the volume of one piece of data needed for the update of one FD h is indicated in algorithms AddTuple, DeleteTuple, and DeleteValue in Section 4.5. Algorithm Distinguish is shown in Section 4.4.3.1. WRTsk denotes all the FDs updated in cascade by Usk as given by set ALTER in AddTuple, DeleteTuple and DeleteValue. Bhk denotes the data and message volume required to update h E WRTSk. Each FD in H can be mapped to a CRDU. The next chapter will address PART 2 of the distributed database design problem; the distribution of CRDUs. This chapter is closed with an example. 4.7. Example Consider the MRDB (medical record database) of Appendix C. The set of user-provided FDs is shown in Table C.1. It is shown below how to apply the design procedure Select for MRDB. Only queries Qtua and Qtj7 are considered. Design procedure Select is stepped through instruction by instruction: (0) F is the user-provided set of FDs shown in Table C.1. Gr is the F-graph which depicts F (Figure 4.8). The set of attributes U is represented by the v-nodes in GF.

110 (1) Applying Reduce on GF: * By RULE 1 w-nodes wl, w2, w3, w4, w6, and w8 are combined into one w-node w1. Similarly, w-nodes w 12 and w 14 are combined into w 12. * RULE 2 needs not be applied. * By RULE 3, w1Is, (Vd,W 13), (v,,w13) and (w13,vj) are deleted. 9 By RULE 4 (see Figure 4.1-ii) all transitive redundancies are eliminated. Therefore, (wl,vg ), (Vb,w 9), w, (wg,v ) and (w,v,) are deleted. Also by RULE 4, arc (v,,w,) is deleted (Figure 4.2). Finally, and using the notation employed in RULE 4: Let VT = Vg, WI = W7, W, = w11. Also, VI = {va t) VY vc V d, Ve }, 7r(w, )= {v,, us ) and 7r(wj )={vc r, d, v, )}. Hence, Vi -{ (v} and V = 0. Therefore, are deleted: (w,,VT)= (W7t,Vg), W7, (vi,w7) and (va,w7). See Figure 4.3. At this step the fully reduced F-graph is GF displayed in Figure 4.9. (2) Since v-node va (labelled with attribute ORDER) is such that: For any v E V, va > V, therefore GF * = GF, i.e. the graph is lossless and there is no need to apply NoLoss. (3) Let query Qt,, be expressed as: Qt,s: FIND ( ORDER, PATIENT_NO, NAME ) Ct,6: WHERE WARD -= Outpatiene. The target list of query Qt, is ({ORDER, PATIENT_ NO, NAME, WARD}. The corresponding set of target nodes is T {(,a, Vd, vi, Vk }. By applying Algorithm GetSource, it is found that Source = {wl}. (4) By applying Algorithm GetNearestSource, it is found that Neare8tSource = {wl}, i.e. Ws = W1. (5) The link found by GetLink for Qt, is: L(ws,T) =- - L(w o, {v, Vd, vi, 1 })). Link L(w,T) is depicted in Figure 4.10.

111 (6) The link found in (5) for query Qt,s depicts the following FDs: f,b: ORDER - SERVICE f l,d' ORDER - PATIENT_NO f J: SERVICE -+ PHYSICIAN f 12,,: PATIENTNO -- NAME flo,k: PHYSICIAN -- WARD fl,k: NAME -+ WARD. The statistical informations I are contained in Tables C.4, C.5, C.6 and C.7. When the base FDs of GF, and in particular the FDs depicted in L(w, T) are restricted by qualification clause Ct~, (WARD =' Outpatient' ) the statistics are modiFD Il f Tuples Byte8 f l,b 5075 9 fl,d 5075 8 fSJ 49 20 f1,i 84 19 f io,t 23 20 flsI 84 20 Table 4.1. Cardinalities and widths of the FDs in the envelope of query Qs."

112 fied and become those of Tables 4.1, and 4.2 (where only the base-FDs of L(w8,T) are kept). The process of calculating or estimating this new set of statistical informations Qf,6 has been referred to as LocalProc in Select. Typically, LocalProc can consist in actually performing the relational algebra operation "Selection" implied by Ctz directly on a centralized version of the MRDB database, or it can consist in multiplying the original statistics by an estimated user-provided ratio associated with Ca. Appart from IDOM( WARD)I = 1, the remaining statistics on widths of attributes and cardinalities of domains are not affected by LocalProc. (7) There is only one possible source node wS = wl, therefore this step leads to the proper link L*(ws, T) of Figure 4.11. FD ORDER SERVICE PATIENTNO PHYSICIAN NAME WARD flO,k 3 1 f 16, 12 1 Table 4.2. Cardinalities of FDs, in the envelope of Qt,s, projected on each referenced attribute.

113 (8) Applying Algorithm Metagraph and MinPropLink on L(w, T), the metagraph of Figure 4.12 is obtained. For this metagraph, the arcs are defined as follows: A LS1 = [<va,wl>; <Vdw12>] --., (ay W1), W It (W1 Vd)P Vt} A L12 = [<Vd,W12>; <Vi,W; 6>] {Vd, (Vd, 12), W 12, (w 12, V), Vi } L2s = [<v,,wl6>; < vk,>1 V, s, (V,w16), W 16, (W lVk ), Vt } LS2 [< v,,UW>; <viw, 6>j - va (Va,W, ) WI, (W)d ) t (d W12) W 12 (W1<Vi ), VI L, A I< v,,wl>; < vk P0> L *(w,,T) A L s= [<v d,W12>; <t,O>] - { (Vd, W,w12), W12, (W12 VI ), VI,, (Vj,W,6), WIS, (W,V ), ), V }. Further, the arcs in L are associated with implicit or explicit FDs as follows: Ls ~ f 1,d: ORDER -+ PA TIENT_ NO L 12- f 12,: PATIENTNO -+ NAME L &2 f 1s,k: NAME - WARD L2 ~fl,i: ORDER -+ NAME La f l,k: ORDER _- WARD

114 L 13 f 12,,: PATIENT_NO -* WARD. (9) It is understood that all FDs have been restricted by LocalProc by virtue of qualification clause Ct,6. The sizes of the FDs in (8) are calculated by SizeEstimate as follows: Size(wl,vd) = 5075 X 8 = 40,600 bytes. Size(wl2,vI) = 84 X 19 = 1,596 bytes. Size(wl6,vk) = 84 X 20 - 1,680 bytes. Size(w l, v, ): If 1,d [PA TIENTNOIfai, I 126 If 1,b [ORDER]I X If 1l, [INAMEI = 5075 X 84 > 3,383, therefore (,d [PA TIENT_NOf 12, )[ ORDER, NAMEfl - 3,383 X (4 + 15) = 64,277 bytes. Size(w l, v ) ((f 1,d [PA TIENT_NOIf 1, )[NAMEf 15,, 5075 X 84 X 12 126 X12.126 = 322 tuples. If i,b [ORDERII X If 6, I WARDII = 5,075 X 1 -= 5075 > 322, therefore I((fl, d IPA TIENTNO]fl2,, )[ NAMElf 1s,, )[ORDER, WARD]I - 322 X (4 + 5) = 2,898 bytes. Size( w 12, v ): x 12 1 If, [ NA MEil 6,1 = [84126 j = 8 tuples.

115 Vfl2, [NAMEl X lfls,k I WARDII = 84 X 1 = 84 > 8, therefore Size(wl 2,vk )- 8 X (4 + 5) = 72 bytee. (10) Metagraph (Y,L) is shown in Figure 4.13 with the weights of all the arcs. (Note that the weight of an arc in L is kept to be the size of the FD that L stands for, rather than a large number minus that size). The application of Algorithm OutTree on (Y,L) is straightforward, and the result is shown in thick lines in Figure 4.13. Note that arc L,1 representing FD ORDER -. PATIENT_NO must be included in any feasible solution because of the lossless join requirement. The envelope for Qt~, is: ACC,6 = {fl,,d f12,, f12, }. (11-12) All the feasible partitions of ACCts are shown in Table 4.3. PARt,6 = {1, 2, 3, 4, 5) and, for example, q — 2 E CLEU~ corresponds to relation scheme < {PA TIENT_NO, NAME, WARD }, {PA TIENT_NO - NAME, NAME - WARD }>. (14) For each partition of ACCt,5 adistributed query processing algorithm such as [CHUN 83J's is run. In Appendix F, input file query,25, the input file for "Algorithm H" [CHUN 831 corresponding to partition p = 2 of ACCt,6 is shown. (15) "Algorithm H" [CHUN 831 is then run with respect to relation schemes <ORDER,PATIENT_NO,NAME, > and <PA TIENT_NO, WARD, > since Table 4.3 indicates that these are the clusters to consider for partition p = 2. For this example, the output is: At,2' = 88.936 kilobits X vti,. At,22 - 0.864 kilobits X 1t'.

116 p q E CL Us PAR,6 1 1 <ORDER,PATIENT_NO,NAME,WARD > 2 1 < ORDER,PATIENT_NO,NAME> 2 < PATIENT_NO,WARD > 3 1 < ORDER,PATIENT_NO > 2 < PATIENT_NO,NAME,WARD> 4 1 < ORDER,PATIENT_NO, WARD > 2 <PATIENTNO,NAME> 5 1 < ORDER,PATIENT_NO > 2 < ORDER,NAME> 3 <PATIENTNO,WARD > Table 4.3. Possible partitions of the referenced attributes for query Qt,. At2, 1,2 = 0.464 kilobits X vl4. t2, 1,2 A 2, 1,2 = A 2,1 = 0. 6, ts6 t,6 For example, if site 3 is the origine of the query, then t = 3, and from Table C.3 V~,s = 3 (l/minutes). Therefore: A 362, = 266.808 kilobits/minutes. A 2 2 = 2.592 kilobits/minutea. As2, 2 = 1.392 kilobits/minutes.

117 Let steps (3) through (12) be repeated for query Q47. (3) Query Qt,7 is expressed as: Qt,7: FIND (RESULT, ORDER, TEST) Ct,7: WHERE NAME -= Rowland8 AND DATE = ' 06-20-83'. The target list of Qt7 is (RESULT, TEST, ORDER, NAME, DATE). The set of target nodes is T {va, v9, vc, tv, vc }. By applying Algorithm GetSource, Source is found to be {w1}. (4) ws = w 1 since this is the only source node that leads to all the nodes in T. (5) GetLink produces L(ws,T) = L(w, {v,, v, v,, vg, vi }). L(w, T) for query Qt,7 is shown in Figure 4.14. (6) The link for Q47 depicts the following FDs: f,,: ORDER -- TEST f l,d: ORDER -* PATIENT_NO f l,' ORDER - DATE I,g TEST,PA TIENT_NO,DA TE -_ RESULT f12,,: PATIENT_NO -_ NAME. When those FDs are restricted by the qualification clause Ct,7 (NAME = ' Rowland8 AND DATE =' 06-20-83' ) the statistics of fl are modified by LocalProc as shown in Tables 4.4 and 4.5. Also IDOM(DATE)I = 1, IDOM(NAME) = 1, and IDOM( TEST,PA TNO,DA TE)I = 150. (7) This step is not relevant.

118 FD Ifl f tuples bytee f,'C 5075 14 fl,d 5075 8 fl,6 318 12 fll 324 44 f12,, 1 19 Table4.4. Cardinalities and widths of the FDs belonging to the envelope of query Qt,7. ItIAIl tuplee where A is... FD ORDER TEST PAT_NO DATE RESULT NAME fl,c 5075 76 fl,d 5075 84 fl, e 5075 632 fll,g 70 15 1 5432 f12,, 1 Table 4.5 Cardinalities of the FDs, in the envelope of query Qt7, projected on each referenced attribute. (8) The metagraph of Figure 4.15 is obtained.

119 The arcs are defined as followed: A Ls,= [<va,W1>; <{Vc,Vd,Vc ),Wll>1 {V, (va,W1), W1, (W1,VC ), (WI,Vd ), (WlVi ), VC, Vd, VC ) A - {Vd} {VC Vd Y V9 Y (Vc, WI), (Vd,W11), (v,,w1I), w (W v3 } L24=[< d,w12>; <V,O>] {Vd, (Vd,w12), (W12,Vi ), vi } L~ [ v< v.,wx1>; <v:,0>] --, (va,W1), W1, (W1,VC ), (Wl,Vd ), (W1,Ve ), VC, t, V9, (VC *W 1), (Vd U 1)P (VC Wll) W11, (W',VgI)t Vg} Ls4 = [< v,Wl>; < vi,0>] - aI (tva,W1), W 1, (W lIt d ), Vd 9 (Vd, W12), W 129 (W12JVi )z VI } Further, the arcs in L are associated with implicit or explicit FDs as follows: Ls-" f,c: ORDER -, TEST fl,d: ORDER -* PATIENT_NO f l,: ORDER -- DATE Ls - fl,d: ORDER -. PA TIENT_NO

120 L12 ~ f11,d: TEST,PATIENTNO,DATE - PATIENT_NO L1s - fll,g: TEST,PATIENT_NO,DATE -* RESULT L 24 f12,5: PA TIENTNO -* NAME La ~ f l,,g ORDER -+ RESULT L,4 fl,i: ORDER -- NAME. (9) The sizes of the FDs in (8) are calculated by SizeEstimate as follows (the scripts (t,k) = (t,7) are omitted in the notations): Size(w 1, v,,Vd,V, }): 5075 X 5075 If s, [ ORDER]f, 75 d 5075 = 5075 tuples. 5075 x 318 I(f l. [ ORDER]f l,d )[ORDERJf,e I 507' 5 = 318 tuples. I((fl,c I ORDER]fl,d )[ ORDERfl, )[ ORDER, TEST,PA T_NO,DA TEq -318 X (4 + 10 + 4 + 8) - 8,268 bytes. Size(w l,vd) 5075 X 8 = 40,600 bytes. Size(wll,v)d )-0 bytes. Size(w l, vg) -324 X 44- 14,256 bytes. Size(wub12,v,) = 1 X 19 = 19 bytes. Size(w 1, vg ): I((f1,c [ ORDER]f l,d )[ ORDER]f l,c )[ TEST,PA TNO,DA TEf 11,9 I 318 X 324 318 X 324 = 687 tuples. 150 If l,c TESI 'x IX,d lPA T_NOIl X iI,c [DA TEl = 76 X 84 X 632 > 318, and 318 X 5432 > 687 therefore:

121 Size(w,v, ) = 687 X (10 + 4 + 8 + 22)- 30,228 bytes. Size(w, v, ): If,d iPATIENT NOLf 12, I = _= 40 tuples. 126 Ifl,d [ORDERlI X If12,, NAMEII = 5075 X 1 > 40 therefore Size(wl,v, )= 40 X (4 + 15) = 760 bytes. (10) Metagraph (Y,L) is shown in Figure 4.16 with the weights (sizes) of all the arcs. The envelope for Qt,7 is: ACC,7 = (fl,c, fl,d, fle f ll,g fll,d f12,i } Strictly speaking, fl,c, l,d and fl,, are three distinct FDS, however, since they have the same key they may be combined into their natural join over attribute ORDER. fl,c,d,, denotes the join of fl,c, fl,d and fl,e over ORDER. Note also that f11,d is a "dummy" FD, and therefore the envelope of Qt,7 really is: ACC47 = {fl,c,e, l, fl1,g' f12,i } (11-12) All the feasible partitions of ACCt,7 are shown in Table 4.6. For example, q 3 E CLUt,67 corresponds to relation scheme < {PA TNO,NAME},{PA TNO - NAME)>. An update example is now worked out. Let the update order be: Us,: ADD <PATIENT_ NO,NAME> = <7785,Smith,45>. This is identical to update 2 in Appendix C without reference to attribute AGE. Suppose the following information has previously been added: < ORDER, TEST,RESULT,DATE> = <18, Glucose,70,06-21-83>, and < NAME, WARD> <Smith, Outpatient>.

122 q E CL U, 7 PARt,7 1 1 <ORDER,TEST,PAT_NO,DATE,RESULT,NAME > 2 1 <ORDER,TEST,PATNO,DATE,RESULT> 2 <PAT_NO,NAME> 3 1 < ORDER,TEST,PAT_NO,DATE> 2 < TEST,PAT_NO,DATE,RESULT,NAME > 4 1 <ORDER,TEST,PAT_NO,DATE,NAME> 2 <TEST,PAT_NO,DATE,RESULT> 3 <PATNO,NAME> Table 4.6 Possible partitions of the attributes referenced by query Qt,7 -Assumne that the F-graph of Figure 4.9 depicts the final FDs to be distributed, i.e. the FDs of H the set of all the query envelopes. (16) DEL,,2 = ((W12, Vi )), since (w12,v,) represents FD PATIENTNO - NAME which itself features the attributes referenced by U,2. (17) Initially, ALTER = {(wj2v, )}. L = += Va, (V, w 1), w1, (W 1,Vd ), Vd, (Vd, W 12), W 12, (W12, i ), V, (Vi,w1 ), w 1s, (W1,Vk ), V} L.-= {(v, (v,w1), W1, (W1,Vb ) Vb, (vb,W8), W8, (W8,.V), V1, (vf,Wlo), Wo0, (W1O,Vk), Vk }. Note that both L, + and L,- are proper links from wl to vk.

123 {(W12,v, )} C L,+. d ( I <ORDER,PA TNO,NAME, WARD, > [PAT_ NO,NAME,WARD, t[PATNO,NAME = dt = <7785,Smith> ] } [WARD] = Outpatient. rs, ( I<ORDER,PATNO,NAME WARD, > [ORDER,PAT_NO, tIPAT_NO] = 7785] } [ORDER] = {18}. k = 0, do = Outpatient, r01 = (Outpatient). X= = WARD, Xo - PHYSICIAN. 4 1= { I <ORDER,SER ICE&PHYSICIAN, WARD> [PHYSICIAN, WARD, WARD = Outpatient] } [PHYSICIANA = (Dr. Akers}. r[2] { I <ORDER,SER VICPHYSTCAN, WARD, > [ SERVICE,PHYSICIAN, PHYSICIAN = Dr. Akers ] } SERVICEI = (S-1, -3). 431 { I <ORDER, SER VCAPHYSMCAN, WARD> [ ORDER,SERVICE, SERVICE E (-1, S}3)] [ORDER) = (05, 07, 10, 15).

124 r, = 18 f {05, 07, 10, 15)}, therefore, there is no antecedent for some cutset in L*-. 8131 = (18), X3 = ORDER, Xi = SERVICE. (2] - ICO<RDERSERVICEPHYSIaANWARD, > [ORDER,SERVICE, ORDER =18] }[SERVICEl -0. qf21 = (S-l, S-3}, therefore ORDER = 18 should be assigned SERVICE = S-1 or SERVICE = S-3. Suppose Mr Smith is in service S-I, then the tuple <ORDER,SERVICE> = < 18,S-1> must be added to the database. Thus ALTER:- ALTER U {(wl,vb )). Hence, for this particular update, the cascade is indicated by: ALTER = {(W12, v), (WI,Vb ))} Any update of PATIENT_NO -+ NAME is to be followed by an update of ORDER -_ SERVICE. WR T,2 = f 12,: PATIENT_NO - NAME, f 1,: ORDER -_ SERVICE). If the user site is site 8 = 4. From Table C.3 v4,2 = 2 (1/minutes). Therefore: B 12 = (4 + 15 bytes + 2 bytes) X 2 tes = 42 bytes minutes Similarly, BZ'1 = (4 + 5 + 2) X 2 = 22 bytes minutes (It was assumed that a = 2 bytes. The widths of the attributes are provided by Table C.6).

125 SERVICE 8 PHYSICIAN > O 1 2( O R D En -9 10 Figue 4.. AnF-grph dpicton f th Medcal ecor Datbas TEST ORDER? 4 ---~' RE SULT 5 WlARD O~.DATE e ', ( ' 6 ia (1A 7 NAME AGE Figure 4.8. An F-graph depiction of the Medical Record Database

126 SERVICE 8 PHYSICIAN TEST 10 ORDER I ATIENT.. II RESULT WARD FDATE IR AGE O Figure 4.9. A minimum F-graph for the MRDB

127 SERVICE 8 PHYSICIAN 1 0 ORDER I WARD 15 PATIENT-JO NAME Figure 4.10. The link of query Qta

128 ORDER 1 PATIENTNO 12 NAME 15 WARD Figure 4.11. A proper link for query Qt,6

129 Ls3 F.igr Ls4. T tr f 23 Figure 4.12. The metagraph for query Qts

130 2,898 64,277 40,600 1,596 1,680 72 Figure 4.13. The weighted metagraph for query Qt,6

131 TEST ORDER / PATIENTNO II RESULT D,ATE. 12 NAME DATE Figure 4.14. The link for query Q47

132 Ls3 Figure 4.15. The metagraph Lsfor query Q, 9~6~W~ 3 Ls2 L12 2"~ L24A Figure 4.15. The metagraph for query Qt,?

133 13,992,, 8,268 14,256 es, 81 e3 40,600 19 76 0 Figure 4.16. The weigthed metagraph for query Q47

CHAPTER 5 A MODEL FOR DATABASE DISTRIBUTION The cross-referencing data units (CRDUs) contained in all query envelopes are to be distributed (with possible duplication) over the nodes of a computer network. In this chapter, a CRDU-distribution model is proposed. The objective is to minimize the communication cost. It is assumed that all the transactions will be processed according to distributed query pro. cessing strategies that feature the cross-referencing of relations as explained in Chapter 2. For a single distributed query processing session, some flow is produced by file-to-file communications, and is associated with the reducing operations. Some flow is produced by file-to-user communications, and is associated with the assembly operations. It is assumed that the assembly node, the node where the answer of a query is assembled, is the user node itself. The sizes of the initial CRDUs are estimated as demonstrated in Chapter 4. For each query Qtki an envelope can be determined as shown in Chapter 4, and the set of CRDUs mapped to the FDs of the query envelope is denoted ACCtk. The sizes of the CRDUs contained in ACCtk, after local processing, may be obtained by applying the "selection" relational operator implied by the user-provided qualification clauses contained in Ctk. Furthermore, it is assumed that an appropriate distributed processing algorithm can be applied on each feasible partition of the set of CRDUs ACCtk to determine the data-volume flow rates which will serve as inputs for this part of the distributed database design. Each feasible partition p of a set of CRDUs ACCtk consists in a set CLU/k of clusters. Each cluster can be visualized as one individual relation. Therefore, a query Qtk can be solved as a distri134

135 buted query over the distinct relations of CLUkP. After distribution, a synthesis algorithm such as [KAMB 781 can be used to combine FDs in third normal form relational tables. Assumptions regarding the model are summarized in Appendix C. The next section contains a notation list with concise definitions. In Section 5.2 a socalled "naive" distribution model is developped. This naive model is to enlighten the reader as to the difficulties of distributing CRDUs. Typically, the clustering-effect discussed in Chapter 2 is at the source of the problems that arise when the modeling of data-volume flow rates is attempted. 5.1. Notations U Universe of attributes. H Set of CRDUs to be distributed. N Set of nodes, or sites, in a computer network. A Set of arcs, or network links, in a computer network. c,, Communication cost for arc (idj) E A. cFi Communication cost corresponding to the shortest path from i to j in network (N,A). K,1 Flow capacity of link (i,j) in network (N,A). i Usually stands for a single user-node, t E N. At the same time t denotes a sink node for some flow. Q: Set of queries originated at user-node t. {Qtk 1 < k < Kt} for some Kt. Qtk k th query originated at user-node t.

136 Stk Set of nodes in N that user-node t needs access to before answering Qtk. Only used in the naive distribution model. stk Subset of Stk such that there exists a reducing flow from 8 E St to all the nodes of Stk and possibly to the user-node t. Only used in the naive distribution model. 8 Usually denotes a source node for some flow. Usk, k th update originated at user-node 8. ACCtk Envelope of query Qtk, i.e. set of CRDUs needed to process Qtk. WRTsk Cascade of update Usk, i.e. set of CRDUs to be updated by Us. PARtk Set of feasible partitions of an envelope ACCt. CLUtPk One feasible partition of ACCtk corresponding to a partitioning label p E PARtk. p Denotes a partition label, i.e. one element of some PARtk. q, r Denote clusters in some set of clusters CLU/fk. Each cluster q or r stands for some relation obtained by joining CRDUs of ACCtk. AfPq Data-volume flow rate corresponding to the shipment for assembly of relation q to a user-node (sink-node) t after all the local processing and all the reducing operations have been completed. q is some relation, or cluster, of CLUfk which is itself the p th feasible partition of the envelope ACCtk ( p E PARk ) of some query Qtk originated at user-node t. AfPq' Data-volume flow rate corresponding to the shipment of all the reducing data sent from relation q to relation r of CLU/f in the context of some critical file-to-file cross-referencing strategy. A/P' As above, but strictly for the noncritical reducing operation from relation q to relation r.

137 Nh Number of relations q that contain CRDU h E H over all clusters, partitions, and queries. f9g Denotes a CRDU held by a source node with respect to some flow. Only used in the naive distribution model. S(f) Attribute scheme of FD f. Rtk (f9) Covering of some envelope ACCtk such that fl E ACCtk sends reducing data (semi-joining columns of relations) to it. - {: h E ACCtk - {f;g } O CRDU f. sends reducing data to CRDU h }. Only used in the naive distribution model. Atgk Data-volume flow rate corresponding to the shipment of CRDU fg E ACCtk from some source-node 8 to node t after all the local processing and all the reducing operations have been performed. Contributes to the assembly flow for query Qtk. Only used in the naive distribution model. Akh Data-volume flow rate corresponding to the shipment of some reducing data from CRDU f; E ACCtk to CRDU Ah E ACCtk. Contributes to the reducing flow for query Qtk. Only used in the naive distribution model. Bshk Data-and-message-volume flow rate corresponding to some updating information sent from user-node s to CRDU fh E WR Tk for some update Usk. aLks Flow rate which corresponds to the shipment, for assembly, of a processed copy of CRDU f: E ACCtR located at source-node 8 E Stk, to user-node t, the sink-node. Only used in the naive distribution model. it9ks Flow rate which corresponds to the shipment of reducing data from CRDU fI E ACCtk located at 8 E Stk, to all the nodes in Stk holding CRDUs involved in cross-referencing with f. Only used in the naive distribution model.

138 Zt9ks Flow rate corresponding to the total reducing and assembly flow rates corresponding to the shipment of some data of CRDU fg E ACCtk located at 8 E Stk. Only used in the naive distribution model. ASMs, Assembly flow-rate value at node i, due to the total assembly flow with source node sa. RD.Cs, Reducing flow-rate value at node i, due to the total file-to-file flow with source node 8. UPDS, Update flow-rate value at node i, due to the total update flow with source node 8. Zs Total flow-rate with over all the flows with source node s. A (1 if CRDU h E H is located at node i E N vh"yh iYh A 1 if node i holds a copy of h E H for Qtk, Ztks - 0 otherwise. Only used in the naive distribution model. A 1 if partition p E PARtk:is chosen, X 10 otherwise. 1 if node i holds cluster q for query Qte Xf k, -= and partition p E PARtk 0 otherwise. A 1 if i-t, Wt, 0 O otherwise. x i,N) - z(ij) where z(i,j) denotes any type of flow rate from node i to node j.

139 The next section deals with the "naive" distribution model. 5.2. "Naive" distribution modeling The following "naive" modeling approach is to stress the difficulty of capturing the clustering effect discussed earlier in Chapter 2. Each user-node t is associated with a set of queries Qt { Qtk:1 < k < Kt } for some Kt. For each query Qtk corresponds an envelope ACCtk C H. Let Stk C N denote the unknown set of nodes in N that user-node t needs to communicate with in the context of a distributed query processing session aiming at answering Qtk. Let s E Stk be the node that holds a copy of CRDU fI E ACCtk. Further, let Stk (s) C Stk be such that for any node 8 in Stk (8) there exists some file-to-file communication from s to F. Two types of flows resulting from the access of CRDU fg may be considered: (1) An assembly flow, denoted ask, taking source at node 8 and terminating at node t, such that for any 8 E Stk, where 8 carries a copy of CRDU fJ, the following flow equations hold: - ats (jN) + atk (N,j) =- Afr for j = 8, 0 for any j {8} U {t}, Ak for j =t. It was assumed that s ' t. (2) A reducing flow, denoted i/k,, taking source at node 8 and terminating at nodes of Stk (). For all 8 E Stk such that 8 carries CRDU f, and such that the set of nodes with which 8 communicates for cross-referencing is Stk (8), the following flow equations hold: - i+ks (j,N) + il$ (Nj) - A~ for j= 8, s e Rt,(, )

140 0 for any j {8} U St (8), = Ak for any j E St (8) and j carries a copy of fh E ACCtk. This is so assuming that all the CRDUs in ACCtk are held in different nodes, i.e. that Stk (8)l = IACCtk I - 1. The assembly and reducing flows may be summed up. It is meaningful to do so as long as only the flows with identical sources are dealt with. Thus for a distributed query processing session aimed at answering a query Qtt, the total flow is such that: - x?~ (i,N) + xks - A? - v Ag A] Zt~ks Ws, /r e nt (/,) + Aetk wt, + E A? Zthk, Zt'ks, for any i E N. /, E R,, (,) Given a source-node 8, all such flows may be added up over all user-nodes t, all queries Qta, and all CRDUs fo E ACCtk. However, the development of the naive distribution model is interrupted at this point to point out a difficulty that brings up the need for a more complete model. Example: Consider Figure 5.1. Suppose that node 8 which holds a copy of f9g needs to seek access to h and h2 to perform semi-joins f1 [A> h l and f, [A>h 2, both over attribute A. Assume that both CRDUs Ah and h2 are placed at node j. Let A?h' and Ak 2 be the datavolumes corresponding respectively to semi-joins fg [A> h and fg {A>A 2. It is clear that the shipment of Akh'1 + AP2ki units of data-volume from node 8 to node j would be quite inefficient. Indeed, this strategy would imply that a column of A values is sent twice from 8 to j. This problem may be summarized thus: The placement, at the same site, of two or more ORDUs which contain common informations result in drastic nonlinearity in data-volume flows. Earlier in Chapter 2, this problem was identified as the clustering effect. It is

141 therefore quite inappropriate to deal with individual CRDUs such as fI in a flow model. Rather, clusters of CRDUs must be considered to allow the input of meaningful data-volumes. In the next section, the effect of clustering is taken into consideration as well as the quality of semi-joins operations, i.e. whether they are critical or not. 5.3. PART 2: A model for the distribution of CRDUs The CRDU-distribution problem is captured in the form of a mathematical optimization model. The assumptions on which the final model for the combined distribution and materialization of CRDUs is based are as follows. (1) The underlying communication network of the distributed system has a fixed topology. (2) Appropriate distributed query processing sequences can be determined for all possible clustering configurations of CRDUs. An optimization algorithm, which will be part of the query processing subsystem during the operational phase of the distributed database, is readily available [CHUN 831 (see Sections 4.6 and 4.7). (3) The distributed query processing strategies are based on the principle of static materialization. (4) Directories, programs and materialization look-up tables are available at each site. (5) Effects of concurrency control mechanisms are not considered. The following are assumed to be given: (N,A) The underlying directed graph of a computer network. ] = CU Communication cost for arc eu = (i,j) representing a communication link. cFt Communication cost corresponding to the shortest path from i to j. K, = Ku Flow capacity of communication link eu = (i,j). H Set of CRDUs to be distributed. Qtk Denotes a query k sent from site t.

142 ACCtk Subset of H to be accessed by a user identified with a site t E N, and a query issued at t identified by an index k. WRTsk Subset of H to be updated by a user identified with a site s E N, and an update issued at s identified by an index k. PARtk Set of possible partition indexes for ACCtk, where p E PARtk denotes the index corresponding to the choice of one partition or clustering configuration. CLUtkp p th partition of ACCtk. Nh Number of clusters q that contain h E H, over all queries. AfPq Volume of data and messages obtained from cluster q E CLUtp after all the reducing operations have been performed. AftP' Sum of the volume of data and messages sent from cluster q E CLUtkp to cluster r E CLU~pk corresponding to the combined critical reducing operations from q to r. Atp' Volume of data and messages sent from cluster q E CLUtkp to cluster r E CLUtkp corresponding to the noncritical reducing operation (if any) from q to r. Bshk Volume of data and messages sent from site s to update CRDU h E H. The following variables are to be evaluated: Flow variables: xs (i,j) Total message and data volume flow on arc (i,j) for queries and updates with s as the source site. Allocation variables:

143 A J( O other1 ifh E H isatiE N Y1, = 10 otherwise. Partition variables: 1 if ACCtk is partitioned according to Xt k = configuration p of PARtt, 0 otherwise. Materialization variables: 1 if cluster q corresponding to partition p E PARtk at k = for query Qtk is accessed at node i 0 otherwise. The following notations are also used: r 1 if node i E N is identical to node t, wt.- i 0 otherwise. yh yA Th _ssm= fl l, (J, t =1 The assembly flow value, at any node i, corresponding to the total assembly flow with source node s is: *ASMss S S A M,,=t t [ ws, - Wt, I Qtk P E PARS q E CLUftk for all 8 E N, i E N. where AP Xt s is the assembly outflow value for source node s, and the assembly inflow value for sink node t (the user node). For any node which is neither s, the source node, nor t, the destination node, this flow value is equal to zero.

144 The reducing flow value at any node i, corresponding to the total file-to-file flow with source s is: RC = [P A r Qr P RDCS, =I C [At +Atk (1A - Xkt, Qtk pEPARt q, rE CLUt, JENJ# s x ti- I, IWxt, X;ts for all s E N, i E N, The term [ 1- Xt kt 1 is zero for any noncritical semi-join r[>q in the query processing sequence, and cluster r is located at user node t. It is equal to one otherwise. This means that cluster r need not be reduced any further since no other cluster outside t would benefit from this reduction. PXt., t tt = 1 if and only if i = t. Therefore X I Xt = Xt t,. Hence RDCs, can be rewritten as: RDCS, =I 3 3 [At k + (A1-W )] Qtk pE PARt q,r CLUtkp j N, j s X [ Ws, - Wji, I xt Xt ks The update flow value at any node i, corresponding to the total update flow with source s is: LUPDs1- Bsk I Y ws - Y, I Us, h e WRT,s for all s E N, i E N. Following is the general optimization model: minimize Z= - C 3 X (i,])' (ij)E A E N

145 subject to: Flow conservation: -XJ (i,N)+ X (N, t = -ASMs, - RDCai - UPD,i (mip-1) for all a E N, i E N. Flow capacity: o < F X(i,j) < Kjj (mip2) for all (i,) E A. Partitioning: Given a query Qtt, a set of CRDUs to access, ACCtk, only one clustering configuration p E PARtk must be selected. Xk 1 (ip-1) p E PARtk for all Qtk. Materlalizatilon: Any cluster q E CLUfk,, must be associated with exactly one node in the network. X k = XIk (ip2) for all Qtt, P E PARt, q E CLUtk,. Allocation: For each cluster, the CRDUs contained in it must be available. E E E xpfq < Nh yh Xt X~~ ki (ip-3) Qtk p E PARt, q E CLUtkp where h E q for all i E N, h E H. The mathematical model presented above is a (min-cost multi-commodity flow) linear mixed-integer program.

146 Example: Let the following CRDUs be defined with respect to the relations displayed in Appendix B. h 1 = R1[NAME,AGE], h2 = R2[NAME,PHYSICIAN], as = R2[ORDER,NAMEJ. Let the target list of a query issued at site t be (NAME, AGE, PHYSICIAN, ORDER). In the following notation the subscripts "tk" will be dropped since we are considering one user site and one query originated at that site. The set of CRDUs to access is: ACC = {(h, h2, hA). There are five possible clustering configurations for ACC, therefore PAR = (1, 2, 3, 4, 5). The clustering configurations that ACC may assume are: CLU1 = <ho >,<h2>,<h3, CLU2 = {<hl>,<h2h3>}, CLU8 = {<h1h2>,<h3>}, CLU4 = <hlh>,<h2>1, CLU = {<h1h 2h3 >}, where <Ah, h > stands for h, clustered with h,. In CLU2, let q = 1 refer to <h1> and q = 2 refer to <h2h3>. For this query only, let t be the user site, let site 1 hold a copy of h1 and site 2 hold a copy of h2 and a copy of h (Figure 5.2). Then Xi 22 22 21 12 = 1, X2 = 1, X1 = 0, X2 =0. The volume of messages and data A corresponds to 21 the column h [NAMEA, and the volume of messages and data A corresponds to one column h 2[NAMEl n h 3NAMEI. Hedce it is evident that because different possible clustering configurations of CRDUs are considered, and not the CRDUs individually, the problems of redundancy are taken care of, i.e. there is no need to send one column h2[NAMEI and another column h,[NAMEI to site 1. The volume of messages and data A corresponds to h,[NAME,AGE] after the semi-join h1<NAME](h2[NAMELh3) occurs at site 1. Similarly, A corresponds to (h 2[NAMEh s)[NAME, ORDER,PHYSICIANA after the semi-join h [,NA-ME> (h 2[NAMEl h ) has taken place at site 2.. When the flow capacity of the communication network is virtually infinite, constraints (mip-l) and (mip-2) are omitted and the objective for the uncapacitated model becomes:

147 minimize E [( AtPkXtP 8 a E N 1 Qtk P E PARtk q E CLUtkp + APkqr X,tt XPk 'a ) Cat r E CL'Utgp + i i [( A pkq + A',Pkqr) Xt[, XT ] CJ; ] r E CLUtkp i E N r y q i 7 t + E E E s i 8 U8k h E WRTA iEN J subject to (ip-1), (ip-2) and (ip-3). In this case, the optimization model is a quadratic 0-1 integer program. One might want to ask whether we need to include a constraint which states that two different FD-clusters in the same partition cannot be accessed at the same node to prevent the shipment of redundant data? For instance, if two CRDUs present in two different clusters located at the same site a, are submitted to semi-joins with CRDUs outside a, on an identical domain, say D, then values of D will be sent twice out of 8 (and into 8). The answer to that question is that if an optimal solution is sought, more favorable partitions can be chosen. Therefore such an additional constraint needs not be enforced because it will be satisfied if it helps to decrease the cost. Similarly, a constraint which states that it is useless to consider a partition containing m clusters in a network with n nodes if m is larger than n is unnecessary. Indeed, the remark made above applies also to this case. It is nevertheless up to the designer to decide whether for some query Qtk and its envelope ACCtk, a partition p associated with a set of clusters CLUk is useful or not. The next chapter will be devoted to computational considerations and techniques to solve the large-scale optimization models proposed above.

148 Z = Aght ht h2 Zt=al Atk Ztkj 5 ZJ - 1 Aha. gh2 gh, Atk A tk SITE's SITE j Figure 5.1. The placement of two redundant CRDUs at the same site q 1 q-2 Fiyh2e3 A A 2 Figure 5.2. A distributed query processing scheme

CHAPTER 6 COMPUTATIONAL TECHNIQUES When the constraints over the total flow of information on each channel, due to finite capacity limitations, are omitted, the mathematical model for the distribution of crossreferencing data units becomes a 0-1 integer program with quadratic cost and linear constraints. If the capacity constraints are included, the model assumes the form of a mixed integer program with a linear cost in real (flow) variables and with linear constraints that include both the real variables and the integer 0-1 variables. The first model will be referred to as the uncapacitated model, and the second model as the capacitated model. An algorithm which seeks an optimal solution, and an algorithm which systematically generates improved feasible solutions for the uncapacitated model will be described and discussed. Further, a heuristic technique which seeks near-optimal solutions for the capacitated model will also be described and discussed. Numerical results and comparisons will follow. 6.1. Solving the uncapacitated model by Integer programming If plenty of computer money and time is available, and if the size of the problem permits, then it may be desirable, and indeed it is possible, to find an optimal solution. Classical integer programming techniques such as Branch and Bound (BNB) can be used to accomplish that goal. In the next section we show that the BNB method can be applied conveniently because only a subset of the 0-1 variables actually need to be considered for branching. The method is then applied successfully to a 350 variables problem in Section 4.1.2. In Section 149

150 4.1.3 it is explained why more efficient techniques, even though non optimal, must be considered. 6.1.1. Description of a branch-and-bound algorithm For some not-so-large problems, optimal integer programming techniques can be applied on the uncapacitated model. It is now shown that there exists some dependency among the 0-1 variables of the uncapacitated model. The results of the following proposition will be used in the application of the BNB algorithm. Proposition 17: * " pq* Ph Let X =( Xt; ' X ) be an optimal solution of the uncapacitated model, such that all the 1Y and Xt t variables are 0 or 1, and such that - qP I s all the variables Xt ti satisfy: 0 Xt i < 1; then there exists an optimal solution..........f X=(..... Xt Y *.) such that Xt,, =o if q,- -- g q e -— P q - Xt i = o,X kt, = 1 if Xt; = 1 andXt ki = O or 1 if Xt k, is fractional. proof: Suppose that, under the conditions of the proposition, there exists at least one Qtt. P and q, and some 8l, 2,..., E N such that: Xt ks I Xt. Y *, Xts$ are all different from zero, and constraints (ip-1), (ip-2) and (ip-3) are satisfied, and such that the cost is optimal. The expression for the cost contains a term of the form: A P t_ P Q Pq_ Pq At kCtsx t1 +. + At kct,8X +. + [At t - cts + Xt s, + + At+ -st,- t k X ] [ "x ' ' -tP + 1, sX + + (At t+ A'" ''t ) C,",t t. X:s. l +EE (Atk, +A: k)Tstkj 1 +(At +AI: ),. IFSX

151 -= [At: + EAt: t ts, + jE (At"] + X],At,,, X s A -- 1 X Let s8 E { 81,82, D... 8n be such that: A + (Apqr + APqr r r 'EAp f PI' E At + ( A, + At ) Xt C. + =At, Xt', c5, -E r, r and E' < E' for all i = 1,, n. Then, let us set: X k. = 1 and Xt t =0 for any 8h E { 8,,...,8 }- {8 }. Equations (ip-1), (ip-2) and (ip-3) are still satisfied after the substitution. Yet, SE EXtA, > S EEXt, = E P, = E = E X'. S. Si Si Therefore after the substitution a solution is obtained that is at least as good as the previously found solution. This type of substitution can be extended to other variables of the type Xt P, that are still fractional. ~ The BNB algorithm written in pseudocode follows. It is self-explanatory. The branchand-bound technique of partial enumeration has been reported in [MURT 76, SALK 75, CHEN 801. ALGORITHM Branch_And_Bound: Incumbent:= 0 /* Current problem feasible all integer */ List:= { CP~ } /* List of candidate problems */ Branch if Incumbent = 0 then NoSolution else Optimal:=Incumbent end Branch_And_Bound. ALGORITHM Branch: while List 4 0 do begin List:-= List- { CP' } /* Use a LIFO search */ CP ' ~:= BranchO(CP ' ) /* Set a fractional variable to 0 */

152 CP ':= Branchl(CP') /* Set it to 1 */ LB'O: LowerBound(CP O) LB 1:= LowerBound(CP 1) Updatelncumbent(CP '~) K 1:= Dangle UpdateIncumbent(CP 1) If (~ K and - Dangle) then return /* CP ' is still good */ If K l then begin List:= ListU { CP1 } Branch end else begin f - Dangle then /* CP'~ still good */ begin List:= ListU { CP' } Branch end else begin /* CP' o CP' 1 are still good for further branching */ If LB '< LB' then begin List ListU { CP1 } List ListU { CP' } end else begin List ListU { CP~} List ListU { CP'1 } end Branch Branch end end end end Branch. ALGORITHM UpdateIncumbent (CP' ): LBlncumbent:= oo Dangle:= true LB':: LowerBound(CP' ) If Fathom then /* CP' is solved all integer */ begin If LB' < LBlncumbent then Incumbent: CP' /* Update incumbent */

153 /* also prune the CP's in List that have higher cost */ Dangle:= false return end If Infeasible then /* No more branching */ Dangle:= false If LB' > LBlncumbent then Dangle:- false /* No need to branch since no hope of improving incumbent along that branch */ end UpdateIncumbent. Note that in accordance with Proposition 17, only the variables of type Y. and Xt are considered for possible branching; i.e. they are the only ones which are deliberately fixed to O or 1 in the BNB algorithm. This constitutes a numerical asset as will be shown below in Section 4.1.2 where a problem which normally requires 350 0-1 variables needs only be branched over 99 of those. To compute the cost of a candidate problem, (LowerBound(CP')), the choice of a method for large scale optimization is crucial, because in the uncapacitated model, the number of variables can easily exceed one thousand. It appears that algorithms based on gradient projection methods and reduced gradient methods [HIMM 721, are best suited to our model because they take advantage of the linearity of the constraints; for that reason the optimization package MINOS [MUR 77, MUR 781 which is based on such techniques was chosen. MINOS is a program that solves large scale, bounded variable, linear or nonlinear programming problems with linear constraints. It performs very well and can handle models with several thousand variables. The BNB algorithm described above has been implemented in FORTRAN. The Lower Bound procedure is realized with a call to MINOS. One problem was solved numerically. The results are discussed in the next section.

154 6.1.2. A practical problem A practical problem which can be modeled using the uncapacitated model is solved optimally. The MRDB of Appendix C is continued (see also Section 4.7). Although there is no constraint as to the amount of information flow in the network, the organization wishes to distribute the data that will be subject to scheduled transactions in such a way that the total communication cost is minimized over all the queries and updates issued at all the sites. The user-provided FDs are listed in Table C.1, but after the application of Algorithm Reduce (Section 4.7) an LR-minimum cover of the FDs is as shown in Table 6.1. Attribute TEST includes the TESTTYPE and CODE-NUMBER and the TECHNIQUE used. RESULT includes the actual numerical result for a test, the PRECISION, the ACCURACY, the COST and the amount of TIME spent. The types of queries and updates which are commonly asked are listed in a simplified form in Table 6.2. Table 6.3 shows what queries and updates are being initiated at each site. Each query and update index is followed by the corresponding Functional Dependencies for MRDB ORDER -- SERVICE ORDER TEST ORDER -+ PATIENT_NO ORDER -+ DATE TEST ---- TEST_TYPE,CODE_NO,TECHNIQUE TEST,DATE,PATIENT_NO RESULT RESULT -- TOTAL,PRECISION,ACCURACY,COST,TIME PATIENT_NO - NAME PATIENT_NO -+ AGE SERVICE - PHYSICIAN PHYSICIAN - WARD NAME - WARD Table 6.1. Some functional dependencies for the MRDB.

155 estimated frequency (preceded by a "X" sign) expressed in 1 / minutes units. For instance, site 4 which stands for the main hospital laboratory, is only concerned with queries 4, 6, 9 and 7, and is the only site to perform updates on the MRDB. After processing the initial FDs as discussed in Section 4.4.4 (see also Section 4.7), each query is assigned an envelope of CRDUs as shown in Table 6.4. The derived FDs are denoted h,..., A13, or simply 1,...,13. At this point it is possible to calculate a distributed query processing sequence for each possible clustering situation of the CRDUs belonging to the envelope of a specific query. (See steps (14) and (15) of Algorithm Select in Section 4.6 and 4.7.) This is displayed in Table 6.5 where only the information pertinent to site 2 is shown: Query 3 needs to access CRDUs 4, 5, and 6. There are five possible clustering situations to be considered (Table 6.5 column 3). The Typical scheduled queries for MRDB Number Given attributes Target attributes 1 ORDER PHYSICIAN,WARD 2 ORDER RESULT 3 DATE,PATIENTNO,TEST ORDER 4 WARD PATIENT NO 5 ORDER NAME,WARD 6 PHYSICIAN ORDER,PATIENT_NO 7 NAME,DATE,TEST RESULT 8 ORDER SERVICE 9 PHYSICIAN SERVICE,PATIENT_NO Typical scheduled updates Number New attributes 1 ORDER,TEST,RESULT,DATE 2 PATIENT_NO,NAME,AGE Table 6.2. Typical user-provided transactions for the MRDB.

156 Usage of MRDB transactions Site Query Update 1 1 X 2,2 X 1, 3 X 1, 5 X 1, 8 X I none 2 1 X 1, 3 X 2, 9 X 1 none 3 5 X 3, 7 X 4,8 X 4 none 4 4 X 1,6 X 1,9 X 1, 7X 3 1 X 1,2 X 2 Table 6.3. Typical frequencies for the MRDB transactions.

157 CRDUs accessed for query processing Query CRDUs Possible partitions of CRDUs hl —ORDER-+PHYSICIAN <hl,h2> h h2=P-HYSICIAN — WARD <hi> <h2> 2 h3=-ORDER —+RESULT <h3> h4=ORDER —+TEST <h4, h5, h6> h5=ORDER —+DATE <h4> <h5> <h6> 3 <h4, hS><h6> h6=ORDER ---PATIENT_NO <h4> <h5, h6> <h4, h6> <h5> 4 h7=PATIENT_NO —*WARD <h7> h8-ORDER-*NAME <h8> <h9> h9 —NAME —+ WARD <h8, h9> hl —ORDER —PHYSICIAN <hl> <h6> h6=-ORDER — PATIENTNO <hl, h6> hl1O=PATIENT_NO,DATE,TEST- -RESULT <hlO> <hll> h 1 =PATIENT_NO —+NAME <hIO, h1l> 8 h 12=-ORDER — SERVICE <h12> h13=SERVICE —+PHYSICIAN <h13, h12, h6> h 12-ORDER —+SERVICE <h 13> h12> <h6> 9 <h13, h12> < h6> h6=ORDER —PATIENT_NO <h13> <h12, h6> <h13, h6> <hl2> Table 6.4. All the FDs selected for distribution, and the possible partitioning configurations, for the MRDB. matrix [At ]qk (Table 6.5 column 5) is obtained, for each user's site-query couple, after running a distributed query processing algorithm for each possible clustering situations of CRDUs, and after adding up all the data volume flows taking place between each couple of CRDU clusters. The entries are in kilobits. All the user's informations are inputted to the

158 BNB algorithm in the form shown in Appendix D. The MRDB problem considers a network of 4 sites, with 3 to 5 typical scheduled query types per site, 2 update types, and a total of 13 CRDUs to be distributed. The uncapacitated version of the model may be used. The total number of 0.1 variables turns out to be 350 for this problem. However, as was explained in the preceding section, since only the variables of the types Y and Xt k, namely the allocation and the partition variables, need to be considered by the BNB algorithm, the total number of variables is decreased to 99. There are a Query profile of site 2 Query index CRDUs Partitions Clusters Data volume flow k ACC2k pEPAR2k qECLU | Al" 1={<1, 2>} 1=<1, 2> 450 - 1 $ 1, 2 2=<1>, <2>) 1=<> 400 5 - 1={<4, 5, 6>) 1=<4, 5, 6> 200 - 1-< 4> 600 0 0 2={<4>, <5>, <6> 2=-<5> 0 800 0 3 <6> 200 200 200 1 --- —<6> 600 0 - 3 4, 5, 6 3={<4, 5>, <6>) 2-=<4, 5 200 20 500 - 4={<4>, <5, 6>) 2=<5,6> 200 200 2 —<5, 6> 200 200 - 5=(<5>, <4, 6>) =<, 6> 0 800 - Table 6.5. An example of numerical input (corresponding to site 2 in the MRDB problem) to be provided to the distribution program.

159 total of 132 constraints for this problem, not including the constraints of the type 0<X<1 where X is any of X t, Y or X k. The BNB algorithm was run with MINOS on the Amdahl 470V/8 of the Michigan Terminal System to solve this problem. A final 0-1 optimal solution was obtained in approximately 4 minutes* of CPU time after branching 22 times. The optimal total communication flow was found to be 3390 kilobit./hour, and the corresponding configuration of the CRDUs is as shown in Table 6.6, for a site 1 and its queries, as given by the allocation variables. The non-zero materialization variables indicate the optimal way to access the CRDUs for that query from the selected site (Table 6.6). For query 5, it was found that X16 = X1I - = 1. rather than XI 2-1 found in the present solution, there would have been no difference in communication cost since the total assembly flow is equal to 100 kilobit8/hour in both cases. Furthermore, CRDU <8> which stands for (<ORDER -_ NAME>) sends no data-volume to the user node (node 1), whereas CRDU <9> (<NAME -_ WARD>) sends 100 kilobitn/hour of data-volume to the user site; therefore, although site 1 would be expected to get access to two clusters <8> and <9> at different sites (refer to the remark, made in Section 5.2, and to the example also in Section 5.2 associated with Figure 5.1), in this particular example it is irrelevant wether CRDUs <8> and <9> are accessed in one common relational table or two distinct relational tables because the assembly flow is identical in both cases. (Of course, if CRDUs <8> and <9> were positioned in distinct sites, we would have to account for an additional 100 kilobitsl/hour of reducing flow.) 6.1.3. Practical limitations The BNB algorithm allows the finding of an optimal solution for tie uncapacitated model. However, there are well known disadvantages: *The package MINOS was available as a stand-alone program only, therefore the BNB algorithm was ran semi-manually. The total CPU time was obtained after compensating for repetitious calls to various initialization routines and input-output operations.

160 Solution for the MRDB problem and the input data of Appendix D. Only the results relevant to user site 1 are shown. k ALCClk PAR |k CLUr f site (,, 2 2= <1>, <2>) 1. <1> 22- <2> 2 X121= 1x = Y1 1 1 A-1 1 2 yll= y2= 1 2 {3} 1 {<3>} 1- <3> { 1 2 =1 x?.2= 1 Y= 1 3 {54, 5, 61>} 16> — <4, 5, 6> 4 X/,= 1 X = 1 1 Y4 Y - =Y - 1 5 ( 8, 9) 2={<8>, <9>) 1-<8> 4 2 9> 4 X12= 1 1" = 1 y= 8 {12} I 1= {<12>} 1- <12> 4 X1 8 4 1 Y12 = 1 Table 6.6. A partial solution for the MRDB problem. (1) The classical file allocation problem is np-complete [ESWA 74], so is, a fortiori, the allocation of CRDUs [APER 80j. In practice the BNB algorithm may not be able to

161 overcome the curse of cardinality. (2) Even if an optimal solution can be arrived at with BNB for a specific problem, the price to be paid in CPU time and computer money may be unacceptable. Therefore, although the BNB algorithm remains a possible option to fully solve the uncapacitated model, clearly, other less expensive and faster algorithms must be considered. In the next section, such an algorithm is described and applied to the MRDB problem. The final section of this chapter deals with the solution of the capacitated model. 6.2. Improving feasible solutions for the uncapacitated model Since it is not always feasible to seek an optimal solution to the uncapacitated model, a more efficient, although non optimal, algorithm is proposed to generate "good" feasible solutions for the uncapacitated model. How good those suboptimal solutions are will be discussed shortly. The forthcoming Algorithm XrefU is based on an adaptive search method developed in [HOLL 75]. This algorithm can be applied to optimize a performance function evaluated over a set of character strings with finite length and fixed representation. In particular, it can be applied to unconstrained 0-1 integer programming models where a character string corresponds to a vector of 0's and l's, and the performance function is mapped to the cost (in our case, the communication cost). Since the theory of adaptive algorithms has been studied elsewhere [HOLL 75], the desirable properties that adaptive algorithms exhibit will be reiterated without proof: (1) They can handle problems that feature high dimensionality and many local optima. Says [HOLL 75]: The algorithm's power is most evident when it is confronted with problems involving high dimensionality (hundreds to hundreds of thousands of attributes, as in genetics and economics) and multitudes of local optima. (2) They feature intrinsic parallelism in the sense that the solution space is searched uniformly, and there cannot be a terminal entrapment on a local optimum. (3) They feature the

162 quality of a robust search in the sense that the number of trials allocated to the observed best solutions increases exponentially with respect to the remainders. The adaptive algorithm maintains a list of strings of finite length, called structure8. String manipulation operations are used to improve the list such that the 0's or 1's (or combination thereof), at specific positions, which tend to improve the performance of structures, will persist consistently within the structures to appear in subsequent lists, whereas the 0's or l's which tend to deteriorate the performance of structures, will eventually recede. [HOLL 751 is entirely devoted to the treatment of adaptive algorithms. If the structures are simply strings of O's and l's of length 1, then any such string with an arbitrary arrangement of 0's and l's may appear in the list (after a repetitive application of the string manipulation operations) at one stage in the execution of the algorithm, and thus constitutes an acceptable structure. Obviously, this is not satisfactory in the context of solving a constrained integer optimization program of the type featured by the uncapacitated model. Indeed if a structure is mapped to a vector of 0-1 variables, then only those vectors that are feasible would have to be considered. Furthermore, if the list to be maintained by the adaptive algorithm were to contain feasible solution vectors only, there would be no guarantee that a repetitive application of the string manipulation operations on the 0-1 strings which represent those vectors, would not interfere with feasibility. Therefore, when solving the uncapacitated model, the 0-1 strings of the list cannot simply be mapped in a one-to-one manner to vectors of 0-1 variables in the mathematical model. The structures of the list must be defined in such a way that they represent feasible solutions (in an encoded form), and such that they remain meaningful at all time in the course of the execution of the algorithm, when submitted to any number of applications of the string manipulation operations. The adaptive algorithm used to solve the uncapacitated model is described in the next section.

163 6.2.1. Adaptive random search for the uncapacitated model It is shown how feasible solutions of the uncapacitated model are encoded into structures which can be used by an adaptive random search algorithm. Any element of the list maintained by the adaptive algorithm will be a string of 0's and l's of length v, such that it can be decoded and mapped to the variables x1),...,z(m) of the uncapacitated model. (These variables correspond to the components of the vector (. *.X],...; *... *. *;.X. ).) The code for a structure is as follows: A structure is divided into detectors (numbered 1 through w, where w is equal to the total number of couples (t,k) in the mathematical model). Each detector has a fixed width (in bits ) which can either be equal to a constant widthl, width2 or width3; these constant widths are sufficient to represent a partition index of CRDUs with respectively, one, two or three clusters of CRDUs*. Widthl is defined to be the number of bits to be allocated to a detector which corresponds to a query that needs to access only one CRDU. Therefore widthl is simply equal to the number of bits required to identify one node in the computer network, when the nodes are numbered in binary form (the length, in bits, of a node binary representation is referred to as nodesize ). For instance, if there are four nodes in the computer network, then widthl is equal to 2 bits. Width2 is defined to be the number of bits to be allocated to a detector which corresponds to a query that may access up to two distinct CRDUs. Therefore width2 is two nodesize bits long (to account for up to two nodes to be accessed) plus one bit to indicate the partition index (i.e if the two CRDUs are clustered at one node or located at different nodes). Thus, for a four node network, width2 is equal to five. Similarly width3 is defined for queries which may access up to three distinct CRDUs. Therefore uwidth3 is three nodesize bits long plus three bits (to indicate the partition index which can be 1 through 5). Thus, for a four nodes network width3 is equal to 9. Figure 6.1 contains some useful notations. In Figure 6.3-i, a structure with twelve site-query couples is shown. The encoding scheme is indicated in Fig*Obviously, if the simultaneous access of more than three CRDUs is allowed at a time, for any query, then the

164 ure 6.2. In Figure 6.3-i the first row of the structure contains the partition indexes for each query. The second, third and fourth rows contain respectively the node indexes for the first, the second and the third cluster for each partition. Each column corresponds to a detector as defined earlier. Note that queries Q1 and Q3 have only one CRDU in their envelope (m = 1). Queries Q2, Qs4, and Qu have two CRDUs in their envelopes (m = 2) and so on. Figure 6.3. ii depicts the binary representation of the same structure, following the encoding indicated in Figure 6.2-ii. Thus, query Q1 which has one possible partition and one cluster, is only represented by the node at which the cluster is located, which for this particular case is node 4 (11 in binary form). Similarly, the node to which the unique cluster in CLU ' of query Q3 is assigned, is node 3 (10 in binary form). Next, query Q2 is shown to be solved through the access of its envelope in the form of partition 1. Thus, P2 E PAR2 is equal to 1 (O in binary form). This particular partition of ACC2 results in CLU2'. The unique cluster in CLU2' is shown to be accessed at site 2 (01 in binary form). Since there is no other cluster in CLU2', therefore the entry for the node of cluster 2 is X which means "don't care". Other detectors are similarly represented. An algorithm to decode a structure and extract the actual values of its corresponding solution vector follows: ALGORITHM EvalXi ih:= 0 for all Qtk do begin p:=decodepar(r, (Qtk)) Xtt k:=1 for al p' E PARt:p' 3p do beginp, Xt: - o=o for all q E CLUt k do begin for all i E N P q Xt,:=O present representation needs to be extended accordingly.

165 end end for all q E CLU' k do begin r:=decodenode(p, (q,p)) vP q At d =Y: for all h E q do Y.:-1 for all j 3 r do Xt k,7 0 end end end EvaIX. Using this decoding technique, the structures can be submitted to the string manipulation operations without fear of interfering with the feasibility of solutions. Indeed, any string of bits of length v, where v s width(i) width(i) e uwidthl,width2,width3}, maps to a unique feasible solution to the uncapacitated model.

166 r = (t,k) a site-query couple {Q,:IACC, l< m} — Q Q E Q -* PARi, P"m= {1,2,,5} (m = 3), ri:Qm - pm random assignment. p E Pm -. CLU/~ RU ' = {1,2,3} (m = 3), p,:Rm XPm P N U {X} random assignment. X don' t care Figure 6.1. Random search assignments.

167 <STRUCTURE> <PORTION-1 > <PORTION-2> <PORTION-3> <PORTION-m>::- < SUBSTR-m >IO <SUBSTR-m>:: <DETEC-m> I < SUBSTR-mr> <DETEC-m> <DETEC-1> <::- NODE> <DETEC-2> < PARTITION-2> <NODE> <NODE> <PARTITION-2>::- 011 < DETEC-3> < PARTITION-3> <NODE> <NODE> <NODE> < PARTITION-3>:= 00010o101o01001110110110111111 <NODE>:: 00110111 j<STRUCTURE>I = IQ'l X nodesize + IQI X (2nodesize + 1) + IQ81 X (3 + 3nodeaize)- v IQ'I + IQ21 + IQI w Code for a structure when INI = 4 and IACCI <= 3 0-1 representation of a <PARTITION-m> field in a detector Partition index Binary representation partitioning p = r, (Q) of <PARTITION-m> 1 0 <AB> 2 2 1 <A><B> 1 001 <ABC> 2 010 or 110 <A> <BC> 3 3 011 or 111 <AB><C> 4 100 or 000 <AC> <B> 5 101 <A> <B> <C> Structure definition for Algorithm XrefU when N = (1, 2, 3, 4 } Figure 6.2. Encoding scheme for the adaptive algorithm.

168 At step i... m - m=2 m= —3 Q.- 1 3 2 4 6 5 7 8 9 10 11 12 p,- =r,(Q',) [,, I 1 'l 1 1 2!11 31 2 I 1"4 1 5 1 4 1 3 P,ll,p,i = 3 11211 1 2 1 2 3 4 1 4 i 3 pi [2,p, = x 3 4 2 1 I 3 1 4 2 p, 13,p, = I x x A 2 Xi x structure 11 110001 — 100101111 i0110110101 —0100000 —.,, physical binary representation N ={ 1, 2, 3, 4) Figure 6.3. A structure for the adaptive algorithm.

169 The adaptive algorithm used to solve the model follows shortly. At any step t in the execution of the algorithm, a list of u structures (structure[.I) is maintained. PerAJI denotes the performance of structure structure[. Each structure is decoded using EvaLX shown earlier, and the resulting vectore is plugged into a cost-evaluation routine. Typically, the performance decreases when the cost increases. Ptercur and ptercurdown are two pointers in the current list of structures at step t. Pternext is a pointer in the new list strucnextl.j, which will become the current list at step t + 1. Procedure EvaOffspring evaluates the offspring of each structure in the current list. Offspringstl is defined to be an integer function of perAil / perf ave where perf ave is the average performance for the current list. If Offspringiptercur] is non-zero, then structurelptercudr participates in string-manipulation operations (OperateLeft, OperateRight) with another structure structure[ptercurdown] to produce one or several new structures strucnezt~pterneztJ in the new list. The string- manipulation operations include mutation, inversion and cro8-over as defined in [HOLL 751. They are summarized in Figure 6.4. Algorithm XrefU follows: ALGORITHM XrefU: /* Adaptive random search to solve the uncapacitated model */ t:=O ReadList /* Arbitrary list of structures */ for t < tmaz do /* Termination criterion */ begin for i=1 to u do perjfl1:=eval(structurelij) /* Performance of each structure */ t:-t+l ptrnezt:=O EvalOffspring /* Calculate the offspring of each structure as a function of its performance */ ptercur:'1 while (ptercur u) e (ptrnezt < u) do begin ptercurdown:=ptercurdown+1 while offspring[ptercur3yO &8 ptrnezt < u & ptercurdown<u do begin If offspring[ptercurdowntjVO then begin offspring iptercurdown:= offspring[ptercurdown]-l offspring[ptercurJ:= offspring[ptercurj-1

170 ptrnezt:-ptrnext+1 strucnezt[ptrneztj:= OperateLeft(structure[ptercurj, astructure[ptercurdown]) strucsecond:= Op erateRight(structurelptercudj, structure[ptercurdown]) If ptrnext < u then begin ptrnezt: —ptrnezt +1 strucnezt[ptrnezt]:=strucsecond end end else If ptercurdown= u then begin ptrnext:=-ptrnext+1 strucnezt[ptrneztl:= Operate(structurelptercur]) offspring[ptercur:= offspring[ptercurl-1 end ptercurdown: —ptercurdown+1 end ptercur:=ptercur+ 1 end if ptrnezt<u then for i=ptrnezt+lto u do strucneztji:i structure| u] structure:=strucnezt end end XrefU. 1.1.1. Solving the MRDB problem with XrefU Some properties of Algorithm XrefU are listed. Algorithm XrefU may be initiated provided that an initial list of structures (encoded feasible solutions) is readily available. The length of this initial list is arbitrary, and so is its content; i.e. any structures may be used to fill it in. The longer the list is, the more informae tion is generated in the course of the execution of the algorithm, however the performance of the algorithm will tend to degrade. Algorithm XrefU was implemented in PL/1. Generally, it was found that the list may contain between 15 and 40 structures for best overall results. Five runs were conducted, for the MRDB problem, on the Honeywell 870 of the Rome Air Development Center (via the Multics system). Each run was associated with a different, and completely arbitrary, list of structures. The results are shown in Figure 6.5.

171 Although the results obtained are not optimal, several positive remarks can be made: (1) The algorithm quickly eliminates the feasible solutions which are too expensive. (2) Even when the list is only 20 structures long, the algorithm is still capable of improving solutions at a steady rate within an average of 21 iterations, or an estimated 6.69 seconds of the Amdahl. Stagnation signs may only appear after the 30t iteration, or so. Refer to remark (5) for possible improvements. (3) The best cost obtained was 4360 kilobits/hour, observed after the 47th iteration of run 1 (which used a list of 34 structures) after an estimated 24.15 seconds of the Amdahl. This is relatively close to the optimal cost of 3390 kilobit/hour since the lowest cost in the initial list was observed to be 12,900 kilobits/hour, and the highest cost was 52,200 kilobits/hour. Also, as can be shown in Figure 6.5, the difference between the optimal cost and the best cost obtained with XrefU is 970 kilobits/hour, which is of the order of magnitude corresponding to the shipment of the volume of data for one query. (4) Although the best feasible solution observed with XreffU, after a reasonable amount of CPU time, is not likely to be the true optimal solution, XrefU generates low cost 0-1 feasible solutions that can be used to a great advantage by a more conventional BNB algorithm as the one described earlier. In other words, an algorithm such as XrefU can be used as a preprocessing tool to reduce the initial size of the enumeration tree in the BNB method. (5) It is still possible to further improve XrefU by "refreshing" the current list periodically and by superposing some localized optimization based on greedy techniques. Discussed in the next section is an optimization technique to tackle the problem of alloeating cross-referencing data units when the communication network has a finite flow capacity on each of its channel. 1.2. The capacitated model When the flow in each communication channel of the network must be kept below a fixed level, called the flow capacity for this channel, the constraints of type (mipe) and

172 (mip-2) must be added to the mathematical model. The capacitated model is a mixed-integer (0-1) program where the objective function as well as the constraints are linear. More specifically, the model assumes the form of a minimum cost, multi-commodity flow problem with additional linear constraints involving the 0-1 variables. The algorithm presented in the coming section is based on the decomposition* technique for mixed-integer programs [MURT 761. 1.2.1. Decomposition of the capacitated model The capacitated model, with fixed 0-1 variables, and its dual, are written in a convenient form for future calculations. The mixed-integer program (MIP) was first presented in Section 5.2. When all the integer (0-1) variables are fixed, (MIP) assumes the form of a standard linear program (LP). For clarity the model will be rewritten in the following form: (LP): minimize j cij z s (i,j) (ij) s e N subject to: -Xs (iN) + xs (N,i) = -ASMs - RDCs, - UPDS, (p-l) 0 < z.s (i,) _ Ks, (lp-2) The only difference between (LP) and (MIP) is that in (LP) the variables to deal with are the real flow variables. (LP) is indeed a capacitated minimum cost mUrlti-commodity flow problem. Let z (s") or simply z' *denote the node-arc flow vector for the r th commodity (associated with a source s and a sink i) for r = 1 to p (p = INI xINI). thus xs( 41 )= Z z, )(s',/ ), for any ( J ) E A. iEN Introducing these new variables, (LP) can be rewritten using the node-arc (na) formulation: LAlso called projection or partitioning.

173 (NA-LP): minimize exy E z(ij (j) r subject to: (i) Capacity constraints: z1+ 2+... +z' +. '+ +zP + < K (na-lpl) r (s,i) s,i e N, where z' = (z,,,z AI)r and K= (K1,...,KIAI) ( K, is the flow capacity of arc j E A.) (ii) Flow conservation constraints: Ez' - q, V, = O (nalp-2) r = 1 top xz' > (na-lp-3) where V, = -[ASM, + RDC, + UPD, I is the flow value of node-arc flow vector z r, E is a (INI X jAl) node-arc matrix, q, is a (INI X 1) node-arc vector corresponding to arc r (s,t): A ~ a Ti q, = (0,0, ***0,1,0, * **0,-1,0, ***0) The arc capacity constraints of (i) lead to the master program of (NA-LP), and the flow conservation equations of (ii) lead to the subprogram of (NA-LP). The subprogram breaks up into p independent subproblems which do not have variables in common. For the rth subproblem, the constraints are: Ez' - q, V, =0 z' > 0 Every basic vector for (na-lp2) consists of flow variables associated with arcs in a spanning tree in G. In the basic feasible solution corresponding to a feasible basic vector for (na-lp-2), the flow value on all the arcs in the simple chain from to i in the associated spanning tree is equal to V7 (iVs, ), and the flow on all the other arcs is zero. Thus the r th subproblem can be

174 interpreted as a shortest chain problem from 8 to i in G. The arc-chain formulation of (LP) (as opposed to the node-arc formulation above) makes use of the chains D',D, -, Dd from 8 to i, and decides how much to ship along each of these chains [MURT 811. Following the notation of [MURT 811, let z' represent the quantity of the r th commodity shipped along the chain D' (h = 1,,d, and r = 1,,p); the amount of flow of the rth commodity on an arc (i,/ )is: a 4 X?S'I)(1V )x = d, h=1 (I,/ )E D' and V, = -Z h=1 The master program corresponding to the multicommodity minimum cost flow problem (NA-LP) stated in arc-chain formulation is: p d, (AC-MLP): minimize z = E E U xh -1 h=1 u: E D*' subject to: p hEIK s Ku(ac-mlp-l) % e DO' u =1,,IAI -'Z-= V, (ac-mlp-2) r 1, ~ ~ ~,p hz _ o (ac-mlp-3) r =(,i) h = l, d, The number of constraints in the master program is IAI+lN12, and hence the basic vectors of the master program will consist of IAl+lN12 variables.

175 c' = (9g g) (1 X d, ) vector U: e. ED' =' = (xz, ".,Zd )T (d, X1) vector al a d, B' = at l ad (IAI X d, ) matrix a iAJ* a riAJd 1 if arc u is on chain D, uh ( O0 otheruise b (K... K, KIAI)T (IAI X 1) vector d' (111 1) (lX d, ) vector V' = ASM, + RDC, + UPD, (X 1) scalar V (V', ~ ~,VIl2)T (INI X1) vector r = (e,i) for some 8 E N, i E N Table 6.7 Notations for the capacitated model. After some further change in notation (Table 6.7), (AC-MLP) can be rewritten as: (AC-MLP): minimize z c z1 +- +c'' + C+ c Py subject to: B11.....B' '.... BP - P 2 -b (ac-mlp-1)

176 d'Y' = V1 (ac-mlp-2) dPxp = Vp x '> 0 (ac-mlp-3) There are JAI constraints of type (ac-mlp-1) and IN12 constraints of type (ac-mlp-2). Let or be a (lXIAI) dual vector corresponding to constraints (ac-mlpl-1) and p be a (lXINl2) dual vector corresponding to constraints (ac-mlp2). The dual of (AC-MLP), which we will call (DL) is: (DL): mazimize - irb + p V subject to: 7rB'l+pldl < cl (dl-l) rBP + #PdP d CP ir > 0 and u unrestricted. (dl-2) (DL) has IAI+INI2 variables but has a very large number of constraints, namely d, cont=1 strain ts. In the next sections an algorithm (XrefC) that generates improved feasible solutions for the capacitated model is described and discussed. 1.2.1.1. OptImizing the capacltated model A projection technique [MURT 76, SALK 751 is used to solve the capacitated model. We have shown above that the capacitated model can be formulated as a minimum cost, multi-commodity flow, arc-chain problem (AC-MLP), when all the 0-1 variables are fixed. Its dual, denoted (DL), was stated. Furthermore, the domain A -{ (r, ): (d-1)and (d-2) }

177 is independent of the 0-1 variables and is convex. Proposition 18: The domain A is a convex polyhedron. proof: That A is a polyhedron is obvious since the constraints (dl-1) and (dl-2) are linear. Next, if the dual problem (DL) is feasible, and its objective function is unbounded above (on the dual feasible domain A ), then the primal problem (AC-MLP) cannot have a feasible solution. (This follows from a corollary of the weak duality theorem ). If A were to be concave, then any feasible dual solution (ir,) would have an unbounded objective, which would mean that for any value of the 0-1 variable vector, the corresponding primal solution (the flow) is infeasible. But the flow is only infeasible in the particular situation where there is not enough capacity to accommodate it for a specific repartition of data. We assume that there is enough capacity in the network to satisfy the flow for at least one allocation of data, and therefore A must be convex. For simplicity it will be assumed further that A is bounded. In practice this means that for any value of the 0-1 variables Xfp i, Xpk and Y,h, there exists a feasible flow (although not necessarily the flow that would have been obtained when the capacity constraints are omitted). In the context of this thesis, this is a very reasonable assumption because the problem under study is the design and usage of a distributed database on an existing computer network. The maximum of -Irb + #i V over A occurs at an extreme point of A. We denote the extreme points by (rt,t) (t = 1, ~ ~ ~ T). Therefore problem (DL) may be written as: (DL' ): maximize -rt b + pt V. t=1... T And thus (MIP) becomes a full integer problem (IP): (IP): minimize | maximum (-t b + Ip V) I

178 subject to: (mip-1), (mip-2), (mip-3) and (mip-4). Algorithm XrefC follows: ALGORITHM XrefC: NoXtremePts:- 0 X: X /* Denotes a feasible 0-1 solution, whereXstands for(' XI, ' Xfk Y '''...)* UpperBnd:= +oo while (z < UpperBnd - E) do begin <ir,l>[NoXtremePtsa+1:= Solve(DL) NoXtremePts:= NoXtremePts + 1 if UpperBnd > max (-_t b+/.t VlX)}(b1=,,NoXtremePts) then UpperBnd:= max {(-r b+i't V(X)} X:= Solve(IP) end Flow:= Solve(AC-MLP). /* Solve for z, for all r = 1 to p and all h = 1 to d,. */ end XrefC. To generate the extreme points of A, the arc-chain formulation of the primal program (AC-MLP) is solved rather than the direct program (DL). A column generation version of the revised simplex method with the product form of the inverse tableau [MURT 76, MURT 811 is used. It is known that the computational effort involved tends to be smaller, and the numerical stability greater than would be the case in carrying out the same algorithm using the explicit form of inverses. Furthermore the pivot matrices occupy very little storage space. The dual solution (ir,g) is read directly from the inverse tableau. Therefore (AC-MLP) can be solved optimally and very efficiently to find the dual solution (A,11) and the flow vector zS) The main obstacle to finding an optimal solution for the capacitated model comes from the difficulty of solving (IP) optimally in XrefC as was the case when solving the uncapacitated model with BNB. For the uncapacitated model, an adaptive random search algorithm (XrefU) was described that can handle the constraints on the 0-1 variables, and can generate good

179 feasible solutions in a relatively reasonable amount of CPU time. Of course it cannot be guaranteed that XrefU will generate an optimal solution for the uncapacitated problem, but it may be argued that looking for the true optimal solution is a luxury not worth paying for anyway. In the context of XrefC, the problem of solving a large integer program (IP) still remains, and the questions that must be addressed are the following. (1) In each main iteration of algorithm XrefC, is it acceptable to use suboptimal values for the 0-1 variables X and still preserve the meaningfulness of the underlying projection technique? (2) If the answer of (1) is yes, and if an adaptive random search is used to solve (IP), how robust will XrefC be? In the next section question (1) will be answered and the issue raised by question (2) will be discussed. 1.2.1.2. Convergence and robustness of XrefC It is argued that Algorithm XrefC features the robustness of an adaptive search. Problem (DL) has a finite number of solutions (ir,p). However, if (IP) is not solved optimally within each iteration of XrefC, then a dual solution (_rj) may be produced more than once. If (IP) is allowed to terminate whenever z + < zt - E, for some positive e, then the incumbent objective z must be improved at each repetition of a dual solution. But since the optimal value of (MIP) is bounded below, a dual solution can only be repeated for a finite number of times. Therefore if the algorithm used to solve (IP) eventually converges, then XrefC must converge too. If (IP) is solved with an adaptive random search, as suggested in the previous section, nothing can be said about the convergence of XrefC, because it is known that the adaptive search does not necessarily converge [HOLL 751. However, each iteration of the adaptive algorithm used to solve (IP) will generate improved solutions for (MIP). Finally, it is interesting to remark that the convergence of the algorithm used to solve (IP) is a sufficient condition for the convergence of XrefC but it is not necessary. Therefore the answer to question (1) is: XrefC is still meaningful even though (IP) is not solved optimally during each

180 iteration. Before discussing question (2) it is helpful to recall what is meant by the notion of robustness in the context of a random search algorithm; as was mentioned in Section 2, a random search is said to be robust if the number of trials allocated to the observed best solutions increases exponentially with respect to the remainders in an observed sample. An adaptive search such as XrefU exhibits such a property. What can be concluded about XrefC when an adaptive search is used to solve (IP)? In the following paragraph the notation shown in Table 6.8 will be used. Consider Figure 6.4 which helps to visualize the execution of XrefC with respect to a discrete variable t, (j ~ 0) corresponding to the iteration process in the adaptive algorithm used on (IP). On the axis of Figure 6.4, each interval [t, -l,tj ) corresponds to one execution of Structure-schema: The set of all stuctures that display a predefined pattern of 0'8 and 1 8's at specific positions. e.g. Io011 10o.. oo10 1 is an instance of structure-schema: lo ixll...xo0lo For each structure-schema ~: M(t) Number of instances of e in the list of structures, at step t. N, (t) Number of trials allocated, from t, to t, to structures which are instances of n, (t) Number of trials allocated, from t, to t, to structures which are not instances of ~. to Initial step. Table 6.8. Additional definitions for the adaptive algorithm.

181 the main "do while" loop of XrefC, except for the line X:= Solve(lP), whereas an interval [tll,t,+l-l) corresponds to one execution of line X: — Solve(lP). Given a structure-schema (, the following was proven in [HOLL 751: N, (t) > M(t ) ' (I tj,< t,tj+,-,. In this context, for any t the following can be written: 1-1 Nto(t):= 3 N, (t:+,-l) + N, (t) = —0: ti <t l-1 - E M(t, )ez, j (t+1-1) + N, (t). =0: ti <t In the course of algorithm XrefC, a new constraint is added to (IP) whenever a new dual solution is generated, therefore: M(t,) = M(t -1)- r(j)M(tj-l) where 0 < r(j) < 1 and lim r(j) = O. A lower bound for Nto(t) can now be rewritten as: =-1 Nto(t) > Nto(t) - m(tj )e z "'(tJ+'-) + N (t) J =0: tl <t where Nto(t) M(to)e and m(t, ) = M(t, ) - M(t, -1). Both an increase in j as well as an increase in t contribute in making m(t,) decrease, so that eventually: Nto (t) >>, m(t, )e~ 'i (t,+1-1) in which case Nto(t) ~ M(to)e 0~~0 which indicates robustness. Informally, XrefC is said to exhibit the same kind of robustness as does XrefU "in the course" of its execution. In the next chapter some practical remarks are drawn from the usage of the CRDUdistribution model.

182 Old Structure AOlo 10 11 loI 1-I Old Structure B L 10001 0l o 0 01I Yield: New Structure A I 0 1 0 11 0 0 0 1 New Structure B 1 I1 0 0 11 [i] Cross-over at position 5 Old Structure 0101 0! I 1 1 Yields: New Structure I O 10 1110 10 [iij Inversion between positions 5 and 7 Old Structure 0 1 0 01101 Yields: New Structure 101 1 1 1 11 [iiij Mutation at positions 5 and 8 Figure 6.4. The string manipulation operations used in an adaptive search

183 x 103 10 0 20.84 S 2.05 S 269. 8 S 24.15 s optim -al4.................................... 0 10 20 30 40 50 60 70 80 90 Iterations Figure 6.5. The performance of Algorithm XrefU

184 t -to t! t2 t3 t4 No N1 N2 N3 Number of trials i=o il ij=2 i=3 i=4 Figure 6.6. A schematic visualization of the execution of Algorithm XrefC.

CHAPTER 7 EXPERIMENTAL RESULTS Some practical observations are made concerning the distribution of CRDUs in a computer network. The effect of the reducing and assembly rates*, over the simultaneous referencing of CRDUs is investigated ( Section 7.1). Also investigated, the effect of the updating rates over cross-referencing (Section 7.2). 7.1. Effect of the reducing rates over cross-referencing A first experiment consists of investigating the effect of the reducing rates, in the matrix i AP.' J, over the cross-referencing aspect in distributed query processing. The results of such an experiment can be of practical interest for the designer of a distributed database. Indeed, cross-referencing is acknowledged to be one cause for the need of restructuring standard file-allocation models. (See Chapter 2. The other cause being the clustering effect). Hence, if it is observed that, beyond a certain threshold, the reducing rates become too large compared with the assembly rates, then it can be concluded that cross-referencing is not a viable alternative to CRDU duplication. In practice, this implies that any query envelope is likely to be clustered at a single site rather than partitioned and distributed over several sites. In this case, the CRDU-distribution problem becomes quite similar to a file-allocation problem, where each query envelope is a file. On the other hand, if it is observed that the reducing rates are not likely to impair the performance of transaction processing in the computer network, then, data redundancy is not likely to supersede to the practice of cross-referencing. *In this chapter, the term "rate' will generally be favored to the term "flow', since the former has a more numerical connotation than the latter. 185

186 In the latter case, file-allocation techniques may not be applied since they do not satisfactorily capture the transaction processing environment of the distributed database. Ten input data files, corresponding respectively to Problems 1-10, were set up for Algorithm XrefU. Each inputfile specified the MRDB problem as before (Chapter 6 and Appendix D), but the entries of the type APf, and only those, were incrementally decreased until they were set equal to zero in the tenth input-file. All other parameters were kept constant, and in particular the updating rates were fixed. Some new definitions are made. It is assumed that the input-file of a problem is given in the form of tables as shown in Appendix D. All the definitions below apply to the inputfile of a problem, and to the values of the variables appearing in a feasible solution of that problem. The measure of cross-referencing, ~, is defined to be: Number of non-zero variables corresponding to clusters A in the same partition but different sites Total number of non-zero materialization variables The measure of redundancy, p, is defined to be: Q Number of non-zero allocation variables =P = Total number of allocation variables In the following notations, the subscripts t, and 8, which normally denote a user-site, are omitted. For example, the symbol AfP will be written as All. In this context, we are interested in the volumes of data associated with query processing sequences (entries in tables 3 and 5, Appendix D), not in the rates. The frequencies of occurence of queries are not considered. The total transaction volume, r, is defined to be:

187 r = AlP + At" + A-'" + k Bhk. p E PARk h E WRTk q E CLUJ The assembly volume, a, is defined to be: [ Atl p E PARk A E CLUI The reducing volume, f, is defined to be: [ y [Ajf + XAkP j p E PARk A qE CLU? The reducing volume ratio, a/, is defined to be: Figure 7.1 displays the results of the experiment. The measure of redundancy, p, is plotted with respect to the reducing volume ratio, a1. Each one of the ten problems was run with XrefU. The number of iterations varied between 9 and 19, and the CPU time varied respectively between 27.83 seconds and 60.27 seconds on the Honeywell 870. For each problem, XrefU was initiated with the same list of 20 structures. In Figure 7.1, we observe that redundancy tends to decrease with decreasing reducing volumes. This seems to confirm initial expectations, namely, that decreasing the penalty associated with cross-referencing, reduces the need for clustering query envelopes, and thus reduces the need for CRDU duplication. Further, the cross-referencing is found to start at 22 %f for Problem 10, and to plateau at 10 % for the first problems. Note that, when the reducing volume ratio becomes quite large,

188 we expect the cross-referencing to be quite small, and therefore, CRDU-distribution can be solved with conventional file-allocation techniques. The next experiment we conducted, was to investigate the effect of the updating volume, to be defined, over cross-referencing. 7.2. Effect of updating over cross-reterencing A second experiment consists in investigating the effect of the updating volume over cross-referencing. As in Section 7.1, a measure of cross-referencing is provided by e. The measure of updating volume, #, is defined as: Bk A hE WRTk j As before, the input-files of the test problems were set up for Algorithm XrefU. All input-files had identical entries for the MRDB problem, but for the entries Bif, and only those, which were increased incrementally. Figure 7.2 displays the results of this experiment. It shows that for small values of A, redundancy, as measured by p, increases. This is obviously expected since, when updates are cheaper, it is practical to duplicate CRDUs at different sites, and thus, the need for cross-referencing becomes insignificant. Indeed, whenever the updating volume is small, it is probably quite appropriate to make use of file-allocation techniques rather than CRDU-distribution techniques to solve the distribution problem. For high values of A, duplications of CRDU copies becomes more expensive, and thus, p decreases. On the other hand, cross-referencing becomes a practical aspect of distributed query processing since the cost associated with the reducing flow can compete with the cost associated with the updating flow. In this latter case, where updating rates are significant, file-allocation tech

189 niques are certainly inappropriate to solve the distribution problem. 7.3. Lower and upper bounds for the communication cost If all the constraints of the uncapacitated model are eliminated, then the lowest possible cost is obviously equal to zero. For the unconstrained model the upper bound for the cost is simply equal to r (page 187). For the problem solved earlier in Section 6.1.2 (page 159) this upper bound r is equal to 187,920 kilobitle/hour. However, the communication cost is never allowed to attain the value of r, since r corresponds to an infeasible solution for the CRDU distribution model. In practice, setting all the zero-one variables to one leads to a possible but highly unreasonable solution. Indeed, every query is solved through the simultaneous accessing of all the possible partitions of its pre-assigned query envelope. Further, since all the allocation variables are equal to one, there is usually more CRDU duplication than is needed to solve most queries. With the constraints, the lower bound for the cost is of course the optimal solution sought. The upper bound for the cost is, a priori, as hard to find as the optimal cost. However, we observe that a feasible solution with largest cost can be obtained as follows: (1) Set all the allocation variables to one. (2) For every query Qtk, select a partition p of the preassigned set ACCtk which maximizes E Ark + AfP' + AP,'. (3) Whenever possible, q,r E CLU& for every set ACCtk, and for a chosen partition p of ACCtk, choose the materialization variables Xf,, Xf'j.. so as to place cluster q at site i different from t, cluster r at site j different from both i and t, and so on... For most non-trivial models, i.e. when the computer network contains at least three sites, finding the upper value of the cost is 1imple. For the problem solved in Section 6.2.2, this upper bound is found to be 94,960 kilobite/hour. Note that the best feasible solution found by Algorithm XrefU (4,360 kilobitu/hour, page 171) has an error equal to ((94,960 - 3,390)/(4,360 - 3,390))X 100 = 1.06% when compared with the optimal solution (page 159).

190 4 x 10'1 P 0 I 0 1 2 3 4 5 6 7 8 9 X X1 0'1 Figure 7.1. Variations of the measure of redundancy p and the measure of crossreferencing e with respect to the reducing volume ratio 11

191 x 10'1 4 3 2 0 *5 1 Figure 7.2. Variations of the measure of redundancy p and the measure of crossreferencing ( with respect to the updating volume ratio

CHAPTER 8 CONCLUSION This research addresses the distributed database design problem. In this context, this problem consists of selecting and distributing the building blocks of a database which will be shared by the nodes of a computer network. No a priori data-files or relational tables are assumed to be provided by the user. Section 8.1 restates the problem of interest and reviews the objectives and accomplishments of the research. Section 8.2 discusses unsolved issues. Section 8.3 outlines the main contributions of this thesis. Finally, Section 8.4 suggests some possible improvements and issues which can be the subject of further investigations. 8.1. The problem A summary of the problem follows. TJser provided inputs include. (1) A universal relation scheme which consists of (i) a universe of data-items or attributes, (ii) a set of semantic constraints, of the functional dependency (FD) type, holding on the attributes, and (iii) a set of statistical informations for each attribute, the domain from which the attribute gets its values, and for each FD. (2) A computer network with tightly or loosely connected nodes. It is assumed that the communication network has a fixed topology and that there is enough over-all channel capacity. Communication costs, and channel capacities are assumed to be given. (3) The user's scheduled queries and updates. A query is assumed to consist of a target list of attributes, a frequency of usage, and a set of qualification clauses holding on attri192

193 butes. An update is assumed to consist of a target list of attributes, a type, i.e. whether the update is a "delete" or an "add", and a frequency of occurrence. The first phase of the design process, the selection, consists in identifying the data-units which can serve as atomic building blocks for a distributed database, and which allow efficient distributed query and update processing. The data-units are to be involved in crossreferencing operations in the context of query processing sessions. They may be visualized as relational tables or fragments thereof, and are referred to as cross-referencing data-units (CRDUs). The CRDUs are mapped to functional dependencies holding on attributes. The selection part of the design process involves the consideration of semantic issues, such as the correctness of query strategies, and the consistency preservation of the database under updating. Since most logical accesses translate into data-volume flows due to cross-referencing and assembling operations, optimization assumes a critical role in selecting CRDUs. The modeling of semantic constraints using graphs and matroids results in efficient optimization algorithms. At the end of this design phase, the CRDUs which can be subject to distribution are well defined. The second phase of the design process, the distribution, concentrates on the optimization of the operational communication cost. Typical issues addressed in this stage are: (i) The modeling of the cross-references of physically distant CRDUs. These cross-references are at the origin of file-to-file inter-site accesses which add up to the usual file-to-user accesses. Typically, modern strategies of distributed query processing cannot be captured by classical fileallocation models, and the inclusion of the "cross-referencing" ingredient in data-allocation, changes substantially the structure of mathematical programming models. (ii) The clustering of the CRDUs, e.g. whether they belong to a common relational table or not, and the resulting impact on the file-to-file and file-to-user flows of data-volume. (iii) The materialization, in the static sense, of the CRDUs, i.e. the choice of relevant instances of CRDUs to seek access to, for each query session, when there are duplicates. (iv) The accessing routes in the

194 communication network, especially when flow capacity constraints must be accounted with. The distribution of the CRDUs is modeled in the form of a manageable multi-commodity mixed-integer program, with quadratic cost and linear constraints. Branch-and-bound, adaptive random search and projection techniques are used to solve the model. In addition to providing a complete procedure for distributed database design, this research draws some practical conclusions as to the distribution of CRDUs in a computer network. For instance, the effect of the file-to-file data-volume with respect to the file-to-user data-volume over all queries, on the simultaneous-referencing of CRDUs is experimentally studied. Also, the effect of the flow-capacity constraints over optimal distributions and materializations of CRDUs, and over optimal access routes is illustrated. The consideration of semantic issues gives a new dimension to the classical dataallocation problem. Indeed, the dependence of CRDUs, in this context, FDs, becomes explicit through both query and update processing strategies. Thus, a CRDU-allocation model is shown to necessitate more modeling power than classical file-allocation models. 8.2. Unsolved Issues Some issues, to be mentioned below, have been left unsolved because we believe that existing techniques to solve them are appropriate, and therefore, expanding upon those issues would not bring about an original research contribution. However, the designer of a distributed database who wishes to implement the proposed design methodology, must be aware of those issues and must address them fully. Of importance to the designer of a distributed database is the following question: What to do once the CRDUs, i.e. the FDs, have been distributed? Firstly, the designer must be aware that after Step-b of the design procedure, i.e. after the distribution of FDs, some redundant FDs may have been allocated at any one site. Therefore, a nonredundant subset of FDs must be identified at each site, and not just any nonredundant cover. This problem has been

195 addressed in the literature, e.g. refer to [KAMB 781. Secondly, the synthesis of relational tables in third normal form, or better, may be accomplished using Kambayashi's algorithm [KAMB 781. If, however, the local databases are to be physically implemented in the form of network databases, then, techniques such as Housel and Yao's [HOUS 791 may be considered. The implementation of distributed query strategies, using the principles of semi-joins for data reduction, will then require the addition of network-to-relational interfaces at each node of the computer network. The problems mentioned above require some exercises of implementation. The contributions of the research are now reviewed. 8.3. Research contributions The main research contributions are now discussed. In Chapter 4, a new algorithm, Reduce, is proposed to derive an LR-minimum cover for a set of FDs. The algorithm is shown to be at least as good as Maier's algorithm [MAIE 80). Further, it is argued that because Algorithm Reduce uses exclusively a graph structure during its execution, it can be more readily implemented, and is more direct in its approach, than Maier's algorithm. A major contribution of this research, however, is the new dimension added to dataallocation. This is so because the database distribution problem involves the distribution of data-dependent CRDUs, as opposed to data-independent files. Data-dependence is at the origin of cross-referencing operations in distributed query processing, and of cascade operations in update processing. Further, the problem of data distribution to enhance system performance, is enlarged with the so-called envelope optimization problem for scheduled queries, and the problem of update cascades determination, in the first phase of the methodology (Chapter 4). The envelope optimization problem is modeled with simple graph theoretical tools which lend themselves to efficient graph processing algorithms. Similarly, the update

196 cascades determination problem is tackled through direct graph processing techniques. The whole design methodology can thus be implemented as an automatic design package*. In the second phase of the methodology (Chapter 5), in addition to the novelty of addressing the problem of distributing data-dependent units, i.e. CRDUs mapped to FDs, we managed to model the CRDU-distribution model in a compact form while adding important features previously non existing in file-allocation models. Indeed, the distribution model includes the modeling of (i) the flow associated with cross-references, and (ii) the effect, on the data-volume rates, of clustering CRDUs at the same site; (i) is accomplished by setting up the optimization program as a multi-commodity flow problem, and (ii) is handled by introducing zero-one partition variables. Note that without those latter variables, the distribution model would be much more complex, i.e. highly nonlinear in terms of the allocation variables, and therefore much harder to solve. The mathematical programming model assumes the form of a quadratic program, i.e. the cost has a quadratic structure and all the constraints are linear. This latter feature allows the usage of a wide variety of computational techniques. Some of those techniques are discussed in Chapter 6. A problem of reasonable size is solved optimally. Heuristic techniques are also discussed and implemented. Chapter 6 reports several positive numerical results. Chapter 7 indicates how several useful experiments can be performed, and suggests, through numerical experiments, that simplifying design decisions can be made by looking carefully at the user-provided inputs. In the next section, areas of need for further research are pointed out. 8.4. Improvements and further research The proposed methodology can be improved if some issues are dealt with more thoroughly. These issues will be discussed in the following paragraphs. *With some reserves for updates. Refer to Section 4.5.

197 One important notion in database theory is the notion of user views. In this thesis, the semantic framework consists only of functional dependencies and join dependencies. Higher level semantic notions may be introduced to capture wider updating features, such as the independence of views with respect to updating [FEDA 801. Those semantic notions may also seek to control update cascades to simple and predictable ranges. In general, such semantic considerations have as an effect to restrict the domain of possible CRDU distribution solutions. The notion of universal relation scheme is found to be somewhat controversial by some researchers (e.g. see [AZKE 831). However, this assumption is not central to the proposed design methodology, and can be relaxed. The notions of query, link and envelope would have to be specified with respect to a set of user-views and not only one universal relation scheme. The concept of relation synthesis from FDs [BERN 761 is implicit in the proposed design methodology. Non functional relationships can usually be handled by FD modeling techniques [BERN 761, and are compatible with synthesis algorithms [BERN 76, KAMB 781. However, the issue of handling non functional relationships needs to be captured in a more formal framework than the one available in [BERN 76]. Another critical notion for database distribution, is the notion of distributed query processing strategy. It has been shown that some types of database schemata are highly desirable to simplify distributed query processing sequences [BEER 81]. Thus, some further restrictions may be imposed on a distributed database and therefore feasible CRDU-distribution strategies may be restricted accordingly. It was pointed out that distributed query processing strategies are very related to database distribution. Strategies based on static materializations have been assumed. This means that the processing of a query starts after the access of a non-redundant portion of the database, i.e. a non-redundant query envelope. However, if parallelism is taken into advantage of in query processing [WONG 81j, other distribution configurations may be sought as a result.

198 To optimize, other more efficient heuristics must be investigated. Indeed, if a network contains hundreds of nodes, a large number of user-queries and CRDUs, then one may reasonably question the usefulness of the algorithms proposed for CRDU distribution. We claim that the distribution model itself is practical, even for large applications, but solution techniques may be improved to reduce storage requirements and to keep CPU time under reasonable bounds. Larger and larger problems can now be solved optimally using standard optimization techniques, and it can be expected that problems with tens of thousands or even hundreds of thousands variables will be solvable in reasonable CPU times. However, efficient heuristic techniques, parallel algorithms and distributed algorithms may be considered to be possible present alternatives for enhanced computing performance. Finally, more thorough testing should be accomplished. In Chapter 7, the results, although plausible and reasonable, are dampened by the fact that the distribution algorithms are quite expensive to execute. We were therefore content to obtain good solutions for each problem that was tackled, but the experiment would have gained to be based on closer-tooptimal solutions, and on a higher sample of solutions. Also, the notions of cross-referencing and redundancy are very input-data-dependent. Therefore, very careful experiments should be based on the control of such parameters as, the noncriticality of cross-references, the average size of query envelopes etc. In general, the dependence of various parameters should be explored more thoroughly. In the same spirit, measures of performance as well as measures of system characteristics should be studied in a formal way using stochastic modeling and statistical analysis.

APPENDICES 199

200 APPENDIX A INDEX Antecedents The left-hand-side of a functional relation from a set of data values to another set of data values, i.e. the "domain" of the function (in the usual sense of the term). Assembly flow: The files to be read during the querying of the database are processed locally and reduced in an effort to eliminate unnecessary information. When this is accomplished the reduced files are sent to a common node in the network and assembled at that node to produce a final answer. The nflow produced by this latter process is called assembly flow. Attribute: In APPENDIX B, PATIENT and WARD are two different attributes (also called "data-items"). Autonomous: Autonomous computers work in an "equal partnership" relationship. No computer receives orders from another one. Availability: The availability of a file is the probability that at any instant of time, at least one copy of that file in the network resides at a site which is accessible from all other sites in the network [KHAB 80]. Centralized computing system: A computing system that operates under the sole supervision and control of a single selected computing center. Centralized database: A complete database, located in one host computer, which features a minimum amount of data duplication.

201 Computer network: A network that has computers for nodes, and communication channels for edges. For large networks, the channels are usually telephone lines or satellite links. Conceptual schema: The overall data model which captures, in a formal way, the information relevant to an organization. Data-unit: A non dissociable piece of data which will be copied at one or several sites of a computer network. Data-item: The atomic data object. In this work, "data-item" and "attribute" are synonymous. Decomposition: The application of formal rules on a data file to derive file fragments which can be recombined to recover the original file. Data integrity is preserved. Deletion cascade: A sequence of deletions of data, triggered by a single deletion. Deletion cascades preserve database consistency. Distributed database: The geographical distribution of a database among the sites of a computer network. An important feature of distributed databases is data duplication for increased data availability and parallelism. Distributed computing system: A computing system based on the distribution of processing and control functions over the sites of a computer network. Distributed query processing: The concurrent access of distant files in a computer network to provide an answer to a query. Distributed query processing leads to the transmission of data between some of the files accessed in order to eliminate nonrelevant information. The reduced files are then moved to one single site to be assembled. Domain: In APPENDIX B, CHARACTER(15), is the domain of attributes NAME and PHYSICIAN. The domain of an attribute is the value set from which this attribute can pick its values.

202 Envelope: A relevant portion of the database that is sufficent to solve a query, and that contains no extraneous information to provide a correct answer to that query. External schema: A subset of the conceptual schema that is relevant to one user of an organization. Fragment: A file reduced through local processing at a computer site. Functional dependency: Semantic constraints holding on some attributes. FDs are used to identify data logical accesses and constitute a basis for data integrity. Horizontal network: Network of autonomous, logically equal, computers (see Autonomous). Integrity: The ability to disclose correct information when correct reading and querying methods are applied. Inter-file flow: The files to be accessed in a querying of the database are processed locally and reduced by each other, whenever possible or advantageous, in an effort to eliminate unnecessary information. This is accomplished by the transmission of common data between each pair of files. Join: A formal operation aiming at merging relations which possess common data. In APPENDIX B, relation R12 is the join of relations R1" and R2' over the domain of NAME. We write R12 = Rl"[NAME]R2'. Logical database schema: A set of records (relations), each one identified by a heading or list of data-items (attributes). Logical design: The definition of clusters of data-items (attributes) to determine records (relations), taking into consideration the cost, measured in data volume, of accessing a set of data-items for every query. Lossy: That causes the creation of misleading information due to the omission of relevant initial information.

203 LR-mlnimum cover: A set of FDs F is LR-minimum, if: (1) There is no set G with fewer FDs than F such that G + = F+ (section 3.1.1), (2) for every FD X -- Y in F, there is no X properly contained in X with X -- Y in F+, (3) replacing FD X -- Y in F by X —. Y, with Y properly contained in Y, alters the closure of F. Materlalization: A representative of data sought by a dstributed query and chosen to take advantage of data-redundancy and parallelism. In this work, a nonredundant representative of data to be accessed for the solving of one query. Normalization: Applied to relational tables in a relational database: It is the restructuring, if required, of relations to force certain rules between attributes to be satisfied. The normal form of a relation is defined with respect to the functional dependencies holding on the attributes representing this relation (see Functional Dependency, Attribute, Relation). Projection: In APPENDIX B, relation R1' is a projection of relation R1 over attributes NAME,WARD. We write R1'=R1iNAME,WARD]. Query: A question to the database. In the general sense queries are represented by a set of referenced attributes or target list. Query optimization: The process of choosing a set of data and a strategy to process the data so as to solve a query and minimize its cost. Record: It is identified by a list of data item names (see Relation). Reducing operation: Query processing operation aiming at eliminating data that are of no interest for a query result. Each reducing operation can be mapped to Projection, Selection or Join (see Inter-file now, Projection, Selection, Join, Semi-join).

204 Redundancy: One data element can have many duplicates in the actual database. In particular, in a distributed database, data may be duplicated at different sites. Relations A table of rows and columns. Each column is identified by an attribute name. Each row, also called a tuple, contains one value from the domain of each column-attribute (APPENDIX B; see Record, Logical database schema). Relational algebra: A set of operations with which relations may be manipulated to achieve any desired tabular representation, i.e. to "cut" them or to combine them whenever possible. The three main operations are Selection, Projection, and Join. Relational model: Two dimensional tabular representation of data. APPENDIX B shows relations Ri, R2 and R3 representing a database in the relational model. The relational model was defined to enhance easy understanding by people with little or no training in programming (nonprocedural), evolution (additions and deletions) without major logical restructuring, flexibility of use of data (see Attribute, Domain, Relation, Tuple, Relational algebra). Reliability: The network reliability is the;probability that at any instant of time the network is connected Selection: In APPENDIX B, relation R2' is a restriction of relation R2 based upon the clause TEST=GLUCOSE. Semantic: Relating to the meaning of data. Semi-join: The semi-join of relation R by relation S on attributes X, denoted by R<X]S is defined as (R[X1S)[AttRl (see Projection, Join), where AttR denotes the attributes of R. If R and S are stored at different sites, R<XJS can be obtained by moving S[XJ to the site that holds R. In APPENDIX B, relation R12' is the semijoin of relation R1" by relation R2'. Site: A node of a computer network. A site corresponds to a computing element, also called a host computer.

205 Storage redundancy: The duplication of identical data at one site is not desirable if it leads to extraneous information at that site. Storage redundancy at one site indicates that data can be eliminated at that site without any loss in information, thus reducing storage cost. Otherwise, storage redundancy in the network (see Redundancy) increases parallelism and decreases communications. Synthesizing: Constructing relational tables (in general, logical schemata), from a set of functional dependencies (see Functional dependency, Relation). Tuple: A list of data values corresponding to a row of a relation. Each component of the tuple is a value extracted from the domain of a single column- attribute of the relation. In APPENDIX B, any row of relation R1 is a tuple of R1 (see Attribute, Domain, Relation). Update: A transaction on a database that involves some writing (or deleting) operations. Data item values are added (or erased) to (from) the database. View: Data needs of an organization or a user. In this work, a view is a set of functional dependencies.

206 APPENDIX B THE RELATIONAL DATA MODEL A RELATIONAL DATABASE EXAMPLE HOSPITAL DATABASE FOR BLOOD TESTS PATIENT_NO NAME AGE WARD 4234 Stout 63 General Med. 5621 Rohkemper 34 Outpatient 1213 Sawdon 46 Outpatient 7784 Litkovitk 58 General Med. 2442 Kraphol 55 Intensive Care 3001 Buck 61 Intensive Care 6886 Rowlands 62 General Med. 2444 Kostedt 42 Ambulatory 3457 Rogulsky 53 General Med. 1972 Winter 39 Ambulatory Relation RI

207 ORDER NAME TEST RESULT PHYSICIAN 01 Rohkemper Glucose 70 Dr Newberry 03 Stout Sodium 112 Dr Farah 05 Rohkemper Potassium 2.5 Dr Akers 07 Kraphol Glucose 50 Dr Morehouse 08 Kraphol Sodium 160 Dr Morehouse 10 Rowlands Glucose 80 Dr Jagiela 11 Rowlands Chloride 25 Dr Jagiela 15 Winter Glucose 100 Dr Yee 16 Stout Glucose 60 Dr Andersland 17 Stout Chloride 20 Dr Andersland Relation R2 NAME WARD Stout General Med. Rohkemper Outpatient Sawdon Outpatient Litkovitk General Med. Kraphol Intensive Care Buck Intensive Care Rowlands General Med. Kostedt Ambulatory Rogulsky General Med. Winter Ambulatory R1' - R1[NAME,WARD]

PATIENT_NO NAME AGE WARD 4234 Stout 63 General Med. 3001 Buck 61 Intensive Care 6886 Rowlands 62 General Med. R1" -- Ri[AGE > 60] ORDER NAME TEST RESULT PHYSICIAN 07 Kraphol Glucose 50 Dr Morehouse 10 Rowlands Glucose 80 Dr Jagiela 15 Winter Glucose 100 Dr Yee 16 Stout Glucose 60 Dr Andersland R2' - R2[TEST = Glucose]

209 NAME AGE WARD ORDER TEST RESULT PHYSICIAN Stout 63 General Med. 16 Glucose 60 Dr Andersland Rowlands 62 General Med. 10 Glucose 80 Dr Jagiela R12 = (Ri "[NAMEIR2 ')[NAMEAGE,WARD,ORDER,TE ST,RESULT,PHYS.] PATIENT_NO NAME AGE WARD 4234 Stout 63 General Med. 6886 Rowlands 62 General Med. R12' = R1"<NAME]R2'

210 APPENDIX C THE MEDICAL RECORD DATABASE PROBLEM A large health organization possesses four geographically separated sites: A main hospital with its own laboratory, a clinical laboratory that can provide additional products and services, a center for medical scientific research, and a department of health technology which comprises a statistical center and a section for general medical organization and development. The four sites of this organization are equipped with their individual computing facilities and have permanent access to a public fully connected local communication network. The staff members, at the four sites, share substantial informations about the medical records of the patients in the hospital. For that reason the organization plans to design an integrated distributed database, which will be referred to as the medical record database (MRDB).

211 User-provided FDs for MRDB ORDER -- SERVICE ORDER -- RESULT ORDER TEST ORDER PATIENTNO ORDER -- DATE ORDER NAME ORDER,NAME RESULT TEST -- TEST_TYPE,CODE_NO,TECHNIQUE TEST,DATE,PATIENT_NO,NAME - RESULT RESULT -- TOTAL,PRECISION,ACCURACY,COST,TIME PATIENT_NO — + NAME PATIENT_NO,NAME AGE PATIENT_NO -- AGE SERVICE PHYSICIAN SERVICE WARD PHYSICIAN -- WARD NAME WARD Table C.1. The user-provided functional dependencies (FDs).

212 Typical scheduled queries for MRDB Query i.d. Given attributes Target attributes 1 ORDER PHYSICIAN,WARD 2 ORDER RESULT 3 DATE,PATIENT_NO,TEST ORDER 4 WARD PATIENT NO 5 ORDER NAME,WARD 6 PHYSICIAN ORDER,PATENT_ NO 7 NAME,DATE,TEST RESULT 8 ORDER SERVICE 9 PHYSICIAN SERVICE,PATIENT_NO Typical scheduled updates Number New attributes 1 ORDER,TEST,RESULT,DATE 2 PATIENT_NO,NAME,AGE Table C.2. Typical scheduled transactions for the MRDB.

213 Usage of MRDB transactions Site Query Update 1 1X 2, 2 X 1, 3 X 1, 5 X 1, 8 X 1 none 2 I X 1,3 X 2,9 X I none 3 5 X 3,7 X 4, 8 X 4 none 4 4 X i, 6 X 1,9X 1,7X 3 1 X 1,2X 2 Table C.3. Needs of each site in the MRDB computer network, and frequencies of occurrence of all the transactions.

214 t upses bytes f,b 5075 9 fie, 5075 14 fl,d 5075 8 fle 5075 12 fs8 49 20 flo, k 193 20 fll,g 5075 44 f 12,i 84 19 f12,j 84 6 115,k 810 20 Table C.4. Cardinalities and widths of the functional dependencies for the MRDB.

IflAll tuples, where A is... FD ORDER SER TEST PAT DATE PHYS RES NAME AGE WARD -VICE _NO -ICIAN -ULT 5075 48.. f R 48 l l 20 | l l f 10'k 20 10 f I 1 76 84 632 6561 f 12.i 84 84 f 84 45 ~~~fis,, ~ ~~~~~~~~84 10 1f05 84 Table C.5. Cardinalities of the functional dependencies, for the MRDB, when projected on attributes.

216 A A bytes ORDER 4 SERVICE 5 TESTTYPE 3 TEST CODE_NO 4 TECHNIQUE 3 NUMERIC 10 PRECISION 10 RESULT ACCURACY 10 COST 5 TIME 5 PATIENTNO 4 DATE 8 NAME 15 AGE 2 PHYSICIAN 15 WARD 5 Table C.6. Attribute widths for the MRDB.

217 A IDOM(A)l ORDER 5075 SERVICE 48 TEST 76 RESULT 6561 PATIENT NO 126 DATE 632 NAME 126 AGE 45 PHYSICIAN 28 WARD 10 Table C.7. Cardinalities of the domains for the MRDB.

218 APPENDIX D INPUT FOR THE DISTRIBUTION ALGORITHMS Number of sites: 4 Number of CRDUs: 13 Uncapacitated User site Query index Query frequency tEN k Vtk 1 2 1 1 8 1 3 8 4 4 4 1 1 1 2 1 5 1 2 1 1 3 5 3 3 7 4 4 6 1 4 7 3 1 3 1 2 3 2 2 9 1 4 9 1

219 Query Partition Cluster # of FDs FD 1 FD 2 FD 3 k Pq Il h E q 1 1 1 2 1 2 0 1 2 1 1 1 0 0 1 2 2 1 2 0 0 2 1 1 1 3 0 0 3 1 1 3 4 5 6 3 5 1 1 4 0 0 3 5 2 1 5 0 0 3 5 3 1 6 0 0 3 3 1 2 4 5 0 3 3 2 1 6 0 0 3 4 1 4 0 0 3 4 2 2 5 6 0 3 2 1 2 4 6 0 3 2 2 10 0 4 1 1 1 7 0 0 5 2 1 1 8 0 0 5 2 2 1 9 0 0 5 1 1 2 8 9 0 6 2 1 1 1 0 0 6 2 2 1 6 0 0 6 1 1 2 1 6 0 7 2 1 1 10 0 0 7 2 2 1 11 0 0 7 1 1 2 10 11 0 8 1 1 1 12 0 0 9 1 1 3 13 12 6 9 5 1 1 13 0 0 9 5 2 1 12 O O 9 5 3 1 6 0 O 9 3 1 2 13 12 0 9 3 2 1 6 0 0 9 4 1 1 13 0 0 9 4 2 2 12 6 0 9 2 1 2 13 6 O 9 2 2 1 12 0 0

220 User site Update index Update frequency 8 k vs, 1 ~ ~1 0 2 0 2 1 0 2 2 0 3 1 0 4 1 1 3 2 0 4 2 2 Update index FD Update volume k Bk 1 1 500 1 3 500 1 4 500 1 5 500 1 6 500 1 8 700 1 10 700 1 12 700 2 6 450 2 7 450 2 8 450 2 9 450 2 10 900 2 11 900

221 Query Partition Cluster Cluster Data volume Noncritical k p q r Afq' Ye'/No 1 1 1 1 450 1 1 2 1 1 400 1 1 2 1 2 5 0 1 2 2 1 4 1 1 2 2 2 10 1 2 1 1 1 1000 1 3 1 1 1 100 1 3 5 1 1 300 1 3 5 1 2 0 0 3 5 1 3 0 1 3 5 2 1 01 3 5 2 2 400 1 3 5 2 3 0 1 3 5 3 1 100 0 3 5 3 2 100 1 3 5 3 3 100 1 3 3 1 1 300 1 3 3 1 2 0 1 3 3 2 1 100 0 3 3 2 2 100 1 3 4 1 1 250 1 3 4 1 2 0 1 3 4 2 1.100 0 3 4 2 2 100 0 3 2 1 1 100 1 3 2 1 2 100 1 3 2 2 1 0 1 3 2 2 2 400 1 4 2 1 1 10000 1 5 2 1.1 0 1 5 2 1 2 100 0 5 2 2 1 0 1 5 2 2 2 100 1 5 1 1 1 100 1 6 2 1 1 0 1 6 2 1 2 10000 0

Query Partition Cluster Cluster Data volume Noncritical k p q r A'qr Yes/No 6 2 2 1 0 1 6 2 2 2 15000 1 6 1 1 1 15000 1 7 2 1 1 100 1 7 2 1 2 0 0 7 2 2 1 100 1 7 2 2 2 0 1 7 1 1 1 100 1 8 1 1 1 150 1 9 1 1 1 100 1 9 5 1 1 0 1 9 5 1 2 200 1 9 5 1 3 0 0 9 5 2 1 0 1 9 5 2 3 10000 1 9 5 3 1 0 1 9 5 3 2 100 1 9 5 3 3 100 1 9 3 1 1 100 1 9 3 1 2 10000 1 9 3 2 1 100 0 9 3 2 2 100 1 9 4 1 1 0 1 9 4 1 2 200 0 9 4 2 1 0 1 9 4 2 2 200 1 9 2 1 1 200 1 9 2 1 1 200 1 9 2 1 2 200 0 9 2 2 1 10000 1 9 2 2 2 0 1

223 For the capacitated model: c, j (N X N) matrix. For all (i,j) E A. K,j ] (N X N) matrix. For all (i,j) E A. For both the capacitated and uncapacitated models: [ Nh (1 X I HI) matrix. (6 4 1 10 10 22 14 4 4 4 11 10)

224 APPENDIX E OPTIMAL SOLUTION FOR THE MRDB (Correspondling to the Inputs of Appendix D)' All the following variables are equal to one (the other variables are all equal to zero): x 2,1 X 2,2 X 1,1 IrI 1,1 11 X1 2,1 1,il 7 1,1,29 1,2,1 7 1,3,4 1, I,4 X1 4, 1,8,4 2,1,I9 X2,2 X~ 1,1 -,I X 2,1 11 12 X 1,1 1,1 2,1,2, 2,3,4 2,9,4, 3,5,4 3,Y,4 3i,4 3,8,4 4,4,4 x 2,1 X 1,1 X 2,1 yl y3 y2 yl y4 y5 4,6,4, 4,9,4 4,4 1 1 2 4 4 4, y 6 y7 y 8 y 9 y 10 yl y 12 y 13 y 2,, 6 4 4 4 4 4 4 4 1,1 XI] 1 XII X I X X2 XI X I X2 X1 1,2 1,3 1,5, 1,8, 2,1, 2,3, 2,9, 3,5, 3,7, Xl f X~r1 12 X1 X2 3,8 7 4,4 4,B, 4,9 X 4,7.

BIBLIOGRAPHY 225

22a BIBLIOGRAPHY [AH0771 Aho, A. V., Beeri, C., and Ullman, J. D., "The theory of joins in relational databases", Proc.29th, IEEE Symp., Foundations of Comp.Sc. 1977. [APER811 Apers, P. M. G., "Redundant allocation of relations in a communication network", Proc. of the 5th Berkeley Workshop on Distributed Data Management and Computer Networks, Lawrence Berkeley Lab. 1981. [APER83] Apers, P. M. G., Hevner, A. R., and YAO, S. B., "Optimization algorithms for distributed queries", lEE Trans. on Software Engin., Vol. SE-9, No. 1, Jan. 1983. [FEDA811 Al-Fedaghi, S., and Scheuermann, P., "Mapping considerations in the design of schemas for the relational model", IEEE Trans. on Soft. Eng., Vol. SE-7, No 1, Jan 1981. [ARMS74] Armstrong, W. W., "Dependency structures of data bases relationships", Information Processing 74, North-Holland Publ. co., Amsterdam, 1974 pp 580-583. [BEER801 Beeri, C., "On the membership problem for functional and multivalued dependencies in relational databases", ACM TODS, Vol.5, No 3, Sept. 1980. [BEER81] Beeri, C., Fagin, R., Maier, R., Mendelzon, A., Ullman, J., and Yannakakis, M., "Properties of acyclic database schemes", Proceedings of the ACM Symposium on Principles of Database Systems, 29-31 March 1982. [BERN76] Bernstein, P. A., "Synthesizing third normal form relations from functional dependencies", ACM TODS, Vol.l, No 4, Dec. 1976. [BISK791 Biskup, J., Dayal, U., Bernstein, P. A., "Synthesizing independent database schemas", ACM SIGMOD, International Conference on the Management of Data, 1979. [BOOT811 Booth, G. M., The distributed system environment some practical approaches, Mc Graw-Hill 1981.

227 [BUCK79J Buckles, B. P., Hardin, D. M., "Partitioning and allocation of logical resources in a distributed computing environment", Distributed System Design, Tutorial, IEEE Catalog No EHO261-1 1979. [CASE721 Casey, R. G., "Allocation of copies of a file in an information network", AFIPS Conference Proceedings, vol.41, part I, 1972. [CHAN77] Chandy, K. M., "Models of distributed systems", Proceedings VLDB Conference, Oct. 1977. [CHEN801 Chen, P. P.-S., and Akoka, J. "Optimal Design of Distributed Information Systems", IEEE Trans. on Computers, Vol. C-29, No 12, Dec. 1980. [CHU731 Chu, W. W., "Optimal file allocation in a computer network", ComputerCommunications Network, N.Abramson and F.Kuo (eds), Prentice-Hall, Englewood Cliffs, N.J., 1973. ICHU791 Chu, W. W., and Hurley, P., "A model for optimal query processing for distributed databases", Digest of Papers, COMCON, Spring 1979. [CHU761 Chu, W. W., "Performance of file directory systems for databases in star and distributed networks", AFIPS Conference Proceedings, Vol. 45, 1976 NCC. [CHUN83] Chung, C.-W., A query optimization in distributed database systems, PhD Thesis, The University of Michigan, 1983. [CODD791 Codd, E. F., "Extending the database relational model to capture more meaning", ACM SIGMOD International Conf. on Management of Data, p 161, 1979. [CODD721 Codd, E. F., "Further normalization of the database relational model", Database Systems, R.Rustin, eds, Prentice-Hall 1972. [CCA801 Computer Corporation of America, A distributed database management system for command and control applications, Technical Report CCA-80-04, Jan.30 1980. [DATE781 Date, C. J., An introduction to database systems, Addison-Wesley 1978. [DELO781 Delobel, C., "Normalization and hierarchical dependencies in the relational data model", ACM TODS, Vol.S, No 3, Sep 1978. [DORA77J Doray, M., "Le teletraitement", Societe Nationale des Chemins de Fer- Service Informatique (Paris), Division IAT, May 1977.

228 [ESWA74j Eswaran, K. P., "Placement of records in a file and file allocation in a computer network", Information Processing 74, IFIPS, North Holland Publishing Co., 1974. [ETCH77j Etcheberry, J. "The Set-Covering Problem: A New Implicit Enumeration Algorithm", Operations Research, Vol. 25, No 5, Sept.-Oct. 1977. [FAGI77j Fagin, R., "Multivalued dependencies and a new normal form for relational databases", ACM TODS, vol.2, no 3, Dec. 1977. [FAGI821 Fagin, R., Mendelzon, A. O., and Ullman, J. D., "A simplified universal relation assumption and its properties", ACM TODS, Vol. 7, No. 3, Sep. 1982. [FISH801 Fisher, M. L., and Hochbaum, D. S., "Database location in computer networks", Journal of ACM, vol.27, no 4, Oct. 1980. [FRY78] Fry, J. P., and Teorey, T. J., "Design and performance tools for improving database usability and responsiveness", Databases: Improving usability and responsiveness, Academic Press 1978. [GARE791 Garey, M. R., and Johnson, D. S., Computers and intractability- A guide to he theory of NP-completeness, Freeman 1979. IHAMM791 Hammer, M., and Niamir, B., "A heuristic approach to attribute partitioning", Proc. ACM SIGMOD, Int. Conf. on the Management of Data, May 1979, pp 93-101. IHEVN791 Hevner, A. R., and Yao, S. B., "Query processing in distributed database systems", IEEE Trans. on Soft. Engin., Vol.SE-5, No 3, May 1979. [HIMM721 Himmelblau, D. M., Applied Nonlinear Programming, Mc Graw Hill 1972. [HOLL75] Holland, J. H., Adaptation in Natural and Artificial Systems, The University of Michigan Press 1975. [HOUS791 Housel, B. C., Waddle, V., Yao, S. B., "the functional dependency model for logical database design", IEEE 1979. [KAMB78] Kambayashi, Y., "Equivalent key problem of the relational database model", Mathematical studies of information processing, Proceedings, Kyoto, Japan 1978. Ed. by E.K.Blum, M.Paul and S.Takasu, Springer-Verlag. [KHAB80] Khabbaz, N. G., A combined communication network design and file allocation for distributed databases, PhD thesis, The University of Michigan, Feb. 1980.

229 [LAWL761 Lawler, E., Combinatorial optimization, Holt, Rinehart and Winston 1976. [LEV1751 Levin, K. D., and Morgan, H. L., "Optimizing distributed databases-A framework for research", AFIPS Conference Proceedings, vol.44, 1975 NCC. [LOZ178J Lozinskii, E. L., "Performance consideration in relational database design", Databases: Improving usability and responsiveness, Academic Press Inc., 1978. [LOZI801 Lozinskii, E. L., "Construction of relations in relational databases", ACM TODS, Vol.5, No 2, June 1980. [MAHM761 Mahmoud, S., and Riordon, J. S., "Optimal allocation of resources in distributed information networks", ACM TODS, vol.l, no 1, Mar.1976. [MAIE801 Maier, D., "minimum covers in the relational database model", Journal of ACM, Vol. 27, No 4, Oct. 1980. [MAIE831 Maier, D., and Ullman, J. D., "Maximal objects an the semantics of universal relation databases", ACM TODS, Vol. 8, No. 1, Mar. 1983. [MORG77] Morgan, H. L., and Levin, K. D., "Optimal program and data locations in computer networks", Communication fo the ACM, Vol.S2, No 5, May 1977. [MURT761 Murty, K., Linear and combinatorial programming, John Wiley and sons Inc., 1976. [MURT811 Murty, K., Network flow algorithms, lecture notes, Department of Industrial Engineering and Operations Research, The University of Michigan, 1981. [MUR77j Murtagh, B. A., and Saunders, M. A. MINOS A Large-Scale Nonlinear Programming System- User's Guide, Technical Report Sol. 77-9 Feb. 1977, Systems Optimization Lab., Department of Operations Research, Stanford University. [MUR781 Murtagh, B. A., and Sauders, M. A. "Large scale linearly constrained optimization", Mathematical Programming, Vol. 14, p. 41-72, 1978. INAVA76] Navathe, S. B., and Fry, J. P., "Restructuring for large databases: Three levels of abstraction", ACM TODS, Vol.l, No 2, Jun 1976. [PALM791 Palmer, D. F., "Distributed computing system design at the subsystemnetwork level", Conf. VLDB, IEEE, 1979. [RAM79-al Ramamoorthy, C. V., and Wah, B. W., "Data management in distributed data bases", Proc. National Conf., AFiPS Press, 1979.

01 03483 2454 230 230 [RAM79-b] Ramamoorthy, C. V., and Wah, B. W., "The placement of relations on a distributed relational database', The 1st International Conference on Distributed Computing Systems, Huntville, Alabama, Oct. 1-5, 1979. jRISS77] Rissanen, J., "Independent components of relations", ACM TODS, Vol.2, No 4, Dec 1977. [ROSE811 Rosenthal, A. S., "Note on the expected size of a join", SIGMOD Record, Vol. 11, No. 4, Jul. 1981. [SALK751 Salkin, H. M., Integer programming, Addison-Wesley 1975. [TANE811 Tanenbaum, A. S., Computer networks, Prentice-Hall 1981. [ULLM801 Ullman, J. D., Principles of database systems, Computer Science Press, 1980. iULLM82] Ullman, J. D., "The U.R. Strikes Back', Proceedings of the ACM Symp. on Principles of Database Syst., Los Angeles, California March 1982. [WANG75] Wang, C. P., and Wedekind, H. H., "Segment synthesis in logical database design", IBM J. Res. Dev., Vol. 19, No 1, Jan. 1975. [WONG771 Wong, E., "Retrieving dispersed data from SDD-1: A system for distributed databases", Proc. of the 2nd Berkeley Workshop on distributed data management and computer networks, Lawrence Berkeley Lab., May 1977. [WONG811 Wong, E., "Dynamic re-materialization: Processing distributed queries using redundant data", Proc. of the 5th Berkeley Workshop on Distributed data management and computer networks, Lawrence Berkeley Lab. 1981.