Division of Research November, 1980 Graduate School of Business Adminstration The University of Michigan A PRACTICAL GUIDE TO THE DESIGN OF DIFFERENTIAL FILES FOR RECOVERY OF ONLINE DATABASES Working Paper No. 239 Houtan Aghili and Dennis G. Severance* The University of Michigan * This work was supported by the National Science Foundation under Grant MCS-78-26597 and by David Taylor Naval Ship Research and Development Center under Contract N00167-97-G0002. \\. i 9r, II

ABSTRACT The concept of a differential file has previously been proposed as an efficient means of collecting database updates for online systems. This paper studies the problem of database backup and recovery for such systems. An analytic model of the operations is presented. Five key design decisions are identified and an optimization procedure for each is developed. A design procedure is presented which quickly provides parameters for a near-optimal differential architecture on the basis of a series of table look-ups. Key Words and Phrases: backup and recovery, maintenance, differential files, hashing numerical methods, optimization, reorganization database functions, CR Categories: 3.73,3.80,4.33,5.0 /o 'f,. l. *,f

OUTLINE 1. INTRODUCTION.................................................. 2. A QUANTITATIVE MODEL OF DIFFERENTIAL FILE OPERATION...........4 2.1 Previous Analysis of Operational Characteristics.......... 4 2.2 Backup, Recovery, and Reorganization Procedures....8.... 8 2.3 Accounting for Costs................................... 11 3. MINIMIZING TOTAL OPERATING COSTS.......................... 15 3.1 Designing the Search Filter............................ 17 3.2 Selecting an Optimal Differential File Dumping Policy.....21 3.3 Selecting the Frequency of Main Data File Backup.......... 27 3.4 Selecting the Reorganization Interval................... 30 4. AN EXAMPLE OF HEURISTIC DESIGN......................................33 5. SUMMARY....................................39 6. APPENDIX................................................... 40 REFERENCES............42,1.

I % e I

1. INTRODUCTION Corporations have increasingly come to depend upon the continuous availability of online databases in support of such critical business functions as order entry, inventory control, credit verification, and customer service. Loss of such systems for even short periods of time can create havoc within the business and may result in significant financial loss. As a result, rapid database recovery is an important issue in the design of these systems. It has been argued that the database recovery might be speeded considerably by holding updates in a separate file of changes called a differential file [8,14]. A practitioner considering this idea, however, has available little specific guidance for designing such an architecture as an alternative to a conventional update inplace strategy. This paper analyzes the backup and recovery operations required for an online database which employs a differential file and provides useful guidelines for selecting an efficient set of operating parameters. The potential for a dramatic improvement in recovery speed at a slight increase in operating cost is illustrated. There are two basic causes of data loss in online systems: (1) partial completion of update operations caused by the program or system failures which render parts of the 1

database inaccurate or inaccessible, and (2) physical destruction of storage media which renders all or part of a database unreadable. Canning [6] provides insights into the nature and impact of such data losses and an overview of techniques used in commercial database systems to recover from them. A simple and widely used database recovery procedure involves a periodic dumping of the database to a backup file. Once the dump is taken, all processed transactions are then dated and recorded on a transaction log so that in the event of a data loss the latest dump may be recopied and all transactions reapplied. Recopying of the latest dump is generally quite fast, requiring as little as a few minutes for an entire disk pack. The reprocessing of transactions, however, may require several hours if the time since the dump has been long. Sayani [11] and Lohman and Muckstadt [8] study the trade-offs between dumping cost and recovery speed by means of an analytic model which provides guidance in selecting an efficient dumping frequency. The database recovery can be speeded considerably with the use of an after-image log, in which a sequential file is used to store an identified and dated copy of each new or changed database page, record, or record segment at the time that it is modified. With this file, transaction 2

reprocessing is not required once the dump is restored. Rather, the log is sorted by identifier and date, and the latest version of each modified record is selected and written directly into the database. For large databases with moderate or naturally concentrated update activity, differential files offer an alternative strategy for rapid backup and recovery. A differential file isolates a database from the physical change by directing all new and modified records onto a separate and relatively small file of changes. Since the main file is never changed, it can always be recovered quickly from its dump in the event of a loss. Transaction reprocessing is required only in the event of damage to the differential file. Since this file is small, it can be backed up quickly and frequently to minimize the reprocessing time. It can also be duplexed at reasonable cost as insurance against physical damage to one of the copies. While differential files offer a number of other operational advantages [14], this paper focuses exclusively upon their value in speeding backup and recovery operations for online databases. Specifically, we will analyze these operations to establish the frequency with which a main data file and its differential file should be subjected to 3

backup, as well as the frequency with which they should be reorganized into a new main database. An analytic cost model for this purpose is developed in Section 2. Its solution in Section 3 leads naturally to a series of tables which enable a designer to quickly determine a near-optimal differential file architecture for a typical operating environment. For environments which differ substantially, a FORTRAN program is provided in the Appendix as an alternative means of developing an efficient design. 2. A QUANTITATIVE MODEL OF DIFFERENTIAL FILE OPERATION 2.1 Previous Analysis of Operational Characteristics Consider a database of N records. Assume that updates are independent and uniformly distributed over all records. Severance and Lohman [14] showed that the expected proportion of distinct main file records, Rn, which are updated after n updates have been received is given by Rn = 1 - (1 /N) = 1 en/N, for large N. (1) For n ranging from 0 to N, Figure 1 depicts the growth over time of n/N and Rn. Respectively, these curves characterize the size of a differential file which maintains every record change and one which maintains only the most recent image of each changed record. 4,

0. I Lf UC yS4 a::~.00 s0.20 0.40 0.60 0.80 1.OO Updates per Main File Record Figure 1 - General Shape of n/N and Rn To assure access to the most current image of a record, one should search the differential file for every record retrieval. If the required record is not found there, then the original copy is accessed from the main database. In general, both the main database and the differential file utilize some form of index to speed these searches [12,13]. Unsuccessful searches of the differential file can be largely eliminated by use of a presearch filtering 5

mechanism. The filtering scheme, devised originally by Bloom [5], associates the differential file with a main memory bit vector of length M, and some number X of hashing functions which map record identifiers into bit addresses. When the differential file is initially empty, all bits in the filter are set to O. Thereafter, whenever a record is stored in the differential file, each hashing transformation is applied to its identifier and each of the selected bits is set to 1. Retrieval of the database records now proceeds as follows. The identifier of the record to be retrieved is mapped through each transformation. If any bit is 0, it is certain that the most recent version of the record still resides in the main data file; the differential file search is skipped and the main data file is accessed immediately. If all bits have value 1, then an updated copy of the record is likely to be found in the differential file, which is therefore searched. There is a possibility that this search will prove fruitless, since the bits associated with a given identifier might be set to 1 coincidentally by mappings from other updated records. Only in the event of such a filtering error are both files searched during a record retrieval. Severance and Lohman [14] show that the probability, Pn, of a filtering error after the accumulation of n updates in the differential file is the product of the probability 6

that the required record is yet unchanged and the probability that all bits addressed have been previously set to 1, that is, Pn(X,M) = e-/N (1-e-n/M) (2) Figure 2 illustrates the general shape of Pn(X,M) for different values of X. This function appears as one component of the analytic cost model for the differential file operations developed in the next section. X=3 8.: b B. iX=2 3 -II 'g X= " / x2 ^ ^ d 0.20 0.40 0.60 0.80 Updates per Main File Record 1.00 Figure 2 - General Shape of Pn(X,M) 7

2.2 Backup, Recovery, and Reorganization Procedures We analyze a differential file operating environment which cycles through a series of backup and reorganization processes. Initially, the main database is dumped to a backup copy, the differential file is empty, and the Bloom filter is set to O. As update transactions arrive over time they are recorded in a (sequential) transaction file. Resulting updates are posted both in a (direct access) differential file and on a (sequential) after-image log. Appropriate filter bits are set and, finally, the successful completion of the transaction processing is noted in the transaction log. A detailed description of processes which utilize this data to recover from a variety of data losses and system failures is provided by Lohman [7]. As the after-image log grows, the time required for differential file recovery increases. Periodically, therefore, the cumulative effect of reposting these changes is preserved by dumping the current differential file to a backup copy. If the differential file grows sequentially, maintaining a history of all changes, then only the incremental portion of the file accumulated since the last dump is copied. The entire file is copied if an update inplace strategy is used. As updates continue, the search filter becomes less 8

effective and the differential file will grow to the limit of its space allocation. A periodic reorganization therefore resets the bit vector to zero and empties the differential file by merging all changes into the main database. After one or more such reorganizations, the time required to recover the main database via the after-image processing justifies another dump and the entire cycle begins again. Figure 3 illustrates the relationships among the various files and backup procedures and highlights five important parameters whose values must be selected during the differential file system design: M: size of the Bloom filter bit vector, X: number of hashing transformations, R: number of updates before reorganization, D: number of differential file dumps during a reorganization interval, and B: number of reorganizations before the main database dump. 9

RETRIEVALS AND UPDATES Figure 3- Interrelationship of Model Components and Decision Variables 10

2.3 Accounting for Costs Selection of efficient values for the five design parameters is affected by the 21 problem variables defined in Table 1. Typical values for each are given and will be used for calculations in subsequent sections. There are nine major components of the system cost to consider. An equation defining the expected cost per processed update for each of these is given below: The expected main database recovery cost is the sum of the costs to recopy theatest dump and to repost from the after-image log all updates made prior to the last reorganization, and is given by C1 = (AX//) [rN + u'R(B-1)/2]. (3) The expected differential file dump and recovery cost is the sum of all differential file dumps taken during a reorganization interval plus, in the event of loss, the expected cost to recopy the latest differential file dump and to repost all updates accumulated in the after-image log since that dump. A general analysis for this cost is provided in Section 3.2. In the case where no differential file dump is taken during reorganizations, the cost is given by C2 = ( X'/p)u"(R-1)/2. (4) The expected differential file storage cost is, in general, given by R-1 C3 = (s"/v) E E[differential file size I j updates]. j=O 11

I Parameter Description I Typical Value I I I I I I I I I I I I I i I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I N number of records in the database p record updates per unit of time P read-only requests per unit of time X rate of main database failures A' rate of the differential file failures CO set-up cost associated with dump and reorganization operations d cost of dumping one record from the main data file d' cost of dumping one record from the differential file u cost of posting an update to the main data file during reorganization a cost of a record access from the main data file a' cost of a record access from the differential file a" cost of an unsuccessful search of the differential file s main storage cost per bit per unit of time s' cost of storing a record of the main data file per unit of time s" cost of storing a record of the differential file per unit of time h' cost of executing a hashing function in Bloom filter w weighting factor reflecting relative importance of recovery costs vis-a-vis other cost components r cost of restoring a record of main data file from its dump r': cost of restoring a differential file record from its dump u': cost of posting an update to the main data file during recovery u": cost of posting an update to the differential file during recovery I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I 10 -_ 107 1-100 per minute 1-100 per minute 1-360 per year 1-360 per year $2.00 $0.0005 $0.0005 $0.01 $0.01 $0.01 $0.01 $1.7x10 6 bit/minute -V7 $3x10 record/minute $3x10 record minute $10. 10. ($0.0005)w ($0.0005)w ($0.01)w ($0.002)w I I I I I I I I I I I I I I I I I I I I I I I I I i I I I I I I I I I I 1 I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I Table 1- Problem Parameters 12

When the history of all database updates is stored sequentially in the differential file, this expression reduces to C3 = (s"/p)(R-1)/2; (5) when only the latest versions of each changed record is recorded, then C3 = (st"/p)N[1-(1-eR/N)/(R/N)]. (5-1) The latter strategy of update inplace precludes incremental dumping of the differential file. Moreover in our environment of independent, uniformly distributed updates, the space savings that it offers is not significant. (Specifically, differential files are found to lose much of their attractiveness at sizes above 10 to 20 percent of the main data file; and below this limit the storage costs of equations (5) and (5-1) are nearly equivalent.) We shall assume in this paper that the differential file is relatively small (R/N<0.20), that it grows sequentially, and that C3 is given by equation (5). The reader interested in an analysis of the update inplace strategy for an environment in which updates cluster on a high activity subset of records, e.g., airline reservation systems, is directed to Aghili and Severance [2]. The expected main database backup cost is given by the cost of a dump divided by the number of updates between the dumps, C4 = (Nd)/(BR). (6) The expected reorganization cost is the ratio of the cost of posting the latest version of each updated record to the total number of updates made: C5 = [uN(1-e R/N+C]/R. (7) 13

The cost of unsuccessful differential file accesses for updates and read-only requests is given by R-1 C6 = (1+P/P) (a"/R) Z Pn(X,M) j=0 X. 1-e-JXR/M-R/N =(1+P/P)[a" Z C(X,j) (-1)j ( --— C --— )], (8) j=O R(jX/M+1/N) where C(i,k) denotes the number of k combinations from i objects (C(i,k) = i! / (k! (i-k)!). The expected record access cost includes filtering costs plus the cost of either a differential file or a main data file access, C7 (1+p/p)[h'X+a'+(a-a')N(1-e-R/N)/R] (9) The cost of storing the bit vector is equal to C8 = (s/1)M. (10) The main data file storage cost is a fixed cost and given by C9 = (s'/P)N. (11) Combining all of these cost components, the total expected cost per processed update transaction is given by C(X,M,R,B,D) = C1+C2+C3+C4+C5+C6+C7+C8+C9. (12) 14

3. MINIMIZING TOTAL OPERATING COSTS To design a differential file architecture for a given database, we select the combination of design parameter values which minimizes the total expected operating cost per processed update transactions over the operating cycle; that is, Minimize C(X,M,R,D,B) Subject to: R < Maximum allowable differential file growth, R,B > 1, X,M,D > 0, R,B,X,M,D integers No exact optimization method exists to solve this nonlinear objective function in integer decision variables. In general, such problems require techniques such as Branch and Bound [2] to structure and search their large solution space, and numerical methods such as Pattern-Search [4], Newthon-Raphson [4], or Powell [9,10] to determine optimal parameter values. A precise methodology for the solution of the differential file design problem is presented in [1]. Experience with this precise analysis has yielded a number of useful insights into the nature of the optimization problem and characteristics of its solution. These insights have led to the development of a much simpler 15

heuristic procedure which can quickly generate a near-optimal differential file architecture via a series of table look-ups. The latter design procedure is presented here. The solution procedure accurately decomposes our design problem into three independent parts, as follows. First, a near-optimal number of hashing functions (X*) and bit vector size (M*) is computed as a function of reorganization interval, R. Values for this function are presented in tabular form. Next, the optimal number of differential file dumps (D*) and the optimal number of reorganizations during main data file operations (B*) are computed as a closed form function of the reorganization interval, R. Finally, the functional expressions for X*, M*, D*, and B* are substituted into the total expected cost expression (equation 12), defining it entirely as a function of R. The optimal reorganization interval R* is then obtained by minimizing the total cost expression, using numerical methods. Having computed a specific value for R*, corresponding values for X* and M* can be extracted from their table, while values of D* and B* are obtained by substituting R* into their corresponding expressions. 16

3.1 Designing the Search Filter Selection of an efficient combination of X (number of hashing functions) and M (bit vector size) is affected by the sum of costs: C6 for unsuccessful differential file accesses; C7 for main data file record access; and C8 for bit vector maintenance. Given a reorganization interval R, the optimal (X*,M*) combination is obtained by solving: Minimize C6 + C7 + C8 (13) X,M Subject to: X,M > 0, X and M integers. Analyzing this problem precisely, Aghili [1] has obtained several useful results. He shows (1) that the optimal value of M is never greater than [a"(P+P)]/s; (2) that for reasonable ratios of h'/a" (<0.0001), the optimal value of X is never greater than 8; (3) that for R > [a"(P+P)]/s, filtering mechanisms are ineffective and should not be used at all; and (4) that the total cost function (12) is insensitive to changes in X and M in a relatively wide range of values surrounding X* and M*. The latter result implies that a precise determination of (X*,M*) is not of practical concern, and therefore approximations which permit us to locate a near-optimal combination quickly are reasonable. 17

Assuming that the size of the main data file is much larger than the reorganization interval, e.g., (R/N) < 0.2, problem (13) may be reformulated as R- nx/M X Minimize a"( P+P ) [h'X+ z (1-e ) /R]+sM, (13-1) X,M n=O Subject to: X,M > 0, X and M integers. This problem is solved by iterating on the values of X from 1 to 8, approximating the optimal M* for each X, and selecting as "optimal" the combination with minimum cost in equation (13-1). Specifically, the approximation to M* for a given X is obtained by solving1 Minimize a"(P+P)[1-(1-e XR/M)/(RX/M)] + sM, (13-2) M Subject to: 0 < M < [a"(p+P)]/s, M integer; which is easily accomplished using a Fibonacci search [4]. The above procedure has been used to approximate (X*,M*) for a variety of reorganization intervals and access intensities (P+p), using the typical values of a", h', and s 1 For reasonable parameter values, s/[a"(P+P)]<<1, and any given value of X, the solution of (13-2) provides a tight upper bound for M* (Aghili [1]). 18

given by Table 1. The results are presented in Table 2. Realize that this table is in fact a tabular representation of the function CXM(R): R — > (X*,M*), (13-3) which maps values of R into corresponding near-optimal values of X and M. It will be used as such by the procedure in Section 3.4 which determines an optimal reorganization interval R*. 19

I __rnizi I I Reorganization I Access Intensity ( p +). ) I Interval (R) I --- —------------------------- I I, I- I 1 I 7%- Ip - 100 I ~200 300 I 400 500 I 600,i 700 I 800 1 ~900 1000 2000 3000 1 4000, 5000 6000 7000 8000 i, 19000 10000 20000 30000 40000 50000 60000 70000 80000, 90000 100000 200000 300000 400000 500000 600000 700000 800000 900000 1000000 2000000 I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I 2 7 2 9 2 11 2 12 3 15 2 14 2 15 2 16 2 16 2 17 1 17 1 18 1 17 1 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10..2 23 12 33 2 40 2 46 2 51 2 55 2 59 2 63 2 66 2 69 2 94 2 111 2 124 2 135 2 143 2 151 2 157 2 162 2 166 1 166 1 175 1 170 1 148 0 0 0 0O 0O 0O 0O 0O 0o 0O 0O 0O 0 0 0 to to 'o t o0 to I I I I i I I I I I I I I I I I I I I I I I I I I I I I I I I I I I i I I I I I I I I I I I I I I 1 I I I I I I I I I I I I I I I I I I 1 o 100 I I - - -t-'rr r - - ' 7 2 74 2 103 2 122 2 151 2 161 2 180 2 190 2 209 2 218 I 2 228 2 324 2 392 2 450 2 508 2 546 2 585 1 2 623 I 2 662 2 691 2 942 1 2 1114 2 1240 2 1346 2 1433 I 2 1501 1 2 1559 2 1615 2 1655 1 1655 1 1751 1 1703 1 1481 0 0 0 0 0 0 I I I I I I I I I I I I I I I I I I I I I I I I Table. 2 - Near-Optimal (X*,M*) Combination for Different Reorganization Intervals and Access Intensities (M is given in hundreds of bits) 20

3.2 Selecting an Optimal Differential File Dumping Policy The differential file is dumped periodically to speed its recovery in the event of loss. The combined cost of dumping and recovery of the differential file was referred to simply as C2 in Section 2.3. A detailed analysis of this cost is now presented. Consider the general situation in which D dumps are taken during a reorganization interval with Yi updates accumulated into the differential file between the (i-1)-th and i-th dumps. Three cost components affect the selection of D and Y={Yi I i=1,...,D}: i- the cost of differential file dump, ii- the expected cost of recopying the latest differential file dump, and iii- the expected cost of reprocessing all transactions since the last dump. Specifically, for given integer values of R, D, and Y, the combined cost of these components is given by C2 = R CD(R,D), (14) where CD(R,D) = CoD+dtYl+..+dYD (14-1) +(XI'/p)r'[Y2Y1+Y3(Y1+Y2)+...+YD(Y1+...+YD ) ] +(A'/p)r'(R-Y1-...-YD)(Y1+...+YD) +( A'/)u"[Y1(Y1-1)+...+YD(YD-1 )]/2 +( A'/) u"(R-Y1-...-YD)(Y1+... +YD-11)/2 21

Relaxing the integer constraint on values of Yi for our environment of sequential differential file growth, Aghili [1] has shown the vector Y* which minimizes CD(R,D) has components Yi* = R/(D+1), i=1,...,D; (15) that is, differential file dumps should be equally spaced over the reorganization interval. The optimal solution to the original integer problem is guaranteed to be one of the integer points surrounding Y* (i.e., a vertex of a hyper-cube in D dimensional space); for reasonable problems these values are identical for all practical purposes. Substituting Y* into (14-1) yields CD(R,D) = CoD + d'(RD)/(D+1) (16) +( A'/l)r'(RD)/[2(D+1)] +( At /)u"(R/2) [R/(D+1)-], which may be rewritten as CD(R,D) = CoD + dl/(D+1) + d2, (17) where dl = {( A'/p)[(u"/2)R-(r'/2)]-d'}R, (17-1) and d2 = [( '/p)(r'-u")/2+d']R. (17-2) The optimal number of differential file dumps with this 22

restatement of the problem is determined by solving dCD(R,D)/dD=O to obtain 0,if d /C0 < 1, D* = { 1/2 (18) (d1/C) -0 1, otherwise. The optimal integer value D* of the original problem (16) is guaranteed to be one of the two integers nearest D*. It can be quickly determined and substituted for D in (17) to yield dl+d2, if (d1/C) < 1, CD(R,D*) = { (19) CoD*+dl/(D*+1)+d2, otherwise. Observe that if dl/C0 < 1, then D*=O and C2=( '/p)u"(R-1)/2 in agreement with equation (14) of Section 2.3. Tables 3(a) and 3(b) respectively present optimal values of dumps D* and time between dumps (R/P)/(D*+1) for the values of CO, u", r', and d' given in Table 1. A 10-hour day, a 5-day week, a 4-week month, and a 12-month year are assumed in our tables. Table 3(b) shows that the time between differential file dumps is rather insensitive to reorganization interval R. This is reasonable, since for large values of R, D* is 23

approximately * 0, if dl/C0 < 1, ( D* = { 1 ~ (20) R[( X'u")/(2pCG)], otherwise and hence the time between dumps is apprcximated by undefined, if dO/Co < 1, (R/p)/D* = I 1/2 (20-1) [(2Co)/(, A' u")l/, otherwise 24

W Reorganization Failure Rate( A'l ' Interval (R) I --- — I I I I —_ —_ _ -_ _ I- -_ *,I 100 200 300 400 500 I 600,1 700 800 900, 1000 2000 3000 4000 5000 I 6000, 7000 I 8000 9000 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 200000 300000 1400000 500000 600000 700000 800000 900000 1000000 2000000 0 0 0 0 0 I 0 1 2 11 13 16 19 22 25 28 1 57 86 11 143 172 201 230 259 288 576 I 865 1153 1442 1731 I2019 I2308 12597:2885 15772 1/Day 1/Week 1 1/Month ' 1/Year O O I o.. O. O O O — O O O l O o o o Ol O O o O O OI 0 0 0 0 0 0 o 0 0 O O O OI O O:0 O 0 0 O O 0 O O0 O O 0 o O O 0 O o O O o O O l o 0 O o 0o 0 I 0 Ol 0 O O 0 l O O O I o0o o 0 o o o I o o o: 0 0: 0 0 o 0 0 0o 0 0 o0 0 o0 0 0 o 0 0 0 0 o0 3 o 2 0 0 o 0 0 0o 0 0 o 4 0 7 3 2 3 0 o 3 0 0 6 1 9 2 0 0 4 0 0 I 6 1 1 11 2 0 0 5 0 0, 0 0 0 3 0 1 5 0 1 37 o 5 1 0 0 0 0 o 26 70 1 38 11 2 18 0 0 35 51 15 2 3 0 o 0 0 o 5 16 76 23 6 8 1 0 0 0 o 6 19 8 27 7 1 13 1 0 0 ' 72 22 102 31 8 5 0 0 0 0 81 25 1 25 35 10 1 57 17 2 1 15 0 0 90 27 1 128 11 2 1 18 0 19 3 0 0 11 5 15 0 2 12 3 0 36 0 0 2 0 3 0 0 1 0 0 0 364 113 515 163 2 50 1 257 80 23 1 73 20 0 55 143 6 23 63 322 100 29 9 0 0 6 172 1 27 7 1 13 1 36 11 0 0 1 72 229 1031 325 01 8 515 15 6 2 13 1 0 820 258 11560 366 11 580 17 82 5 5 166 50 7 912 27 11289 4107 127 1 6 203 219 3 185 56 2 0 182 56 1257 811 25 28 39 180 3 8 05 356 10 51 162 50 3 257 80 23 73 20 0 45 13 6 3 19 4I 322 100 29 0 91 26 0 54 172 76723 62 75 38 11 36 1110 32 0 638 201 9 8903 287 788 4451 11 121 1129 0 72 22 102 31 325 101 15 162 19 114 1 0 1 8120 258 11160 3566 10I 5780 17 2 5 15 1 0 7 9012 287 128 4017 1271 6 203 62 17 256 0 181 56:257 80 815 256 128 39 107 126 8371 115 29 Table 3(a) - Optimal Number of Differential File Dumps, D*, during a Reorganization Interval at Update Intensities 1, 10, and 100 per Minute 25

I Reorganizat ' Interval (] I 1 ionI Failure Rate2(X')3 -— I I I I I 1/Day j 1/Week I 1/Month 1 1/Year I I T I I I I I I I I I I I I I I I I I i I I I I I I I I I I I I I I I I I I I I I I I I! I I I I I I I I I I I I I I I I I I I I I I I I I 100 200 300 400 500 600 700 800 900 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 200000 300000 400000 500000 600000 700000 800000 900000 1000000 2000000 I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I f I 0.5 0.6 0.7 0.6 0.6 0.6 0.2 0.6 0.2 0.6 0.2 0.6 0.2 0.6 0.2 0.6 0.2 0.061 0.6 0.2 0.071 0.6 0. 0.208 0.6 0.2 0.081 0.6 0.2 0.071 0.6 0.2 0.061 0.6 0.2 0.061 0.6 0.2 0.061 0.6 0.2 0.061 0.6 0.2 0.061 0.6 0.2 0.061 0.6 0.2 0.061 0.6 0.2 0.061 0.6 0.2 0.061 0.6 0.2 0.061 0.6 0.2 0.061 0.6 0.2 0.061 0.6 0.2 0.061 0.6 0.2 0.061 0.6 0.2 0.061 0.6 0.2 0.061 0.6 0.2 0.061 0.6 0.2 0.061 0.6 0.20.06J, 1.1 1.2 1.3 1.4 0.4 1.2 0.5 1.3 0.4 1.3 0.4 1.2 0.5 1.3 0.4 1.3 0.4 1.3 0.4 1.3 0.4 1.3 0.4 1.3 0.4 1.3 0.4 1.3 0.4 1.3 0.4 1.3 0.4 1.3 0.4 1.3 0.4 1.3 0.4 1.3 0.4 1.3 o.4 1.3 0.4 1.3 0.4 1.3 o.4 1.3 0.4 1.3 0.4 I I I I I I I I I I i I I I I I I I I I I I I 1 I I I I I I I 0.171 0.171 0.171 0.141 0.151 0.141 0.141 0.141 0.131 0.131 0.131 0.131 0.131 0.131 0.131 0.131 0,13 0.131 0.131 0.131 2.5 3.3 2.8 2.5 2.9 2.7 2.5 2.8 2.6 2.6 1.1 2.6 1.0 2.6 0.8 2.6 0.8 2.6 0.8 0.6 2.6 0.8 0.4 2.6 0.8 0.5 2.6 0.8 0.4 2.6 0.8 0.4 2.6 0.8 0.3 2.6 0.8 0.3 2.6 0.8 0.3 2.6 0.8 0.3 2.6 0.8 0.3 2.6 0.8 0.3 2.6 0.8 0.3 2.6 0.8 0.3 2.6 0.8 0.3 1 1., I I I I 9I I 9. 3 10 I I. I l 1' 10 I 1 10 197 19.5 6.7 19.4 7.5 19.3 5.6 19.0 3.7 1 19.1 3.3 1 19.0 3.2 1 19.1 3.0 19.0 3.0 19.0 3.0 19.0 3.0 2.7 19.0 2.9 1.91 19.0 2.9 1.61 19.0 2.9 1.11 Table 3(b) - Optimal Time (in Days) Between Differential File Dumps at Update intensities 1, 10, and 100 per Minute 26

3.3 Selecting the Frequency of Main File Backup The main data file is dumped periodically to speed recovery in the event of loss. The expected main database recovery cost, C1, and the main data file backup cost, C4, combine to affect the selection of the optimal main data file dump frequency, B*. To determine B* in terms of reorganization interval R, we define CB(R,B) = C1+C4 (21) = (X/) [rN+u'R(B-1)/2]+(Nd)/(BR) Relaxing the integer constraint on B and solving dCB(R,B)/dB=O, we obtain._ d3/R, if R < d3, B* = { (22) 1, otherwise where d3 = [(2Nd)/(u')]1/2. (22-1) Tables providing optimal values of B* for a wide variety of problems have been produced. Rather than including them here, we present a single table for a normalized problem from which entries of the other tables can be generated. Assuming the typical values for r, d, u', and w given in Table 1, Table 4 displays optimal values B1* for a number of reorganization intervals R1, assuming N=107 records, 27

U1=100/minute, and X1=1/year. This table and equation 22 permit us to closely approximate an optimal B2* for arbitrary values of R2, N2, p2, and A2. Specifically, 1, if aB1* < 1 (23), B2* = (23) aB1*, otherwise, where a is given by a= 0.9x10-5(R1/R2)[N2(p2/X2)]1/2 and P2 and A2 are, respectively, the average number of updates per minute and the average number of main data file failures per month. Given an optimal value B*, B* is guaranteed to be one of the two integer values nearest B-. Substituting B* for B in (21), the expected main data file backup and recovery cost per processed update transaction is given by: ( A/ P) [ rN-u ' R/ (B*- 1 ) /2+ (Nd) / (B*R), R<d3, CB(R,B*) = { (24) (A/p)(rN)+(Nd)/R, otherwise This is the final cost component needed by the procedure to locate R*, as described in the next section. 28

Reorganization Number oT Interval Reorganizations I( IR) B ____ I i I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I 100 200 300 400 500 600 700 800 900 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 200000 300000 400000 700000 600000 700000 800000 900000 1000000 2000000 I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I I 12000.00 6000.00 4000.00 3000.00 2400.00 2000.00 1714.29 1500.00 1333.33 1200.00 600.00 400.00 300.00 240.00 200.00 171.43 150.00 133.33 120.00 60.00 40.00 30.00 24.00 20.00 17.14 15.00 13.33 12.00 6.00 4.00 3.00 2.40 2.00 1.71 1.50 1.33 1.20 1.00 I I I I I I I I I I I I I I I I I I I I I I I t I I I I I I I I I I I I I I I I I I I I I I I I I I I I Table 4 - Optimal Number B* of Main Data File Reorganizations Between Dumps for N=10,000,000,i =100/minute, X=l/year 29

3.4 Selecting the Reorganization Interval Section 3.1 describes a table look-up procedure for selecting a near-optimal (X*,M*) combination for a given reorganization interval R. Similarly, Sections 3.2 and 3.4 provide near optimal D* and B* values as functions of R. As a result, the total cost equation (12) can be reduced to the nonlinear function of R given by C(R) = R CD(R,D*), differential file dump and recovery cost; +CB(R,B*), main data file dump and recovery cost; +CXM(R), sum of unsuccessful differential file access cost, record access cost, and cost of maintaining the bit vector; +C3, differential file storage cost; +C5, reorganization cost; and +C9, main data file storage cost; (25) where CD(R,D*), CB(R,B*), CXM(R), C3, C5, and C9 are given by expressions (19), (24), (13-3), (5), (7), and (11), respectively. The optimal reorganization interval, R*, can now be found by minimizing C(R) over the range of integers 1< R <0.2N. Although the nonlinear nature of C(R) precludes a closed form solution of this function, a complete enumeration on R is not necessary. Figure 4 illustrates the typical shape of C(R). One finds that while C(R) generally has more than one local minimum, the curve is relatively shallow at all minima. A simple search procedure (as an alternative to Pattern-Search [2], Newthon-Raphson. [3], or 30

Powell [9,10]) can therefore locate an efficient reorganization interval quickly. Rather than enumerating all values of R, determining (M*,X*,D*,B*) for each and selecting the R* with minimal cost, one can take sample values of R -- say 100 updates apart -- and confidently choose the sample value with the minimum cost as effectively optimal. Using this technique for the parameter values given in Table 1, R* has been computed for a wide variety of problem situations. Results are presented in Table 5 and corresponding values for X*, M*, D*, and B* can be readily obtained using Tables 2, 3, and 4. Reorganization Interval Figure 4 - General Shape of Cost Function C(R) 31

i' -ataba se............................. Database F ailure rate(:) Size (N) --------------------------------------- in 1000 1/Day 1/Week 1/Month 1/Year Records_________ F r Th 2 2 2a Tr8 2 2- -f —2 2 2 2 20 t 3.8 4 38 4 4 4 4 4 0.9 2.8 4 30 1 6 6 6 6 6 6 6 6 1 0.9 2.3 6 40 1 8 8 8 ' 8 8 8 8 8 0.9 2.2 8 50 0 10 10 1 110 0 10 10 10 0.9 2.1 10 60 12 12 -121 12 12 12 1 2 12 12 1 0.9 2.1 12 70 14 13.3 14 14 13.3 14: 14 1 4 14 1 0.9 2.1 13.81 80 16 16 16 16 16 16 1 16 16 16 0.9 2 15.81 90 18 18 18 18 18 18 118 18 1 1 0.9 2 8.31 100 20 20 20 20 20 20 1 20 20 20 1 2 6 200 28 37 40 28 36 40 28 36 40 1 2 5 '300 33 60 60 336 60 60 33 60 60 33 2 4 400 38 80 80 38 80 80 38 8 80 0 38 2 4 500 42 100 100 42 100 100 42 100 100 1 42 2 4 600 46 120 120 46 120 120 46 120 120 46 2 4 700 1 49 140 136 49 140 130 49 140 130 49 2 4 800 53 160 160 53 160 160 53 160 160 1 53 2 4 900 1 56 180 180 56 180 180 56 180 180 56 2 4 1000 59 200 200 59 200 200 59 200 200 59 2 4 5000 131 424 1000 130 425 1000 130 422 1000 130 2 4 10000 184 590 2000 1184 590 2000 183 588 2000 183 588 4 Table 5 - Optimal Reorganization Interval, for Different Failure Rates and R*, in Thousands of Updates Database Sizes at Update Intensities 1, 10, and 100 per Minute 32

4. AN EXAMPLE OF HEURISTIC DESIGN The rather detailed analysis given in the last section was not intended to suggest that heuristics are inappropriate when designing a database backup and recovery strategy. On the contrary, we describe elsewhere [1] a more complex and time-consuming design procedure which determines optimal differential file parameters more precisely than the table look-up procedure presented here. We believe, however, that such precision and effort are rarely justified Model optimality is generally not a practical concern. Seldom with any model will all assumptions be completely satisfied (e.g., independent and uniformly distributed updates in our case) or will all problem parameters be precisely known (e.g., update and retrieval intensity, failure rates, etc.). Moreover, in a changing environment, optimality is always short-lived. An analyst' s primary responsibility in database design is to avoid "bad" designs; all 'good" solutions are effectively equivalent, and "exact" answers to imprecise problems provide only unwarranted comfort. An exemplary problem serves to clarify the spirit in which we offer our results and the means by which we feel "goocd" designs can be achieved. Given a design problem in which updates are reasonably independent and uniformly distributed over an online 33

database, an analyst may proceed to design an efficient differential file architecture in one of two ways. He may accept the operating parameters of Table 1 as reasonably descriptive of the problem environment and select design values (X* M*,D*,B*,R*) via interpolation in Tables 2, 3, 4, and 5. Alternatively, the Appendix provides the FORTRAN source code for a program which will accept an arbitrary operating environment and problem description and return parameters for a near-optimal differential file design directly. Consider a real-time inventory system which maintains 125,000 records and receives an average of 40 updates per minute. Assume that the expected failure rate for any data file is once per month, and that other problem parameters are suitably described by Table 1. For this problem N=125,000, =-40/minute, X=1/month, and R* may be obtained by interpolating on the relevant entries from Table 5 shown below: Main Data ' =1/Month 1 File Size --- ___ N _P=10 1P=100 100,000 20,000 20,000 200,000 36,000 1 40,000 34

Specifically, optimal reorganization intervals R1* and R2* for p=40 and N=100,000 and 200,000, respectively, are approximated by R*1 = [(100-40)/(100-10)](20000-20000) + 20000 = 20000 R*2 = [(40-10)/(100-10)](40000-36000) + 36000 37333 An optimal R* for 125,000 records is then computed from R* = [(125000-100000)/(200000100000)](R*2-R*1) + R*1 = 24300 With R* now established, a similar interpolation in Table 4 provides the normalized main data file dumping frequency B*1=51.4. Inserting this value and our problem parameters into equation 23, we obtain B* = 0.9x10 5[(125000)(40)]1/2(51.4)= 1.034, and hence, B*=1. In a similar manner, optimal values X*=1, M*=45200 bits, and D*=0 are obtained from Tables 2 and 3. Operationally, this solution translates to a differential file architecture in which the main data file is reorganized and dumped once a day, while two hashing functions and a 5,700-character bit vector are used as a Bloom filter. Calculations show the expected recovery time for the differential file to be less than 1 minute, while 10.5 minutes is required to recopy the main data file in the 35

event of loss. An efficient update inplace strategy [7] would also dump the main data file at the end of each day and would cost 3.6 percent less to operate each month. Again, 10.5 minutes is needed to recopy the dump in the event of loss; but now, on average, a much larger 8.2 minutes are required to recover the updates. With either strategy above, recopying of the main file could be avoided if a second disk copy were made (and moved offline if necessary) at the time the main data file was 2 dumped to tape. A differential file architecture would now look especially attractive, offering approximately an 88 percent improvement in recovery speed. Table 6 summarizes operating statistics for both the conventional and differential file strategies, with and without a duplexed main data file. Table 6 also demonstrates solution sensitivity to w, the recovery cost weighting factor (see Table 1), by providing statistics for model solutions obtained with each strategy when w takes on alternative values 10, 5, and 1. As w decreases from our assumed value of 10, file dumps are 1 Since the times required to diagnose data loss and to load and ready appropriate storage devices are affected by many operational factors, they are not included here. 2 The probability of more than one loss in a day is less than 0.1 percent for this problem. 36

taken less frequently, while reorganization occurs more frequently and filtering errors decrease as a result. As one would expect, improvement in recovery speed diminishes as this speed is valued less; when w=1 (recovery operations and normal operations have equal weight), the improvement is a modest 7 percent. We clearly observe here that the differential file architecture offers its greatest recovery speed when the main data file is dumped with each reorganization so that transaction reprocessing is not required for its recovery (i.e. B*=1, and T2=0). 37

Recovery Times Monthly Average w Design Variables (in minutes) Operating Filtering (R*,X*,MB*,D) ------------- Cost Error 5 tt T1 I T2 I T3 1 _ = l = — r -. -. A I. " - -- A - 1U i (UU5000,2,52,1,2) ~ 5 1 (4400,2,289,8,0) 1 1 (3800,2,269,20,0) 10.5 10.5 10.5;0.0 1 1.00 10.2 1 0.20 24.0 1 0.15 $13595 $12854 $11711 38% I 15% t 14% I I I I I Table 6(a) - Differential File Architecture Optimal Dump Recovery Times Monthly w Frequency (in minutes) t Operating B* ------------- Cost T1 T2 ___________________Ti ___I T2 ___ I I I10 I 24544 10.5 8.2 $13117 5 34710 10.5 11.6 $12090 1 t. 77615 t 10.5 1 25.9 1 $10885 Table 6(b) - Conventional Update Inplace System Improvement in Recovery Speed Increase in Monthly --- w Operating Cost Single Main Duplex Main Data File Data File 10 3:6% 38J t 5 t6.3% 6% 12% 1 7.66% 4% 1 7% Table 6(c) - Change in Operating Cost and Recovery Speed T1 Time to Recopy Main Data File from Its Dump T2: Expected Time to Repost Updates to Main Data File T3: Expected Time to Recover Differential File Table 6 - Model Solutions and Performance Statistics for Different Values of w 38

5. SUMMARY Backup and recovery of online databases using a differential file have been discussed and described in terms of an analytic model. Values for five decision variables (number of hashing functions, bit vector size, number of updates before reorganization, number of differential file dumps between reorganizations, and number of reorganizations before main data file backup) combine to define an optimal differential file operating strategy. We describe elsewhere [1] a complex optimization procedure which determines these values precisely, but we believe that such precision is rarely justified. In realistic situations where parameters such as system costs, access intensities, and failure rates must be approximated, the simple design procedure presented here to locate near-optimal solutions quickly is more appropriate. 39

APPENDIX This appendix presents the FORTRAN source code used to compute values for R* X*, M*, D*, and B*. The package consists of a block data, a main program, and three subroutines. The block data initializes the values of the operating environment (which presumably will be modified by an analyst choosing to implement this program). The subroutines CXM, CDR, and CBR respectively return the optimal values (X*,M*), D*, and B* for a given value of R. The main program samples and evaluates R values at increments of 100 updates. That sample R* with minimum cost, and corresponding values for X*, M*, D*, and B*, are printed at the end of the run. Exhaustive enumeration or a more sophisticated search procedure for R* may be easily incorporated through modification of the main program. The remaining code would remain unchanged. 40

BLOCK DATA C COMMON /PRBPAR/ N,WUERHO.LOL1.CO,DO,D1,R.ORI & UOUIU2,AO,A1,A2,SOt,S1,S2,I,W C C F: size of the *lin data file C MUE: average rate of updates to database C RHO: average rat of retrievals froa database C LO: rate of aain data file failures C LI:rate of differential file failures C CO: syste set up cost for dump and reorganization operations C DO: coat of duping a sain date file record C D1: cost of dumping a differential file record C UO: cost of posting an update during reorganization C AO: cost of accessing record from main data tile C A1: cost of accessing a record from differential file C A2: cost of unsuccessful differential tfle sterch C 50: cost of maintaining a bit fi main memory per unit of time C S1: coat of storing a main data file record per unit of time C S2: cost of storing a differential file record per unit of tine C H1: cost of executing a Bloom filter hashing function C V: weighting factor for 0O; R1, UO, Ul 10O: cost of restoring a main data tile record C R1: cost of restoring a differential tile record C U1: coat of reposting an update to main data file C U2: cost of reposting an update to differential file C C Assumotions: C Unit of time: minute C 'nit of cost: $ C A character: 8 bita C REAL NUE.LO.L1 DATA N,MUE, RHO,LO,L1,CO /100000, 100..0.,0.000083,0.000083,2./ DATA DO, D,.RO, 1 /O.0005,0.0005, 10.,0.0005,0.0005/ DATA O.,UI,U2 /0.01,0.01,0.002/ DATA AO,A1,A2 /3*0.01/ DATA SCSI,S2,Ht /1.7E-06,3.E-07T3.E-07,1.0E-06/ END C C an lin ne of the FORTRAN peokage to design C optimal Differential file architectures C COMMON /PRBPAI/ N,MUE,RHO,LO.LI,CO.DO,D1,RO,l1, & UO, UIU2.,AO,Al, 12.O,S1.S2.H1,W NANELIST /INPARS/ N,MUE.RHO.LO, L.l,CO.DO,Dl,RO,1, &; UOU,U1,U,AO,A1,A2,SO,SI,S,2,1,V REAL R,M,NMMIN,UE,LO.LI.REPLTYES/''/ VRIT~(6,1) N,S0,AO.O,.MUE,S.1,A.UI,RHO,S2,2,2U2, & LO, 0, DOHI.,LI,R1, D,,CO 50 VRITE(6,2) READ (5,INPARS.ENDO900) VRITE(6,1) N,SO,AO,UO,MUE,S1,AI,U1, RO,S2,A2,U2, & LO,RO,DOH1,L1,R1,,D,W,CO VRITE(6,3) READ(5, ) REPLY IF (REPLY.XE. TES) GOTO 50 C C Take samples from R values, and select the one with minimum cost C CMDSTR ~ S1/MUEN TOTIN 1i.0E10 MAX ~ 2~N/10 DO 100 IR * 100,MAX,100 R a FLOAT(IR) CALL CXM(R.IXM,BFLCST) CALL CDR (R.D,CDFDMP) CALL CBR(R.B.CMDDMP) CREORG * (UOeNM( 1-EXP(-R/N)).CO)/R CCFSTR * S2/NUEO(R-i.)/2. TOTCST a BFLCST*CDFDMP*CMDtmC *CREORCGCDFSTRCMNDSTR It (TOTMIN.LT. TOTCST) GOTO 100 TOTMIN TOTCST AKIN ~ R IXHIN * IX MMIn a N DMIN a D BMIN * B 100 CONTINUE C VRITE(6,5) RMIN, IXNIN,MHIN,IMIN.DMIN,TOTNIN GOTO 50 900 STOP 1 FORMAT('f Current parameter values are:',//, & ',I9.'N.,F9.7'T s '.F9.5.'a '.F9.5,'*u',/,, *.F9.0,'.aue *,F9.7,'as' ',F9.5.'a"'' ',F9.5,'e.*u',/1 & '.F9.0,'rho '.F9.7,'~a" ',F9.5,'* ',F9.5,'su'',/ ',F9.79,'lamda,F9.5,r ',F9.5.,'d ',F9..7.'h''',/, & *' r9.7,'.lslda', *,F9.5,er' ',rF9.5.'ed' ',F9.I*,'s*,/, A&,9.2, '2CO*) 2 FORMATC'OEnter any nodification to parameter valute.') 3 FORMAT( 'O','All OK?l(Y/)') C FORMAT(AI) 5 FORMATC('O R Xe Me B* Do Total Cost',/, ' ' r.o, '.o,r s.o,, - i '6T 6.0,' ',F10.7) END SUBROUTINE CXM(, IX,NBFLCST) C C Find an efficient X-N combination at a given R (The solution C of 13-1). H iL returned in blocks of 100 bits. C COMMON /PtBPAR/ NNUE,tRO,LO,L,CO,00.D1. O.RI, & UO,Ul,U2,AO,A1,A2,050,,5i.2.H1, REAL R,N,MUE,LO,L1.FSER(15)/1., 1.,2..3.,5.,8., 13.21.,3C.,. ~55.,89., 1.,233.,377.,610./,CNKC5)/..,2., 1.,3.,3., 1..,6., I. 1., 5.. 10.. 10.5..1.,6., 15i.,20.. t5.,6., 1..7..,21..35..35., t21.,7,....828.2B..56.,70..55..28.,8.1.,9..36..8r.,126. 126.. &B4..36.,9.,./ COST( l, IX,M), (RHONUE) A22I 1 -( 1 -EXP(-R/ IX))/( R/ IX) )*30 C C Begin with X-M of 0-0. iterate on X, select an efficient M for C each case(using Fibbonscal search), and revise X-H if necessary C 11ZX a 0. M a 0. BFLCST a A2e(1.iRHO/NUE) C DO 200 IXX ~ 1,9 XL s 1. XR ~ A2/100.*MUE/SO 00 100 Kx1,i3 XI z XL * F3ER(1R-K)/FSER(t6-.K)(XR-XL) X2 a XL * FSER(15-K)/FSER(16-X)o(XR-XL) Y1 a COSM(RIXX.100.X1) T2 ~ COSTM(R,1XX,100.iX2) IF (Y1.LE. Y2) XR a X2 IF (Ti.GT. Y2) XL a XI 100 CONTINUE C Tn a IFIX((XL+XR)/2.) 100. TC a COSt, (R.IXX.Tt) TC1 a CCSTM(R,IXX,M..100.) IF (TC.GT. TC1) TM a TM * 100. C C Compute cost of this X-H, and revise optimal Xo-M' if necessary C TC a R DO 150 J a 1.1XX 150 TC a TC * CNK(IXXe(IXX-1)/2*J)6(-ie)*o J (t1.-EXP(-R'J/TMNIXX))/( IXX/TIJ) TC a (14RHO/NUE)t(A2*TC/Re*H1IXX) * SO/MUEOTM IF (TC.CE. BFLCST) GOTO 200 IX ~ IXX ~r IFIX(TMN/00..0.5) BFLCST a TC 200 CONTINUE C BFLCST a BFLCST*( XO/MUE)*{(A +(AO-A ) *1i.-EXP(-R/N))/RNN) RETURN END SUBROUTINE CCR(R, D, CDFCMP} C C Compute the optimal number of differential file duapa C COMMON /PRBPAR/ N,MUE,RHO.LC,LI,CO,20,0D1,RO,1. & UOU1,U2,A0, AI,A2,S0, S1,2,H1,W REAL HUE,LO,L1 C................................................................. T1 * ((LI/MUE)~We(U2eR-A1)/2.-Dl)- D T2 a ((LI/HUE)eVe(RI-U2)/2.*D1)'R D * 0. IF (TI/CO.GT. 1.) DzSCRT(T1/CO)-l. C D a IFIX(D) CDFDtP * (COeDoTI/(D*..)*T2)/R TC a (COe(D.l.)*TI/(D.*.)Tr2)/R IF (CDFCnP.LE. TC) RETURN D ~ D-1. CDFDMP a TC RETURN END SUBROUTINE CBR(R,, CMDDMP) C C Compute optiatl lain data file bacKUp period C COMNON /PRBPAR/ N.Ut,ERHO.LO,LI.CO.D0,D1.l O.R1. $4 ~UO.,U1,U2.AO.A.A2.50.S1,5S2,.HI,' REAL MUE.LO.L1 C*....* ~~.............................................................. 3 ~ SORTt(2.cNDO*NUE/LO/UI/V) B a T3/R IF (R.3T. T3. ) Bt. C B!TFIX(B) CKDCMP a (LO//UE)o(~*tOe0NWoJU/2.e1o(ts-I.)) Ne)DO/IR/ TC * (L/MUE)('W*Roe4.woeU1/2.e z.eR) M~0O/n/(Bi.)!. (CNtONP. LE. TC) RETURN B a * t. CDDOMP ~ TC RETURN END 41

REFERENCES 1. Aghili, H. Differential File Architecture: Analysis and Applications. Ph.D. thesis, The University of Michigan, Ann Arbor, Michigan, to appear. 2. Aghili, H., and Severance, D.G. The Use of Differential Files in an 80/20 Update Environment. Unpublished working paper. 3. Becker, J.R. "EXPLORE: A Computer Code for Solving Nonlinear Continuous Optimization Problems." Technical report No. 73-6, IOE Department. The University of Michigan, Ann Arbor, Michigan, 1973. 4. Beightler, C.S.; Phillips, Foundations of Optimization. Prentice-Hall, 1979. D.T.; and Wild, Englewoods Cliffs, D.J. N.J.: 5. Bloom, B.H. "Space/Time Trade-offs in Hash Coding with Allowable Errors." Communication of ACM 13,7 (July 1970), pp. 422-426. 6. Canning, R.G. "Recovery in Data Base Systems." EDP Analyzer 14,11 (November 1976). 7. Lohman, G.M. Optimal Data Storage and Organization in Computerized Information Processing Systems Subject to Failure. Ph.D. thesis, Cornell University, Ithaca, N.Y., January 1977. 8. Lohman, G.M., and Muckstadt, J.A. "Optimal Policy for Batch Operations: Backup, Checkpointing, Reorganization, and Updating." ACM Transactions on Database Systems 2,3 (September 1977), pp. 209-222. 9. Powell, M.J.D. "A FORTRAN of Nonlinear Algebraic A.E.R.E. Harwell, Didcot, Subroutine for Solving Systems Equations." Report No. R-5947 Berkshire, England, 1968. 10. Powell, M.J.D. "A Hybrid Method for Equations." Report No. T.P. 364, A.E.R.E. Didcot, Berkshire, England, 1969. Nonlinear Harwell, 42

11. Sayani, H.H. "Restart and Recovery in Transaction Oriented Information Processing System." Proc. 1974 ACM SIGMOD Workshop on Data Description, Access, and Control (May 1974), pp. 351-366. 12. Severance, D.G., and Carlis, J.V. "A Practical Approach to Selecting Record Access Paths." Computing Surveys 9,4 (December 1977), pp. 259-272. 13. Severance, D.G., and Duhne, R.A. "Practitioner's Guide to Addressing Algorithms." Communication of ACM 19,6 (June 1976), pp. 315-326. 14. Severance, D.G., and Lohman, G.M. "Differential Files: Their Application to the Maintenance of Large Databases." ACM Transactions on Database Systems 1,3 (September 1976), pp. 256-267. 43

I; i i I i~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ I~~~~~~~~~~ I Ir