THE UNIVERSITY OF MICHIGAN COMPUTING RESEARCH LABORATORY ENSURING FAULT-TOLERANCE OF PHASE-LOCKED CLOCKS C.M. Krishna, Kang G. Shin and Ricky W. Butler CRL-TR-29-84 June 1984 Room 1079, East Engineering Building Ann Arbor, Michigan 48109 USA Tel: (313) 783-8000

ENSURING FAULT-TOLERANCE OF PHASE-LOCKED CLOCKS' C. M. Krishna, Kang G. Shin Computing Research Laboratory Department of Electrical Engineering and Computer Science The University of Michigan Ann Arbor, Michigan 48109 and Ricky W. Butler NASA Langley Research Center Hampton, VA 23665 ABSTRACT Processors within a real-time multiprocessor must be synchronized with as little overhead as possible. Although synchronization can be achieved via both software (e.g. interactive convergence and interactive consistency algorithms) and hardware (e.g. multistage synchronizers and phase-locked clocks), phase-locked clocks are most attractive due to their small overheads. Despite the fact that synchronization of the multiprocessor with phaselocked clocks is totally different in nature from the interactive consistency algorithm [1], we prove that it must satisfy the same condition N>3m+1 where N is the total number of clocks in the multiprocessor and m the maximum number of faults tolerable. We also present results showing how to design phase-locked clocks so as to be impervious to up to a given arbitrary number of malicious failures. Index Terms - Synchronization, interactive consistency and interactive convergence algorithms, phase-locked clocks, malicious failure. 'The work of Krishna and Shin was supported in part by NASA Grant No. 1-296. Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of NASA. All correspondence regarding this report should be addressed to Professor Kang G. Shin.

1. INTRODUCTION One problem in designing ultra-reliable multiprocessors for the control of real-time systems is synchronizing the processors. It is difficult to reliably synchronize processors within a multiprocessor because a failed unit, such as a clock or a processor, can behave arbitrarily maliciously or disruptively. That is to say, when a unit fails, it need not always go dead, or behave otherwise predictably. A failed clock may, for instance, send out conflicting time signals (or lie ) to different parts of a system. As we shall see in Section 3, even highly redundant systems can, if they are not designed with great care, be vulnerable to just two malicious faults. In the absence of models for computing the probability of malicious failures,2 we must assume for safety that malicious behavior is always possible. The most difficult job in synchronizing a highly reliable system is ensuring that synchronization can be achieved in the presence of such malicious units. The interactive consistency [1], and the interactive convergence [2] algorithms -- both implemented in software -- can be used to synchronize the multiprocessor even in the face of malicious failures. Although these algorithms are elegant and theoretically appealing, they consume a great deal of time for a large N, and time is precious in real-time systems. For example, the SIFT aircraftcontrol computer [2,31 uses the interactive convergence algorithm to periodically (100ms period) synchronize its processors. Based on data obtained from this sys2 At present, there is no known method of characterizing the behavior of malicious failures. 2

tem, we projected in [4] the overhead that this algorithm would impose on a similar system for different clock drifts (see Figure 1). The largeness of the overhead imposed by that algorithm is evident. The interactive consistency algorithm performs even worse (see [4] for a detailed overhead analysis). By comparison, the time overhead of hardware synchronization is insignificant. There are two ways of implementing synchronization in hardware to ensure correct operation in the face of malicious failures. The first is the multi-stage synchronizer of Davies and Wakerly [5], illustrated in Figure 2. Unfortunately, the number of hardware components in that arrangement is a quadratic function of the number of processors (or clocks) to be synchronized. The other way is to use phase-locked clocks which are the subject of this report. As we shall see, phase-locked clocks can be designed without any of the disadvantages mentioned above. Phase-locked clocks were first used to ensure that the processors of FTMP [6] operated in lock step. We consider here a total of N clocks to be synchronized in the face of up to m faulty clocks. The basic theory behind their operation is simple. In Figure 3, we provide a schematic diagram of an individual clock. Each clock consists of a receiver which monitors the clock pulses of the N-i other clocks in the arrangement, and these are used to generate a reference signal. By comparing this reference with its own pulse, the receiving clock computes an estimate of its own phase error. This estimated phase error is then put into an appropriate filter, and the output of the filter controls the clock oscillator's frequency. By thus controlling the frequency of the individual clocks, they can be 3

kept in phase-lock and therefore synchronized for as long as the initial phase error is below a prescribed bound, i.e. for as long as the clocks started reasonably in step and their drifts are sufficiently low. A discussion of clock stability can be found in [7]. The arrangement for N=4, m=l is, to our knowledge, the only phase-locked clock constructed and fully analyzed [8]. Unfortunately, when one attempts to increase m without care, synchronization can be lost due to the presence of malicious faults. In this report, we show how to design phase-locked clocks to tolerate a given arbitrary number of malicious failures. Our work is a generalization of the original design [81 which can tolerate at most one failed clock. As we shall see, the generalization requires to prove two non-trivial theorems, and interestingly it leads to the same necessary condition N>3m+l as the interactive consistency algorithm. The report is organized as follows. In Section 2, we present preliminary notation and definitions. In Section 3, we show how damaging malicious failures can be. Our main result is contained in Section 4 where we prove two important theorems related to the design of phase-locked clocks so as to tolerate up to a given arbitrary number of faults, and we conclude with Section 5. 2. NOTATION AND DEFINITIONS The following notation and definitions are used in this report. Definition 1: If the overall system of clocks is properly synchronized, all indivi 4

dual non-faulty clocks must agree closely with each other. A well-synchronized systcIli tihus has global clock cycles. Global clock cycle i is the interval between tet a th tick of the fastest non-faulty clock (i.e. the non-faulty clock that has its i-th tick before those of all the other non-faulty clocks) and the (i+l)-th tick of the fastest non-faulty clock. For brevity, we shall denote global clock cycle i by gcci. Definition 2: Each of the clocks "sees" through its receiving circuitry, the ticks of the other clocks. These ticks, together with the receiving clock's own tick, can be totally ordered in any gcci by the relation "prior or equal to". Such an ordered set, called a scenario, for clock a in gcci is denoted by St,. We shall frequently drop the superscript for convenience: where this is done, it will be understood that we are talking about some gcci. If a non-faulty clock c does not receive a tick from clock d within a given timeout period in any global clock cycle, the tick for d is arbitrarily assumed by c to be at the end of that timeout period. The scenario of every non-faulty clock therefore has exactly N elements. Definition 3: If clock a has clock b as its reference in some gcci, it is said to trigger on b in that gcci. Definition 4: Given the various triggers, we can draw a directed graph with the clocks as the vertices, and the directed arcs reflecting the relationship "'triggers" in some gcci. Such a graph is called the trigger graph. For example, in Figure 4, 5

a triggers b and c, and is itself triggered by d, while d is triggered by b. A clique of clocks is a component [9] of the trigger graph, i.e, a set of connected vertices. In Figure 4, there are two cliques: {a,b,c,d) and {c,f,g). Notation: G and NG are the set of clocks and non-faulty clocks, respectively, in the system. There are N clocks in all, and up to m failures must be sustained. Definition 6: A partition of G is defined as P=(G1,G2}, where G1 and G2 are subsets of G with the following properties: (i) G= GU G2 (ii) G1 nG2 nNG =, (iii) G, nNG 1 ~, i=1,2. From (i), each clock must belong to at least one of G0 and G2. From (ii). only faulty clocks may belong to both G0 and G2. From (iii), there must be at least one non-faulty clock in each of G1 and G2. Definition 7: A clock a is said to be faster than a clock b in scenario S if a precedes b in S. In a partition P={G,,G2), G0 is said to be faster than G2 if every non-faulty clock in G1 is faster than every non-faulty clock in G2. Notation: Given a partition P={G(,G2}, NG1 and NG2 are the non-faulty clocks in G1 and G2, respectively. By definition 6, neither NG1 nor NG2 can be empty and NG [nNG 2 =. Definition 8: Cliques A and B (of clocks) are said to be non-overlapping if the

non-faulty clocks of A are either all faster than those of B, or vice versa. Notation: Denote the position of a clock c in its own scenario SE in gcci by pc'. Again, we shall frequently drop the superscript for convenience. The reference signal (i.e. the trigger) is a function of N and of p,. It is denoted by p, (N). By this, we mean that clock c triggers on the fp,(NT-th signal in S, not counting itself. For the system to operate satisfactorily, all the non-faulty clocks must have their ticks close together. Also, they should tell good time, i.e. the length of every global clock cycle should be about the length of an ideal (or absolute time) clock's inter-tick interval. These conditions dictate the following two conditions of correctness C1 and C2. Definition 9: Each of the following conditions of correctness must be satisfied in gcci if the system is to be correctly operating in every gcci. C1. For all partitions P{=(G1,G2} of the set of clocks G, in which the non-faulty clocks in G1 are all faster than those in G2, each of the following (Ki and K2) must apply: K1. If, in gcci, all clocks in NG1 trigger on clocks in G1, then there is at least one clock in NG2 that triggers on a clock in G1. Furthermore, if no clock in NG2 triggers on a clock in NG1, at least one clock kENG2 must trigger on a faulty clock hEG1 such that in the scenario SI, there is at least one clock rENG1 that is slower than the clock h. 7

1K2. If, in gcci, all clocks in NG2 trigger on clocks in G2, then there is at least one clock in NG1 that triggers on a clock in G2. Furthermore, if no clock in NG1 triggers on a clock in NG2, at least one clock kENG1 must trigger on a faulty clock hEG2 such that in Sk, there is at least one clock rENG2 that is faster than h. C2. If a non-faulty clock z triggers on a faulty clock y, then there must exist non-faulty clocks z1 and z2 such that z1 is faster than or equal to y, and y is faster than or equal to z 2. Either z l or z2 may be z itself. C1 prevents the formation of non-overlapping cliques which would obviously destroy synchrony. C2 ensures that the system keeps good time, i.e. that each global clock cycle is close to being the clock cycle of an ideal clock. The nonfaulty clocks define the range of acceptable timing values, and so any faulty clock value between two non-faulty values is also acceptable. If C2 did not hold, the entire system could be subject to the wiles of an erratically fast or slow faulty clock. Finally, we assume that the transmission of clock signals through the system takes negligible time. This ensures that all non-faulty clocks are seen by all clocks in the same mutual order. 3. MALICIOUS FAILURE AND CORRECT SYNCHRONIZATION The phase-locked clock system for N=4, m=l1 is simple enough to be proved correct by an exhaustive enumeration of all eventualities. It is, to our knowledge, 8

thed only phaxse-locked clock actually constructed and fully analyzed [8]. Here, the reference used is the second incoming pulse (in temporal order), i.e. the median pulse. Such a clock is proof against the malice of a single faulty clock. To give the reader a feeling for why this is so, and to enhance his intuition about malicious failure, we provide below a simple example. Call the four clocks a, b, c, and d. Let d be the maliciously faulty clock. Because d is malicious, it may provide different timing signals (i.e. lie) to different receiving clocks. Since the non-faulty clocks by definition send their ticks at the same moment (or do not lie) to all the other receiving clocks, the mutual ordering of the non-faulty clocks within every scenario is the same for all nonfaulty clocks. That is to say, if clock b sees clock a faster than clock c in some gcei (i.e. clock a sends its i-th tick to b before clock c does so), then a will appear faster than c to both the other non-faulty clocks in the system, i.e. to a and c in that gcci. d, however, may appear in different positions in the scenarios of the non-faulty clocks since it is malicious. One way of proving that a four-clock arrangement works despite d's being malicious, is to enumerate all possible actions of d and show that the system still continues to satisfy the conditions of correctness. Assume without loss of generality that a is prior or equal to b which in turn is prior or equal to c in some gcci. Consider a sample set of scenarios for our four-clock example. The triggering clock is denoted in bold-face type.

a= a<b<c<d Sb = a<d<b<c S= a<b<d<c The scenario Sd is irrelevant, since d is faulty. Notice first that the position of the faulty clock d changes relative to the others, while the mutual ordering of the non-faulty clocks remains unchanged, as indeed it should. It is easy to see that both conditions of correctness will be satisfied, and that the clock will operate correctly if the above scenario holds. It is not difficult to write down all the 43=64 possible scenarios (with the ordering of the non-faulty clocks fixed as above) that are made possible by the arbitrary positioning of d, and to convince oneself that, for all possible scenarios, C1 and C2 are satisfied. Unfortunately, if we try to allow for m=2,3,..., by expanding the system arbitrarily without sufficient care, the conditions of correctness can be violated. In fact, it is even possible for a system to contain an arbitrarily large number of clocks, and still to be vulnerable to just two malicious failures. To see this, consider the following example. Let us choose, for each clock y in the system, fp,(N) as the median clock signal in the scenario, not counting clock y. If N is odd (and there is thus an even number of "other" clocks), choose the slower of the two middle clocks. Then, fp,(N) is only a function of N. We therefore drop the subscript for this example. Choosing the median signal is certainly good intuition. 10

Let there be only two faulty clocks, z1 and x2, and n=N-2 non-faulty clocks Case 1: N>7. Consider some gcci. Assume that ak is faster than at in gcci if k<l. Now, let zl and Z2 present themselves as the fastest two clocks to al,..., ap, and as the slowest two clocks to the other non-faulty clocks, i.e. aP+1, an, where p-=rn/21]=ANl1. Then, the set of scenarios can be represented as in Figure 5. Recalling that a clock triggers on the AN)-th tick in its scenario not counting itself, we can draw the trigger graph as in Figure 6. It follows that ({a,..., ap } and {ap+1,..., a } will be two non-overlapping cliques, no matter how large n may be. It is easy to work out the case for N=7 to convince oneself of this fact. Case 2: N<7. This is trivial, and showing that the system is incapable of sustaining even two maliciously faulty clocks is left to the reader. This has been a cautionary tale of the unbridled use of intuition in designing phase-locked clocks. Assured now that a more careful approach is needed, we turn in the following section to showing how to expand phase-locked clocks. 4. MAIN RESULT Our job is to (i) find the lower bound, N, on the size of a system of clocks that must sustain up to m maliciously faulty clocks, and (ii) find the functions f, (1N) for z=l,...,N. 11

We begin with the following two lemmas. Lemma 1: Condition C2 is satisfied if and only if there exist functions f2 (N) for z-=1,...,N, such that min({m, z-1) < f:(N) < max{N-m, z} (1) Proof: Let k be a non-faulty clock such that Pk = z. We must show that Eq. (1) holds for all z for which pk is defined iff condition C2 holds. Suppose that there exist functions f, (N) for z=1,...,N satisfying Eq. (1). This implies min{m, z-1})+l < max{N-m, z) for all zE(1,2,...,N}, leading to N>2m+1. Hence, it is sufficient to consider the following three cases: (i) <rm: Clearly, max{N-m, x) = N-m, min{m, z-1) = z-1 and therefore z-1 < f (N) < N-m. If the reference clock is non-faulty, we have nothing to prove. If it is faulty, then since there are at most m faulty clocks, there must be at least one non-faulty clock slower than the reference clock from the right half of the inequality, fz (N) < N-re. Also, from the left half of the inequality, f, (N)>z-1, and since clock k is non-faulty, there is a non-faulty clock (i.e. k itself) faster than the reference clock. So, C2 is satisfied. (ii) N-mx>>m: min{ rm,z-1 } =m, max{ N-m,z}N-mrn and therefore 12

m+l < Af (N) < N-m-l. Since at most m faulty clocks exist, if the reference clock in Sk were faulty, it must appear in Sk as slower than at least one non-faulty clock (the right half of the inequality), and faster than at least one non-faulty clock (the left half of the inequality), and C2 is satisfied. (iii) N>x>N-m min{m,x-1}=m, max{(N-m,z})= and m+l < f2(N) < x-1. As with the previous cases, there must appear in Sk at least one non-faulty clock that is faster than the reference clock, if the reference clock is faulty. Also, since k is non-faulty, and appears in the x-th (i.e. pk-th) position, there is at least one non-faulty clock, in particular clock k, that is slower than the reference clock in Sk, thus satisfying C2. Conversely, suppose fA(N)<min(m,z-1}. Then, C2 is violated when faulty clocks appear in positions 1,...,f (N) of Sk. Similarly, if f, (N)>_max({N —rn,z} C2 is violated when faulty clocks appear in positions f (N)+1,...,N of Sk.3 Q.E.De Lemma 2: If all clocks in NG1 trigger only on clocks in G1 (where the notation is the same as in definition 9), then the following are equivalent: (i) ql > min fp(N) where ql is the number of non-faulty clocks in G1. kENG2 3 Once again, the addition of 1 occurs because a clock does not count Itself when counting to A, (N). 13

(ii) KI is satisfied. Proof: (i) implies (ii): If (i) holds, then it is easy to see that no matter how the up to m faulty clocks in G arrange themselves, KI is satisfied. (ii) implies (i): Suppose, to the contrary, that q1 < min f(p(N). Consider kENG2 the nonempty set L {y: yENG2 and p,(N) = min fP (N)). Assume that there 2ENG are i<m faulty clocks in G1. Since the faulty clocks may present themselves in any position in any scenario, consider the case where they present themselves in the scenario of every yEL in the q1+l,...,ql+i positions. Then, there is no nonfaulty clock in G1 that is slower than the reference clock of any clock in NG2, a contradiction. Q.E.D. The two theorems below yield the main result of this report. Theorem 1: To ensure that, despite up to m malicious failures, the conditions of correctness are satisfied, the system must have N>3m+l clocks. Proof: We will only consider here the case of partitions P = {(G,G2} in which all clocks in NG1 trigger on clocks in G1. The other case (i.e. K2) can similarly be dealt with. Let there be q1 and q2 clocks respectively in NGC and NG2. Let M={y: yENG, and 4p,(N)= max 4p (N)). Let i<m be the number of faulty clocks kENG, that belong to Go. Then, the assumption that all non-faulty clocks in G1 trigger 14

on clocks in G1 is equivalent to saying that one of the following Eqs. (2) and (3) 1n1ust aipply: ql+i > max fP (N) + 1 = f(N) + 1 (2) kENG I which applies if there exists at least one py, yEM, such that p, <f(,,1(N). The addition of 1 follows from the fact that clock y does not count itself when counting to fp,(N). If py > fp,(N) for all yEM, the following Eq. (3) applies: qv+i > max fpk(N) (3) -kNG I3k First consider the case where Eq. (2) applies. The condition that C1 (more specifically, K1) holds implies, from Lemma 2, that q1 > min fpk(N) (4) kENG2 Since this must be true for all partitions of G, we have for all q1E{1-Y.N-I}: qi > max fpt (A) - i + 1 LE NG! if q > min fp (N). Hence, KI can be written as: kENG2 For all q1=1,...,N-1, ql> max JP,(N) -i+l D q1 min p, (N) ' '"-kENGI kENG2 In particular, this is true for q 1 max fp (N) -i+1. Thus, kENG1 max fP(N)-i+l > min fP(N) kENG1 kENG2 or 15

max f,(N) - min fp,N) > -1(5) kENG I kENGRecall that this is true if Eq. (2) applies. Similarly, if Eq. (3) applies, we have from an identical argument, max fp,(N) - m in fp(N) > i kC NGx kENG2 () Eqs. (2)-(6) must hold for all possible i. Since there are at most m faulty clocks, we must have: max fp,(N) - min fp,(N) > m- i kENG 1 kENG2 if Eq. (2) applies, and max fp,(N) - min fP (N) n m kENG I kENG2 if Eq. (3) applies. WVe first consider the case where Eq. (2) applies. We claim that it implies that N>3m. To see why, let y be the slowest clock in M and z the slowest clock in L (with L defined as in Lemma 2). Then, due to Lemma 1 and Eq. (5') the following inequality must hold: max(N —m, py } > max p4(N) > m-1+ min t(N) > m-1 + min{m, p,-1} (7) kENG1 x kE NG2 Then up to m faulty clocks in the system can arrange themselves in any order. In particular, they can so order themselves in Sy, that py<N-m, and so order themselves in S, that p, >m. Since Eq. (7) must hold always, no matter what 16

the faulty clocks do, we must have: n > max fp, { - I> m.+ mln In p(N) > (m-1) + (m+l) (8) k r NG -I kENG2(8 from which we arrive at the equation N>3m (9) Recall that this applies whenever Eq. (2) holds. If, instead, Eq. (3) applies, we can similarly show that N>3m+l (10) Since we seek the smallest N to satisfy the conditions of correctness, we have done if we can show that there exist functions f,(N) such that Eq. (2) always applies (and therefore Eq. (3) never applies), and for which Eq. (8) is satisfied. But, we can always construct f, (N) to (i) be monotonically non-increasing funce tions of z and (ii) satisfy Eq, (8): an example of such a construction is provided in the statement of Theorem 2 below. Hence Eq. (2) always applies, and N>3m+n: is the necessary condition. The case when all clocks in NG2 trigger on clocks in G2 can be similarly treated. Q.E.D. 2m if z<N-m Theorem 2: If N>3m+l and f,(N)=m+i if z>N-rn then the conditions of correctness are satisfied. Proof: f (.N) as defined here satisfies Lemmas 1 and 2 and is monotonically nonincreasing in x. Clearly, C2 holds. Also, it is easy to see that if N>3mn and Eq 17

(2) implies Eq. (4), then case K1 in Definition 8 will hold. We therefore only have to show that the definition of f, (N) as given above satisfies Eq. (4) if Eq. (2) is satisfied. This can easily be verified by a direct substitution. Case K2 can be similarly seen to hold. Q.E.D. It should be noted that the set of functions f(N) is not always unique. From the proofs of Theorems 1 and 2, the following inequalities are sufficient: (i) m+l < f, (N) N-m-1 for all z = 1,.,N. (ii) fi (N) > m-l+fN,,, (N for all z < m+ (iii) fNm (N) <_ fz (N) < fm+(i(N') for N-m>z>m+l, (iv) fJ (N) > fj (N) iff i <j. The intervals z>N-m and z<m+l arise from the up to m faulty clocks in the system. All that we can tell about the fastest non-faulty clock g in the system (this clock must have the maximum value of f,(N)) in clock g's scenario is that it is in the first m+l clocks in that scenario. Similarly, all that we can tell about the position of the slowest non-faulty clock a in the system (which must have the minimum value of f (N)) is that it occupies a place in the last m+1 clocks. This leads at once to the intervals x>N-m and x<m+l. It is interesting to note that if conditions C1 and C2 are both satisfied, and the functions f, (N) are monotonically non-increasing in z, then a stronger condition than C2 automatically holds. Corollary: If the conditions of correctness are satisfied, with the fS(N) being 18

defined as monotonically non-increasing functions of z, then the following condition C3 holds. C3. Every non-faulty clock necessarily triggers on either a non-faulty clock, or a faulty clock that is sandwiched between the other non-faulty clocks. Proof: Now, C3 follows immediately from C2 for all but the fastest and slowest non-faulty clocks. Consider the fastest non-faulty clock. In the course of proving Theorem 1, it was established that N-m > f,(N) > m for all z=1....,N, and that max ',p (N) - min fpk(N) > m-l, leading to 2m as the smallest value for makNx Gpt( j kEG kECG kNG where NGC G is the set of non-faulty clocks. From the monotonic nature (in pi) of the fpt(N), the trigger for the fastest non-faulty clock must lie in the interval 2rm+l,..., N-m. But, since iN>3m+1, any faulty clock in this interval must be sandwiched between non-faulty clocks. The proof for the slowest non-faulty clock is similar. Q.E.D. 5. DISCUSSION In this report, we have shown how to construct phase-locked clocks that operate correctly in the face of up to a given arbitrary number of malicious failures. As we saw in the Introduction, the high overhead of the other methods of synchronization -- time overhead for software synchronization (i.e. interactive consistency and interactive convergence algorithms) and hardware overhead for the Davies and Wakerly arrangement -- makes phase-locked clocks very 19

attractive candidates for fault-tolerant clocking arrangements due to their small overlheads. Tab)le I is a comparison of the techniques for synchronization in the face of inalicious failure. The phase-locked theorem here is completely different from the other ways -- the interactive convergence and interactive consistency algorithms -- of handling malicious failure. However, we have proved in this report that the theorem requires the same necessary condition N>3m+1 as in software synchronization. It would be interesting to make a comparative study of the various algorithms that handle malicious failure and try to establish a unified theory from which the proofs for the interactive convergence and interactive consistency algorithms, as well as that of Theorems I and 2 of this report, would follow. We believe that the N>3m requirement, which is common to all the three algorithms, might be a fruitful starting point in the search for such a unified theory. 20

REFERENCES [1] M. Pease, et al., "Reaching Agreement in the Presence of Faults," J. ACM, Vol. 27, No. 2, pp. 228-234, April 1980. [2] J. Goldberg, et al., "Development and Analysis of the Software Implemented Fault-Tolerance (SIFT) Computer," NASA CR-172146, June 1983. [3] J. 11. Wensley, et al, '"SIFT: Design and Analysis of a Fault-Tolerant Computer for Aircraft Control,"' Proc. IEEE, Vol. 66, No. 10, pp. 1240-1255. October 1978. [4] C. M. Krishna, K. G. Shin, and R. W. Butler, "Synchronization and FaultMasking in Redundant Real-Time Systems," Digest of Papers, FTCS-14, pp. 152-157, June 1984. [5] D. Davies and J. F. Wakerly, "Synchronization and Matching in Redundant Systems," IEEE Trans. Comput., Vol. C-27, No. 6, pp. 531-539, June 1978. [6] A. L. Hopkins, et al., "FTMP -- A Highly Reliable Fault-Tolerant Multiprocessor for Aircraft," Proc. IEEE, Vol. 66, No. 10, pp. 1221-1239, October 1978. [7] A. W. Holt and J. M. Myers, "An Approach to the Analysis of Clock Net-,works," NASA Contract Report, CR-166028, November 1982. [8] T. B. Smith, "Fault-Tolerant Clocking System," Digest of Papers, FTfS-11, pp. 262-264, 1981. [9] F. HIarary, Graph Theory, Addison Wesley Publishing Coe, New York, 1969. 21

d:) p- I X 1o-3 p=1 1X-I csI ~-I X p=1X10-6 p=2.2X 1()5 CL C= "bl0 L.0.0 R0.0OCSRS 40.00 50.00 Rcsynchronization Interval -=- 100 ms Figure 1. Overhead of Interactive Convergence Algorithm.

Processor Cluster Syncllrolnizer Cluster I Syllcllronizer Cluster 2 Synchronizer Cluster m ~~~~~~~~~~~~~~~Synchole Clster m r~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ _ *. _,. __... ~, Go ' l r S", I Y~~~~~~S 1, V Y 1f C, o I te~,,yI Caea d Go Reoay's C_ 0o IN fld, n's Go i,,,<\. 1< I --, 1X/ 8\' S8,, 1/ 15L. \ l Go Ready' GoR le eady's Go 1 Iendys a Figure 2. Davis and Wakerlys ultistage Synchronizer D -' —'- ~ ~ ~ ~ ~ F~;r 2. Ro',ly andI tcakly's (ulti'tReaSyncro izr sGo

External Clock __ Signals Voting Filter Oscillator 'iaCr cui Comparator ~go Circuit Clock Figure 3. Component of a Phase-Locked Clock

C Figure 4. Trigger Graph: An Example

Sal. x X2 al a2 an S,: x1 2a a 2 a, Sa: x1 x2 al a2 an sa, 1 S l an S, a +. a a an 1 X S~ +' aIt a a, 2X Sa,: a a a, x 1 x2 Figure 5: Scenario for Example in Section 3

~~~~~~~~~~~~~~~~~~~~~~~~~3 ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~~~~~' CI ~ ~ ~ ~ ~ ~ ~ ~ 5 1, - ' F~~~~~~~~~~~~~1 0 '1) CD a~~~~~~~~ WI 09 m 'y WI CD I:r rb PI C1 1 ' PI Cn C

UNIVERSITY OF MICHIGAN f11111111111 slllOFl U fllf7llll1111I 3 9015 03023 0042 Technique Min. Cluster I/O ports Overhead Expanded Phase- 3m+1 gm2+3m Negligible Locked Clock Multistage Synchroniz- 2m-+3m+l 8m3+16m2+10m+2 Negligible ers (Davies and Wakerly) Interactive Conver- 3m+1 |mm2+3m Considerable gence Algorithm (Goldberg, et al.) m = number of faulty processors accommodated Table 1. Comparison of Synchronization Methods