THE UNIVERSITY OF MICHIGAN COMPUTING RESEARCH LABORATORY OPTIMAL DESIGN AND USE OF RETRY IN FAULT TOLERANT REAL-TIME COMPUTER SYSTEMS Yann-Hang Lee and Kang G. Shin CRL-TR-28-84 May 1984 Room 1079, East Engineering Building Ann Arbor, Michigan 48109 USA Tel: (313) 763-8000

OPTIMAL DESIGN AND USE OF RETRY IN FAULT TOLERANT REAL-TIME COMPUTER SYSTEMS1 Yann-Hang Lee and Kang G. Shin ABSTRACT In this report, we present a new method for (i) determining an optimal retry policy and (ii) using retry for fault characterization. First, we derive an optimal retry policy for a given fault characteristic, which determines the maximum allowable retry durations so as to minimize the total task completion time. Then, we carry out the combined fault characterization and retry decision, in which the characteristics of fault are estimated simultaneously with the determination of the optimal retry policy. We have developed two solution approaches; one is based on the point estimation and the other on the Bayes sequential decision. The maximum likelihood estimators are used for the first approach, and the backward induction for testing hypotheses in the second approach. We also present numerical examples in which all the durations associated with faults (i.e. active, benign, and inter-failure durations) have monotone hazard rate functions, e.g., exponential, Weibull and gamma distributions. These are standard distributions commonly used for modeling and analyses of faults. Categories and Subject Descriptors: B.2.3 [Arithmetic and Logic Structures]: Reliability, Testing and Fault-Tolerance -- hazard rate function, recovery overhead, optimal retry policy, fault characteristic; G.3 [Probability and Statistics] -- estimation, censored sampling, likelihood ratio, sequential or Bayes decision problem, hypotheses testing. 'This work was supported in part by NASA under Grant NAG 1-296. Any opinions, findings, and conclusions or recommendations expressed in this report are those of the authors and do not necessarily reflect the views of NASA. Authors' address: Division of Computer Science and Engineering, Department of Electrical Engineering and Computer Science, The University of Michigan, Ann Arbor, MI 48109. All correspondence should be addressed to Prof. Kang G. Shin at the above address.

1. INTRODUCTION There are three types of fault in computer systems: transient, intermittent, and permanent [1]. Transient faults die within a certain time of their generation, intermittent faults cycle between being active and inactive, and permanent faults are (as the term indicates) permanent. It has been found that permanent faults form but a small fraction of the faults in computer systems [2,3]. This makes the purging of any faulty components as soon as they have been discovered an inefficient means for handling redundancy. If the active duration of a transient or intermittent fault is short, the continuation of the task with the same resource after the disappearance of the fault may be more efficient than that of using other recovery methods. Unfortunately, it is impossible to tell at its first occurrence whether or not the fault is permanent and also impossible to know its active duration if the fault is intermittent or transient. Moreover, it would be much more efficient (time-wise) and accurate to characterize faults on-line, and then take the appropriate recovery actions. In this report, we propose (i) determining an optimal retry policy so as to minimize the task completion time, and (ii) using retry in conjunction with statistical estimation and decision theory to characterize faults. We obtain the optimal retry duration in the face of uncertainty about the nature of a detected fault. Since our focus is on real-time systems, we are principally concerned with skewing the density function of the task completion time as much to the left as possible. For this reason, we shall concentrate on maximizing reductions in response time. As the term implies, retry consists of restoring the affected process to some faultfree initial state, and then re-running it on the same processor. Clearly, retry is only applicable when the error induced by a fault is confined and the process can be restored to integrity. The most efficient means for fault confinement are signal-level detection 1

mechanisms which can detect faults immediately upon their occurrence [4]. For the process to be restorable to integrity, scratchpad registers are needed. Results obtained by Carter, et al. [5] indicate that self-checking and retry mechanisms can be incorporated into processors inexpensively, and without substantially degrading performance. Currently, several commercial machines incorporate retry. In the Honeywell 6000 [6], instruction retry is reported to approach an effectiveness rate of 100%. Retry in the IBM 360 and 370 series machines is widely used in the peripheral areas (I/O and storage) as well as in the central processor [7]. The UNIVAC 1100/60 uses a hardware timer that goes off after an interval judged to be long enough to allow transient faults to die out, upon which retry can be effected [8]. However, no discussion or justification about the retry duration or the number of retry attempts used has been addressed. The usefulness of retry mechanisms arises, as we said above, from (i) the smallness of the proportion of permanent faults in any computer system, and (ii) the fast recovery from non-permanent faults and thus the small task completion time. In the case of a permanent fault, to retry a process on the affected processor is worse than useless: it is a waste of time. To hasten the completion of the executing task when a fault is detected, we must control the duration of retry to maximize the difference between the expected gain in response time that results from using retry when the fault is transient or intermittent (in some cases), and the expected loss that results from using it when the fault is permanent or intermittent (in other cases). Our object in this report is to derive the maximum allowable retry duration r' when a fault is detected. If the retry succeeds with this duration, the execution continues. If not, other methods for error recovery, e.g., rollback or restart following the system reconfiguration, must be used. 2

In addition to the performance gain in case of a successful retry, the characteristic of a fault can be monitored through retries. For instance, a retry which succeeds after the retry duration r' implies that the active duration of the fault is also less than or equal to r.2 Even when the retry fails, it indicates that the active duration of the fault is greater than r'. On the other hand, the detection of fault gives information regarding the duration of fault occurrence and the benign duration of an intermittent fault. Thus, it becomes possible to observe the nature of fault through both retry and detection mechanisms. Note, however, that the information obtained from retry is censored, since for example, in case of an unsuccessful retry, the sampling via retry is stopped while the associated fault is still active. With the censored information, the problem of estimating the nature of fault is the same as the design of experiments in the sequential analysis where the experiments are described by the retry policy, and the sampling is analogous to detection and retry. The report first presents a brief description of fault models in Section 2. It should be obvious that r' will depend on fault behavior, and in Section 3, we begin with how to derive it, given the fault characteristics. When quantitative descriptions of fault behavior are hard to come by in the real environments, the combination of retry and detection enables us to observe the fault characteristics, while determining the optimal retry policy. We counter this in Section 4 by showing how to use statistical estimation theory to create a system that learns via retry the fault characteristics as it goes, and therefore becomes increasingly more "optimal" in the sense of minimizing the task completion time. Due to its repetitive reappearances, retry of an intermittent fault is a renewal process. In Section 5, we apply the Bayes sequential decision to fault characterization and This is not really true due to the fault latency. The fault latency [4J which is defined as the interval between the moments of fault occurrence and error generation has no effect on retry. Thus, we simply ignore the fault latency in the consideration of retry. 3

retry decision. The backward induction for testing hypotheses is also presented as a solution to the sequential decision problem. The report concludes with Section 6. In what follows, we will use continuous retry durations instead of the number of retry attempts.3 2. BASIC MODEL OF FAULTS Let a unit be the smallest hardware system component that the detection mechanisms can distinguish when a fault is detected, e.g., a processor module can be viewed as an assembly of such units. Hence, the term "module" will mean a system component larger than a unit. Typically, a module is formed by a set of units. For each unit, the fault's behavior can be modeled by a three-state stochastic process as in Figure 1. Denote these three states, namely non-faulty, fault-active, and fault-benign by NF, F and FB (see [4] for their detailed descriptions). At NF, no fault exists in the unit. Transition from NF to F indicates the occurrence of a fault. If the fault is intermittent and becomes benign following an active duration, the state of the unit changes to FB. The unit may move back to F when this intermittent fault recurs -- this is referred to as the reappearance of the intermittent fault. If the fault is transient and disappears, the unit will transfer from F back to NF. The model similar to this has been widely used in the reliability analyses and the modeling of faults [4,9,10]. Let Tf, 7T, T! and T1 denote the duration between two successive fault occurrences, the active duration of a transient fault, and the active and benign durations of an intermittent fault, respectively. These durations are random variables with distributions Ff, Ft, F? and F!, and density functions fi, ', f, and A, respectively. For simplicity, we 3Conversion between a retry duration and its corresponding number of retry attempts is not difficult as discussed in Conclusion. 4

assume that these durations are mutually independent and that the causes of triggering different types of fault are not correlated. The latter assumption implies that the occurrence of any type of fault can be modeled as a Bernoulli process with probabilities Pt, P,, and pp for transient, intermittent and permanent faults, respectively. Thus, the characteristics of a fault can be represented by a 7-tuple CIE (Pt, Pi, Pp, F F, FT,,, ) I Pt+P.+p,=1}. Usually, the mean time between the occurrences of fault, E[T%, is much larger than any other durations. Thus, it is reasonably accurate to assume that there is at most one fault in a given unit at any moment. In addition, it is assumed that the reappearance of an intermittent fault is never mistaken for the occurrence of a new fault within a unit.4 Following a successful retry, the detection mechanisms should be able to recognize the type of fault in a unit by continuously monitoring normal operation. When the detection mechanisms find the same unit failing again within a short period, the unit is declared to have an intermittent fault. If the fault has disappeared for a long period, it is regarded as a transient fault. 3. OPTIMAL RETRY POLICY FOR GIVEN Cr 3.1. General Problem Statement Once a fault is detected, it is necessary to take a proper sequence of actions such as fatllt isolation, system reconfiguration, and recovery. For convenience, define the recovery overhead as the total time required to resume the normal system operation in case of the detection of a fault; this is a system-oriented view. On the other hand, the occuirence of fault may delay the completion of the executing task; this is a task4This is the very reason why the term "unit" is introduced here. 5

oriented view. These two views are equivalent when the fault is transient or permanent. For an intermittent fault, since it appears and disappears repetitively, the accumulated overhead of retry could become unacceptably large (eventually infinite). In view of this fact, an intermittent fault has the same undesirable effects on computer performance as a permanent fault when retry is used as the sole means of recovery. Consequently, as far as the minimization of the expected recovery overhead (i.e. the system-oriented view) is concerned, an intermittent fault can be regarded as a permanent fault and hence retry should not be used when detection mechanisms find again the same fault that was detected but became inactive during the last retry. Once intermittent faults are treated just like permanent faults, the optimal retry policy of minimizing the total recovery overhead becomes equivalent to a special retry policy (for transient faults) which minimizes the task completion time. More on this will be discussed near the end of this subsection. A most attractive gain from retry is the rescuing of the executing task, i.e., the task-oriented view of retry. Suppose there is enough redundancy so that the system may be reconfigured and the affected task may be migrated to other fault-free modules when a module becomes faulty (due to one or more faulty units within the module). It is obvious that no task should be started on any faulty or potentially faulty module (having one or more units with benign intermittent fault(s)). Consider a practical case in which a module (i) becomes faulty once and gets back to normal during execution of a task, and (ii) never becomes faulty again before the task is completed. In such a case, it is possible to avoid the overhead of migrating and restarting the task by means of a successful retry, leading to a fast completion of the task. Even if the fault that occurred was intermittent, retry is the best recovery method when its active duration is short and benign duration is long, insofar as the completion time of the executing task is

concerned. Considering the task-oriented view of retry, we will derive an optimal retry policy for a given fault characteristic, Cf, which minimizes the expected task completion time when a fault is detected during the task execution. Such an optimal policy would be of significant value to real-time applications where small task completion times are of paramount importance. Let x0 denote the computation time initially needed to complete the task under a fault-free condition. When a fault is detected, the amount of computation remaining to complete the task, i.e., residual computation, is denoted by x, where O<z<zx. For the n-th detection of the same fault when the residual computation x and the characteristic Cl are both given, the optimal retry policy should specify a mazimum allowable retry duration, r,(x,Cf). When n=l, the detected fault may be transient, intermittent, or permanent, since the fault type is unknown; but it is intermittent if n>2. Since the optimal retry policy is to minimize the time necessary to complete the residual computation x regardless of what has happened in the past, we have the following lemma. Lemma 1: r,'(,C) = r*(x,Cf) for all i, j > 2. Thus, we have to consider two maximum retry durations for two different cases: the case when a new fault is detected, and the case when an old intermittent fault is detected again. Let R= {(rl(z,Cf), r2(z,C/)) I O<z<zo) be a retry policy where the maximum retry durations is r1(z,C1) or r2(X, C) for the detection of a new fault or an old intermittent fault with the residual computation z and fault characteristic C,. For notational simplicity, we shall use r,, whenever convenient, in the sequel, to represent r,(z,Cl), i=1,2. Also, denote the expected times needed to complete the residual compu7

tation z by Vl(z,C, R), V 2(,C,,R), V3(z,Cf,R), and V4(x,C1,R) when the system is in the following situations: execution starts/resumes on a non-faulty module, a new fault is detected, an old intermittent fault is detected again, and execution continues following a successful retry for an intermittent fault, respectively. Based on transitions among these situations, one can derive the following recursive equations: V(x,C/,R) = {1-F1(Z)}z + o {t + V2(z-t, C,R)}dFl(t) (1) V2( z,CR) = p, fo { t+ Vl( Z CfR)} dF t) + i (+ V4(z c, CR)) d~F(t) + (1 - Pt F(r) - p rl")}({ V1( a(z), C,R) + r, + t8} (2) V3(z,Cf,R)= {1-F(r2)}{ V(a(z),CfR)+r2+to} + fo{t+ V4(z,Cf,R)}dFt) (3) V4(,, Cf,R) = {1-F;(x)}+ f+ o {+ v(- t, CR) } dF( t) (4) where a(z) is the residual computation needed when the system applies recovery methods other than retry, e.g., x(z)=az if restart is used following an unsuccessful retry, and t8 is the set-up time necessary for system reconfiguration and re-initialization. The optimal retry policy, R'={(rl(z,Cf ), r2(z,G)) I O<z<zo}, should minimize both V2(z,CfR) and V3(x,Cj,R) for all x, since retry is directly applicable only to the second and third situations. (As such, they are explicitly dependent on rl and r2.) Obviously, this policy also minimizes V1(x, C,IR) and V4(z,C,,R). Since the mean time between failures is usually much longer than the other durations, Vl(x,Cf,R) can be accurately approximated by z. In general, there are no closed form solutions for r;(z,Cf) and r2(z,Cf). However, these optimal retry durations can be calculated numerically without difficulty as explained below. With the initial condition V4(O,C1,R)=O, V3(Z,C1,R) and V4(z,C1,R) can be calculated iteratively using Eqs. (3) and 8

(4) for any given R. Thus, one can numerically determine r2(z,C) so as to minimize V3(z,C, R); V4(z,CiR) is also determined. Once V4(z,CR) is known, one can easily compute V2(z, C,R) and therefore r*(z, Cf). When the recovery overhead in place of the task completion time is to be minimized, r2(z,Ci)=O for all zE(O,zo). In this case, the recovery overhead can be expressed as V2(x,C,R)- Vl(z,Ci,R), which is the time spent to restore the system to its state immediately before the fault is detected. The optimal retry duration r;(x,Cf) can be determined through Eqs. (1)-(4) just as we can compute that for minimizing the task completion time. Consequently, we will in the sequel deal with the task completion time only. 3.2. Fault Active Durations with Monotone Hazard Rate Functions Since T: is a continuous random variable, one can assume that f(t) is continuous in [O,oo). The hazard rate function of the active duration of an intermittent fault is defined by h(t) ---. When the hazard rate function of the active duration of an intermittent fault is monotonically increasing, constant, or monotonically decreasing, the optimal retry duration r2 exhibits interesting properties. These properties play a significant role in determining the optimal retry policy, since the time durations associated with faults are usually modeled to have monotone hazard rate functions. Typical distributions with monotonically increasing hazard rate functions include the gamma and the Weibull distributions with the shape parameters greater than 1. When their shape parameters are less than 1, they have monotonically decreasing hazard rate functions. The exponential distribution has a constant hazard rate. Consider first the nondecreasing hazard rate function which leads to the following theorem. 8

Theorem 1: When ha(t) is monotonically non-decreasing in t, r2=O or r2 —. Proof: Differentiating Eq. (3) with respect to r2, we obtain aV(CR)= -f(r2)[V4(x, CR) + - - { Vl(a(z),C,R) + t,}] Or2 hI(r2) =/ f (r)[ V(x,CR) + 1 - {a()+ }] (5) f' (r2) Since V4(x,Cf,R) is independent of the past and current retry durations r2(y,Ci) where y> 2,5 V4(z, CfR) + - { a(x)+ t} is non-increasing in r2(zx, C). If there is an r such hl(r2) that V4(x,Cf,R) + 1 - {a(x)+t,}<O, then the first derivative of V3(x,Cf,R) with respect to r2 is negative for all r2> r. Thus, r =oo. If such an r does not exist, the derivative is always non-negative, implying that V3(x, CR) is monotonically nondecreasing. This results in r2=O. Q.E D. Following the definition of r (z,C1), r2(z,Cf)=O implies that no retry be attempted for reappearing intermittent faults, whereas r2(z,C)=oo means that the retry should be applied until the intermittent fault becomes benign. Corollary 1: When ha(t) is monotonically non-decreasing in t and if there exists an z2 such that a(x2) + t8 - x2- (R(42)-1)E[Tl] = 0 where RM(x) is the renewal function [11] corresponding to the distribution F1(t), then r2(x,Cf)=oo if x~< and r(z,C1)=0 otherwise. Proof: From Theorem 1, r2(x,Cf) is either oo or 0. When r2(x,C1)=oo, there exists an r 5 Note that the probability of having a zero benign duration of an intermittent fault should be zero, i.e. Prob( 1T=0)=0. Otherwise, no useful computation can be done. 10

such that Eq. (5) becomes negative. Since V4(x,Cf,R) is monotonically non-decreasing functions of x, there also exists an r such that Eq. (5) becomes negative when the residual computation is less then z. Thus, r2(y,C1)=oo for all y<z. Since both the active and benign durations are mutually independent, we have V4(X,Cf,R3 = x + {E[N(x)] - 1}EIT:I where Nix) is the number of reappearances of the intermittent fault during the residual computation x, namely N(z)=inf {n; 7 l~k~>} where TXk is the benign duration folk=1 lowing the k-th occurrence of the intermittent fault. The expected value of N(x), E[Mx)], is equivalent to the renewal function Rb(x) corresponding to the distribution Fb(t). Also, r2(x,C )=oo if and only if V3(x,CR)I2=oo < V3(Z,Cf,R)lr0, i.e., f0(t+ V4(x,CfR)dFr(t) = E[T?+V4(x,CfR) < a(z)+t_ From the equality in the right-hand side of the above equation, we obtain 2 and thus the Corollary is proved. Q.E.D. Theorem 1 can also be viewed as below using the concept of atochastic ordering between two random variables. A random variable X is said to be stochastically larger than the other random variable Y if Prob(X>t)_Prob(Y>t) for all t [12]. Let T?(lr) be the remaining life of the intermittent fault after retry has been applied for the duration r. When the hazard rate function is non-decreasing, T?(lr) is stochastically larger than Ta(ls) provided r<s. Thus, for all s>r, if it is worth continuing retry beyond the retry duration r (in the sense of minimizing the task completion time), then we should continue the retry even after the retry duration a. Consequently, the retry continues until the intermittent fault disappears. 11

Note that when the hazard rate function is non-decreasing, Z2 is determined by the mean active duration and is independent of the shape of the distribution. x could become negative when E[l1i is large, that is, intermittent faults have a long active duration. In such a case, Corollary 1 implies that no retry be applied for intermittent faults. On the other hand, if the set-up overhead t8 is large, z2 could be even larger than 7o, implying that retry be used as a sole means of recovering from an intermittent fault. When the hazard rate h?(t) is decreasing, the nice properties stated in both Theorem 1 and Corollary 1 do not exist. However, there exists at most one root of Eq. (5) that minimizes V3. In such a case, since there is no closed form expression of V4(z, Cf,R'), we have to resort to (less elegant) numerical techniques for determining both r.(z, C1) and r;(x, C) as was previously mentioned. Several numerical examples, in which restart is used as a sole means of backup recovery, i.e. a(x)=xo, are shown in Figures 2 to 4. In these figures, the durations are normalized with respect to x0, and the active duration of the intermittent fault is assumed to have the gamma or Weibull distribution. Figure 2 presents x2's when the shape parameters a's of the gamma and Weibull distributions, respectively, are greater than or equal to 1. Figures 3 and 4 show the optimal retry duration r2(Z,C/); the solid lines for a< and the dashed lines for ac-1. Note that for the gamma distribution h?(t) approaches - as t —oo where /9 is the scale parameter. Thus it is possible for the derivative of V3 to be negative, (i.e. Eq. (5) becomes negative), implying r2(z,Cf)=o. For the Weibull distribution with a<l1, r2 never becomes oo since h?(oo)=0. Consider the case where Ti, Ta and iT' are all exponentially distributed with the parameters r, p, v for the transient fault disappearance rate, the intermittent fault disappearance and reappearance rates, respectively. Since 4(t)=ve-vt, the renewal 12

function R!(z) becomes 1+vz. From Corollary 1, we have r2(z, C/)=oo if zx< 2 and r(x,Cfj )=0 if Z>Zx, where z2 -= p (zO + ts- 1 ). Since V4(z,CI,R) implicitly depends L+L p on r2 via V3, we can express V4(x,Cf,R ) as: (1 +-)z if z<z V4(,CJ,R') = +t e if x> (6) + to + -e - +-4 V p v The derivative of V2(x,Cfi,R) with respect to r1 becomes: a V2(,C1'R) p= p + {1 - (z+t-)r} + pe'[1 - {zo+t,-V4(zx,cfR)}lp (7) With r2(x,Cf) as determined in Corollary 1 and V4(z,Cj,R) as in Eq. (6), Eq. (7) can have at most two roots. The optimal retry duration r*(z,Cf) can be obtained by examining V2(x,Cf,R) at the boundaries, rl=0 and rl=oo, and the roots of Eq. (7). Note that r; cannot be infinite as long as pp>O. Unlike r2, r; does not have to be zero when z>z2. Several cases of V2(z,CfR) as a function of rl are shown in Figure 5 where all parameters are normalized with respect to x0. The case 2 in Figure 5 shows an example for which two positive roots of Eq. (7) exist. Figure 6 presents some numerical results on rj(x,C1) as a function of x. Note that 22 depends upon the ratio of v to p, whereas r; varies as Pt, Pi and pp change. 4. OPTIMAL RETRY POLICY AND PARAMETER ESTIMATION In Section 3, we have derived an optimal retry policy for a given fault characteristic Cf. It is, however, very difficult in practice to know a priori the fault characteristic. Even if the fault characteristic is measured during device manufacture, it may well vary as the execution environment and the executing tasks change. Another factor that 13

makes the fault characteristics time-variant is the aging of components, e.g., the bathtub curve of the failure rate as a function of time [1]. Thus, it is important to determine an optimal retry policy for uncertain fault characteristics. Note that retry not only provides an efficient recovery of task execution, but also monitors the behavior of a fault present in a hardware unit. Naturally, it is desirable to integrate the estimation of the fault characteristics and the control of the maximum retry durations into a single decision problem. In such a case, the computer system has to adjust its retry policy using the information on the fault behavior collected during its past retries. The detection mechanisms6 can be useful in estimating the duration between two successive fault occurrences or the benign duration of an intermittent fault. Note that this information is crucial in specifying the behavior of fault occurrence or reappearance. Consequently, the fault characteristics would become well-defined if a good estimator were used. Moreover, retry may lead to an indication of the active duration of a transient or intermittent fault, which is, on the other hand, affected by the retry policy applied. The information collected is incomplete in case of an unsuccessful retry, since the retry is stopped while the associated fault is active. In what follows, we consider the estimation of the characteristic of an active fault and the simultaneous determination of an optimal retry policy which minimizes the task completion time. Note that the probabilities of having a permanent, transient, or intermittent fault are crucial to the determination of r;(z,Cl) but unrelated to that of r2(zx,C). It implies that correlations among successive retry durations during the execution of a task do not depend on these probabilities. Thus, to minimize the task completion time, it is assumed that these probabilities are determined a priori from the previous observations 6As was pointed out, we mean here the signal-level detection of faults [4]. 14

of fault occurrence. These probabilities can be estimated accurately if a sufficiently large amount of data of fault occurrence were collected. If a sufficient number of samples has not yet been obtained, the measured results as in [2,3] have to be used instead. Recall that in the determination of r2, the transient fault is excluded. Also, if pt, P, pp are determined c priori, the effects of retry on the task completion time is a linear combination of the effects of transient and intermittent faults when a new fault is detected. This implies that the same technique can essentially be used to deal with the unknown parameters for both transient and intermittent faults. Consequently, we confine ourselves to the case where the density function of the active durations of transient faults is known and the active durations of intermittent faults have the density function form Jf(tlO) with the parameter 0 unknown. (Note that 0 could be a vector if there are two or more parameters, e.g., the shape parameter a and the scale parameter, for the Weibull and gamma distributions.) The samples obtained from retry can be represented by a 2-tuple (I, t) where I is a single-bit flag and t indicates a duration. I=0 represents a successful retry, and hence t indicates the active duration of the fault. On the other hand, when a retry fails, I=1 and t is the retry duration. Let (Ii, ti) i=1,2,..n denote the past samples related to the active duration of an intermittent fault. These resulting samples are type I progressively censored, following Cohen's definition in [13] with continuous censoring times. There are several different types of estimators conceivable for estimating the parameter 0 on the basis of these progressively censored samples. For the Weibull anld gamma distributions, the maximum likelihood estimators have been widely studied as in [13-17] when the samples are progressively censored. For simplicity (but not because of difficulty), we shall employ the maximum likelihood estimator 0 of 0 in the sequel. 15

When the fault is still active even after the current retry has been applied for the duration r, we shall have collected an additional sample (1, r) via the current retry. Let 0(r) be the maximum likelihood estimator of 0 which is based on the samples including up to the current sample (1, r). Following the current retry duration r, the maximum likelihood function becomes L() = t —17(Ii,tj,) t/(1,r,$) (8) where r(I,t,0) is defined as: AI(tlO) if I=0 ~(I-,t,0) = I -rtie) if I-1 The maximum likelihood estimator 0(r) should maximize L(@) or log L(0). Let the optimal retry durations based on the estimated 0(r) be denoted by rk(z,(r)) k=1,2 for a newly detected fault and an old intermittent fault, respectively. Use the notation C/0(r)) to indicate that the active durations of intermittent faults have the density function rf(tlj(r)), and let R(r) denote the policy that the maximum allowable retry duration for the current retry is r. Then, the direct solution of the optimal retry duration is to find the minimum of Vk(x,CXI(r)),R(r)) k=1,2,3,4. Notice that the retry duration r not only appears in the integral equations (2) and (3), but also affects the fault characteristic C1. Under certain conditions, it can be proven that r*(xz,(r)) is a non-increasing function of r. We will first derive the results under such conditions, the application of which to a more general case is then discussed later in this section. For the former case, the optimal retry duration r for the current retry can be readily obtained by the following 18

theorem. Theorem 2: When (a) the active duration of an intermittent fault has the density tl (tj)) function f~(tlO), and (b) for t,>~tk the ratio -- a non-decreasing likelihood ratio funtin litO) ad b) ort r(tlo(tk)) [18] -- is non-decreasing in t, then the optimal retry duration is determined by: r = inf { r;,e( r)) < r} (9) To prove this theorem, we need the following three lemmas. Lemma 2. Under the same conditions as in Theorem 2, let T, and Tk be random variables with the density functions I?(tlO(tj)) and?(tl(tk)), respectively, and 'I(t) be a non-decreasing function of t, then E[I( T)]~>_E[*I,(Tk)] provided ti> tk. Proof of this lemma follows immediately from Lemma 2 of Chapter 3 in [18]. Let ha( tlO(t,)) be the hazard rate function when the density function of i' is fl(tle(tj)). The following Lemma gives the ordering of h?(tjl(tj)) with respect to tj. Lemma 3. Under the same conditions as in Theorem 2, h(tlO(tj)) is a non-decreasing function of tj for every fixed t. Proof: For t > tk, we have in( i)) < II( ti)) for all a>t. This inequality implies. f(IO(tk)) - js(810(tk)) ha tlo(t)) f< f(uIl(tk))du 1-P(tle(tk)) Thus, h(tlI(tj))>hI(tIl(tk)) if t> tk. /r(tle(q)) -- t r(u I(I tk))du I-PPI))du Q.E.D. 17

Let V(Xz,0(t)) = minVk(z,O(t),R) k-2,3,4 where 0(t) is used in place of CXO(t)). R Note in this case that the active duration of the intermittent fault is distributed with the parameter O(t) and that all the other distributions are known. Lemma 4. Under the same conditions defined as in Theorem 2, if tl> t2, then (i) Vk(z,0(t)) > Vk(z,0(t2)) k=-2,3,4, (ii) r;(z,B(tl)) < r*(xz,(t2)) for k=1,2. Proof: The proof for k=3,4 is done by mathematical induction. Let Vk,n(zx,(ti),r2(n,j)) k-3,4 be the expected times needed to complete the residual computation z when there are at most n retries to be attempted following the current one, and let r2(n,j) be the maximum retry duration allowed. Also, let the optimal retry duration to achieve the minimum V,(xz,0(tj)) be r2(n,j). For n=O, V4,0(x,(tj))=z and 00 V3{( (XO(t1)) - V3,,0(X't2),r2*(O,)) )= fSo 4(t,x,r(0,1)) { tlIO(t))-f,(tlO(t2))} dt where '(t,z,y)=t+x when t<y and y+-Zo+t8 when t>y. Since I(t,x,r~) is non-decreasing in t, the right-hand side of the above equation is non-negative as a result of Lemma 2. Also, since V3,0(x,0(t2)) is the minimum when the active duration of the intermittent fault has the density function jT(tl0(t2)), then V3,O(x,O(tl)) > V3,0(X,8(tl),r2(O,1)) > V,0(,t2)). Suppose that V*,,(x,0(t1)) > Vn(x,0(t2)) and V;n(z,(tl)) V4,,n(z,(t2)) for all z provided t1>t2. It is obvious to see from Eq. (4) that V;,+l(z,0(tl)) > V4,,+,(x,(t2)) for all x. Thus, V3*,n+(^tl))( - V3s,+1(x8(t2),r(n+,1)) > 18

I(t, V;.nl(zx,( tl)),r;(n+l,1)) is non-decreasing in t<r2~(n+1,1). Also, since r2(n+1,1) is the optimal retry duration, V 0,+l(x,B(tl)) < x0+t8. Hence, '(t,V4,n+l(x,O(tl)),r2(n+1,1)) is always non-decreasing in t. The right-hand side of the above equation becomes nonnegative, resulting in V3 n+l(Z,(tl))~ V3, n+l(x,0(t2),r2(n+1,1))> V* n(x( 2))* By mathematical induction, we have V(x,O(tl))_ V;(x,O(t2)) for k —3,4. To prove r2(xz,O(tl))r2(xz,(t2)), the following cases are examined. When r2(x,O(tl))=O, the relation is always true. When r~(z,O(tl))>O, using Lemma 3 and the first part of this proof, the derivative of V3(x,6(tj),R) with respect to the retry duration r has the following ordering relationship for all r and tl t2. V;(Z,~( tl))- + -{ (z)+t} > V4(z, t t2)) + -{ a( )+ tt} where all retries after the current one are assumed to employ the optimal policy. Thus, for t1>t2, r2(Z,O(t2))=oo when r2(z,O(t))=oo, and r2(x,O(t2))>r2*(x,O(tj)) when r2(xz,O(tl)) is finite. For the case of k —2, it is easy to see that V2(Xz,(tj),R) is a linear combination of the effects of both transient and intermittent faults. Thus, V2(z,X(t),R*) V2(x,O(tk),R*). Also, the handling of V2 with respect to r1 has the same ordering relationship as that of V3 with respect to r2. Thus, V2(x,O(tj),R)~- V2(x,O(tk),R), and r;(O(tj))<r;(O(tk)) when tj tk,. Q.E.D. Lemma 4 shows that rk(z,O(t,)) is non-increasing in t/ for k=1,2. Thus, there exists an r such that r> rk*(,(r)). The proof of Theorem 2 is given as follows: Proof of Theorem 2: Suppose that the retry has been applied for the period r but the fault is still active. When r(z,O(r))>r, the retry should be continued since it decreases 19

the expected task completion time. Thus, rk(x)>sup {r; rk(x,O(r))>r}. Suppose there is an rE{r; r(x,U(r))<r}. Then Vk (x,O(r,),rj) V, (x,(rj))~ V, (z,O(r)) where r, is defined in Eq. (9) and k =k+l. Thus, the theorem follows. Q.E.D. For the same example in Section 3.2, suppose the active duration of an intermittent fault is exponentially distributed with an unknown disappearance rate p. Using a method similar to the Cohen's derivation in [13], the maximum likelihood estimator j(r) for an exponential distribution -- which maximizes log L(p) -- is obtained as: n:,)(Z+) (10) l=1 Theorem 2 gives the optimal stopping time for the current retry. Note that the true value of p is unknown and its maximum likelihood estimator is to determine the optimal retry duration. In the case of retry for a reappearing intermittent fault, the optimal retry duration for a given p is either 0 or oo as shown in Corollary 1. Using Theorem 2 and Eq. (10), we get the optimal retry duration as follows: r = max 0, I(n- I,)j t - t (11) L.= 1+V Ii /=] Note that the gamma distribution has a non-decreasing likelihood ratio for both a and d [18]. Furthermore, the estimators provided by Cohen [15] show that both the estimated ca and # are increasing in the current retry period r. Thus, Theorem 2 can be applied directly when the active duration of the intermittent fault has the gamma distribution. When the distribution of the active duration is Weibull, Theorem 2 cannot be applied directly, but still provides a good approximate solution. This is due to the fact that the Weibull distribution has a non-decreasing likelihood ratio with respect to its 20

scale parameter only. Since (i) the variation of the estimated shape parameter a with respect to the current retry duration r is always less than that of the estimated scale parameter a, and (ii) the estimated f is increasing in r when a is fixed, a reasonably good approximation can be obtained by assuming that a is constant during the current retry and B is estimated using both the past and current samples as discussed in the above. There are some shortcomings when the maximum likelihood estimator is used for the progressively censored samples. Particularly, the estimator is biased when the samples are censored. Also, in the case of the exponential distribution, jt does not contain sufficient statistics of p when the samples are censored and incomplete, i.e., when there exists at least one sample (I;,ti) with Ij=1. These shortcomings can be seen easily from a trivial example: pj becomes zero when I,=1 for all i=1,2,..n. In fact, as shown by van Zwet [19], for most practical cases it is impossible to obtain unbiased estimators when the samples are Type I censored in a semi-infinite interval. Note, however, that there is no restriction about which estimator to be used in the foregoing determination of the optimal retry policy, meaning that estimators other than the maximum likelihood estimator can be used without altering our method described thus far. 5. BAYES SEQUENTIAL ANALYSIS AND OPTIMAL RETRY In the previous section, the unknown parameters of a distribution are estimated first, and the optimal retry policy is then determined using the estimated results. In this section, the same problem is attacked by taking the Bayes approach. Since the reappearances of an intermittent fault during the execution of a task are a renewal process, there could be a sequence of retries for the same intermittent fault. Thus, it is natural 21

to incorporate the Bayes sequential analysis to both characterize the intermittent fault and determine the optimal retry policy. Retry is then considered as sampling of the fault behavior, and the retry duration controls both the task completion time and the information to be collected. Since the retry for a newly detected fault only occurs once within the execution of a task (under the assumption that E[TA is much larger than the other durations), we focus on the behavior of intermittent faults and thus the determination of r2. 5.1. Optimal Retry and Bayes Decision Let the distribution of T? be governed by some unknown parameters Wi. Note that Wi may be a scalar (e.g. for the exponential distribution) or a vector (e.g. for the gamma or Weibull distribution). The a priori information concerning WVV is expressed in terms ofa probability distribution function defined on Q. Let the density function of Wj be (,(w). Denote further the fault characteristics, given wi and the prior density function,, by CAlw and Ca,, respectively. To apply the Bayes decision theory, the risk with a retry policy R, given ei and the residual computation x, is defined as follows: pA(xz,,R) fInV,(ZxCa,'R) Ci(w)dw k=3,4 (12) Thus, the (optimal) Baye8 riak is given as p*(x,) = inf p(z,f,R) k=3,4 (13) Since we are now concerned only with the retry for an intermittent fault, R consists of r2 only. The optimal retry duration in case of the detection of an old intermittent fault, r2(z, CA)), abbreviated by r(z,~i), yields the Bayes risk p(z,3i). Similarly, the Bayes 22

risk of the retry for a newly detected fault can be defined with Eqs. (12) and (13). However, the determination of r1(z,f,) is a one stage Bayes decision problem. Once p4(Z,X4) and r2(z,f) are obtained, the normal form of analysis [22] can be applied directly for the solution of r;(xz,J). Following a retry attempt for an intermittent fault, regardless whether it fails or succeeds, an event related to the fault active duration T? is observed. The event observed during a retry of the duration r is either "success" or "fail". The "success" event, denoted by e8(t), occurs when the detected fault disappears after the retry duration t which is less than or equal to the maximum allowable retry duration r. The "fail" event, denoted by ef(r), occurs when the detected fault does not disappear by the end of the retry duration r. Let Str)={e8(t); t<r} U ({e(r)}. With the prior density function (,(w), the posterior density function following the observation of eES(r), denoted by ~,(wle) i=1,2, becomes W(el W) GX W) (,twfe). (14) (w = g(elw),(Aw) d(14) where g(elw) is the generalized conditional density function for the event e as in [20], i.e., f The density function of l: at t if e=e;(t) and tKr (15) g(ejw)= Prob(ef(r)) e=ef(r) This posterior density function will become the prior density for the next retry. Consequently, the system's behavior is similar to a sequential decision procedure which determines first a retry policy and then observes the resulting sample. The procedure will be repeated with a new prior distribution which is determined on the basis of the new sample observed and the old prior distribution. The decision on retry and the sampling for fault characterization will continue as long as there is an occurrence of fault. 23

The problem of selecting the optimal retry policy can also be treated as the optimal stopping problem with continuous observations [21]. Suppose an intermittent fault is I(etccte(I again when the residual computation i3 x. Then, retry is applied for a specified stopping time r. The task will be continued, without applying recovery methods other than retry, if the fault disappears during the retry period r. Otherwise, it has to be restarted from the beginning.7 The posterior density function of w; becomes (,(wle8(t)) or (,(wle(r)), depending on the outcome of retry. The cost of an observation is the amount of time used for monitoring the fault until its disappearance, i.e., c(e8(t))=t, or until the end of retry, i.e., c(e(r))=r. The costs associated with the termination of retry are defined as the amount of time necessary to complete the residual computation x as follows: L(z,r, I4e(t)) p;(=,Xwle (t))) L4,r,(il (r))= =0 + t8 The expected loss for the stopping time r' is the same as the Bayes risk defined in Eq. (13). According to the theory presented by Irle in [21], there always exists an optimal stopping time, r;E[0, oo), satisfying Eq. (13). We will in the next section solve the sequential decision problem using the backward induction [20] for testing hypotheses where the prior and posterior distributions are confined within the open unit interval, i.e., (0,1). Note that the minimax method in [22] cannot be used to solve Eqs. (12) and (13), since the decision space -- which consists of all possible maximum retry durations -- is neither compact nor finite. 7For simplicity, it is assumed that there is only one alternative to the retry recovery, i.e., restart. 24

5.2. Optimal Retry and Hypotheses Testing Suppose that there are a primary and some alternative hypotheses concerning the characteristics of an intermittent fault. Consider the sequential testing of these hypotheses and the simultaneous determination of the optimal retry policy; this is not difficult to solve since both the prior and posterior probabilities lie in the same unit interval (0,1). For given hypotheses, the initial prior distribution can be assumed to be equally likely among the hypotheses. To be more specific, take a demonstrative example8 in which the active duration of an intermittent fault is assumed to be exponentially distributed with an unknown parameter p. Let there be two hypotheses on p, Ho and H1 for p=Po and p=l8, respectively, and let p0o>p1. The uncertainty associated with these hypotheses can be represented by the probability h of having p= po. WVe will first determine the optimal retry policy for all hE(0,1). Then, we will consider the problem of testing hypotheses as well as estimating the expected sample size to reach a certain significance level under the optimal retry policy. Consider the optimal retry duration r2(z,h) upon detection of an old intermittent fault. In this case, we get the posterior probabilities given the events e8(t) and e{(r), denoted by h(t) and h(r), respectively as follows: h(t) -hpo ' where t<r (16) hpoae-~'~+(1-h)ple-por h(r)= he (17) he_~or+( 1_h)e-mr 8As will be pointed out near the end of this section, the results obtained from this example are applicable/extendable to more general cases. 25

As was discussed in Section 3.2, we can compute x2 for a given,i, denoted by Xz(ui) i=-0,1, such that (i) r2(z,1)=oo if x<x2z(p0), or 0 otherwise, and (ii) r2(z,0)=oo if x<x2(pl), or 0 otherwise. Since x2(#0)>2(*L1), r*(x,h)=oo if x<2(pj1), and r2(x,h)-O if x> z2(P). Note that the above represents extreme cases of retry, i.e., retries of duration zero or infinite. For the non-extreme case, i.e., the case of () < x<2(), let h= 8up{h J r2(,h)=O}. Since r2(x,1)=oo and r2(x,O)=O for Z2(0o)<Z<Z2(p1), we get O<h*<l. For all h>h*, r2(z,h)>O, i.e., retry must be applied upon detection of a fault. Suppose retry has been applied for a small duration 6r<r2(z,h). Then, the memoryless property of the exponential distribution leads to the following equation: br px(z,h) = (1 - F?(6rlh)) (6r + p3(z,h(6r))) + 0 f(tlh) (t + p4(z,h(t))dt (18a) By letting br-O and changing variables, Eq. (18a) becomes dp = (-}(l ) ( Ph( zag h(o)) - P3( 2 d) h ( (r) dh 0 -Ph) dr Ir==oj hpo + (1-h)pl p4(zh(O))- p3(zh) + } (18b) 1 I-'h)(p0 —pl)0- h)+ 0io+(1-h)p1 On the other hand, p(Zh) = 20 + t* for all h<h'. Using the same approach as in Theorem 1, we can prove that h* satisfies the following equation p (z,h'(O)) = -0 + t, - (19) h *poO+(1-hp) p1 From Eq. (4) and the definition of p; in Eq. (13), p;(x,h) is expressed as: p4(z,h) = 1 (l-ei) + e_-zf veVYp3(y,h)dy (20) 26

(20), we can calculate pk(x,h) k=3,4 for all xE(zx(pl), xz(#0)) with the following numerical algorithm: Al. Set h —1. A2. Calculate p(x,l1) and p4(',,1) for all zE(x2z(p), Zl(p0)). dp;(x,h) A3. Calculate dh using Eq. (18) and p;(z,h-6h) for all zE(z2(p), 2'(#0)). (Note p3(x,h) and p;(x,h(0)) are both known.) A4. Calculate p(xz,h-6h) using Eq. (20) for all xE(2(#l), x2(#0)). (Note p3(z,h-6h) is known for all x.) AS. Set h=h-Sh. If h<0, terminate the algorithm. A6. If p(x,h(0)) < 0 + t, - +(h)), go to A3. Otherwise, set p3(z,h-6h) = z0 + t, and go to A4. From the test at Ad, one can determine h' for all zE(X4(p1), z(p0)) so as to satisfy Eq. (19). Due to the memoryless property of the exponential distribution, r2(x,h)=0 when h<h' or satisfies Eq. (17) with h(r)=h' if h>h'. In Figure 7, r2 versus the prior probability h is plotted for various values of the residual computation x. Intersections of the curves in Figure 7 with the horizontal axis give the values of h for different values of z. Remark 1: in case the active duration of an intermittent fault has a general distribution (instead of an exponential distribution), a differential equation similar to Eq. (18b) cannot be obtained. In such a case, the original integral equation of p3(z,&;), i.e., the combination of Eqs. (3) and (12), has to be used instead. 27

From the foregoing discussion we can determine the optimal retry policy that is based on the prior probability h. Under this optimal retry policy, we can also determine trajectories of the posterior probabilities after a large number of occurrences and reappearances of intermittent faults have been observed. Let each retry be numbered by a two-tuple (m,nm) on the basis of occurrences and reappearances of intermittent faults. The (m,nm)-th retry is used to recover from the m-th occurrence of fault in case of nm=O or from the n,-th reappearance of the m-th intermittent fault if nm~O0. For the hypotheses Hi i=0,1, let h(m,nm) represent the posterior probability after the (m,nm)-th retry is applied. Also, let n4 be the total number of reappearances of an intermittent fault during the execution of a task and h,(m) be the prior probability before the m-th occurrence of fault, which is equal to h,(m-l,n 1) by definition. There are now two main problems to be addressed: (i) Will h,(m) converge to either 0 or 1, namely to the true fault characteristic as m-+oo?; (ii) If converges, how fast will it converge? For convergence, we get the following theorem. Theorem 3. Let M= inf {m I h,(m)>l-E, or h,(m)<E} where 0<e<1. If O<h,(O)<i, and zxo+t8- >0 for all hypotheses Hi and all tasks, then Prob(M<oo) = 1 and pi E[M < oo. ha(m) Proof: Let S,(m)=log hs(m) for jy4i. Thus, M can be defined as inf (m I IS,(m)l>K}, h(m) where Klog( ). Let z,(m,n)=log [e( ), where e(m,nm) is the event g[e( m, nm)lli)] observed at the (m,nm)-th retry and g(eljp) is the generalized conditional density function defined in Eq. (15). (When the retry duration defined by a retry policy is zero, e(m,nm) is null and z,(m,nm)=o. Also, when nm0, the retry duration is r; since the fault type is 28

not known at its first occurrence.) From Bayes theorem, we have n,,W nrt SmM(-) I ', h(O) sx(. )- S,(m' -1) + z,(z,n) = E E z,(m,n) + log n=0 m= —ln=O hO) mut Let y,(m)= z,(m,n) under the optimal retry policy. S,(m) becomes the sum of indepenn=O dent random variables. After an event is observed, the expected value of z,(m,nm) is the Kullback-Leibler information number and is greater than or equal to zero when Hi is true [231. In this case, E[z,(m,nm)]=O if and only if the prior probability before the (m,nm)-th retry is 0 or 1. Since xz+t,-1 >0 for all hypotheses Hi and all tasks executed, Prob(rij40)>O i —1,2. Hence, Prob[y,(m)=0]<1 for all tn<M. Following the proof in [241 that the sampling of a sequential probability-ratio test (SPRT) terminates with probability 1, Prob(M<oo)=-1 and E[A]<oo are obtained. Q.E.D. Remark 2: Since the tasks affected by intermittent faults do not have to be identical, the random variables y,(1), y(2), * are independently but not identically distributed. Moreover, for a fixed m, z,(m,nm)'s are dependent on one another because the events observed are controlled by the retry duration which is in turn a function of the moment of reappearance. However, all z(m,nm)>O when Hi is true. The condition, zo+t8 — >0 Pi for all hypotheses Hi and all tasks executed, indicates that retry is always a useful recovery when an intermittent fault is detected. In fact, this condition is not necessary true for all tasks, but Theorem 3 holds as long as Prob(r >0)'~40. Theorem 3 shows that the expected number of faults observed -- that makes the posterior probability reach either E or 1-e -- is finite. This also holds for other distributions and retry policies as long as rly'0 and r2y0 for some x. However, it does not pro29

vide the average sample size, E[MI HJ], that is necessary to reach these termination bound(laries K and -K. Also, one has to justify whether or not the posterior probability at the termination implies the true fault characteristic. In other words, it is important to know the error probability, Prob(S(M)<-K I Hi). There are two difficult aspects in the evaluation of E[M I HJ and Prob(SJM)<-K I HJ); one is that yXm)'s are not identically distributed, and the other is the non-existence of closed form solutions for both rl and r2. If the same task is executed repeatedly under the condition zo+t, — >0 for all hypotheses, then y,{m)'s #Ii become independently and identically distributed. Assume further that initially, both hypotheses are equally likely, i.e., ho(0)=h1(0)=0.5. Using the characteristics of SPRT in [20], the error probability is approximated by: iK-K Proq(S,{")<-K I H.) K -K e e -e Even if the same task is executed repeatedly, it is very difficult to obtain an exact solution for E[yJ because of the dependency between the optimal retry durations and the observed samples of the active durations. This fact in turn makes it impossible to obtain the exact solution of E[M I HJ. Due to the above difficulties, in what follows, we will derive upper and lower bounds of E[M I HJ] instead of an exact solution. Suppose there are two retry policies R~ and R' with the retry durations (r~,r~) and (rl,r), respectively. r~(x,h) and rl(z,h) are defined the same as r;(z,h). r4(z,h) is equal to oo if z<zx(p) and 0 otherwise for j=0,1. Let y(m) and 3M be 3 z,(m,n) and the n-0 number of faults observed to reach the termination boundaries under the retry policy RI', respectively. Then, (i) Prob(MI<oo)=l and E[M/]<oo, and (ii) EWlyIE[viyE[y, ]. 30

(Note that the indices m are omitted because of the distributions being identical.) Once / ([i I 1151 j=0,1 is calculated as in the Appendix, the expected sample size to reach the boundaries 1 -l and ( is bounded as: E[Af I H,] EIMI Hs] ~ ELM I H,] where E[f I Hj] ~ E H j=0,1 (see [20] for more on this). The above equations give the error probability and the bounds of the expected sample size when a certain level of significance is to be achieved. These bounds of E[MI H.IJ become tight when the difference between 0o and Pl is small. Of course, the expected sample size under the optimal retry policy is larger than that for the case when the complete information about active duration is observed, i.e., rl-r2=oo. Thus far, we have discussed solutions to the problem of sequential retry decision and hypotheses testing only for the case of exponentially distributed durations. Notice, however, that (i) the same method, with little modification, can be applied to the cases with any other kind of distributions, and (ii) Theorem 3 holds as long as Prob[y,(m)=01<1. Moreover, the method can be extended to the testing of multiple alternative hypotheses by specifying the prior and posterior probabilities as a vector, each element of which represents the probability that the corresponding hypothesis is true. 6. CONCLUSION In this report, we have investigated optimal retry policies with known and unknown fault characteristics. Retry not only saves the recovery overhead but provides a means to estimate the unknown characteristics. Although the data resulting from retries are censored, they are the only significant means of monitoring the fault characteristics. 31

May 4, 1984 Naturally, the monitored results are different from those obtained during device manufacture. In the discussion of retry policies, retry durations are assumed to be continuous. In fact, the retry durations should be discrete since the time required for repeated execution of an operation cannot be cascaded into a single continuous duration. Since the expected risk is a continuous function of the retry duration, it is not difficult to find the optimal retry policy which is specified as a number of retry attempts. As was pointed out in the discussion of the expected sample size for reaching a certain level of confidence in hypotheses testing, the test under the optimal retry policy turns out to be inefficient in the sense of maximizing the information observed. This is due to the fact that the optimal retry policy is defined to minimize the total completion time of the task affected by the occurrence of fault. Thus, the retry policy is a local optimum; "optimal" only for the task involved. Clearly, the retry policy which gives complete maximum information should have infinite retry durations, although such a retry policy is totally unacceptable in reality. It would be interesting to examine the trade-off between the two extreme objectives, i.e., minimizing the local task completion time and maximizing the information to be collected. This problem can be formulated as the minimization of the asymptotically accumulated risk, lim I E[pk(x,C5)], where m-*oo m j=1 j and m are used to number the successive retries and CO is the measured fault characteristic at the j-th retry. It can also indicates that the global optimal retry policy should collect more information (it is definitely not complete though) from the beginning to speed up the estimation of the true fault characteristics and then implement the local optimal retry policy once the true characteristics are obtained. 32

May 4, 1984 Another important aspect is the choice of an accurate model for the fault behavior. As was discussed in Sections 4 and 5, the optimal retry policy and the measurement of the fault characteristics are dependent on the family of density functions that are initially selected. The suitability of chosen models can be validated through goodness-of-fit tests, e.g., chi-square goodness-of-fit. Although sometimes the expected task completion time may not be minimized because of the poor choice of model, the information collected via retries can still be used to check the suitability of the model. Thus, after a large number of samples have been obtained, it is possible to select an appropriate form of density function and then achieve the minimum task completion time. The other approach is to begin with hypotheses of various forms of density functions. As sampling progresses, the parameters associated with the density function forms are estimated and then the hypotheses are tested. The work presented in this report is to incorporate the capability of real-time estimation (of the fault characteristics) and decision (on optimal retry policies) into the computer system. The results are a self-adjustable (thus intelligent) system and a powerful measurement of the fault characteristics. This idea can also be extended to other applications, e.g. the measurement of program behavior and the simultaneous decision of system configuration or scheduling. Such extensions would be significant contributions towards the construction of highly intelligent computer systems. ACKNOWLEDGMENT The authors are grateful to C. M. Krishna at the University of Michigan for his careful comments on the first draft of this report, and to both R. Butler and M. Holt at 33

May 4, 1984 NASA Langley Research Center for their technical and financial assistance. REFERENCES [1] D. P. Siewiorek and R. S. Swarz, The Theory and Practice of Reliable System Design, Digital Press, 1982. [2] O. Tasar and V. Tasar, "A Study of Intermittent Faults in Digital Computers," 1977 AFIPS Conference Proceedings, Vol. 46, 1977, pp. 807-811. [3] M. Ball and F. Hardie, "Effects and Detection of Intermittent Failures in Digital Systems," 1969 AFIPS Conference Proceedings, Vol. 35, Fall 1969, pp. 329-335. [4] K. G. Shin and Y. H. Lee, "Error Detection Process: Model, Design, and Its Impact on Computer Performance," To appear in IEEE Trans. on Computers, Vol. C-33, No. 6, June 1984. [5] W. C. Carter, et al., "Cost Effectiveness of Self Checking Computer Design," Proc. of the 7th Annual Int'l Symp. on Fault-Tolerant Computing, 1977, pp. 117-123. [6] G. H. Maestri, "The Retryable Processor," 1972 AFIPS Conference Proceedings, Vol. 41, Fall 1972, pp. 273-277. [7] D. L. Droulette, "Recovery through Programming System/360-System/370," 1971 AFIPS Conference Proceedings, Vol. 38, Spring 1971, pp. 467-476. [8] L. A. Boone, H. L. Liebergot and R. M. Sedmak, "Availability, Reliability, and Maintainability Aspects of the SPERRY UNIVAC 1100/60," Proc. of the 10th Annual Intl Symp. on Fault-Tolerant Computing, 1980, pp. 3-9. [9] J. J. Stiffler and L. A. Bryant, "CARE III Phase Report - Mathematical Description," NASA Report, No. 3566, Nov. 1982. [10] I. Koren and S. Y. H. Su, "Reliability Analysis of N-Modular Redundancy Systems with Intermittent and Permanent Faults," IEEE Trans. on Computers, Vol. 28, No. 7, July 1979, pp. 514-520. [11] E. Cinlar, Introduction to Stochastic Processes, Prentice-Hill Inc., 1975. 34

May 4, 1984 [12] S. M. Ross, Stochastic Processes, John Wiley & Sons, 1983. [13] A. C(. Cohen, "Progressively Censored Samples in Life Testing," Technometrics, Vol. 5, No. 3, Aug. 1963, pp. 327-339. [14] A. C. Cohen, "Multi-censored Sampling in the Three Parameter Weibull Distribution," Technomitics, Vol. 17, No. 3, Aug. 1975, pp. 347-351. [15] A. C. Cohen, "Progressively Censored Sampling in the Three-Parameter Gamma Distribution," Technometrics, Vol. 19, No. 3, Aug. 1977, pp. 333-340. [16] G. H. Lemon, "Maximum Likelihood Estimation for the Three Parameter Weibull Distribution Based on Censored Samples," Technometrics, Vol. 17, No. 2, 1975, pp. 247-254. [17] D. R. Wingo, "Solution of the Three-Parameter Weibull Equations by Constrained Modified Quasilinearization (Progressively Censored Samples)," IEEE Trans. on Reliability, Vol. R-22, No. 2, June 1973, pp. 96-102. [18] E. L. Lehmann, Testing Statistical Hypotheses, John Wiley & Sons, 1959. [19] W. R. van Zwet, "Bias in Estimation from Type I Censored Samples," Statist. Neerlandica, Vol. 20, 1966, pp. 143-148. [20] M. H. DeGroot, Optimal Statistical Decision, McGraw-Hill Book Company, 1970. [21] A. Irle and N. Schmitz, "Decision Theory for Continuous Observations I: Bayes solutions," Trans. of the 7th Prague Conf. on Information Theory, Statistical Decision Functions and Random Processes, 1974, pp. 209-221. [22] I. O. Berger, Statistical Decision Theory, Springer-Verlag, New York, 1980. [23] H. Chernoff, Sequential Analysis and Optimal Design, S.I.A.M., 1972. [24] C. Stein, "A note on cumulative sums," Ann. Math. Stat., Vol. 17, 1946, pp. 489 -499. 35

May 4, 1984 Appendix: The Expression of E[yjlHjl The retry d(uration r2 under the retry policy R' is equal to oo if x~'(p,(lj) and O othcrwisc. 'Thus, the complete information will be gathered if an old intermittent fault is detected again at z~<4(p,) and no information will be obtained if detected at z>z24(Pj). Hence, if the retry for a newly detected intermittent fault when the residual computation is z succeeds, we expect to collect information from the successive retries before the task completion as follows: ~ vz (logo- -) i 4121 ~ = E[~ ilm~n~lzl Pi Bi if Z< z2(j) El2.n)) = ) ( og lPi _ i-j ) otherwise Let the maximum retry duration for a newly detected fault be r;(z) when the residual computation is x. Also, let 1(Zd) be the density function of the detection time of a new intermittent fault, zd, given that it is detected during the task execution. Then, X where X,=pX. Thus, we have EJ1H] as follows: 0 z.or,(z) EqIjlHj = Ifp(XO-z)e-"pri(~)(r;(x))dz + f f le-'t"(Zxo-Z){(( (t)+E[I2jlzl}dtd 0 0 0 where I(r)=-(,,-lj)r is the information collected from an unsuccessful retry of the maximum retry duration r and l~(r)=log- - iJr is the resulting information when the Pi Pi retry succeeds after the duration r. 36

N F NF: non-faultv F: faulty FB: fault-benign Figure 1. The Model of Faults.

case 1: gamma distribution, B=O.1. case 2: Weibull distribution, /=0.I. case 3: Weibull distribution, /6=0.2. case 4: gamma distribution, /8=0.2. CNI co C0 C i xJ K \case 1 4,, 1- II '1.00 1.80 2.60 3.40 4.20 5.00 Shape parameter (>1) Figure 2. x2(x,Cl) versus the Shape Parameter a when the Hazard Rate is Increasing.

a ct0. 8 '=~ 0.2 |a,=0.7 | I | (t)=3c-3t:: X CDt~~~! =1. C I o 1 -o I 0.00 0.520 0.40 0.60 0.80 1.0 Residual computation (x) The density function of active duration f,(t)-_ de-1 e f Figure 3. The Optimal Retry Duration r(x,Cj) for Weibull Distributions with Decreasing Hazard Rate.

a==-0.6 a=0.8 a=0.4 crE. 1 - (NJ I -o I -0.o0006.. 04~1) ~ ~ ~ ~ ~ ~ ~ ~ ~ 10 C 0 -'" Residual conputation (x) The density function of active duration f?(t)- At! ' Figure 4. The Optimal Retry Duration r(z,Cf) for Gamma Distributions with Decreasing Hazard Rate. cm~~~~~~~~~~~~~ FE~~~~~~~~~~~~~~ +-j~~~~~~~~~~~~~~~~~

O r7z=0.8, r=10, p=12 case 1: p; —0.1, pt=0.5, p-0.4, v=6 case 2: p=.O-5, p=O0.05, p 5 0.4, v=i2 case 3: pp==0.05, pt=0.6, p,=0.35, v=-6 e case 1 co -4 La -9 tcase 2 aI Co oli o 0 0..." 4 O0.00 0.10 0.20 0.30 0.40 0.50 Retry duration rl Figure 5. V2(x, CfR) - z versus the Retry Duration rl for Exponentially Distributed Durations.

am case 1: pp=0.1, pt=0.3, Pi=0.6, r, p —5, v=3 case 2: pp=O.1, pe=0.6, p=-0.3, r-6, p-5, v —3 case 3: pp-0.1, p=-0.3, pi=0.6, r=12, p=10, v=6 case 4: p=O0.1, p,=0.6, pi=0.3, r=12, 1t=10, v=6 ~r case 1 case 2 tC H case 3 Ca case 4 0 C) ub.o0 0.20 0.40 0.60 0.80 1.00 Residual computation (x) Figure 6. The Optimal Retry Duation rl(z,C1) for Exponentially Distributed Duartions.

C3 z= 0.48 z=0.52 x Z2,-(po)=O.63 "~~ I z== —0.56 CN l2(I11-43 t z~0.60 -O L d-. C1 C~.00 0.20 0.40 0.60 0.80 1.00 Prior probability h Figure 7. The Optimal Retry Duration r (z,h) versus the Prior Probability h. (P.o-10, p1=5, v=5)

UNIVERSITY OF MICHIGAN 3 901 5 03023 076011111 3 9015 03023 0760