THE UNIVERSITY OF MICHIGAN COMPUTING RESEARCH LABORATORY1 OPTIMAL RECONFIGURATION STRATEGY FOR A DEGRADABLE MULTI-MODULE COMPUTING SYSTEM Yann-Hang Lee and Kang G. Shin CRL-TR-41-84 September 1884 Room 1079, East Engineering Building Ann Arbor, Michigan 48109 USA Tel: (313) 783-8000 1Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the authors and do not necessarily reflect the views of the funding agency.

OPTIMAL RECONFIGURATION STRATEGY FOR A DEGRADABLE MULTI-MODULE COMPUTING SYSTEM' Yann-Hang Lee and Kang G. Shin Division of Computer Science and Engineering Department of Electrical Engineering and Computer Science The University of Michigan Ann Arbor, MI 48109. ABSTRACT In this report, we present a new quantitative approach to the problem of reconfiguring a degradable multi-module system. The approach is concerned with both assigning some modules for computation and arranging others for reliability. Conventionally, a fault-tolerant system performs reconfiguration only upon a subsystem failure. Since there exists an inherent tradeoff between the computation capacity and fault-tolerance of a multi-module computing system, the conventional approach is a passive action and does not yield a configuration which provides an optimal compromise for the tradeoff. Using the expected total reward as the optimal criterion, we show the need and existence of an active reconfiguration strategy in which the system reconfigures itself on the basis of not only the progre8ssion of the mission but also the occurrence of a failure. Following the problem formulation, we investigate some important properties of an optimal reconfiguration strategy which specify (i) the times at which the system should undergo reconfiguration, and (ii) the configurations that the system should change to. Then, the optimal reconfiguration problem is converted to integer nonlinear knapsack and fractional programming problems. The algorithms for solving these problems and a demonstrative example are given. Additional extensions of the optimal reconfiguration problem are also discussed. 'This work was supported in part by NASA under both Grant NAG 1-296 and Grant NAG 1-492. Any opinions, findings, and conclusions or recommendations expressed in this report are those of the authors and do not necessarily reflect the views of NASA. All correspondence should be addressed to Prof. Kang G. Shin at the above address.

Lee and Shin: Optimal Reconfiguration September 27, 1984 1. INTRODUCTION Reconfiguration of a system is the process of changing an already-existing system organization or interconnections among its subsystems. In general, the system needs to perform reconfiguration for two reasons. The first reason is the dynamic variations of incoming tasks. The system reconfigures itself to match the special demands made by the incoming tasks and then executes the tasks more efficiently than with the previous configuration. In this case, reconfiguration depends on the tasks to be executed. Reconfiguration of a system could have many different ways such as the change of partition [1], word size [2], etc. The second reason is to make the system tolerate faults that may occur dynamically and randomly during the mission lifetime. Reconfiguration allows the system to remain operational, perhaps in a degraded mode, even in the case of subsystem failures. Typical examples of reconfiguration for fault tolerance include the handling of an extra-stage in permutation networks [3], the reconfiguration algorithm to maintain the system in a safe state [4], and the like. For the purpose of fault-tolerance, several authors have proposed the procedures and principles of reconfiguration of computer systems [5,6,7]. These procedures are all intended to make the system operational in the face of subsystem failures. Saheban and Friedman have investigated the degradation of computation capability and diagnosability in terms of the number of switches to connect modules [8,9]. They have also proposed a methodology for the design of reconfigurable multi-module systems. Fortes and Raghavendra have examined the design of reconfigurable array processors with VLSI chips and analyzed the improved reliability, performability, computation capability, and the additional hardware cost [10]. 1

Lee and Shin: Optimal Reconfiguration September 27, 1984 We classify the conventional system reconfigurations discussed above as a passive action, since it is performed only when a fault is detected. Moreover, they assume that there is only one configuration that the system will be changed to following each reconfiguration. For instance, the system degrades from an m module-parallel system to an m-1 module-parallel system when a module failure occurs. Thus, there is no choice concerning when the reconfiguration should be performed and what configurations the system should construct. In this report, we are concerned with developing a quantitative method for design and analysis of the reconfiguration of a multi-module system. Particularly, we will derive an optimal reconfiguration strategy under which the system is able to be optimally reconfigured to have high performance and survive the occurrences of subsystem failure during the entire mission lifetime. The term "module" is used here to mean processor, memory, or bus. We assume that the environment and workload of the system are in the steady state throughout the mission lifetime. The system is capable of assigning a module to be a redundant unit, a functioning unit, or a spare unit. Functioning modules are executing computation tasks. Redundant modules are associated with functioning modules for verifying the correctness of computation results or masking erroneous results. Spare modules do not execute any useful task before they replace failed modules. Although there is no difference between functioning and associated redundant modules in the execution of tasks, they do have different purposes in a logical sense. It is well known that the goals of reconfiguring a multi-module system are to enhance both computation capacity and eyatem reliability. In most cases, it is easy to see a trade-off between these two goals. For example, if there were no module failures 2

Lee and Shin: Optimal Reconfiguration September 27, 1984 and therefore, no module redundancy were necessary, then the computation capacity would increase as the number of functioning modules increases. On the other hand, if module failures are allowed, then providing greater module redundancy enhances the system reliability at the expense of the computation capacity. When the number of available modules is finite, it becomes necessary to make a suitable compromise between the system reliability and the computation capacity. It is the optimal reconfiguration that is desired for the most suitable compromise in some sense. >From the standpoints of reliability and performance, it is natural to consider more than one possible way for reconfiguring multi-module systems. An extreme example is whether the system should be configured to an m-module redundant system or to an m-parallel server system. Obviously, the former offers higher reliability, while the latter providing higher performance. Some criterion is needed to judge the goodness of different configurations. Based on the criterion, it is possible that the best configuration at a particular moment no longer becomes the best one at another moment. In such a case, reconfiguration is needed even if there is no occurrence of failure; we term this active reconfiguration. Thus, we need to have an optimal reconfiguration strategy which specifies the optimal configurations for the whole mission lifetime. Since it is invoked for both the cases of failure occurrence as well as no occurrence of failure, the active reconfiguration subsumes the conventional passive reconfiguration. This report is organized as follows. In Section 2, we introduce several convenient concepts, notations, and terminologies for reconfiguration of multi-module systems. Then, we develop a criterion that will be used for judging the goodness of configurations. The need of active reconfiguration is also justified in this section. Section 3 examines the properties of an optimal reconfiguration strategy during the mission lifetime. Also, 3

Lee and Shin: Optimal Reconfiguration September 27, 1984 presented is an algorithm for determining an optimal reconfiguration strategy. Actual determination of the optimal configurations is the subject of Section 4, where we develop solution algorithms for two integer nonlinear optimization problems. The report concludes with Section 5. 2. RECONFIGURATION STRATEGY Consider a multi-module system which begins its mission with m0 identical modules. Let the mission lifetime be to during which no system repair is allowed, i.e. a non-repairable system. We consider here only the failures which are caused by hardware faults and will -- along with the progression of the mission -- trigger system reconfiguration. Transient and intermittent faults from which the system can be recovered through retry 111,12] are not considered, since they do not cause reconfiguration. As we shall see, the optimal configurations are generated off-line in table form and, therefore, the (on-line) overhead of reconfiguration is simply task switching times. These do not generate any significant impacts on the determination of an optimal reconfiguration strategy, since in practice the system has to undergo only a few active reconfigurations during the mission lifetime. Consequently, the overhead of reconfiguration is assumed to be negligible in the following discussion. 2.1. Notation and Definitions Classical reliability analyses treat the system failure probability as a function of the components' failure probabilities using combinatorial mathematics. These approaches neglect the effects of failure recovery overhead on executing tasks. These effects are significant, particularly when real-time applications are considered. For example, a delay in 4

Lee and Shin: Optimal Reconfiguration September 27, 1984 task execution due to failure recovery overhead can lead to increased system operational costs or even a system crash. Due to these reasons, a reliability analysis including the impact of failure handling on executing tasks is more general and powerful than the classical methods [13]. In addition, system performance should depend upon the tasks completed by the system within its mission lifetime. Thus, it is natural and essential to consider the execution of tasks when reconfiguration strategies have to be determined. Consider a set of tasks that are to be executed during the mission lifetime to. Group the tasks into k classes such that the tasks in the same class will have the same influence on the mission. More specifically, the system will gain the same reward for each task within a class when it is completed successfully, and will suffer the same penalty if its execution is unsuccessful. Although the pattern of the incoming tasks could change in reality, we assume for simplicity that the combined workload in each task class is stationary throughout the entire mission, i.e. the task arrival rate and the required computations in each task class are constant.2 Because of the failures of individual modules during the mission, when the remaining mission lifetime(RML) is tE[O,t0o the system may have to operate with only m(t)E O,...,mo} modules. Let mi (t) be the number of modules assigned to the class i tasks. With these mi (t) modules, n (t) computing clusters are to be constructed. Each computing cluster is a computation unit and could have some redundant modules for reliability reasons. Let ri (t) be the total number of redundant modules used for the task class i. These redundant modules are used in constructing computing clusters as simplexes (without redundancy), dyads (with one redundant module each), triads (with two redundant modules each), etc. For notational simplicity, we will leave out the time As we shall discuss in Section 5, this assumption can be easily relaxed. 6

Lee and Shin: Optimal Reconfiguration September 27, 1984 dependency of mi, n,, and ri in the rest of this report as long as it does not cause any k ambiguity. Obviously, ni + ri = mi and mi < m.3 Since all computing clusters i=1 assigned to the same task class are homogeneous insofar as their capabilities of computation and fault-tolerance are concerned, the r, redundant modules should be equally distributed over n, computing clusters. Thus, for the task class i there are ri - n, L J ni r computing clusters with the module redundancy Li J +2, and n,(l+Lni J) - r, computing clusters with the module redundancy L- J+1, where LzJ is the maximum integer that is less than or equal to z. Let ~lm be the set of all feasible configurations of m modules and be given as m m-{((n 1,r l),(n2,r2),..,(nk,rk )) I (ni +ri )<m, ni,ri EI+, ri =O0 if ni =0 i=1,2,..,k where I+ is the set of non-negative integers. Also, denote the set of all configurations of the system by n U nm. m=0 Let ym: R + -m-* be a configuration function where R + is the set of non-negative real numbers. A reconfiguration strategy RStm is defined as RSt,m y(t) t[Ot] MEO,...m Hence, given the initial reconfiguration strategy RSgomo, the system uses the configuration ym (t)ERSt0mo when RML=t and there are m healthy modules available. 3 The "less than" or "equal to" relationship is used to include the case when the system has standby spare modules.

Lee and Shin: Optimal Reconfiguration September 27, 1984 2.2. Reconfiguration Model During the mission lifetime, system degradation is unavoidable due to module failures. A simple model for system degradation is presented in Figure 1 where state Sm represents the availability of m healthy modules. The transition from Sm to S,-I, m> 1, implies failure of one module and the subsequent recovery of system operation. However, failuire of one module could be fatal to the system, for example, coverage failure [14,15j or dynamic failure [13], which results in loss of the whole mission. In such a case, the system will transfer directly to the total system failure state So. When the system state is Sm and RML=t, the system has the configuration im (t). It is assumed that the times to failure for all modules are independently and identically distributed random variables. Distribution of the times to failure is assumed to be exponential with rate X. Define a stochastic process M(t) which is equal to (i) the number of available modules when the system is operational and RML=t, or (ii) 0 if the system crashed when RML> t. This stochastic process is governed by the failure and recovery processes which in turn depend on the system configurations during the mission lifetime. Within a reconfiguration strategy RSgtmo, system configurations are another stochastic process, called the reconfiguration process and denoted by RFt,,o(t), tE[Otol. RF,lmo(t) includes a configuration 'MA)(t)ERSto,mo to be used at RML=t. The transitions between configurations within RFtom(t) depend on failure and recovery processes as well as on the reconfiguration strategy RStm0,. A sample path of RFtmo is called a configuration trajectory which represents a configuration history of the system. When RMfL=t and the number of available modules is m, the system reconrigures itself from im (t) to am l(t) if a recoverable failure occurs, or to am (t-6t) if there is no failure during 7

Lee and Shin: Optimal Reconfiguration September 27, 1984 6t. Note that if y, (t)='Ym (t)=ym (t-6t) for all tE[t-6t,t] and if there is no failure during this 6t, then the system does not have to change its configuration for the period 6t. For the system which is capable of graceful degradation and reconfiguration, Meyer's performability 116,17,18] is a useful measure of the system capability. Performability is a composite measure of performance and reliability which automatically takes into account the performance degradation due to component failures. To incorporate the concept of performability, two functions associated with the configuration wEnl are introduced here. The first is a non-negative and bounded function p(w) called the reward rate, which represents the average reward per unit time corresponding to the computation performed by the system with the configuration w. This function is analogous to the reward structure defined by Furchtgott [18], and the reward function defined by Donatiello and Iyer [19]. Clearly, p(w)=O when wEQ0, i.e. zero reward rate in case of no module available or a system crash. Thus, the total reward accumulated during the mission lifetime to is given as to Wt m0 = f P(fyt)(t))dt (1) Note that the definition of Wtm0 is the same as Meyer's performability [16,17,18]. The second function, a(w), called the crash probability, represents the probability that a module failure causes a system crash. This function indicates the system's vulnerability to a module failure when the system configuration is w. a(w) may be either the coverage failure [14,15], or the probability of dynamic failure 113] associated with the configuration w. The reward rate p(w) is an implicit function of the number of computing clusters in each task class. We assume that redundancy in each computing cluster does not affect 8

Lee and Shin: Optimal Reconfiguration September 27, 1984 its performance.4 On the other hand, a(w) will depend on the number of redundant modules associated with each computing cluster. Since a configuration is specified by n and r where n=[n 1,n2,,nl] and r=[rl,r2,...,r], the functions p(w) and a(w) will be used interchangeably with p(n) and a(n,r), respectively in the rest of this report. Several authors have derived the distribution and the moments of performability for gracefully degradable systems under the restriction that system reconfiguration is allowed only upon failure [16,17,18,19]. Moreover, such a system has exactly one new configuration to choose from upon detection of a module failure. This, however, is an unnecessarily limiting factor (it might result in a configuration with less performance and reliability than the actual system's capacity), since there are usually several alternative configurations available for a system with multiple modules. For example, when there are four modules available upon failure, we can construct one 4-module redundant computing cluster, or one triad and one simplex, or two dyads, or four simplexes. Conventional reconfiguration concepts becomes more inappropriate when we consider the fact that the remaining mission lifetime has to play an important role in deciding a new configuration. This fact can be seen easily with the following two, simple cases. One is the case when the remaining mission lifetime is very short in which the probability of having failures is very small. Thus, the computation capability is more important than reliability for higher rewards. The other case is when the remaining mission lifetime is long. In this case, the probability of having failures becomes large and any good configuration should be able to tolerate module failures and minimize the possibility of a sys. tem crash. 4As will be discussed in the Conclusion, this assumption can be relaxed.

Lee and Shin: Optimal Reconfiguration September 27, 1984 For these reasons, we must first examine the effects of the remaining mission lifetime on system reconfiguration, i.e. the progression of the mission, and then choose a new configuration. Specifically, it is necessary to determine an optimal reconfiguration strategy, which maximizes the expected reward E[Wto,]J. This optimization problem is equivalent to controlling system reconfiguration such that the system follows a certain configuration trajectory to provide a maximum expected reward even in the face of random failures. 3. DERIVATION OF OPTIMAL RECONFIGURATION STRATEGY Denote the optimal reconfiguration strategy by RStmo,('(|m(t) tE[O,to], mE(0,...,m0}} under which the total expected reward E[WtnmJ is maximized. Since the optimal configuration 'ry(t) is completely specified by the values of ni' and r;', the problem is to determine ni'(t) and ri'(t)5 for all tE[O, to] and mE(O,...,m0} that maximize E[Wt,,,J. 3.1. Problem Formulation Based on the assumption of an exponential distribution of failure occurrence, the probability of having a failure during a small interval 6t is approximately equal to X Et. If Yin (t+6t) is the configuration used at RML=t+6t, then the reward rate at that time, Wt+6t,m, can be expressed as: 6Although dependence of nf and rg on t is re-introduced here for clarity, it will be omitted again throughout the report. 10

Lee and Shin: Optimal Reconfiguration September 27, 1984 P(Ym (t+6t))6t + WIm with probability l-mX 6t Wt6tm — = P('Tm (t+6t))6t with probability a(t, (t+6t)) mX 6t (2) p(m, (t+6t))6t + W,,~,, with probability (l-a(ym (t+6t))) mX 6t Thus, a recursive expression for the expected reward is derived as follows. EIWt+bt,m ] = (1-mX 6t) E[p(ym (t+6t))6t + Wt, m + a(Ym (t+6t)) mX 6t (p(Ym (t+6t))bt) + (1 - c(ym (t+bt))) mX ft E[p(y. (t+6t))6t + Wt,,,] (3) Assume that at any moment there is at most one occurrence of failure implying that the maximum jump in M(t) is one. Notice that when the system reconfigures itself into a new configuration ym(t) from y, (t+6t) or from ym+i(t+6t), the system must be in this configuration for a nonzero mission interval; otherwise, there is no need to move in that configuration at all. Let the optimal strategy RS+,m +I lm RSt;tm and the configuration Im (t+) lim 'n (t+6t). Then the following lemma gives a recursive representation of the optimal reconfiguration strategy. Lemma 1. RSt+m = {'m(t+)} U RSt, U RSt,,l. Proof. Suppose that the configuration chosen at RML=t+6t by a reconfiguration strategy will be used at least for the period 6t if there is no occurrence of module failure, and also assume that there could be at most one failure occurrence during 6t. Let bW be the reward gained during this 6t. When there is no failure, E[Wi+6tm J=E[Wtm I + E[6WJ. Thus, to have a maximum expected reward in this case, RSt 6t, m D{'ym(t+6t)}URSt,. Similarly, RSt+6,m JD{'m( t+6t)}URSt,, when we consider the case of one failure occurrence. On the other hand, by definition, the system does not transform into any configuration other than (i) ym(t+6t) and (ii) the 11

Lee and Shin: Optimal Reconfiguration September 27, 1984 configurations belonging to RSt'm and RSt,' 1. Thus, Lemma 1 follows immediately, since the assumption at the beginning of this proof becomes true as 6t —0. Q.E.D. The above lemma can be used to determine recursively the optimal reconfiguration strategy. With the knowledge of RSt'm, RSt,,'l, and their respective expected rewards E[ Wtm] and E[ Wtm l], the optimal configuration - (t +) can be determined by Theorem 1 below. Theorem 1. rm(t +) maximizes Jt,m (w) for wEnfm where J,,m (w) = p(w) - a(w)mX E[wt,-l (4) Proof: >From Eq. (3), the variation of E[Wt,m I in 6t is ElWt+6t,m I - E[Wt,m = P(ym (t+6t)) 6t - a('I, (t+6t)) mX 6t E[lW,m,,1 + (5) mX 6t E[ Wf,,,M- WI,m i Dividing both sides of Eq. (5) by Et and taking limits as 6t -- 0, we get the first derivative of E[Wt.m ] with respect to t: dE[Wt,m I dt = P(7, (t)) - (m, (t +)) mX E [W,,,, - mX E[ W,, m - Wt,, (6) The last term in the right-hand side of Eq. (6) is independent of the configuration Im (t+). By Lemma 1, -,(t+) maximizes the first two terms of the right-hand side of Eq. (6) over all wEf2l. Since this is true for all E[W,_,1J including E[Wt'm 1, the theorem is proved. QIE.D We define the optimal reconfiguration problem for -y,(t+) in the following form: Problem OR: maximize Jt,m (n,r)= p(n)- a(n,r) mX E[Wt,1] n1, r 12

Lee and Shin: Optimal Reconfiguration September 27, 19,84 k subject to (ni +ri) < m, and n, ri E1+ for i=1,2,..,k. i=1 Though Lemma 1 and Theorem 1 provide a recursive relationship necessary for obtaining RSt'.o, the relationship does not yield an acceptable solution. It requires a solution to the OR problem for all tE(O,tol (thus infinitely many times). As a remedy to this difficulty, we will in the next section convert the OR problem so that it has to be solved only when -, '(t +)~%Ay(t), instead of for all tE[O,tol. 3.2. Problem Transformation Once an optimal configuration at a moment during the mission is determined, it will be used for a nonzero interval. Given m, it is therefore natural to look for the 8witch time8 at which the optimal configuration has to be changed to maximize the expected total reward. That is, we only have to solve the OR problem at those switch times instead of for all tE[O,t0o. Let a' E[O0,o] j=0,1,2,... and m=1,2,...,m0 be the switch times for RS ' where am =o, and by definition For two configurationss w1 and wv2, one can say that w1 is better than w2 if a(w)=(w2) and (w1)>p(w2). Thus, in the following discussions we are 2 for the case only (with configurations with different crash probabilities, i.e. ws ) ar), for allned only. For convenience define a function Am: nm Xm R Ras follows. 13

Lee and Shin: Optimal Reconfiguration September 27, 1984 Am (i,W2) -= Am (W2, wI) X=(w - (2) ma(w) - ma(ow2) where a(w1)J3ca(w2) as mentioned above. Also, define two convenient sets Lm ( a )-E I a(w)<a(m( )) and G. (so)- t )> I(Y (J )) Then, the following theorem elucidates a useful property of the switch times 4s. Theorem 2. For all tE(8 -1,a, min Am(,(),w) > X E[Wtm- > max A, (-'Y(),W) (7) wELm ( mn) E Gm (,j ) Proof: Since -y (t)='yy(,() for all tE(al,-',l, we have the following relationship from Theorem 1: p(-Ym (am - )) mX E[Wt,m-l > p(w) - a(w) mX E[W, 1] for all wEflm. Therefore, n(. m ( f) ma(w) > XE[Wt' 1] for all w satisfying a(w)<a(j '(t)), and fn (^ifn m )Dma(W) ApQYm(4 ))-p(w) < XE[W, 1] for all w satisfying a(w)>a(%,,(t)). m c>( am ( am >/ cb'(0) Q.E.D. Note that E[Wtm- 1 is a function of t. However, the function Am ('m( ),) is dependent upon configurations only. Once the optimal configuration -y(, ( ) is known, min Am (%( ),w) and max Am ('(a ),w) can be determined. The only compuwELm (8 ),.m (, ) tation needed to determine sal is to calculate E[Wt',l recursively by Eq. (5) until Eq. (7) becomes invalid. Thus, Theorem 2 provides the solution for eJ+t which is the infimum of {t I t>4 and t that does not satisfy Eq. (7) }. If Eq. (7) is valid for all t >4, then ae=oo. Theorem 2 also presents a special property of an optimal 14

Lee and Shin: Optimal Reconfiguration September 2.7, 1984 configuration. With an optimal configuration m(s84 ), Am (-y(, ),w) partitions nm into two sets, Lm (84) and Gm (.4), as shown above. An arbitrary configuration without this property can never be an optimal configuration regardless of the remaining mission lifetime. It is important to notice that.4 represents the reconfiguration time independent of the presence of a module failure or not. Moreover, the solution to the OR problem is embedded in the process of solving for e.. To explore this property and make full use of it, we present the following lemmas and theorem. Lemma 2. E[Wtm] is continuous and non-decreasing in t. Proof: >From Eq. (6), we can write a first order differential equation of E[Wt,, with respect to t as follows. dE W,, ] + mX E W,m ] = p('m (t)) + mX (1-ca('m (t +)) E[Wt,.m1] Since p(,m(t)) is bounded for all 'Ym(t)Eflm, so is E[Wt ml for all tE[O, to0 and mE{,...,mo}. Also, O<a(-ym(t))<1 for all -Y (t)Eflm. Thus, [t ] is finite dt implying the continuity of E[ Wt. and, therefore, the continuity of E[ Wt ]. To prove that E[Wt',] is non-decreasing in t, we must show that E[W4.At,j >_ E[Wt,lI for all At>O. Let RSt+At,,,M { (t) I 0<t<t+- At,,m(t)=y((t-At) if t>At, or -,(O+) if t<At }. Then, it is easy to see that the expected reward based on RSS+Atm is larger than E[Wt i. Thus, E[W*, ]~,,>E[Wtm 1 Q.E.D. Lemma 3. For real numbers ai and bi, i=1,2,3, where bl >b2>b3, al-a2 a1-a3 ii al-a2 a2-a3 al-a2 a> - i < if and only if <__-_ Also, > - a if b 1-b 2 b'l-b 3 b l-b 2 - b2-b3 b l-b2 b 61-b3 16

Lee and Shin: Optimal Reconfiguration September 27, 1984 t l- 3 fia2-a3 and only if b >-b3 2-63 b -b3 - b2-b3 Proof: The proof is trivial and thus omitted. Let 4,m (s )- Enfm, Am ('( ), )=- mi Am (' ( ),w) } Then, WEL. (OM) <> (8m )-4q if Lm (4s ). The following theorem gives a recursive relationship between m( am) and (,,(+) when 4i <oo. Theorem 3. If 4 <oo, then y'(Am+1)E (f), and c(%,(4i+l))_ a(v) for all Proof: >From the definition of switch time, 'y,(4i+) exists if 4 <oo. First, we prove that a(y,,(a,+))<a( j(4' )). Since E[WtmJ is non-decreasing, we have max Am ( ( ),WJ) < XE WJ ~ XE[W],1] wEL. (8I) for tEc(-,j+l]. Thus, Jt m(m,(4))>Jt,m (W) for all wEGm (). If a('y( ))>(y,7(4 )), then 'y,(4) is also an optimal configuration at tE(am 4+1], which is contradictory to the definition of switch time. Therefore, a(m (J+1))<a( (am)). This result also implies that 4m (4m)yb. Next, we prove -y,(4m+1)E4'm (ej). When there is one and only one wELm (,a ) the theorem automatically holds in case of a4 < o00, since,(+l) exists and a(-,,( )) cannot be larger than a(7,m (4 )). Consider a configuration wE{; I a( )<a( ( 4))<a(*(J')), jEfl) }. Since (sj+l1) is optimal and E[Wt',] is continuous (and so is E[Wm 1J), the following order can be obtained from Theorem 2: 16

Lee and Shin: Optimal Reconfiguration September 27, 1984 Am (m (J, )iw) > XE[Wal1J >I Am ( (8 + i ) )) where tE(.,,,m~m+l. Appling Lemma 3, we have Am ('p(Jm ),w) ~ Am (-y()m,'),(m )). For wEim satisfying a(y' (J+l)) <a(w)< a(m (.i )), Am (eY(a ) (,m )) E[ W,iln- I and E[Wt II >Am (,7(m m+'),w) for all tE(s,m The continuity in E[W,, 1] implies Am(7(('m+ (, ),(w). >From the second part of Lemma 3, we also obtain Am(y,'(84),w) ~ A.m(-y, (a\),-Yp(Jm)). Thus, (a..J+1')E4, (,' ). If there is only one configuration in 4M (.J ), the theorem is proved. Suppose there exist two configurations, wl and w2, then Am (.' (a, ),wl) = Am(-(),)w2) Am (W,w2). To satisfy Theorem 2, the configuration with the smallest a(w), WEE'm (s8), becomes optimal for all tE(.a,a 1J. Q.E.D. Theorem 3 indicates that, while solving for 8o+1l in which Am(m(m ),w) has to be minimized over Lm (4), the optimal configurations,,(,+l) can be obtained simultaneously. Also, we can exclude the configurations wE Gm ((s ) during the determination of,i+l and ym( i+l). Note that even if.m(,(J ) is not empty, 'ym(.i+l) does not always exist since a}, may be infinite. When there is more than one element in 4m (s4) and, <oo, the configuration with the smallest crash probability in 1m (J ) is optimal for the interval (.4, /+1]. Note, however, that all other configurations in m (.,) are optimal only at UrM (,j +6t). Theorem 3 also provides additional properties concerning t-0+s the optimal reconfiguration strategy as shown in the following corollaries. Corollary 1. a(am(,,m ))<((,4( )) and p('~m (, ))<p(7( (4,)) for all i>j. 17

Lee and Shin: Optimal Reconfiguration September 27, 1984 Proof: In Theorem 3, we proved a( 4,('))<a('y,(a )) Since Am m _(( )r(e )) >. E t-][W > 0, we have p(m ( +))< P(( )). Q. Definition: A reconfiguration process is said to be acyclic if for t 7t 2, 'AAt 3J(t 1)=- ~2)(t2) implies 'D,)(t )='ytI )(t)=y4,)(t2) for all tE[t l,t l. >From Corollary 1, ym(8m )3t'm(e4 ) for all i3j. Also, if no repair is allowed during the mission, the system degrades from the m module system to the nm-l module system in case of a module failure. Thus, we immediately get the following Corollary 2. Corollary 2. The reconfiguration process based on RSt4mo, denoted by RFtmo, is acyclic. Corollary 3. For all mE({0,...,mo) and j>o, if Am <oo, then p(.. (-i*')) p(7. (am)) a( a( -+)) - ( )) Proof: Consider a remaining mission lifetime tE(8(s,8+1. Since E[Wt. j is nondecreasing and continuous in t, t' l >0 and therefore, we have dt - (XE[Wtm- 1 from Eq. (6). >From Theorem 3, we get Ma(j( )) XE[ Wt.m-i] > A m (-m (a ),' ( +l)). (8'( Thu s, > (7 (am)) Q.E.D. Definition: The coordinates of a configuration w are defined as (a(w),p(w)) in an x-y plane. 18

Lee and Shin: Optimal Reconfiguration September 27, 1984 Using Figure 3, we can explain the relationships obtained from Corollary 3. Note that the slope of the line segment between (a(y',(s )),p(',m(s ))) and (a +)),p( +l))) is equal to mAm((),(+l)). It is easy to see from Figure 3 or Theorem 3 that Am (',(,(a M ),Y,*m( i+)) is increasing in j. Figure 3 also indicates that the coordinates of the optimal configuration, i.e. (a(y, (e8+l)),p(', (Ph+x))), is located within the triangle surrounded by (0,0), (0,p(% (4 )), and (a(-ym (84 )),p(-,,y ( I))). When there is no configuration whose coordinates are within this area or Eq. (7) is valid for all t >4,es j, becomes infinite. Though the optimal configurations can be indicated from their coordinates, the switch times have to be computed based on Theorem 2 in which no optimization problem is involved. As a final solution step for the optimal reconfiguration strategy, we have developed the following algorithm A(RS), in which 'y,,(, ) and E[Wtm] are calculated using 4m-A(4 ) and E[Wtmil for all tE[Oto]J. Complete algorithms to solve for -y (e ) will be presented in the next section. When to is large, a different algorithm, where -y,,(t+Et) and E[Wtbtm1] are computed based on y '(t) and E[Wtm for all mE({,...,mo), is better than Algorithm A(RS) insofar as the storage requirement is concerned. However, the underlying principles are the same. Algorithm A(RS): Step 1. initialization l.a For m=-,2,..m0, find -ym(m) in which n maximizes p(w), wEfm.6 When there is more than one configuration which maximizes p(w), the configuration with the smallest crash probability is chosen. eIt is easy to prove that am(es) maximizes p(w) from Theorem 1 and E[W,m ]=0. 19

Lee and Shin: Optimal Reconfiguration September 27, 1984 1.b Find E[W]=p(y1( ) (1-e-Xtex )or tE[O,t0o. (Note that reconfiguration is not needed when m=l). 1.c Choose 6t to digitize tEl[,t0]. l.d Set m:=1 and start the following recursive steps. Step 2. Set m:=m+l, t:=O, and j:=2. If m>mo, stop the algorithm. Step 3. Find I ( )Fm (s~1) and Ca(-(s i))_ a(c) for 3jE4, (/-). Step 4. Using -y (em-l) as the optimal configuration, calculate E[Wt,; m,, by Eq. (5). Step 5. Set t:=t+6t. If t>to, go to Step 2 else if A (am ( ),m ( * I))EWi go to Step 4 else j:-=j+1 and go to Step 3. 4. DETERMINATION OF OPTIMAL CONFIGURATION As we defined in Section 2, a configuration WEnl, can be specified by (n,r) where n=[nl,n2,..,nkJ represents the number of computing clusters for the task class i, and r=[r 1,r 2,..,rk I is the degree of redundancy associated with the computing clusters in the task class i. It is assumed that the rewards gained from executing different task classes k are independent. Thus, the system reward rate, p(n), is equal to pi (ni) where pi is i=1 the reward rate for the task class i. Furthermore, since all modules within the system are identical, the crash probability, a(n,r), can be represented by 1 k - - (ni+ri) ai(ni,ri) where ai (ni,ri) is the crash probability when a module assigned to the task class i fails, given there are n, computing clusters and r, redundant modules in this task class. 20

Lee and Shin: Optimal Reconfiguration September 27, 1984 Consider the execution of class i tasks. When there are ni computing clusters, the task class can be treated as an n -server system. Let Jx) denote the performance of an x-sever system. Ideally, the performance of the system is additive in x, i.e. f(x1+X2) = fx )+f(2). However, due to task communications, resource sharing, and other overheads, the performance of the system is sub-additive, i.e. zx )+A z2) ~ f(zl+zX2) In most cases, Az) is a non-decreasing concave function of z as evidenced in [20,21,22,23]. Following the evidenced tradition, we will consider here the cases in which the reward rate for each task class, p,(ni), is a non-decreasing concave function of the number of computing clusters, ni. Let h,(n) be the crash probability of an n-module redundant computing cluster in the task class i. We assume that hi (n) is a decreasing convex function of n for all i- 1,2,...,k. For simplicity, we assume that n hi(n) is also convex in n.7 Thus, the crash probability of the task class i is given by: rii +2i ri a, (nir, = (r -n r,) hi(r, +2) + (ni -ri +n ri ) where +r nT + rh where r =L= J- This implies that (ni +ri ) aj (n,,r ) is a piecewise linear and convex ni function of ri. When ni ri < ri <ni (ri +), r EI+, the slope of the corresponding linear piece is (ri +2) hi (r, +2) - (ri +1) hi (r; +1). As indicated in Algorithm A(RS), there are two nonlinear integer programming problems to be solved for deriving the optimal reconfiguration strategy. The first, called P0, is a nonlinear integer knapsack programming problem which has to be solved for determining ym ( 1n) 7Relaxation of these two assumptions will be discussed in Section 5. 21

Lee and Shin: Optimal Reconfiguration September 27, 1984 k Po: max, (ni) (8) n, k subject to ni < m and ni E+I for i=1,2,...,k. The second, called P1, is an integer fractional programming problem which is to determine 'm '(ami+1), given 'm(4 ). P1 can be expressed as: k P(wo) - pi (ni) P1: min given woEfjm (9) i m(wo) - (n, +ri ) a (ni,ri ) k k i=:1 11 subject to E(n+ri) aj(ni,ri)<ma(w0), Epi(ni)<p(O) and E (ni +ri )< m, ni,ri EI+ for i-1,2,...,k. i=1 -4.1 An Algorithm for Solving P0 The nonlinear integer knapsack problem has been considered for various applications such as resource allocation, portfolio-selection, capital budgeting, etc. Several methods have been proposed for solving this problem; for example, dynamic programming approaches [24,25], the shortest path algorithm [26], and ranking methods [27,28]. Michaeli and Pollatschek investigated the problem and provided a necessary and sufficient condition for the optimal solution [29]. In this report, we develop a simple but elegant algorithm for problem Po, under the assumption that p,(n,) is a concave function of ni.8 Let Ai (ni )==pi (ni +1) - pi (ni). The principle of the algorithm is to allocate 80ur algorithm is extremely efficient when m is not too large. According to [291, a general solution for the problem P0 is known to be intractable. However, the algorithm we developed is very simple as will be shown below. 22

Lee and Shin: Optimal Reconfiguration September 27, 1984 modules such that the system can have maximum return.9 Algorithm A o Step 1. Set ni:-=O for all i=1,2,...,k. Step 2. Select i such that Ai.(n,.) = max A,(n). 1< i<k If Ai(ni,)<0, then terminate the algorithm. Step 3. ni,.:=n.+l1 and m:=m-1. If m=O, then terminate the algorithm. Otherwise, go to Step 2. Clearly, it is not necessary to sort all Ai (n,) for every m. Ai (0) must be sorted during the first iteration. However, since n,, is changed only in later iterations, A.(n,.) has to be evaluated and inserted into the previous sorted sequence. Note that there are at most m iterations required for this algorithm to terminate. Theorem 4. The result obtained from Algorithm A0, denoted by n'=I [n,n2,., is a solution to P0. Proof: First, we introduce an additional task class k+l where pk+i(j)=O for all jEI+.10 Thus, the problem is converted to: k+1 max Cpi(ni) k+1 subject to n ni =m and n, E1+ for i-1,2,...,k+l 5=1 QThe problem Po is not in the general form of the nonlinear integer knapsack programming problem, since the coefficients of the linear constraint are all equal as indicated in Eq. (8). 10 The modules in this task class can be regarded as consisting of spare modules. So, the reconfiguration problems for a system with and without spare modules can be treated as the same. 23

Lee and Shin: Optimal Reconfiguration September 27, 1984 As shown in [29], a necessary and sufficient condition" for the above problem is pi (n, )+p (n) Pi (ni +z)+pi (n -z) for each pair i, j (i=1,2,...,k+1; j=1,2,...,k+1; and is4j) and integer z where max (-l,-n,) < z < min (n,1). Thus, to prove the theorem, we only need to show that the above equation is valid for z=1 when ni >0. If ni' and ni' are obtained from Algorithm Ao, then Ai(nj'-l)~ _,(ni,). Thus, the theorem follows. Q.E.D. Another advantage of Algorithm Ao is that all ym(R 1) can be solvred at once. By assuming m0 at the beginning, y(a 1), mE({0,...,m0} is obtained at the end of the m-th iteration. 4.2 An Algorithm for Solving PI The solution of P1 can be divided into two levels: the lower level is to determine ri, i=1,2,..,k by minimizing the objective function (9) for a given n; the higher level problem is to determine ni' by minimizing the objective function (9) with the calculated ri, from the lower level. Since the only place that ri's appear is in the denominator of the objective function, the lower level problem can be stated as follows: k P2: min (ni+ri) ai(ni,ri) (10) r, =l k k subject to ri < m- ni and ri EI' for i=1,2,...,k i=1 -i=1 The problem P2 is an integer knapsack programming problem which is to minimize the sum of nonlinear functions. If (ni +ri) aj(ni,ri) is convex with respect to ri, we n The necessary and sufficient condition given in [29j is used for minimizing the sum of convex functions. Here, we consider the maximization of the sum of concave functions. 24

Lee and Shin: Optimal Reconfiguration September 27, 1984 can apply an algorithm similar to A 0 in which we choose min ((n,+ri +) at(ni,r+l) -(ni+ri) ai(ni,ri)} in place of max (p (n, +1)-p, (n, )) in Algorithm A0. When (n, +r ) as(n,,r ) is not convex, the lower level problem becomes a non-convex programming problem. There is no guarantee that the solution obtained by Algorithm A 0 is the global minimum. k Let /B(n)=min C (ni +ri ) ac (ni,ri ) obtained from solving P2, given n. Thus, the i i-=1 problem P1 can be converted to the following form: P3: min (11) "I mta(wo)-#(n) subject to d?(n)<ma(wo) and p(n)<p(w0) k ni < m and ni EI+ for i —1,2,...,k. i —=1 >From Corollary 3, the triangle area surrounded by (0,0), (0,p(n~)), and (i1 (no),p(n0)) is a feasible region described in terms of the configuration coordinates. m Since there does not exist an explicit form of mapping from the configuration coordinates to n, an enumeration is the only means to find whether a configuration is within this triangle region of the configuration coordinates. Consider the use of an explicit enumeration (or brute force enumeration) for solving k P3. For every possible combination of ni satisfying the constraint n- <m, we map the 3=1 p(wo)p(n) combination into ri,, (n), and ma(wo)-_(n). Then, the configuration with the minimum ratio is chosen as the optimal configuration. When there are k task classes and m modules available, the total number of combinations in n to satisfy the constraint 25

Lee and Shin: Optimal Reconfiguration September 27, 1984 k Enj <m is (mk). Thus, when the size of the system is moderate, the explicit i-=1 enumeration approach is reasonable. For example, when k=3 and m=12, the total number of enumerations is 455. Moreover, there are two important and advantageous properties associated with the explicit enumeration: (i) The coordinates of all feasible configurations for m available modules, that is (1 Bf(n),p(n)), can be obtained from the enumeration. Thus, from m Theorem 3, we can easily determine all optimal configurations, ",(81 ); (ii) When we solve problem P2 for ri' with m=mo, the other ri's are determined simultaneously for all mE{ 3 n,...,m 0)}. We obtain the coordinates, incorporated with p(n), of all feasi-=1 ble configurations for mEl{1,...,mo). Therefore, the optimal configurations yt,(a, ) for mE({,...,mo) can be determined. These two properties lead to a situation in the determination of reconfiguration strategy that the explicit enumeration has to be conducted only once. When both k and m are large, the explicit enumeration becomes intractable. An immediate candidate for approximate solutions is to aggregate task classes into a small number of groups and then perform the explicit enumerations for the system and for each group. Remarks on Fractional Programming Problems For a general fractional programming problem, the objective function in Eq. (11) is no longer convex or concave with respect to ni even if fl(n) is convex or concave with respect to ni [30]. For a continuous nonlinear fractional programming problem, some equivalent or dual problems have been proposed in [31,32]. With integer constraints, 28)

Lee and Shin: Optimal Reconfiguration September 27, 1984 Chandra and Chandramohan [33] suggested to apply a branch and bound method after solving the continuous problem. However, no example or analysis is provided to show the efficiency of their algorithm. A recent survey on the methods of solving the fractional programming problem [34] has discussed three different state-of-the art approaches. The first approach uses variable transformation and is probably the most efficient method for linear fractional programming problems. The second approach deals with the problem as a nonlinear programming problem and applies a suitable search algorithm to find the solution. The third approach uses an auxiliary parameterized problem which is briefly discussed below. Let F3 denote the feasible region for P3. Define the auxiliary problem Q(q) by Q(i): min p(wo) - p(n) - -(ma(wo) - (n)) n E F3 Let z(q) be the minimum value of Q(q/). It is shown that if z(iq)=O, then an optimal solution of Q(r/) is also an optimal solution of P3 [34]. Hence, an algorithm is needed to search for q' such that z(rtq )0O by solving Q(qr) iteratively. Thus, the complexity of this algorithm is based on the efficiency of solving Q(tl) and the search algorithm, e.g. Newton's method, binary search, etc. Meggido [35] proposed an algorithm which combines the search for r' and the dynamic programming approach to Q(q). The resulting algorithm requires O(km2 log m) evaluations of Bf(n) and p(n). Note that the problem Q(q/) is the same as the problem OR, i.e. a nonlinear knapsack programming problem. It might be more efficient to solve the OR problem directly instead of searching for /'. When d(n) is convex with respect to n;, the objective function becomes concave with respect to n, implying that Algorithm A0 can be applied. 27

Lee and Shin: Optimal Reconfiguration September 27, 1984 The algorithm A 1 which includes Algorithm A 0 is provided below using an operator Pi and a function A, defined by: I (n)_(n l,n2,,nj-l ni +l,ni+l,. ~ ~,nk ) Ai (n) -p(Pj (n)) - p(n) - XE[Wt,,,](f(Pi (n))-l(n)) Algorithm A 1 Step 1. Set L:=O and U:=t0. Step 2. (a). Set th:=L + h(U-L) for h=1,2,...,N. N (b). For h=1,2,...,N (bi). Set dm:=m and ni:=O for i=1,2,...,k. (b2). Solve for #(Pi (n)) as indicated in the solution of P2 and Ai (n) for i= 1,2,...,k. (b3). Select i~ such that A,. = max A i (n). If Ai,(n)O0, go to (b5). 1<i<k (b4). n:=P,.(n) and dm:=dm-1. If dm40, go to (b2). (b5). Set y (th ):=n. Algorithm A 1 determines the optimal configurations for the small number of partitions of the mission lifetime. If the optimal configurations at th and th+1 are different, the solution can be refined by setting L=th, U=-th+ and then repeating Step 2 of A 1. This will lead to a more accurate switch time between th and th+l. Although Algorithm A 1 only provides one level partitions, it is easy to extend to more refined partitions. 4.3. An Example of the Optimal Reconfiguration Strategy Consider a system with 12 modules at the beginning of the mission. The module failure rate is assumed to be 0.0005 and the mission lifetime is 1000. The tasks to be 28

Lee and Shin: Optimal Reconfiguration September 27, 1984 executed during the entire mission are grouped into three classes whose respective reward rates pi(ni) and crash probabilities h,(n) for i=1,2,3 are listed in Table 1. In addition, we include the constraints ni >1 for i=-1,2,3, indicating that at least one computing cluster must be available for each task class throughout the entire mission. Applying the explicit enumeration given in Section 4.2, we can easily find all m(sl ) for 4<m<12 which are listed in Table 2. Table 3 gives the switch times for each optimal configuration. The optimal reconfiguration strategy is obtained from the combination of both of these two tables. Notice that certain optimal configurations will never be used. For instance, '75( 3) is one one of them since 3=00. As shown in this example, the optimal reconfiguration strategy is derived off-line in table form before the mission. For on-line use of the strategy the system only needs to look up the tables of optimal configurations and switch times (e.g. Tables 2 and 3). In this way, the system will follow an optimal reconfiguration trajectory using (i) the switch times in the same row of the two tables for the case of no module failure or (ii) row changes in the tables and then the switch times in case of module failures. In Figure 4, the configuration trajectories, reward rates, and crash probabilities are plotted if the module failures occur when RML are 800, 520, 400, and 310.12 5. DISCUSSION 5.1. Concluding Remarks In this report, we have addressed the problem of reconfiguring non-repairable multi-module systems. Since we treated the problem for general multi-module computing systems, both the problem formulation and the solution approach of this report have 2These are random and known only via failure detection mechanisms. For a demonstrative purpose, we used arbitrarily chosen values. 29

Lee and Shin: Optimal Reconfiguration September 27, 1984 high potential use for designing the growing number of fault-tolerant multiprocessors and computer networks. Given multiple modules, computing clusters with appropriate redundancy for each task class are formed so that the resulting system may meet both requirements of performance and reliability in an optimal fashion. Because of the inherent trade-off between performance and reliability, we need to determine system configurations which specify the number of computing clusters and redundant modules for each task class. In addition to the conventional prassive reconfiguration strategy which is invoked only upon detection of a module failure, we have shown the need of an active reconfiguration strategy which allows the system to reconfigure itself as the mission progresses, regardless of the occurrence of module failure. Thus, the active reconfiguration strategy provides the optimal configurations by taking into account both the progression of the mission and the degradation of the system due to module failures. Using the expected total reward or Meyer's performability as the criterion for determining the optimal configurations, we have explored the properties of the optimal configurations, which are useful for deriving solutions. A feasible region is described in terms of the configuration coordinates. Although it is easy to find the configuration coordinates, the inverse mapping from the coordinates to a configuration does not exist in closed form, thereby requiring less elegant enumerations. In order to derive the optimal configurations, two nonlinear integer programming problems have to be solved. The first is a knapsack problem for which we have developed a simple but elegant algorithm. The second is a fractional programming problem, which is basically a non-convex programming problem. For the fractional programming problem we have used an explicit enumeration which is effective for small m and k. 30

Lee and Shin: Optimal Reconfiguration September 27, 1984 We have also discussed the other approaches known for solving the fractional programming problem. They do not appear any better than the explicit enumeration or solving the optimization problem, OR, directly. Also, note that the elimination of this assumption is equivalent to dynamic classification of incoming tasks. As shown in the example of Section 4.3, the optimal reconfiguration strategy can be represented by two tables: one is for optimal configurations; the other for switch times. Although the solution procedures to obtain these two tables are complex, they can be computed off-line. The real reconfiguration during the mission is performed just by looking up these two tables. 5.2. Extensions Several assumptions that we used can be relaxed at the expense of a more complex optimization procedure. For instance, the reward rate could be affected by the degree of redundancy incorporated in a task class. In such a case, we need to change p(n) to p(n,r). The optimization problem P1 can no longer be decomposed into two levels. When the concavity of pi (n) and the convexity of a(n,r,) do not exist, the optimization problems become non-convex and thus, are more difficult to solve. We can also remove the assumption that the reward rate and the crash probability must be stationary, i.e. independent of the mission lifetime. For such a case, let p(w,t) and c(w,ct) be used in place of p(w) and a(w), respectively. The problem OR, then, becomes to maximize Jt,m (w)-p(w,t) - a(w,t)mXE[ Wtm-1] for wEn,. Obviously, this time dependency does not increase any complexity if we solve the problem OR directly. 31

Lee and Shin: Optimal Reconfiguration September 27, 1984 In place of the total expected reward, a generalized objective function for the optimal configurations could be defined as E[R(Womo)]. Note that R(W) should be non-decreasing but may not be continuous. It could be a step function as in the example in [19], or any other discontinuous function with finite jumps. When R( W) is not additive, i.e. R(W1+W2)yQR(W1)+R(W2), the optimal configuration at the current moment is dependent on the total reward accumulated up to the current moment. Hence, the determination of an optimal configuration must consider both the past and future configuration trajectories, thereby making the optimization problem very complex and difficult. The difficulty can be foreseen by comparing the optimal configuration problem with the general stochastic optimal control problem. The optimal control problem usually uses the summation (or integration) of the functions of variables as an objective function to be optimized. However, the optimal configuration problem requires the use of a function of the summation of variables, making the decomposition of these variables impossible. This is a matter of further research. ACKNOWLEDGMENT The authors wish to thank Michael Woodbury for carefully reading this report and making many useful comments. Also, they are indebted to Rick Butler and Milton Holt of NASA for their technical and financial support. 32

Lee and Shin: Optimal Reconfiguration September 27, 1984 REFERENCES [1] H. J. Siegel, R. J. McMillen, and P. T. Mueller, Jr., "A Survey of Interconnection Methods for Reconfigurable Parallel Processing System," Proc. AFIPS 1979 Nat. Comp. Conf., pp. 529-542, June 1979. [2] S. L. Kartashev and S. P. Kartashev, "A Multicomputer System with Dynamic Architectures" IEEE Trana. Comput., Vol. C-28, No. 10, pp. 704-721, Oct. 1979. [3] G. B. Adams and H. J. Siegel, "A Fault-tolerant Interconnection Network for Supersystems," IEEE Trana. Comput., Vol. C-31, pp. 443-454, May 1982. [4] R. Troy, "Dynamic Reconfiguration: An Algorithm and its Efficiency Evaluation," Proc. 9th Int. Syrmp. Fault-tolerant Comput., pp. 44-49, 1979. [5] G. Barigazzi, A. Ciuffoletti, and L. Strigini, "Reconfiguration Procedure in a Distributed Multiprocessor System," Proc. 12th Int. Symp. Fault-tolerant Comput., pp. 73-80, 1982. [61 K. H. Kim, " Error Detection, Reconfiguration and Recovery in Distributed Processing System," Proc. 1st Int'l. Conf. on Distributed Computing Systems, pp. 284 -295, Oct. 1979. [7] R. Y. Kain and W. R. Franta, "Interprocess Communication Schemes Supporting System Reconfiguration," IEEE COMPSAC, pp. 365-371, 1980. [8] F. Saheban and A. D. Friedman, "Diagnostic and Computational Reconfiguration in Multiprocessor System," A CM Proc. of 1978 Annual Con., pp. 68-78, 1978. [9] F. Saheban and A. D. Friedman, "A Survey and Methodology of Reconfigurable Multi-Module System," IEEE COMPSAC, pp. 790-796, 1978. [10] J. A. B. Fortes and C. S. Raghavendra, "Dynamically Reconfigurable FaultTolerant Array Processors," Proc. 14th Int. Symp. Fault-tolerant Comput., pp. 386-392, 1984. [11] K. G. Shin and Y. H. Lee, "Error Detection Process-Model, Design and Its Impact on Computer Performance," IEEE Trans. Comput., Vol. C-33, No. 6, pp. 529-540, June 1984. [12] Y. H. Lee and K. G. Shin, "Optimal Design and Use of Retry in Fault Tolerant Real-time Computer Systems," CRL-TR-28-84, Computing Research Lab, The University of Michigan, Ann Arbor, MI, May 1984. 33

Lee and Shin: Optimal Reconfiguration September 27, 1984 [13] C. M. Krishna and K. G. Shin, "Performance Measures for Multiprocessor Controllers," Performance '83: Ninth Int'l Symp. Comp. Perf., Meas., and Eval., pp. 229-250, 1983. [14] J. J. Stiffler and L. A. Bryant, "CARE III Phase Report - Mathematical Description," NASA Rep. 3566, Nov. 1982. [15] K. Trivedi, J. B. Dugan, R. Geist, and M. Smotherman, "Modeling Imperfect Coverage in Fault-Tolerant System," Proc. 14th Int. Symp. Fault-tolerant Comput., pp.77-82, 1984. [16] J. F. Meyer, "On Evaluating the Performability of Degrading Computer Systems," IEEE Trans. Comput., Vol. C-29, No. 8, pp. 720-731, August 1980. [17] J. F. Meyer, "Closed-form Solutions of Performability," IEEE Trans. Comput., Vol. C-31, No. 7, pp. 648-657, July 1982. [18] D. G. Furchtgott, "Performability Models and Solutions," CRL-TR-8-84, Computing Research Laboratory, The University of Michigan, Ann Arbor, Michigan, 1984. [19] L. Donatiello and B. R. Iyer, "Analysis of a Composite Performance Reliability Measure for Fault Tolerant Systems," RC-10325, IBM Thomas J. Watson Research Center, Yorktown Heights, NY, Jan. 1984. [20] W. W. Chu, L. J. Holloway, M.-T. Lan, and K. Efe, "Task Allocation in Distributed Data Processing," Computer, Vol. 13, No. 11, pp. 57-70, Nov. 1980. [21] D. P. Bhandarkar, "Some Performance Issues in Multiprocessor System," IEEE Trans. Comput., Vol. C-26, No. 5, pp. 506-511, May 1977. [22] S. H. Fuller, J. K. Ousterhout, L. Raskin, P. L. Rubinfield, P. J. Sindhu and R. J. Swan, "Multi-microprocessors: An Overview and Working Example," Proc. IEEE, Vol. 66, No. 2, pp. 216-228, Feb. 1978. [23] W. A. Wulf and C. G. Bell, "C.mmp - A Multi-mini-Processor," 1972 Fall Joint Computer Conf., AFIPS Cof. Proc., Vol. 41, pp. 765-777, 1972 [24] M. W. Cooper, "The Use of Dynamic Programming Methodology for the Solution of a Class of Nonlinear Programming Problem," Naval Research Logistics Quarterly, Vol. 27, pp.89-95, 1980 [25] T. L. Morin and R. E. Marsten, "An Algorithm for Non-linear Knapsack Problem," Management Science, Vol. 22, No. 10, pp. 1147-1158, June 1976. 34

Lee and Shin: Optimal Reconfiguration September 27, 1984 [26] A. M. Frieze, "Shortest Path Algorithms for Knapsack Type Problem," Mathematical Programming, Vol. 11, pp.150-157, 1976. [27] H. Luss and S. K. Gupta, "Allocation of Effort Resources among Competing Activities," Operations Research, Vol. 5, pp. 613-629, 1975. [281 P. H. Zipkin, "Simple Ranking Methods for Allocation of one Resource," Management Science, Vol. 26, No. 1, pp. 34-43, Jan. 1980. [291 I. Michaeli and M. A. Pollatschek, "On Some Nonlinear Knapsack Problem," Annals, of Discrete Math., Vol. 1, pp. 403-414, 1977. [30] S. Schaible, "Minimization of Ratios," Journal of Optimization Theory and Application, Vol. 19, No. 2, pp. 347-352, June 1976. [31] B. D. Craven and B. Mond, "On Fractional Programming and Equivalence," Naval Research Logistica Quarterly, Vol. 22, pp. 405-410, 1975. [32] S. Schaible, "A Survey of Fractional Programming," Generalized Concavity in Optimization and Economics, edited by S. Schaible and W. T. Ziemba, Academic Press, pp. 417-440, 1981. [33] S. Chandra and M. Chandramohan, "A Branch and Bound Method for Integer Nonlinear Fractional Programs'," Z. Angew. Math. and Mech., Vol. 60, No. 12, pp. 735-737, 1980. [34] T. Ibaraki, "Solving Mathematical Programming Problems with Fractional Objective Functions," Generalized Concavity in Optimization and Economics, edited by S. Schaible and W. T. Ziemba, Academic Press, pp. 441-472, 1981. [35] N. Megiddo, "Combinatorial Optimization with Rational Objective Functions," Mathematics of Operations Research, Vol. 4, pp. 414-424, 1979. 35

Figure 1. A Model of System Degradation.

optimal conriguration IMO mYM~~( M(n ) V 3() 8- 'Ym( 8f) yf( r1) 0m 8 m to remaining mission lifetime Figure 2. The Switch Times and Optimal Configurations.

/ - tan 0 - m/ Figur 3. he Coo iate (s( (< (o Opi a Configuation! J j / / ~,!! igre.Tonasl (w Figure 3. The Coordinates of Optimal Configurations.

reward rate crash probability 6.0 reward rates *s) 1 51~(82~ (2 '(4 5.0 - ';o(o10) 0.5 4 0.4.,l(,l() crash probability r ---J -0.3 4.0 I _ I ~~~,,,J~~~ 10.2 1'2(812) 8 0.1 3.0 1000 75500 250 0 Remaining Mission Lifetime Module fails when the remaining lifetime is 800, 520, 400, 310. Figure 4. Example of Optimal Reconfiguration Trajectory.

n 1 2 3 4 5 6 7 8 hl(n) 0.6 0.3 0.1 0.05 0.005 0.003 0.002 0.001 pl(n) 1.0 1.8 2.4 2.9 3.3 3.6 3.8 3.9 h2(n) 0.5 0.25 0.1 0.05 0.025 0.013 0.005 0.002 0.8 1.5 2.2 2.8 3.4 4.0 4.6 5.1 h:~(n) 0.3 0.1 0.05 0.01 0.05 0.03 0.02 0.01 p:3( A) 0.6 1.2 1.7 2.2 2.7 3.2 3.6 4.0 Table 1. Crash Probabilities h,(n) and Reward Rates p,(n) Used in the Example.

|SM(9) | E( 8,m) |f tM'(9m|atm 8,mtMm zm(Um) |Y a m m(. ) |Y am(8 ) t (2s 2 ( 1 2 O O 0 1 0 _ t,=5 (2 2 1 (2 1 2 ( 11 ( II I. ~ ~~ O O 0_,O 1 0 1 1._ ~ood oo~ o~ l m =6 (2 3 1 (2 2 2 2 1 1 1 2 111 o ( 0 o O 0 01 1 1 2 1 ~~~~......... tn=7 61 2 3 (2 1 2 I12 1( || 0o O( 2 o0 0 2 1 (r2 1 1 na-8 (2 4 2 2 3 3 C2 3 (r2 2 2fo2 ' 1 1 2E 111 0 0 ( 0 ooa 1 0 0 2 0 ( 2 1 1) 2 2 1 m,=I) (252 5 23 (252 2 22 (2 21 (1 12) (1 1 0 0 (Y 0 0 O 2 0 0} 2 1 OJ 3 1 ( 2 1 2.3 2 1 I 0=lO 2 26 2235, (2233 2 232 2 22 2 1 2 1(2 1 1) If1 1 2r11, 0 0 o 0 0 2 0 2 1(Y 2 2 4 1 4 1 1.222 33 1,,l-i1 (,2 2, 2.3 2 3 4 (2 (2 2 {2 1 28) (211 (111 ooO 0 oo 2( 0 22d 32( 4 1 42 (3,32) a=-12.3 7.3 72 2 7 3) 2 4 6 (2 3 61 2 3 58 f2 3 2 1 2 12 2 1I'2 11 1 0 0 0 0 o 0 0 1 0 2 0 2 3 0 4 1 2 5.2 1 4 32 configuration: nr r2 r31 Table 2. Optimal Configurations for the Example

0 1 23 48 2 5 6 7 8 |2 mm 5m 8m 8m 8m 8m m-4 0 1194 oo - - n —5 0 436 1391 oo - m-6 0 349 938 1150 oo - - - - m=7 0 755 1364 2813 - - m —8 0 254 573 635 739 1379 oo - - m=9 0 227 506 643 760 1108 0c - - m 10 0 206 455 541 574 807 1098 1611 oo m — 11 0 188 413 490 602 716 1321 oo - m ---12 0 104 173 314 378 447 646 oo oo Table 3. Switch Times for the Example

UNIVERSITY OF MICHIGAN 3 9015 03023 075211111111 11111111111111 3 9015 03023 0752