THE UNIVERSITY OF MICHIGAN COMPUTING RESEARCH LABORATORY1 PERFORMANCE MEASURES FOR MULTIPROCESSOR CONTROILRS C.M. Krishna K.G. Shin CRIL-TR-1 -82 OCTOBER 1982 Room 1079, East Engineering Buildn AnArbor, Michian 48109 USA Tel: (313) 763-8000

PI~FOMACE MEASURES FOR MULTIPROCESSOR CONTROLLERS C. M. Krishna and K. G, Shin Department of Electrical and Computer Engineering The University of Michigan Ann Arbor, MI 48109 ABRACT In this report, we consider some new performance measures to characterize fault-tolerant multiprocessors used in the control of critical processes. Our performance indices are based on controller response time, By relating this to the needs of the application, we have been able to derive indices that faithfully reflect the performance of the multiprocessor in the context of the application, that permit the design, evaluation and objective comparison of rival computer systems, and that can either be definitively estimated or objectively measured. Using the example of a controller in an idealized satellite application, the computation, and some uses of, the performance measures are illustrated. 1. INTRODUCTON 1. 1. Motivation Over the last few years, inexpensive, powerful, and reliable microprocessors have become available. At the same time, analytical, simulation and modelling techniques for use in computer communication networks have been developed. Multiprocessors are therefore becoming attractive. One special application to which they can contribute a great deal is reliable control. Several reliable multiprocessors have been proposed and a smaller number built. Among the latter, one might mention the Fault Tolerant Multiprocessor (FTMP) [1] and the Software Implemented Fault Tolerance (SIFT) machine [2], both built under contract to NASA for the control of civilian aircraft of the next decade. Reliability requirements are stringent: the benchmark figure employed by NASA is that the probability of controller failure should not exceed i0-1 for a ten hour flight. Naturally, such a benchmark begs the question of how "failure" or the "performance" of a complex multiprocessor system is to be defined. In this report, we present some measures that are appropriate in this context and by extension, in the context of other control applications -- nuclear reactors, life-support systems, etc. -- where controller failure can lead to catastrophic consequences.

-21.2. The Nature of the Control Function Any real-time system (of which a computer controller is an example) consists of three communicating parts. These are the data acquisition, data processing and output sections. The data acquisition section consists of sensors, input panels and other associated equipment; the processing section consists of the computer (in our case the multiprocessor) and the output section consists of mechanical actuators, displays and other output devices. The system may logically be regarded as a three-stage pipe1. The set of tasks to be executed by a control system is predetermined and the nature and behavior of the software known in advance -- at least in outline -- to the designer. The multiprocessor controller is therefore a rather specialized device; it might indeed be considered custom-built for a particular application. This fact makes it both easier and more necessary to obtain a reasonably good performance analysis of the system. The controller software in the processing section consists of a set of tasks, each of which corresponds to some job to be performed in response to particular sets of envirornmiental stimuli, 1.3. The Need For New Performance Indices The determining characteristic of a computer controller's performance is a combination of reliability and high throughput, The throughput requirements arise from the need for quick system response to environmental stimuli. Speed is of the essence in a real-time controller since failure can occur not only through hardware failure in the system, but also on account of the system not responding fast enough to events in the environment. This fact imposes a premium on controller response time and leads naturally to the work presented in this report. As a result of these special performance requirements, performance measures used to characterize general-purpose uniprocessor systems are no longer appropriate for multiprocessor controllers. Conventional reliability, throughput and availability by themselves alone have little meaning in the context of control; a suitable combination of these is necessary. New performance measures are required; measures that are congruent to the application, permit the expression of specifications that reflect without contortion the true system characteristics and application requirements, in addition to allowing an objective comparison of rival systems for particular applications. We cannot stress too heavily that it is meaningless to speak of the performnance of a computer out of the context of its application. The form the perf ormance measures take must reflect the needs of the application, and the computer system must be modelled within this context. The multiprocessor controlleT and the controlled process form a synergistic pair, and any effort to study the one must take account of the needs of the other. It is important that performance measures should depend on parameters that can be definitively estimated or objectively measured. Much of the work published so far on characterizing the "goodness of fit" between the attributes offered by a computer system (reliability, throughput, etc.) and those required by the application depends on parameters obtained through essentially subjec

-3quantities, i.e., controller response time for the various system tasks. 1.4. Organization of Paper In Section 2, we discuss our new performance measures and in Section 3 means for practically obtaining these quantities. An example is presented in Section 4, and the report concludes with Section 5. 2. PERFORMANCE MEASURES 2.1. Survey The concept of systems that are, through built-in redundancy, much more reliable than any of their components is not new. Some pioneering work was done by von Neumann [4] and by Moore and Shannon [5], both in 1956. Work in modelling fault-tolerant multiprocessors was done by Bouricius et. al, [6], who took into account the impact of transient and permanent failures, Borgerson and Frietas, [7], who studied the PRIME computer as a gracefully degrading system, and other researchers who, like their precursors, used Markov models in the representation of these systems [8]-[ 11 ]. With only one exception ([8]), the work referred to above assumed traditional performance measures. Other workers recognized the shortcomings of these and defined new ones. Among these, one might mention Beaudry [12] and Meyer [13]. The chief drawback of Beaudry's measures is that they express the capability of the system as a whole, not with respect to particular tasks. The computer is treated as a monolith in terms of the services it delivers. The implicit presupposition is that all tasks are qualitiatively of the same form and have practically the same "cost" as a function of the response time2. However, the tasks that a real-time computer controller executes are typically widely varying both in terms of the demands they make on various entities of the multiprocessor and in the fact that they are of unequal importance. The performance measures of Beaudry thus express the performance of the computer without reference to the particular needs of the application, i.e., without discriminating between the impact on performance between individual jobs. Meyer's concept of performabitity represents a useful advance in the search for appropriate performance measures. However, the actual performance measures used sometimes require further refinement. For example, in [13], Meyer employs as the performance measure, Y8, the fraction of arrived tasks that are processed by a system S in an interval of time [0,t]. This, as in the case of Beaudry's performance measures allows for no discrimination between individual controller tasks. Again, in none of the work cited above is there any attempt to systematically study the actual cost of the overhead incurred in computation. To remove, at least partially, the limitations of the work cited above, we present here some new performance measures. We begin with some definitions and some basic concepts. 2. Or equivalently, that it does not matter if they do not.

-42.2. Definitions and Basic Concepts Because of the stochastic nature of the transient and/or permanent failures of components in a multiprocessor, the load-dependent blocking at shared resources, and conditional branches in the software, a probabilistic model is required to properly represent the behavior of a multiprocessor controller. A state-space model can be formulated, with the states representing the current operational capacity of the systems. Multiprocessor response time is a function of tlie state. System response time is the time interval between the moment of task initiation and the actuator and/or display result that occurs. A task is triggered when some set of events in the operating environment initiates its execution. If the environment is stochastically stationary, there are intertrigger distributions. Every time a task is triggered, a unique version of it is created. This version is called an extant version until its execution is complete. Versions are numbered in sequence of triggering: successive versions of task i being denoted by tl,,2, etc. The response time associated with a version is denoted by RESP(ij), under the assumption that the system continues operating until the version completes executing. (This is clearly only valid so far as the expected response time is finite). The extant time of a particular extant version of task i is the difference between the absolute (or system) time at present and that at Lhe moment of triggering. When an extant version finishes executing, its extant time is frozen by definition. Thus, the extant time at t of a version i.j that was triggered at time r'r, with the system in state nij is defined by EXT[u,j,rj.njJ,t]=min[t-Ti,j, RESP(ij)]. A controller task may be criticalt or non-critical. A critical task has a hard deadline [15], the violation of which leads to dynamic failure. The deadlines are generally random variables. Non-critical tasks have no such deadlines4. The mission lifetime is the duration of operation between successive service stages. 2.3. The Performance Measures We define a cost function C1(C) associated with system response time C for task i. For critical tasks, the cost function takes the following form: fgi(O( if t e tdi Ci() c=X otherwise (1) where gi(C) is a suitable continuous function of C that increases monotonically5 in [0, tdi], is zero for all >tdi, and we recognize tdi as the hard deadline associated with critical task i.5 For noncritical tasks, Ci(~) is continuous, monotonically increasing and therefore always bounded for a finite response time. For consistency, we assume that the costs accrue as execution proceeds. In other words, if task i was triggered r units ago and has not yet been completed, its contribution to the cost so far accrued is Ci(T). 3. The actual definition of the states depends on the system that is being modelled. Choosing a suitable formulation to facilitate system analysis is often a challenging task. 4. Note that dynamic failure encompasses the traditional notion of catastrophic hardware failure as well as other causes (software crashes, system reconfiguration, electromagnetic intererence on the communications network, etc.) of missing one or more hard deadline. 5. In practice, all that can be assured of the behavior of gi() is that it is monotonicall nondecreasing. However, we shall need to invert the cost function and so we assume a strictly increasing cost function in [otdi]

-5The cumulative cost function associated with task i, Fi(t), is defined as follows: qu(t) ri(t) = Z Ci(EXT(ij,Ti.j,ni.j,t)) (2) j=1 where there have been ci(t) triggers for task i in the interval [O,t]. The system cost function in a system with r different tasks is defined as: r S(t)= Zri(t). (3) i=1 Let L(t) = Prob f Mission Lifetime c t, Q the set of states of the controller, and A C Q the subset of states in which failure is not certain. Note that failure is certain when the expected response time goes to infinity. This, of course, means that the system is overloaded and has a utilization exceeding unity. This corresponds to hardware failure sufficiently massive to render the system dead for all practical purposes. Define B = Q-A. Our performance measures are then8: Cost Index, K(X) = fProb}S(t) c Xl dL(t) (4) 0 Probability of dynamic failure, Pdn-= fProb ( S(t) = c0 dL(t) (5) 0 Mean Cost, M = fE~S(t) I system never enters state set BldL(t) (6) 0 Variance Cost, V fVarfS(t) I system never enters state set BIdL(t) (7) 0 We define also the following auxiliary measures: Cost Index for Task i, Ki(y) mf ProbiFi(t)' yjdL(t) (8) Mean Cost for Task i, M1 — fErl(t) I system never enters state set BjdL(t) (9) Variance Cost for Task i,VifVar}Fi(t) I system never enters state set BldL(t) (10) The auxiliary measures are principally for use during the design phase. To accurately compute the mean and variance cost indices would require us to take account of the fact that the mission might end before one or more computations is completed. Termination of the mission implies a cessation of all controller activity. One would therefore have to incorporate into the calculations the probability of the mission ending before some tasks have been completely evaluated. This is not difficult to do, but is tedious and when the mission lifetime is long compared to both the expected task service time and the mean intertrigger interval, it is often useful to substitute the following modified parameters: Modified Cost Index, (X- tl-pdfrob(t)x)dL(t)l for X< Pdyn for X=o (11) 8. The Cost Index is generally only for use when rival multiprocessors are being compared.

-8Mean Modified Cost, A_ E~(t) I system never enters state set BjdL(t) (12) Variance Modified Cost, V fVarlS(t) I system never enters state set BjdL(t), (13) and the auxiliary measures Ri, fi, and Vi are defined analogously, with qi(t)!i(t) = Z gi(RESP(i3)) (14) J=1 and 2(t)= Ti(t). Also, from physical considerations, there exists a T<o such i=1 that for all t-T, L(t)=1, so that the above integrals always converge. For values of t much greater than either the mean task completion time or intertrigger interval, the modified costs are good approximations to their exact counterparts. This is of value, since it applies to most mission lifetimes, (for aircraft, mission lifetime can range from 30 minutes to 10 hours, while for spacecraft, it can range up to several months) and the modified parameters are much easier to compute. In the examples presented in this report, we evaluate modified values throughout. 2.4. Expressions for the Performance Measures Let P,(t) be the probability that at time t the system is in state v E Q, and )((t) the probability that the system is in state a E A, given that in a mission lifetime of C, it never enters a state in B. Let Pr(j)(t)dt be the probability of arrival of one version of task i in [t,t+dt], and wi,a and Wi,a the density and distribution functions, respectively, of the controller response time for task i when in state a. Further, denote the distribution of the hard deadline for task i, if critical, by Fdi. Then, Pdyn,(l)' E f ~f [1-WiG(cI<)]P((t)Parrl(t) dtdL(t)dFdi(c) + f Pb(t)dL(t) (15) "SEA O 0 O beB Pdyn ZP dyn(i - (r-1) Z fPb(t)dL(t) (16) i= 1 bEB o where the approximations hold whenever (as is almost invariably the case in practice) each of the integrals is much less than one. Let Zi be a random variable denoting the contribution to 7P of a single version of task i, with the system in state a E: A, tot(i) the value of f over a mission lifetime given that the system remains in state-set A throughout, and syst _ C tot(i). Let Ila,(r)(u,') be the probability of u arrivals of versions of task i 1=l in an interval of length 6, and H3(o) the distribution for the sojourn time of the system in state v in a mission of length f. Then, define the following characteristic functions: " l(t) f f e-.'wi1,(gi-'(q))dq dFdi(t) if task i is critical =O q=O f e-wi,(gai- (q))dq otherwise q=0 For the evaluation of a single system, the other three measures are sufficient.

-7for all a E A, ptot()(s) = E f f[.(s)]u lrr(i)(u,1) dHt(6) dL(t) (18) aeA u=O t=0 O and r o.syst(s) = 1I tot(i)(s) (19) i=l The reader will have noticed two implicit assumptions: (a) that the service time for a task is much less than the sojourn time of the system at any one state,? and (b) that costs incurred by each task-version are independent. The Cost Indices can now be determined by inverting 5otot(i)(s) and s5yt(s), and weighting with the appropriate probability of dynamic failure. 2.5. Application of the Performance Measures The mean cost, variance cost and the probability of dynamic failure can all be used in the design and evaluation of individual systems. The design of every multiprocessor is the result of a multiplicity of decisions regarding schedulling strategy, individual component redundalncies, speed differentials, etc. The performance measures can be used as optimization criteria in this context. Specifically, the cost function can be used in the following. (i) Placement of recovery blocks for backward error recovery. (ii) Dynamic control of queues at processors and memory. (iii) Deriving tradeoffs between control overhead and hardware cost. (iv) Deriving tradeoffs between dynamic failure probability and mean control overhead for each multiprocessor. (v) Derivation of optimal (or quasi-optimal) task allocation and reallocation strategies. The above is only a sampling of the uses to which the performance measures can be put. Indeed, any decision that affects in any way the response time of the controller can be objectively and rigorously evaluated. For the evaluation of existing designs, it is suggested that the measures be used in a two-stage process. First, the probability of dynamic failure of the designed system is computed and this compared with that required by the specifications. In the event that the system meets the dynamic failure requirements, expressions for the mean and variance costs are derived, and fine-tuning of the design carried out by studying the sensitivity of the measures to changes in system structure, speed, etc. If the system does not meet the dynamic failure requirements, the mean and variance costs are not computed. Instead, the system is redesigned to the point where the failure requirements are met before mean and variance costs are considered. Figure 1 summarizes the approach. The System and the Task Cost Indices may be regarded as the distribution functions of random variables that we shall call the system cost and the task cost respectively. The system cost can be used in the comparison of two or 7. State changes occur when hardware or software failures occur in the system. The rate at which these occur is far smaller than the mean task completion time. Also, of course, gi is invertible in [0, tdifj

-8more rival computer systems. Since the system cost function can be regarded as representing the actual running costs (in money terms, for example), some function of it can be used to rank candidate systems. The actual function used depends on the desired criterion. For instance, if it is desired to rank candidate architectures on the basis of expected running cost and the performance functionals for the controlled process are suitably chosen, the expectation of the system cost function would be used and the architectures would be ranked according to this criterion. The task costs can be used as indicators of the contribution of individual tasks to the cost of control. 3. ON DEI'ERMINING THE PERFORMANCE MEASURES In what follows, the real-time system consists of the controlted process (often referred to simply as the "process") and the corttroller (i.e., the multiprocessor). Four items need to be determined, prior to computing the performance indices. These are the distribution of the hard deadlines tdi for each critical task i, the finite cost function gj for each task j, the multiprocessor response time distribution as a function of its state, and the P,(t) for all v E Q. We concentrate here on the first two, referring the reader to the example presented in Section 4 and the queueing theory and probability literature for the last two. 3. 1. Derivation of the Hard Deadlines Typically, critical tasks are associated with life-critical activity and so one is especially concerned in finding practical means for accurately estimating Pdy. In order to do so, the hard deadlines must first be derived. In most cases, since the environment is only stochastically known, this takes the form of probability distributions. To explain the derivation of the distribution of tdi, a state-variable approach is useful. A general process may be described as c(t) = f(x(t), u(t), t) (20) where x E En is the state vector, u E Rm is the input vector, and t represents time. The input'vector can be partitioned UT = u I Us] (21) where UE is the environment input subvector8. It represents the effects of the environment on the controlled process. For example, a gust of wind is an environmental input when the controlled process is an aircraft. UC is the control input subvector. It represents the input delivered under controller command in response to an environmental input. For obvious reasons, it is always bounded. The process is required to perform within a certain subset of the statespace. Let the admissible set of the state vector be bounded and denoted by XA. Note that the admissible state-space is not necessarily static. The control uc is employed to keep the process in XA. The control is clearly a function of UE. Since XA is bounded, there is a bound on the controller response time allowed. This bound is the hard deadline. Since the process 8. By definition, an input from an I/O panel is also an environmental input.

-9dynamics and the distribution of UE are known, the distribution of each hard deadline can in theory be determined. For clarity, we restate the above. The system response time -- the interval between a trigger and the termination of the resulting controller response -- can be divided into two portions: controller think-time (i.e., the controller response time) and the actuation time during which the process reacts to the environmental trigger under controller directives. The hard deadline tdi associated with a control task i which is triggered by an environmental input is the maximum controller think time permitted, consistent with environmental conditions, the process dynamics and the requirement to keep the system within the admissible state-space, XA. Clearly, in order to derive the hard deadlines, a precise formulation of the process dynamics is required. Since such a formulation is a required part in the design of a critical process, no additional requirements are imposed on the system designer. For an example in determining hard deadlines, the reader is directed to Section 4. 3.2. Derivation of Finite Cost Functions Very little work has been published in this area. Cost functions of this type are usually -- and if at all -- specified in an ad-hoc manner. The main problem here is that the cost functions must be linked to the controlled process to have any concrete meaning. Contemporary workers (e.g., [3]) have skirted the problem -- with less than convincing results -- by ascribing qualitatively-defined "weights" to controller attributes (conventional reliability, throughput, etc.) and attempting to match these to corresponding weights for the application (also qualitatively obtained). Controlled processes have performance measures that are functionals of system state and input, and which express the cost of running the process [17], [18]. The traditional formulation is: J(t) = fo(x(t), u(t), t) (22) where J(t) is the instantaneous performance measure, f0 the functional and the other symbols have their usual meanings. The cost of running the process over, say, an interval [to, tf], E = fJ(t)dt (23) We claim that a good representation for gi(~) is given by9 gt(() = {(C) - [I(O) for 0~_(_td2 0 otherwise (24) where: 9. It is easy to see that many other suitable representations of the cost function can exist. For instance, one can modify f0(7r) to include higher moments of the contribution to (. However, for most practical purposes, the above measure should suffice. It is usually only when the intertrigger or service time distributions exhibit strongly non-Markovian behavior that such a modification needs to be considered. Also, gi is 0 for values of response time greater than the associated hard deadline since to speak of finite cost after failure has occurred is meaningless.

- 10 - f(lq) = Expected contribution of ucito @ if response time of task i =r7, uci = control input subvector associated with task i. See Section 4 for an example. 3.3. Remarks By connecting the activity of the controller system with that of the controlled process, a proper foundation has been provided for the definition of controller performance. All quantitites relating to the performance measures can be calculated from quantities that may be estimated or measured. It is true that our methods presuppose a knowledge of the controlled process and its dynamics that may not at first be accurately available. However -- and this is the crucial point -- an understanding of the dynamics of the process will surely increase with experience (say with a mathematical model or a prototype). There is therefore a learning curve associated with process operation and hence by induction another for the performance measures. Note that we are here considering critical processes such as nuclear reactors or aircraft and that such processes call for extensive testing and analysis before release or installation. Also, the operating environment for these critical processes is usually well known, so that a good model generally exists for the intertrigger distributions. One can express uncertainty about the accuracy of the individual cost functions in terms of a confidence measure. The sensitivity of the system cost to changes in the cost functions is given indirectly by the individual task costs. The latter can therefore be used with confidence intervals for the cost functions to obtain a confidence interval for the system cost. The vital point to note here is that this present effort is probably the first in which the computer controller is designed specifically with a particular controlled process in mind. Carrying through with this idea, we note that the design of the controller will depend on the nature of the controlled process. This is a departure from tradition since controlled processes and computer controllers have evolved separately as distinct and independent disciplines. 4. EXAMPLE In this section, we present an example for the determination of hard deadline distributions, mean and variance costs, and the analysis of a multiprocessor system. Both the example process and the multiprocessor are somewhat idealized, the principal purpose being to illustrate what has gone before. 4.1. The Process 4.1.1. Description The controlled process is a communications satellite of mass m, floating in free space. One critical controller task (task 1) is to keep it within a sphere of radius R centred on a "rest position", x0. The environment is characterized by a series of impulses, deriving from meteorite impacts, and arriving along random directions. The meteorites arrive according to a Poisson distribution with parameter X. The satellite must be restored to rest at xo before the next meteorite comes in. The energy the meteorites impart to the satellite is assumed to be constant at k units per impact. The controller employs rockets to respond to the impulses and can impose a maximum thrust of a units of force in any direction. The rockets must be adjusted to the right direction, and this deployment takes Te, units of time.

- 11 - The performance index with respect to this one task is G = total energy expended by the rockets. The problem is to determine the distribution of tdl and the finite cost function, gl(t). 4.1.2. Derivation of tdl Distribution Catastrophic failure will occur either if the satellite is not restored to rest at zo by the time the next impulse comes along, or if the satellite leaves the admissible state space. tdl is the maximum think-time, and finding it is equivalent to solving for the minimum actuation time. We clearly have a bang-bang control solution. Using standard methods from optimal control theory, we conclude that tdi = max j min[A(r), tl], O j (25) where T is the interval between successive meteorite arrivals, A(T) = T-T,+ -2 + (26) at' a2 a a tl=\/.R- k (27) 1- V ~k oThe distribution function for tdl is given by: 0 if <0 Fdl(f) = 1 - e-A-'(~) if 0~ f ~tl (28) i 1 S otherwise 4.1.3. Derivation of gl Notice that since the environment is stochastic, the controller will always order maximum thrust. The duration for which the thrust is maintained will depend on the response time, t. All computations are made under the condition that (~tdl. Using the laws of mechanics, we arrive at the following result, which the reader may easily verify: Expected contribution of ucl to 0 if response time = ~, 1(0) = O,(0) ~+ at (29) We can therefore now write: g(0)= A,/ for 0t5 (30) The cost function C1(C) is now completely determined. It only remains for us to obtain the controller response time distribution and state probability functions. Remark: Notice the tradeoff between td, and gl. As the rocket thrust, a, is increased, the expected value of tdl increases, meaning that more time is available for the controller, and the chances of catastrophic failure are reduced. This is paid for in the form of increased operating cost, i.e., g, rises faster.

- 12 - 4.2. The Multiprocessor 4.2. 1. Description The multiprocessor is configured as shown in Figure 2. It has c processors and a dispatcher with an infinite buffer which dynamically assigns tasks to processors as they come in and the processors become free. Input rate is Poisson, with parameter Ai for task i. There are r tasks in the system, and all are critical. All tasks have an identical service time distribution'~. There is no common memory in the system, each processor is assumed to have all the system applications software stored in its private memory". The dispatcher schedule is FCFS. Processors fail at an exponential rate of pp; the dispatcher (which is assumed to have internal redundancy) with rate pd., and each of the q redundant buses (of which only one is active at any one time) with rate Ab. Components fail independently of each other, and coverage is 100%. Queueing delays at buses are small enough to be ignored'2. 4.2.2. Determining Task Execution Rate The time taken to execute a task on a processor is a random variable whose distribution is affected by operating conditions and system parameters. The distribution is determined by operating system characteristics, processor failure rates (both transient and permanent) and the probability of incorrect transmission over the intercommunication medium together with the conditional branches within the executed code. In any reliable system, the effects of the perturbations due to hardware failures or electromagnetic interference are very small due to their low probability of occurrence. It is generally assumed in analyses of this type that the software execution time assuming no perturbation due to hardware failure or interference is exponentially distributed. Since the perturbations are small, we approximate them by assuming the service rate still to be exponential, but shifting its parameter slightly to allow for the perturbations. Let this parameter be,. To determine Cj, we use the probability model shown in Figure 3. nro is the density function for the execution time of the software assuming no perturbations due to failure, incorrect transmissions, etc. This quantity can be derived through experiment, and naturally depends on the nature of the code. irj is the density function of the j-th perturbation and aj is the probability that this perturbation occurs during one execution of the code. Both these quantities can be determined by experiment and/or analysis. Assuming, for analytical simplicity, that the perturbations are mutually independent, the characteristic function of the service time pdf for the code, bN can be written down by inspection as: (S) = 0(S) ~ It ajij(s)+1-atj (31) where ~j(s) is the characteristic function of Irj(t). 10. This assumption (used by many authors eg., [14] )is removed by us in [20]. 11. This assumption is not as unrealistic as it might seenm With memory densities rising, and costs falling, there are emerging designs based on this idea (eg., CM2FCS system [21]). The removal of this assumption entails analysis of a blocking problem. For a study of this type of system, see [22]. 12. This is characteristic of this type of system. Messages transmitted are invariably either control messages or very small packets of data.

- 13 - The mean and variance of the execution time can be determined from ad(s). Let them be M and V, respectively. Then, the value of /s is taken to be min{1/ M, 1/V}. This is generally a rather conservative estimate. 4.2.3. Analysis of the Multiprocessor (a) Response Time Density Function: System failure occurs when the xi' dispatcher fails, all buses fail, or when there are n = ll j processors or less functioning. The state description is: y=0 when there is system failure, and 7=n+l,..., c, when there is no system failure, and there are n+1,...,c, processors respectively functioning in the system. Clearly, A=n+ 1,..., ci, and B=}0O. r Let X = Xi. The system at state x (x>O) is essentially an M/M/x queue i=l when the (small) dispatcher service time is ignored. Therefore, the pdf of the response -time of the system in state x (x>O) for each task i = 1,..., r is given by [23]: (=e —~[,\-xP+AWx(O)] - [1 -'W(O)][X-xx]Ae-(x-x)t wi.(t) = - (x-1) (32) where: W(O)= 1-x(x- /( ) = f y + x ()x xL (33) (b) State Probability Functions: Let pp(t), pd(t), and pb(t) be the probability of failure of a processor, dispatcher and bus by time t respectively, where: pp(t) = 1 - e t (34) Pb(t) = 1 - e At (35) Pd(t) = 1 - e-dt (36) Then, the probability of the system being in state 0 at time t, Po(t) I[pp(t)]c-i[1-pp(t)]i + [pb(t)]q + pd(t) (37) and, the corresponding probability for state a, for a=n+1l,..., c, Pa(t) " [1 - i[pb(t)]q+pd(t)li[[][pp(t)]a[1- pp(t)]a], and (38) (c) Sojourn Time Distribution Function, H11(t): If ta is the random variable representing the sojourn time of the system in state a, it is easy to see that: ta = max[mintT-cra, va, 0] (39) where T is the mission lifetime, t if a<c (7a = =-a-l o if a=c (40)

-14and Ta is exponentially distributed with parameter Hp. The derivation is conceptually straightforward. We therefore present only the result. For a=n+1,..., c-I, 0 for t<0O HT(t) = i-F,,(T-t)t1-F,-.(t); 1-F,.(t)! for O-t<T (41) for t;T where 0 for t<O =(j-1)!(c-a-j)a!(a j) F (t)= J=' for0t<T (42) 1 for t>T A?) (_=t 1(43) 1o for t<0 Fi((t) = )- = fo ot tOT (44) Fx_.(t) = 1 -S f Fa(k-t)dF,(k) (45) k=t F,(t) = e-4t(l-e-bt)q + (-le-LAt)(1-_1-_e- btjq) + (1-e-dt)(1-e-~bt)q (46) Also, 0 for t<0 HT(t) = FU_(t)+F,(t)-Ft_(t)F(t) for 0_t<T (47) for t>T where Fi.(t) = 1 - eciPt (48) 4.3. Numerical Results We consider the contribution to the probability of dynamic failure of the task considered above. It is assumed that R=1.75, k=0.1, m=2, a=1. and that there is an average of one impact an hour. In addition to the triggers created by meteorite impacts there are others, and the total load upon the system can be characterized by an input arrival rate of 39 every hour. The range of mission lifetimes considered is zero to ten hours. Failure rates for individual components of the multiprocessor are as follows: pp 10-3, P'b=10-5, P'd=m'10-8, where the time units here are hours. (Dispatcher and memoryfailure rates are extremely low since it is assumed that they possess internal redundancy.) For a system of this type, the only design variables are the number of processors, the number of buses, and the speed of the hardware. The graphs that make up Figure 4 show the marginal benefit to be gained from the addition of processors when L=2O. Addition of buses, it was found, has practically no impact on the probability of dynamic failure or on the mean and variance costs

- 15 - for the range of lifetimes considered. (Mean and variance costs are equal in this system). The marginal benefit to be gained from raising the processor strength from 2 to 3 is considerable; above a processor count of 6, the marginal benefits are practically non-existent for probability of dynamic failure. (Note that these remarks hold good only for mission lifetimes in the range considered.) The failure rate at the asymptotic level of infinite processors is principally due to the probability of working processors exceeding the hard deadline, and the mean and variance costs become the cost associated with mean processor service time. If better performance than that provided by the asymptotic level is required, it can only be obtained by the introduction of faster hardware. We study the results from a variation in hardware speed in Figure 5. Figure 5 shows the asymptotic value for the probability of dynamic failure and the mean cost for different processor speeds, i.e., different values of As. Here again, one can compute the marginal benefit to be gained in terms of mean cost and dynamic failure probability, this time as a function of processor speed. One could, of course, generalize this process and attempt to derive a function pdyn(t)=f,(/~,c,t), and similar functions for the mean and variance costs. Several quantities of great practical importance to designers and evaluators can be derived from such functions, but this is out of the scope of this present report. 4.4. Remarks With reference to the above example, several points are worth elaborating. First, note that A-1(0) is a positive quantity; this in turn means that the probability that the hard deadline is zero is also positive. Since A-1(0) depends only upon the parameters of the controlled process. this means that there is, inherent in the controlled process itself, a certain predisposition to failure that cannot be removed by upgrading the quality of control provided. Specifically, 1-e —Ae(0) is the probability of dynamic failure basic to the controlled process itself. That is, the probability of dynamic failure cannot be lowered beyond this point without altering the characteristics of the controlled process. Next, consider the effects of adding processors to the controller system. Our curves show that after a certain point, the marginal benefit to be gained from further additions is vanishingly small. Also, as processor speed is increased beyond limit without altering the failure characteristics of the processors themselves, another asymptotic limit is reached. This is because the dynamic failure probability consists of two components: the static part and the non-static part. The non-static part which is the probability of missing the hard deadline when the processor is in a state a E A drops with increasing processor speed. However, the static component which represents the probability of the system entering a state in state-set B remains unaltered. Together with the underlying processdetermined probability of failure, this ultimately becomes the dominant term in the dynamic probability expression and is the asymptotic limit. It is easy to see how the above can be extended to provide quite valuable tradeoff curves in multiprocessor design. When cost of the hardware is included to the reckoning for example, it becomes possible to study the tradeoff between initial capital cost of the controller and controller overhead. Limitations on space do not permit us to pursue this last point further. However, our case that the performance indices presented in this report can be applied to a far wider range than can the conventional computer performance

- 16 - measures should be apparent from the above. 5. DISCUSSION In this report, we have presented performance indices that are objective in the way they measure performance. Due to this objectivity, they can be used to advantage in both the design and evaluation phases of the development of a controller system. As for design, the measures should help identify quasi-optimal architectures and operating systems. In the former category are included decisions regarding the interconnection structure and component speed differentials. In the latter category, we include the choice of schedule in the access of shared resources, the design of failure recovery procedures — especially the optimal placement of recovery points — and algorithms controlling the allocation and reallocation of tasks to the individual processors. The lack of objective indices so far has forced contemporary workers to employ overly simplistic performance indices, most notably in solving the task allocation problem (such as ascribing unit cost to each transaction on the interconnection network [24]. ) In the sphere of evaluation, the measures can be used to provide either an absolute or relative index to controller performance. In the former case, the performance functional, fo, has to be defined with particular care, and the indices provide a measure of the overhead cost incurred in control. A second important application is the comparison of rival controller systems for particular applications. We stated earlier that any decision that had any impact on controller response time could be rigorously and objectively evaluated by use of our performance measures. The cost functions are therefore of universal applicability in the real-time field; indeed their units might properly be considered the basic currency in controller performance evaluation. This universality and underlying rigor has been the principal contribution of this report. Acknowledgements The authors wish to thank Rick Butler and Milton Holt of the NASA Langley Research Center and Y. H. Lee of The University of Michigan for technical discussions. REFERENCES [1] A. L. Hopkins, et. al, "FTMP -- A Highly Reliable Fault-Tolerant Multiprocessor for Aircraft", Proc. IEEE, Vol. 66, No. 10, pp. 1221-1239, October 1978. [2] J. H. Wenseley, et. al, "SIFT -- Software Implemented Fault Tolerance," Proc. IEEE, Vol. 66, No. 10, pp. 1240-1256, October 1978. [3] M. J. Gonzalez and B. W. Jordan, "A Framework for the Quantitative Evaluation of Distributed Computer Systems", IEEE Tracns. Comput., Vol. C-29, No. 12, Dec. 1980, pp. 1087-1094.

- 17 - [4] J. von Neumann, "Probabilistic logics and the Synthesis of Reliable Organisms from Unreliable Components", in Automata Studies, pp. 43-98, Princeton University Press, Princeton, NJ 1956. [5] E. E. Moore and C. E. Shannon, "Reliable Circuits Using Less Reliable Relays", J. Franklin Inst., Pt. I, Vol. 262, pp. 191-208, and Pt. II, pp. 281-297, 1956. [6] W. G. Bouricius, et. al, "Reliability Modelling Techniques for Self-Repairing Computer Systems," Proc. ACM 1969 Annual Conf., pp. 295-309, August 1969. [7] R. R. Borgerson and R. F. Frietas, "A Reliability Model for Gracefully Degrading and Standby-Sparing Systems," IEEE Trans. Comput., Vol. C-24, pp. 517-525, May 1975. [8] G. N. Cherkesov, "Semi-Markovian Models of Reliability of Multichannel Systems with Unreplenishable Reserve of Time," Engineering Cybernetics, Vol. 18, pp. 65-78, Mar/April 1980. [9] J. Losq, "Effects of Failures on Gracefully Degradable Systems," Seventh Annu. Int Conf. on Fault-Tolerant Computing, Los Angeles, CA, pp. 29-34, March 1977. [10] R. Troy, "Dynamic Reconfiguration: An Algorithm and its Efficiency Evaluation," op. cit, pp. 44-49. [11] J. F. Meyer, et. al, "Performability Evaluation of the SIFT Computer," IEEE Trans. Comput., Vol. C-29, No. 6, pp. 501-509, June 1980. [12] M. D. Beaudry, "Performance-Related Reliability Measures for Computing Systems," IEEE Trans. Comput., Vol. C-27, No. 6, pp. 540-547, June 1978. [13] J. F. Meyer, "On Evaluating the Performability of Degrading Computer Systems," IEEE Trans. Comput., Vol. C-29, pp. 720-731, August 1980. [14] J. F. Meyer, "Closed-Form Solutions of Performability," IEEE Trans. Comnput., Vol. C-31, No. 7, pp. 648-657, July 1982. [15] G. K. Manacher, "Production and Stabilization of Real-Time Task Schedules," J. ACM, Vol. 14, No. 3, July 1967, pp. 439-465. [16] J. A. White, et. al, "Multimicroprocessor Flight Control: System Architectural Concepts," Proc. AIAA Comp. in Aerosp. Conf., pp. 87-92, October 1979. [17] D. E. Kirk, Optimal Control Theory, Prentice Hall, Englewood Cliffs, NJ, 1970.

- 18[18] A. P. Sage, Optimnum Systems Control, Prentice Hall, Englewood Cliffs, NJ, 1977. [19] M. Reiser and K. M. Chandy, eds., Computer Performance, North-Holland Publishing Co., Amsterdam, 1977. [20] C. M. Krishna and K. G. Shin, "Approximate analysis of Queues with Multiple Job Classes," under preparation. [21] S. J. Larimer and S. K. Maher, " The Continuously Reconfiguring Multiprocessor," NATO-AGARD Meeting on Tactical Airborne Computing, Roros, Norway, 1981. [22] C. M. Krishna and K. G. Shin, "Queueing Analysis of a Canonical Model of Real-Time Multiprocessors," submitted for publication. [23] D. Gross and C. M. Harris, Fundamentals of Queueing Theory, John Wiley, New York, 1974. [24] W. W. Chu, et. al, "Task Allocation in Distributed Data Processing," Computer, Vol. 13, No. 11, pp. 57-70, Nov. 1980.

START requirements 1 met? Yes is it Yes No Evaluate expression mprove Pdyn wit for'M and V changes? Update design based on sensitivity of d l M, and V to design Carry out I Reject changes improvements as unsuitable Is /new design \ significantly better than the preceding one? No Figure 1 System Refinement

Xi, i=l, Dispatcher' l - I Processors 0. To Actuators Figure 2 A Real-Time Multiprocessor

I I I I To Actuators g 0 -- -...- -... PI Incoming Tasks 1-c1 1-n Figure 3 Determining Task Response Time

c=2 lx10-1; H lx&-2 Ua -3- c=3 H - 1x10 0j:x10-54 ( t lxl 6' 0 rxl0 — 0.00 2.00 4.00 6.00 8.00 10.00 MISSION LIFETIME Figure 4 Sensitivity of Performance to Change in Processor Number

25. 00 c=3 c=4 F. 20.00 c=5 c=6 Wa 15.00 Hi O 10.001 z 5.00 0.00 I _ 0.00 2.00 4.00 6.00 8.00 10.00 MISSION LIFETIME Figure 4 (cont.) Sensitivity of Performance to Change in Processor Number

i = 0.1 u- / V = 1.0 —. / p = 2.0 LL 0- ". 5. H CD o-q MISSION 7LIFETIME MISSION LIFETIME Figure 5 Sensitivity of Performance to Change in Processor Speed

or p = 1.0 Ip1 = 0.1 CU 3L C) w~1ISSION L=2.00 10.00 Figure 5 (cont.) Sensitivity of Performance to Change in Processor Speed in Processor Speed