THE UNIVERSITY OF MICHIGAN COMPUTING RESEARCH LABORATORY1 MODELS FOR EVALUATING THE PERFORMABILITY OF DEGRADABLE COMPUTING SYSTEMS Liang Tai Wu CRL-TR-7-82 JUNE 1982 Room 1079. East Engineering Building Ann Arbor, Michigan 48109 USA Tel: (313) 763-8000 1This work was supported by NASA Grant NSG 1306. Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the authors-and do not necessarily reflect the views of the funding agencies.

ABSTRACT MODELS FOR EVALUATING THE PERFORMABILITY OF DEGRADABLE COMPUTING SYSTEMS by Liang Tai Wu Chairman: John F. Meyer Recent advances in multiprocessor technology have established the need for unified methods to evaluate computing systems performance and reliability. In response to this modeling need, this dissertation considers a general modeling framework that permits the modeling, analysis and evaluation of degradable computing systems. Within this framework, several useroriented performance variables are identified and shown to be proper generalizations of the traditional notions of system performance and reliability. Furthermore, a time-varying version of the model is developed to generalize the traditional fault-tree reliability evaluation methods of phased missions. The modeling and evaluation methods considered in this dissertation provide a relatively straightforward approach to integrate reliability and availability measures with performance measures. The hierarchical decomposition approach permits the modeling and evaluation of a computing system's subsystems (e.g., hardware, software, peripherals, interfaces, user demand systems) as a whole rather than the traditional methods of evaluating these subsystems independently. Accordingly, it becomes possible to evaluate

the performance of the system software and the reliability of the system hardware simultaneously in order to measure the effectiveness of the system design. Moreover, since the performance variables considered in this study permit the characterization of system performance according to the application needs of a system, the results obtained represent more accurate assessments of the system's ability to perform than the existing performance or reliability measures. -ii

TABLE OF CONTENTS LIST OF ILLUSTRATIONS................. v CHAPTER 1. INTRODUCTION.............................~.... 1 1.1 Background.....................~.............. 1 1.2 Research Objectives.............................. 3 2. PERFORMABILITY EVALUATION OF COMPUTING SYSTEMS 7 2.1 Introduction ** ****...........*....~..... ~...~....... 7 2.2 System Models *.................................. 11 3. OPERATIONAL MODELS............................................. 22 3.1 Introduction. * *****................................... 22 3.2 Recoverability * * *...................... 224 3.3 Evaluation of Computing Systems Using Functionals of a Markov Process..................... 38 3.4 Performability of a Triplicated Fault-Tolerant Computing System *~ ~. * * * *....................... 62 4. MODELING AND EVALUATION OF DEGRADABLE MULTIPROCESSOR SYSTEMS........................ 75 -iii

4.1 4.2 4.3 Introduction ~ *******.... ~ *... *.*... *..... s System Model................................. Evaluation of Two Degradable Multiprocessor Systems ** * * ******.................. 5. PHASED MODELS..................... 5.1 Phased-Missions..*..-*.....*....... 5.2 Structural Properties of Phased Models * - 5.3 Probability Computation of Cartesian Trajectory Sets....~ * *........ *..*.. 5.4 Operational Models as Intraphase Processes 9 0 0 0 0 0 0 0 0 0 0 0 0 0 f * 0 0 0 e 0 0 0 0 9 * 0 75 78 96 109 109 115 135 148 - * t..... e... 6. CONCLUSION AND FURTHER RESEARCH * - * * * 164 168 APPENDIX..........................................0 REFERENCES..................~........ 185 -iv

LIST OF ILLUSTRATIONS Figure 2.1 A Model Hierarchy........................... 9 3.1 Markov Model of a TMR System with Software Error Recovery..................... 64 3.2 Performability of S........................ 73 3.3 Performability of S........................... 74 4.1 Single Resource Model for Systems with Safe Faults and Unsafe Faults.............. 85 4.2 Single Resource Model for Systems with Instantaneous Detections and Recoveries......... 87 4.3 Head-of-the-Line Priority Queue.................. 92 4.4 State-Transition Diagram of the Base Model for S1 and S2................. 99 4.5 Performability of S1......................... 107 4.6 Performability of S2.......................... 108 5.1 An Order-Preserving Mapping.................... 133 5.2 Underlying Markov Process of a Phased Model....... 159 A.1 Markov Model of a Multicomputer................ 174

Table 4.1 State Transition Rates of a General Resource Model.............. 82 4.2 Base Model Parameters............. 84 4.3 Transition Rates for Systems with Instantaneous Detections and Recoveries........... 86 5.1 An Organizing Structure........................ 156 5.2 Operational Structures of a Phased Model............ 160 -vi

CHAIPER 1 INTRODUCTION 1.1 Background The recent developments in multiprocessor systems have stimulated a growing interest in degradable computing systems that are designed to provide a high degree of performance and reliability by reallocating the computer's resources when faults are detected. To assess the effectiveness of these computing systems, it has been found that the traditional way of evaluating the performance and the reliability as distinct attributes of a computer is no longer adequate [1], [2]. Traditional performance evaluation methods generally assume that the computer to be evaluated is fault free and are concerned with the quantification of the effectiveness in which the computer's resources handle a specific application (see [3] and [4], for example). Traditional reliability evaluation methods, on the other hand, deal with the measurement of a computer's ability to remain operational in the event of physical failures (see [5] through [8]). Since the level of performance of a degradable computing system may decrease with successive failures, the performance and the reliability of the system must be dealt with simultaneously to measure the extent to which the user can benefit from the tasks accomplished by the computer. -1 -

-2 - In response to the above modeling need of degradable computing systems, some recent investigations have attempted to formulate new modeling and evaluation methods that combine both the performance and the reliability characteristics of computing systems. Particularly, Beaudry [9] has considered performance-reliability measures that reflect the computational capacity of a system, defined as the amount of useful computations available per unit of time, and has shown that these measures can be evaluated in terms of a transformed Markov process. By examining the set of jobs executed by a computing system, Mine and Hatayama [10] have considered the reliability of the system with respect to a specific job, called job-related reliability. Although the above models have shown the feasibility of combining the performance and reliability measures into a single measure, their efforts have focused mainly on the maximum capacity at which a computer can handle its computation. The effect of interactions between the demand for computation (by the user) and its supply (by the computer) has not been considered explicitly. Another approach to quantifying the unified performance and reliability of computing systems is based on Markov reward processes [11]. By assigning a throughput rate to each state of a Markov process that describes the resource availability of a computing system, Gay [12] has considered the expected system throughput and the throughput availability of a system. A similar model has also been used in [13] by De Souza to estimate the reduction in operating cost when fault-tolerance features are incorporated in commercial systems. More recently, based on renewal process models, Castillo and

-3 - Siewiorek [14] have considered the apparent capacity and expected elapsed time required to execute a program correctly. In contrast to the above efforts to formulate specific performance measures for degradable computing systems, Meyer [1] has developed a general modeling framework that permits the definition, formulation and evaluation of user-oriented performance measures. A hierarchical model is defined [1] which assumes that the probabilistic nature of the total system S (the computer and its environment) is modeled by a stochastic process Xs. It is further assumed that the process Xs can be used to determine the probability distribution function of a random variable Ys which describes the user's view of how well the system performs. The probability distribution function of Ys is shown to induce a useful performance measure, referred to as the performability of S, in the context of degradable computing system performance. 1.2 Research Objectives In this investigation, among other things, we wish to extend the modeling framework in [1] to provide a more concrete basis for studying the evaluation of degradable computing systems. By introducing extra ingredients to the modeling framework, we wish to develop a general stochastic process model of degradable computing systems that satisfies the following objectives: (1) The model should be general enough to permit uniform formulation of different performance measures.

-4 - (2) The model should be specific enough to permit derivations of computational algorithms and formulas. (3) The model should be flexible enough to be related to traditional performance and reliability models so that it may serve as a basis for unifying traditional computing system evaluation methods. (4) The model should be able to reflect the information processing needs of the user as well as internal structural changes of the system caused by component failures. In addition to the above efforts of model development, we also wish to apply the results obtained to evaluate a large class of fault-tolerant computing systems known as "degradable multiprocessor systems." By comparing the effectiveness of various design strategies, we wish to illustrate the tradeoffs between different techniques of incorporating fault-tolerance in the design of a multiprocessor system. Chapter 2 puts this work in context with respect to the general modeling framework considered in [1]. It describes the components of a performability model and formalizes the relationships among these components. The major results of this chapter include the precise formulation of the notion of system performance in a broad context and the clarification of the notion of supporting the evaluation of system performance using a stochastic process model.

-5 - Chapter 3 introduces a general notion of recoverability and establishes necessary and sufficient conditions for an operational model to be recoverable. For both the recoverable and the nonrecoverable models, it examines the solution methods of a generally defined performance variable where the performance is identified with the minimum value of a functional. The modeling approach and the evaluation methods are then illustrated through the evaluations of a multiprocessor system. The results obtained indicate that the performance variable is, indeed, a proper generalization of the traditional notions of the system performance and reliability. The modeling and the evaluation methods proposed thus represent a unifying approach for integrating the performance and the reliability measures of computing systems. Chapter 4 presents a specific operational model for evaluating the performability of degradable multiprocessor systems. The model is constructed according to a hierarchical decomposition of a system's behavior. A Markovian base model is developed to represent the resource availability of the system, and priority queueing models are used to determine the operational rates of the resource states. The model not only demonstrates the generality of an operational model but also illustrates the feasibility of modeling and evaluating the system performance via a step-by-step hierarchical approach. The methods developed in this chapter thus represent a straightforward approach to produce a composite picture of a computer's ability to meet overall throughput goals. Chapter 5 extends the concept of an operational model to phased

missions where the environment of a system can vary in time. Both the combinatorial and the probabilistic properties of the extended model are examined in detail. In addition, an example is constructed to illustrate the performability evaluation of a phased mission with multiple accomplishment levels. The results obtained in this chapter represent an important step toward the understanding and the development of a more general time-varying operational model. Chapter 6 summarizes the results of this study and suggests topics for further research.

CHAPTER 2 PERFORMABILITY EVALUATION OF COMPUTING SYSTEMS 2.1 Introduction The concept of hierarchical organization has become an important tool in the design of computing systems. Hardware components are typically formed by putting together some basic modules or building blocks, and software components are often structured into subroutines in a top-down manner. By carefully organizing the structure of a computer into a hierarchy of components, it becomes possible to increase greatly the capability and the functional features of the computer. Although this concept of hierarchical organization has been used extensively in the design of computing systems since the invention of the first electronic computer, its implication in computing systems performance evaluation has only been exploited recently. As suggested in [2], a computing system can be described by a hierarchy of system models that vary in 'scope" and "level of abstraction' (see Figure 2.1 for an example of what we call a model hierarchy). In this representation, a higher level model has a larger scope and a higher level of abstraction i.e., it describes a larger portion of the computer and its environment, but possibly in less amount of detail. In particular, the top model has a scope that includes all the subsystems that can influence the computational process of the system (e.g., hardware and software, peripherals, -7 -

interfaces, maintenance systems, user demand system, etc., collectively referred to as the total system). The level of abstraction at the top level is expressed in a form easily usable by the system user. On the other hand, the bottom model may involve low level representations of the computer's hardware and operating system structure. Based on the above model hierarchy of the total system, various performance and reliability measures can then be associated with models at each level of the hierarchy. The part of the total system that one is interested in evaluating must be identified first with a specific level in the hierarchy (referred to as the object system). The part of the total system outside the object system is then regarded as the environment of the object system. The choice of an object system is, to a large extent, determined by the particular problem one is interested in solving. For example, if the performance or reliability of a data-base system is to be evaluated, the object system will not only include the hardware and operating system but also the data bases and their supporting programs. Once a specific object system is selected, the performance of the system can then be defined as how well the object system satisfies the computational demands (also referred to as the workload) imposed by its environment [4]. The performance is typically considered as a random variable

-9 - Scope Level of Abstraction Total System Computer, Peripherals, Interfaces and Supporting Systems Computer, Peripherals and Interfaces Computer Top Intermediate 1 Mission Functional Tasks Computational Tasks Hardware and Software Functions Intermediate 2 Bottom Figure 2.1 A Model Hierarchy

- 10 - referred to as a performance variable or performance index. Thus we can talk about the mean, variance, distribution function, and the like of a performance variable. If we regard computer performance measures to be the measurements of the quality of a computer according to the above broad definition, some observations about the relationships among different performance measures can now be made. First, we note that there are no essential differences between what are traditionally called performance measures (e.g., throughput rate, response time or utilization rate) and what are traditionally called reliability measures (e.g., reliability, availability or maintainability). They differ only in the way performance criteria are formed. Any performance or reliability measures, viewed in the broadest context, must account for both the workload and the probabilistic nature of the object system. Accordingly, in the discussion that follows, the term performance measure will be used to include both the performance and the reliability aspects of a system. Second, we also note that different performance measures can be associated with different levels of the model hierarchy. Thus, for each level of the hierarchy, it is possible to formulate various performance measures according to the application needs as well as the modeling requirements of the object system. In the following section, we first define more precisely the basic elements of a performance study by considering the basic components of a

- 11 - performability model. The relationships among those components are then formalized via the concept of capability functions. Finally, the notion of supporting the evaluation of system performance using a stochastic process is made precise by relating the probabilistic nature of the performance variable to the known properties of the underlying stochastic process. 2.2 System Models A major objective of this section is to formulate precisely what we mean by a "stochastic model" for system performability. It is assumed that the total system S=(C,E) contains a computer C operating in an environment E. The computer C is composed of several processors, memory modules, input/output devices, buses, etc., and the environment E includes man-made components (e.g., interface circuits and peripheral subsystems), operational rules (e.g., job submitting policies and maintenance procedures) and other conditions (e.g., weather) that can influence the computer's effectiveness. At this level of abstraction, it is appropriate to view S as a network of interconnected subsystems with simultaneous information flow among subsystems. Accordingly, S can be described as an autonomous state transition system that changes state due to events occurring in time. Given the above characterization of the total system, the behavior of S can be viewed as a stochastic process

- 12 - Xs {XtlteT) (2.1) where T is the time range involved (called the utilization period) and, for each teT, Xt: - Q is a random variable defined on a common probability space (SQ,E,P) and taking values in the state space Q of the total system. In the following discussions, it will be assumed that T is a set of real numbers and Q is a discrete set. Thus, without los of generality, the states of Xs will often be named by positive integers, viz. Q ={1,2,3,...} or, when Q is finite, Q -{1,2,...,n}. The stochastic process Xs will be referred to as the base model of S and is denoted simply as X when the system context is clear. Although the base model X provides a detailed description of the system's state behavior, the description is generally invisible to the users. It is assumed that the users are concerned only with distinguishing different "levels

- 13 - of accomplishment" when judging how well the system has performed. Accordingly, the user's view of the system's behavior can be formulated as a random variable with respect to the underlying probability space (f,E,P), i.e., Y: 2 —,A (2.2) where, for each e fQ, Y(w) takes a value in the accomplishment set A. Depending on the application, the accomplishment set A can be any set of real numbers where the elements of A are taken to be the various degrees of user satisfaction such that a>b if a is preferred over b (i.e., the ordering relation > as implied by the user preference coincides with the natural ordering of real numbers). For example, to evaluate the reliability of a nondegradable system, the accomplishment set can be taken to be A-{0,1} where 1 = "system success" and 0 = "system failure." On the other hand, if the user is interested in evaluating the system throughput, A can be taken to be an interval of real numbers. The random variable Y will be referred to as a performance variable of S. As generally defined above, the performance variable Y clearly can be used to characterize either the performance or the reliability aspects of a system. Thus a natural measure that can be used in. the evaluation of computing systems is the probability measure induced by Y (e.g., see [15] p.97). This unified performance-reliability measure is referred to as the performability of S which, in terms of our modeling framework, can be defined as the function perfs where, for each measurable set BCA (i.e.,

- 14 - {ol|Y(w)eB}eE) of accomplishment levels, perfs(B) - P({wlY(w)eB}), (2.3) i.e., perfs(B) is the probability that S performs at a level in B. The requirement that B be measurable insures the existence of the probability on the right side of (2.3). In theory, it is possible to determine the performability of S from the underlying probability space (Q,E,P) of the performance variable Y. However, in practice, the underlying probability space is generally unknown and, consequently, the performability must be determined from known properties of the base model X. Hence an important step to determine the performability of S is to establish relations between the base model X and the performance variable Y based on the given properties of X. Following a common practice in probability theory, we assume that the base model X is specified by its finite-dimensional distributions or by information that determines these distributions (e.g., Markov assumptions together with a transition function and an initial distribution). Based on these finite-dimensional distributions, we then construct a 'coordinate probability space" (see [16] or [17] for the details of this construction) and express the performance variable Y in terms of an equivalent random variable defined on the new probability space. The advantage of this approach is that the resulting

- 15 - probability space is considerably structured and, hence, questions about the probabilistic nature of X and Y can be addressed more easily by dealing with the coordinate probability space. The construction of this probability space described in [16] is summarized as follows. Suppose that the base model X is described by a family of finite dimensional distributions - = { Ft,.... I tl,.,tnT and neN } (2.4) where T is the utilization period and N is the set of all natural numbers. Then the coordinate probability space is a probability space (U,F,Pr) where 1. The coordinate sample space U is the set of all functions u:T - Q where Q is the state space of X. In other words, U is the ru-dimensional direct product of the state space Q. 2. To construct the event space F, let Bn be the smallest a-algebra generated by the relative topology of Qn (also referred to as the topology of Qn induced by the n-dimensional Euclidean space). For each set B in Bn and given tl,...,tn in T, let - { uU I [u(tl),...,u(t,)]eB }, (2.5)

- 16 - i.e., C is the set of all functions u in U such that the values of u at tj (l:isn), when regarded as an n-tuple, is an element in B. If we let Fo be the set of all C such that C is obtained from (2.5) for all neN, all BeB", and all tj,...,tneT, then Fo is a field. Finally, the event space F is taken to be the completion of the smallest a-algebra containing Fo. 3. To construct the probability measure Pr, we first define a measure M on Fo such that for each CeFo, (C) = j J dFt....,t...( l, n) (2.6) B where C is generated by the set B with indices tl,...,tn (see (2.5)) and Ft...,t. is a multivariate distribution in 4 (see (2.4)). Then the probability measure Pr is the completed version of the above measure s. Given the coordinate probability space (U,F,Pr) of X, we can construct an equivalent process of X defined on (U,F,Pr) such that both processes have the same multivariate distributions 4. More precisely, let us define {XtlteT} where, for all tET and all ueU,

- 17 - Xt(u) - u(t). (2.7) Furthermore, for all B in B" and all tl,...,tn in T, let us assign (see (2.6)) 'I f JdFt,,l...,t,(Xl *** 1n) B to be the probability of the event { ueU I [Xt,(u),...,Xt(u)]eB }. Then, for each tcT, Xt is a function from the set U to the set Q and the family of functions X is a stochastic process defined on (U,F,Pr) having the multivariate distributions 4 (see [17], pp. 10-11). Since, for each uEU, u(t) specifies the state of X at t (see (2.7)), each element in U will be referred to as a state trajectory and the set U will be referred to as the trajectory space. The notion of a coordinate probability space permit us to answer questions about the probabilistic nature of X by relating the questions to the state behavior of X. In particular, for each fixed we o of the underlying probability space (fl,E,P) let us define a function u,: T -- Q such that, for all teT, u,(t) = Xt(o). (2.8)

- 18 - Moreover, given an event V in F, let w =-{luV} (2.9) be a subset of d. Then, it is known (see [17], pp. 621-622) that W is an event in E and that P(W) = Pr[V]. (2.10) On the other hand, given a subset W of 0 measurable with respect to the induced probability space of X, there exists an event V in F such that (2.10) is satisfied. In the following discussions, we consider the question of what we need to know about X in order to determine the distribution function of the performance variable Y. Formally, we says that X supports Y if there exists a random variable,:U -- A (2.11) defined with respect to the coordinate probability space (U,F,Pr) such that for each we 0 Y(W) = T(uJ) (2.12)

- 19 - where u, is the state trajectory associated with the outcome w. Since 'y can be regarded as the user's performance criteria for judging the "capability' of the total system, it is referred to as the capability function of S. When X supports Y, the capability function permits us to determine the performability of S using the finite-dimensional distributions of X. To substantiate this claim, let us define a function h: Q — U such that, for all we Qf, h(w)-u,, where uo, is the state trajectory associated with the outcome w. Then, by (2.12), X supports Y implies Y=- rh, i.e., Y is the functional composition of 'y and h, applying h first. Accordingly, taking the preimage on both side, we have -1 h-l-l Hence, for any measurable set BCA, if we let

- 20 - V {u l-(u) B}CU and W-{l IY(o,),B} a, then W = {wluEV)= h-l(V). Accordingly, by (2.10), P(W) = Pr[V], which, in turn, implies perfs(B) A P({olY(oo)B}) - Pr[y-'(B)]. (2.13) Since the probability Pr[y^'(B)] can be determined directly from the finitedimensional distributions of X, we have shown that X together with My suffice to support an evaluation of the performability perfs. In view of what has been observed, if X supports Y, then the pair (X,y) is said to constitute a performability model of S. If B is a measurable set of accomplishment levels, the inverse image y-'(B) is referred to as the trajectory set of B where its determination requires an analysis of how levels in B relate back down via y-1 to trajectories of the base model. In the following

-21 - discussions, since we will be dealing with X instead of X, the induced process X will be called the base model and denoted simply as X. Given a performability model (X,y), equation (2.13) permits us to evaluate the performability of S for a set B of accomplishment levels by i) determining the trajectory set y-1(B) and ii) calculating Pr[y7g(B)]. In general, the trajectory set -'(B) is difficult to obtain because the "distance" between the base model X and the performance variable Y may be considerable. The difficulty can be alleviated by introducing intermediate models between X and Y based on the concept of a model hierarchy discussed in Section 2.1. The use of a model hierarchy allows the capability function or, more accurately, the trajectory set t-1(B) to be derived step-by-step in a top-down manner from more elementary components in a clearly conceived way. In particular, by introducing an intermediate model called an "operational model," we show in the following chapter that the performability of S can be determined by evaluating the intermediate model. Finally, we note that the role of a capability function in performability evaluation is similar to that of a structure function [18] in reliability evaluation. However, even when performability is restricted to reliability, the concept of a capability function is still more general because a capability function must take into account the behavior of S throughout the utilization period while a structure function is restricted to modeling the instantaneous behavior of S at a given moment in time [2].

CHAPTER 3 OPERATIONAL MODELS 3.1 Introduction When modeling degradable computing systems by stochastic processes for system performance or reliability evaluation, the models used are typically Markov processes (see [8], for example) or models which can be analyzed in terms of embedded Markov processes (for example, certain queueing models such as M/G/1 or GI/M/m queues; see [21]). However, to ensure the validity of the Markov assumption, it is usually necessary to model the structure and behavior of the system at a low level, e.g., a level describing the system's physical resources (processing units, memory units, input buffers, etc.). Performance and reliability measures, on the other hand, often quantify the system's behavior in terms of high-level, user-oriented variables (throughput, response time, operational status, etc.) which, if viewed as stochastic processes, are seldom Markovian. In such cases, an essential part of the modeling effort is to establish a "connection' between the low and high levels to resolve the probabilistic nature of the measure in question. Historically, in the context of reliability modeling, this connection has taken a form that lies at one of two extremes. At one extreme, system "success' is defined in terms of the underlying structural resources (at least so -22 -

- 23 - many fault-free processors, at least so many fault-free memory units, etc.), in which case the connection between structure (available fault-free resources) and performance (success or failure) is immediate. At the other extreme, the object of the modeling effort is the connection, per se, and the resulting model is typically some form of event-tree or fault-tree (see [18], for example). In general, as discussed in the previous chapter, the general nature of this connection can be formalized as a capability function of the system. In this setting, a total system S, comprising a computing system and its computational environment, is modeled at a low level by a stochastic process X (the base model of S). Then, relative to a high level variable Y (the performance of S), the capability function of S is a function ' which translates state trajectories (sample paths) of the process X into corresponding values of the performance variable Y. Knowing X and 'y, it is possible to solve for the probability distribution function of Y and, hence, determine the performability of S. When the performance variable Y is far removed from the base model X, solution procedures can be simplified by introducing intermediate model at levels between X and Y. One use of such a model hierarchy is a step-by-step formulation of the preimage of 7 beginning at Y and terminating at the base model X. If Y is discrete, the performability of S can then be evaluated by determining the probabilities of certain trajectory sets that correspond (under y-1) to performance values of Y. Another role that can be

- 24 - played by an intermediate model, and the one we explore in this chapter, is to represent the probabilistic nature of S at a level that is higher than the base model and thus "closer" to the performance variable. To characterize the behavior of an intermediate model, Section 3.2 introduces a general notion of recoverability and shows that a performance process is nonrecoverable if and only if the state behavior of the process can be determined by taking a 'snapshot" at the end of the utilization period. For both the recoverable and nonrecoverable models, Section 3.3 examines the solution methods of a generally defined performance variable where the performance is identified with the minimum value of a functional. The modeling and the solution methods are then illustrated in Section 3.4 through the evaluation of a degradable computing system. The results of the evaluation indicate that the performance variable considered in this chapter is a proper generalization of the traditional notions of the system performance and reliability. The modeling and the evaluation methods considered thus provide a unifying approach for evaluating the integrated performance and the reliability of degradable computing systems. 3.2 Recoverability Generally, in reliability modeling, a system is said to be repairable or nonrepairable according to whether maintenance actions are permitted during its utilization to reduce the incidence of system failure or to return a failed system to an operating state. The classification is useful because, when a

- 25 - system is nonrepairable, the computation of system reliability at a time t amounts to calculating the probability that the system functions at that moment in time (see [18], for example). On the other hand, when a system is repairable, the computation requires a more complete knowledge of the system's behavior during the entire utilization period T. The above classification and properties of reliability models can be extended to models of degradable computing systems by considering the way in which system performance may change in time. The generalization not only permits us to obtain a better understanding of the performance degradation of a degradable computing system, but also provides us with a common basis for unifying traditional performance and reliability methods. To begin, let us define an operational model to be a stochastic process Z ={ZtteT} with Z1: f - Q such that the state space Q of Z is partially ordered by some partial ordering _. The partial ordering < can be interpreted as the ranking of system states according to the degree of user satisfaction with the system operating in a given state (hence the term 'operational"). Although operational models are introduced here to characterize intermediate-level models, it should be noted that operational models can often be defined at the base model level with some natural ordering of states. For instance, consider a system S

- 26 - containing m subsystems where each of them can be in one of two operational conditions; functioning or failed. Then, one natural way to define a state space for S is by taking Q- {0,1 m (3.1) and, assuming no compensating effects of successive failures, the state space can be ordered by taking the Cartesian product of the component ordering relations, i.e., for all (al,a2,...,am) and (bl,b2,...,bm) in Q, let (al,a2, *,am): (bl,b2,...,bm) (3.2) if and only if aibi for all lim. The above ordering of component states is a standard practice in reliability theory and plays an important role in fault-tree analysis (see [18], for example). The applicability of operational models in modeling degradable computing systems will be discussed in more detail in the next section. Given an operational model Z, the concept of repairability can be extended as follows. We say Z is nonrecoverable if, for all s,tcT(st) and all ijEQ, Pr[Zs=i, Zt=j] > 0 implies i 2 j

- 27 - In less formal terms, an operational model is nonrecoverable iff its operational status can only degrade monotonically in time. Moreover, by considering the contrapositive form of the above condition, it follows that Z is nonrecoverable iff for all s,tET (sSt) and all ijQ, i.j implies Pr[Zs-i, Ztj] 0. (3.3) Similarly, we say Z is recoverable if it is not nonrecoverable, i.e., if there exist s,teT (st) and ijEQ, such that i: j and Pr[Z=i, Ztj] > 0. (3.4) In other words, Z is recoverable if there is a nonzero probability that the state of the system may "recover" from a degraded state i to a higher level state j (j > i) or to a noncomparable state j (j L i and j 4 i). The notion of nonrecoverability can be characterized in a number of useful ways, as indicated by the following theorem. Theorem 3.1: Let Z be an operational model with state space Q. Then the following statements are equivalent:

- 28 - (1) Z is nonrecoverable. (2) For all s,tET (st) and all keQ Pr[Zk, Zt-k] — 0. (3) For all s,teT (sct) and all keQ Pr[Zs>k, ZtZk] - 0. (4) For all s,teT (sct) and all kEQ Pr[Zs>k, Ztk] = Pr[Zt>k]. Proof. (1) implies (2): Suppose that Z is nonrecoverable. By (3.3), for all s,teT (s't) and all ijeQ, i. j implies Pr[Z-=i, Z1] - 0. Since Q is denumerable, we then have, for all kEQ, Pr[Zs..k, Zt-k] = ~ Pr[Zs-i, Ztk] - 0. it.k

- 29 - (2) implies (3): Suppose that, for all s,teT (st) and all jeQ, Pr[Z7 j, Zt=j] =0. Then, since Q is denumerable, Pr[Z._k, Zt>k] - SPr[Z_.k, Ztj] jmk _ Pr[Zsj, Ztj]-0. j2_k (3) implies (4): Note first that, for all s,teT (st) and all keQ, Pr[Zt>k] = Pr[Z>_k, Ztak] + Pr[Z.k, Ztk]. Hence, for all s,teT (st) and all kEQ, Pr[Z8~k, Zt>zk] = 0 (3.5) if and only if

- 30 - Pr[Z4>:k, Ztk] = Pr[Zt>k]. Thus (3) and (4) are equivalent and, in particular, (3) implies (4). (4) implies (1): From (3.5), when condition (4) is satisfied, we have, for all s,tET (s't) and all jeQ, Pr[Zij, Ztj]- =0. Thus, ij implies Pr[Z-i, Z=j] < Pr[Zs-j, Zt-j] - 0, in particular, it implies Pr[Z=i, Z=j] = 0, i.e., as characterized in (3.3), Z is nonrecoverable. This circle of implications thus completes the proof of Theorem 3.1. An alternative way to characterize the recoverability of an operational model is by examining the state behavior of the model over the entire utilization period. In this regard, let us restrict our attention to operational models that are separable in the sense as defined in [17], i.e., there exists a denumerable subset R of T and an event A of probability 0 such that,

-31 - for any closed interval B and any open interval I in (-oo,+oo), we have nf{ZeB} - nf{ZB} C A *sdR seIT (3.6) where IR I n R and IT = I T. The set R is referred to as a separability set. When Z is separable, we are able to show that Z is nonrecoverable if and only if its state behavior over any time interval can be summarized by observing the state of Z at the end of the interval. Theorem 3.2: Suppose Z is a separable operational model. Then Z is nonrecoverable if and only if, for all r,tET (r<t) and all keQ Pr[Zs>k, rst] - Pr[Zt>k]. (3.7) Proof. Suppose Z is nonrecoverable and, given a state kEQ, let E, = {lZs(w)2:k} (seT). Then, in terms of this notation, (3.7) says

- 32 - Pr[ n Es] = Pr[Et. (3.7)' se r,t] Furthermore, let us denote the intersection of two sets A and B by AB. Since Z is separable, there exists a denumerable subset R of T and a null event A such that, for all r,teT, n E,- Es C A. sE(r,t)R se(r,t) Accordingly, ErEtr n Es nEs C A se(r,t)R se(r,t) and, hence, n E, n- Es C A (3.8) se[r,t]R' se[r,t] where R'R U {r,t}. Clearly, the first set in (3.8) is measurable because [r,t]R' is denumerable. Thus, under the separability hypothesis, the second set (which is contained in the first) is likewise measurable and has the same probability (assuming the probability measure Pr is complete; see Chapter 2, p. 16). More

-33 - precisely, for all r,tET (r<t), Pr[ n E]-=Pr[ n E]. sc[r,t] s[rt]R' (3.9) Hence, if we can show that the probability on the right side of (3.9) equals Pr[Et], we establish the desired result, i.e., (3.7)'. If we denote D- n Es, sc [r,t]R' then, since tE[r,t]R', we have D=DEt. Thus it suffices to show that Pr[DEt] = Pr[E] or equivalently, Pr[DEU] - 0. (3.10) Next, note that

- 34 - DEt- U Es Et Es[ r,t]R' U EsEt. sE[r,t]R' Accordingly, we have Pr[DEt] _< Pr[,Et] se [r,t]R' or equivalently (in our original notation), Pr[DEt] _ ~ Pr[ZsLk, Ztak]. se r,t]R' Since Z is nonrecoverable, by Theorem 3.1 (condition 4), each term on the right side of the above equation is zero whence Pr[DEt] = 0. This proves (3.10) and thus establishes the necessity of (3.7). To prove that (3.7) is sufficient, suppose (3.7) holds. Then, if we let s,teT and let kEQ, by Theorem 3.1, it suffices to show that Pr[Zs>k, Zt1k] = Pr[Zt:k]

-35 - or, equivalently, using the notations introduced above Pr[EsE] = Pr[E]. (3.11) The above equality is trivially true when s=t, so let us suppose s<t. By (3.7), it follows that Pr[ n EU] = Pr[Et]. ue[s,t] Then, since Pr[EsEt]: Pr[Et] and Pr[EEt] > Pr[ n Eu] = Pr[t], ue[s,t] it follows that Pr[EEt] = Pr[Et] which establishes the sufficiency of (3.7).

- 36 - Using Theorem 3.2 and the fact that Q is denumerable, recoverability can also be characterized by each of the following alternative conditions (the proofs are immediate and are omitted): (1) For all r,tET (r<t) and all kEQ (3.12) Pr[Zs>k, rSst] = Pr[Zt>k]. (2) For all r,teT (r<t) and kEQ (3.13) Pr[Zs:k, rs<t, Z=-k] = Pr[Z-k]. Theorem 3.2 provides us with a convenient way for relating the concept of recoverability to the traditional notion of repairability. To see this, let us define the level-k reliability of Z at time t to be Rk(t) = Pr[Z4>k, O<st] (3.14) and define the level-k availability of Z at time t to be Ak(t) = Pr[Zt:k]. (3.15) Clearly, when Q={0,1} where O=failure and l=success, Rl(t) and Al(t) reduce to the usual notions of system reliability and system availability, respectively. Moreover, when Z is nonrecoverable, Theorem 3.2 implies that,

- 37 - for all teT and all keQ (3.16) Ak(t) - Rk(t). In other words, when Z is nonrecoverable, its level-k reliability is reduced to the level-k availability for all keQ. The significance of this observation is that, when Z is nonrecoverable, the calculation of the level-k reliability at a time t amounts to calculating the probability that the system operates at a level greater than or equal to k at that particular moment in time. On the other hand, let T=[0,h] and suppose (3.16) holds. Then, since, for all r,tET (r<t) and all keQ, Pr[Zs>k, 0<sct] < Pr[Zs>k, resst] < Pr[Z>k], we have Ak(t)=Rk(t) implies Pr[Zs>k, r<s't] = Pr[Z:_k], i.e., condition (3.16) is necessary and sufficient for nonrecoverability. Thus, by taking the negation of the above condition, we also have the following alternative characterization of recoverability: Theorem 3.3: Let Z be a separable operational model with a state space Q. Then Z

- 38 - is recoverable if and only if, there exist teT and keQ, such that Ak(t) > Rk(t). (3.17) Roughly speaking, the above theorem says that Z is recoverable if and only if, for some kEQ, the level-k availability is subject to improvement by maintenance actions. Theorems 3.2 and 3.3 not only provide us with a useful tool for characterizing the behavior of operational models, they also provide us with a basis for evaluating the performability of degradable computing systems. Each of equations (3.14) and (3.15) defines an important class of performance measures that are proper generalizations of the traditional notions of system reliability and system availability. When the operational model is nonrecoverable, both classes convey the same information and the behavior of the system can be determined by taking a "snapshot" at the end of the utilization period. Motivated by the above properties of an operational model, we consider in the following section a single user-oriented performance variable that integrates these notions of system reliability and availability. 3.3 Evaluation of Computing Systems Using Functionals of a Markov Process When describing system behavior in user-oriented terms, it is often possible to identify various operational "modes" for the system (including a failure mode) which result in different degrees of user satisfaction. Moreover,

- 39 - for a given mode of operation, the extent of user satisfaction can often be quantified as a real number "rate" at which that operation benefits or penalizes the user. Depending on the application, these rates can have a variety of interpretations relating to the system's productivity, responsiveness, etc., or at a higher level, to such things as economic benefit (e.g., the worth rate measured, say, in dollars/unit time) associated with a given mode of operation. Under the above conditions, a user-oriented model can be constructed in a natural way. As in the previous discussions, let S denote the total system in question and suppose that we have already determined a base model X and a capability function 7y relative to some specified performance variable Y. Suppose further that the base model process X is defined relative to a continuous time interval T (the utilization period), that is, X = {XltcT) (3.18) where the random variables Xt take values in a denumerable state space Q (see (2.1) for the definition of a base model). Finally, we presume that at the base level, the system model is Markovian with a time-invariant structure, that is, X is a continuous-time time-homogeneous Markov process. Unless otherwise specified, it will be assumed that Q is countably infinite throughout the following discussions. Within this framework, let us now consider the situation discussed

- 40 - above where, at a higher level, one is able to identify various operational modes for S, each having an associated operational rate. If, further, each state of the base model can be classified according to some mode of operation, then there is a naturally defined real-valued function f:Q R (3.19) where, for each iEQ, f(i) is the operational rate associated with the mode containing i. Moreover, if we let Q denote the range of f (i.e., Q=f(i)licQ}) and, for each variable Xt of X (see (3.18)), we let Z - f(Xt). (3.20) It follows that Z ={ZtltT} (3.21) is a stochastic process with state space Q referred to generally as a functional of the underlying Markov process X (see [22], for example). When f is not 1-1 (i.e., some different states have the same mode of operation), the derived process Z will typically represent a simpler, higher level view of the system and is generally non-Markovian unless certain stringent conditions are satisfied. (Conditions under which the derived processes become

-41 - Markovian are discussed in the Appendix.) To qualify Z as an intermediate model, we must also require that Z be compatible with the performance variable Y to the extent that the probability distribution function of Y can be determined from Z. More precisely, letting K denote the translation of trajectories of X to trajectories of Z (i.e., K(u)=U where u(t)=f(u(t)), for all teT), there must exist a capability function 5y for Z such that t' K -= ' (3.22) where denotes functional composition, first applying K. Although the above condition appears somewhat formidable, it says simply that the higher level model Z must remain detailed enough to permit solution of the system's performability. This condition can be typically satisfied in practice if the definition of performance (i.e., Y) is taken into account when identifying the various modes of operation and assigning rates to these modes. If f, as defined in (3.19), satisfies condition (3.22) then we refer to f as an operational structure of S and, since states inherit the rates assigned to modes, the value f(i) is referred to as the operational rate of i or, when context permits, simply the "rate of i." Likewise, the corresponding functional Z is referred to as an operational model of S or, alternatively, a model of S at the operational level. In reliability modeling where, at the operational level, a system is

- 42 - typically viewed as either operating or not operating, the concept of an operational structure reduces to the familiar notion of a structure function [18]. Technically, a function f:Q -- R is a structure function if Q has binary coordinates, i.e., Q={O,}1m, and f(i) is 1 or 0 according as S is operating or not operating in state i. More recently, operational structures have been employed at least implicitly in the context of performance-reliability modeling where the operational rates are referred to as computational capacities [9], [12]. Although capacity (which typically refers to the maximum rate at which a computer can "supply" computations) is a legitimate interpretation of operational rate, it should be emphasized that, in general, such rates can represent an interaction of supply (by the computer) and demand (from the environment); this is because that, as generally conceived, a state i of the base model represents a particular status of both the computer and its environment; hence, both supply and demand can be accounted for when translating i, via f, to its corresponding operational rate f(i). In various special forms, then, the concept of an operational structure is no stranger to performance and reliability modeling. On the other hand, the general nature of associated functional Z, how it relates to the base model, how it can be exploited in solution procedures, etc., appear to be subjects that deserve further investigation. In the following discussions, we focus our investigation on the evaluation of performability with respect to a generally defined performance

-43 - variable. This variable is defined in terms of an arbitrary operational model which is generally non-Markovian. However, by relating this variable to the underlying Markov process, it is shown that system performability can still be evaluated using traditional Markov process methods. The performance variable is motivated by the level-q reliability Rq(t)-Pr[f(Xs)~q, O0s5t] (where qeQ) discussed in Section 3.2. Recall that, by (3.16), the operational model Z={ZtltET} is nonrecoverable if and only if, for all teT, Zt is the 'worst case' rate experienced by Z during [0,t]. On the other hand, if Z is recoverable, it was shown in Theorem 3.3 that the operational rate at the end of the utilization (i.e., the value Zt) will generally not convey the worst case rate. Motivated by the above considerations, a performance variable Yt, indicating the worst case operational rate during 0[,t], can be defined on Z as follows: Yt = min{10<s<t}. (3.23) As defined above, we note first that Yt is a discrete performance variable since the base model X has a denumerable number of states and, hence, there are a denumerable number of operational rates. Therefore the

-44 - performability perfs of S (see (2.3)) is simply the probability distribution of Yt, i.e., perfs(q) = Pr[YtIq]. (3.24) Before attempting to solve the performability of S, let us consider the recoverability of Z in more detail. Since the underlying base model is a time-homogeneous Markov process, significant insights can be obtained regarding the relationship between Z and X by expressing the recoverability of Z in terms of the probabilistic nature of X. In this regard, let us restrict our attention to Markov processes X which are regular in the sense that their transition probabilities are uniquely determined by a generator matrix or, equivalently, a state-transition-diagram (see [23], for example). Moreover, borrowing the terminology from [22], we say that i leads to j (where ijQ) and write i-j if and only if there exists a t>O in T such that Pr[X=jlXo-i] > O. Then, it can be shown that

- 45 - Theorem 3.4: Let Z be an operational model associated with the base model X and the operational structure f. Furthermore, let X be a time-homogeneous Markov process. Then, Z is recoverable if and only if, for some sET and some ijEQ, both of the following conditions are satisfied (1) Pr[X,=i] > O (3.25) (2) i -- j and f(i) < fj). Proof. By (3.4), Z is recoverable if and only if there exist s,tET (st) and q,reQ, such that i < j and Pr[Z=q, Z —=r] > 0. (3.26) (Here, since Q is a totally ordered set, we are able to replace > with < in (3.4).) Now, since Q is denumerable, Pr[Zs=q, Z-=r] = Pr[Xs-i, Xtj], f(i)-q f(j)-r and, hence, equation (3.26) holds if and only if, for some ijcQ,

- 46 - f(i)=q, f(j)=r and Pr[X-=i, Xtj] > 0. (3.27) Moreover, since Pr[X=i, Xtj] > 0 if and only if (3.28) Pr[Xs-i] > 0 and i j, it follows that the conditions stated in (3.25) are necessary and sufficient for Z to be recoverable. By taking the negation of (3.25), similar result can also be derived to characterize the nonrecoverability of Z as follows: Corollary: Let Z be an operational model associated with the base model X and the operational structure f. Moreover, let X be a time-homogeneous Markov process. Then, Z is nonrecoverable if and only if, for all ijeQ and all seT, at least one of the following conditions is satisfied: (1) Pr[Xsi]-O (3.29) (2) i -- j implies f(i) > f(j). Note that if we assume Q is the minimal state space of X in the sense as defined in [22], i.e., for all ieQ, there exists an s in T such that Pr[X-=i]>O,

- 47 - then the first conditions in (3.25) and (3.29) can both be eliminated. To show this, we first observe that Pr[X-=i] > 0 if and only if Pr[Xo-k] > 0 and Pr[XsilXo —k] > 0 for some kEQ. Now since X is regular, the transition probability Pr[Xt=ilXo=k]>O as a function of t vanishes either everywhere or nowhere in T (see [23], p. 240). Thus, it must be the case that Pr[Xt=ilXo-k]>0 for all teT. Clearly, it then follows that Pr[Xt-i] > Pr[Xo-k]-Pr[Xt=ilXo-k] > 0 for all teT. The recoverability of Z can also be characterized in terms of partitions induced by f on the state space Q. Note first that the binary relation -- induces an equivalence relation on Q as follows (see [22], for example): we say i communicates with j (denoted i +- j) if and only if i -- j and j -- i. Let [i]~ be the communicating class containing i. Then the partition of Q induced by the communication relation can be denoted as a set rc = {[ijliJQ}. (3.30)

- 48 - Note also that the operational structure f:Q -p R induces another partition on Q irf- [i]flifQ} (3.31) such that ij belong to the same equivalence class if f(i) f(j). In terms of the above partitions and assuming that X is a time-homogeneous Markov process with minimal state space Q, we can then show that Lemma: If Z is nonrecoverable, then rc is finer than rf (denoted Vrcrf). Proof: Suppose that i and j belong to the same block in rc. It must be the case that ici-j, which in turn, implies that f(i):f(j) and f(j)-f(i) because Z is nonrecoverable. Thus, it follows that f(i) f(j), i.e., i and j belong to the same block in rf. The converse of the above lemma is generally not true. For example, let X be a Markov process with transition graph

- 49 - and initial distribution Pr[Xol1]=1. Suppose the operational structure f:Q -- R is given by f(l)-0 and f(2)-f(3)=1. Then rc= {{1},2,3}}1 = f However, Z is recoverable, because 1 - -2 but f(l)<f(2). The above example also suggests a necessary and sufficient condition for Z to be a nonrecoverable model. Under the same assumptions as those for the above lemma, we first define a partial ordering of the set 7rc: For all [ilc and [j]c in 7rc, let [iic- [j]c if ij. (3.32) Clearly, the partial ordering as a relation is reflexive, transitive and antisymmetric. Furthermore, let us define a mapping

- 50 - h:wc - Q (3.33) such that h([i]c)=f(i). Then, Theorem 3.5: Z is nonrecoverable if and only if h is well-defined and orderpreserving. Proof. Suppose Z is nonrecoverable. Then, by the above lemma, Tc_<rf. In other words, for all i and j in Q, i -- j implies f(i)=f(j) and, hence, h([i]~)=f(i) is well-defined. Moreover, suppose [i]c -- j]. Then, by (3.30), we have i - j and, hence, h([i]c) - f(i) > f(j) = h(Uj]), i.e., h is order-preserving. Conversely, let us suppose that h is well-defined and orderpreserving. Then, for all ijeQ, i - j implies [i]c -- Ij].

- 51 - Now since h is well-defined, we have h([i]c) = f(i) and h([]c) - f(i) Applying the order-preserving assumption of h, it then follows that f(i) = h([i]c) _ h([]c) - f(j) i.e., Z is nonrecoverable. Corollary: If Z is nonrecoverable, then the following diagram commutes f Q — C0

- 52 - Theorem 3.5 permits us to determine whether an operational model is nonrecoverable by comparing the state diagram of the underlying Markov process with the operational structure. To illustrate, let us consider a base model X with the following state diagram

- 53 - and initial distribution Pr[Xo=2] —1. Suppose the operational structure is given by f(2-)-f(l)-1 and f(0)-0. Then, rc-{{O},{1,21-=rf, { 1,2 } -, (0 and h(l1,2))-1 > h({OD)=O. Thus, by Theorem 3.5, Z is nonrecoverable. It should also be noted that a Markov process X may induce a nonrecoverable model with respect to an operational structure but a recoverable model with respect to other operational structures. For example, using the same base model as above and the operational structure g(2)= 2, g(1)=l and g(O)=0, then it is clear that WrCr~g ( rc is not finer than rg). Hence, by the lemma of Theorem 3.5, Z induced by g is a recoverable model. Returning now to the problem of evaluating the performability of S (with respect to the performance variable Yt as defined by (3.23)), we consider the problem in two cases based on the recoverability of the operational model Z. If the operational model Z is nonrecoverable, Theorem 3.2 shows that the behavior of Z during [O,t] can be determined by the state of Z at the time instant t. In particular, we have Pr[Zs>q, 0<st] - Pr[Zt>q] (3.34) and Pr[Z>q, Ost] Pr[Z>q]. (3.35) Hence the performability of S (see (3.24)) can be obtained by evaluating the

- 54 - finite dimensional distribution Pr[Zt=q]. More precisely, we have perfs(q) - Pr[Yt=q] = Pr[min{ZsjlO'st}=q] = Pr[Zs>q, O<s<t] - Pr[Zs>q, O<sst] = Pr[Zt-q] - Pr[Zt>q] - Pr[Zt=q]. (3.36) Thus, in this case, evaluating performability (i.e., to determine the probability distribution function of YJ) is tantamount to evaluating the transition function of X, i.e., perfs(q)-= Pr[Xt=j] f(j)-q = S Pr[XtjlXo-i] Pr[Xoi]. fjQ (3.37) On the other hand, if Z is recoverable, more elaborate solution methods are required since (3.36) is no longer satisfied. When the operational model is recoverable, the performability perfs of S can be obtained by calculating the conditional probabilities

-55 - miA(t) = Pr[Yt=q, XtjIXo-=i] (3.38) where ijcQ, qeQ and t>O. Note that, when where t=O, the minimum operational rate is just the operational rate associated with the initial state; in short mii(o) - (3.39) to otherwise. Then, by summing over some of the indices of miq(t), we have perfs(q) =: miq(t)-pi (3.40) ijEQ where pi=Pr[Xoil.. There are several ways of expressing the conditional probabilities m-q(t) in terms of the state transition probabilities of the underlying Markovian base model X [24]. First, let us introduce another stochastic process based on X and the performance variable Yt as X-{(Xt,Yt)tET}. (3.41) Then, since

- 56 - Yt- mini inf {Zr,Y)}] and since X is a Markov process, it can be shown that X is a Markov process. Clearly, the generator matrix of X can be expressed in terms of the state transition rates of the underlying Markov process X. For all ijEQ (iOj), let Xij denote the transition rate of X from state i to state j. Then the generator matrix of X is the IQIXIQI matrix A = [aj] where, for all ijEQ, Xij if iij, aij =i if i (3.42) I: x, if i-*. o io If, further, we let the generator matrix of X be denoted by where y=(k,r) and z=-j,q) belong to QXQ, then

- 57 - akj if 1) f(k):r, f(j)Sq and r=q, or 2) f(k)>r, f(j)=q and r>q, ayz (3.43) O otherwise. Accordingly, if we denote the transition function of X by z(t) = Pr[Xt=zlXo=x] where x,zeQXQ, then the transition functions of X can be expressed as the system of differential equations (see [23]; pp. 254-255, Theorem 4.5) d. d -Px (t) - P xy(t) yfQXQ (3.44) Furthermore, notice that when x-(i,f(i)) and z(j,q), pz = Pr[Xt=j,q) IXo-(i,f(i))] -Pr[Xt=j, Yt=qlXo-i, Y-f(i)] = miq(t). Hence, by varying the value of y over QXQ and replaying ay by (3.43), the above system of differential equations (3.44) can be expressed more

- 58 - conveniently as S m&m(t).akj if f(i)2q, fj)>q f(k):q mdti(t) m = mi(t)-ak if f(i)_q, fOj)q, (3.45) rk) t(k)~r 0 otherwise. It was suggested [24] that the above equations can also be obtained by relating (3.38) to the notion of taboo probabilities [22]. Another approach for determining the probability distribution of the random variable Yt is to modify the underlying Markov process X by making some of the states in Q absorbing [24]. More precisely, let us rename and rearrange the elements in Q into an increasing sequence Q=1,2,...}. For each qEQ, let Bq be a subset of Q such that Bq {ilf(i)<q} where f:Q -- R is the operational structure of Z. We then replace the state space of X by a reduced one in which a single state bq replaces the states Bq and denote the transformed process by

- 59 - X -{X tqltT). Moreover, let us denote the corresponding generator matrix of Xq by Aq [aij] then, for all ije{klf(k)>q} U {bq}, aij if f(i)>q, f(j)>q, 2 ai if i:bq,j-bq, # EBq if i-bq. 0 (3.46) If, further, we let the transition functions of Xq be denoted by pij(t) = Pr[Xtq=jlX-i] li P[X= (3.47) where ijE{klf(k)q} U {bq}. Then the conditional probabilities mi(t) can be expressed as follows: For all ijeQ and all qeQ, where f(i)~q and f(j)-q,

- 60 - miq(t) - Pr[Ytq,Xt=jlXo=i] - Pr[Ytq+l,Xt-jlXo-i] - Pr[Zsq,0Ost,Xt-jlIXo=] ] - Pr[Zs4q+l,Ossct,Xxt=lIXo=i] = Pr[Xtq=jlXo=i] - Pr[Xtq+ljlXo=i] - Piq(t) - Pi+l(t). (3.48) Hence, if we solve the transition probabilities piq(t) for all qeQ, the performability of S can be computed by perfs(q)=- Piq(t)'Pifq -= [l-piq(t)]'l f(i):q s Piq+l (t)-Pi f(i)q+l pi- Z [- Pil,(t)]'Pi f(i)-_q+l (3.49) where pi-Pr[Xo0i] are the initial probabilities of X. When the total system is modeled by a recoverable operational model, either (3.45) or (3.49) can be used to evaluate the performability of the system. The solution method described in (3.45) requires evaluating a large system of differential equations. On the other hand, the solution method described in (3.49) decomposes the system of differential equations of (3.45) into smaller subsystems. Thus, evaluating (3.49) amounts to a step-by-step

- 61 - iterative solution to (3.45). Finally, we also note that, in addition to the applications illustrated in the following example and the next chapter, (3.45) and (3.49) can also be used to compute the "intraphase transition probabilities" considered in the context of phased models (see Section 5.4).

- 62 - 3.4 Performability of a Triplicated Fault-Tolerant Computing System To illustrate the solution methods for the evaluation of system performabilities, let us consider a degradable fault-tolerant computing system wherein resources are triplicated and voted (triple modular redundance). This type of resource redundancy has been employed in a variety of hardware architectures (see [25] for an example of current usage) and, in a more general form (N modular redundancy), has been investigated as a means for implementing fault-tolerant software (see [26], for example). Although such a system will generally possess a number of triplicated resources (e.g., the various "triads" of the FTMP architecture [25]), let us restrict our attention to a single resource, say, a triplicated processor consisting of three identical processor modules and a voter. With respect to hardware faults, we assume that the processor modules fail independently and that each fails permanently with a constant failure rate X (failures/hr.). The system's ability to recover from a hardware fault in a processing module is accounted for by a coverage parameter c (see [27]). When the system is free of hardware faults, we further assume that it has some capability of recovering from errors due to design faults in the software. Such errors are presumed to occur at a constant rate a (errors/hr.). Transitions from the error state, with or without recovery, occur at a constant rate A (transitions/hr.) which we assume to be much larger than the processor module failure rate X. The probability of software error recovery, given a

- 63 - software error, is a constant d (the software error coverage parameter); error recovery thus occurs at a rate dz. Lack of recovery from a software error is presumed to cause a crash (system failure). If a processor module becomes faulty and the fault is tolerated via successful voting, the input-output behavior of the system remains the same. With this loss of a processing resource, however, we assume that the system is no longer capable of software error recovery. Hence any further errors, due to a software fault or a second faulty processor module, result in failure of the system. Under the above assumptions, the system can be conveniently represented by a 4-state Markovian base model, where the states are interpreted as follows: Processor Software State Fault Error 1 No No 2 Yes No 3 No Yes 4 System failure

64 - a dp v<^ 2X+: 8\ (1-d) p y^ -3 -- 4 Figure 3.1 Markov Model of a TMR System with Software Error Recovery

- 65 - The state-transition-rate diagram of the model is depicted in Figure 3.1. Note that when the system is attempting recovery from a software error (state 2), there are no transitions representing the occurrence of a hardware fault. This is a consequence of our assumption that j ~> X, in which case such occurrences are negligible. As for performance, let us suppose the user is interested in three levels of accomplishment: full performance (as would be exhibited by a faultfree version of the system), degraded performance (at least one software error during utilization but successful recovery in each case), and system failure. To obtain an appropriate operational model that can support this view of performance, we see that states 1 and 2 can be identified with one mode of operation while states 3 and 4 must be distinguished. Moreover, because the mode {1,2) is preferred over {3} and {3} is preferred over {4), we find that the following function will suffice as an operational structure: i f(i 1 1 2 1 3 1/2 4 0

- 66 - To establish that the performance variable in question can indeed be supported by the functional Z of X (Fig. 3.1) induced by this choice of f, suppose that Yt is formulated as in (3.23), i.e., Yt is the minimum operational rate experienced during the utilization period [O,t]. Then it is easily verified that Yt conveys the desired information, i.e., Value of Yt Interpretation 1 Full performance 1/2 Degraded performance 0 Failure Note also that the operational model Z is recoverable in the sense defined in (3.4) due to the error/recovery cycle from rate 1 to 1/2 and then back to rate 1. The performability of the system can thus be evaluated using either of the methods discussed in Section 3.3. To illustrate the solution method described in (3.49), note that the generator matrix of X is -(3X+a) c3X a (1-c)3X A= 0 -(2X+a) 0 2X+,a dM 0 -ti (l-d)j j L 00 0 0

- 67 - Thus, with respect to each operational rate q-l, 1/2 or 0, the generator matrices of the corresponding transformed Markov processes are given by (3.46), viz., A (3X+r) A'l= 0 O0 c3X -(2X+a) 0 o+(l-c)3X 2X+a 0 J and Al/2 = A = A. Hence, if we denote the transition functions of each transformed process by a matrix Pq(t)= [pi(t)] (ije{klf(k)>q} U {bq}) where p-q(t) is defined by (3.47), then Pq(t) is determined by Aq uniquely by the well known formula Pq(t) = e. In particular, we have

- 68 - e-(3X+U)t 3c[e —(2x+U )t-e-(3X+C)t] C1 pl(t)= 0 -(2X+a)t 2 0 0 1 where l -1 I - e-(3o+)t - 3c[e-(2+f)t - e-(3X+-)t] c2= - e —(2X+)t, and Pl/2(t) = dl 0 d4 0 d2 d3 e-(2X+a)t o dS d6 0 0 1-(dl+d2+d3) le-(2\+a)t 1-(d4+d5+d6) 1 where if we let Z- =V/x2 3X + X + + 2+2 + 6Xa -- 6Xt -2cau + 4dMpa t + c +z 2 3X + + + a - z Y ----2 2 then

- 69 - 3X - a + a -y 3,- + a +z + + -xt d,- e+ 2z e XZ 2z d2 3Xc(y-A) e-yt_ 3xc(x-+) ext 2 z(y-2X-cr) z(x-2X-a) + 3Xc( —2X-a) e(2X+.)t (x-2X-a )(y-2X-a) d3 a et a -xt z z d4 d= e-yt d e-xt z z d - 6xcdg e-yt (X+ao-+z)z 6Xcdc — xt (X+a-l -z)z + 12Xcdl 2-(2x+a)t (X+r -1+z)(X+ a-j-z) 3X - 3 + f + Z -yt_ 2z 3X - At + a - z -xt 2z Accordingly, applying (3.40), the performability of S can be expressed as

- 70 - perfs(l) - pl(l —cl) + 2'(l —c2) perfs(l/2) - pl(dl+d2+d3) + p2.e-(2+i)t + p3'(d4+d5+d6) - perfs(l) perfs(O) = 1 - perfs(l/2) - perfs(l) (3.50) where pi=Pr[Xo=i]. By expressing the performability of the system in terms of the above closed form solution (3.49), various design tradeoffs can then be investigated by varying the parameter values. To illustrate, let us fix the following base model parameters to be X-= X 10-4 c =.99999 M- 103 d-.9 and assume that the system initially has all three modules operational, i.e., p - [1 0 0 0]. Then, depending on the choice of the software failure rate a, the

- 71 - performability of the system exhibits various kinds of relationships between different levels of accomplishment. In particular, Figure 3.2 and 3.3 display the performability of S as a function of t (the duration of the utilization) for r-10-2 and a-10-3, respectively. In both figures, the performability of S is represented by three curves: I, II and III. Curve I is the probability of fault-free operation throughout the utilization. The probability decreases from 1 to 0 as t goes to infinity. Curve II is the probability that the system suffered from performance degradation due to software faults while remaining operational throughout the utilization period. Finally, we note that curve III is the familiar S-shape function for system unreliability. When we compare Figure 3.2 with Figure 3.3, we find that substantial performance improvement is obtained by reducing the number of design faults in the software. For example, consider the case when the duration of the utilization is 400 hours. With the reduction in software failure rate by a factor of 10, the probability of degraded performance and the probability of system failure are reduced, respectively, to 1/2 and 1/3 of their original values; more significantly, the probability of full performance is increased by a factor of 37. Although the above example has established the feasibility of the hierarchical modeling approach for the evaluation of system performabilities, the operational model constructed in the example is based on a specific state

- 72 -transition diagram (Figure 3.1). Thus, to extend the generality of the model, we consider in the next chapter a systematic approach for describing the underlying Markov process together with a general method for determining the operational structures of degradable computing systems. Moreover, we also note that the results developed in this chapter are further extended in Chapter 5 to permit the modeling and evaluation of systems with a time-varying environment.

- 73 - 1.0' 0.8' nondegraded performance) II / r-9 vt 11 if~L 4) 0.66 0.4 Parameter Values -4 X = 5x10 -2 a = 10 p = 10 c =.99999 d =.9 0.2 0.0 0 500 1000 1500 t (hours) Figure 3.2 Performability of S

- 74 - 1.0 0.8 I 'nondegraded performance) ar 4. II 0' Il4) p. 0.6 0.4 parameter Values -4 X = 5xl10 o=10 a = 10o 3 p = 10 c =.99999 d =.9 0.2 0.0 0 500 1000 1500 t (hours) Figure 3.3 Performability of S

CHAPTER 4 MODELING AND EVALUATION OF DEGRADABLE MULTIPROCESSOR SYSTEMS 4.1 Introduction The design of a distributed multiprocessor system (see [28]-[30], for example) is generally approached in a sequential manner beginning with the identification of the computing system's application. The problem identification phase is followed by a functional breakdown of the application into major subtasks to be performed by the system. Following these phases, the designer then specifies the performance and reliability requirements in terms of the resource requirements for each task, the time relationship between tasks, the executive software overhead for system control, etc. Finally, based on the performance and reliability requirements of the system, alternative hardware and software architectures are then considered to optimize cost, performance and other trade-off criteria. Moreover, in the design of multiprocessor systems for real-time control applications, tasks to be performed are often partitioned into several priority groups and priority interrupt mechanisms are used to meet the stringent constraints of fast response time (see [291 and [301, for example). Normally, -75 -

- 76 - all tasks are executed iteratively to generate sample-time updates of control variables. However, when the computer's resources decrease due to faults to a point that only some of the tasks can be completed in time, tasks from the higher priority groups are given preferential treatment over tasks from the lower priority groups. Computing systems capable of performance degradation as above are often referred to as gracefully degradable computing systems or simply degradable computing systems. Performance degradation of degradable multiprocessor systems are typically realized through the following steps [31]: 1) Error detection: Basic techniques for error detection include error detecting/correcting codes, time-out counter, memory protection, majority voting, periodic testing, etc. 2) Fault location and hardware reconfiguration: Once an error is detected, diagnostic programs and testing strategies are used to localize the faulty components. The hardware reconfiguration program is then called upon to establish an operational configuration. 3) Computation recovery: Computation recovery concerns the restoration of a valid system state from which the system can resume its operation. The restoration of a valid system state can be achieved by rollback and retry, uses of traces, program roll-aheads, etc. 4) Software reconfiguration: Fault-tolerant software permits the replacement of suspected software modules with their alternative versions at

- 77 - run time. Current approaches include N-version programming [26] and the use of recovery blocks [32]. There are many different designs of degradable multiprocessor systems that are potential candidates for real-time control applications. In general, these designs can be characterized in terms of the degree of redundancies built into their basic components (e.g., simplex, duplex and triple-modular-redundancy). The characterization is useful because each class of systems can be attributed with certain specific performance and reliability trade-offs. Replicated components are often used in a degradable multiprocessor systems to enhance the system reliability. For example, when triple-modular-redundancy (TMR) is used, not only all single faults can be tolerated, but procedures for error detection, fault-location and system reconfiguration are also simplified considerably. However, since there is no parallel-processing within TMR, the configuration represents a substantial loss of computing power. Systems using triplicated components in their design include C.vmp [33], SIFT [30] and FTMP [25]. When the application of a degradable multiprocessor system requires not so much reliability (i.e. uninterrupted operation throughout the utilization period) as the ability to recover from failures, simplex components are often used to improve the performance/cost ratio of the system (e.g., PRIME [34] and PLURIBUS [29]). Since such systems rely mostly on complicated self

- 78 - testing logic for error detection and location, the reliability of these systems is generally lower than that of systems using replicated components. This is because single faults may cause such systems to crash and the detection latency [35] (the time period between the first error and the first detected error) of these systems are generally longer. In this chapter, a comprehensive model for degradable multiprocessor systems is presented for studying the trade-offs between systems with different degrees of component redundancy. The model is based on the approach considered in the previous chapter; a Markovian base model is described to represent the physical resources of the system and priority queuing models are used to determine the operational structure associated with the base model. Since our model supports the evaluation of system performability, it differs substantially from those considered by Borgerson [36] and Losq [37] which stress hardware-oriented measures such as reliability or availability. 4.2 System Model As compared with existing Markov models for degradable multiprocessor systems (see [36] and [37], for example), the model presented here has the advantages that (i) the partitioning of the system is based on the system's available resources as well as computational requirements of the user's application, (ii) the hierarchical representation (i.e., the base model together with the operational structure) permits the formulation and evaluation of useroriented performance variables.

- 79 - 4.2.1 Base Model for System Resources Degradable multiprocessor systems typically can be divided into several physical resources each of which is made up of one or more identical components. These resources generally form a pool shared by tasks to be performed by the system. For example, as described in [25], the FTMP computer includes 15 processors, 9 memory units, 5 busses and 48 bus guardian units to be shared by aircraft functional tasks. Generally, the amount of resources that a system can provide varies from time to time depending on the intrinsic hardware failure rates, the effectiveness of fault tolerance mechanisms and the repair procedures. Hence, if we assume that (i) the occurrences of failures and repairs are independent among different components, (ii) each component has constant failure and repair rates, and (iii) fault tolerance mechanisms are 'memoryless' in the sense that they are determined by the current state of the system, then the resource availability of the system can be represented as a Markov process. Before attempting to describe a general Markov model for system resources, let us consider first the effects of failures on the system resources. As suggested in [37], two classes of hardware faults can be distinguished according to the characteristics of fault tolerance mechanisms. The first class corresponds to hardware faults that are detected and recovered from as soon as they occur. Hardware faults of the first class are referred to as safe faults because failed components are removed instantaneously from the resource pool

- 80 - to avoid data contamination or faulty control. The second class of hardware faults, referred to as unsafe faults, contains those that are either latent faults or those that are in the process of fault recovery. Depending on the degree of component redundancy, failure to tolerate unsafe faults may cause the system to crash because of the loss or the corruption of important information. Since the performance of a system is determined by the amount of fault-free resources, the state of the resource model can be defined to be the number of safe faults and unsafe faults. More precisely, suppose that the multiprocessor system contains N resources where the ith resource contains ni identical components. Suppose further each component of the ith resource has failure rate Xi and repair rate /;. If we assume that the occurrence of more than one event such as failure or repair completion has negligible probability, then the system can be represented as a Markov process X- {XtteT} (4.1) where, for each teT, Xt is random variable taking values in the state set Q - {(al,bl,a2,b2,...,aN,bN)10ai+bi<ni for all I<iSN}. For each state (al,bl,a2,b2,...,aN,bN) in Q, a1 denotes the number of safe faults of the ith resource and bi denotes the number of unsafe faults of the ith

- 81 - resource. Four types of state transitions can be distinguished: repair, safe fault, unsafe fault and fault recovery (see Table 4.1). On the completion of a repair, the number of safe faults is decreased by one. Once a hardware fault has occurred, either the number of the safe faults or the number of unsafe faults is increased by one depending on whether the fault is safe or unsafe. Finally, the successful recovery of an unsafe fault will decrease the number of unsafe faults by one and increase the number of safe faults by one. To derive the transition rates of the above state transitions, the system's ability to recover from a hardware fault is modeled by a single parameter c defined to be the conditional probability that a system will be able to recover once a hardware fault has occurred (referred to as the coverage of the system, see Section 3.6). It is further assumed that the probability of a failure being transient is represented by another parameter a. Based on the above assumptions about the failure and repair characteristics of the system, the transition rates of the above Markov process can be expressed as in Table 4.1 (where the parameters are summarized in Table 4.2).

-82 - Type of Next state Transition Rate Transition repair (a,b,... bi,...,aN,bN) ai,'i safe (al,bl,...,ai+1, fault bi,..,aNbN) (ni-a —bi)-(l —a)rci unsafe (al,bl,.. ai, fault bi+l,...,aN,bN) (ni-ai-bi)(1-l)'Xi recovery (al,b1,...,a bi-l,...,aNbN) I bivi Present State = (al,bl,...,ai,bi,..,aN,bN) Table 4.1 State Transition Rates of a General Resource Model

- 83 - When the system contains only one type of resource (i.e., N= 1), the model can be represented more conveniently using transition graphs as in Figure 4.1. This single resource model, as a special case to the above model, is the same as the one considered in [37] by Losq except that states are named differently here to simplify the formulation of operational rates. Another special case of the above general model can be obtained when all unsafe faults result in system failure. In this case, since unsafe faults cause system resources to vanish, the state of the system can be taken to be the number of fault-free components at each moment in time. In other words, the state set Q defined in (4.1) can be represented as Q = (al,a2,...,aN)10:aiOni for all lli<N) (4.2) where, for each (a,a2,..., aN) in Q, ai is the number of fault-free components of the ith resource. The simplification on the corresponding transition rates is described in Table 4.3. Again, as a special case, the single resource model can be represented by a transition graph as in Figure 4.2.

- 84 - Parameter Interpretation X |i Component failure rate of the ith resource Ai Component repair rate of the h resource vi Component recovery rate of the ith resource following unsafe faults a Probability of a fault being transient c Probability of a fault being safe Table 4.2 Base Model Parameters

U y(n-i) (1-a)cx 2 (1-a)cX, (l-a)cX 0* 0;> Figure 4.1 On-l -- H n — Single Resource Model for Systems (-cA cX with Safe Faults and Unsafe Faults X

- 86 - Type of Next state Transition Rate Transition repair (al,...,ai-l,ai+l,ai+l,...,aN) (ni-aj)hardware (a,...,ai-1,ai-,ai+1,...,aN) ai(l-a)-c-Xi fault recovery (a1,...,ai_,0,ai+l,...,aN) ai-(l-c)-Xi failure Present State = (a,...,ai,...,aN) Table 4.3 Transition Rates for Systems with Instantaneous Detections and Recoveries

- 87 v n(l-c)X (n-t) (l-c)k ~ (- -)cLX(n-i) 2 (1-c) \ ~ \ \1-o) Figure 4.2 Single Resource Model for Systems with Instantaneous Detections and Recoveries

- 88 - 4.2.2 Performance Variable The performance and reliability requirements of a real-time system often differ from those of a general purpose computer because of the stringent constraints of fast response time. Thus, to characterize the performance and the reliability of a real-time system by a single performance variable, the performance variable must take into account both the resource availability as well as the promptness of the system response. As an example, let us consider first a typical real-time environment encounted by a control computer. In a study concerning the design of faulttolerant computers for aircraft, Ratner et al. [39] have identified 26 computational tasks most likely to be performed by the control computer of an advanced commercial aircraft. These computation tasks conceptually can be regarded as short programs stored in a common memory of a multiprocessor system and each task is scheduled to be executed periodically according to a predetermined frequency. However, the actual execution of a task may be delayed from its scheduled execution time because of the resource sharing, interface between tasks and the overhead for running system software. Since a prolonged starting-time delay may cause dangerous conditions to develop, the computational tasks are grouped into 5 priority classes and priority interrupt mechanisms are used to reduce the delay times of more critical tasks. The above example clearly shows that the concept of the starting time delay is a useful tool for specifying the performance criteria of a real-time

- 89 - control application: A task is regarded as failing to satisfy the real-time constraint if its starting-time delay, on a regular basis, exceeds a certain predetermined value. More precisely, let d be the estimated length of time that would have to elapse before an undesirable condition is noticed. Then the real-time constraints can typically be stated as: either the average starting-time delay or the percentile starting-time delay must be less than d. For example, the percentile starting-time delay can be stated as that the starting-time delay should not exceed d for 99% of the time when a task is scheduled to be executed. For simplicity, it will be assumed in the following discussions that the average starting-time delay is used to specify the real-time constraints. To generalize the above notion of the starting-time delay, a useroriented performance variable can be formulated as follows. First, let us assume that the tasks to be performed by the computer belong to one of a set of K different priority classes. Then, relative to the utilization period T-[O,t], the performance variable can be defined as a random variable Y taking value in the set {O,1,...,.} such that

- 90 - Y=k iff k is the largest nonnegative integer less than or equal to K such that all tasks from the first k priority groups are executed within the real-time constraints throughout T. In other words, if we adopt the convention that the priority groups are numbered in reverse order (i.e., the smaller the number, the higher the priority), then the performance variable Y can be regarded as the degree of user satisfaction relative to how well the more critical tasks are executed by the computer to satisfy the real-time constraints. In particular, Y = 0 can be interpreted as a system crash, since, in this case, the computer does not even have enough resources to meet the demand of the most critical tasks (i.e., those from priority group 1). On the other hand, Y=K represents nondegraded performance, since all tasks are executed properly within their real-time constraints. 4.2.3 Operational Structure To ease the evaluation of system performability, the connection between the performance variable Y and the base model X can be established more easily by introducing intermediate models to account for the internal structure of the computer. Under the assumptions described in Sections 4.2.1 and 4.2.2, a natural representation of the system's behavior at the operational

- 91 - level is the behavior of the computer's control programs. Typically, to resolve conflicting demands on system resources, the control software of a multiprocessor system must provide a scheduler (also called a supervisor or executive) that allocates system resources to application tasks and handles interfaces between tasks. When the computer is degradable, the control software must also provide a mechanism (called a reconfiguration mechanism) which, upon detection of faults, appropriately changes the scheduler to facilitate error recoveries. In other words, given a general resource model as described in Figure 4.1, it is possible to associate each state of the model with a scheduler. Accordingly, depending on how well the application tasks are performed within each resource state according to a given scheduler, various operational rates can be identified to reflect different degrees of user satisfaction. To measure the effectiveness of the scheduling algorithms associated with the resource states, it is assumed that each scheduler is modeled by a resource sharing priority queuing model (see [21], for example). For each resource state of a resource model, it is further assumed that arriving tasks form a single queue according to the "head-of-the-line" (HOL) queuing discipline (see Figure 4.3), that is, an arrival from priority class k joins the queue behind all "customers" from priority class k (and higher) and in front of all "customers" from priority class k+ 1 (and lower). Moreover, the value of

- 92 - new arrival Head-of-the-Line Priority Queue

- 93 - one's priority remains constant in time. Thus, while the customers with the highest priorities are selected for service ahead of those with lowei priorities, customers from the same priority class are served on a first-come, first-served (FCFS) basis. Finally, we also note that two possible refinements in priority mechanism can be distinguished depending on whether the execution of a lowpriority task is interrupted when a task of higher priority arrives. Since we are concerned with the ability of a system in satisfying the real-time constraints, the effectiveness of each scheduler can be measured in terms of the 'expected waiting times' of the corresponding queuing model. More precisely, for each state q of a general resource model and for each priority class k, let Tr be a random variable denoting the time spent waiting in the queue of a priority k 'customer' with respect to the queuing model of resource state q (see [40], p. 189). Then the expected value of 7r is a close approximation of the average starting-time delay when i) the communication delays between tasks are negligible, and ii) the transition rates of the given resource model are much lower than the arrival and the service rates of the computational tasks. The first condition can typically be satisfied by treating each task as an atomic unit for resource allocation. The second condition is usually satisfied automatically because failures and repairs of a computer occur much less frequently than the iteration rates of the computational tasks (thus the expected waiting times rapidly approach steady-states once the system enters a particular state).

- 94 - The above relationships between the average starting-time delays and the expected waiting times provide us with a basis for partitioning the resource states into operational modes. First, we note that the ability of a resource state q in satisfying the real-time constraints of a priority k task can be expressed as E[rkq] < d (4.4) where d is the predetermined time length as described in the previous subsection. The use of average starting-time delays to specify the real-time constraints is reasonable because, in general, the effect of a starting-time delay on the behavior of the system is proportional to the duration of the delay. In addition to the real-time constraints, the partitioning of system states into operational modes must also take into account the effects of unsafe faults. For example, consider a multiprocessor system with triplicated components. The occurrence of a undetected double fault in a triad will cause the system, to fail regardless of the duration of starting-time delays. Accordingly, if we assume that the probabilistic nature of the system resources is specified as a Markov process X with state space Q (see (4.1)), then an operational structure can be given as a function f:Q -— 0,...,xK

- 95 - where, for each qeQ, k if unsafe faults in q are tolerated and, for some lkxK, E[r7]<d and E[7kq+1]d, f(q) - (4.5) 0 otherwise. Note that, in the above formulation, we have assumed that tasks of higher priorities have shorter expected waiting times, i.e., k < k' implies E[rkq] E[rt]. The assumption is satisfied when the scheduling algorithm (associated with q) uses the usual priority queuing disciplines in resource allocation. Having established the operational structure, the performance variable described in Section 4.2.2 can now be expressed as Y = minf(Xt)ltT}. Finally, to conclude the construction of the operational model, we note that the analysis of a priority queuing system is generally more difficult than that of a nonpriority system. In particular, for the multiple channel (multiple server) case, it is usually required to assume no service-time

- 96 - distinctions between priorities or else the mathematics becomes intractable. However, as illustrated in the following section, even under the above stringent conditions, the scheduling algorithms of a single resource model can still be modeled satisfactorily using a priority M/M/m queue (see [40], pp. 193-194). For multiple resource models, the scheduler associated with each resource state can be modeled by an open or a closed queuing network (also called network of queue; see [4] and [21], for example). In addition to the basic assumptions governing the arrival and service rates of each "service station" (see [4], pp. 161-163), it is also required to assume that the priority queuing discipline at each service station is work conserving in the sense that the priority interrupts will not impose extra work on the server. Solutions for the expected waiting times at each service station can be obtained by incorporating solutions for priority M/M/m queues in the multiple resource models. 4.3 Evaluation of Two Degradable Multiprocessor Systems In this section, the performances of two degradable multiprocessor systems are evaluated and compared to determine the tradeoffs between two design approaches. Both of the computers to be evaluated are assumed to be multiprocessor systems containing 4 identical processors and a common memory module. They differ in the way that the system resources are allocated to perform computations. One computer (SI) is assumed to operate in simplex mode, i.e., the processors are allowed to operate independently. The other

- 97 - computer (S2) is assumed to operate in duplex mode, i.e., the processors are paired into duplex subsystems to enhance the system reliability. While the simplex mode of operation generally can improve the performance/cost ratio of the system, the duplex mode of operation provides better detection and fault coverage relative to the simplex mode of operation. With respect to hardware faults, we assume that the processor modules fail independently and permanently with a constant failure rate X2 (failures per hour). The memory module is assumed to have a constant failure rate XI and fails independently with respect to other subsystems. We further assume that the system's ability to recover from a failure is accounted for by the coverage factors cl and c2, respectively, for SI and S2. Since simplex subsystems rely on error detecting codes and self-checking logic for error detection and location, the coverage of SI is generally lower than that of S2. This is because, in the simplex mode of operation, undetected single faults may cause the system to crash and the detection latency (i.e., the time duration between the first error and the first detected error) of a simplex subsystem is generally longer than that of a replicated subsystem. Accordingly, it will be assumed that CI: c2. (4.6) Under the above assumptions, both S1 and S2 can be represented as

- 98 - a Markov process according to the general resource model as described in (4.1). Howev.er, since neither the simplex mode of operation nor the duplex mode of operation can tolerate unsafe faults, the model can be reduced to the special case whose state space is given by (4.2). In particular, each of S1 and S2 can be conveniently represented by a 10-state Markovian base model Xi (i 1 or 2), where the state space Qi (i 1 or 2) can be represented as Qi { (kj) I k=0,1 and j=O,1,2,3,4 }. (4.7) For each state (kj) in Qi, k denotes the operational status of the shared memory module (O=failed and 1-working) and j denotes the number of working processor units. Thus, for instance, a fault-free configuration is encoded as state (1,4). The state-transition diagram of the model is depicted in Figure 4.4.

- 99 - ci 1 \ C 2 2 2\ \/ Figure 4.4 State-Transition Diagram of the Base Model for S1 and S2

- 100 - To provide a useful comparison between S1 and S2, their performance must be evaluated with respect to the same work environment. In this regard, let us consider a typical real-time control application where its control computer uses priority interrupt mechanisms to meet the stringent constraints of fast response time. For simplicity, let us assume that each computational task of the system is assigned a priority of 1 or 2 denoting, respectively, high or low-priority. Normally, all tasks are executed iteratively to generate sample-time updates of control variables. However, when the computer's resources decrease, because of failures, to a point that only some of the tasks can be completed in time, tasks from high-priority group are given preferential treatment over tasks from low-priority group. Based on the above assumptions of the work environment, a user-oriented performance variable can then be formulated as 2 if the system is operating as prescribed, 1 if only high priority tasks meet Yi - the average response time requirement, (4.8) 0 if high priority tasks can not meet the average response time requirement. In other words, the performance of Si conveys the following information:

- 101 - Value of Yi Interpretation 2 normal 1 degraded 0 failure On closer examination of the relationship between Yi and Xi, we find it is necessary to introduce an intermediate model between them because Xj is not detailed enough to support the user's view of system performance Yi. One way to introduce such an intermediate model is to identify an operational model by taking into account the workload environment of the computer. In this regard, let us assume that tasks from priority group i (i= or 2) arrive in a Poisson stream at 1 task per millisecond. We also assume that each task, regardless of its priority, requires a service time exponentially distributed with mean service time 1/3 milliseconds. Furthermore, we assume that the HOL queuing discipline is used, but there is no preemption [5]. Then, if the behavior of the system in a state q is modeled as a queuing system with m servers, the average starting-time delay of tasks from priority group i can be approximated by the expected waiting time of the M/M/m priority queue. As for the operational structure, the states of the base model Xi can now be partitioned into three operational modes according to their average

- 102 - starting-time delay. In particular, suppose that the average starting-time delays of both priority groups shall not exceed d milliseconds. Then, the operational structure of Si (i= 1 or 2) can be expressed as a function fi:Qi - R such that, for each qeQi, 2 fi(q)a- 1 lO if E[rl]q<d and E[T-q]<d, if E[r?]'d and E[lr]>d, otherwise. (4.9) If d = 1/2 millisecond, the operational structures of SI and S2 can be tabulated as follow: State q of Si Operational rate Operational rate (i = 0 or 1), fl(q) of SI f,(q) of S, (1,4) 2 2 (1,3) 2 2 (1,2) 21 (1,1) 1 1 (1,0) O O (0,4) 0 0 (0,3) 0 0 (0,2) 0 0 (0,1) O O (0,0) 0 o

- 103 - The operational rate assignments in the table are determined by the degree of subsystem redundancies. For instance, suppose that the system if in state (1,2), i.e., the memory module is operational and two processors are working. Then, since in the duplex mode of operation both processors must perform the same function in parallel, the behavior of S2 in state (1,2) can be modeled as a M/M/1 priority queue. Thus, using the existing formula for computing the expected queueing times (see [40], for example), the average starting-time delays can be approximated by Tq = 1/3 milliseconds and T2 = 1 milliseconds where q=(1,2). Since, in this case, Tl ' d and T* > d, we have f2(q)=1 by equation (4.9). On the other hand, since the behavior of SI in state (1,2) can be modeled as a M/M/2 priority queue, the average starting-time delays can be estimated as Tq = 1/30 milliseconds and Ty = 1/20 milliseconds where q=(1,2). Accordingly, it follows by (4.9) that fl(q)-2.

- 104 - Given the above performance variables Yi and the operational structures fi, it can easily be verified that, for each Xi (i= 1 or 2), Yi min { fi(Xt) I teT where T is the utilization period. Thus the solution methods described in the previous chapter can now be used to solve the system performability (i.e., the probability distribution function of Yi). Suppose that the system is initially fault-free, i.e., Pr[X0o(1,4)]=1. Then, relative to the utilization period T= [Ot], the performability of S1 is given by perfi(2) = Pr[ Y1=2 ] e-(4X2+xt)t + 4cI [e(3X2+XI)t - e(4X2+X )t] + 6c2[e-(2X2+hX)t 2e-(3)2+X1)t + e(4X2+X,)t] perfl(l) = Pr[ Y1=1 ] = 4c3[e(X 2+xI)t 3e-(2X2+X)t + 3e-(3X2+)t -- e-(4X2+XI)t] and

- 105 - perfi(O) - Pr[ Yl=O ] 1 - perfl(2) - perfl(l). Similarly, the performability of S2 is given by perf2(2) = Pr[ Y2=2 ] = -(4X2,=i)t + 4c2[e-(3X2+x1) - e-(4X2+X)t] perf2(l) = Pr[ Y2=1 ] = 4c23e-(+ ') + 6(c22-2c23)e-(2x2+x')t + 12(c23-C22)e-(3X2+X)t + (6C22-4c23)e-(4X2+X )t, and perf2(0) = Pr[ Y2=0 ] = - perf2(2) - perf2(1). When expressed as functions of the duration of the utilization, the above equations can be represented, respectively, as in Figures 4.5 and 4.6 for the indicated parameter values of SI and S2. When we compare Figure 4.5 with Figure 4.6, S2 clearly results in better system reliability than S1 in the sense that perf2(0) < perfl(0).

- 106 -Moreover, S2 also has a much higher degraded performance even though its nondegraded performance is slightly lower than that of S1. In general, we may conclude that, if the system's ability to recover from faults is low, then the performance of the system can be made to degrade more gracefully by allowing the subsystems to operates in duplex mode.

- 107 - 0' >-4 cH I-1 c4 orC< 1.0, 0.9 0.8 0.7. 0.6. 0.5 0.44 0.31 0.2. 0.1' Normal (Pr[ Y1 =2 1) 1 = 0.0001 2 = 0.001 C1 = 0.9 Failure (Pr[ Y1 =0 A.0 Degraded (Pr[ Y1 =1 ]) 0 1000 1500 2000 t (hours) Figure 4.5 Performability of S,

- 108 - 1.0 0.9 Normal (Pr[ Y2 =2 ]) 0.8 0.7 ' \Deg: r"< \ — ^ (Pr h> 0.6 " 0 5,. 0.4\ I I \ 0.3 0.2 0.1 0 0 500 1000 t (hours) Figure 4.6 x1 = 0.0001 X2 = 0.001 C2 = 0.999 raded 1 Y2 =1 ]) Failure (Pr[ Y =0 ]) 2 1500 2000 Performability of S2

CHAPTER 5 PHASED MODELS 5.1 Phased-Missions 5.1.1 Introduction The performability models considered in the previous chapters assume that the environment of the system is invariant in time in the sense that the underlying processes are time-homogeneous and the operational structures of the system remain the same throughout the utilization period. Although this assumption is appropriate for certain applications, there are many cases where the user's demands on the computing system can change appreciably during different phases of its utilization. This is particularly true for real-time control applications in which the computing system is required to execute different sets of computational tasks during different phases of a control process. One approach to dealing with a time-varying environment is to decompose the system's utilization period into consecutive time periods (usually referred to as a decomposition of the system's "mission" into "phases"; see [42]-[45]). Demands on the system are then allowed to vary from phase to phase; within a given phase, however, they are assumed to be time invariant. -109 -

- 110 - This permits intraphase behaviors to be evaluated in terms of conventional time-homogeneous models, but raises the interesting question of how the intraphase results are combined. This is the essential question addressed in investigations of phased-mission reliability evaluation methods (e.g., [42]-[45]) where the problem has been constrained as follows. It is assumed, first, that a success criterion (formulated, say, by a structure function; see [18] for example) can be established for each phase, where the criterion is independent of what occurs during other phases. It is required further that successful performance of the system be identified with success during all phases, that is, the system performs successfully if and only if, for each phase, the corresponding success criterion is satisfied throughout that phase. Although the above constraints are reasonable for certain types of systems, they exclude systems where successful performance involves nontrivial interactions among the phases of the mission. In more exact terms, it has been shown (see [46] Theorem 6) that 'structure-based" formulations of success are possible if and only if the phases are functionally independent in a precisely defined manner. What we wish to do, therefore, is to examine the utility of phased-mission evaluation methods in a less restrictive context. In addition to removing the above constraints, we extend the domain of application to include evaluation of computing system performability. Moreover, by representing intraphase models in terms of operational models, we are able to obtain useful results even without the typical no-repair

- 111 - assumption of the traditional phased-mission reliability methods. Finally, unlike the models used in phased-mission reliability evaluation methods, we permit the intraphase models to differ from phase to phase. Thus, the modeling of a particular phase can be tailored not only to the computational demands of each phase but also to the relevant properties of the total system that influence performance during the phase. 5.1.2 Formulation Intuitively, phased-missions are real-time control processes whose utilization period can be decomposed into phases. During each phase, the system is required to execute a predetermined set of computational tasks. A typical example of a phased-mission is an 'unmanned space mission' during which the spacecraft's on-board computer must complete different phases of the mission. The analysis of such a system is usually complicated because of the time-varying nature of the system's performance criteria. To generalize the notion of a phased-mission in the context of performability modeling, we assume that (i) the behavior of each single phase can be characterized by a single performance variable regardless of the interactions among phases, and (ii) interactions among phases can be characterized without reference to the detailed behavior of each phase. More precisely, let us suppose that the utilization period T is the continuous interval T = [O,h]. Suppose further that T is divided into a finite number of consecutive phases (time intervals) T1 = [to,tl], T2 = [tl,t2],..., Tm = [tm-1,tm]

- 112 - where 0 = to<tl<.* <tm = h. During each phase Tk, we assume that the system.can be modeled by a performability model (Xk,'yk) where Xk = {IXltETk) is a continuous-time stochastic process such that, for each t in Tk, Xt is a random variable taking values in the phase k state space Qk (Xk: fl Qk), and yk: Uk - Ak is a function that maps the phase k trajectory space Uk to the phase k accomplishment set Ak. Xk is referred to as the intraphase process (of phase k) and yk is called the phase k capability function. When each phase can be represented by an intraphase performability model, a performability model (X,'y) of S is referred to as a phased model if it can be constructed by the following steps: (i)X- U Xk - {xXltETk) (5.1) k-l k-i (ii) there exists a function (referred to as an organizing structure)

- 113 - ':AIXA2X ~ ~ XA, —, A (5.2) such that, for all ueU, '(u) = (yl(Ul),...1ym(Um)) where uk is the restriction of u to Tk (k= 1,2,...,m). On examining X we see that it is similar to a base model except that, for each time instant tk (1 ckcm), the state of the system is represented by two random variables Xk and Xk+' whose values, respectively, are the final state of the kth phase and the initial state of the k+1th phase. Since we permit the state sets of the intraphase models to differ from phase to phase, Xk and X'+ can also be different. However, if we consider an augmented utilization period T - TU {tk'lk-l1,2,...,m-l} (where tk can be interpreted as the initial time of phase k+1), then X can be expressed as X {XtlteT} where

- 114 - if t - Xt = Xk if tE(tl,tk) (5.3) xt+l k+ if t = tk'. If, further, we regard the state space of X as the union m Q-U Qk, k-i then X is a base model in the sense defined in Chapter 2. When X is so constructed from intraphase processes, we will refer to it as a phased base model. Generally, there are two types of dependencies among phases that can affect the performability evaluation of phased models. The first type, encountered when computing intraphase performabilities, are caused by statistical dependencies among phases. For example, if we assume that the kth intraphase process Xk is a Markov process with a given transition probability function, then the performability of the system during phase k can be computed once the initial distribution of Xk is known. However, in general, the initial distribution of Xk is determined by that of the first phase together with the behavior of the previous phases in realizing a specific level of performance of the total system.

- 115 - The second type of dependence occurs when combining intraphase performabilities to determine system performability. This type of dependence is determined by the algebraic relationship among phases, i.e., those relationships which do not involve probability concepts. In other words, the relationship is analogous to structure functions that are concerned with the structural representation of multicomponent systems (see [47], for example). Clearly, a complete analysis of phased models will require a detailed knowledge of both types of dependencies and their effects on the performability of the system. In the following section, we first study the above algebraic relationship among phases via an extended definition of structure functions. In Section 5.3, we then consider the probabilistic aspects of the dependencies among phases. In both cases, we only assume that the behavior of the system during each phase can be summarized by a performance variable Yk defined by the phase k performability model (Xk,yk). Finally, in Section 5.4, computational methods and formulas are derived, when Yk can be defined as the minimum value assumed by a functional of Xk. 5.2 Structural Properties of Phased Models In system theory, the structure of a system is generally taken to be the interactions among subsystems to perform certain specific tasks. The interaction may involve the physical interconnection of the subsystems or, more generally, "functional dependence" among subsystems [46]. Thus, if we

- 116 - regard the phases of a phased model as subsystems, then the structure of a phased model can be effectively regarded as the interaction among phases in the realization of various degrees of system performance. Since the relationship among phases can be represented as an organizing structure, and since the accomplishment sets of the phases are totally ordered sets (see (2.2)), there is a natural connection between the structure of a phased model and the mappings of partially ordered sets. More precisely, for each phase k, let us denote the phase k performance variable as Yk: Q - Ak defined by the phase k performability model (Xk,yk) according to (2.12). Then, by (5.2), the performance variable of the total system S can be represented as a random variable Y:fl - A where, for each we Q, Y(w) = 4(Yl(W),...,Ym(W)). Thus, the structural relationship between Y1,Y2,.. Ym can be characterized in terms of the properties of the mapping I:AiX * * * XAm- A. The product of

- 117 - sets AIX * * * XAm is a partially ordered set because, by extending the ordering relation of the individual phases, an ordering relation for A1X * * * XA, can be defined as, for all (al,a2,...,am) and (bl,b2,...,bm) in AIX ~* * XAm, (al,a2,...,am) -< (bIb2,...bm) iff ak<bk for all k - 1,2,...,m. (5.4) Our interest will be restricted to the case when I:AIX ~ * XA ---A is order-preserving in the sense that, for all (al,a2,... am) and (bl,b2,...,bm) in AIX * * XAm, (al,a2,...,am):< (bl,b2,...,bm) implies i(al,a2,...,am) S (bl,b2,...,bm). Order-preserving mappings as defined above can be used to characterize systems whose performance do not deteriorate due to the performance improvements of the subsystems. Thus, the notion of an order-preserving mapping is a proper generalization of the notion of a coherent structure function [18] because the coordinates of AIX. * XAm are not restricted to binary valued sets. Our investigation efforts will be focused on the properties of i

- 118 - which permit us to simplify the evaluation of system performability. In particular, given a subset B of A, we wish to consider methods for representing the set -l'(B) without enumerating its elements. The representation methods are important because, as shown in the following discussions, such representations are amenable to iterative methods of evaluation. We note first that the effects of *I on AIX * * XAm is that it imposes an order structure on the equivalence kernal of the order-preserving mapping. First, let us define, for all aeA, C(a) {qlqcAlX ~ * XAm and '(q)>a) (5.5) and D(a) = {qlqAlX * * XAm and I(q)a}. (5.6) In words, D(a) is the set of elements in AX * *.. XAm that result in at least level a accomplishment. Clearly, when P is an order-preserving mapping, for all a,bEA, a < b implies C(a)DC(b) and D(a)DD(b). (5.7) Accordingly, if we let

- 119 - and C ={C(a)laeA} D ={D(a) aEA}, (5.8) (5.9) then, because A is a totally ordered set, it follows that C and D are totally ordered sets with respect to set inclusion. Moreover, if the elements in A are expressed as a sequence -* <al<a2< * <a'< *i* where, for all i, ai+l covers ai in the sense that ai+ >ai and for no acA, ai+l>a>ai, then the corresponding elements in C and D can be expressed in a like manner, i.e., * C- 2( C(a)C(a2)D ~ DC(ai)D - - and.2- D)DDl)DD(a2)D *~* D(ai)_ - - i Hence, if we denote the difference between two sets X and Y by X-Y, then it can easily be shown that

- 120 - Theorem 5.1: Let A be a totally ordered and countable set and let *:AiX. * XA ---A be an order-preserving mapping. Then, for all a,bEA, a covers b implies, for all q,reA1X.~ ~ XAm, (1) *I(q) = A(r) = a if and only if both q and r belong to C(b)-C(a), (2) it(q) = - (r) = b if and only if both q and r belong to D(b)-D(a). The above results imply that, when evaluating system performability based on a phased model, the sets C(a) and D(a) where aeA can be used as building blocks for describing events that characterize system performance. In particular, given an order-preserving mapping *:AIX * - * XAm-^A and a closed interval B=[a,b]CA, we show in the following theorems that *-1(B) can be specified by D(a) and C(b). The set -t'(B) has practical significance in performability modeling because its probability quantifies the ability of S to perform within the specified limits a and b. First, we note that Cartesian subsets of AIX * XAm are amenable to iterative methods of evaluation. A subset VCAX. XAm is called a Cartesian subset if V=t(V)X. * Xtm(V) where k(V) is the projection of V onto its kth coordinate. To illustrate the use of Cartesian sets in the evaluation of performabilities, let us consider the evaluation of a specific Cartesian subset

- 121 - V. We first let Bk = {qEAiX. * * XAmljk(q)ek(V)) be the set of elements in A1X * * * XAm that assume values in tk(V) at the kth coordinate. Then, the probability of Bk can be expressed as a one-dimensional distribution of the phase k performance variables Yk, i.e., Pr[Bk] ^ Pr[(YI,Y2,..,Ym) e Bk] A Pr[Yjkce(V)]. Accordingly, when V is a Cartesian set, V clearly can be represented as the intersection of those elementary sets Bk, i.e., V = tl(V)X * * * XIm(V) m n Bk. k-I By iteratively applying the definition of conditional probability, then

- 122 - m-I Pr[V] - Pr[Bml n Bk] k-1 m-2 * Pr[Bml In Bk] k-I * Pr[B21BI Pr[B1] = Pr[YmEcm(V) IYm-l m-((V),..., YiE( (V)] 0 0 Prlyzft2(V)Iylftl(V)] ~ Pr[Ylett(V)l. (5.10) Since each term in the product involves only elementary sets Bk, we show in the following section that Pr[V] can be determined iteratively using matrix multiplications. An important class of Cartesian subsets of AiX. * XAm is the sets of "intervals." For all q,reA1X * * * XAm where q=(aj,a2,...,am), r=(bi,b2,...,bm) and q<r, a closed interval [q,r] is defined to be [q,r] = {q'AlX.. XAmIqq'r). It follows that [q,r] =- l([q,r])X.. Xtm([q,r]) - [al,bl]X... X[am,b, hence [q,r] is Cartesian. The open interval(q,r), half-open intervals (q,r] and

- 123 - [q,r) can be defined in a like manner. Moreover, if we assume that A1X ~ ~ XAm has greatest and least elements I and 0, satisfying q > 0 and I > q for all q in A1X * * * XAm, then AIX * * * XAm can be expressed as a closed interval [O,I]. Given an interval B of A, an important problem then is to find the "representation" of,-I(B) in terms of intervals of AiX *. XAm. The problem is interesting not only because the representation permits us to evaluate Pr['I1(B)] using the iterative algorithm described above but also because the representation reduces the number of elements needed to describe the set I-l'(B). To express I-I'(B) as intervals, we note first that since I is orderpreserving, for all q,rEAlX ~* * XAm, we have q s x a r implies I(q): I(x) < +(r) (5.11) (for all xcAIX. * XAm). Thus, if B is an interval of A and q,rE*-l(B), it follows that [q,rl C *-I (B).

- 124 - Moreover, for each aeA, let C(a) and D(a) be the sets as defined in (5.5) and (5.6), and denote the set complement of C(a) by ~C(a). If we define M(a) {qeAIX... XAmIq is a maximal element of ~C(a)}, (5.12) and m(a) A {qeAlX * * * XAmlq is a minimal element of D(a), (5.13) then, applying (5.11), it follows that Lemma: If I:A1X * * * XAm —A is order-preserving, then I-'([a,b])D U [q,r]. (5.14) qem(a) In general, the relation D in (5.14) can not be replaced by an equality. However, we found that the equality holds when AX.. XAm satisfies the chain conditions [48]; A partially ordered set L is said to satisfy the ascending chain condition when every nonempty subset of L has a maximal element. Similarly, L is said to satisfy the descending chain condition when every nonempty subset of L has a minimal element. L is said to satisfy the chain

- 125 - conditions when it satisfies both chain conditions. When the chain conditions are imposed on the above lemma, we then have the following result: Theorem 5.2: If i:AIX * XAm —A is order-preserving and AIX XAm satisfies the chain conditions, then for all a,beA, - ([a,b])-= U [q,r] qgm(a) rtM(b) (5.15) Proof: We only need to show that I-'([a,b])c U [q,r], qem(a) trM(b) i.e., for all xeA1X * * * XAm, we have to show that ats(x):b implies q<x-r, for some qem(a) and rEM(b). First, for each fixed xcA1X *~ ~ XAm, let K1 =- {yAX... XAm'(y):<b and xay} and K2 {yAIX... XAmlaS'(y) and yx}.

- 126 - Then K1 and k2 are clearly nonempty since xeK1 and xcK2. Moreover, since A1X * - XAm satisfies the chain conditions, K1 must satisfy the ascending chain condition and K2 must satisfy the descending chain condition. Hence K1 contains a maximal element, say r, and K2 contains a minimal element, say q, that is qxr, for some qEK2 and rEKI. Finally, we note that r is a maximal element of K1 implies reM(b) and q is a minimal element of K2 implies qem(a), i.e., xE[q,r] for some qem(a) and reM(b). Note that, for each aeA, the set M(a) is an unordered set in the sense that, for all ql and q in M(a), neither ql'q2 nor ql~q2 unless ql=q2. Hence, the number of elements in M(a) is bounded by the width of A1X * * XAm defined to be a natural number n if and only if there is an unordered subset K of A1X * * * XAm of n elements such that all unordered subsets of AIX * * * XAm have no more than n elements [48]. Similarly, for all aeA, the cardinality of m(a) is no larger than the width of AIX * * * XAm. Since a finite partially ordered set always satisfy the chain conditions and the width of a finite partially ordered set is finite, we also obtain the following result: Corollary: If l:AlX * * XAm —*A is order-preserving and AX - * XAm is finite, then for all intervals B in A, ~-'(B) can be expressed as the union of a finite number of intervals in A1X * * * XAm.

- 127 - In addition to the conditions of Theorem 5.2, Equation (5.15) also holds when I:AIX * XAm —A is a lattice homomorphism. A partially ordered set L is called a lattice when any two of whose elements x and y have a least upper bound denoted by x V y, and a greatest lower bound denoted by x A y [48]. Clearly, every totally ordered set L is a lattice because, for any q,reL, qif q<r qVrrotherwise, Moreover, since Aj,A2,...,Am are totally ordered sets, the least upper bound and the greatest lower bound of any two elements (al,a2,., am) and (bi,b2,...,bm) in A1X * * * XAm can be defined as (al,a2,...,am) V (b,b2,..., bm) A (aIAb,a2Ab2,...,amAbm). (5.16) and (al,a2,...,am) A (bi,b2,...,bm) ^ (aiAbl,a2Ab2....,amAbm). (5.17)

- 128 - Now, since A1X * * * XAm is a partially ordered set, it follows immediately from (5.16) and (5.17) that AiX ~- * XAm is also a lattice. When the mapping,:AIX. * XAm —A is a lattice homomorphism in the sense that, for all q,reA1X.~ ~ XAm, I (qVr)= i (q) V I (r) and *(qAr) = * (q) A I (r), we are able to show that Theorem 5.3: If *:A1X * XAm-4A is a lattice homomorphism and AIX * * XAm satisfies the chain conditions, then, for any interval B in A, *-I(B) can be expressed as an interval of AIX * * * XAm in one and only one way. Proof: We note first that a lattice homomorphism is order-preserving because, when q>r,

- 129 - I(qAr) -- (q) A I(r) = *(r) and i(qVr) = '(q) V I(r) = I(q). Moreover, given an interval B in A, I-'(B) contains at most one maximal element. To provide this statement, let us suppose that *-I(B) contains two maximal elements q and r. Since A is a totally ordered set, we have either *I(q)>-E(r) or *i(q)_<(r). If we assume I(q)~-(r), then i*(qVr) = I(q) V *I(r)- = (q). But this implies q can not be a maximal element of *-I(B) unless qVr —q or, equivalently, q-r. However, since r is also a maximal element of -I'(B), q>r implies q=r. In other words, -I'(B) contains at most one maximal element. By similar argument, we can also show that *-'(B) contains at most one minimal element. If we assume that -l'(B) is nonempty then, since AIX.. XAm satisfies the chain conditions, V-'(B) contains a unique maximal element, say q2, and a unique minimal element, say ql. Now since I is order-preserving, we have

- 130 - I-l(B) 2 [ql,q2]. Thus, it remains to be shown that -Y (B) C [q,q2], i.e., xet-l(B) implies ql<xcq2. This follows immediately since q2 is also a maximal element of the set {yE'lY(B)ly=x} and ql is a minimal element of the set {yE-'l(B)lyx)}. Theorem 5.2 and 5.3, in our opinion, have established a feasible method for the evaluation of system performability based on the notion of a phased model. To illustrate the application of the method, let us consider the following hypothetical two-phase model. During each phase, it is assumed that the phase k performance variable is given by Yk: Q - Ak (k-l or 2) where Ak = {0,1,2} and

- 131 - 2 if the system operates in the nondegraded mode throughout phase k, 1 if the sysem enters degraded mode during Yk = phase k while remains operational throughout the phase, 0 otherwise. It is further assumed that the performance variable of the total system S is given by Y:ft -,A where A=-0,1,2} and 2 1 if the system operates in the nondegraded mode throughout all phases, if the system remains operational for at least one phase otherwise. 0 Then, the organizing structure can be expressed as an order-preserving mapping i:{0,1,2)X{0,1,2) —, 0,1,2} as illustrated in Figure 5.1. Note that in the figure the domain and the

- 132 - codomain of the mapping are both represented by diagrams where two nodes q and r are connected by a downward line when q covers r. Suppose that the user is interested in evaluating the probability of the event Y -1 or 0, i.e., the probability of encountering degraded performance or system failure. Applying Theorem 5.2, the event i-'l([O,1]) can be expressed as I-l([0,1])- [(0,0),(2,1)] U [(0,0),(1,2)] where [(0,0),(2,1)] and [(0,0),(1,2)] are intervals of the partially ordered set {0,1,2}X{0,1,2}. Note that the intersection of [(0,0),(2,1)] and [(0,0),(1,2)] is also an interval [(0,0),(1,1)]. Accordingly, the probability of 'I-l([0,1]) can be expressed as Pr[i,-([0,1])] Pr[[0,0),(2,1)],U [(0,0),(1,2)]] = Pr[[(0,0),(2,1)]] + Pr[[(0,0),(l,2)]] - Pr[[(0,0),(1,l)]]. Since an interval of {0,1,2}X{0,1,2) is a Cartesian subset, each of the last three terms of the above equality can be evaluated iteratively using the solution method described in Section 5.3 for Cartesian subsets.

- 133 - Figure 5.1 An Order-Preserving Mapping -:{0,1,2}2 -- { [0,1,2}

- 134 - In general, given an order-preserving mapping:AIX ~* * XAm-A and an interval B of A, let us suppose -'(B) can be expressed as the set union of a finite number of intervals I1,2,... IN i.e., t-l(B)- I U I2 U U!. U IN. To generalize the above evaluation method, let us define S1 -= Pr[Ii] S2= 2 Pr[Ii n Ij] ij S3 = Pr[Ii n Ij n k] ij,k SN- Pr[Ii n I2 n IN] where lSi<j<k<... N so that in the sums each combination appears once and only once; hence SN has J terms. Then, since the intersection of intervals is an interval, each of the sums can be evaluated by repeatedly applying the computational algorithms described in the following section.

- 135 - Hence, by the method of inclusion and exclusion (see [51], p. 89), the probability of I -l(B) can be computed by the well-known formula Pr[I-'(B)] -S-S2 + S3-S4 + *''' SN. (5.18) 5.3 Probability Computation of Cartesian Trajectory Sets Given a phased model (X,y) satisfying assumptions (5.1) and (5.2) the performability model can be simplified as follows. The simplified base model is taken to be the imbedded discrete-time process Xs ={Ylkk=1,2,...,m} where, for each k= 1,2,...,m, Yk is the phase k performance variable. The trajectory space of X can be effectively regarded as the product space U - AXA2X '- XAm where Ak is the accomplishment set of phase k. The corresponding simplification of y is the organizing structure *:U -- A

- 136 - as defined in (5.2). Then, it follows that, for all a in A, the probability that S performs at level a is perf(a) - Pr[y'l(a)] = Pr[-'l(a)] and hence the performability model (Xs,i) can be used to evaluate the performability of S. We will thus refer to (X,*I) as being equivalent to the model (X,y). Generally, given an equivalent performability model (Xs,I), the evaluation of Pr[-Il(a)] requires a detailed knowledge of how intraphase processes cooperate to accomplish level a, i.e., a thorough understanding of their functional dependencies (see [46]). The difficulties are further aggravated by statistical dependencies between phases. However, we show in the following discussions that when a trajectory set VCU is Cartesian in the sense that, for every phase k, there exists RkCAk such that V=RiXR2X * * * XRm, then Pr[V] can be determined iteratively using matrix multiplications. Moreover, given this ability to compute the probabilities of Cartesian sets, the probabilities of more general sets can be determined by decomposing them into Cartesian components (see (5.18)). Hence, the problem reduces to that of computing the probabilities of Cartesian trajectory sets. If, for each phase k, let nk be the number of states in Qk. Then, for a Cartesian trajectory set V=RiXR2X * * XRm, the conditional

- 137 - intraphase transition matrix of the kth phase is the nkXnk matrix Pvj, where, for all ijQk, Pvk(iJ) = Pri[YkeR Xt=jIXt,=,Yk- CeRk-I, ~ ~ ~ YRI] (5.19) where Xk and Xk respectively are the initial and the final states of phase k intraphase process (see (5.1)). In other words, Pv,k(ij) is the probability of having performance levels Rk during phase k while the intraphase process initiates in state i and ends up in state j, conditional by the first k-l components of V. Similarly, for all but the first phase, the conditional interphase transition matrix is the nklXnk matrix Hv,k where, for all ieQkl and all jEQ^, Hvk(iJ) = -Pr[Xt JXt; _ - i,YklRk-,...,YliRi]. (5.20) In other words, HV,k(ij) is the probability that the kth phase initiates in state j given that the final state of the k-1th phase is i, conditioned by the first k-l components of V. Finally, for consistency, we let Hv,1 be the nlXnI identity matrix. In terms of the above matrices, we are able to establish the following matrix formula for computing the probability of a Cartesian trajectory set V. Given X, let p denotes the initial state distribution, i.e., p [P1 P2 ' ' Pn] where pi= Pr[XJo=i], and let Fk denote the nkXl column matrix with "1" in each entry. Then, by induction on k, it can be established that

- 138 - Theorem 5.4: If V = RiX ~ * ~ XRkXQk+lX ~* * XQm then k Pr[V] = p.[ l Hv.Pv,f(]Fk Proof: For k- 1, P'Hv,-Pv,I = PPv,' = [al... aj *'' an,] (5.21) where aje= Pr[X=i]-Pr[YIR1,Xt1,lX=i] itQ1 iQPXt -Pr[YER1,Xt=j] Multiplied by Fl,

- 139 - p*HV,,IPV,1 *F Pr[YERi,Xtlj-] Pr[YI Ri] jeQi = Pr[YI Ri,Y2EQ2,. YmfQm] = Pr[V]. Suppose that the formula holds for k<m, then k+I p- [ I HvPv,]fFk+l f-I k -= p. [ Hv,.'Pv,f] Hv.k+l Pv,k+l Fk+1 - Ai-Hv,k+lPvkP+liFk+l where Al == [b,.~bj.. bj -- ni and bj Pr[Xtj,YkERk,.., YlTi] by applying the equation for k.

- 140 - When we iteratively compute the matrix product, beginning from the left, the first two terms become A2 = AIHv,+k+l = [1.* * cj - * Cl] where cj =: biHv,k+l (ij) iEQk - Z Pr[Xk-=i,YkRk,...,YIeR1] ieQk Pr[X t+l=-jIXt-iYkcR.. Yl RI] - Pr[X-t j,YkERk,..,YIRI]. The next partial product is the result of multiplying A2 by the transition matrix PV,k+l which yields: A3 = A2.Pv,k+l = [di * dj.* dn,+,] where

- 141 - dj= Z ciPv,k+l(ij) =- Pr[Xtk+li,YkeRk...,Yl cR1] ieQk+l Pr[Yk+liRk+,X+IIjlXkX+liYRk,...,YiER' = Pr[X^,=j,Yk+lERk+l,...,Yl Rl] The product is completed by multiplying A3 by the summing vector Fk+l, that is, k+I p.[ II Hv,(-Pv,l].Fk+l f-1 A3Fk+l = S Pr[Xk+lj,Yk+lcRk+l,...,YleRl] = Pr[Yk+l Rk+lYkeRk...,YR, R,] - Pr[YI ERI,...,Yk+lI Rk+l,Yk+2 eQk+2,...,YmeQm] - Pr[RIX ~* * XRk+lXQk+2X. XQm]. Accordingly, the equation holds for all km, which completes the proof of Theorem 5.4.

- 142 - In particular, for k=m, we have Corollary: For any Cartesian set V=R1XR2X ~* * XRm, m Pr[V] - p[ II Hv,k'Pv,kl]Fm. (5.22) k-I Although Equation (5.22) provides us with a general formula for computing the probability of a Cartesian set, its disadvantages derive from the fact that the HV,k and PV,k matrices may be difficult to obtain in practical applications. In particular, these matrices will generally depend on V as well as X and, moreover, will generally depend on the history of X before phase k. However, the latter objections disappear when the transition probabilities are "memoryless.' More precisely, let the (unconditional) intraphase transition matrix of the kth phase be the nkXnk matrix Pv,k where, for all ijeQk, Pv,k(ij) = Pr[Xtk=j, YkRkIXt,-i] (5.23) i.e., the probability of having performance levels Rk during phase k while the intraphase process initiates in state i and ends up in state j. Similarly, let the (unconditional) interphase transition matrix be the nklXnk matrix Hk where, for all iccQk-I and jeQk,

- 143 - Hk(ij) - Pr[Xt,-jlXtk-i] k( ) F tkI k (5.24) i.e., the probability that the kth intraphase process initiates in state j given that the k-1th intraphase process ends up in state i. Then the intraphase transitions of (X,-y) are memoryless for V at phase k if PV,k PV,k* Similarly, the interphase transitions of (X,^y) are memoryless for V at phase k if Hv,k = Hk Accordingly, when transitions are memoryless through phase k, by the definitions and Theorem 5.4, we obtain Theorem 5.5: If V = RIXR2X * XRkXQk+lX * XQm and the intraphase and interphase transitions of X are memoryless for V through phase k, then k Pr[VJ = p.[ Hf IPV,fl'Fk. 9-1 (5.25) Corollary:

- 144 - For any Cartesian set V, if the intraphase and interphase transitions of X are memoryless for V for all phases, then Pr[V] - p[II HJ PvJ]tFkl. (5.26) When V is a Cartesian set and Rt = Qt, for Q - 1,2,...,k-1, then the intraphase and interphase transitions of X are memoryless for V through phase k. Accordingly, applying Theorem 5.5, we obtain the following formula for the probability of the trajectory set V = QX... XQk-lXRkXQk+lXQm which, alternatively, is the probability of the event YkeRk. Theorem 5.6: If V - QiX * * * XQkXRkXQk+lX * XQm, then k Pr[V] p-[ n H.Pv,I].Fk. (5.27) By Theorems 5.5 and 5.6, when certain intraphase and interphase transitions are memoryless for V, the probability of a Cartesian set V is easily obtainable. However, such results may still be difficult to use because, even though the transitions are memoryless for V, they may not be memoryless for other Cartesian sets. Accordingly, we have sought to identify stronger conditions under which the formulas will hold for all Cartesian trajectory sets.

- 145 - First, by extending previous definitions, the intraphase (interphase) transitions of (X,y) are memoryless at phase k if they are memoryless for all Cartesian sets V at phase k; the intraphase (interphase) transitions of (X,y) are memoryless if they are memoryless at all phases. The advantages of memoryless transitions are obvious, for by their definitions and the corollary to Theorem 5.4, we have Theorem 5.7: If (X,y) is a phased model and the intraphase and interphase transitions of X are memoryless, then, for all Cartesian sets V, m Pr[V] - p-[ H Hk Pv,k]Fm. (5.28) k"-I Moreover, we find that the memoryless property is relatively easy to characterize, that is, we are able to show the following characterization conditions for the memoryless property. Note that the conditions do not involve any specific Cartesian sets. Theorem 5.8: (1) The intraphase transitions of X are memoryless at phase k if and only if, for all ijEQk and all aQ in A 1 ( 1,2,...,k),

- 146 - Pr [Xtkj,YiE~ak l k I=i,yk-l~ak-ll.~~,Yl3all = Pr[Xtkj,Yk ---aklXt — i]. (5.29) (2) The interphase transitions of X are memoryless at phase k if and only if, for all iEQk-1, jfQk and all allAq (Q = 1,2,...,k-l), Pr[X_,l Xt —l=i,Yk -l-ak-,,...,Y= -al] Pr[Xt=jlXkli]. (5.30) Proof: Suppose Pv,k is memoryless for all Cartesian sets V = RIXR2X.. XRm. By taking Ra to be the singleton set {aI} (O = 1,2,...,k), Pv,k(ij) = Pr[Xtk-j,Ykak Xt-iYk, l-ak-1-,...,Yl- al] -Pr[Xtk=j,Yk=ak k i] = PV,k(ij) Now, suppose that, for all ar cA (Q - 1,2,...,k),

- 147 - Pr[Xt-j,Yk ak JtX_-,Yk- lak-,...,Yi=a] - Pr[X k j,Yk-aklXt.l-i]. Then, for any Cartesian set V -- RXR2X ~ * XRm, Pv,k(ij) - Pr[Xtk'=j,YkeRkjXt_-i,Yk-_lRk-I,...,Yl R1] ) cij(al, ',ank)-dik(a 1,.,8 ak) ajeRl,...,akeRk Pr[Xtl=-i, Yk-cRk-,...,YIeR where cij(al,...,ak)- Pr[Xtj,Yk=aklXk =Y l-iak-,...,Y1al] DTv X. —,Yk. k.. and dik(al,...,ak) = Pr[Xk_=i,Yk_l-=ak,... Ylyal]. Thus, by the assumption, PVk(ij) is equal to Pr[Xtk=j,Yk=at Ixt-i] acRi,....,akeRk..,Yl=al] Pr[Xti_,-i,Yk-I eRk-I,... Y cR1]

- 148 - Factoring out the term Pr[Xtj,Y — aJXtk i], we have Pv,k(iJ) - Pr[Xt -,Yk=-ajXtk =i]~1 = PV,k(ij) which completes the proof for part (1) of the theorem. Part(2) is proven in a like manner. 5.4 Operational Models as Intraphase Processes To illustrate the application of the above general results, we consider in this section performability evaluation methods assuming that the phased base model of a phased model (see (5.3)) is Markovian and that the intraphase performance variable are defined as the minimum operational rates experienced by the system during a particular phase (see (3.23)). The representation of intraphase processes as operational models permits us to calculate the intraphase transition probabilities using Equations (3.45) and (3.49). By formulating the intraphase performance variables in terms of functionals, we also obtain a considerable gain in expressive power of a phased model, particularly in representing the effects of intraphase "repair." Moreover, since the operational models are allowed to vary from phase to phase, the modeling of a particular phase can be tailored to the computational requirements of the phase. In other words, this special class of phased models can be regarded as the time varying versions of the models introduced in Chapter 3.

- 149 - 5.4.1 Derivation of Intraphase Transition Probabilities Recall that, in order to represent the phased base model as a oneparameter family of random variables X = {Xttlt}, the augmented utilization period is taken to be T- T ( {tk'lk=l,2,...,m-l1 where t-[O,h] is the original utilization period. Hence, to describe X as a Markov process, T must be specified as a totally ordered set by inheriting the ordering relation of T and by assuming tk<tk', tk'<x, y<tk' for all k=1,2,...,m-l, and all x,yeT such that tk<x and y<tk. By Theorem 5.8, the Markov assumptions imply that the intraphase transition and the interphase transitions of X are memoryless. Hence, applying (5.28), for all Cartesian sets V C AX. ~ ~ XAm,, m Pr[V] = p[ II Hk-Pv,k]'Fm, (5.31) k-I In other words, the probability of V can be expressed in terms of the initial distribution of the first phase, p=[P P2 * * * P], the interphase transition

- 150 - probabilities Hk(ij) - PrI[X ljlXtk-i] where k= 1,2,...,m, ieQk^- and jeQk, and the intraphase transition probabilities Pv,(ij) = Pr[Xt=j,YkceRklX i] where k= 1,2,...,m and i,jeQk. Assuming p and Hk(ij) can be determined from the known properties of the system, then the problem of computing Pr[V] is reduced to that of computing the intraphase transition probabilities. To compute the intraphase transition probabilities, we assume that each phase of the mission is represented by an operational model. More precisely, for each phase k, the intraphase process Xk = {XklteTk} is a time-homogeneous enMarkov process and the phase k performance variable Yk is given by Yk= min{fki(Xt)lteTk

- 151 - where fk is an operational structure defined on the state space of Xk. Clearly, the intraphase transition probability Pvk(ij) can be determined by Pr[X,=j,Y=|q I Xt_1 -i] (qeQk) which, in turn, can be determined by the conditional probabilities miq(t) described in Section 3.2. More precisely, for each lSkm and qeQk^=l,2,...,nk} let us define a nkXnk dimensional matrix Mk,q - [miq] where for all iJeQk, mi - Pr[Xk-j,Yk=qlXt_-i]. Then, since Xk is a time-homogeneous Markov process, miq can be computed by applying either (3.45) or (3.49). Moreover, it follows that the intraphase transition probabilities can then be obtained from Mk,q by matrix additions, i.e., Pv,k(iJj)- Mk,q. qERk

- 152 - 5.4.2 Evaluation of a Two-Phase Mission In constructing a phased model that can support an evaluation of system performability, we must specify the intraphase processes and an organizing structure with respect to a specific computer and its computational environment. The intraphase processes together with the organizing structure determine an equivalent performability model that can be evaluated using solution methods developed in Section 5.3. To illustrate these modeling and evaluation methods, a comprehensive phased model has been examined in [49], involving the SIFT computer with an environment taken to be the control of a transoceanic air transport mission. The model represents the internal structure of the SIFT computer as well as conditions of its environment in terms of Markov processes (see [49], Figure 3 and Table III). State trajectories of the equivalent base model are then related to accomplishment levels of the mission via a capability function (i.e., an organizing structure) which is formulated in terms of a three-level model hierarchy (see [49], Figure 2). After the capability function is formulated, solution methods are then applied to determine the performability of the total system. Although the performability modeling and evaluation effort of the SIFT computer has shown the essential aspects of the phase model method, it has emphasized the construction of realistic higher level models. Simple Markovian models for nonrepairable systems are used to reduce the complexity

- 153 - of the performability calculation. Hence, to show the solution algorithms in more detail, we consider in this subsection a phased model that uses operational models as intraphase processes. We consider a total system S-(C,E) comprised of a control computer C operated in the environment E of a two-phase mission. The computer initially operates as a multiprocessor system with three identical components. However, system reconfiguration can occur in the computer due to phase change, hardware faults, or software faults. During the first phase Tj=[to,tj], all three subsystems are required to perform all computational tasks successfully. But, when less than three subsystems are available, the system can still survive by executing a reduced set of computational tasks. Depending on the amount of resources available, the system exhibits the following levels of accomplishment during phase 1: Phase 1 accomplishment Interpretation levels 3 Full performance 2 Noncritical performance degradation 1 Critical performance degradation 0 Failure During the second phase T2=[t1,t2], the system is reconfigured into a TMR

- 154 - system with software recovery to obtain a high degree of reliability (see Section 3.4). Hence, the system exhibits the following three accomplishment levels: Phase 2 accomplishment levels Interpretation 2 1 0 Full performance Degraded performance Failure To describe aspects of the system performance that the users consider important, we assume that users are interested in distinguishing only three levels of mission performance A={2,1,0}, where the accomplishment levels convey the following information: Mission accomplishment levels Interpretation 2 1 0 Full performance Degraded performance Failure We further assume the following characteristics of the mission: 1) To achieve level 2 mission accomplishment, if the performance

- 155 - degraded noncritically during the first phase, then performance degradation is not permitted during the second phase. If phase 1 performance is not degraded, then phase 2 performance is allowed to degrade. 2) Mission performance is degraded if phase 1 performance is noncritically degraded and phase 2 performance is degraded. A critically-degraded performance in phase 1 will result in level 1 mission accomplishment only if phase 2 performance is not degraded. 3) Mission accomplishment level is 0 if the system enters the failure mode during any phase. Under the above assumptions, the organizing structure of the phased model (i.e., the capability function of the equivalent performability model) can be tabulated as in Table 5.1. Clearly, the organizing structure *i:AiXA2 — A is order-preserving and satisfies the conditions of Theorem 5.2.

- 156 - Accomplishment Levels Mission Phase 1 Phase 2 (I) 3 3 3 2 2 2 1 1 1 0 0 0 TO 2 1 0 2 1 0 2 1 0 2 1 0 2 2 0 2 1 0 1 0 0 0 0 0 Table 5.1 An Organizing Structure

- 157 - Hence, for each mission accomplishment level aeA, -I (a) can be expressed as a finite union of Cartesian subsets of AiXA2. In particular, one such representation is -1(2)- {3)X{1,2} U {2}X{2} (5.32) -'(l)-= {2)X{1) U {1)X{2 (5.33) -X(O) - {O}X{O,1,2) U {O,1}X{0,1} U {O}X{0,1,2,3) (5.34) Note that each Cartesian set in the above equalities is an interval as defined in Section 5.2. To specify the equivalent base model, it is necessary that the state spaces of the intraphase processes be refined enough to support the evaluation of system performability. This condition can be satisfied if the phased base model (5.3) is chosen to be a time-homogeneous Markov process with a state space detailed enough to distinguish different operational modes of the intraphase processes. In other words, even though the operational models vary from phase to phase, they share the same underlying Markov process throughout the whole mission. Assume that the computer has the same failure characteristics as the one considered in Section 3.4. Then, if we denote the model parameters by

- 158 - X = component failure rate a = software failure rate c = hardware error coverage d - software error coverage d= - rate of software recovery, a common time-homogeneous Markov process for both phases can be specified as in Figure 5.2. Each state of the graph (except state 0) represents a specific number of subsystems that are free from hardware faults; a prime () is appended to the number if the system is attempting recovery from a software error. State 0 represents any other configurations. Using this Markov process, the probabilistic nature of phase 1 and phase 2 can be represented, respectively, by operational models induced by the operational structures as illustrated in Table 5.2. Note that the operational rates are chosen to convey the same information as the accomplishment levels of each phase so that the intraphase performance variables Yk (k=l or 2) can be defined as Yk- minIfk(Xt)IteTk) where fk is the operational structure of the kth phase. To compute the performability of the total systems, note also that the use of a common Markov process for all phases implies that each interphase transition matrix is an identity matrix. Accordingly, once the initial

- 159 - c3X c2X a Figure 5.2 Underlying Markov Process of a Phased Model

- 160 - Phase State Operational rate 1 3 3' 2' 1 0 3 2 1 1 0 2 3 3' 2 2' 1 0 2 1 2 0 0 0 O O Table 5.2 Operational Structures of a Phased Model

- 161 - distribution of the first phase model is known, the probabilities of Cartesian sets can. be determined by repeatedly applying (5.31). First, let us assume the following parameter values: Parameter Value x c d 5X10-4 10-2 103.99999.9 Then, if we assume that the duration have, by (5.31), of the two phases are both 10 hours, we 89137 0 M,- 0 0 0.08393 0 2.87777 Mi.2 0 o 0 0 O0 00000 00000 00000 00000 00000 00000..01402.89583.01262 0 0 0.00001 0.00001 0 0 0 000 000 000 000 000 000.

- 162 - 0 O Mil - ~ 0 0.00065.08436.00058.88218 0 0 0.00001 0.00001 0 0 0.00001 0 0 0 0.00005.00005.00846.00846.90032 0 0 0 0 0.89137 0 0 M2,2- o0 0 0 and.08393 0 M2.87777 0 0 L 0.01338.89583 0 0 0 0.00065 0.01262 0 0 0 0000 0000 0000 0000 0000 0000..00001 0.00001 0 0 0 000 000 000 000 ' 000 00 0 Accordingly, the trajectory sets in (5.32) and (5.33) can be evaluated by applying (5.28) and (5.18), i.e., Pr[I-'(2)] = Pr[{3}X{1,2}] + Pr[{2}X{2}] p.MM,3(M2,1+M2,2)F + P.Mi,2M2,2.F and

- 163 - Pr[I-l(l)] = Pr[{2}X{1)] + Pr[{lUX{2}] - pMi,2M2,l.F + p-Mi,lM2,2*F where p is the initial distribution of the first phase and F is a 6X1 dimensional matrix with a '1" on every entry. In particular, if the computer is fault-free at the beginning, i.e., p 1 0 0 0 0 0], the performability of the phased mission is as follows: perf(2) = Pr[I-1(2)] =.97036 perf(l) = Pr[i-'(l)] =.00769 perf(0) = Pr[I-'(O)] =.02195

CHAPTER 6 CONCLUSION AND FURTHER RESEARCH The objective of this research has been to develop a general stochastic process model for evaluating the performability of degradable computing systems. This objective was established to fulfill the need of evaluating the unified performance and reliability of distributed multiprocessor systems. To accomplish the objective, a precise formulation of system performance is developed in a broad context and the concept is then applied to analyze the performance of degradable computing systems. Furthermore, a simple and useful user-oriented performance variable is identified and shown to be a proper generalization of the traditional notions of system performance and reliability. In addition to the above modeling framework, a specific two-level hierarchical model is developed. The model is constructed according to a hierarchical decomposition of a system's behavior: Priority queueing models are used to analyze the system's detailed program behavior and the results are combined via a Markov reward process to characterize the overall system performance. Although the modeling approach resembles the top-down structured approach of software development, the decomposition considered -164 -

- 165 - here is based on a more precise classification of a system's short term and long term equilibrium behavior. Accordingly, the modeling approach permits the evaluation of a computing system's hardware and software as a whole, and it becomes possible to deal with the performance and the reliability of a computing system simultaneously to measure the extent to which the user can benefit from tasks accomplished by the computer. Finally, a time-varying version of the model is considered to analyze the performance of phased missions. By representing intraphase models in terms of operational models, we are able to obtain useful results even without the typical no-repair assumption of the traditional phased-mission reliability methods. Moreover, since the model considered does not require the structure function representation of system success, the approach thus represents an important generalization of traditional fault-tree analysis. Although the investigation efforts documented in this thesis were carried to the point where the research objectives described in Section 1.2 were satisfactorily accomplished, there remain several problems that must be resolved before the performability modeling techniques can become a major tool in the design and analysis of computing systems. First, to extend the usefulness of operational models, more efforts can be made to formulate various user-oriented performance variables that are suitable for a wide variety of computer applications. Since solution methods for these performance variables may differ considerably from those obtained in this

- 166 - study, new solution techniques should also be explored. Several generalizations of the phased model are possible depending on how the notion of phasing is relaxed. For example, by allowing the duration of each phase to vary, the model can be extended to a larger class of systems with time-varying environments. One may also extend the phased model by allowing the decision on selecting a succeeding phase to be made at the time of a phase change to improve the mission performance. Another important problem that may have significant influence on the modeling of degradable computing systems is the modeling of software faults. The problem becomes even more interesting when both software and hardware faults are considered simultaneously. Note that the model considered in Chapter 4 measures the effect of hardware faults while taking into account the behavior of the system software. On the other hand, software reliability models (see [55-156], for example) are typically concerned with the effect of software faults on the system performance assuming that the hardware is faultfree. Clearly, useful performance measures can be obtained by combining performability with the results of software reliability analysis. Finally, we note that the hierarchical decomposition method considered in Chapter 4 may also be extended to the performance modeling of computing systems in general in addition to the performance modeling of degradable computing systems. By classifying the physical and logical resources of a computing system according to their frequency of accesses, various models

- 167 -can be constructed to facilitate the step-by-step approximation of system performance. To make the approach useful, however, it then becomes necessary to have a better understanding of the error bound of the approximation.

APPENDIX -168 -

- 169 - MARKOVIAN FUNCTIONALS OF MARKOV PROCESSES This appendix is reference material for Chapter 3. Conditions under which operational models become Markovian are stated in terms of slightly modified forms of what can be found in the literature. The utilization of such conditions together with their limitations are illustrated through examples. When modeling computing systems as operational models, there are many situations in which the state space of the underlying base model may be much larger than needed to distinguish operational rates via the operational structure. Accordingly, to simplify the evaluation of system performability, one question that arises naturally is whether the operational models can be described as Markov processes and, if so, whether they are time-homogeneous. More precisely, let us suppose the total system is modeled by a timehomogeneous Markov process X = {IXSOs<t} with a denumerable state space Q and, relative to an operational structure f:Q —R,

- 170 - Z- {f(X) 0<ss<t). Since the above question does not involve the actual values of f, Z may be regarded here as a 'lumped" [52] version of X, where states i and j are in the same lump if and only if f(i)-f('). In other words, lumps coincide with the operational modes of S. If f is 1-1 then Z is obviously both Markovian and time-homogeneous since, in this case the lumping is trivial. If i is properly a many-to-one function, the answer is no longer obvious, and indeed the question needs further clarification. To begin, let us suppose that the underlying process X is specified by its generator matrix A and an initial distribution p. Then our original inquiry can be reduced to the following questions: Ql) Given A, p and f, is Z Markovian? In many applications, however, one wants the freedom to alter the initial distribution p without losing the Markov property. In this case we are asking: Q2) Given A and f, is Z Markovian for arbitrary p? Finally, we can raise our sights even higher and ask: Q3) Given A and f, is Z Markovian for arbitrary p and, moreover, is the transition function of Z independent of p?

- 171 - Adopting the terminology of [52] (which investigates the discrete-time versions of Q1 and Q3), if the answer to QI is "yes" then the process X specified by A and p is weakly lumpable (with respect to f). A "yes" answer to Q2 is stronger but, generally, these Markov processes will not be time-homogeneous. If the answer to Q3 is "yes" then, for all initial distributions p, the Markov processes Z have the same transition function and, by the homogeneity of X, it follows that this function is invariant under time shifts, i.e., the processes are timehomogeneous. In this case we say that the processes X specified by A are strongly lumpable (with respect to f). Addressing first the question of weak lumpability (Q1), if Z is to be a Markov process, we must insure it has the "memoryless" property, that is, any sequence of past observations of Z provides the same information as the last of those observations. To formalize this requirement, if O0tl<t2< * * * <tk is a sequence of observation times and qieQ=f(Q) is the state of Z observed at time ti, for each underlying state jeQ, let Mj(tl,..., t;ql, ~, qk) = Pr[Xt-=jlZt,=ql,.,Z1=-qk] (A.1) then the IXIQI matrix M(ts... tlk;ql,...qk) = [Mj(; )]

- 172 - is the probability distribution of the states of X at time tk, as conditioned by these observations of Z. In particular, since 2=,qk, it follows that Mj(; ) is nonzero only if f(j)=q, where {Mj(; )lf(o)=qk gives the probability distribution of states inside the lump fl(qk). Moreover, since X is a Markov process, Mj(; ) permits us to represent conditional probabilities in terms of the transition function of X, i.e., it can be shown that, for all Otl]<t2<.* * <tk-t and sO, Mj(tl,... tik;ql,...,qk)'P(s) =rX1 r2 * * * (A.2) where, for each iEQ, ri = Pr[Xt+silZ1t- =q]., Zt=q] ~ To translate distributions of X back up into distributions of Z relative to some specified ordering of the lumps, let Qg denote the Qth lump, i.e., the collection of sets {Q(lii < - IQ|} is the partition of Q induced by f. Accordingly, if Wr-[r1 2... is a

- 173 - probability distribution over the states of X, we let X denote the corresponding distribution, that is X-= ['lr I2 '' ] I(A.3) where r. =is TQ i ieQ (l1:js|QI). In terms of the above notation, weak lumpability can then be characterized similar to its discrete time analog. Theorem A.I: Let X be a time-homogeneous Markov process with transition function P and a fixed initial state probability distribution p. Let Z be a functional of X and, for all t>0, define Ap(t) = {M(t1,...,tk;q,...,qk)lOtl< * <tk —t and q,...,qQ}. Then, Z is a Markov process if and only if, for all s>0 and for all r,'r'eAp(t), = - 7 implies WrP(s) = r-P(s). (A.4)

- 174 - Theorem A.1 can be proved in a way similar to that of its discretetime analog (see [52]; pp. 132-134). To illustrate its application, let us suppose that the system in question is a multicomputer comprised of three identical computer modules. Suppose further that modules fail independently and that each fails permanently with a constant failure rate X. Then we can take the base model X to be the Markov process depicted by the state-transition-rate diagram of Figure A.1. 2X 3 Figure A.1 Markov Model of a Multicomputer As for operational rates, let us assume they are normalized so that, at full capacity the rate is 1, and with the loss of one or two modules the rate is 1/2; loss of a third module results in total failure. Accordingly, the operational structure here is the function

- 175 - i f(i) 1 1 2 1/2 3 1/2 4 0 and hence the functional Z takes values in the state set Q={1,1/2,0). On taking the inverse of f, these states correspond as follows to lumps of Q: 1.- {1} 1/2.- {2,3} 0 - 4}. If we now examine the probabilistic nature of Z, we find that the conditional probability Pr[Z,+=OIZl=1/2] depends on the time that Z enters state 1/2 from state 1 if the latter event is possible (i.e., if the probability of initially being in state 1 is nonzero). Thus, for example, if X is initially in state 1 with probability 1, i.e., p=[l 0 0 0] is the initial state probability distribution, then we have such a dependence (on the past history of Z) and therefore Z is not a Markov process. On the other hand, let us suppose the initial distribution is p=[0 0 1 0], which is not a likely choice from a functional point of view, but it serves to illustrate the role of p. In this case Ap(t) (as defined in the statement of Theorem A.1) is the same for any time t in T, i.e., it is the set

- 176 - Ap(t) {0010],[0001]}. Accordingly, the conditions of Theorem A.1 are vacuously satisfied, and therefore Z is a Markov process for this choice of p. Moreover, it should be obvious that Z, is this case, is time-homogeneous. Other distributions, such as p-= [O 1 0 0] can be shown to result in Markov processes that are not timehomogeneous. Regarding the second question (Q2), a necessary and sufficient condition can be obtained by extending the previous theorem to arbitrary initial distributions. More precisely, it can be shown that Z is a Markov process whatever the initial distribution if and only if, for all t,s>0 and any initial distribution p, condition (A.4) holds for all x, r' in Ap(t). Although the characterization is useful from a conceptual point of view, it is difficult to use in practical applications. A more desirable form of this result can be found in [53] (pp. 1113-1114, Theorem 4) which assumes that X has a finite state space. (The theorem was generalized later in [54] to allow for arbitrary state space.) Stating the desired form of this result in terms of the notation defined above we have: Theorem A.2: Let X be a time-homogeneous Markov process with generator matrix A=[aij] and let Z be a functional of X determined by f. Then Z is a Markov process, whatever the initial distribution of X, if and only if for each

- 177 - qeQ taken separately either (i) For all ijcQ such that f(i) q and f(j)=q, aij 0 or (ii) For all reQ such that r:q, the sum f aij f(j)-=r is the same for all iEQ such that f(i)=q. Although the conditions of Theorem A.2 guarantee that Z is a Markov process relative to any initial distribution p for X, note that the specific nature of Z (as specified by its transition function) will generally depend on p. Moreover, the process Z need not be time-homogeneous. To illustrate Theorem A.2 and the above observations, let us again consider the Markov process X having the state-transition-rate diagram given by Figure A.1. Then, the generator matrix of X is the 4X4 matrix -3X 3X 0 0 A 0 -2X 2X 0 0O -X X 0 0 0 0

- 178 - Suppose, however, that the operational structure here is one that corresponds to triplication with voting (TMR), i.e., the function i 1 2 3 4 f(i) 1 1 0 0 O Then Q=({,0} and, applying Theorem A.2, we see that state 1 (i.e., lump {1,2}) satisfies condition (i) and state 0 (i.e., lump {3,4}) satisfies condition (ii). Hence, the functional Z is a Markov process. To determine the probabilistic nature of Z, let us rename state 0 (in Q) as state 2 (permitting the use of standard matrix notation) and let P(s,t) denote the transition function of Z, i.e., P(s,t) = [pi(s,t)] (s<t) where Pij(s.t -= Pr[Zt-=jlZ4=i].

- 179 - Then, relative to an initial distribution P=[PI P2 P3 P41 for X, if we let Pi d = p P1+P2 it can be shown that the matrix P(s,t) has the following entries: pi (st) - -2X,( 1+2d( 1-eAt)) p.(st) - (1 —2d(l —e-x")) p2(s,t) -1 - pll(s,t), p21(s,t) 0, (A.5) P22(S,t)= 1 From the above equations, we see that the transition function P(s,t) depends on d and, hence, on the initial distribution p=[pi P2 P3 P4]* Moreover, we observe that Z is time-homogeneous (i.e., the values of P(s,t) depend only on the time difference t-s) only when d=0. In other words, by the definition of d, Z is time-homogeneous if and only if pl=0, i.e., there is a zero probability that the underlying process X is initially in state 1 (all three modules faultfree). However, with our interpretation of Z as a TMR model, this special case is pathological, and hence for most practical purposes Z will not be timehomogeneous.

- 180 - Finally, turning to the question of strong lumpability (see Q3 above),-the answer can be characterized by removing condition (i) of Theorem A.2 and modifying the proof to accommodate this change. More precisely, we have Theorem A.3: Let X be a time-homogeneous Markov process with generator matrix A=[aij] and let Z be a functional of X determined by f. Then Z is a Markov process, whatever the initial distribution p of X and with a transition function that is independent of p, if and only if for each qeQ the following is satisfied: For all rEQ such that rfq, the sum aij (A.6) f(j)-r is the same for all ieQ such that f(i)=q. To illustrate Theorem A.3, suppose X is specified by the generator matrix -3X X X X 0 0 0 0 -2X 0 0 X X 0 0 0 -2X 0 X 0 X A- 0 0 0 -2X 0 X X (A.7) 0 0 0 0 000 0 0 0 0 000 0 0 0 0000

- 181 - and f is the function q 1 2 3 4 5 6 7 f(q) 1 2 3 3 4 4 4 Testing condition (A.6) for states 1 and 2 in Q, we see that it holds trivially since these states correspond to singleton lumps. As for state 3eQ (corresponding to lump {3,4)), with respect to states 1,2eQ the sums are zero for both i=3 and i=4; with respect to state 4eQ the sum is 2X for both i-3 and i=4. Thus condition (A.6) holds for state 3. Finally, (A.6) is likewise satisfied for state 4EQ and we conclude that Z is a Markov process with a transition function that is independent of p. In general, if X is strongly lumpable (as characterized by Theorem A.3) it is easily shown that Z must inherit the time-homogeneity of X. In other words, a strongly lumped processes will always be time-homogeneous and, accordingly, it can be specified by a constant generator matrix. More

- 182 - precisely, let us rename the states in Q (if not already so named) with the integers from 1 to IQI, the generator matrix A=IaqJr of Z can be constructed directly from A, where entry ar (qAr) is given by the invariant sum of condition (A.6) for any i such that f(i) q. (The diagonal entries aqq are then determined by the condition that rows must sum to zero.) Thus, for the example just considered (see (A.7) and (A.8)), the generator matrix of Z is the 4X4 matrix -3X X 2X 0 A — 0 -2X 0 2X 0 0 -2X 2X O 0 0 0 As illustrated through the above examples, a strongly lumped process is clearly the most desirable type of operational model. On the other hand, by Theorem A.3, it is evident that such models require a relatively restricted "match" between the probabilistic nature of the base model (as specified by A) and the operational structure f. The conditions of Theorem A.2 are somewhat weaker although, when satisfied, the transition rates of the resulting Markov functional are generally time-varying and dependent on the initial distribution of the underlying process. Of significance here is that even without strong lumpability one can obtain operational models that are Markovian and admit to feasible, closed-form analytic solutions (see Equations A.5, for example). What must be

- 183 - used here, then, are solution techniques for arbitrary (discrete-state) Markov processes, as opposed to more special (and much more familiar) techniques that apply only to time-homogeneous Markov processes. Finally, regarding weak lumpability (Theorem A.1), the requirement here is even less restrictive. However, depending on A, p, and f, it may be difficult to decide whether the condition of Theorem A.1 is satisfied. Moreover, we currently know of no general means of solving such models without resorting to detailed computations at the base model level. The utility of weak lumpability is also curtailed by the fact that the initial state distribution of the base model is fixed. This may be satisfactory in certain applications but one often wishes to examine the influence of different initial distributions. In such cases, one must derive a solution for each of the given distributions provided, of course, that each admits to weak lumpability. Theorems A.1-A.3 thus provide formal support of what we and others in the field have observed through experience: at higher, more useroriented levels of abstraction, it is difficult to maintain a Markovian representation of system behavior. As a consequence, we should seek means for accommodating operational models (functionals) that are not Markovian. The latter task is less formidable than it might appear if we bear in mind that, when evaluating a system S, an operational model Z plays an intermediate role in support of a specific performance variable Y. Thus, our knowledge of Z can be restricted to that required to solve the probability distribution of Y, i.e., the

- 184 -performability of S. The latter observation serves as the guiding principle for the work described in Chapter 3.

REFERENCES -185 -

- 186 - References [1] J. F. Meyer, "On evaluating the performability of degradable computing systems," Proc. 1978 Int'l Symp. on Fault-Tolerant Computing, Toulouse, France, pp. 44-49, June 1978. [2] J. F. Meyer, "On evaluating the performability of degradable computing systems," IEEE Trans. Comput., vol. C-29, no. 8, pp. 720-731, Aug. 1980. [3] D. Ferrari. Computer Systems Performance Evaluation. Englewood Cliffs, NJ: Prentice-Hall, 1978. [4] H. Kobayashi, Modeling and Analysis: An Introduction to System Performance Evaluation Methodology. Reading, MA: Addison-Wesley, 1978. [5] W. G. Bouricius, W. C. Carter, and P. R. Schneider, "Reliability modeling techniques for self-repairing computing systems," Proc. 24th ACM National Conference, pp. 295-305, Aug. 1969. [6] F. P. Mathur, "On reliability modeling and analysis of ultra-reliable fault-tolerant digital systems," IEEE Trans. Comput., vol. C-20, no. 11, pp. 1376-1382, Nov. 1971. [7] Y.-W. Ng and A. Avizienis, "A reliability model for gracefully degrading and repairable fault-tolerant systems," Proc. 1977 Int'l Symp. on Fault

- 187 - Tolerant Computing, Los Angeles, CA, pp. 22-28, June 1978. [8] A. Costes, C. Landrault and J. C. Laprie, "Reliability and availability models for maintained systems featuring hardware failures and design faults," IEEE Trans. Comput., vol. C-27, no. 6, pp. 548-560, June 1978. [9] M. D. Beaudry, "Performance related reliability measures for computing systems," Proc. 1977 Int'l Symp. on Fault-Tolerant Computing, Los Angeles, CA, pp. 16-21, June 1977. [10] H. Mine and K. Hatayama, "Performance evaluation of a fault-tolerant computing system," Proc. 1979 Int'l Symp. on Fault-Tolerant Computing, Madison, WI, pp. 59-62, June 1979. [11] R. S. Howard, Dynamic Probabilistic Systems, Vol. II: Semi-Markov and Decision Processes. New York, NY: John Wiley, 1971. [12] F. A. Gay and M. L. Ketelsen, "Performance evaluation for gracefully degrading systems," Proc. 1979 Int'l Symp. on Fault-Tolerant Computing, Madison, WI, pp. 51-58, June 1979. [13] J. M. De Souza, "A unified method for the benefit analysis of faulttolerance," Proc. 1980 Int'l Symp. on Fault-Tolerant Computing, Kyoto, Japan, pp. 201-203, Oct. 1980. [14] X. Castillo and D. P. Siewiorek, "A performance-reliability model for computing systems," Proc. 1980 Int'l Symp. on Fault-Tolerant Computing, Kyoto, Japan, pp. 187-192, Oct. 1980.

- 188 - [15] W. B. Davenport, Jr., Probability and Random process. New York, NY: McGraw-Hill, 1970. [16] E. Wong, Stochastic Processes in Information and Dynamical Systems. New York, NY: McGraw-Hill, 1971. [17] J. L. Doob, Stochastic Processes. New York, NY: John Wiley, 1953. [18] R. E. Barlow and F. Proschan, Statistical Theory of Reliability and Life Testing. New York, NY: Holt, Rhinehart and Winston, 1975. [19] J. F. Meyer, "Models and techniques for evaluating the effectiveness of aircraft computing systems," Semiannual Status Rep. 3, NASA Rep. CR158992 (NTIS Rep. N79-17564/2GA), Jan. 1978. [20] J. F. Meyer, "Models and techniques for evaluating the effectiveness of aircraft computing systems," Semiannual Status Rep. 4, NASA Rep. CR158993 (NTIS Rep. N79-17563/4GA), July 1978. [21] L. Kleinrock, Queuing Systems, Volume II: Computer Applications. New York, NY: John Wiley, 1976. [22] K. L. Chung, Markov Chains with Stationary Transition Probabilities. Berlin: Springer-Verlag, 1960. [23] E. Cinlar, Introduction to Stochastic Processes. Englewood Cliffs, NJ: Prentice-Hall, 1975. [24] Discussions with F. J. Beutler.

- 189 - [25] A. L. Hopkins, Jr., T. B. Smith, III, and J. H. Lala, "FTMP-A highly reliable fault-tolerant multiprocessor for aircraft," Proc. IEEE, vol. 66, no. 10, pp. 1221-1239, Oct. 1978. [26] A. Avizienis and L. Chen, "On the implementation of N-version programming for software fault-tolerance during program execution," Proc. 1977 COMPSAC (Int'l ComputerSoftware and Applications Conf.), Chicago, IL, pp. 149-155, Nov. 1977. [27] T. F. Arnold, "The concept of coverage and its effect on the reliability model of a repairable system, IEEE Trans. Comput., vol. C-22, no. 3, pp. 251-254, March 1973. [28] C. Weitzman, Distributed Micro/Minicomputer Systems. Englewood Cliffs, NJ: Prentice-Hall, 1980. [29] D. Katsuki, E. S. Elsam, W. F. Mann, E. S. Roberts, J. C. Robinson, F. S. Skowronski, and E. W. Wolf, "Pluribus - An operational fault-tolerant multiprocessor," Proc. IEEE, vol. 66, no. 10, pp. 1146-1159, Oct. 1978. [30] J. H. Wensley, L. Lamport, J. Goldberg, M. W. Green, K. N. Levitt, P. M. Melliar-Smith, R. E. Shostak, and C. B. Wenstock, "SIFT: Design and analysis of a fault-tolerant computer for aircraft control," Proc. IEEE, vol. 66, no. 10, pp. 1240-1255, Oct. 1978. [31] A. Avizienis, "Fault-tolerance: The survival attribute of digital systems," Proc. IEEE, vol. 66, no. 10, pp. 1109-1125, Oct. 1978.

- 190 - [32] H. Hecht, "Fault-tolerant software for real-time applications," ACM Computing Surveys, vol. 8, no. 4, pp. 391-407, Dec. 1976. [33] D. P. Siewiorek, V. Kini, H. Mashburn, S. McConnel, and M. Tsao, "A case study of C.mmp, Cm* and C.vmp: Part I - Experiences with fault tolerance in multiprocessor systems," Proc. IEEE, vol, 66, no. 10, pp. 1178-1199, Oct. 1978. [34] H. B. Baskin, B. R. Borgerson, and R. Roberts, "PRIME - A modular architecture for terminal-oriented systems," 1972 Spring Joint Comput. Conf., AFIPS Conf. Proc. Vol. 40. Washington, DC: Sparton, pp. 431 -437. [35] J. J. Shedletsky and E. J. McCluskey, "The error latency of a fault in a sequential digital circuit," IEEE Trans. Comput., vol. C-25, no.6, pp. 655 -659, June 1976. [36] B. R. Borgerson and R. F. Fretitas, "A reliability model for gracefully degrading and standby-sparing systems," IEEE Trans. Comput., vol. C-24, no.5, pp. 517-525, May 1975. [37] J. Losq, "Effect of failure on gracefully degradable systems," Proc. 1977 Int'l Symp. on Fault-Tolerant Computing, Los Angeles, CA, pp. 29-34, June 1977. [38] J. C. Laprie, "Reliability and availability of repairable structures," Proc. 1975 Int'l Symp. on Fault-Tolerant Computing, Paris, France, pp. 87-92,

- 191 - June 1975. [39] *R. S. Ratner, E. B. Shapiro, H. M. Zeidler, S. E. Wahlstrom, C. B. Clark, and J. Goldberg, "Design of a fault tolerant airborne digital computer, Volume II - Computational Requirements and Technology," NASA Contract NAS1-10920, Stanford Research Institute, Menlo Park, CA, Oct. 1973. [40] D. Gross and C. M. Harris, Fundamentals of Queueing Theory. New York, NY: John Wiley, 1974. [41] W. N. Toy, "Fault-tolerant design of local ESS processors," Proc. IEEE. vol. 66, no. 10, pp. 1126-1145, Oct. 1978. [42] H. S. Winokur, Jr., and L. J. Goldstein, "Analysis of mission-oriented systems," IEEE Trans. Reliability, vol. R-18, no. 4, pp. 144-148, Nov. 1969. [43] J. L. Bricker, "A unified method for analyzing mission reliability for fault tolerant computer systems," IEEE Trans. Reliability, vol. R-22, no. 2, pp. 72-77, June 1973. [44] J. D. Esary and H. Ziehms, "Reliability analysis of phased missions," Reliability and Fault-Tree Analysis, Philadelphia, PA: SIAM, pp. 213-236, 1975. [45] G. R. Burdick, J. B. Fussell, D. M. Rasmuson, and J. R. Wilson, "Phased mission analysis: A review of new developments and applications," IEEE

- 192 - Trans. Reliability, vol. R-26, no. 1, pp. 43-49, April 1977. [46] R. A. Ballance and J. F. Meyer, "Functional dependence and its application to system evaluation," Proc. 1978 Conf. on Information Sciences and Systems, The Johns Hopkins University, Baltimore, MD, March 1978. [47] Z. W. Birnbaum, J. D. Esary and S. C. Saunders, "Multicomponent systems and structures and their reliability," Technometrics, vol. 3, no. 1, pp. 55-77, Feb. 1961. [48] G. Birkhoff, Lattice Theory. Providence, RI: American Math. Society, 1966. [49] J. F. Meyer, D. G. Furchtgott, and L. T. Wu, "Performability evaluation of the SIFT computer," IEEE Trans. Comput., vol. C-29, no.6, pp. 501 -509, June 1980. [50] G. Gratzer, General Lattice Theory. Stuttgart, Germany: Birkhauser, 1978. [51] W. Feller, An Introduction to Probability Theory and Its Applications, Volume I. New York, NY: John Wiley, 1957. [52] J. G. Kemeny and J. L. Snell, Finite Markov Chains. Princeton, NJ: Van Nostrand, 1960. [53] C. J. Burke and M. Rosenblatt, "A Markov function of a Markov chain," Annals of Math. Statistics, vol. 29, pp. 1112-1122, 1958.

- 193 - [54] M. Rosenblatt, Markov Processes. Structure and Asymptotic Behavior. New York, NY: Springer-Verlag, 1971. [55] J. D. Musa, "A theory of software reliability and its application," IEEE Trans. Software Engineering, vol. SE-1, no. 3, Sept. 1975. [56] B. Littlewood, "A semi-Markov model for software reliability with failure costs," Proc. Symp. Comput. Software Eng., New York, NY, pp. 281-300, April 1976.

UNIVERSITY OF MICHIGAN 3 9015 03527 5000