THE UNIVERSITY OF MICHIGAN COMPUTING RESEARCH LABORATORY PERFORMABILITY EVALUATION OF COMPUTING SYSTEMS USING REWARD MODELS D.G. Furchtgott and J.F. Meyer CRL-TR-27-83 August 1983 Room 1079, East Engineering Building Ann Arbor, Michigan 48109 USA Tel: (313) 763-8000

-J

PERFORMABILITY EVALUATION OF COMPUTING SYSTEMS USING REWARD MODELS D. G. Furchtgott and J. F. Meyer Computing Research Laboratory The University of Michigan Ann Arbor, MI 48109 ABSTRACT —An evaluation algorithm Is developed for a broad class of performability models wherein system performance is identified with "reward." More precisely, for a system S and a utilization period T, the performance variable of the model is the reward derived from using S during T. The state behavior of S is represented by a finite-state stochastic process (the base model); reward is determined by reward rates associated with the states of the base model. It is assumed that the corresponding reward model is a nonrecoverable process In the sense that a future state (reward rate) of the model cannot be greater than the present state. For this model class, we obtain a general method for determining the probability distribution function of the performance (reward) variable and, hence, the performability of the corresponding system. Moreover, this is done for bounded utilization periods, an assumption which demands a relatively complex solution. The result is an integral expression which can be solved either analytically or numerically. A program written for numerical solutions is discussed and an example Illustrating the method is presented.

2 I. INTRODUCTION Much of the recent work in analytical evaluation of degradable fault-tolerant computing systems has concerned the evaluation of the total "benefit" derived from the computing system during some specified interval of time [1]-[19]. In these studies, a variety of concepts and terminology have been used to formalize benefit, including "work"1ll, "capacity"[21, "reward''7], and "performance"[4], Other studies have concerned the evaluation of "cost" in conjunction with the evaluation of some form of benefit 18], [9], [11], [12], [15], 117]. Much of this work 13], [51-[9], [11], [12], [16]-[181, [201 has considered systems in steady-state which are utilized over an unbounded period of time: repair of components is allowed (indeed, required), and the evaluated quantity is the expected rate at which benefit is derived under steady-state conditions. However, many important applications of degradable systems have bounded utilization periods. In such cases, a transient solution of the system benefit is usually required. Moreover, in many such applications, the user is interested in how benefit is distributed probabilistically (i.e., the probability distribution function of the benefit variable). This further complicates the evaluation process as compared, say, with the evaluation of expected benefit. Conforming with terminology used in some of our earlier studies 110], 1141, 1211-1231, we view system "performance" as a relatively general concept which includes various types of benefit as possible specializations. Formally, we define the performance of a system S over a specified time period T to be a random variable Y taking values in a set A. Elements of A are accomplishment levels representing how well S performs during T. Relative to a designated performance variable Y, performability is the probability measure p induced by Y where, for any measurable set B of accomplishment levels (BCA), p(B) is the probability that S performs at a level B. Evaluation of performability thus requires solution of the probability distribution function of Y. This solution is based on an underlying stochastic process X, called the base model (of S), which represents the dynamics of the system's structure, internal state, and environment during utilization. Specific interpretations of performability thus depend on how values of Y (the accomplishment levels A) are interpreted. In particular, if elements of A represent levels of benefit then performability describes the system's ability to benefit its user. In the discussion that follows, we consider this type of performability where benefit is equated with reward (see Howard [24] for example) derived from using the system during a specified period T. We are interested in the case where the time period T is bounded, although unbounded periods are considered during an intermediate step of the evaluation process. Also, by the definition of performability (see above), we seek to determine (either analytically or numerically) the probability distribution function of the reward variable Y. Prior work relating to this problem has dealt primarily with evaluations of expected reward, i.e., the expected value of the reward variable Y. This background is nevertheless relevant and two approaches are of particular interest. The first is based on the concept of the "potential" of a semi-Markov process; see Cinlar [25] for an introductory discussion of potentials. Gay 1201 (see also Gay and Ketelsen [31) has applied potentials of Markov processes to modeling the performance of degradable systems. Most of this work deals with steady-state solutions of systems having repairable components. The second approach considers transient solutions of the expected reward and is based on semi-Markov reward models such as those discussed by Howard [24]. The formulation is in terms of a set of integrations and variables which must be solved in order to determine the expected reward. Because of the many inherent convolutions, the equations can be conveniently written in terms of Laplace transforms. However, the equations are generally difficult to solve (and if done in transform space, to invert), and so practical applications are limited. De Souza [7] has applied Howard's work to fault-tolerant computing systems via a unified reliability/cost model; however, these methods presume that the system is in steady state. Though some of the underlying characteristics (especially the nature of transition graphs) of these potential and reward models are allowed to be more general than those considered in this paper, these models are restricted to be semi-Markov and, more importantly, the evaluation results are

3 expected values rather than probability distribution functions. Relatively little work has been done on obtaining transient solutions of the probability distribution function (PDF) of the reward variable. Notably, Cherkesov [26] presents an elegant solution for the probability of accumulating a minimum reward during a finite interval when the underlying process is semi-Markov. The length of the interval can itself be random. However, the solution is in terms of a system of equations of two-dimensional LaplaceCarson transforms (see p. 39 of [27] for a description of this transform). Solving and inverting these equations is intractable except in the simplest of cases. Instead, Cherkesov employs the transforms to obtain expected values. This work is mainly of theoretical interest. In the presentation that follows, we obtain a method for determining the PDF of the reward variable, subject to certain conditions imposed on the base model and on the corresponding reward model. Specifically, we assume that the base model is a finite-state process and is "acyclic" in the sense that states are not revisited by the process. However, the process need not be Markov or even semi-Markov. We assume further that the corresponding reward model is a nonrecoverable process in the sense that a future state (reward rate) of the model cannot be greater than the present state. These conditions reflect properties that are typically exhibited by degradable, nonrepairable computing sytems. For this model class, we are able to obtain the probability distribution function of the performance (reward) variable and, hence, the performability of the corresponding system. Moreover, this is done for bounded utilization periods, an assumption which demands a relatively complex solution. The approach is based on the strategy used in [10], [14] to solve performability, once the performance (reward) rates were obtained via a queueing model analysis. (There, in reward terms, the reward rate of a given structure state was taken to be the normalized throughput rate; performance was viewed as average reward, i.e., total reward divided by the duration of utilization.) This earlier work, however, does not suggest specific techniques for implementing the prescribed steps. The computational example presented in [10], 114] (namely a 3-state process) is solved using graphical arguments to determine appropriate regions of integration. Such an approach becomes more difficult when the number of states is four and becomes intractable when the number of states is five or more. This paper describes a solution technique wherein regions of integration are determined in a tractable, algorithmic fashion. Section II discusses the model; Section III presents the solution and describes a program we have written for implementing the solution procedure. Finally, Section IV presents an example which illustrates the use of this methodology. I. THE MODEL A. Reward Models We consider reward models of the type discussed in [241, although the stochastic process need not be semi-Markovian. We restrict our attention, however, to the case where reward is determined solely by reward rates (referred to in [28]-[30] as "operational rates") associated with states of the model. We assume further that these rates are constant, i.e., timeinvariant. More general reward structures (time-varying rates and "bonuses" associated with state transitions) might also be accomodated by our approach, but they are not considered in the presentation that follows. A state, in this context, typically represents a certain "operational status" of the system, including configurations wherein the system is not operating (system failure). In a given state, the associated reward rate reflects the pace at which the system rewards its user. Such rates can thus be identified with aspects of system performance such as productivity, responsiveness, and utilization, or, at a higher level, with broader measures such as economic benefit. We discuss reward rates in greater detail below-what such rates can represent and how to obtain them. More formally, let X, be the random variable denoting the state of the system at time r.

4 Accordingly, the base model of the system is the stochastic process X- { iX, re [0,oo)}. (1) For reasons given later in the discussion, we restrict X to be finite-state, ioe., X, E Q = {qN, qN-1,.., qo} Suppose further that each q, E Q has an associated reward rate r, (a nonnegative real number). Then there is a natural function r: Q- + (2) where r(q,) = r, is the rate associated with q,. r is referred to as a reward structure of X. Let Xr, r E [O,oo), be a random variable [taking values in r(Q)] representing the reward rate of S at time r, that is x = r(X7). (3) Then the stochastic process X= {X- l 7 E0,o)}. (4) is a reward model of the system. Such reward models are a special class of the type of "operational models" investigated by Wu 1301. (Also, compare with the "capacity models" of Gay [201, Gay and Ketelsen [31.) The performability models we consider consist of a reward model X along with a performance (reward) variable Y representing the total reward accrued during utilization of the system. More precisely, given a reward model X and a utilization period T = [0, t], we take Y to be the random variable,- _ (5) Y J Xd. d(5 o Equivalently, in terms of the base model X and the reward structure r, t r= Jfr(X)dr. o B. An Example of a Reward Based Performability Model Consider a computing system S consisting of N processors, each processor having a computational capacity of 6 jobs/hour. In addition, there is an air conditioner which does not affect the computational capacity of the system (though it may the affect the failure characteristics of the processors). The base model X is a (2(N+ 1))-state (not necessarily Markov) stochastic process whose state-transition diagram appears in Fig. 1. In state (i,j), i {0,1,...,N} denotes the number of operational processors and j is 1 if the air conditioner is operational and 0 otherwise. Assuming system capacity is proportional to the number of operational processors, the reward structure is the function r(i,j) = i3'. Accordingly, the reward model X has N+1 states corresponding to the N + 1 different reward rates. Relative to a specified utilization period T, the performance (reward) variable Y is the

5 number of jobs processed during T. C. Determination of Reward Rates An important consideration when applying the methodology described in this paper is the interpretation and derivation of the reward rates. In this section, we delineate several possible interpretations and quantifications of reward. This survey is by no means complete; for a slightly more involved discussion, see [31]. 1) Capacity and Workload: One basic measure of a computing system's reward rate is the speed at which the system is able to perform computations. This is commonly referred to as capacity. Work of Beaudry [2], Gay [20], Gay and Ketelsen [31, Castillo and Siewiorek [61, Oda, Tohma, and Furuya [12], and Munarin [181 model capacity. A related concept is that of workload, which is the demand for computation placed on the system by the environment. Studies by Gay [201 and Gay and Ketelsen [3] examine workload models. 2) Queues and Networks of Queues: A detailed view of the system's behavior often can be obtained by studying queueing models of the system. Examples of such models abound in the performance evaluation literature; see Kobayashi [32], Ferrari [33], Chandy and Sauer 134], and Trivedi [35]. For instance, S could be a multiprocessor (k servers) and a buffer (queue) for storing arriving tasks; S is then a G/G/i queueing system. If arrival and service rates are exponential and the buffer length is L, then S is the M/M/k/L+k system considered by Meyer [14]. The reward rate and total reward associated with a queueing system depends on what is of concern to the analyst. One could associate reward with throughput rate, system (or service) time of customers, server utilization, number of customers in the system, etc. In 114], for instance, the reward was based on a "normalized" throughput rate. S) Profit and Expense: An important measure of the reward derived from a system is the profit (say, in terms of dollars) obtained from the system, or, in a negative context, the expense of the system. Each system state has an associated rate of profit or expense. Such values may be obtained, for instance, by economic analysis or by utility techniques, e.g., Raiffa [361. Koren and Berg [8] and Huslende [17] have based analyses on economic factors. 4) Control Theoretic: When the system is a real-time control processor, the reward rate associated with a given state could be specified as functions of such control theoretic concepts as response time. For instance, Krishna and Shin 1151 have examined such descriptions, while Gai and Adams [19] have discussed tradeoffs between "optimal" and "robust" response times. D. Nonrecoverable Processes Consider the "desired" behavior of a degradable computing system used over a bounded period of time. Given a set of available resources, a well-designed degradable computing system configures itself to maximize the reward rate. Thus, when a component fails, the system does not reconfigure itself in a manner that causes the reward rate to be greater than before the failure. We will assume there is no repair (i.e., component replacement via an external source), so an increase in the reward rate due to the acquisition of additional components cannot occur. Transient faults that could lower the reward rate temporarily, thereby raising the rate again when the fault is corrected, are not be considered. Under these conditions, the reward rate of the system is non-increasing in time, which has the interpretation that the system does not become a "better" system after a change in state (e.g., a component failure). These assumptions are formalized by requiring that the reward model X be a nonrecoverable process[30] relative to the usual ordering of real numbers. In other words, for all states

6 q1, q2 e Q and all times T,v E [O,oo) such that r < v, we require that Prob[Xr = r(q,) and X, = r(q2)] > 0 ==r(q2) < r(q). (7) Consider the process X of the multiprocessor/air conditioner example discussed above. Clearly, by the transition diagram of Fig. 1 and the definition of the reward structure, the reward rate of the process cannot increase (with positive probability). Hence, the associated reward model X is an example of a nonrecoverable process. E. Acyclic Processes An essential step of our model solution procedure (Section III) is to first evaluate conditional performabilities, conditioned on the sequence of states entered by the process during the unbounded period [0,oo). For this reason, we restrict our attention to finite-state base models. To insure a finite number of state sequences and hence conditions, we also require that the base model be "acyclic" in the sense that states are not revisited by the process. More precisely, let N, be the random variable denoting the total number of visits of the process X to state q, E Q during the interval [0,oo). X is an acyclic process if, for all q, E Q, ProbiN, > 1] = 0. When evaluating fault-tolerant systems where repair is not allowed, the acyclic condition is not restrictive since once the system leaves a state due to a fault, the system can never return to that state. Often, however, one is interested in analyzing systems with transient faults or temporary failures. Models of such systems are generally cyclic because if the system recovers from the fault, the system returns to some previously entered state. Such recoveries can occur an unbounded number of times during [O,t], resulting in an infinite number of state sequences. One approach for approximating cyclic base models is to "unravel" the process into an acyclic base model. This is done by simply choosing a finite set of state sequences to consider. For example, the set chosen can be those sequences which have at least some threshold probability of occurring during the utilization period [O,t]. In other words, those sequences of sufficiently low probability are ignored. Alternatively, the set could include all sequences less than or equal to some specified length. A second approach is to "lump" all the states in each cycle. Because each state in a cycle must have the same reward rate (or else X would not be nonrecoverable), the reward rate of each lumped state is the rate of each of the component states. The difficulty with this approach is determining the equivalent stochastic behavior of the lumped model.

7 m. MODEL SOLUTION A. A Partition of the Trajectory Space As noted in [101, one approach to determining performability is to partition the trajectory space and solve for each of the resulting classes of trajectories. During the unbounded period [0,co), the base model process X will, with probability 1, pass through some finite sequence of distinct states, say u = (un, u_l,..., uo), where state u, is the initial state and state u0 is an absorbing state. X is nonrecoverable (see above) so there is a sequence of reward rates (r(un), r(u.-),... r(uo)), corresponding to u, such that r(u) > r(un_) 2. > r(uo) (8) Let U be a random variable denoting the sequence of states visited by the base model during [0,oo). Since the base model is acyclic, and since there are N < oo base model states, there are only a finite number of possible state sequences, i.e., sequences u such that Prob[U = u] > 0. Let Y be the performance (reward) variable defined in (5) and let Fy be the PDF of Y. Further, if Fy1 u is the conditional PDF of Y given U, then Fy can be expressed as the following summation over a finite index set: Fy(y) = SFyl u(Y u)Prob[U = u] (9) B. Notation At this point, it is convenient to introduce a body of notation for dealing with the time the process spends in various states during both T = [0,t] and 0[,oo). Unless otherwise noted, the remarks that follow assume U to be some arbitrary sequence u = (unun-,1,.. Uo) such that Prob[U = u] > 0. For each state sequence u = (un, un-..., uo) there is a vector-valued random variable Vu = (Vn, V-l,..., Vo) taking values in (P'+)f, which describes the time the process resides in each state in u. For 0 < i < n, V, is a random variable representing the time of the base model process resides in state q, during the interval [0,oo). State uo is an absorbing state, and so Vo is oo. We will be interested in the PDF for Vu conditioned on U = u, i.e., Fvu (vu u) = Prob[Vu < v. U - u] (10) - Prob[Vn < v A Vn_1 < Vn-l' * A Vo < Vo I U = ul If the conditional probability density function for Vu exists, it will be written fv. We suppose that fv exists and we find it convenient to expand fv, as follows: fvu u(v. lu) = f, u(vn u) v (vu(-1I vu) ~. (11) fvol v.v,_,1.v, u(VoI vnv,,,...,vl,u) For a given u, define Wl For a given u, define W to be a random variable denoting the amount of time the process is in state u, during the utilization period T. The basic relationship between W, and V

8 can be expressed as WI ~~t~~~~~n max(min(t - Vj, V,),O), otherwise, j-i+l Note that if U = u and if Wi = and V, then W -, note that the value of W' Walo ote that the value of W can always be expressed by one of thetfoflow6ii afeiatvem elS6 V, t- 3 V, or0. J = s+l C. The Approach We seek to determine the PDF Fy of the reward varible Y (Eq. 9). The algorithm delineated in [10] will be employed: Algorithm 1: (Determining the probability distribution function, Fy) 1) Determine all state sequences u, i.e., the range of the random variable U. 2) For each of these u, determine Prob[U = u]. 3) For each u such that Prob[U = u] > 0, determine FylU(y). 4) Apply Eq. 9 to arrive at Fy. For the present, we assume that steps 1) and 2) have been completed. Consider step 3), the calculation of FYl u. Using the notation introduced in the previous section, the reward variable Y can be directly expressed as a linear combination of the random variables Wi If U = the Y- r(u,)W1 (13) J=0 If the Wf were independent random variables, we could obtain the PDF Fy by mathematical convolution. However, as discussed in 1101, 1141, the Wt are statistically dependent and a probabilistic characterization is generally difficult to obtain. Such complications are avoided by formulating Y as a function y. of V. (if it is known that U = u). Although the basic concept is simple to state, the details are somewhat complex. Using the relationship between W, and V, of Eq. 12, (see Eq. 6) r(uj)t + (r(uk)- r(u))V (14) Y = (V) = if - V < t, S V > t, for n > j r(u)t if ~V +l > t r(un)t if V,' t.

9 See Fig. 2 and compare with Eq. 24 of [141. Let By be the set of accomplishment levels (reward outcomes) not greater than y, i.e., By {b > 0 b < y} and let Cy =';l(By) be the set of all base model trajectories which traverse the state sequence u and provide reward no greater than y. Then FrY U(Yl u) = Prob[V, E Cy I U = uj = f f, l(vu l u) dv. (15) C " Since the probability density function fv/ U( I ) is assumed to exist and be known, to formulate FYI u we must determine Cy. D. Formulation of Fyl u In the next section, we show that, depending on the values of u and y, Cy can be easily partitioned, i.e., expressed as a union of disjoint regions. Such a partition provides the ability to break up FYl U into a sum of smaller, less complex integrations. In particular, if the length of the sequence u is n+l, Cy will be decomposed into n+2 disjoint regions {Cn+l,... }. (16) Let v, be a point in Cy and let C, = { vn I there exist v._l,vn_2,..., v0 such that (v,_,..., v) E Cy (17) and for n > j > 0, let Cy j(vn, vl,..., vj+l) = { vJ there exist vij_, vt2,..., o (18) such that (,,v,_,,..., o) E C } In other words, Cy,(vnl,.., v+l) is the set of all values v, that are "contained" in some Vu E Cy. For conciseness, we shall write Cyj in place of Cy'(vv._l,..., vJ+) when the (v, tn -,_,...,,VJ+l) is implicit. Consider,u (Eq. 14). The regions C' are not generally Cartesian, that is, Cy is usually not expressible as the cross-product of the CY,. Rather, a different relationship based on "iresolvability" (see the next section) will be employed to define the Cy. By the definitions of the Cy and Cy,, Eq. 15 can now be expressed as n+l n+l Fylu(l u) = Prob[V ClU = u = S fvlu(-u) dv, —o' —~~; (19) = fS f.. f /vuIu(vulu)dvu'S=0c1 c,,I C 1, The expansion of fv (^Vu I u) in Eq. 11 is especially appropriate in this representation. Finally, combining Els. 9, 11, and 19 with the characterization of the CY, developed below,

10 we arrive at an integral solution for the PDF Fy. D. i-resolvability In this section, we describe a method of partitioning the set Cy'= -1(By) into the Cy (Eq. 16) and characterizing Cy, (Eqs. 17 and 18). The immediate difficulty with characterizing the C, arises from the nature of %y (Eq. 14), viz., yu is a sum of a random number of random variables V,, V,,-,... VJ+1. We introduce the notion of "i-resolvability" which allows the decomposition of the single large problem of describing the sum of a varying number of random variables into a set of smaller problems, each consisting of describing the sum of a fixed number of random variables. The number of random variables to be considered will be determined by i-resolvability, a notion which takes full advantage of both the monotonicity of y and the finite utilization period. The concept can be loosely stated: "For a given subtrajectory, based on the past and regardless of the future, what can we say about the entire trajectory?" We are interested in determining the probability that the total accumulated reward u (Vu ) < y. Recall that the time the process resides in each state of the sequence u is given by the random variable Vu = (vn,vn1,_l,..., vo). Suppose that the sojourn times for the first n-i+l of these values are known, i.e., (Vn., V,., ) =V (vnVn-l,..., V). Knowledge of these times conveys significant information regarding the set of possible sojourn times (V,_1, V,2,..., VO) that can be associated with the remaining states and still satisfy %U(V.) < y. As an informal introduction, imagine that someone chooses at random some vector v, of sequence times and starts dealing, in order and one at a time, the values v,, vn_1,..., V. We look at the values as they are dealt. If after the value v, has been dealt-but not before-we can determine with certainty that %u(vu) < y, then we shall say that vu is such that y(vu ) < y is i-resolvable based on vu, or, more succinctly, v, is i-resolvable (y?ind r, are implicit). If at any time we can determine with certainty that -u(vu) > y, then Vu Cy; we are not interested in that value v, (since it does not contribute to Prob[ Y < y]) and so we reject it. We now formalize this property. Let y J — 1R and v -= (v,, vn_l,.,. Vo) E l be such that au < y. In addition, let y7u (o,,n,l,-... *,,,...,) be the (n-i+1)-variable function derived from %yu(,.,...,*) by setting Vn = vn, Vn-1 =,,1,.n-., V, = v,. When y is an upper bound1 of the function -u(vn,vnl,,..,v,....) but not of?u(Vnvn-1, _ -., v t^1,r,... ), we shall say that vu is i-resolvable. [As a special case, when y bounds -u(,,...., ), then v. is (n+l)-resolvable.] In other words, when a determination as to whether u(vu) < y can not be made based solely on the values (vn,vn-,,l, v.,+1), but such a determination can be made with the additional knowledge of the value of v,, we shall say that v. is i-resolvable. [When v, is (n+l)-resolvable, then the determination can be made with no knowledge of vu.] One can define i-resolvability recursively, as follows: For a given vu = (,,v,-_,..., v0): i) v, is (n+l)-resolvable if and only if yu(v ) < y regardless of the value v,. ii) VU is i-resolvable if v, is not j-resolvable for all j > i but u (Vn, vn-l,. -, Vs),.. -,) < y regardless of the values of (v,_1,v, -._..., V). To motivate our interest in i-resolvability, consider the following example. Suppose U is such that (r(u2),r(ui),r(uo)) = (2,1,0) and that y = 5 and t = 4. If V2 = 1, then 1A constant y is a upper bound of the function f:D —R if for all z E D, f(z) < y.

11 for yu(V. ) < y to be true, we must have Wt1 < t V2 = 4- 1 3 (see Eq. 12), and 2 0 r(uf)W2, = 1 W + 1 = +2 < 3+2 = 5 = y.2 Now Wi1 < 3 irrespective of V1, so V1 can therefore take on any value in the interval [0,oo) and still satisfy yu(Vu) < y. In other words, since V2 1 then, without further examination of V1 or V0, we know u (Vu ) < y. Thus, every (1, V1, V0) is 2-resolvable. E. Solution The partition {Cy} of Cy that we desire is the one induced by i-resolvability. Specifically, let a trajectory vu e Cy be in Cy if v, is i-resolvable. One can easily see that every vU E Cy is i-resolvable for exactly one i. The recursive definition of "i-resolvability" yields the following algorithm for determining the Cy: Algorithm 2: (Determining the Cy) For a given By, i) Determine Cy+1, i.e., all v, E 7u-l(By) such that v. E'u-l(By) regardless of the value of v,. These are the v, which are (n+l)-resolvable. (Cy+1 will either be empty or all Cy, depending on the value of y.) ii) For each i = n,n-l,...,0, determine all vu E yu-l(By) such that vu, C +l U Cy U C-' U U Cy+1 and vu E'u-'(By) regardless of the values of (v,,_, v_..., Vo). These are the v, which are i-resolvable. To show that a given vector vU is i-resolvable, the definition of i-resolvability may be used directly. Unfortunately, space does not allow the full derivation of the characterization of the C, (see [311). An important intermediate result is the following (compare with Eq. 14 and see Eq. 12 for the relation between V, and Wt ) v, is i-resolvable (n > i > O) if and only if y > r(ui)t + (r(uj)- r(ul))w (21) and r(u,)t + (r(u)- r(u))w/ if n > i> (2 y < r(u,)t if i n As mentioned above, the w, are statistically dependent and thus inconvenient to use. Hence, Eqs. 21 and 22 must be rewritten in terms of the v, (see Eq. 12). We use the properties of iresolvability and the monotonicity of Y to obtain the main result of this section:

12 Characterization of Cy,: i) If r(u,) = r(u,_l), n > i > 0, then no v. is i-resolvable. Cy = and for each i such that n > k > 0: cJ= ~ (23) ii) Let y E [r(u,)t,o). Then every v, is n+l-resolvable. C+l =- Cy, Cy = 0 for n > i > 0 and for each j such that n > j > 0 and for each k such that n > k > 0: C+l = [0, o) (24) c =. (25) iii) Let y E {r(ul_)t,r(ul)t), n > I > 0. Then for every i such that I > i > 0 and r(u,) =, r(u,_i), there exist vectors vu such that vU is i-resolvable. a) If i = n, then i) C [o ) y- r(u-d )t ] (26) and for each j such that n > j > 0 ii) CyL = [o,oo), n > j > and (27) iii) Cyo -=. (28) b) If i n, then C ( y - r(un_l)t y- r(ui,_)t (29) >O r(u,)- r(u,-1)'r(u,)- r(u, )

13 and for each j such that n > j > i Y - r(u.-)t- k (r(ut)- r(u,_))vk -— ".k /+l ii) CYJ, r(u) - r(u,) >0 (30) y- r(u,_i)t- (r(uk)- r(ui,))vk k =J+1 I r(u,) - r(u,_l) and for j = k — +1 uiii) CI- r(u, l ))t- Z (r(u,)- )) (3 and for each j such that i > j > 0 iv) CY, = [ 0,o), i>j> and (32) Cy,o 00. c) For each k such that n + 1 > k > I (or k = 0) and for each j such that n > j > 0 Cy= 5. (33) where the notation ( a,b] means >0 [O,b] if a <0 (34) >0 (a,b] if a > 0 If r(u,) = r(u,J_), then in Eqs. 29 and 30, terms with the denominator r(u,)- r(u,_) [or r(u) - r(u,_nl) are replaced by 0, and Eqs. 26 and 31 are replaced by q [see case i)]. Also, in Eq. 30, if r(u) - r(u,_i) for n > j > i, then r(u,) = r(u,_i) and Eq. 30 is replaced by b [see case i)]. An integral solution for the PDF Fy (and hence the system's performability) is thus obtained by employing Eqs. 9, 11, 19, and 24-33.

14 F. IETAPHOR The remaining difficulty is calculating the integrations. For certain classes of probability densities fV v,v v+,U it may be possible to perform symbolically the integrations to obtain a true closed-form solution. For example, a closed-form solution for the performability of a queueing system is derived in [141. The symbolic integrations may be performed either (laboriously) by hand or by computer programs that manipulate symbolic quantities, e.g., MAXIMA [371 and REDUCE2 [38]. A recursive version of Eqs. 19 and 24-33 is derived in [311 and may help in reducing the amount of work necessary to perform symbolically the required integrations. However, when performability is evaluated for complex systems with large state spaces, one is primarily interested in numerical results as opposed to closed-form solutions. For this purpose, we have written a numerical program based on Eqs. 9, 11, 19, and 24-33. This program is now part of a larger performability evaluation package, METAPHOR (see [22], 1311 for descriptions of METAPHOR). To obtain a system's performability, METAPHOR is given information about each trajectory sequence u, including the sequence itself, the conditional densities fV,\ v,_ v+lu, and the utilization period T = [O,t]. METAPHOR then computes the regions of integrations Cad and using Gaussian quadrature, computes FYl v IV. MULTIPROCESSOR/AIR CONDITIONER EXAMPLE To illustrate the application of the solution procedure for Fy and to exhibit its ability to deal with non-semi-Markov base models, we consider the multiprocessor/air conditioner example discussed in Section II. B. (See Fig. 1). Recall that the performance (reward) variable Y is the number of jobs processed during some utilization period T = [0, t]. State (i,j) reflects the number of processors operational and whether the air conditioner is operational; each processor is identical and has computation rate of 6 jobs/hour. The reward structure is r(i,j) = idl. For simplicity, the example constrains the system to a single air conditioner. (A more complete example would include multiple air conditioners.) Suppose that, relative to a specified threshold y (y > 0), the system user is interested in processing more than y jobs during the bounded utilization period. For instance, the system might be a computer in a business or university during a working day, or it might be a processor handling financial transactions overnight. Note that especially in the latter category of applications, availability over the entire utilization period is not required since most of the computation could be done early in the period, or, alternatively, spread out over the entire period. In performability terms, we are considering (for a given value of y) the set of accomplishment levels BY = {b I b > y}. The performability p(BY), i.e., the probability that the number of jobs processed (reward) Y is greater than y, can then be obtained from Fy since p(BY) = ProbrY > y] = 1- F(y). The stochastic properties of the processors are affected by the failure time of the air conditioner in such a way that the base model is not semi-Markov. At the beginning of the utilization period, the air conditioner and all of the processors are functioning and the temperature of the room containing the equipment is 20~C. Since the amount of heat which the air conditioner must dissipate places stress on the compressor, the air conditioner's failure rate is influenced by the number of processors it must cool. Assume the air conditioner fails with an exponential failure rate XAc(N) = N0.05 failures/hour, where N is the number of processors in the room. If the air conditioner fails, the ambient temperature in the room R begins to increase tto a higher steady-state temperature with an exponential rise time; the number of processors in the room affects the speed at which the temperature rises. More pre

15 cisely, R(A ) = 55" + 2N - (55 + 2N- 200)exp( - NA r), (35) where (55 +2N) C is the new steady-state temperature, A r is the time (in hours) since the air conditioner failed, and As = 10 degrees/hour/processor is a constant reflecting the rate of temperature increase. If a processor fails, it does not shut off and so continues contributing heat to the room. The failure behavior of the processors is influenced by the ambient temperature; each processor fails at an exponential rate which varies linearly with room temperature (over the range 200C to 55"C) from 0.001 failures/hour to 0.1 failure/hour. If the room temperature is R~C, Xp(R) = 00 1.001 0.1-0.001 (R (36) 55" - 20" There are N+1 state sequences u such that Prob[U = u] > 0 (see Fig. 1): {u} = {((N,1),(N,0),(N-,0),...,(0,0)), ((N, 1),(N-,1),(N-l,0);...,(0,0)),..., (37) ((N, l),(N-l, ), (N-2,1),..., (0,1),(0,0))}. Let u = ((N,1),(N-l,1),...,(k,l),(k,O),..,(0,0)). The probability of the sequence u is XAC N JXP ~ ~XI n if k < N (38) Prob[U = u] = | +kX + XAC j=AC if k (3C Prob[U = u] = XAC N +XAC if k = N NXp + XAC Suppose u, = (j,1); the conditional probability density function for the time the process spends in state u, is f I vrvv,,,+,u (v" I Vn-, *, V,+1,u) (39) (jXp(20") + XAc)exp(-(jXp(200) + XAc)v,) Suppose u, = (j,0) and that the first state in u representing a failed air conditioner is uk, k > i; the conditional probability density function for the time the process spends in state u, is f,1v v I,,v.l u l ( - e,, I1 ui) = K jXp(R(A r))exp(-jXp(R( A r))v,) (40)

16 where A r = vk + vk _l+ ~ + v,, and K is a normalization constant equal to 00 K = jX p(R(A / )exp(-jXp(R(A r ))v)dv, (41) 0 (where A r = vk + vk _l+' ~ + v,+l + v). That the base model is not semi-Markov can be seen from the time-varying failure rates associated with states (i,O) [Eq. 401; these rates are functions of the process' history and cannot be inferred from the present state, present time, and time of entry to the state. It is clear how Eqs. 38, 39, and 40 can be applied to Eqs. 9 and 11. To see how Eqs. 19 and 24-33 are employed to calculate Fy, consider N = 3, u = ((3,1),(3,0),(2,0),(1,0),(0,0)) and y = 1500 e [r(ui)t,r(u2)t). From Eq. 19, we see that Fy is the sum of six multiple integrals, three of which are 0 since they are integrated over empty Cy (these are Cy, C', and C~). The other 3 Cy correspond to the sets of base model trajectories which are 3, 2, and 1-resolvable. Consider Cy. From Eq. 29: C24 [0 (1500 -1000) ] = 0,2.5) (42) (300- 100) from Eq. 30: 2 ( 1500 -2000- (300 - 200)V4 1500 - 1000- (300 - 100)v4 >Y 0 300- 200 300- 100 >_0 (43) r 500- 200v41 = 1 200 from Eq. 31: 2 r[ 1500 - 1000 - (300 - 100)v3 - (300 - 100)4 Cy,2 [= L 200- 100 ] (44) 100 [ 500- 200(v4 + v3) ] and from Eq. 32: CY,= [0,oo) and (45) Cy,O = 00 Once regions and C C are similarly obtained, Eq. 19 can then be evaluated. This process must be repeated for each state sequence u and then Eq. 9 can be applied to obtain the performability p(BY).

17 Fig. 3 shows a plot of p(B) = 1- Fy(y) for N = 1, 2, 3, and 4 processors, t = 10 hours and 6 = 100 jobs/hour. Note that these evaluations provide a considerable amount of information regarding the system's ability to perform (provide reward) in the presence of faults. It is easy to show, for example, that the performability p(BY) is 0 when y is greater than or equal to N'6't (e.g., for N = 2, p(B2000) = 0; see Fig. 3). Hence, to obtain a nonzero performability, the number of processors must be greater than.2. For instance, to 6 t have a nonzero probability of accomplishing more than 1500 jobs, one must have at least 2 processors. Generally, for the values of N shown on the plot, there is a significant gain in p(By) for values of y above 1000 when additional processors are included in the system. For values of y below 1000, there is relatively little gain from having more than a single processor. Indeed, if the specified minimum reward is between about 500 and 1000 jobs, a single processor provides a greater probability of performing within BY than do two processors. In non-critical applications, a system designer may choose to settle for a lower performability to avoid the cost of additional processors. Information such as that provided by Fig. 3 can be quite useful in investigating such tradeoffs. For example, suppose the threshold is y = 1500. The difference between the performability with N = 3 and with N = 4 is about 0.03, while the difference between N = 2 and N = 3 is about 0.12. The probability of accomplishing more than 1500 jobs with 3 processors is about 0.96. If this probability is adequate for the application, and the additional.03 probability is not worth the cost of an extra processor, the designer may well choose a 3-processor system. V. CONCLUSION In summary, the techniques described in this paper provide a new means of evaluating performability for a broad class of systems and for a general class of performance variables. The evaluation procedure is admittedly complex and requires a programmed implementation for practical applications. This is true, however, for most evaluation methods, particulary those applicable to degradable fault-tolerant systems. Moreover, although not the subject of this paper, the feasibility of such implementations has been demonstrated (see Section III.F.). Future research will seek to accommodate more general reward structures (time-varying rates and "bonuses" associated with state transitions) and more generally defined performance variables, e.g., variables defined with respect to random utilization periods. [1] R. Troy, "Dynamic reconfiguration: An algorithm and its efficiency evaluation," in Proc. 1977 Int. Symp. Fault-Tolerant Computing, Los Angeles, CA, pp. 44-49, June 1977. [2) M. D. Beaudry, "Performance-related reliability measures for computing systems," IEEE Trans. Comput., vol. C-27, pp. 540-547, June 1978. [3] F. A. Gay and M. L. Ketelsen, "Performance evaluation for gracefully degrading systems," in Proc. 1979 Int. Symp. on Fault-Tolerant Computing, Madison, Wisconsin, pp. 51-58, June 1979. [4] H. Mine and K. Hatayama, "Performance related reliability measures for computing systems," in Proc. 1979 Int. Symp. on Fault-Tolerant Computing, Madison, WI, pp. 59-62, June 1979.

18 151 T. C. K. Chou and J. A. Abraham, "Performance/availabilty model of shared resource multiprocessors," IEEE Trans. Reliability, vol. R-29, pp. 7076, April 1980. [6] X. Castillo and D. P. Siewiorek, "A performance-reliability model for computing systems," in Proc. 1980 Int. Symp. Fault-Tolerant Computing, Kyoto, Japan, pp. 187-192, Oct. 1980. [71 J. M. De Souza, "A unified method for the benefit analysis of faulttolerance," in Proc. 1980 Int. Symp. on Fault-Tolerant Computing, Kyoto, Japan, pp. 201-203, Oct. 1980. 18] I. Koren and M. Berg, "A module replacement policy for dynamic redundancy fault-tolerant computing systems," in Proc. 1981 Int. Symp. FaultTolerant Computing, Portland, ME, pp. 90-95, June 1981. [9] S. V. Makam and A. Avizienis, "Modeling and analysis of periodically renewed closed fault-tolerant systems," in Proc. 1981 Int. Symp. FaultTolerant Computing, Portland, ME, pp. 134-141, June 1981. [10] J. F. Meyer, "Closed-form solutions of performability," in Proc. 1981 Int. Symp. on Fault-Tolerant Computing, Portland, ME, pp. 66-71, June 1981. [11] T. Nakagawa, K. Yasui, and S. Osaki, "Optimum maintenance policies for a computer system with restart," in Proc. 1981 Int. Symp. Fault-Tolerant Computing, Portland, ME, pp. 148-150, June 1981. [12] Y. Oda, Y. Tohma, and K. Furuya, "Reliability and performance evaluation of self-reconfigurable systems with periodic maintenance," in Proc. 1981 Int. Symp. Fault-Tolerant Computing, Portland, ME, pp. 142-147, June 1981. [13] R. Huslende, "A combined evaluation of performance and reliability for degradable systems," in ACM/SIGMETRICS Conf. on Measurement and Modeling of Computer Systems, Las Vegas, NV, pp. 157-164, Sept. 1981. [14] J. F. Meyer, "Closed-form solutions of performability," IEEE Trans. Conput., vol. C-31, pp. 648-657, July 1982. 115] C. M. Krishna and K. G. Shin, "Performance measures for multiprocessor controllers," CRL-TR-1-82, Computing Research Laboratory, The University of Michigan, Ann Arbor, MI, Oct. 1982. [16] J. Arlat and J. C. Laprie, "Performance-related dependability evaluation of supercomputer systems," in Proc. 1983 Int. Symp. on Fault-Tolerant Computing, Milano, Italy, June, 1983. 117] R. Huslende,"Optimal cost/reliability allocation in communication networks," in Proc. 1983 Int. Symp. on Fault-Tolerant Computing, Milano, Italy, June, 1983.

19 [18] J. A. Munarin,"Dynamic workload model for performance/reliability analysis of gracefullly degrading systems," in Proc. 1988 Int. Symp. on Fault-Tolerant Computing, Milano, Italy, pp. 290-295, June, 1983. [19] E. Gai and M. Adams, "Measures of merit for fault-tolerant systems," in Proc. Guidance and Control Conf., Gatlinburg, TN, Aug. 1983. [20] F. A. Gay, "Performance Modeling for Gracefully Degrading Systems", Ph. D. Dissertation, Northwestern Universtity, Evanston, IL, June 1979. [21] J. F. Meyer, "On evaluating the performability of degradable computing systems," in Proc. 1978 Int. Symp. on Fault-Tolerant Computing, Toulouse, France, pp. 44-49, June 1978. [22] J. F. Meyer, D. G. Furchtgott, and L. T. Wu, "Performability evaluation of the SIFT computer," IEEE Trans. Comput., vol. C-29, pp. 501-509, June 1980. [23] J. F. Meyer, "On evaluating the performability of degradable computing systems," IEEE Trans. Comput., vol. C-29, pp. 720-731, Aug. 1980. 124] R. A. Howard, Dynamic Probabilisitc Systems, Vol. II: Semi-Markov and Decision Processes. New York, NY: Wiley, 1971. [251 E. Cinlar, Introduction to Stochastic Processes. Englewood Cliffs, NJ: Prentice-Hall, 1975. [26] G. N. Cherkesov, "Semi-markovian models of reliability of multichannel systems with unreplenishable reserve of time," Engineering Cybernetics, vol. 18, March 1981. [27] V. A. Ditkin and A. P. Prudnikov, Operational Calculus in Two Variables and its Applications. New York: Pergamon Press, 1962. [28] J. F. Meyer and L. T. Wu, "Evaluation of computing systems using functionals of a Markov process," in Proc. 14th Annual Hawaii Int. Conf. on System Sciences, Honolulu, HI, pp. 74-83, Jan. 1981. 129] L. T. Wu, "Models for evaluating the performability of degradable computing systems", Ph.D. Thesis, University of Michigan, Ann Arbor, MI, 1982. [30, "Operational models for the evaluation of degradable computing systems," in ACM/SIGMETRICS Conf. Measurement and Modeling of Computer Systems, Seattle, WA, pp. 179-185, Aug. 1982. [31] D. G. Furchtgott, "Performability models and solutions", Ph.D. Thesis, University of Michigan, Ann Arbor, MI, 1983.

20 132] H. Kobayashi, Modeling and Analysis: An Introduction to System Performance Evaluation Methodology. Reading, MA: Addison-Wesely, 1978. [33] D. Ferrari, Computer Systems Performance Evaluation. Englewood Cliffs, NJ: Prentice-Hall, 1978. [34] C. H. Sauer and K. M. Chandy, Computer Systems Performance Modeling. Englewood Cliffs, NJ: Prentice-Hall, 1981. [351 K. S. Trivedi, Probability and Statistics with Reliability, Queuing, and Computer Science Applications. Englewood Cliffs, NJ: Prentice-Hall, 1982. [361 H. Raiffa, Decision Analysis: Introductory Lectures on Choices under Uncertainty. Reading, MA: Addison Wesley, 1968. [371 J. Moses, "Symbolic integration: The stormy decade," Comm. ACM, vol. 14, Aug. 1971. [38] A. C. Hearn, "REDUCE 2: A system and language for algebraic manipulation," in Proc. Second Symp. Symbolic and Algebraic Manipulation.

(N,0) o~0~~~ 0 o0 F 0 0 0 Figure 1. V V ((o A- > ( (oIo)

o - o O.... 0)~ ~ ~^^^ I ____IM r6 ~~~~~~~~~~~~~~~~~. L_

I loo - ~ " —]^ r ^.S75 - \ \.625 - \ ^.500375 1 - NT 3 ^ 4.550 -.15 - 0.00 -~~.000.500 1.00 1.50 2.00 2.50 3.00 3.50 400 y ( lO) Figure 3.