Machine Monitoring Using Probability Thresholds and System Operating Characteristics Stephen M. Pollock Department of Industrial and Operations Engineering The University of Michigan Ann Arbor MI 48109-2117, USA and Jeffrey M. Alden G.M. R&D Center Warren MI 48090 Technical Report 95-14

1. Introduction We are concerned with detecting a change in the underlying condition of a process, when the available observations are related only probabilistically to this condition. This situation has long been of concern to statisticians, engineers, economists, epidemiologists, etc. In this paper we address the specific context of monitoring a production machine, with the objective of effectively determining when to shut the machine down for maintenance or replacement. Applications to other areas such as quality control, health, military surveillance, or economic analysis should be readily apparent Before presenting any details, we list some definitions and initial assumptions, in order to clarify the general setting that governs our analysis: a) There is an underlying time interval that characterizes the operation of the machine. It is convenient to think of this time interval as the "part production cycle time" of a machine that produces discrete parts (although it may also be advantageous to use natural time units such as minutes or hours). All times and intervals are subsequently measured in units of this time interval. b) The machine can be in only one of two conditions: "good" or "bad" (denoted G and B, respectively). By the "G" condition we mean the machine is operating in such a way that it is "in control" or "normal" or "producing acceptably"; by the B condition we mean it is "out of control" or "failed" or "producing bad parts (scrap)". c) The machine starts in G but at some operating time T goes to B. Ths is called a "failure event", or more simply, a "failure". The time T, called the failure time, is a random variable with known probability density function (p.d.f.) q(t), cumulative distribution function 4(t), and expected value E(T). d) Observations of "signals" which are probabilistically related to the machine's condition are made at fixed, pre-determined times. e) Immediately following any observation one of two possible actions can be made: "do nothing" or "take an action consistent with believing the machine is about to go into (or is 1

in) B". The latter action is called a check. f) When a check is made production is stopped and the condition of the machine becomes known with certainty. A check that finds the machine in G, called a false alarm, returns the machine to operation (in G) after an interval of length g. A check that finds the machine in B, called a true alarm, re-sets it to "as-new" condition - or, equivalently, replaces it by a new (identical) machine - after an interval of length b. This event is called a renewal. Typically g < b since renewal often requires fixing or replacing something, while checking when the machine is in G may only require a brief inspection to ascertain that it is in fact in G. g) The process (observations, failures, checking, etc.), which we call "monitoring", continues indefinitely, with successive failure times assumed to be independent and identically distributed (I.I.D.). All of these assumptions, of course, must be eventually relaxed to conform to the realities of any actual processes under consideration. On the other hand, the important underlying relations among performance measures, and their dependance upon machine parameters and checking strategies, are exhibited using this set of simplifying assumptions. Our fundamental problem, then, is to determine a policy, i.e. when to check, knowing the complete history of the process, including its age, the elapsed time since the last check and the values of all observations made to date. There are two competing concerns that underly the determination of an effective policy: checking soon enough so that the machine does not operate too long in condition B, while not checking so often that the machine is shut down unnecessarily. In other words, an economic (or other) advantage is gained when a policy raises an alarm that detects the occurrence of B soon after it happens. However, it is also desirable to avoid costly false alarms that result in shutting the machine down to check it when it is still in condition G. Trading off (or constraining) the costs of delayed failure detection and false alarms is the basis of most quality control and control chart procedures developed over the past sixty 2

years [see Shewhart (1931), Roberts (1966), Johnson and Leone (1962), Montgomery (1980), Lorenzen and Vance (1986), for a historical perspective and related formulations]. The most common approaches to "optimizing" these procedures [e.g. Moskowitz, Plante and Chun (1989), Saniga (1989)] use economic models that explicitly incorporate costs ascribable to false alarms and delayed checking. Such trade-off analyses and economic approaches depend upon specific policy structures that, although they have intuitive appeal and lead to easy computation or evocative charting methods, can be inefficient or arbitrary in nature. More important, such methods often essentially ignore what is known about the machine's failure time distribution. In contrast, we explicitly make use of this distribution to present and evaluate a checking policy that is "optimal" according to a broad set of criteria. Our approach is also motivated by a natural inclination to develop checking policies that become, in appropriate limits, those that have been shown in the literature to be optimal for those well-studied situations where either no observations or perfect observations are made. The former is usually analysed under the rubric of "optimal maintenance policies" [see, for example, Barlow et al. (1963)]; the latter has been the subject of "optimal replacement policies" [see, for an early example, Page (1954)]. These special cases "bracket" the capabilities of any realizable system. 2. Performance Measures The use of any checking policy ultimately results in performance measures KoN interest to decision makers. Before one can find effective checking policies, then, it is important to define these measures and understand the relations among them. We choose to list these measures in three general categories, defined below, along with parameters and variables that will be used in our analysis. Type One (False Alarm) Measures. False alarms are costly since resources are used to process each alarm, and production time is often lost as well. Measures that serve as proxies for these costs include: 3

rf -false alarm rate - the expected number of false alarms per unit time, pf - fraction of total time spent processing false alarms, /L expected number of false alarms until failure, ARL (sometimes denoted ARLO) _ the expected machine operating time until a check, given the machine starts and remains in G Type Two (Late Detection) Measures. Being slow to stop a machine that is in B leads to the production of bad parts or simply to lost production. Measures of the cost of such delayed failure detection include: D (random variable) time between failure and the next check, 6 -expected detection time =E(D), also called EDD - "expected delay in detection" - see Marcellus(1993), rt true alarm rate = the expected number of true alarms per unit time, pt - fraction of total time spent processing true alarms, ARL1 _ the expected machine operating time until a check, given the machine starts and remains in B We note here that, in spite of their popularity in the literature (and in practice), the use of the average run length measures ARL and ARL1 is questionable for two reasons: a) the hypothetical situation (required for computing ARL) where the machine is "forced" to remain in G until the first alarm is hard to justify. Its interpretation is particularly unclear if a policy allows the machine to fail before the first alarm; b) ARL1 is defined only for the situation where the machine starts and remains in condition B - a situation that is hard to imagine being realized in practice. 4

This shortcoming has been pointed out before by Woodall (1986), Svoboda (1991) andothers. Composite Measures. Other performance measures of operational interest can be expressed as simple functions of Type One and Type Two measures. In these definitions, the term total time means operating time plus checking time. Ps fraction of total time the machine is in B and producing scrap, PB fraction of total time the machine is in B (either producing scrap or being replaced), PG _ fraction of total time the machine is in G (i.e. producing usable parts), r = total alarm rate = the expected number of alarms of any type per unit time, Po - fraction of total time spent processing alarms of any type One can combine performance measures to obtain the overall cost per unit time. Computing this cost rate, however, using the simplest linear model, requires the following parameters: Kf = fixed cost per false alarm Vf = cost per unit time spent processing a false alarm, including lost production Kt = fixed cost per true alarm Vt= cost per unit time spent when processing a true alarm (i.e. when the machine is stopped and found to be in B), including lost production Vd = cost per unit time of producing scrap while in condition B, including lost production. The resulting total cost per unit time, CT, can then be expressed as: CT = Kfrf + Vfpf + Ktrt + Vt(PB - rt) + Vdrtps (1) 5

Computing CT requires knowing all these cost coefficients (minimizing CT requires knowing at least their ratios), which in many cases are difficult (if not impossible) to obtain. For this reason, and because the method we propose to use is can be implemented without availibility of these costcoefficients, we do not directly pursue cost-minimization. Our approach concentrates, instead, on computing non-cost performance measures and using them to choose among various checking policies. Moreover, the policy we produce can be readily shown to provide a decision rule that minimizes CT. 3. System Operating Charactistics We are ultimately concerned with providing decision makers with checking policies that: a) are easy to understand and implement; b) readily allow input of, and sensitivity analysis with respect to, important parameters, including the moments of the machine failure time, and the discriminatory capabilities of various signal observation devices; and c) do not require an explicit assessment of the hard-to-estimate cost coefficients listed above. The means by which we present the consequences of using any particular monitoring policy is the System Operating Curve or System Operating Characteristic (SOC) [see Pollock (1964) and Rapoport et al. (1979) for early development]. The SOC is a simple graphical plot involving two axes - one showing a Type One measure, the other a Type Two (or composite) measure. A single point on the SOC represents a pair of performance measures attainable by using a particular checking policy with a machine characterized by specified parameter values. A family of such points represents the range of output measures attainable by changing one or more policy variables available to the decision maker. For example, a SOC could be a plot of ps (the fraction of time the machine is producing scrap) versus po (the fraction of time spent checking alarms of any type). Consider a set of 6

such SOC's, one for each of three different hypothetical monitoring situations A, B and C, as shown in Figure 1. Each curve represents the set of operating points (i.e., values of ps and po) achievable by using different values of a particular policy variable. P A S =fraction of time producing scrap C PO =fraction o f time processing alarms Figure 1: Operating Characteristics (OC) for three hypothetical procedures A, B and C. Assuming that the costs of implementing A, B and C are all the same, situation C is clearly better than A or B, since it has a lower ps for any given po or lower po for any given ps. (Perhaps C allows the monitoring of signals that are not available to either A or B). If only A or B were available, A would be preferred to B when the advantage of achieving early detection (i.e. small ps) is more important than the disadvantage of having a large fraction of time spent checking alarms. The SOC is similiar to the Receiver Operating Characteristic (ROC) used in telecommunication and signal detection theory, and is also related to the power curve of simple hypothesis testing. The word system is used to emphasise the fact that it is a combination of the machine, the monitoring device and a checking policy that is being represented. Note that equation (1) can be used to directly produce values of CT for any point on the SOC curve, given appropriate cost coefficients. 7

Condition: Events: 4 * + t Time T++ -1)g t 0 tI failure n cycle check check start t g t 1 + g t n-1+ (n'l)g t n +b restart restart renew t n-1 + (n-2)g check Figure 2: A single cycle with n checks (n - 1 of which are false alarms). Arrows indicate checking actions and renewals; T = operating time until failure which occurs at time T + (n - 1)g; D = detection time; ti = time of ith check (a false alarm for i < n); tn = time condition B is detected; and tn + b is the time the machine re-enters condition G (which ends the cycle). 4. Relationships Among Performance Measures We now present general relations among the performance measures that exist for any reasonable checking/monitoring procedure. To obtain these, we define the a cycle to be the time between renewals. Figure 2 shows a cycle that contains n checks at times t1, t2,., tn. There are n - 1 checks that find the machine in G (i.e., there are n - 1 false alarms), each requiring time g. The last check in the cycle finds the machine in B and takes time b. Since the machine is stopped during checks of any type, in any cycle the operating time until failure, T, differs from the total time until failure which equals T plus the time spent checking (while in G) prior to failure. From Figure 2 we see that L, the expected cycle length when there are n checks (hence n - 1 false alarms) is L = E(T) + g(n - 1) + 6 + b. (2) As previously defined, p is the expected number of false alarms until failure (also called 8

EFA). An elementary use of the fundamental renewal theorem gives and rf= rt = E(T) +++b E(T) + gm + 5 + b (3) 1 E(T) + g/ + f + b (4) The ratio of these two equations gives rf rt (5) Solving for 6 in equation (4) gives 6 = E(T) - g - b = E(T) - ) b. r7t rt It (6) The expected time the machine spends in B is 6 + b. Since true alarms occur at a rate rt, PB = (b + b)rt, (7) The times to process a false alarm and a true alarm are g and b, respectively. Since these occur at rates rf and rt Pf = grf. (8) Pt = brt. (9) Finally, by definition, r = rf + rt (10) PG + PB = 1 PS = PB - Pt Po =Pt + Pf (11) (12) (13) Relations (2) through (13) hold for any checking procedure. Thus having computed any pair of Type One and Type Two measures (such as rt and rf) and knowing E(T'), b, and g, allows the calculation of 6, PB, PG, Pt and pf, etc. 9

For expositional simplicity, in the remainder of this paper we set the lengths of time needed to perform checks to unity, i.e. b = g = 1, since computing results for arbitrary b and g values (even when they are zero) is straightforward, as shown in Appendix C. 5. The Basic Monitoring Process We now present additional assumptions and notation needed to define the monitoring process, determine an optimal checking policy, and to compute performance measures that result from its use. Let Ct = condition of the machine at time t, so that ((t) - the cumulative distribution for the failure time T = prob.{Ct = BICo = G}. i observation number, i = 1, 2,... i = time at which the ith observation is made Since observations (and therefore checking opportunities) are limited to the times ri, it is useful to define the function f(i), the probability that failure occurs between the (i - l)st and ith observation, so that f(i) - (ri)- (Tri-l), i = 1, 2,3,..., (14) where T0 0. The observation made at time Ti is the random variable Xi, having a (p.d.f.) fx ('), depending upon the machine condition as follows: fx() = { p(z); if C =G, i = 1,2,..., (15) q(x); if C, = B, i = 1,2,... 10

The random vector X of observations is defined as as X_n = (Xi, X2,.. -Xn), and its realization xn is n =_ (Xi, X2... n). Any checking policy can be viewed as being a decision rule to determine whether or not to check the machine at time Tn given the set of observations x,. For example, using the ordinary Shewhart chart (see, e.g. Montgomery [1991]), the decision is based upon only the last observation Xn: if this value falls outside pre-determined control limits then a check (i.e. a "search for an assignable cause") is made. The control limits are policy variables, and varying them will produce an associated SOC. For Shewhart charts with supplementary runs tests, if K out of the last N observations fall within a pre-specified zone, then a check is made. Here, K and N and the control limits are policy variables that, when varied, will produce the associated SOC. Other policies, such as those using CUSUM or EWMA charts, make use of various functions of some (or all) of the observations x_. It is important to note, however, that these policies are in some sense ad-hoc. By contrast, we introduce a class of policies based upon certain optimality conditions. 6. The Probability Threshold Rule (PTR) A particularly attractive form of checking policy can be based upon a simple proposition: since the observations at times r1 through Tn provide information about machine condition, this information can be used to "update" the probability that C, = B. Specifically, by defining Pn(xn) = prob.{Cn = BICo = G, X1i Xl,XX2 = 2,..,Xn = Xn}, we can use this probability as the fundamental element in the definition of the 11

Probability Threshold Rule (PTR): Check when Pn(1) first equals or exceeds some threshold probability p*. If X(t) is geometric, it is well known (e.g., see Girshick and Rubin [1952 ], Shiryaev [1963], or Pollock [1965] for early references, or Pollak [1987] for a more recent one) that the PTR is "optimal" in that it allows one to select the policy variable p* so as to minimize either: a) the expected detection time 6 for a given value of false alarm rate rf; or b) the false alarm rate rf for a given value of expected detection time 6; or c) the cost per unit time, CT, given by equation (1) From an operational point of view this means that given a particular value of p*, a decision maker can achieve the associated performance level shown as a point on the a SOC. Or, equivalently, from an analysis or strategic point of view, given the ability to efficiently compute Pn(jn) the decision maker can vary values of p* to explore trade-offs among various performance measures. For example, if curve C of Figure 1 represents a typical SOC associated with using a PTR policy: large ps and small po are produced when p* -- 1, and small ps and large po are produced when p* -- 0. Note that the PTR has only one "free" parameter, p*. 7. Computing Pn(xn) The computation of Pn(j,) follows from a straightforward use of the definition of conditional probability: Pn(Xn) - prob.{T Tnl X X n} prob.{T< Tn n Xn } (n6) prob. {Xn = n } where n prob.{T < n n Xn n = n) f (j)li-1p(xi)Ik=jq(zk) (17) j=1 12

and 00oo prob.{Xn = =} = prob{T < n NX = } + Z f(j)n1=p(xi) (18) j=n+l Substituting (17) and (18) into (16), and dividing numerator and denominator by IIF=1p(xi), gives n Of (j)In= jL(xk) Pn (X=) - j=l (19) Ef (j)Ik=jL(xk) + F(n) j=1 where L(xi) - q(xi)/p(xi) (20) is the likelihood ratio for condition B given Xi = xi, and 00 F(n) = f(i) = prob{T > Tn}. i=n+1 Although computing Pn(jn) directly from equation (19) is straightforward, it is advantageous, instead, to use the "odds in favor of condition B" (also called the odds ratio) Rn(xn) - Pn(xn)/(l-Pn(xn)). (21) (For notational convenience, the argument zn is suppressed for the remainder of this paper, e.g., Rn(xn) is written as Rn.) This odds ratio can be obtained directly from equation (19) as n Rn -- [F(n)]-1 E f (j)rIl=jL(xk). (22) j=l This allows a recursive representation for R/: Rn= -() [F(n - 1)R-1 + f(n)], (23) F(n) which can be confirmed by substitution into equation (22). Equation (23) is an excellent way to compute Rn, and thus Pn = Rn/(l + Ri), since Rn is calculated from the previously obtained Rn and each new observation xn by means of a simple addition and multiplication. 13

Equation (23) clarifies the challenge of computing performance measures associated with the PTR. In particular, consider computing when P, first equals or exceeds p* (at which time the PTR produces a check). This is the equivalent of finding the first time that Rn equals or exceeds the "odds threshold" * p*/(l - p*). (24) When Xn+1 is replaced by the r.v. X1+i, we see that equation (23) can be viewed as the generator of a Markov Process Rn. This process has as a state space the non-negative real line IR+, with transitions at observation times T7 governed by the stochastic behavior of Xn, which in turn are governed by the p.d.f.s of equation (15). When an observation of Xn+i = xn+1 is made, either (a) Rn+1 is less than the threshold p*, and the process continues; or (b) Rn+1 equals or exceeds p* and a check is made. 8. Performance Measures for Two Limiting Cases: In this section we "bracket" performance of the PTR for any realizable monitoring system by computing performance measures for two limiting worst-case and best-case extremes of monitoring. No Observations: As a "worst case" bound on the SOC, we can consider the limiting situation where observations provide no information, which is equivalent to having p(xi) = q(xi), so that L(xi) = 1, for i = 1, 2.... In this case, equation (23) reduces to Rn = [F(n)]-[F(n - 1)R-1 + f(n)], (25) which is a deterministic difference equation with solution (given boundary condition Ro = 0): -F(n) (26) F(n)' 14

From the definition of Rn in equation (21), this gives Pn= F(n), (27) a result that holds for any F(n). Clearly, observations of x have no effect on Pn, which is simply the cumulative distribution for the failure time T evaluated at T = rn. Two performance measures that can be readily calculated are A and 6. By definition of the PTR, the (deterministic) time to first check, ti, is the smallest monitoring time such that the threshold probability is exceeded, i.e. tl = min {t: ((t) > p*}. (28) te{ri} Similarly, the time of the jth check, tj, can be shown to be t min t (t)-(t) p*; t > j-1, (29) tEri 1 - (tj-1 where to = 0. The computation of Ap and 6 becomes straightforward if we assume that the inequalities of expressions (28) and (29) are satisfied as equalities. This situation holds if the checking times tj are not constrained to be in the set {rj} - a reasonable assumption if the monitoring provides no information. In this case, the tj are easily shown to be given by )(tj)- I - (1 - p*)J. (30) The probability that the failure time T is between the (j - l)st and the jtlh check, i.e. that {tj,_ < T < tj}, is I(tj) - (tj_-). Since all previous checks will have produced false alarms, we have 00 = E(j - 1)[I)(t) - ((tjl)] = /p* - 1 (31) j=o The time late, given tjl T < tj is tj - T, so by simple decomposition of cxpectations 00 tj 00 6 = E X (tj - t)>(t)dt = E(T) - p*'(1 - p*)i-l-[1 - ( - p*)i] (32) 1 t=tj-1 i=l 15

This result is equivalent to similar ones contained in the literature, and serves as the basis for most cost-minimizing maintenance policies, for example those in Barlow and Proschan [1965]. Perfect Monitoring. The limiting case of "perfect" information represents another special situation. In this case the supports of p(xi) and q(xi) are disjoint: L(xi) = 0 for all i such that Ti < T; and L(xi) = oo for all i such that Ti > T. From equation (19), we see that any non-zero threshold is exceeded for all n such that Tn > T. In this case 6 = ET[min (t -T: t > T)], (33) te{ri)} and the machine is checked only when it fails. From this, it is trivial to show that the true alarm rate is rt = 1/(E(T) + b), the false alarm rate is rf = 0, the fraction of time spent in processing true alarms is Pt = b/(E(T) + b) and the fraction of time processing false alarms is pf = 0. 9. Behavior of Rn for Geometric Failure Time Using the PTR Unfortunately, computation of performance measures for the PTR is extremely difficult for general failure time distributions 4<(t). (For a recent example dealing with a uniform distribution, see Wang [1995] ). However, suppose the failure time T has the geometric p.m.f. with expected value E(T) = l/h, 0(t) = h(l - h)t-l, t = 1, 2, 3,..., (34) so that the cumulative distribution is 4(t)= 1-(1-h)t, t=1,2,3.... (35) Furthermore, assume that observations are made at equal intervals, i.e. at times T1 = s, T2 = 2s, T3 = 3s,.... (This assumption is often consistent with machine monitoring practice. However, the determination of a "good", much less"optimal" interval - or, if not at equal 16

intervals, the set of planned observation times that should be used - is still very much an open research question.) The associated probability that a failure will occur between observation i and (i - 1) is readily shown to be f(i) =a(l-a)-1, i =1, 2, 3,... (36) with cumulative distribution F(i) = - (1- a), i = 1, 2, 3... (37) where a 1-(1- h). (38) Using this distribution, equation (23) reduces to: Rn = ~(Xn)[Rn-_ + a], (39) where e(X) = L(X)/(1 - a) (40) can be interpreted to be a "normalized" likelihood ratio. In this case the conditions for checking and continuing, respectively, using equation (39), are: a) if f(xn+l) < p*/(Rn + a), continue; and b) if e(xn+l) > p*/(R, + a), check. The absorption behavior of this process, and in particular the distribution of the (random variable) time until RN first equals or exceeds the odds threshold p*, has a long and important history of study (see Shiryaev [1978]), resulting in computational (as contrasted to structural) methods for limiting cases or approximations (e.g. Pollak [1985]). In the sections that 17

follow, we present a Markov Chain approximation that extends these results, and provides an efficient method for computing performance measures of interest. 10. Markov Process Representation for Geometric Failure Time All performance measures of interest can be obtained by computing the steady state probabilities of a mixed continuous-discrete state Markov Process MPR, created by combining Rn with the machine condition Cn. MPR is defined such that: a) Transitions occur immediately after observations and any associated checking; b) The state at the end of the nth transition is denoted as Sn E S, n = 1, 2,...; c) The state space S is the union of five sub-spaces: three each containing a single state and two each containing elements that are mixed continuous-discrete states. These sub-spaces of S are: SG- {(R, G): R E (0, p*)}, set of states where Rn is between 0 and p*, and Cn = G; SB- {(R, B): R E (0, p*)}, set of states where Rn is between 0 and p*, and Cn = B; So 0, renewal state: the state entered after the machine is renewed, i.e., when Rn = 0 (or, equivalently, Pn = 0) and Cn = G; SG- = {(p*, G)}, false alarm state: the state entered after checking while the machine is in G, i.e., when R, > p* and Cn = G; S; (p* B),, true alarm state: the state entered after checking while the machine is in B, i.e., when Rn > p* and Cn = B. 18

Given the distribution of the random variable Xn+1 shown by equation (15) and the geometric failure time distribution of equation (36), the transition probabilities among the states in these sets are governed by the evolution of Rn described by equation (39). Having a geometric distribution of failure time is the equivalent of having the probability of the where this probability is zero); this is the key to establishing the Markovian properties of MPR. We note some of the properties of the Markov Process MPR: a) it is ergodic, since there is a single closed communicating class of states; b) the probability of transition from SG or SB to So is 1, reflecting the one transition (since b = g = 1) needed to check after a false alarm or a true alarm; c) due to the geometric failure time distribution of equation (35), the single-step transition probability is a for transitions: i) from the set SG to the set SB U S$; ii) from the state So to the set SB U S*. The steady-state probabilities for the singleton states are defined as 7rO lim prob{S =So}, n-o-00 7rG - lim prob{Sn = S}, 71B - lim prob{Sn = S}. In addition, we define the steady state cumulative distribution functions for the sets SG and SB as Ha(r) = lim prob{S, E [(t, G): t r]}, 0 < r < p* rB(r) = lim prob{Sn E [(t, B): t < r]}, 0 < r < p*, n —oo 19

and define rHG() = nB(0) = 0, rIG(p*) = lime,,o nG(p* - ), and IB(p*) = limo IB(p* - ) Performance measures are easily obtained from these steady state probabilities and distributions. In particular, by appealing to the ergodic theorem for Markov Processes, we know that oO = fraction of time the process is in the renewed state. Thus, Pr = Pt + pf = 7o, since MPR is in So for exactly one time unit per cycle. Similiarly, pf = 7T = expected fraction of time the process is in the false alarm state, pt = 7T = expected fraction of time the process is in the true alarm state. Given these values for pf and Pt, the analysis in Section 3 and Appendix C allows computation of all other measures of interest. The equations needed to calculate T7r and T7r are given in Appendix A. In particular, equations (A4) and (A5) are special cases of the Fredholm equation of the second kind, which has a long history of theoretical study (e.g., Groetsch (1984) or Brunner (1982)) and numerical means of solution (Schippers (1983)). Indeed these equations have an analogue to those developed by Pollak [1987] to compute an ARLO-type measure. However, as Pollak notes, a solution method is lacking for even the simplest forms of monitoring distributions p(x) and q(x). The computational literature is problem-specific and essentially suggests using transformation of variables and discretization approximations tailored to the problem at hand. 20

11. Markov Chain Approximation We now present an approximation useful for finding performance measures for specific cases of p(x) and q(x), by showing that the process MPR, which has continuous elements in its state space, can be approximated by a Markov Chain ("MCR") with discrete state space. To construct MCR, we identify values of the odds ratio Rn that lie in the interval (0, p*), but are restricted to the finite set S1 - {r, r2,.. rm-l} The key to this restriction is to find a set of ri values that "cover" the interval (0, p*) in such a way that the sums over probabilities in S1 well approximate the integrals over S implicit in equations (A2) to (A5). (Obtaining such a set of r-values is shown, for example, in section 12 for Bernoulli monitoring; at this point we will assume that S1 is available.) Given the finite elements of S1, we define a finite state space 7R for MCR 7Z = S U S U SB U, G U RB. The first three sub-spaces are the singletons previously defined in Section 10 (corresponding to renewal, false alarm and true alarm states, respectively), and the last two are RG = (ri:ri E S, Cn=G, i 1, 2,... m - 1, n =-1,2,... RB = (ri:ri E S Cn = i 2B, i= 1,= 1 2,.... Thus RG represents a discrete subset of Rn values while the machine is in C. and 'kB represents a discrete subset of Rn values when the machine is in B. A simple arbitrary numbering of states allows us to represent MCR as a (2m + 1)-state ergodic Markov Chain, which we will refer to as "MC," with a n- the state of MC after the nth transition; state space I2m+l = {0, 1, 2,..., 2m}; and 21

transition matrix P with elements [P]ij = pij prob{ = jln-l = i for i, j E I2m+l, n = 1,2,...; Details of the relation between MCR and MC finding the values of pij are contained in Appendix B. Given P and defining 7ri _ lim prob{an = i, i =0,1,... 2m, the steady-state probability vector r = {I70, 711, r,... T,2rr} can be obtained by solving the set of linear equations: 7r = 7rP (41) r -= 7rl (42) where 1 {1 1 1... 1}t is the transpose of the unit (2m + 1)-vector. Performance measures can be immediately obtained from these equations since wo is immediately given, and Tr = 7rm and ir = 7F2m. The next sections show specific results for two examples: monitoring Bernoulli and Normal observations. 12. Markov Chain MCR for Bernoulli Observations and Geometric Failure Time The "discretization" of MPR to MCR depends in general on the monitoring distributions p(x) and q(x). This section deals with Bernoulli observations, by which we mean that the observations Xl E {0, 1}, n = 1, 2,..., and 1-a if x = 0, p(x) = { [t a if x = 1, /3 I if x = 0, q(x) = { 1-/ if x= 1. This situation is a form of classical hypothesis testing: x = 0 is "evidence" of condition G (e.g. no defect in an observed manufactured product) and x = 1 is evidence of condition B 22

(e.g.. a defect is observed). Thus a is analogous to an "error of the first kind," and, to an "error of the second kind." The likelihood ratio is L(X) = 3/(1-ca) if = 0, L(x) = (1 - O)/ca if = 1. By defining WO - (1- a)(1-a)' 1- / 1 a(1 -a)' equation (39) can be written {wo(Rn-i +a) if X, = O, Rn - (43) w l(Rn_l + a) if Xn = 1. For a realistic problem, a) the value of a (the probability that the machine goes from G to B on any transition) is usually quite small; b) a and /, the "misclassification" errors in a single observation of x, are both less than.5, so that a + / < 1. In order to identify a useful set S1 we can consider the evolution of the Rn process of equation (43) when the process starts with Rn = 0 (i.e., Po = 0). The possible values of Rn that can be generated after the first three observations, assuming none exceed p*, are given in Table 1: 23

Observation Number n Possible Rn Values 1 wo W1 2 (a + wo)wo (a + wo)wi (a + wl)wo (a + wi)wl 3 (a + (a + wo)wo)wo (a + (a + wo)wo)w1 (a + (a + wo)wl)wo (a + (a + wo)wl)wl (a + (a + wl)wo)wo (a + (a + wl)wo)wl (a + (a + w1)wl)wo (a + (a + w1)w1)wl Table 1: Possible values of Rn after n = 1, 2, 3 observations. After h observations the maximum number of possible distinct values of Rn that could be generated is clearly 2h+1 - 2. However, an important effect reduces this number, which allows a practical means of discretizing MCR to MC: as h becomes large, some values of Rh satisfy Rh > p*, in which case either state SG or SB occurs and Rh+l becomes 0. Thus we can generate S1 = {rl, r2,..., rml} by simply computing all possible Rn-values achievable over a "horizon" of h observations of Xn, n = 1, 2,..., h. The advantages of this approach are: a) only feasible values of ri are generated, b) it actually represents the Rn process for n = 1, 2,..., h over the horizon h, and 24

c) the accuracy of the discrete approximation can be improved by increasing h. An algorithm for generating m and S1 is: 1. Set So = O, p*}, a horizon h > 1 and n = 1. 2. Generate the set so = {wo(ri + a) for all ri E Sn-1 such that wo(ri + a) < p*}. 3. Generate the set si = {wi(ri + a) for all ri E Sn_1 such that wl(ri + a) < p*}. 4. Set Sn {r: r E so U s and r U' i}. 5. Increment n by 1. 6. If n < h then go to step 2. 7. Set S1 = U^nSi, m = IS + 1 and stop. At termination S1 will be a set of m - 1 distinct r-values which, when sorted by increasing value, can be labeled rl, r2,..., r1. Computational experience (see section (13)) has shown that even though m increases nearly exponentially in h, the accuracy of the discrete approximation, for a wide range of parameters, becomes excellent for h < 10. The key to converting MCR into MC is identifying, for any possible realized value of Rn that is not an element of S1, the "closest" element to it. Thus the entire evolution of the process, and in particular values of Rn for n > h, can approximately contained within the states {0, $, SB, S1 x G and S1 x B}. A formal procedure for doing this is to define WO = {OU US1Up*} with 0 as its 0th element (i.e., ro = 0) and p* as its mth element (i.e., rm = p*), and define the indices Jo(i) = the index of the closest element in WO to Rn+i given Rn = ri and Xn+l = 0 = arg min {Iwo(ri +a) -rk} kEO,1,...m 25

and Jl(i) = the index of the closest element in W' to Rn+1 given R, = ri and xn+l = 1 = arg mmin {Iwi(r+a)-rkI}. kEO, 1,...m The associated probability transition matrix P can be created (see Appendix B) from the two probability transition submatrices pG and pB corresponding to state transitions placing the machine in G and B, respectively, as shown in Figure 5. The composition of these submatrices is: 1 1-a ifj =Jo(i) i E {0,1,2,...,m-1}, [PG]ij= a ifj=Jl(i), i {0,1,2,...,m-1}, (44) 0 other i E {0, 1,..., m- 1}, j E {0, 1,..., m}, /3 ifj =Jo(i)+ 1, i E, 1,...,m-1}, [PB]j 1-/ if j= J(i)+l, iE{0,l,...,m-1}, (45) 0 other i E {O, 1...m - 1}, j E {1,2,..., m} (46) 13. Numerical Results for Bernoulli Monitoring To obtain numerical results for Bernoulli monitoring we first generate the set S1 using the algorithm of the preceding section. Figures 6a, 6b, and 6c show how m = S1 I increases with increasing horizon h for various values of a,,B, a, and p*. These figures show that the number of states becomes nearly exponential in h only for large p*, small a, and large a and 3, conditions that are not likely to be attained in practice. We next compute, using equations (41) and (42), the steady-state probability vector 7. Figures 7a and 7b show pii, i 0,1, 1...m, plotted against the values of ri that were generated by the algorithm, for a = = 0.2, a = 0.01, and p* = 0.2. 26

0 1 2 m-1 m m+1 m+2 2m-1 2m i o 1 * * ~ m m- m+lm+2* * * 2m-1 2m apB (1 -a) pG..........P ]..................... _........... i"0........... _.. _...... Figure 5: Schematic representation of the transition matrix P from state i to state j showing the use of submatrices pG and pB. [PB]i.o denotes the submatrix created by removing the row associated with i = 0 from pB. Shaded regions represent zero transition probabilities. Note the "jumpy" nature of the steady-state probabilities and the "gaps" between the values of ri. This behavior suggests that a) any solution approach which uses a continuous approximation to the state space would be unwieldy, and b) using a simple evenly spaced "grid" over the r axis to represent the possible ri values Bernoulli Monitoring: State a=0.1, alpha=beta=0.35 (a 0) Co 0 0.0 E z 120 100 80G 60 40 200 CO cu — p* = 0O - -p* = 0 1 ---p* = 0. C)o ' L (0o,- ao o0 Horizon Used to Generate Sta Figure 6a: The number of states generated increases as the horizon used to generate the r-state space increases. 27

Bernoulli Montoring: State 0,O.01, alha,=beta=n ~ =, o 200; /?"~' — 0o: 15 o _ L. Z. c Ho co = ) Horzon Used to Ge nerate Sta Figure 6a) mean time between failures by decreasng a from rnber of o aol )s tgenerated sates. Bernoullioniori 80z 8l0I' pha=beta=-0.l1 1 2 3 4 5 Figure 6,: De7ras: Oe,,ati o7 28 si3n F 6 n content ProbaBernoulli Monitoring Probability Diatribution 0q.~.~ 0.3....=beta=0.2, p, za gailmer 28

Bernoulli Monitoring Probability Distribution ove o a=0.01, alpha=beta=0.2, p*: ero 0.3 -S-5-0.2 -""'1 — 0.2 0 5 10 15 20 25 z vail Figure 7b: Increasing the horizon h from 8 (as in Figure 7a) to 12 (above) introduces some additional ri values, but has a minor effect on the probabilities associated with the old ri values. would be inefficient due to the resulting inclusion of highly unlikely (or even impossible) values of ri. Figure 7b shows that increasing the horizon h from 8 to 12 (which increases the number of generated states, and thus the computational efort) has little effect on the probabilities associated with the ri values. Figure 8 illustrates how computational accuracy (represented by the computed value of wro) generally stabilizes, and for a wide range of parameter settings produces a maximum absolute percent error of less than 5 percent, for h > 7. For this reason, we use h = 7 for the remaining computational results. The resulting performance measures can be shown by means of a SOC. In particular, Figures 9a, 9b and 9c each show curves that plot ps = Pr{producing scrap} versus 70ro Pr{down for checking} for selected parameter values. Figure 9a shows that for a fairly non-informative sensor (al = = 0.3), varying p* from.01 to.1 produces a wide range of operating points (i.e. possible values of Ps and 7r0). Figure 9b shows the effect of increasing sensor sensitivity to a = / = 0.3. Figure 9c shows that with an even more sensitive sensor (a = p = 0.1), only a few operating points are possible for p* between 0.01 and 0.5. Indeed, for a = 0.1 there is only one feasible operating point ( ps = 0.00747, Tro = 0.146) in this range of p*. 29

0) Isr o = 25 -20 -co e 15 --E 10 -1o 5- coW -51 L. -3C Bernoulli Monitoring: Conv _ lw 3 4 5 6 7 8 h = Horizon Used to 9 10 11 12 Generate St Figure 8: Percent change in 7r0 as the horizon used to generate the r-state space increases from h- 1 to h. The figure displays 18 curves corresponding to all possible parameter settings such that a = 3 E {0.15, 0.25, 0.35}, a E {0.01, 0.1}, and p* E {0.1, 0.4, 0.7}. Six of the parameter settings had zero percent change for all h shown. The two worst cases are plotted using diamonds (a = /3 = 0.35, a = 0.01, p* = 0.4) and triangles (a = p = 0.35, a = 0.01, p* = 0.1). Bernoulli Monitoring: Operating Characte alpha=beta=0.3 0 0.05 0.1 0.15 0.2 0.25 0.3 Pr{System Di Figure 9a: System Operating Characteristics under Bernoulli monitoring for Pr{Producing Scrap} versus Pr{System Down} generated by varying p* from 0.01 to 0.5 in steps of 0.01. There are three curves, one for each a E {0.01, 0.05, 0.1} with a = / = 0.3 in each case. Any apparent non-convexity of these curves is due to round off errors in 7rw and 7rS which are used to calculate Pr{producing scrap} and Pr{system down}. Points associated with the same value of p* are connected by solid lines for p* = 0.01, 0.3 and 0.5. 30

Bernoulli Monitoring: Operating Characte alpha=beta=0.2 0. Cl co - 0) U) 0 I0 IL to 0 0.05 0.1 0.15 Pr{System Dc 0.2 0.25 Figure 9b: System Operating Characteristics of Figure 9a but with a = / decreased from 0.3 to 0.2. Note that increasing the failure rate tends to collapse ranges of smaller p* into one operating point, for example p* E (0.01,0.3) produces a wide range of operating points when a = 0.01, but produces only one operating point when a = 0.1. Comparing this figure with Figures 9a and 9c shows this collapse is more pronounced with smaller a and 3. Bernoulli Monitoring: Operating Characte {a alpha=beta=0.1 X O.O 0.04 co 0.041 0 0.05 0.1 0.15 0.2 0.25 Pr{System Dc Figure 9c: System Operating Characteristics of Figure 9a with a = 3 reduced from 0.3 to 0.1. Note the extreme collapse of operating points to just one for all p* E (0.01, 0.5) associated with a high failure rate (a = 0.1) and informative sensors. 31

0.16 0.14 -— a 0.01 -—, — a = 0.05 C 0.12 ~ E 0.1! - ' " -"I 0.1 = 0.08 X 0.06 & 0.04. I 0.02 0 0.05 0.1 0.15 0.2 0.25 0.3 Pr{System Down) Bernoulli Monitoring: Operating Characteristic Curves *- 0.1 Figure 10: System Operatia Characteristics with an asymmetric sensor: a = 0.3, = 0.2. DecreasEin l from 0.3 (as in Figure 9a) to 0.2 (above) improves both performance lesures. — alpha=beta=.1 o 0.05 o0. 0 U -=-=, --- alpha=beta=.2 0.04 c 0.0 — alpha=beta=.3 A 0.03 0 | 0.02 \ 0.01 0 0 0.05 0.1 0.15 0.2 0.25 Pr(System Down) Figure 11: Decreasing a and P of a symetric sensor from 0.3 to 0.1 dramatically improves the System Operating Characteristic for scrap production versus system down time. Figure 10 shows a SOC for an "asymmetric" sensor with a = 0.3 and 3 = 0.2. Improving the sensor error probability 3 = Pr{x = 0GI} from 3 = 0.3 (shown in Figure 9a) to.2 improves both measures plotted on the SOC: decreasing / reduces the detection time as well as the probability the system is down: a lower / gives greater confidence that observing x = 0 implies condition G. This reduces the upward drift of the Rn process for any given set of "zero" observations, which delays the time to a false alarm. Figure 11 compares different sensors when a = 0.1, dramatically showing the advantage of having a more sensitive sensor. Figure 12 shows an interesting alternative form of an SOC. The two attributes are ps and Pr{Producing good product} = ro +PG. The latter measure is important since checking 32

Bernoulli Monitoring: Operating Characteri alpha=beta=0.3 CD0).E 0.15 -'o 0.1 -o - 0-O L e 0.05 -X o 0 - 0.7 0.75 0.8 0.85 0.9 0.95 Pr{Producing good pr Figure 12: Operating Characteristic Curve for scrap production versus good production. "Better" is towards the point (1,0): no scrapping and always producing good product (no down time). Note the non-optimal operating points above the dotted line: when a = 0.01, for example, for each p* above 0.26 there exists a p* below 0.26 with the same throughput of good product and a lower scrap rate. Similar "optimality threshold boundaries" for a = 0.05 and a = 0.1 are near 0.32 and 0.36, respectively. time is taken from production capacity even though less scrap is produced. This SOC shows operating points (above the dotted line) that will never be optimal with respecttothese measures. For example when a = 0.01, there exits for each value of p* above 0.26 another p* below 0.26 that produces the same throughput of good product with a lower scrap rate. 14. Elements of P for Normal monitoring. For Normal monitoring the observations Xn are independent normally distributed random variables depending only on machine condition. In particular,we asume the distibutions of Xn in equation(15) aregiven by 1 -x2/2 p(x) = — e V27Tr 33

q(x) 1 -(1-__)2/2 q(x) =f e The likelihood ratio is thus L(x) = q(x)/p(x) = ex-L"2/2 and equation (22) becomes Rn+l -= yelXn[a + Rn], (47) where e-A2/2 I7 -(48) I-a Again, we first assume a set of r-values {ro - O,rl,r2,...,m-, rm - P*} is given. Using the governing equation (refeqRnnorm) given r, = y, the conditional distibution of Rn+i is: prob.{Rn+l < rrn- =y} = prob.{yeXn (a + y) < r} = prob.{Xn < 1 n r } n t m ne in n n i (a l ) When the machine in condition is G, Xn has pdf p(x). Thus the elements of pG are given by: [PG]ij = prob.{Rn+l = Rj R = ri} =prob.{Rn+i _<.rj | r} - prob. {R+1 _< rj_l I Rn = ri} -(1 In rj ) (1 In rj-1 ) -' Sy(a + ri) y(a + ri) [P]mj = { 0 j=1,2,...,m. [PG]im = prob.{R+l > p* IR= ri}= 1 - ( In )) [pG]oj = (1 In r) n 7) [pG]Om = 1 ln P [P]om - 1_ 11n a,) i,j =. 2,...,m-1. i = 1,2,...,m- 1 j = 1,2,...,m- 1 where 4)(.) is the standard normal cumulative distribution. 34

Similarly, while the system is in condition B, Xn has pdf q(x), and so [PBly] =- -(ln a r -f)- Q1 lnr(1 -,() i,j=1,2,...,m-1 [PB]mj = 0 j=1,2,...,m [P"]mj - i1 j-0 [PB]im = (1- () ( In-)- /l -) j=1,2,...,m-1 [PB]Om = 1-jfln - ) Solving equations (41) and (42) depends crucially upon obtaining a set S1 such that these linear equations have numerically stable solutions. Because of the peculiar nature of the matrices pG and pB the task of generating such a set is not straightforward. Since Xn is a continuous random variable, however, a re-stating of equations (41) and (42) is possible. This re-formulation allows a solution to be obtained by means of existing techniques from numerical analysis. In particular, we can define the elements of ri to be ip* ri = i i = 0,1,2,...,m m so that 6 is the interval between m+1 equally spaced points from Rn = 0 to Rn = p*. By taking the limit as 6 - 0, this allows the definition of continuous analogues of the steadystate probability vectors I. To do this, we define fg(r) and fb(r) to be the steady state probability density functions satisfying: 7rofg(r)6 = prob.{r < R < r + 6 n system is in condition G} 7rofb(r)6 = prob.{r < Z < r + 6 n system is in condition B} and define the vectors gi = ni i=1,2,...,m-1 bi -- Hi+m 35

Thus fg(ri)6 = [17J9]i and fb(ri)6 = [lT4bi. Equation (41) can then be written m [flgjm =(1 - a) E[FJ g]j j + (1 - a)Po~. (50) Using the definition of fg (.), the first set of these becomes: m fg (rj) 6 =(1 -a)Z fg (ri) 6FP~~+ (1 -a) PO 1= 12,m - 1 (51) i=1 From the definitions of Pijand Pojgiven in above, it can be readily shown that, for small 6, [p]3 P Ij n Ijr)~ n y~~~](52) and Dividing equation (51) by 6 and taking the limit as 6 -* 0 gives By a similar argument, P* 1r_ _11 r fb (r) = f,,(x) + afg(x)]~ - j In( dx + a~-q ln -) 0 < r < P Equations () and () are the continuous equivalents of equations (41) and (42). The solution method is to first solve equation () for fg (r), and then solve equation () for fb (r). Substituting Pa 1 E ' Gi ad Om, =1 Z-Em-1 Po~ into equation (50) and then using the continuous approximations for Gij and gj, we can derive the alarm probabilities: [flum (I1- a) gWJ I(n T ~drx+(1 -a f 1 I (l r\dr (55) 36

/p* 1 /1 r \ (1 I r) [IIb]m = [fb(x)+afg(X)] -q In (+ ) dr dx+a -q In - dr. (56) J ~o JP* ClrJ r* A ya+x) -L IL a7 Finally, the normalizing equation equivalent to equation (42) is 7r- = 1+ fg/(x)dx + fb(x)dx + [1g]m + [Ib]m. (57) Equations () and () are Fredholm equations of the second kind, which have been examined extensively in the literature. The key to their solution is the nature of their kernels, that is the behavior of k9(x,r) = (1 il ) 1 2 r ln2 r 1 - 2/2 7(a + x) and I 1 1In r A kb(xr) = 1 e2 (lny(a+) ) Lr v27r In order to obtain a numerical solution to equation (FG), a FORTRAN code was written using a subroutine (NAG, 1983) that solves this linear non-singular Fredholm integral equation of the second kind using the method of El-Gendi(1969). The function fg(.), appearing on both sides of equation (FC), is approximated by truncating a Chebychev series to form an nth order polynomial. The subroutine solves for the resulting polynomial coefficients c,i = 1, 2,..., n such that fg (r) Ec cos ((i - 1)cos - -1)) i=l \ - 1P* with the property p' n -1 '' J0 f9(r) dr ii Todd 1-(i-1)2 The subroutine also provides fg(xi) evaluated over the set of Chebyshev points xi where xi + cos ((1))) i=1,2,...,n. 37

Equation (FB) is then solved, using the same NAG subroutine, by approximating fb(.) with a different Chebychev series and using fg(.) obtained above. This produces the coefficients cb, i = 1 2,..., n, such that n /,\,~,b /2r fb(r) E t co cos — 1)) i=l Y \p / / with the property P* n ~ b j fb(r) dr E 1 C: =0 ) i =.'odd 1-(i-1) The performance measures [Hg],m and [1lb]m are obtained by trapezodial approximation of the integrals in equations (55) and (56) using the values of fg(xi) and fb(xi) at the Chebyshev points xi, i = 1, 2,..., n. Finally, 7ro is obtainedi by substituting equations (), (), and the values of [11g]m and [Ilb]m into equation (28). 15. Numerical Results for Normal Monitoring Figure 8.1 shows the probability density function 7rofg(r) for Normal monitoring. Compared to Bernoulli monitoring (Figure 7.2) this distribution is smooth and well behaved except near zero where, it can be shown that lim-_o fg(r) = 0. Figure 8.2 shows the Operating Characteristic Curves for pB = Pr{Producing scrap} versus 7r0 = Pr{Producing good product} for two values of a (0.05 and 0.1) while fixing / = 1.5. As in Figure 7.4a, there is an improvement in the OC with smaller a. Since a is the expected number of inter-monitoring intervals until system failure, a can be reduced by either increasing the actual life of a machine or by decreasing the inter-monitoring interval. Figure 8.3 shows OC sensitivity to changing,u, the shift in the expected observation value when the system fails. As p increases the power of the sensor to discriminate between conditions G and B increases and this improves the OC curve. This has obvious implications in evaluating sensors with different p values. 38

Figure 8.4 shows the alternative OC curves for Pr{Producing scrap} versus Pr{Producing good product} for a = 0.05 and p E {0.5,1.0, 1.5}. As in Figure 7.7, if this OC represents a significant trade-off, then there is a wide range of probability thresholds that can be ignored when selecting an operating point. For example, when p =.5 and a = 0.05 all p* above 0.15 can be ignored 16. Conclusion The value of the System Operating Characteristic as a tool for evaluation of monitoring procedures and policies has yet to be established. We believe, however, that it is an important and evocative tool for the comparison of policies and the comparison of alternative observation technologies. The structure set forth in this paper provides a method for the computation of critical measures such as the expected detection time and total alarm rate (6 and r), needed to express various SOCs. For the case of general sampling functions p(x) and q(x), Appendix A presents the equations that need to be solved to find key performance measures. In a companion paper we explore the effectiveness of promising numerical techniques that can be used to efficiently obtain SOC's. Acknowledgments This research was supported jointly by General Motors Research Laboratories and the National Science Foundation under contract # SES9108956. The support of Walter Albers, Jr., who was instrumental in arranging for this unique funding collaboration, is greatly appreciated. The research and presentation was also appreciably strengthened by many substantive discussions with Richard Marcellus. Much of the computation was assisted by Wei-Ching Wang and Julie Simmons. 39

APPENDIX A: Steady State Properties of MPR In this Appendix, we show how the Chapman-Kolmogorov (C-K) equations can be used to write expressions for steady-state probabilities and distributions for MPR. Although the notation and details may appear formidable, the method is a straightforward extension of the use of C-K equations for finding steady-state solutions to a finite state Markov chain. In the development that follows, it might be helpful to refer to the schematic flow diagram of Figure Al. In this diagram, the "states" {(JG} and {RB} refer to R-values in the open interval (0, p*) while the machine is in condition "G" and "B," respectively. S, S*, and 0 are the singleton "false alarm," "true alarm," and "renewal" states as discussed in Section 10. The labels on the transition arrows to singleton states represent governing probabilities. The arrows to {fRG} and f{JB}), represent the complementary distribution functions P and Q associated with exceeding the threshold p*, given machine condition G and B, respectively. aQ(p V n~ ~(*) q aq I.( a)p aq * (l'-a)p ((a)P(p*) Sa)(p*) Figure Al: Schematic representation of transitions among the states in MCR. Note that 0, SB and SG are singleton states, while {TJG} and {lB} represent a continuum of states in the open interval (0, p*). Consider the computation of IIG(r). By definition nG(r) lim prob{Sn e [(t, G)]: 0 <t <r},,0 < r < p* n ---oo 40

- lim prob{O < Rn ~rn C,,= G}) n —*oo which by conditioning on the value of R,-, and noting that Cn = C is only possible if Cn,= G, gives flG(r)=j im prob{O0< R, <r nCfl= GIRn-l=yfl Cn-l= G} xpr6b{ Rn- E (y, y + dy) fl Cn-1 = G}. (Al) The first probability in (Al) can be obtained by using: a) equation (23) which governs the behavior of R, when R,-4 = y, and b) prob.{Cn = GjCn-1 = G} = 1 - a which is independent of the value of 1?-j Thus HIG~ = y0lim (1 - a) prob{O0 <e~(X1) (y + a) ~ r I Cn- = G} xprob{&Rn1 E (y, y + dy) fl Cn-1 = G}. By definition, the second probability in this expression is lim prob{ RnI E (y, y + dy) n Cn-1 = G}I dHIG(Y), 0 <, <:! P* Hence, after taking the limit as ri — + oc, G (r) =j (I - a) prob.{0 <Ef(Xn-)(y + a) ~ rJCn-j = GjdfG(Y) +I- a)7ro prob.{0 <ef(X,-a) ~ rjCn-, = G}. By defining the region C (r, t) { fx: E (x) < r /(a + t)}1, C (p*,IXn) becomesI "continuation" values of the observation xn. Using this, and the fact that the Xn-1, given 0n-1 = C, is p(x), FJG(r) = (1 - a) I=O IEC(r,y) p(x) dx drIG(Y) + (1 - a)wro jEC(r,O) p(x) dx. the set of p. d.f. for (A2) 41

Similarly, it can be shown that B(r) = a E() q(x) dx dGnc(y) + a71o q(x) dx y=o xEC(r,y) JxeC(r,O) + [P q(x)dxdnlB(y) (A3) y=O x~C(p*,y) 7rG= J C(p*,) (1 - a)p(z) dx fdnGc() + 7Jo j (1 - a)p(x) dx. (A4) y=0 JxC(p*,o) JxC~(p*,o) tI =,o aq(x) dx dIG(y) + 7ro aq(x) dx. y=o C(pO*,O) x oc(p*,o) + q(x) dx dIJB(y) (A5) =0 xc(p*,y) and, finally tiFo G =B -= IoL~C(P*,o) [aq() + (1 -a)p(x)] dxd nG(y) +=O0 xC(p*,),O) + / MP ) q(x) dxdIHBs(y) + 7ro [aq(x) + (1 - a)p(x)] dx. (A6) y=oxC(p*,y) JxC(p*,O) 42

R- Values 4 MCR Sets S* G MC States CZ:> (5> MCR Sets S* B rm-l_ rm-2 m-A R G CD ~ CD Go (G) Good (G): I. \ * RB CDJ 0 Bad (B) Figure B1: Schematic representation of the M/lC R-values generated from equation (23). states, the MCR sets, and the APPENDIX B: Correspondance between Formulations MC and MCR The correspondance between elements of 12,m+1 E {O, 1, 2,..., 2m} of the chain MC and of the elements MCR which lie in the set S1 x {G, B} are shown in Figure B1. State 0 is the "starting" state of MC. Since it represents the situation where Pn = 0 (and thus R, = 0) it is also the "renewal" state, with machine condition G; state m is the false alarm state, since Rn = p* with machine condition G; state 2m is the true alarm state, since R, = p* with machine condition B. The set RG represent states where Rn lies between 0 and p* with machine condition G; RB represents states where Rn lies between 0 and p* with 43

machine condition B. Transition probabilities for MC, i.e. among the states in 12m+l, are obtained by noting that: a) when Rn+1 > p*, the state entered after transition n + 1 is either m or 2m, depending whether machine condition is either G or B; b) once in state m or 2m, since b = c = 1, the next transition is into state 0 with probability 1; c) for any value of Rn < p*, if the machine condition is G then with probability a condition B will hold on the next transition; d) the only transition from condition B to condition G must be made via the true alarm state 2m. Thus, transitions from alarm states to the renewal state are Pm,o = 1, P2m,0 = 1, Pm,j = 0 if j 0, P2m,j = 0 if j O, and pij = 0 for i = m + 1, m + 2,...,2m-1; j = 0, 1, 2,..., m. For the rest of the elements of P, we define the m x m sub-matrices pG and pB, such that [PG]i= prob.{n = jlUn-l =inCn=G} for i=0,,...,m-1; j =0,,...,m, [PB] = prob.{an = j + mlUn-l = i n Cn = B) for i = 0,1,..., m -1; j =1, 2,..., m, The elements of these submatrices are given by the nature of the distributions p(.) and q(.) of equation (15). In terms of these submatrices, the remainding elements of P are given 44

by: Pij = (1 - a)[PG]ij for i = 0, 1,..., m - 1; j = O, 1,..., m pij = a[B]i,j-m for i = 0,1,..., m -1; j= m + 1,...,2m, ij = [PB]i-m,j-m fori= m +1, m + 2,...,2m - 1; j = m + 1, m + 2,...,2m. The first equation reflects transitions from Cn-1 = G to Cn = G; the second represents transitions from Cni = G to Cn = B; the third equation reflects transitions while the machine condition is B. Matrix P is shown schematically in Figure **. 45

APPENDIX C: Performance Measures for Arbitrary b and g In our development, we assumed checking times of b = g = 1. Calculating performance measures for arbitrary b and g is described here. Recalling that L is the expected cycle time, we can re-express equation (**) as: L(b, g) = E(T) + g(n - 1) + 6 + b (58) which explicitly incorporates as arguments the checking times b and g;for condition B or G, respectively. By definition, pf(b, g) is then the associated fraction of time the machine is in the false alarm state. It is clear that, when b = g = 1, pf(l, I)L(l, 1) is the resulting expected number of false alarms per cycle. For arbitrary b and g, the cycle time is L(1, 1) plus b - 1 (for the check while in B) plus pf(l, I)L(1, l)(g - 1) (for the false alarm checks). Hence, the expected cycle time for arbitrary b and g can be written L(b, g) = L(1, 1) + (b - 1) + pf(l, 1)L(1, 1)(g - 1). Using this result, the fraction of time spent in the false alarm state is f(b, ) =pf(l, 1)L(1, 1)g L(b,g) and the fraction of time spent in the true alarm state is b pt(b, g) g) L(b ) 46

References Barlow, R. E., Hunter, L. C. and Proschan, F.,"Optimum Checking Procedures", J. Soc. Indust. Appl. Math., 11, 1078-1095 (1963) Barlow, R. E., and Proschan, F., MAthematical Theory of Reliability, Wiley, New York (1965) Boland, P.J. and Proschan, F., "Periodic Replacements With Increasing Minimal Repair Costs at Failure Time', Operations Research, 8, 1183-1189 (1982) Brunner, H. "A Survey of Recent Advances in the Numerical Treatment of the Volterra Integral and Integro-differential Equations", Journal Comput. Appl. Math, 8, 213-229 (1982). Duncan, A. J., "The Economic Design of X Charts Used to Maintain Current Control of a Process," Journal of the American Statistical Association, 51, 228-242, (1956) Girshik, M.A. and Rubin, H. "A Bayes Approach to a Quality Control Model," Ann. Math. Stat., Vol. 23, 114-125 (1952) Groetsch, C. W., "The theory of Tikhonov Regularization for Fredholm Equations," 104p, Boston Pitman Publication (1984). Johnson, N.L. and Leone, F.C., "Cumulative Sum Control Charts: Mathematical Principles Applied to Their Construction and Use," Parts I, II, III. Industrial Quality Control, Vol. 18, 15-21; Vol. 19, 29-36; Vol 20, 22-28, (1962). Lorden, G., "Procedures for Reacting to a Change in Distribution," Ann. Math. Stat., 1897-1908, (1971). 47

Lorenzen, T. J. and L. C. Vance, "The Economic Design of Control Charts: A Unified Approach", Technometrics 28, 3-10, (1986) Marcellus, R. L. and Jasmani, Z., "A Comparative Study of Cusum Control Charts and Bayesian Process Control," in Productivity and Quality Management Frontiers - III, Institute of Industrial Engineers, Norcross, GA. (1991) Montgomery, D.C. "The Economic Design of Control Charts: A Review and Literature Survey," Journal of Quality Technology, Vol. 12, No. 2 75-87 (1980). Montgomery, D C. Introduction to Statistical Quality Control, second edition, John Wiley and Sons, New York. (1991), Moskowitz, H., Plante, R. and Chun, Y.H. "Economic Design of Continuous Shift Model X Process Control Charts," Krannert Graduate School of Management, Purdue University (1989). Page, E. S., "Continuous Inspection Schemes," Biometrika 41, 100-115, (1954) Pollak, M. "Average Run Lengths of an Optimal Method of Detecting a Change in Distribution," Ann. Stat., Vol. 15, No. 2, 749-779 (1987). Pollak, M. and Siegmund, D., "Approximations to the Expected Sample Size of Certain Sequential Tests," Ann. Stat., Vol. 6, 1267-1282, (1975). Pollak, M. "Optimal Detection of A Change in Distribution," Ann. Stat., Vol. 13, No. 1, 206-227 (1985). Pollock, S.M., "Minimum Cost Checking Using Imperfect Information," Management Science, Vol. 13, No 7, pp 206-227 (1965). Rapoport, A. and G. J. Burkheimer, "Parameters of Discrete Time Models of Detection of Change," Management Science 19(9), 973-984, (1973) 48

Roberts, S.W., "A Comparison of Some Control Chart Procedures," Technometrics, Vol. 8, 411-430 (1966). Schippers, H., "Multiple Grid Methods for Equation of the Second Kind," Amsterdam, Mathematisch Centrum, 133p (1983) Shewhart, W.A., "The Economic Control of the Quality of Manufactured Product," Macmillan, New York, (1931). Shiryaev, A.N., "Optimal Stopping Rules," Springer-Verlag, New York, (1978). Shiryaev, A.N. "On Optimum Methods in Quickest Detection Problems," Prob. Appl., Vol. 8, 22-46 (1963). Svoboda, L., "Economic Design of Control Charts: A Review and Literature Survey," in Statistical Process Control in Manufacturing, Marcel Dekker, Inc., New York. (1991) Woodall, W. H.,"Weaknesses of the Economic Design of Control Charts," Technometrics 28(4), 408-409, (1986) Woodall, W. H., "The Statistical Design of Quality Control Charts," The Statistician 34, 155-160,(1985) 49