OPERATING CHARACTERISTICS FOR DETECTION OF PROCESS CHANGE PART I: THEORETICAL DEVELOPMENT Stephen M. Pollock Department of Industrial & Operations Engineering University of Michigan Ann Arbor, MI 48109-2117 and Jeffrey M. Alden General Motors Research Laboratories Warren, MI 48090 August 1992 Technical Report 92-34

1. Introduction Detecting the occurrence of an underlying process change, when observations are related only probabilistically to the state of the process, has long been a concern of statisticians, engineers, economists, epidemiologists, etc. In this paper we present an analysis of this "detection of change" problem in the context of machine monitoring. Applications to areas such as quality control, health, military surveillance, or economic analysis should be readily apparent. Fundamentally, the problem is to determine when a system goes "out of control" or "fails" and to do this "as soon as possible." We use the term "policy" to describe a procedure that prescribes when to take an "action" consistent with a process change, after observing available information. Usually, an economic or other advantage is gained when a policy raises an "alarm" that detects the change soon after it happens. However, it is also desirable to avoid costly "false alarms" that may occur while the system is still "in control," i.e., when no process change has occured. The balancing off of these two criteria - quick detection and few false alarms - is the basis of most quality control and control chart procedures (e.g., CUSUM) developed over the past sixty years (see Shewhart (1981), Roberts (1966), Johnson and Leone (1962), Montgomery (1980), for a historical perspective and related formulations). The most common approach to "optimize" these procedures (e.g. Moskowitz, Plante and Chun (1989)) uses an economic model that explicitly assumes costs are ascribable to these two criteria, or to equivalent operational measures of the procedure. These economic approaches depend upon specific policy structures which, although intuitively appealing, and lead to easy computation or evocative charting methods, are not necessarily optimal in any sense, or even, in some cases, consistent. Thus, a full understanding of how to effectively and appropriately operate a monitoring (or "surveillance" or "inspection") system, and how to compute the associated performance 1

measures, is still lacking. As a step towards such an understanding, we address four issues, in the context of a rather simple but generalizable framework, that are relatively neglected in both the literature and practice. 1. A traditional performance measure is the average ruin leng. h (ARL): the expected time until a false alarm occurs given the system remains "in control." This attempts to capture the relative cost associated with false alarms - a policy with a longer ARL typically engenders lower false alarm costs. We propose instead to use the reasonable, computable, and economically supportable measures of "false alarm rate" and "fraction of time processing false alarms" during the entire monitoring process. These measures contribute to the typical costs associated with false alarms: a fixed cost per false alarm (e.g., to allocate resources) which is captured by the false alarm rate, and a variable cost for time spent processing false alarms (e.g., labor hours and lost operating time) which is captured by the fraction of time spent processing false alarms. 2. A traditional "quickness of detection" measure is the expected time to detect that the system has gone "out of control" or "failed." This detection time is used to estimate the cost of operating the system while in a failed condition, or "down" (e.g., while producing scrap). However, there is more. Typically failure costs also depend on the failure occurance (e.g., replacement part costs) and on the time spent after detection to renew the system (e.g., lost production time). To understand total costs over time, the frequency of these cost-incurring events is needed, particularly when one wishes to minimize average cost per unit time. These costs are captured by the failure frequency or "true alarm rate" (assuming all failures are eventually detected), the fraction of time the system is down (i.e., operating while in a failed condition or halted for renewal after a failure detection), and the expected time to detect a failure. Hence, we focus on calculating the true alarm rate, fraction of time down, and expected detection time 2

while routinely using the monitoring policy. This is in constrast to similar measures traditionally computed given the monitoring process starts when the system first fails. 3. Most analyses of monitoring system performance produce only asymptotic results, involve computational difficulties, are relatively complex and difficult to explain to potential users, or require unrealistic conditions on sampling processes. Our approach, instead, provides a method for the computation of operating performance measures which (although approximate) is computationally straightforward, easy to understand and implement, and holds for a wide range of parameters. 4. Our approach allows sensitivity analysis with respect to important parameters such as the expected operating time until system failure and the relative discriminatory power of various sampling devices. We do not require an explicit assessment of hardto-estimate costs associated with false alarms and late detections. 3

2. Formulation of the Basic Monitoring Problem Consider a system that can be in one of two conditions, G or B (for "Good" and "Bad"). The system starts in G and enters B after some operating time T, where T is a discrete random variable (r.v.) with probability mass function (p.m.f.) f(t); t = 1,2,.... Once the system enters B, it remains there until some exogenous action is taken. When this action, called checking, is taken the system is halted (i.e., it stops operating) and its condition is assessed with certainty. Checking may find the system in G or B and, in either case, we assume it is renewed (i.e., made as "good-as-new") before operations continue. The time it takes to renew after checking depends on system condition. In particular, this time is c or b if the system is found in G or B, respectively. Typically c < b since renewal from B often requires fixing something while renewal from G may only require a brief inspection to ascertain it is in G. Since the system is halted during renewals, the operating time T until failure differs from the system time until failure. The system time until failure is T plus the total time spent checking and renewing (while in G) prior to failure. For convenience, we use language appropriate to a manufacturing process that starts producing good parts (hence "G") and at some operating time T later it "fails" and then produces bad (hence "B") parts or no parts at all. We do not consider the explicit economic or physical consequences of the process while in either condition. As we will see, these can be subsumed into appropriate performance measures of a procedure for determining if the system has entered B. We use the following notation: {G, B} = set of possible system conditions, Ct condition of the system at time t, t = 0,1,2,..., and Pt _ prob{Ct = BICo = G}, t = 1,2,3,... 4

We now consider the use of an (imperfect) information gathering procedure with a sequence of events as shown in Figure 1. At every observation time Tr = 1,2,3,... a random variable Xi is observed with probability density function (p.d.f.) fx. (*). This "monitoring" of Xi provides information about system condition since its p.d.f. depends upon the condition. This information is used to immediately "update" f(t) and determine if checking should be done. In particular, let ifx (X) p(x) if C = G, i= 1,2,..., (1) q(x) if Ci = B, i = 1,2,..., define the random vector X, as Xn - (Xl X2 *... Xn), and its realization x_ as Xm = (XX22... Xn), and define Pn(xn) =-prob{Cn = BICo = G,X1 = x1,X2 = x2,...,Xn = Xn}. It is well known (e.g., see Girshik and Rubin [1952 ], Shiryaev [1963], or Pollock [1965] for early references, or Pollak [1987] for a more recent one) that for this situation, the following form of decision rule is optimal (in that it minimizes average cost per unit time) for a variety of reasonable economic and/or statistical criteria: Probability Threshold Rule (PTR): Decide (or take appropriate action consistent with such a decision) that s the system is in B when P() equals or exceeds some threshold p*. 5

Event Flow Chart System Observation Time Time renew C1=G renew O Figur checkc eq observe of = Gv en ailurev in = posb obsyste "fcheck, Yes b oberin rv.Xi ad ecdig hehe o nt o hek Fiur1 Seuec of evt in n No Ys Yes cck observe ailure C=B observe or= nt Y toce No No Yes Nn YeLos Yes osr No o No Yes Yes X3 3 X3 Figure 1' Sequence of events involving possible system "failure," monitoring by observing r.v. Xi, and deciding whether or not to check. 6

The PTR policy effectively sets the structure of the monitoring procedure: continuously compute P,(_.,) and react when it equals or exceeds p*. By varying the threshold p*, the decision maker, given the ability to compute Pn(I), can explore trade-offs among various criteria and performance measures. Note that there is only this one parameter, p*, to vary. The event when Pn(In) equals or exceeds p* is called an alarm; subsequent exploration to determine the system's actual condition is called checking; and making the system "as good as new" is called renewing. An alarm is called a false alarm if, upon checking, the system is found to be in G, otherwise it is a true alarm. This paper first develops a way to conveniently compute P,(n) for any set of observed values {x1,x2,... x,}. Given this "probability the system has failed given xn," we compute five performance measures arising from two competing criteria that characterize the "goodness" of the procedure: the cost associated with false alarms and the cost associated with failures. The first is measured by the false alarm rate and the fraction of time spent processing false alarms. The latter is measured by the true alarm rate, the fraction of time down (i.e., in B), and the expected detection time. These performance measures are formally defined as: rf = false alarm rate: the expected number of false alarms per unit time, pf = fraction of time spent processing false alarms, rt = true alarm rate: the expected number of true alarms per unit of time, pB = fraction of time the system is down (in B), and 6 = expected detection time: the expected time after failure until Pn(n) equals or exceeds the threshold p*. These measures capture the essence of the operational performance of PTR: Pn(Xn) allows the rule to be followed (given any assigned value of p*); rt, PB, and 6 represent the waste 7

due to failures (failure frequency, fraction of time down, and expected detection time); rf and pf represent the waste due to false alarms (false alarm frequency and fraction of time spent processing false alarms). These measures also support a simple yet reasonable cost model. Given the following costs: Kf = fixed cost per false alarm, Vf = variable cost per unit of time spent processing a false alarm, Kt = fixed cost per true alarm, t = variable cost per unit of time spent while the system is down and a failure has been detected, and Vd = variable cost per unit of time while the system is down and no failure has been detected, then the average cost rate for the procedure can often be expressed as Kfrf + Vfpf + Ktrt + Vt(pB - rt6) + Vdrt6. Even if such a linear cost model model or cost parameters can not be specified, it is important to note that these measures serve to summarize the behavior of the policy. An excellent way to summarize the performance of any procedure that purports to be appropriate for monitoring a system change process is to develop an Operating Characteristic (OC) for the procedure. Similiar to the Receiver Operating Characteristic, used in telecommunication and signal detection theory, and related to the power curve of simple hypothesis testing, an OC is simply a plot of achievable levels of two competing performance measures. For example, an OC could plot the expected fraction of time down, PB, versus false alarm rate, rf. Consider such OC's for three different hypothetical monitoring procedures A, B 8

and C, as shown in Figure 2. Each curve represents the operating points (i.e., values of PB and rf) achievable by varying free parameters within the procedures. In the PTR policy, this is done by varying the threshold p* to produce large (near 1) PB and small rf when p* - 1, and small PB and large rf when p* - 0. Procedure C is clearly better than A or B, since it has a lower PB for any given rf or lower rf for any given PB- (Perhaps procedure C is based upon observing variables xi that are not available to A or B). Procedure A would be preferred to B only in those situations when early detection (i.e. small PB) is more important than having high false alarm rates. PB A B C Figure 2: Operating Characteristics (OC) for three hypothetical procedures A, B and C. 9

Condition: 1- Good - ' Bad ---- (G) - (B) - D -*I. --- —D --- Events: f4 XT Time 0 t 1 t.1 T+(n-l)c tn cycle check check failure check start t tl+ t n+c +b restart restart renew Figure 3: A single cycle of the checking/monitoring procedure. Arrows indicate checking actions and renewals; T = operating time until failure which occurs at system time T + (n - l)c; D = detection or response time; ti = time of ith check - a false alarm for i < n; t, = time condition B is detected; and t, + b is the time the system re-enters condition G which ends the cycle. 3. Some Relations Among Performance Measures For a process governed by failure time p.m.f. f(t), where t is the operating time until failure, and any reasonable checking/monitoring procedure, there exist general relations among the performance measures. Figure 3 shows a single cycle of a procedure that starts with a renewal and ends with a renewal following the first check that finds the system in B. If this cycle contains n checks at times tl, t2,., t,, then n - 1 checks find the system in G, i.e., there are n - 1 false alarms, each requiring a time c to process. As shown in Figure 3, D is the (random variable) response time. The performance measure 6 is E(D) by definition, and the expected cycle length is C(b, c) = E(T) + c(n - 1) + 6 + b where E(T) = E tf(t) is the expected operating time until system failure. Note, the system remains in B during the renewal time interval of length b, and it cannot fail during the 10

checking time interval(s) of length c while in G. Using the fundamental renewal theorem, if p is the expected number of false alarms until failure, then r __ Vf (2) E(T) + c/ + 6 + b and 1 E(T) + + 6 +b' ( The ratio of these two equations gives /=rf —.(4) rt Solving for 6 in equation (3) gives 6= — E(T) - ca -b. (5) rt Since 6 + b is the time in B which occurs at rate rt, and c is the lost operating time for each false alarm which occurs at rate rf, PB = (6+b)rt, (6) Pf = crf. (7) These relationships hold for any checking or monitoring procedure. If rt and rf, can be calculated, then given the parameters E(T), b, and c, we can calculate u, 6, PB, and pf. A traditionally used Average Run Length measure, cG, is the expected time to first alarm given the system remains in condition G. In spite of its popularity, using OG as a performance measure of a monitoring procedure is questionable for several reasons: a) the hypothetical situation where the system is "forced" to remain in G until the first alarm (an assumption required for computing OG) is hard to justify as a realistic one, and in any event never represents reality; 11

b) its interpretation is unclear if the system fails before the first false alarm: OG is really a conditional expectation, yet it is not computed as such in the literature; c) in any event, in most situations any checking action (not only a false alarm) is costly; thus qG is less relevant than, for example, the total checking (thus cost-incurring) rate r = rf + rt. As a point of completness, we note that if the inter-checking intervals t1, t2 - t,t3- t2,... etc. are I.I.D. random variables, then the unconditional expected time to first alarm (false or true), or X, is = EG(tl) = EG(t2 - t) = EG(t3 - t2), etc., where EG is the expectation given the system was found in condition G when last checked. The total checking rate r is then r = rf +t 1/. We stress that for nontrivial decision policies, the inter-checking intervals will not be I.I.D. For example, while using the TPR policy with any informative observations, the expected time between checks becomes smaller once the system fails. 12

4. Computing Pn(,) The computation of Pn(jn) follows from a straightforward application of Bayes' Theorem: P,(X,) prob{T < nl X, = -,} prob{T <nnX =n} (8)x prob{Xn = x,} where n prob{T < n nX, = } = f j-' p(x) IT q(Xk) (9) j=1 and 00 prob{Xn = x} = prob{T < nnx = }+ E f(j)lWlp(Xi) (10) j=n+l Substituting (9) and (10) into (8), and dividing numerator and denominator by II=lp(xi), gives n,ef (j)n=j L(xk) Pn (xn) = (11) Zf(ji)=j3L(xk) + F(n) j=1 where L(xi) = q(xi)/p(x,) (12) is the likelihood ratio for condition B given Xi = xi, and F(n)_ f (i) = prob{T > n}. i=n+l For notational convenience in the remainder of this paper, the argument x will be left out, e.g., Pn(xn) will be written as Pn. Although computing Pn directly from equation (11) is straightforward, it is advantageous to use the "odds in favor of condition B" ratio Rn = Pn/(l - Pn), which is obtained directly from equation (11) as n R, = [F(n)] - f(j)Hn=jL(xk). (13) j=1 13

This allows a recursive representation for R,: Rn+l = (+) [F(n)R+ f(n + 1)], (14) P(n + 1) which can be confirmed by substitution into equation (13). Equation (14) is an excellent way to compute P, = R,/(1 + Rn) since Rn+1 is calculated from the previously obtained Rn after each new observation xn+l by a simple addition and multiplication. Equation (14) also clarifies the challenge of computing the expected time until Pn first equals or exceeds p* or, equivalently, the first time Rn equals or exceeds the "odds threshold" p*/(l -p*). In particular, when xn+1 is replaced by the r.v. X,+1, we see that equation (14) can be viewed as the generator of a Markov Process R,. This process has as a state space the non-negative real line IR+, with transitions governed by the stochastic behavior of Xi, which in turn are governed by the p.d.f.s of equation (1). 14

5. Special Case: Geometric Time to Failure The remainder of this paper assumes the operating time from G to B follows the geometric p.m.f. f(t) = a(l - a)t-, t = 1,2, 3,..., with cummulative mass function (c.m.f.) F(t) = 1-(1 -a)t, t = 1,2,3..., (15) and expectation E(T) = I/a. Using this distribution in equation (14), the random variables Xi produce the generating equation for the Markov Process Rn: Rn+l= L(X+) [Rn + a] I-a In order to gain notational convenience and some advantage in computation, scaling and interpretation, the process Rn can be transformed into a new Markov Process Zn Rn/a, Zn E IR+. This new process has governing equation Zn+1 = e(Xn+l)[Zn + 1], (16) where e(Xn) L(Xn)/(1 - a), and the process has an associated threshold P a(1 -p*)' The absorption behavior (i.e., the distribution of time until Z, first equals or exceeds z*) of this process has a long and important history of study (see Shiryaev 1978), with computational (as contrasted to structural) results essentially constrained to limit theorems and approximations [e.g. Pollak (1985)]. Our emphasis is not on absorption, but on the steady-state behavior that results when the system is renewed after checking. This allows direct computation of the measures rt and rf from which we get 6, PB and pf (see Section 3). 15

6. No Information and Perfect Information Before exploring how the performance measures and OCs might be calculated in general, it is instructive to examine two special situations. First, consider the limiting case where observations provide no information, a situation equivalent to p(xi) = q(xi) or L(xi) = 1, for i = 1,2... In this case, equation (14) reduces to R,+l = [F(n + 1)]-[F(n)Rn + f(n + 1)], which is a deterministic difference equation with solution (given boundary condition Ro = 0): F(n) Rn -= From the definition of Rn, this gives Pn = F(n). Clearly, observations of x, have no effect on computing P,, which is simply the cumulative distribution for the failure time T. This result holds for any f(t). To calculate the performance measures, we derive expressions for,i (expected number of false alarms per cycle), 6 (expected response time) and the expected cycle time. The remaining performance measures follow immediately from their definitions. The (deterministic) processing time to the first alarm, (, is the smallest (integer) n such that F(n) > p*, i.e., q = min{n: F(n) > p*}. Since there is a constant probability F(>) that an alarm is a true alarm (which is the last alarm in a cycle), the expected number of false alarms per cycle is /u = OF(+) + 1(1 - F(~))F(O) + 2(1 - F(q))2F(q) +... 1- F(O) F(O) Computation of 6 is slightly more complex. If T < X, then the time between T and when action is taken - the response time D - is D = q-T. 16

If kOq < T < (k + 1)b for k = 1,2,..., then there are k false alarms (at system times X, 24 + c, 30 + 2c,..., and kq + (k - l)c), followed by a "true detection" with response time D = (k + 1) - T. If the system were not renewed, then oo (k+l)0 - = E E [(k + i -t]f(t). k=O t=k+c+l However, since f(t) is geometric, the process renews after a false alarm, and y(O - t)f(t) 6= t= ) (17) F(O) From equation (15) for F(t), = min{n n > ln(1-p*)/ln(1-a)}. Using this in equation (17), after some algebra, results in (I -(1-a) -1+a0 a[l -(1 -a)] Given 6, /, and E(T) = I/a, the expected cycle time is C(b,c) = l/a +c + + b which allows calculation of the remaining performance measures from their definitions: rf =,/C(b,c), rt = 1/C(b, c), pf = ~c/C(b, c), PB = ( + b)/C(b,c). 17

Figure 4 shows an example operating characteristic curve: plots of r versus 6 as p* varies, for b = c = 1 and selected values of a. Figure 5 contains the same information presented in "normalized" time units (i.e., in units of E(T) = I/a) where r' = r/a is the normalized checking rate. Since these represent worst case results, any actual information gathering procedure should produce achievable points below the curves in Figures 4 and 5. The limiting case of "perfect" information represents another special situation. In this case the supports of p(x) and q(x) are disjoint, so that L(x) = 0 for i = 1,2,... T - 1, and L(xi) = oo for i = T, T + 1.... From equation (11), we see that any non-zero threshold is exceeded for all n > T, so that 6 = 0. Since the system is checked only when it fails,,- = 0 and the true alarm rate is rt = 1/(E(T) + b) = a/(l + ab), the false alarm rate is rf = 0, the fraction of time down is PB = b/(E(T) + b) = ab/(l + ab) and the fraction of time processing false alarms is pf = 0. 18

0.12 a =.00001 o lo a =.0001 a =.001 \ — a ==.01 0.08 -L 0.06 U U II 0.04 0.02 0.00 I ' I I 0 10 20 30 40 = expected time late Figure 4: Operating Characteristic for "no information" case. 19

10000 a C) CE 0) C - - c II L L 1000 \' \ 100 - a =.00001 ------ a =.0001.a =.001 10..... ' ~ W..I......T.00001.0001.001.01 8 = normalized expected time late Figure 5: Normalized Operating Characteristic for "no information" case. 20

Checking z, Threshold (b) j^ Check and Renew ("a Continue (a) 4 n I n+l Observation Number Figure 6: Evolution of the Z, process for an observation that (b) leads to checking, and (a) does not lead to checking. 7. An Extension of the Process Z, When information (in the form of A) is gathered, the process Zn described by equation (16), along with the system condition, determines the behavior of the monitoring procedure. Figure 6 shows two possible results of computing Z,+i from Zn, an observation of xn+ is made, and either a) Zn+ is less than the threshold z*, and the process continues; or b) Zn+ equals or exceeds z* and action (check and renew) is taken. Recall that z* _ p*/a(l -p*) is the threshold value for Z,, and that Z, > z* means a decision is made to check; when this action is taken, the system is renewed (either because condition B is discovered, or because of the memoryless property of the geometric distribution if condition G is discovered). 21

The conditions for (a) and (b), respectively, from equation (16) are such that a) e(Xn+l) < z*/(Z, + 1), and b) >(Xn+l) > Z*/(Zn + 1). Recall that xi are realizations of random variables Xi which are generated with p.d.f. p(x) while the system is in G, and with p.d.f. q(x) while the system is in B. In order to directly compute the various performance measures, we characterize the checking/monitoring procedure by means of a mixed continuous-discrete state Markov Process (called, for convenience, "MPZ") based upon combining the process Zn with the system condition Cn. The performance measures can be obtained from the appropriate stationary state probabilities of this MPZ. The state of MPZ at the end of the nth transition will be denoted as Sn E S, n = 1, 2,.... The state space S is the union of five sub-spaces: three singletons and two mixed continuousdiscrete. These sub-spaces are: So = 0, renewal state: the state entered after the system is renewed, i.e., when Zn = 0 (or, equivalently, Pn = 0 or Rn = 0) and Cn = G; S = {(Z*, G)}, false alarm state: the state entered after checking while the system is in G and occupied until system renewal, i.e., when Zn > z* and Cn - G; S - {(Z*, B)}, true alarm state: the state entered after checking while the system is in B and occupied until system renewal, i.e., when Zn > z* and Cn = B; SG {(z, G): z E (0, z*)}, set of states where Zn is between 0 and z*, and Cn = G; SB= {(z, B): z E (0, z*)}, set of states where Zn is between 0 and z*, and Cn = B. 22

The transition probabilities among the states in these sets are governed by the evolution of Z, described by equation (16), the behavior of the random variable Xn+ given in equation (1), and the geometric failure time distribution of equation (15) which implies a probability a of the system condition going from G to B at each transition (except from SG where this probability is zero). In the following development we assume b = c = 1, since computing results for arbitrary b and c values (even for zero values) is straight forward - see Appendix C. The steady state equations for MPZ are derived in Appendix A. We note some of the properties of this Markov Process: a) it is ergodic, since there is a single closed communicating class of states; b) the probability of transition from SG or SB to 0 is 1. Both represent one transition (since we assume b = c = 1) needed to check and renew the system (the former after a false alarm, the latter after a true alarm); c) due to the geometric failure time distribution of equation (15), the single-step transition probability: i) from the set SG to the set SB U S; ii) from the state 0 to the set SB U S$. is a. The steady-state probabilities for the singleton states are defined as 7r0 = lim prob{S, = 0}, 7rG - lim prob{Sn = S3}, lim prob = 7r;- lim prob{S,n- S}. n --- oo 23

In addition, we define the steady state cumulative distribution functions for the sets SG and SB as HG(Z) = lim prob{S, E (t,G):t < z}, 0 < z < z*, IHB(z) = lim prob{S, E (t,B): t z}, 0 < z < z*, and set, for convience, HlG(O) = HB(O) = 0, HG(Z*) = limeo HG(Z* - E), and B(Z*) = lime.o HB(Z* - ). If these steady state probabilities and distributions can be computed from the equations in Appendix A, then the performance measures are easily obtained. In particular, by appealing to the ergodic theorem for Markov Processes, we know that r0o = expected fraction of time the process is in the renewed state. Thus, r = rt + rf = ro0, since MPZ is in 0 for exactly one time unit per cycle. Similiarly, since 7r* = expected fraction of time the process is in the false alarm state, 7r = expected fraction of time the process is in the true alarm state, we know that rf = 7r and rt = -7r (recall that c or b, the time to renew from either G or B, is one time unit). Given these values for rf and rt, we have from the analysis in Section 3 (with b= c= 1 and E(T) = /a), = B (18) B 6 = l7 -— 1, (19) 7B a 7* pB = - 7r-o, (20) a pf = 7r. (21) Thus we can present OC plots (such as rf v.s. PB), given the steady-state probabilities 7rB and 7rG. 24

8. Markov Chain Approximation The steady-state equations used to calculate wr* and ir* - (A4) and (A5) in Appendix A - are special cases of the Fredholm equation of the second kind, which has a long history of theoretical study (e.g., Groetsch (1984) or Brunner (1982)) and numerical means of solution (Schippers (1983)). Indeed these equations have an analogue to those developed by Pollak [1987] to compute the ARL measure qG (rather than the measures like rf and PB that we seek). However, as Pollak notes, a solution method is still lacking for even the simplest forms of p(.) and q(.). The computational literature is problem-specific and essentially suggests using variable transformations and discretization approximations tailored to the problem at hand. Following this approach, we now develop an approximation for some specific cases of p(x) and q(x). In the context of our problem this approximation method is equivalent to proposing that the process MPZ can be approximated by a Markov Chain ("MCZ") with finite state space. To construct MCZ, we require values of Z, that do not cover the interval [0, z*], but in fact are restricted to a finite set 0 U S U z where S1 {Z1,Z2,.*.Z*m-}. The key to this restriction is to find values of zi that are numerous enough, and "cover" the interval (0,z*) in such a way that the sums over probabilities in S1 well approximate the integrals over S implicit in equations (A2) to (A5). Obtaining such a set of z-values is discussed below; at this point we will assume that S1 is available. Given the finite elements of S1, we define a finite state space Z for MCZ Z = So U SG U S U U ZB where the first three sub-spaces are the singletons (corresponding to the renewed, false alarm 25

and true alarm "states" of the system, respectively) and ZG - {zi:zi E S1, Cn=G, i= 1,2,... m-1, n= 1,2,...} ZB = {Zi Zi E S Cn = B, = 1,2,... m-1, n = 1,2,...}. Thus ZG represents a subset of Zn values while the system is in G, and ZB represents a subset of Z, values when the system is in B. A simple re-numbering of states now allows us to represent MCZ as a (2m + 1)-state ergodic Markov Chain, which we will refer to as "MC," with state space I2m+ -- {0, 1,2,..., 2m}, on, the state of MC after the nth transition, and transition matrix P with elements [P]ij = Pij = prob{a- = jaIn-l = i} for i,j E I2m+l, and n = 1,2,.... Details of this representation of MCZ and MC are contained in Appendix B. Given the elements of P developed in Appendix B, the steady-state probability vector {7rO, 7ri,7 2,...,7r2m }, where ri - lim prob{a, = i}, i = 0,1,...2m n --- —, oo can be obtained by solving the set of linear equations: 7r = TrP, 7T = 7l, where 1- {1 1 1... 1} is the transpose of the unit (2m + 1)-vector. The key performance measures of interest can be immediately obtained from these equations since Tr = 7rm and ir = 2m. 26

9. Special case: Bernoulli Observations The general conditions under which the "discretization" of MPZ to MCZ is valid, and which allow an explicit determination of Tri and wr from p(x) and q(x), are not addressed in this paper. We consider, instead, the following special case of Bernoulli observations, where Xn = 0 or 1,n = 1,2,..., and P1 -a if x = 0, p(x) = a if x= 1, /:3 if x = 0, q(x) = 1-/d if X=l. This situation is a form of classical hypothesis testing: x = 0 is "evidence" of condition G (e.g. no defect in an observed manufactured product) and x = 1 is evidence of condition B (e.g. a defect is observed). Thus a is analogous to an "error of the first kind," and /3 to an "error of the second kind." The likelihood ratio is L(x) J //(- a) if x =0, (1 -/ )/a if x = 1. By defining WO - (1 -a)(l -a)' 1 -/ a(1 -a)' then W(x) o if x = 0, wl if x = 1, so that equation (16) can be written wo(Zn + 1) if Xn+ = 0, (22) w(Z+ 1 ifX1 -- = 1. Wl(Zn + 1) if Xn+1 =l. 27

Before writing the steady-state equations for this case, it is convenient to note that, for a realistic problem, a) the value of a (the probability that the system goes from C to B on any transition) is usually quite small; b) a and 03, the "misclassification" errors in a single observation of x, are relatively small compared to one (generally, a, 13 ~.1); Under these conditions 0 < WO < 1 < Wi < Z~, and equations (A2) to (A6) can be written, after some reduction and algebra: IIG (Z)={ lB (Z) = (1a)(1a)[HG(-1)+ 7ro] /[aflG(&z - 1) + llB(~ -1 + a7ro] if 0 < z < wo, if w0 < z < I if w1 < z <_* if 0 < z < wo, if wo < z < WI, if WI < z <_* (23) (24) = 1 a )a [ H * ~ G ~ ~ ),(25) =r 7r*+I*(27) where IIG(O) = HB(O) = 0, IIG(Z) = HG(Z*) for z > z*, and HB(Z) = HB(Z*) for z > z*. 28

10. Markov Process Discretization In order to create the MIC version of equations (23) to (27), we now consider the evolution of the Z, process of equation (22), when the process starts with Z, = 0 (i.e., Ro = Po = 0). The possible values of Z that can be generated after the first three observations, assuming none exceed z*, are given in Table 1: Observation Number n 1 Possible Z, Values Label 2 3 w0 Wi (1+ w0)w0 (1 + W0)w1 (1+ wi)wo (1+ w1)wi (1+ (1+ W0)w0)w0 (1+ (1+ w0)w0)wl (1+ (1+ w0)w1)w0 (1 ~ (1 + wo)wl)wi (1 + (1 + wi)wo)wo (1 ~(1 + w1)wO)w1 (1 + (1 + w1)w1)w0 (1 + (1 + w1)w1)w1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Table 1: Possible values of Zn after n = 1, 2,3 observations. Figure 7 shows all possible values of Zn for a = 0.01,a = 0.2,- = 0.1, and n = 6. Each distinct value of Z is assigned an arbitrary "label" number. After n observations 29

12 -10 -8-,3 N 6 - S 0 00 o0s 4 - 2 - 0 O.... —.......r.......... I0 20 40 40 60 80 100 Z Label sorted by increasing Z value Figure 7: Possible Z, values after 6 observations for a = 0.01, a = 0.2, 3 = 0.1, and p* =.1 (or z* = 11.11). Since calculated Z values exceeding z* are set to zero and not used subsequently, only 90 out of the 254 possible Z, realizations after 6 observations are greater than zero. 30

the number of possible distinct values of Z, hence the number of different labels, is clearly 2n+ - 2. However, two important effects allow us to reduce this number, which in turn allows a practical means of discretizing MCZ to MC: 1. As n becomes large, some values of Z, satisfy Zn > z*, in which case either state SG or SB occurs and Zn+ becomes 0; 2. As n becomes large, values of Z, less than z* tend to "cluster" around a finite set W of m - 1 points (m - 1 = IWj). Thus, we can approximately incorporate into a single discrete value z- all those Zn that are within some small range e > 0 of zi. Then the entire evolution of the process can be represented by a MC with states 0, SG, SL plus the 2m -2 states from the cross product of the set W and the set of system conditions: {G, B}. The associated probability transition matrix P can be easily created (see Appendix B) from the two probability transition submatrices pG and pB corresponding to state transitions placing the system in G and B, respectively. Figure 8 depicts the composition of the transition matrix P using these submatrices: i -a if j= Jo(i), i E {0,1,2,..., m - 1}, [PG] a if j=J(i), i {0,1,2,... m- 1}, (28) 10 otheri E {0,1,...,m-1}, j E {0,1,..., m}, /3, if j =Jo(i) + 1, i E {0, 1,..., m-1}, [PB]j = 1 -/ if Jl(i)+1, iE {0,,...,m-1, (29) 0..other i {0,1,...,m -1}, j E (1,2,...,m} (30) where, after letting WO = {0 U W U z*} and defining 0 as its 0th element (i.e., zo = 0) and 31

z* as its mth element (i.e., Zm = z*), respectively, Jo(i) = the index of the closest element in WO to Zn+l given Z, = zi and xn+1 = 0 = arg min {Iwo(zi + 1)- zk} and Jl(i) = the index of the closest element in WO to Zn+1 given Zn = z- and Xn+1 = 1 =arg min {IWl(zi + 1) - zkl} kE~Wo 32

j 0 1 2 0 0 0 m-1 m m+ m+2 * ~* 2m-1 2m 0 1 2 0 m-1 i m m+l m+2 0 2m-1 2m ~~~~~~~~~~~~I ~~~~~~~~~~~~I (l a) PGI laPB I I I T I TI I ~ T ~ T I....................................................................... ~................................................................. ~ ~~ r~ ~ ~ ~~ ~ ~~ ~ ~ ~~ 1 ~[P B] p..........................!................................................................... *.................................. * *. *................ *.......... *.............. ro0.....,........................................................ *......................................................... *@@ * @ *..... *................... P.....;*.................................................................... 1............................................ ~~~..... @... * *...................... Figure 8: Schematic representation of the transition matrix P from state i to state j showing the use of submatrices pG and pB. [M]io denotes the submatrix created by removing the row associated with i = 0 from M. Shaded regions represent zero transition probabilities. 33

11. Conclusion The value of the Operating Characteristic as a tool for evaluation of monitoring procedures and policies has yet to be established. We believe, however, that it is an important and evocative tool for the comparison of policies and the comparison of alternative observation technologies. The structure set forth in this report provides a method for the computation of critical measures such as the expected detection time and total alarm rate (6 and r), needed to express various OCs. It is an open question, however, as to whether the approach outlined in section 10 will prove viable for computation in the case when there are Bernoulli observations. For the case of general sampling functions p(x) and q(x), Appendix A presents the equations that need to be solved to find key performance measures. Again, demonstrated numerical techniques for their solution are not yet available. In a companion paper we explore the effectiveness of the discretization method outlined in section 10 as a means of obtaining numerical representation of useful OC's. Acknowledgments This research was supproted jointly by General Motors Research Laboratories and the National Science Foundation under contract # SES9108956. The support of Walter Albers, Jr., who was instrumental in arranging for this unique funding collaboration, is greatly appreciated. 34

APPENDIX A: Steady State Properties of MPZ In this Appendix, we show how the Chapman-Kolmogorov (C-K) equations can be used to write expressions for the steady-state probabilities and distributions for MPZ. Although the notation and details may seem formidable, the method is a straightforward extension of the use of C-K equations for finding the steady-state solutions to a finite state Markov chain. In the development that follows, it might be helpful to refer to the suggestive flow diagram of Figure Al. In this diagram, the "states" {ZG} and {ZB} refer to Z-values in the open interval (0, z*) while the system is in condition "G" and "B," respectively. S, SB, and 0 are the singleton "false alarm," "true alarm," and "renewal" states as discussed in Section 7. The labels on the transition arrows represent governing probabilities (to singleton states) and densities (to {ZG} and {ZB}), where P and Q are the complementary distribution functions associated with p and q. Consider the computation of IIG(Z). By definition IG(Z) = lim prob{Sn E {(t,G)} 0 < t < z} = lim prob{O < Zn < z n C, = G}, n —+oo00 which by conditioning on the value of Z-_l, and noting that Cn = G is only possible if Cn-i = G, gives PZ* G(z) = lim prob{O < Z,< z n Cn = G|Zn,- = v n Cn-i = G} Jy=0 n-oo xprob{Zn_1 E (y, y + dy) n Cn- = G}. (Al) The first probability in (Al) can be obtained by using: a) equation (16) which governs the behavior of Zn when Zn-, = y, and b) prob{Cn = GIC_,1 = B) = 1 - a which is independent of the value of Z,-1. 35

q ( 1 -a^^ ^^ff^ ^^ PIz -a)p Figure Al: Schematic representation of transitions among the states in MPZ. Note that 0, SI and SG are singleton states, while {ZG} and {ZB } represent a continuum of states in the open interval (0, z*). 36

Thus nIc(z) = lim (1 -a) prob{O < e(X-_l)(y + 1) < z|Cn-_ = G} Jy=0 -— O xprob{Z-_l E (y, y + dy) n Cn-1 = G}. By definition lim prob{Zn_1 E (y, y + dy) n Cn_1 = G} = dG(y), 0 < y < z. Hence, after taking the limit as n -- oo, fZ* HG(Z) = (1 -a) prob{0 < e(Xn-l(y + 1) < zlCn_1 = G}dnIG(y) +(1 - a)ro prob{O < e(Xnl) < zlCn_1 = G}. For convenience, we define the region C(z,t) - {x: ~(x) < z/(l + t)}. Using this notation, C(z*, Xn) is the set of "continuation" values of the observation xn. Finally, since the p.d.f. for X-_l, given Cn-_ = G, is p(x), IG(z) = (1 - a) pEC(z,) p(dx dHG(y) + (1 -a)7ro (z p(x) dx. (A2) Similarly, it can be shown that lIB(z) = a/ e(z q(x) dx dlG(y) + a7ro E q(x) dx Jy=o JxeC(,y) JxeC(z,O) + q(x)dxdHB(y) (A3) 7rG = =o I~C(Z o)( -a)p(x) dxdHG(y) + 7o (1 - a)p(x) dx. (A4) G =0 ^C(Z*,O) JC(z',O) rf = a ~C(z,)q(x) dx dIG(y) + ro 0 aq(x) dx. Jy=0 JxC(z* ) dz dC(z*,O) + =0 q(x) dxdIHB(y) (A5) J=o L~C(z.,y) 37

7r0 = G + rB y=O / ~C(z*,0) [aq(x) + (1 - a)p(x)] dx dnG(y) + c(J. C (,) q(x) dx dlB(y) + To C(z*.o) [aq(x) + (1 - a)p(x)] dx. (A6) 38

Z- Values 4 Z Zm-l Zm-2 ^_ MCZ Sets MC IN ( 0 ZN Good (G) Good (G) States MCZ! SE) SF 1 ~ ZF a> Sets Z2 Z1 0 Bad (B) Figure B1: Schematic representation of the MC states, the MCZ sets, and the Z-values generated from equation (16). System condition transitions are governed by a geometric probability distribution from G to B. APPENDIX B: Correspondance between Formulations MC and MCZ The correspondance between elements of I2m+1 E {0, 1,2,..., 2m} of the chain MC and of the elements MCZ which lie in the set S1 x {G, B} are shown in Figure B1. State 0 is the "starting" state of MC. Since it represents the situation where Pn = 0 (and thus R, = 0 and Zn = 0) it is also the "renewal" state, with system condition G; state m is the false alarm state, since Zn = z* and the system condition is G; state 2m is the true alarm state, since Zn = z* and the system condition is B. The set ZG represent states 39

where Z, lies between 0 and z* when the system condition is G; ZB represents states where Zn lies between 0 and z* and the system condition is B. Transition probabilities for MC, i.e. among the states in I2m+l, are obtained by noting that: a) when Zn+i > z*, the state entered after transition n + 1 is either m or 2m, depending whether system condition is either G or B; b) once in state m or 2m, as long as b = c = 1, the next transition is into state 0 with probability 1; c) for any value of Zn < z*, if the system condition is G then with probability a condition B applies on the next transition; d) the only transition from condition B to condition G must be made via the true alarm state 2m. Thus, transitions from alarm states to the renewal state are Pm, = 1, P2m,0 = 1, Pm,, = 0 if j 0, P2m,j = 0 if j 0, and pij = 0 for i = m + 1,m + 2,...,2m -1; j = 0,1,2,..., m. For the rest of the elements of P, we define the m x m sub-matrices G and pB, such that [pG], = prob{nn =jl|,n- = inl =G} for i=0 1...,m-; j =0 1...,m, [pB] = prob{a =j+mlc_li =inCn=B} fori = 0,1,...,m-1; j = 1,2,...,m, ij bc, j+mc,ZI~~~~~~~~~~~~~~~~~~ 40

a The elements of these submatrices are given by the nature of the distributions p(-) and q(Q) of equation (1). In terms of these submatrices, the remainding elements of P are given by: pij = (1 - a)[PG]ij for i = 0, 1,..., m - 1; j = 0,,..., m pij = a[PB]i,.m for i = 0,1,..., m -1; j = m + 1,...,2m, ij = [PB]i-m,j-m for i = m + 1, m + 2,..., 2m - 1; j = m + 1, m + 2,..., 2m. The first equation reflects transitions from Cn_ = G to C, = G; the second represents transitions from C-_1 = G to Cn = B; the third equation reflects transitions while the system is in B. Matrix P is shown schematically in Figure 8. 41

APPENDIX C: State Probabilities for Arbitrary b and c In our development, we assumed renewal times of one time unit, i.e., b = c = 1. Calculating the steady-state probabilities for arbitrary b and c is described here. Recall that C(b, c) is the expected cycle time (between two consecutive renewals from B) given b and c are the renewal times after a checking action detects condition B or G, respectively. Let 7rn(b, c) be the associated steady-state probability the system is in the false alarm state given b and c. It is clear that 7r (1, 1)C(1, 1) is the expected number of false alarms per cycle when b = c = 1. For arbitrary b and c, the cycle time is C(1, 1) plus b -1 (for the renewal while in B) plus 7r (1, 1)C(1, l)(c- 1) (for the false alarm renewals). Hence, the expected cycle time for arbitrary b and c is C(b, c) = C(1, 1) + (b- 1) + r(1,1)C(1, 1)(c- 1). Note, that prob(Slb = c = 1)C(1,1) is the expected time per cycle the system is in state S for any state S E S and this time is independent of b and c for all states except the two renewal states, i.e., the false alarm and true alarm renewal states. The probability of any non-renewal state S E S is then probS) prob{Sb = c = 1}C(1, 1) prob($) = C(b,c) Using these results, the probability of the false alarm renewal state is r(bc) (l, 1)C(1, 1)c G( Cc) C(b,c) and the probabilty of the true alarm renewal state is - (bc ) Bk;(b, C(b,) c B Qb, c)' 42

References Brunner, H. "A Survey of Recent Advances in the Numerical Treatment of the Volterra Integral and Integro-differential Equations", Journal Comput. Appl. Math, 8, 213-229 (1982). Girshik, M.A. and Rubin, H. "A Bayes Approach to a Quality Control Model," Ann. Math. Stat., Vol. 23, 114-125 (1952) Groetsch, C. W., "The theory of Tikhonov Regularization for Fredholm Equations," 104p, Boston Pitman Publication (1984). Johnson, N.L. and Leone, F.C., "Cumulative Sum Control Charts: Mathematical Principles Applied to Their Construction and Use," Parts I, II, III. Industrial Quality Control, Vol. 18, 15-21; Vol. 19, 29-36; Vol 20, 22-28, (1962). Lorden, G., "Procedures for Reacting to a Change in Distribution," Ann. Math. Stat., 1897-1908, (1971). Montgomery, D.C. "The Economic Design of Control Charts: A Review and Literature Survey," Journal of Quality Technology, Vol. 12, No. 2 75-87 (1980). Moskowitz, H., Plante, R. and Chun, Y.H. "Economic Design of Continuous Shift Model X Process Control Charts," Krannert Graduate School of Management, Purdue University (1989). Pollak, M. "Average Run Lengths of an Optimal Method of Detecting a Change in Distribution," Ann. Stat., Vol. 15, No. 2, 749-779 (1987). Pollak, M. and Siegmund, D., "Approximations to the Expected Sample Size of Certain Sequential Tests," Ann. Stat., Vol. 6, 1267-1282, (1975). 43

Pollak, M. "Optimal Detection of A Change in Distribution," Ann. Stat., Vol. 13, No. 1, 206-227 (1985). Pollock, S.M., "Minimum Cost Checking Using Imperfect Information," Management Science, Vol. 13, No 7, pp 206-227 (1965). Roberts, S.W., "A Comparison of Some Control Chart Procedures," Technometrics, Vol. 8, 411-430 (1966). Schippers, H., "Multiple Grid Methods for Equation of the Second Kind," Amsterdam, Mathematisch Centrum, 133p (1983) Shewhart, W.A., "The Economic Control of the Quality of Manufactured Product," Macmillan, New York, (1931). Shiryaev, A.N., "Optimal Stopping Rules," Springer-Verlag, New York, (1978). Shiryaev, A.N. "On Optimum Methods in Quickest Detection Problems," Prob. Appl., Vol. 8, 22-46 (1963). 44