Division of Research School of Business Administration March 1988 RECALL-PRECISION TRADE-OFF: A DERIVATION Working Paper #554 Michael Gordon Manfred Kochen The University of Michigan FOR DISCUSSION PURPOSES ONLY None of this material is to be quoted or reproduced without the expressed permission of the Division of Research Copyright 1988 University of Michigan School of Business Administration Ann Arbor Michigan 48109

Gordon/p, 1 1. Introduction It has been widely observed that document retrieval systems produce a tradeoff between furnishing relevant documents and failing to furnish irrevelant documents. At one extreme, a literature searcher can ensure that he retrieves all relevant documents by scanning the entire collection (and thereby examining a vast number of irrelevant documents). At the other, he may wish to avoid all irrelevant documents and may attempt to do so by examining only a very few likely relevant documents (thereby missing many relevant ones). In practice, retrieval usually falls between these two extremes. Similar trade-offs have been discussed in signal detection theory (Peterson, et al., 1954; VanMeter and Middleton, 1954), medical diagnostics and decision theory (Wald, 1950), and pattern recognition studies (Tanner and Swets, 1964). In all cases, we have objects that may possess a certain characteristic and also judgments about whether they really do. For instance, a document may or may not be relevant, and a retrieval system may or may not retrieve it based on its matching rule, which is a prediction of relevance. Similarly, a person may be diseased or disease-free, a determination a physician diagnosing this individual attempts to make. Consequently, a condition which exists may be detected (which we call a "hit" or "true positive") or not ("miss" or "false negative"). Alternatively, a non-existing condition may be erroneously detected ("false positive" or "false alarm") or correctly judged as absent ("true negative"). Two variables commonly characterize the performance of a document retrieval system: 1) Recall (or hit rate, h) which we describe as the conditional probability of retrieving a document given it is relevant; and 2) Precision (or acceptance, a) which is the conditional probability that a document is relevant given it is retrieved. We may also interpret h as sensitivity and a as

Gordon/p. 2 accuracy (Swets, 1963). A third variable, Fallout (which we denote b), the conditional probability of retrieving a document given it is irrelevant, is sometimes reported instead of precision. Figure 1 gives an example of these variables for a single query posed for a collection of 1,000 documents. Recall (hit rate, h) is estimated to be 1/3 (10 out of 30 relevant documents retrieved); precision (acceptance, a) 1/10 (10 of 100 retrieved documents are relevant); and fallout (b) 9/97 (90 of 970 irrelevant documents retrieved). Retrieved Not Retrieved (Signal Noted) (No Signal Noted) (Positive Diagnosis) (Negative Diagnosis) Relevant (Signal Present) HIT MISS (Condition Exists) 10 20 Not Relevant FALSE ALARM TRUE NEGATIVE (Signal Absent) 90 880 (Condition Absent) 30 970 Total 100 900 1,000 Figure 1 Despite many attempts to explicate "relevance" (Goffman, 1964; Sarecevic, 1975; Cooper, 1971), there is still no completely satisfactory conceptualization, either operationally or theoretically. Nonetheless, h and a have been estimated and relations between them have been postulated (Cleverdon, 1972), and empirically plotted (Cleverdon, 1970; Salton, 1971; Lancaster, 1978; Lancaster, 1979; Blair and Maron, 1985). The intuitively plausible trade-off between a (accuracy) and h (sensitivity) is typified by Figure 2 (adapted from Salton and McGill, 1983, p. 168). The first mathematical derivation of this

Gordon/p. 3 inverse relation (Kochen, 1974) was based on an assumption of retrieval based on the presence of query terms in the full text of a document, and used rankfrequency laws for those query terms. In what follows, we derive a mathematical model in an effort to explain the origin of the recall-precision (h vs. a) trade-off. 2. General Considerations We first establish a mathematical relationship between h (recall, i.e., hit rate) and a (precision, i.e., acceptance): a = hp/r (1) where p = is the probability that a randomly chosen document is relevant to a selected query, i.e., P (relevant) r = hp + b(l-p), i.e., the probability an arbitrary (2) document has been retrieved. and b = P (document is retrieved it is irrelevant) To see this, note that h = P (document is retrievedlit is relevant) = P (retrieved and relevant)/ P (relevant), so that hp = P (retrieved and relevant). Similarly, b(1-p) = P (retrieved and irrelevant). Thus, r = P (retrieved and relevant) + P (retrieved and irrelevant) = P (retrieved). Thus; hp/r = P (retrieved and relevant)/ P (retrieved) = P (relevant retrieved) = a (acceptance or precision) as we wished to show. If we assume, as an idealization that does not diminish the value of the argument, that a is a continuous and twice differentiable function of h, we

Gordon/p. 4 obtain an expression for the derivative, aa/3h. We first show the conditions under which aa/ah must be negative, that is, the conditions under which precision (a) decreases monotonically as recall (h) increases, by proving the following: Theorem 1 aa/ah < 0 iff ar/ah > r/h Proof: Differentiate equation (1) with respect to h to obtain: 3a/3h = p/r - (hp)/r2 ar/ah = p/r [1 - (h/r)ar/ah] (3) Since p/r > 0, aa/ah < 0 iff 1 - (h/r)ar/ah < 0, or (4) equivalently, iff ar/ah > r/h. Q.E.D. In words, we have shown that precision and recall are inversely related only when ar/ah > r/h. We show next that the condition ar/ah > r/h is a characteristic of operational retrieval systems. We note that ar/ah > r/h iff d(num-ret/D)/d(num-r-ret/pD) > (num-ret/D)/(num-r-ret/pD) iff d(num-ret)/d(num-r-ret) > num-ret/num-r-ret (5) where num-ret stands for number of documents retrieved, num-r-ret stands for number of relevant documents retrieved, and D is the number of documents in the collection. Again, p refers to the probability of an arbitrary document being judged relevant to a given query under consideration. Intuitively, we argue that, for any number of relevant documents that a searcher has already retrieved, m, he will have to examine more documents to find the next relevant one than he had to, on average, in retrieving those

Gordon/p. 5 relevant documents he has found already. Pictorially Figure 3 illustrates this point, the slope of the hypotenuse of the triangle with base [,m always being less than the tangent to the curve at m, for all values of m. This observation about the marginal retrieval performance informally validates (5). More formally, we examine the relationship between ar/ah and r/h by considering the hit rate as a continuous function of time, h(t), and computing its McLaurin expansion to prove the following: Theorem 2 ar/ah > r/h iff h''(0) < 0, where h(t) measures hit rate as a function of time and h'(t) and h'(t) are its first and second derivatives, dh/dt and d h/dt, respectively. Proof: Express h(t) and h'(t) by their McLaurin series h(t) = h(0) + th'(0) + (t /2)h''(0) + (t /3!)h''(0) +... (6) h'(t) = h'(O) + th''(0) + (t /2!)h'''(0) + (t /3!)h(4)(0) +... (7) Let us suppose it takes 1 time unit to search the entire document collection and examine every document for relevance and that the rate at which documents are searched (retrieved) is constant. Thus t 0 < t < 1 r(t) ~ - > (8) t t > 1 As long as we continue to retrieve, therefore, dr/dt = 1 (9) Additionally, h(0) = 0. (10) As an approximation to (6) and (7) we assume h(k)(0) is negligible for k > 3.

Gordon/p. It follows from these approximations and equations (7) and (9) that, until retrieval ceases: ar/ah - (dr/dt)/(dh/dt) - (dr/dt)/(h'(t) = l/(h'(O) + th''(O)) (11) Also, by equations (6), (8), (10) and our approximations: r/h = r(t)/h(t) = t/(h(O) + th'(O) + (t /21)h't(O)) = t/(th'(O) + (t2/2!)h''(0)) = 1/(h'(O) + (t/2)h''(0)) (12) We see, therefore, that for any positive time, t, (11) > (12) iff h''(O) < 0. Thus we have shown dr/dh > r/h iff h'(0) < O. Q.E.D. Together Theorems 1 and 2 reveal that precision (a) falls with increasing recall (h) whenever h''(0) < 0. The condition h''(0) < 0 indicates that the retrieval of relevant documents is decelerating at t = 0, when retrieval begins. More generally, we assume that h''(t) is negative for all values of t. This implies that h'(t) is monotonic decreasing. In other words, the rate of retrieving relevant documents (dh/dt) is maximum at t = 0. This condition is perfectly plausible. For a recall precision curve representing aggregate retrieval over several queries, it means that the earlier a document is presented, the more likely it is to be relevant. A similar interpretation holds for a recall precision curve derived from a single query. These interpretations agree with Robertson's "probability ranking principle" (Robertson, 1977). 3. Examination of Hit-Rate as a Function of Time We have seen, logically, that precision (acceptance, a) will decline with increasing recall (hit rate, h) iff h''(O) < O. Further, we have plausibly

Gordon/p. 7 postulated that h"(t) < 0 for all t. In this section, we examine some further constraints on h(t) and discuss them in relationship to various functional forms that might describe h(t). It is straightforward to provide a best and worst case analysis for recall (h), fallout (b), and precision (a). For a collection with p*100% relevant documents requiring 1 time unit to be searched in its entirety, optimal retreival occurs if all relevant documents are selected before any non-relevant document is selected. Thus, (l/p)t for t < p h (t) = opt 1 for t > p Equation (2) shows that, for optimal retrieval, fallout, b, is 0 through the retrieval of the last relevant at t = p. That is, bop(t) = (r(t) - p*h (t))/(l-p) opt opt = (t - p*(l/p)t)/(l-p) = 0 through t = p. Additionally, optimal precision is shown by equation (1) to be ao =h p/r opt opt = (1/p)t*p/t = 1 through t = p. Under worst case retrieval, we assume the retrieval system performs exactly as if it were randomly selecting documents. Thus h (t) = r(t) (the hit worst rate is the same as the retrieval rate). Again using equation (2), we obtain b = (r(t) - ph (t))/(1-p) worst worst = (r(t) - p*r(t))/(1-p) = r(t) through t = p. For such retrieval, equation (1) shows that

Gordon/p. 8 a rst(t) h wrp/r worst wors t - r(t)p/r(t) - p through t = p. All of these results are as expected. We have postulated already that h(O) - 0. To continue our analysis, we also assume for now that h(t) is coincident with the graph of best retrieval, h (t) at t O0. (Intuitively we are suggesting that the first document reopt h'(O) = h' (t) = l/p. (13) opt We consider several functional forms of h(t) consistent with these constraints in an effort to model hit-rate (recall, h) as a function of time. We first consider quadratic functions, that is, those of form: h(t) = at + St + y (14) and try to determine values for parameters a, B and y. We see that h'(t) = 2at + B and, thus, h'(0) = B = 1/p. Similarly, h(O) = y = 0. To obtain a value for the coefficient a, we specify the time at which the hit rate, h(t), first becomes 1.00 to be k, p < k < 1. Putting h(k) = ak2 + (1/p)k = 1 (15) and solving, we get a = (p - k)/(k2p). Substituting this into (14), we see that, under these assumptions, 2 2 ((p - )/(k2p))t + (l/p)t t < k h(t) = (16) I t > k and, thus, h'(t) = 2((p - k)/(k2p))t + (l/p) (17) until the last relevant document is retrieved at t = k.

Gordon/p. 9 We note that hit-rate must be a non-decreasing function from t - 0 until t - k. We see, in fact, that h'(t) > 0 only when 2((p - k)/(k2p))t + (1/p) > 0. That is, since k > p, h'(t) > 0 only when t < k2/(2(k - p)). In other words, hit rate increases until t = k /(2(k - p)). For this to be a realistic model of retrieval, k /(2(k - p)) must be at least as big as k (time when hit-rate first attains its maximum). Solving k /2(k - p) > k, we see that k < 2p. In words, this means that, under this model of retrieval, the last relevant document must be retrieved by twice the time required to retrieve all (and only) the relevant documents. Said differently, at most 50% of the retrieved documents can be non-relevant by the time of retrieving the last relevant one. That is, precision is always at least 50%. Since this behavior is contradicted by ordinary retrieval experience, we reject the possibility that h(t) is quadratic under our assumptions. We next consider the possibility that h(t) is a logarithmic function of t, that is, h(t) = log(f(t)). This is plausible because the time to find a record in a file is known to be proportional to the logarithm of the file size for many good search algorithms. Since h(0) = 0, we see that log (f(0)) = 0, or, equivalently, that f(O) = 1. (18) Additionally, h'(0) = 1/p means f'(0)/f(O) = l/p, which implies f'(0) = 1/p. (19) We use these two constraints in examining hit-rates which are the logarithms of linear functions, that is, those in which f(t) = at + B. From equation 18, we get 1 = 1. Using f'(t) = a —together with equation 19 —shows a = l/p. Thus, under our assumptions, if h(t) is the logarithm of a linear function of t, we must have

Gordon/p. 10 h(t) - log((l/p)t + 1) (20) This equation increases with t (for all values of p) until the last possible relevant document is retrieved. We see, too, that all relevant documents are retrieved when h(t) = 1, that is log ((l/p)tf + 1) = 1, or tf = p * (e - 1) (21) where e is the transcendental number 2.718... Since this time, tf, must be less than the time at which we retrieve the last document in the collection (at t = 1.00), we must ensure that p * (e - 1) < 1.00 or p <.582 (approximately) (22) This constraint, that no more than 58.2% of the documents be relevant to a query, does not restrict the class of queries in any significant way. However, by equation (1), if all relevant documents are retrieved at tf = p * (e - 1) = p * 1.718, then, at that time, when the hit-rate is perfect (100%), precision, a, is 1/ 1.718 - 58.2%. Unfortunately, precision is normally far below this level even for levels of recall considerably less than 100%. Thus, we reject the possibility that hit-rate is modelled by equation 20. The last candidate formula for hit-rate that we will consider is h(t) = log(at2 + 8t + y) (23) We consider two sets of constraints in determining values for the parameters in this equation. First, using h(0) = 0, we obtain y = 1. Similarly, h'(0) = 1/p implies that B = 1/p. The value of a in the equation for h(t) above may be made to depend on the time at which all relevant documents become 2 retrieved, which we denote by k, p < k < 1. Solving log(ak + (l/p)k + 1) = 1 1 for a, we see that

Gordon/p. 11 h(t) - log(((e - (k/p + 1))/k2)t2 + (l/p)t + 1) (24) which is the first form of equation (23) we consider. For equation (24) to be suitable, it must be an increasing (or nondecreasing) function in the interval [O,k]. Differentiating (24), we get h'(t) ((2/pk2)(p(e-1)-k)t + (l/p))/((/pk2)(p(e-1)-k)t2 + (t/p) + 1) (25) We set equation (25) equal to 0 and solve for t to see that h(t) has a local extremum at t = k /(2(k-p(e-l))). (26) We see that t > 0 when e k/p > e-1. (27) Also, t < k when k/p > 2(e-1). (28) Thus, by equations (27) and (28), h(t) has a local extremum between t a 0 (when retrieval begins) and t = k (when hit-rate reaches 1.00) for k/p > 2(e-l) 3.436. Notice that p/k reveals what the level of precision is when recall reaches 100%. The condition k/p > 3.436, therefore, is equivalent to the realistic condition that precision be less than 29.1% when recall reaches 100%. Since h(t) will not be monotonic under these conditions, we must reject equation (24) as a candidate function for h(t) for general values of p and k. Instead, we consider h(t) = log(at2 + Bt + y) with an alternative set of hypotheses to derive a function consistent with the retrieval behavior we expect. In particular, we maintain the hypotheses that h(O) = 0 and h(k) = 1.00. But we replace the hypothesis that h'(t) = 1/p by h'(k) = 0. The intuition behind this assumption is that the hit-rate ceases to increase when the last

Gordon/p. 12 relevant document is retrieved. As before, h(O) - 0 implies y - 1. Further by stipulating that h(k) I 1 we see log(ak2 + Bk + 1) = 1 or ak2 + Bk + 1 - e (29) From h'(k) - 0 we obtain (2ak + 8)/(ak2 + 8k + 1) - 0 (30) Solving (29) and (30) for a and a, we see that h(t) = log(((l-e)/k2)t2 + (2(e-1)/k)t + 1) (31) Note that differentiating h(t) = log(at + St + 1) gives h'(t) s (2at + B)/ (at2 + Bt + 1) which has only one local extremum —at t = — /(2a) —for any values of a or B. By hypothesis, (31) contains an extremum at t = k. Thus, equation 23 has no extrema in the interval (O,k) and is, therefore, monotonic increasing during this interval as we desire. By differentiating equation (31) and evaluating at t = 0, we see that h'(O) = 2(e-1)/k (32) Since the instantaneous hit-rate at t = 0 must not exceed the rate of optimal retrieval (equation (13)) we get 2(e-l)/k < l/p, or, equivalently, k/p > 2(e-1) = (approximately) 3.436. (33) This restriction in the values k can take on in relation to p indicates that, under this model, recall will fall short of 100% at least until one has examined 3.43 times as many documents as ultimately prove to be relevant. That is, 29% (approximately) is the best precision possible for those seeking perfect recall. 4. Conclusion A theoretical model of the recall-precision (a vs. h) trade-off will be useful in several ways. One, it can shed light on the limits of effective

Gordon/p. 13 retrieval: What precision must one tolerate to achieve a specified level of recall? Two, a theoretical formulation of this trade-off can help in the building of document retrieval simulations from which operational systems can be derived. In developing such a model, we have shown that a recall-precision tradeoff derives from recall, h(t), decelerating as more documents are retrieved. We have examined several such mathematical functions modeling h(t) to see which seems most consistent with expected retrieval behavior. We have concluded that the best form is h(t) - log(((l-e)/k2)t2 + (2(e-1)/k)t + 1), primarily since it is monotonic until h(t) reaches 1.00, predicts realistic levels of pecision, and imposes no unrealistic restrictions on when hit-rate reaches 1.00. (In this model, 1 time unit is required to search the entire database which contains p% relevant documents, and the last relevant document is retrieved at t = k, p < k < 1.) Curves showing the recall-precision behavior predicted by this model are shown in Figure 4. Note that equation (31) defines a family of curves parameterized by k. (k can be regarded as the time at which the last relevant document is retrieved or, equivalently, the fraction of the entire collection that has been searched when the last relevant document has been retrieved.) Since equation (31) is monotonic from t = 0 through t = k, knowing what percentage of a real collection must be searched to achieve any specified level of recall determines a particular value for the parameter k. Future research should be directed at comparing how well this theoretical model predicts actual retrieval and estimating its parameter, k, for particular retrieval methods. Justification of the model, together with a determination that different values of k are appropriate for different retrieval methods,

Gordon/p. 14 presents an analogy to coding methods in Shannon and Weaver's mathematical theory of communication (Shannon and Weaver, 1964): Like channel capacity, our research may offer a bound on retrieval effectiveness for various retrieval methods.

Gordon/p. 15 REFERENCES 1. Peterson, W. W., Burdrall, T. G., Fox, W. C., IRE Trans. Info. Th., 4, 1954, p. 171. 2. Van Meter, D., and Middleton, D., IRE Trans. Info. Th., 4, 1954, p. 119. 3. Wald, A., Statistical Decision Functions, New York, Wiley, 1950. 4. Tanner, W. P. Jr. and Swets, J. A., Psychol. Rev., 61, 1954, p. 401. 5. Swets, J. A., "Information Retrieval Systems," Science, 141, July 19, 1963, p. 245-250. 6. Goffman, W., "On Relevance as a Measure," Info. Storage and Retrieval, 2 (3) December 1964, p. 201-203. 7. Saracevic, T., "Relevance: A Review of and a Framework for the Thinking on the Notion in Information Science," JASIS, 26 (6), NovemberDecember 1975, p. 321-343. 8. Cooper, W. S., "A Definition of Relevance for Information Retrieval," Info. Storage and Retrieval, 7 (1), June 1971, p. 19-37. 9. Cleverdon, C. W., "On the Inverse Relationship of Recall and Precision," J. Doc., 28, 1972, p. 195-201. 10. Cleverdon, C. W., "Progress in Documentation: Evaluation Tests of Information Retrieval Systems," J. Doc., 26 (1), March 1970, p. 55-67. 11. Salton, G., The Smart Retrieval System - Experiments in Automatic Document Processing, Englewood Cliffs, N.J.: Prentice-Hall, 1971. 12. Lancaster, F. W., Evaluation of the Medlars Demand Search Service, Nationa Library of Medicine, Bethesda, January 1968. 13. Lancaster, F. W., Information Retrieval Systems - Characteristics, Testing and Evaluation, 2nd Ed., New York: Wiley, 1979. 14. Blair, D., and Maron M. E., "An Evaluation of Retrieval Effectiveness for a Full-Text Document Retrieval System," COMM ACM, 28 (3), March 1985, p. 289-299. 15. Salton, G., and McGill, M. J., Introduction to Modern Information Retrieva New York: McGraw-Hill, 1983. 16. Kochen, M., Principles of Information Retrieval, Los Angeles: WileyBecker and Hayes/Melville, 1974, p. 156-163. 17. Robertson, S. E., "The Probability Ranking Principle in IR," J. Doc., 33, 1977, pp. 294-304. 18. Shannon, C., and Weaver, W., The Mathematical Theory of Communication, University of Illinois Press, Urbana, 1964.

1.0I 0.8 0.E r.4 0 m 0 0.4 0.2 L I t I I I 0 0.2 0.4 0.6 Recall 0.b 1.0 Typical Recall-Precision Trade-Off

4, 0) 0] 0 Q) o, CI 0 I 0Z I <DI-/ /I * g /y/ Retrieved vs. Relevant-Retrieved Documents wrpI N rT-: 9ITo"119 -46 "..,PR. 1,..-

1.0.9.8.7 l 6 C) 4.6 0 0.1.2.3..5.6.7.8.9 1.0 Recall (Hit-rate, h) p = P (Relevant) k = Time that last relevant document is retrieved Derived Recall-Precision Relations - ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~.IPPTw W