Division of Research Graduate School of Business Administration The University of Michigan Octo ber 1984 \ \ AN EXTENSIONAL SEMANTIC ANALYSIS OF DOCUMENT INDEXING Working Paper No. 395 David C. Blair FOR DISCUSSION PURPOSES ONLY None of this material is to be quoted or reproduced without the expressed permission of the Division of Research.

I I

One of the persistant problems in indexing theory is maintaining the distinction between the indexing language (used to tag documents in a particular collection) and the English language (from which the indexing language is derived). There is an undeniable semantic relation between a term used as a descriptor in the indexing language and that same term used as a word in Standard English. But the semantic relation between these different usages of the same term is more like a family resemblance than an identity relationship. In other words, the dictionary definition of a word may be useful in determining the meaning of that word in Standard English, but it could be quite misleading when used to determine the meaning of that same word when used as a descriptor in an indexing language. This lack of close semantic correspondence between index terms and English words is what Maron and Kuhns refer to as "semantic noise:" It turns out 'that given any term there are many possible subjects that it could denote (to a greater or lesser extent), and, conversely, any particular subject of knowledge (whether broad or narrow) usually can be denoted by a number of different terms. This situation may be characterized by saying that there is 'semantic. -i noise t in the index terms. [pp 218-219]

2 This semantic noise is complicated by the fact that while there does exist a Standard English whose semantics is generally similar in a wide variety of contexts and glosses, there does not appear to be a similar standard indexing language. That is, if the same set of terms were used in two different documnent collections (eg, a psychology collection and a geology collection), the "meanings" (semantics) of these terms in these two collections (ie, the kinds of information they represent), would be quite different. So, in effect, instead of having two usages of one indexing language, we have two similar, yet distinct indexing languages. They are similar in that they use the same set of descriptors, but different because the same descriptor may (and usually does) have a different meaning in either language. One could think of these indexing languages as dialects of English. The purpose of this paper will be to demonstrate a method which reduces the "semantic noise" of a given indexing language, and enables the semantic definition of terms to be made which are specific to that indexing language alone. There are two principal semantic aspects of indexing descriptors which this paper attempts to clarify: 1. The main semantic categories in the indexing language; 2. The

3 semantic relations between descriptors in the indexing language. Of course, all this information can be obtained from the Library of Congress list of subject headings or Roget;s Thesaurus, but I maintain that this would be a misleading way of understanding the semantic relations be — tween these indexing descriptors. The tacit semantic relations extant in an indexing language represent the indexing philosophy of that particular document collection, and this indexing philosophy will vary markedly from document collection to document collection. FOUNDATIONS The central issue here is howt to make explicit this tacit indexing philosophy. To do this, two questions must be asked: 1. What are the "facts" of a document collection, and how would we describe them? (That is, 5rwhat are the basic units or "things" which we can examine in a document retrieval system?) 2. What can we infer from these facts? iWhat do these facts "mean?" There are two major -types of facts in a document retrieval system: 1. The set of documents in the collection. 2. The set of indexing descriptors actually assigned to the documents in the collection. The documents in the collection

4 can be counted (eg, the number retrieved at a given time) and can be distinguished one from the other (eg, document 1 is distinguishable from document 7). In terms of the present study this is all we can directly say about the documents; that they are. countable and distinguishable. Nothing more can be said about the documents other than what is implied by the indexing descriptors assigned to them. Like the documents in a collection, the indexing descriptors can be both counted and distinguished. More specifically, they can be counted and distinguished in several significant ways: 1. We can calculate the total number of times an individual discriptor is used in a document collection. This is the "breadth" of a given descrptor. 2. W16 can determine how many descriptors are assigend to each document. This is the "depth" of assignment for a given document. 3. Since we can distinguish different doctuments and distinguish different terms, vwe can therefore determine the frequency of cooccurrence of descriptors on documents within the collection. These, then, are the basic "facts" of a document retrieval system, and it is r only from these fundamental, obsevable units that inferences will be made. Previously I had said that the purpose of this paper is to reduce the "semant;ic noise" of the indexing

language used in a document collection. bMore specifically, this paper will attempt to show how the basic "facts"1 of a retrieval system can be used to clarify the indexing philosophy of the system, and provide some further indication of the content of a document. which a descriptor assigned to that document cannot give by itself (even if its meanings in Standard English are considered). SEMANTIC CATEGORIES: RELATIVE BREADTH The first step in making the indexing philosophy of a document collection explicit is the calculation of the I relative 'breadth" of each descriptor in the system. Given the set of descriptors actually used in the document collection, T1.,T,.T3,..., n and the aggregate of documents, D, in the collection, then the relative breadth of descriptor T. is the value ri, such that: the # of times T. is used in D the number of times the most frequent descriptor is used in D r. thus reflects the number of times the descriptor T. is 1 1 used in D. For example, if ri is the most frequently assigned descriptor in the collection then ri = 1i If ri is not the

6 most frequently used term, then it will equal something like, for instance,.65. SPECIFICITY AND GENEPRALITY How shall we interpret the relative breadth (r.) of each descriptor in the collection? First of all, the values of ri when ranked in order of magnitude, gives us an indication of the major and minor categories of the indexing language. (the major categories being those with the highest values for ri, and the minor categories being those with the lowest values) M1ore importantly, ri gives us an indication of the "generality" (ri->1) or "specificity" (ri- >O) of the descriptor whose relative breadth = r.. But note that this specificity/generality is defined solely in terms of how the indexing descriptors are used in this system, and not in terms of the semantic estimation of their specificity/generality in Standard English. For example, in Standard English usage the descriptor "philosophy of science" would be considered quite general, while "tectonic plates" would be considered fairly specific (at least compared to "philosophy of science"). But if, in a given document collection, "philosophy of science" is used only 5 times, and "tectonic plates" is used 150 times, and their

7 e resulting rilative breadths (ri) come out to be something like.08 and.85, respectively, then in terms of that collection "philosophy of science" is a very specific descriptor, while "tectonic plates" is a quite general one, Now that we have a quantitative definition of specificity and generality for individual descriptors, we also have a method for esatmating the specificity or generality of individual documents. For each document d. in D construct the vector <r,rr,.. where r,r, r, are the relative breadths of the descriptors T,T1)Tc T. which are assigned to d. Next calculate the quantity, ((r^)2 - (r)2 +... 2 This quantity, Si., is the norm (or length) of the vector composed of the relative breadths of the terms assigned to di, and represents the specificity/generality of the document d. in terms of the descriptors ET Tb,T,.. ] assigned to d.. In a less rigorus form, S. represents the 1 1 approximate specificity of the document d.. Note one important thing: the specificity/generality of d. is contingent on the independant specificity/generality of tlle descriptors assigned (as reflected. in the norm' of the vector), aid not on the number of descriptors assigned. In other words, a

8 given document with only one descriptor could be either quite specific, quite general or neither, depending on the overall frequency of use of the descriptor. The value of r. reflects a certain attitude which the indexers have towards how the descriptor Ti is used, and the value of Si roughly reflects this same attitude in terms of the document d.. In other wvords, if a document is indexed with several descriptors wxhich are all quite general (have high ri values) then in terms of that collection, that document would be quite general. Note also that tlhe specificity/generality of a term (ri) affects, btt does not control the specificity/generality of the document (Si) to which that term is assigned (except in the trivial case where T. is the only descriptor assigned to the document). For example, if a user wanted the most "general" docunent which had the descriptor TiC assigned to it, then this estimation of the generality of these documents wiould not be a function of Tk alone, but of the relative generality of the other descriptors which were assigned to the documents to which Tk was assigned. Certainly it must be understood that measurements of the specificity/generality of descriptors and (especially) documents is a very imprecise notion. A small difference

9 between two values (eg,.50 and.60) would not necessarily be significant. But when the differences are large (eg,.20 and.80) then S. is a valuable heuristic for estimating specificity. The advantage of calculating S;- for the i documents in the collection is that it provides an immediate short cut for searches in which the user asks for a "general" or "specific" document on a given topic. The retrieved set could then be the two or three documents (to which the descriptors indicated by the user are assigned) with the lowest/highest Si. Alternatively, Si could be a means of giving an ordering to the entire set of retrieved doctunents. Given that the specificity/generality of terms is roughly equal to relative breadth, and that the specificity/ generality of documents is, at best, a very imprecise notion, it may still be objected that the specificity of terms may in fact be misleading when generalized to docunments. Suppose that in a geology library there is one book indexed with the descriptor "philosophy of science" and that that book is. (in the estimation of a philosophy professor) a very general book on the philosophy of science. Since it is the only work in the collection indexed with this descriptor, the procedure I outlined above would label it a

10 very specific book on the subject. lWho's right? tWell, it depends on your point of view. From the perspective of a philosopher, the book is, in fact, quite general. But from the point of view of a geologist (ie, a typical user of this library) any book on philosophy of science would be a very specific book. [actually, the specificity of this hypothetical work, is really moot. A geologist who wanted a "specific" or "general" Iwork on "philsoophy of science" twould prestmlably go to the philosophy or main library to search for it. But it is still important to Ieen "philosophy of science" as a "specific" descriptor in itself, since it may affect the calculation of Si when it is used with other terms on a given document] Ny basic arguments, then, for accepting relative breadth as an estimation of specificity/generality are: 1. the specificity/ generality of a descriptor/document must be defined in ter.ms of the collection in Awhich these descriptors are used, not in the broader context of English semantics. 2. the number of times a descriptor is used in the collection, relative to the frequency of the most frequent descriptor, is an indication of the specificity/generality of that descriptor. That is, a desciipntor which is assigned a great many tnies in a collection, describes a very large

11 set of documents which have something to do with the concept expressed by that descriptor. This descriptor, therefore, by describing such a large set, is not very specific. SEMANTIC RELATIONS: DEFINITION OF A TOPIC There are two ways a given descriptor, T, can be seen: 1. It is a descriptor (word) in the indexing language. 2. It is a word in the English Language. The word itself, regardless of which language it is in, can again be seen two ways: 1. As a word. 2. As the designation of a topic or subject area (a word with "semantic noise"). The definition of a word in either the indexing language or the English language is straightforward, but it is not entirely clear what we mean by a "topic". —especially in terms of the indexing language. For example, to understand the topic "statistics" one must hlave a familiarity with how the word "statistics" is used in standard English. This is a very subjective notion, but there is a certain agreement as to the topic "statistics" among speakers of English. An individual may understand "statistics to have something to do with mathematics, &/or sampling, &/or experimental design, &/or verification, &/or averaging, &/or sociology, &/or the Gallup Poll, etc. The agreement

12 upon what the topic "statistics" means is not important here (we are not lexicographers). lWhlat is important to understand is that the notion of a "topic" is contingent on the varieties of usage of the word in Standard English. That is, a "topic" designated by a word in a language is understood by the context(s) of that word in the language. Thus, since the topic Tk in the English language must be understood in the context of Standard English, it is reasonable to assume that the topic Tk in the indexing language must be defined in terms of the context(s) of T in the indexing language. But wrhat is a "context" in the indexing language? In Standard English the context of a word was defined as the varieties of that word's usage. This appears to be a good rough definition of context (certainly nmore heuristic than "context" or "topic" by itself). It follows, then, that wre must define the topic of Tk in the indexing language in terms of the usage of that descriptor in the indexing language. This usage of T1 consists, of course, in the assignment of Tk to documents in the collection, and the varieties of these assignments are distinguished by noting that the assignments are made to "different" documents (just like English words are used in "different" contexts). Since the contents

13 of these documents are not differentiated, the only way to distinguish their content is by noting the different descriptors assigned to them. The different contexts in which Tk may appear, therefore, are the groups of descriptors which are assigned to documents to which T1 is also assigned. For example, suppose that T1 is assigned to documents d,db and d e and the descriptors assigned to each of these documents is as follo ws: d = T,,T T T a k' a' e ir db= Tk, T a T Th d = T1, Ta T. Tf Then the contexts in wthich T, appears are T,T,T T T I- aa e g a e Th; and T,T Tf, which is nothing more than the distribution II a ga f' of co-occurrences of TiC with other descriptors in the collection. Thus the definition of the "topic" Tk in terms of the indexing language of the system in which it is used is as follows: Tk = a Zb zc. > such that z az bzc.. corespond to descriptors T aT, T.... which have concurrent assignment with Ti1 (ie, they co-occur) on documents in the collection, and:

14 - numiber of times T cob-occurs with T. Z *k -I —. maximum J/ of times T1 occurs with any descrintor In the example above of documents d,dbdc the topic vector for Tk is: T = Zi> st ZaZet z ezf,,Zh = 1, 67i.33,.67,.33, respectively. (all other values of z.= O) Topologically speaking, this vector defines the "region" on the topic Ti. As in the definition of specificity, the definition of the topic T, is solely in terms of the retrieval system in which T1 is used, not in terms of the semantics of standard English. It may be argued that the co-occurrence of descriptors can be purely coincidental, and that to define a topic in such arbitrary terms would be mistaken. Suppose every time "psychology" is used on a document in a collection the descriptor "statistics" is also used. It may be argued that the notions of "psychology" and "statistics" are independant, and that any relation between the two is tenuous at best. But there is a confusion here. It is true that in terms of Standard English, "psychology" and "statistics" have very little in common. But in an information retrieval system, this does not necessarily

15 concern us. The relation between "psychology" and "statistics" should be made solely in the indexing language —.in terms of the retrieval system itself. In such a case, if "statistics" is used to index every document which is indexed by "psychology" then, in terms of that collection of documents, there is a strong and clear connection between the two descriptors. [Of course, the connection is not symetrical. If statistics is used to index every document which is indexed by "psychology" then there is a strong dependance of "statistics" on "psychology." But unless most of the occurrances of "psychology" coincide with occurrances of "statistics," then "psychology" will not be strongly dependant on "statistics."] DEFINITION OF TOPIC AND RETRIEVAL REQUESTS The notion of a topic in terms of the indexing language can be a very powerful tool for ordering relatively unspecific requests (that is, requests which retrieve a large number of documents). For example, a user can request a set of documents "about" the topic designated by descriptor Tk In such a case, the documents with the descriptor Tk are re-tieved and this retrieved set is ordered according to the following procedure: 1. The topic vector for Tk

16 is recalled (for example, Tw =h <z Zzb zcZd where Tk co-occurs with descriptors T aTbT C and Td) 2 The abc c il retrieved documents are ranked according to these categories: a. documents indexed with all terms in the topic region (Tk,,T,T TT,Td) b. The remainder of the documents are ranked according to the decreasing values of the results of the dot product of the document vectors and the topic vector for Ti. For example: c <uZi. - St zaszbtzc.zd = 1 '.651 *4,.3, resp. (all other values of z = 0) d is a document indexed with descriptors T Td, Tc, and the corresponding docunent vector for dl = z. s>.t. z a z, Z = 1, and all other values of z. = 0 d t's ranking number equals the dot product of the dcocumient vector and the tonic vector for Tk: (,.65,.4, 3> <, 0, 0, 1>= 1.3 d is indexed with Tb and T, and the corresponding document vector for d2 = z.i s.t. ZbZlc 1, and all other values of z. = 0 d 's ranking number 1,,O =.65 equals: <1.65 *4y *3> * l 0 1. 03 O> =0.65 d is indexed writh T, T, Td, and T, The document vector for d ==i) s.t. z,zz>dt z= 1. dt s ranking number works out to 1.95. d, is indexed with T a Tb T, T. The document vector for d. =<z.s.t. a zbz=zc 1. ct1 s ranking number wTorks out to 2.05

17 The final ranking of the retrieved set: dc (2.05) d, (1.95) dl (1.30) d2 (.65) If the topic region for Tk is defined as the set of co-occurring descriptors {T T Tb, T 0 Td}, then it would be possible (if necessary) to retrieve documents:which were similar in subject matter to what Tk designates, but did not actually have T1 as a descriptor* This request would take the Boolean form R a T1) T Td and could be an alternative retrieval method if retrieving writh Tk and ranking according to T ts topic rcgion proved unsatisfactory. DEFINITION OF TOPIC AND BOOLEAN REOUESTS FWhen the retrieval request; is of Boolean form, a topic region can be calculated for the combined descriptors and used to rank- the documents retrieved. For example: 1. Request = Tf, n T Topic region for T1 = Ta Tb, T T( Topic region for Tn = TTa, Tc Te, T Therefore the topic region for T T a' T T, Tdc T T, {aT TJ

18 The topic vector for T, nT = <z> s.t. z,z,z, z = 1. All other z. values = 0 ac. C n c The retrieved set of documents are those indexed,with both T. anc T, and they are ranked c n according to the values of the dot products of the document vectors of the retrieved documents and the topic vector for Tkn T (similar to the doctunent ranking method on p 16-17). 2. Request = TkLUJTn If the topic regions are the same as above, then: Topic region for T17U T = aT Tb, T c Td 1. LTa TcY T Tf = {T. Tbs T c Td, T T3 a C ba b' c' e f The topic vector for TkU Tn <i> s.t. z z z,z e zf = 1. All other z. values = O a' b c dr e' f The retrieved set of documents are those indexed w:ith Tk or T or both, and are ranked according n to the values of.the dot products of the document vectors of the retrieved documents and the topic vector -for TlUT (same as above). If searching under T]n Tn or T U T proves unsatisfactory, then as an alternative, documents indexed with neither Tk or T can be retrieved and ranked according to how closely their document vectors approximate -the topic vectors for either T c T or TkU TI IC n.LJ n

19 AUTOMIATIC INWEIGITING OF ASSIGNED INDEXING DESCRIPTORS The method of calculating topic regions for index terms (the set of co-occurring descriptors) and topic vectors (the values of z. associated with the descriptors in a topic region) can nowr be used to automatically calculate weights for terms wlhich have already been assigned to a document. The procedure is as follows: 1. The indcxer assigns a set of descriptors to a document (eg, d K Tp Tm TP. 2. The topic regions for Td, T, and T are called: (eg) T = T, T T, Tr T d lal o p s S T = T ^ T m d' 11' p nn m rs rn T = T ^ T T p az ' d. mf r 3. The corresponcling topic vectors are derived: (eg) T = <'7, 1,.1,.5,3> [z, mzZp, z, ] T = <.1,.6, 1> >[zdz,z] Tp = <1,.6,.8,.[2z> aZ Z9z I 4. The individual weights of T, T 1, and T. are calculated for document d: a. To determine the weight of Td on dO find the average of the z' values'-.for T and T' inTs I ': m z + z ^ topic vector. ie: the weight of T on d m.7 1:.~ he wi~h b T Od' 0 2 7 -.1.4 The weight of T on d is the 'average. of the *2 n -, o

20 z values for T and T in T 's topic vector. Thus the P d.... weight of T = zp + zd I1 +.1 -55.2 2 And finally, the weight of T on d- is the average of the p o z values for Td and T in T ts topic vector. The weight of T = d m -— p-. p Zd + m. 6 +.8 2 2 The resulting weights of the descriptors assigned to d d {Td(.4) TM(.55), T (.7)j The basic assuption behind this weighting method is that the topic region for a given descriptor is the semantic definition of the descriptor in a given document retrieval system. In such a case the weight of a descriptor (say, Tk) on a document ---ie, the degree to which Tk applies to a document ---is a function of how strongly the other descriptors on that document are related to T,c If the other descriptors on a document do not correlate highly (have low z values in Tk s topic vector), then it seems reasonable to claim that Tk is not

21 highly correlated. with the document and should recieve a correspondingly low weight. The advantage of this automatic weizghting procedure is that it keeps the subjective decisions of the indexer down to one (ie, whether to assign a term or not) while increasing the descriptive power of term assignment by indicating what the assignment of certain terms to a document implies about that document. A MEASUREIMENT OF RETRIEVAL EFFECTIVENESS The measurement of effectiveness of a retrieval system has generally been cuided by the touchstones of precision and recall (the related notion of fallout is also sometimes used). The calculations of precision and recall are full of difficulties. W'hile it is not the purpose of this paper to offer an extended criticism of these two notions, I would like to point out some of the: salient difficulties involved as a preliminary to discussing my own measure of retrieval effectiveness. The first difficulty is with dalculating these measures. Althourh the determlination of precision can be done relatively easily by asking the user which retrieved documents he believes are relevant (fatigue factors aside,

22 and user co-operation assumed), the estimation of recall is not quite so straightforwarrd. That is, while the user can determine which retrieved documents are relevant to his need/request; who is going to determine which documents of the large unretrieved set of documents are relevant to this request? [ '"for simplicity, I won't distingquish between relevance to need and relevance to request, although the difference is certainly non-trivial] Since it is unlikely that the user wrould have the time or inclination to peruse the entire unretrieved collection, the assessment of whether certain unretrieved documents are relevant must be made by some other person(s). In other words, some "exnerts" are trying to tell the user what is relevant to his need/request. I am sceptical iwhether any two individuals' judgement of relevance can be compared meaningfully in this manner at all. But beyond this, even if we have accurate and reliable means for determinining precision and recall, there appears another question: do precision and recall measure what we really want to measure when measuring the effectiveness of a retrieval system? TMore specifically, it is often the case that a user doesn't wtant all the documents which are relevant to his reques/need. In other words it would be perfectly consistant for him to say,

23' 1 k, "Yes, all 50 of these docunents,which you retrieved for me are relevant to miy request, but I only want this one [exit]." Every information system contains a great deal of redundant informiation, and althogh redundatnt information may be in fact highly relevant, it may not be required by the user. Certainly too much redundant, though relevant, information would make for poor retrieval efficiency even though precision and recall were both 100o% Clearly, then, the measuremient of the effectiveness of retrievals should be based on some other measure.?Ty proposal is that the measurement of effectiveness for a retrieval system should be whether a set of retrieved documents, ranked according to their estimated relevance to the user s need as exrnressed in a formal request, coincides with the user's preference order of the retrieved documients. For examp le, suppose the system retrieves -ben docunents and ranks them as numbers one through ten in decreasing estimated relevance. If the user considers document number one the best, and doesn't 1want the other nine, then the system should be given the highest mark for retrieval effectiveness. Likewise, if the user wants all ten documentsancd prefers them in order one through ten, then the system should still be givn, the highest

24 mark for effectiveness. A stat'istical measure such as Spearman's formula for rank correlation could be used to calculate the goodness of fit between the system's ranking of the retrieved documents and the user's ranking of the retrieved documents: 62D2 r =- 1 N(N-1) D = the dif ference between ranks of corresponding values of X and Y. N = the number of pairs of values (X,Y) in the data (set of retrieved documents). document The/ranking procedure for the measurement of retrieval effectiveness is the same as the one outlined in the previous sections covering pages 16 - 18. SEMANTIC DISTANCE BETWEEN TOPICS Given two topic vectors such as Td =<zazb3ZCc Zdze> and T =<z z zbc zde > the quantitative estimation of the semantic distance between the topics represented by Td and T is merely the distance betwvecn their respective topic vectors: Semantic distance =[(za z)+(Zb-z)....-(z-ze) ] [N.B all other values for z. in these vectors = 0]

25 Of course it is easy to see that if the two topic vectors being compared are equivalent the semantic distance between them will be zero. Thus the closer the semantic distance between twro topic vectors approaches zero, the closer the two topics are to being synonymous (in terms of that collection). It is easy to see how many of the measures outlined here could be extended to produce alternative searching procedures. Although this may be a worthwhile exercise, the purpose of this paper has only been to provide the fundamental framework for the quantification of semantic noise in indexing terms. In such a case, if the arguments of this paper are accepted, then these alternative or extended mathematical formulations can only be thought of as subsidiary to the formulations worked out in this paper.

BIBLIOGRAPHY Cooper, William S. "On higher level association measures;" JASIS, v 22, n 5, Sept-Oct. 1971, p 354-355, Gries, David. Compiler Construction for Digital Computers, John Wiley and Sons, Inc., New York, 1971. Kleene, Stephen C. Mathematical Logic, John Wiley and Sons, Inc., New York, 1967. Maron, M.e and JoL, Kuhns. "On relevance, probabalistic indexing and information retrieval," Journal of the Association for Computing Machinery, v 7, n 3, July, 1960, p 216-244. Siegal, Sidney. Non-Parametric Statistics for the Behavioral Sciences, McGraw-Hill, New York, 1956 Tarski, Alfred. Logic.Semantics Metamathematics, Oxford at the Clarendon Press, 1956* Zadeh, LeA, "Fuzzy Sets," Information and Contro v 8 1965, p 338-353. --—.M"The concept of a linguistic variable and its application to approximate reasoning," Dept. of Electrical Engineering and Computer Science, Electronics Research Laboratory, UC, Berkeley, Memorandum no. ERL-M411, 15 Oct. 1973. Zunde, P. and M. Dexter. "Indexing Consistency and Quality," JASIS, v 20, n 3, July 1969, p 259-267.