Conjugacy Class Prior Distributions on Metric Based Ranking Models JAYANTI GUPTA University of Michigan, Ann Arbor, MI. PAUL DAMIEN University of Michigan, Ann Arbor, MI SUMMARY A new class of prior distributions for metric based models in the analysis of fully ranked data is developed. This class has two attractive features: first, it provides a meaningful way to encapsulate prior information about the parameters of the model; second, a full Bayesian solution is made possible via stochastic simulation methods. The ideas are exemplified using illustrative analyses. Key words: Ranked data, Mallows model, Permutation group, Bayesian inference, Gibbs sampling. 1

1 Introduction Suppose that each person in a random sample of n people is asked to rank his preferences among a fixed set of k items. Each person produces a ranking ir = [r(l),...,gr(k)], where ir(i) denotes the rank assigned to item i, (i = 1,..., k). The problem here is to characterize the population based on a random sample irj, j = 1,..., n of rankings. Historically, models for random rankings grew out of the literature on paired comparisons. For example, the Thurstone (1927) model specified that item i would be preferred to item j if Xi > Xj where the X's are i.i.d. normal random variables with different means and equal variance. Mosteller (1951) provided simple forms for the least square estimators of the means of X's under this model. MacKay and Chaiy (1982) used Monte Carlo methods to compare estimators of the means of X's in the above model with those under the unequal variance model. Clearly, these paired comparison models can be extended to rankings by letting tri = rank of Xi. Mallows (1957) also started with models for paired comparisons and used. a conditional argument to extend these to models for rankings. His twoparameter models are unimodal with the probability of a ranking Xr decreasing as the distance in a certain metric between ir and the mode increases. These models were popularized by Feigin and Cohen (1978) and Schulman (1979) who provided tables for their use. Other models for random rankings have been introduced by Luce (1959), Plackett (1975), Fienberg and Larantz (1976), Henery (1981), Berry (1979), Tallis and Dansie (1983). Gordon (1979) introduced a model based on Ulam's distance, while Fligner and Verducci (1986) investigated Cayley's distance and Kendall's r-distance. Fligner and Verducci (1990) did a Bayesian analysis of the generalized Mallows model by introducing prior distributions on the parameters of the model. Diaconis (1988) developed a second-order analysis for ranked data. 2

In this paper, we discuss the notion of conjugacy classes on the space of permutations and use it to define two classes of prior distributions on the space of rankings. The first one, called the conjugacy class prior distribution uses properties of the permutation group to define the nature of the prior. The second one, called the metric based prior distribution uses the notion of metrics on conjugacy classes to define prior distributions on rankings. We use the Gibbs sampling algorithm to generate random variates from the distributions of the parameters of the Mallows model, leading to a full Bayesian analysis using both the priors. The paper is organised in the following manner. Section 2 describes the Mallows model and defines some of the different metrics on fully ranked data. Section 3 discusses the notion of conjugacy classes of the permutation group and describes the conjugacy class prior distribution for the Mallows model. Section 4 defines metrics between conjugacy classes, and discusses the metric based prior distribution. Section 5 illustrates these priors via examples and Section 6 concludes the paper with a discussion of the ideas proposed here. 2 Models and Metrics 2.1 The Mallows Model Colin Mallows(1957) proposed a non-null probability model, that is a model distinct from the uniform model (the model where all k! possible rankings of the k items are equally likely) for fully ranked data. The model specifies a particular ranking iro E Sk, the permutation group on k objects, which can be interpreted as the most likely or the modal ranking of the k items, and states that the probability of any other ranking xr decreases exponentially according to the distance from ir to iro. So, P(ir) = K(A)e-Ad(wr.~), for all ir E Sk, d(,) is a metric on Sk, 7o is the 3

location parameter and A > 0 is a dispersion parameter. The normalizing constant, K(A), is defined as K(A)- = YGESk e-Xd(ro), and is independent of the choice of 7ro. The model is centered about the ranking 7r0, and as A increases the distribution becomes more and more peaked about 7r0. 2.2 Some metrics on fully ranked data By suitable choice of a metric on Sk, some well-known measures of association of two permutations have been obtained; see, for example, Diaconis (1987). These are: R(r, a) = (Z 1 (ir(i) - r(i))2)1/2 is Spearman's rho distance. F(7r, a) = ik=l 17r(i) - c(i)l is Spearman's footrule. T(7r, a) = number of pairs of items, (i,j), such that 7r(i) < 7(j) and a(i) > a(j) is Kendall's T. H(ir, a) = {i = 1,.., k: 7(i) $ a(i)} is Hamming distance. U(r, a) = k- the maximal number of items ranked in the same order by 7r and a is Ulam's distance. C(r, a) = minimum number of transpositions needed to transform 7r into a is Cayley's distance. It should be noted that all these metrics are right invariant, that is they remain unchanged under arbitrary relabeling of the items. 3 Conjugacy Class Prior on Sk Fligner and Verducci (1990) performed a Bayesian analysis on ranked data by introducing prior distributions on the parameters of the Mallows model. They assume a uniform prior for the modal ranking, tro and an independent conjugate prior for the scale parameter given iro. In this section, we develop a new class of prior distributions for the modal ranking in the Mallows model; 4

the uniform prior, of course, is obtained as a special case. 3.1 Conjugacy Classes of the Permutation Group Definition 3.1.1 For two permutations, ir, w2 E Sk, i1 is said to be a conjugate of 7r2 in Sk, (or -rx1 ~ 7r2), if there exists an element a in Sk such that 7T1 = 772-1 71 -- 7r26r Conjugacy is an equivalence relation on Sk, and so splits the group into equivalence classes, called conjugacy classes. Definition 3.1.2 Given an integer k, we say the sequence of positive integers, kl, k2,... kr k, I< k2 < *.* < kI constitute a partition of k if k = k + k2 + * + kr. Definition 3.1.3 The set of integers (il, i2,..., i) is said to be a cycle of the permutation 7r e Sk, if sends il into i2, i2 into i3,..., t-1 into ir and ir into i1, and leaves all other items fixed. Definition 3.1.4 A permutation it E Sk has the cycle decomposition {kc, k21..., kjr if it can be written as the product of disjoint cycles of lengths ki, k2,..., kr, kl <: k2 < kr. For example, in S9, ir ( 13256478 = (1)(2, 3)(4, 5, 6)(7)(8, 9) 7^/(123456789) has cycle decomposition {1,1,2,2,3} and 1+1+2+2+3 = 9. Let p(k) denote the number of partitions of k. Each time we break a given permutation in Sk into a product of disjoint cycles, we obtain a partition of k; for if the cycles appearing have lengths kl, k2,..., kr respectively, 5

kl ~< k2 ~ - < kr, then k = k, + k2 + + kr. A well-known result in algebra states that two permutations in Sk are conjugate if and only if they have the same cycle decomposition. The reason is the following formula for computing the conjugate: If ir written in cycle notation is: (a -. b)(c...d) (e f), then, ra7r-1 = (o(a) a*(b))(ac(c). r(d)) * ((a(e). * a(f)). This results in the following Lemma: The total number of conjugacy classes in Sk, is p(k), the number of partitions of k. Proof: There is one conjugacy class for each partition of k: thus the identity forms a class, the transpositions or 2 cycles {(ij)} form a class, the 3 cycles {(ijk)}, products of 2-2 cycles {(ij)(kl)}, and so on; each form a conjugacy class, whence the lemma. 3.2 Choice of prior distribution on Sk The Mallows model: P(7r) = K(A)e-Ad("xo) has two parameters, 7ro the location parameter and A > 0 the scale parameter. It is well known [Serre, 1977; pp 32-33] that a natural choice for a measure on a topological group is the Haar measure of that group. However, for the finite permutation group, Sk, the Haar measure on this group with the discrete topology is simply the uniform distribution on the group. In the Mallows model, a prior distribution for the scale parameter A could be: P(A) oc exp-"'~, A E R+, i.e. Exponential(ao). Interest here is on developing a prior distribution for the modal ranking it0. 6

We develop a new approach to constructing a prior distribution for ir0. This approach must satisfy two features: (a) can one exploit the structure of the permutation group in which the data is observed to encapsulate prior information about the random quantity iro? (b) does the prior distribution have a "sensible" interpretation? Since the notion of conjugacy classes is central in the study of permutation groups, we wish to exploit this feature of the group to construct a prior. In particular, the prior'distribution on 7ro will be taken to be constant on conjugacy classes, and proportional to the number of elements in the conjugacy class it belongs to. Is this "sensible?" We think so because we are assigning equal prior probabilities to all permutations that permute an equal number of items and leave the remaining unchanged. Let us consider a simple example to illustrate this. The group S4 has 24 elements. Since four can be partitioned in five ways, there are five conjugacy classes in S4. These classes are listed here along with the number of elements in each class. Partition Conjugacy Class #. of elements (1,1,1,1) (1)(1)((1)1))} (1,1,2) {((1)1)(2)} 6 (1,3) {(1)(3)} 8 (2,2) {(2)(2)} 3 (4) {(4)} 6 I... I.. The elements within a conjugacy class are similar as they represent rankings of items that were assigned by permuting the order in which the items were observed in a similar manner. In the above example, the first conjugacy class consists of the identity element in which the items were ranked in the same order in which they were 7

observed, i.e. the item that was observed first received the first rank, the item observed second was ranked second, and so on. The second conjugacy class, {(1)(1)(2)} consists of those permutations in which two of the items were assigned the same rank as the order in which they were observed, corresponding to the two 1's in this partition while the remaining two had their ranks interchanged, corresponding to the 2 in the partition. For example, in the permutation (4231) which belongs to this class, the items that were observed second and third were assigned ranks 2 and 3 respectively, while the items appearing first and fourth received ranks 4 and 1 respectively. Due to this similarity between rankings within a conjugacy class we assign equal prior probabilities to all these rankings. Also, conjugacy classes are invariant to relabeling of the items, i.e. if all the items were observed in a different order, the conjugacy classes would remain unchanged. Further, a conjugacy class that has more elements is more likely to be chosen, hence these probabilities are also proportional to. the number of elements in the conjugacy classes they belong to or, more generally, to some function of the number of elements in these classes. Recall that in the Mallows model the probability of any ranking decreases exponentially as its distance from the modal ranking r0o increases. We may argue that the prior probability of the modal ranking increases exponentially as the number of elements in its conjugacy class increases. This would give rise to another rule for assigning prior probabilities to rankings given by: P(7r) oc eg() where c(r) = number of elements in the conjugacy class of r, and P > 0 represents the strength of our prior belief. Larger values of /3 indicate stronger prior beliefs about the distribution of the modal ranking, and vice versa, with / = 0 denoting the uniform distribution of the modal ranking. Since 8

the space of all rankings Sk is finite, the prior distribution on or0 is proper, which implies that the posterior on 7ro will also be proper. 3.3 A Gibbs sampler for the Mallows model Here, we develop a Gibbs sampler algorithm to sample from the posterior distribution of the parameters of the Mallows model using the prior distribution developed earlier. Suppose a group of k items are being ranked by n people. Let 71r, r27..., Tr be the set of all possible rankings of these k items. Let the data be (ai, i = 1,.., n), the rankings of these items by the n people. The joint likelihood is given by: P(a*,... XXaj|w-oAA) = (Kd(e,ro) P(al,,... ) an- o, A) = (K(A)) e -A 2,nid(oiro) = (K(A))ne where r= # of distinct rankings in the data ni= # of people with ranking ai The prior distribution of the parameters is: P(7ro,A) = P(7o)*P(A) efc(o) =! - - - xaoe-ao Z e CI) e j=1 where c(ro) = # of elements in the conjugacy class of no. The conditional density of AIdata, ro is: P(AjIdata, o0) -= P(datalio, A)P(A) P(Aldata, o)- o P da WX) fCoP(datal-roo A')P (')d(A') oc (K())ne-A(ao+ S nid(ai,ro)) x (K(A))Ye 9

The conditional density of irodata, A is: P(ro dataA) _= P(datalro, A)P(wo) P (,roP ldata, -) A E P(dataljrj, A)P(7r,) 7rj ES -A Fj nid(air~) c(ro) k! -A i di e n=1 esoci j=1 where dij = d(Tri, r), j = c(7rj), j runs over all k! permutations in Sk. Using the full conditional densities as described above, a hybrid Gibbs/Metropolis Hastings algorithm can be used to obtain posterior estimates of the parameters. Details of the algorithm are given in the appendix. Computation of c(iro) Let Sto correspond to the partition A1 + A2 + * * + Ar = c, i.e. tr0 can be written as the product of r disjoint cycles of lengths Al, A2..., Ar respectively. Suppose that rl of these are distinct. Let Al, A2,..., XA denote these distinct values and let the number of X's equal to Ai be ki, i = 1,..., r. Then c(r0o), the # of elements of Sk that belong to the above partition is: - k! ri -ki-k2 -k (fJ(k2!)' A1 A2 '. * i=l With our choice of prior distribution on the parameters, we would now be interested in finding the modal ranking in the posterior distribution. Or for rlq, 7T2 E Sk, what are the conditions under which 7r would have a higher posterior probability than 7r2? When would the ranking with the highest proportion in the observed data also have the highest posterior probability? Is there a relationship between A the scale parameter in the model and / the scaling factor in the prior distribution that would determine which rank 10

ing gets the highest posterior distribution? The following theorem and its corollaries attempt to answer some of these questions. n Theorem 3.1 Let D(wrj) = Zd(ai,7rj), the sum of the distances of the i=l observed rankings from rj. For Trl, IT2 E Sk,and given A, l > 0, 7rl will have a higher posterior probability than 7 iff c(7i1)- c(7r2) > 7(D(7r) - D(2)) where =.' Proof: P(Trldata,A) > P(7r2ldata,A) n -A E d(ai,) eWC(w) -A Z d(ir 2) ePc(r2) iff (K(A)) 7"<e kl > (K(ADe i=1 k L er("j) L eC(j) j=1 j=i iff e-AD(7rl)+3pc(rl) > e-AD(7r2)+Pc(x2) iff e((7r(l-72)) > e1(D(lj)-D(72)) iff /(c(I7r)- c(7r2)) > A(D(wri) -D(7-2)) iff c(7rl)- c(72) > y(D(7r).- D(T2)) Corollary 3.1 Let rt be the m.l.e. of the modal ranking in the Mallows model, i.e. ft minimizes d(ai, ir). If c(ri) < c(7r), 7ri will have a lower i=l posterior probability than tr. Corollary 3.2 For 7rl, r2 E Sk, if c(irl) = c(r2), then rt will have a higher posterior probability than 2 iff D( I) < D(7r2).: Corollary 3.3 For r1,~ 2 E Sk, if c(ir) > C(r2) and D(r1) < D(2), then 7r1 will have a higher posterior probability than wr2. 11

4 A Metric-Based Prior Distribution for wo 4.1 Metrics on Conjugacy Classes For any set X, endowed with a bounded metric, d, the induced Hausdorff distance d* between any two closed non-empty subsets of X is well defined and satisfies the axioms for a metric on such subsets of X [Kuratowski,1966: pp 214-215]. Since conjugacy classes are subsets of Sk, each of the metrics on Sk can be extended to compute the induced Hausdorff metrics on conjugacy classes given by: d*(Cr, C,) = maximax min d(a, ), max min d(a, /)} PE C aECr aEGr13EC. where Ca and C, are the conjugacy classes of ir and a respectively. The induced metric d* in right-invariant if d is. Theorem 4.1 If a metric d on Sk is bi-variant, then its induced Hausdorff metric d* on the conjugacy classes may be computed according to the simpler formula: d*(Cir, Car)= min d(a, A) = min d(r, f/) fi ECa a cc^ fl x 161co. Proof: max min d(a, /) #eC, aECT Similarly, max min d(a, ) fEc iGc v, = max min d(7rlrT-1, 72ar7 1) riESkr2ESk ( 1 X 2 = max min d(rT7r, acrr2 rti) = max min d(r, rj-lr2ar7'lr7) 7 Sk r2ESk max min d(r, rsar-1) where r3 = r1-lr2 r~ ESk r3ESk = min d(7r, r3crr1) r3 E Sk min d(7r,) min d(7r,,) fiEc, 12

Hence, d*(C,, C) min d(rr, 3) Also, min d(tr,/) min d(7r, ar-1) feG0' r7ESk =- min d(Trr 72o'72ra(- 1 21) 1,T2 GS = min d(rlTrr1,, Tr;2-1) ri,r2 Sk = min d(a, A) aECw Among the six metrics defined in Section 2.2, only two are bi-variant: Hamming distance and Cayley's distance. The above theorem can thus be used to derive simpler forms for these two metrics. However, any metric can be made invariant by averaging it [Diaconis, 1987 pp 114-115]. Hence the above theorem can be used to compute induced Hausdorff distances for all of the metrics. The corollaries below provide the simplified forms of the metrics induced by Cayley's and Hamming distance. Corollary 4.1 The Hausdorff metric between two conjugacy classes induced by Cayley's distance is the minimum number of transpositions required to transform a permutation of one of the conjugacy classes into a permutation of the other class. This transformation occurs in the following manner: Let the ranking xr belong to the conjugacy class having the cycle decomposition {k1, k2, k 3.., kr}. Suppose ir is given by: 7r = (al,..., ak)(b,..., bc.k2).* (Pl, I. Pkr) A transposition on 7r can be one of two possible types: the two numbers being permuted could either belong to the same cycle of 7r, (say ai and aj), or they could belong to different cycles of ar, (say ai and b). A transposition of the first type results in a ranking with cycle decomposition {(j - i), kI - (j - i), k 2,... kr}, i.e. the cycle containing ai and aj is split 13

into two cycles with lengths depending on the positions of ai and aj in the cycle, while a transposition of the second type gives the cycle decomposition {(ki + k2), k3,..., kr} where the two cycles containing ai and bj are combined to form one cycle of length kl + k2. Definition 4.1 For two conjugacy classes C, and Ca in Sk, with partitions {k1, k2, -., kl} and {P1 P2, *. p, m} respectively, the relative partition of C, w.r.t. C, denoted by C,\IC is the set of kMs without those that are equal to some distinct pj. For example, consider the conjugacy classes C = {2, 2, 1} and Co = {3, 2} of S5. Then, the relative partition of Cr w.r.t. Ca is given by Cr C,. = {2, 1}, while the relative partition of Ca w.r.t. C, is given by CaIC =- (3}. Definition 4.2 For two conjugacy classes Cr and Ca in Sk, with partitions {kl, k2,..., k} and {PP2,... Pm} respectively, Cr and Ca are said to be divisible if 3 KC Cr IC7 and P C CaICr, such that IC, P # b and E{ki: ki E C} = pi E{:pj E P} Then, {C, P} are called a pair of divisible subpartitions of {Cr, Ca}. For example, the conjugacy classes of 51u given by C, = {5, 4,1, 1 and C, = {3, 2, 2, 2, 2} are divisible. A pair of divisible subpartitions would be C = {5} and P = {3,2}. K1 = {4},Pi = {2,2} form a second pair of divisible subpartitions which is disjoint from the first pair. Corollary 4.2 The Hausdorff metric induced by Hamming distance between two conjugacy classes C~ and Ca that are not divisible is the number of integers in C\IC, plus the number of integers in CaIC, minus 1. i.e. d*(C,, Ca) = # {CICa} + #{CaICr} - 1 14

Proof: Let {k1, k21,..., k1} and {P1,p2,...,pm} be the partitions of C. and C, respectively. Without loss of generality, suppose kl < pi. Then of the kl elements corresponding to the first cycle of ir, atmost k1 - 1 of them would be equal to the elements in the corresponding positions of o. Hence the minimum contribution to Hamming distance from these kl elements would be 1. Now consider k2 and (p1 - kl). As before, if k2 < (P1 - ck), minimum contribution to Hamming distance from these k2 elements of yr is 1. If k2 > (P - kl), the minimum contribution from the (pi - k1) elements would again be 1. Since C, and C, are not divisible, k2 # (pi - k). Continuing this pairwise comparison of cycle lengths of C, and Ca, we see that for each comparison, the minimum contribution to Hamming distance is 1. Since the number of integers in the partitions of C, and C, are I and m respectively, there would be I + m - 1 such comparisons, hence the corollary. Corollary 4.3 When C, and Ca are divisible, let r be the maximum number of disjoint pairs of divisible subpartitions of {C,, C}. Then the induced Hamming distance between them is given by d*(Cr, Ca) = #{CIC} + #{C,IC}J - r Proof: The proof of this result is the same as the previous one, except that now since Cr and C, are divisible, it is possible that k2 = (pI - kl) when KC = {Ik, kI} and P = {p} form a pair of divisible subpartitions. Then the minimum contribution to Hamming distance from these elements would still be 1 but the next pairwise comparison would now be between k3 and p2. So we now lose one pairwise comparison, hence one contribution to the distance. If r is the maximum number of disjoint pairs of divisible subpartitions of {C, C,}, then in the pairwise comparison of cycle lengths of C, and C0, we would lose a maximum of r - 1 such comparison s, hence the result. 15

4.2 Prior Distribution via Metrics on Conjugacy Classes In this section we introduce another prior distribution on rankings that is based on the distance between conjugacy classes. Let us specify a ranking, c-* which we believe apriori to be the modal ranking. Let C,. be the conjugacy class containing 7r*. We argue that the prior distribution on a conjugacy class decreases exponentially as its distance from CQ increases, i.e. P(C,) oc e->*d*(cttCQ*), * > 0 A* is a scalar that determines how peaked the distribution is around C*.. This prior distribution on conjugacy classes induces a prior distribution on rankings, if we assume that all rankings within a conjugacy class have the same prior distribution. The induced prior distribution on rankings is given by: e —*d*(C,c,*) P(W )= k: -- -x* () ( ) Let us develop the Gibbs sampler algorithm for the Mallows model using Let us develop the Gibbs sampler algorithm for the Mallows model using the above prior on x0, the modal ranking: Let the prior distribution on, A be Exp(ao). The conditional density of Aldata, ro is: P(Aldata, ro) = P(datal7ro, A)P(A) fo' P (data I roo A')P(A')d(A') The conditional density of data, is The conditional density of 7rQIdata, A is: P(-ro data,, =A) P(datalr7o, A)P(iro) P(7Trodata, A) = E P(datalrrj, A)P(irj) rj ESk 16

r e i=l e -= > nid(ai,7ro) --.,*(Ct,.) c(7ro) k! _- t nidij..Ad(Cjsc) e Ci= e ( ) j=1 l -) 5 Prior Distributions on Partially Ranked Data 5.1 Partially Ranked Data Given a set of n items, a partial ranking of k out of these n items is a ranking where only the first k choices are specified. An example would be when 10 candidates are contesting for an election and people are asked to rank only 5 of their most favorite candidates. A partial ranking of this type forms an element of the coset space Sn/Sk, where Sn-k is the subgroup of Sn consisting of all permutations which leave the first k integers fixed: Sn-k = {r E S,: r(i) = i, < i < k} The equivalence relation ^, defined on Sn by: For 7r,ca E S,, 7r a 4= 7 ra-~l Sna-k, partitions S, into equivalence classes such that for any ir E Sn, the equivalence class containing 7r, denoted by Sn-kO is {r7r: T E Sn-k} and is called a right coset of Sn-k. It follows that to each partial ranking of k out of n items, there corresponds a unique right coset of Sn-k, and two full permutations ir, a e Sn belong to the same right coset of Sn-k iff 7 and a induce the same partial ranking of k out of n items, i.e. r-'(i) -= o-(i), 1 < i < k. All the metrics on fully ranked data discussed in section 2.2 can be extended to form metrics on partially ranked data and the Mallows model for such data [Critchlow,1985 pp 100-1011 can be written as P(7rP) = C(A)e-dP(Pr) 17

for all partial rankings irP E Sn/S,_k. Here, 7r is a location parameter representing the modal partial ranking and A > 0 is a dispersion parameter. dp(,) is the induced Hausdorff metric on the coset space Sn/Snk and C(A) is the normalizing constant. 5.2 Prior distributions via Coset Classes on Partially Ranked Data Recall that for fully ranked data, we divided the space of all rankings into conjugacy classes, because we believed that rankings within a conjugacy class were similar to each other as these rankings were assigned by permuting the order in which the items were observed in a similar manner. In the case of partially ranked data, we wish to argue that among all the people who rank k out of the n items, those that choose to rank the same set of k items are similar in some sense as they have the same choice of k favorite items, but may choose to rank them differently. With this in mind, let us partition the coset space Sn/Sn-k into coset classes, where each coset class consists of all partial rankings that choose the same set of k items out of the n items but rank them differently. On fully ranked data, the conjugacy class prior on wr0 assigned equal probabilities to rankings within a conjugacy class that was proportional to the number of elements in the class, c(7ro), and was given by: P(wj) c eaC(j) In the case of partially ranked data, each coset class has the same number of elements, so the analog of the conjugacy class prior here would be: P(7rP) cC 3i where /j is a scalar quantity representing our prior belief on the coset class containing 7w. So our choice of pj for each coset class should be proportional 18

to the strength of our belief in the k items being ranked in that class. Using this prior on the modal partial ranking, and an independent exponential prior for the scale parameter, A, the conditional density of 7rP data, A is calculated to be m -A E dp(aP,ior) P(irOIdata,A) = e A=' k! E: ll j, j=l where acP, i = 1,..,m are the observed partial rankings and nk = (n), the number of coset classes in Sn/Sn-k. The metric based prior on fully ranked data can be extended in a similar manner to form the analogous prior distribution on partially ranked data. Since the coset classes are bounded non-empty subsets of Sn/Snk, all the metrics defined on partially ranked data can be extended to form the corresponding induced Hausdorff metrics on the coset classes [see section 4.1]. Then following the same argument as in the case of fully ranked data, if CGp is our choice of the modal coset class, the prior distribution on any coset class, Crp is given by: P(Cp) oc e-dpp where d(, ) is the induced metric on coset classes. Assuming that all partial rankings in a coset class have the same prior distribution, this induces a prior distribution on partially ranked data given by: e-_AXdP(C6"vC":) P(7rp) = -k Ad k! > e aP(c! " c'4) i=1 The full conditional densities can be calculated as in the previous case and the Gibbs sampler algorithm can be developed in a similar manner to simulate from the posterior densities of 47r and A. LLVIL Il~CrVJL~S~V L ~iL3~1~3 V 10 19

6 Illustrative Analyses The first example discussed here is a simulated example: the motivation here is to illustrate the posterior distributions obtained under different prior/metric combinations. The second example provides a comparative illustration with the data set analyzed by Fligner and Verducci(1990). Example 1: Let us illustrate the conjugacy class prior on S4 with a simulated example. With the modal ranking, wo = (2143) and A = 0.4, samples of size 80 were generated according to the Mallows model using four different metrics and for different prior settings, the posterior probabilities of all the rankings were computed. / Kendall's r Footrule Hamming Spearman's p 0 0.9901 (2143) 0.9999 (2143) 0.9997(2143) 0.9952 (2143) 0.5 0.9569 (2143) 0.9999 (2143) 0.9987 (2143) 0.9789 (2143) 1 0.8321 (2143) 0.9998 (2143) 0.9940 (2143) 0.9119 (1243) 1.5 0.5250 (1243) 0.9991 (2143) 0.9717 (2143) 0.6973 (1243) 5 0.4287 (3142) 0.5824 (1243) 0.9581 (4132) 0.3444 (4132) 10 0.6026 (3241) 0.7700 (4132) 0.9645 (1243) 0.4399 (4132) The conjugacy class containing the true modal ranking, (2143), corresponds to the partition (2,2) and contains only 3 elements. When / = 0, the prior is uniform over all rankings, hence the ranking with the highest posterior probability is the true modal ranking, using all four metrics. Changing P to 0.5 gives similar results. When P = 1.0, the highest posterior probability is now given to the ranking (1243) by one of the metrics, which belongs to the conjugacy class corre 20

sponding to the partition (1,1,2) and has a larger (6) number of elements. When P = 5, all metrics give the highest posterior probability to some ranking other than the true modal ranking, which belong to conjugacy classes that have larger number (6 or 8) of elements. Thus, the choice of P can be used to define the strength of our belief on the prior distribution. Using the same data, but'the metric based prior, the analysis was repeated using the Hausdorff metric induced by Hamming distance. Two different choices of ir* were used, I* = (2143), the true modal ranking, and 7r* = (2134). For six different choices of X* the modal rankings in the posterior models are tabulated below along with their probabilities. A* r* = (2143) r* = (2134) 0 0.9998 (2143) 0.9998 (2143) 0.5 0.9999 (2143) 0.9996 (2143) 1 0.9999 (2143) 0.9989 (2143) 1.5 0.9999 (2143) 0.9973 (2143) 1.5 0.9999 (2143) 0.9973 (2143) 5 1.0000 (2143) 0.6228 (2134) 10 1.0000 (2143) 0.8320 (2134) So for the correct choice of r*, the posterior model puts most of the mass at the true modal ranking, whereas with the incorrect choice of lr*, on increasing the value of A*, maximum posterior probability is given to this incorrect ranking. Example 2: This example is taken from Fligner and Verducci (1990) in which the Graduate Record Examination Board sampled 98 college students who were asked to rank five words according to the strength of association with a target word. For the target word "idea", the five choices were (A) thought, (b) play, (C) 21

theory, (D) dream, and (E) attention. They fit both the Mallows and generalized Mallows model to the data. For the Mallows model, they assume a uniform prior for the modal ranking, ro0 and an independent conjugate prior for the scale parameter, A. The conjugacy class prior for the Mallows model with / = 0 and Kendall's r metric was used to analyze this data, and a Gibbs sampling algorithm was used to obtain posterior estimates for the rankings. The results are tabulated below along with the results from the previous analyses. The first 7 ranks with the highest observed frequencies are listed along with their mean posterior probabilities by the two methods. The numbers in parentheses in the 3rd and 4th columns denote the ranks of the corresponding permutations in the posterior models. Obs ranks Freq(prob) Mallows(unif) Mallows(conj. class) 15234 33 (.337).032 (1).8658 (1) 15324 18 (.184).026 (2).1039 (2) 14235 12 (.122).022 (3).0220 (3). 14325 8 (.082).019 (4).0032 (4) 15243 6 (.061).019 (5).0021 (5) 25134 5 (.051).019 (6).0021 (6) 15423 5 (.051).016 (7).0003 (7) The posterior mean for lambda is 0.083. So our method not only picks the correct modal ranking and most of the subsequent rankings, the posterior mean of the modal ranking is a lot higher than that obtained by the previous analysis, reflecting its high proportion in the observed data. 22

7 Discussion In this paper, we discussed the concept of conjugacy classes in permutation groups and introduced the use of these classes to define two types of prior distributions on metric based ranking models. The conjugacy class prior is useful when we do not have any prior knowledge of what the modal ranking could be, whereas the metric-based prior is useful when we have some idea about what the most popular ranking is. Each of these priors can control the strength of our prior belief through scale parameters, / and A* respectively. References Berry,D.A. (1979), "Detecting Trends in the Arrangements of Ordered Objects: A Likelihood Approach," Scandinavian Journal of Statistics, 6,169-174. Critchlow,D.E. (1985), Metric Methods for Analyzing Partially Data, SpringerVerlag: New York Diaconis,P. (1987), Group Representations in Probability and Statistics, Haywood, CA: Institute of Mathematical Statistics. Diaconis,P. (1988), "A Generalization of Spectral Analysis with Applications to Ranked Data," Annals of Statistics, 17,949-979. Feigin,P., and Cohen,A. (1978), "On a Model for Concordance Between Judges," Journal of the Royal Statistical Society, Ser.B, 40,203-213. Fienberg,S.E., and Larntz,K. (1976), "Log-Linear Representation for Paired and Multiple Comparison of Models," Biometrika, 63,345-354. Fligner,M.A., and Verducci,J.S. (1986), "Distance Based Ranking Models," Journal of the Royal Statistical Society, Ser.B, 48,859-869. Fligner,M.A., and Verducci,J.S. (1988), "Multistage Ranking Models," Journal of the American Statistical Association 83,892-901. Fligner,M.A., and Verducci,J.S. (1990), "Posterior Probabilities for a Con 23

sensus Ordering," Psychometrika 55,No. 1,53-63. Gordon,A.D. (1979), "A Measure of Agreement Between Rankings," Biometrika, 66,7-15. Henery,R.J. (1981), "Permutation Probabilities as Models for Horse Races," Journal of the Royal Statistical Society, Ser.B, 43,86-91. Kuratowski,K. (1966), Topology: Volume I, New York: Academic Press. Luce,R.D. (1959), Individual Choice Behavior, New York: John Wiley. MacKay,D.B.,and Chaiy,S. (1982), "Parametric Estimation for the Thurstone Case III Model," Psychometrika, 47,353-359. Mallows,C.L. (1957),"Non Null Ranking Models I," Biometrika, 44,114-130. Michael,E. (1951), "Topologies on Spaces of Subsets," Transactions of the American Mathematical Society, 71,152-182. Mosteller,F. (1951), "Remarks on the Method of Paired Comparisons,I: The Least Squares Solution Assuming Equal Standard Deviations and Equal Correlations," Psychometrika, 16,3-9. PlackettR.L. (1975),"The Analysis of Permutations," Applied Statistics, 24,193 -202. Schulman,R.S. (1979)," Ordinal Data:An Alternative Distribution," Psychometrika, 44,3-20. Serre,J.P.(1977), Linear Representations of Finite Groups, New York: SpringerVerlag. Tallis,G.M.,and Dansie,B.R. (1983),"An Alternative Approach to the Analysis of Permutations," Applied Statistics, 32,110-114. Thurstone,L.L. (1927),"A Law of Comparative Judgment," Psychological Reviews, 34,273-286. 24