JUDGMENTS OF PROBABILITY AND UTILITY FOR DECISION-MAKING
First Annual Report
30 September 1971
Contract Number N00014-67-A-0181-0034
NR 197-014
Cameron R. Peterson
Engineering Psychology Laboratory
The University of Michigan
Ann Arbor, Michigan
Prepared for:
The Department of the Navy
Office of Naval Research
Arlington, Virginia
Attention: Dr. Martin A. Tolcott
Director
Engineering Psychology Programs

iStC urit. d.tssitir'~ttio__ _ _ _ _ _ _ _ _ _ _ _
DOCUMENT CONTROL DATA R & D
I, It, s rit, 4 I is ifni.ilcj at I Itir irshe, h /. t I it). a tc i t l t l,'s r,i.ei ti t.i tl i Tiln tist 1 )e bC' nl t'trt'd itIhen the overall report is c itsstlied)
I,,);li.. lIA ITIi AC TI VI XI ((' V >rI.r.t' i iit or) 0. REPORT SECURITY CLASSIFICATION
Department of Psychology Unclassified
University of Michigan 2b. GROUP
Ann Arbor, Michigan 48105
3. REPORT TITL.E
MAN-MACHINE DECISION MAKING: JUDGMENTS OF PROBABILITY AND UTILITY
FOR DECISION MAKING
4 DESCRIPTIVE NOTES (Type of report and.itnclusive dates)
Scientific - Annual Report for 12 months ending 30 September 1971
s. AU THOR(S) (First name, middle initial, last name)
Cameron R. Peterson
6 REPORT DATE 7.1, TOTAL NO. OF PAGES 7b. NO. OF REFS
30 September 1971 416
8'. CONTRACT O0R GRANT NO. 9a. ORIGINATOR'S REPORT NUMBER(S)
N00014-67-A-0181-0034
b. PROJECT NO.
c. 9b. OTHER REPORT NO(S) (Any other numbers that may be areigned
thia report)
dq _none
10. DISTRIBUTION STATEMENT
1. This document has been approved for release and sale; its
distribution is unlimited.
I1. SUPPLEMENTARY NOTES 12. SPONSORING MILITARY ACTIVITY
Engineering Psychology Programs
Office of Naval Research
13. ABSTRACT
This annual report describes research intended to develop procedures
for eliciting judgments of probability and utility that could be employed
efficiently in a decision theoretic analysis. Experiments on probability
estimation investigated (1) the reinforcing effects of a proper scoring
rule upon probability estimates, (2) the use of Bayesian procedures to
revise probability estimates, and (3) the relative merits of probabilities
and odds as response modes. Research on the decomposition of utility
estimates showed that (1) the model of a weighted linear average is
relatively insensitive to nonadditive combination rules but highly sensitive
to the nonlinearity of utility functions, (2) when utilities are esitmated
it makes relatively little difference whether or not the judgments are
decomposed, and (3) the procedures of decomposing utility estimates were
feasible for use on a real world problem where water-quality engineers rather
than college students served as subjects. Finally, a study is described
which illustrated how the probability-revision procedures work on an actual
S/N 0101 -807-6811 Security Classi fiction -

Sit'curitv C(ltissifCiation
4.
KFY WO,* ~.I|NK A LINK B LINK C
R OLE E WT ROLE WT| ROLE W T
decision making
probability
utility
Bayes theorem
Proper scoring rules
Multi attribute utility
QD D NOV 651473 (BACK)''N 01 01 - 4 0 7 - 6,.82 1 Security Classification A,;1409

TABLE OF CONTENTS
INTRODUCTION 1.
PROBABILITY 3.
Validation of Probability Judgments 3.
Use of Scoring Rules to Calibrate Probability Estimators 5.
Bayesian Procedures to Revise Probabilities 9.
Probability Versus Odds Estimates 17.
UTILITY 19.
Validation of Utility Estimates 19.
Robustness of a Linear Average —a Simulation 21.
Experiment Decomposing Multidimensional Utilities 24.
Field Research on Multidimensional Utilities 28.
SUBMARINE SURVEILLANCE 31.
REFERENCES 41.

Introduction
This is a report of research supported by the Engineering Psychology
Programs, Office of Naval Research, conducted under sponsorship of ONR
Contract Number N00014-67-A-0181-0034, from the period of October 1, 1970
to September 30, 1971.
The purpose of the psychological research was to develop procedures
for eliciting judgments of probability and utility that could be employed
efficiently in decision theoretic analyses. Decision theory is a tool
that is becoming increasingly popular in the field of business for making
decisions about such things as building a new plant or introducing a new
product. The basic approach is that a complicated decision is divided
into its component parts. These components are evaluated and then
arithmetic prescribed by decision theory is used to aggregate the evaluations into a final recommendation for a decision.
This approach offers promise for naval decisions as well as business
decisions, but there is a serious stumbling block in that the evaluations
for naval decisions are typically more complicated than for business
decisions. The goal of naval decisions is not merely to maximize the
expected rate of return; many non-monetary factors are important as well.
Consequently, the evaluations which would serve as the input to a decision
theoretic analysis of naval decision must be highly subjective. It is the
goal of the research being conducted under the current contract to develop
psychological procedures for eliciting subjective inputs to decision analyses.
The inputs to a decision analysis are to two kinds; they serve to
answer the following question: "What are the stakes and what are the odds?"

The first kind of an input is called a utility. It is a number that
measures thel relative degree of attract iveness of a consequence of anl
action. Tlhe second measure is a probal)ility, a number referring to the
likelihood that the consequence will result if a decision is taken.
The first section of the report describes three sets of experiments
on the problem of eliciting probability estimates; the second section
describes a simulation and two experiments on the problem of eliciting
utility judgments; and the third section describes attempts to test
the feasibility of research on probability estimation in an operational
naval environment, on the problem of submarine surveillance.
-2

Probability
Nearly everyone understands that a probability is the number that refers to
the likelihood of occurrence. The problem is that, with the exception of the
forecasting of precipitation probabilities by weather forecasters, people have
relatively little practice in attaching numerical estimates to their opinions
about whether or not events will occur. Consequently, a major portion of the
research conducted during the past year was aimed at finding procedures for
eliciting probability estimates. This is necessary because such numerical judgments form one of the natural inputs to decision-making systems.
Validation of Probability Judgments
Research on how to elicit good judgments of probabilities presumes that
the experimenter can identify a good judgment when he sees one. For probability
judgments, researchers have used four criteria: optimality, accuracy, agreement
with relative frequency, and consistency. The most stringent criterion, optimality,
requires that the experimenter know the optimal probabilities. This is typically
accomplished in experiments by using physical processes that produce equally
likely elementary events, such as rolling fair die, to generate events. There is
currently some debate about which scale to use for measuring the discrepancy
between judged and optimal probabilities, but a log odds scale has desirable
properties and is pretty well accepted for many situations.
The second criterion of a probability distribution, accuracy, can be measured
by scoring rules, even without knowing the corresponding optimal probabilities, if
it is possible to find out which of the alternative events turns out to be true.
For example, a weather forecaster who estimates the probability of rain has no
way of finding the optimal probability, but can find out whether or not it does
-3

indeed rain. A high probability of rain is good if it does rain and a low one if
it does not. Scoring rules essentially rate probability distributions as accurate
to the extent that they pile up a lot of the probability on the event that turns
out to be true. A scoring is said to be "proper" when the expected score can
be maximized only by setting the judged probabilities equal to the corresponding
optimal probabilities, when optimal probabilities are defined, or to observed
relative frequencies, in situations in which it is possible to collect enough
observations so that relative frequencies are good estimators of probabilities.
It is therefore also true that the subjectively expected score can be maximized
only by setting the judged probabilities equal to the corresponding subjective
probabilities.
The third criterion requires that a distribution of judged probabilities
agrees with the corresponding distribution of relative frequencies. This criterion
is useful when a subject estimates probabilities of repeatable events, and the
experimenter is able to measure the relative frequencies. It has been used
extensively in experiments on learning, where there is interest in the degree to
which estimated probabilities come to match the relative frequencies. They
typically come very close indeed (Peterson & Beach, 1967). (Note that even in
frequentistic situations this criterion is different from the accuracy criterion.
The reasons why are subtle; it would lead the discussion too far afield to go
into them here.)
A set of probabilities meets the fourth criterion, consistency, if it
obeys the rules of probability theory. Optimal probabilities are consistent, but
often not vice versa. Therefore, even though there is a strong tendency for
probability judgments to be internally consistent (Beach, 1966; Peterson, Ulehla,
Miller, Bourne, & Stilson, 1965), the criterion of consistency will be studied in
the proposed research only as it relates to optimality or accuracy.
-4

Use of Scoring Rules to Calibrate Probability Estimators
The use of quantitative judgmental inputs to a decision analysis requires
that the person making the judgments be well-calibrated with respect to what
the quantities mean. Thus, if a person is making a probability estimate he must
have a good, intuitive appreciation of the probability or the odds scale upon
which he is estimating. If he is making a value judgment he must understand the
value scale that he is using. With respect to probabilities, research shows
that some people are well-calibrated and others are not. Research shows, for
example, across all days for which a weather forecaster has estimated, say, a
20% chance of rain, it turns out to rain nearly 20% of the time; and when he
estimates a 60% chance of rain it turns out to rain approximately 60% of the time.
This quality of calibration, however, does not hold for college students.
For example, I conducted an experiment several years ago in which each subject's
task was to estimate whether or not a weak signal was contained in a noisy
background. The subjects responded by estimating the probability that a signal
was present in the noisy background. A smooth function could be used to describe
the relation between percentage of trials containing signals and estimated
probability of signal for each subject, but that function was not the identity
line. When one subject estimated a 60% chance of signal it turned out that a
signal was present about 80% of the time and when he estimated a 40% chance of
signal, it was actually present only 20% of the time. That subject was conservative
in his judgments about how much he had learned from his observations about whether
or not a signal was contained in the noisy background. Across all trials on
which he said that he had learned enough to do a 60:40 job of separating signals
from noise, he had actually learned enough to do an 80:20 job. He had actually
learned more about whether or not a signal was contained in the noise then he
said he had learned. If such "hedging" were to occur in decision analyses of Naval
-5

problems, the analyses would certainly turn out to be suboptimal.
Why are expert weather forecasters better calibrated than college students?
There are probably several reasons, but one possibility is that weather forecasters have had considerable experience having their precipitation probabilities
evaluated with a proper scoring rule. On days that it actually turns out to rain,
a precipitation probability of 60% receives a higher score than does a precipitation
probability of 40%. But how much better is the 60% estimate? Proper scoring rules
have been developed to answer that question. A proper scoring rule has the
property that a weather forecaster can maximize his expected score when estimating
a precipitation probability only by estimating his true subjective probability.
That is, if a weather forecaster expects that there is a 70% chance of rain
then he will maximize his expected proper score only by estimating 70%. If a
scoring rule is not proper it may be possible to receive a higher expected score
by estimating a probability other than the true subjective probability.
Under the hypotheses that experience with a proper scoring rule was one
of the factors that led to the better calibration on the part of weather forecasters, we designed the following experiment. College students were presented
a long string of questions that were taken from encyclopedias, almanacs, and
general knowledge examinations. An example of such a question is: "Is there
a higher per capita income in Washington, D. C. or in the State of California?"
The subject's task was to indicate which of the two answers he thought was more
likely to be correct and then to estimate just how likely in terms of odds. For
example, a subject may say that he thinks that California has the higher per
capita income and that he estimates odds of 3:1 that he is correct.
There were two groups of subjects, experimental and control. During
Stage One, the pretest, all subjects answered fifty questions like the one above
-6

without receiving any kind of feedback about the right answer. Then, in Stage
Two for the experimental group, subjects answered another group of seventy-five
qtuestiolls, land after each responfse they received feedback both in terms of which
answer was right and also the proper score associated with the response given.
The score was determined by the proper scoring rule designed for this experiment.
In the control group subjects in Stage Two merely received feedback about which
answer was correct; they were not informed that such a thing as a score existed.
For Stage Three, all subjects took a post-test comprised of the same questions as
the pre-test. That is, they once again gave odds estimates about questions asked
in Stage One and once again they received no feedback.
The primary thing that distinguished the experimental group from the control
group was that subjects in the experimental group received feedback that was
generated by a proper scoring rule. Thus, if the use of experience with scoring
rules serves to calibrate subjects with respect to odds estimates, then, from
Stage One to Stage Three, odds made by subjects in the experimental group should
improve more than the corresponding odds estimates made by subjects in the
control group. Any greater improvement in the experimental group than in the
control group should be attributable to recalibration by means of feedback
generated through the use of scoring rules.
The results of the experiment support the experimental hypothesis. The proper
scoring rule that was used in the training portion for the experimental group was
also used to evaluate the quality of the odds estimates for both the pre-test
given in Stage One and the post-test given in Stage Three. Of the twelve subjects
in the control group, six improved from the pretest to the post-test and the
scores for the other six became worse. This is essentially random performance,
showing no more propensity to improve than not to improve. In the experimental
-7

group, however, 10 of the 12 subjects improved and, on the average, the degree
of improvement was rather substantial. Additional analyses of these results
continue.
Encouraged by the results of this experiment, we administered the experimental
procedure to 15 different intelligence analysts in different agencies in
Washington D.C. where probability forecasts are beginning to be used. Results
showed that the estimates for 12 of the 15 analysts improved from the pre-test
to the post-test. Furthermore, the average amount of improvement was even more
substantial than obtained for the college students. In addition to the quantitative
results, several of the analysts volunteered comments to the effect that their
intuitive "feelings" for the meaning of probability and odds changed as the result
of the experience. Most analysts seemed to become more conservative in their
estimates; they learned that it is costly to make estimates in the range of 50:1
without being very sure of the correct answer. On the other hand, at least one
analyst who had a reputation of being cautious in probability estimates seems
to have been making more extreme estimates since participating in the experiment.
The scoring rule seems to be useful, at least for correcting extreme miscalibration.
We are conducting additional analyses in an attempt to understand the mechanism
by which experience with a scoring rule improves calibration. When the analyses
are complete we will write up the results and submit them to an appropriate
scientific journal.
At the same time, we are beginning to develop a more streamlined version of
the scoring rule test that can be self-administered to anyone who faces the task
of estimating probabilities or odds. It can also be taken by people who are in
a position to interpret probabilities estimated by someone else.
-8

Bayesian Procedures to Revise Probabilities
In many practical situations, diagnosis about what the environment is like
should precede deciding what to do about the environment. But unaided human
intuition is subject to large and systematic biases in the process of a diagnosis.
When a person estimates probabilities about which pair of hypotheses is true,
and then revises his estimate in the light of new information, his revision is
usually conservative. The revision is too small when compared with the optimal
rule, Bayes's Theorem, for the revision of opinion.
This conservative bias obviously has important implications for any application
of decision theory in which probability estimates must be updated. The biasing
of inputs result in poor final decisions. Professor Ward Edwards at the University
of Michigan has developed a set of procedures designed to reduce the conservative
bias in updating opinion. He calls these procedures a probability information
processing (PIP) system. It is based upon the assumption that people err in
revising probability estimates because they misaggregate the impact of several
data. A PIP system requires a person to estimate the diagnostic value of each
individual datum (a likelihood ratio) and then assigns to a computer the task
of aggregating across the individual likelihood ratios by means of Bayes's Theorem.
Several experiments have shown that the degree of conservatism can be substantially
reduced by the use of a PIP system.
The second experiment conducted during the last year of this contract was
designed to investigate alternative procedures for using a PIP system. Some of
our current on-line research with Naval officers is concerned with the problem of
ship survelliance so it is appropriate that the general task faced by subjects
in this experiment was one of trying to figure out the destination of a merchant
ship. There were two possibilities. The ship was either going to Port A, which
was relatively near and along a coastal route or the ship was going to Port B,
-9

which was much furtlher away and along an open-sea route. The experimenter
displayed items of information upon which the subject was to base his judgment
about the destination of the ship. These items of information included the amount
of fuel taken on, the age of the ship in years, the size of the ship in terms of
capacity for cargo and the percentage of capacity used on this trip. These data
were related in obvious ways to the two ports and this relation was displayed to
subjects by means of frequency distributions. For example, a display contained
information about the amount of fuel that had been taken on by each of the last 100
ships that had sailed to Port A and it also contained the same kind of information
about each of the last 100 ships that had sailed to Port B. On the average, of
course, more fuel was taken on in the past by ships that were sailing to the more
distant port.
Each subject made judgments about the destination of 26 different ships and
he based his estimates on the four kinds of information presented in a counterbalanced sequence. That is, sometimes the subject learned about the age of the
ship first and in other conditions he learned about amount of fuel first. The
six different experimental conditions referred to six different procedures that
were used to elicit judgments that could be used to calculate odds estimates about
the destination of each ship. Those conditions are described below.
1. Unaided odds estimates. —Since this experiment employed judgments about
ship destination, it was different from tasks used in previous experiments. So the
goal of the first two conditions was to measure the degree to which results from
which previous experiments (which typically involved abstract data generating
processes such as the sampling of colored chips from a bag) could be generalized
to the present task. The first condition employed direct odds estimates for
which the subject simply wrote down which of the two ports he thought was the more
likely destination of the ship; then he wrote down his revision of that odds estimate
-10

after observing each of the four items of information. This is the kind of
procedure that has typically produced conservative revisions and we anticipate
that that will happen here.
2. Direct likelihood-ratio estimation. —A second condition that has been
used frequently in past research on PIP systems requires the subject to estimate
a likelihood-ratio upon observing each datum. Then after the experiment is complete
the experimenter multiplies the likelihood-ratios together, as prescribed by
Bayes's Theorem, in order to calculate the posterior odds with respect to ship
destination. If the ship is rather new the likelihood-ratio should favor the
distant, open-sea port and if the ship takes on only a small amount of fuel the
likelihood-ratio should favor the near port. Previous research had shown that
this procedure, using direct likelihood estimate as inputs to Bayes's Theorem,
substantially reduces conservatism in the revision of odds estimates. That is the
result we expect with this task.
3. Likelihood-ratio estimation with posterior odds feedback. —On-line research
that we have recently conducted with intelligence analysts suggests that the direct
estimation of likelihood-ratios as used in Condition Two will be unacceptable in
practice. The reason is that it insulates the likelihood-ratio estimator from
posterior odds. Therefore, after estimating a sequence of likelihood-ratios
the analyst does not really know whether "system opinion" favors Port A or Port B
but yet it is typically the analyst, who has specialized knowledge, who must
defend that system opinion to his superiors and to operational officers. Accordingly, I doubt if the procedure of direct likelihood-ratio estimation will ever
be used in practice, even though it may serve to reduce the conservative bias.
A natural extension of direct likelihood estimation is to display the posterior
odds to the estimator. That is, each time the likelihood-ratio estimator makes an
estimate, the implied posterior odds are calculated and then displayed. At least one
-11

previous experiment conducted by Ward Edwards and his associates has shown that
such posterior odds feedback cuts down on the efficiency of a PIP system in
reducing conservatism. A serious difficulty with that experiment, however, was
that there was no way of calculating the optimal posterior odds. It is therefore
possible that the introduction of posterior odds feedback after the estimation of
likelihood-ratios reduced the magnitude of posterior odds but not in the manner
that was suboptimal. The current experiment will permit us to make that test. That
is because the experimenter will control the process that generates the data about
amount of fuel taken on, age of ship, and so on.
We expect the posterior odds resulting from the procedure in Condition Three
will be less extreme than those of Condition Two, and we expect this "hedging"
to be in a suboptimal direction, but the question is sufficiently important to
warrant an experimental answer.
4. Likelihood-ratio estimation on a logrithmic scale with consistency checks. —
A potential benefit of using a Bayesian procedure is that the availability of
individual likelihood-ratios permits the estimator to check the internal consistency
of the likelihood-ratios. For example, after observing the sequence of data
does the likelihood-ratio estimator really believe that the datum with the largest
likelihood-ratio was the most diagnostic? And since Bayes's Theorem is multiplicative, because posterior odds is equal to the product of the likelihood-ratio
and prior odds, the appropriate measure of the diagnostic value of a datum is equal
to the log of the likelihood-ratio. For example, if the log of the likelihoodratio for a datum is four times as great as the log of another likelihood-ratio,
the first datum is four times as diagnostic as the second; it will take exactly
four data like the second one to move the odds estimate as far as the single
first datum.
The fourth condition was designed to exploit the possibility of consistency
-12

checks. Does the procedure of requiring the likelihood-ratio estimator to
make consistency checks among the logs of his likelihood-ratio estimates
actually yield more optimal posterior odds?
These checks were made possible in the following manner. Within each
sequence of data for Condition Four, the subjects drew four vertical lines on
a logrithmic likelihood-ratio scale. After observing all four data, the likelihood-ratio estimator was asked to check the consistency among his estimates.
5. Odds and probability estimates on a log-odds scale. — The fifth
experimental condition was developed by the principal investigator and Mr.Clint
Kelly while conducting on-line Bayesian experimentation with intelligence
analysts in Washington, D. C. Several events combined to indicate that the
use of straight likelihood estimates would be unacceptable in practice. The first
was mentioned above; it will in general be necessary for the expert who estimates
likelihood-ratios to know what the posterior odds are in order to defend those
odds. In addition, although analysts find it reasonable to estimate likelihoodratios under some conditions, they find it difficult or awkward in other conditions.
For example, when the hypotheses about which the odds are being estimated have an
apparent causal effect upon the datum observed, analysts find it more reasonable
to estimate likelihood-ratios then when the reverse is true, then when the data
seem to cause the hypotheses. For example, assume that an analyst is interested in
whether or not King Heussin will remain in power for another year and the relevant
datum is that there is an uprising by the fedayeen. The datum, the uprising,
could have the ultimate effect of toppling Heussin from power and that is the kind
of situation in which analysts find it difficult to estimate likelihood-ratios.
Finally, analysts seem to find it more difficult to estimate likelihood-ratios
when the datum is the nonoccurrence of an event rather than its occurrence. It is
apparently psychologically easier to reason through plausible chains of events
-13

that could result in the occurrence of an event rather than a change that could
result in ia nonoccurrence. It is this kind of difficulty with Bayesian procedulres
inl on-lille research that led the development of the procedure used ill Conditiio
Five.
This procedure requires that the analyst work with a combination of odds and
likelihood-ratios. It is perhaps best understood by observing the response sheet
on the following page, a response sheet that has been used extensively with intelligence analysts. The vertical axis of this response sheet indicates the odds on
a logrithmic scale. The horizontal axis indicates the day of the month. The
heavy line that traces a path through these coordinates is an example of how an
analyst might estimate odds comparing the likelihoods of a pair of hypotheses
as a function of events that occur on the indicated days. In this case, assume
that an analyst's odds start out on the first of the month at 2:1 in favor of
the Ill (say, that a ship is sailing to the Atlantic) rather than H2 (that it is
sailing to the Mediterranean). Then on the 3rd of the month an event occurs that
raises these odds to 4:1. On the 6th another event raises the odds to 8:1 and
on the 7th of the month a third event drops the odds back to 2:1. This example
yields the following interpretation: datum A and datum B have equal strength in
that both are associated with likelihood-ratios of 2:1. Datum C, on the other hand,
is as diagnostic as the combination of A and B. As an analyst moves his odds on
this response form he may respond both in terms of odds and likelihood-ratios.
The distance that he moves his odds as a result of a datum is a likelihood-ratio.
When he attends to where he ends up rather than how far he is moving he is responding
in odds. Intelligence analysts typically use both characteristics of this scale
when responding. They attend both to how far they move as a result of the datum
and to the resulting posterior odds.
Furthermore, after having processed several data it is possible for the analyst
-14

to modify his estimates as a result of a consistency analysis. That is, he
can examine the relative distances moved for each of several data. This scale
permits the analyst to make a consistency analysis by examining the relative
distances moved for the different data items and then adjust these movements
for inconsistencies in case he thinks that one datum was more important but
he originally moved with another datum further. If a reanalysis indicates
to him that his original posterior odds were either too high or too low, it
is possible to modify that estimate, but only by making corresponding revisions
in movements along the way.
The illustrated form served to structure the response procedure for
condition 5. The subjects used essentially the same form except that the
horizontal axis referred to the datum number within a sequence rather than to
the day of the month. Each sequence, of course, contained four data. Previous
experience already indicated that analysts find this form acceptable for use.
The purpose of the experiment was to evaluate the degree to which the use of
this form could eliminate the conservatism which results from verbal odds
estimates. A hypothesis is that condition 5 will result in less conservatism
than condition 1 because the use of this chart requires the odds estimator
to consider the impact of each data item individually. To a degree, the use
of this chart reduces some of the requirements for intuitive aggregation inherent
in direct odds estimates as used in condition 1.
6. The log-odds chart with probabilities deleted. — Note that the
right hand side of the chart in Figure 1 indicates the probability in favor
of hypothesis 1 as implied by the corresponding odds that are displayed on
the left side of the chart. However, the use of bounded probabilities as
displayed in Figure 1, may inhibit movement toward either extreme. In order
to test that hypothesis, subjects in condition 6 followed exactly the same
- 15 -

procedure as did subjects in condition 5, except that the probabilities were
omitted from the response sheet for subjects in condition 6. The experimental
hypothesis is that the deletion of the probability scale in condition 6 will
lead to estimates that are more excessive than corresponding estimates in
condition 5.
Data analysis. — Subject running is now complete for this rather complicated
experiment. A total of 40 subjects were run individually. Five subjects served
in each of conditions 1, 2, 3, and 6, and 10 subjects served in each of the
conditions 4 and 5 because those were the condition of major experimental
interest. The subjects in the first three conditions participated for one to
one and half hours, whereas subjects in conditions 4, 5, and 6 participated
for three to four hours each. Greater subject-running time was required
for the later conditions because of two factors. First of all, more training
was required in order for subjects to understand the rationale behind the
logrithmic response scale. In addition, subjects in the later three conditions
were required to make internal consistency checks of the estimated or implied
log likelihood ratios for each sequence of four data.
Analysis of the results has now begun. The major hypotheses about relative
degree of conservatism will be tested by converting each subject's estimates
to final posterior odds for each of the 26 sequences. These posterior odds
will then be compared with optimal odds and also with corresponding posterior
odds for other conditions. The effect of the consistency checks will be
analyzed by inferring log likelihood ratios for each trial and each subject, and
then correlating these log likelihood ratios with the corresponding optimal
values. We hypothesize that the process of consistency checks will generate
higher correlations, procedures to decrease conservatism will lead to higher
regression slopes.
- 16 -

Probability Vs. Odds Estimates
Previous research has shown that subjects are more conservative when they
revise probability estimates than when they revise odds estimates. That research
however, used only two hypotheses, and an odds estimate, which is a ratio of
two probabilities, seems naturally suited for only two hypotheses. It was
important to learn, therefore, whether or not odds estimates will remain more
effective with more than two hypotheses. That was the goal of the experiment
to be described next.
The experiment made use of abstract stimuli; the subject's task was to
infer the proportion of red chips in an urn that contained red chips and blue
chips. The number of hypotheses were two, three and five, depending upon the
experimental condition. With two hypotheses, the proportion of red chips could
take on only two different values; with three hypotheses there were three possible
proportions; and with five hypotheses the proportion of red chips could take
on five different values. The task of the subject was to revise his estimate,
either a probabiltiy or an odds estimate, on the basis of observing the color
of chips sampled at random from the urn.
Because of large individual differences with this type of task, we used
a within-subject design in which each of the twelve subjects participated in
all three conditions. On some trials they made probability estimates and on
others they made odds estimates. Probability revisions were made by redistributing 100 washers among different troughs where each trough represented
one of the hypotheses. Odds estimates were made on a logrithmic scale; each
individual estimate compared the probabilities for a pair of hypotheses.
Results. — The results clearly demonstrated that advantage associated
with odds estimates does not decrease as the number of hypotheses increase.
- 17 -

()O til colltr:ltry,;iltllotllgl oddtl; est im;lIte' did Il ot show muIch ofi;ll:a dv nillt.lte
witll twto or thlree lhyptoth.!esess, they weret silbstantially )eotter than plt'l rol); ility
estimates when the number of hypotheses increased to five.
This experiment left us puzzled about why the odds estimates were not
substantially more efficient than the probability estimates with only two
hypotheses. But the experiment left no doubt about the fact that odds are
substantially less conservative than probability estimates with several hypotheses,
and this is the basic question that we had set out to answer. Our next step
was to incorporate conditions 5 and 6 into the Bayesian experiment described in the
preceding section in order to evaluate the relative merits of a joint probabilityodds scale vs. odds alone. As indicated, the results of that experiment are
now being analyzed.
- 18 -

Utility
One of the inputs to a decision analysis is a probability estimate
and the other is an estimate of value or utility. The primary purpose
of the decision-making system is to permit the choice of the best course
of action, the one with the highest expected utility. Many approaches
have investigated the utility or attractiveness of the consequence of
an action. The simplest is to equate utility with dollars. Bernoulli
understood problems with this approach as early as 1738 and argued that
utility should be a function of money rather than simply equal to money.
In this way he managed to capture the attitude of a decision maker toward
risk as well as toward money. But many decisions are sufficiently complicated
so that the possible outcomes of a course of action cannot be measured in
terms of only money and risk. Frequently, other attributes such as time,
effort, safety, and public opinion are also important. This problem has
led to a serious interest in the study of multiattribute utilities. The
basic problem is to discover the appropriate set of trade-off functions
that will collapse several dimensions of value into a single dimension.
Validation of utility estimates. —There has been much less psychological
research on eliciting utilities then on eliciting probabilities. The reason
for the lack of research on utility is that it is difficult to validate
estimates. A utility is usually considered to be a subjective quantity
that characterizes the person making the judgment. Accordingly, a judgment
of a utility is valid to the extent that it is close to the subjective
quantity that it describes. The continuing difficulty in coming up with
an independent measure of that subjective quantity has been the major
stumbling-block to empirical research.
-19

But there are ways around this problem. Our approach so far has
been to postpone validation as long as possible. Suppose, for example,
that we are interested in two different procedures for eliciting utility
judgments. We wish to know which would be the best one to use in an
applied setting. So far, we have adopted the policy of comparing the
output of both procedures. If both outputs are the same, then a decision
about which procedure to use can be based upon such practical considerations
as convenience. Only when the two procedures differ with respect to
output is it important to find which of the two is better. Accordingly,
our present strategy has been to map out the kinds of situations in which
different procedures yield similar results and the situations in which
they yield different results; only when a difference exists will it be
necessary to find the independent measure of utility in order to learn
which procedure is best.
As will be shown below, our primary emphasis so far is to compare
procedures that decompose the utility problem into a dimensional analysis
with procedures that require the subject to aggregate across the dimensions
intuitively. We have been finding a surprising degree of agreement between
the output of the two procedures.
When we finally get to the stage of validation, we plan to use the
concept of an organizational utility. The utility need not be restricted
to the person making the judgment. In applications of decision-making
systems, it is often the case that decisions should be made in such a way
that they maximize the expected utility accruing to an organization, rather
than the expected utility accruing to any individual within the organization.
-20

This conception of a utility as a property of the organization makes it
external to the person estimating it, and therefore opens the path to
validation. The experimental procedure will be to create an imaginary
organization with an externally specified multidimensional value system
to train subjects about that value system, and then to invite the subjects
to estimate utilities for multidimensional stimuli. The composite utility
estimates can then be evaluated by comparing them with corresponding values
for the organization.
Robustness of a Linear Model —A Simulation
A weighted linear average is the approach most frequently proposed
for decomposing multidimensional utilities. With this approach, values
along each dimension are multiplied by the relative weight of the dimension
and then the products are added for a compositive utility. This approach
requires the estimation only of weights for dimensions, not the utility
functions. This simple linear model may not mirror the corresponding value
system, but linear models frequently can account for much of the variance
in systems that are largely non-linear. For example, Yntema and Torgerson
in 1961 showed that the additive model could account for about 94% of
the variance in a system with a non-additive, multiplicative combination
rule.
Before completing the design of any experiment on the machine
aggregation of utilities, however, we conducted further investigation
of the degree to which different kinds of non-linear systems can be
described by a linear model. In order to make this test we created
four different kinds of two-dimensional value systems. One was linear
and additive; so the linear additive model fit perfectly. A second
-21

was linear and multiplicative; it was similar to the one used by
Yntema and Torgerson. The third was non-linear but additive and the
fourth was non-linear and multiplicative.
With each type of value system, we used two variables, each of
which took on twenty values, and then measured the linear, multiple
correlation for the set of all four hundred possible pairs of these two
variables. The square of the multiple correlation denotes proportion
of the variance that can be explained by the linear, additive model.
For the two cases with non-linear value systems, the relations
between value and the underlying dimension were all monotonic. They
included logrithmic exponential, and power functions (of the S-curved
type). In each case we started with functions that were nearly linear
and progressed toward functions that were highly non-linear.
Table 1 shows the results of this analysis. The columns refer
to the four different forms of value systems, and rows refer to the degree
Table 1
Goodness-of-fit between linear additive models
(measured by multiple R2) and various value
systems for the two variable case
Type of Data Generator
Case 1. Linear- 2. Linear- 3. Nonlinear 4. Nonlinear
Number Additive Multiplicative Additive Multiplicative
3 1. 1.0.87.93.80
o 2. 1.0.92.88.80
-"1 3. 1.0.89.65.52
4. 1.0.95.67.64
o 5. 1.0.97.19.07
-22

of non-linearity for the non-linear systems. The first row is the most
linear and the fifth row is the least linear. These results indicate,
of course, that the linear, additive model accounts for all of the
variance in the linear additive system. Furthermore, replicating
the results of Yntema and Torgerson, the linear additive model accounts
for nearly all of the variance in the linear, multiplicative system as
shown in column two. It accounts for about 90% of the variance, on the
average. But column three, showing the results of a non-linear, additive
system, indicates that the amount of variance accounted for by the linearadditive model deteriorates markedly when the component value functions
become nonlinear. Progressing from the first to the fifth row, the amount
of the variance accounted for drops from 93% to 19%. Column four shows
the same result. Predictability decreases markedly as the component value
functions become nonlinear.
The implication of this analysis is simple and important. The linear,
additive combination rule can be applied successfully to a value system only
when the component value functions are highly linear. It makes little
difference whether the combination rule is additive or multiplicative, but
it makes a great deal of difference whether the component functions are
linear or nonlinear.
This result is surprising in light of the degree to which the
linear, additive model has been advertised as robust; and it was strong
enough to deter us from using the linear additive approach for decomposing
utility judgments in the experiments to be described below.
-23

Experiment Decomposing Multidimensional Utilities
The purpose of this first laboratory experiment was to elicit
utilities for multidimensional stimuli by using two different approaches -
wholistic and decomposition - and then measure the agreement in resulting
utilities from the two different procedures. To the degree that the
outputs are alike, there is no need to worry about which approach is better.
For the wholistic approach, subjects estimated the utility of a stimulus
by a single number; for the decomposition approach, he estimated utility
functions and weights for the component dimensions.
The purpose of the experiment required familiar stimuli with many
value dimensions. After considering several kinds of stimuli, compact
cars were chosen. The subject's task was to judge the relative utility of
different cars. These cars varied along several dimensions, such as top
speed in terms of miles per hour, economy as measured by miles per gallon,
comfort as rated by expert judges, breaking performance, judged handling
performance, and so on.
Two conditions differed in the amount of aggregation required. One
condition used compact cars that differed only in three dimensions, whereas
the condition requiring more aggregation used compact cars that differed
in nine dimensions. The assumption was that any difference between the two
procedures should increase with the amount of aggregation required.
The experiment employed three different response modes. The first
two involved only direct estimation; a dollar scale in one case and a value
scale ranging from 0 to 100 in the other. Thus, in the first condition
the subject evaluated each car in terms of dollars (presumably a dimension
with which he was familiar) and in the second case he evaluated the relative
-24

values of the displayed compact cars on a 0 to 100 scale. A lottery was used
for the third response mode because of an increasing interest in the concept
of additive utilities. For example, would a subject gamble with a 50-50
chance of winning the most attractive car rather than the least attractive
car; or would he prefer having a moderately attractive car for sure?
These, then, are the components of a three-dimensional factorial
design: (1) decomposed vs. wholistic judgments; (2) 3 attributes vs. 9
attributes; and (3) the response modes of dollars, rating scales, and
lotteries.
Result
Table 2 presents the correlational analysis of the results of the
experiment. It displays correlations between the additive decomposed utility
Table 2
Correlations between decomposed and
wholistic utility judgments
1. Three Dimensions S#l S#2 S#3 S#4 S#5 Median
a. Rating Scales.99.92.95.96.94.95
b. Dollars 1.00.92.97.92.92.92
c. BRLTs.91.94.94.93.93.93
2. Nine Dimensions
a. Rating Scales.97.96.93.91.86.93
b. Dollars.98.89.97 1.00.91.97
c. BRLTs.82.85.89.84.92.85
-25

models with the wholistic utility models. If the two approaches yield
essentially the same utility functions, then the correlations should be
about 1.0. In all cases the correlations are high; they are systematically
below 90 in only one of the six conditions —when a lottery response mode
was used with nine dimensions.
Table 3 presents a similar pattern of results. It displays differences
between wholistic and intuitive approaches rather than a correlation between
the two. The entries in Table 3 are mean squared differences between the
Table 3
Mean squared difference between decomposed
and wholistic utility judgments - all scores
normalized from 0 - 100.
1. Three Dimensions S#l S#2 S#3 S#4 S#5 Median
a. Rating scales 39 143 41 86 128 91
b. Dollars 3 234 73 204 151 151
c. BRLTs 272 306 119 155 183 183
2. Nine Dimensions
a. Rating scales 45 103 127 216 257 127
b. Dollars 32 159 60 6 129 60
c. BRLTs 697 345 204 400 151 345
decomposed utility judgments and the wholistic utility judgments. All scales
were normalized on a 0 to 100 scale to make results comparable throughout.
Once again, the poorest performance resulted from the use of lotteries with
-26

nine dimensions; next poorest from lotteries with three dimensions.
Taken together, these results suggest that there is a high correlation
between decomposed and wholistic approaches. The absolute disagreement between
the two approaches is greatest when the lottery response mode is used. Our
first impression was that the use of a lottery was less efficient because
of the greater complexity of the response. Additional reflection, however,
suggested an alternative explanation. The direct estimation procedures
resulted essentially in riskless utility functions, because they were
derived from situations in which the subject assumed for sure that he would
receive the stimulus car. The lottery procedure, on the other hand, generated
utility functions that incorporated the judges' attitude toward risk, because
there was uncertainty about whether or not the imaginary car would be recieved.
Therefore, in additon to the greater natural complexity of the lottery task,
there was also more to aggregate —attitude toward risk as well as the utility
of dimensions along which the cars differed.
This result has prompted us to reconsider the entire theoretical
approach to aggregating multidimensional utilities by means of an additive
model. In many real world situations to which decision analysis will be
employed there is uncertainty about whether or not the objects whose
utilities are estimated will be received. In such a case, it is important
to incorporate the decision-makers attitude toward risk as well as his attitude
toward the values of each of the value dimensions. But that does not imply
that the risk attitude must be tapped when measuring each of the value
dimensions. An alternative procedure is to use one of the direct estimation
procedures, such as rating scales or dollars, in order to collapse the set
of value dimensions into a single dimension. Then, once that simplification
-27

has been achieved, it is possible to find a single utility function which
incorporates attitude toward risk on that single dimension, using traditional
procedures for eliciting "utility for money" functions. We will propose
additional research aimed at disentangling some of these ideas in the near
future.
Field Research on Multi Dimensional Utilities
The field research that was conducted in parallel with laboratory
research on multidimensional utilities is now complete. It forms Mr.
Michael O'Connor's PhD dissertation, and the final draft is now being
prepared. The purpose of this field research was to evaluate the degree
to which procedures developed in the laboratory would work in the real
world. Whereas laboratory research typically uses college students as
subjects, and the tasks are relatively abstract, it is important to learn
the degree to which these same procedures will be applicable when used with
professional people who have a deep understanding of the problem on which
the analysis is being applied.
This field work involved the construction of a water quality index.
The construction of such an index is a task that is ideally suited for
using decomposition procedures designed to cope with multidimensional utilities.
The problem is that water quality is currently evaluated by gathering samples
and measuring amounts of such polluting variables as fecal coliforms, phosphates,
and so on. The problem is that there is no physical model that describes the
trade-off functions between these dimensions. How much of an increase in
fecal coliforms is required to exactly offset some specified decrease in
suspended solids? Assuming that the less polluted water has more utility,
Mr. O'Connor set about the task of finding these trade-off functions by
using procedures of multi-attribute utility.
-28

Mr. O'Connor made field trips to several different water quality
engineers for the purpose of eliciting judgments of utility functions. The
first problem was to select a set of criteria to be included in the index.
Almost immediately this posed a serious problem. What was desired was an
overall measure of quality, but it became apparent very quickly that a very
relevant question was utility for whom, or for what purpose. For example,
consider the attribute of water temperature. If water is to be used for
bathing, then a higher temperature is generally considered a good thing,
but if it is to be used for industrial cooling then a higher temperature
is a bad thing. Therefore, before constructing the quality index, it
was decided to measure quality for two quite different purposes: for
public water supply and for fish and wild life. One purpose of selecting
two very different purposes of the utility function was to learn the degree
to which the specific purpose was important in determining the function.
Then a list of dimensions or attributes was selected for each of the
two functions by eliciting judgments from water quality engineers about
which dimensions they felt were most important for each of the two functions.
After settling upon the lists of dimensions, each engineer generated utility
functions in the following way: for each attribute, he first drew a function
describing the manner in which quality of water changed as the amount of the
attribute increased. These functions were usually monotonic, but typically
curvilinear. In a few instances they were nonmonotonic; the function increased
until the concentration of the attribute was at an optimum, and then it
decreased as the attribute exceeded that optimum. After estimating a quality
-29

function for each attribute, the engineers then estimated relative weights.
T'hcy did this by assigning a weight of 100 to the attribute considered most
important, assumed a weight of 0 for an irrelevant, completely unimportant
attribute, and then estimated intermediate weights for all of the other
attributes.
These inputs were then sufficent to construct a multidimensional
utility function for each of the water quality engineers. The functions
were, of course, different, and so a modified delphi procedure was used
to resolve the differences. Reasonably good consensus was achieved by
the process.
A final index has been achieved for each of the two purposes: for
public water and for fish and wild life. It was evaluated in two ways.
First, the 11 participating water quality engineers agreed on the face
validity of the index. Then several different samples of water were
rated by the index. It turned out that even when the engineers differed
with respect to their final utility weights or utility functions, the
resulting correlations between engineers across water samples remained
high. However, there was a much lower correlation between the two
indices. These results indicate that small differences among the judges
are not critical, but that it is important to determine exactly what the
utility function is to be used for before generating the function. General
purpose utility functions may be misleading.
-30

Submarine Surveillance
Although it was not possible to field test the laboratory research on
multidimensional utilities in a Naval setting, this was possible for research
on the estimation of probabilities. Toward the end of the current contract
year, on-line research was begun on the problem of submarine surveillance. This
research was conducted with Naval intelligence officers in OP-942U (CNO USW
Flag-plot) who are responsible for the surveillance of Soviet submarines. The
research was conducted by Dr. Cameron Peterson and Mr. Clint W. Kelly; it was
supported in part by the IBM Corporation and in part by this contract.
Procedures developed in the laboratory research described above were field
tested on the problem of submarine surveillance. First, the scoring rule test
was administered to the intelligence officers for purposes of calibration. Then
we employed several different approaches to decomposition intended to improve
the forecasts. The following example is an actual case study that is described
here in hypothetical terms. It illustrates several of the decomposition
procedures that have been used in other cases. In each instance, the numbers
are those that were estimated by Naval intelligence officers, but some of
the substantive portions of the case have been modified or disguised for
purposes of security.
The problem is as follows: a submarine has been sighted leaving the
Mediterranean Sea. The analysts were attempting to infer whether or not it was
a nuclear submarine; was it an SSN or an SS? The following example illustrates
several procedures used for estimating this probability.
Consider the two hypotheses: the first hypothesis, (H1), is that it is
an SSN and the second hypothesis, (H2), is that it is an SS. Some
historical data are relevant to this question. Six similar submarines
-31

have been observed previously; five were SSNs and one was an SS. The task
is to add that historical information to the analyst's theoretical knowledge in
order to arrive at a probability estimate about this particular submarine. This
estimate was achieved through the use of second-order beta probabilities and
is illustrated in Figure 1. The top graph refers to the percentage of SSNs
<,';;..:.;..!, ^:.........',,,,,,_..';-'.: _ -;....
^i^-',:;lk PRIORS P(Hi)/P(H2) o~. i:5; r s
let
C 4
P(H1) r_
" 10:1 0.909
|- i ~.~..-..',:....:Y..l J..-....:.~................... -.:..5:1,: 0.833
~'' P0 ].2 55:1
3:1
1|B~.^2~~~ =,, ---------- ^ -- - —,-1 0.500
0'f 0 0.2 OA A 0.4 0.A. 1 t.0 -,
P(H1)
1:',v:..,'?.p..........,..
Figure 1 Figure 2
among all submarines that might be sent out. This is represented by P(H1)
and is the horizontal axis of the top graph. The horizontal function is a
rectangular probability distribution estimated by an intelligence analyst. This
implies that, based upon theoretical knowledge and ignoring the historical
frequencies, he expects that all proportions of SSNs are equally likely. It is
-32

just:s likely that they would send 15%;is 40% as 80% SSNs. It is possible,
through the use of beta functions, to combine this rectangular prior probalbility
distribution with the historical frequency information in order to obtain a
posterior probability distribution. The two parameters of the prior distribution,
r=l and s=l, are simply added to the relative frequencies in order to arrive
at the appropriate parameters for the posterior probability distribution: r=6
and s=2. This posterior probability distribution is displayed in the bottom portion
of Figure 1. It is a probability distribution over the proportion of SSNs.
The mean or expectation of this probability distribution is.75 and so that
is the number we selected as the prior probability that this particular submarine was an SSN. That is the number that served as a starting point for
analyzing the following data.
Figure 2 shows the logrithmic chart that was used in the experiment on
Bayesian procedures described above. The line at prior odds of 3:1 indicated
on the left horizontal axis and the prior probability of.75 shown on the
right horizontal axis indicates the starting point derived above.
The first datum is that another submarine was sighted on a homewardbound course. For some weel-considered reasons, this datum slightly
favors the hypothesis that the submarine being observed is an SSN; there is
an estimated probability of.55 that this submarine would have been homewardbound in about this time interval if the submarine being observed were an SSN;
there is a 50% probability estimate if this were an SS. Therefore the likelihood ratio of the first datum, that a homeward bound submarine was observed,
is.55/.50, which is equal to 1.1. This likelihood ratio is now inserted into
the log odds chart as shown in Figure 3. It increases the odds in favor of an
SSN by an almost imperceptible amount.
-33

:. %i.:,'. DATUM2 STRAIGHT TRACK,',i: t..
*.'"'''~'';";, ~./ 0.5 i.1/ 0
3021 in — ~~ --- - ~968 _ / r-s; ~t ^ ^E IS0.83
k l ale <PUS Po- i*'*R0;6 A-', ~ ML..',:-":' " ":'"'*"t w F
I I s - vy;4^ X r 0 o,
31 8 - -- 0.9750 9,
1 "' *:.~ ~ -':.:,~:,.:::'~.';;:.-:;
C, P(D/H)].J -
1:10.500
I' o0 O.S 1.0 0 O 5'
illustrates the manner in which the likelihood ratio associated with the
~i":,''. P(D/H2)';/H2;r
Fi gure 3 Figure 4
straight track was elicited. This procedure also employed beta distributions.
The upper portion of the figure refers to the prior beta distributions, ignoring historical frequencies of straight tracks. The left-hand function is
the second order probability distribution over the proportion of straight
tracks given an SSN as estimated by the analyst. The analyst expected that
many more of the SSN, would follow straight tracks rather than evasive tracks.
-34

The upper right-hand graph refers to the second order probability distribution for
straight track given the second hypothesis, an SS. This is a uniform distribution. The middle portion of the figure shows the historical frequencies.
All five SSNs previously observed had followed a straight track, whereas the
one SS did not. Addition of these frequencies to the parameters of the beta
distributions shown at the top of the figure imply the beta distributions shown
at the bottom of the figure. The resulting likelihood ratio associated with
datum 2 is therefore.90, the expectation of the bottom left-hand distribution,
divided by.33. This likelihood ratio, 2.7, is now added to the log-odds chart
as illustrated in Figure 5. The observation that the submarine is moving in
a straight track has boosted the odds in favor of an SSN to approximately
9 to 1. This second datum is a very strong one, indeed.
Is -
= C;
30:1. --- 0.968
9 0t
10:1't - - 0.909
5:1 0.833
3:1 0.750
3:1 -, 0.750
2 1 2.7
- L! = 1.1!
Figure 5
-35

The analysis of the third datum is displayed in Figure 6. A political
event that was expected to be observed was not observed. A conditional
probability tree has been used to estimate the likelihood ratio. The
left-hand branch of the tree refers to the two hypotheses, the SSN versus
the SS. The next branch refers to the probability that the expected event
DATUM3 = EXPECTED POLITICAL EVENT NOT OBSERVED
TRUE HYPOTHESIS EVENT EVENT OBSERVED
5 — ^ ^J^01^-~- ^::::5D_,"~ r~~0.:3 0.40 -
0.20 + 0.50 1.75
Figure 6.
would occur given each of the hypotheses. The analysts estimated that
the event was certain to occur if they were observing an SSN, but the
probability was only.50 that it would occur given an SS. The third
column of branches refers to the probability of observing the event if
it occurs. The analysts reviewed some experimental literature and
concluded that there was a 60% chance that they would observe the event
-36

if it actually occurred, but it was certain that they would not observe
the event if it did not occur. That is, there was no chance of a false
alarm. These branch probabilities implied the path probabilities displayed
in the boxes at the right-hand side of the tree. Each probability was calculated
by taking the product of the component branch probabilities. Thus, the.60 in the top branch is equal to 1.0 times.6. It is now a simple matter
to find the likelihood ratio. The probability of not observing the expected
event given as SSN is equal to.40 (the probability of observing it if it
occurs, plus zero, the probability of observing it if it doesn't occur).
The probability of not observing the expected event given an SS is equal
to.70, so the resulting likelihood ratio is 1/1.75.
This likelihood ratio of 1/1.75 associated with not observing an
expected event is now drawn on the log odds chart in Figure 7.,;..
0, z
30:1 >- --— 0.968
0
N 10:11 - - 0.909
0 ~ I ~ -~ 0.833
1:". 0.500
Figure 7.
-37-37

It drops the odds in favor of an SSN to approximately 5 to 1.
The fourth datum is that there has been no contact with the
submarine for an extended period of time. The analysts estimated a
probability of 90% for an extended lack of contact with an SSN, but
a probability of only 20% of no contact with the SS. Accordingly, the
likelihood ratio associated with the fourth datum is.9/.2 or 4.5. This
likelihood ratio is now added to the log odds chart as shown in Figure 8.
0;
~30:~1 0.968
\j o
0.909
30,:1 -- - 0.750
3:1:~ 2 It
_ LZ Z 0
30.1 _. - > o 0.968
- 0:1 M' 0 ""0
_,,,.
5:1- ~ -:~0.833,7, I. --
11 —-----------—:1 — 0.500
Figure 8.
It raises the odds in favor of the SSN to about 25 to 1.
The fifth datum later turned out to be a false alarm. An event
has begun to develop and the analysts concluded that if it continued to
develop it would favor the hypothesis that the submarine was an SS. The
effect of this datum is displayed on the log odds chart in Figure 9.
-38

Several attempts yielded no appropriate procedure for decomposing the
estimate of the likelihood ratio associated with this particular event.... o <
+ 4 __ J
z 10:1 -- x ~ 0.909
-2 - s2. 5:1, 0o833
3$ —---. ---- ^ -05 1- 4 — 00.750
Figure 9.
Therefore, the analysts simply moved the odds on the log-odds chart on
an intuitive basis. After considerable discussion, they decided that
if the event did not turn out to be a false alarm, it favored the SS. It was
somewhat more diagnostic than the datum of not observing the expected
political event, and not quite as diagnostic as the straight track. They
therefore moved the log-odds down just slightly less than they had moved
it up as a function of the straight track. The resulting likelihood
ratio turned out to be 2.3. The next morning, after completing this
analysis, the final datum was identified as a false alarm and the odds
estimates were therefore returned to about 25 to 1 as is shown in Figure
0. Some weeks later after receiving much more information, it was
-39LI 0: 1.1
Figure 9.
Therefore, the analysts simply moved the odds on the log-odds chart on
an intuitive basis. After considerable discussion, they decided that
if the event did not turn out to be a false alarm, it favored the SS. It was
political event, and not quite as diagnostic as the straight track. They
it up as a function of the straight track. The resulting likelihood
ratio turned out to be 2.3. The next morning, after completing this
estimates were therefore returned to about 25 to 1 as is shown in Figure
-39

.- - <
0 < 5 Z
1- 11 0 t
> 0
\./ Z, U ~ x
30:1 _ u —— O — -- --- - 0.968
3 10:1 - 0.909,~ z
x 5:1 0.833
I s I 2.7
3:1 0.750
1 -1 0 0.500
Figure 10.
concluded that this submarine was indeed an SSN. It is interesting
to note that four of the analysts had discussed the problem thoroughly
throughout the day and remained both uncertain and divided about which of
the two hypotheses was correct. Yet the two analysts who made the estimates
for this problem agreed with the final conclusion, that the data very strongly
favored the hypothesis that they were observing an SSN.
We have now conducted several other case studies similar to the one
described here. In general, the results have been encouraging with respect
to the operational feasibility of techniques that are currently being
developed for eliciting probability estimates and for revising those
probability estimates in the light of new information. But these case
studies have also highlighted some weaknesses of the laboratory procedures.
Because of this usefulness, we intend to increase the field research with
the intelligence analysts at OP 942U during the next contract year.
-40

References
Beach, L. R., Accuracy and consistency in the revision of subjective
probabilities. IEEE Trans. Hum. Fact. Electronics, 1966, 7, 29-37
Bernoulli, D. Specimen theoriae novae de mensura sortie. Comentarri
Academiae Scientiarum Imperiales Petropolitanae, 1738, 5, 175-192
(Trans. by L. Sommer in Econometrica, 1954, 22, 23-36.)
Edwards, W., Phillips, L. D., Hays, W. L. & Goodman, B. Probabilistic
Information Processing Systems: Design and Evaluation, IEEE Trans.
System Sci. 8 Cybernetics, 1968, 3, 248-265.
Peterson, C. R. ~ Beach, L. R. Man as an intuitive statistician.
Psychol. Bull. 1967, 68, 29-46.
Peterson, C. R., Ulehla, Z. J., Miller, A. J., & Bourne, L. E., Jr.
Internal consistency of subjective probabilities. J. exp. Psychol.
1965, 70, 525-533
Yntema, D. B. & Torgerson, W. S. Man-computer cooperation in decisions
requiring common sense. IRE Trans. Hum. Fact. in Electronics, 1961,
2, 20-26.
- 41 -

Personnel List
Cameron Peterson Principal Investigator
Patricia Domas Research Assistant
Gregory Fischer Research Assistant
Jane Hoffman Research Assistant
Michael O'Connor Research Assistant
Linda Bender Secretary
Patricia Homan Secretary
Annette Johnson Secretary

DISTRIBUTION LIST "A"
Chief of Naval Research (5 cys) Office of Naval Research
Code 455 Naval Analysis Programs
Department of the Navy Code 462
Arlington, Virginia 22217 Attn: J.R. Simpson ( cy)
Attn: Robert Miller (1 cy)
Defense Documentation Center (12 cys) Department of the Navy
Cameron Station Arlington, Virginia 22217
Alexandria, Virginia 22314
Office of Naval Research
Director, ONR Branch Office Mathematical Sciences Division
Attn: Dr. E. Goyle and Robert Lawson Code 430
1030 East Green Street Department of the Navy
Pasadena, California 91106 Arlington, Virginia 22217
Director, ONR Branch Office Dr. John J. Collins
Attn: Dr. M. Bertin Office of the Chief of Naval Operations
536 S. Clark Street Op-098T
Chicago, Illinois 60605 Department of the Navy
Pentagon
Director, ONR Branch Office Washington, D.C. 20350
Attn: Dr. C. Harsh
495 Summer Street Cdr. H.J. Connery
Boston, Massachusetts 02210 Office of the Chief of Naval Operations
OP 701
Director (6 cys) Department of the Navy
Naval Research Laboratory Washington, D.C. 20350
Technical Information Division
Code 2027 Mr. James Jenkins
Department of the Navy Department of the Navy
Washington, D.C. 20390 Naval Ships Systems Command
Code OOVIG2
Director (6 cys) Washington, D.C. 20360
Naval Research Laboratory
Attn: Library, Code 2029 (ONRL) Naval Ship Systems Command
Washington, D.C. 20390 Code 03H
Department of the Navy
Office of Naval Research Washington, D.C. 20360
Aeronautics Programs, Code 461
Department of the Navy Chief of Naval Training (Code 017)
Arlington, Virginia 22217 Naval Air Station
Pensacola, Florida 32508
Office of Naval Research
Code 461 Commander
Attn: JANAIR Chairman Naval Electronics Systems Command
Arlington, Virginia 22217 Command and Display Systems Branch
Code 0544
Washington, D.C. 20360

Distribution List "A" (continued)
Headquarters, Air Force Systems Command Dr. A.I. Siegel
Attn: Lt. Col. John York Applied Psychological Services
Andrews Air Force Base 404 East Lancaster Street
Washington, D.C. 20331 Wayne, Pennsylvania 19087
U.S. Air Force Office of Scientific Dr. R.L. Helmreich
Research University of Texas
Behavioral Sciences Division, SRLB Department of Psychology
1400 Wilson Boulevard Austin, Texas 78712
Arlington, Virginia 22209
Dr. W.S. Vaughan
Lt. Col. Austin W. Kibler Whittenburg, Vaughan & Associates, Inc.
Director, Behavioral Sciences 4810 Beauregard Street
Advanced Research Projects Agency Alexandria, Virginia 22312
1400 Wilson Blvd.
Arlington, Virginia 22209 Dr. W.G. Matheny
Life Sciences, Inc.
Dr. Stanley Deutsch 7305-A Grapevine Highway
Chief, Man-Systems Integration Fort Worth, Texas 76118
OART
Hqs. NASA Dr. G. Weltman
600 Independence Avenue University of California at Los Angeles
Washington, D.C. Department of Engineering
405 Hilgard Avenue
Dr. Jesse Orlansky Los Angeles, California 90024
Institute for Defense Analyses
400 Army-Navy Drive Mr. A.J. Pesch
Arlington, Virginia 22202 General Dynamics Corp.
Electric Boat Division
Dr. H.W. Sinaiko Eastern Point Road
Institute for Defense Analyses Groton, Connecticut 06340
400 Army-Navy Drive
Arlington, Virginia 22202 Dr. S. Roscoe
Institute of Aviation
Mr. Luigi Petrullo University of Illinois
2431 N. Edgewood Street Urbana, Illinois 61803
Arlington, Virginia 22207
Dr. W.H. Teichner
Dr. D.N. Buckner Department of Psychology
Human Factors Research, Inc. New Mexico State University
Santa Barbara Research Park Las Cruces, New Mexico 88001
6780 Cortona Drive
Goleta, California 93017 Dr. Cameron Peterson
Engineering Psychology Laboratory
Dr. Angelo Mirabella 2901 Baxter Road
American Institute for Research H.S.R.I.
8555 Sixteenth Street University of Michigan
Silver Spring, Maryland 20910 Ann Arbor, Michigan 48105
Dr. C.R. Kelley Dr. C.H. Baker
Dunlap & Associates, Inc. Director, Human Factors Wing
1454 Cloverfield Boulevard Defense Research Establishment Toronto
Santa Monica, California 90404 P.O. Box 2000
Downsview, Toronto, Ontario, Canada

Distribution List "A" (continued)
Mr. Joseph B. Blankenheim Mr. Richard Coburn
Naval Electronics Systems Command Head, Human Factors Division
Code 0474 Naval Electronics Lab. Center
Department of the Navy San Diego, California 92152
Washington, D.C. 20360
Commander, U.S. Naval Missile Center
Dr. Heber G. Moore Human Factors Engineering Branch
Headquarters, Naval Material Command Code 5342
Code 03RI2 Point Mugu, California 93041
Department of the Navy
Washington, D.C. 20360 Human Engineering Branch, Code A624
U.S. Naval Ship Research &
Chief of Naval Development Development Center
(NAVMAT 034P) Annapolis, Maryland 21402
Department of the Navy
Washington, D.C. 20360 Commanding Officer (3cys)
Naval Personnel and Training
Mr. John Hill, Code 5634 Research Laboratory
Naval Research Laboratory Attn: Technical Director
Washington, D.C. 20390 San Diego, California 92152
Commanding Officer and Director Mr. A. Sjoholm
U.S. Navy Underwater Systems Center Bureau of Naval Personnel
Attn: Mr. D.H. Aldrich Personnel Research Division, PERS A-3
Fort Trumbull Washington, D.C. 20370
New London, Connecticut 06321
Mr. E. Ramras (3 cys)
Commander ASW Forces Technical Director
Atlantic Fleet Personnel Research and Development
Scientific Advisory Team Laboratory
Code 71 Washington Navy Yard
Norfolk, Virginia 23511 Washington, D.C. 20310
Attn: Mr. J. Barber
Dr. George Moeller
Head, Human Factors Engineering Branch Technical Director
Submarine Medical Research Laboratory U.S. Army Human Engineering Labs.
Naval Submarine Base Aberdeen Proving Ground
Groton, Connecticut 06340 Aberdeen, Maryland 21005
Cdr. Robert Wherry Dr. J.M. Christenson
Human Factors Engineering Systems Office Chief, Human Engineering Division
Naval Air Development Center Aerospace Medical Research Lab.;
Johnsville Wright-Patterson AFB, Ohio 45433
Warminster, Pennsylvania 18974
Dr. Walter F. Grether
Dr. J.J. Regan Technical Director, Behavioral
Human Factors Department Science Laboratory
Code 55 Aerospace Medical Research Lab.
Naval Training Device Center Wright-Patterson AFB, Ohio 45433
Orlando, Florida 32813

Distribution List "A" (continued)
Dr. D.E. Broadbent Bureau of Medicine & Surgery
Director, Applied Psychology Unit Code 513 (LCDR Goodson)
Medical Research Council Department of the Navy
5 C(llaulcer Road Washington, D.C. 20360
Camlridge, CB2 2EF
England Dr. Robert French
Naval Undersea R&D Center
Dr. Harry L. Snyder San Diego, California 92132
Virginia Polytechnic Institute
Dept. of Industrial Engineering Mr. James Runyon
Blacksburg, Virginia 24061 Naval Intelligence Command, NIC-3P
1276 Hoffman Bldg.
Mr. Harold Crane 2461 Eisenhower Ave.
Analytics, Inc. Alexandria, Virginia 22314
1800 N. Kent Street
Arlington, Virginia 22209 Mr. Fred Harrison
Naval Ocean Surveillance
Mr. E.F. Rizy Intelligence Center
Submarine Signal Division 4301 Suitland Road
Raytheon Company Suitland, Maryland 20390
P.O. Box 360
Portsmouth, Rhode Island 02871 National Security Agency
Attn: D-12- James Tippett
Bureau of Medicine & Surgery Fort Meade, Maryland 20755
Code 713 (Cdr. Nelson)
Department of the Navy Dr. Carl Menneken
Washington, D.C. 20360 Dean of Research Administration
Naval Postgraduate School
Mr. Ronald A, Erickson Monterey, California 93940
Head, Human Factors Branch
Code 3572 Dr. A.L. Slafosky
Naval Weapons Center Scientific Advisor
China Lake, California 93555 Commandant of the Marine Corps (Code AX)
Washington, D.C. 20380
Aeromedical Branch
Service Test Division Commanding Officer
U.S. Naval Test Center Naval Medical Neuropsychiatric Research Unit
Patuxent River, Maryland 20670 San Diego, California 92152
Capt. Allen McMichael Dr. JE. Uhlaner
Staff, Chief of Naval Training Behavior and Systems Research Division
Code 017 Department of the Army
~~Co~~de 017 ~1320 Wilson Blvd.
Naval Air Station 1320 Wilson Blvd.
Pensacola, Florida 32508 Arlington, Virginia 22209
CDR. Thomas Gallagher Chief of Research and Development
Chief, Aerospace Psychology Division Human Factors Branch
Naval Aerospace Medical Institute Behavioral Science Division
Pensacola, Florida 32512 Department of the Army
Washington, D.C. 20310
Commander Naval Safety Center Attn Mr. J. Barber
Attn: Life Sciences Department
Naval Air Station
Norfolk, Virginia 23511

H1
P (H1):P (H2) H2 P (H1)
1000:1.
900:1 - --.999
800:1
700:1
600:1
500:1:.998
300:1 99
150:1 -_ — 9
10:1 -----— f —- - _ — -- _ ___ 979
8:15: ---- - --- - - -_ - |__ __ __ _
301| — |~- |l — l||- -75 —- - - - -.993
90:1
80:1
70:1...6
60:1I —
50:1 --- —. —-— 8
40:1 ----- - - -- - -.980
30:1 -: —
20:1.952
3:1 —.750
1::1 - -- — ~ — - - - - - - - - - - - - - - - - - - - -----.500938
9:1..- 00
7:1 0751
6 7 - - - - - - - - - - - —:1.2 5 0
1:-.75~
4-1..00:1 -- - - --- --.20'1:)~7.800
1:1 -.500
1:4. 2450
1:5.6200
1:61~
1:70 -- -- -- -- - - - - - - -- -- -- -- - - - - - - -- -- -- -- - - - - - -.0 14
1:7
1:80 --
1:90
1:100.1091
1:15.007
1:20 --.048
1:30.032
1:40.024
1:50 —.020
1:70.014
1:80
1:100 1 -- - - -.010
1:70.007
1:800 --
1:100 0 —.01
Day 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
Month --