Statistical Methods for Eliciting Probability Distributions

Elicitation is a key task for subjectivist Bayesians. Although skeptics hold that elicitation cannot (or perhaps should not) be done, in practice it brings statisticians closer to their clients and subject-matter expert colleagues. This article reviews the state of the art, reflecting the experience of statisticians informed by the fruits of a long line of psychological research into how people represent uncertain information cognitively and how they respond to questions about that information. In a discussion of the elicitation process, the first issue to address is what it means for an elicitation to be successful; that is, what criteria should be used. Our answer is that a successful elicitation faithfully represents the opinion of the person being elicited. It is not necessarily “true” in some objectivistic sense, and cannot be judged in that way. We see that elicitation as simply part of the process of statistical modeling. Indeed, in a hierarchical model at which point the likelihood ends and the prior begins is ambiguous. Thus the same kinds of judgment that inform statistical modeling in general also inform elicitation of prior distributions. The psychological literature suggests that people are prone to certain heuristics and biases in how they respond to situations involving uncertainty. As a result, some of the ways of asking questions about uncertain quantities are preferable to others, and appear to be more reliable. However, data are lacking on exactly how well the various methods work, because it is unclear, other than by asking using an elicitation method, just what the person believes. Consequently, one is reduced to indirect means of assessing elicitation methods. The tool chest of methods is growing. Historically, the first methods involved choosing hyperparameters using conjugate prior families, at a time when these were the only families for which posterior distributions could be computed. Modern computational methods, such as Markov chain Monte Carlo, have freed elicitation from this constraint. As a result, now both parametric and nonparametric methods are available for low-dimensional problems. High-dimensional problems are probably best thought of as lacking another hierarchical level, which has the effect of reducing the as-yet-unelicited parameter space. Special considerations apply to the elicitation of group opinions. Informal methods, such as Delphi, encourage the participants to discuss the issue in the hope of reaching consensus. Formal methods, such as weighted averages or logarithmic opinion pools, each have mathematical characteristics that are uncomfortable. Finally, there is the question of what a group opinion even means, because it is not necessarily the opinion of any participant.


Introduction
Elicitation is the process of formulating a person's knowledge and beliefs about one or more uncertain quantities into a ( joint) probability distribution for those quantities. In the context of Bayesian statistical analysis, elicitation most often arises as a method for specifying the prior distribution for one or more unknown parameters of a statistical model. In this context, the prior distribution will be combined with the likelihood through Bayes' theorem to derive the posterior distribution. However, this is not the only context in which elicitation is important.
Much of the literature on elicitation has been concerned with formulating a probability distribution for uncertain quantities when there are no data with which to augment the knowledge expressed in that distribution. This situation arises in decision making where uncertainty about the "state of nature" needs to be expressed as a probability distribution to derive (and then maximize) expected utility. Similarly, it arises in the use of mechanistic models. Such models are built in almost all areas of science and technology to describe, understand, and predict the behavior of complex physical processes. The user is required to specify the values of appropriate model inputs to run the model and obtain outputs, but there is generally uncertainty about the "true" values of the inputs. It is then important to formulate that uncertainty and to propagate it through the model so as to quantify the uncertainty in model outputs.
It is convenient to think of the elicitation task as involving a facilitator, who helps the expert formulate the expert's knowledge in probabilistic form. In the context of eliciting a prior distribution for a Bayesian analysis, it is the expert's prior knowledge that is being elicited, but in general the objective is to express the expert's current knowledge in probabilistic form. If the expert is a statistician, or is very familiar with statistical concepts, then there may be no formal need for a facilitator, but this is rare in practice. Elicitation is a complex process that demands a range of skills if it is to be done well, and the role of facilitator is an important one.
What does it mean for an elicitation to be done well? It is important to distinguish between the quality of an expert's knowledge and the accuracy with which that knowledge is translated into probabilistic form. An elicitation is done well if the distribution that is derived accurately represents the expert's knowledge, regardless of how good that knowledge is. The expert might, for instance, believe very strongly in a certain scientific hypothesis. Then the elicitation is accurate if it derives a suitably high probability for that hypothesis being true, even if it is subsequently found to be false. Even if the rest of the scientific community is much more skeptical and inclined to give the hypothesis a low probability, this expert believes in the hypothesis, and therefore accurate elicitation of this expert's knowledge and beliefs should derive a high probability for it.
To achieve accurate elicitation is by no means straightforward, even if we wish to elicit the expert's beliefs about just a single event or hypothesis (or, equivalently, for a binary random variable). In this case we require only a single probability, but the expert may be unfamiliar with the meaning of probabilities. Even when the expert is familiar with probabilities and their meaning, it is not easy to accurately assess a probability value for an event.
If we now consider the task of eliciting a distribution for a continuous random variable X, then implicitly this involves eliciting an infinite collection of probabilities F (x) = P (X≤ x) for all of the possible values of x. This is clearly impossible, and in practice an expert can make only a finite number (and usually a rather small number) of statements of belief about X. These statements might take the form of individual probabilities or quantiles of the distribution [i.e., P (X ≤ x) for a few distinct values of x] or might be some other summaries of the distribution, such as modes. When it comes to a joint distribution for a collection of random quantities, the magnitude of the elicitation task is much larger still.
Given the difficulty involved, why is it worth the effort to attempt elicitation? One reason has to do with the use of elicitations to make decisions. Often a reasonable goal for elicitation is to capture the "big message" in the expert's opinion. The details (e.g., the exact shape of the expert's opinion) may not matter for the decision to be reached. Even when the decision is sensitive to the exact shape of the elicited distribution, it is not the decision, but rather the expected utility of the decision, that matters. And expected utility of the optimal decision is very often robust to details of the expert's opinion.
A second reason why elicitation is worthwhile has to do with the use of elicitations to make inferences, particularly for making possible the calculation of posterior distributions. In such a situation, elicitation encourages the expert and the facilitator to consider the meaning of the parameters being elicited. This has two helpful consequences. First, it brings the analysis closer to the application by demanding attention to what is being modeled and what is reasonable to believe about it. Second, it helps make the posterior distributions, once calculated, into meaningful quantities.
Elicitation is properly conceived of as part of the familiar process of statistical modeling. Statisticians are used to stating a likelihood for an applied problem. This is an opinion about how the data are generated, conditional on certain parameters. Hierarchical models, such as random-effects models and models with latent variables, involve distributions on some parameters, conditional on yet other parameters. What we are calling "elicitation" in this article is merely the final step in this process, the statement of probability distributions of the highest-level parameters in such a hierarchy. One should keep in mind that the usual principles of statistical modeling apply to elicitation as well.  2. The next stage is to elicit specific summaries of the experts' distributions for those aspects. This is obviously the core of the process, and one where psychologists have contributed at least as much to the methodology as statisticians. 3. The next stage is to fit a ( joint) probability distribution to those summaries. In practice, this stage often blurs with the previous stage, in the sense that the choice of what summaries to elicit is often influenced by choice of what distributional form the facilitator intends to fit. 4. Elicitation is almost invariably an iterative process, and the fourth stage involves assessing the adequacy of the elicitation, with the option then of returning to the second stage and eliciting more summaries from the expert(s).

The Elicitation Process
This article is structured in accordance with this schematic. The remainder of Section 1 concerns topics relevant to the setup of the elicitation, Section 2 deals with the interaction with the expert to elicit specific summaries, Section 3 addresses how to fit a probability distribution to the elicited summaries, and Section 4 deals with assessing the accuracy of elicitation. Section 5 considers questions that arise when beliefs are elicited from several experts, and Section 6 offers some discussion and challenges for future research.

Whose Beliefs?
We have presented elicitation as the process of formulating in probabilistic terms the beliefs of an expert, but who is the expert? The use of the term "expert" suggests an emphasis on persons to whom society and/or his or her peers attribute special knowledge about the matters being elicited. In practice, we often seek to identify the best available knowledge about the quantities of interest, and for this purpose we can consider the "expert" a real expert. There are other uses of elicitation, however, in which the expert has little or no expertise in the usual sense of that word. For example, to study adolescent decision making around risky behaviors, one might want to ask adolescents how they perceive those risks. Here the very point of the study is the lack of experts of the experts.
The simple answer to the question of "who is the expert" is that the expert is the person whose knowledge we wish to elicit; the term "expert" does not necessarily signify any more than that. An important point to bear in mind when eliciting from an acknowledged expert is that expertise can bring biases if the expert has some kind of personal interest in the result. Suppose, for example, that a radiation expert is asked for an opinion about how serious a health problem is engendered by the radiation release at Chernobyl. Such an expert may have spent much of his or her adult life becoming an expert on radiation. How well that expertise pays off in terms of social attention (and grants) depends on how urgent society perceives the issues the expert studies to be. Hence such an expert has an incentive to emphasize the dangers. (For more on this kind of bias, see Kadane and Winkler 1988.) In Section 5 we consider the case of multiple experts, where often the desire is to combine the expertise of several people. In such a case it is sensible to try to ensure that the experts' knowledge is complementary. Where their knowledge overlaps, it is more difficult to account for this (as we discuss in Sec. 5), and there is less gain from using the extra experts.

Conducting the Elicitation
We outline here various aspects of good practice in the conduct of elicitations. Many of these aspects can be ignored in an informal elicitation, but they are important considerations wherever substantive decisions or inferences may depend on the expert's distribution: • The objective is to elicit a distribution to represent the expert's current knowledge. It is very useful to have a summary of what that knowledge is based on. • Any financial or personal interest that the expert might have in the inferences or decisions that will depend (even marginally) on the expert's distribution must be declared. • Training should be given to familiarize the expert with the interpretation of probability and with whatever concepts and properties of probability will be required in the elicitation. It is useful to run through a dummy elicitation exercise to provide practice in the protocol that the facilitator proposes to use. • A record should be kept of the elicitation. This should ideally set out all of the questions that were asked by the facilitator together with the expert's responses, as well as the process by which a probability distribution was fitted to those responses.
Note that these recommendations apply whatever the specific protocol that the facilitator will use to elicit the expert's beliefs. We now proceed with a detailed discussion of how to construct a suitable protocol.

PSYCHOLOGICAL CONSIDERATIONS AND ELICITING SUMMARIES
An elicitation method forms a bridge between an expert's opinions and an expression of these opinions in a statistically useful form. Thus the development of an elicitation method requires some understanding of both the psychological part and the statistical part of the bridge. As Hogarth (1975, p. 284) pointed out, "assessment techniques should be designed both to be compatible with man's abilities and to counteract his deficiencies." In this section we present some results from psychological research that should be taken into account when forming methods of quantifying an expert's opinion. Much of the fundamental work in this area stems from the 1960s and early 1970s, and good reviews of this research have been given by Hampton, Moore, andThomas (1973), Hogarth (1975), Huber (1974), Lichtenstein, Fischhoff, and Phillips (1982), Peterson and Beach (1967), Slovic andLichtenstein (1971), andTversky (1974). Later reviews have been given by Chaloner (1996), Cooke (1991), Hogarth (1987, Kadane and Wolfson (1998), Meyer and Booker (2001), Morgan and Henrion (1990), Wallsten and Budescu (1983), and Wolfson (1995).

Heuristics and Biases
A body of psychological research has been concerned with the question of how a person assesses the probability of an event, or how he or she judges which of two or more events is the more likely to occur. It appears that intuitive judgments in these tasks are based on a limited number of mental operations, or heuristics. In general these heuristics are quite effective, but they can lead to severe errors and systematic bias. Elicitation techniques commonly require novel assessment tasks. An appreciation of the strategies that people use to quantify their opinions can give an indication of how (and how well) these tasks might be performed (Meyer and Booker 2001).
One commonly used heuristic is judgment by representativeness. This is applicable for questions of the form: "what is the probability that an object A belongs to a class B?" or "what is the probability that event A will generate an event B?" In answering these questions, which in effect require assessment of the probability P (B|A), people typically compare the main features of A and B and assign a probability depending on the degree of similarity between them. A common error made with this kind of judgment is that little or no attention is paid to the unconditional probability of B. As an illustration, consider the problem of evaluating the probabilities that an individual, Mr. X, who has been described as "meticulous, introverted, meek, and solemn," is engaged in one of the following occupations: farmer, salesman, pilot, librarian, or physician. Most people perform this task by assessing the similarity of Mr. X to the stereotype of that occupation, and order the occupations by the extent to which Mr. X is representative of these stereotypes. They completely ignore base rates, such as the relative number of salesmen to librarians, and assign a high probability to Mr. X being a librarian . Similar results have been obtained by Hammerton (1975) and Nisbett, Borgida, Crandall, and Reed (1976).
Another commonly used heuristic is judgment by availability. This is used when a person estimates the frequency of a class or the probability of an event by the ease with which examples are recalled or occurrences come to mind. Examples from large classes are usually recalled better and faster than examples from less frequent classes, and likely occurrences are easier to imagine than unlikely ones. Hence mental availability is often a helpful indicator for the assessment of frequency and probability, but availability is also affected by factors other than frequency or probability. For example, suppose that one is asked whether a randomly chosen word from an English text is more likely to start with an "r" or to have "r" as its third letter. It is easier to recall words by their starting letter (e.g., red, rank, rogue, road, rope) than by their third letter (e.g., park, bird, wire). Hence most people judge that "r" is more likely to be the first letter of a word rather than the third letter, although the reverse is actually true . Recall is also affected by such factors as familiarity, salience, and recency, and newsworthy events also impact disproportionately on memory, so one might overestimate the probability of a plane crash with fatalities, for example, particularly if such a crash has happened recently. Thus the judgment-by-availability heuristic, though useful, can lead to marked error.
Perhaps the most widely used heuristic for probability assessment is judgment by anchoring and adjustment. With this strategy, a person estimates an unknown quantity by starting from some initial value and then adjusting it to obtain a final estimate. The starting value, which is usually termed the anchor, could be suggested by the nature of the problem or the way it is formulated. Regardless of the source of the starting value, the adjustment is usually too small (Slovic 1972), a phenomenon called anchoring. An experiment conducted by Tversky and Kahneman (1974) elegantly demonstrated this effect. Subjects were asked to estimate various quantities, stated in percentages (e.g., the percentage of African countries in the United Nations). They were given randomly chosen starting values and first had to decide whether the value they had been given was too high or too low, and then adjust it until they reached their best estimate. Through insufficient adjustment, subjects whose starting values were high ended up with substantially higher estimates than those who started with low values. For example, the median estimates of the percentage of African countries in the United Nations was 25% for subjects who received 10% as their starting point and 45% for those who received 65% as their starting point.
As mentioned earlier, when subjects use the judgment-byrepresentativeness heuristic to assess probabilities, they tend to ignore prior probabilities. However, many experiments have shown that if subjects are made aware of the prior probabilities and are asked to modify these in the light of fresh sample data, then the assessed posterior probabilities are too close to the prior probabilities compared with the revision indicated by Bayes' theorem; that is, the subjects' modifications to the prior probabilities are insufficient. This type of insufficiency when modifying probabilities to reflect new data is called conservatism (Edwards and Phillips 1964). One possible explanation of this phenomenon is that subjects use the anchoring and adjustment strategy; the prior probability acts as the anchor and the adjustment is insufficient.
A typical experiment to demonstrate conservatism is one in which subjects assess the probability that colored poker chips have been drawn from one of two bookbags, where the two bags contain different compositions of chips (e.g., 70% red and 30% blue in one bag and 30% red and 70% blue in the other). A coin is tossed to select one of the bags, so the prior probability of each bag is .50, and the experimenter then draws a succession of chips with replacement from the selected bag, indicating their color to the subject. Having observed the sample evidence, the subject states his or her posterior probability that the bookbag sampled contained the predominantly red or the predominantly blue proportion of chips. Subjects' revisions are normally conservative compared with the objective probability calculated by Bayes' theorem. For example, when the sample contains eight red chips and four blue chips, subjects commonly give a probability for the "red" bookbag of about .75, whereas the posterior probability calculated by Bayes' theorem is .97.
Several studies have attempted to counteract conservatism through varying the experimental procedure. In case subjects avoided approaching the bounds of the probability scale, unbounded odds estimates have been used instead of probability estimates. Rewards have been added to provide an incentive to perform well, and sample sizes, sequence lengths, and prior probabilities have been varied. Some of these changes have influenced the degree of conservatism, but they have not eliminated it (see, e.g., Peterson and Beach 1967, pp. 32-33, for a review of such experiments). In some experiments, however, the basic experimental situation has been modified so as to make it more complex, and in these more complex situations, conservatism is not always a dominating influence (Youssef and Peterson 1973).
Research has also demonstrated that in many circumstances, people expect a sample from a population to represent all of the essential characteristics of that population, even if the sample is small. Tversky and Kahneman (1971) refered to this false logic as the "law of small numbers," which asserts that the law of large numbers applies to small numbers as well. One experiment that demonstrated this fallacious belief used, as subjects, audiences at two psychology meetings. Through a questionnaire, the psychologists were asked to decide sample sizes and also to relate sample sizes to inferences that they would make from hypothesis tests. The following is a typical example of the questions asked: "Suppose you have run an experiment on 20 subjects and have obtained a significant result which confirms your theory (z = 2.23, p < .05, two-tailed test). You now have cause to run an additional group of 10 subjects. What do you think the probability is that the results will be significant, by a one-tailed test, separately for this group?" (Tversky and Kahneman 1971, p. 105). The median answer from the two groups was .85. However, if one assumes a noninformative prior distribution for the mean before the first sample was taken, then the true probability is only .48. The error is readily attributable to belief in the "law of small numbers"; people expect all samples to have virtually identical characteristics. For similar reasons, answers to other parts of the questionnaire indicated that most of the respondents (a) were too easily convinced by early results from a small experiment, (b) tested their research hypotheses on small samples without realizing the high odds against detecting the effects being studied, and (c) rarely attributed unexpected results to sampling variability, because they found a causal explanation for every observed effect.
The results demonstrating the "law of small numbers" and those demonstrating conservatism appear somewhat contradictory. The former suggest that people overestimate the value of sample evidence, whereas the latter suggest that people underestimate it. One conjecture is that conservatism has an effect if people first formulate their opinion before being given the sample evidence, whereas the "law of small numbers" has an effect if people obtain the sample evidence before first formulating their opinion (Garthwaite 1983, p. 17).
Another form of systematic error that affects people's judgments, hindsight bias can arise when people are asked to assess their a priori probability of an event that has actually occurred. For example, they might be asked whether the dismantling of the Berlin Wall was predictable before it happened: "In 1988, what was the probability that the Berlin Wall would come down within the next 5 years?" In 1988 it may have seemed unlikely that communism would soon collapse and East and West Germany would reunite, but with hindsight the economic problems of communist countries make it seem almost inevitable. Knowledge of what has occurred tends to distort memory, and people tend to exaggerate their a priori probability for an event that has occurred. An experiment that clearly demonstrates this phenomenon was conducted by Fischhoff and Beyth (1975). Just before President Nixon's visit to China and the USSR in 1972, subjects were asked to assess the probabilities of various possible outcomes of his visit, such as "President Nixon will meet Mao at least once" and "the United States and the USSR will agree to a joint space program." Shortly after the visit and without forewarning, the subjects were asked to recall the probabilities they had given to these events before Nixon's visit, and they were also asked which events they thought had actually occurred. Results showed that subjects generally overestimated their a priori probabilities for the events they thought had occurred and underestimated their a priori probabilities for the events they thought had not occurred.
We have described only some of the heuristics that people use to make numeric judgments, and the biases that affect such judgments. These heuristics were chosen for their relevance to elicitation methods. For example, conservatism, hindsight bias, and the "law of small numbers" relate to people's response to data, and various elicitation methods use the impact of sample data on opinion to quantify beliefs about variances and covariances, as we discuss in Section 1.3. An extensive list of heuristics and biases in human judgment was given by Hogarth (1987, pp. 216-222).

What Summaries to Elicit
In designing an elicitation method, there is usually choice as to which quantities the expert is asked to assess. If possible, quantities should be chosen that are usually assessed reasonably competently. People's ability to estimate simple statistical quantities, such as means and variances, has been examined in psychological research over several decades. Such quantities could constitute part of the elicitation method for almost any form of prior distribution.
Several experiments have investigated subjects' capability to judge sample proportions (Erlick 1964;Nash 1964;Pitz 1965Pitz , 1966Shuford 1961;Simpson and Voss 1961;Stevens and Galanter 1957). In these experiments, binary data were displayed to subjects for a limited period of time, and the subjects were then asked to estimate one of the sample proportions. For example, Shuford (1961) projected 20 × 20 matrices onto a screen, one at a time. The elements of each matrix were red squares and blue squares, and subjects observed a matrix for 1 second in some trials and for 10 seconds in others. After each trial, subjects had to estimate the proportion of squares that had been, say, red. In this and similar experiments, subjects generally assessed the sample proportion very accurately, with the mean of subjects' estimates differing from the true sample proportion by less than .05 in most cases.
Similar experiments have been used to investigate people's ability to estimate measures of central tendency (Beach and Swenson 1966;Peterson and Miller 1964;Spencer 1961Spencer , 1963. Typically, a sample of numbers is displayed to subjects, who are then asked to estimate the mode, median, or mean of the sample. When the sample distribution is approximately symmetric, so that these three measures are numerically similar, subjects' estimates have shown a high degree of accuracy (Beach and Swenson 1966;Spencer 1961). However, an experiment conducted by Peterson and Miller (1964) used a sample drawn from a population whose distribution was highly skewed; subjects' assessments of the median and mode were again reasonably accurate, but assessments of the mean were biased toward the median.
In most applications, determining a prior distribution requires estimation of the variances of unknown scalar quantities and/or sample errors. Regrettably, it seems that people are poor both at interpreting the meaning of "variance" and at assigning numerical values to it. When estimating relative variability, empirical evidence indicates that people are influenced by the mean of the stimuli and estimate the coefficient of variation, rather than the variance. For example, Hofstatter (1939) obtained assessments of the variability in the lengths of sticks tied in bundles. He found that the assessed value increased with the sample variance, as it should, but as the means increased, the assessed value decreased. Lathrop (1967) replicated this result. Even allowing for the effect of means, systematic differences still arise between intuitive judgments of sample variance and objective values. If large deviations from the mean predominate, as when, for example, the sample is drawn from a population whose distribution is bimodal, then the variance is overestimated. On the other hand, if small deviations from the mean predominate, as when, for example, the population distribution is normal, then the variance is underestimated (Beach and Scopp 1968).
One way of eliciting variances that avoids their direct assessment is to elicit credible intervals; such intervals are useful in their own right and can yield estimates of variances if suitable distributional assumptions are made (cf. Sec. 3.2). There are two main approaches to assessing credible intervals for a scalar quantity, the fixed interval method and the variable interval method. Let X denote the scalar quantity of interest. With the fixed interval method, the range of values that X can take is partitioned into intervals by the statistician/psychologist who organizes the elicitation method. For each interval, the expert assesses the probability that X will fall in that interval. (In principle, assessed probabilities must sum to 1.) The probabilities are also sometimes elicited through odds assessment. The expert first indicates the interval that he or she believes is most likely to contain X. Then, for each of the other "less likely intervals," he or she states odds that X will take a value in the "less likely interval" as opposed to the "most likely interval." The probabilities associated with each interval are then calculated by imposing the constraint that their sum must be 1.
With the variable interval method, the expert identifies points that correspond to specified percentiles of his or her subjective distribution. A method of bisection is often used that entails a sequence of questions of the following form: Q1. Can you determine a value (the expert's median) such that X is equally likely to be less than or greater than this point? Q2. Suppose you were told that X is below your assessed median; can you now determine a new value (the lower quartile) such that it is equally likely that X is less than or greater than this value? Q3. Suppose you were told that X is above your assessed median; can you now determine a new value (the upper quartile) such that it is equally likely that X is less than or greater than this value?
An advantage of this line of questioning is that only judgments of equal odds are required, an intuitively easier task than specifying percentiles that divide a probability in the ratio of, say, 4:1. Many experiments have examined people's performance at assessing credible intervals. If the credible intervals calibrate well with reality, then the proportion of p% central credible intervals that contain the true value of X (the quantity to which they relate) should be about p%. (The concept of calibration is discussed further in Sec. 4.4.) These experiments have demonstrated that assessing credible intervals is a task that people perform reasonably well, but there is a clear tendency for central credible intervals to be too short, so that this proportion is less than p%. This bias is called overconfidence; people believe they are more accurate in estimating X than is justified (Keren 1991). Lichtenstein et al. (1982, pp. 325-326) tabulated 28 sets of results in which 50% central credible intervals were elicited for continuous scalar quantities. Less than 50% of the intervals contained the true value of the scalar in 23 of the 28 sets, and the proportion never exceeded 57% in any of the sets. The evidence is conflicting as to whether the fixed-interval method or the variable-interval method gives better calibration; Seaver, von Winterfeldt, and Edwards (1978) found that the fixed-interval method performed better, whereas Murphy and Winkler (1974) found the converse. With the variable-interval method, it is also unclear which percentiles should be elicited. The median and quartiles are most commonly assessed (using the method of bisection), and although this has sometimes yielded good results (e.g., Murphy and Winkler 1974;Peterson, Snapper, and Murphy 1972), other empirical work has found that overconfidence is less if the 33rd and 67th percentiles are assessed (Barclay and Peterson 1973;Garthwaite and O'Hagan 2000). Most empirical work has involved scalar quantities, and the results may not generalize to more complex models. For example, Garthwaite (1989) elicited 50% central predictive intervals for the dependent variable in a simple linear regression model and found that far more than 50% of the intervals contained correct values. The subjects had drawn graphs to help make their assessments, and this may have improved the accuracy of their median assessments and led to predictive intervals that were more likely to contain a true value.
Much empirical research has investigated subjects' ability to assess the extreme "tails" of a distribution. For example, Alpert and Raiffa (1969) used the variable interval method to elicit 98% central credible intervals. They asked "almanac" questions of the following kind: How many foreign cars were imported into the U.S. in 1968? (a) Make a high estimate such that you feel there is only a 1% probability the true answer would exceed your estimate. (b) Make a low estimate such that you feel there is only a 1% probability the true answer would be below this estimate (Alpert and Raiffa 1969, pp. 16-17).
It should have been somewhat of a "surprise" to a subject to find the true value of a quantity falling outside an interval; 43% of all assessments produced such surprises. This information was given as feedback before a second session. In this second session, 23% of the assessments produced surprises, still a very high percentage. One reason for the large number of surprises is that assessing tails of distributions is a difficult task, mainly because it requires the consideration of events that are unlikely, so that comparisons do not come readily to mind. (It is unfortunate that assessing probabilities for rare events is difficult, because expert opinion is of paramount importance when sample data are scarce.) Task characteristics have an effect on how an expert views a problem and the assessments that are elicited. Visual aids to help people quantify their opinions have been tried, such as urns full of colored balls (Raiffa 1968), light pens on colored screens (Barclay and Randall 1975), or simply asking assessors to mark a point on a line whose endpoints are 0 and 1 (for probabilities) or 0-100% for proportion. Probability wheels (Spetzler and Stael von Holstein 1975) are another visual aid that have been used with some success (Morgan and Henrion 1990, p. 126). In its simplest form, a probability wheel comprises a round pieshaped disc of one color that is partly covered by a "slice" of a different color and a pointer. The size of the slice can be varied, and the expert adjusts its size so that if the pointer is spun, then the probability it lands within the slice is equal to the expert's probability for some specified event.
Efforts to influence how people consider probabilities have also been explored, such as asking them to suggest scenarios that would lead to an unlikely event (Slovic and Fischhoff 1977), influence diagrams (Howard and Matheson 1984), getting subjects to think carefully about the substantive details of each judgment (Koriat, Lichtenstein, and Fischoff 1980), and disaggregating an implicit hypothesis into its constituent hypotheses (Johnson, Hershey, Meszaros, and Kunreuther 1993). As an example of the effect of disaggregation, Fischhoff, Slovic, and Lichtenstein (1978) questioned experts (i.e., car mechanics) about the probable reasons for a car not starting. The experts assessed the probability that it would not start "for some reason other than the battery, engine or fuel system," and their average probability was .22. They also assessed the probability that it would not start for more specific reasons: failure of the ignition system, failure of the starting system, and so on. Combining the probabilities of the latter disaggregated reasons gave an average of .44 as the probability that the car would not start for a reason other than the battery, engine, or fuel system. This result is consistent with other empirical work; there is ample evidence that the sum of separate probability assessments for constituent hypotheses generally gives a much larger probability than a single probability assessment of the combined hypothesis that they form (see Tversky and Koehler 1994 for a review). Morgan and Henrion (1990, p. 116) commented that "it has become something of an article of faith in the decision analysis community that disaggregation of an elicitation problem holds the potential for significantly improved performance on many assessment tasks." There has been research aimed at converting probabilistic phrases (such as "quite likely" and "extremely probable") into numeric values (Wallsten, Budescu, Rapoport, Zwick, and Forsyth 1986;Mosteller and Yountz 1990), and at differentiating situations where verbal expressions of probability are preferable to numeric ones, or vice versa (Windschitl and Wells 1996). People are generally more comfortable expressing their uncertainty in verbal terms rather than numerically. Unfortunately, there is considerable variation in the probabilities that different people attach to the same phrase, and the context also affects the probability that a person associates with a phrase (Lichtenstein and Newman 1967;Beyth-Marom 1982;Wallsten et al. 1986). The response mode in which subjects are asked to give assessments also affects judgments. For example, Gigerenzer (1996) found that numeric expressions that are formally equivalent, such as frequencies and probabilities, are not always treated as equivalent in subjective uncertainty judgments. Also, it seems better to elicit probabilities in terms of populations of events, such as "what proportion of students starting a Ph.D. will complete it within 5 years?" rather than as single (one-shot) events, like "if a new Ph.D. student is picked at random, what is the probability that he or she will complete the Ph.D. within 5 years?" (Gigerenzer 1996;Koehler 1996).
As noted earlier, a good elicitation method should yield a probability distribution that accurately reflects the expert's opinion, but this is hard to check. A pragmatic alternative is to compare assessed distributions with true values when these are known; see the discussion of calibration in Section 4.4. Several experiments have attempted to train subjects to improve the calibration/objective accuracy of their assessments. These have typically found that objective accuracy is substantially improved by training, but that such biases as overconfidence are tempered rather than eliminated (Schaefer and Borcherding 1973;Lichtenstein and Fischhoff 1980). In these experiments, training usually took the form of feedback, in which subjects were told the correct values after making the assessments and the trainer stressed the direction of biases and how the expert might reduce them. The benefits of effective feedback can be seen in the performance of weather forecasters. Weather forecasters make regular predictions for the same quantities each day (e.g., temperature, probability of precipitation), and thus soon learn the accuracy of their forecasts. Experiments have generally found them to be quite well calibrated. For example, Peterson et al. (1972) conducted an experiment with two meteorologists who gave forecasts of the maximum and minimum temperatures on the following day. Using the method of bisection, the meteorologists expressed their forecasts in terms of 50% central credible intervals. They showed good calibration; out of 55 forecasts, 28 fell inside the 50% credible intervals, 18 fell outside the 50% credible intervals, and 9 fell on the boundaries.
In the main, research concerned with descriptive statistics for scalars has produced relatively clear-cut conclusions. People are capable of estimating proportions, modes, and medians of samples, but are slightly less proficient at assessing sample means if the sample distribution is highly skewed and often have serious misconceptions about variances. People are reasonable at quantifying opinions as credible intervals using the fixed-interval and variable-interval methods. However, there is a general tendency for the assessed distributions to imply a greater degree of confidence than is justifiable. Practice, coupled with feedback, can reduce this bias, but assessing the extreme tails of distributions (e.g., 98% credible intervals) is difficult, and although training should reduce bias, it will not eradicate it. Visual aids can prove useful, and task characteristics often have a marked impact on the assessments that are elicited.

Multivariate Elicitation
When the expert's opinion is sought on two or more unknown variables, the output of the elicitation should be the expert's joint probability distribution for those variables. The task is now more complex than when eliciting a distribution for a single variable, and the facilitator must inevitably ask more complex questions.
An important special case is where the variables are independent, meaning that if the expert were to obtain new information about some of the variables, it would not change his or her beliefs about the others. The concept of independence is straightforward to explain, and independence between variables is a relatively simple judgment for the expert to make. It is also a very convenient judgment, because when all of the variables are independent, their joint distribution is just the product of their marginals. The elicitation exercise then reduces to eliciting the expert's beliefs about each variable separately, so only univariate elicitation techniques are required. Using independence to decompose a multivariate elicitation task into simpler univariate tasks is consistent with the idea of disaggregation.
Psychological research also indicates that when events are independent, joint probabilities should be assessed via univariate probabilities, because people exhibit systematic bias when making joint probability assessments. In particular, people tend to overestimate the probability of conjunctive events and underestimate the probability of disjunctive events. For example, Bar-Hillel (1973) found that people tended to overestimate the probability of drawing a red marble seven times in succession from a bag containing 90% red marbles and 10% white marbles, and to underestimate the probability of drawing a red marble at least once in seven successive draws from a bag containing 10% red marbles and 90% white marbles. These errors can be explained as the result of anchoring; the probability of an elementary event provides an obvious starting point for estimating the probability of both conjunctive and disjunctive events. For conjunctive events, the probability of the elementary event must be reduced, which is done insufficiently, and for disjunctive events it must be increased, which again is done insufficiently.
Discussion of the physical or historical relationships among variables can make judgments of independence or conditional independence clear. With many elicitation methods, it is transparent as to what assessments would correspond to independence, and in application of these methods, subjective independence between some pairs of parameters is often observed (see examples in Garthwaite andDickey 1991, 1992). It should be noted, however, that assumptions of independence often make assessment tasks easier for an expert. For example, if an expert has assessed the marginal distribution of X, and X and Y are independent, then the conditional distribution of X|Y is easily specified as "no change." It may be that experts are too willing to accept independence where it does not strictly apply.
Even where variables are dependent, it may be possible to restructure the problem by expressing it in terms of independent variables. An example might be where we seek a medical expert's opinion on the effectiveness of two treatments in a clinical trial. Letting X and Y denote the relevant measures of effectiveness of the two treatments, we would not typically have independence between X and Y . If the expert learned that X, the effectiveness of the first treatment, was higher than he or she originally expected, then this would generally lead him or her to have an increased expectation of the effectiveness of the second treatment, Y . This may be because the expert believes that the treatments act in similar ways, but it may also be because of uncertainty in patient recruitment. That is, if X is smaller than expected, say, this may be because the trial has recruited patients who are more ill and will thereby suggest a smaller value for Y . However, the expert might be willing to accept independence between two functions of X and Y . For instance, it may be reasonable to suppose independence between X and Z = Y/X. Here Z is the relative effectiveness of treatment 2 over treatment 1. Such a structure is often appropriate where treatment 1 is standard care or placebo and treatment 2 is a new or active treatment. Where both treatments are new, the asymmetry of the aforementioned structure may not be appealing, but the expert might be happy to express independence between (X + Y )/2, the mean effectiveness, and the difference Y − X. Bayesian hierarchical models are natural examples of structuring dependent variables in terms of conditional independence. O'Hagan (1998) emphasized the role of structuring as an aid to elicitation, and Kadane and Schum (1996) provided an extended example of complex structuring of beliefs.
In cases where variables are dependent and obviously cannot be reduced to independence in this way, we cannot escape from the complexity of multivariate elicitation. We can (and generally should) elicit summaries of the expert's marginal distributions, but these no longer characterize the joint distribution completely. The question then arises as to which summaries of the expert's joint distributions are most effective and reliable to elicit.
Although statisticians usually model dependence in terms of correlations, directly eliciting correlation coefficients or covariances might be expected to encounter at least as many problems as directly eliciting means and variances in univariate elicitation. Psychological research has primarily considered eliciting correlation between variables that might be considered as being drawn from some population. For example, Clemen, Fischer, and Winkler (2000) examined the following six methods of eliciting a correlation between weight and height in a population of male MBA students: 1. Verbal (nonnumeric) description of the strength of a correlation on a seven-point scale ranging from "very strong negative relationship" to "very strong relationship." [Clemen et al. (2000) made strong assumptions to convert the verbal assessments to correlations.] 2. Direct assessment of the correlation by specifying a value between −1 and 1. 3. Ask the subject to imagine that a person has been picked at random from the population. Give the person's percentile for one variable, and ask the subject to assess the person's percentile for the second variable. 4. Ask the subject to imagine that two people, A and B, have been picked at random from the population. Conditional on A being greater than B for one variable, ask the subject to assess the probability that A is also greater than B for the other variable. 5. Ask the subject to imagine that a person has been picked at random from the population. Then ask the subject to assess the probability that for both variables, the person is below a specified percentile. 6. Ask the subject to imagine that a person has been picked at random from the population. Conditional on the person being below a specified percentile for one variable, ask the subject to assess the probability that the person is also below that percentile for the second variable. Clemen et al. (2000) found that method 2 performed best. This is surprising, because others have suggested that the direct assessment of moments is a poor method of quantifying opinion (Morgan and Henrion 1990;Kadane and Wolfson 1998;Gokhale and Press 1982). Method 4 asks for a concordance probability to be assessed, which can be equated to a value of Kendall's τ . Assumptions of normality are then made so as to relate Kendall's τ to the Pearson correlation coefficient. Assessment of concordance probabilities to examine correlation has been examined by Gokhale and Press (1982), who found it preferable to alternative methods that they considered, and by Kunda and Nisbett (1986), who concluded that reasonably accurate correlation estimates were obtained provided that subjects were very familiar with observations from the population in question and the data related naturally to a numeric scale.
In several experiments, subjects were shown samples from a bivariate population and then asked to judge the "degree of relatedness" between the two variables; the subjects made use of only a limited portion of the available data, sometimes basing their judgments on just the proportion of time that the positive outcome for one of the binary variables occurred with a positive outcome for the other (Smedslund 1963;Inhelder and Piaget 1958;Jenkins and Ward 1965;Ward and Jenkins 1965).
Statistically, it is important to distinguish between eliciting an expert's beliefs about a population correlation coefficient and eliciting the value of the correlation in the expert's beliefs about two variables. In the first case, the correlation coefficient is the variable whose probability distribution we wish to elicit. This task was addressed by, for instance, Gokhale and Press (1982). The second case is the situation that arises in multivariate elicitation, where the correlation (or some other measure of association) is to be elicited as one summary of the expert's joint distribution for two variables. Where the two variables can be considered single draws from a population, then some of the aforementioned methods may be appropriate. However, many cases of multivariate elicitation do not fit this situation. Consider, for example, eliciting someone's beliefs about the fuel economy and the acceleration of a new car. This car is not some random draw from any population. Methods 4-6 in Clemen et al.'s study are no longer appropriate.
None of the methods examined by Clemen et al. (2000) use graphs in any way. It seems likely that graphical methods could perform better, especially because it is very natural to plot a graph to describe the relationship between two variables, such as height and weight. This approach would represent association between variables in terms of regression, which is related to correlation. For two variables X and Y , for instance, we might try to elicit the regression function m(x) = E(Y |X = x). If the expert accepts the proposition that this function is linear, then we might simply elicit m(x 1 ) and m(x 2 ) for any x 1 = x 2 . Eliciting more than two points on the function ("overfitting"; see Sec. 4.1) would allow for checking of the assumption of linearity, or for more accurate fitting of a straight line. Here, as elsewhere, it may be preferable to elicit medians rather than means.
A body of psychological research has examined multiple regression. In this research, the x-variables are generally referred to as "cues," Y is termed the "criterion," and the regression coefficients are referred to as "cue weights." Subjects predict the value of the criterion, basing their predictions on the known values of the cues. In a wide variety of situations, it has been found that a subject's responses can be represented quite well by a linear model that relates the criterion to the cues. The correlations between subjects' responses and the responses predicted by linear models (fitted to the same responses that determined the model) have generally taken values in the .70's when the judgmental task is from a "real world" situation, and in the .80's and .90's for less complex artificial tasks. In some studies, the model derived from one sample of predictions was used to forecast a second sample of predictions. The forecasts produced in this way were only slightly less accurate than those produced by a model actually based on the second sample of predictions (Einhorn 1971;Slovic and Lichtenstein 1968;Wiggins and Hoffman 1968). Experiments also show that, provided that cues are monotonically related to the predicted variable, a simple linear combination of main effects will do a remarkably good job at forecasting a subject's assessments, even if subjects know that interactions exist. One implication is that when eliciting the dependence of one variable on one or more other variables, it is reasonable to constrain an expert's assessments to fit a linear model and to ignore interactions, unless it becomes clear that some interactions are important. An extensive review of cue-weighting experiments was given by Slovic and Lichtenstein (1971).
A joint distribution involves more than modeling conditional means or medians. Hence eliciting a joint distribution is more complex than determining an expert's cue weights. Generally, conditional probabilities are a natural way to augment marginal probabilities when trying to specify a joint probability distribution, and in particular they allow conditional dispersion to be elicited and modeled. Conditional medians and other quantiles were extensively exploited by, for instance, Kadane, Dickey, Winkler, Smith, and Peters (1980), and Dickey, Dawid, and Kadane (1986), and Kadane (1996).
An alternative to conditional probabilities is joint probabilities. Having elicited P (X ≤ x), for instance, we might elicit the joint probability P (X ≤ x, Y ≤ y) or, equivalently, the conditional probability P (Y ≤ y|X ≤ x). Note, however, that conditional probabilities are usually elicited in the form of P (Y ≤ y|X = x), and that conditioning on X ≤ x rather than on X = x may be cognitively more complex. But we have already noted that experts do not assess joint probabilities accurately, even when the variables are independent. We might also expect joint probabilities like P (X ≤ x, Y ≤ y) to be subject to a kind of representativeness bias and so to be positively/negatively biased if the association between X and Y is positive/negative, although this does not appear to have been studied.

FITTING A DISTRIBUTION
Once the facilitator has obtained from the expert a number of specific statements, the elicitation task is completed by converting these into a probability distribution. Different levels of complexity are found in the fitting of a probability distribution to the expert's statements. If the elicitation is to obtain a prior distribution that will then be updated in a Bayesian analysis of some additional data, then it is usual to fit a probability distribution using standard parametric families of distributions. But if the elicitation is to formulate uncertainty about inputs to a decision problem or a mathematical model, such as in the risk assessment of a complex engineering project, then much more simplistic elicitation and fitting are common.

Uniform and Triangular Distributions
The simplest form of elicitation is to ask the expert to specify a range [a, b] in which the parameter is believed to lie. If this is all that is elicited from the expert, then it is common to assume a uniform probability distribution over [a, b]. This can be criticized as being too simplistic in at least two respects. First, the expert almost certainly would not believe that the unknown quantity in question is as likely to be very close to the limits a and b as to be at a more central point in the interval. Second, unless the range [a, b] represents absolute physical limits to the possible values of the quantity (in which case the first criticism applies even more strongly), it is unreasonable to give zero probability to the event that the quantity lies outside the range.
As a simple response to the first criticism, another common practice is to use a triangular distribution. For this purpose, the expert is asked to also specify a mode, say c. Then the assumed distribution has the density The acceptability of uniform and triangular distributions as representations of uncertainty about model inputs in engineering applications is indicated by their featuring strongly in the work of Oberkampf, Helton, Joslyn, Wojtkiewicz, and Ferson (2004), but O'Hagan and Oakley (2004) criticized this practice as a failure to elicit adequately. Even where substantially more information is elicited from the expert, uniform distributions may be assumed over intervals. Suppose that in addition to the range [a, b] the expert also specifies probabilities p 1 , p 2 , . . . , p k that the uncertain quantity lies in the intervals [a, c 1 ], (c 1 , c 2 ], . . . , (c k−1 , b]. (This may be done by fixing the c i 's and asking for the probabilities, or by fixing the p i 's and asking for quantiles.) Then the facilitator may simply assign the histogram distribution where c 0 = a and c k = b. Although the expert's beliefs would generally be better represented by a distribution with smooth density function, this histogram form may be adequate, particularly if k is not small. Feedback to the expert may be useful at this point.

Fitting Parametric Distributions
More complex elicitation methods usually impose structure on an expert's opinion by assuming that his or her knowledge can be well represented by some member of a specified family of distributions. If the expert has specified information that fits a convenient parametric distribution, then it makes sense to use it. This strategy has costs and benefits similar to those for statistical modeling in general. Members of the hypothesized family are distinguished by parameters (called hyperparameters), and the elicitation task then reduces to choosing appropriate hyperparameter values to capture the main features of the expert's opinion. If the expert's opinion does not correspond approximately to any member of the family, then discrepancies are likely to show up in the expert's answers to elicitation questions, much as a sampling model can succumb to traditional diagnostic checks. The family of distributions is typically chosen to be the natural conjugate family (or a tractable extension of that family), which facilitates subsequent analysis if sample data become available, although advances in Bayesian computation through Markov Chain Monte Carlo (MCMC) methods make it viable to use other families. For many sampling models, the conjugate family is reasonably flexible and can represent a variety of opinion through suitable choice of hyperparameters. Further flexibility is available through mixtures of conjugates. Dalal and Hall (1983) and Diaconis and Ylvisaker (1985) demonstrated that such mixtures can arbitrarily accurately represent any actual belief, although we are not aware of any work in which an expert's opinion is elicited in terms of a mixture, other than for variable selection problems (Garthwaite andDickey 1992, 1996).
Two elicitation tasks that have attracted substantial attention are quantifying opinion about a Bernoulli process and quantifying opinion about a linear regression model. We first focus on each of these problems in turn before briefly discussing elicitation for other sampling models. The judgmental tasks that are central in much of this work are the assessment of means, medians, and quantiles; revision of opinion when sample data become available; and specification of relevant aspects of a "prior sample" whose information content would approximately equate to one's knowledge. As a guiding principle, experts should be asked questions about quantities that are meaningful to them. This suggests that questions should generally concern observable quantities rather than unobservable parameters, although questions about proportions and means also might be considered suitable, because psychological research suggests that people can relate to these quantities. However, in some application areas, particular statistical models are so familiar to experts that their parameters have acquired wellunderstood scientific meaning. It then may be appropriate to ask experts directly about such parameters, as discussed by Kadane (1980) and Winkler (1980). The facilitator should always try to understand what terms make the expert most comfortable for elicitation.
Four basic methods have been used to quantify subjective opinion about p, the unknown parameter of a Bernoulli process. In illustrating assessment questions, we suppose that p is the proportion of students at the University of Chicago who are male, which is the example used by Winkler (1967), the first author to address this elicitation problem: 1. One method is to ask the expert to specify his or her median estimate of p and to give one or more quantiles (usually at least two) of his or her subjective distribution for p. These may be plotted and a smooth cumulative distribution function drawn through them, giving a nonparametric representation of the expert's opinion. More commonly, it is assumed that the expert's opinion can be well represented by a beta distribution (the conjugate distribution for Bernoulli sampling), and a beta distribution is selected whose quantiles are similar to those that the expert gave. The beta distribution might be selected using a table presented by Winkler (1972, table 5) that lists several quantiles for a variety of parameter values. We call this method of elicitation the quantile method; it is also often called the credible interval method. 2. The second method is the hypothetical future sample (HFS) method. The expert first estimates the proportion under consideration (e.g., the proportion of students who are male) and then revises his or her opinion in the light of information from additional (hypothetical) samples. For example, he or she might be asked questions of the following form: "Suppose that a random sample of 50 students was taken and 20 of them were male. Now what is the probability that one additional student, chosen at random, is male?" Again it is assumed that the expert's opinion corresponds to a beta distribution; its parameters are uniquely determined by the expert's prior and posterior (given the hypothetical sample data) estimates of the proportion. In general, the expert is confronted with several hypothetical samples in which the number of students and the proportion of these who are male is varied. Each sample yields a separate estimate of the hyperparameters, and some form of averaging is used to reconcile them. A general issue arises here in the case where the resulting hyperparameter estimates are very disparate. This may reflect assessment inaccuracy; that is, the beta distribution may be correct and the elicitation method may be the best available, but the expert's answers to questions are subject to substantial variability. It also warns that the elicited distribution could be a poor representation of the expert's opinion, either because his or her opinion does not correspond to a beta distribution or because the elicitation method asks questions that are hard to answer accurately. (See also the discussion of internal consistency in Sec. 4.1.) 3. The third method is the equivalent prior sample (EPS) method, in which an expert expresses his or her knowledge as an equivalent prior sample. Quoting Winkler (1967, p. 779), "you have some knowledge concerning University of Chicago students. Can you determine two numbers r and n such that your knowledge would be roughly equivalent to having observed exactly r males in a random sample of n University of Chicago students, assuming that you had very little knowledge about the sex of University of Chicago students before seeing this sample?" The prior distribution is taken to be a beta distribution with parameters r and n − r. 4. The fourth method is the probability density function (PDF) method, in which the expert specifies the most likely value of p, say p, and then assesses other points of the pdf for p. For example, Winkler (1967) asked the expert to give two values of p (one on each side of p ) that he or she considers half as likely as p and to also specify quantiles, which in this context are defined as points that divide the area under the graph of the pdf in specified proportions. (There is obviously strong similarity between the PDF method and the quantile method, because both make use of quantile assessments.) A nonparametric estimate of the expert's opinion may be obtained by asking the expert to draw a graph of his or her pdf, taking into account the assessments that he or she has given.
All four of these methods tend to produce prior distributions that are unrealistically "tight" (i.e., they have variances that are too small). For example, Schaefer and Borcherding (1973) conducted an experiment in which 22 subjects used the EPS and quantile methods to quantify their opinions about various proportions. Before any training, each subject assessed 50% central credible intervals for 18 different proportions using each method. The proportion of these intervals that contained true values was 15.7% for the EPS method and 22.5% for the quantile method. The task in the HFS method, revising opinion in the light of additional data, is similar to the task in the "bookbag and poker chips" experiment, where conservatism has a marked effect. Insufficient revision of opinion when given the hypothetical data would result in a distribution that is too tight. Winkler (1967) felt that conservatism also influenced the EPS method, hypothesizing that subjects equated their knowledge to too large a sample size, through not realizing the value of sample information. The quantile method tends to yield distributions that are again too tight, but slightly less tight than the PDF method and much less tight than the HFS and EPS methods (Winkler 1967). On this basis, the quantile method seems preferable, and it also seems to be the method of choice when judged by scoring rules. (Scoring rules are discussed in Section 4.3.) Some experiments have examined the effect of training and found that the bias of tightness was reduced for all three methods with particularly marked improvement for the EPS method (Stael von Holstein 1971; Schaefer and Borcherding 1973).
Elicitation methods have also been devised that avoid direct questions about p (which is not an observable quantity) and instead ask questions about sampling distributions, such as "the number of students who would be male in a random sample of 20 University of Chicago students." Chaloner and Duncan (1983) used this form of question in a variant of the PDF method. The expert states the most likely number of males in the sample, say x, and then assesses the relative likelihood of x males rather than x − 1 males, and of x males rather than x + 1 males. Assessments are used to estimate the parameters of a beta distribution, and the implied endpoints of the shortest 50% predictive interval are then calculated. These endpoints are given as feedback to the expert, who comments on the length of the interval. If it is found to be too short or too long, then the parameter estimates are revised repeatedly until he or she is satisfied with its length. The process is repeated for a variety of sample sizes, and the resulting parameter estimates are amalgamated in some way. Gavaskar (1988) also used questions about the sampling distribution in a variant of the HFS method. The expert first specifies the most likely number of males in a random sample of some specified size (as in the method of Chaloner and Duncan), then revises his or her assessment after being given hypothetical sample data, as in the HFS method. Assessments are again used to determine the parameters of a beta distribution. This is done for a variety of sample sizes. Interestingly, Gavaskar (1988) conducted a computer simulation to assess the sensitivity of parameter estimates to errors in an expert's assessments. He compared his method with that of Chaloner and Duncan (1983) and found that his method is much less sensitive. However, in a simulation study of this nature it is difficult to choose appropriate error distributions for the different methods, and Gavaskar may have underestimated the magnitude of errors induced by hypothetical samples.
Turning to multiple linear regression, we suppose that the model is Y = x β + , where ∼ N(0, σ 2 ). The usual conjugate prior distribution specifies that σ 2 equals ωδ times the reciprocal of a chi-squared random variable with δ degrees of freedom, and that given σ , β has a normal distribution with some mean b and some variance-covariance matrix σ 2 R. For this prior distribution, ω, δ, b, and R are the hyperparameters that need to be assessed via an expert's assessments.
To quantify opinion about these quantities, Zellner (1972) suggested questioning an expert about the regression coefficients. Some experts may be able to think of regression coefficients directly, but, as noted earlier, it is usually better to question people about observable quantities, such as Y , rather than ask direct questions about unobservable quantities, such as regression coefficients. Toward this end, elicitation methods generally ask the expert about observations Y 1 , . . . , Y m at values x 1 , . . . , x m of x. But this has the possible disadvantage that uncertainty about Y results both from the expert's uncertainty about the values of the regression coefficients and from random variation. To separate these sources of uncertainty, Garthwaite and Dickey (1988) asked questions about Y i , the average value of Y if a large number of observations were taken at a single value, x i , arguing that averages are quantities to which people can relate and that an expert can give assessments about Y i without the need to consider random error. We briefly outline the elicitation methods that have been proposed by Kadane et al. (1980), Oman (1985), and Ibrahim and Laud (1994), all of which question an expert about Y , and the method of Garthwaite and Dickey (1988). In this section we refer to these articles as KDWS&P, Oman, I&L, and G&D. Let X = (x 1 , . . . , x m ) . We refer to x i as a design point and X as a design matrix. To estimate the mean vector, b, at each design point KDWS&P elicited the median of Y , G&D elicited the median of Y , Oman elicited the mean of Y , and I&L elicited the expert's "best guess" (or prior prediction) of Y . Denoting the vector of assessments by y, the methods equate b to (X X) −1 X y, which would be the least squares estimate if y were a vector of observations at X. (Oman also offered two less attractive alternative methods of eliciting b: asking the expert to assess b directly and asking the expert to specify prior expected covariances.) With each elicitation method, the distribution of Y at any design point is unimodal and symmetric, so that in principle, it should not matter which point estimate of Y is assessed. In practice, however, the distribution of Y may be skew. Then the (unanswered) question arises as to which feature of a skew distribution should correspond to the point of symmetry of a symmetric distribution that is used to represent it.
To elicit ω and δ, G&D and I&L elicited assessments that depend only on experimental error. In the approach of G&D, the expert is asked to suppose that two observations are taken at the same design point. It is pointed out to the expert that the observations will not be identical because of random variation, and the median of their absolute difference is elicited. The expert is then given a hypothetical datum, and he or she then states the updated median of their absolute difference. The approach of I&L uses assessments of the mean and variance of the precision (σ −2 ) to determine ω and δ, and in related work (Laud and Ibrahim 1995), I&L used assessments of the median and the 95th percentile of the distribution of the precision. Oman did not elicit ω and δ, and restricted his posterior analysis to inferences that depend only on a point estimate of σ 2 , which he obtained using empirical Bayes methods. KDWS&P elicited δ by asking the expert to assess the median (y .50 ), upper quartile (y .75 ), and 93.75th percentile (y .9375 ) at a design point. (y .9375 is elicited through repeated bisections, y .50 −→ y .75 −→ y .875 −→ y .9375 .) The ratio (y .9375 − y .50 )/(y .75 − y .50 ) depends only on δ and hence provides an estimate of it. KDWS&P repeated the assessments at several design points and averaged the estimates of δ that each set of assessments yielded. The method used by KDWS&P to elicit ω is complex; see that article for details. A central task is to ask the expert to suppose that two independent observations are taken at a specified design point. The expert assesses the median of one of the observations, then, given a hypothetical value for that observation, assesses the conditional median of the other observation.
All of the aforementioned methods may be criticized for using assessment tasks that people are not very good at performing. Direct questions about the distribution of a precision (I&L) are surely hard to answer; G&D used conditional assessments that will be biased by conservatism and also elicited only the minimum number of assessments to determine the hyperparameters; KDWS&P found that extreme values of δ were not uncommon with their method. Because δ is a difficult hyperparameter to assess, it is a good idea to elicit more than one estimate of it and then reconcile the different estimates in some way. If estimates of δ are to be combined arithmetically, then empirical evidence favors taking their geometric mean rather than their average (Al-Awadhi and .
The most complicated hyperparameter to elicit is the variance-covariance matrix, σ 2 R; this may contain a large number of elements that each need to be elicited, and the matrix itself must be positive definite. In KDWS&P, a crucial step in assessing this matrix is the elicitation of a variance-covariance matrix for a multivariate t-distribution. We next describe that part of their method in some detail, because it can be useful in a variety of elicitation problems. Also, it requires sophisticated mathematics and statistics, so it illustrates that statisticians need to be involved in the development of elicitation methods; in the past some statisticians have suggested that their development should be left to psychologists.
Suppose that Y = (Y 1 , . . . , Y m ) has a multivariate t-distribution on δ degrees of freedom, Y ∼ t δ (a, P). Then P is referred to as the spread of Y, S(Y), and if δ > 2, then the variance of Y equals [δ/(δ − 2)]P. (The variance does not exist if δ < 2.) To elicit P, KDWS&P proceeded as follows.
The expert assesses the following: In elicitation methods, the most widely used method of estimating variances or spreads from quartile assessments is to divide the assessed interquartile range or semi-interquartile range by the corresponding range of a standard distribution. Let t [δ, .75] denote the semi-interquartile range of a univariate t-distribution on δ degrees of freedom. The variance of Y i does not exist if δ < 2, so KDWS&P determined spreads, and for i = 1, . . . , m − 1. The order of the conditional assessments enables conditions to be based on an expert's earlier answers; y 0 1 is set equal to y * 1 + 1 2 {S(Y 1 )} 1/2 and for i = 2, . . . , m − 1, y 0 i is set equal to y i,.50 |y 0 1 , . . . , y 0 i−1 + 1 2 {S(Y i |y 0 1 , . . . , y 0 i−1 } 1/2 . An iterative method is used to calculate P. Let P i denote the covariance matrix of Y 1 , . . . , Y i and put P 1 = S(Y 1 ). The following equations give P i+1 after P 1 , . . . , P i have been estimated. Put and Then put The procedure stops when P m = P has been obtained. Results of KDWS&P show that P is certain to be a positive-definite matrix.
A general feature of the method of KDWS&P is that for every hyperparameter, it elicits more assessments than necessary, then uses some form of averaging to obtain hyperparameter estimates. This is obviously sound practice. Also, the largest deviations from the averages are reported to the expert, so that he or she can judge the extent to which some of his or her answers may be in error and require changing. The x-values used in the elicitation method will affect the quality with which expert opinion is captured; Kadane and Wolfson (1998) have suggested a procedure for their selection.
Oman did not attempt to relate R to the expert's opinion. Instead, he chose design points that cover the region of interest and that gave a design matrix X that is as near orthogonal as possible. He then set R equal to τ (X X) −1 and estimated τ using empirical Bayes methods. I&L were concerned with the analysis of designed experiments. They let X be the design matrix for which data will be gathered and, like Oman, assumed that R = τ (X X) −1 . The hyperparameter τ was then chosen to reflect the weight that should be attached to the expert's opinion, relative to the weight that should be attached to the data from the experiment. Hence, I&L chose R partly to reflect the expert's knowledge, but their approach is pragmatic and is not a serious attempt to determine the prior variance of β. (Otherwise, their approach implies that prior opinion about the relationship between Y and the x-variables depends on the experiment to be conducted.) G&D developed an alternative method of eliciting P, the spread of Y. It is based on a novel assessment task that requires the expert to first select the design point at which he or she can predict Y most accurately, and then repeat this task several times with an increasing set of restrictions on the x-values that he or she can choose. The method was developed for the use of industrial chemists and exploits their experience in choosing design points to conduct experiments. It is not as flexible as the method of KDWS&P, in that it cannot be used with polynomial regression or with x-variables that are factors. However, it can be extended to elicit prior distributions that are suitable for variable selection problems (Garthwaite and Dickey 1992).
A feature common to the methods of KDWS&P and G&D is that a structured set of sequential questions is used to ensure that P is a positive-definite matrix.
A drawback of the KDWS&P method is that for many of its tasks, the expert must update his or her beliefs on the basis of hypothetical data, so assessments are likely to be biased by conservatism. Al-Awadhi and Garthwaite (1998) suggested a modification whereby the diagonal elements of P are estimated from unconditional assessments. This modification involves scaling P in such a way that estimates of correlations are unchanged, so P is still certain to be a positive-definite matrix while the impact of conservatism should be reduced. Al-Awadhi and Garthwaite also suggested estimating the spreads of univariate distributions from assessments of both the lower and upper quartiles (KDWS&P used medians and upper quartiles), so that an estimated spread reflects both halves of the subjective distribution. Eliciting both quartiles also enables marked asymmetry to come to light.
Turning to elicitation methods for other sampling models, Al-Awadhi and Garthwaite (1998) gave a method of eliciting a conjugate prior distribution for sampling from a multivariate normal distribution. This work was extended by Garthwaite and Al-Awadhi (2001) so as to elicit a more flexible nonconjugate prior distribution. The methods follow the approach of KDWS&P to assess spread matrices (with the modifications mentioned earlier) and follow the approach of G&D to elicit degrees of freedom parameters. The method of KDWS&P was also exploited by Dickey et al. (1986) to develop assessment methods for matrix-t models, and a closely related method was used by Garthwaite and Al-Awadhi (2003) for logistic regression. An elicitation method for logistic regression was also given by Chen, Ibrahim, and Yiannoutsos (1999), using ideas similar to those of I&L. Chaloner and Duncan (1987) extended their method of quantifying opinion about a Bernoulli process (Chaloner and Duncan 1983) so as to elicit a Dirichlet distribution that represents prior opinion about a multinomial sampling model. Other models that have attracted attention include the proportional hazards model (Chaloner, Church, Matts, and Louis 1993), Weibull lifetime distributions (Singpurwalla and Song 1987), AR(1) time series models (Kadane, Chan, and Wolfson 1996), and ANOVA models (Black and Laskey 1989). Graphical feedback is an important component in the methods of Chaloner andDuncan (1983, 1987), Chaloner et al. (1993), and Garthwaite and Al-Awadhi (2003), and it seems to provide a potentially powerful means of improving the quality of assessed distributions.
Almost all of the foregoing methods represent expert opinion by some form of conjugate distribution. This approach has limitations, however, notably when the sampling model is a multivariate normal distribution. In a frequentist analysis for a multivariate normal distribution, the sample variancecovariance matrix is both an estimate of the population variance and, after division by the sample size, it is also the estimated variance of the sample mean. In the standard conjugate distribution, a single variance matrix again fulfills both of these roles. An experiment by Al-Awadhi and  demonstrated that this is inappropriate. The experiment examined different forms of assessment task and compared alternative ways of estimating hyperparameters. To quantify opinion about the vector of means, it proved preferable to ask directly about the means rather than individual observations, whereas, to quantify opinion about the variance matrix, it was better to ask about deviations from the mean. One alternative to the conjugate distribution is to assume that the population mean and variance are independent in the prior distribution, an approach followed by Garthwaite and Al-Awadhi (2001).
Elicitation methods for broader classes of problem have also been proposed. Bedrick, Christensen, and Johnson (1996) suggested a method for generalized linear models in which the predictive distributions at different design points are elicited and then combined to form a prior distribution. For convenience, Bedrick et al. (1996) mainly considered the case where the predictive distributions are independent of each other, so that combining them is straightforward. They discussed the properties of their method and showed its similarities to data augmentation. Clemen and Reilly (1999) considered the general problem of constructing a joint prior distribution for several hyperparameters. They used a copula to form the joint distribution from an expert's subjective judgments of marginal distributions and correlations. No restrictions are placed on the marginal distributions, and dependence between the marginals is modeled by the copula that underlies a multivariate normal distribution. People are not good at assessing correlations; Clemen and Reilly addressed this problem by discussing various techniques for their assessment. They reported a small empirical study that forms part of the basis for their well-informed views; Clemen et al. (2000) reported a larger study of methods for assessing correlations, as discussed in Section 1.2.3.

Nonparametric Fitting
Any fitting of a parametric distribution to the expert's stated summaries implies assumptions about the form of the expert's underlying probability distribution. Although the distributional form may be acceptable to the expert, he or she is rarely in a position to question the assumptions critically. This is even more true in the case of multivariate distributions.
Just as there are many statisticians who are uncomfortable with parametric assumptions in modeling data, it is arguably preferable in elicitation not to make parametric assumptions. (Indeed, in the subjective Bayesian framework, the likelihood is just as much a judgment as the prior distribution, and in principle should also be formally elicited.) A number of nonparametric approaches are possible in elicitation.
One such approach is simply to decline to fit a distribution at all and use just the expert's stated summaries and nothing more. It is then necessary to find ways to make use of this limited specification of beliefs. In the context of eliciting a prior distribution for a Bayesian analysis, Bayes linear methods have been advocated by Goldstein (1999). The Bayes linear approach is based on eliciting only first-and second-order moments (i.e., means, variances, and covariances). The underlying theory is a Bayesian analog of the Gauss-Markov theorem, but advocates of the Bayes linear approach have developed a substantial body of methodology to facilitate applications. Bearing in mind the difficulties in asking experts to assess moments that we noted earlier, the Bayes linear approach places higher demands on the statistical understanding of the expert, or else requires a substantial training input. Berger and O'Hagan (1988) effectively allowed the expert's prior distribution for a single unknown parameter to be any unimodal distribution having specified quantiles. They computed the range of posterior inferences over this range of prior distributions for given data. This is a fully nonparametric approach, although it may be difficult to implement in more complex situations. It also allows distributions that the expert would easily be able to reject as inaccurate representations of his or her knowledge.
A more recent approach is that of Oakley and O'Hagan (2002), which adopts the framework of modern Bayesian nonparametrics. The expert's beliefs about the random variable X are supposed to be represented by a probability density function f (x). To the facilitator, f is an unknown function. The facilitator has a prior distribution for f , and this is updated to a posterior distribution after "observing" the "data" comprising the expert's summary values. It is worth looking at this formulation in a little more detail.
Because f is a function, the facilitator's prior is a distribution over the space of possible values of the whole function. Formally, Oakley and O'Hagan (hereafter O&O) supposed that f has a Gaussian process prior distribution. Its prior expectation is a member g(x; θ) of some parametric family indexed by θ . The prior variance of f (x) is g(x; θ) 2 σ 2 , and the correlation between f (x) and f (x ) is a decreasing function of the distance between x and x . The hyperparameters θ and σ 2 are given weak prior distributions. The method is nonparametric because f is not assumed to be exactly a member of any parametric family, and indeed it is allowed to take any form. Nevertheless, the facilitator expects the expert's true density function f to be close, in some sense, to some member of the parametric family, defined by g. The closeness is governed by the hyperparameter σ 2 , which O&O learn about from the expert's summary "data." The correlation in the Gaussian process ensures that departures of f (x) from g(x; θ) are smooth.
Although in this approach the facilitator does not constrain the expert's distribution to fit the parametric family, it is clear that the facilitator is supplying some information through the Gaussian process prior distribution. There is the belief that f should be "close" to g, and that it should be a "smooth" density function. Technically, from the assumption of a Gaussian process, the facilitator has a normal distribution for each f (x). Strictly speaking, this cannot be completely realistic, because f (x) must be nonnegative, but the normality is important for tractability of the O&O approach, and O&O argued that it is relatively innocuous.
A significant benefit of the O&O approach is that it yields a complete posterior distribution for f . The posterior mean can be considered an estimate of f based on the expert's summaries, and hence as the elicited probability distribution. But O&O then have a posterior distribution that quantifies the possible inaccuracy of the elicitation, as called for by Dickey (1980). They gave the following synthetic example in which the expert's true probability distribution is a bimodal mixture of two normal distributions. The parametric family g is the normal family. Figure 2 shows the posterior median and 95% pointwise credible intervals for f based on seven elicited quantiles from the expert. We see that, despite believing initially that the expert's distribution would be similar to a normal distribution, the facilitator's posterior distribution accurately reproduces the expert's true bimodal f .

TESTING ADEQUACY OF ELICITATION
In view of the many practical difficulties of elicitation, how can one know whether the elicited distribution is in any sense an adequate representation of the expert's knowledge? Before addressing this question, we should consider whether there is, in some sense, a "true" representation. Does the expert have a true personal probability distribution for the uncertain quantities? Winkler (1967, p. 778) noted that "the assessor has no builtin prior distribution which is there for the taking. That is, there is no 'true' prior distribution. Rather, the assessor has certain prior knowledge which is not easy to express quantitatively without careful thought. An elicitation technique used by the statistician does not elicit a 'true' prior distribution, but in a sense helps to draw out an assessment of a prior distribution from the prior knowledge. Different techniques may produce different distributions because the method of questioning may have some effect on the way the problem is viewed." In contrast O'Hagan (1988) explicitly defined "true" probabilities as those that would result if the expert were capable of perfectly accurate assessments of his or her own beliefs. O'Hagan regarded different "stated" probabilities that might result from different elicitation methods as more or less inaccurate attempts to specify the expert's underlying "true" probabilities. This differs from Winkler's position, which seems to be that the results of different elicitations are all assessments of slightly different probabilities. A possible reconciliation is that a "true" distribution would be the result of a method that leads the expert to view the problem from as complete and unbiased a perspective as possible through appropriate use of cognitive tools.
In this section we first consider how to test the internal consistency of the expert's statements, together with any assumptions made by the facilitator. We then discuss assessing the adequacy of the elicitation, in terms of whether the acknowledged inaccuracies in the elicitation process matter.

Internal Consistency
A system of probability statements is coherent if the probabilities are all consistent with the laws of probability. If, for instance, an expert states that P (E) = .4, P (F ) = .3, and P (E or F ) = .6, when E and F are mutually exclusive events, then these probabilities are noncoherent. One way to check the quality of an expert's statements is for the facilitator to ask for sets of probability assessments that allow tests of coherence. We must expect that the expert's elicited statements will fail coherence tests. This is almost inevitable in view of the imprecision with which the expert can make these judgments. The question then arises of how we should reconcile the internal inconsistency of the elicited statements.
In the case of an incoherent set of individual probabilities, as in the example of mutually exclusive events, the simplest answer is to confront the expert with the inconsistency and to invite him or her to revise one or more of the stated values. In general, we should expect this revision to improve the expert's assessments.
A careful examination of reconciling incoherent probability assessments was given by Lindley, Tversky, and Brown (1979, hereafter LT&B). In their approach, reconciliation is done by the person that we have called the facilitator. The facilitator takes a view of the accuracy with which the expert will have been able to assess the stated probabilities. Thus, in the example of mutually exclusive events, the facilitator needs to formulate a joint probability distribution, p(e, s), for the expert's underlying true probabilities, e = (P (E), P (F )), and for the expert's stated probabilities, s, for the three events E, F , and "E or F ." The facilitator's beliefs about the expert's true probabilities would then be expressed by p (e|s = (.4, .3, .6)). In practice, LT&B envisage the facilitator formulating the joint distribution via a prior distribution p(e) for e and a "likelihood," p(s|e), for the expert's assessment errors, so that the facilitator's posterior distribution, p (e|s = (.4, .3, .6)), is derived by Bayes' theorem. LT&B refered to this solution as the "internal approach" and contrasted it with an "external approach" that we consider in Section 5.1.
As in the simpler method of asking the expert to revise his or her own probabilities, this reconciliation can lead to more accurate assessments. However, the improvement is now formally expressed by the familiar Bayesian property that the facilitator's posterior distribution of e will generally be more concentrated than the prior distribution.
In the context of eliciting a parametric probability distribution, sometimes only as many summaries are elicited as are required to identify unique values of the required hyperparameters. It is then usually the case that any set of elicited summaries will be consistent with the fitted distribution, and hence with each other. It is therefore not possible to find any noncoherence. However, if the facilitator asks for at least one more summary from the expert, then it becomes possible to test for coherence. This has been called overfitting.
Note, however, that inconsistent statements from the expert may simply indicate that the expert's distribution cannot be adequately represented by a member of the chosen family. For instance, if the expert's beliefs are sought concerning a proportion π , then the facilitator might choose to work with the assumption of a beta distribution. Then if the expert specifies that the median is .4, the lower quartile is .3, and the upper quartile is .5, no beta distribution will fit these specifications. The Be(4.733, 7.1) distribution fits the median and lower quartile, but has an upper quartile of .494, whereas the Be(4.16, 6.24) distribution fits the median and upper quartile but has a lower quartile of .293. But it is unreasonable to expect the expert to specify these quantiles to such accuracy, and the assessments in practice would not be seen as challenging the assumption of a beta distribution. Instead, a compromise such as Be(4.4, 6.6), with lower quartile .296 and upper quartile .497, clearly fits the elicited values very well.
If, on the other hand, the expert had specified a lower quartile of .2, then the beta family assumption is called into question. The expert's distribution appears to be negatively skewed but with a median below .5, two properties that are together inconsistent with a beta distribution. In general, given a set of expert statements that is larger than needed to identify unique hyperparameters, we can choose an elicited distribution that in some sense fits the elicited statements as closely as possible; an example of this was given by O'Hagan (1998). The quality of fit can be seen as indicating the accuracy of the elicitation, whereas a sufficiently poor fit casts doubt on the assumed family of distributions.
In the case of parametric elicitation, then, overfitting has the potential to either refine the specification of hyperparameters or refine the assumed distributional family. Notice, however, that to decide between these two options, the facilitator needs to have some idea of the accuracy of the expert's judgments. Given that this kind of judgment by the facilitator is required, it may be that an extension of the approach of LT&B might be developed for this case, but we are not aware of any published work in this direction.
O&O's nonparametric elicitation method of Section 3.3 also adopts essentially the approach of LT&B, in the sense that the facilitator derives a posterior distribution for the expert's underlying density function, using the elicited statements as data. Note, however, that O&O treated the expert's summaries as error-free, and so they did not consider an analog of the likelihood p(s|e).
An idea similar to overfitting is feedback. For instance, in a parametric elicitation when we have elicited enough summaries to fit a unique member of the chosen family, instead of eliciting one or more further summaries, the facilitator informs the expert of the values of those summaries that are implied by the expert's statements so far (and the assumed distributional form). In general, feedback entails displaying the implications of other statements and inviting the expert to confirm or deny that these are reasonable expressions of his or her beliefs. Whereas overfitting will almost invariably expose inconsistencies in the expert's statements, feedback often simply results in the expert confirming the implied values. As such, overfitting is generally preferred, but feedback can be very useful to show complex implications, such as displaying the fitted density function graphically.
It is also worth noting that the expert will often make qualitative statements during elicitation that can be checked informally against quantitative summaries. For instance, the expert may appear uncertain and have difficulty specifying a credible interval, and yet may actually give a narrower interval than in another task where he or she informally indicated more certainty. The facilitator should be alert for any opportunity to assist the expert by checking the consistency of his or her opinions, whether expressed or implied.

Fitness for Purpose
Although overfitting and coherence checking have the potential to improve the elicitation process, appreciable imprecision will inevitably remain in the elicited summaries and in the fitted distribution. Whether this imprecision matters depends on the purpose for which the elicitation is performed.
Recognition that an elicited prior distribution does not necessarily accurately reflect the expert's knowledge has led to quite widespread use of sensitivity analysis in Bayesian statistics (O'Hagan and Forster 2004, chap. 8). This may involve varying the hyperparameters of a parametric fit or performing a more sophisticated variation of all aspects of the distribution. The general thrust of the robust Bayesian movement was to allow the true prior distribution to lie in a nonparametric class of distributions containing the elicited distribution. Then this approach proceeded to derive bounds on posterior inferences as the prior varied across the class of possible priors. Berger (1994) has reviewed this body of research. A more common use of sensitivity analysis in practical Bayesian analyses has been to just explore in an ad hoc way a small number of alternative prior distributions.
Whether the elicitation is done to obtain a prior distribution for some Bayesian analysis, obtain expert judgments for inputs of some decision model, or for some other purpose, sensitivity analysis has the same objective. It is to determine whether, when the elicited distribution is varied to other distributions that might also be consistent with the expert's knowledge, the results derived from that distribution change appreciably. If not, then the elicitation has adequately represented the expert's knowledge.
How can we determine whether the result changes "appreciably" as the elicited distribution changes? There is a clear answer to this question when the result is a decision that is to be made optimally with respect to some utility function. Then it is not the change in the decision that matters, but rather the change in expected utility. We consider the difference between, on the one hand, the expected utility obtained by the optimal decision with respect to the elicited prior, and on the other hand, the maximum expected utility that can be obtained by the optimal decision with respect to any other distribution in the class. This difference represents the potential gain in expected utility that might be obtained by more careful elicitation (see Kadane and Chuang 1978;and Chuang 1984).
The problem with any sensitivity analysis is to specify the class of distributions. If we allow the distribution to vary more from the elicited distribution, then we can expect greater discrepancies in the results. The classes of priors used in robust Bayesian analysis are arbitrary and not based on analysis of the elicitation process. The "internal approach" of LT&B and the method of O&O both yield the facilitator's posterior distribution for the expert's underlying probabilities or density function. They therefore provide formal expression of the uncertainty around the particular elicited distribution, which can in principle form the basis for subsequent sensitivity analysis. Their formal structures are more complex to apply, but otherwise there seems no alternative to the kind of informal ad hoc sensitivity analysis most commonly used.

Scoring Rules
In empirical work, probability distributions may be elicited for uncertain quantities whose actual values are known to the experimenters. In other circumstances, such as weather forecasting, predictive distributions may be assessed for quantities whose values become known subsequently. In both cases it can be useful to compare assessed probability distributions with the observed data to provide an objective measure of its accuracy. This is the purpose of a scoring rule.
Formally, a scoring rule is a formula for awarding a score to the expert, which can be thought of as a reward. It is a function both of the expert's elicited probability distribution for the uncertain quantity and of that quantity's true value.
One common application of scoring rules is in the comparison of alternative elicitation methods or different variants of an elicitation method. In empirical research, one elicitation method is often judged to be better than another if it gets better scores. Note, however, that better scores result both from the expert assessing his or her beliefs more accurately and from the expert having more (or more accurate) knowledge.
Other purposes of a scoring rule are to provide an incentive for experts to record their opinions well and to help train experts to quantify their opinions accurately. Toward this end, it is important that a scoring rule should encourage experts to record their true beliefs. More precisely, "the scoring rule is constructed according to the basic idea that the resulting device should oblige each participant to express his true feelings, because any departure from his own personal probability results in a diminution of his own average score as he sees it" (de Finetti 1962, p. 359). A scoring rule with this property is termed proper. Various proper scoring rules have been proposed; several, including those most commonly used, have been described by Matheson and Winkler (1976)

Calibration
There is a large and somewhat murky literature on the subject of calibration. At its simplest, the idea of calibration is that perhaps a person's elicited probabilities show a particular flaw, in that of the events the person says has probability p of happening, some function g(p) of them actually occur. Then the thought is that when the person announces p as his or her probability of some event, knowing better, the user of this information has g(p) as his or her probability (Lichtenstein et al. 1982). Such a program has the following flaw. Suppose that the person being elicited is faced with a coin that he or she believes to be fair, and hence announces p = 1 2 as the elicited probability of "tails." What values can g( 1 2 ) take? Because the g(·) are supposed to be probabilities, and "tails" and "heads" are mutually exclusive and exhaustive, we must have g( 1 2 ) + g( 1 2 ) = 1, that is, g( 1 2 ) = 1 2 . Now suppose there are three events equally likely in the view of the person being elicited. The same argument shows g( 1 3 ) = 1 3 and g( 2 3 ) = 2 3 . Indeed, this argument demonstrates g(r) = r for every rational number r. An assumption of continuity or measurability of g then suffices to show that g(x) = x for all real numbers, 0 < x < 1. Hence recalibration on this basis contradicts the coherence of either the pretransformation or post-transformation probabilities. Note that this argument does not apply to functions g that involve more than the elicited probabilities. For example, it is not a contradiction to coherence to think that a person may be overconfident in the sense that the probability ( 1 2 ) assigned to the interquartile range is too high, and hence the probability assigned to the tails is too low. Similarly, it might be noticed that a weather forecaster systematically overpredicts rain and hence underpredicts the event of no rain.
In practice, calibration is relevant where, as in the case of weather forecasters, experts regularly make similar probability statements, so that it is possible to check calibration and feedback is immediate, relevant, and frequent. Even without a formal calibration check, receiving regular feedback would tend to ensure that their forecasts are reasonably well calibrated.

Synthesizing Separate Elicitations
Where important decisions or inferences are to be made, it is common to wish to draw on the expertise of several experts. A number of approaches have been proposed as to how to elicit and how to synthezise the different experts' knowledge. Formal methods of combining probability distributions have been reviewed by Genest and Zidek (1986), who provided a very useful annotated bibliography, and French (1985), among others. We first consider the situation where the experts do not interact. Separate probability distributions are elicited from the experts in separate elicitation sessions. It is then natural to ask how we can synthesize these different distributions into a single distribution.
Two of the most popular methods fall into the category known as opinion pools. The linear opinion pool is a convex combination (a weighted average) of the individual probability distributions composing it, and the logarithmic opinion pool is a normalized weighted geometric mean (equivalent to applying a linear pool to the logarithms of the individual probability densities and then normalizing the result). An important property that an opinion pool might be expected to have is that it be externally Bayesian (Madansky 1978), meaning that when there is an agreed-on likelihood function, the opinion pool of the posterior distributions should coincide with the posterior distribution obtained from the opinion pool of the prior distributions. Except in trivial cases, the linear opinion pool fails to have this property, whereas the logarithmic pool does have it, when the weights sum to 1. However, a second property that we might require is invariance to event combination. Suppose, for instance, that we elicit the experts' probabilities for two mutually exclusive events, A and B. Letting C be the event 'A or B,' each expert's probabilities (assuming that they are coherent) satisfy P (C) = P (A) + P (B). Combination invariance would then require that the same property should hold for the pooled probabilities of A, B, and C. McConway (1981) showed that only the linear opinion pool satisfies a general marginalization criterion of this type. It is therefore not possible to find a mechanistic opinion pooling method that is externally Bayesian and also satisfies the marginalization criterion.
Note that the logarithmic opinion pool also suffers from the fact that a single expert's opinion that a certain set has zero probability implies that the pool must also assign zero probability to that set. (See Genest and Zidek 1986 for a wide-ranging discussion of these issues.) Both linear and logarithmic pools allow the assignment of different weights to the experts, which can be used to give more weight to experts whose probability distributions are believed to be more accurate. Cooke (1991) described a method of choosing weights based on the experts' performance in assessing distributions for seed variables, quantities whose true value is known to the facilitator but not to the experts. Evidence that this produces better elicitation than simple equal weighting of the experts was presented by Cooke and Goossens (2000).
Mechanistic pooling methods can lead to a form of double counting of expertise if the knowledge of some of the experts overlaps substantially. Then it is inappropriate to weight them all equally with other experts, but the method of seed variables will also tend to overweight such a group.
Another criticism of all of these pooling methods is that it is not clear whose opinion (if anyone's) the resulting probability distribution represents. A quite different approach to putting multiple experts' opinions together is to imagine each opinion as data input to a single "supra Bayesian," who uses these opinions to update his or her views. This is the "external approach" of Lindley et al. (1979). It was proposed earlier by Morris (1974) and further developed by Lindley (1985), French (1985), and Genest and Schervish (1985); see also the discussion of Genest and Zidek (1986). This approach is completely Bayesian, but requires a very substantial elicitation of the supra Bayesian's opinions about the expert opinions to be pooled.

Group Elicitation Methods
We now consider approaches where the experts interact as a group. One simple and practical group elicitation approach is to bring the experts together to discuss the uncertain quantity or quantities about which their beliefs are to be elicited, and through this sharing of their expertise to seek a consensus view. In effect this treats the group as a single individual. Phillips (1999) presented a formal justification of this behavioral aggregation approach. There are, however, new psychological issues that arise in the interaction between the members of the group. This kind of group elicitation requires a knowledgeable and experienced facilitator who needs to be aware of the possibilities of strong personalities in the group having too much weight in the discussion, or of judgments based on overlapping experience being overweighted through being repeated in the discussions. It may also be that the pressure to reach consensus leads to the experts suppressing dissenting views, or, alternatively, it may not be possible to reach consensus.
The Delphi method is a formal technique for managing group interaction. The method proceeds by first eliciting the various experts' opinions separately, then feeding each expert's views to all of the other experts along with some explanation of that expert's reasoning. The experts are then invited to revise their own elicitations. The method then operates iteratively, feeding back the revised elicitations to all of the experts, with explanation of the reasons for revisions, and so on. Although some of the complications of group interaction are removed, the method is likely to produce a less efficient sharing of knowledge than the behavioral aggregation approach. It is also still necessary for the facilitator to manage the interaction, because one expert's reasons may have undue influence if expressed very forcefully. Pill (1971) reviewed the Delphi technique. In addition, there is a truly vast literature on its use in political science and government.
A variant of Delphi was discussed by DeGroot (1974), who proposed that each expert revise his or her opinion by applying a linear opinion pool to all of the experts' distributions, with weights that reflect the importance that each pool member puts on the opinions of each of the other participants. The system of revisions then forms a Markov chain whose transition matrix is the matrix of weights, and DeGroot obtained conditions for them to converge to a consensus (see also Lehrer 1976).
In contrast with these practical group elicitation methods, we can also take a more axiomatic approach and try to identify what would be rational ways for experts to seek a combined expression of beliefs. Bayesian theory is profoundly a theory of rational individual decision making. Basically, it imposes a minimal condition on an individual's statements of what bets would be acceptable (namely the avoidance of Dutch book), and accepts all responses that meet that condition. How can this theory be extended to groups?
To make progress on this question, one needs to know how the group decision making structure works, and how it relates to the views of the individuals in the group. Before starting, it is necessary to eliminate one obvious special case, that of a dictatorship. In a dictatorship, the choices of only one individual (i.e., of the dictator) matter for the group's choices. Hence if the dictator behaves individually in accordance with the Bayesian axioms, then the group will as well. By eliminating this case, we insist that more than one person's views matter in how the group makes decisions.
Suppose instead that the group makes decisions by majority vote, and that each member of the group has a transitive ranking of the alternatives. Surely in this case, something reasonable should be true of the decisions of such a group. Well, perhaps not. Suppose that the group consists of three people (the minimum for interesting majority votes) and are expressing preferences among three alternatives, A, B, and C. Suppose that voter 1 ranks them in that order, that is, prefers A to B to C. These preferences will be denoted by A > B > C. Suppose that voter 2's preferences are B > C > A and that voter 3's are C > A > B. In a choice between A and B, voters 1 and 3 prefer A to B. Between B and C, voters 1 and 2 prefer B to C. Finally, between A and C, voters 2 and 3 prefer C to A. Hence the majority preferences can be summarized as A > B > C > A. No group utility function can summarize such choices, because they are not transitive. This simple example has been generalized to every nondictatorial group decision making process in a celebrated theorem of Arrow (1951Arrow ( , 1963. A huge literature has grown up around this result, mainly under the heading of political economy. A second approach to the problem of Bayesian group decision making asks for preferences between risky outcomes for two or more Bayesians. However they make their decisions, these Bayesians seek to compromise. The only condition imposed on their compromises is that they obey the Pareto Principle: If each member of the group prefers A to B, then the compromise cannot prefer B. Suppose that there are two Bayesians and three alternatives. Two trivial cases must be dealt with first. If the Bayesians agree in their probabilities, then every nontrivial convex combination of their utilities, together with their agreed-on probability, will provide a Bayesian compromise satisfying the Pareto Principle. Similarly, if their utilities coincide, then every nontrivial convex combination of their probabilities will similarly suffice. The result of Seidenfeld, Schervish, and Kadane (1989) is that these are the only cases in which a Bayesian compromise can be found. When there are more than two Bayesians involved, the Pareto Principle is less binding, so in certain cases a Bayesian compromise can be found (see Goodman 1988).
These results cast serious doubt on what might be meant by the probability and utility function of a group seeking to be Bayesian. In what sense can the probability and utility function of the group be representative of the decisions that the group might make?

DISCUSSION
From the 1960s to at least the early 1980s, research in elicitation was substantial and was characterized by some close collaborations between statisticians and psychologists. More recently, there seems to have been much less research in both the statistics and psychology communities, and collaboration between them has lapsed. There are signs that this is changing, however. The growing sophistication of Bayesian computational methods has led to a dramatic increase in the breadth and complexity of Bayesian applications. Bayesians are beginning to show more interest in elicitation and, being freed from the computational constraint to use tractable, conjugate priors, there is a need to develop processes capable of eliciting complex, nonstandard distributions. A recent review of case studies and software has been provided by Kadane and Wolfson (1998). This in turn is likely to lead to renewed collaboration with psychologists.
Despite the existence of a broad and diverse literature in elicitation, which has provided many valuable procedures and insights, there remains very considerable scope for further research. Some important topics are the following: • Multivariate elicitation. Where it is necessary to elicit a joint probability distribution for two or more quantities, there has been relatively little investigation by psychologists (particularly when the quantities cannot be regarded as instances of a larger population). In this context, also, the available multivariate parametric families typically impose unrealistic constraints on experts' beliefs. • Nonparametric elicitation. We find the use of uniform, triangular, or histogram distributions unrealistic, but fitting parametric distributions also imposes constraints that may be unrealistic. There has been little work on nonparametric fitting, and although a nonparametric fit might represent an expert's beliefs more accurately, it is not clear whether this will actually matter in practice (see the discussion in Sec. 4.2). • Graphical tools. We believe that there is substantial, as-yet relatively unexplored, potential in graphical methods.
An aim of much statistical research is to wring as much from data as we possibly can, but using expert opinion better (or using it at all) could add more information than slight improvement in efficiency through better techniques of data analysis. Too often, ad hoc methods must be used when an expert's opinion is to be quantified; ideally, there should be a range of tried and tested elicitation methods in a statistician's toolbox. [Received June 2004. Revised November 2004