The Delphi methodology
Norman C Dalkey

Delphi is a method of combining the judgements of knowledgeable individuals. It is relevant when there is no determinate answers (eg., hard data or well established theory) available, but some persons (often called experts) have relevant information about the topic of concern. It is especially pertinent in the common case of disagreement among experts. In effect it is a procedure for aggregating the information know to the panel. The resultant group judgement is obviously no better than the composite information within the group.

The method was formulated in the early 1950s at the Rand Corporation by Olaf Helmer and myself. Since then it has been applied in hundreds of investigations around the world, dealing with topics as diverse as forecasting long-term technological developments, the health effects of various concentrations of atmospheric pollutants, or the reliability of components in nuclear reactors. The method has proved to be a practical and efficient way to obtain 'best estimate' in uncertain contexts.

In its simplest form, the method has three features:

Anonymity. Each member of the panel submits his own independent answer(s) to the relevant question(s) by questionnaire or computer query.

Controlled feedback and iteration. The results of a given round of responses are summarised and reported to the group, who are then asked to reassess their replies in light of the feedback.

Formal group judgement. Given the final set of individual answers, the group answer is expressed as a formal aggregation; e.g., if the questions involve numerical answers, the group judgement may be formulated as the mean, median, or other measure of central tendency.

Anonymity is a restriction to combat the biasing effects of group pressure, dominant individuals, and the like, and assures, with formal group judgement, that the answer of every individual in the group is taken into account in the final group judgement. Iteration with feedback allows a certain amount of interchange among the members of the group but in a controlled manner. Formal aggregation gives a well-defined group response.

I have sometimes used the statement 'N heads are better than one' - a generalisation of the age-old adage, 'Two heads are better than one' - to express the basic value of the Delphi method. However, in the area of uncertain matters, many scholars have been wary of 'group think': A camel is a horse designed by a committee;' 'A group judgement is the lowest common denominator of the individual judgements.' And so on. The Delphi procedures are designed to meet the many concerns about group think, but they clearly needed validation. There are two routes to validation, experimental and theoretical.

In an extensive series of experiments at the RAND Corporation and the University of California at Los Angeles, the effectiveness of the procedures, as well as many variations, was tested, using graduate students as subjects and 'almanac-type' questions where the subjects did not know the answers, but had some relevant knowledge. The experimenters could obtain the answers from reference sources. A typical question was 'How many telephones are there in Africa?' A typical group size was 20 to 30 subjects, and typically, 20 questions would be involved in a given session. In general, the subjects were able to make reasonable 'estimates' for these, to them, uncertain questions. The basic results of these studies are:

Size of group. The average error of the group responses declined monotonically with the size of the group, with decreasing returns with increasing size. Roughly, one half of the individual error was observed with groups of 7 members. An additional 20 members reduced the average group error by an additional 10%. The reduction of error with size of group is analogous to, but not identical with, the rule for the dispersion of the sample mean in random sampling.

Iteration with feedback. There was monotonic reduction in the dispersion of individual responses (convergence) with iteration, again with decreasing effect with additional iterations. However, the accuracy of the group answer improved with the first iteration and fluctuated with additional iterations. It is my present belief that a single iteration furnishes the major benefit obtainable with iteration.

Dispersion. There is a generally held belief that greater agreement (smaller dispersion is associated with a greater likelihood of the group being correct). This was born out in the experiments. Roughly, the average group error was about 2/3 of the observed dispersion.

Individual and group self-ratings. In may of the exercises, individuals were asked to rate their confidence in their answer to each question on a scale of 1-5 where 5 meant 'I know the answer' and 1 meant ' I'm just guessing'. A group self-rating could then be computed for each question by taking the average of the individual ratings. Between a group self-rating of 1.2 and 4, average error dropped by a factor of 5.

Group self-rating and sample dispersions are thus valid, if rough, indicators of accuracy. Combined, they are even stronger. Thus, for cases of self-ratings of 2 or better and dispersion of .5 or less and cases of self-ratings of 2 or less and dispersions of 1.5 or more there is a factor of 10 difference in accuracy. Since these results were for a specific type of subject matter, and a specific type of 'expert', the precise relationships cannot be transferred directly to other investigations, but they can be used as guides in evaluating the solidity of elicited judgements. Low self ratings and large dispersions probably indicate unreliable results.

Theoretical validations have suffered form a lack of a suitable model for the estimation process. The elementary fact that the squared error of the group response is equal to the average individual squared error minus the sample dispersion (This is not a statistical statement. It is simple arithmetic. i.e., the group preferring a group judgement for uncertain in estimate, but it not sufficient to address the effect that given the appropriate kinds of estimates by the individual members of a group there is an aggregation technique that guarantees that the group response is better than that of any individual. However, the type of individual estimate required appears to be unfeasible in most applied contexts. The result is thus more of a 'reassurance' than a useable procedure.

In summary, the Delphi process has strong experimental support for obtaining answers to uncertain questions, and non-trivial theoretical support. In applications, the intuition of the investigator and the specifics of the subject matter and available pool of experts may dictate modifications to the 'laboratory' procedures.