A solution to the single-question crowd wisdom problem

Journal name:
Nature
Volume:
541,
Pages:
532–535
Date published:
DOI:
doi:10.1038/nature21054
Received
Accepted
Published online

Once considered provocative1, the notion that the wisdom of the crowd is superior to any individual has become itself a piece of crowd wisdom, leading to speculation that online voting may soon put credentialed experts out of business2, 3. Recent applications include political and economic forecasting4, 5, evaluating nuclear safety6, public policy7, the quality of chemical probes8, and possible responses to a restless volcano9. Algorithms for extracting wisdom from the crowd are typically based on a democratic voting procedure. They are simple to apply and preserve the independence of personal judgment10. However, democratic methods have serious limitations. They are biased for shallow, lowest common denominator information, at the expense of novel or specialized knowledge that is not widely shared11, 12. Adjustments based on measuring confidence do not solve this problem reliably13. Here we propose the following alternative to a democratic vote: select the answer that is more popular than people predict. We show that this principle yields the best answer under reasonable assumptions about voter behaviour, while the standard ‘most popular’ or ‘most confident’ principles fail under exactly those same assumptions. Like traditional voting, the principle accepts unique problems, such as panel decisions about scientific or artistic merit, and legal or historical disputes. The potential application domain is thus broader than that covered by machine learning and psychometric methods, which require data across multiple questions14, 15, 16, 17, 18, 19, 20.

At a glance

Figures

  1. Two example questions from Study 1c, described in text.
    Figure 1: Two example questions from Study 1c, described in text.

    a, Majority opinion is incorrect for question (P). b, Majority opinion is correct for question (C). c, d, Respondents give their confidence that their answer is correct from 50% (chance) to 100% (certainty). Weighting votes by confidence does not change majority opinion, since respondents voting for both answers are roughly equally confident. e, Respondents predict the frequency of yes votes, shown as estimated per cent agreement with their own answer. Those answering yes believe that most others will agree with them, while those answering no believe that most others will disagree. The surprisingly popular answer discounts the more predictable votes, reversing the incorrect majority verdict in (P). f, The predictions are roughly symmetric, and so the surprisingly popular answer does not overturn the correct majority verdict in (C).

  2. Why ‘surprisingly popular’ answers should be correct, illustrated by simple models of Philadelphia and Columbia questions with Bayesian respondents.
    Figure 2: Why ‘surprisingly popular’ answers should be correct, illustrated by simple models of Philadelphia and Columbia questions with Bayesian respondents.

    a, The correct answer is more popular in the actual world than in the counterfactual world. b, Respondents’ vote predictions interpolate between the two possible worlds. In both models, interpolation is illustrated by a Bayesian voter with 2/3 confidence in yes and a voter with 5/6 confidence in no. All predictions lie between actual and counterfactual percentages. The prediction of the yes voter is closer to the percentage in the yes world, and the prediction of the no voter is closer to the percentage in the no world. c, Actual votes. The correct answer is the one that is more popular in the actual world than predicted—the surprisingly popular answer. For the Philadelphia question, yes is less popular than predicted, so no is correct. For the Columbia question yes is more popular than predicted, so yes is correct. The example also proves that any algorithm based on votes and confidences can fail even with ideal Bayesian respondents. The two questions have different correct answers, while the actual vote splits and confidences are the same. Confidences 2/3 and 5/6 follow from Bayes’ rule if the actual world is drawn according to prior probabilities that favour yes by 7:5 odds on Philadelphia, and favour no by 2:1 odds on Columbia. The prior represents evidence that is common knowledge among all respondents. A respondent’s vote is generated by tossing the coin corresponding to the actual world. A respondent uses their vote as private evidence to update the prior into posterior probabilities via Bayes’ rule. For example, a yes voter for Philadelphia would compute posterior probability, that is, confidence of that yes is correct, which is the same confidence computed by a yes voter for Columbia: .

  3. Selection of stimuli from Study 4 in which respondents judged the market price of 20th century artworks.
    Figure 3: Selection of stimuli from Study 4 in which respondents judged the market price of 20th century artworks.

    a, Roshan Houshmand, Rhythmic Structure. b, Abraham Dayan, dance in the living room. c, Matthew Bates, Botticelli e Filippino. d, Christopher Wool, Untitled, 1991, enamel on aluminum, 90′′ × 60′′ ©Christopher Wool; courtesy of the artist and Luhring Augustine, New York. e, Anna Jane McIntyre, Conversation With a Spoonbill. f, Tadeusz Machowski, Abstract #66.

  4. Results of aggregation algorithms on studies discussed in the text.
    Figure 4: Results of aggregation algorithms on studies discussed in the text.

    Study 1a, b, c: n (items per study) = 50; Studies 2 and 3: n = 80; Study 4a, b: n = 90. Agreement with truth is measured by Cohen’s kappa, with error bars showing standard errors. Kappa = (A − B)/(1 − B), where A is per cent correct decisions across items in a study, and B is the probability of a chance correct decision, computed according to answer percentages generated by the algorithm. Confidence was not elicited in Studies 1a, b and 4a, b. However, in 4a, b we use scale values as a proxy for confidence27, giving extreme categories (on a four-point scale) twice as much weight in scale-weighted voting, and 100% weight in maximum scale. The results for the method labelled ‘Individual’ are the average kappa across all individuals. SP is consistently the best performer across all studies. Results using Matthews correlation coefficient, F1 score, and per cent correct are similar (Extended Data Figs 1, 2, 3).

  5. Logistic regressions showing the probability that an artwork is judged expensive (above $30,000) as function of actual market price.
    Figure 5: Logistic regressions showing the probability that an artwork is judged expensive (above $30,000) as function of actual market price.

    Thin purple lines are individual respondents in the art professionals and laypeople samples, and the yellow line shows the average respondent. Price discrimination is given by the slope of the logistic lines, which is significantly different from zero (χ2, P < 0.05) for 14 of 20 respondents in the professional sample, and 5 of 20 respondents in the laypeople sample (χ2, P < 0.05). Performance is unbiased if a line passes through the red diamond, indicating that an artwork with a true value of exactly $30,000 has a 50:50 chance of being judged above or below $30,000. The bias against the higher price category, which characterizes most individuals, is amplified when votes are aggregated into majority opinion (blue line). The surprisingly popular algorithm (green line) eliminates the bias, and matches the discrimination of the best individuals in each sample.

  6. Performance of all methods across all studies, shown with respect to the Matthews correlation coefficient.
    Extended Data Fig. 1: Performance of all methods across all studies, shown with respect to the Matthews correlation coefficient.

    Error bars are bootstrapped standard errors. Details of studies are given in Fig. 4 of the main text.

  7. Performance of all methods across all studies, shown with respect to the macro-averaged F1 score.
    Extended Data Fig. 2: Performance of all methods across all studies, shown with respect to the macro-averaged F1 score.

    Error bars are bootstrapped standard errors. Details of studies are given in Fig. 4 of the main text.

  8. Performance of all methods across all studies, shown with respect to percentage of questions correct.
    Extended Data Fig. 3: Performance of all methods across all studies, shown with respect to percentage of questions correct.

    Error bars are bootstrapped standard errors. Details of studies are given in Fig. 4 of the main text.

  9. Performance of aggregation methods on simulated datasets of binary questions, under uniform sampling assumptions.
    Extended Data Fig. 4: Performance of aggregation methods on simulated datasets of binary questions, under uniform sampling assumptions.

    One draws a pair of coin biases (that is, signal distribution parameters), and a prior over worlds, each from independent uniform distributions. Combinations of coin biases and prior that result in recipients of both coin tosses voting for the same answer are discarded. An actual coin is sampled according to the prior, and tossed a finite number of times to produce the votes, confidences, and vote predictions required by different methods (see Supplementary Information for simulation details). As well as showing how sample size affects different aggregation methods the simulations also show that majorities become more reliable as consensus increases. A majority of 90% is correct about 90% of the time, while a majority of 55% is not much better than chance. This is not due to sampling error, but reflects the structure of the model and simulation assumptions. According to the model, an answer with x% endorsements is incorrect if counterfactual endorsements for that answer exceed x% (Theorem 2), and the chance of sampling such a problem diminishes with x.

References

  1. Galton, F. Vox populi. Nature 75, 450451 (1907)
  2. Sunstein, C. Infotopia: How Many Minds Produce Knowledge (Oxford University Press, USA, 2006)
  3. Surowiecki, J. The Wisdom of Crowds (Anchor, 2005)
  4. Budescu, D. V. & Chen, E. Identifying expertise to extract the wisdom of crowds. Manage. Sci. 61, 267280 (2014)
  5. Mellers, B. et al. Psychological strategies for winning a geopolitical forecasting tournament. Psychol. Sci. 25, 11061115 (2014)
  6. Cooke, R. M. & Goossens, L. L. TU Delft expert judgment data base. Reliab. Eng. Syst. Saf. 93, 657674 (2008)
  7. Morgan, M. G. Use (and abuse) of expert elicitation in support of decision making for public policy. Proc. Natl Acad. Sci. USA 111, 71767184 (2014)
  8. Oprea, T. I. et al. A crowdsourcing evaluation of the NIH chemical probes. Nat. Chem. Biol. 5, 441447 (2009)
  9. Aspinall, W. A route to more tractable expert advice. Nature 463, 294295 (2010)
  10. Lorenz, J., Rauhut, H., Schweitzer, F. & Helbing, D. How social influence can undermine the wisdom of crowd effect. Proc. Natl Acad. Sci. USA 108, 90209025 (2011)
  11. Chen, K., Fine, L. & Huberman, B. Eliminating public knowledge biases in information-aggregation mechanisms. Manage. Sci. 50, 983994 (2004)
  12. Simmons, J. P., Nelson, L. D., Galak, J. & Frederick, S. Intuitive biases in choice versus estimation: implications for the wisdom of crowds. J. Consum. Res. 38, 115 (2011)
  13. Hertwig, R. Psychology. Tapping into the wisdom of the crowd–with confidence. Science 336, 303304 (2012)
  14. Batchelder, W. & Romney, A. Test theory without an answer key. Psychometrika 53, 7192 (1988)
  15. Lee, M. D., Steyvers, M., de Young, M. & Miller, B. Inferring expertise in knowledge and prediction ranking tasks. Top. Cogn. Sci. 4, 151163 (2012)
  16. Yi, S. K., Steyvers, M., Lee, M. D. & Dry, M. J. The wisdom of the crowd in combinatorial problems. Cogn. Sci. 36, 452470 (2012)
  17. Lee, M. D. & Danileiko, I. Using cognitive models to combine probability estimates. Judgm. Decis. Mak. 9, 259273 (2014)
  18. Anders, R. & Batchelder, W. H. Cultural consensus theory for multiple consensus truths. J. Math. Psychol. 56, 452469 (2012)
  19. Oravecz, Z., Anders, R. & Batchelder, W. H. Hierarchical Bayesian modeling for test theory without an answer key. Psychometrika 80, 341364 (2015)
  20. Freund, Y. & Schapire, R. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 119139 (1997)
  21. Goldstein, D. G. & Gigerenzer, G. Models of ecological rationality: the recognition heuristic. Psychol. Rev. 109, 7590 (2002)
  22. Cooke, R. Experts in Uncertainty: Opinion and Subjective Probability in Science (Oxford University Press, USA, 1991)
  23. Koriat, A. When are two heads better than one and why? Science 336, 360362 (2012)
  24. Prelec, D. A Bayesian truth serum for subjective data. Science. 306, 462466 (2004)
  25. John, L. K., Loewenstein, G. & Prelec, D. Measuring the prevalence of questionable research practices with incentives for truth telling. Psychol. Sci. 23, 524532 (2012)
  26. Arrow, K. J. et al. Economics. The promise of prediction markets. Science 320, 877878 (2008)
  27. Lebreton, M., Abitbol, R., Daunizeau, J. & Pessiglione, M. Automatic integration of confidence in the brain valuation signal. Nat. Neurosci. 18, 11591167 (2015)

Download references

Author information

Affiliations

  1. Sloan School of Management, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA

    • Dražen Prelec
  2. Department of Economics, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA

    • Dražen Prelec
  3. Department of Brain & Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA

    • Dražen Prelec &
    • John McCoy
  4. Princeton Neuroscience Institute and Computer Science Department, Princeton University, Princeton, New Jersey 08544, USA

    • H. Sebastian Seung

Contributions

All authors contributed extensively to the work presented in this paper.

Competing financial interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to:

Reviewer Information Nature thanks A. Baillon, D. Helbing and the other anonymous reviewer(s) for their contribution to the peer review of this work.

Author details

Extended data figures and tables

Extended Data Figures

  1. Extended Data Figure 1: Performance of all methods across all studies, shown with respect to the Matthews correlation coefficient. (127 KB)

    Error bars are bootstrapped standard errors. Details of studies are given in Fig. 4 of the main text.

  2. Extended Data Figure 2: Performance of all methods across all studies, shown with respect to the macro-averaged F1 score. (161 KB)

    Error bars are bootstrapped standard errors. Details of studies are given in Fig. 4 of the main text.

  3. Extended Data Figure 3: Performance of all methods across all studies, shown with respect to percentage of questions correct. (127 KB)

    Error bars are bootstrapped standard errors. Details of studies are given in Fig. 4 of the main text.

  4. Extended Data Figure 4: Performance of aggregation methods on simulated datasets of binary questions, under uniform sampling assumptions. (110 KB)

    One draws a pair of coin biases (that is, signal distribution parameters), and a prior over worlds, each from independent uniform distributions. Combinations of coin biases and prior that result in recipients of both coin tosses voting for the same answer are discarded. An actual coin is sampled according to the prior, and tossed a finite number of times to produce the votes, confidences, and vote predictions required by different methods (see Supplementary Information for simulation details). As well as showing how sample size affects different aggregation methods the simulations also show that majorities become more reliable as consensus increases. A majority of 90% is correct about 90% of the time, while a majority of 55% is not much better than chance. This is not due to sampling error, but reflects the structure of the model and simulation assumptions. According to the model, an answer with x% endorsements is incorrect if counterfactual endorsements for that answer exceed x% (Theorem 2), and the chance of sampling such a problem diminishes with x.

Supplementary information

PDF files

  1. Supplementary Information (207 KB)

    This file contains Supplementary Text and Data sections 1-3 – see contents page for details.

Additional data