The quality of decisions depends on the accuracy of estimates of relevant quantities. According to the wisdom of crowds principle, accurate estimates can be obtained by combining the judgements of different individuals1,2. This principle has been successfully applied to improve, for example, economic forecasts3,4,5, medical judgements6,7,8,9 and meteorological predictions10,11,12,13. Unfortunately, there are many situations in which it is infeasible to collect judgements of others. Recent research proposes that a similar principle applies to repeated judgements from the same person14. This paper tests this promising approach on a large scale in a real-world context. Using proprietary data comprising 1.2 million observations from three incentivized guessing competitions, we find that within-person aggregation indeed improves accuracy and that the method works better when there is a time delay between subsequent judgements. However, the benefit pales against that of between-person aggregation: the average of a large number of judgements from the same person is barely better than the average of two judgements from different people.
Many human decisions, whether in the business, political, medical or personal domain, require the decision-maker to estimate unknown quantities. One way to improve accuracy is to combine the estimates of a group of individuals. Aggregated estimates generally outperform most and sometimes all of the underlying estimates, and are often close to the true value. This phenomenon has become known as 'the wisdom of crowds'1,2. It arises from the statistical principle that aggregation of imperfect estimates diminishes the role of errors15,16,17,18. Generally, one has to combine only a few estimates to get most of the effect19.
The phenomenon was first described in Nature by the renowned British scientist Sir Francis Galton20. Galton witnessed a weight judging competition at the 1906 West of England Fat Stock and Poultry Exhibition, where visitors could win a prize by paying six pence and estimating the weight of an exhibited ox after it had been “slaughtered and dressed”. Galton collected all 800 tickets with estimates and found that the aggregate judgement of the group closely approximated the true value: the mean judgement was 1,197 lb, and the true value was 1,198 lb21,22. Similar results have since been observed in a wide range of experiments23,24,25,26,27,28,29.
Recent research proposes that the same principle applies to repeated judgements from the same person14. Laboratory experiments confirm that estimation accuracy can indeed be improved by aggregating estimates from a single individual16,30,31,32,33,34,35. The benefit of within-person aggregation reflects what has been dubbed ‘the wisdom of the inner crowd’, and can potentially boost the quality of individual decision making36.
This paper analyses within-person aggregation outside the psychological laboratory. We use three large proprietary data sets from three incentivized natural (‘naturally occurring’) experiments that resemble the one observed by Galton over a century ago. We show that within-person aggregation indeed improves accuracy, but not as much as between-person aggregation: the average of a large number of judgements from the same person is barely better than the average of two judgements from different people, even if the advantages of time delay between estimations are being exploited.
Our data are from three promotional events organized by the Dutch state-owned casino chain Holland Casino. During the last 7 weeks of 2013, 2014 and 2015, anybody who visited one of the casinos received a voucher with a login code. Via a terminal inside the casino and via the Internet, this code granted access to a competition in which participants were asked to estimate the number of objects in a transparent plastic container located just inside the entrance. This container, shaped to represent a champagne glass, was filled with small objects that represented pearls in 2013, pearls and diamonds in 2014 and casino chips in 2015 (Supplementary Fig. 1). Both the container and the exact number of objects were the same at every location. There were 12,564 objects in the container in 2013, 23,363 in 2014, and 22,186 in 2015. A prize of €100,000 was shared equally by those whose estimate was closest to the actual value. In 2013, the prize money was awarded to 16 people, and in 2014 and 2015, the entire amount was won by one person. All winners had submitted exactly the right number.
Our pseudonymized data sets contain all entries for the three years: a total of 369,260 estimates from 163,719 different players in 2013, 388,352 estimates from 154,790 players in 2014, and 407,622 estimates from 162,275 players in 2015. Many players submitted multiple estimates (Supplementary Fig. 2). Across the combined data sets, 60% of the participants were male and the average age was 39 yr. The Supplementary Information provides further details about the data.
The distributions of the estimates have a log-normal, right-skewed shape (Supplementary Figs. 7 and 8). Such a shape is in line with the tendency to estimate large numerical values in a logarithmically compressed manner29,32,37. This tendency seems to be the result of an innate intuition for numbers, with numbers logarithmically encoded in the brain38,39,40,41,42,43.
Immediately after Galton published his classic article, the aggregation measure to be used became a topic of debate21,44. The arithmetic mean is now the most commonly adopted aggregation measure45,46,47,48,49; however, with log-normal distributions, the preferred metric of central tendency is the geometric mean29,32,33. For our data, the geometric mean indeed performs much better than the arithmetic mean. The arithmetic mean overestimates the true value by ≥346% (Table 1), and is more accurate than only 10–14% of the underlying individual estimates across the three years. The geometric mean overestimates the true value by 86% in 2015, and is 19% and 32% below the true value in 2013 and 2014, respectively. In 2013 and 2014, the geometric mean is better than respectively 90% and 84% of the underlying individual estimates, and in 2015, it outperforms approximately 50%. Restricting the data to participants’ first estimate gives a similar picture (Supplementary Table 1).
Given the log-normal distributions of the estimates, our analyses follow the convention of using a logarithmic transformation29,32,33. After a logarithmic transformation, the arithmetic mean corresponds to the logarithm of the geometric mean of the original values. To make the distributions comparable across the three competitions, we divide the estimates by the true value before taking the logarithm. This two-step transformation yields approximately normal distributions (Supplementary Fig. 9), where zero represents the true value and deviations from zero measure the positive or negative estimation error. Our accuracy measure is the mean squared error (MSE). The Supplementary Information presents similar results for the mean absolute error and for the untransformed data.
For every event, approximately 60,000 participants submitted more than one estimate. In 2013, the average of their first two estimates was more accurate than either estimate alone (MSE1 = 3.12, MSE2 = 2.73, MSE1&2 = 2.47, with t(60,869) > 21.90 and two-sided P < 0.0001 in the two comparisons). This was also true in 2014 (MSE1 = 3.07, MSE2 = 2.77, MSE1&2 = 2.50, t(59,156) > 23.20, P < 0.0001), and in 2015 (MSE1 = 3.45, MSE2 = 3.30, MSE1&2 = 2.96, t(61,893) > 31.73, P < 0.0001). However, the effect sizes are relatively small: Cohen’s d varies between 0.08 and 0.11 for the three comparisons between the average and the first estimate, and between 0.05 and 0.06 for the three comparisons between the average and the second estimate.
If judgements can be improved by aggregating two estimates, aggregating a greater number of estimates is likely to lead to further improvements. The MSE of aggregations across the first t consecutive estimates for players who provided at least K = 5 or K = 10 estimates in a given year is plotted in Fig. 1 (in black; see Supplementary Fig. 10 for alternative values of K). In all cases, the MSE declines with t, at a decreasing marginal rate.
Figure 1 also plots the MSE of the average of T different players’ first estimates (in dark grey), showing that aggregating across individuals works substantially better than aggregating judgements from the same individual. The ‘outer crowd’ MSE declines with the number of estimates, but at a much faster rate than the MSE of the inner crowd.
To more formally compare the wisdom of the inner and the outer crowd, we define as the number of estimates one needs to average across individuals to achieve the same squared error as the squared error that results from averaging t estimates from a single individual (see Methods)33.
Depending on the sample that we use, varies between 1.44 and 1.66, and varies between 1.63 and 1.96. This implies that averaging five or ten estimates from the same individual is, in expectation, inferior to averaging two estimates from randomly selected individuals.
Aggregating even more estimates yields hardly any additional benefits. The MSE of the inner crowd can be approximated by the hyperbolic function , where a represents the average individual variance and b represents the average individual squared error (see Methods)33. Integrating an infinite number of estimates from a single individual therefore yields MSE = b in expectation. The number of estimates needed to obtain this MSE by aggregating across individuals, , varies between 1.59 and 2.06 across the samples. Hence, the expected potential benefit from within-person aggregation barely exceeds the expected benefit from aggregating the judgements of two randomly selected individuals.
Figure 1 also shows the MSE of the jth individual estimate (in light grey). Throughout the competitions, no information was revealed about the contents of the container, but players could potentially improve their estimates over time by using the power of aggregation. Communication was not restricted, and players therefore had the opportunity to aggregate not only their own estimates but also those of their peers. Earlier research indicates that people underestimate the merits of averaging judgements across individuals50,51, and that they do not average their own estimates as often as they ideally should32,34,52. The patterns of the MSE of individual consecutive estimates in our guessing competitions are in line with these findings: estimates improve over time, but the improvements do not match the improvements that could have been obtained by averaging. Of course, the decreasing MSE can also be the consequence of other forms of learning, such as better approaches and better comprehension.
In the previous analyses, the benefit of aggregating estimates from the same person may partly derive from such learning effects. For practical purposes, the exact sources and their contributions to the gain from within-person aggregation are unimportant, but here we are also interested in the strength of within-person aggregation in the absence of learning. Therefore, we have analogously investigated the pattern of the MSE when the first K estimates from the same person are aggregated in a random order (Supplementary Fig. 11). To ensure an equal base of comparison, we similarly used all first K estimates to determine the MSEs of between-person aggregation—not just the very first ones as we previously did. Depending on the sample, with random ordering, varies between 1.34 and 1.41, between 1.43 and 1.49, and between 1.46 and 1.57. Hence, the ‘pure’ within-person aggregation benefit is considerably lower than the benefit of aggregating two judgements from different individuals.
When learning effects are absent, the benefit of within-person aggregation relative to between-person aggregation is entirely driven by the degree to which the variation in estimates is due to variation within individuals (random noise) versus variation in individual-level systematic error (idiosyncratic bias). Aggregating multiple estimates from a single individual eliminates the influence of random noise only, whereas aggregating across different individuals eliminates the influence of both random noise and idiosyncratic bias. If we express the error of the jth estimate of person i, x i,j , as an additive function of the overall bias in the population μ, idiosyncratic bias u i and random noise v i,j (that is, x i,j = μ + u i + v i,j ), and assume that and , then (see Methods). Hence, the previous estimates of 1.46–1.57 for imply that the variance of idiosyncratic bias (across individuals; τ2) is about twice as large as the variance of random noise (within individuals; σ2). Direct estimations of those variances for each of the various subsamples confirm this ratio and the values of (Supplementary Table 2).
When we estimate the two variances across all entries of all participants for each of the three competitions, the implied values of range between 1.36 and 1.45 (Supplementary Table 3). Again, aggregating estimates from a single individual clearly fails to approach the benefit of aggregating estimates from only two randomly selected individuals.
Previous studies show that the accuracy gain from within-person aggregation is higher if people are asked to base their second estimate on different knowledge or assumptions than their first31,34,36. Such new perspectives happen naturally when people forget, and it has indeed been observed that accuracy gains are larger for individuals with lower working memory spans53 and increase with the delay between estimates14. However, the beneficial effect of delay was not found in a pre-registered replication study 54.
We exploit the variation in the timing between players’ first and second estimates to investigate the effect of delay on the benefit of aggregation. Because this variation happened naturally and was therefore not exogenously imposed, the results need to be interpreted with some caution. To quantify the benefit of aggregation, we define a participant’s accuracy gain as the resulting percentage decrease of the squared error (squared error of the average of the estimates relative to the average squared error of the individual estimates). Figure 2a shows that the accuracy gain increases almost monotonically with the delay. For two estimates provided at a single point in time—a participant could enter up to five estimates simultaneously—the average accuracy gain from aggregation is 16–18%. For estimates submitted more than 5 weeks apart, the average accuracy gain is approximately 30%.
Figure 2b indicates that the increase in accuracy gain is a consequence of the decrease in correlation between the estimates. The Pearson correlation coefficient decreases from more than 0.8 when people entered the estimates simultaneously to approximately 0.5 when multiple weeks passed between the attempts.
Two estimates are said to bracket the true value if they fall on opposite sides of it. Bracketing is an important driver of aggregation benefits, and the degree of bracketing is sometimes used as an indicator for the wisdom of crowds29,31,50. Figure 2c shows that the bracketing rate increases if estimates are made further apart in time: bracketing rates are about 15% for estimates made at a single time-point, and increase to >25% when multiple weeks passed. Overall, our data thus yield evidence of substantial delay benefits. These benefits are similar across the three independent data sets, suggesting that the advantageous effect of delay is more robust than previously thought14,54.
Figure 3 depicts estimates for as a function of the median time between the first five estimates for players who provided five or more estimates in a given year. Across the three competitions, varies between only 1.29 and 1.44 if the median delay is no longer than half a day, and increases to values between 1.74 and 1.93 if the median delay is more than 6 days. Averaging an infinite number of estimates with a median delay of more than 6 days allows an individual to outperform the aggregated estimate of two randomly selected individuals, but not by much: then varies between 1.94 and 2.48. Even though delay can be used to increase the relative merit of aggregating estimates from a single individual, between-person aggregation remains substantially more powerful.
Note that in situations where decision time is limited, there is a trade-off between making additional estimates and taking more time between estimates. For example, aggregating five estimates with a median delay of only half a day or less is roughly equivalent to or better than aggregating two estimates that are made more than six days apart. Under time pressure, making multiple estimates in short succession can therefore be the better option.
As before, the above values also reflect the improvements from learning that we observed earlier. When we control for learning by aggregating estimates in a random order, we still observe delay benefits, and as expected, the values are lower (Supplementary Fig. 12). For the category with the longest median delay, decreases to values between 1.65 and 1.75. However, these values need not reflect the full potential of within-person aggregation, because a median delay of more than six days does not guarantee that the correlation between estimates has reached its minimum. Indeed, Fig. 2b indicates that the correlation decreases with longer delays, and only stabilizes when the delay spans multiple weeks (at values of about 0.5).
To capture the maximum delay effect, we decompose the estimation error as before, but we now allow the covariance between estimates from the same person to have a delay-dependent part that declines exponentially with the duration of the delay (see Methods). Estimations of the error components on the full data sets indeed confirm that a threshold of six days is not sufficient for convergence in the covariance to occur; the delay-dependent part of the covariance halves about every eight days, meaning that it takes multiple weeks until most of it has dissipated (Supplementary Table 4). More importantly, the estimation results again show the limited efficacy of within-person aggregation; even if we fully exploit the advantageous effect of delay by allowing the delay-dependent part of the covariance to completely evaporate—which can be seen as allowing a person to take infinitely long delays between consecutive estimates— remains relatively low at values between 1.75 and 1.99 (Supplementary Table 4).
In conclusion, the present study finds that the effectiveness of within-person aggregation is considerably lower than that of between-person aggregation: the average of a large number of judgements from the same person is barely better than the average of two judgements from different people. The efficacy difference is a consequence of the existence of individual-level systematic errors (idiosyncratic bias). The effect of these errors can be eliminated by combining estimates from multiple people, not by combining multiple estimates from a single person.
In the context of our guessing competitions, all individuals were exposed to the same (visual) information about the container and the objects in it, and the sources of variation in idiosyncratic bias were limited to differences in individuals’ comprehension of the task, visual perception, and geometric skills. In many other real-world contexts, additional sources of idiosyncratic bias exist, which can be expected to lower the comparative benefit of within-person aggregation even more.
Within-person aggregation is potentially useful in situations where only one individual can make sufficiently informed estimates. This may be the case, for example, in strictly personal matters and under extreme degrees of specialization. Because of the relatively limited accuracy gains from within-person aggregation, between-person aggregation should be preferred whenever practicable.
The diversity prediction theorem and
Estimates made by different individuals are considered to be realizations of a random variable X. The diversity prediction theorem says that the crowd’s error, or population bias, equals the average error minus the diversity in estimates. More formally, it states that the collective squared error (CSE) relative to the true value θ, CSE(X) = (E(X)−θ)2, equals the MSE of the individual estimates, MSE(X) = E((X − θ)2), minus the variance of the estimates, VAR(X) = E(X2) – E(X)2 (refs 2,55):
The theorem can be used to mathematically determine the MSE of the average of T estimates from different individuals (ref. 33):
It can also be used to compare between-person and within-person aggregation. We define T t * as the number of estimates one needs to average across individuals to achieve the same squared error as the squared error of IC t , which represents the arithmetic mean of t estimates from one individual (the inner crowd). From the above framework, it follows that33:
Estimation error decomposition and
In the absence of learning, the error of the jth estimate of individual i, x i,j , can be decomposed into population bias μ, idiosyncratic bias u i , and random noise v i,j :
We assume that and . The MSE of the average of T estimates from different individuals , is then given by:
and the MSE of the arithmetic mean of t estimates from one individual, IC t , is then given by:
is the number of estimates needed to average across individuals to achieve the same squared error as the squared error of the average of an infinite number of estimates from one individual. Equating and MSE(IC t ), and solving for T if t → ∞, gives:
Estimation error decomposition with delay-dependent covariance
We modify the error decomposition to allow for delay-dependent individual-level noise. Estimation errors can be decomposed into population bias μ, individual-level bias u i that remains irrespective of the delay, and delay-dependent individual-level noise v i,j :
We assume that and , where Σ is a variance–covariance matrix with constant variance σ2 and delay-dependent covariances:
We assume that f(Δ(j,j′)) decays exponentially with the delay Δ(j,j′) between estimates x i,j and x i,j′ from the same individual:
where λ determines the speed of decay, and (1−δ) allows for a discontinuous jump such that estimates provided simultaneously are not required to be perfectly correlated. The half-life of the decay-dependent covariance, t1/2, is:
The (overall) covariance between two estimates x i,j and x i,j′ from the same individual is then given by:
which converges to τ2 if Δ(j,j′) → ∞. then converges to 1 + σ2/τ2, which is the highest possible value of that can be obtained by exploiting the benefits of delay.
Life Sciences Reporting Summary
Further information on experimental design is available in the Life Sciences Reporting Summary.
The code used to generate the results in this study is available in the Supplementary Information.
The data used in this study are from Holland Casino. In accordance with the Dutch Personal Data Protection Act, the data were provided in pseudonymized form, under non-disclosure agreements, and for scientific purposes only. Because of the non-disclosure agreements, the data are not publicly available. For reproducibility, the authors will archive the data on a secure VU Amsterdam server for at least five years after publication (contact: D.v.D.).
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
We thank Holland Casino for providing the data, and A. Baillon, S. Herzog, A. Lucas, L. Molleman, A. Opschoor, R. Potter van Loon, V. Spinu, and L. Wolk for their constructive and valuable comments. The paper has benefited from discussions with seminar participants at the Max Planck Institute for Human Development, Carnegie Mellon University and the University of Nottingham, and with participants of the 2015 NIBS workshop, SPUDM 2015 Budapest, WESSI 2016 Abu Dhabi, IMEBESS 2016 Rome, TIBER 2016 Tilburg and BFWG 2017 London. We gratefully acknowledge support from the Netherlands Organisation for Scientific Research (NWO) and from the Economic and Social Research Council via the Network for Integrated Behavioural Sciences (ES/K002201/1). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.