## Main

Can social scientists predict societal change? Governments and the general public often rely on experts, on the basis of a general belief that they make better judgements and predictions of the future in their domain of expertise. The media also seek out experts to render their judgements and opinions about what to expect in the future1,2. Yet research on predictions in many domains suggests that experts may not be better than purely stochastic models in predicting the future. For example, portfolio managers (who are paid for their expertise) do not outperform the stock market in their predictions3. Similarly, in the domain of geopolitics, experts often perform at chance levels when forecasting occurrences of specific political events4. On the basis of these insights, one might expect that experts would find it difficult to accurately predict societal change.

At the same time, social science researchers have developed rich, empirically grounded models to explain social science phenomena. By examining sampled data, social scientists strive to develop theoretical models about causal mechanisms that, in ideal cases, reliably describe human behaviour and societal processes5. Therefore, it is possible that explanatory models afford social science experts an advantage in predicting social phenomena in their domain of expertise. Here we test these possibilities, examining the overall predictability of trends in social phenomena such as political polarization, racial bias or well-being, and whether experts in social science are better able to predict those trends than non-experts.

Prior forecasting initiatives have not fully addressed this question for two reasons. First, forecasting initiatives with subject matter experts have focused on examining the probability of occurrence for specific one-time events4,6 rather than the accuracy of ex ante predictions of societal change over multiple units of time. In a sense, predicting events in the future (ex ante) is the same as predicting events that have already happened, as long as the experts (the research participants) don’t know the outcome. Yet, there are reasons to think that future prediction is different in an important way. Consider stock prices: participants could predict stock returns for stocks in the past, except that they know many other things that have happened (conflicts, bubbles, Black Swans, economic trends, consumption trends and so on). Post hoc, those making predictions have access to the temporal variance or occurrence for each of these variables and hence are more likely to be successful in ex post predictions. Predictions about past events thus end up being more about testing people’s explanations rather than their predictions per se. Moreover, all other things being equal, the likelihood of a prediction regarding a one-off event being accurate is by default higher than that of a prediction regarding societal change across an extended period. Binary predictions for the one-off event do not require accuracy in estimating the degree of change or the shape of the predicted time series, which are extra challenges in forecasting societal change.

The second reason is that past research on forecasting has concentrated on predicting geopolitical4 or economic events7 rather than broader societal phenomena. Thus, in contrast to systematic studies concerning the replicability of in-sample explanations of social science phenomena8, out-of-sample prediction accuracy in the social sciences remains understudied9,10. Similarly, little is known about the rationales and approaches that social scientists use to make predictions for societal trends. For example, are social scientists more apt to rely on data-driven statistical methods or on theory and intuitions when generating such predictions?

To address these unknowns, we performed a standardized evaluation of forecasting accuracy9 among social scientists in well-studied domains for which systematic, cross-temporal data are available—namely, subjective well-being, racial bias, ideological preferences, political polarization and gender–career bias. With the onset of the COVID-19 pandemic as a backdrop, we selected these domains on the basis of data availability and theoretical links to the pandemic. Prior research has suggested that each of these domains may be impacted by infectious disease11,12,13,14 or pandemic-related social isolation15. To understand how scientists made predictions in these domains, we documented the rationales and processes they used to generate forecasts, and we then examined how different methodological choices were related to accuracy.

## Research overview

We present results from two forecasting tournaments conducted through the Forecasting Collaborative—a crowdsourced initiative among scientists interested in ex ante testing of their theoretical or data-driven models. Examining performance across two tournaments allowed us to test the stability of forecasting accuracy in the context of unfolding societal events and to investigate how social scientists recalibrate their models and incorporate new data when asked to update their forecasts.

The Forecasting Collaborative was open to behavioural, social and data scientists from any field who wanted to participate in the tournament and were willing to provide forecasts over 12 months (May 2020 to April 2021) as part of the initial tournament and, upon receiving feedback on initial performance, again after 6 months for a follow-up tournament (the recruitment details are in the Methods, and the demographic information is in Supplementary Table 1). To ensure a “common task framework”9,16,17, we provided all participating teams with the same time series data for the United States for each of the 12 variables related to the phenomena of interest (that is, life satisfaction, positive affect, negative affect, support for Democrats, support for Republicans, political polarization, explicit and implicit attitudes towards Asian Americans, explicit and implicit attitudes towards African Americans, and explicit and implicit associations between gender and specific careers).

The participating teams received historical data that spanned 39 months (January 2017 to March 2020) for Tournament 1 and data that spanned 45 months for Tournament 2 (January 2017 to September 2020), which they could use to inform their forecasts for the future values of the same time series. Teams could select up to 12 domains to forecast, including domains for which team members reported a track record of peer-reviewed publications as well as domains for which they did not possess relevant expertise (see the Methods for the multi-stage operationalization of expertise). By including social scientists with expertise in different subject matters, we could examine how such expertise may contribute to forecasting accuracy above and beyond general training in the social sciences. The teams were not constrained in terms of the methods used to generate time-point forecasts. They provided open-ended, free-text responses for the descriptions of the methods used, which were coded later. If they used data-driven methods, they also provided the model and any additional data used to generate their forecasts (Methods). We also collected data on team size and composition, area of research specialization, subject domain and forecasting expertise, and prediction confidence.

We benchmarked forecasting accuracy against several alternatives. First, we evaluated whether social scientists’ forecasts in Tournament 1 were better than the wisdom of the crowd (that is, the average forecasts of a sample of lay participants recruited from Prolific). Second, we compared social scientists’ performance in both tournaments with naive random extrapolation algorithms (that is, the average of historical data, random walks and estimates based on linear trends). Finally, we systematically evaluated the accuracy of different forecasting strategies used by the social scientists in our tournaments, as well as the effect of expertise.

## Results

Following the a priori outlined analytic plan (https://osf.io/7ekfm; the details are in the Supplementary Methods) to determine forecasting accuracy across domains, we examined the mean absolute scaled error (MASE)18 across forecasted time points for each domain. The MASE is an asymptotically normal, scale-independent scoring rule that compares predicted values against the predictions of a one-step random walk. Because it is scale independent, it is an adequate measure when comparing accuracy across domains on different scales. A MASE of 1 reflects a forecast that is as good out of sample as the naive one-step random walk forecast is in sample. A MASE below 1.76 is superior to median performance in prior large-scale data science competitions7. See the Supplementary Information for further details of the MASE method.

In addition to absolute accuracy, we assessed the comparative accuracy of social scientists’ forecasts using several benchmarks. First, during Tournament 1, we obtained forecasts from a non-expert crowdsourced sample of US residents (N = 802) via Prolific19 who received the same data as the tournament participants and filled out an identically structured survey to provide a wisdom-of-the-(lay-)crowd benchmark. Second, for both tournaments, we simulated three different data-based naive approaches to out-of-sample forecasting using the time series data provided to the tournament participants: (1) the historical mean, calculated by randomly resampling the historical time series data; (2) a naive random walk, calculated by randomly resampling historical change in the time series data with an autoregressive component; and (3) extrapolation from linear regression, based on a randomly selected interval of the historical time series data (see the Supplementary Information for the details). This latter approach captures the expected range of predictions that would have resulted from random, uninformed use of historical data to make out-of-sample predictions (as opposed to the naive in-sample predictions that form the basis of MASE scores).

### How accurate were behavioural and social scientists at forecasting?

Figure 1 shows that in Tournament 1, social scientists’ forecasts were, on average, inferior to in-sample random walks in nine domains. In seven domains, social scientists’ forecasts were inferior to median performance in prior forecasting competitions (Supplementary Fig. 1 shows the raw estimates; Supplementary Fig. 2 reports measures of uncertainty around the estimates). In Tournament 2, the forecasts were on average inferior to in-sample random walks in eight domains and inferior to median performance in prior forecasting competitions in five domains. Even winning teams were still less accurate than in-sample random walks for 8 of 12 domains in Tournament 1 and one domain (Republican support) in Tournament 2 (Supplementary Tables 1 and 2 and Supplementary Figs. 49). One should note that inferior performance to the in-sample random walk (MASE > 1) may not be too surprising; errors of the in-sample random walk in the denominator concern historical observations that occurred before the pandemic, whereas the accuracy of scientific forecasts in the numerator concerns the data for the first pandemic year. However, average forecasting accuracy did not generally beat more liberal benchmarks such as the median MASE in data science tournaments (1.76)7 or the benchmark MASE for ‘good’ forecasts in the tourism industry (Supplementary Information). Except for one team, the top forecasters from Tournament 1 did not appear among the winners of Tournament 2 (Supplementary Tables 1 and 2).

We examined the accuracy of scientific and lay forecasts in a linear mixed-effect model. To systematically compare results for different forecasted domains, we tested a full model with expertise (social scientist versus lay crowd), domain and their interaction as predictors, and log(MASE) scores nested in participants. We observed no significant main effect difference between the accuracy of social scientists and that of lay crowds (F(11, 1,747) = 0.88, P = 0.348, partial R2 < 0.001). However, we observed a significant interaction between social science training and domain (F(11, 1,304) = 2.00, P = 0.026). Simple effects show that social scientists were significantly more accurate than lay people when forecasting life satisfaction, polarization, and explicit and implicit gender–career bias. However, the scientific teams were no better than the lay sample in the remaining eight domains (Fig. 1 and Table 1). Moreover, Bayesian analyses indicated that only for life satisfaction is there substantial evidence in favour of the difference, whereas for eight domains the evidence was in favour of the null hypothesis. See the Supplementary Information for further details and the interpretation of the multiverse analyses of domain-general accuracy.

### Cross-validation of domain-general accuracy via forecast-versus-trend comparisons

The most elementary analysis of domain-general accuracy involves inspecting trends for each group and comparing them against the ground truth and historical time series in each domain. Figure 2 allows us to inspect individual trends of social scientists and the naive crowd per domain in Tournament 1, along with historical and ground truth markers for each domain. For social scientists, one can observe the diversity of forecasts from individual teams (light blue) along with a lowess regression and 95% confidence interval (CI) around the trend (blue). For the naive crowd, one can see an equivalent lowess trend and the 95% CI around it (salmon). In half of the domains—explicit bias against African Americans, implicit bias against Asian Americans, negative affect, life satisfaction, and support for Democrats and Republicans—lowess curves from both groups were overlapping, suggesting that the estimates from both social scientists and the naive crowd were identical. Moreover, except for the domain of life satisfaction, the forecasts of scientists and the naive crowd were close to far off the mark vis-à-vis ground truth. In one further domain—explicit bias against African Americans—the naive crowd estimate was in fact closer to the ground truth marker than the estimate from the lowess curve of the social scientists. In the other five domains, which concerned explicit and implicit gender–career bias, explicit bias against Asian Americans, positive affect and political polarization, social scientists’ forecasts were closer to the ground truth markers than those of the naive crowd. We note, however, that these visual inspections may be somewhat misleading because the CIs don’t correct for multiple tests. This caveat aside, the overall message remains consistent with the results of the statistical tests above: for most domains, social scientists’ predictions were either similar to or worse than the naive crowd’s predictions.

### Comparisons with naive statistical benchmarks

Next, we compared scientific forecasts against three naive statistical benchmarks by creating benchmark/forecast ratio scores (a ratio of 1 indicates that the social scientists’ forecasts were equal in accuracy to the benchmarks, and ratios greater than 1 indicate greater accuracy). To account for interdependence of social scientists’ forecasts, we examined estimated ratio scores for domains from linear mixed models, with responses nested in forecasting teams. To reduce the likelihood that social scientists’ forecasts beat naive benchmarks by chance, our main analyses focused on performance across all three benchmarks (see the Supplementary Information for the rationale favouring this method over averaging across the three benchmarks), and we adjusted the CIs of the ratio scores for simultaneous inference of 12 domains in each tournament by simulating a multivariate t distribution20. Figures 1 and 3 and Supplementary Fig. 2 show that social scientists in Tournament 1 were significantly better than each of the three benchmarks in only 1 out of 12 domains, which concerned explicit gender–career bias (1.53 < ratio ≤ 1.60, 1.16 < 95% CI ≤ 2.910). In the remaining 11 domains, scientific predictions were either no different from or worse than the benchmarks. The relative advantage of scientific forecasts over the historical mean and random walk benchmarks was somewhat larger in Tournament 2 (Supplementary Fig. 1). Scientific forecasts were significantly more accurate than the three naive benchmarks in 5 out of 12 domains. These domains reflected explicit racial bias (African American bias, 2.20 < ratio ≤ 2.86, 1.55 < 95% CI ≤ 4.05; Asian American bias, 1.39 < ratio ≤ 3.14, 1.01 < 95% CI ≤ 4.40) and implicit racial and gender–career biases (African American bias, 1.35 < ratio ≤ 2.00, 1.35 < 95% CI ≤ 2.78; Asian American bias, 1.36 < ratio ≤ 2.73, 1.001 < 95% CI ≤ 3.71; gender–career bias, 1.59 < ratio ≤ 3.22, 1.15 < 95% CI ≤ 4.46). In the remaining seven domains, the forecasts were not significantly different from the naive benchmarks. Moreover, as Fig. 3 shows, scientific forecasts for political polarization in Tournament 2 were significantly less accurate than estimates from a naive linear regression (ratio = 0.51; 95% CI, (0.38, 0.68)). Figure 3 also shows that in most domains at least one of the naive forecasting methods produced errors that were comparable to or less than those of social scientists’ forecasts (11 out of 12 in Tournament 1 and 8 out of 12 in Tournament 2).

To compare social scientists’ forecasts against the average of the three naive benchmarks, we fit a linear mixed model with forecast/benchmark ratio scores nested in forecasting teams and examined the estimated means for each domain. In Tournament 1, scientists performed better than the average of the naive benchmarks in only three domains, which concerned political polarization (95% CI, (1.06, 1.63)), explicit gender–career bias (95% CI, (1.23, 1.95)) and implicit gender–career bias (95% CI, (1.17, 1.83)). In Tournament 2, social scientists performed better than the average of the naive benchmarks in seven domains (1.07 < 95% CIs ≤ 2.79), but they were statistically indistinguishable from the average of the naive benchmarks when forecasting four of the remaining five domains: ideological support for Democrats (95% CI, (0.76, 1.17)) and for Republicans (95% CI, (0.98, 1.51)), explicit gender–career bias (95% CI, (0.96, 1.52)), and negative affect on social media (95% CI, (0.82, 1.25)). Moreover, in Tournament 2, social scientists’ forecasts of political polarization were inferior to the average of the naive benchmarks (95% CI, (0.58, 0.89)). Overall, social scientists tended to do worse than the average of the three naive statistical benchmarks in Tournament 1. While scientists did better than the average of the naive benchmarks in Tournament 2, this difference in overall performance was small (mean forecast/benchmark inaccuracy ratio, 1.43; 95% CI, (1.26, 1.62)). Moreover, in most domains, at least one of the naive benchmarks was on par with if not more accurate than social scientists’ forecasts.

### Which domains were harder to predict?

Figure 4 shows that some societal trends were significantly harder to forecast than others (Tournament 1: F(11,295.69) = 41.88, P < 0.001, R2 = 0.450; Tournament 2: F(11,469.49) = 26.87, P < 0.001, R2 = 0.291). Forecast accuracy was the lowest in politics (underestimating Democratic support, Republican support and political polarization), well-being (underestimating life satisfaction and negative affect on social media) and racial bias against African Americans (overestimating; also see Supplementary Fig. 1). Differences in forecast accuracy across domains did not correspond to differences in the quality of ground truth markers: on the basis of the sampling frequency and representativeness of the data, most reliable ground truth markers concerned societal change in political ideology, obtained via an aggregate of multiple nationally representative surveys by reputable pollsters, yet this domain was among the most difficult to forecast. In contrast, some of the least representative markers concerned racial and gender bias, which came from Project Implicit—a volunteer platform that is subject to self-selection bias—yet these domains were among the easiest to forecast. In a similar vein, both life satisfaction and positive affect on social media were estimated via texts on Twitter, even though forecasting errors between these domains varied. Though measurement imprecision undoubtedly presents a challenge for forecasting, it is unlikely to account for between-domain variability in forecasting errors (Fig. 4).

Domain differences in forecasting accuracy corresponded to differences in the complexity of historical data: domains that were more variable in terms of standard deviation and mean absolute difference (MAD) of historical data tended to have more forecasting error (as measured by the rank-order correlation between median inaccuracy scores across teams and variability scores for the same domain) (Tournament 1: ρ(s.d.) = 0.19, ρ(MAD) = 0.20; Tournament 2: ρ(s.d.) = 0.48, ρ(MAD) = 0.36), and domain changes in the variability of historical data across tournaments corresponded to changes in accuracy (ρ(s.d.) = 0.27, ρ(MAD) = 0.28).

### Comparison of accuracy across tournaments

Forecasting error was higher in the first tournament than in the second tournament (Fig. 4) (F(1, 889.48) = 64.59, P < 0.001, R2 = 0.063). We explored several possible differences between the tournaments that may account for this effect. One possibility is that the characteristics of teams differed between tournaments (such as team size, gender, number of forecasted domains, field specialization and team diversity, number of PhDs on a team, and prior experience with forecasting). However, the difference between the tournaments remained equally pronounced when we ran parallel analyses with team characteristics as covariates (F(1, 847.79) = 90.45, P < 0.001, R2 = 0.062).

Another hypothesis is that forecasts for 12 months (Tournament 1) include further-removed data points than forecasts for 6 months (Tournament 2), and the greater temporal distance between the tournament and the moment to forecast resulted in greater inaccuracy in Tournament 1. To test this hypothesis, we zeroed in on Tournament 1 inaccuracy scores for the first and the last six months, while including domain type as a control dummy variable. By focusing on Tournament 1 data, we kept other characteristics such as team composition as constants. Contrary to this seemingly straightforward hypothesis, error for the forecasts for the first six months was in fact significantly greater (MASE = 3.16; s.e. = 0.21; 95% CI, (2.77, 3.60)) than for the last six months (MASE = 2.59; s.e. = 0.17; 95% CI, (2.27, 2.95)) (F(1, 621.41) = 29.36, P < 0.001, R2 = 0.012). As Supplementary Fig. 1 shows, for many domains, social scientists underpredicted societal change in Tournament 1, and this difference between predicted and observed values was more pronounced in the first than in the last six months. This suggests that for several domains, social scientists anchored their forecasts on the most recent historical data. Figure 2 further indicates that many domains showed unusual shifts (vis-à-vis prior historical data) in the first six months of the pandemic and started to return to the historical baseline in the following six months. For these domains, forecasts anchored on the most recent historical data were more inaccurate for the May–October 2020 forecasts than for the November 2020–April 2021 forecasts.

Finally, we tested whether providing the teams an additional six months of historical data capturing the onset of the pandemic in Tournament 2 may have contributed to lower error than in Tournament 1. To this end, we compared the inaccuracy of forecasts for the six-month period of November 2020–April 2021 done in May 2020 (Tournament 1) and those done when provided with more data in October 2020 (Tournament 2). We focused only on participants who completed both tournaments to keep the number of participating teams and team characteristics constant. Indeed, Tournament 1 forecasts had significantly more error (MASE mean, 2.54; s.e. = 0.17; 95% CI, (2.23, 2.90)) than Tournament 2 forecasts (MASE mean, 1.99; s.e. = 0.13; 95% CI, (1.74, 2.27)) (F(1, 607.79) = 31.57, P < 0.001, R2 = 0.017), suggesting that it was the availability of new (pandemic-specific) information rather than temporal distance that contributed to more accurate forecasts in the second than in the first tournament.

### Consistency in forecasting

Despite variability across scientific teams, domains and tournaments, the accuracy of scientific predictions was highly systematic. Accuracy in one subset of predictions (ranking of model performance across odd months) was highly correlated with accuracy in the other subset (ranking of model performance across even months) (first tournament: multilevel racross domains = 0.88; 95% CI, (0.85, 0.90); t(357) = 34.80; P < 0.001; domain-specific 0.55 < rs ≤ 0.99; second tournament: multilevel racross domains = 0.72; 95% CI, (0.67, 0.75); t(544) = 23.95; P < 0.001; domain-specific 0.24 < rs ≤ 0.96). Furthermore, the results of a linear mixed model with MASE scores in Tournament 1, domain, and their interaction predicting MASE in Tournament 2 showed that for 11 out of 12 domains, accuracy in Tournament 1 was associated with greater accuracy in Tournament 2 (median of standardized βs = 0.26).

Moreover, the ranking of models based on performance in the initial 12-month tournament corresponds to the ranking of the updated models in the follow-up 6-month tournament (Fig. 4). Harder-to-predict domains in the initial tournament remained the most inaccurate in the second tournament. Figure 3 shows one notable exception. Bias against African Americans was easier to predict than other domains in the second tournament. This exception appears consistent with the idea that George Floyd’s death catalysed movements in racial awareness just after the first tournament, although this explanation is speculative (see Supplementary Fig. 14 for a timeline of major historical events).

### Which strategies and team characteristics promoted accuracy?

Finally, we examined forecasting approaches and individual characteristics of more accurate forecasters in the tournaments. In the main text, we focused on central tendencies across forecasting teams, whereas in the supplementary analyses we reviewed strategies of winning teams and characteristics of the top five performers in each domain (Supplementary Figs. 411). We compared forecasting approaches relying on (1) no data modelling (but possible consideration of theories), (2) pure data modelling (but no consideration of subject matter theories) and (3) hybrid approaches. Roughly half of the teams relied on data-based modelling as a basis for their forecasts, whereas the other half of the teams in each tournament relied only on their intuitions or theoretical considerations (Fig. 5). This pattern was similar across domains (Supplementary Fig. 3).

In both tournaments, pre-registered linear mixed model analyses with approach as a factor, domain type as a control dummy variable and MASE scores nested in forecasting teams as a dependent variable revealed that forecasting approaches significantly differed in accuracy (first tournament: F(2, 149.10) = 5.47, P = 0.005, R2 = 0.096; second tournament: F(2, 177.93) = 5.00, P = 0.008, R2 = 0.091) (Fig. 5). Forecasts that considered historical data as part of the forecast modelling were more accurate than models that did not (first tournament: F(1, 56.29) = 20.38, P < 0.001, R2 = 0.096; second tournament: F(1, 159.11) = 8.12, P = 0.005, R2 = 0.084). Model comparison effects were qualified by a significant model type × domain interaction (first tournament: F(11, 278.67) = 4.57, P < 0.001, R2 = 0.045; second tournament: F(11, 462.08) = 3.38, P = 0.0002, R2 = 0.028). Post-hoc comparisons in Supplementary Table 4 revealed that data-inclusive (data-driven and hybrid) models were significantly more accurate than data-free models in three domains (explicit and implicit racial bias against Asian Americans and implicit gender–career bias) in Tournament 1 and two domains (life satisfaction and explicit gender–career bias) in Tournament 2. There were no domains where data-free models were more accurate than data-inclusive models. Analyses further demonstrated that, in the first tournament, data-free forecasts of social scientists were not significantly better than lay estimates (t(577) = 0.87, P = 0.385), whereas data-inclusive models tended to perform significantly better than lay estimates (t(470) = 3.11, P = 0.006, Cohen’s d = 0.391).

To examine the incremental contributions of specific forecasting strategies and team characteristics to accuracy, we pooled data from both tournaments in a linear mixed model with inaccuracy (MASE) as a dependent variable. As Fig. 6 shows, we included predictors representing forecasting strategies, team characteristics, domain expertise (quantified via publications by team members on the topic) and forecasting expertise (quantified via prior experience with forecasting tournaments). We further included domain type as a control dummy variable and nested responses in teams.

The full model fixed effects explained 31% of the variance in accuracy (R2 = 0.314), though much of it was accounted for by differences in accuracy between domains (non-domain R2 (partial), 0.043). Consistent with prior research21, model sophistication—that is, considering a larger number of exogenous predictors, COVID-19 trajectory or counterfactuals—did not significantly improve accuracy (Fig. 6 and Supplementary Table 5). In fact, forecasting models based on simpler procedures turned out to be significantly more accurate than complex models, as evidenced by the negative effect of statistical model complexity for accuracy (B = 0.14, s.e. = 0.06, t(220.82) = 2.33, P = 0.021, R2 (partial) = 0.010).

On the one hand, experts’ subjective confidence in their forecasts was not related to the accuracy of their estimates. On the other, people with expertise made more accurate forecasts. Teams were more accurate if they had members who had published academic research on the forecasted domain (B = −0.26, s.e. = 0.09, t(711.64) = 3.01, P = 0.003, R2 (partial) = 0.007) and who had taken part in prior forecasting competitions (B = −0.35, s.e. = 0.17, t(56.26) = 2.02, P = 0.049, R2 (partial) = 0.010) (also see Supplementary Table 5). Critically, even though some of these effects were significant, only two factors—complexity of the statistical method and prior experience with forecasting tournaments—showed a non-negligible partial effect size (R2 above 0.009). Additional testing of whether the inclusion of US-based scientists influenced forecasting accuracy did not yield significant effects (F(1, 106.61) < 1).

In the second tournament, we provided the teams with the opportunity to compare their original forecasts (Tournament 1, May 2020) with new data at a later time point and to update their predictions (Tournament 2, November 2020). We therefore tested whether updating improved people’s predictive accuracy. Of the initial 356 forecasts in the first tournament, 180 were updated in the second tournament (from 37% of teams for life satisfaction to 60% of teams for implicit Asian American bias). The updated forecasts in the second tournament (November) were significantly more accurate than the original forecasts in the first tournament (May) (t(94.5) = 6.04, P < 0.001, Cohen’s d = 0.804), but so were the forecasts from the 34 new teams recruited in November (t(75.9) = 6.30, P < 0.001, Cohen’s d = 0.816). Furthermore, the updated forecasts were not significantly different from the forecasts provided by new teams recruited in November (t(77.8) < 0.10, P = 0.928). This observation suggests that updating did not lead to more accurate forecasts (Supplementary Table 6 reports additional analyses probing different updating rationales).

## Discussion

How accurate are social scientists’ forecasts of societal change22? The results from two forecasting tournaments conducted during the first year of the COVID-19 pandemic show that for most domains, social scientists’ predictions were no better than those from a sample of the (non-specialist) general public. Furthermore, apart from a few domains concerning racial and gender–career bias, scientists’ original forecasts were typically not much better than naive statistical benchmarks derived from historical averages, linear regressions or random walks. Even when we confined the analysis to the top five forecasts by social scientists per domain, a simple linear regression produced less error roughly half of the time (Supplementary Figs. 5 and 9).

Forecasting accuracy systematically varied across societal domains. In both tournaments, positive sentiment and gender–career stereotypes were easier to forecast than other phenomena, whereas negative sentiment and bias towards African Americans were the most difficult to forecast. Domain differences in forecasting accuracy corresponded to historical volatility in the time series. Differences in the complexity of positive and negative affect are well documented23,24. Moreover, racial attitudes showed more change than attitudes regarding gender during this period (perhaps due to movements such as Black Lives Matter).

Which strategies and team characteristics were associated with more effective forecasts? One defining feature of more effective forecasters was that they relied on prior data rather than theory alone. This observation fits with prior studies on the performance of algorithmic versus intuitive human judgements21. Social scientists who relied on prior data also performed better than lay crowds and were overrepresented among the winning teams (Supplementary Figs. 4 and 8).

Forecasting experience and subject matter expertise on a forecasted topic also incrementally contributed to better performance in the tournaments (R2 (partial) = 0.010). This is in line with some prior research on the value of subject matter expertise for geopolitical forecasts6 and for the prediction of success of behavioural science interventions25. Notably, we found that publication track record on a topic, rather than subjective confidence in domain expertise or confidence in the forecast, contributed to greater accuracy. It is possible that subjective confidence in domain expertise conflates expertise and overconfidence26,27,28 (versus intellectual humility). There is some evidence that overconfident forecasters are less accurate29,30. These findings, along with the lack of a domain-general effect of social science expertise on performance compared with the general public, invite consideration of whether what usually counts as expertise in the social sciences translates into a greater ability to predict future real-world trends.

The nature of our forecasting tournaments allowed social scientists to self-select any of the 12 forecasting domains, inspect three years of historical trends for each domain and update their predictions on the basis of feedback on their initial performance in the first tournament. These features emulated typical forecasting platforms (for example, metaculus.com). We argue that this approach enhances our ability to draw externally valid and generalizable inferences from a forecasting tournament. However, this approach also resulted in a complex, unbalanced design. Scholars interested in isolating psychological mechanisms that foster superior forecasts may benefit from a simpler design whereby all forecasting teams make forecasts for all requested domains.

Another issue in designing forecasting tournaments involves the determination of domains that one may want participants to forecast. In designing the present tournaments, we provided the participants with at least three years of monthly historical data for each forecasting domain. An advantage of making the same historical data available for all forecasters is that it establishes a “common task framework”9,16,17, ensuring that the main sources of information about the forecasting domains remain identical across all participants. However, this approach restricts the types of social issues that participants can forecast. A simpler design without the inclusion of historical data would have had the advantage of greater flexibility in selecting forecasting domains.

Why were forecasts of societal change largely inaccurate, even though the participants had data-based resources and ample time to deliberate? One possibility concerns self-selection. Perhaps the participants in the Forecasting Collaborative were unusually bad at forecasting compared with social scientists as a whole. This possibility seems unlikely. We made efforts to recruit highly successful social scientists at different career stages and from different subdisciplines (Supplementary Information). Indeed, many of our forecasters are well-established scholars. We thus do not expect members of the Forecasting Collaborative to be worse at forecasting than other members of the social science community. Nevertheless, only a random sample of social scientists (albeit impractical) would have fully addressed the self-selection concern.

Second, it is possible that social scientists were not adequately incentivized to perform well in our tournaments. We provided reputational incentives by announcing the winners and rankings of participating teams, but like other big-team science projects8,31, we did not provide performance-based monetary incentives32, because they may not be key motivating factors for intrinsically motivated social scientists33. Indeed, the drop-out rate between Tournaments 1 and 2 was marginal, suggesting that the participating teams were motivated to continue being part of the initiative. This reasoning aside, it is possible that stronger incentives for accurate forecasting (whether reputation-based or monetary) could have stimulated some scientists to perform better in our forecasting tournament, opening doors for future directions to address this question directly.

Third, social scientists often deal with phenomena with small effect sizes that are overestimated in the literature8,31,34. Additionally, social scientists frequently study social phenomena in conditions that maximize experimental control but may have little external validity, and it is argued that this not only limits the generalizability of findings but in fact reduces their internal validity. In the world beyond the laboratory, where more factors are at play, such effects may be smaller than social scientists might think on the basis of their lab studies, and in fact, such effects may be spurious given the lack of external validity35,36. Social scientists may thus overestimate and misestimate the impacts of the effects they study in the lab on real-world phenomena37,38.

Fourth, social scientists tend to theorize about individuals and groups and conduct research at those scales. However, findings from such work may not scale up when predicting phenomena on the scale of entire societies39. Like other dynamical systems in economics, physics or biology, societal-level processes may also be genuinely stochastic rather than deterministic. If so, stochastic models will be hard to outperform.

Fifth, training in predictive modelling is not a requirement in many social sciences programmes10. Social scientists often prioritize explanations over formal predictions5. For instance, statistical training in the social sciences typically emphasizes unbiased estimation of model parameters in the sample over predictive out-of-sample accuracy40. Moreover, typical graduate curricula in many areas of social science, such as social or clinical psychology, do not require computational training in predictive modelling. The formal empirical study of societal change is relatively uncommon in these disciplines. Most social scientists approach individual- or group-level phenomena in an atemporal fashion39. Scientists may favour post hoc explanations of specific one-time events rather than the future trajectory of social phenomena. Although time is a key theoretical variable for foundational theories in many subfields of the social sciences, such as field theory41, it has remained an elusive concept.

Finally, perhaps it is unreasonable to expect theories and models developed during a relatively stable post–World War II period to accurately predict societal trends during a once-in-a-century health crisis. Precisely for this reason, we targeted predictions in domains possessing pandemic-relevant theoretical models (for instance, models about the impact of pathogen spread or social isolation). In this way, we sought to provide a stress test of ostensibly relevant theoretical models in a context (a pandemic-induced crisis) where change was most likely to be both meaningful and measurable. Nevertheless, the present work suggests that social scientists may not be particularly accurate at forecasting societal trends in this context, though it remains possible that they would perform better during more ‘normal’ times. The considerations above notwithstanding, future work should seek to address this question.

How can social scientists become better forecasters? Perhaps the first steps might involve probing the limits of social science theories by evaluating whether a given theory is suitable for making societal predictions in the first place or whether it is too narrow or too vague5,42. Relatedly, social scientists need to test their theories using representatively designed experiments. Moreover, social scientists may benefit from testing whether a societal trend is deterministic and hence can benefit from theory-driven components, or whether it unfolds in a purely stochastic fashion. For instance, one can start by decomposing a time series into the trend, autoregressive and seasonal components, examining each of them and their meaning for one’s theory and model. One can further perform a unit root test to examine whether the time series is non-stationary. Training in recognizing and modelling the properties of time series and dynamical systems may need to become more firmly integrated into graduate curricula in the field. A classic insight in the time series literature is that the mean of the historical time series may be among the best multi-step-ahead predictors for a stationary time series43. Using such insights to build predictions from the ground up can afford greater accuracy. In turn, such training can open the door to more robust models of social phenomena and human behaviour, with a promise of greater generalizability in the real world.

Given the broad societal impact of phenomena such as prejudice, political polarization and well-being, the ability to accurately predict trends in these variables is crucially important for policymakers and the experts guiding them. But despite common beliefs that social science experts are better equipped to accurately predict these trends than non-experts1, the current findings suggest that social and behavioural scientists have a lot of room for growth44. The good news is that forecasting skills can be improved. Consider the growing accuracy of forecasting models in meteorology in the second part of the twentieth century45. Greater consideration of representative experimental designs, temporal dynamics, better training in forecasting methods and more practice with formal forecasting all may improve social scientists’ ability to accurately forecast societal trends going forward.

## Methods

The study was approved by the Office of Research Ethics of the University of Waterloo under protocol no. 42142.

### Pre-registration and deviations

The forecasts of all participating teams along with their rationales were pre-registered on the Open Science Framework (https://osf.io/6wgbj/registrations). Additionally, in an a priori specific document shared with the journal in April 2020, we outlined the operationalization of the key dependent variable (MASE), the operationalization of the covariates and benchmarks (that is, the use of naive forecasting methods), and the key analytic procedures (linear mixed models and contrasts being different forecasting approaches; https://osf.io/7ekfm). We did not pre-register the use of a Prolific sample from the general public as an additional benchmark before their forecasting data were collected, though we did pre-register this benchmark in early September 2020, prior to data pre-processing or analyses. Deviating from the pre-registration, we performed a single analysis with all covariates in the same model rather than performing separate analyses for each set of covariates, to protect against inflating P values. Furthermore, due to scale differences between domains, we chose not to feature analyses concerning absolute percentage errors of each time point in the main paper (but see the corresponding analyses on the GitHub site for the project, https://github.com/grossmania/Forecasting-Tournament, which replicate the key effects presented in the main manuscript).

### Participants and recruitment

We initially aimed for a minimum sample of 40 forecasting teams in our tournament after prescreening to ensure that the participants possessed at minimum a bachelor’s degree in the behavioural, social or computer sciences. To ensure a sufficient sample for comparing groups of scientists employing different forecasting strategies (for example, data-free versus data-inclusive methods), we subsequently tripled the target size of the final sample (N = 120) and accomplished this target by the November phase of the tournament (see Supplementary Table 1 for the demographics).

The Forecasting Collaborative website that we used for recruitment (https://predictions.uwaterloo.ca/faq) outlined the guidelines for eligibility and the approach for prospective participants. We incentivized the participating teams in two ways. First, the prospective participants had an opportunity for co-authorship in a large-scale citizen science publication. Second, we incentivized accuracy by emphasizing throughout the recruitment that we would be announcing the winners and would share the rankings of scientific teams in terms of performance in each tournament (per domain and in total).

As outlined in the recruitment materials, we considered data-driven (for example, model-based) or expertise-based (for example, general intuition or theory-based) forecasts from any field. As part of the survey, the participants selected which method(s) they used to generate their forecasts. Next, they elaborated on how they generated their forecasts in an open-ended question. There were no restrictions, though all teams were encouraged to report their education as well as areas of knowledge or expertise. The participants were recruited via large-scale advertising on social media; mailing lists in the behavioural and social sciences, the decision sciences, and data science; advertisement on academic social networks including ResearchGate; and word of mouth. To ensure broad representation across the academic spectrum of relevant disciplines, we targeted groups of scientists working on computational modelling, social psychology, judgement and decision-making, and data science to join the Forecasting Collaborative.

The Forecasting Collaborative started by the end of April 2020, during which time the US Institute for Health Metrics and Evaluation projected the initial peak of the COVID-19 pandemic in the United States. The recruitment phase continued until mid-June 2020, to ensure that at least 40 teams joined the initial tournament. We were able to recruit 86 teams for the initial 12-month tournament (mean age, 38.18; s.d. = 8.37; 73% of the forecasts were made by scientists with a doctorate), each of which provided forecasts for at least one domain (mean = 4.17; s.d. = 3.78). At the six-month mark after the 2020 US presidential election, we provided the initial participants with an opportunity to update their forecasts (44% provided updates), while simultaneously opening the tournament to new participants. This strategy allowed us to compare new forecasts against the updated predictions of the original participants, resulting in 120 teams for this follow-up six-month tournament (mean age, 36.82; s.d. = 8.30; 67% of the forecasts were made by scientists with a doctorate; mean number of forecasted domains, 4.55; s.d. = 3.88). Supplementary analyses showed that the updating likelihood did not significantly differ between data-free and data-inclusive models (z = 0.50, P = 0.618).

### Procedure

Information for this project was available on the designated website (https://predictions.uwaterloo.ca), which included objectives, instructions and prior monthly data for each of the 12 domains that the participants could use for modelling. Researchers who decided to partake in the tournament signed up via a Qualtrics survey, which asked them to upload their estimates for the forecasting domains of their choice in a pre-programmed Excel sheet that presented the historical trend and automatically juxtaposed their point estimate forecasts against the historical trend on a plot (Supplementary Appendix 1) and to answer a set of questions about their rationale and forecasting team composition. Once all data were received, the de-identified responses were used to pre-register the forecasted values and models on the Open Science Framework (https://osf.io/6wgbj/).

At the halfway point (that is, at six months), the participants were provided with a comparison summary of their initial point estimate forecasts versus actual data for the initial six months. Subsequently, they were provided with an option to update their forecasts, provide a detailed description of the updates and answer an identical set of questions about their data model and rationale for their forecasts, as well as the consideration of possible exogenous variables and counterfactuals.

### Materials

#### Forecasting domains and data pre-processing

Computational forecasting models require enough prior time series data for reliable modelling. On the basis of prior recommendations46, in the first tournament we provided each team with 39 monthly estimates—from January 2017 to March 2020—for each of the domains that the participating teams chose to forecast. This approach enabled the teams to perform data-driven forecasting (should the teams choose to do so) and to establish a baseline estimate prior to the US peak of the pandemic. In the second tournament, conducted six months later, we provided the forecasting teams with 45 monthly time points—from January 2017 to September 2020.

Because of the requirement for rich standardized data for computational approaches to forecasting9, we limited the forecasting domains to issues of broad societal importance. Our domain selection was guided by the discussion of broad social consequences associated with these issues at the beginning of the pandemic47,48, along with general theorizing about psychological and social effects of threats of infectious disease49,50. An additional pragmatic consideration concerned the availability of large-scale longitudinal monthly time series data for a given issue. The resulting domains include affective well-being and life satisfaction, political ideology and polarization, bias in explicit and implicit attitudes towards Asian Americans and African Americans, and stereotypes regarding gender and career versus family. To establish the common task framework—a necessary step for the evaluation of predictions in data science9,17—we standardized methods for obtaining relevant prior data for each of these domains, made the data publicly available, recruited competitor teams for a common task of inferring predictions from the data and a priori announced how the project leaders would evaluate accuracy at the end of the tournament.

Furthermore, each team had to (1) download and inspect the historical trends (visualized on an Excel plot; an example is in the Supplementary Information); (2) add their forecasts in the same document, which automatically visualized their forecasts against the historical trends; (3) confirm their forecasts; and (4) answer prompts concerning their forecasting rationale, theoretical assumptions, models, conditionals and consideration of additional parameters in the model. This procedure ensured that all teams, at the minimum, considered historical trends, juxtaposed them against their forecasted time series and deliberated on their forecasting assumptions.

### Affective well-being and life satisfaction

We used monthly Twitter data to estimate markers of affective well-being (positive and negative affect) and life satisfaction over time. We relied on Twitter because no polling data for monthly well-being over the required time period exists, and because prior work suggests that national estimates obtained via social media language can reliably track subjective well-being51. For each month, we used previously validated predictive models of well-being, as measured by affective well-being and life satisfaction scales52. Affective well-being was calculated by applying a custom lexicon53 to message unigrams. Life satisfaction was estimated using a ridge regression model trained on latent Dirichlet allocation topics, selected using univariate feature selection and dimensionally reduced using randomized principal component analysis, to predict Cantril ladder life satisfaction scores. Such Twitter-based estimates closely follow nationally representative polls54. We applied the respective models to Twitter data from January 2017 to March 2020 to obtain estimates of affective well-being and life satisfaction via language on social media.

### Ideological preferences

We approximated monthly ideological preferences via aggregated weighted data from the Congressional Generic Ballot polls conducted between January 2017 and March 2020 (https://projects.fivethirtyeight.com/polls/generic-ballot/), which ask representative samples of Americans to indicate which party they would support in an election. We weighed the polls on the basis of FiveThirtyEight pollster ratings, poll sample size and poll frequency. FiveThirtyEight pollster ratings are determined by their historical accuracy in forecasting elections since 1998, participation in professional initiatives that seek to increase disclosure and enforce industry best practices, and inclusion of live-caller surveys to cell phones and landlines. On the basis of these data, we then estimated monthly averages for support of the Democratic and Republican parties across pollsters (for example, Marist College, NBC/Wall Street Journal, CNN and YouGov/Economist).

### Political polarization

We assessed political polarization by examining differences in presidential approval ratings by party identification from Gallup polls (https://news.gallup.com/poll/203198/presidential-approval-ratings-donald-trump.aspx). We obtained a difference score as the percentage of Republican versus Democratic approval ratings and estimated monthly averages for the period of interest. The absolute value of the difference score ensures that possible changes following the 2020 presidential election do not change the direction of the estimate.

### Explicit and implicit bias

Given the natural history of the COVID-19 pandemic, we sought to examine forecasted bias in attitudes towards Asian Americans (versus European Americans). To further probe racial bias, we sought to examine forecasted racial bias in attitudes towards African American (versus European American) people. Finally, we sought to examine gender bias in associations of the female (versus male) gender with family versus career. For each domain, we sought to obtain both estimates of explicit attitudes55 and estimates of implicit attitudes56. To this end, we obtained data from the Project Implicit website (http://implicit.harvard.edu), which has collected continuous data concerning explicit stereotypes and implicit associations from a heterogeneous pool of volunteers (50,000–60,000 unique tests on each of these categories per month). Further details about the website and test materials are publicly available at https://osf.io/t4bnj. Recent work suggests that Project Implicit data can provide reliable societal estimates of consequential outcomes57,58 and when studying cross-temporal societal shifts in US attitudes59. Despite the non-representative nature of the Project Implicit data, recent analyses suggest that the bias scores captured by Project Implicit are highly correlated with nationally representative estimates of explicit bias (r = 0.75), indicating that group aggregates of the bias data from Project Implicit can reliably approximate group-level estimates58. To further correct possible non-representativeness, we applied stratified weighting to the estimates, as described below.

Implicit attitude scores were computed using the revised scoring algorithm of the IAT60. The IAT is a computerized task comparing reaction times to categorize paired concepts (in this case, social groups—for example, Asian American versus European American) and attributes (in this case, valence categories—for example, good versus bad). Average response latencies in correct categorizations were compared across two paired blocks in which the participants categorized concepts and attributes with the same response keys. Faster responses in the paired blocks are assumed to reflect a stronger association between those paired concepts and attributes. Implicit gender–career bias was measured using the IAT with category labels of ‘male’ and ‘female’ and attributes of ‘career’ and ‘family’. In all tests, positive IAT D scores indicate a relative preference for the typically preferred group (European Americans) or association (men–career).

Respondents whose scores fell outside of the conditions specified in the scoring algorithm did not have a complete IAT D score and were therefore excluded from analyses. Restricting the analyses to only complete IAT D scores resulted in an average retention of 92% of the complete sessions across tests. The sample was further restricted to include only respondents from the United States to increase shared cultural understanding of the attitude categories. The sample was restricted to include only respondents with complete demographic information on age, gender, race/ethnicity and political ideology.

For explicit attitude scores, the participants provided ratings on feeling thermometers towards Asian Americans and European Americans (to assess Asian American bias) and towards white and Black Americans (to assess racial bias), on a seven-point scale ranging from −3 to +3. Explicit gender–career bias was measured using seven-point Likert-type scales assessing the degree to which an attribute was female or male, from strongly female (−3) to strongly male (+3). Two questions assessed explicit stereotypes for each attribute (for example, career with female/male, and, separately, the association of family). To match the explicit bias scores with the relative nature of the IAT, relative explicit stereotype scores were created by subtracting the ‘incongruent’ association from the ‘congruent’ association (for example, (male–career versus female–career) − (male–family versus female–family)). Thus, for racial bias, −6 reflects a strong explicit preference for the minority over the majority (European American) group, and +6 reflects a strong explicit preference for the majority over the minority (Asian American or African American) group. Similarly, for gender–career bias, −6 reflects a strong counter-stereotype association (for example, male–arts/female–science), and +6 reflects a strong stereotypic association (for example, female–arts/male–science). In both cases, the midpoint of 0 represents equal liking of both groups.

We used explicit and implicit bias data for January 2017–March 2020 and created monthly estimates for each of the explicit and implicit bias domains. Because of possible selection bias among the Project Implicit participants, we adjusted the population estimates by weighting the monthly scores on the basis of their representativeness of the demographic frequencies in the US population (age, race, gender and education, estimated biannually by the US Census Bureau; https://www.census.gov/data/tables/time-series/demo/popest/2010s-national-detail.html). Furthermore, we adjusted the weights on the basis of political orientation (1, ‘strongly conservative’; 2, ‘moderately conservative’; 3, ‘slightly conservative’; 4, ‘neutral’; 5, ‘slightly liberal’; 6, ‘moderately liberal’; 7, ‘strongly liberal’), using corresponding annual estimates from the General Social Survey. With the weighted values for each participant, we computed weighted monthly means for each attitude test. These procedures ensured that the weighted monthly averages approximated the demographics of the US population. We cross-validated this procedure by comparing the weighted annual scores to nationally representative estimates for feeling thermometers for African American and Asian American estimates from the American National Election studies in 2017 and 2018.

An initial procedure was developed for computing post-stratification weights for African American, Asian American and gender–career bias (implicit and explicit) to ensure that the sample was representative of the US population at large as much as possible. Originally, we computed weights for the entire year, which were then applied to each month in the year. After we received feedback from co-authors, we adopted a more optimal approach wherein weights were computed on a monthly as opposed to yearly basis. This was necessary because demographic characteristics varied from month to month each year. This meant that using yearly weights had the potential to amplify bias instead of reducing it. Consequently, our new procedure ensured that sample representativeness was maximized. This insight affected forecasts from seven teams who had provided them before the change. The teams were informed, and four teams chose to provide updated estimates using the newly weighted historical data.

For each of these domains, the forecasters were provided with 39 monthly estimates in the initial tournament (45 estimates in the follow-up tournament), as well as detailed explanations of the origin and calculation of the respective indices. We thereby aimed to standardize the data source for the purpose of the forecasting competition9. See Supplementary Appendix 1 for example worksheets provided to the participants for submissions of their forecasts.

#### Forecasting justifications

For each forecasting model submitted to the tournament, the participants provided detailed descriptions. They described the type of model they had computed (for example, time series, game theoretic models or other algorithms), the model parameters, additional variables they had included in their predictions (for example, the COVID-19 trajectory or the presidential election outcome) and the underlying assumptions.

#### Confidence in forecasts

The participants rated their confidence in their forecasted points for each forecast model they submitted. These ratings were on a seven-point scale from 1 (not at all) to 7 (extremely).

#### Confidence in expertise

The participants provided ratings of their teams’ expertise for a particular domain by indicating their extent of agreement with the statement “My team has strong expertise on the research topic of [field].” These ratings were on a seven-point scale from 1 (strongly disagree) to 7 (strongly agree).

#### COVID-19 conditional

We considered the COVID-19 pandemic as a conditional of interest given links between infectious disease and the target social issues selected for this tournament. In Tournament 1, the participants reported whether they had used the past or predicted trajectory of the COVID-19 pandemic (as measured by the number of deaths or the prevalence of cases or new infections) as a conditional in their model, and if so, they provided their forecasted estimates for the COVID-19 variable included in their model.

#### Counterfactuals

Counterfactuals are hypothetical alternative historic events that would be thought to affect the forecast outcomes if they were to occur. The participants described the key counterfactual events between December 2019 and April 2020 that they theorized would have led to different forecasts (for example, US-wide implementation of social distancing practices in February). Two independent coders evaluated the distinctiveness of the counterfactuals (interrater κ = 0.80). When discrepancies arose, the coders discussed individual cases with other members of the Forecasting Collaborative to make the final evaluation. In the primary analyses, we focus on the presence of counterfactuals (yes/no).

#### Team expertise

Because expertise can mean many things2,61, we used a telescopic approach and operationalized expertise in four ways of varying granularity. First, we examined broad, domain-general expertise in the social sciences by comparing social scientists’ forecasts with forecasts provided by the general public without the same training in social science theory and methods. Second, we operationalized the prevalence of graduate training on a team as a more specific marker of domain-general expertise in the social sciences. To this end, we asked each participating team to report how many team members had a doctorate in the social sciences and calculated the percentage of doctorates on each team. Moving to domain-specific expertise, we instructed the participating teams to report whether any of their members had previously researched or published on the topic of their forecasted variable, operationalizing domain-specific expertise through this measure. Finally, moving to the most subjective level, we asked each participating team to report their subjective confidence in their team’s expertise in a given domain (Supplementary Information).

#### General public benchmark

In parallel to the tournament with 86 teams, on 2–3 June 2020, we recruited a regionally, gender- and socio-economically stratified sample of US residents via the Prolific crowdworker platform (targeted N = 1,050 completed responses) and randomly assigned them to forecast societal change for a subset of domains used in the tournaments (well-being (life satisfaction and positive and negative sentiment on social media), politics (political polarization and ideological support for Democrats and Republicans), Asian American bias (explicit and implicit trends), African American bias (explicit and implicit trends) and gender–career bias (explicit and implicit trends)). During recruitment, the participants were informed that in exchange for 3.65 GBP, they had to be able to open and upload forecasts in an Excel worksheet.

We considered responses if they provided forecasts for 12 months in at least one domain and if the predictions did not exceed the possible range for a given domain (for example, polarization above 100%). Moreover, three coders (intercoder κ = 0.70 unweighted, κ = 0.77 weighted) reviewed all submitted rationales from lay people and excluded any submissions where the participants either misunderstood the task or wrote bogus bot-like responses. Coder disagreements were resolved via a discussion. Finally, we excluded responses if the participants spent under 50 seconds making their forecasts, which included reading instructions, downloading the files, providing forecasts and re-uploading their forecasts (final N = 802, 1,467 forecasts; mean age, 30.39; s.d. = 10.56; 46.36% female; education: 8.57% high school/GED, 28.80% some college, 62.63% college or above; ethnicity: 59.52% white, 17.10% Asian American, 9.45% African American/Black, 7.43% Latinx, 6.50% mixed/other; median annual income, $50,000–$75,000; residential area: 32.37% urban, 57.03% suburban, 10.60% rural).

#### Exclusions of the general public sample

Supplementary Table 7 outlines exclusions by category. In the initial step, we considered all submissions via the Qualtrics platform, including partial submissions without any forecasting data (N = 1,891). Upon removing incomplete responses without forecasting data and removing duplicate submissions from the same Prolific IDs, we removed 59 outliers whose data exceeded the range of possible values in a given domain. Subsequently, we removed responses that the independent coders flagged as either misunderstood (n = 6) or bot-like bogus responses (n = 26). See Supplementary Appendix 2 for verbatim examples of each screening category and the exact coding instructions. Finally, we removed responses where the participants took less than 50 seconds to provide their forecasts (including reading instructions, downloading the Excel file, filling it out, re-uploading the Excel worksheet and completing additional information on their reasoning about the forecast). Finally, one response was removed on the basis of open-ended information where the participant indicated they had made forecasts for a different country than the United States.

#### Naive statistical benchmarks

There is evidence from data science forecasting competitions that the dominant statistical benchmarks are the Theta method, ARIMA and ETS7. Given the socio-cultural context of our study and to avoid loss of generality, we decided to employ more traditional benchmarks such as naive/random walk, historical average and the basic linear regression model—that is, the method that is used more than anything else in practice and science. In short, we selected three benchmarks on the basis of their common application in the forecasting literature (historical mean and random walk are the most basic forecasting benchmarks) or the behavioural/social science literature (linear regression is the most basic statistical approach to test inferences in the sciences). Furthermore, these benchmarks target distinct features of performance (historical mean speaks to the base rate sensitivity, linear regression speaks to sensitivity to the overall trend and random walk captures random fluctuations and sensitivity to dependencies across consecutive time points). Each of these benchmarks may perform better in some but not in other circumstances. Consequently, to test the limits of scientists’ performance, we examined whether social scientists’ performance was better than each of the three benchmarks. To obtain metrics of uncertainty around the naive statistical estimates, we chose to simulate these three naive approaches for making forecasts: (1) random resampling of historical data, (2) a naive out-of-sample random walk based on random resampling of historical change and (3) extrapolation from a naive regression based on a randomly selected interval of historical data. We describe each approach in Supplementary Information.

### Analytic plan

#### Categorization of forecasts

We categorized the forecasts on the basis of modelling approaches. Two independent research associates categorized the forecasts for each domain on the basis of the following justifications: (1) theoretical models only, (2) data-driven models only or (3) a combination of theoretical and data-driven models—that is, computational models that rely on specific theoretical assumptions. See Supplementary Appendix 3 for the exact coding instructions and a description of the classification (interrater κ = 0.81 unweighted, κ = 0.90 weighted). We further examined the modelling complexity of approaches that relied on the extrapolation of time series from the data we provided (for example, ARIMA or moving average with lags; yes/no; see Supplementary Appendix 4 for the exact coding instructions). Disagreements between coders here (interrater κ = 0.80 unweighted, κ = 0.87 weighted) and on each coding task below were resolved through joint discussion with the leading author of the project.

We tested how the presence and number of additional variables as parameters in the model impacted forecasting accuracy. To this end, we ensured that additional variables were distinct from one another. Two independent coders evaluated the distinctiveness of each reported parameter (interrater κ = 0.56 unweighted, κ = 0.83 weighted).

#### Categorization of teams

We next categorized the teams on the basis of compositions. First, we counted the number of members per team. We also sorted the teams on the basis of disciplinary orientation, comparing behavioural and social scientists with teams from computer and data science. Finally, we used information that the teams provided concerning their objective and subjective expertise levels for a given subject domain.

#### Forecasting update justifications

Given that the participants received both new data and a summary of diverse theoretical positions that they could use as a basis for their updates, two independent research associates scored the participants’ justifications for forecasting updates on three dummy categories: (1) the new six months of data that we provided, (2) new theoretical insights and (3) consideration of other external events (interrater κ = 0.63 unweighted/weighted). See Supplementary Appendix 5 for the exact coding instructions.

#### Statistical analyses

A priori (https://osf.io/6wgbj/), we specified a linear mixed model as a key analytical procedure, with MASE scores for different domains nested in participating teams as repeated measures. Prior to the analyses, we inspected the MASE scores to determine violations of linearity, which we corrected via log-transformation before performing the analyses. All P values refer to two-sided t-tests. For simple effects by domain, we applied Benjamini–Hochberg false discovery rate corrections. For 95% CIs by domain, we simulated a multivariate t distribution20 to adjust the scores for simultaneous inference of estimates for 12 domains in each tournament.

### Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.