It is important for research users to know how likely it is that reported research findings are true. The Social Science Replication Project finds that, in highly powered experiments, only 13 of 21 high-profile reports could be replicated. Investigating the factors that contribute to reliable results offers new opportunities for the social sciences.
There is much concern that the scientific literature contains findings that cannot be replicated by others. Writing in Nature Human Behaviour, Camerer and colleagues report replications of 21 experiments in the social sciences published between 2010 and 2015 in Science and Nature1. They find significant effects in the same direction as the original study for only 13 of 21 replications; magnitudes of the observed effects were only around half the size of those seen in the original studies. Prospective replication projects in psychology2 and experimental economics3 have illuminated the problem in those fields, but did not focus on publications in high-impact journals, where there may be a premium on novelty, and where one might expect greater editorial scrutiny.
A cardinal feature of scientific endeavour is held to be self-correction, so why the need for formal replications? If science were operating as intended, should this not happen automatically? It is important that all scientific results be sound, because the findings from one study are often used as the starting point for future work. Therefore, if these initial results are faulty or only work in specific situations, then future work is compromised.
In the Experimental Economics Replication Project, 11 of 18 replication studies found significant effects in the same direction as the original study, with average replicated effect sizes of 66% of the originals3. In the Reproducibility Project: Psychology, 35 of 97 replication studies found significant effects in the same direction as the original study, with average replicated effect sizes of 49%2. In the ongoing Reproducibility Project: Cancer Biology, 4 of 10 studies have successfully reproduced important parts of the original and a further 2 have been partially successful4.
The current study is important because it suggests that findings of studies published in high-impact journals are just as unlikely to be replicated successfully; so (as with risks of bias in reports of in vivo research5) the impact factor of the journal in which work was published is no guarantee that the findings are true. Further, when the authors asked a community of almost 400 peers to predict whether or not a given finding could be reproduced, there was good correlation with both whether a finding did replicate and the replication effect size observed, that is, the ‘crowd’ was right much more often than it was wrong.
The current study also begins to establish more clearly the possible outcomes from a replication study. Some studies replicated the direction of effect and the magnitude, others replicated the direction but with lower magnitude, and in others, there appeared to be no effect in the replication study. Replication of direction but not magnitude might be due to any number of aspects of study design or reporting that led to an overstatement of efficacy in the originator study. Greater diligence in research design, conduct and reporting will go some way to addressing this.
Of greater interest are the studies where there was no effect in the replication study. This may well be due to some latent independent variable represented in one experiment but not the other. Importantly, identifying such latent variables and understanding their influence provides an opportunity to better understand the phenomenon being investigated. For instance, the failure to replicate drug effects on extending lifespan in worms led initially to despair, but, many years later, inspired the multicentre National Institute on Aging-funded Caenorhabditis Intervention Testing Program6. After much effort, the authors were able to show bimodal patterns of survival that could not be explained through any known biology; and differential effects of drugs, some with broad efficacy and others with different effects in different species and strains7. This has provided a starting point for research that should deepen our understanding of the biology of ageing.
Important questions remain. While these formal replication projects1,2,3 are prospective studies with prior registration of detailed protocols, the selection of the findings to be reproduced is not random. Isager identified 85 independent replication efforts and found that selection strategies could be considered as being based on theoretical impact, personal interest, academic, public or social impact, or methodological concerns8. For the present study, selection was from studies published in Nature and Science, and these are unlikely to be typical examples of research in the social sciences. As the authors note, existing reproducibility studies have “a relatively small sample of studies with idiosyncratic inclusion criteria and unknown generalizability”.
Nonetheless, the extent of observed non-reproducibility across disciplines means that research users cannot reliably conclude that reported findings are true. Either we have to attempt to reproduce all research findings that we want to use; or we need to find ways of predicting which study characteristics are associated with findings that can be reproduced, making the reported findings more credible. We know from the reproducibility projects to date that reproducibility is more likely where the original study had a large effect size and a small P value. I may believe that in vivo studies at low risk of bias are more likely to be successfully replicated, but at present I have no direct evidence to support this view. Disappointingly, to my knowledge, none of the reproducibility studies to date have considered whether the extent to which authors had addressed possible risks of bias in the design of the original study might predict the chances of successful reproduction.
Further, if resources for replication studies are limited, it would be desirable for these to be targeted where they could do most good, by substantially increasing (or decreasing) our confidence in a body of research findings that manifested certain shared characteristics.
To do this would require a reproducibility study with selection of originator studies by random sampling of studies with or without the characteristics of interest. Given that the chances that a replication study will ‘correct’ the literature are in the order of 30–65%, these studies also represent a high rate of return for research funders.
Researchers have experienced difficulties in publishing work that challenges received wisdom9, against a background where the importance of such work is seldom recognized for career advancement. The various reproducibility projects, the plaudits that the authors have rightly received, and the development of dedicated funding streams for such studies go some way to addressing this imbalance.
‘Failed’ reproducibility, approached scientifically, affords great research opportunities. Exploiting these opportunities usually requires a collaborative, multicentre approach, and pursuit of these opportunities is likely to lead to improved and better considered research designs. For these reasons, current issues of the reproducibility of research should not be seen as a crisis, but as an opportunity.