Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

# Prediction of robust scientific facts from literature

### Subjects

A preprint version of the article is available at arXiv.

## Abstract

The growth of published science in recent years has escalated the difficulty that human and algorithmic agents face in reasoning over prior knowledge to select the next experiment. This challenge is increased by uncertainty about the reproducibility of published findings. The availability of massive digital archives, machine reading, extraction tools and automated high-throughput experiments allows us to evaluate these challenges computationally at scale and identify novel opportunities to craft policies that accelerate scientific progress. Here we demonstrate a Bayesian calculus that enables positive prediction of robust scientific claims with findings extracted from published literature, weighted by scientific, social and institutional factors demonstrated to increase replicability. Illustrated with the case of gene regulatory interactions, our approach automatically estimates and counteracts sources of bias, revealing that scientifically focused but socially and institutionally diverse research activity is most likely to replicate. This results in updated certainty about the literature, which accurately predicts robust scientific facts on which new experiments should build. Our findings allow us to identify and evaluate policy recommendations for scientific institutions that may increase robust scientific knowledge, including sponsorship of increased diversity of and independence between investigations of any particular scientific phenomenon, and diversity of scientific phenomena investigated.

This is a preview of subscription content

## Access options

\$32.00

All prices are NET prices.

## Data availability

To illustrate our pipeline, we used the publicly available GeneWays and Literome datasets (available at https://github.com/KnowledgeLab/geneways and https://github.com/KnowledgeLab/literome), linked with Clarivate’s Web of Science database of bibliographic information. While we cannot share the Web of Science, we share a linked file https://github.com/KnowledgeLab/nmi_robust_facts_supplemenary, which includes all claims of interest and citation metadata required to perform described analyses.

## Code availability

Our code is publicly available at https://github.com/alexander-belikov/datahelpers and https://github.com/KnowledgeLab/bm_support.

## References

1. Hey, T. & Trefethen, A. in Grid Computing: Making the Global Infrastructure a Reality (eds Fox, G. C. & Hey, T.) 809–824 (Wiley, 2003).

2. Bell, G., Hey, T. & Szalay, A. Computer science. Beyond the data deluge. Science 323, 1297–1298 (2009).

3. Burger, B. et al. A mobile robotic chemist. Nature 583, 237–241 (2020).

4. King, R. D. et al. Functional genomic hypothesis generation and experimentation by a robot scientist. Nature 427, 247–252 (2004).

5. Zhou, Q. et al. Learning atoms for materials discovery. Proc. Natl Acad. Sci. USA 115, E6411–E6417 (2018).

6. Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571, 95–98 (2019).

7. Nissen, S. B., Magidson, T., Gross, K. & Bergstrom, C. T. Publication bias and the canonization of false facts. eLife 5, e21451 (2016).

8. Daston, L. J. & Galison, P. Objectivity (Zone Books, 2007).

9. Foreman, P. Weimar culture, causality and quantum theory 1918–1927. Hist. Stud. Phys. Biol. Sci. 3, 2–225 (1971).

10. Rzhetsky, A., Iossifov, I., Loh, J. M. & White, K. P. Microparadigms: chains of collective reasoning in publications about molecular interactions. Proc. Natl Acad. Sci. USA 103, 4940–4945 (2006).

11. Ioannidis, J. P. A. Why most published research findings are false. PLoS Med. 2, e124 (2005).

12. Surowiecki, J. The Wisdom of Crowds: Why the Many Are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations (Doubleday, 2004).

13. Galton, F. Vox populi (the wisdom of crowds). Nature 75, 450–451 (1907).

14. Hong, L. & Page, S. E. Groups of diverse problem solvers can outperform groups of high-ability problem solvers. Proc. Natl Acad. Sci. USA 101, 16385–16389 (2004).

15. Becker, J., Brackbill, D. & Centola, D. Network dynamics of social influence in the wisdom of crowds. Proc. Natl Acad. Sci. USA 114, E5070–E5076 (2017).

16. Lorenz, J., Rauhut, H., Schweitzer, F. & Helbing, D. How social influence can undermine the wisdom of crowd effect. Proc. Natl Acad. Sci. USA 108, 9020–9025 (2011).

17. Danchev, V., Rzhetsky, A. & Evans, J. A. Centralized communities more likely generate non-replicable results. eLife 8, e43094 (2019).

18. Hicks, D. M. & Katz, J. S. Where is science going? Sci. Technol. Human Values 21, 379–406 (1996).

19. Guimerà, R., Uzzi, B., Spiro, J. & Amaral, L. A. N. Team assembly mechanisms determine collaboration network structure and team performance. Science 308, 697–702 (2005).

20. Hand, E. ‘Big science’ spurs collaborative trend. Nature 463, 282–282 (2010).

21. Wuchty, S., Jones, B. F. & Uzzi, B. The increasing dominance of teams in production of knowledge. Science 316, 1036–1039 (2007).

22. Wu, L., Wang, D. & Evans, J. A. Large teams develop and small teams disrupt science and technology. Nature https://doi.org/10.1038/s41586-019-0941-9 (2019).

23. Jones, B. F., Wuchty, S. & Uzzi, B. Multi-university research teams: shifting impact, geography, and stratification in science. Science 322, 1259–1262 (2008).

24. Merton, R. K. The Matthew effect in science: the reward and communication systems of science are considered. Science 159, 56–63 (1968).

25. Azoulay, P., Stuart, T. & Wang, Y. Matthew: effect or fable? Manage. Sci. 60, 92–109 (2014).

26. Evans, J. A. Electronic publication and the narrowing of science and scholarship. Science 321, 395–399 (2008).

27. Simkin, M. V. & Roychowdhury, V. P. Do copied citations create renowned papers? Ann. Improbable Res. 11, 24–27 (2005).

28. Chu, J. S. G. & Evans, J. A. Slowed canonical progress in large fields of science. Proc. Natl. Acad. Sci. USA 118, e2021636118 (2021).

29. Mullard, A. Reliability of ‘new drug target’ claims called into question. Nat. Rev. Drug Discov. 10, 643–644 (2011).

30. Prinz, F., Schlange, T. & Asadullah, K. Believe it or not: how much can we rely on published data on potential drug targets? Nat. Rev. Drug Discov. 10, 712–712 (2011).

31. Freedman, L. P. & Gibson, M. C. The impact of preclinical irreproducibility on drug development. Clin. Pharmacol. Ther. 97, 16–18 (2015).

32. Ioannidis, J. P., Ntzani, E. E., Trikalinos, T. A. & Contopoulos-Ioannidis, D. G. Replication validity of genetic association studies. Nat. Genet. 29, 306–309 (2001).

33. Hirschhorn, J. N., Lohmueller, K., Byrne, E. & Hirschhorn, K. A comprehensive review of genetic association studies. Genet. Med. 4, 45–61 (2002).

34. Lohmueller, K. E., Pearce, C. L., Pike, M., Lander, E. S. & Hirschhorn, J. N. Meta-analysis of genetic association studies supports a contribution of common variants to susceptibility to common disease. Nat. Genet. 33, 177–182 (2003).

35. Open Science Collaboration. Estimating the reproducibility of psychological science. Science 349, aac4716 (2015).

36. Van Bavel, J. J., Mende-Siedlecki, P., Brady, W. J. & Reinero, D. A. Contextual sensitivity in scientific reproducibility. Proc. Natl Acad. Sci. USA 113, 6454–6459 (2016).

37. Zollman, K. J. S. The communication structure of epistemic communities. Phil. Sci. 74, 574–587 (2007).

38. Payette, N. in Models of Science Dynamics: Encounters between Complexity Theory and Information Sciences (eds Scharnhorst, A., Börner, K. & van den Besselaar, P.) 127–157 (Springer, 2012).

39. Baker, M. Biotech giant publishes failures to confirm high-profile science. Nature 530, 141 (2016).

40. Borenstein, M., Hedges, L. V., Higgins, J. P. T. & Rothstein, H. R. Introduction to Meta-Analysis (Wiley, 2011).

41. Nussbaum, D. The role of conceptual replication. Psychologist 25, 350 (2012).

42. Barragan-Jason, G., Atance, C. M., Hopfensitz, A., Stieglitz, J. & Cauchoix, M. Commentary: Revisiting the marshmallow test: a conceptual replication investigating links between early delay of gratification and later outcomes. Front. Psychol. 9, 2719 (2019).

43. MacLeod, C. & McLaughlin, K. Implicit and explicit memory bias in anxiety: a conceptual replication. Behav. Res. Ther. 33, 1–14 (1995).

44. Hagemann, D., Naumann, E., Becker, G., Maier, S. & Bartussek, D. Frontal brain asymmetry and affective style: a conceptual replication. Psychophysiology 35, 372–388 (1998).

45. Horselenberg, R., Merckelbach, H. & Josephs, S. Individual differences and false confessions: a conceptual replication of Kassin and Kiechel (1996). Psychol. Crime Law 9, 1–8 (2003).

46. Belknap, P. & Leonard, W. M. A conceptual replication and extension of Erving Goffman’s study of gender advertisements. Sex Roles 25, 103–118 (1991).

47. Seyedghorban, Z., Tahernejad, H. & Matanda, M. J. Reinquiry into advertising avoidance on the internet: a conceptual replication and extension. J. Advert. 45, 120–129 (2016).

48. Lu, Y., Ossmann, M. M., Leaf, D. E. & Factor, P. H. Patient visibility and ICU mortality: a conceptual replication. HERD 7, 92–103 (2014).

49. Friedman, C., Kra, P. & Rzhetsky, A. Two biomedical sublanguages: a description based on the theories of Zellig Harris. J. Biomed. Inform. 35, 222–235 (2002).

50. Rzhetsky, A. et al. GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data. J. Biomed. Inform. 37, 43–53 (2004).

51. Quirk, C. et al. MSR SPLAT, a language analysis toolkit. In Proc. 2012 Conference of the North American Chapter of the Association for Computational Linguistics (Association for Computational Linguistics, 2012).

52. Kim, J.-D., Ohta, T., Pyysalo, S., Kano, Y. & Tsujii, J. Overview of BioNLP’09 shared task on event extraction. In Proc. BioNLP 2009 Workshop Companion Volume for Shared Task (Association for Computational Linguistics, 2009).

53. Rosvall, M., Axelsson, D. & Bergstrom, C. T. The map equation. Eur. Phys. J. Spec. Top. 178, 13–23 (2009).

54. Subramanian, A. et al. A next generation connectivity map: L1000 platform and the first 1,000,000 profiles. Cell 171, 1437–1452 (2017).

55. Rosenthal, R. The file drawer problem and tolerance for null results. Psychol. Bull. 86, 638 (1979).

56. Scargle, J. D. Publication bias (the ‘file-drawer problem’) in scientific inference. Preprint at https://arxiv.org/abs/physics/9909033 (1999).

57. Sunstein, C. R. Republic.com (Princeton Univ. Press, 2001).

58. Stoeger, T., Gerlach, M., Morimoto, R. I. & Nunes Amaral, L. A. Large-scale investigation of the reasons why potentially important genes are ignored. PLoS Biol. 16, e2006643 (2018).

59. Rzhetsky, A. et al. GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data. J. Biomed. Inform. 37, 43–53 (2004).

60. Poon, H., Quirk, C., DeZiel, C. & Heckerman, D. Literome: PubMed-scale genomic knowledge base in the cloud. Bioinformatics 30, 2840–2842 (2014).

61. Rosvall, M. & Bergstrom, C. T. Maps of random walks on complex networks reveal community structure. Proc. Natl Acad. Sci. USA 105, 1118–1123 (2008).

62. Bergstrom, C. T., West, J. D. & Wiseman, M. A. The eigenfactor™ metrics. J. Neurosci. 28, 11433–11434 (2008).

63. Ioannidis, J. P. A., Boyack, K. W. & Klavans, R. Estimates of the continuously publishing core in the scientific workforce. PLoS ONE 9, e101698 (2014).

64. Babuji, Y. N., Chard K., Gerow, A. & Duede, E. Cloud Kotta: enabling secure and scalable data analytics in the cloud. In IEEE International Conference on Big Data 302–310 (IEEE, 2016).

## Acknowledgements

We thank V. Sitnik, V. Danchev and P. Saleiro for fruitful discussions, Y. Babuji for technical help, H. Poon for suggestions regarding the formulation of the project and I. Mayzus, R. Melamed and O. Kel-Margoulis for help with the annotation and the interpretation of biological datasets. We are grateful for comments from participants of the MetaScience Conference at Stanford (2019), and for meetings associated with the Defense Advanced Research Projects Agency (DARPA) Big Mechanism programme. We acknowledge funding from DARPA (14145043, J.E. and A.V.B.; HR00111820006, J.E., A.V.B. and A.R.), the Air Force Office of Scientific Research (FA9550-19-1-0354, J.E.; FA9550-15-1-0162, J.E.), the National Science Foundation (SBE-1829366, J.E.; 1422902, J.E.; 1158803, J.E.) and the John Templeton Foundation to the ‘Metaknowledge Network’ (J.E. and A.R.).

## Author information

Authors

### Contributions

A.V.B. proposed and implemented the methodology, validated the model, analysed the data and drafted the paper. J.E. was responsible for conception and funding of the project, contributed to the design of the methodology and drafted the paper. A.R. provided feedback on the experimental work and data interpretation, and participated in drafting the paper. All authors read and approved the final manuscript.

### Corresponding authors

Correspondence to Alexander V. Belikov or James Evans.

## Ethics declarations

### Competing interests

The authors declare no competing interests.

## Peer review

### Peer review information

Nature Machine Intelligence thanks Luis Amaral and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Extended data

### Extended Data Fig. 1 Illustration of core interaction and claim variables.

Directed regulatory interactions between genes constitute communities of researchers who study them. Features regarding the position of a claim within prior knowledge are derived from its relationship to other genetic regulatory interactions. Features regarding the breadth and independence of support are derived from the connection between publications making claims about the same interaction.

### Extended Data Fig. 2 Correlation between claim value and experimental strength across the claim frequency distribution.

Correlation of mean claim value $$\hat \mu ^\alpha$$ and interaction strength $$\hat \pi ^\alpha$$ from LINCS L1000 as a function of threshold on minimum claim sequence length per interaction for GeneWays (a) and Literome (b).

### Extended Data Fig. 3 Data-driven thresholds to partition interactions into neutral, negative and positive interactions for analysis.

C0, C and C+ correspond to classes of neutral, negative and positive genetic regulatory interactions. Distance between C0and C ($$W\left( {g_0,g_ - ,\theta _ - ,\theta _ + } \right)$$, solid green), and C0 and C+ ($$W\left( {g_0,g_ + ,\theta _ - ,\theta _ + } \right)$$, solid blue), number of claims in C, dotted green, number of claims in C+, dotted blue) for GeneWays (a) and Literome (b). Distance between C0 and C ($$W\left( {g_0,g_ - ,\theta _ - ,\theta _ + } \right)$$) in GeneWays (c) and Literome (d); Distance between C0 and C+ ($$W\left( {g_0,g_ + ,\theta _ - ,\theta _ + } \right)$$) in GeneWays (e) and Literome (f).

### Extended Data Fig. 4 Pearson correlation between core analysis variables in both Geneways and Literome datasets.

Heat map indicating correlation between: (a) claim correctness $$y_i^\alpha$$ and batch-level features for GeneWays (top row) and Literome (bottom row); (b) claim correctness $$y_i^\alpha$$ and claim-level features for GeneWays (top row) and Literome (bottom row); (c) interaction non-neutrality $$\pi _0^\alpha$$ and interaction-level features for GeneWays (top row) and Literome (bottom row); (d) interaction positivity $$\pi _ + ^\alpha$$ and interaction-level features for GeneWays (top row) and Literome (bottom row).

### Extended Data Fig. 5 Variable importance and significance in models of the non-neutrality and positivity of genetic regulatory interactions.

Family importances of random forest model (left, darker shade) and logistic regression coefficients (right, lighter shade) for the model of classification of neutral interactions (top) and positive interactions (bottom) for GeneWays (left) and Literome (right). Vertical centered lines show 95% confidence level on the mean of the corresponding importance/coefficient.

### Extended Data Fig. 6 Analysis of the relationship between the distribution of claims per interaction and overall certainty about those interactions.

Examples of claim number distribution ρ(nα) per interaction for test subsamples from GeneWays (a) and Literome (b). Information gain as a function of the slope of claim number distribution β. Solid lines correspond to binned averages and shaded regions denote one standard deviation of the data confidence interval for GeneWays (c) and Literome (d).

### Extended Data Fig. 7 Survival functions (complements of the cumulative distribution functions) of claim number per interaction.

Survival functions for GeneWays (a) and Literome (b); for all interactions () and nonzero () interactions, where the probability distribution function is modeled as $$\rho \propto n^\gamma$$. Exponents γ equal 2.26 and 2.01 for Geneways for all and non-neutral interactions, respectively; and equal 2.5 and 2.26 for Literome for all and non-neutral interactions. The exponents were obtained by Maximum Likelihood Estimation.

### Extended Data Fig. 8 Model selection using ROC AUC values for all models.

Neutral interaction models (a-c,g-i) and positive interaction models (d-f,j-l). Left: the distribution of ROC AUC as a function of random forest depth (a,d,g,j). Center: the distribution of ROC AUC as a function of minimum number of samples in a decision tree leaf (b,e,h,k). Right: the distribution of ROC AUC as a function of the number of trees in a random forest (c,f,i,l).

### Extended Data Fig. 9 Science policy experiments revealing the relationship between community independence, collective attention, and certainty about genetic regulatory interactions (complement to Fig. 4).

a, Relationship between the number of communities studying a particular genetic regulatory interaction and the average AUC of out-of-sample predictions for positive interactions. b, Distribution of the average AUC curves for Literome for interactions with 1, 2-3 and greater than 4 communities. c, Relationship between the shape of the distribution of number of claims per interaction on the AUC of out-of-sample predictions for positive interactions. β represents the slope of the claim number per interaction distribution for Literome. (Complement to main Fig. 4).

### Extended Data Fig. 10 Positivity bias in published effects and prediction results for Literome (complement to Fig. 3); random forest Gini Importance scores and logistic regression coefficients for features from Literome (complement to Fig. 2b).

a, Joint plot of the mean experimental interaction strength (x-axis) and mean value of the published claim (y-axis) for each genetic interaction. More intense hues of the red (and also greater marker size) correspond to the interactions in Literome with 10 or more claims per interaction; for less intense hues (and also smaller marker size) the cutoff is absent, representing the complete distribution. (See Fig. 3a for comparable Geneways distribution). b, We first predicted the nonexistentence () or existence () of each published gene-gene regulatory interaction (Literome). c, Then, if the interaction was deemed existent (), we predicted whether each claim (of positivity or negativity) from literature was correct. d, Using Bayesian inference, we estimated the sign (positive vs negative) of all genetic regulatory interactions. Mean ROC curves in bold are complemented by a 95% c.i. contours, with fainter individual lines corresponding to ROC curves for 60 models corresponding to different training/validation samples. (Complement to Fig. 3 in the main manuscript). e, Gini Importance or Mean Decrease in Impurity for features in the random forest models (left vertical scale, bold colors), and coefficients from the logistic regression models (right vertical scale, fainter colors) for Literome. Vertical bars represent 95% c.i. for the mean value of the estimate.

## Supplementary information

### Supplementary Information

Supplementary Discussion and Tables 1–3.

## Rights and permissions

Reprints and Permissions

Belikov, A.V., Rzhetsky, A. & Evans, J. Prediction of robust scientific facts from literature. Nat Mach Intell 4, 445–454 (2022). https://doi.org/10.1038/s42256-022-00474-8

• Accepted:

• Published:

• Issue Date:

• DOI: https://doi.org/10.1038/s42256-022-00474-8

• ### A cautionary tale from the machine scientist

• Luís A. Nunes Amaral

Nature Machine Intelligence (2022)