Analysis | Published:

Reproducibility of computational workflows is automated using continuous analysis

Nature Biotechnology volume 35, pages 342346 (2017) | Download Citation

Abstract

Replication, validation and extension of experiments are crucial for scientific progress. Computational experiments are scriptable and should be easy to reproduce. However, computational analyses are designed and run in a specific computing environment, which may be difficult or impossible to match using written instructions. We report the development of continuous analysis, a workflow that enables reproducible computational analyses. Continuous analysis combines Docker, a container technology akin to virtual machines, with continuous integration, a software development technique, to automatically rerun a computational analysis whenever updates or improvements are made to source code or data. This enables researchers to reproduce results without contacting the study authors. Continuous analysis allows reviewers, editors or readers to verify reproducibility without manually downloading and rerunning code and can provide an audit trail for analyses of data that cannot be shared.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Accessions

Primary accessions

Gene Expression Omnibus

Sequence Read Archive

Referenced accessions

Gene Expression Omnibus

References

  1. 1.

    Anonymous. Rebooting review. Nat. Biotechnol. 33, 319 (2015).

  2. 2.

    Anonymous. Software with impact. Nat. Methods 11, 211 (2014).

  3. 3.

    Reproducible research in computational science. Science 334, 1226–1227 (2011).

  4. 4.

    Reproducibility. Science 343, 229 (2014).

  5. 5.

    Anonymous. Illuminating the black box. Nature 442, 1 (2006).

  6. 6.

    1,500 scientists lift the lid on reproducibility. Nature 533, 452–454 (2016).

  7. 7.

    et al. Quantifying reproducibility in computational biology: the case of the tuberculosis drugome. PLoS One 8, e80278 (2013).

  8. 8.

    et al. The Mycobacterium tuberculosis drugome and its polypharmacological implications. PLoS Comput. Biol. 6, e1000976 (2010).

  9. 9.

    et al. Repeatability of published microarray gene expression analyses. Nat. Genet. 41, 149–155 (2009).

  10. 10.

    & Case studies in reproducibility. Brief. Bioinform. 12, 288–300 (2011).

  11. 11.

    & Open science and reproducible research. Br. Med. J. 344, e4383 (2012).

  12. 12.

    An introduction to Docker for reproducible research, with examples from the R environment. ACM SIGOPS Oper. Syst. Rev. 49, 71–79 (2015).

  13. 13.

    et al. Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data. Nucleic Acids Res. 33, e175 (2005).

  14. 14.

    , , & Long-term reduction of T-cell intracellular antigens reveals a transcriptome associated with extracellular matrix and cell adhesion components. PLoS One 9, e113141 (2014).

  15. 15.

    Docker v.1.12.5, build 7392c3b (Docker, 2016).

  16. 16.

    , & Continuous Integration: Improving Software Quality and Reducing Risk (Addison-Wesley Professional, 2007).

  17. 17.

    & IPython: a system for interactive scientific computing. Comput. Sci. Eng. 9, 21–29 (2007).

  18. 18.

    Jupyter v.4.1.0 (Project Jupyter, 2016).

  19. 19.

    RStudio: Integrated Development for R: v.0.98.1083 (RStudio Inc., 2015).

  20. 20.

    , , , & Markdown: integrating a reproducible analysis tool into introductory statistics. Technol. Innov. Stat. Educ. 8, uclastat_cts_tise_20118 (2014).

  21. 21.

    Friedrich Leisch. Sweave: dynamic generation of statistical reports using literate data analysis. Proc. Comput. Stat. 2002, 575–580 (2002).

  22. 22.

    & Semi-supervised learning of the electronic health record for phenotype stratification. J. Biomed. Inform. 64, 168–178 (2016).

  23. 23.

    , , & MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30, 3059–3066 (2002).

  24. 24.

    PHYLIP—phylogeny inference package (version 3.2). Cladistics 5, 164–166 (1989).

  25. 25.

    et al. Organoid models of human and mouse ductal pancreatic cancer. Cell 160, 324–338 (2015).

  26. 26.

    , , & Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34, 525–527 (2016).

  27. 27.

    et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47 (2015).

  28. 28.

    Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol. 3, e3 (2004).

  29. 29.

    , , , & Differential analysis of RNA-seq incorporating quantification uncertainty. Preprint at bioRxiv (2016).

  30. 30.

    et al. Scalable and cost-effective NGS genotyping in the cloud. BMC Med. Genomics 8, 64 (2015).

  31. 31.

    et al. Enhancing reproducibility for computational methods. Science 354, 1240–1241 (2016).

  32. 32.

    , & Multiple testing procedures: the multtest package and applications to genomics. in Bioinformatics and Computational Biology Solutions Using R and Bioconductor (eds. Gentleman, R. et al.) (Springer New York, 2005).

  33. 33.

    , & EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. 16, 276–277 (2000).

Download references

Acknowledgements

We would like to thank D. Balli (University of Pennsylvania) for providing the RNA-seq analysis design, K. Siewert (University of Pennsylvania) for providing the phylogenetic analysis design and A. Whan (Commonwealth Scientific and Industrial Research Organization) for contributing a Travis-CI implementation. We also thank M. Paul, Y. Park, G. Way, A. Campbell, J. Taroni and L. Zhou for serving as usability testers during the implementation of continuous analysis. This work was supported by the Gordon and Betty Moore Foundation under a Data Driven Discovery Investigator Award to C.S.G. (GBMF 4552). B.K.B.-J. was supported by a Commonwealth Universal Research Enhancement (CURE) Program grant from the Pennsylvania Department of Health and by US National Institutes of Health grants AI116794 and LM010098.

Author information

Affiliations

  1. Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA.

    • Brett K Beaulieu-Jones
  2. Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA.

    • Casey S Greene

Authors

  1. Search for Brett K Beaulieu-Jones in:

  2. Search for Casey S Greene in:

Contributions

B.K.B.-J. and C.S.G. conceived the study and designed the solution. B.K.B.-J. implemented continuous analysis. B.K.B.-J. and C.S.G. wrote the manuscript.

Competing interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to Casey S Greene.

Integrated supplementary information

Supplementary information

PDF files

  1. 1.

    Supplementary Text and Figures

    Supplementary Figures 1–6

  2. 2.

    Supplementary Data 1

    Top 104 most recent papers citing the manuscript used and the Custom CDF version used. The 104 most recent papers identified using Web of Science on November 14, 2016 that reference the BrainArray manuscript and the version of the Custom CDF that was specified.

  3. 3.

    Supplementary Data 2

    Top 116 most cited papers citing the manuscript used and the Custom CDF version used. The 116 most cited papers identified using Web of Science on November 14, 2016 that reference the BrainArray manuscript and the version of the Custom CDF that was specified.

CSV files

  1. 1.

    Supplementary Data 3

    Complete P values for Custom CDF version 18.

  2. 2.

    Supplementary Data 4

    Complete P values for Custom CDF version 19.

  3. 3.

    Supplementary Data 5

    Complete P values for Custom CDF version 20.

Zip files

  1. 1.

    Supplementary Source Code

    Continuous analysis source code. This includes template workflows for multiple distinct continuous analysis providers.

About this article

Publication history

Received

Accepted

Published

DOI

https://doi.org/10.1038/nbt.3780

Further reading