Reproducibility of computational workflows is automated using continuous analysis

Abstract

Replication, validation and extension of experiments are crucial for scientific progress. Computational experiments are scriptable and should be easy to reproduce. However, computational analyses are designed and run in a specific computing environment, which may be difficult or impossible to match using written instructions. We report the development of continuous analysis, a workflow that enables reproducible computational analyses. Continuous analysis combines Docker, a container technology akin to virtual machines, with continuous integration, a software development technique, to automatically rerun a computational analysis whenever updates or improvements are made to source code or data. This enables researchers to reproduce results without contacting the study authors. Continuous analysis allows reviewers, editors or readers to verify reproducibility without manually downloading and rerunning code and can provide an audit trail for analyses of data that cannot be shared.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Figure 1: Reporting of Custom CDF file descriptors in published papers.
Figure 2: Research computing versus container-based approaches for differential gene expression analysis of HeLa cells.
Figure 3: Setting up continuous analysis.
Figure 4: Reproducible workflows with continuous analysis.

Accession codes

Primary accessions

Gene Expression Omnibus

NCBI Reference Sequence

Sequence Read Archive

Referenced accessions

Gene Expression Omnibus

References

  1. 1

    Anonymous. Rebooting review. Nat. Biotechnol. 33, 319 (2015).

  2. 2

    Anonymous. Software with impact. Nat. Methods 11, 211 (2014).

  3. 3

    Peng, R.D. Reproducible research in computational science. Science 334, 1226–1227 (2011).

    CAS  Article  Google Scholar 

  4. 4

    McNutt, M. Reproducibility. Science 343, 229 (2014).

    CAS  Article  Google Scholar 

  5. 5

    Anonymous. Illuminating the black box. Nature 442, 1 (2006).

  6. 6

    Baker, M. 1,500 scientists lift the lid on reproducibility. Nature 533, 452–454 (2016).

    CAS  Article  Google Scholar 

  7. 7

    Garijo, D. et al. Quantifying reproducibility in computational biology: the case of the tuberculosis drugome. PLoS One 8, e80278 (2013).

    Article  Google Scholar 

  8. 8

    Kinnings, S.L. et al. The Mycobacterium tuberculosis drugome and its polypharmacological implications. PLoS Comput. Biol. 6, e1000976 (2010).

    Article  Google Scholar 

  9. 9

    Ioannidis, J.P.A. et al. Repeatability of published microarray gene expression analyses. Nat. Genet. 41, 149–155 (2009).

    CAS  Article  Google Scholar 

  10. 10

    Hothorn, T. & Leisch, F. Case studies in reproducibility. Brief. Bioinform. 12, 288–300 (2011).

    Article  Google Scholar 

  11. 11

    Groves, T. & Godlee, F. Open science and reproducible research. Br. Med. J. 344, e4383 (2012).

    Article  Google Scholar 

  12. 12

    Boettiger, C. An introduction to Docker for reproducible research, with examples from the R environment. ACM SIGOPS Oper. Syst. Rev. 49, 71–79 (2015).

    Article  Google Scholar 

  13. 13

    Dai, M. et al. Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data. Nucleic Acids Res. 33, e175 (2005).

    Article  Google Scholar 

  14. 14

    Núñez, M., Sánchez-Jiménez, C., Alcalde, J. & Izquierdo, J.M. Long-term reduction of T-cell intracellular antigens reveals a transcriptome associated with extracellular matrix and cell adhesion components. PLoS One 9, e113141 (2014).

    Article  Google Scholar 

  15. 15

    Docker v.1.12.5, build 7392c3b (Docker, 2016).

  16. 16

    Duvall, P., Matyas, S. & Glover, A. Continuous Integration: Improving Software Quality and Reducing Risk (Addison-Wesley Professional, 2007).

  17. 17

    Pérez, F. & Granger, B.E. IPython: a system for interactive scientific computing. Comput. Sci. Eng. 9, 21–29 (2007).

    Article  Google Scholar 

  18. 18

    Jupyter v.4.1.0 (Project Jupyter, 2016).

  19. 19

    RStudio: Integrated Development for R: v.0.98.1083 (RStudio Inc., 2015).

  20. 20

    Baumer, B., Cetinkaya-Rundel, M., Bray, A., Loi, L. & Horton, N.J.R. Markdown: integrating a reproducible analysis tool into introductory statistics. Technol. Innov. Stat. Educ. 8, uclastat_cts_tise_20118 (2014).

    Google Scholar 

  21. 21

    Friedrich Leisch. Sweave: dynamic generation of statistical reports using literate data analysis. Proc. Comput. Stat. 2002, 575–580 (2002).

  22. 22

    Beaulieu-Jones, B.K. & Greene, C.S. Semi-supervised learning of the electronic health record for phenotype stratification. J. Biomed. Inform. 64, 168–178 (2016).

    Article  Google Scholar 

  23. 23

    Katoh, K., Misawa, K., Kuma, K. & Miyata, T. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30, 3059–3066 (2002).

    CAS  Article  Google Scholar 

  24. 24

    Felsenstein, J. PHYLIP—phylogeny inference package (version 3.2). Cladistics 5, 164–166 (1989).

    Google Scholar 

  25. 25

    Boj, S.F. et al. Organoid models of human and mouse ductal pancreatic cancer. Cell 160, 324–338 (2015).

    CAS  Article  Google Scholar 

  26. 26

    Bray, N.L., Pimentel, H., Melsted, P. & Pachter, L. Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34, 525–527 (2016).

    CAS  Article  Google Scholar 

  27. 27

    Ritchie, M.E. et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47 (2015).

    Article  Google Scholar 

  28. 28

    Smyth, G.K. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol. 3, e3 (2004).

    Article  Google Scholar 

  29. 29

    Pimentel, H.J., Bray, N., Puente, S., Melsted, P. & Pachter, L. Differential analysis of RNA-seq incorporating quantification uncertainty. Preprint at bioRxiv https://doi.org/10.1101/058164 (2016).

  30. 30

    Souilmi, Y. et al. Scalable and cost-effective NGS genotyping in the cloud. BMC Med. Genomics 8, 64 (2015).

    Article  Google Scholar 

  31. 31

    Stodden, V. et al. Enhancing reproducibility for computational methods. Science 354, 1240–1241 (2016).

    CAS  Article  Google Scholar 

  32. 32

    Pollard, K.S., Dudoit, S. & van der Laan, M.J. Multiple testing procedures: the multtest package and applications to genomics. in Bioinformatics and Computational Biology Solutions Using R and Bioconductor (eds. Gentleman, R. et al.) (Springer New York, 2005).

  33. 33

    Rice, P., Longden, I. & Bleasby, A. EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. 16, 276–277 (2000).

    CAS  Article  Google Scholar 

Download references

Acknowledgements

We would like to thank D. Balli (University of Pennsylvania) for providing the RNA-seq analysis design, K. Siewert (University of Pennsylvania) for providing the phylogenetic analysis design and A. Whan (Commonwealth Scientific and Industrial Research Organization) for contributing a Travis-CI implementation. We also thank M. Paul, Y. Park, G. Way, A. Campbell, J. Taroni and L. Zhou for serving as usability testers during the implementation of continuous analysis. This work was supported by the Gordon and Betty Moore Foundation under a Data Driven Discovery Investigator Award to C.S.G. (GBMF 4552). B.K.B.-J. was supported by a Commonwealth Universal Research Enhancement (CURE) Program grant from the Pennsylvania Department of Health and by US National Institutes of Health grants AI116794 and LM010098.

Author information

Affiliations

Authors

Contributions

B.K.B.-J. and C.S.G. conceived the study and designed the solution. B.K.B.-J. implemented continuous analysis. B.K.B.-J. and C.S.G. wrote the manuscript.

Corresponding author

Correspondence to Casey S Greene.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 Example continuous integration log.

Continuous integration log showing quantification of the abundances of RNA transcripts from RNA-seq data using Kallisto.

Supplementary Figure 2 Example continuous analysis branch workflow.

Code changes are made on development branches. When completed, changes are merged into the staging branch and continuous integration runs. If the continuous integration process succeeds, changes are merged into the master branch and pushed along with regenerated figures and results.

Supplementary Figure 3 Example basic YAML file structure.

Example.yml file structure,choose your Docker image, run tests, perform analysis and then publish results.

Supplementary Figure 4 Consensus phylogenetic tree tracked between two continuous analysis runs.

The effect of adding the HumanTw2 sequence to the constructed phylogenetic tree in two different continuous analysis runs.

Supplementary Figure 5 Principal component analysis plot of kallisto transcript quantification.

The effect of adding an additional organoid derived from pancreatic adenocarcinoma on principal components analysis using Kallisto’s estimated counts.

Supplementary Figure 6 Differential expression analysis before and after adding an additional sample.

A volcano plot plotting the p-value vs. the log fold change. Adding an additional organoid derived from pancreatic adenocarcinoma leads to an additional gene being marked as significantly differentially expressed after Benjamini & Hochberg correction.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–6 (PDF 650 kb)

Supplementary Data 1

Top 104 most recent papers citing the manuscript used and the Custom CDF version used. The 104 most recent papers identified using Web of Science on November 14, 2016 that reference the BrainArray manuscript and the version of the Custom CDF that was specified. (PDF 444 kb)

Supplementary Data 2

Top 116 most cited papers citing the manuscript used and the Custom CDF version used. The 116 most cited papers identified using Web of Science on November 14, 2016 that reference the BrainArray manuscript and the version of the Custom CDF that was specified. (PDF 490 kb)

Supplementary Data 3

Complete P values for Custom CDF version 18. (CSV 961 kb)

Supplementary Data 4

Complete P values for Custom CDF version 19. (CSV 966 kb)

Supplementary Data 5

Complete P values for Custom CDF version 20. (CSV 948 kb)

Supplementary Source Code

Continuous analysis source code. This includes template workflows for multiple distinct continuous analysis providers. (ZIP 1345 kb)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Beaulieu-Jones, B., Greene, C. Reproducibility of computational workflows is automated using continuous analysis. Nat Biotechnol 35, 342–346 (2017). https://doi.org/10.1038/nbt.3780

Download citation

Further reading

  • TaxisPy: A Python-based software for the quantitative analysis of bacterial chemotaxis

    • Miguel Á. Valderrama-Gómez
    • , Rebecca A. Schomer
    • , Michael A. Savageau
    •  & Rebecca E. Parales

    Journal of Microbiological Methods (2020)

  • Curious Containers: A framework for computational reproducibility in life sciences with support for Deep Learning applications

    • Christoph Jansen
    • , Jonas Annuscheit
    • , Bruno Schilling
    • , Klaus Strohmenger
    • , Michael Witt
    • , Felix Bartusch
    • , Christian Herta
    • , Peter Hufnagl
    •  & Dagmar Krefting

    Future Generation Computer Systems (2020)

  • Toward a Standardized Strategy of Clinical Metabolomics for the Advancement of Precision Medicine

    • Nguyen Phuoc Long
    • , Tran Diem Nghi
    • , Yun Pyo Kang
    • , Nguyen Hoang Anh
    • , Hyung Min Kim
    • , Sang Ki Park
    •  & Sung Won Kwon

    Metabolites (2020)

  • Responsible, practical genomic data sharing that accelerates research

    • James Brian Byrd
    • , Anna C. Greene
    • , Deepashree Venkatesh Prasad
    • , Xiaoqian Jiang
    •  & Casey S. Greene

    Nature Reviews Genetics (2020)

  • MEMOTE for standardized genome-scale metabolic model testing

    • Christian Lieven
    • , Moritz E. Beber
    • , Brett G. Olivier
    • , Frank T. Bergmann
    • , Meric Ataman
    • , Parizad Babaei
    • , Jennifer A. Bartell
    • , Lars M. Blank
    • , Siddharth Chauhan
    • , Kevin Correia
    • , Christian Diener
    • , Andreas Dräger
    • , Birgitta E. Ebert
    • , Janaka N. Edirisinghe
    • , José P. Faria
    • , Adam M. Feist
    • , Georgios Fengos
    • , Ronan M. T. Fleming
    • , Beatriz García-Jiménez
    • , Vassily Hatzimanikatis
    • , Wout van Helvoirt
    • , Christopher S. Henry
    • , Henning Hermjakob
    • , Markus J. Herrgård
    • , Ali Kaafarani
    • , Hyun Uk Kim
    • , Zachary King
    • , Steffen Klamt
    • , Edda Klipp
    • , Jasper J. Koehorst
    • , Matthias König
    • , Meiyappan Lakshmanan
    • , Dong-Yup Lee
    • , Sang Yup Lee
    • , Sunjae Lee
    • , Nathan E. Lewis
    • , Filipe Liu
    • , Hongwu Ma
    • , Daniel Machado
    • , Radhakrishnan Mahadevan
    • , Paulo Maia
    • , Adil Mardinoglu
    • , Gregory L. Medlock
    • , Jonathan M. Monk
    • , Jens Nielsen
    • , Lars Keld Nielsen
    • , Juan Nogales
    • , Intawat Nookaew
    • , Bernhard O. Palsson
    • , Jason A. Papin
    • , Kiran R. Patil
    • , Mark Poolman
    • , Nathan D. Price
    • , Osbaldo Resendis-Antonio
    • , Anne Richelle
    • , Isabel Rocha
    • , Benjamín J. Sánchez
    • , Peter J. Schaap
    • , Rahuman S. Malik Sheriff
    • , Saeed Shoaie
    • , Nikolaus Sonnenschein
    • , Bas Teusink
    • , Paulo Vilaça
    • , Jon Olav Vik
    • , Judith A. H. Wodke
    • , Joana C. Xavier
    • , Qianqian Yuan
    • , Maksim Zakhartsev
    •  & Cheng Zhang

    Nature Biotechnology (2020)