Reproducibility of computational workflows is automated using continuous analysis

Journal name:
Nature Biotechnology
Volume:
35,
Pages:
342–346
Year published:
DOI:
doi:10.1038/nbt.3780
Received
Accepted
Published online

Abstract

Replication, validation and extension of experiments are crucial for scientific progress. Computational experiments are scriptable and should be easy to reproduce. However, computational analyses are designed and run in a specific computing environment, which may be difficult or impossible to match using written instructions. We report the development of continuous analysis, a workflow that enables reproducible computational analyses. Continuous analysis combines Docker, a container technology akin to virtual machines, with continuous integration, a software development technique, to automatically rerun a computational analysis whenever updates or improvements are made to source code or data. This enables researchers to reproduce results without contacting the study authors. Continuous analysis allows reviewers, editors or readers to verify reproducibility without manually downloading and rerunning code and can provide an audit trail for analyses of data that cannot be shared.

At a glance

Figures

  1. Reporting of Custom CDF file descriptors in published papers.
    Figure 1: Reporting of Custom CDF file descriptors in published papers.

    (a,b) CDF version reporting in the 100 most recent (a) and the 100 most cited (b) papers citing Dai et al.13 that use Custom CDF. Each circle represents one manuscript; color coding indicates the Custom CDF version used.

  2. Research computing versus container-based approaches for differential gene expression analysis of HeLa cells.
    Figure 2: Research computing versus container-based approaches for differential gene expression analysis of HeLa cells.

    (a,b) Numbers of significantly differentially expressed genes identified using different versions of software packages (a) and a container-based approach with a defined computing environment (b). n = 3 biological replicates per group (wild-type or double-knockdown HeLa cells).

  3. Setting up continuous analysis.
    Figure 3: Setting up continuous analysis.

    Continuous analysis can be set up in three steps. First, the researcher creates a Docker container with the required software (1). The researcher then configures a continuous integration service to use this Docker image (2) then pushes code that includes a script capable of running the analyses from start to finish (3). The continuous integration provider runs the latest version of code in the specified Docker environment without manual intervention. This generates a Docker container with intermediate results that allows anyone to rerun analysis in the same environment, produces updated figures and stores logs describing what occurred. Example configurations are available in Online Methods and at https://github.com/greenelab/continuous_analysis.

  4. Reproducible workflows with continuous analysis.
    Figure 4: Reproducible workflows with continuous analysis.

    (a,b) Phylogenetic tree building with four mRNA samples (MouseTw1, HumanTw1, MouseTw2 and FlyTw) (a) and an additional gene (HumanTw2) (b). (c,d) RNA-seq differential expression experiment principal component (PC) analysis before (c) and after (d) addition of a sample (mT8).

  5. Example continuous integration log.
    Supplementary Fig. 1: Example continuous integration log.

    Continuous integration log showing quantification of the abundances of RNA transcripts from RNA-seq data using Kallisto.

  6. Example continuous analysis branch workflow.
    Supplementary Fig. 2: Example continuous analysis branch workflow.

    Code changes are made on development branches. When completed, changes are merged into the staging branch and continuous integration runs. If the continuous integration process succeeds, changes are merged into the master branch and pushed along with regenerated figures and results.

  7. Example basic YAML file structure.
    Supplementary Fig. 3: Example basic YAML file structure.

    Example.yml file structure,choose your Docker image, run tests, perform analysis and then publish results.

  8. Consensus phylogenetic tree tracked between two continuous analysis runs.
    Supplementary Fig. 4: Consensus phylogenetic tree tracked between two continuous analysis runs.

    The effect of adding the HumanTw2 sequence to the constructed phylogenetic tree in two different continuous analysis runs.

  9. Principal component analysis plot of kallisto transcript quantification.
    Supplementary Fig. 5: Principal component analysis plot of kallisto transcript quantification.

    The effect of adding an additional organoid derived from pancreatic adenocarcinoma on principal components analysis using Kallisto’s estimated counts.

  10. Differential expression analysis before and after adding an additional sample.
    Supplementary Fig. 6: Differential expression analysis before and after adding an additional sample.

    A volcano plot plotting the p-value vs. the log fold change. Adding an additional organoid derived from pancreatic adenocarcinoma leads to an additional gene being marked as significantly differentially expressed after Benjamini & Hochberg correction.

Accession codes

Primary accessions

Gene Expression Omnibus

Sequence Read Archive

Referenced accessions

Gene Expression Omnibus

References

  1. Anonymous. Rebooting review. Nat. Biotechnol. 33, 319 (2015).
  2. Anonymous. Software with impact. Nat. Methods 11, 211 (2014).
  3. Peng, R.D. Reproducible research in computational science. Science 334, 12261227 (2011).
  4. McNutt, M. Reproducibility. Science 343, 229 (2014).
  5. Anonymous. Illuminating the black box. Nature 442, 1 (2006).
  6. Baker, M. 1,500 scientists lift the lid on reproducibility. Nature 533, 452454 (2016).
  7. Garijo, D. et al. Quantifying reproducibility in computational biology: the case of the tuberculosis drugome. PLoS One 8, e80278 (2013).
  8. Kinnings, S.L. et al. The Mycobacterium tuberculosis drugome and its polypharmacological implications. PLoS Comput. Biol. 6, e1000976 (2010).
  9. Ioannidis, J.P.A. et al. Repeatability of published microarray gene expression analyses. Nat. Genet. 41, 149155 (2009).
  10. Hothorn, T. & Leisch, F. Case studies in reproducibility. Brief. Bioinform. 12, 288300 (2011).
  11. Groves, T. & Godlee, F. Open science and reproducible research. Br. Med. J. 344, e4383 (2012).
  12. Boettiger, C. An introduction to Docker for reproducible research, with examples from the R environment. ACM SIGOPS Oper. Syst. Rev. 49, 7179 (2015).
  13. Dai, M. et al. Evolving gene/transcript definitions significantly alter the interpretation of GeneChip data. Nucleic Acids Res. 33, e175 (2005).
  14. Núñez, M., Sánchez-Jiménez, C., Alcalde, J. & Izquierdo, J.M. Long-term reduction of T-cell intracellular antigens reveals a transcriptome associated with extracellular matrix and cell adhesion components. PLoS One 9, e113141 (2014).
  15. Docker v.1.12.5, build 7392c3b (Docker, 2016).
  16. Duvall, P., Matyas, S. & Glover, A. Continuous Integration: Improving Software Quality and Reducing Risk (Addison-Wesley Professional, 2007).
  17. Pérez, F. & Granger, B.E. IPython: a system for interactive scientific computing. Comput. Sci. Eng. 9, 2129 (2007).
  18. Jupyter v.4.1.0 (Project Jupyter, 2016).
  19. RStudio: Integrated Development for R: v.0.98.1083 (RStudio Inc., 2015).
  20. Baumer, B., Cetinkaya-Rundel, M., Bray, A., Loi, L. & Horton, N.J.R. Markdown: integrating a reproducible analysis tool into introductory statistics. Technol. Innov. Stat. Educ. 8, uclastat_cts_tise_20118 (2014).
  21. Friedrich Leisch. Sweave: dynamic generation of statistical reports using literate data analysis. Proc. Comput. Stat. 2002, 575580 (2002).
  22. Beaulieu-Jones, B.K. & Greene, C.S. Semi-supervised learning of the electronic health record for phenotype stratification. J. Biomed. Inform. 64, 168178 (2016).
  23. Katoh, K., Misawa, K., Kuma, K. & Miyata, T. MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucleic Acids Res. 30, 30593066 (2002).
  24. Felsenstein, J. PHYLIP—phylogeny inference package (version 3.2). Cladistics 5, 164166 (1989).
  25. Boj, S.F. et al. Organoid models of human and mouse ductal pancreatic cancer. Cell 160, 324338 (2015).
  26. Bray, N.L., Pimentel, H., Melsted, P. & Pachter, L. Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34, 525527 (2016).
  27. Ritchie, M.E. et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47 (2015).
  28. Smyth, G.K. Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat. Appl. Genet. Mol. Biol. 3, e3 (2004).
  29. Pimentel, H.J., Bray, N., Puente, S., Melsted, P. & Pachter, L. Differential analysis of RNA-seq incorporating quantification uncertainty. Preprint at bioRxiv https://doi.org/10.1101/058164 (2016).
  30. Souilmi, Y. et al. Scalable and cost-effective NGS genotyping in the cloud. BMC Med. Genomics 8, 64 (2015).
  31. Stodden, V. et al. Enhancing reproducibility for computational methods. Science 354, 12401241 (2016).
  32. Pollard, K.S., Dudoit, S. & van der Laan, M.J. Multiple testing procedures: the multtest package and applications to genomics. in Bioinformatics and Computational Biology Solutions Using R and Bioconductor (eds. Gentleman, R. et al.) (Springer New York, 2005).
  33. Rice, P., Longden, I. & Bleasby, A. EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. 16, 276277 (2000).

Download references

Author information

Affiliations

  1. Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA.

    • Brett K Beaulieu-Jones
  2. Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA.

    • Casey S Greene

Contributions

B.K.B.-J. and C.S.G. conceived the study and designed the solution. B.K.B.-J. implemented continuous analysis. B.K.B.-J. and C.S.G. wrote the manuscript.

Competing financial interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to:

Author details

Supplementary information

Supplementary Figures

  1. Supplementary Figure 1: Example continuous integration log. (166 KB)

    Continuous integration log showing quantification of the abundances of RNA transcripts from RNA-seq data using Kallisto.

  2. Supplementary Figure 2: Example continuous analysis branch workflow. (68 KB)

    Code changes are made on development branches. When completed, changes are merged into the staging branch and continuous integration runs. If the continuous integration process succeeds, changes are merged into the master branch and pushed along with regenerated figures and results.

  3. Supplementary Figure 3: Example basic YAML file structure. (20 KB)

    Example.yml file structure,choose your Docker image, run tests, perform analysis and then publish results.

  4. Supplementary Figure 4: Consensus phylogenetic tree tracked between two continuous analysis runs. (253 KB)

    The effect of adding the HumanTw2 sequence to the constructed phylogenetic tree in two different continuous analysis runs.

  5. Supplementary Figure 5: Principal component analysis plot of kallisto transcript quantification. (79 KB)

    The effect of adding an additional organoid derived from pancreatic adenocarcinoma on principal components analysis using Kallisto’s estimated counts.

  6. Supplementary Figure 6: Differential expression analysis before and after adding an additional sample. (154 KB)

    A volcano plot plotting the p-value vs. the log fold change. Adding an additional organoid derived from pancreatic adenocarcinoma leads to an additional gene being marked as significantly differentially expressed after Benjamini & Hochberg correction.

PDF files

  1. Supplementary Text and Figures (666 KB)

    Supplementary Figures 1–6

  2. Supplementary Data 1 (454 KB)

    Top 104 most recent papers citing the manuscript used and the Custom CDF version used.
    The 104 most recent papers identified using Web of Science on November 14, 2016 that reference the BrainArray manuscript and the version of the Custom CDF that was specified.

  3. Supplementary Data 2 (501 KB)

    Top 116 most cited papers citing the manuscript used and the Custom CDF version used.
    The 116 most cited papers identified using Web of Science on November 14, 2016 that reference the BrainArray manuscript and the version of the Custom CDF that was specified.

Other

  1. Supplementary Data 3 (984 KB)

    Complete P values for Custom CDF version 18.

  2. Supplementary Data 4 (989 KB)

    Complete P values for Custom CDF version 19.

  3. Supplementary Data 5 (971 KB)

    Complete P values for Custom CDF version 20.

Zip files

  1. Supplementary Source Code (1377 KB)

    Continuous analysis source code.
    This includes template workflows for multiple distinct continuous analysis providers.

Additional data