Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Butler enables rapid cloud-based analysis of thousands of human genomes

A Publisher Correction to this article was published on 12 February 2020

This article has been updated

Abstract

We present Butler, a computational tool that facilitates large-scale genomic analyses on public and academic clouds. Butler includes innovative anomaly detection and self-healing functions that improve the efficiency of data processing and analysis by 43% compared with current approaches. Butler enabled processing of a 725-terabyte cancer genome dataset from the Pan-Cancer Analysis of Whole Genomes (PCAWG) project in a time-efficient and uniform manner.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: Butler framework architecture.
Fig. 2: Butler performance comparison.

Data availability

PCAWG’s final callsets, somatic and germline variant calls, mutational signatures, subclonal reconstructions, transcript abundance, splice calls and other core data generated by the ICGC/TCGA Pan-cancer Analysis of Whole Genomes Consortium is described in ref. 7 and available for download at https://dcc.icgc.org/releases/PCAWG. Additional information on accessing the data, including raw read files, can be found at https://docs.icgc.org/pcawg/data/. In accordance with the data access policies of the ICGC and TCGA projects, most molecular, clinical and specimen data are in an open tier that does not require access approval. To access potentially identifying information, such as germline alleles and underlying sequencing data, researchers will need to apply to the TCGA Data Access Committee (DAC) via dbGaP (https://dbgap.ncbi.nlm.nih.gov/aa/wga.cgi?page=login) for access to the TCGA portion of the dataset and to the ICGC Data Access Compliance Office (DACO; http://icgc.org/daco) for access to the ICGC portion. In addition, to access somatic single nucleotide variants derived from TCGA donors, researchers will also need to obtain dbGaP authorization.

Code availability

The source code for Butler is freely available at http://github.com/llevar/butler under the GPL v3.0 license.

The project-specific deployment settings, configurations, analysis definitions, and workflows are available at the following:

PCAWG Germline Project: https://github.com/llevar/pcawg-germline

EOSC Pilot: https://github.com/llevar/eosc_pilot

Pan-Prostate Cancer Group: https://github.com/llevar/pan-prostate

The R source code for the analysis is available at https://github.com/llevar/butler_perf_analysis.

The core computational pipelines used by the PCAWG Consortium for alignment, quality control and variant calling are available to the public at https://dockstore.org/search?search=pcawg under the GNU General Public License v3.0, which allows for reuse and distribution.

Change history

References

  1. 1.

    Habermann, N., Mardin, B. R., Yakneen, S. & Korbel, J. O. Using large-scale genome variation cohorts to decipher the molecular mechanism of cancer. C. R. Biol. 339, 308–313 (2016).

    Article  Google Scholar 

  2. 2.

    Di Tommaso, P. et al. Nextflow enables reproducible computational workflows. Nat. Biotechnol. 35, 316–319 (2017).

    Article  Google Scholar 

  3. 3.

    Vivian, J. & Paten, B. Toil enables reproducible, open source, big biomedical data analyses. Nat. Biotechnol. 35, 314–316 (2017).

    CAS  Article  Google Scholar 

  4. 4.

    Mashl, R. J. et al. GenomeVIP: a cloud platform for genomic variant discovery and interpretation. Genome Res. 27, 1450–1459 (2017).

    CAS  Article  Google Scholar 

  5. 5.

    Stein, L. D., Knoppers, B. M., Campbell, P., Getz, G. & Korbel, J. O. Data analysis: create a cloud commons. Nature 523, 149–151 (2015).

    CAS  Article  Google Scholar 

  6. 6.

    Molnár-Gábor, F., Lueck, R., Yakneen, S. & Korbel, J. O. Computing patient data in the cloud: practical and legal considerations for genetics and genomics research in Europe and internationally. Genome Med. 9, 58 (2017).

    Article  Google Scholar 

  7. 7.

    Pan-cancer Analysis of Whole Genomes Consortium. Pan-cancer analysis of whole genomes. Nature https://doi.org/10.1038/s41586-020-1969-6 (2020).

  8. 8.

    Soergel, D. A. Rampant software errors may undermine scientific results. F1000 Res. 3, 303 (2014).

    Article  Google Scholar 

  9. 9.

    Gormley, C. & Tong, Z. Elasticsearch: The Definitive Guide (O’Reilly Media, 2015).

  10. 10.

    Leipzig, J. A review of bioinformatic pipeline frameworks. Brief. Bioinformatics 18, 530–536 (2017).

    PubMed  Google Scholar 

  11. 11.

    Merkel, D. Docker: lightweight Linux containers for consistent development and deployment. Linux J. 2014, 2 (2014).

    Google Scholar 

  12. 12.

    Amstutz, P. et al. Common Workflow Language, v1. 0. https://w3id.org/cwl/v1.0/; https://doi.org/10.6084/m9.figshare.3115156.v2 (2016).

  13. 13.

    Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).

    CAS  Article  Google Scholar 

  14. 14.

    Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. Preprint at https://arxiv.org/abs/1207.3907 (2012).

  15. 15.

    Rausch, T. et al. DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics 28, i333–i339 (2012).

    CAS  Article  Google Scholar 

  16. 16.

    Raine, K. M. et al. cgpPindel: identifying somatically acquired insertion and deletion events from paired end sequencing. Curr. Protoc. Bioinformatics 15, 15.7.11–15.7.12 (2015).

    Google Scholar 

  17. 17.

    Nik-Zainal, S. et al. Landscape of somatic mutations in 560 breast cancer whole-genome sequences. Nature 534, 47–54 (2016).

    CAS  Article  Google Scholar 

  18. 18.

    Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).

    Article  Google Scholar 

  19. 19.

    Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

    Article  Google Scholar 

Download references

Acknowledgements

We acknowledge the contributions of the many clinical networks across ICGC and TCGA who provided samples and data to the PCAWG Consortium, and the contributions of the Technical Working Group and the Germline Working Group of the PCAWG Consortium for collation, realignment and harmonized variant calling of the PCAWG cancer genomes. We thank the patients and their families for their participation in the individual ICGC and TCGA projects. We also thank the PPCG project, and J. Weischenfeldt for assistance with the PPCG data. We are grateful to C. Yung, B. O’Connor, J. Zhang and L. Stein for their assistance and invaluable advice throughout the project and to A. Cafferkey, C. Short, D. Ocaña, D. Vianello, E. van den Bergh, S. Newhouse and E. Birney for invaluable support with the EMBL-EBI Embassy Cloud used largely for the computing in this study. We also acknowledge The Cancer Genome Collaboratory, Amazon Web Services, Google Compute Platform and Microsoft Azure for providing computing or cloud infrastructure. J.O.K. acknowledges support by the EOSC Pilot study (European Commission award number 739563), the BMBF (de.NBI project 031A537B), the European Research Council (336045) and the Heidelberg Academy of Sciences and Humanities. S.W. was supported through an SNSF Early Postdoc Mobility fellowship (P2ELP3_155365) and an EMBO Long-Term Fellowship (ALTF 755-2014).

Author information

Affiliations

Authors

Consortia

Contributions

This manuscript was written by S.Y. and J.O.K., with input from all authors. S.Y. and J.O.K. are responsible for study conception. S.Y. designed, implemented, and executed the Butler software framework in the context of the analyses described in this manuscript. S.M.W. designed workflows and assessed the integrity of the framework. S.Y. led the data analysis, and S.M.W., M.G. and J.O.K. contributed to data analysis. The PCAWG Technical Working group provided invaluable assistance and feedback. M.G. and J.O.K. provided supervision and project oversight.

Corresponding authors

Correspondence to Sergei Yakneen or Jan O. Korbel.

Ethics declarations

Competing interests

G.G. receives research funds from IBM and Pharmacyclics and is an inventor on patent applications related to MuTect, ABSOLUTE, MutSig, MSMuTect, MSMutSig and POLYSOLVER.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Freebayes workflow.

Freebayes workflow can be used for small variant discovery and genotyping and splits into tasks by chromosome, where each task can run in parallel (not all tasks are visible in figure to save space). Workflow is started and ended by standard start_analysis_run and end_analysis_run that keep track of Analysis state. validate_sample makes sure that access to the data is available.

Supplementary Figure 2 Freebayes task durations.

Boxplot of freebayes task durations during the SNV genotyping stage across 5668 samples. Durations are highly correlated with chromosome length (Pearson’s r=0.92). n=5668 biologically independent samples Boxplot center line corresponds to the median, lower and upper hinges to the 25%th and 75%th percentiles, and whiskers to +- 1.5 Interquartile range from the hinges. The experiment was performed once.

Supplementary Figure 3 Delly workflow durations.

(a) Distribution of Delly workflow durations for genotyping of 244,889 germline deletions across 5668 PCAWG samples. (b) Distribution of Delly workflow durations for genotyping of 217,433 germline duplications across 5668 PCAWG samples. n=5668 biologically independent samples. The experiment was performed once.

Supplementary Figure 4 Analysis Tracker UML diagram.

The Analysis Tracker consists of four entities that are necessary for keeping track of the state of scientific analyses run in Butler. The Workflow object keeps a registry of known workflows and their attributes. The Analysis object keeps track of analyses that are being performed. An Analysis Run represents an instance of running a particular workflow under a particular analysis on a particular sample. Configuration objects keep track of the parameters supplied to the workflow invocation.

Supplementary Figure 5 Analysis Run state transitions.

Each Analysis Run keeps track of its state and has a set of rules governing allowable state transitions. A Run is created in the Ready state from which it may be scheduled for execution. Once the corresponding workflow task is picked up for execution it is transitioned to In-Progress. Upon successful completion it is marked Completed. At any point a failure may put this run in an Error state from which it can recover only to the Ready state to initiate a re-execution of the corresponding workflow.

Supplementary Figure 6 Hierarchical tri-level configuration.

Configuration can be applied at three levels of granularity within Butler - Workflow, Analysis, and Analysis Run. Each higher level configuration may override and augment the configurations supplied at lower levels. At runtime all three levels of configuration are resolved into an “effective configuration”, which is then applied for execution.

Supplementary Figure 7 Butler compute cluster performance metrics during germline deletion genotyping for PCAWG.

(a) Overall load per VM that is part of the Butler cluster - shows no load prior to analysis kick-off, then steady load throughout the analysis, and drop-off in load at the end when VMs start running out of work. (b) CPU profile shows highly variable CPU utilization that is typical of Delly executions. (c) Memory profile is stable and similar between all VMs that are running the analysis. Similar measurements have been observed over the other 5 analyses performed with Butler during PCAWG, although the exact pattern of CPU and Memory utilization is dependent on the algorithms that comprise the workflow being executed.

Supplementary Figure 8 SQL Database state monitoring dashboard.

SQL Database health can be ascertained from logs harvested on the database server. (a) 75th, 99th, and 99.5th percentile of query response times. (b) Count queries by type. (c) Database READ and WRITE counts. (d) Data throughput in and out. These measurements were collected over a single 2-hour run of the software and serve as an example of visualization capabilities, not an indication of typical database performance.

Supplementary information

Supplementary Materials

Supplementary Figures 1–8, Supplementary Tables 1–3 and Supplementary Notes 1 and 2

Reporting Summary

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Yakneen, S., Waszak, S.M., PCAWG Technical Working Group. et al. Butler enables rapid cloud-based analysis of thousands of human genomes. Nat Biotechnol 38, 288–292 (2020). https://doi.org/10.1038/s41587-019-0360-3

Download citation

Further reading

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing