We present Butler, a computational tool that facilitates large-scale genomic analyses on public and academic clouds. Butler includes innovative anomaly detection and self-healing functions that improve the efficiency of data processing and analysis by 43% compared with current approaches. Butler enabled processing of a 725-terabyte cancer genome dataset from the Pan-Cancer Analysis of Whole Genomes (PCAWG) project in a time-efficient and uniform manner.
Subscribe to Journal
Get full journal access for 1 year
only $8.25 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
PCAWG’s final callsets, somatic and germline variant calls, mutational signatures, subclonal reconstructions, transcript abundance, splice calls and other core data generated by the ICGC/TCGA Pan-cancer Analysis of Whole Genomes Consortium is described in ref. 7 and available for download at https://dcc.icgc.org/releases/PCAWG. Additional information on accessing the data, including raw read files, can be found at https://docs.icgc.org/pcawg/data/. In accordance with the data access policies of the ICGC and TCGA projects, most molecular, clinical and specimen data are in an open tier that does not require access approval. To access potentially identifying information, such as germline alleles and underlying sequencing data, researchers will need to apply to the TCGA Data Access Committee (DAC) via dbGaP (https://dbgap.ncbi.nlm.nih.gov/aa/wga.cgi?page=login) for access to the TCGA portion of the dataset and to the ICGC Data Access Compliance Office (DACO; http://icgc.org/daco) for access to the ICGC portion. In addition, to access somatic single nucleotide variants derived from TCGA donors, researchers will also need to obtain dbGaP authorization.
The source code for Butler is freely available at http://github.com/llevar/butler under the GPL v3.0 license.
The project-specific deployment settings, configurations, analysis definitions, and workflows are available at the following:
PCAWG Germline Project: https://github.com/llevar/pcawg-germline
EOSC Pilot: https://github.com/llevar/eosc_pilot
Pan-Prostate Cancer Group: https://github.com/llevar/pan-prostate
The R source code for the analysis is available at https://github.com/llevar/butler_perf_analysis.
The core computational pipelines used by the PCAWG Consortium for alignment, quality control and variant calling are available to the public at https://dockstore.org/search?search=pcawg under the GNU General Public License v3.0, which allows for reuse and distribution.
Habermann, N., Mardin, B. R., Yakneen, S. & Korbel, J. O. Using large-scale genome variation cohorts to decipher the molecular mechanism of cancer. C. R. Biol. 339, 308–313 (2016).
Di Tommaso, P. et al. Nextflow enables reproducible computational workflows. Nat. Biotechnol. 35, 316–319 (2017).
Vivian, J. & Paten, B. Toil enables reproducible, open source, big biomedical data analyses. Nat. Biotechnol. 35, 314–316 (2017).
Mashl, R. J. et al. GenomeVIP: a cloud platform for genomic variant discovery and interpretation. Genome Res. 27, 1450–1459 (2017).
Stein, L. D., Knoppers, B. M., Campbell, P., Getz, G. & Korbel, J. O. Data analysis: create a cloud commons. Nature 523, 149–151 (2015).
Molnár-Gábor, F., Lueck, R., Yakneen, S. & Korbel, J. O. Computing patient data in the cloud: practical and legal considerations for genetics and genomics research in Europe and internationally. Genome Med. 9, 58 (2017).
Pan-cancer Analysis of Whole Genomes Consortium. Pan-cancer analysis of whole genomes. Nature https://doi.org/10.1038/s41586-020-1969-6 (2020).
Soergel, D. A. Rampant software errors may undermine scientific results. F1000 Res. 3, 303 (2014).
Gormley, C. & Tong, Z. Elasticsearch: The Definitive Guide (O’Reilly Media, 2015).
Leipzig, J. A review of bioinformatic pipeline frameworks. Brief. Bioinformatics 18, 530–536 (2017).
Merkel, D. Docker: lightweight Linux containers for consistent development and deployment. Linux J. 2014, 2 (2014).
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. Preprint at https://arxiv.org/abs/1207.3907 (2012).
Rausch, T. et al. DELLY: structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics 28, i333–i339 (2012).
Raine, K. M. et al. cgpPindel: identifying somatically acquired insertion and deletion events from paired end sequencing. Curr. Protoc. Bioinformatics 15, 15.7.11–15.7.12 (2015).
Nik-Zainal, S. et al. Landscape of somatic mutations in 560 breast cancer whole-genome sequences. Nature 534, 47–54 (2016).
Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
We acknowledge the contributions of the many clinical networks across ICGC and TCGA who provided samples and data to the PCAWG Consortium, and the contributions of the Technical Working Group and the Germline Working Group of the PCAWG Consortium for collation, realignment and harmonized variant calling of the PCAWG cancer genomes. We thank the patients and their families for their participation in the individual ICGC and TCGA projects. We also thank the PPCG project, and J. Weischenfeldt for assistance with the PPCG data. We are grateful to C. Yung, B. O’Connor, J. Zhang and L. Stein for their assistance and invaluable advice throughout the project and to A. Cafferkey, C. Short, D. Ocaña, D. Vianello, E. van den Bergh, S. Newhouse and E. Birney for invaluable support with the EMBL-EBI Embassy Cloud used largely for the computing in this study. We also acknowledge The Cancer Genome Collaboratory, Amazon Web Services, Google Compute Platform and Microsoft Azure for providing computing or cloud infrastructure. J.O.K. acknowledges support by the EOSC Pilot study (European Commission award number 739563), the BMBF (de.NBI project 031A537B), the European Research Council (336045) and the Heidelberg Academy of Sciences and Humanities. S.W. was supported through an SNSF Early Postdoc Mobility fellowship (P2ELP3_155365) and an EMBO Long-Term Fellowship (ALTF 755-2014).
G.G. receives research funds from IBM and Pharmacyclics and is an inventor on patent applications related to MuTect, ABSOLUTE, MutSig, MSMuTect, MSMutSig and POLYSOLVER.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Integrated supplementary information
Freebayes workflow can be used for small variant discovery and genotyping and splits into tasks by chromosome, where each task can run in parallel (not all tasks are visible in figure to save space). Workflow is started and ended by standard start_analysis_run and end_analysis_run that keep track of Analysis state. validate_sample makes sure that access to the data is available.
Boxplot of freebayes task durations during the SNV genotyping stage across 5668 samples. Durations are highly correlated with chromosome length (Pearson’s r=0.92). n=5668 biologically independent samples Boxplot center line corresponds to the median, lower and upper hinges to the 25%th and 75%th percentiles, and whiskers to +- 1.5 Interquartile range from the hinges. The experiment was performed once.
(a) Distribution of Delly workflow durations for genotyping of 244,889 germline deletions across 5668 PCAWG samples. (b) Distribution of Delly workflow durations for genotyping of 217,433 germline duplications across 5668 PCAWG samples. n=5668 biologically independent samples. The experiment was performed once.
The Analysis Tracker consists of four entities that are necessary for keeping track of the state of scientific analyses run in Butler. The Workflow object keeps a registry of known workflows and their attributes. The Analysis object keeps track of analyses that are being performed. An Analysis Run represents an instance of running a particular workflow under a particular analysis on a particular sample. Configuration objects keep track of the parameters supplied to the workflow invocation.
Each Analysis Run keeps track of its state and has a set of rules governing allowable state transitions. A Run is created in the Ready state from which it may be scheduled for execution. Once the corresponding workflow task is picked up for execution it is transitioned to In-Progress. Upon successful completion it is marked Completed. At any point a failure may put this run in an Error state from which it can recover only to the Ready state to initiate a re-execution of the corresponding workflow.
Configuration can be applied at three levels of granularity within Butler - Workflow, Analysis, and Analysis Run. Each higher level configuration may override and augment the configurations supplied at lower levels. At runtime all three levels of configuration are resolved into an “effective configuration”, which is then applied for execution.
Supplementary Figure 7 Butler compute cluster performance metrics during germline deletion genotyping for PCAWG.
(a) Overall load per VM that is part of the Butler cluster - shows no load prior to analysis kick-off, then steady load throughout the analysis, and drop-off in load at the end when VMs start running out of work. (b) CPU profile shows highly variable CPU utilization that is typical of Delly executions. (c) Memory profile is stable and similar between all VMs that are running the analysis. Similar measurements have been observed over the other 5 analyses performed with Butler during PCAWG, although the exact pattern of CPU and Memory utilization is dependent on the algorithms that comprise the workflow being executed.
SQL Database health can be ascertained from logs harvested on the database server. (a) 75th, 99th, and 99.5th percentile of query response times. (b) Count queries by type. (c) Database READ and WRITE counts. (d) Data throughput in and out. These measurements were collected over a single 2-hour run of the software and serve as an example of visualization capabilities, not an indication of typical database performance.
About this article
Cite this article
Yakneen, S., Waszak, S.M., PCAWG Technical Working Group. et al. Butler enables rapid cloud-based analysis of thousands of human genomes. Nat Biotechnol 38, 288–292 (2020). https://doi.org/10.1038/s41587-019-0360-3
Nature Methods (2021)
Nature Cancer (2020)