Main

Cloud computing offers easy and economical access to computational capacity at a scale that had previously been available to only the largest research institutions. To take advantage, large biological datasets are increasingly analyzed on various cloud computing platforms, using public, private and hybrid clouds1 with the aid of workflow systems. When employed in global projects, such systems must be flexible in their ability to operate in different environments, including academic clouds, to allow researchers to bring their computational pipelines to the data, especially in cases where the raw data themselves cannot be moved. The recently developed cloud-based scientific workflow frameworks Nextflow2, Toil3 and GenomeVIP4 focus their support largely on individual commercial cloud computing environments—mostly Amazon Web Services—and lack complete functionality for other major providers. This limits their use in studies that require multi-cloud operation due to practical and regulatory requirements5,6. Butler, in contrast, provides full support for operation on OpenStack-based commercial and academic clouds, Amazon Web Services, Microsoft Azure and Google Compute Platform, and can thus enable international collaborations involving the analysis of hundreds of thousands of samples where distributed cloud-based computation is pursued in different jurisdictions5,6,7.

A key lesson learned from large-scale projects including the PCAWG project7, which has pursued a study of 2,658 cancer genomes sequenced by the International Cancer Genome Consortium and the Cancer Genome Atlas, is that analysis of biological data of heterogeneous quality, generated at multiple locations with varying standard operating procedures, frequently suffers from artifacts that lead to many failures of computational jobs and that can considerably limit a project’s progress. Sequencing library artifacts, sample contamination and nonuniform sequencing coverage8 can cause data and software anomalies that challenge current workflows. Delays in recognizing and resolving these failures can notably affect data processing rate and increase project duration and costs. In contrast to previous tools, Butler provides an operational management toolkit that quickly discovers and resolves expected and unexpected failures (Fig. 1a,b and Supplementary Note 1).

Fig. 1: Butler framework architecture.
figure 1

a, The framework consists of several interconnected components, each running on a separate virtual machine (VM). See Methods and Supplementary Note 1 for details. b, Metrics flow from all VMs into a time series database. The self-healing agent detects anomalies and takes appropriate action. See Supplementary Note 1 for details. Solid arrows indicate information flow; dashed arrows indicate metrics flow; dashed-and-dotted arrows indicate configuration instructions.

The toolkit functions at two levels of granularity: host level and application level. Host-level operational management is facilitated via a health metrics system that collects system measurements at regular intervals from all deployed virtual machines (VMs). These metrics are aggregated and stored in a time-series database within Butler’s monitoring server. A set of graphical dashboards reports system health to users while supporting advanced querying capabilities for in-depth troubleshooting (Supplementary Fig. 8). Application-level monitoring is facilitated via systematic log collection (Supplementary Fig. 4) and extraction wherein the logs are stored in a queryable search index9. These tools provide multidimensional visibility into operational bottlenecks and error conditions as they occur, in a manner that is aggregated across hundreds of VMs. On top of these data, a rule-based anomaly detection engine defines normal operating conditions that, when breached, trigger handling routines that can notify the user by sending e-mail, Slack or Telegram messages, and enables automated restarting of offending workflows, underlying services or entire VMs, allowing the cluster to self-heal (Fig. 1b).

These monitoring and operational management capabilities set Butler apart from current scientific workflow frameworks2,3,4,10 (Supplementary Table 1), which do not contain anomaly detection modules and are therefore unable to automatically resolve key issues that frequently occur during large-scale analyses. For example, Butler’s operational modules are able to identify and resolve failures of the cloud workflow scheduler, workflows that run perpetually and never finish (indicative of underlying problems), and crashed and unresponsive VMs that, in practice, may prevent workflows from setting a failed status and thus would prevent triggering of error handling logic in other workflow systems.

These capabilities indeed enable highly efficient data processing in studies, such as PCAWG, where analyses are run by multiple groups at different times and on different clouds. Butler can invoke a variety of analysis algorithms, including genome alignment, variant calling and execution of R scripts. These can either be preinstalled or run as Docker11 images or Common Workflow Language (CWL)12 tools and workflows. Butler’s workflows accept parameters via JavaScript Object Notation (JSON) configuration files, which are stored in a database to maintain reproducibility. Workflow tasks scheduled for execution are deposited into a distributed task queue from which available worker nodes will pick them up, allowing analyses to be distributed over thousands of computing nodes. It is worth noting that for some small-scale projects executed over relatively short timelines, the increased complexity of setting up and running these monitoring systems may render Butler less practicable than simpler workflows.

We assessed Butler’s ability to facilitate large-scale analyses of patient genomes in the context of the PCAWG study, where Butler was deployed on 1,500 CPU cores, 5.5 terabytes of random access memory (RAM), 1 petabyte of shared storage and 40 terabytes of local solid-state drive storage. Using Butler, we implemented and successfully tested a genomic alignment workflow using BWA13, germline variant calling workflows based on FreeBayes14 (Supplementary Fig. 5) and Delly15, as well as several tools for somatic mutation calling, including Pindel16 and BRASS17. We carried out whole-genome variant discovery and joint genotyping of 90 million germline genetic variants (single nucleotide polymorphisms (SNPs), indels and structural variants) across a 725-terabyte dataset comprising the full PCAWG cohort (including samples that were later blacklisted) of 2,834 cancer patients7. Additionally, we performed sequence alignment and called both germline and somatic variants on 232 high-coverage prostate cancer tumor–normal sample pairs in the context of the PanProstate Cancer Group (PPCG) Consortium. We executed and successfully completed over 2.5 million computational jobs using 546,552 CPU hours. The management overhead of employing Butler for these analyses was less than 2% of the overall computational cost.

To assess Butler performance in the field, in comparison to other large-scale workflow systems, we compare the actually observed historical performance of Butler, recorded during PCAWG, against the performance of the ‘core’ somatic PCAWG consortium pipelines (Fig. 2), which represent the current state of the art in the field in terms of cloud software7 (on the basis of recency of development, scale of deployment, dataset size and analysis duration)—achieving nearly complete feature parity with several available cloud-based scientific workflow frameworks2,3,4,10 (Supplementary Table 1). These PCAWG pipelines used the same information technology infrastructure and computed over the same samples, but did not use Butler. Our metric to estimate the highest achievable processing rate for an analysis is defined as the smallest proportion of time required for processing 5% of all samples, which we refer to as the ‘target processing rate’. This is measured on the basis of the difference between the calendar completion date and time of the samples and the analysis start date, thus taking into account the time spent on failed and repeated runs and cluster downtime, which are major contributors to analysis duration. To establish how well a pipeline performs compared to its potential, we calculated the ratio of the actual processing rate to the target processing rate (Fig. 2a,b). Butler-operated pipelines were markedly closer to the target processing rate (mean actual/target rate ratio 0.696) than the core PCAWG pipelines (mean actual/target rate ratio 0.490) (Fig. 2c). Consequently, Butler-based analyses showed a duration 1.43 times the ideal target duration while core PCAWG pipelines showed a duration of 2.04 times the ideal target duration—43% longer. Additionally, core PCAWG pipelines exhibited a highly nonuniform processing rate (Fig. 2d) deviating 23.1% on average (minimum 0.0%, maximum 57.8%, s.d. 15.0%) from the ideally uniform trajectory of processing 1% of samples in 1% of analysis time, while Butler-based pipelines (Fig. 2e) performed in a substantially more uniform manner, deviating only 4.0% (minimum 0.0%, maximum 15.6%, s.d. 3.7%) over the same sample set on average (Methods). These timesaving and controlled execution abilities resulted in the adoption of Butler for genomics-oriented analyses in the context of the European Open Science Cloud (EOSC) Pilot (http://eoscpilot.eu) and its further adoption within PPGC (http://melbournebioinformatics.org.au/project/ppgc).

Fig. 2: Butler performance comparison.
figure 2

a,b, Comparing the ratio of actual to target progress rates for core PCAWG pipelines (a) vs. Butler pipelines (b). See Methods for details. c, Mean actual/target progress rate ratio across pipelines for core PCAWG (mean 0.49) vs. Butler (mean 0.7) pipelines, each of which were run once over the entirety of PCAWG samples available to us. d,e, Progress rate uniformity of core PCAWG pipelines (d) vs. Butler (e). See Methods for details. In all panels the samples are arranged by their completion date. Runtime includes time spent on failed attempts. Comparison between Butler and core pipelines was facilitated in the context of the PCAWG. Similar comparison between Butler and other frameworks is presently impractical at this scale due to the high costs and complexity involved.

Butler can be generally applied to any large-scale analysis and could, for example, readily extend to studies such as GTEx (http://gtexportal.org), ENCODE (http://encodeproject.org) and the Human Cell Atlas Project (http://humancellatlas.org). A standard Butler workflow generically parallelizes R script execution across thousands of VMs, which will facilitate its use for other research contexts and other data types (including single-cell ‘omics’ data and microbiomes, for example).

We have developed Butler to meet the challenges of working with diverse cloud computing environments in the context of large-scale scientific data analyses. The operational management tools provided with Butler help overcome the key challenge that impacts analysis duration—the ability to autonomously detect, diagnose and address issues in a timely manner—thus allowing researchers to spend less time focusing on error conditions and considerably reduce analysis duration and cost. The comprehensive nature of the Butler toolkit sets it apart from current scientific workflow managers2,3,4,10 (Supplementary Table 1) by offering an efficient and scalable solution for modern global cloud-based big data analyses.

Methods

The Butler system

Overall, the Butler system is composed of four distinct subsystems. The Cluster Lifecycle Management is the first subsystem and deals with the task of creating and tearing down clusters on various clouds, including defining VMs, storage devices, network topology and network security rules. The second subsystem, Cluster Configuration Management, deals with configuration and software installation of all VMs in the cluster. The Workflow System is responsible for allowing users to define and run scientific workflows on the cloud. Finally, the Operational Management subsystem provides tools for ensuring continuous successful operation of the cluster, as well as for troubleshooting error conditions. Supplementary Note 1 contains an in-depth description of each of these subsystems and how they work within Butler, while the Installation Guide (http://butler.readthedocs.io/en/latest/installation.html) provides detailed instructions for how to set up the software.

Butler deployment

Butler has been validated for production use on the EMBL-EBI Embassy Cloud (http://www.embassycloud.org), an academic cloud computing center that runs an OpenStack-based environment (Fig. 1). The Embassy Cloud has played a key role in the PCAWG project by donating substantial storage and cloud computing capacity over the course of 3 years. The total amount of resources dedicated to the project by the Embassy Cloud was as follows:

  • 1 PB Isilon storage shared over NFS

  • 1,500 computational cores

  • 5.5 TB RAM

  • 40 TB local solid-state drive storage

  • 10-gigabit network

These resources have been used to host one of the six PCAWG data repositories that exist worldwide, as well as performing scientific analyses for the project. We have used Butler extensively on the Embassy Cloud to carry out the analyses for the PCAWG Germline Working Group. To deploy Butler on the 1,500-core cluster, we set up five different profiles of VMs, each playing several different roles (Supplementary Table 2).

Each profile was defined separately via Terraform and uses Saltstack roles for configuration. Users can check out the Butler github repository to their local machine, and once they install Terraform locally, they can fully commandeer the provisioning process from the local machine via Terraform.

The cluster is bootstrapped via the Salt-master VM. This VM is started first whenever the cluster needs to be recreated from scratch. The monitoring-server role is responsible for installing and configuring InfluxDB and other monitoring components, as well as registering them with Consul so that metrics can start being recorded. We also attach a 1-TB block storage volume for the metrics database so that it can survive cluster crashes and teardowns. If the monitoring server needs to be recreated, the block storage volume simply needs to be reattached to the new Monitoring Server VM.

The tracker VM is responsible for running various Airflow components, such as the Scheduler, Webserver and Flower. Additionally, we deploy the Butler tracker module to this VM, and thus the tracker VM acts as the main control point of the system from which analyses are launched and monitored. This VM additionally has the Elasticsearch role that designates it as the location of the Logstash and Elasticsearch components. To persist the search index, we attach an additional 1-TB block storage volume.

The job queue VM is responsible for hosting the RabbitMQ server, which holds all of the in-flight workflow tasks. Because the resources of the job queue are heavily taxed by communication with all of the worker VMs in the cluster, we do not assign any additional roles to this host.

The db-server is responsible for hosting most of the databases used by Butler. This VM runs an instance of PostgreSQL Server and hosts the Run Tracking DB, Airflow DB and Sample Tracking DB. The 1-TB block storage volume serves as the backing storage mechanism.

The worker VMs are the workhorses of the Butler cluster. For analyses by the PCAWG Germline Working Group, we employed 175 eight-core worker machines dedicated to running Butler workflows. The worker role ensures that Airflow client modules are installed and loaded on each worker. The germline role also loads the workflows and analyses that are relevant to the PCAWG Germline Working Group.

Because of the comprehensive nature of the Butler framework, which covers far more scope than a traditional workflow framework (provisioning, configuration management, operations management, anomaly detection, etc.), the setup and deployment of a Butler system are more complex than those of other workflow frameworks because multiple VMs need to be successfully set up and configured to interact with each other in a secure environment that is fit for sensitive information handling. Even though Butler features comprehensive documentation (http://butler.readthedocs.io), usage examples and automated deployment and configuration scripts, we recommend that the prospective user should ideally have a working understanding of cloud computing, server administration, networking, security, and other development operations (dev ops) concepts to make full use of the system. And while smaller-scale projects may benefit less from Butler’s state-of-the-art feature set owing to its increased complexity and learning curve, this feature set is imperative for enabling the success of current and future generations of large-scale bioinformatics computing on the cloud.

PCAWG germline analyses

To assess Butler’s performance on real data, we carried out several large-scale data analyses using Butler on the Embassy Cloud and over the entirety of the 725 TB of raw PCAWG data, including the following:

  • discovery of germline single nucleotide variants (SNVs) and small indels in normal genomes.

  • genotyping of common SNVs occurring at minor allele frequency (MAF) >1% in the 1000 Genomes Project18.

  • genotyping of germline SNVs and small indels in tumor and normal genomes (Supplementary Fig. 6).

  • discovery and genotyping of structural variant deletions in tumor and normal genomes (Supplementary Fig. 7).

  • discovery and genotyping of structural variant duplications in tumor and normal genomes (Supplementary Fig. 7).

Overall, most Butler workflows that carry out an analysis follow a similar structure (Supplementary Fig. 1): an analysis run is started, access to the sample is validated, the analysis steps are carried out (possibly with branching), and the analysis run is completed. Because of the largely common structure between workflows a large degree of code reuse is possible, and thus most of the methods reside in the workflow_common submodule of the Analysis Tracker and are invoked for each workflow.

Common variant genotyping was performed across the PCAWG cohort using a site list of 12 million variants occurring with at least 1% minor allele frequency within the 1000 Genomes Project18 phase 3 cohort, interrogating 34 billion sites overall. 130,152 computing hours were used to complete 70,850 workflow tasks for this analysis, with an additional 2,688 CPU hours used for cluster management overhead. Thus, management overhead accounted for 2% of the overall computational resource costs for this analysis. Using 1,000 cores, this analysis took less than 6 d to complete. Supplementary Fig. 2 shows a distribution of job runtimes by chromosome (runtimes highly correlate with chromosome length, r = 0.92). Using a site list of 60 million variants obtained from the FreeBayes Variant Discovery analysis, we used the Butler FreeBayes Workflow in genotyping mode to calculate genotypes at 170 billion genomic positions. 76,518 workflow tasks were completed using 302,071 CPU hours over the course of the analysis (10 d wall time), of which 5,040 CPU hours were cluster management overhead, accounting for 1.6% of total resource utilization.

244,889 deletions were evaluated across 5,668 samples (tumor and normal) for a total of 1,388,030,852 genomic sites genotyped. Overall wall time was 13 d, using 265,200 CPU hours with 6,240 CPU hours going to cluster management overhead—an overhead of 2.2%. 217,433 duplications were genotyped for each sample across 5,668 samples, for a total of 1,232,410,244 genomic variants genotyped. The wall time for this analysis was only 4.5 d, using 151,200 CPU hours during this time, with a management overhead of 2,160 h, for a total overhead of 1.4%. The comparatively low cluster management overhead has been accomplished by scaling up the cluster to 1,400 cores without the need for more management resources. Supplementary Fig. 3 shows a distribution of workflow run durations.

We carried out several analyses on a 725-TB dataset of 2,834 cancer patients’ genomic samples, consuming a total of 546,552 CPU hours. Each analysis took no longer than 2 weeks to complete and used only 1.5%–2.2% of the overall computing capacity for management overhead. On several occasions we were able detect large-scale cluster instability and program crashes using the Operational Management system and take corrective action with a minimal impact on overall productivity.

Comparing Butler with the core PCAWG somatic pipelines

We evaluate the relative effectiveness of Butler-based pipelines in comparison to a set of pipelines operating under similar conditions and over the same dataset, namely the ‘core’ PCAWG somatic pipelines that have been used to accomplish genome alignment and somatic variant calling for the PCAWG Technical Working Group7. The core PCAWG pipeline set consists of five pipelines—BWA, Sanger, Broad, DKFZ/EMBL and OxoG detection—run over the course of 18 months over all samples in PCAWG. The Butler-based pipeline set consists of two pipelines—FreeBayes and Delly, used to accomplish four analyses: germline SNV discovery, germline SNV genotyping, germline structural variant deletion genotyping and germline structural variant duplication genotyping—also running over all samples in PCAWG (725 TB in total). We assessed and compared pipeline performance with respect to an estimated optimal performance (based on available hardware), as well as with respect to analysis progress uniformity in time.

For core PCAWG pipelines, we used the date of data upload to the official data repository as the most reliable sample completion date. However, approximately 25% of the DKFZ/EMBL pipeline results were uploaded in two batches on two separate days, and thus do not accurately represent the real analysis progress rate. For this reason, we excluded this pipeline from the optimal performance analysis. Butler sample completion dates are based on timestamps collected in Butler’s analysis tracking database.

Our assessment of pipeline performance is based on establishing an ‘optimal’ progress rate for a pipeline given a hardware allocation. We divided the sample set into 20 bins based on their completion time (each bin comprising 5% of all samples) and defined the optimal progress rate for each pipeline to be the smallest proportion of overall analysis time required to process all samples of a bin (scaled to a 1% rate).

$$r_{{\mathrm{opt}}} = \mathop {{{\mathrm{min}}}}\limits_{b \in {\mathrm{bins}}} \left\{ {{\mathrm{duration}}_b/{\mathrm{duration}}_{{\mathrm{total}}}/5} \right\}$$

We observed that the mean ropt was significantly higher for Butler-based pipelines at 0.46 than for the core PCAWG pipelines at 0.13 (Supplementary Table 3). For each pipeline and each 1% of the samples under analysis, we then computed a metric e (for effectiveness) defined as the proportion of ropt actually achieved.

$$e = \frac{{r_{{\mathrm{act}}}}}{{r_{{\mathrm{opt}}}}}$$

Comparing the core PCAWG and Butler pipelines with respect to e (Fig. 2a–c), we observed that effectiveness was on average lower for PCAWG pipelines (\({\mu _{e_{\mathrm{PCAWG}}}} = {0.49}\)) than for Butler pipelines (\(\mu _{e_{\mathrm{Butler}}} = 0.70\)). Assessing the expected analysis duration for the two sets of pipelines, we observed

$$d_{{\mathrm{PCAWG}}} = \frac{{100}}{{\mu _{e_{{\mathrm{PCAWG}}}}}} = 2.04d_{{\mathrm{opt}}}$$
$$d_{{\mathrm{Butler}}} = \frac{{100}}{{\mu _{e_{{\mathrm{Butler}}}}}} = 1.43d_{{\mathrm{opt}}}$$
$$d_{{\mathrm{PCAWG}}} = 1.43d_{{\mathrm{Butler}}}$$

Thus, the estimated duration for PCAWG pipelines was 43% longer than that for Butler-based pipelines.

We further compared core PCAWG pipelines with Butler pipelines on the basis of uniformity of rate of progress through an analysis. Given a constant resource allocation, an ideal analysis execution processes 1% of all samples in 1% of the analysis runtime. We divided the sample set into 100 equal-size bins and measured the percentage of overall analysis time spent processing each bin (Fig. 2d,e). Deviations from the diagonal indicate inefficiencies in data processing. Measuring this deviation, we observed that PCAWG pipelines deviated 23.1% from the diagonal on average (minimum 0.0%, maximum 57.8%, s.d. 15.0%) while Butler pipelines over the same sample set only deviated 4.0% (minimum 0.0%, maximum 15.6%, s.d. 3.7%) from the diagonal on average. This indicates that Butler pipelines are considerably less affected by various causes that slow an analysis (for example, job and infrastructure failures).

Adapting Butler to new projects and domains

Butler is a highly general workflow framework, built on top of generic open source components that in principle can work with any data in any scientific domain, deploy onto over 20 cloud types, and work on any operating system, and it comprises a rich set of tools for installing and configuring software. Adapting Butler to a new application is straightforward. This process is described below.

Butler has a prebuilt library of workflows that focus on handling genomic data and can support a large variety of studies that are based on next-generation sequencing applications, such as variant discovery, common and rare variant association studies, cancer genome analysis, and expression quantitative trait locus (eQTL) mapping. Using one of these workflows is simply a matter of providing configuration values in JSON format for the underlying tools (such as, for example, FreeBayes, Delly, samtools19 or bcftools). Notably, Butler also supplies a generic workflow that allows execution of arbitrary R scripts across the entire Butler cluster. This powerful functionality can be used to facilitate a broad range of studies across disciplines, communities and analysis types, given the wide cross-community usage of R.

If the prebuilt workflows do not meet the users’ requirements as-is, they can be customized to adapt to arbitrary needs or entirely new workflows can be written. Each Butler workflow is a Python program, which typically contains only 100–200 lines of code. There are three principal avenues of developing new workflows that are suitable to a wide variety of users’ needs.

The easiest involves adapting tools that are already available as Docker images. Butler has prebuilt configurations for setting up all the infrastructure necessary to run Docker containers. The user only needs to wrap the Docker command line within existing boilerplate code that sets up access to the data that need to be analyzed. Once appropriate configuration parameters are supplied, Butler will be able to run the workflow seamlessly.

Only slightly more sophisticated is the setup of workflows that use CWL (Common Workflow Language) as a description language. Butler already has built-in functionality for installing and configuring cwl-runner, which is the reference implementation of CWL. To set up a new workflow that uses CWL within Butler, users need to prepare an appropriate JSON parameter file according to the CWL definition. This is accomplished via Butler’s configuration functionality. The genome alignment and somatic variant calling workflows that accompany the Butler framework already provide full functionality in this regard and can be used as examples by new users. Because a number of workflows from varying scientific fields have already been described with CWL, this approach opens up a relatively straightforward avenue for adopting Butler in a wide variety of additional studies.

Potentially the most complex, but also the most powerful, way of authoring new workflows is writing them using the native constructs of the underlying Apache Airflow workflow framework. This approach provides the users with all of the power of the Python language and extended library, as well as the prebuilt Airflow components for interacting with a wide variety of distributed systems and engines, such as HDFS, Apache Spark, Apache Cassandra, various databases such as PostgreSQL and SQLite, email engines and many more. Several of the prebuilt Butler workflows, such as the FreeBayes, Delly and R workflow, use this approach, and users can employ these as templates for new workflows built in this style.

Because of the wide variety of workflow authoring and customization styles available, the existing examples, and the generic nature of the underlying open source components, applying Butler to new projects and analysis domains can be accomplished with minimal efforts and at a complexity level that is matched to the requirements of the project. Individual steps of the workflow can be easily debugged and tested on the local machine without the need to deploy to any cloud, using Python’s extensive testing and debugging functionality. The typical life cycle for developing a new workflow is a few hours to a few days long and is usually much shorter than a week. Because new projects frequently require the installation and configuration of new software packages, Butler has integrated a full-featured configuration management solution called Saltstack that is used to set up and configure Butler internals and also any additional software required by the user for their project. Recipes for configuring dozens of software packages are already included with the Butler system, and hundreds more are available as community contributions to the Saltstack project. Arbitrary new configurations can be defined by the user to meet their custom requirements. To support this the user would typically set up a new Github repository that acts as a customization layer on top of the core Butler configurations. Within this custom repository, users can define new configuration recipes or override the behavior of the pre-existing Butler settings depending on the needs of their scientific project. We provide several examples of such repositories under ‘Code availability’ to help users become familiar with Butler.

Statistics

No formal sample size and power calculations were performed as we made use of all 5,668 of the samples available to us via the PCAWG consortium. The analyses in Fig. 2, performed over the entirety of PCAWG samples available to us, were run once (rather than multiple times) owing to the multi-year nature and high costs of the PCAWG project.

Ethical compliance

The authors have complied with all of the relevant ethical regulations with regards to the subjects described in this manuscript.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.