Design considerations for workflow management systems use in production genomics research and the clinic

The changing landscape of genomics research and clinical practice has created a need for computational pipelines capable of efficiently orchestrating complex analysis stages while handling large volumes of data across heterogeneous computational environments. Workflow Management Systems (WfMSs) are the software components employed to fill this gap. This work provides an approach and systematic evaluation of key features of popular bioinformatics WfMSs in use today: Nextflow, CWL, and WDL and some of their executors, along with Swift/T, a workflow manager commonly used in high-scale physics applications. We employed two use cases: a variant-calling genomic pipeline and a scalability-testing framework, where both were run locally, on an HPC cluster, and in the cloud. This allowed for evaluation of those four WfMSs in terms of language expressiveness, modularity, scalability, robustness, reproducibility, interoperability, ease of development, along with adoption and usage in research labs and healthcare settings. This article is trying to answer, which WfMS should be chosen for a given bioinformatics application regardless of analysis type?. The choice of a given WfMS is a function of both its intrinsic language and engine features. Within bioinformatics, where analysts are a mix of dry and wet lab scientists, the choice is also governed by collaborations and adoption within large consortia and technical support provided by the WfMS team/community. As the community and its needs continue to evolve along with computational infrastructure, WfMSs will also evolve, especially those with permissive licenses that allow commercial use. In much the same way as the dataflow paradigm and containerization are now well understood to be very useful in bioinformatics applications, we will continue to see innovations of tools and utilities for other purposes, like big data technologies, interoperability, and provenance.


Design considerations in building the variant calling pipeline
Design considerations for building a Swift/T-defined Variant Calling pipeline are detailed elsewhere [2], so in this section, we only focus on respecting Modularity with an architecture that allows consistent evaluation of Swift/T along with the other 3 WfMSs: Nextflow, CWL and WDL.
In this context, modularity means the ability to construct a complete workflow from a set of smaller and independent processes, apps, CommandLineTools, or tasks/subworkflows (as per the semantics of each WfMS (see section Methods: Nomenclature)).
To meet the modularity constraint, src code is arranged as per Fig Supplementary 2a into folders corresponding to each WfMS language, and within which there are folders for calling tasks, unit-testing those tasks, and defining the logic of workflows composed of these tasks (except for Nextflow). The tasks themselves were written in conditionals-free, stand alone bash scripts that provide consistent output definitions and logging functionality regardless of inputs specifications and bioinformatics tools being called (Fig Supplementary 2b). These bash scripts are also free from streaming (i.e., piping) between processes for more robustness and easier debugging of failure source when needed. This further allows seamless switching between Sentieonbased [3] and GATK-based [4,5] tools (or others) while using the same WfMS (and vice-versa). For easier working with input json files (in case of WDL-and CWL-defined pipelines), helper parser and validator scripts were written in python to populate values from an easy to construct configuration file into the needed input json. This added a layer of abstraction/independence between the processing logic of the workflow (conditions and loops-the DAG definition), and the underlying invocations of the bioinformatics tools. Additionally, it allowed a head-to-head comparison between the 3 languages (See sections Results: Language expressiveness -Results: Support for modularity).
The 4 chosen WfMSs considered here (Nextflow, Swift/T, CWL and WDL) all have engines that adopt the dataflow -paradigm. This means an inherent and implicit parallelism in running computations based on data (and resources) availability (rather than location within a script)making them appropriate for sprouting parallel jobs rather easily (compared with, say, native  Figure Supplementary 1: Analysis stages in a typical Genome Analysis Toolkit (GATK)-based multi-sample Variant Calling pipeline where each yellow slice is a sample. Gray blocks denote functional equivalence recommendations [1]-with re-alignment (after Deduplication) not shown. Red arrows denote parallel stages, and Green arrows denote optional stages, and thus need a WfMS that supports:Analysis sequential, parallel (looping) and conditional processing and also nesting these within an overall loop. bash parallelization and other top-down sequential languages). A complete evaluation of these parallelization and run-time features follows in the main results sections (Data dependencies and parallelism -Workflow dependency graph resolution and visualization). Performance aspects are discussed in the remaining results sections (Executor-level differences -Robustness). Operational aspects, like debugging and cross-compatibility are examined in the remaining results sections (Debugging workflows -Cross-compatibility and conformance to standards).

Workflows Invocations
Custom Nextflow invocation $ nextflow run workflow . nf -c b a c k e n d _ r u n t i m e _ a n d _ i n p u t . conf

Coding examples
Nextflow DSL-1 process // Defining and using a process ( in Tasks / alignment . nf ) : S am p le N am e sC h an n el = Channel . from ( params . SampleName . tokenize ( ' , ') ) // Comma -separated list of strings input process Alignment { input : val SampleName from S a mp l eN am e sC h an n el // Implicit Parallelism over channel elements " " " / bin / bash alignment . sh ... # Script in the ' shell ' directory " " " } Swift/T leaf function and its usage // Defining a leaf function ( in bioapps / align_dedup . swift ) : @dispatch = WORKER app ( < outputs >) bwa_mem ( < inputs >) { < command invocation >} app ( < outputs >) samtools_view (   The workflow used here is the 1-step version of the pipeline used for testing scalability. There is a hostname process that is run twice in parallel, then unique hostnames are sorted and collated in a file. Testing was done in a local machine, and relevant comments accompany each workflow run. The complete code can be found in our scalability-tst repo here: https: //github.com/azzaea/scalability-tst

Nextflow
Nextflow defaults to creating a work directory where it is run. Each process will have its own hexa-coded directory of all inputs, outputs, intermediates and logs. No need for a dedicated cat process with Nextflow, since it has efficient channel operators for organizing such outputs.
The output of the workflow is sent to a specific directory in our code. Its contents are shown in the snippet below.

WDL: Cromwell
Cromwell defaults to creating a cromwell-execution directory where it is run. Each workflow will have its own directory, and different runs will be different hexa-coded subfolders within. Tasks will further have their own directories nested within their parent sub-workflows or scattering patteren-if present. Similar to Nextflow, each process directory will host all of its inputs, outputs, intermediates and logs.
The output of the workflow is sent to a specific directory via the workflow.options.json file. Its contents are shown in the snippet below.

WDL: toil-wdl-runner
toil-wdl-runner: defaults to deleting the working directory, and does not understand command line arguments acceptable otherwise to Toil. Hence, below we explicitly generate a python equivalent of our WDL code and edit it to accept command line options for specifying a working directory and not deleting it upon successful workflow run.
Additionally, Toil doesn't seem to have the ability to put output files in a user desired destination. Instead, it puts them in the current directory from which it is run. The hostnames retrieved in this case are unusual-preceded by apostrophe or letter (b).

WDL: miniWDL
miniWDL defaults to creating a timestamped named working directory per each workflow run, appended by the workflow name. It requires that only inputs that the workflow actually uses are present in the input json file. Under the hood, for miniWDL to run locally, docker needs to be installed with proper user permissions. A parallelized workflow will consequently be run in Docker swarm mode. This explains the ouput in the example below-hostnames are from this docker swarm environment (not the local environment).
Similar to Toil, miniWDL does not offer the possibility to place outputs in a user defined destination. It doesn't place outputs in the current directory either, but the execution log will direct to their location within the execution directory.

CWL: Cromwell
The general notes of section 4.2 apply here, except that the -options directive is not respeced by Cromwell, and hence there is no way to specify the final destination of output files readily.
Instead, the log gives complete path to where outputs are stored within the cromwell-execution directory Cromwell invocation $ java -jar $crom --version cromwell 42 $ $ java -jar $crom run host_process . cwl -i h o s t _ p r o c e s s _ w o r k f l o w . yml --type cwl ## log omitted , containing final outputs location within cromwell -executions dir $ $ cat / home / azza / github_repos / varCall / scalability -tst / src / cwl / cromwell -executions / host_process . cwl / b13c231f -b3aa -4880 -b503 -afc0edf541e8 / call -catsortStep / execution / log . txt azza -Satellite -P845 $ $ tree cromwell -executions   Cromwell was run with the in-memory database (the default), in run mode. Cromwell uses this database to track the execution of workflows and store outputs. For features like call caching, having a separate mysql database is necessary. This issue may have an effect on the CPU utilization.

On Biocluster, Recent WfMS versions
The testing reported below was done in 2021, using the most recent version of the runners available: Cromwell 63, Toil 5.3.0 and Nextflow 21.04.1.5556. Experiments were performed on the normal queue of Biocluster, composed of 5 Supermicro SYS-2049U-TR4 nodes, each of 72 cores. The cluster is not dedicated, so the data is affected by the queue load at the time.

On Biocluster, Older WfMS versions
The experiments reported below were done in 2019, using the then most recent version of the runners: Cromwell 47 and Nextflow 19.04.1.5072. They were done on the normal queue, composed of 5 Supermicro SYS-2049U-TR4 nodes, each of 72 cores. The cluster is not dedicated, so the data is affected by the queue load at the time.  DAG Directed Acyclic Graph.
WDL Workflow Description Language.
WfMS Workflow Management System.