With the advances of high-throughput technologies in biomedicine, the volume of data has drastically increased in the last decade across scientific disciplines1. This influx of data has provided researchers with the ability to discover and utilize data of various types and structural characteristics that aid in carrying out leading-edge research. However, when heterogeneous and multimodal data types are produced in large quantities, the data become much more complex to process, making conventional computational processing methods inadequate and calling for new solutions2,3. These conventional methods encompass the use of scripting languages to process this data lacking (i) resource management capabilities (compute and memory); (ii) ability to aggregate data from multiple sources; (iii) support for modular processing; (iv) ability to handle unstructured data; and (v) ability to transform data to be used with other tools/algorithms4,5. Moreover, pipelines harnessing complicated methods for processing cancer pharmacogenomic data, which is data measuring the way a cancer’s genome affects its response to drug therapy (multiple gene-drug associations), may be difficult to reproduce6,7. These methods include the use of convoluted scripts that deploy multiple genomic tools and statistical methods/algorithms to compute drug response and identify molecular features7,8. Studying the effects of a drug on a single gene (single gene-drug association) or a few genes, is referred to as a pharmacogenetic analysis6. A challenge subsequently arises, as there becomes a plethora of pipelines for pharmacogenomic datasets that utilize different complex methods, which all aim to perform the same goal, but will yield different results9. These limitations hinder scalability and the use of pharmacogenomic data generated by drug screening facilities worldwide, to its full potential. There is therefore a need for the development of more sophisticated computational pipelines to address these issues10.

To address the issues of scalability, reproducibility and standardization with processing and analyzing pharmacogenomic datasets, we created open-source processing pipelines using the Common Workflow Language (CWL), a popular data workflow language in the data science and bioinformatics community11. We leveraged PharmacoGx within our pipelines, an R/Bioconductor package that provides computational approaches to simplify the processing and analysis of such large datasets12. We pushed our CWL pipelines to Code Ocean13, which process two large breast cancer pharmacogenomic datasets14,15,16,17 and create fully documented data objects shared through a persistent, unique digital object identifier (DOI) on Harvard Dataverse18. Our study demonstrates how existing computational tools and platforms can be used to standardize the processing of pharmacogenomic data in a transparent and reproducible way, and how these processing pipelines and resulting datasets can be shared with the scientific community.

Pharmacogenomic Datasets

The first dataset is the Oregon Health and Science University (OHSU) breast cancer screen generated within Dr. Joe Gray’s laboratory (GRAY)14,17,19. The two most recent versions of the GRAY dataset were published in 2013 and 2017, where the latest update collectively includes 91 cell lines and 107 drugs, with 9,756 drug sensitivity experiments for 72 cell lines screened against 107 drugs, after our curations14,17,19. The dataset includes processed SNP (n = 77), exon array (n = 56), U133A expression (n = 51), RNA-seq (n = 54), RPPA (n = 49), and methylation (n = 55) profiles with the use of various technologies and processing methods (Table 1)17,19. Multiple cell lines were added to the GRAY molecular profile data after the 2013 release, but before the update to the drug response data in the 2017 release, resulting in our curation of 91 cell lines for both versions of the dataset14,17,19.

Table 1 Summary of cell line and drug curations, sensitivity experiments, and molecular profile processing for GRAY and UHNBreast datasets.

The second dataset is the University Health Network (UHN) breast cancer screen (UHNBreast) with molecular and pharmacological profiles released in 201616 and 201715, respectively. The dataset includes processed SNP (n = 79), RNA-seq (n = 82), RPPA (n = 79), and miRNA (n = 82) (Table 1)16. We provide the most recent update to UHNBreast with four new drugs (trastuzumab, olaparib, BYL719, and UNC0642), for a total of 85 cell lines, 8 drugs, and 689 drug sensitivity experiments, after our curations, where 56 cell lines were screened against 8 drugs15,16.

The convergence of the 2017 update of GRAY and our 2019 update to UHNBreast yield an intersection of 72 cell lines and 5 drugs after curation through our pipelines (Fig. 1).

Fig. 1
figure 1

Convergence of drugs and cell lines between GRAY (2017) and UHN Breast (2019) after curation through our CWL pipelines.

Reproducible and Transparent Processing of Data

Due to the scale and complexity of data that are produced through high-throughput platforms, the data processing and analysis pipelines should possess a robust and flexible infrastructure4,5. It is therefore important for pipelines to support interoperability, such as where different tools can be allocated to different data20. However, pipelines that are interoperable by consisting of multiple components/stages are difficult to reproduce21. To solve this issue, we developed our PharmacoGx pipelines in CWL, which allowed us to standardize the way we executed our multi stage processing and analysis of both breast cancer datasets in a reproducible and transparent manner (Fig. 2) (see Methods)11. Importantly, PharmacoGx implements the PharmacoSet (PSet) class, allowing us to create shareable R objects integrating all aspects of pharmacogenomic datasets, from cell line and drug annotations to the molecular and pharmacological data12. Each CWL pipeline is allocated a specific subroutine that is required for PSet creation, which includes curating cell and drug annotations, computing drug response, and incorporating processed molecular profiles for a given dataset (Table 2)11,12. To accomplish this in a semi-automatic fashion, we incorporated each pipeline into a CWL workflow, where PharmacoGx computes each stage of a pipeline and assembles their corresponding outputs into a PSet. This workflow not only transparently indicates the pipelines that are being executed, but also ensures that each pipeline is executed in the same manner if replicated, enforcing reproducibility11. In addition, the support of interoperability through CWL can be highlighted, as each pipeline stage that generates a corresponding output interacts with subsequent stages, which can be further enforced through specifying file-specific ontologies. These pipeline interactions validate the integrity of each given output object and PSet generated to ensure that it can be used for secondary analyses11,12. Interoperability through shared ontologies is also supported by PharmacoGx, as our pipelines curate and assign unique identifiers to each cell line and drug compound in each dataset, where the identifiers are used in subsequent pipeline stages to verify that the data is correctly compiled12. Therefore, the unique identifiers not only validate PSets that are generated, but also maximize consistency across existing PSets. However, because every dataset requires a different way of transforming and processing the data, due to variability in the way the data were initially shared and structured for each study, GRAY and UHNBreast possess their own CWL pipelines and workflow to accommodate for the differences14,15,16,17,19. Because CWL is a standardized language, each pipeline must include input and output definitions, base commands, and requirements (e.g., resource, Docker)11. In addition, each CWL pipeline and workflow must be accompanied by a YAML (YAML Ain’t Markup Language) or JSON file, which consists of an object array that defines a class and path for each input in the respective pipelines. In order for our CWL pipelines to execute successfully, they must specify the following: hints (docker requirement to run PharmacoGx), inputs that declare a type and input binding position (Rscripts, annotation files, raw drug data, processed molecular data), outputs that declare a type and output binding (e.g, processed drug sensitivity R objects, PSets), and a base command (to run Rscript), in the specified CWL file11,12. Therefore, in order for our CWL workflows to be fully documentented and reproducible, each pipeline must be defined as an input and possess a successful runtime independently11. Having to explicitly specify these parameters required to run each pipeline, along with the inputs and outputs in CWL provides an added layer of transparency to the pipelines, as well as allowing users to have control over data provenance. One of the highlights of our CWL workflows is the computation of drug response data for both datasets, which include AAC (Area Above the drug-dose response Curve), IC50 (maximal drug concentration to achieve 50% cell growth inhibition), Hill-Slope (measurement of slope of a drug-dose response curve), Einf (maximum theoretical inhibition), and EC50 (drug concentration for which 50% of maximum response is observed) (see Methods). Computed AAC was later utilized in a post-PSet analysis to determine the concordance between a gene-drug association in both datasets through calculating the concordance index (CI) between respective RNA-seq and the AAC data (see Methods)14,15,16,17,19. For GRAY, we computed AAC, IC50, Hill slope values, and included published GI50, GR50, GEC50, GRmax, GRinf, hGR, and GRAOC, data14,17 (see Methods). For UHNBreast, recomputation of AAC, IC50, and Hill slope was also performed, along with Einf, and EC5015.

Fig. 2
figure 2

Breast cancer PharmacoSet (PSet) generation and DOI assignment through execution of a reproducible PharmacoGx CWL workflow.

Table 2 CWL workflow pipelines and their respective data streams to produce a PharmacoSet (PSet) for GRAY and UHNBreast datasets.

Tracking Data Provenance and Validating Pipeline Integrity

Tracking data provenance with CWL can be further enhanced through the use of the provenance flag (–provenance) when executing the PSet workflows11. Here, a Research Object is automatically generated, which is a directory that acts as a bundled container for all of the resources utilized and produced within our workflows, including metadata that annotates each resource11,22. Within this object is a “data” directory that contains each input file used in the workflow with a unique and fixed checksum11. We are given granular transparency across the entire workflow at every stage, as we are able to map each checksum to a respective input file and location in the “data” directory, including all of the Rscripts that were utilized within a pipeline, through a workflow metadata file that is generated. In addition to a checksum, each PSet is also assigned a Universally Unique Identifier (UUID), which provides an additional layer of provenance to accurately identify the PSet that was generated by the workflow11,12. Moreover, this is accompanied by a provenance metadata file, which provides users with the ability to use checksums and UUID’s to accurately identify when each file was called and generated along the entire execution of a workflow11. Therefore, a Research Object confirms the reproducibility of our CWL workflows and validates the PSet that was generated with a respective runtime by providing rich metadata that tracks data provenance at each stage of a workflow.

Harnessing Docker to Create a Reproducible Runtime

PharmacoGx integrates seamlessly with CWL, as we leverage CWL’s Docker capabilities to containerize the package and run all of our pipelines in an isolated environment11,12,21. Docker is a tool that allows for PharmacoGx to be uniformly deployed with all software dependencies, in a containerized runtime environment where all of our computations are performed and PSets are produced12,21. The Docker container is invoked upon CWL workflow execution, where all the input files for a given pipeline become mounted into the container and all output files produced in the isolated environment are recovered into a local environment11,23,24. Another advantage of Docker is the ability of containers to utilize and share the hardware resources of the environment it is being run in25. Therefore, PharmacoGx deployment is not only consistent, but also portable across both cloud and high performance computing environments, as our Docker image is also publicly available through Docker Hub (,25. The ability to standardize the manner in which PSets are produced through CWL and develop an additional layer of abstraction for pipeline execution through Docker, allowed us to create and deploy reproducible and transparent pharmacogenomic pipelines that can be shared with the research community and replicated.

Sharing of Data and Pipelines

In order for a study to be computationally reproducible, data and pipelines must be well documented, uniquely identified, and easily accessible in a persistent location to other researchers26. To accomplish this, we utilized the Harvard Dataverse to share our PSets for both breast cancer pharmacogenomic datasets, along with Code Ocean to share our CWL pharmacogenomic pipelines13,18. Harvard Dataverse is an online data repository for transparently preserving and sharing research data with other researchers18. By creating a container known as a “dataverse” within the platform, researchers are able to deposit their datasets and corresponding metadata, in an organized fashion and make them easily discoverable for others to download and share. Each dataset can be also assigned a unique DOI, which allows a dataset to possess a persistent location, as well as allow researchers to accurately identify and share a specific dataset of interest. In addition, subsequent updates (versions) to a dataset can be uploaded, with accompanying metadata that explains the update and its changes, providing a layer of data provenance to the research community.

We also transferred our reproducibility measures to the pipeline level, as we deposited and shared our CWL workflows through Code Ocean, a reproducibility platform that allows for researchers to upload, share, and run published and configured code13. Data is uploaded into a “capsule”, which provides a computational environment for others to run code in the capsule, without the need to manually execute it locally with the addition of installing any dependencies13,27. Moreover, code can also be assigned a persistent DOI, providing the ability to accurately share and retrieve pipelines, as well as verify the reproducibility of published results directly through the compute capsule. Because Code Ocean does not currently support running multi-container pipelines, and therefore our CWL workflows, we used the platform to host our workflows and raw data, provide execution instructions, and run a post-PSet analysis for biomarker discovery.

Our PSets can be found on Harvard Dataverse at the following Our CWL workflows can be found on Code Ocean at the following

Utilization of PSets for Biomarker Discovery

In order to demonstrate the utilization of our PSets for cancer research, we identified ERBB2 expression as a biomarker for lapatinib in both the GRAY 2017 and UHNBreast 2019 datasets (Fig. 3) (see Methods). To investigate this pharmacogenetic association6, we utilized processed RNA-seq expression and computed drug response (AAC) from each PSet14,15,16,17,19. We subsequently identified 39 cell lines from the GRAY PSet and 50 cell lines from the UHNBreast PSet that include both gene expression data for ERBB2 and drug response data for lapatinib and computed the strength of significance of this gene-drug association using the concordance index (CI). CI estimates the probability that random pairs of samples will be similarly ranked by two variables, in order to identify the agreement (concordance) between the two variables30,31,32. We found that ERBB2 expression was strongly predictive in both the GRAY and UHNBreast datasets (CI = 0.73, p-value = 4.8E-15 in GRAY and CI = 0.63, p-value = 0.015 in UHNBreast). This argues against the null hypothesis that ERBB2 expression is independent of lapatinib response. This analysis can be reproduced through our Code Ocean capsule29.

Fig. 3
figure 3

ERBB2 expression as a biomarker for lapatinib in GRAY 2017 and UHN Breast 2019. N: number of samples; C-index: concordance index calculated for respective PSet; P-value; p-value calculated for respective PSet. Meta analysis represents combined concordance index and p-value across PSets.


The utilization of CWL allows us to create and execute transparent and reproducible pharmacogenomic pipelines that can be validated and easily shared with the scientific community11. The standardized architecture of the language allows users to create language-agnostic pipelines and workflows that enforce strict parameter specifications to ensure execution is consistent. In addition, users are able to incorporate Docker into their runtimes, where data ingestion, analysis, and exportation all occur in an isolated container environment that promote repeatable execution11,23,24. Users are also able to track data provenance across the entire execution time by creating Research Objects in CWL, which validates each portion of data flow from input to output, through checksums and UUID’s11,22. Lastly, CWL pipelines and workflows are scalable and portable across many computing environments, such as the cloud, which gives users the ability to easily share their analyses and harness a plethora of various hardware resources to successfully execute their workloads that would not be possible with using on premise resources11,24,25. A common practice in pharmacogenomics is sharing study data as supplementary files through a journal, or through online sharing platforms/repositories such as Synapse and GitHub, which was the case for both the GRAY and UHNBreast datasets14,15,16,17,19. However, the challenge becomes assembling these data into a form that can be successfully analyzed and interpreted when shared. We were able to accomplish this in a reproducible manner by utilizing study data from a variety of sources and assembling it into a meaningful and useful form for cancer researchers, which are PSets, through CWL and PharmacoGx11,12. Therefore, our pipelines form the bridge between raw pharmacogenomic data and assembly in a transparent fashion. With our pipelines utilizing the versatile PharmacoSet class, many data types from other datasets can be easily encapsulated for PSet generation, such as methylation, chromatin accessibility (e.g., ATAC-seq), metabolomics, protein expression33, and radiation therapy response34, given that it follows the PharmacoSet data structure12. This provides many opportunities for researchers to process and analyze a plethora of data for their studies. However, our workflows do have limitations, including the inability to identify changes to pipelines, input data, and PSets, at the file level, when updates are pushed, and the files are taken into an environment outside of Harvard Dataverse and Code Ocean. However, with storing our data on Harvard Dataverse and pipelines on Code Ocean with rich metadata, users will be able to retrieve any updated files on both repositories and accurately identify the exact changes to each file. In addition, CWL Research Objects provide checksums and UUID’s only after a runtime is complete, which are bound to the file name and not persistently attached to a file for use in subsequent workflow runs11. Thus, if an input file is updated and re-utilized in a workflow, we must manually keep track of all checksums and UUID’s that were assigned to it by CWL over time. In the future, we hope to increase transparency and reproducibility by automating these pharmacogenomic pipelines in a manner that keeps track of all input and output data at the file level through the use of automatically generated unique identifiers that are persistent. Moreover, we hope to provide users with an interface that provides options for processing drug sensitivity and molecular profiles in a generated PSet.


Computation of drug response data

Our CWL pipelines process raw pharmacological data of the GRAY and UHNBreast datasets14,15,17. This encompasses the computation of AAC, IC50, Hill-Slope, Einf, and EC50. With regard to the sensitivity metrics, drug potency and efficacy is a measure of AAC, potency is a measure of IC50 and EC50, while Einf is a measure of efficacy35,36. Our pipelines address the issues of metric summarization inconsistency and processing reproducibility across studies through the utilization of PharmacoGx, which efficiency standardizes the computation of drug sensitivity parameters for any pharmacogenomic dataset12,30.

The calculate From Raw function within PharmacoGx was used to compute the GRAY drug response data, while the computeSensitivity function was utilized to compute the UHNBreast drug response data12. The two functions reflect the data structure and formatting differences of the drug response data between the two breast cancer datasets.

Incorporating published drug response data

The GRAY dataset includes published processed drug response data14,17. The published data was curated, annotated, and compiled into a PSet using PharmacoGx12. These metrics include growth inhibition (GI) and growth rate inhibition (GR): GI50, GR50, GEC50, GRmax, GRinf, hGR, and GRAOC. The sensitivity metrics can be defined as14,17:

GI50: the drug concentration for 50% inhibition of cell proliferation.

GR50: the drug concentration (c) to achieve GR(c = GR50) = 0.5.

GEC50: the drug concentration for which 50% of maximal effect is observed.

GRmax: the GR observed at the highest drug concentration.

GRinf: the effect of the utilization of an infinite drug concentration.

hGR: the fitted curve Hill coefficient.

GRAOC: the effect of a drug across AOC estimated concentrations.

CWL pipeline execution steps

Each CWL pipeline within a workflow executes a custom R script with computational processing procedures for generating each PSet, which follow the same structure, regardless of the dataset being analyzed11,12. Each PSet that is generated begins with the execution of an R script that gathers curated identifiers for each cell line, tissue, and drug compound within each dataset. The curated identifiers are then used to collect cell line and drug metadata and generate a data array of the corresponding cell line and drug response experiment. The raw drug response data is then processed using PharmacoGx12, while the published drug response data is annotated and compiled. The pre-processed molecular profiles from each dataset are later organized into an ExpressionSet, which are data structures with processed data in the form of matrices with associated feature, phenotypic, and annotation data37. The last pipeline in our workflow compiles the curated unique identifiers, cell line and drug metadata, computed drug response data, published drug response data, and molecular profile ExpressionSets into a PSet through the PharmacoSet class in PharmacoGx12.

To execute a CWL workflow, cwltool must be run on the CWL and YAML files that are defined for a dataset workflow, where the –-provenance flag generates a Research Object11:

cwltool –-provenance /outputdir getUHN2017_Workflow.cwl getUHN2017_Workflow.yml

Biomarker discovery

We utilized the GRAY 2017 and UHNBreast 2019 PSets to identify an association between ERBB2 expression and lapatinib drug response across cell lines14,15,16,17,19. We identified 39 and 50 cell lines from the GRAY and UHNBreast PSet’s, respectively, that possessed both ERBB2 gene expression and drug response data (AAC) for lapatinib. With AAC being one of the most commonly used drug sensitivity metrics, we utilized the gene expression and computed AAC data within the GRAY and UHNBreast PSets to assess this gene-drug association (expression-based biomarker) through calculating the concordance index and p-value15,30,31.

Our code for this analysis utilizes the summarizeSensitivityProfiles and summarizeMolecularProfiles functions in PharmacoGx to extract lapatinib response and ERBB2 expression data from the GRAY 2017 and UHNBreast 2019 PSets12. This response and expression data was subsequently used to compute the concordance index and p-value between them, for both PSets, using the concordance index function within the survcomp R package38.