MetaGxData: Clinically Annotated Breast, Ovarian and Pancreatic Cancer Datasets and their Use in Generating a Multi-Cancer Gene Signature

A wealth of transcriptomic and clinical data on solid tumours are under-utilized due to unharmonized data storage and format. We have developed the MetaGxData package compendium, which includes manually-curated and standardized clinical, pathological, survival, and treatment metadata across breast, ovarian, and pancreatic cancer data. MetaGxData is the largest compendium of curated transcriptomic data for these cancer types to date, spanning 86 datasets and encompassing 15,249 samples. Open access to standardized metadata across cancer types promotes use of their transcriptomic and clinical data in a variety of cross-tumour analyses, including identification of common biomarkers, and assessing the validity of prognostic signatures. Here, we demonstrate that MetaGxData is a flexible framework that facilitates meta-analyses by using it to identify common prognostic genes in ovarian and breast cancer. Furthermore, we use the data compendium to create the first gene signature that is prognostic in a meta-analysis across 3 cancer types. These findings demonstrate the potential of MetaGxData to serve as an important resource in oncology research, and provide a foundation for future development of cancer-specific compendia.


Results
MetaGxData characterization and curation. The MetaGxData compendium integrates three packages containing curated and processed expression datasets for breast (MetaGxBreast), ovarian (MetaGxOvarian), and pancreatic (MetaGxPancreas) cancers. Our current framework extends upon the standardized framework we had already generated for curatedOvarianData 29 . Our proposed enhancements facilitate rapid and consistent maintenance of our data packages as newer datasets are added, and provides enhanced user-versatility in terms of data rendering across single or multiple datasets. All of these datasets can be downloaded through the MetaGxBreast, MetaGxOvarian and MetaGxPancreas R data packages publicly available through the Bioconductor ExperimentHub [37][38][39] . Vignettes outlining how to access the MetaGxBreast, MetaGxOvarian and MetaGxPancreas datasets in R are available through the Bioconductor website.
We developed semi-automatic curation scripts to standardize gene and clinical annotations of our breast, ovarian and pancreatic cancer datasets based on the nomenclature used in The Cancer Genome Atlas (TCGA) (Supplementary File S1) 2,29 . At its core, the MetaGxData compendium represents a unified pipeline for processing datasets within a given form of cancer, and providing cancer-specific data packages to users with standardized gene and clinical annotations (Fig. 1). Such annotations include a host of relevant categorical variables that reflect upon tumour histology (stage, grade, primary site, etc.), as well as categorical and numerical variables crucial for survival analysis and prognostication in these cancers (including overall survival, recurrence-free survival, distant-free survival, and metastasis-free survival) ( Supplementary Fig. S2). Most importantly, we have provided a number of comparable and overlapping clinicopathological features across breast, ovarian and pancreatic cancer samples, such as age at diagnosis, tumour grade, or vital status (Fig. 2). Where some datasets lack vital status www.nature.com/scientificreports www.nature.com/scientificreports/ or other endpoints, we have included information on other endpoints, such as relapse free survival (breast and ovarian cancer datasets) and distant metastasis free survival (breast cancer datasets only). Additional common variables between the datasets can be seen in the supplementary figures (Supplementary Figs S3-S5). We also provide tumour-specific and critical annotations for each tumour type, including, for example, biomarker identification status (HER2, ER, PR) in breast cancer, and TNM status for pancreatic datasets. Treatment information across the cancers is provided when available.
For subsequent analyses presented in this work, overall survival was used as the primary endpoint, and datasets lacking vital status were excluded from the analysis. For pancreatic cancer, survival information was obtained exclusively using overall survival as the primary endpoint.
Analysis of prognostic genes in breast, ovarian, and pancreatic cancer. The wealth and breadth of transcriptomic datasets in MetaGxData can be used as a framework for translational cancer research. As an example of the versatility of our packages, we conducted a meta-analysis of the prognostic value of well-studied prognostic genes in ovarian cancer and pancreatic cancer, as well as our previously published gene modules in breast cancer using the MetaGxBreast, MetaGxPancreas and MetaGxOvarian packages (Figs 3-5) 22,23,27,28 . A total of 6 ovarian genes (PTCH1, TGFBR2, CXCL14, POSTN, FAP, and NUAK1), 36 pancreas genes from the gene signature developed by Haider et al. 40 , and 7 breast cancer gene modules (ESR1, ERBB2, STAT1, CASP3, PLAU, VEGF, and AURKA) were tested. For breast cancer gene modules, each module is comprised of a set of highly-correlated genes (using Gram-Schmidt variable selection) relating to specific cancer biological processes that we previously demonstrated to have prognostic utility in breast cancer 23,28 . For simplicity, each module is identified by a standard 'prototype gene'; as an example, the ' AURKA' module contains genes that are highly correlated with the proliferation gene AURKA (Fig. 3a).
The hazard ratio of tested genes and gene modules was determined by calculating the D.index, which is an estimate of the log hazard ratio (HR) comparing two equal sized groups. We observed that the direction of hazard ratios of these genes (HR > 1 or HR < 1) was fairly consistent, largely deviated from HR = 1, and was statistically significant across datasets. Genes with hazard ratios closer to 1 demonstrated greater variability in the direction Figure 1. Diagrammatic representation of the data processing pipeline for packages that are part of the MetaGxData compendium. Depicted are the processes involved in downloading a dataset, and standardization of molecular (gene) and clinical (patient) data to produce cancer-specific compendia that abide by the MetaGxData framework.
www.nature.com/scientificreports www.nature.com/scientificreports/ of the HR index across datasets, owing to their decreased prognostic relevance (Fig. 3a). Furthermore, log rank tests were used to determine whether splits in the survival curves generated by using the genes to group patients into high and low score groups were statistically significant.
Unsurprisingly, higher gene expression levels of the proliferation gene AURKA indicate poorer survival in breast cancer (log rank p = 1.1e-16, n = 4,161) (Fig. 3c). This supports previous findings regarding the importance of this gene in biology-driven signatures of breast cancer, and its comparable prognostic effect with other multi-gene prognostic signatures 22,23,35,41,42 . We have also observed that the NUAK1 gene exhibits worst prognosis in ovarian cancer (log rank p = 6.2e-9, n = 2,450) (Fig. 4c). We have previously demonstrated the utility of NUAK1 in the development of a debulking signature that can predict the outcome of cytoreductive surgery 28 . Figure 5 demonstrates the results of the 6 top-most statistically significant genes from the Haider et al. pancreatic gene signature 40 . Of these genes, we have observed that adrenomedullin (ADM) exhibits the worst prognosis in pancreatic cancer (Fig. 5c). High expression levels of ADM led to poor outcomes in patients, which is consistent with previous findings that ADM is over expressed in PDAC and enhances pancreatic cancer cell invasion 43 .

Meta-analysis of gene expression prognosis across cancers.
Our single-gene prognostic analysis can easily be extended to a genome-wide meta-analysis across individual cancer types, or combining several cancer types. To this end, we first determined the prognostic capability of 22,410 genes that are common across predominantly female cancers (Supplementary File S6). We identified 30 genes that are significantly prognostic across both tumours (False Discovery Rate [FDR] < 5%). From this list of prognostic genes, we subsequently identified 12 genes that share same-direction hazard ratios in both breast and ovarian cancers: 3 genes have elevated expression values indicative of worse prognosis in both cancers (HR > 1), and 9 genes have better prognosis (HR < 1) (Supplementary File S6). Such analyses can be used to test pan-cancer hypotheses across much larger sample sizes than previously possible, and will allow deeper study of relationships between cancer subtypes.
We additionally conducted a genome-wide analysis of all the genes present across the MetaGxPancreas datasets in order to identify highly prognostic genes (Supplementary File S6). Only genes present in at least 6 of the 12 datasets containing overall survival information were considered in the search for the most prognostic genes (n = 19,245 genes). The 3 genes that led to the poorest outcomes when overexpressed (largest HR) with FDR-adjusted p-values under 5% were FAM83A (HR = 1.83), HMGA2 (HR = 1.73), and KRT7 (HR = 1.72). The 3 genes whose expression was most indicative of better outcomes (smallest HR), with an FDR-adjusted p-values under 5% were PPP1R10 (HR = 0.69), FRZB (HR = 0.7), and GATA6 (HR = 0.71), and FAM189A2 (HR = 0.68). Notably, FAM189A2 was also identified in our analysis as the only gene that is indicative of worse outcome (FDR < 0.05, HR < 1) across breast, ovarian, and pancreatic cancers (Supplementary File S6). www.nature.com/scientificreports www.nature.com/scientificreports/ MetaGx gene signature creation and prognosis in breast, ovarian and pancreatic cancer. We developed a gene signature that is prognostic in both breast and ovarian cancers by running a single-gene, genome-wide prognostic analysis on 22,410 genes as above, but excluding several large breast and ovarian datasets for use as validation cohorts. The METABRIC dataset (n = 2136 samples) from MetaGxBreast, and 5 of the largest ovarian datasets (GSE9891, GSE32062, GSE49997, GSE26712, GSE51088) were removed from the analysis for later use as the validation cohort to test the signature. Using only the training sets, meta-analysis identified 53 genes with significant hazard ratios in both cancers (FDR < 5%, HR > 1.125 or HR < 0.875), which were used to form the MetaGx signature ( Table 1). The direction of association of the genes comprising the signature was chosen based on the hazard ratios (HR > 1 positive direction). Notably, the MetaGx signature included 3 genes (DDB2, GSTZ1, and FAM1892A) that had been previously identified from the set of 12 genes sharing same-direction hazard ratios in the meta-analysis of breast and ovarian cancers (Supplementary File S6).
The top 5 signatures from our recent review of ovarian gene signatures were evaluated alongside the MetaGx signature, and each signature was tested in the molecular subtypes identified by The Cancer Genome Atlas Research Network (immunoreactive, proliferative, mesenchymal, differentiated subtypes) 1,27 . The MetaGx signature was the most prognostic of the ovarian signatures tested in an analysis containing all the patients (HR 2.02, n = 1,069) and was the only signature providing statistically significant prognostic capabilities within each subtype (log rank tests p < 0.05). Although the D index was prognostic in the differentiated subtype (HR 1.85, n = 427) and the most prognostic of the signatures tested in the Mesenchymal subtype (HR 1.95, n = 229), the MetaGx signature did not yield statistically significant D indices in the immunoreactive and proliferative subtypes (Fig. 6a-e).
In breast cancer, the MetaGx signature was benchmarked against the clinically relevant mammaprint and oncotype DX signatures [44][45][46] . Our three gene (ER, HER2, and AURKA) subtype classification model (SCM) was We further tested the prognostic value of the MetaGx signature in pancreatic cancer and benchmarked it against pancreatic signatures from the literature. A signed average approach was implemented for evaluation, where the direction of association of the genes comprising the signature were chosen based on the hazard ratios (HR > 1 positive direction) 40,[47][48][49] . Briefly, in each patient, genes from the signature whose expression led to poor outcomes (HR > 1) were added together, and genes whose expression led to a favorable prognosis (HR < 1) were subtracted. Accordingly, higher signature scores (ie, signed average) were associated with poorer outcomes. Information pertaining to the genes comprising each of the pancreatic signatures can be found in Supplementary File S8.
Of the 5 signatures tested, the MetaGx signature was the most prognostic in the analysis of all the patients (HR 1.64, n = 903) and was the only signature that yielded a statistically significant difference in survival within both the basal (log rank p = 1.1e-3, n = 375) and the classical (log rank p = 1.3e-2, n = 528) pancreatic cancer molecular subtypes identified by Moffitt et al. ( Table 2, Fig. 6j-l) 50 .
We determined the spearman correlation between patients signature scores and our gene modules in order to investigate the biological processes present in our signature ( Supplementary Fig. S9). In all 3 cancers, the sig-  Each KM plot represents patients of a specific tumour grade. Within each plot, patients are split into 'high' and 'low' based whether they fall above or below the median NUAK1 gene expression. The asterisks above the D indices indicate whether the D index was statistically significant (p < 0.05).
www.nature.com/scientificreports www.nature.com/scientificreports/ pancreatic cancer, ovarian cancer and breast cancer test datasets against 1,000 random signatures of equal size 51 . In all three cases, the magnitude of the hazard ratio from the MetaGx signature was larger than the random signatures' hazard ratio (p = 0.001 for all three cancers) ( Supplementary Fig. S10).

Discussion
Meta-analysis of multiple cancer types is an area of high interest, with ongoing research continually supporting the growing relationship between these malignancies and suggesting common patterns of tumour biology 52 . We provide an integrative, standardized, and comprehensive platform to facilitate analysis between breast, ovarian, and pancreatic cancer. This platform provides a flexible framework for data assimilation and unified nomenclature, with standardized data packages hosting the largest compendia of breast, ovarian, and pancreatic cancer transcriptomic and clinical datasets available to date.
Integration of genomic data into standardized frameworks is challenged by the inconsistency of the clinical curations across datasets and across tumour types. Annotation of clinicopathological variables may vary widely due to different protocols in different laboratories, institutions, and across international boundaries. We have standardized, as much as possible, the catalog of clinical variables within each tumour type. For characteristics pertaining to a specific tumour type, including ER, PGR, and HER2 IHC status in breast cancer samples, we have generated a semantic positive/negative variable to reflect IHC status. This facilitates searching across all patients irrespective of the original assay annotations that may have binary, numeric, or qualitative. Similarly, a binary variable has been assigned to ovarian cancer patients to reflect whether they had been treated with platinum, taxol, or neoadjuvant therapy. Many of the annotated variables (ex: stage and tumour grade in MetaGxOvarian) have also been standardized to facilitate comparisons across multiple studies. Further analyses using our previously developed packages (curatedOvarianData) have indicated good consistency across datasets, and ultimately facilitated uniform and consistent investigations on the prognostic effect of biomarkers in ovarian cancer survival 53,54 .
The scale of MetaGxData facilitates identification of gene signatures that are prognostic across multiple forms of cancer. Using this compendium, we developed a gene signature that is prognostic for breast, ovarian, and pancreatic cancers. Requiring genes to be prognostic across multiple datasets should help distinguish between general and disease-specific processes affecting patient survival, and allow signatures to generalize better to new datasets, as opposed to conventional signature creation methods that select genes based on cox proportional  www.nature.com/scientificreports www.nature.com/scientificreports/ hazard models in a single dataset. We have demonstrated that the multi-cancer MetaGx signature outperformed the top ovarian signatures identified in our previous review in an analysis conducted on all patients with overall survival as the endpoint. It was also more prognostic than the clinically-relevant Mammaprint and OncotypeDX signatures in the ER−/HER2− breast cancer subtype, and more prognostic than pancreas-specific signatures in pancreatic cancer. Furthermore, it was the only signature that was prognostic in each molecular subtype of pancreatic cancer, and was highly prognostic in the basal-like subtype. Notably, the MetaGx signature was not prognostic in the HER2− breast subtype or the immunoreactive and proliferative ovarian subtypes. One possible explanation for this behavior is that the number patients with those subtypes are fewer, compared to the majority of patients that were used to as the training set. This is particularly true for the Her2− subtype in breast cancer (n = 236 Her2− patients, in a training set of n = 1,969 breast cancer patients). However, we are unaware of any gene signature to-date that is prognostic across each subtype based on a meta-analysis of multiple datasets. Indeed, the clinically used Mammaprint signature, as an example, is only used for ER+/Her2− patients.
The large number of datasets offered as part of MetaGxData provides researchers with the ability to select different datasets for their respective analyses. As such, it is conceivable that researchers may select particular datasets to highlight the significance of signatures. However, the magnitude of the samples and datasets provided by the compendium makes it arguably difficult for researchers to justify why some datasets have been retained and others dismissed. In the current literature, many existing publications have derived prognostic signatures based on a comparison of 3-5 datasets. With the release of the MetaGxData, researchers now need to develop signatures that harness the full compendium. Hopefully, this will result in the production of more rigorous signatures, as these signatures would need to be prognostic across an entire meta-analysis.
To our knowledge, the MetaGx signature represents the first signature demonstrated to be prognostic in a meta-analysis across three cancers. This includes pancreatic cancer, which had been selected as an independent validation set for testing the signature. Our signature predicts poor outcomes associated with metastases for patients, based on our observations that patients signature scores across all three cancers consistently had strong positive correlations with our PLAU tumor metastases module. Furthermore, since the signature was consistently negatively correlated with the ESR1 module in all three cancers, and high signature scores led to poor outcomes, we believe the signature also models the poor outcomes associated with increased ER pathway activity in patients. Our signature provides additional support for the role of CLDN4 in pancreas, breast and ovarian malignancies. Higher expression levels of this gene placed patients in the high score group that had poorer outcomes in all 3 of these cancers. This is in agreement with numerous studies that have shown CLDN4 to be overexpressed in pancreatic, ovarian, and breast tumors relative to normal tissue [55][56][57][58][59] . It is also interesting to observe that FAM189A2 www.nature.com/scientificreports www.nature.com/scientificreports/ was one of the top genes across all 3 cancer that was indicative of worse outcomes when expression levels were low (HR < 1), which is consistent with what has been shown in lung and thyroid cancer 60,61 .
In conclusion, the MetaGxBreast, MetaGxOvarian and MetaGxPancreas packages follow a unified framework that facilitates integration of oncogenomic and clinicopathological data. We have demonstrated how our packages facilitate easy meta-analysis of gene expression and prognostication in breast, ovarian and pancreatic cancer. We have also demonstrated that leveraging this data in meta-analysis can lead to gene signatures that outperform clinically relevant breast signatures in ER−/HER2− patients, and outperform ovarian signatures developed from single datasets, as well as a number of published pancreatic cancer signatures. These packages have the potential to serve as an important resource in oncology and methodological research and provide a foundation for future development of cancer-specific compendia.

Methods
Breast cancer data acquisition. Breast cancer datasets were extracted from our previous meta-analysis of breast cancer molecular subtypes, which includes 39 microarray datasets from a variety of commercially available microarray platforms published from 2002 to 2014 35 . Additional datasets were extracted from the Gene Expression Omnibus (GEO) and manually curated. Gene expression and clinical annotation for Metabric were downloaded from EBI ArrayExpress and combined into a dataset of 2,136 samples 62 . The cgdsr R package was used to extract 1,098 tumour samples from The Cancer Genome Atlas (TCGA), and matching clinical annotations for these samples were downloaded from the TCGA Data Matrix portal (https://tcga-data.nci.nih.gov/ tcga/) 2,63 . Combining these studies produced a total of 39 breast cancer microarray expression datasets spanning 10,004 samples. Of these 10,004 samples, survival information is available for 6,847 patients, including overall survival (n = 4,425), metastasis free survival (n = 2,695), and relapse free survival (n = 1,858). ovarian cancer data acquisition. Ovarian microarray expression datasets were obtained from our recent update of the curatedOvarianData data package, onto which we have added 5 expression datasets to the originally published version 29 , for a total of 26 microarray datasets spanning 3,526 samples. To obtain these datasets we first used the curatedOvarianData pipeline to generate the "FULLcuratedOvarianData" version of the package, which differs from the public version in that probe sets for same gene are not merged (https://bitbucket.org/lwaldron/ curatedovariandata). Of the 3,526 samples, survival information is available for 2,726 patients, including overall survival (n = 2,712) and relapse free survival (n = 1,928). pancreatic cancer data acquisition. Pancreatic ductal adenocarcinoma (PDAC) datasets were obtained by curating datasets available from the literature. A total of 21 datasets were curated for a total of 1,719 patient transcriptomic profiles. Of the 21 datasets, overall survival data was present for 12 studies. Consequently, of the 1,719 samples survival information is available for 1,000 patients, including overall survival (n = 1,000) and no relapse free survival data.
processing of gene expression datasets. The processing of breast and ovarian cancer microarray datasets was previously described 29,35 . The pancreatic cancer datasets were processed in the manner described within the original studies from which they were obtained; the only exception is the Kirby dataset, which had Gene Signature -Subtype D Index D Index 95% CI D Index P Log Rank Test P  www.nature.com/scientificreports www.nature.com/scientificreports/ been aligned using Kallisto and whose expression values are calculated using the logarithm of the transcripts per kilobase million (TPM).
Across all datasets, we used GEO platform descriptions as the primary source of probe and gene annotations when available, otherwise original annotations as published by the authors were used for non-standard gene expression profiling platforms. The full set of gene annotation platforms across all expression sets can be found in the metadata files associated with each Bioconductor package, and is additionally provided in Supplementary Tables S11-S13. Gene symbols and Entrez Gene identifiers that matched the probeset ids of a given expression set were subsequently saved as part of the featureData (fData) pertaining to that expression set. For genes with multiple probesets, the iqr function within R was used to calculate the variance of the probes across the dataset; only the probe with the highest variance across the dataset was used to calculate the prognostic value of the gene. Standardization of gene expression values (normalization) across datasets was undergone using a meta-analysis (each gene is evaluated in each dataset, and a final estimate was determined for each gene via the survcomp comb.est function. Further details are provided below).
MetaGxData package implementation. The breast, ovarian, and pancreatic cancer datasets are available through the MetaGxBreast, MetaGxOvarian, and MetaGxPancreas R data packages hosted on Bioconductor's ExperimentHub. The MetaGxData packages allow users to select and filter the finalized curated datasets using the loadOvarianDatasets, loadBreastDatasets and loadPancreasDatasets functions of MetaGxOvarian, MetaGxBreast and MetaGxPancreas, respectively. Users are provided options for filtering samples based on clinical parameters, availability of survival data, and sample replicates (patients with highly correlated transcriptomic profiles; spearman correlation > 0.98). Users are also provided other options including, but not limited to, the ability to remove datasets based on the number of samples and the number of survival events present in the data. Importantly, users have the ability to specifically select for only primary tumour samples or several tissue types (primary tumours, healthy tissue, etc.) using the sample type info found in the clinical data.
Collectively, our data compendium, referred to as MetaGxData, encompasses 86 processed datasets, containing in total 15,249 breast, ovarian and pancreas samples. Information pertaining to the platform, number of samples, number of probes, and number of unique genes present in the breast, ovarian, and pancreas datasets can be found in in the supplementary files (Supplementary Tables S11, S12 and S13). Expression datasets are represented as SummarizedExperiment objects with attached clinical data (pData), and feature data (fData) and can be loaded into R with a single function call allowing for fast and flexible analysis 38 . Hosting the datasets within the Bioconductor ExperimentHub facilitates rapid integration of new datasets into the existing framework and allows for easy extension of newer studies into the package in future iterations of MetaGxData.
prognostication of breast and ovarian cancer genes and signature generation. Cox proportional hazards analysis was performed using the R package survcomp (version 1.29.4) to estimate the prognostic value (hazard ratio) and significance (corresponding p-value) of the genes in each dataset 64 . In these analyses, overall survival was used as the primary endpoint when determining the hazard ratio. After determining the hazard ratio in each dataset, a final combined estimate of the hazard ratio was calculated using a random-effects model (combine.est from survcomp) 65 . Expression data from non-tumor samples was removed from all analyses. When stratifying samples into groups to generate survival curves, samples within each dataset were stratified into two groups based on the median expression of the gene or the median gene signature/module score for all the samples within that dataset. For the gene signatures, risk prediction scores were determined using the signed average of the patients' gene expression, with the sign being determined as their direction of association with the survival outcome (HR > 1 positive direction). Datasets which did not include the 3 genes in our SCM gene subtype classification model were removed from the survival analyses. For example, the UNC4 breast cancer dataset was excluded, as the ER probe was deemed poor quality by the manufacturer and removed from the annotations. Furthermore, the ICGCSEQ dataset in MetaGxPancreas was excluded, due to overlap of a subset of patients with the ICGCMICRO dataset. To generate the MetaGx gene signature, the aforementioned analysis was performed on common genes in MetaGxBreast and MetaGxOvarian to determine the hazard ratios of each gene. The METABRIC dataset (n = 2136 samples) from MetaGxBreast, and 5 of the largest ovarian datasets (GSE9891, GSE32062, GSE49997, GSE26712, GSE51088, totaling 1,116 samples) were removed from the analysis for later use as the validation cohort to test the signature. The 53 genes with significant hazard ratios in both cancers (FDR < 5% and HR > 1.125 or FDR < 5% and HR < 0.875) were selected for the MetaGx gene signature.
Correlation between the signature scores and gene modules. Correlations between the MetaGx signature and the gene modules were determined by finding the individual Spearman correlations coefficients between the signatures risk predictions, and the gene modules risk predictions in each individual dataset. A meta-estimate for the correlation coefficient was then determined from the individual correlation coefficients and their associated standard errors via the survcomp package (combine.est function) using a random effects model. statistical analysis. The hazard ratios were computed via the R survcomp package as D indices by using risk predictions for the signatures along with the patients' corresponding survival times and overall survival statuses. The D-index is a robust estimate of the traditional Cox's hazard ratio, more precisely an estimate of the hazard ratio comparing two equal-sized prognostic groups 64,66 . This is a scale-free measure of separation between two independent survival distributions under the proportional hazards assumption. All individual estimates were combined into a meta-estimate via survcomp in a random effects model to obtain a single best estimate of the D index; this metric is reported throughout the present work. The patient groups, survival times and overall survival status of the patients from all the datasets were used within the survival package to generate Kaplan-Meir survival