Integrative analysis of TCGA data identifies miRNAs as drug-specific survival biomarkers

Biomarkers predictive of drug-specific outcomes are important tools for personalized medicine. In this study, we present an integrative analysis to identify miRNAs that are predictive of drug-specific survival outcome in cancer. Using the clinical data from TCGA, we defined subsets of cancer patients who suffered from the same cancer and received the same drug treatment, which we call cancer-drug groups. We then used the miRNA expression data in TCGA to evaluate each miRNA’s ability to predict the survival outcome of patients in each cancer-drug group. As a result, the identified miRNAs are predictive of survival outcomes in a cancer-specific and drug-specific manner. Notably, most of the drug-specific miRNA survival markers and their target genes showed consistency in terms of correlations in their expression and their correlations with survival. Some of the identified miRNAs were supported by published literature in contexts of various cancers. We explored several additional breast cancer datasets that provided miRNA expression and survival data, and showed that our drug-specific miRNA survival markers for breast cancer were able to effectively stratify the prognosis of patients in those additional datasets. Together, this analysis revealed drug-specific miRNA markers for cancer survival, which can be promising tools toward personalized medicine.


Background
One important aspect of personalized medicine in cancer therapy is the ability to predict an individual patient's response to drug treatments. Currently, there is an unmet need for such drug-specific predictive biomarkers, which can spare patients from ineffective toxic agents and optimize treatment for individual patients 1 . Utilizing various powerful sequencing technologies, significant efforts have been spent in genomics and proteomics research to profile cancer patients. Such molecular data of cancer enabled studies of the relationship between molecular expression profiles and clinical outcomes, and revealed genetic and epigenetic alterations as biomarkers predictive of survival [2][3][4] . However, in existing studies, variations in drug treatment among patients were often overlooked, even though it has been reported that drug exposure could affect specific histones, modify gene expression, and impact survival outcomes 5 . In our opinion, there are two main reasons why drug treatment was often not considered in the search for survival biomarkers for cancer. One reason is statistical power. Since larger sample size often leads to improved statistical power for identifying cancer survival biomarkers, studies usually chose to include as many relevant patients as possible, while ignoring the fact that not all included patients received the same drug treatment. The other reason is data availability. In publicly available datasets of cancer genomics and survival studies, drug treatment data of patients is often unavailable. In rare cases where drug treatment data of patients is provided, the data is often in inconsistent and non-standardized formats, making it difficult to use 6 . As a result, the existing biomarkers are often general to the cancer or cancer subtype being studied, but not specific to any drug treatment. Although existing studies have identified useful survival biomarkers, it would be more beneficial if drug-specific survival biomarkers can be identified 7 .
As a valuable data resource for cancer research, The Cancer Genome Atlas (TCGA) provides an opportunity for exploring biomarkers predictive of drug-specific survival outcomes. The TCGA data resources contain comprehensive molecular characterizations of ∼11,000 cancer patients across 33 different cancer types, as well as drug treatment data and clinical outcomes of the patients. The molecular characterizations include mutation, copy number variation, methylation, gene expression, miRNA expression and protein expression. Although the drug treatment data in TCGA contains a lot of variations in naming conventions and spelling issues, our group www.nature.com/scientificreports/ have previously manually standardized the data to eliminate the nomenclature problems 8 . This previous effort enabled integration of the drug treatment data into bioinformatics pipelines for genomics and survival analysis.
In this study, we focus on exploring whether miRNAs can be used as drug-specific survival biomarkers, because the potential connection between miRNAs and cancers has been particularly well appreciated by many studies. It has been showed that miRNA involvements in the development of cancer, including cell proliferation, metastasis, differentiation, and evasion of apoptosis 9 . Moreover, it is proposed that expression of more than one third of human genes are under miRNAs control, which means many drug target genes may be regulated by miRNAs 10 . In addition, the functional pattern of miRNA-mRNA interaction network has been shown in the onset and development of cancer [11][12][13] . Two previous studies took the advantage of network-based models and explored the regulatory mechanism between miRNA and mRNA in colorectal and pancreatic cancer by using the matched specimens of human cancer tissue and adjacent non-tumorous mucosa. Such network-based analysis identified cancer-specific biomarkers, as well as the relationship between deregulated miRNA expression and biological pathways involved in the cancers 14,15 . Since miRNAs play regulatory roles on numerous genes involved in oncology and pharmacology, differences in the levels of circulating miRNAs can contribute to inter-individual variability in response to cancer therapies 16 .
To identify miRNAs that can serve as drug-specific survival biomarkers, we first performed survival analysis on cancer-stratified and drug-stratified subpopulations of patients, and identified miRNAs that significantly correlated with survival outcome in a drug-specific manner. Further, in order to assess the prognostic power of miRNA markers, we investigated the regulatory mechanism of these significant miRNAs by examining their corresponding coding genes. Moreover, we performed a literature survey to assess whether there are existing evidences that supported the correlations between identified miRNAs and drug responses. Finally, we examined several additional breast cancer datasets that provided miRNA expression and survival outcomes, and showed that our drug-specific miRNA survival markers for breast cancer were able to effectively stratify the prognosis of patients in those additional datasets.

Results
Identify miRNAs predictive of survival in cancer-drug groups. We took patients who suffered from the same cancer type and were exposed to the same drug, and assigned them into one cancer-drug group. A patient can be a member of multiple cancer-drug groups, if the patient received multiple different drugs as reflected in the drug treatment records. The cancer-drug groups containing at least 15 patients with miRNA expression data available were selected for the subsequent analysis to identify miRNA markers for drug-specific survival. A total of 110 cancer-drug groups were selected, which involved 23 cancer types and 43 different drugs. The heatmap in Fig. 1 shows the number of patients in each of the 110 cancer-drug groups. The heatmap was generated by using the R (https:// www.r-proje ct. org/, version 4.1.1) 17 package 'ggplot2' 18 .
For each of the 110 cancer-drug groups, we applied survival analysis for the expression data of each of the 1,881 miRNAs measured in TCGA, which amounted to a total of 206,910 miRNA-cancer-drug combinations examined. For each miRNA, patients in the cancer-drug group were stratified into a highly-expressed class and a lowly-expressed class by applying StepMiner 19 to the expression data of the miRNA. Specifically, we sorted the expression data across all patients for an miRNA, and then fitted a step function to the sorted data that minimizes the square error between the original and the fitted values, which provides a global threshold to binarize the expression of the miRNA. Since this threshold is derived based on the data of all patients across all cancer types, it is able to robustly define high and low expression of the miRNA. In order to avoid situations where the sample size is too small to extrapolate significant findings, we only performed survival analysis on those miRNAs with at least 5 low-expressed patients and 5 high-expressed patients within the cancer-drug group, which excluded 16,535 out of 206,910 miRNA-cancer-drug combinations in our analysis. Log-rank test was applied to evaluate whether there is significant difference in survival between the two classes. Benjamini-Hochberg multiple tests were used to calibrate the false discovery rate (FDR). MiRNAs with adjusted FDR < 0.1 were selected as predictive markers whose expression levels were predictive of patients' survival outcome in the corresponding cancer-drug groups. In order to assess whether the identified miRNAs are specifically associated to the drug, we performed the same analysis on all patients in the corresponding cancer type, and only selected the miRNA markers significant in the cancer-drug groups, but not significant in the corresponding cancer type.
In total, we identified 115 significant miRNA-cancer-drug combinations in 44 cancer-drug groups (Table 1) with an FDR threshold of < 0.1, which involved 71 unique miRNAs that were significant in at least one cancerdrug group. All the identified miRNAs significantly associated with survival outcome of specific cancer in a drugspecific manner. We found 10 significant miRNAs in the sarcoma (SARC)-gemcitabine group and 9 significant miRNAs in the ovarian serous cystadenocarcinoma (OV)-paclitaxel group. 31 identified miRNAs were found to be significant in more than one cancer-drug group. These miRNAs are able to predict the survival of patients with different cancers and treated with different drugs, which may be promising biomarkers for drug response in multiple cancer contexts. Among these 31 repeated miRNA markers, two miRNAs were significantly associated with survival in four cancer-drug groups. We observed that the expression level of miR-577 was related to the outcomes of patients with breast cancer (BRCA) and treated with docetaxel, tamoxifen, anastrozole, and NOS, respectively. Another observation was miR-576, which was found to be associated with drug response to fluorouracil, leucovorin in rectum adenocarcinoma (READ), and gemcitabine, docetaxel in SARC.
Literature survey of identified miRNA markers. The significant drug-specific survival miRNA markers included miRNAs that have been previously implicated, as well as the novel ones that have never been mentioned in the literature. The PubMed database was used to search for potentially relevant studies for each of the 115 identified miRNA-cancer-drug combinations, and we have found supportive literature for multiple predic- www.nature.com/scientificreports/ tive miRNAs associate with drug responses in various cancer types. The number of papers linked to each identified miRNA-cancer-drug combination is shown in Table 1. For example, two previous studies 20, 21 have reported that the expression of miR-330-5p is related to the promotion of gemcitabine response, one in the context of pancreatic cancer and the other in the context of colon cancer. Coincidentally, we found that increased miR-330 was associated with prolonged survival outcomes of bladder urothelial carcinoma (BLCA) patients exposed to gemcitabine (log-rank, p-value= 0.003), and worse survival outcomes of SARC patients exposed to gemcitabine (log-rank, p-value< 0.001) (Fig. 2). As another example, our analysis suggested that low expression of miR-296 led to increased cisplatin sensitivity in cervical squamous cell carcinoma and endocervical adenocarcinoma(CESE) (log-rank, p-value= 0.002) and increased gemcitabine sensitivity in SARC (log-rank, p-value= 0.031), as showed in Fig. 3. This is supported by several previous studies on the role of miR-296 in drug response [22][23][24][25] . One of those studies indicated that miR-296 was able to sensitize lung adenocarcinoma (LUAD) cells to cisplatin in vitro and in vivo 2 . The inhibition of miR-296 also resulted in increased chemosensitivity in esophageal carcinoma (ESCA) cell lines to standard chemotherapeutic agents such as 5-fluorouracil and cisplatin. Another study proved that miRNA-296-5p could promote drug resistance by targeting Bcl2-related ovarian killer, leading to a poor prognosis in pancreatic cancer 26 . As a third example, we found that the over-expression of miR-483 resulted in decreased cisplatin sensitivity in BLCA and CESE (log rank, p-value= 0.003; p-value< 0.001) (Fig. 4). This is consistent with previous studies showing that miR-483-5p is significantly associated with cisplatin sensitivity and with overall survival in patients with tongue squamous cell carcinoma (TSCC) 27,28 . In addition, there has been evidence that miR-483-3p expression is increased in platinum drug-resistant ovarian carcinoma cells and indicated an association between increased miR-483-3p expression and cisplatin resistance 29,30 .  www.nature.com/scientificreports/ Correlation between drug-specific miRNAs and their target genes. To examine the correlation between the identified drug-specific miRNAs and their associated genes in the cancer-drug groups where the miRNAs were identified, we performed drug-specific survival analysis on the expression data of genes that are targets of the identified miRNAs. For each of the 115 identified miRNA-cancer-drug combinations, we extracted the gene expression data of their target genes for patients in that cancer-drug group. We binarized the gene expression data using the StepMiner, the same method we used to binarize the miRNA expression data. The target genes were determined according to the experimentally verified miRNA-target gene pairs incorporated from multiple databases [31][32][33][34] . Similar to the analysis of miRNAs, for each cancer-drug group and each gene under   www.nature.com/scientificreports/ consideration, we stratified patients in the cancer-drug group to highly-expressed and lowly-expressed classes according to the binarized data for the gene, and only performed survival analysis on gene-cancer-drug combinations with at least 5 lowly-expressed patients and 5 highly-expressed patients. A total of 8,513 genes targeted by identified miRNAs were examined in the corresponding cancer-drug groups. Log-rank test was performed to examine whether each of the target genes is significantly correlated with survival outcome in the corresponding cancer-drug groups. We applied FDR adjusted significance p-value < 0.01 as thresholds to identify genes whose expression were also predictive of drug-specific survival, in the same context as its associated miRNA. 48 genecancer-drug combinations exceeded the FDR threshold, which involved 42 unique genes and 9 cancer-drug groups. Based on the 48 significant gene-cancer-drug combinations, ∼37% of the identified miRNA-cancer-drug combinations showed significance in their corresponding genes, leading to 59 miRNA-mediated gene-cancerdrug combinations in total, as shown in Table 2. In order to examine the biological role of the 42 target genes, we performed Gene Ontology (GO) 35 enrichment analysis and Kyoto Encyclopedia of Genes and Genomes (KEGG) 36 pathway enrichment analysis by using The Database for Annotation, Visualization and Integrated Discovery (DAVID) 37 , searching against the entire Homo sapiens genome. We used a p-value threshold of <0.05 to identify enriched processes and pathways. However, due to the small number of target genes identified, this functional enrichment analysis may not have sufficient statistical power to be conclusive, even though it was encouraging to observe the target genes were involved in relevant biological processes including "developmental growth", "beta-catenin-TCF complex assembly", and "negative regulation of transcription from RNA polymerase II promoter", as well as general pathways such as "MicroRNAs in cancer" and "Pathways in cancer". In each of these 59 miRNA-mediated gene-cancer-drug combinations, since both the miRNA and the target gene were predictive of survival difference in the corresponding cancer-drug group, these combinations may reflect the functional role of the miRNA markers in affecting drug response through modulating the expression of their target genes. Interestingly, we observed miR-18a and its target gene SMAD4 were predictive of carboplatin response in HNSC, which was consistent with our previous study on the relationship between SMAD4 and carboplatin response 38 . Additionally, we also identified miR-18a to be predictive of the prognosis of HNSC patients treated with cetuximab. In the literature, it has been showed that SMAD4 down-regulation leads to cetuximab resistance in HNSC cell lines 39 , therefore, we can reasonably hypothesize that the miR18a-SMAD4 axis may be involved in the resistance mechanism of cetuximab response in HNSC.

Prognostic performance of drug-specific miRNAs in independent datasets. The prognostic
power of the identified drug-specific miRNA biomarkers in survival prediction was further tested in several independent miRNA expression datasets generated by microarray platforms. When searching for independent datasets, we considered cancer-drug groups that contained at least 150 patients and produced at least 1 significant miRNA survival marker. Four cancer types were considered, which were BRCA, brain lower grade glioma(LGG), OV and UCEC. After an extensive search on GEO for datasets that contained both miRNA expression data and survival outcome data of these four cancer types, three appropriate datasets were found (GSE40267, GSE19783, and GSE22220), all of which were on BRCA. Using these datasets, we examined the drug-specific miRNAs we identified in BRCA-Doxorubicin, BRCA-Cyclophosphamide, BRCA-Docetaxel, BRCA-Tamoxifen, and BRCA-Anastrozole groups. In these five BRCA-drug groups, 6 drug-specific miRNAs were identified, which were miR-20b, miR-363, miR-628, miR-7.2, miR-577, and miR-3677 as shown in Table 3. Since expression data for miR-7.2 and miR-3677 were not available in these three datasets, we only examined the remaining 4 drugspecific miRNAs for their ability to predict survival outcomes. For each dataset, we performed survival analysis for each of the 4 drug-specific miRNAs. Patients in the datasets were divided into highly-expressed class and lowly-expressed class by the median of expression value of individual miRNAs, and log-rank test was applied to examine the ability of the drug-specific miRNA markers for stratifying patients with different prognosis.
Specifically, in the GSE40267 dataset, stratification of patients by miR-363 expression was significantly correlated with patients' overall survival (OS) (Fig. 5A, log-rank test, p-value = 0.01), where patients with high miR-363 expression had significantly longer overall survival than patients with low expression of miR-363. In the GSE19783 dataset, disease-free survival (DFS) was used as the survival endpoint. High expression of miR-20b and miR-628 were significantly associated with improved survival outcome (Fig. 5B,C, log-rank test, p-value = 0.021 and 0.026, respectively). In the GSE22220 dataset with distant-relapse free survival (DRFS) as the survival endpoint, miR-577 expression was significantly correlated with distant-relapse free survival, where low expression of miR-577 resulted in prolonged survival outcomes of patients (Fig. 5D, log-rank test, p-value = 0.003).
Overall, Fig. 5 shows the Kaplan-Meyer curves comparing highly-vs. lowly-expressed patients according to these 4 drug-specific miRNAs. Among the 4 drug-specific miRNAs examined in the three independent datasets, all of them turned out to be significantly predictive of survival outcomes.

Discussion
This study represents an integrative analysis using TCGA data. By performing survival analysis on miRNA expression data for patients with the same cancer type and exposed to the same drug, we identified several miRNAs significantly associated with survival outcomes in a drug-specific manner. We further investigated the regulatory mechanism of identified miRNAs by examining the relationship between drug-specific miRNA markers and their target genes, which revealed consistencies between the identified miRNAs and their target genes in terms of their expression levels and their correlations with survival outcomes.
For each miRNA-cancer-drug combination being examined, we considered the regulation direction of the miRNA on its target genes, as well as the directions of correlations with the survival outcome. Using the R package 'lm' , we examined the correlation between the expression levels of miRNAs and their target genes in each of the 59 miRNA-mediated gene-cancer-drug combinations in Table 2 www.nature.com/scientificreports/ significant (p-value <0.05). We then examined the consistency of significant miRNAs and their target genes by comparing both their survival directions and regulatory directions. For example, for a given miRNA-mediated gene-cancer-drug combinations where both the miRNA and the target gene were predictive of survival outcome in the same cancer-drug group, we compared the direction of their associations with the survival outcome. If the miRNA inhibits the expression level of the target gene, and high miRNA expression and low gene expression led to better survival, or low miRNA expression and high gene expression led to better survival, then we considered that the miRNA and its target gene were consistent in their directions of correlation with survival   Here, 22 ( ∼69%) of the 32 with inhibition relationship showed consistency in the survival directions with the miRNAs and target genes, which illustrated the consistency between the drug-specific miRNAs and the target genes regarding their correlations with survival outcome. Although the remaining 10 ( ∼31%) showed inconsistent survival direction, this may be because gene expression is not only regulated by miRNAs alone, but can be controlled at various steps,including transcription, pre-mRNA splicing and export, mRNA stability, translation, protein modification, and protein half-life 40 . Some of the identified drug-specific miRNAs (such as miR-296, miR-330 and miR-483) have been previously reported to be potential biomarkers that are highly correlated with drug sensitivity and resistance in various cancer contexts. In several independent datasets on breast cancer, we observed that multiple of our drug-specific miRNA markers for breast cancer (i.e., miR-20b, miR-268, miR-363 and miR-577) were predictive of overall survival and other measures of clinical outcome. Notably, we observed that miR-577 was predictive of survival in multiple BRCA-drug groups ( Table 3), indicating that miR-577 may potentially serve as key biomarkers to multiple drug responses in breast cancer. We found that decreased miR-577 expression leads to prolonged survival outcomes of BRCA patients treated with docetaxel, tamoxifen, or anastrozole. However, there is previously literature indicating that down-regulation of miR-577 was correlated with increased invasion and metastasis in breast cancer 41 . Despite this inconsistency in miR-577's impact on survival between the previous literature and the observations in this study, miR-577 can be a promising biomarker for both predicting clinical outcomes and studying drug mechanisms.
In summary, this study demonstrates that miRNAs can be effective biomarkers predictive of drug-specific survival outcomes in cancer. The miRNA markers we identified are promising and may be useful tools toward personalized medicine.

Methods
Data access. MiRNA expression data and gene expression data were downloaded from TCGA Genomic Data Commons (GDC) using the GDC Data Transfer Tool. The miRNA and gene expression data covered ∼ 11,000 patients across 33 cancer types. Clinical data were also downloaded from GDC, which included patients' drug treatment records and survival outcomes. The drug treatment data contains 10,863 treatment entries for 4,328 patients. The drug names in the treatment records were standardized to remove inconsistency in terms of naming conventions and spelling errors. The standardization was performed based on a manually curated list previously created by our group 8 . The survival data contains the survival outcome for 11,082 patients across 33 cancer types. After removing duplicates in the molecular data and filtering for samples with treatment and survival data, we narrowed down to a total of 9,559 patients across 31 cancer types and 264 unique drugs in this study.
To examine the functional relevance of the identified miRNA markers, we obtained experimentally verified miRNA-target pairs from four databases including mir2Disease, miRecords, TarBase and miRTarBase. The miRNA-target pairs obtained from the four databases include: 96 pairs from mir2Disease, 518 pairs from miRecords, 26388 pairs from TarBase, and 50381 pairs from miRTarBase. These amount to a total of 57,863 human specific miRNA-target gene pairs, involving 14,652 genes and 579 miRNAs.
We also downloaded and log-transformed independent miRNA expression datasets from the Gene Expression Omnibus (GEO, https:// www. ncbi. nlm. nih. gov/ geo/) database to assess the prognostic power of identified drug-specific miRNAs. A total of 490 patients from 3 independent datasets were analysed in this study. Detailed information about these 3 datasets is shown in Table 4. Data preprocessing. The miRNA expression data and gene expression data downloaded from TCGA have been normalized by FPKM-UQ 42 , and we subsequently transformed the expression data by log-transformation. For each miRNA and gene feature, we used StepMiner 19 to compute a global threshold based on all patients across all cancer types. We first sorted the expression data of a given feature for all patients and then fitted a step function to minimize the square error between the original and the fitted values. Since this threshold is derived based on the data of all patients across all cancer types, it is able to robustly define high and low expression. The Table 4. Predictive miRNAs identified in BRCA-drug groups. www.nature.com/scientificreports/ threshold of each miRNA or gene was used to binarize the data, so that the patients could be divided into two classes (highly-expressed class vs. lowly-expressed class) according to the binarized data of an individual miRNA or gene feature (Supplementary files 1, 2, 3, 4).

Survival analysis.
We grouped patients based on cancer type and drug exposure. Patients with a given cancer and exposed to a given drug were assigned into one cancer-drug group. For each miRNA, patients in the cancer-drug group were further stratified into highly-or lowly-expressed classes according to the binarized data of that miRNA. Log-rank test was used to assess the statistical significance of the survival difference between highly-and lowly-expressed classes. P-values of the log-rank test were adjusted for multiple testing using Benjamini-Hochberg method with a false-discovery rate (FDR) <0.1. Kaplan-Meier analysis and log-rank test were performed using the R package 'survival' .
Literature survey. A literature search was performed using PubMed, accessed via the National Library of Medicine PubMed interface (http:// www. ncbi. nlm. nih. gov/ pubmed). We programmatically searched the Pub-Med database with custom Python scripts, using keywords of drugs AND identified drug-specific miRNAs. We searched through PubMed for all keywords in all filed, including the title, abstract and main texts of the articles.

Availability of data and materials
All data used in this analysis can be found at the GDC data portal. The code and processed data were provided as supplementary materials.