A 14-gene gemcitabine resistance gene signature is significantly associated with the prognosis of pancreatic cancer patients

To identify a gemcitabine resistance-associated gene signature for risk stratification and prognosis prediction in pancreatic cancer. Pearson correlation analysis was performed with gemcitabine half maximal inhibitory concentration (IC50) data of 17 primary pancreatic cancer lines from Genomics of Drug Sensitivity in Cancer (GDSC) and the transcriptomic data from GDSC and Broad Institute Cancer Cell Line Encyclopedia, followed by risk stratification, expression evaluation, overall survival (OS) prediction, clinical data validation and nomogram establishment. Our biomarker discovery effort identified a 14-gene signature, most of which featured differential expression. The 14-gene signature was associated with poor OS in E-MTAB-6134 (HR 2.37; 95% CI 1.75–3.2; p < 0.0001), pancreatic cancer-Canada (PACA-CA) (HR 1.76; 95% CI 1.31–2.37; p = 0.00015), and 4 other independent validation cohorts: pancreatic cancer-Australia (PACA-AU) (HR 1.9; 95% CI 1.38–2.61; p < 0.0001), The Cancer Genome Atlas (TCGA) (HR 1.73; 95% CI 1.11–2.69; p = 0.014), GSE85916 (HR 1.97; 95% CI 1.14–3.42; p = 0.014) and GSE62452 (HR 1.82; 95% CI 1.02–3.24; p = 0.039). Multivariate analysis revealed that the 14-gene risk score was an independent pancreatic cancer outcome predictor in E-MTAB-6134 (p < 0.001) and TCGA (p = 0.006). A nomogram including the 14-gene was established for eventual clinical translation. We identified a novel gemcitabine resistance gene signature for risk stratification and robust categorization of pancreatic cancer patients with poor prognosis.

Pancreatic cancer is one of the most dangerous malignancies worldwide, with a 5-year overall survival of less than 5% and an approximately 6-8-month median survival after diagnosis 1,2 . Surgical resection followed by adjuvant treatment with the nucleoside analog drug gemcitabine is the standard management of pancreatic cancer patients in clinical practice 3 . Moreover, nab-paclitaxel plus gemcitabine (Nab-Gem) and FOLFIRINOX (folinic acid, 5-fluorouracil, irinotecan and oxaliplatin) represent the standard regimens for the first-line treatment of locally advanced and metastatic pancreatic cancer 4,5 . However, the majority of patients treated with gemcitabine chemotherapy eventually develop gemcitabine resistance 6 . Obviously, the effectiveness of gemcitabine chemotherapy is related to the prognosis of pancreatic cancer and therefore could be employed as a baseline for treatment and prognosis improvement of this fatal disease.
Many studies have been carried out to elucidate the possible mechanisms involved in gemcitabine resistance [7][8][9] , which are considered to be related to transport and metabolism behavior and are thought to involve multiple enzymes and signaling pathways, including human concentrative nucleoside transporters (hCNTs), human equilibrative nucleoside transporters (hENTs), deoxycytidine kinase (dCK), and ribonucleotide reductase (RR), or increased activity of the detoxifying enzyme cytidine deaminase (CDA), and PI3K/Akt, MAPK and NF-κB pathways. Based on these findings, changes in multiple genes at the functional and expression levels could be a reasonable phenomenon during the development of gemcitabine resistance. Recent advances in microarray gene chip technology and high-throughput sequencing have resulted in protocols for evaluating the expression level changes of multiple genes at the same time. Moreover, the generation of large public databases with In the present study, by including genes related to gemcitabine resistance in pancreatic cancer, we performed a systematic and comprehensive discovery and validation of biomarkers to identify and generate a gene expression signature for the effective prognostic prediction of patients with pancreatic cancer by using multiple datasets. We identified a novel 14-gene signature comprising genes related to gemcitabine resistance, which could offer excellent accuracy for patient risk stratification and prognosis prediction.
Identification of a 14-gene risk stratification signature for the prognosis of patients with pancreatic cancer and exploration of the expression pattern. Subsequently, we used multiple datasets to validate the 14-gene signature. The results of Kaplan-Meier curve uncovered significantly favorable overall survival in patients with a low risk score from the E-MTAB-6134 (HR 2.37; 95% CI 1.75-3.2; p < 0.0001), PACA-CA (HR 1.76; 95% CI 1.31-2.37; p = 0.00015), PACA-AU (HR 1.9; 95%CI 1.38-2.61; p < 0.0001), TCGA (HR 1.73; 95% CI 1.11-2.69; p = 0.014), GSE85916 (HR 1.97; 95% CI 1.14-3.42; p = 0.014) and GSE62452 (HR 1.82; 95% CI 1.02-3.24; p = 0.039) ( Fig. 2A,D,G,J,M,P) datasets but not in the GSE71729 dataset (HR 0.98; 95% CI 0.63-1.51; Supplementary Fig. 1). Moreover, the prediction efficacy of the 14-gene signature was also evaluated by ROC curves; in most of the datasets, the AUCs for 1-, 3-and 5-year OS was larger than 0.6 and close to 0.7. The highest AUC was found in the E-MTAB-6134 set, and the AUCs for 1-, 3-and 5-year OS were 0.757, 0.716 and 0.717, respectively, while the AUCs for 3-and 5-year OS were relatively low in the TCGA set (0.560 and 0.458, respectively) ( Fig. 2B,E,H,K,N,Q and Supplementary Fig. 1). Detailed information on the risk score and survival status can be found in Supplementary Fig. 2. In addition, we evaluated the expression pattern of these 14 genes in the above datasets, and the heatmap results showed a pattern of differential expression of these genes in patients with high and low risk scores (Fig. 2C,F,I,L,O,R and Supplementary Fig. 1). These results suggested  Supplementary Fig. 3). Further multivariate analysis revealed that the 14-gene risk score was an independent predictor in pancreatic cancer patients from the E-MTAB-6134 (p < 0.001) and TCGA (p = 0.006) datasets but not the GSE62452 dataset (p = 0.241) ( Fig. 3A-C). Since there were relatively large numbers of patients in the E-MTAB-6134 and TCGA datasets, our results verified that our 14-gene signature is quite robust and can be used as an independent factor for survival prediction in pancreatic cancer patients.
The 14-gene signature robustly identifies poor molecular subtypes in pancreatic cancer. Moreover, recently discovered transcriptomic data-based molecular subtypes, including the squamous subtype defined by Bailey 18 , the quasimesenchymal (QM) subtype by Collisson 19 , the basal-like subtype by Moffitt 20 , and the pure basal-like/stroma activated subtype by Puleo 21 , are predicted to have poor survival outcomes among PDAC patients. Due to the incomplete clinical information in some of the datasets, we only verified the effectiveness of our 14-gene signature for molecular subtype prediction in the E-MTAB-6134 and GSE71729 sets, and an AUC value between 0.8 and 0.96 could be achieved by using our signature (Fig. 4).

Establishment of a risk nomogram for survival prediction in pancreatic cancer patients. To
further strengthen the accuracy of the predictive power of our 14-gene signature, we established a ready-to-use and clinically applicable risk nomogram for survival prediction in pancreatic cancer patients from E-MTAB-6134 to assess the performance of this signature via a combination of other univariate significant patient characteristic parameters (e.g., grade and N-stage). As shown in Fig. 5A, a higher total score calculated by adding the assigned numbers of each factor included in the nomogram was associated with worse 3-year and 5-year overall survival rates. For example, a patient with a higher grade and a higher 14-gene score could generate a total of 168 points   To validate the prediction model, the performance of the nomogram was evaluated by the discrimination index calculation and calibration plot adjusting for the 1-year, 3-year and 5-year survival. The accuracy of the predictive nomogram was higher than that of the 14-gene signature alone, with an accompanying C-statistic discriminatory index value of 0.67. This higher C-statistic revealed that the 14-gene signature in combination with sex, tumor grade, resection margin and N-stage was considerably robust in discriminating subjects with variable outcomes. The 284-patient bootstrapped calibration plots for the prediction of 1-year, 3-year and 5-year overall survival rates are shown in Fig. 5B-D. The calibration plots of the 14-gene signature with sex, grade, resection margin and N stage exhibited excellent consistency between the observed outcomes and predicted survival. These results further support the clinical significance of our 14-gene signature in combination with sex, grade, resection margin and N stage, which has superior overall predictive power for distinguishing survival outcomes in pancreatic cancer patients.

Discussion
In addition to the application of conventional clinicopathological parameters or AJCC staging for prognosticating disease progression, molecular prognostic markers with the features of easy detection and simple quantification by standard methods could be employed not only for early diagnosis and prognosis prediction but also for novel therapeutic target discovery. Moreover, the combination of a panel of molecular biomarkers could overcome the hurdles resulting from tumor heterogeneity. In order to strengthen the survival and prognosis of pancreatic cancer patients, elucidate the mechanisms involved in gemcitabine resistance and discover novel diagnostic, prognostic and therapeutic-related biomarkers, we performed a preset study focusing on the relationship of gemcitabine resistance and overall survival. A novel 14-gene signature comprising CCDC148, SH3RF2, CACNA1D, POLD3, PARP1, AP1M2, C4orf19, ANO1, VGLL1, SCEL, INPP4B, NET1, INSIG2 and BVES and a corresponding risk score formula were established for pancreatic cancer risk stratification and prognosis prediction. Patients in the low-risk group had significantly better prognosis than those in the high-risk group. In addition, we confirmed the effectiveness of the 14-gene signature in predicting 1-year, 3-year and 5-year overall survival via AUC calculation. As a widely employed prognosis evaluation modality in clinical oncology, nomograms can integrate different prognostic determinants, including molecular and clinicopathological characteristics 22 . The different probabilities of clinical events can be visualized and calculated into a total score for risk illustration. Compared to conventional staging methods, nomograms may more effectively strengthen clinical decision-making regarding prognosis prediction and thereby be beneficial for implementing timely treatment. The C-index and calibration curves identified that the 14-gene signature was effective for predicting 1-year, 3-year and 5-year overall survival. An integrated nomogram using the 14-gene signature and clinicopathological characteristics was set up for accurate overall survival prediction. As a supplement to AJCC staging and conventional clinicopathological characteristics, the 14-gene signature and nomogram could be employed as useful high-risk indicators and overall survival predictors.
With the advances in gene sequencing technologies, several groups have reported their own mRNA-level prognostic gene signatures that have the ability to predict overall survival in pancreatic cancer. Birnbaum et al.   ) gene signature comprising TNFRSF4,  TNFSF18, TNFSF10, RFC4, PVRL3, PLD4, KDM6B, INHBA, IL32, IL4, IFIT3, FOXP3, CDC20, CD160, and AHR with favorable prediction of poor OS and identification of the molecular subtype of PDAC 24 . Considering that recent progress in combination therapies includes various chemotherapeutic regimens, especially the frequent use of gemcitabine in neoadjuvant and palliative therapies of pancreatic cancer patients, genes related to gemcitabine resistance could be a group of strong candidate genes in the identification of robust prognostic and predictive risk-stratification biomarkers for pancreatic cancer patients. Moreover, plenty of the available clinical and gene expression information from previous studies (including International Cancer Genome Consortium [ICGC], TCGA and GEO) could be used to validate the predictive power of these genes. Among our panel of 14 genes included in the signature, 6 genes were previously studied in pancreatic cancers. PARP-1 plays critical roles in DNA damage repair during intrinsic cell death, and cytoplasmic PARP-1 was recently confirmed to promote pancreatic cancer tumorigenesis and resistance 25 . ANO1 located on amplicon 11q13 is usually amplified in human cancers with poor prognosis and is pivotal in PDAC cell migration 26 . SCEL is a potential biomarker for the prognosis of pancreatic cancer confirmed by Cheng et al 27 . INPP4B, which was first described by Zhai et al., is an oncogenic gene in pancreatic cancer and could serve as a potential diagnostic marker and an independent prognostic marker, suggesting that it is a novel therapeutic target for pancreatic cancer 28 . INSIG2 overexpression is related to the malignant phenotype of pancreatic cancer under hypoxic conditions 29 . BVES, as a novel regulator of the Rac1 and Cdc42 signaling cascades, controls cell shape and movement in multiple cancers, including pancreatic cancer 30 . In addition, CACNA1D was found to be www.nature.com/scientificreports/ expressed on pancreatic islets and play a vital role in insulin secretion, which is involved in β-cell physiology and pathophysiology 31 , and AP1M2 expression was found to be upregulated only in pancreatic cancer 32 . Other genes, such as CCDC148 33 , SH3RF2 34 , VGLL1 35 and NET1 36 , were found to regulate different cell behaviors (such as cell proliferation and migration, cell death and cell cycle) involved in the development and progression of gastric cancer, whereas POLD3 was reported to be required for cell cycle progression and DNA synthesis and to be responsible for the high frequency of genomic duplications in human cancers 37 . No study was found on role of C4orf19 in cancers. Since most of the genes were previously reported to be involved in cancer biology, the findings of our panel are believable and could be useful. Moreover, since the 14 genes were selected using the Lasso regression method, the inconsistency in the expression patterns of the 14 genes at different levels (cell lines, normal vs. cancer, and specimens of different disease stages) could be neglected. In addition, we did not observe statistically significant correlations between the gemcitabine IC50 data and the mRNA expression There are several limitations in this study. First, although datasets from several databases were employed, including the E-MTAB-6134, ICGC, TCGA and GEO databases, the main information on clinical characteristics and gene expression was from white, black or Hispanic populations mainly living in Europe and the United States; therefore, caution should be taken when applying the results to other ethnicities. Second, the establishment and verification of the nomogram were only performed using the E-MTAB-6134 database due to the lack of complete clinical information in other datasets. We did not include a clinical cohort of pancreatic cancer for validation due to failure to obtain enough clinical samples and related follow-up information in a relatively short study period. Third, missing clinical information (age, etc.) in some of the datasets (E-MTAB-6134, GSE85916, GSE62452) resulted in the inability to perform univariate and multivariate prognostic analyses. Fourth, the molecular mechanisms of the genes identified here and their roles in the pathogenesis and progression of pancreatic cancer require further experimental studies. Therefore, sample collection with complete experimental and clinical information should be performed for future validation.
In conclusion, we believe that the 14-gene signature obtained in the present study could provide an efficient platform for better managing patients with this devastating malignancy in clinical practice.
The mRNA expression and related experimental and clinical data of pancreatic cancer samples were downloaded from GEO (http://www.ncbi.nlm.nih.gov/geo/) using the search terms "pancreatic cancer", "gemcitabine resistance" and "expression profiling by array". The gene expression microarray datasets GSE140077, GSE19650, GSE62165, GSE62452, GSE71729, GSE85916 and GSE91035 were selected and downloaded. The criteria for dataset selection were as follows: cell lines with gemcitabine resistance features or clinical samples with detailed clinical, survival and gene expression information. Among these datasets, GSE140077, GSE62165, GSE91035 and GSE19650 were used for validation of the expression level of the 14 identified gemcitabine resistance genes at the cellular and tissue levels, whereas clinical and gene expression data from GSE85916, GSE62452, and GSE71729 were used for validation of 14-gene signature in predicting prognosis. Detailed information on these microarray datasets is listed in Supplementary Table 1.
Microarray gene expression data from E-MTAB-6134 (N = 288; training cohort), transcriptomic data of PACA-CA (N = 214; validation cohort) and PACA-AU (N = 266; validation cohort) from ICGC and transcriptomic data from the TCGA (N = 148; validation cohort) were used for validation of 14-gene signature in predicting prognosis. The clinical and normalized ICGC RNA-sequencing data for PACA-CA and array-based gene expression data for PACA-AU were downloaded from the ICGC Data Portal (https ://dcc.icgc.org/), and the clinical and microarray-based gene expression data from E-MTAB-6134 were downloaded from ArrayExpress (Supplementary Table 2). The basic clinical information of these data is listed in Table 3.
Identification of survival-related genes from gemcitabine resistance-related genes and establishment of the prognostic gene signature. Genes related to gemcitabine resistance were identified with gemcitabine IC50 data of pancreatic cancer cell lines from GDSC and the gene expression data from both the E-MTAB-3610 and CCLE datasets using Pearson correlation with the criterion of p < 0.05 as determined cor.test () function in the R base package. Among the cells with IC50 data, a total of 17 and 15 primary pancreatic carcinoma-derived cell lines were obtained from GDSC and CCLE, respectively, and detailed information on the cell lines used in the present study is listed in Supplementary Table 3. The intersecting genes from the above E-MTAB-3610 and CCLE datasets were then used for survival-related gene screening by the combined use of the clinical data from the E-MTAB-6134 and PACA-CA datasets by univariate Cox analysis with the criterion of p < 0.05. Lasso-penalized Cox regression analysis based on the glmnet package in R was performed with the intersection of the survival-related genes to further minimize the gene numbers in the selected panel while maintaining the best predictive performance using tenfold cross validation. A pancreatic cancer patient prognostic gene signature using the risk score was calculated based on a linear combination of the regression coefficients (β) from the Lasso Cox regression model multiplied by the corresponding expression level of mRNA, www.nature.com/scientificreports/ and an optimal 14-gene combination was obtained using Ln (lambda.min). Risk scores were divided into highand low-risk groups based on the median risk score. The flowchart of the data processing procedure is shown in Fig. 6.

Statistical analysis.
Statistical analysis was carried out using R version 3.6.0. Student's t-test was used for between-group comparisons, while one-way ANOVA was used for multiple-group comparisons. Kaplan-Meier (KM) analysis was carried out to determine survival outcomes. The median values were used as cutoff thresholds to plot the KM curves, and the statistical significance was evaluated by the log rank test (survival and survminer R packages). The survival probability prediction for 1-, 3-and 5-year survival was calculated by receiver operator characteristic (ROC) curves and the area under the curve (AUC) (survivalROC and ROCR R packages). Univariate and multivariate Cox regression analyses with all the available clinical information were performed for survival predictive nomogram construction and validation (forestplot and rms R packages). The hazard ratio (HR) and 95% confidence interval (CI) were employed to confirm genes associated with overall survival. Unless specifically mentioned, P < 0.05 was considered statistically significant. Table 3. Basic information of patient from the included datasets. *The downloaded data from PACA-AU on T stage, N stage and AJCC stage was not complete and the number represents the available data. NA Not Available.   Figure 6. Flowchart of the data processing procedure.