Pancancer survival analysis of cancer hallmark genes

Cancer hallmark genes are responsible for the most essential phenotypic characteristics of malignant transformation and progression. In this study, our aim was to estimate the prognostic effect of the established cancer hallmark genes in multiple distinct cancer types. RNA-seq HTSeq counts and survival data from 26 different tumor types were acquired from the TCGA repository. DESeq was used for normalization. Correlations between gene expression and survival were computed using the Cox proportional hazards regression and by plotting Kaplan–Meier survival plots. The false discovery rate was calculated to correct for multiple hypothesis testing. Signatures based on genes involved in genome instability and invasion reached significance in most individual cancer types. Thyroid and glioblastoma were independent of hallmark genes (61 and 54 genes significant, respectively), while renal clear cell cancer and low grade gliomas harbored the most prognostic changes (403 and 419 genes significant, respectively). The eight genes with the highest significance included BRCA1 (genome instability, HR 4.26, p < 1E−16), RUNX1 (sustaining proliferative signaling, HR 2.96, p = 3.1E−10) and SERPINE1 (inducing angiogenesis, HR 3.36, p = 1.5E−12) in low grade glioma, CDK1 (cell death resistance, HR = 5.67, p = 2.1E−10) in kidney papillary carcinoma, E2F1 (tumor suppressor, HR 0.38, p = 2.4E−05) and EREG (enabling replicative immortality, HR 3.23, p = 2.1E−07) in cervical cancer, FBP1 (deregulation of cellular energetics, HR 0.45, p = 2.8E−07) in kidney renal clear cell carcinoma and MYC (invasion and metastasis, HR 1.81, p = 5.8E−05) in bladder cancer. We observed unexpected heterogeneity and tissue specificity when correlating cancer hallmark genes and survival. These results will help to prioritize future targeted therapy development in different types of solid tumors.

www.nature.com/scientificreports/ treatment modalities. Those patients who are negative for all three markers are designated as triple-negative breast cancer; these patients have generally worse prognoses and conversely need a more aggressive systemic therapy. Establishing prognostic multigene classification protocols can contribute to the understanding of tumor biology and to better prediction of cancer progression and cancer treatment strategies. One important issue is the selection of the proper method for the combination of the genes. First, genes can be utilized independently in a decision tree, where each node can be based on a single gene. Second, when multiple genes are combined, the most widespread approach is to compute their mean expression and to use this new value as a surrogate for the activity of the entire signature. A third option is to combine multiple genes after assigning a different weight to each of them. With breast cancer as an example, such combined signatures are utilized in FDA-approved multigene signature platforms, including the 76-gene signature, 21-gene signature and 70-gene signature platforms; all three of these can predict the prognosis of cancer under different conditions 10-12 . In this study, our goal was to rank established cancer hallmark genes according to their correlation to survival in a large cohort of distinct cancer types. We also aimed to correlate the relevance of each cancer hallmark in each of the available tumor types by assessing the prognostic power of signatures comprising hallmark genes.

Results
Transcriptomic database. The complete dataset of RNA-seq samples with follow-up comprised 9663 specimens from 26 distinct tumor types with breast cancer as the largest (n = 1090) and thymoma as the smallest set (n = 118). Across the entire database, the median follow-up for overall survival (OS) was 24.3 months, and for relapse-free survival (RFS), it was 23.8 months. Most datasets contained both OS and RFS data, with the exception of AML, glioblastoma, melanoma and thymoma, which only had RFS data. Ovarian cancer patients had the highest median OS, while gastric and head and neck cancer patients had the shortest OS (Fig. 1C). In addition, glioma and liver cancer patients had the longest and the shortest median RFS at 23.8 and 6.7 months, respectively (Fig. 1C).
Clinico-pathological characteristics of patients, including stage, grade, sex and race, were available for 6301, 4126, 9720 and 9471 patients, respectively (Table 1). According to the stage, head and neck cancer had the most patients in stage 4, and testicular cancer had the most patients in stage 0 or stage 1. The proportion of patients by tumor grade indicates that an unfavorable high grade was more common in bladder cancer, while a favorable low grade was restricted to head and neck cancer. Sex and ethnicity data of the patients showed that the number of males with cancer is higher than the number of females with cancer and that Caucasians give the majority in the TCGA database ( Table 1).
The strongest cutoff value in the survival analysis. We demonstrate the calculation of the best cutoff via the CDK1 gene in kidney papillary carcinoma and ovarian cancer in Fig. 1A,B. To validate the robustness of CDK1 expression in kidney papillary carcinoma, we performed multivariate survival analysis for OS using the somatic mutation data of 278 renal cancer patients including CDK1 expression and the mutations of the top five mutated genes. These include MET (proportion of patient samples with a mutation in kidney renal papillary carcinomas: 24%), MUC16 (20%), KMT2C (19%), SETD2 (17%) and FAT1 (15%). In the multivariate survival analysis, we found that the association between the CDK1 expression retained its significance (p = 1.55E−07) when including the mutation status of MET (p = 0.952), MUC16 (p = 5.65E−01), KMT2C (p = 0.909), SETD2 (p = 0.04) and FAT1 (p = 0.948) genes.
Prognostic significance of hallmark-associated genes across 26 types of cancer. Cox regression analysis was performed using the RNA-seq expression of 671 cancer hallmark genes. The results of survival analysis across 26 types of cancer for each gene are listed in Supplemental Table S1. We computed the proportion of significant genes in each hallmark and in each tumor type (Fig. 2). Hierarchical clustering was performed to correlate different tumor types and cancer hallmark-associated genes. In this analysis, genes associated with invasion and metastasis activation, genome instability, sustained proliferative signaling and cellular energetics deregulation clustered into separate cohorts (Fig. 2). The top five tumors that contained the highest proportion of established cancer hallmark genes significantly associated with overall survival were kidney renal clear cell carcinoma, low grade glioma, melanoma, thymoma, and liver cancer.

Hallmark signatures and survival in different types of tumors.
The expression signature of hallmark features was determined for each sample, and the prognostic effect of these signatures was investigated in different types of cancer. Significant p values (p < 0.05) are illustrated as forest plots in Fig. 3A.
Of the eight hallmark feature signatures, seven showed a significant association with OS in low grade glioma. On the other hand, lung squamous carcinoma, uterine, ovarian, sarcoma, bladder and esophageal cancer contained only one significant hallmark signature (Fig. 3B).
Tumor mutation burden was also determined, and it showed a significant association with OS in glioma (HR 3.25, p = 6.3E−11), melanoma (HR 0.41, p = 6.5E−10), bladder cancer (HR 0.49, p = 5.6E−06), uterine cancer (HR 0.33, p = 2.5E−05), ovarian cancer (HR 0.69, p = 3.8E−03), stomach cancer (HR = 0.62, p = 4.2E−03) and kidney renal clear cell carcinoma (HR 2.26, p = 2.0E−04) (Fig. 3C). To demonstrate the reliability of these results, we selected breast cancer and performed univariate survival analysis for the significant cancer hallmark signatures using an independent gene expression dataset of 1976 samples obtained from the METABRIC study 13  Genes with the greatest prognostic power in multiple tumor types. In at least ten tumor types, there were 39 genes whose expression was associated with OS ( Fig. 4A). We pinpointed the genes with the highest prognostic power in each cancer hallmark feature: BRCA1 associated with genome instability in low grade glioma (  In addition, multivariate Cox regression analysis was also performed using the expression of the 39 most significant genes and the available clinical variables, including race, sex, age, tumor stage and tumor grade. Of the clinical parameters, age and tumor stage were the variables that reached significance in the Cox model in most tumors (for detailed results, see Supplemental Table S2).
Gene set enrichment analysis. In glioma, the expression of BRCA1, RUNX1, and SERPINE1 were analyzed using GSEA. High expression of BRCA1 was associated with the enrichment of cell cycle checkpoint genes (p < 1E−16) and DNA repair genes (p = 0.038) that have important role in genome instability. High expression of RUNX1 was associated with several proliferation signaling genes such as JAK-STAT (p < 1E−16), KRAS (p < 1E−16) and TGFB (p = 0.007) signaling genes. In patients with high expression of SERPINE1 angiogenesis associated genes (p = 0.02), apoptosis genes (p < 1E−16) and hypoxia related genes (p < 1E−16) were overrepresented.
In cervical cancer, the high expression of E2F1 was associated with the enrichment of tumor suppressor genes such as E2F signaling pathway genes (p = 0.002) and the high expression of EREG was associated with TGF-beta (p < 1E−16) signaling pathway genes.
In renal papillary carcinoma, the high expression CDK1 was associated with the enrichment of apoptosis genes (p = 0.025). In renal clear cell cancer the high expression of FBP1 gene was associated with enrichment of metabolic genes such as fatty acid metabolism (p < 1E−16), reactive oxygen species pathway (p = 0.015), and www.nature.com/scientificreports/ bile acid metabolism (p = 0.002). In bladder cancer, the high expression of MYC was associated with metastasis related genes that takes role in apical junction (p = 0.002) and MYC signaling pathway genes (p = 0.008). Overall, the GSEA identified cancer hallmark gene sets are in line with our previous results.

Discussion
In this study, we examined the prognostic significance of previously established cancer hallmark genes 5 . For the survival analysis, we utilized an RNA-seq database from the TCGA that contains 9720 patients of 26 tumor types with clinical annotations. Kidney renal clear cell carcinoma, low grade glioma and melanoma had the highest proportion of cancer hallmark genes that correlated with survival. Hierarchical clustering analysis showed that some cancer hallmark genes clustered together, such as those involved with invasion and metastasis activation, genome instability, sustained proliferative signaling and cellular energetics deregulation (distance was based on the percentage of significant genes per hallmark in each tumor type). A transcriptomic surrogate signature for each hallmark was also determined; this is based on the means of the average expression of the cancer genes associated with the given hallmark. The prognostic significance of these factors was examined in different types of cancers. Among the eight main hallmark signatures, those associated with oncogene activation, genome instability, cellular energetics, invasion and metastasis and cell death resistance were significant in at least five tumor types.
It is important to mention that in this analysis we did not simply averaged genes whose overexpression worsens the prognosis and those whose loss worsens prognosis. Rather, we use a pre-selected set of genes linked to a single cancer hallmark. Therefore, not the mean of the genes but their relative change influences the final classification. Within a single hallmark, we do not expect to have a perfect negative or positive correlation between the genes, and their mean will be representative for the overall activity of the hallmark. This approach is supported by the observation that many genes have inverse expression patterns-a negative correlation in terms absolute gene expression levels. For example, for CDKN2A and CCND1 this was observed in multiple studies [14][15][16][17] . In case of a negative correlation, exactly those genes should be combined for which www.nature.com/scientificreports/ the higher expression of one is linked to worse prognosis and the low expression of another also leads to worse prognosis. By combining these into a single signature the overall power of detecting the combined effect will increase. Because of the large number of genes involved in each cancer hallmark we believe that the combined signature is satisfactorily robust. Of note, this issue is complicated by the fact that different genes have different correlation to survival in different tumor types. For example both CDKN2A and CCND1 had increase expression in senescent fibroblasts 18 .
Oncogenes have a major role in the control of cell proliferation, differentiation and survival during tumorigenesis. c-MYC was the first characterized oncogene that is activated by chromosome translocation in human Burkitt's lymphomas 19 . Expression of the altered c-MYC gene is increased in tumor cells and is associated with extensive cell proliferation and contributes to tumor development. The association between c-MYC expression and patient survival remains controversial 19 , and we observed a worse prognosis in patients with higher expression of c-MYC. Similar results were present in the case of the ERBB2 gene, which encodes a cell surface protein-tyrosine kinase receptor that is associated with the progression of breast cancer 20 and higher expression of genes in the Wnt-β-catenin pathway. This pathway is mutated in more than 85% of colorectal cancers 21 . β-catenin (CTNNB1) is the most frequently mutated gene, and it can be detected in more than 80% of colorectal tumors. In addition, high expression of CTNNB1 is associated with shorter survival in colorectal cancer 21 . Finally, overexpression of cyclin D1 (CCND1), a member of the cyclin family, also correlated with poor survival in esophageal squamous cell carcinoma 22 .
Chromosomal instability (CIN) and microsatellite instability (MSI) are the two main types of genomic instability in human cancers 4 . The expression of genomic instability-related genes is higher in metastatic samples than in primary tumors 23 . In breast cancer, Habermann et al. performed gene expression profiling in which they examined the correlation between gene expression, genome instability and clinical outcomes 24 and identified a 12-gene aneuploidy-specific signature that is an independent predictor of clinical outcome. In our analysis, the transcriptomic signature consisting of 150 genes contributing to genome instability 5 was prognostic in eight tumors. Among these, high signature expression was associated with poor survival in low grade glioma, liver cancer, kidney papillary cancer, lung adenocarcinoma and sarcoma. In cervical cancer, renal clear cell carcinoma and thymoma, the high expression of the hallmark signature was correlated with a favorable outcome.
Altered energy metabolism involves an increased rate of glycolysis and limited oxidative phosphorylation. These features of proliferating cancer cells enable the retention of macromolecules, which help to drive constitutive cell growth and proliferation 4 . Among the numerous metabolic pathway-associated genes, the high expression of GLUT1, G6PD, TKTL1 and PGI/AMF are significantly correlated with decreased survival in breast cancer 25 . The FAS gene is upregulated at an early stage in multiple cancers, including breast 26 , stomach 27 and prostate cancers 28 ; its expression is positively correlated with poor survival. Our results show that the high expression of the transcriptomic signature of cancer metabolism-associated genes is linked to decreased survival www.nature.com/scientificreports/ in acute myeloid leukemia, head and neck cancers, breast cancer, lung adenocarcinoma and melanoma. However, in kidney renal clear cell carcinoma, kidney papillary cancer and low grade glioma, the high expression of the signature was associated with a better outcome. Epithelial-mesenchymal transition (EMT) is a multistep process that contributes to the migratory and invasive capacity of cells, which are essential for the development and metastasis of cancer 4 . In many types of cancer, including breast and head and neck cancers, developmental EMT pathways such as Notch have been reported to be dysregulated, and activation of these pathways often correlates with poor survival 29 . The suppression of EMT results in the increase of cell proliferation with increased expression of nucleoside transporters in pancreatic tumors. These changes lead to enhanced sensitivity to gemcitabine treatment and increased overall survival in mice 30 . The importance of EMT is supported by our observation that the transcriptomic signature of the tumor invasion and metastasis activation-associated genes 5 had prognostic significance in the highest number of tumors. Among the tumors, the high expression of the signature was linked to poor survival outcome in low grade glioma, liver cancer, acute myeloid leukemia, cervical cancer, head and neck cancers, pancreas cancer, bladder cancer and lung adenocarcinoma.
The resistance of cancer cells to apoptosis is a fundamental aspect of cancer development, which includes the upregulation of antiapoptotic proteins and the downregulation of proapoptotic proteins 31 . The number of gene expression signature studies of apoptotic genes is limited, and studies more commonly reflect on single apoptotic genes. Holleman et al. performed a microarray gene expression study in which they examined the expression pattern of 70 key apoptotic genes in acute lymphoblastic leukemia (ALL) and concluded that leukemia subtypes have a unique expression pattern of apoptosis genes and that select genes are linked to cellular drug resistance and prognosis in childhood B-lineage ALL 32 . Another study investigated 40 genes involved in the extrinsic and intrinsic pathways in myeloma cells, and these genes were linked to poor prognosis and were overexpressed in normal plasmablastic cells 33 . In our study, the cell death resistance signature based on a set of 119 genes 34,35 was linked to poor survival in liver and pancreatic cancers and good survival in melanoma, kidney renal clear cell carcinoma, breast cancer and thyroid cancer.
In brief, RNA-seq-based transcriptomic data were utilized to perform survival analysis across 26 different types of cancer. Strikingly, the signatures constructed from the cancer hallmark genes showed tumor type-specific correlations with survival. Individual cancer hallmark genes showing prognostic significance in more than 10 Table 2. Multivariate Cox regression analysis of hallmark gene signatures after including sex, race, stage, grade and age. Significant p (p < 0.05) and HR values in univariate and both uni-and multivariate survival analyses are bold and italics, respectively. HR values with asterisk (*) shows that there are not any events in one of the groups in the survival analysis*.

Methods
Database setup. All data processing steps and statistical analyses were performed in the R v3.5.2 statistical environment (http://www.r-proje ct.org). The source code are available at GitHub: https ://githu b.com/adam-nagy9 1/panca ncer_survi val_analy sis. RNA sequencing (RNA-seq) data were utilized from the Cancer Genome Atlas (TCGA, https ://cance rgeno me.nih.gov/). Only tumor types with more than 100 cancer specimens were included to ensure a robust sample number in each analysis. The RNA-seq HTSeq count data generated by the Illumina HiSeq 2000 RNA Sequencing Version 2 platform were used in the expression analyses. The "DESeq" package based on the negative binomial distribution was used to normalize the raw count data 36 . The Bioconductor "AnnotationDbi" package (http://bioco nduct or.org/ packa ges/Annot ation Dbi/) was applied to annotate Ensembl transcript IDs with gene symbols (n = 25,228). A second scaling normalization was performed to set the mean expression of all genes in each patient sample to 1000 to reduce batch effects.
For each sample, the preprocessed and annotated Mutation Annotation Format (MAF) data files that were generated by using MuTect2 for variant detection were used to compute the tumor mutation burden. The "maftools" package (http://bioco nduct or.org/packa ges/mafto ols/) was used for the aggregation and visualization of mutation data.
Defining cancer hallmark signatures. Altogether, 671 cancer genes were grouped into eight hallmarks 4 , based on gene assignment to hallmarks as described previously 5 . The surrogate hallmark expression signature was calculated by computing the mean expression of all genes associated with the given hallmark in each tumor sample.
Survival analysis and calculation of the strongest cutoff. Cox proportional hazards regression analysis was performed to examine the correlation between gene expression and overall survival (OS). The "survival" R package v2.38 (http://CRAN.R-proje ct.org/packa ge=survi val/) was utilized to calculate log-rank P values, hazard ratios (HR) and 95% confidence intervals (CI). In addition, the survival differences were visualized by generating Kaplan-Meier survival plots. www.nature.com/scientificreports/ To maximize the sensitivity of the analysis and to uncover any potential correlation to survival independent of a preset cutoff value (e.g., median), we computed each possible cutoff between the lower and upper quartiles of expression. Then, each of these cutoff values was used in a separate Cox regression analysis. The false discovery rate (FDR) was computed to correct for multiple hypothesis testing, and the result was only accepted as significant in the case of FDR < 10%. The best performing cutoff with the lowest p value was used in the final analysis when drawing the Kaplan-Meier plot.
In addition, multivariate survival analysis was performed for the gene expression and clinical features to assess independence from known epidemiological and clinical variables, including race, sex, age, tumor stage and tumor grade. Data visualization. Hierarchical clustering was applied to group and to visualize the survival-associated cancer hallmark genes in different types of cancer using the Genesis software 37 . The "forestplot" R package (https ://CRAN.R-proje ct.org/packa ge=fores tplot ) was used to examine the association of cancer hallmark gene signatures with OS across different types of cancer. The "survplot" R package (http://www.cbs.dtu.dk/~eklun d/ survp lot/) was used to generate the Kaplan-Meier plots.
Gene set enrichment analysis (GSEA). Gene set enrichment analysis (GSEA) 38 was performed for the most significant cancer hallmark genes (Fig. 4B-I). Patients were divided into high and low expression groups based on the expression of the selected gene across all patients within each tumor type. To categorize patients into two groups, we used the same cutoff point also used in the survival analysis. These categories were to designate the "phenotype labels" in the gene set enrichment analysis. The normalized RNA-seq expression and the built in "hallmark cancer genes" sets were used as expression datasets and gene set database, respectively.