Identification and validation of an 18-gene signature highly-predictive of bladder cancer metastasis

We found two deviant groups that were unpredictable with clinical models predicting bladder cancer metastasis. The group G consists of patients at high risk of pN+ , but they have pN0. The group P consists of patients at low risk of pN+ , but they have pN+ . We aimed to determine the genetic differences between these two groups. 1603 patients from SEER database were enrolled to build a multivariate model. This model was applied to patients from the TCGA database to distinguish groups G and P. Differentially expressed genes between the two groups were identified. RT-qPCR was used to validate the results in a cohort from FUSCC. Two deviant groups were identified both in the SEER population and the TCGA population. Expression of 183 genes was significantly different between the two groups. 18 genes achieved significant statistical power in predicting lymph node metastasis excluding these two deviant groups. The 18-gene signature outperformed 3 other bladder cancer lymph node prediction tools in 2 external GEO datasets. RT-qPCR results of our own cohort identified NECTIN2 (P = 0.036) as the only gene that could predict metastasis. Our study showed a novel gene screening method and proposed an 18-gene signature highly predictive of bladder cancer metastasis.

With an incidence of approximately 7% and 4% mortality, bladder cancer has become the fourth most common cancer and the eighth most common cause of death in men 1 . In China, 80,500 new bladder cancer cases are expected with 32,900 estimated deaths for both sexes in 2015 2 . Urothelial carcinoma is the dominant histological subtype of bladder cancer, except for in certain areas in Africa and the Middle East 3 . However, despite considerable progress in management of treatment of bladder cancer, 50% of patients eventually develop metastasis [4][5][6] . Furthermore, bladder cancer spreads from the bladder in a predictable stepwise manner to the lymph nodes and then to visceral organs. A total of 80% of patients with pN1 disease experience recurrence of disease, while only 30% have recurrence in those with extravesical and pN0 disease [7][8][9] Lymph node metastasis is a powerful predictor of cancer-specific survival 10 . Therefore, knowledge of nodal status plays a crucial rule in counseling of patients, clinical decision-making, and adjuvant chemotherapy 11,12 .
To date, many prediction models for predicting non-organ confined bladder cancer (pT3-4/N + ) have been created and properly externally validated 9,13,14 . Among these prediction tools, the nomogram developed by Karakiewicz represents the first step at defining objective, systematic, standardized, multivariate models 9 . This nomogram includes transurethral resection (TUR) stage and TUR grade to provide individual pN stage predictions. However, we found that there are two deviant groups, which are unpredictable with clinical parameters in these models. One group is at high risk of pN+, but actually has pN0 (good prognosis group, group G) and the other group is at low risk of pN+, but actually has pN+ (poor prognosis group, group P). Clinicopathological factors cannot predict outcomes in these two deviant groups. Therefore, we hypothesized that there are some genetic differences between the two populations that lead to vastly different outcomes.
In the present study, we built prediction models for pN+ disease based on the Surveillance, Epidemiology and End Results (SEER) database 15 . This study aimed to identify these two deviant groups in The Cancer Genome Atlas (TCGA) database 16 , as well as the gene signatures that are expressed differently between them. Furthermore, we validated the results in a cohort from Fudan University Shanghai Cancer Center (FUSCC).
imental design and main procedures is shown in Fig. 1. The demographic characteristics of the patients in the SEER database are shown in Table 1. Among a total of 1603 patients who were available for construction of the model, 1185 (73.9%) were men and 418 (26.1) were women, with a mean age of 69 years (interquartile range: 61-76 years). The majority of patients (1107, 62.8%) had grade 4 diseases. The median number of examined regional lymph nodes was 11 (interquartile range: 6-18) and 497 (30.9%) patients harbored at least one positive lymph node.
Identification of the deviant groups and genes that were expressed differently between them in the TCGA database. The baseline characteristics of patients in TCGA database are shown in Table 1.
Among a total of 248 patients, the same prediction model was applied and it achieved an area under the curve (AUC) of 0.674 (95%CI: 0.612-0.732) (Supplementary Figure 2). Then, the possibility of pN + disease was generated. The cutoffs of the tertiles for the spectrum of possibilities were 0.321 and 0.417. Thirty-seven and 17 patients were identified in groups G and P respectively (Fig. 2b) and survival outcome was provided (Supplementary Figure 1). The t-test showed that 183 genes were significantly different between the two groups, as shown in the heatmap ( Fig. 3 and Supplementary Table 1). To determine whether these genes could predict pN1 disease in a more common population, we excluded these two deviant groups from the total cohort. We then performed univariate logistic regression analysis with each of the 183 genes. Finally, 18 genes achieved significant statistical power (P < 0.05, Table 3).
Validating and comparing the predictive accuracy of the 18-genes signature with 3 other bladder cancer lymph node prediction tools in 2 external datasets. We performed a receiver-operating characteristic (ROC) curve analysis of the 18-gene signature as well as 3 other signatures [17][18][19] in two Gene Validating the identified genes in the FUSCC cohort. The demographic characteristics of this cohort are shown in Table 1. After the prediction model was applied, a total of 18 and 4 patients were subsequently identified in groups G and P, respectively. Validation using RT-qPCR in the FUSCC cohort showed that NECTIN2 was the only gene that was differently expressed between the two groups (Supplementary Figure 4).

Discussion
The present study provides important insight into the relationship between gene expression profiles and lymph node metastasis. We took advantage of the large cohort in the SEER database to construct an accurate prediction model of lymph node metastasis. Using the gene expression profile in the TCGA database, we identified the two deviant groups (G and P, Fig. 2) with a completely different genetic background, as shown by the gene expression heatmap (Fig. 3). Furthermore, we narrowed down the identified genes from 183 to 18 by examining their predictive ability in a more common population. To the best of our knowledge, for the first time, we combined the two most widely used external databases and studied the different gene expression status between extreme populations whose outcomes could not be predicted by clinicopathological factors. With external validation with RT-qPCR in a consecutive FUSCC cohort, we first give a clue that NECTIN2 might be a trigger for metastasis in bladder cancer pending larger cohorts and basic research. Most importantly, the 18-gene signature we proposed that is highly predictive of bladder cancer metastasis outperformed three other published signatures in another two GEO datasets.
Generally, most of the studies on genetic biomarkers can be classified into one of two categories. In the first approach, some type of machine-learning algorithm is applied to the data, and thereby a panel of biomarkers is obtained. Examples of such approaches include a study carried out by Wang et al. In their study, 57 genes (mRNA levels) were used to classify patients with urothelial cancer at each stage into high or low risk for progression  20 . In another study, a five-gene expression signature was developed with this type of approach to predict progression in T1G3 bladder cancer 21 .
In the second category, the authors had a particular candidate biomarker. They then separated the available patient pool into two groups according to the endpoint that they were interested in. The mean values of the candidate biomarker across each group were generated. The candidate biomarker is considered to have passed one filter for utility if there is a significant difference between these mean values. Biomarkers that are identified with this approach include PTP4A3 22 and FGR3 23 . In such studies, there is the implicit assumption that only the putative biomarker achieves a significant variation between the groups, while other confounders may not be well balanced. Moreover, there may be more than one gene that shows a significant difference in mean values between the studied groups. Therefore, examining one or a few genes in isolation may lead to incorrect conclusions. In the present study, instead of defining criteria, identification of extreme groups was based on objective observations with the assistance of a prediction model. Furthermore, instead of choosing one or a few genes in isolation, we performed narrowing down and validation processes without prejudice.
In the development of Smith's signature, the primary differentially expressed genes were identified simply by comparing patients pN1-3 disease and pN0 disease 19 . Although the resulted signature performed well in external validation cohorts, they didn't take the influence of other clinicopathological factors into consideration, such as the number of lymph nodes examined. The number of lymph nodes examined has been shown a good surrogate  marker of adequate sampling which is associated with the probability of getting true node status 24,25 . In the development of our 18-gene signature, by incorporating the variable 'number of lymph nodes examined' in the logistic regression model, the probability of a positive lymph node gets higher with the increasing number of lymph nodes examined and vice versa. To some extent, we are looking for patients with 'extremely adequate' sampling genes that were significantly (P < 0.05) different between group G (n = 37) and group P (n = 17). Genes are arranged by linkage distance, using unsupervised hierarchical clustering of average expression across genes as illustrated by dendrograms. Groups G and P are represented by black and red squares within the sidebar, respectively.
but have pN0 and patients with 'extremely inadequate' sampling but have pN+. In this manner, we have the best chance to get their true lymph node status which would minimize the influence of including the whole cohort instead of patient only with adequate sampling. MYBL2 and RCC2 has been reported to play a crucial role in epithelial-to-mesenchymal transition (EMT), in which epithelial cells lose their polarity and gain migratory and invasive abilities. It has been proposed that MYBL2 might mediate EMT and cancer cell invasion by upregulates the expression of major EMT regulator SNAIL in breast cancer 26 . RCC2 was also reported to play a pivotal role in lung adenocarcinoma metastasis by inducing EMT via activation of MAPK-JNK signaling 27 . NECTIN2, which was validated in FUSCC cohort, belongs to a family consisting of four Ca 2+ -independent cell adhesion molecules (NECTIN1 to 4) 28 . Recent studies have shown that NECTIN2 also contributes to tumorigenesis. Oshima et al. observed overexpression of NECTIN2 in various cancer tissues and reported that NECTIN2 is a potential target for antibody therapy 29 . However, few of these genes in bladder cancer has been previously studied. Our study only represents the first step towards defining the role of these genes in bladder cancer. Further validation, as well as more basic research, is still required.
The major limitation of this study was its retrospective design and validation was performed in a single center. We were only able to validate NECTIN2 out of the 18 genes probably due to different populations with different genetic background and life styles. Moreover, the different platforms like RNAseq in TCGA and RT-qPCR in our study may also contribute to this problem. Considering the very limited patients in the extreme groups, we included the whole cohort instead of patients only with adequate sampling (number of examined lymph nodes ≥15) to ensure the stability of the generated signature. Despite these limitations, our study provides a useful method to screen genes between two extreme populations. This may help identify the most crucial genes in determining totally opposite outcomes. Most importantly, the proposed 18-gene signature for muscle-invasive bladder cancer patients that is highly-predictive of LN+ would help with selecting patients for neoadjuvant chemotherapy, which would benefit high-risk patients while sparing other patients' toxic effects and delay to cystectomy.

Patients and Methods
Patients and samples. SEER is program that is sponsored by the National Cancer Institute and collects information on patients with cancer in the USA. This database covers approximately 26% of the American population, and it is considered representative of the USA by demographic composition, incidence of cancer, and mortality 15 . In the SEER database, we identified 2761 patients who were diagnosed with muscle-invasive urothelial cell carcinoma of bladder and were treated with radical cystectomy between 2000 and 2010. We excluded 424 patients with no examination of regional lymph nodes, 151 with missing regional lymph node data, and 583 with missing data of tumor size from the analysis. Finally, this process resulted in a cohort of 1603 patients.
The TCGA project aims to assess cancer-causing genome alternations in large cohorts of human tumors with high-throughput genomic technologies 16 . In the TCGA database, data on 408 patients who were diagnosed with muscle-invasive urothelial cell carcinoma of the bladder and were treated with radical cystectomy were collected. Samples comprised RNA sequencing data and clinical information. Of the 408 patients, we excluded 139 with  missing data of the number of lymph nodes, seven with missing data of positive lymph nodes, three with missing pathological T stage, six with a history of neoadjuvant chemotherapy, and five with missing gene expression data. Finally, 248 patients were available for analysis. In external validation of different signatures predicting lymph node metastasis, two previously published datasets (GSE13507 and GSE31684) were retrieved. A total of 164 patients from GSE13507 and 72 patients from GSE 31684 were available for analysis after exclusion of patient with missing lymph node status and other histologic type except for transitional urothelial carcinoma. 3 previously published and appropriately validated lymph node prediction tools [17][18][19] were used to compare the predictive accuracy with our signature.
In validation of our own cohort, we retrospectively recruited a consecutive cohort of 130 patients with muscle-invasive urothelial cell carcinoma of the bladder between 2011 and 2015 from FUSCC. Three patients were excluded because of missing data of lymph node status. Our study was approved by the ethical committee of FUSCC and all experiments were performed in accordance with relevant guidelines and regulations. Each patient provided written informed consent before participation.
RNA extraction and reverse transcription qPCR (RT-qPCR) analysis. For the validation cohort, 127 frozen tissue samples (100 mg) were harvested and ground into a fine powder. TRIzol ® reagent (15596-026; Invitrogen, CA, USA) were used to isolate total RNA. First-strand cDNA was synthesized from total RNA with the PrimeScript RT reagent kit (K1622; Thermo Scientific, MA, USA). SYBR Green real-time PCR assays were subsequently performed using an ABI 7900HT (Applied Biosystems, CA, USA). β-actin was used as the internal reference to normalize the expression level of RNA. The primers were synthesized by Sangon (Shanghai, China). The sequences of these primers are shown in Supplementary Table 4. Data analysis. All statistical analyses were performed by R software (https://www.r-project.org). Categorical data are shown as frequency and percentage, and continuous data as mean and interquartile range. Univariate and multivariate logistic regression analyses were performed to assess the parameters of potential risk factors using the RMS package. The possibility of pN+ was generated in each patient and patients were divided into low-, moderate-and high-risk groups according to the tertiles of the possibilities. Groups G and P were subsequently distinguished.
Level 3 RNAseq data from bladder carcinoma samples were obtained from the TCGA data portal (https://cancergenome.nih.gov) and GEO database (https://www.ncbi.nlm.nih.gov/geo/). Gene expression levels of the two groups were compared by the t-test. The expression status of the identified genes was observed with a heatmap using the pheatmap package (https://CRAN.R-project.org/package=pheatmap). We used logistic regression analysis to identify the relationship of gene expression and lymph node metastasis. Hazard ratios (HR) and 95% confident intervals (95% CI) were calculated. Discrimination ability of classifiers were compared using AUC on ROC curves. Distributions of overall survival were estimated using Kaplan-Meier method. Two-sided P < 0.05 was considered as statistically significant.
Data availability statement. The data used in this study were available publicly.