Mining The Cancer Genome Atlas gene expression data for lineage markers in distinguishing bladder urothelial carcinoma and prostate adenocarcinoma

Distinguishing bladder urothelial carcinomas from prostate adenocarcinomas for poorly differentiated carcinomas derived from the bladder neck entails the use of a panel of lineage markers to help make this distinction. Publicly available The Cancer Genome Atlas (TCGA) gene expression data provides an avenue to examine utilities of these markers. This study aimed to verify expressions of urothelial and prostate lineage markers in the respective carcinomas and to seek the relative importance of these markers in making this distinction. Gene expressions of these markers were downloaded from TCGA Pan-Cancer database for bladder and prostate carcinomas. Differential gene expressions of these markers were analyzed. Standard linear discriminant analyses were applied to establish the relative importance of these markers in lineage determination and to construct the model best in making the distinction. This study shows that all urothelial lineage genes except for the gene for uroplakin III were significantly expressed in bladder urothelial carcinomas (p < 0.001). In descending order of importance to distinguish from prostate adenocarcinomas, genes for uroplakin II, S100P, GATA3 and thrombomodulin had high discriminant loadings (> 0.3). All prostate lineage genes were significantly expressed in prostate adenocarcinomas(p < 0.001). In descending order of importance to distinguish from bladder urothelial carcinomas, genes for NKX3.1, prostate specific antigen (PSA), prostate-specific acid phosphatase, prostein, and prostate-specific membrane antigen had high discriminant loadings (> 0.3). Combination of gene expressions for uroplakin II, S100P, NKX3.1 and PSA approached 100% accuracy in tumor classification both in the training and validation sets. Mining gene expression data, a combination of four lineage markers helps distinguish between bladder urothelial carcinomas and prostate adenocarcinomas.

Histological examination of a carcinoma from transurethral resection specimens, especially from the bladder neck, always triggers diagnostic consideration for the origin of the carcinoma as either bladder or prostate. The distinction is crucial as it impacts further management and prognosis. For advanced bladder urothelial carcinomas, the treatment options include neoadjuvant chemotherapy followed by cystectomy 1 , whereas for advanced prostate adenocarcinomas, the treatment options include radiotherapy and androgen deprivation therapy 2 .
For low-grade carcinomas, distinction between bladder urothelial carcinomas and prostate adenocarcinomas is usually possible based on morphological features. However, for high-grade bladder urothelial carcinomas and prostate adenocarcinomas, conclusive distinction based on morphology alone is difficult due to overlapping morphological features between these two types of carcinomas. In such cases, immunohistochemistry is performed, employing a panel of antibodies to interrogate the presence of certain proteins that act as urothelial lineage or prostate lineage markers 3 . A number of urothelial lineage markers such as GATA3 and p63, and prostate lineage markers such as prostate-specific antigen (PSA) and prostate acid phosphatase (PAP) are routinely used, acknowledging the variable sensitivities and specificities of these markers 4 www.nature.com/scientificreports/ For the past decades, the joint effort between the National Cancer Institute and the National Human Genome Research Institute has uncovered the genomic profiles of different types of cancers via large-scale genome sequencing and integrated multi-dimensional analyses. In particular, the Pan-Cancer analysis project under The Cancer Genome Atlas (TCGA) research network incorporates datasets across tumor types as well as across platforms by broad normalization efforts, enabling analyses for commonalities, differences and emergent themes 6 . Capitalizing on the publicly available transcriptomic data for bladder urothelial carcinomas and prostate adenocarcinomas, firstly, this study aims to verify that genes corresponding to urothelial lineage and prostate lineage markers employed in diagnostic immunohistochemistry are indeed significantly expressed in the corresponding groups of carcinomas. Secondly, this study aims to establish the relative importance of expressions of these genes in distinguishing between bladder urothelial carcinomas and prostate adenocarcinomas. Lastly, a model incorporating expressions of urothelial lineage and prostate lineage genes is constructed to best distinguish between bladder urothelial carcinomas and prostate adenocarcinomas.
Heat maps of these genes were drawn in Xena Browser. Differential gene expression analyses with RNA-seq data in unit log(TPM + 0.001) for these genes were performed between these two groups of carcinomas. Graphical display was done in R version 4.0.3 with the ggplot2 and ggpubr packages 8,9 . Welch-t test was applied in SPSS version 24.0. To address the multiple tests problematic, the significance level α was adjusted by the Bonferroni correction (α corrected = 0.05/14 tests = 0.003) 10 .
The cases were randomly divided into about 70% as the training set and the remaining as the validation set by randomly generated Bernoulli variates with probability parameter 0.7. To determine which gene expressions best distinguish between bladder urothelial carcinomas and prostate adenocarcinomas, standard linear discriminant analysis was performed in the training set and then validated in the validation set by SPSS version 24.0.

Results
A total of 407 bladder urothelial carcinoma samples and 495 prostate adenocarcinoma samples were included in this study. Relevant clinical data of these bladder and prostate carcinoma samples are summarized in Table 1.
Heat map was drawn for expressions of genes corresponding to the urothelial lineage markers for both bladder urothelial carcinomas and prostate adenocarcinomas (Fig. 1A). The corresponding genes for GATA3, uroplakin III, thrombomodulin, p63, CK5/6, S100P and uroplakin II are GATA3, UPK3A, THBD, TP63, KRT5, S100P and UPK2, respectively. For CK5/6, only KRT5 gene expression was included. Similarly, heat map for expressions of genes corresponding to the prostate lineage markers was drawn (Fig. 1B). The corresponding genes for PSA, PSAP, P501S, PSMA, NKX3.1, AR and AMACR are KLK3, ACPP, SLC45A3, FOLH1, NKX3-1, AR and AMACR , respectively. Figure 2 displays the boxplots of urothelial and prostate lineage gene expressions, comparing between bladder urothelial carcinomas and prostate adenocarcinomas. All urothelial lineage genes had significantly higher expressions in bladder urothelial carcinomas except UPK3A, which was significantly expressed in the prostate adenocarcinomas as compared to bladder urothelial carcinomas (all p < 0.001). All prostate lineage genes had significantly higher expressions in prostate adenocarcinomas as compared to those in bladder urothelial carcinomas (all p < 0.001).
Standard discriminant analysis was used to see if the model could predict the group membership of the dependent variable of either bladder urothelial carcinoma or prostate adenocarcinoma based on urothelial lineage gene expressions except UPK3A. This was first analyzed in the training set and then validated in the validation set. Table 2 shows the hit ratios for the training set and the validation set; predictive accuracies of the model for the training set and the validation set were 93.1% and 93.6% respectively. In descending order of importance for the urothelial lineage gene expressions, UKP2, S100P, GATA3 and THBD were the most important predictors for bladder urothelial carcinoma based on the discriminant loading > 0.3 (Tables 3, 4).
Similarly, standard discriminant analysis was performed based on prostate lineage gene expressions to see if the model could predict the group membership of the dependent variable of either bladder urothelial carcinoma or prostate adenocarcinoma. Table 5 shows the hit ratios for the training set and the validation set; predictive accuracies of the model for the training set and the validation set were 99.8% and 100.0% respectively. In descending order of importance for the prostate lineage genes, NKX3-1, KLK3, ACPP, SLC45A3 and FOLH1 were the most important predictors for prostate adenocarcinoma based on the discriminant loading > 0.3 (Tables 6, 7).
Standard discriminant analysis was performed based on two most important urothelial lineage genes and two most important prostate lineage genes to see if the model could predict the group membership of the dependent variable of either bladder urothelial carcinoma or prostate adenocarcinoma. Table 8 shows the hit ratios for the training set and the validation set; predictive accuracies of the model for the training set and the validation set were 99.8% and 100.0% respectively. Prostate lineage genes of NKX3-1 and KLK3 appeared to be more important predictors as compared to urothelial lineage genes of UPK2 and S100P (Tables 9, 10).

Discussion
To distinguish urothelial carcinomas from prostate adenocarcinomas, many studies have employed immunohistochemistry to investigate the use of several lineage markers. GATA3, Uroplakin III, Thrombomodulin, S100P, and Uroplakin II are commonly recommended as urothelial lineage markers 5 . Apart from that, urothelium expresses squamous cell-associated markers such as CK5/6 and p63; expressions of these markers are of value to distinguish from adenocarcinomas 5 . This study showed that genes corresponding to these urothelial lineage markers with the exception of UPK3A were indeed significantly expressed in the urothelial carcinomas as compared to those in prostate adenocarcinomas. Surprisingly, gene for uroplakin III, UPK3A, was highly expressed in prostate adenocarcinomas as compared to urothelial carcinomas. Contradictorily, by immunohistochemistry method, no expression of uroplakin III was observed in prostate adenocarcinomas across many studies [11][12][13][14] , yielding specificity of 100% in determining the origin as the bladder. This discrepancy between transcripts of UPK3A gene and uroplakin III protein expression in the prostate has been previously documented in a study 15 . Presence of UPK3A transcripts in the absence of uroplakin III protein is likely related to interactions between UPK1B gene expression and translation of UPK3A transcripts 15 .
Standard discriminant analysis of this study demonstrated that, in descending order of importance for the urothelial lineage markers, UKP2, S100P, GATA3 and THBD were the most important predictors for urothelial carcinoma by gene expression. These results corroborate to the studies whereby expressions of these urothelial lineage markers have been studied immunohistochemically 12,14,16,17 . Among these, GATA3 has been widely studied as a urothelial lineage marker and has a wide range of sensitivities (67-100%) across different studies 16 . Although most studies reported 0% staining in prostate adenocarcinomas, GATA3 generally lacks specificity because a variety of other tumors express this protein, especially breast carcinomas, cutaneous basal cell carcinomas, and trophoblastic and endodermal sinus tumors 18 . The corresponding protein for UKP2, uroplakin II, is a relatively new marker for urothelial lineage. The reported sensitivities and specificities for uroplakin II to differentiate urothelial carcinomas from prostate adenocarcinomas were 66-78% and 95-100%, respectively 12,[19][20][21] . For S110P, the sensitivities and specificities were 71-100% and > 95% respectively in cases whereby antibody clone 16 was used 16 . Thrombomodulin has been used as a urothelial lineage marker with sensitivities of 46-81% and specificity of 95-100% to differentiate from prostate adenocarcinomas 16,17 . Thrombomodulin also stains a small number of carcinomas from the lung, breast, ovary, and pancreas 14 .
On the other hand, recommended prostate lineage markers are PSA, PSAP, P501S, PSMA, NKX3.1, AR, and AMACR 4 . This study confirms that genes corresponding to these prostate lineage markers were indeed significantly expressed in the prostate adenocarcinomas as compared to those in urothelial carcinomas. Standard discriminant analysis of this study demonstrated that many of the prostate lineage markers genes were important , and PSMA respectively. Among these, PSA is a sensitive and specific marker for the prostatic lineage with its sensitivities and specificities of 85-100% and 88-100%, respectively to differentiate from urothelial carcinomas 17 . PSAP is another conventional prostate lineage marker with high sensitivities and specificities of 92-95% and 81-100% respectively 17 . PSMA also has a similar range of sensitivities (87-100%) and specificities (83-100%) as a prostate lineage marker 3,17,22 . However, PSMA is also expressed in a few other tumor tissues such as squamous cell carcinomas and adenocarcinomas from stomach, colon and pancreas 22 . NKX3.1 and P501S are relatively newer prostate lineage markers. Sensitivities and specificities for NKX3.1 were 69-100% and 99-100%, and for P501S were 94-100% and 99-100%, respectively 3,17,23 . NKX3.1 is especially useful as it is expressed in many PSA-negative prostate adenocarcinomas 24 . This study showed that by combination of four lineage markers with the highest discriminant loadings, i.e. UKP2 and S100P for urothelial lineage and NKX3-1 and KLK3 for prostate lineage, classifications of training set and validation set approached 100% accuracies. Importantly, the prostate lineage genes took precedence over urothelial lineage genes as major predictors. Combination of NKX3.1, PSA, uroplakin II and S100P is therefore proposed to be the favored immunohistochemical test to resolve the dilemma of distinguishing between bladder urothelial carcinomas and prostate adenocarcinoma. This is in line with the recommendations provided by International Society of Urologic Pathology that combination of both lineage markers should be applied in such scenario with the weightage inclined towards prostate lineage markers 4 .
A few limitations of this study are acknowledged. Although findings of this study generally support the results of the previous studies, this study employed gene expression data of tumor tissue as compared to the visual evaluation of the lineage markers expressed on tumor cells by immunohistochemistry. Thus, discrepancy in expression www.nature.com/scientificreports/ between gene transcripts and proteins may arise as quantification of transcripts is dependent on tumor cellularity in the tumor tissue. Furthermore, in this study, 5.2% of bladder urothelial carcinomas were low grade and 9.1% of prostate adenocarcinomas had Gleason score of six. Inclusion of these low-grade carcinomas in this study as retrieved from the public databases differs from those studies focusing on high-grade carcinomas. Nevertheless, the findings of this study shall remain valid as total loss of expressions of all lineage markers in high-grade carcinomas is a rare event. Although this study readily provides combination of four lineage gene expressions as an algorithm to resolve the distinction between bladder urothelial carcinomas and prostate adenocarcinomas, transition to application by immunohistochemistry in routine diagnostic practice requires future validation.

Conclusions
Data mining TCGA expression data for urothelial and prostate lineage markers, this study establishes that in descending order of importance, genes for uroplakin II, S100P, GATA3 and thrombomodulin are the most important urothelial lineage markers to distinguish a carcinoma as bladder urothelial carcinoma from prostate adenocarcinoma. In descending order of importance, genes for NKX3.1, PSA, PSAP, P501S and PSMA are the most important prostate lineage markers. Classification of a carcinoma of either bladder urothelial carcinoma or