MicroRNA and transcription factor co-regulatory networks and subtype classification of seminoma and non-seminoma in testicular germ cell tumors

Recent studies have revealed that feed-forward loops (FFLs) as regulatory motifs have synergistic roles in cellular systems and their disruption may cause diseases including cancer. FFLs may include two regulators such as transcription factors (TFs) and microRNAs (miRNAs). In this study, we extensively investigated TF and miRNA regulation pairs, their FFLs, and TF-miRNA mediated regulatory networks in two major types of testicular germ cell tumors (TGCT): seminoma (SE) and non-seminoma (NSE). Specifically, we identified differentially expressed mRNA genes and miRNAs in 103 tumors using the transcriptomic data from The Cancer Genome Atlas. Next, we determined significantly correlated TF-gene/miRNA and miRNA-gene/TF pairs with regulation direction. Subsequently, we determined 288 and 664 dysregulated TF-miRNA-gene FFLs in SE and NSE, respectively. By constructing dysregulated FFL networks, we found that many hub nodes (12 out of 30 for SE and 8 out of 32 for NSE) in the top ranked FFLs could predict subtype-classification (Random Forest classifier, average accuracy ≥90%). These hub molecules were validated by an independent dataset. Our network analysis pinpointed several SE-specific dysregulated miRNAs (miR-200c-3p, miR-25-3p, and miR-302a-3p) and genes (EPHA2, JUN, KLF4, PLXDC2, RND3, SPI1, and TIMP3) and NSE-specific dysregulated miRNAs (miR-367-3p, miR-519d-3p, and miR-96-5p) and genes (NR2F1 and NR2F2). This study is the first systematic investigation of TF and miRNA regulation and their co-regulation in two major TGCT subtypes.

www.nature.com/scientificreports www.nature.com/scientificreports/ (5,362/6,350) and 93.1% (6,930/7,447) of regulation pairs were TF-target regulations for NSE-specific analysis and SE-specific analysis, respectively. Among them, ~68% were determined to be positively correlated for both NSE-specific analysis and SE-specific analysis (3, www.nature.com/scientificreports www.nature.com/scientificreports/ reduce false positive interactions, a common issue in regulatory analysis at large-scale 43 . Specifically, they are TF repressed FFLs (TRFs), TF activated FFLs (TAFs), and miRNA repressed FFLs (MRFs). As summarized in Table 2, we identified 164 TRFs, 386 TAFs and 114 MRFs in NSE, and 86 TRFs, 163 TAFs, and 39 MRFs in SE. Although they had similar number of significant regulations (6,350 regulation pairs in NSE, and 7,447 regulation pairs in SE), the frequency of FFLs in NSE was more than twice that in SE (Hypergeometric test, p-value = 4.76 × 10 −53 ). For example, 386 out of 664 FFLs were TAFs in NSE while for SE, 163 out of 288 FFLs were TAFs. Besides, we found only a few FFLs shared by these two subtypes, i.e. 1 shared TRF, 16 TAFs, and 5 MRFs.
Common and subtype-specific regulatory networks. Topological properties of regulatory networks. We constructed miRNA and TF mediated regulatory networks in two TGCT subtypes from the identified regulation pairs ( Table 1). The NSE-specific regulatory network contained 194 nodes (44 TFs, 23 miRNAs, and 127 genes) and 834 links while the SE-specific regulatory network had 168 nodes (41 TFs, 35 miRNAs, and 92 genes) and 508 links. The average degrees were 8.5 and 5.9 in these two networks, respectively. Therefore, the NSE-specific regulatory network was more strongly connected than the SE-specific regulatory network. This feature, plus more nodes and edges, indicated that NSE was more complex than SE in its regulatory mechanism. This feature might also reflect more heterogeneous samples of NSE than SE. Since the regulatory networks were directed networks, we investigated their out-degree distribution, in-degree distribution, and clustering coefficient distribution ( Supplementary Fig. S2A,B). We determined that most molecules had small out-degrees and only a few genes had high out-degrees for both NSE and SE specific networks. TFs and miRNAs regulated target genes, and also regulated each other, but only a few TFs and miRNAs regulated a large number of targets. One difference  www.nature.com/scientificreports www.nature.com/scientificreports/ between NSE and SE is that NSE had more nodes with the out-degree greater than 20. The in-degree values were more evenly distributed than those out-degree values. In addition, we determined that their average clustering coefficient distributions were similar. There were only four nodes having an average clustering coefficient greater than 0.2 for both subtypes. We searched the reported TGCT-related genes and miRNAs in related databases, including OMIM 44 , COSMIC 45 , candidate caused TGCT genes from Litchfield et. al. 6 , HMDD 46 , miR2Disease 47 , and PhenomiR 48 . Only 1 gene GAB2 was regulated in our FFLs, 5 and 10 related miRNAs were found in our FFLs for NSE and SE, respectively (Supplementary file S2).
The miRNAs in the miR-183/182/96, miR-302/367 and miR-371-3 clusters were enriched in both NSE and SE. On the other hand, the miRNAs in C19MC cluster were enriched in the NSE type only, and the miRNAs in miR-141/200c cluster were enriched in the SE type only. We investigated the miRNAs at the cluster level. The miR-183/182/96 cluster consisted of miR-96, miR-182, and miR-183, which shared almost identical seed sequences. The miRNAs in this cluster act as oncomiRs across cancer types, including prostate, breast, and ovary cancers 50 . Furthermore, these miRNAs have an important role in regulating major cellular pathways in cancer, including apoptosis, DNA repair, metabolism, and others 50 . In our previous study, we reported that miR-96-5p and miR-183-5p were overexpressed across 12 cancer types (not including TGCT) 26 . In this work, we determined that miR-182-5p and miR-96-5p were significantly overexpressed (with fold-change 3.62 and 3.94, adjusted p-value 6.45 × 10 −8 and 8.45 × 10 −10 , respectively) in the SE samples versus the NSE samples, and they were involved in 149 and 34 FFLs in NSE and SE, respectively. This observation indicated that they might have important regulatory roles in the pathology of TGCT. The miR-302/367 cluster consisted of 5 miRNAs (miR-302a, miR-302b, miR-302c, miR-302d, and miRNA-367), which were demonstrated to have vital roles in various biological processes and cellular signaling pathways 51 . The miRNAs in this cluster were activated by some TFs, including GATA6, POU5F1, NANOG, and SOX2 51 , and were related to TGCT 2,47,52 . The C19MC cluster (chromosome 19 miRNA cluster) and the miR-371-3 cluster are located on chromosome 19 and were involved in stem cell biology and tumorigenesis 53,54 . The miRNAs in the miR-371-3 cluster were biomarkers for TGCT 2,46,47 . In addition, the miR-141/200c cluster, which is part of the miR-200 family, has been reported to be associated with breast cancer 55 , whereas miR-200c-3p was found to be associated with TGCT 47 .

Enrichment analysis of genes in subtype-specific regulatory networks.
We conducted pathway enrichment analysis of the genes from TGCT subtype-specific regulatory networks by using Kyoto Encyclopedia of Genes and Genomes (KEGG) 57 pathway annotations and WebGestalt tool 58 . By setting FDR (Benjamini-Hochberg adjusted p-value) threshold 0.05, we identified 4 oncogenic pathways that were significantly over-represented in both NSE and SE subtypes from the top 10 pathways (Table 3). Furthermore, Wnt signaling (FDR = 0.007), and Calcium signaling (FDR = 0.013) were significantly enriched in NSE subtype 57 . Of note, TGCT is male-specific cancer. For the SE subtype, we observed several relevant pathways were enriched in the context of TGCT, including Transcriptional dysregulation in cancer (FDR = 0.011), and Jak-STAT signaling (FDR = 0.017) pathways. The Jak-STAT signaling pathway is a well-known oncogenic and stemness-related pathway.

Regulatory features of Yamanaka factors in TGCT subtypes. Yamanaka factors include four tran-
scription factors [KLF4, MYC, POU5F1 (OCT3/OCT4), and SOX2]. They are highly expressed in embryonic stem cells. The imbalanceness in their expression (e.g., over-expression) can induce pluripotency in both mouse and human somatic cells 59,60 . The expression of Yamanaka factors have previously been detected in testicular cancer 61 . In addition, two TF-coding genes (POU5F1 and SOX2) are candidate biomarkers in TGCT 56 . Their roles have been reported in testicular cancer 7,8 . This motivated us to explore the regulatory features of the Yamanaka factors in two TGCT subtypes: NSE and SE.
Subtype prediction based on top FFLs. For NSE and SE, we applied Random Forest classifier to each of the top 5 FFLs belonging to each FFL category to classify corresponding experimental or control class label (e.g., NSE or SE here). Using 10-fold cross-validation with 10 repeats, we obtained the classification performance on the samples for each FFL. In our experiment, the majority of the FFLs provided high classification accuracy (>= 90%) and area under the curve (AUC) (>0.9). For example, the FFL (TFAP2C-miR-520d-3p-LYPD6) in the category of NSE TRF produced the highest average accuracy (0.991) as well as the highest AUC (>0.999). FFL ARID5B-miR-367-3p-STARD13 in the category of NSE MRF generated the second highest average accuracy (0.982) as well as the second highest AUC (0.999). FFL NR2F2-miR-141-3p-EPHA2 in the category of SE TAF had the third highest average accuracy (0.979) as well as the third highest AUC (0.998). The details of average sensitivity, average specificity, average precision, average accuracy and AUC scores for the top 5 FFLs of each category are summarized in Supplementary Fig. S3, Supplementary Fig. S4, and Table 4. Since these hub genes in top 5 FFLs were important for the regulatory mechanism of TGCT, we evaluated their regulatory patterns using a validation dataset (GEO GSE99420) 62 below.
Subtype-specific hub regulators and targets for NSE. There were 2 TFs (NR2F1 and NR2F2), 7 miR-NAs (miR-367-3p, miR-519d-3p, miR-520b, miR-520c-3p, miR-520d-3p miR-520e, and miR-96-5p,) and 1 gene (DCAF5) identified as hubs in the top 5 FFLs that were specific for NSE. Only 2 TF-coding genes (NR2F1 and NR2F2) and 3 miRNAs (miR-367-3p, miR-519d-3p, and miR-96-5p) were expressed in the GEO dataset, and their expression patterns were represented in Fig. 4. All these five genes represented the same regulatory patterns in TCGA and GEO datasets (see Supplementary Table S3). Specifically, two miRNAs (miR-96-5p, and miR-367-3p) were up-regulated, whereas the other miRNA is down-regulated. Both of the two TFs were up-regulated in NSE subtype.  Table S3). Among the 5 miRNAs, 3 showed the same regulatory pattern in the two datasets, i.e., miR-200c-3p, and miR-302a-3p were down-regulated and miR-25-3p was up-regulated. In TCGA dataset, miR-141-3p was down-regulated and miR-29b-3p was up-regulated, whereas in the GEO dataset, these two miRNAs showed similar expression levels for both NSE and SE. Of note, miR-200c-3p and miR-302a-3p had stronger molecular signatures when compared to miR-25-3p in SE. Since the TFs were the top four hubs according to their out-degrees ranked from high to low score, they might play vital roles in regulating targets. JUN was down-regulated and KLF4 and SPI1 were up-regulated in both of the two datasets, even though SPI1 was slightly up-regulated in the GEO dataset. While GATA3 was down-regulated in TCGA dataset, it was slightly up-regulated in the GEO dataset. Hence, JUN and KLF4 were likely reliable molecular signatures for SE samples. All four hub genes (EPHA2, PLXDC2, RND3, and TIMP3) were down-regulated in both datasets. By exploring the FFLs in which these hub genes were involved (Supplementary Table S2), we determined that they were regulated by several hub TFs, including SPI1, KLF4, JUN, GATA3, NR2F2, and SOX9. The miRNAs included miR-302a/d-3p, miR-372/373-3p, miR-520a-e, and miR141/200c, all of which have been discussed above.

Discussion
So far, studies have been conducted to characterize the genetic, epigenetic, and molecular mechanisms of TGCT 2-4,6,7 , but not much in regulatory investigation. In this study, we first identified TGCT subtype-specific differentially expressed genes (mRNA and miRNA) 35,36 . Next, we collected TF-target gene pairs using TRANSFAC and miRNA-target gene pairs using four miRNA-target curation databases. Then, we formed FFLs by three categories: TRFs, TAFs, and MRFs. These FFLs were further used to build TF-miRNA-target gene regulatory network in two TGCT subtypes (NSE and SE). Our network analyses (such as detecting the hub nodes for TFs, miRNAs, and genes) and subtype classification analyses pinpointed subset of the FFLs that might have a significant role in the pathogenesis of TGCT subtypes. The TFs, miRNAs, and genes in the top FFLs represented promising molecular signatures in classifying TGCT types. From the dysregulated FFL networks, we assessed that most of the top FFLs could generate higher than 90% average subtype-classification accuracy through Random Forest classifier. Our study generated several SE-specific dysregulated miRNAs (miR-200c-3p, miR-25-3p, miR-302a-3p), SE-specific dysregulated genes (EPHA2, JUN, KLF4, PLXDC2, RND3, SPI1, and TIMP3), NSE-specific dysregulated miR-NAs (miR-367-3p, miR-519d-3p, and miR-96-5p) and NSE-specific dysregulated genes (NR2F1 and NR2F2). Furthermore, we validated the hub molecules using an independent dataset for TGCT. The validation analysis indicated that they had the similar expression patterns. Our FFL based analysis could identify distinct regulatory molecules, their interaction modules, and other features in two TGCT subtypes. www.nature.com/scientificreports www.nature.com/scientificreports/ One important limitation of the study is that the dataset did not include matched control samples. This limitation was due to the original TGCT study by The Cancer Genome Atlas (TCGA), which represented the largest dataset in the field. Therefore, our results only represented the difference in expression and regulation between the two TGCT subtypes, not between TGCT tumors versus controls. Future work should include a more comprehensive understanding of the regulatory mechanisms to further uncover complex diseases like TGCT using additional multiple omics data (e.g., methylation and copy number) and regulatory relations (e.g., enhancer-gene associations). The analytical approaches proposed in this study can be applied to similar data in other cancers or complex diseases.

Materials and Methods
Clinical information. We downloaded TCGA generated TGCT patients' clinical pathological information deposited in Xena database (https://xenabrowser.net/datapages, Accessed date: October 20, 2017). There was a total of 156 samples in the original clinical data file. We filtered the samples by the following two conditions as in our previous study 10 : (1) the age range of the patients was between 18 and 45; and (2) all the samples belonged to NSE or SE were verified histologically. This resulted in 48 NSE samples and 55 SE samples. Subtype-specific differentially expressed genes and miRNAs. Both the mRNA and miRNA expression profiles for the TGCT patient samples were downloaded from TCGA. We filtered the genes and miRNAs using the same procedure as in our previous study 10 . Briefly, for gene expression profile, we removed the genes having a log2-transformed RSEM expression level less than 1 in more than 50% of the samples 10,67 . For miRNA expression profiles, we removed the miRNAs with missing values in more than 10% of the samples, and only retained those miRNAs that had log2-transformed RSEM expression levels greater than 3.46 in more than 10% of the samples 10,68 .
Since the matched normal samples were unavailable in TCGA, we identified the differentially expressed genes and miRNAs between NSE and SE using statistical tool Limma implemented in R package 35,36 . A gene (or miRNA) was considered differentially expressed in NSE samples versus SE samples if they had at least 2-fold change with the adjusted p-value < 0.05. The same applied in the comparison of SE versus NSE. The analysis identified 2,950 genes and 167 miRNAs that were significantly highly expressed in NSE samples (i.e., NSE versus SE) and 1,969 genes and 58 miRNAs significantly highly expressed in SE samples (SE versus NSE).
Transcriptional regulations of TF-gene and TF-miRNA. TRANSFAC is a comprehensive TF-target relation database 37 . We identified TF-gene pairs and TF-miRNA pairs according to the pipeline in previous studies 15,18 using TRANSFAC data (release April 6, 2016). First, we retrieved the promoter region sequences, ranging from −1500 to +500 bp around each transcription start site (TSS) of human genes and miRNAs obtained from UCSC Table Browser 69 . We employed MATCH software 38 to find the binding sites. We applied a pre-calculated stringent threshold to create a high-quality matrix, and we required a core score of 1.00 and a matrix score of 0.95 for each pair. Moreover, we only selected those TF-gene pairs that were conserved among human, mouse and rat.
Post-transcriptional regulations of miRNA-gene and miRNA-TF. We selected three reliable miRNA-target prediction databases, TargetScan 39 (release 7.1, June 2016), miRanda 40 (release August 2010), and PITA 41 (release Thursday, December 09, 2010). Furthermore, we regarded miRNA-target pairs from miRTar-Base 42 (release 7.0, September 15, 2017) in which the data were curated from low and high-throughput experimental procedures. We retained the pairs if they were present in at least two databases, which resulted in the identification of 170,544 miRNA-target pairs having a total of 697 unique miRNAs and a total of unique 12,507 target genes. Among them, a subset of the target genes was denoted as TFs.
Significant transcriptional and post-transcriptional regulations. Before evaluating FFLs in regulatory networks, we defined significant regulations in our study using Pearson's correlation coefficient (PCC)

FFLs in NSE and SE.
Since FFLs are directional, reflecting specific biological regulation, we define FFLs by three subcategories: TF represses FFLs (TRFs), TF activates FFLs (TAFs), and miRNA represses FFLs (MRFs). In the TRF model, a TF activates its target miRNA to repress a target gene indirectly, whereas the same TF also represses the same target gene directly. In the TAF model, a TF represses its target miRNA to repress a target gene indirectly, whereas the same TF activates the same target gene directly overcoming the effect of suppression by the target miRNA. In the MRF model, a miRNA represses its target TF to repress a target gene indirectly, whereas the same miRNA represses the same target gene directly. Of note, these three models represent biologically coherent FFLs 43 . In this study, we formed FFLs from the significantly correlated regulator-target pairs in NSE and SE, separately.
Subtype-specific regulatory network construction and analysis. TGCT type-specific regulatory networks were constructed through integrating the identified FFLs in NSE and SE. We examined common and distinct properties between these two networks. We visualized the networks using Cytoscape, the network visualization software (version 3.7.1, https://cytoscape.org/) 70,71 . We analyzed the topological properties of the regulatory networks with Cytoscape plugin and identified hubs 49 . Validate of hub molecules. The expression patterns of three types of molecules (TF, miRNA and gene) identified as hubs were evaluated using an independent dataset from GEO (ID: GSE99420) 62 . The original study was to find gene signatures for relapse after 2 and 3 years of surveillance of TGCT. It had all the samples belonged to stage I, and could be divided into relapse or non-relapse, as well as NSE versus SE. The expression data was generated by Expression profiling by array platform. We used expression of 30 NSE and 30 SE samples from this dataset.

Subtype classification based on top FFLs.
To evaluate the classification ability of the resultant FFLs in terms of sample classification, we selected the top five FFLs from each category of FFL in NSE and SE subtype, individually. All the participating biomolecules (TF, miRNA and gene) belonging to each FFL were then used as features to perform two-class classification on the samples of the data using Random Forest classifier using R package caTools 72 . We utilized five measures [sensitivity, specificity, precision, accuracy, area under the receiver operating characteristic curve (AUC)] to evaluate the performance 36  Specificity is the true negative rate i.e., the proportion of actual negative test set tuples which are correctly classified. In other words, specificity is the fraction of true negatives to the total number of true negatives and false positives.

TN TN FP
Specificity (2) Accuracy is the proportion of all actual positive and negative test set tuples which are correctly classified, i.e., the fraction of the total number of true positives and true negatives to the total numbers of true positives, true negatives, false positives and false negatives. Precision is the positive predictive rate, i.e., the fraction of the retrieved test tuples that are relevant. In other words, precision is the fraction of the true positives to the total number of true positives and false positives.
= + TP TP FP Precision (4) For the experiment, we applied 10-fold cross-validation by repeating 10 times. Finally, we computed the average score of each evaluation metric.