Genetic co-expression networks contribute to creating predictive model and exploring novel biomarkers for the prognosis of breast cancer

Genetic co-expression network (GCN) analysis augments the understanding of breast cancer (BC). We aimed to propose GCN-based modeling for BC relapse-free survival (RFS) prediction and to discover novel biomarkers. We used GCN and Cox proportional hazard regression to create various prediction models using mRNA microarray of 920 tumors and conduct external validation using independent data of 1056 tumors. GCNs of 34 identified candidate genes were plotted in various sizes. Compared to the reference model, the genetic predictors selected from bigger GCNs composed better prediction models. The prediction accuracy and AUC of 3 ~ 15-year RFS are 71.0–81.4% and 74.6–78% respectively (rfm, ACC 63.2–65.5%, AUC 61.9–74.9%). The hazard ratios of risk scores of developing relapse ranged from 1.89 ~ 3.32 (p < 10–8) over all models under the control of the node status. External validation showed the consistent finding. We found top 12 co-expressed genes are relative new or novel biomarkers that have not been explored in BC prognosis or other cancers until this decade. GCN-based modeling creates better prediction models and facilitates novel genes exploration on BC prognosis.

The risk scores of GCN-based models succeed in predicting RFS of BC. We used a partial Cox hazard proportional regression 27 to compute the risk score of each model. The risk scores were used to predict the RFS of BC in the form of continuous and categorical variables (Fig. 2). All the risk scores significantly predicted the RFS of BC, and HRs ranged from 1.89 ~ 3.32 (p < 10 -8 ) under the control of the node status. The risk score of Model 4 has the best discrimination for whether the high/low risk group developed recurrence (high risk group: HR 3.25, p ~ 0) under the control of the node status ( Fig. 3 and Supplementary Table 3). According to the time-dependent the prediction error and AUCs in Supplementary Figs. 1, 2, model 4 was also the most precise prediction model followed by Model 3, Model 2, Model 5/Model 6, and Model 1, in that order. In summary, the GCN-based models outperformed the reference model, and the larger GCN-based model performed better.
Validation of model prediction using independent data sets. We used the public independent mRNA microarray data of 1056 eligible primary BC tissues from KM plotter websites to do external validation.
The outcomes are concordant with the modeling results. Model 4 has the best AUC 72.1-74.8% on predicting the RFS in 3, 5 and 10 years (Table 4 and Supplementary Fig. 3). All models predict best on 3-year RFS.
GCN-based models outperform with less genetic predictors. We integrated the 46 significant genes from Model 1 to 4 and filtered out the most important genes using the criteria of uni-variable cox hazard proportional regression p < 0.001, HR > 1.5 or HR < 0.7 (= 1/1.5) under the control of the node status. Finally, we chose 12 genes and used stepwise forward cox hazard proportional regression to create Model 5, which was composed of eight genes, including AACS, C10orf5, CCNE2, EEF1E1, IDUA, LMNB1, MGC27165, and RORC. In comparison to the 21-gene model (Model 6) from Chou's study 28 using the same integrated GSE data sets, the GCN-based Model 5 (ACC 66.5-74.4%, AUC 64.1-74.8%) predicted as accurately as Model 6 (ACC 68.1-70.9%, AUC 63.4-74.5%) with only eight genetic predictors and had a better AUC (Fig. 1). This suggested that the GCN-based models not only reduced the dimension of the predictors but also filtered out the most representative and important genetic predictors.
Larger GCNs effectively provide more information on the novel genes related to recurrence. We also confirmed that larger GCNs effectively provided more information on the novel genes related   Predictive pathway of importance novel genes. The importance of the genes related to RFS was assessed by computing the proportion of the chi-square of each gene in each model. We plotted the relative importance of the genes in the order of the sum and average of importance indices (Supplementary Table 5 and Supplementary Fig. 4). Top 12 important genes are selected to plot the predictive pathways (Fig. 4). The function of these genes are clustered into 4 groups "Circadian Cycle (CCNA2, CCNE2, RORC, TIMELESS)", "Peptidase (IDUA, CPZ)", "Immune (MGC27165, FCER1G)" and others (C10orf56, KIF14, EBP, RFC2). The dysregulation of CCNA2 triggers the consequent reaction of other genes and leads to TIMELESS that influence the recurrence of BC.

Discussion
Gene expression profiling of BC has shifted from differentially expressed genes (DEGs) to GCN analyses. GCN analyzes have been shown to aid comprehensive understanding of genomes regulation 10,16,17 . In this study, we proposed GCN-based modeling and SNM to create better prediction models of RFS in BC and explored novel significant co-expression genes. This is the first study to evaluate the prediction of GCN-based models created by various sizes of GCNs. We found that GCN-based modeling from larger GCNs created good prediction gene panels for BC recurrence either in the training and validation datasets (Table 4 and Fig. 3). Genetic predictors selected merely from the DEGs (DEG-based model) may miss key genetic information. GCN-based models can make up for the weakness and increase the prediction accuracy. Though SNM model (Model 5) did not reach the best prediction outcome yet it predicted as accurately as Model 6 (from Chou's study 28 ), which created by our previous study contain 21 genes with less genetic predictors (eight genes). It indicated that GCN-based and SNM modelings are better approaches than DEG-based model (Model 1) for creating good gene panels with fewer genes but higher prediction accuracy.
Analysis of integrated microarray data sets facilitates the gene expression profiling of BC, but the lack of complete clinical characteristics is a big issue. Besides, most of the GCN studies 16,18,[29][30][31] are analyzed on already known biological pathways, the gene-gene interaction from text mining science articles or re-calculation using public genome data. However, those studies sometimes miss the information of unknown GCNs or biomarkers whose function has not yet been identified.
Our GCN-based modeling can make up for these deficiencies. 34 KCGs were selected by choosing the overlapping genes in five BC prognosis-related studies 28,[32][33][34][35][36] . It is suggested that these genes play essential and stable roles in the mechanism of recurrence of BCs. The GCNs of 34 KCGs were assumed to include more significant novel genes related to recurrence even without considering the clinical characteristics. Therefore, even though we only considered one clinical variable node status, the results are still informative.
In  www.nature.com/scientificreports/ and 15 years. All the models applied in the short-term 3-year RFS performed with the best prediction (Fig. 1). GCN-based models effectively provide more novel and significant genetic information related to BC recurrence while the GCNs grow larger. The top 12 important genes from all the models were identified to plot the predictive pathway ( Fig. 4) (7) nucleic acid binding and poly(A) RNA binding (C10orf56) 37 . Through literature review, these genes are biologically related to BC or other cancers 11,[38][39][40][41][42][43][44][45][46][47][48][49][50][51][52][53][54][55][56] in line with our bioinformatics findings. It is notable that the role of TIMELSS in the progression of BC has not been well-characterized until these few years [57][58][59][60] . The overexpression of TIMELSS upregulated the expression and the trans-activity of the well-known oncogene MYC. Inhibition of MYC significantly blocked the effects of TIM on cancer stem cell population, cell invasion and anchor-independent cell growth 60 . Therefore, the functional gene-gene interactions in the pathway warrant further study to understand more about the mechanism of BC progression.
CCNA2 is a regulator of the cell cycle 61 . Its overexpression increased BC proliferation 62 and is an early/transient/proliferation response biomarker 63 for the prognosis of ER + BC and the monitoring of tamoxifen efficacy 38 .
CCNE2 has the same function as CCNA2. It might play an important role in acquired trastuzumab resistance in HER2 + breast cancer 40 .
IDUA(Iduronidase, Alpha-L) is associated with visceral organ metastatic disease in breast cancer 49 . CPZ(Carboxypeptidase Z) expression was significantly lower ovarian cancers and may be relevant to the biology of high-grade serous ovarian cancers 66 . But no study discussed its role in BC.
MGC27165(IGHA1) was associated with BC survival time 47 and was suppressed in triple negative breast cancer patients with poor prognosis 11 .
FCER1G (High-Affinity Immunoglobulin Epsilon Receptor Subunit Gamma) Receptors for immunoglobulins [Fc-receptors (FcRs)] are widely expressed throughout the immune system 67 and are related to antibody-based therapeutics, such as trastuzumab 68 .
C10orf56 (Chromosome 10 open reading frame 56, ZCCHC24) participated in tumorigenesis by inhibiting BET family proteins 69 . Its specific methylation pattern affected the expression level and was related to BC subtypes detection 56 .

RFC2 (Replication Factor C Subunit 2)
is only mentioned in one study for its indirect association with BC and involvement in DNA repair 55 .
EBP (emopamil-binding protein) is a human sterol isomerase (hSI) that is associated with a poorer BC disease-free survival 71 . SR31747A (sigma receptor ligand) binds with EBP and other proteins to exhibit antitumoral activity 72 . Few studies discuss its role in BC.
There are many missing data of clinical variables in validation data sets obtained from public database. Only significant lymph node status which had sufficient samples was included while modeling. It's the limitation of the study. However, it will not alter the main findings that bigger genetic co-expression networks are more likely to produce good prediction models. In addition, all the significant identified biomarkers were reported to be associated with BC or other cancers. However, it will be more comprehensive to include as many clinical factors as possible for analysis.

Conclusion
We proposed GCN-based modeling and SNM method to construct more precise models than DEGs-based models with fewer genetic predictors. The results showed that all the GCN-based models outperformed the DEG-based model. Our framework systematically facilitates the discovery of novel co-expressed genes for BC prognosis without prior biological information of genes. We succeeded in finding relatively new co-expressed genes, such as TIMELESS, IDUA, CPZ, MGC27165, C10orf56 and so on that were found to be associated with BC or other cancers until this decade. In general, our GCN-based models and SNM facilitate studies to create prediction models and discover novel biomarkers.

Methods and materials
Microarray data sets. The mRNA microarray data of BC retrieved from Chou' study 28 , including GSE 2034 (n = 286) 32 , GSE 2990 (n = 189) 34 , GSE 4922 (n = 249) 35 , and GSE 7390 (n = 198) 33 of the NCBI GEO, and comprised a total of 922 cases and 13,452 genes, with 354 cases showing recurrence of breast cancer (38%) at the end of follow-up. A total of 111 node-positive cases (12%) and a total of 796 negative cases (86%) were included. The four data sets showed no difference in determining the distribution of recurrence (Supplementary Table 6). The pre-processing of the microarray data was denoted in Chou's study 28 . They used quantile normalization 73,74 to normalize all the mRNA expression values and calculated the median probe expression in a gene to represent the mRNA expression level. GSE7390 33 was used as the reference standard, and the other three data sets were log transformed to fit the former distribution. The workflow of study is shown in Fig. 5      www.nature.com/scientificreports/ tering, coefficient of variation, and Cox hazard proportional regression were computed using the cor, hclust, co.var, and coxph functions. Due to variations in genotype and recurrence, the data were divided into two data sets by recurrence status for analysis. The 34 KCGs were essential nodes in the GCNs. A Spearman's rank correlation analysis was applied on the 34 KCGs and all the other genomes of 13,418 genes. The genes with an r value over the threshold we set were selected and drawn in the GCNs. If the edges, starting from each node (gene) in the GCNs, were over two, only the most associative two were kept in order to identify the most important GCNs.
The regulation of the genetic networks of recurrence versus no recurrence is different. The GCNs for recurrence and no recurrence cases were drawn separately. The regulation of the recurrence of BCs was what we cared most about. Therefore, we plotted 2-4 times the size of 34 KCG GCN of recurrence data with the correlation thresholds of 0.82, 0.80 and 0.79 (Table 2).
Create GCN-based cox hazard proportional regression models and SNM. We obtained four groups of genetic predictors of four GCNs under r thresholds of 0.82, 0.80 and 0.79 to create prediction models of RFS in BCs (Table 2). Stepwise forward cox hazard proportional regression was used to select significant genes which variance inflation factor (VIF) < 10, a chosen significance level for entry (SLE) = 0.08 and the chosen significance level for stay (SLS) = 0.05. We gathered all significant genes from Models 1-4 and filtered out the most important genes based on the criteria of the uni-variable cox hazard proportional regression p < 0.001, HR > 1.5 or HR < 0.7 (1/1.5) under the control of the node status. Using these important genes, We these important genes and stepwise forward cox hazard proportional regression to create model 5. We named the procedure used to create Model 5 as stepwise network modeling (SNM).

Partial cox hazard proportional regression.
To compare the prediction of the various models, we used a partial cox hazard proportional regression for constructing the mutually uncorrelated components of the genetic predictors in each model using the function of PCRf developed by Li and Gui 27 . This method was useful in building a parsimonious predictive model that accurately predicted the survival based on the mRNA expression profile. Predictive components whose p values were less than 0.05 (uni-variable Cox hazard proportional regression) were selected to rebuild a model [as shown in formula (1)]. The risk scores were computed by summing up the multiplication of the coefficient and selecting the component scale in the Cox hazard proportional regression [as shown in formula (2)]. Since the mean of each component scale was zero, we set zero as the cut-off point to categorize the patients into the high/low risk score groups. The Cox hazard proportional regression of the components is written as shown in formula (1). xi is the component extracted by the partial Cox hazard proportional regression. The risk scores were computed by formula (2).

Time-dependent AUC and prediction error.
The area under the curve (AUC) of the ROC curve is a well-established indicator for assessing how well a prediction model performs. AUC ranges from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0; one whose predictions are 100% correct has an AUC of 1 77 . The classical approach of the AUC analysis considers the event (disease) status and marker value to be fixed over time. However, in practice, both the disease status and marker value change over time. Thus, a time-dependent AUC is more appropriate 78 . We evaluated the prediction performance of each model by computing the timedependent AUC and prediction error using the R package of "survAUC" 79 .
Predictive pathway graph. Understanding cause-effect relationships between variables is of primary interest in cancer science. Usually, experimental intervention is used to confirm these relationships, but this can be infeasible because of time and cost. However, Kalisch et al. 80 introduced "pcalg" to effectively explore causal relationships of important biomarkers in BC recurrence. Therefore, we used the mRNA dataset and R packages of "pcalg" and "Rgraphviz" to plot the predictive pathway. The alpha was set at 0.01. Ethics approval and consent to participate. The data used in the article are public available data from NCBI GEO (https:// www. ncbi. nlm. nih. gov/ geo/) and KM plotter (https:// kmplot. com/ analy sis/). According to item two of the regulation "得免取得研究對象同意之人體研究案件範圍(Scope of human research cases exempt from obtaining consent) " (https:// www. mohw. gov. tw/ dl-45112-b708e 126-a9c4-4842-ac83-ba656 1948a 2f. html), it denotes that " Use legally publicly known information and use the information for its publicly known purpose " meets the scope of the exemption.

Data availability
The data sets used during the present study are available from the corresponding author on reasonable request and can be downloaded in GEO Data sets (https:// www. ncbi. nlm. nih. gov/ gds/).