Integrating ensemble systems biology feature selection and bimodal deep neural network for breast cancer prognosis prediction

Breast cancer is a heterogeneous disease. To guide proper treatment decisions for each patient, robust prognostic biomarkers, which allow reliable prognosis prediction, are necessary. Gene feature selection based on microarray data is an approach to discover potential biomarkers systematically. However, standard pure-statistical feature selection approaches often fail to incorporate prior biological knowledge and select genes that lack biological insights. Besides, due to the high dimensionality and low sample size properties of microarray data, selecting robust gene features is an intrinsically challenging problem. We hence combined systems biology feature selection with ensemble learning in this study, aiming to select genes with biological insights and robust prognostic predictive power. Moreover, to capture breast cancer's complex molecular processes, we adopted a multi-gene approach to predict the prognosis status using deep learning classifiers. We found that all ensemble approaches could improve feature selection robustness, wherein the hybrid ensemble approach led to the most robust result. Among all prognosis prediction models, the bimodal deep neural network (DNN) achieved the highest test performance, further verified by survival analysis. In summary, this study demonstrated the potential of combining ensemble learning and bimodal DNN in guiding precision medicine.


B. Systems biology feature selector
The inputted samples were divided into two groups according to the binary split criterion that was assigned, for example, ER+ samples and ER-samples. Genes without significant differential expression between the two groups were excluded by ANOVA. Next, we constructed interaction networks for each group based on the interaction information documented in the BioGRID database. We used gene expression data to estimate the interaction ability between genes and excluded false positive links. The constructed interaction networks would therefore be disease-specific and tailored for the inputted group of samples.
The main assumption of the network construction method is that the expression level of a gene is affected by other genes, which can thus be represented by the linear combination of the expression level of its interaction partners: where " [ ] is the expression level of gene for patient ; ") is the interaction ability between genes and ; G " is the set of genes that are related to gene according to where M is the number of genes left after excluding those without differential expression by ANOVA, and N is the sample size. The interaction abilities were estimated by Linear Minimum Mean Square Error (LMMSE). Afterwards, we performed model selection and excluded false positive interactions through Akaike information criterion (AIC) and t-test. If ") is not equal to )" , we took the one with the larger absolute value as the final interaction ability between genes and . After calculating all the interaction abilities, we then obtain the final interaction ability matrix A. By constructing interaction networks for two sample groups (e.g., ER+ samples and ER-samples), we would get A + and Afor each group. We define the difference matrix D to be: where ") is the difference in interaction ability between genes and . The prognosis relevance value (PRV) is then defined as: (4) which is the summarized interaction ability difference between gene and its interaction partners. For each gene, a higher PRV implies greater difference in its interaction abilities between two networks. Since the two networks represent different prognosis statuses, genes with high PRVs can be selected as potential prognosis biomarkers, which serve as an extension to the original inputted prognosis-relevant split criterion. Figure S1 illustrates the structure of bimodal DNN. The bimodal DNN processes gene expression input and clinical input with two separated subnetworks. The output of two subnetworks were then merged together, processed by successive hidden layers, and then turned into a final prediction output. The weights of the subnetworks were pre-trained. During the first phase of training, we froze the weights of subnetworks and trained only the weights between the merged layer and final output. During the second phase of training, all weights were unfrozen to allow fine-tuning of the whole network.

D. Data perturbation ensemble approach
For the data perturbation random sampling setting, we tried subsampling 90%, 80%, and 70% of the whole data each time. In addition, we tried different number of subsamples: repeating 5, 10, 20, or 30 times. There were therefore 12 possible combinations of subsampling rates and number of subsamples. We evaluated the 12 settings through random validation by adding the areas of seven feature selectors to be the final summarized area.
From the comparison (Fig. S2), we found that all of the data perturbation results outperformed those of the original feature selection for all settings. Among all, subsampling 70% and repeating 5 times achieved the highest performance, which became the final data perturbation setting we used.
Ensemble learning approaches require a diverse set of base learners to cooperate such that the final prediction is strong and robust. Even with the subsampling rate set to 70%, candidate gene features frequently appear in most subsamples have a high chance of being selected as the final gene features. They may have significant biological meanings since they frequently achieved high PRV scores in such a set of high variety data subsamples. Moreover, given the marginal differences under 70% subsampling rate, a small ensemble is preferred due to its low computational complexity. As a result, we chose the 70%x5 setting that achieves both the best performance and the lowest complexity.
After determining the final data perturbation setting, we then compared the seven original feature selectors with their data-perturbation versions. From pairwise comparisons for each feature selector (Fig. S3), we found that data perturbation improves the robustness in most of the cases except for PR-selector. The improvement was verified through the one-tailed paired t-test, which implied that the "summarized area" distribution of the data-perturbation results for ER, HER2, TN, HP, MKI67, and PLAU-selectors were all significantly higher than their 90%x5 90%x10 90%x20 90%x30 80%x5 80%x10 80%x20 80%x30 70%x5 70%x10 70%x20 70%x30 corresponding original feature selection results.  Through random validation (Fig. S4) we found that rank-mean and top N overlap produce the best performance. However, since the top N overlap strategy cannot output a feature ranking score and the number of final selected genes cannot be determined by the user, there will be more limitation when using the top N overlap strategy in real applications. Therefore, we adopted rank-mean as our final aggregation strategy for function perturbation.
Having determined the aggregation strategy, we compared the original results of the seven feature selectors. Through Fig. S5, we found that function perturbation brought significant improvement to the original feature selection results, statistically verified by the one-tailed paired t-test.

F. The final selected gene set
The top 50 selected genes via the hybrid ensemble approach are listed below. The final selected genes are the first 16 genes that produced the peak performance.  We compared the performance with 100 random validation summarized areas among three gene sets, including the final feature selection result (16 genes), the top 50 selected genes in Table S3 (50 genes, including the final 16 genes), and all genes before feature selection (24,338 genes). A detailed box-plot illustrates the above results is available in Fig. S6. As illustrated in

H. Test performance of selected genes as single biomarkers
Through test performance evaluation, we found that models with gene feature alone (Table   1a) can achieve an AUC between 0.7443 and 0.7672. This performance achieved by a multigene approach is higher than the AUC of any component gene as a single biomarker (Table   S6). This indicates that a multi-gene approach can indeed model the complex molecular process of breast cancer more comprehensively through joint evaluation of multiple genes.

I. Gene feature selection frequency curves
According to [1], we plotted the gene feature selection frequency curves for all function perturbation and the hybrid ensemble approaches, as illustrated in Fig. S7.

J. Independent validation test performance
From Table 1 in [2], we chose GSE21653 as the independent validation dataset, which records labels for disease-free survival (DFS). However, since certain clinical features are missing, predictions were only made with microarray expression profiles. We focus on the patients with DFS events and categorized those who survived less and more than 5 years (60 months) as poor and good prognosis class patients, respectively. Consequently, 14 patients belong to the poor prognosis class, and 69 patients are in the good prognosis class. We trained the support vector machine (SVM), random forest (RF), and microarray deep neural network (DNN) classifiers with the optimal hyper-parameters as detailed in Tables S4-6. These models were then tested on the independent validation dataset. The results were summarized in Table S8. We observe that the final 16 gene biomarkers are quite robust. Their prediction power is sufficient as all models still perform decently in stratifying breast cancer patients with DFS events in the independent validation dataset.  Table S8. Test performance of the independent validation GEO dataset