Abstract
In predictive model development, gene expression data is associated with the unique challenge that the number of samples (n) is much smaller than the amount of features (p). This “n ≪ p” property has prevented classification of gene expression data from deep learning techniques, which have been proved powerful under “n > p” scenarios in other application fields, such as image classification. Further, the sparsity of effective features with unknown correlation structures in gene expression profiles brings more challenges for classification tasks. To tackle these problems, we propose a newly developed classifier named Forest Deep Neural Network (fDNN), to integrate the deep neural network architecture with a supervised forest feature detector. Using this builtin feature detector, the method is able to learn sparse feature representations and feed the representations into a neural network to mitigate the overfitting problem. Simulation experiments and real data analyses using two RNAseq expression datasets are conducted to evaluate fDNN’s capability. The method is demonstrated a useful addition to current predictive models with better classification performance and more meaningful selected features compared to ordinary random forests and deep neural networks.
Introduction
In the field of bioinformatics, the development of computational methods for predicting clinical outcomes using profiling datasets with a large amount of variables has drawn great interest. In such datasets, the sample sizes tend to be very small compared to the number of predictors (genes), hence resulting in the \(n\ll p\) issue. Moreover, existence of complex unknown correlation structures among predictors has brought more difficulty in prediction and feature extraction. Therefore, the prediction task has been formulated as a classification problem combined with feature representations, and related work tried to solve the problem by utilizing machine learning approaches such as random forests^{1,2}, neural networks^{3}, sparse linear models^{4,5} and support vector machines^{6}. While the primary goal of these methods are to achieve high classification accuracy, efforts have also been put into learning effective feature representations. Literature shows that among the machine learning techniques, random forests^{7} (RF) have been an excellent tool to learn feature representations^{8,9}, given their robust classification power and easily interpretable learning mechanism. This can be useful in building robust predictive models especially when the underlying structures in the feature space are complex and unknown.
Classification methods have been developed considering known functional links between features. For example, a variant of the Random Forest method has been proposed where the feature subsampling was conducted according to spatial information of genes on a known functional network^{10}.Objective functions of the support vector machine and the logistic regression were modified by adding relational penalty terms, again based on known functional information^{11,12,13}. Very recently, a method embedding proteinprotein interaction feature graph directly into the deep neural network structure has also been proposed^{14}. The authors of these methods have demonstrated that incorporating feature relation structures results in better classification performance. However, considering the functional relation structures explicitly requires external information in addition to gene expression values. This requirement cannot always be satisfied as the functional structure can be unknown or incomplete.
Trying to develop a powerful classifier which can implicitly extract sparse feature relations from an extremely large feature space, we intend to incorporate a forest “feature detector” with deep neural networks (DNN), which is one of the stateoftheart learning techniques^{15}. Although in recent years, deep learning models have been proved to be powerful tools in classification, their application in bioinformatics is limited due to the \(n\ll p\) issue^{16}. This is because cell populations and clinical subject populations exhibit large heterogeneity and data characteristics across various laboratories are inconsistent, resulting in gene expression datasets to have limited numbers of samples compared to the large numbers of features. On the other hand, deep learning usually requires a large amount of training samples such as in image classification^{17}, therefore the contradiction obstructs the use of deep learning techniques in the field of bioinformatics. Based on these facts, modified deep learning models suitable for disease outcome classification using gene expression data with \(n\ll p\) are in need.
Building a supervised feature detector on top of DNN classifiers is a natural choice to achieve sparse learning with less parameters compared to the usual DNN, for the following reasons: (1) the detector detects effective features in a supervised manner, i.e. using the information of training outcomes, resulting in accurate feature representations; (2) the input of the downstream DNN, which is the output of the feature detector, has a much smaller dimension compared to the original feature sets. Also, the rationale of employing random forests over other models lies in two aspects: (1) as an ensemble model, RF is able to output prediction results from all its base learners rather than a single predicted probability score; (2) the importance of features in each base learner can be easily obtained. The first aspect allows us to build downstream DNN following the feature detector, which cannot be achieved if the detector only outputs a single prediction such as in support vector machines and logistic regressions. The second aspect facilitates feature evaluation process for the entire integrated model, while other classifiers such as kernel based methods may not naturally embrace feature selection mechanism. To the best of our knowledge, no work has been done along this track for gene expression data. In the field of traditional machine learning research such as computer vision, the idea of stacking classifiers^{18} has been implemented and is now very popular in Kaggle data science competitions (https://www.kaggle.com). Nevertheless, stacking methods are mainly intended to crossvalidate a large amount of multilevel models, and consequently require much larger number of instances (samples) than the number of features with no exception. In contrast, our new fDNN classifier with supervised forest feature detector is developed for \(n\ll p\) sparse learning. In this paper, we justify our approach by demonstrating the classification performance on both synthetic data and real RNAseq datasets.
Methods and Materials
Forest deep neural networks
Our newly proposed forest deep neural network (fDNN) model consists of two parts. The forest part serves as a feature detector to learn sparse representations from raw inputs with the supervision of training outcomes, and the DNN part serves as a learner to predict outcomes with the new feature representations. In the forest part, independent decision trees^{19} are constructed, and the forest is then an ensemble of the trees. Therefore, a natural choice of building the forest is the Random Forest model^{7}. Other forest constructions are also possible. For example, one can use the networkguided forests^{10} if the feature space is structured and known, or the forest can be simply built through bagging trees^{20}. In this paper, we only employ random forests as the feature detector.
In the fDNN model, a forest \( {\mathcal F} \) is a collection of decision trees
where M is the total number of trees in the forest, Θ = {Θ_{1}, …, Θ_{M}} represents the parameters in \( {\mathcal F} \). In random forests, Θ includes splitting variables and their splitting values. In the feature detection stage, \( {\mathcal F} \) is fitted by the training data X and y, where \({\bf{X}}\in {{ {\mathcal R} }}^{n\times p}\) is the input data matrix with n samples and p features and \({\bf{y}}\in {{ {\mathcal R} }}^{n}\) is the outcome vector containing classification labels. Through the fitted forest, for any observation x_{i}, i = 1, …, n, we obtain the prediction from each tree in \( {\mathcal F} \):
where \({T}_{m}({{\bf{x}}}_{i};{{\rm{\Theta }}}_{m})={\hat{y}}_{im}\) is the binary prediction of observation x_{i} given by \({{\mathscr{J}}}_{m}\). Hence, denote f_{i}: = f(x_{i};Θ) for simplicity, for an observation x_{i}, f_{i} is a binary vector summarizing the signal detected from the forest and later on serves as the new input features to be fed into the DNN.
Following the new feature representations provided by the forest, the deep neural network with l hidden layers has a standard architecture
where F = (f_{i}, …, f_{M})^{T} is the forest matrix with n samples and M tree predictions, y again is the classification outcome vector, Ψ denotes all the parameters in the DNN model, Z_{out} and Z_{k}, k = 1, …, l − 1 are hidden neurons with corresponding weight matrices W_{out}, W_{k} and bias vectors b_{out}, b_{k}. The dimensions of Z and W depend on the number of hidden neurons h_{in} and h_{k}, k = 1, …, l, as well as the input dimension M and the number of classes h_{out}. For binary classification problems, h_{out} ≡ 2 since the elements of y are binary. Usually, the number of hidden neurons decreases from the input layer, namely h_{in} = M > h_{1} > h_{2} … > h_{out}. σ(⋅) is the activation function such as sigmoid, hyperbolic tangent or rectifiers. g(⋅) is the softmax function converting values of the output layer into probability prediction i.e.
where
where i = 1, …, n.
The parameters to be estimated in the DNN are thus all the weights and biases. The model can be trained using a stochastic gradient decent (SGD) based algorithm^{21} by minimizing the crossentropy loss function
where again Ψ denotes all the model parameters, and \({\hat{p}}_{i}\) is the fitted value of p_{i}. More details about DNN can be found in standard deep learning reviews^{21}. The entire architecture of the fDNN model is visualized in Fig. 1.
Details of model training
The training of fDNN classifier consists of two stages. In the first stage, training data including labels are used to fit the forest, and predictions from each tree in the forest for all instances are then fed into the fullyconnected DNN, for training in the second stage. After the twostage training, given a testing instance, the testing prediction is calculated through the entire model by the fitted forest and DNN. Note that for implementation purpose, the forest prediction feature f_{i}, ∀i is onehot encoded as shown in Fig. 1. This is the same operation as with the label vectors y_{i}, since the final output dimension from DNN is two. Consequently, the actual input for the DNN in our implementation is an n × M × 2 tensor rather than an n × M matrix F.
For the DNN model, the activation functions are the rectified linear unit (ReLU)^{22} with the form (in scalar case)
This activation has an advantage over sigmoid and hyperbolic tangent as it can avoid the vanishing gradient problem^{23} during optimization. For the optimization algorithm, We choose the Adam optimizer^{24} as it is the most widely used variant of traditional gradient descent algorithms in deep learning nowadays. Also, we use the minibatch training strategy by which the optimizer randomly trains a small proportion of the samples in each iteration. Details about the Adam optimizer and minibatch training can be seen in deep learning literature^{21,24}.
The classification performance of the fDNN model is associated with both hyperparameters for the forest and for the DNN. Forest hyperparameters include number of trees in the forest, and tree related parameters such as tree depth, minimum splitting sample size etc. DNN hyperparameters are architecture related parameters such as the number of layers and the number of hidden neurons in each layer, regularization related parameters such as the dropout proportion and the penalty scale of regularizers, model training related parameters such as the learning rate and the batch size. Those hyperparameters can be finetuned using advanced hyperparameter optimizing algorithm such as Bayesian Optimization^{25}. However, in this work, our primary interest is to examine the performance of fDNN compared to ordinary classifiers under same or similar settings, instead of fine tuning a “bestofall” model for specific datasets. Therefore, the hyperparameters are simply chosen by convention or tuned using grid search with synthetic validation datasets in a feasible hyperparameter space. The method is implemented in Python with packages Scikitlearn^{26} and Tensorflow^{27}.
Synthetic data generation
The goal of simulation experiments is to mimic disease outcome classification using gene expression data, where \(n\ll p\) and effective features are extremely sparse and correlated, and explore the performance of our new model compared to ordinary classification methods. We compare our fDNN method with usual random forests and DNN, which account for the two parts of fDNN respectively. Through the numerical experiments, we are intended to show that fDNN is able to improve the classification performance of pure random forests or DNN, and the better performance cannot be achieved simply by increasing the complexities of the two ordinary classifiers. Robustness is also tested as we simulate datasets that do not fully satisfy the correlated feature assumption, and apply the new method to examine whether it can still achieve a reasonable performance.
For a given number of features p, we first generate a latent feature network using the preferential attachment algorithm^{28}. Each node of the network represents one feature. The resulting network is scalefree with a powerlaw degree distribution. That means only a few features in this network have relatively large number of “neighbors”. Defining the distance between two features in the network as the shortest path between them, a p × p distance matrix D recording pairwise distances among features is then calculated. Next, the distance matrix is transformed into a covariance matrix Σ by letting
Here by convention the diagonal elements of D are all zeros meaning the distance between a feature to itself is zero, and thus the diagonal elements of Σ are all ones. Since only a few features have high connections in the feature network, this fact is reflected in the Σ matrix. Utilizing Σ as the covariance matrix for generating sample instances, we are then able to achieve the goal that features have sparse and correlated structures. n multivariate Gaussian samples are simulated forming the data matrix X = (x_{1}, …, x_{n})^{T} i.e.
where \(n\ll p\) for imitating real gene expression situations.
To generate outcome variables, we select a subset of all features to be the “true” predictors. The selection is conducted as follows: in the generated feature network mentioned above, we randomly select part of the highdegree features as “cores”, and a portion of their neighbors are also randomly selected. In this way, the true predictors satisfy: (1) sparsity, since only a few of all are highdegree features and only part of the neighbors are selected. (2) correlated structure, since the “core” features have much higher correlation with their neighbors than other distant features. Denoting the number of true predictors as p_{0}, we sample a set of parameters \({\boldsymbol{\beta }}=({\beta }_{1},\ldots ,{\beta }_{{p}_{0}}{)}^{T}\) and an intercept β_{0} within a certain range. In our experiments, we first sample β’s from the interval (0.05, 0.1), and some of the parameters are randomly turned into negative, so that we accommodate both positive and negative coefficients. Finally, the outcome variable y is generated through a logistic regression model
where \( {\mathcal L} \)(·) is the indicator function, t is a threshold and logit(⋅) is the logit function
The inverse logit^{−1} is equivalent to a binary class softmax function.
Following the above procedure, we simulate a set of synthetic datasets with 5,000 features and 400 samples. Since we are considering cases with extremely low signaltonoise ratio, we examine different numbers, i.e. 10, 20, 30 40, and 50 of true predictors, corresponding to 1–5 cores among all the highdegree features. Also, in reality, the true predictors may not be only distributed at the highdegree nodes and their neighbors in the latent feature network. Instead, a few of the true predictors can be quite scattered. To test model robustness in this possible circumstance, in addition to generating datasets following the above procedure, we also simulate another series of datasets where 50% of the true predictors are randomly selected among the entire feature network rather than from highdegree features and their neighbors. We call these two sets of data “clustered” case and “scattered” case respectively, according to the property of predictor structures.
Real datasets
We apply the fDNN method to two real datasets. The first is the singlecell RNAseq data on bone marrow cells^{29} (GSE99095). The dataset consists of a gene expression matrix with 17,258 genes in 391 control cells from healthy donors, and 588 cells from 5 patients with bone marrow failure. The original study has found the cell populations are diverse both in patients and in healthy donors, with patient cells showing higher diversity due to the existence of aneuploid cells^{29}. We obtain the normalized expression matrix from the Gene Expression Omnibus (GEO). Our interest is to test the method’s capability to classify the source of the cells, i.e. healthy/diseased, despite the presence of cell diversity within each class.
The other dataset we study is GSE106291, which contains the RNAseq expression profiles of 23,368 genes from 250 acute myeloid leukemia patients under intensive treatment^{30}. The primary clinical outcome is treatment resistance. Each patient was labeled as either resistant or sensitive to the treatment. We aim at classifying the two responses with the gene expression data. From the original normalized expression matrix, we delete genes with more than 10% zero measurements, resulting in the final data matrix with 11,068 features and 250 columns. For each feature, the expression value is Zscore transformed, i.e. the expression value minus the mean across all patients and then divided by the standard deviation.
The two datasets are suitable for testing our method, because the datasets fall into the \(n\ll p\) category, the features are correlated due to their functional relations, and only a small portion of the features are expected to contribute to the biological mechanism that generated the class membership.
Results and Discussion
Simulation results
In our simulation studies, the fDNN had 300 trees in the forest part and three hidden layers in the DNN part, with 256, 64 and 16 hidden neurons respectively. Note the 300tree forest served as a supervised feature detector for the downstream DNN, and the feature space was shrunk from p = 5000 to M = 300, mediating intractable \(n\ll p\) situation since n and the input dimension M are now at the same magnitude. We also observed that adding more hidden layers to DNN resulted in similar prediction performance, hence the threehidden layer architecture was finalized as a parsimonious choice. To compare, we also recorded the prediction performance from the 300tree forests (RF_300) in fDNN and experimented with a DNN classifier (DNN_3_256) with the same architecture as the one in fDNN. Moreover, we tested additional random forests with 500 trees (RF_500) and DNN with one more hidden layer (1024 neurons) at the top (DNN_4_1024), for the reason mentioned in the Methods section. For each of the data generation settings, 10 datasets were generated, and all methods mentioned above were applied on the data. For each simulated dataset, we randomly split the dataset into training and testing sets at a 4:1 ratio. The final testing classification performances were then averaged across the ten datasets. All the classification results were evaluated by the area under the receiver operating characteristic (ROC) curve (AUC).
Table 1 and Fig. 2 show the results of the simulation experiments. Corresponding to the “clustered” case, Fig. 2(a) shows the fDNN method outperformed RF_300 and DNN methods, and performed better than RF_500 in most cases. As the number of true predictors increased, there were increasing trends for all of the methods, with a few exceptions due to the randomness of data generation. The trends for fDNN and RF_300 are quite parallel. However, the downstream DNN in fDNN always improved the prediction from the forests. Note that DNN_4_1024 was actually worse than DNN_3_256, and this makes sense because under \(n\ll p\) the deeper neural network is more affected by the pitfall of the overfitting phenomenon. Hence, it in turn demonstrated the necessity for constructing a model with reduced feature dimension as in fDNN to get around this issue.
As for the “scattered” case (Fig. 2(b)), fDNN was still the best among the five, while overall AUC’s slightly decreased compared to Fig. 2(a). This is because for neural network methods, DNN inherently tackles correlated features. When the correlation among features decreased, the performance of DNN_3_256 and DNN_4_1024 in (b) became worse than that in Fig. 2(a). At the same time, although not as directly affected as DNN methods, decreased feature correlation also deteriorated the performance of RF_300 and RF_500. Recall we only selected a proportion of highdegree feature neighbors as the true predictors. The remaining neighbors could also be informative when constructing decision trees, due to their high correlation with the true predictors. Consequently, compared to the scattered case where half of the selected true predictors could hardly be connected by others in the feature network, the clustered case is easier for random forests as the chance of selecting “relevant” predictors is higher.
In summary, the simulation experiments demonstrated that our newly proposed fDNN classifier had better classification performance compared to ordinary random forests or deep neural networks alone, in the situation that \(n\ll p\) and signals are sparse and correlated. Moreover, the improved performance could not be achieved by simply increasing the model complexities for random forests and DNN. The method was also robust as it outperformed other methods in both of the clustered and scattered cases.
Real data results
For each of the two datasets, we again randomly divided all samples into training and testing sets. For GSE99095, we had 700 training and 279 testing samples; for GSE106291, the numbers were 200 and 50 respectively. The three classifiers, fDNN, DNN, and RF were trained on the training samples. Hyperparameters were chosen by crossvalidation using the training datasets. The classification performance was again evaluated by the testing AUC of ROC. The computation times of fDNN training plus testing were 69.9 seconds for GSE99095 and 40.8 seconds for GSE106291 respectively, on a workstation with dual Xeon Gold 6136 processors, 192 GB RAM, and a single Nvidia Quadro P5000 GPU. Tables 2 and 3 list the detailed architectures for each classification methods and summarize the testing results for GSE90995 and GSE106291 respectively. Corresponding ROC plots are shown in Fig. 3(a,b).
From the tables and the ROC plots, we see that fDNN was able to obtain better classification results in terms of ROC, compared to traditional DNN and RF classifiers. All three methods performed reasonably well on GSE99095, which contained close to 1000 samples. Although RF performed better than DNN in simulations, in the real dataset DNN achieved slightly better testing results. Our fDNN method, by learning sparse representation using RF as a feature detector, improved over the two methods in terms of testing data classification. GSE106291 had a smaller sample size of 250, which tested the limits of the methods. The small sample size may be the reason why all three methods performed worse. Still the fDNN achieved slightly better testing data classification error rate, indicating its applicability on gene expression datasets with relatively small sample size.
In real analysis of gene expression data, one may not only be concerned about the prediction results, but also be interested in features with major contribution to the classification, as those significant genes can reveal biological mechanisms. After fitting the fDNN model, we employed a newly developed variable ranking mechanism, which combined the variable importance calculation in ordinary random forests and the Connection Weights (CW) method^{31} introduced in neural networks, to calculate a score for each gene as the variable importance in fDNN.
In random forests, variable importance is quantified by cumulating the decrease of impurity caused by splitting at a certain feature across all the trees. Based on this fact, the forest \( {\mathcal F} \) in fDNN is also able to record feature importance during fitting. Moreover, importance scores of features in each tree \({\mathscr{J}}\) are also available, resulting in a p × M tree importance matrix S, where again p is the number of features and M is the number of trees in \( {\mathcal F} \).
For an ordinary DNN, the CW method tries to quantify the contribution of an input variable by summing over all the absolute values of the weights connecting the variable and the first hidden layers, assuming all input data are standardized. Mathematically, we have
where u_{j} is the importance score for feature j, w^{(in)} denotes weights between the input and first hidden layers. The same logic applies in fDNN, but instead of calculating the importance of each feature, the CW method helps compute the importance of each tree output by
and here v_{j} is the importance score for \({{\mathscr{J}}}_{j}\). Now we have both quantified the feature importance in the forest part and the tree importance in the DNN part. Denoting \({{\bf{v}}}^{\ast }={({v}_{1}^{\ast },\ldots {v}_{M}^{\ast })}^{T}\) as the normalized importance scores for all the trees in DNN with \({\sum }_{i=1}^{M}\,{v}_{i}^{\ast }=1\), we finally combine the two parts of fDNN with
where λ = (λ_{1}, … λ_{p})^{T} is the final importance of the original features.
Applying this feature evaluation procedure to our real data examples, we obtained a ranked gene importance lists for GSE99095 and GSE106291 respectively. For GSE99095, we analyzed top 1% ranked genes from both fDNN and RF for comparison purpose. The reason for comparing RF is that it is commonly used as a variable importance ranking tool. Among the top 1% (172) genes selected by the two methods, 52 genes overlap. GO enrichment results are shown in Table 4.
Among the top 10 fDNN selected pathways, one of the major themes is related to the mitochondria. The top three biological processes include the synthesis of mitochondrial proteins, mitochondrial transport, and the process of energy generation through oxidative phosphorylation. The results indicate that at the cellular level, mitochondria biogenesis and energy production is associated with the bone marrow failure outcome. Comparatively, RF also identified the oxidative phosphorylation as a significant process, but not the mitochondrial protein biosynthesis and transport processes. Both fDNN and RF found the hematopoietic process and the integrin pathway, which are integral parts of blood cell development. The fDNN selected the hemostasis pathway, which is another important part of blood cell regulations.
For the GSE106291 data, as the total number of genes under study is smaller, to maintain sufficient statistical power in gene set enrichment analysis, we compared the top 2% most important genes from fDNN and RF. Between the two lists of top 221 genes, 81 genes overlap. As shown in the enrichment analysis results (Table 5), both fDNN and RF selected chemotaxis as the top GO term. However, the second term selected by fDNN, “myeloid leukocyte activation”, was not found by RF. Given the nature of the disease, and the clinical response under study, i.e. treatment resistance, it is expected that the myeloid activation pathways play a critical role. The fDNN also selected small molecule metabolism, exocytosis, and cAMP response as important processes in the resistance to the treatment, which have been implicated in chemotherapy response in other types of cancer. Overall, fDNN selected different important genes from RF, and the biological functions overrepresented by the fDNN selected genes are plausible.
Conclusion
We presented a new forest  deep neural network classifier aimed at \(n\ll p\) classification problems for clinical outcome prediction using gene expression data. Its machinery relies on supervised learning feature representations from a forest and training classifiers in a deep neural network. Simulation experiments have shown its relatively higher classification accuracy compared to existing methods, and the real data application demonstrated the utility of the new model.
References
 1.
Kursa, M. B. Robustness of random forestbased gene selection methods. BMC bioinformatics 15, 8 (2014).
 2.
Cai, Z. et al. Classification of lung cancer using ensemblebased feature selection and machine learning methods. Mol. BioSystems 11, 791–800 (2015).
 3.
Chen, Y.C., Ke, W.C. & Chiu, H.W. Risk classification of cancer survival using ann with gene expression data from multiple laboratories. Comput. biology medicine 48, 1–7 (2014).
 4.
Liang, Y. et al. Sparse logistic regression with a l 1/2 penalty for gene selection in cancer classification. BMC bioinformatics 14, 198 (2013).
 5.
Algamal, Z. Y. & Lee, M. H. Penalized logistic regression with the adaptive lasso for gene selection in highdimensional cancer classification. Expert. Syst. with Appl. 42, 9326–9332 (2015).
 6.
Vanitha, C. D. A., Devaraj, D. & Venkatesulu, M. Gene expression data classification using support vector machine and mutual informationbased gene selection. Procedia Comput. Sci. 47, 13–21 (2015).
 7.
Breiman, L. Random forests. Mach. learning 45, 5–32 (2001).
 8.
Vens, C. & Costa, F. Random forest based feature induction. In Data Mining (ICDM), 2011 IEEE 11th International Conference on, 744–753 (IEEE, 2011).
 9.
Tang, A. & Foong, J. T. A qualitative evaluation of random forest feature learning. In Recent Advances on Soft Computing and Data Mining, 359–368 (Springer, 2014).
 10.
Dutkowski, J. & Ideker, T. Protein networks as logic functions in development and cancer. PLoS computational biology 7, e1002180 (2011).
 11.
Zhu, Y., Shen, X. & Pan, W. Networkbased support vector machine for classification of microarray samples. BMC bioinformatics 10, S21 (2009).
 12.
Lavi, O., Dror, G. & Shamir, R. Networkinduced classification kernels for gene expression profile analysis. J. Comput. Biol. 19, 694–709 (2012).
 13.
Kim, S., Pan, W. & Shen, X. Networkbased penalized regression with application to genomic data. Biom. 69, 582–593 (2013).
 14.
Kong, Y. & Yu, T. A graphembedded deep feedforward network for disease outcome classification and feature selection using gene expression data. Bioinforma., https://doi.org/10.1093/bioinformatics/bty429 (2018).
 15.
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nat. 521, 436–444 (2015).
 16.
Min, S., Lee, B. & Yoon, S. Deep learning in bioinformatics. Briefings bioinformatics bbw068 (2016).
 17.
Russakovsky, O. et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. (IJCV) 115, 211–252, https://doi.org/10.1007/s112630150816y (2015).
 18.
Wolpert, D. H. Stacked generalization. Neural networks 5, 241–259 (1992).
 19.
Quinlan, J. R. Induction of decision trees. Mach. learning 1, 81–106 (1986).
 20.
Breiman, L. Bagging predictors. Mach. learning 24, 123–140 (1996).
 21.
Goodfellow, I., Bengio, Y. & Courville, A. Deep learning (MIT press, 2016).
 22.
Nair, V. & Hinton, G. E. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27 ^{th} international conference on machine learning (ICML10), 807–814 (2010).
 23.
Hochreiter, S., et al. Gradient flow in recurrent nets: the difficulty of learning longterm dependencies (2001).
 24.
Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
 25.
Mockus, J. Bayesian approach to global optimization: theory and applications, vol. 37 (Springer Science & Business Media, 2012).
 26.
Pedregosa, F. et al. Scikitlearn: Machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
 27.
Abadi, M. et al. TensorFlow: Largescale machine learning on heterogeneous systems Software available from tensorflow.org (2015).
 28.
Barab´asi, A.L. & Albert, R. Emergence of scaling in random networks. science 286, 509–512 (1999).
 29.
Zhao, X. et al. Singlecell rnaseq reveals a distinct transcriptome signature of aneuploid hematopoietic cells. Blood 130, 2762–2773 (2017).
 30.
Herold, T. et al. A 29gene and cytogenetic score for the prediction of resistance to induction treatment in acute myeloid leukemia. Haematol. haematol–2017 (2017).
 31.
Olden, J. D. & Jackson, D. A. Illuminating the “black box”: a randomization approach for understanding variable contributions in artificial neural networks. Ecol. modelling 154, 135–150 (2002).
Acknowledgements
This study was partially funded by NIH grant R01GM124061 and R37AI051231. The authors thank Dr. Hao Wu for helpful discussions.
Author information
Affiliations
Contributions
T.Y. conceived the study. Y.K. programed the algorithm, conducted simulation and real data experiments. T.Y. and Y.K. analyzed the results. Y.K. and T.Y. drafted the manuscript. Both authors reviewed the manuscript.
Corresponding author
Ethics declarations
Competing Interests
The authors declare no competing interests.
Additional information
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Kong, Y., Yu, T. A Deep Neural Network Model using Random Forest to Extract Feature Representation for Gene Expression Data Classification. Sci Rep 8, 16477 (2018). https://doi.org/10.1038/s41598018348336
Received:
Accepted:
Published:
Keywords
 Deep Neural Networks (DNN)
 Random Forest (RF)
 Gene Expression Data Classification
 Extract Feature Representations
 Predictive Model Development
Further reading

Identification and Staging of BCell Acute Lymphoblastic Leukemia Using Quantitative Phase Imaging and Machine Learning
ACS Sensors (2020)

Robust random forest based nonfullerene organic solar cells efficiency prediction
Organic Electronics (2020)

CancerRelated Gene Signature Selection Based on Boosted Regression for Multilayer Perceptron
IEEE Access (2020)

Optimized gene selection and classification of cancer from microarray gene expression data using deep learning
Neural Computing and Applications (2020)

Exploring Neural Networks and Related Visualization Techniques in Gene Expression Data
Frontiers in Genetics (2020)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.