Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

# Permutation-based identification of important biomarkers for complex diseases via machine learning models

## Abstract

Study of human disease remains challenging due to convoluted disease etiologies and complex molecular mechanisms at genetic, genomic, and proteomic levels. Many machine learning-based methods have been developed and widely used to alleviate some analytic challenges in complex human disease studies. While enjoying the modeling flexibility and robustness, these model frameworks suffer from non-transparency and difficulty in interpreting each individual feature due to their sophisticated algorithms. However, identifying important biomarkers is a critical pursuit towards assisting researchers to establish novel hypotheses regarding prevention, diagnosis and treatment of complex human diseases. Herein, we propose a Permutation-based Feature Importance Test (PermFIT) for estimating and testing the feature importance, and for assisting interpretation of individual feature in complex frameworks, including deep neural networks, random forests, and support vector machines. PermFIT (available at https://github.com/SkadiEye/deepTL) is implemented in a computationally efficient manner, without model refitting. We conduct extensive numerical studies under various scenarios, and show that PermFIT not only yields valid statistical inference, but also improves the prediction accuracy of machine learning models. With the application to the Cancer Genome Atlas kidney tumor data and the HITChip atlas data, PermFIT demonstrates its practical usage in identifying important biomarkers and boosting model prediction performance.

## Introduction

With the advancement of high-throughput technologies, massive amounts of high-dimensional omics data have been generated and made available through large public databases due to great data sharing efforts by the research community, such as The Cancer Genome Atlas (TCGA)1. These data are valuable in elucidating the molecular mechanisms of disease phenotypes2,3. However, study of complex human disease remains challenging due to convoluted disease etiologies and underlying intricate molecular mechanisms at genetic, genomic, and proteomic levels. Many popular machine learning algorithms, such as non-linear kernel support vector machines (SVMs), random forests (RFs), and deep neural networks (DNNs) in artificial intelligence areas, have been developed to build more powerful predictive models for biomedical and bio-omics data regarding clinical outcomes, e.g. drug response4, and medical imaging classification5. While enjoying the modeling flexibility and robustness, these model frameworks suffer from non-transparency and difficulty in interpreting the role of each individual feature due to their sophisticated algorithms, compared with those more interpretable parametric models, such as linear regressions, logistic regressions, and decision trees. Nonetheless, identifying important biomarkers associated with complex human disease is a critical pursuit towards assisting researchers to establish novel hypotheses regarding prevention, diagnosis and treatment of complex human diseases. Accurate identification of important biomarkers associated with complex human disease not only provides valuable insights into their underlying genetic architecture and disease etiology but also offers great potentials for early disease diagnosis, improved precision medicine, innovative treatment development, and accurate disease risk and progression prediction6.

To address the non-transparency in the association study between disease outcomes and predictors using machine learning models, the feature importance score strategy has been proposed and extensively investigated7,8,9,10,11,12,13, including surrogate models7,14, Shapley value-based methods15,16, conditional randomization tests (CRTs)10, knockoff models (i.e., model-X)10,12, and permutation-based feature importance8. Surrogate modeling methods approximate the complex models by using explanatory surrogate models, such as linear models or decision trees. While enjoying the great flexibility in choosing the surrogate models, the feature importance is still restricted to the selected explanatory models that might be misspecified13. Shapley value-based methods, such as SHAP16, provide localized feature characterization based on game theory, while they are computationally intensive and do not guarantee a valid test. Both CRT and model-X knockoff were proposed in Candes et al.10, while CRT is less preferred due to its expensive computational cost. Model-X knockoff is more computationally efficient in performing feature importance test via constructing knockoff features. Recently, model-X knockoff was adopted for DNN12 models. Tansey et al.11 proposed the holdout randomization test (HRT) to reduce the computational cost of CRT via avoiding model refitting.

The overall disadvantage of CRT, HRT, and model-X knockoff is that they all depend on the assumption of a known covariance structure10. When the covariance structure is not accurately estimated, their performance could be severely impacted17. Although KnockoffGAN18, an extension of model-X knockoff, does not suffer such disadvantage, it is difficult to train adversarial networks19 and requires more tuning. Another strategy to avoid suffering from the known covariance structure assumption is approaches based on Gaussian mirrors20,21,22. Specifically, Xing et al.21 proposed individual neural Gaussian mirror (INGM) and simultaneous neural Gaussian mirror (SNGM). However, INGM requires repetitive model fitting, which is computationally costly, while SNGM is efficient but could suffer performance loss21. The permutation-based feature importance learning method, another popular approach for feature selection, measures the change of prediction errors due to the shuffling of a feature. The larger the increase of prediction errors is, the more impact a feature makes on the outcome of interest. However, unlike CRT, HRT, or model-X knockoff, permutation-based feature selection does not require prior knowledge of feature distributions and thus it is more statistically robust. Several permutation-based feature importance methods have been proposed, with applications mainly on random forests and DNNs8,9,23. These methods either do not conduct any statistical inference or cannot offer valid inference on the feature importance. For example, Putin et al.23 applied permutation-based importance scores to DNNs to identify biomarkers associated with human aging, but provided no formal statistical testing. Notably, Altmann et al.9 proposed a corrected permutation-based importance score approach for random forest, which however, is difficult to be generalized to other machine learning model frameworks.

To overcome the aforementioned challenges, we propose a general permutation-based feature importance test (abbreviated as PermFIT), for complex machine learning models, which takes advantage of (i) permutation test coupled with cross-fitting to obtain a valid importance score test that properly controls the type-I error; and (ii) selecting important features from PermFIT to further improve the accuracy of these predictive models. We implement PermFIT for the following machine learning models, including DNN, RF, and SVM. More specifically, PermFIT first approximates the function that maps features to the outcome, based on which, PermFIT then evaluates the importance score of each feature, defined as the expected increase of prediction errors due to the permutation of the feature. Computationally, the PermFIT framework does not require refitting the models. In order to reduce the bias of important score estimation from the potential model overfitting, we adopt cross-fitting to ensure the validity of the test statistics. PermFIT is motivated by two benchmark data: the Reverse Phase Protein Arrays (RPPAs) data from three kidney cancer studies in The Cancer Genomic Atlas (TCGA) and the HITChip Atlas microbiome data regarding body mass index (BMI). However, PermFIT has broad applicability to a wide variety of biomedical data and more.

## Results

To evaluate the performance of PermFIT, we first conduct comprehensive simulation studies under various scenarios with different sample sizes and correlation structures among features. Moreover, it is applied to two real-world datasets: the Reverse Phase Protein Arrays (RPPA) data from three kidney cancer studies in TCGA and the HITChip Atlas microbiome data. We apply PermFIT to three commonly used machine learning methods: DNN24,25, RF8, and SVM26, denoted as PermFIT-DNN, PermFIT-RF and PermFIT-SVM, respectively. We also compare PermFIT with several existing popular feature selection methods for DNN, RF, and SVM: SHAP16, LIME14, holdout randomization test11, and simultaneous neural Gaussian mirror21 with DNN (denoted as SHAP-DNN, LIME-DNN, HRT-DNN, and SNGM-DNN, respectively), RF importance evaluation of Breiman8 (denoted as Vanilla-RF, i.e., an ensemble approach based on decision trees), and SVM with recursive feature elimination27 (denoted as RFE-SVM). SHAP-DNN, LIME-DNN, and RFE-SVM utilize an importance score to rank input features, from which top features are selected. For each feature, Vanilla-RF provides an importance score estimate and its associated standard error, with which the statistical significance of the feature importance can be tested. HRT provides a p-value for each feature without importance scores. We evaluate these methods as follows: (i) we apply each method to the training data with all the input features, estimate the feature importance scores, p values, and assess the type-I error; (ii) we refit each model with its corresponding top ranked important variables, and re-evaluate its goodness-of-fit and prediction improvement.

### Simulation studies

We examine the performance of the proposed methods with the following simulation scenarios. First, we generate the continuous data from the following model,

$$Y={X}_{1}+2\, {\mathrm{log}}\,\Big(1+2{X}_{{p}_{0}+1}^{2}+{\big({X}_{2{p}_{0}+1}+1\big)}^{2}\Big)+{X}_{3{p}_{0}+1}{X}_{4{p}_{0}+1}+\epsilon ,$$
(1)

where X is a p-dimensional random variable drawn from MVN(0, Σ), p = 10p0, p0 = 10, Σ = diag{Σ1, . . . , Σ10}, is a block-diagonal matrix, $${{{\Sigma }}}_{1}=...={{{\Sigma }}}_{10}={\{{\sigma }_{ij}\}}_{0\,{<}\,i,j\le {p}_{0}}$$, are p0 × p0 matrices, σij = 1 for i = j and σij = ρ for i ≠ j, and ϵ ~ N(0, 1). N independent observations are drawn from the distribution of (Y, X) in the training set and 10,000 in the test set, which is used to evaluate model fitting performance. To mimic the real-world data, we introduce correlations among variables by blocks, and let one variable from each of the first 5 blocks have a signal. We define S0 and S1 as the sets that contain all the null features that are correlated and uncorrelated with the causal features, respectively, i.e., S0 = {Xj: j ≤ 5p0 and j ≠ 1, p0 + 1, 2p0 + 1, 3p0 + 1, 4p0 + 1}, S1 = {Xj: j > 5p0}. We consider various simulation settings with different values of ρ {0, 0.2, 0.5, 0.8}, and N {1000, 5000}. Each simulation scenario is replicated 100 times.

The results are displayed in Fig. 1 and Table 1. Figure 1a displays detailed feature importance scores generated from each method that we consider. Since HRT does not provide importance scores, we use $$-{\mathrm{log}}_{10}$$(p value) instead. Note that the estimated importance scores from PermFIT methods and Vanilla RF are in the same scale, while the ones from SHAP-DNN, LIME-DNN and RFE-SVM are not. For X1 whose effect is linear, the importance scores from PermFIT-DNN and PermFIT-SVM are higher, compared with those from RF-based framework due to the restricted tree-based modeling nature of RF. In addition, the RF-based framework can barely detect the interaction between $${X}_{3{p}_{0}+1}$$ and $${X}_{4{p}_{0}+1}$$ because the split rule in tree-based methods is less effective in dealing with such interactions. Expectedly, as the within-block correlation ρ increases, the estimated importance scores from all methods deviate further away from their estimands. However, PermFIT-SVM remains high power in detecting the true positive features. As ρ increases, it’s noticeable that Vanilla-RF and PermFIT-SVM tend to identify the null features that are correlated with the causal features. Compared with Vanilla-RF, PermFIT-RF has fewer false positive discoveries. Overall, PermFIT-DNN provides the most precise and stable importance measure in differentiating the true positive from null features.

The frequency (percentage) of the important variables detected by each method is presented in Table 1. For Vanilla-RF and PermFIT methods that provide p values, the significance level is controlled at 0.05, while for RFE-SVM, the top 10 features with the largest importance scores are selected. First of all, at ρ = 0, PermFIT controls the rate of significance findings across all null features at around 0.05, suggesting that the type-I error is well controlled by PermFIT, while Vanilla-RF has the type-I error of 0.09, nearly double of PermFIT. When N = 1000, the type-I error of HRT-DNN is slightly inflated. Besides, LIME-DNN and SNGM-DNN show a limited ability in identifying features with nonlinear effects, such as $${X}_{{p}_{0}+1}$$, $${X}_{3{p}_{0}+1}$$, and $${X}_{4{p}_{0}+1}$$. On the other hand, SHAP-DNN is able to assign high rankings to the important features based on the importance scores. However, it fails to offer a valid test for its importance scores; specifically, its type-I error and power depend on correctly specifying the number of important features. When ρ increases to 0.5 or 0.8, RFE-SVM tends to select the null features that are correlated with the true causal features, or those in S0 more frequently than $${X}_{3{p}_{0}+1}$$ and $${X}_{4{p}_{0}+1}$$, the two causal variables that interact with each other, demonstrating its limited capability in detecting variables with interaction effects when correlation exists. In contrast, PermFIT-SVM is capable of identifying X3p0+1 and X4p0+1 consistently at a much higher frequency than all the null features. Compared with PermFIT-RF, Vanilla-RF has a higher power in detecting X3p0+1 and X4p0+1, but also produces remarkably more false positive findings among features in S0. For example, as ρ goes to 0.8 and N = 1000, it results in >80% false positive rate in S0, suggesting a far inferior feature selection performance. In all these scenarios, PermFIT-DNN can consistently identify causal features while controlling false positive findings at a much lower rate than those of Vanilla-RF, PermFIT-RF, and PermFIT-SVM.

Posterior to important feature selection, the prediction performance of the comparing models almost all gets improved. Figure 1b displays the mean squared prediction error (MSPE) of each model, (i) with full input features, respectively denoted as DNN, RF, and SVM; and (ii) with top selected features from PermFIT methods and HRT-DNN at the significance level of 0.1, and top 20 features from SHAP-DNN, LIME-DNN, SNGM-DNN and RFE-SVM. Selected features help boosting the prediction accuracy of all models, except RFE-SVM, LIME-DNN, and SNGM-DNN, across all simulation scenarios. However, LIME-DNN and SNGM-DNN fail to identify certain important features, which leads to deterioration of the model performance. In addition, at ρ = 0.8 corresponding to high correlation among within-block input features, RFE-SVM fails to improve the model fitting over SVM because of feature selection failure, in particular, on X3p0+1 and X4p0+1; its inferior performance to PermFIT-SVM is clearly observed. Moreover, PermFIT-RF outperforms Vanilla-RF in terms of MSPE, because the latter yields more false positives and cannot effectively reduce the feature dimension. We note that PermFIT-DNN and HRT-DNN consistently outperforms all other methods in comparison, due to its high success rate in identifying true positive features while maintaining a considerably low false positive rate at the same time. In particular, PermFIT-DNN has a lower MSPE than that of HRT-DNN when N = 1000 and ρ ≤ 0.2, while similar MSPE values in other scenarios.

To further investigate the small sample performance of these methods, we conduct additional simulations with (N = 300, p = 100) and (N = 500, p = 200), and report the results in Table 2. The type-I errors of PermFIT-based methods are not affected much by the change of N and p in these more challenging cases, while those for HRT-DNN are severely inflated, which is likely because, for more challenging data with a smaller N or a larger p, HRT-DNN fails to make an accurate estimation of the covariance matrix of the input features.

We further conduct a simulation study on binary outcomes generated from the following model:

$$P(Y=1| X)={\rm{expit}}\Big(4{X}_{1}+8\, {\mathrm{log}}\,\big(1+2{X}_{{p}_{0}+1}^{2}+{\big({X}_{2{p}_{0}+1}+1\big)}^{2}\big)+4{X}_{3{p}_{0}+1}{X}_{4{p}_{0}+1}-11\Big),$$
(2)

where $${\rm{expit}}(x)=1/(1+\exp (-x))$$. All the other data structures, including X, are generated in the same way as in the continuous case. Similar conclusions are observed with the details presented in Supplementary Table 1 and Supplementary Fig. 1.

### TCGA kidney tumor data

A large collection of clinical and multiple omics data have been made publicly available by TCGA research project1. In our analysis, we included three studies of kidney-related cancer types from TCGA research network: kidney renal clear cell carcinoma (KIRC, N1 = 537), kidney renal papillary cell carcinoma (KIRP, N2 = 291), and kidney chromophobe (KICH, N3 = 113). We defined long-term survivor (LTS) as patients who survived more than five years after diagnosis, and short-term survivor (STS) as patients who died within 5 years. We aimed to predict the probability of a patient being in the LTS group and to identify significant biomarkers that contribute to classification of the LTS/STS status. We included 188 LTS and 178 STS subjects with the known survival status in our analysis. We focused our analysis on expression data of 118 proteins extracted from reverse phase protein arrays (RPPAs)—a highly sensitive, reproducible, and high-throughput proteomic method for protein expression profiling28.

The negative $${\mathrm{log}}_{10}$$(p value)s and the estimated importance scores from each method are presented in Fig. 2 and Supplementary Fig. 3. HRT-DNN, Vanilla-RF and PermFIT models control the FDR at 0.1, and SHAP-DNN, LIME-DNN, SNGM-DNN and RFE-SVM selects 10 features with the largest importance scores. We notice that moderate correlations generally exist among the proteins (see Supplementary Fig. 2). However, six proteins, SRC, RAF1, RB1, RPS6, YWHAZ, and EGFR, are highly correlated and clustered together by hierarchical clustering in Fig. 2. Among them, EGFR, YWHAZ, RPS6, RB1, and SRC are identified by Vanilla-RF, and RPS6, RB1 and SRC are selected by RFE-SVM, while none of these biomarkers are selected by any PermFIT procedures. According to our observations in simulation studies, both Vanilla-RF and RFE-SVM tend to identify false positive biomarkers in the presence of high correlation among features, casting some doubts on the validity of their biomarker selection results. In addition, LIME-DNN identifies a very different set of important biomarkers compared to SHAP-DNN, HRT-DNN, SNGM-DNN, and PermFIT-DNN.

Since the underlying genetic truth is unknown, we alternatively use the model performance improvement estimated via 5-fold cross-validation, randomly repeated for 100 times (see Fig. 3a, b; Supplementary Table 2) as a surrogate measure for evaluating the relative quality of the selected features. Similar to the simulation study, we set the feature inclusion criteria on p values smaller than 0.1 for HRT-DNN, Vanilla-RF, and PermFIT methods, and top 20 features for SHAP-DNN, LIME-DNN, SNGM-DNN, and RFE-SVM. PermFIT-RF improves the accuracy from 0.694 to 0.732 on average, while Vanilla-RF only improves to 0.713. Moreover, PermFIT-SVM elevates the accuracy from 0.69 to 0.744, outperforming RFE-SVM (0.709). Similar to the simulation results, PermFIT-DNN and HRT-DNN achieve the highest accuracy (0.751 and 0.750, respectively), better than those from SHAP-DNN (0.731), LIME-DNN (0.650) and SNGM-DNN (0.723). The same conclusion is further confirmed by area under the ROC curve (AUC) results. In summary, it is evident that PermFIT procedures consistently perform more efficient and accurate feature selection across various machine learning frameworks.

On the identified biomarkers, four genes—CDKN1A, EIF4EBP1, INPP4B, and SERPINE1—are claimed by all the three PermFIT methods to be significantly associated with the survival status. Interestingly, all four genes have been reported to be cancer related. Especially, INPP4B, identified as the most significant biomarker by all the three methods (p value = 1.3E − 05 by PermFIT-DNN, 9.1E − 07 by PermFIT-RF, and 4.5E − 05 by PermFIT-SVM), encodes inositol polyphosphate-4-phosphatase, type II, a dual specificity phosphatase. Low INPP4B is recently reported to be associated with shorter survival in kidney clear cell, liver hepatocellular, and bladder urotheleal carcinomas, and with long survival in pancreatic adenocarcinoma29. It is also related to acute myeloid leukemia, breast cancer and bladder cancer30,31,32. SERPINE1 encodes plasminogen activator inhibitor-1, which plays an important role in various diseases, in particular, kidney pathology and renal cell cancer33,34,35. In addition, the CDKN1A encoded protein, CDK-interacting protein 1, was reported as a prognostic marker for renal cell cancer36, and has an effect on kidney cancer cell death37 as well as kidney cancer survival38. Similarly, EIF4EBP1 affects disease progression in renal cell carcinoma39.

Moreover, the DNA repair protein XRCC1, identified by PermFIT-DNN and PermFIT-SVM, is shown to be associated with bladder cancer40. ANXA7, identified by PermFIT-DNN and PermFIT-RF, is reported to be associated with prostate cancer and breast cancer41,42, and its encoded protein has an impact on prostate cancer and breast cancer43,44. Furthermore, MYH9 and NRG1 are identified by PermFIT-DNN. Myosin-9, encoded by MYH9, has been discussed for its role as a tumor suppressor45, and NRG1 is also reported to be related to multiple cancer types46,47. Last, PermFIT-RF identifies a novel gene, STK11 whose role in kidney cancer is unknown, however, it has been reported that inactivation of STK11 in lung adenocarcinomas is a common event48.

### HITChip atlas data

In the HITChip Atlas study, the data was collected from 1006 adults in 15 western countries49 by using the HITChip, and it is publicly available in R library “microbiome”50. Besides demographic and clinical variables, the HITChip Atlas data includes microbiome measurements from 130 taxonomic groups summarized at the genus level, which covered major types of human intestinal microbiota bacterial diversity. Many of the 130 taxonomic groups are highly correlated, which is reflected in the correlation heat map and the hierarchical clustering dendrogram (see Fig. 4 and Supplementary Fig. 4). We then investigated the importance of demographic factors, including gender and nationality, together with 129 microbial genus (1 was removed due to the use of compositional values), in predicting the baseline BMI level. Our analysis includes 900 subjects with BMI measurements.

The feature selection and biomarker identification criteria remain the same as those in the TCGA example. The improvement of the model performance from variable selection is presented in Fig. 3c, d and Supplementary Table 2. Besides the MSPE, we report the Pearson correlation between the predicted and the true values. We notice that high correlation among microbiome features leads to large inflation in importance scores from Vanilla-RF, corresponding to high false positive rate. As a result, it fails to improve the model performance and the reduced model with selected features from RFE-SVM performs worse than the full model. Again, this is likely due to the fact that highly correlated biomarkers are falsely selected by RFE-SVM. For instance, Streptococcus mitis et rel, Streptococcus bovis et rel and Streptococcus intermedius et rel, highly correlated to each other, are among top 10 biomarkers identified by RFE-SVM. In contrast, PermFIT yields the most remarkable improvements in all these models, reflected in both MSPE and correlation.

Figure 4 and Supplementary Fig. 5 show the negative $${\mathrm{log}}_{10}$$(p-value)s and the importance scores estimated from each method. Among all the features, as expected, age is identified as the most significant factor. The nationality is also selected by PermFIT-DNN and PermFIT-SVM. Among all the microbiome features, Megasphaera elsdenii is identified by all the PermFIT methods. M. elsdenii is shown by prior studies as one of the ruminal and intestinal lacate- and sugar-fermenting species51. M. elsdenii is also reported to massively reside in patients with an increase in BMI after bariatric surgery52. In addition, Eggerthella lenta is identified by PermFIT-SVM. E. lenta is not well-studied, but its potential role as an emerging pathogen has been increasingly recognized in years53. Lastly, uncultured clostridiales is identified by PermFIT-RF.

## Discussion

Complex machine learning models are difficult to distinguish the contribution of individual input features, though they enjoy the more robustness and flexibility in modeling complex human diseases as compared with parametric models. In this paper, we introduce PermFIT, a computationally efficient permutation-based feature importance test with applicability to various machine learning models such as DNN, RF, and SVM, to identify important features. Also, as demonstrated by the applications to TCGA kidney cancer data and HITChip Atlas BMI data, PermFIT procedures further show the superior performance over all the other competitors of concern in the paper, which severely suffer from false positive or negative findings, leading to inferior prediction performance with top selected features. In contrast, feature selection via PermFIT procedures remarkably improves the performance of these predictive models. However, it is worth pointing out that the prediction improvement of PermFIT is restricted to the capability of each machine learning model framework. For example, RF is relatively inefficient in modeling interaction terms, thus the performance of PermFIT-RF may be limited for complex traits with strong gene-gene interactions. Overall, PermFIT coupled with DNN consistently shows superior empirical performance.

The proposed analytical tool, PermFIT, is computationally efficient and has broad applicability in addressing real-world problems. It can be implemented and incorporated into a variety of machine learning models with different types of outcomes, and without the need of model refitting. PermFIT provides researchers a useful tool for deciphering complex genetic architecture and disease etiologies of complex traits.

## Methods

### PermFIT

We start the case with a continuous outcome. Let $$X\in {\mathcal{X}}$$, $$Y\in {\mathcal{Y}}$$, where X = (X1, . . . , Xp) is a p-dimensional covariate vector, the observation of the outcome variable Y is a continuous scalar, E(YX) = μ(X), μ() is an unknown mapping from $${\mathcal{X}}$$ to $${\mathcal{Y}}$$, the residues ϵ = Y − μ(X) is independent of X and 0 < σ2 = E(ϵ2) < .

We define the feature importance score Mj for Xj, the jth feature in X(j = 1, . . . , p), as the expected squared difference between μ(X) and $$\mu \left({X}^{(j)}\right)$$, where $${X}^{(j)}=({X}_{1},...,{X}_{j-1},{X}_{j^{\prime} },{X}_{j+1},...,{X}_{p})$$ is equal to X with the jth covariate replaced by a random vector $${X}_{j^{\prime} }$$ whose elements are independently drawn from the distribution of Xj. The importance score Mj can be expressed as,

$${M}_{j}={E}_{X,{X}_{j^{\prime} }}{\left[\mu (X)-\mu \left({X}^{(j)}\right)\right]}^{2}.$$
(3)

Assuming X does not have redudant features, Mj is zero only when $$\mu (X)\equiv \mu \left({X}^{(j)}\right)$$ on $${\mathcal{X}}$$, implying that the jth element of X does not have any impact on μ(X); and is non-zero otherwise.

To obtain a clear understanding of Mj, we take the linear model as an example where μ(X) = Xβ + β0, with β = (β1, . . . , βp) consisting of p parameters. Under the linear assumption, (3) becomes:

$${M}_{j}={E}_{X,{X}_{j^{\prime} }}{({X}_{j}-{X}_{j^{\prime} })}^{2}{\beta }_{j}^{2}=2{\beta }_{j}^{2}{\mathrm{Var}}\,({X}_{j}).$$
(4)

Here, (4) is proportional to the squared standardized coefficient, which has been recognized as a popular measure of variable importance in multiple linear regression.

Furthermore, Mj can be simply decomposed as follows:

$${M}_{j}={E}_{X,{X}_{j^{\prime} }}{\left[Y-\mu \left({X}^{(j)}\right)\right]}^{2}-{E}_{X}{[Y-\mu (X)]}^{2}.$$
(5)

Ideally, given the true form of μ(), from (5), Mj could be estimated through permutation. Let (Yi, Xi1, . . . , Xip), (i = 1, . . . , N) be N independent observations drawn from the distribution of (Y, X1, . . . , Xp). A permutation on one covariate Xj = (X1j, . . . , XNj) is to randomly sample the elements in Xj without replacement to generate a permuted version of $$X^{\prime} =({X}_{{s}_{1},j},...,{X}_{{s}_{N},j})$$. The empirical permutation importance score is then,

$${M}_{j}^{(P)}=\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}\left[{\left\{{Y}_{i}-\mu \left({X}_{i\cdot }^{(j)}\right)\right\}}^{2}-{\left\{{Y}_{i}-\mu ({X}_{i\cdot })\right\}}^{2}\right],$$
(6)

where Xi = (Xi1, . . . , Xip) and $${X}_{i\cdot }^{(j)}=({X}_{i1},...,{X}_{i,j-1},{X}_{{s}_{i},j},{X}_{i,j+1},...,{X}_{ip})$$. Let $${M}_{j}^{(P)}=\frac{1}{N}\mathop{\sum }\nolimits_{i = 1}^{N}{M}_{ij}^{(P)}$$, where $${M}_{ij}^{(P)}={\left\{{Y}_{i}-\mu \left({X}_{i\cdot }^{(j)}\right)\right\}}^{2}-{\left\{{Y}_{i}-\mu ({X}_{i\cdot })\right\}}^{2}$$, then

$$E\left[{M}_{j}^{(P)}\right]=E\left[{M}_{ij}^{(P)}\right]=\frac{N-1}{N}{M}_{j}.$$
(7)

When N is large, Mj could be well approximated by $${M}_{j}^{(P)}$$. Besides, $${\mathrm{Var}}\,\left[{M}_{j}^{(P)}\right]\approx \frac{1}{N}{\mathrm{Var}}\,\left[{M}_{ij}^{(P)}\right]$$ which can be approximated by the empirical variance of $${M}_{ij}^{(P)}$$.

Let $$\widehat{\mu }(\cdot )$$ be the fitted function approximator to μ(), according to (6), we propose to estimate $${M}_{j}^{P}$$ by

$${\widehat{M}}_{j}^{(P)}=\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}\left[{\left\{{Y}_{i}-\widehat{\mu }\left({X}_{i\cdot }^{(j)}\right)\right\}}^{2}-{\left\{{Y}_{i}-\widehat{\mu }({X}_{i\cdot })\right\}}^{2}\right].$$
(8)

If feature Xj is not associated with Y, then $$\mu \left({X}_{i\cdot }^{(j)}\right)=\mu ({X}_{i\cdot })$$ with corresponding $${M}_{j}^{(P)}=0$$, and Eq. (8) becomes,

$${\widehat{M}}_{j}^{(P)}=\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}\left[{\left\{\mu ({X}_{i\cdot })-\widehat{\mu }\left({X}_{i\cdot }^{(j)}\right)\right\}}^{2}-{\left\{\mu ({X}_{i\cdot })-\widehat{\mu }({X}_{i\cdot })\right\}}^{2}+2{\epsilon }_{i}\left\{\widehat{\mu }({X}_{i\cdot })-\widehat{\mu }\left({X}_{i\cdot }^{(j)}\right)\right\}\right].$$
(9)

With the universal consistency, the three terms are expected to converge to zero as N goes to infinity. However, for data with a finite sample size, the model $$\widehat{\mu}(\cdot )$$ may become overfit, leading to $${\left\{\mu ({X}_{i\cdot })-\widehat{\mu }\left({X}_{i\cdot }^{(j)}\right)\right\}}^{2}> {\left\{\mu ({X}_{i\cdot })-\widehat{\mu }({X}_{i\cdot })\right\}}^{2}$$ in estimating $${M}_{j}^{(P)}$$. To overcome the bias issue, we employ cross-fitting strategy to separate the input data as the training and validation sets, with one set being utilized to obtain $$\widehat{\mu }(\cdot )$$ and the other set to estimate $${\widehat{M}}_{j}^{(P)}$$. Let $${\widehat{\mu }}_{T}(\cdot )$$ be the estimate of μ() from the training set, and $${{\mathcal{D}}}_{V}={\{{Y}_{i},{X}_{i\cdot }\}}_{i = 1}^{{N}_{V}}$$ be the validation set,

$${\widehat{M}}_{j}^{(P)}=\frac{1}{{N}_{V}}\mathop{\sum }\limits_{i=1}^{{N}_{V}}\left[{\left\{{Y}_{i}-{\widehat{\mu }}_{T}\left({X}_{i\cdot }^{(j)}\right)\right\}}^{2}-{\left\{{Y}_{i}-{\widehat{\mu }}_{T}({X}_{i\cdot })\right\}}^{2}\right],$$
(10)
$$\widehat{\mathrm{Var}}\left[{\widehat{M}}_{j}^{(P)}\right]=\frac{1}{{N}_{V}}\mathop{\sum }\limits_{i=1}^{{N}_{V}}{\left[{\left\{{Y}_{i}-{\widehat{\mu }}_{T}\left({X}_{i\cdot }^{(j)}\right)\right\}}^{2}-{\left\{{Y}_{i}-{\widehat{\mu }}_{T}({X}_{i\cdot })\right\}}^{2}-{\widehat{M}}_{j}^{(P)}\right]}^{2}.$$
(11)

The one-sided p value can be obtained by assuming normality. To increase the power of important feature identification, K-fold cross-fitting can be adopted. Here, we randomly divide the data into K folds, denoted as V1, . . . , VK. For each of Vk, k = 1, . . . , K, $${\overline{V}}_{k}$$ denote the complementary set of Vk, which is used to fit the model $${\widehat{\mu }}_{k}(\cdot )$$. Then

$${\widehat{M}}_{ij}^{(P,CV)}=\mathop{\sum }\limits_{k=1}^{K}{\rm{I}}\left(i\in {V}_{k}\right)\left[{\left\{{Y}_{i}-{\widehat{\mu }}_{T}\left({X}_{i\cdot }^{(j)}\right)\right\}}^{2}-{\left\{{Y}_{i}-{\widehat{\mu }}_{k}({X}_{i\cdot })\right\}}^{2}\right],$$
(12)
$${\widehat{M}}_{j}^{(P,CV)}=\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}{\widehat{M}}_{ij}^{(P,CV)},$$
(13)
$$\widehat{\mathrm{Var}}\left[{\widehat{M}}_{j}^{(P,CV)}\right]=\frac{1}{N}\mathop{\sum }\limits_{k=1}^{K}{\mathop{\sum} _{i\in {V}_{k}}}{\left[{\left\{{Y}_{i}-{\widehat{\mu }}_{T}\left({X}_{i\cdot }^{(j)}\right)\right\}}^{2}-{\left\{{Y}_{i}-{\widehat{\mu }}_{k}({X}_{i\cdot })\right\}}^{2}-{\widehat{M}}_{j}^{(P,CV)}\right]}^{2}.$$
(14)

The algorithm of PermFIT with cross-fitting is illustrated in Algorithm 1.

### Algorithm 1

Algorithms for PermFIT

1: Randomly divide the data into K folds.

2: fork = 1 toKdo.

3:  Denote the data in kth fold as Vk and the rest of the data as $${\overline{V}}_{k}$$.

4:  Build the machine learning model with $${\overline{V}}_{k}$$, denoted as $${\widehat{\mu }}_{k}(\cdot )$$.

5:  forj = 1 topdo

6:    Calculate $${\widehat{M}}_{ij}^{(P,CV)}$$ for subjects in $${{\mathcal{D}}}_{k}$$.

7:  endfor

8: endfor

9: forj = 1 topdo

10:  Calculate $${\widehat{M}}_{j}^{(P,CV)}$$ and estimate $$\widehat{\mathrm{Var}}\left[{\widehat{M}}_{j}^{(P,CV)}\right]$$. Calculate p-value by assuming nomality.

11: endfor

### Binary outcome

For binary outcome Y {0, 1}, we have μ(X) = E(YX) = Pr(Y = 1X) and define Mj as the expectation of binomial deviance,

$${M}_{j}={E}_{X,{X}_{j^{\prime} }}\left[Y\, {\mathrm{log}}\,\left(\frac{\mu (X)}{\mu ({X}^{(j)})}\right)+(1-Y)\, {\mathrm{log}}\,\left(\frac{1-\mu (X)}{1-\mu ({X}^{(j)})}\right)\right].$$
(15)

The empirical estimate of Mj can be similarly obtained by plugging in the estimate of μ(X(j)) and μ(X) as in the continuous data scenario.

### DNN with bootstrap aggregating

In this paper, we use feedforward and fully-connected deep neural networks (DNNs) to approximate function μ(). The DNN model contains L hidden layers of (n1, . . . , nL) hidden nodes that transform the initial input covariates X to the estimation of the continuous outcome Y. Let θ denote all the parameters in the DNN model, we essentially have the fitted DNN, $$\widehat{\mu }(X;{\boldsymbol{\theta }})$$, by minimizing the empirical risk function,

$$\arg \mathop{\min }\limits_{{\boldsymbol{\theta }}}\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}\ell \{{Y}_{i},\mu ({X}_{i};{\boldsymbol{\theta }})\}+\lambda {{\Omega }}({\boldsymbol{\theta }}),$$
(16)

where (, ) is the loss function dependent on the outcome type, Ω(θ) is a penalty on θ and λ is a hyperparameter that controls the degree of regularization, via mini-batch stochastic gradient descent algorithm and Adam54 to adjust the learning rate.

To increase the robustness and accuracy of DNNs, bootstrap aggregating (bagging) is applied55. Besides, due to the randomness of initial parameters, some DNNs may not converge to a stable solution, hence, perform poorly. In neural network ensembles, it is argued that “many could be better than all”, meaning that using a subset of bagged DNNs that well fit the data could be better than using all bagged DNNs25,56. Therefore, we adopt the scoring system to select the optimal subset of DNNs in the bagging procedure, following Mi et al.25. DNN with bagging has been implemented in the R package “deepTL” (available at https://github.com/SkadiEye/deepTL)57. According to Mi et al.25, for all the reported numerical analysis in this paper, we set bagging size to 100, batch size to 50, the number of hidden layers to 4 with 50, 40, 30, 20 hidden nodes at each layer subsequently, penalty weight λ to 1E − 4, and reclified linear units as the activation function.

### SHAP, LIME, SNGM, and HRT

Shapley and LIME values are calculated using R package “iml”. Feature importance scores of SHAP and LIME are defined as mean absolute values of Shapley and LIME values, respectively. HRT is implemented in R. SNGM-DNN is implemented in R, following Xing et al.21. To be consistent with other approaches without providing p values, implementation of SNGM-DNN also focuses on selecting top features. In SHAP-DNN, LIME-DNN, SNGM-DNN and HRT-DNN, the above-described bagged DNNs are applied.

### RF and SVM

RF is implemented via R package “randomForest”. Vanilla-RF importance and its standard error is generated from the “randomForest” function with 1000 trees. SVM is implemented through R package “e1071” with Radial kernels used. The hyper-parameters in SVMs are searched via fivefold cross-validation. RFE-SVM was implemented with the function “rfe” in R package “caret”.

### The simulation study and real data applications

In the simulation studies, PermFIT is performed by randomly splitting the samples into training (80%) and validation (20%) sets, and the importance score is estimated via (10) and (11). In real applications, HRT and PermFIT are conducted with 5-fold cross-fitting through (13) and (14). To eliminate the impact from the randomness of cross-fitting and other random factors in model fitting, we repeat each method 100 times and report the mean and standard deviation of MSPE, Pearson correlation, AUC or accuracy, and the median of the importance scores and p values. Features presented in figures are ordered by hierarchical clustering, which is implemented in “hclust” function in R package “stats”, where the dissimilarity is set to one minus the Pearson correlation.

For the TCGA kidney cancer application, RPPAs at gene level are analyzed. We first remove the proteins that are not common across all three TCGA datasets (KIRC, KIRP, and KICH). In addition, we remove the proteins with perfect multicollinearity, after which 118 are kept for further analysis.

For the HITChip Atlas data, the BMI level was originally grouped into six groups: underweight, lean, overweight, obese, severeobese and morbidobese, which we transform into numerical levels from 1 to 6 in our analysis. Total 900 subjects are left for the analysis after subjects with missing BMI are excluded. Missing information on nationality is grouped into a new group named “Unknown”. Missing values in the microbiome data are simply imputed with the median values across all samples. The analysis on the microbiome data is based on the compositional values but we remove the cell proportion from the last group due to the sum to 1 constraint on the compositional values, after which a log-transformation is applied to the remaining compositions.

### Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

## Data availability

The TCGA datasets are available at the LinkedOmics website (http://linkedomics.org), among which three studies, KIRC, KICH, and KIRP, are used (dbGaP Study Accession: phs000178). The HITChip Atlas data is available in R package “microbiome” (https://microbiome.github.io/). We provide final datasets used in our analysis (https://github.com/SkadiEye/deepTL/tree/master/permfit/code/cleaned-dat.RDS). Source data are provided with this paper.

## Code availability

PermFIT is implemented in our R package “deepTL” (https://github.com/SkadiEye/deepTL)57. We also provide source code for replicating the simulation studies and real data applications (https://github.com/SkadiEye/deepTL/tree/master/permfit/code).

## References

1. 1.

Cancer Genome Atlas Research N. et al. The cancer genome atlas pan-cancer analysis project. Nat. Genet. 45, 1113–1120 (2013).

2. 2.

Hanahan, D. & Weinberg, R. A. Hallmarks of cancer: the next generation. Cell 114, 646–674 (2011).

3. 3.

Wilhelm, M. et al. Mass-spectrometry-based draft of the human proteome. Nature 509, 582–587 (2014).

4. 4.

Gawehn, E., Hiss, J. A. & Schneider, G. Deep learning in drug discovery. Mol. Inf. 35, 3–14 (2016).

5. 5.

Erickson, B. J., Korfiatis, P., Akkus, Z. & Kline, T. L. Machine learning for medical imaging. Radiographics 37, 505–515 (2017).

6. 6.

Zhang, A. et al. Metabolomics in diagnosis and biomarker discovery of colorectal cancer. Cancer Lett. 345, 17–20 (2014).

7. 7.

Craven, M. & Shavlik, J. W. Extracting tree-structured representations of trained networks. In Advances in Neural Information Processing Systems, 24–30 (1996).

8. 8.

Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).

9. 9.

Altmann, A., Toloşi, L., Sander, O. & Lengauer, T. Permutation importance: a corrected feature importance measure. Bioinformatics 26, 1340–1347 (2010).

10. 10.

Candes, E., Fan, Y., Janson, L. & Lv, J. Panning for gold:‘model-x’aknockoffs for high dimensional controlled variable selection. J. R. Stat. Soc.: Series B (Statistical Methodology) 80, 551–577 (2018).

11. 11.

Tansey, W., Veitch, V., Zhang, H., Rabadan, R. & Blei, D. M. The holdout randomization test: principled and easy black box feature selection. arXiv preprint arXiv:1811.00645 (2018).

12. 12.

Lu, Y., Fan, Y., Lv, J. & Noble, W. S. Deeppink: reproducible feature selection in deep neural networks. In Advances in neural information processing systems, 8676–8686 (2018).

13. 13.

Molnar, C. Interpretable machine learning, https://christophm.github.io/interpretable-ml-book/ (2019).

14. 14.

Ribeiro, M. T., Singh, S. & Guestrin, C. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1135–1144 (ACM, 2016).

15. 15.

Shapley, L. A value for n-person games, contributions to the theory of games. (ed. Harold W. kuhn) (1953).

16. 16.

Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, 4765–4774 (2017).

17. 17.

Horel, E. & Giesecke, K. Computationally efficient feature significance and importance for machine learning models. arXiv preprint arXiv:1905.09849 (2019).

18. 18.

Jordon, J., Yoon, J. & van der Schaar, M. Knockoffgan: Generating knockoffs for feature selection using generative adversarial networks. In International Conference on Learning Representations (2018).

19. 19.

Arjovsky, M. & Bottou, L. Towards principled methods for training generative adversarial networks. arXiv preprint arXiv:1701.04862 (2017).

20. 20.

Xing, X., Zhao, Z. & Liu, J. S. Controlling false discovery rate using gaussian mirrors. arXiv preprint arXiv:1911.09761 (2019).

21. 21.

Xing, X., Gui, Y., Dai, C. & Liu, J. S. Neural gaussian mirror for controlled feature selection in neural networks. arXiv preprint arXiv:2010.06175 (2020).

22. 22.

Dai, C., Lin, B., Xing, X. & Liu, J. S. False discovery rate control via data splitting. arXiv preprint arXiv:2002.08542 (2020).

23. 23.

Putin, E. et al. Deep biomarkers of human aging: application of deep neural networks to biomarker development. Aging 8, 1021–1033 (2016).

24. 24.

LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).

25. 25.

Mi, X., Zou, F. & Zhu, R. Bagging and deep learning in optimal individualized treatment rules. Biometrics 75, 674–684 (2019).

26. 26.

Suykens, J. A. K. & Vandewalle, J. Least squares support vector machine classifiers. Neural Process. Lett. 9, 293–300 (1999).

27. 27.

Guyon, I., Weston, J., Barnhill, S. & Vapnik, V. Gene selection for cancer classification using support vector machines. Mach. Learn. 46, 389–422 (2002).

28. 28.

Tomczak, K., Czerwińska, P. & Wiznerowicz, M. The cancer genome atlas (tcga): an immeasurable source of knowledge. Contemporary Oncol. 19, 68–77 (2015).

29. 29.

Dzneladze, I. Pan-cancer study of INPP4B reveals its unexpected oncogene-like role and prognostic significance. PhD thesis (2017).

30. 30.

Sun, Y., Ding, H., Liu, X., Li, X. & Li, L. Inpp4b overexpression enhances the antitumor efficacy of parp inhibitor ag014699 in mda-mb-231 triple-negative breast cancer cells. Tumor Biology 35, 4469–4477 (2014).

31. 31.

Hsu, I. et al. Estrogen receptor alpha prevents bladder cancer development via inpp4b inhibited akt pathway in vitro and in vivo. Oncotarget 5, 7917–7935 (2014).

32. 32.

Dzneladze, I. et al. Subid, a non-median dichotomization tool for heterogeneous populations, reveals the pan-cancer significance of inpp4b and its regulation by evi1 in aml. PLoS ONE 13, e0191510 (2018).

33. 33.

Eddy, A. A. Plasminogen activator inhibitor-1 and the kidney. Am. J. Physiol.-Renal Physiol. 283, F209–F220 (2002).

34. 34.

Małgorzewicz, S., Skrzypczak-Jankun, E. & Jankun, J. Plasminogen activator inhibitor-1 in kidney pathology. Int. J. Mol. Med. 31, 503–510 (2013).

35. 35.

Hofmann, R. et al. Prognostic value of urokinase plasminogen activator and plasminogen activator inhibitor-1 in renal cell cancer. J. Urol. 155, 858–862 (1996).

36. 36.

Weiss, R. H. et al. p21 is a prognostic marker for renal cell carcinoma: implications for novel therapeutic approaches. J. Urol. 177, 63–69 (2007).

37. 37.

Inoue, H., Hwang, S. H., Wecksler, A. T., Hammock, B. D. & Weiss, R. H. Sorafenib attenuates p21 in kidney cancer cells and augments cell death in combination with dna-damaging chemotherapy. Cancer Biol. Ther. 12, 827–836 (2011).

38. 38.

Zaman, M. S. et al. Up-regulation of microrna-21 correlates with lower kidney cancer survival. PloS One 7, e31060 (2012).

39. 39.

Campbell, L., Jasani, B., Griffiths, D. F. R. & Gumbleton, M. Phospho-4e-bp1 and eif4e overexpression synergistically drives disease progression in clinically confined clear cell renal cell carcinoma. Am. J. Cancer Res. 5, 2838–2848 (2015).

40. 40.

Akhmadishina, L. Z. et al. DNA repair xrcc1, xpd genes polymorphism as associated with the development of bladder cancer and renal cell carcinoma. Genetika 50, 481–490 (2014).

41. 41.

Srivastava, M. et al. Anx7 as a bio-marker in prostate and breast cancer progression. Dis. Mark. 17, 115–120 (2001).

42. 42.

Srivastava, M. et al. Anxa7 expression represents hormone-relevant tumor suppression in different cancers. Int. J. Cancer 121, 2628–2636 (2007).

43. 43.

Smitherman, A. B., Mohler, J. L., Maygarden, S. J. & Ornstein, D. K. Expression of annexin i, ii and vii proteins in androgen stimulated and recurrent prostate cancer. J. Urol. 171, 916–920 (2004).

44. 44.

Srivastava, M. et al. Prognostic impact of anx7-gtpase in metastatic and her2-negative breast cancer patients. Clin. Cancer Res. 10, 2344–2350 (2004).

45. 45.

Schramek, D. et al. Direct in vivo rnai screen unveils myosin iia as a tumor suppressor of squamous cell carcinomas. Science 343, 309–313 (2014).

46. 46.

De Boeck, A. et al. Bone marrow-derived mesenchymal stem cells promote colorectal cancer progression through paracrine neuregulin 1/her3 signalling. Gut 62, 550–560 (2013).

47. 47.

Huang, H.-E. et al. A recurrent chromosome breakpoint in breast cancer at the nrg1/neuregulin 1/heregulin gene. Cancer Res. 64, 6840–6844 (2004).

48. 48.

Sanchez-Cespedes, M. et al. Inactivation of lkb1/stk11 is a common event in adenocarcinomas of the lung. Cancer Res. 62, 3659–3662 (2002).

49. 49.

Lahti, L., Salojärvi, J., Salonen, A., Scheffer, M. & De Vos, W. M. Tipping elements in the human intestinal ecosystem. Nature Commun. 5, 4344 (2014).

50. 50.

Lahti, L. & Shetty, S. Microbiome r package, 2012–2019.

51. 51.

Marounek, M., Fliegrova, K. & Bartos, S. Metabolism and some characteristics of ruminal strains of megasphaera elsdenii. Appl. Environm. Microbiol. 55, 1570–1573 (1989).

52. 52.

Federico, A. et al. Gastrointestinal hormones, intestinal microbiota and metabolic homeostasis in obese patients: effect of bariatric surgery. In Vivo 30, 321–330 (2016).

53. 53.

Gardiner, B. J., Korman, T. M. & Junckerstorff, R. K. Eggerthella lenta bacteremia complicated by spondylodiscitis, psoas abscess, and meningitis. J. Clin. Microbiol. 52, 1278–1280 (2014).

54. 54.

Byrd, R. H., Chin, G. M., Nocedal, J. & Wu, Y. Sample size selection in optimization methods for machine learning. Math. Program. 134, 127–155 (2012).

55. 55.

Breiman, L. Bagging predictors. Mach. Learn. 24, 123–140 (1996).

56. 56.

Zhou, Z. H., Wu, J. X. & Tang, W. Ensembling neural networks: many could be better than all. Artif. Intell. 137, 239–263 (2002).

57. 57.

Mi. Skadieye/deeptl: Second release, https://doi.org/10.5281/zenodo.4568807 (February, 2021).

## Acknowledgements

The work was partially supported by the National Institute of Health Grants R01AI143886, R01CA219896, CCSG P30 CA013696, and P30 ES010126.

## Author information

Authors

### Contributions

X.M. implemented the algorithms in “deepTL” for the proposed method and performed numerical analyses. All authors contributed to the methodology development and writing the manuscript.

### Corresponding author

Correspondence to Jianhua Hu.

## Ethics declarations

### Competing interests

The authors declare no competing interests.

Peer review informationNature Communications thanks Ming Li and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Rights and permissions

Reprints and Permissions

Mi, X., Zou, B., Zou, F. et al. Permutation-based identification of important biomarkers for complex diseases via machine learning models. Nat Commun 12, 3008 (2021). https://doi.org/10.1038/s41467-021-22756-2

• Accepted:

• Published: