Introduction

With the advancement of high-throughput technologies, massive amounts of high-dimensional omics data have been generated and made available through large public databases due to great data sharing efforts by the research community, such as The Cancer Genome Atlas (TCGA)1. These data are valuable in elucidating the molecular mechanisms of disease phenotypes2,3. However, study of complex human disease remains challenging due to convoluted disease etiologies and underlying intricate molecular mechanisms at genetic, genomic, and proteomic levels. Many popular machine learning algorithms, such as non-linear kernel support vector machines (SVMs), random forests (RFs), and deep neural networks (DNNs) in artificial intelligence areas, have been developed to build more powerful predictive models for biomedical and bio-omics data regarding clinical outcomes, e.g. drug response4, and medical imaging classification5. While enjoying the modeling flexibility and robustness, these model frameworks suffer from non-transparency and difficulty in interpreting the role of each individual feature due to their sophisticated algorithms, compared with those more interpretable parametric models, such as linear regressions, logistic regressions, and decision trees. Nonetheless, identifying important biomarkers associated with complex human disease is a critical pursuit towards assisting researchers to establish novel hypotheses regarding prevention, diagnosis and treatment of complex human diseases. Accurate identification of important biomarkers associated with complex human disease not only provides valuable insights into their underlying genetic architecture and disease etiology but also offers great potentials for early disease diagnosis, improved precision medicine, innovative treatment development, and accurate disease risk and progression prediction6.

To address the non-transparency in the association study between disease outcomes and predictors using machine learning models, the feature importance score strategy has been proposed and extensively investigated7,8,9,10,11,12,13, including surrogate models7,14, Shapley value-based methods15,16, conditional randomization tests (CRTs)10, knockoff models (i.e., model-X)10,12, and permutation-based feature importance8. Surrogate modeling methods approximate the complex models by using explanatory surrogate models, such as linear models or decision trees. While enjoying the great flexibility in choosing the surrogate models, the feature importance is still restricted to the selected explanatory models that might be misspecified13. Shapley value-based methods, such as SHAP16, provide localized feature characterization based on game theory, while they are computationally intensive and do not guarantee a valid test. Both CRT and model-X knockoff were proposed in Candes et al.10, while CRT is less preferred due to its expensive computational cost. Model-X knockoff is more computationally efficient in performing feature importance test via constructing knockoff features. Recently, model-X knockoff was adopted for DNN12 models. Tansey et al.11 proposed the holdout randomization test (HRT) to reduce the computational cost of CRT via avoiding model refitting.

The overall disadvantage of CRT, HRT, and model-X knockoff is that they all depend on the assumption of a known covariance structure10. When the covariance structure is not accurately estimated, their performance could be severely impacted17. Although KnockoffGAN18, an extension of model-X knockoff, does not suffer such disadvantage, it is difficult to train adversarial networks19 and requires more tuning. Another strategy to avoid suffering from the known covariance structure assumption is approaches based on Gaussian mirrors20,21,22. Specifically, Xing et al.21 proposed individual neural Gaussian mirror (INGM) and simultaneous neural Gaussian mirror (SNGM). However, INGM requires repetitive model fitting, which is computationally costly, while SNGM is efficient but could suffer performance loss21. The permutation-based feature importance learning method, another popular approach for feature selection, measures the change of prediction errors due to the shuffling of a feature. The larger the increase of prediction errors is, the more impact a feature makes on the outcome of interest. However, unlike CRT, HRT, or model-X knockoff, permutation-based feature selection does not require prior knowledge of feature distributions and thus it is more statistically robust. Several permutation-based feature importance methods have been proposed, with applications mainly on random forests and DNNs8,9,23. These methods either do not conduct any statistical inference or cannot offer valid inference on the feature importance. For example, Putin et al.23 applied permutation-based importance scores to DNNs to identify biomarkers associated with human aging, but provided no formal statistical testing. Notably, Altmann et al.9 proposed a corrected permutation-based importance score approach for random forest, which however, is difficult to be generalized to other machine learning model frameworks.

To overcome the aforementioned challenges, we propose a general permutation-based feature importance test (abbreviated as PermFIT), for complex machine learning models, which takes advantage of (i) permutation test coupled with cross-fitting to obtain a valid importance score test that properly controls the type-I error; and (ii) selecting important features from PermFIT to further improve the accuracy of these predictive models. We implement PermFIT for the following machine learning models, including DNN, RF, and SVM. More specifically, PermFIT first approximates the function that maps features to the outcome, based on which, PermFIT then evaluates the importance score of each feature, defined as the expected increase of prediction errors due to the permutation of the feature. Computationally, the PermFIT framework does not require refitting the models. In order to reduce the bias of important score estimation from the potential model overfitting, we adopt cross-fitting to ensure the validity of the test statistics. PermFIT is motivated by two benchmark data: the Reverse Phase Protein Arrays (RPPAs) data from three kidney cancer studies in The Cancer Genomic Atlas (TCGA) and the HITChip Atlas microbiome data regarding body mass index (BMI). However, PermFIT has broad applicability to a wide variety of biomedical data and more.

Results

To evaluate the performance of PermFIT, we first conduct comprehensive simulation studies under various scenarios with different sample sizes and correlation structures among features. Moreover, it is applied to two real-world datasets: the Reverse Phase Protein Arrays (RPPA) data from three kidney cancer studies in TCGA and the HITChip Atlas microbiome data. We apply PermFIT to three commonly used machine learning methods: DNN24,25, RF8, and SVM26, denoted as PermFIT-DNN, PermFIT-RF and PermFIT-SVM, respectively. We also compare PermFIT with several existing popular feature selection methods for DNN, RF, and SVM: SHAP16, LIME14, holdout randomization test11, and simultaneous neural Gaussian mirror21 with DNN (denoted as SHAP-DNN, LIME-DNN, HRT-DNN, and SNGM-DNN, respectively), RF importance evaluation of Breiman8 (denoted as Vanilla-RF, i.e., an ensemble approach based on decision trees), and SVM with recursive feature elimination27 (denoted as RFE-SVM). SHAP-DNN, LIME-DNN, and RFE-SVM utilize an importance score to rank input features, from which top features are selected. For each feature, Vanilla-RF provides an importance score estimate and its associated standard error, with which the statistical significance of the feature importance can be tested. HRT provides a p-value for each feature without importance scores. We evaluate these methods as follows: (i) we apply each method to the training data with all the input features, estimate the feature importance scores, p values, and assess the type-I error; (ii) we refit each model with its corresponding top ranked important variables, and re-evaluate its goodness-of-fit and prediction improvement.

Simulation studies

We examine the performance of the proposed methods with the following simulation scenarios. First, we generate the continuous data from the following model,

$$Y={X}_{1}+2\, {\mathrm{log}}\,\Big(1+2{X}_{{p}_{0}+1}^{2}+{\big({X}_{2{p}_{0}+1}+1\big)}^{2}\Big)+{X}_{3{p}_{0}+1}{X}_{4{p}_{0}+1}+\epsilon ,$$
(1)

where X is a p-dimensional random variable drawn from MVN(0, Σ), p = 10p0, p0 = 10, Σ = diag{Σ1, . . . , Σ10}, is a block-diagonal matrix, \({{{\Sigma }}}_{1}=...={{{\Sigma }}}_{10}={\{{\sigma }_{ij}\}}_{0\,{<}\,i,j\le {p}_{0}}\), are p0 × p0 matrices, σij = 1 for i = j and σij = ρ for i ≠ j, and ϵ ~ N(0, 1). N independent observations are drawn from the distribution of (Y, X) in the training set and 10,000 in the test set, which is used to evaluate model fitting performance. To mimic the real-world data, we introduce correlations among variables by blocks, and let one variable from each of the first 5 blocks have a signal. We define S0 and S1 as the sets that contain all the null features that are correlated and uncorrelated with the causal features, respectively, i.e., S0 = {Xj: j ≤ 5p0 and j ≠ 1, p0 + 1, 2p0 + 1, 3p0 + 1, 4p0 + 1}, S1 = {Xj: j > 5p0}. We consider various simulation settings with different values of ρ {0, 0.2, 0.5, 0.8}, and N {1000, 5000}. Each simulation scenario is replicated 100 times.

The results are displayed in Fig. 1 and Table 1. Figure 1a displays detailed feature importance scores generated from each method that we consider. Since HRT does not provide importance scores, we use \(-{\mathrm{log}}_{10}\)(p value) instead. Note that the estimated importance scores from PermFIT methods and Vanilla RF are in the same scale, while the ones from SHAP-DNN, LIME-DNN and RFE-SVM are not. For X1 whose effect is linear, the importance scores from PermFIT-DNN and PermFIT-SVM are higher, compared with those from RF-based framework due to the restricted tree-based modeling nature of RF. In addition, the RF-based framework can barely detect the interaction between \({X}_{3{p}_{0}+1}\) and \({X}_{4{p}_{0}+1}\) because the split rule in tree-based methods is less effective in dealing with such interactions. Expectedly, as the within-block correlation ρ increases, the estimated importance scores from all methods deviate further away from their estimands. However, PermFIT-SVM remains high power in detecting the true positive features. As ρ increases, it’s noticeable that Vanilla-RF and PermFIT-SVM tend to identify the null features that are correlated with the causal features. Compared with Vanilla-RF, PermFIT-RF has fewer false positive discoveries. Overall, PermFIT-DNN provides the most precise and stable importance measure in differentiating the true positive from null features.

Fig. 1: Simulation results on continuous outcomes.
figure 1

a Estimated feature importance for the five true causal features: X1, \({X}_{{p}_{0}+1}\), \({X}_{2{p}_{0}+1}\), \({X}_{3{p}_{0}+1}\), \({X}_{4{p}_{0}+1}\), and two null feature sets: S0 and S1. b Mean squared prediction error (MSPE) for methods in comparison. DNN, RF or SVM: specific modeling with all features; PermFIT-DNN, SHAP-DNN, LIME-DNN, HRT-DNN, SNGM-DNN, PermFIT-RF, Vanilla-RF, PermFIT-SVM, or RFE-SVM: specific modeling after feature selection. Data are presented as mean values ± s.d. Simulations in each scenario are repeated for 100 times. Source data are provided as a Source Data file.

Table 1 Simulation results on continuous outcomes.

The frequency (percentage) of the important variables detected by each method is presented in Table 1. For Vanilla-RF and PermFIT methods that provide p values, the significance level is controlled at 0.05, while for RFE-SVM, the top 10 features with the largest importance scores are selected. First of all, at ρ = 0, PermFIT controls the rate of significance findings across all null features at around 0.05, suggesting that the type-I error is well controlled by PermFIT, while Vanilla-RF has the type-I error of 0.09, nearly double of PermFIT. When N = 1000, the type-I error of HRT-DNN is slightly inflated. Besides, LIME-DNN and SNGM-DNN show a limited ability in identifying features with nonlinear effects, such as \({X}_{{p}_{0}+1}\), \({X}_{3{p}_{0}+1}\), and \({X}_{4{p}_{0}+1}\). On the other hand, SHAP-DNN is able to assign high rankings to the important features based on the importance scores. However, it fails to offer a valid test for its importance scores; specifically, its type-I error and power depend on correctly specifying the number of important features. When ρ increases to 0.5 or 0.8, RFE-SVM tends to select the null features that are correlated with the true causal features, or those in S0 more frequently than \({X}_{3{p}_{0}+1}\) and \({X}_{4{p}_{0}+1}\), the two causal variables that interact with each other, demonstrating its limited capability in detecting variables with interaction effects when correlation exists. In contrast, PermFIT-SVM is capable of identifying X3p0+1 and X4p0+1 consistently at a much higher frequency than all the null features. Compared with PermFIT-RF, Vanilla-RF has a higher power in detecting X3p0+1 and X4p0+1, but also produces remarkably more false positive findings among features in S0. For example, as ρ goes to 0.8 and N = 1000, it results in >80% false positive rate in S0, suggesting a far inferior feature selection performance. In all these scenarios, PermFIT-DNN can consistently identify causal features while controlling false positive findings at a much lower rate than those of Vanilla-RF, PermFIT-RF, and PermFIT-SVM.

Posterior to important feature selection, the prediction performance of the comparing models almost all gets improved. Figure 1b displays the mean squared prediction error (MSPE) of each model, (i) with full input features, respectively denoted as DNN, RF, and SVM; and (ii) with top selected features from PermFIT methods and HRT-DNN at the significance level of 0.1, and top 20 features from SHAP-DNN, LIME-DNN, SNGM-DNN and RFE-SVM. Selected features help boosting the prediction accuracy of all models, except RFE-SVM, LIME-DNN, and SNGM-DNN, across all simulation scenarios. However, LIME-DNN and SNGM-DNN fail to identify certain important features, which leads to deterioration of the model performance. In addition, at ρ = 0.8 corresponding to high correlation among within-block input features, RFE-SVM fails to improve the model fitting over SVM because of feature selection failure, in particular, on X3p0+1 and X4p0+1; its inferior performance to PermFIT-SVM is clearly observed. Moreover, PermFIT-RF outperforms Vanilla-RF in terms of MSPE, because the latter yields more false positives and cannot effectively reduce the feature dimension. We note that PermFIT-DNN and HRT-DNN consistently outperforms all other methods in comparison, due to its high success rate in identifying true positive features while maintaining a considerably low false positive rate at the same time. In particular, PermFIT-DNN has a lower MSPE than that of HRT-DNN when N = 1000 and ρ ≤ 0.2, while similar MSPE values in other scenarios.

To further investigate the small sample performance of these methods, we conduct additional simulations with (N = 300, p = 100) and (N = 500, p = 200), and report the results in Table 2. The type-I errors of PermFIT-based methods are not affected much by the change of N and p in these more challenging cases, while those for HRT-DNN are severely inflated, which is likely because, for more challenging data with a smaller N or a larger p, HRT-DNN fails to make an accurate estimation of the covariance matrix of the input features.

Table 2 Simulation results on continuous outcomes with smaller sample size and/or larger dimension.

We further conduct a simulation study on binary outcomes generated from the following model:

$$P(Y=1| X)={\rm{expit}}\Big(4{X}_{1}+8\, {\mathrm{log}}\,\big(1+2{X}_{{p}_{0}+1}^{2}+{\big({X}_{2{p}_{0}+1}+1\big)}^{2}\big)+4{X}_{3{p}_{0}+1}{X}_{4{p}_{0}+1}-11\Big),$$
(2)

where \({\rm{expit}}(x)=1/(1+\exp (-x))\). All the other data structures, including X, are generated in the same way as in the continuous case. Similar conclusions are observed with the details presented in Supplementary Table 1 and Supplementary Fig. 1.

TCGA kidney tumor data

A large collection of clinical and multiple omics data have been made publicly available by TCGA research project1. In our analysis, we included three studies of kidney-related cancer types from TCGA research network: kidney renal clear cell carcinoma (KIRC, N1 = 537), kidney renal papillary cell carcinoma (KIRP, N2 = 291), and kidney chromophobe (KICH, N3 = 113). We defined long-term survivor (LTS) as patients who survived more than five years after diagnosis, and short-term survivor (STS) as patients who died within 5 years. We aimed to predict the probability of a patient being in the LTS group and to identify significant biomarkers that contribute to classification of the LTS/STS status. We included 188 LTS and 178 STS subjects with the known survival status in our analysis. We focused our analysis on expression data of 118 proteins extracted from reverse phase protein arrays (RPPAs)—a highly sensitive, reproducible, and high-throughput proteomic method for protein expression profiling28.

The negative \({\mathrm{log}}_{10}\)(p value)s and the estimated importance scores from each method are presented in Fig. 2 and Supplementary Fig. 3. HRT-DNN, Vanilla-RF and PermFIT models control the FDR at 0.1, and SHAP-DNN, LIME-DNN, SNGM-DNN and RFE-SVM selects 10 features with the largest importance scores. We notice that moderate correlations generally exist among the proteins (see Supplementary Fig. 2). However, six proteins, SRC, RAF1, RB1, RPS6, YWHAZ, and EGFR, are highly correlated and clustered together by hierarchical clustering in Fig. 2. Among them, EGFR, YWHAZ, RPS6, RB1, and SRC are identified by Vanilla-RF, and RPS6, RB1 and SRC are selected by RFE-SVM, while none of these biomarkers are selected by any PermFIT procedures. According to our observations in simulation studies, both Vanilla-RF and RFE-SVM tend to identify false positive biomarkers in the presence of high correlation among features, casting some doubts on the validity of their biomarker selection results. In addition, LIME-DNN identifies a very different set of important biomarkers compared to SHAP-DNN, HRT-DNN, SNGM-DNN, and PermFIT-DNN.

Fig. 2: Negative \({\mathrm{log}}_{10}\)p values for TCGA kidney cancer data.
figure 2

Important features selected by each method is marked in red. Since SHAP-DNN, LIME-DNN, SNGM-DNN, and RFE-SVM do not produce a p value, its importance is presented instead, and 10 features with top importance scores are marked. The highly correlated features (see details from the dendrogram on the right) selected by RFE-SVM, Vanilla-RF, and SNGM-DNN, but not by PermFIT methods, are highlighted. Source data are provided as a Source Data file.

Since the underlying genetic truth is unknown, we alternatively use the model performance improvement estimated via 5-fold cross-validation, randomly repeated for 100 times (see Fig. 3a, b; Supplementary Table 2) as a surrogate measure for evaluating the relative quality of the selected features. Similar to the simulation study, we set the feature inclusion criteria on p values smaller than 0.1 for HRT-DNN, Vanilla-RF, and PermFIT methods, and top 20 features for SHAP-DNN, LIME-DNN, SNGM-DNN, and RFE-SVM. PermFIT-RF improves the accuracy from 0.694 to 0.732 on average, while Vanilla-RF only improves to 0.713. Moreover, PermFIT-SVM elevates the accuracy from 0.69 to 0.744, outperforming RFE-SVM (0.709). Similar to the simulation results, PermFIT-DNN and HRT-DNN achieve the highest accuracy (0.751 and 0.750, respectively), better than those from SHAP-DNN (0.731), LIME-DNN (0.650) and SNGM-DNN (0.723). The same conclusion is further confirmed by area under the ROC curve (AUC) results. In summary, it is evident that PermFIT procedures consistently perform more efficient and accurate feature selection across various machine learning frameworks.

Fig. 3: Model performance improvement from feature selection.
figure 3

a, b Fivefold cross-validated prediction accuracy and AUC for TCGA kidney cancer data. c, d Fivefold cross-validated MSPE and Pearson correlation (between true outcome and prediction) for HITChip Atlas data. Fivefold cross-validation evaluation is randomly repeated for 100 times. Data are presented as mean values ± s.d. Source data are provided as a Source Data file.

On the identified biomarkers, four genes—CDKN1A, EIF4EBP1, INPP4B, and SERPINE1—are claimed by all the three PermFIT methods to be significantly associated with the survival status. Interestingly, all four genes have been reported to be cancer related. Especially, INPP4B, identified as the most significant biomarker by all the three methods (p value = 1.3E − 05 by PermFIT-DNN, 9.1E − 07 by PermFIT-RF, and 4.5E − 05 by PermFIT-SVM), encodes inositol polyphosphate-4-phosphatase, type II, a dual specificity phosphatase. Low INPP4B is recently reported to be associated with shorter survival in kidney clear cell, liver hepatocellular, and bladder urotheleal carcinomas, and with long survival in pancreatic adenocarcinoma29. It is also related to acute myeloid leukemia, breast cancer and bladder cancer30,31,32. SERPINE1 encodes plasminogen activator inhibitor-1, which plays an important role in various diseases, in particular, kidney pathology and renal cell cancer33,34,35. In addition, the CDKN1A encoded protein, CDK-interacting protein 1, was reported as a prognostic marker for renal cell cancer36, and has an effect on kidney cancer cell death37 as well as kidney cancer survival38. Similarly, EIF4EBP1 affects disease progression in renal cell carcinoma39.

Moreover, the DNA repair protein XRCC1, identified by PermFIT-DNN and PermFIT-SVM, is shown to be associated with bladder cancer40. ANXA7, identified by PermFIT-DNN and PermFIT-RF, is reported to be associated with prostate cancer and breast cancer41,42, and its encoded protein has an impact on prostate cancer and breast cancer43,44. Furthermore, MYH9 and NRG1 are identified by PermFIT-DNN. Myosin-9, encoded by MYH9, has been discussed for its role as a tumor suppressor45, and NRG1 is also reported to be related to multiple cancer types46,47. Last, PermFIT-RF identifies a novel gene, STK11 whose role in kidney cancer is unknown, however, it has been reported that inactivation of STK11 in lung adenocarcinomas is a common event48.

HITChip atlas data

In the HITChip Atlas study, the data was collected from 1006 adults in 15 western countries49 by using the HITChip, and it is publicly available in R library “microbiome”50. Besides demographic and clinical variables, the HITChip Atlas data includes microbiome measurements from 130 taxonomic groups summarized at the genus level, which covered major types of human intestinal microbiota bacterial diversity. Many of the 130 taxonomic groups are highly correlated, which is reflected in the correlation heat map and the hierarchical clustering dendrogram (see Fig. 4 and Supplementary Fig. 4). We then investigated the importance of demographic factors, including gender and nationality, together with 129 microbial genus (1 was removed due to the use of compositional values), in predicting the baseline BMI level. Our analysis includes 900 subjects with BMI measurements.

Fig. 4: Negative \({\mathrm{log}}_{10}\)p values for HITChip Atlas data.
figure 4

Important features selected by each method is marked in red. Since SHAP-DNN, LIME-DNN, SNGM-DNN, and RFE-SVM do not produce a p-value, its importance is presented instead, and 10 features with top importance scores are marked. The highly correlated features (see details from the dendrogram on the right) selected by RFE-SVM and Vanilla-RF, but not by PermFIT methods, are highlighted. Source data are provided as a Source Data file.

The feature selection and biomarker identification criteria remain the same as those in the TCGA example. The improvement of the model performance from variable selection is presented in Fig. 3c, d and Supplementary Table 2. Besides the MSPE, we report the Pearson correlation between the predicted and the true values. We notice that high correlation among microbiome features leads to large inflation in importance scores from Vanilla-RF, corresponding to high false positive rate. As a result, it fails to improve the model performance and the reduced model with selected features from RFE-SVM performs worse than the full model. Again, this is likely due to the fact that highly correlated biomarkers are falsely selected by RFE-SVM. For instance, Streptococcus mitis et rel, Streptococcus bovis et rel and Streptococcus intermedius et rel, highly correlated to each other, are among top 10 biomarkers identified by RFE-SVM. In contrast, PermFIT yields the most remarkable improvements in all these models, reflected in both MSPE and correlation.

Figure 4 and Supplementary Fig. 5 show the negative \({\mathrm{log}}_{10}\)(p-value)s and the importance scores estimated from each method. Among all the features, as expected, age is identified as the most significant factor. The nationality is also selected by PermFIT-DNN and PermFIT-SVM. Among all the microbiome features, Megasphaera elsdenii is identified by all the PermFIT methods. M. elsdenii is shown by prior studies as one of the ruminal and intestinal lacate- and sugar-fermenting species51. M. elsdenii is also reported to massively reside in patients with an increase in BMI after bariatric surgery52. In addition, Eggerthella lenta is identified by PermFIT-SVM. E. lenta is not well-studied, but its potential role as an emerging pathogen has been increasingly recognized in years53. Lastly, uncultured clostridiales is identified by PermFIT-RF.

Discussion

Complex machine learning models are difficult to distinguish the contribution of individual input features, though they enjoy the more robustness and flexibility in modeling complex human diseases as compared with parametric models. In this paper, we introduce PermFIT, a computationally efficient permutation-based feature importance test with applicability to various machine learning models such as DNN, RF, and SVM, to identify important features. Also, as demonstrated by the applications to TCGA kidney cancer data and HITChip Atlas BMI data, PermFIT procedures further show the superior performance over all the other competitors of concern in the paper, which severely suffer from false positive or negative findings, leading to inferior prediction performance with top selected features. In contrast, feature selection via PermFIT procedures remarkably improves the performance of these predictive models. However, it is worth pointing out that the prediction improvement of PermFIT is restricted to the capability of each machine learning model framework. For example, RF is relatively inefficient in modeling interaction terms, thus the performance of PermFIT-RF may be limited for complex traits with strong gene-gene interactions. Overall, PermFIT coupled with DNN consistently shows superior empirical performance.

The proposed analytical tool, PermFIT, is computationally efficient and has broad applicability in addressing real-world problems. It can be implemented and incorporated into a variety of machine learning models with different types of outcomes, and without the need of model refitting. PermFIT provides researchers a useful tool for deciphering complex genetic architecture and disease etiologies of complex traits.

Methods

PermFIT

We start the case with a continuous outcome. Let \(X\in {\mathcal{X}}\), \(Y\in {\mathcal{Y}}\), where X = (X1, . . . , Xp) is a p-dimensional covariate vector, the observation of the outcome variable Y is a continuous scalar, E(YX) = μ(X), μ() is an unknown mapping from \({\mathcal{X}}\) to \({\mathcal{Y}}\), the residues ϵ = Y − μ(X) is independent of X and 0 < σ2 = E(ϵ2) < .

We define the feature importance score Mj for Xj, the jth feature in X(j = 1, . . . , p), as the expected squared difference between μ(X) and \(\mu \left({X}^{(j)}\right)\), where \({X}^{(j)}=({X}_{1},...,{X}_{j-1},{X}_{j^{\prime} },{X}_{j+1},...,{X}_{p})\) is equal to X with the jth covariate replaced by a random vector \({X}_{j^{\prime} }\) whose elements are independently drawn from the distribution of Xj. The importance score Mj can be expressed as,

$${M}_{j}={E}_{X,{X}_{j^{\prime} }}{\left[\mu (X)-\mu \left({X}^{(j)}\right)\right]}^{2}.$$
(3)

Assuming X does not have redudant features, Mj is zero only when \(\mu (X)\equiv \mu \left({X}^{(j)}\right)\) on \({\mathcal{X}}\), implying that the jth element of X does not have any impact on μ(X); and is non-zero otherwise.

To obtain a clear understanding of Mj, we take the linear model as an example where μ(X) = Xβ + β0, with β = (β1, . . . , βp) consisting of p parameters. Under the linear assumption, (3) becomes:

$${M}_{j}={E}_{X,{X}_{j^{\prime} }}{({X}_{j}-{X}_{j^{\prime} })}^{2}{\beta }_{j}^{2}=2{\beta }_{j}^{2}{\mathrm{Var}}\,({X}_{j}).$$
(4)

Here, (4) is proportional to the squared standardized coefficient, which has been recognized as a popular measure of variable importance in multiple linear regression.

Furthermore, Mj can be simply decomposed as follows:

$${M}_{j}={E}_{X,{X}_{j^{\prime} }}{\left[Y-\mu \left({X}^{(j)}\right)\right]}^{2}-{E}_{X}{[Y-\mu (X)]}^{2}.$$
(5)

Ideally, given the true form of μ(), from (5), Mj could be estimated through permutation. Let (Yi, Xi1, . . . , Xip), (i = 1, . . . , N) be N independent observations drawn from the distribution of (Y, X1, . . . , Xp). A permutation on one covariate Xj = (X1j, . . . , XNj) is to randomly sample the elements in Xj without replacement to generate a permuted version of \(X^{\prime} =({X}_{{s}_{1},j},...,{X}_{{s}_{N},j})\). The empirical permutation importance score is then,

$${M}_{j}^{(P)}=\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}\left[{\left\{{Y}_{i}-\mu \left({X}_{i\cdot }^{(j)}\right)\right\}}^{2}-{\left\{{Y}_{i}-\mu ({X}_{i\cdot })\right\}}^{2}\right],$$
(6)

where Xi = (Xi1, . . . , Xip) and \({X}_{i\cdot }^{(j)}=({X}_{i1},...,{X}_{i,j-1},{X}_{{s}_{i},j},{X}_{i,j+1},...,{X}_{ip})\). Let \({M}_{j}^{(P)}=\frac{1}{N}\mathop{\sum }\nolimits_{i = 1}^{N}{M}_{ij}^{(P)}\), where \({M}_{ij}^{(P)}={\left\{{Y}_{i}-\mu \left({X}_{i\cdot }^{(j)}\right)\right\}}^{2}-{\left\{{Y}_{i}-\mu ({X}_{i\cdot })\right\}}^{2}\), then

$$E\left[{M}_{j}^{(P)}\right]=E\left[{M}_{ij}^{(P)}\right]=\frac{N-1}{N}{M}_{j}.$$
(7)

When N is large, Mj could be well approximated by \({M}_{j}^{(P)}\). Besides, \({\mathrm{Var}}\,\left[{M}_{j}^{(P)}\right]\approx \frac{1}{N}{\mathrm{Var}}\,\left[{M}_{ij}^{(P)}\right]\) which can be approximated by the empirical variance of \({M}_{ij}^{(P)}\).

Let \(\widehat{\mu }(\cdot )\) be the fitted function approximator to μ(), according to (6), we propose to estimate \({M}_{j}^{P}\) by

$${\widehat{M}}_{j}^{(P)}=\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}\left[{\left\{{Y}_{i}-\widehat{\mu }\left({X}_{i\cdot }^{(j)}\right)\right\}}^{2}-{\left\{{Y}_{i}-\widehat{\mu }({X}_{i\cdot })\right\}}^{2}\right].$$
(8)

If feature Xj is not associated with Y, then \(\mu \left({X}_{i\cdot }^{(j)}\right)=\mu ({X}_{i\cdot })\) with corresponding \({M}_{j}^{(P)}=0\), and Eq. (8) becomes,

$${\widehat{M}}_{j}^{(P)}=\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}\left[{\left\{\mu ({X}_{i\cdot })-\widehat{\mu }\left({X}_{i\cdot }^{(j)}\right)\right\}}^{2}-{\left\{\mu ({X}_{i\cdot })-\widehat{\mu }({X}_{i\cdot })\right\}}^{2}+2{\epsilon }_{i}\left\{\widehat{\mu }({X}_{i\cdot })-\widehat{\mu }\left({X}_{i\cdot }^{(j)}\right)\right\}\right].$$
(9)

With the universal consistency, the three terms are expected to converge to zero as N goes to infinity. However, for data with a finite sample size, the model \(\widehat{\mu}(\cdot )\) may become overfit, leading to \({\left\{\mu ({X}_{i\cdot })-\widehat{\mu }\left({X}_{i\cdot }^{(j)}\right)\right\}}^{2}> {\left\{\mu ({X}_{i\cdot })-\widehat{\mu }({X}_{i\cdot })\right\}}^{2}\) in estimating \({M}_{j}^{(P)}\). To overcome the bias issue, we employ cross-fitting strategy to separate the input data as the training and validation sets, with one set being utilized to obtain \(\widehat{\mu }(\cdot )\) and the other set to estimate \({\widehat{M}}_{j}^{(P)}\). Let \({\widehat{\mu }}_{T}(\cdot )\) be the estimate of μ() from the training set, and \({{\mathcal{D}}}_{V}={\{{Y}_{i},{X}_{i\cdot }\}}_{i = 1}^{{N}_{V}}\) be the validation set,

$${\widehat{M}}_{j}^{(P)}=\frac{1}{{N}_{V}}\mathop{\sum }\limits_{i=1}^{{N}_{V}}\left[{\left\{{Y}_{i}-{\widehat{\mu }}_{T}\left({X}_{i\cdot }^{(j)}\right)\right\}}^{2}-{\left\{{Y}_{i}-{\widehat{\mu }}_{T}({X}_{i\cdot })\right\}}^{2}\right],$$
(10)
$$\widehat{\mathrm{Var}}\left[{\widehat{M}}_{j}^{(P)}\right]=\frac{1}{{N}_{V}}\mathop{\sum }\limits_{i=1}^{{N}_{V}}{\left[{\left\{{Y}_{i}-{\widehat{\mu }}_{T}\left({X}_{i\cdot }^{(j)}\right)\right\}}^{2}-{\left\{{Y}_{i}-{\widehat{\mu }}_{T}({X}_{i\cdot })\right\}}^{2}-{\widehat{M}}_{j}^{(P)}\right]}^{2}.$$
(11)

The one-sided p value can be obtained by assuming normality. To increase the power of important feature identification, K-fold cross-fitting can be adopted. Here, we randomly divide the data into K folds, denoted as V1, . . . , VK. For each of Vk, k = 1, . . . , K, \({\overline{V}}_{k}\) denote the complementary set of Vk, which is used to fit the model \({\widehat{\mu }}_{k}(\cdot )\). Then

$${\widehat{M}}_{ij}^{(P,CV)}=\mathop{\sum }\limits_{k=1}^{K}{\rm{I}}\left(i\in {V}_{k}\right)\left[{\left\{{Y}_{i}-{\widehat{\mu }}_{T}\left({X}_{i\cdot }^{(j)}\right)\right\}}^{2}-{\left\{{Y}_{i}-{\widehat{\mu }}_{k}({X}_{i\cdot })\right\}}^{2}\right],$$
(12)
$${\widehat{M}}_{j}^{(P,CV)}=\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}{\widehat{M}}_{ij}^{(P,CV)},$$
(13)
$$\widehat{\mathrm{Var}}\left[{\widehat{M}}_{j}^{(P,CV)}\right]=\frac{1}{N}\mathop{\sum }\limits_{k=1}^{K}{\mathop{\sum} _{i\in {V}_{k}}}{\left[{\left\{{Y}_{i}-{\widehat{\mu }}_{T}\left({X}_{i\cdot }^{(j)}\right)\right\}}^{2}-{\left\{{Y}_{i}-{\widehat{\mu }}_{k}({X}_{i\cdot })\right\}}^{2}-{\widehat{M}}_{j}^{(P,CV)}\right]}^{2}.$$
(14)

The algorithm of PermFIT with cross-fitting is illustrated in Algorithm 1.

Algorithm 1

Algorithms for PermFIT

1: Randomly divide the data into K folds.

2: fork = 1 toKdo.

3:  Denote the data in kth fold as Vk and the rest of the data as \({\overline{V}}_{k}\).

4:  Build the machine learning model with \({\overline{V}}_{k}\), denoted as \({\widehat{\mu }}_{k}(\cdot )\).

5:  forj = 1 topdo

6:    Calculate \({\widehat{M}}_{ij}^{(P,CV)}\) for subjects in \({{\mathcal{D}}}_{k}\).

7:  endfor

8: endfor

9: forj = 1 topdo

10:  Calculate \({\widehat{M}}_{j}^{(P,CV)}\) and estimate \(\widehat{\mathrm{Var}}\left[{\widehat{M}}_{j}^{(P,CV)}\right]\). Calculate p-value by assuming nomality.

11: endfor

Binary outcome

For binary outcome Y {0, 1}, we have μ(X) = E(YX) = Pr(Y = 1X) and define Mj as the expectation of binomial deviance,

$${M}_{j}={E}_{X,{X}_{j^{\prime} }}\left[Y\, {\mathrm{log}}\,\left(\frac{\mu (X)}{\mu ({X}^{(j)})}\right)+(1-Y)\, {\mathrm{log}}\,\left(\frac{1-\mu (X)}{1-\mu ({X}^{(j)})}\right)\right].$$
(15)

The empirical estimate of Mj can be similarly obtained by plugging in the estimate of μ(X(j)) and μ(X) as in the continuous data scenario.

DNN with bootstrap aggregating

In this paper, we use feedforward and fully-connected deep neural networks (DNNs) to approximate function μ(). The DNN model contains L hidden layers of (n1, . . . , nL) hidden nodes that transform the initial input covariates X to the estimation of the continuous outcome Y. Let θ denote all the parameters in the DNN model, we essentially have the fitted DNN, \(\widehat{\mu }(X;{\boldsymbol{\theta }})\), by minimizing the empirical risk function,

$$\arg \mathop{\min }\limits_{{\boldsymbol{\theta }}}\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}\ell \{{Y}_{i},\mu ({X}_{i};{\boldsymbol{\theta }})\}+\lambda {{\Omega }}({\boldsymbol{\theta }}),$$
(16)

where (, ) is the loss function dependent on the outcome type, Ω(θ) is a penalty on θ and λ is a hyperparameter that controls the degree of regularization, via mini-batch stochastic gradient descent algorithm and Adam54 to adjust the learning rate.

To increase the robustness and accuracy of DNNs, bootstrap aggregating (bagging) is applied55. Besides, due to the randomness of initial parameters, some DNNs may not converge to a stable solution, hence, perform poorly. In neural network ensembles, it is argued that “many could be better than all”, meaning that using a subset of bagged DNNs that well fit the data could be better than using all bagged DNNs25,56. Therefore, we adopt the scoring system to select the optimal subset of DNNs in the bagging procedure, following Mi et al.25. DNN with bagging has been implemented in the R package “deepTL” (available at https://github.com/SkadiEye/deepTL)57. According to Mi et al.25, for all the reported numerical analysis in this paper, we set bagging size to 100, batch size to 50, the number of hidden layers to 4 with 50, 40, 30, 20 hidden nodes at each layer subsequently, penalty weight λ to 1E − 4, and reclified linear units as the activation function.

SHAP, LIME, SNGM, and HRT

Shapley and LIME values are calculated using R package “iml”. Feature importance scores of SHAP and LIME are defined as mean absolute values of Shapley and LIME values, respectively. HRT is implemented in R. SNGM-DNN is implemented in R, following Xing et al.21. To be consistent with other approaches without providing p values, implementation of SNGM-DNN also focuses on selecting top features. In SHAP-DNN, LIME-DNN, SNGM-DNN and HRT-DNN, the above-described bagged DNNs are applied.

RF and SVM

RF is implemented via R package “randomForest”. Vanilla-RF importance and its standard error is generated from the “randomForest” function with 1000 trees. SVM is implemented through R package “e1071” with Radial kernels used. The hyper-parameters in SVMs are searched via fivefold cross-validation. RFE-SVM was implemented with the function “rfe” in R package “caret”.

The simulation study and real data applications

In the simulation studies, PermFIT is performed by randomly splitting the samples into training (80%) and validation (20%) sets, and the importance score is estimated via (10) and (11). In real applications, HRT and PermFIT are conducted with 5-fold cross-fitting through (13) and (14). To eliminate the impact from the randomness of cross-fitting and other random factors in model fitting, we repeat each method 100 times and report the mean and standard deviation of MSPE, Pearson correlation, AUC or accuracy, and the median of the importance scores and p values. Features presented in figures are ordered by hierarchical clustering, which is implemented in “hclust” function in R package “stats”, where the dissimilarity is set to one minus the Pearson correlation.

For the TCGA kidney cancer application, RPPAs at gene level are analyzed. We first remove the proteins that are not common across all three TCGA datasets (KIRC, KIRP, and KICH). In addition, we remove the proteins with perfect multicollinearity, after which 118 are kept for further analysis.

For the HITChip Atlas data, the BMI level was originally grouped into six groups: underweight, lean, overweight, obese, severeobese and morbidobese, which we transform into numerical levels from 1 to 6 in our analysis. Total 900 subjects are left for the analysis after subjects with missing BMI are excluded. Missing information on nationality is grouped into a new group named “Unknown”. Missing values in the microbiome data are simply imputed with the median values across all samples. The analysis on the microbiome data is based on the compositional values but we remove the cell proportion from the last group due to the sum to 1 constraint on the compositional values, after which a log-transformation is applied to the remaining compositions.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.