Permutation-based identification of important biomarkers for complex diseases via machine learning models

Mi, Xinlei; Zou, Baiming; Zou, Fei; Hu, Jianhua

doi:10.1038/s41467-021-22756-2

Download PDF

Article
Open access
Published: 21 May 2021

Permutation-based identification of important biomarkers for complex diseases via machine learning models

Nature Communications volume 12, Article number: 3008 (2021) Cite this article

8561 Accesses
46 Citations
5 Altmetric
Metrics details

Subjects

Abstract

Study of human disease remains challenging due to convoluted disease etiologies and complex molecular mechanisms at genetic, genomic, and proteomic levels. Many machine learning-based methods have been developed and widely used to alleviate some analytic challenges in complex human disease studies. While enjoying the modeling flexibility and robustness, these model frameworks suffer from non-transparency and difficulty in interpreting each individual feature due to their sophisticated algorithms. However, identifying important biomarkers is a critical pursuit towards assisting researchers to establish novel hypotheses regarding prevention, diagnosis and treatment of complex human diseases. Herein, we propose a Permutation-based Feature Importance Test (PermFIT) for estimating and testing the feature importance, and for assisting interpretation of individual feature in complex frameworks, including deep neural networks, random forests, and support vector machines. PermFIT (available at https://github.com/SkadiEye/deepTL) is implemented in a computationally efficient manner, without model refitting. We conduct extensive numerical studies under various scenarios, and show that PermFIT not only yields valid statistical inference, but also improves the prediction accuracy of machine learning models. With the application to the Cancer Genome Atlas kidney tumor data and the HITChip atlas data, PermFIT demonstrates its practical usage in identifying important biomarkers and boosting model prediction performance.

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data

Article Open access 12 April 2024

Qiuyue Yuan & Zhana Duren

scGPT: toward building a foundation model for single-cell multi-omics using generative AI

Article 26 February 2024

Haotian Cui, Chloe Wang, … Bo Wang

Genome-wide association studies

Article 26 August 2021

Emil Uffelmann, Qin Qin Huang, … Danielle Posthuma

Introduction

With the advancement of high-throughput technologies, massive amounts of high-dimensional omics data have been generated and made available through large public databases due to great data sharing efforts by the research community, such as The Cancer Genome Atlas (TCGA)¹. These data are valuable in elucidating the molecular mechanisms of disease phenotypes^2,3. However, study of complex human disease remains challenging due to convoluted disease etiologies and underlying intricate molecular mechanisms at genetic, genomic, and proteomic levels. Many popular machine learning algorithms, such as non-linear kernel support vector machines (SVMs), random forests (RFs), and deep neural networks (DNNs) in artificial intelligence areas, have been developed to build more powerful predictive models for biomedical and bio-omics data regarding clinical outcomes, e.g. drug response⁴, and medical imaging classification⁵. While enjoying the modeling flexibility and robustness, these model frameworks suffer from non-transparency and difficulty in interpreting the role of each individual feature due to their sophisticated algorithms, compared with those more interpretable parametric models, such as linear regressions, logistic regressions, and decision trees. Nonetheless, identifying important biomarkers associated with complex human disease is a critical pursuit towards assisting researchers to establish novel hypotheses regarding prevention, diagnosis and treatment of complex human diseases. Accurate identification of important biomarkers associated with complex human disease not only provides valuable insights into their underlying genetic architecture and disease etiology but also offers great potentials for early disease diagnosis, improved precision medicine, innovative treatment development, and accurate disease risk and progression prediction⁶.

To address the non-transparency in the association study between disease outcomes and predictors using machine learning models, the feature importance score strategy has been proposed and extensively investigated^{7,8,9,10,11,12,13}, including surrogate models^7,14, Shapley value-based methods^15,16, conditional randomization tests (CRTs)¹⁰, knockoff models (i.e., model-X)^10,12, and permutation-based feature importance⁸. Surrogate modeling methods approximate the complex models by using explanatory surrogate models, such as linear models or decision trees. While enjoying the great flexibility in choosing the surrogate models, the feature importance is still restricted to the selected explanatory models that might be misspecified¹³. Shapley value-based methods, such as SHAP¹⁶, provide localized feature characterization based on game theory, while they are computationally intensive and do not guarantee a valid test. Both CRT and model-X knockoff were proposed in Candes et al.¹⁰, while CRT is less preferred due to its expensive computational cost. Model-X knockoff is more computationally efficient in performing feature importance test via constructing knockoff features. Recently, model-X knockoff was adopted for DNN¹² models. Tansey et al.¹¹ proposed the holdout randomization test (HRT) to reduce the computational cost of CRT via avoiding model refitting.

The overall disadvantage of CRT, HRT, and model-X knockoff is that they all depend on the assumption of a known covariance structure¹⁰. When the covariance structure is not accurately estimated, their performance could be severely impacted¹⁷. Although KnockoffGAN¹⁸, an extension of model-X knockoff, does not suffer such disadvantage, it is difficult to train adversarial networks¹⁹ and requires more tuning. Another strategy to avoid suffering from the known covariance structure assumption is approaches based on Gaussian mirrors^20,21,22. Specifically, Xing et al.²¹ proposed individual neural Gaussian mirror (INGM) and simultaneous neural Gaussian mirror (SNGM). However, INGM requires repetitive model fitting, which is computationally costly, while SNGM is efficient but could suffer performance loss²¹. The permutation-based feature importance learning method, another popular approach for feature selection, measures the change of prediction errors due to the shuffling of a feature. The larger the increase of prediction errors is, the more impact a feature makes on the outcome of interest. However, unlike CRT, HRT, or model-X knockoff, permutation-based feature selection does not require prior knowledge of feature distributions and thus it is more statistically robust. Several permutation-based feature importance methods have been proposed, with applications mainly on random forests and DNNs^8,9,23. These methods either do not conduct any statistical inference or cannot offer valid inference on the feature importance. For example, Putin et al.²³ applied permutation-based importance scores to DNNs to identify biomarkers associated with human aging, but provided no formal statistical testing. Notably, Altmann et al.⁹ proposed a corrected permutation-based importance score approach for random forest, which however, is difficult to be generalized to other machine learning model frameworks.

To overcome the aforementioned challenges, we propose a general permutation-based feature importance test (abbreviated as PermFIT), for complex machine learning models, which takes advantage of (i) permutation test coupled with cross-fitting to obtain a valid importance score test that properly controls the type-I error; and (ii) selecting important features from PermFIT to further improve the accuracy of these predictive models. We implement PermFIT for the following machine learning models, including DNN, RF, and SVM. More specifically, PermFIT first approximates the function that maps features to the outcome, based on which, PermFIT then evaluates the importance score of each feature, defined as the expected increase of prediction errors due to the permutation of the feature. Computationally, the PermFIT framework does not require refitting the models. In order to reduce the bias of important score estimation from the potential model overfitting, we adopt cross-fitting to ensure the validity of the test statistics. PermFIT is motivated by two benchmark data: the Reverse Phase Protein Arrays (RPPAs) data from three kidney cancer studies in The Cancer Genomic Atlas (TCGA) and the HITChip Atlas microbiome data regarding body mass index (BMI). However, PermFIT has broad applicability to a wide variety of biomedical data and more.

Results

To evaluate the performance of PermFIT, we first conduct comprehensive simulation studies under various scenarios with different sample sizes and correlation structures among features. Moreover, it is applied to two real-world datasets: the Reverse Phase Protein Arrays (RPPA) data from three kidney cancer studies in TCGA and the HITChip Atlas microbiome data. We apply PermFIT to three commonly used machine learning methods: DNN^24,25, RF⁸, and SVM²⁶, denoted as PermFIT-DNN, PermFIT-RF and PermFIT-SVM, respectively. We also compare PermFIT with several existing popular feature selection methods for DNN, RF, and SVM: SHAP¹⁶, LIME¹⁴, holdout randomization test¹¹, and simultaneous neural Gaussian mirror²¹ with DNN (denoted as SHAP-DNN, LIME-DNN, HRT-DNN, and SNGM-DNN, respectively), RF importance evaluation of Breiman⁸ (denoted as Vanilla-RF, i.e., an ensemble approach based on decision trees), and SVM with recursive feature elimination²⁷ (denoted as RFE-SVM). SHAP-DNN, LIME-DNN, and RFE-SVM utilize an importance score to rank input features, from which top features are selected. For each feature, Vanilla-RF provides an importance score estimate and its associated standard error, with which the statistical significance of the feature importance can be tested. HRT provides a p-value for each feature without importance scores. We evaluate these methods as follows: (i) we apply each method to the training data with all the input features, estimate the feature importance scores, p values, and assess the type-I error; (ii) we refit each model with its corresponding top ranked important variables, and re-evaluate its goodness-of-fit and prediction improvement.

Simulation studies

We examine the performance of the proposed methods with the following simulation scenarios. First, we generate the continuous data from the following model,

$$Y={X}_{1}+2\, {\mathrm{log}}\,\Big(1+2{X}_{{p}_{0}+1}^{2}+{\big({X}_{2{p}_{0}+1}+1\big)}^{2}\Big)+{X}_{3{p}_{0}+1}{X}_{4{p}_{0}+1}+\epsilon ,$$

(1)

where X is a p-dimensional random variable drawn from MVN(0, Σ), p = 10p₀, p₀ = 10, Σ = diag{Σ₁, . . . , Σ₁₀}, is a block-diagonal matrix, ${{{\Sigma }}}_{1}=...={{{\Sigma }}}_{10}={\{{\sigma }_{ij}\}}_{0\,{<}\,i,j\le {p}_{0}}$, are p₀ × p₀ matrices, σ_ij = 1 for i = j and σ_ij = ρ for i ≠ j, and ϵ ~ N(0, 1). N independent observations are drawn from the distribution of (Y, X) in the training set and 10,000 in the test set, which is used to evaluate model fitting performance. To mimic the real-world data, we introduce correlations among variables by blocks, and let one variable from each of the first 5 blocks have a signal. We define S₀ and S₁ as the sets that contain all the null features that are correlated and uncorrelated with the causal features, respectively, i.e., S₀ = {X_j: j ≤ 5p₀ and j ≠ 1, p₀ + 1, 2p₀ + 1, 3p₀ + 1, 4p₀ + 1}, S₁ = {X_j: j > 5p₀}. We consider various simulation settings with different values of ρ ∈ {0, 0.2, 0.5, 0.8}, and N ∈ {1000, 5000}. Each simulation scenario is replicated 100 times.

The results are displayed in Fig. 1 and Table 1. Figure 1a displays detailed feature importance scores generated from each method that we consider. Since HRT does not provide importance scores, we use $-{\mathrm{log}}_{10}$(p value) instead. Note that the estimated importance scores from PermFIT methods and Vanilla RF are in the same scale, while the ones from SHAP-DNN, LIME-DNN and RFE-SVM are not. For X₁ whose effect is linear, the importance scores from PermFIT-DNN and PermFIT-SVM are higher, compared with those from RF-based framework due to the restricted tree-based modeling nature of RF. In addition, the RF-based framework can barely detect the interaction between ${X}_{3{p}_{0}+1}$ and ${X}_{4{p}_{0}+1}$ because the split rule in tree-based methods is less effective in dealing with such interactions. Expectedly, as the within-block correlation ρ increases, the estimated importance scores from all methods deviate further away from their estimands. However, PermFIT-SVM remains high power in detecting the true positive features. As ρ increases, it’s noticeable that Vanilla-RF and PermFIT-SVM tend to identify the null features that are correlated with the causal features. Compared with Vanilla-RF, PermFIT-RF has fewer false positive discoveries. Overall, PermFIT-DNN provides the most precise and stable importance measure in differentiating the true positive from null features.

**Fig. 1: Simulation results on continuous outcomes.**

Table 1 Simulation results on continuous outcomes.

Full size table

The frequency (percentage) of the important variables detected by each method is presented in Table 1. For Vanilla-RF and PermFIT methods that provide p values, the significance level is controlled at 0.05, while for RFE-SVM, the top 10 features with the largest importance scores are selected. First of all, at ρ = 0, PermFIT controls the rate of significance findings across all null features at around 0.05, suggesting that the type-I error is well controlled by PermFIT, while Vanilla-RF has the type-I error of 0.09, nearly double of PermFIT. When N = 1000, the type-I error of HRT-DNN is slightly inflated. Besides, LIME-DNN and SNGM-DNN show a limited ability in identifying features with nonlinear effects, such as ${X}_{{p}_{0}+1}$, ${X}_{3{p}_{0}+1}$, and ${X}_{4{p}_{0}+1}$. On the other hand, SHAP-DNN is able to assign high rankings to the important features based on the importance scores. However, it fails to offer a valid test for its importance scores; specifically, its type-I error and power depend on correctly specifying the number of important features. When ρ increases to 0.5 or 0.8, RFE-SVM tends to select the null features that are correlated with the true causal features, or those in S₀ more frequently than ${X}_{3{p}_{0}+1}$ and ${X}_{4{p}_{0}+1}$, the two causal variables that interact with each other, demonstrating its limited capability in detecting variables with interaction effects when correlation exists. In contrast, PermFIT-SVM is capable of identifying X_3p0+1 and X_4p0+1 consistently at a much higher frequency than all the null features. Compared with PermFIT-RF, Vanilla-RF has a higher power in detecting X_3p0+1 and X_4p0+1, but also produces remarkably more false positive findings among features in S₀. For example, as ρ goes to 0.8 and N = 1000, it results in >80% false positive rate in S₀, suggesting a far inferior feature selection performance. In all these scenarios, PermFIT-DNN can consistently identify causal features while controlling false positive findings at a much lower rate than those of Vanilla-RF, PermFIT-RF, and PermFIT-SVM.

Posterior to important feature selection, the prediction performance of the comparing models almost all gets improved. Figure 1b displays the mean squared prediction error (MSPE) of each model, (i) with full input features, respectively denoted as DNN, RF, and SVM; and (ii) with top selected features from PermFIT methods and HRT-DNN at the significance level of 0.1, and top 20 features from SHAP-DNN, LIME-DNN, SNGM-DNN and RFE-SVM. Selected features help boosting the prediction accuracy of all models, except RFE-SVM, LIME-DNN, and SNGM-DNN, across all simulation scenarios. However, LIME-DNN and SNGM-DNN fail to identify certain important features, which leads to deterioration of the model performance. In addition, at ρ = 0.8 corresponding to high correlation among within-block input features, RFE-SVM fails to improve the model fitting over SVM because of feature selection failure, in particular, on X_3p0+1 and X_4p0+1; its inferior performance to PermFIT-SVM is clearly observed. Moreover, PermFIT-RF outperforms Vanilla-RF in terms of MSPE, because the latter yields more false positives and cannot effectively reduce the feature dimension. We note that PermFIT-DNN and HRT-DNN consistently outperforms all other methods in comparison, due to its high success rate in identifying true positive features while maintaining a considerably low false positive rate at the same time. In particular, PermFIT-DNN has a lower MSPE than that of HRT-DNN when N = 1000 and ρ ≤ 0.2, while similar MSPE values in other scenarios.

To further investigate the small sample performance of these methods, we conduct additional simulations with (N = 300, p = 100) and (N = 500, p = 200), and report the results in Table 2. The type-I errors of PermFIT-based methods are not affected much by the change of N and p in these more challenging cases, while those for HRT-DNN are severely inflated, which is likely because, for more challenging data with a smaller N or a larger p, HRT-DNN fails to make an accurate estimation of the covariance matrix of the input features.

Table 2 Simulation results on continuous outcomes with smaller sample size and/or larger dimension.

Full size table

We further conduct a simulation study on binary outcomes generated from the following model:

$$P(Y=1| X)={\rm{expit}}\Big(4{X}_{1}+8\, {\mathrm{log}}\,\big(1+2{X}_{{p}_{0}+1}^{2}+{\big({X}_{2{p}_{0}+1}+1\big)}^{2}\big)+4{X}_{3{p}_{0}+1}{X}_{4{p}_{0}+1}-11\Big),$$

(2)

where ${\rm{expit}}(x)=1/(1+\exp (-x))$. All the other data structures, including X, are generated in the same way as in the continuous case. Similar conclusions are observed with the details presented in Supplementary Table 1 and Supplementary Fig. 1.

TCGA kidney tumor data

A large collection of clinical and multiple omics data have been made publicly available by TCGA research project¹. In our analysis, we included three studies of kidney-related cancer types from TCGA research network: kidney renal clear cell carcinoma (KIRC, N₁ = 537), kidney renal papillary cell carcinoma (KIRP, N₂ = 291), and kidney chromophobe (KICH, N₃ = 113). We defined long-term survivor (LTS) as patients who survived more than five years after diagnosis, and short-term survivor (STS) as patients who died within 5 years. We aimed to predict the probability of a patient being in the LTS group and to identify significant biomarkers that contribute to classification of the LTS/STS status. We included 188 LTS and 178 STS subjects with the known survival status in our analysis. We focused our analysis on expression data of 118 proteins extracted from reverse phase protein arrays (RPPAs)—a highly sensitive, reproducible, and high-throughput proteomic method for protein expression profiling²⁸.

The negative ${\mathrm{log}}_{10}$(p value)s and the estimated importance scores from each method are presented in Fig. 2 and Supplementary Fig. 3. HRT-DNN, Vanilla-RF and PermFIT models control the FDR at 0.1, and SHAP-DNN, LIME-DNN, SNGM-DNN and RFE-SVM selects 10 features with the largest importance scores. We notice that moderate correlations generally exist among the proteins (see Supplementary Fig. 2). However, six proteins, SRC, RAF1, RB1, RPS6, YWHAZ, and EGFR, are highly correlated and clustered together by hierarchical clustering in Fig. 2. Among them, EGFR, YWHAZ, RPS6, RB1, and SRC are identified by Vanilla-RF, and RPS6, RB1 and SRC are selected by RFE-SVM, while none of these biomarkers are selected by any PermFIT procedures. According to our observations in simulation studies, both Vanilla-RF and RFE-SVM tend to identify false positive biomarkers in the presence of high correlation among features, casting some doubts on the validity of their biomarker selection results. In addition, LIME-DNN identifies a very different set of important biomarkers compared to SHAP-DNN, HRT-DNN, SNGM-DNN, and PermFIT-DNN.

**Fig. 2: Negative ${\mathrm{log}}_{10}$p values for TCGA kidney cancer data.**

Since the underlying genetic truth is unknown, we alternatively use the model performance improvement estimated via 5-fold cross-validation, randomly repeated for 100 times (see Fig. 3a, b; Supplementary Table 2) as a surrogate measure for evaluating the relative quality of the selected features. Similar to the simulation study, we set the feature inclusion criteria on p values smaller than 0.1 for HRT-DNN, Vanilla-RF, and PermFIT methods, and top 20 features for SHAP-DNN, LIME-DNN, SNGM-DNN, and RFE-SVM. PermFIT-RF improves the accuracy from 0.694 to 0.732 on average, while Vanilla-RF only improves to 0.713. Moreover, PermFIT-SVM elevates the accuracy from 0.69 to 0.744, outperforming RFE-SVM (0.709). Similar to the simulation results, PermFIT-DNN and HRT-DNN achieve the highest accuracy (0.751 and 0.750, respectively), better than those from SHAP-DNN (0.731), LIME-DNN (0.650) and SNGM-DNN (0.723). The same conclusion is further confirmed by area under the ROC curve (AUC) results. In summary, it is evident that PermFIT procedures consistently perform more efficient and accurate feature selection across various machine learning frameworks.

**Fig. 3: Model performance improvement from feature selection.**

On the identified biomarkers, four genes—CDKN1A, EIF4EBP1, INPP4B, and SERPINE1—are claimed by all the three PermFIT methods to be significantly associated with the survival status. Interestingly, all four genes have been reported to be cancer related. Especially, INPP4B, identified as the most significant biomarker by all the three methods (p value = 1.3E − 05 by PermFIT-DNN, 9.1E − 07 by PermFIT-RF, and 4.5E − 05 by PermFIT-SVM), encodes inositol polyphosphate-4-phosphatase, type II, a dual specificity phosphatase. Low INPP4B is recently reported to be associated with shorter survival in kidney clear cell, liver hepatocellular, and bladder urotheleal carcinomas, and with long survival in pancreatic adenocarcinoma²⁹. It is also related to acute myeloid leukemia, breast cancer and bladder cancer^30,31,32. SERPINE1 encodes plasminogen activator inhibitor-1, which plays an important role in various diseases, in particular, kidney pathology and renal cell cancer^33,34,35. In addition, the CDKN1A encoded protein, CDK-interacting protein 1, was reported as a prognostic marker for renal cell cancer³⁶, and has an effect on kidney cancer cell death³⁷ as well as kidney cancer survival³⁸. Similarly, EIF4EBP1 affects disease progression in renal cell carcinoma³⁹.

Moreover, the DNA repair protein XRCC1, identified by PermFIT-DNN and PermFIT-SVM, is shown to be associated with bladder cancer⁴⁰. ANXA7, identified by PermFIT-DNN and PermFIT-RF, is reported to be associated with prostate cancer and breast cancer^41,42, and its encoded protein has an impact on prostate cancer and breast cancer^43,44. Furthermore, MYH9 and NRG1 are identified by PermFIT-DNN. Myosin-9, encoded by MYH9, has been discussed for its role as a tumor suppressor⁴⁵, and NRG1 is also reported to be related to multiple cancer types^46,47. Last, PermFIT-RF identifies a novel gene, STK11 whose role in kidney cancer is unknown, however, it has been reported that inactivation of STK11 in lung adenocarcinomas is a common event⁴⁸.

HITChip atlas data

In the HITChip Atlas study, the data was collected from 1006 adults in 15 western countries⁴⁹ by using the HITChip, and it is publicly available in R library “microbiome”⁵⁰. Besides demographic and clinical variables, the HITChip Atlas data includes microbiome measurements from 130 taxonomic groups summarized at the genus level, which covered major types of human intestinal microbiota bacterial diversity. Many of the 130 taxonomic groups are highly correlated, which is reflected in the correlation heat map and the hierarchical clustering dendrogram (see Fig. 4 and Supplementary Fig. 4). We then investigated the importance of demographic factors, including gender and nationality, together with 129 microbial genus (1 was removed due to the use of compositional values), in predicting the baseline BMI level. Our analysis includes 900 subjects with BMI measurements.

The feature selection and biomarker identification criteria remain the same as those in the TCGA example. The improvement of the model performance from variable selection is presented in Fig. 3c, d and Supplementary Table 2. Besides the MSPE, we report the Pearson correlation between the predicted and the true values. We notice that high correlation among microbiome features leads to large inflation in importance scores from Vanilla-RF, corresponding to high false positive rate. As a result, it fails to improve the model performance and the reduced model with selected features from RFE-SVM performs worse than the full model. Again, this is likely due to the fact that highly correlated biomarkers are falsely selected by RFE-SVM. For instance, Streptococcus mitis et rel, Streptococcus bovis et rel and Streptococcus intermedius et rel, highly correlated to each other, are among top 10 biomarkers identified by RFE-SVM. In contrast, PermFIT yields the most remarkable improvements in all these models, reflected in both MSPE and correlation.

Figure 4 and Supplementary Fig. 5 show the negative ${\mathrm{log}}_{10}$(p-value)s and the importance scores estimated from each method. Among all the features, as expected, age is identified as the most significant factor. The nationality is also selected by PermFIT-DNN and PermFIT-SVM. Among all the microbiome features, Megasphaera elsdenii is identified by all the PermFIT methods. M. elsdenii is shown by prior studies as one of the ruminal and intestinal lacate- and sugar-fermenting species⁵¹. M. elsdenii is also reported to massively reside in patients with an increase in BMI after bariatric surgery⁵². In addition, Eggerthella lenta is identified by PermFIT-SVM. E. lenta is not well-studied, but its potential role as an emerging pathogen has been increasingly recognized in years⁵³. Lastly, uncultured clostridiales is identified by PermFIT-RF.

Discussion

Complex machine learning models are difficult to distinguish the contribution of individual input features, though they enjoy the more robustness and flexibility in modeling complex human diseases as compared with parametric models. In this paper, we introduce PermFIT, a computationally efficient permutation-based feature importance test with applicability to various machine learning models such as DNN, RF, and SVM, to identify important features. Also, as demonstrated by the applications to TCGA kidney cancer data and HITChip Atlas BMI data, PermFIT procedures further show the superior performance over all the other competitors of concern in the paper, which severely suffer from false positive or negative findings, leading to inferior prediction performance with top selected features. In contrast, feature selection via PermFIT procedures remarkably improves the performance of these predictive models. However, it is worth pointing out that the prediction improvement of PermFIT is restricted to the capability of each machine learning model framework. For example, RF is relatively inefficient in modeling interaction terms, thus the performance of PermFIT-RF may be limited for complex traits with strong gene-gene interactions. Overall, PermFIT coupled with DNN consistently shows superior empirical performance.

The proposed analytical tool, PermFIT, is computationally efficient and has broad applicability in addressing real-world problems. It can be implemented and incorporated into a variety of machine learning models with different types of outcomes, and without the need of model refitting. PermFIT provides researchers a useful tool for deciphering complex genetic architecture and disease etiologies of complex traits.

Methods

PermFIT

We start the case with a continuous outcome. Let $X\in {\mathcal{X}}$, $Y\in {\mathcal{Y}}$, where X = (X₁, . . . , X_p) is a p-dimensional covariate vector, the observation of the outcome variable Y is a continuous scalar, E(Y∣X) = μ(X), μ(⋅) is an unknown mapping from ${\mathcal{X}}$ to ${\mathcal{Y}}$, the residues ϵ = Y − μ(X) is independent of X and 0 < σ² = E(ϵ²) < ∞.

We define the feature importance score M_j for X_j, the j^th feature in X(j = 1, . . . , p), as the expected squared difference between μ(X) and $\mu \left({X}^{(j)}\right)$, where ${X}^{(j)}=({X}_{1},...,{X}_{j-1},{X}_{j^{\prime} },{X}_{j+1},...,{X}_{p})$ is equal to X with the j^th covariate replaced by a random vector ${X}_{j^{\prime} }$ whose elements are independently drawn from the distribution of X_j. The importance score M_j can be expressed as,

$${M}_{j}={E}_{X,{X}_{j^{\prime} }}{\left[\mu (X)-\mu \left({X}^{(j)}\right)\right]}^{2}.$$

(3)

Assuming X does not have redudant features, M_j is zero only when $\mu (X)\equiv \mu \left({X}^{(j)}\right)$ on ${\mathcal{X}}$, implying that the j^th element of X does not have any impact on μ(X); and is non-zero otherwise.

To obtain a clear understanding of M_j, we take the linear model as an example where μ(X) = Xβ + β₀, with β = (β₁, . . . , β_p) consisting of p parameters. Under the linear assumption, (3) becomes:

$${M}_{j}={E}_{X,{X}_{j^{\prime} }}{({X}_{j}-{X}_{j^{\prime} })}^{2}{\beta }_{j}^{2}=2{\beta }_{j}^{2}{\mathrm{Var}}\,({X}_{j}).$$

(4)

Here, (4) is proportional to the squared standardized coefficient, which has been recognized as a popular measure of variable importance in multiple linear regression.

Furthermore, M_j can be simply decomposed as follows:

$${M}_{j}={E}_{X,{X}_{j^{\prime} }}{\left[Y-\mu \left({X}^{(j)}\right)\right]}^{2}-{E}_{X}{[Y-\mu (X)]}^{2}.$$

(5)

Ideally, given the true form of μ(⋅), from (5), M_j could be estimated through permutation. Let (Y_i, X_i1, . . . , X_ip), (i = 1, . . . , N) be N independent observations drawn from the distribution of (Y, X₁, . . . , X_p). A permutation on one covariate X_j = (X_1j, . . . , X_Nj) is to randomly sample the elements in X_j without replacement to generate a permuted version of $X^{\prime} =({X}_{{s}_{1},j},...,{X}_{{s}_{N},j})$. The empirical permutation importance score is then,

$${M}_{j}^{(P)}=\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}\left[{\left\{{Y}_{i}-\mu \left({X}_{i\cdot }^{(j)}\right)\right\}}^{2}-{\left\{{Y}_{i}-\mu ({X}_{i\cdot })\right\}}^{2}\right],$$

(6)

where X_i⋅ = (X_i1, . . . , X_ip) and ${X}_{i\cdot }^{(j)}=({X}_{i1},...,{X}_{i,j-1},{X}_{{s}_{i},j},{X}_{i,j+1},...,{X}_{ip})$. Let ${M}_{j}^{(P)}=\frac{1}{N}\mathop{\sum }\nolimits_{i = 1}^{N}{M}_{ij}^{(P)}$, where ${M}_{ij}^{(P)}={\left\{{Y}_{i}-\mu \left({X}_{i\cdot }^{(j)}\right)\right\}}^{2}-{\left\{{Y}_{i}-\mu ({X}_{i\cdot })\right\}}^{2}$, then

$$E\left[{M}_{j}^{(P)}\right]=E\left[{M}_{ij}^{(P)}\right]=\frac{N-1}{N}{M}_{j}.$$

(7)

When N is large, M_j could be well approximated by ${M}_{j}^{(P)}$. Besides, ${\mathrm{Var}}\,\left[{M}_{j}^{(P)}\right]\approx \frac{1}{N}{\mathrm{Var}}\,\left[{M}_{ij}^{(P)}\right]$ which can be approximated by the empirical variance of ${M}_{ij}^{(P)}$.

Let $\widehat{\mu }(\cdot )$ be the fitted function approximator to μ(⋅), according to (6), we propose to estimate ${M}_{j}^{P}$ by

$${\widehat{M}}_{j}^{(P)}=\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}\left[{\left\{{Y}_{i}-\widehat{\mu }\left({X}_{i\cdot }^{(j)}\right)\right\}}^{2}-{\left\{{Y}_{i}-\widehat{\mu }({X}_{i\cdot })\right\}}^{2}\right].$$

(8)

If feature X_j is not associated with Y, then $\mu \left({X}_{i\cdot }^{(j)}\right)=\mu ({X}_{i\cdot })$ with corresponding ${M}_{j}^{(P)}=0$, and Eq. (8) becomes,

$${\widehat{M}}_{j}^{(P)}=\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}\left[{\left\{\mu ({X}_{i\cdot })-\widehat{\mu }\left({X}_{i\cdot }^{(j)}\right)\right\}}^{2}-{\left\{\mu ({X}_{i\cdot })-\widehat{\mu }({X}_{i\cdot })\right\}}^{2}+2{\epsilon }_{i}\left\{\widehat{\mu }({X}_{i\cdot })-\widehat{\mu }\left({X}_{i\cdot }^{(j)}\right)\right\}\right].$$

(9)

With the universal consistency, the three terms are expected to converge to zero as N goes to infinity. However, for data with a finite sample size, the model $\widehat{\mu}(\cdot )$ may become overfit, leading to ${\left\{\mu ({X}_{i\cdot })-\widehat{\mu }\left({X}_{i\cdot }^{(j)}\right)\right\}}^{2}> {\left\{\mu ({X}_{i\cdot })-\widehat{\mu }({X}_{i\cdot })\right\}}^{2}$ in estimating ${M}_{j}^{(P)}$. To overcome the bias issue, we employ cross-fitting strategy to separate the input data as the training and validation sets, with one set being utilized to obtain $\widehat{\mu }(\cdot )$ and the other set to estimate ${\widehat{M}}_{j}^{(P)}$. Let ${\widehat{\mu }}_{T}(\cdot )$ be the estimate of μ(⋅) from the training set, and ${{\mathcal{D}}}_{V}={\{{Y}_{i},{X}_{i\cdot }\}}_{i = 1}^{{N}_{V}}$ be the validation set,

$${\widehat{M}}_{j}^{(P)}=\frac{1}{{N}_{V}}\mathop{\sum }\limits_{i=1}^{{N}_{V}}\left[{\left\{{Y}_{i}-{\widehat{\mu }}_{T}\left({X}_{i\cdot }^{(j)}\right)\right\}}^{2}-{\left\{{Y}_{i}-{\widehat{\mu }}_{T}({X}_{i\cdot })\right\}}^{2}\right],$$

(10)

$$\widehat{\mathrm{Var}}\left[{\widehat{M}}_{j}^{(P)}\right]=\frac{1}{{N}_{V}}\mathop{\sum }\limits_{i=1}^{{N}_{V}}{\left[{\left\{{Y}_{i}-{\widehat{\mu }}_{T}\left({X}_{i\cdot }^{(j)}\right)\right\}}^{2}-{\left\{{Y}_{i}-{\widehat{\mu }}_{T}({X}_{i\cdot })\right\}}^{2}-{\widehat{M}}_{j}^{(P)}\right]}^{2}.$$

(11)

The one-sided p value can be obtained by assuming normality. To increase the power of important feature identification, K-fold cross-fitting can be adopted. Here, we randomly divide the data into K folds, denoted as V₁, . . . , V_K. For each of V_k, k = 1, . . . , K, ${\overline{V}}_{k}$ denote the complementary set of V_k, which is used to fit the model ${\widehat{\mu }}_{k}(\cdot )$. Then

$${\widehat{M}}_{ij}^{(P,CV)}=\mathop{\sum }\limits_{k=1}^{K}{\rm{I}}\left(i\in {V}_{k}\right)\left[{\left\{{Y}_{i}-{\widehat{\mu }}_{T}\left({X}_{i\cdot }^{(j)}\right)\right\}}^{2}-{\left\{{Y}_{i}-{\widehat{\mu }}_{k}({X}_{i\cdot })\right\}}^{2}\right],$$

(12)

$${\widehat{M}}_{j}^{(P,CV)}=\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}{\widehat{M}}_{ij}^{(P,CV)},$$

(13)

$$\widehat{\mathrm{Var}}\left[{\widehat{M}}_{j}^{(P,CV)}\right]=\frac{1}{N}\mathop{\sum }\limits_{k=1}^{K}{\mathop{\sum} _{i\in {V}_{k}}}{\left[{\left\{{Y}_{i}-{\widehat{\mu }}_{T}\left({X}_{i\cdot }^{(j)}\right)\right\}}^{2}-{\left\{{Y}_{i}-{\widehat{\mu }}_{k}({X}_{i\cdot })\right\}}^{2}-{\widehat{M}}_{j}^{(P,CV)}\right]}^{2}.$$

(14)

The algorithm of PermFIT with cross-fitting is illustrated in Algorithm 1.

Algorithm 1

Algorithms for PermFIT

1: Randomly divide the data into K folds.

2: for k = 1 to K do.

3: Denote the data in k^th fold as V_k and the rest of the data as ${\overline{V}}_{k}$.

4: Build the machine learning model with ${\overline{V}}_{k}$, denoted as ${\widehat{\mu }}_{k}(\cdot )$.

5: for j = 1 to p do

6: Calculate ${\widehat{M}}_{ij}^{(P,CV)}$ for subjects in ${{\mathcal{D}}}_{k}$.

7: end for

8: end for

9: for j = 1 to p do

10: Calculate ${\widehat{M}}_{j}^{(P,CV)}$ and estimate $\widehat{\mathrm{Var}}\left[{\widehat{M}}_{j}^{(P,CV)}\right]$. Calculate p-value by assuming nomality.

11: end for

Binary outcome

For binary outcome Y ∈ {0, 1}, we have μ(X) = E(Y∣X) = Pr(Y = 1∣X) and define M_j as the expectation of binomial deviance,

$${M}_{j}={E}_{X,{X}_{j^{\prime} }}\left[Y\, {\mathrm{log}}\,\left(\frac{\mu (X)}{\mu ({X}^{(j)})}\right)+(1-Y)\, {\mathrm{log}}\,\left(\frac{1-\mu (X)}{1-\mu ({X}^{(j)})}\right)\right].$$

(15)

The empirical estimate of M_j can be similarly obtained by plugging in the estimate of μ(X^(j)) and μ(X) as in the continuous data scenario.

DNN with bootstrap aggregating

In this paper, we use feedforward and fully-connected deep neural networks (DNNs) to approximate function μ(⋅). The DNN model contains L hidden layers of (n₁, . . . , n_L) hidden nodes that transform the initial input covariates X to the estimation of the continuous outcome Y. Let θ denote all the parameters in the DNN model, we essentially have the fitted DNN, $\widehat{\mu }(X;{\boldsymbol{\theta }})$, by minimizing the empirical risk function,

$$\arg \mathop{\min }\limits_{{\boldsymbol{\theta }}}\frac{1}{N}\mathop{\sum }\limits_{i=1}^{N}\ell \{{Y}_{i},\mu ({X}_{i};{\boldsymbol{\theta }})\}+\lambda {{\Omega }}({\boldsymbol{\theta }}),$$

(16)

where ℓ(⋅, ⋅) is the loss function dependent on the outcome type, Ω(θ) is a penalty on θ and λ is a hyperparameter that controls the degree of regularization, via mini-batch stochastic gradient descent algorithm and Adam⁵⁴ to adjust the learning rate.

To increase the robustness and accuracy of DNNs, bootstrap aggregating (bagging) is applied⁵⁵. Besides, due to the randomness of initial parameters, some DNNs may not converge to a stable solution, hence, perform poorly. In neural network ensembles, it is argued that “many could be better than all”, meaning that using a subset of bagged DNNs that well fit the data could be better than using all bagged DNNs^25,56. Therefore, we adopt the scoring system to select the optimal subset of DNNs in the bagging procedure, following Mi et al.²⁵. DNN with bagging has been implemented in the R package “deepTL” (available at https://github.com/SkadiEye/deepTL)⁵⁷. According to Mi et al.²⁵, for all the reported numerical analysis in this paper, we set bagging size to 100, batch size to 50, the number of hidden layers to 4 with 50, 40, 30, 20 hidden nodes at each layer subsequently, penalty weight λ to 1E − 4, and reclified linear units as the activation function.

SHAP, LIME, SNGM, and HRT

Shapley and LIME values are calculated using R package “iml”. Feature importance scores of SHAP and LIME are defined as mean absolute values of Shapley and LIME values, respectively. HRT is implemented in R. SNGM-DNN is implemented in R, following Xing et al.²¹. To be consistent with other approaches without providing p values, implementation of SNGM-DNN also focuses on selecting top features. In SHAP-DNN, LIME-DNN, SNGM-DNN and HRT-DNN, the above-described bagged DNNs are applied.

RF and SVM

RF is implemented via R package “randomForest”. Vanilla-RF importance and its standard error is generated from the “randomForest” function with 1000 trees. SVM is implemented through R package “e1071” with Radial kernels used. The hyper-parameters in SVMs are searched via fivefold cross-validation. RFE-SVM was implemented with the function “rfe” in R package “caret”.

The simulation study and real data applications

In the simulation studies, PermFIT is performed by randomly splitting the samples into training (80%) and validation (20%) sets, and the importance score is estimated via (10) and (11). In real applications, HRT and PermFIT are conducted with 5-fold cross-fitting through (13) and (14). To eliminate the impact from the randomness of cross-fitting and other random factors in model fitting, we repeat each method 100 times and report the mean and standard deviation of MSPE, Pearson correlation, AUC or accuracy, and the median of the importance scores and p values. Features presented in figures are ordered by hierarchical clustering, which is implemented in “hclust” function in R package “stats”, where the dissimilarity is set to one minus the Pearson correlation.

For the TCGA kidney cancer application, RPPAs at gene level are analyzed. We first remove the proteins that are not common across all three TCGA datasets (KIRC, KIRP, and KICH). In addition, we remove the proteins with perfect multicollinearity, after which 118 are kept for further analysis.

For the HITChip Atlas data, the BMI level was originally grouped into six groups: underweight, lean, overweight, obese, severeobese and morbidobese, which we transform into numerical levels from 1 to 6 in our analysis. Total 900 subjects are left for the analysis after subjects with missing BMI are excluded. Missing information on nationality is grouped into a new group named “Unknown”. Missing values in the microbiome data are simply imputed with the median values across all samples. The analysis on the microbiome data is based on the compositional values but we remove the cell proportion from the last group due to the sum to 1 constraint on the compositional values, after which a log-transformation is applied to the remaining compositions.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

The TCGA datasets are available at the LinkedOmics website (http://linkedomics.org), among which three studies, KIRC, KICH, and KIRP, are used (dbGaP Study Accession: phs000178). The HITChip Atlas data is available in R package “microbiome” (https://microbiome.github.io/). We provide final datasets used in our analysis (https://github.com/SkadiEye/deepTL/tree/master/permfit/code/cleaned-dat.RDS). Source data are provided with this paper.

Code availability

PermFIT is implemented in our R package “deepTL” (https://github.com/SkadiEye/deepTL)⁵⁷. We also provide source code for replicating the simulation studies and real data applications (https://github.com/SkadiEye/deepTL/tree/master/permfit/code).

References

Cancer Genome Atlas Research N. et al. The cancer genome atlas pan-cancer analysis project. Nat. Genet. 45, 1113–1120 (2013).
Article CAS Google Scholar
Hanahan, D. & Weinberg, R. A. Hallmarks of cancer: the next generation. Cell 114, 646–674 (2011).
Article CAS Google Scholar
Wilhelm, M. et al. Mass-spectrometry-based draft of the human proteome. Nature 509, 582–587 (2014).
Article ADS CAS PubMed Google Scholar
Gawehn, E., Hiss, J. A. & Schneider, G. Deep learning in drug discovery. Mol. Inf. 35, 3–14 (2016).
Article CAS Google Scholar
Erickson, B. J., Korfiatis, P., Akkus, Z. & Kline, T. L. Machine learning for medical imaging. Radiographics 37, 505–515 (2017).
Article PubMed Google Scholar
Zhang, A. et al. Metabolomics in diagnosis and biomarker discovery of colorectal cancer. Cancer Lett. 345, 17–20 (2014).
Article ADS CAS PubMed Google Scholar
Craven, M. & Shavlik, J. W. Extracting tree-structured representations of trained networks. In Advances in Neural Information Processing Systems, 24–30 (1996).
Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).
Article MATH Google Scholar
Altmann, A., Toloşi, L., Sander, O. & Lengauer, T. Permutation importance: a corrected feature importance measure. Bioinformatics 26, 1340–1347 (2010).
Article CAS PubMed Google Scholar
Candes, E., Fan, Y., Janson, L. & Lv, J. Panning for gold:‘model-x’aknockoffs for high dimensional controlled variable selection. J. R. Stat. Soc.: Series B (Statistical Methodology) 80, 551–577 (2018).
Article MathSciNet MATH Google Scholar
Tansey, W., Veitch, V., Zhang, H., Rabadan, R. & Blei, D. M. The holdout randomization test: principled and easy black box feature selection. arXiv preprint arXiv:1811.00645 (2018).
Lu, Y., Fan, Y., Lv, J. & Noble, W. S. Deeppink: reproducible feature selection in deep neural networks. In Advances in neural information processing systems, 8676–8686 (2018).
Molnar, C. Interpretable machine learning, https://christophm.github.io/interpretable-ml-book/ (2019).
Ribeiro, M. T., Singh, S. & Guestrin, C. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1135–1144 (ACM, 2016).
Shapley, L. A value for n-person games, contributions to the theory of games. (ed. Harold W. kuhn) (1953).
Lundberg, S. M. & Lee, S.-I. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, 4765–4774 (2017).
Horel, E. & Giesecke, K. Computationally efficient feature significance and importance for machine learning models. arXiv preprint arXiv:1905.09849 (2019).
Jordon, J., Yoon, J. & van der Schaar, M. Knockoffgan: Generating knockoffs for feature selection using generative adversarial networks. In International Conference on Learning Representations (2018).
Arjovsky, M. & Bottou, L. Towards principled methods for training generative adversarial networks. arXiv preprint arXiv:1701.04862 (2017).
Xing, X., Zhao, Z. & Liu, J. S. Controlling false discovery rate using gaussian mirrors. arXiv preprint arXiv:1911.09761 (2019).
Xing, X., Gui, Y., Dai, C. & Liu, J. S. Neural gaussian mirror for controlled feature selection in neural networks. arXiv preprint arXiv:2010.06175 (2020).
Dai, C., Lin, B., Xing, X. & Liu, J. S. False discovery rate control via data splitting. arXiv preprint arXiv:2002.08542 (2020).
Putin, E. et al. Deep biomarkers of human aging: application of deep neural networks to biomarker development. Aging 8, 1021–1033 (2016).
Article CAS PubMed PubMed Central Google Scholar
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
Article ADS CAS PubMed Google Scholar
Mi, X., Zou, F. & Zhu, R. Bagging and deep learning in optimal individualized treatment rules. Biometrics 75, 674–684 (2019).
Article MathSciNet PubMed MATH Google Scholar
Suykens, J. A. K. & Vandewalle, J. Least squares support vector machine classifiers. Neural Process. Lett. 9, 293–300 (1999).
Article Google Scholar
Guyon, I., Weston, J., Barnhill, S. & Vapnik, V. Gene selection for cancer classification using support vector machines. Mach. Learn. 46, 389–422 (2002).
Article MATH Google Scholar
Tomczak, K., Czerwińska, P. & Wiznerowicz, M. The cancer genome atlas (tcga): an immeasurable source of knowledge. Contemporary Oncol. 19, 68–77 (2015).
Google Scholar
Dzneladze, I. Pan-cancer study of INPP4B reveals its unexpected oncogene-like role and prognostic significance. PhD thesis (2017).
Sun, Y., Ding, H., Liu, X., Li, X. & Li, L. Inpp4b overexpression enhances the antitumor efficacy of parp inhibitor ag014699 in mda-mb-231 triple-negative breast cancer cells. Tumor Biology 35, 4469–4477 (2014).
Article CAS PubMed Google Scholar
Hsu, I. et al. Estrogen receptor alpha prevents bladder cancer development via inpp4b inhibited akt pathway in vitro and in vivo. Oncotarget 5, 7917–7935 (2014).
Article PubMed PubMed Central Google Scholar
Dzneladze, I. et al. Subid, a non-median dichotomization tool for heterogeneous populations, reveals the pan-cancer significance of inpp4b and its regulation by evi1 in aml. PLoS ONE 13, e0191510 (2018).
Article PubMed PubMed Central CAS Google Scholar
Eddy, A. A. Plasminogen activator inhibitor-1 and the kidney. Am. J. Physiol.-Renal Physiol. 283, F209–F220 (2002).
Article CAS PubMed Google Scholar
Małgorzewicz, S., Skrzypczak-Jankun, E. & Jankun, J. Plasminogen activator inhibitor-1 in kidney pathology. Int. J. Mol. Med. 31, 503–510 (2013).
Article PubMed CAS Google Scholar
Hofmann, R. et al. Prognostic value of urokinase plasminogen activator and plasminogen activator inhibitor-1 in renal cell cancer. J. Urol. 155, 858–862 (1996).
Article CAS PubMed Google Scholar
Weiss, R. H. et al. p21 is a prognostic marker for renal cell carcinoma: implications for novel therapeutic approaches. J. Urol. 177, 63–69 (2007).
Article CAS PubMed Google Scholar
Inoue, H., Hwang, S. H., Wecksler, A. T., Hammock, B. D. & Weiss, R. H. Sorafenib attenuates p21 in kidney cancer cells and augments cell death in combination with dna-damaging chemotherapy. Cancer Biol. Ther. 12, 827–836 (2011).
Article CAS PubMed PubMed Central Google Scholar
Zaman, M. S. et al. Up-regulation of microrna-21 correlates with lower kidney cancer survival. PloS One 7, e31060 (2012).
Article ADS CAS PubMed PubMed Central Google Scholar
Campbell, L., Jasani, B., Griffiths, D. F. R. & Gumbleton, M. Phospho-4e-bp1 and eif4e overexpression synergistically drives disease progression in clinically confined clear cell renal cell carcinoma. Am. J. Cancer Res. 5, 2838–2848 (2015).
CAS PubMed PubMed Central Google Scholar
Akhmadishina, L. Z. et al. DNA repair xrcc1, xpd genes polymorphism as associated with the development of bladder cancer and renal cell carcinoma. Genetika 50, 481–490 (2014).
CAS PubMed Google Scholar
Srivastava, M. et al. Anx7 as a bio-marker in prostate and breast cancer progression. Dis. Mark. 17, 115–120 (2001).
Article CAS Google Scholar
Srivastava, M. et al. Anxa7 expression represents hormone-relevant tumor suppression in different cancers. Int. J. Cancer 121, 2628–2636 (2007).
Article CAS PubMed Google Scholar
Smitherman, A. B., Mohler, J. L., Maygarden, S. J. & Ornstein, D. K. Expression of annexin i, ii and vii proteins in androgen stimulated and recurrent prostate cancer. J. Urol. 171, 916–920 (2004).
Article CAS PubMed Google Scholar
Srivastava, M. et al. Prognostic impact of anx7-gtpase in metastatic and her2-negative breast cancer patients. Clin. Cancer Res. 10, 2344–2350 (2004).
Article CAS PubMed Google Scholar
Schramek, D. et al. Direct in vivo rnai screen unveils myosin iia as a tumor suppressor of squamous cell carcinomas. Science 343, 309–313 (2014).
Article ADS CAS PubMed PubMed Central Google Scholar
De Boeck, A. et al. Bone marrow-derived mesenchymal stem cells promote colorectal cancer progression through paracrine neuregulin 1/her3 signalling. Gut 62, 550–560 (2013).
Article PubMed CAS Google Scholar
Huang, H.-E. et al. A recurrent chromosome breakpoint in breast cancer at the nrg1/neuregulin 1/heregulin gene. Cancer Res. 64, 6840–6844 (2004).
Article CAS PubMed Google Scholar
Sanchez-Cespedes, M. et al. Inactivation of lkb1/stk11 is a common event in adenocarcinomas of the lung. Cancer Res. 62, 3659–3662 (2002).
CAS PubMed Google Scholar
Lahti, L., Salojärvi, J., Salonen, A., Scheffer, M. & De Vos, W. M. Tipping elements in the human intestinal ecosystem. Nature Commun. 5, 4344 (2014).
Article ADS CAS Google Scholar
Lahti, L. & Shetty, S. Microbiome r package, 2012–2019.
Marounek, M., Fliegrova, K. & Bartos, S. Metabolism and some characteristics of ruminal strains of megasphaera elsdenii. Appl. Environm. Microbiol. 55, 1570–1573 (1989).
Article CAS Google Scholar
Federico, A. et al. Gastrointestinal hormones, intestinal microbiota and metabolic homeostasis in obese patients: effect of bariatric surgery. In Vivo 30, 321–330 (2016).
CAS PubMed Google Scholar
Gardiner, B. J., Korman, T. M. & Junckerstorff, R. K. Eggerthella lenta bacteremia complicated by spondylodiscitis, psoas abscess, and meningitis. J. Clin. Microbiol. 52, 1278–1280 (2014).
Article CAS PubMed PubMed Central Google Scholar
Byrd, R. H., Chin, G. M., Nocedal, J. & Wu, Y. Sample size selection in optimization methods for machine learning. Math. Program. 134, 127–155 (2012).
Article MathSciNet MATH Google Scholar
Breiman, L. Bagging predictors. Mach. Learn. 24, 123–140 (1996).
MATH Google Scholar
Zhou, Z. H., Wu, J. X. & Tang, W. Ensembling neural networks: many could be better than all. Artif. Intell. 137, 239–263 (2002).
Article MathSciNet MATH Google Scholar
Mi. Skadieye/deeptl: Second release, https://doi.org/10.5281/zenodo.4568807 (February, 2021).

Download references

Acknowledgements

The work was partially supported by the National Institute of Health Grants R01AI143886, R01CA219896, CCSG P30 CA013696, and P30 ES010126.

Author information

Authors and Affiliations

Department of Preventive Medicine, Northwestern University, Feinberg School of Medicine, Chicago, IL, USA
Xinlei Mi
Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Baiming Zou & Fei Zou
Department of Biostatistics and Department of Medicine, Columbia University, New York, NY, USA
Jianhua Hu

Authors

Xinlei Mi
View author publications
You can also search for this author in PubMed Google Scholar
Baiming Zou
View author publications
You can also search for this author in PubMed Google Scholar
Fei Zou
View author publications
You can also search for this author in PubMed Google Scholar
Jianhua Hu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

X.M. implemented the algorithms in “deepTL” for the proposed method and performed numerical analyses. All authors contributed to the methodology development and writing the manuscript.

Corresponding author

Correspondence to Jianhua Hu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Communications thanks Ming Li and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Peer Review File

Reporting Summary

Source data

Source Data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Mi, X., Zou, B., Zou, F. et al. Permutation-based identification of important biomarkers for complex diseases via machine learning models. Nat Commun 12, 3008 (2021). https://doi.org/10.1038/s41467-021-22756-2

Download citation

Received: 21 April 2020
Accepted: 18 March 2021
Published: 21 May 2021
DOI: https://doi.org/10.1038/s41467-021-22756-2

This article is cited by

Explainable artificial intelligence-assisted virtual screening and bioinformatics approaches for effective bioactivity prediction of phenolic cyclooxygenase-2 (COX-2) inhibitors using PubChem molecular fingerprints
- Mithun Rudrapal
- Kevser Kübra Kirboga
- Siddhartha Maji
Molecular Diversity (2024)
A deep neural network framework to derive interpretable decision rules for accurate traumatic brain injury identification of infants
- Baiming Zou
- Xinlei Mi
- Fei Zou
BMC Medical Informatics and Decision Making (2023)
Identification of novel biomarkers and immune infiltration features of recurrent pregnancy loss by machine learning
- Yujia Luo
- Yuanyuan Zhou
Scientific Reports (2023)
Identification of novel candidate biomarkers and immune infiltration in polycystic ovary syndrome
- Zhijing Na
- Wen Guo
- Da Li
Journal of Ovarian Research (2022)
Data analysis methods for defining biomarkers from omics data
- Chao Li
- Zhenbo Gao
- Xiaohui Lin
Analytical and Bioanalytical Chemistry (2022)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.