Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Deep neural networks with controlled variable selection for the identification of putative causal genetic variants

Abstract

Deep neural networks (DNNs) have been successfully utilized in many scientific problems for their high prediction accuracy, but their application to genetic studies remains challenging due to their poor interpretability. Here we consider the problem of scalable, robust variable selection in DNNs for the identification of putative causal genetic variants in genome sequencing studies. We identified a pronounced randomness in feature selection in DNNs due to its stochastic nature, which may hinder interpretability and give rise to misleading results. We propose an interpretable neural network model, stabilized using ensembling, with controlled variable selection for genetic studies. The merit of the proposed method includes: flexible modelling of the nonlinear effect of genetic variants to improve statistical power; multiple knockoffs in the input layer to rigorously control the false discovery rate; hierarchical layers to substantially reduce the number of weight parameters and activations, and improve computational efficiency; and stabilized feature selection to reduce the randomness in identified signals. We evaluate the proposed method in extensive simulation studies and apply it to the analysis of Alzheimer’s disease genetics. We show that the proposed method, when compared with conventional linear and nonlinear methods, can lead to substantially more discoveries.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Overview of the workflow.
Fig. 2: Power and FDR comparison.
Fig. 3: Confirmatory-stage analysis of candidate regions.
Fig. 4: Functionally informed analysis of pQTLs.
Fig. 5: Stabilized HiDe-MK improves the stability of FIs compared with a single HiDe-MK run.
Fig. 6: The hierarchical layers improve computational efficiency.

Similar content being viewed by others

Data availability

Alzheimer’s disease genetic cohort data can be obtained for approved research (see the description in the work by Le Guen and colleagues62). Simulation datasets are available on our GitHub repository: https://github.com/Peyman-HK/De-randomized-HiDe-MK (ref. 70).

Code availability

The code for the generation and reproduction of the simulation studies of SKAT haplotype data is written in R. The code for HiDe-MK training, prediction and evaluation were written in Python with Keras and Tensorflow. The codes are feely available at: https://github.com/Peyman-HK/De-randomized-HiDe-MK. The doi of the code can be found at https://doi.org/10.5281/zenodo.6872386 (ref. 70). The pseudo code for simulation studies can be found in Supplementary Section 4.

References

  1. Sierksma, A., Escott-Price, V. & De Strooper, B. Translating genetic risk of Alzheimer’s disease into mechanistic insight and drug targets. Science 370, 61–66 (2020).

    Article  Google Scholar 

  2. Visscher, P. M. et al. 10 Years of GWAS discovery: biology, function, and translation. Am. J. Hum. Genet. 101, 5–22 (2017).

    Article  Google Scholar 

  3. Lee, S., Abecasis, G. R., Boehnke, M. & Lin, X. Rare-variant association analysis: study designs and statistical tests. Am. J. Hum. Genet. 95, 5–23 (2014).

    Article  Google Scholar 

  4. Zuk, O., Hechter, E., Sunyaev, S. R. & Lander, E. S. The mystery of missing heritability: genetic interactions create phantom heritability. Proc. Natl Acad. Sci. USA 109, 1193–1198 (2012).

    Article  Google Scholar 

  5. Ma, Y. et al. Analysis of whole-exome sequencing data for Alzheimer disease stratified by APOE Genotype. JAMA Neurol. 76, 1099–1108 (2019).

    Article  Google Scholar 

  6. Jun, G. R. et al. Transethnic genome-wide scan identifies novel Alzheimer’s disease loci. Alzheimers. Dement. 13, 727–738 (2017).

    Article  Google Scholar 

  7. Belloy, M. E. et al. Association of klotho-VS heterozygosity with risk of Alzheimer disease in individuals who carry APOE4. JAMA Neurol. 77, 849–862 (2020).

    Article  Google Scholar 

  8. He, L. et al. Exome-wide age-of-onset analysis reveals exonic variants in ERN1 and SPPL2C associated with Alzheimer’s disease. Transl. Psychiatry 11, 146 (2021).

    Article  Google Scholar 

  9. Sims, R., Hill, M. & Williams, J. The multiplex model of the genetics of Alzheimer’s disease. Nat. Neurosci. 23, 311–322 (2020).

    Article  Google Scholar 

  10. Costanzo, M. et al. A global genetic interaction network maps a wiring diagram of cellular function. Science 353, aaf1420 (2016).

    Article  Google Scholar 

  11. Kuzmin, E. et al. Systematic analysis of complex genetic interactions. Science 360, eaao1729 (2018).

    Article  Google Scholar 

  12. Phillips, P. C. Epistasis — the essential role of gene interactions in the structure and evolution of genetic systems. Nat. Rev. Genet. 9, 855–867 (2008).

    Article  Google Scholar 

  13. Moore, J. H. & Williams, S. M. Epistasis and its implications for personal genetics. Am. J. Hum. Genet. 85, 309–320 (2009).

    Article  Google Scholar 

  14. Cordell, H. J. Detecting gene–gene interactions that underlie human diseases. Nat. Rev. Genet. 10, 392–404 (2009).

    Article  Google Scholar 

  15. Scarselli, F. & Chung Tsoi, A. Universal approximation using feedforward neural networks: a survey of some existing methods, and some new results. Neural Netw. 11, 15–37 (1998).

    Article  Google Scholar 

  16. Koo, P. K. & Ploenzke, M. Improving representations of genomic sequence motifs in convolutional networks with exponential activations. Nat. Mach. Intell. 3, 258–266 (2021).

    Article  Google Scholar 

  17. Cao, Y., Geddes, T. A., Yang, J. Y. H. & Yang, P. Ensemble deep learning in bioinformatics. Nat. Mach. Intell. 2, 500–508 (2020).

    Article  Google Scholar 

  18. Manifold, B., Men, S., Hu, R. & Fu, D. A versatile deep learning architecture for classification and label-free prediction of hyperspectral images. Nat. Mach. Intell. 3, 306–315 (2021).

    Article  Google Scholar 

  19. Song, Z. & Li, J. Variable selection with false discovery rate control in deep neural networks. Nat. Mach. Intell. 3, 426–433 (2021).

    Article  Google Scholar 

  20. Ghorbani, A., Abid, A. & Zou, J. Y. Interpretation of neural networks is fragile. In Proc. AAAI Conference on Artificial Intelligence Vol. 33 3681–3688 (AAAI, 2019); https://doi.org/10.1609/aaai.v33i01.33013681

  21. Barber, R. F. & Candès, E. J. Controlling the false discovery rate via knockoffs. Ann. Stat. 43, 2055–2085 (2015).

    Article  MathSciNet  MATH  Google Scholar 

  22. Candès, E., Fan, Y., Janson, L. & Lv, J. Panning for gold: ‘Model-X’ knockoffs for high dimensional controlled variable selection. J. R. Stat. Soc. B 80, 551–577 (2018).

    Article  MathSciNet  MATH  Google Scholar 

  23. Tibshirani, R. Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. B 58, 267–288 (1996).

    MathSciNet  MATH  Google Scholar 

  24. Sesia, M., Katsevich, E., Bates, S., Candès, E. & Sabatti, C. Multi-resolution localization of causal variants across the genome. Nat. Commun. 11, 1093 (2020).

    Article  Google Scholar 

  25. Lu, Y. Y., Fan, Y., Lv, J. & Noble, W. S. DeepPINK: reproducible feature selection in deep neural networks. In Proc. 32nd International Conference on Neural Information Processing Systems 8690–8700 (Curran Associates, 2018).

  26. He, Z. et al. Identification of putative causal loci in whole-genome sequencing data via knockoff statistics. Nat. Commun. 12, 3512 (2021).

  27. Lu, L., Shin, Y., Su, Y. & Karniadakis, G. E. Dying ReLU and initialization: theory and numerical examples. Commun. Comput. Phys. 5, 1671–1706 (2020).

    Article  MathSciNet  MATH  Google Scholar 

  28. Clevert, D.-A., Unterthiner, T. & Hochreiter, S. Fast and accurate deep network learning by exponential linear units (ELUs). In International Conference on Learning Representations (ICLR, 2016).

  29. LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).

    Article  Google Scholar 

  30. He, Z., Xu, B., Buxbaum, J. & Ionita-Laza, I. A genome-wide scan statistic framework for whole-genome sequence data analysis. Nat. Commun. 10, 3018 (2019).

    Article  Google Scholar 

  31. Li, X. et al. Dynamic incorporation of multiple in silico functional annotations empowers rare variant association analysis of large whole-genome sequencing studies at scale. Nat. Genet. 52, 969–983 (2020).

    Article  Google Scholar 

  32. Dai, C., Lin, B., Xing, X. & Liu, J. False discovery rate control via data splitting. J. Am. Stat. Soc. https://doi.org/10.1080/01621459.2022.2060113 (2020).

  33. Tibshirani, J. F., Hastie, T. & Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1–22 (2010).

    Google Scholar 

  34. Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R. & Lin, C.-J. LIBLINEAR: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008).

    MATH  Google Scholar 

  35. Lee, S., Zhao, Z., Miropolsky, L., Wu, M. SKAT: SNP-Set (Sequence) Kernel Association Test, R package, version 2.2.4. (2022)

  36. Gimenez, J. R. & Zou, J. Improving the stability of the knockoff procedure: multiple simultaneous knockoffs and entropy maximization. In Proc. 22nd International Conference on Artificial Intelligence and Statistics (AISTATS) (PMLR, 2018).

  37. Ren, Z., Wei, Y. & Candès, E. Derandomizing knockoffs. J. Am. Stat. Assoc. https://doi.org/10.1080/01621459.2021.196272 (2021).

  38. He, Z. et al. Genome-wide analysis of common and rare variants via multiple knockoffs at biobank scale, with an application to Alzheimer disease genetics. Am. J. Hum. Genet. 108, 2336–2353 (2021).

    Article  Google Scholar 

  39. Shea J, A., Fulton-Howard, B. & Goate, A. Interpretation of risk loci from genome-wide association studies of Alzheimer’s disease. Lancet Neurol. 19, 326–335 (2020).

    Article  Google Scholar 

  40. Ferkingstad, E. et al. Large-scale integration of the plasma proteome with genetics and disease. Nat. Genet. 53, 1712–1721 (2021).

    Article  Google Scholar 

  41. Sesia, M., Bates, S., Candès, E., Marchini, J. & Sabatti, C. False discovery rate control in genome-wide association studies with population structure. Proc. Natl Acad. Sci. USA 118, e2105841118 (2021).

    Article  Google Scholar 

  42. Schaffner, S. F. et al. Calibrating a coalescent simulation of human genome sequence variation. Genome Res 15, 1576–1583 (2005).

    Article  Google Scholar 

  43. Sesia, M., Sabatti, C. & Candès, E. J. Gene hunting with hidden Markov model knockoffs. Biometrika 106, 1–18 (2019).

    Article  MathSciNet  MATH  Google Scholar 

  44. Plassman, B. L. et al. Prevalence of dementia in the United States: the aging, demographics, and memory study. Neuroepidemiology 29, 125–132 (2007).

    Article  Google Scholar 

  45. Escott-Price, V., Shoai, M., Pither, R., Williams, J. & Hardy, J. Polygenic score prediction captures nearly all common genetic risk for Alzheimer’s disease. Neurobiol. Aging 49, 214.e7–214.e11 (2017).

    Article  Google Scholar 

  46. Guen, Y. Le et al. A novel age-informed approach for genetic association analysis in Alzheimer’s disease. Alzheimer’s Res. Ther. 13, 72 (2021).

  47. Das, S. et al. Next-generation genotype imputation service and methods. Nat. Genet. 48, 1284–1287 (2016).

    Article  Google Scholar 

  48. Beecham, G. W. et al. The Alzheimer’s disease sequencing project: study design and sample selection. Neurol. Genet. 3, e194–e194 (2017).

    Article  Google Scholar 

  49. Weiner, M. W. et al. The Alzheimer’s disease neuroimaging initiative: progress report and future plans. Alzheimers. Dement. 6, 202–211.e7 (2010).

    Article  Google Scholar 

  50. Bennett, D. A. et al. Overview and findings from the rush memory and aging project. Curr. Alzheimer Res. 9, 646–663 (2012).

    Article  Google Scholar 

  51. Kunkle, B. W. et al. Genetic meta-analysis of diagnosed Alzheimer’s disease identifies new risk loci and implicates Aβ, tau, immunity and lipid processing. Nat. Genet. 51, 414–430 (2019).

    Article  Google Scholar 

  52. Kunkle, B. W. et al. Novel Alzheimer disease risk loci and pathways in African American individuals using the African genome resources panel: a meta-analysis. JAMA Neurol. 78, 102–113 (2021).

    Article  Google Scholar 

  53. Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).

    Article  Google Scholar 

  54. Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006).

    Article  Google Scholar 

  55. Chen, C.-Y. et al. Improved ancestry inference using weights from external reference panels. Bioinformatics 29, 1399–1406 (2013).

    Article  Google Scholar 

  56. Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).

    Article  Google Scholar 

  57. Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).

    Article  Google Scholar 

  58. Taliun, D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 590, 290–299 (2021).

    Article  Google Scholar 

  59. Andrews, S. J., Fulton-Howard, B. & Goate, A. Interpretation of risk loci from genome-wide association studies of Alzheimer’s disease. Lancet Neurol. 19, 326–335 (2020).

    Article  Google Scholar 

  60. Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).

    Article  Google Scholar 

  61. Hechtlinger, Y. Interpretation of prediction models using the input gradient. Preprint at https://arxiv.org/abs/1611.07634 (2016).

  62. Le Guen, Y. et al. A novel age-informed approach for genetic association analysis in Alzheimer’s disease. Alzheimers. Res. Ther. 13, 72 (2021).

    Article  Google Scholar 

  63. Saha, S. et al. Hierarchical deep learning neural network (HiDeNN): an artificial intelligence (AI) framework for computational science and engineering. Comput. Methods Appl. Mech. Eng. 373, 113452 (2021).

    Article  MathSciNet  MATH  Google Scholar 

  64. Roy, D., Panda, P. & Roy, K. Tree-CNN: a hierarchical deep convolutional neural network for incremental learning. Neural Netw. 121, 148–160 (2020).

    Article  Google Scholar 

  65. Kim, J., Kim, B., Roy, P. P. & Jeong, D. Efficient facial expression recognition algorithm based on hierarchical deep neural network structure. IEEE Access 7, 41273–41285 (2019).

    Article  Google Scholar 

  66. Xu, Y. et al. A hierarchical deep learning approach with transparency and interpretability based on small samples for glaucoma diagnosis. npj Digit. Med. 4, 48 (2021).

    Article  Google Scholar 

  67. Glorot, X. & Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proc. 13th International Conference on Artificial Intelligence and Statistics (AISTATS) Vol. 9, 249–256 (JMLR, 2010).

  68. LeCun, Y. A., Bottou, L., Orr, G. B. & Müller, K.-R. in Neural Networks: Tricks of the Trade (eds. Müller, K.-R. et al.) 2nd edn, 9–48 (Springer, 2012); https://doi.org/10.1007/978-3-642-35289-8_3

  69. Jha, N. K., Mittal, S. & Mattela, G. The ramifications of making deep neural networks compact. Preprint at https://arxiv.org/abs/2006.15098 (2020).

  70. Peyman-HK/Stabilized-HiDe-MK: Stabilized HiDe-MK (Zenodo, 2022); https://doi.org/10.5281/zenodo.6872386

Download references

Acknowledgements

This research was supported by NIH/NIA award AG066206 (ZH).

Author information

Authors and Affiliations

Authors

Contributions

P.H.K., and Z.H. developed the concepts for the manuscript and proposed the method. P.H.K., F.L., Y.L.G. and Z.H. designed the analyses and applications and discussed the results. P.H.K., Z.H. and F.L. conducted the analyses. Z.H., Y.L.G. and M.E.B. helped interpret the results of the real data analyses. P.H.K., Z.H., F.L. and Y.L.G. prepared the manuscript and contributed to editing the paper.

Corresponding author

Correspondence to Zihuai He.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Yue Cao and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–7, Tables 1–4 and discussions of ‘Notes on the real data preparation’ and ‘Model configurations’.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kassani, P.H., Lu, F., Le Guen, Y. et al. Deep neural networks with controlled variable selection for the identification of putative causal genetic variants. Nat Mach Intell 4, 761–771 (2022). https://doi.org/10.1038/s42256-022-00525-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s42256-022-00525-0

This article is cited by

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics