Deep neural networks with controlled variable selection for the identification of putative causal genetic variants

Kassani, Peyman H.; Lu, Fred; Le Guen, Yann; Belloy, Michael E.; He, Zihuai

doi:10.1038/s42256-022-00525-0

Article
Published: 15 September 2022

Deep neural networks with controlled variable selection for the identification of putative causal genetic variants

Nature Machine Intelligence volume 4, pages 761–771 (2022)Cite this article

2262 Accesses
3 Citations
2 Altmetric
Metrics details

Subjects

Abstract

Deep neural networks (DNNs) have been successfully utilized in many scientific problems for their high prediction accuracy, but their application to genetic studies remains challenging due to their poor interpretability. Here we consider the problem of scalable, robust variable selection in DNNs for the identification of putative causal genetic variants in genome sequencing studies. We identified a pronounced randomness in feature selection in DNNs due to its stochastic nature, which may hinder interpretability and give rise to misleading results. We propose an interpretable neural network model, stabilized using ensembling, with controlled variable selection for genetic studies. The merit of the proposed method includes: flexible modelling of the nonlinear effect of genetic variants to improve statistical power; multiple knockoffs in the input layer to rigorously control the false discovery rate; hierarchical layers to substantially reduce the number of weight parameters and activations, and improve computational efficiency; and stabilized feature selection to reduce the randomness in identified signals. We evaluate the proposed method in extensive simulation studies and apply it to the analysis of Alzheimer’s disease genetics. We show that the proposed method, when compared with conventional linear and nonlinear methods, can lead to substantially more discoveries.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 3: Confirmatory-stage analysis of candidate regions.**

**Fig. 4: Functionally informed analysis of pQTLs.**

**Fig. 5: Stabilized HiDe-MK improves the stability of FIs compared with a single HiDe-MK run.**

**Fig. 6: The hierarchical layers improve computational efficiency.**

Deep structured learning for variant prioritization in Mendelian diseases

Article Open access 13 July 2023

GhostKnockoff inference empowers identification of putative causal variants in genome-wide association studies

Article Open access 23 November 2022

DeepNull models non-linear covariate effects to improve phenotypic prediction and association power

Article Open access 11 January 2022

Data availability

Alzheimer’s disease genetic cohort data can be obtained for approved research (see the description in the work by Le Guen and colleagues⁶²). Simulation datasets are available on our GitHub repository: https://github.com/Peyman-HK/De-randomized-HiDe-MK (ref. ⁷⁰).

Code availability

The code for the generation and reproduction of the simulation studies of SKAT haplotype data is written in R. The code for HiDe-MK training, prediction and evaluation were written in Python with Keras and Tensorflow. The codes are feely available at: https://github.com/Peyman-HK/De-randomized-HiDe-MK. The doi of the code can be found at https://doi.org/10.5281/zenodo.6872386 (ref. ⁷⁰). The pseudo code for simulation studies can be found in Supplementary Section 4.

References

Sierksma, A., Escott-Price, V. & De Strooper, B. Translating genetic risk of Alzheimer’s disease into mechanistic insight and drug targets. Science 370, 61–66 (2020).
Article Google Scholar
Visscher, P. M. et al. 10 Years of GWAS discovery: biology, function, and translation. Am. J. Hum. Genet. 101, 5–22 (2017).
Article Google Scholar
Lee, S., Abecasis, G. R., Boehnke, M. & Lin, X. Rare-variant association analysis: study designs and statistical tests. Am. J. Hum. Genet. 95, 5–23 (2014).
Article Google Scholar
Zuk, O., Hechter, E., Sunyaev, S. R. & Lander, E. S. The mystery of missing heritability: genetic interactions create phantom heritability. Proc. Natl Acad. Sci. USA 109, 1193–1198 (2012).
Article Google Scholar
Ma, Y. et al. Analysis of whole-exome sequencing data for Alzheimer disease stratified by APOE Genotype. JAMA Neurol. 76, 1099–1108 (2019).
Article Google Scholar
Jun, G. R. et al. Transethnic genome-wide scan identifies novel Alzheimer’s disease loci. Alzheimers. Dement. 13, 727–738 (2017).
Article Google Scholar
Belloy, M. E. et al. Association of klotho-VS heterozygosity with risk of Alzheimer disease in individuals who carry APOE4. JAMA Neurol. 77, 849–862 (2020).
Article Google Scholar
He, L. et al. Exome-wide age-of-onset analysis reveals exonic variants in ERN1 and SPPL2C associated with Alzheimer’s disease. Transl. Psychiatry 11, 146 (2021).
Article Google Scholar
Sims, R., Hill, M. & Williams, J. The multiplex model of the genetics of Alzheimer’s disease. Nat. Neurosci. 23, 311–322 (2020).
Article Google Scholar
Costanzo, M. et al. A global genetic interaction network maps a wiring diagram of cellular function. Science 353, aaf1420 (2016).
Article Google Scholar
Kuzmin, E. et al. Systematic analysis of complex genetic interactions. Science 360, eaao1729 (2018).
Article Google Scholar
Phillips, P. C. Epistasis — the essential role of gene interactions in the structure and evolution of genetic systems. Nat. Rev. Genet. 9, 855–867 (2008).
Article Google Scholar
Moore, J. H. & Williams, S. M. Epistasis and its implications for personal genetics. Am. J. Hum. Genet. 85, 309–320 (2009).
Article Google Scholar
Cordell, H. J. Detecting gene–gene interactions that underlie human diseases. Nat. Rev. Genet. 10, 392–404 (2009).
Article Google Scholar
Scarselli, F. & Chung Tsoi, A. Universal approximation using feedforward neural networks: a survey of some existing methods, and some new results. Neural Netw. 11, 15–37 (1998).
Article Google Scholar
Koo, P. K. & Ploenzke, M. Improving representations of genomic sequence motifs in convolutional networks with exponential activations. Nat. Mach. Intell. 3, 258–266 (2021).
Article Google Scholar
Cao, Y., Geddes, T. A., Yang, J. Y. H. & Yang, P. Ensemble deep learning in bioinformatics. Nat. Mach. Intell. 2, 500–508 (2020).
Article Google Scholar
Manifold, B., Men, S., Hu, R. & Fu, D. A versatile deep learning architecture for classification and label-free prediction of hyperspectral images. Nat. Mach. Intell. 3, 306–315 (2021).
Article Google Scholar
Song, Z. & Li, J. Variable selection with false discovery rate control in deep neural networks. Nat. Mach. Intell. 3, 426–433 (2021).
Article Google Scholar
Ghorbani, A., Abid, A. & Zou, J. Y. Interpretation of neural networks is fragile. In Proc. AAAI Conference on Artificial Intelligence Vol. 33 3681–3688 (AAAI, 2019); https://doi.org/10.1609/aaai.v33i01.33013681
Barber, R. F. & Candès, E. J. Controlling the false discovery rate via knockoffs. Ann. Stat. 43, 2055–2085 (2015).
Article MathSciNet MATH Google Scholar
Candès, E., Fan, Y., Janson, L. & Lv, J. Panning for gold: ‘Model-X’ knockoffs for high dimensional controlled variable selection. J. R. Stat. Soc. B 80, 551–577 (2018).
Article MathSciNet MATH Google Scholar
Tibshirani, R. Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. B 58, 267–288 (1996).
MathSciNet MATH Google Scholar
Sesia, M., Katsevich, E., Bates, S., Candès, E. & Sabatti, C. Multi-resolution localization of causal variants across the genome. Nat. Commun. 11, 1093 (2020).
Article Google Scholar
Lu, Y. Y., Fan, Y., Lv, J. & Noble, W. S. DeepPINK: reproducible feature selection in deep neural networks. In Proc. 32nd International Conference on Neural Information Processing Systems 8690–8700 (Curran Associates, 2018).
He, Z. et al. Identification of putative causal loci in whole-genome sequencing data via knockoff statistics. Nat. Commun. 12, 3512 (2021).
Lu, L., Shin, Y., Su, Y. & Karniadakis, G. E. Dying ReLU and initialization: theory and numerical examples. Commun. Comput. Phys. 5, 1671–1706 (2020).
Article MathSciNet MATH Google Scholar
Clevert, D.-A., Unterthiner, T. & Hochreiter, S. Fast and accurate deep network learning by exponential linear units (ELUs). In International Conference on Learning Representations (ICLR, 2016).
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
Article Google Scholar
He, Z., Xu, B., Buxbaum, J. & Ionita-Laza, I. A genome-wide scan statistic framework for whole-genome sequence data analysis. Nat. Commun. 10, 3018 (2019).
Article Google Scholar
Li, X. et al. Dynamic incorporation of multiple in silico functional annotations empowers rare variant association analysis of large whole-genome sequencing studies at scale. Nat. Genet. 52, 969–983 (2020).
Article Google Scholar
Dai, C., Lin, B., Xing, X. & Liu, J. False discovery rate control via data splitting. J. Am. Stat. Soc. https://doi.org/10.1080/01621459.2022.2060113 (2020).
Tibshirani, J. F., Hastie, T. & Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33, 1–22 (2010).
Google Scholar
Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R. & Lin, C.-J. LIBLINEAR: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008).
MATH Google Scholar
Lee, S., Zhao, Z., Miropolsky, L., Wu, M. SKAT: SNP-Set (Sequence) Kernel Association Test, R package, version 2.2.4. (2022)
Gimenez, J. R. & Zou, J. Improving the stability of the knockoff procedure: multiple simultaneous knockoffs and entropy maximization. In Proc. 22nd International Conference on Artificial Intelligence and Statistics (AISTATS) (PMLR, 2018).
Ren, Z., Wei, Y. & Candès, E. Derandomizing knockoffs. J. Am. Stat. Assoc. https://doi.org/10.1080/01621459.2021.196272 (2021).
He, Z. et al. Genome-wide analysis of common and rare variants via multiple knockoffs at biobank scale, with an application to Alzheimer disease genetics. Am. J. Hum. Genet. 108, 2336–2353 (2021).
Article Google Scholar
Shea J, A., Fulton-Howard, B. & Goate, A. Interpretation of risk loci from genome-wide association studies of Alzheimer’s disease. Lancet Neurol. 19, 326–335 (2020).
Article Google Scholar
Ferkingstad, E. et al. Large-scale integration of the plasma proteome with genetics and disease. Nat. Genet. 53, 1712–1721 (2021).
Article Google Scholar
Sesia, M., Bates, S., Candès, E., Marchini, J. & Sabatti, C. False discovery rate control in genome-wide association studies with population structure. Proc. Natl Acad. Sci. USA 118, e2105841118 (2021).
Article Google Scholar
Schaffner, S. F. et al. Calibrating a coalescent simulation of human genome sequence variation. Genome Res 15, 1576–1583 (2005).
Article Google Scholar
Sesia, M., Sabatti, C. & Candès, E. J. Gene hunting with hidden Markov model knockoffs. Biometrika 106, 1–18 (2019).
Article MathSciNet MATH Google Scholar
Plassman, B. L. et al. Prevalence of dementia in the United States: the aging, demographics, and memory study. Neuroepidemiology 29, 125–132 (2007).
Article Google Scholar
Escott-Price, V., Shoai, M., Pither, R., Williams, J. & Hardy, J. Polygenic score prediction captures nearly all common genetic risk for Alzheimer’s disease. Neurobiol. Aging 49, 214.e7–214.e11 (2017).
Article Google Scholar
Guen, Y. Le et al. A novel age-informed approach for genetic association analysis in Alzheimer’s disease. Alzheimer’s Res. Ther. 13, 72 (2021).
Das, S. et al. Next-generation genotype imputation service and methods. Nat. Genet. 48, 1284–1287 (2016).
Article Google Scholar
Beecham, G. W. et al. The Alzheimer’s disease sequencing project: study design and sample selection. Neurol. Genet. 3, e194–e194 (2017).
Article Google Scholar
Weiner, M. W. et al. The Alzheimer’s disease neuroimaging initiative: progress report and future plans. Alzheimers. Dement. 6, 202–211.e7 (2010).
Article Google Scholar
Bennett, D. A. et al. Overview and findings from the rush memory and aging project. Curr. Alzheimer Res. 9, 646–663 (2012).
Article Google Scholar
Kunkle, B. W. et al. Genetic meta-analysis of diagnosed Alzheimer’s disease identifies new risk loci and implicates Aβ, tau, immunity and lipid processing. Nat. Genet. 51, 414–430 (2019).
Article Google Scholar
Kunkle, B. W. et al. Novel Alzheimer disease risk loci and pathways in African American individuals using the African genome resources panel: a meta-analysis. JAMA Neurol. 78, 102–113 (2021).
Article Google Scholar
Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).
Article Google Scholar
Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006).
Article Google Scholar
Chen, C.-Y. et al. Improved ancestry inference using weights from external reference panels. Bioinformatics 29, 1399–1406 (2013).
Article Google Scholar
Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Article Google Scholar
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Article Google Scholar
Taliun, D. et al. Sequencing of 53,831 diverse genomes from the NHLBI TOPMed Program. Nature 590, 290–299 (2021).
Article Google Scholar
Andrews, S. J., Fulton-Howard, B. & Goate, A. Interpretation of risk loci from genome-wide association studies of Alzheimer’s disease. Lancet Neurol. 19, 326–335 (2020).
Article Google Scholar
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
Article Google Scholar
Hechtlinger, Y. Interpretation of prediction models using the input gradient. Preprint at https://arxiv.org/abs/1611.07634 (2016).
Le Guen, Y. et al. A novel age-informed approach for genetic association analysis in Alzheimer’s disease. Alzheimers. Res. Ther. 13, 72 (2021).
Article Google Scholar
Saha, S. et al. Hierarchical deep learning neural network (HiDeNN): an artificial intelligence (AI) framework for computational science and engineering. Comput. Methods Appl. Mech. Eng. 373, 113452 (2021).
Article MathSciNet MATH Google Scholar
Roy, D., Panda, P. & Roy, K. Tree-CNN: a hierarchical deep convolutional neural network for incremental learning. Neural Netw. 121, 148–160 (2020).
Article Google Scholar
Kim, J., Kim, B., Roy, P. P. & Jeong, D. Efficient facial expression recognition algorithm based on hierarchical deep neural network structure. IEEE Access 7, 41273–41285 (2019).
Article Google Scholar
Xu, Y. et al. A hierarchical deep learning approach with transparency and interpretability based on small samples for glaucoma diagnosis. npj Digit. Med. 4, 48 (2021).
Article Google Scholar
Glorot, X. & Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. In Proc. 13th International Conference on Artificial Intelligence and Statistics (AISTATS) Vol. 9, 249–256 (JMLR, 2010).
LeCun, Y. A., Bottou, L., Orr, G. B. & Müller, K.-R. in Neural Networks: Tricks of the Trade (eds. Müller, K.-R. et al.) 2nd edn, 9–48 (Springer, 2012); https://doi.org/10.1007/978-3-642-35289-8_3
Jha, N. K., Mittal, S. & Mattela, G. The ramifications of making deep neural networks compact. Preprint at https://arxiv.org/abs/2006.15098 (2020).
Peyman-HK/Stabilized-HiDe-MK: Stabilized HiDe-MK (Zenodo, 2022); https://doi.org/10.5281/zenodo.6872386

Download references

Acknowledgements

This research was supported by NIH/NIA award AG066206 (ZH).

Author information

Authors and Affiliations

Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA, USA
Peyman H. Kassani, Yann Le Guen, Michael E. Belloy & Zihuai He
Department of Statistics, Stanford University, Stanford, CA, USA
Fred Lu
Quantitative Sciences Unit, Department of Medicine (Biomedical Informatics Research), Stanford University, Stanford, CA, USA
Zihuai He

Authors

Peyman H. Kassani
View author publications
You can also search for this author in PubMed Google Scholar
Fred Lu
View author publications
You can also search for this author in PubMed Google Scholar
Yann Le Guen
View author publications
You can also search for this author in PubMed Google Scholar
Michael E. Belloy
View author publications
You can also search for this author in PubMed Google Scholar
Zihuai He
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

P.H.K., and Z.H. developed the concepts for the manuscript and proposed the method. P.H.K., F.L., Y.L.G. and Z.H. designed the analyses and applications and discussed the results. P.H.K., Z.H. and F.L. conducted the analyses. Z.H., Y.L.G. and M.E.B. helped interpret the results of the real data analyses. P.H.K., Z.H., F.L. and Y.L.G. prepared the manuscript and contributed to editing the paper.

Corresponding author

Correspondence to Zihuai He.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Yue Cao and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–7, Tables 1–4 and discussions of ‘Notes on the real data preparation’ and ‘Model configurations’.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Kassani, P.H., Lu, F., Le Guen, Y. et al. Deep neural networks with controlled variable selection for the identification of putative causal genetic variants. Nat Mach Intell 4, 761–771 (2022). https://doi.org/10.1038/s42256-022-00525-0

Download citation

Received: 24 September 2021
Accepted: 26 July 2022
Published: 15 September 2022
Issue Date: September 2022
DOI: https://doi.org/10.1038/s42256-022-00525-0

This article is cited by

Artificial intelligence for nailfold capillaroscopy analyses – a proof of concept application in juvenile dermatomyositis
- Peyman Hosseinzadeh Kassani
- Louis Ehwerhemuepha
- Lauren M. Pachman
Pediatric Research (2023)