Abstract
Machine learning (ML) has become increasingly popular in almost all scientific disciplines, including human genetics. Owing to challenges related to sample collection and precise phenotyping, ML-assisted genome-wide association study (GWAS), which uses sophisticated ML techniques to impute phenotypes and then performs GWAS on the imputed outcomes, have become increasingly common in complex trait genetics research. However, the validity of ML-assisted GWAS associations has not been carefully evaluated. Here, we report pervasive risks for false-positive associations in ML-assisted GWAS and introduce Post-Prediction GWAS (POP-GWAS), a statistical framework that redesigns GWAS on ML-imputed outcomes. POP-GWAS ensures valid and powerful statistical inference irrespective of imputation quality and choice of algorithm, requiring only GWAS summary statistics as input. We employed POP-GWAS to perform a GWAS of bone mineral density derived from dual-energy X-ray absorptiometry imaging at 14 skeletal sites, identifying 89 new loci and revealing skeletal site-specific genetic architecture. Our framework offers a robust analytic solution for future ML-assisted GWAS.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
GWAS summary statistics for skeletal site-specific DXA-BMD are available at https://qlu-lab.org/data.html and the NHGRI-EBI GWAS Catalog (https://www.ebi.ac.uk/gwas/home) with the accession GCST90446627–GCST90446644.
Code availability
POP-GWAS software and the power calculator app for ML-assisted GWAS are publicly available at https://github.com/qlu-lab/POP-TOOLS (ref. 64). The analysis code is available at https://github.com/jmiao24/POP-GWAS_analysis (ref. 65).
References
Uffelmann, E. et al. Genome-wide association studies. Nat. Rev. Methods Primers 1, 59 (2021).
Dahl, A. et al. Phenotype integration improves power and preserves specificity in biobank-based genetic studies of major depressive disorder. Nat. Genet. 55, 2082–2093 (2023).
An, U. et al. Deep learning-based phenotype imputation on population-scale biobank data increases genetic discoveries. Nat. Genet. 55, 2269–2276 (2023).
Burstein, D. et al. Genome-wide analysis of a model-derived binge eating disorder phenotype identifies risk loci and implicates iron metabolism. Nat. Genet. 55, 1462–1470 (2023).
Cosentino, J. et al. Inference of chronic obstructive pulmonary disease with deep learning on raw spirograms identifies new genetic loci and improves risk models. Nat. Genet. 55, 787–795 (2023).
Kun, E. et al. The genetic architecture and evolution of the human skeletal form. Science 381, eadf8009 (2023).
Sethi, A., Ruby, J. G., Veras, M. A., Telis, N. & Melamud, E. Genetics implicates overactive osteogenesis in the development of diffuse idiopathic skeletal hyperostosis. Nat. Commun. 14, 2644 (2023).
Alipanahi, B. et al. Large-scale machine-learning-based phenotyping significantly improves genomic discovery for optic nerve head morphology. Am. J. Hum. Genet. 108, 1217–1230 (2021).
Dahl, A. et al. A multiple-phenotype imputation method for genetic studies. Nat. Genet. 48, 466–472 (2016).
Yun, T. et al. Unsupervised representation learning on high-dimensional clinical data improves genomic discovery and prediction. Nat. Genet. 56, 1604–1613 (2024).
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
Sun, B. B. et al. Plasma proteomic associations with genetics and health in the UK Biobank. Nature 622, 329–338 (2023).
Zhao, B. et al. Common genetic variation influencing human white matter microstructure. Science 372, eabf3736 (2021).
Zhao, B. et al. Heart-brain connections: phenotypic and genetic insights from magnetic resonance images. Science 380, abn6598 (2023).
Ramírez, J. et al. Analysing electrocardiographic traits and predicting cardiac risk in UK biobank. JRSM Cardiovasc. Dis. 10, 20480040211023664 (2021).
Julkunen, H. et al. Atlas of plasma NMR biomarkers for health and disease in 118,461 individuals from the UK Biobank. Nat. Commun. 14, 604 (2023).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 770–778 (IEEE, 2016).
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Hormozdiari, F. et al. Imputing phenotypes for genome-wide association studies. Am. J. Hum. Genet. 99, 89–103 (2016).
Turley, P. et al. Multi-trait analysis of genome-wide association summary statistics using MTAG. Nat. Genet. 50, 229–237 (2018).
McCaw, Z. R., Gao, J., Lin, X. & Gronsbell, J. Synthetic surrogates improve power for genome-wide association studies of partially missing phenotypes in population biobanks. Nat. Genet. 56, 1527–1536 (2024).
Mazumder, R., Hastie, T. & Tibshirani, R. Spectral regularization algorithms for learning large incomplete matrices. J. Mach. Learn. Res. 11, 2287–2322 (2010).
Mahajan, A. et al. Multi-ancestry genetic study of type 2 diabetes highlights the power of diverse populations for discovery and translation. Nat. Genet. 54, 560–572 (2022).
Dornbos, P. et al. A combined polygenic score of 21,293 rare and 22 common variants improves diabetes diagnosis based on hemoglobin A1C levels. Nat. Genet. 54, 1609–1614 (2022).
Wheeler, E. et al. Impact of common genetic determinants of hemoglobin A1c on type 2 diabetes risk and diagnosis in ancestrally diverse populations: a transethnic genome-wide meta-analysis. PLoS Med. 14, e1002383 (2017).
Sarnowski, C. et al. Impact of rare and common genetic variants on diabetes diagnosis by hemoglobin A1c in multi-ancestry cohorts: the trans-omics for precision medicine program. Am. J. Hum. Genet. 105, 706–718 (2019).
Leong, A. & Meigs, J. B. Type 2 diabetes prevention: implications of hemoglobin A1c genetics. Rev. Diabet. Stud. 12, 351–362 (2015).
Chen, J. et al. The trans-ancestral genomic architecture of glycemic traits. Nat. Genet. 53, 840–860 (2021).
Jiang, L. et al. A resource-efficient tool for mixed model association analysis of large-scale data. Nat. Genet. 51, 1749–1755 (2019).
Miao, J., Miao, X., Wu, Y., Zhao, J. & Lu, Q. Assumption-lean and data-adaptive post-prediction inference. Preprint at https://arxiv.org/abs/2311.14220 (2023).
Zheng, H. F. et al. Whole‐genome sequencing identifies EN1 as a determinant of bone density and fracture. Nature 526, 112–117 (2015).
Estrada, K. et al. Genome-wide meta-analysis identifies 56 bone mineral density loci and reveals 14 loci associated with risk of fracture. Nat. Genet. 44, 491–501 (2012).
Haseltine, K. N. et al. Bone mineral density: clinical relevance and quantitative assessment. J. Nucl. Med. 62, 446–454 (2021).
Boer, C. G. et al. Deciphering osteoarthritis genetics across 826,690 individuals from 9 populations. Cell 184, 4784–4818 (2021).
Nethander, M. et al. An atlas of genetic determinants of forearm fracture. Nat. Genet. 55, 1820–1830 (2023).
Medina-Gomez, C. et al. Bone mineral density loci specific to the skull portray potential pleiotropic effects on craniosynostosis. Commun. Biol. 6, 691 (2023).
Nethander, M. et al. Assessment of the genetic and clinical determinants of hip fracture risk: genome-wide association and Mendelian randomization study. Cell Rep. Med. 3, 100776 (2022).
Trajanoska, K. et al. Assessment of the genetic and clinical determinants of fracture risk: genome wide association and mendelian randomisation study. BMJ 362, k3225 (2018).
Mullin, B. H. et al. Expression quantitative trait locus study of bone mineral density GWAS variants in human osteoclasts. J. Bone Miner. Res. 33, 1044–1051 (2018).
Mullin, B. H. et al. Characterisation of genetic regulatory effects for osteoporosis risk variants in human osteoclasts. Genome Biol. 21, 80 (2020).
Wen, Y. et al. COL4A2 in the tissue-specific extracellular matrix plays important role on osteogenic differentiation of periodontal ligament stem cells. Theranostics 9, 4265 (2019).
Del Mare, S., Kurek, K. C., Stein, G. S., Lian, J. B. & Aqeilan, R. I. Role of the WWOX tumor suppressor gene in bone homeostasis and the pathogenesis of osteosarcoma. Am. J. Cancer Res 1, 585–594 (2011).
Morris, J. A. et al. An atlas of genetic influences on osteoporosis in humans and mice. Nat. Genet. 51, 258–266 (2019).
Park, S. et al. Unlike LGR4, LGR5 potentiates Wnt–β-catenin signaling without sequestering E3 ligases. Sci. Signal. 13, eaaz4051 (2020).
Olbertová, K. et al. Role of LGR5-positive mesenchymal cells in craniofacial development. Front. Cell Dev. Biol. 10, 810527 (2022).
Morita, H. et al. Neonatal lethality of LGR5 null mice is associated with ankyloglossia and gastrointestinal distension. Mol. Cell. Biol. 24, 9736–9743 (2004).
Wang, S., McCormick, T. H. & Leek, J. T. Methods for correcting inference based on outcomes predicted by machine learning. Proc. Natl Acad. Sci. USA 117, 30266–30275 (2020).
Angelopoulos, A. N., Bates, S., Fannjiang, C., Jordan, M. I. & Zrnic, T. Prediction-powered inference. Science 382, 669–674 (2023).
Daetwyler, H. D., Villanueva, B. & Woolliams, J. A. Accuracy of predicting the genetic risk of disease using a genome-wide approach. PLoS ONE 3, e3395 (2008).
De Vlaming, R. et al. Meta-GWAS accuracy and power (MetaGAP) calculator shows that hiding heritability is partially due to imperfect genetic correlations across studies. PLoS Genet. 13, e1006495 (2017).
Bulik-Sullivan, B. et al. An atlas of genetic correlations across human diseases and traits. Nat. Genet. 47, 1236–1241 (2015).
Willer, C. J., Li, Y. & Abecasis, G. R. METAL: fast and efficient meta-analysis of genomewide association scans. Bioinformatics 26, 2190–2191 (2010).
Medina-Gomez, C. et al. Life-course genome-wide association study meta-analysis of total body BMD and assessment of age-specific effects. Am. J. Hum. Genet. 102, 88–102 (2018).
Bulik-Sullivan, B. K. et al. LD score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 47, 291–295 (2015).
Wallace, C. Eliciting priors and relaxing the single causal variant assumption in colocalisation analyses. PLoS Genet. 16, e1008720 (2020).
Finucane, H. K. et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet. 47, 1228–1235 (2015).
Gazal, S. et al. Linkage disequilibrium-dependent architecture of human complex traits shows action of negative selection. Nat. Genet. 49, 1421–1427 (2017).
Lu, Q. et al. Systematic tissue-specific functional annotation of the human genome highlights immune-related DNA elements for late-onset Alzheimer’s disease. PLoS Genet. 13, e1006933 (2017).
Watanabe, K., Taskesen, E., Van Bochoven, A. & Posthuma, D. Functional mapping and annotation of genetic associations with FUMA. Nat. Commun. 8, 1826 (2017).
De Leeuw, C. A., Mooij, J. M., Heskes, T. & Posthuma, D. MAGMA: generalized gene-set analysis of GWAS data. PLoS Comput. Biol. 11, e1004219 (2015).
Zhu, Z. et al. Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets. Nat. Genet. 48, 481–487 (2016).
Li, M.-X., Yeung, J. M., Cherny, S. S. & Sham, P. C. Evaluating the effective numbers of independent tests and significant p-value thresholds in commercial genotyping arrays and public imputation reference datasets. Hum. Genet. 131, 747–756 (2012).
Miao, J. & qlu-lab. jmiao24/POP-TOOLS: POP-TOOLS v1.1.0. Zenodo https://doi.org/10.5281/zenodo.13334219 (2024).
Miao, J. jmiao24/POP-GWAS_analysis: POP-GWAS analysis v1.0.0. Zenodo https://doi.org/10.5281/zenodo.13334325 (2024).
Acknowledgements
We gratefully acknowledge research support from the National Institutes of Health (NIH; grant U01 HG012039) and support from the University of Wisconsin–Madison Office of the Chancellor and the Vice Chancellor for Research and Graduate Education with funding from the Wisconsin Alumni Research Foundation. We also acknowledge the use of the facilities of the Center for Demography of Health and Aging at the University of Wisconsin–Madison, funded by the National Institute on Aging (NIA) Center Grant (P30 AG017266). We thank members of the Social Genomics Working Group at the University of Wisconsin for their helpful comments. The font choice in Fig. 2b is inspired by pop art. The funders had no role in study design, data collection and analysis, the decision to publish or the preparation of the manuscript.
Author information
Authors and Affiliations
Contributions
J.M. conceived the study and developed the statistical framework. J.M., Y.W. and Z.S. performed data analysis. Y.W. implemented the software. X.M. developed the method to account for selection bias. T.L. advised on result interpretation. J.Z. and Q.L. advised on statistical issues. Q.L. advised on genetic issues. J.M. and Q.L. wrote the manuscript. All authors contributed to manuscript editing and approved the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Genetics thanks the anonymous reviewers for their contribution to the peer review of this work. Peer reviewer reports are available.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Note and Figs. 1–21.
Supplementary Tables
Supplementary Tables 1–11.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Miao, J., Wu, Y., Sun, Z. et al. Valid inference for machine learning-assisted genome-wide association studies. Nat Genet (2024). https://doi.org/10.1038/s41588-024-01934-0
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41588-024-01934-0