Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Valid inference for machine learning-assisted genome-wide association studies

Abstract

Machine learning (ML) has become increasingly popular in almost all scientific disciplines, including human genetics. Owing to challenges related to sample collection and precise phenotyping, ML-assisted genome-wide association study (GWAS), which uses sophisticated ML techniques to impute phenotypes and then performs GWAS on the imputed outcomes, have become increasingly common in complex trait genetics research. However, the validity of ML-assisted GWAS associations has not been carefully evaluated. Here, we report pervasive risks for false-positive associations in ML-assisted GWAS and introduce Post-Prediction GWAS (POP-GWAS), a statistical framework that redesigns GWAS on ML-imputed outcomes. POP-GWAS ensures valid and powerful statistical inference irrespective of imputation quality and choice of algorithm, requiring only GWAS summary statistics as input. We employed POP-GWAS to perform a GWAS of bone mineral density derived from dual-energy X-ray absorptiometry imaging at 14 skeletal sites, identifying 89 new loci and revealing skeletal site-specific genetic architecture. Our framework offers a robust analytic solution for future ML-assisted GWAS.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Pervasive false-positive associations in the GWAS on imputed T2D.
Fig. 2: Comparison of POP-GWAS and a conventional design for ML-assisted GWAS.
Fig. 3: Simulation results.
Fig. 4: Effective sample size calculation for ML-assisted GWAS.
Fig. 5: POP-GWAS for DXA-BMD across 14 skeletal sites.
Fig. 6: LGR5 as a head-specific GWAS signal.

Similar content being viewed by others

Data availability

GWAS summary statistics for skeletal site-specific DXA-BMD are available at https://qlu-lab.org/data.html and the NHGRI-EBI GWAS Catalog (https://www.ebi.ac.uk/gwas/home) with the accession GCST90446627GCST90446644.

Code availability

POP-GWAS software and the power calculator app for ML-assisted GWAS are publicly available at https://github.com/qlu-lab/POP-TOOLS (ref. 64). The analysis code is available at https://github.com/jmiao24/POP-GWAS_analysis (ref. 65).

References

  1. Uffelmann, E. et al. Genome-wide association studies. Nat. Rev. Methods Primers 1, 59 (2021).

    Article  CAS  Google Scholar 

  2. Dahl, A. et al. Phenotype integration improves power and preserves specificity in biobank-based genetic studies of major depressive disorder. Nat. Genet. 55, 2082–2093 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. An, U. et al. Deep learning-based phenotype imputation on population-scale biobank data increases genetic discoveries. Nat. Genet. 55, 2269–2276 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Burstein, D. et al. Genome-wide analysis of a model-derived binge eating disorder phenotype identifies risk loci and implicates iron metabolism. Nat. Genet. 55, 1462–1470 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Cosentino, J. et al. Inference of chronic obstructive pulmonary disease with deep learning on raw spirograms identifies new genetic loci and improves risk models. Nat. Genet. 55, 787–795 (2023).

    Article  CAS  PubMed  Google Scholar 

  6. Kun, E. et al. The genetic architecture and evolution of the human skeletal form. Science 381, eadf8009 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Sethi, A., Ruby, J. G., Veras, M. A., Telis, N. & Melamud, E. Genetics implicates overactive osteogenesis in the development of diffuse idiopathic skeletal hyperostosis. Nat. Commun. 14, 2644 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Alipanahi, B. et al. Large-scale machine-learning-based phenotyping significantly improves genomic discovery for optic nerve head morphology. Am. J. Hum. Genet. 108, 1217–1230 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Dahl, A. et al. A multiple-phenotype imputation method for genetic studies. Nat. Genet. 48, 466–472 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Yun, T. et al. Unsupervised representation learning on high-dimensional clinical data improves genomic discovery and prediction. Nat. Genet. 56, 1604–1613 (2024).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Sun, B. B. et al. Plasma proteomic associations with genetics and health in the UK Biobank. Nature 622, 329–338 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Zhao, B. et al. Common genetic variation influencing human white matter microstructure. Science 372, eabf3736 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Zhao, B. et al. Heart-brain connections: phenotypic and genetic insights from magnetic resonance images. Science 380, abn6598 (2023).

    Article  CAS  PubMed  Google Scholar 

  15. Ramírez, J. et al. Analysing electrocardiographic traits and predicting cardiac risk in UK biobank. JRSM Cardiovasc. Dis. 10, 20480040211023664 (2021).

    PubMed  PubMed Central  Google Scholar 

  16. Julkunen, H. et al. Atlas of plasma NMR biomarkers for health and disease in 118,461 individuals from the UK Biobank. Nat. Commun. 14, 604 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 770–778 (IEEE, 2016).

  18. LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).

    Article  CAS  PubMed  Google Scholar 

  19. Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Hormozdiari, F. et al. Imputing phenotypes for genome-wide association studies. Am. J. Hum. Genet. 99, 89–103 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Turley, P. et al. Multi-trait analysis of genome-wide association summary statistics using MTAG. Nat. Genet. 50, 229–237 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. McCaw, Z. R., Gao, J., Lin, X. & Gronsbell, J. Synthetic surrogates improve power for genome-wide association studies of partially missing phenotypes in population biobanks. Nat. Genet. 56, 1527–1536 (2024).

    Article  CAS  PubMed  Google Scholar 

  23. Mazumder, R., Hastie, T. & Tibshirani, R. Spectral regularization algorithms for learning large incomplete matrices. J. Mach. Learn. Res. 11, 2287–2322 (2010).

    PubMed  PubMed Central  Google Scholar 

  24. Mahajan, A. et al. Multi-ancestry genetic study of type 2 diabetes highlights the power of diverse populations for discovery and translation. Nat. Genet. 54, 560–572 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Dornbos, P. et al. A combined polygenic score of 21,293 rare and 22 common variants improves diabetes diagnosis based on hemoglobin A1C levels. Nat. Genet. 54, 1609–1614 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Wheeler, E. et al. Impact of common genetic determinants of hemoglobin A1c on type 2 diabetes risk and diagnosis in ancestrally diverse populations: a transethnic genome-wide meta-analysis. PLoS Med. 14, e1002383 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  27. Sarnowski, C. et al. Impact of rare and common genetic variants on diabetes diagnosis by hemoglobin A1c in multi-ancestry cohorts: the trans-omics for precision medicine program. Am. J. Hum. Genet. 105, 706–718 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Leong, A. & Meigs, J. B. Type 2 diabetes prevention: implications of hemoglobin A1c genetics. Rev. Diabet. Stud. 12, 351–362 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  29. Chen, J. et al. The trans-ancestral genomic architecture of glycemic traits. Nat. Genet. 53, 840–860 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Jiang, L. et al. A resource-efficient tool for mixed model association analysis of large-scale data. Nat. Genet. 51, 1749–1755 (2019).

    Article  CAS  PubMed  Google Scholar 

  31. Miao, J., Miao, X., Wu, Y., Zhao, J. & Lu, Q. Assumption-lean and data-adaptive post-prediction inference. Preprint at https://arxiv.org/abs/2311.14220 (2023).

  32. Zheng, H. F. et al. Whole‐genome sequencing identifies EN1 as a determinant of bone density and fracture. Nature 526, 112–117 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Estrada, K. et al. Genome-wide meta-analysis identifies 56 bone mineral density loci and reveals 14 loci associated with risk of fracture. Nat. Genet. 44, 491–501 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Haseltine, K. N. et al. Bone mineral density: clinical relevance and quantitative assessment. J. Nucl. Med. 62, 446–454 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Boer, C. G. et al. Deciphering osteoarthritis genetics across 826,690 individuals from 9 populations. Cell 184, 4784–4818 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  36. Nethander, M. et al. An atlas of genetic determinants of forearm fracture. Nat. Genet. 55, 1820–1830 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Medina-Gomez, C. et al. Bone mineral density loci specific to the skull portray potential pleiotropic effects on craniosynostosis. Commun. Biol. 6, 691 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Nethander, M. et al. Assessment of the genetic and clinical determinants of hip fracture risk: genome-wide association and Mendelian randomization study. Cell Rep. Med. 3, 100776 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Trajanoska, K. et al. Assessment of the genetic and clinical determinants of fracture risk: genome wide association and mendelian randomisation study. BMJ 362, k3225 (2018).

    Article  PubMed  PubMed Central  Google Scholar 

  40. Mullin, B. H. et al. Expression quantitative trait locus study of bone mineral density GWAS variants in human osteoclasts. J. Bone Miner. Res. 33, 1044–1051 (2018).

    Article  CAS  PubMed  Google Scholar 

  41. Mullin, B. H. et al. Characterisation of genetic regulatory effects for osteoporosis risk variants in human osteoclasts. Genome Biol. 21, 80 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  42. Wen, Y. et al. COL4A2 in the tissue-specific extracellular matrix plays important role on osteogenic differentiation of periodontal ligament stem cells. Theranostics 9, 4265 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  43. Del Mare, S., Kurek, K. C., Stein, G. S., Lian, J. B. & Aqeilan, R. I. Role of the WWOX tumor suppressor gene in bone homeostasis and the pathogenesis of osteosarcoma. Am. J. Cancer Res 1, 585–594 (2011).

    PubMed  PubMed Central  Google Scholar 

  44. Morris, J. A. et al. An atlas of genetic influences on osteoporosis in humans and mice. Nat. Genet. 51, 258–266 (2019).

    Article  CAS  PubMed  Google Scholar 

  45. Park, S. et al. Unlike LGR4, LGR5 potentiates Wnt–β-catenin signaling without sequestering E3 ligases. Sci. Signal. 13, eaaz4051 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  46. Olbertová, K. et al. Role of LGR5-positive mesenchymal cells in craniofacial development. Front. Cell Dev. Biol. 10, 810527 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  47. Morita, H. et al. Neonatal lethality of LGR5 null mice is associated with ankyloglossia and gastrointestinal distension. Mol. Cell. Biol. 24, 9736–9743 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Wang, S., McCormick, T. H. & Leek, J. T. Methods for correcting inference based on outcomes predicted by machine learning. Proc. Natl Acad. Sci. USA 117, 30266–30275 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  49. Angelopoulos, A. N., Bates, S., Fannjiang, C., Jordan, M. I. & Zrnic, T. Prediction-powered inference. Science 382, 669–674 (2023).

    Article  CAS  PubMed  Google Scholar 

  50. Daetwyler, H. D., Villanueva, B. & Woolliams, J. A. Accuracy of predicting the genetic risk of disease using a genome-wide approach. PLoS ONE 3, e3395 (2008).

    Article  PubMed  PubMed Central  Google Scholar 

  51. De Vlaming, R. et al. Meta-GWAS accuracy and power (MetaGAP) calculator shows that hiding heritability is partially due to imperfect genetic correlations across studies. PLoS Genet. 13, e1006495 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  52. Bulik-Sullivan, B. et al. An atlas of genetic correlations across human diseases and traits. Nat. Genet. 47, 1236–1241 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. Willer, C. J., Li, Y. & Abecasis, G. R. METAL: fast and efficient meta-analysis of genomewide association scans. Bioinformatics 26, 2190–2191 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. Medina-Gomez, C. et al. Life-course genome-wide association study meta-analysis of total body BMD and assessment of age-specific effects. Am. J. Hum. Genet. 102, 88–102 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  55. Bulik-Sullivan, B. K. et al. LD score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. Genet. 47, 291–295 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. Wallace, C. Eliciting priors and relaxing the single causal variant assumption in colocalisation analyses. PLoS Genet. 16, e1008720 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  57. Finucane, H. K. et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet. 47, 1228–1235 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  58. Gazal, S. et al. Linkage disequilibrium-dependent architecture of human complex traits shows action of negative selection. Nat. Genet. 49, 1421–1427 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  59. Lu, Q. et al. Systematic tissue-specific functional annotation of the human genome highlights immune-related DNA elements for late-onset Alzheimer’s disease. PLoS Genet. 13, e1006933 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  60. Watanabe, K., Taskesen, E., Van Bochoven, A. & Posthuma, D. Functional mapping and annotation of genetic associations with FUMA. Nat. Commun. 8, 1826 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  61. De Leeuw, C. A., Mooij, J. M., Heskes, T. & Posthuma, D. MAGMA: generalized gene-set analysis of GWAS data. PLoS Comput. Biol. 11, e1004219 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  62. Zhu, Z. et al. Integration of summary data from GWAS and eQTL studies predicts complex trait gene targets. Nat. Genet. 48, 481–487 (2016).

    Article  CAS  PubMed  Google Scholar 

  63. Li, M.-X., Yeung, J. M., Cherny, S. S. & Sham, P. C. Evaluating the effective numbers of independent tests and significant p-value thresholds in commercial genotyping arrays and public imputation reference datasets. Hum. Genet. 131, 747–756 (2012).

    Article  CAS  PubMed  Google Scholar 

  64. Miao, J. & qlu-lab. jmiao24/POP-TOOLS: POP-TOOLS v1.1.0. Zenodo https://doi.org/10.5281/zenodo.13334219 (2024).

  65. Miao, J. jmiao24/POP-GWAS_analysis: POP-GWAS analysis v1.0.0. Zenodo https://doi.org/10.5281/zenodo.13334325 (2024).

Download references

Acknowledgements

We gratefully acknowledge research support from the National Institutes of Health (NIH; grant U01 HG012039) and support from the University of Wisconsin–Madison Office of the Chancellor and the Vice Chancellor for Research and Graduate Education with funding from the Wisconsin Alumni Research Foundation. We also acknowledge the use of the facilities of the Center for Demography of Health and Aging at the University of Wisconsin–Madison, funded by the National Institute on Aging (NIA) Center Grant (P30 AG017266). We thank members of the Social Genomics Working Group at the University of Wisconsin for their helpful comments. The font choice in Fig. 2b is inspired by pop art. The funders had no role in study design, data collection and analysis, the decision to publish or the preparation of the manuscript.

Author information

Authors and Affiliations

Authors

Contributions

J.M. conceived the study and developed the statistical framework. J.M., Y.W. and Z.S. performed data analysis. Y.W. implemented the software. X.M. developed the method to account for selection bias. T.L. advised on result interpretation. J.Z. and Q.L. advised on statistical issues. Q.L. advised on genetic issues. J.M. and Q.L. wrote the manuscript. All authors contributed to manuscript editing and approved the manuscript.

Corresponding author

Correspondence to Qiongshi Lu.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Genetics thanks the anonymous reviewers for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Note and Figs. 1–21.

Reporting Summary

Peer Review File

Supplementary Tables

Supplementary Tables 1–11.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Miao, J., Wu, Y., Sun, Z. et al. Valid inference for machine learning-assisted genome-wide association studies. Nat Genet (2024). https://doi.org/10.1038/s41588-024-01934-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1038/s41588-024-01934-0

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing