Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Review Article
  • Published:

Artificial intelligence powered statistical genetics in biobanks

Subjects

Abstract

Large-scale, sometimes nationwide, prospective genomic cohorts biobanking rich biological specimens such as blood, urine and tissues, have been established and released their vast amount of data in several countries. These genetic and epidemiological resources are expected to allow investigators to disentangle genetic and environmental components conferring common complex diseases. There are, however, two major challenges to statistical genetics for this goal: small sample size—high dimensionality and multilayered—heterogenous endophenotypes. Rather counterintuitively, biobank data generally have small sample size relative to their data dimensionality consisting of genomic variation, lifestyle questionnaire, and sometimes their interaction. This is a widely acknowledged difficulty in data analysis, so-called “p»n problem” in statistics or “curse of dimensionality” in machine-learning field. On the other hand, we have too many measurements of individual health status, which are endophenotypes, such as health check-up data, images, psychological test scores in addition to metabolomics and proteomics data. These endophenotypes are rich but not so tractable because of their worsen dimensionality, and substantial correlation, sometimes confusing causation among them. We have tried to overcome the problems inherent to biobank data, using statistical machine-learning and deep-learning technologies.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Snow, J. On the mode of communication of cholera. 2nd ed. London: John Churchill; 1855.

  2. Taubes G. Epidemiology faces its limits. Science. 1995;269:164–9.

    Article  CAS  Google Scholar 

  3. Maher B. Personal genomes: the case of the missing heritability. Nature. 2008;456:18–21.

    Article  CAS  Google Scholar 

  4. Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, et al. Finding the missing heritability of complex diseases. Nature. 2009;461:747–53.

    Article  CAS  Google Scholar 

  5. Manolio TA, Bailey-Wilson JE, Collins FS. Genes, environment and the value of prospective cohort studies. Nat Rev Genet. 2006;7:812–20.

    Article  CAS  Google Scholar 

  6. Collins FS. The case for a US prospective cohort study of genes and environment. Nature. 2004;429:475–7.

    Article  CAS  Google Scholar 

  7. Hemminki K, Bermejo JL, Forsti A. The balance between heritable and environmental aetiology of human disease. Nat Rev Genet. 2006;7:958–65.

    Article  CAS  Google Scholar 

  8. Thomas D. Gene-environment-wide association studies: emerging approaches. Nat Rev Genet. 2010;11:259–72.

    Article  CAS  Google Scholar 

  9. Akaike H. Fitting autoregressive models for prediction. Ann Inst Stat Math. 1969;21:243–7.

    Article  Google Scholar 

  10. Tibshirani R. Regression shrinkage and selection via the LASSO. J R Stat Soc Ser B Stat Methodol. 1996;58:267–88.

    Google Scholar 

  11. Cordell HJ. Detecting gene-gene interactions that underlie human diseases. Nat Rev Genet. 2009;10:392–404.

    Article  CAS  Google Scholar 

  12. Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc Ser B Stat Methodol. 2008;70:849–911.

    Article  Google Scholar 

  13. Fan J, Song R. Sure independence screening in generalized linear models with NP-dimensionality. Ann Stat. 2010;38:3567–604.

    Article  Google Scholar 

  14. Fan J, Samworth R, Wu Y. Ultrahigh dimensional variable selection: beyond the lienar model. J Mach Learn Res. 2009;10:2013–38.

    PubMed  PubMed Central  Google Scholar 

  15. He Q, Lin D-Y. A variable selection method for genome-wide association studies. Bioinformatics. 2011;27:1–8.

    Article  CAS  Google Scholar 

  16. Ueki M, Tamiya G. Ultrahigh-dimensional variable selection method for whole-genome gene-gene interaction analysis. BMC Bioinforma. 2012;13:72.

    Article  Google Scholar 

  17. Ueta M, Tamiya G, Tokunaga K, Sotozono C, Ueki M, Sawai H, et al. Epistatic interaction between TLR3 and PTGER3 genes. J Allergy Clin Immunol. 2012;129:1413–6.

    Article  CAS  Google Scholar 

  18. Ueki M, Tamiya G. Smooth-threshold multivariate genetic prediction with unbiased model selection. Genet Epidemiol. 2016;40:233–43.

    Article  Google Scholar 

  19. Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc Ser B Stat Methodol. 2005;67:301–20.

    Article  Google Scholar 

  20. Takahashi Y, Ueki M, Tamiya G, et al. Machine learning to effectively avoid overfitting is a crucial strategy for genetic prediction of depressive states. Transl Psychiatry. 2020. (In press).

  21. Falconer DS. Introduction to quantitative genetics. London: Oliver & Boyd; 1960.

  22. Pearson K. On lines and planes of closest fit to systems of points in space. Philos Mag. 1901;2:559–72.

    Article  Google Scholar 

  23. Yano K, Morinaka Y, Wang F, Huang P, Takehara S, Hirai T, et al. GWAS with principal component analysis identifies a gene comprehensively controlling rice architecture. Proc Natl Acad Sci USA. 2019;116:21262–7.

    Article  CAS  Google Scholar 

  24. Hinton GE, Salakhutdinov RR. Reducing the dimensionality of data with neural networks. Science. 2006;313:504–7.

    Article  CAS  Google Scholar 

  25. Yamamoto Y, Tsuzuki T, Akatsuka J, Ueki M, Morikawa H, Numata Y, et al. Automated acquisition of explainable knowledge from unannotated histopathology images. Nat Commun. 2019;10:5642.

    Article  CAS  Google Scholar 

  26. Anttila V, Winsvold BS, Gormley P, Kurth T, Bettella F, McMahon G, et al. Genome-wide meta-analysis identifies new susceptibility loci for migraine. Nat Genet. 2013;45:912–7.

    Article  CAS  Google Scholar 

  27. Perry JR, Voight BF, Yengo L, Amin N, Dupuis J, Ganser M, et al. Stratifying type 2 diabetes cases by BMI identifies genetic risk variants in LAMA1 and enrichment for risk variants in lean compared to obese cases. PLoS Genet. 2012;8:e1002741.

    Article  CAS  Google Scholar 

  28. Li Y, Sheu CC, Ye Y, de Andrade M, Wang L, Chang SC, et al. Genetic variants and risk of lung cancer in never smokers: a genome-wide association study. Lancet Oncol. 2010;11:321–30.

    Article  CAS  Google Scholar 

  29. Obara T, Ishikuro M, Tamiya G, Ueki M, Yamanaka C, Mizuno S, et al. Potential identification of vitamin B6 responsiveness in autism spectrum disorder utilizing phenotype variables and machine learning methods. Sci Rep. 2018;8:14840.

    Article  Google Scholar 

  30. Narita A, Nagai M, Mizuno S, Ogishima S, Tamiya G, Ueki M, et al. Clustering by phenotype and genome-wide association study in autism. Transl Psychiatry. 2020. (In press).

  31. Sakurai R, Ueki M, Makino S, Hozawa A, Kuriyama S, Takai-Igarashi T, et al. Outlier detection for questionnaire data in biobanks. Int J Epidemiol. 2019;48:1305–15.

    Article  Google Scholar 

  32. Takahashi Y, Ueki M, Yamada M, Tamiya G, Motoike IN, Saigusa D, et al. Improved metabolomic data-based prediction of depressive symptoms using nonlinear machine learning with feature selection. Transl Psychiatry. 2020;10:157.

    Article  Google Scholar 

  33. Sakaue S, Hirata J, Kanai M, Suzuki K, Akiyama M, Lai Too C, et al. Dimensionality reduction reveals fine-scale structure in the Japanese population with consequences for polygenic risk prediction. Nat Commun. 2020;11:1569.

    Article  CAS  Google Scholar 

  34. Kojima K, Tadaka S, Katsuoka F, Tamiya G, Yamamoto M, Kinoshita K. A genotype imputation method for de-identified haplotype reference information by using recurrent neural network. PLoS Comput Biol. 2020. (In press).

  35. Beaumont MA, Rannala B. The Bayesian revolution in genetics. Nat Rev Genet. 2004;5:251–61.

    Article  CAS  Google Scholar 

Download references

Acknowledgements

This work was supported by a grant for the Tohoku Medical Megabank Project and the Centre for Advanced Intelligence Project from the Ministry of Education, Culture, Sports, Science and Technology (MEXT) of Japan. We are grateful to Dr. Satoshi Makino and Miho Kuriki for their special assistances.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gen Tamiya.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Narita, A., Ueki, M. & Tamiya, G. Artificial intelligence powered statistical genetics in biobanks. J Hum Genet 66, 61–65 (2021). https://doi.org/10.1038/s10038-020-0822-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s10038-020-0822-y

This article is cited by

Search

Quick links