Artificial intelligence powered statistical genetics in biobanks

Narita, Akira; Ueki, Masao; Tamiya, Gen

doi:10.1038/s10038-020-0822-y

Review Article
Published: 11 August 2020

Artificial intelligence powered statistical genetics in biobanks

Journal of Human Genetics volume 66, pages 61–65 (2021)Cite this article

1110 Accesses
16 Citations
1 Altmetric
Metrics details

Subjects

Abstract

Large-scale, sometimes nationwide, prospective genomic cohorts biobanking rich biological specimens such as blood, urine and tissues, have been established and released their vast amount of data in several countries. These genetic and epidemiological resources are expected to allow investigators to disentangle genetic and environmental components conferring common complex diseases. There are, however, two major challenges to statistical genetics for this goal: small sample size—high dimensionality and multilayered—heterogenous endophenotypes. Rather counterintuitively, biobank data generally have small sample size relative to their data dimensionality consisting of genomic variation, lifestyle questionnaire, and sometimes their interaction. This is a widely acknowledged difficulty in data analysis, so-called “p»n problem” in statistics or “curse of dimensionality” in machine-learning field. On the other hand, we have too many measurements of individual health status, which are endophenotypes, such as health check-up data, images, psychological test scores in addition to metabolomics and proteomics data. These endophenotypes are rich but not so tractable because of their worsen dimensionality, and substantial correlation, sometimes confusing causation among them. We have tried to overcome the problems inherent to biobank data, using statistical machine-learning and deep-learning technologies.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

Causal machine learning for predicting treatment outcomes

Article 19 April 2024

Genome-wide association studies

Article 26 August 2021

Utility of polygenic scores across diverse diseases in a hospital cohort for predictive modeling

Article Open access 12 April 2024

References

Snow, J. On the mode of communication of cholera. 2nd ed. London: John Churchill; 1855.
Taubes G. Epidemiology faces its limits. Science. 1995;269:164–9.
Article CAS Google Scholar
Maher B. Personal genomes: the case of the missing heritability. Nature. 2008;456:18–21.
Article CAS Google Scholar
Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, et al. Finding the missing heritability of complex diseases. Nature. 2009;461:747–53.
Article CAS Google Scholar
Manolio TA, Bailey-Wilson JE, Collins FS. Genes, environment and the value of prospective cohort studies. Nat Rev Genet. 2006;7:812–20.
Article CAS Google Scholar
Collins FS. The case for a US prospective cohort study of genes and environment. Nature. 2004;429:475–7.
Article CAS Google Scholar
Hemminki K, Bermejo JL, Forsti A. The balance between heritable and environmental aetiology of human disease. Nat Rev Genet. 2006;7:958–65.
Article CAS Google Scholar
Thomas D. Gene-environment-wide association studies: emerging approaches. Nat Rev Genet. 2010;11:259–72.
Article CAS Google Scholar
Akaike H. Fitting autoregressive models for prediction. Ann Inst Stat Math. 1969;21:243–7.
Article Google Scholar
Tibshirani R. Regression shrinkage and selection via the LASSO. J R Stat Soc Ser B Stat Methodol. 1996;58:267–88.
Google Scholar
Cordell HJ. Detecting gene-gene interactions that underlie human diseases. Nat Rev Genet. 2009;10:392–404.
Article CAS Google Scholar
Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc Ser B Stat Methodol. 2008;70:849–911.
Article Google Scholar
Fan J, Song R. Sure independence screening in generalized linear models with NP-dimensionality. Ann Stat. 2010;38:3567–604.
Article Google Scholar
Fan J, Samworth R, Wu Y. Ultrahigh dimensional variable selection: beyond the lienar model. J Mach Learn Res. 2009;10:2013–38.
PubMed PubMed Central Google Scholar
He Q, Lin D-Y. A variable selection method for genome-wide association studies. Bioinformatics. 2011;27:1–8.
Article CAS Google Scholar
Ueki M, Tamiya G. Ultrahigh-dimensional variable selection method for whole-genome gene-gene interaction analysis. BMC Bioinforma. 2012;13:72.
Article Google Scholar
Ueta M, Tamiya G, Tokunaga K, Sotozono C, Ueki M, Sawai H, et al. Epistatic interaction between TLR3 and PTGER3 genes. J Allergy Clin Immunol. 2012;129:1413–6.
Article CAS Google Scholar
Ueki M, Tamiya G. Smooth-threshold multivariate genetic prediction with unbiased model selection. Genet Epidemiol. 2016;40:233–43.
Article Google Scholar
Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc Ser B Stat Methodol. 2005;67:301–20.
Article Google Scholar
Takahashi Y, Ueki M, Tamiya G, et al. Machine learning to effectively avoid overfitting is a crucial strategy for genetic prediction of depressive states. Transl Psychiatry. 2020. (In press).
Falconer DS. Introduction to quantitative genetics. London: Oliver & Boyd; 1960.
Pearson K. On lines and planes of closest fit to systems of points in space. Philos Mag. 1901;2:559–72.
Article Google Scholar
Yano K, Morinaka Y, Wang F, Huang P, Takehara S, Hirai T, et al. GWAS with principal component analysis identifies a gene comprehensively controlling rice architecture. Proc Natl Acad Sci USA. 2019;116:21262–7.
Article CAS Google Scholar
Hinton GE, Salakhutdinov RR. Reducing the dimensionality of data with neural networks. Science. 2006;313:504–7.
Article CAS Google Scholar
Yamamoto Y, Tsuzuki T, Akatsuka J, Ueki M, Morikawa H, Numata Y, et al. Automated acquisition of explainable knowledge from unannotated histopathology images. Nat Commun. 2019;10:5642.
Article CAS Google Scholar
Anttila V, Winsvold BS, Gormley P, Kurth T, Bettella F, McMahon G, et al. Genome-wide meta-analysis identifies new susceptibility loci for migraine. Nat Genet. 2013;45:912–7.
Article CAS Google Scholar
Perry JR, Voight BF, Yengo L, Amin N, Dupuis J, Ganser M, et al. Stratifying type 2 diabetes cases by BMI identifies genetic risk variants in LAMA1 and enrichment for risk variants in lean compared to obese cases. PLoS Genet. 2012;8:e1002741.
Article CAS Google Scholar
Li Y, Sheu CC, Ye Y, de Andrade M, Wang L, Chang SC, et al. Genetic variants and risk of lung cancer in never smokers: a genome-wide association study. Lancet Oncol. 2010;11:321–30.
Article CAS Google Scholar
Obara T, Ishikuro M, Tamiya G, Ueki M, Yamanaka C, Mizuno S, et al. Potential identification of vitamin B6 responsiveness in autism spectrum disorder utilizing phenotype variables and machine learning methods. Sci Rep. 2018;8:14840.
Article Google Scholar
Narita A, Nagai M, Mizuno S, Ogishima S, Tamiya G, Ueki M, et al. Clustering by phenotype and genome-wide association study in autism. Transl Psychiatry. 2020. (In press).
Sakurai R, Ueki M, Makino S, Hozawa A, Kuriyama S, Takai-Igarashi T, et al. Outlier detection for questionnaire data in biobanks. Int J Epidemiol. 2019;48:1305–15.
Article Google Scholar
Takahashi Y, Ueki M, Yamada M, Tamiya G, Motoike IN, Saigusa D, et al. Improved metabolomic data-based prediction of depressive symptoms using nonlinear machine learning with feature selection. Transl Psychiatry. 2020;10:157.
Article Google Scholar
Sakaue S, Hirata J, Kanai M, Suzuki K, Akiyama M, Lai Too C, et al. Dimensionality reduction reveals fine-scale structure in the Japanese population with consequences for polygenic risk prediction. Nat Commun. 2020;11:1569.
Article CAS Google Scholar
Kojima K, Tadaka S, Katsuoka F, Tamiya G, Yamamoto M, Kinoshita K. A genotype imputation method for de-identified haplotype reference information by using recurrent neural network. PLoS Comput Biol. 2020. (In press).
Beaumont MA, Rannala B. The Bayesian revolution in genetics. Nat Rev Genet. 2004;5:251–61.
Article CAS Google Scholar

Download references

Acknowledgements

This work was supported by a grant for the Tohoku Medical Megabank Project and the Centre for Advanced Intelligence Project from the Ministry of Education, Culture, Sports, Science and Technology (MEXT) of Japan. We are grateful to Dr. Satoshi Makino and Miho Kuriki for their special assistances.

Author information

Authors and Affiliations

Tohoku Medical Megabank Organization, Tohoku University, Sendai, Japan
Akira Narita & Gen Tamiya
RIKEN Center for Advanced Intelligence Project, Tokyo, Japan
Masao Ueki & Gen Tamiya

Authors

Akira Narita
View author publications
You can also search for this author in PubMed Google Scholar
Masao Ueki
View author publications
You can also search for this author in PubMed Google Scholar
Gen Tamiya
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gen Tamiya.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Narita, A., Ueki, M. & Tamiya, G. Artificial intelligence powered statistical genetics in biobanks. J Hum Genet 66, 61–65 (2021). https://doi.org/10.1038/s10038-020-0822-y

Download citation

Received: 22 June 2020
Revised: 15 July 2020
Accepted: 26 July 2020
Published: 11 August 2020
Issue Date: January 2021
DOI: https://doi.org/10.1038/s10038-020-0822-y

This article is cited by

FAIR principles for AI models with a practical application for accelerated high energy diffraction microscopy
- Nikil Ravi
- Pranshu Chaturvedi
- Ian Foster
Scientific Data (2022)

Artificial intelligence powered statistical genetics in biobanks

Subjects

Abstract

Access options

Similar content being viewed by others

Causal machine learning for predicting treatment outcomes

Genome-wide association studies

Utility of polygenic scores across diverse diseases in a hospital cohort for predictive modeling

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Rights and permissions

About this article

Cite this article

This article is cited by

FAIR principles for AI models with a practical application for accelerated high energy diffraction microscopy

Search

Quick links

Subjects

Abstract

Access options

Similar content being viewed by others

Causal machine learning for predicting treatment outcomes

Genome-wide association studies

Utility of polygenic scores across diverse diseases in a hospital cohort for predictive modeling

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

FAIR principles for AI models with a practical application for accelerated high energy diffraction microscopy

Search

Quick links