Abstract
Large-scale, sometimes nationwide, prospective genomic cohorts biobanking rich biological specimens such as blood, urine and tissues, have been established and released their vast amount of data in several countries. These genetic and epidemiological resources are expected to allow investigators to disentangle genetic and environmental components conferring common complex diseases. There are, however, two major challenges to statistical genetics for this goal: small sample size—high dimensionality and multilayered—heterogenous endophenotypes. Rather counterintuitively, biobank data generally have small sample size relative to their data dimensionality consisting of genomic variation, lifestyle questionnaire, and sometimes their interaction. This is a widely acknowledged difficulty in data analysis, so-called “p»n problem” in statistics or “curse of dimensionality” in machine-learning field. On the other hand, we have too many measurements of individual health status, which are endophenotypes, such as health check-up data, images, psychological test scores in addition to metabolomics and proteomics data. These endophenotypes are rich but not so tractable because of their worsen dimensionality, and substantial correlation, sometimes confusing causation among them. We have tried to overcome the problems inherent to biobank data, using statistical machine-learning and deep-learning technologies.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
Snow, J. On the mode of communication of cholera. 2nd ed. London: John Churchill; 1855.
Taubes G. Epidemiology faces its limits. Science. 1995;269:164–9.
Maher B. Personal genomes: the case of the missing heritability. Nature. 2008;456:18–21.
Manolio TA, Collins FS, Cox NJ, Goldstein DB, Hindorff LA, Hunter DJ, et al. Finding the missing heritability of complex diseases. Nature. 2009;461:747–53.
Manolio TA, Bailey-Wilson JE, Collins FS. Genes, environment and the value of prospective cohort studies. Nat Rev Genet. 2006;7:812–20.
Collins FS. The case for a US prospective cohort study of genes and environment. Nature. 2004;429:475–7.
Hemminki K, Bermejo JL, Forsti A. The balance between heritable and environmental aetiology of human disease. Nat Rev Genet. 2006;7:958–65.
Thomas D. Gene-environment-wide association studies: emerging approaches. Nat Rev Genet. 2010;11:259–72.
Akaike H. Fitting autoregressive models for prediction. Ann Inst Stat Math. 1969;21:243–7.
Tibshirani R. Regression shrinkage and selection via the LASSO. J R Stat Soc Ser B Stat Methodol. 1996;58:267–88.
Cordell HJ. Detecting gene-gene interactions that underlie human diseases. Nat Rev Genet. 2009;10:392–404.
Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc Ser B Stat Methodol. 2008;70:849–911.
Fan J, Song R. Sure independence screening in generalized linear models with NP-dimensionality. Ann Stat. 2010;38:3567–604.
Fan J, Samworth R, Wu Y. Ultrahigh dimensional variable selection: beyond the lienar model. J Mach Learn Res. 2009;10:2013–38.
He Q, Lin D-Y. A variable selection method for genome-wide association studies. Bioinformatics. 2011;27:1–8.
Ueki M, Tamiya G. Ultrahigh-dimensional variable selection method for whole-genome gene-gene interaction analysis. BMC Bioinforma. 2012;13:72.
Ueta M, Tamiya G, Tokunaga K, Sotozono C, Ueki M, Sawai H, et al. Epistatic interaction between TLR3 and PTGER3 genes. J Allergy Clin Immunol. 2012;129:1413–6.
Ueki M, Tamiya G. Smooth-threshold multivariate genetic prediction with unbiased model selection. Genet Epidemiol. 2016;40:233–43.
Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc Ser B Stat Methodol. 2005;67:301–20.
Takahashi Y, Ueki M, Tamiya G, et al. Machine learning to effectively avoid overfitting is a crucial strategy for genetic prediction of depressive states. Transl Psychiatry. 2020. (In press).
Falconer DS. Introduction to quantitative genetics. London: Oliver & Boyd; 1960.
Pearson K. On lines and planes of closest fit to systems of points in space. Philos Mag. 1901;2:559–72.
Yano K, Morinaka Y, Wang F, Huang P, Takehara S, Hirai T, et al. GWAS with principal component analysis identifies a gene comprehensively controlling rice architecture. Proc Natl Acad Sci USA. 2019;116:21262–7.
Hinton GE, Salakhutdinov RR. Reducing the dimensionality of data with neural networks. Science. 2006;313:504–7.
Yamamoto Y, Tsuzuki T, Akatsuka J, Ueki M, Morikawa H, Numata Y, et al. Automated acquisition of explainable knowledge from unannotated histopathology images. Nat Commun. 2019;10:5642.
Anttila V, Winsvold BS, Gormley P, Kurth T, Bettella F, McMahon G, et al. Genome-wide meta-analysis identifies new susceptibility loci for migraine. Nat Genet. 2013;45:912–7.
Perry JR, Voight BF, Yengo L, Amin N, Dupuis J, Ganser M, et al. Stratifying type 2 diabetes cases by BMI identifies genetic risk variants in LAMA1 and enrichment for risk variants in lean compared to obese cases. PLoS Genet. 2012;8:e1002741.
Li Y, Sheu CC, Ye Y, de Andrade M, Wang L, Chang SC, et al. Genetic variants and risk of lung cancer in never smokers: a genome-wide association study. Lancet Oncol. 2010;11:321–30.
Obara T, Ishikuro M, Tamiya G, Ueki M, Yamanaka C, Mizuno S, et al. Potential identification of vitamin B6 responsiveness in autism spectrum disorder utilizing phenotype variables and machine learning methods. Sci Rep. 2018;8:14840.
Narita A, Nagai M, Mizuno S, Ogishima S, Tamiya G, Ueki M, et al. Clustering by phenotype and genome-wide association study in autism. Transl Psychiatry. 2020. (In press).
Sakurai R, Ueki M, Makino S, Hozawa A, Kuriyama S, Takai-Igarashi T, et al. Outlier detection for questionnaire data in biobanks. Int J Epidemiol. 2019;48:1305–15.
Takahashi Y, Ueki M, Yamada M, Tamiya G, Motoike IN, Saigusa D, et al. Improved metabolomic data-based prediction of depressive symptoms using nonlinear machine learning with feature selection. Transl Psychiatry. 2020;10:157.
Sakaue S, Hirata J, Kanai M, Suzuki K, Akiyama M, Lai Too C, et al. Dimensionality reduction reveals fine-scale structure in the Japanese population with consequences for polygenic risk prediction. Nat Commun. 2020;11:1569.
Kojima K, Tadaka S, Katsuoka F, Tamiya G, Yamamoto M, Kinoshita K. A genotype imputation method for de-identified haplotype reference information by using recurrent neural network. PLoS Comput Biol. 2020. (In press).
Beaumont MA, Rannala B. The Bayesian revolution in genetics. Nat Rev Genet. 2004;5:251–61.
Acknowledgements
This work was supported by a grant for the Tohoku Medical Megabank Project and the Centre for Advanced Intelligence Project from the Ministry of Education, Culture, Sports, Science and Technology (MEXT) of Japan. We are grateful to Dr. Satoshi Makino and Miho Kuriki for their special assistances.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Narita, A., Ueki, M. & Tamiya, G. Artificial intelligence powered statistical genetics in biobanks. J Hum Genet 66, 61–65 (2021). https://doi.org/10.1038/s10038-020-0822-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s10038-020-0822-y