Deep neural network improves the estimation of polygenic risk scores for breast cancer

Badré, Adrien; Zhang, Li; Muchero, Wellington; Reynolds, Justin C.; Pan, Chongle

doi:10.1038/s10038-020-00832-7

Article
Published: 02 October 2020

Deep neural network improves the estimation of polygenic risk scores for breast cancer

Adrien Badré¹,
Li Zhang²,
Wellington Muchero³,
Justin C. Reynolds¹ &
…
Chongle Pan ORCID: orcid.org/0000-0003-2860-0334^1,2

Journal of Human Genetics volume 66, pages 359–369 (2021)Cite this article

4853 Accesses
25 Citations
12 Altmetric
Metrics details

Subjects

Abstract

Polygenic risk scores (PRS) estimate the genetic risk of an individual for a complex disease based on many genetic variants across the whole genome. In this study, we compared a series of computational models for estimation of breast cancer PRS. A deep neural network (DNN) was found to outperform alternative machine learning techniques and established statistical algorithms, including BLUP, BayesA, and LDpred. In the test cohort with 50% prevalence, the Area Under the receiver operating characteristic Curve (AUC) were 67.4% for DNN, 64.2% for BLUP, 64.5% for BayesA, and 62.4% for LDpred. BLUP, BayesA, and LPpred all generated PRS that followed a normal distribution in the case population. However, the PRS generated by DNN in the case population followed a bimodal distribution composed of two normal distributions with distinctly different means. This suggests that DNN was able to separate the case population into a high-genetic-risk case subpopulation with an average PRS significantly higher than the control population and a normal-genetic-risk case subpopulation with an average PRS similar to the control population. This allowed DNN to achieve 18.8% recall at 90% precision in the test cohort with 50% prevalence, which can be extrapolated to 65.4% recall at 20% precision in a general population with 12% prevalence. Interpretation of the DNN model identified salient variants that were assigned insignificant p values by association studies, but were important for DNN prediction. These variants may be associated with the phenotype through nonlinear relationships.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

Genome-wide association studies

Article 26 August 2021

Emil Uffelmann, Qin Qin Huang, … Danielle Posthuma

Inferring gene regulatory networks from single-cell multiome data using atlas-scale external data

Article Open access 12 April 2024

Qiuyue Yuan & Zhana Duren

Genomic data in the All of Us Research Program

Article Open access 19 February 2024

The All of Us Research Program Genomics Investigators

References

NIH. Female Breast Cancer—Cancer Stat Facts. https://seer.cancer.gov/statfacts/html/breast.html. Accessed 03 Dec 2019.
Nelson HD, Tyne K, Naik A, Bougatsos C, Chan BK, Humphrey L. Screening for breast cancer: an update for the U.S. Preventive Services Task Force. Ann Intern Med. 2009;151:727–37. https://doi.org/10.7326/0003-4819-151-10-200911170-00009.
Article PubMed PubMed Central Google Scholar
Oeffinger KC, Fontham ETH, Etzioni R, Herzig A, Michaelson JS, Shih Y-CT, et al. Breast Cancer Screening for Women at Average Risk: 2015 Guideline Update From the American Cancer Society. J AMA. 2015;314:1599–614.
CAS Google Scholar
Dudbridge F. Power and predictive accuracy of polygenic risk scores. PLoS Genet. 2013;9:e1003348. https://doi.org/10.1371/journal.pgen.1003348.
Clark SA, Kinghorn BP, Hickey JM, van der Werf JH. The effect of genomic information on optimal contribution selection in livestock breeding programs. Genet Select. Evol. 2013;45:44. https://doi.org/10.1186/1297-9686-45-44.
Whittaker AJ, Royzman I, Orr-Weaver TL. Drosophila Double parked: a conserved, essential replication protein that colocalizes with the origin recognition complex and links DNA replication with mitosis and the down-regulation of S phase transcripts. Genes Dev. 2000;14:1765–76.
CAS PubMed PubMed Central Google Scholar
Meuwissen THE, Hayes BJ, Goddard ME. Prediction of total genetic value using genome-wide dense marker maps. Genetics. 2001;157:1819–29.
CAS PubMed PubMed Central Google Scholar
Maier R, Moser G, Chen G-B, Ripke S, Cross-Disorder Working Group of the Psychiatric Genomics Consortium, Coryell W, et al. Joint analysis of psychiatric disorders increases accuracy of risk prediction for schizophrenia, bipolar disorder, and major depressive disorder. Am J Hum Genet. 2015;96:283–94.
Article CAS PubMed PubMed Central Google Scholar
Speed D, Balding DJ. MultiBLUP: improved SNP-based prediction for complex traits. Genome Res. 2014;24:1550–7. https://doi.org/10.1101/gr.169375.113.
Article CAS PubMed PubMed Central Google Scholar
Vilhjálmsson BJ, Yang J, Finucane HK, Gusev A, Lindström S, Ripke S, et al. Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores. Am J Hum Genet. 2015;97:576–92.
Article PubMed PubMed Central Google Scholar
Khera AV, Chaffin M, Aragam KG, Haas ME, Roselli C, Choi SH, et al. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat Genet. 2018;50:1219–24.
Article CAS PubMed PubMed Central Google Scholar
Ge T, Chen C-Y, Ni Y, Feng Y-CA, Smoller JW. Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nat Commun. 2019;10:1–10. https://doi.org/10.1038/s41467-019-09718-5.
Article CAS Google Scholar
Ho DSW, Schierding W, Wake M, Saffery R, O’Sullivan J. Machine learning SNP based prediction for precision medicine. Front Genet. 2019. https://doi.org/10.3389/fgene.2019.00267.
Wei Z, Wang K, Qu H-Q, Zhang H, Bradfield J, Kim C, et al. From Disease Association to Risk Assessment: An Optimistic View from Genome-Wide Association Studies on Type 1 Diabetes. PLOS Genetics. 2009;5:e1000678.
Article PubMed PubMed Central Google Scholar
Bellot P, de los Campos G, Pérez-Enciso M. Can deep learning improve genomic prediction of complex human traits? Genetics. 2018;210:809–19. https://doi.org/10.1534/genetics.118.301298.
Article CAS PubMed PubMed Central Google Scholar
Yin B, Balvert M, Spek RAA van der, Dutilh BE, Bohte S, Veldink J, et al. Using the structure of genome data in the design of deep neural networks for predicting amyotrophic lateral sclerosis from genotype. bioRxiv. 2019;533679.
Mavaddat N, Michailidou K, Dennis J, Lush M, Fachal L, Lee A, et al. Polygenic Risk Scores for Prediction of Breast Cancer and Breast Cancer Subtypes. The American Journal of Human Genetics. 2019;104:21–34.
Article CAS PubMed Google Scholar
Chan CHT, Munusamy P, Loke SY, Koh GL, Yang AZY, Law HY, et al. Evaluation of three polygenic risk score models for the prediction of breast cancer risk in Singapore Chinese. Oncotarget. 2018;9:12796–804.
Article PubMed PubMed Central Google Scholar
Wen W, Shu X, Guo X, Cai Q, Long J, Bolla MK, et al. Prediction of breast cancer risk based on common genetic variants in women of East Asian ancestry. Breast Cancer Res. 2016;18:124.
Article PubMed PubMed Central Google Scholar
Hsieh Y-C, Tu S-H, Su C-T, Cho E-C, Wu C-H, Hsieh M-C, et al. A polygenic risk score for breast cancer risk in a Taiwanese population. Breast Cancer Res Treat. 2017;163:131–8.
Article CAS PubMed Google Scholar
Shrikumar A, Greenside P, Kundaje A. Learning important features through propagating activation differences. In International Conference on Machine Learning. 2017. p. 3145–53. Accessed 11 Nov 2019. http://proceedings.mlr.press/v70/shrikumar17a.html.
Ribeiro MT, Singh S, Guestrin C. ‘Why Should I Trust You?’: explaining the predictions of any classifier. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, 2016. p. 1135–44. https://doi.org/10.1145/2939672.2939778.
Amos CI, Dennis J, Wang Z, Byun J, Schumacher FR, Gayther SA, et al. The OncoArray Consortium: A Network for Understanding the Genetic Architecture of Common Cancers. Cancer Epidemiol Biomarkers Prev. 2017;26:126–35.
Article PubMed Google Scholar
Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience. 2015. https://doi.org/10.1186/s13742-015-0047-8.
Bengio Y. Learning deep architectures for AI. Found Trends Mach Learn 2009;2:1–127. https://doi.org/10.1561/2200000006.
Article Google Scholar
Xu B, Wang N, Chen T, Li M. Empirical evaluation of rectified activations in convolutional network. arXiv. 2019. http://arxiv.org/abs/1505.00853.
Kingma DP, Ba J. Adam: a method for stochastic optimization. In 3rd International Conference for Learning Representations. 2015. http://arxiv.org/abs/1412.6980.
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res. 2014;15:1929–58.
Google Scholar
Ioffe S and Szegedy C. Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv. 2019. http://arxiv.org/abs/1502.03167.
Hastie T, Rosset S, Zhu J, Zou H. Multi-class adaboost. Stat its Interface. 2009;2:349–60.
Article Google Scholar
Friedman JH. Greedy function approximation: A gradient boosting machine. Ann Statist. 2001;29:1189–232.
Article Google Scholar
Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd edn, 2nd ed. New York: Springer-Verlag, 2009.
De R, Bush WS, Moore JH. Bioinformatics Challenges in Genome-Wide Association Studies (GWAS). In Trent R, editor. Clinical Bioinformatics. New York: Springer; 2014. p. 63–81.
Gola D, Erdmann J, Müller-Myhsok B, Schunkert H, König IR. Polygenic risk scores outperform machine learning methods in predicting coronary artery disease status. Genet Epidemiol. 2020;44:125–38. https://doi.org/10.1002/gepi.22279.
Article PubMed Google Scholar
Fergus P, Montanez A, Abdulaimma B, Lisboa P, Chalmers C, Pineles B. Utilising deep learning and genome wide association studies for epistatic-driven preterm birth classification in African-American Women. In IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2018. https://doi.org/10.1109/TCBB.2018.2868667.
Cudic M, Baweja H, Parhar T, Nuske S, Prediction of sorghum bicolor genotype from in-situ images using autoencoder-identified SNPs. In 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA). 2018. https://doi.org/10.1109/ICMLA.2018.00012.
Dayem Ullah AZ, Oscanoa J, Wang J, Nagano A, Lemoine NR, Chelala C. SNPnexus: assessing the functional relevance of genetic variation to facilitate the promise of precision medicine. Nucleic Acids Res. 2018;46:W109–W113. https://doi.org/10.1093/nar/gky399.
Article CAS PubMed PubMed Central Google Scholar
O’Connor MJ. Targeting the DNA damage response in cancer. Mol Cell. 2015;60:547–60. https://doi.org/10.1016/j.molcel.2015.10.040.
Article CAS PubMed Google Scholar
Kolch W, Halasz M, Granovskaya M, Kholodenko BN. The dynamic control of signal transduction networks in cancer cells. Nat Rev Cancer. 2015;15:9. https://doi.org/10.1038/nrc3983.
Article CAS Google Scholar
Fernald K, Kurokawa M. Evading apoptosis in cancer. Trends Cell Biol. 2013;23:620–33. https://doi.org/10.1016/j.tcb.2013.07.006.
Article PubMed PubMed Central Google Scholar
Michailidou K, Lindström S, Dennis J, Beesley J, Hui S, Kar S, et al. Association analysis identifies 65 new breast cancer risk loci. Nature. 2017;551:92–4.
Article PubMed PubMed Central Google Scholar
Li X, Zou Z, Tang J, Zheng Y, Liu Y, Luo Y, et al. NOS1 upregulates ABCG2 expression contributing to DDP chemoresistance in ovarian cancer cells. Oncol Lett. 2019;17:1595–602.
CAS PubMed Google Scholar
Mao Q, Unadkat JD. Role of the breast cancer resistance protein (BCRP/ABCG2) in drug transport—an update. AAPS J. 2015;17:65–82. https://doi.org/10.1208/s12248-014-9668-6.
Article CAS PubMed Google Scholar
Lee J-Y, Park AK, Lee K-M, Park SK, Han S, Han W, et al. Candidate gene approach evaluates association between innate immunity genes and breast cancer risk in Korean women. Carcinogenesis. 2009;30:1528–31.
Article CAS PubMed Google Scholar
Tinholt M, Viken MK, Dahm AE, Vollan HKM, Sahlberg KK, Garred Ø, et al. Increased coagulation activity and genetic polymorphisms in the F5, F10 and EPCRgenes are associated with breast cancer: a case-control study. BMC Cancer. 2014;14:845.
Article PubMed PubMed Central Google Scholar
Cesaratto L, Grisard E, Coan M, Zandonà L, De Mattia E, Poletto E, et al. BNC2 is a putative tumor suppressor gene in high-grade serous ovarian carcinoma and impacts cell survival after oxidative stress. Cell Death Dis. 2016;7:e2374–e2374.
Article CAS PubMed PubMed Central Google Scholar
Tsuboi M, Kondo K, Masuda K, Tange S, Kajiura K, Kohmoto T, et al. Prognostic significance of GAD1 overexpression in patients with resected lung adenocarcinoma. Cancer Med. 2019;8:4189–99.
Article CAS PubMed PubMed Central Google Scholar
International Schizophrenia Consortium, Purcell SM, Wray NR, Stone JL, Visscher PM, O’Donovan MC, et al. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature. 2009;460:748–52.
Article PubMed Central Google Scholar
Scott RA, Scott LJ, Mägi R, Marullo L, Gaulton KJ, Kaakinen M, et al. An Expanded Genome-Wide Association Study of Type 2 Diabetes in Europeans. Diabetes. 2017;66:2888–902.
Article CAS PubMed PubMed Central Google Scholar
LeBlanc M, Kooperberg C. Boosting predictions of treatment success. Proc Natl Acad Sci USA. 2010;107:13559–60. https://doi.org/10.1073/pnas.1008052107.
Article PubMed PubMed Central Google Scholar
Angermueller C, Pärnamaa T, Parts L, Stegle O. Deep learning for computational biology. Mol Syst Biol. 2016. https://doi.org/10.15252/msb.20156651.
Schmidhuber J. Deep learning in neural networks: an overview. Neural Netw. 2015;61:85–117. https://doi.org/10.1016/j.neunet.2014.09.003.
Article PubMed Google Scholar

Download references

Acknowledgements

We would like to thank the OU Supercomputing Center for Education & Research (OSCER) for supercomputing technical support, the DRIVE project for the GWAS data, NIH dbGap for data access authorization, and Dr. Xu Chao for helpful discussions. The study was funded by Dr. Pan’s startup funding from the University of Oklahoma and by the Oak Ridge National Laboratory (ORNL)’ Directed Research Development (LDRD) Funding. Oak Ridge National Laboratory is managed by UT-Battelle, LLC for the U.S. Department of Energy under Contract Number DE-AC05-00OR22725.

Author information

Authors and Affiliations

School of Computer Science, University of Oklahoma, Norman, OK, USA
Adrien Badré, Justin C. Reynolds & Chongle Pan
Department of Microbiology and Plant Biology, University of Oklahoma, Norman, OK, USA
Li Zhang & Chongle Pan
Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN, USA
Wellington Muchero

Authors

Adrien Badré
View author publications
You can also search for this author in PubMed Google Scholar
Li Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Wellington Muchero
View author publications
You can also search for this author in PubMed Google Scholar
Justin C. Reynolds
View author publications
You can also search for this author in PubMed Google Scholar
Chongle Pan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chongle Pan.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Table 1

Supplementary Table 2

Supplementary Table 3

Supplementary Table 4

Supplementary Table 5

Supplementary Figures

Rights and permissions

Reprints and permissions

About this article

Cite this article

Badré, A., Zhang, L., Muchero, W. et al. Deep neural network improves the estimation of polygenic risk scores for breast cancer. J Hum Genet 66, 359–369 (2021). https://doi.org/10.1038/s10038-020-00832-7

Download citation

Received: 09 April 2020
Accepted: 10 August 2020
Published: 02 October 2020
Issue Date: April 2021
DOI: https://doi.org/10.1038/s10038-020-00832-7

This article is cited by

Large sample size and nonlinear sparse models outline epistatic effects in inflammatory bowel disease
- Nora Verplaetse
- Antoine Passemiers
- Daniele Raimondi
Genome Biology (2023)
Improving genetic risk prediction across diverse population by disentangling ancestry representations
- Prashnna K. Gyawali
- Yann Le Guen
- Zihuai He
Communications Biology (2023)
Deep learning-based polygenic risk analysis for Alzheimer’s disease prediction
- Xiaopu Zhou
- Yu Chen
- Nancy Y. Ip
Communications Medicine (2023)
Evaluation of tree-based statistical learning methods for constructing genetic risk scores
- Michael Lau
- Claudia Wigmann
- Holger Schwender
BMC Bioinformatics (2022)
Epistemo-ethical constraints on AI-human decision making for diagnostic purposes
- Dina Babushkina
- Athanasios Votsis
Ethics and Information Technology (2022)