Polygenic risk scores (PRS) estimate the genetic risk of an individual for a complex disease based on many genetic variants across the whole genome. In this study, we compared a series of computational models for estimation of breast cancer PRS. A deep neural network (DNN) was found to outperform alternative machine learning techniques and established statistical algorithms, including BLUP, BayesA, and LDpred. In the test cohort with 50% prevalence, the Area Under the receiver operating characteristic Curve (AUC) were 67.4% for DNN, 64.2% for BLUP, 64.5% for BayesA, and 62.4% for LDpred. BLUP, BayesA, and LPpred all generated PRS that followed a normal distribution in the case population. However, the PRS generated by DNN in the case population followed a bimodal distribution composed of two normal distributions with distinctly different means. This suggests that DNN was able to separate the case population into a high-genetic-risk case subpopulation with an average PRS significantly higher than the control population and a normal-genetic-risk case subpopulation with an average PRS similar to the control population. This allowed DNN to achieve 18.8% recall at 90% precision in the test cohort with 50% prevalence, which can be extrapolated to 65.4% recall at 20% precision in a general population with 12% prevalence. Interpretation of the DNN model identified salient variants that were assigned insignificant p values by association studies, but were important for DNN prediction. These variants may be associated with the phenotype through nonlinear relationships.
This is a preview of subscription content, access via your institution
Open Access articles citing this article.
Ethics and Information Technology Open Access 19 April 2022
BMC Bioinformatics Open Access 21 March 2022
Communications Biology Open Access 17 September 2021
Subscribe to Journal
Get full journal access for 1 year
only $9.92 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Get time limited or full article access on ReadCube.
All prices are NET prices.
NIH. Female Breast Cancer—Cancer Stat Facts. https://seer.cancer.gov/statfacts/html/breast.html. Accessed 03 Dec 2019.
Nelson HD, Tyne K, Naik A, Bougatsos C, Chan BK, Humphrey L. Screening for breast cancer: an update for the U.S. Preventive Services Task Force. Ann Intern Med. 2009;151:727–37. https://doi.org/10.7326/0003-4819-151-10-200911170-00009.
Oeffinger KC, Fontham ETH, Etzioni R, Herzig A, Michaelson JS, Shih Y-CT, et al. Breast Cancer Screening for Women at Average Risk: 2015 Guideline Update From the American Cancer Society. J AMA. 2015;314:1599–614.
Dudbridge F. Power and predictive accuracy of polygenic risk scores. PLoS Genet. 2013;9:e1003348. https://doi.org/10.1371/journal.pgen.1003348.
Clark SA, Kinghorn BP, Hickey JM, van der Werf JH. The effect of genomic information on optimal contribution selection in livestock breeding programs. Genet Select. Evol. 2013;45:44. https://doi.org/10.1186/1297-9686-45-44.
Whittaker AJ, Royzman I, Orr-Weaver TL. Drosophila Double parked: a conserved, essential replication protein that colocalizes with the origin recognition complex and links DNA replication with mitosis and the down-regulation of S phase transcripts. Genes Dev. 2000;14:1765–76.
Meuwissen THE, Hayes BJ, Goddard ME. Prediction of total genetic value using genome-wide dense marker maps. Genetics. 2001;157:1819–29.
Maier R, Moser G, Chen G-B, Ripke S, Cross-Disorder Working Group of the Psychiatric Genomics Consortium, Coryell W, et al. Joint analysis of psychiatric disorders increases accuracy of risk prediction for schizophrenia, bipolar disorder, and major depressive disorder. Am J Hum Genet. 2015;96:283–94.
Speed D, Balding DJ. MultiBLUP: improved SNP-based prediction for complex traits. Genome Res. 2014;24:1550–7. https://doi.org/10.1101/gr.169375.113.
Vilhjálmsson BJ, Yang J, Finucane HK, Gusev A, Lindström S, Ripke S, et al. Modeling Linkage Disequilibrium Increases Accuracy of Polygenic Risk Scores. Am J Hum Genet. 2015;97:576–92.
Khera AV, Chaffin M, Aragam KG, Haas ME, Roselli C, Choi SH, et al. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat Genet. 2018;50:1219–24.
Ge T, Chen C-Y, Ni Y, Feng Y-CA, Smoller JW. Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nat Commun. 2019;10:1–10. https://doi.org/10.1038/s41467-019-09718-5.
Ho DSW, Schierding W, Wake M, Saffery R, O’Sullivan J. Machine learning SNP based prediction for precision medicine. Front Genet. 2019. https://doi.org/10.3389/fgene.2019.00267.
Wei Z, Wang K, Qu H-Q, Zhang H, Bradfield J, Kim C, et al. From Disease Association to Risk Assessment: An Optimistic View from Genome-Wide Association Studies on Type 1 Diabetes. PLOS Genetics. 2009;5:e1000678.
Bellot P, de los Campos G, Pérez-Enciso M. Can deep learning improve genomic prediction of complex human traits? Genetics. 2018;210:809–19. https://doi.org/10.1534/genetics.118.301298.
Yin B, Balvert M, Spek RAA van der, Dutilh BE, Bohte S, Veldink J, et al. Using the structure of genome data in the design of deep neural networks for predicting amyotrophic lateral sclerosis from genotype. bioRxiv. 2019;533679.
Mavaddat N, Michailidou K, Dennis J, Lush M, Fachal L, Lee A, et al. Polygenic Risk Scores for Prediction of Breast Cancer and Breast Cancer Subtypes. The American Journal of Human Genetics. 2019;104:21–34.
Chan CHT, Munusamy P, Loke SY, Koh GL, Yang AZY, Law HY, et al. Evaluation of three polygenic risk score models for the prediction of breast cancer risk in Singapore Chinese. Oncotarget. 2018;9:12796–804.
Wen W, Shu X, Guo X, Cai Q, Long J, Bolla MK, et al. Prediction of breast cancer risk based on common genetic variants in women of East Asian ancestry. Breast Cancer Res. 2016;18:124.
Hsieh Y-C, Tu S-H, Su C-T, Cho E-C, Wu C-H, Hsieh M-C, et al. A polygenic risk score for breast cancer risk in a Taiwanese population. Breast Cancer Res Treat. 2017;163:131–8.
Shrikumar A, Greenside P, Kundaje A. Learning important features through propagating activation differences. In International Conference on Machine Learning. 2017. p. 3145–53. Accessed 11 Nov 2019. http://proceedings.mlr.press/v70/shrikumar17a.html.
Ribeiro MT, Singh S, Guestrin C. ‘Why Should I Trust You?’: explaining the predictions of any classifier. In Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, 2016. p. 1135–44. https://doi.org/10.1145/2939672.2939778.
Amos CI, Dennis J, Wang Z, Byun J, Schumacher FR, Gayther SA, et al. The OncoArray Consortium: A Network for Understanding the Genetic Architecture of Common Cancers. Cancer Epidemiol Biomarkers Prev. 2017;26:126–35.
Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience. 2015. https://doi.org/10.1186/s13742-015-0047-8.
Bengio Y. Learning deep architectures for AI. Found Trends Mach Learn 2009;2:1–127. https://doi.org/10.1561/2200000006.
Xu B, Wang N, Chen T, Li M. Empirical evaluation of rectified activations in convolutional network. arXiv. 2019. http://arxiv.org/abs/1505.00853.
Kingma DP, Ba J. Adam: a method for stochastic optimization. In 3rd International Conference for Learning Representations. 2015. http://arxiv.org/abs/1412.6980.
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res. 2014;15:1929–58.
Ioffe S and Szegedy C. Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv. 2019. http://arxiv.org/abs/1502.03167.
Hastie T, Rosset S, Zhu J, Zou H. Multi-class adaboost. Stat its Interface. 2009;2:349–60.
Friedman JH. Greedy function approximation: A gradient boosting machine. Ann Statist. 2001;29:1189–232.
Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd edn, 2nd ed. New York: Springer-Verlag, 2009.
De R, Bush WS, Moore JH. Bioinformatics Challenges in Genome-Wide Association Studies (GWAS). In Trent R, editor. Clinical Bioinformatics. New York: Springer; 2014. p. 63–81.
Gola D, Erdmann J, Müller-Myhsok B, Schunkert H, König IR. Polygenic risk scores outperform machine learning methods in predicting coronary artery disease status. Genet Epidemiol. 2020;44:125–38. https://doi.org/10.1002/gepi.22279.
Fergus P, Montanez A, Abdulaimma B, Lisboa P, Chalmers C, Pineles B. Utilising deep learning and genome wide association studies for epistatic-driven preterm birth classification in African-American Women. In IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2018. https://doi.org/10.1109/TCBB.2018.2868667.
Cudic M, Baweja H, Parhar T, Nuske S, Prediction of sorghum bicolor genotype from in-situ images using autoencoder-identified SNPs. In 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA). 2018. https://doi.org/10.1109/ICMLA.2018.00012.
Dayem Ullah AZ, Oscanoa J, Wang J, Nagano A, Lemoine NR, Chelala C. SNPnexus: assessing the functional relevance of genetic variation to facilitate the promise of precision medicine. Nucleic Acids Res. 2018;46:W109–W113. https://doi.org/10.1093/nar/gky399.
O’Connor MJ. Targeting the DNA damage response in cancer. Mol Cell. 2015;60:547–60. https://doi.org/10.1016/j.molcel.2015.10.040.
Kolch W, Halasz M, Granovskaya M, Kholodenko BN. The dynamic control of signal transduction networks in cancer cells. Nat Rev Cancer. 2015;15:9. https://doi.org/10.1038/nrc3983.
Fernald K, Kurokawa M. Evading apoptosis in cancer. Trends Cell Biol. 2013;23:620–33. https://doi.org/10.1016/j.tcb.2013.07.006.
Michailidou K, Lindström S, Dennis J, Beesley J, Hui S, Kar S, et al. Association analysis identifies 65 new breast cancer risk loci. Nature. 2017;551:92–4.
Li X, Zou Z, Tang J, Zheng Y, Liu Y, Luo Y, et al. NOS1 upregulates ABCG2 expression contributing to DDP chemoresistance in ovarian cancer cells. Oncol Lett. 2019;17:1595–602.
Mao Q, Unadkat JD. Role of the breast cancer resistance protein (BCRP/ABCG2) in drug transport—an update. AAPS J. 2015;17:65–82. https://doi.org/10.1208/s12248-014-9668-6.
Lee J-Y, Park AK, Lee K-M, Park SK, Han S, Han W, et al. Candidate gene approach evaluates association between innate immunity genes and breast cancer risk in Korean women. Carcinogenesis. 2009;30:1528–31.
Tinholt M, Viken MK, Dahm AE, Vollan HKM, Sahlberg KK, Garred Ø, et al. Increased coagulation activity and genetic polymorphisms in the F5, F10 and EPCRgenes are associated with breast cancer: a case-control study. BMC Cancer. 2014;14:845.
Cesaratto L, Grisard E, Coan M, Zandonà L, De Mattia E, Poletto E, et al. BNC2 is a putative tumor suppressor gene in high-grade serous ovarian carcinoma and impacts cell survival after oxidative stress. Cell Death Dis. 2016;7:e2374–e2374.
Tsuboi M, Kondo K, Masuda K, Tange S, Kajiura K, Kohmoto T, et al. Prognostic significance of GAD1 overexpression in patients with resected lung adenocarcinoma. Cancer Med. 2019;8:4189–99.
International Schizophrenia Consortium, Purcell SM, Wray NR, Stone JL, Visscher PM, O’Donovan MC, et al. Common polygenic variation contributes to risk of schizophrenia and bipolar disorder. Nature. 2009;460:748–52.
Scott RA, Scott LJ, Mägi R, Marullo L, Gaulton KJ, Kaakinen M, et al. An Expanded Genome-Wide Association Study of Type 2 Diabetes in Europeans. Diabetes. 2017;66:2888–902.
LeBlanc M, Kooperberg C. Boosting predictions of treatment success. Proc Natl Acad Sci USA. 2010;107:13559–60. https://doi.org/10.1073/pnas.1008052107.
Angermueller C, Pärnamaa T, Parts L, Stegle O. Deep learning for computational biology. Mol Syst Biol. 2016. https://doi.org/10.15252/msb.20156651.
Schmidhuber J. Deep learning in neural networks: an overview. Neural Netw. 2015;61:85–117. https://doi.org/10.1016/j.neunet.2014.09.003.
We would like to thank the OU Supercomputing Center for Education & Research (OSCER) for supercomputing technical support, the DRIVE project for the GWAS data, NIH dbGap for data access authorization, and Dr. Xu Chao for helpful discussions. The study was funded by Dr. Pan’s startup funding from the University of Oklahoma and by the Oak Ridge National Laboratory (ORNL)’ Directed Research Development (LDRD) Funding. Oak Ridge National Laboratory is managed by UT-Battelle, LLC for the U.S. Department of Energy under Contract Number DE-AC05-00OR22725.
Conflict of interest
The authors declare that they have no conflict of interest.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Badré, A., Zhang, L., Muchero, W. et al. Deep neural network improves the estimation of polygenic risk scores for breast cancer. J Hum Genet 66, 359–369 (2021). https://doi.org/10.1038/s10038-020-00832-7
BMC Bioinformatics (2022)
Ethics and Information Technology (2022)
Communications Biology (2021)