An ensemble penalized regression method for multi-ancestry polygenic risk prediction

Great efforts are being made to develop advanced polygenic risk scores (PRS) to improve the prediction of complex traits and diseases. However, most existing PRS are primarily trained on European ancestry populations, limiting their transferability to non-European populations. In this article, we propose a novel method for generating multi-ancestry Polygenic Risk scOres based on enSemble of PEnalized Regression models (PROSPER). PROSPER integrates genome-wide association studies (GWAS) summary statistics from diverse populations to develop ancestry-specific PRS with improved predictive power for minority populations. The method uses a combination of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{{{{{\mathscr{L}}}}}}}_{1}$$\end{document}L1 (lasso) and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${{{{{{\mathscr{L}}}}}}}_{2}$$\end{document}L2 (ridge) penalty functions, a parsimonious specification of the penalty parameters across populations, and an ensemble step to combine PRS generated across different penalty parameters. We evaluate the performance of PROSPER and other existing methods on large-scale simulated and real datasets, including those from 23andMe Inc., the Global Lipids Genetics Consortium, and All of Us. Results show that PROSPER can substantially improve multi-ancestry polygenic prediction compared to alternative methods across a wide variety of genetic architectures. In real data analyses, for example, PROSPER increased out-of-sample prediction R2 for continuous traits by an average of 70% compared to a state-of-the-art Bayesian method (PRS-CSx) in the African ancestry population. Further, PROSPER is computationally highly scalable for the analysis of large SNP contents and many diverse populations.

). Per-SNP heritability is assumed to be the same across all populations, and the correlations in effect sizes for share SNPs between all pairs of populations is fixed at 0.6.The sample sizes for GWAS training data are assumed to be (a) n=15,000, and (b) n=80,000 for the four non-EUR target populations; and is fixed at n=100,000 for the EUR population.PRS generated from all methods are tuned in n=10,000 samples, and then tested in n=10,000 independent samples in each target population.

3 :
Performance of alternative methods on simulated data generated with different sample sizes and different genetic architectures.Data are simulated for continuous phenotype under a no negative selection model and three different degrees of polygenicity (top panel:   = 0.01, middle panel:   = 0.001, and bottom panel:   = 5 × 10 −4 ).Common SNP heritability is fixed at 0.4 across all populations, and the correlations in effect sizes for share SNPs between all pairs of populations is fixed at 0.8.The sample sizes for GWAS training data are assumed to be (a) n=15,000, and (b) n=80,000 for the four non-EUR target populations; and is fixed at n=100,000 for the EUR population.PRS generated from all methods are tuned in n=10,000 samples, and then tested in n=10,000 independent samples in each target population.The PRS-CSx package is restricted to SNPs from HM3, whereas other alternative methods use SNPs from either HM3 or MEGA.Bars in the figure show the performance of R 2 for each method in each dataset.Colors are described on the right side of the figure.Source data are provided in Supplementary Data 3.
of alternative methods on simulated data generated with different sample sizes and different genetic architectures.Data are simulated for continuous phenotype under a strong negative selection model and three different degrees of polygenicity (top panel:   = 0.01, middle panel:   = 0.001, and bottom panel:   = 5 × 10 −4 The PRS-CSx package is restricted to SNPs from HM3, whereas other alternative methods use SNPs from either HM3 or MEGA.Bars in the figure show the performance of R 2 for each method in each dataset.Colors are described on the right side of the figure.Source data are provided in Supplementary Data 5. all tuning parameters and ancestries) and PROSPER for prediction of simulated data generated with different sample sizes and genetic architectures under strong negative selection and fixed common-SNP heritability.Data used to generate this figure is same as in Figure 2. Bars in the figure show the performance of R 2 for each method in each dataset.Colors are described on the right side of the figure.Source data are provided in Supplementary Data 15.all tuning parameters and ancestries) and PROSPER for prediction of simulated data generated with different sample sizes and different genetic architectures under mild negative selection.Data used to generate this figure is same as in Supplementary Figure 2. Bars in the figure show the performance of R 2 for each method in each dataset.Colors are described on the right side of the figure.Source data are provided in Supplementary Data 15.all tuning parameters and ancestries) and PROSPER for prediction of simulated data generated with different sample sizes and different genetic architectures under no negative selection.Data used to generate this figure is same as in Supplementary Figure 3. Bars in the figure show the performance of R 2 for each method in each dataset.Colors are described on the right side of the figure.Source data are provided in Supplementary Data 15.all tuning parameters and ancestries) and PROSPER for prediction of simulated data generated with different sample sizes and genetic architectures under strong negative selection and fixed per-SNP heritability.Data used to generate this figure is same as in Supplementary Figure 4. Bars in the figure show the performance of R 2 for each method in each dataset.Colors are described on the right side of the figure.Source data are provided in Supplementary Data 15.all tuning parameters and ancestries) and PROSPER for prediction of simulated data generated with different sample sizes and genetic architectures under strong negative selection and less genetic correlation.Data used to generate this figure is same as in Supplementary Figure 5. Bars in the figure show the performance of R 2 for each method in each dataset.Colors are described on the right side of the figure.Source data are provided in Supplementary Data 15.all tuning parameters and ancestries) and PROSPER for prediction of four blood lipid traits (GLGC-training and UKBB-tuning/validation). Data used to generate this figure is same as in Figure 5. Bars in the figure show the performance of adjusted R 2 for each method in each dataset.Colors are described on the right side of the figure.Source data are provided in Supplementary Data 16.all tuning parameters and ancestries) and PROSPER for prediction of of two anthropometric traits (AoU-training and UKBB-tuning/validation). Data used to generate this figure is same as in Figure 6.Bars in the figure show the performance of adjusted R 2 for each method in each dataset.Colors are described on the right side of the figure.Source data are provided in Supplementary Data 16.

Multi ancestry method (weighted PRS) weighted CT weighted LDpred2 weighted lassosum2 Multi ancestry method (existing methods)
b

PROSPER PROSPER Supplementary Figure 4: Performance of alternative methods on simulated data generated with different sample sizes and different genetic architectures. Data
,000 independent samples in each target population.The PRS-CSx package is restricted to SNPs from HM3, whereas other alternative methods use SNPs from either HM3 or MEGA.Bars in the figure show the performance of R 2 for each method in each dataset.Colors are described on the right side of the figure.Source data are provided in Supplementary Data 4.