Contextualizing genetic risk score for disease screening and rare variant discovery

Studies of the genetic basis of complex traits have demonstrated a substantial role for common, small-effect variant polygenic burden (PB) as well as large-effect variants (LEV, primarily rare). We identify sufficient conditions in which GWAS-derived PB may be used for well-powered rare pathogenic variant discovery or as a sample prioritization tool for whole-genome or exome sequencing. Through extensive simulations of genetic architectures and generative models of disease liability with parameters informed by empirical data, we quantify the power to detect, among cases, a lower PB in LEV carriers than in non-carriers. Furthermore, we uncover clinically useful conditions wherein the risk derived from the PB is comparable to the LEV-derived risk. The resulting summary-statistics-based methodology (with publicly available software, PB-LEV-SCAN) makes predictions on PB-based LEV screening for 36 complex traits, which we confirm in several disease datasets with available LEV information in the UK Biobank, with important implications on clinical decision-making.


Supplementary Notes
The case-only study design We compared the polygenic burden (PB) between large-effect variant (LEV) carriers and non-carriers among cases. We note that the test among cases is fundamentally different from an association test between PB and LEV. Higher PB among non-carriers than carriers could not be interpreted as a negative correlation between PB and LEV in the general population (a collider bias). In fact, PB and LEV are not correlated in the general population by study design. (We also examined the scenario in which the PB and LEV are dependent in a secondary analysis [see Results].) Generally speaking, collider bias can result in biased genetic associations or bias in associations between variables that influence study participation, and should be avoided 1,2 . Spurious associations (e.g., genetic associations or associations between variables) can arise in the absence of true correlation in the intended study population 3 . However, in this study, our interest is elsewhere (i.e., not in inference on effect size in the intended study population).  [3 or 4]). For T1D, the proportion of high-risk HLA-DRB1 is provided in parentheses. 'Variance explained' is the proportion of variance in lrgCNV carrier status (for TS and OCD) and in number of HLA-DRB1 risk alleles (for T1D) explained by the PGS. For TS and OCD we calculated the difference in Nagelkerke's pseudo R 2 between the full model and a reduced model that excluded the PGS term. For T1D we report the R 2 from linear regression on the number of high-risk alleles present. 'Wilcoxon rank sum test (p-value)' provides the p-value for the (one-sided) non-parametric test of lower adjusted PGS group means in lrgCNV carriers than in non-carriers for TS and OCD. For T1D, we provide the pvalue for the non-parametric test of adjusted PGS group means between HLA-DRB1 highrisk homozygotes compared to heterozygotes and low-risk homozygotes. Using the liability-threshold model and polygenic genetic architecture, we assumed 1% of the samples (Ntotal = 10,000) are cases. For the cases, the LEV burden (R) and the PB (A) are displayed on x-axis and y-axis, respectively. Panel (a) and panel (b) assume one LEV and ten independent LEVs (MAF = 0.01), respectively. Here, we assumed two subtypes, namely, major and minor. The heterogeneous effect (λ; see Methods) in the subjects with the minor subtype and the proportion of cases with the minor subtype are reflected in the rows and columns, respectively. A linear regression line (red) was fitted for each sub-panel, with the 95% confidence interval shown. Source data are provided as a Source Data file.

Supplementary Figure 4. Odds ratio (OR) comparison between the LEV and the
PB under a genetic architecture consistent with negative selection and liability-threshold model of disease risk. Similar to Figure 6, we calculated the OR for LEV and PB while varying the parameters. The point estimates and the 95% confidence interval (CI) are shown as dots and horizontal lines, respectively. The vertical broken line at OR = 1 shows the null. In panel (a), we fixed ℎ 2 , , and at 0.03, 0.005, and 0.01, respectively, while varying ℎ 2 . In panel (b), we fixed ℎ 2 , , and at 0.3, 0.005, and 0.01, respectively, while varying ℎ 2 . In panel (c), we fixed ℎ 2 , ℎ 2 , and at 0.3, 0.03, and 0.01, respectively, while varying . In panel (d), we fixed ℎ 2 , ℎ 2 , and at 0.3, 0.03, and 0.005, respectively, while varying . Source data are provided as a Source Data file.  Figure 6, we calculated the OR for LEV and PB while varying the parameters. The point estimates and the 95% confidence interval (CI) are shown as dots and horizontal lines, respectively. The vertical broken line at OR = 1 shows the null. In panel (a), we fixed ℎ 2 , , and at 0.03, 0.005, and 0.01, respectively, while varying ℎ 2 . In panel (b), we fixed ℎ 2 , , and at 0.3, 0.005, and 0.01, respectively, while varying ℎ 2 . In panel (c), we fixed ℎ 2 , ℎ 2 , and at 0.3, 0.03, and 0.01, respectively, while varying . In panel (d), we fixed ℎ 2 , ℎ 2 , and at 0.3, 0.03, and 0.005, respectively, while varying . Source data are provided as a Source Data file.

Supplementary Figure 6. Comparison of the odds ratio (OR) of the PB between LEV carriers and non-carriers while varying the proportion of non-causal variants ( ).
In these simulations, we fixed ℎ 2 , , and at 0.03, 0.005, and 0.01, respectively, while varying 0 and ℎ 2 . The OR of the PB was estimated under the liability-threshold model while varying the proportion of non-causal variants (π 0 ) from 0 to 0.9 (panel a to panel e). In each panel, we also varied the common SNPbased heritability (ℎ 2 ) from 0.1 to 0.5. The OR of the PB was estimated in LEV carriers and in non-carriers separately. In simulations, we assumed 10,000 samples. The median of the OR across simulations is shown as a dot, while the P2.5 and P97.5 quantiles of the OR across simulations are represented by the horizontal segments. Source data are provided as a Source Data file.

Supplementary Figure 7. Change in the OR of PB (per sd change) with respect to change in parameter differs between LEV carriers and non-carriers.
The OR of the PB was calculated under a genetic architecture consistent with negative selection while varying the common SNP-based heritability (ℎ 2 ), the heritability of LEV (ℎ 2 ), the allele frequency of LEV( ), and the prevalence (K). In simulations, we assumed 10,000 samples. The median of the OR across simulations is shown as a dot, while the 2.5 th and 97.5 th percentile of the OR across simulations are represented by the horizontal segments. The results under the polygenic model are presented here, as the same pattern was found for the other genetic architecture models. In panel (a), we fixed ℎ 2 , , and at 0.03, 0.005, and 0.01, respectively, while varying ℎ 2 . In panel (b), we fixed ℎ 2 , , and at 0.3, 0.005, and 0.01, respectively, while varying ℎ 2 . In panel (c), we fixed ℎ 2 , ℎ 2 , and at 0.3, 0.03, and 0.01, respectively, while varying . In panel (d), we fixed ℎ 2 , ℎ 2 , and at 0.3, 0.03, and 0.005, respectively, while varying . Among the parameters tested here, the ℎ 2 is the most important determinant of how differently, between carriers and non-carriers, the OR of the PB changes, as can be seen from the "slope" at each point. Source data are provided as a Source Data file.

Supplementary Figure 8. Change in the OR of PB (per sd change) with respect to change in parameter differs between LEV carriers and non-carriers.
The OR of the PB was calculated under the LD-adjusted kinship based genetic architecture while varying the common SNP-based heritability (ℎ 2 ), the heritability of LEV (ℎ 2 ), the allele frequency of LEV( ), and the prevalence (K). In simulations, we assumed 10,000 samples. The median of the OR across simulations is shown as a dot, while the 2.5 th and 97.5 th percentile of the OR across simulations are represented by the horizontal segments. The results under the polygenic model are presented here, as the same pattern was found for the other genetic architecture models. In panel (a), we fixed ℎ 2 , , and at 0.03, 0.005, and 0.01, respectively, while varying ℎ 2 . In panel (b), we fixed ℎ 2 , , and at 0.3, 0.005, and 0.01, respectively, while varying ℎ 2 . In panel (c), we fixed ℎ 2 , ℎ 2 , and at 0.3, 0.03, and 0.01, respectively, while varying . In panel (d), we fixed ℎ 2 , ℎ 2 , and at 0.3, 0.03, and 0.005, respectively, while varying . Among the parameters tested here, the ℎ 2 is the most important determinant of how differently, between carriers and non-carriers, the OR of the PB changes, as can be seen from the "slope" at each point. Source data are provided as a Source Data file.

Supplementary Figure 9. Utility of PB-LEV correlation under logit risk model.
We simulated the effect sizes for common and rare variants and calculated the probability of disease risk for each individual by a logit link function. Cases were stochastically drawn from the binomial distribution using the probability of disease risk. For each column, we varied the heritability attributable to the common-variant polygenic component, the heritability captured by the LEV, and its allele frequency while fixing all other parameters. For each row, we simulated the effect sizes for common variants under different genetic architectures. Two-sample Wilcoxon test was performed to test whether the PB burden was lower in LEV carriers than in non-carriers (one-sided test) in cases. The utility was calculated as the proportion of significant (P < 0.05) test results in 500 simulations with different seeds for sampling causal variants and subjects. Broken line at 80% is a reasonable utility threshold. Source data are provided as a Source Data file. Supplementary Figure 15. Prediction of the framework on proportion of LEV carriers in cases with a given PB profile matches empirical observations, providing an LEV screening approach (logit risk model). Using empirically derived parameters (including the common variant based heritability, disease prevalence, allele frequency, and odds ratio of the LEV) estimated from UK Biobank, we performed simulations for four traits under the genetic architecture in line with negative selection. We varied the 0 (the proportion of non-causal variants in the polygenic risk score) at (a) 0, (b) 0.5, and (c) 0.8. Cases were defined under the liability-threshold model and classified into 'low-PB' and 'high-PB' (using the median as the cutoff) groups. The proportion of LEV carriers was defined as the number of cases carrying LEV over the total number of cases in each group. The distribution of the proportion (among 500 simulated sets) is shown in the boxplot. The median of the proportion is visualized as a black segment in the middle of the box. The lower and upper hinges correspond to the first and third quartiles (the 25 th and 75 th percentiles). The upper / lower whisker extends from the hinge to the largest / smallest value no further than/ at most 1.5 * IQR from the hinge (where IQR is the inter-quartile range or the distance between the first and third quartiles). The actual observed proportion of LEV carriers for each PB profile from empirical data (with the matching parameters as the simulations) is marked as a red diamond. Thus, the prediction of the framework and the empirical dataset (with matching parameters as the simulations) were concordant. Source data are provided as a Source Data file. Figure 16. Cases with low polygenic risk score have higher probability of carrying an LEV. Based on empirically derived parameters from the UK Biobank, we performed simulations (under the liability-threshold model and the genetic architecture in line with negative selection) and compared the number of LEV carriers per 1000 cases among the different polygenic risk scores for 31 UK Biobank traits (additional to Figure 7). For each trait, we grouped the cases into 10 equally-sized polygenic risk score bins. For each bin, the mean ± 1sd of the number of LEV-carriers per 1000 cases is displayed in blue circles and bars. The distribution of polygenic risk score is shown as a histogram. Thus, the framework (assuming prespecified simulation parameters) provides testable predictions on the number of LEV carriers per 1000 cases with a given PB profile and on the sample polygenic risk score profile to optimize LEV screening. Detailed results can be found in Supplementary Data 2.