Assessment of polygenic architecture and risk prediction based on common variants across fourteen cancers

Genome-wide association studies (GWAS) have led to the identification of hundreds of susceptibility loci across cancers, but the impact of further studies remains uncertain. Here we analyse summary-level data from GWAS of European ancestry across fourteen cancer sites to estimate the number of common susceptibility variants (polygenicity) and underlying effect-size distribution. All cancers show a high degree of polygenicity, involving at a minimum of thousands of loci. We project that sample sizes required to explain 80% of GWAS heritability vary from 60,000 cases for testicular to over 1,000,000 cases for lung cancer. The maximum relative risk achievable for subjects at the 99th risk percentile of underlying polygenic risk scores (PRS), compared to average risk, ranges from 12 for testicular to 2.5 for ovarian cancer. We show that PRS have potential for risk stratification for cancers of breast, colon and prostate, but less so for others because of modest heritability and lower incidence. In cancer many gene variants may contribute to disease etiology, but the impact of a given gene variant may have varied effect size. Here, the authors analyse summary statistics of genome-wide association studies from fourteen cancers, and show the utility of polygenic risk scores may vary depending on cancer type.

G enome-wide association studies (GWASs) have led to the identification of hundreds of independent cancer susceptibility loci containing common, low-risk variants 1,2 . The number of discoveries varies widely across cancers, largely driven by available sample size, which reflects, in part, disease incidence in the general population. However, specific cancers, e.g., chronic lymphoid leukemia (CLL) 3 and testicular cancer 4 , are notable for unexpectedly high numbers of genome-wide significant discoveries from GWASs of relatively small sample size. Previous studies have also reported that these two cancers have high heritability 5 . Across cancer types, polygenic risk scores (PRSs) show varying levels of risk stratification depending on the heritability explained by the identified variants and the disease incidence rates in the population [6][7][8][9][10][11][12] . Their potential clinical utility would depend not only on the level of risk stratification but also on other factors such as the availability of appropriate riskreducing interventions for those identified as at high risk.
Estimation of heritability due to additive effects of all singlenucleotide polymorphisms (SNPs) included in GWAS arrays 13 , referred to as GWAS heritability in this article, have shown that common variants have substantial potential to identify individuals at different levels of risk for many cancer types 14 . It remains, however, unclear how large the sample sizes of GWAS need to be to reap the full potential of PRS-based risk prediction. Herein we apply our recently published method 15 to estimate the degree of polygenicity and the effect-size distribution associated with common variants (minor allele frequency (MAF) > 0.05) across 14 different cancer types, based on summary-level association statistics from available GWASs [16][17][18][19][20][21][22][23][24][25][26][27][28] from populations of European ancestry (Supplementary Table 1). From these inferred parameters, we then provide projections of the expected number of common variants to be discovered and predictive performance of associated PRS as a function of increasing sample size for future GWASs. Finally, by incorporating age-specific incidence 29 from population-based cancer registries, we explore the magnitude of absolute risk stratification potentially achievable by PRS.

Results
Cancer polygenicity. We found that cancers are highly polygenic, like other complex traits 15,30,31 . Estimates of the number of susceptibility variants with independent risk associations vary from~1000 to 7500 between the 14 cancer sites (Table 1). For comparability, effect-size distributions are shown in groups of similarly sized GWASs with similar power for detecting associations (Fig. 1). For GWASs with <10,000 cancer cases (group 1), CLL and testicular cancer are each associated with 2000-2500 variants and characterized by a much larger proportion of variants with larger estimated effect sizes than for the other group 1 cancers, as reflected by wider effect-size distribution with heavier tails (Fig. 1, Table 1). GWAS heritability estimates indicate that, in aggregate, common variants explain a high degree of variation of risk for these two cancers. In contrast, in group 1, esophageal and oropharyngeal cancers are associated with a larger proportion of variants with substantially smaller effect sizes, compared with CLL and testicular cancers in group 1.
For GWASs with 10,000-25,000 cases (group 2), melanoma is noteworthy because it is associated with a wider effect size distribution than other group 2 cancers. The estimated number of susceptibility variants in this group ranges from 1000 to 2000. GWAS heritability estimates indicate that aggregated common variants make a relatively small contribution to ovarian and endometrial cancer susceptibility. Finally, for the 3 GWAS with >25,000 cases each (group 3), prostate cancer is remarkable for having more variants with large effect sizes, namely, the underlying effect-size distribution has a heavier tail, compared with cancers of the breast and lung (Fig. 1). In this group, all three cancer types tend to have large numbers of associated variants (>4500) compared with cancer sites in other groups, but this pattern could partially be due to the very large sample sizes of group 3 GWAS 15 .
For a large majority of the 14 cancer sites, a two-component normal-mixture model for non-null effects provides a substantially better fit to observed summary statistics than a single normal distribution; this indicates the presence of a fraction of variants with distinctly larger effect sizes than the remaining (Supplementary Figs. 1 and 2). In contrast, a single normal distribution appears to be adequate for esophageal and oropharyngeal cancer, indicating the presence of a large number of variants with a continuum of small effects, similar to our previous findings for traits related to mental health and abilities 15 . Across all 14 cancers, the predicted number of discoveries and their associated genetic variance explained for current GWAS sample sizes match well to those observed empirically (Supplementary Table 2), indicating good fit of our model to the observed data.
Future GWAS projections. GWAS heritability estimates indicate that the potential of PRS for risk discrimination in the population varies widely among cancer types ( Table 1). The area under the curve (AUC) statistics associated with the best achievable PRS varies from 64% (endometrial and ovarian cancer) to 88% (testicular cancer) and in the range of 70-80% for most cancers. The percentage of GWAS heritability explained by known variants varies widely, depending on study sample size and the underlying trait genetic architecture (Fig. 2). Known variants explain more than a quarter of heritability for cancer sites based on very large sample sizes (e.g., breast and prostate cancer) or for cancer sites that have susceptibility variants with relatively large effect sizes (e.g., CLL, melanoma, and testicular cancer). Oropharyngeal cancer, in contrast, has both a small sample size and small effect sizes; its percentage heritability currently explained is almost zero.
The sample size needed to identify common variants that could explain approximately 80% of the total GWAS heritability for the cancers evaluated is generally very large, requiring 200,000-1,000,000 cancer cases, with a comparable number of controls (Fig. 2). However, for three sites, namely, testicular cancer, CLL, and melanoma, the required sample size is smaller, 60,000, 80,000, and 110,000 cases, respectively, due to the large effect sizes of their associated variants. By quadrupling the sample sizes of currently published GWASs, the percentage of GWAS heritability explained would rise to >40% across all cancers, except for oropharyngeal cancer. Such sample size increases would also lead to appreciable improvements in PRS discriminatory power across all these sites (Figs. 3 and 4). For cancers that were found to be the most polygenic and that had small effect sizes (e.g., cancers of breast, lung, and oropharynx), improvement would occur at a slower rates as sample sizes increase, and these sites would require the largest sample sizes to generate PRSs with discriminatory power close to theoretical limits. Of note, for a number of cancers, the achievable relative risks for subjects at the 99th percentile of PRS distribution compared with those at average risk, are comparable to those for monogenic disorders 32 (e.g., relative-risk >3-4-fold) (Fig. 4). Across all 14 cancer types, inclusion of SNPs using more liberal but optimized p value thresholds (see "Methods") would improve performance of PRSbased risk prediction versus using the stringent genome-wide significance level, but the anticipated gains would be generally modest ( Supplementary Figs. 3 and 4).
Projections of residual lifetime cancer risks for the US non-Hispanic white population show that the discriminatory power of PRS built from current or foreseeable studies will depend heavily on the underlying cancer incidence in the population (Fig. 5, Supplementary Figs. 5-7). The potential clinical utility of PRS depends on the degree of risk stratification and specific prevention or early detection strategies for a given cancer, should they exist. For common cancers, such as breast, colorectal, and prostate, a PRS with even modest discriminatory power (maximum AUC of approximately 70%, Fig. 3) can provide substantial stratification of absolute risk in the population. In contrast, for CLL and testicular cancer, even though its PRS could achieve a higher AUC (e.g. in the range 80-90%, Fig. 3), the degree of absolute risk stratification will be modest because of the infrequency of these cancers. Thus a PRS by itself has the least impact on risk stratification for cancer sites that are infrequent or/and that have low heritability. However, it is possible that PRS could have clinical utility for some of these cancers in the presence or in combination with other risk factors and biomarkers. For example, a PRS for lung cancer may provide larger stratification for absolute risk among smokers than never smokers because of the higher baseline risk in smokers.

Discussion
Our study is subject to several limitations. We may have underestimated the number of underlying common susceptibility loci,   especially for those cancers for which current GWAS have small sample sizes 15 . Thus the interpretation of comparisons of the underlying genetic architecture across cancer types with very different sample sizes requires caution. Nevertheless, the major patterns are unlikely to be due to differences in sample size. For example, we estimated oropharyngeal and esophageal cancers to be two of the most polygenic sites, though the GWAS sample sizes for these two sites were relatively small. Further, Q-Q plots of observed and expected p values indicate that the inferred models for effect-size distributions explain observed GWAS summary statistics well, regardless of GWAS sample size.
Another important limitation is that we only included data from subjects of European ancestry, since GWAS data for other ancestries are currently too small to permit reliable projections for most cancer sites. In addition, several cancers (e.g., lung, ovary, glioma, and breast) consist of etiologically heterogeneous subtypes that were not considered in our analyses due to lack of adequate sample sizes for appropriate subtypes for most of these cancer sites. Further studies of ancestry-and subtype-specific genetic architectures are needed to address these limitations. In our projections, we assume standard agnostic association analysis of SNPs without incorporating any external information   on population genetics or functional characteristics of SNPs. It is, however, possible to incorporate various types of external information to improve power for discovery of associations [33][34][35][36] and genetic risk prediction 37 . We have evaluated the merit of future GWAS only in terms of their ability to explain heritability and improve risk prediction. However, current and future discoveries have other major implications, including provident insights to biological pathways and mechanisms, potential gene-environment interactions, and understanding causal relationships through Mendelian Randomization analyses 38 . A number of these cancers are known to have rare high-penetrant risk variants, but for this study we have focused on estimating effect-size distribution associated with common variants. Furthermore, heritability analysis indicate that uncommon and rare variants could explain a substantial fraction of the variation of complex traits 39 , and thus it is likely that there are many unknown uncommon and rare variants associated with these cancers as well. In the future, characterization of heritability and effect-size distribution associated with the full spectrum of allele frequencies will require individual-level sequencing data on a substantially larger number of cases and controls. The observed differences in the underlying genetic architecture of susceptibility across cancers could be due to various factors, including the effect of negative selection 30,40 , tissue-specific genetic regulation of gene expression 41 , cell of origin 42 , the number of biological steps needed to transition from normal to malignant tissue 43 , mediation of genetic effects by underlying environmental exposures 44 , and the presence of heterogeneous cancer-specific subtypes 21,25,27,28 . A number of cancer types, including those of lung, oropharynx, and esophagus, which were associated with large numbers of SNPs with small average effect sizes, have known strong environmental risk factors and distinct Colored dots correspond to sample size for largest published GWAS and those for doubled and quadruped sizes. For oropharyngeal cancer, the projections at the "current sample size" are based on a sample size of 25K cases and 25K controls. For breast and esophageal cancer, the projections at the "current sample size" are based on the current largest GWAS sample sizes: 123K cases and 106K controls and 10K cases and 17K controls, respectively. For all other cancer sites, the projections at the "current sample size" are based on the GWAS sample sizes in Supplementary  Table 1. CLL chronic lymphocytic leukemia. NATURE COMMUNICATIONS | https://doi.org/10.1038/s41467-020-16483-3 ARTICLE NATURE COMMUNICATIONS | (2020) 11:3353 | https://doi.org/10.1038/s41467-020-16483-3 | www.nature.com/naturecommunications etiologic subtypes. It is also noteworthy that testicular cancer also stands out for a large number of discoveries in cross-tissue expression quantitative trait loci analyses, likely indicating a stronger association of SNPs on gene expression levels for this tissue compared to others 41 .
In conclusion, our comprehensive analysis of 14 cancer sites in adults of European ancestry reveals that, while all sites have polygenic influences, there is substantial diversity observed in their underlying genetic architectures, which reflects important biology and also influences the utility of polygenic risk prediction for individual cancers. Our projections for future yields of GWAS across these cancers provide a roadmap for important returns from future investment in research, including the potential clinical utility of polygenic risk prediction for stratification of absolute risks in the population.

Methods
Description of GWAS studies. We analyzed summary data from GWAS studies across 14 cancer types. For select cancer sites 26,28 , we downloaded publicly available genome-wide summary-level statistics from the latest consortium-based analyses. For others, we obtained access to data through collaborative efforts with individual consortia. Details about individual studies, including the number of cases and controls, are provided in Supplementary Table 1.
Linkage disequilibrium (LD) reference panel selection. We consider a reference panel with~1.07 million SNPs included in the HapMap3 and that had MAF > 0.05 in the 1000 Genome European Ancestry sample. Based on known LD among Breast x : Total sample size assuming 1:1 case:control ratio (in thousands) y : Relative risk for people at 99th centile compared to average risk of the population Fig. 4 Projections of relative risks for individuals at or higher than 99th percentile of PRS as sample size for GWAS increases. Results are shown where PRS is built based on SNPs at optimized p value threshold. The dotted horizontal red line indicates the maximum relative risk achievable according to estimate of GWAS heritability. Colored dots correspond to sample size for the largest published GWAS and those for doubled and quadruped sizes. y-Axis is presented in log10 scale. For oropharyngeal cancer, the projections at the "current sample size" are based on a sample size of 25K cases and 25K controls. For breast and esophageal cancer, the projections at the "current sample size" are based on the current largest GWAS sample sizes: 123K cases and 106K controls and 10K cases and 17K controls, respectively. For all other cancer sites, the projections at the "current sample size" are based on the GWAS sample sizes in Supplementary Table 1. CLL chronic lymphocytic leukemia.
common variants, we expect these set of variants to provide high coverage for all common variants for European ancestry population and thus loss of information due to imperfect tagging of causal variants to be fairly minimal.
Quality control for summary GWAS data. Across all cancers, we applied several filtering steps analogous to those used earlier for estimation of heritability 45,46 and effect-size distribution using summary-level data 15 . First, we restricted analysis to SNPs within a set of reference~1.07 million SNPs included in the HapMap3 and that had MAF > 0.05 in the 1000 Genome European Ancestry sample. Second, we excluded SNPs having substantial amounts of missing genotype data: sample sizes <0.67 times the 90th percentile of the distribution of sample sizes across all SNPs.
Third, we excluded SNPs within the major histocompatibility complex region (i.e., SNPs between 26,000,000 and 34,000,000 base pairs on chromosome six), which is known to have very complex allelic architecture and can have uncharacteristically large effects on some traits. Fourth, we removed regions that have SNPs with extremely large effect sizes to reduce possible undue influence of them on estimation of parameters associated with overall effect-size distributions. Using PLINK --clump, we identify all top SNPs that have associated chi-square statistics >80 (i.e., odds ratio (in standardized scale) >2. 19) and removed all SNPs that were within 1-MB distance of or had an estimated squared LD >0.1 with those top SNPs. We added back the contribution of these top independent SNPs in the final reporting of the total number of susceptibility SNPs, estimates of total heritability, and various projections we made as a function of sample size of the GWAS.  Statistical model. We inferred common variant genetic architecture of the different cancers using GENESIS 15 , a method we recently developed to characterize underlying effect-size distributions in terms of the total number of susceptibility SNPs (polygenicity) and a normal mixture model for the distribution of their effects. Specifically, it is assumed that standardized effects of common SNPs in an underlying logistic regression model on the risk of a cancer can be specified in the mixture distribution in the form β m $ 1 À π c ð Þ δ 0 þ π c Nð0; σ 2 Þ (two-component model) or (three-component model) where δ 0 is the Dirac delta function indicating that a fraction, 1 À π c , of the SNPs have null effects and remaining π c fraction of SNPs have non-null effects. Under the three-component model, p 2 ¼ 1 À p 1 denotes the proportion of SNPs allocated to mixture component with larger variance component (assuming σ 2 2 > σ 2 1 ) models. Under these models, Mπ c characterizes the degree of polygenicity, i.e., the number of susceptibility SNPs with independent effects on disease risk. Under both models, we defined "GWAS heritability" of a disease as h 2 ¼ Mπ c E β 2 À Á , where E β 2 À Á denotes the average variance size of the non-null SNPs. We observed that, under the above model, h 2 is also the population variance of the underlying "true" PRS, defined as PRS ¼ P M m¼1 β m G m , where G m denotes the standardized genotype associated with the mth SNP. Under the two-component model, which assumes a single normal distribution for the effect of all susceptibility SNPs, Under the three-component model, which allows mixture of two normal distributions with distinct variance components and thus can better accommodate the presence of a group of susceptibility SNPs with much larger effects than others, we have p 1 σ 2 1 þ p 2 σ 2 2 . Under the three-component model, we use the fraction υ ¼ p 2 σ 2 2 =ðp 1 σ 2 1 þ p 2 σ 2 2 Þ to characterize the proportion of heritability explained by SNPs associated with the larger variance component parameter. As we removed SNPs with extremely large effects (χ 2 i > 80) and the associated regions from the analysis, in reporting the final heritability estimates, we added back the contribution of the independent top SNPs from these excluded regions as P i ðβ 2 i À τ 2 i Þ whereβ i is the estimate of log odds ratio (in standardized scale) and τ i is the corresponding standard error for the ith SNP.
Genetic variance projection. Given the estimated effect-size distribution, we calculated expected discoveries and genetic variance explained using ED ¼ with Φ Á ð Þ the standard normal cumulative density function and c α ¼ Φ À1 1 À α ð Þ the αth quantile for the standard normal distribution. Similar to heritability calculations, we added back the contributions of independent top SNPs with very large effects to the number of expected discoveries and associated variances explained by the quantities P i pow α;n ðβ i Þ and h À2 P i ðβ 2 i À τ 2 i Þpow α;n ðβ i Þ. We observed that for projections involving sample sizes bigger than the current study pow α;nβi for the large effect SNPs will all be very close to 1.0.
Projection for AUC and relative risk at top 1%. As we quantify heritability in terms of the variability of the underlying "true" PRS, we used the formula 12 q Þ to characterize the best discriminatory power achievable in limiting using common variant PRS. We used the same formula to calculate the AUC associated with PRSs that could be built using SNPs either reaching genome-wide significance (p value <5 10 À8 ) or a weaker but optimized threshold for a GWAS of given sample size based on the projected variance of the respective PRS. Given sample size of GWAS and an effect-size distribution for the underlying cancer, an optimal threshold for SNP selection that will maximize the expected predictive performance of PRS is calculated using analytic formula we have derived earlier 48 . The relative risk for those estimated to be at the 99th percentile or higher of the distribution of a PRS (compared to the average risk of the population) was calculated using the formula 12 expðÀ h 2 2 þ Φ À1 0:99 ð Þ ffiffiffiffi ffi h 2 p Þ, where h 2 is the population variance of the PRS.
Absolute risk projection. For each cancer site, we projected the distribution of residual lifetime risk (up to age 80 years) for non-Hispanic white individuals in the general US population according to PRSs, which could be built from GWASs of different sample sizes. For any given age, we first obtain the distribution of residual lifetime risks based on a model for absolute risks developed using the iCARE tool that we have described earlier 12,29 . The iCARE tool uses projected standard deviations of PRS at different GWAS sample sizes and age-specific cancer incidence rates available from the US National Cancer Institute-Surveillance, Epidemiology, and End Results Program (NCI-SEER) (2015) to obtain absolute risk distributions. In deriving absolute risks, we adjusted for competing risk of mortality due to other causes using the age-specific mortality rates from the Center for Disease Control WONDER database (2016). We then weighted the projected residual lifetime risk distribution at different baseline ages (in 5-year categories) based on the US population distribution of ages within 30-75 years, as observed in the estimated 2016 US Census. For cancers of the reproductive system, weights were based on the age distributions among males or females, as appropriate.
Reporting summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Code availability
The code for running the analysis in the paper is freely available from the CancerEffectSize GitHub repository (https://github.com/yandorazhang/ CancerEffectSize).