Genetic association studies of alterations in protein function expose recessive effects on cancer predisposition

The characterization of germline genetic variation affecting cancer risk, known as cancer predisposition, is fundamental to preventive and personalized medicine. Studies of genetic cancer predisposition typically identify significant genomic regions based on family-based cohorts or genome-wide association studies (GWAS). However, the results of such studies rarely provide biological insight or functional interpretation. In this study, we conducted a comprehensive analysis of cancer predisposition in the UK Biobank cohort using a new gene-based method for detecting protein-coding genes that are functionally interpretable. Specifically, we conducted proteome-wide association studies (PWAS) to identify genetic associations mediated by alterations to protein function. With PWAS, we identified 110 significant gene-cancer associations in 70 unique genomic regions across nine cancer types and pan-cancer. In 48 of the 110 PWAS associations (44%), estimated gene damage is associated with reduced rather than elevated cancer risk, suggesting a protective effect. Together with standard GWAS, we implicated 145 unique genomic loci with cancer risk. While most of these genomic regions are supported by external evidence, our results also highlight many novel loci. Based on the capacity of PWAS to detect non-additive genetic effects, we found that 46% of the PWAS-significant cancer regions exhibited exclusive recessive inheritance. These results highlight the importance of recessive genetic effects, without relying on familial studies. Finally, we show that many of the detected genes exert substantial cancer risk in the studied cohort determined by a quantitative functional description, suggesting their relevance for diagnosis and genetic consulting.


Results
Genetic and protein-function analysis of cancer predisposition in the UK Biobank. From the UKB, we derived a study cohort of 274,830 unrelated white individuals (Fig. 2). Based on their medical records, we determined that 56,634 of those individuals had a history of cancer. In addition to this pan-cancer view, we also considered nine distinct cancer types: melanoma, leukemia, breast, ovarian, prostate, lung, skin, pancreatic and colorectal cancer. We studied each of the ten defined cancer types (the nine specific cancers and pan-cancer) using both GWAS and PWAS independently. With GWAS we found 883 significant variant-cancer associations in 100 unique genomic regions, and with PWAS we found 110 significant gene-cancer associations in 70 unique genomic regions. Altogether, the GWAS and PWAS associations mapped to 145 unique genomic regions, available at Supplementary Table S1. The full GWAS and PWAS results across the ten cancer types and all tested variants and genes are available at Supplementary Table S2 and Supplementary Table S3.
We compared the 145 discovered genomic regions against prominent cancer predisposition gene panels used in the clinics (CleanPlex TMB 500 and Invitae Multi-Cancer). We also compared our results to three catalogues of cancer driver genes: Census 34 , the pan-software catalogue 35 and FABRIC 36,37 . We find that a majority of the 145 significant cancer regions are indeed supported by external evidence (clinical panels, cancer driver catalogues, or both), but 51 of the 145 regions (35%) appear as new discoveries (Fig. 3). We also observe that external validation is slightly stronger for significant regions that are supported by both GWAS and PWAS.

Single-gene loci with direct interpretation.
To demonstrate the discovered genetic signal of cancer predisposition, we begin by focusing on high-confidence associations with plausible causal mechanisms. One of the benefits of PWAS is its capacity to suggest concrete protein-coding genes underlying genetic associations. However, like GWAS, PWAS is also sensitive to linkage disequilibrium, which can lead to the genetic signal being spread across an extended genomic region and mistakenly attributed to an unrelated nearby gene. For this reason, it is of interest to examine significant associations that are restricted to single genes. We define a significant cancer region to be a soliton region if i) it has exactly one gene significant by PWAS, and ii) the significance is very strong (FDR q-value < 1E−03). Of the 145 cancer regions, 14 are solitons according to this definition ( Table 1). 5 of the 14 soliton genes are supported by gene panels (CleanPlex or Invitae), and 3 of them are supported by Census and other cancer driver catalogues. Importantly, the criteria of external evidence for the solitons listed in Table 1 are stricter than those used for the genomic regions in Fig. 3. Whereas in Fig. 3 we considered evidence for any gene within the same region (which may include up to 138 genes; see Supplementary Table S1), in Table 1 we only consider evidence matching the exact soliton gene. www.nature.com/scientificreports/ In addition to recovering genomic regions that are missed by GWAS, a major contribution of PWAS is highlighting the exact genes that drive known associations. Of the 14 soliton genes, 8 are also found by standard GWAS, whereas the other 6 are below the exome-wide significance threshold (5E−07). IRF3 (interferon regulatory factor 3), for example, is associated with non-melanoma skin cancer according to PWAS with overwhelming significance (FDR q-value = 7E−6), but not according to GWAS (p = 1E−6). IRF3 is part of the innate immune system, whose one of its major roles is to recognize and destroy infected and tumorous cells. In-vivo and expression studies have implicated IRF3 in melanoma 38,39 and other cancer types 40,41 .
The list of solitons also includes novel discoveries. For example, UBE3A (ubiquitin protein ligase E3A; also known as E6AP) is known for resulting in the ubiquitylation and degradation of the p53 tumor suppressor in Figure 1. Proteome-wide association study (PWAS). PWAS finds associations between phenotypes (such as cancer) and protein-coding genes that are mediated by genetic alterations to protein function. (A) When studying a protein-coding gene, PWAS collects all the called variants in the genome of an individual that affect the coding sequence of that protein. Using a machine-learning model, it predicts how damaging these variants are to the protein function (for example, how likely an enzyme is to retain its catalytic activity given each of the mutations, which might affect its 3D structure). It then aggregates the predictions made for all the variants into overall gene effect scores. Each gene is assigned two such scores, corresponding to either a dominant or a recessive genetic effect. The dominant effect score attempts to capture the event that the gene is affected by at least one damaging variant, whereas the recessive effect score attempts to capture the event that it is affected by at least two damaging variants. (B) Given a cohort of cases and controls, PWAS assigns each individual dominant and recessive effect scores for the studied protein-coding gene based on his or her genetics, as described in (A). By comparing the distributions of gene effect scores assigned to cases and controls, PWAS can detect statistically significant differences between the two groups (measured by p-values and FDR q-values). These differences reflect the estimated functional damage to the gene, finding, for example, that the studied protein appears to be more damaged in cases than in controls. www.nature.com/scientificreports/ However, E6AP hasn't been linked to pancreatic cancer, an association that we observe here (PWAS q-value = 8E−04). Another example is AVP (arginine vasopressin; also known as ADH), a gene producing a hormone involved in water balance homeostasis and complex behavioral traits 46 . Despite vast literature about the hormone, there doesn't appear to be any direct evidence linking the gene to cancer or specifically to melanoma, as observed here (PWAS q-value = 9E−04).

Scientific
Functional damage to coding genes is often associated with reduced cancer risk. An advantage of PWAS is that it provides clear interpretation of the directionality of significant associations, namely whether gene damage is associated with elevated or reduced cancer risk. Of the 14 solitons, which are high-confidence PWAS associations, 8 genes (57%) are associated with elevated and 6 (43%) are associated with reduced cancer risk (Table 1). For example, CTLA4 (also known as CD152) is an immune checkpoint and a preferred target in cancer immunotherapy. Reduced CTLA4 function, either through natural genetic variation or immunotherapy treatment, activates T cells and suppresses tumor (while, on the downside, increases the risk for autoimmunity) 47 . Therefore, it is not surprising that CTLA4 is characterized by PWAS as a protective gene, with gene damage being associated with reduced cancer prevalence in the UKB cohort. The characterization of genes as being associated with elevated or reduced cancer risk according to PWAS is based on either the dominant or recessive effect scores calculated by the method, depending on the heritability found significant for each association. Notably, the 4 soliton genes showing both dominant and recessive inheritance (potentially with different effect sizes and significance levels) all exhibit consistent trends. In other words, when estimated damage to these genes is associated with elevated (or reduced) cancer risk according to the dominant model, it shows the same trend according to the recessive model. Across all 110 significant PWAS associations we observe similar patterns, with 62 (56%) being associated with elevated cancer risk and the other 48 (44%) with reduced risk. Among the 16 PWAS associations that are significant for both dominant and recessive effects, 15 show consistent directionality and only one association shows opposite trends between the dominant and recessive models.

Genetic signal varies between cancers.
The uniformity of the UKB as a prospective cohort that covers all major cancer types provides a unique opportunity to perform a comparative analysis between cancer types. We start by considering the number of significant genomic regions recovered in each cancer type, according to GWAS, PWAS or both ( Table 2). We observe that the number of recovered regions vary substantially across cancer types, ranging from 2 in epithelial ovarian cancer to 58 in non-melanoma skin cancer. As statistical power is determined by sample sizes, it is not surprising that cancers with fewer cases tend to recover fewer significant regions. However, this is not the only factor. Chronic lymphocytic leukemia, for example, has the fewest cases (764), yet it recovers 20 significant regions, more than many other cancers. This could be explained by a strong polygenic signal associated with leukemia, or more likely, it could be a spurious effect of mistaking somatic mutations as germline variants in blood cancers (as UKB genotypes are based on blood samples) 48 . Colorectal cancer includes 3351 cases but only 4 recovered regions, compared for example to melanoma with 14 recovered regions, despite its slightly lower number of cases (3122). Interestingly, although pan-cancer has the highest between cancer types also reinforces the contribution of both GWAS and PWAS to our analysis, with each of the two methods dominating the discovery in certain cancer types. Major differences in polygenic signal between cancer types are also observed with respect to significance levels ( Fig. 4). For example, the 6 significant regions of lung cancer are statistically weaker than the 18 significant regions of breast cancer, although this could simply reflect the different numbers of cases. Among the 145 significant cancer predisposition loci, 65 are associated with melanoma or non-melanoma skin cancer. Genes affecting skin pigmentation were implicated with both cancer types. Of all the PWAS associations recovered in this study, the most significant is the gene MC1R (Melanocortin 1 receptor; q-value = 1E−109 in non-melanoma skin cancer and q-value = 2E−53 in melanoma), which plays a central role in skin pigmentation, melanin formation and melanocyte differentiation. Many variants in the gene have been associated with pigmentation differences and elevated risk of skin cancer and melanoma 49 . Additional major pigmentation-related genes identified in this study include TYR (tyrosinase; PWAS q-value = 6E−13 in non-melanoma skin cancer and q-value = 0.005 in melanoma) and SLC45A2 (solute carrier family 45 member 2; PWAS q-value = 1E−20 in non-melanoma skin cancer and q-value = 8E−5 in melanoma) 50 .
Predisposition genomic loci are mostly cancer-type specific. We also consider the extent to which cancer predisposition genetic signal overlaps between cancers, by counting the number of significant cancer regions shared between cancer types (Fig. 5). For example, melanoma and non-melanoma skin cancer share a substantial genetic signal, with 7 of the 14 melanoma regions being also significant in non-melanoma skin cancer (Fig. 5B). A shared genetic signal between these two cancer types is expected, given their relatedness. More surprising is the fact that 3 of the 18 breast-cancer regions are shared with prostate cancer (with overall 32 regions), while none of the 2 significant ovarian-cancer regions is shared with breast cancer (or any other cancer type). One of the 3 genomic regions shared between breast and prostate cancer is found in 11q13.3. Most significant in this region is the intergenic variant rs7130881 (lowest GWAS p-value = 3E−36). The other 2 regions shared between breast and prostate cancer contain soliton genes listed in Table 1: CHEK2 (22q12.1-q12.2) and POU5F1B (8q24.21). All six associations (the three regions with respect to the two cancer types) have strong support in literature [51][52][53][54] .
We observe that the pan-cancer phenotype does not substantially add to the recovery of significant genomic regions over the nine specific cancer types. Of the 26 significant pan-cancer regions, only 3 are not recovered by any of the specific cancer types. From a statistical perspective, this stands out given the high number of pancancer cases (56,634).
The genomic region with the most dominant pan-cancer signal is 5p15.33, which is also significant in nonmelanoma skin cancer, melanoma, prostate cancer and lung cancer. Among the most significant variants in this region are rs401681 (GWAS p-value = 7E−32) in the intronic region of the gene CLPTM1L (Cisplatin Resistance Related Protein), and rs2853677 (GWAS p-value = 9E−22) in the intronic region of TERT (Telomerase Reverse Transcriptase). TERT is a well-known cancer predisposition gene implicated in numerous cancer types 55 . Mutations in TERT can cause abnormal activities of the telomerase and prevent telomere shortening, thereby leading to cell immortality. rs2853677 has been implicated in various Chinese and Indian populations 56,57 . We provide evidence for the importance of this variant in the white British population. CLPTM1L has also been linked to cancer in multiple studies. For example, overexpression of the gene has been shown to enhance the growth of pancreatic cancer cells in vitro and in vivo 58 .
Another relatively major genetic overlap is observed between prostate cancer and non-melanoma skin cancer, which share 3 predisposition regions. On top of 5p15.33 (the locus of CLPTM1L and TERT), prostate and non-melanoma skin cancer also share 6p22.1-p21.31 and 21q22.2-q22.3. The 21q22.2-q22.3 region seems to be a novel discovery of our analysis, without any existing support from the literature. The most significant variant  Table S1).
Other than the handful of shared regions mentioned here, most cancer types show little overlap with other cancers (Fig. 5). Overall, of the 145 cancer predisposition regions, only 27 are shared by multiple cancers. Disregarding pan-cancer, only 14 regions are shared by two or more specific cancer types.
Recessive effects are prevalent in cancer predisposition. Unlike GWAS, PWAS explicitly models dominant and recessive effects at the gene level and has the capacity to capture genetic effects that transcend the individual variant. It therefore provides the opportunity to assess the importance of these effects to cancer predisposition. We observe that recessive effects are indeed common among significant PWAS regions, and especially among the regions that are not captured by GWAS (Fig. 6A). The enrichment of recessive effects in (B) Partitioning of the 145 significant cancer genomic regions by the combination of cancer types in which they were found significant. Each node in the graph represents a combination of cancer types. Two nodes are connected when one combination is a direct subset of the other. Each node is labeled with two numbers: the number of significant regions matching the exact combination of cancer types represented by the node (e.g. 12 genomic regions that are unique to breast cancer), and the total number of significant regions matching any combination of cancer types that includes this combination (e.g. overall 18 regions associated with breast cancer, including 6 regions shared with other cancers). Node sizes reflect the number of significant regions that match the combinations exactly. Nodes representing combinations of more than one cancer type are colored dark blue. Pan-cancer is abbreviated as P-C.   www.nature.com/scientificreports/ PWAS-exclusive regions is explained by the fact that GWAS (under standard settings) models additive genetic effects. Unlike dominant effects, which can be approximated by an additive model, recessive effects cannot be effectively recovered by an additive GWAS model. The same trend shows when considering the evidence for each genomic region according to continuous p-values: significant regions with little evidence according to GWAS tend to be recovered by the recessive model of PWAS (Fig. 6B).
To illustrate the dominant and recessive genetic effects captured by PWAS, we selected three soliton associations with strong effects (see Table 1): MITF (melanocyte inducing transcription factor) associated with melanoma (q-value = 1E−08; both dominant and recessive effects are significant), HOXB13 (homeobox B13) associated with prostate cancer (q-value = 2E−31; only dominant effect is significant) and SLC45A2 (solute carrier family 45 member 2) associated with non-melanoma skin cancer (q-value = 1E−20; only recessive effect is significant). For each of the three associations, we plot cancer risk among UKB individuals as a function of PWAS effect scores (Fig. 6C). PWAS effect scores are assigned to each individual in the cohort with respect to each studied protein-coding gene based on their genotype. They reflect the overall estimated functional damage affecting genes at the molecular level, with lower effect scores reflecting more damage. For each individual, each coding gene is assigned two effect scores that aggregate all the variants within the gene that were genotyped or imputed for that individual: a dominant effect score (reflecting the likelihood of at least one damaging hit) and a recessive effect score (reflecting the likelihood of at least two damaging hits). These effect scores are used by PWAS to detect significant genes (if they correlate with a history of cancer), and they can also be used to assess individual-level risk (with respect to specific genes).
In the case of MITF, we observe that among the 2,372 individuals with the lowest dominant effect scores (comprising 0.9% of the cohort), 61 (2.57% of that group) have had melanoma, compared to only 1.12% risk of melanoma in the rest of the cohort (leading to observed risk-ratio of 2.3). When considering recessive effect scores, we find that 2 out of the 3 individuals with the lowest scores have had melanoma. We conclude that individuals with functional damage to the MITF protein have substantially higher rates of melanoma. As the name of the gene suggests, MITF is a transcription factor that controls the development and function of melanocytes, which are pigment-producing cells. The causal connection of MITF to melanoma is well established 59 .
Unlike MITF, showing both dominant and recessive effects (which might be interpreted as approximating an additive effect), HOXB13 shows a clear dominant effect according to PWAS. Among the 479 males with the lowest dominant effect scores (0.4% of the male cohort), 99 of them (20.7%) have a history of prostate cancer, compared to 5.87% rate of prostate cancer in the rest of the male cohort (risk-ratio of 3.5). HOXB13 regulates cellular response to androgen, an important male hormone. Its link to prostate cancer and early onset of the disease has been established in both GWAS and family studies 60 .
A clear recessive effect is exemplified by SLC45A2 (also known by the names AIM1, MATP and OCA4). Among the 13,196 individuals with highest recessive effect scores, i.e. the 5% of the UKB cohort least likely to have two damaged copies of the gene, 648 of them (4.91%) are cases of non-melanoma skin cancer, compared to 6.99% of cases in the rest of the cohort (risk-ratio of 1.4). SLC45A2 is a transporter protein likely involved in the production of melanin. The gene has indeed been implicated in melanoma and skin cancer through albinism, a disorder known for recessive inheritance 50 .
These three examples demonstrate the applicability of PWAS effect scores to clinical risk assessment. Similar plots for all 110 significant PWAS associations are available in Supplementary Fig. S2.
To further examine the recessive associations, we list the 30 genomic regions with only significant recessive effects (Fig. 6A) in Table 3 (full summary statistics for these and all other significant regions are available in Supplementary Table S1 and Supplementary Table S3). We find that 29 of the 30 recessive regions support exactly one significant association of gene and cancer-type, and we therefore identify the regions by the significant genes (the only exception is the region of SEMA3F on chromosome 3, in which another gene, MST1R, is also significant, but less so). Notably, 9 of the 30 recessive genes are enzymes (or part of enzyme complexes), which are considered a standard model for recessive effects 61 . Additional 4 genes are transporters associated with the solute carrier family (or regulators of such transporters). In fact, of all 110 significant PWAS associations, these genes are the only transporters of that family (namely, all 4 genes associated with the solute carrier family that we have recovered exhibit pure recessive effects). Both SLC12A9 and SLC45A2, which we associate with melanoma and non-melanoma skin cancer, are implicated with pigmentation according to the Open Targets Platform 62 . We also observe that many of the 30 recessive genes are involved in cellular processes of apoptosis, ubiquitination and immunology. Interestingly, leukemia seems to be particularly enriched with recessive effects. Of the 16 genomic regions associated with leukemia according to PWAS, 12 show exclusive recessive effects and are listed in Table 3. The possibility of mistaking somatic mutations as germline variation in blood cancers should be taken into account when interpreting this finding. For example, the high prevalence of recessive heritability in leukemia might be a reflection of the "second-hit" hypothesis 63 . As an external validation scheme, Table 3 also lists published cancer genetic associations curated by the Open Targets Platform. We find that the cancer types reported by the Open Targets Platform are usually in agreement with our results. Nevertheless, most recessive genes appear to represent novel genetic associations, maybe as a result of recessive genetic studies being largely neglected in case-control cohorts. When considering the external validation used earlier (Fig. 3), only 2 of the 30 recessive genes are supported by clinical panels or catalogues of cancer drivers. Specifically, MUTYH and BIRC3 are both established as cancer drivers, and they both appear in clinical panels. Overall, these 30 recessive genes highlight the importance of modeling recessive inheritance in non-familial (case-control) cancer studies.

Discussion
We performed a comprehensive analysis of cancer predisposition in the UKB cohort and identified 145 significant genomic regions by combining both GWAS and PWAS. The two methods complemented each other. While GWAS identifies variants with additive genetic effects, PWAS identifies coding genes with dominant or recessive effects (or a combination of these effects). The capacity of PWAS to detect non-additive effects has proven substantial in our analysis. We detected 30 genomic regions with pure recessive effects, indicating that recessive inheritance is substantial for cancer predisposition also outside of family studies. Many of the discovered genes with significant recessive effects had not been previously implicated with cancer predisposition ( Table 3), indicating that non-additive effects are neglected in contemporary genetic association studies. The main reason for the scarcity of studies on recessive effects is that GWAS and current variant-level methods are not designed to capture such genetic effects, especially when compound heterozygosity is involved. By exposing the importance of recessive inheritance to cancer predisposition and presenting a methodology for recovering it, we aim to motivate future investigation of this underexplored source of cancer risk (and of other conditions).
Many of the genomic loci recovered in this study are supported by external evidence (clinical panels and cancer drivers), while many are novel associations (Fig. 3, Table 3). While proving definite causal links is beyond the scope of this work, many of our novel discoveries appear as promising candidates. In particular, genes defined as solitons (Table 1) are likely less sensitive to linkage-disequilibrium and are therefore more likely to prove causal.
Many of the reported genes are associated with substantial cancer risk (Fig. 6C). For example, we show that individuals in the male cohort with substantial damage to HOXB13 have an increased risk of prostate cancer (20.7%, compared to 5.9% in the rest of the male cohort). We provide a similar cancer risk analysis for all 110 PWAS gene associations (Supplementary Fig. S2). The capacity for risk assessment at the resolution of whole www.nature.com/scientificreports/ genes is enabled by PWAS through the aggregation of all variants affecting an individual. Integration of such data into the clinics could have a substantial impact on individual and familial cancer risk evaluation. Even when variants of unknown significance are observed after the initial study, PWAS can still estimate gene damage and compare it to the population distribution, thereby providing risk assessment. The presented analysis was conducted over nine specific cancer types and pan-cancer. We observe that most cancer predisposition loci are cancer-type specific. Only 14 of the 145 loci are shared by two or more specific cancer types, with 7 of them shared only by melanoma and non-melanoma skin cancer. Only 2 loci are shared by three or more specific cancer types (Fig. 5). The pan-cancer analysis recovered only 3 unique associations not discovered in specific cancers. This observation is in stark contrast to somatic tumor genomics, where most discovered cancer drivers are pan-cancer 36 . Our results suggest that cancer predisposition of germline genetic variation, unlike positive selection of somatic mutations in tumors, are mostly cancer-type specific. The cancer type with the highest number of recovered loci is non-melanoma skin cancer, with 58 of the 145 genomic regions associated with that cancer type. While this could be explained by the high number of cases for that cancer type, it highlights a substantial polygenic signal for non-melanoma skin cancer. Such strong polygenicity is compatible with the substantial role of environmental factors such as UV radiation and viral infection in this cancer 64 .
An important limitation of our analysis is its restriction to the white British population, as the UKB cohort does not include a substantial number of individuals from other ethnicities required for sufficient statistical power (see Methods). This leads to a lack of validation of the discovered associations on other populations, as well as oversight of additional associations only present in other ethnic groups. Another limitation of the UKB is the lack of sequencing data for the vast majority of the cohort. In this analysis we relied solely on genotyping of ~ 800 K predefined genetic markers and ~ 600 K variants in coding regions imputed from the original set of markers 26 . This limitation is particularly relevant for PWAS, which underestimates genetic damage when nongenotyped variants are involved, leading to diminished statistical power (but, critically, not to false discoveries) 33 .
This study demonstrates the power of exhaustive genomic and protein-centric analysis over large biobanks for expanding the discovery of cancer predisposition loci beyond family studies. In particular, we have demonstrated that recessive effects are also prevalent in case-control cohorts when looked for by suitable methods. We have also demonstrated the benefit of cohorts with diverse cancer-type annotations for cross-cancer comparisons. Finally, our results show promise towards individual clinical risk assessment of cancer.

Methods
Study cohort. From the entire UK Biobank (UKB) cohort of 502,520 individuals (application ID 26664) 25,26 , we filtered out 227,690 individuals (see Fig. 2) for the following reasons: i) 14 individuals had asked to withdraw from the UKB, ii) 92,922 were non-whites (according to self-reported ethnicity or their genetics), iii) 312 had self-reported sex mismatching their genetics, iv) 75,848 were family-related to other individuals in the cohort (from each group of related individuals we chose only one representative), and v) 58,594 didn't have any recorded ICD-10 codes (which were required to determine cancer history).
Following these filtrations, we remained with a study cohort of 274,830 individuals, 56,634 (20.06%) of them had a history of cancer according to their ICD-10 codes (the number of cases and controls per cancer type are listed in Table 2).
In addition to cancer status, we also extracted from the UKB the following set of covariates: sex, year of birth, 40 genetic principal components (PCs), genotyping batch (105 categories) and UKB assessment center (25 categories). Altogether, 173 covariates (including a constant intercept) were included in the GWAS and PWAS analyses. We chose using 40 PCs, which is the maximal number of PCs provided by the UKB, to minimize the risk of false discoveries due to unaccounted population structure (at the cost of slightly reduced statistical power).
While this work would clearly benefit from trans-ethnic analysis (in particular, validation of the reported associations on independent cohorts of other ethnicities) 65 , the UKB does not include a sufficient number of samples from any non-white ethnic group to support such analysis. The largest minority group in the UKB is the Indian group which includes ~ 6000 samples. Such sample size is sufficient in the presence of continuous traits or very common binary traits, but not binary traits with a prevalence of 1-2% (which is the typical prevalence of most cancer types; see Table 2). Specifically, a sample size of ~ 6000 and a disease prevalence smaller than 2% entails only up to ~ 100 cases. In other words, under such settings, the number of events would be smaller than the number of variables (having 173 covariates), while PLINK recommends at least 10 events per variable 66 .
The computational pipeline for processing the UK Biobank data used in this work is an open-source project available at https:// github. com/ nadav bra/ ukbb_ parser.
Where ukb_cal_chr1_v2.bed, ukb_snp_chr1_v2.bim and ukb2666_cal_chr1_v2_s488366.fam are the files with the genetic data of chr1, provided by the UKB. The files breast_cancer.txt and covariates.txt are PLINK-formatted tab-separated data files with the cancer phenotype (case-control status) and covariates of each of the samples in the study cohort, respectively. The --glm argument specifies the use of a linear/logistic model; the --1 flag specifies that cases and controls are encoded as 1 and 0, respectively; the hide-covar flag indicates that summary statistics are to be reported only with respect to the phenotype (but not the individual covariates); the no-x-sex flag indicates that sex is not to be re-added as a covariate (since it's already included among the provided covariates); www.nature.com/scientificreports/ --mac 20 limits the analysis to variants with at least 20 occurrences of the minor allele (according to PLINK's recommendation), so that the linear model is properly calibrated; --covar-variance-standardize linearly transforms all quantitative covariates to mean-zero, variance 1; -freq specifies that allele frequencies are to be calculated and reported; and --threads 32 --memory 10,000 allocate 32 CPU cores and ~ 10 GB of RAM for the process. We ran GWAS independently over ten cancer phenotypes (nine specific cancer types and pan-cancer) and over 26 chromosomes (22 autosomal chromosomes, chrX, chrY, chrXY and chrMY, as provided by the UKB). The complete GWAS results with all the summary statistics, including 883 variant-cancer associations below the exome-wide significance threshold (5E−07), are available per cancer-type in Supplementary Table S2. PWAS. Proteome-Wide Association Study (PWAS) is a protein-centric, gene-based method for genetic association studies (Fig. 1). The method is fully described and validated in a previous work 33 . In the case of cancer predisposition, PWAS detects protein-coding genes that are significantly more (or less) damaged in individuals with a history of cancer. To determine to what extent a given gene is damaged within a given individual (based on his or her genetics), PWAS employs a machine-learning model that assigns functional effect score to each variant affecting the coding sequence of the gene. Specifically, PWAS uses the FIRM predictor, a random-forest classifier that assigns each variant a score reflecting the probability of the gene to retain its function in the presence of that variant 36  , the tested null hypothesis is that the coefficients of both the dominant and recessive effect scores are zero when fitting a logistic regression model to explain cancer status. In other words, a significant PWAS association indicates that at least one of the two effect scores (or some linear combination of them) is associated with cancer predisposition.
We used version 1.0.4 of the PWAS software (available at https:// github. com/ nadav bra / pwas) and executed the standard PWAS pipeline (as specified on that GitHub page). We derived the dominant and recessive PWAS effect scores for the entire UKB cohort (which depend solely on the genetic information) and tested associations between 18,053 protein-coding genes and the ten cancer phenotypes. The complete PWAS results with full summary statistics, which include 110 FDR-significant gene-cancer associations, are available per cancer type in Supplementary Table S3.
Multiple testing correction. We accounted for multiple testing by using the accepted exome-wide significance threshold ( p < 5E −7 ) in GWAS, and false discovery rate (FDR) with the standard significance threshold ( q < 0.05 ) across all 18,053 tested genes in PWAS. Nonetheless, some interpretations of our results may warrant stricter significance threshold. First, the genome-wide significance threshold (5E−08) is more appropriate when considering non-exonic regions by GWAS. Second, as our analysis considers ten different cancer types, another factor of 10 in the significance threshold might be justified (e.g. 0.005 instead of 0.05 in PWAS). On the other hand, if one considers only a specific cancer type of interest, then this would not be unnecessary. A conservative interpretation of our results may warrant further division of the significance threshold by another factor of 2, to account for the fact that both GWAS and PWAS have been used (even though this might be somewhat too conservative, as the two methods are applied over the same data and are not completely independent). As always, p-and q-values are best interpreted as continuous measures of statistical evidence rather than dichotomous markers of significance. A prudent interpretation of statistical analysis should always consider their exact values, and not only whether or not they are below a certain threshold. Due to the considerations outlined here, GWAS results with p-values between 2.5E−09 and 5E−07 and PWAS results with q-values between 0.0025 and 0.05 may be seen as borderline. To allow easy interpretation of our results and comparison between GWAS and PWAS in exonic regions, we decided to stick with the simple thresholds of p < 5E −7 and q < 0.05 throughout this work, while making available the full summary statistics across all regions (Supplementary Table S1), variants (Supplementary Table S2) and genes (Supplementary Table S3).
Merging the significant associations into unique genomic regions. We merged the 883 significant GWAS associations and 110 significant PWAS associations into 145 unique genomic regions by extending each association by 500,000 base-pairs on both sides and iteratively merging overlapping regions. GWAS associations started as 1 bp and were extended into 1,000,001 bp regions (the position of the variant, plus 500,000 bp on each side) while PWAS associations started as the region of the gene's coding-sequence (i.e. from the genomic coordinate of the start codon to the last before the stop codon). This merging procedure resulted in 145 nonoverlapping genomic loci of sizes of at least 1Mbp, available in Supplementary Table S1 Table S4). The Census gene list was downloaded in May 2020, and the two panel gene lists were downloaded in June 2020. The list of 299 pan-software driver genes was downloaded from Supplementary Table S1 of the publication 35 . The list of 593 FABRIC driver genes was downloaded from Supplementary Table S1 of the publication 36 . We mapped the obtained lists of genes onto our 18,053 analyzed genes according to their UniProt ID (in the case of FABRIC) or gene symbols (in the case of the other resources). For each of the 145 genomic regions listed in Supplementary Table S1, we include all the genes in the region that are supported by each of these external catalogues. In Fig. 3, we consider a genomic locus to be supported by clinical panels or driver catalogues if it contains at least one gene listed in those external resources. In Table 1 and Table 3, we consider the external evidence to be relevant only if it matches the exact gene listed on the table.
In Table 3 we also include cancer associations of the listed genes according to the Open Targets Platform 62 . We consider a reported association to be relevant only if (i) it is related to cancer or neoplasm, and (ii) it is supported by genetic association (as opposed to the overall association score reported by the platform, which also includes other sources of evidence, including somatic mutations).
Determining whether PWAS-estimated gene damage is associated with elevated or reduced cancer risk. In Table 1 we indicate for each of the genes recovered by PWAS whether estimated protein damage is associated with cancer risk or protection, based on the signs the Cohen's d summary statistics reported by PWAS. Cohen's d is the (normalized) difference in mean gene effect score between cases and controls (calculated independently for both dominant and recessive effect scores). When negative, it indicates that cases have a lower mean effect score (i.e. more gene damage) compared to controls, meaning that gene damage is associated with elevated cancer risk. Conversely, a positive Cohen's d indicates that gene damage is associated with reduced risk.
Cancer risk as a function of PWAS effect scores. In Fig. 6C and Supplementary Fig. S2 we illustrate the potential clinical utility of PWAS associations by plotting the risk for the relevant cancer types as a function of dominant and recessive effect scores (which capture the overall amount of gene damage to the relevant gene, per individual). The underlying data for these figures (cancer case-control status and PWAS effect scores) was taken directly from the study cohort (see the relevant sections in the Methods). In the continuous line plots, we examine the entire distribution of PWAS effect scores by considering all possible cutoffs of the relevant cohort, i.e. all subsets with effect scores that are either below or above a certain threshold. Since the precise effect-score values are not very informative, we describe the relevant sub-populations by percentiles (e.g. "bottom 0.9%", referring to the 0.9% of the cohort with the lowest scores, namely the 0.9% with the greatest estimated functional damage to the gene, or "top 5%", referring to the 5% of the cohort with the highest scores, namely the 5% with the least estimated functional damage to the gene). Each of the continuous plots also includes a reference to the entire cohort, marked on the x-axis as "All". Cutoffs of bottom percentiles are shown to the left of the entirecohort reference point, and cutoffs of top percentiles are shown to its right (both in log scale). On the y-axis of these plots we show the estimated prevalence of the relevant cancer type in the study cohort. Note that the estimated cancer prevalence is not the naïve fraction of individuals who were diagnosed with the relevant cancer type (among the group of individuals described by the relevant partition). Rather, it is based on the 95% confidence interval for the fraction of cases in the relevant group, calculated by Wilson score interval's method for binomial proportion. When the confidence interval overlaps with the baseline cancer rate (the dashed horizontal line in the plot), then the estimated prevalence is taken to be that baseline value. When the entire confidence interval is strictly above or below the baseline rate, then the more conservative bound of the confidence interval (that is closer to the baseline rate) is taken as the estimated prevalence. In other words, when the confidence interval is above the baseline rate, then its lower bound is taken, and when it is below the baseline rate, then its higher bound is taken. By relying on the 95% confidence interval, rather than naïve frequency of cases, more reliable estimates are obtained when dealing with small groups at the tails of the distributions (which could show wild random variations due to small numbers).
In addition to the continuous plots that capture the entire distribution of PWAS effect scores, we also illustrate the associations by choosing specific cutoffs showing the strongest differences. Specifically, we tested each cutoff by Fisher's exact test for the association between the effect-score cutoff to cancer case-control status, selecting the cutoff with the lowest p-value (and display the plot only if it's lower than 1E−03).
Note that these plots do not account for covariates, and should therefore not be considered as rigorous statistical evidence for the associations (unlike the summary p-values reported by PWAS, which account for covariates and consider the entire distribution of effect scores).