Expanding cancer predisposition genes with ultra-rare cancer-exclusive human variations

It is estimated that up to 10% of cancer incidents are attributed to inherited genetic alterations. Despite extensive research, there are still gaps in our understanding of genetic predisposition to cancer. It was theorized that ultra-rare variants partially account for the missing heritable component. We harness the UK BioBank dataset of ~ 500,000 individuals, 14% of which were diagnosed with cancer, to detect ultra-rare, possibly high-penetrance cancer predisposition variants. We report on 115 cancer-exclusive ultra-rare variations and nominate 26 variants with additional independent evidence as cancer predisposition variants. We conclude that population cohorts are valuable source for expanding the collection of novel cancer predisposition genes.

Discovery of cancer predisposition genes (CPGs) has the potential to impact personalized diagnosis and advance genetic consulting. Genetic analysis of family members with high occurrences of cancer has led to the identification of variants that increase the risk of developing cancer 1 . In addition to family-based studies, efforts to identify CPGs focus on pediatric patients where the contribution of environmental factors is expected to be small. Forty percent of pediatric cancer patients belong to families with a history of cancer 2 .
Tumorigenesis results from mis-regulation of one or more of the major cancer hallmarks 3 . Therefore, it is anticipated that CPGs overlap with genes that are often mutated in cancerous tissues. Indeed, CPGs most prevalent in children (TP53, APC, BRCA2, NF1, PMS2, RB1 and RUNX1) 2 are known cancer driver genes that function as tumor suppressors, oncogenes or have a role in maintaining DNA stability 4 . Many of the predisposed cancer genes are associated with pathways of DNA-repair and homologous recombination 5 . The inherited defects in cells' ability to repair and cope with DNA damage are considered as major factors in predisposition to breast and colorectal cancers 6 .
Complementary approaches for seeking CPGs are large-scale genome/exome wide association studies (GWAS) which are conducted solely based on statistical considerations without prior knowledge on cancer promoting genes 7 . Identifying CPGs from GWAS is a challenge for the following reasons: (1) limited contribution of genetic heritability in certain cancer types; (2) low effect size/risk associated with each individual variant; (3) low-penetrance in view of individual's background 8 , and (4) low statistical power. Large cohorts of breast cancer show that ~ 2% of cancer cases are associated with mutations in BRCA1 and BRCA2 which are also high-risk ovarian cancer susceptibility genes. Additionally, TP53 and PTEN are associated with early-onset and high-risk familial breast cancer. Mutations in ATM and HRAS1 mildly increase the risk for breast cancer but strongly increase the risk for other cancer types and a collection of DNA mismatch repair genes (MLH1, MSH2, MSH6, PMS2) are associated with high risk of developing cancer 9 . A large cohort of Caucasian patients with pancreatic cancer reveal 6 high risk CPGs that overlap with other cancer types (CDKN2A, TP53, MLH1, BRCA2, ATM and BRCA1) 10 .
Estimates for the heritable component of predisposition to cancer were extracted from GWAS, family-based and twin studies [11][12][13] . These estimates vary greatly with maximal genetic contribution associated with thyroid and endocrine gland cancers, and a minimal one with stomach cancer and leukemia 14 . Current estimates suggest that as many as 10% of cancer incidents can be attributed to inherited genetic alterations (e.g., single variants and structural variations) 15,16 . The actual contribution of CPGs varies according to gender, age of onset, cancer types and ethnicity [17][18][19][20] . It is evident that high risk variants with large effect sizes are very rare 21 . Actually, based on the heritability as reflected in GWAS catalog, it was estimated that only a fraction of existing CPGs is presently  1 with about half of the reported genes derived from family studies representing high-penetrance variants. An extended catalog was reported with a total of 152 CPGs that were tested against rare variants from TCGA germline data, covering 10,389 cancer patients from 33 cancer types and included known pediatric CPGs 23 . The contribution of BRCA1/2, ATM, TP53 and PALB2 to cancer predisposition was confirmed.
In this study we report on known and novel cancer predisposition candidate genes. We benefit from the UK-Biobank (UKBB), an invaluable resource of germline genotyping data for ~ 500,000 individuals. The UKBB reports on ~ 70,000 cancer patients and ~ 430,000 cancer free individuals, considered as control group. We challenge the possibility that CPGs can be identified from very rare events, henceforth called cancer-exclusive ultrarare variants (CUVs). These CUVs are expected to exhibit high penetrance. Notably, the presented CUVs were extracted from UKBB DNA array and therefore only cover the array pre-selected 803,804 SNPs. We report on 115 exome variations, 72 of which are heterologous. The majority of the matching genes are novel CPG candidates. We provide indirect genomic support for some of the CUVs that occur within coding genes and discuss their contribution to tumorigenesis.

Results
The primary UKBB data set used in the article is comprised of 325,407 Caucasian UKBB participants (see Methods, Fig. 1c), 282,435 cancer-free (86.8%) and 42,972 diagnosed with at least one malignant neoplasm. Among participants with cancer, 55% were diagnosed with either skin or breast cancer. The clinical ICD-10 codes assembly is summarized in Supplementary Table S1. A total of 13.2% of the cancer-diagnosed individuals had two or more distinct neoplasms diagnosed. The validation UKBB data set includes 70,544 non-Caucasian participants, among them 63,585 are cancer-free (90.1%). Figure 1a,b provide further details on different cancer type prevalence in these sets.
Non-melanoma skin cancer is mostly attributed to environmental factors rather than genetic association 24 . However, based on evidence for hereditary links for non-melanoma skin cancer predisposition 25,26 , we included these individuals in our analysis. In addition, focusing on extremely rare variations enables the identification of existing, yet overlooked genetic associations.
Compilation of cancer-exclusive ultra-rare variants (CUVs). We scanned 803,804 genetic markers in our prime data set for cancer-exclusive variations. 183 variations met our initial criteria, appearing at least twice in individuals diagnosed with cancer and not appearing in cancer-free individuals. Among them, 95 were heterozygous and 88 were homozygous variations. In order to target variations with additional supporting evi- www.nature.com/scientificreports/ dence, we considered only coding exome and splice-region variants. To assure the CUVs rarity in the general population, we applied an additional filter based on the gnomAD data set (see Methods). The resulting final list is comprised of 115 variants (associated with 108 genes), 72 heterozygous and 43 homozygous (Fig. 1d). The detailed list of all 115 CUVs can be found in Supplementary Table S2. Most (66%) of the CUVs are missense variants. There is a strong enrichment for loss of function (LoF) variants (i.e., frameshift, splicing disruption and stop gains), which account for 33% of the CUVs. Only a single homozygous CUV is synonymous (Fig. 2a). The distribution of variation types varies greatly between homozygous and heterozygous CUVs (Fig. 2b). Missense variants are 93% of the homozygous variant set, but only 50% of the heterozygous CUVs. The heterozygous CUVs are highly enriched for LoF variants which constitute the other 50%.
Cancer-exclusive ultra-rare variants overlap with known cancer predisposition genes. From the listed CUVs, 26 variants were previously defined as cancer inducing genes (in 23 genes, Table 1). Specifically, 22 CUVs within 19 genes appear in the updated list of CPG catalog 23 and 24 CUVs (within 21 genes) are known cancer driver genes ( Fig. 3a), as determined by either COSMIC 27 or the consensus gene catalog of driver genes (listing 299 genes, coined C299) 28 . More than half of the cancer associated variants result in LoF. Many of the affected genes are tumor suppressor genes (TSGs), among which are prominent TSGs such as APC, BRCA1 and BRCA2 (Table 1), each identified by two distinct CUVs. Notably, 10 of the variants had at least one appearance in non-melanoma skin cancer.
The heterozygous CUVs are enriched for known cancer predisposition genes. Twenty-five of the cancer associated CUVs are heterozygous and one is homozygous. However, there is an inherent imbalance in the initial variant sampling performed by the UKBB. As the UKBB use DNA arrays for obtaining genomic data, the identifiability of ultra-rare exome variants is restricted by the selection of SNP markers and the design of the DNA array. There are 6,450 heterozygous ultra-rare exome variants from 2,938 genes which pass our biobank-ethnic and the gnomAD allele frequency filtration. A total of 1,604 of the filtered ultra-rare variants overlap with 105 known CPGs, as some genes are over-represented among the ultra-rare variants (Supplemental Table S3). For example, the exomic region of BRCA2 is covered by 226 such SNP marker variants, while most genes have none.
In order to account for the disproportional number of the ultra-rare variant of some CPGs, we calculated the expected number of cancer predisposed genes when gradually removing highly-represented genes from the collection of heterozygous ultra-rare variants. As shown in Fig. 3b, there is an enrichment towards CPGs and even more so as we remove variants of over-represented genes (e.g., BRCA2). The statistical significance estimates (p-values) for each data-point are available in Supplemental Table S3 (see Methods).

Independent genetic validation.
Due to the extremely rare nature of the CUVs, we require additional support for the collection of the CPG candidates. We seek independent genetic validation of the non-cancer related CUVs. We apply three sources for validation: (1) the filtered Caucasian UKBB cohort; (2) the matched filtered, non-Caucasian UKBB cohort; (3) the collection of germline variants from TCGA, as reported in gno-mAD. The complete list of genetically validated novel CPG candidates is listed in Table 2. Ten out of the 23 novel CPGs were identified based on appearances in individuals with non-melanoma skin cancer.
Within the Caucasian cohort, we consider the following as additional genomic evidence: (1) a gene with 2 CUVs, or (2) any CUV seen in more than two individuals diagnosed with cancer. We found 7 genes that have 2 distinct CUVs, 3 of which are already known CPGs: BRCA1, BRCA2 and APC. The other 4 genes are likely novel    The non-Caucasian UKBB cohort provides additional independent genomic evidence. There are 5 CUVs that appear at least once in an individual with cancer from the non-Caucasian cohort. CUVs from the genes MYO1E, SARDH and ISLR appeared in two distinct individuals with cancer from this non-Caucasian cohort, while CUVs from PCDHB16 and known CPG BMPR1A appeared in a single individual with cancer.
TCGA germline variants were obtained using exome sequencing and thus offer an additional separate source for CUV validation. Clearly, the appearance of CUVs in TCGA germline data is not anticipated, as we discuss variants that are ultra-rare in both UKBB and gnomAD. The TCGA collection within gnomAD includes only 7,269 samples. We identified 10 CUVs that were also observed in TCGA gnomAD germline data, one of a known cancer driver gene TGIF1, and 9 novel CPG candidates: PCDHB16, EGFLAM, AKR1C2, MAP3K15, MRPL39, DNAH3, WDFY4, HSPB2 and ZFC3H1.
Based on the above support, we compiled a list of 23 validated CPGs which includes 21 genes that are novel CPGs. Among these genes 12 CUVs are heterozygous, 8 are homozygous and MYBPC3 is supported by both heterozygous and homozygous CUVs. Two of these genes have multiple validation evidence. DNAH3 with a homozygous CUV which appears in 3 individuals with cancer in the Caucasian cohort and within TCGA germline variant collection. PCDHB16 with a homozygous CUV which appeared in 3 individuals in the Caucasian cohort, one individual in the non-Caucasian cohort and in the TCGA gnomAD resource. In addition, non-CPG cancer-driver genes with validated CUVs include TGFBR2 and TGIF1 that are also very likely CPG candidates.
Some of the prominent genes in our list were signified by additional independent studies. For example, a novel oncolytic agent targeting ICAM1 against bladder cancer is now in phase 1 of a clinical trial 29 . Additionally, DNAH3 was identified as novel predisposition gene using exome sequencing in a Tunisian family with multiple non-BRCA breast cancer instances 30 .

Somatic mutations in novel CPGs significantly decrease survival rate.
There is substantial overlap between CPGs and known cancer driver genes (Fig. 3a). This overlap suggests that somatic mutations in validated CPG candidates may have an impact on patients' survival rate. We tested this hypothesis for the 21 novel CPG candidates (Table 2) using a curated set of 32 non-redundant TCGA studies (compiled in cBioPortal 31,32 ) that cover 10,953 patients. By testing the impact of alteration in the 21 novel CPGs in somatic data we expect to provide a functional link between the germline CPG findings and the matched mutated genes in somatic cancer samples. Altogether, 3,846 (35%) of the patients had somatic mutations in one or more of the genes. The median survival of patients with somatic mutations in these genes is 67.4 months, while the median for patients without www.nature.com/scientificreports/ somatic mutations in any of these genes is much longer (86.3 months). Applying the Kaplan-Meier survival estimate yields a p value of 1.78e−4 in the Logrank test (Fig. 4a). The Kaplan-Meier disease/progression-free estimate was also worse for patients with somatic mutations in the 21 novel CPGs with a p value of 6.03e−3 (Fig. 4b). Cancer types in this analysis are represented by varied number of patients and percentage of individuals with somatic mutations in any of the novel CPGs (Supplemental Table S4). The trend in most cancer types match the presented pan-cancer analysis. Survival and disease/progression estimate for each cancer type are available in Supplementary Figures S1-S24. Hazard Ratios and confidence intervals were calculated (see Materials and Methods and Supplemental Table S4). We conclude that the CUV-based CPG candidate genes from UKBB carry a strong signature that is manifested in patients' survival, supporting the notion that these genes belong to an extended set of previously overlooked CPGs.
Homozygous variations are mainly recessive. In order to ascertain whether the homozygous variations found are indicative of the heterozygous form of the variant as well, we viewed the heterozygous prevalence within the UKBB Caucasian population. In only a single variant in the gene MYO1E was the prevalence in healthy individuals significantly lower than in individuals with cancer (p value = 0.04). As most of the variations have a strong cancer predisposition effect as homozygous variations, it seems that their influence is explained by a recessive inheritance mode. This phenomenon might explain the significant depletion of known CPGs within the homozygous variations in our list.
Inspecting the heritability model of previously reported CPGs 1 is in accord with our findings, showing that while about two-thirds of the genes comply with a dominant inheritance, the rest are likely to be recessive. Notably, in the most updated CPG catalog, 15% of the genes were assigned with both inheritance patterns. In our ultra-rare list, only MYBPC3 is associated with both heterozygous and homozygous variations.

Discussion
We present a list of 115 CUVs from 108 genes. Among them 26 variants (from 23 genes) are associated with known cancer genes. Most of these variants (22) overlap with known cancer predisposition genes. Expanding the number of currently identified CPGs is crucial for better understanding of tumorigenesis and identifying various processes causing high cancer penetrance. Genetic consulting, family planning and appropriate treatment is a direct outcome of an accurate and exhaustive list of CPGs.
Known cancer predisposition variants only partially explain the cases of inherited cancer incidents. CPGs identification has already impacted cancer diagnostics, therapy and prognosis 1 . Genomic tests and gene panel for certain cancer predisposition markers are commonly used for early detection and in preventative medicine 33,34 . It is likely that CPGs based on ultra-rare variants are not saturated. For example, additional CPGs including CDKN2A and NF1 were associated with an increased risk for breast cancer 35 . Specifically, CDKN2A has been also detected as a CPG in families of patients with pancreatic cancer 36 . Inspecting the function of genes associated with www.nature.com/scientificreports/ the 108 identified genes further supports the importance of protein modification (e.g. kinases and phosphatase function), chromatin epigenetic signatures 37 , membrane signaling, DNA repair systems and more. Numerous CUVs are present in individuals with non-melanoma skin cancer. For the most part non-melanoma skin cancers are attributed to environmental factors. Nevertheless, studies show that there are in fact genetic components associated with the majority of non-melanoma skin cancers 25,26 . Accordingly, CUVs can unveil such rare genetic associations.
We chose to focus on cancer-exclusive variants to shed light on mostly overlooked ultra-rare cancer predisposition variants. Naturally, additional ultra-rare variants in the data-set are presumably cancer inducing. Detecting these variants requires developing a broader model expanding the scope to somewhat less rare, possibly lowerpenetrance variants. The impending availability of UKBB exome sequencing (150,000 exomes), will enable us to revisit the identified variants, to further refine the list of candidate CPGs (i.e., removing false-positives and adding evidence to support true CPGs) and to develop a less strict detection model.
The inheritably rare nature of CUVs raise concerns on the reliability of their initial identification 38 . We overcome this hurdle by only considering as candidate CPGs those genes that are supported by additional independent genomic evidence from either the UKBB or the TCGA cohort. We nominate 23 genes as CPG candidates, two of which are known cancer drivers. As we have shown (Fig. 4), somatic mutations in the non-driver validated CPG candidates resulted in a significant negative effect on the patients' survival rate.

Materials and methods
Study population. The UKBB has recruited ~ 500,000 people from the general population of the UK, using National Health Service patient registers, with no exclusion criteria 39 . Participants were between 40 and 69 years of age at the time of recruitment, between 2006 and 2010. To avoid biases due to familial relationships, we removed 75,853 samples keeping only one representative of each kinship group of related individuals. We derived the kinship group from the familial information provided by the UKBB .fam files. Additionally, 312 samples had mismatching sex (between the self-reported and the genetics-derived) and 726 samples had only partial genotyping.
We divided the remaining 395,951 participants into two groups: (1) 'Caucasians'-individuals that were both genetically verified as Caucasians and declared themselves as 'white' . (2) 'non-Caucasians'-individuals not matching the previous criterion. The Caucasian cohort includes 325,407 individuals (42,972 of whom had cancer) and the non-Caucasian cohort includes 70,544 individuals (6,959 had cancer). We used the Caucasian cohort for our primary analysis and the non-Caucasian cohort for additional validation purposes.

Variant filtration pipeline.
We considered a heterozygous variation as cancer-exclusive when there were at least 2 cancer patients exhibiting the variation and no healthy individuals with the variation in the Caucasian cohort. We considered a homozygous variation as cancer-exclusive when there were at least 2 cancer patients exhibiting the variation (i.e., homozygous to the alternative SNP) and no healthy individuals with the homozygous variation in the Caucasian cohort. The ensemble Variant effect predictor 40 was used to annotate the variants.
We applied two additional filtration steps for the exome/splicing-region variants. The first filter was applied using the 'non-Caucasian' data set, we filtered heterozygous variations with MAF > 0.01% and homozygous variations with homozygous frequency > 0.01% in this set. This filtration step is meant to diminish variations which are mostly ethnic artifacts. The second filter was applied to assure the variations rarity. We applied the same filter (heterozygous variations with MAF > 0.01% and homozygous variations with homozygous frequency > 0.01%), using gnomAD v2.1.1 41 . The used gnomAD threshold was based on the summation of gnomAD v2.1.1 exomes and genomes. We also used gnomAD for the TCGA-germline validation, by extracting TCGA appearances from the database. Statistical analysis. The UKBB ultra-rare variants are enriched with CPGs variants. We accounted for this imbalance by calculating the expected number of cancer predisposed genes when gradually removing highlyrepresented genes from the ultra-rare variant collection for heterozygotes. We calculated p-values for each datapoint using a two-side binomial test.
We downloaded survival data from cBioPortal. The data only included survival months. We used Cox regression without covariates to calculate Hazard Ratio and confidence intervals. The results are listed in Supplementary  Table S4.
Rare variants reliability. Our CUV collection includes variants that appeared at least twice in the filtered Caucasian cohort, thereby evading many SNP-genotyping inaccuracies 38 . We further ascertain the validity of prominent variants with additional genomic evidence.
Cancer type definition. The UKBB provides an ICD-10 code for each diagnosed condition. We considered an individual diagnosed with malignant neoplasm (ICD-10 codes C00-C97) as individuals with cancer, and otherwise as cancer-free individuals. The codes were aggregated to improve data readability using the assembly described in Supplementary Table S1.
Ethical approval. All methods were performed in accordance with the relevant guidelines and regulations.
UKBB approval was obtained as part of the project 26664. Ethical approval for this study was obtained from the