Introduction

The global burden of cancer is substantial, with an estimated 18.1 million individuals diagnosed each year and approximately 9.6 million deaths attributed to the disease1. Efforts toward cancer prevention, screening, and treatment are thus imperative, but they require a more comprehensive understanding of the underpinnings of carcinogenesis than we currently possess. While studies of twins2, families3, and unrelated populations4,5,6 have demonstrated substantial heritability and familial clustering for many cancers, the extent to which genetic variation is unique versus shared across different types of cancer remains unclear.

Genome-wide association studies (GWAS) of individual cancers have identified loci associated with multiple cancer types, including 1q32 (MDM4)7,8; 2q33 (CASP8-ALS2CR12)9,10; 3q28 (TP63)11,12; 4q24 (TET2)13,14; 5p15 (TERT-CLPTM1L)9,12; 6p21 (HLA complex)15,16; 7p1517; 8q2412,18; 11q1318,19; 17q12 (HNF1B)18,20; and 19q13 (MERIT40)21. In addition, recent studies have tested single-nucleotide polymorphisms (SNPs) previously associated with one cancer to discover pleiotropic associations with other cancer types22,23,24,25. Consortia, such as the Genetic Associations and Mechanisms in Oncology, have looked for variants and pathways shared by breast, colorectal, lung, ovarian, and prostate cancers26,27,28,29,30. Comparable studies for other cancers—including those that are less common—have yet to be reported.

In addition to individual variants, recent studies have evaluated genome-wide genetic correlations between pairs of cancer types4,5,6. One evaluated 13 cancer types and found shared heritability between kidney and testicular cancers, diffuse large B-cell lymphoma (DLBCL) and osteosarcoma, DLBCL and chronic lymphocytic leukemia (CLL), and bladder and lung cancers4. Another study of six cancer types found correlations between colorectal cancer and both lung and pancreatic cancers5. In an updated analysis with increased sample size, the same group identified correlations of breast cancer with colorectal, lung, and ovarian cancers and of lung cancer with colorectal and head/neck cancers6. While these studies provide compelling evidence for shared heritability across cancers, they lack data on several cancer types (e.g., cervix, melanoma, and thyroid).

Here, we present analyses of genome-wide SNP data on 18 cancer types, examining 408,786 individuals of European ancestry from two large, independent, and contemporary cohorts unselected for phenotype—the UK Biobank (UKB) and the Kaiser Permanente Genetic Epidemiology Research on Adult Health and Aging (GERA) cohorts. We seek to detect risk SNPs and pleiotropic loci and variants and to estimate the heritability of and genetic correlations between cancer types. We then conduct in silico functional analyses of pleiotropic variants to catalog biological mechanisms potentially shared across cancers. Leveraging the wealth of individual-level genetic and phenotypic data from both cohorts allows us to extensively interrogate the shared genetic basis of susceptibility to different cancer types, with the ultimate goal of better understanding common genetic mechanisms of carcinogenesis and improving risk assessment. We find widespread pleiotropy that offers further insights into the complex genetic architecture of cross-cancer susceptibility.

Results

Genome-wide association analyses of individual cancers

We found 21 previously unreported genome-wide significant associations between variants and cancers at P < 5 × 10−8 upon meta-analysis of the UKB and GERA results (Table 1). These included 20 unique variants, with 1 variant that was associated with two cancers (rs78378222). Nine of these 21 associations were in known susceptibility regions for the cancer of interest but independent of previously reported variants (r2 < 0.1; see “Methods”). The remaining 12 were in regions not previously associated with the cancer of interest in individuals of European ancestry. Fourteen of these 21 associations indicated pleiotropy in that the relevant variants were in regions previously associated with at least one of the other cancer types evaluated in this study (Table 1). The effect estimates for these 21 associations were not materially changed when stratified by age at diagnosis, Surveillance, Epidemiology, and End Results Program (SEER) grade, or SEER stage (heterogeneity P > 0.05/[number of strata and variants]; see “Methods”).

Table 1 Previously unreported genome-wide significant loci from meta-analysis of UKB and GERA SNPs for each cancer site.

In addition, there were nine previously unreported variants associated with cancers at P < 5 × 10−8 that were only genotyped or imputed in one cohort (Supplementary Data 1; yellow rows). For the sake of completeness and future efforts, Supplementary Data 1 also includes the 21 associations from Table 1 (green rows) and an additional 113 suggestive associations (P < 1 × 10−6) independent of previously reported results. Finally, we replicated 308 independent cancer risk variants identified as GWAS significant by previous studies (Supplementary Data 2; P < 1 × 10−6).

In genome-wide sensitivity analyses in the UKB cohort restricted to incident cases (i.e., excluding prevalent cases), our findings for significant and suggestive associations were essentially unchanged (heterogeneity P > 0.05/[number of variants per cancer]; see “Methods”; Supplementary Fig. 1). Similarly, genome-wide sensitivity analysis results in the UKB cohort for esophageal and stomach cancers separately were comparable to those for the two phenotypes combined (heterogeneity P > 0.05/6; see “Methods”; Supplementary Fig. 2).

Genome-wide heritability and genetic correlation

Array-based heritability estimates across cancers ranged from h2 = 0.04 (95% CI: 0.00–0.13) for oral cavity/pharyngeal cancer to h2 = 0.26 (95% CI: 0.15–0.38) for testicular cancer (Table 2). For some of the cancers, our array-based heritability estimates were comparable to twin- or family-based heritability estimates2,3 but were more precise. Several were also similar to array-based heritability estimates from consortia comprised of multiple studies4,5,6. One of our highest heritability estimates was observed for thyroid cancer (h2 = 0.21; 95% CI: 0.09–0.33), a cancer that has not been evaluated in other array-based studies.

Table 2 Heritability estimates (h2) and 95% confidence intervals (CIs) for each cancer based on the union set of UKB and GERA SNPs and previous estimates.

Among pairs of cancers, only colon and rectal cancers (rg = 0.85, P = 5.33 × 10−7) were genetically correlated at a Bonferroni-corrected significance threshold of P = 0.05/153 = 3.27 × 10−4 or using a false discovery rate (FDR) threshold of q < 0.1 (Fig. 1a, Table 3 and Supplementary Data 3). However, at a nominal threshold of P = 0.05, we observed suggestive relationships between 11 other pairs. Seven pairs showed positive correlations: esophageal/stomach cancer was correlated with Non-Hodgkin’s lymphoma (NHL; rg = 0.40, P = 0.0089), breast (rg = 0.26, P = 0.0069), lung (rg = 0.44, P = 0.0035), and rectal (rg = 0.32, P = 0.024) cancers; bladder and breast cancers (rg = 0.22, P = 0.017); melanoma and testicular cancer (rg = 0.23, P = 0.028); and prostate and thyroid cancers (rg = 0.23, P = 0.013). The remaining four pairs showed negative correlations: endometrial and testicular cancers (rg = −0.41, P = 0.0064); esophageal/stomach cancer and melanoma (rg = −0.27, P = 0.038); lung cancer and melanoma (rg = −0.28, P = 0.0048); and NHL and prostate cancer (rg = −0.21, P = 0.012).

Fig. 1: Cross-cancer genetic correlations (rg) calculated via LD-score regression (LDSC) and associated cancers from the locus-specific pleiotropy analysis.
figure 1

a Cancer pairs are connected if the genetic correlation had P < 0.05, width of the line is proportional to magnitude of rg, color of the line indicates direction of correlation (red is negative and blue is positive), and shading is proportional to strength of association according to P, where the Bonferroni-corrected threshold is 0.05/153 = 3.27 × 10−4; b cancer pairs are connected by a line (each line represents one region) if a region contains any SNPs associated with either cancer, where regions are formed around index SNPs with P < 5 × 10−8 for any cancer in the cancer-specific meta-analyses and SNPs are added if they have P < 5 × 10−8 for any cancer, are within 500 kb of the index SNP, and have LD r2 > 0.5 with the index SNP.

Table 3 Cross-cancer genetic correlations (rg) calculated via LD-score regression (LDSC) for all cancer pairs with P < 0.05.

Locus-specific pleiotropy

We detected 25 pleiotropic regions associated with more than one cancer (P < 5 × 10−8 for each cancer; independent regions were defined using our linkage disequilibrium [LD] clumping procedure; see “Methods”; Fig. 1b and Supplementary Table 1). Most were at known cancer pleiotropic loci: HLA (14 regions), 8q24 (7 regions), TERT-CLPTM1L (2 regions), and TP53 (1 region). All of the HLA regions were associated with both cervical cancer and NHL. Five regions in 8q24 were associated with prostate and colon cancers (one also associated with rectal cancer), and two were associated with prostate and breast cancers. Of the regions in TERT-CLPTM1L, one was associated with breast cancer and melanoma, and the other was associated with melanoma and cervical and pancreatic cancers. The TP53 region, indexed by rs78378222, was associated with melanoma and lymphocytic leukemia. The remaining pleiotropic region, indexed by rs6507874, was in SMAD7, which has been previously linked to colorectal cancer31, and we confirmed its association with colon and rectal cancers separately.

Genome-wide variant-specific pleiotropy

We assessed variant-specific pleiotropy by testing all variants genome-wide using the summary statistics for each cancer using ASSET. We found 85 independent (LD r2 < 0.1) one-directional pleiotropic variants with at least two associated cancers, the same direction of effect for all associated cancers, and an overall pleiotropic P < 5 × 10−8 (Supplementary Data 4). Of these one-directional pleiotropic variants, there were 17 for which the overall pleiotropic P was smaller than the P for each of the associated cancers (Fig. 2 and Table 4). While 84 of the 85 one-directional pleiotropic variants were in regions that have previously been associated with any cancer, 68 were associated with at least one cancer not previously reported. The variant in a region not previously associated with any cancer is rs150260898, intronic of RABIF5, which was associated with melanoma and oral cavity/pharyngeal cancer.

Fig. 2: Manhattan plot displaying one-directional variant-specific pleiotropy from ASSET.
figure 2

The red dashed line represents the genome-wide significance threshold (P < 5 × 10−8), and the black dotted line represents a suggestive threshold (P < 1 × 10−6). Highlighted in purple are genome-wide significant loci where the overall pleiotropic P is less than all individual P for the selected cancers. Highlighted in green are the genome-wide significant loci where the overall pleiotropic P is greater than at least one of the individual P for the selected cancers. All highlighted loci are independent of bidirectional SNPs with smaller overall P.

Table 4 Top independent variants from the one-directional variant-specific pleiotropic analysis.

We also considered bidirectional pleiotropic associations, wherein the same allele for a given variant was associated with an increased risk for some cancers but a decreased risk for others. We found 15 such variants with P < 5 × 10−8, all of which were independent from one another and from the one-directional pleiotropic variants (LD r2 < 0.1; Fig. 3, Table 5 and Supplementary Data 5). There were eight variants where the overall pleiotropic P was smaller than the P for the associated cancers. While all of the bidirectional pleiotropic variants were in regions that have previously been associated with cancer, six were independent of known risk variants, and all 15 were associated with at least one cancer not previously reported.

Fig. 3: Manhattan plot displaying bidirectional variant-specific pleiotropy from ASSET.
figure 3

The red dashed line represents the genome-wide significance threshold (P < 5 × 10−8), and the black dotted line represents a suggestive threshold (P < 1 × 106). Highlighted are loci with overall pleiotropic P < 5 × 10−8, the two directional P < 0.05, and not in LD with a one-directional SNP with smaller P. Loci in purple are genome-wide significant loci where the overall pleiotropic P is less than all individual P for the selected cancers, and loci in green are genome-wide significant loci where the overall pleiotropic P is greater than at least one of the individual P for the selected cancers.

Table 5 Top independent variants from the bidirectional variant-specific pleiotropic analysis.

For any pair of cancers associated with the same variant, the type of association falls in one of three categories: (1) SNPs identified in the one-directional analysis, where all associations are in the same direction; (2) SNPs identified in the bidirectional analysis, where both cancers in the pair are associated in the same direction (both risk increasing or both risk decreasing), even though at least one other cancer is associated in the opposite direction; and (3) SNPs identified in the bidirectional analysis, where the pair of cancers are associated in opposite directions (one risk increasing and one risk decreasing). For each of the possible 153 pairs of cancers, we tabulated how many of the 100 pleiotropic SNPs fall into each category (Fig. 4a and Supplementary Data 6). The number of one- and bidirectional SNPs shared by cancer pairs ranged from one (bladder and breast) to 13 (lymphocytic leukemia and testis) (Fig. 4a and Supplementary Data 6). For 30 cancer pairs, the shared associations had exclusively the same direction of effect (i.e., tabulating across the first two categories of pleiotropic SNPs). For three cancer pairs, at least 50% of the shared variants were associated in opposite directions.

Fig. 4: Summary of cancer pairs associated with and functional consequences of the 100 one- and bidirectional pleiotropic variants.
figure 4

a The number of pleiotropic variants (of the independent 100 one- and bidirectional variants with overall pleiotropic P < 5 × 10−8) associated with each pair of cancers by type of pleiotropic effect for select cancer pairs using ASSET: SNPs identified in the one-directional analysis, where all associations are in the same direction (navy); SNPs identified in the bidirectional analysis, where both cancers in the pair are associated in the same direction (both risk increasing or both risk decreasing), even though at least one other cancer is associated in the opposite direction (blue); and SNPs identified in the bidirectional analysis, where the pair of cancers are associated in opposite directions (one risk increasing and one risk decreasing) (green). b The distribution of variant consequences and corresponding enrichment, calculated using Fisher’s exact test comparing the proportion of variants belonging to each functional class observed among the 100 ASSET variants to all variants in the UK Biobank. Pleiotropic variants were enriched in intergenic (P = 0.043) and non-coding RNA transcripts (P = 0.015). c Venn diagram summarizing the number of variants with specific regulatory elements, based on analyses of chromatin features from Roadmap and expression quantitative trait loci (eQTL) associations. d Distribution of DeepSEA functional significance scores, providing an integrated summary score based on evolutionary conservation and chromatin data, with 0 denoting variants most likely to be functional.

For each of the 100 independent SNPs showing either one- or bidirectional pleiotropy (Supplementary Data 4–5), we assessed whether the results differed according to age at diagnosis, SEER grade, or SEER stage for any of the associated cancers. After correcting for the number of SNPs and strata tested, only a single one-directional pleiotropic SNP showed heterogeneity across case subtypes. rs111362352-C was significantly positively associated with the risk of low grade prostate cancer in GERA, while it was not associated with high-grade disease. These results are consistent with previous findings for this SNP (or SNPs in strong LD): the C allele has been associated with lower Gleason score, and it is located at KLK3, the prostate-specific antigen gene, which may reflect its previous association with lower grade prostate cancer32,33.

Functional characterization of pleiotropic variants

The biological significance of these 100 independent pleiotropic variants (Supplementary Data 4–5) was evaluated using in silico annotation tools (Supplementary Data 7)34,35,36. Pleiotropic variants were enriched in intergenic (P = 0.043) and non-coding RNA transcripts (P = 0.015) compared to all variants in the reference panel of UKB European descent individuals (Fig. 4b). The distribution of DeepSea functional significance scores was skewed toward 0 (P = 7.3 × 10−4), indicating a higher likelihood of regulatory effects compared to a reference distribution of 1000 Genomes variants (Fig. 4d). Suggestively functional variants (n = 26, DeepSEA score < 0.05) were also predicted to be pathogenic by Combined Annotation-Dependent Depletion36 (CADD; mean score of 10.66, corresponding to the top 10% of deleterious substitutions). Twenty-two of the 100 pleiotropic variants were characterized by active chromatin states, 33 were classified as enhancers, and 64 had significant (FDR < 0.05) effects on gene expression (Fig. 4c). Five variants belonged to all three classes (Fig. 4c).

Consistent with hypothesized pleiotropy, 78.1% of the 64 expression quantitative trait loci (eQTLs) identified among the pleiotropic variants had more than one target tissue, and 78.1% influenced the expression of more than one gene (Supplementary Fig. 3), for a total of 596 significant SNP-gene pairs. The most common expression tissues for eQTLs among pleiotropic variants were whole blood (49.2%), followed by adipose (14.8%) and esophageal (4.7%) tissues. Regulatory effects mediated by chromatin looping were observed for 28 variants, including 3 enhancer-promoter links in 6p21.23 (rs535777, rs73728618) and 22q13.2 (rs5759167, PACSIN2 promoter; Supplementary Fig. 4). Notably, rs5759167 is an eQTL for PACSIN2 in whole blood (BIOS QTL: P = 9.89 × 10−14; GTEx v8: P = 3.39 × 10−7).

The functional profile of the 100 pleiotropic variants was significantly different across multiple features when compared to a randomly selected set of 100 independent variants. Pleiotropic variants had a significantly higher proportion of enhancers (P = 3.38 × 10−4), eQTLs (P = 3.38 × 10−4; >1 tissue: P = 2.33 × 10−3; >1 gene: P = 1.34 × 10−4), and chromatin interactions (P = 3.48 × 10−4). Pleiotropic variants did not have a significantly higher proportion in active chromatin states (P = 0.48).

Genes represented by pleiotropic variants were significantly enriched for 36 KEGG pathways that formed two clusters broadly characterized by immune-related functions and cancer-specific genes (Supplementary Table 2 and Supplementary Fig. 5). Top-ranking pathways in the first cluster included antigen processing and presentation (P = 4.29 × 10−6), cell adhesion molecules (P = 4.29 × 106), allograft rejection (P = 4.29 × 10−6), cancer-related infections (human T cell leukemia virus 1: P = 4.35 × 106; Epstein-Barr virus: P = 3.49 × 105), and autoimmune diseases (type I diabetes: P = 9.84 × 106; inflammatory bowel disease: P = 1.16 × 103). The second cluster was enriched for genes related to multiple cancers (gastric: P = 9.94 × 105; small cell lung cancer: P = 3.14 × 10−3; prostate: P = 3.65 × 10−3), drug resistance (endocrine resistance: P = 2.55 × 10−4), and cellular senescence (P = 0.014).

Discussion

In this study of cancer pleiotropy in two large cohorts, we found multiple lines of evidence for a shared genetic basis of several cancer types. By characterizing pleiotropy at the genome-wide, locus-specific, and variant-specific levels for a large number of cancer sites, we generated several insights into cancer susceptibility. Specifically, we detected 21 previously unreported genome-wide significant variant associations across 11 of the 18 individual cancers examined. We also detected 100 independent variants displaying one- or bidirectional pleiotropy that were enriched for a number of regulatory functions that reflect hallmarks of carcinogenesis.

One notable finding from our cervical cancer GWAS was rs10175462 in PAX8 on 2q13, which, to our knowledge, is the first genome-wide significant cervical cancer risk SNP identified outside of the HLA region in a European ancestry population15. In a candidate SNP study of PAX8 eQTLs in a Han Chinese population, two variants in LD with rs10175462 in Europeans (rs1110839, r2 = 0.33; rs4848320, r2 = 0.34) were suggestively associated with cervical cancer risk in the same direction37. Several GWAS findings also provided evidence of pleiotropy, in that previously unreported risk variants for one cancer had known associations with one or more other cancers. For instance, rs9818780 was associated with melanoma and has been implicated in sunburn risk38. This intergenic variant is an eQTL for LINC00886 and METTL15P1 in skin tissue. The former gene has previously been linked to breast cancer39, and both genes have been implicated in ovarian cancer40. Beyond the previously unreported associations, our GWAS detected 308 independent associations with P < 1 × 10−6 that confirmed signals identified in previous GWAS with P < 5 × 108. This finding strengthened our confidence in using our genome-wide summary statistics for subsequent analyses of cancer pleiotropy.

In evaluating pairwise genetic correlations between the 18 cancer types, we observed the strongest signal for colon and rectal cancers—an expected relationship consistent with findings from a twin study41. We also identified several cancer pairs for which the genetic correlations were nominally significant. One pair supported by previous evidence is melanoma and testicular cancer; some studies have found that individuals with a family history of the former are at an increased risk for the latter42,43. Esophageal/stomach cancer was a component of five correlated pairs—with melanoma, NHL, and breast, lung, and rectal cancers. Despite some similarities between esophageal and stomach cancers, testing them as a combined phenotype may have inflated the number of correlated cancers.

Our genetic correlation results contrast with some previous findings4,5,6; we did not find several correlations that they did and found others that they did not. The differences may be partly due to a smaller number of cases in our cohorts for some sites. Further studies with larger sample sizes are necessary to validate our correlations, as those that did not attain Bonferroni-corrected significance may have been due to chance. However, we achieved comparable or higher cancer-specific heritability estimates for breast, colon, and lung cancers, which suggests that differences in study design may also play a role. Previous analyses aggregated case–control studies recruited during different time periods. While such meta-analyses can be effective at reducing residual population stratification, our extensive quality control processes also seemingly mitigated population stratification; the mean λGC across the 18 cancers was 1.02 (standard deviation = 0.027). Moreover, our design allowed for the assessment of cross-cancer relationships in the same set of individuals and the examination of several cancers that have yet to be studied in large consortia.

The assessment of pleiotropy at the locus level confirmed previously reported associations at 5p15.33, HLA, and 8q24 (refs. 9,12,15,16,18). Out of the 25 pleiotropic loci that we identified, most were at these known cancer pleiotropic loci. Over half, all in the HLA locus, were associated with cervical cancer and NHL. The two cancers were weakly negatively correlated in the two cohorts combined and nominally significantly negatively correlated in the UKB alone (Supplementary Data 8). The difference may reflect better coverage and imputation of the HLA region in the UKB than in GERA.

Variant-specific analyses provided further evidence in support of locus-specific cancer pleiotropy, including validation of previously reported signals at 1q32 (refs. 7,8) and 2q33 (refs. 9,10) (ALS2CR12). Interestingly, our lead 1q32 variant (rs1398148) maps to PIK3C2B and is in LD (r2 > 0.60) with known MDM4 cancer risk variants7,8, suggesting that the 1q32 locus may be involved in modulating both p53-and PI3K-mediated oncogenic pathways. The 100 independent pleiotropic variants (with overall pleiotropic P < 5 × 10−8) mapped to a total of 56 genomic locations (defined by cytoband), which included the six genomic locations to which all 25 of the regions identified from the locus-specific analysis map. Although 99 of the 100 variants showing one- or bidirectional pleiotropic associations are in regions previously associated with cancer, 83 of the 99 were associated with at least one cancer not previously reported.

Out of 100 independent variants identified from the variant-specific pleiotropy analyses, 17 were in 8q24 and 15 were in the HLA region. Different distributions of one- and bidirectional results highlight patterns of directional pleiotropy: of the 15 HLA variants, 7 were bidirectional, while only three of the 17 variants in 8q24 were bidirectional. The HLA region is critical for innate and adaptive immune response and has a complex relationship with cancer risk. Heterogeneous associations with HLA haplotypes have been reported for different subtypes of NHL44 and lung cancer45, suggesting that relevant risk variants are likely to differ within, as well as between, cancers. Studies have further demonstrated that somatic mutation profiles are associated with HLA class I (ref. 46) and class II alleles47. Specifically, mutations that create neoantigens more likely to be recognized by specific HLA alleles are less likely to be present in tumors from patients carrying such alleles. It is thus possible that some of the positive and negative pleiotropy we identified is related to mutation type. These results reinforce the importance of the immune system playing a role in cancer susceptibility.

In contrast to the HLA region, the majority of the 8q24 pleiotropic variants had the same direction of effect for all associated cancers, implying the existence of shared genetic mechanisms driving tumorigenesis across sites. The proximity of the well-characterized MYC oncogene makes it a compelling candidate for such a consistent, one-directional effect. It could work via regulatory elements, such as acetylated and methylated histone marks48. Consistent with this hypothesis, we observed heritability enrichment49 for variants with the H3K27ac annotation for breast (P = 3.09 × 10−4), colon (P = 4.44 × 10−4), prostate (P = 2.74 × 10−5), and rectal (P = 0.036) cancers—all of which share susceptibility variants in 8q24, according to our analyses and previous studies48.

In silico analyses found the 100 pleiotropic variants to be enriched across multiple regulatory domains compared to non-pleiotropic randomly selected variants and highlighted cross-cancer susceptibility loci. The 11q13.3 region includes rs12275055, which maps to active enhancers and is also an eQTL for TPCN2, a gene involved in controlling the angiogenic response to VEGF and extracellular vesicle trafficking in cancer cells50,51. An additional interesting region, 22q13.2, is indexed by rs5759167, an intergenic variant linked to prostate and lung cancers risk. Its pleiotropic effects are likely mediated by regulation of PACSIN2, which codes for a cyclin D1 binding partner that serves as a brake for CCND1-mediated cellular migration52. This is consistent with our observation that that the risk-increasing G-allele is associated with increased PACSIN2 expression in whole blood53. Lastly, our pathway analysis indicated that pleiotropic variants as a group are enriched for genes involved in immune regulation and infection, as well as cancer development and progression. Our in silico findings highlight loci that are good candidates for investigation in future in vivo studies.

It is important to acknowledge some limitations of our study. First, counts for some of the cancer types were limited. However, small sample sizes are partially offset by the advantages of using two population-based cohorts. Second, due to the complexity of the LD structure in the HLA region, we may have overestimated the number of distinct, independent signals. Slight overestimation, however, does not affect our overall conclusions regarding the pleiotropic nature of this region. Third, our analyses included both prevalent and incident cases. Nevertheless, sensitivity analyses restricted to incident cancers yielded comparable results. Fourth, we grouped esophageal and stomach cancers despite possible differences in their risk factor profiles. However, there is precedent for using a composite phenotype54, and analyses of stomach and esophageal tumors suggest that they have many overlapping molecular features55,56. In addition, sensitivity analyses for each cancer alone gave similar results, suggesting that they may have similar genetic bases despite potentially having different environmental risk factors. Fifth, we focused solely on individuals of European ancestry. Further analyses are needed to accurately assess patterns of pleiotropy in non-Europeans. Finally, the two distinct cohorts studied here—the UKB and GERA—were recruited from different populations and time periods and were genotyped with different versions of Axiom GWAS arrays. Only variants genotyped or well-imputed across the cohorts were combined in our meta-analysis. Moreover, studying two cohorts provides complementary evidence for pleiotropy.

The characterization of pleiotropy is fundamental to understanding the genetic architecture of cross-cancer susceptibility and its biological underpinnings. The availability of two large, independent cohorts provided an opportunity to efficiently evaluate the shared genetic basis of many cancers, including some not previously studied together. The result was a multifaceted assessment of common genetic factors implicated in carcinogenesis, and our findings illustrate the importance of investigating different aspects of cancer pleiotropy. Broad analyses of genetic susceptibility and targeted analyses of specific loci and variants may both contribute insights into different dimensions of cancer pleiotropy. Future studies should consider the contribution of rare variants to cancer pleiotropy and aim to elucidate the functional pathways mediating associations observed at pleiotropic regions. Such research, combined with our findings, has the potential to inform drug development, risk assessment, and clinical practice toward reducing the burden of cancer.

Methods

Study populations and phenotyping

The UKB is a population-based cohort of 502,611 individuals in the United Kingdom. Study participants were aged 40–69 at recruitment between 2006 and 2010, at which time all participants provided detailed information about lifestyle and health-related factors and provided biological samples57. GERA participants were drawn from adult Kaiser Permanente Northern California (KPNC) health plan members who provided a saliva sample for the Research Program on Genes, Environment and Health (RPGEH) between 2008 and 2011. Individuals included in this study were selected from the 102,979 RPGEH participants who were successfully genotyped as part of GERA and answered a baseline survey concerning lifestyle and medical history58,59.

Cancer cases in the UKB were identified via linkage to various national cancer registries established in the early 1970s57. Data in the cancer registries are compiled from hospitals, nursing homes, general practices, and death certificates, among other sources. The latest cancer diagnosis in our data from the UKB occurred in August 2015. GERA cancer cases were identified using the KPNC Cancer Registry, including all diagnoses captured through June 2016. Following SEER standards, the KPNC Cancer Registry contains data on all primary cancers (i.e., cancer diagnoses that are not secondary metastases of other cancer sites; excluding non-melanoma skin cancer) diagnosed or treated at any KPNC facility since 1988.

In both cohorts, individuals with at least one recorded prevalent or incident diagnosis of a borderline, in situ, or malignant primary cancer were defined as cases for our analyses. Individuals with multiple cancer diagnoses were classified as a case only for their first cancer. For the UKB, all diagnoses described by International Classification of Diseases (ICD)-9 or ICD-10 codes were converted into ICD-O-3 codes; the KPNC Cancer Registry already included ICD-O-3 codes. We then classified cancers according to organ site using the SEER site recode paradigm60. We grouped all esophageal and stomach cancers and, separately, all oral cavity and pharyngeal cancers to ensure sufficient statistical power. The 18 most common cancer types (except non-melanoma skin cancer) were examined. Testicular cancer data were obtained from the UKB only due to the small number of cases in GERA.

Controls were restricted to individuals who had no record of any cancer in the relevant registries, who did not self-report a prior history of cancer (other than non-melanoma skin cancer), and, if deceased, who did not have cancer listed as a cause of death. Individuals whose first cancer diagnosis was for a cancer not among our 18 cancers of interest were excluded. For analyses of sex-specific cancer sites (breast, cervix, endometrium, ovary, prostate, and testis), controls were restricted to individuals of the appropriate sex.

Quality control

For the UKB population, genotyping was conducted using either the UKB Axiom array (436,839 total; 408,841 self-reported European) or the UK BiLEVE array (49,747 total; 49,746 self-reported European)57. The former is an updated version of the latter, such that the two arrays share over 95% of their marker content. UKB investigators undertook a rigorous quality control (QC) protocol57. Genotype imputation was performed using the Haplotype Reference Consortium as the main reference panel and the merged UK10K and 1000 Genomes phase 3 reference panels for additional data, resulting in a unified set of 93,095,623 imputed SNPs57, which is used for all analyses. Ancestry principal components (PCs) were computed using fastPCA based on a set of 407,219 unrelated samples and 147,604 genetic markers57.

For GERA participants, genotyping was performed using an Affymetrix Axiom array (Affymetrix, Santa Clara, CA, USA) optimized for individuals of European race/ethnicity. Details about the array design, estimated genome-wide coverage, and QC procedures have been published previously59,61. The genotyping produced high-quality data with average call rates of 99.7% and average SNP reproducibility of 99.9%. Variants that were not directly genotyped (or that were excluded by QC procedures) were imputed to generate genotypic probability estimates. After pre-phasing genotypes with SHAPE-IT v2.5, IMPUTE2 v2.3.1 was used to impute SNPs relative to the cosmopolitan reference panel from 1000 Genomes. Ancestry PCs were computed based on 144,799 high-performing SNPs using the smartpca program in the EIGENSOFT4.2 software package58.

For both cohorts, analyses were limited to self-reported European ancestry individuals for whom self-reported and genetic sex matched. To further minimize potential population stratification, we excluded individuals for whom either of the first two ancestry PCs fell outside five standard deviations of the mean of the population. Based on a subset of genotyped autosomal variants with minor allele frequency (MAF) ≥ 0.01 and genotype call rate ≥97%, we excluded samples with call rates <97% and/or heterozygosity more than five standard deviations from the mean of the population. With the same subset of SNPs, we used KING to estimate relatedness among the samples. We excluded one individual from each pair of first-degree relatives, first prioritizing on maximizing the number of the cancer cases relevant to these analyses and then maximizing the total number of individuals in the analyses. Our study population ultimately included 408,786 UKB participants and 66,526 GERA participants. We excluded SNPs with imputation quality score (r2INFO) <0.3, call rate <95% (alternate allele dosage required to be within 0.1 of the nearest hard call to be non-missing; UKB only), Hardy–Weinberg equilibrium P among controls <1 × 10−5, and/or MAF < 0.01, leaving 8,876,519 variants for analysis for the UKB and 8,973,631 for GERA.

For indels, the r2INFO scores indicated extremely high accuracy, ranging from 0.81 to 0.99 in the UKB (median = 0.99) and from 0.72 to 0.99 in GERA (median = 0.99) (Supplementary Data 1–2). In addition, the correlation was very high between imputed and sequenced genotypes for 44 EUR samples from the 1000 Genomes Project genotyped with the Axiom UK Biobank array and imputed using the 1KGP WGS Phase 3 reference panel: the average r2 was 0.97 for SNPs and 0.90 for indels (MAF > 0.01; Jeremy Gollub, Personal Communication).

Genome-wide association analyses of individual cancers

We used PLINK to implement within-cohort logistic regression models of additively modeled SNPs genome-wide, comparing cases of each cancer type to cancer-free controls. All models were adjusted for age at specimen collection, sex (non-sex-specific cancers only), first ten ancestry PCs, genotyping array (UKB only), and reagent kit used for genotyping (Axiom v1 or v2; GERA only). Case counts ranged from 471 (pancreatic cancer) to 13,903 (breast cancer) in the UKB and from 162 (esophageal/stomach cancer) to 3978 (breast cancer) in GERA (Supplementary Table 3). Control counts were 359,825 (189,855 females) and 50,525 (29,801 females) in the UKB and GERA, respectively. After separate GWAS were conducted in each cohort, association results for the 7,846,216 SNPs in both cohorts were combined via meta-analysis. For variants that were only examined in one cohort (22% of the total 10,003,934 SNPs analyzed), original summary statistics were merged with the meta-analyzed SNPs to create a union set of SNP statistics for each cancer for use in downstream analyses (Supplementary Fig. 6).

To determine independent signals in our union set of SNPs, we implemented the LD clumping procedure in PLINK based on genotype hard calls from a reference panel comprised of a downsampled subset of 10,000 random UKB participants. For each cancer separately, LD clumps were formed around index SNPs with the smallest P not already assigned to another clump. While only variants with P < 5 × 10−8 were considered significant, to also identify suggestive variants for supplementary results, in each clump, index SNPs had a suggestive association based on P < 1×10−6, and SNPs were added if they were marginally significant with P < 0.05, were within 500 kb of the index SNP, and had r2 > 0.1 with the index SNP. To confirm independence, we implemented GCTA’s conditional and joint analysis (COJO) method with the aforementioned downsampled subset of UKB participants as a reference panel, performing stepwise selection of the index SNPs within a ±1000 kb region of one another. SNPs were deemed independent if they maintained a P < 1 × 10−6 in the joint model. The remaining independent variants were determined to be novel if they were independent of previously reported risk variants in European ancestry populations (as described below).

To identify SNPs previously associated with each cancer type, we abstracted all genome-wide significant SNPs from relevant GWAS published through June 2018. We determined that a SNP was potentially novel if it had LD r2 < 0.1 with all previously reported SNPs for the relevant cancer based on both the UKB reference panel and the 1000 Genomes EUR superpopulation via LDlink. As an additional filter for novelty, we again used COJO to condition each potentially novel SNP on previously reported SNPs for the relevant cancer using the UKB reference panel, and SNPs were not considered novel if they did not maintain P < 1 × 10−6 in the joint model. To confirm novelty and consider pleiotropy, we conducted an additional literature review to investigate whether these SNPs had previously been reported for the same or other cancers, including those not attaining genome-wide significance and those in non-GWAS analyses. For this additional review, we used the PhenoScanner database to search for SNPs of interest and variants in LD in order to comprehensively scan previously reported associations. We then supplemented with more in-depth PubMed searches to determine if the genes in which novel SNPs were located had previously been reported for the same or other cancers. Finally, for cancers with publicly available summary statistics (breast [>120,000 cases]39, prostate [~80,000 cases]62, and ovarian [~30,000 cases]40), we tested our potentially previously unreported SNPs with P < 1 × 10−6 for replication (defined as having the same direction of effect and P < 0.05). Tested SNPs that did not replicate were not considered previously unreported.

We considered whether clinical characteristics of the cases were informative about associated phenotypes by examining SEER stage and grade (both GERA only) and age at cancer diagnosis (UKB and GERA). For each clinical variable, we decomposed cases into one of two categories: grade 1—2 (well or moderately differentiated) or grade 3–4 (poorly or undifferentiated); stage 0–1 (in situ or localized) or stage 2–7 (regional or distant metastases); age < median or age ≥ median. The case counts for all cancer-outcome strata are tabulated in Supplementary Table 4. For each of the previously unreported GWAS SNPs, we conducted logistic regression comparing controls to each of the relevant case subtypes. We then compared the effect estimates across the strata for each clinical variable (e.g., for each relevant SNP–cancer pair, we compared the OR for grade 1–2 with the OR for grade 3–4) and calculated Cochran’s Q statistic to test for heterogeneity, adjusting for multiple testing for the number of strata and SNPs tested.

To assess whether our results were influenced by factors associated with survival, we conducted sensitivity analyses restricted to incident cases in the larger UKB cohort. For each cancer, we compared the independent SNPs that were suggestively associated in the analysis using both prevalent and incident cases (P < 1 × 10−6) with those in the incident only analysis. We assessed whether the effect sizes varied by calculating Cochran’s Q statistic to test for heterogeneity, adjusting for multiple testing across the number of SNPs tested for each cancer. Additional sensitivity analyses evaluated esophageal and stomach cancers as separate phenotypes in the UKB cohort. For independent SNPs with P < 1 × 10−6 in the analysis of the composite phenotype in UKB alone, we compared effect sizes for the composite phenotype to effect sizes for esophageal and stomach cancers separately and calculated Cochran’s Q statistic to test for heterogeneity, adjusting for multiple testing across the number of SNPs tested. For both of these sensitivity analyses, we assessed all SNPs with P < 1 × 10−6 to allow for a sufficient number of variants for comparison.

Genome-wide heritability and genetic correlation

We used LD-score regression (LDSC) on summary statistics from the union set of all SNPs genome-wide to calculate the genome-wide liability-scale heritability of each cancer type and the genetic correlation between each pair of cancer types. Internal LD scores were calculated using the aforementioned downsampled subset of UKB participants. To convert to liability-scale heritability, we adjusted for lifetime risks of each cancer based on SEER 2012–2014 estimates (Supplementary Table 5)63. LDSC was unable to estimate genetic correlations for testicular cancer with both oral cavity/pharyngeal and pancreatic cancers, likely due to small sample sizes.

Locus-specific pleiotropy

Using our union set of SNP-based summary statistics, we constructed pleiotropic regions of SNPs associated with more than one cancer with P < 5 × 10−8. Non-overlapping regions were iteratively formed around index SNPs associated with any cancer, beginning with the SNP associated with the smallest P. SNPs were added to a region if they were associated with any cancer with P < 5 × 10−8, were within 500 kb of the index SNP, and had LD r2 > 0.5 with the index SNP. We used a larger threshold for assessing pleiotropic regions (r2 > 0.5) than for identifying truly independent signals (r2 > 0.1; above) to ensure that all SNPs within a region were in LD. If all SNPs in a region were associated with the same cancer, the region was not considered pleiotropic.

Genome-wide variant-specific pleiotropy

We quantified one-directional and, separately, bidirectional variant-specific pleiotropy via the R package ASSET (association analysis based on subsets)64. Briefly, ASSET explores all possible subsets of traits for the presence of association signals, resulting in the best combination of traits to maximize the test statistic64. ASSET has two procedures: in one, all traits are assumed to be associated with a variant in the same effect direction (one-directional pleiotropy); in the other, variants can be associated with traits in opposite directions (bidirectional pleiotropy)64. In the one-directional pleiotropy analysis, an overall P across the selected traits is provided, and in the bidirectional pleiotropy analysis, a P for each direction is provided as well as an overall P for the total association signal for both directions combined. ASSET corrects for the internal multiple testing burden accrued by iterating through all possible trait subsets for each variant as well as controlling for shared samples among the traits64.

Genome-wide ASSET analyses were conducted on the union sets of summary statistics for all 18 cancers. Independent variants were determined via LD clumping, where index SNPs were suggestively significant (overall P < 1 × 10−6), and other SNPs were clumped with the lead variant if they had overall P < 0.05, were within 500 kb of the index SNP, and had r2 > 0.1 with the index SNP. While we only considered variants with an overall P < 5 × 10−8 significant, we used a suggestive significance threshold to comprehensively assess all potentially pleiotropic variants. A SNP was determined to have a one-directional pleiotropic association if the overall P was <1 × 10−6 and it was associated with at least two cancers. A SNP was determined to have a bidirectional pleiotropic association if the overall P was <1 × 106 and the P for each direction was <0.05. For one- and bidirectional SNPs in LD with each other, the SNP with the smaller overall P was retained. We deconstructed bidirectional associations into cancers with risk-increasing effects and cancers with risk-decreasing effects.

To assess whether clinical aspects of the cases could be informative about the pleiotropic variants, for each of the one-directional and bidirectional pleiotropic SNPs, we conducted logistic regression comparing controls to each of the relevant case subtypes described above and calculated Cochran’s Q statistic to test for heterogeneity between estimates across the strata for each clinical variable.

Functional characterization of pleiotropic variants

Functional consequences for the 100 pleiotropic variants identified in the ASSET analysis were obtained from ANNOVAR. Enrichment of functional classes was evaluated using Fisher’s exact test, comparing the distribution observed among the pleiotropic variants to that of all variants with INFO > 0.90 in the reference panel of UKB European descent individuals (16,972,700 SNPs total).

Overall functional significance was assessed using DeepSEA, a deep learning tool that prioritizes functional variants by integrating regulatory binding and ENCODE modification patterns of ~900 cell-factor combinations with evolutionary conservation features. Resulting functional significance scores, ranging from 0 to 1, represent the degree of deviation from a reference distribution of 1000 Genomes variants, with lower scores indicating a higher likelihood of functional significance. We also report CADD scores, which combine over 60 diverse annotations to predict deleteriousness36. CADD scores are transformed into a log10-derived rank score based on the genome-wide distribution of scores for 8.6 billion single-nucleotide variants in GRCh37/hg19 (i.e., CADD = 10 corresponds to top 10% most deleterious substitutions)36.

To assess more specific functional features, we annotated each SNP according to Roadmap’s 15-core chromatin states across 127 cell or tissue types35,65. Chromatin state was assigned by taking the most common state, with values ≤7 indicating open, accessible chromatin regions. Three-dimensional chromatin interactions were explored to identify significant interaction and enhancer-promoter links. We also explored associations with gene expression in using data from the GTEx v8 and BIOS QTL databases. The distribution of functional features among pleiotropic cancer risk variants was compared to a random sample of the same number of SNPs. Chromatin features and BIOS QTL annotations were obtained from the FUMA (Functional Mapping and Annotation) database. Differences in the proportion of variants belonging to each functional class were tested using a two-sample chi-squared test. Lastly, after annotating variants to their nearest gene, we conducted gene-set pathway enrichment analyses using the Kyoto Encyclopedia of Genes and Genomes (KEGG) database66 with an FDR q < 0.05 significance threshold.

Ethics

The study was approved by the University of California and KPNC Institutional Review Boards and the UKB data access committee, and informed consent was obtained from all participants.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.