Introduction

Keratinocyte cancers (KC), including basal cell carcinoma (BCC) and squamous cell carcinoma (SCC), are the most commonly diagnosed cancers globally. KC resulted in over 5.4 million diagnoses and $8 billion dollars in expenditure in the US in 2011 alone1, while in Australia, they account for >24% of all-cancer diagnoses2, and impose a huge economic burden on the health sector costing over AUD $700 million for treatment annually3. KC is responsible for up to 8700 deaths a year in the United States4. The relative rates and morbidity, from KC, is even higher in Australia5. BCC and SCC share many common risk factors, including sun exposure, skin and hair pigmentation and immunosuppression.

Skin cancers, and pigmentation traits and autoimmune diseases have several susceptibility genes overlapping6,7,8,9. For example, several variants in pigmentation genes ASIP/RALY, IRF4, MC1R, OCA2, SLC45A2 and TYR, are associated with BCC, SCC and melanoma8,10. Shared immune-regulatory genes in the HLA and LPP regions have been found to influence susceptibility to BCC, SCC, melanoma and autoimmune diseases such as rheumatoid arthritis, vitiligo, type 1 diabetes and psoriasis6,7,8,9. There are also some tumour-genesis-related genes, which are expressed in both KC and other non-skin cancers. For example, oncogene TNS3, which is overregulated in BCC, is also associated with breast, lung and prostate cancers8,11,12. Furthermore, HAL at 12q23.1 has been found to be associated with KC risk13 as well as vitamin D levels14. However, standard single GWAS meta-analysis approaches are unable to utilise this multi-trait genetic overlap to further explore the genetic risk for BCC, and SCC.

Multivariate GWAS approaches, such as multi-trait analysis of GWAS (MTAG)15, can draw on this overlapping genetics to identify new risk regions (here for BCC or SCC). MTAG is a generalisation of inverse-variance-weighted meta-analysis that importantly accounts for incomplete genetic correlation, and sample overlap, between GWAS. A key property of MTAG is that it outputs estimates of trait-specific effect sizes and p-values for each of the input traits—in this case BCC or SCC. We have previously used MTAG to identify loci for KC based on the genetic correlation between BCC and SCC only13. BCC and SCC are different in terms of polygenicity and aetiology and therefore, we sought to identify susceptibility genetic loci for BCC and SCC by exploring their genetic overlap with melanoma, pigmentation traits, autoimmune diseases, and blood biochemistry biomarkers in a multi-phenotype analysis of GWAS.

In this work, we show that BCC and SCC have a high genetic correlation with melanoma, pigmentation traits, autoimmune diseases, and blood biochemistry biomarkers. We use MTAG to leverage this genetic overlap and identify 78 and 69 independent genome-wide significant susceptibility loci for BCC and SCC, respectively; 19 BCC and 15 SCC loci are both previously unknown and replicated in a large independent cohort. The previously unknown risk loci are implicated in BCC/SCC development and progression, pigmentation, cardiometabolic pathways, and immune-regulatory pathways, including; innate immunity, HIV-1 viral load modulation and disease progression. We also report a optimised BCC polygenic risk score (PRS) that enables effective risk stratification for KC.

Results

Genetic correlation

Using linkage disequilibrium score (LDSC) regression16, 20 phenotypes were significantly genetically correlated (P < 0.05, rg > 10%) with either BCC or SCC (Fig. 1 and Supplementary Data 1). In the first instance, 35 phenotypes that we considered as possibly correlated with skin cancer (including body mass index) were excluded for not meeting the aforementioned criteria above (Supplementary Data 2). Using the same selection criteria, no additional new phenotypes were included following analysis using collated GWAS summary statistics (over 700 phenotypes) in the LD hub database17. In total, subsequent analyses included 22 genetically correlated traits; cancers; BCC and SCC GWAS from the UK Biobank (UKB)18,19, a cutaneous melanoma GWAS meta-analysis20, KC from the QSkin Sun and Health Study (QSkin)21, KC from the Electronic Medical Records and Genomics Network (eMERGE) cohort22,23 and all-cancer from the Resource for Genetic Epidemiology Research on Aging (GERA) cohort;24 skin and hair pigmentation related traits; skin burn type (QSkin), red hair (QSkin), hair colour excluding red hair (UKB), skin colour (UKB), and mole count excluding melanoma cases (QSkin), autoimmune conditions; type 1 diabetes and hypothyroidism25, and vitiligo26, lifestyle-related traits; educational attainment in years spent in school27 and smoking (cigarettes per day)28, and biochemistry blood biomarkers from the UKB; aspartate aminotransferase, C-reactive protein, albumin, and gamma-glutamyl transferase, glucose and vitamin D (adjusted for monthly variation). The sample sizes and phenotype measurements for all the included and excluded traits are presented in Supplementary Data 3 and  2, respectively.

Fig. 1: Heatmap for the genetic correlation between 22 traits with a significant correlation with either BCC or SCC.
figure 1

Bivariate genetic correlation 22 traits that were significantly correlated (P < 0.05, rg >10%) with the UKB BCC or SCC GWAS. BCC UKB basal cell carcinoma in the UK Biobank, SCC UKB squamous cell carcinoma in the UK Biobank, CM cutaneous melanoma, KC QSkin keratinocyte cancer in the QSkin cohort, KC eMERGE keratinocyte cancer in the eMERGE cohort, Hypothyr hypothyroidism, T1Dtype 1 diabetes, EA education attainment, VitD vitamin D, AAT aspartate aminotransferase, CRP C-reactive protein, GGT gamma-glutamyl transferase and Corr correlation. Source data are provided as a Source Data file.

Discovery of genome-wide significant susceptibility loci for BCC and SCC

Adding 20 traits genetically correlated with either BCC or SCC (rg > 0.1, P < 0.05) (from UKB) increased the effective sample sizes for BCC and SCC by 2.6 and 8.3 times, respectively. Using the MTAG approach, we identified 78 and 69 independent genome-wide significant (P < 5 × 10−8) susceptibility loci for BCC (Fig. 2 and Supplementary Data 4) and SCC (Fig. 3 and Supplementary Data 5), respectively. Although the results for the peak single nucleotide polymorphisms (SNPs) were more significant following the MTAG analysis due to the greater statistical power, the log (odds ratio) effect sizes for the MTAG output and the respective UKB BCC or SCC GWAS inputs were highly concordant. For BCC the Pearson’s correlation of effect sizes was 0.93 (95% confidence interval [CI] = 0.89–0.96, P < 2.20 × 10−16; Fig. 4a). Similarly, concordance was high for SCC loci (Pearson’s correlation = 0.71, 95% CI = 0.57–0.81, P = 7.34 × 10−12; Fig. 4c).

Fig. 2: Manhattan plot for basal cell carcinoma susceptibility.
figure 2

The Manhattan plot shows the association between SNPs and basal cell carcinoma susceptibility based on the MTAG approach. The Y-axis represents the level of significance recorded in negative log 10 (P value) (two-tailed test), whilst the X-axis represents the chromosome 1–22, alternated with light blue and light pink colours. The horizontal blue line represents a suggestive level of significance at P value = 10−6, while the red one represents the genome-wide level of significance; P = 5 × 10−8. The green dots represent the 78 genome-wide significant independent loci for basal cell carcinoma susceptibility (after multiple correction for a million tests; 0.05/1,000,000). Only SNPs with a P value <0.01 were included. The source data file is provided as the BCC summary statistics in the GWAS Catalogue under accession code GCST90137411.

Fig. 3: Manhattan plot for squamous cell carcinoma susceptibility.
figure 3

The Manhattan plot shows the association between SNPs and squamous cell carcinoma susceptibility based on the MTAG approach. The Y-axis represents the level of significance recorded in negative log 10 (P value) (two-sided test), whilst the X-axis represents the chromosome 1–22, alternated with light blue and light pink colours. The horizontal blue line represents a suggestive level of significance at P value = 10−6, while the red one represents the genome-wide level of significance; P = 5 × 10−8. The green dots represent the 69 genome-wide significant independent loci for squamous cell carcinoma susceptibility (after multiple correction for a million tests; 0.05/1,000,000). Only SNPs with a P value <0.01 were included. The source data file is provided as the BCC summary statistics in the GWAS Catalogue under accession code GCST90137412.

Fig. 4: Concordance of the log (OR) effect estimates for MTAG versus UK single-trait GWAS and 23andMe replication.
figure 4

Figure 4 shows the comparison of the effect estimates in log (odds ratio) for both basal cell carcinoma (BCC) and squamous cell carcinoma (SCC) based on the respective MTAG approach results versus UKB single-trait GWAS and replication results from 23andMe. The blue line is the line of best fit with the 95% confidence intervals. The blue dots represent loci that overlap between BCC and SCC, whilst the red dots show the loci that are respectively unique to BCC or SCC. The dotted purple lines represent null effects (i.e. log (OR) = 0). The Y- and X- axes represent log (OR). a Shows BCC MTAG versus UKB BCC effect estimates, yielding a high concordance with a Pearson’s correlation of 0.93 (95% confidence interval [CI] = 0.89–0.96, two-sided test). b Shows BCC MTAG versus BCC replication (23andMe) effect estimates, yielding a high concordance i.e. Pearson’s correlation = 0.97 (95% CI = 0.95–0.98, two-sided test). c Shows SCC MTAG versus UKB SCC effect estimates, yielding a high concordance; Pearson’s correlation = 0.71, 95% CI = 0.57–0.81, two-sided test). d Shows SCC MTAG versus SCC replication effect estimates, also resulting in a high correlation i.e. 0.69 (95% CI = 0.55–0.80, two-sided test). UKB- United Kingdom Biobank, and MTAG- multi-trait analysis of GWAS. Source data are provided as a Source Data file.

In the 23andMe, Inc replication sample (252,931 cases and 2,281,246 controls), 71 of the 78 susceptibility loci for BCC replicated at the genome-wide level (P < 5 × 10−8), 74 replicated after Bonferroni correction (P = 6.49 × 10−4), and 77 loci replicated at a nominal P = 0.05 (Supplementary Data 4). There was high concordance with the BCC effect estimates between the MTAG and the replication set with Pearson’s correlation = 0.97 (95% CI = 0.95–0.98, P = 2.20 × 10−16; Fig. 4b). Of the 69 susceptibility loci for SCC, 25 replicated at the genome-wide level (P = 5 × 10−8), 31 replicated after Bonferroni correction (P = 7.24 × 10−4) and 38 loci replicated at a nominal P = 0.05 in the 23andMe cohort (135,214 cases and 2,404,735 controls) (Supplementary Data 5). For SCC, there was also high concordance with the effect estimates between the MTAG and the replication set with Pearson’s correlation = 0.69 (95% CI = 0.55–0.80, P = 3.48 × 10−11; Fig. 4d).

Description of the previously unknown loci for BCC and SCC

A locus was considered previously unknown for BCC or SCC if it had not been significantly associated with either BCC, SCC or KC at the genome-wide level (P < 5 × 10−8) before, and if it replicated at minimum P < 0.05) in the 23andMe replication cohort. By this criterion, we identified 19 and 15 previously unknown loci for BCC (Table 1) and SCC (Table 2), respectively. The previously unknown loci were annotated to the pigmentation, cardiometabolic, cancer development/progression and immune-regulatory pathways (Figs. 5,  6), whilst others are known loci for cutaneous melanoma susceptibility (ATM, and SOX6 for BCC, and GPR98, and DSTYK for both BCC and SCC). More details on these loci and the broader biological groups have been discussed in the Supplementary Information (Supplementary Note 1). For loci that are unique to BCC or SCC, or overlap between BCC and SCC, refer to Tables 12.

Table 1 BCC susceptibility novel loci that replicated at P < 0.05 in 23andMe cohort
Table 2 SCC susceptibility novel loci that replicated at P < 0.05 in 23andMe cohort
Fig. 5: Basal cell carcinoma loci and biological pathways.
figure 5

The broad biological pathways included; pigmentation, immuno-regulatory, cardiometabolic and cancer development and progression. FBRSL1, KITLG, ATM, GPR98 and DSTYK are not shown in this figure.

Fig. 6: Squamous cell carcinoma loci and biological pathways.
figure 6

The broad biological pathways included; pigmentation, immuno-regulatory, cardiometabolic and cancer development and progression (cancer devt**). GPR98 and DSTYK are not shown in this figure.

Gene-set pathways

After multiple correction testing (P = 0.05/18,188 genes; 2.75 × 10−6), gene-set analysis revealed curated and gene ontology (GO) pathways that are important in the development of keratinocyte cancer (Supplementary Table 1). A number of pathways are involved in melanogenesis (e.g. melanin biosynthesis, melanin biosynthetic process and melanosome membrane); a process which influences the nature of pigmentation traits and response to UV exposure. Genes in the “response to trabectedin” pathway are likely to play an important role in DNA damage response. Trabectedin is an alkylating agent used to treat certain cancers resulting in DNA damage. Other pathways are important in the downregulation of the immune response (e.g. GO negative regulation of regulatory T cell differentiation), and enhancement of the immune response (IL2-PI3K pathway, MHC class II receptor activity, and nuclear factor of activated T cells (NFAT) pathway for development and function of regulatory T cells).

BCC MTAG-derived polygenic risk score for KC prediction in the Canadian longitudinal study on aging (CLSA)

During the validation of the PRSs, S5 (i.e. P < 10−4 with 273 SNPs for the MTAGPRS and 462 SNPs for the UKBPRS) was the optimal PRS models for both MTAGPRS and UKBPRS with Nagelkerke R2 of 10.65 and 9.55% respectively (Fig. 7a). The total number of SNPs in both PRS was different because the MTAG results have more power than the single BCC analysis and therefore it has more SNPs reaching significance. However, based on 'the nearest gene' analysis, 154 SNPs (Supplementary Data 8) overlapped between the MTAGPRS and UKBPRS. The correlation of the effect size for the PRS SNPs across the two sets was consistent or high (e.g. for the overlapping 154 SNPs, Pearson’s correlation = 0.94, 95% CI = 0.92–0.96, P < 2.2 × 10−16), meaning the extra MTAG SNPs are consistent but just better powered.

Fig. 7: Validation and application of the basal cell carcinoma MTAGPRS and UKBPRS in participants in the Canadian Longitudinal Study on Aging (CLSA).
figure 7

PRS refers to polygenic risk score, UKB- United Kingdom Biobank, MTAG multi-trait analysis of GWAS, CI confidence intervals, SD standard deviation, % percent and BCC basal cell carcinoma. The red colour represents the UKB PRS version whilst cyan indicates the MTAG-derived PRS. The error bars represent the 95% confidence interval in 6b (odds ratio, two-sided test), c (net reclassification improvement index) and d (percentage reclassified). a Validation of the BCC MTAGPRS and UKBPRS models to select the best performing index based on clumped SNPs at S1 (P < 5×10−8), S2 (P < 10−7), S3 (P < 10−6), S4 (P < 10−5), S5 (P < 10−4), S6 (P < 10−3), S7 (P < 10−2) and S8 (P < 10−1) on the x-axis. The y-axis represents Nagelkerke’s R2 (%), a measure for model fitness. PRS model S1 and S5 are the optimal PRS models for UKBPRS and MTAGPRS, respectively, in a selected validation sample of CLSA (N = 1911 individuals). b Shows and compares the association between the UKBPRS and MTAGPRS and KC risk in CLSA (N = 18,515 individuals) expressed in odds ratios per standard deviation (y-axis) increase in the PRS, and adjusted for age, sex and the ancestral 10 PCs. c Illustrates that the MTAGPRS performs better than the UKBPRS based on both the categorical and continuous net reclassification improvement indices in CLSA (N = 18,515 individuals). d Compares the percentage of people reclassified to an appropriated KC risk group after adding the MTAGPRS vs the UKBPRS to a model with age, sex and 10 ancestral principal components in CLSA (N = 18,515); MTAGPRS reclassified 36.57%, 95% CI = 35.89–37.26% of individuals compared to 33.23%, 95% CI = 32.56–33.91% by UKBPRS. Source data are provided as a Source Data file.

The SNPs for the optimal models are presented in Supplementary Data 6 and Supplementary Data 7 for the UKBPRS and MTAGPRS, respectively. When we tested the performance for both the UKBPRS and MTAGPRS in the CLSA (N = 18,933), the MTAGPRS outperformed the UKBPRS in terms of association with KC risk, KC risk prediction, and stratification. For example, after adjusting for age at recruitment, sex and the first ten PCs, the MTAGPRS outperformed the UKBPRS for association with KC risk i.e. MTAGPRS OR = 1.66, 95% CI = 1.55–1.79, P = 1.95 × 10−41 versus UKBPRS OR = 1.56, 95% CI = 1.45–1.67, P = 3.38 × 10−33 (Fig. 7b). In addition, the net reclassification index for KC risk was greater for MTAGPRS than the UKBPRS (Fig. 7c), when added to the base model containing age, sex and ten PCs. Consequently, the MTAGPRS compared to the UKBPRS reclassified more participants for KC risk to the appropriate risk group (low risk, moderate risk and high risk) (i.e. percentage of people reclassified; MTAGPRS = 36.57%, 95% CI = 35.89–37.26% versus UKBPRS = 33.23%, 95% CI = 32.56–33.91%) (Fig. 7d).

Discussion

In this large multi-trait GWAS analysis, we show that cutaneous melanoma, 'any-cancer', pigmentation traits, autoimmune diseases and other serum metabolic biomarkers are genetically correlated with BCC and SCC. We have leveraged this genetic correlation using the MTAG approach to identify 78 and 69 independent genome-wide significant loci for BCC and SCC risk, respectively, the most common skin cancers among fair-skinned people. Nineteen BCC and 15 SCC loci were previously unknown for any KC and replicated in the 23andMe cohort, indicating our study uncovers important findings relevant to KC biology.

First, we identify previously unknown loci in the pigmentation pathways for both BCC and SCC susceptibility. Due to the importance of sun exposure in keratinocyte cancer biology29, several new loci for BCC and SCC were linked to pigmentation traits, including skin colour, red hair, skin tanning response and sunburns. The gene-set analysis results also confirmed we identified biological pathways involved in melanin biosynthesis and DNA damage response.

Second, our study affirms the role of immune-regulatory processes and pathways in BCC and SCC susceptibility. We show that the previously unknown loci for BCC and SCC are implicated in immune-regulatory processes (Supplementary Information), including; HIV viral load modulation30,31, innate immune response (through IFIH1)32,33,34 and autoimmunity. These cellular immune responses are important in cancer initiation and progression35. We also highlight a previously known locus (CTLA4) which is an immunotherapy target (anti-CTLA4 medication) in melanoma treatment36. Therefore, our identified loci implicated in immune response may be potential targets to improve immunotherapy for skin cancer. However, further functional genomic studies will be needed to establish their potential role in skin cancer prevention and treatment.

Third, immunosuppressive medication, including azathioprine and cyclosporin A have been implicated in BCC and SCC risk37,38. While we uncovered KC loci linked to immune-related medication use, including; anti-asthmatic inhalants and thyroid preparations39, it is likely that medication-related loci underpinned here are just a proxy indicator for the autoimmune disease. Thus, these medications are unlikely to cause BCC or SCC. In addition, even if these diseases were all treated with drugs that greatly increased the risk of KC, they are (a) too rare to lead to a cryptic genetic correlation as large as what we see here e.g. for hypothyroidism (rg = −0.19, P = 1.05 × 10−4) (Supplementary Data 1) and (b) the genetic correlation e.g. for hypothyroidism was negative with BCC where a drug-induced cryptic overlap would give a positive genetic correlation.

Fourth, our study also highlights the potential role of cardiometabolic biomarkers in BCC/SCC risk. Besides the PUFA levels, whose causal association link with the BCC risk has been established through a Mendelian randomisation study40, our results highlight a potential causal relationship between cardiometabolic biomarkers, including; diastolic and systolic blood pressure, lipids, serum glucose, cholesterol and adiposity, and the risk of BCC and SCC. As is the case for PUFA, downstream metabolism of these cardiometabolic biomarkers, such as lipids and cholesterol, results in oncogenic inflammatory biomarkers (e.g. prostaglandins E, thromboxane A2 and leukotriene B). However, some risk genetic variants or loci for the cardiometabolic pathway could be influencing BCC and SCC risk through already known pigmentation and immune-regulatory biological pathways e.g. rs1136165 in CKB and rs10774625 in ATXN241,42,43,44.

Fifth, we also unveil important genes with a potential role in BCC and SCC initiation and progression e.g. FAP, CDKL1, MARK3, RAB11FIP2, GAB2, SUOX and SOX6. Although some genetic variants within these genes have pleiotropic effects with pigmentation traits, the aforementioned genes have established roles in cancer cell proliferation, migration and invasion, and downregulation of apoptosis in melanoma, colorectal cancer and breast cancer45,46,47,48,49,50,51. Some of these loci are potential drug targets. For example, a previous study identified a potential drug 'PCC0208017' as an inhibitor of MARK3, suppressing glioma progression both in vitro and in vivo52. Fostamatinib, a drug used for treatment of chronic immune thrombocytopenia53, is an inhibitor of MARK354. Further studies are warranted to test these drugs for any anti-tumour activity in KC.

Our results further emphasise the shared biology between cutaneous melanoma and KC. In total, four previously unknown loci for BCC and SCC at ATM, DSTYK, GPR98 and SOX6 are known for CM20,55,56. Our MTAG results have also highlighted shared biology between BCC and SCC whereby almost half (7) of the previously unknown loci are shared between BCC and SCC. However, our work also highlights loci distinct to either BCC (12) and SCC (8), indicating unique biological pathways (see results) for each cancer.

We also note the difference in the replication success between BCC and SCC. Given the relatively high genetic correlation between the two traits, similar replication results are expected. However, at a subset of loci, the input data may suggest that a particular SNP is only strongly associated with say, BCC but no SCC. Given we have substantially more input data on BCC than SCC, power may also play a part in the strength of the results, and replication success. We have previously shown that BCC is twice as heritable as SCC (SNP-heritability estimates for BCC = 13.1%, 95% CI = 9.7–16.5% versus 6.8%, 95% CI = 0.9–12.7% for SCC)13, and it is more polygenic8,57. We believe the reasons contributed to the differences in replication success.

One strength of the MTAG method is the increase in statistical power to identify several loci that a standard single-trait GWAS would not have done. For example, using MTAG, we increased our sample size by 2.6 times and 8.3 times for BCC and SCC, respectively. Owing to the great improvement in statistical power, our MTAG-derived BCC PRS outperformed (for KC risk stratification) the one derived from a single-trait BCC GWAS. We and others have previously shown that the KC PRS generated from the general population effectively stratify them for KC risk and multiplicity58,59,60. The optimised MTAG-derived PRS is likely to improve KC risk stratification in high-risk subpopulations, as previously shown in solid organ transplant recipients.

One caveat with the MTAG approach is that it assumes that the genetic variants have a homogeneous effect across all the included traits so that the results are not driven by a certain trait to result in false positives15. Firstly, when we compared the genetic correlation (Supplementary Fig. 1), and the MTAG results (Supplementary Fig. 2) before and after excluding genomic regions (HLA, ASIP, IRF4, MC1R, SLC45A2 and CDKN2A) with very large effect sizes for skin cancers and pigmentation traits, and there was a high concordance (Supplementary Fig. 2). Secondly, there was good replication of our results in an independent cohort, which counters concerns of false positives. In addition, in order to minimise biases arising from using several cohorts which might have phenotypes with different measures15, we selected only traits where the magnitude of the genetic correlation was larger than 0.1 (or less than −0.1 for negatively correlated traits); we also required the correlation to at least reach nominal significance (P < 0.05), as a priori. Also, studies with small sample size were not considered, as including such traits would only negligibly increase our effective sample size.

In conclusion, leveraging the genetic correlation between skin cancers, autoimmune diseases, pigmentation traits and serum biochemistry biomarkers revealed previously unknown susceptibility loci for SCC and BCC, implicated in KC development and progression, pigmentation, cardiometabolic and immune-regulatory pathways. We also report an optimised PRS for effective risk stratification for KC, which could facilitate skin cancer surveillance in high-risk subpopulations such as transplantees.

Methods

Cohorts

Discovery cohorts

Participants that contributed to the phenotype-specific genome-wide association studies were of homogenous European ancestry drawn from different cohorts from Australia, Europe and America. While there was sample overlap across the included GWAS, MTAG adjusts and corrects for biases due to sample overlap15. The major cohorts used included; the UK Biobank (UKB)18,19, QSkin Sun and Health Study (QSkin) (Olsen et al. 2012), eMERGE (dbGaP, study accession: phs000360.v3.p1) and GERA (dbGaP, study accession: phs000674.v3.p3), a melanoma meta-analysis consortium (Supplementary Information; Supplementary Table 2)20 (dbGaP accession study code: phs001868.v1.p1), as well as publicly available GWAS summary statistics from international cohorts and consortium. Details for each cohort, including ethics oversight, are described in the Supplementary Information.

Replication cohort: 23andMe Research Cohort

23andMe, Inc. is a direct-to-consumer genetic company that collected both self-reported phenotypes and genetic data from participants who provided informed consent and participated in the research online, under a protocol approved by the external Association for the Accreditation of Human Research Protection Programme (AAHRPP)- accredited Institutional Review Board (IRB), Ethical & Independent Review Services (E&I Review). The BCC cohort included 2,523,630 participants of European ancestry; 251,963 BCC cases and 2,271,667 controls, and 44.65% males. The SCC dataset included 2,529,399 participants of European ancestry; 134,700 SCC cases and 2,394,699 controls, and 44.65% males. Further details on data collection, validation, genotyping, imputation and quality control have been published before8,57.

BCC PRS application cohort: the Canadian Longitudinal Study on Aging (CLSA)

The Canadian Longitudinal Study on Aging (CLSA) is a prospective large population-based cohort in Canada comprising about 50,000 participants (45–85 years) randomly recruited between 2010 and 2015 from ten provinces61,62. More information about the cohort has been published elsewhere61,62 and summarised here. It consists of two cohorts; the 'Tracking cohort' of ~20,000 participants recruited through a telephone questionnaire in ten provinces, and the “Comprehensive cohort” with ~30,000 individuals who provided data through an in-person questionnaire, clinical/physical tests and biological samples (e.g. for genetic data) in seven provinces.

In general, at baseline, information on relevant variants, including age and sex, were recorded, and participants were also asked whether they had been diagnosed with any cancer, including KC (yes/no), by a health professional. Between 2015 and 2018, the first follow-up assessment was conducted and participants were asked again if they had been diagnosed with cancer, and KC during the follow-up period. Thus, the CLSA dataset we used included the 'Baseline Comprehensive Dataset version 4.0' and 'Follow-up 1 Comprehensive Dataset version 1.0'. At the time of analysis, ~30,000 individuals had genetic data available, genotyped using 820 K UK Biobank Axiom Array (Affymetrix)61, and imputed using the TopMed imputation server63. The CLSA is overseen by the Canadian Institutes of Health Research (CIHR) and its protocol has been reviewed and approved by 13 research ethics boards in Canada. All participants provided written informed consent.

Firstly, for purposes of validation and selection of the optimal PRS models (as described below in Stage 6 analysis) we randomly selected 1523 cancer-free controls and 388 prevalent KC cases at the baseline. Thus, our validation sample included 1911 participants with a mean age of 65.81 years (sd = 10.25) and 52.75% males.

Secondly, we tested the BCC PRSs in a second sample (unrelated to the validation dataset) of 18,933 participants of European ancestry, with a mean age of 61.80 years (sd = 9.84), followed up for a mean duration of 2.9 years (sd = 0.3) and 49.63% males. Only participants with complete data on age, sex, cancer status and KC diagnosis were included. Thus, 18,139 controls with no history of any cancer (at follow up 1) and 794 participants who developed KC during follow-up.

Statistical analysis

Stage 1: GWAS for BCC, SCC and related traits

We conducted two case-control GWAS using UKB data for BCC, N = 307,684 (20,791 cases and 286,893 controls) and SCC, N = 294,294 (7402 SCC cases and 286,892 controls) of European ancestry. We adjusted for age and sex as well as the first ten ancestral principal components (PCs) in order to control for biases from population stratification. We used Scalable and Accurate Implementation of GEneralised mixed model (SAIGE) software for the analysis since it controls for sample relatedness and case-control imbalance25. Analysis was restricted to single nucleotide polymorphism (SNPs) with minor allele frequency (MAF) >1% and an imputation quality score of 0.3. BCC/SCC cases were drawn from UK cancer registries. Further details on case ascertainment and definition are described in Supplementary Information.

In addition, we conducted GWAS for pigmentation traits (e.g. skin colour, hair colour, tanning response, skin burn, sunburn, etc.), all-cancer, autoimmune conditions, and blood biochemistry biomarkers (e.g. C-reactive protein, vitamin D, glucose, albumin, aspartate aminotransferase, gamma-glutamyl transferase, etc) using data from international cohorts including; UKB, QSkin, and GERA as described in Supplementary Information, Supplementary Data 2, 3. We also conducted GWAS on KC and all-cancer after accessing data from eMERGE (dbGaP, study accession: phs000360.v3.p1) and GERA (dbGaP, study accession: phs000674.v3.p3) cohorts respectively (Supplementary Information). We also accessed publicly available GWAS summary statistics e.g. for cutaneous melanoma20, smoking28, education attainment27, body mass index64, hypothyroidism, type 1 diabetes, rheumatoid arthritis25 and vitiligo26 (Supplementary Information, Supplementary Data 2, 3).

Stage 2: Genetic correlation between BCC, SCC and related traits

We used LDSC version 1.0.165, to compute the genetic correlation (rg)16 between BCC and a range of other traits, including; other skin cancer types, pigmentation traits, autoimmune traits and biochemistry biomarkers (recently released in the UKB). We then repeated this process for SCC instead of BCC. We used data from publicly available GWAS, as well as GWAS data from international cohorts of participants of European ancestry (conducted in stage 1 above). Traits with a statistically significant (P < 0.05) rg greater than 10% with either BCC or SCC were selected and included in the MTAG model (Fig. 1 and Supplementary Table 1). We further sought additional traits that were genetically correlated with BCC or SCC using data from the LD hub catalogue17. Out of about 700 phenotypes, no additional phenotypes were selected to be included in the final MTAG model.

In total, 22 traits, including the initial input BCC and SCC GWAS from different cohorts of European ancestry, met the inclusion criteria. The 22 genetically correlated traits included; BCC, SCC, skin colour, hair colour excluding red hair, hypothyroidism, type 1 diabetes, gamma-glutamyl transferase, aspartate aminotransferase, serum vitamin D levels, albumin, C-reactive protein and glucose in the UK Biobank19, KC, red hair and mole count in the QSkin21, KC in eMERGE (dbGaP, study accession: phs000360.v3.p1), all-cancer in GERA cohort (dbGaP, study accession: phs000674.v3.p3), melanoma risk as measured by the latest and largest melanoma risk gwas meta-analysis20, vitiligo26, education attainment27 and smoking28. All the above studies excluded 23andMe, to enable us to utilise the 23andMe data as a replication set. Details on the phenotypic measurements and definitions are described in Supplementary Information and Supplementary Data 2,  3.

Stage 3: Multi-trait analysis of GWAS summary statistics

Next, using a total of 22 genetically correlated traits, we conducted a multi-phenotype analysis of GWAS summary statistics (generated at stage 1 analysis and selected in stage 2) using MTAG software version 1.0.815. MTAG default settings were used. MTAG combines GWAS summary statistics for genetically correlated traits into a meta-analysis while accounting for genetic correlation, sample overlap, maximising power to identify loci associated with the trait(s) of interest (here BCC and SCC)15. MTAG generates trait-specific results for each phenotype included in the model. BCC and SCC GWAS summary data from UKB from stage 1 were included as trait 1 and 2, respectively in the model below;

$${{{\mathrm{MTAG}}}}\,{{{\mathrm{model}}}}\!\!: \, {{{\mathrm{BCC}}}}+{{{\mathrm{SCC}}}}+{{{\mathrm{melanoma}}}}+{{{\mathrm{pigmentation}}}}\,{{{\mathrm{traits}}}} \\ +{{{\mathrm{autoimmune}}}}\,{{{\mathrm{traits}}}}+\ldots \ldots .+{{{\mathrm{trait}}}}\,n.$$

After the quality control measures, the analysis was restricted to 5,301,239 SNPs common in all the 22 GWAS with a minor allele frequency of >1%, and no ambiguous alleles. MTAG boosts the statistical power of the single-trait GWAS15. We assessed the increase in the statistical power/effective sample size or the GWAS-equivalent sample size when MTAG was applied to the single-trait GWAS, by comparing the average chi-squared before and after MTAG for BCC and for SCC using the following formula recommended by the MTAG authors:15

$$({1}-{{{\mathrm{average}}}}\,{\chi }^{2}\,{{{\mathrm{MTAG}}}}\,output)/({1}-{{{\mathrm{average}}}}\,{\chi }^{2}\,{{{\mathrm{MTAG}}}}\_input)$$

Where MTAG input corresponds to the input for either BCC or SCC GWAS in the UKB dataset, and χ2 is chi-squared.

We took forward the a) BCC and b) SCC MTAG output summary statistic results for further post-GWAS analysis in stage 4 and replication in stage 5. BCC and SCC Manhattan plots are presented in Figs. 2 and 3, respectively.

Stage 3.1: Sensitivity analyses

MTAG assumes a homogeneous effect across all the included traits15. However, due to their strong association with some input traits, the following genomic regions were removed; CDKN2A, SLC45A2, IRF4 and HLA for autoimmune, and ASIP and MC1R for pigmentation or CM, violate this assumption. We conducted sensitivity analyses excluding these regions before implementing our MTAG model. Using the stage 1 BCC GWAS summary statistics, we removed extended regions for ASIP on chromosome 20 (30–36 megabases (mb)), MHC regions on chromosome 6 (25−36 mb), and MC1R on chromosome 16 (87–90.3 mb). We also removed 2 mb around the most significant SNP in the following regions; rs12203592 (6:396321) in the IRF4 region on chromosome 6, rs3731239 (9:21974218) in the CDKN2A region on chromosome 9, and rs16891982 (5:33951693) in the SLC45A2 region on chromosome 5. We compared the genetic correlation between BCC/SCC before and after removing the genomic regions with known strong associations and high LD (Supplementary Fig. 2), before running the full MTAG model of 22 traits described above. The MTAG results with and without the above genomic regions were also compared (Supplementary Fig. 1).

Stage 4: Post-GWAS analysis

We used FUMA v.1.3.666, to identify independent, genome-wide significant SNPs and the genomic risk loci, and performed annotation of candidate SNPs in the genomic loci and functional gene mapping. We also conducted gene-based and pathway analyses using MAGMA v.1.7, as implemented in FUMA v.1.3.667. For the gene pathway analysis, gene ontology (GO) and curated gene sets from MSigDB (v5.2)68 were used and corrected for multiple testing. GWAS catalogue69 and Open Targets platform70 were used to annotate the loci and their relationship with other traits.

Stage 5: Replication of the BCC and SCC MTAG results

Next, we sought to replicate the BCC and SCC susceptibility loci in a large independent cohort using data from the 23andMe research cohort. For BCC, the replication cohort included 251,963 self-reported cases and 2,271,667 controls, while the SCC replication comprised 134,700 cases and self-reported cases and 2,394,699 controls of European ancestry filtered to remove close relatives.

Previous studies have shown high accuracy of 23andMe BCC/SCC self-reported cases8 and high genetic correlation (rg > 0.9) between the histologically confirmed UKB BCC/SCC data and 23andMe data13. Age, sex, and population stratification using five PCs were adjusted for in both analyses in a logistic regression i.e.

$$ {{{\mathrm{BCC}}}}\,{{{\mathrm{or}}}}\,{{{\mathrm{SCC}}}} \sim {{{\mathrm{genotype}}}}+{{{\mathrm{age}}}}+{{{\mathrm{sex}}}}+pc.{0}+pc.{1}+pc.{2}+pc.{3}+pc.{4} \\ +v{2}\_{{{\mathrm{platform}}}}+v{3}\_{0}\_{{{\mathrm{platform}}}}+v{3}\_{1}\_{{{\mathrm{platform}}}}+v{4}\_{{{\mathrm{platform}}}}.$$

The V2 genotyping platform was a variant of the Illumina HumanHap550 + BeadChip with ~560,000 SNPs, including about 25,000 custom SNPs selected by 23andMe. The V3 platform included Illumina OmniExpress + BeadChip with ~950,000 SNPs and custom content SNPs. The V4 is the current and fully custom array of ~950,000 SNPs and includes a lower redundancy subset of V2 and V3 SNPs71.

The BCC results were adjusted for a genomic control inflation factor λ = 1.286. The equivalent inflation factor for 1000 cases and 1000 controls λ1000 = 1.001, and for 10000, λ10000 = 1.006. In a similar way, the SCC results were adjusted for a genomic control inflation factor λ = 1.172. The equivalent inflation factor for 1000 cases and 1000 controls λ1000 = 1.001, and for 10000, λ10000 = 1.007. Thus, this inflation factor was not concerning as it is proportional to the large sample size72. We also explored any evidence of inflation in the discovery GWAS by assessing the LDSC intercept73, which showed no inflation (not substantially above 1) for both BCC (LDSC intercept = 0.96, 95%CI = 0.94−0.99) and SCC (LDSC intercept = 0.77, 95%CI = 0.75−0.79).

We also compared the concordance of the effect sizes (log OR) for the MTAG results versus the replication results (Fig. 4b,  d). We further analysed the number of loci that replicated at a genome-wide significant level (P = 5.0 × 10−8), after multiple testing correction (i.e. Bonferroni correction P = 6.49 × 10−4 for BCC; correcting for 77 loci, and P = 7.24 × 10−4 for SCC; correcting for 69 loci) and at a nominal P = 0.05.

Stage 6: Development and validation of the BCC Polygenic Risk Score in a selected sample of participants in CLSA

To construct two comparable polygenic risk scores (PRSs) for BCC, we separately used the BCC MTAG output (generated in stage 3) and the UKB BCC single-trait GWAS (generated in stage 1) summary statistics as the discovery data sets. MTAG15 drops SNPs with extremely significant associations with any input trait, which resulted in a number of previously reported pigmentation-associated SNPs being dropped from the model. Hence in both the MTAG and UKB discovery GWAS summary statistics, we also included two functional SNPs (rs1805007 for MC1R, and rs12203592 for IRF4) that would otherwise have been dropped in the PRS using the weights from a previously published BCC PRS74. They are removed during the MTAG analysis as it filters out SNPs strongly (P < 10 × 2.22−308) associated with input traits, but this same strong association confirms they are important for a PRS for BCC. A sensitivity analysis results excluding these SNPs, and still, the MTAG BCC PRS reclassified skin cancer cases to a higher risk group (41.27%) better than the single BCC PRS (37.95%).

Next, using autosomal, non-ambiguous, and bi-allelic SNPs overlapping in the CLSA cohort (MTAG discovery = 5,300,872 SNPs and UKB discovery = 5,300,868 SNPs), we performed LD clumping based on (r2 = 0.005 and LD window = 5000 kb, P = 1) to yield 62,494 and 62,884 independent SNPs for MTAGPRS and UKBPRS models respectively. PLINK 1.90b6.875 for clumping. Using the clumped independent SNPs above, we generated PRS models at varying p value thresholds i.e. S1 (P < 5 × 10−8), S2 (P < 10−7), S3 (P < 10−6), S4 (P < 10−5), S5 (P < 10−4), S6 (P < 10−3), S7 (P < 10−2) and S8 (P < 10−1) in validation sample of 1911 participants split from the CLSA cohort using log odds ratio (from the respective discovery GWAS; MTAG or UKB) as weights. PLINK2 (v2.00a3LM 5 May 2021 release)75 was used for generating the PRS scores.

For both MTAG and UKB PRS models, we used Nagelkerke’s R276,, a metric for model fitness used for selecting the optimal model. We computed the R2 by comparing the model fitness between models with PRSs (BCC~MTAGPRS or UKBPRS + age + sex + 10 Pcs) and a null model using predictABEL package77 in R software version 4.0.278.

Stage 7: Applying BCC polygenic risk score and keratinocyte cancer risk prediction in the Canadian longitudinal study of aging

To determine the ability of our MTAG GWAS data to predict skin cancer, we used 18,933 participants of European ancestry with data on KC risk in the Canadian Longitudinal Study of Aging (CLSA). We included 18,139 controls with no history of any cancer (both at baseline and follow-up) and 794 cases who developed KC during the 2.9 years (on average) follow-up following baseline recruitment. Separate BCC and SCC data were unavailable in this cohort, and as ~80% of KC cases are BCC cases79, we tested the performance MTAGPRS vs UKBPRS derived for BCC to predict the risk of KC.

Using PLINK2 (v2.00a3LM 5 May 2021 release)75, we generated individual scores for CLSA participants for both the BCC MTAGPRS and UKBPRS weighted by their respective effect sizes (log odds ratios). The genetic scores were standardised to a variance of 1 in order to interpret the associations as odds ratio per standard deviation increase in the PRS. We compared the performance of the two BCC PRSs (MTAGPRS vs UKBPRS) based on the magnitude of the association (odds ratios) and the net reclassification improvement for KC risk using R version 4.0.278. For net reclassification improvement, we compared the net reclassification index and the percentage of the participants who got reclassified to an appropriate risk group/tertile i.e. the low risk (bottom tertile), moderate risk (middle tertile), and high risk (top tertile) after adding the MTAGPRS vs UKBPRS to the base model containing age, sex and the ten PCs.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.