The 9p21.3 risk of childhood acute lymphoblastic leukaemia is explained by a rare high-impact variant in CDKN2A

Genome-wide association studies (GWAS) have provided strong evidence for inherited predisposition to childhood acute lymphoblastic leukaemia (ALL) identifying a number of risk loci. We have previously shown common SNPs at 9p21.3 influence ALL risk. These SNP associations are generally not themselves candidates for causality, but simply act as markers for functional variants. By means of imputation of GWAS data and subsequent validation SNP genotyping totalling 2,177 ALL cases and 8,240 controls, we have shown that the 9p21.3 association can be ascribed to the rare high-impact CDKN2A p.Ala148Thr variant (rs3731249; Odds ratio = 2.42, P = 3.45 × 10−19). The association between rs3731249 genotype and risk was not specific to particular subtype of B-cell ALL. The rs3731249 variant is associated with predominant nuclear localisation of the CDKN2A transcript suggesting the functional effect of p.Ala148Thr on ALL risk may be through compromised ability to inhibit cyclin D within the cytoplasm.

Acute lymphoblastic leukaemia (ALL) is the major paediatric cancer in western countries, with B-cell precursor (BCP) ALL accounting for ~80% of cases 1 . To date the aetiology of ALL is poorly understood and while there is indirect evidence for an infective origin, no specific environmental risk factors have been identified. Evidence for inherited predisposition to ALL is provided by the increased risk seen in siblings of cases independent of the concordance in monozygotic twins, which has an in utero basis 2 .
By performing a genome-wide association study (GWAS) we have previously shown common SNPs at 9p21.3 annotating CDKN2A/CDKN2B influences ALL risk 3 . The tagSNPs associations discovered in such GWAS are however rarely themselves candidates for causality, but simply act as markers for functional variants.
To decipher the association signal at 9p21.3 we performed imputation of two GWAS datasets and subsequent replication SNP genotyping. Our analysis demonstrates that the 9p21.3 association for ALL can be ascribed to a rare high-impact haplotype at 9p21.3 which captures a CDKN2A exon 2 variant.

Results
Using Haploview we defined the haplotype blocks and recombination hotspots containing the tag sentinel SNP rs3731217 (hg19 chr9:g.21984661 A > C) previously found to be associated with ALL risk at 9p21.3. rs3731217 is contained within an 174-kb LD block (at hg19 chr9:g.21,942,000-22,116,000 base pairs [bp]) and to include the possibility of long-range synthetic associations, we considered the 9q21.3 region of association to be defined by a 400kb interval (rs10965124, hg19 chr9:g.21776615 T> G to rs117009334, hg19 chr9:g.22176265 C > T). To fine map the 9p21.3 risk locus we made use of data from two previously published GWAS -(i) The German GWAS comprising 1155 B-cell ALL cases ascertained through the German Berlin-Frankfurt-Munster (BFM) group childhood ALL trials and 2125 controls from the Heinz-Nixdorf Recall (HNR) study; (ii) the UK GWAS comprising 824 B-cell cases from United Kingdom Childhood Cancer Study (UKCCS), UK MRC ALL 97 trial (http://www.thelancet. com/protocol-reviews/97PRT-14) and Northern Institute for Cancer Research; 5,200 controls from the 1958BC and UK Blood donors (http://www.cls.ioe.ac.uk/). After QC the two GWAS provided genotype data on 1,658 B-cell ALL cases (824 UKGWAS and 834 German-GWAS samples) and 7,224 controls in total. To harmonise GWAS datasets and recover untyped SNPs we made use of imputation using 1000Genomes and UK10K (http://www.uk10k.org/) as reference panels thereby allowing for the comprehensive examination of SNPs with minor allele frequency (MAF) > 0.005 in the genomic region with IMPUTE2 software (https://mathgen.stats.ox.ac.uk/impute/impute).
For replication we genotyped an additional 519 B-cell ALL cases from the UK ALL2003 trial and 1,016 population controls from the National Study of Colorectal Cancer Genetics (NSCCG) study 4 . This analysis provided increased evidence for an association between SNPs rs3731249, rs113650570 and rs36228834 and risk of ALL (Table 1). Pooling genotype data for all of three series provided unequivocal evidence for a relationship between rs3731249 (P = 3.45× 10 −19 ), rs36228834 (P = 4.41 × 10 −19 ) and rs113650570 (P = 9.21 × 10 −19 ) and ALL risk. There was no evidence for between study heterogeneity and combined odds ratios (OR) and 95% confidence intervals (95% CI) were 2.42 (95% CI: 1.99-2.93), 2.41 (95% CI: 1.99-2.92) and 2.39 (95% CI: 1.97-2.89) respectively (Table 1). Additionally there was no evidence that one of the series had significantly biased study findings -for rs3731249 the P-values for Begg's and Egger's tests were 1.00 and 0.35 respectively.
The SNPs rs3731249, rs113650570 and rs36228834 are in linkage disequilibrium (LD) -pairwise D' and r 2 metrics between rs113650570 and rs36228834, 1 and 0.99; between rs113650570 and rs3731249, 0.99 and 0.98; between rs36228834 and rs3731249, 0.99 and 0.98 respectively. Hence, while the strongest association was provided by rs3731249 the three SNPs collectively define a single risk haplotype.
To examine the possibility that the haplotype association might capture additional coding changes in CDKN2A, we sequenced all exons of CDKN2A/CDKN2B in 93 of the UK GWAS cases which were carriers of the risk haplotype. Only one additional coding change in CDKN2A was identified, a heterozygous deletion of G within the 3′ UTR of NM 001195132 (hg19 chr9:g.21968623G> _). Collectively these data make it unlikely that the haplotype association is a consequence of association through a sequence change other than rs3731249, rs113650570 or rs36228834.
rs113650570 and rs36228834 are both non-coding SNPs which localise to intron 1 of CDKN2A-p14ARF (NM_058195.3). Neither of the SNPs reside within strong evolutionary conserved domains (GERP and PhastCon scores -0.67, 0.22 and 3.44, 0.05 respectively; Fig. 1) and the genomic regions do not exhibit regulatory function as defined by histone markers or transcription factor binding. rs3731249 is responsible for the exon 2 p.Ala148Thr change in the CDKN2A protein transcripts p16INK4a (NP_000068; NCBI, http://www.ncbi.nlm.nih.gov/) and predicted protein transcript cdkn2a isoform X3 (XP_005251400.1; Scientific RepoRts | 5:15065 | DOi: 10.1038/srep15065 NCBI). While the genomic position to which rs3731249 maps is not evolutionarily conserved (GERP and PhastCon scores 1.06 and 0.00 respectively), protein modelling using SuSPect 5 software implies that the p.Ala148Thr change is unlikely to be neutral and rs3731249 is predicted to be possibly damaging by PolyPhen (PolyPhen2 score = 0.49, protein accession: P42771-CD2A1_HUMAN). Direct evidence for rs3731249 being functional is provided by transfection studies 6 .
To explore the possibility that somatic loss of the wild-type allele preferentially occurs in the leukemic clones of rs3731249 carriers we examined the relationship between genotype carrier status and CDKN2A and CDKN2B deletion in the leukaemic clones of 43 UKGWAS B-cell and 519 replication series B-cell  (Table 2). There was, however, no evidence for a relationship between rs3731249 genotype and CDKN2A deletion (P = 0.59).
The association between 10q21.2 and 10p14 and risk of BCP-ALL are highly subtype specific 5 . To investigate the relationship between variation in 9p21.3 and ALL further we conducted a stratified analysis using data on 137 ETV6-RUNX1 and 169 B-high hyperdiploid cases and rs3731249. In this analysis there was no evidence that the influence of rs3731249 genotype on ALL risk is subtype specific (Table 3).

Discussion
Here we have provided evidence that the 9p21.3 association for risk of ALL can be ascribed to a recurrent rare high impact variant in CDKN2A. CDKN2A encodes both p16 (INK4A), a negative regulator of cyclin-dependant kinases, and p14 (ARF1), an activator of p53. CDKN2A and CDKN2B are inactivated in a number of different cancers, and mono-or biallelic deletion of CDKN2A are recurrent events in childhood ALL 7 . CDKN2A deletions are generally considered to arise as secondary events and cooperate with the initiating abnormality such as ETV6-RUNX1 fusions to drive disease. CDKN2A/CDKN2B deletions occur in approximately 30% of childhood ALL but are often acquired at relapse where the frequency is 40% [8][9][10] .
The p.Ala148Thr variant is located towards C-terminus of the p16INK4A and is thus not within ankyrin repeats 11 (Supplementary Figure 1)  studies while the variant protein has been shown to bind to CDK4 and CDK6 6 it has shown to be associated with predominant nuclear localisation of CDKN2A transcript. This is consistent with the functional effect of rs3731249 on ALL risk being through compromised ability to inhibit cyclin D within the cytoplasm 6 . Although a slightly higher colony count than the wild type protein has been reported in association with rs3731249, providing an indication of a subtle effect on inhibitory capabilities, the effect was not statistically significant and hence at this juncture the observation needs to with interpreted with caution. While we found no evidence for a relationship between rs3731249 genotype and CDKN2A deletion this does not of course preclude the possibility that the genotype might be associated with allelic imbalances within a subclone. Indeed subclonal heterogeneity in CDKN2A copy number was evident in these data (Supplementary Figure 2) consistent with the published assertion that CDKN2A alterations are secondary events in leukaemogenesis 8 .
In conclusion, we have been able to demonstrate that the 9p21 association for ALL can be ascribed to a rare high impact variant. Given that variation at 9p21.3 has been implicated in risk of other malignancies we assessed the effect of rs3731249 on risk of other cancers with known chr9p21.3 GWAS associations but observed no association between the variant and either glioma (glioblastoma and non-GBM cancers), colorectal, lung cancer or melanoma (MelGeneDB, http://www.melgene.org) suggesting an ALL-specific role for rs3731249 [12][13][14][15] . Further studies are required to determine the precise biological basis of the association. Finally our data also provide an important proof of principle that imputation based on large reference panels such as the UK10K can be used to detect novel, low frequency-large effect associations, thereby extending the utility of pre-existing GWAS data. Quality control of GWAS datasets. DNA samples with GenCall scores < 0.25 at any locus were considered "no calls". Any SNP was deemed to have failed if < 95% of DNA samples generated a genotype at the locus. Cluster plots were manually inspected for SNPs considered for replication. The same quality control metrics on the German GWAS data were applied as in the UK GWAS. We removed individuals aged > 16 years (n = 10); sex discrepancy issues (n = 2) and samples for whom < 95% of SNPs were successfully genotyped (n = 5) (supplemental Fig. 1). We computed identity-by-state (IBS) probabilities for all pairs to search for duplicates and closely related individuals among samples (defined as IBS ≥ 0.80, thereby excluding first-degree relatives). For all identical pairs the sample having the highest call  Linkage disequilibrium (LD) metrics were calculated in PLINK 18 using UK10K genomic data. LD blocks were defined on the basis of HapMap recombination rate, as defined by using the Oxford recombination hotspots, and on the basis of distribution of CIs 24,25 . Prediction of the effects of missence changes was performed using POLYPHEN. Genomic evolutionary rate profiling (GERP) 26 and PhastCons 27 scores were used to assess sequence conservation. SuSPect software (http://www.sbg.bio. ic.ac.uk/suspect/) was used to explore any deleterious effect of coding variant on protein function and structure 5 . The association and features plots were constructed using visPIG sofware 28 (www.vispig.icr. ac.uk). To explore epigenetic profile of association signals, we used chromatin state segmentation data learned by computationally integrating Chip-seq data inferred from ENCODE Histone Modification data (H4K20me1, H3K9ac, H3K4me3, H3K4me2, H3K4me1, H3K36me3, H3K27me3, H3K27ac, and CTCF) and binarized using a multivariate Hidden Markov Model 29 (http://genome.ucsc.edu/ENCODE/). We used RegulomeDB and HaploReg to examine if any of the SNPs annotate putative transcription factor binding/enhancer elements. cBioPortal (http://www.cbioportal.org) was used to explore the protein structure of the variant explored 11 .

Ethics. Collection of samples and clinico-pathological information from subjects was undertaken with
Scientific RepoRts | 5:15065 | DOi: 10.1038/srep15065 9p21.3 (CDKN2A) copy number analysis in ALL. To study the relationship between SNP genotype and 9p21.3 monoallelic/biallelic deletion in tumours we analysed data on NCRI CCLG ALL2003 trial samples cases. Genomic copy number at 9p21.3 was assayed in diagnostic tumour samples using FISH and MLPA as previously described 7,8 .