Introduction

Acute lymphoblastic leukaemia (ALL) is the most common paediatric malignancy, with B-cell precursor ALL (BCP-ALL) accounting for 85% of cases1. BCP-ALL is characterized by recurring somatic chromosomal abnormalities, many of which are important for diagnosis and risk stratification. Many of the translocations in BCP-ALL, such as the t(12;21)(p13;q22)[ETV6-RUNX1] and MLL rearrangements, involve transcriptional regulators of haematopoiesis and are important alterations that precede leukaemogenesis. However, the observation that these abnormalities can be found years before leukaemia onset and are generally not sufficient to induce disease in experimental models2,3 suggests the involvement of additional genetic or epigenetic susceptibility factors, some of which may be germline.

Overall, it has been estimated that the heritability of paediatric BCP-ALL is 24% (ref. 4). Within the last 10 years a number of loci have been identified by genome-wide association study (GWAS), which have shed considerable light on the genetics of BCP-ALL predisposition. These include variants in ARID5B5,6, IKZF1 (ref. 6), CEBPE6, BMI1-PIP42KA7, CDKN2A8, TP63 (ref. 9) and GATA3 (refs 10, 11), all of which have odds ratios (ORs) ranging from 1.23 to 1.91. Although their effect sizes are large relative to those of many of the variants identified for other complex diseases, these susceptibility loci explain only 8% of the genetic contribution to childhood BCP-ALL risk4. In addition, the mechanisms by which these variants contribute to disease remain to be elucidated.

In this study, to discover additional paediatric BCP-ALL risk variants, we perform a meta-analysis of four paediatric ALL GWAS comprising 1,210 cases and 4,144 controls of European ancestry9,12,13,14,15. We discover a locus at 9p21.3 tagged by rs77728904, which is independent from the previously reported risk locus in this region tagged by rs3731217 (ref. 8), and we replicate this association in 977 cases and 1,399 controls. To fine-map the associated region, we perform an association study in African Americans (AAs) and find that of single-nucleotide polymorphisms (SNPs) in this region that are both associated with BCP-ALL in Europeans and correlated with rs77728904, only rs662463 is significant in AAs, suggesting that it may be the disease-associated variant tagged by this locus. Functional analysis demonstrates that rs662463 is a cis-expression quantitative trait locus (cis-eQTL) for CDKN2B and disrupts a transcription factor-binding site (TFBS) for CEBPB, a transcription factor (TF) recurrently mutated in BCP-ALL16. In individuals homozygous for the rs662463 protective allele, CDKN2B expression is significantly correlated with CEBPB expression; however, in individuals carrying one or more risk alleles this correlation is abrogated. Our results suggest that rs662463 is associated with BCP-ALL by modifying the ability of CEBPB to regulate CDKN2B expression. We also investigate the function of rs3731217, the previously reported BCP-ALL-associated variant in this region. We find that rs3731217 is associated with alternative splicing of CDKN2A, suggesting that rs3731217 is associated with BCP-ALL by influencing the messenger RNA stability of the p16 and p14ARF tumour suppressors encoded by this gene. In summary, we conclude that common inherited genetic variation at 9p21.3 is associated with risk for BCP-ALL by modulating the regulation of CDKN2B and CDKN2A expression. Our results not only shed light on genetic variation predisposing towards BCP-ALL, but also suggest a path forward moving from GWAS associations to underlying mechanism.

Results

Identification of a paediatric BCP-ALL locus at 9p21.3

After performing quality control (QC) measures and imputation of four paediatric BCP-ALL data sets17, we interrogated 6,784,687 SNPs with minor allele frequencies (MAFs) >1% in 1,210 cases and 4,144 controls of European ancestry (Supplementary Figs 1–3). Each SNP was assessed for disease association in the discovery data sets using an additive allele-dosage logistic regression model and then meta-analysed. We confirmed previously reported associations between BCP-ALL and IKZF1 at 7p12.2 (refs 5, 6), ARID5B at 10q21.2 (refs 5, 6), CEBPE at 14q11.2 (ref. 6), BMI1-PIP4K2A at 10p12.31-12.2 (refs 7,10,18), CDKN2A at 9p21.3 (ref. 8) and TP63 at 3q28 (ref. 9) (Supplementary Table 1).

We discovered a novel BCP-ALL susceptibility locus at 9p21.3 indexed by rs77728904 (Pdiscovery=1.02 × 10−8, OR=1.71, 95% confidence interval=1.42–2.05; Table 1, Fig. 1, Supplementary Tables 2 and 3, and Supplementary Data 1), which we replicated in an independent set of 977 cases and 1,399 controls of European ancestry (Preplication=6.28 × 10−8; Pcombined=3.32 × 10−15, OR=1.72, 95% confidence interval=1.50–1.97). This risk locus (defined as all 1,000 Genomes Phase 1 SNPs in linkage disequilibrium (LD) with rs77728904 (r2EUR>0.6); Table 1 and Supplementary Data 1) spans CDKN2B and lies within the first eight introns of ANRIL (antisense non-coding RNA in the INK4A locus), a long non-coding RNA that acts as a negative regulator of gene expression19 (Fig. 1b).

Table 1 Association of SNPs in the 9p21.3 locus* with BCP-ALL in individuals of European ancestry.
Figure 1: Meta-analysis results for paediatric BCP-ALL.
figure 1

(a) Manhattan plot of associations for the discovery GWAS of 1,210 cases and 4,144 controls from four independent studies. The red line denotes the threshold for genome-wide significance. Peaks surpassing this threshold are found at the following: 7p12.2 (IKZF1), 10q21.2 (ARID5B) and 9p21.3 (CDKN2). (b) The top panel is the regional LocusZoom plot of the 9p21.3 locus extending 500 kb on either side of the index SNP, rs77728904 (shown in purple), and including all genotyped and imputed SNPs with MAF >0.01. The bottom panel is zoomed in (indicated by the solid black lines between plots) to include only the CDKN2 locus and surrounding recombination peaks. The r2 (visualized by colour) demonstrates that rs77728904 and rs3731217 are in independent linkage blocks between the recombination peaks. r2 for all SNPs in both panels is shown relative to rs77728904. (c) Forest plot of OR with 95% confidence intervals (CIs) for rs77728904 in each of the studies comprising the discovery meta-analysis (Disc (Meta)), the full meta-analysis, the replication analysis and the combined analysis. Squares represent the OR; horizontal lines represent the CI; the solid vertical line represents OR=1; the blue dashed vertical line represents the OR in the combined discovery+replication sets.

rs77728904 is not in LD with rs3731217, the CDKN2A SNP previously reported as associated with BCP-ALL8 (r2EUR=0.015)17, and its association with BCP-ALL remains significant after conditioning on rs3731217 (Pdiscovery-conditional=1.25 × 10−7; Pcombined-conditional=9.34 × 10−13; Table 1). Similarly, rs3731217 retains its significance after conditioning on rs77728904 (Supplementary Table 4). These results demonstrate that the rs77728904-tagged locus is associated with BCP-ALL independently of rs3731217.

The discovery set consisted predominantly of BCP-ALL cases with either an ETV6-RUNX1 rearrangement (t(12;21)) or double trisomy of chromosomes 4 and 10/high hyperdiploid (>50 chromosomes) tumour karyotype (DT/HHD)9,12,13,14. The replication set consisted of a heterogeneous group of BCP-ALL cases, most with neither a t(12;21) translocation nor a DT/HHD tumour karyotype7,14,18. The influence of rs77728904 on BCP-ALL did not differ between BCP-ALL subtypes within the discovery set (Cochran’s Pheterogeneity=0.606; Supplementary Table 5) nor did it differ between the discovery and replication studies (Cochran’s Pheterogeneity=0.858). Thus, rs77728904 appears to influence paediatric BCP-ALL risk irrespective of cytogenetic subtype.

To refine the risk locus tagged by rs77728904, we leveraged the reduced LD in admixed populations by testing the association between BCP-ALL and all 26 SNPs in the rs77728904-tagged risk locus in AAs and Hispanic Americans (HAs) (203/1,363 AA and 391/1,008 HA cases/controls)7. Whereas nearly all SNPs were associated with BCP-ALL in HAs (Pnominal<0.05) and had ORs similar to those in European Americans (EAs), only one variant, rs662463, retained its significance in AAs (Pnominal=0.003; Table 2).

Table 2 Association of SNPs in the 9p21.3 locus* with BCP-ALL in HAs and AAs.

Although we did not detect an association in AAs between BCP-ALL and either rs77728904 or any other SNP in the rs77728904-tagged locus except for rs662463, we were powered to do so; assuming the effect size of each SNP in AAs was the same as in EAs (Powerone-sided=0.98 for rs77728904; Supplementary Table 6). In addition, we observed that there is remarkably lower LD in this region in AAs (based on 1000 Genomes ASW population) as compared with that in EAs (r2rs662463/rs77728904-EUR=0.70 versus r2rs662463/rs77728904-ASW=0.58; Fig. 2 and Supplementary Table 6). To determine whether the LD structure in AAs is consistent with our finding in AAs that rs662463 is associated with BCP-ALL but rs77728904 is not, we generated 1,000 simulated data sets using haplotypes from the 1,000 Genomes ASW panel that exactly matched both (1) the number of cases and controls, and (2) the rs662463 genotype frequencies of the original data set. We found that the association P-value for rs77728904 was less significant in 26% of the simulated data sets than the association P-value for rs77728904 in the original data set. This indicates that it is common to find in AAs that the association between BCP-ALL and rs77728904 is not significant even when the association between BCP-ALL and rs662463 is significant.

Figure 2: Haploview LD map of the 9p21.3 locus showing differences in the LD structure by ancestry in 1,000 Genomes Phase 1 populations.
figure 2

LD is reported as r2 with r2=0.01 white; 0.01<r2<1 shades of grey; r2=1 black. ASW, African American ancestry; CEU, European ancestry; MXL, Hispanic ancestry. rs77728904 is highlighted in orange and rs662463 is highlighted in blue.

Taken together, these results suggest that rs662463 is the causal variant tagged by rs77728904.

Functional analysis of the rs77728904-tagged risk locus

As the functional consequence of variants identified by GWAS is often the regulation of gene expression20 and tissue context is a major determinant of gene expression21, we employed genotype and RNA sequencing (RNA-seq) data in whole blood from the GTEx (Genotype-Tissue Expression) project22, to investigate the association between the rs77728904-tagged risk locus and gene expression. We found that all SNPs in the locus found in GTEx are cis-eQTLs for CDKN2B, with increasing dosage of the risk allele associated with decreasing levels of CDKN2B mRNA (Padditive,rs77728904=6.1 × 10−7 and Padditive,rs662463=8.7 × 10−7; Fig. 3a and Supplementary Table 7). We confirmed that both rs77728904 and rs662463 are cis-eQTLs for CDKN2B in whole blood using an independent data set of genotype and RNA-seq data from 922 individuals of European descent (Padditive,rs77728904<2 × 10−16 and Padditive,rs662463<2 × 10−16)23.

Figure 3: Multi-ethnic and functional analysis suggesting rs662463 is the causative BCP-ALL SNP tagged by the rs77728904-defined locus.
figure 3

(a) eQTL analysis from GTEx22 in whole blood showing the association of CDKN2B expression with rs77728904 and rs662463 genotypes. For both SNPs, the minor allele is the risk allele. Each grey circle represents an individual. Each box plot shows the median rank normalized gene expression (black horizontal line), the first through third quartiles (purple box) and 1.5 × the interquartile range (whiskers). (b) Motifs derived from ChIP data showing the effect of rs662463 on CEBPB binding. The genomic sequence (Ref) surrounding rs662463 is shown below the CEBPB-binding site motif logo for K562 cells from Factorbook27, with the reference protective G-allele boxed and the risk A-allele in red. The CEBPB motif logo represents the position weight matrix (PWM) for each base. The PWM LOD scores calculated by HaploReg24 from TRANSFAC28 for two CEBPB-binding motifs (‘Tr1’ or TRANSFAC accession M00912 and ‘Tr2’ or TRANSFAC accession M00109) including either the protective or the risk allele of rs662463 demonstrate that the risk allele disrupts CEBPB binding. (c) RNA-seq analysis in European ancestry LCLs29, suggesting that the rs662463 genotype influences the correlation between CDKN2B and CEBPB expression, and that the rs662463 risk allele is associated with lower CDKN2B expression. Shown are the best-fit lines overall (black line), for LCLs homozygous for the protective allele (blue line) and for LCLs with at least one copy (one copy: green circles and two copies: red circles) of the risk allele (green line). (d) RNA-seq analysis in whole blood from an independent set of European ancestry individuals23, demonstrating that the rs662463 genotype significantly influences the correlation between CDKN2B and CEBPB expression. The rs662463 risk allele attenuates this correlation and is associated with lower CDKN2B expression. Shown are the best-fit lines overall (black line), for individuals homozygous for the protective allele (blue line) and for individuals with at least one copy (one copy: green circles and two copies: red circles) of the risk allele (green line).

Data from HaploReg24 and RegulomeDB25 show that the entire 9p21.3 locus is rife with regulatory motifs; a complete list of functional annotation has been provided in Supplementary Data 2–4. Of SNPs in the locus, rs662463 is not only a cis-eQTL for CDKN2B but it is also the most functionally compelling variant based on ENCODE annotation26: by DNase sequencing, it is in a DNase hypersensitivity peak in 72 cell lines; by chromatin immunoprecipitation sequencing (ChIP-seq), it is bound by 14 TFs; and by position-weight matrix score, the presence of the risk allele is predicted to disrupt four of the nine TFBSs in which rs662463 resides (CEBPB, HNF1, SOX and P300)26.

To explore the role of rs662463 in the proper tissue context, we evaluated ENCODE ChIP-seq data from lymphoblastoid cell lines (LCLs). We found that rs662463 lies within chromatin marked by H3k9me3, H3k27me3, H4k20me1, H3k4me1, H3k4me3 and H2az modifications, suggesting it resides in a TFBS motif regulating CDKN2B expression in BCPs. Intriguingly, the protective G-allele of rs662463 (associated with higher CDKN2B expression) is part of a ChIP-seq-validated TFBS for CEBPB (CCAAT/enhancer-binding protein β), a TF important for haematopoietic differentiation that is mutated in some BCP-ALL patients16, and substitution by the risk A-allele (associated with lower CDKN2B expression) disrupts this binding motif (Fig. 3b)26,27,28.

This confluence of genetic and functional evidence led us to hypothesize that rs662463 regulates CDKN2B expression by modifying CEBPB signalling. If so, we surmised that CDKN2B and CEBPB expression would be correlated in an rs662463-genotype-dependent manner, with the correlation decreasing with increasing risk allele dosage. To investigate the modifying effect of rs662463 genotype on the coexpression of CEBPB and CDKN2B, we performed a likelihood ratio test (LRT) using genotype and RNA-seq data from 358 European-ancestry LCLs29. We observed a positive correlation between the expression of these genes (Spearman’s correlation r=0.22, PSpearman=4.3 × 10−9). Although we did not find statistically significant evidence for genotype-dependent coexpression (PLRT,one-sided=0.12), the correlation between CDKN2B and CEBPB expression was consistent with our hypothesis, exhibiting greater correlation in the homozygous protective genotype group than in the risk allele-containing group (rhom prot=0.24 and rrisk allele+=0.15; Fig. 3c, Supplementary Fig. 4 and Supplementary Data 5). We then investigated rs662463-genotype-dependent coexpression between CDKN2B and CEBPB in whole blood using the same set of 922 European-ancestry individuals employed to confirm that rs662463 is a cis-eQTL for CDKN2B23. In this data set, we found that CDKN2B and CEBPB expression was significantly correlated in individuals homozygous for the rs662463 protective allele (r=0.08, PSpearman=0.02), but not in individuals with one or more risk alleles (r=−0.06, PSpearman=0.41). Moreover, the difference in correlation between these groups was significant (PLRT,one-sided=0.04; Fig. 3d). Thus, these results from two independent European ancestry data sets provide evidence, suggesting that CDKN2B and CEBPB expression is correlated in an rs662463-dependent manner, with the rs662463 risk allele associated with lower CDKN2B expression. In 326 African-ancestry LCLs, however, we found neither a significant correlation between CDKN2B and CEBPB expression (PSpearman=0.42) nor a significant influence of rs662463 genotype on their coexpression (PLRT,two-sided=0.63; Supplementary Data 6)30.

In European-ancestry LCLs, CDKN2B expression was also correlated with the expression of other TFs with TFBSs containing SNPs in the 9p21.3 locus, many with roles in lymphopoiesis. We investigated the coexpression between CDKN2B and all of these TFs (Supplementary Data 5). We found that only JUND, a component of the AP-1 TF, whose canonical TFBS is only two bases away from that of CEBPB, was nominally significantly coexpressed with CDKN2B in a genotype-dependent manner. Surprisingly, the SNP mediating this coexpression was again rs662463 (PLRT,one-sided=0.004; Supplementary Fig. 4). As JUND expression was not measured in Africans, we investigated coexpression between CDKN2B and cJUN, a JUND family member whose expression is highly correlated with that of JUND in Europeans (r=0.54, PSpearman<10−12). We found that although cJUN and CDKN2B expression was correlated in Africans in an rs662463-genotype-dependent manner (PLRT,two-sided=0.02), the risk allele was associated with higher CDKN2B expression, which is inconsistent with our results in European ancestry individuals and our eQTL results. In addition, in the European-ancestry whole blood data set we found that although JUND and CDKN2B expression was significantly correlated (r=−0.09, PSpearman=0.0058), there was no evidence that this correlation was modified by rs662463 genotype (PLRT,two-sided=0.67). Coexpression results between CDKN2B and all TFs with TFBSs containing SNPs in the p21.3 locus in African-ancestry LCLs are presented in Supplementary Data 6.

Taken together, our integrated genetic and functional analysis of multi-ethnic GWAS, eQTL and TF binding data suggests a model whereby rs662463 predisposes to BCP-ALL by regulating CDKN2B expression through CEBPB signalling. In individuals of European ancestry homozygous for the rs662463 protective G-allele, CEBPB signalling positively regulates CDKN2B expression. In European-ancestry individuals harbouring at least one risk A-allele, however, CEBPB binding is disrupted, thereby attenuating the influence of CEBPB signalling on CDKN2B transcriptional regulation and resulting in lower levels of CDKN2B expression. These data underscore the complexity of the relationship between this locus and BCP-ALL, and also highlight possible ancestry-based differences in molecular aetiology.

Functional analysis of rs3731217

We also investigated the functional basis for the association between BCP-ALL and rs3731217, the SNP intronic to CDKN2A at 9p21.3 previously found to be associated with BCP-ALL. Consistent with other studies8, we found no evidence that rs3731217 is a cis-eQTL for CDKN2A. However, we observed that the alleles of this SNP create two overlapping cis-acting intronic splice enhancer motifs31 (CCCAGG and CAGTAC; Fig. 4a and Supplementary Fig. 5), suggesting rs3731217 may regulate alternative splicing of CDKN2A. We assessed the relationship between CDKN2A exon usage and rs3731217 using genotype and RNA-seq data from 501 LCLs29. We found that the minor G-allele, which is protective against BCP-ALL8, is associated with increased expression of exon 3, containing the 3′-untranslated region for this gene (Padditive=0.01; Fig. 4b). CDKN2A encodes three proteins that function as tumour suppressors: p16INK4a and p16γ, which regulate cell cycle progression32,33, and p14ARF, which stabilizes p53 by binding MDM2 (ref. 34). As stable translation is dependent on the presence of a 3′-untranslated region35, these data may suggest increasing dosage of the rs3731217 G-allele is associated with protection against BCP-ALL by promoting transcript isoforms of CDKN2A containing exon 3, resulting in higher protein levels of the tumour suppressors encoded by this gene.

Figure 4: Functional analysis demonstrating rs3731217 is associated with CDKN2A exon 3 usage.
figure 4

(a) Cartoon showing the major protein isoforms encoded by CDKN2A: p14ARF (blue arrows), p16γ (red arrows) and p16INK4a (black arrow). The four main exons are labelled with exon 3 surrounded by a red box. rs3731217 is located in two overlapping intronic splicing elements between exon 1α and exon 1β. The image is modified from AceView70. (b) RNA-seq data from LCLs correlating exon usage with rs3731217 genotype showing that exon 3 usage is significantly associated with the protective G-allele (0, 1 and 2 on the x axis refer to G-allele dosage). Each box plot shows the median rank normalized gene expression (black horizontal line), the first through third quartiles (box) and 1.5 × the interquartile range (whiskers).

Discussion

It has long been recognized that CDKN2 at chromosome 9p21.3 is a common site for somatic mutation acquisition in BCP-ALL36,37,38. It was also shown in recent times that a germline variant in CDKN2A (rs3731217) influences risk for BCP-ALL. Here we report an independent risk locus tagged by rs77728904 implicating CDKN2B, which encodes the tumour suppressor p15, in the aetiology of BCP-ALL.

Using independent but complementary genetic and bioinformatic/functional strategies, our results suggest that rs662463 is the causal BCP-ALL variant tagged by the locus we discovered. Furthermore, our results imply that the rs662463 risk allele influences CDKN2B expression by disrupting a TFBS for CEBPB, although we observed evidence for this mechanism only in individuals of European ancestry. Thus, we speculate that the increased risk for BCP-ALL among individuals with the rs662463 risk allele results from diminished levels of the p15 tumour suppressor as compared with individuals with the protective allele, as a consequence of attenuated CEBPB signalling. Undoubtedly, many factors influence CDKN2B expression, making it unlikely that there would be a strong correlation between any one factor and CDKN2B expression. Hence, observing a consistent—albeit modest—rs662463 genotype-dependent relationship between CEBPB and CDKN2B expression in two independent data sets is compelling and deserving of further study.

Although GWAS results identify regions associated with disease, it is often challenging to progress from tagging SNP to causal variant, thereby limiting opportunities to investigate the biological basis for the association. Our results suggest that testing for genotype-dependent coexpression between TFs and eQTL target genes may be an efficient method to prioritize candidate variants, thereby facilitating hypothesis-driven functional follow-up studies.

We also obtained evidence to suggest that the previously reported BCP-ALL risk variant in this region, rs3731217, is associated with BCP-ALL by regulating alternative splicing of CDKN2A, which we hypothesize is associated with differences in the translation of the p16 and p14ARF tumour suppressors encoded by this gene. There are several examples of somatic mutations that disrupt the normal regulation of splicing in cancer39,40. Likewise, rare germline mutations in CDKN2A have been found that disrupt normal splicing of this gene and are associated with familial melanoma41,42.

Thus, our results reveal previously unsuspected complexity between the association of BCP-ALL and the CDKN2 locus. The remarkable confluence of germline susceptibility variants and somatic mutations point towards the central importance of this region in the aetiology of this disease and warrant further investigation of its role in the regulation of benign and malignant haematopoiesis.

Methods

Discovery analysis data sets

The discovery meta-analysis was performed using data from four GWAS consisting of 1,210 paediatric BCP-ALL cases and 4,144 controls of European ancestry (Supplementary Figs 1 and 2)9,12,13,15. The GWAS data sets are referred to as the ‘9,904-GAIN’ data set, the ‘Aus–French’ data set and the ‘German data set’, and are described in detail below.

Cases for the 9,904-GAIN data set were 568 children of European ancestry with BCP-ALL treated on The Children’s Oncology Group (COG) P9904 protocol ( https://members.childrensoncologygroup.org/Mtg/bookreports/Denver/reports/9904_Fall07SPR_FinalReport.pdf)14,43. All were diagnosed between the ages of 1 and 9 years (median=3.8 years of age; mean=4.2 years of age; s.d.=1.8 years). Three hundred and twenty cases were males and 248 were females. Two hundred and seventy-five cases (48%) had the t(12;21) ETV6-RUNX1 translocation tumour karyotype and 293 (52%) had the double DT tumor karyotype. Controls were obtained separately and included 1,014 healthy individuals (464 males and 550 females) from the Genetic Association Information Network (GAIN) Consortium schizophrenia study cohort44. Permission to use this data set was obtained from dbGaP (phs000021.v1.p1). Cases and controls were genotyped separately by COG and the Broad Institute, respectively, using the Affymetrix Genome-Wide Human SNP Array 6.0 (Santa Clara, CA) and called using the Birdseed-v2 algorithm45. We performed stringent QC before imputation (described below). Twenty individuals (20 cases and 0 controls) failed genotyping (missingness>2%) and 16 individuals (16 cases and 0 controls) were removed for high inbreeding coefficients (|F|>0.05). One hundred and fifty-one population outliers (95 cases and 56 controls) were also removed. Following QC, 437 cases and 958 controls remained and were included in the discovery meta-analysis.

The French data set, described previously, consisted of 223 paediatric BCP-ALL cases and 1,542 controls ascertained through the ESCALE (Etude Sur les Cancers et les Leucémies de l’Enfant) study12,13. Cases were all diagnosed before age 15 years and were identified through the French National Registry of Childhood Hematopoietic Malignancies. Eighty-one cases (36%) were of the ETV6-RUNX1 molecular subtype, 123 cases (55%) were of the HHD subtype and 21 (9%) cases were hyperdiploid (47–50 chromosomes). Two cases had both the ETV6-RUNX1 rearrangement and were HHD. Controls were healthy French adults of European descent from the SU.VI.MAX study46. Cases were genotyped using the Illumina Infinium Human CNV370-Quad BeadChip (San Diego, CA) and controls were genotyped using the Illumina Infinium Human-Hap300 (317 K) BeadChip, both by the Centre National du Génotypage13.

The Australian data set consisted of 142 paediatric BCP-ALL cases and 1,229 controls obtained through the Aus–ALL Consortium, as described12,15. Fifty-two cases (37%) were of the ETV6-RUNX1 molecular subtype, 28 cases (20%) were of the DT subtype and the remaining 62 (43%) cases had >46 chromosomes. All controls and 67 cases were genotyped using the Illumina Infinium Human610-Quad BeadChip; 75 cases were genotyped using the Illumina Infinium Human CNV370-Quad BeadChip.

As the French and Aus–ALL data sets are both part of the Childhood Leukemia International Consortium47, we merged them to create the combined ‘Aus–French’ data set, from which 11 cases and 59 controls were removed following QC before imputation (described below). The remaining 354 cases (129 ETV6-RUNX1, 146 HHD, 21 hyperdiploid and 60 >46 chromosomes) and 2,712 controls were included in the discovery meta-analysis.

The German data set consisted of 491 paediatric BCP-ALL cases, all with the ETV6-RUNX1 rearrangement, and 483 cancer-free controls, all of European ancestry9. Seventy-two cases and 9 controls were removed as previously described9. The remaining 419 cases and 474 controls were included in the discovery meta-analysis. Cases were obtained through the Austrian-German-Italian-Swiss multicentre clinical trial AIEOP-BFM ALL 2000 and controls were obtained through the German popgen biobank9,48. Cases and controls were genotyped together as part of the German National Genome Research Network GWAS initiative by Affymetrix (South San Francisco, CA) using the Affymetrix Genome-Wide Human SNP Array 5.0. Additional QC was performed as described below.

This study was approved by the Institutional Review Board of The University of Chicago.

Discovery GWAS QC and imputation

Before imputation, we performed the following QC measures in PLINK49. Samples were removed based on the following: (a) call rate <0.98, (b) ambiguous gender assignment, (c) cryptic relatedness (pi-hat >0.125) and (d) absolute value of the inbreeding coefficient >0.05. SNPs were excluded based on the following: (a) MAF <0.01 (<0.05 for the German data set), (b) genotype missingness rate >0.05 (>0.03 for the German data set), (c) deviation from Hardy-Weinberg equilibrium (PHWE<1 × 10−4), (d) differential missingness between cases and controls (Pmissingness<0.05), (e) ambiguous strand assignment and (f) failure to resolve reference genome assembly ambiguities in LiftOver50. Population outliers were identified and removed using principal component analysis implemented through EIGENSTRAT51, to ensure that all remaining cases and controls were well matched and of European ancestry. A flowchart detailing these QC measures is provided in Supplementary Fig. 1.

To improve the genome-wide coverage of SNPs for association testing, we performed imputation separately for each discovery data set (cases and controls together) using IMPUTE2 v2.1.2.3 (ref. 52) and the 1,000 Genomes Project17 Phase 1 v3 reference panel (all ancestries panel, build 37, released March, 2012). Pre-imputation data sets were updated to build 37 using LiftOver50. Sample haplotypes were phased by chromosome before imputation using SHAPEIT53. Poorly imputed SNPs, defined by info scores <0.3, were excluded from further analysis. The number of SNPs successfully imputed for each study is shown in Supplementary Fig. 2.

Discovery association analysis

Following imputation, we analysed the association of SNPs with BCP-ALL in SNPTEST54 separately for each of the three discovery data sets with a frequentist additive missing data likelihood score test (SNPTEST-method score), which uses the genotype posterior probabilities generated during imputation to account for uncertainty in the genotypes. To control for residual population stratification in each data set, significant principal components (PCs) inferred by EIGENSTRAT51 were included as covariates for association testing (PC1 for 9,904-GAIN, PC1 for Aus–French and PC1–PC2 for German). Following association testing in SNPTEST54, SNPs were excluded from analysis based on the following: (a) P=not available (NA), which occurs when a model cannot be fit to the data; (b) info score for β<0.3; (c) MAF <0.01; and (d) deviation from HWE (PHWE<1 × 10−4).

Discovery meta-analysis

Only SNPs passing genotyping and imputation QC and shared among all three data sets following post SNPTEST54 QC were included in the meta-analysis (Supplementary Fig. 2). The meta-analysis was performed using a fixed-effects model with inverse variance weighting, as implemented in METAL55. SNPs were excluded based on high heterogeneity among data sets (I2>50%)56. Genome-wide significance was defined as P<5 × 10−8. We assessed inflation of the meta-analysis test statistics by examining quantile–quantile plots of the expected and observed distributions of P-values and by estimating λGC57 (Supplementary Fig. 3). We found no evidence for systematic genotyping differences between cases and controls.

Regions associated with BCP-ALL were visualized using LocusZoom58 by plotting the −log10 of the meta-analysis P-values and local recombination rate. LD patterns were plotted with Haploview59 using LD estimates from the 1,000 Genomes Phase 1 CEU, ASW and MXL populations17 (NCBI build 37 assembly).

Fine-mapping of the 9p21.3 locus

We mapped the association with BCP-ALL of all SNPs within 500 kb of the index SNP rs77728904. The significance of association dropped precipitously as a function of r2 relative to rs77728904. Consequently, we chose to replicate and further analyse the SNPs listed in HaploReg24 in LD (r2EUR>0.60 in the 1,000 Genomes Phase 1 data set) with rs77728904 (Table 1). Results for all SNPs with r2>0.20 relative to rs77728904 are reported in Supplementary Data 1.

Replication set analysis

The association between paediatric BCP-ALL and the novel 9p21.3 locus was validated in an independent replication set of 977 cases and 1,399 controls of European ancestry. Cases comprise two cohorts of paediatric BCP-ALL patients: one consisting of patients enroled on the COG-P9905 protocol14 (‘9,905 cohort’) genotyped on the Affymetrix 6.0 array (520 cases) and the other consisting of patients enroled on either the St Jude Total Therapy XIIIB/XV (‘SJ’) protocol or the COG-P9906 protocol genotyped on the Affymetrix GeneChip Human Mapping 500K array and collectively referred to as the ‘SJ cohort’ (457 cases)7,18. Controls were from the Multi-Ethnic Study of Atherosclerosis (MESA) study (dbGaP phs000209.v9)60 genotyped on the Affymetrix 6.0 array. Permission to the use the MESA study was obtained from dbGaP.

Variants not directly genotyped were imputed using IMPUTE2 v2.1.2.3 (ref. 52) as described above. Association was tested in PLINK using a logistic regression model that fit an additive allele-dosage model for the SNP along with the top four eigenvectors from principal component analysis. SNPs with one-sided P<0.05 and the same risk allele as in the discovery set were considered validated.

Multi-ethnic set analysis

To identify the functional SNP tagged by rs77728904, we investigated the patterns of association between paediatric BCP-ALL and the 9p21.3 locus in patients of HA and AA ancestry. AA cases consisted of 203 individuals from the SJ (n=114), 9,904 (n=38) and 9,905 (n=51) data sets, as previously described7, and 1,363 AA controls from the MESA data set60. HA cases consisted of 391 individuals from the SJ (n=86), 9,904 (n=123) and 9,905 (n=182) data sets, as previously described7, and 1,008 HA MESA controls. Ancestry for both cases and controls was determined using STRUCTURE61,62 (version 2.2.3) with HapMap CEU, YRI, CHB/JPT and indigenous Native Americans63 as reference populations. European ancestry was defined as >95% CEU, African ancestry was defined as >70% YRI and Hispanic ancestry as >10% Native Americans>YRI.

Genotyping platforms are specified above and imputation was performed as described. Association testing was performed in PLINK, using an additive allele-dosage logistic regression model that fit genetic ancestries, defined by the first four principle components. SNPs with one-sided P<0.05 and the same risk allele as in the discovery set were considered validated.

Conditional tests of independence

To assess the independence of the association signal for SNPs in the rs77728904-tagged locus from the previously reported CDKN2A SNP, rs3731217 (ref. 8), in the discovery cohort, we used SNPTEST54 (-condition_on) to test each of the 26 SNPs in the novel locus for association with BCP-ALL conditional on rs3731217. For the replication set (Table 1) and admixed populations (Table 2), we performed logistic regression in PLINK49 using rs3731217 as a covariate.

We also tested the association of rs3731217 with BCP-ALL using SNPTEST54 (-condition_on) when conditioning on either the index SNP (rs77728904) or rs662463. These analyses were performed separately for each of the three discovery data sets and also for the discovery meta-analysis (Supplementary Table 4).

Simulation analysis of the 9p21.3 locus in AAs

Of all SNPs significant in the 9p21.3 locus in EAs, only rs662463 was significant in AAs (Prs662463=0.003 versus Prs77728904=0.403). To confirm that this difference in significance was consistent with the LD structure of the region in AAs, we performed simulations (Supplementary Table 6). To create each simulated data set, we sampled without replacement rs77728904-rs662463 haplotypes from the 122 individuals in the 1,000 Genomes ASW sample until 203 case and 1,363 control haplotype pairs were collected. The sampling was performed, to generate the same rs662463 genotype counts as in the original 203 AA cases and 1,363 AA controls. We then tested the association between rs77728904 and BCP-ALL in the simulated data set using a χ2-test of homogeneity. We repeated the sampling and association testing 1,000 times and recorded the P-values. We then estimated the probability of observing an association P-value at rs77728904 with the same or greater significance as observed in our AA sample.

Cis-eQTL analysis of the 9p21.3 locus

To determine whether SNPs in the 9p21.3 locus were cis-eQTLs for CDKN2B or other local genes, we queried the GTEx project database22 (Supplementary Table 7). Within GTEx, we accounted for tissue context by investigating eQTLs in whole blood22. To confirm our findings, we replicated our observation that rs77728904 and rs662463 were cis-eQTLs for CDKN2B in an independent data set of eQTLs in whole blood from 922 individuals23. All individuals used in the cis-eQTL analysis were of European ancestry.

Bioinformatics analysis of the 9p21.3 locus

To explore the function of SNPs in the 9p21.3 locus, we used the web-based Haploreg24 and RegulomeDB25,64 resources that integrate SNP frequency and LD information from 1,000 Genomes Phase 1 individuals and functional annotation from a variety of biological databases, including motif instances and enhancer annotations from the ENCODE26 and Roadmap Epigenetics65 Projects.

CDKN2B and CEBPB coexpression analysis

Our genetic results suggested that rs662463 may be the causal variant tagged by the 9p21.3 locus. As rs662463 is a cis-eQTL for CDKN2B and is a critical residue in a number of TFBSs (Supplementary Data 2–4), we hypothesized that it was associated with BCP-ALL by regulating CDKN2B expression in response to TF signalling. To test this, we first used the Haploreg24 and RegulomeDB25,64 databases to determine the set of TFs with TFBSs containing rs662463, whose expression was correlated with CDKN2B expression, and we then assessed the dependence of this coexpression on rs662463 genotype.

Of the nine TFs with binding motifs containing rs662463, the binding affinity of four (CEBPB, HNF1, SOX and P300) was substantially altered by rs662463 genotype status based on position-weight matrix analysis and ENCODE ChIP-seq data26 (Supplementary Data 2). We employed RNA-seq data in LCLs derived from 358 individuals of European ancestry (1,000 Genomes Phase 1 populations CEU, TSI, FIN and GBR)29 and Illumina expression array data derived from LCLs for 326 individuals of African ancestry (HapMap populations YRI, MKK and LWK)30, to determine whether the expression of any of these TFs was correlated with CDKN2B expression in Europeans and in Africans. We assessed the significance of the correlation using a t-test based on Pearson’s correlation coefficient. We found that only CEBPB expression was correlated with CDKN2B expression, but only in Europeans (Spearman’s correlation r=0.22, PSpearman=4.3 × 10−9; Supplementary Data 5 and 6) and not in Africans.

To determine whether rs662463 genotype status altered the relationship between CDKN2B and CEBPB expression, we used the European and African expression data sets described above, as well as genotype data from the 1,000 Genomes Project17 and HapMap 3 (ref. 66). We assessed separately in Europeans and Africans whether the correlation between the CDKN2B and CEBPB expression differed by rs662463 genotype class (0 risk alleles versus ≥1 risk allele) by fitting two competing linear regression models: the first model regressed CDKN2B expression on CEBPB expression and rs662463 genotype class (the reduced model), and the second model regressed CDKN2B expression on CEBPB expression, rs662463 genotype class and the interaction between CEBPB expression and rs662463 genotype class (the full model). The interaction term in the full model allows different correlations between CDKN2B and CEBPB in each rs662463 genotype class, whereas the reduced model requires the correlation to be the same in both genotype classes. Expression levels were scaled to have mean 0 and variance 1, to allow regression parameters to be interpretable as a difference in correlation. An LRT was used to determine whether the full model fit the data significantly better than the reduced model.

As the homozygous risk genotype class for rs662463 (and all other SNPs in the 9p21.3 locus) is rare, the full model includes only a single interaction term that estimates the difference in the CDKN2BCEBPB correlation between the homozygous protective and the risk allele-containing genotype classes. We used these estimates, their s.e. and the inverse variance weighting method, to calculate the population-averaged difference in correlation coefficients between the two genotype classes and to perform population-specific meta-analyses55. The results of this analysis are summarized in Supplementary Data 5 and 6.

To reproduce our findings, we repeated this analysis in a second independent data set. We used the same data set of 922 individuals of European ancestry previously employed to confirm that SNPs in the 9p21.3 locus were cis-eQTLs for CDKN2B. Here, rs662463 was directly genotyped, and CDKN2B and CEBPB expression was measured by RNA-seq in whole blood23. Tests for genotype-dependent coexpression were performed as described.

Coexpression between CDKN2B and other TFs in the 9p21.3 locus

Using the EA and AA LCL data sets described above29,30, we also investigated genotype-dependent coexpression between CDKN2B and every TF that either: (1) possessed a binding motif (defined by JASPER67 or TRANSFAC28) containing an SNP in the rs77728904-tagged locus or (2) was confirmed by ChIP-seq to bind to a sequence containing an SNP in the locus (Supplementary Data 3 and 4). For the European RNA-seq data, we also required the RPKM (reads per kilobase of transcript per million mapped reads) value68 to be >0.01 in >75% of samples for each TF investigated.

We calculated the Spearman’s correlation coefficient between CDKN2B expression and the expression of each TF, and assessed the significance of these correlation coefficients using both an asymptotic t approximation and bootstrapping. For each TF-SNP pair, we used an LRT comparing the full and reduced models, as described above, to determine whether the correlation between CDKN2B expression and TF expression varied by SNP genotype in either EAs or AAs.

Functional analysis of rs3731217

We examined the functional significance of rs3731217 using RNA-seq data for LCLs from the GEUVADIS Consortium29 by testing for an association between rs3731217 genotype and exon usage using an additive model. The analysis was conducted in 501 LCLs of European ancestry from the 1,000 Genomes CEU, FIN, GBR and TSI populations. Linear regression was performed on log2-transformed RNA expression data and the significance of the association between the SNP genotype and expression was assessed using a t-test.

Statistical analysis

Except when specified, all analyses were performed using R statistical software (version 2.15.1)69.

Additional information

How to cite this article: Hungate, E. A. et al. A variant at 9p21.3 functionally implicates CDKN2B in paediatric B-cell precursor acute lymphoblastic leukaemia aetiology. Nat. Commun. 7:10635 doi: 10.1038/ncomms10635 (2016).