Genome-wide association analysis implicates dysregulation of immunity genes in chronic lymphocytic leukaemia

Several chronic lymphocytic leukaemia (CLL) susceptibility loci have been reported; however, much of the heritable risk remains unidentified. Here we perform a meta-analysis of six genome-wide association studies, imputed using a merged reference panel of 1,000 Genomes and UK10K data, totalling 6,200 cases and 17,598 controls after replication. We identify nine risk loci at 1p36.11 (rs34676223, P=5.04 × 10−13), 1q42.13 (rs41271473, P=1.06 × 10−10), 4q24 (rs71597109, P=1.37 × 10−10), 4q35.1 (rs57214277, P=3.69 × 10−8), 6p21.31 (rs3800461, P=1.97 × 10−8), 11q23.2 (rs61904987, P=2.64 × 10−11), 18q21.1 (rs1036935, P=3.27 × 10−8), 19p13.3 (rs7254272, P=4.67 × 10−8) and 22q13.33 (rs140522, P=2.70 × 10−9). These new and established risk loci map to areas of active chromatin and show an over-representation of transcription factor binding for the key determinants of B-cell development and immune response.

C hronic lymphocytic leukaemia (CLL) is an indolent B-cell malignancy that has a strong genetic component, as evidenced by the eightfold increased risk seen in relatives of CLL patients 1 . Our understanding of CLL genetics has been transformed by genome-wide association studies (GWAS) that have identified risk alleles for CLL [2][3][4][5][6][7][8][9] . So far, common genetic variation at 33 loci has been shown to influence CLL risk. Although projections indicate that additional risk variants for CLL can be discovered by GWAS, the statistical power of the individual existing studies is limited.
To gain a more comprehensive insight into CLL predisposition, we analysed genome-wide association data from populations of European ancestry from Europe, North America and Australia, identifying nine new risk loci. Our findings provide additional insights into the genetic and biological basis of CLL risk.

Results
Association analysis. After quality control, the six GWAS provided single-nucleotide polymorphism (SNP) genotypes on 4,478 cases and 13,213 controls (Supplementary Tables 1 and 2). To increase genomic resolution, we imputed 410 million SNPs using the 1000 Genomes Project 10 combined with UK10K 11 as reference. Quantile-Quantile (Q-Q) plots for SNPs with minor allele frequency (MAF) 40.5% post imputation did not show evidence of substantive overdispersion (l between 1.00 and 1.10 across the studies; Supplementary Fig. 1). Meta-analysing the association test results from the six series, we derived joint odds ratios per-allele and 95% confidence intervals under a fixedeffects model for each SNP and associated P values. In this analysis, associations for the established risk loci were consistent in direction and magnitude of effect with previously reported studies ( Fig. 1 Table 3).
Several of the newly identified risk SNPs map in or near to genes with established roles in B-cell biology, hence representing credible candidates for susceptibility to CLL. The 4q24 association marked by rs71597109 (Fig. 2) Figure 1 | Manhattan plot of association P values. Shown are the genome-wide P values (two-sided) of 410 million successfully imputed autosomal SNPs in 4,478 cases and 13,213 controls from the discovery phase. Text labelled in red are previously identified risk loci, and text labelled in blue are newly identified risk loci. The red horizontal line represents the genome-wide significance threshold of P ¼ 5.0 Â 10 À 8 .
upstream of MDS2 (Fig. 2), which is the fusion partner of ETV6 in t(1;12)(p36;p13) myelodysplasia. Based on RNA sequencing (RNA-seq) data from patients, MDS2 is overexpressed in CLL versus normal cells and also differentially expressed between two experimentally determined CLL subgroups 14 . The SNP rs57214277 maps to 4q35.1 and resides B140 kb centromeric to IRF2 (interferon regulatory factor 2, Fig. 2). Interferon (IFN)ab, a family of antiviral immune genes, induces IRF2 that inhibits the reactivation of murine gamma herpesvirus 15 . Furthermore, SNPs in strong LD with rs57214277 are associated with increased expression of IRF2 as well as trans-regulation of a network of genes in lipopolysaccharide and IFNg-treated monocytes 16 . rs140522 maps to 22q13.33 (Fig. 2), which has previously been associated with multiple sclerosis risk 17 . This region of LD contains four genes, of which only NCAPH2 (non-SMC condensin II complex subunit H2) shows differential expression between CLL and normal B cells 14 (B2.5-fold lower levels in CLL), and plays an essential role in mitotic chromosome assembly and segregation. rs41271473, rs3800461, rs61904987 and rs1036935 mark genes that have roles in WNT signalling (RHOU), autophagy (C6orf106), transcriptional activation (CXXC1), kinetochore association (SKA1, ZW10) and protein degradation (USP28, TMPRSS5; Fig. 3).
New CLL risk SNPs and clinical phenotype. We tested for differences in the associations by sex or age at diagnosis for each of the nine risk SNPs using case-only analysis, and observed no relationships (Supplementary Data 1). In addition, case-only analysis in a subset of studies provided no evidence for associations between risk SNP genotypes and IGVH (immunoglobulin variable region heavy chain) mutation subtype (Supplementary Data 1) or overall patient survival (Supplementary Table 7). Collectively, these data suggest that  Supplementary Fig. 2). We also examined ATAC-seq profiles from CLL samples and primary B cells as a marker of chromatin accessibility 19,20 . Since the strongest associated GWAS SNP may not represent the causal variant, we examined signals across an interval spanning all variants in LD r 2 40.2 with the sentinel SNP (based on the 1000 Genomes EUR reference panel). These data revealed regions of active chromatin state at all nine risk loci, in at least one of the cell types. Furthermore, based on the analyses of Hnisz et al. 21 , five of the loci fall within regions designated as 'super enhancers' in either LCLs or CD19 B cells ( Supplementary Fig. 2). Overall, these findings suggest that the risk loci annotate regulatory regions and may, therefore, have an impact on CLL risk through modulation of enhancer or promoter activity. Given the possibility that SNPs might influence enhancer or promoter activity by causing changes in transcription factor (TF) binding, we next evaluated the SNPs at each GWAS locus based on their overlap with TF-binding sites. In the absence of comprehensive TF chromatin immunoprecipitation sequencing (ChIP-Seq) data from CLL samples, we used regions of chromatin accessibility defined by ATAC-seq data 19 as a surrogate marker for TF binding, identifying 47 SNPs in LD r 2 40.2 with the sentinel SNPs that also overlapped ATAC-seq peaks. Using motifbreakR 22 to predict whether these SNPs might disrupt TFbinding motifs, we found 478 potentially disrupted motifs, corresponding to 349 TF-binding sites (Supplementary  Table 7). Moreover, at 10 of the SNPs, the altered motif matched the TFs bound in ChIP-seq data from the ENCODE project (Supplementary Table 8 and Supplementary Fig. 3). In particular, we noted that rs13149699 at 4q35 (r 2 ¼ 0.83 with lead SNP rs57214277) was predicted to disrupt SPI1 binding. In addition, rs13149699 showed evidence of evolutionary constraint, and in LCL ChIP-seq data, the SNP was bound by SPI1 as well as other TFs with roles in B-cell function including IRF4, PAX5, POU2F2 (alias OCT2) and RELA (Supplementary Table 8).
We explored whether there was any association between the genotypes of the nine new risk SNPs and the transcript levels of genes within 1 Mb of each respective variant by performing expression quantitative trait loci (eQTL) analysis using gene expression profiles of 468 CLL cases. In addition, we interrogated publicly accessible expression data on whole blood and LCLs (Supplementary Data 2). There were significant (false discovery rate (FDR)o0.05) and consistent eQTLs between rs3800461 and C6orf106, rs1036935 and SKA1, rs140522 and ODF3B, and rs140522 and TYMP.
Biological inference of all CLL risk loci. Given our observation that the nine novel risk loci annotate putative regulatory regions, we sought to examine the epigenetic landscape of CLL risk loci on a broader scale, evaluating the enrichment of both histone    23 , we identified regions of strong LD (defined as r 2 40.8 and D 0 40.8) and determined the overlap between these variants and ENCODE ChIP-seq data. Imposing a P value threshold of 5.37 Â 10 À 4 (that is, 0.05/93, based on permutation), we identified a significant enrichment of histone marks associated with active enhancer and promoter elements (HK4Me1, H3K27ac and H3K9ac) as well as actively transcribed regions (H3K36me3). We also identified an over-representation of TF binding for POLR2A, IRF4, RUNX3, NFATC1, STAT5A, PML and WRNIP1 (Fig. 4). In addition, although not statistically significant, POU2F2 showed evidence for enriched binding (P ¼ 7.78 Â 10 À 4 ). Several of these TFs have established roles in B-cell function. OCT2, IRF4 and RUNX3 have been shown to be targeted for hypomethylation in B cells 24 . MYC is a direct target of IRF4 in activated B cells, with IRF4 being itself a direct target of MYC transactivation. It is noteworthy that variations at IRF4 and 8q24-MYC are recognized risk factors for CLL 2,3 . Collectively, these findings are consistent with CLL GWAS SNPs mapping within regions of active chromatin state that exert effects on B-cell cis-regulatory networks.
We investigated the genetic pathways between the gene products in proximity to the GWAS SNPs using the LENS pathway tool 25 . These gene products were primarily involved in immune response, BCR-mediated signalling, apoptosis and maintenance of chromosome integrity, as well as interconnectivity between the gene products (Fig. 5). Pathways that were enriched included those related to interferon signalling and apoptosis (Supplementary Data 3).
Impact of risk SNPs on heritability of CLL. By fitting all SNPs from GWAS simultaneously using Genome-wide Complex Trait Analysis, the estimated heritability of CLL attributable to all common variation is 34% ( ± 5%), thus having potential to explain 57% of the overall familial risk. This estimate represents the additive variance and, therefore, does not include the potential impact of interactions or dominance effects or gene-environment interactions, having an impact on CLL risk. The currently identified risk SNPs (newly discovered and previously identified) only account for 25% of the additive heritable risk.

Discussion
Besides providing additional evidence for genetic susceptibility to CLL, the new and established risk loci identified further insights into the biological basis of CLL development. These loci annotate genes that participate in interconnecting cellular pathways, which are central to B-cell development. In particular, we note the involvement of BCR-mediated signalling with immune responses and apoptosis. Importantly, gene discovery initiatives can have an impact on the successful development of new therapeutic agents 26 . In this respect it is notable that Ibrutinib 27 (a BTK inhibitor) and Idelalisib 28 (a PI3KCD inhibitor) mediate their effects through interference of BCR signalling, and Venetoclax 29 targets the anti-apoptotic behaviour of BCL-2. The power of our GWAS to identify common alleles conferring relative risks of 1.2 or greater (such as the rs35923643 variant) is high (B80%). Hence, there are unlikely to be many additional SNPs with similar effects for alleles with frequencies greater than 0.2 in populations of European ancestry. In contrast, our analysis had limited power to detect alleles with smaller effects and/or MAFo0.1. Hence, further GWAS studies in concert with functional analyses should lead to additional insights into CLL biology and afford the prospect of development of novel therapies.  Genome-wide association studies. The meta-analysis was based on six GWAS 2,6,7,9 (Supplementary Tables 1 and 2 Quality control of GWAS. Standard quality-control measures were applied to the GWAS 31 . Specifically, individuals with low call rate (o95%) as well as all individuals evaluated to be of non-European ancestry (using the HapMap version 2 CEU, JPT/CHB and YRI populations as a reference) were excluded. For apparent first-degree relative pairs, we removed the control from a case-control pair; otherwise, we excluded the individual with the lower call rate. SNPs with a call rate o95% were excluded as were those with a MAF o0.01 or displaying significant deviation from Hardy-Weinberg equilibrium (that is, Po10 À 6 ). GWAS data were imputed to 410 million SNPs with the IMPUTE2 v2.3 software 32 using a merged reference panel consisting of data from 1000 Genomes Project (phase 1 integrated release 3 March 2012) 10 and UK10K (ref. 11). Genotypes were aligned to the positive strand in both imputation and genotyping. Imputation was conducted separately for each study, and in each the data were pruned to a common set of SNPs between cases and controls before imputation. We set thresholds for imputation quality to retain potential risk variants with MAF40.005 for validation. Poorly imputed SNPs defined by an information measure o0.80 were excluded. Tests of association between imputed SNPs and CLL was performed using logistic regression under an additive genetic model in SNPTESTv2. 5 (ref. 33).
The adequacy of the case-control matching and possibility of differential genotyping of cases and controls were formally evaluated using Q-Q plots of test statistics (Supplementary Fig. 1). The inflation factor l was based on the 90% leastsignificant SNPs 34 . Where appropriate, principal components, generated using common SNPs, were included in the analysis to limit the effects of cryptic population stratification that otherwise might cause inflation of test statistics.
Eigenvectors for the GWAS data sets were inferred using smartpca (part of EIGENSOFT 35 ) by merging cases and controls with Phase II HapMap samples.
Replication studies and technical validation. The 16 SNPs in the most promising loci were taken forward for de novo replication (Supplementary Table 2). The UK replication series comprised 645 cases collected through the NCLLC and Leicester Haematology Tissue Bank and 2,341 controls comprised 2,780 healthy individuals ascertained through the National Study of Colorectal Cancer (1999-2006; ref. 36). These controls were the spouses or unrelated friends of individuals with malignancies. None had a personal history of malignancy at the time of ascertainment. Both cases and controls were British residents and had selfreported European ancestry. The Mayo replication series comprised 407 newly diagnosed cases and 1,207 clinic-based controls from the Mayo Clinic CLL case-control study 37 . The eligibility criteria of the cases were age 20 years and older, consented within 9 months of their initial diagnosis at presentation to Mayo Clinic and no history of HIV. The eligibility criteria for the controls were age 20 years and older, a resident of Minnesota, Iowa or Wisconsin at the time of appointment at Mayo Clinic, no history of lymphoma or leukaemia and no history of HIV infection. Controls were frequency matched to the regional case distribution on 5-year age group, sex and geographic area. In silico replication was performed in 444 cases and 609 controls from International Cancer Genome Consortium (ICGC), and 226 cases and 228 controls from the WHI study 38,39 . The fidelity of imputation as assessed by the concordance between imputed and directly genotyped SNPs was examined in a subset of samples (Supplementary Table 5). Replication genotyping of UK samples was performed using competitive allele-specific PCR KASPar chemistry (LGC, Hertfordshire, UK); replication genotyping of Mayo samples was performed using Sequenom MassARRAY (Sequenom Inc., San Diego, CA, USA). Primers are listed in Supplementary  Table 9. Call rates for SNP genotypes were 495% in each of the replication series.
To ensure the quality of genotyping in all assays, at least two negative controls and duplicate samples (showing a concordance of 499%) were genotyped at each centre. To exclude technical artefacts in genotyping, we performed cross-platform validation of 96 samples and sequenced a set of 96 randomly selected samples from each case and control series to confirm genotyping accuracy. Assays were found to be performing robustly; concordance was 499%.
Meta-analysis. Meta-analyses were performed using the fixed-effects inversevariance method based on the b estimates and s.e.'s from each study using META v1.6 (ref. 40). Cochran's Q-statistic to test for heterogeneity and the I 2 statistic to quantify the proportion of the total variation due to heterogeneity were Each arm represents a functional annotation term, each arc represents an interaction between two proteins and the distance from the centre of the plot corresponds to a greater number of protein-protein interactions (higher degree of the node). The left arm represents proteins annotated as being involved in BCR signalling; the top arm represents proteins annotated as immune response; the right arm represents proteins involved in apoptosis; and the bottom arm represents proteins involved in DNA damage and chromosomal integrity. Selected proteins known to be involved in CLL risk are shown. Mutational status. IGVH mutation status was determined according to the BIOMED-2 protocols as described previously 44 . Sequence analysis was conducted using the Chromas software version 2.23 (Applied Biosystems) and the international immunogenetics information system database. In accordance with published criteria, we classified sequences with a germline identity of Z98% as unmutated and those with an identity of o98% as mutated.
Association between genotype and patient outcome. To examine the relationship between SNP genotype and patient outcome, we analysed two patient series: (1) 356 patients from the UK Leukaemia Research Fund (LRF) CLL-4 trial 45 , which compared the efficacy of fludarabine, chlorambucil and the combination of fludarabine plus cyclophosphamide; (2) 377 newly diagnosed patients from Mayo Clinic who were prospectively followed. Cox-regression analysis was used to estimate genotype-specific hazard ratios and 95% CIs with overall survival. Statistical analyses were undertaken using R version 2.5.0. eQTL analysis. eQTL analyses were performed by examining the gene expression profiles of 452 CLL cases (Affymetrix Human Genome U219 Array) 46 . Additional data were obtained by querying publicly available eQTL mRNA expression data using MuTHER 47 , the Blood eQTL browser 48 and data from the GTEx consortium 49 . MuTHER contains expression data on LCLs, skin and adipose tissue from 856 healthy twins. The Blood eQTL browser contains expression data from 5,311 non-transformed peripheral blood samples. We used the whole-blood RNA-seq data from GTEx, which consisted of data from 338 individuals.
Functional annotation. Novel risk SNPs and their proxies (that is, r 2 40.2 in the 1000 Genomes EUR reference panel) were annotated for putative functional effect based upon histone mark ChIP-seq/ChIPmentation data for H3K27ac, H3K4Me1 and H3K27Me3 from GM12878 (LCL) 18 and primary CLL cells 19 . We searched for overlap with 'super-enhancer' regions as defined by Hnisz et al. 21 , restricting the analysis to the GM12878 cell line and CD19 þ B cells. We also interrogated ATAC-seq data from CLL cells 19 and primary B cells 20 . The novel risk SNPs and their proxies (r 2 40.2 as above) were intersected with regions of accessible chromatin in CLL cells, as defined by Rendeiro et al. 19 , which were used as a surrogate for likely sites of TF binding. SNPs falling within accessible sites (n ¼ 47) were taken forward to TF-binding motif analysis and were also annotated for genomic evolutionary rate profiling score 50 as well as bound TFs based on ENCODE project 18 ChIP-seq data.
TF-binding disruption analysis. To determine whether the risk variants or their proxies were disrupting motif-binding sites, we used the motifbreakR package 22 . This tool predicts the effects of variants on TF-binding motifs, using position probability matrices to determine the likelihood of observing a particular nucleotide at a specific position within a TF-binding site. We tested the SNPs by estimating their effects on over 2,800 binding motifs as characterized by ENCODE 51 , FactorBook 52 , HOCOMOCO 53 and HOMER 54 . Scores were calculated using the relative entropy algorithm.
TF and histone mark enrichment analysis. To examine enrichment in specific TF binding across risk loci, we adapted the variant set enrichment method of Cowper-Sal lari et al. 23 . Briefly, for each risk locus, a region of strong LD (defined as r 2 40.8 and D 0 40.8) was determined, and these SNPs were termed the associated variant set (AVS). TF ChIP-seq uniform peak data were obtained from ENCODE for the GM12878 cell line, which included data for 82 TF and 11 histone marks. For each of these marks, the overlap of the SNPs in the AVS and the binding sites was determined to produce a mapping tally. A null distribution was produced by randomly selecting SNPs with the same characteristics as the risk-associated SNPs, and the null mapping tally calculated. This process was repeated 10,000 times, and approximate P-values were calculated as the proportion of permutations where null mapping tally was greater or equal to the AVS mapping tally. An enrichment score was calculated by normalizing the tallies to the median of the null distribution. Thus, the enrichment score is the number of s.d.'s of the AVS mapping tally from the mean of the null distribution tallies.
Heritability analysis. We used genome-wide complex trait analysis 42 to estimate the polygenic variance (that is, heritability) ascribable to all genotyped and imputed GWAS SNPs. SNPs were excluded based on low MAF (MAFo0.01), poor imputation (info score o0.4) and evidence of departure from Hardy Weinberg Equilibrium (HWE) (Po0.05). Individuals were excluded for poor imputation and where two individuals were closely related. A genetic relationship matrix of pairs of samples was used as input for the restricted maximum likelihood analysis to estimate the heritability explained by the selected set of SNPs. To transform the estimated heritability to the liability scale, we used the lifetime risk 55,56 for CLL, which is estimated to be 0.006 by SEER (http://seer.cancer.gov/statfacts/html/ clyl.html). The variance of the risk distribution due to the identified risk loci was calculated as described by Pharoah et al. 57 , assuming that the relative risk when a first-degree relative has CLL is 8.5 (ref. 1).
Pathway analysis. To investigate the interaction between the gene products of the GWAS hits, we performed a pathway analysis. We selected the closest coding genes for the lead-associated SNPs and then performed pathway analysis using the LENS tool 25 , which identifies gene product and protein-protein interactions from HPRD 58 and BioGRID 59 . Enrichment of pathways was assessed using Fisher's exact test, comparing the overlap of the genes in the network with the genes in the pathway. Pathway data were obtained from REACTOME 60 . Cytoscape was used to perform network analyses 61 , and the Hive Plot was drawn using HiveR (academic.depauw.edu/Bhanson/HiveR/HiveR.html).
Data availability. Genotype data that support the findings of this study have been deposited in the database of Genotypes and Phenotypes (dbGAP) under accession code phs000802.v2.p1 and in the European Genome-phenome Archive (EGA) under accession codesEGAS00001000090, EGAD00001000195, EGAS00001000108, EGAD00000000022 and EGAD00000000024. Transcriptional profiling data from the MuTHER consortium that support the findings of this work have been deposited in the European Bioinformatics Institute (Part of the European Molecular Biology Laboratory, EMBL-EBI) under accession code E-TABM-1140. Data from Blood eQTL have been deposited in the EBI-EMBL under accession codes E-TABM-1036, E-MTAB-945 and E-MTAB-1708. GTEx data are deposited in dbGaP under accession code phs000424.v6.p1. The remaining data are contained within the paper and its Supplementary files or are available from the authors upon reasonable request.