Genome-wide association study identifies risk loci for progressive chronic lymphocytic leukemia

Prognostication in patients with chronic lymphocytic leukemia (CLL) is challenging due to heterogeneity in clinical course. We hypothesize that constitutional genetic variation affects disease progression and could aid prognostication. Pooling data from seven studies incorporating 842 cases identifies two genomic locations associated with time from diagnosis to treatment, including 10q26.13 (rs736456, hazard ratio (HR) = 1.78, 95% confidence interval (CI) = 1.47–2.15; P = 2.71 × 10−9) and 6p (rs3778076, HR = 1.99, 95% CI = 1.55–2.55; P = 5.08 × 10−8), which are particularly powerful prognostic markers in patients with early stage CLL otherwise characterized by low-risk features. Expression quantitative trait loci analysis identifies putative functional genes implicated in modulating B-cell receptor or innate immune responses, key pathways in CLL pathogenesis. In this work we identify rs736456 and rs3778076 as prognostic in CLL, demonstrating that disease progression is determined by constitutional genetic variation as well as known somatic drivers.

C hronic lymphocytic leukemia (CLL) is a prevalent hematological malignancy with approximately 3750 new cases diagnosed in the United Kingdom every year 1 . CLL is characterized by marked clinical and biological heterogeneity with a proportion of patients having stable, asymptomatic disease which never requires therapeutic intervention, whilst others suffer from a progressive disease with associated shortened survival. Treatment for symptomatic CLL has been transformed by the advent of highly effective chemoimmunotherapy regimens, B-cell receptor signaling pathway inhibitors (BCRi) and agents targeting anti-apoptotic proteins but the majority of patients who progress to the point of therapeutic intervention have historically experienced CLL-related mortality. However the advent of highly effective combination regimens utilizing both BCRi and antiapoptotic agents may improve patient outcomes 2 .
Prognostication in early stage CLL and identifying those patients at risk of progression remains challenging. Numerous methods of CLL prognostication exist, which range from clinical assessments to genomic techniques. Clinical assessments include the Rai 3 or Binet 4 scores, where adverse outcomes are associated with increasing tumor load and evidence of marrow failure, and the CLL international prognostic index (CLL-IPI) 5 which utilizes a combination of clinical and biological characteristics to assess prognosis [6][7][8][9] . Cases with unmutated immunoglobulin heavy chain variable region (IGHV) genes 10 and those with mutations/ deletions affecting TP53, NOTCH1, and SF3B1 have poor prognoses 11,12 , although these alterations are infrequent in newly diagnosed CLL patients 13,14 . Once patients progress to symptomatic disease then somatic drivers of disease, particularly mutations of TP53 and SF3B1, as well as complex karyotype, become dominant prognostic markers 15,16 .
CLL arises from an unidentified cell-of-origin in response to continued signaling through the BCR either in response to ongoing antigenic stimulation by auto-antigens or allo-antigens, or is driven by autonomous BCR activation 17 . The increased incidence of CLL in first-degree relatives of affected patients also points to an element of genetic susceptibility, which has been borne out in large scale genome-wide association studies (GWAS) which to date have identified over 40 risk alleles [18][19][20][21][22][23][24][25] . Given the important genetic contribution to CLL susceptibility we hypothesized that constitutional genetic variants also contribute to determining risk of disease progression.
In this work, we conduct a GWAS using data from a large United Kingdom multicentre cohort study of well characterized predominantly early-stage CLL cases to identify two risk alleles for progressive disease in patients with previously untreated disease.

Results
Meta-analysis of CLL genome-wide association studies. We conducted six genome-wide association studies for single nucleotide polymorphisms (SNPs) associating with progressive CLL incorporating cases of European ancestry diagnosed at clinical centers across the United Kingdom (Supplementary Fig. 6; Supplementary Table 1). Of these, data on time to first treatment (TTFT) was available for 755 cases. We combined the association test statistic for 5199911 autosomal SNPs common to all 6 GWAS after exclusion of those with an imputation info (imputation quality) score of <0.9 and a minor allele frequency (MAF) of <0.025, and conducted a meta-analysis under a fixedeffects model. Quantile-quantile plots of observed vs. expected P values for SNPs showed minimal inflation of the test statistic (λ GC = 1.027) excluding the possibility of hidden population substructure or cryptic relatedness ( Supplementary Fig. 9).
Pooling data from 6 GWAS identified 5 SNPs at two genomic locations that surpassed genome-wide significance (P ≤ 5 × 10 −8 ) for association with time from diagnosis to first treatment (Fig. 1). The strongest statistical evidence for an association with progressive CLL was for rs736456 (hazard ratio (HR) = 1.76, 95% confidence interval (CI) = 1.45-2.14; P = 1.26 × 10 −8 ), which maps to the TACC2 locus on chromosome 10q26.13 (Fig. 2). The second strongest association with progressive disease was for rs3778076 (HR = 2.03, 95% CI = 1.58-2.62; P = 3.89 × 10 −8 ) which maps to the SPDEF and C6ORF106 (ILRUN) genes on chromosome 6p (Fig. 2). We genotyped rs736456 and rs3778076 in a seventh cohort, bringing the total number of CLL cases analyzed to 842. Both markers showed consistent direction and magnitude of effect sizes across all seven studies with no evidence of heterogeneity (Fig. 3). There was no evidence of significant interaction between rs736456 and rs3778076 (P = 0.131), suggesting that each locus has an independent effect on CLL progression. Analysis conditioning on the lead SNP at each locus showed no evidence for other variants (P < 10 −4 ) associating with progressive disease ( Supplementary  Fig. 10).
rs736456 and rs3778076 predict disease progression in early stage CLL. In order to further investigate the associations with progressive disease we stratified patients by disease stage at presentation and established prognostic markers. rs736456 (chromosome 10) and rs3778076 (chromosome 6) significantly associate with disease progression in patients with Binet stage A disease at presentation, but have limited prognostic impact in patients with Binet stage B and C at presentation (Supplementary Fig. 12) (P = 3.81 × 10 −5 in Binet A and P = 0.12 in Binet B/C for rs736456 carriers vs. non-carriers; P = 1.93 × 10 −9 in Binet A and P = 0.69 in Binet B/C for rs3778076 carriers vs. non-carriers), consistent with a specific role in driving progression in early stage disease. Both genetic markers retained significance in multivariate models for disease progression restricted to Binet stage A patients and when adjusted for study in the model (Fig. 4). rs736456 and rs3778076 were particularly powerful markers for disease progression in stage A patients when considered together, where carriers of 1 risk allele (P = 1.4 × 10 −7 ) or 2 or more risk alleles (P = 7.9 × 10 −8 ) have a significantly shorter time to first treatment compared to non-carriers (Fig. 4c). Whilst not as powerful as IGHV status, rs736456 and rs3778076 had prognostic utility approximately equivalent to CD38 status. Moreover, rs736456 and rs3778076 both significantly associate with progressive disease in patients with otherwise low risk clinical markers, including IGHV mutation ( Supplementary Fig. 13), CD38 negativity ( Supplementary Fig. 14), low β2 microblobulin (Supplementary Fig. 15) and wild-type for TP53 ( Supplementary Fig. 16). As such, rs736456 and rs3778076 are particularly powerful prognostic markers in analysis restricted to patients with disease that is Binet stage A, IGHV mutated and CD38 negative (Supplementary Fig. 17). Taken together, these data identify rs736456 and rs3778076 as prognostic for disease progression in patients with otherwise low risk markers, but which have limited prognostic power in patients with high-risk markers or late stage disease.
Progressive disease loci at chromosomes 10 and 6. To identify cis-regulated genes at each locus associated with progressive disease we interrogated gene expression data derived from a meta-analysis of 31624 blood samples collated by the eQTLGen consortium 26 . Thirteen genes were annotated to within 1 Mb of the chromosome 10 association signal and rs736456 is eQTL for the PLEKHA1 gene (pleckstrin homology domain-containing family A; TAPP-1) (Benjamini-Hochberg corrected P-value [P BH ] = 1.29 × 10 −15 ) (Supplementary Table 2). Of the 27 genes annotated to within 1 Mb of the chromosome 6 signal, rs3778076 is eQTL for 5 genes including most significantly UHRF1BP1 (Ubiquitin-Like Containing PHD And RING Finger Domains 1-Binding Protein 1) (P BH = 6.73 × 10 −139 ) and ILRUN (Inflammation And Lipid Regulator With UBA-Like And NBR1-Like Domains; C6ORF106) (P BH = 3.54 × 10 −64 )(Supplementary Table 3).

Discussion
CLL susceptibility loci identified to date implicate genes central to B-cell development, BCR signaling and apoptotic responses 20 . These risk alleles may therefore function by promoting the generation of abnormal pro-B populations or to maintain the outgrowth of CLL-precursors in response to continuous BCR signaling. However, the current model would suggest that once the pre-leukemic CLL clone has established, disease progression becomes largely dependent on the classic multi-hit model of oncogenesis where acquired somatic drivers promote clonal expansion 11 , thus temporally limiting the impact of constitutional genetic variants to disease progression at an early stage.
The evidence presented herein suggests that, in addition to being etiological, constitutional risk alleles contribute to disease progression in CLL post-diagnosis. Our data suggest that rs736456 and rs3778076 are powerful prognostic markers in early stage CLL, but have less impact as disease progresses, predicted to be due to the acquisition of strongly prognostic somatic alterations which then dominate. Consistent with this model, the prognostic power of rs3778076 and rs735456 is weaker in patients with Binet stage B or C disease and in patients with high risk somatic markers, such as unmutated IGHV or CD38 positivity.
The progressive disease allele at chromosome 10 (rs736456) is associated with increased expression of PLEKHA1 (pleckstrin homology domain containing A1; TAPP1) whilst the allele at chromosome 6 (rs3778076) is associated with increased expression of ILRUN (Inflammation And Lipid Regulator With UBA-Like And NBR1-Like Domains; C6ORF106) and UHRF1BP1 (UHRF binding protein 1). PLEKHA1, a plekstrin homology (PH) adapter protein is a target for phosphatidylinositol-4,5bisphosphate 3-kinase (PI3K) signaling and a partner for phosphatidylinositol 4,5-bisphosphate (PIP2) which regulates the assembly of signaling complexes at the cell membrane 27 . PI3Kmediated signaling is a key element of BCR-mediated signal transduction, a receptor with a central, etiological role in CLL Fig. 1 Manhattan plot from chronic lymphocytic leukemia meta-analysis of 6 genome-wide association studies for variants associating with progressive disease. Manhattan plot of fixed-effects meta-analysis results for time to first treatment (TTFT). Study-specific single nucleotide polymorphism (SNP) effect sizes were combined using the inverse-variance-weighted approach, and nominal meta P values were calculated based on Zscore statistics. The x-axis represents SNP coordinates on autosomal chromosomes and the y-axis shows the fixed-effects meta P values in the −log 10 scale. All statistical tests were two-sided and red dashed line indicates significance at genome-wide level (P ≤ 5 × 10 −8 ). Risk loci for progressive CLL are annotated with chromosome position and local genes. Fig. 2 Regional association and linkage disequilibrium plots for loci associated with progressive CLL. Regional plots of time to first time treatment (TTFT) survival associations for rs736456 (a) and rs3778076 (b) reaching genome-wide significance (P ≤ 5 × 10 −8 ). Study-specific single nucleotide polymorphism (SNP) effects were combined in a fixed effect model using an inverse-variance-weighted method. Nominal meta-analysis P values were calculated based on Z-score statistics and all statistical tests were two-sided. Plots show SNP coordinates based on genomic build b37/hg19 on the x-axis, and −log 10 (P values) on the y-axis. Genotyped and imputed SNPs are represented by diamonds and circles, respectively, and are colored according to their linkage disequilibrium (pairwise r 2 ) with the lead SNP based on the 1000 Genomes European panel. Reference genes in the region are shown in the lower panel, with arrows indicating transcript direction, dense blocks representing exons and horizontal lines representing introns. Two additional SNPs at 10q26.13 (rs4752676 and rs736457) (a) and one additional SNP at 6p (rs11757517) (b) also reach genome-wide significance in meta-analysis ( Supplementary Figs. 60-62). The symbol for rs11757517 is obscured on the regional plot (b) by the symbol for rs3778076. pathogenesis 28 . Herold and colleagues identified and validated an 8-gene expression signature including PLEKHA1 that predicted time to first treatment in CLL 29 . Furthermore, the PLEKHA1 homologous protein TAPP2 is differentially expressed in CLL, with increased expression in ZAP70 positive/IGHV unmutated CLL, a subtype associated with enhanced signaling via the BCR with associated pro-survival effects on cultured CLL-cells 30,31 .
ILRUN (C6ORF106) regulates interferon regulatory factor 3 (IRF3) signaling which modulates cellular responses to viral antigens. Given that CLL may arise in response to chronic stimulation of the BCR by viral antigens, constitutional regulation of an IRF3-dependent response may therefore affect the expansion of CLL-clones 32 . IRF3 also regulates expression of miR-155 33,34 which is thought to be etiological in a subset of patients 33,35 and also associates with unmutated IGHV and aggressive disease in CLL [36][37][38][39][40] .
UHRF1BP1 has been identified as a susceptibility gene for systemic lupus erythematosus (SLE) 41 with differential methylation affecting regulation of this gene 42 , however UHRF1BP1 has not previously been associated with B-cell tumors. An etiological CLL GWAS identified susceptibility variants mapping to within the B-cell scaffold protein with ankyrin repeats 1 (BANK1) gene 20 . BANK1 is an adapter for BCR signals, modulates CD40 receptor-mediated signaling responses and has also been implicated in SLE and other inflammatory arthropathies [43][44][45] . Therefore, the association of both UHRF1BP1 and BANK1 with CLL and SLE could imply common etiological factors between the two diseases. Given that both CLL and SLE are considered to arise, in part, due to a loss of immunological tolerance to self-antigens or auto-antigens this may provide further indirect evidence for autoantigenic drive in the etiology of CLL.
Collectively the risk alleles identified herein map to genes known to be regulatory for B-cell development and the control of autoimmunity supporting the concept that either aberrant BCR signals and/or a breakdown in normal immunological tolerance contributes to CLL clonal expansion and the subsequent acquisition of somatic driver mutations.
Many patients with Binet stage A CLL never progress to the point of therapeutic intervention and the accurate identification of patients in this group has been the subject of intensive investigation. Multiple prognostic features and tests have been described to identify those with a low risk of progressive disease. Conversely, the identification of high-risk prognostic features in stage A CLL patients has been used to define a group likely to need close surveillance as opposed to community monitoring programmes. The failure, to date, to demonstrate a survival benefit for early treatment of high-risk asymptomatic patients is due to an inability to successfully identify an appropriate highrisk group and also the limits of pre-emptive chemotherapy-based approaches 46 . However, with the advent of highly-effective targeted therapies such as BCRi pre-emptive therapy in high-risk CLL patients is again under investigation 47 .
Genetic tests for constitutional genetic markers, such as rs736456 and rs3778076, have the advantage of being easy to perform, highly reproducible and inexpensive making them ideal for incorporation into multivariate prognostication models for disease progression in early stage CLL.   7 (1997-2017). IGHV gene sequences were analyzed using IgBlast (v1.3.0) (https://www.ncbi.nlm.nih.gov/igblast/) and those with at least 98% homology to germline were regarded as unmutated 10 . CD38 expression was determined at each individual clinical center via flow cytometric analysis of diagnostic samples and a cut-off of 30% was used for positivity 49 . Serum β2 microblobulin was performed by referring centers using locally established and validated immunoassays. Assessment of deletion 17p and TP53 mutation analysis was in line with ERIC recommendations on TP53 mutation analysis in chronic lymphocytic leukemia 50 .
Treatment was initiated in response to standard indications 51 . Given the historic nature of the treated cohorts in this study the treatment regimens employed were predominantly chemotherapeutic (chlorambucil monotherapy, fludarabine monotherapy, fludarabine, and cyclophosphamide) or chemoimmunotherapeutic Fig. 4 Multivariate prognostication models for time to first treatment (TTFT) and survival curves in Binet stage A CLL. Study-stratified multivariable Cox proportional hazard model for TTFT in Binet A CLL patients incorporating rs736456 (a) and rs3778076 (b). Age at diagnosis (continuous), gender (binary), CD38 (binary), VH (binary) and SNP allelic dosage for the effect allele (continuous) were included together in a study-stratified Cox proportional hazard model to estimate Hazard ratios (HR) and 95% confidence intervals (CI). CI and nominal P values from Wald tests based on the standard normal distribution were reported here. Cases with missing data for any of the variables were excluded from multivariate models. c Kaplan-Meier survival plot for TTFT by number of cumulative risk alleles (0, 1, >1) for rs3778076 and rs736456 in Binet A CLL patients. TTFT is defined as the time from diagnosis of CLL to treatment or last follow-up without treatment. The number of patients in each group at each timepoint are indicated in the table. Nominal P value from the log-rank test is shown. All statistical tests were two-sided.
(chemotherapy in combination with a CD20 monoclonal antibody). A minority of patients were treated with other regimens including radiotherapy, alemtuzumab, ibrutinib or ibrutinib in combination with venetoclax.
Unmutated IGHV, CD38-positivity, Binet stage B or C, serum β2 microblobulin >3.5 mg L −1 and deletion/mutation of TP53 at presentation were all significantly associated with shorter time to first treatment (TTFT) in univariate analysis (Supplementary Figs. 1-5; P < 0.05), confirming that established prognostic markers predict disease progression in patients recruited to this study.
Genotyping and quality control procedures. DNA was extracted from peripheral blood and genotyped on Illumina OmniExpress arrays (8v1, 8v1-2, or 8v1.3). Genotypes were determined using Illumina GenomeStudio software (v2.0) (Illumina) with 922269 SNPs common to all three arrays (Supplementary Fig. 6). Quality control and data analysis were performed using R v3.5.1 and PLINK v1.9b4.4 ( Supplementary Fig. 6). Given that CLL cases were genotyped using DNA extracted from tissue samples that include leukemic B cells we applied rigorous quality control to remove variants potentially affected by somatic copy number alterations. For each GWAS we excluded markers with departure from Hardy-Weinberg equilibrium (HWE; P ≤ 10 −3 ), a call rate <95% or with significant differences in minor allele frequency (P ≤ 10 −3 ) between genotype batches. Data from all six GWAS were combined for sample quality control processing. Samples were excluded if the call rate was < 95%, heterozygosity exceeded 3 standard deviations from the overall mean heterozygosity or were identified as non-European based on principal components analysis using 1000 genome data as a reference ( Supplementary Fig. 7). Samples were also removed such that there were no two individuals with estimated relatedness pihat ≥0.1875, with retention of the sample with the higher call rate. As an additional control check to ensure that the constitutional variants reported here are not located within recurrent regions of somatic copy number change we used Nexus Copy Number software 10 (build version 9665) (BioDiscovery, California) to interrogate B allele frequency and Log R ratio at variant positions.
Following the exclusion of poor quality markers and samples, haplotypes were estimated from genotypes using ShapeIT (v2.r790) 52 and genome-wide imputation was performed using the Michigan Imputation Server (https://imputationserver. sph.umich.edu/index.html) and the Haplotype Reference Consortium reference haplotype panel (http://www.haplotype-reference-consortium.org/). All variants with an imputation info score <0.9 were excluded from subsequent analysis.
Genome-wide analysis for markers that predict time to first treatment. The primary outcome assessed in this study was time to first treatment (TTFT), defined as the interval between CLL diagnosis and date of first treatment or last follow-up. For each study, allelic dosage was estimated for the minor allele at each variant position and included in a cox proportional hazard model to estimate hazard ratio (HR) and 95% confidence interval (CI). Variants were included in the meta-analysis if they had results from all six studies. Study-specific single nucleotide polymorphism (SNP) effects were combined using an inverse-variance-weighted method (fixed effects model) and the DerSimonian-Laird approach (random effects model) using R metafor package (v2.4-0) 53 . SNPs with fixed-effect P values of ≤5 × 10 −8 were deemed significant at genome-wide level. Cochran's Q test and I 2 were used to assess the heterogeneity across studies. Regional plots were generated using LocusZoom (V1.4) downloaded from https://github.com/statgen/locuszoom-standalone.
To test for secondary signals in regions achieving genome-wide significance we carried out conditional analysis on the index SNP by study and then combined individual conditional effect sizes from 6 studies under a fixed effects model. SNPs with a conditional meta P value <10 −4 were considered independent of the index SNP.
Heterogeneity of Kaplan-Meier survival curves was assessed by the log-rank test, including tests for established prognostic markers, the most likely index SNP genotypes (for genotypes with posterior probabilities >0.9), combined groups of established prognostic markers/SNPs and combined SNP genotypes. For joint analyses, pairwise log-rank tests, with false discovery rate (FDR) corrections for multiple testing, were used to test for significant differences between survival curves.
Study-stratified multivariate Cox proportional hazard models incorporating established prognostic markers including IGHV and CD38 status were used to determine the impact of variants on disease progression. Cases with missing data for any of the variables were excluded from multivariate models.
A secondary outcome was post-treatment survival, defined as the interval between first treatment for CLL-related symptoms and death or last follow-up. Data on post-treatment survival were available on 390 CLL cases from six studies (GWAS1-6), with 231 deaths and 159 censored at last follow-up.
Technical validation of variants associating with progressive CLL. rs3778076 was directly genotyped in all six GWAS and rs736456 was imputed to high quality in all six GWAS (info score 0.974-0.985). Fidelity of array genotyping and imputed dosages was confirmed using Sanger sequencing in a subset of samples for each sentinel variant with concordance of 100% for all variants (Supplementary Fig. 8). rs3778076 and rs736456 were directly genotyped in study 7 using Sanger sequencing.
Expression quantitative trait loci analysis. To identify cis expression quantitative trait loci (eQTLs) we made use of summary data from the eQTLGen Consortium for whole blood, which incorporates 37 independent datasets derived from a total of 31684 individuals (http://www.eqtlgen.org/cis-eqtls.html) 26 . Benjamini-Hochberg (BH)-adjusted P values were estimated for each gene annotated to within 1 Mb of rs736456 on 10q26.13 and rs3778076 on 6p.
Reporting summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.