The value of genome-wide over targeted driver analyses for predicting clinical outcomes of cancer patients is debated. Here, we report the whole-genome sequencing of 485 chronic lymphocytic leukemia patients enrolled in clinical trials as part of the United Kingdom’s 100,000 Genomes Project. We identify an extended catalog of recurrent coding and noncoding genetic mutations that represents a source for future studies and provide the most complete high-resolution map of structural variants, copy number changes and global genome features including telomere length, mutational signatures and genomic complexity. We demonstrate the relationship of these features with clinical outcome and show that integration of 186 distinct recurrent genomic alterations defines five genomic subgroups that associate with response to therapy, refining conventional outcome prediction. While requiring independent validation, our findings highlight the potential of whole-genome sequencing to inform future risk stratification in chronic lymphocytic leukemia.
Chronic lymphocytic leukemia (CLL), the most common adult hematological malignancy in Western countries, is characterized by diverse treatment outcomes even in the era of targeted agents. The full complement of genomic events contributing to this clinical diversity have yet to be determined. Thus far, only mutations in TP53 influence clinical practice1,2,3,4,5,6,7. Other prognostic markers, including the immunoglobulin heavy chain variable (IGHV) region mutational status, and existing molecular classifications have limited predictive value in individual patients7,8,9,10.
Previous sequencing studies of CLL have focused largely on mutations affecting protein-coding genes7,8,9,10,11,12,13, and whole-genome sequencing (WGS) has been reported for only a small number of CLL patients, mostly with low-risk disease1,2,3,4,5,6. Hence, the association between clinical parameters and genomic alterations has largely been restricted to driver coding mutations and copy number changes.
Here, to provide the largest and most comprehensive analysis of the entire genomic landscape of CLL and its relationship to clinical outcome, we performed WGS of 485 clinical trial patients recruited to the United Kingdom’s 100,000 Genomes Project. The results of our study provide additional insights into coding and noncoding single nucleotide mutations. We then exploit WGS data to provide a detailed map of structural alterations and global features, including telomere length, mutational signatures and genomic complexity (GC). Finally, we integrate the different modes of genetic alterations to define five genomic subgroups (GSs) of CLL and relate these to clinical outcome. Our results provide a springboard to indepth functional validation of putative drivers and our integrated genome-wide approach could, after independent clinical validation, refine current clinical outcome prediction.
We performed WGS of tumor and matched normal samples from 485 patients with treatment-naïve CLL enrolled in clinical trials to a median depth of 109× and 36×, respectively (Supplementary Tables 1–3). A second tumor sample was available for a subset of 25 patients at relapse. In addition, RNA sequencing (RNA-seq; n = 73) and assay for transposase-accessible chromatin with high-throughput sequencing (ATAC-seq; n = 24) data were generated for a subset of CLL samples with recurrent noncoding mutations (Supplementary Table 4).
Coding mutations and structural variants
We initially identified putative coding drivers by (1) screening for genes impacted by single nucleotide variants (SNVs) and small insertion/deletions (indels) and (2) integrating SNV/indels with copy number alterations (CNAs) (Fig. 1a; Methods). We identified 36 known and 22 putative driver genes (Fig. 1b and Supplementary Fig. 1), which were not found associated with CLL in the literature and also not prevalent above 1% in two landmark genomic studies in CLL3,7. These were classified as previously unknown putative drivers and included the immune checkpoint regulator IRF2BP2 (4.3%) (Supplementary Table 4).
We identified 74 regions of the genome that were recurrently affected by CNAs in at least four samples (Fig. 1c, Extended Data Fig. 1a and Supplementary Table 6). Using DNA microarray data, 85% of CNAs could be validated (Supplementary Table 7). In addition to 14 well-known CNAs, including del13q14.2, del11q22.3 and del17p13.1, we identified a further 60 regions, of which 27 were previously not recognized. The breakpoints of the remaining 33 CNAs could be refined to a smaller minimally overlapping region14,15,16,17,18,19. By combining SNVs/indels with CNAs (discovery method 2; Methods), we predicted the most likely target gene for nine known regions, including TP53/del17p13.1, and seven additional regions including PCM1/del8p, IRF2BP2/del1q42.2q42.3 and SMCHD1/del18p11.32-p11.31 (Fig. 1d, Extended Data Fig. 1b,c and Supplementary Table 8). We also found 66 additional genes affected by recurrent CNAs using more permissive criteria (Methods). While these are potentially interesting, they were not considered to be putative CLL drivers and were not taken forward for downstream analyses (Supplementary Table 9).
A major advantage of WGS is the power to identify inversions and translocations. We identified 1,248 inversions (Extended Data Fig. 2a; Methods) with frequent breakpoints involving either the immunoglobulin light chain kappa (IGK) locus (n = 65, 13.4%), the immunoglobulin heavy chain (IGH) locus (n = 65, 13.4%) or chr13q14.2 (n = 40, 8.7%) (Extended Data Fig. 2b and Supplementary Tables 10 and 11). We detected 993 translocations, of which two occurred in more than ten samples and affected known genes with no previously documented role in CLL, including t(14;22) with breakpoints within WDHD1 (n = 12, 2.6%) and t(5;6) (CTNND2-ARHGAP18, n = 11, 2.4%) (Fig. 1e and Extended Data Fig. 2c).
The 22 potential coding driver genes were altered by truncating mutations or also affected by CNAs (Fig. 2a, Extended Data Fig. 3a–d, Supplementary Table 12 and Supplementary Figs. 2 and 3). Most mutations occurred in protein domains, and 62% of mutations were detectable in more than 50% of tumor cells (median cancer cell fraction (CCF) ≥0.5) and 89% in at least 20%. All previously unreported CNAs for which we could predict a target gene(s) were also clonal (median CCFs ≥0.8) (Fig. 2b and Extended Data Fig. 3e). Candidate driver mutations affected multiple biological pathways including the DNA damage/cell-cycle and RNA-ribosome processing (Fig. 2c).
Performing RNA-seq on representative CLL samples from 74 patients with known and potential coding mutations (for 40 of the 58 drivers, n variants = 118, Supplementary Table 4; Methods), we validated the expression of 73% of variants at the RNA level (Extended Data Fig. 4a and Supplementary Table 13). As expected, most (29/43) mutations that were either not detectable or were seen at low expression levels were truncating mutations consistent with nonsense-mediated decay (Supplementary Table 13). Additionally, allelic skewing and/or a reduction of mutant transcript expression compared with the mean expression of wild-type (WT) transcripts across the cohort was shown, notably for specific mutations in SPEN, SETD2, TP53 and IRF2BP2 (Fig. 2d). When considering all mutations, significantly reduced gene expression was demonstrated for TP53, ATM and SETD2 (refs. 20,21) (Extended Data Fig. 4b).
When we associated the 36 known and 22 putative drivers and regions of CNAs with other biological variables such as disease stage, TP53 alterations, IGHV mutation status (unmutated, u-IGHV; and hypermutated, m-IGHV) and stereotyped B cell receptor immunoglobulin subsets (BCR IG) including IGHV3-21 usage (Fig. 2e and Supplementary Table 14; Fisher’s exact test, false discovery rate (FDR) < 0.05), we found that SETD2/del3p21.31, del9p21.3 and gains of chr17q21.31 were associated with relapsed/refractory (R/R) disease and TP53 disruption, whereas MED12 and DDX3X mutations were associated with u-IGHV CLL. BCR IG subset 2, representing about 3% of all CLL, and known to be associated with poor prognosis22, was linked to the putative driver FAM50A. The IGHV3-21 rearrangement was also enriched for FAM50A and for ATM/del11q22, SF3B1 mutations and chr21q21.3-q22.3 gains.
Association of coding mutations with disease evolution
We examined the relationship between recurrent gene mutations and disease evolution in three different cohorts (Fig. 3a and Supplementary Table 4; Methods): (1) unpaired frontline-treated versus R/R (main cohort, unmatched, n = 443 versus 30—excluding the 12 early CLL); (2) paired samples from the CLL and Richter’s syndrome (RS) phases of the same patient (previously published cohort23, matched, n = 17) and (3) a second sample taken from a subset of the 485 patients at relapse who had already been profiled before frontline treatment: paired frontline-treated versus relapsed (main cohort, matched, n = 25/485).
Recurrent coding gene mutations were linked to disease evolution in all three cohorts. They presented higher mutation counts and frequency in the RS compared with the CLL phase (P = 2.1 ×10−2; Extended Data Fig. 4c,d) and higher CCFs at the more advanced stages with a median CCF > 0.8 (Fig. 3b and Extended Data Fig. 4e–g).
Restricting analysis to patients with information on long-term survival outcome (n = 243 / 485), 13 known or putative drivers and recurrent CNAs were significantly associated with progression-free survival (PFS) and 11 with overall survival (OS) (FDR < 0.05) (Fig. 3c and Supplementary Tables 15 and 16).
Out of the 22 putative drivers, 21 were also related to disease progression (Extended Data Fig. 4c–f), including two of the most commonly mutated ones. IRF2BP2 (interferon regulatory factor 2 binding protein 2), located in the minimally deleted region of chr1q42.3 (Fig. 3d) was also affected by deleterious mutations and CNAs (Fig. 3e) (in total, n = 28/485, 5.8%) with high CCFs (Fig. 3f, left panel). Mutations showed evidence of clonal expansion in more advanced disease (Fig. 3f, right panel) and altered RNA expression (Fig. 2d). This gene contributes to the differentiation of immature B-cells and is associated with a familial form of common variable immunodeficiency disorder24.
Similarly, SMCHD1 (structural maintenance of chromosomes flexible hinge domain containing 1), previously reported as a candidate tumor suppressor in hematopoietic cancers25 was affected by copy number losses (del18p11.32-p11.31) (Fig. 3g) and truncating SNVs/indels with high CCFs (Fig. 3h) (n = 24/485, 5.0%). SMCHD1 mutations showed clonal expansion (Fig. 3i) and were associated with adverse OS (median = 48.2 months, P value < 1 × 10−4, log-rank test) (Fig. 3j).
Noncoding putative driver mutations
To gain insight into the significance of noncoding mutations, we first identified CLL-specific regulatory elements (REs) by integrating ATAC-seq and H3K27ac profiles26,27 as well as chromatin states28 from publicly available primary CLL (Fig. 4a; Methods). Out of the 29,224 promoters and 56,137 enhancers identified, 90% were present in CLL as a whole, whereas the remaining 10% were specific for IGHV subgroups and were used for the IGHV subtype-specific annotation (Methods). Mapping noncoding mutations to REs (Fig. 4b; Methods), we could identify 29 untranslated regions (UTRs), 25 enhancers (23 of them cataloged by the GeneHancer database29) and 72 promoters that had hotspot mutations or were recurrently mutated more frequently than expected (FDR < 0.1), defined as significantly mutated (Extended Data Fig. 5a and Supplementary Table 17).
Next, we defined the candidate target genes of these 126 mutated noncoding regulatory elements. Mutations within UTRs and promoters were annotated predominately according to proximity (Methods). For enhancers, we calculated the correlation between H3K27ac levels for each regulatory elements and the gene expression levels of surrounding genes located within the same topologically associated domain (TAD) of the B cell lymphoblastoid cell line GM1287830 (Methods). In total, 29 regulatory elements had target genes known to be CLL drivers or cancer drivers in the COSMIC database (Fig. 4c); 89 were linked to other genes (Fig. 4d) and 8 to none (Extended Data Fig. 5a and Supplementary Table 17). Four mutated regulatory elements were specific for u-IGHV (Extended Data Fig. 5b) and none for m-IGHV. Overall, genes targeted by mutated regulatory elements were enriched for gene ontology terms linked to the immune system, lymphocyte activation and cell death (Fig. 4e and Supplementary Table 18).
Of the 29 mutated UTRs, 58% (n = 17) had a median CCF ≥ 0.5, and 83% had a CCF > 0.2, thus indicating their selection during CLL pathogenesis (Extended Data Fig. 5c). These included the 3′ UTR mutations of NOTCH1 creating a splice site that leads to increased gene expression3,31 (n = 16; FDR = 4.57 × 10−2). The NF-κB signaling gene NFKBIZ (n = 8, FDR = 2.38 × 10−2) was also found significantly mutated, confirming previous findings6 and known to increase levels of mRNA and protein in lymphoma32,33. We observed clonal mutations in the 5′ UTR of IGLL5 (n = 28; FDR < 2.2 × 10−16), previously found to be associated with reduced expression4. Previously unreported significantly mutated UTRs included the 5′ UTR of BCL2 (n = 6; FDR = 1.01 ×10−6, Fig. 5a). We performed RNA-seq on samples carrying these mutations (Supplementary Table 4; Methods) demonstrating that 5′ UTR mutations were associated with BCL2 overexpression (P = 4.3 × 10−2; Fig. 5b), which is noteworthy given that BCL2 inhibitors are used therapeutically in CLL34.
A high clonality (>0.5) was also observed when considering the 97 significantly mutated promoters and enhancers; 72% had a median CCF >0.5 and 97% of a CCF >0.2 (Supplementary Fig. 4a). Six discrete regions spanning 117 kb contained 50 variants and were annotated in the previously reported PAX5 superenhancer3,6,35 (Extended Data Fig. 5a and Supplementary Fig. 4b). Another region spanning 325 kb on chr3q27.2 contained seven significantly mutated enhancers and linked to BCL6 (Extended Data Fig. 5a and Supplementary Table 17). RNA-seq of eight samples with mutations in this region showed overall increased expression of BCL6, although the effect was heterogenous (Fig. 5c), suggesting that some variants are more or less pathogenic than others and variants might exert a positional effect (Fig. 5d).
When considering the 72 significantly mutated promoters, we found mutations of known CLL drivers including BIRC3 (n = 31, 6.4%, FDR < 1.15 ×10−15), IKZF3 (n = 12, 2.5%, FDR = 8.16 × 10−13) and TP53 affecting splicing regions of noncoding exons/5′ UTR/promoter region (n = 2, 0.4%, FDR = 5.55 × 10−6). Next, we investigated mutations in these promoters further to identify those predicted to change chromatin state, using DeepHaem36, a deep neural network trained on chromatin feature data of 73 immune cell types. Seventy-four variants were predicted to lead to a loss of open chromatin (that is, loss-of-function variants), including those in the BACH2 promoter (Fig. 6a and Extended Data Fig. 6a). A recent study showed that decreased BACH2 expression in CLL is associated with adverse outcomes37. Notably, the mutations we detected in this promoter were mostly clonal (median CCF = 0.99). We therefore investigated this promoter further by performing ATAC-seq and RNA-seq (Fig. 6b) on mutated samples, when available (13 variants investigated, Supplementary Table 12; Methods) to understand the impact of these variants on chromatin accessibility and gene expression. Three variants within a 14-bp region were associated with allelic skew in the ATAC-seq compared with WGS data, demonstrating a preference for accessibility on the reference allele (Fig. 6c), which mirrored the decrease in chromatin accessibility in that region compared with WT samples (Fig. 6d). This allelic skew was also detected at the RNA level (Fig. 6e and Extended Data Fig. 6b). In addition, the same three samples also showed decreased BACH2 gene expression (Fig. 6f).
Finally, we analyzed 20 cases with paired WGS, ATAC-seq and RNA-seq data (Supplementary Table 4). We identified five recurrently mutated promoters with allelic skewing of chromatin accessibility and RNA expression. Three, BTG2, CCND1 and ST6GAL1, were associated with allelic skewing towards the mutant allele, whereas ATAD1 and BIRC3 showed the opposite effect (Extended Data Fig. 6c). In the case of ATAD1, which plays a role in mitochondria protein degradation, we additionally observed reduced expression in promoter-mutated samples (P = 7.0 × 10−4) (Extended Data Fig. 6d–f).
Collectively, these data suggest that a small subset of the noncoding mutations in CLL have characteristics indicative of a driver and target regulatory elements of genes that are critical for B cell development and function as well as cancer progression. However, the effects on chromatin accessibility and gene expression levels were subtle and require further indepth functional characterization.
Clinical impact of combined and global genome features
We recalculated the occurrence of mutations in each known or putative driver in CLL by combining coding mutations, noncoding mutations in regulatory elements and CNAs (Fig. 7a and Supplementary Table 19). In total, 33 of the 58 coding, known or putative driver genes were also affected by noncoding mutations in associated regulatory elements or by CNAs. Overall, 412 (29%) of all alterations in these genes were either CNAs or affected regulatory elements. ATM and BIRC3 were most frequently targeted by genetic lesions. The median number of mutated known or putative drivers in each tumor was 2 (0–7) or 5 (0–21) when excluding or including CNA/copy neutral loss of heterozygosity (cnLOHs) and noncoding variants, respectively (Fig. 7b). A higher number of mutated genes was associated with worse PFS, especially when noncoding variants were included (Extended Data Fig. 7a,b and Supplementary Tables 15 and 16). Furthermore, the number of samples containing mutations in particular pathways also increased (by 3.3%) (Fig. 7c and Supplementary Fig. 5), in particular for the NOTCH and the transcriptional regulations pathways.
We explored whether global genomic features could also be associated with clinical outcome. Firstly, we evaluated telomere length and observed that it was reduced in CLL samples compared with paired germline (median length of 2.7 kb versus 3.8 kb, P < 2.2 × 10–16, median content of 405 versus 467, P = 3.9 × 10−6, paired Wilcoxon test) (Fig. 7d and Extended Data Fig. 8a,b). Shorter telomeres were significantly enriched in samples with p53 pathway alterations (P = 1.99 × 10−36; Fig. 7d), with R/R samples compared with frontline (FDR = 5.37 × 10−7; Supplementary Table 14) and were associated with poorer PFS (FDR = 4.39 × 10−4; Supplementary Table 15 and Extended Data Fig. 8c,d).
Secondly, we explored the clinical associations of mutation signatures including single base substitution (SBS), doublet base substitutions (DBS) and small insertions and deletions (ID)38 (Fig. 7e,f and Supplementary Table 20). Considering signatures with known or probable etiology, the most prevalent were SBS5 (clock-like), DBS11 (APOBEC activity) and ID2 followed by other clock-like signatures: SBS1 (deamination of 5-methylcytosines), SBS8, DBS2 and the AID signature SBS9. As previously documented, SBS9 was highly enriched in m-IGHV CLLs (FDR = 4.80 × 10−57, Fisher’s exact test; Supplementary Table 14), was mutually exclusive with TP53 alterations (2.29 × 10−3) and associated with good PFS (Supplementary Table 15 and Extended Data Fig. 8e). De novo signature ID83C was found associated with TP53 alterations (FDR = 2.53 × 10−2; Supplementary Table 14) and poorer PFS (1.57 × 10−2; Extended Data Fig. 8f and Supplementary Table 15). SBS1 was also associated with adverse outcome (3.70 × 10−2; Supplementary Table 15 and Extended Data Fig. 8g).
Thirdly, we analyzed GC using unsupervised clustering (multiple correspondence analysis (MCA)) of 17 features related to CNAs (Extended Data Fig. 9a,b; Methods). These defined eight groups (GC1–GC8) (Extended Data Fig. 9c,d) with distinct genomic profiles (Fig. 7g and Extended Data Fig. 9e). GC4 (presenting CN losses only, n = 210) was enriched in del13q14.2 (FDR = 3.26 × 10−23). GC7 (presenting both CN gains and losses, n = 127) was associated with ten recurrent CNAs and seven known coding drivers including XPO1 (FDR = 3.98 × 10−11) and TP53 (FDR = 8.36 × 10−9). Together with GC8 (presenting trisomy, CN gains and losses, n = 15), GC7 comprised the most patients with conventional genomic complexity, defined by the presence of at least four CNAs (Extended Data Fig. 9f). None of the genomic complexity groups was significantly enriched in stereotyped subsets (Extended Data Fig. 9g). For the subset of samples with survival data (n = 243), we combined genomic complexity groups with copy number gains only, copy number losses only and both copy number gains and losses to increase statistical power. Interestingly, the eight groups were associated with different PFS and OS (Extended Data Fig. 10a,b), independent of TP53 status (Extended Data Fig. 10c,d). Furthermore, patients with both TP53 mutations and GC7/8 changes had ultrahigh-risk disease (median PFS = 8 months, median OS = 15 months) and fared worse compared with patients with TP53 mutations but no GC7/8 status (P = 0.03; Fig. 8a,b).
Towards a patient classifier
To evaluate the potential clinical relevance of combining different genomic features, we first used penalized multivariate regression analysis for least absolute shrinkage and selection operator. This analysis led to the identification of 56 individual genomic features that predicted PFS and/or OS including SMCHD1/del18p11.32-p11.31, which retained significance as an independent predictor of OS (Extended Data Fig. 10e and Supplementary Fig. 6a).
Next, we applied non-negative matrix factorization (NMF) to identify robust subgroups of CLLs sharing subsets of the 186 different genetic alterations (Supplementary Table 21; Methods). Considering the profound clinical impact of the IGHV mutational status, we initially divided patients into m-IGHV and u-IGHV. Using this approach, we identified five distinct GS: three were u-IGHV (u-GS1, 2 and 3) and two m-IGHV (m-GS1 and 2) (Fig. 8c,d and Supplementary Table 22).
When considering u-IGHV CLL (Fig. 8c and Supplementary Table 23), u-GS1 was characterized by the presence of high-risk features including TP53 disruption, GC7, short telomeres and mutations in targetable pathways such as MAPK, PI3K and apoptosis, but there was no DNA damage response signature. By contrast, u-GS2 was defined by ATM/BIRC3/del11q22.2-22.3 alterations, as well as mutations in DNA damage response pathways, but without TP53 mutations or genomic complexity as defining features. Patients in u-GS2 were predominately male. u-GS3 had a high number of mutations in known and putative coding drivers, introns and UTRs, CN gains including trisomy 12, NOTCH1 mutations, and was enriched for older patients. All three subgroups included patients with BCR IG subsets 1 and 8, which are known to be associated with aggressive disease39 (Supplementary Fig. 6b). Although u-GS2 and u-GS3 were clearly distinct, they were associated with similar PFS after chemoimmunotherapy (Fig. 8e).
Regarding m-IGHV CLL (Fig. 8d), m-GS1 was similar to u-GS1 (cosine similarity of 0.81) and also to u-GS2 (cosine similarity of 0.7) (Supplementary Table 24). In contrast, m-GS1 was enriched for older men, BCR IG subset 2 (FDR = 2.96 × 10−6) and IGHV3-21 (FDR = 7.50 × 10−9) (Supplementary Fig. 6b), although most patients in m-GS1 did not have any defined CLL stereotype. m-GS2 had high mutation burden in enhancers, UTRs and promoters, was enriched for del13q4.2 but no other CNAs and had longer telomeres compared with the mean length in CLL. Additional clustering (Methods) further refined m-GS2 into distinct two clusters (Supplementary Fig. 6c). m-GS2 cluster 1 stood out by the high frequency of SBS9, the presence of GC4 and the absence of any other features. In comparison, m-GS2 cluster 2 had MYD88 mutations, trisomy 12 and other CN gains but no CN losses (Supplementary Table 25). Both clusters of m-GS2 had a very favorable PFS of 75% and showed a plateau of PFS, implying cure after chemoimmunotherapy (Fig. 8f). By contrast, patients belonging to m-GS1 had a shorter PFS than u-GS2/u-GS3 (median PFS = 38 versus 50 months; Fig. 8e) and there was no plateau.
In our analysis of patients treated with chemoimmunotherapy, NMF subgroups could not be defined without the different acquired local and global noncoding genomic changes, since combining all known coding drivers and the four common recurrent CNAs did not cluster patients into the GSs (Supplementary Figs. 6d and 7). Based on this observation, we examined whether the NMF method could be used to prospectively and precisely assign individual patients into their subgroup for individualized outcome prediction in the clinic. Our validation, performed by subsetting the dataset (Methods) showed that a total of 15/16 m-IGHV samples and 48/51 of u-IGHV samples were assigned correctly to their respective subgroup (Fig. 8g).
Our study presents the first comprehensive WGS analysis of a large series of CLL patients requiring treatment. A main strength of our study is that it is based on patients enrolled into multicenter clinical trials, thereby reducing heterogeneity. This allowed us to not only define the genomic landscape of different stages of CLL3,4, but also to identify mutations associated with disease relapse and transformation.
Based on a strict pipeline for discovery of coding drivers, we selected the top ranked recurrently mutated genes, which comprised 36 known CLL drivers3,6,7 and 22 putative drivers. Only 32% of variants in those putative driver genes were missense variants, with most being truncating and stop-gain mutations. Although these putative drivers shared characteristics of known drivers (that is, damaging mutations in protein domains, impact on RNA expression, high CCF that further increased at disease progression, association with survival), we cannot exclude the possibility that some may simply represent passengers.
We defined recurrent translocations (with breakpoints in WDHD1; CTNND2-ARHGAP18) and 126 candidate noncoding drivers within REs pinpointing potentially druggable target genes (NOTCH1, DTX1, NFKBIZ, NTRK2 and BACH2). For a small subset of selected noncoding candidate mutations, we were able to demonstrate a modest impact on chromatin accessibility and/or target gene expression (5′ UTR of BCL2, enhancer of BCL6, promoter of BACH2 and promoter of ATAD1).
Exploring different layers of genomic data including coding, noncoding and genome-wide global changes allowed us to (1) derive a WGS-derived genomic complexity classification that further refines risk by identifying an independent ultrahigh-risk group associated with complex genomic alterations (GC7/8); (2) more precisely predict individual patients who achieve a plateau after chemoimmunotherapy (m-GS2) and are functionally cured, thereby clearly differentiating them from progressors in the m-GS1 subgroup.
Ideally, only genomic features experimentally validated as disease drivers should be included in any prognostic classification system, even if they were selected by very stringent criteria as those applied in this study (see above). However, it is well recognized that some genomic features are clearly not disease drivers, yet carry prognostic relevance. For example, in CLL, the IGHV mutation status representing the cell-of-origin or telomere length reflecting proliferative activity, are associated strongly with clinical outcome, but are not considered disease drivers.
In our NMF model using only the known coding drivers and recurrent CNAs did not allow us to recover the same level of discrimination as that afforded by inclusion of additional local and global noncoding information. This observation implies that the combination of coding and noncoding information in the classifier increases the precision of clinical risk prediction at least in our cohort of clinical trial patients.
Although treatment algorithms for CLL are shifting away from chemoimmunotherapy to targeted agents, the subgroups we define remain potentially clinically relevant as they reflect distinct biological entities. Collectively, our study provides a springboard for downstream functional analyses of putative coding and noncoding drivers. Robust testing on independent cohorts of patients undergoing targeted therapy will be required to further establish the clinical utility of this WGS-based classifier.
Patient cohorts, samples and ethics
All patients gave written informed consent and the study was approved under the 100,000 Genomes Project Ethics and the CLL Pilot ethics (MREC 09/H1306/54). A total of 485 patients with CLL were included in the study. A small subset was enrolled into CLEAR (CLL Empirical Antibiotic Regimen, early stage of the disease, n = 12, NCT01279252) and CLL210 (ref. 42) (relapsed/refractory patients, n = 30, EudraCT 2010-019575-29). All other patients were treatment-naïve and required treatment according to iwCLL criteria43. They were either fit patients receiving frontline treatment with fludarabine, cyclophosphamide, rituximab (FCR)-based treatment in ARCTIC44 (Attenuated dose Rituximab with ChemoTherapy In CLL, n = 61, EudraCT Number:2009-010998-20) or AdMIRe45 (Does the ADdition of Mitoxantrone Improve REsponse to FCR chemotherapy in patients with CLL, n = 65, EudraCT number: 2008-006342-25) or frail patients receiving ofatumumab with either bendamustine or chlorambucil chemoimmunotherapy in RIAltO (A Trial Looking at Ofatumumab for People With Chronic Lymphocytic Leukemia Who Cannot Have More Intensive Treatment, n = 92, NCT01678430). Patients recruited into FLAIR46 (Front-Line therapy in CLL: Assessment of Ibrutinib + Rituximab, n = 225, EudraCT 2013-001944-76) were randomized to ibrutinib alone or in combination with rituximab or venetoclax or standard first-line FCR treatment. In line with the studies’ data monitoring committees, baseline characteristics and clinical outcomes data were available only from studies once closed to recruitment (see Supplementary Table 1 for details of all patients recruited into the 100,000 Genomes Project). For patients recruited into the FLAIR study, these data are still awaited.
For a subset of 25 patients, we obtained a sample taken at relapse (Supplementary Table 4).
To investigate findings in more advanced disease, we reanalyzed WGS data coming from a cohort of 17 patients from whom two concurrent samples were collected: the CLL phase and the transformed phase (RS). This cohort includes samples and data generation as described in Klintman et al.23.
Only samples with a lymphocyte count of greater than 25 × 109 l–1 were included in the study ensuring a tumor purity greater than 80% and a median lymphocyte count of 80 × 109 l–1 (range, 33.9–166.5) (Supplementary Table 1).
Peripheral blood mononuclear cells (PBMCs) and a saliva sample were collected from each patient, which served as a source of tumor and germline DNA, respectively. DNA was extracted from PBMCs and saliva using QIAamp DNA mini kit (Qiagen) and the Oragene DNA saliva kit (DNA Genotek Inc) kits, respectively, according to the manufacturer’s instructions. DNA quality was assessed using Nanodrop (Thermo Fisher Scientific) and quantified using Qubit (Thermo Fisher Scientific) technology. RNA was extracted from PBMCs using the RNeasy Mini Kit (Qiagen) according to the manufacturer’s instructions. The quality of RNA was assessed using the Agilent 4200 Tapestation System, using High Sensitivity tapes. The concentration was assessed using the GeminiTM XPS Microplate Spectroflurometer from Molecular Devices and the Quant-iT HS RNA assay.
Whole-genome 125 bp paired-end TruSeq PCR–free libraries were sequenced using Illumina HiSeq2500 technology. Raw sequencing data was aligned with using Isaac v.03.16.02.19 to GRCh38. Alignment and coverage metrics were calculated using Picard v.2.12.1 and Bwtool47 showing a mean read depth of 36× and 109× for normal and tumor samples, respectively. All downstream analysis of WGS data was performed on the whole dataset of 485 samples, unless otherwise stated.
Libraries were prepared from samples of 74 patients using the Illumina Stranded Total RNA Prep, Ligation with Ribo-Zero Plus, with additional custom depletion probes, using 100 ng RNA. Libraries were sequenced on a NovaSeq 6000 system (Illumina) using 100 base paired-end chemistry (108–455 million read-pairs per sample). Sequencing reads were processed and aligned to Human Reference genome GRCh38 using the Illumina Dragen RNA pipeline v.3.8.4. Gentoyping was performed using bcftools mpileup48. Allele specific read counts were generated at sites of acquired SNVs determined by WGS.
ATAC-seq was performed as previously described49. Briefly 7.5 × 104 cells per technical replicate were resuspended in lysis buffer (10 mM Tris-HCl, pH 7.5, 10 mM NaCl, 3 mM MgCl2, 0.1% IGEPAL CA-630). Nuclei were pelleted (500g for 10 min), PBS was discarded and nuclei were resuspended in tagmentation buffer (25 µl 2× tagmentation DNA buffer, 2.5 µl Tn5 Transposase (Illumina) and 22.5 µl water) then incubated (37 °C for 30 min). DNA was extracted using the MinElute PCR Purification Kit (Qiagen), half the DNA was amplified (NEBNext High-Fidelity 2× PCR Master Mix (New England Biolabs)) and purified with the QIAquick PCR Purification Kit (Qiagen). Libraries were sequenced using 40-bp paired-end reads (Illumina NextSeq).
Reads were mapped to GRCh38 using the PEPATAC pipeline with prealignment to the mitochondrial genome and default settings50. Gentoyping was performed using bcftools mpileup48. Allele specific ATAC-seq read counts were generated at sites of acquired SNVs determined by WGS.
Immunoglobulin gene characterization
To determine the IGHV status of our cohort, we prioritized data from Sanger sequencing, followed by WGS-derived data including IgCaller51 results and the presence of noncanonical AID mutational signature (SBS9). This prioritizing scheme resulted in 54% (264/485) cases classified by Sanger sequencing, 40% (194/485) by the IgCaller algorithm and 6% (27/485) by the mutational signature SBS9. The correlation between these three methodologies was high, as can be seen in Supplementary Table 26. In addition, the IgCaller algorithm was used to further characterize the IG genes, including to define the IGHV3-21 rearrangement in 10% (47/485) of cases and CLL stereotypy in 27% (132/485). To assign CLL stereotypes, the IgCaller output was used as input for AssignSubsets online tool52, which annotates the 19 main subsets, including subsets 1, 2, 4 and 8, as recommended by ERIC guidelines39. In cases more than one rearrangement were detected, we selected the rearrangement with the highest score to define the main CLL stereotype. In cases where a rearrangement was not assigned, but there was a proximal rearrangement reported, we included this rearrangement in our analysis.
Somatic variant calling and filtering
SNVs and indels were called using Strelka v.2.8.4 7 adopting default parameters. Filtering of SNVs/indels was performed as follow: depth required greater than ten and allele fraction (AF) greater than 0.05; the quality filter annotation should be ‘PASS’ and quality score greater than 30; variants with allele frequency less than 0.05 from 1KGP phase 3 1405.34_GRCh38.p8 and EXAC v.0.3 data (annotated from using Ensembl VEP GRCh38 release v.89.4 (ref. 53)). Additional filters according to the Illumina v.4 Genomics England annotation pipeline removed variants as follows: variants with a population germline frequency greater than 1% in either the Genomics England dataset or in the gnomAD v.3; recurrent somatic variants with a frequency greater than 5% in the Genomics England cohort; variants overlapping with LINE repeats or simple repeats found with Tandem Repeats Finder v.4.09 (ref. 54); calls within 50 bp either side of an indel where at least 10% of variants have been filtered due to quality; locus depth is greater than three times mean chromosomal depth in the germline sample; contains multiple alternate alleles; germline sample is not the homozygous reference or indel Q-score is less than 30; variant quality score recalibration (VQSR) score less than 2.75; most overlapping reads do not map uniquely to variant position; within ten bases of Genomics England inhouse database or Gnomad v.3 germline indel with frequency greater than 1%; SNVs resulting from systematic mapping and calling artefacts; fails somatic panel of normal Phred cut-off (< 80).
The Supplementary Notes include details on cancer cell fraction calculation as well as coding and noncoding variant annotations. In addition, it includes our approach for assigning target genes of regulatory elements, identifying of coding and noncoding candidate drivers.
Structural variant identification
The structural variant (SV) calling pipeline for detection of inversions and translocations was as follows. (1) Delly55 was used to call variants in each tumor–germline pair, with the following steps: complete somatic prefiltering, genotype all potentially somatic sites across all CLL germline samples, postfilter for somatic SVs using control samples. Variants with an alternative AF less than 0.05 were removed. (2) Lumpy v.0.2.13 (ref. 56) and (3) Manta 0.28.0 were also used to call SVs. Variants with an alternative AF < 0.05 or for which there was any evidence in the germline were removed for consistency. (4) The pcawg-merge-sv consensus calling pipeline57 was adapted for this analysis. SVs supported by two or more callers were reported.
Identification of CNAs
We used both DNA microarray (n = 109 samples) and WGS (n = 485 samples) to determine CNAs and observed high concordance between the two methods. Of 282 CNAs detected by WGS, 240 (85%) were also reported by DNA microarray with high confidence (Supplementary Table 6). In addition, we further reduced false positive signals using a combination of intersects between several variant callers and visual inspection as detailed below.
Samples from subset of 109 patients enrolled in ARCTIC and AdMIRe trials were genotyped using HumanOmni2.5-8 BeadChip arrays (Illumina Inc.). Genotypes were called using GenomeStudiov.2009.2 (Illumina Inc.). CN gains and losses greater than 50 kb and cnLOH less than 5 Mb were reported using Nexus Copy Number v.10 (BioDiscovery, Inc.), as previously described16,58, with the following settings (SNPRank Segmentation): significance threshold, 1 × 10–5; max contiguous probe spacing (kb), 1000.0; minimum number of probes per segment, 5; high gain, 0.6; gain, 0.2; loss, –0.2; big loss, –1.0; 3:1 sex chromosome gain, 1.2; homozygous frequency threshold, 0.95; homozygous value threshold, 0.8; heterozygous imbalance threshold, 0.4; minimum LOH length (kb), 20; percentage outliers to remove, 3%. We also inspected all genomes to scan visually for changes not identified using these analysis settings using Nexus visualization tool.
In the case of WGS, Canvas v.1.3.1 (ref. 59) and Manta v.0.28.0 were used to call CNAs, filtering out centromeric and telomeric regions as defined in the UCSC cytoband table. Variants reported by Canvas with a quality score less than ten were filtered out. Variants reported by Manta were filtered out as follows: (1) variants with a normal sample depth near one or both variant break-ends three times higher than the chromosomal mean, and (2) variants with somatic quality score of less than 30.
For each remaining CNA, its presence and type (gain or loss) were confirmed by visually inspecting the genome-wide mean coverage and B-allele frequency data, derived from the aligned reads in 100 kb windows. Calls with continuous copy number changes of length greater than 100 kb were kept. The Supplementary Notes include details on cancer cell fraction calculation.
Counts of number of drivers
We calculated the total number of drivers in each patient by the following methodologies: we established (1) the total mutational burden by counting the number of functional variants (that is, with the following exonic consequences splice acceptor variant, splice donor variant, stop gained, frameshift variant, stop lost, start lost, transcript amplification, in-frame insertion, in-frame deletion, missense variant, protein-altering variant or incomplete terminal codon variant), (2) the number of mutated coding drivers (out of 58) SNVs/indels and (3) the number of mutated coding (SNVs/indels and CNAs) and noncoding drivers.
Two pathway datasets were used: PANCANCER containing 14 pathways from The NanoString PanCancer Pathways Panel and KEGG containing 23 signaling pathways60. For the six pathways in common between the two lists, the PANCANCER pathway was selected, resulting in 31 unique pathways included in the analysis. We counted the number of patients with mutations per pathway considering (1) a gene panel of the coding drivers (n = 58); (2) the exome (coding drivers plus exonic mutation with high impact according to VEP annotations: splice_acceptor_variant, splice_donor_variant, stop_gained, frameshift_variant, stop_lost, start_lost); (3) a larger driver panel containing both coding drivers and regulatory candidate drivers (n = 58 + 126) and (4) all of the above combined (coding and noncoding drivers plus exonic mutation with high impact according to VEP annotations).
Telomere analysis was carried out on all 485 CLL tumor-normal pairs. Telomere content was estimated using Telomere Hunter v.1.1.0 (ref. 61). Telomere content is normalized by the total number of overall reads that comprise a ‘telomere-like’ GC-content range (48–52%). Telomere length in basepairs was estimated using Telomerecat v.1.0 (ref. 62). We found that telomere content assessed using Telomere Hunter and telomere length assessed using Telomerecat were highly correlated (P = 0.84, P < 2.2 × 10–16, Extended Data Fig. 8a). We compared the telomere lengths and contents between CLL samples and matched saliva samples as germline63, considering that different cell types can naturally present different telomere lengths64.
Chromothripsis was identified using Shatterseek65, which aims to detect candidate regions on the basis of oscillating copy number states (using CNAs as previously described), as well as intersection with clusters of interleaved structural variants (SVs; that is, deletions, duplications, inversions and translocations) identified from the SV consensus pipeline previously described. Potential regions of chromothripsis were classified as ‘high confidence’ or ‘low confidence’ using criteria as per Cortés-Ciriano et al.65.
Extraction of SBS, DBS and small ID signatures was performed using SigProfilerExtractor v.1.0.1810 (ref. 66). SigProfilerExtractor de novo signature extraction and decomposition were carried out according to default parameters, with potential de novo extracted signature solutions tested between 1 and 25 signatures. Signatures were referenced to the Catalogue of Somatic Mutations in Cancer (COSMIC) v.3; SigProfilerExtractor signatures were decomposed based on a cosine similarity greater than 0.9. Following decomposition to COSMIC signatures, SigProfilerExtractor estimated the overall signature contributions per tumor, as well as the per tumor signature estimates for each mutation context. Through associating these context estimates back to the original mutations, signature estimates were attributed to individual driver mutations, as well as genomic regions (exome, promoters, UTRs, and so on).
We investigated the presence of GC using an unsupervised multiple correspondence analysis with FactoMineR67. We included 17 genomic measures as binary data, including variables binned as less than median or greater than or equal to median: number of SNVs, number of indels, telomere lengths, telomere content and variables binned as presence/absence: SV breakpoint, CNA, CN gain, CN loss, cnLOH, trisomy, aneuploidy, CN gain excluding trisomy, CN loss excluding aneuploidy, cnLOH excluding whole chromosome cnLOH, inversion, translocation and chromothripsis.
Genomic alterations in known risk factors and disease states
All genomic alterations derived from WGS were combined and included as follows: noncoding candidate drivers mutated in more than 5% of samples; coding drivers were combined according to the presence of an SNVs/indels and CNAs (union); recurrent CNAs that significantly co-occurred (mean square contingency coefficient, mu > 0.3) and defined in the same chromosome were combined (union). In addition, only genomic alterations with at least five occurrences across all the samples were included in the analysis. In total, 186 genomic remained including 58 coding drivers, 36 recurrent CNAs, 44 noncoding drivers, 12 pathways affected by genetic alterations, 28 global genomic features and mutational signatures, and eight genomic complexity groups (Supplementary Table 21).
We tested for enrichment (two-sided Fisher’s exact test, FDR ≤ 0.05) of each genomic alteration in several known risk factor and disease state groups for samples with available data: age (195 samples < median age versus 216 samples ≥ median age); sex (338 male versus 136 female); disease stage (443 frontline versus 30 R/R); TP53 status (420 WT versus 65 disrupted); IGHV mutational status (197 hypermutated versus 288 unmutated); minimal residual disease (MRD; 59 negative versus 57 positive); BCR IG subset 2 (33 presenting 2 versus 450 others); IGHV3-21 rearrangement (47 with versus 436 without).
Relationship between genomic alterations and patient outcome
We examined the relationship between each of the 186 genomic features as detailed above (Supplementary Table 21) and patient outcomes using Cox proportional hazards models on 243 patients for PFS and 245 patients for OS. FDR-corrected P values were reported as significant if less than 0.05. In addition, several particular comparisons with more than two groups were performed using Kaplan–Meier curves and the log-rank test. These were: number of mutated drivers, the eight genomic complexity groups and the combination of different structural rearrangements. We also performed a multivariate analysis using penalized Cox regression, as implemented in the R package glmnet68, to find a minimal set of predictors with maximal predictive power. An optimal value of the penalization parameter λ was selected using leave-one-out cross-validation; specifically, the value of λ that minimizes the cross-validation error.
Patient stratification using non-negative matrix factorization
All 186 genomic features, as well as IGHV status including percent homology to germline (labeled MS), age and sex were selected for unsupervised clustering using non-negative matrix factorization69,70 using the NMF v.0.22.0 R package71 with the offset method72,73. Data were converted to a binary matrix using either presence or absence of a feature, or above or below the mean to avoid a mixture of binary and nonbinary data (Supplementary Note). After removal of samples without age information, samples were divided into m-IGHV (n = 168) and u-IGHV (n = 243) as defined above. The number of permitted NMF clusters in either the m- or u-IGHV subset was determined using a combination of rank estimation methods including the cophenetic correlation coefficient74,75,76. Data were randomized and the ranks estimated for comparison to avoid overfitting. NMF was carried out on each IG subset of samples separately to produce GSs.
DeconstructSigs v.1.9.0 (ref. 77) was designed to use the mutation catalog of a sample to define the linear combination of COSMIC signatures that best reconstruct that sample’s mutational profile. Here, we used this tool to define the linear combination of GSs calculated using the NMF method that best reconstruct the genomic features of a sample. The proportions of each GS within all patients were then clustered using mclust v.5.4.6 (ref. 78) and assigned a cluster that maximized parsimony whilst still producing an adequate prediction. The defined GSs were then compared with known subgroups such as BCR IG subsets and patients harboring an IGHV3-21 rearrangement.
Testing of the method was carried out as follows:
Data were randomly split into two trial groups each representing 50% of the dataset: and further divided into m-IGHV and u-IGHV CLL. The NMF was then performed on all genomic features on each group and evaluated using cosine similarity between group signature matrices (Supplementary Table 22);
all samples used for NMF were split into 80% (m-IGHV: n = 133, u-IGHV: n = 195) training and 20% (m-IGHV: n = 34, u-IGHV: n = 49) testing at random. The NMF was performed on the training data as described above to produce GS matrixes (m-GS, u-GS). The training data were then assigned to a GS using deconstructSigs to identify the combination of GSs that best reconstructed a sample’s genomic feature matrix and then assigning the signature that occurred at the highest percentage. The signature assigned to the test samples was then compared with the signatures assigned to those same individuals when 100% of data was used for both training and testing (Fig. 8g).
Data wrangling and plotting
Plotting of data was performed using tidyverse v.1.3.0 (refs. 79,80) in R v.3.6.2 (ref. 81). Mutation hotspot graphics were plotted using the package GenVisR v.1.18.1 (ref. 82). Lollipop plots were plotted with the MutationMapper from cbioportal accessible from https://www.cbioportal.org/mutation_mapper. Genomic views were prepared using the UCSC genome browser83.
Statistics and reproducibility
The sample size calculation was critical to the success of this program. Our power calculations considered the heterogeneity of CLL and a background somatic mutation frequency of 0.8 mutations per megabase. This means that, to reliably detect somatic mutations recurring in 2% of patients with CLL, we need to sequence approximately 500 CLL genomes (Supplementary Fig. 8). No data were excluded from the analyses. The experiments were not randomized. The investigators were not blinded to allocation during experiments and outcome assessment.
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
The National Genomic Research Library (NGRL) is a ‘reading library’, therefore data cannot be extracted directly. All WGS data, BAM files and processed files cited can be viewed in situ via the Haematological Malignancy Genomics England Clinical Interpretation Partnership (GECIP), once an individual’s data access has been approved. The link to becoming a member of GECIP to get access can be found here https://www.genomicsengland.co.uk/research/academic/join-gecip. The process involves an online application, verification by the applicant’s institution, completion of a short information governance training course (circa 30 min), and verification of approval by the Haematological Malignancy domain lead (A.S., see contact details for corresponding author). Please see https://www.genomicsengland.co.uk/research/academic for more information.
All RNA sequencing data has been deposited in the European Bioinformatics Institute (EMBL-EBI) ArrayExpress Archive of Functional Genomics Data database under accession number E-MTAB-12124.
The outcome of the clinical studies has been published (all references in Methods). Access to clinical datasets is subject to data sharing policies of the respective clinical trial units that provided legal sponsorship for the studies and can be made available on request to A. Pettitt (firstname.lastname@example.org; Department of Molecular and Clinical Cancer Medicine, University of Liverpool, Liverpool, UK) and P. Hillmen (email@example.com; St. James’s University Hospital, Leeds, UK). Source data are provided with this paper.
Puente, X. S. et al. Whole-genome sequencing identifies recurrent mutations in chronic lymphocytic leukaemia. Nature 475, 101–105 (2011).
Schuh, A. et al. Monitoring chronic lymphocytic leukemia progression by whole genome sequencing reveals heterogeneous clonal evolution patterns. Blood 120, 4191–4196 (2012).
Puente, X. S. et al. Non-coding recurrent mutations in chronic lymphocytic leukaemia. Nature 526, 519–524 (2015).
Kasar, S. et al. Whole-genome sequencing reveals activation-induced cytidine deaminase signatures during indolent chronic lymphocytic leukaemia evolution. Nat. Commun. 6, 8866 (2015).
Zhao, Z. et al. Evolution of multiple cell clones over a 29-year period of a CLL patient. Nat. Commun. 7, 13765 (2016).
Burns, A. et al. Whole-genome sequencing of chronic lymphocytic leukaemia reveals distinct differences in the mutational landscape between IgHVmut and IgHVunmut subgroups. Leukemia 32, 332–342 (2018).
Landau, D. A. et al. Mutations driving CLL and their evolution in progression and relapse. Nature 526, 525–530 (2015).
Hallek, M. et al. Addition of rituximab to fludarabine and cyclophosphamide in patients with chronic lymphocytic leukaemia: a randomised, open-label, phase 3 trial. Lancet 376, 1164–1174 (2010).
Rossi, D. et al. Integrated mutational and cytogenetic analysis identifies new prognostic subgroups in chronic lymphocytic leukemia. Blood 121, 1403–1412 (2013).
Stilgenbauer, S. et al. Gene mutations and treatment outcome in chronic lymphocytic leukemia: results from the CLL8 trial. Blood 123, 3247–3255 (2014).
Quesada, V. et al. Exome sequencing identifies recurrent mutations of the splicing factor SF3B1 gene in chronic lymphocytic leukemia. Nat. Genet. 44, 47–52 (2011).
Skowronska, A. et al. Biallelic ATM inactivation significantly reduces survival in patients treated on the United Kingdom Leukemia Research Fund Chronic Lymphocytic Leukemia 4 trial. J. Clin. Oncol. 30, 4524–4532 (2012).
Fabbri, G. et al. Analysis of the chronic lymphocytic leukemia coding genome: role of NOTCH1 mutational activation. J. Exp. Med. 208, 1389–1401 (2011).
Edelmann, J. et al. High-resolution genomic profiling of chronic lymphocytic leukemia reveals new recurrent genomic alterations. Blood 120, 4783–4794 (2012).
Gunnarsson, R. et al. Array-based genomic screening at diagnosis and during follow-up in chronic lymphocytic leukemia. Haematologica 96, 1161–1169 (2011).
Knight, S. J. L. et al. Quantification of subclonal distributions of recurrent genomic aberrations in paired pre-treatment and relapse samples from patients with B-cell chronic lymphocytic leukemia. Leukemia 26, 1564–1575 (2012).
Malek, S. N. The biology and clinical significance of acquired genomic copy number aberrations and recurrent gene mutations in chronic lymphocytic leukemia. Oncogene 32, 2805–2817 (2013).
Brown, J. R. et al. Integrative genomic analysis implicates gain of PIK3CA at 3q26 and MYC at 8q24 in chronic lymphocytic leukemia. Clin. Cancer Res. 18, 3791–3802 (2012).
Lehmann, S. et al. Molecular allelokaryotyping of early-stage, untreated chronic lymphocytic leukemia. Cancer 112, 1296–1305 (2008).
Parker, H. et al. Genomic disruption of the histone methyltransferase SETD2 in chronic lymphocytic leukaemia. Leukemia 30, 2179–2186 (2016).
Austen, B. et al. Mutations in the ATM gene lead to impaired overall and treatment-free survival that is independent of IGVH mutation status in patients with B-CLL. Blood 106, 3175–3182 (2005).
Sonia, Jaramillo et al. Prognostic impact of prevalent chronic lymphocytic leukemia stereotyped subsets: analysis within prospective clinical trials of the German CLL Study Group (GCLLSG). Haematologica 105, 2598–2607 (2019).
Klintman, J. et al. Genomic and transcriptomic correlates of Richter transformation in chronic lymphocytic leukemia. Blood 137, 2800–2816 (2021).
Keller, M. D. et al. Mutation in IRF2BP2 is responsible for a familial form of common variable immunodeficiency disorder. J. Allergy Clin. Immunol. 138, 544–50.e4 (2016).
Brideau, N. J. et al. Independent mechanisms target SMCHD1 to trimethylated histone H3 lysine 9-modified chromatin and the inactive X chromosome. Mol. Cell. Biol. 35, 4053–4068 (2015).
De Paepe, A. Elucidating Regulatory Elements: Studies in Chronic Lymphocytic Leukemia and Multiple Myeloma. PhD thesis, Karolinska Institute (2018).
Beekman, R. et al. The reference epigenome and regulatory chromatin landscape of chronic lymphocytic leukemia. Nat. Med. 24, 868–880 (2018).
Ernst, J. & Kellis, M. Chromatin-state discovery and genome annotation with ChromHMM. Nat. Protoc. 12, 2478–2492 (2017).
Fishilevich, S. et al. GeneHancer: genome-wide integration of enhancers and target genes in GeneCards. Database (Oxford) 2017, bax028 (2017).
Rao, S. S. P. et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 1665–1680 (2014).
Larrayoz, M. et al. Non-coding NOTCH1 mutations in chronic lymphocytic leukemia; their clinical impact in the UK CLL4 trial. Leukemia 31, 510–514 (2017).
Arthur, S. E. et al. Genome-wide discovery of somatic regulatory variants in diffuse large B-cell lymphoma. Nat. Commun. 9, 4001 (2018).
Fonte, E. et al. Toll-like receptor 9 stimulation can induce IκBζ expression and IgM secretion in chronic lymphocytic leukemia cells. Haematologica 102, 1901–1912 (2017).
Roberts, A. W. et al. Targeting BCL2 with venetoclax in relapsed chronic lymphocytic leukemia. N. Engl. J. Med. 374, 311–322 (2016).
Rose-Zerilli, M. J. J. et al. Longitudinal copy number, whole exome and targeted deep sequencing of ‘good risk’ IGHV-mutated CLL patients with progressive disease. Leukemia 30, 1301–1310 (2016).
Schwessinger, R. et al. DeepC: predicting 3D genome folding using megabase-scale transfer learning. Nat. Methods 17, 1118–1124 (2020).
Ciardullo, C. et al. Low BACH2 expression predicts adverse outcome in chronic lymphocytic leukaemia. Cancers (Basel). 14, 23 (2021).
Alexandrov, L. B. et al. The repertoire of mutational signatures in human cancer. Nature 578, 94–101 (2020).
Rosenquist, R. et al. Immunoglobulin gene sequence analysis in chronic lymphocytic leukemia: updated ERIC recommendations. Leukemia 31, 1477–1481 (2017).
Stanek, D. et al. Prot2HG: a database of protein domains mapped to the human genome. Database (Oxford) 2020, baz161 (2020).
Raudvere, U. et al. g:Profiler: a web server for functional enrichment analysis and conversions of gene lists (2019 update). Nucleic Acids Res. 47, W191–W198 (2019).
Pettitt, A. R. et al. Lenalidomide, dexamethasone and alemtuzumab or ofatumumab in high-risk chronic lymphocytic leukaemia: final results of the NCRI CLL210 trial. Haematologica 105, 2868–2871 (2020).
Hallek, M. et al. iwCLL guidelines for diagnosis, indications for treatment, response assessment, and supportive management of CLL. Blood 131, 2745–2760 (2018).
Howard, D. R. et al. Results of the randomized phase IIB ARCTIC trial of low-dose rituximab in previously untreated CLL. Leukemia 31, 2416–2425 (2017).
Munir, T. et al. Results of the randomized phase IIB ADMIRE trial of FCR with or without mitoxantrone in previously untreated CLL. Leukemia 31, 2085–2093 (2017).
Collett, L. et al. Assessment of ibrutinib plus rituximab in front-line CLL (FLAIR trial): study protocol for a phase III randomised controlled trial. Trials 18, 387 (2017).
Pohl, A. & Beato, M. bwtool: a tool for bigWig files. Bioinformatics 30, 1618–1619 (2014).
Danecek, P. et al. Twelve years of SAMtools and BCFtools. Gigascience 10, giab008 (2021).
Buenrostro, J. D., Giresi, P. G., Zaba, L. C., Chang, H. Y. & Greenleaf, W. J. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat. Methods 10, 1213–1218 (2013).
Smith, J. P. et al. PEPATAC: an optimized pipeline for ATAC-seq data analysis with serial alignments. NAR Genom. Bioinform. 3, lqab101 (2021).
Nadeu, F. et al. IgCaller for reconstructing immunoglobulin gene rearrangements and oncogenic translocations from whole-genome sequencing in lymphoid neoplasms. Nat. Commun. 11, 3390 (2020).
Bystry, V. et al. ARResT/AssignSubsets: a novel application for robust subclassification of chronic lymphocytic leukemia based on B cell receptor IG stereotypy. Bioinformatics 31, 3844–3846 (2015).
McLaren, W. et al. The Ensembl variant effect predictor. Genome Biol. 17, 122 (2016).
Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27, 573–580 (1999).
Rausch, T. et al. DELLY: Structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics 28, 333–339 (2012).
Layer, R. M., Chiang, C., Quinlan, A. R. & Hall, I. M. LUMPY: a probabilistic framework for structural variant discovery. Genome Biol. 15, R84 (2014).
ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium Pan-cancer analysis of whole genomes. Nature 578, 82–93 (2020).
Klintman, J. et al. Clinical-grade validation of whole genome sequencing reveals robust detection of low-frequency variants and copy number alterations in CLL. Br. J. Haematol. 182, 412–417 (2018).
Roller, E., Ivakhno, S., Lee, S., Royce, T. & Tanner, S. Canvas: versatile and scalable detection of copy number variants. Bioinformatics 32, 2375–2377 (2016).
Kanehisa, M., Furumichi, M., Tanabe, M., Sato, Y. & Morishima, K. KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res. 45, 1–15 (2016).
Feuerbach, L. et al. TelomereHunter - in silico estimation of telomere content and composition from cancer genomes. BMC Bioinform. 20, 272 (2019).
Farmery, J. H. R., Smith, M. L. & Lynch, A. G. Telomerecat: a ploidy-agnostic method for estimating telomere length from whole genome sequencing data. Sci. Rep. 8, 1300 (2018).
Barthel, F. P. et al. Systematic analysis of telomere length and somatic alterations in 31 cancer types. Nat. Genet. 49, 349–357 (2017).
Demanelis, K. et al. Determinants of telomere length across human tissues. Science. 369, eaaz6876 (2020).
Cortés-Ciriano, I. et al. Comprehensive analysis of chromothripsis in 2,658 human cancers using whole-genome sequencing. Nat. Genet. 52, 331–341 (2020).
Bergstrom, E. N. et al. SigProfilerMatrixGenerator: a tool for visualizing and exploring patterns of small mutational events. BMC Genomics 20, 685 (2019).
Lê, S., Josse, J. & Husson, F. FactoMineR: an R package for multivariate analysis. J. Stat. Softw. 25, 1–18 (2008).
Simon, N., Friedman, J., Hastie, T. & Tibshirani, R. Regularization paths for Cox’s proportional hazards model via coordinate descent. J. Stat. Softw. 39, 1–13 (2011).
Paatero, P. & Tapper, U. Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5, 111–126 (1994).
Lee, D. D. & Seung, H. S. Learning the parts of objects by non-negative matrix factorization. Nature 401, 788–791 (1999).
Gaujoux, R. & Seoighe, C. A flexible R package for nonnegative matrix factorization. BMC Bioinf. 11, 367 (2010).
Badea, L. Extracting gene expression profiles common to colon and pancreatic adenocarcinoma using simultaneous nonnegative matrix factorization. Pac. Symp. Biocomput. 2008, 267–278 (2008).
Lee, D. D. & Seung, H. S. Algorithms for non-negative matrix factorization. Presented at: 14th Annual Neural Information Processing Systems Conference (NIPS 2000); November 27–30, 2000; Denver, CO.
Brunet, J. P., Tamayo, P., Golub, T. R. & Mesirov, J. P. Metagenes and molecular pattern discovery using matrix factorization. Proc. Natl Acad. Sci. USA 101, 4164–4169 (2004).
Hutchins, L. N., Murphy, S. M., Singh, P. & Graber, J. H. Position-dependent motif characterization using non-negative matrix factorization. Bioinformatics 24, 2684–2690 (2008).
Frigyesi, A. & Höglund, M. Non-negative matrix factorization for the analysis of complex gene expression data: identification of clinically relevant tumor subtypes. Cancer Inform. 6, CIN.S606 (2008).
Rosenthal, R., McGranahan, N., Herrero, J., Taylor, B. S. & Swanton, C. deconstructSigs: delineating mutational processes in single tumors distinguishes DNA repair deficiencies and patterns of carcinoma evolution. Genome Biol. 17, 31 (2016).
Scrucca, L., Fop, M., Murphy, T. B. & Raftery, A. E. mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. R. J. 8, 289–317 (2016).
Wickham, H. et al. Welcome to the Tidyverse. J. Open Source Softw. 4, 1686 (2019).
R Core Team. R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, 2019).
RStudio Team. RStudio: Integrated Development for R (RStudio, 2020).
Skidmore, Z. L. et al. GenVisR: genomic visualizations in R. Bioinformatics 32, 3012–3014 (2016).
Kent, W. J. et al. The Human Genome Browser at UCSC. Genome Res. 12, 996–1006 (2002).
Patient material was obtained from the UK CLL Biobank, University of Liverpool, which is funded by Blood Cancer UK. This work was supported by the Genomics England Research Consortium and the CLL pilot consortium (full list of Individual consortia authors are listed in the Supplementary Material). This research was made possible through access to the data and findings generated by the 100,000 Genomes Project. The 100,000 Genomes Project is managed by Genomics England Limited (a wholly owned company of the Department of Health and Social Care). The 100,000 Genomes Project is funded by the National Institute for Health Research and NHS England. The Wellcome Trust, Cancer Research UK and the Medical Research Council also funded research infrastructure. The 100,000 Genomes Project uses data provided by patients and collected by the National Health Service as part of their care and support. This work was supported by the National Institute for Health Research Oxford Biomedical Research Centre (A.S., D.V. and K.R.). The views expressed in this publication are those of the authors and not necessarily those of the Department of Health. The work of P.R. was supported by the Japan Society for the Promotion of Science Postdoctoral standard program. The work of R.S.H. is supported Wellcome Trust (214388) and Cancer Research UK (C124388) grants. The work of A.R.P. and S.D. was supported by Blood Cancer UK. A.B. received D.Phil. funding from Health Education England and Genomics England. J.I.M.-S. is funded by the European Research Council under the European Union’s Horizon 2020 research and innovation program (Project BCLLATLAS, grant agreement 810287). J.C.S. is funded by Cancer Research UK (ECRIN-M3 accelerator award C42023/A29370, Southampton Experimental Cancer Medicine Centre grant C24563/A15581, Cancer Research UK Southampton Centre grant C34999/A18087 and program C2750/A23669). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.
In the past five years, A.S. has received in-kind contributions from Illumina and Oxford Nanopore Technology and is a shareholder of Illumina. She is a company director and shareholder of SERENOx Ltd. A.S. has received honoraria from Exact Sciences, Janssen, Astra Zeneca, Abbvie and Beigene, non-restricted research grants from Janssen and Astra Zeneca and an educational grant from Abbvie. A.R.P. receives research funding from Celgene/BMS, Gilead, Napp and Roche. N.A. received speaker fees from Gilead. P.A., T.J., U.M., M.R. and D.B. are employees of Illumina, a public company that develops and markets systems for genetic analysis. The remaining authors declare no competing interests.
Peer review information
Nature Genetics thanks Anton Langerak and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer review reports are available. Peer reviewer reports are available.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
a, Number of samples with copy number gains (upper track, red) and losses (lower track, blue) (y axis) according to the genomic coordinates of the full chromosome from 5’ to 3’ (x axis), for each chromosome (panels). b-c, MutComFocal scores for genes affected by CN losses and mutations (b) and gene significantly affected by CN gains and mutations (c) Genes classified as tier 1 and 2 were selected for further investigations.
a, Number of samples with inversions (y axis) according to the genomic coordinates of the full chromosome from 5’ to 3’ on each chromosome (x axis), for each chromosome (panels). b-c, Distance between all inversion (b) and translocation (c) (breakpoints across all 485 samples on each chromosome highlighting hotspot breakpoints (named kataegis, in red).
a, Number of variants previously reported in the COSMIC database. b, Number of variants for each consequence. c, Distribution of cancer cell fractions binned in four groups [1-0.75],]0.75-0.5],]0.5-0.25],]0.25-]. The number of variants represented in each boxplot is detailed in Supplementary Table 12. d, Number of variants occurring in protein domains including two types of protein domains: sites and regions, as defined in by Prot2HG39.e, Distribution of cancer cell fractions of recurrent CN gains (red) and CN losses (blue). All boxplots show the minimum and maximum values and interquartile range. The number of CNAs represented in each boxplot is detailed in Supplementary Table 6.
a, Distribution of variant allele fraction calculated from the RNA-seq data of variants detected by WGS. The number of variants represented in each boxplot is detailed in Supplementary Table 13. b, RNA-seq defined transcripts per million (TPM) of selected genes TP53 (n = 7 mutated vs. 66 WT), ATM (n = 6 mutated vs. 67 WT), and SETD2 (n = 7 mutated vs. 68 WT), according to the presence of a genomic alterations (ALT) or an absence of alteration (WT). p-values were calculated using a two-sided Welch’s t-test. c, Number of driver genes per sample for CLL samples and paired Richter’s’ syndrome (RS) samples (n = 16 vs. 16) for all 58 drivers (left panel) p = 2.4 ×10-3, the 36 known genes (central panel) p = 5.8 ×10-3 and 22 candidate genes (right panel) p = 2.1 ×10-3 (two-sided Welch’s t-test). d, Difference of number of mutated samples in Richter Syndrome (RS) samples vs paired CLL samples. Each bar indicates the absolute number of mutated samples per group. e-f, Distribution of cancer cell fractions (CCFs) in frontline samples vs. relapsed/refractory (R/R) samples (cohort 1, unpaired samples) (e), and CLL vs RS as well as frontline vs. relapsed (cohort 2 and 3, paired samples) (f). Corresponding variants are connected by a dotted line. Panels are ordered based on the direction of evolution: high stable CCFs and increasing CCFs, mixed increasing/decreasing CCFs, low/decreasing CCFs. Figures are not shown if no R/R / T2 sample carried a mutation. * indicates candidate drivers. g, Distribution of cancer cell fractions in frontline vs. relapsed (paired samples). Corresponding variants are connected by a dotted line. Panels are ordered based on the direction of evolution: high stable CCFs and increasing CCFs, mixed increasing/decreasing CCFs, low/decreasing CCFs. All boxplots show the minimum and maximum values and interquartile range and all datapoints are represented.
a, Genomic map of non-coding drivers per chromosome (panels) according the genomic coordinates of the full chromosome from 5’ to 3’ (x axis). Methods of detection were discovery algorithms (disc. algorithms), mutational hotspot analysis (mut. hotspots). b, Number of mutations in non-coding genomic elements (top panel) and proportion of variants with signature attributed to AID, APOBEC or other processes (bottom panel) for regulatory elements exclusively active in samples with u-IGHV. c, Distribution (showing the minimum and maximum values and interquartile range) of CCFs in significantly mutated UTRs. The number of variants represented in each boxplot is detailed in Supplementary Table 17.
a, Prioritization of significantly mutated promoters using DeepHaem prediction results. BACH2 showed the highest number of loss-of-function variant compared to the total number of variants in its promoter. b, allelic skew of mutations in the promoter BACH2 (n = 3). c, Distribution of allelic skew in ATAC-seq and RNA-seq data compared to WGS data in all variants located in significantly mutated promoters. The RNA enrichment was calculated by the difference of expression in the mutated sample with the mean expression across the cohort. Variants in promoter of interest were located in the top right corner (increase of allelic fraction of mutant in ATAC-seq and RNA-seq) and bottom left corner (decrease of allelic fraction of mutant in ATAC-seq and RNA-seq). d, Fraction of mutant and WT reads in one ATAD1 promoter variants showing allelic skew in ATAC-seq and RNA-seq compared to WGS. e, Allelic skew of mutations in the promoter ATAD1 (n = 2). All boxplots show the minimum and maximum values and interquartile range. f, Gene expression of ATAD1 in transcript per million (TPM) determined by RNA-seq in samples with BCL2 5’UTR mutations vs. WT. Black dots are marks as outliers. P-value was derived from a two-sided t-test.
a-b, Kaplan-Meier curves on number of drivers defined as (a) number of mutated coding drivers (SNVs/indels), and (b) number of mutated coding drivers (SNVs/indels + CNAs). High denotes > median and low denotes < = median. P-values were derived from a log-rank test. Shading denotes confidence intervals and dotted lines median survival for each group.
a, Pearson correlation between telomere content (assessed by Telomere Hunter v1.1.041) and telomere length (assessed by Telomerecat v1.042). b, Comparison of telomere content distribution (showing the minimum and maximum values and interquartile range) (Telomere Hunter) between normal samples and CLL samples. Lines link match tumor-normal datapoints. Significance was showing two-sided paired Wilcoxon test of p-value <0.001 (n = 485 tumor samples vs. 485 germline samples). c-g, Kaplan-Meier curves of genomic features with lowest false discovery rate (FDR) when tested against progression-free survival (PFS) using a Cox proportional-hazards model (univariate analysis). P-values were derived from a log-rank test. Shading denotes 95% confidence intervals (additional data in Supplementary Table 15).
a, Multiple correspondence analysis (MCA) plot showing the top two dimensions: dimension 1 (44% of variance) and dimension 2 (21% of variance). Each datapoint represents one sample. b, MCA variable representation according to the top two dimensions. Variables are represented as binary: “no” is “absence” and “yes” is “presence”. c, Number of samples and description of the 8 groups defining genomic complexity (GC): each group is presented with a different color (matching other figures). Dark grey squares denote the presence of the alteration and white squares denote the absence of alteration. d, MCA plot showing the 8 GC groups. Colours keys are provided in c. e, Enrichment of genomic features for each group according to the Fisher’s exact test for count data (FDR < 0.05 and estimate > 1), CNA: recurrent copy number alterations, coding: coding drivers, Global: genome-wide global lesions. f, Proportion of samples reported with GC (the conventional definition of > = 4 CNAs) in each GC group. g, Enrichment of GC groups for stereotyped subsets. No significant results were found using a two-sided Fisher’s Exact test. Group #5 was not tested as sample size was lower than threshold.
a-b, Kaplan-Meier curve on (a) progression-free survival (PFS) and (b) overall survival (OS) of all 8 genomic complexity (GC) groups. c-d, Kaplan-Meier curve on (c) PFS and (d) OS of all 8 GC groups independently from TP53 alterations. Group 5 is not presented due to small sample size. Groups were further combined to increase power, by focusing on presence / absence of CN gains and losses and omitting trisomy data. P-values were derived from a log-rank test. Shading denotes confidence intervals. e, Penalised Cox Regression performed on 186 genomic features identified independent predictors for progression-free survival. Each panel shows the hazard ratios for those genomic features that jointly minimise the out-of-sample prediction error (estimated through leave-one-out cross-validation). The full list and details of each genomic feature is presented in Supplementary Table 21.
About this article
Cite this article
Robbe, P., Ridout, K.E., Vavoulis, D.V. et al. Whole-genome sequencing of chronic lymphocytic leukemia identifies subgroups with distinct biological and clinical features. Nat Genet 54, 1675–1689 (2022). https://doi.org/10.1038/s41588-022-01211-y