Whole-genome sequencing of chronic lymphocytic leukemia identifies subgroups with distinct biological and clinical features

The value of genome-wide over targeted driver analyses for predicting clinical outcomes of cancer patients is debated. Here, we report the whole-genome sequencing of 485 chronic lymphocytic leukemia patients enrolled in clinical trials as part of the United Kingdom’s 100,000 Genomes Project. We identify an extended catalog of recurrent coding and noncoding genetic mutations that represents a source for future studies and provide the most complete high-resolution map of structural variants, copy number changes and global genome features including telomere length, mutational signatures and genomic complexity. We demonstrate the relationship of these features with clinical outcome and show that integration of 186 distinct recurrent genomic alterations defines five genomic subgroups that associate with response to therapy, refining conventional outcome prediction. While requiring independent validation, our findings highlight the potential of whole-genome sequencing to inform future risk stratification in chronic lymphocytic leukemia.


Robbe et al. Whole-genome sequencing of chronic lymphocytic leukemia identifies subgroups with distinct biological and clinical features Supplementary Note
Cancer cell fraction calculation For each high-confidence CNA, the appropriately filtered, debiased and normalised median coverage of the genomic region of the CNA (as reported by Canvas) is modelled using the following expression: where CCF is the cancer cell fraction for the CNA, is the estimated copy number, is the copy number of the same genomic region in the normal tissue (here taken equal to 2 in all cases), and is the tumour fraction in the sample. From this, it follows that an estimate ̃ of the CCF for each CNA can be calculated as the ratio ̃= ( − ) ( • ( − )) ⁄ . For samples with more than 10 copy number aberrations, we further applied a clustering step, which is typically conducted in clonal deconvolution analyses. This is because CCF values tend to aggregate in groups, which reflects the fact that tumours typically consist of homogeneous cell populations, known as clones. Specifically, we used mclust 1 v5.4.9 (Gaussian Mixture Modelling for Model-Based Clustering, Classification, and Density Estimation) to group the logit-transformed CCF values calculated in the previous step in an optimal number of clusters in each sample. For each CNA in a particular sample, the final estimate of the cancer cell fraction was calculated using the expression = (1 + − ) −1 , where = ∑ =1 ̅ . In the previous expression, is the optimal number of estimated clusters, is the probability that the CNA comes from cluster , and ̅ is the centre of cluster , as reported by mclust.
For each mutation, the following relation holds between the variant allele fraction (VAF) and the cancer cell fraction (CCF) values (see Vavoulis et al. 2021 2 for a detailed derivation from first principles and references therein): where is the tumour fraction in the sample, is the total copy number at the locus of the SNV, is the minor allele count at the same locus, and is the copy number at the same locus in the normal tissue (here taken equal to 2 in all cases). Based on this expression, rough CCF estimates for each SNV were calculated in each sample as ̃= 2 • /( • − • ( − 2) • ). In a second stage, the logit-transformed values of ̃ were clustered in each sample using mclust and final estimates were calculated using the same approach as for the copy number aberrations (see previous paragraph).

Coding variant annotations
Variants were first annotated against protein-coding sequences, non-coding gene sequences as well as splice sites and UTRs defined as per Ensembl VEP GRCh38 release 89.4. 5bp padding was added to account for splice variants at intro-exon boundaries. UTRs overlapping with protein-coding sequences were subtracted. Variants in exons of coding-genes (except Ig loci, and those mapping to chrY and chrM) were considered functional if they were predicted by Ensembl VEP to lead to aberrant splicing, protein truncation or deleterious amino acid substitution.

Non-coding variant annotations
Variants outside coding regions were annotated against non-coding regulatory elements (RE) datasets (Fig. 4a). All promoter regions were defined as +/-250bp of GENCODE v29 transcription start sites. In addition, these REs were further complemented using previous available ChIP-seq data and chromatin state segmentation of primary CLL samples 3 . More specifically, these REs were defined by first intersecting previously identified H3K27ac peaks and open chromatin regions defined by ATAC-seq (derived from 104 and 106 primary CLL cases, respectively) 3 . To consider potential differences related to IGHV status, we performed a differential analysis between m-IGHV and u-IGHV CLL cases using H3K27ac and ATAC-seq peaks in 39 u-IGHV and 63 m-IGHV samples for H3K27ac ChIP-seq data, and 38 u-IGHV and 66 m-IGHV cases for ATAC-seq data. These differential analyses allowed us to precisely classify active/open chromatin regions in (1) significantly higher in u-IGHV CLL, (2) significantly higher in m-IGHV CLL and (3) no significantly different between u-and m-IGHV CLL, and therefore named "CLL".
Mutations were then annotated using this classification allowing us to identify potential enrichments in particular IGHV subgroups.
All non-coding variants were annotated with the chmm of the region they occurred in according to five grouped functional states: Variants overlapping with the chromatin states annotated as promoters and enhancers without significant genetic imprints of AID-mediated somatic hypermutation were considered for all downstream analyses. Previous work has established a cut-off of 35% of variants caused by off-target AID activity 5 in REs as highly enriched in AID signature, and thus was used in our analyses (Supplementary table 17). C>T/G mutations in WRCY motifs and their reverse complement (canonical AID) were considered to be AID-linked. The remaining variants not overlapping with protein-coding sequences, non-coding gene sequences, splice sites, UTRs, promoters or enhancers were annotated as non-regulatory regions.
Target genes of regulatory elements To link REs to their potential target genes, we first selected the regions previously classified as promoters (including WkProm, ActProm, StrEnh1 chmm) and enhancers (including WkEnh, StrEnh2 chmm) in at least 2/7 of the previous CLL samples. Next, these regions were intersected with the previous ChIP-seq H3K27ac data showing at least 1 peak in all the CLL samples analysed (n=104). This intersection ensures to obtain H3K27ac peaks associated with REs. Afterwards, we calculated correlation coefficients between H3K27ac levels of all the previously identified peaks related to RE with the gene expression levels of all the surrounding genes. The surrounding genes were defined according to the gene coordinates with a 2kb upstream extension with respect their start and being in the same TAD as the interrogated RE (TADs were defined by Hi-C in GM12878 6 ). This strategy allowed us to identify both proximal and distal associations between regulatory elements and their target genes. To perform the correlations, we used 74 previously available CLL samples with concomitant H3K27ac and RNA-seq 3 . For each defined RE, we calculated the Pearson correlation coefficients between H3K27ac levels and gene expression levels for all possible genes with the aforementioned restrictions. Potential target genes were considered to be those showing correlations of r≥0.3, FDR≤0.05 (FDR correction per each RE) with the RE. Correlations were only performed with protein coding and lncRNA genes defined by Ensembl Genes version 105. Annotation was retrieved using the biomaRt R package version 2.50. Preferentially, we have chosen the annotations from this strategy (named CLL dataset 1).
In the case where no target gene was identified using the previous strategy, we assigned the promoter class to a target gene based on the nearest TSS using the GENCODE database (for 7/72 promoters). For the enhancer class, we used additional publicly available datasets derived from CLL primary samples. This was the case of 16 out of 126 enhancers. Firstly, three-dimensional chromatin interactions publicly available for one CLL sample was downloaded (EGA, under accession number EGAD00001004046). We used this data to annotate genes found in the same TAD as the RE (CLL dataset 2, used for 2 enhancers). Secondly, we used a dataset containing target genes of enhancers predicted using correlation of RNAseq expression data and H3K27ac data and ATAC-seq from 12 matched CLL samples previously defined 7 (CLL dataset 3, used for 6 enhancers). Briefly, we required a minimum of 10 reads in at least 1 sample for transcripts detected by RNA-seq to be considered. Within each TAD defined using HOMER 8 from Hi-C data of the GM12878 cell line [GEO accession: GSM1181867 6 ], we performed a peak-transcript correlation (Spearman correlation). All correlations > 0.3 were selected. If the distance between the H3K27ac peak centre and the nearest TSS was <2kb the peak was classified as a promoter, if not, it was classified as an enhancer. Annotations for each enhancer-target are specified in Supplementary table 17. After using these three strategies to annotate enhancers to target genes, eight enhancers did not have any predicted target genes. These were left without CLL specific annotations and were intersected with the Double Elite enhancers reported in the public database GeneHancer 9 .
In addition, all enhancers reported as significantly mutated were further annotated as super enhancers using previously published CLL-specific super enhancers datasets obtained from chromatin accessibility and H3K27ac ChIP-seq data using 18 CLL samples 10 , available in Supplementary table 17. Identification of coding driver genes Two methodologies were used to identify coding driver genes. The first methodology included only SNVs and indels and combined results of four discovery algorithms, according to a "genes detected in at least two out of four" approach (Discovery method 1A) and a statistical approach deriving one pvalue from the four p-values (Discovery method 1B). The second methodology included CNAs in addition to SNVs/indels (Discovery method 2).
For discovery method 1A, the four discovery algorithms were MutSigCV v1.41 11 , OncodriveFML v2.2.0 12 , OncodriveCLUSTL v1.1.1 13 and dNdScv 14 . MutSigCV was used to detect drivers on the basis of recurrence, after adjusting for factors such as replication timing, gene expression and chromatin accessibility. A coverage file across protein-coding regions for the genome assembly GRCh38 was constructed using CovGen v1.0.2 (https://github.com/tgen/CovGen) and SNPeff v4.3 15 assuming full coverage across all protein-coding regions. Gene symbols used in the input MAF, coverage file and default gene covariates file were queried against gene name synonyms in HGNC to maximise unambiguous matching between input files. OncodriveFML was used to detect drivers enriched for mutations with a high functional impact. Default parameter settings were used. Genome-wide mutation frequencies were used to model the background mutation rate. Functional impact predictions were obtained from CADD v1.5 16 . OncodriveCLUSTL was used to detect drivers containing clusters of mutations. Default parameter settings were used. The same normalised mutation frequencies were used as for OncodriveFML. dNdSCV was used to detect drivers with significant evidence of positive selection. Default parameter settings were used. Each method derived a q-value per gene, and genes were selected as drivers when q < 0.001 for at least two out of four algorithms.
For discovery method 1B, p-values calculated for each gene from these four methods were combined as a single p-value using two methods. First, a Z-transformed ("Stouffer") weighted test of p-values using the R package "survcomp". For each gene test per-study weights were applied according to the median ranking of driver test P-values in a set of known CLL driver genes (e.g. a driver method where the median ranking is much higher for CLL genes compared to all genes tested would be weighted proportionally greater than a method where there is little difference in the median rankings of CLL genes and all genes). Second, a weighted harmonic mean p-value was used. FDR q-values of combined results were estimated from the R package "qvalue" at a significance level of q < 0.01. Any genes found significant by at least one of these two methods (1A and 1B) were selected as putative CLL driver genes.
For discovery method 2, first we identified minimally overlapping regions altered by CNAs, defined as the smallest genomic regions where CNAs of four samples overlap. 74 regions were defined by this method. Second, all genes falling within these loci were analysed further to select the most probable drivers of each region. MutComFocal v1.0 17 was used to rank the genes in each region, according to the following approach: 1. We ranked the genes in each region, calculating focality and recurrence scores based on SNV, indel and CNA data. The ranked genes were categorised into tiers based on the entropy H of the posterior distribution of the scores. Genes ranked in the top 2 tiers were selected (in our dataset, this was defined as H (P) > 2.48 x 10 -4 for "CN gain + mutation", and > 1.01 x 10 -4 for "CN loss + mutation).

2.
We refined the list of selected genes to keep only genes (a) affected by CN gains if they are suspected or proved oncogene from the literature, or (b) affected by CN loss if they are suspected or proved TSG from the literature. 3. Among these, we finally chose only genes with acquired SNVs/indels in at least 1% of the cohorts (at least 5 samples). The genes that did not satisfy this criterion were not classified as "candidate drivers'', but instead are listed in Supplementary table 9 in the "Permissive list of genes". We provide this gene list with combined recurrence and focality scores as calculated by MutComFocal as a resource for the community for future research on CLL.
Finally, coding drivers and other findings were annotated as "candidate or previously unrecognised" if they were reported in fewer than 1% of samples in previous large-scale CLL studies 18,19 , and not found in literature searches to have been associated with CLL. In addition to annotations from VEP, we also annotated the variant with the Prot2HG database to determine if a variant was within a protein domain, annotated as in a "site" (short sequence with known function) and as in a "region" (larger regions of about 100 amino acids associated with a function of the protein; more details in the original publication 20 ).

Identification of non-coding driver mutations
We considered non-coding SNVs and indels falling in REs as described above and in Fig. 4a, excluding all regions in immunoglobulin genes and other regions previously described as false-positives due to sequencing or mapping artefacts 5 . Two independent approaches were used to identify non-coding hits in the CLL dataset considering both mutational burden and functional impact (Fig. 4b).
Firstly, we used OncodriveFML v2.2.0 and OncodriveCLUSTL v1.1.1 with default parameters to identify REs enriched for functional mutations or linear mutation hotspots respectively. P values were corrected for multiple testing using the Benjamini-Hochberg method. Genomic elements with q-values < 0.1 in at least one out of two methods and with at least two samples mutated were retained as non-coding hits. All data are reported in Supplementary table 17. Secondly, we screened for the presence of single-site hotspots (i.e. same nucleotide mutated), and regions with high mutational density/kataegis. Regions of kataegis were identified as previously described 11 . Briefly, mutations from all tumour samples were combined and scanned for six mutations with less than two standard deviations below the median of intermutational distance of SNVs/indels, or at least 6 variants with <1 kb intermutational distance. Regions were tested for an excess of mutations using a methodology adapted from previous work 21 . P values were calculated per hotspot using the binomial distribution, where the raw data (q) were the number of mutations in a region, the sample size (n) was the length of the region and the probability (p) was calculated separately to produce a local and global estimation of the expected mutation rate per bp. Local mutation rate was calculated as the total number of mutations within 10kb either side of the hotspot region divided by the length of that region. Any bases overlapping with exonic regions were excluded from the counts. Global values were calculated as the total number of mutations per feature, i.e., UTRs, promoters (i.e., ActProm, WkProm, StrongEnhancer2) or enhancers (i.e., StrngEnhancer1, WkEhnacner chmm) divided by the sum total length of that feature. The observed number of mutations (q) was assumed to follow a binomial distribution, binomial (n, pi) for region i, with null hypothesis that the region was not excessively mutated. P values were corrected for multiple testing using the Benjamini-Hochberg method. Significance was considered where both the global and the local test were below a significance threshold FDR < 0.1. Single-site hotspots were defined as kataegis regions of length 1 bp, and were excluded from statistical calculations.
The 126 REs were annotated with the proportion of mutations attributed to off-target AID and APOBEC activities, which are known process described in CLL 5,22 . To do so, the proportion of mutations attributed to AID (C>T/G at WRCY, and the reverse, G>A/C at RGYW) and APOBEC (C>T/G at TCW, G>A/C WGA) activities were annotated for each genomic element reported as a non-coding hit. Previous work has established a cut-off of 35% of variants caused by off-target AID activity 5 in REs as highly enriched in AID signature, and thus was used in our analyses (Supplementary table 17). In addition, predicted target genes of these 126 REs (see above) were included in a non-ordered gene set enrichment analysis using the R package gprofiler2 v0.2.0 23 using the Gene Ontology (BP: biological process) and Human phenotype ontology (HPO) databases available within the package.
We used DeepHaem 24 to predict the effects of the non-coding mutations in the 72 promoters with open chromatin using 73 immune cell types with open chromatin assay data. Briefly, the model looks at 1 kb DNA sequence and calculates a probability of a DNA sequence belonging to an open chromatin site in each cell type. We use an empirical threshold to classify anything above a 20 % probability of being at least a weak open chromatin site. To assess the effect of each mutation, we predicted the probability score with and without the variant in the DNA sequence and calculated the damage by subtracting the variant probability score to the reference probability score. A positive score meant the variant reduced the chromatin accessibility, such as in the case of a TF binding site loss (loss of function). A negative damage score meant the variant increased chromatin accessibility, for example by creating a previously unrecognised transcription factor binding site. Empirical thresholds were set as an absolute damage above 0.1 and 0.05 to select candidate variants with varying levels of stringency (changes above 10 and 5%, respectively).
Patients' stratification using non-negative matrix factorisation Data were converted to a 0/1 matrix using either presence or absence of a feature, or above or below the mean, in the case of continuous variables, such as the presence of a mutation in a gene or pathway, and other not, such as telomere lengths. Although Binomial Matrix Factorization (BMF) methods exist, the NMF has been found to have a similar feature outcome capability to the BMF 25 . Further, the offset method 26,27 of the NMF takes the additional step of removing common signatures between groups, making it particularity useful for identification of uniquely group defining features. The number of clusters in each group to be used by the NMF was determined using a combination of rank estimation methods implemented by the NMF software [28][29][30] including the cophenetic correlation coefficient (See figures below).  NFKBIE  PTPN11  NFKBIA  ACSL6  MTOR  PRKAG3  PRKCQ  STAT3  TRAF2   ADIPOCYTOKINE  SIGNALING PATHWAY   NFKBIE  KRAS  NFKBIA  NFATC1  PRKCB  CD22  JUN  SOS1   B CELL RECEPT OR  SIGNALING PATHWAY   PRKCB  PLCB1  RYR2  ADRA1A  ADRB2  CACNA1D CACNA1E GRIN2C  GRIN2D  ITPKB  P2RX6  PDE1A  PDE1C  PHKA1  PLCZ1  PRKCA   CALCIUM SIGNALING  PATHWAY   BIRC3  ATM  TP53  MYD88  CREBBP  CCND2  BCL2  CCND1  NFKBIA  CDKN1B  ATR  CDC25C  CCND3  CDK6  CHEK2  CDKN1C  PCNA  PRKDC  RB1  SKP2  WNT the type of genes affected: coding mutation in coding driver gene (red), coding mutation with high functional impact in other genes not reported as CLL driver (green), non-coding mutation in regulatory element of a gene (blue).