Novel Mutation Hotspots within Non-Coding Regulatory Regions of the Chronic Lymphocytic Leukemia Genome

Mutations in non-coding DNA regions are increasingly recognized as cancer drivers. These mutations can modify gene expression in cis or by inducing high-order chormatin structure modifications with long-range effects. Previous analysis reported the detection of recurrent and functional non-coding DNA mutations in the chronic lymphocytic leukemia (CLL) genome, such as those in the 3′ untranslated region of NOTCH1 and in the PAX5 super-enhancer. In this report, we used whole genome sequencing data produced by the International Cancer Genome Consortium in order to analyze regions with previously reported regulatory activity. This approach enabled the identification of numerous recurrently mutated regions that were frequently positioned in the proximity of genes involved in immune and oncogenic pathways. By correlating these mutations with expression of their nearest genes, we detected significant transcriptional changes in genes such as PHF2 and S1PR2. More research is needed to clarify the function of these mutations in CLL, particularly those found in intergenic regions.

A major part of mutations in the cancer genome occur in non-coding DNA regions, and their function is still beginning to be understood 1 . Non-coding DNA comprises approximately 98% of the human genome, but recent research has proven that most of these regions are either part of regulatory motifs or actively transcribed to RNA 2,3 . These mutations can induce functional genomic changes by altering the binding of transcription factors or by inducing high-order chromatin structural modifications 2,4 . For example, mutations in 5′ and 3′ untranslated regions (UTRs) may disturb RNA structural conformation, modify microRNA binding sites or disrupt polyadenylation signals 2 . In a similar fashion, mutations affecting non-protein coding genes such as microRNA and long intergenic RNA genes (lincRNAs) are known cancer driver events 2,5 . Different studies have evidenced that the expression of genes such as BRCA1, CDH10, CCND1, MALAT1, PAX5, RB1, SDHD, TERT, TOX3, and TAL1 is influenced by non-coding DNA mutations in regulatory regions of the cancer genome 1,6,7 . The Pancancer Analysis of Whole Genomes (PCAWG) project has revealed the existence of common and tumor-specific recurrently mutated functional elements near known cancer drivers 7 . Some of these driver mutations can induce long-range changes in genome organization and trigger abnormal expression of distant oncogenes and tumor suppressors 8 . Furthermore, the sequence distribution of these driver mutations is not random. Hornshøj et al. (2018) identified a significant enrichment in conserved CCCT-binding factor (CTCF) binding sites among recurrently mutated non-coding DNA regions with cancer specificity 6 . Similarly, Line et al. (2019) identified 21 recurrently altered CTCF-rich insulator regions in the cancer genome, and elegantly demonstrated that some of these mutations drive tumor proliferation 9 .
Chronic Lymphocytic Leukemia (CLL) is among the most frequent lymphoproliferative disorders, and it is characterized by its remarkable clinical heterogeneity. Recent efforts by Puente et al. 10 enabled the discovery of 24 recurrently mutated non-coding genomic regions in the CLL genome, some of which are associated with functional changes such as mutations in the 3′UTR of NOTCH1 and in the PAX5 super-enhancer. Nevertheless, both the sparsity of annotations in non-coding DNA regions and the difficult functional classification of non-coding DNA mutations hinder a better understanding of the non-coding cancer genome, which probably harbors multiple deregulated elements yet to discover. In this analysis, we analyzed whole genome sequencing (WGS) data using a best-practice mutation detection pipeline. Then, we identified signals of positive selection of mutations in regulatory regions. Finally, our last attempt was to analyze if any of these recurrent mutations in non-coding DNA regions were associated with abnormal expression of the nearest gene. Our results point toward the existence of dozens of mutation-enriched regulatory regions near cancer and immune-related genes, some of which influence local gene expression.

Methods
Data origin. Whole genome sequencing files produced by the International Cancer Genome Consortium 11 were obtained from the European Genome-Phenome Archive under accession code EGAD00001001466. Gene expression from microarray data of the same set of patients was obtained from EGAD00010000875. Data analysis. 130 tumor-normal matched CLL whole genomes were processed using the bcbio-nextgen pipeline, which provides best practices for analyzing high throughput sequencing data 12 . Low complexity regions, areas with abnormally high coverage, sequences with single nucleotide stretches >50 bp and loci with alternative or unplaced contigs in the reference genome were not analyzed. Some polymorphic regions are prone to be classified as highly mutated due to artifacts or biases in the sequencing process, and suspicious elements were manually removed from downstream analysis. Single nucleotide and indel mutation detection was performed with vardict 13 , varscan 14 , mutect2 15 and freebayes 16 using default bcbio-nextgen parameters. Only variants with a minimum sequencing depth (DP) of 10 and a genotype quality (GQ) above 20 Phred in both tumor and normal samples were analyzed. A mutation was reported when detected by at least two different mutation callers. Mutations were annotated to the 1000G 17 , gnomAD 18 and ExAc 19 databases in order to filter likely germline variants. All mutations with a minimum allele frequency >0.001 in any population were discarded from the analysis.

Region annotation.
Annotations corresponding to promoter regions, 5′UTR, 3′UTR and lincRNAs were retrieved from Genecode version 18 20 . DNAse hypersensitivity (DHS) regions and Transcription Factor Binding Sites (TFBS) tracks from the ENCODE 21 project were obtained from Lochovsky et al. 22 . Similarly, we used enhancer regions from the GeneHancer database 23 , and analyzed those that were supported by two or more sources of evidence ("elite" enhancers). Regulatory regions within telomeric and centromeric positions were discarded.
Two different methods were used to identify areas with evidence of positive selection of mutations: LARVA 22 and OncodriveFML 24 . LARVA models the mutation counts of each target region as a β-binomial distribution in order to handle overdispersion. Furthermore, LARVA also includes replication timing information in order to estimate local mutation rate, and provides a β-binomial distribution adjusted for replication timing which is used to compute p-values. On the other hand, OncodriveFML is designed to analyze the pattern of somatic mutations across tumors in both coding and non-coding genomic regions. OncodriveFML uses functional predictions in order to identify signals of positive selection. OncodriveFML was run with CADD v1.3 scores and default parameters. TFBS tracks were not analyzed with OncodriveFML due to high computational demands. Regions were labeled as significantly mutated if the q-value was <0.05 with any of the two methods.

Gene expression analysis and association with recurrent non-coding DNA mutations.
Background correction, normalization and log2-transformation of microarray gene expression data was performed with the RMA algorithm 25 . In the case of genes targeted by multiple probes, the median expression was calculated. The Wilcoxon-Rank sum test was used to detect changes in gene expression between mutated and wild-type cases. Non-coding regulatory genomic regions cannot be directly ascribed to any gene, and they can affect the transcription of virtually any part of the genome. However, this study is underpowered to detect long-range interactions due to small sample size and the need of extreme p-values passing multiple-testing correction. Therefore, we centered our efforts on changes in expression of the nearest gene. We annotated the closest gene to each recurrently mutated non-coding genomic region as the nearest transcription start site to the middle position of the corresponding region. In the case of multiple overlapping regulatory regions, we selected the most significant one for downstream analysis. P-values were adjusted for multiple testing using the FDR method, with a significance threshold of 0.05.

Regions significantly enriched in mutations.
LARVA detected significant mutation enrichments (q-value < 0.05) in 120 TFBS, 16 DHS regions, 10 enhancers, 4 promoters, 2 5′UTRs and 1 lincRNA (Table 1,  Supplementary Tables 1-6). No relevant inflation in p-value distribution was observed. (Supplementary Fig. 1). These regions were located in 44 different genomic loci (Fig. 1). The most recurrently mutated promoters were those of TCL1A (q-value 3. Mutations associated with changes in gene expression. We studied the association of regions enriched in mutations with changes in the expression of their respective nearest genes. Although this type of analysis is limited by low sample size, we detected significant associations in some cases. We tested if patients with at least one mutation in these regulatory regions were accompanied by changes in expression of the nearest gene.  Table 9).

Discussion
Mutations in the non-coding part of the genome constitute the "dark-matter" of cancer genomics 2 . Growing evidence indicates that many of these mutations occur in conserved motifs and loci under epigenetic control, and some of these play fundamental roles in cancer biology and disease prognosis 1-3,6-9 . Using WGS data produced by the ICGC, we identified dozens of recurrently mutated regulatory regions in the CLL genome. Among these, 10 were previously reported by the original analysis performed by Puente et al. 10 , namely those near BACH2, BLC2, BCL6, BCL7A, BIRC3, S1PR2, PCDH15, ZCCHC7/PAX5 and ZFP36L1. Numerous novel regions were also enriched in non-coding DNA mutations, including transcription factor binding sites, DNAse hypersensitivity regions, 5′UTR regions, promoters, enhancers and non-coding RNAs. These events were frequently found in the vicinity of genes previously vinculated with oncogenic pathways. Indeed, the most significantly mutated regions were a SETB1 binding site within the first intron of DAP3, a GTP-binding protein that participates in the apoptosis pathway 26 ; and a DNAse hypersensitivity region downstream to ING2, a well-characterized tumor suppressor 27 . Other highly mutated regulatory regions affected cancer-related genes such as DACT2 28 40 ; and others were detected near genes involved in immunity, such as LTB 41 and MADCAM1 42 . Overall, only three of the novel genes (LTB, MALAT1 and ST6GAL1) were previously defined as targets of somatic hypermutation in B cell lymphomas 43 . Finally, it is worthwhile to mention that recurrent and even highly significant enrichments were detected around barely characterized genes (e.g. C21ORF89/LINC0334) and intergenic regions. www.nature.com/scientificreports www.nature.com/scientificreports/ The reported mutations can either be bystander or have functional implications related to their potential to modify gene expression or to induce high-order chromatin structural changes. Although limited by low sample size, we devised significant changes in the expression of PHF2, S1PR2 and RPL39L. These three genes are involved in the regulation of important oncogenic processes. PHF2 encodes a histone demethylase with tumor suppressor activity 35 . S1PR2 participates in the TGF-β pathway and acts as a tumor suppressor of B cell lymphomas 44 . Finally, RPL39L 45 is involved in cancer stem cell self-renewal and hypoxia response. These results are concordant with other reports of non-coding regulatory mutations driving gene expression changes in B-cell lymphomas [46][47][48] .
The combination of an optimized mutation detection pipeline with statistical tests specifically designed to handle non-coding DNA mutations has enabled the detection of novel putative regulatory driver regions in the CLL genome. These regions were mostly located in the vicinity of genes implicated in oncogenic and immune pathways, although several recurrently mutated intergenic regions were detected too. Furthermore, we could confirm the association of some of these events with altered expression of their respective genes. We expect that our results, along with those published by other groups, will promote an improved characterization of the non-coding mutational drivers of CLL.

Data availability
There is not data to deposit.