Genome-wide discovery of somatic coding and regulatory variants in Diffuse Large B-cell Lymphoma

Diffuse large B-cell lymphoma (DLBCL) is an aggressive cancer originating from mature B-cells. Many known driver mutations are over-represented in one of its two molecular subgroups, knowledge of which has aided in the development of therapeutics that target these features. The heterogeneity of DLBCL determined through prior genomic analysis suggests an incomplete understanding of its molecular aetiology, with a limited diversity of genetic events having thus far been attributed to the activated B-cell (ABC) subgroup. Through an integrative genomic analysis we uncovered genes and non-coding loci that are commonly mutated in DLBCL including putative regulatory sequences. We implicate recurrent mutations in the 3’UTR of NFKBIZ as a novel mechanism of oncogene deregulation and found small amplifications associated with over-expression of FC-γ receptor genes. These results inform on mechanisms of NF-κB pathway activation in ABC DLBCL and may reveal a high-risk population of patients that might not benefit from standard therapeutics.


27
It has been established that DLBCL, although genetically heterogeneous, can be robustly divided at 28 the gene expression level into two subgroups based on markers of B-cell differentiation and NF-κB 29 activity pathways, the latter being particularly active in ABC cases 1 . EZH2 2 , SGK1, GNA13 and 30 MEF2B 2 exemplify genes with mutations restricted to GCB cases, whereas MYD88 3 , CD79B 4 and 31 CARD11 5 have been reported as more commonly mutated in ABC. Some DLBCL cases have few 32 (if any) genetic alterations strongly associated with either subgroup, suggesting the possibility of 33 additional genetic or epigenetic changes that shape the malignancy. Similarly, the over-expression Although there have now been more than 1000 tumours analysed using targeted strategies 42 such as array-based copy number analysis 9 or whole exome sequencing (WES) 10 , a limited num-43 ber of DLBCL genomes have been described to date [11][12][13] , leaving the potential to uncover new 44 somatic structural variations (SVs), copy number alterations (CNAs) and other cis-acting regu-45 latory mutations that may be cryptic to other assays. The search for driver mutations has been 46 further confounded by aberrant somatic hypermutation (aSHM) affecting a substantial number of 47 genes in DLBCL 14 . Specifically, in DLBCL and several other lymphoid cancers the AID enzyme 48 (encoded by AICDA), in cooperation with POLη, induces mutations in actively transcribed genes 49 with a concentration in the first 1.5-2 kb 15 . Though the repertoire of known aSHM targets in 50 lymphoma continues to grow, it has become apparent that this process can also impact non-genic 51 loci associated with super-enhancers and some aSHM-mediated mutations may have regulatory 52 functions 16, 17 . 53 Whole genome sequencing (WGS) offers the possibility of cataloguing the sites affected by 54 3 this process along with concomitant determination of genes with potential cis-regulatory effects. 55 We analysed WGS data from 153 DLBCL tumour/normal pairs alongside existing WES data from 56 191 additional cases to uncover novel driver genes affected by somatic single nucleotide variants 57 or indels, collectively referred to as simple somatic mutations (SSMs). These affected many of the 58 genes that have been ascribed to DLBCL along with 4,386 regions we identified as enriched for 59 somatic mutations, the majority impacting non-coding loci. Analysis of matched RNA-seq data 60 uncovered recurrent structural alterations and mutated loci with potential roles in mediating the 61 transcriptional or post-transcriptional regulation of numerous genes with relevance to DLBCL.

63
The landscape of somatic CNAs in DLBCL has been addressed by multiple groups 9, 18, 19 but owing 64 to the technologies typically used, the breakpoints that underlie these events and putative copy-65 neutral alterations and smaller focal gains and losses can be missed by array-based approaches 11 . 66 The 153 genomes were analysed for SVs, revealing a total of 13,643 breakpoints (range: 0-390; 67 median 66). We determined the SVs likely to affect specific genes based on their proximity to in-68 dividual genes (Table 1). As expected, the genes with proximal SVs in the highest number of cases 69 were oncogenes relevant to DLBCL including BCL2, BCL6, FOXP1 and MYC. Tumour suppressor 70 genes (TSGs) typically exhibited focal deletions and commonly exhibited SV breakpoints within 71 the gene body, including TP53, CDKN2A, and CD58 (Extended Data Figure 1).

72
By intersecting with regions affected by recurrent copy number losses, we searched for pu-73 4 tative TSGs that might be disrupted by either deletion or SV breakpoints (Table 1). Some of these 74 were separately identified through subsequent analyses (below) revealing patterns of non-silent 75 exonic mutations and/or peaks of non-coding mutations whereas others rarely harboured simple 76 somatic mutations (SSMs) such as SNVs and indels. Many genes impacted by aSHM were also 77 enriched for somatic breakpoints. In contrast, TOX and WWOX harboured a substantial number of SVs were predominantly found in the genomes of ABC DLBCLs whereas hot spot mutations are 87 a feature of GCB. 88 We next searched for concomitant signals of recurrent copy number gain and SVs proximal 89 to genes, restricting our analysis to regions identified as peaks for amplification by GISTIC in 90 a separate large DLBCL cohort (Ennishi et al, unpublished). We utilised RNA-seq-derived ex-91 pression values from a subset of the cases to infer cis effects of these events on expression. This 92 uncovered several genes reported to act as oncogenes through focal amplification, with IKBKE, 93 NFKBIZ, FCGR2A/FCGR2B representing the strongest candidates due to significantly elevated 94 5 expression in cases having either a gain or proximal SV (Figure 1). This also revealed additional 95 known targets of aSHM (Extended Data Figure 2). Some breakpoints were within the gene body, 96 an observation seen in some known oncogenes such as FOXP1 22 . Such events can lead to novel 97 isoforms or fusion transcripts, such as those involving TBL1XR1 23 .

98
The most striking collection of focal gains was those affecting the FCγ receptor locus, a com-99 plex region of the genome comprising multiple paralogs that have arisen through a series of seg-100 mental duplications 24 (Figure 2A-B). These focal gains and, less commonly, deletion events were 101 corroborated by read pairing information in many cases. This observation could be confounded by 102 the presence of germline copy number alterations in this region, thus many of the single copy gains 103 could represent germline events. Using a custom multiplex droplet digital PCR (ddPCR) assay, we 104 confirmed the CNAs and identified four additional examples of amplifications and several addi-105 tional gains not detected by SNP arrays. Amplifications, but not gains, were significantly enriched 106 among GCB cases and had a striking correlation with elevated FCGR2B expression ( Figure 2C).

107
Although the prevalence of this genetic alteration is low, we found a compelling trend towards in-108 ferior outcome in GCB cases with FCGR2B amplification. Taking into account the apparent effect 109 of gains on FCGR2B over-expression and deletions on reduced FCGR2C, we hypothesised that 110 the tumours benefited from a relative increase in FCγRIIB protein relative to FCγRIIC and used 111 the log-ratio of the expression of these two genes to stratify GCB patients (Methods). This simple 112 classifier showed a strong separation of patients on disease-specific survival that was significant as We next sought genes with patterns of non-silent mutations, beginning with a meta-analysis of the 117 genomes and all available exome data for recurrently mutated genes. The genes significantly af-118 fected by SSMs had mostly been identified in prior studies and a large exome study published while 119 this manuscript was being completed 10 (Extended Data: Figure 4, Table 1). Among the genes that

124
Within the 153 genomes, we identified between 1689 and 121,694 SSMs (median: 14,026). 125 We searched genome-wide for patterns of mutation that may imply regulatory function without 126 directly impacting protein sequence. To accomplish this, we implemented a new strategy to infer  also revealed examples of non-coding loci with mutation peaks, for example the two adjacent 136 long noncoding RNA (lncRNA) genes NEAT1 and MALAT1 and the miR-142 locus ( Figure 3C).

137
Mutations at each of these loci have been previously noted in other DLBCL and FL and their 138 pattern is consistent with aSHM 27, 28 .

139
To determine the suitability of our approach to identify mutation clusters relevant to DL-140 BCL, we extended our peak analysis to include all mutations including the coding region (CDS). 141 We found a similar number of peaks (4,405), which comprised the bulk of the original non-coding 142 peaks as well as peaks in genes with mutation hot spots such as EZH2, FOXO1 and MYD88 (Fig-143 ure 3B). Aside from intergenic regions (2,244), the top three classes of annotation affected by 144 peaks were 5 flanking regions, 5 UTRs and introns. These are also the regions typically affected 145 by aSHM and, as expected, virtually all of the known targets of this process were represented 146 among these regions 12, 14 . We also noted multiple genes with mutation patterns consistent with 147 aSHM including the AICDA locus itself, PRDM1, DNMT1, and ACTB ( Figure 3E and Extended 148 Data Figure 3). If the mutations in these peaks were largely due to a single mutational process that 149 preferentially acts in certain regions, we expected to find differences in the mutational signatures 150 relative to the full set of SNVs. Interestingly, although the signature attributed to AID and POLν 151 activity was represented genome-wide and within the peaks, another signature that does not clearly 152 correspond to any of the previously described signatures was unique to the peaks (Extended Data 153 Figure 5).

154
To identify peaks with potential relevance in modulating transcription, we assessed the rela-155 8 tionship between gene expression and the presence of mutations in nearby peaks. All genes with 156 one or more proximal peak were tested for significant differences in expression between mutated 157 and un-mutated cases (Extended Data Figure 6). Most of the protein-coding loci identified were 158 known (including SERPINA9, CD44, PIM1) or the novel targets of aSHM we had identified (in-159 cluding DNMT1 and AICDA). The correlation between expression and aSHM is typically attributed 160 to an elevated AID activity at highly expressed genes and thus may act as a permanent marker of 161 sustained expression of these genes rather than representing driver mutations. Regardless, the 162 unprecedented breadth of mutations affecting potential regulatory regions including enhancers 163 proximal to these genes suggests the possibility of a regulatory effect and this warrants further 164 investigation. To enrich for genes with patterns unlikely to result from aSHM, we identified loci 165 for which the most common variant annotation in each peak was not among the classes attributable 166 to aSHM. Multiple genes showed distinct distributions of mutations seemingly inconsistent with 167 aSHM. This could imply a different mutational process or the action of selective pressure to retain 168 or alter function. Some of these genes had short 5 UTRs and thus had mutations within their CDS 169 and even 3 UTRs (e.g. MPEG1, HIST1H1C). In longer genes, such as NFKBIZ, 3 UTR mutations 170 cannot be readily attributed to aSHM and appear to indicate strong selective pressure.

172
Our genome analysis uncovered a striking pattern of mutations in NFKBIZ, a gene that has been 173 reported to act as an oncogene in DLBCL cases with copy number amplification affecting this 174 locus 29 though other somatic mutations affecting this region appear to be lacking. NFKBIZ was 175 9 significantly more commonly mutated in ABC cases when the 3 UTR mutations are considered and 176 even more strongly enriched in ABC when amplifications affecting this region are also considered 177 (P = 2.15 × 10 −5 , Fisher's Exact Test) . Combining the genome data and results from targeted 178 sequencing in a larger "validation" cohort, we confirmed our observation of a novel pattern of were uniformly re-analysed with the same methodology and show a much lower yield of these 183 variants. We also compared the prevalence of NFKBIZ 3 UTR mutations in other lymphoid cancers 184 with available WGS data including CLL, FL and BL. FL had the next highest prevalence with 185 these mutations appearing in less than 3% of cases ( Table 2). The number of cases also provided 186 sufficient power to determine patterns of mutually exclusive genes within ABC and GCB. One of 187 the few pairs of genes showing mutual exclusivity for mutation in ABC was MYD88 and NFKBIZ, 188 indicating a potentially redundant role of these two mutations in lymphomagenesis (Extended Data 189 Figure 8).

190
The 3 UTR of NFKBIZ is highly conserved and the mutated region has been previously 191 identified as a destabilising element that promotes rapid mRNA turnover 30 . We predict these mu-192 tations perturb this process thereby increasing mRNA longevity, which would in turn cause allelic 193 imbalance in mutant cases. We constructed a structural model for the highly conserved region 194 that contained the vast majority of SSMs (Extended Data Figure 9), which consists of one large
For genes with mutation hot spots or affected by CNVs or SVs, we considered these mutations separately from other missense variants. (B) Some genome-wide non-coding mutation peaks also showed cell-of-origin differences. Many of these are within genes that encode the immunoglobulin heavy and light chains. Unsurprisingly, the remaining genes overlap considerably with COOassociated genes that are also affected by coding mutations, mainly those affected by aSHM. The differential presence of aSHM activity, likely owing to expression differences, may explain why some of these genes are uniquely mutated in their respective subgroup. The BCL2 locus had multiple peaks that were commonly mutated among GCB cases including multiple intronic regions that appear, based on H3K27Ac patterns, to coincide with an enhancer. These mutations were not restricted to cases with BCL2 translocations. The AICDA locus, a novel aSHM target, is mutated mainly in ABC cases. The BCL6 and PAX5 super-enhancer was preferentially mutated in GCB cases. A peak in GRHPR near PAX5 was more commonly mutated in ABC cases ( Figure 3D).
The DNMT1 locus is near S1PR2 and both of these peaks were enriched for mutations in GCB, indicating the potential for co-regulation of these genes using a common set of regulatory regions.
(C) Genes with mutations significantly associated with one subgroup are shown above a heat map of the expression of several genes with strong COO-associated expression to highlight the mutual exclusivity between some gene pairs. In ABC, NFKBIZ and MYD88 mutations were mutually exclusive relative to other mutations involved in NF-κB signalling. In GCB, EZH2 and MEF2B hot spot mutations were common in BCL2-translocated cases and in those lacking mutations in  assay was applied to eight DLBCL cell lines to determine NFKBIZ mRNA expression levels.
Dark colours indicate total transcript counts and light colours indicate wild-type 3 UTR transcript counts. Cell line NFKBIZ mutations include amplifications (blue), 3 UTR deletions (green) or none (grey). One line (Pfeiffer) lacking any detectable NFKBIZ mutation had elevated NFKBIZ mRNA levels relative to un-mutated lines. We suspect this is due to a STAT3 mutation, as previous studies suggest that STAT3 plays a role in NFKBIZ activation 40,41