Global long terminal repeat activation participates in establishing the unique gene expression programme of classical Hodgkin lymphoma

Long terminal repeat (LTR) elements are wide-spread in the human genome and have the potential to act as promoters and enhancers. Their expression is therefore under tight epigenetic control. We previously reported in classical Hodgkin Lymphoma (cHL) that a member of the THE1B class of LTR elements acted as a promoter for the proto-oncogene and growth factor receptor gene CSF1R and that expression of this gene is required for cHL tumour survival. However, to which extent and how such elements participate in globally shaping the unique cHL gene expression programme is unknown. To address this question we mapped the genome-wide activation of THE1-LTRs in cHL cells using a targeted next generation sequencing approach (RACE-Seq). Integration of these data with global gene expression data from cHL and control B cell lines showed a unique pattern of LTR activation impacting on gene expression, including genes associated with the cHL phenotype. We also show that global LTR activation is induced by strong inflammatory stimuli. Together these results demonstrate that LTR activation provides an additional layer of gene deregulation in classical Hodgkin lymphoma and highlight the potential impact of genome-wide LTR activation in other inflammatory diseases.


Supplementary Figures
. HL cell lines display a global activation of long terminal repeat elements A) Full THE1B consensus sequence showing predicted transcription start site, splice site and THE1B RACE primer location B) Annotation of RACE-Seq peaks with the Repeat Masker hg19 repeat family annotation. C) Comparison of the THE1B primer sequence to the genomic sequence of active LTRs identified by RACE-Seq. D) Overlap of peaks identified in biological RACE-Seq replicates. Highlighted areas, in blue, represent those peaks which are shared by at least 2 replicates in a cell line (p<0.01, hypergeometric analysis) and are the peaks which were selected for further downstream analysis. Figure S2. LTR activation contributes to global deregulation of gene expression in HL cell lines. A) Comparison of RNA-Seq biological replicates Log2 FPKM values for each gene with an FPKM of at least 1 were plotted and linear regression calculated. B) UCSC genome browser screenshot showing quality of RNA-Seq data. . C) UCSC genome browser screenshot showing a THE1B LTR acting as a promoter for the CSF1R gene in the L428 HL cell line. D) H3K4me3 ChIP-Seq signal in L428 cells centred on active LTRs identified by RACE-Seq and at control random sites E) UCSC genome browser screenshot showing H3K4me3 ChIP-Seq signal at the cHL specific THE1C LTR producing a transcript of NLRP1 and at the endogenous promoter in Reh cells (1) F) Annotation of expressed LTRs identified by RACE-Seq to genomic regions in which they are located. G) Pearson correlation of gene expression patterns determined by RNA-Seq and clustered by hierarchical unsupervised clustering. H) The orientation of active LTRs identified by RACE-Seq was inferred from annotation of LTR orientation by repeat masker. The closest genes downstream of active LTRs were annotated and the expression values obtained from RNA-Seq data. The expression of these genes was correlated by Pearson correlation and the result clustered.  qPCR gene expression analysis using primers designed in exon 1 (p<.05 L1236 and KM-H2 vs Reh and Namalwa) and between the upstream LTR and Exon 2 (p<.05 HL cell lines vs control cell lines, paired Student t test). C) Normalised RNA-Seq FPKM values showing expression of CACN2AD1. D) qPCR gene expression analysis using primers designed for transcripts before and after the intragenic LTR (p<.01 KM-H2 compared to all other cell lines, paired Student t test). E) Normalised RNA-Seq FPKM values showing expression of CHD1L. F) qPCR gene expression analysis using primers designed in the LTR driven anti-sense transcript (p<.01, L428 and KM-H2 vs L1236 and control, paired Student t test) . G) qPCR gene expression analysis using primers designed in an LTR driven lncRNA transcript (p<.01, HL vs control cell lines, paired Students t test). Figure S5. TNFRSF11A-LTR transcript is downregulated by siRNA targeting exon 7 of TNFRSF11A and the WNT5A-LTR has a spliced and lncRNA transcript. A) qPCR gene expression analysis showing expression of a transcript between exon 2 & 3 and also between the upstream LTR and exon 2 following siRNA knockdown compared to non-targeting control. Error bars show standard deviation from 3 biological replicates (p<.05 L1236 vs control cell lines, paired Student t test). B) TNFRSF11A protein measured by Western blot following siRNA knockdown compared to non-targeting control. C) Spliced and un-spliced WNT5A-LTR transcript quantified by qPCR. Error bars show standard deviation of 3 biological replicates.   Figure S8. Constitutive NF-κB activation drives expression of a set of THE1 LTRs. A) THE1B LTR consensus sequence obtained from RepBase and annotated with transcription factor binding motifs, highlighting the presence of an NF-κB motif. B) Doxycycline (Dox) inducible NF-κB activation scheme. C) NF-κB activation in Reh cells confirmed by nuclear localisation of NF-κB as measured by western blotting. D) Relative quantification of inducible nuclear NF-κB localisation in Reh cells as measured by densitometry of western blots. E) Overlap of active LTR peaks identified by THE1B RACE-Seq in 3 biological replicates. F) Overlap of RACE-Seq LTR peaks before and after NF-κB activation. G) Overlap of RACE-Seq LTR peaks after NF-κB activation with merged peaks from the 3 HRS cell lines. Figure S9. Treatment of the Reh cell line with PMA induces global THE1B LTR activation. A) KEGG pathway analysis linking 2-fold dysregulated genes in HL was plotted as a network to show genes which are shared between multiple pathways. Orange represents genes which are upregulated in at least one HL cell line compared to the control cell lines (Reh and Namalwa) and blue represents genes which are down-regulated. The pie charts and colour of the lines for each pathway indicate the proportion of the dysregulated genes which are up-(orange) or downregulated (blue) in that pathway. B) Growth curve following treatment of Reh cells with 2ng/ml. Error bars show standard deviation over 3 biological replicates. C) Overlap of active LTR peaks identified by THE1B RACE-Seq following treatment of Reh cells following treatment of Reh cells with 2ng/ml for 8 hours in 3 biological replicates. D) Comparison of RNA-Seq biological replicates Log2 FPKM values for each gene with an FPKM of at least 1 were plotted and linear regression calculated. E) KEGG pathway analysis for genes up-regulated following treatment of Reh cells F) Differential gene expression measured by RNA-Seq represented by row z-score. Bracket show differentially expressed genes shared between Reh cells following PMA treatment and cHL cell lines (orangedownregulated, blueupregulated). G) Example up-regulated inflammatory response genes following PMA treatment of Reh cells.

5' RACE
RACE was performed using the THE1B specific primer CATGGCTGGGGAGGCCTCA and 5' RACE adaptor TCATACACATACGATTTAGGTGACACTATAGAGCGGCCGCCTGCAGGAAA to amplify cDNA products containing a conserved region of the THE1B LTR and the transcription start site. RACE was carried out based on the ExactSTART™ Eukaryotic mRNA 5'-& 3'-RACE Kit (Epicentre) using the supplied protocol. However, due to the discontinuation of supply of the Tobacco Acid Pyrophosphatase enzyme, a number of modifications were made.
Following second strand synthesis the fragments incorporating a biotinylated primer were selected using T1 Dynabeads (Thermo Fisher Scientific) as follows. T1 dynabeads (20 µl) were washed in B&W buffer (10 mM Tris-HCL, 1 mM EDTA, 2 M NaCl) using magnetic separation and resuspended in 50 µl of 2 x buffer B&W buffer. An equal volume of RACE product was added and samples were mixed on a slow rotating wheel at room temperature for 1 hour. The beads were then captured using a magnetic separator washed with B&W buffer and TE and were resuspended in 33.75 µl 1 x TE.

RACE-Seq
Libraries for genome wide RACE-Seq were produced using the MicroPlex Library Preparation Kit v2 (Diagenode). Purified RACE material (10 µl) was added to 2 µl of template preparation buffer and 1 µl of Template Preparation Enzyme and incubated at 22°C for 25 minutes followed by 55°C for 20 minutes. The prepared template was then mixed with 1 µl Library synthesis buffer according to the manufacturer's instructions and 1 µl Library synthesis enzyme and incubated at 22°C for 40 minutes. The libraries were then mixed with amplification buffer (25 µl), amplification enzyme (1 µl), H 2 O (4 µl) and a Barcoded Indexing Reagent (5 µl) to allow for the samples to be multiplexed for sequencing. The mix was split into 2 reactions of 25 µl which were amplified at 12 and 14 cycles to allow for selection of enough material to sequence without introducing clonal amplification (Extension and Cleavage: 72°C 3 minutes, 85°C 2 minutes, Denaturation: 98°C 2 minutes, Addition of Indexed oligonucleotides: 4 x 98°C 20 seconds, 67°C 20 seconds, 72°C 40 seconds, Library Amplification: 12 or 14 x 98°C 20 seconds, 72°C 50 seconds).
Finally size selection and purification of the libraries was carried out by running products on a 1.2% TAE agarose gel (with 0.05% ethidium bromide) and excising fragments between 190 and 300 bp. These were then extracted using the Qiagen mini-elute gel extraction kit and eluted twice in 12 µl H 2 O. The libraries were run on a Bioanalyzer 2100 with a High Sensitivity DNA Assay chip (Agilent) to determine the average fragment size and quantified by PCR using the Kappa Illumina Library Quantification Kit on an Applied Biosystems StepOne Plus RT PCR system.
The indexed libraries were pooled and sequenced on Illumina MiSeq using the 150-Cycle paired end kit or a NextSeq 500 as a fraction of a 150 cycle flow cell.
Libraries were generated using the Kapa Hyperkit protocol (Kapa Biosystems) according to manufacturer's instructions, using 16 cycles of amplification. 200-400 bp fragments were sizeselected using a 2% agarose gel then subsequently purified using the QiaQuick Gel Extraction kit (QiaGen) according to manufacturer's instructions. High-throughput sequencing was performed on an Illumina HiSeq 2500 sequencer (Illumina, USA).

RNA-Seq
For RNA-Seq performed in cell lines, reads were mapped to the hg19 human reference genome using Tophat2 (3) using --library-type fr-firstrand (for alignment rates see Supp. Table 5). Reads mapping to the sense and anti-sense strands were split into separate files using the following commands: samtools view -b -f 128 -F 16 and samtools view -b -f 80 for forward reads; samtools view -b -f 144 and samtools view -b -f 64 -F 16 for reverse reads (4). Bam files generated by separate samtools commands were subsequently merged by forward or reverse read status using samtoos merge. and histogram density plots were created from the mapped reads using bedtools genomecov with the '-d -split' option and uploaded to UCSC genome browser (5). To obtain normalised FPKM (fragments per kilobase of transcript per million mapped reads) values for gene expression using cuffnorm (6) with --library-type fr-firststrand. Further analysis was carried out using log 2 FPKM values in R and Microsoft Excel (7). Expressed genes were defined as any gene with a log 2 FPKM value of 0 or above and differential expression between cell lines was defined based on at least a 2-fold change in expression.
For RNA-Seq performed in laser capture micro-dissected primary material, reads were mapped to the hg19 human reference genome using hisat2 (8) using default parameters (for alignment rates see Supp. Table 6). To account for noise due to low cell numbers and to remove artefact reads in exons, FPKM values were computed using DESeq2 (9) following read count per feature using featureCounts --countSplitAlignmentsOnly to exclude artefacts corresponding to PCR duplicates in exons originating from genomic DNA (10).
To perform clustering of the RNA-seq data from the cell lines pairwise Pearson correlation of gene expression was used to produce a correlation matrix. Clustering of the RNA-Seq data by Pearson correlation was performed using R with the heatmap.2 function in the gplots package using hierarchical clustering with Euclidean distance and average linkage. For correlations of replicates, spearman correlation coefficients of log 2 (FPKM+1) values were used then clustered using the heatmap.2 function of the gplots package.
For DRG heatmaps from cell line RNA-Seq, DRGs were computed with cuffdiff for all pairwise HL vs NHL comparisons, i.e. all pairwise combinations of KM-H2, L1236, L428 versus Namalwa, Reh, using --library-type fr-firstrand as a parameter. Common HL upregulated genes were defined as significantly differentially regulated in all comparisons, i.e. q<0.05, intersecting using all up-and down-regulated genes from each comparison using the merge function of R, resulting in two lists corresponding to common up-and down-regulated genes. For DRG heatmaps for patient RNA-Seq, TU vs NTC DRGs were computed via DESeq2, using a p-adjusted cutoff of 0.5. Row Zscores were computed from FPKM values derived from cuffnorm as described above, using the following code in R: Z=t(scale(t(log (FPKM+1,2)),scale=T,center=T)). Heatmaps were plotted using the heatmap.2 function of the gplots package. For boxplots comparing the expression profiles from previously published LCM primary material microarray (2) and our own LCM primary material RNA-Seq, row Z-scores were described above retrieved for the top 150 upregulated genes from (2) and plotted for each sample using the boxplot function in R.

Gene Ontology analyses
Gene Ontology analysis was performed using DAVID on lists of up and down regulated genes as previously defined (11,12). KEGG Pathway analysis was carried out using the ClueGo and CluePedia packages in Cytoscape with lists of up and down regulated genes combined from each HL cell line compared to each control cell line (13)(14)(15). The network was produced based on KEGG terms with a pV < 0.05 and the layout was manually adjusted to enable all interactions to be visualised.

RACE-Seq
RACE-Seq reads were first trimmed using nested cutadapt -g <adaptor> commands to remove the RACE adaptor sequence (TCATACACATACGATTTAGGTGACACTATAGAGCGGCCGCCTGCAGGAAA) and the THE1B consensus sequence (TGAGGCCTCCCCAGCCATG). The trimmed reads were then mapped to the hg19 version of the human reference genome using Bowtie2 in paired end mode with the '-very-sensitive' parameter (16) (for alignment rates see Supp. Table 4). Multi-mapping reads were removed using samtools view -bq 2. Histogram density plots were produced for each biological replicate and for the merged replicates using bedtools genomecov and the resulting plots uploaded to UCSC genome browser. Regions of enrichment (peaks) were identified using Macs1.4 with the '--keep-dup=all' parameter. The resulting peaks from each biological replicate were overlapped to identify the shared peaks using the ChipPeakAnno package in R and venn diagrams produced. High-confidence RACE-Seq peaks were defined by the presence of a peak in at least 2 out of 3 biological replicates. High confidence peaks were selected using an in house bedtools script (utilising nested bedtools intersect commands) and used for all further analysis. All further venn diagrams comparing the RACE-Seq peaks between cell lines were performed using the ChipPeakAnno package in R and lists of overlapping and specific peaks were created using the intersect function in bedtools. Annotation of repeat elements also made use of the bedtools intersect function overlapping the datasets with the Repeat Masker annotation track obtained from UCSC genome browser.
The clustering of RACE data between cell lines was carried out by creating a matrix of the number of peaks shared between each pair of cell lines using the ChipPeakAnno package in R (17). To compare these binary datasets the Dice index coefficient was calculated for each pairwise comparison in the context of the entire population and clustering was carried out as previously described using the heatmap.2 function in the gplots package of R.
Annotation of the genomic regions in which the active LTRs (RACE peaks) resided was performed using the annotatePeaks function of Homer. Annotation of LTR peaks to repeat elements was carried out using the command bedtools closest -a KM-H2_peaks_in_min_2reps_peaks.bed -b ../../Annotations/hg19_rmsk.bed -t first -D b | awk ' $10==0 '. LTR element types were counted by aggregating the resulting object in R using the aggregate function.
Closest genes to RACE peaks were identified using bedtools closest and a hg19 gene annotation reference. To determine closest genes in the same orientation and on the same strand the expressed LTR strand was first determined using bedtools intersect with the '-wao' parameter to obtain strand annotation form the repeat masker annotation. LTRs whose closest downstream gene shared the same strand, excluding LTRs located within genes, promoter and 3' UTR regions were selected and annotated usingbedtools slop -i hg19_refFlat.bed -b 1000 -g hg19.chrom.sizes | bedtools intersect -a <LTR peaks> -b --v | bedtools closest -t first -iu -D a | awk ' $6==$12 '. The union of all closest genes was computed using the following command: cat <all LTR peaks with closest gene on same strand bed files> | awk '{ print $10 }' | sort | uniq. Files were subsequently split by strand using grep + or grep -commands. Average profiles were obtained using annotatePeaks < LTR peaks with closest gene on same strand split by + or -strand> hg19 -hist 10 -bedGraph <plus strand bedGraph> <,inus strand bedGraph> -size 500, by combining profiles with of matching and opposite strands, respectively, i.e. average profiles on the + andstrands for LTRs on the + andstrands, respectively, and on the + andstrands for LTRs on theand + strands, respectively.

ChIP-Seq
Alignment was performed using Bowtie2 with -x hg19 --very-sensitive-local as parameters . Peak detection and coverage track generation was carried out using MACS with -t <bam> -n <name> -g hs --keep-dup=auto -w -S as parameters. For average H3K4me3 profiles, average read counts were obtained using annotatePeaks with -hist 10 -size 2000 as parameters, around L428 LTRs mapped to strand by annotating to repeat elements as described above, and around 10,000 random elements as a control, generated using bedtools random -g hg19.chrom.sizes -l 100 -n 10000.

LTR presence by gene expression fold change
Pairwise RNA-Seq datasets were ranked by log 2 FPKM fold change, defined as FC=(log 2 sample A FPKM+1)/(log 2 sample B FPMK+1) to avoid dividing by 0, with all genes ranked accordingly. Separately, LTRs identified via RACE-Seq were annotated to the closest gene using bedtools closest -a <LTR peak file.bed> -b hg19_refGene.bed -t first as parameters (5). LTR presence for all genes was thus computed by performing a left outer join of all genes and genes annotated as closest to LTR peaks via the merge function of R, using merge(<all genes sorted by log 2 FPKM fold change>, <gene list of annotated LTR peak file>, all.x=T), then replacing all matches with the value 1 and NULL values by 0 in the column originating from the gene list of the annotated LTR peak file. Resulting files were subsequently written as text files via write.table in R, then visualised and saved as heatmap images using Java TreeView (18). To test for significance of enrichment between the presence of LTRs and up-and down-regulated genes, hypergeometric tests were carried out in R as follows: p=1-phyper(<overlap>,<LTR>,<total>-<LTR>,<DEG>) where <overlap> is the overlap of gene names of 1 log 2 FPKM fold change up-or down-regulated genes, <LTR> the total number of genes closest to LTRs, <total> the number of genes in the hg19_refFlat annotation, and <DEG> the number of 1 log 2 FPKM fold change up-or down-regulated genes.