Suppression of Transposable Elements in Leukemic Stem Cells

Genomic transposable elements (TEs) comprise nearly half of the human genome. The expression of TEs is considered potentially hazardous, as it can lead to insertional mutagenesis and genomic instability. However, recent studies have revealed that TEs are involved in immune-mediated cell clearance. Hypomethylating agents can increase the expression of TEs in cancer cells, inducing ‘viral mimicry’, causing interferon signalling and cancer cell killing. To investigate the role of TEs in the pathogenesis of acute myeloid leukaemia (AML), we studied TE expression in several cell fractions of AML while tracking its development (pre-leukemic haematopoietic stem cells, leukemic stem cells [LSCs], and leukemic blasts). LSCs, which are resistant to chemotherapy and serve as reservoirs for relapse, showed significant suppression of TEs and interferon pathways. Similarly, high-risk cases of myelodysplastic syndrome (MDS) showed far greater suppression of TEs than low-risk cases. We propose TE suppression as a mechanism for immune escape in AML and MDS. Repression of TEs co-occurred with the upregulation of several genes known to modulate TE expression, such as RNA helicases and autophagy genes. Thus, we have identified potential pathways that can be targeted to activate cancer immunogenicity via TEs in AML and MDS.

Using the previously mentioned combined sets, we tested paired measurements using the patient identifiers for each pairwise comparison (LSC-Blast, LSC-pHSC, and Blast-pHSC). For the analysis of MDS we compared high-risk to low-risk MDS. In each pairwise test we used Bonferroni-Adjusted significance thresholds to adjust for multiple testing. A read minimum cutoff of 2 was used to measure library normalised CPM counts using edgeR in units of log base 2. The pathway analyses were always conducted in units of CPM for all data sets and all pairwise comparisons.
Differential Expression was called on the merged 'Immune' and 'Inflammation' gene-sets comparing LSC (Comparison) to Blast (Reference) (Supplemental Table 5) and plotted in units of TPM (Supplemental Figure3, and Supplemental Figure 6). Similarly differential expression was determined comparing high-low risk MDS.

Individual Gene Testing Using ANOVA
We tested the expression of individual genes across the stages of AML, and in MDS, some of those genes included EVI-1, ATG5, HSP90AA1, and LAMP2. The individuals gene expression levels were tested using ANOVA in units of TPM or CPM.
For the ANOVA conducted on units in CPM, between samples normalisation was calculating using EdgeR to normalise library size with a minimum read cutoff of 1. The ANOVA was fit using a linear model and the pairwise comparisons used Bonferroni adjust p.values to adjust for multiple comparisons for testing genes in AML which had paired measurements; MDS did not have paired measurements and used a significance threshold of 0.05 (non-adjusted).

Unsupervised Clusters of Patient Samples Based on Coding Gene Expression Correspondence to TE Type Expression in AML
Unsupervised clustering of patient samples used an unsupervised clustering of average Euclidean distance of the gene expression of all samples. The hierarchical clustering for each sample clustered average distance of expression using the unique ENSEMBL gene identifier.
The clustering of coding gene expression corresponding TE type expression used units of CPM and library normalisation edgeR. We performed hierarchical clustering of aggregated CPM counts of gene-ID/gene name and calculated the corresponding TE expression in terms of CPM.
The hierarchical clustering performed used WGCNA 9,10 . The gene expression used in the Euclidean cluster was log base 2 (1+CPM), the minimum read counts per gene considered was a minimum of 2 read CPM. We required filtering criteria of the clustering branch with a minimum branch of 2.

Coding Gene Networking Co-Relationship To Specific Transposable Element Types
The gene network construction used a soft-power threshold of 6 in WGNCA 9,10 , which is the default parameter; further WGCNA publication suggests that the network module construction is robust per the soft thresholding power parameter value. Langfelder et. al suggest that the soft thresholding power level 6 suffices to achieve a scale-free-topology. In the construction of AML and MDS coding gene networks, we verified that the scale independence for both experiments was greater than 0.80 indicating a scale-free topology. Additionally, the gene network construction used a block construction and a signed network, setting the minimum gene module size to 30, and a maximum gene module size of 4000. For the construction of the AML gene network, we considered Blast and LSC gene expression, excluding pHSC. The gene network was formed using ENSEMBL gene identifiers, edgeR library normalisation, and a log 2 transformation of units of CPM. MDS network construction was formed identically.
For both MDS and AML, each gene network was then correlated to the expression of the TE types selected as MULE, Pseudogene SAT, snRNA, TcMAR, DNA Transposon, PiggyBac, ERV1, L1, satellite, Endogenous Retrovirus, Repetetive Element, MIR, tel, Mariner/Tc1, HSFAU, SVA, CR1, SINE/tRNA, hAT, L2, Transposable Element, ERV3, ERVL, Merlin, ERVK, centromeric, Alu, and LTR Retrotransposon. The association of gene 'modules' to TE types was calculated by correlating each module's Eigen-value (PC1) to the sample's expression of individual TE type 9,10 (Supplement figure 5 and 7). In order to determine statistically significant associations between the samples' gene expression in a given network to a specific TE type, a Pearson-correlation test was conducted on the samples' gene expression relatability of a given network to the samples' expression of a given TE type. A significance threshold of 0.05 was used.
Each gene module was tested for significant gene-set activation of the GO gene-sets version 5 (Supplement Figure 5, and 7). AML activation levels of the merged 'Interferon', 'Inflammation', and 'Immune' were plotted along the left-hand side the modules; note only the significantly activated genesets (p.value less than or equal to 0.05) were depicted (Figure 4, Figure 5D). We used units of CPM with library size normalisation in edgeR with a minimum read count of 2 for gene-set activation testing. For gene-set analysis, we used ENSEMBL gene names because MsigDB gene-sets were in terms of their naming symbols. The gene-set enrichment for MDS used 'Immune' and 'Inflammation' gene-sets only ( Figure 5D).
Network construction of the AML and MDS experiments were done identically. The unconnected module defined as 'Grey' was discarded from all analyses.

Analyzing Gene Modules: Predictive Gene Expression of Network Modules, and Signed Connectivity of Regulatory Hub Genes
The supplements figures (Supplement figure 5 and 7) depict the relatability of each samples' gene expression to the network by a bar plot which indicates the level of covariance to the 1 st principal component of the module Eigen-value; thus if a given sample has a high covariance to a given network module's Eigen-value (PC1), then this sample is likely to have high expression in this network (positively co-varies) with the adjacency matrix and vice versa. The Gene Module Eigen-values depicts which samples are predicted to covary with the network module and therefore have higher expression in this weighted network collection.
Most modules have more than 100 genes, so their names were not included in the supplementary figures 5 and 7. Most modules had genes, which were differentially expressed, and the DE genes' TPM expressions were included; if a module had less than 5 DE genes, the DE expression plot was omitted. The DE analysis was performed using previously described methods.
A power adjacency matrix is formed by calculating weighted pairwise correlations 9,10 . The regulatory intramodular connected elements is calculated by summing rows in a given adjacency matrix defined as Generalised Connectivity 9,10 . The top 50-65 connected genes in a given module with the most competitive intramodular connectivity scores of higher than 0.95 were plotted in units of TPM (supplemental figures 5 and 7).

Chi-Square Test and Fisher Test of Correlation and Activation Levels
After creating the gene module network and identifying the associations between the gene coexpression network modules to the expression of TE types, and testing for module set activations of the 'Interferon', 'Immune'/'Inflammation' gene-sets using previously described methods, we then tested the association between the correlation of TE types to the activation levels. For AML, we used the association table and created a contingency table by counting each TE type's positive/negative correlations and the corresponding positive/negative activity level for 'Interferon', 'Immune' and 'Inflammation' gene-set. Using two factors (+/-) correlations and (+/-) activity levels of the three merged gene-sets; we conducted a chi-square test by totaling every TE types' respective factors from the contingency table (ensuring large enough values). For individual TE types association between the two factors, a Fisher-Exact test was conducted using FDR controlling for multiple testing method titled 'Benjamani Hochberg' which adjusted for the 29 multiple Fisher-Exact tests on each of the 29 TE types (Supplement table 6). A significance level for the chi-square test measuring significance of association between correlations and enrichment activity of 3 gene-sets, a significance level used was 0.05. For individual pairwise Fisher Exact Test a significance level of 0.09 was used.
RNA Helicase genes were reported and reviewed by Umate et al 12 , and we expanded our analysis to the entire RNA DEAH-Box Helicases and analyzed the family using coding gene DE methods previously described.
In order to investigate the correlation of the differential expression of the post-transcriptional genes and RNA interference genes reviewed by Goodier, we fit the differential means to a linear model for each experiment (MDS and AML). Thus an R 2 value and a p-value were derived which measured the correlation of the differential means for each gene indicating a significant correlation of differentially expressed genes across experiments. This analysis was performed with a read cutoff of 2 in units of CPM normalised by edgeR.

ATACseq data analysis
ATACseq fastq reads were analyzed from Corces et al 1 . Patient trios were selected which had ATACseq data from pHSC, LSC and Blast samples. ATACseq reads were aligned against the hg19 reference genome using the bwa mem algorithm (https://arxiv.org/abs/1303.3997). Following the alignment, QC was performed for the samples. Only reads aligned to autosomes and sex chromosomes were considered. Mitochondrial reads were discarded.
All samples showed enrichment for transcription start sites (TSSs) sites. Enrichment was computed by comparing total reads falling into a window of 2kb just upstream of promoters to reads in a 5kb window distant from the TSS. An important visualisation tool for ATACseq data is the distribution of fragment sizes. A typical fragment size distribution showed a characteristic wiggle indicating large fraction of short nucleosome-free fragments and a lower fraction of fragments from regions protected by nucleosomes 13,14 . We used the getPESizes function from the csaw 15 library to interrogate the distribution of fragment sizes.
Our goal in this analysis was to compare changes in chromosome accessibility across celltypes. This falls within the framework of differential accessibility testing. Here we favored adoption of a window-based method 16 for detecting differentially accessible regions. Such approaches have been previously been implemented for ChIPseq data, notably in the package csaw 15 .
Reads were counted along the chromosomes in sliding windows of size 200bps. Windows with less than 10 reads were discarded. Further filtering was done by first computing a global background of reads distribution by counting reads in contiguous window size of 1000bps.
Composition bias across libraries was alleviated by normalising the libraries using the trimmed mean of M-vaues (TMM) method 17 . Read counts in contiguous windows of 1000bps were again considered for this purpose.
Post normalisation and filtering, the libraries were used for calling differentially accessible windows across cell-types. A paired design (Y ~ patient + cell type) was employed to perform a negative binomial regression using functions from the edgeR 18 .. Differentially accessible windows for LSC versus Blast samples and LSC versus pHSC samples were tested for. Post accessibility testing, neighbouring windows were combined to define a region and significance level of the region computed from the p-values of the window-level tests. Multiple testing correction was then done at the regionlevel using a False Discovery Rate (FDR) cut-off of 0.01. For interpretability of the regions, the number of log-fold change increase (logFC UP) and log-fold change decrease (logFC DOWN) windows it contained were also computed (Supplement table 4). In addition, the p-value of the best window and the direction of change were also reported. Finally, all regions were annotated to report their distance from neighbouring genes.