Transposable Element Expression in Acute Myeloid Leukemia Transcriptome and Prognosis

Over half of the human genome is comprised of transposable elements (TE). Despite large-scale studies of the transcriptome in cancer, a comprehensive look at TE expression and its relationship to various mutations or prognosis has not been performed. We characterized the expression of TE in 178 adult acute myeloid leukemia (AML) patients using transcriptome data from The Cancer Genome Atlas. We characterized mutation specific dysregulation of TE expression using a multivariate linear model. We identified distinct patterns of TE expression associated with specific mutations and transcriptional networks. Genes regulating methylation was not associated with significant change in TE expression. Using an unpenalized cox regression analysis we identified a TE expression signature that predicted prognosis in AML. We identified 14 candidate prognostic TE transcripts (TEP) that classified AML as high/low-risk and this was independent of mutation-based and coding-gene expression based risk-stratification. TEP was able to predict prognosis in independent cohorts of 284 pediatric AML patients and 19 relapsed adult AML patients. This first comprehensive study of TE expression in AML demonstrates that TE expression can serve as a biomarker for prognosis in AML, and provides novel insights into the biology of AML. Studies characterizing its role in other cancers are warranted.


Supplementary figure 2.
Module average dysregulation for multiple predictive features. The x-axis depicts the predictive features. The yaxis depicts the modules constructed by network analysis. The module dysregulation average was summarized from the significant altered non-TE, predicted by the multiple regression (supplement figure 1), that corresponded to the given module.

Acquisition of RNA sequencing fastq files for AML
The RNAseq data was obtained from The Cancer Genome Atlas (TCGA) via dbGAP access.
The transcriptome data for the 284 pediatric AML (TARGET 1 ) and relapsed AML 2 were also obtained through dbGAP.

Normalization of TE and non-TE in TCGA
We calculated transcript abundances using Kallisto 3  We considered normalized expression of TE and non-TE by collapsing all read counts with a minimum read cutoff of 16, as recommended the sequencing quality control consortium (SEQC) 21 . The non-TE and TE were normalized together as a source data using voom 5  heteroscedasticity, and the normalized counts were then saved in log 2 counts per million (CPM).

Normalization of TE and non-TE in TARGET
The cytogenetic variables included in the pediatric TARGET patient profile were t(8;21), inv (16)

Normalization of TE and non-TE in MSKCC
For the 19 relapsed AML patient profiles, the following predictive features were used in normalizing nonTE/TE expression and survival analysis. Demographic variables were age, and sex. Mutational variables were NPM1 point mutations, FLT3 point mutations. Cytogenetic variables were deletion 5 on the q-arm, deletion of 7 on the q-arm, Y deletion, trisomy 21, t(6;9), and inversion 16. Other variable white blood counts were also used.

Multivariate linear model: altered expression of transposable elements in TCGA
We fit the normalized TE expression in an empirical Bayesian linear model and identified altered expressed TE (using TE transcripts), using Benjamini-Hochberg FDR cutoff of 0.05, un-adjusted p.value threshold of 0.01, minimum absolute log-fold change of 0.5, and hierarchical multiple testing strategy across features as contrasts 6 .
The altered expressed TE (AE-TE) were then identified which satisfied the previously described filtering criteria for each corresponding feature such as mutations, cytogenetics, blood related factors, and clinical/demographic factors resulting in significant effects per feature. The total numbers of AE-TE were plotted in a barplot for each predictor variable. We compared the linear model to a random model by sampling randomly the predictor design matrix and re-fitting an empirical Bayesian random mutational model 7 .
The statistically significantly AE-TE identified after hierarchical testing were summarized by averaging the estimated coefficients for each corresponding AE-TE within each biotype. In other words, for each TE biotype, the significant AE-TE estimated coefficients from the linear model were averaged. The biotype information was identified using Arkas 4 , which includes RepBase annotation information at the transcript level. We hence were able to identify each TE transcript biotype dysregulation for each mutational predictor. Each of these was depicted using an R package 'riverplot'.

Multivariate linear model: altered expression of ENSEMBL non-TE TCGA
Normalized transcripts of non-TE were examined separately from TE. Non-TE were identified using ENSEMBL gene identifiers, which included coding gene, pseudogene, long noncoding RNA, short noncoding RNA elements, and small nuclear RNA. Arkas provided the annotation information and data stored of the experimental data. The multivariate linear model, and significant effects were similarly modeled as with TE; we used hierarchical filtering of the multivariate model coupled with Benjamini-Hochberg FDR filtering q.value threshold of 0.05, un-adjusted p-value 0.01 threshold, and minimum absolute log fold change threshold of 0.5.

Coding gene network association table with TE transcript biotypes
In order to comprehensively study the correlation patterns between TE and other transcripts including coding gene pathways, we constructed a weighted gene correlation network table 8 . We first constructed a non-TE expression network using WGCNA using soft-thresholding power of 6 after confirming that scale free topology and mean-connectivity scores were above 80%, which suffices for diverse nodal connectivity. The normalized non-TE transcript expression in log 2 CPM was used for ENSEMBL gene-IDs described earlier. The minimum genes for each eigengene cluster was set to 10, the maximum was set to 4000. We used 'blockwise' module construction, and module 50 is defined as the set of genes that remained un-connected in the networking context.

Prognostication of AML using TE transcript expression in TCGA:
We used all patients in TCGA to normalize. We prepared the TE expression of all patients by normalizing the collapsed estimated counts with respect to ENSEMBL transcript IDs using limma-voom 11 as previously described. We then subset the training and testing cohort after normalization. The corresponding survival times along with expiration status of each patient was obtained from the data acquisition. The matching survival data for each data split was matched by patient.

Identification of 14 transposable element prognosticators via nested sampling
We identified the 14 candidate transposable element prognosticators through nested sampling in TCGA (N=178). For each nested sample division, the TCGA subjects were divided 60% nested training cohort, and 40% nested testing cohort. Each nested cohort division ensured that selected covariates were prevalent within the subdivision proportional to the population prevalence of the corresponding covariate. The following covariates were frequency matched: TP53, FLT3,

Evaluation of TEP
The survival analysis was evaluated in three ways.
1. We implemented multivariate Cox regression using a ridge regression 13 and the Cstatistic was evaluated using cross-fold validation. For TCGA intra cohort validation, we used 3-fold cross validation (N=37).
2. Utilizing compound covariate prediction method, (identifying patient risk scores as a linear combination of TEP expression and TEP hazard estimates), the patient risk summary scores were cut into 2 risk groups using a quantile cut at the 75 th quartile. A Kaplan-Meier (KM) survival model was performed using the 2 groups, and a Wald test statistic was calculated on the classification of patient risk summary scores.

Simple sign averaging method for risk discrimination
Simple sign averaging of Cox hazard estimates was shown to perform no worse than LASSO, or ridge regression 15 . We calculated the sign average estimated from the TCGA intra cohort validation using the final TEP. Using the sign averages of the Cox hazard estimates, we perform a Kaplan-Meier (KM) survival analysis on the intra cohort validation set (N=37), TARGET, and

Cox proportional hazards model with random biotype effects TCGA
The TEP were fit into a multivariate Cox model with random effects which assumed a Gaussian distribution for the random effects 16  The 'blood' features were BM/PB blast %, and WBC. All variables were then mean centered and fit into a Cox proportional hazards model, each variable's contribution to the multivariate concordance index is depicted in Figure 6. The survival plots were drawn using the R package 'survminer' (Alboukadel, 2017, https://CRAN.R-project.org/package=survminer).

Validation in TARGET Pediatric AML
We validated the 14 TEP expression using 284 pediatric AML patients. The expression data was normalized using limma-voom. The design matrix considered mutation and cytogenetic information that was 5% prevalent in the TARGET cohort. The normalized TEP expression were fit into a Cox model 15 and the hazards estimates were used to identify patient risk scores as previously described. The patient risk summary scores classified patients into 2 groups based on a quantile cut at the 75 th quartile. The patient risk group classification was then fit into a KM survival plot depicted in Figure 3C.
The signed averages of the TARGET TEP hazard estimates were used as an alternative discrimination. The signed averages were computed as previously described and depicted in C and D). The C-statistic was calculated using 5-fold cross validation from the Cox model 15 .

Validation in Adult Relapsed AML Memorial Sloan Kettering Cancer Center (MSKCC) data set
We normalized the expression of MSKCC 19 relapsed AML patients using voom. The 14 TEP normalized expression was similarly used in a Cox model 15 , and the patient risk classification groups were identified. The KM survival model was constructed, and depicted in Figure 3D.
The Wald test statistic was computed from the KM model and depicted in supplementary figure   4F.
The effects of age, and sex covariates were examined in MSKCC. The patient ethnicity information was not available. The ANOVA was performed similar to TARGET and TCGA.
The were summarized into a group categorical covariate using 'coxme'.

TCGA stratified survival models based on mutational covariates
Using mutation status of 178 patients and their corresponding event free survival times (EFS), we predicted survival using Cox model variant that used ridge penalty 15 . The mutational status in the Cox model variant were centered, but not scaled 13 . Survival statistical testing was performed consistently as before. We thus identified 96 low risk patients determined by mutation covariates, and 82 patients as high risk based on mutations alone.
Using the 14 TEP and the 96 low-risk patients classified by mutations, we performed survival analysis using Cox model to sub-stratify the low-risk mutation based cohort. The TEP expression of the mutation based low-risk patient cohort was transformed into a categorical variable. The principal components of the 14 TEP expression 15 corresponding to the patients within the lowrisk cohort were cut at the 75 th quartile. The categorical TEP features were centered and hazards were estimated from a Cox model. This model was evaluated in a Kaplan-Meier survival model as previously described. The summary statistics, and statistical testing of the survival analysis was performed consistently as before. Although dichotomizing the continuous TEP covariates is not necessary, we attempted to negatively bias the TEP's predictive power for a conservative analysis.
High risk (N=82) patients classified by mutations were also sub-stratified. The TEP substratification was performed identically as the low-risk analysis.

TCGA stratified survival models based on cytogenetic covariates
Similar to mutational risk classifications, we identified risk groups by compounding the Cox model hazard estimates derived from cytogenetics previously described. Each cytogenetic risk cohort was analyzed similarly to the mutation sub-stratification analysis using the 14 TEP.

TCGA stratified survival models based on nonTE expression covariates
The nonTE expression of the 178 TCGA patients were used to identify patient risk groups. The 'voom' normalized expression was performed, and the principal components were identified.
The first 14 PCs were used as covariates in the Cox model 15 and were cut into a categorical variable using the 75 th quartile. These categorical features were centered, and then used in a Cox model and the concordance index was computed using a 3-fold cross validation. The nonTE risk groups were identified using the resultant hazard estimates that were compounded with the categorical non-TE PCs variables.
Similar to the identification of risk cohorts using the nonTE expression, the principal components of the TEP were similarly utilized. The nonTE based low-risk patient cohort was used as the subjects for sub-stratification analysis. The principal components of the subjects' corresponding 14 TEP expression were dichotomized into a categorical variable by cutting at the PCs 75 th quartile. These categorical variables were centered and used as the features in the Cox model which identified hazard estimates per each categorical variable. Compounding the Cox covariates with the estimated hazards determined 2 sub-stratified risk groups. These 2 risk groups were used in a Kaplan-Meier survival model ( Figure 4E).
The high-risk cohort identified by nonTE expression was modeled similarly ( Figure 4F).

TEP associated gene predictors identified through penalized regression
The normalized nonTE expression was used as the independent predictors for each of the 14 TEP (response variable) using 'glmnet' 18 . The LASSO (Gaussian family) was used with 10 fold crossvalidation (10). The largest 'lambda' within 1 error of the minimum cross validated error was used. The scaled predicted response was derived from the glmnet model which was compared to the scaled observed TEP expression. Figure 5 depicted the gene predictors which modeled the corresponding TEP expression (R 2 >0.5).
Previously the nonTE network identified module membership for each nonTE. The module membership for the selected TEP predictors was identified using the nonTE network ( Figure 2).

Functional annotation of TEP associated predictors
The DAVID 19 bioinformatic resource was used to identify GO function annotations associated with genes from module 2 and 3. We screened the enriched GO functions using an FDR cutoff of 0.05 and a Fisher p.value threshold of 0.05. GO biological processes (BP), molecular function (MF), and cellular components (CC) were identified for module 2 and 3.