A HML6 endogenous retrovirus on chromosome 3 is upregulated in amyotrophic lateral sclerosis motor cortex

There is increasing evidence that endogenous retroviruses (ERVs) play a significant role in central nervous system diseases, including amyotrophic lateral sclerosis (ALS). Studies of ALS have consistently identified retroviral enzyme reverse transcriptase activity in patients. Evidence indicates that ERVs are the cause of reverse transcriptase activity in ALS, but it is currently unclear whether this is due to a specific ERV locus or a family of ERVs. We employed a combination of bioinformatic methods to identify whether specific ERVs or ERV families are associated with ALS. Using the largest post-mortem RNA-sequence datasets available we selectively identified ERVs that closely resembled full-length proviruses. In the discovery dataset there was one ERV locus (HML6_3p21.31c) that showed significant increased expression in post-mortem motor cortex tissue after multiple-testing correction. Using six replication post-mortem datasets we found HML6_3p21.31c was consistently upregulated in ALS in motor cortex and cerebellum tissue. In addition, HML6_3p21.31c showed significant co-expression with cytokine binding and genes involved in EBV, HTLV-1 and HIV type-1 infections. There were no significant differences in ERV family expression between ALS and controls. Our results support the hypothesis that specific ERV loci are involved in ALS pathology.


Results
ERV locus HML6_3p21.31c is differentially expressed in ALS. We used the curated ERVMap database and protocol to test for differentially expressed ERVs comparing ALS (n = 80) and controls (n = 28) using RNA-sequence data from King's College London (primary motor cortex); see Table 1. 1654 of 3237 ERV transcripts had sufficient read-counts (> 10) across samples for testing. ERVs previously associated with ALS but not included in ERVMap were analysed independently (see Supplementary Table S1).
We then performed the same analysis using lateral motor cortex taken from the GSE137810 dataset (ALS n = 32, Controls n = 6). HML6_3p21.31c showed a significant increase in expression in ALS with a similar foldchange to the KCL cohort (log2 fold-change = 0.642, standard error = 0.220, p value = 0.003). See Fig. 1, Table 2,  and Supplementary Table S2a.
To test if other ERV transcripts were significantly differentially expressed across both King's College London and GSE137810 datasets we performed a Stouffer's meta-analysis, with log2 fold-change as effect direction and sample size used as weights. Only HML6_3p21.31c was significant after correcting for multiple-testing, with Stouffer's Z = 5.039, p value = 4.675 × 10 −7 (Bonferroni threshold = 3.546 × 10 −5 ). See Table 2 and Supplementary  Table S2a.
To test if HML6_3p21.31c was significantly expressed in other brain areas we performed the same analysis in medial motor cortex, cerebellum and across frontal cortex regions using RNA-seq data from GSE137810 and GSE67196. We found a similar relationship between HML6_3p21.31c and ALS in the medial motor cortex (log2 fold-change = 0.645, standard error = 0.326, p value = 0.04). In addition, using Stouffer's method, we found significant increased expression in cerebellum tissue using GSE137810 and GSE67196 datasets (Z = 2.721, p value = 0.006) but not in frontal regions (Z = 1.22, p value = 0.22). Please note that there are overlapping samples across cerebellum and prefrontal cortex datasets. See Fig. 1, Table 2, and Supplementary Table S2b,c.
The HML6 ERV family in general, modelled as a collapsed transcript element using TETranscripts, did not show significantly increased expression in ALS (see Supplementary Table S3). Previous research has identified specific ERV loci in the HML2 family as showing differential expression in ALS 40 . We did not find differential expression of these loci in motor cortex, cerebellum or prefrontal cortex. HML2 as a family of ERVs did not show differential expression between ALS and controls across tissue and studies (see Supplementary Table S3).
To assess whether differences in expression could be driven by differences in cell composition between ALS and control donors, we used the BRETIGEA 45 cell marker database to estimate relative proportions of neurons across KCL samples. We did not find significant differences in estimated neuron proportions between ALS and controls, while controlling for covariates gender, age, post-mortem delay, RIN and surrogate variables (beta = 0.035, standard error = 0.023, p value = 0.150). www.nature.com/scientificreports/ ERV locus HML6_3p21.31c co-expresses with genes involved in cytokine pathways and HIV. To assess the functional involvement of HML6_3p21.31c in post-mortem ALS we performed weighted co-expression network analysis integrating gene expression with ERV expression. Our network analyses included both ALS cases and controls. We performed network analyses using the KCL primary motor cortex cohort as well as the GSE137810 lateral motor cortex.
In the KCL primary motor cortex, HML6_3p21.31c showed significant co-expression in a network enriched for cytokine signalling and binding, viral protein interaction, and responses to EBV and HIV Type 1 infection (see Table 3; Fig. 2). For the genes driving the enrichment for these categories see Supplementary Table S4 and for an extended list of enrichment categories see Supplementary Table S5. This network also showed significant correlation with disease status (Pearson's r = 0.19, p = 0.05) indicating network-wide upregulation in ALS. HML6_3p21.31c was the most significant genomic element to associate with ALS in this network (see Table 3). For genes and ERVs that show differential expression in this network (with an uncorrected p value < 0.05) see Supplementary Table S6.
For the GSE137810 lateral motor cortex cohort HML6_3p21.31c also showed significant co-expression in a network enriched for cytokine binding and receptor activity, and genes involved in HIV Type 1 infection (see Supplementary Fig. S2; Supplementary Table S7).
The HML6_3p21.31c locus is immediately adjacent to the chemokine receptor cluster on chromosome 3, which has been shown to influence retrovirus HIV viral load and is approximately 50 kb downstream of Table 2. Differential expression analyses of retrovirus transcripts. This table displays log2 fold-change between ALS cases and controls for each tissue, and the p value from the Stouffer's meta-analysis. Transcripts are shown if the Stouffer's meta-analysis had p < 0.05, incorporating the KCL primary motor cortex dataset and the TA lateral motor cortex dataset. Bold: Significant Stouffer's meta-analysis p value, beyond Bonferroni multipletesting correction for 3237 transcripts (p < 1.5 × 10 −5 ). Cells highlighted in blue indicate increased expression in ALS samples; cells highlighted in red indicate decreased expression in ALS samples, compared to controls, where transcripts had a p < 1.5 × 10 −5 in any meta-analysis. KCL King's College London, MC motor cortex, Cere cerebellum, FCx frontal cortex (various), PFC prefrontal cortex, Lat lateral, Med medial, FC fold-change.   www.nature.com/scientificreports/ and LTF, they do not show significant differential expression between ALS and control donors after multiple testing correction (see Supplementary Table S8). Previous research has demonstrated that ERVs that show significant differential expression in ALS correlate with TARDBP expression. We performed expression correlation analyses between a list of 14 genes that associate with ALS and HML6_3p21.31c. A small but significant correlation was identified with TARDBP (Pearson's r = 0.21, p value = 0.04) and optineurin (Pearson's r = − 0.25, p value = 0.01). See Supplementary Fig. S3.

ERVs and genetic risk of ALS. To assess if ERVs that showed increased expression in post-mortem ALS
increased genetic risk of the disease, we performed ERV-set analysis using MAGMA and three ALS GWAS datasets. For each post-mortem RNA-seq dataset analysed, ERVs were divided by their direction of fold-change and filtered by their disease status association with a p value ≤ 0.05. ERVs that showed significant increased expression in the KCL primary motor cortex dataset were marginally enriched for SNPs that increased ALS risk (beta = 0.315, S.D. = 0.032, p value = 0.039). No other tissue or tissue-source showed enrichment for ALS risk SNPs (see Supplementary Table S9). In addition, we tested whether ERVs belonging to family HML6 show significant enrichment of ALS risk SNPs (beta = 0.199, S.D. = 0.025, p value = 0.087).

Discussion
Using RNA-sequence data from post-mortem motor cortex, we found that an individual ERV locus of the HML6 family, HML6_3p21.31c, showed significantly increased expression in ALS. We tested six additional post-mortem ALS RNA-sequence datasets and found a significant increase in expression of HML6_3p21.31c in four of them. Increased expression of the ERV locus was found in motor cortex and cerebellum, but not in prefrontal cortex. To understand the potential functional involvement of HML6_3p21.31c in ALS we performed co-expression network analyses. HML6_3p21.31c significantly co-expressed with genes involved in cytokine binding and signalling, interleukin, inflammatory responses, EBV infection, viral protein interaction, HTLV-1 and HIV Type-1 infection.
Previous research has reported differential expression of specific HERV-K HML2 loci on chromosome bands 3q21.2, 7q34, and 10p14 in post-mortem ALS, as well as differential expression of HML2 as a family 25,26 . We did not find evidence that these specific loci or the HML2 family were differentially expressed in ALS motor cortex, cerebellum or prefrontal cortex. Our findings concur with recent studies using qPCR in post-mortem ALS that also found no differential expression of these loci or the HML2 ERV family 41,42 . The qPCR primers used across all previous studies examining HML2 in ALS would not have amplified HERV-K HML6 loci and therefore could not show association with ALS. Two key differences exist between our study and the studies that originally found differential expression of HML2 loci. We have used methods that are specifically designed to quantify ERV locispecific transcripts, while accounting for sequence alignment ambiguity and repetitive genomic elements. In addition, the sample-size of our study is larger and has greater power to account for variation across ALS and control populations.
HML6_3p21.31c belongs to the HERV-K(HML6) family of ERVs. HML-6 expression has been found throughout the genome in healthy and diseased tissue 46 . The most recent clinical association was found comparing HIV-1 infected cell cultures with control T cells 47 . In a similar bioinformatic approach to that used in our paper 47 , Grandi et al. found upregulation of HML6_ 19q13.43 in HIV-positive cells, which was the highest expressed ERV of the 3250 tested. HML-6 has also shown association with breast cancer 48 and melanoma 49 .
HML6_3p21.31c is proximal to and shows significant co-expression with HIV associated genes CCR5, CCR2 and LTF. Outside of the HLA locus, CCR5 is the only gene that associates with HIV susceptibility 50 and viral load 51 through genetic inheritance. CCR5 encodes for a HIV co-receptor 52 , where genetic mutation (CCR5-delta 32) reduces receptor functionality and inhibits HIV's capacity to infect cells 53 . Lactoferrin also exhibits anti-HIV effects 54 , with preliminary evidence of LTF polymorphisms influencing maternal transmission of HIV-1 to offspring. Using network analyses, we found genes that co-expressed with HML6_3p21.31c were enriched for genes that associate with HIV Type-1 infection. These included IL4R, CCL3L1, CXCR1 and CCL2, in addition to genes proximal to the ERV locus (CCR5, CCR2, CCR1 and LTF).
Comorbidity exists between ALS and HIV and was first identified in 1985 55 . Estimates indicate that 3.5 in 1000 HIV patients develop ALS-like symptoms 56 , and this is now referred to as HIV-associated ALS. Clinical studies have found that HIV-associated ALS symptoms may be reversible through antiretroviral therapies 56 , and there are clinical trials 57 in ALS patients without HIV infection, currently ongoing. Our results present the first locus-specific link between ALS, ERVs and HIV associated genes. Given that none of our post-mortem samples had known HIV-associated ALS, our results support evidence that ERV and HIV-related pathways are important in the wider ALS clinical population.
One limitation of our study is the uncertainty about what is causing upregulation of HML6_3p21.31c in ALS. It may be driven by expression of HIV-associated genes nearby, however HML6_3p21.31c alone showed significant upregulation in this locus, whereas HIV-associated genes did not. HML6_3p21.31c co-expressed with cytokine and HIV-associated genes as part of a wider network. Within this network, HML6_3p21.31c was the most significant genomic element to associate with ALS. Given that ERVs are known to regulate gene expression, our results highlight the potential that HML6_3p21.31c may be key genomic elements in modifying HIV associated and cytokine pathways in ALS.
Enrichment analyses of genes that co-expressed with HML6_3p21.31c in a network included genes associated with infection by several exogenous viruses, including HIV-1, HTLV-1, and EBV, as well as other viral-associated immune response genes. A limitation to our study is that we cannot be certain that this enrichment is not driven by immunological pathways in response to ALS, that parallel immune responses to viruses. www.nature.com/scientificreports/ We found HML6.3p21.31c co-expressed with a network that correlated with both neuronal and nonneuronal cell-types. While it cannot be certain that differential expression of HML6.3p21.31c is not influenced by differences in cell composition between ALS and control donors, we integrated surrogate variable analyses into our model which controls for cell heterogeneity across samples 58 . In addition, we found no significant differences in neuron estimates between ALS and controls donors using cell composition analyses.
Anti-retroviral drugs are currently being trialled in ALS 57 . These trials are based upon decades of research showing differential expression of reverse transcriptase activity and ERV transcripts in ALS. While ERV families, and reverse transcriptase activity have shown association with the disease, findings regarding specific ERV transcripts have been less consistent. We found significant upregulation of a specific ERV locus (HML6_3p21.31c) in four post-mortem ALS RNA-sequence datasets across three studies. To date, this is the largest analysis of its kind in ALS. HML6_3p21.31c is located at the chemokine receptor gene cluster on chromosome 3 that modifies HIV susceptibility, viral entry into the cell and viral load. Network analyses of HML6_3p21.31c reveal significant correlation with genes involved in cytokine and interleukin activity, and infection with HIV and other exogenous viruses. In summary, our results (1) support the hypothesis that ERVs are involved in ALS pathology, (2) identify a specific ERV locus that shows consistent upregulation in ALS, and (3) identify major pathological pathways in the disease in which ERVs may play a significant role.  Table 1.

Material and methods
For KCL and the MRC London Neurodegenerative Diseases Brain Bank, frozen human post-mortem tissue was taken from primary motor cortex. For GSE137810, frozen human post-mortem tissue was taken from motor cortex (medial and lateral), frontal cortex (various locations) and cerebellum. For GSE67196, frozen human post-mortem tissue was taken from the lateral hemisphere of the cerebellum, the prefrontal cortex (Brodmann area 9/44), and the primary motor cortex (Brodmann area 4). For KCL and GSE67196 61 , all controls had Braak stage ≤ 2 with the exception of one control in GSE67196 61 . Co-expression and cell-type analyses. To identify which genes significantly co-expressed with ERVs (defined by ERVMap transcript IDs), we performed variance stabilizing transformations (VST) for all required datasets using DESeq2 63 . We corrected normalised expression estimates by covariates sex, age, post-mortem delay, RIN and surrogate variables, using jaffelab function cleaningY 64 to regress them out. Next we used BRETIGEA to estimate cell composition per individual sample using BRETIGEA's cell marker dataset (50 markers) 45 . We used weighted correlation network analysis (WGCNA) 65 alongside VST normalised expression data for the KCL primary motor cortex and GSE137810 lateral motor cortex datasets. WGCNA was performed using a deepsplit of 0 and a minimum module size of 30 genes\ERVs. Next, we analysed if WGCNA modules correlated with disease by performing Pearson's r correlation between disease status and module eigengene values. We performed the same analysis for cell-type estimates by sample.

RNA-sequencing of
To perform co-expression analysis of individual ERVs and genes we used Pearson's r analyses using PerformanceAnalytics in R (https:// github. com/R-Finan ce/ Perfo rmanc eAnal ytics).
Gene function enrichment analyses. Enrichment analyses of gene function was performed using g:profiler and Enrichr, which include KEGG pathways [66][67][68] . Ontological categories with a term size (number of genes) greater than 1000 are not reported.

ERV-set enrichment analysis.
To asses if ERVs were enriched for SNPs that modify ALS disease risk we used MAGMA 69 and three ALS GWAS summary statistics from Nicolas et al. 70 and van Rheenen and Shatunov et al. 9 . Note that these GWAS have overlapping samples but use alternative statistical models (please see references for more details).
We performed annotation of the 1000 genomes reference build GRCh37/hg19 Phase 3 using a 50 kb flanking window. A 50 kb flanking region was selected as the median length of a haplotype for chromosome 1 71  www.nature.com/scientificreports/ ERV loci regions are small in comparison to protein-coding genes, which diminishes the capacity to identify an enrichment of SNPs that associate with disease. Gene-level analyses were performed on an ad-hoc ERV loci file using ERVMap mapping coordinates lifted over to reference genome GRCh37/hg19 Phase 3. ERV-set analyses were then performed using standard MAGMA protocols.

Family-level ERV expression analyses.
To test if ERV families were differentially expressed in post-mortem ALS we used TETranscripts 72 . TETranscripts uses an expectation maximization algorithm to quantify multi-mapped reads, which is designed to estimate abundance of transcribed transposable elements (like ERVs) that have high sequence fidelity (like the sequence overlap that occurs between ERVs of the same family).
To perform TETranscripts we used RNA-seq bam files derived from all seven datasets tested. ERV families with total read-count less than 10 were discarded. All differential expression analyses of ERV families were performed using DESeq2 controlling disease status, gender, age, post-mortem delay, RIN, surrogate variables, where available.
Ethics declaration. Post-mortem tissue samples from King's College London were collected under the ethical approval of the MRC London Neurodegenerative Diseases Brain Bank and under the regulations of the Human Tissue Act UK 2014. All post-mortem tissue was donated to the MRC London Neurodegenerative Diseases Brain Bank under standard ethical and Human Tissue Act procedures, with informed consent provided by the next of kin. Data generated from this material were anonymized and analysed on a high-performance computing cloud (https:// www. mauds leybrc. nihr. ac. uk/ facil ities/ rosal ind/) with data protection protocols in accordance with Department of Health Policy (UK) and the security standards set by the National Data Guardian. Ethical approval to process and analyse post-mortem samples stored at King's College London was provided by a local ethics committee at the Institute of Psychiatry, Psychology & Neuroscience, King's College London, and the MRC London Neurodegenerative Diseases Brain Bank.

Data availability
Dataset are available, on reasonable request. The post-mortem genetic and RNA-sequence datasets are available from the corresponding author, the GWAS survival datasets are available through Dr Isabella Fogh, and the Braineac eQTL datasets are available through the UK Brain Expression Consortium.