Background & Summary

Amyotrophic lateral sclerosis (ALS) is a devastating neurodegenerative disorder that affects the motor neurons in the brain and spinal cord. It is characterised by the progressive degeneration and loss of these motor neurons, leading to a gradual deterioration of muscle control and function. As a result, ALS patients experience muscle weakness, paralysis, and ultimately, difficulty or inability in breathing, swallowing, and speaking. The molecular cause of ALS is complex and not fully understood. Diagnosing the disease is a significant challenge as ALS has similar symptoms to other diseases. Differentiating ALS from these conditions requires a number of medical tests, including genetic testing, electromyography (EMG), as well as MRI scans1. Currently, there are no effective proteomic, RNA, or other biomarkers to provide early predictions for this disorder.

ALS is a rare disease with a worldwide incidence of 1.75 cases per 100 000 person per years of follow-up2. The familial form (representing 5–10% of cases) is linked to specific gene mutations such as SOD1, C9orf72, TARDBP, and FUS, however, the list of discovered ALS gene mutations is continuously growing. The pathogenesis of sporadic ALS, which constitutes the majority of cases (around 90–95%), is less clear and likely involves a combination of genetic and environmental factors3,4,5. While specific gene mutations are not prevalent in sporadic ALS, genetic variants as risk factors can significantly contribute to disease susceptibility6. Pathogenic or likely-pathogenic variants that individually have only modest effects on ALS risk can collectively contribute to an individual’s predisposition (this concept is best supported by evidence in individuals carrying a p.N352S mutation in TARDBP)7. Common variants with small effect size and combinations of such variants may also confer genetic risk in sporadic ALS patients, but convincing data demonstrating this are still lacking8. Therefore, establishing new genomic data to identify these genetic variants and their combinations in individuals becomes of paramount importance to explain the “missing heritability”6 of high-risk (causative) ALS genes. On the other hand, Mendelian inheritance accounts for only a minority of cases, promoting a focus on environmental and epigenomic cues to explain the missing heritability of known ALS mutations9. In addition to genetic variants, epigenetic changes such as DNA methylation10,11,12,13 and R-loops14,15,16,17,18,19,20 are highly likely to play a role in the development of sporadic ALS. Although the precise pathophysiological mechanisms by which these factors contribute to motor neuron degeneration remain unclear, they may be applicable as diagnostic markers of the disease.

Here, we present a comprehensive epigenomic investigation in blood samples from seven sporadic ALS patients with well-defined clinical characteristics to uncover molecular factors linked to the disease. Our study employed three main techniques: clinical exome sequencing (CES)21 to analyze DNA variants, DNA-RNA hybrid immunoprecipitation sequencing (DRIP-seq)22 to investigate R-loop accumulation, and reduced representation bisulfite sequencing (RRBS)23 to explore DNA methylation levels (Fig. 1). By combining these datasets, we have compiled a compendium of genomic alterations associated with the studied ALS cases.

Fig. 1
figure 1

Outline of the next generation sequencing (NGS) experiments. The workflow began with obtaining blood samples from ALS patients with well-defined clinical records and age-matched control individuals. From these blood samples, genomic DNA was extracted that underwent three simultaneous investigations: 1. Clinical exome sequencing (CES) was performed on 4813 genes commonly linked to human diseases. The protein-coding regions (exons) of the target genes were enriched using target capture microarray hybridization prior to NGS. 2. Reduced representation bisulfite sequencing (RRBS) was employed to analyze gene regulatory regions that influence gene transcription through DNA methylation. 3. DNA-RNA hybrid immunoprecipitation sequencing (DRIP-seq) was utilized to capture chromosomal R-loops associated with genic and intergenic regions.

Methods

Patient selection

The clinical diagnosis of ALS cases fulfilled the ‘El Escorial revisited’ clinically definite diagnostic certainty and the complementary ‘Awaji’ criteria for amyotrophic lateral sclerosis24,25. All selected patients are considered to have sporadic ALS, as the disease did not occur in their families (Table 1). We confirm that informed consent was obtained from all subjects and that our experiments were approved by the Medical Research Council (Ministry of Interior, Hungary) in accordance with all relevant ethical and legal requirements (approval numbers: 47066-3/2013/EKU(556/2013); 11920-2/2017/EÜIG; 12702-5/2018/EÜIG). We confirm that the ethics agreement allows for the open publication of anonymized genetic data and the participants understood all risks for the open publication of these data, as per informed consent.

Table 1 Clinical characteristics of individuals examined in this study.

Genomic DNA extraction from whole blood

We prepared aliquots of 1 ml frozen whole blood sample from ALS and control patients and extracted 200 μL of it using the Macherey-Nagel™ NucleoSpin™ Tissue Kit according to the manufacturer’s protocol. In brief, we mixed 200 μL of the whole blood sample with 25 μL of Proteinase K and 200 μL of buffer B3, prewarmed to 65 °C. After vortexing the mixture, we incubated it at 65 °C for 10 minutes. Then, we added 210 μL of ethanol (96 – 100%) to each sample, vortexed again, and loaded the mixture onto a NucleoSpin® Tissue Column placed in a Collection Tube. We centrifuged it for 1 minute at 11,000 × g, followed by washing the silica membrane once with 600 μL of BW buffer and once with 600 μL of B5 buffer. After drying the membrane for five minutes, we eluted the DNA twice with 50 μL of prewarmed (55 °C) elution buffer (5 mM Tris pH 8.5). The same samples were pooled and DNA concentration was measured using a Nanodrop. The samples were then concentrated to 100 μL using a speed vacuum concentrator. Next, the genomic DNA was fragmented by digesting 25ug of DNA in a final volume of 100 μL with HindIII, EcoRI, BsrGI, XbaI, and SspI restriction enzymes at 37 °C for 4 hours. We performed a fragment analysis of 3 μL of the digested sample on a 1% agarose gel stained with GelRed (SigmaAldrich). DNA was cleaned from the enzyme reaction using the Macherey-Nagel™ NucleoSpin™ Gel and PCR Clean-up Kit according to the manufacturer’s protocol, eluting the fragmented DNA with 2 × 50 μL of nuclease-free water.

Clinical exome seq

Clinical exome sequencing was performed by TruSight One Sequencing Panel Kit (Illumina, San Diego, CA) that covers coding regions of 4813 genes associated with human disease. Library preparation was done in accordance with the manufacturer’s instructions. Sequencing was performed on the Illumina MiSeq instrument with the 2 × 150 paired-end sequencing mode. Read alignment and sequence variant analysis were performed using NextGENe Software version 2.4.2.3 (SoftGenetics, State College, PA). The reads were mapped against the human GRCh38 (hg38) reference genome. More than 84% of the target regions were observed with least 20-fold coverage. Variants were recorded when a minimum of 5 reads aligned to the variant position, and the variant was observed in at least 20% of those reads (Fig. 2). Variants were annotated using the gnomAD v2.1.1 (GRCh38 LiftOver) database. The clinical interpretation of variants were extracted from the ClinVar (NCBI) database. Analysis results of the samples were obtained from NextGENe as variant call format (vcf) files. Mutation reports (variant lists) were also exported as text tables (.tsv) and then merged, sorted and formatted using R v4.3.1 https://www.R-project.org). Reference files, databases used in the analyses, and non-default software settings are detailed in the Data Records section.

Fig. 2
figure 2

Representative clinical exome sequencing result of an ALS patient. Aligned reads are shown over the FUS gene. The list of variants can be found in the Data Records section.

Reduced Representation Bisulfite Sequencing (RRBS)

Reduced Representation Bisulfite Sequencing (RRBS) is an established technique for identifying cytosine methylation in genomic regions most relevant to gene regulation. The genomic DNA first undergoes digestion by the MspI restriction enzyme, which recognizes CCGG sites, resulting in genomic fragments starting and ending with a CpG dinucleotide. Since CpGs are unevenly distributed throughout the genome, MspI digestion is followed by size selection that enriches for CpG-rich regions, including CpG islands. These regions are typically located in gene promoters and control gene expression through DNA methylation. We employed the premium RRBS kit of Diagenode to carry out the RRBS protocol including NGS library preparation, as described26,27. Briefly, 100 ng of DNA was fragmented by the MspI restriction enzyme, which cleaves DNA without regard to the cytosine methylation state. Following adaptor ligation and size selection by Ampure beads, we pooled up to 6 samples together. These pooled samples underwent bisulfite treatment, purification, and PCR amplification, according to the manufacturer’s recommendations. The final RRBS libraries were quantified using the Qubit dsDNA HS Assay from Life Technologies, and the library’s size distribution profile was assessed using Agilent Bioanalyzer 2100 capillary electrophoresis. Sequencing was performed on an Illumina HiSeq 2500 instrument using 1 × 50 single-end mode. Adapter sequences were removed from the raw reads using Trim Galore v0.6.10 https://github.com/FelixKrueger/TrimGalore (with Cutadapt v4.428). Following an initial quality trimming, sequences were trimmed off the 3′ end of the reads if at least 5 bp overlapped with the adapter sequence. Only reads that were no shorter than 35 bp after trimming were kept. Trim Galore was run in non-directional RRBS mode, which performs additional trimming to avoid using cytosine positions in methylation calls that were filled-in during the end-repair step. Low quality bases at the end of the NGS reads were trimmed off (quality trimming using TrimGalore, see RRBS data processing) before the methylation analysis. After read trimming, bisulfite alignment to the GRCh38 (hg38) reference sequence and methylation calling were performed using Bismark v0.24.129 with bowtie2 v2.3.5.130. After mapping, sam files were sorted using samtools v1.1031.

To investigate methylation level differences between ALS and Control samples, DMSs (differentially methylated sites, per CpG) and DMRs (differentially methylated regions, per 1000 bp window) were identified using the MethylKit v1.26.032 software package (with R v4.3.1. Both methylated and unmethylated CpG cytosines were imported as an input. Then, sites with fewer than 10 mapped reads and sites with exceptionally high coverage above the 99.9 percentile were excluded from the analysis. To ensure comparability between samples, the CpG methylation data were normalised, setting the median coverage to be the same for each sample (Table 2).

Table 2 Methylation statistics of CpG sites.

For differentially methylated site (DMS) analysis, each CpG site’s DNA methylation status was evaluated individually33. A CpG site was retained for further analysis only if it had been sequenced and covered in at least 4 samples within each sample group. For the analysis of differentially methylated regions (DMRs), the genome was partitioned into 1000 bp tiling windows with a 1000 bp step-size, and DNA methylation levels were computed for each tiled region. A region was considered for analysis only if it had been sequenced and covered in at least 4 samples within each sample group, and it had a minimum of 5 covered CpG bases. DMS and DMR statistics were calculated by logistic regression using the ‘calculateDiffMeth’ function with default parameters. The threshold for significant methylation differences between sample groups was set to 20%, and the hits were further filtered by a significance value of q <= 0.01. The identified DMSs and DMRs were annotated by the overlapping genic regions, including promoter, exon, intron, and transcription start site (TSS). DMSs and DMRs resulted in extensive coverage of CpG islands (53–60%) and in sequences up to 2 kb distant termed CpG island shores (6–8%)34, while also maintaining substantial coverage of other genomic elements (32–41%) (Fig. 3). Description of all software parameter settings employed in the RRBS analysis, reference data and processed data utilized in the analysis can be found in the Data Records section.

Fig. 3
figure 3

Most DNA methylation occurs in CpG islands and in sequences up to 2 kb distant termed CpG island shores. (a) Proportion of hypo- and hypermethylated sites (DMSs) and regions (DMRs) in ALS samples relative to healthy controls. The dominant change is hypomethylation. N: Total number of DMSs and DMRs. (b) Annotation of DMSs and DMRs to CpG islands (CpGi) and CpG shores of the human genome.

DRIP-seq

DNA-RNA Immunoprecipitation (DRIP) sequencing was started by blocking and coating magnetic beads (Dynal protein G). 2 × 50 μL of protein G beads were washed with 3 × 1 ml of 5 mg/ml PBS/1 mM EDTA/1% BSA solution using a MagnaRack. To the beads, we added 500 µl of 5 mg/ml PBS/1 mM EDTA/1% BSA solution and 25 μg of the RNA-DNA hybrid-specific S9.6 monoclonal antibody35,36. The mixture was incubated for 4 hours with rotation at 4 °C in a cold room and then washed with 1 ml of PBS/1 mM EDTA/1% BSA twice. For immunoprecipitation, we diluted each sample to 700 µl with ChIP lysis Buffer (50 mM Hepes/KOH pH 7,5, 0,14 M NaCl, 5 mM EDTA, 1% Triton X-100, 0,1% Na-Deoxycholate). We mixed 200 μL of the sample with 500 μL of ChIP lysis Buffer and added 2 × 700 µl of this mixture to 50 µl of S9.6-coated beads. After incubating overnight at 4 °C with rotation, we performed a series of washing steps using different buffers as described22. Finally, the DNA was eluted from the beads by adding 100 µl of IP Elution buffer (50 mM Tris/HCl pH 8.0, 10 mM EDTA, 1% SDS), vortexing every 2 minutes, and incubating in a thermomixer for 15 minutes at 65 °C. We then used a MagnaRack to separate the supernatant and pipetted it into a new low-binding tube. To further purify the DNA, we conducted a PCR clean-up using the NTB M&N kit, eluting the DNA in 3 × 100 μL of H2O (pH > 7). Before NGS library preparation, we performed further fragmentation through sonication in a 1.5 ml low-bind tube (300 μl sample) using a Bioruptor (Diagenode) with 2 × 4 cycles of 30 seconds ON/OFF at LOW settings. Finally, we concentrated the samples to 30 μL using a speed vacuum concentrator and measured the concentration using the Qubit dsDNA HS Assay Kit. NGS library preparation and sequencing was performed as described36,37 using the Illumina TruSeq ChIP Sample Preparation protocol. Briefly, DRIP DNA was end-repaired and indexed adapters were ligated to the inserts. Purified ligation products were then amplified by PCR. Amplified libraries were sequenced (at the EMBL Genomics Core Facility, Heidelberg) using an Illumina HiSeq 2500 instrument in the 2 × 125 paired-end sequencing mode. Quality of raw DRIP-seq reads were checked using FastQC v0.11.9 and the results were summarized using MultiQC v1.1438. DRIP-seq reads were aligned to the GRCh38 (hg38) reference sequence using bowtie2 v2.3.5.130. Alignments were filtered using samtools v1.10 keeping only primary alignments of properly paired reads that had >30 mapping quality score. PCR and optical duplicates were also filtered out. RPKM normalized coverage (bigwig) files have been generated with 100 bp resolution using bamCoverage v3.5.139. A representative genome browser track is shown in Fig. 4. Detailed software parameter settings can be found in the Data Records section.

Fig. 4
figure 4

Representative genome browser track showing the RNA-DNA hybrid (R-loop) profile of ALS patients and healthy controls.

Reference data information

All reference data and databases used for CES, RRBS, and DRIP-seq analyses are listed in Table 3.

Table 3 Reference data and reference databases used in this study.

Data Records

All NGS datasets generated in this study (CES, RRBS, DRIP-seq) were deposited in Gene Expression Omnibus (GEO) under the accession number: GSE24247540. A detailed summary of all NGS experiments, NGS data, processed data, result tables, and identifiers to access relevant datasets for RRBS, CES and DRIP-seq can be found in Tables 46.

Table 4 Summary of RRBS data generated in this study.
Table 5 Summary of CES data generated in this study.
Table 6 Summary of DRIP-seq data generated in this study.

Technical Validation

CES was validated by FastQC v0.11.9 and summarized reports were generated using MultiQC v1.1438. Of the 7.4–10,1 million sequenced reads, greater than 60% of NGS reads were unique (Fig. 5a,b). Mean Phred quality scores for each read position in each RRBS sample were high (indicative of ‘very good quality’) (Fig. 5c). Per sequence Phred quality scores were higher than 28 (indicative of ‘very good quality’) for at least 95% of reads in each CES sample (Fig. 5d).

Fig. 5
figure 5

MultiQC validation of clinical exome seq (CES) data. (a) General NGS statistics of CES samples. (b) The Sequence Counts plot displays the total count of reads, categorized as either unique or duplicate. Identifying duplicates involves requiring an exact sequence match across the entire sequence length. To conduct this analysis, any reads longer than 75 bp were shortened to 50 bp. (c) Mean quality Phred scores are presented, with higher scores indicating better base calls. The graph’s background color partitions the y-axis into regions denoting very good quality calls (green zone), calls of moderate quality (orange zone), and calls of poor quality (red zone). (d) Per sequence quality Phred scores are depicted, indicating whether a subset of sequences exhibits consistently low-quality values. The background color of the graph distinguishes very good quality calls (green), moderate quality calls (orange), and poor quality calls (red).

In the RRBS setting, 8.5–37,0 million of sequenced reads were obtained (Fig. 6a). Duplication detection indicated 20–30% of NGS reads as unique (Fig. 6b). Mean Phred quality scores for each read position in each RRBS sample were higher than 28 (indicative of ‘very good quality’) (Fig. 6c). Per sequence Phred quality scores were higher than 28 (indicative of ‘very good quality’) for at least 95% of reads in each RRBS sample (Fig. 6d).

Fig. 6
figure 6

MultiQC validation of reduced representation bisulfite sequencing (RRBS) data. (a) General NGS statistics of RRBS samples. (bd) Same as in Fig. 2.

For the DRIP-seq experiments, raw sequence coverage ranged from 7.0–37.7 million reads (Fig. 7a), of which greater than 70 percent qualified as unique (Fig. 7b). Mean and Per sequence Phred quality scores for each read position in each DRIP-seq sample were high (indicative of ‘very good quality’) (Fig. 7c,d).

Fig. 7
figure 7

MultiQC validation of DNA-RNA hybrid immunoprecipitation (DRIP) sequencing data. (a) General NGS statistics of DRIP-seq samples. (bd) Same as in Fig. 2.

Usage Notes

The NGS results presented above reveal three distinct data layers in a small group of ALS patients, encompassing genetic changes, DNA methylation alterations, and R-loop modifications. Obtaining such data from individual patients is crucial as it enables the correlation of these variables both within and between individuals (Fig. 8). Such a correlative analysis is expected to unveil significant epigenetic patterns, potentially explaining the missing heritability (Mendelian inheritance) of gene mutations identified in ALS. The data presented here have been established using widely recognized software tools, making them readily adaptable for integration into any genome analysis workflow. We are confident that the data we provide can contribute to better understand the aetiology of this devastating disease .

Fig. 8
figure 8

Correlative analysis of CES, RRBS, and DRIP profiles in ALS patients. Lolliplots show the mutational spectra (SNPs) of the displayed gene (Reference base/Alternate base). Missense, synonymous, and noncoding variants are shown in different colours. DNA methylation is represented by the ratio of methylated/unmethylated CpG dinucleotides. + values: fw strand; - values: reverse strand. The position of CpG sites are shown as a proxy. DRIP stands for the RPKM normalized read coverage of RNA-DNA hybrid (R-loops) levels over the genomic region.

Table 7 Summary of software tools used in the current study.