DNA methylome, R-loop and clinical exome profiling of patients with sporadic amyotrophic lateral sclerosis

Amyotrophic lateral sclerosis (ALS) is a fatal neurodegenerative disorder characterized by the death of motor neurons, the aetiology of which is essentially unknown. Here, we present an integrative epigenomic study in blood samples from seven clinically characterised sporadic ALS patients to elucidate molecular factors associated with the disease. We used clinical exome sequencing (CES) to study DNA variants, DNA-RNA hybrid immunoprecipitation sequencing (DRIP-seq) to assess R-loop distribution, and reduced representation bisulfite sequencing (RRBS) to examine DNA methylation changes. The above datasets were combined to create a comprehensive repository of genetic and epigenetic changes associated with the ALS cases studied. This repository is well-suited to unveil new correlations within individual patients and across the entire patient cohort. The molecular attributes described here are expected to guide further mechanistic studies on ALS, shedding light on the underlying genetic causes and facilitating the development of new epigenetic therapies to combat this life-threatening disease.

The pathogenesis of sporadic ALS, which constitutes the majority of cases (around 90-95%), is less clear and likely involves a combination of genetic and environmental factors [3][4][5] .While specific gene mutations are not prevalent in sporadic ALS, genetic variants as risk factors can significantly contribute to disease susceptibility 6 .Pathogenic or likely-pathogenic variants that individually have only modest effects on ALS risk can collectively contribute to an individual's predisposition (this concept is best supported by evidence in individuals carrying a p.N352S mutation in TARDBP) 7 .Common variants with small effect size and combinations of such variants may also confer genetic risk in sporadic ALS patients, but convincing data demonstrating this are still lacking 8 .Therefore, establishing new genomic data to identify these genetic variants and their combinations in individuals becomes of paramount importance to explain the "missing heritability" 6 of high-risk (causative) ALS genes.On the other hand, Mendelian inheritance accounts for only a minority of cases, promoting a focus on environmental and epigenomic cues to explain the missing heritability of known ALS mutations 9 .In addition to genetic variants, epigenetic changes such as DNA methylation [10][11][12][13] and R-loops [14][15][16][17][18][19][20] are highly likely to play a role in the development of sporadic ALS.Although the precise pathophysiological mechanisms by which these factors contribute to motor neuron degeneration remain unclear, they may be applicable as diagnostic markers of the disease.
Here, we present a comprehensive epigenomic investigation in blood samples from seven sporadic ALS patients with well-defined clinical characteristics to uncover molecular factors linked to the disease.Our study employed three main techniques: clinical exome sequencing (CES) 21 to analyze DNA variants, DNA-RNA hybrid immunoprecipitation sequencing (DRIP-seq) 22 to investigate R-loop accumulation, and reduced representation bisulfite sequencing (RRBS) 23 to explore DNA methylation levels (Fig. 1).By combining these datasets, we have compiled a compendium of genomic alterations associated with the studied ALS cases.

Methods
Patient selection.The clinical diagnosis of ALS cases fulfilled the 'El Escorial revisited' clinically definite diagnostic certainty and the complementary ' Awaji' criteria for amyotrophic lateral sclerosis 24,25 .All selected patients are considered to have sporadic ALS, as the disease did not occur in their families (Table 1).We confirm that informed consent was obtained from all subjects and that our experiments were approved by the Medical Research Council (Ministry of Interior, Hungary) in accordance with all relevant ethical and legal requirements (approval numbers: 47066-3/2013/EKU(556/2013); 11920-2/2017/EÜIG; 12702-5/2018/EÜIG).We confirm that Genomic DNA extraction from whole blood.We prepared aliquots of 1 ml frozen whole blood sample from ALS and control patients and extracted 200 μL of it using the Macherey-Nagel ™ NucleoSpin ™ Tissue Kit according to the manufacturer's protocol.In brief, we mixed 200 μL of the whole blood sample with 25 μL of Proteinase K and 200 μL of buffer B3, prewarmed to 65 °C.After vortexing the mixture, we incubated it at 65 °C for 10 minutes.Then, we added 210 μL of ethanol (96 -100%) to each sample, vortexed again, and loaded the mixture onto a NucleoSpin ® Tissue Column placed in a Collection Tube.We centrifuged it for 1 minute at 11,000 × g, followed by washing the silica membrane once with 600 μL of BW buffer and once with 600 μL of B5 buffer.After drying the membrane for five minutes, we eluted the DNA twice with 50 μL of prewarmed (55 °C) elution buffer (5 mM Tris pH 8.5).The same samples were pooled and DNA concentration was measured using a Nanodrop.The samples were then concentrated to 100 μL using a speed vacuum concentrator.Next, the genomic DNA was fragmented by digesting 25ug of DNA in a final volume of 100 μL with HindIII, EcoRI, BsrGI, XbaI, and SspI restriction enzymes at 37 °C for 4 hours.We performed a fragment analysis of 3 μL of the digested sample on a 1% agarose gel stained with GelRed (SigmaAldrich).DNA was cleaned from the enzyme reaction using the Macherey-Nagel ™ NucleoSpin ™ Gel and PCR Clean-up Kit according to the manufacturer's protocol, eluting the fragmented DNA with 2 × 50 μL of nuclease-free water.
Clinical exome seq.Clinical exome sequencing was performed by TruSight One Sequencing Panel Kit (Illumina, San Diego, CA) that covers coding regions of 4813 genes associated with human disease.Library preparation was done in accordance with the manufacturer's instructions.Sequencing was performed on the Illumina MiSeq instrument with the 2 × 150 paired-end sequencing mode.Read alignment and sequence variant analysis were performed using NextGENe Software version 2.4.2.3 (SoftGenetics, State College, PA).The reads were mapped against the human GRCh38 (hg38) reference genome.More than 84% of the target regions were observed with least 20-fold coverage.Variants were recorded when a minimum of 5 reads aligned to the variant position, and the variant was observed in at least 20% of those reads (Fig. 2).Variants were annotated using the gnomAD v2.1.1 (GRCh38 LiftOver) database.The clinical interpretation of variants were extracted from the   ClinVar (NCBI) database.Analysis results of the samples were obtained from NextGENe as variant call format (vcf) files.Mutation reports (variant lists) were also exported as text tables (.tsv) and then merged, sorted and formatted using R v4.3.1 https://www.R-project.org).Reference files, databases used in the analyses, and non-default software settings are detailed in the Data Records section.

Reduced Representation Bisulfite Sequencing (RRBS). Reduced Representation Bisulfite Sequencing
(RRBS) is an established technique for identifying cytosine methylation in genomic regions most relevant to gene regulation.The genomic DNA first undergoes digestion by the MspI restriction enzyme, which recognizes CCGG sites, resulting in genomic fragments starting and ending with a CpG dinucleotide.Since CpGs are unevenly distributed throughout the genome, MspI digestion is followed by size selection that enriches for CpG-rich regions, including CpG islands.These regions are typically located in gene promoters and control gene expression through DNA methylation.We employed the premium RRBS kit of Diagenode to carry out the RRBS protocol including NGS library preparation, as described 26,27 .Briefly, 100 ng of DNA was fragmented by the MspI restriction enzyme, which cleaves DNA without regard to the cytosine methylation state.Following adaptor ligation and size selection by Ampure beads, we pooled up to 6 samples together.These pooled samples underwent bisulfite treatment, purification, and PCR amplification, according to the manufacturer's recommendations.The final RRBS libraries were quantified using the Qubit dsDNA HS Assay from Life Technologies, and the library's size distribution profile was assessed using Agilent Bioanalyzer 2100 capillary electrophoresis.Sequencing was performed on an Table 3. Reference data and reference databases used in this study.
Illumina HiSeq 2500 instrument using 1 × 50 single-end mode.Adapter sequences were removed from the raw reads using Trim Galore v0.6.10 https://github.com/FelixKrueger/TrimGalore(with Cutadapt v4.4 28 ).Following an initial quality trimming, sequences were trimmed off the 3′ end of the reads if at least 5 bp overlapped with the adapter sequence.Only reads that were no shorter than 35 bp after trimming were kept.Trim Galore was run in non-directional RRBS mode, which performs additional trimming to avoid using cytosine positions in methylation calls that were filled-in during the end-repair step.Low quality bases at the end of the NGS reads were trimmed off (quality trimming using TrimGalore, see RRBS data processing) before the methylation analysis.After read     trimming, bisulfite alignment to the GRCh38 (hg38) reference sequence and methylation calling were performed using Bismark v0.24.1 29 with bowtie2 v2.3.5.1 30 .After mapping, sam files were sorted using samtools v1.10 31 .
To investigate methylation level differences between ALS and Control samples, DMSs (differentially methylated sites, per CpG) and DMRs (differentially methylated regions, per 1000 bp window) were identified using the MethylKit v1.26.0 32 software package (with R v4.3.1.Both methylated and unmethylated CpG cytosines were imported as an input.Then, sites with fewer than 10 mapped reads and sites with exceptionally high coverage above the 99.9 percentile were excluded from the analysis.To ensure comparability between samples, the CpG methylation data were normalised, setting the median coverage to be the same for each sample (Table 2).
For differentially methylated site (DMS) analysis, each CpG site's DNA methylation status was evaluated individually 33 .A CpG site was retained for further analysis only if it had been sequenced and covered in at least 4 samples within each sample group.For the analysis of differentially methylated regions (DMRs), the genome was partitioned into 1000 bp tiling windows with a 1000 bp step-size, and DNA methylation levels were computed for each tiled region.A region was considered for analysis only if it had been sequenced and covered in at least 4 samples within each sample group, and it had a minimum of 5 covered CpG bases.DMS and DMR statistics were calculated by logistic regression using the 'calculateDiffMeth' function with default parameters.The threshold for significant methylation differences between sample groups was set to 20%, and the hits were further filtered by a significance value of q <= 0.01.The identified DMSs and DMRs were annotated by the overlapping genic regions, including promoter, exon, intron, and transcription start site (TSS).DMSs and DMRs resulted in extensive coverage of CpG islands (53-60%) and in sequences up to 2 kb distant termed CpG island shores (6-8%) 34 , while also maintaining substantial coverage of other genomic elements (32-41%) (Fig. 3).Description of all software parameter settings employed in the RRBS analysis, reference data and processed data utilized in the analysis can be found in the Data Records section.DRIP-seq.DNA-RNA Immunoprecipitation (DRIP) sequencing was started by blocking and coating magnetic beads (Dynal protein G). 2 × 50 μL of protein G beads were washed with 3 × 1 ml of 5 mg/ml PBS/1 mM EDTA/1% BSA solution using a MagnaRack.To the beads, we added 500 µl of 5 mg/ml PBS/1 mM EDTA/1% BSA solution and 25 μg of the RNA-DNA hybrid-specific S9.6 monoclonal antibody 35,36 .The mixture was incubated for 4 hours with rotation at 4 °C in a cold room and then washed with 1 ml of PBS/1 mM EDTA/1% BSA twice.For immunoprecipitation, we diluted each sample to 700 µl with ChIP lysis Buffer (50 mM Hepes/KOH pH 7,5, 0,14 M NaCl, 5 mM EDTA, 1% Triton X-100, 0,1% Na-Deoxycholate).We mixed 200 μL of the sample with 500 μL of ChIP lysis Buffer and added 2 × 700 µl of this mixture to 50 µl of S9.6-coated beads.After incubating overnight at 4 °C with rotation, we performed a series of washing steps using different buffers as described 22 .Finally, the  DNA was eluted from the beads by adding 100 µl of IP Elution buffer (50 mM Tris/HCl pH 8.0, 10 mM EDTA, 1% SDS), vortexing every 2 minutes, and incubating in a thermomixer for 15 minutes at 65 °C.We then used a MagnaRack to separate the supernatant and pipetted it into a new low-binding tube.To further purify the DNA, we conducted a PCR clean-up using the NTB M&N kit, eluting the DNA in 3 × 100 μL of H2O (pH > 7).Before NGS library preparation, we performed further fragmentation through sonication in a 1.5 ml low-bind tube (300 μl sample) using a Bioruptor (Diagenode) with 2 × 4 cycles of 30 seconds ON/OFF at LOW settings.Finally, we concentrated the samples to 30 μL using a speed vacuum concentrator and measured the concentration using the Qubit dsDNA HS Assay Kit.NGS library preparation and sequencing was performed as described 36,37 using the Illumina TruSeq ChIP Sample Preparation protocol.Briefly, DRIP DNA was end-repaired and indexed adapters were ligated to the inserts.Purified ligation products were then amplified by PCR.Amplified libraries were sequenced (at the EMBL Genomics Core Facility, Heidelberg) using an Illumina HiSeq 2500 instrument in the 2 × 125 paired-end sequencing mode.Quality of raw DRIP-seq reads were checked using FastQC v0.11.9 and the results were summarized using MultiQC v1.14 38 .DRIP-seq reads were aligned to the GRCh38 (hg38) reference sequence using bowtie2 v2.3.5.1 30 .Alignments were filtered using samtools v1.10 keeping only primary alignments of properly paired reads that had >30 mapping quality score.PCR and optical duplicates were also filtered out.RPKM normalized coverage (bigwig) files have been generated with 100 bp resolution using bamCoverage v3.5.1 39 .A representative genome browser track is shown in Fig. 4. Detailed software parameter settings can be found in the Data Records section.

Reference data information.
All reference data and databases used for CES, RRBS, and DRIP-seq analyses are listed in Table 3.
RRBS sample were higher than 28 (indicative of 'very good quality') (Fig. 6c).Per sequence Phred quality scores were higher than 28 (indicative of 'very good quality') for at least 95% of reads in each RRBS sample (Fig. 6d).
For the DRIP-seq experiments, raw sequence coverage ranged from 7.0-37.7 million reads (Fig. 7a), of which greater than 70 percent qualified as unique (Fig. 7b).Mean and Per sequence Phred quality scores for each read position in each DRIP-seq sample were high (indicative of 'very good quality') (Fig. 7c,d).

Usage Notes
The NGS results presented above reveal three distinct data layers in a small group of ALS patients, encompassing genetic changes, DNA methylation alterations, and R-loop modifications.Obtaining such data from individual patients is crucial as it enables the correlation of these variables both within and between individuals (Fig. 8).Such a correlative analysis is expected to unveil significant epigenetic patterns, potentially explaining the missing heritability (Mendelian inheritance) of gene mutations identified in ALS.The data presented here have been established using widely recognized software tools, making them readily adaptable for integration into any genome analysis workflow.We are confident that the data we provide can contribute to better understand the aetiology of this devastating disease .

Fig. 1
Fig.1Outline of the next generation sequencing (NGS) experiments.The workflow began with obtaining blood samples from ALS patients with well-defined clinical records and age-matched control individuals.From these blood samples, genomic DNA was extracted that underwent three simultaneous investigations: 1. Clinical exome sequencing (CES) was performed on 4813 genes commonly linked to human diseases.The protein-coding regions (exons) of the target genes were enriched using target capture microarray hybridization prior to NGS. 2. Reduced representation bisulfite sequencing (RRBS) was employed to analyze gene regulatory regions that influence gene transcription through DNA methylation.3. DNA-RNA hybrid immunoprecipitation sequencing (DRIP-seq) was utilized to capture chromosomal R-loops associated with genic and intergenic regions.

Fig. 3
Fig. 3 Most DNA methylation occurs in CpG islands and in sequences up to 2 kb distant termed CpG island shores.(a) Proportion of hypo-and hypermethylated sites (DMSs) and regions (DMRs) in ALS samples relative to healthy controls.The dominant change is hypomethylation.N: Total number of DMSs and DMRs.(b) Annotation of DMSs and DMRs to CpG islands (CpGi) and CpG shores of the human genome.

Fig. 4
Fig. 4 Representative genome browser track showing the RNA-DNA hybrid (R-loop) profile of ALS patients and healthy controls.

Fig. 5
Fig. 5 MultiQC validation of clinical exome seq (CES) data.(a) General NGS statistics of CES samples.(b) The Sequence Counts plot displays the total count of reads, categorized as either unique or duplicate.Identifying duplicates involves requiring an exact sequence match across the entire sequence length.To conduct this analysis, any reads longer than 75 bp were shortened to 50 bp.(c) Mean quality Phred scores are presented, with higher scores indicating better base calls.The graph's background color partitions the y-axis into regions denoting very good quality calls (green zone), calls of moderate quality (orange zone), and calls of poor quality (red zone).(d) Per sequence quality Phred scores are depicted, indicating whether a subset of sequences exhibits consistently low-quality values.The background color of the graph distinguishes very good quality calls (green), moderate quality calls (orange), and poor quality calls (red).

Fig. 8
Fig. 8 Correlative analysis of CES, RRBS, and DRIP profiles in ALS patients.Lolliplots show the mutational spectra (SNPs) of the displayed gene (Reference base/Alternate base).Missense, synonymous, and noncoding variants are shown in different colours.DNA methylation is represented by the ratio of methylated/unmethylated CpG dinucleotides.+ values: fw strand; -values: reverse strand.The position of CpG sites are shown as a proxy.DRIP stands for the RPKM normalized read coverage of RNA-DNA hybrid (R-loops) levels over the genomic region.

Table 1 .
Clinical 41aracteristics of individuals examined in this study.Five female and two male patients with at least two symptoms of the classic diagnostic hallmarks of ALS were recruited.None of the patients had any other comorbidities.Their median age was 51 years, and the median time from diagnosis to sampling was 15 months.Seven age-matched and gender-matched control individuals were also included in the studies.The severity of the diseases at the time of the onset of ALS patients is based on the ALSFRS * score41.The maximal score is 48: not having any signs and symptoms.Fig. 2 Representative clinical exome sequencing result of an ALS patient.Aligned reads are shown over the FUS gene.The list of variants can be found in the Data Records section.

Table 2 .
Methylation statistics of CpG sites.

Table 4 .
Summary of RRBS data generated in this study.

Table 5 .
Summary of CES data generated in this study.

Table 6 .
Summary of DRIP-seq data generated in this study.