Significant out-of-sample classification from methylation profile scoring for amyotrophic lateral sclerosis

We conducted DNA methylation association analyses using Illumina 450K data from whole blood for an Australian amyotrophic lateral sclerosis (ALS) case–control cohort (782 cases and 613 controls). Analyses used mixed linear models as implemented in the OSCA software. We found a significantly higher proportion of neutrophils in cases compared to controls which replicated in an independent cohort from the Netherlands (1159 cases and 637 controls). The OSCA MOMENT linear mixed model has been shown in simulations to best account for confounders. When combined in a methylation profile score, the 25 most-associated probes identified by MOMENT significantly classified case–control status in the Netherlands sample (area under the curve, AUC = 0.65, CI95% = [0.62–0.68], p = 8.3 × 10−22). The maximum AUC achieved was 0.69 (CI95% = [0.66–0.71], p = 4.3 × 10−34) when cell-type proportion was included in the predictor.


. C)
Post-hoc statistical power calculation, based on the pre-determined NL cohort sample size (Ncases = 1159, Ncontrols = 637) and replication significance threshold p-value = 5.3x10 -4 . Power is calculated as a percentage of 1 -probability of a type II error (y-axis) and is plotted with estimated effect sizes (x-axis). Red triangles represent the true effect sizes of probes with p-value < 5x10 -4 in MOMENT in common between datasets (m = 97), calculated from the AU sample. D) Post-hoc statistical calculation of sample size (y-axis) necessary to find a true association with the corresponding effect sizes (x-axis), based on pre-determined 80% power and replication significance threshold p-value = 5.3x10 -4 .
Out-of-sample classification results   The AUCs were then calculated for each ROC curve. Black -classification accuracy calculated from methylation profile scores based on BLUP solutions of methylation probes only; orange -classification accuracy calculated from methylation profile scores based on estimated fixed effects of predicted cell type proportions (CTP) only; green -classification accuracy calculated from methylation profile scores based on the sum of the scores from BLUP and CTP. The combined BLUP+CTP score gives equal weight to the two contributing scores. It may be more optimal to give unequal weight to the two scores, but another independent data is needed to estimate these weights.

6
Supplementary Table 1 -Area under the curve (AUC) of ALS-derived methylation profile scores based on p-value thresholding from AU MOA or AU MOMENT to an independent ALS cohort from the Netherlands. The p-value (P) is from logistic regression models. m -number of probes, CI95% -confidence interval at 95% level for AUC, P -p-values from logistic regression models.

Supplementary Figure 5 -Scatter matrix and correlation of individual methylation profile scores (MPS) based on estimated effect sizes of DNA methylation sites, from different mixedlinear model (MLM) methods and effect sizes of cell-type proportions, excluding eosinophils.
The latter were estimated from an OREML model. CTP -cell-type proportions.  Table 2 -DNA methylation DNAm sites with p < 1x10 -4 using the combined Australian cohort in MOMENT. Chr -chromosome number, Probe -probe identification number as provided by Illumina, bp -base pair position in the genome, Gene -closest genes the probe is annotated to, based on distance to transcription starting site, Orientation -DNA strand orientation (F = forward, R = Reverse), b_MOMENT -effects sizes (increase (positive sign) or decrease (negative sign) of methylation between cases and controls per standard deviation unit) of AU MOMENT, p_MOMENT -p-values of AU MOMENT. Effects of pre-adjusting DNAm probes have a more pronounced effect in standard linear regression MWAS results compared to mixed linear models

Supplementary Figure 6 -Comparison of MWAS results from linear and mixed linear model regression, for AU ALS cohort before and after pre-adjustment of DNA methylation probes with technical and biological covariates. A), B) and C) -log10(p)
of all probes in linear regression, MOA and MOMENT, respectively, before (y-axis) and after (x-axis) pre-adjustment, for the AU ALS dataset. Dashed blue lines mark the genome-wide significance threshold (p = 3.1x10 -7 ). Red dots mark all probes with p < 5x10 -4 as in D), E) and F) Effect sizes of linear regression, MOA and MOMENT, respectively, before (y-axis) and after (x-axis) preadjustment, for the AU ALS dataset., of probes with p < 5x10 CGIs, but CpG shores [3]. These alterations were shown to have an inverse relationship with gene expression of associated genes, and they apply to shores located within 2 kb of an annotated transcriptional start site, but leave open the possibility of additional regulatory function for shores located in intragenic regions or gene deserts [3].
Aditionally, four of the eight probes with co-annotations were annotated to intronic regions ( Supplementary Figure 7), three of which were located in CpG shores (annotated to CXXC5, FKBP5 and AGRN) and one in a CGI (annotated to SYNPO). DNA methylation of intragenic regions is also correlated with higher levels of gene transcription and may be a mechanism that regulates the use of alternative promoters [5,6]. however, explain the contribution of neuroinflammation to the disease process, explain the initial protein misfolding, account for non-cell-autonomous influences or some of the muscle-specific modifiers of disease progression [9]. Indeed, due to the clear multifactorial nature of ALS and extensive pleiotropic effects of disease-altering variants [10,11], one should be wary that some disease-related tissue abnormalities can interfere with the potential therapeutic properties. For example, AGRN has been extensively studied in the context of the NMJ, where the encoded protein exerts a key role as regulator of postsynaptic differentiation [12]. Overexpressing AGRN (and other genes involved in NMJ formation) in mouse models of ALS (and other neuromuscular disorders) has been shown to have therapeutic benefits [12]. However, in the context of the immune system, AGRN expression (and many other genes) was found to be associated with activation in monocytes from rapidly progressing ALS patients [13]. As another case example, the insulin-like growth-factor 1, IGF-1 is a well-studied trophic factor for different tissues, including nervous system and skeletal muscle. Animal studies have shown that hSOD1 G93A mice at "end stage of disease" had normal levels of muscle IGF-I protein expression, but decreased circulating levels of IGF-I and skeletal muscle IGF-IRα protein expression [14,15]. More importantly, these changes were variable according to disease progression. Interestingly, IGF-I-directed interventions prolong survival in a mouse model of ALS [16][17][18], whereas Growth-Hormone/IGF-I therapies were of no benefit in slowing disease progression in human ALS patients [19][20][21]. These seemingly inconsistent results illustrate the importance of acknowledging species-specific differences in disease progression and disease stage-specific or tissue-specific interventions, that may account for improved outcome in mouse models of disease.
Protein misfolding is a key feature of all neurodegenerative disorders, including alphasynuclein in Parkinson's disease, tau and Aβ in Alzheimer's disease (AD), prion protein in prion diseases, polyglutamine disease proteins in polyglutamine repeat diseases (e.g., huntingtin in Huntington's disease), and SOD1 in amyotrophic lateral sclerosis. Peptidyl-prolyl cis/trans isomerases (PPIases), a unique family of molecular chaperones, regulate protein folding at proline residues. Similarly to the examples given above, some members of the PPIase family have been shown to exert positive and others negative effects on neuron function [22]. For example, FKBP5 is a PPIase, which acts as a co-chaperone that modulates not only glucocorticoid receptor activity in response to stressors [23], but has also been implicated in AD [24], showing neurotoxic effects. Interestingly, it was also shown to be overexpressed in monocytes of ALS patients compared to controls [13], once again highlighting the potentially widespread pleiotropic gene effects in complex disease traits.

CXXC5 and neurodegeneration
Genomic variation in the CXXC5 locus has not been identified as contributing to ALS risk but it may play a relevant functional role. CXXC5 encodes a retinoid inducible transcription factor containing a CXXC-type zinc finger motif. In the mature central nervous system (CNS), retinoids have been implicated in the maintenance of plasticity and neural stem cell production [25].
Retinoids can be converted into various retinoid species, including retinoic acid (RA) or retinol molecules, which are able to enter the cell. Once within the cell, RA can bind to a family of retinolspecific binding proteins, the cellular acid retinoid binding proteins, which are involved in the metabolism and nuclear import of RA [26,27]. Elevated RA signaling in the adult correlates with axon outgrowth and nerve regeneration and has also been shown to be involved in the maintenance of the differentiated state of adult neurons [25]. Interestingly, it has been reported that disruption of RA signaling may also lead to degeneration of motor neurons [28], all relevant processes in ALS pathogenesis. CXXC5 is ubiquitously expressed throughout human tissues, with relatively high expression in the brain. Current human exome and genome data suggests an essential functional role for CXXC5 as the observed loss of function variation is lower than expected (pLI = 0.89, gnomAD). This may be due to its role in the CNS, which has been extensively studied in mouse models, where CXXC5 was characterized as BMP4-regulated modulator of Wnt signaling in neural stem cells and also an important myelination factor, controlling multiple genes involved in myelination in oligodendrocytes [29,30]. Moreover, a recent study has shown that ALS-patient derived oligodendrocytes (from both induced pluripotent stem cells and induced neural progenitor cells) have recently been found to play an active role in motor neuron death [31]. Thus, further research may be warranted to better understand the putative role between CXXC5 and neurodegeneration in the context of ALS.
Supplementary Note -QC parameters for exclusion of samples and probes. 6. Exclude samples whose control DNAm sites mean value exceeded 5 SD from the mean across all DNAm sites; 7. The median intensity methylated vs unmethylated signal for all control DNAm sites exceeded 3 SD; 8. Calculate difference between median chromosome Y and chromosome X probe intensities ("XY diff"). Cutoff for sex differentiation was "XY diff" = -2. Exclude samples whose XY diff is higher than std = 5 (sex outliers);