Detecting methylation signatures in neurodegenerative disease by density-based clustering of applications with reducing noise

There have been numerous genetic and epigenetic datasets generated for the study of complex disease including neurodegenerative disease. However, analysis of such data often suffers from detecting the outliers of the samples, which subsequently affects the extraction of the true biological signals involved in the disease. To address this critical issue, we developed a novel framework for identifying methylation signatures using consecutive adaptation of a well-known outlier detection algorithm, density based clustering of applications with reducing noise (DBSCAN) followed by hierarchical clustering. We applied the framework to two representative neurodegenerative diseases, Alzheimer’s disease (AD) and Down syndrome (DS), using DNA methylation datasets from public sources (Gene Expression Omnibus, GEO accession ID: GSE74486). We first applied DBSCAN algorithm to eliminate outliers, and then used Limma statistical method to determine differentially methylated genes. Next, hierarchical clustering technique was applied to detect gene modules. Our analysis identified a methylation signature comprising 21 genes for AD and a methylation signature comprising 89 genes for DS, respectively. Our evaluation indicated that these two signatures could lead to high classification accuracy values (92% and 70%) for these two diseases. In summary, this framework will be useful to better detect outlier-free genetic and epigenetic signatures in various complex diseases and their developmental stages.


Scientific Reports
| (2020) 10:22164 | https://doi.org/10.1038/s41598-020-78463-3 www.nature.com/scientificreports/ neuron samples were used as the control samples. In the GEO dataset GSE74486, the AD sample IDs were GSM1921521-GSM1921523 and the DS sample IDs were GSM1921524-GSM1921532, whereas normal sample IDs were GSM1921533-GSM1921543. The total number of Reference IDs (CpGs along with the IDs started with "rs" and with "ch") was 485,577. First, we eliminated the feature IDs starting with "rs" or "ch", and kept only the CpGs. We then discarded the CpGs that had zero values in all samples or that had the missing value in any of the samples. Min-max normalization technique was used to normalize the data for each individual CpG. On the other hand, we collected the mapping information of CpG sites and official gene symbols ("UCSC_RefGene_Name") through the annotation file "Illumina HumanMethylation450 BeadChip" (HumanMethylation450_15017482) (NCBI Ref. ID: GPL13534-11288). In the annotation file, there is either a one-to-many, many-to-one, or many-to-many relationship between CpGs and genes. We first chose the CpGs connected to each matched gene symbol, performing an average operation on all the CpG sites of each individual gene symbol to obtain a unique methylation data vector for each gene. Then, we conducted analyses for the two neurodegenerative diseases, AD and DS.
Outlier detection through DBSCAN clustering algorithm. Since the sample size is small, applying the noise removal clustering algorithm prior to using any statistical test is extremely helpful. Specifically, we conducted DBSCAN clustering algorithm 62 using those unique gene vectors and filtered out the noisy features for each analysis. In detail, for each disease, we preliminarily estimated the knee point through KNN distance plot, and that knee value was used as the eps-neighborhood value. Other parameters were set by default (e.g., MinPts = 5, weights = NULL, borderPoints = FALSE). This generated a few density-based clusters, while each contained core, border and outlier (noisy) features. We then discarded these noisy features and further proceeded with noise-free features for the statistical analysis. The resultant cluster plot obtained by the DBSCAN was provided to visualize those core, border and outlier features clearly.
In DBSCAN clustering, two required parameters were epsilon (eps) and minimum points (MinPts). DBSCAN is somewhat sensitive to parameter settings of eps and MinPts, but there is no specific theory that can completely guide the setting of its parameters [62][63][64] . The eps, the radius of the neighborhood around any point, was considered as ǫ-neighborhood (epsilon-neighborhood) of the point. MinPts is the minimum number of neighbors inside the eps radius. If a point has a neighbor count value higher than or equal to MinPts, the point is stated www.nature.com/scientificreports/ as a core point. Whenever the number of the neighbors of a point is less than MinPts, but the point belongs to the ǫ-neighborhood of a core point, the point is called as border point. On the other hand, if a point is neither a core point nor a border point, that point is considered as a noisy point. Our aim is to find the dense regions that can be evaluated by the number of objects (points) close to a specified point. In our study, we preliminarily estimated the knee-point through KNN (K-nearest neighbor) distance plot for each disease. To evaluate the knee-point, first KNN distances were computed and then sorted. Thereafter, they were scaled in between 0 and 1 ([0,1]). The derivative was then estimated. Finally, the first point in which the derivative was higher than a certain value, 1, was considered as knee-point. That corresponding scaled distance value of the knee-point was considered as eps-neighborhood value.
Limma statistical analysis and identifying differentially methylated genes. After detection of the noise-free initial clusters through DBSCAN, we conducted Voom normalization 65 and Limma statistical analysis [66][67][68] , consecutively on the features belonging to the noise-free clusters for identifying differentially methylated genes for the two experiments. In Limma, empirical Bayes and moderated t-statistic had been utilized for design. The moderated t statistic used in Limma could be demonstrated as follows: where n denotes the sample size ( n = n 1 + n 2 ); while β g and s 2 g refer to the contrast estimator and posterior sample variance for the feature g, respectively. The statistic for evaluating the contrast estimator for feature g can be denoted as follows: where N is the normal distribution. However, the statistic to evaluate the posterior sample variance for the feature g is described as: where s 2 0 and d 0 ( < ∞ ) denote the prior variance and prior degrees of freedom, respectively, and s 2 g and d g ( > 0 ) are the experimental sample variance and experimental degrees of freedom for the feature g, respectively. After computing the t score by Limma, the p value for each feature (gene) g is evaluated. Whenever the p value of the gene is less than 0.05, the gene can be defined as differentially methylated gene.
Gene module detection through hierarchical clustering. After obtaining the differentially methylated genes, we estimated the power for determining the soft-thresholding, and then applied this power to compute the adjacency matrix using Pearson's correlation co-efficient. Next, we evaluated the topological overlap matrix (TOM) similarity and corresponding distance matrix from the adjacency matrix. Average linkage clustering and dynamic tree cut methods [69][70][71][72][73] were applied consecutively to generate the gene modules highlighted by different colors. After obtaining the gene modules, we estimated the scores of several cluster validity index parameters such as Ball_hall, Davies_bouldin, Dunn, G_plus, Gdi11, Gdi12, Gdi31 and Ray_turi.
Correlation analysis and detection of gene signature. After obtaining the gene modules, Pearson's correlation coefficient (PCC) was computed among each gene-pair belonging to each gene module. Finally, the average correlation score for each cluster was obtained. The cluster that had the highest average PCC, was selected as the potential gene signature consisting of all differentially methylated genes.
Evaluation of signature through sample group classification. To verify the classification performance of the resultant signature, we applied Random Forest (RF) classifier using k-fold cross-validation (CV) ( k = 2, 3, 4, . . . ) on all the samples using all the features of the signature to classify two groups (AD/DS and control). The entire process was repeated many times. Finally, we computed the average classification accuracy and the area under the curve (AUC).
Gene set enrichment analysis. In addition, we conducted gene set enrichment analysis using KEGG pathways and Gene Ontology (GO) terms available at DAVID online database (version 6.8) 74 . GO terms include three types, Biological Process (BP), Cellular Component (CC) and Molecular Function (MF). A KEGG pathway or GO term whose enrichment p value was less than 0.05, was considered as statistically significant. For more detail about the flowchart of the framework, see Supplementary Figure S5.

Results
Identification of non-redundant CpGs. We found a total of 485,577 features (IDs) in the initial analysis of the data (GEO GSE74486). After removing the redundant IDs that started with "rs" or "ch", the number of CpGs was reduced to 482,421. We then eliminated the CpGs that had zero values in all samples or had missing value in any of the samples. After this filtering, we obtained a total of 435,662 CpGs. Next, we performed minmax normalization technique to scale all the data for each CpG. We collected the mapping information of CpG sites and official gene symbols through the annotation file (see "Methods"). We first selected those CpGs related  Figure S1). This knee point was used as eps-neighborhood value. The DBSCAN generated two clusters, one of which contained 19,592 core features and 206 border features. The second cluster had only 10 core features and no border feature, whereas the number of noisy features were 439 (Table 1). We then discarded these noisy features and further proceeded with the features ( = 19,808 ) belonging to these two clusters for the statistical analysis. The clusters with core and border features were denoted in blue and orange whereas outliers were depicted by green dots (Supplementary Figure S2). Similarly, in the comparison of DS vs control samples, we identified the knee point ( = 0.4 represented in Fig. 2A and Supplementary Figure S3) that was applied as eps-neighborhood value. This resulted in three clusters. One cluster consisted of 18,148 core and 559 border features while the remaining two clusters contained only 10 and 5 core features, respectively, with no border feature ( Table 1). The number of noisy features were 1525 (Table 1). We then eliminated these noisy features, and further proceeded with the non-noisy features ( = 18,722 ) belonging to these three clusters for the statistical analysis. Figure 2B and Supplementary Figure S4 illustrated the three clusters containing the core, border and outlier features.

Identification of differentially methylated genes using limma.
After the pre-filtering analysis by DBSCAN clustering, we conducted Voom normalization and Limma statistical analysis, consecutively to identify differentially methylated genes for the two analyses. This resulted in a total of 229 differentially methylated genes, among which 133 were hyper-methylated and 96 were hypo-methylated in the comparison of the AD versus control samples. These numbers were 1062, among which 135 genes were hyper-methylated and the remaining 927 were hypo-methylated in the comparison of the DS versus control samples. Figure 3A presents the Voom plot for DS vs control.
Detection of gene modules. After finding the set of the differentially methylated genes, we first estimated the power value of soft-thresholding, and then used the power to compute the adjacency matrix using Pearson's correlation. Then the TOM score was computed and distance score was determined. Next, average linkage clustering and dynamic tree cut methods were used to identify gene modules. For the AD vs control analysis, we obtained a total of six modules. The number of participating differentially methylated genes for these six modules (illustrated by turquoise, brown, yellow, blue, red and green colors) were 83, 31, 28, 39, 21, and 27, respectively. Similarly, for the DS vs control analysis, using the power value ( Fig. 3B), we generated a total of six modules. The number of participating differentially methylated genes for these six modules (colored by turquoise, yellow, brown, green, red and blue) were 380, 164, 172, 89, 71 and 184, respectively. Figure 3C shows  Table 4 summarizes the classification accuracy, AUC and precision values for each disease analysis.   Table 5 summarizes the resultant significant GO terms with enrichment p value for Table 3. Comparison of various cluster validity indices between our proposed method and k-means clustering for DS vs control. Higher value signifies better than the other value in the same row (cluster validity index).

Cluster validity index
Proposed method K-means clustering   Table 6 summarizes the resultant significant GO terms with enrichment p value for DS versus control analysis.
Comparison with other method. In addition, we provided a comparative study of the scores of the eight cluster validity index measures between our proposed method and a well-known existing clustering method, k-means 77 . Of note, in the case of k-means clustering, we provided same cluster size that was estimated in our proposed method. For the experiment of AD vs control, Ball_hall scores for our proposed method and k-means clustering method were 0.331 and 0.111, respectively, whereas Davies_bouldin scores for them were 5.205 and 2.090, respectively. The detailed information for these eight measures for our proposed method and k-means for AD vs control and DS vs control was provided in Tables 2 and 3, respectively. In summary, we obtained better scores in our proposed DBSCAN based method than k-means for all of these eight validity index measures, indicating that our method has better performance than k-means clustering.
Validation. For independent validation of our findings obtained in the analysis of DS vs control in FC neuron tissue, we used another 5mC and 5hmC methylation profile of the same disease (DS). This disease is from the same GEO accession ID (GSE74486) but with different tissue (cerebellum). To validate our resultant gene signature obtained by previous dataset, first we selected all the features belonging to the gene signature ( = 89 ), and then identified the corresponding sub-data of those features from the external dataset. Next, we applied several types of CVs (2-fold and 8-fold) on all the samples and then applied Random Forest classifier to classify two classes (diseased and normal groups) with the repetition of 10 times. For the 2-fold CV, we obtained 97.00% average accuracy, while for the 8-fold CV, the average accuracy was 97.80%. This validation analysis supported that our resultant gene signature provided excellent average classification accuracy in the similar methylation data. Furthermore, we applied our entire proposed framework on the second new dataset (DS vs control: cerebellum tissue), and evaluated the average accuracy and AUC. Specifically, there were initially 482,421 features (IDs) in the beginning of the analysis along with 13 DS cerebellum and ten control samples. We initially selected 443,020 CpGs, and then obtained 20,252 gene-vectors from them. Next, in DBSCAN, four resultant clusters were generated of which one cluster consisted of 16,736 core and 753 border features, while the number of core features for the remaining three clusters were 5, 10 and 5, respectively, Notably, we found no border feature for these three remaining clusters. The number of outlier features were 2743. We then selected only the non-outlier   www.nature.com/scientificreports/ features belonging to these four clusters ( = 17,509 ) for the next analysis. Moreover, we identified 1467 differentially methylated genes of which 607 were hyper-methylated and remaining 860 were hypo-methylated. A total of two gene modules was then detected by dynamic tree cut method of which 701 genes included in blue module and 762 genes in turquoise module (Supplementary Figure S11). Turquoise cluster, which had higher average PCC value ( = 0.742 ) than the other cluster was denoted as potential gene signature. This signature had 762 genes. We applied 8-fold CV on all the samples of those 762 features and then used Random Forest classifier with a repetition of 10 times. We obtained 82.60% average accuracy, 86.20% average sensitivity, 78.00% average specificity, 83.60% average precision, and 0.818 AUC value (Supplementary Figure S8).
Generalized cis-regulatory enrichment analysis. We performed the generalized cis-regulatory enrichment analysis (i-cisTarget) using its web tool (https ://gbiom ed.kuleu ven.be/apps/lcb/i-cisTa rget/) 78,79 on the resultant 89-gene signature for DS vs control (FC Neuron). Specifically, we used the 89 genes as the input to the i-cisTarget tool. We found 16 enriched features containing the normalized enrichment score (NES) > 3.0 . Among these 16 enriched features, top three measured by NES score were 1) "GSM1208590_batch1_chrom1_LoVo_ ARNT_PassedQC_peaks_hg19" (NES 4.36), 2) "GSM1208674_batch1_chrom1_LoVo_SMAD2_PassedQC_ peaks_hg19" (NES 3.92), and 3) "GSM1208673_batch1_chrom1_LoVo_RXRA_PassedQC_peaks_hg19" (NES 3.82). The barplot of p value vs AUC in the TF binding sites for the prediction of regulatory features and cisregulatory modules for the 89-gene signature was represented in Supplementary Figure S9(A), while the plot of #predicted regions vs rank in the best (topmost) feature was also illustrated in Supplementary Figure S9(B). The significantly high ranked regions (mentioned in Supplementary Figure S9) in UCSC Genome Browser for the prediction of regulatory features and cis-regulatory modules for the 89-gene signature was also shown in Supplementary Figure S10. The seventeen significantly highly ranked regions for the topmost feature were provided in Supplementary Table S2. For example, top three region IDs were chr3-reg108831, chr3-reg108833 and chr3-reg108832, while their associated gene was RARRES1, as part of the 89-gene signature. Next four region IDs were chr12-reg6728, chr12-reg6726, chr12-reg6725 and chr12-reg6724 whose associated gene was LTBR that was part of the 89-gene signature. Notably, the list of 16 enriched features obtained from the generalized cis-regulatory enrichment analysis for the 89-gene signature by i-cisTarget web tool was mentioned in Supplementary File S1.
Additional analysis. The sample size in the previous validation is small, but the methylation data were generated from Fc neuron directly. To further validate our method, we used another recent real-life Alzheimer's Disease (AD) data with larger sample size [NCBI GEO ID: GSE134379]. This data consisted of 4,11,157 CpG cites (features) and a total of 404 samples including 225 AD samples and 179 control samples from cerebellum (CBL) brain region of Illumina 450K methylation array. We applied our proposed method on this dataset. After  www.nature.com/scientificreports/ the removal of redundant CpGs and the mapping between CpGs and corresponding genes, we obtained a total of 20,190 gene vectors (non-redundant features). After applying the DBSCAN clustering algorithm on the 20,190 gene vectors, we discarded the noisy features. In this regard, we first estimated the knee point through KNN distance plot ( = 2.5 , as marked by the red dotted line in Fig. 5C). This knee point was used as eps-neighborhood value. The DBSCAN generated two clusters, one of which consisted of 19,711 core features and 71 border features. The second cluster contained only 10 core features while no border feature. Finally, the number of noisy features were 398. We then omitted these 398 noisy features and further proceeded with the remaining features ( = 19,792 ) belonging to these two clusters for the statistical analysis. Figure 5D showed the two clusters containing the core, border and outlier features. Thereafter, we determined a total of differentially methylated genes using Limma-Voom statistical analysis. This resulted in a total of 887 differentially methylated genes (Fig. 5A). Thereafter, by using the soft-thresholding, adjacency matrix using Pearson's correlation, TOM score, distance score, average linkage clustering and dynamic tree cut methods, respectively, on these 887 differentially methylated genes, we detected five gene modules. The number of participating differentially methylated genes for these modules (illustrated by blue, brown, green, turquoise and yellow colors in Fig. 5B)   www.nature.com/scientificreports/

Discussion
So far, methylation gene signatures have been reported in many diseases or cellular conditions. They include two highly stable methylation variants, 5mC and 5hmC, that play some roles in several neurodegenerative diseases, such as AD and DS. Specially, one primary concern regarding the large but complex biological data analysis is the reduction of the noise. To meet this strong challenge, in this study, we proposed a new framework to identify outlier-free DNA methylation signatures through the utilization of the well-reputed DBSCAN clustering algorithm and hierarchical clustering, and applied it for the 5mC and 5hmC labeled methylation data (GEO ID: GSE74486) in the tissue of FC neuron for two neurodegenerative diseases: AD and DS. We first performed various pre-filtering analyses to initially remove the redundant CpGs. In this study, two sets of analyses had been conducted, (1) AD vs control, and (2) DS vs control, both using the methylation data from the FC neuron tissue, which is a critical tissue to study AD and DS. In our study, we preliminarily estimated the knee-point through KNN distance plot for each disease 62 . Next, the evolved knee-point would be used as a parameter eps in the next step, DBSCAN clustering algorithm where default settings of other parameters were also set to discard the outliers from the methylation data. Since the sample size was not very high, the DBSCAN was highly effective to filter out the outliers. Interestingly, DBSCAN is somewhat sensitive to parameter settings specially in eps and MinPts, but there is no specific theory which can provide entire guidance regarding the setting of its parameters [62][63][64] . However, Limma statistical method was then used to identify differentially methylated genes. Consecutive utilization of average linkage clustering and dynamic tree cut generated some colored gene modules ( = 6 for both the experiments). The module cluster having best average correlation score was considered as a potential methylated gene signature for AD (21-gene signature) as well as DS (89-gene signature). We obtained satisfactory classification accuracy for disease group from these signatures (92% for AD vs control, and 70% for DS vs control). Our proposed method had better performance than k-means clustering in terms of various cluster validity index measures.
There are a couple of reasons of obtaining big difference in the results: (1) the imbalanced dataset (i.e., higher number of features and very low number of samples) and (2) noise found in the datasets. The total number of initial features was 4,85,577, while total number of samples used for AD vs control experiment was 14 and total number of samples used for DS vs control experiment was 20. Thus, due to these extreme imbalanced datasets, the intermediate result up to the signature detection might be varied. For example, use of statistical test with smaller population size of samples are difficult. As a result, classification accuracy was be varied highly. The second reason is high noise in the dataset. To reduce noisy features, we applied DBSCAN clustering technique initially to detect noisy features for improve the later results. For the additional dataset (AD vs control in CBL brain region), we also obtained good average accuracy ( ∼ 80% ) to detect the two class classification for the respective signature genes that is obviously a good indication of the validation of the proposed method.
However, as future work, We will need to integrate the other types of data (e.g., gene expression, copy number variation, chromatin remodeling, etc.) in our analysis. Finally, our resultant outlier-free signatures might be useful for the potential detection of onset or progression of neurodegenerative diseases, as methylation is critical in such diseases.