Introduction

The number of people worldwide suffering from Alzheimer’s disease (AD) has been steadily increasing in recent decades1. AD is an irreversible disease that slowly and progressively destroys the brain. Specifically, AD affects memory dysfunction, representing a major cause of dementia in the aging population2,3. However, AD is not a normal component of aging, but is instead a complex disease entity, and the detailed pathogenic mechanisms underlying the disease remain unclear. It has been generally recognized that accumulation of plaques (beta-amyloid) and tangles (tau) is the leading cause of AD4, and the prime suspect contributing to the associated neuronal destruction5,6. Although the specific causes of beta-amyloid and tau accumulation are unknown, this pathogenic event is considered to be the result of various interacting genetic and environmental factors7. Therefore, it is important to address the complexity of AD by detecting the underlying characteristics.

One approach to disentangle a complex disease is gene expression analysis, including the identification of potential candidate genes or comparing expression values for specific AD-related genes. Indeed, several studies have discovered AD-related genes and mechanisms using genome-wide analyses8,9,10,11. In particular, AD analyses using blood samples from patients have received considerable attention as a novel method of diagnosis given the advantages of the non-invasive nature and less expensive process compared to traditional analyses using brain tissue or imaging. In fact, several differentially expressed proteins in the AD brain have been identified in the blood of AD patients12. Bu et al.13 showed that the amyloid-beta protein is produced not only in the brain, but also in the peripheral tissues of AD patients. Therefore, many studies have focused on machine-learning and statistical methods to improve early detection and for the identification of candidate genes that can serve as targets for AD treatment based on data obtained from peripheral samples14,15,16,17.

Moreover, recent studies have indicated an association of AD with other diseases18,19,20. Several genetic factors and pathophysiological mechanisms associated with AD are also shared with other AD-related diseases21. Therefore, identifying the relationships between AD and AD-related diseases can help to address the complexity of AD and bring us a step closer to realizing the personalized treatment of AD by reducing risk factors from comorbidities and giving a chance to treat common dysregulated pathways between AD and it related diseases22,23. Narayanan et al.24 investigated co-regulated genes in AD and Huntington’s disease, and identified common differentially co-expressed subnetworks in the two neurodegenerative diseases. In particular, type 2 diabetes (T2D) is highly associated with AD, and epidemiological studies have shown that patients with T2D have a higher risk of developing AD25. One of the hallmark pathophysiologic features in T2D patients is the deposition of amyloid converted from islet amyloid peptide (IAPP)26. Overproduction of IAPP secreted by pancreatic beta-cells may cause beta cell loss in T2D, and there are evidences that intracellular toxic amyloid peptide oligomers are associated with AD27. Given that several lines of evidence indicate a link between AD and T2D, AD can be considered as a brain-specific type of diabetes that has been dubbed “type 3 diabetes”, and inflammation, insulin resistance, and mitochondrial dysfunction were considered as common pathogenesis of two diseases28,29,30.

Accordingly, the aim of this study was to detect the common and distinct characteristics of AD and T2D using gene expression datasets. Because not all AD patients show an association with T2D, we extracted genes with significantly different expression profiles in AD and T2D patients separately using computational and statistical methods, and then defined subgroups of AD and T2D compared to their respective controls. Based on this analysis, we identified common genes with the same regulation directions (up or downregulated in the disease) in each pair of patient subgroups. Through this approach, we identified candidate genes among the common genes with significant differences in expression levels in the subgroups of the two diseases, which were tested as potential biomarkers for diagnostic or prognostic prediction using an independent Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset31. Figure 1 illustrates the procedure for relation extraction between two diseases.

Figure 1
figure 1

Procedure for relation extraction between two diseases. (a) Gene expression data of AD and T2D patients are given. (b) The convex non-negative matrix factorization (NMF) is used to decompose the input expression matrices. Gene and sample clusters are obtained from the NMF decomposed matrices. (c) DEG genes are assigned to clusters. (d) Common candidate genes are extracted from a related cluster pair between two diseases. (b) is generated by the R software (R version 3.6.1, https://www.r-project.org/) using an example dataset.

Methods

Data description

The mRNA expression datasets for AD and T2D were downloaded from the Gene Expression Omnibus (GEO) (https://www.ncbi.nlm.nih.gov/geo/). For AD analysis, we used integrated data on peripheral blood gene expression profiles from the GSE63060 and GSE63061 datasets using an R software32, which were generated on the Illumina HumanHT-12 v3.0 Expression BeadChip and the Illumina HumanHT-12 v4.0 Expression BeadChip, respectively. The GSE63060 dataset contains data for 145 AD patients and 104 control samples, and the GSE63061 dataset contains data for 140 AD patients and 135 control samples. For T2D, we used gene expression data (GSE78721) including 68 T2D patients and 62 healthy controls from adipocytes and infiltration macrophages because it was the largest T2D dataset in GEO, and adipose-derived transcription signature is associated with T2D33,34. Gene expression data were generated using the Affymetrix PrimeView Human Gene Expression Array.

For each gene expression data point from the GEO datasets, the probe ID was converted into the Entrez Gene ID using information from the platforms of each AD and T2D datasets (e.g., GPL6947, GPL10558, and GPL15207). Gene expression levels of the probe that did not match the Entrez Gene ID were removed. Among all the assigned Entrez Gene IDs, protein-coding genes were selected using database Homo_sapiens.GRCh38.94 (http://asia.ensembl.org/Homo_sapiens/Info/Index), where Ensembl IDs were converted to the Entrez Gene IDs using the “biomaRt” package in R software. For duplicated Entrez Gene IDs, the expression values of the same Entrez Gene ID were merged into the mean value. To analyze the AD data, we merged the two datasets (GSE63060 and GSE63061). The number of Entrez Gene IDs selected by the protein-coding gene database in each AD dataset was 16,730 and 14,957, respectively. We selected 14,134 common genes from the two datasets and used the “removeBatchEffect” function in the R package “limma”. As a result, we obtained the expression data of 14,134 genes from the 285 AD patients and 239 control samples. In addition, we normalized each data point in the patient expression data by applying a log2(fold change) conversion as follows:

$$\begin{aligned} y_{ij} = \log _{2}{\frac{patient_{ij}}{normal_i}} \end{aligned}$$

where \(patient_{ij}\) is the expression value of the gene i of the jth patient and \(normal_i\) is the mean expression value of the gene i in the normal control samples.

Decomposition of gene expression data sets

Most gene expression datasets contain information on thousands of genes, which is relatively large compared to the number of samples; therefore, several studies have applied dimensionality reduction methods to reduce the matrix dimension35,36. The non-negative matrix factorization (NMF) method has been widely used to reduce the dimension of an input matrix by decomposing a non-negative input matrix into two matrices37. Assuming an input matrix A consisting of the expression data of N genes and M samples, NMF produces the matrices W and H of size \(N\times k\) and \(k\times M\), respectively, in which the parameter k indicates the number of clusters desired in the input data. The NMF algorithm is a multiplication update algorithm that multiplies W and H to obtain the input A during iteration until convergence. After convergence, the matrices W and H are used to bi-cluster genes and samples38,39. Each row of \(W (gene\times k)\) and each column of \(H (k\times sample)\) can be represented by a positive linear combination of k. The element \(w_{ij}\) in the W matrix is the coefficient of gene i and cluster \(j (1\sim k)\), and the element \(h_{ij}\) in the H matrix is the coefficient of cluster \(i (1\sim k)\) and sample j (Fig. 1b).

Determination of the optimal number of clusters according to the rank k

To select the optimal number of meaningful clusters that correctly divide the input data, the cophenetic correlation coefficient should be taken into consideration37. NMF updates the \(W_i\) and \(H_i\) matrices of each ith iteration until convergence. Following this, a connectivity matrix \(C_i\) (with size \(sample\times sample\)) is defined by each sample assignment of the ith iteration from \(H_i\) by selecting the maximum index of each column. The elements of the connectivity matrix \(c_{ij}\) are filled with 1 s if the samples i and j are assigned to the same cluster and are filled with 0 s otherwise. The average of all connectivity matrices represents the consensus matrix \({\bar{C}}\), which is the probability that the samples i and j belong together during iterations. The cophenetic correlation coefficient is then calculated as the Pearson correlation coefficient between \(I-{\bar{C}}\) and the distance between samples in a hierarchical clustering of \({\bar{C}}\). The cophenetic correlation coefficient indicates the dispersion of the sample assignment, which refers to how consistently samples with similar gene expression profiles belong together during iterations. Therefore, the rank k with the highest cophenetic correlation indicates the optimal capacity of the model.

However, NMF can only process non-negative ranges of entries in the input matrix A, and the output matrices W and H also have non-negative ranges. Therefore, to analyze the log2(fold change) expression dataset of each disease, we needed to select a model that can cover both positive and negative values. If the range of the input matrix A is \({\pm }\), then the matrix \(A_{\pm }\) is decomposed such that \(A_{\pm } \approx W_{\pm }\times H_+\). In convex NMF, the basis vectors of the \(W_{\pm }\) matrix are considered to be convex combinations of the input matrix \(A_{\pm }\) (i.e., \(A_{\pm } \approx W_{\pm }\times H_+ \approx X_{\pm }\times F_+ \times H_+\)40) and there is an advantage in that the factors of \(F_+\) and \(H_+\) are sparse. Therefore, we apply the convex NMF to obtain the W and H matrices. \(X_{\pm }\times F_+\) from convex NMF is treated as a factor of W in the NMF used for gene clustering, and then \(F_+\) and \(H_+\) are updated alternatively as follows:

$$\begin{aligned} H_{ik} \leftarrow H_{ik} \sqrt{\frac{[(X^TX)^+F]_{ik}+ [HF^T(X^TX)^-F]_{ik}}{[(X^TX)^-F]_{ik} + [HF^T(X^TX)^+F]_{ik}}} \\ F_{ik} \leftarrow F_{ik}\sqrt{\frac{[(X^TX)^+H]_{ik} + [(X^TX)^-FH^TH]_{ik}}{[(X^TX)^-H]_{ik} + [(X^TX)^+FH^TH]_{ik}}} \end{aligned}$$

Gene and sample clusters

The basic method of gene and sample clustering in NMF using the factors W and H is the “Max” method, which selects the cluster with the highest coefficient37. In general, a gene is assigned to the cluster with the highest coefficient in each gene row in the W matrix, and a sample is assigned to cluster \(S_i\) when the ith coefficient is the highest coefficient in each sample column in the H matrix. Accordingly, the gene cluster obtained by the “Max” gene clustering method is a group of genes with relatively upregulated expression in the sample cluster \(S_i\) compared to other sample clusters. Conversely, to consider the gene cluster with relatively downregulated expression, a gene is assigned to the cluster with the minimum coefficient using the “Min” gene clustering method. We performed this bi-clustering method via NMF using the AD patient expression data to cluster AD patients into k sample clusters and genes into k relatively up and downregulated clusters, respectively. The T2D patient expression data were processed in the same manner.

However, in most gene expression analyses, the number of genes (features) is larger than the number of samples in the dataset. Even if the dimension of genes and samples in an expression dataset can be reduced using NMF, it is still difficult to analyze genes in k clusters because each gene belongs to one of the k clusters. In addition, each cluster may contain genes whose expression values in the samples of the given cluster are not different compared with those in samples of other clusters. Thus, some genes will be assigned to a gene cluster even if there are no relative differences in expression between sample clusters (Supplementary Fig. S1).

To address this issue, we filtered out genes in clusters by considering the original input matrix A and identified which genes in each cluster showed significantly different expression levels in a specific sample cluster. First, for each gene that was already assigned to the cluster, the distribution of expression levels was defined as \(D_i (i=1\sim k)\) for each k sample cluster. We then used the Kruskal–Wallis test to identify genes with a significantly different expression distribution in samples of a given cluster from those in samples of at least one other cluster. The p values of the test were adjusted according to the Bonferroni correction for multiple comparisons, and genes with a q value < 0.05 were selected. Further, the Dunn test was performed between the distribution of expression levels for each gene in all possible sample cluster pairs. If the expression level differences of the gene between the given cluster and other remaining clusters were significant (q values < 0.05 after Bonferroni correction), the gene was selected. For cluster i, these genes with relatively upregulated expression were denoted as \(G_i^+\) (Fig. 1b).

Similarly, for a gene assigned to the gene cluster by the “Min” gene clustering method, the Kruskal–Wallis test and Dunn test were subsequently applied to the genes in each cluster. For cluster i, these genes with relatively downregulated expression were selected and denoted as \(G_i^-\). This process could effectively reduce the number of genes in each cluster compared to the conventional clustering method of NMF, which facilitated the analysis.

Differentially expressed gene (DEG) subgroups

The Kruskal–Wallis–based gene clustering method described above can extract the genes with relatively up and downregulated expression in patients with a given disease in k sample clusters. However, we further needed to identify whether genes in the obtained \(G_i^+\) and \(G_i^-\) groups are differentially expressed in the sample cluster \(S_i\) compared to controls. In addition, even if \(G_i\) selected as the characteristic of \(S_i\) differs significantly from other sample clusters, genes in \(G_i\) need to be differentially expressed in \(S_j\) compared to controls, which means that it can also be the characteristic of the \(S_j\) sample cluster. Especially, because AD and T2D datasets are gene expression data from different tissue types, by extracting differentially expressed genes between each patient groups of AD and T2D and their healthy controls, we can remove tissue-specific genes and detect genes related to each disease. Therefore, we considered DEGs from all genes in \(G_i^+\) and \(G_i^-\) between each patient cluster and their respective controls. First, we collected all genes assigned to any k cluster. Second, the expression levels for each gene were compared between disease samples in the i cluster and control samples using the t-test followed by Bonferroni correction. The genes with a q value < 0.05 for the sample i cluster were assigned to \(M_i^+\) or \(M_i^-\) depending on the direction of the expression level difference (upregulation or downregulation, respectively) (Fig. 1c).

Because AD and T2D gene expression datasets were obtained from different tissue types, this subgrouping of genes based on DEG is necessary. We can remove tissue-specific genes and select genes related to each disease by using DEGs between each patient groups of AD and T2D and their healthy controls.

Extraction of AD and T2D-related subgroup pairs and candidate genes

We independently applied the NMF approach for the expression data of AD and T2D patients and their respective controls, and obtained AD and T2D DEG subgroups for specific sample clusters. We then aimed to find AD subgroups related to T2D and T2D subgroups related to AD. To this end, we performed pathway enrichment analysis. These enriched pathways were then compared with those of known AD- and T2D-related genes in DigSee41. In addition, we investigated whether the enriched pathways in the AD DEG subgroups overlapped with T2D-related pathways and vice versa. Then, we selected a AD subgroup and a T2D subgroup containing the largest number of T2D-related pathways and AD-related pathways, respectively, which are a pair of clusters related with each other. From these two clusters, we selected common candidate genes with the same regulation direction, which are referred to as candidate genes (Fig. 1d).

Afterwards, we used an independent AD dataset downloaded from the ADNI (http://adni.loni.usc.edu)31, which included gene expression data from 116 AD patients and 246 controls extracted from peripheral blood, to validate the candidate genes identified from the DEG subgroup pairs related to both the diseases. With this dataset, the expression levels of 20,384 protein-coding genes filtered using the database Homo sapiens.GRCh38.94 were used for the classification of AD and the control sample using logistic regression. Tenfold cross-validation was performed with zero initialization and a learning rate of 0.05, and the area under the curve (AUC) was calculated at each tenfold cross-validation to evaluate the predictive ability of the candidate genes.

Additionally, we collected independent T2D gene expression datasets: 25 T2D patient and 71 control samples extracted from beta-cells or pancreatic islets in GSE20966, GSE25724, and GSE3864242,43,44,45. Then, we selected protein-coding genes using Homo sapiens.GRCh38.94 from each dataset. By removing the batch effect using the “removeBatchEffect” function in the R package “limma”, we normalized the expression profiles for 10,490 common genes in the three datasets and merged them. Similar to AD, we validated the candidate genes using these T2D datasets. Because of the small number of T2D patients, we performed threefold cross-validation with zero initialization and a learning rate of 0.005 in a logistic regression model.

Results and discussion

Clustering of AD and T2D genes

The AD and T2D subgroups were independently defined using the log2(fold change) values from the expression data of 285 AD and 68 T2D patients using the convex NMF approach and NMF-based clustering method. First, to decompose the expression data of the patients into subgroups using the NMF approach, we needed to determine the optimal number of subgroups. In general, initialization of the matrices W and H affects the final outputs of NMF. Hence, we applied the NMF algorithm 10 times for each rank k from 2 to 10 with randomly initialized W and H matrices, and then calculated the average of the cophenetic correlation coefficient. We chose the rank k that had the largest average cophenetic correlation coefficient. Figure 2 shows the average cophenetic correlation coefficients for each rank k in each AD and T2D patient dataset. The optimal rank k of both datasets was 3. Thus, we used the factorized matrices W and H with the largest cophenetic correlation coefficient out of 10 iterations of rank 3 as the NMF output for each disease.

Figure 2
figure 2

Cophenetic correlation coefficients for the consensus matrix in (a) Alzheimer’s disease (AD) and (b) type 2 diabetes (T2D) datasets.

After applying the sample and gene assignment method to the decomposed matrices, we constructed three clusters containing a subset of samples for genes with relatively upregulated and downregulated expression, respectively. For sample clustering, the “Max” method was applied to matrix H in columns, and the 285 AD patients and 68 T2D patients were divided into three sample clusters of 93, 90, and 102 patients, and 17, 9, and 42 patients, respectively. All of the genes in both datasets were first assigned to one of the three gene clusters through the “Max” method. According to the Kruskal-Wallis and Dunn test, 5375 and 4479 genes were significantly upregulated (\(G_x^+\)) and downregulated (\(G_y^-\)), respectively, in expression from other clusters for AD. Similarly, 3461 and 7369 genes were upregulated and downregulated, respectively, for T2D (Table 1). Each gene can be assigned to both \(G_x^+\) and \(G_y^-\) when the distribution in \(S_x\) is relatively upregulated whereas that in \(S_y\) is relatively downregulated. Thus, in the union of upregulated and downregulated genes, 6729 and 10,051 genes emerged as showing significant differences in expression from other clusters for AD and T2D, respectively.

Each gene can be assigned to both \(G_x^+\) and \(G_y^-\) when the distribution in \(S_x\) is relatively upregulated whereas that in \(S_y\) is relatively downregulated. The Kruskal-Wallis based gene clustering method showed that samples and genes in AD patients were more evenly divided compared to those in T2D patients. However, in the T2D dataset, most of the samples were assigned to \(S_3\) and the gene cluster also showed a skewed distribution in \(G_3\). Because the number of clusters was decided by the optimal rank k, the cluster \(S_2\) with the small number samples can be generated when the total number of samples are small such as the T2D dataset. This small size cluster may have distinct characteristics that can be distinguished from other clusters.

Table 1 Clustered samples and genes for (a) Alzheimer’s disease (AD) and (b) type 2 diabetes (T2D).

To visually confirm that the Kruskal-Wallis–based gene clustering method removed inappropriate genes in each gene cluster compared to the conventional “Max”/“Min” method, the gene expression matrices were rearranged in cluster order for both genes and samples, which were visualized on a heatmap. The rectangular regions on the diagonal of the heatmap, indicating samples and genes assigned in the same cluster, demonstrate genes with relatively upregulated or downregulated expression in each cluster. Compared to the basic “Max” and “Min” method, genes selected by Kruskal-Wallis test generated more distinct regions, in which genes showed significant differences in expression levels that could be clearly observed on the heatmap for both the AD and T2D datasets (Fig. 3). Figure 3 was generated using “aheatmap” function in the R package “NMF”46 and “heatmap” function in the python library “seaborn”47.

Figure 3
figure 3

Heatmap of gene clustering method in Alzheimer’s disease (AD) and type 2 diabetes (T2D). (a) and (b) show genes and sample clusters from AD patients. (c) and (d) show genes and sample clusters from T2D patients. Top panels show clusters obtained using the basic “Max” ((a) and (c)) and “Min” ((b) and (d)) methods. Bottom panels show clusters obtained using the Kruskal–Wallis and Denn test for relatively upregulated genes ((a) and (c)) and relatively downregulated genes ((b) and (d)). This figure is generated by the R software (R version 3.6.1, https://www.r-project.org/) and python (version 3.6.8, https://www.python.org/).

Next, we constructed \(M_i^+\) and \(M_i^-\), where gene expressions differ between each disease patient group and its corresponding control group. For the subgroups of each sample cluster, we integrated the genes with relatively up and downregulated expression compared to patients in other clusters (\(G_i^+\) and \(G_i^-\)), resulting in 6729 and 10,051 integrated genes in AD and T2D, respectively. Then, genes were assigned into a subgroup i if they are differentially expressed between patients in the subgroup i and control samples based on the t-test with Bonferroni-corrected q value < 0.05. The numbers of these up and downregulated genes (\(M_i^+\) and \(M_i^-\)) are shown in Table 2.

We further examined the clinical characteristics such as age of these samples in AD modules because the age information is only in AD, not T2D. The result of a one-way ANOVA for age in the AD patient subgroup showed no significant differences between AD patient subgroups (Supplementary Table S2), which suggests that the grouping of samples does not depend on age.

Table 2 Number of DEGs in each cluster.

Selecting cluster pairs of interest

To find AD clusters related to T2D and T2D clusters related to AD, pathway enrichment analysis was performed for the up and downregulated genes (\(M_i^+\) and \(M_i^-\)) in each AD and T2D cluster using a total of 10,378 Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways and Gene Ontology (GO) terms from MSigDB. The hypergeometric test for \(M_i^+\) and \(M_i^-\) in AD and T2D genes was used to determine significantly enriched KEGG pathways and GO terms (Tables 3 and 4). As references of AD- and T2D-related pathways, we extracted 1635 AD-related genes and 1658 T2D-related genes from DigSee41, and obtained 1675 and 1757 AD- and T2D-related pathways, respectively, using the hypergeometric enrichment test. We identified common pathways from the DEG subgroups in AD patients with 1757 T2D-related pathways from DigSee41. Among the AD clusters, patients in \(S_3\) were most likely to have an association with T2D compared to AD patients in the AD \(S_1\) and AD \(S_2\) clusters (Table 3). Likewise, T2D patients in cluster \(S_3\) were most likely to be associated with AD (Table 4). There was no clinical information in the datasets to confirm whether the AD \(S_3\) patients actually have T2D or whether the T2D \(S_3\) patients have AD; however, our data suggest that patients in these clusters might share genetic characteristics of the other disease.

In addition, we used brain gene expression data GSE5281 to determine which AD clusters have features in common with the AD brain48. We extracted 1831 DEGs from these gene expression data by performing the t-test with a q value < 0.05, and 160 pathways were extracted from these genes. As a result, genes in the AD \(S_3\) module most overlapped with AD brain-related pathways (Supplementary Table S1).

Table 3 Numbers of type 2 diabetes (T2D)-related pathways for each Alzheimer’s disease (AD) differentially expressed gene module.
Table 4 Numbers of Alzheimer’s disease (AD)-related pathways in each type 2 diabetes (T2D) differentially expressed gene module.

To obtain further evidence that AD \(S_3\) and T2D \(S_3\) are related, pathway enrichment analysis was performed for the common genes of possible cluster pairs in the up-regulated and down-regulated AD and T2D gene clusters. The common genes with the same regulation direction (i.e., up or downregulated) in both diseases were extracted, and their enriched pathways were compared with the 1498 pathways of the 671 intersecting genes between AD and T2D from DigSee41. We used the pathways associated with both AD and T2D as references to identify whether the common genes in each possible cluster pair were related. Indeed, the enriched pathways from common genes in the upregulated AD \(M_3^+\) and T2D \(M_3^+\) modules showed more overlap with pathways from DigSee compared to that in other possible pairs (Table 5). Similarly, the common genes from the downregulated AD \(M_3^-\) and T2D \(M_3^-\) modules showed the highest overlapping pathway ratio with pathways from DigSee (Table 5). This suggested that the cluster pairs AD \(M_3\) and T2D \(M_3\) were the most closely related to both AD and T2D.

Table 5 Numbers of common pathways between disease subgroup pairs and DigSee.

Extraction of candidate genes

We extracted 241 common genes from the cluster pair AD \(M_3\) and T2D \(M_3\), including 195 upregulated genes and 46 downregulated genes, which were selected as candidate genes associated with both AD and T2D (Supplementary Table S3). Among these 241 genes, 14 genes were common with genes related with both AD and T2D from DigSee41. In DigSee, 661 genes were common for both AD and T2D. With a hypergeometric test for significance of 14 genes out of 241, a p value was 0.03826, showing significance of these genes in their roles in AD and T2D. In addition, we collected more AD and T2D genes from AlzGene and T2DiACoD (Supplementary Table S3)49,50. When comparing the 241 genes with genes in the three databases of DigSee, AlzGene, and T2DiACoD, 56 genes were related to AD or T2D.

These candidate genes were enriched in 29 pathways (Supplementary Table S4) and 14 pathways that are common with AD and T2D-related pathways from DigSee are shown in Table 6. Pathways associated with common pathological features of AD and T2D such as the immune system-related pathways (T cell selection, positive T cell selection, and T cell differentiation)51,52 and chemokine signaling pathway53,54 were included. Immune system-related pathways are known to be common characteristics of AD in the brain and blood51, and there are some evidences that chemokines play an essential role in the central nervous system and neuroprotection55,56. Interestingly, 11 genes among the 241 candidate genes were involved in chemokine signaling pathway, and 6 of them were AD and T2D-related genes: RAF1, RAC1, RHOA, STAT3, AKT1, and PRKCD.

Table 6 Common pathways between candidate genes and DigSee genes.

To verify whether the 241 candidate genes could be informative markers for the classification of each disease patients and controls, we used data from the ADNI cohort31 for AD prediction, and a merged independent T2D dataset for predicting T2D. In AD prediction, candidate genes from the (AD \(M_3\), T2D \(M_3\)) pair showed the best diagnostic performance, with an AUC value of 0.6906, compared with genes from nine possible pairs (Table 7).

Table 7 Performance of classification of Alzheimer’s disease (AD) and controls in the ADNI cohort using different sets of genes. (\(M_i,M_j\)) represents the common genes between the AD \(M_i\) module and type 2 diabetes (T2D) \(M_j\) module used for classification.

For comparison, the performance of classifying AD in the ADNI cohort was measured for 250 random genes to match the size of the candidate genes. Classification using 250 random genes was performed 100 times, and the mean AUC value was 0.5658. The t-test showed that the candidate genes significantly outperformed the randomly selected genes in classification with a p value of \(5.723\times 10^{-52}\) (Fig. 4). As another comparison, we obtained 1,466 DEGs from the AD samples (GSE63060 and GSE63061 datasets) by the t-test and selected genes with a q value < 0.05 after Bonferroni correction. When these DEGs were used for AD classification in the ADNI cohort, the AUC value was 0.5757.

Figure 4
figure 4

Performance comparison of Alzheimer’s disease (AD) classification from controls based on the area under the curve (AUC) using different gene sets.

Furthermore, we examined serum glucose data in the ADNI dataset to consider the clinical characteristics of samples. When we compared glucose levels between the AD and the control groups by the t-test, there was no significant difference (p value = 0.513). In ADNI, there were 41 AD patients and 95 controls with high glucose levels, including prediabetes or diabetes samples (\(\ge\) 100 mg/dL). Using the logistic regression model that were constructed for classification of AD (Table 7), we examined the classification performance for these hyperglycemic samples in the test set of each fold on tenfold cross-validation. The prediction performance for these hyperglycemic samples using the candidate genes (AUC = 0.715) was the highest among those using other genes (Supplementary Table S5). We also found that the predictive power of the prediction model using candidate genes was higher for these hyperglycemic samples compared to those for the whole samples (0.715 in Supplementary Table S5 and 0.6906 in Table 7, respectively).

We also performed classification using the independent T2D datasets (25 T2D and 71 controls). Among the 241 candidate genes, 179 genes were included in the T2D samples. On the threefold cross-validation, we obtained the AUC value of 0.9543. The AUC value of randomly selected 180 genes was 0.9458 and the predictive performances using other possible gene pairs were also high (Supplementary Table S6). This indicates that gene expression levels between T2D samples and controls in pancreatic islets were significantly different for most genes.

Application of the proposed model to another AD dataset

We applied the proposed model to another AD dataset. For AD, ADNI was used instead of gene expression datasets of GSE63060 and GSE63061. For T2D, the same gene expression data (GSE78721) was used. When we try to find the optimal number of clusters for the ADNI dataset, the optimal rank k of the ADNI dataset was 2 (Supplementary Fig. S2). At least three clusters are required to determine significant differences in the distribution of each gene between clusters with the Kruskal-Wallis and Dunn test. Thus, as an alternative, we clustered the ADNI data with the non-optimal rank k = 3, resulting that 116 AD patients were clustered into three clusters with 17, 54, and 45 samples, respectively (Supplementary Table S7). When the pathway enrichment analysis was performed for genes in each cluster, the ADNI \(M_3^+\) gene cluster contained the largest number of T2D-related pathways and followed by \(M_1^+\). Among 41 hyperglycemic AD patients, the largest 19 belonged to ADNI S3. However, the proportion of hyperglycemic AD patients in each ADNI sample cluster was the highest in ADNI S1 followed by S3 (S1 = 0.47%, S2 = 0.259%, and S3 = 0.422%), which implies that the characteristics of patients of \(S_1\) and \(S_3\) are similar and can be merged for the high risk subgroup of T2D.

Additionally, there was no difference between the age of patients in \(S_1\) and \(S_3\), but the age of patients of \(S_2\) was significantly lower than those of \(S_1\) and \(S_3\) (p values were 0.0013 and 0.0207 with a one-way ANOVA test, respectively). We also performed a one-way ANOVA test for APOE4 among subgroups of AD patients and observed no significant difference between APOE4 (p value as 0.35). Therefore, the proposed method may cluster patients that have some similar clinical characteristics such as the age, but not all of subgroups were clustered by these characteristics.

Conclusion

We have provided a methodological and analytical approach for identifying correlations between AD and T2D at the genetic level. Since the AD dataset does not contain information about whether the AD patients have T2D or not, it is important to define subgroups of AD; the same is true for T2D. Because the conventional NMF is not suitable for this task, we developed a method of gene selection from gene expression data. After applying NMF to gene expression data, additional conditions were taken into account for detecting distinct characteristics of subgroups. Genes with significant differences in expression levels in each patient groups (AD and T2D) were first selected to screen patients with AD associated with T2D and patients with T2D associated with AD. We identified genes that characterize these specific AD and T2D patients and identified the potential relationship between the two diseases based on gene expression profiles. To validate these potential relationships from candidate genes, prediction errors of the classification between AD and controls from logistic regression were compared with randomly selected genes in an independent AD dataset. Inclusion of the candidate genes significantly increased the AUC values in classifying AD from controls compared with randomly selected genes.

In conclusion, we provide new insights for extracting differentially expressed genes with relative differences in a specific patient group. These genes were enriched with pathways related to both AD and T2D such as T cell selection and chemokine pathways. As AD patients have genetic heterogeneity, the investigation of commonly dysregulated pathways in AD and T2D can enhance personalized medical cares for a subgroup of AD. Further studies are needed to reveal the relationships among AD and other AD-related diseases which could improve the prevention and treatment of AD.