Correlation between Alzheimer’s disease and type 2 diabetes using non-negative matrix factorization

Alzheimer’s disease (AD) is a complex and heterogeneous disease that can be affected by various genetic factors. Although the cause of AD is not yet known and there is no treatment to cure this disease, its progression can be delayed. AD has recently been recognized as a brain-specific type of diabetes called type 3 diabetes. Several studies have shown that people with type 2 diabetes (T2D) have a higher risk of developing AD. Therefore, it is important to identify subgroups of patients with AD that may be more likely to be associated with T2D. We here describe a new approach to identify the correlation between AD and T2D at the genetic level. Subgroups of AD and T2D were each generated using a non-negative matrix factorization (NMF) approach, which generated clusters containing subsets of genes and samples. In the gene cluster that was generated by conventional gene clustering method from NMF, we selected genes with significant differences in the corresponding sample cluster by Kruskal–Wallis and Dunn-test. Subsequently, we extracted differentially expressed gene (DEG) subgroups, and candidate genes with the same regulation direction can be extracted at the intersection of two disease DEG subgroups. Finally, we identified 241 candidate genes that represent common features related to both AD and T2D, and based on pathway analysis we propose that these genes play a role in the common pathological features of AD and T2D. Moreover, in the prediction of AD using logistic regression analysis with an independent AD dataset, the candidate genes obtained better prediction performance than DEGs. In conclusion, our study revealed a subgroup of patients with AD that are associated with T2D and candidate genes associated between AD and T2D, which can help in providing personalized and suitable treatments.


Methods
Data description. The mRNA expression datasets for AD and T2D were downloaded from the Gene Expression Omnibus (GEO) (https:// www. ncbi. nlm. nih. gov/ geo/). For AD analysis, we used integrated data on peripheral blood gene expression profiles from the GSE63060 and GSE63061 datasets using an R software 32 , which were generated on the Illumina HumanHT-12 v3.0 Expression BeadChip and the Illumina HumanHT-12 v4.0 Expression BeadChip, respectively. The GSE63060 dataset contains data for 145 AD patients and 104 control samples, and the GSE63061 dataset contains data for 140 AD patients and 135 control samples. For T2D, we used gene expression data (GSE78721) including 68 T2D patients and 62 healthy controls from adipocytes and infiltration macrophages because it was the largest T2D dataset in GEO, and adipose-derived transcription signature is associated with T2D 33,34 . Gene expression data were generated using the Affymetrix PrimeView Human Gene Expression Array.
For each gene expression data point from the GEO datasets, the probe ID was converted into the Entrez Gene ID using information from the platforms of each AD and T2D datasets (e.g., GPL6947, GPL10558, and GPL15207). Gene expression levels of the probe that did not match the Entrez Gene ID were removed. Among all the assigned Entrez Gene IDs, protein-coding genes were selected using database Homo_sapiens.GRCh38.94 (http:// asia. ensem bl. org/ Homo_ sapie ns/ Info/ Index), where Ensembl IDs were converted to the Entrez Gene IDs using the "biomaRt" package in R software. For duplicated Entrez Gene IDs, the expression values of the same Entrez Gene ID were merged into the mean value. To analyze the AD data, we merged the two datasets (GSE63060 and GSE63061). The number of Entrez Gene IDs selected by the protein-coding gene database in each AD dataset was 16,730 and 14,957, respectively. We selected 14,134 common genes from the two datasets and used the "removeBatchEffect" function in the R package "limma". As a result, we obtained the expression data of 14,134 genes from the 285 AD patients and 239 control samples. In addition, we normalized each data point in the patient expression data by applying a log2(fold change) conversion as follows: where patient ij is the expression value of the gene i of the jth patient and normal i is the mean expression value of the gene i in the normal control samples.

Decomposition of gene expression data sets.
Most gene expression datasets contain information on thousands of genes, which is relatively large compared to the number of samples; therefore, several studies have applied dimensionality reduction methods to reduce the matrix dimension 35,36 . The non-negative matrix factorization (NMF) method has been widely used to reduce the dimension of an input matrix by decomposing a non-negative input matrix into two matrices 37 . Assuming an input matrix A consisting of the expression data of N genes and M samples, NMF produces the matrices W and H of size N × k and k × M , respectively, in which the parameter k indicates the number of clusters desired in the input data. The NMF algorithm is a multiplication update algorithm that multiplies W and H to obtain the input A during iteration until convergence. After convergence, the matrices W and H are used to bi-cluster genes and samples 38,39 . Each row of W(gene × k) and each column of H(k × sample) can be represented by a positive linear combination of k. The element w ij in the W matrix is the coefficient of gene i and cluster j(1 ∼ k) , and the element h ij in the H matrix is the coefficient of cluster i(1 ∼ k) and sample j (Fig. 1b).
Determination of the optimal number of clusters according to the rank k. To select the optimal number of meaningful clusters that correctly divide the input data, the cophenetic correlation coefficient should be taken into consideration 37  www.nature.com/scientificreports/ ith iteration from H i by selecting the maximum index of each column. The elements of the connectivity matrix c ij are filled with 1 s if the samples i and j are assigned to the same cluster and are filled with 0 s otherwise. The average of all connectivity matrices represents the consensus matrix C , which is the probability that the samples i and j belong together during iterations. The cophenetic correlation coefficient is then calculated as the Pearson correlation coefficient between I −C and the distance between samples in a hierarchical clustering of C . The cophenetic correlation coefficient indicates the dispersion of the sample assignment, which refers to how consistently samples with similar gene expression profiles belong together during iterations. Therefore, the rank k with the highest cophenetic correlation indicates the optimal capacity of the model. However, NMF can only process non-negative ranges of entries in the input matrix A, and the output matrices W and H also have non-negative ranges. Therefore, to analyze the log2(fold change) expression dataset of each disease, we needed to select a model that can cover both positive and negative values. If the range of the input matrix A is ± , then the matrix A ± is decomposed such that A ± ≈ W ± × H + . In convex NMF, the basis vectors of the W ± matrix are considered to be convex combinations of the input matrix A ± (i.e., A ± ≈ W ± × H + ≈ X ± × F + × H + 40 ) and there is an advantage in that the factors of F + and H + are sparse. Therefore, we apply the convex NMF to obtain the W and H matrices. X ± × F + from convex NMF is treated as a factor of W in the NMF used for gene clustering, and then F + and H + are updated alternatively as follows: Gene and sample clusters. The basic method of gene and sample clustering in NMF using the factors W and H is the "Max" method, which selects the cluster with the highest coefficient 37 . In general, a gene is assigned to the cluster with the highest coefficient in each gene row in the W matrix, and a sample is assigned to cluster S i when the ith coefficient is the highest coefficient in each sample column in the H matrix. Accordingly, the gene cluster obtained by the "Max" gene clustering method is a group of genes with relatively upregulated expression in the sample cluster S i compared to other sample clusters. Conversely, to consider the gene cluster with relatively downregulated expression, a gene is assigned to the cluster with the minimum coefficient using the "Min" gene clustering method. We performed this bi-clustering method via NMF using the AD patient expression data to cluster AD patients into k sample clusters and genes into k relatively up and downregulated clusters, respectively. The T2D patient expression data were processed in the same manner.
However, in most gene expression analyses, the number of genes (features) is larger than the number of samples in the dataset. Even if the dimension of genes and samples in an expression dataset can be reduced using NMF, it is still difficult to analyze genes in k clusters because each gene belongs to one of the k clusters. In addition, each cluster may contain genes whose expression values in the samples of the given cluster are not different compared with those in samples of other clusters. Thus, some genes will be assigned to a gene cluster even if there are no relative differences in expression between sample clusters ( Supplementary Fig. S1).
To address this issue, we filtered out genes in clusters by considering the original input matrix A and identified which genes in each cluster showed significantly different expression levels in a specific sample cluster. First, for each gene that was already assigned to the cluster, the distribution of expression levels was defined as D i (i = 1 ∼ k) for each k sample cluster. We then used the Kruskal-Wallis test to identify genes with a significantly different expression distribution in samples of a given cluster from those in samples of at least one other cluster.
The p values of the test were adjusted according to the Bonferroni correction for multiple comparisons, and genes with a q value < 0.05 were selected. Further, the Dunn test was performed between the distribution of expression levels for each gene in all possible sample cluster pairs. If the expression level differences of the gene between the given cluster and other remaining clusters were significant (q values < 0.05 after Bonferroni correction), the gene was selected. For cluster i, these genes with relatively upregulated expression were denoted as G + i (Fig. 1b). Similarly, for a gene assigned to the gene cluster by the "Min" gene clustering method, the Kruskal-Wallis test and Dunn test were subsequently applied to the genes in each cluster. For cluster i, these genes with relatively downregulated expression were selected and denoted as G − i . This process could effectively reduce the number of genes in each cluster compared to the conventional clustering method of NMF, which facilitated the analysis.
Differentially expressed gene (DEG) subgroups. The Kruskal-Wallis-based gene clustering method described above can extract the genes with relatively up and downregulated expression in patients with a given disease in k sample clusters. However, we further needed to identify whether genes in the obtained G + i and G − i groups are differentially expressed in the sample cluster S i compared to controls. In addition, even if G i selected as the characteristic of S i differs significantly from other sample clusters, genes in G i need to be differentially expressed in S j compared to controls, which means that it can also be the characteristic of the S j sample cluster. Especially, because AD and T2D datasets are gene expression data from different tissue types, by extracting differentially expressed genes between each patient groups of AD and T2D and their healthy controls, we can remove tissue-specific genes and detect genes related to each disease. Therefore, we considered DEGs from all genes in G + i and G − i between each patient cluster and their respective controls. First, we collected all genes assigned to any k cluster. Second, the expression levels for each gene were compared between disease samples in the i cluster and control samples using the t-test followed by Bonferroni correction. The genes with a q value < www.nature.com/scientificreports/ 0.05 for the sample i cluster were assigned to M + i or M − i depending on the direction of the expression level difference (upregulation or downregulation, respectively) ( Fig. 1c).
Because AD and T2D gene expression datasets were obtained from different tissue types, this subgrouping of genes based on DEG is necessary. We can remove tissue-specific genes and select genes related to each disease by using DEGs between each patient groups of AD and T2D and their healthy controls.

Extraction of AD and T2D-related subgroup pairs and candidate genes.
We independently applied the NMF approach for the expression data of AD and T2D patients and their respective controls, and obtained AD and T2D DEG subgroups for specific sample clusters. We then aimed to find AD subgroups related to T2D and T2D subgroups related to AD. To this end, we performed pathway enrichment analysis. These enriched pathways were then compared with those of known AD-and T2D-related genes in DigSee 41 . In addition, we investigated whether the enriched pathways in the AD DEG subgroups overlapped with T2D-related pathways and vice versa. Then, we selected a AD subgroup and a T2D subgroup containing the largest number of T2D-related pathways and AD-related pathways, respectively, which are a pair of clusters related with each other. From these two clusters, we selected common candidate genes with the same regulation direction, which are referred to as candidate genes (Fig. 1d).
Afterwards, we used an independent AD dataset downloaded from the ADNI (http:// adni. loni. usc. edu) 31 , which included gene expression data from 116 AD patients and 246 controls extracted from peripheral blood, to validate the candidate genes identified from the DEG subgroup pairs related to both the diseases. With this dataset, the expression levels of 20,384 protein-coding genes filtered using the database Homo sapiens.GRCh38.94 were used for the classification of AD and the control sample using logistic regression. Tenfold cross-validation was performed with zero initialization and a learning rate of 0.05, and the area under the curve (AUC) was calculated at each tenfold cross-validation to evaluate the predictive ability of the candidate genes.
Additionally, we collected independent T2D gene expression datasets: 25 T2D patient and 71 control samples extracted from beta-cells or pancreatic islets in GSE20966, GSE25724, and GSE38642 42-45 . Then, we selected protein-coding genes using Homo sapiens.GRCh38.94 from each dataset. By removing the batch effect using the "removeBatchEffect" function in the R package "limma", we normalized the expression profiles for 10,490 common genes in the three datasets and merged them. Similar to AD, we validated the candidate genes using these T2D datasets. Because of the small number of T2D patients, we performed threefold cross-validation with zero initialization and a learning rate of 0.005 in a logistic regression model.

Results and discussion
Clustering of AD and T2D genes. The AD and T2D subgroups were independently defined using the log2(fold change) values from the expression data of 285 AD and 68 T2D patients using the convex NMF approach and NMF-based clustering method. First, to decompose the expression data of the patients into subgroups using the NMF approach, we needed to determine the optimal number of subgroups. In general, initialization of the matrices W and H affects the final outputs of NMF. Hence, we applied the NMF algorithm 10 times for each rank k from 2 to 10 with randomly initialized W and H matrices, and then calculated the average of the cophenetic correlation coefficient. We chose the rank k that had the largest average cophenetic correlation coefficient. Figure 2 shows the average cophenetic correlation coefficients for each rank k in each AD and T2D www.nature.com/scientificreports/ patient dataset. The optimal rank k of both datasets was 3. Thus, we used the factorized matrices W and H with the largest cophenetic correlation coefficient out of 10 iterations of rank 3 as the NMF output for each disease. After applying the sample and gene assignment method to the decomposed matrices, we constructed three clusters containing a subset of samples for genes with relatively upregulated and downregulated expression, respectively. For sample clustering, the "Max" method was applied to matrix H in columns, and the 285 AD patients and 68 T2D patients were divided into three sample clusters of 93, 90, and 102 patients, and 17, 9, and 42 patients, respectively. All of the genes in both datasets were first assigned to one of the three gene clusters through the "Max" method. According to the Kruskal-Wallis and Dunn test, 5375 and 4479 genes were significantly upregulated ( G + x ) and downregulated ( G − y ), respectively, in expression from other clusters for AD. Similarly, 3461 and 7369 genes were upregulated and downregulated, respectively, for T2D (Table 1). Each gene can be assigned to both G + x and G − y when the distribution in S x is relatively upregulated whereas that in S y is relatively downregulated. Thus, in the union of upregulated and downregulated genes, 6729 and 10,051 genes emerged as showing significant differences in expression from other clusters for AD and T2D, respectively.
Each gene can be assigned to both G + x and G − y when the distribution in S x is relatively upregulated whereas that in S y is relatively downregulated. The Kruskal-Wallis based gene clustering method showed that samples and genes in AD patients were more evenly divided compared to those in T2D patients. However, in the T2D dataset, most of the samples were assigned to S 3 and the gene cluster also showed a skewed distribution in G 3 . Because the number of clusters was decided by the optimal rank k, the cluster S 2 with the small number samples can be generated when the total number of samples are small such as the T2D dataset. This small size cluster may have distinct characteristics that can be distinguished from other clusters.
To visually confirm that the Kruskal-Wallis-based gene clustering method removed inappropriate genes in each gene cluster compared to the conventional "Max"/"Min" method, the gene expression matrices were rearranged in cluster order for both genes and samples, which were visualized on a heatmap. The rectangular regions on the diagonal of the heatmap, indicating samples and genes assigned in the same cluster, demonstrate genes with relatively upregulated or downregulated expression in each cluster. Compared to the basic "Max" and "Min" method, genes selected by Kruskal-Wallis test generated more distinct regions, in which genes showed significant differences in expression levels that could be clearly observed on the heatmap for both the AD and T2D datasets (Fig. 3). Figure 3 was generated using "aheatmap" function in the R package "NMF" 46 and "heatmap" function in the python library "seaborn" 47 .
Next, we constructed M + i and M − i , where gene expressions differ between each disease patient group and its corresponding control group. For the subgroups of each sample cluster, we integrated the genes with relatively up and downregulated expression compared to patients in other clusters ( G + i and G − i ), resulting in 6729 and 10,051 integrated genes in AD and T2D, respectively. Then, genes were assigned into a subgroup i if they are differentially expressed between patients in the subgroup i and control samples based on the t-test with Bonferroni-corrected q value < 0.05. The numbers of these up and downregulated genes ( M + i and M − i ) are shown in Table 2. We further examined the clinical characteristics such as age of these samples in AD modules because the age information is only in AD, not T2D. The result of a one-way ANOVA for age in the AD patient subgroup showed no significant differences between AD patient subgroups (Supplementary Table S2), which suggests that the grouping of samples does not depend on age.

Selecting cluster pairs of interest.
To find AD clusters related to T2D and T2D clusters related to AD, pathway enrichment analysis was performed for the up and downregulated genes ( M + i and M − i ) in each AD and T2D cluster using a total of 10,378 Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways and Gene Ontology (GO) terms from MSigDB. The hypergeometric test for M + i and M − i in AD and T2D genes was used to determine significantly enriched KEGG pathways and GO terms (Tables 3 and 4). As references of AD-and T2D-related pathways, we extracted 1635 AD-related genes and 1658 T2D-related genes from DigSee 41 , and obtained 1675 and 1757 AD-and T2D-related pathways, respectively, using the hypergeometric enrichment test. We identified common pathways from the DEG subgroups in AD patients with 1757 T2D-related pathways from DigSee 41 . Among the AD clusters, patients in S 3 were most likely to have an association with T2D compared to AD patients in the AD S 1 and AD S 2 clusters (Table 3). Likewise, T2D patients in cluster S 3 were most likely to be associated with AD (Table 4). There was no clinical information in the datasets to confirm whether the AD S 3 www.nature.com/scientificreports/ patients actually have T2D or whether the T2D S 3 patients have AD; however, our data suggest that patients in these clusters might share genetic characteristics of the other disease. In addition, we used brain gene expression data GSE5281 to determine which AD clusters have features in common with the AD brain 48 . We extracted 1831 DEGs from these gene expression data by performing the t-test with a q value < 0.05, and 160 pathways were extracted from these genes. As a result, genes in the AD S 3 module most overlapped with AD brain-related pathways (Supplementary Table S1).
To obtain further evidence that AD S 3 and T2D S 3 are related, pathway enrichment analysis was performed for the common genes of possible cluster pairs in the up-regulated and down-regulated AD and T2D gene clusters.  (d)). This figure is generated by the R software (R version 3.6.1, https:// www.r-proje ct. org/) and python (version 3.6.8, https:// www. python. org/).  www.nature.com/scientificreports/ The common genes with the same regulation direction (i.e., up or downregulated) in both diseases were extracted, and their enriched pathways were compared with the 1498 pathways of the 671 intersecting genes between AD and T2D from DigSee 41 . We used the pathways associated with both AD and T2D as references to identify whether the common genes in each possible cluster pair were related. Indeed, the enriched pathways from common genes in the upregulated AD M + 3 and T2D M + 3 modules showed more overlap with pathways from DigSee compared to that in other possible pairs (Table 5). Similarly, the common genes from the downregulated AD M − 3 and T2D M − 3 modules showed the highest overlapping pathway ratio with pathways from DigSee (Table 5). This suggested that the cluster pairs AD M 3 and T2D M 3 were the most closely related to both AD and T2D.
Extraction of candidate genes. We extracted 241 common genes from the cluster pair AD M 3 and T2D M 3 , including 195 upregulated genes and 46 downregulated genes, which were selected as candidate genes associated with both AD and T2D (Supplementary Table S3). Among these 241 genes, 14 genes were common with genes related with both AD and T2D from DigSee 41 . In DigSee, 661 genes were common for both AD and T2D. With a hypergeometric test for significance of 14 genes out of 241, a p value was 0.03826, showing significance of these genes in their roles in AD and T2D. In addition, we collected more AD and T2D genes from AlzGene and T2DiACoD (Supplementary Table S3) 49,50 . When comparing the 241 genes with genes in the three databases of DigSee, AlzGene, and T2DiACoD, 56 genes were related to AD or T2D.
These candidate genes were enriched in 29 pathways (Supplementary Table S4) and 14 pathways that are common with AD and T2D-related pathways from DigSee are shown in Table 6. Pathways associated with common pathological features of AD and T2D such as the immune system-related pathways (T cell selection, positive T cell selection, and T cell differentiation) 51,52 and chemokine signaling pathway 53,54 were included. Immune system-related pathways are known to be common characteristics of AD in the brain and blood 51 , and there are some evidences that chemokines play an essential role in the central nervous system and neuroprotection 55,56 . Interestingly, 11 genes among the 241 candidate genes were involved in chemokine signaling pathway, and 6 of them were AD and T2D-related genes: RAF1, RAC1, RHOA, STAT3, AKT1, and PRKCD.  Table 5. Numbers of common pathways between disease subgroup pairs and DigSee. www.nature.com/scientificreports/ To verify whether the 241 candidate genes could be informative markers for the classification of each disease patients and controls, we used data from the ADNI cohort 31 for AD prediction, and a merged independent T2D dataset for predicting T2D. In AD prediction, candidate genes from the (AD M 3 , T2D M 3 ) pair showed the best diagnostic performance, with an AUC value of 0.6906, compared with genes from nine possible pairs (Table 7).
For comparison, the performance of classifying AD in the ADNI cohort was measured for 250 random genes to match the size of the candidate genes. Classification using 250 random genes was performed 100 times, and the mean AUC value was 0.5658. The t-test showed that the candidate genes significantly outperformed the randomly selected genes in classification with a p value of 5.723 × 10 −52 (Fig. 4). As another comparison, we obtained 1,466 DEGs from the AD samples (GSE63060 and GSE63061 datasets) by the t-test and selected genes with a q value < 0.05 after Bonferroni correction. When these DEGs were used for AD classification in the ADNI cohort, the AUC value was 0.5757. Furthermore, we examined serum glucose data in the ADNI dataset to consider the clinical characteristics of samples. When we compared glucose levels between the AD and the control groups by the t-test, there was no significant difference (p value = 0.513). In ADNI, there were 41 AD patients and 95 controls with high glucose levels, including prediabetes or diabetes samples ( ≥ 100 mg/dL). Using the logistic regression model that were constructed for classification of AD (Table 7), we examined the classification performance for these hyperglycemic samples in the test set of each fold on tenfold cross-validation. The prediction performance for these hyperglycemic samples using the candidate genes (AUC = 0.715) was the highest among those using other genes (Supplementary Table S5). We also found that the predictive power of the prediction model using candidate genes was higher for these hyperglycemic samples compared to those for the whole samples (0.715 in Supplementary  Table S5 and 0.6906 in Table 7, respectively).
We also performed classification using the independent T2D datasets (25 T2D and 71 controls). Among the 241 candidate genes, 179 genes were included in the T2D samples. On the threefold cross-validation, we obtained the AUC value of 0.9543. The AUC value of randomly selected 180 genes was 0.9458 and the predictive performances using other possible gene pairs were also high (Supplementary Table S6). This indicates that gene expression levels between T2D samples and controls in pancreatic islets were significantly different for most genes.   For T2D, the same gene expression data (GSE78721) was used. When we try to find the optimal number of clusters for the ADNI dataset, the optimal rank k of the ADNI dataset was 2 ( Supplementary Fig. S2). At least three clusters are required to determine significant differences in the distribution of each gene between clusters with the Kruskal-Wallis and Dunn test. Thus, as an alternative, we clustered the ADNI data with the non-optimal rank k = 3, resulting that 116 AD patients were clustered into three clusters with 17, 54, and 45 samples, respectively (Supplementary Table S7). When the pathway enrichment analysis was performed for genes in each cluster, the ADNI M + 3 gene cluster contained the largest number of T2D-related pathways and followed by M + 1 . Among 41 hyperglycemic AD patients, the largest 19 belonged to ADNI S3. However, the proportion of hyperglycemic AD patients in each ADNI sample cluster was the highest in ADNI S1 followed by S3 (S1 = 0.47%, S2 = 0.259%, and S3 = 0.422%), which implies that the characteristics of patients of S 1 and S 3 are similar and can be merged for the high risk subgroup of T2D.
Additionally, there was no difference between the age of patients in S 1 and S 3 , but the age of patients of S 2 was significantly lower than those of S 1 and S 3 (p values were 0.0013 and 0.0207 with a one-way ANOVA test, respectively). We also performed a one-way ANOVA test for APOE4 among subgroups of AD patients and observed no significant difference between APOE4 (p value as 0.35). Therefore, the proposed method may cluster patients that have some similar clinical characteristics such as the age, but not all of subgroups were clustered by these characteristics.

Conclusion
We have provided a methodological and analytical approach for identifying correlations between AD and T2D at the genetic level. Since the AD dataset does not contain information about whether the AD patients have T2D or not, it is important to define subgroups of AD; the same is true for T2D. Because the conventional NMF is not suitable for this task, we developed a method of gene selection from gene expression data. After applying NMF to gene expression data, additional conditions were taken into account for detecting distinct characteristics of subgroups. Genes with significant differences in expression levels in each patient groups (AD and T2D) were first selected to screen patients with AD associated with T2D and patients with T2D associated with AD. We identified genes that characterize these specific AD and T2D patients and identified the potential relationship between the two diseases based on gene expression profiles. To validate these potential relationships from candidate genes, prediction errors of the classification between AD and controls from logistic regression were compared with randomly selected genes in an independent AD dataset. Inclusion of the candidate genes significantly increased the AUC values in classifying AD from controls compared with randomly selected genes.
In conclusion, we provide new insights for extracting differentially expressed genes with relative differences in a specific patient group. These genes were enriched with pathways related to both AD and T2D such as T cell selection and chemokine pathways. As AD patients have genetic heterogeneity, the investigation of commonly dysregulated pathways in AD and T2D can enhance personalized medical cares for a subgroup of AD. Further studies are needed to reveal the relationships among AD and other AD-related diseases which could improve the prevention and treatment of AD.