Machine learning compensates fold-change method and highlights oxidative phosphorylation in the brain transcriptome of Alzheimer’s disease

Alzheimer’s disease (AD) is a neurodegenerative disorder causing 70% of dementia cases. However, the mechanism of disease development is still elusive. Despite the availability of a wide range of biological data, a comprehensive understanding of AD's mechanism from machine learning (ML) is so far unrealized, majorly due to the lack of needed data density. To harness the AD mechanism's knowledge from the expression profiles of postmortem prefrontal cortex samples of 310 AD and 157 controls, we used seven predictive operators or combinations of RapidMiner Studio operators to establish predictive models from the input matrix and to assign a weight to each attribute. Besides, conventional fold-change methods were also applied as controls. The identified genes were further submitted to enrichment analysis for KEGG pathways. The average accuracy of ML models ranges from 86.30% to 91.22%. The overlap ratio of the identified genes between ML and conventional methods ranges from 19.7% to 21.3%. ML exclusively identified oxidative phosphorylation genes in the AD pathway. Our results highlighted the deficiency of oxidative phosphorylation in AD and suggest that ML should be considered as complementary to the conventional fold-change methods in transcriptome studies.

According to Gong et al. 2018, the collective effects of multiple genes/insults may lead to the development and onset of AD 2 . Thus, multifactorial diagnosis and personalized treatment were emphasized since different combinations of etiological genes/insults may present in each individual. However, due to insufficient knowledge of AD's full spectrum, there is an urgent need to decipher the mechanism and risk factors of AD.
Machine learning (ML) is the process that computer systems use algorithms and statistical models to perform a prediction relying on patterns and inference without using explicit instructions. The application of ML on AD is focused on the diagnosis of AD from neuroimaging 4 . Despite the fact that the emergence of a wide range of biological data of AD, including genomic profiling and electronic health records, a comprehensive understanding of AD's mechanism from ML is so far unrealized, majorly due to the lack of needed data density 5 . We have previously identified MMP14 and dystonin potentially modulate the crosstalk between diabetes and AD by metaanalysis 6,7 . In this study, we applied ML to a publically available transcriptome dataset from AD postmortem to uncover the complex genetic network and compare the results with conventional fold-change (FC) methods.

Methods
Data source. The gene expression profile of the prefrontal cortex brain tissues of 310 AD patients and 157 non-demented control samples were retrieved from the GSE33000 dataset 8 of the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) database. This dataset was selected. The processed data, which have been adjusted for the age, gender, RIN, pH, PMI, batch, and preservation of the samples, were downloaded from the Sample table. This dataset contains 39,279 detected probes, of which 13,798 were annotated, and a total of 9969 genes were profiled, while 31 probes were omitted due to mapping to more than one gene.
Another publically available microarray dataset GSE84422 9 , which profiled PFC from 56 postmortems with varying degrees of AD pathological abnormalities, was utilized as the unseen dataset to verify our models. The samples were classified into control or AD by CDR, Braak, and CERAD. Notably, due to the difference of microarray used, out of the 9966 attribute genes of the training dataset, 3680 genes were not profiled in the testing dataset. To conduct the testing, these 3680 gene profiles were artificially added with FC assigned as "1" for all samples.
Machine learning. RapidMiner Studio version 9.5 (WIN64 platform) was registered to Jack Cheng and was executed under the Windows 10 operating system with Intel Core i3-3220 CPU and 16 GB RAM. In addition to the samples' age and sex, the 9969 profiled genes were assigned as the regular attributes (potential contributing factors to be analyzed in modeling operator) in the modeling. The disease status (1 = AD; 0 = non-AD CTRL) was assigned as the Label attribute (the predicted class in modeling operator). The sample ID was assigned as the ID attribute (assigning the identity of the sample). The input matrix is supplied as Supplementary File 1.
Seven predictive operators or combinations of RapidMiner Studio operators were used to establish predictive models from the input matrix and assign a weight to each attribute. They were (1) AdaBoost + Decision Tree, (2) AdaBoost + Rule Induction, (3) AdaBoost + Decision Stump, (4) Generalized Linear Model, (5) Logistic Regression, (6) Gradient Boosted Trees, and (7) Random Forest + Weight by Tree Importance. The parameters of these operators are listed in the Parameters sheet of Supplementary File 2. Notably, in the Random Forest model, the number of trees was 500, and the depth of split was set to '-1' , which means the maximal depth parameter puts no bound on the depth of the trees. Moreover, the Generalized Linear Model is a regularized GLM, and the elastic net penalty was used for parameter regularization. Other operators under the category Models / Predictive were abandoned in this study due to the reasons listed in the Models sheet of Supplementary File 2.
The model's performance was estimated by cross-validation of models, which contains two subprocesses: a training subprocess and a testing subprocess. The training subprocess produces a trained model to be applied to the testing subprocess for the performance evaluation. In this study, the samples were randomly divided into ten subsets, with an equal number of samples. Each of the ten subsets was iterationaly used in the testing subprocess to evaluate the trained model from the other nine subsets. The convergence of each model's iteration was recorded and summarized in Supplementary File 3, which describes how genes were aggregated from these iterations. The performance of a model can be evaluated by its accuracy, precision, and recall, where accuracy = (TP + TN)/ (TP + FP + FN + TN), precision = TP/(TP + FP), recall = TP/(TP + FN), T = true, F = false, P = positive, and N = negative. The setup diagrams of the seven predictive models are illustrated in Supplementary File 4. We conducted the analysis of variance (ANOVA) test to determine the probability for the null hypothesis of the equal performance of the different ML models. The ANOVA result (f = 1.558, prob = 0.174, alpha = 0.050) could not reject the null hypothesis, indicating that the difference between the performance of the different ML models is not significant. The process was exported as "ANOVA.rmp" and was uploaded to GitHub at https:// github. com/ JackC heng-TW/ Rapid Miner-files/ Proce ss/.

Conventional fold-change method.
To check if our findings are not unique to a single dataset, we took another microarray-profiled PFC dataset GSE84422 as an unseen dataset to verify our models. Although nearly one-third of the training genes are missing in the test dataset, GSE84422 is currently the 2nd largest one after GSE33000. Upon testing, the accuracy was 28.57%, 58.93%, 76.79%, 82.14%, 71.43%, 44.64%, and 69.64% for decision tree, random forest, gradient boosted tree, generalized linear model, linear regression, decision stump, and rule induction, respectively. Since there are only a few attributes in decision tree/stump, missing one or two may largely limit the model performance. In contrast, models with more attributes like GLM outperform the others. The results indicate some models' generalizability and the difficulty of applying ML models on cross-platform datasets. The testing dataset "GSE84422_testing.xls" and exported processes were uploaded to GitHub at https:// github. com/ JackC heng-TW/ Rapid Miner-files/ testi ng/.

ML compensates conventional FC methods in gene identification.
To compare the differences in gene identification between ML and conventional FC-based methods, we adopted two independent strategies, as illustrated in Fig. 3. In one way, the uppermost 157 genes and the bottommost 157 genes of fold change were selected (Genes sheet of Supplementary File 6). In the other way, we selected 314 DEGs by firstly filtering with the fold change cutoffs 1.2, and followed by the rank of the p values (Genes sheet of Supplementary File 7). Surprisingly, there were only 67 (21.3%) or 80 (25.5%) genes overlapped with the ML-derived 314 genes for the two conventional FC-based methods, respectively.
Next, to figure out the differences in enriched pathways, the final gene list from ML and those from the two conventional FC-based methods were submitted to KEGG pathway enrichment analysis, respectively. The top 15 enriched KEGG pathways are summarized in Table 1 www.nature.com/scientificreports/ to conventional FC-based methods. Interestingly, the KEGG Alzheimer's pathway (hsa05010) was only enriched by the ML-derived genes, not conventional methods. However, this does not imply that ML is superior to or can replace the conventional methods since the latter also exclusively enriched several AD-related pathways, such as complement cascades (hsa04610), cytokine-cytokine receptor interaction (hsa04060), and phagosome (hsa04145). The mutual exclusivity of critical pathways demonstrates that ML compensates the conventional FC methods in gene identification.

ML highlights oxidative phosphorylation genes in the AD pathway.
When we looked into MLderived genes, which enriched the pathways, we found a considerable overlap of genes between the ML-exclusive pathways ( Table 2). These genes are ATP5C1, ATP5G1, NDUFA1, NDUFA4, NDUFA6, NDUFA12, NDUFB1, NDUFB2, NDUFB9, NDUFV1, NDUFV2, and UQCRFS1. They belong to the oxidative phosphorylation pathway (hsa00190), which is also a part of the KEGG Alzheimer's pathway (Fig. 4). Among them, NDUFA1, NDUFA4, NDUFA6, NDUFA12, NDUFB1, NDUFB2, NDUFB9, NDUFV1, NDUFV2 belong to the OXPHOS protein complexes (CX) I of the electron transport chain (ETC); while UQCRFS1 belongs to CX III of ETC. Moreover, ATP5C1 and ATP5G1 belong to the ATP synthase (CX V).   www.nature.com/scientificreports/ Co-predictive partners of the CX genes. Random Forest produces decision trees, which use combinations of the Attribute value, i.e., expression of genes, to predict the sample Label, i.e., AD or not. Figure 5 shows 13 decision trees involving ETC complexes subunit genes, and Table 3 summarizes the 12 CX genes and other 37 predictive genes in these trees. Notably, 32 out of the 37 genes are relevant to AD. The AD-relevance is established by association studies of the expression, genomics, or metabolomics, respectively, with references listed in Table 3. Figure 5A shows that the expression of ATP5G1 and ATN1 predicts AD. Although the exact function of ATN1 is unknown, it may act as a transcriptional co-repressor in neurons 13 . Moreover, alternative splicing of ATN1 was significantly detected in the frontal lobe of AD postmortem 14 . Figure 5B shows that the expression of NDUFV and CTXN1 predicts AD. CTXN1 encodes cortexin-1 and may mediate signaling of cortical neurons during forebrain development 15 , and it is highly dysregulated in the aging brain 16 . Figure 5C shows that slightly downregulation of two CX I genes, NDUFA6 and NDUFB1 predicts AD. Figure 5D shows that the expression of    17 . Besides, the expression of FRMPD4 was found to be significantly altered in the AD hippocampus 18 . Other AD-predictive genes in these decision trees will be discussed in groups according to their biological functions.

Discussion
We conducted machine learning (ML) analyses to train AD case/control classifiers using transcriptomic data and then compared the ML-derived gene features with that from the conventional differential expression analysis. ML exclusively highlighted oxidative phosphorylation but could not fully include the findings from the conventional methods. The pathways involving the identified genes and the limitation of the study are discussed below.
Oxidative phosphorylation. Oxidative phosphorylation in eukaryotes takes place at the electron transport chain in the mitochondrion. The oxidation of NADH or succinate from the citric acid cycle is the energy source of ATP synthase. During this process, several mitochondrial inner-membrane-embedded complexes, including CX I and CX III, pump protons out from the inner membrane to establish proton gradient, while CX V utilizes the energy of the influx of protons to generate ATP from ADP 19 . Abnormal mitochondrial morphology and functions, including glucose metabolism and ROS production, have been identified as early hallmarks of AD 20,21 . These phenotypes are directly related to the disruption of glycolytic processes and the impairment of the ETC complexes. In the '90 s, most research efforts have been devoted to investigating the role of CX IV in AD 22,23 . However, the evidence is not conclusive on whether dysregulation of any single ETC complex dominates AD progress. For example, besides expression, several mutations in ETC complex subunit genes may impair the complex activity 24 . Moreover, the ETC complex's dysregulation seems to be brain-region dependent, e.g., CX IV has no significant decrease in the temporal lobe, and CX I-III are decreased at certain cortex locations of AD 24 .
Recent studies also highlighted CX I's role, especially its deregulation, is tau-dependent in contrast to the Aβ-dependent CX IV 25 . Moreover, an SNP association study demonstrated the AD association for complex I genes but not for complexes II-V 26 . Furthermore, from a postmortem study of 18 AD and 44 controls, the downregulation of CX I-V in the hippocampus was identified 27    www.nature.com/scientificreports/ be monotonic during the AD progression. From a postmortem study of twelve AD and six controls, CX I genes are reported to decrease in the early stage and increase in the frontal cortex of definite AD patients 28 .
When we only see the symptom, most things look complex, especially the case for ETC complexes in AD. Could ML guide us through this misty forest with the aid of the Random Forest model by finding out potential partners of CX genes in predicting AD?
Neural maintenance or transmission. Among the AD-predictive genes, CACNA1G, FGF13, LRFN2, NPFF, and SHOX2 participate in neural maintenance or transmission. CACNA1G encodes voltage-dependent T-type calcium channel subunit alpha-1G. FGF13 is a fibroblast growth factor and plays a critical role in neuron polarization and migration 29 . LRFN2 promotes neurite outgrowth and increases the expression of the NMDA receptor 30 . NPFF is a neuropeptide, while SHOX2 may be a growth regulator in the neural system and involves processing somatosensory information 31 . In AD, pathological hallmarks include synaptic failure and neuronal loss 32 . Moreover, the critical role of mitochondria in supporting synaptic, as well as the evidence of dysfunction of mitochondria from both clinical postmortem 33 and animal models 34 of AD, support the mitochondriasynapse hypothesis of AD. Our findings that simultaneous dysregulation of CX and neuronal genes predict AD supports this hypothesis.
Immune system. The innate immunity, especially neuroinflammation mediated by microglia, is considered a hallmark of AD, whereas the role of the adapted immunity in AD is not conclusive 35 . Among the AD-predictive genes, CFHR1, CMTM4, HLA-DRA, IL18, MICA, MORN1, SCYE1, and SOCS4 participate in immunity. CMTM4 regulates PD-L1 protein 36 , which binds to PD-1 and suppresses the T-cells' adaptive arm, while HLA-DRA presents the extracellular-protein-derived peptides to, and MICA presents the stress-induced self-antigen to T-cells, respectively 37 . Moreover, MORN1 modulates functional Ca 2+ influx in T cells upon activation of T-cell receptors 38 . IL18 and SCYE1 (AIMP1) are pro-inflammatory cytokines, while SOCS4 is part of a negative feedback system that regulates cytokine signal transduction 39 . CFHR1 is an inhibitor of the complement pathway that blocks C5 convertase and controls complement activation along with complement factor H 40 . Our results indicate that the dysregulation of both innate and adaptive immunity genes may cooperate with CX genes to advance AD progression.
Phosphatase regulators. In AD, hyperphosphorylation of the microtubule-associated proteins, especially tau, disrupts the microtubules' assembly in neurons. Moreover, significantly lower type 1 phosphatase (PP1) activity in AD brains suggests the critical role of dysfunctional phosphatases in AD 41 . Among the AD-predictive genes, PPP1R14C and PPP1R7 belong to PP1 regulatory subunit 14 and subunit 7, respectively. Our results indicate that the dysregulation of PPP1R14C and PPP1R7, along with CX genes, may further advance AD progression by aggravating the microtubule-associated proteins' hyperphosphorylation.
Protein glycosylation. Protein glycosylation is a ubiquitous posttranslational modification of site-specific attachment of glycans and regulates the protein's folding and function. During the protein transport from Endoplasmic Reticulum to the Golgi apparatus, a series of attachment of oligosaccharides maturates a wide variety of complex N-or O-glycans. An N-glycosylation denotes the glycan's attachment to the amide nitrogen of an asparagine residue of the protein, whereas an O-glycosylation denotes the attachment to the oxygen atom of serine or threonine residues. Abnormal N-and O-glycosylation has been reported in AD 42,43 . Among the ADpredictive genes, FUT8 and GCNT4 mediate glycosylation in the Golgi apparatus. FUT8 catalyzes the addition of fucose to the GlcNAc residue, while GCNT4 is a glycosyltransferase mediating O-glycan branching 44 . Thus, the dysregulation of FUT8 and GCNT4 may aggravate AD progression by abnormal glycosylation under the condition of CX deficiency.
Other mitochondria machinery. Notably, among the AD-predictive genes, there are two mitochondrial genes besides the CX: Ornithine aminotransferase (OAT) and DnaJ homolog subfamily C member 30 (DNAJC30/ WBSCR18). OAT converts ornithine into pyrroline-5-carboxylate (P5C), which can serve as the precursor of proline and glutamate. Furthermore, since ornithine is an intermediate product in the urea cycle, OAT dysregulation may lead to abnormalities of both energy production machinery and the supply of neural transmitters. Recently, the OAT substrate ornithine has been proposed as an early diagnostic biomarker of AD 45 , and altered expression of the urea cycle enzymes have been identified in sporadic AD brains 46 . Our finding that simultaneous downregulation of OAT and CX I predicts AD indicates that the deficiency of the urea cycle and CX may co-operate to advance AD.
Meanwhile, DNAJC30 has been recently identified as an auxiliary component of ATP-synthase machinery in the mitochondria 47 . The removal of Dnajc30 in mice resulted in hypofunctional mitochondria, decreased integrity of CXs, and abnormal neocortical pyramidal neurons 47 . Our finding that the simultaneous downregulation of DNAJC30 and CX I predicts AD also supports the mitochondria deficiency hypothesis of AD.
Slightly dysregulated CX genes predict AD. From an overall observation on the CX-related decision trees (Fig. 5), a combination of down-regulated CX components and one or several partner genes mentioned above predicts AD. Notably, the margin is not conventional twofold, 1.5-fold, or even 1.2-fold. The margin is very subtle, and this is why the conventional FC method cannot identify them. With the criteria of p < 0.05 and FC 1.2, 1.5, or 2, the numbers of DEGs of the model dataset GSE33000 are 418, 10, and 0, respectively, as shown in Supplementary File 10 www.nature.com/scientificreports/ the number of intersection genes to be 60 (19.1%), which was compatible with the results of the conventional method 1 (21.3%) and the conventional method 2 (25.5%). Furthermore, it is difficult to identify complicated rules by conventional methods. Therefore, we suggest adopting machine-learning algorithms, especially decision trees, rule induction, and random forest, as complementary methods in transcriptome studies.
Limitations. There are several limitations to the interpretation of the results. (1) The samples are primarily of Caucasian ancestry. The biased sample race may limit the results to be applied to other races. (2) The samples are from the postmortem of a specific brain region. Since expressional heterogeneity, this may limit the results to be applied to other brain regions. (3) Due to the same reason, the results can hardly be applied to patient diagnosis purposes. (4) For the future application of the study pipeline, at least hundreds of samples might be required due to ML's nature. (5) ML models predict the patient disease labels but not the involvement of genes in disease, and additional genetic evidence is required to delineate any possible causal/reactive roles of these gene features in AD. (6) The performance difference in the independent dataset could be attributed to the detectable genes of different chip systems and the within-dataset variations. The absence of 36.9% attributes (genes) in the test set largely limited the performance of some models. Moreover, the limitation may also come from the difference in the sampling quality, which is reflected by the within-dataset variation (the average STD/INT were 8.2% and 21% for the modeling set and test set, respectively). Rapidminer models have also been used to identify the transcriptomic bio-signature of an infectious disease condition in the mammary gland of the cow 48 , with the performance ranging from 53 to 87%, which is compatible with the performance of this study. The differences in strategy majorly lay in whether pre-screening attributes (the so-called feature selection) before applying ML 49 . The benefits of feature selection include simplifying models, shorter training times, and avoidance of high dimensionality problems; however, the feature selection step using the entire dataset may strongly bias all downstream prediction, even when cross-validation is used 50 . Therefore, in this study, we decided to skip the feature selection step to achieve an unbiased understanding of AD.
Since decision trees were the final models to identify potential novel genes in this study, whether the data size is big enough is crucial. According to Vabalas et al. 51 , we conducted a series of train/test split to validate whether arbitrary partial subsets of data could generate decision trees to predict the "unseen" counterpart, with the same parameters used in this study. As shown in Supplementary File 8, the recall rates were saturated at n = 94, i.e., 20% of the total samples, which may imply the sample size was sufficient to conduct this study.
Although we did not combine datasets in this study, appropriate methods used for reducing the batch effect and differences between experiments 52 should be applied when combing datasets in future studies. We also noticed that random forest analysis dominated the identified gene features, indicating that future similar studies might focus on random forest first. However, other models may supply other 10% genetic cues on the investigator's demand.
Hypotheses developed from ML models. To discover and characterize the underlying pathophysiological pathways of AD are the main objectives of genetic research, including this ML study. Based on our findings, we postulate that two novel players, i.e., RNF157 and KIAA1715, may independently participate in AD pathophysiology by mediating the mitogen-activated protein kinase (MAPK) signaling pathway. MAPKs are serine/ threonine protein kinases regulating cellular processes in response to environmental stimuli and participate in hallmark events of AD, including tau phosphorylation, Aβ deposition, and chronic inflammation 53,54 .
In the #2 model of RF (Supplementary File 2), RNF157, EPHA2, and hCG_1776018 (also known as PIRT, an uncharacterized phosphoinositide-interacting protein) co-predict AD. EPHA2 is a membrane receptor tyrosine kinase, which regulates migration, adhesion, and blood-brain barrier through MAPK signaling 55 . RNF157 is an E3 ubiquitin ligase that acts as a downstream effector of PI3K/MAPK signaling 56 and regulates the survival of neurons by ubiquitinating APBB1 57 . Presently, there is no knowledge about the roles of these three genes in AD. We hypothesize that RNF157 may act as the downstream of EPHA2 and hCG_1776018, and regulate neural death upon cellular stress in the AD microenvironment. We also postulate that RNF157 agonist may act as a symptomatic treatment in AD.
In the #230 model of RF (Supplementary File 2), KIAA1715 and MAP3K9 co-predict AD. MAP3K9 is a serine/threonine kinase that is activated by environmental stress and acts as an upstream activator of the MKK/ JNK signal transduction cascade regulating apoptosis 58 . MAP3K9 dysregulation has been proposed as a possible marker in AD 59 . KIAA1715 (also known as LNPK) is an endoplasmic reticulum (ER) membrane protein, which stabilizes ER curvature and ER tubular junction network 60,61 . Mutations in KIAA1715 cause neurodevelopmental syndromes, such as intellectual disability and epilepsy 61 . Notably, disruption of ER-mitochondria contact has recently been found in AD postmortem 62 , while restoring ER-mitochondria contact rescues AD animal model 63 . However, there is no knowledge about the role of KIAA1715 in AD. We hypothesize that under the pro-inflammatory microenvironment of AD, KIAA1715 deficiency may lead to instability of ER structure, leading to disruption of ER-mitochondria contact and eventually aggravate AD progression.

Conclusion
Our study using machine learning techniques on the gene expression profile of the postmortem of the prefrontal cortex brain tissues of AD and controls highlighted the oxidative phosphorylation genes in the AD pathway. These genes were exclusively identified in ML but not in the conventional counterpart. Our results imply that ML should be considered complementary to the conventional FC methods in transcriptome studies. More importantly, we show that hypotheses underlying pathophysiological pathways of AD could be developed by further looking into ML models.