Systematic Analysis and Biomarker Study for Alzheimer’s Disease

Revealing the relationship between dysfunctional genes in blood and brain tissues from patients with Alzheimer’s Disease (AD) will help us to understand the pathology of this disease. In this study, we conducted the first such large systematic analysis to identify differentially expressed genes (DEGs) in blood samples from 245 AD cases, 143 mild cognitive impairment (MCI) cases, and 182 healthy control subjects, and then compare these with DEGs in brain samples. We evaluated our findings using two independent AD blood datasets and performed a gene-based genome-wide association study to identify potential novel risk genes. We identified 789 and 998 DEGs common to both blood and brain of AD and MCI subjects respectively, over 77% of which had the same regulation directions across tissues and disease status, including the known ABCA7, and the novel TYK2 and TCIRG1. A machine learning classification model containing NDUFA1, MRPL51, and RPL36AL, implicating mitochondrial and ribosomal function, was discovered which discriminated between AD patients and controls with 85.9% of area under the curve and 78.1% accuracy (sensitivity = 77.6%, specificity = 78.9%). Moreover, our findings strongly suggest that mitochondrial dysfunction, NF-κB signalling and iNOS signalling are important dysregulated pathways in AD pathogenesis.

studies were performed on a variety of platforms with different initial feature sizes and relatively small sample size, very few potential biomarkers have so far been identified or replicated in larger cohort study 14 .
Our study has two parts. The first was a system analysis to identify differentially expressed genes (DEGs) and pathways in a large-scale human blood dataset, and integrate these with results from brain tissue to comprehensively explore the correlations between blood and brain. The second part was to apply ML techniques to identify a panel of potential predictive biomarkers in the blood, and to see whether gene expression in the blood can be used as a biomarker for AD diagnosis.

Methods
Microarray gene expression profile in human blood. Two independent human whole blood normalized mRNA gene expression datasets were downloaded from GEO (http://www.ncbi.nlm.nih.gov/geo/): GSE63060 and GSE63061 from the AddNeuroMed Cohort 15 . We merged these two normalized datasets (generated by different Illumina platforms) using the inSilicoMerging R package 16 , and then extracted 143 patients with AD, 77 MCIs and 104 controls subjects (CTL) from GSE63060; 102 patients with AD, 65 MCIs and 78 CTLs from GSE63061 with Western European and Caucasian ethnicity respectively. Probesets without annotation (Entrez_Gene_ID) were filtered out, which left 22756 probesets corresponding to 16928 unique genes. The limma R package 17 was then applied and adjusted by age and gender to identify DEGs (a) between AD patients and CTLs, (b) between MCI patients and CTL groups, and (c) between AD and MCI patients. These comparisons were carried out in the two GEO datasets and in the merged one (referred to as the merged discovery dataset) separately. We focused on this merged discovery dataset for downstream analysis with the Benjamini-Hochberg adjusted p-value, i.e. BH.pval of 0.01 used as the significance level for DEG identification.
In order to evaluate the DEGs identified in our above discovery dataset, two additional datasets were downloaded for analysis. Firstly, the whole blood gene expression dataset (GSE6613) was download from GEO. The Affymetrix U133A CEL profiles were normalized by RMA 18 method implemented in affy R package. Probesets were filtered out if (1) they were not annotated or were multiply annotated; or (2) they were present in less than 10 percent of the samples as determined by applying the MAS5 present/absent call algorithm (affy R package). DEGs were identified by applying limma with age and gender adjusting. Nominal pval < 0.01 was used for significance because we observed that no DEG could pass multiple testing (BH.pval > 0.05, see discussion section). This dataset includes samples for AD, MCI, CTL, as well as Parkinson disease (PD). We excluded PD samples after data normalisation.
The second evaluation blood gene expression dataset was downloaded from the Alzheimer's Disease Neuroimaging Initiative website (ADNI, http://www.adni-info.org/). The ADNI was launched in 2003 as a public-private partnership led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers and clinical and neuropsychological assessment can be combined to measure the progression of MCI and early AD. In our study, we focused on the ADNI2 Caucasian population with disease status according to baseline diagnosis. This cohort has APOE4 information for each individual participant. Limma was applied to each APOE4 group (APOE4 = 0, APOE4 = 1, APOE4 = 2), adjusting for age, gender, RIN, RNA purity ratio A 260/280 and A 260/230 separately to detect DEGs between patients with AD and CTL, early MCI (EMCI) and CTL, late-MCI (LMCI) and CTL. A nominal p-value of <0.01 was used for significance since no DEG could pass multiple testing (see discussion section). We present results on the APOE4 = 1 group because there were similar numbers of cases for each disease status in this group, but very few AD cases in the other two APOE4 groups.
Microarray gene expression profile in human brain. The GSE84422 dataset includes human post-mortem brain samples taken from 19 brain regions for an AD study 6 . The cohort used is totally independent to the above blood cohorts. Gene expression profiles of 17 brain regions were generated by both Affymetrix U133A and U133B platforms, and profiles for other two regions were generated by the U133plus2 platform. We processed the raw CEL files as above, identified DEGs for each platform separately adjusted by age, gender, post-mortem interval (PMI) and pH values using limma, as applied in the original study 6 , and merged them together afterwards to obtain 19 lists of DEGs. Nominal pval < 0.01 was applied for significance, again since no DEG could pass multiple testing (i.e. BH.pval > 0.05). We only analysed definite AD and CTLs in the Caucasian ethnic group. Supplementary Table 1 indicates the sample size in each comparison group including the cases for blood datasets.
To clarify, within our study, DEGs either refer to array probesets, when we discuss DEGs within the same data cohorts, or unique genes (Entrez_Gene_ID), when we compare results from different cohorts for blood and brain.
Pathway analysis for DEGs. We performed pathway analysis on the identified DEGs using commercial Ingenuity® Pathway Analysis (IPA®, QIAGEN Redwood City, www.qiagen.com/ingenuity) software. We chose as significant those canonical pathways with BH.pval < 0.01.

Gene-based Analysis of GWAS data. The International Genomics of Alzheimer's Project (IGAP)
Consortium reported a large-scale of AD GWAS dataset 1 . The gene-based analysis tool MAGMA 19 was applied to the IGAP stage 1 whole genome summary statistics (including 17,008 AD and 37,154 CTLs), with the 1000 genomes European reference panel used to perform the joint SNP gene-based GWAS study. We searched for single-nucleotide polymorphisms (SNPs) within 20 kb up/downstream of each gene (NCBI37.3). Two significance levels were applied, nominal pval < 0.01 and Bonferroni BF.pval < 0.05 to identify significant genes in GWAS, which we refer to as MAGMA genes. The qvalue package in R was also applied. Scientific  Biomarker discovery by machine learning. We attempted to identify blood biomarkers and classification models trained/learned from the GSE63060 dataset and tested in GSE63061, and vice versa. Data were adjusted for age and gender by a robust regression model (applying the rlm function in MASS R package); the model residual was further centred and scaled to a mean of zero and standard variation of one across all subjects in each dataset for those common probesets. We used the least absolute shrinkage and selection operator (LASSO) regression feature selection method 20 , implemented in the glmnet R package, to investigate the prediction performance of different ML approaches, including SVM, RF and logistic Ridge Regression (RR) models with a voting strategy to detect optimal biomarkers and classification models to discriminate AD patients from control subjects. The voting strategy of majority outcomes from the above three ML algorithms was applied to determine the final predictive outcome. The LASSO approach shrank most of the coefficients of variables that have no or less discriminatory power to zero, while variables with non-zero coefficients remained in the final LASSO model representing the joint discriminatory power to separate patients with AD and controls subjects 21 . An optimal penalty factor lambda was tuned during the cross-validation process. We repeated such LASSO regression with 5-fold cross-validation (CV) 100 times, and the subset of features with the best CV area under the curve (AUC) value for receiver operating characteristic (ROC), or most frequently selected on the training dataset, was kept as the selected biomarker panel (feature set). However, if the number of variables selected was less than two, then the feature set with sub-optimal AUC would be selected. Feature set selected by LASSO initially started from the full feature pool, i.e., 22756 common probesets between GSE63060 and GSE63061. For SVM and RF, we used the default setting when calculating the predict accuracy. For RR, we calculated the optimal cut-off from training with optimal AUC and accuracy, and then applied this cut-off to prediction in testing. Prediction performances of the classifiers were evaluated by AUC, test accuracy (ACC), sensitivity (Sens), and specificity (Spec). For comparison, the area under precision-recall curve (AUPR) were calculated as well using PRROC R package. ROC curves were plotted using the ROCR R package 22 . All this work was conducted by in-house R programs.

Results
Differentially expressed genes in blood were also found in the brain. DEGs 14) or frontal atrophy (OR > 1.2, pval < 8.4E-06) in the brain subjects with AD when mapped to the data in Zhang's brain study 4 ( Supplementary Fig. 3). Furthermore, 789 AD-DEGs in blood were also DEGs identified by our previous meta-analysis in brain prefrontal cortex (PFC) region 7 with significant enrichment (OR = 1.48, 95%CI 1.34-1.62, pval < 6.28E-16), and 77.9% of them showed the same regulation direction between blood and brain (pval < 2.2E-16, sign test). Similarly, we observed that 998 MCI-DEGs in blood are also DEGs in the brain of AD patients with significant enrichment (OR = 1.  23 , and we observed that AD-DEGs or MCI-DEGs in blood were likely to be ageing-associated genes (OR > 2.00, pval < 2.93E-36 for both, Supplementary Fig. 3). AD-DEGs in brain PFC region 7 were also enriched with these ageing-associated genes, although with a slightly lower level of enrichment (OR = 1.8, 95%CI 1.6-2.1, pval < 2.2E-16). Table 1 lists the top 20 DEGs common to both AD and MCI, the top 10 AD-only DEGs, and the top 10 MCI-only DEGs in blood (see Supplementary Table 2 for the whole list).  Table 1 were re-discovered in GSE6613, namely WDFY3, TCIRG1, and NEMF/SDCCAG1.

Validation using Gene expression in other blood datasets. Among the 374 DEGs identified in the GSE6613 validation dataset (see Methods and Supplementary
In the ADNI2 dataset, we identified 416, 630, and 157 DEGs (unique genes) for AD, early MCI (EMCI) and late MCI (LMCI) disease status respectively (see Supplementary Table 2). Both AD-DEGs and MCI-DEGs identified in the merged discovery cohort were enriched with DEGs identified in ADNI2 AD (OR = 1.88, 95%CI 1.53-2.33, pval = 6.11E-09; OR = 2.02, 95%CI 1.65-2.48, 9.67E-12, for AD and MCI respectively, Supplementary  Fig. 4). None of the top DEGs listed in Table 1 were re-discovered in the ADNI2 AD dataset. However, HELZ was identified as an early MCI-DEG in the sub-cohort of ADNI2 with APOE4 = 1 genotype. This gene had a 12% up-regulation in both blood of AD and blood of MCI in the merged discovery dataset. An exome sequencing study revealed that variants in HELZ are associated with intellectual disability 24 . HELZ functions as a RNA helicases, and RNA helicases are involved in almost every RNA related process, including transcription, splicing, ribosome biogenesis, translation and degradation. Therefore, HELZ may have associations with the pathogenesis of neurodegenerative disease including AD 25 .   Table 1. The top DEGs in blood and their relationships with AD brain. Data shown are from (top rows) the top 10 up-regulated and the top 10 down-regulated DEGs in AD blood that are also DEGs in MCI blood; (middle rows) the top 10 DEGs in AD blood that are not also DEGs in MCI blood; (bottom rows) the top 10 DEGs in MCI blood that are not also DEGs in AD blood. In addition, all these DEGs in blood were mapped to DEGs in the brain PFC region 7 (columns 7 to 9) and we show their correlation coefficient braak stage and brain frontal atrophy 4 in patients with AD. FC represents Fold Change in gene expression. Differentially expressed genes not uniform across brain regions. In total, we identified 5552 AD-DEGs (unique genes) in 19 brain sub-regions (Supplementary Table 4  With such a divergent distribution across 19 brain regions (Table 2), we did not identify any super genes which were DEGs in all 19 brain regions. Two genes (AKAP9, NEBL) were identified as DEGs in eight brain regions, and 3640 DEGs were identified from only a single region. 1048 of these DEGs (18.9%) were identified in our previous meta-analysis in brain PFC region (OR = 1.78, 95%CI 1.64-1.94, pval < 3.53E-42). Figure 2 illustrates the DEGs in these 19 brain regions and the overlap with AD-DEGs or MCI-DEGs in blood. Among these 19 brain regions, Prefrontal Cortex (PC), Occipital Visual Cortex (OVC), and Dorsolateral Prefrontal Cortex (DPC) are the top three regions with the highest proportion of brain DEGs mapped to blood. Only 15% of brain DEGs in hippocampus (HIP) were identified as AD-DEGs in blood. In addition, the mappings of brain AD-DEGs to blood AD-DEGs and brain AD-DEGs to blood MCI-DEGs, were highly associated (R = 0.80, pval < 3.33E-05, Pearson test, Table 2).
Gene-based GWAS reveals potential new risk genes. In total, 18229 genes were identified in the IGAP stage 1 GWAS dataset by MAGMA, including all of the 39 GWAS risk genes in AD except INPP5D. Sixty seven MAGMA genes passed BF.pval < 0.05, including 17 AD risk genes, and 15 AD-DEGs and 20 MCI-DEGs in blood (Table 3). Among them, MS4A6A, MS4A4A, ABCA7, HLA-DRA, MTSS1L, NDUFS3, and CD2AP were identified as DEGs in the brain PFC region in our previous brain meta-analysis; Thirteen of them were differentially expressed in at least one brain region. ABCA7 showed 17%-, 19%-, and 13% significant expression fold changes in blood of AD, blood of MCI and brain of AD respectively; this gene may thus be a potential biomarker for early diagnosis. MS4A6A showed >10% down-regulation in blood, and >43% up-regulation in brain; NDUFS3 was >10% down-regulated, and HMHA1 >9% up-regulated in blood and brain. Although HMHA1 is not a risk gene in AD, it has been reported that methylation sites in this gene have a strong relationship to ABCA7 and AD pathologies 26 . In addition, BCL3, a proto-oncogene candidate, might be a potential novel risk gene for AD, because it was 27% up-regulated in AD brain and identified as a DEG in both AD blood and MCI blood. Supplementary Table 5 indicates the 751 IGAP MAGMA genes (nominal pval < 0.01) and the most significant SNPs in their 20kbp up/downstream regions. We identified 281 and 119 genes at 0.05 or 0.01 significance level respectively when FDR testing was applied.
DEGs in blood did not show any enrichment for these IGAP MAGMA genes at the stringent significance level (BF.pval > 0.05). However, if we apply nominal pval < 0.01 for MAGMA (751 genes identified), both AD-DEGs and MCI-DEGs in blood show enrichment in IGAP genes (OR = 1.33, 95%CI 1.11-1.61, pval = 2.45E-03; OR = 1.36, 95%CI 1.14-1.62, pval = 5.33E-04, respectively). We previously identified 3124 AD-DEGs in the brain PFC region 7 , and those DEGs had enriched MAGMA genes either for BF.pval < 0.05 or nominal pval < 0.01 (OR = 2.27, 95%CI 1.23-4.02, pval = 5.67E-03; OR = 1.23, 95%CI 1.00-1.51, pval = 4.64E-02 respectively). These results revealed the significant associations between genomics and gene expression in AD. Creation of potential biomarker panels by machine learning. Our aim here was to identify a set of biomarkers and classification models (classifiers) which can discriminate patients with AD from healthy control subjects, e.g. 143 patients with AD from 104 controls in GSE63060 or 102 patients with AD from 78 controls in GSE63061. We trained classifiers in one dataset and tested them in the other dataset (see Methods). Figure 3a illustrates an optimal six-feature panel (named Full6set) that was identified by measuring area under the curve (AUC) performance for SVM, RR and RF (0.875, 0.874, 0.849 respectively). The voted AUC (the Figure 2. Number of DEGs common to both the blood and the different brain regions. Overlap between DEGs (up-regulated and down-regulated) identified in the merged blood datasets and DEGs identified in each of the 10 brain regions is shown as an arc, the area of which is proportional to the number of overlapping DEGs (see full name of brain region in Table 2). Scientific Table 3. Results of gene-based GWAS analysis. This table lists 67 genes identified by MAGMA (BF.pval < 0.05) from the IGAP stage 1 GWAS dataset, and compares their expression (fold-change and p-value) in AD and MCI blood datasets, and in the brain dataset from our previous study 7 Table 6 for further details. All features in Full6set and Full4set were down-regulated DEGs in the blood merged discovery dataset, except GALNT4 which was an up-regulated DEG (Supplementary Table 2); the two common features, ILMN_2097421 (MRPL51) and ILMN_2189933 (RPL36AL), were the top DEGs in the blood but not in the brain. In order to test the robustness of the classification models and features used, we swapped the training dataset and testing dataset, i.e. we trained classification models in GSE63060 using Full4set then tested in GSE63061, and we trained models in GSE63061 using Full6set and tested in GSE63060. Their testing performances are illustrated in Fig. 3c,d, and Supplementary Table 6. The robustness of the selected features was also tested by random selection (Supplementary Fig. 6). The models using Full6set demonstrated similar classification performances to the models using Full4set. Voting AUC for Full6set models were 0.866 and 0.864 in the two testing datasets (GSE63060 and GSE63061 respectively) with an average of 0.865. For Full4set models, the values were 0.859, 0.875 with an average of 0.867. Moreover, when we used the models trained from AD vs. controls to discriminate MCI from controls, most of the MCI (>72%) were predicted to be AD (Supplementary Table 7). Supplementary Fig. 7 shows the boxplots and swarm plots of each of the features in Full4set where MCI samples were also included, which demonstrates that each of the features had good classification performance.

Discussion
In this study, we observed that in blood samples more DEGs were identified comparing MCI to controls than comparing AD to controls. This suggests that the trajectory from control to MCI to AD is surely not linear. In addition, under the current classification of MCI there are many clinical entities, not all evolving to AD in the same way or time (some MCI even revert to control). Therefore, it is possible that the increased differences we observed between MCI and controls reflect the MCI's dynamic and heterogeneous state. On the contrary, overt AD is a more stable clinical entity with possibly a more defined gene expression signature. We also observed that AD-DEGs tended to have the same regulation direction as the MCI-DEGs in blood (only a few genes were identified as DEGs comparing AD to MCI samples), and the majority of those AD-DEGs that overlapped in the blood and brain showed consistent directions of regulation, suggesting the biomarkers to be investigated in blood can be potential early diagnostic signatures. Our study shows evidence for a role of ribosomal dysfunction. In blood, the top 10 up-and down-regulated AD-DEGs were also identified as MCI-DEGs, and included ribosomal protein genes such as MRPL51, RPL36AL, and RPS25. Ribosome dysfunction is an early event in AD 27 , and the abnormal tau-ribosomal interactions in tauopathy lead to a decrease in RNA translation 28 . Two recent studies reported that reducing ribosomal protein S6 kinase 1 expression improves spatial memory and synaptic plasticity in a mouse model of AD 29 , and there are striking overlaps between non-steroidal anti-inflammatory (NSAID) drugs-induced changes and gene expression in the blood of AD patients in the ribosome and oxidative phosphorylation pathways 30 . A novel mutation discovered in the gene NDUFA1 may also lead to a progressive mitochondrial complex I specific neurodegenerative disease 31 . TYK2 and STAT3 were identified as up-regulated DEGs in both blood and brain (Supplementary Table 2). Tyk2/Stat3 signalling mediates beta-amyloid-induced neuronal cell death in AD 32 . TYK2 encodes a member of the tyrosine kinase specifically for the Janus kinases (JAKs) protein families, and inhibition of JAK1/JAK3 may provide an efficient therapeutic agent for the treatment of inflammatory diseases 33 which might benefit AD patients as well since inflammation drives progression of AD 34 . It is interesting to note that TCIRG1 showed a greater than 20% up-regulation in blood of AD, blood of MCI and brain of AD. Mutations in this gene can cause lower absolute neutrophil count and may be responsible for infantile malignant osteopetrosis (IMO) disease 35,36 . However, its role in AD or dementia is not yet proven, and it may be related to neutrophil function and immunity.
We observed that DEGs in blood have a high potential to be identified as DEGs in brain prefrontal cortex region (PFC) through enrichment analysis. Table 2 shows that DEGs in brain PFC, Superior Temporal Gyrus (STG), Inferior Temporal Gyrus (ITG) regions are commonly DEGs in blood. Few DEGs were identified in brain hippocampus (HIP) region due to the large shrinkage in HIP that radically reduces gene expressions, and these DEGs have a low likelihood of being identified as DEGs in blood. It is well known that the hippocampus, a critical region for learning and memory, is especially vulnerable to damage at early stages of AD, hippocampal volume is one of the best AD biomarkers for diagnosis. The brain temporal cortex including STG, ITG, HIP, etc. plays a critical role in cognitive processes, language comprehension, memory formation and recall 6 . Functional segmentation analysis revealed that AD patients exhibit stronger hippocampus-PFC functional connectivity 37 . Actually 27.8% of all the DEGs in brain (1544/5552) are also DEGs in AD blood with a significant enrichment (OR = 1.27, 95%CI: 1.18-1.38, pval = 9.8e-10, Fisher test); 2154 DEGs in brain are also DEGs in MCI blood with an enrichment (OR = 1.44, 95%CI 1.34-1.55, pval = 2.2e-16, Fisher test). This shows that gene expression in the blood is a strong representation of gene expressions in the brain.
It has been revealed that mitochondrial dysfunction and oxidative phosphorylation were identified in AD/ MCI blood, AD brain and ageing brain, showing the relevance of mitochondrial function in AD 38 . In our present study, we also found strong evidence for dysregulation of the mitochondrial and oxidative phosphorylation pathways in the blood of patients with AD and MCI. IGAP provides a powerful data resource for the study of AD and it has been explored by several research teams 39,40 . To our knowledge, our study is the first to integrate IGAP with datasets from the blood of AD, blood of MCI and brain of AD. Moreover, recent trans-ethnic GWAS identified five novel AD risk genes 41 and three of them (TPBG, PFND1/HBEGF, BZRAP1-AS1) were MAGMA genes in our study. Fourteen out of 39 previously identified risk genes of AD were identified as DEGs in at least one brain region of this disease, including MAPT, APP, PSEN1 and ABCA7. Genes simultaneously differentially expressed in several brain regions may be AD-relevant risk genes. For example, AKAP9 was identified as a DEG in eight brain regions including the hippocampus, and two rare mutations in this gene were recently discovered as AD-associated loci by whole exome sequencing 42 .This gene is also at the significance border in blood (BH.pval = 0.033 and 0.012 for AD and MCI respectively). Moreover, Low et al. discovered that variants of NEBL are relevant to atrial fibrillation (AF) susceptibility 43 , and NEBL was identified as a DEG in eight brain regions with AF recognized as a risk factor for cognitive decline and dementia 44 .
Discovering biomarkers in blood for the diagnosis of AD at the earliest and mildest stages is always clinically required and would be hugely beneficial. Recently, Nakamura and colleagues demonstrated the ability of amyloid-β precursor protein APP 669-711 /Aβ 1-42 and Aβ 1-40 /Aβ 1-42 ratios, and their composites in plasma to predict brain amyloid-β burden with very high performances 45 . Despite the relatively expensive IP-MS measurement method used, their results bring new hope for blood biomarker-based early diagnosis for AD.
In this study, we identified an optimal classification panel of four features, Full4set, by the LASSO feature selection approach. By applying classifiers with Full4set, 75.4% and 72.7% of MCI were predicted as AD in GSE63061 and GSE63060 respectively (Supplementary Table 7). All features in Full4set were DEGs in blood, and this small feature size panel may have the potential to be applied in Point-of-Care (PoC) diagnostic devices that will be developed and validated in the future.
Our study has a number of limitations. For the two blood datasets (GSE63060 and GSE63061), which are the main focus of this study, we applied multiple testing for DEGs identification. However, for the two validation blood datasets and the brain multiple regions dataset, no DEGs could pass the multiple-testing (BH.pval > 0.05), i.e. no significant genes were identified after allowing for multiple testing. We therefore were forced to apply nominal p-value with a more stringent significance level (<0.01) for DEG detection. The sample sizes used in previous transcriptomic and proteomic studies of AD were generally small, particularly in post-mortem brain studies. Therefore, there was a limited power to identify dysfunctional genes. We observed that most of our DEGs had small effect size, and the small sample sizes (particularly in the brain studies) gave us low statistical powers which resulted in a high level of false positives for DEG detection when nominal p-values were applied. Applying multiple testing may lose information, and alternative network-based approaches could be applied for biomarker discovery 4,46 . In addition, more accurate and sensitive techniques are required to measure such gene expressions, for instance, droplet digital polymerase chain reaction (ddPCR) 47 and RNA-seq 48 . Aside from sample size, another limitation is that the classification effect of any genetic risk factors was not taken into account due to lack of information availability, e.g. for APOE which may be the most important genetic risk factors for AD 49 . This may be a major limitation as the presence of the APOE4 allele has been shown to influence the classification algorithms based on medical imaging and cerebrospinal fluid (CSF) biomarkers 50 (and by our unpublished works). Moreover, our classification model only included gene transcript information and the effect from ageing and gender was adjusted during the data pre-processing. Finally, although AUC-ROC together with Sensitivity/Specificity are frequently used as performance measurements in biomedical research, for example recently in Nakamura and colleagues' study 45 , it has been reported that Precision/Recall and Area Under Precision Recall (AUPR) can provide more information in imbalanced dataset 51 . We had applied ROC with class-weight adjustment in our model training process, and so we compared these results to those obtained using AUPR to assess the effect of data imbalance (please see Supplementary Fig. 8 and Table 6). In general, AUPR values are a bit lower than AUC-ROC values indicating the effect of data imbalance in our case, and there might have be rooms to improve classification performance by applying AUPR in the feature selection process.
In conclusion, our study revealed that genes differentially expressed in the blood were likely to be differentially expressed in the brain and with the same regulation direction. Common pathways were identified and found to be shared among brain AD, blood AD and ageing brain. We also identified a four-feature panel classification model that discriminated between AD patients and controls with promising performances. A larger cohort study is now necessary to validate the reproducibility of this model's results perhaps using target-based transcriptional measurement.

Data Availability Statement
This link provides seven datasets: Two initial datasets downloaded from GEO (GSE63060_series_matirx.txt, GSE63061_series_matrix.txt); one merged dataset for DEGs analysis (gse63060_61.merged.exp); two central-scaled datasets for training and testing ML models (files contain 22756 features and disease status for each sample: gse63060_ADMCICtr_Residual_normT_lab.txt, gse63061_ADMCICtr_Residual_normT_lab.txt); and two information files (Samples_gse63060.info, Samples_gse63061.info) extracted from the two GEO datasets. https://figshare.com/s/78839db30d17d3f75aca.