Insights into the changes in the proteome of Alzheimer disease elucidated by a meta-analysis

Mass spectrometry (MS)-based proteomics is a powerful tool to explore pathogenic changes of a disease in an unbiased manner and has been used extensively in Alzheimer disease (AD) research. Here, by performing a meta-analysis of high-quality proteomic studies, we address which pathological changes are observed consistently and therefore most likely are of great importance for AD pathogenesis. We retrieved datasets, comprising a total of 21,588 distinct proteins identified across 857 postmortem human samples, from ten studies using labeled or label-free MS approaches. Our meta-analysis findings showed significant alterations of 757 and 1,195 proteins in AD in the labeled and label-free datasets, respectively. Only 33 proteins, some of which were associated with synaptic signaling, had the same directional change across the individual studies. However, despite alterations in individual proteins being different between the labeled and the label-free datasets, several pathways related to synaptic signaling, oxidative phosphorylation, immune response and extracellular matrix were commonly dysregulated in AD. These pathways represent robust changes in the human AD brain and warrant further investigation.


Introduction
Mass spectrometry (MS)-based proteomics is a powerful technique, as it allows a simultaneous identification and quantification of proteins in complex biological samples such as brain tissue. These studies typically use a bottom-up approach in which proteins are first digested, the resulting peptides are then analyzed by liquid chromatography coupled to tandem mass spectrometry (LC-MS/MS), and lastly the generated mass spectra of the peptide ions are compared against the theoretical spectra from databases for protein identification. Two main approaches, labeled or label-free, are commonly used for quantification of the relative abundance of the identified proteins. In the labeling strategies, such as tandem mass tags (TMTs) or isobaric tags for relative and absolute quantification (iTRAQ), stable tags are chemically attached to free amine on the N-termini of the peptides and lysine side-chains 1,2 . The principle of isobaric labeling strategies relies on the fact that labeled peptides

Results
Out of 10 studies using labeled (TMT, iTRAQ and 18 O labeling) [8][9][10][11][12]14 or label-free 6,7,19 MS approaches, 18 datasets that analyzed the proteome of frontal and temporal cortices were retrieved. Data containing isoform-specific information was combined by using UniProtKB accession number as a common identifier between studies. This resulted in a dataset with 21,588 distinct protein IDs, 533 of which were identified in all 857 postmortem human samples. Protein intensities from the retrieved datasets were standardized (i.e., allowing 20% missing values per group, log2 conversion, median 0 and standard deviation 1). As source of variance could arise from differences in biological samples and methodology (e.g., instrumentation, database search platforms, application of batch correction or other regression models), a random-effects-model was used to calculate the effect size (Fig. 1).

Samples of labeled vs label-free datasets appears in distinct clusters.
To investigate the largest variability between samples and to determine whether there were clustering patterns that could be explained by some of the methodological differences among the selected studies, principal component analysis (PCA) was performed using the log2 protein intensities of 533 proteins quantified in all 857 samples (547 AD and 310 control cases). Two main clusters pertaining to labeled or label-free MS approaches (Fig. 2a) suggested that relative protein intensities were specific to the MS method of choice, which may partly be explained by an underestimation of the fold change in the labeled samples due to ratio compression. No clear separation was observed between datasets ( Fig. 2b), diagnosis (AD vs control, Fig. 2c), brain regions (frontal vs temporal lobes, Fig. 2d) or lysis buffer (SDS-vs urea-based, Fig. 2e). The top 10 proteins contributing to the first (PC1) and second (PC2) components are also shown in Fig. 2f. Intriguingly, two datasets from Bai et al. 9 were clustered further away from all other labeled datasets. While this cannot be explained by the choice of labeling approach, it is possible that differences in methodology (e.g., cell-type correction) could result in this disparity 9 .
Differences and similarities between the labeled and the label-free datasets. Based on the distinct sample distributions observed above (Fig. 2a), we performed separate meta-analyses for labeled and label-free datasets. The labeled data consisted of 17,296 distinct proteins quantified across 225 (123 AD and 102 control) cases while the label-free data contained 4,292 distinct proteins quantified across 632 (424 AD and 208 control) cases (Online-only Table 1). When comparing the total protein identifications between the labeled and the label-free data, 3,731 proteins were found to be shared (Fig. 3a). The input data that was used for meta-analyses can be found at the figshare database 20 .
The meta-analysis highlighted that 757 (391 down-and 366 upregulated) proteins were significantly altered in AD in the labeled data (p-value < 0.005 and false discovery rate (FDR) <10%), compared to 1,195 (634 downand 561 upregulated) proteins in the label-free data (p-value < 0.03 and FDR < 10%) (Fig. 3b). The findings of both meta-analyses can be found at the figshare database 20 . Notably, among significant alterations, 604 proteins (80%) were originally quantified in two or more labeled datasets and 1038 proteins (87%) were quantified in more than one label-free dataset 20 .
Despite the large number of shared proteins between the labeled and the label-free datasets (Fig. 3a), only 71 proteins (1.9%) were found to be significantly altered in AD in both datasets (Fig. 3b). Of these, 33 proteins (46%) showed consistent alterations in AD, in other words the mean difference between AD and control, www.nature.com/scientificdata www.nature.com/scientificdata/ computed by meta-analyses, was in line with the fold changes found in individual studies (Fig. 3c). Several of these proteins were found to be implicated in synaptic signaling pathways, apoptotic signaling and proteasomal protein catabolic processes. Importantly, our analysis led to the identification of novel proteins involved in AD (Fig. 3c), such as GDNF family receptor alpha-2 (GFRA2), voltage-dependent L-type calcium channel subunit beta-3 (CACNB3), utrophin (UTRN), sodium-dependent phosphate transporter 2 (SLC20A2) and synaptotagmin-11 (SYT11).
Gene set analysis (GSA) identifies common pathways in the meta-analysis of the labeled and the label-free data. To put single protein alterations into biological context, we performed GSA 21 . All proteins (in gene-centric format) were subjected to GSA (for input data, see 20 ), despite that some of them were originally quantified only in one dataset or had opposite directional changes between meta-analyses and individual studies.
By using the KEGG 2019 database, GSA showed that synaptic signaling (e.g., synaptic vesicle cycle), oxidative phosphorylation and pathways related to neurodegenerative disorders, such as AD and Huntington disease (HD), were significantly downregulated in AD in the labeled data (FDR < 5%, Fig. 4a). In addition, complement and coagulation cascade and ECM-receptor interaction pathways were significantly upregulated in AD in the labeled data. Similar changes in biological processes were detected in the label-free data, with the exception that the glycolysis/gluconeogenesis pathway was significantly upregulated in AD (Fig. 4b). Similarly, using the GO biological process database, GSA showed pathways related to synaptic signaling, mitochondrial metabolism, ECM organization and immune response, thereby greatly strengthening the notion that these pathways are involved in disease pathogenesis. All GSA findings are deposited in the figshare database 20 .

Discussion
We performed a meta-analysis of proteomic studies with the aim of identifying proteins and pathways that are robustly related to AD pathogenesis as well as elucidating discrepancies between the studies. Three different labeling techniques were employed in the original studies including TMT, 18 O and iTRAQ. Since sample distributions pertaining to these techniques showed a high degree of overlap in the PCA plot, we together call them labeled datasets. Particularly, by comparing the meta-analysis of the labeled and the label-free MS datasets separately, our study provides further knowledge on how the observed proteome might change depending on the MS method of choice. The notion that these two MS approaches rely on different protein quantification methods 22 Fig. 1 The workflow of the study. After literature search, 18 datasets from ten different MS-based proteomic studies, using either labeled or label-free quantification, were retrieved. These studies were done on postmortem human brain tissue, which were categorized into frontal or temporal lobes, severely affected regions by AD pathology, and consisted of AD (5 ≤ n ≤ 252) and control (5 ≤ n ≤ 94) cases. Subsequently, data pre-processing was done so that datasets would be comparable for further statistical analyses. Sample distributions and the presence of any clustering patterns were assessed using principal component analysis (PCA). The datasets generated by labeled and label-free MS approaches were concatenated separately, and meta-analysis using a random-effects-model was performed. Lastly, for better biological interpretation of our findings, gene set analysis was performed. ( www.nature.com/scientificdata www.nature.com/scientificdata/ it is not surprising that PCA showed a clear separation, however we cannot rule out that other factors might also contribute to this clustering pattern. Our meta-analysis findings show that the observed protein alterations were clearly different between the two techniques, since only 71 significantly altered proteins (FDR < 10%) were shared. Several reasons may explain why these 71 proteins stood out from the other significant alterations: they are robustly quantified independent of MS approach or they are not sensitive to discrepancies in sample preparation procedures. To get a better understanding of these 71 significantly altered proteins, we further considered whether the mean difference (i.e., the main outcome of meta-analysis) and the fold changes found in each individual dataset followed the same direction, i.e., the protein was either upregulated or downregulated across datasets. We found 22 consistently www.nature.com/scientificdata www.nature.com/scientificdata/ downregulated and 11 consistently upregulated proteins emerging as the most robust changes occurring in AD brain (Fig. 3b). Importantly, by performing this large dataset analysis, we were able to find novel proteins involved in AD, some of which were involved in synaptic or cell adhesion pathways, which will be discussed below. For instance, we detected upregulation in SLC20A2, which is involved in phosphate transport by absorbing phosphate from interstitial fluid. Interestingly, mutations in this gene are reported to cause primary familial brain calcification, which are often associated with movement disorders 23 . In addition, our findings validate the previously reported data on AD-related proteins. For instance, the astrocytic marker glial fibrillary acidic protein (GFAP) is well-known to be upregulated in AD brain as part of concurrent gliosis. Clusterin (CLU) is identified as a genetic risk factor for sporadic AD 24,25 . Compelling evidence suggests that CLU binds to the amyloid β-peptide (Aβ) and affects its deposition and clearance (see review by Foster et al. 26 ). Interestingly, Wojtas et al. recently demonstrated that CLU not only affects Aβ pathology 27 but also tau pathology 28 . Another interesting protein from the heatmap is the mitochondrial protein DNM1L (also known as DRP1) which is vital for mitochondrial fission events. Several studies reported a relationship between Aβ-mediated toxicity, DRP1 function and mitochondrial dysfunction 29,30 . In addition, a recent study suggested that DRP1 is required for proper synaptic function in CA1 hippocampal neurons 31 . Several proteins related to synaptic signaling were consistently altered in AD brain across datasets. Among them, synaptotagmin-1 (SYT1) and synaptotagmin-7 (SYT7), acting as calcium sensors triggering neurotransmitter release, are reported to interact with amyloid precursor protein (APP) and presenilin 1 (PS1), a component of γ-secretase that mediates the final cut in APP processing [32][33][34][35] . Hyperpolarization-activated cyclic nucleotide-gated channel 1 (HCN1), important for regulating neuronal activity, has been studied extensively in epilepsy research but not much in AD research. However, decreased  Fig. 3 Summary of the meta-analysis findings. Venn diagrams showing (a) the total number of identified proteins (with distinct protein IDs) shared between the labeled and the label-free datasets, and (b) the statistically significant proteins with FDR < 10% identified by the meta-analysis of the labeled and the labelfree datasets. Despite a large overlap found between the two data subsets (3731 proteins), only 71 proteins were significantly altered in AD in both meta-analyses. The top 10 most significantly altered proteins were shown next to the Venn diagram. *These proteins were among the shared proteins. (c) Out of 71 significantly altered proteins, 33 were identified as the most robust changes in AD, since the direction of change remained the same across labeled and label-free datasets. Some of these proteins were found to be involved in synaptic signaling, apoptotic and proteasomal protein catabolic processes. Proteins that showed no statistically significant difference between AD and controls as well as the ones that were not identified in the original dataset are indicated by the white boxes.
www.nature.com/scientificdata www.nature.com/scientificdata/ levels of HCN1 were in line with our finding observed in AD brain, and loss of HCN1 gene function was reported to increase Aβ levels in mice brain 36 . Recent studies identified dipeptidyl aminopeptidase-like protein 6 (DPP6), which modulates the activity of potassium channels, as a novel genetic factor contributing to dementia 37,38 . Functional studies suggested that DPP6 is important for synaptic structure, hippocampus-dependent learning, and memory 39 . Our meta-analysis confirms that DPP6 is of importance for AD pathogenesis and merits further studies.
While the observations on an individual protein level are interesting, we sought to test whether and how the meta-analyses of labeled and label-free datasets would show similar patterns in a wider perspective. Notably, our findings indicate that common biological processes were dysregulated in AD brain, even though the individual protein alterations detectible were clearly different between the two MS approaches. For instance, synaptic processes, such as synaptic vesicle cycle (KEGG hsa04721) and chemical synaptic transmission (GO:0007268), were significantly downregulated in AD. This is not surprising considering that synaptic dysfunction occurs early in AD pathogenesis and strongly correlates with cognitive decline 40 . Examples of synaptic proteins involved in these processes included components of the adaptor protein complex 2 (AP2A2 and AP2M1), disk large-associated protein 1 (DLGAP1), glutamate receptors (GRIA2-4 and GRM3), neuronal pentraxin-2 (NPTX2), RAB3A, SNAP25, SYT1 and V-ATPase subunit F (ATP6V1F). As illustrated by the heatmap, the synaptic proteins NPTX2 and SYT1 were consistently decreased in AD brain across studies, representing one of the most robust changes. Interestingly, AP2A2, involved in clathrin-dependent endocytosis, was among the proteins contributing most to the clustering pattern observed on PCA plot. Many studies have reported reduction in synaptic protein expression (e.g., SNAP25, GRIA2) in AD brain and shown that such decrease was correlated with increased rate of cognitive decline 8,41,42 . Besides the above-mentioned synaptic proteins, we also detected robust downregulation in the levels of GFRA2, which is a receptor for the neurotrophic factor called neurturin. A recent genome-wide association study identified the GFRA2 locus as a potential modifier risk of frontotemporal dementia, proposing that GFRA2-related processes may hold a potential as therapeutic targets 43 . Unlike other synaptotagmins, SYT11 does not bind to calcium while mediating vesicle trafficking 44 , and it is also interesting that SYT11 is the only upregulated synaptic proteins in our meta-analyses.
Another prominently downregulated pathway was related to oxidative phosphorylation (KEGG hsa00190) and mitochondrial energy metabolism. Mainly proteins of the electron transport chain (e.g., NDUFS1, NDUFS7, NDUFA7) were associated with these pathways. Altered mitochondrial dynamics and bioenergetic metabolism are commonly observed in neurodegenerative disorders 45 . This is not surprising since activities such as synaptic transmission and synaptic vesicle cycle demand high metabolic energy that is tightly regulated by mitochondria 46 . www.nature.com/scientificdata www.nature.com/scientificdata/ It is noteworthy that HD was the pathway with highest number of protein hits. In this regard, it is interesting to note that our previous immunohistochemical studies have shown increased levels of huntingtin in pyramidal neurons both in AD brain 47 and in an AD mouse model (APP-NLF) 48 . Many of the altered proteins were mitochondrial, particularly those involved in the respiratory chain. There were also significant effects on proteins involved in clathrin-mediated endocytosis, and for instance postsynaptic receptors and signaling molecules. These findings support that HD and AD partially share disease pathways involved in mitochondrial function, clathrin-mediated endocytosis, postsynaptic function and signaling.
Several pathways were found to be significantly upregulated in AD, such as complement and coagulation cascade (KEGG hsa04610), cytokine-mediated signaling (GO:0019221), which reflects the presence of neuroinflammation, ECM-receptor interaction (KEGG hsa04512) and ECM organization (GO:0030198), indicating changes in the ECM. The ECM provides structural support that is essential for the cells and regulates many cellular processes such as neurogenesis, axonal outgrowth, synaptic plasticity, and immune response 49 . In line with previous findings 15,50 , we mainly detected upregulation of the ECM components, e.g., collagen, CD44, tenascin, integrin alpha-6, versican core protein and fibronectin-1. In addition, we detected robustly increased levels of UTRN (or dystrophin-related protein 1), which mediates the interaction between the plasma membrane, the cytoskeleton and the ECM.
Our study is the largest meta-analysis study analyzing proteomic data from multiple research centers. Thus far, only a few studies compared proteomic datasets in a systematic manner and identified several proteins as promising targets for maintenance of cognitive resilience 42,51 . Similar to the observations from Wingo et al. 42 , we found decreased levels of synaptic (e.g., GRIA2, AP2A2, AP2B1, BAIAP2, DMXL2, DLG4, SYNPO) and mitochondrial proteins (e.g., NDUFS1, PDHA1) but increased levels of proteins involved in myelination and apoptosis processes (e.g., GFAP, GSN, NEFL) in AD brain. Altogether these findings corroborate the involvement of these proteins and processes in disease pathogenesis.
Considering that proteomics data extensively relies on experimental setups such as sample preparation 52 , clinical characteristics of subjects, statistical tests of choice or even cut-offs for statistical significance, a meta-analysis could provide valuable and coherent information that a single dataset analysis could not. However, this study is not without challenges. From a statistical perspective, the random-effects-model could compute the mean difference between AD and control groups, even when a given protein was originally quantified in only one dataset or when proteins had originally opposite directional changes. This is possible because the statistical model gives larger weight to the datasets that contain larger effect size, larger sample size, and lower within-study variability. With that said, from a biological perspective, it is important and relevant to pinpoint proteins with consistent alterations across individual datasets, as they could represent the most robust alterations associated with disease. In this study, we address both perspectives by reporting all significantly altered proteins in the labeled (n = 757) and the label-free (n = 1,195) datasets as well as the robustly altered ones (n = 33). It is important to note that our study does not allow any direct comparison between labeled and label-free MS approaches, since the same biological samples were not studied using both methods. However, the observation of two distinct clusters pertaining to labeled and label-free datasets provided good reason for our stratification wherein we performed two different meta-analyses. At the protein level, our findings suggest that different sets of proteins were significantly altered in the labeled dataset compared to the label-free dataset. It is thus important to emphasize that this could be influenced by how each individual study has dealt with protein inference, which could overestimate the number of distinct protein IDs as well as underestimate commonalities between the datasets. We also noted that the total number of proteins (17,296 vs 4,294 proteins) and sample size (225 vs 632 cases) were rather different between the two methods, which could influence the meta-analysis outcome. Nevertheless, we observed consistent alterations at the pathway level. The analysis of raw MS data in a common pipeline, using the same database search engines for protein identification and quantification, could dissipate some of the methodological differences between the studies, however, this is not without challenges especially when applied to such quantities of data.
Given the complex and multifactorial nature of AD, disentangling pathological mechanisms is of the utmost importance for development of treatment strategies for this debilitating disorder. This study provides a comprehensive analysis of 18 proteomic datasets and offers novel insights into single protein alterations related to AD. Depending on the MS method of choice (labeled vs label-free), different proteins appeared to be significantly dysregulated in AD. Nonetheless, pathway analyses of labeled and label-free MS datasets illustrated that processes related to synaptic signaling, oxidative phosphorylation, immune response and ECM were commonly dysregulated in AD. These observations are largely in line with previous reports, indicating that these mechanisms are central to AD pathogenesis.

Methods
Study selection and search strategy. Literature search on MS-based proteomic studies in AD was done using the following string "Alzheimer* AND proteome AND (proteomic OR mass spectrometry) AND "human brain"" on April 2, 2020 in PubMed (https://pubmed.ncbi.nlm.nih.gov/). PRISMA statement was followed during study design 53 . Out of 95 articles, 16 datasets were manually retrieved and curated from high-quality studies performed in seven different research groups [6][7][8][9][10][11][12]14,19 (Online-only Table 1), based on the following criteria: (i) sample size of at least five cases per diagnosis group (AD and control), (ii) minimum number of 1000 quantified proteins, (iii) studies in which quantified protein intensities were reported for each individual biological sample, (iv) studies that are publicly available in repositories, and (v) studies in which the proteome of frontal and temporal cortices, severely affected regions by AD-related pathology, was analyzed. In addition, two unpublished datasets (Schedin-Weiss et al. in preparation) from our research group were included in this study, as they met with the above-mentioned inclusion criteria (for pre-processed datasets, see 20 ). Lastly, studies using pooled biological samples or performing targeted MS approach were excluded from the meta-analysis.
www.nature.com/scientificdata www.nature.com/scientificdata/ Neuropathological evaluation of neurofibrillary tangles (Braak staging) 54 and neuritic plaques (CERAD criteria) 55 was done in all postmortem human brain tissue included in the above-mentioned studies. While AD cases had often higher Braak stages (IV-VI), control subjects presented little or no pathological alterations beyond normal age-appropriate changes (0-III). Cases with non-AD pathological changes were reported to be excluded in these studies.
Data pre-processing. The protein identifiers were manually set to UniProtKB accession numbers in each dataset. Each dataset was standardized in the same manner by allowing 20% of missing values per group, converting protein intensities to log2 scale, applying median-centered normalization and scaling (median 0 and standard deviation 1) so that they would be comparable for further statistical analyses. Subsequently, sample distributions were visually inspected by PCA for each dataset as well as for the concatenated data, consisting of 533 proteins quantified in all 857 samples and all 18 datasets. All analyses were performed in R (version 4.0.1).
Meta-analysis by random-effects-model. The meta-analysis was performed using the meta package (version 4.13) and the metacont function 56 . A random-effects-model, using the DerSimonian and Laird method 57 , was selected, due to methodological differences noted between the selected studies, such as clinical characteristics of postmortem human samples, brain region, sample preparation and LC-MS/MS experiments (Online-only Tables 1 and 2). For each protein, the effect size (i.e., mean difference between AD and control) was computed by taking into account the following parameters: mean of protein intensities per protein per group, standard deviation of protein intensities per protein per group and sample size per group. Particularly, this allowed accounting for group differences even when proteins showed opposite directional changes between datasets or had not been quantified in all datasets. The p-values were corrected for multiple hypothesis testing using Benjamini-Hochberg method and considered significant at ≤10% FDR.

Gene set analysis.
To gain better insights into the biological processes underlying AD pathogenesis, GSA was performed using the piano R package 21 . Most of the selected proteomic studies were protein-centric, thereby containing the isoform-specific information. Prior to GSA, proteins from the meta-analyses were converted to genes, giving rise to multiple values for the same gene. If isoforms showed consistent mean difference in AD, those with the least significant changes were then excluded from GSA. On the other hand, all isoforms that did not follow the same direction of change were excluded from GSA. Thus, a reduced number of genes were subjected to GSA, compared to the protein list obtained from the meta-analysis 20 . Subsequently, the UniProtKB accession numbers were converted to gene symbol using the UniProt Retrieve/ID mapping tool. All genes together with mean difference and p-value, which were computed by the random-effects-model, were subjected to GSA, thus allowing for a comparison between significant alterations and all identified genes (background) from our meta-analysis data. Two databases, KEGG 2019 and GO biological process, were used as retrieved from Enrichr 58,59 . Minimum number of genes associated with a given geneset was set to five. To get a better understanding of how pathways could be dysregulated in AD, distinct upregulation and distinct downregulation categories were selected. Pathways with FDR < 5% were considered as statistically significant.
To support data sharing and reusability, all fully processed individual datasets are available in github, which can be accessed by https://github.com/hazhay/Meta-analysis_AD. In addition, input datasets used for meta-analyses and for GSAs as well as their findings are available at the figshare database associated with this article 20 .

Code availability
The R script used in this research is publicly available and can be found in github (https://github.com/hazhay/ Meta-analysis_AD).