A beginner’s guide into curated analyses of open access datasets for biomarker discovery in neurodegeneration

The discovery of surrogate biomarkers reflecting neuronal dysfunction in neurodegenerative diseases (NDDs) remains an active area of research. To boost these efforts, we demonstrate the utility of publicly available datasets for probing the pathogenic relevance of candidate markers in NDDs. As a starting point, we introduce the readers to several open access resources, which contain gene expression profiles and proteomics datasets from patient studies in common NDDs, including proteomics analyses of cerebrospinal fluid (CSF). Then, we illustrate the method for curated gene expression analyses across select brain regions from four cohorts of Parkinson disease patients (and from one study in common NDDs), probing glutathione biogenesis, calcium signaling and autophagy. These data are complemented by findings of select markers in CSF-based studies in NDDs. Additionally, we enclose several annotated microarray studies, and summarize reports on CSF proteomics across the NDDs, which the readers can utilize for translational purposes. We anticipate that this “beginner’s guide” will benefit the research community in NDDs, and would serve as a useful educational tool.


FIGURE S1 a) PICTORAL OVERVIEW OF CURATED ANALYSES USING GEO2R
1) Search the datasets on https://www.ncbi.nlm.nih.gov/geo/, by entering the dataset ID (e.g., GSE7621) 2) On the next page, scroll to the bottom and click: "Analyze with GEO2R", it will load the GEO2R interface 3) Then, Click on "Profile Graph" 4) In the box "Enter ID", enter the unique probe ID for a marker of interest and click "Set". It will load a plot as shown below. Next to the plot, click "Sample values" 5) A pop-up table (highlighted by blue box below) will be displayed 6) Copy the table and transfer as text into a data processing software.
NOTE: A quicker way to navigate all datasets in the same window is the following: Scroll to the top of the page and enter dataset accession (e.g. GSE20146, as shown in the red box above), and click "Set". This action will load the new dataset. Then, enter the unique probe ID for the gene(s) of interest under "Profile Graph" and repeat the process shown in steps 3-6 above.

FIGURE S2 a) PICTORAL OVERVIEW OF GLOBAL ANALYSES USING GEO2R
1) Search the datasets on https://www.ncbi.nlm.nih.gov/geo/, by entering the dataset ID (e.g., GSE7621) 2) On the next page, scroll to the bottom and click: "Analyze with GEO2R", it will load the GEO2R interface 3) Then, click on "Define Groups" and assign each sample to the respective group. This is done by selecting the corresponding samples (while holding SHIFT key) and selecting the group.
In the example below, we have created a Control and PD group in the GSE7621. 4) Once the groups are assigned, click "Analyze", (highlighted in STEP 3 above). It will generate a table with differential gene expression profiles and plots for visualization (all of which can be downloaded) 5) The user can also modify the settings for statistical analyses and data plots, the defaults are shown below

FIGURE S3 a) PICTORAL OVERVIEW OF CURATED ANALYSES OF CSF PROTEOMICS
1) Access the weblink https://proteomics.uib.no/csf-pr/ and click "Search protein data" 2) In the Search box, enter the unique identifier (e.g., Uniprot ID), select the input type (e.g., "Protein Accession") and the disease category, and then click "Search".
3) A graphical overview of the results will be displayed, which can be viewed all together or for an individual marker. In the chart below, out of the 7 marker queries, 6 were found within the database and 1 was not found.

4) In the chart shown under
Step 3, the number in the center for each marker (circle) indicates the number of individual datasets a given marker has been reported. The relevant datasets are also color coded with reference to the disease category. To view the linked data, click Load. A new interactive interface will appear, as shown below.
In this interface, the icons on the left side (green box above) are arranged as 3 panels.
A-Disease comparison, datasets arranged across the disease categories (and sub-categories).
Here datasets can be included or excluded based on the intended patient populations.
B-Protein Overview, pictoral representation of the protein detection being increased (red circle), decreased (green circles) or Equal/unaltered (blue circles). This menu is further option of filtering, ie. de-selecting markers whose levels were reported to be equal/unaltered between control and disease conditions C-Protein Table, based on the selection and filtering in Disease comparison and Protein overview, this option is linked to specific studies and reported trend NOTE: Panel A also contains an overview of the datasets and linked publications A B C 5) In the Protein Table, select the desired marker. A new sub-Menu "Protein Details" will appear (highlighted in the red box below). 6) Clicking "Protein Details" will display the linked dataset(s), including peptide information 7) Selected data can be downloaded as an Excel file, by clicking the icon highlighted in the red.  Table S1. Unique probe IDs within each dataset are included in Table S2. The values across the dataset are expressed relative to the controls in each disease condition, i.e., mean value of control samples=1 (a.u., arbitrary units).
Error bars represent standard deviation of the mean, s.d. Pair-wise comparisons were assessed by Mann-Whitney test-only significant differences (*=p≤0.05, **=p≤0.01, ***=p≤0.005) are highlighted. The number of controls and cases, microarray platforms and original studies are included in Table S1. Unique probe IDs within each dataset are included in Table S2. Legend: Ctrl (controls); Alzheimer disease (AD); Motor neurone disease (ALS); Huntington disease (HD), and Multiple sclerosis (MS). The values across the dataset are expressed relative to the controls in each disease condition, i.e., mean value of control samples=1 (a.u., arbitrary units).
Error bars represent standard deviation of the mean, s.d. Pair-wise comparisons were assessed by Mann-Whitney test-only significant differences (*=p≤0.05, **=p≤0.01, ***=p≤0.005) are highlighted. The number of controls and cases, microarray platforms and original studies are included in Table S1. Unique probe IDs within each dataset are included in Table S2. Legend: Ctrl (controls); Alzheimer disease (AD); Motor neurone disease (ALS); Huntington disease (HD), and Multiple sclerosis (MS).  Table S1. Unique probe IDs within each dataset are included in Table S2.   Table 2 continued on the next page...

(Table 2 continued) Summary of function:
Alpha-synuclein: Neuronal protein putatively involved in synaptic activity, whose misfolding and aggregation is associated with neurodegeneration.