Cerebrospinal fluid reference proteins increase accuracy and interpretability of biomarkers for brain diseases

Cerebrospinal fluid (CSF) biomarkers reflect brain pathophysiology and are used extensively in translational research as well as in clinical practice for diagnosis of neurological diseases, e.g., Alzheimer’s disease (AD). However, CSF biomarker concentrations may be influenced by non-disease related inter-individual variability. Here we use a data-driven approach to demonstrate the existence of inter-individual variability in mean standardized CSF protein levels. We show that these non-disease related differences cause many commonly reported CSF biomarkers to be highly correlated, thereby producing misleading results if not accounted for. To adjust for this inter-individual variability, we identified and evaluated high-performing reference proteins which improved the diagnostic accuracy of key CSF AD biomarkers. Our reference protein method attenuates the risk for false positive findings, and improves the sensitivity and specificity of CSF biomarkers, with broad implications for both research and clinical practice.


Statistics
For all statistical analyses, confirm that the following items are present in in the figure legend, table legend, main text, or or Methods section.

n/a Confirmed
The exact sample size (n) for each experimental group/condition, given as as a discrete number and unit of of measurement A statement on on whether measurements were taken from distinct samples or or whether the same sample was measured repeatedly The statistical test(s) used AND whether they are one-or or two-sided Only common tests should be described solely by name; describe more complex techniques in the Methods section.

A description of of all covariates tested
A description of of any assumptions or or corrections, such as as tests of of normality and adjustment for multiple comparisons A full description of of the statistical parameters including central tendency (e.g.means) or or other basic estimates (e.g.regression coefficient) AND variation (e.g. standard deviation) or or associated estimates of of uncertainty (e.g.confidence intervals) For null hypothesis testing, the test statistic (e.g.F, t, r) with confidence intervals, effect sizes, degrees of of freedom and P value noted Give P values as exact values whenever suitable.
For Bayesian analysis, information on on the choice of of priors and Markov chain Monte Carlo settings For hierarchical and complex designs, identification of of the appropriate level for tests and full reporting of of outcomes Estimates of of effect sizes (e.g.Cohen's d, Pearson's r), ), indicating how they were calculated Our web collection on statistics for biologists contains articles on many of the points above.

Software and code
Policy information about availability of of computer code Data collection

Data analysis
For manuscripts utilizing custom algorithms or or software that are central to to the research but not yet described in published literature, software must be be made available to to editors and reviewers.We We strongly encourage code deposition in in a community repository (e.g.GitHub).See the Nature Portfolio guidelines for submitting code & software for further information.

Data Policy information about availability of of data
All manuscripts must include a data availability statement This statement should provide the following information, where applicable: -Accession codes, unique identifiers, or or web links for publicly available datasets -A description of of any restrictions on on data availability -For clinical datasets or or third party data, please ensure that the statement adheres to to our policy Research involving human participants, their data, or biological material Policy information about studies with human participants or human data.See also policy information about sex, gender (identity/presentation), and sexual orientation and race, ethnicity and racism.
Reporting on sex and gender Reporting on race, ethnicity, or other socially relevant groupings

Recruitment
Ethics oversight Note that full information on the approval of the study protocol must also be provided in the manuscript.

Field-specific reporting
Please select the one below that is the best fit for your research.If you are not sure, read the appropriate sections before making your selection.

Life sciences Behavioural & social sciences Ecological, evolutionary & environmental sciences
For a reference copy of the document with all sections, see nature.com/documents/nr-reporting-summary-flat.pdf

Life sciences study design
All studies must disclose on these points even when the disclosure is negative.We used the term "sex" throughout the manuscript.Sex was determined based on self-reporting.Statistical analyses included sex as a covariate; the study included 2 independent cohorts altogether comprised of balanced numbers of males (n=389 in BioFINDER-1 and 438 in BioFINDER-2) and females (n=515 in BioFINDER-1 and 392 in BioFINDER-2); therefore, we believe the findings apply to both sexes.
This study did not include categorization of race, ethnicity and/or other socially relevant groupings.
Detailed information is given in Table 1.In short, we present results for analyses from the BioFINDER-1 (mean age 73 years [SD 5.5]) and BioFINDER-2 (mean age 69 years [SD 12] ) cohorts, both with similar demographics.Participants were included based on baseline (cross-sectional) CSF measures (OLINK + A!40) in the present study.Both cohorts consisted of individuals with either normal cognition (NC), subjective cognitive decline (SCD), mild cognitive impairment (MCI), dementia or another neurodegenerative disease.For participants with longitudinal data, these were used to assess conversion to AD dementia (based on the treating physician's assessments).
This project was done as part of the prospective Swedish BioFINDER study.All patients were recruited from the Southern part of Sweden and underwent baseline examination from 2007 to 2015 (BioFINDER-1) or from 2017 to 2021 (BioFINDER-2).In BioFINDER-1, participants were consecutively recruited based on referrals (mostly from primary care) to participating memory clinics (in the towns of Malmö, Lund and Ängelholm in Sweden).In BioFINDER-2, patients were included after being referred to the memory clinic of Skåne University Hospital in Malmö, Sweden.
The study was approved by the Swedish Ethical Review Authority.All participants gave their informed consent to participate in the study and the data were collected according to the Declaration of Helsinki.
The study was conducted by maximizing the sample sizes available in regards to the number of participants and CSF proteins in the two cohorts, resulting in n=830 participants in BioFINDER-2 with 2944 measured CSF protein concentrations and n=904 participants in BioFINDER-1 with 369 measured CSF protein concentrations.There is no indication that we were insufficiently powered for these analyses.
The data were limited to the subsets of the source cohorts with available CSF biomarker data, demographic information (age and sex) and outcome measures (tau-PET, A!-PET and conversion to AD dementia).See flowchart in Fig. 1.
To assure generalizability of the suggested reference protein candidates, all exploratory work was performed on a training dataset (80% of BioFINDER-2, n=658) using 10-fold cross-validation, and then evaluated on a held out test set (20%, n=172), where significance testing was done with bootstrap-resampling with replacement (number of iterations = 2000).All suggested reference protein candidates outperformed using no reference protein and using the mean standardized CSF protein level as reference in three models.Furthermore, four reference proteins were validated in the independent BioFINDER-1 cohort, where three of the four suggested reference proteins (A!40, NTRK3 and NTRK2) showed to be superior to using no reference protein and using the mean standardized CSF protein level as reference in two models.The fourth reference protein (BLMH) showed trend level improvement compared to using no reference protein, as did using the mean standardized CSF protein level.Significance testing was again assessed with bootstrap-resampling with replacement (number of iterations = 2000).All suggested reference protein candidates could not be validated in BioFINDER-1 as they were not measured.As baseline tau-PET data did not exist in BioFINDER-1, the P-tau181"TauPET model was not evaluated on BioFINDER-1.
In these two prospective cohort studies (observational studies) no allocation into experimental groups were performed, therefore randomization is not relevant to this study.Statistical analyses were controlled for potential confounding effect of age and sex.
CSF analyses and PET analyses were performed by individuals who were blinded to the clinical data.Authors who performed the data preprocessing were blinded to demographic and clinical characteristics of individuals.

Linda
was used.Code can be be found in in the GIT repository https://github.com/karlssonlinda/reference_protein_project.The repository has been linked to to Zenodo with release v1.0.0.Python version 3.9 (packages: sklearn 0.0, pandas 1.4.4,numpy 1.23.3, matplotlib 3.5.3,pingouin 0.5.3, tqdm 4.64.1,statsmodels0.13.2) and R version 4.2 (packages: tidyverse 1.3.2,pROC 1.18.0,EWCE 1.4.0).Seurat version 4.3.0 and WEB-based Gene SeT AnaLysis Toolkit (WebGestalt) 2019 were also used.Pseudonymized BioFINDER-1 and BioFINDER-2 data can be be shared to to qualified academic researchers after request (PI:OH) for the purpose of of replicating procedures and results presented in in the study.Data transfer must be be performed in in agreement with EU EU legislation regarding general data protection regulation and decisions by by nature portfolio | reporting summary April 2023 Board of Sweden and Region Skåne.Human MTG 10x SEA-AD Allen Brain data from 202260 are publicly available and can be downloaded from celltypes.brain-map.org/rnaseq.Tissue datasets from the Human Protein Atlas are also publicly available and can be downloaded from https:// www.proteinatlas.org/about/download.