Adolescence is a period of life of profound changes in the biological, psychosocial, cognitive and emotional domains1,2,3. The dynamic brain development during youth opens a critical window for cognitive improvement, but also the onset and development of mental disorders4,5. Several mental disorders, such as attention deficit hyperactivity disorder, phobias, obsessive compulsive disorder, eating disorder, substance use, mood and social anxiety disorder, begin before the individual reaches adulthood, with a peak age of onset of 14 years5,6,7. Mental disorders negatively impact adolescent development, leading to morbidity, mortality and dysfunction in later life5,8. Therefore, identifying adolescents with a high risk of developing mental health issues and improving early diagnostics could improve the clinical outcomes and decrease the socio-economic impact.

Globally the prevalence of mental health conditions in adolescents is estimated to be between 10–20%, and most cases remain underdiagnosed and undertreated9,10. The social stigma of mental disorders, the adolescent and parent perception of mental health care needs, and the lack of mental health resources are some factors that contribute to the number of adolescents without proper diagnosis and treatment11,12. Furthermore, misdiagnosis or overdiagnosis could expose the adolescent to unnecessary treatment13. Diagnostics of mental disorders are based on the International Classification of Diseases (ICD) and the Diagnostic and Statistical Manual (DSM) classifications provided by the World Health Organization (WHO) and the American Psychiatric Association, respectively14. Clinical interviews and validated questionnaires used for symptom assessment (for example, the Beck Depression Inventory15) have a major role in mental health diagnostics. The complexity of adolescent behavior and the overlap of symptomatology of several mental disorders complicate the precise and objective diagnosis of diseases in youth. Furthermore, the difficulty in defining normative or atypical expected behavioral development in adolescence16, and lack of access to professional expertise, contribute to inaccurate judgment and precise definition of mental health conditions in adolescents17,18.

The Strengths and Difficulties Questionnaire (SDQ) is a screening questionnaire for emotional and behavioral problems in children and young people that assesses the impact of difficulties on the child’s life, including (1) emotional symptoms, (2) conduct problems, (3) hyperactivity/inattention, (4) peer relationship problems and (5) prosocial behavior19,20,21. Past validation studies have shown that the total SDQ score can be considered as a predictive factor for mental health disorders, as children with high SDQ scores have an increased probability for clinical disorders22,23. Differences in the total SDQ score seem to reflect the differences in prevalence of mental health disorders, although cross-national differences exist22,24. Thus, developing additional tools such as biological measurements for assessing mental health issues could improve identification of adolescents at high risk of mental health dysfunction, and enhance more precise diagnostics.

Although studying human brain tissue may be the most revealing method for measuring alterations related to mental disorders, it poses several severe limitations, including tissue access25 and high cost in the case of neuroimaging25. By contrast, biological fluids such as blood or urine are easier to access and are routinely used for clinical diagnostics. Alterations in the gene expression levels, proteins abundance and biological activity can serve as internal indicators present in biological fluids (biomarkers), of pathogenic processes or responses to an external exposome25,26. The blood connects the brain and periphery, and changes in plasma components such as proteins can reflect alterations in the brain associated with mental disorders due to the two-way communication between the central nervous system (CNS) and peripheral circulation27,28. Past studies have shown the plasma proteomic changes associated with mental disorders29,30,31,32. For example, significant reductions in glia maturation factor beta and brain-derived neurotrophic factor were observed in patients with schizophrenia (SCZ) when compared with healthy volunteers29. Thus, peripheral blood plasma is a suitable biological fluid for investigating molecular alterations that reflect those associated with mental health issues, and for providing new understanding on the bidirectional communication between the brain and body27,28,29,30.

Limited knowledge exists on whether alterations in plasma proteins could serve as early susceptibility biomarkers to predict the risk of mental health issues, leading to proper clinical interventions before disease onset, even though alterations in pathways and molecules related to hormone signaling, energy metabolism, growth factors, inflammation, oxidation/reduction and protein synthesis have been commonly associated with psychiatric disorders28. However, studies by Mongan et al.33 and English et al.34 suggest that adolescents at high risk of psychosis could be identified on the basis of the changes in the blood proteome several years before the psychotic experiences manifest. The number of studies focusing on discovering new susceptibility or predictive plasma biomarkers for mental health diseases in adolescents is so far limited.

This explorative study aims to identify and characterize alterations of plasma proteins in adolescents at high risk of developing mental health issues. We identified 67 plasma proteins with abundances significantly associated with the SDQ score, offering new insight into using proteins as susceptibility biomarkers for early identification of adolescents at risk of mental health problems.

Plasma samples and behavioral outcome

The peripheral blood plasma samples were obtained and analyzed from a subsample of 91 adolescents, aged 11–16 years, of the WALNUTs regional Spanish study (Table 1). The samples were collected in 2016 at approximately the same time that the participants filled out the SDQ. This baseline subsample without any dietary intervention was selected on the basis of the availability of blood samples and filled SDQ questionnaires, as well as the total scores of the SDQ questionnaire. Based on the self-reported SDQ score, the plasma samples were categorized into lower (SDQ = 0–14) and raised (SDQ = 15–25) groups35. The plasma samples were stored undisturbed at −80 °C until they were thawed in 2021 for protein depletion and subsequent proteomic analysis. The studies were reviewed and approved by the CEIC Parc Salut Mar Clinical Research Ethics Committee (approval nos. 2015/6026, WALNUTs; 2020/9688, Equal-Life). Written informed consent to participate in the original WALNUTs study was provided by the participants’ legal guardian/next of kin.

Table 1 Sample characteristics

Plasma samples were pre-processed and analyzed using liquid chromatography electrospray ionization tandem mass spectrometry, which was performed at the Turku Proteomics Facility and supported by Biocenter Finland. The linear associations between the SDQ score and the protein abundances were investigated using linear modeling with DeqMS36. To characterize the biological processes and pathways related to the identified proteins, significantly differently abundant proteins (adjusted P-value ≤ 0.05) associated with the SDQ score were used in further bioinformatic data analyses. See the Methods for more details.


Protein identification

Using mass spectrometry-based proteomics we successfully identified 1,485 proteins in the WALNUTs plasma samples (N = 91; mean of 1,228 proteins per sample; standard error = 117). The full list of the proteins is presented in the Supplementary Information. Out of these, 77 were identified as contaminants, and were removed. After that, 983 proteins were detected in at least 80% of the samples, and therefore these proteins were used for subsequent analysis.

The sex and age variables were added to the linear model to correct for the possible effects. In the analysis, 67 proteins a had linear relationship with the SDQ score, out which 48 were positively correlated with the SDQ score, and 19 were negatively correlated (Fig. 1b). The proteins associated with the SDQ score are presented in Table 2.

Fig. 1: Significantly altered proteins identified in mass spectrometry-based proteomic analysis.
figure 1

a, Volcano plot of proteins associated with the SDQ score; 58 proteins were significantly changed, indicated in blue (adjusted P-value < 0.05). The P-values were calculated using DeqMS with SDQ as a continuous variable, and adjusted using the Benjamini–Hochberg method. b, A heatmap of protein abundances (z-scores) of proteins significantly associated with the SDQ score. The upper group represents negatively correlated proteins (n = 19), and the lower group positively correlated proteins (n = 48). SDQ scores are shown as the gradient at the bottom of the heatmap, green (pink) indicates individuals with a low (raised) SDQ score.

Table 2 Plasma proteins with abundance changes associated with the SDQ score

All the significantly altered proteins were used to create a heatmap (Fig. 1b) that shows the protein abundances (z-scores) in relation to the SDQ score.

Enriched pathways and biological processes

Of the highly abundant proteins that were depleted from the plasma samples before the mass spectrometry analysis, nine were found in the data; these were considered as a possible source of bias and were excluded from these analyses, leaving 58 proteins significantly associated with the SDQ score.

Those proteins were used for a clustering analysis of the differentially abundant proteins identified in our study using the STRING database37. The clustering analysis yielded three groups of proteins, as shown in Fig. 2a. Cluster 1 contained up- and down-regulated proteins involved in neuron growth, synaptic function, glial cell migration and cholesterol transport. The second cluster contained only up-regulated proteins mostly involved in the complement and coagulation cascades. Cluster 3 contained three down-regulated proteins involved in the olfactory system and three up-regulated proteins involved in protein degradation.

Fig. 2: Enriched biological processes and pathways.
figure 2

a, The results of the STRINGdb clustering analysis; the proteins positively correlated with the SDQ score are highlighted green, whereas negatively correlated proteins are highlighted red. The number and colors of the lines represent the evidence for the protein connection according to the default STRING database scheme. b, Significantly enriched Reactome pathways. The colors indicate functional grouping. The enrichment was performed using the enrichPathway function of the ReactomePA package, which uses the hypergeometric model. The P-value was adjusted using the Benjamini–Hochberg method.

We also performed an analysis of enriched pathways with Reactome38, using all of the identified proteins as the gene background. Enriched pathways were related to immune system, coagulation, complement cascade, and post-translational protein modification (Fig. 2b). In total, thirteen pathways were significantly (false discovery rate (FDR)-adjusted P-value < 0.05) enriched in the pathway analysis (Supplementary Table 1). The signaling pathways analysis in ingenuity pathway analysis (IPA) revealed that canonical pathways were associated with immune responses, coagulation, complement cascade and signaling, as in the Reactome analysis (Supplementary Table 2).

Predictive models generation

The relatively large number of samples made it possible to employ modern strategies to determine potentially predictive biomarkers for the low versus raised SDQ score groups. A novel QLattice algorithm39 was used to create models containing predictive biomarkers that best separate the two groups with low and raised SDQ scores. The Bayesian information criterion (BIC) was used to ensure that the resulting models generalize well from the training to test set. We performed fivefold cross-validation of running logistical regression model with QLattice on different partitions of the data keeping the lowest BIC-scoring model from each partition. The receiver operator characteristic (ROC) curves and area under the ROC curve (AUC) for each of the models are presented in Supplementary Fig. 2.

Five diverse models were created using a fivefold cross-validation scheme. These models bring similar—albeit complementary—insights, as the whole dataset was split into training and validation sets five times, and each round contained different sub-samples of the data. The five unique models (Table 3) contained eleven proteins in total (Supplementary Table 3). Four of the five models contained proteins with a previously shown connection to the CNS. The first model contained three such proteins: amyloid beta precursor-like protein 1 (APLP1) (P51693), calcium/calmodulin dependent protein kinase II beta (CAMK2B) (Q13554/Q13555) and Reticulon 4 (RTN4; Q9NQC3), the ROC parameters for the models are shown in the Supplementary Fig. 2. Only the fifth model contained no proteins, previously connected to brain development. The proteins present in the models can be investigated further as potential biomarkers.

Table 3 The models returned by the QLattice with lowest BIC score


Plasma proteomic biomarker studies in mental health diseases are a novel field. Increasing evidence shows alterations in plasma proteins associated with different mental disorders such as depression (MDD), SCZ, psychotic disorders and bipolar disorders24,25,32. Most altered pathways in mental disorders (such as complement cascade and signaling by interleukins) seem to be common to the above-mentioned major psychiatric disorders32. Here we report plasma protein alterations related to immune responses, blood coagulation, complement cascade, neuronal degeneration and neurogenesis in adolescents at high risk of mental dysfunction, which was evaluated based on the self-reported SDQ score. It should be kept in mind that assessing the risk of mental health problems in adolescents is associated with ethical issues, which should be appropriately considered.

In this study we used the total SDQ score as an indicator of mental health dysfunction and predisposition to mental health issues in adolescents. Becker and co-workers have shown the predictive value of the self-reported SDQ in clinical diagnostics, especially combined with parent and/or teacher versions40. Furthermore, the self-reported SDQ was shown to be a reliable and valid method for the assessment of behavioral problems in children and adolescents40. Goodman et al. have shown that multi-informant (parents, teachers, older children) SDQs in community samples can identify children and adolescents with a psychiatric diagnosis with a specificity of 94.6% and a sensitivity of 63.3%; SDQ scores successfully identified over 70% of individuals with conduct, hyperactivity, depressive and some anxiety disorders19. The SDQ performs well as a screening tool, but it is not intended to be used as a psychiatric diagnostic instrument as such41. It is therefore considered a useful and valid tool for screening children and adolescents at a high risk of mental disorders19,41,42.

This study revealed 58 plasma protein alterations associated with the SDQ score in adolescents. The abundances of 39 proteins were enhanced in the raised SDQ score group, whereas 19 were reduced. We identified altered proteins such as clusterin, vitronectin, complement C2 and coagulation factor XI that have also been reported to be altered in past blood proteomics studies33,43,44.

Blood coagulation and immune responses, including the complement cascade, were the most enriched pathways altered among proteins significantly associated with the SDQ score. Our clustering analysis revealed up-regulation of complement and blood coagulation cascades. Past studies have also shown associations between early changes in complement and coagulation cascades and increased risk of psychotic disorders in adolescents33,34,45. Changes in immune responses and blood coagulation have also been reported in SCZ, MDD and bipolar disorder patients in several blood proteomic studies43,44,46. We found positive correlations with the SDQ score in coagulation factor XI, coagulation factor X and coagulation factor II (thrombin)—all of which are involved in blood coagulation. Increased levels of prothrombin and several coagulation factors (F5, F9, F12, F13A1) were also found by English et al. in adolescents who later developed a psychotic disorder34.

Several complement components and factors such as C6, C1S, and CFI were altered (mostly increased) in high-risk psychotic disorder adolescents33,34 and first-episode SCZ-patients47. In our study, complement proteins such as complement C1q and C1r subcomponents, complement factor I, complement factor H and complement C2 were significantly and mainly positively associated with the SDQ score. Jiang et al.48 suggested that complement activation together with metabolic up-regulation can increase oxidative stress, which can induce protein damage and cell apoptosis, and thus contribute to the development of SCZ. Altogether, our results are in line with the current understanding of the role of altered immune responses and blood coagulation in pathophysiology of mental disorders. Furthermore, our findings support the increasing evidence on early changes in coagulation and complement cascades in predisposition to—and development of—mental health issues in adolescents.

A symbolic-regression-based algorithm, QLattice, was used to gain an insight into the potential of proteins to predict the SDQ status. The formed models comprised eleven proteins, four of which have a previously reported connection to the CNS, neurogenesis or mental health. Of the eleven proteins, eight were reported to belong to the first cluster according to STRING database analysis. Of the proteins not previously connected to the CNS, the LYPD3 protein was reported to be an amyloid precursor protein interactor49, which can explain its coincidence with APLP1, CAMK2B and CD9 in our predicted models.

APLP1 is a protein residing predominantly in brain tissue, and it has been shown to be involved in brain development50 and synaptogenesis during post-natal development in mice51. We identified a significant negative association between APLP1 and the SDQ score. Pandolfo and colleagues have suggested that amyloid could be a marker of cognitive impairment and altered neurodevelopment in mental diseases. Decreased β-amyloid proteins in CSF have been reported in patients with SCZ and MDD, and altered amyloid precursor protein metabolism in patients with bipolar disorders52. However, although the role of APLP1 in the pathogenesis of mental disorders is still unknown, this protein—on the basis of our data—warrants further investigation in the context of adolescent mental health.

Two other proteins—RTN453,54 and CAMK2B—have been reported to be connected to neuronal development and neuroplasticity55,56. CAMK2B was positively associated with the SDQ score. It is a protein connected to dendritic spine and synapse formation, neuronal plasticity and regulation of sarcoplasmic reticulum Ca2+ transport in skeletal muscle57. The beta subunit was reported to be brain specific55, yet little is known of its involvement in mental disorders. Another protein present in the top model was RTN4, which was negatively associated with the SDQ score, and has been previously shown to be associated with SCZ53,58. RTN4 is a membrane shaping protein in the endoplasmic reticulum involved in the maintenance of the endoplasmic reticulum membrane tubular integrity. Impairment in the RTN4 process has been connected to neurodegeneration. The RTN4A-subtype, also known as Nogo-A, is localized in the CNS and has a role in neuronal growth and maturation during nervous system development59. Furthermore, RTN4 was shown to be connected with social behavior and spatial cognition in a study using mice with a missense mutation in the RTN4 receptor59,60. The involvement of RTN4 in adolescent mental health remains a topic worthy of detailed investigation. The fourth protein with reported connection to CNS was Cadherin 11 (P55287), which has been shown to be enriched in several brain areas during dendrite formation and synaptogenesis61,62,63. Given that the proteins reported as probable biomarkers belonged to the same cluster according to STRING database, and that four of the proteins were shown to be connected to the CNS and neuronal development, it is conceivable that all of the proteins from the cluster are connected to one process. Further investigation of this network might shed new light on the nuances of brain development and mental health in adolescents.

Girls in the raised SDQ score (>15) group reported slightly earlier puberty changes compared with the lower SDQ score (<14) group, whereas in boys, the self-reported puberty changes between the lower and the raised SDQ score groups seemed to be the opposite (Supplementary Table 4); however, differences in puberty changes were minor and sex was added as confounding factor in our linear models. We also examined other possible confounding factors available, including the education level for each parent, the levels of media consumption, levels of social media engagement, drug and alcohol use, and physical activity. The results showed no significant differences among the groups (P-value > 0.05). Furthermore, we used the information on the school of the subjects attended as a random variable in linear modeling. No differences in the number of significant proteins were found. We thus concluded that these factors were not likely to confound the investigation of the connection of SDQ to the protein abundance levels.

As the main limitation in this study, the sample number is low in relation to the identified proteins. Linear modeling was performed along with some group-based comparisons to strengthen the statistical power of the analysis, and we managed to detect statistically significant alterations with these sample numbers. Similar N-numbers have also been used in past studies on serum and plasma protein biomarkers in SCZ, MDD and bipolar disorder patients45. Furthermore, overnight fasting samples are preferred for proteomics analysis as food intake can influence the protein composition and concentrations in blood64. The plasma samples used in the current study were non-fasting samples due to the practical and ethical issues related to the implementation of the WALNUTs study as blood samples were drawn from the adolescents at school in the afternoon.


In this explorative study, we identified protein-based susceptibility biomarker candidates associated with the self-reported SDQ score in adolescents reflecting a risk of developing mental health dysfunction. Significant alterations were found in proteins involved in the immune response, blood coagulation and hemostasis, neuronal degeneration and neurogenesis. Further studies are needed to confirm and validate these biomarker candidates in larger cohorts, as well as follow-up data and studies to evaluate whether these biomarkers are associated with the risk of transition to the clinical state and mental disorders.


Participant recruitment and sample collection

The studies were reviewed and approved by CEIC Parc Salut Mar Clinical Research Ethics Committee (approval nos. 2015/6026 WALNUTs and 2020/9688–Equal-life). Written informed consent to participate in the original WALNUTs study was provided by the participants’ legal guardian/next of kin. No additional consent was needed for this study, all of the participants were offered free tickets to the science museum of Barcelona. The specifics of the WALNUTs cohort formation were described in previous publications21,65. The current manuscript used a subset of 372 baseline blood samples before any dietary intervention originally described in a previous work65. For this study, a sub-group of 91 samples was used to perform the proteomics analysis. These samples were selected on the basis of the SDQ scores: 42 with the lowest SDQ score (SDQ = 0–14) and 49 with the highest SDQ score (SDQ = 15–25). Samples were drawn by a nurse using K2EDTA plus tubes, rested for 1 h and then centrifuged at 2,500 × g for 20 min at 20 °C, refrigerated at 4 °C, and frozen to −80 °C within 4 h after extraction65, stored at –80 °C, and were not thawed until the protein depletion was performed before the proteomics analysis.

High-abundance protein depletion

Albumin and IgG represent more than 70% of total protein levels in human plasma samples. The depletion of high-abundant proteins is therefore essential to the identification and analysis of low-abundant proteins. A commercial kit (High Select Top14 Abundant Protein Depletion Mini Spin Columns, catalogue no. A36370, ThermoScientific) was used to deplete the 14 most abundant proteins from plasma before the proteomic analyses. The depleted proteins were human serum albumin, albumin, IgG, IgA, IgM, IgD, IgE, kappa and lambda light chains, α1-acidglycoprotein, α1-antitrypsin, α2-macroglobulin, apolipoprotein A1, fibrinogen, haptoglobin and transferrin, according to manufacturer’s manual. Briefly, 10 µl of total plasma was added to the mini spin columns and incubated for 10 min while rotating, followed by centrifugation of the columns (1,000 × g) for 2 min. The filtrate was collected in 2 ml plastic tubes and stored at −20 °C until preparation for mass spectrometry proteomic analyses, which were performed at the Turku Proteomics Facility supported by Biocenter Finland.

Protein precipitation and digestion

Samples were acetone precipitated and subjected to in-solution digestion. Shortly, four volumes of ice-cold acetone were used to precipitate proteins. Precipitated proteins were resuspended to 8 M Urea, 50 mM Tris-HCl for protein denaturation, reduced with 5 mM dithiothreitol and alkylated with 13 mM iodoacetamide. Proteins were digested to peptides with trypsin (Promega) (enzyme:protein ratio 1:30) at 37 °C overnight. After digestion the peptides were desalted with a Sep-Pak C18 96-well plate (Waters), evaporated and stored at −20 °C.

Mass spectrometry analysis

Digested peptide samples were dissolved in 0.1% formic acid and peptide concentrations were determined with a NanoDrop device. Samples were spiked with iRT peptides (Biognosys) for retention time calibration. Equal amounts of samples were analyzed on a nanoflow HPLC system (Easy-nLC1200, Thermo Fisher Scientific) coupled to the Q Exactive HF Orbitrap mass spectrometer (Thermo Fisher Scientific) equipped with a nano-electrospray ionization source. Peptides were first loaded onto a trapping column and subsequently separated in-line on a 15 cm C18 column (75 μm × 15 cm, ReproSil-Pur 3 μm 120 Å C18-AQ, Dr. Maisch HPLC GmbH). The mobile phase consisted of water with 0.1% formic acid (solvent A) or acetonitrile/water (80:20 (v/v)) with 0.1% formic acid (solvent B). A 100 min gradient was used to elute peptides (50 min from 5% to 21% solvent B followed by 40 min from 21% to 36 min solvent B). Mass spectrometry data were acquired automatically by using Thermo Xcalibur v.4.1 software (catalog no. OPTON-30965; Thermo Scientific). In a data-independent acquisition (DIA) method, a duty cycle contained one full scan (400–1,000 m/z) and 40 DIA MS/MS scans covering the mass range 400–1,000 with variable width isolation windows.

Protein identification and quantification analysis

Data analysis consisted of protein identifications and label-free quantifications of protein abundances. The data were analyzed using the Spectronaut software (Biognosys; v.17.1.221229). The direct DIA approach was used to identify proteins and label-free quantifications were performed with the MaxLFQ algorithm in Spectronaut. The main data analysis parameters in Spectronaut were: (1) enzyme (Trypsin/P); (2) up to two missed cleavages; (3) fixed modification (carbamidomethyl (cysteine)); (4) variable modifications (acetyl (protein N-terminus) and oxidation (methionine)); (5) the precursor FDR cutoff (0.01); (6) the protein FDR cutoff (0.01); (7) the quantification MS level (MS2); (8) the quantification type (area under the curve within integration boundaries for each targeted ion); (9) the protein database (Swiss-Prot 2022_05 Homo Sapiens66 and Universal Protein Contaminant database67); and (10) normalization (global median normalization). All of the peptides were used for quantification.

Statistical analysis

Data pre-processing and statistical analyses were performed using R (v.4.2.1). Principal component analysis was performed to assess the general quality of the dataset (Supplementary Fig. 1). Identified proteins with more than 20% missing values were excluded from the analysis. Sample normalization was performed using the medianCentering method from the proBatch method68. Missing values remaining in the dataset were input using the sample minimum method69. We have compared the low SDQ and raised SDQ groups for the following variables: the education level for each parent, the levels of media consumption, levels of social media engagement, drug and alcohol use, and physical activity. We tested whether there are differences of the confounding factors between the groups using the one-way ANOVA test. After correcting for multiple comparisons, no socio-economic, sociodemographic or other factors showed any significant difference between the groups, meaning that the analyzed groups do not have significant differences in between the mean values of those factors.

Bioinformatic data analysis

The DeqMS (v.1.16.0) package was used for the differential abundance analysis36, with SDQ score used as a continuous variable. The sex and age of the adolescents were included into the linear model to ensure that the proteins reported are associated with SDQ and were not influenced by confounding factors. The differences in protein abundances were expressed as log2-fold-change (the ratio of the means of raised (numerator) and low (denominator) SDQ groups). The P-values were adjusted using Benjamini–Hochberg procedure.

Plasma proteomic datasets from adolescents represent a very low number of all the proteomic datasets70, so to better investigate the functional enrichment the full list of all proteins found in this study was used as the background gene list in the enrichment analyses. To characterize the enriched pathways related to the identified proteins, significantly differently abundant proteins (P ≤ 0.05) associated with the SDQ score were used in further bioinformatic data analyses. Proteins depleted before mass spectrometry analysis that showed significant differences between groups were considered a possible source of bias and thus were excluded. The Reactome pathways were investigated using the ReactomePA (v.1.9.4) R package38. The enrichment was performed using the enrichPathway function of the ReactomePA package, which uses the hypergeometric model. The P-value was adjusted using the Benjamini–Hochberg method. We used IPA (Ingenuity Systems) for the further enrichment analyses In IPA core analysis, default software parameters were used (reference set: ingenuity knowledge base—genes only). The z-score values were used to identify canonical pathways that were expected to be changed by their activity. The STRINGdb package was used to get the protein–protein interaction information for the significantly differentially abundant proteins from the STRING database (v.11.5)37. The fastgreedy clustering function was used to extract gene clusters with strong associations. A novel symbolic-regression-based algorithm, QLattice, which is part of the Feyn package (v.3.0.3), was used to generate models combining proteins with the best predictive power for the SDQ score based on protein biomarkers39. The algorithm was used to find the models combining proteins with the best predictive power. Possible biomarkers were searched among the 58 proteins significantly associated with the SDQ score, with five rounds of cross-validation. The resulting models were identified as the top models in each of the cross-validation. Result visualizations were performed using ggplot2 (v.3.4.0)71 and ComplexHeatmap72 (v.2.14.0) packages.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.