Plasma proteomics discovery of mental health risk biomarkers in adolescents

An estimated 10–20% of adolescents experience mental health conditions, and most of them remain underdiagnosed and undertreated. Discovering new susceptibility biomarkers is therefore important for identifying individuals at high risk of developing mental health problems, and for improving early prevention. Here we aimed to discover plasma protein-based susceptibility biomarkers in children/adolescents aged 11–16 years at risk of developing mental health issues. Risk was evaluated on the basis of self-reported Strengths and Difficulties Questionnaire (SDQ) scores, and plasma proteomic data were obtained for individuals participating in the Spanish WALNUTs cohort study by liquid chromatography–tandem mass spectrometry. Bioinformatic analyses were performed to identify the biological processes and pathways in which the identified biomarker candidates are involved; 58 proteins were significantly associated with the SDQ score. The most prominent enriched pathways related to these proteins included immune responses, blood coagulation, neurogenesis and neuronal degeneration. This exploratory study revealed several alterations of plasma proteins associated with the SDQ score in adolescents, which opens a new avenue to develop novel susceptibility biomarkers to improve early identification of individuals at risk of mental health problems. In a sample of children and adolescents aged 11–16 years old, authors used plasma-based proteomic techniques to identify biomarkers associated with risk for developing a mental health disorder.

An estimated 10-20% of adolescents experience mental health conditions, and most of them remain underdiagnosed and undertreated. Discovering new susceptibility biomarkers is therefore important for identifying individuals at high risk of developing mental health problems, and for improving early prevention. Here we aimed to discover plasma proteinbased susceptibility biomarkers in children/adolescents aged 11-16 years at risk of developing mental health issues. Risk was evaluated on the basis of self-reported Strengths and Difficulties Questionnaire (SDQ) scores, and plasma proteomic data were obtained for individuals participating in the Spanish WALNUTs cohort study by liquid chromatography-tandem mass spectrometry. Bioinformatic analyses were performed to identify the biological processes and pathways in which the identified biomarker candidates are involved; 58 proteins were significantly associated with the SDQ score. The most prominent enriched pathways related to these proteins included immune responses, blood coagulation, neurogenesis and neuronal degeneration. This exploratory study revealed several alterations of plasma proteins associated with the SDQ score in adolescents, which opens a new avenue to develop novel susceptibility biomarkers to improve early identification of individuals at risk of mental health problems.
Adolescence is a period of life of profound changes in the biological, psychosocial, cognitive and emotional domains [1][2][3] . The dynamic brain development during youth opens a critical window for cognitive improvement, but also the onset and development of mental disorders 4,5 . Several mental disorders, such as attention deficit hyperactivity disorder, phobias, obsessive compulsive disorder, eating disorder, substance use, mood and social anxiety disorder, begin before the individual reaches adulthood, with a peak age of onset of 14 years [5][6][7] . Mental disorders negatively impact adolescent development, leading to morbidity, mortality and dysfunction in later life 5,8 . Therefore, identifying adolescents with a high risk of developing mental health issues and improving early diagnostics could improve the clinical outcomes and decrease the socio-economic impact.
Globally the prevalence of mental health conditions in adolescents is estimated to be between 10-20%, and most cases remain underdiagnosed and undertreated 9,10 . The social stigma of mental disorders, the adolescent and parent perception of mental health care needs, and the lack of mental health resources are some factors that contribute to the number of adolescents without proper diagnosis and treatment 11,12 . Furthermore, misdiagnosis or overdiagnosis could expose the adolescent to unnecessary treatment 13 . Diagnostics of mental disorders are based on the International Classification of Diseases (ICD) and the Article https://doi.org/10.1038/s44220-023-00103-2

Plasma samples and behavioral outcome
The peripheral blood plasma samples were obtained and analyzed from a subsample of 91 adolescents, aged 11-16 years, of the WALNUTs regional Spanish study ( Table 1). The samples were collected in 2016 at approximately the same time that the participants filled out the SDQ. This baseline subsample without any dietary intervention was selected on the basis of the availability of blood samples and filled SDQ questionnaires, as well as the total scores of the SDQ questionnaire. Based on the self-reported SDQ score, the plasma samples were categorized into lower (SDQ = 0-14) and raised (SDQ = 15-25) groups 35 . The plasma samples were stored undisturbed at −80 °C until they were thawed in 2021 for protein depletion and subsequent proteomic analysis. The studies were reviewed and approved by the CEIC Parc Salut Mar Clinical Research Ethics Committee (approval nos. 2015/6026, WALNUTs; 2020/9688, Equal-Life). Written informed consent to participate in the original WALNUTs study was provided by the participants' legal guardian/next of kin.
Plasma samples were pre-processed and analyzed using liquid chromatography electrospray ionization tandem mass spectrometry, which was performed at the Turku Proteomics Facility and supported by Biocenter Finland. The linear associations between the SDQ score and the protein abundances were investigated using linear modeling with DeqMS 36 . To characterize the biological processes and pathways related to the identified proteins, significantly differently abundant proteins (adjusted P-value ≤ 0.05) associated with the SDQ score were used in further bioinformatic data analyses. See the Methods for more details.

Protein identification
Using mass spectrometry-based proteomics we successfully identified 1,485 proteins in the WALNUTs plasma samples (N = 91; mean of 1,228 proteins per sample; standard error = 117). The full list of the proteins is presented in the Supplementary Information. Out of these, 77 were identified as contaminants, and were removed. After that, 983 proteins were detected in at least 80% of the samples, and therefore these proteins were used for subsequent analysis.
The sex and age variables were added to the linear model to correct for the possible effects. In the analysis, 67 proteins a had linear relationship with the SDQ score, out which 48 were positively correlated with the SDQ score, and 19 were negatively correlated (Fig. 1b). The proteins associated with the SDQ score are presented in Table 2.
All the significantly altered proteins were used to create a heatmap (Fig. 1b) that shows the protein abundances (z-scores) in relation to the SDQ score.

Enriched pathways and biological processes
Of the highly abundant proteins that were depleted from the plasma samples before the mass spectrometry analysis, nine were found in the data; these were considered as a possible source of bias and were Diagnostic and Statistical Manual (DSM) classifications provided by the World Health Organization (WHO) and the American Psychiatric Association, respectively 14 . Clinical interviews and validated questionnaires used for symptom assessment (for example, the Beck Depression Inventory 15 ) have a major role in mental health diagnostics. The complexity of adolescent behavior and the overlap of symptomatology of several mental disorders complicate the precise and objective diagnosis of diseases in youth. Furthermore, the difficulty in defining normative or atypical expected behavioral development in adolescence 16 , and lack of access to professional expertise, contribute to inaccurate judgment and precise definition of mental health conditions in adolescents 17,18 .
The Strengths and Difficulties Questionnaire (SDQ) is a screening questionnaire for emotional and behavioral problems in children and young people that assesses the impact of difficulties on the child's life, including (1) emotional symptoms, (2) conduct problems, (3) hyperactivity/inattention, (4) peer relationship problems and (5) prosocial behavior [19][20][21] . Past validation studies have shown that the total SDQ score can be considered as a predictive factor for mental health disorders, as children with high SDQ scores have an increased probability for clinical disorders 22,23 . Differences in the total SDQ score seem to reflect the differences in prevalence of mental health disorders, although cross-national differences exist 22,24 . Thus, developing additional tools such as biological measurements for assessing mental health issues could improve identification of adolescents at high risk of mental health dysfunction, and enhance more precise diagnostics.
Although studying human brain tissue may be the most revealing method for measuring alterations related to mental disorders, it poses several severe limitations, including tissue access 25 and high cost in the case of neuroimaging 25 . By contrast, biological fluids such as blood or urine are easier to access and are routinely used for clinical diagnostics. Alterations in the gene expression levels, proteins abundance and biological activity can serve as internal indicators present in biological fluids (biomarkers), of pathogenic processes or responses to an external exposome 25,26 . The blood connects the brain and periphery, and changes in plasma components such as proteins can reflect alterations in the brain associated with mental disorders due to the two-way communication between the central nervous system (CNS) and peripheral circulation 27,28 . Past studies have shown the plasma proteomic changes associated with mental disorders [29][30][31][32] . For example, significant reductions in glia maturation factor beta and brain-derived neurotrophic factor were observed in patients with schizophrenia (SCZ) when compared with healthy volunteers 29 . Thus, peripheral blood plasma is a suitable biological fluid for investigating molecular alterations that reflect those associated with mental health issues, and for providing new understanding on the bidirectional communication between the brain and body [27][28][29][30] .
Limited knowledge exists on whether alterations in plasma proteins could serve as early susceptibility biomarkers to predict the risk of mental health issues, leading to proper clinical interventions before disease onset, even though alterations in pathways and molecules related to hormone signaling, energy metabolism, growth factors, inflammation, oxidation/reduction and protein synthesis have been commonly associated with psychiatric disorders 28 . However, studies by Mongan et al. 33 and English et al. 34 suggest that adolescents at high risk of psychosis could be identified on the basis of the changes in the blood proteome several years before the psychotic experiences manifest. The number of studies focusing on discovering new susceptibility or predictive plasma biomarkers for mental health diseases in adolescents is so far limited.
This explorative study aims to identify and characterize alterations of plasma proteins in adolescents at high risk of developing mental health issues. We identified 67 plasma proteins with abundances significantly associated with the SDQ score, offering new insight into using proteins as susceptibility biomarkers for early identification of adolescents at risk of mental health problems. Article https://doi.org/10.1038/s44220-023-00103-2 excluded from these analyses, leaving 58 proteins significantly associated with the SDQ score. Those proteins were used for a clustering analysis of the differentially abundant proteins identified in our study using the STRING database 37 . The clustering analysis yielded three groups of proteins, as shown in Fig. 2a. Cluster 1 contained up-and down-regulated proteins involved in neuron growth, synaptic function, glial cell migration and cholesterol transport. The second cluster contained only up-regulated proteins mostly involved in the complement and coagulation cascades. Cluster 3 contained three down-regulated proteins involved in the olfactory system and three up-regulated proteins involved in protein degradation.
We also performed an analysis of enriched pathways with Reactome 38 , using all of the identified proteins as the gene background. Enriched pathways were related to immune system, coagulation, complement cascade, and post-translational protein modification (Fig. 2b). In total, thirteen pathways were significantly (false discovery rate (FDR)-adjusted P-value < 0.05) enriched in the pathway analysis (Supplementary Table 1). The signaling pathways analysis in ingenuity pathway analysis (IPA) revealed that canonical pathways were associated with immune responses, coagulation, complement cascade and signaling, as in the Reactome analysis (Supplementary Table 2).

Predictive models generation
The relatively large number of samples made it possible to employ modern strategies to determine potentially predictive biomarkers for the low versus raised SDQ score groups. A novel QLattice algorithm 39 was used to create models containing predictive biomarkers that best separate the two groups with low and raised SDQ scores. The Bayesian information criterion (BIC) was used to ensure that the resulting models generalize well from the training to test set. We performed fivefold cross-validation of running logistical regression model with QLattice on different partitions of the data keeping the lowest BIC-scoring model from each partition. The receiver operator characteristic (ROC) curves and area under the ROC curve (AUC) for each of the models are presented in Supplementary Fig. 2.
Five diverse models were created using a fivefold cross-validation scheme. These models bring similar-albeit complementary-insights, as the whole dataset was split into training and validation sets five times, and each round contained different sub-samples of the data. The five unique models (Table 3) contained eleven proteins in total (Supplementary Table 3). Four of the five models contained proteins with a previously shown connection to the CNS. The first model contained three such proteins: amyloid beta precursor-like protein 1 (APLP1) (P51693), calcium/calmodulin dependent protein kinase II beta (CAMK2B) (Q13554/Q13555) and Reticulon 4 (RTN4; Q9NQC3), the ROC parameters for the models are shown in the Supplementary Fig. 2. Only the fifth model contained no proteins, previously connected to brain development. The proteins present in the models can be investigated further as potential biomarkers.

Discussion
Plasma proteomic biomarker studies in mental health diseases are a novel field. Increasing evidence shows alterations in plasma proteins associated with different mental disorders such as depression (MDD), SCZ, psychotic disorders and bipolar disorders 24,25,32 . Most altered pathways in mental disorders (such as complement cascade and signaling by interleukins) seem to be common to the above-mentioned major psychiatric disorders 32 . Here we report plasma protein alterations related to immune responses, blood coagulation, complement cascade, neuronal degeneration and neurogenesis in adolescents at high risk of mental dysfunction, which was evaluated based on the self-reported SDQ score. It should be kept in mind that assessing the risk of mental health problems in adolescents is associated with ethical issues, which should be appropriately considered.
In this study we used the total SDQ score as an indicator of mental health dysfunction and predisposition to mental health issues in adolescents. Becker and co-workers have shown the predictive value of the self-reported SDQ in clinical diagnostics, especially combined with parent and/or teacher versions 40 . Furthermore, the self-reported SDQ was shown to be a reliable and valid method for the assessment of behavioral problems in children and adolescents 40 . Goodman et al. have shown that multi-informant (parents, teachers, older children) SDQs in community samples can identify children and adolescents with a psychiatric diagnosis with a specificity of 94.6% and a sensitivity of 63.3%; SDQ scores successfully identified over 70% of individuals with conduct, hyperactivity, depressive and some anxiety disorders 19 . The SDQ performs well as a screening tool, but it is not intended to be used as a psychiatric diagnostic instrument as such 41 . It is therefore considered a useful and valid tool for screening children and adolescents at a high risk of mental disorders 19,41,42 .
This study revealed 58 plasma protein alterations associated with the SDQ score in adolescents. The abundances of 39 proteins were enhanced in the raised SDQ score group, whereas 19 were reduced. We identified altered proteins such as clusterin, vitronectin, complement C2 and coagulation factor XI that have also been reported to be altered in past blood proteomics studies 33,43,44 .  Blood coagulation and immune responses, including the complement cascade, were the most enriched pathways altered among proteins significantly associated with the SDQ score. Our clustering analysis revealed up-regulation of complement and blood coagulation cascades. Past studies have also shown associations between early changes in complement and coagulation cascades and increased risk of psychotic disorders in adolescents 33 Proteins are presented with their UniProt accession number and corresponding protein name. The asterisks indicate the highly abundant proteins that were depleted in the pre-processing stage. The effect size indicates the log2-fold-change in expression that results from a unit change in SDQ. The P-values were calculated using DeqMS SDQ score as continuous variable and adjusted using the Benjamini-Hochberg method. log2FC, log2-fold-change (ratio of means).  Article https://doi.org/10.1038/s44220-023-00103-2 bipolar disorder patients in several blood proteomic studies 43,44,46 . We found positive correlations with the SDQ score in coagulation factor XI, coagulation factor X and coagulation factor II (thrombin)-all of which are involved in blood coagulation. Increased levels of prothrombin and several coagulation factors (F5, F9, F12, F13A1) were also found by English et al. in adolescents who later developed a psychotic disorder 34 .
Several complement components and factors such as C6, C1S, and CFI were altered (mostly increased) in high-risk psychotic disorder adolescents 33,34 and first-episode SCZ-patients 47 . In our study, complement proteins such as complement C1q and C1r subcomponents, complement factor I, complement factor H and complement C2 were significantly and mainly positively associated with the SDQ score. Jiang et al. 48 suggested that complement activation together with metabolic up-regulation can increase oxidative stress, which can induce protein damage and cell apoptosis, and thus contribute to the development of SCZ. Altogether, our results are in line with the current understanding of the role of altered immune responses and blood coagulation in pathophysiology of mental disorders. Furthermore, our findings support the increasing evidence on early changes in coagulation and complement cascades in predisposition to-and development of-mental health issues in adolescents.
A symbolic-regression-based algorithm, QLattice, was used to gain an insight into the potential of proteins to predict the SDQ status. The formed models comprised eleven proteins, four of which have a previously reported connection to the CNS, neurogenesis or mental health. Of the eleven proteins, eight were reported to belong to the first cluster according to STRING database analysis. Of the proteins not previously connected to the CNS, the LYPD3 protein was reported to be an amyloid precursor protein interactor 49 , which can explain its coincidence with APLP1, CAMK2B and CD9 in our predicted models.
APLP1 is a protein residing predominantly in brain tissue, and it has been shown to be involved in brain development 50 and synaptogenesis during post-natal development in mice 51 . We identified a significant negative association between APLP1 and the SDQ score. Pandolfo and colleagues have suggested that amyloid could be a marker of cognitive impairment and altered neurodevelopment in mental diseases. Decreased β-amyloid proteins in CSF have been reported in patients with SCZ and MDD, and altered amyloid precursor protein metabolism in patients with bipolar disorders 52 . However, although the role of APLP1 in the pathogenesis of mental disorders is still unknown, this protein-on the basis of our data-warrants further investigation in the context of adolescent mental health.
Two other proteins-RTN4 53,54 and CAMK2B-have been reported to be connected to neuronal development and neuroplasticity 55,56 . CAMK2B was positively associated with the SDQ score. It is a protein connected to dendritic spine and synapse formation, neuronal plasticity and regulation of sarcoplasmic reticulum Ca 2+ transport in skeletal muscle 57 . The beta subunit was reported to be brain specific 55 , yet little is known of its involvement in mental disorders. Another protein present in the top model was RTN4, which was negatively associated with the SDQ score, and has been previously shown to be associated with SCZ 53,58 . RTN4 is a membrane shaping protein in the endoplasmic reticulum involved in the maintenance of the endoplasmic reticulum membrane tubular integrity. Impairment in the RTN4 process has been connected to neurodegeneration. The RTN4A-subtype, also known as Nogo-A, is localized in the CNS and has a role in neuronal growth and maturation during nervous system development 59 . Furthermore, RTN4 was shown to be connected with social behavior and spatial cognition in a study using mice with a missense mutation in the RTN4 receptor 59,60 . The involvement of RTN4 in adolescent mental health remains a topic worthy of detailed investigation. The fourth protein with reported connection to CNS was Cadherin 11 (P55287), which has been shown to be enriched in several brain areas during dendrite formation and synaptogenesis [61][62][63] . Given that the proteins reported as probable biomarkers belonged to the same cluster according to STRING database, and that four of the proteins were shown to be connected to the CNS and neuronal development, it is conceivable that all of the proteins from the cluster are connected to one process. Further investigation of this network might shed new light on the nuances of brain development and mental health in adolescents.
Girls in the raised SDQ score (>15) group reported slightly earlier puberty changes compared with the lower SDQ score (<14) group, whereas in boys, the self-reported puberty changes between the lower and the raised SDQ score groups seemed to be the opposite (Supplementary Table 4); however, differences in puberty changes were minor and sex was added as confounding factor in our linear models. We also examined other possible confounding factors available, including the education level for each parent, the levels of media consumption, levels of social media engagement, drug and alcohol use, and physical activity. The results showed no significant differences among the groups (P-value > 0.05). Furthermore, we used the information on the school of the subjects attended as a random variable in linear modeling. No differences in the number of significant proteins were found. We thus concluded that these factors were not likely to confound the investigation of the connection of SDQ to the protein abundance levels.
As the main limitation in this study, the sample number is low in relation to the identified proteins. Linear modeling was performed along with some group-based comparisons to strengthen the statistical power of the analysis, and we managed to detect statistically significant alterations with these sample numbers. Similar N-numbers have also been used in past studies on serum and plasma protein biomarkers in SCZ, MDD and bipolar disorder patients 45 . Furthermore, overnight fasting samples are preferred for proteomics analysis as food intake can influence the protein composition and concentrations in blood 64 . The plasma samples used in the current study were non-fasting samples due to the practical and ethical issues related to the implementation of the WALNUTs study as blood samples were drawn from the adolescents at school in the afternoon.

Conclusion
In this explorative study, we identified protein-based susceptibility biomarker candidates associated with the self-reported SDQ score in Article https://doi.org/10.1038/s44220-023-00103-2 adolescents reflecting a risk of developing mental health dysfunction. Significant alterations were found in proteins involved in the immune response, blood coagulation and hemostasis, neuronal degeneration and neurogenesis. Further studies are needed to confirm and validate these biomarker candidates in larger cohorts, as well as follow-up data and studies to evaluate whether these biomarkers are associated with the risk of transition to the clinical state and mental disorders.

Participant recruitment and sample collection
The studies were reviewed and approved by CEIC Parc Salut Mar Clinical Research Ethics Committee (approval nos. 2015/6026 WALNUTs and 2020/9688-Equal-life). Written informed consent to participate in the original WALNUTs study was provided by the participants' legal guardian/next of kin. No additional consent was needed for this study, all of the participants were offered free tickets to the science museum of Barcelona. The specifics of the WALNUTs cohort formation were described in previous publications 21,65 . The current manuscript used a subset of 372 baseline blood samples before any dietary intervention originally described in a previous work 65 . For this study, a sub-group of 91 samples was used to perform the proteomics analysis. These samples were selected on the basis of the SDQ scores: 42 with the lowest SDQ score (SDQ = 0-14) and 49 with the highest SDQ score (SDQ = 15-25). Samples were drawn by a nurse using K2EDTA plus tubes, rested for 1 h and then centrifuged at 2,500 × g for 20 min at 20 °C, refrigerated at 4 °C, and frozen to −80 °C within 4 h after extraction 65 , stored at -80 °C, and were not thawed until the protein depletion was performed before the proteomics analysis.

High-abundance protein depletion
Albumin and IgG represent more than 70% of total protein levels in human plasma samples. The depletion of high-abundant proteins is therefore essential to the identification and analysis of low-abundant proteins. A commercial kit (High Select Top14 Abundant Protein Depletion Mini Spin Columns, catalogue no. A36370, ThermoScientific) was used to deplete the 14 most abundant proteins from plasma before the proteomic analyses. The depleted proteins were human serum albumin, albumin, IgG, IgA, IgM, IgD, IgE, kappa and lambda light chains, α1-acidglycoprotein, α1-antitrypsin, α2-macroglobulin, apolipoprotein A1, fibrinogen, haptoglobin and transferrin, according to manufacturer's manual. Briefly, 10 µl of total plasma was added to the mini spin columns and incubated for 10 min while rotating, followed by centrifugation of the columns (1,000 × g) for 2 min. The filtrate was collected in 2 ml plastic tubes and stored at −20 °C until preparation for mass spectrometry proteomic analyses, which were performed at the Turku Proteomics Facility supported by Biocenter Finland.

Protein precipitation and digestion
Samples were acetone precipitated and subjected to in-solution digestion. Shortly, four volumes of ice-cold acetone were used to precipitate proteins. Precipitated proteins were resuspended to 8 M Urea, 50 mM Tris-HCl for protein denaturation, reduced with 5 mM dithiothreitol and alkylated with 13 mM iodoacetamide. Proteins were digested to peptides with trypsin (Promega) (enzyme:protein ratio 1:30) at 37 °C overnight. After digestion the peptides were desalted with a Sep-Pak C18 96-well plate (Waters), evaporated and stored at −20 °C.

Mass spectrometry analysis
Digested peptide samples were dissolved in 0.1% formic acid and peptide concentrations were determined with a NanoDrop device. Samples were spiked with iRT peptides (Biognosys) for retention time calibration. Equal amounts of samples were analyzed on a nanoflow HPLC system (Easy-nLC1200, Thermo Fisher Scientific) coupled to the Q Exactive HF Orbitrap mass spectrometer (Thermo Fisher Scientific) equipped with a nano-electrospray ionization source. Peptides were first loaded onto a trapping column and subsequently separated in-line on a 15 cm C18 column (75 µm × 15 cm, ReproSil-Pur 3 µm 120 Å C18-AQ, Dr. Maisch HPLC GmbH). The mobile phase consisted of water with 0.1% formic acid (solvent A) or acetonitrile/water (80:20 (v/v)) with 0.1% formic acid (solvent B). A 100 min gradient was used to elute peptides (50 min from 5% to 21% solvent B followed by 40 min from 21% to 36 min solvent B). Mass spectrometry data were acquired automatically by using Thermo Xcalibur v.4.1 software (catalog no. OPTON-30965; Thermo Scientific).
In a data-independent acquisition (DIA) method, a duty cycle contained one full scan (400-1,000 m/z) and 40 DIA MS/MS scans covering the mass range 400-1,000 with variable width isolation windows.

Statistical analysis
Data pre-processing and statistical analyses were performed using R (v.4.2.1). Principal component analysis was performed to assess the general quality of the dataset (Supplementary Fig. 1). Identified proteins with more than 20% missing values were excluded from the analysis. Sample normalization was performed using the medianCentering method from the proBatch method 68 . Missing values remaining in the dataset were input using the sample minimum method 69 . We have compared the low SDQ and raised SDQ groups for the following variables: the education level for each parent, the levels of media consumption, levels of social media engagement, drug and alcohol use, and physical activity. We tested whether there are differences of the confounding factors between the groups using the one-way ANOVA test. After correcting for multiple comparisons, no socio-economic, sociodemographic or other factors showed any significant difference between the groups, meaning that the analyzed groups do not have significant differences in between the mean values of those factors.

Bioinformatic data analysis
The DeqMS (v.1.16.0) package was used for the differential abundance analysis 36 , with SDQ score used as a continuous variable. The sex and age of the adolescents were included into the linear model to ensure that the proteins reported are associated with SDQ and were not influenced by confounding factors. The differences in protein abundances were expressed as log2-fold-change (the ratio of the means of raised (numerator) and low (denominator) SDQ groups). The P-values were adjusted using Benjamini-Hochberg procedure. Plasma proteomic datasets from adolescents represent a very low number of all the proteomic datasets 70 , so to better investigate the functional enrichment the full list of all proteins found in this study was used as the background gene list in the enrichment analyses. To characterize the enriched pathways related to the identified proteins, significantly differently abundant proteins (P ≤ 0.05) associated with the SDQ score were used in further bioinformatic data analyses. Proteins depleted before mass spectrometry analysis that showed significant differences between groups were considered a possible source of bias and Article https://doi.org/10.1038/s44220-023-00103-2 thus were excluded. The Reactome pathways were investigated using the ReactomePA (v.1.9.4) R package 38 . The enrichment was performed using the enrichPathway function of the ReactomePA package, which uses the hypergeometric model. The P-value was adjusted using the Benjamini-Hochberg method. We used IPA (Ingenuity Systems) for the further enrichment analyses In IPA core analysis, default software parameters were used (reference set: ingenuity knowledge base-genes only). The z-score values were used to identify canonical pathways that were expected to be changed by their activity. The STRINGdb package was used to get the protein-protein interaction information for the significantly differentially abundant proteins from the STRING database (v.11.5) 37 . The fastgreedy clustering function was used to extract gene clusters with strong associations. A novel symbolic-regression-based algorithm, QLattice, which is part of the Feyn package (v.3.0.3), was used to generate models combining proteins with the best predictive power for the SDQ score based on protein biomarkers 39 . The algorithm was used to find the models combining proteins with the best predictive power. Possible biomarkers were searched among the 58 proteins significantly associated with the SDQ score, with five rounds of crossvalidation. The resulting models were identified as the top models in each of the cross-validation. Result visualizations were performed using ggplot2 (v.3.4.0) 71 and ComplexHeatmap 72 (v.2.14.0) packages.

Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability
The data analyzed in this study are subject to the following licenses/ restrictions: the WALNUTs data art not publicly available due to the restrictions of informed consent. The data contain personal information on children and, according to the ethical approval, they should be kept confidential. Data are available from the corresponding author on reasonable request for researchers who meet the criteria for access to confidential data. A data-use/transfer agreement is needed to ensure the protection of privacy and compliance with national data protection legislation, the content and specific clauses of which will depend on the nature of the requested data.

Statistics
For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section.

n/a Confirmed
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly The statistical test(s) used AND whether they are one-or two-sided Only common tests should be described solely by name; describe more complex techniques in the Methods section.
A description of all covariates tested A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted

Data analysis
Data pre-processing and statistical analyses were performed using R (version 4.2.1.). For the differential abundance analysis DeqMS (v. 1.16.0) package was used 67, with SDQ score used as a continuous variable. QLattice a part of the Feyn package (v. 3.0.3), a symbolic-regressionbased ML algorithm was used to build models to find the proteins with the best predictive power. Result visualisations were performed using ggplot2 (v3.4.0) and ComplexHeatmap (v.2.14.0) packages.
For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors and reviewers. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Portfolio guidelines for submitting code & software for further information.

March 2021
Data Policy information about availability of data All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: -Accession codes, unique identifiers, or web links for publicly available datasets -A description of any restrictions on data availability -For clinical datasets or third party data, please ensure that the statement adheres to our policy The String-DB (v. 11.5) database (https://string-db.org/) was used for data annotation. The Reactome (v. 83) database (https://reactome.org/) was used for data annotation. Ingenuity Pathway analysis was performed in this study, using the IPA The data analysed in this study is subject to the following licenses/restrictions: The Walnuts data is not publicly available due to the restrictions of informed consent. The data contain personal information of children and according to the ethical approval, they should be kept confidential. Data are available from the corresponding author upon reasonable request for researchers who meet the criteria for access to confidential data. To ensure the protection of privacy and compliance with national data protection legislation, a data use/transfer agreement is needed, the content and specific clauses of which will depend on the nature of the requested data.

Human research participants
Policy information about studies involving human research participants and Sex and Gender in Research.

Reporting on sex and gender
The biological sex information was obtained from school databases for the original WALNUTs manuscript.

Population characteristics
The peripheral blood plasma samples were obtained and analyzed from a subsample of 91 adolescents aged 11-16 years, of the WALNUTs regional Spanish study ( Table 1). The samples were collected in 2016-2018 approximately at the same moment as the participants filled out the Strengths and Difficulties Questionnaire (SDQ). The SDQ is a screening questionnaire for emotional and behavioral problems in children and young people assessing the impact of difficulties on the child's life, such as (i) emotional symptoms, (ii) conduct problems, (iii) hyperactivity/inattention, (iv) peer relationship problems, and (v) prosocial behavior. Based on the self-reported SDQ score, the plasma samples were categorized into low (0-14) and raised (>15-17) groups.

Recruitment
The original WALNUTs study performed age-, gender-, and maternal education-stratified random computerized sampling within each school to assign adolescents to one of the two groups.

Ethics oversight
The studies were reviewed and approved by CEIC Parc Salut Mar Clinical Research Ethics Committee (approval numbers: 2015/6026 Walnuts and 2020/9688-Equal-life). Written informed consent to participate in the original WALNUTs study was provided by the participants' legal guardian/next of kin. No additional consent was needed for this study Note that full information on the approval of the study protocol must also be provided in the manuscript.

Field-specific reporting
Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection.

Life sciences Behavioural & social sciences Ecological, evolutionary & environmental sciences
For a reference copy of the document with all sections, see nature.com/documents/nr-reporting-summary-flat.pdf

Life sciences study design
All studies must disclose on these points even when the disclosure is negative.

Sample size
This is an exploratory study for the investigation of possible connections between the SDQ and abundancies of plasma proteins in adolescents. All the individuals from with raised SDQ score were analysed in this study, and an equal number of individuals with low SDQ score were also included as the control group.
Data exclusions No data was excluded from the analysis.

Replication
Since the samples were taken from a cohort study, all the samples were analysed. There was no other comparable cohort available. There was no longitudinal data available.
Randomization The original study was randomised. The samples were also randomised prior to protein sequencing.

Blinding
The SDQ score was calculated after the plasma samples were taken. The samples were anonymised for the protein depletion and the proteomics analysis step.