Identification of risk groups for mental disorders, headache and oral behaviors in adults during the COVID-19 pandemic

The dramatically changing situation during COVID-19 pandemic, is anticipated to provoke psycho-emotional disturbances and somatization arising from the current epidemiological situation that will become a significant problem for global and regional healthcare systems. The aim of this study was to identify the predictors, risk factors and factors associated with mental disorders, headache and potentially stress-modulated parafunctional oral behaviors among the adult residents of North America and Europe as indirect health effects of the COVID-19 pandemic. This may help limit the long-term effects of this and future global pandemic crises. The data were collected from 1642 respondents using an online survey. The results demonstrated increased levels of anxiety, depression, headache and parafunctional oral behaviors during the COVID-19 pandemic in both North American and European residents. The results of this study facilitated the definition of the group most predicted to experience the aforementioned secondary effects of the pandemic. This group included females younger than 28.5 years old, especially those who were single, less well educated and living in Europe. In case of this and other global crises this will allow faster defining the most vulnerable groups and providing rapid and more targeted intervention.


Results
Background characteristics of the sample. During the study period, a total of 1642 subjects responded to the questionnaire. In total, 1130 were from North America and Europe, and 99.91% (N = 1129) fully completed the questionnaire; 843 subjects from North America responded to the questionnaire, out of whom 100% (N = 843) fully completed it, and 287 subjects from Europe responded to the questionnaire, out of whom 99.65% (N = 286) fully completed it. The groups of respondents (North America and Europe) were adults; 47.74% (N = 539) were men and 52.26% (N = 590) were women. Among the respondents from North America, 43.42% (N = 366) were men and 56.58% (N = 477) were women. Among the respondents from Europe, 60.49% (N = 173) were men and 39.51% (N = 113) were women. The age of the respondents ranged from 18-72 years old, with a mean age ± standard deviation (SD) of 32.59 ± 9.18 years. The age of respondents from North America ranged from 18 to 72 years old, with a mean age ± SD of 32.65 ± 9.11 years. The age of respondents from Europe ranged from 18 to 66 years old, with a mean age ± SD of 32.09 ± 9.37 years.
Education level. Only 7 participants reported having a primary level of education. These respondents were included in the high school group. We observed a statistically significant effect of education level on the HADS-D (F(2,1126) = 5.20, p = 0.006), HADS-A (F(2,1126) = 9.06, p = 0.0001), HADS total (F(2,1126) = 8.61, p = 0.0002) and OBC (F(2,1126) = 8.97, p = 0.0001) scores. Participants with higher education levels had lower scores on all of the analyzed measures. While moderate scores were characteristic of people with a college education, the highest scores were identified in people with a high school education. In all cases, the post hoc Tukey test showed that differences between people with higher education and a high school education were statistically significant (all p values < 0.005). The differences between people with higher education and a college education and between those with a college education and a high school education were statistically nonsignificant (all p's > 0.09) except for HADS-A. Here, respondents with higher education differed significantly from those with a college education (p = 0.048). The difference between the latter group and people with a high school education was statistically nonsignificant (p = 0.14). The effect of education was statistically nonsignificant for the MIDAS score (F Welch (2, 612.90) = 2.94, p = 0.054). The details are presented in Table 4.
Trees for each target variable. Assumptions. To determine which groups of people are at a high risk for migraine-related disability, depression, anxiety and oral behaviors, we built separate decision trees for all analyzed variables. We expected each tree to maintain a proper balance between specificity and generalizability. If a tree is too specific (detailed), the model can be overfitted and may work well with the current dataset but not with the new datasets. An adequate level of generalizability, however, allows us to capture the underlying structure of data and draw conclusions that can be applied to new observations. Considering the above, we decided that the maximum depth of a tree should be 4 and the minimum number of samples to generate leaves should be 10% of the sample size, which was 113 individuals (those numbers were selected arbitrarily).  anxiety. An explanation of the decision tree construction and the method of tree interpretation is presented in section "Validation of the classification procedure". With regard to the HADS-A score ( Fig. 1), place of residence and gender were important features (higher in Europe, higher and in females). In Europe, age also affected the HADS-A score. Interestingly, older female individuals had lower HADS-A scores (12) than younger female individuals (14), but among males, the highest scores were identified in those between 29 and 34 years old (13). Younger males and older males had relatively lower scores (12 and 10, respectively). The lowest HADS-A score (9.0) was in the group of American males. The highest HADS-A score (15.0) was in European females younger than 27.5 years.
Hospital anxiety and depression scale (HADS) depression. An explanation of the decision tree construction and the method of tree interpretation is presented in section "Validation of the classification procedure". In regard to  www.nature.com/scientificreports/ the HADS-D score (Fig. 2), for both males and females, education, age and relationship status were important. Interestingly, among males with higher education levels than high school, younger individuals (< 28.5) had lower HADS-D scores (6.0) than older individuals (≥ 29), who had an average score of 7.0. Among females with the same level of education, younger individuals had higher HADS-D scores (9.0) than older individuals (8.0). Males with higher education levels than high school who were younger than 28.5 years old had the lowest HADS-D scores (6.0). The highest HADS-D scores (9.0) were identified in females with a high school education or those with a relatively higher education level who were younger than 28.5 years.
Oral behavior checklist (OBC). An explanation of the decision tree construction and the method of tree interpretation is presented in section "Validation of the classification procedure". In terms of the OBC scores (Fig. 3), age and gender were important. Among females, younger people had higher OBC scores than older people. Among males, middle aged people (30-34 years old) had the highest OBC score (13.0). Younger males (12.0) and older males (9.0) had relatively lower OBC scores, with the latter having the lowest OBC scores in the entire population. The highest OBC score (14.0) was found in younger (under 28.5 years old) females.
Migraine disability assessment (MIDAS). An explanation of the decision tree construction and the method of tree interpretation is presented in section "Validation of the classification procedure". The tree presented in Fig. 4 shows that among males, age was important for the MIDAS score, but among females, relationship status and education level were important. The tree shows that the lowest MIDAS score, which was 0, was identified in males older than 34.5 years. The highest score (average 15.5) was in females who were not single, had education levels other than higher education and were younger than 28.5 years. The groups with intermediate MIDAS scores can be seen on the tree. Based on each tree, we calculated the feature importance of each characteristic for each analyzed variable. These calculations are presented in Table 5. The most informative (with highest impact on the analyzed value) features are highlighted.
Validation of the classification procedure. When the assigning rule was applied to the dataset for Europeans, the most vulnerable group accounted for 27.6% of the population. The most vulnerable group in North America using the same rule accounted for 45.6% of the population. In other words, if the help was pro-   In this step, classifiers were generated separately for Europe and North America. As mentioned in the methods, we used 60% of the data to train the classifiers (create rules). The remaining 40% of the sample was used to test the efficiency of the classifier.
After the classification was applied to the remaining 40% of the sample, we obtained the predicted probability of the respondents belonging to the high-risk group.
As mentioned in the methods, we further selected 1/3 of the individuals from the test sample who were predicted to belong to the group at highest risk. In this subsample, we calculated the precision metric, which is the fraction of the positive respondents (i.e., those who were classified as high-risk individuals), among those selected from the highest risk subsample. The operation was performed separately for North American and European residents. For North American residents, 59.8% of the people were properly classified as the most vulnerable. For Europe, the 44.7% were properly classified.

Discussion
Although COVID-19 primarily affects physical health, the secondary influence of issues related to the pandemic on mental health should also receive attention. Previously published surveys showed increased symptoms of depression, anxiety, and stress related to COVID-19, as a possible result of psychosocial stressors such as the fear of the disease, the loss of life and economic issues [10][11][12][13][14][15][16][17][18][19][20][21][22][23][24][25][26][27] . Moreover, the results of the aforementioned studies were heterogeneous, probably because of differences in methods, locations and the timing of the studies with regard to the stage of the pandemic 16 . Only two studies on oral behaviors have been published 28,29 . None of the results of the published surveys assessed headache as a somatic symptom related to the COVID-19 pandemic. However, due to a large amount of worrisome information in social media, the general public is also experiencing overwhelming psychological pressure, which may lead to a variety of psychological conditions, stress-related parafunctional oral behaviors and somatization. Moreover, mental health aspects are relegated to the background due to the heavy burden on local and global health systems imposed by the immediate effects of the pandemic. Therefore, it is essential to identify simple method of identifying the group at high risk of developing psychiatric disorders and deliver early preventive measures or treatment to avoid further consequences. The present online survey was conducted to identify the predictors, risk factors and factors associated with mental disorders (anxiety, depression), headache, and oral parafunctional behaviors among the adult residents of North America and Europe during the COVID-19 pandemic.
As the one of the most important results, in the entire study group, we observed high HADS-A scores. The results are in agreement with data on the prevalence of anxiety during other epidemiological or natural catastrophes, such as the Ebola outbreak 30 , tsunamis 31 and the September 11th attacks 32 . Moreover, most recent data on mental health problems in China during the COVID-19 outbreak have also shown increased risks of anxiety and depression 19 .
In presented study, we observed higher anxiety scores in younger participants and female. Available literature shows that early-onset generalized anxiety disorder (GAD) is associated with female gender, higher education levels and higher levels of neuroticism, while late-onset GAD is associated with physical illnesses 33 . What is more, gender-based differences in anxiety have been consistently found, and females are approximately twice as likely as males to have mood disorders 34 .
Numerous studies have shown that health and mortality outcomes for married persons are better than those for single persons 35 , especially among men 36 . In a recent study investigating the relationship between marriage and quality of life, single men were found to have a worse quality of life than married men, whereas single women were found to have a better quality of life than married, separate or divorced women 37 . However, data concerning marital status and anxiety are controversial. A higher anxiety level was observed in single individuals than in married persons among patients with epilepsy 38 . However, the level of anxiety was similar in married and single  39 . In the present study, we did not observe a relationship between the anxiety score and relationship status. In presented study, the anxiety score was correlated also to educational level. Patients with higher education levels had lower anxiety scores than those with primary and high school educations. The similar effect of education on anxiety has been also demonstrated previously. The Hunt study showed that a higher educational level may protect against anxiety and depression 40 . Also, Cekirdekci and Bugan showed higher anxiety scores in patients with lower education levels in the population diagnosed with cardiac syndrome X 41 .
Also, the place of residence seems to influence the anxiety score. In the available literature, the prevalence of anxiety has been shown to be higher in North America than in European countries 42 . It is worth noting that we also observed the effect of place of residence on anxiety. In the present study, North American respondents had generally higher anxiety levels than respondents from Europe. On the other hand, taking into account the decision trees analysis and predictors, the highest anxiety level was identified in European females who were younger than 27.5 years.
In presented study, we tried also to establish the similar relations for depression scores, as mood disorders are highly prevalent in the global population, with prevalence ranging from 5.4 to 7.8% 43 . Available literature shows that the lifetime prevalence of depression in females is twice that in males 44 . Females with depression tend to have a younger age of onset 45 , longer duration 46 more severe and recurrent episodes 47 and lower quality of life 48 than male patients. Education level has been associated with the risk of depression 49 . However, this relationship is not consistent, and some studies have shown that a lower education level is not related to a higher prevalence of major depression 50 .
In the present study, effects of gender and education level on the depression score were observed. The highest depression score was observed in females with a high school education or with an education level other than high school who were younger than 28.5 years. Taking into account the place of residence, depression scores were similar in North American and European respondents.
Another aspect studied in this manuscript was headache. As the headache can be the associated with somatization and is highly prevalent worldwide 51 . Numerous studies in the general population have consistently demonstrated that headache is more prevalent in women than in men 52,53 . The most important risk factors for headache include the overuse of acute migraine medication, ineffective acute treatment, obesity, depression, stressful life events, age, and low education level 53 . In the present study, females and subjects with lower education levels had higher MIDAS scores. As the effect of gender and education level on migraine has been previously described 51,52 , the results of this study are in line with the findings of previous studies. Taking into account decision trees analysis, the highest MIDAS scores were observed in the group of non-single females with education levels other than higher education who were younger than 28.5 years.
Another studied in the presented study aspects were potentially stress-related oral behaviors. To the best of our knowledge, this is the first study to investigate the prevalence of oral behaviors during the COVID-19 pandemic. Oral behaviors are frequently observed in the general population and can lead to serious clinical implications including temporomandibular disorders and orofacial pain 54 . The relationship between oral behaviors and temporomandibular disorder has been reported by several authors in children, adolescents and adults [54][55][56][57] . Oral parafunctions include teeth clenching, lip biting, thumb sucking, nail biting and other oral habits. Bruxism is the most common oral motor activity and is anticipated to be present in 31% of the general population 58 . Nail biting and holding objects in the mouth are other oral parafunctions observed frequently in children and adolescents 59 . Winocur et al. 55 found that biting hard objects and nail biting were associated with tired jaws in adolescent females. Atsü demonstrated that TMD signs and symptoms were relatively more frequent in the adolescent female group (47.8%), and these results may be explained by biological differences, hormone levels and higher pain sensitivity in women 59 . In the present study, the effects of gender, age, place of residence, and education level on the OBC score were observed. Similar to the results regarding anxiety levels, younger females with lower education levels were in the highest risk group for parafunctional oral behavior.
The highest OBC scores were observed in females younger than 28.5 years. The design of the study was thorough and enabled the authors to obtain results online in an easy way and in a short time period. It allowed us to define risk groups rapidly, which, in the future, may allow the establishment of precisely targeted risk groups and the provision of the necessary prophylactic or treatment measures to the highest-risk population, thereby preventing the development of mental disturbances during this and other global crises. It is very important to achieve scientific advances even when access to patients is difficult, and assistance cannot be provided to everyone. The present study is the first to consider potentially stress-related parafunctional oral behaviors, the occurrence of headaches and the prevalence of mental disorders during the COVID-19 pandemic. Moreover, this is the first study to highlight the risk factors for mental disorders and high-risk groups who could potentially develop mental disorders during global crises. The strength of the study is the fact that it was conducted on a large and representative group of respondents and compared residents in two different continents: North America and Europe. The questionnaires used in the study were validated, established and highly specific tools. The obtained results are novel, interesting and clinically useful.
Despite its novelty and many strengths, this study is not without some limitations. First, the study was performed as an online survey, which, despite widespread access to the Internet, could possibly have defined or limited the study group. It is expected that this form of data obtaining would be more available for younger respondents as the elderly may be less technology proficient. This could influence reliability of the data including potential bias. Additionally, the fact that the survey was conducted in English was a limitation, especially for residents of Europe.
The present study demonstrated showed levels of anxiety, depression, headache, and oral behaviors during the COVID-19 pandemic in both North America and European residents. For the first time, we have also shown increased levels of oral parafunctional habits during the COVID-19 pandemic, which may result in an increased www.nature.com/scientificreports/ prevalence of orofacial pain and temporomandibular disorders in the future. Therefore, health care systems should be prepared for more patients with mental disorders, headache, orofacial pain and temporomandibular disorders during the current pandemic and future global crises. The results obtained in this study facilitated the identification of the group at highest risk for the mentioned secondary effects of the pandemic. This group was composed of females younger than 28.5 years old, especially those who were single, less well educated and living in Europe. These results indicate the need to perform further research in this population. Determining this risk group may allow the implementation of screening tests and the faster implementation of preventive and treatment measures, with the aim of reducing the long-term negative effects of this and future global crises. Due to the fact that in the times of almost every crisis, performing screening tests and access to large populations could be very difficult, in authors' opinion, the clinical recommendations from the presented study findings would be performing screening for the occurrence of psycho-emotional disturbances and somatization first in the defined highest risk groups. This will allow faster detection of people presenting disturbing symptoms and faster and more accurate implementation of interventions.

Methods
This study was conducted in accordance with the principles of the Declaration of Helsinki. The study was approved by the Ethics Committee of Wroclaw Medical University in Poland (ID: KB-302/2020). All the study participants provided informed consent before being included in the study.
Data collection procedure. To collect the data, the authors created an online questionnaire on the Google Original Platform-Google Forms because (1) it is the platform with which they have the greatest experience and (2) according to them, it is the most user-friendly for both researchers and respondents. The questionnaire was posted on Reddit, an American social news aggregation platform that also allows users to be involved in discussions. The authors posted links to the questionnaire on several Reddit pages called "subreddits", including local American and European forums as well as SARS-CoV-2-related forums. Reaching out to these internet communities and explaining why such data are important enabled rapid data collection-in 3 days, from March 22th to March 25th, 2020, the authors collected 1642 answers. It is worth mentioning that the Redditors (as Reddit users call themselves) who took part in the survey were satisfied with their participation, and many of them decided to share the link to the questionnaire with their families or friends. The authors chose to post the questionnaire on Reddit because they noticed that subreddits were thriving in the first quarter of 2020; for example, according to subredditstats.com, a website with statistical data about subreddits, /r/Coronavirus, which is currently the largest SARS-CoV-2-related subreddit, was created on 20th of January 2020. It had 6 subscribers that day, but by the end of April 2020, it had grown to have more than two million subscribers. The questionnaire was anonymous, and the authors did not collect any data that allowed them to identify the respondents.

Questionnaires.
Author sociodemographic survey. The sociodemographic portion of the survey asked basic questions about gender , age, place of residence (name of country), marital status (in a relationship, married or single), education level (primary school, high school or college or higher), and existing medical conditions. As the educational level information could vary between specific countries due to differences between the independent educational systems, we tried to mention all the possible universal types/levels of education: primary school, high school, college graduate, higher education (professional or post-graduate level). Then, for the purposes of statistical analysis, on the basis of presented possible answers, we created 3 types of categories: primary education (primary school), secondary education (high school or college graduate) and higher education (professional or post-graduate level). This allowed for taking into account the problem of differences in individual educational systems of individual countries.

Hospital anxiety and depression scale (HADS).
The HADS is a widely used self-assessment of anxiety and depressive symptoms, focusing mostly on the cognitive and psychological aspects [60][61][62] . Somatic concerns and physical symptoms are not assessed by this scale. It is commonly used in general medical populations as well as in healthy populations 63 . The psychometric properties, including the internal consistency, discriminatory ability, validity and test-retest correlations, are considered satisfactory; thus, the HADS is one of the most commonly used self-assessment questionnaires for anxiety/depression symptom screening 62 .
The HADS consists of a total of 14 items in 2 separate subscales: anxiety (HADS-A) and depression (HADS-D), each of which includes 7 items. All items were scored by the participant using a Likert scale (4 points, from 0 to 3 points). The total score varies from 0 to 42 points, and both subscale scores vary from 0 to 21 points.
The originally recommended cutoff scores for the subscales were as follows: a score from 0 to 7 indicates a noncase, a score from 8 to 10 indicates a possible case, and a score from 11 to 21 indicates a probable case 63 . Currently, the categorization system includes more groups: 0 -7, normal; 8 -10, mild; 11-15, moderate; and > 16, severe 61 .
In this study, scores of 11 or more were considered to indicate a "high risk of anxiety/depression", according to the cutoff values described above.
Migraine disability assessment (MIDAS). The MIDAS is a short, 5-item tool designed for the rapid assessment of the consequences of migraine for a patient, focusing on time lost (in terms of lack of productivity) due to the headache. The patient indicates the number of days with significant disability due to migraines during the last 3 months before the assessment. The score is obtained by summing days mentioned in the responses to the 5 items, and this total score is classified in one of four clinical groups: little or no disability (0-5 days), mild disability (6-10 days), moderate disability (11-20 days) and severe disability (21 days or more) 64 www.nature.com/scientificreports/ All properties, including the internal consistency, test-retest correlations and validity, are considered satisfactory and have been confirmed in several studies 65,66 .
In this study, scores of 21 or more were considered to indicate a "high risk of headache" with a significant impact of those headaches on daily functioning.

Oral behavior checklist (OBC).
The OBC is a self-assessment tool designed for the evaluation of the frequency of different oral behaviors during the day or at night. It consists of 21 items, out of which 2 refer to night-time behaviors, while the rest refer to daily oral function. For each item, a participant provides an answer describing the frequency of this behavior: during the night (how many nights in a week such behavior appears) or during the day (none of the time/a little of the time/some of the time/most of the time/all of the time). For each item, a score of 0-4 points is assigned, yielding a total sum in the range from 0 to 84 points. The score is interpreted as follows: 0-no risk of parafunctional oral activity, 1-24-low risk of parafunctional oral activity, 25-84-high risk of parafunctional oral activity.
During the design of the study, the internal consistency, test-retest correlations and validity were found to be good, and the OBC is the tool most commonly used for the assessment of oral behaviors 67 .
In this study, a score of 25 points or more was used as a cutoff value for a high risk of parafunctional oral activity.
Target group definitions. Given that the HADS does not include questions on somatic concerns, we defined several target "high-risk" groups by combining a score indicating a high risk of mental health issues (HADS-D/A) with a score indicating a high risk of somatic/physical issues (the MIDAS or OBC). In this way, 4 different high-risk target groups were established: • anxiety and headaches (HADS-A score of 11 or more AND MIDAS score of 21 or more); • anxiety and oral parafunctional activity (HADS-A score of 11 or more AND OBC score of 25 or more); • depression and headaches (HADS-D score of 11 or more AND MIDAS score of 21 or more); • depression and oral parafunctional activity (HADS-D score of 11 or more AND OBC score of 25 or more).
Based on the definitions of these high-risk target groups, in the third stage of the research, we defined the group at highest risk of negative effects and applied a classification algorithm to predict if the likelihood of belonging to such a group was associated with any basic characteristics (age, gender, place of residence, relationship status and education level).
In this way, the possible combinations of the risks of two different mental disorders with the risks of two different somatic/physical manifestations were exhaustively analyzed. It is noteworthy that the use of the HADS does not allow any diagnosis of depressive or anxiety disorders; it only indicates a high risk of anxiety/depressive symptoms 62 and suggests that professional/institutional help should be administered.

Statistical analysis.
We analyzed our data in three stages. First, we analyzed the background characteristics of the entire sample. When our data did not satisfy the assumptions of standard parametric analyses, we used either nonparametric tests or relevant alternatives to classic parametric methods. Accordingly, Spearman's rank correlation coefficients were used to assess the relationships between nonnormally distributed continuous variables (participant age and MIDAS, HADS, OBC scores). To test for significant differences between groups of respondents defined by categorical variables (i.e., respondent gender, place of residence, relationship status, education level) we used both classic one-way ANOVA (if variances in the compared groups were homoscedastic) or Welch's one-way ANOVA (if group variances were heteroscedastic). We report the results from Welch's ANOVA analyses with the appropriate remarks. If Welch's ANOVA indicated statistically significant results and more than two groups were analyzed, post hoc pairwise comparisons were performed with the Games-Howell test. Post hoc pairwise comparisons for the classic one-way ANOVA were conducted with the Tukey test.
Decision tree-based analysis. In the second stage, we generated a regression decision tree to identify the multidimensional dependencies among all characteristics identified in the first stage (age, gender, place of residence, relationship status and education level) and analyzed the values (MIDAS, HADS and OBC).
The primary goal of using decision trees is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. A decision tree is one of many predictive modeling methods. The clear advantage of the decision tree model over other methods (linear regression, supported vector machines, artificial neural networks and others) is its graphical representation enabling the straightforward interpretation of the rules explaining dependencies among variables. Unlike in the standard procedure, in which one subset of data is used to train the model and the second subset is used to validate its efficiency, here we built a model on the entire available dataset to visualize and discern the influence of the variables on the analyzed values. In this study, we used the common Python (version 3.7.4) programming language and generated decision trees with the CART algorithm. The authors used decision tree implementation provided in the scikit-learn library (one of the most popular machine learning libraries on GitHub. https:// scikit-learn. org/). Decision trees can be built only with continuous (numerical) variables that require prior transformation, and the categorical features (gender, place of residence, relationship status and education level) were encoded as a numeric array. We used the label binarization method to transform each categorical variable (LabelBinarizer method in sklearn.preprocessing, https:// scikit-learn. org/ stable/ modul es/ gener ated/ sklea rn. prepr ocess ing. Label Binar izer. html). If the variable had 2 possible values, e.g., gender (female, male) one result feature is generated. In www.nature.com/scientificreports/ the example of gender, the result would be 1 if person is a male and 0 if the person is a female. When the initial variable has more options, e.g., education level (higher, college, high school), the result of label binarization is a set of features for each possible option, and they are assigned the value of 1 if the original row has option and 0 when it has a different one.
Although decision trees are a very informative and compelling method of data exploration and data mining 68 , they are not very common in the biomedical literature; therefore, in Fig. 5, a hypothetical decision tree is presented, and brief guidelines for the appropriate interpretation of the results obtained from the trees generated in our study are presented.
Each decision tree includes root nodes, subnodes and leaf nodes. They are connected with branches. Nodes include the predicates (e.g., male ≤ 0.5), the size of a sample (samples), the predicted value on a current level (value) and the mean absolute error (mae). Prediction with classification trees is performed by navigating down the tree through the logical results of the predicates until a leaf is reached 69 . If the logical test of the predicate results in "true", we follow the left-hand option; otherwise, we should follow the right-hand option. The predicted value is presented in a leaf node. If we would like to use the presented tree to predict the MIDAS score for a single (single = 1) woman (male = 0), we need to answer Male ≤ 0.5 (the predicate in the root node). In our case, the logical value "true" is returned. Then, we follow the "true" branch (left-hand side). The predicted value for females is 10.0. The next step in the prediction process is to answer single (≤ 0.5). In the given example (single = 1), the result is "false", so we follow the right-hand side. The final prediction is 6.0. We also know that there are 126 observations in the initial dataset of single females. The indicated mae is high, so this prediction is likely inaccurate.
The mae is a measure of the error between observations expressing the same phenomenon. It is calculated with the following formula: where, y i -prediction, x i -actual value, n-sample size. Having generated a decision tree, we are able to evaluate the importance of each feature in the prediction process. Feature importance evaluation always pertains to a generated decision tree. To perform such an evaluation, the Gini importance score is calculated. Splits in a decision tree are determined by choosing the feature and splitting criterion that result in the greatest reduction in total impurity, which ultimately indicates the importance of that feature in the specific tree. A split that generates a large decrease in impurity is considered important; therefore, variables used to determine important splits are also considered important. Based on this idea, the importance for each variable X in terms of the reduction in impurity is computed as the sum of all the measures of the decrease in impurity at all nodes in the tree at which a split occurs based on X 70 .
When the Gini importance score is 1, it means that one feature is sufficient to predict the analyzed value. If it is 0, such a feature is not represented in a tree at all. The sum of all Gini scores for all features is 1. The higher the Gini score, the more informative (important) a feature is (the more influence it has on an analyzed value).
Validation of the prediction of the high-risk group. The third stage of the analysis was the validation of how accurate the prediction of the high-risk was and involved the application of the knowledge acquired in previous stages. The target group definition presented earlier was used in this stage. www.nature.com/scientificreports/ The primary aim of identifying dependencies among all characteristics selected in the first stage (age, gender, place of residence, marital status and education level) and the analyzed scales was to determine the most vulnerable individuals to enable the precise and efficient targeting of the provision of support. In this stage of the analysis, we validated the efficiency of the developed classification procedure. This was accomplished in several steps. First, we defined a rule assigning a person to the most vulnerable group according to the mentioned cutoff points for each of the analyzed measures. Consequently, the rule assigning a given person to the most vulnerable group was as follows: This rule was based on the 4 different high-risk target groups defined earlier.

If
Having identified this group, we assumed that relevant interventions should be delivered separately to the entire populations of North America and Europe. However, the efficiency of such an approach is questionable since a great deal of effort will probably be devoted to diagnosing and providing help to people who do not actually need it. The efficiency of such an approach is calculated as the percent of the population that truly needs the help. The next step involved building a classifier using part of the initial dataset (the so-called training set) to train the model. The training set was 60% of the initial dataset. During the training process, we fed the algorithm the basic characteristics of the respondents: gender, relationship status, and education level. The result was whether an individual belonged to the most vulnerable group.
The goal of the classifier is to assign a person who has not been diagnosed to a specific group (i.e., i) most vulnerable; ii) the rest of population), knowing only the mentioned basic characteristics and lacking an actual diagnosis.
After the classifier is trained, its efficiency should be verified. Classification quality is calculated based on the remaining part of the initial dataset that was not used in the training process. The test set was composed of the remaining 40% of respondents. During testing, classification was performed on that 40% of the samples. The classifier predicts the probability that respondents belong to the high-risk group. Assuming that help is delivered primarily to the group predicted by the classifier as the most vulnerable persons, we calculated the efficiency and compared it to the initial situation where we assumed the delivery of interventions to the entire population. We assumed that support should be provided to 1/3 of the population due to resource-related limitations, which is why the results will be calculated for the group of respondents for whom the classification predicted the highest probability of a high level of risk.
The classifiers were created separately for North America and Europe and were validated in the corresponding sets.

Data availability
The data that support the findings of this study are available from the corresponding author upon reasonable request.