Prediction of acute suicidal ideation in young adults using multi-dimensional scales: A graph neural network approach

Precise remote evaluation of both suicide risk and psychiatric disorders is critical for suicide prevention as well as psychiatric well-being during COVID-19 crisis. Using questionnaires is an alternative to labour-intensive diagnostic interviews in a large general population, but previous models for predicting suicide attempts suffer from low sensitivity. We developed and validated a graph neural network model, MindWatchNet, which increased the prediction sensitivity of suicide risk in young adults (n = 17,482 for training; n = 14,238 for testing) using multi-dimensional questionnaires and suicidal ideation within 2 weeks as the prediction target. MindWatchNet achieved the highest sensitivity of 80.9% and an area under curve of 0.877 (95% condence interval, 0.854–0.897). We demonstrated that multi-dimensional deep features covering depression, anxiety, resilience, self-esteem, and clinico-demographic information contribute to SI prediction. MindWatchNet might be useful in the remote evaluation of suicide risk in the general population of young adults for specic situations such as the COVID-19 pandemic. lower sensitivity of logistic regression, and SVM models. a A total of 358 received structured interviews, of which 102 participants were diagnosed with MaDE, accounting for 0.32% of the total 31,720 participants. The results from a state-of-the-art study by Jung et al 33 were cited, where the condence intervals of AUC were not reported.


Introduction
Suicide is the second leading cause of death in young adults (individuals 10-34 years old) in the US and is 2.5 times more frequent than homicides (48,344 vs. 18,830, respectively) 1 . In the last two decades, the total suicide rate increased to an all-time high of 35% in the US in 2018 1 . Suicidal ideation (SI) and suicide attempts (SAs), which are strong risk factors for completed suicide, are prevalent in the population (11-14% and 2.8-4.6%, respectively) 2 . Worldwide, the number of suicides is over 800,000 annually 3 , and 60-70% of suicides die on the rst or "index" attempt. Additionally, only approximately 30-40% of survivors received emergent hospital-level care 4,5 . Thus, accurate prediction of rst SAs, or individuals with imminent suicide risk, followed by instantaneous intervention, would be effective in suicide prevention, leading to decreased mortality in young adults.
During pandemics such as the novel coronavirus disease 2019 (COVID-19) pandemic, remote mental health evaluation of self-isolating people to prevent viral spread is critical. To date, more than 43 million con rmed cases and one million deaths have been attributed to the COVID-19 pandemic, and global lockdowns are considered effective interventions to combat the virus 6 . However, these interventions, as well as the pandemic itself, can increase the potential for adverse outcomes on suicide risk 7 . As with the Spanish in uenza pandemic in the US during 1918 − 19 8 , the monthly suicidal rate increased by 16% in Japan during the second wave of the COVID-19 pandemic 9 . Symptoms of anxiety and depressive disorder markedly increased in the US during April-June 2020 10 compared with the same period in 2019 11 . Pre-existing psychiatric disorders are associated with increased SI as the psychological impact of the COVID-19 pandemic 12 and contribute to predicting future individuals with SI in young adult populations 13 . Moreover, younger adults reported having experienced disproportionately worse mental health outcomes and elevated SI than older adults 14 . Thus, the development of precise remote evaluation techniques of both suicide risk and psychiatric disorders is critical for suicidal prevention as well as psychiatric well-being when lockdowns are prolonged.
However, there are many challenges for evaluating suicide risk in a large general population. In a pandemic situation, it is too labour intensive and clinician dependent to conduct structured interviews or scales for SI 15 to assess present and past mental health in an entire population. Moreover, there is a possibility of missing cases during screening with simple questionnaires in general population studies because most studies further evaluate cases only when they respond that they have SI, which could mask true patients at risk for suicide 16 . Existing prediction models 5,17−19 for suicidal behaviour achieved an accuracy over 80%, but, at the same time, these models had a very low sensitivity, which is due to low incidence of SAs in the general population. Speci cally, the model sensitivity is more important than the speci city for monitoring suicide risk; individuals who attempt suicide are missed due to the low sensitivity of the monitoring system, which results in irreversible events, is more critical than missing individuals will not attempt suicide due a low speci city. Although knowing who is going to commit suicide is critical, it is di cult to predict who will attempt suicide with a prospective study. Active screening for COVID-19 risk 20 , which extends the scope of the risk group, could reduce the likelihood of missing an infected person compared to traditional approaches that target people with obvious symptoms. Expanding the range of suicide risk to SI instead of SAs could increase the sensitivity of the prediction model and may contribute to reducing the number of individuals who may attempt suicide.
In many countries, including South Korea, a large population of young adults is obliged to have regular check-ups, including mental status examinations, for work or when entering a dormitory for college, which leads to the only portal to access individuals who may attempt suicide. We employed the multiple scales included in regular mental status examinations to predict imminent suicide risk. We used acute SI within 2 weeks as a surrogate marker for prospective imminent suicide risk 21 . To extract a good representation of acute SI from scales, multi-dimensional questionnaires evaluating depression, anxiety, resilience, and selfesteem levels were used as input features of the neural network. To overcome these challenges, we present a novel graph neural network (GNN)-based model that employs multi-dimensional scale-based prediction of depression and acute SI with a high sensitivity and speci city, which could promote the deployment of the model in the real world.

Attention plots and interpretation
The raw averaged attention plots without normalization are given in Fig. 2 for the test set ( Fig. 2a). In the attention plot comparing questionnaire items by using row-wise normalization (Fig. 2b), a high score (i.e., 4 points) for item 2 of the PHQ-9 (i.e., PHQ_2 -feeling down, depressed, or hopeless) was the most salient positive feature (i.e., a feature that increases the prediction score of acute SI) among 19 items of the questionnaires. The 2nd most salient positive feature was a high total score for the STAI-S (i.e., the 4th quartile group), which represents a high level of anxiety. For low scores (i.e., 1 point), the most salient negative feature (i.e., a feature that decreases the prediction score of acute SI) was a low PHQ_2 score, which means that a low PHQ_2 score is the most signi cant feature associated with reduced SI the most.
For intermediate scores (i.e., 2-3 points), the most salient positive features were PHQ_8, PHQ_6, and PHQ_9 (i.e., psychomotor symptoms, low self-esteem, and suicidal thoughts, respectively) to a similar degree. Moreover, for point 1, the RAS total score was the most salient negative feature for SI, which means that the 3rd quartile group (because of reversed order) of RAS total points had decreased SI.
In the attention plot comparing binary items by using column-wise normalization (Fig. 2c) In the plot using the L1-norm of the attention vector, obtained for each column of the 19 questionnaire items (Fig. 2d) Ablation study for PHQ-9 item 9 Because PHQ_9 is related to acute SI, the model performance with and without PHQ_9 were obtained for comparison: the sensitivity, speci city, accuracy, and AUC were 79.77%, 77.85%, 77.87%, and 0.869 (95% CI, 0.846-0.891), respectively, for the model without PHQ_9. There was no signi cant difference in the AUCs between models with and without PHQ_9 (AUC = 0.877 vs. 0.869; p = 0.150) ( Supplementary Fig. 2).
Validity of the labels for acute SI: comparison study Among the subjects in the test set, only n = 792 of 13,408 subjects completed the Korea Advanced Institute of Science and Technology (KAIST) Scale for Suicide Ideation (KSSI). Spearman's rank correlation coe cient 1) between the predicted scores by MindWatchNet and the total KSSI score was ρ pred =0.719 (p < 0.0001) and 2) between PHQ_9 and the total KSSI score was ρ PHQ =0.446 (p < 0.0001). In the comparison of the correlation coe cients,ρ pred was larger than ρ PHQ (p < 0.0001). A scatter plot between the raw predicted scores (i.e., the model output before applying the sigmoid function) by MindWatchNet and the total KSSI score is shown in Fig. 3.

Discussion
We developed a GNN model to predict acute SI within 2 weeks, which showed improved sensitivity compared to baseline models, and validated it in an external test set: the sensitivity, speci city, accuracy and AUC were 80.9%, 80.6%, 80.6%, and 0.877 (95% CI, 0.854-0.897), respectively, using an ensemble of GIN models with different sampling methods, or MindWatchNet; these values were 15.03%, 99.81%, 98.72%, and 0.574 (95% CI, 0.552-0.597), respectively, using an SVM. Speci cally, MindWatchNet, based on a GIN to predict SI, improved the sensitivity signi cantly at the cost of slight reductions in the speci city and accuracy. The low sensitivities of the baseline models prevented the prediction of individuals who may attempt suicide before committing suicide, resulting in irreversible events 5 . In contrast, MindWatchNet achieves a signi cant increase in the sensitivity compared to previous baseline models, allowing more accurate prediction of individuals who may attempt suicide, suggesting that this model can potentially be of great help in the real world.
Our model achieved a good performance by incorporating the following three factors. 1) The GNN extracts a good graph embedding. The GIN 29 , a variant of a spatial GNN speci cally for graph classi cation, extracts an even better representation from the graph than other GNNs such as graph convolutional networks (GCNs) because GINs are equivalent to generalized convolutional neural networks (CNNs) for non-Euclidean data that can be represented as graph structures, such as brain connectivity 30,31 . 2) An ensemble method using under-sampling and over-sampling (i.e., SMOTE for nominal and continuous features (SMOTE-NC) 32 ) was designed to handle class imbalance issues. 3) Rich information from multi-dimensional scales and subject clinico-demographic information for large multi-centre datasets were used; 7 questionnaires covering domains such as depression, anxiety, resilience, and self-esteem, which are obtained from n = 31,720 individuals across 4 centres including universities and hospitals. Jung et al 33 reported that the baseline models showed good performance in predicting SI in the past 12 months in a young population, with approximately 13 positive cases compared to the current data (12.4% vs. 0.97%, see Table 1). However, it is challenging to predict acute SI within 2 weeks. In the present study, having severe class imbalance, the SVM without the ensemble method, which is a baseline model, could not extract a good representation of the positive cases, resulting in a much lower sensitivity (~ 15%) than MindWatchNet, while a speci city and accuracy of nearly 99% were achieved. This nding suggests that dealing with class imbalance, such as with the ensemble method, should be considered to prevent prediction bias towards the majority class (i.e., the model always predicts SI-negative). It probably does not matter what kind of model is used, but this analysis is beyond the scope of the current study. Interestingly, our model can show not only feature importance but also the association among features. Although PHQ_2 and STAI-S are features having the highest saliency value, the former was associated with other items of the PHQ-9 and the latter was associated with resilience and self-esteem (Fig. 2d).
We predicted MaDEs as a pseudo-label before the prediction of acute SI because pre-existing psychiatric disorders such as major depressive disorder (MDD) have been known to increase suicide risk 34 , which would be helpful in accurately predicting acute SI. In MaDE prediction, all the conventional and GIN models achieved AUCs and sensitivities over 90%. This nding suggests that both the PHQ-9 and other scales, including the GAD-7, contributed to predicting the MaDE labels. MaDE pseudo-labels were used as input to predict acute SI. Although the presence of a MaDE is 3.46 times more likely to indicate an individual with SI than its absence (Fig. 2c), its low saliency may be indirectly associated with SI via its association with various PHQ-9 items and GAD_7 ("Feeling afraid, as if something awful might happen") (Figs. 2d and Supplementary Figs. 3d). Interestingly, lifetime SA achieve both the highest OR among the binary items ( Fig. 2c) and a higher saliency score than MaDE. In addition, MaDE can be accurately predicted with conventional or GIN models. The results suggest that both gathering SA information and predicting MaDE with a model, instead of structural interviews for diagnosis, is an e cient approach for survey-based screening for suicide risk. Moreover, nearly identical attention plots for the training/validation set ( Supplementary Fig. 3) and test set (Fig. 2) might suggest that the common "scale and clinico-demographic signature" of acute SI was extracted by using the GIN, which models the relationship between the scale items and clinico-demographic information in graph-structured data.
In the attention plots, the model recognized the salient items among the multi-dimensional questionnaires and other information (Fig. 2). Speci cally, when comparing 19 questionnaire items, several PHQ-9 items (e.g., items 2, 4, 5, 6, and 8) and the total RAS and STAI-S scores showed high saliency values. Among these features with high saliency, depressed mood (PHQ_2) and high state and trait anxiety (STAI-S total score) were the two most salient features. The rst 2 items of the PHQ-9 provide the two cardinal symptoms of depression, i.e., PHQ_1 (anhedonia) and PHQ_2 (depressed mood or hopelessness) 35 . Depressed mood (PHQ_2) mediates negative life events associated with SI 22 , and its severity is also strongly associated with SI 36 . In the network analysis of depressive symptoms, hopelessness (PHQ_2) was the most central criterion or central node (speci cally, the highest betweenness centrality) in the symptom network, showing a strong connection between PHQ_2 and PHQ_9 (suicide), as well as PHQ_2 and PHQ_6 (worthlessness), and a moderate connection between PHQ_2 and PHQ_1 (anhedonia) 37 , which are all salient features for SI in the attention plot (Figs. 2b and Supplementary Figs. 3b). In another network analysis of anxiety and depressive symptoms, the same symptom network was revealed, which represents the connections between PHQ_2 and PHQ_1; PHQ_2 and PHQ_6; and PHQ_2 and PHQ_9, making PHQ_2 a central node 38 . In addition, psychomotor symptoms (PHQ_8, i.e., "moving or speaking so slowly that other people could have noticed or the opposite") was another salient feature for acute SI (Fig. 2b). In a large population-based longitudinal study, anxiety disorders were found to be independent risk factors for suicidal behaviours (i.e., SI and SA), and an increased risk of SA in combination with a mood disorder was found 2 . In our results, a high STAI-S total score was associated with increased acute SI, which is consistent with previous studies showing that both state and trait anxiety increase the risk of suicide risk 39,40 . It has been reported that resilience protects against symptoms of anxiety and depression and strongly in uences the associations between symptoms and lifestyle factors 41 , which is also consistent with the ndings that low resilience is strongly associated with mild depression and psychological resilience is linked to social support 42 , and might lead to an increased risk of SI compared to non-depressed subjects. Moreover, low resilience was a risk factor for suicidal behaviours 43 . In our study, a high RAS total score was associated with decreased SI, and vice versa, which is also consistent with a previous study showing that high resilience is one of the most protective features for SAs 26,44 .
In the ablation study of the PHQ_9, it was related to acute SI, the model performance without PHQ_9 showed no signi cant difference in term of the AUC compared to the model with this item (AUC = 0.869 vs. 0.877, respectively; p = 0.150), which guarantees that the model did not "cheat" to predict acute SI using only PHQ_9. In the validation study of the true labels for acute SI, the model prediction score showed a higher correlation with the KSSI score (i.e., it is a more accurate proxy for acute SI) than PHQ_9 ( =0.719 vs. 0.664, respectively, p = 0.005; see the Validity of the labels for acute SI section in the Results section). Originally, the PHQ-9 was designed for screening depression and to assess severity, not to assess suicide risk 22 .
Interestingly, in a recent validation study, Na et al. 45 showed that PHQ_9 is an insu cient assessment tool for suicide risk and SI because of the limited utility in certain clinico-demographic and clinical subgroups, which is in line with our results. Our results indicate that our model-based predictions resulting from multidimensional information are more valid than those from only a single question (i.e., PHQ_9 and acute SI label) and is an alternative to a structured interview or a scale for suicide risk. While PHQ_9 itself may not be a valid measure for SI, our results (Fig. 2b) suggest that intermediate scores (i.e., 2-3 points) for this item should not be overlooked. This strategy should also apply to the PHQ_6 (worthlessness) and PHQ_8 (psychomotor symptoms) (Fig. 2b).
It is worth noting that this multi-dimensional scale dataset was collected before the outbreak of COVID-19, and the speci c representation of mental illness, including depression and anxiety, evoked by consequences of the COVID-19 pandemic may not be re ected by the scales used in the present study. Further research is needed to explore the effectiveness of MindWatchNet during the COVID-19 pandemic. In addition, the true labels for acute SI may be improved if we obtain the labels for suicidal behaviour from reference to standards, such as structured interviews by clinicians for all subjects; however, this process is time consuming, impractical, and requires large amounts of research funding.
This study has several limitations. As prediction of major depressive episode using small dataset can lead to over tting, the bene t of the pseudo-label 46 of MaDE to predict SI should be con rmed in future studies.
The type of institution cannot be generalized to other types of data obtained from workplaces. Although there was relatively low saliency of the type of institution (Fig. 2c) compared to lifetime SA or MaDE, its value for each individual might not be meaningful and must be interpreted carefully. Although beyond the scope of the current study, exploring the impact of edge and sparsity de nitions on performance is necessary. To generalize the results of young adults to other populations, further studies of a wide range of ages are needed. Longitudinal cohort studies are needed to investigate factors that can predict future SAs or new SI cases. Veri cation studies are needed to determine whether predicting SI instead of SAs is effective in preventing SAs in the real world.
In conclusion, we developed and validated a deep-learning-based compensatory tool by using extracted deep features from multi-dimensional self-report questionnaires covering depression, anxiety, resilience, selfesteem, and clinico-demographic information of a large dataset to instantaneously predict suicide risk and monitor responses to suicide prevention strategies, which might be useful in remote clinical practice in the general population of young adults for speci c situations such as the COVID-19 pandemic. GIN as a graph neural network A GIN, which is a variant of a GNN with equal representative/discriminative power for graph-structured data, such as the Weisfeiler-Lehman (WL) test -one of the most powerful existing tests for distinguishing a broad class of graphs 50 , was developed for graph classi cation and has achieve state-of-the-art performance 29 .

Dataset
More speci cally, for each node, v, graph convolution aggregates neighbouring -or nodes connected by features were summed to make a graph feature of the kth hidden layer, p ( k ) G , which is known as sumpooling. For the graph-level readout, all K graph features from the hidden layers were concatenated to make a nal graph feature, p G (Fig. 1), extracting an excellent graph representation 29 for positive and negative cases of acute SI. Finally, p G is fed to the nal classi er to calculate the sigmoid prediction score of acute SI. The overall model architecture is illustrated in Fig. 1, and the mathematical equations are described in the Supplementary material. Semi-supervised learning-based input features: pseudo-labels for MaDE MaDE labels are important information to predict acute SI; however, only a fraction of MaDE labels were available because only a fraction of subjects, 294 individuals in the training/validation set and 64 individuals in the test set, completed the MINI. Following the pseudo-labelling strategy frequently used in semi-supervised learning 46 , we generated pseudo-labels for MaDE via other questionnaires and clinicodemographic information, such as gender and type of institution, using the GIN-MaDE network prior to training the GINs for predicting acute SI. Details are described in the Supplementary method section.

Prediction of acute SI: subsampling strategy
To overcome the intrinsic challenge of SI prediction or the sparsity of positive cases of acute SI (i.e., the class imbalance problem), we utilized not only data augmentation for balancing the data but also ensembles of models with different subsamplings. First, a GIN model was developed to predict acute SI using the MaDE pseudo-labels as an additional input feature. Most machine learning models built on imbalanced datasets give predictions that are biased towards the majority class (i.e., negative cases); hence, the model will always predict a case as a negative case even if it is a positive case. Speci cally, to obtain different decision boundaries to be ensembled, which may largely depend on the subsampled data distribution, we built three different GIN models with different subsampling strategies: 1) GIN-u1 (undersampling of the majority class with a balance ratio of 10); 2) GIN-u2 (under-sampling of the majority class with a balance ratio of 5); and 3) GIN-SMOTE 32 (over-sampling of the minority class with a balance ratio of 1), where the majority and minority classes has negative and positive SI labels, respectively, and the balance ratio is de ned as the ratio of negative to positive cases in the subsampled data from the training set.
For the training and validation sets, datasets from centres 1-3 (SMC, Gachon, and KAIST) were used, and a dataset from centre 4 (SNU) was used for the test set. Note that the test set was never augmented.

Ensemble model
After training each of the three GIN models de ned above, the best model for each subsampling strategy was saved at the epoch when the validation loss was minimized: GIN-u1-best, GIN-u2-best, and GIN-SMOTEbest. Next, the nal ensemble GIN model was obtained using the three best models. Speci cally, the sigmoid prediction scores from the best models were averaged to obtain the nal prediction score of the ensemble model, which is a process known as "soft voting" 51,52 .

Evaluation
For the prediction of MaDE and acute SI, the sensitivity, speci city, and accuracy were calculated for all the models. To evaluate the diagnostic performance of the models, a ROC analysis was performed to obtain the AUC, and DeLong's method was used to compare the AUCs. For the comparison with conventional algorithms, logistic regression with LASSO and an SVM (detailed in the Supplementary material) were used for the prediction of MaDEs and acute SI. All statistical analyses were performed using R version 3.6.1 (R Foundation for Statistical Computing, Vienna, Austria).

Ablation study for PHQ_9
Because PHQ_9 ("Thoughts that you would be better off dead or of hurting yourself in some way") is related to acute SI, including PHQ_9 as a predictor could be redundant. Moreover, response to PHQ_9 has been reported to be a moderate predictor of a subsequent SA or death 53 . However, in studies for the validation of PHQ_9 using the Structured Clinical Interview for DSM Disorders (SCID) assessment as the reference standard, it had a good sensitivity, speci city, and negative predictive value but a low positive predictive value (PPV) in irritable bowel disease (20.8%) 54 and neurological disorders such as epilepsy (39.1%), migraine (54.5%), multiple sclerosis (41.7%), and stroke (57.1%) 55 . Here, to test the bene t of the inclusion of PHQ_9, the performance of the model without PHQ_9 was also assessed and compared with that of the model including PHQ_9. Speci cally, the ROC comparison of the best GIN-based model with and without PHQ_9 was performed using DeLong's method. The saliency plots were also compared with and without PHQ_9 using the best GIN-based model.
Validity of the labels for acute SI: comparison study Self-report instruments for the assessment of suicidal thinking, such as the Beck Scale for Suicidal Ideation, could be a reliable quantitative reference for acute SI [56][57][58] . The KSSI 13 is a comprehensive scale to evaluate suicide risk. The KSSI score for the previous 2 weeks was signi cantly correlated with the Beck Scale for Suicidal Ideation score (Kendall's τ = 0.35, p < 0.001) in our previous study 13 . To investigate the reliability of the model prediction score, we also compared Spearman's correlation coe cients between the KSSI total score and the prediction score or PHQ_9.

Attention plots and interpretation
To interpret what the ensemble model "thinks" is important for the prediction of acute SI, we calculated the saliency/attention values, which are de ned as the gradient of the input with respect to the model output,

Data availability
Due to potentially identifying information, the data that support the ndings of this study are not publicly available, but can be obtained under the condition both on reasonable request to corresponding authors and the permission of Institutional Review Board.