Analysis and evaluation of explainable artificial intelligence on suicide risk assessment

This study explores the effectiveness of Explainable Artificial Intelligence (XAI) for predicting suicide risk from medical tabular data. Given the common challenge of limited datasets in health-related Machine Learning (ML) applications, we use data augmentation in tandem with ML to enhance the identification of individuals at high risk of suicide. We use SHapley Additive exPlanations (SHAP) for XAI and traditional correlation analysis to rank feature importance, pinpointing primary factors influencing suicide risk and preventive measures. Experimental results show the Random Forest (RF) model is excelling in accuracy, F1 score, and AUC (>97% across metrics). According to SHAP, anger issues, depression, and social isolation emerge as top predictors of suicide risk, while individuals with high incomes, esteemed professions, and higher education present the lowest risk. Our findings underscore the effectiveness of ML and XAI in suicide risk assessment, offering valuable insights for psychiatrists and facilitating informed clinical decisions.


Introduction
Suicide accounted for 1.3% of global deaths and was the 17 th leading cause of death in 2019.With more than 700,000 people committing suicide yearly, 77% of suicidal behaviours occurred in developing countries 1 .An American survey also shows that 93% of adults believe that suicides can be delayed or prevented if psychiatrists intervene effectively and immediately.According to World Health Organisation's statistics, suicidal adults may have attempted 20 times before their death 2 .According to a report published by the Centres for Disease Control and Prevention, middle-aged white men have the highest suicidal risk in America 2 , and suicide was the leading cause of death among Australian teenagers aged 15 to 24 in 2019 3 .The current tools and solutions for suicide prevention mostly rely on self-reported measures, such as questionnaires and interviews, which can be subjective or multimodal data 4,5 which is not easy to collect.Furthermore, traditional clinical risk assessment tools have been shown to be not sufficiently accurate to identify moderate-and high-risk patients 6 .Two recent systematic reviews 7,8 have evaluated various scales to predict the risk of suicide but have found overall low Positive Predictive Value (PPV).Hence, there is a critical need to develop technologies and models that can assist psychiatrists and mental health professionals to accurately stratify risk, enable precision medicine, and allocate resources.Over the past decade, researchers have proposed various Machine Learning (ML) solutions and frameworks to enhance the performance of suicide prediction; however, since they are primarily "black box" units and not interpretable, it is challenging to use them in clinical treatments.The objectives of this study are threefold.First, we review the related works to summarise the existing ML models used for suicide prediction.Second, we select and integrate suitable ML algorithms and use data augmentation methods to assess the feasibility of ML models for suicide prediction.Finally, we determine which variables contribute the most, using an Explainable Artificial Intelligence (XAI) framework to determine features' importance and visualize the underlying logic behind the predictions.ML is a branch of computer science that uses historical data to train models and make predictions about future trends through building models, testing and improving models.In recent years, there has been a rapid growth and progress in the field of computer science, including ML, Computer Vision, Artificial Intelligence (AI), Natural Language Processing (NLP) which has led to the development of new tools and techniques to predict the risk of physical and psychological illnesses 9 .For instance, these technologies have been implemented to predict the possibility of heart attacks 10 , liver diseases 11,12 , alcohol disorders 13 , human emotion disorders 14,15 , depressions 16 etc.In the past decade, studies have also shown that ML can be effective in predicting suicide risk prediction 17,18 .In recent years, numerous research studies have used ML techniques to predict suicide.For example 19 integrated a C-Attention Network architecture with multiple ML models to identify individuals at risk for suicide.The three-stage suicide theory and prior emotions work were also introduced to examine suicidal thoughts.In the sub-task of predicting suicide attempts in 30 days, traditional ML models had superior performance compared to the baseline in predicting suicide attempts, with an F1 score of 0.741 and an F2 score of 0.833 (higher F-score shows better performance 20 ).Moreover, when predicting suicide within a six-month period, the C-Attention method also outperformed the baseline, achieving an F1 score of 0.737 and an F2 score of 0.833.Other research has also utilized smartphone applications to gather data on outpatients' therapy and apply NLP techniques to assess patients' suicide risk levels 21 .The results showed that the Support Vector Machine (SVM) and Logistic Regression produced satisfactory calcification scores, while the extreme gradient model achieved the highest AUC value (0.78).The authors in 21 highlighted the importance of using XAI tools to address the lack of explainability in traditional ML models, as it is crucial for psychiatrists to trust and rely on ML models.Similarly in 22 , the authors compared the performance of four traditional models, namely logistic regression, Lasso, Ridge, and Random Forrest, using the epidemiological Early Developmental Stages of Psychopathology (EDSP) dataset.After conducting nested 10-fold cross-validation, they found that these models performed almost the same in terms of mean AUC values ranging from 0.824 to 0.829.Furthermore, the RF model achieved the highest PPV of 87%, which was significantly better than other models.In suicide prediction research, various types of surveys, questionnaires, and scales have been used.For instance, in 23 the Korea National Health & Nutrition Examination Survey (KNHANES) and the Synthetic Minority Over-sampling TEchnique (SMOTE) were used to select patients with suicidal thoughts and to construct the dataset by resampling.After pre-processing, a Random Forrest (RF) algorithm was applied and the experimental results verified the feasibility of such techniques on the general population.The RF model achieved an AUC of 0.947 and an 88.9% accuracy.Notably, the feature selection process identified days of feeling sick or in discomfort, daily smoking amount, and household composition as the most significant features that contributed to the prediction.Traditional mathematical techniques produced less accurate results due to the complexity of input/output relationships in human behaviours.In 24 the authors used the Patient Health Questionnaire-9 (PHQ-9) to collect data from college students and used the Mini-International Neuropsychiatric Interview suicidality module to evaluate their suicide ideation.They applied ML models, including K-Nearest Neighbourhood (KNN), Linear Discriminant Analysis (LDA), and RF.Their results showed that the RF model had the best performance, with an AUC value of 0.841 and an accuracy of 94.3%.The positive and negative predictive values of the RF were also noteworthy, with values of 84.95% and 95.54%, respectively.RF modes were also used in other research studies, such as in 25 to predict suicidal attempts on a self-report dataset collected from 4,882 Chinese medical students.The dataset included clinical features from multiple psychiatric scales, including the Self-rating Anxiety Scale (SAS), the Self-rating Depression Scale (SDS), the Epworth Sleepiness Scale (ESS), the Self-Esteem Scale (SES), and the Chinese version of Connor Davidson Resilience Scale (CD-RISC).After applying five-fold cross-validation to the model, the experimental results showed that the RF model achieved significant performance, with an AUC value of 0.925 and an accuracy of 90.1% in suicide prediction.This study also made several noteworthy discoveries.For instance, it found that relationships with parents were among the top five predictors of college students' suicide risk, and participants with low care from fathers were associated with a higher risk of suicide.ML algorithms have demonstrated potential in analyzing datasets from psychometric scales, such as the Suicide Crisis Inventory (SCI) and Columbia Suicide Severity Rating Scale (CSSRS) 26 .In order to improve model performance, the researchers employed Gradient Boosting (GB) techniques to minimize prediction error and used SMOTE to generate artificial/synthetic data points.Their experimental results revealed that RF and GB algorithms performed the best, with precision values of 98.0% and 94% respectively for detecting short-term suicidal behaviours.An artificial neural network classifier with 31 psychiatric scales and 10 sociodemographic elements was proposed to predict suicide and assess the performance of ML models as well as identify the most significant variables 27 .The classifier's accuracy for predicting suicide within one month, one year, and the whole lifetime were 93.7%, 90.8%, and 87.4%, respectively.In terms of the AUC, the highest was in one-month detection (0.93), followed by lifetime prediction (0.89) and 1-year (0.87).In their study, the Emotion Regulation Questionnaire (ERQ) has the highest impact, followed by the Anger Rumination Scale (ARS) and the Satisfaction With Life Scale (SWLS) 27 .All the studies previously mentioned were designed to apply standard machine learning techniques to predict suicide risk using their own private and imbalanced dataset with a small number of records.Additionally, to score the importance of each variable, conventional correlation analysis was commonly used.In contrast, in this current work, we not only use conventional tools, but also employ data augmentation and state-of-the-art AI frameworks to effectively analyze and interpret data.The word cloud of death reasons in the dataset.As shown, according to investigated data, physical and psychiatric disorders such as "physical disabilities," "mental disorders," "chronic diseases," and "family disputes" are the most common reasons for death.
Table 1.Models performances (in %) in predicting suicide behaviours.We observe that DT and RF are performing the best in identifying patients with a high risk of suicidal acts.

Data Visualization
The word cloud, a popular method in the NLP area, provides an intuitive illustration of the word frequency for individuals to understand which words appear most frequently in the dataset.In the word cloud illustrated in Fig. 1, we observe that among the reasons for death, there are some high-frequency words, including "physical disabilities," "mental disorders," "chronic diseases," and "family disputes."We can assume that many suicidal patients also suffer from these physical and psychiatric disorders.Figure 2 visualizes the count of suicidal and non-suicidal patients in each occupation.Unemployment is the leading reason for suicidal behaviours, while agriculture and forest-related workers have a higher risk of committing suicide.An interesting discovery is that few police officers commit suicide, and the suicidal rate of administrative managers is relatively low among occupations.The conclusion of our visualization is similar to the results in previous studies 28 , which suggests that patients with a good life income and people who can gain respect from their occupations are less prone to committing suicide.

ML Performance
This section reports the performance of our ML algorithms.We have implemented Random Forrest (RF), Decision Tree (DT), Logistic Regression (LR), Support Vector Machine (SVM), linear Support Vector Classification (SVC), Perceptron, and eXtreme Gradient Boosting (XGBoost) to predict the risk of suicide based on patients' records.To avoid the influence of contingency on experimental results, we tried different sample sizes, train-test splitting percentages, and random seeds.We collected performance statistics and calculated their average performance.highest accuracy, precision, and F1 score indicators, which are respectively 95.23%, 96.98%, and 95%.RF has the best recall performance, which is 93.28%.To enhance the credibility of our models and understand the underlying reasons for their high performance, this research introduces a correlation matrix and XAI model to further investigate the reasons behind the above performances.

Correlation Analysis
In this section, we use the correlation function in Seaborn library and the heat-map function to analyze the correlation among attributes in the dataset.Considering that most variables in our dataset are categorical and non-continuous variables, we use the Spearman correlation to perform the analysis.Figure 3 clearly illustrates the correlation between every two variables.
According to the colour bar on the right-hand side, when the correlation colour between two variables is closer to 1, it is coded with dark red colour showing a significant positive correlation.A red area in the bottom right corner of the figure indicates that these variables are highly related.The heat-map shows a strong correlation between suicide and anger problems, sleep problems, social isolation, depression problems, and humiliating experiences.Moreover, the light red area in the center demonstrates a moderate correlation between the patient's suicidal risk with past suicide attempts, suicidal thoughts, self-injuries, and psychiatric disorders.The above analysis proves that every single variable, which mostly measures mental issues, can considerably contribute to the model prediction and the model would become more powerful when we combine all of these variables for prediction.

Analysis by Explainable AI
With the growing need to understand the underlying logic of ML models, studies have introduced the XAI framework to analyze the contribution of variables in model prediction.The generalization of SHapley Additive exPlanations (SHAP) and local interpretable model-agnostic explanations methods extends the use of XAI in the ML domain.Python package XGBoost provides library functions to calculate the importance of features that contribute to the final model.Figure 4 demonstrates the features' importance in predicting suicide.It is evident that anger problem is the dominant variable correlated with suicidal behaviours.Mental health issues, including depression problems, social isolation, sleeping problem, and humiliating experiences are in next places and need psychiatrists' attention.Meanwhile, past suicidal attempts and suicidal ideation are important factors for a patient who commits suicide.Some nonlinear models, such as XGBoost, have significantly stronger prediction accuracy.However, their characteristics also make their interpretability inferior to linear models, which impedes them from being promoted in practical clinical diagnosis.The Shapley value is a calculation method for fair distribution in cooperative game theory 29 .SHAP is an additive interpretation model based on Shapley's value.The model produces a prediction value for each prediction sample, and the SHAP value is the value assigned to each feature in the sample 30 .Suppose the i th sample is x i , the j th feature of the i th sample is x i, j , the model predicted sample of the i th sample is y i , the baseline of the entire model (usually the mean value of the target variable of all samples) is y base , then the SHAP value is given by where f (x i, j ) is the SHAP value of x i, j , and k is the total number of variables.Intuitively, f (x i, j ) is the amount of contribution in forming y i .When f (x i , j) > 0, the feature improves the predicted value and has a positive effect.In contrast, negative values  Variables with positive and negative contributions using SHAP analysis for a random sample.For this particular sample, the model predicts that this person is at risk of suicide commitment because of his/her age, past self-injury experiences, and social isolation problems.
mean that these features decrease the predicted value.Compared to the traditional feature importance method, the advantage of SHAP is that it can reflect the importance of variable values in each sample, and it also shows the positive and negative contributions of variables.This can help to assign a contribution share to each variable, where high positive values and high negative values show strong direct and reverse correlation with the output predicted risk, showing which variables act as a driver to commit suicide and which variables prevent it.On the other hand, SHAP values around zero show the irrelevancy of the variable with output.
In Table 2, we selected a random sample from the dataset and calculated its SHAP values.Figure 5 illustrates the visualization of Table 2. Features with red colour indicate a positive contribution to the final decision, while features with blue colour indicate negative contributions to the result.Our XGBoost model predicts that this patient has suicide risk in this sample based on his age (56 years), past self-injury experiences, and social isolation problems (which are the most significant positive factors for this patient).Factors that prevent the XGBoost model from identifying this patient has suicidal potential include: not having an anger problem or depression problem, being a Christian (the most important negative factor for this patient), being a widow, and being a clerical worker.The patient was predicted to be at risk for suicide due to stronger positive factors than negative factors.Note that higher positive values in the output indicate a higher risk of committing suicide.
The SHAP method also provides interfaces to visualize the overall feature contributions.Figure 6 (a) illustrates the overall SHAP value of features in our dataset.Each patient is represented by a point.The red colour points indicate larger feature values, while the bluer colour points indicate lower feature values.It is noteworthy that for features, such as past suicide behaviours and self-injury behaviours, when the values of these features are low, indicating that patients have few related experiences, these variables do not negatively impact the predicted value.However, when the values of these features are high, indicating that these samples have suicidal attempts or self-injuries, these two features significantly contribute to positive prediction.
Figure 6 (b) shows the feature importance as calculated by the SHAP package.Although there are some differences compared to Figure 4, the top-three variables, namely anger problem, depression problem, and social isolation remain the same.According to the importance rank provided by SHAP, psychiatric hospitalization, occupation, and sleeping problem are also crucial features in predicting suicide.To further analyze the impact of different feature values, the partial dependence plots from SHAP were used.For example in Figure 7, each point represents one sample with a corresponding attribute value.It is observed that their distributions are closer to zero for most education levels and tend to be symmetric, indicating that these features do not have a significant contribution to the final result.Figure 7 reveals that feature contributions are more pronounced for patients with education level zero (from grade one to seven) and level six (university degree or above).It can be observed that for most patients with education level zero, their SHAP values are positive, indicating a higher risk of suicide, while all patients with education level six have negative SHAP values, indicating a relatively low risk.This suggests that patients with lower levels of education have higher suicide risks, while those with university degrees have lower risks.

Clinical Implications and Future Directions
Suicide prediction is difficult and traditional self-report-based actuarial risk assessment tools have been found to have limitations in predicting suicide.The other commonly used methods are clinical judgment and structured professional judgment.Clinical judgment alone has been found to have a sensitivity of < 25% 31 .Most clinicians use structured professional judgment to determine risk.However, there is a need for tools or systems to validate the decision-making in suicide risk prediction.Carter et al. 7 undertook a meta-analysis of three types of instruments used to predict suicide death or self-harm: psychological scales, biological tests, and "third-generation" scales derived from statistical models.This review concluded that no instrument was sufficiently accurate to determine intervention.Similar to other areas of medicine, risk stratification is essential for accurate and precise treatment.The current paper has presented a methodology for improving suicide risk prediction using ML algorithms, which will hopefully increase the confidence of mental health professionals in utilizing ML algorithms in conjunction with clinical risk assessment to improve suicide risk prediction and intervention, and ultimately, to help reverse the trend of increasing suicides worldwide.The next step of this study will be to develop a risk assessment interface that utilizes the identified factors and ML algorithm to provide clinicians with a predicted suicide risk for individual patients.This objective risk determination will enhance and refine clinical decision-making and further train the developed models.In future studies, it would be beneficial to investigate other modalities such as speech, image, and videos, as the current ML methods are often trained on text or tabular data.

Discussion
This section justifies the excellent performance of ML algorithms by providing relevant academic evidence to support our experimental results.Firstly, the typical medical diagnosis dataset includes both numerical variables (such as age, past suicide attempts, and blood pressure) and categorical variables (such as gender, marital status, alcohol consumption, sleeping problems, and humiliating experience).This type of data format is ideal for implementing tree-based algorithms.DTs are the most basic tree structure, which classifies each record by evaluating attributes at each node.RF is an advanced version of Decision Trees, 7/11  which utilizes the Bagging technique to combine the results of multiple trees resulting in improved predictive accuracy.It is observed that the DT model consistently exhibits the best performance when using ML for medical predictions.
The superior performance of RF algorithms in the healthcare domain has been well-documented in several studies [32][33][34][35] .Research has found that RF models significantly outperform other algorithms in predicting chronic stress and cardiovascular disease risk, with higher accuracy even when using fewer feature variables.The outstanding performance of RF models improves the reliability of diagnosis and reduces the number of tests required for patients in hospitals.The results and observations made in this paper align with the existing knowledge in psychiatry and provide a data-driven perspective to justify it.The 50K synthetic+real data analyzed in this paper makes the results general and reliable.The most important variables identified in this study can serve as a foundation for future research in the field.
The current paper aims to evaluate the performance of Machine Learning (ML) algorithms in predicting suicide and to improve the interpretability of ML models by using XAI models.To achieve these objectives, the paper implements the entire process of using ML algorithms to predict suicide with XAI.Firstly, we conducted a literature review to summarise state-of-the-art suicidal datasets, psychometric questionnaires, ML models, and model evaluation parameters.Secondly, to prevent under-fitting when building models, the CTGAN and Scikit-learn are used to generate an artificial dataset.The CTGAN method had many powerful functions for data augmentation, but in terms of the distribution of feature values, the dataset generated by the Scikit-learn method more closely resembles the distribution of the original dataset.In this paper, seven models were built and repeated experiments were conducted to evaluate their performance.The Decision Tree (DT), Random Forrest (RF), and XGBoost models all showed excellent performance among the seven models.Correlation analysis revealed that mental health problems are strongly related to suicidal behaviours, which is consistent with existing research findings.Additionally, the XAI framework was applied to identify the dominant and key factors associated with suicide, which included anger problems, depression problems, social isolation, psychiatric hospitalization, and patients' occupation.

Figure 1 .
Figure1.The word cloud of death reasons in the dataset.As shown, according to investigated data, physical and psychiatric disorders such as "physical disabilities," "mental disorders," "chronic diseases," and "family disputes" are the most common reasons for death.

Figure 2 .
Figure 2. Counts of suicidal and non-suicidal patients among different occupations.Unemployment and agriculture and forest-related workers are among high-risk patients and on the other hand, police officers and security personnels have the minimum risk.

Figure 3 .
Figure 3.The correlation matrix of suicide-related variables.Results show a strong correlation between suicidal acts and anger problems, sleep problems, social isolation, depression problems, humiliating experiences, past suicide attempts, suicidal thoughts, self-injuries, and psychiatric disorders.

Figure 4 .
Figure 4. Traditional feature importance analysis provided by XGBoost predicting model.Similar to previous results, anger problem is the most important variable in suicidal risk prediction.

Figure 5 .
Figure 5.Variables with positive and negative contributions using SHAP analysis for a random sample.For this particular sample, the model predicts that this person is at risk of suicide commitment because of his/her age, past self-injury experiences, and social isolation problems.

Figure 6 .
Figure 6.(a) Overall SHAP Values in the dataset.For each variable and sample, the contribution is shown by the SHAP value.A higher distinction between red and blue points shows higher importance in risk prediction, (b) Feature importance ranking of SHAP analysis.The top 3 variables are the same as the result shown in Figure 4.

Table 1
lists the performance of each model under different evaluation indicators.According to the above statistics, DT, RF, and XGBoost have excellent performances among the seven ML models, with AUC values higher than 0.94.In terms of the other evaluation metrics, DT achieves the

Table 2 .
SHAP value of a single sample.