“Is there no way out of the mind?”

-Sylvia Plath

The person in whom Its invisible agony reaches a certain unendurable level will kill herself the same way a trapped person will eventually jump from the window of a burning high-rise. Make no mistake about people who leap from burning windows. Their terror of falling from a great height is still just as great as it would be for you or me standing speculatively at the same window just checking out the view; i.e., the fear of falling remains a constant. The variable here is the other terror, the fire’s flames: when the flames get close enough, falling to death becomes the slightly less terrible of two terrors.

It’s not desiring the fall; it’s terror of the flames.”

- David Foster Wallace

Introduction

The prediction of suicide has been a challenge for decades, and to date, a method for anticipating individual suicides or stratifying patients according to suicide risk is still lacking [1]. Suicide is a worldwide phenomenon and ranks as the second most frequent cause of premature mortality in individuals between 15 and 29 years (preceded only by traffic accidents), and as the third in the age group 15–44 years [2].

Alarmingly, recent studies suggest that the detection of risk factors and the implementation of interventions are inadequate [3]. The majority of individuals who have attempted suicide are reported to consult with physicians prior to the attempt, suggesting that a possibility to intervene might be possible in these help-seeking subjects. The difficulty in predicting suicidal behaviors relies on the lack of clear psychiatric biomarkers and the poor predictive power of individual risk factors [4]. Suicidal behaviors, as many other psychiatric phenomena, are likely the result of the complex relationship between several environmental and trait variables interacting to modify the actual risk rate [4, 5]. Well-recognized risk factors for suicide encompass mental disorders, previous suicide attempts, early trauma, negative life events, and vulnerable periods, with important differences among sexes in terms of ideation and lethality [6, 7]. However, traditional suicide risk factors have only limited clinical predictive value and show a relatively poor clinical utility in predicting suicide occurrence [8, 9], even in high-risk population, such as depressed patients [10].

That is, to date, a method for anticipating suicides or stratifying patients according to risk for suicidal behaviors remains elusive, and no biomarkers have been yet established [9, 11].

Over the last decades, machine learning (ML) techniques emerged as a potential new tool to improve the management of complex problems in psychiatry [12]. This form of multimodal learning has shown to improve prognostic/predictive performance in various fields of medicine, e.g., cardiology and neurology [13, 14]. As a matter of fact, ML can be used to process high-dimensional sets of variables and determine the optimal model for classification. Importantly, such techniques allow predictions at the individual level, therefore representing a promising tool to accurately characterize the complex nature of suicidal behavior.

In the last few years, several algorithms and procedures have been used to predict suicidal behaviors in different populations [11, 15,16,17]. Given that suicide is considered a transdiagnostic feature, a number of studies have been conducted in the general population, sometimes with very large and heterogeneous samples [6, 18]. One of the most solid findings emerging from studies focusing on the general population is that a formal psychiatric diagnosis is a strong predictor of suicidal risk in different samples across countries [1, 6, 18, 19]. This is not surprising, as up to 90% of all suicides occur in psychiatric populations [1, 20,21,22], with mood disorders being considered the leading cause of suicidality among mental disorders [23, 24].

Therefore, the inclusion of both healthy individuals and psychiatric patients into large sample ML studies may prevent the identification of more subtle risk factors specific to distinct psychiatric disorders by merely taking into account a previous psychiatric diagnosis as the driving factor for the analysis. Instead, by targeting vulnerable populations only, ML could uncover predictors of suicidal behaviors specific to distinct disorders and help in better stratifying patients according to the actual risk. This would translate into useful information that can be more easily applied in clinical and forensic settings [25].

In this context, in this work, we provide a systematic review of the results from ML studies in psychiatric clinical populations and discuss crucial issues in ML literature, including employed algorithms, features, and samples, with the aim of providing meaningful considerations to future research in the field of suicide prevention.

Material and method

The current systematic review followed the preferred reporting items for systematic reviews and meta-analyses (PRISMA) guidelines [26].

Search strategy

A systematic literature search was performed for articles published from inception through November 17, 2022 on PubMed, EMBASE, and Scopus, using the following search terms adapted for each database:

(suicid* AND (machine learning OR support vector machine OR deep learning OR neural network OR random forest OR xboost OR gradient boosting OR regression tree OR elastic net) AND (psychiatr* OR schizophren* OR depress* OR obsessive OR bipolar OR mania OR manic OR anxiety OR borderline OR personality)

Database searches were supplemented by hand-search, which encompassed an extensive search through the reference list of included papers, previous reviews, and the “Similar Articles” sections in PubMed (reported in Fig. 1 as “Other sources”).

Fig. 1: PRISMA flowchart of the study selection.
figure 1

Flowchart summary of the study selection process (adapted from PRISMA guidelines; Page et al., 2021).

Two authors (A.P. and G.D.) independently performed the literature search. Documents were assessed according to the following inclusion criteria: (1) journal article available in English, (2) original investigation, (3) employment of ML methodology, (4) evaluation of a suicide risk outcome or self-harm; (5) evaluation of a psychiatric population. Also, we included studies if (a) the sample was composed of individuals with a confirmed psychiatric diagnosis, irrespective of the specific diagnosis and disease severity, and (b) used multiple psychiatric diagnoses or a transdiagnostic framework. The absence of a control group of healthy individuals was not considered an exclusion criterion. To be included, studies must have used ML as a primary or secondary analysis method to predict suicide attempt, suicide risk, or to stratify patients according to risk. No restriction of age was applied. If controversies emerged in the screening processes, they were resolved by discussion between the two authors (A.P. and G.D.) with a third party (P.B.).

Exclusion criteria were the following: (1) non-original investigations (reviews, expert opinions, meta-analyses); (2) article not in English; (3) employment of a methodology other than ML (logistic regression was excluded, except when it was compared to other ML approaches); (4) evaluation of outcomes other than suicide; (5) exclusive evaluation of non-psychiatric populations (e.g., general population, neurologic patients, high-risk populations, emergency department patients). Given that suicidal behaviors are reported across all ages, age-related variables were not considered an exclusion criterion.

We also excluded studies in which the sample was composed by “suicide attempters” without further differentiation in terms of the presence or absence of psychiatric diagnoses. A PRISMA flowchart (Fig. 1) (Page et al., 2021) was created to graphically depict the inclusion/exclusion of studies.

Data extracted

A preliminary data extraction form was designed by A.P.; it was then pilot-tested on five randomly selected studies and fine-tuned accordingly. The search was rerun on a weekly basis, and data from the newly included studies were added to the database accordingly.

For each article, the following variables were extracted:

  • General information (author, year of publication).

  • Sample characteristics (demographics, numerosity, clinical data).

  • Type of ML algorithm(s) employed.

  • Number and characteristics of features employed for prediction.

  • ML performance metrics (AUC, Accuracy, Sensitivity, Specificity).

  • Number of psychiatric diagnoses assessed.

  • Type of psychiatric disorders assessed.

  • Findings regarding the prediction of suicide or the classification of risk.

Descriptive analyses

Given the different types of features and algorithms employed, the data were not homogeneous enough to be included in a quantitative meta-analysis. Descriptive analyses were employed to analyze study findings by key design characteristics such as the employed features, sample size, and ML algorithms.

Quality assessment

An assessment for bias risk was performed using the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) guidelines [27] (see Supplementary Materials for more details; see Supplementary Table 3 for risk bias results).

Results

Based on the search strings and after the removal of duplicates, 745 unique studies were retrieved and screened for eligibility from direct database search and 109 from other sources (Fig. 1).

During this screening phase, 82 studies were rejected because they failed to fully meet the inclusion criteria. Subsequently, we reviewed the full texts of the remaining 663 studies plus 109 from other sources. Six hundred studies were further excluded since they did not meet the inclusion criteria (see Fig. 1 for a complete description).

As a result, the remaining 81 studies were included in the qualitative synthesis of the review, whose information are summarized in Table 1.

Table 1 Sociodemographic and clinical characteristics of the reviewed studies.

Description of outcome employed

Regarding the predicted outcome, 41 (51%) studies used ML to predict lifetime suicide attempts (e.g., retrospective assessed past attempts), while only 16 (19.7%) longitudinally assessed the risk of suicide using future risk/attempts as an outcome. Specifically, five studies [28,29,30,31,32] predicted the attempts/death at 1 month after the actual evaluation, the study by Chen and colleagues [33] predicted suicide attempts at both one and 3 months from the assessment, while three studies [34,35,36] predicted suicide risk at three months, and Nock and colleagues [37] predicted suicide between 1 and 6 months. Three studies [38,39,40] predicted suicide attempts at 12 months, and one study [41] stratified suicide risk at 12 months after the actual assessment. Finally, three studies [42,43,44] predicted future hospitalization for suicide or future suicide attempts without defining a precise temporal window.

Moreover, 14 studies predicted suicide ideation alone [45,46,47,48,49,50,51,52,53,54,55] or in combination with suicide attempts [56,57,58,59,60]. Finally, other studies predicted self-harm [61,62,63,64], suicide risk [38, 55, 65,66,67,68,69,70], the number of suicide attempts [71], and the presence of a familiar history of suicide [72].

Description of ML algorithms used

Regarding the number and type of ML approaches employed in the studies, 46 (57%) of the retrieved papers used a single ML algorithm, while 35 (43%) employed more than one. Among those employing more than one ML method, the average number of ML algorithms used was 3.8, with a range from 2 to 7. The most used algorithms were random forest (RF) and support vector machine (SVM), which were employed 29 times each, followed by neural networks-based approaches and decision tree-based approaches, employed 22 and 18 times, respectively. Other ML approaches were used more scarcely: elastic net eight times, Bayesian-based approaches six times, and clustering methods only four times.

Among studies adopting only one ML algorithm, neural networks were used 12 times, SVM 11, RF 5, tree-based approaches 4 times, and elastic nets three times.

In the studies that compared more than one algorithm, ML methods always performed better than LR. Moreover, RF [32, 57, 73] and SVM [74, 75] resulted among the best-performing algorithms, often with comparable results [65, 76], when compared to other methods. Finally, when present, CNN outperformed other ML methods [49, 50, 62, 77], including SVM and RF (please see Supplementary Table 4 for further details).

Description of the sample sizes and most assessed diagnoses

Sample sizes varied substantially across studies, ranging from 37 [42] to 10,120,030 [61] individuals, with an average of 230,074.5 and a standard deviation of 1,392,637. More in detail, twelve studies (14.8%) enrolled less than 100 participants, 27 studies (33.3%) enrolled between 100 and 500 individuals, 12 studies (14.8%) between 500 and 1000, 15 studies (18.5%) between 1000 and 10,000 and the remaining ten studies (12.3%) more than 10,000 subjects. For six studies, it was not possible to retrieve the exact number of participants included in the analysis.

Given the relatively low prevalence of the event of interest (i.e., suicide), most of the samples were unbalanced in terms of the number of subjects in each group. For instance, in the studies conducted by Fan and colleagues [57] and Wang and colleagues [77], the difference in size between the suicidal group and the non-suicidal control group was tenfold (i.e., 205 subjects in the “suicide” group and 2963 in the “no suicide” group). Similarly, the difference in Xu et al. [41] was 20-fold, with 2323 patients reporting self-harm and 46,460 patients with no self-harm characteristics. It is important to note that, on the one hand, very large differences in sample size require significant corrections in the predictive algorithm (e.g., the weighting of the hyperplane for uneven group sizes), whereas, on the other hand, they reflect real data, as the prevalence of suicidal events in the assessed population is typically low.

Regarding the psychiatric diagnoses, 45 studies (55.5%) included more than one diagnosis in their sample and assessed the risk of suicide in a transdiagnostic manner, whereas 36 studies (45.5%) focused on patients with a single specific diagnosis. Not all the studies reported full details regarding the diagnostic status of the included sample, with some of them only referring to “psychiatric patients” to describe the sample.

Among reports detailing patients’ diagnosis, mood disorders were prevalent in 64 studies (79%). Specifically, major depressive disorder (MDD) was studied in 37 investigations, and bipolar disorder (BD) in 21 publications. Six studies simply reported “mood disorders” to characterize the sample. Patients affected by schizophrenia were included in 14 studies, whereas four enrolled patients diagnosed with schizoaffective disorder and five simply reported “psychosis” as a sample description. Thirteen studies focused on anxiety disorders, eight on substance-use disorders and four on obsessive-compulsive disorders.

Finally, among studies focusing on a single diagnosis, MDD was the most represented one (16 times), followed by BD, schizophrenia, and substance-use disorders represented three times each.

Description of the number and types of features

The number of features employed in the prediction of suicidal behaviors varied considerably across studies, ranging from 10 [71] to 190,919 [64]. Specifically, 20 studies (24.7%) predicted suicide with less than 50 features, seven studies (8.6%) employed between 50 and 100 features, 11 (13.6%) between 100 and 200, ten (12.3%) between 200 and 500, and, lastly, 11 studies (13.6%) employed more than 500 features. In addition, 22 studies (27.1%) did not report the exact number of features being fed to the algorithm for suicide prediction.

As far as the feature types are concerned, the majority of the studies (54, 66.6%) used clinical and sociodemographic variables. Among these, ten studies were based on electronic health records (EHR), which are becoming an important source of data in the last few years [78].

Ten studies employed brain imaging data to predict suicide: seven studies used resting-state MRI (rsMRI) [54, 55, 60, 68, 69, 79, 80], two used both rsMRI and structural MRI [58, 81], three used diffusion tensor imaging (DTI) [49, 59, 82], and one structural MRI in combination with clinical and demographic data [53], and one single study employed measures from spectroscopy [47]. Eight studies (13.6%) analyzed the text obtained from interviews and EHR using natural language processing (NLP).

Only four studies (4.9%) focused on genetics and epigenetics features in order to predict suicide, and a single study [46] explored the predictive value of the human metabolome, employing 123 plasma metabolites, to predict suicide. Lastly, three studies [36, 51, 83] used blood biochemistry in association with clinical and sociodemographic data.

Description of AUC and accuracy ranges

A total of 62 studies (76.5%) reported at least the accuracy or the area under the curve (AUC) of their prediction, while the remaining studies reported different metrics (e.g., positive predictive value, sensitivity F1 score [84]), also because of the methods employed (e.g., clustering and neural networks [41, 68]).

Interestingly, 87% of studies (i.e., 54 out of 62) focusing on either prediction accuracy or AUC reported values above 70% or 0.70, respectively. Specifically, eleven studies reported an accuracy between 70 and 80%, 14 between 80 and 90%, and six studies above 90%. Regarding AUC, 14 studies showed AUC between 0.70 and 0.80, 16 between 0.80 and 0.90, and eleven studies reported AUC above 0.90. The AUC of selected studies is reported in Fig. 2 as a function of sample sizes and number of features. Nonetheless, besides a few notable exceptions [38, 42, 43], no studies tested their prediction on independent validation samples. However, it is noticed that in highly unbalanced samples, the lack of an independent validation sample greatly reduces the overall generalizability. Therefore, these findings are likely to suffer from overfitting and should be regarded with caution [85].

Fig. 2: Graphical representation of the AUCs as a function of the number of features and the sample size.
figure 2

When the authors performed more than one analyses using the same features and sample, the highest prediction value was used for the present graph. Features number and sample size are reported in a logarithmic scale. The color bar indicates the prediction rate. Good predictions are reached even with a limited number of subjects and features. However, this graph does not hold any meta-analytic value, given the differences between the studies.

Most relevant features

Studies employing clinical and sociodemographic variables confirmed previous suicide risk factors. Previous suicide attempts, suicidal behaviors, or self-harm acts were among the strongest and most replicated predictors [28, 32, 33, 37,38,39, 61, 63, 71, 73, 75, 86,87,88,89,90]. Similarly, the type and severity of the psychiatric diagnosis seem to be associated with an increased risk of suicide. In detail, diagnosis and severity of MDD [4, 33, 56, 86, 88, 89, 91], psychotic features alone or accompanied by mood disorder [4, 63, 91], borderline personality disorder [33, 86, 89] and previous psychiatric hospitalizations [91, 92], ranked among the most relevant features. Moreover, also comorbidity with alcohol or substance use or abuse emerged as relevant features, irrespectively of the initial diagnosis [28, 57, 71,72,73, 90,91,92,93]. Interestingly, a significant effect on suicide prediction was reported for the use and dosage of psychiatric pharmacotherapy, specifically antipsychotics [33, 63, 64] and antidepressants, especially tricyclics [33, 64, 73]. Moreover, variable importance analysis in a sample of 390,000 US veterans showed that 51.1% of model performance was driven by psychopathological risk factors, 26.2% by social determinants of health, 14.8% by prior history of suicidal behaviors, and 6.6% by physical disorders [87].

In line with this result, other ML studies highlighted the importance of socio-occupational status and well-being [56, 63, 65, 87, 93]. Similarly, non-psychiatric health issues have been reported among the features able to predict suicide [38, 56, 94]; moreover, one study reported the use of commonly prescribed opioids (e.g., Fentanyl) as a relevant feature in the prediction [57].

Regarding demographic variables, sex, and age differences also emerged. Sex resulted in a significant predictor in five studies, showing either increased risk for males [39, 63, 92] or more complex relationships between biological sex and risk factors [29, 73]. Moreover, age ranked among the most predictive features in five studies [38, 39, 63, 71, 73, 94], with Lopez-Castroman and colleagues [71] also suggesting that the risk increases until middle-aged, but then tends to decrease in the elderly. Lastly, only two studies [72, 93] reported family history of suicide among the most relevant features assessed, whereas criminal or violent behavior were listed as predictive in two other investigations [28, 39].

Regarding the studies that assessed the predictive power of brain imaging data, the thickness and volume of the orbitofrontal, the anterior and posterior cingulate, and the temporal areas were selected by the algorithm as best predictors of suicide attempts in a group of young individuals and MDD patients [53], while in late-life depression sample, frontal areas and precuneus emerges as the strongest predictors [58]. Moreover, measures of functional connectivity [69] of frontolimbic [79, 81] and fronto-temporal circuits, as well as of the default mode network (DMN) [54, 68, 81], the amygdala, the parahippocampus and the putamen [54, 81], attained classification accuracies above 70%.

Regarding clinical predictors in MDD populations, Ilgen and colleagues [92] reported that co-occurring substance use, male sex, and previous psychiatric hospitalizations increased the risk of suicide. Similarly, in a more recent publication [89], hospitalization, previous suicide attempts, and co-diagnosis with a personality disorder resulted in the most relevant features to predict suicide, yielding an accuracy above 80%. Moreover, thyroxine plasma level and the severity of depression (measured via the Hamilton scale for depression - HAMD) were able to predict suicide with an accuracy of 70% [51].

In studies that involved a broader spectrum of diagnoses of mood disorders (including MDD, BD and also anxiety disorders), previous history of suicide or suicidal thoughts [56, 63], presence of psychotic features [63, 91], and socio-occupational functioning [56, 63, 65] ranked among the most important features in the prediction (all scoring above 70% accuracy). Lastly, Passos and colleagues [91] showed a significant contribution of substance use or dependence and of the number of previous hospitalizations to suicide risk, whereas Iorfino and colleagues [63] found that treatment with antipsychotics, sex, and age were relevant features in the prediction. A brief summary of the most important features is reported in Supplementary Table 5.

Discussion

The objective of our review was to summarize the results of ML studies in predicting suicidal behaviors in psychiatric clinical populations. Although the earliest publication in our review dates back to 1998, more than half of the reports were published between 2019 and 2022, ultimately suggesting that ML approaches in psychiatry, and especially in suicide prediction, are becoming more and more frequent nowadays. It is, therefore, important to constantly update the literature evaluation in order to keep pace with an exponentially increasing field. This translates into the opportunity to critically guide the nascent field and address key gaps in the existing literature. Compared to previous literature [95], our review focused only on psychiatric samples, in order to reduce the bias given by the diagnoses in general population. When focusing on broader samples, studies tend to find the presence of a psychiatric diagnosis as one of the most predictive features. Since it is well-known that the psychiatric population are at higher risk for suicidal behaviors, using general population often does not add knowledge in suicide prevention, while on the other side might mask more subtle risk factors. Moreover, compared to previous reviews in the field [95], we gave a more in-depth analysis of predictive features and also employed two different scoring ranking especially designed for ML studies (see Supplementary materials), in order to give the most precise overview of the literature. Critically, all these aspects might serve as a starting point for future studies.

Regarding our results, most studies classified lifetime suicide attempts, and fewer assessed suicidal attempts in a follow-up time window [28,29,30,31,32, 38, 39, 96]. Moreover, some studies classified their sample for death by suicide [44], suicidal ideation [45, 46, 48,49,50,51, 56, 57], or risk stratification [38, 41, 65,66,67,68,69]. Differences in the outcomes and in the definition of risk pose a problem for the interpretation of the results, as risk factors for suicide are reported to be different from those for self-harm and suicidal ideation [1, 97]. In addition, studies also varied in terms of sample selection. Indeed, while most of the publications assessed suicide as a transdiagnostic outcome [38, 40, 63, 66, 67, 81, 98], only a few authors focused on patients with a specific diagnosis, mostly mood disorders [46, 51, 53, 58, 68, 75, 89, 92]. These differences limit the translation of the findings into clinical practice. Prediction models will likely improve prediction accuracy and inform clinical decisions if tailored not just for specific diagnostic groups but also on a dimensional approach to psychiatric disorders [16], as every diagnosis has a different and specific type of assessment and disease trajectory. This means that different patients’ groups might have different predictive features, with probable overlaps between diagnoses. Therefore, a focus on specific diagnostic groups should not divert attention from a comprehensive evaluation of the patient, given that both physical and psychiatric (especially substance abuse disorder) comorbidities proved among the most important predictive features.

Furthermore, another main issue regarding the reviewed studies is the imbalance between the prediction groups, given the low prevalence of the event of interest, with some studies including a larger control group, even tenfold bigger, than the suicidal group [41, 57, 77]. Although an imbalance is intrinsic to this kind of studies, given the prevalence of suicide in psychiatric disorders, some methods can be deployed to reduce the risk of false positive. Fan and colleagues [57] opted for an oversampling in the training phase, a procedure that creates new samples by connecting inliers and outliers from the original dataset. This technique allows the creation of dummy subjects to balance the sample, to foster the reliability of the ML analysis. Other analytical procedures to overcome the issue of imbalanced samples imply weighting of the hyperplane for uneven group sizes, selecting a specific “weight” based on the difference between the groups.

Notably, in most of the cases, the variables employed as predictors were clinical and sociodemographic [48, 57, 87]. Several of the strongest predictors in ML studies are well-known risk factors for suicide, such as previous suicide attempts, previous hospitalizations, and severity of depression [28, 38, 51, 89, 91, 94, 96, 99]. Moreover, the presence of psychosis and a higher amount of pharmacological treatments, especially antipsychotics, resulted to be highly predictive features in many investigations [4, 63, 64, 91, 100, 101]. Interestingly, also presence of psychiatric comorbidities was one of the most valuable predictive features, in particular substance or alcohol use disorders [57, 61, 71, 72, 92]. These results emphasize the importance of a comprehensive evaluation of psychiatric patients and of the burden that comorbidities represent, also given their frequent occurrence [102]. This is particularly important for the comorbid use of alcohol and drug abuse, since they can reduce compliance to treatments [103] and increase impulsive behaviors [104], which in turn may act as risk factors for suicide. Besides the well-known suicide risk factors (i.e., history of suicide attempts, hospitalizations, etc.), more subtle risk factors emerged from the reviewed studies. More in detail, comorbidities resulted in important features in different studies, suggesting that not only psychiatric comorbidities but also physical health is important. Similarly, the use of specific drugs (i.e., antipsychotics), illness severity, and psychosis seemed to be highly predictive of suicide attempts. Finally, some studies suggested that also laboratory tests, such as thyroid hormones, might play a role in predicting suicidal behaviors, even at a subclinical level [51, 83].

Although most of the significant features identified by ML are well-known risk factors for suicide [6, 7], ML demonstrate a greater predictive ability when compared with classical univariate statistics (i.e., logistic regression) and clinician assessment of risk factors [8, 9]. In particular, ML attained higher accuracies as compared to logistic regression [46, 49, 57, 61, 63, 67, 69, 87, 105]. These results suggest that advanced methods may inform the clinical decision-making processes in a more precise manner, likely overcoming the poor predictive value provided by classical statistics and expert assessment of the same risk factors [8, 9]. Interestingly, when present, CNN seemed to perform better than other ML algorithms, including SVM and RF. This might indicate the possibility of using deep learning to better stratify suicide risk, at the cost of a slight loss of interpretability.

Lastly, only a few studies employed biological features, such as genes, SNPs, epigenetic loci [42, 43, 98, 106], and neuroimaging measures [47, 49, 53, 68, 69, 79, 81] to predict suicide. Surprisingly, just a single study [53] combined brain imaging with clinical data to predict suicidal behaviors. As one of the major strengths of ML is the possibility to combine data obtained through different modalities (e.g., genetics, brain imaging, clinical features) to increase prediction accuracy, this approach should be exploited in future suicide research, since it is already occurring in other field of medicine [14].

Limitations and future challenges

A number of limitations should be highlighted. Methods varied widely across studies in terms of ML approach, sample selection, features employed, and preprocessing pipeline. Moreover, distinct investigations focused on a variety of different outcomes, from lifetime attempts to death by suicide, from cross-sectional to longitudinal evaluations. Such differences call for increased uniformity in the assessment of suicidal behaviors and in the design of ML protocols to enhance predictions of risk that may translate into clinical practice.

For instance, the decision to use either a specific and unique ML framework or different algorithms should be motivated: the testing of several approaches at once seems confusing and rather exploratory, especially in the absence of an external validation dataset. Regarding the different algorithms, it is noteworthy to mention that, from our results, it emerged that deep learning methods (such as CNN) performed better than other ML algorithms in direct comparisons. Although important from a research point of view, deep learning algorithms tend to be less interpretable (more “black boxes”), and this aspect might prove crucial in the further development of AI techniques in medicine and psychiatry. This is true, especially in the field of mental health and suicide prediction, where AI tools should assist clinicians and not introduce further complexity. For an AI to become useful in clinical practice, it should prove to be trustworthy, therefore not only valid and reliable, but also easily understandable [107]. In the last years, the concept of explainable AI (“XAI”) emerged, as a possibility to close the gap between the algorithms and the clinicians, creating a human-understandable correspondence between inputs and outputs of the black-box model either through intrinsic transparency of the model or through post-hoc techniques. Given that clinical applications are high-stakes, we require understandability from the prediction tools, or either AI tools will grow in distrust [107].

Moreover, features should be accurately selected, and their number should not be excessive (e.g., curse of dimensionality), as in some of the studies [44, 61]. Collecting such a huge amount of data could be feasible only in university centers, thus reducing the translational value of the results. This comprehensive review should also help in the choice of the right type and number of features. For example, pharmacological treatments, especially antipsychotics, were among the most important features in those studies who included them in the models. However, the pharmacological status of patients is often not reported (see Table 1), and in most cases type and dosage of different drugs are not included in the models. Based on the results of our review, it might be beneficial to include data related to pharmacological therapy in the models, since it could potentially enhance the predictive power and clinical applicability of these models. Moreover, the inclusion of pharmacological information might also help in defining protective features, not just risk factors, as suggested by studies showing that some stabilizers and antidepressants might actually reduce the risk of suicide [64]. Also, both psychiatric and physical comorbidities seem to have a predictive role in the presented models; especially, substance abuse as a comorbid disorder resulted to be highly predictive. This aspect suggests a comprehensive evaluation of the patient in order to define the clinical risk.

In addition, most of the studies addressed the prediction of suicide using a cross-sectional approach, disregarding the temporal aspects. Yet, time may represent a crucial feature for predictive models of suicide [17]. In this regard, defining in advance one or more prediction windows after the assessment is fundamental, as the prediction of short-term suicide risk may rely on different features as compared with long-term risk. Similarly, the temporal characteristics of a feature with respect to the assessment point might impact differentially the accuracy of prediction. For instance, suicide attempts in the year prior to the assessment, but not those that occurred several years before, may be a stronger predictor for new short-time suicidal behaviors.

Finally, despite the high heterogeneity, most of the studies (>80%) obtained a good accuracy, namely 70% or higher. However, many studies did not report additional key metrics (e.g., PPV, F1-score) that are paramount to interpret the actual usefulness of prediction models. Moreover, only few studies tested their prediction on external validation samples; therefore, caution is needed when interpreting these findings, since it is possible that they suffer from overfitting.

Finally, it is evident the importance of further studies also examining the role of neurocognitive variables, dimensions of social support, loneliness, extent and type of medical comorbidity and associated disability, the type of pharmacological interventions used in the context of specific diagnoses as well as the presence of psychotherapies and their combination with medications on suicidal risk. Similarly, a call for a more consistent use of ML is of paramount importance. CNN, RF, and SVM proved to perform better against other algorithms, but these results should be further tested in the future.

Conclusions

The results that emerged from the reviewed studies lead to the conclusion that ML approaches attain greater accuracies in predicting suicidal behaviors across a variety of psychiatric disorders as compared to classical analysis methods. From the reviewed ML studies, well-known risk factors for suicide emerged as relevant predictors, along with new subtle aspects, such as physical and psychiatric comorbidities, presence of psychotic symptoms, and subclinical lab tests, that should be further analyzed and confirmed in future studies. However, additional work is needed to improve the predictive strength of ML algorithms, resolve the systemic lack of external validation, and finally make them become of use in clinical psychiatry. To do so, ML should integrate genetics, neurobiological, brain imaging, psychometric and clinical data to achieve better predictions. Then, algorithms should be presented in an intuitive way for both psychiatrists and patients to foster their adoption and easiness of use in the clinical setting. Although some attempts have been made, to date, ML approaches are not routinely part of clinical practice in psychiatry. We believe ML development should aim to gain the trust of clinicians, by proving to be valid, reliable, and understandable, to be realistically included in decision processes. Our review proved they can be valid in the context of suicide risk stratification; future studies should demonstrate that ML tools are reliable and, even more importantly, easy to understand by clinicians. Multifactorial disorders require multifaceted approaches, and ML could really help in this aspect; however, AI tools should not introduce further complexity in the decision processes, and therefore explainable AI will be a crucial point in further clinical development of predictive tools.