Introduction

Schizophrenia is a heterogeneous illness and its long-term outcomes are highly variable1,2,3. Attempts to provide prognostic markers for long-term outcomes, such as Rumke’s “praecox feeling”, have appeared throughout medical history4, but despite an abundance of outcome predictors at group-level, such as sociodemographic characteristics, clinical markers, and neurocognitive markers5,6, at a patient-level, no valid prediction model for long-term outcome of schizophrenia is available to clinicians at present7. An additional challenge is that “outcome” entails symptomatic, social, functional, and personal dimensions, which are only partly interrelated8,9, and may have differing significance for individual patients10,11. These matters complicate clinical decision-making, for example when considering an early switch to clozapine12, antipsychotic dose reduction or discontinuation strategies13, allocations of sheltered housing14, or occupational support15. From a public health perspective, reliable long-term outcome prediction and the resulting treatment stratification are important, as demands usually outweigh the capacity of mental health institutions, even in countries with high mental healthcare expenses16.

Machine learning potentially presents a way to develop models reliably predicting individual outcomes for multifactorial and heterogeneous illnesses such as schizophrenia17,18,19,20,21. In clinical research, machine learning, or pattern recognition, refers to an algorithm that is able to learn from a large multivariate dataset to make an adequate prediction for a patient, for example concerning the future clinical outcome. Modern prospective multicenter studies facilitate the development of prediction models based on machine learning. They provide well-established outcome measures and large numbers of potential predictors (i.e. “features”), in study samples large enough to cover the heterogeneity of the target population19. A landmark study by Koutsouleris et al. recently demonstrated the potential of machine learning for individual outcome prediction in psychosis18. Pre-treatment data from a multicenter clinical trial were used to predict global outcomes after 4 and 52 weeks of treatment in first-episode psychosis. Predictive accuracy was found significantly above chance, at 73.8–75.0%. With an average drop of 2.8%, accuracy was retained when the models were tested on geographic sites left out of the model training procedure, suggestive of its validity in other samples. Unemployment, lower education, functional deficits, and unmet psychosocial needs were found most valuable in predicting 4- and 52-week outcomes.

Here, we extend the use of data-driven model development based on patient reportable data, to long-term (3 and 6 years) symptomatic and global outcomes of patients with schizophrenia-spectrum disorders. To this end, we include a heterogeneous population of schizophrenia-spectrum patients, with variable illness duration and baseline clinical status from the Genetic Risk and Outcome in Psychosis (GROUP) cohort study22. We explore the use of a wide range of baseline markers of genetic and environmental risk and measures of past and baseline clinical state, to predict 3- and 6-year symptomatic and global outcomes. We use data-driven selection to arrive at a model containing predictors from a limited number of measures, aiming at clinical applicability. We assess its generalizability by using leave-one-site-out (LOSO) cross-validation, testing our models on geographic study-sites left out of model development. Additionally, we investigate the use of the features that have been found to predict 4- and 52-week outcomes of first-episode psychosis18, for 3- and 6-year outcomes in the GROUP sample.

Results

Sample characteristics

We included 523 patients with a schizophrenia spectrum disorder who had outcome assessments three (T3) and six (T6) years after baseline. Demographic and clinical baseline characteristics of the study sample and comparisons to patients excluded because of missing follow-up assessments are listed in Table 1. Patients with unfavorable baseline characteristics were more likely to be lost to follow-up. At baseline, T3 and T6, 49%, 37%, and 41% of patients were in symptomatic remission (according to the consensus definition by Andreasen et al. (2005)) respectively; 31%, 44%, and 36% had good global functioning status (Global Assessment of Functioning (GAF) scale ≥ 65) at respective measurements. For symptomatic outcome, 65% and 64% of patients were stable at T3 and T6 relative to baseline, and 68% and 68% for global outcome (Supplementary Fig. 2).

Table 1 Baseline demographic and clinical characteristics of patients who completed baseline and follow-ups and of those who were not included in the study.

Selection of modalities based on unimodal models

We included demographic information, illness-related variables, Positive and Negative Syndrome Scale (PANSS; present state clinician-rated symptomatology), and either Camberwell Assessment scale of Need Short Appraisal Schedule (CANSAS; clinician-rated and self-reported need of care) or Community Assessment of Psychic Experiences (CAPE; self-reported lifetime psychotic experiences) data for multimodal modeling. Notably, this set is especially rich on indicators of clinical course until inclusion in GROUP (i.e. includes GAF, features from PANSS, and CAPE where applicable). The choice of these modalities was based on unimodal modeling performance for the following modalities: (1) demographic variables; (2) illness-related variables; (3) PANSS; (4) substance use characteristics; (5) neurocognitive task scores; (6) social cognitive task scores; (7) Premorbid Adjustment Scale items; (8) CANSAS; (9) CAPE; (10) extrapyramidal symptoms; (11) genetic features, and familial loading of psychotic disorder, bipolar disorder, and drug abuse; (12) environmental variables of urbanicity and living situation (see the “Methods” section; Supplementary Tables 2 and 4). As mentioned earlier, we additionally trained models using a prespecified set of features that had performed best in predicting 4- and 52-week outcome of first-episode psychosis in the EUFEST study (22 and 24 features, respectively, see part C in Table 2, see the “Methods” section; Supplementary Table 1)18. A summary of the number of features and sample size for unimodal and multimodal models per outcome, and good versus poor outcome distributions is provided Supplementary (Supplementary Tables 2 and 3).

Table 2 Internal validation with nested cross-validation and leave-one-site-out (LOSO) nested cross-validation predicting symptomatic and global outcome at T3 and T6 in multimodal models.
Table 3 Important baseline features by model.

Performance of prediction models: Internal validation

Using a repeated nested cross-validation design (see the “Methods” section, Creation of models (A)) including recursive feature elimination, support vector machine models were trained to predict individual patient’s outcomes based on their baseline data from four modalities. The outcome could be predicted with similar cross-validated balanced accuracies (BACs), regardless of the period, outcome, and the fourth modality included (i.e. CANSAS or CAPE), ranging from 62.2% to 67.6% (Table 2). Model performance was well above benchmark models containing a single feature (GAF) and using threshold values for good/poor global functioning above and below GAF 65 did not result in higher model performance (Supplementary Note 4, Supplementary Table 15).

The 10% most influential features for symptomatic as well as global outcome based on weight and frequency of selection included items from all four modalities in the model: PANSS items, illness-related, demographic features, and either CANSAS or CAPE items (Supplementary Tables 512). As illustrated in Fig. 1, generally, the more often a feature is selected, the higher its average weight is (see Supplementary Figs. 3 and 4 for overviews of selection frequency over weight for all models).

Fig. 1: Frequency of inclusion and weight of features.
figure 1

Frequency of inclusion of a feature against its (average) weight in the model; shown for prediction of global assessment of functioning (GAF) outcome at T3, containing Positive and Negative Syndrome Scale—general, negative, and positive subscale (PANSS—Gen, Neg, and Pos), Demographic, Illness-related and lifetime psychotic experiences (CAPE) related features. A positive weight reflects that scoring higher on this feature contributes to being classified as ‘poor outcome’. For features with negative weights the opposite holds.

Worse GAF symptoms and GAF disabilities, worse scores on specific items in the positive and negative subdomains of the PANSS (i.e. judgment and insight, hallucinatory behavior, flat affect, unusual thought content, motor retardation), worse score on (health-related) quality of life and the use of antipsychotics were associated with multiple poor outcome endpoints. This was supplemented by a lower number of no needs and met needs, together with housing needs and unmet psychotic disorder needs in models including CANSAS items. In models including the CAPE, items of importance from the CAPE mostly included those from the depression subscale (i.e. guilty and tense feelings, suicidal thoughts, and lack of activity) (Table 3).

The following features were predictive of at least one-fourth of poor outcome endpoints, albeit selected in fewer models than those summed in the previous paragraph: higher age, schizophrenia diagnosis, and a higher level of various present state symptoms in PANSS subdomains of positive, negative, disorganization symptoms and emotional distress (i.e. delusions, suspiciousness/persecution, grandiosity, stereotyped thinking, lack of spontaneity, difficulty in abstract thinking, emotional withdrawal, depressive symptoms, and tension) (Table 3).

Comparison of frequently misclassified patients to those mostly correctly predicted showed the following: patients with a good outcome, who were incorrectly classified as having poor outcome (21–37% over the models) showed less favorable baseline characteristics (e.g. higher PANSS, lower GAF scores, more chronicity, lower parent education, and more schizophrenia diagnoses) when compared to the correctly classified group of patients. Contrary, patients with poor outcome who were mostly incorrectly classified as having good outcome (3–14%) showed favorable baseline characteristics, such as lower PANSS and higher GAF scores, when compared to the most correctly classified group of patients (see Supplementary Table 13 for a detailed overview of significant comparisons).

Generalization of the prediction models: LOSO validation

The generalizability of the models was evaluated by consecutively leaving out patients from one of the four geographic sites and testing the model trained on the three remaining sites in these patients (see the “Methods” section, Creation of models (B)). These LOSO validated models had slightly (−2.3% on average) lower accuracies than models trained on the full dataset (Table 2). The range of the average prediction accuracies for symptomatic outcome at T3 and T6 was 59.9–63.8% (Table 2); site-specific BACs ranged from 53.0% to 69.7% (Table 4). For global outcome the range was 61.2–64.8% (Table 2); BACs of the different sites ranged from 53.0% to 68.9% (Table 4). The difference between T3 and T6 prediction accuracy was small (mean BACs were 63.4% and 61.9%, respectively). There was, again, a small difference between CANSAS-based and CAPE-based models (mean BACs were 62.8% and 62.4%, respectively). LOSO models validated on the Utrecht site and single-site models trained on the Utrecht site tended to perform below average, in line with the differential patient profile (higher baseline symptom severity, lower GAF scores, and higher needs) found at this site relative to the other geographic sites and its smaller sample size (Supplementary Note 4.4; Supplementary Table 14).

Table 4 Leave-one-site-out cross-validation site performance by model.

External validation of EUFEST predictors

Predicting long-term outcome based on the top 10% most predictive features for the short-term outcome (EUFEST study (see the “Methods” section, Creation of models (C))), resulted in accuracies of 59.0–62.7% for symptomatic outcome. For global outcomes, we obtained accuracies of 56.5–66.4% (Table 2).

Discussion

Using a rigorous machine learning approach, we developed individualized models to predict 3- and 6-year symptomatic and global outcomes of patients with schizophrenia-spectrum disorders based on patient-reportable data. The multicenter sample included 523 schizophrenia-spectrum patients with variable illness duration, mainly with established illness. Notably, baseline clinical status was variable, and outcome status remained poor at follow-up in a large share of patients. The data-driven nature of this study allowed us to explore the predictive value of a wide range of measures for the long-term outcome of psychosis. In keeping with clinical applicability, our aim was to arrive at lean models. We report nested-cross-validated balanced accuracies ranging from 62.2% to 67.6%. Suggestive of generalization of model performance to out-of-study samples, leave-site-out cross-validation showed minor drops in accuracy, with balanced accuracies ranging from 59.9% to 64.8%. Models trained in our sample for long-term outcome prediction, utilizing short-term outcome predictors for first-episode psychosis18, yielded comparable balanced accuracies up to 66.4%.

To the best of our knowledge, no prognostic models for the long-term global and symptomatic outcomes of psychosis are presently available23. Our results indicate that while state-of-the-art methods may result in robust (generalizable) performance estimates, predictions are modestly accurate, similar to recent experimental prognostic models for depression based on machine learning predicting long-term clinical outcomes based on patient reportable data17,24. The models did not reach the LOSO cross-validated accuracy of 71% in the study on the one-year outcome of first-episode psychosis18, presumably due to the uncertainty introduced by time, care-as-usual setting, and the heterogeneity of baseline clinical status and illness duration within our target population.

Through a modality-wise learning strategy, a combination of baseline sociodemographic features and clinician-rated symptoms, complemented by self-rated lifetime psychotic experiences (CAPE items) or psychosocial needs (CANSAS items) was selected in the models. Interestingly, in unimodal models, these state-based and context-based modalities outperformed trait-based modalities, including genetic and cognitive task scores. This finding may be partly explained by the relatively large share of patients with a stable clinical state at inclusion and follow-up. We further argue that the performance of trait-based measures may improve if the interaction between genetics and environmental exposures in psychosis outcomes is taken into account25.

Features offering a clinician’s integration of the clinical picture and those with a broad underlying construct (e.g. GAF; insight; schizophrenia diagnosis; quality of life; summed no need/met need items; depression) show to be the most important predictors. These resemble features found to be predictive of one-year outcome in first-episode psychosis: psychosocial needs, global functioning deficits, and affective symptoms (specific quality of life, CAPE, and PANSS items)18.

In comparison to the aforementioned study, and our work, we also note differences, suggesting differential ways to short-term and long-term clinical management of psychosis. We found higher, and not lower symptom severity to predict poor long-term outcome18. In particular, lack of insight appears predictive of poor long-term outcomes across all the models. This may be mediated through poor adherence and eventual service disengagement26. Furthermore, we note that the most important social need in our models (i.e. housing) is different from those (company, daytime activities) predicting short-term outcomes. This could be explained by the lower level of social functioning found in our study cohort, compared to first-episode patients27, suggesting that in a model suited for a functionally heterogeneous population, the entire range of social needs within the CANSAS instrument may have its relevance. In interpreting the influence of features on the predictions, it should be noted that some, frequently selected, features show large variation in weight. This variation could have its origin in the heterogeneity of the disorder and should be the subject of future research.

Within our models, misclassification especially occurs in patients with unfavorable clinical baseline status combined with good outcomes. This may reflect variation in baseline clinical context and acute state at the time of inclusion (i.e. admission due to relapse vs. outpatient treatment) of these patients and/or availability of therapeutic or supportive resources. We note that the higher baseline symptom severity, lower GAF, and higher psychosocial needs found in one geographic site that underperformed in the LOSO validation relative to the others, may support his possibility. To enhance model performance, these contextual factors may be taken into account in future models.

Our models on long-term outcomes of psychosis perform with reasonable accuracies, but at present are not suitable as a stand-alone tool to stratify treatment. Regardless, the machine learning model trained here and a clinician would represent rather different takes on reality. The model sees the patient through the lens of a number of constructs, such as “insight”, whereas a clinician’s judgment is more globally constructed and starts from the moment the clinician meets his patient in the waiting room28,29. Models with modest accuracy may be of use, depending on the level of uncertainty parties involved in clinical decision-making are willing to accept from a model30,31, a level which to our knowledge is unknown for long-term outcomes of psychosis. In addition, the preferable way of interaction between model and clinician remains to be addressed32. We suggest clinicians may inform their decision making, both by the prediction itself and important features in it, for example, high core social needs, affective symptoms, or low quality of life. Apart from clinical practice, modestly accurate model predictions may serve intervention research by offering stratified randomization.

Future prediction tool development should be informed by end-user (i.e. patients and clinicians) needs concerning scope, predictive capacities, and potential clinical consequences. We need to learn how they weigh benefit and harm due to treatment choices against a given outcome probability30. Furthermore, the significance of any predicted outcome might differ per patient, per stage of illness, and per intervention10. Hence, presenting an array of outcome dimensions with accessible features might best fuel the clinician–patient dialog on intervention33. Moreover, clinical guidance on when and how to use prognostic tools might prove essential for future dissemination of prognostic models based on machine learning in psychiatry.

We note the following limitations. Although we present the largest machine learning study to date on outcome in psychosis, based on patient-reportable data, the sample size may not be sufficient to account for the substantial heterogeneity of the out-of-study population with a schizophrenia-spectrum disorder34. It should be noted that the drop in performance was small (on average 2.3%), when the models were applied to patients from geographic sites not part of the training sample, suggesting transportability to samples with a comparable profile. Although we implemented a comprehensive validation procedure, we cannot rule out some overfitting not accounted for35, including that resulting from information leakage because modality selection, imputation, and scaling were performed outside the nested cross-validation pipeline. Apart from this, our approach of taking the four best performing data modalities from unimodal modeling runs together does not necessarily yield the best performing combination in a multimodal model. Instead, models may benefit from a combination of modalities containing a wider range of information, as has been suggested by studies combining patient reportable and imaging data36,37. We suggest future research may address what is a clinically parsimonious set of modalities, that is, an optimum between accuracy and investment to obtain data. Furthermore, we believe that imputation and scaling outside the cross-validation setup has not led to over-optimistic estimates of generalizability, because of the very low number of imputations (<0.5%) and the fact that most of our features’ scales are fixed, thus independent of our dataset.

The GROUP study sample is known to represent a relatively well functioning subset of a population of schizophrenia-spectrum patients in need of specialist care. Generalization to other samples further might be hindered by the exclusion of the most severely affected patients, either due to study drop-out, exclusion of patients with extensive missing data, or incompetence or unwillingness to give study consent38. The nature of the sample included may also explain the association of antipsychotics use with worse outcomes, as antipsychotics use at baseline is likely to be confounded by history or expectation of more severe illness course. The observational sample obtained may however be more representative for clinical practice than those stemming from clinical trials. To improve model reliability, the use of multi-center samples dedicated to prognostic model development or models informed by national registry data, as has been done to predict transition to psychosis from high-risk mental states and suicidal behavior, may be needed39,40,41.

Regarding outcomes, prediction of longitudinal patterns, or adverse events, such as readmission to a psychiatric hospital may also add clinical relevance, especially for long-term outcome42. We used baseline data for outcome prediction, whereas in clinical practice, decisions are typically based on longitudinal, rather than single, examinations. Longitudinally informed models are expected to result in better prediction accuracies. Furthermore, we propose that contextual information, such as baseline clinical context (e.g. acute inpatient or outpatient status, treatment status), or supportive resource status (e.g. family support) may further enhance model performance. The addition of biomarker modalities, including imaging data and genetic data derived from genome-wide association studies, possibly in interaction with environmental exposures, holds the same promise41. However, all additions come at the expense of time investment, model interpretability, and the requirement of larger training datasets20.

In conclusion, we demonstrate the feasibility of a machine-learning approach to long-term outcome prediction in a heterogeneous target population of schizophrenia-spectrum patients, based on a lean set of patient reportable features, overlapping with those predictive of short-term outcome of first-episode psychosis. Future models may benefit from considering patient’s and clinician’s needs, the appropriate nature of the training sample (i.e. sample similarity to the population of interest as well as richness on (contextual) features), and implementation of advancements in machine learning methodology. Individual outcome prediction based on machine learning may inform the treatment stratification needed both from a patient and a public health perspective.

Methods

Participants and data selection

In the GROUP prospective longitudinal cohort study, in- and out-patients with a psychotic disorder presenting consecutively at selected representative mental health services in representative geographical areas in the Netherlands and Belgium from January 8, 2004 until February 6, 2008 were recruited. Inclusion criteria were: (1) psychotic disorder diagnosis according to the Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition APA43; (2) age 16–50 years (extremes included); (3) Dutch language proficiency; (4) ability to provide informed consent. Extensive genetic, cognitive, environmental, and outcome data were collected at baseline (T0), and 3-year (T3) and 6-year (T6) follow-up. The full GROUP sample at baseline included 1119 patients with variable illness duration, including recent-onset psychosis22. Here, we used data of 523 participants for whom outcome assessments at both T3 and T6 were available, with a schizophrenia spectrum disorder (i.e. schizophrenia, schizophreniform disorder, schizoaffective disorder, delusional disorder, brief psychotic disorder, psychotic disorder: not otherwise specified), assessed with the Comprehensive Assessment of Symptoms and History or the Schedules for Clinical Assessment for Neuropsychiatry (see Supplementary Fig. 1 for a selection process flow-chart)44,45. We assessed selection bias by comparing our sample on demographic and clinical characteristics to GROUP patients not included in this study.

The study protocol was approved by the Medical Ethical Review Board of the University Medical Centre Utrecht and by local review boards of participating institutes. Participants provided written informed consent. Database release 5.0 was used in all analyses.

Long-term outcomes and baseline predictors

We selected two long-term outcome measures in a classification approach to outcome prediction: symptomatic remission and global functioning, measured at both T3 and T6. Symptomatic outcome was selected as it traditionally is a mainstay of clinical care. We followed the consensus definition of symptomatic remission by Andreasen et al., operationalized as a mild score (3 or less, implying no functional disturbance related to symptoms) on selected items of the PANSS (i.e. delusions, conceptual disorganization, hallucinatory behavior, mannerism and posturing, blunted effect, social withdrawal, lack of spontaneity and unusual thoughts), maintained for at least 6 months46,47. For global outcome we followed Koutsouleris et al. in operationalizing global outcome with a dichotomization of the Global Assessment of Functioning (GAF) scale, considering a score of <65 poor global outcomes and ≥65 good global outcomes18,43, as GAF scores between 61 and 70 have been proposed as a threshold between at-risk mental states and illness and widely used as markers of recovery as part of more complex criteria48,49. GAF was constructed as a mean composite score of the GAF symptoms and GAF disabilities subscales assessed in the GROUP project, and normally distributed in our sample. To investigate the possibility that a threshold other than GAF 65 would better represent a border between “good” and “poor” outcomes, we tested other cut values in the GAF 50–68 range post-hoc (Supplementary Note 4.2).

All clinical variables assessed at baseline within the GROUP project which permitted a sample size >250 patients were considered for inclusion as a predictor, barring the models which contained a prespecified set of features based on the best-performing features in the EUFEST study by Koutsouleris et al. (see Supplementary Table 1)18. We clustered available candidate baseline predictors in modalities according to information type: (1) demographic variables, including age, sex, education, socioeconomic status, living situation, and employment; (2) illness-related variables, of diagnosis, comorbidities, illness course duration of untreated psychosis, quality of life and medication use; (3) clinician-rated, present state symptoms as measured by the PANSS47; (4) substance use characteristics (i.e. illicit drug use, alcohol use and smoking) indicated by urine analysis and the Composite International Diagnostic Interview50; (5) neurocognitive task scores of IQ, memory, processing speed/attention and executive functioning, assessed with the Wechsler Adult Intelligence Scale-Third Edition short form, Word Learning Task, Continuous Performance Task-HQ and Response Set-shifting Task respectively; (6) social cognitive task scores of theory of mind, affect recognition and facial recognition, assessed with Hinting Task, Degraded Affect Recognition Task and Benton Facial Recognition Test respectively. For psychometric instrument references for cognitive testing, see Supplementary Note 1.5; (7) Premorbid Adjustment Scale items51, comprising social and cognitive functioning in childhood and adolescence; (8) need of care items, measured with the CANSAS52,53; (9) self-rated lifetime psychotic experiences, consisting of Community Assessment of Psychic Experiences questionnaire (CAPE) items54; (10) extrapyramidal symptoms, comprising akathisia, dyskinesia, and Parkinsonian symptoms; (11) genetic features (i.e. polygenic risk score for schizophrenia55, and familial loading of psychotic disorder, bipolar disorder and drug abuse, measures that comprise the absence or presence of affected relatives of the patient56; (12) environmental variables of urbanicity and living situation. For global content of, and features within the modalities, see Supplementary Note 1 and Supplementary Table 1). Within each modality, missing data for every feature and subject with <20% missing values was imputed and scaled; features and subjects with ≥20% missing values were excluded (also see Supplementary Note 2).

Creation of individual prediction models: machine learning strategy

We trained a linear support vector machine (SVM)57, to find the optimal separating hyperplane dividing patients into the two outcome classes (Fig. 2). For a given training dataset, each patient is represented by a labeled datapoint in an m-dimensional feature space. The position of the data point is determined by the score on the m baseline predictors (input features) and its binary label is the outcome (−1: good outcome; +1: poor outcome). SVM returns feature weights, reflecting the relative influence of predictors on outcome prediction. We used weighting by outcome class to account for unbalance between outcome group sizes and blind the algorithm to base rate distribution, to avoid model bias towards the largest outcome group. Internal validation was performed with three-layer, 10-fold nested cross-validation, where the inner cross-validation layer optimized the cost parameter, representing a penalty imposed on cases violating the margin of the decision boundary of the model. The middle layer selected the smallest predictor set with performance within 10% of the best performing set by recursive feature elimination (RFE). The outer layer provided performance estimates, reflecting the accuracy of the ensemble of k models taken together. This validation procedure was repeated 50 times to reduce dependency on the choice of train-test partitions. We employed the e1071 library (version 1.6.8) for SVM in R (version 3.4.0); and the caret package (version 6.0.76) for RFE58,59.

Fig. 2: Machine learning training and validation design.
figure 2

Machine learning pipeline. D = modality; M = Model; training sets in dark blue, test/validation sets in yellow. a Data selection (see Supplementary Fig. 1 for details) and preprocessing including scaling and imputation. b Unimodal models, to identify the most informative modalities. c Multimodal models consisting of 2–4 modalities, including recursive feature elimination (RFE). d External validation of multimodal models using leave-one-site-out (LOSO) validation, where one of the four geographic sites is held out of model training and used for external validation; SVM (support vector machine) setup: RFE is part of the SVM pipeline; (i) In the inner layer, a CV loop is used to find the optimal value for the cost hyperparameter C from 38 points equidistant in 2log, starting at 0.0001 and ending at 37.07. C sets a penalty for violating the margin of the hyperplane; (ii) the middle layer employs a CV loop for RFE, a feature selection algorithm. It starts by including all available features in the model and iteratively eliminates the least informative features from it until the stopping criterion is met. The smallest set of features with performance within 10% of the best-performing set is selected; (iii) in the outer layer, a CV loop is used to define feature weights in the training set (9/10th of the data) and test the accuracy of the model in the validation set (1/10th of the data). Repetition of this procedure yields 10 models, which are repeated 50 times to reduce dependency on the choice of train-test partitions. The final prediction for a patient is an ensemble constituting an average of 50 repetitions.

See Supplementary Note 3 for an elaborate description of the machine learning pipeline, and Supplementary Note 4.3 for a comparison to an alternative nonlinear learning design (random forests classifier + IsoMap dimensionality reduction, implemented in Neuropredict60) which yielded comparable performance in the study sample (Supplementary Table 16).

Creation of individual prediction models: training and validation design

We employed a data-driven, modality wise learning strategy with the aim of automatically identifying a concise set of features from a limited number of clinical instruments. We entered the best performing modalities from preliminary uni-modal modeling runs (Fig. 2b; Supplementary Table 2) together into the SVM to train a multi-modal prediction model (Fig. 2c).

To align with best practice in prognostic model development19, our study included three components: model development, model validation, and comparison to existing models. (A) Data-driven model development including internal validation using repeated 10-fold nested cross-validation (Fig. 2a–c). Single-feature models containing baseline GAF only as a predictor were additionally trained to benchmark model performance (Supplementary Note 4.1). (B) A test of generalization to out of study samples with leave-one-site-out (LOSO) validation (Fig. 2d). Each of the four geographical sites (Amsterdam, Utrecht, Groningen, and Maastricht) of the GROUP study was held out once, and the prediction model was trained on patients from the remaining three sites. This model was then tested on the hold-out site, to yield prediction accuracy in a site geographically distinct from sites the model was trained on. To estimate predictive power in unseen data, the average prediction accuracy from four LOSO-runs was calculated. We assessed differences between geographic sites on the measures included in the models and ran single-site models post-hoc, to offer possible explanations to performance differences between LOSO-runs (Supplementary Note 4.4). (C) Applicability of 4-week and 52-week outcome predictors in first-episode psychosis for 3- and 6-year outcome prediction in a heterogeneous sample. We selected GROUP predictors matching the top 10% 4- and 52-week global outcome predictors from the European First Episode Schizophrenia Trial (EUFEST; Supplementary Table 1)18, and trained the SVM testing their capability of predicting long-term outcomes within the GROUP sample.

We assessed model performance by calculating sensitivity, specificity, balanced accuracy (BAC: the average of sensitivity and specificity), positive predictive value, and negative predictive value. To give an overview of important features to predict long-term outcomes, we listed features with the highest selection chance per model (top 10%), selected in >1 model. Since the entire cross-validated RFE procedure was repeated 50 times we were able to calculate the percentage of misclassified and correctly classified patients within these 50 repeats. To explore ways to enhance future model performance, we compared the profile of ≥90% misclassified patients with that of ≥90% correctly classified patients on sociodemographic and clinical characteristics, separately for the poor and the good outcome groups. We made use of the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) statement61.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.