Schizophrenia spectrum disorders can have remarkably different life courses. Approximately half of people presenting with a first episode of psychosis (FEP) show good outcomes, such as remission1 or no need for long-term secondary care2. However, ~23–24% of patients with a FEP go on to develop treatment-resistant schizophrenia (TRS)3. TRS is typically defined as resistance to two antipsychotic treatments, each given at an adequate dose for at least 6 weeks, with evidence of medication adherence4. TRS is associated with reduced quality of life, substantial societal burden and up to tenfold higher healthcare costs5.

It is not currently possible to predict accurately whether someone with FEP will develop TRS. This is important because there is evidence that clozapine, the only treatment licensed for TRS6, is more effective the sooner it is prescribed7. Yet, in clinical practice there are often long delays before clozapine is considered8. This highlights the need to identify treatment resistance as soon as possible.

Risk prediction in psychosis is a flourishing field (Extended Data Fig. 1). However, existing studies have commonly included predictors that are not easy to deploy in routine clinical practice (for example, neuroimaging9 or genetic measures10); not routinely or reliably collected (for example, duration of untreated psychosis11, substance misuse12,13 or premorbid functioning14); or not available at FEP onset (for example, symptom patterns over time12,15). All these characteristics limit the potential clinical usefulness of existing efforts in TRS prediction.

In addition to limited clinical usefulness, most previous studies are limited by methodological difficulties or poor reporting practices, particularly a lack of assessment of model calibration, a lack of external validation to assess generalizability16,17, limited consideration of sample size and the risk of overfitting, and the inclusion of variables that cannot be known at FEP onset, such as medication during follow-up15.

Blood biomarkers, which are objective, are commonly used to predict clinical outcomes in routinely used, large-scale, risk-prediction algorithms based on the general population18. Indeed, biomarkers and clinical measures commonly taken at FEP onset can help predict metabolic outcomes in patients with psychosis19. Furthermore, inflammatory and metabolic alterations are already evident in antipsychotic-naïve patients with FEP, including impaired glucose tolerance, insulin resistance20, hypertriglyceridaemia21 and pro-inflammatory changes22. Biomarker alterations may additionally be associated with a more chronic psychiatric illness course2,23.

In this work, we aimed to predict clozapine use (as a proxy for TRS) up to eight years after FEP onset, using routinely collected, objective and measurable biomedical predictors at baseline, with the aim of producing the most parsimonious prediction model with the potential for clinical use. We used patient data from three UK early intervention in psychosis services (EIPs) to investigate the predictive potential of sociodemographic, lifestyle and biological data routinely recorded at FEP baseline. We aimed to follow best practice by including an external validation step to examine generalizability. We followed the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) guidelines (Supplementary Table 1).


Development of a TRS prediction model

The coefficients for MOZART and for the LASSO model are presented in Table 1. Histograms of predicted outcome probabilities are provided as Extended Data Figs. 2 and 3. Univariable logistic regression coefficients (clozapine~predictor) are presented in Supplementary Table 2.

Table 1 Model comparisons including coefficients for development and external validation

Internal validation

Measures of pooled internal validation performance of the models over 100 imputed datasets are shown in Table 1. The C statistic for the forced-entry model (MOZART) was 0.70 (95% confidence interval (CI): 0.63–0.76), while that for the LASSO model was 0.69 (0.63–0.77). Calibration plots showed good agreement between observed and expected risk at most predicted probabilities for both models, although the LASSO model showed slight overprediction of risk at lower predicted probabilities (Extended Data Figs. 4 and 5).

External validation

The external validation sample comprised 1,110 patients from the South London and Maudsley NHS Foundation Trust (SLaM) EIP (Table 1 and Fig. 1).

Fig. 1: Patient selection flow charts, by cohort.
figure 1

The flow charts describe the application of inclusion and exclusion criteria for each cohort, starting from the sampling frames, and up to the analytic samples.

Applying the models developed in the joint development sample to the SLaM sample, the C statistic for MOZART was 0.63 (95% CI: 0.58–0.69), while that for the LASSO model was 0.64 (0.58–0.69; Table 1).

The calibration plot for MOZART showed good agreement between observed and expected risk (Fig. 2a), while that for the LASSO model showed evidence of mild overprediction of risk at higher predicted probabilities and of slight overprediction for very low risk (Fig. 2c). In all models, the 95% CIs widened as predicted probabilities became higher, owing to lower numbers of participants.

Fig. 2: Calibration plots for the main models based on the external validation sample.
figure 2

ad, Model calibration is the extent to which outcomes predicted by the model are similar to those observed in the validation dataset. Calibration plots illustrate agreement between the observed proportion of participants developing TRS (y axis) and predicted risk of TRS (x axis). Perfect agreement would trace the red line. Model calibration is shown by the continuous black line. Triangles denote grouped observations for participants at deciles of predicted risk, with 95% CIs indicated by the vertical black lines. Axes range between 0 and 0.4 because very few individuals received predicted probabilities greater than 0.3. a,b, Calibration plots based on the external validation sample for the forced-entry model (MOZART) before (a) and after (b) recalibration/revision. c,d, External validation calibration plots for the LASSO model before (c) and after (d) recalibration/revision. N = 1,110 participants in external validation sample.

External validation after model recalibration and revision

We applied logistic recalibration to both main models in the external validation sample. Additionally, the coefficient for lymphocyte count was selected for revision as the sign of the coefficient was reversed between the development and validation samples.

Table 1 shows that, after MOZART’s recalibration/revision, the C statistic was restored to values close to internal validation performance (0.67 (95% CI: 0.62–0.73)). The same procedure performed on the LASSO model, however, did not produce any improvement on the original model performance statistics.

The calibration plots for both recalibrated models are shown in Fig. 2b,d. Both showed good agreement between observed and expected risk.

Decision curve analysis and data visualization tool

Decision curve analysis for MOZART (Fig. 3) suggests that at propensity-to-intervene thresholds greater than 0.05 (revised model) or 0.06 (original model), the models provided greater net benefit than the competing extremes of treating all patients or none. The recalibrated model provided higher net benefit at most, if not all, thresholds over 0.05 than the original model.

Fig. 3: Decision curve analysis plot for forced-entry original and recalibrated models.
figure 3

The plot reports the net benefit (y axis) of the forced-entry (MOZART) original and recalibrated models across a range of propensity-to-intervene thresholds (x axis), compared with intervening in all patients and intervening in no patients. The dashed red vertical lines represent the two thresholds we selected a priori to study the potential clinical value of low-risk (for example, monitoring) and high-risk (for example, starting clozapine) interventions.

Numerical decision curve analysis results (net benefit, standardized net benefit, sensitivity and specificity) are shown in Supplementary Table 3 across a range of propensity-to-intervene thresholds. For example, if a low-risk intervention such as close monitoring for TRS was considered suitable above a propensity-to-intervene threshold of 0.10 (>10% risk of clozapine use), the recalibrated model would provide a net benefit of 2% (95% CI: 1–4%), meaning that an additional 24% of patients could be closely monitored for the presence of TRS (standardized net benefit). However, for a potentially more invasive intervention such as starting clozapine treatment, at a propensity-to-intervene threshold of 0.50, the same model would provide no net benefit, owing to insufficient sensitivity.

We also developed an online data visualization tool for both the original and recalibrated MOZART models, which allows interactive exploration of the effect of each predictor and their combinations on the risk of clozapine use based on the predictors included in this study (

Sensitivity analysis with iterative improvements

To examine the added benefit of selected demographic and biological predictors, we examined iterative improvements in the forced entry model. Model 1 (M1) included sex as the only predictor; M2 included all demographics; M3 included demographics plus triglyceride levels; M4 additionally included ALP. The internal coefficients and shrinkage factors for each model are presented in Supplementary Table 4. The C statistic increased from 0.56 (95% CI: 0.50–0.62) for M1 to 0.69 (0.62–0.76) for M4. Calibration plots showed good agreement between observed and expected risk at most predicted probabilities for M3 and M4 (shown, alongside histograms of predicted outcome probabilities, in Extended Data Figs. 69).


We examined the predictive potential of routinely collected sociodemographic, lifestyle and clinical information, obtained at the start of a FEP, for the risk of clozapine use, as a proxy for developing TRS. We developed two models: MOZART, based on forced-entry logistic regression, and another based on LASSO for coefficient generation and shrinkage. The two models performed adequately in both internal and external validation. MOZART performed better than LASSO in external validation, and its performance improved following recalibration/revision.

Decision curve analysis revealed that MOZART shows clinical utility at lower propensity-to-intervene thresholds, such as 10–20%. This model cannot yet be recommended for clinical use and requires prospective validation in larger samples, health technology assessment and regulatory approval. However, in future our model could allow implementation of low-risk strategies, for example, stratifying patients at higher risk of developing antipsychotic resistance for closer monitoring of TRS. These strategies have very low risk of causing harm and might show potential for earlier recognition and treatment of TRS. Clozapine is more effective when given soon after treatment resistance is established, although in clinical practice there are long delays to starting it7,8; therefore, starting treatment early might show potential in reducing symptoms and improving quality of life in people with unrecognized TRS. However, given the higher risk and licensing conditions of clozapine, and the lower sensitivity of the model at higher risk thresholds, this model alone will not be useful for initiating higher risk interventions, such as starting clozapine.

In the future, inclusion of genetic risk scores and other predictors might make clozapine prediction models more accurate, and therefore more clinically useful. Two existing studies found that polygenic risk scores for schizophrenia did not produce significant increases in predictive power of a model for TRS24,25. However, the publication since then of larger genome-wide association studies (GWAS) for schizophrenia26 and of a specific TRS GWAS27 will likely make the approach more powerful.

MOZART extends existing research by using only seven common predictors available at FEP baseline; by including an external validation analysis, a crucial step to demonstrate generalizability; and by following best practice guidelines28,29.

We show that simple blood-based biomarkers measured at the onset of psychosis can explain part of the variance of the risk of clozapine use, as demonstrated by the increased C statistic for the incremental model including biomarkers. This suggests that the variance of a psychiatric phenotype (development of TRS) may be explained, at least in part, by inflammatory, fat and liver biomarkers.

Previous studies using regression-based methods have shown that elevated triglycerides are associated with a worse psychiatric clinical outcome in psychosis at the group level2,23. We extend these findings by showing that elevated triglycerides at the individual level could aid in TRS prediction. We included ALP owing to the increasing importance that liver dysfunction is thought to play in the psychosis spectrum30. In particular, elevated ALP might relate to the primary dysglycaemic and dysmetabolic phenotype of FEP20,31,32, or it might be its consequence (hyperlipidaemia leading to non-alcoholic fatty liver disease33, a phenotype that has been found in FEP30). Elevated ALP may also capture some of the variance of substance use in a more objective manner than self-reporting34,35.

As a proxy of inflammation we selected lymphocyte count as predictor, because the data is widely available across samples. In a previous analysis of mostly White participants with FEP, lymphocytes were associated with a worse psychiatric outcome2. However, cross-sectional studies have not found lymphocyte elevations in FEP36,37, and a recent Mendelian randomization study did not find evidence for a causal association with schizophrenia38, potentially discounting the likelihood of a causal association of elevated lymphocytes with schizophrenia. Further, we found that the drop in discrimination performance for the forced-entry model from internal to external validation was mostly due to differences in the lymphocyte predictor, with the sign of the coefficient switching direction between samples. In model updating, the C statistic could be partially preserved by updating the coefficient for lymphocytes. This might be explained by the different ethnic mix between the development sample (mainly White) and the external validation sample (mainly Black). It is known that inflammatory markers, including lymphocytes, show different distributions in different ethnic groups39,40. This might encourage repetition of the analysis using different inflammatory markers, such as C-reactive protein (CRP), in future research. We could not include CRP, because it was most often sampled in the included cohorts when there was suspicion of infection; therefore, data were only available for a small subset and likely showed strong selection bias.

The use of longitudinal EIP cohort data is the main strength of this study. Enrolment into an EIP fosters confidence in the psychiatric phenotype of included participants and into the naturalistic nature of the sample including many consecutive referrals with little possibility of selection bias from the sampling frame. Most EIPs in the UK NHS (National Health Service), including all three in this analysis, are the only treatment providers for FEP in each geographical area, thus covering a large proportion of all incident cases of FEP. Specifically, the Cambridgeshire and Peterborough Assessing, Managing and Enhancing Outcomes (CAMEO) EIP, used to develop our model, accepts people presenting with confirmed psychotic symptoms from any cause, including drug-induced psychoses and affective psychoses (including International Classification of Diseases (ICD)-10 codes F06.0-2, F20-F31, F32.3, F33.3 and F53.1). Therefore, MOZART is shown to work in real-life samples of FEP, which predisposes the results to be more clinically applicable (that is, to any patient presenting with a FEP). Because this study is based on real-life patient data from electronic health records (EHRs) from different regions, we were unable to address potential secular and regional trends in monitoring, laboratory testing and prescribing practice that could have biased results. However, using completely separate development and validation samples is required to adhere to best prediction modelling practice, which requires external validation on separate participants to avoid ‘high risk of bias’29.

Among the limitations of this study, we used clozapine treatment—a proxy measure for TRS—as the outcome, as in several previous studies14. Prevalence of clozapine use in our samples was lower than the expected prevalence of 13% (see calculation in Methods). In the UK, clozapine should be offered to all patients with TRS41. However, a recent national audit showed that only 52% of patients with FEP who have not responded adequately to at least two antipsychotics are offered clozapine42. Furthermore, EIPs accept patients with psychotic symptoms from any cause, thus including bipolar and unipolar mood disorders; this diagnostically inclusive nature of our FEP cohort might partially explain the relatively low rate of TRS. However, while our outcome definition may have a reduced sensitivity for capturing treatment resistance, the specificity is likely to be high. Indeed, the UK National Institute for Health and Care Excellence (NICE) guidance is that prescription of clozapine is reserved for those with schizophrenia in whom two trials of antipsychotics have failed43, and the only UK indication for clozapine other than TRS is Parkinson’s disease, which would be extremely rare in FEP cohorts only including adults up to 65 (mean age of 28 years; Table 2). Further, the literature suggests that clozapine in the UK is used ‘off label’ for treating refractory mania, psychotic depression, aggression in psychotic patients, the reduction of tardive dyskinesia symptoms and borderline personality disorder44, therefore the presence of such diagnoses among the cases cannot be excluded, and is a limitation of this study. However, a UK-based systematic investigation of off-label antipsychotic use in secondary care established that clozapine is the least likely to be used outside its approved indications, with only 1 of 46 patients (~2%) in the study using it off label45, which might be a consequence of the very strict regulations in place for clozapine use. Another UK-based study of TRS, including 14,299 patients, both inpatient and community-based, undergoing mandatory clozapine blood-monitoring, found 56 off-label clozapine prescriptions (0.4%)46. While these studies included any patient on antipsychotics, our cohorts are based on UK EIP teams, which only accept young patients with a FEP, and therefore it is likely that off-label clozapine use in this group is even rarer. Further, not all cohorts could provide information about time of clozapine initiation, and therefore time-to-event analysis could not be performed. Moreover, follow-up data were available for up to eight years following a FEP. This means that we might not have been able to capture ‘late onset’ TRS, which might develop after a number of relapses and over a number of years47; this might also help to explain the relatively low clozapine rate in our samples.

Table 2 Predictor comparisons between samples used in model development and internal/external validation

Predictor availability was limited to those markers that were available in all three study cohorts. No cohort included a symptom or severity measure, such as the Positive and Negative Syndrome Scale (PANSS); we could therefore not include symptoms at baseline as a predictor. The number of predictors that we could include was also limited by our sample size, although we took particular care in predictor selection and this may have helped to prevent model overfitting28,48. It must be pointed out that this work did not aim to make any assumptions about whether the included predictors might be causal to TRS: variables were selected if they were known to be associated – that is, likely capturing part of the outcome’s variance.

A further limitation of this work is the potential for the inclusion of patients already taking antipsychotic medication at baseline. Antipsychotics could influence the levels of the biomarkers. However, most patients admitted to an EIP are medication naïve or minimally treated. Bloods tests were only used for prediction if performed within 100 days of referral to the EIP; it is likely that some patients were started on antipsychotic medication during this time, though the duration of treatment is likely to have been relatively short. However, participants were excluded if the outcome (starting clozapine) predated baseline blood collection.


In conclusion, we report that, based on three large samples of patients, routinely recorded demographics and biomarkers measured at presentation with a FEP could be useful in the individualized prediction of the risk of clozapine use (as a proxy for developing TRS) up to eight years later. Subject to further external validation and regulatory approval, MOZART appears useful at predicting the risk of TRS at lower propensity-to-intervene thresholds, thus potentially allowing implementation of low-risk strategies such as closer psychiatric monitoring for TRS in at-risk populations. This could potentially speed up the time from FEP onset to clozapine start, thus reducing delays in TRS recognition and treatment, and consequently reducing suffering and improving quality of life.

We suggest that future efforts in TRS risk prediction should seek to consider such routinely collected data. Doing so may improve both model predictive performance and likely clinical usefulness, both of which are crucial for the future routine deployment of a risk prediction model into clinical practice.


Model development

We used a forced-entry logistic regression model as the most parsimonious way to predict a binary outcome (such as clozapine use) from a small number of predictors. However, to explore whether additional predictors may improve performance in a manner that reduces risk of overfitting, we also used a LASSO-based selection model, which has the benefit of independently shrinking the predictors’ coefficients up to excluding them, and can be more robust to a slightly larger number of predictors.

Data from 785 patients were included in the pooled development sample: 539 from CAMEO and 246 from the Birmingham EIP (Table 2), following EHR searches and application of inclusion and exclusion criteria (Fig. 1, and a description of the included and excluded samples in Supplementary Table 5). Included patients had a mean age of 28.2 years, an average BMI of 25, and were 66% White and 41% smokers. In the pooled development sample, 58 (7.4%) patients were treated with clozapine.

Ethical approval

All research complied with relevant ethical regulations and underwent the local approval process in each of the three cohorts. CAMEO data were identified by anonymously searching for all EIP patients using the Cambridgeshire and Peterborough NHS Foundation Trust (CPFT) Research Database49—approved under UK NHS Research Ethics Service references 12/EE/0407, 17/EE/0442. Anonymized data for all patients enrolled in the Birmingham EIP were collected as part of the National Clinical Audit of Psychosis Quality Improvement Programme, and were enhanced locally with biomarker data; the work conformed to the Health Research Authority definition of service evaluation (confirmed by Birmingham Women’s and Children’s Hospital NHS Foundation Trust). The Clinical Records Interactive Search (CRIS) resource was used to capture anonymized data from SLaM EIP—approved under UK NHS Research Ethics Service references 18/SC/0372 and 08/H0606/71+5; National Institute for Health Research (NIHR) Biomedical Research Centre (BRC) CRIS Oversight Committee reference 20-005.


EIP currently represents the gold standard of care for people presenting with a FEP50. EIPs are built around a specialized, multidisciplinary team, which can deliver both pharmacological and psychological interventions, family and social support, support with employment and physical healthcare checks for up to five years after patients are diagnosed with a FEP. This model of care has been shown to be highly effective1, and, given the longitudinal follow-up by a highly specialized team, to guarantee a high degree of confidence that enrolled patients suffer from a FEP, including the first manifestations of both primary psychotic conditions such as schizophrenia and schizoaffective disorders, and psychotic mood disorders such as psychotic depression or mania.

Model development sample

We developed a risk prediction model using pooled longitudinal data from patients enrolled in the CAMEO psychosis EIP, searching for patients enrolled between 1 January 2013 and 31 May 2021 (sampling frame n = 1,660), or the Birmingham EIP, searching for patients enrolled between 1 January 2014 and 31 December 2018 (sampling frame n = 391). This was selected as the development sample for the present study as CAMEO data were recently used to examine group-level associations between mean biomarker levels and psychiatric outcomes2.

Predictors were assessed within 100 days of patient EIP enrolment. We excluded any participant who had missing data on more than 50% of predictor variables, and non-cases (patients who did not use clozapine) who had less than two years of follow-up to reduce the probability of including future TRS cases as non-cases. Further details on missing data management can be found in the Supplementary Notes. All patients who developed TRS were included regardless of duration of follow-up. As predictors must predate outcomes, we also excluded all cases where the outcome start date (clozapine treatment start date) predated the earliest available baseline bloods in the CAMEO cohort (and in the SLaM cohort, see the Model external validation sample section), or participants who started taking clozapine within 100 days of baseline in the Birmingham cohort. Patients were excluded if they died or moved out of the Trust’s catchment area during follow-up.

Model external validation sample

We used the CRIS resource to capture anonymized data from SLaM. Our sampling frame included 3,012 EIP patients, all those enrolled between 1 January 2012 and 20 November 2021. Patients were excluded and predictors and outcomes were assessed as for the development sample.


Owing to data availability, we adopted a pragmatic definition of TRS: patients were defined as having TRS if they had been treated with clozapine at any point during the follow-up period. Clozapine is the only clinically approved treatment for TRS in the UK, and provides an objective, easily quantifiable measure of TRS41. We calculated an expected prevalence of clozapine use of 13%. This was calculated as follows: starting from a population prevalence of 23%3,14,51, we expected to capture mostly ‘early onset’ cases, which represent ~84% of cases11. From previous literature, clozapine is given in ~68% of TRS cases11, so the expected prevalence was 0.23 × 0.84 × 0.68 = 0.13.

Predictor variables

Routinely used clinical predictors were included based on a balance of clinical knowledge, existing research and likely clinical usefulness. Demographic variables were considered if they had shown evidence of potential predictive ability for TRS in existing prognosis research16,24. Biomarkers and clinical measures were considered if they showed evidence from past longitudinal association studies of biological measures at FEP using long-term clinical outcomes2,23. Predictors were only included if they were part of the suite of measurements that should be collected at baseline as part of local or national guidelines, to avoid ascertainment bias. We did not include variables that may only be recorded in specific circumstances, such as CRP, which may only be recorded when an infection is suspected. All predictors needed to be available in all three EIP samples. Therefore, we considered the following parameters, measured within 100 days of EIP start: sex (female or male); age (years); ethnicity (categorical: White European or not recorded (reference), Black or African-Caribbean, Asian, or other); triglyceride concentration (mmol l–1); lymphocyte and neutrophil blood cell counts (billion l–1); ALP levels (units l–1), smoking status (binary, at least one cigarette on average daily); BMI (kg m2); and random glucose levels (mmol l–1).

See Supplementary Notes for full rationale and details of data extraction.

Statistical analysis

All data analyses were conducted in R (v. 4.2.1)52. We performed sample size calculations using the R package pmsampsize (v. 1.1.2)53; for details on sample size calculations, please see the Supplementary Notes. For data imputation we used the MICE package (v. 3.14)54. For logistic modelling we used base R and the pROC package (v. 1.18)55. For calibration plots we used the CalibrationCurves package (v. 0.1.5)56. For LASSO model development, we used the MAMI package (v. 0.9.13)57. Finally, for coefficient shrinking we used the psfmi package (v. 1.0)58.

Primary analysis

We performed sample size calculations using the R package pmsampsize48,53. The sample size required was estimated from the estimated outcome prevalence, the a priori estimated R2 of the model, and the estimated required model shrinkage. For 11 predictors, the minimum sample required was 412. We did not consider non-linear terms or interactions to reduce the risk of overfitting. See Supplementary Notes for detailed sample size calculations.

We used multiple imputation using chained equations for missing data and pooled estimates using Rubin’s rules (see Supplementary Notes for details about predictor missingness). Internal validation involved bootstrap resampling (500 bootstraps) to obtain an estimate of the corrected calibration slope. The resulting pooled, corrected C slope was then used as a shrinkage factor for our coefficients. After this step, predictive performance was assessed.

We developed the risk calculator using two alternative model selection methods:

  1. 1.

    A forced-entry logistic regression model, including all sociodemographic and three biological predictors (one lipid, one inflammatory and one liver marker), based on a balance of clinical knowledge, past research and likely clinical usefulness.

  2. 2.

    A LASSO-based selection model, after predictor scaling and centring, including all 11 pre-selected sociodemographic, lifestyle and biological predictors. The inclusion of additional variables was enabled by LASSO including a predictor selection step, and by its more efficient coefficient shrinkage, leading to less risk of model overfit59. For the LASSO model we used 100-fold cross-validation to tune the penalty parameter in the development sample as implemented in glmnet60.

Both methods involved variable pre-selection, after ruling out predictor multi-collinearity to minimize risk of overfitting, as is recommended for smaller datasets61.

The models were applied to the external validation sample. The distribution of predicted outcome probabilities was inspected using histograms.

Model performance was assessed primarily with measures of discrimination (the ability of the model to distinguish participants with the outcome from those without), such as the C statistic, and calibration (the extent to which the outcome probabilities predicted by the model in specified risk-defined subgroups are similar to those observed in the validation dataset), assessed by inspection of calibration plots (presented as figures, e.g. Fig. 2).

The discrimination of the models was assessed using the C statistic. For binary outcomes, this is equivalent to the area under the receiver operating characteristic curve61, which plots sensitivity against 1 minus specificity. The C statistic normally ranges from 0.5 to 1, with a value of 1 representing perfect discrimination and a value of 0.5 representing discrimination no better than chance. C statistics were determined in relation to the observed binary outcomes (subsequent clozapine use or not).

We also recorded calibration intercepts (ideally close to 0) and Brier scores (an overall measure of model performance, ideally close to 0, with scores >0.25 generally indicating a poor model). For further details of our prediction methods, see ref. 19.

Model recalibration/revision

Additionally, where performance at external validation differed from internal validation performance, we considered two recalibration approaches. First, we considered logistic recalibration. This method is used where the coefficients of the original model may have been over-fitted, affecting calibration performance. Logistic recalibration assumes similar relative effects of the predictors, but allows for a larger or smaller absolute effect of the predictors62. Further details are in Supplementary Notes. Second, where there was evidence of a clear difference in the association of a predictor with clozapine use between the development and validation samples, we considered logistic recalibration plus revising a single predictor in the model. We limited this model revision approach to a maximum of one model predictor, to preserve as much of the character of an external validation analysis as possible, though we note that all recalibrated/revised models will require a further external validation in an additional unseen sample.

Decision curve analysis

Decision curve analysis was performed to assess potential clinical benefit63. Clinical net benefit of the prediction model is calculated against offering an intervention to all or no patients. This can be calculated at a range of propensity-to-intervene thresholds. Net benefit is defined as the minimum probability of clozapine use at which the intervention would be warranted, as net benefit = sensitivity × prevalence – (1 – specificity) × (1 – prevalence) × w, where w is the odds at the propensity-to-intervene threshold64. In decision curve analysis, it is usual to only consider the range of propensity-to-intervene thresholds that may be clinically relevant; these depend on how risky the intervention being offered might be.

For starting clozapine, we selected a priori a propensity-to-intervene threshold of 0.50, representing a greater than 50% risk of developing TRS. We believe that such a threshold would represent a good balance between the potential positives of early clozapine initiation and relatively rare risks of clozapine. We also selected a lower propensity-to-intervene threshold of 0.10 (>10% risk of developing TRS) for defining a ‘TRS-at risk population’ who may be eligible for close monitoring.

The decision curve plot is presented as Fig. 3, to visualize the net benefit of both model versions (forced-entry original and recalibrated) over varying propensity-to-intervene thresholds compared with treating all patients or no-one. Classical decision theory proposes that at a chosen propensity-to-intervene threshold, the choice with the greatest net benefit should be preferred63.

Sensitivity analysis

To examine the added benefit of selected demographic and biological predictors, we examined iterative improvements of the model. The first model included only a single demographic predictor, sex; the second added all demographics; the third included all demographics plus a single biological predictor (triglycerides); the last model included all the above plus a second biological predictor (ALP). We did not externally validate the incremental models.

Visual representation of the model

We developed an online data visualization tool using shiny for R65, allowing interactive exploration of the effect of sociodemographic, lifestyle, and clinical variables and their combinations on TRS risk in people with FEP. The tool is not yet suitable for clinical use.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.