Validation and recalibration of OxMIV in predicting violent behaviour in patients with schizophrenia spectrum disorders

Oxford Mental Illness and Violence (OxMIV) addresses the need in mental health services for a scalable, transparent and valid tool to predict violent behaviour in patients with severe mental illness. However, external validations are lacking. Therefore, we have used a Dutch sample of general psychiatric patients with schizophrenia spectrum disorders (N = 637) to evaluate the performance of OxMIV in predicting interpersonal violence over 3 years. The predictors and outcome were measured with standardized instruments and multiple sources of information. Patients were mostly male (n = 493, 77%) and, on average, 27 (SD = 7) years old. The outcome rate was 9% (n = 59). Discrimination, as measured by the area under the curve, was moderate at 0.67 (95% confidence interval 0.61–0.73). Calibration-in-the-large was adequate, with a ratio between predicted and observed events of 1.2 and a Brier score of 0.09. At the individual level, risks were systematically underestimated in the original model, which was remedied by recalibrating the intercept and slope of the model. Probability scores generated by the recalibrated model can be used as an adjunct to clinical decision-making in Dutch mental health services.

www.nature.com/scientificreports/ output were specified beforehand. Upon entry of the 16 items, OxMIV estimates the probability of violent offending within 1 year. This estimate is expressed as a percentage, capped at 20%. A classification of 'low risk' (< 5%) or 'increased risk' (≥ 5%) is also given. In external validation, OxMIV showed excellent discrimination-the AUC was 0.89 (95% confidence interval [CI] 0.85-0.93)-and calibration 8 . A recent study in Germany found moderate discrimination (AUC = 0.72) for the prediction of inpatient violence in a forensic setting 9 . However, further studies are needed to validate OxMIV for different countries, care settings and forms of violent behaviour. The last are relevant because all violence (not solely incidents leading to arrest and conviction) has negative consequences, including treatment disruption, morbidity in victims, costs to services 10 and stigmatisation of patients 11 , and may accurately be predicted by OxMIV. Therefore, we have evaluated the performance of OxMIV in predicting interpersonal violence over a 3-year period in a Dutch sample of general psychiatric patients with schizophrenia spectrum disorders. We also explored the feasibility of adjusting OxMIV for this population and outcome.

Methods
Setting and participants. Data were collected as part of a larger research project, called Genetic Risk and Outcome of Psychosis (GROUP). The GROUP project was conducted by four university hospitals and affiliated mental healthcare centres (k = 36) in the Netherlands. These institutions are located in representative geographical areas of the country and provide access to psychiatric treatment in a variety of settings (e.g., psychiatric hospitals, outpatient clinics, residential care) to approximately 75% of the population. Throughout 2004, consecutive patients were invited to participate if they met the following criteria: (1) age between 16 and 50; (2) good command of the Dutch language; and (3) Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition, Text Revision (DSM-IV-TR) 12 diagnosis of schizophrenia or other non-affective psychotic disorder. Their parents and siblings were also invited. In total, 1013 patients, 907 parents and 1061 siblings enrolled. Assessments took place at the university hospitals, with a follow-up at 3 years. The protocol for the GROUP project was approved centrally by the ethics committee of the University Medical Centre Utrecht and implemented in accordance with relevant guidelines. All participants gave written informed consent before the first assessment.
Predictors and outcome. We selected variables whose definitions most closely matched those in the derivation study. The definitions of the predictors (Table S1) and the instruments used to measure them (Table S2) can be found in the supplement. The psychometric properties of the instruments and training of research personnel have been described elsewehere 13 . The patients themselves provided information on all predictors, apart from 'parental drug or alcohol misuse' (parents), 'parental violence' (parents) and 'sibling violence' (siblings). For the predictors 'parental violence' , 'sibling violence' and 'personal income' , data were only collected at three of the four university hospitals. We excluded cannabis from the predictors 'previous drug misuse' and 'parental alcohol or drug misuse' , as their prevalence would otherwise have been considerably higher than in the derivation sample (46% vs. 12% and 20% vs. 11%, respectively). Furthermore, we have previously shown that, in the current sample, the contribution of cannabis misuse to violence is small and nonsignificant when adjusted for background factors 14 . A possible explanation for both observations is that, unlike in Sweden, cannabis use is not criminalised in the Netherlands.
The outcome was physical abuse of another person (i.e. interpersonal violence) during the 3 years of follow-up, ascertained from clinical case notes and patient interviews. The definition (physical abuse vs. violent offending) and time period (3 years vs. 1 year) thus differed from the outcome OxMIV is designed to predict. Statistical analysis. We aimed to validate and, if necessary, update OxMIV for a different population (i.e., general psychiatric patients with schizophrenia spectrum disorders in the Netherlands) and outcome (i.e., interpersonal violence over 3 years). For model updating, we followed an incremental strategy suggested previously 15,16 . This strategy involves up to three steps: (1) recalibrating the intercept; (2) recalibrating the intercept and slope; and (3) re-estimating one or more coefficients. Performance was assessed in terms of calibration, both 'in the large' (with the ratio between predicted and observed events across the sample and the Brier score) and at the individual level (through calibration plots), and discrimination. We calculated the following discrimination metrics: AUC, sensitivity, specificity, and positive (PPVs) and negative (NPVs) predictive values. Wilson's method 17 was used to construct 95% confidence intervals around the last four of these. Discrimination was also visualised with a receiver operating characteristic (ROC) curve. To aid interpretation, we presented results for the following probability thresholds: 5% (the default), 10%, 15% and 20% (the cap).
Based on its distribution in the Dutch population, the predictor 'personal income' was converted into deciles. Patients with 'unstable' incomes belonged to deciles 1-3, and those with 'stable' incomes to deciles 4-10 18 . The corresponding model coefficients were averaged. Since the predictor 'recent (substance) dependence treatment' was not available, we assigned the derivation sample proportion to all patients. The same was done with patients who had ever been admitted to a psychiatric hospital for the predictor 'currently an inpatient' . Others were assumed to be outpatients. For partially missing predictors, we used multiple imputation by chained equations. As recommended, the outcome was excluded from the imputation models 18 . We averaged values across 20 imputations. Among the predictors measured at all sites, proportions of missing data were modest (≤ 13%) (Table S3). Insofar data were missing due to local practice, they can reasonably be assumed to be missing at random. Missingness on most predictors correlated significantly (p < 0.05) with values on at least one other predictor (Table S4).
Outcome data were available for 637 (63%) patients. As outcomes should not be imputed in external validation 19 , these patients formed the sample used in the analyses. They typically had attained a higher level of education (χ 2 [2] = 12.18, p = 0.002) and were less likely to receive benefits (χ 2 [1] = 5.42, p = 0.020) than patients without outcome data. No significant differences were observed for any of the other predictors (Table S5) Table 1 outlines summary statistics for the predictors in the patient sample (N = 637). Patients were mostly male (n = 493, 77%) and, on average, 27 (SD = 7) years old. Previous violence (n = 115, 21%) and drug misuse (n = 118, 20%) were each present in about one in five patients. Almost all patients (n = 565, 95%) had taken antipsychotics in the past 6 months. Fifty-nine (9%) patients physically assaulted another person during the 3 years after baseline.

Results
Discrimination, as measured by the AUC, was moderate at 0.67 (95% CI 0.61-0.73). The ROC curve is shown in Fig. 1. The original model had low sensitivity (25%) and high specificity (90%) at the default threshold of 5%. The same pattern was observed for the PPV (21%) and NPV (92%) ( Table 2). Calibration-in-the-large was satisfactory, with a ratio between predicted and observed events of 1.2 and a Brier score of 0.09 (Table 3). At the individual level, however, risks were systematically underestimated. This was remedied by recalibration of the intercept and slope (updating step ii) (Fig. 2 and, for the model formula, Table S6). Re-estimation of coefficients (updating step iii) was therefore not necessary. When using a threshold of 10%, the model with the recalibrated intercept and slope also offered a better balance between sensitivity (47%) and specificity (73%) than the original model ( Table 2). www.nature.com/scientificreports/

Discussion
In a Dutch sample of 637 general psychiatric patients with schizophrenia spectrum disorders, we evaluated the performance of a newly developed risk assessment tool (OxMIV) in predicting interpersonal violence over 3 years. We found OxMIV performed moderately well, especially considering that it is designed to predict a different outcome (i.e., violent offending within 1 year). The broader definition of the outcome and longer followup period may partly explain why we obtained a lower AUC (0.67, 95% CI 0.61-0.73) than previous validation studies of OxMIV 7,8 . At the same time, it is comparable to AUCs reported by validation studies of other more resource-intensive tools 4, , and the current study is an external validation using a different clinically informative outcome. Furthermore, unlike other tools, where calibration has not been reported, OxMIV demonstrated good calibration in the large. In addition, we showed that the performance of OxMIV can be optimised with model updating: calibration at the individual level was adequate after recalibration of the intercept and slope. This is important methodologically as it provides an approach to test the performance of prediction models and risk assessment tools for a different outcome than in their derivation/development studies. Strengths of this study include the representativeness of the sample, use of multiple data sources for the outcome, prespecification of the methods, and presentation of a wide range of performance measures. However, there are some limitations. First, most predictors were defined differently than in the derivation study (Table S1). The distribution of predictors differed as well. Of note, the proportion of men was larger (77% vs. 49%), mean Table 2. Discrimination metrics for the original and recalibrated models. Data are given as percentages, with 95% confidence intervals between parentheses. PPV positive predictive value, NPV negative predictive value. *The number of predicted events was too low (k < 5) to reliably calculate a confidence interval.  www.nature.com/scientificreports/ age lower (27 years vs. 44 years) and recent treatment with antipsychotics more common (95% vs. 54%) in the current sample (Table 1) than in the derivation sample (Table S7). These differences may have hampered the performance of OxMIV. At the same time, they reflect the profile of patients presenting at mental health services in the Netherlands where information is not always available to align predictors exactly with those in the derivation study, and external validations with patient groups with different baseline characteristics provide evidence whether a tool's performance can be maintained in real-world clinical settings and practice. Another limitation was the relatively small number of patients with the outcome. It has been suggested that ≥ 100 events are required to reliably measure predictive accuracy 21 . For this reason, the findings may be considered preliminary rather than definitive. Finally, missing data may have introduced bias. However, multiple imputation would have reduced this bias in the predictors 22 , and patients with and without outcome data were similar on nearly all predictors. The findings suggest that OxMIV is suitable for predicting violent behaviour in Dutch patients with schizophrenia spectrum disorders. Clinicians are advised to use the probability scores generated by the model with the recalibrated intercept and slope, as it had the best individual-level calibration and discrimination was worse at the chosen thresholds. This revised model can be accessed on the OxRisk website (https:// oxrisk. com). The original model can be used to screen patients for low risk of violence (< 5%), as facilitated by the high NPV (92%). The low PPV (21%) suggests that patients should be assessed further if classified as high risk (≥ 5%). There remains a need for validation studies in which variable definitions more closely match those in the derivation study and the number of events is higher. Comparing the performance of OxMIV against other tools or investigating its clinical feasibility could also be considered.

Data availability
Supporting data for this study are not available, as the participants did not agree for these to be shared publicly.