Transdiagnostic individualized clinically-based risk calculator for the automatic detection of individuals at-risk and the prediction of psychosis: external replication in 2,430,333 US patients

The real-world impact of psychosis prevention is reliant on effective strategies for identifying individuals at risk. A transdiagnostic, individualized, clinically-based risk calculator to improve this has been developed and externally validated twice in two different UK healthcare trusts with convincing results. The prognostic performance of this risk calculator outside the UK is unknown. All individuals who accessed primary or secondary health care services belonging to the IBM® MarketScan® Commercial Database between January 2015 and December 2017, and received a first ICD-10 index diagnosis of nonorganic/nonpsychotic mental disorder, were included. According to the risk calculator, age, gender, ethnicity, age-by-gender, and ICD-10 cluster diagnosis at index date were used to predict development of any ICD-10 nonorganic psychotic disorder. Because patient-level ethnicity data were not available city-level ethnicity proportions were used as proxy. The study included 2,430,333 patients with a mean follow-up of 15.36 months and cumulative incidence of psychosis at two years of 1.43%. There were profound differences compared to the original development UK database in terms of case-mix, psychosis incidence, distribution of baseline predictors (ICD-10 cluster diagnoses), availability of patient-level ethnicity data, follow-up time and availability of specialized clinical services for at-risk individuals. Despite these important differences, the model retained accuracy significantly above chance (Harrell’s C = 0.676, 95% CI: 0.672–0.679). To date, this is the largest international external replication of an individualized prognostic model in the field of psychiatry. This risk calculator is transportable on an international scale to improve the automatic detection of individuals at risk of psychosis.


Introduction
Under standard care, clinical outcomes in psychosis are suboptimal; prevention and early intervention are essential to improve outcomes of this disorder 1 . Primary indicated prevention of psychosis revolves around the ability to detect, assess and care for individuals at risk of psychosis. The Clinical High Risk state for Psychosis (CHR-P) 2 includes individuals who present with attenuated psychotic symptoms, impaired functioning 3 and help-seeking behavior. Twenty percent of these individuals develop a psychotic disorder within two years 4 . Primary indicated prevention of psychosis through specialized CHR-P clinical services 5 is uniquely positioned to alter the course of the disorder and improve outcomes 1 .
The impact of the CHR-P approach is contingent on effective identification of individuals at risk of developing psychosis. Because of complex interactions between helpseeking behaviors, recruitment strategies and referral pathways 6 , detection of at-risk individuals is currently inefficient: only 5% 7 -12% 8 of first-episode cases are identified by specialized or youth mental health CHR-P services. Moreover, these services are only available to a limited number of individuals, with only 48 services mapped worldwide 9 . To overcome these problems, a transdiagnostic, individualized, clinically-based risk calculator has been developed in the South London and Maudsley (SLaM) NHS Trust boroughs of Lambeth and Southwark (n = 33,820) 7 . This prognostic model uses core predictors that were selected on a priori meta-analytical knowledge 10 (age, gender, ethnicity, primary index diagnosis and age*gender interaction), that are routinely collected in clinical care, to forecast individual level of psychosis risk up to six years. This model leverages electronic health record (EHR) data, therefore allowing for the automatic detection of at-risk individuals. This prognostic model has shown adequate performance in a first external validation in the SLaM boroughs of Lewisham and Croydon (n = 54,716, Harrell's C = 0.79) 7 and in a second external validation in the Camden and Islington NHS Foundation Trust (C&I; n = 13,702, Harrell's C = 0.73) 11 , with Harrell's C demonstrating the probability that a randomly selected patient who experienced an event will have a higher score than a patient who did not. This prognostic model is also currently being piloted for real-world implementation in clinical routine in the UK 12 .
Despite these promising results, it is not yet clear whether this prognostic model is transportable to international healthcare settings. External validation studies are scarce in psychiatry, undermining the translational impact of research discoveries. This study aims to investigate the international external validity of the original transdiagnostic, clinicallybased, individualized risk calculator using large scale EHRs from the US.

Design
Retrospective cohort study using Electronic Health Records (EHRs) conducted according to the REporting of studies Conducted using Observational Routinely-collected health Data (RECORD) statement 13 (see checklist reported in Table S1).

Data source
The IBM ® MarketScan ® Commercial Database (hereafter Commercial) contains data from approximately 65 million people from multiple geographically dispersed US states, who are covered by employer-sponsored health insurance plans. This data includes all medical and pharmaceutical claims for these individuals and their dependents (Methods S1). It provides contemporaneous and 'real-world' data on both routine primary and secondary mental healthcare.

Study population
All patients accessing primary or secondary healthcare between 1 January 2015 and 31 December 2017 who received an ICD-10 primary index diagnosis of a nonorganic and nonpsychotic mental disorder (Methods S2). To ensure correct diagnosis classification, a lookback period of six months was applied to each patient (Methods S3).

Follow-up
Follow-up started at the time of the ICD-10 index diagnosis and ended when a transition to psychosis was recorded, or when the patient dropped out of the EHR (as documented by the last entry on Commercial).

Model specifications
The original transdiagnostic, clinically-based, individualized risk calculator was developed using a retrospective cohort study leveraging EHRs of the SLaM boroughs of Lambeth and Southwark, firstly validated in the SLaM boroughs of Croydon and Lewisham 7 and secondly validated in C&I 11 in the UK. In summary, a Cox model was used to predict the hazard ratio of developing any psychotic disorder over time (see Methods S2 for definition) as primary outcome of interest. The predictors included age (at the time of the index diagnosis), gender, age*gender, self-assigned ethnicity, and cluster index diagnosis (ICD-10 diagnostic spectra: acute and transient psychotic disorders (ATPD), substance use disorders, bipolar mood disorders, nonbipolar mood disorders, anxiety disorders, personality disorders, developmental disorders, childhood/adolescence onset disorders, physiological syndromes, mental retardation). Self-assigned ethnicity and index diagnoses were operationalized as indicated in Tables S2 and S3. A weighted sum of covariates with the model weights from the Cox model resulted in the Prognostic Index (PI). From this, the risk of the individual developing a psychotic disorder within a time period (between one and six years) could be calculated 14 .
Since this model was originally developed on a retrospective cohort 7 , it excluded cases with an onset of psychosis within the first three months to minimize the short-term diagnostic instability of baseline ICD-10 index diagnoses. However, during the subsequent implementation study 12,15 an updated version of the model was adapted for prospective use (i.e., not excluding transitions occurring in the first three months), demonstrating similar prognostic performance (Table S4). Furthermore, a lookback period was additionally used in this study (see Methods S3), to minimize the risk of misclassification of index diagnosis date. The specifications of the present model are fully detailed in Table S5. A main difference compared to the SLaM dataset was that there were no patient-level ethnicity data in Commercial. To mitigate this issue, aggregate ethnicity coefficients were generated for patients who had Metropolitan Statistical Area (MSA) and state-level ethnicity data using Integrated Public Use Microdata Series (IPUMS) census data (www.ipums.org). The geographical information from IPUMS were matched with the geographical data available for each patient in the study population from Commercial, assigning each patient with a vector of ethnic weights for each level of the ethnicity predictor. For example, if a patient were matched for New York (NY) state and Ithaca, NY MSA and was diagnosed in 2016, the proportions of White individuals in the MSA in the year of index diagnosis was 0.82, Black individuals was 0.03, Asian individuals was 0.10, Mixed individuals was 0.03 and Other was 0.01. For comparability purposes we also reported the performance of the original model 7 (i) without ethnicity as a predictor and (ii) with computed aggregate ethnicity using census data 16 (Table S6).

Statistical analysis
Model external validation followed the guidelines of Royston and Altman 17 , Steyerberg et al. 18 , and the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) 19 . The study protocol is uploaded in the Research Registry database (www.researchregistry.com, researchregistry5130).
For a general overview of prognostic modeling methods, including external validation procedures, see our recent review 20 . To interpret the performance of a risk model in the context of external validation, it is essential to first quantify the similarities between development and validation samples 21 . External validity only assesses model transportability if validation samples have a different casemix, with the greater the difference in the case-mixes, the greater the possibility of generalizing to other populations. Thus, we investigated the extent to which the SLaM and Commercial datasets comprised patients with sets of prognostically relevant predictors in common, comparable time to event outcomes with roughly similar followup times, and the same clinical condition observed in similar settings 22 .
As a first step, we described the Commercial patient population, including the configuration of clinical services and compared with SLaM. Baseline clinical and sociodemographic characteristics of the sample (including missing data) were described by means and SDs for continuous variables, and absolute and relative frequencies for categorical variables 22 .
In a second step, we visually compared the two Kaplan-Meier failure functions, showing the number of patients developing a psychotic disorder, as well as those still at risk, over time. The overall cumulative risk of psychosis onset in Commercial was visualized with the Kaplan-Meier failure function (1-survival) 23 and Greenwood 95% confidence intervals (CIs) 24 . Curves that vary noticeably may indicate systematic differences within the study populations 22 .
In a third step, we reported the spread (SD) and mean of the PI in the two datasets. An increased (or decreased) variability of the PI would indicate more (or less) heterogeneity of case-mix between the two datasets, and therefore, of their overarching target populations 21 . Differences in the mean PI indicate differences in overall (predicted) outcome frequency, reflecting case-mix severity between the two datasets (and revealing the model's calibration-in-the-large in the Commercial database) 21 . Continuous variables were tested with independent t-tests.
We then performed the formal external validation, assessing the prognostic accuracy of the model in the Commercial database 22 . Accordingly, the regression coefficients obtained from our model developed in SLaM (see Table S6) were applied to each case in the external Commercial database, to generate the PI in the Commercial database. In the case of ethnicity, the aggregate ethnic weights were multiplied by their respective regression coefficients to provide an aggregate coefficient for that patient. The sum of an individual's regression coefficients resulted in an individualized PI. The greater the PI, the higher the risk of the individual developing a psychotic disorder.
Since we were interested in discrimination, the primary outcome measure for this study was the external model performance (accurate predictions discriminate between those with and those without the outcome) 18 , defined with the Harrell's C-index 17 . Harrell's C is a recommended measure for external validation of Cox models according to established guidelines 17 . Harrell's C is the probability that for a random pair of "case" and "control," the predicted risk of an event (PI) is higher for the "case" 25 . In addition, we estimated the overall model performance 18 using the Brier score (average mean squared difference between predicted probabilities and actual outcomes, which also captures calibration and discrimination aspects) 18 . Calibration (agreement between observed outcomes and predictions) 18 was assessed using the regression slope of the PI 17,18 .
As a further exploratory step, we updated the model using the regression slope on the PI as a shrinkage factor for recalibration, in line with the Royston et al. guidelines 22 .
All analyses were conducted in R version 3.3.2 26 . using the survival package, and significance was set to P < .05.

Commercial sample characteristics
A total of 3,828,791 patients accessing primary or secondary healthcare between January 2015 and December 2017 received an ICD-10 primary index diagnosis of a nonorganic and nonpsychotic mental disorder. 2,430,333 (63.5%) of these individuals could be matched with ethnicity data, and were included in the analysis, as detailed in the study flow-diagram (Fig. 1). Patients accessing Commercial and included in this study had an average age of 34.2 years (95% CI: 34.19-34.23), 59% were female, and White ethnicity was particularly common in patients' MSAs (79%). The most frequent index diagnosis was anxiety disorders (45%). Full sociodemographic information is provided in Table 1.

Differences between the commercial and SLaM databases Sociodemographic and service configuration differences
The most important difference is that while the SLaM database contains data on individuals accessing publicly funded secondary mental healthcare, Commercial is limited to individuals covered by employer-sponsored health insurance plans. Compared to the full population, incidence of psychosis may be rarer in those covered by private insurance such as in the Commercial dataset. Similar to the C&I Trust that was the basis of the second external replication study, Commercial did not include CHR-P services; therefore, there were no CHR-P  diagnoses. Additional differences are that Commercial data incorporates both primary and secondary healthcare, compared to solely secondary healthcare in SLaM and C&I, as well as the aggregation of ethnicity data as discussed in "Methods" section. The average patient's age in the Commercial was 0.2 years lower than in SLaM (p = 0.03). Compared with SLaM, there was a lower incidence of ATPD, substance use disorders, bipolar mood disorders, personality disorders, developmental disorders, physiological syndromes and mental retardation in the Commercial dataset. Conversely, there were higher rates of nonbipolar mood disorders, anxiety disorders and childhood/adolescence onset disorders. Finally, there were fewer males in Commercial than in SLaM (Table 1).
Cumulative risk of psychosis in commercial compared with the SLaM derivation dataset

External validation in the commercial database
The comparative model performance in the SLaM dataset using aggregate ethnicity data was 0.761 (Table  S5). In the Commercial dataset, the model predicted significantly better than chance, with a Harrell's C of 0.676 (95% CI: 0.672-0.679, Harrell's C in SLaM = 0.79). The two-year Brier score was 0.013 (two-year Brier score in SLaM = 0.012). The model did not show major calibration issues, with a regression slope close to 1: 0.93, 95% CI: 0.91-0.94 (P < .001).
Updating the model optimized calibration (regression slope = 1) but conferred no substantial improvement in model performance (full model specifications are appended in Table S6).

Discussion
This is the largest ever replication study of a risk prediction model in psychiatry. The study demonstrates that the transdiagnostic, individualized risk calculator was able to detect individuals at risk of psychosis in an international setting with a prognostic discriminative performance that was significantly above chance.
To our knowledge, this is the largest ever external replication study of a risk calculator not only in early psychosis but also in clinical psychiatry. Importantly, this study included 24,941 events (transitions to psychosis) which are over one hundred times the minimum   recommended amount of 100 events required to produce accurate estimates of external prognostic accuracy 20,27 .
The previous largest external validation study of this kind was our first external replication of this calculator conducted in SLaM (n = 33,820) 7 , followed by a validation study of a calculator that predicts major depressive disorder (n = 29,621) 28 and by another calculator that predicts risk of violent crime in patients with severe mental illness (n = 16,387) 29 , all smaller than our sample size of 2,40,333. This is a substantial achievement given that prognostic modeling in psychiatry is affected by a severe scarcity of replication efforts 30 , to the point that replication has become equally as-or even more-important than discovery 31 . A systematic review and meta-analysis of clinical prediction models for predicting the onset of psychosis in CHR-P people uncovered 91 studies, none of which performed a true external validation of an existing model 32 . This is the only transdiagnostic clinical prediction model to be externally validated in three different populations (Lewisham & Croydon SLaM NHS Trust, C&I and now Commercial); another risk prediction model for use in CHR-P patients has also received three independent validations [33][34][35] . A full list of individualized risk prediction models that have been externally replicated in the field of early psychosis is detailed in Table 2.
The additional strength of this study is that it provides further empirical support for the use of EHRs in the context of precision psychiatry. Transporting risk prediction models across different EHRs representing heterogeneous clinical settings is complex because they reflect underlying differences in the patient population. A first empirical challenge is the availability of predictors and outcomes. The vast majority of predictors were available in the Commercial database, with the exception of ethnicity; patient-level ethnicity variables were computed to compensate for this. There was also a shorter follow-up time in Commercial compared to SLaM, as ICD-10 was only integrated into United States healthcare on 1 October 2015. Use of ICD-9 diagnoses was considered to extend follow-up but converting diagnostic clusters to ICD-9 proved inexact and therefore inappropriate. A second challenge is to quantify the differences between development and validation databases to interpret the performance of a risk model in the context of external validation 21 . For example, compared with SLaM, where the model was developed, there were apparent differences in sociodemographic characteristics in Commercial (fewer males and fewer patients of Black ethnicity and different frequency of ICD diagnoses, reflected by smaller spread of the PI) and time to event (shorter). Furthermore, similar to our second replication in C&I 11 , there were no CHR-P services in Commercial and, therefore, no CHR-P designations. However, as ATPD diagnoses are typically not made in CHR-P or early intervention services 36 , the number of ATPD diagnoses in Commercial are unlikely to be affected by this difference in service configuration. Because of this case-mix, the incidence of psychosis was about half in Commercial (1.43/2.57 at two years, reflected by a lower mean value of the PI). The most important difference is that, while previous replications were performed in data collected from publicly funded secondary mental healthcare alone, the Commercial database was composed of both primary and secondary healthcare data composed of commercially insured patients. Given such relevant differences, it was expected that the risk calculator could not be easily transported to the Commercial setting and that it would achieve a lower prognostic performance and calibration than that observed in the first two external validations.
Despite these differences in clinical setting and populations, the overall prognostic accuracy of the transdiagnostic, clinically-based risk calculator remained significantly above chance. As expected, the level of prognostic performance (Harrell's C = 0.68) was suboptimal and lower than our previous external validation (Harrell's C = 0.73) 11 . Yet, this level of accuracy is comparable to that of structural neuroimaging methods (i.e., gray matter volume) to detect a first-episode of psychosis at the individual level, with accuracies ranging from 0.5 to 0.63 37 . A recent machine-learning study externally validated a risk calculator to predict treatment outcome in depression in 151 patients. The study reported a one year prognostic accuracy of 0.59 and concluded that, if implemented at scale, performance even only significantly above chance can be considered to be clinically useful 38 . Given that our risk calculator has been developed on realworld EHR data, it offers the potential for automatically screening large mental health populations. Psychiatry is undergoing a digital revolution 39 , and there is an ongoing expansion of EHR adoption worldwide. More to this point, this risk calculator was evidently developed with a clear vision of future implementation as decision support in clinical routine and is currently being piloted in this capacity 12,15 . For example, it uses simple predictors that can easily be understood by clinicians, as compared to complex black-box machine-learning-derived algorithms 40 . Furthermore, harnessing data from EHRs is cheaper than other methods such as patient recruitment, because most of the predictors are available as part of clinical routine. There are no competing algorithms (CHR-P instruments are not usable for screening purposes) 41 to screen the at-risk population at scale. Other risk prediction tools in early psychosis have shown promise, however they predominantly rely on clinical symptom scores 42,43 , which means they are more financially and labor intensive than this tool; potential for automation is therefore limited. Moreover, these tools are focused on identifying transition to psychosis and are reliant on prior identification of CHR-P, whereas our tool is able to predict psychosis risk transdiagnostically outside of this designation. Thus, there is potential benefit in utilizing this risk calculator to screen for psychosis risk in large numbers.
There is scope for optimization of the current risk calculator through stepped risk stratification and model refinement. As a first step, this risk calculator could be deployed in a screening pathway where an individual's risk is calculated upon entry into secondary mental health services. Individuals flagged by our risk calculator as being at risk for psychosis would progress to a more thorough clinical CHR-P assessment in the context of a staged sequential risk assessment 44,45 . This could supplement other detection strategies targeting the general population, such as the Youth-Mental Risk and Resilience study (YouR-Study) 46 , which provided the first evidence of digital detection tools improving identification of psychosis in the general population. A potential further step would be combining the risk calculator with additional information (environmental, genetic or biomarkers) to improve prognostic accuracy further 44,45,47 , refine estimates of individuals' risk and stratify them accordingly. This is in keeping with the current clinical staging model of early psychosis, which aims to improve preventative care and reduce the duration of untreated psychosis to improve outcomes 1 . In addition to its clinical utility, this risk calculator could improve CHR-P research by aiding recruitment for much needed large-scale international collaborations in the vein of the HARMONY project, incorporating NAPLS (https://campuspress.yale.edu/ napls/), PRONIA (https://www.pronia.eu/) and PSY-SCAN (http://psyscan.eu), and the proposed 26-site ProNET cohort study. Furthermore, this prognostic model can be refined. In companion studies, we have tested whether using machine-learning methods and expanding the range of 48 , or redefining 49 , predictors might improve the prognostic accuracy of this risk calculator.
The limitations of this study are largely inherited from the original study. We did not employ structured psychometric interviews to ascertain the type of emerging psychotic diagnoses at follow-up. However, we predicted psychotic disorders rather than specific ICD-10 diagnoses, a category which has good prognostic stability 50 . Therefore, while the psychotic diagnoses in our analyses are high in ecological validity (i.e., they represent real-world clinical practice), they have not been subjected to formal validation with researchbased criteria. However, the use of structured diagnostic interviews can lead to selection biases, decreasing the transportability of models 51 . There is also meta-analytical evidence indicating that within psychotic disorders, administrative data recorded in clinical registers are generally predictive of true validated diagnoses 52 .
Other limitations were inherent in the Commercial database, mostly due to the lack of patient-level ethnicity data and a short follow-up time. These two issues reduced the prognostic performance of the model a priori, in particular considering that risk for psychosis may well extend beyond two years 53 . It is therefore possible that prognostic performance of this model in the longer term may actually be better than the performance reported here. A further limitation is that the study team for this replication is not completely independent from the team who completed the original study 54 , which is particularly relevant given the support of a pharmaceutical company. However, Lundbeck has no financial interests nor patents on this project. As this study involved a large commercial dataset and a refined version of the model, it was logistically impossible to conduct this research independently from the original team. To mitigate against this overlap, we adhered to the Royston 22 , RECORD 13 , and TRIPOD 19 guidelines to ensure transparency. Finally, although we welcome further external validation studies, it must be noted that even strong replication does not automatically imply the potential for successful adoption in clinical or public health practice. Ideally, randomized clinical trials or economic modeling are needed to assess whether our risk calculator effectively improves patient outcomes.

Conclusion
The largest international external replication of an individualized prognostic model in psychiatry confirms that precision medicine in this discipline is feasible even at large scale. The transdiagnostic, individualized, clinicallybased risk calculator is potentially transportable on an international scale to improve the automatic detection of individuals at risk of psychosis. Further research should refine the model and test the benefit of implementing this risk prediction model in clinical routine.