Prediction of quality of life in schizophrenia using machine learning models on data from Clinical Antipsychotic Trials of Intervention Effectiveness (CATIE) schizophrenia trial

While research focus remains mainly on psychotic symptoms, it is questionable whether we are placing enough emphasis on improving the quality of life (QoL) of schizophrenia patients. To date, the predictive power of QoL remained limited. Therefore, this study aimed to accurately predict the QoL within schizophrenia using supervised learning methods. The authors report findings from participants of a large randomized, double-blind clinical trial for schizophrenia treatment. Potential predictors of QoL included all available and non-redundant variables from the dataset. By optimizing parameters, three linear LASSO regressions were calculated (N = 697, 692, and 786), including 44, 47, and 41 variables, with adjusted R-squares ranging from 0.31 to 0.36. Best predictors included social and emotion-related symptoms, neurocognition (processing speed), education, female gender, treatment attitudes, and mental, emotional, and physical health. These results demonstrate that machine learning is an excellent predictive tool to process clinical data. It appears that the patient’s perception of their treatment has an important impact on patients’ QoL and that interventions should consider this aspect. Trial registration: ClinicalTrials.gov Identifier: NCT00014001.


INTRODUCTION
Schizophrenia is a chronic and severe mental disorder that can be invalidating 1 . This disorder can greatly affect the quality of life (QoL) 2,3 , which is defined by the World Health Organization as an individuals' perception of their position in life in the context of the culture and value systems in which they live and in relation to their goals, expectations, standards, and concerns 4 .
A shift has recently been observed in the objectives of schizophrenia treatment. While the goal was once to reduce symptoms only, this has changed to focus more on recovery through improving QoL and functioning 5,6 . Although complete recovery is often not possible for these patients, they can still recover in some way. This notably involves optimizing their wellbeing and functioning, which are key components of QoL. Over the past few years, factors that may promote better QoL have been identified in the literature, with mixed results. Some predictors that recur frequently are types of psychiatric or psychotic symptoms, but which type exactly predicts best QoL remains controversial 2,7-10 . These can be reduced by using medication; however, even though response and adherence to antipsychotics can improve QoL 3 , some medication side effects such as weight gain 11 and sexual dysfunction 12 have been associated with a worsened outcome. Other predictors of higher QoL were also identified, e.g., a better cognition and an older age of disorder onset [13][14][15] . On the other hand, stigma-related feelings and comorbid diagnoses predicted a poorer outcome regarding QoL 14 . In general, it seems that the highly heterogeneous factors presented in the current literature largely depend on the angle from which the authors choose to approach the question. Another issue is that the design is often cross-sectional, which does not allow for longitudinal predictions. Identifying the most important and essential factors could help identify which patients are better able to recover, and ultimately optimize every patient's recovery.
Several researchers have used multivariate models to predict the quality of life. Mohamed et al. 13 created a model excluding variables that may be redundant with QoL (e.g., functioning) using longitudinal data from the Clinical Antipsychotic Trials of Intervention Effectiveness (CATIE) schizophrenia study. In doing so, they were able to explain 22% of the variance in total QoL with positive and negative symptoms, neurocognitive, and sociodemographic variables (age, race, ethnicity, gender, and time). In such studies, the explained variance is generally low 2,14 , possibly because authors did not include some factors that deviate from their research question and that may play a major role in QoL (e.g., physical health, patients' self-reported satisfaction, and medication adherence). With the emergence of supervised machine learning, it now becomes possible to reach an optimal model including the best predictors among fairly large datasets, and without human a priori in the way variables are combined 16,17 . This new approach could thereby provide a better understanding of the various factors that influence QoL in individuals with schizophrenia, just as it successfully predicted other outcomes such as relapses 18,19 .
The aim of the current study was to identify, using machine learning, factors that predict QoL among people with schizophrenia. To do so, we computed important variables from the CATIE study, a large naturalistic clinical trial conducted between 2000 and 2004 in the United States.

Sample characteristics
Due to attrition and missing data, only 919 of the 952 participants with a longitudinal follow-up were included in a model (N = 697, 692, and 786 in models 1-3, respectively). From this number, 670 were males (73%) and the average age was of 41.1 years (SD = 11.0; range: 18-67). One-quarter did not complete high school (25%), a minority was employed full-time at the time of the study (6%), and only a few were married (11%). Most of the sample had no comorbid psychiatric condition (60%). Detailed baseline sample characteristics were presented in Table 1. At the baseline, 6-month and 12-month follow-up visits, the QoL total score was on average 2.8 (SD = 1.1), 2.9 (SD = 1.1, and 3.0 (SD = 1.1), respectively.
Linear regressions using machine learning Three longitudinal models were calculated to predict QoL (1) 12 months after the baseline, (2) 6 months after the 6-month visit, and (3) 6 months after the baseline.
The first model attempting to predict the 12-month QoL with baseline variables attained an uncentered adjusted R-squared of 0.350 and comprised 45 predictors. All included variables and associated coefficients are presented in Table 2. The mean squared error (MSE) training result was 0.92 and the MSE testing result was 0.97.
As for the second model predicting the 12-month QoL using variables from the 6-month visit, the optimal regression (Table 3) comprised 47 predictors, and the uncentered adjusted R-squared was 0.365. The MSE training result was 0.86 and the MSE testing result was 0.98.
Finally, the QoL at 6 months was estimated using baseline variables in a third model. With 41 variables, an uncentered adjusted R-squared of 0.307 was obtained. The complete model and its parameters are presented in Table 4. The MSE training result was 0.93 and the MSE testing result was 0.96.
A summary of the results of the three prediction models is presented in Table 5. Among the strongest and most reliable predictors were having low/no passive apathetic social withdrawal, low/no emotional withdrawal, and having a high processing speed score. Many other variables were also present in all three models, including having educated parents, selfreporting high mental health, female gender, being treatmentresponsive (CGIS), gaining weight as a side-effect, and having energy and interests. Being a veteran and being hopeless were negatively associated with QoL. Other predictors were strong but only present in one or two models; having a high level of total bilirubin, a higher education level, or believing that they had a mental problem was associated with a better QoL. Meanwhile, having a high clinical global impression of severity, social avoidance, poor rapport, stereotyped thinking, and dry mouth as a side-effect was associated with poorer outcomes.

DISCUSSION
This study aimed to accurately predict further QoL by identifying the characteristics that make individuals more prone to recover. By using machine learning to create optimal models, good predictions have been reached, and this despite adjustments to avoid any redundancy or collinearity of the data. Three models were calculated: (1) prediction of 12-month QoL with baseline variables, (2) prediction of 12-month QoL with 6-month variables, and (3) prediction of 6-month QoL with baseline variables. R squares of 0.350, 0.365, and 0.307 were achieved for each of these models, respectively. Identified predictors included, among others, social and emotion-related symptoms, neurocognition (processing speed), education, female gender, veteran status, indicators of satisfaction with psychiatric treatment as well as elements of physical functioning. The performance of the model is consistent with the prediction score for human behavior modeling 20 .
Firstly, predictors of QoL include many symptoms related to social and emotional aspects of life (e.g., negative association with social and emotional withdrawal, social avoidance, poor rapports, and hopelessness), thereby highlighting the fact that socialization and social roles are central determinants of QoL. Notably, the patients' and their parents' education level, likely associated with social inclusion and socioeconomic status, were strong predictors, as previously demonstrated 21 . Similar results have previously been obtained with emotional discomfort 22 . It is indeed possible that the relationship between negative symptoms and the QoL observed in the literature is due to the patients' ability to interact with others as well as their environment. These factors might be related to social support as well, which is a key component of   QoL 23 . The lack of social support is indeed a major problem for individuals with schizophrenia 23 , and it is, therefore, a crucial determinant to consider. Female gender was also associated with higher QoL; this predictor is, however, controversial in the current literature [24][25][26][27][28] . The backgrounds and origins of patients also seem to have an impact, since parental education level and veteran status were among identified predictors. This finding could be linked to the fact that schizophrenia patients with a greater trauma history tend to have a poorer QoL 29 . Secondly, as previously demonstrated with that database 13 , neurocognition had a significant impact on QoL. Considering each subscale separately, the processing speed was found to be the most predictive, even more than the total neurocognition score. This finding suggests that cognitive rehabilitation programs,  which have already proven to be effective to improve cognitive performance, symptoms, and psychosocial functioning 30 , could be an important element to improving QoL as well 31 .
Many subjective factors were also classified as very strong predictors of QoL. For example, good mental health, evaluated by the physician or reported by the patient, was contributing to a favorable outcome. Satisfaction toward mental health providers was also an important predictor, which was previously shown to be associated with a better QoL 32 . This finding suggests that the patients' subjective satisfaction is a very important factor when it comes to recovery. Additionally, having a good attitude toward the medication (e.g., thinking that medication is needed or that it prevents them from getting sick) also seemed important. These factors are likely to be associated with better medication adherence, as supported by other recent studies of people with schizophrenia 15,33 . Adherence was only found to be a weak predictor in one model; however, it should be noted that it was only a potential predictor in the second model as this was not measured at the baseline visit, since the patients were not yet taking the study medication. Antipsychotic medication is indeed considered important to improve the mental health of schizophrenia patients. However, while they contribute to the improvement of the symptomatology, they also cause a lot of side effects, thereby having contradictory effects on QoL. In the current study, side effects and treatment attitudes seemed more important than specific drugs, demonstrating that the ideal medication varies from patient to patient, and that adherence and observed changes are more important in predicting QoL. Nevertheless, response to treatment, measured using the CGIS questionnaire, was found to be a strong predictor in all three models. These results confirm those of Naber et al., who came to similar conclusions using the CATIE database 34 .
Finally, some physical health indicators were included in the models (e.g., bilirubin). Physical comorbidities being very frequent in that population could reflect the presence of metabolic disorders that greatly impact the QoL of some individuals. Tobacco use, which is well established to be associated with significant physical disorders, was also a predictor in two models. Similarly, predictors related to adverse events were also probably associated with physical health, which is unsurprisingly a great predictor of QoL in schizophrenia 35 . However, weight gain was found to be predictive of a better QoL in all models. This result is controversial since that side-effect is usually associated with poorer outcomes. However, compliant patients might be at higher risk of gaining weight from medication, which could explain that association 36 .
Although this study innovates by demonstrating that QoL can be predicted effectively in schizophrenia patients, a few limitations must be acknowledged. Despite that the prediction was great in that cohort, it is not necessarily representative of the overall schizophrenia population. Subjects were excluded if they had certain psychiatric comorbid diagnoses that are fairly frequent in that population (e.g., mental retardation and schizoaffective disorders), and they were all willing to participate as well as able to provide informed consent. However, this is an issue that is common to all randomized controlled trials, and the researchers minimized that issue by including a large number of sites representative of the United States population. Nevertheless, more such studies will be needed to confirm the predictors identified. This model could also be tested on another population to assess to what extent it is generalizable.
In conclusion, this study allowed an excellent prediction of the QoL of patients with schizophrenia using machine learning algorithms. Among the best and most reliable predictors of QoL were notably characteristics linked to social and emotional symptoms, good attitude toward medication, satisfaction toward healthcare providers and patients' own mental health, neurocognition, female gender, and medication side-effects. Since good prediction levels can be achieved, the use of machine learning could have major implications for the future of prediction as it helps avoid human bias. Eventually, it will also become possible to create predictive algorithms that could be used on various clinical populations and guide clinicians in their decision-making. The study of the predictors identified by such algorithms also allows a bit more insight into how a disease such as schizophrenia manifests itself and into the mechanisms that could explain the outcome. Notably, in the present study, we were able to identify very precise symptoms and factors that could have a higher impact than expected on the QoL of people with schizophrenia (e.g., their subjective perception of their mental health). In doing so, it was notably observed that physical health variables, which are often omitted from mental health-related studies, seem to have an important impact on schizophrenia patients' QoL. Consequently, interventions aiming to increase QoL should also consider these aspects. More studies will be needed to confirm the results and their applicability for clinicians.

Study sample
Data for this study were extracted from the CATIE schizophrenia study dataset. CATIE was a large, naturalistically designed clinical trial conducted by the National Institute of Mental Health (NIMH) between December 2000 and December 2004. 1460 patients with a DSM-IV diagnosis of schizophrenia, based upon the Structured Clinical Interview for DSM-IV 37 , were followed for 18 months. The trial was approved by the institutional review board at each site, and the patients or their legal guardians provided their written informed consent. The detailed study description and design can be found elsewhere 38 .
A subsample of 952 patients was selected based on the longitudinal monitoring of their QoL, i.e., they had completed at least 2 visits among the baseline visit and the 6, 12, and 18-month follow-up visits. According to the protocol, participants should have been followed for 18 months, with a follow-up visit occurring every 3 months or so. However, the attrition rate was very high, and therefore some variables were missing for Variables that have only been measured during the screening or baseline visit, and that therefore could only be a predictor in models 1 and 3. b Variable that has only been measured during follow-up visits, and that therefore could only be a predictor in model 2. Bold: coefficient over 0.3. Italic: coefficient under 0.1. some participants. Consequently, only data up to 12 months were used, and 697 subjects could be included in the first model, whereas the second and the third comprised 692 and 786 individuals, respectively.

Dataset
The QoL was measured every 6 months using a well-validated clinicianrated scale, the Heinrichs-Carpenter Quality of Life Scale (QOL) 39 . The objective was to use total QoL score at 6 and 12 months as a continuous outcome, i.e., the dependent variable, while all other variables from the CATIE trial were used as potential predictors in linear regressions. These included a large number of questionnaire items as well as the total scores and other variables (dichotomous or continuous) that were in the database, for a total of 253 potential baseline predictors and 233 potential 6-month predictors. Notably, psychotic symptoms were accessed during each visit using the positive and negative syndrome scale 40 . Depressive symptoms were measured every 3 months using the Calgary depression rating scale 41 . Neurocognition was measured using a neurocognitive battery accessing verbal learning, vigilance, speed, reasoning, and working memory. Other potential predictors were selected based on what was available within the database. These included many variables, both demographic and clinical, and both psychiatric and somatic (e.g., sociodemographic variables, metabolic biomarkers, complete blood count, side effects severity, antipsychotic medication, insight and attitudes toward treatment, adherence, violence, drug use, general status, vitals, etc.). However, items that were considered too conceptually related to the concept of QoL (i.e., redundant with items of the QoL questionnaire) were removed from the database. Included variables were all detailed in the Supplementary Table. In the models where computed not only the scales' totals but also every single item included in each tool and questionnaire.

Statistical analysis
A Lasso supervised regularization algorithm was implemented to identify potential predictors for three models: (1) baseline predictors of 12-month QoL, (2) 6-month predictors of 12-month QoL, and (3) baseline predictors of 6-month QoL. This type of regularization regression was developed to enable feature (predictor) selection and regularize the dataset to optimize prediction accuracy. By conducting multiple analyses in parallel, it is possible to assume that the variables that recur consistently across models are probably stronger predictors since these remain important over time.
The Lasso algorithm, from the Sk learn library (version 1.0.1), was implemented in Python 3.9. The train the regularization algorithm, 70% of the dataset was used whereas 30% is used for testing, which performed well in similar studies with datasets of this size in the literature 42,43 . A preprocessing of the data took place prior to this division. Participants for whom 25% of data were missing were removed from the dataset. The remaining missing data was accounted for by using the mean value of the other participants which is a technique called mean imputation often used in order to stabilize the classification process (selection of predictors). This algorithm is consistent with other studies conducted in the field of psychiatry. Best performing hyperparameters were identified using the GridSearchCV algorithm provided by the Sk learn library. An alpha = 0.01, max_iter = 100,000 and default values for the remaining parameters were selected by the GridSearchCV.
The performance of the algorithm for the three models was analyzed as follows. The MSE for the training set and for the testing set were calculated and compared. An R2 score was calculated for both the training set and testing sets. The testing R2 score is representative of our predictive score where a score of 1 would indicate that the model explains all the variation of the dependent variable around its mean compared to a score of 0 which means that the model does not explain at all the observed variations. Collinearity between the different variables is accounted for in the Lasso algorithm by its regulative nature: it keeps all the features of the model but gradually reduces the coefficient up to 0 of variables that are not of interest in the model to predict the dependent variable.
To account for the validation of the regression algorithm over the three models, tenfold cross-validation was conducted. This validation method, which is repeated ten times, divides the dataset randomly into ten parts and nine of those parts are used for training whereas the remaining one is used for testing.

Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.

DATA AVAILABILITY
The Clinical Antipsychotic Trials of Intervention Effectiveness Schizophrenia Trial is a limited access dataset available on request, under certain conditions, from the National Institutes of Mental Health clinical trials.