Introduction

Schizophrenia is a chronic and severe mental disorder that can be invalidating1. This disorder can greatly affect the quality of life (QoL)2,3, which is defined by the World Health Organization as an individuals’ perception of their position in life in the context of the culture and value systems in which they live and in relation to their goals, expectations, standards, and concerns4.

A shift has recently been observed in the objectives of schizophrenia treatment. While the goal was once to reduce symptoms only, this has changed to focus more on recovery through improving QoL and functioning5,6. Although complete recovery is often not possible for these patients, they can still recover in some way. This notably involves optimizing their well-being and functioning, which are key components of QoL. Over the past few years, factors that may promote better QoL have been identified in the literature, with mixed results. Some predictors that recur frequently are types of psychiatric or psychotic symptoms, but which type exactly predicts best QoL remains controversial2,7,8,9,10. These can be reduced by using medication; however, even though response and adherence to antipsychotics can improve QoL3, some medication side effects such as weight gain11 and sexual dysfunction12 have been associated with a worsened outcome. Other predictors of higher QoL were also identified, e.g., a better cognition and an older age of disorder onset13,14,15. On the other hand, stigma-related feelings and comorbid diagnoses predicted a poorer outcome regarding QoL14. In general, it seems that the highly heterogeneous factors presented in the current literature largely depend on the angle from which the authors choose to approach the question. Another issue is that the design is often cross-sectional, which does not allow for longitudinal predictions. Identifying the most important and essential factors could help identify which patients are better able to recover, and ultimately optimize every patient’s recovery.

Several researchers have used multivariate models to predict the quality of life. Mohamed et al.13 created a model excluding variables that may be redundant with QoL (e.g., functioning) using longitudinal data from the Clinical Antipsychotic Trials of Intervention Effectiveness (CATIE) schizophrenia study. In doing so, they were able to explain 22% of the variance in total QoL with positive and negative symptoms, neurocognitive, and sociodemographic variables (age, race, ethnicity, gender, and time). In such studies, the explained variance is generally low2,14, possibly because authors did not include some factors that deviate from their research question and that may play a major role in QoL (e.g., physical health, patients’ self-reported satisfaction, and medication adherence). With the emergence of supervised machine learning, it now becomes possible to reach an optimal model including the best predictors among fairly large datasets, and without human a priori in the way variables are combined16,17. This new approach could thereby provide a better understanding of the various factors that influence QoL in individuals with schizophrenia, just as it successfully predicted other outcomes such as relapses18,19.

The aim of the current study was to identify, using machine learning, factors that predict QoL among people with schizophrenia. To do so, we computed important variables from the CATIE study, a large naturalistic clinical trial conducted between 2000 and 2004 in the United States.

Results

Sample characteristics

Due to attrition and missing data, only 919 of the 952 participants with a longitudinal follow-up were included in a model (N = 697, 692, and 786 in models 1–3, respectively). From this number, 670 were males (73%) and the average age was of 41.1 years (SD = 11.0; range: 18–67). One-quarter did not complete high school (25%), a minority was employed full-time at the time of the study (6%), and only a few were married (11%). Most of the sample had no comorbid psychiatric condition (60%). Detailed baseline sample characteristics were presented in Table 1. At the baseline, 6-month and 12-month follow-up visits, the QoL total score was on average 2.8 (SD = 1.1), 2.9 (SD = 1.1, and 3.0 (SD = 1.1), respectively.

Table 1 Baseline sample characteristics. N = 919.

Linear regressions using machine learning

Three longitudinal models were calculated to predict QoL (1) 12 months after the baseline, (2) 6 months after the 6-month visit, and (3) 6 months after the baseline.

The first model attempting to predict the 12-month QoL with baseline variables attained an uncentered adjusted R-squared of 0.350 and comprised 45 predictors. All included variables and associated coefficients are presented in Table 2. The mean squared error (MSE) training result was 0.92 and the MSE testing result was 0.97.

Table 2 Linear regression of QoL at the 12-month visit using baseline variables. N = 697.

As for the second model predicting the 12-month QoL using variables from the 6-month visit, the optimal regression (Table 3) comprised 47 predictors, and the uncentered adjusted R-squared was 0.365. The MSE training result was 0.86 and the MSE testing result was 0.98.

Table 3 Linear regression of QoL at the 12-month visit using variables from the 6-month visit. N = 692.

Finally, the QoL at 6 months was estimated using baseline variables in a third model. With 41 variables, an uncentered adjusted R-squared of 0.307 was obtained. The complete model and its parameters are presented in Table 4. The MSE training result was 0.93 and the MSE testing result was 0.96.

Table 4 Linear regression of QoL at the 6-month visit using baseline variables. N = 786.

A summary of the results of the three prediction models is presented in Table 5. Among the strongest and most reliable predictors were having low/no passive apathetic social withdrawal, low/no emotional withdrawal, and having a high processing speed score. Many other variables were also present in all three models, including having educated parents, self-reporting high mental health, female gender, being treatment-responsive (CGIS), gaining weight as a side-effect, and having energy and interests. Being a veteran and being hopeless were negatively associated with QoL. Other predictors were strong but only present in one or two models; having a high level of total bilirubin, a higher education level, or believing that they had a mental problem was associated with a better QoL. Meanwhile, having a high clinical global impression of severity, social avoidance, poor rapport, stereotyped thinking, and dry mouth as a side-effect was associated with poorer outcomes.

Table 5 Summary of variables favoring quality of life. Variables with a similar meaning (e.g., different scales for the same side effect) were merged. Predictors are presented in order of effect sizes.

Discussion

This study aimed to accurately predict further QoL by identifying the characteristics that make individuals more prone to recover. By using machine learning to create optimal models, good predictions have been reached, and this despite adjustments to avoid any redundancy or collinearity of the data. Three models were calculated: (1) prediction of 12-month QoL with baseline variables, (2) prediction of 12-month QoL with 6-month variables, and (3) prediction of 6-month QoL with baseline variables. R squares of 0.350, 0.365, and 0.307 were achieved for each of these models, respectively. Identified predictors included, among others, social and emotion-related symptoms, neurocognition (processing speed), education, female gender, veteran status, indicators of satisfaction with psychiatric treatment as well as elements of physical functioning. The performance of the model is consistent with the prediction score for human behavior modeling20.

Firstly, predictors of QoL include many symptoms related to social and emotional aspects of life (e.g., negative association with social and emotional withdrawal, social avoidance, poor rapports, and hopelessness), thereby highlighting the fact that socialization and social roles are central determinants of QoL. Notably, the patients’ and their parents’ education level, likely associated with social inclusion and socioeconomic status, were strong predictors, as previously demonstrated21. Similar results have previously been obtained with emotional discomfort22. It is indeed possible that the relationship between negative symptoms and the QoL observed in the literature is due to the patients’ ability to interact with others as well as their environment. These factors might be related to social support as well, which is a key component of QoL23. The lack of social support is indeed a major problem for individuals with schizophrenia23, and it is, therefore, a crucial determinant to consider. Female gender was also associated with higher QoL; this predictor is, however, controversial in the current literature24,25,26,27,28. The backgrounds and origins of patients also seem to have an impact, since parental education level and veteran status were among identified predictors. This finding could be linked to the fact that schizophrenia patients with a greater trauma history tend to have a poorer QoL29.

Secondly, as previously demonstrated with that database13, neurocognition had a significant impact on QoL. Considering each subscale separately, the processing speed was found to be the most predictive, even more than the total neurocognition score. This finding suggests that cognitive rehabilitation programs, which have already proven to be effective to improve cognitive performance, symptoms, and psychosocial functioning30, could be an important element to improving QoL as well31.

Many subjective factors were also classified as very strong predictors of QoL. For example, good mental health, evaluated by the physician or reported by the patient, was contributing to a favorable outcome. Satisfaction toward mental health providers was also an important predictor, which was previously shown to be associated with a better QoL32. This finding suggests that the patients’ subjective satisfaction is a very important factor when it comes to recovery. Additionally, having a good attitude toward the medication (e.g., thinking that medication is needed or that it prevents them from getting sick) also seemed important. These factors are likely to be associated with better medication adherence, as supported by other recent studies of people with schizophrenia15,33. Adherence was only found to be a weak predictor in one model; however, it should be noted that it was only a potential predictor in the second model as this was not measured at the baseline visit, since the patients were not yet taking the study medication. Antipsychotic medication is indeed considered important to improve the mental health of schizophrenia patients. However, while they contribute to the improvement of the symptomatology, they also cause a lot of side effects, thereby having contradictory effects on QoL. In the current study, side effects and treatment attitudes seemed more important than specific drugs, demonstrating that the ideal medication varies from patient to patient, and that adherence and observed changes are more important in predicting QoL. Nevertheless, response to treatment, measured using the CGIS questionnaire, was found to be a strong predictor in all three models. These results confirm those of Naber et al., who came to similar conclusions using the CATIE database34.

Finally, some physical health indicators were included in the models (e.g., bilirubin). Physical comorbidities being very frequent in that population could reflect the presence of metabolic disorders that greatly impact the QoL of some individuals. Tobacco use, which is well established to be associated with significant physical disorders, was also a predictor in two models. Similarly, predictors related to adverse events were also probably associated with physical health, which is unsurprisingly a great predictor of QoL in schizophrenia35. However, weight gain was found to be predictive of a better QoL in all models. This result is controversial since that side-effect is usually associated with poorer outcomes. However, compliant patients might be at higher risk of gaining weight from medication, which could explain that association36.

Although this study innovates by demonstrating that QoL can be predicted effectively in schizophrenia patients, a few limitations must be acknowledged. Despite that the prediction was great in that cohort, it is not necessarily representative of the overall schizophrenia population. Subjects were excluded if they had certain psychiatric comorbid diagnoses that are fairly frequent in that population (e.g., mental retardation and schizoaffective disorders), and they were all willing to participate as well as able to provide informed consent. However, this is an issue that is common to all randomized controlled trials, and the researchers minimized that issue by including a large number of sites representative of the United States population. Nevertheless, more such studies will be needed to confirm the predictors identified. This model could also be tested on another population to assess to what extent it is generalizable.

In conclusion, this study allowed an excellent prediction of the QoL of patients with schizophrenia using machine learning algorithms. Among the best and most reliable predictors of QoL were notably characteristics linked to social and emotional symptoms, good attitude toward medication, satisfaction toward healthcare providers and patients’ own mental health, neurocognition, female gender, and medication side-effects. Since good prediction levels can be achieved, the use of machine learning could have major implications for the future of prediction as it helps avoid human bias. Eventually, it will also become possible to create predictive algorithms that could be used on various clinical populations and guide clinicians in their decision-making. The study of the predictors identified by such algorithms also allows a bit more insight into how a disease such as schizophrenia manifests itself and into the mechanisms that could explain the outcome. Notably, in the present study, we were able to identify very precise symptoms and factors that could have a higher impact than expected on the QoL of people with schizophrenia (e.g., their subjective perception of their mental health). In doing so, it was notably observed that physical health variables, which are often omitted from mental health-related studies, seem to have an important impact on schizophrenia patients’ QoL. Consequently, interventions aiming to increase QoL should also consider these aspects. More studies will be needed to confirm the results and their applicability for clinicians.

Methods

Study sample

Data for this study were extracted from the CATIE schizophrenia study dataset. CATIE was a large, naturalistically designed clinical trial conducted by the National Institute of Mental Health (NIMH) between December 2000 and December 2004. 1460 patients with a DSM-IV diagnosis of schizophrenia, based upon the Structured Clinical Interview for DSM-IV37, were followed for 18 months. The trial was approved by the institutional review board at each site, and the patients or their legal guardians provided their written informed consent. The detailed study description and design can be found elsewhere38.

A subsample of 952 patients was selected based on the longitudinal monitoring of their QoL, i.e., they had completed at least 2 visits among the baseline visit and the 6, 12, and 18-month follow-up visits. According to the protocol, participants should have been followed for 18 months, with a follow-up visit occurring every 3 months or so. However, the attrition rate was very high, and therefore some variables were missing for some participants. Consequently, only data up to 12 months were used, and 697 subjects could be included in the first model, whereas the second and the third comprised 692 and 786 individuals, respectively.

Dataset

The QoL was measured every 6 months using a well-validated clinician-rated scale, the Heinrichs-Carpenter Quality of Life Scale (QOL)39. The objective was to use total QoL score at 6 and 12 months as a continuous outcome, i.e., the dependent variable, while all other variables from the CATIE trial were used as potential predictors in linear regressions. These included a large number of questionnaire items as well as the total scores and other variables (dichotomous or continuous) that were in the database, for a total of 253 potential baseline predictors and 233 potential 6-month predictors. Notably, psychotic symptoms were accessed during each visit using the positive and negative syndrome scale40. Depressive symptoms were measured every 3 months using the Calgary depression rating scale41. Neurocognition was measured using a neurocognitive battery accessing verbal learning, vigilance, speed, reasoning, and working memory. Other potential predictors were selected based on what was available within the database. These included many variables, both demographic and clinical, and both psychiatric and somatic (e.g., sociodemographic variables, metabolic biomarkers, complete blood count, side effects severity, antipsychotic medication, insight and attitudes toward treatment, adherence, violence, drug use, general status, vitals, etc.). However, items that were considered too conceptually related to the concept of QoL (i.e., redundant with items of the QoL questionnaire) were removed from the database. Included variables were all detailed in the Supplementary Table. In the models where computed not only the scales’ totals but also every single item included in each tool and questionnaire.

Statistical analysis

A Lasso supervised regularization algorithm was implemented to identify potential predictors for three models: (1) baseline predictors of 12-month QoL, (2) 6-month predictors of 12-month QoL, and (3) baseline predictors of 6-month QoL. This type of regularization regression was developed to enable feature (predictor) selection and regularize the dataset to optimize prediction accuracy. By conducting multiple analyses in parallel, it is possible to assume that the variables that recur consistently across models are probably stronger predictors since these remain important over time.

The Lasso algorithm, from the Sk learn library (version 1.0.1), was implemented in Python 3.9. The train the regularization algorithm, 70% of the dataset was used whereas 30% is used for testing, which performed well in similar studies with datasets of this size in the literature42,43. A pre-processing of the data took place prior to this division. Participants for whom 25% of data were missing were removed from the dataset. The remaining missing data was accounted for by using the mean value of the other participants which is a technique called mean imputation often used in order to stabilize the classification process (selection of predictors). This algorithm is consistent with other studies conducted in the field of psychiatry. Best performing hyperparameters were identified using the GridSearchCV algorithm provided by the Sk learn library. An alpha = 0.01, max_iter = 100,000 and default values for the remaining parameters were selected by the GridSearchCV.

The performance of the algorithm for the three models was analyzed as follows. The MSE for the training set and for the testing set were calculated and compared. An R2 score was calculated for both the training set and testing sets. The testing R2 score is representative of our predictive score where a score of 1 would indicate that the model explains all the variation of the dependent variable around its mean compared to a score of 0 which means that the model does not explain at all the observed variations. Collinearity between the different variables is accounted for in the Lasso algorithm by its regulative nature: it keeps all the features of the model but gradually reduces the coefficient up to 0 of variables that are not of interest in the model to predict the dependent variable.

To account for the validation of the regression algorithm over the three models, tenfold cross-validation was conducted. This validation method, which is repeated ten times, divides the dataset randomly into ten parts and nine of those parts are used for training whereas the remaining one is used for testing.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.