Daily estimates of clinical severity of symptoms in bipolar disorder from smartphone-based self-assessments

Currently, the golden standard for assessing the severity of depressive and manic symptoms in patients with bipolar disorder (BD) is clinical evaluations using validated rating scales such as the Hamilton Depression Rating Scale 17-items (HDRS) and the Young Mania Rating Scale (YMRS). Frequent automatic estimation of symptom severity could potentially help support monitoring of illness activity and allow for early treatment intervention between outpatient visits. The present study aimed (1) to assess the feasibility of producing daily estimates of clinical rating scores based on smartphone-based self-assessments of symptoms collected from a group of patients with BD; (2) to demonstrate how these estimates can be utilized to compute individual daily risk of relapse scores. Based on a total of 280 clinical ratings collected from 84 patients with BD along with daily smartphone-based self-assessments, we applied a hierarchical Bayesian modelling approach capable of providing individual estimates while learning characteristics of the patient population. The proposed method was compared to common baseline methods. The model concerning depression severity achieved a mean predicted R2 of 0.57 (SD = 0.10) and RMSE of 3.85 (SD = 0.47) on the HDRS, while the model concerning mania severity achieved a mean predicted R2 of 0.16 (SD = 0.25) and RMSE of 3.68 (SD = 0.54) on the YMRS. In both cases, smartphone-based self-reported mood was the most important predictor variable. The present study shows that daily smartphone-based self-assessments can be utilized to automatically estimate clinical ratings of severity of depression and mania in patients with BD and assist in identifying individuals with high risk of relapse.


Introduction
Bipolar disorder (BD) is a common and complex illness with an estimated prevalence of 1-2% and is regarded as one of the most important causes of disability worldwide 1,2 . BD is characterized by recurrent episodes of depression, (hypo)mania and mixed episodes intervened by periods of euthymia 3 and with a high degree of comorbidity and functional impairment 4 . BD is associated with an elevated risk of mortality due to suicide and medical comorbidities such as cardiovascular disease and diabetes [5][6][7] , and among people with BD, life expectancy is decreased 8-12 years 8,9 . In clinical practice, there are major challenges in diagnosing and treating BD 10 . Patients with BD are often misdiagnosed, and the correct diagnosis can be delayed for several years after illness onset [11][12][13] . Currently, due to the lack of objective tests, the diagnostic process and the clinical assessment of the severity of depressive and manic symptoms relies on subjective information, clinical evaluation and rating scales 14 . Periodic clinical evaluations using clinical rating scales such as the Hamilton Depression Rating Scale (HDRS) 15 and the Young Mania Rating Scale (YMRS) 16 are currently used as the golden standard for assessing the severity of depressive and manic symptoms in patients with BD. Each rating scale consists of a series of items reflecting various symptoms of depression and mania, and these items are finally added up to produce a total score summarizing the current severity of depressive (HDRS) or manic (YMRS) state of the patient. However, the use of clinical rating scales involves a risk of potential patient recall bias, other recall distortions, decreased illness insight (mainly during affective episodes) and individual clinician observer bias [17][18][19][20][21] . In addition, the clinical evaluations are time consuming and require a specialist who is trained and experienced in using the rating scales to produce consistent, valid and reliable results.
As part of treatment, patients may be asked to perform daily self-assessments to track changes in symptoms between clinical evaluations. Modern smartphones provide a unique platform for fine-grained real-time symptom monitoring and management, and a convenient means of self-assessment that have traditionally been carried out on paper [22][23][24] . A smartphone-based monitoring system enables users to ubiquitously record and review their own data, receive reminders, and even share data with carers and clinicians. From the perspective of health care providers, it offers efficient, online monitoring of a group of patients and enables intervention in case any deterioration is observed. Electronic self-monitoring has the additional benefit of making data available for immediate and automatic analysis that can help support monitoring and treatment tasks between outpatient visits.
Correlations between smartphone-based self-reported mood scores and clinical ratings of depressive and manic symptoms measured using the HDRS and the YMRS in patients with BD have already been demonstrated by previous work [25][26][27] , but to our knowledge this is the first study to predict scores of clinical ratings directly from combinations of smartphone-based self-assessed data in patients with BD. In related work, detection of daily self-reported mood from smartphone sensor and usage data is well studied 23,[28][29][30] , but remains a difficult problem due to noisy data. In ref. 31 , Grünerbl et al. classified affective states and state changes derived from clinical ratings and phone interviews of patients with BD from a combination of smartphone sensor modalities and argued that detecting deviations from the euthymic state is more important than the recognition of a particular affective state in practical applications.
Several studies in the field of affective computing have highlighted the need for personalized models to account for individual differences in order to achieve good predictive performance 29,30,32,33 . However, a separate analysis is not feasible until sufficient data about each individual is available. Hierarchical Bayesian modelling is a well-suited approach for providing individual models while borrowing statistical power from the population, which is especially useful when the individual datasets are too small to be analysed separately 34 .
The main of objective of this study was to examine the feasibility of producing daily estimates of clinical ratings of depression and mania based on smartphone selfassessments of symptoms collected from a group of patients with BD, who were followed as part of a randomized controlled trial (RCT) 35 . Additionally, we aimed to demonstrate how uncertainty in the estimated quantities could be used to compute individual, daily risk of relapse, useful for identifying high-risk individuals who need urgent assistance. Our assumption was that daily, automatic estimates of clinical ratings augmented with individual relapse risk scores are more interpretable and actionable results than observing the smartphone-based self-assessments directly and can be a valuable tool in continuous monitoring of illness activity and treatment of patients with BD.

Patients and study design
Data analysed in this study was collected between September 2014 and January 2018 during the MON-ARCA II RCT, investigating the effect of smartphonebased monitoring in patients with BD 35 . All patients with a diagnosis of BD who had previously been treated at the Copenhagen Clinic for Affective Disorder, Denmark, in the period from 2004 to January 2016 and who at the time of recruitment were being treated at community psychiatric centres, private psychiatrists and general practitioners were invited to participate in the trial. The clinic is a specialized outpatient clinic with a catchment area consisting of the Capital Region in Denmark corresponding to 1.4 million people. Patients with a newly diagnosis of BD or with treatment-resistant BD were referred to the clinic. The staff consists of specialists in psychiatry, psychologists, nurses, and a social worker, all with specific experience and knowledge regarding BD. Treatment at the clinic comprises a two-year program including combined evidence-based psychopharmacological treatment and supporting therapy, including group psychoeducation 36 . Patients were included in the study for a nine-month follow-up period if they had a BD diagnosis according to ICD-10 using the Schedules for Clinical Assessments in Neuropsychiatry (SCAN) 37 and previously were treated at the Copenhagen Clinic for Affective Disorder. Patients with schizophrenia, schizotypal or delusional disorders, previous use of the MONARCA system, pregnancy and lack of Danish language skills were excluded. Patients with other comorbid psychiatric disorders and substance use were eligible for the trial. As part of the MONARCA II trial, patients were randomized to either using a smartphone-based monitoring system (the Monsenso system) for daily self-monitoring (the intervention group) or to treatment as usual (the control group). Patients from the intervention group who successfully provided smartphone-based self-monitoring data were included in the analyses in the present study.

Data description Clinical assessments
The dataset consists of 280 clinical ratings collected from 84 patients with BD. Each clinical rating includes ratings for severity of depression and mania using the HDRS 15 and the YMRS 16 , respectively. Each participant was evaluated by a clinician up to 5 times during the study period (at baseline, after 4 weeks, 3 months, 6 months and 9 months). All clinical assessments were conducted by a researcher (MFJ), who was blinded to all smartphonebased data. Thus, data on the severity of depressive and manic symptoms were collected rater-blinded. On both rating scales, the first item indicates mood and low severity ratings indicate low levels of either depressive or manic symptoms while high severity ratings indicate severe symptoms. A score of 13 or more on either rating scale was classified as a depressive or manic episode, respectively, while a high score on both scales at the same time constituted a mixed episode. The cut-off on the HDRS and the YMRS of 13, in contrast to a lower cut-off, was chosen á priori to increase the validity of a current affective depressive or manic/mixed state (the more severe, the higher the validity). A euthymic state was defined as HDRS and YMRS less than 13 thereby also including affective states with partial remission. Clinical ratings with the HDRS and the YMRS were considered to be valid on the day of the assessment as well as the 3 previous days, thus each rating is attributed a total of 4 days in the present dataset.

Smartphone-based self-assessments
In addition to periodic clinical ratings, patients were instructed to carry out daily self-assessments via a smartphone application (the Monsenso system) configured for the present study. The smartphone application was developed using an iterative, user-centred design process involving patients, IT researchers, clinicians and clinical researchers, and the items chosen for the selfassessments were designed to capture clinically important symptoms of bipolar disorder 23 . The self-assessment included the following items: activity level (scored from −3 to +3); alcohol consumption (number of units from 0 to 10+); anxiety level (scored from 0 to 2); irritability level (scored from 0 to 2); cognitive problems (scored from 0 to 2); medicine adherence (not taken/taken/taken with changes); mixed mood (yes/no); mood (scored from −3 to +3 including −0.5 and +0.5); sleep duration (in hours); and stress level (scored from 0 to 2). The activity, medicine, mood and sleep items were mandatory items, which the patients evaluated daily. Additionally, the smartphone application enabled users to configure reminders and users were allowed to provide self-assessments retrospectively for up to 2 days in case they forgot the daily entry. The entered self-assessed data collected over time was visually presented to the users on their smartphone.

Statistical analysis Data preprocessing
Three smartphone-based self-assessment variables, mood, sleep and medicine, required preprocessing prior to analysis. We split the mood variable into a negative and positive component, mood negative and mood positive, allowing for non-linear relationships with the clinical ratings as we expected negative mood to be associated mainly with severity of depression (reflected by scores on the HDRS) and positive mood to be associated mainly with severity of mania (reflected by scores on the YMRS). Additionally, we expected the relationship between sleep duration and symptom severity to be non-linear as increased or decreased sleep duration can both represent signs of deterioration during depression and mania. To encode this, we subtracted the individual-level mean of the sleep duration variable and split the result into positive and negative components, sleep negative and sleep positive. When testing the outof-sample predictive performance of statistical models, the individual mean sleep duration was computed on the training set and applied to generate features in the training set and test set. The medicine adherence variable was categorical by design with categories: medicine not taken, medicine taken as prescribed, medicine taken with changes. To prepare the data for analysis, the three possible answers were encoded with two exclusive binary variables indicating if medicine was not taken, medicine omitted, or if medicine was taken with changes, medicine changed. The expected most common answer, medicine taken as prescribed, was not encoded to avoid collinearity in the regression models (a.k.a. "the dummy variable trap"). Finally, all variables were normalized by their allowed minimum and maximum values to allow for easier selection of model hyperparameters and interpretation of the inferred model weights.
It was a common problem for patients to occasionally forget to fill in their daily self-assessment, resulting in missing values in the dataset. In most cases, selfassessments were either complete for all items or missing, but in a few instances, they were only partially answered. To avoid discarding observations with only a few missing values, we experimented with filling in values from the previous day, which is a common method for dealing with missing values in time series data 38 .
However, it resulted in very few additional complete observations and we therefore decided to leave this step out.

Modelling approach
When analysing several related sets of measurements, such as data from individuals of a population, the two extreme approaches are to either pool the datasets in a one-size-fits-all solution or to analyse the datasets separately, the latter only being possible when sufficient data is available (also known as the cold start problem). A hierarchical Bayesian approach provides an intermediate solution that enables personalized models while learning the characteristics of the population 39 . In a hierarchical Bayesian regression model, individuals have their own set of regression intercept and weights, α j ,β j , sampled from a common population distribution parameterized by population-level means μ and variances τ determining the amount of pooling: where y ji is the ith observation of the target variable for individual j, x ji are the corresponding predictor variables and σ is the standard error. This hierarchical tying together of parameters means that data from the population helps regularize the individual-level weights. An additional benefit of the Bayesian approach is that it expresses uncertainty in all the model parameters and predictions by their posterior distributions, which is important for interpretability of the model. For further details, a complete description of the hierarchical Bayesian model is provided in the Supplementary Information (SI).
In the present study, we used Stan 40 to specify and perform inference in the Bayesian models and then compared the predictive results with pooled and separate naïve mean baselines and common machine learning methods: Ridge Regression from the scikit-learn machine learning library 41 and XGBoost regression from the XGBoost Python package 42 . Details of the Stan setup is also included in the SI. To estimate the predictive performance of the models we designed a cross-validation experiment where in each iteration we held out one randomly sampled clinical evaluation (consisting of up to 4 days of data) from each individual and used the remaining data to fit the models. This procedure was repeated K times and the predicted coefficient of determination (R 2 ) and root mean square error (RMSE) was computed on the held-out data in each iteration. We evaluated the models on the HDRS and the YMRS total scores as well as item 1 of each rating scale, since these items reflect mood only. Additionally, we evaluated the models using all smartphone-based self-assessment items, the mandatory self-assessment items (activity, medicine, mood and sleep) and using only the mood self-assessment item, respectively. Estimating scores on the HDRS and the YMRS with separate models enables prediction of high values of the HDRS and the YMRS at the same time, indicating a mixed episode.

Computing risk of relapse
In some practical applications, it may be more relevant to accurately identify high-risk individuals than to estimate the exact value of the severity score. Applying a Bayesian approach does not only provide a point estimate of the outcome of interest but provides a probability distribution of unobserved (future) outcomes given previously observed data, i.e. the posterior predictive distribution, which can be utilized to reason about uncertainty in the predictions. Specifically, samples from the posterior predictive distribution can be used to compute the probability that an unobserved outcome,ỹ ji , exceeds a predefined threshold, T: When estimating scores of clinical ratings, by applying a threshold T = 13 we can interpret this probability as the risk that an individual is experiencing severe symptoms and utilize it as a personal score indicating the risk of relapse.

Ethical considerations
The MONARCA II RCT was approved by the Regional Ethics Committee in the Capital Region of Denmark (H-2-2014-059) and the Danish Data protection agency (2013-41-1710). The law on handling of personal data was respected. All potential participants were given both written and oral information about the study before informed consent was obtained. Prior to commencement the trial was registered at ClinicalTrials.gov (NCT02221336). Electronic data collected from the smartphones were stored at a secure server at Concern IT, Capital Region, Denmark (I-suite number RHP-292 2011-03). The trial complied with the Helsinki Declaration of 1975, as revised in 2008.

Descriptive statistics
The MONARCA II dataset consists of 280 clinical evaluations, with a mean number of clinical evaluations per patients during the study of 3.33 (SD = 1.14), and a total of 15975 daily smartphone-based self-assessments with a mean number of smartphone-based self-assessments during the study of 190.18 (SD = 70.97) from 84 patients with BD assigned to the intervention group of the RCT. The age ranged from 21 to 71 years (mean = 43.1, SD = 12.4) and 61.9% (N = 52) were women. During the study period, most patients presented with rather low severity of depressive and manic symptoms resulting in low HDRS and YMRS scores. The mean HDRS total score was 7.56 (SD = 6.29) and 20.4% of scores were greater than or equal to 13. The mean YMRS total score was 2.85 (SD = 4.17) and 5.0% of scores were greater than or equal to 13. The mean HDRS item 1 score was 0.69 (SD = 0.85) and the mean YMRS item 1 score was 0.24 (SD = 0.53). Similarly, the majority of the smartphone-based self-reported mood scores were close to zero with a mean of −0.14 (SD = 0.48), indicating neutral mood (euthymia).
After filling back the clinical severity ratings 4 days (since the clinical rating scales reflect this time period) there were 764 observations with associated smartphonebased self-assessments. Figure 1 shows the association between the clinical ratings and the smartphone-based self-reported mood scores. Overall, a high score on the HDRS corresponded to neutral or depressed smartphonebased self-assessed mood (r = −0.40, P < 0.01) while a high score on the YMRS corresponded to neutral or elevated smartphone-based self-assessed mood (r = 0.22, P < 0.001). Only in a few instances were the HDRS and the YMRS rated high at the same time, indicating a mixed episode (r = 0.13, P = 0.02).

Model estimates
The hierarchical Bayesian regression model was evaluated on the entire dataset of clinical ratings combined with all self-assessed items of the completed smartphonebased self-assessments for all participants with at least two data points (N = 433). The model predicting total scores on the HDRS achieved an R 2 of 0.84, indicating that the model accounted for 84% of the variance in the data, and a residual RMSE of 2.41. The model predicting total scores on the YMRS achieved an R 2 of 0.81 and a residual RMSE of 2.07. The model predicting the HDRS item 1 score achieved an R 2 of 0.89 and a residual RMSE of 0.30, and the model predicting the YMRS item 1 score achieved an R 2 of 0.86 and a residual RMSE of 0.22.
The distributions of inferred population-level mean, μ, and variance, τ, parameters in the hierarchical Bayesian regression HDRS total and YMRS total models are summarized in Table 1. The absolute t-statistic of the mean parameters, computed as the mean scaled by the standard error of the parameter: t μ ¼ μ=SEðμÞ, is included as a measure of variable importance, following the intuition that larger absolute weights and lower variance implies importance 43 . This shows that negative mood was the most important predictor variable in the HDRS model while positive mood was the most important predictor and in the YMRS model. A visual presentation of the population-level parameters and a weight matrix summarising the individual parameters are included in the SI. A figure showing the effect size of each self-assessment item is also included in the SI.

Cross-validation results
The predictive performance of the hierarchical Bayesian model was evaluated in K = 100 cross-validation experiments on all data where participants had complete observations of clinical ratings and smartphone-based self-assessments from at least three different clinical evaluations (N = 329). In each iteration, data from one randomly sampled clinical evaluation from each patient was held out and the remaining data was used to fit the models. Models were fitted to predict HDRS total, YMRS total, HDRS item 1 and YMRS item 1, from (1) all; (2) mandatory and (3) mood self-assessment items, respectively. The hierarchical Bayesian model was compared to Fig. 1 Distributions of clinical ratings of symptom severity of depression (HDRS) and mania (YMRS) and smartphone-based self-reported mood scores. A negative mood score is expected to indicate a high HDRS score and a positive mood score is expected to indicate a high YMRS score. The HDRS and YMRS scores are rarely high at the same time (indicating mixed mood). Thus, data is expected to primarily occupy the white background areas of the scatter plots.
naïve pooled and separate mean models along with pooled and separate ridge regression and XGBoost regression models. Table 2 presents the cross-validation results of predicting HDRS total and YMRS total. Because of low variance in the data, the naïve mean models performed relatively well. Still the hierarchical Bayesian regression model achieved the best overall performance in every case and was significantly better than the separate mean model in both the HDRS and YMRS case according to independent t-tests (P < 0.001). Overall, the separate models performed better than their pooled counterparts. Table 3 presents the cross-validation results of predicting HDRS item 1 and YMRS item 1, indicating mood. The pooled XGBoost achieved the best result at predicting HDRS item 1 using all self-assessment items. When reducing the feature set to the mandatory or mood self-assessment items, the hierarchical Bayesian model was best. It was not The population-level regression weight means, μ, are summarized in the leftmost columns and sorted by variable importance computed as the absolute t-statistic of the mean parameter. The corresponding variances, τ, are summarized in the columns to the right and can be interpreted as the amount of pooling of the given variable in the hierarchical model. possible to predict YMRS item 1 significantly better than the naïve mean baselines.

Predicted risk of relapse scores
The results from cross-validation experiments predicting the HDRS total score and the YMRS total score using all self-assessment items presented in the previous section were used to compute risk of relapse scores The ability of the model to correctly assign high risk to instances with high ratings can be evaluated as a binary classification problem with severity ratings equal to or greater than the threshold T constituting the positive class. Figure 2  The hierarchical Bayesian regression model was able to account for information in the smartphone-based selfassessments as well as individual differences and achieved the highest AUC of 0.89 in the HDRS case and 0.84 in the YMRS case.

Discussion
In the present study, we analysed clinical ratings of depression reflected by the HDRS and mania reflected by the YMRS along with daily smartphone-based selfassessments including self-reported mood in a population Table 2 Results of K = 100 cross-validation experiments with the HDRS total score (left columns) and the YMRS total score (right columns) models based on all, mandatory and mood self-assessment items, respectively.
HDRS total score YMRS total score The hierarchical Bayesian model achieved the best overall performance in every case and could predict the clinical severity ratings within 4 points of RMSE on the original rating scales. The best HDRS total result was achieved using all self-assessment items while the best YMRS total result was achieved using only the mood selfassessment item. Bold values indicates the best results within each set of self-assessment items. a Coefficient of determination. Higher is better. b Root Mean Square Error. Lower is better.
of 84 patients with BD. As hypothesized, there was a negative correlation between the HDRS and self-reported mood and a positive correlation between the YMRS and mood. This confirms previous work [25][26][27] , and suggests that smartphone-based self-reported mood is a valid indicator of symptom severity in patients with BD and thereby a clinically relevant feature for monitoring and analysis.
Interestingly and as hypothesized, the proposed approach of applying hierarchical Bayesian regression models was able to fit the data distributions of the HDRS total score and the YMRS total score and all smartphone-based self-assessment items and accounted for more than 80% of the variance in the data according to R 2 . Using the absolute t-statistic of the populationlevel regression weights as a measure of variable importance, decreased and increased smartphone-based self-reported mood were the most important variables for predicting the severity of depression (HDRS) and mania (YMRS). This is not surprising since sampling of self-reported mood from the patients was designed to collect indicators on the patient's affective state and thus should reflect the clinically rated symptoms. Other important variables in the HDRS total model were decreased sleep and feelings of mixed mood and anxiety, while in the YMRS total model only mood ranked important (see Table 1).
To assess the predictive performance of the hierarchical Bayesian model compared to pooled and separate baseline models, we performed cross-validation experiments of estimating the HDRS total score, the YMRS total score, the HDRS item 1 score and the YMRS item 1 score using all smartphone-based self-assessment items, the four mandatory items and mood self-assessment item alone, respectively. Thus, we were able to estimate the total clinical rating scores using regression models based on Table 3 Results of K = 100 cross-validation experiments with the HDRS item 1 score (left columns) and YMRS item 1 score (right columns) models based on all, mandatory and mood self-assessment items, respectively.
HDRS item 1 score YMRS item 1 score All self-assessment items smartphone-based self-assessments. The hierarchical Bayesian model achieved the best performance in predicting the HDRS total and was significantly better than a naïve model using the observed individual (separate) mean as a prediction (P < 0.001). Similarly, the hierarchical Bayesian model was best at predicting the YMRS total score and was significantly better than the naïve separate mean model. Additionally, we tested models for predicting the first item of the HDRS and the YMRS, indicating mood. The pooled XGBoost model achieved the best result in predicting the HDRS item 1 score, while estimating the YMRS item 1 score could not be improved over the naïve baseline. In all the presented experiments, we found that models based only on self-assessed mood were able to retain most of the predictive performance of models based on all self-assessment items. This further shows that mood is the most important self-reported predictor variable for estimating scores of the HDRS and the YMRS. Overall, the YMRS models did not account for much of the variance in the data, indicated by the low R 2 scores. This could be mainly due to low variation in the observed YMRS data.
In clinical settings of monitoring illness activity in patients with bipolar disorder, detecting individuals with a high risk of relapse is highly important in order to enable intervention. Therefore, a sensitive indication if a symptom severity rating is above a critical threshold might be more useful than estimating the exact value of the severity rating itself. Thus, we demonstrated how uncertainty in the estimated total severity scores can be utilized to compute individual daily risk of relapse scores by considering samples from the posterior predictive distribution of the hierarchical Bayesian model. In the case of both the HDRS and the YMRS, using hierarchical Bayesian approach achieved substantial improvements over naïve models using pooled and separate means of observed data as predictions. Hence, including self-assessments in a regression model provided additional useful information for estimating the level of the clinical severity ratings and hence the relapse risk scores, which is a promising and clinically relevant result.
The findings that a combination of fine-grained daily smartphone-based self-assessment items can be used to estimate and predict clinical ratings are interesting and innovative. Daily longitudinal self-monitoring of mood symptoms gives valuable information of mood fluctuation experienced by patients with BD between clinical outpatient visits. Long-term monitoring of symptoms has been an essential part of the monitoring and treatment of BD for decades 44 and rapidly evolving smartphone technologies have made it possible to monitor symptoms more continuously, fine-grained and in real-time. This can be clinically relevant for detection of symptoms before the first or recurrent depressive or manic episodes 45 , and allow for early intervention on prodromal symptoms. In the latest version of the Diagnostic and Statistical Manual of Mental Disorders (DSM-V), increased activity level or energy is acknowledged as a core feature of hypomania and mania together with mood changes 46 . Several studies using factor analysis have described activation and not mood state as the primary symptom in manic episodes 47,48 . However, in the present study we found mood to be the most important predictor variable for estimating the HDRS and the YMRS severity ratings while activity presented with low importance in both models. Furthermore, sleep disturbances and anxiety has been identified as early symptoms of depression and mania 49,50 , which is in line with our findings in the HDRS model while sleep and anxiety were less important in the YMRS model.

Advantages
The patients included in the present study were clinically well characterized and were receiving treatment or had received treatment at the Copenhagen Clinic for Affective Disorders, Denmark. The clinical evaluations were conducted multiple times during follow-up by experienced researchers with a specific knowledge within BD. The smartphone-based self-assessment system used in the present studies (the Monsenso system) was developed by the authors and has been shown easy to use with a high usability, usefulness, ease of learning to use and interface quality-also when compared with other smartphone-based self-assessment systems 22,51 . The use of smartphones for fine-grained real-time monitoring reduced the risk of recall bias. The proposed hierarchical Bayesian modelling approach is well suited for analysis of small related datasets, especially when the individual datasets are too small to analyse separately. Additionally, the linear regression method and ability to express uncertainty in all estimated quantities makes the model easy to interpret, which is essential in a clinical setting. Overall, the findings from the present study are found to be innovative and generalizable to patients with BD not presenting with an acute affective episode and who are willing to use a monitoring tool during prolonged time periods.

Limitations
The dataset used in this study primarily contained clinical ratings of low severity of affective symptoms indicating most participants did not experience severe symptoms of depression or mania during the study period. Similarly, a large proportion of the self-reported mood scores were close to zero (indicating euthymia) and had low variance. Consequently, the naïve mean baseline models could fit the data well and achieved good performance in the prediction task. However, the best regression model was still significantly better than the naïve mean models, showing that it is possible to utilize smartphone-based self-reported data to produce more accurate estimates of the clinical ratings of symptom severity. Although we saw significant correlations between self-reported mood and the HDRS and the YMRS, respectively, the correlations were weaker than what has been reported in some other studies 45 . Furthermore, the absence of high ratings makes it difficult to reason about the performance of the models in detecting extreme cases, which are the most critical in a monitoring and intervention application.
Our analysis does not explore the distribution of missing data and thus assumes data is missing at random. However, it is reasonable to believe that individuals who are experiencing severe depression or mania have difficulties coping with selfassessment while euthymic individuals find it less relevant. Thus, analysing the missing data distribution might hold valuable information regarding symptom severity which can be explored further.
Lastly, our analysis did not include any temporal information in the models, but rather used smartphone self-assessment data from a given day to estimate clinical ratings on the same day and treated each day independently from other days. Thus, the analysis made no assumptions regarding temporal patterns of mood but relied entirely on relationship between data collected on the same day.

Perspectives and future implications
Smartphones have become a ubiquitous technology in modern society and can be utilized to provide improved and personalized illness management and monitoring in psychiatry. Smartphone-based self-assessment makes data available for immediate analysis and can enable new tools for improved illness monitoring. In particular, accurate, daily estimates of symptom severity could help identify critical cases and enable timely and individualized intervention. Additionally, advances in sensor technology and algorithms is making it possible to extract a growing range of increasingly accurate behavioural features directly from sensor data. Utilizing these automatically generated features to infer symptom severity scores could be used to eliminate the need for frequent, intrusive self-assessments and improve the user experience of illness monitoring systems in psychiatry going forward.
In this paper, we have explored the relationship between smartphone-based self-assessments and clinical ratings observed on the same day with the purpose of identifying current high-risk individuals. A related objective with possible great clinical potential would be to predict individual risk of relapse ahead of time. We see this as an important topic for future studies.

Conclusions
In the present study, clinical ratings of the severity of depression and mania were estimated from smartphonebased self-assessments collected from patients with BD. We found that our approach of applying a hierarchical Bayesian model could estimate severity of depression and mania with low error compared to commonly used baseline methods and within 4 points of RMSE on the HDRS and the YMRS rating scales. Furthermore, we showed how uncertainty in the estimates can be utilized to compute personal relapse risk scores suited for identifying critical cases of patients experiencing severe symptoms and that our approach achieved substantial improvements over naïve pooled and separate mean models. The results presented in this work show that it is feasible to compute daily estimates of clinical severity ratings of depression and mania from smartphone-based selfassessments, which can be used to improve and automate continuous disease monitoring and treatment of BD.