Differential temporal utility of passively sensed smartphone features for depression and anxiety symptom prediction: a longitudinal cohort study

While studies show links between smartphone data and affective symptoms, we lack clarity on the temporal scale, specificity (e.g., to depression vs. anxiety), and person-specific (vs. group-level) nature of these associations. We conducted a large-scale (n = 1013) smartphone-based passive sensing study to identify within- and between-person digital markers of depression and anxiety symptoms over time. Participants (74.6% female; M age = 40.9) downloaded the LifeSense app, which facilitated continuous passive data collection (e.g., GPS, app and device use, communication) across 16 weeks. Hierarchical linear regression models tested the within- and between-person associations of 2-week windows of passively sensed data with depression (PHQ-8) or generalized anxiety (GAD-7). We used a shifting window to understand the time scale at which sensed features relate to mental health symptoms, predicting symptoms 2 weeks in the future (distal prediction), 1 week in the future (medial prediction), and 0 weeks in the future (proximal prediction). Spending more time at home relative to one’s average was an early signal of PHQ-8 severity (distal β = 0.219, p = 0.012) and continued to relate to PHQ-8 at medial (β = 0.198, p = 0.022) and proximal (β = 0.183, p = 0.045) windows. In contrast, circadian movement was proximally related to (β = −0.131, p = 0.035) but did not predict (distal β = 0.034, p = 0.577; medial β = −0.089, p = 0.138) PHQ-8. Distinct communication features (i.e., call/text or app-based messaging) related to PHQ-8 and GAD-7. Findings have implications for identifying novel treatment targets, personalizing digital mental health interventions, and enhancing traditional patient-provider interactions. Certain features (e.g., circadian movement) may represent correlates but not true prospective indicators of affective symptoms. Conversely, other features like home duration may be such early signals of intra-individual symptom change, indicating the potential utility of prophylactic intervention (e.g., behavioral activation) in response to person-specific increases in these signals.


INTRODUCTION
Technological advances facilitating personal sensing, or passively collected signals from networked smartphone sensors 1 , stand to address critical gaps in measuring and treating affective symptoms.Features assessed using smartphones could signal novel treatment targets; for example, the daily number of calls and texts made may signal changes in social behavior relevant to depression 2 .Similarly, personal pronoun use in text messages has been linked with depression and anxiety symptoms [3][4][5] , and reductions in I-pronoun use track broad improvements in therapy 6 .Incorporating sensed data into clinical care may also enhance shared decision-making 7 .For instance, deviations in GPSlocation-based features could signal relevant changes to patient depression severity that could trigger a provider notification.Finally, better understanding how personal sensing can be leveraged to reliably signal current or prospective deterioration may address a key question about existing digital mental health interventions 8,9 , which is how best to optimize the delivery of intervention components so that the right component is received at the right time, while minimizing user burden [10][11][12][13] .
As a foundational step in realizing this potential, studies have evaluated how sensed features relate to affective symptom severity.Prior work shows that different sensor signals such as the number and type (i.e., incoming or outgoing) of phone calls and text messages relate to affective symptoms 14,15 .Additional data suggest that the content of text messages predicts mood and anxiety symptoms [3][4][5]16 . Eve mobile phone keystroke patterns have been associated with mood states 17 .Other smartphone signals such as GPS-location-derived features have demonstrated associations with affective symptoms across many different studies 4,14,18,19 ; however, due to challenges with replication and generalizability, there are calls for these findings to be replicated in larger and more heterogeneous samples 19,20 .
Additional challenges stem from the dearth of studies on how temporal characteristics impact observed relationships between sensed features and symptoms, including the data window (i.e., interval over which sensor data are collapsed) and time lag (i.e., time between of predictor and outcome measurement).Previous studies of mental health outcomes have used 24-h data windows to predict mental health outcomes lagged by short timeframes such as 1 h or 1 day 21 .Other studies have used slightly larger data windows to predict mental health outcomes at lags of 1 or 2 weeks in the future 3,22,23 .The predictive power of different sensor types may be more or less clinically meaningful depending on the data window and time lag used 22 .For example, a recent study we conducted of text message language features as they related to depression symptom severity demonstrated that a data window of 4 weeks was the optimal aggregation for prediction 5 .Another example in social media data indicated that using a data window of 2 months to predict depression severity with time lags of between 2 and 4 weeks was the optimal analytic setup 24 .Understanding how the relationships between sensed features and affective symptoms change depending on data windows and time lags is essential to informing the clinical utility of sensed data for mental health.
Our primary objective for this study was to evaluate smartphone sensor-based markers that prospectively relate to depression and anxiety symptoms.We examined sensed features' prospective relationships to symptom severity for depression and anxiety, as well as their utility as distal or proximal predictors of affective symptom severity, using a shifting 2-week sensor data window across various time lags to predict future affective symptoms.Inclusion and exclusion criteria for waves 1 and 2 did not differ.We conducted stratified sampling based on baseline PHQ-8 scores such that a minimum of 50% experienced at least moderate depression symptoms (PHQ-8 ≥ 10).In Wave 3, all participants were recruited to have at least moderate depression symptoms (PHQ-8 ≥ 10).Across all waves, participants were required to be at least 18 years old, a U.S. resident, able to read English, and own an Android smartphone with an active data and text messaging plan.Participants were excluded if they self-reported a diagnostic history of bipolar disorder, manic, or hypomanic episode, schizophrenia, or other psychotic disorder.

METHODS
Participants were compensated up to $142 for completion of assessments, as well as bonuses delivered at the end of each assessment week for participants who were running the latest version of the app and had transmitted sensed data within the past 2 days.Fig. 1 Timing of associations between sensed data and affective symptoms.Testing the influence of past 2 week sensor data on subsequent week depression and anxiety symptoms (1a, medial prediction, 1-week lag), as well as the effect of shifting the sensor data time window on symptom prediction (1b, distal prediction, 2-week lag; and 1c, proximal prediction, 0-week lag).The orange boxes in each panel depict the sliding sensor window across various lag times.

Procedure
After providing written informed consent, participants enrolled in the study for 16 weeks.All participants downloaded the LifeSense app 25 , which automatically collected GPS-based sensor data, app, and device use data, and communication data from participants' smartphones (see Supplementary Table S1 for a list of sensors used and frequency acquired, consistent with Saeb et al., 2015).Participants responded to web-based surveys (e.g., GAD-7) 26 through the REDCap platform at baseline and every 3 weeks thereafter (i.e., weeks 1, 4, 7, 10, 13, 16) 27,28 .Participants also completed PHQ-8 surveys via the LifeSense app at the beginning and end of every third week in the study 29 .Because of this cadence, PHQ-8 instructions were modified to ask participants about their symptoms over the past week rather than past two weeks.All procedures were approved by the Northwestern University Institutional Review Board.

Analytic methods
Multilevel regression models were tested in R using the lmerTest package with maximum likelihood estimation 30 .Specifically, we evaluated the associations of clustered sensor features aggregated over a 2 week window (see Supplementary Table S2 for details on clustering) with subsequent depression and anxiety symptoms.The 2 week window was selected for three reasons: to permit sufficient density of sensor data, to align with gold-standard assessments of depression and anxiety symptoms that ask about the past 2 weeks 26,29 , and to be consistent with prior sensing studies 4,31,32 .The prediction window was shifted such that three different models were tested for each outcome: (1) medial prediction is at a 1-week lag (Fig. 1a), (2) distal prediction is shifted back 1 week for a 2-week lag (Fig. 1b), and (3) proximal prediction is shifted forward 1 week for a 0-week lag (Fig. 1c).
While there was no overlap between the sensor window and symptom reporting for distal or medial prediction, proximal prediction involved taking sensor data from the week immediately before and the week concurrent with symptom reporting (e.g., weeks 3 and 4 of sensor data predicting the week 4 symptom assessment).Sensor predictors were person-mean centered, and for each sensor predictor, both a person mean term and a withinperson deviation term were included in the model.

Primary results
Table 2 (PHQ-8) and 3 (GAD-7) present results for all within-person and between-person effects of sensor data on symptoms over time; for parsimony, only features with at least some significant relationships to outcomes are described below in the text.

Location features.
Spending more time at home relative to one's own average (i.e., within-person) was associated with increased future PHQ-8 severity across prediction windows (distal β = 0.219, p = 0.012; medial β = 0.198, p = 0.022; proximal β = 0.183, p = 0.045).Within-person time spent at home was not significantly associated with GAD-7 severity across any of the time windows (Table 2).We observed no evidence that between-person effects for time spent at home were related to PHQ-8 or GAD-7 severity.People with greater GPS variability and mobility less severe nextweek PHQ-8 (medial β = −0.503,p = 0.046), but this signal was absent for distal (β = −0.464,p = 0.073) and proximal (β = −0.424,p = 0.093) associations.Table 3.Two other sensed location features were reflective of near-or medial-term PHQ-8 severity but did not predict PHQ-8 severity far in the future.First, people spending time in more frequently visited venues relative to their own average were likely to have lower impending or concurrent PHQ-8 scores (medial β = −0.185,p = 0.003; proximal β = −0.168,p = 0.007); however, going to more frequently visited venues did not prospectively predict PHQ-8 severity in the more distant future (distal β = −0.064,p = 0.308).Second, people who showed more circadian movement (i.e., regularity in 24-h movement patterns) relative to their own average just before and at the time of reporting depression symptoms had less severe PHQ-8 scores than those who showed less circadian movement (proximal β = −0.131,p = 0.035); however, circadian movement did not prospectively predict PHQ-8 severity (distal β = 0.034, p = 0.577; medial β = −0.089,p = 0.138).
Communication features.People spending more time on messaging apps relative to their own average reported more severe impending or concurrent PHQ-8 symptoms (proximal β = 0.162, p = 0.015), but this effect was non-significant for distal (β = 0.059, p = 0.385) and medial (β = 0.115, p = 0.083) prediction.While we did not see a significant association between within-person app-based messaging and GAD-7 at any of the time points, people engaging in more app-based messaging at the between-person level were more likely to report higher distal (β = 0.486, p = 0.041) and medial (β = 0.481, p = 0.046) GAD-7 severity; however, the association of betweenperson app-based messaging and GAD-7 severity was nonsignificant for proximal prediction (β = 0.466, p = 0.053).Additionally, calling and texting more relative to one's own average was associated with GAD-7 severity across all prediction windows (distal β = 0.279, p = 0.005; medial β = 0.386, p < 0.001; proximal β = 0.293, p = 0.003).There were no significant associations between PHQ-8 and call/text-based communication at either the within-person or between-person level.

DISCUSSION
In the present study, we aimed to identify passively sensed digital markers that relate to future depression and anxiety symptoms at both the within-person and between-person levels, and across multiple time windows.Location features were more strongly linked with depression symptoms, whereas communication features related to both depression and anxiety.Results highlighted the importance of the prediction lag in understanding personally sensed signals of affective symptoms: certain features (e.g., time spent at home) were consistent predictors of symptom severity across more distal and more proximal prediction windows, whereas others (e.g., circadian movement) were only associated with next-week or current symptoms.Overall, location features-and time spent at home in particular -were more strongly linked with depression symptoms than anxiety symptoms.The most robust predictor of depression symptoms was spending more time at home relative to one's own average, which signaled that a participant was likely to report increases in depressive symptoms 1-3 weeks later.This aligns with meta-analytic evidence indicating that greater time spent at home is one of the sensed features that most consistently relates to depression 14 .Broadly, spending more time at home may be reflective of reductions in motivation or hedonic capacity 33 ; if this is the case, the finding that increases in time spent at home relate Table 3. Multilevel model results predicting GAD-7 from sensing data across shifting prediction windows.

Predictor
Sensing predicting GAD-7 with 2-week lag (R 2 = 0.058) Sensing predicting GAD-7 with 1-week lag (R 2 = 0.056) Sensing predicting GAD-7 with 0-week lag (R to future depression symptoms would align with the notion of anhedonia as an endophenotype of depression 34 . In contrast to location features, communication features related to both depression and anxiety symptoms, with a dissociation for communication type: messaging apps signaled impending depression, and both messaging apps and calling/texting signaled future anxiety.Social media messaging apps are feature-rich 35 , such that their usage may reflect a range of different behaviors related to depression (e.g., "doomscrolling"; engaging in social comparison; ruminating; checking to see why others didn't respond to a message), and they tend to involve indirect conversations about a shared visual stimulus.Conversely, calling and texting are featurepoor and primarily facilitate direct communication with others 35 ; in the context of anxiety, within-person increases in these forms of communication may signal greater activation or reassurance seeking.In general, there were more consistent associations of communication data with anxiety symptoms than depression symptoms across prediction windows and communication modalities, suggesting that changes in communication-like changes in home duration for depression-may be an especially useful signal for understanding anxiety.While studies have linked changes in calling and texting with depression symptoms in bipolar disorder 36,37 , the absence of an association with depression in our study aligns with prior research reporting null findings around communication changes in unipolar depression 31,38 .Continued replication of these null findings may suggest that changes in call and text based communication are not a useful proxy for the social withdrawal and decreased motivational processes that characterize depression symptoms 39 .
By using multilevel models to disaggregate within-and between-person effects over time, we identified differential relationships of sensed features with affective symptoms across time windows that have implications for identifying novel treatment targets, personalizing digital mental health interventions, and enhancing traditional patient-provider interactions 12 .One of the predominant hypothesized methods for bringing personalized digital mental health interventions to fruition is understanding how personal sensing can be leveraged to reliably signal current or prospective worsening symptoms 8,9 .Our findings underscore that the sensing context and timing (i.e., prediction lag) are critical factors impacting the utility of sensed features as a marker of affective symptoms.For example, prior studies have shown a broad correlation between circadian movement and depression symptoms 31,32 .Given that within-person changes in circadian movement occur immediately before and contemporaneously with depression rather than predicting symptoms further in the future, interventions in response to decreased circadian movement may benefit from strategies focused on more immediate or impending depression symptoms.Conversely, in light of the prospective, within-person relationships between time at home and depression severity, developers may consider deploying prophylactic depression-focused content (e.g., behavioral activation) in response to person-specific increases in these signals.Finally, features that are significantly related to symptoms primarily at the between-person level (e.g., launcher use with PHQ-8 or app-based messaging with GAD-7) are unlikely to be helpful signals for individualized intervention or as signals of deterioration.
It is important to consider these implications in the context of the low overall amount of variance explained (approximately 5-6% across the different outcomes and lags), as compared to the larger effect sizes seen in early sensing studies, generally in small samples 4,31,32 .While we opted to use multilevel models for explainability, future studies may consider machine learning models to optimize variance explained in light of the high dimensionality of sensor data 40,41 ; these models may also provide greater insight into prediction accuracy metrics (e.g., rates of false positives and false negatives) to inform algorithms designed to prospectively predict clinical symptoms.Additionally, although we lagged sensors and symptom assessments, these data are still correlational and should not be interpreted as implying causality.To the best of our knowledge, there has been no research to date that has attempted to change these sensed constructs through targeted interventions, which would provide stronger evidence of potential causality.It will also be important for future studies to vary the sensor data window -which we kept consistent at 2 weeks-along with the lag to determine impacts on predictive power, and to better understand the impact of missing data over time on observed relationships.Further, the declaration of a national emergency due to COVID-19 in March 2020 occurred partway through our second wave of data collection.We did not see differences across waves substantial enough to warrant separate analysis by wave.However, the variability in the environment since the onset of COVID-19 may have tempered some of the associations between certain features (e.g., geographic location) and symptoms due to changing routines.Additional limitations are the differences in delivery mechanism and timeframe of reporting instructions for the GAD-7 (REDCap; past 2 weeks) and PHQ-8 (in-app; past week), which may have influenced responses.Finally, given the relative lack of demographic diversity in our sample, it will be important for future studies to test whether these findings generalize across more diverse populations.
Overall, findings from this large-scale mobile sensing study point to location features as important in predicting depression symptoms, and communication features in predicting both depression and anxiety symptoms.The multilevel, longitudinal approach allowed us to identify that features such as home duration were true prospective markers of intraindividual change in depression symptoms, whereas others, such as circadian movement, may be more indicative of impending or concurrent depression symptoms.
Participants Participants were recruited in 3 waves, with a total of 1,093 enrolled.Participants in wave 1 (July-September 2019) were recruited from the Center for Behavioral Intervention Technologies (CBITs) Health research registry and ResearchMatch.org, a national health volunteer registry supported by the National Institutes of Health.Participants in wave 2 (February-April 2020) were recruited from the CBITs Health and ResearchMatch.org registries, as well as from Focus Pointe Global, a market research data collection company.Participants in wave 3 (January-April 2021) were recruited from digital advertisements (e.g., posts on Instagram, Facebook, Twitter, craigslist, etc.), the CBITs Health and ResearchMatch.org registries, and Focus Pointe Global.

Table 2 .
Multilevel model results predicting PHQ-8 from sensing data across shifting prediction windows.