Introduction

Technological advances facilitating personal sensing, or passively collected signals from networked smartphone sensors1, stand to address critical gaps in measuring and treating affective symptoms. Features assessed using smartphones could signal novel treatment targets; for example, the daily number of calls and texts made may signal changes in social behavior relevant to depression2. Similarly, personal pronoun use in text messages has been linked with depression and anxiety symptoms3,4,5, and reductions in I-pronoun use track broad improvements in therapy6. Incorporating sensed data into clinical care may also enhance shared decision-making7. For instance, deviations in GPS-location-based features could signal relevant changes to patient depression severity that could trigger a provider notification. Finally, better understanding how personal sensing can be leveraged to reliably signal current or prospective deterioration may address a key question about existing digital mental health interventions8,9, which is how best to optimize the delivery of intervention components so that the right component is received at the right time, while minimizing user burden10,11,12,13.

As a foundational step in realizing this potential, studies have evaluated how sensed features relate to affective symptom severity. Prior work shows that different sensor signals such as the number and type (i.e., incoming or outgoing) of phone calls and text messages relate to affective symptoms14,15. Additional data suggest that the content of text messages predicts mood and anxiety symptoms3,4,5,16. Even mobile phone keystroke patterns have been associated with mood states17. Other smartphone signals such as GPS-location-derived features have demonstrated associations with affective symptoms across many different studies4,14,18,19; however, due to challenges with replication and generalizability, there are calls for these findings to be replicated in larger and more heterogeneous samples19,20.

Additional challenges stem from the dearth of studies on how temporal characteristics impact observed relationships between sensed features and symptoms, including the data window (i.e., interval over which sensor data are collapsed) and time lag (i.e., time between of predictor and outcome measurement). Previous studies of mental health outcomes have used 24-h data windows to predict mental health outcomes lagged by short timeframes such as 1 h or 1 day21. Other studies have used slightly larger data windows to predict mental health outcomes at lags of 1 or 2 weeks in the future3,22,23. The predictive power of different sensor types may be more or less clinically meaningful depending on the data window and time lag used22. For example, a recent study we conducted of text message language features as they related to depression symptom severity demonstrated that a data window of 4 weeks was the optimal aggregation for prediction5. Another example in social media data indicated that using a data window of 2 months to predict depression severity with time lags of between 2 and 4 weeks was the optimal analytic setup24. Understanding how the relationships between sensed features and affective symptoms change depending on data windows and time lags is essential to informing the clinical utility of sensed data for mental health.

Our primary objective for this study was to evaluate smartphone sensor-based markers that prospectively relate to depression and anxiety symptoms. We examined sensed features’ prospective relationships to symptom severity for depression and anxiety, as well as their utility as distal or proximal predictors of affective symptom severity, using a shifting 2-week sensor data window across various time lags to predict future affective symptoms.

Methods

Participants

Participants were recruited in 3 waves, with a total of 1,093 enrolled. Participants in wave 1 (July–September 2019) were recruited from the Center for Behavioral Intervention Technologies (CBITs) Health research registry and ResearchMatch.org, a national health volunteer registry supported by the National Institutes of Health. Participants in wave 2 (February–April 2020) were recruited from the CBITs Health and ResearchMatch.org registries, as well as from Focus Pointe Global, a market research data collection company. Participants in wave 3 (January–April 2021) were recruited from digital advertisements (e.g., posts on Instagram, Facebook, Twitter, craigslist, etc.), the CBITs Health and ResearchMatch.org registries, and Focus Pointe Global.

Inclusion and exclusion criteria for waves 1 and 2 did not differ. We conducted stratified sampling based on baseline PHQ-8 scores such that a minimum of 50% experienced at least moderate depression symptoms (PHQ-8 ≥ 10). In Wave 3, all participants were recruited to have at least moderate depression symptoms (PHQ-8 ≥ 10). Across all waves, participants were required to be at least 18 years old, a U.S. resident, able to read English, and own an Android smartphone with an active data and text messaging plan. Participants were excluded if they self-reported a diagnostic history of bipolar disorder, manic, or hypomanic episode, schizophrenia, or other psychotic disorder.

Participants were compensated up to $142 for completion of assessments, as well as bonuses delivered at the end of each assessment week for participants who were running the latest version of the app and had transmitted sensed data within the past 2 days.

Procedure

After providing written informed consent, participants enrolled in the study for 16 weeks. All participants downloaded the LifeSense app25, which automatically collected GPS-based sensor data, app, and device use data, and communication data from participants’ smartphones (see Supplementary Table S1 for a list of sensors used and frequency acquired, consistent with Saeb et al., 2015). Participants responded to web-based surveys (e.g., GAD-7)26 through the REDCap platform at baseline and every 3 weeks thereafter (i.e., weeks 1, 4, 7, 10, 13, 16)27,28. Participants also completed PHQ-8 surveys via the LifeSense app at the beginning and end of every third week in the study29. Because of this cadence, PHQ-8 instructions were modified to ask participants about their symptoms over the past week rather than past two weeks. All procedures were approved by the Northwestern University Institutional Review Board.

Analytic methods

Multilevel regression models were tested in R using the lmerTest package with maximum likelihood estimation30. Specifically, we evaluated the associations of clustered sensor features aggregated over a 2 week window (see Supplementary Table S2 for details on clustering) with subsequent depression and anxiety symptoms. The 2 week window was selected for three reasons: to permit sufficient density of sensor data, to align with gold-standard assessments of depression and anxiety symptoms that ask about the past 2 weeks26,29, and to be consistent with prior sensing studies4,31,32. The prediction window was shifted such that three different models were tested for each outcome: (1) medial prediction is at a 1-week lag (Fig. 1a), (2) distal prediction is shifted back 1 week for a 2-week lag (Fig. 1b), and (3) proximal prediction is shifted forward 1 week for a 0-week lag (Fig. 1c). While there was no overlap between the sensor window and symptom reporting for distal or medial prediction, proximal prediction involved taking sensor data from the week immediately before and the week concurrent with symptom reporting (e.g., weeks 3 and 4 of sensor data predicting the week 4 symptom assessment). Sensor predictors were person-mean centered, and for each sensor predictor, both a person mean term and a within-person deviation term were included in the model. Additional model terms included time (week; centered around zero), the random intercept, and the demographic covariates of age (centered), gender, and urbanicity/rurality. See Supplementary Materials for more detail on modeling.

Fig. 1: Timing of associations between sensed data and affective symptoms.
figure 1

Testing the influence of past 2 week sensor data on subsequent week depression and anxiety symptoms (1a, medial prediction, 1-week lag), as well as the effect of shifting the sensor data time window on symptom prediction (1b, distal prediction, 2-week lag; and 1c, proximal prediction, 0-week lag). The orange boxes in each panel depict the sliding sensor window across various lag times.

Results

Data aggregation and demographics

Data were available from 1013 participants (74.6% female; mean age = 40.9 years [SD = 12.7]), including a total of 4731 PHQ-8 scores (of 5065 possible; 6.59% missing) and 4649 GAD-7 scores (of 5065 possible; 8.21% missing). Table 1 contains complete demographic data.

Table 1 Demographic data.

Primary results

Table 2 (PHQ-8) and 3 (GAD-7) present results for all within-person and between-person effects of sensor data on symptoms over time; for parsimony, only features with at least some significant relationships to outcomes are described below in the text.

Table 2 Multilevel model results predicting PHQ-8 from sensing data across shifting prediction windows.

Location features

Spending more time at home relative to one’s own average (i.e., within-person) was associated with increased future PHQ-8 severity across prediction windows (distal β = 0.219, p = 0.012; medial β = 0.198, p = 0.022; proximal β = 0.183, p = 0.045). Within-person time spent at home was not significantly associated with GAD-7 severity across any of the time windows (Table 2). We observed no evidence that between-person effects for time spent at home were related to PHQ-8 or GAD-7 severity. People with greater GPS variability and mobility less severe next-week PHQ-8 (medial β = −0.503, p = 0.046), but this signal was absent for distal (β = −0.464, p = 0.073) and proximal (β = −0.424, p = 0.093) associations. Table 3.

Table 3 Multilevel model results predicting GAD-7 from sensing data across shifting prediction windows.

Two other sensed location features were reflective of near- or medial-term PHQ-8 severity but did not predict PHQ-8 severity far in the future. First, people spending time in more frequently visited venues relative to their own average were likely to have lower impending or concurrent PHQ-8 scores (medial β = −0.185, p = 0.003; proximal β = −0.168, p = 0.007); however, going to more frequently visited venues did not prospectively predict PHQ-8 severity in the more distant future (distal β = −0.064, p = 0.308). Second, people who showed more circadian movement (i.e., regularity in 24-h movement patterns) relative to their own average just before and at the time of reporting depression symptoms had less severe PHQ-8 scores than those who showed less circadian movement (proximal β = −0.131, p = 0.035); however, circadian movement did not prospectively predict PHQ-8 severity (distal β = 0.034, p = 0.577; medial β = −0.089, p = 0.138).

Communication features

People spending more time on messaging apps relative to their own average reported more severe impending or concurrent PHQ-8 symptoms (proximal β = 0.162, p = 0.015), but this effect was non-significant for distal (β = 0.059, p = 0.385) and medial (β = 0.115, p = 0.083) prediction. While we did not see a significant association between within-person app-based messaging and GAD-7 at any of the time points, people engaging in more app-based messaging at the between-person level were more likely to report higher distal (β = 0.486, p = 0.041) and medial (β = 0.481, p = 0.046) GAD-7 severity; however, the association of between-person app-based messaging and GAD-7 severity was non-significant for proximal prediction (β = 0.466, p = 0.053). Additionally, calling and texting more relative to one’s own average was associated with GAD-7 severity across all prediction windows (distal β = 0.279, p = 0.005; medial β = 0.386, p < 0.001; proximal β = 0.293, p = 0.003). There were no significant associations between PHQ-8 and call/text-based communication at either the within-person or between-person level.

Other phone use features

People who used the launcher more on average had lower PHQ-8 scores across time windows (distal β = −0.596, p = 0.008; medial β = −0.525, p = 0.018; proximal β = −0.653, p = 0.004). When people used the launcher more relative to their own average, they reported lower impending or concurrent PHQ-8 scores (proximal β = −0.161, p = 0.023). Launcher use was not found to be associated with GAD-7 severity at the within or between person level. People who on average had more screen-on time tended to have greater distal (β = 0.503, p = 0.016) and proximal (β = 0.541, p = 0.012) PHQ-8 severity; however, this association was non-significant for next-week prediction (medial β = 0.272, p = 0.196).

Demographic effects

Higher PHQ-8 and GAD-7 severity were found for younger people (β: [0.573–1.163], p: [<0.001–0.001]) and women (β: [0.360–0.563], p: [0.001–0.036]). People living in rural areas reported higher GAD-7 (β: [0.520–0.532], p: [0.002–0.002]), but not PHQ-8 (β: [0.307–0.320], p: [0.058–0.068]).

Time effects

There was a significant fixed effect of time, such that people reported decreasing PHQ-8 and GAD-7 severity over the course of the study (β: [−0.107 to −0.183], p: [<0.001 to <0.001]).

Overall variability explained

The models explained a modest amount of overall variability in PHQ-8 (distal R2 = 0.049; medial R2 = 0.048; proximal R2 = 0.053) and GAD-7 (distal R2 = 0.058; medial R2 = 0.056; proximal R2 = 0.057) symptom severity.

Discussion

In the present study, we aimed to identify passively sensed digital markers that relate to future depression and anxiety symptoms at both the within-person and between-person levels, and across multiple time windows. Location features were more strongly linked with depression symptoms, whereas communication features related to both depression and anxiety. Results highlighted the importance of the prediction lag in understanding personally sensed signals of affective symptoms: certain features (e.g., time spent at home) were consistent predictors of symptom severity across more distal and more proximal prediction windows, whereas others (e.g., circadian movement) were only associated with next-week or current symptoms.

Overall, location features—and time spent at home in particular—were more strongly linked with depression symptoms than anxiety symptoms. The most robust predictor of depression symptoms was spending more time at home relative to one’s own average, which signaled that a participant was likely to report increases in depressive symptoms 1–3 weeks later. This aligns with meta-analytic evidence indicating that greater time spent at home is one of the sensed features that most consistently relates to depression14. Broadly, spending more time at home may be reflective of reductions in motivation or hedonic capacity33; if this is the case, the finding that increases in time spent at home relate to future depression symptoms would align with the notion of anhedonia as an endophenotype of depression34.

In contrast to location features, communication features related to both depression and anxiety symptoms, with a dissociation for communication type: messaging apps signaled impending depression, and both messaging apps and calling/texting signaled future anxiety. Social media messaging apps are feature-rich35, such that their usage may reflect a range of different behaviors related to depression (e.g., “doomscrolling”; engaging in social comparison; ruminating; checking to see why others didn’t respond to a message), and they tend to involve indirect conversations about a shared visual stimulus. Conversely, calling and texting are feature-poor and primarily facilitate direct communication with others35; in the context of anxiety, within-person increases in these forms of communication may signal greater activation or reassurance seeking. In general, there were more consistent associations of communication data with anxiety symptoms than depression symptoms across prediction windows and communication modalities, suggesting that changes in communication—like changes in home duration for depression—may be an especially useful signal for understanding anxiety. While studies have linked changes in calling and texting with depression symptoms in bipolar disorder36,37, the absence of an association with depression in our study aligns with prior research reporting null findings around communication changes in unipolar depression31,38. Continued replication of these null findings may suggest that changes in call and text based communication are not a useful proxy for the social withdrawal and decreased motivational processes that characterize depression symptoms39.

By using multilevel models to disaggregate within- and between-person effects over time, we identified differential relationships of sensed features with affective symptoms across time windows that have implications for identifying novel treatment targets, personalizing digital mental health interventions, and enhancing traditional patient-provider interactions12. One of the predominant hypothesized methods for bringing personalized digital mental health interventions to fruition is understanding how personal sensing can be leveraged to reliably signal current or prospective worsening symptoms8,9. Our findings underscore that the sensing context and timing (i.e., prediction lag) are critical factors impacting the utility of sensed features as a marker of affective symptoms. For example, prior studies have shown a broad correlation between circadian movement and depression symptoms31,32. Given that within-person changes in circadian movement occur immediately before and contemporaneously with depression rather than predicting symptoms further in the future, interventions in response to decreased circadian movement may benefit from strategies focused on more immediate or impending depression symptoms. Conversely, in light of the prospective, within-person relationships between time at home and depression severity, developers may consider deploying prophylactic depression-focused content (e.g., behavioral activation) in response to person-specific increases in these signals. Finally, features that are significantly related to symptoms primarily at the between-person level (e.g., launcher use with PHQ-8 or app-based messaging with GAD-7) are unlikely to be helpful signals for individualized intervention or as signals of deterioration.

It is important to consider these implications in the context of the low overall amount of variance explained (approximately 5–6% across the different outcomes and lags), as compared to the larger effect sizes seen in early sensing studies, generally in small samples4,31,32. While we opted to use multilevel models for explainability, future studies may consider machine learning models to optimize variance explained in light of the high dimensionality of sensor data40,41; these models may also provide greater insight into prediction accuracy metrics (e.g., rates of false positives and false negatives) to inform algorithms designed to prospectively predict clinical symptoms. Additionally, although we lagged sensors and symptom assessments, these data are still correlational and should not be interpreted as implying causality. To the best of our knowledge, there has been no research to date that has attempted to change these sensed constructs through targeted interventions, which would provide stronger evidence of potential causality. It will also be important for future studies to vary the sensor data window—which we kept consistent at 2 weeks—along with the lag to determine impacts on predictive power, and to better understand the impact of missing data over time on observed relationships. Further, the declaration of a national emergency due to COVID-19 in March 2020 occurred partway through our second wave of data collection. We did not see differences across waves substantial enough to warrant separate analysis by wave. However, the variability in the environment since the onset of COVID-19 may have tempered some of the associations between certain features (e.g., geographic location) and symptoms due to changing routines. Additional limitations are the differences in delivery mechanism and timeframe of reporting instructions for the GAD-7 (REDCap; past 2 weeks) and PHQ-8 (in-app; past week), which may have influenced responses. Finally, given the relative lack of demographic diversity in our sample, it will be important for future studies to test whether these findings generalize across more diverse populations.

Overall, findings from this large-scale mobile sensing study point to location features as important in predicting depression symptoms, and communication features in predicting both depression and anxiety symptoms. The multilevel, longitudinal approach allowed us to identify that features such as home duration were true prospective markers of intraindividual change in depression symptoms, whereas others, such as circadian movement, may be more indicative of impending or concurrent depression symptoms.