Main

Depression is a leading contributor to the burden of disability worldwide1,2, and there is some evidence that disability attributed to depression is rising, particularly among young people3,4. A key challenge in reducing the prevalence of depression has been that it is often under-recognized5 as well as undertreated6.

Cognitive behavioural therapy (CBT) is the most widely researched psychotherapy for depression. It is equivalent to antidepressant medications in its short-term efficacy and evidences superior outcomes in the long-term7,8. In CBT, therapists work with their clients to identify depressogenic thinking patterns by identifying lexical or verbal markers of rigid, distorted or overly negative interpretations9,10. For example, statements that include ‘should’ or ‘must’ are often challenged as reflecting overly rigid rules about the world (‘I shouldn’t be lazy’, ‘I must never fail’). This process often entails a series of conversations with the client to uncover and address statements that reflect these cognitive distortions.

The cognitive theory that underlies CBT argues that the ways in which individuals process and interpret information about themselves and their world is directly related to the onset, maintenance and recurrence of their depression11,12. This model is consistent with information-processing accounts of mood regulation13 and its dynamics14, as well as basic research that supports the role of cognitive reappraisal and language in emotion regulation15,16,17,18.

However, the critical assumption at the foundation of CBT, namely that depression is associated with changes in language that are indicative of distorted thinking, has not been directly confirmed from studies of the language of individuals with depression in real-world settings.

The idea that depression is associated with changes in language is supported by previous research. Specifically, it has been shown that individuals with depression more frequently use a variety of terms that describe negative emotions19,20,21, first-person pronouns (FPPs)21,22,23,24,25, common symptoms26 and linguistic inquiry and word count (LIWC) categories deemed to correspond to ‘absolutist’ language27. Machine learning approaches have shown good performance with respect to predicting whether social media users have depression28,29,30, identifying the most useful term features to render a prediction.

In this Article, we refine and expand on these data-driven approaches along several fronts. First, we empirically verified a crucial tenet of CBT theory, namely that individuals with depression, in their thinking, exhibit higher levels of cognitive distortions as conceived by CBT. This is distinct from attempting to estimate the morbidity of depression itself in the general population or to algorithmically discover any set of features that is useful to predict depression. Second, rather than sampling from data obtained in a clinical setting, possibly confounding the context of a specific therapeutic approach, we relied on naturalistic language recorded in an ex post hoc manner from large samples of social media users. Third, we conducted our analysis on the basis of a set of context-free semantic schemata (n-grams) that encode the semantics of patterns of thought, that is, cognitive distortions as hypothesized by CBT, not individual terms or features. In other words, we captured the structure of thought behind CBT’s notion of cognitive distortions. This is distinct from previous research that used term features that are either derived from general lexicons or discovered by supervised machine learning algorithms.

We compared the prevalence of a set of 241 cognitive distortion schemata (CDS)—patterns of thought represented by sequences of words (n-grams)—in the language of a large cohort of individuals with depression versus a random sample on social media (Twitter), excluding institutions and organizational accounts (see the ‘Data and sample construction’ section in the Methods). We show a set of examples of these CDS in Table 1. We designed our method to be platform-independent, but we chose Twitter because it (1) is a fast-paced real-time medium with hundreds of millions of active users (posting daily or regularly) who use colloquial language in a short text format that is especially suitable for our approach and (2) has been active since 2006, providing comprehensive longitudinal data spanning more than a decade.

Table 1 Common types of cognitive distortions associated with depression65 and their definitions

For our analysis, we built two cohorts of individuals: individuals with depression (D cohort) and a cohort of randomly selected individuals (R cohort). For our D cohort, following Coppersmith et al.31, we identified a cohort of social media users who (1) received a clinical diagnosis of depression and (2) posted an explicit report of this diagnosis on Twitter, that is, by stating a variant of ‘I was diagnosed with depression by my doctor’ (Methods). An overview of this approach is shown in Fig. 1.

Fig. 1: Cohort of individuals with depression.
figure 1

We identified a cohort of individuals with depression who (1) received a clinical diagnosis of depression and (2) explicitly reported this diagnosis on social media using a variant of the statement ‘I was diagnosed with depression by my doctor’. (3) A team of experts rated each statement to ensure that the statement actually reports a personal, clinical diagnosis of depression, after which (4) the individual’s timeline (all tweets up to the limit allowed by the Twitter data service: the 3,200 most recent tweets) was downloaded and added to our analysis cohort. Twitter, tweet, retweet, and the Twitter logo are trademarks of Twitter, Inc. or its affiliates.

Supporting an important assumption underlying CBT, our results indicate that there is a significantly higher prevalence of most types of distorted thinking, marked by a set of CDS n-grams, in the individuals with depression, both at the within-individual and between-cohort level. Notably, CDS in the ‘personalizing’ and ‘emotional reasoning’ types occur approximately two times more frequently in the online language of individuals with depression. We verified whether our results could be explained by gender or age differences, random variations in our user sample, our particular choice of CDS n-grams, the sentiment loadings of our CDS set and the known propensity of individuals with depression to make self-referential statements (see the ‘Robustness’ section). In all cases, we continued to find much higher levels of certain types of distorted thinking in the language of individuals with depression compared with in the random sample of online individuals.

We emphasize that, in contrast to some previous research, our goal was not to detect or classify users with depression on Twitter, but to compare the prevalence of expressions of cognitive distortions in the language of users who personally report having a diagnosis with those who do not.

Results

Sample demographics

The age and gender distributions of our D and R cohorts align with previous studies32,33,34 as indicated by the M3 classifier35 that we used to predict individual’s gender (M3 Macro-F1: 0.915) and age (M3 Macro-F1: 0.425) categories. As shown in Table 2, our D cohort has a similar 2:1 female-to-male ratio as observed in clinical depression studies32,33, indicating that the demographics of our Twitter cohort closely match previous clinical findings that women are twice as likely to be diagnosed with depression compared with men. Note that this gender disparity was not found to be associated with differences in language used to express depression or depressive symptomologies in women versus men36,37. The indicated age distribution of our D cohort (although less reliable, Macro-F1: 0.425) is also consistent with clinical studies32,34; specifically, we found a decreasing number of individuals in each age-group as the age of the group increases in the D cohort. Our subsequent analysis accounts for the observed distributions of gender and age between the D and R cohorts by performing comparisons across identical demographics (men versus men, women versus women and so on), amounting to a stratified sampling approach.

Table 2 Demographic information predicted using the M3 Twitter-trained classifier for the D and R cohorts

Within-individual CDS prevalence

First, we compared the within-individual CDS prevalence between the D and R cohorts. For each individual, we counted the number of their tweets containing any of the 241 CDS and divided it by their total number of tweets, resulting in a single within-individual CDS prevalence (Methods). We next compared the density distribution of individual prevalence values between all of the individuals from the D and R cohorts as shown in Fig. 2a,b.

Fig. 2: Within-individual CDS prevalence and between-cohort PR values.
figure 2

a, Box and whisker (box, 50% CI; whiskers, 95% CI; vertical line, median value) plots for within-individual CDS prevalence distributions compared between all individuals in the D and R cohorts and within the same demographic group (age and gender). All points that fall outside the 95% CI are indicated by dots. For all of the demographic subgroups (gender and age categories), we can reject the null hypothesis that the two distributions have the same mean on the basis of Welch’s unequal variances t-test. ***P < 0.001. b, The density of within-individual prevalence of tweets containing CDS for the D cohort (blue, \({\bar{P}}_{{{D}}}=0.232\)) versus the R cohort (orange, \({\bar{P}}_{{{R}}}=0.173\)). The dashed vertical lines indicate the median value for each cohort. A large fraction of individuals in the R cohort (9.756%) have no tweets that contain any CDS. c, Box and whisker (box, 50% CI; whiskers, 95% CI; vertical line, median value) plots of bootstrapped between-cohort PR values between the D and R cohort (exact median and 95% CI values are provided in Table 3). All points that fall outside the 95% CI are indicated by dots. The 95% CI of the distribution does not include 1.00 (vertical line), indicating a significantly higher prevalence of all CDS for the D cohort. d, Box and whisker (box, 50% CI; whiskers, 95% CI; vertical line, median value) plots of CDS PR values between the D and R cohort for each cognitive distortion type. All points that fall outside the 95% CI are indicated by dots. The D cohort showed a significantly higher use of CDS than the R cohort for most CDS types separately (PR  1) with the exception of ‘catastrophizing’, ‘mindreading’ and ‘fortune-telling’. Further details about the PR values are provided in Table 3.

In Fig. 2b, we observed that the distribution of within-individual CDS prevalence is shifted to the right for the D cohort relative to that of the R cohort, indicating that individuals in the D cohort express significantly more CDS (mean prevalence, \({\bar{P}}_{{{D}}}=0.232\)) than individuals in the R cohort (mean prevalence, \({\bar{P}}_{{{R}}}=0.173\)). On the basis of a two-sided Welch’s unequal variances t-test, we rejected the null hypothesis that the two samples have equal means (t1,619 = 21.20, P < 0.001, Cohen’s d = 0.56). Data distribution was assumed to be normal, but this was not formally tested. Note that 9.756% of the individuals in the R cohort have no tweets with CDS, whereas only 0.386% of the individuals in the D cohort express no CDS.

After comparison of the distribution of within-individual prevalence between the subgroups on the basis of demographic information, as shown in Fig. 2a, we found that all distributions differ based on Welch’s unequal variances t-test (male: t335 = 9.82, P < 0.001, Cohen’s d = 0.53; female: t1,127 = 16.81, P < 0.001, Cohen’s d = 0.62; aged 18 and under: t208 = 9.35, P < 0.001, Cohen’s d = 0.71; aged 19–29: t580 = 13.49, P < 0.001, Cohen’s d = 0.67; aged 30–39: t217 = 7.73, P < 0.001, Cohen’s d = 0.59; aged 40 and over: t103 = 3.49, P < 0.001, Cohen’s d = 0.30). Excluding individuals that have no tweets with CDS from our analysis led to similar results across all demographic subgroups.

Between-cohort CDS prevalence

We conducted a between-cohort analysis to compare the prevalence of CDS between the D and the R cohorts. We did this by calculating the prevalence of CDS for all tweets from each cohort and calculating the prevalence ratio (PR) between the two cohorts (see the ‘PR values’ section in the Methods). A PR value of higher than 1 indicates that the presence of CDS in the tweets written by the D cohort is greater than the R cohort. To assess the sensitivity of our results to changes in our cohort samples, for example a few ‘high-power’ users biasing our analysis, we repeatedly calculated the estimated PR over 10,000 random resamples (with replacement) of both groups, resulting in a distribution of PR values shown in Fig. 2 (see the ‘Bootstrapping estimates’ section in the Methods)

We found narrow distributions of the number of tweets in each resample (D cohort: 95% confidence interval (CI) = 1,454,068.75–1,566,230.325; R cohort: 95% CI = 6,630,441.375–6,941,408.2) indicating that our results are not biased by the presence of exceptionally active or inactive users in either cohort sample. Note that PR values express the relative difference in CDS prevalence between the two cohorts, not the absolute difference.

We observed in Fig. 2c that the median of this distribution of PR values for all of the individuals in the D and R cohorts is much greater than 1, and that its 95% CI does not include 1, indicating that we found a statistically significant higher prevalence of CDS in the D cohort (1.129×) compared with the R cohort. This result is robust to random changes in our cohort samples, indicating that outliers or exceptionally active or inactive users are not biasing our results. Furthermore, when we performed a between-cohort comparison within each of the gender and age categories, as shown in Fig. 2c, in all cases, we found a statistically significant higher prevalence of CDS in the D cohort, with median values ranging from 1.102× for individuals aged 40 and over to 1.164× for individuals aged 19–29.

To investigate the possible influence of the difference in the time intervals that are spanned by both cohorts, we performed stratified sampling by month whereby tweets were placed as a time-matched control and found similar results for all individual months (Supplementary Information Section 2). We found no indications of a time-dependent effect on CDS prevalence.

CDS prevalence by cognitive distortion type

The between-cohort PR values shown in Fig. 2c do not reflect specific distortion types; all CDS are equally and independently matched to all tweets. However, CDS types may differ in their prevalence between our cohorts. We therefore repeated the above analysis with CDS separated by cognitive distortion type.

As shown in Table 3 and Fig. 2d, the prevalence of CDS is significantly higher for nearly all cognitive distortion types in the tweets of the D cohort compared with those of the R cohort; PR values ranged from 2.084× to 1.056×, with the exception of ‘fortune-telling’, ‘mindreading’ and ‘catastrophizing’, which produce a PR that is not significantly different from parity. However, PR values vary by cognitive distortion type. The cognitive distortion types ‘personalizing’ and ‘emotional reasoning’ have the greatest PR values of 2.084× and 1.983×, respectively, followed by ‘overgeneralizing’ (1.441×), ‘mental filtering’ (1.296×), ‘disqualifying the positive’ (1.229×), ‘labelling and mislabelling’ (1.207×) and ‘dichotomous reasoning’ (1.131×). The cognitive distortion types ‘should statements’ and ‘magnification and minimization’ have significant PR values of lower than 1.1×. Table 4 shows the number and ratios of schemata for each cognitive distortion type that have PR values for which we can conclude that the D cohort uses these schemata more.

Table 3 PR and 95% CIs of CDS between the D and R cohort
Table 4 Statistics with respect to significance for our set of CDS, grouped in 12 cognitive distortion categories

We observed the individually highest PR scores for the CDS ‘if it only’, ‘because my’ and ‘because I feel’ and the individually lowest PR scores for ‘she will not believe’, ‘we will not think’ and ‘nobody will believe’ (which belong to the non-reflexive ‘mindreading’ type).

Robustness

In the following text, we discuss our efforts to verify whether our results may be explained by random variations in our sample of individuals, our particular choice of CDS n-grams, the sentiment loadings of our CDS set and the known propensity of individuals with depression to make self-referential statements. When accounting for these factors, in all cases, we continued to find much higher levels of distorted thinking in the language of the individuals in the D cohort compared with individuals in the R cohort. However, we caution that possible biases resulting from our data collection (for example, the veracity of the diagnosis statements or the degree to which individuals are willing to disclose a diagnosis) are difficult to assess, and are part of an ongoing discussion in the literature38,39.

Absence of sentiment effect

Previous research has shown that the language of individuals with depression is less positive (lower text valence) and contains higher levels of self-referential language19,40,41,42,43,44. To determine the degree to which our results can be explained by text sentiment or self-referential statements instead of distorted thinking, we examined the valence loadings of our collection of tweets and CDS, and reproduced our results with and without CDS containing self-referential statements.

First, we determined the valence values of each CDS n-gram in our set using the VADER sentiment analysis tool45, which was shown in a recent survey to outperform other available sentiment analysis tools for social media language46. VADER is particularly appropriate for this use, as its sentiment ratings take into account grammatical context, such as negation, hedging and boosting. We found that 75.9% of our CDS have either no sentiment-loaded content or are rated to have zero valence (neutral sentiment scores). The average valence rating of all CDS is −0.05 (n = 241) on a scale from −1.0 to +1.0. Figure 3a shows the VADER sentiment distribution of only CDS n-grams with non-zero ratings. Here we observed only a small negative skew of CDS sentiment for this small minority of CDS n-grams (24.1%).

Fig. 3: CDS and tweet sentiment scores (VADER).
figure 3

a, Density of VADER scores for CDS with non-zero sentiment values (58 out of 241 schemata). Most CDS carried no valence loading (75.9%). The average rating for the complete set CDS is −0.05 (n = 241). b, Distributions and kernel density estimates of the VADER valence ratings for all individual tweets. Both cohorts indicate a clear right-hand skew towards positive sentiment. The D cohort has slightly more-extreme positive and negative sentiment values compared with the R cohort, but distributions are largely comparable, indicating that there is only a small difference in sentiment values between the two cohorts. c, Distributions and kernel density estimates of the VADER valence ratings for all individual tweets that contain at least one CDS.

Furthermore, as shown in Fig. 3b, the sentiment distributions of all tweets for the D and R cohorts are both skewed towards positive sentiment (right side of distribution). This matches earlier findings that human language exhibits a Pollyanna effect47, which is a near-universal phenomenon that skews human language towards positive valence. VADER sentiment ratings in the range 0.70–1.00 seem to be slightly more prevalent among the tweets of the D cohort (Fig. 3b), possibly indicating an increased emotionality (higher levels of both negative and positive affect). We found nearly identical distributions of sentiments for the tweets of the two cohorts, whether we performed the comparison for all tweets (Fig. 3b) or for only tweets containing at least one CDS (Fig. 3c). One particular deviation in the sentiment range of 0.40–0.45 was found to be uniquely associated with the use of the ‘face with tears of joy’ emoji (VADER sentiment = 0.4404) more often by individuals in the R cohort compared with individuals in the D cohort.

Taken together, these findings strongly suggest that the higher prevalence of CDS in the language of the D cohort can neither be attributed to a negative valence skew in the CDS set nor the sentiment distribution of the tweets produced by either the D or R cohorts.

Absence of personal pronoun effect

Research has shown that FPPs are more prevalent in the language of individuals with depression19,23. As many CDS contain FPPs (Supplementary Table 1, FPP(%)), our results may to a degree reflect this phenomenon instead of the ‘distorted’ nature of our CDS. To test the sensitivity of our results to the presence of FPPs in our set of CDS, we repeated our analysis entirely without CDS containing the FPPs ‘I’ (upper case), ‘me’, ‘my’, ‘mine’ and ‘myself’. As shown in Table 3 (PR1), we found that their removal does not alter the observed effect, except for the cognitive distortion type ‘fortune-telling’, which is not significantly different from parity in this case. The respective CIs resulting from our removal of FPP schemata changed slightly, but most overlap with those obtained from the analysis that included the full set of CDS (Table 3, PRA versus PR1), demonstrating that the presence of FPPs does not alter our results. Note that we could not determine any values for ‘personalizing’ because, by definition, its CDS all contain FPPs.

Robustness to CDS changes

To determine the sensitivity of our results to the particular choice of CDS, we recalculated PR values between the D and R cohorts but, instead of resampling our D and R cohort, we randomly resampled (with replacement) the set of 241 CDS n-grams. The 95% CI of the resulting distribution of PR values indicates how sensitive our results are to random changes in our CDS set. The results of this analysis are shown in Table 3 (PRC). We observed small changes in the dispersion of the resulting distribution of PR values, but the median values and 95% CIs remain largely unchanged. As before, the 95% CIs continue to exclude 1.000 for all of the cognitive distortion types except for ‘mindreading’, ‘should statements’, ‘fortune-telling’ and ‘catastrophizing’, and we can continue to reject the null hypothesis that PR values are similar between the D and R cohort for nearly all cognitive distortion types. Furthermore, as shown in Table 3, the 95% CIs of PRC and PRA largely overlap across all cognitive distortion types, indicating that our results are robust to random changes in our CDS set as well as our D and R cohort samples. Furthermore, we examined whether URLs in tweets may bias CDS prevalence rates as they could be indicative of externally generated content that does not reflect the individual’s own state. However, we found similar PR values regardless of whether we included or excluded tweets with URLs (Supplementary Information).

Discussion

In a sample of online individuals, we used a theory-driven approach to measure the prevalence of linguistic markers that may indicate cognitive vulnerability to depression, according to CBT theory. We defined a set of CDS that we grouped along 12 widely accepted types of distorted thinking and compared their prevalence between two cohorts of Twitter users—the first included individuals who reported that they received a clinical diagnosis of depression and the second was a similar random sample.

As hypothesized, the individuals in the D cohort use significantly more CDS in their online language compared with individuals in the R cohort, particularly schemata associated with ‘personalizing’ and ‘emotional reasoning’. We observed significantly increased levels of CDS across nearly all cognitive distortion types, sometimes more than twice as much, but did not find a statistically significant increase in prevalence among the D cohort for two specific types, namely ‘fortune-telling’ and ‘catastrophizing’. This may be due to the difficulty of capturing these specific cognitive distortions in the form of a set of 1–5-grams—their expression in language can involve an interactive process of conversation and interpretation. Notably, our findings are not explained by the use of FPPs or more negatively loaded language. These results shed a light on the degree to which depression-related language of cognitive distortions are manifested in the colloquial language of social media platforms. This is of social relevance given that these platforms are specifically designed to propagate information through the social ties that connect individuals on a global scale.

An advantage of studying theory-driven differences between the language of individuals with and without depression, in contrast to a purely data-driven or machine learning approach, is that we can explicitly use the principles underpinning CBT to understand the cognitive and lexical components that may shape depression. Cognitive behavioural therapists have developed a set of strategies to challenge the distorted thinking patterns that are characteristic of depression. Preliminary findings suggest that specific language can be related to specific therapeutic practices and seems to be related to outcomes48. However, these practices have been largely shaped by a clinical understanding and not necessarily informed by objective measures of how patterns of language reflect cognitive distortions, which could be harnessed to facilitate the path of recovery.

Our results suggest a path for mitigation and intervention, including applications that engage individuals with mood disorders, such as major depressive disorder, through social media platforms and that challenge particular expressions and types of depression-related language. Future characterization of the relationship between depression-related language and mood may help in the development of automated interventions (such as ‘chatbots’) or suggest promising targets for psychotherapy. Another approach that has shown promise in leveraging social media for the treatment of mental health problems involves crowdsourcing the responses to cognitively distorted content49. These types of applications have the potential to be more-scalable mental health interventions compared with existing approaches such as face-to-face psychotherapy50. The extent to which user CDS prevalence can be used as a passive index of vulnerability to depression that may be expected to change with treatment could also be explored. Insofar as online language can be considered to be an index of cognitive vulnerability to depression, a better understanding of online language may help to tailor treatments, especially internet-based treatments, to the more-specific needs of individuals. For example, interventions that target depression-related thinking and language may be well-suited for individuals with depression who express relatively higher levels of these distortions, whereas interventions that target other mechanisms (such as physical activity, circadian rhythm) may be better suited for individuals who do not show relatively higher levels of CDS. More research towards understanding differences in language patterns in depression and related disorders, such as anxiety disorders, is recommended. However, when implementing these types of approaches, ethical considerations and privacy issues have to be adequately addressed38,39.

Several limitations of our theory-driven approach should be considered. First, we relied on individuals reporting their personal clinical depression diagnoses on social media. Although we verified that the statement indeed pertains to a clinical diagnosis, we do not have verification of the diagnosis itself nor of its accuracy. This may introduce individuals into the D cohort who might not have been diagnosed with depression or accurately diagnosed. Vice versa, we have no verification that individuals in our random sample do not suffer from depression. However, the potential inaccuracy of this inclusion criterion will probably reduce the difference in depression rates between the two cohorts and, therefore, reduce the observed effect sizes (PR values between cohorts) due to the larger heterogeneity of our sample. As a consequence, our results are probably not an artefact of the accuracy of our inclusion criterion. Second, our approach is limited to discovering only individuals who are willing to disclose their diagnosis on social media. As this might skew our D cohort to a subgroup of individuals suffering from depression, we recommend caution when generalizing our findings to the level of all individuals who have depression. Third, our lexicon of CDS was composed and approved by a panel of ten experts who may have been only partially successful in capturing all of the n-grams used to express distorted ways of thinking. On a related note, the use of CDS n-grams implies that we measure distorted thinking by proxy, namely through language, and our observations may be therefore be affected by linguistic and cultural factors. Common idiosyncratic or idiomatic expressions may syntactically represent a distorted form of thinking, but no longer do so in practice. For example, an expression such as ‘literally the worst’ may be commonly used to express dismay, without necessarily involving the speaker experiencing a distorted mode of thinking. Thus, the presence of a CDS does not point to a cognitive distortion per se. Fourth, both cohorts were sampled from Twitter, one of the leading social media platforms, the use of which may be associated with higher levels of psychopathology and reduced well-being51,52,53. We may therefore be observing increased or biased rates of distorted thinking in both cohorts as a result of platform effects. However, we report relative prevalence numbers with respect to a carefully construed random sample also taken from Twitter, which probably compensates for this effect and the effect that individuals with depression might be more active than their random counterparts. Furthermore, recent analysis indicates that representative samples with respect to psychological phenomena can be obtained from social media content54. This is an important discussion in computational social science that will continue to be investigated. Data-driven approaches that analyse natural language in real-time will continue to complement theory-driven work such as ours.

As we analysed individuals on the basis of inferred health-related information, we want to stress some additional considerations regarding ethical research practices and data privacy30,38,39. We limited our investigation strictly to comparing, in the aggregate, the publicly shared language of two deidentified cohorts of individuals (individuals who report that they have been diagnosed with depression and a random sample). We carefully deidentified all obtained data to protect user privacy and performed our analysis under the constraints of two IRB protocols (IU IRB Protocols 2010371843 and 1707249405). Whereas the outcomes of our analysis could contribute to a better understanding of depression as a mental health disorder, they could also inform approaches that detect traces of mental health issues in the online language of individuals, and as such contribute to future detection, diagnostics and intervention efforts. This may raise important ethical and user privacy concerns as well as risk of harm, including but not limited to the right to privacy, data ownership and transparency. For example, even though social media data are technically public, individuals do not necessarily realize nor consent to particular retrospective analysis when they share information on their public accounts55 nor can they consent to how these data may be leveraged in future approaches that may involve individualized interactions and inventions. Considering existing evidence that individuals are more willing to share biomedical data than social media data56, in future research, we hope to reach a larger sample of individuals who understand public data availability and increase transparency through a carefully managed consent process. We acknowledge that these considerations are part of an active and ongoing discussion in our community that we encourage and that we hope our research may contribute to.

We emphasize that not all use of CDS n-grams reflects depressive thinking, as these phrases are part of normal English usage, and it would therefore be wrong to try to diagnose depression merely on the basis of use of one or more such phrases. Such an approach would, as well as being inaccurate, potentially lead to harm in terms of stigmatizing individuals.

Methods

Data privacy and handling

Throughout our analysis we adhered to two Indiana University (IU) Institutional Review Board (IRB) protocols, namely IU IRB Protocol 2010371843 ‘Depressed individuals express more distorted thinking on social media’, which was reviewed specifically for this entire study and its research team, and IU IRB Protocol 1707249405, which previously covered the data collection and analysis. As this study analyses individuals on the basis of inferred health-related information, additional steps were taken to ensure the privacy of all of the individuals in our cohorts. We deidentified all data by assigning each tweet and each user a unique label, for example D2345960 or R17156599, in both cohorts, to remove all identifying information from our analysis. All raw data are stored on a protected IU server that is accessible only to members of the study team.

Demographic information

Twitter accounts are not generally associated with detailed demographic information about the individuals in question, other than what individuals may choose to self-report in their profiles and the content that they post. However, demographic information can be reliably inferred from a variety of account characteristics, such as the individual’s name and ‘screen name’, profile photograph and biographies. To infer the demographic information of all Twitter accounts, we used the M3 system35, which is a highly accurate deep learning classifier that was trained on a massive Twitter dataset using profile images, screen names, names and biographies. The classifier is built to classify an account along three categories; (1) gender (male/female, Macro-F1: 0.915), (2) age (‘18 and below’, ‘19–29’, ‘30–39’ and ‘40 and up’, Macro-F1: 0.425) and (3) organization (individual versus organizational account, Macro-F1: 0.898). To assure precision, we used a high threshold to assign a label to each account on the basis of the output of the M3 system. For the gender and organization categories, we set the threshold at 0.8. For age, we set the threshold at 0.6.

Data and sample construction

Using the Twitter application program interface (API) and the IUNI OSoMe57 (a service that provides searchable access to the Twitter Gardenhose, a 10% sample of all daily tweets), we searched for tweets that matched both ‘diagnos*’ and ‘depress*’. The resulting set of tweets were then filtered for matching the expressions ‘i’, ‘diagnos*’, ‘depres*’ in that order in a case-insensitive manner, allowing insertions to match the greatest variety of diagnosis statements; for example, a tweet that states ‘I was in fact just diagnosed with clinical depression’ would match. Finally, to ensure that we are including only true self-referential statements of a depression diagnosis, a team of three experts manually excluded quotes, jokes and external references. The members of this team assessed the collection of tweets to verify that we included only explicit statements that the individual had received a clinical diagnosis. All quotes, retweets and external references to depression (that is, ‘My friend and I were practically diagnosed with depression over the Game of Thrones finale’) were removed. A similar approach was deemed to be most accurate in a comparative analysis of social media sampling methods58. As recommended previously59, we avoided the use of data-driven supervised machine learning approaches to draw conclusions with respect to the language features and population morbidity of depression58.

We do not have certainty that the reported clinical depression diagnoses are in fact accurate. However, although the clinical recognition of depression is poor in some settings5, patients who are recognized as being depressed tend to, on average, have higher levels of depression compared with those who are not recognized60. This observation, along with research suggesting that depression is best understood as existing on a continuum (reviewed previously61), supports our use of an explicit report of a clinical depression diagnosis as the inclusion criteria for the D cohort.

For each qualifying diagnosis tweet, we retrieved the timeline of the corresponding Twitter user using the Twitter ‘user_timeline’ API endpoint (https://developer.twitter.com/en/docs/tweets/timelines/api-reference/get-statuses-user_timeline). Subsequently, we excluded all non-English tweets (Twitter API machine-detected ‘lang’ field), all retweets and all tweets containing ‘diagnos*’ or ‘depress*’. As we wanted to analyse only personal accounts belonging to individuals, we excluded all accounts that M3 predicted to belong to an organization or institution, leading to a final D cohort of 1,035 individuals and 1,510,359 tweets.

To compare CDS prevalence rates of the D cohort to a baseline, we constructed a random sample of individuals (R cohort). To do so, we collected a large sample of random tweets in 3 weeks (that is, 1–8 September 2017, 1–8 March 2018 and 1–8 September 2018) from the IUNI OSoMe57. We extracted all Twitter user identifiers from these tweets (n = 588,356), and included only those that specified their geographical location and were not already included in our D cohort. To equalize platform, interface and behavioural changes over time, we selected a subsample of these individuals such that the distribution of their account creation dates matches those of the D cohort, resulting in an initial set of 9,525 random individuals. Finally, we collected the Twitter timelines of these users and filtered the obtained data in the same manner as described for the D cohort, again excluding accounts that the M3 classifier predicted to be an institution or organization, resulting in a final R cohort consisting of 7,349 individuals and 6,783,353 tweets.

No statistical methods were used to predetermine sample sizes, but our sample sizes are similar to those reported in previous publications21,29.

Construction of set of CDS n-grams

A. T. Beck introduced the concept of cognitive distortions to characterize the thinking of individuals with depression62,63. Subsequently, other clinicians expanded on his typology of distortions64—notably, clinical psychologist and CBT expert, J. Beck65. We drew on these latest lists, which consist of 12 types of cognitive distortions, that may characterize the thinking of individuals with depression.

A panel of CBT experts (three co-authors and seven experts consulted) engaged in a process of collaborative design, followed by a consensus voting procedure (unanimous decision) to map a set of 241 CDS n-grams, each geared to express at least one type of cognitive distortion. The schemata in each category were formulated to capture the minimal semantic building blocks of distorted thinking for a particular type, avoiding expressions that are specific to depression-related topics, such as poor sleep or health issues. For example, the common 3-gram ‘I am a’ was included as a building block of expressing a variety of ‘labelling and mislabelling’ cognitive distortions, because it would be a highly likely (and nearly unavoidable) n-gram to express many self-referential (‘I’) expressions of labelling (‘am a’). We show a set of examples in Table 1. Where possible, higher-order n-grams were chosen to capture as much of the semantic structure of one or more distorted schemata as possible, for example, the 3-gram ‘everyone will believe’ captures both ‘overgeneralizing’ and ‘mindreading’. We did include 1-grams, such as ‘nobody’ and ‘everybody’, as they strongly correspond to the expression of ‘dichotomous reasoning’. The number of schemata per category in our CDS set along with the average n-gram size, as well as a number of relevant grammatical features, are provided in Supplementary Table 1. The complete set of CDS is provided in Supplementary Table 2.

PR values

For each Twitter user u in our sample, we retrieved a timeline Tu of their time-ordered k most recent tweets, Tu = {t1, t2, ..., tk}. We also defined a set C = {c1, c2, ..., cn} of n-grams where n = 241 (Table 4) with varying n [1,5] number of terms. The elements of set C are intended to represent the lexical building blocks of expressing cognitive distortions (Table 4 and Supplementary Table 2. We introduced a CDS matching function \({{\mathcal{F}}}_{C}(t)\to \{0,1\}\), which maps each individual tweet t to either 0 or 1 according to whether a tweet t contains one or more of the schemata in set C. Note that the range of \({{\mathcal{F}}}_{C}(t)\) is binary; therefore, a tweet that contains more than one CDS still counts as 1.

The within-individual prevalence of tweets for individual u is defined as the ratio of tweets that contain a CDS in C over all tweets in their timeline Tu:

$${P}_{C}(u)=\frac{\sum _{t\in {T}_{u}}{{\mathcal{F}}}_{C}(t)}{\left|{T}_{u}\right|}$$

Our sample is separated into two cohorts—one of 1,035 individuals with depression and another of 7,349 randomly sampled individuals. We denoted the set of all individuals with depression D = {u1, u2, ..., u1,035} and random sample cohort R = {u1, u2, ..., u7,349}. Thus, the sets of all tweets written by users in the D and R cohorts are defined as:

$${T}_{D}=\bigcup _{u\in D}{T}_{u}\ {\rm{and}}\ {T}_{R}=\bigcup _{u\in R}{T}_{u}$$
(1)

We can then define the prevalence (P) of tweets with CDS C for each the D and R cohorts as follows:

$${P}_{C}(D)=\frac{{\sum }_{t\in {T}_{D}}{{\mathcal{F}}}_{C}(t)}{\left|{T}_{D}\right|} \ {\mathrm{and}}\ {P}_{C}(R)=\frac{{\sum }_{t\in {T}_{R}}{{\mathcal{F}}}_{C}(t)}{\left|{T}_{R}\right|}$$
(2)

or, informally, the ratio of tweets that contain any CDS over all tweets written by the individuals of that cohort.

As a consequence, the PR of CDS in set C between the two cohorts D and R, denoted PRC(D,R), is defined simply as the ratio of their respective CDS prevalence PC(TD) and PC(TR) in the tweet sets TD and TR, respectively:

$${\mathrm{PR}}_{C}(D,R)=\frac{{P}_{C}(D)}{{P}_{C}(R)}$$
(3)

If PRC(D,R)  1, the prevalence of CDS in the tweets of the D cohort are comparable to their prevalence in the tweets of the R cohort. However, any value PRC(D,R)  1 or PRC(D,R)  1 may indicate a significantly higher prevalence in each respective cohort. Here we used 1 and 1 to signify that a PR value is significantly higher or lower than 1 respectively, which we asses on the basis of whether its 95% CI includes 1 or not (see the ‘Bootstrapping estimates’ section below).

Bootstrapping estimates

The estimated P and PR values can vary with the particular composition of (1) set C (our CDS n-grams) or (2) the set of individuals in our D and R cohorts. We verified the reliability of our results by randomly resampling either C or both D and R, with replacement. This was repeated B = 10, 000 times, leading to a set of resampled cognitive distortion sets or cohort samples. Each of these B number of resamples of either (1) the set of CDS C or (2) or the sets D and C of all individuals in our D and R cohorts results in B number of corresponding P or PR values:

$${P}^{* }=\{{P}_{1}^{* },{P}_{2}^{* } ,... ,{P}_{B}^{* }\}\, {\mathrm{and}}\, P{R}^{* }=\{P{R}_{1}^{* },P{R}_{2}^{* } \ ,...,P{R}_{B}^{* }\}$$
(4)

The distributions of P* and PR* were then characterized by their median (μ50) and their 95% CI (μ2.5μ97.5). A 95% CI of a PR that does not contain 1 is held to indicate a significant difference in prevalence between the two cohorts.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.