Identifying subtypes of depression in clinician-annotated text: a retrospective cohort study

Current criteria for depression are imprecise and do not accurately characterize its distinct clinical presentations. As a result, its diagnosis lacks clinical utility in both treatment and research settings. Data-driven efforts to refine criteria have typically focused on a limited set of symptoms that do not reflect the disorder’s heterogeneity. By contrast, clinicians often write about patients in depth, creating descriptions that may better characterize depression. However, clinical text is not commonly used to this end. Here we show that clinically relevant depressive subtypes can be derived from unstructured electronic health records. Five subtypes were identified amongst 18,314 patients with depression treated at a large mental healthcare provider by using unsupervised machine learning: severe-typical, psychotic, mild-typical, agitated, and anergic-apathetic. Subtypes were used to place patients in groups for validation; groups were found to be associated with future outcomes and characteristics that were consistent with the subtypes. These associations suggest that these categorizations are actionable due to their validity with respect to disease prognosis. Moreover, they were derived with automated techniques that might theoretically be widely implemented, allowing for future analyses in more varied populations and settings. Additional research, especially with respect to treatment response, may prove useful in further evaluation.

www.nature.com/scientificreports/ from clinical text into natural subtypes. A range of service outcomes were then chosen for further analysis of predictive validity. We hypothesized that the subtypes would stratify patients into coherent groups with respect to outcome data.

Methods
Participants. Unstructured EHR data were accessed from the South London Maudsley Trust NHS Foundation Trust (SLaM). SLaM provides specialist mental healthcare to approximately 1.3 million residents of four London boroughs, and has used an EHR for all its services since 2006. The Clinical Record Interactive Search (CRIS) data platform was developed between 2007 and 2008 to make de-identified data from SLaM's EHR available for research within a robust governance framework 19,20 . CRIS data has been substantially enhanced over the last 10 years by a series of natural language processing (NLP) algorithms designed to extract data of interest from free text fields in the EHR 21 . Use of CRIS as a data source for secondary analyses has received IRB approval (Oxford Research Ethics Committee C reference 18/SC/0372); the methods presented here were conducted in compliance with the relevant guidelines. No identifying information was used as a part of this study. De-identified data from 18,314 patients treated at SLaM from January 1st, 2007 to November 1st, 2018 were analyzed. Patients were included if they received a primary diagnosis of depression (ICD-10 F33 or F32) within the first 3 months of their first face-to-face interaction with SLaM. Demographic information for the total sample is included as a part of Table 1.
Measures. Fifty psychiatric symptoms, which included a range covering psychotic, bipolar and depressive disorders, derived from unstructured EHRs with rules-based algorithms were used to create subtypes. The symptoms are listed in Supplementary eTable 1. The algorithms were developed prior to this study; detailed methodologies and performance metrics for each algorithm are documented by the CRIS NLP service 21 . All algorithms seek to determine whether a patient experienced a symptom or not, excluding irrelevant mentions such as negative statements. A symptom was considered present in a patient if it was extracted from text fields drawn from the first month of clinical contact. These binary variables were used for the subtype generation process described below.
Outcomes. Predictive validity of the derived subtypes was evaluated with respect to the occurrence of a mental health crisis as a primary outcome. This was defined as any admission to mental health inpatient care or an episode of home-treatment team care, an alternative to the former, within the window between 3 and 15 months after a patient's first face-to-face encounter with SLaM. In addition, the following secondary outcomes were studied within the same period: (1) occurrence of an emergency room presentation; (2) number of days active to SLaM within the window; (3) number of recorded face-to-face contacts with SLaM clinical staff; (4) mortality within the window excluding deaths after August 6th, 2020; (5) number of years of follow-up.
Additionally, covariates were investigated: age, gender, ethnic group (classified into White, Black, Asian, Mixed, Other), year of first SLaM contact, and neighborhood deprivation (Index of Multiple Deprivation, a standard metric derived from national census data and applied at the level of the Lower Super Output Area, a national administrative unit with an average 1500 residents). www.nature.com/scientificreports/ Information from the Health of the Nation Outcome Scales (HoNOS) was also extracted. HoNOS is a clinician-rated instrument composed of 11 scales quantifying different elements of mental health and general function, where each scale is rated between 0 to 4. A score of 2 corresponds to a mild problem; as a result, patients were considered to have a HoNOS-defined problem if they scored between 2 and 4.
Finally, different types of medications received during the window were studied. The results are presented in Supplementary eTable 2.

Analyses.
A latent Dirichlet allocation (LDA) model was developed to identify different subtypes of depression based on patient symptoms. LDA is a topic modeling method; it was chosen in order to reflect the fact that the underlying data was text.
LDA decomposes individual patient symptom data into mixtures of distributions. Here, distributions were seen as subtypes of depression, where each distribution predicts the likelihood of the presence of each symptom. A more detailed introduction to LDA is included in Supplementary eFig. 1.
The number of subtypes, n, are not known a priori, and were chosen primarily by comparing model outputs between 2 to 8 subtypes for construct validity within the co-author team. Perplexity, a common metric for evaluating language models, was also used. However, it produced ambiguous results that were not helpful in this context; more details can be found in Supplementary eTable 3. Subtypes were chosen prior to any evaluation of predictive validity.
After the number of subtypes were chosen, k-means clustering was used to create patient groups based on the decomposed data produced by the final LDA model. K-means clustering creates a predetermined number of clusters that minimize variance between data points. The number of clusters was chosen to be n to reflect the notion that patients can be described by a single subtype of depression.
The process of producing patient groups is illustrated in Supplementary eFig. 2. Both LDA (sklearn.decomposition.LatentDirichletAllocation) and k-means clustering (sklearn.cluster.KMeans) were performed using version 0.22 of sci-kit learn 22 , a machine learning package for Python 3. Outside of the number of subtypes, the default settings for both classes were used. After the final model was chosen, demographic and clinical characteristics were then compared between groups using chi-squared tests, evaluating first all derived groups. Afterwards, another subsample of the groups deemed mildest was evaluated to determine whether observed group differences persisted at this level. Presence or absence of events (crisis, emergency presentation, mortality) and mean service use (days active, number of contacts) were similarly compared. Regression analyses were then used to compare outcomes between groups, adjusting all models for age, gender, ethnic group and neighborhood deprivation score: logistic regression (generating odds ratios) for crisis event and emergency presentation, Poisson regression (generating incidence rate ratios) for days active and number of contacts. Each subtype can be characterized by distributions of symptoms. Figure 1 illustrates the differences between distributions by comparing the likelihood of the top two symptoms per subtype. Complete distribution information is included in Supplementary eTable 5.

Subtype selection.
For the purpose of labelling groups, two presentations were judged to form a severe set. Group 1 had an average of 7.11 (s = 3.95) recorded symptoms and Group 2 had 8.62 (s = 5.58). On the other hand, Groups 3, 4, and 5 had on average 5.99 (s = 3.0), 5.70 (s = 4.85), and 4.50 (s = 2.79) recorded symptoms respectively. Thus Groups 1 and 2 were viewed as forming a severe set, and Groups 3, 4, and 5 as forming a mild set.
Group 1 was felt to be more reflective of severe emotional distress given its emphasis on hopelessness and worthlessness. On the other hand, Group 2 featured psychotic symptoms, such as hallucinations, more prominently. Thus Group 1 was labelled severe-typical and Group 2 psychotic.
Distinct features were also identified for the milder set. Group 3 was characterized by tearfulness and poor concentration, the most common symptoms in the cohort, as the primary symptoms. Additionally, because hopeless and worthless ideation were unlikely amongst this subtype, it was labelled as mild-typical. Group 4 was labelled an agitated subtype as insomnia, agitation, and aggression were its most common features. Finally, the prominence of low energy and poor motivation in Group 5 supported an anergic-apathetic label.
Group analysis. Table 1 presents demographic information for each group; Table 2 presents adjusted regression analyses of group outcomes; Table 3 presents HoNOS problems. Comparisons between group outcomes and unadjusted analyses are included in Supplementary eTable 6; analyses for the mild set are presented in Supplementary eTables 7 and 8. Supplementary eTable 9 presents the years in which patients were first active at SLaM. Each table presents p values for the total sample as well as the mild set. The differences presented here are significant for both cases unless otherwise noted.
Demographic information. Differences in demographic information, as seen in Table 1, were mostly significant across the groups. However, there were no significant differences between the first year that patients were active at SLaM, and no significant differences in mean deprivation score.
There was a gender gap skewing towards women for every group. In the total sample, the difference was 24.3% (62.1% female versus 37.8% male). The largest gender gap was exhibited by the mild-typical group with a www.nature.com/scientificreports/  www.nature.com/scientificreports/ difference of 42.1% (71.0% female versus 28.9% male). The smallest gender gap was exhibited by the psychotic group, with a difference of 11.5% (55.7% female versus 44.2% male). Group differences in ethnicity were statistically significant across the total sample, but not within the mild set. The largest differences were within the psychotic group. White patients were underrepresented; they made up 54.0% of the psychotic group even though they comprised 57.1% of the total sample. Asian patients were overrepresented (6.2% versus 5.0%); Black patients were also overrepresented (18.7% versus 14.9%). Differences in other groups were small, often less than half a percent in magnitude.
With respect to the ages amongst the total sample within groups, the mild-typical and agitated groups featured more patients under the age of 18; patients over the age of 49 were more likely to be a part of the psychotic group; the opposite was true for patients under the age of 18; patients between the ages of 18 and 34 were 3.9% more prominent in the apathetic-anergic group (36.4% versus 32.5%).
Group outcomes. Generally, patients within the severe set had worse outcomes than the mild set, as seen in Table 2. For example, patients in the severe-typical group had the highest mortality within the outcomes window (HR = 1.24, 95% CI = 1.12 to 1.37, p < 0.001) and mild-typical patients demonstrated the lowest mortality (HR = 0.86, 95% CI = 0.79 to.095, p < 0.001). Patients in the psychotic group were the most likely to have a crisis event (OR = 2.45, 95% CI = 2.15 to 2.80, p < 0.001), and those within the anergic-apathetic group were less likely to have this outcome (OR = 0.64, 95% CI = 0.54 to 0.77, p < 0.001). The same was true for emergency presentations between patients in the psychotic group compared to those in the agitated group.
The severe-typical patients diverged from psychotic patients with respect to the last two outcomes: days active at SLaM and number of face-to-face contacts. They were closer to the mild set, which tended to have fewer active days at SLaM; the severe-typical group had the fewest active days. On the other hand, the psychotic group engaged with SLaM the most. They had the most days active in SLaM (IRR = 1.14, 95% CI = 1.13 to 1.15, p < 0.001) and the most face-to-face contacts (IRR = 1.52 95% CI = 1.54 to 1.15, p < 0.001).
HoNOS problems. HoNOS problems were well-aligned with the primary symptoms of each subtype. For example, patients in the psychotic group had the most HoNOS problems, with the exception of self-injury, which was more common in the severe-typical group. And compared to every other group, patients in the mild set generally displayed fewer HoNOS problems. However, drug misuse and physical illness were not significantly different. Differences in several HoNOS problems were insignificant within the mild set: drug misuse, cognition, physical illness, depression, living conditions, and occupation. The primary differences within the mild set was the higher prevalence of some symptoms amongst the agitated group relative to the mild-typical group and the lower prevalence in the anergic-apathetic group.

Discussion
Construct validity. In this study, we identified depressive subtypes in symptom data derived from unstructured EHRs. Five distinct subtypes were identified based upon patient data collected within a month after an initial face-to-face encounter with SLaM: severe-typical, psychotic, mild-typical, agitated, and anergic-apathetic. They were then used to create patient groups for validation. To this end, follow-up characteristics and outcomes recorded at least 3 months after the initial window were studied. Outcomes were extracted and evaluated after the subtypes had been created and finalized.
Each subtype was defined by several symptoms that were not prominent in any other group and were wellcharacterized from a qualitative perspective. In other words, subtypes were more representative of the way clinicians described their patients. Moreover, they were predictive of a variety of future outcomes, such as crisis events, emergency presentations, likelihood to be deceased, as well as service utilization. Unsurprisingly, this was especially true for the psychotic and mild-typical groups.
Subtypes were aligned well with future mental and behavioral issues found in the structured data: patients in the severe-typical group had more problems with self-injury; those in the psychotic group had more hallucinations problems rated on the HoNOS structured instrument; patients in the mild-typical group had the fewest problems. Compared within the mild set, patients in the anergic-apathetic group were more likely to be described as depressed; agitated patients were more likely to have HoNOS problems.
These results are reflective of some patterns found in the clinical literature. For example, several studies have found that African American patients are more likely to be described as exhibiting hallucinatory behavior and seek treatment for depression at lower rates than Caucasian patients [23][24][25][26] . Depression severity is correlated with increased emergency department visits and healthcare utilization [27][28][29] . Patients most likely to be later described as depressed featured anergia, the second most common residual symptom of depression, and one that poses significant problems for daily living 30 . There was a sizable gender gap favoring women in every group, but this gap was the smallest amongst the psychotic group. This finding aligns with existing research that suggests that unlike mood or anxiety disorders, the prevalence of psychosis is approximately even between men and women 31,32 .
However, our findings showed some inconsistencies with other studies. For example, there were no statistically significant differences in problems with physical illness between groups, even though associations with physical illness and depression severity have been reported 33 . Intuitively, problems with daily living and living conditions might have been expected to differ between groups, yet significant differences only existed for the former within the mild set. The number of patients per group was spread reasonably evenly, though the rate of different types of depression need not be distributed in this way [33][34][35][36][37][38] . Severe-typical patients were not that likely to have an emergency presentation, considering the number of outcome variables, even though severity is correlated with hospitalization 27,39 . Similarly, severity was not as predictive of drug misuse problems on the HoNOS scale compared to other outcomes, though this has been reported for substance abuse broadly 40  www.nature.com/scientificreports/ Additionally, some factors do not lend themselves to easy interpretation. For example, a significantly large gender gap was present in the mild-typical group relative to those in the other subtypes within the mild set; the causes of this gap can be attributed to multiple reasons, but exactly which combination is not possible to discern. And while the results presented here are statistically significant, some are smaller in magnitude than what may be expected, such as the odds of having an emergency presentation: severe, typical patients were only 1.17 times more likely than their mild counterparts; however, this might reflect the fact that all patients were receiving care from a specialist mental health service, so represent a relatively severe subset of all community cases of depression, potentially diluting differences between symptom cluster groups. There are also issues of representation, such as the differences in the availability of HoNOS scores.
As a result of these discrepancies, it is both true that these subtypes provide clinically relevant information, but they should be still understood as complementary to current diagnostic tools.
Study context. Previous studies have focused on studying small samples of patients with a narrow set of depressive symptoms. They typically employ latent class analyses and factor analysis to identify subtypes, though some also use k-means clustering 5,6,[41][42][43][44] . Generally, groups are stratified across severity. For example, one LCA study 45 produced the following groups: "severe typical", "mild typical", "severe atypical", "mild atypical", "intermediate", and "minimal symptoms". A k-means study identified a "vital" and "nonvital" group amongst depressed men, where individuals in the former were more likely to have each symptom compared to those in the latter. We address these issues, in part, by analyzing a large cohort and including a broader set of symptoms.
This study also differed in that the underlying data comprised free text recorded by clinicians, as opposed to checklists from research instruments applied to screened samples. While clinical text has been analyzed in other medical specialties, it has seen limited use for depression, though text mining for psychiatry has seen increased use within the last decade 46 . It is not clear, a priori, what types of information are important for different applications. Moreover, clinicians often write narratives about their patients, as opposed to any set of semi-structured information, such as a list of symptoms or surgeries. As a result, contextual issues make accurate data extraction difficult 47 ; research to this end is also hampered due to a lack of data access within healthcare settings 48 .
Here, we have shown that the symptom data captured by clinicians can be used to define meaningful constructs to categorize patient experiences in early stages of specialist care. In particular, the constructs are qualitative in nature-they relate directly to patient symptoms-and are relevant to future outcomes. Thus, unstructured EHRs for this task merit further exploration.
One approach could involve studying how to better identify constructs. In this study, one set of subtypes was chosen for further analysis based upon potential clinical use, i.e. the subtypes should describe clinically relevant patient profiles, and not goodness-of-fit, which poses issues surrounding model interpretability. K-means clustering was used to group patients, but other methods, like organizing patients based upon their most prominent subtype, could have been used. Realistically, many patients will not fit cleanly into one subtype; allowing for additional clusters could let patient groups with more complicated profiles to emerge.
Subtypes should also be leveraged to predict a broader range of outcomes, such as medication efficacy. One way to do so is to simply extract a wider range of symptoms as well as other relevant characteristics in unstructured EHRs. This can also include information commonly collected from depression scales, such as symptom temporality or severity. To the latter point, prior analyses with structured data have already created promising predictive models for treatment response 49,50 . Limitations. This study has several limitations. First, the choice of symptoms was limited in scope. While new variables are constantly being extracted from CRIS, some symptoms classically associated with depression, such as anxiety, were not available for use in this study. This biases which subtypes can be derived from the data. For example, mood reactivity and weight gain are two symptoms that have not been extracted, making it difficult to identify and study atypical depression in this cohort.
Second, like other cluster analyses, the results presented here are sensitive to methodological changes. For example, if ten groups were chosen over five, the differences between groups may have been too slight to detect. Alternatively, fewer groups could have been generated, potentially obscuring important subtypes. Had we chosen two groups, distinctions between depressed patients with moderate, severe, or psychotic symptoms would be harder to detect.
Third, patients treated in a setting like SLaM will have more severe mental health issues, since all will have either been first seen and referred by a general practitioner or will have been identified as emergency care presentations. The results presented here are specific to patients diagnosed primarily with and treated for their depression. This excludes several relevant populations, including patients with a different primary diagnosis and people that have depression yet have not sought yet treatment.
Additionally, noise is introduced into unstructured EHRs from several different sources. The symptom data here is less precise than information provided by depression scales, which track the severity of individual items, whereas entities extracted from clinical text tend to be binary: present or not recorded. Scales also specify time periods, e.g. within the previous 2 weeks, whereas it is generally difficult to extract temporal relations from text. Moreover, clinicians do not record information consistently. For example, questionnaires will always include an item for low mood or lack of interest, but this information was not always recorded for patients in this study.

Conclusion
In this study, we decomposed depression, a highly heterogeneous disorder, into 5 subtypes using a broad set of symptom data derived from unstructured EHRs. Previous studies have typically relied on a limited set of symptoms related to depression, whereas symptoms used here included those related to psychosis and bipolar in www.nature.com/scientificreports/ addition to depression. These subtypes-severe-typical, psychotic, mild-typical, agitated, and anergic-apatheticwere created using an unsupervised latent model and validated by examining their relationship to a variety of different clinical outcomes, including those that captured future health conditions. Broadly, these subtypes tended to be significantly different in ways that corresponded well to their defining symptom. For example, subtypes that were intuitively severe tended to have more mental and behavioral problems compared to milder presentations. Thus, they were clinically relevant, and given that they were automatically generated, could potentially be implemented in different settings to guide clinicians. Additionally, by focusing on data in unstructured EHRs, which include symptoms not captured by depressive scales, opens new avenues to study depression in relation to other disorders. To these ends, future work could focus on more clinical outcomes, such as antidepressant efficacy, and leveraging more information, such as more symptom data, different data sources, or a more holistic use of clinical text.

Data availability
Data from this study is not publicly available, but access can be obtained by contacting the Clinical Record Interactive Search (CRIS) team.