Introduction

Early hospital readmission has been identified as a preventable driver of health-care costs and a key quality metric for Medicare.1 Among psychiatric patients, with disorders often characterized by chronicity and high rates of recurrence, readmission is a substantial concern; the post-hospitalization period is recognized as a high-risk period for outcomes such as suicide and relapse into substance use.

A challenge in reducing readmission risk is the difficulty in identifying individuals at greatest risk, who might benefit from personalized interventions.2 Few studies in psychiatry have attempted to develop clinically actionable prediction rules,3, 4 despite the widespread use of prediction methodologies in other areas of medicine. A particular challenge in developing prediction models in psychiatry is the paucity of data regarding symptoms or severity provided by diagnostic codes available in electronic health records (EHRs) or claims data sets. We have recently shown, for example, that ICD9 severity codes in major depressive disorder are no more reliable than chance in characterizing actual severity.5

The narrative notes in EHRs often include substantial clinical detail, including details of presentation and prior course, which may otherwise be unavailable. In general, efforts to parse narrative notes in EHR data focus on an individual diagnosis or symptom. Typical strategies involve using information extraction systems to identify the terms manually preselected by clinicians, sometimes supplemented by additional vocabulary based on these terms.5, 6, 7, 8, 9

Approaches in which each term must be selected by experts do not scale well if the goal is to characterize populations in terms of overall psychopathology, rather than aspects of a single diagnosis or outcome. As an alternative, more robust and generalizable term-agnostic approaches to automatic clinical concept extraction have also met with some success in other medical domains.10, 11, 12 However, the resolution of these methods for psychopathology per se appears to be limited, as they still rely on standard diagnostic categories.

Another approach to deriving information from narrative notes relies on a form of natural language processing known as topic modeling. This strategy relies on word co-occurrence patterns in order to learn the latent set of topics discussed in the text. We have previously demonstrated that this approach can extract meaningful concepts from a set of intensive care unit notes, which were then applied to generate highly accurate classifiers to predict mortality;13, 14 another recent report utilized an alternative approach to discriminate veterans at risk for suicide based on narrative notes.15 Here we hypothesized that topic modeling could also be applied to identify topics from psychiatric discharge notes, and that these topics will usefully improve prediction of hospital readmission compared with diagnosis or single terms extracted from the note text.

Materials and Methods

Cohort derivation

The Partners HealthCare EHR includes sociodemographic data, billing codes, laboratory results, problem lists, medications, vital signs, procedure reports and discharge notes from the Massachusetts General Hospital and the Brigham and Women’s Hospital, as well as from community and specialty hospitals. We used The Informatics for Integrating Biology and the Bedside (i2b2) software (https://www.i2b2.org; i2b2 v1.6, Boston, MA, USA)16 to access and manipulate this EHR data. For this study, we selected all patients admitted to an inpatient unit of the Massachusetts General Hospital with a principal diagnosis of major depressive disorder (MDD; ICD9 296.2 × and 296.3 × ) between January 1994 and December 2012. Only individuals age 18 or older were included; no other exclusion criteria were applied except for data-cleaning considerations. The Partners Institutional Review Board approved all aspects of this study.

Outcome definition

Primary analysis examined 30- day psychiatric readmission, corresponding to a standard health system quality metric.

Word extraction and topic model derivation

Text was extracted from all clinical notes by removing all non-letter characters (for example, digits and punctuation), and then using whitespace to split text into words. Stop words were removed using the Python sci-kit learn17 software, which incorporates the Glasgow stop word list (http://ir.dcs.gla.ac.uk/resources/).

We used the term frequency–inverse document frequency (TF–IDF) scores18 to identify the 1000 most informative words in each patient's notes, and we then limited our overall vocabulary to this set. Note that as 1000 most informative words were selected for each patient separately and some of these words were the same for different patients, the total number of words used in topic modeling was 66 429. As notes were not deidentified, this helped to remove the words that were unique to a single patient, such as surnames. Patients were excluded if their notes had fewer than 100 nonstop words in total or if their admission and death dates were not consistent (for example, if they were marked as being ‘readmitted’ after their date of death).

A Latent Dirichlet Allocation19 (LDA) model was trained on the full corpus, and then topics were generated for each individual note. The LDA model produces a distribution over words for each topic, which can be used to score each note for membership in different topics. Our initial experiments using a training data set derived from 70% of the clinical cohort found no statistically significant difference in held-out prediction accuracy across a range of 25–75 topics. Below, we show the results for 75 topics (Supplementary Table 2).

Following the suggested strategy,20 we set hyperparameters on the Dirichlet priors for the topic distributions α=50/k and the topic-word distributions β=200/number_of_words, where k is the number of topics. Topic distributions were sampled using a Markov Chain Monte Carlo chain with 2500 iterations. This topic-modeling step resulted in a k-dimensional vector of topic proportions for each note.

Supervised learning and model evaluation

The clinical cohort was randomly split 70%/30% into training and test data, maintaining the proportion of positive class (subjects readmitted for MDD) in each. Test and training data were also balanced based on age, gender, use of public insurance and age-adjusted Charlson comorbidity index.21, 22

We trained a linear two-class support vector machine (SVM) model to predict MDD-related readmissions within 30 days. A separate SVM model was trained for each of the following feature configurations: (1) baseline clinical features, including age, gender, use of public health insurance and Charlson comorbidity index;21, 22 (2) baseline features+top-N bag-of-words features using N most informative words from each patient’s record, with N ranging between 1 and 1000; and (3) baseline+topic features derived from the k-topic LDA model, with k=75.

The most informative words for the second configuration type (baseline+top-N words) were selected separately for each patient, that is, the bag-of-words features included the union of the top-N words with highest TF–IDF scores from each patient record. As a result, the top-1 word configuration used 3013 unique words, top-10 configuration used 18 173 unique words and top-100 word configuration used 63 438 unique words. Top-1000 configuration used nearly all the words included in the vocabulary (66 429/66 451).

Features were all scaled to a range of 0–1 before training. The loss parameters for the SVM were selected using threefold cross-validation on the training data to determine the optimal values with area under the curve as an objective. The learned parameters were then used to construct a model for the entire training set and to make predictions on the test data. The same set of test data was consistently held out, and never used for model selection or parameter tuning. We experimented with randomly sub-sampling the negative class in the training set to produce a minimum 70%/30% ratio between the negative and positive classes, but this failed to affect the results and is not addressed further.

Test set distributions were never modified to reflect the reality of class imbalance during prediction, and reported performance reflects those distributions. Topic features were derived from the LDA model trained on the complete data set including both training and test subsets. As deriving topics does not incorporate any knowledge of future readmission, the inclusion of the testing set does not lead to overfitting or inflated estimates of discrimination.

To illustrate discrimination of the SVM models, we generated two separate unstratified Cox regression models for two feature sets: the baseline features; and the baseline features plus 75 topics. Both models used the time to psychiatric readmission as output.

After models were built for each feature set, the corresponding Kaplan–Meier curves for those models were plotted. These are shown in Figures 1 and 2. In both figures, a single model was built for the features chosen. We then used the output from the linear SVM classifier to generate a ‘group’ prediction. Note that the Cox regression models themselves ignore the group assignment variable; the variable was only used to select different subsets of the patient population.

Figure 1
figure 1

Kaplan–Meier survival curve for time to psychiatric hospital readmission, for a model built using baseline sociodemographic and clinical variables only. Patients are plotted separately for two groups identified by the support vector machine model as: (1) likely psychiatric readmissions in red; and (2) unlikely psychiatric readmissions in blue.

Figure 2
figure 2

Kaplan–Meier survival curve for time to psychiatric hospital readmission, for a model built using the baseline variables and 75 topics. Patients are plotted separately for two groups identified by the support vector machine model as: (1) likely psychiatric readmissions in red; and (2) unlikely psychiatric readmissions in blue.

Results

We identified 4687 inpatient discharge summaries for unique patients admitted with the primary diagnosis of MDD in the billing data, with 470 of those patients readmitted within 30 days with a psychiatric diagnosis, 2977 readmitted with a nonpsychiatric diagnosis and 1240 not readmitted. Clinical features of the full cohort are summarized in Table 1. Supplementary Table 1 shows univariate associations with readmission based on logistic regression models applied to the training data set.

Table 1 Socioeconomic and clinical features of the psychiatric hospital readmission cohort

The topics derived by a 75-topic LDA model trained on the full data set were annotated by one of the authors (RHP) based on manual inspection of the individual terms loading on that topic. Table 2 summarizes the top 10 topics identified and the clinician annotation; words particularly informative for annotation are italicized. The identified topics included many linked to psychiatric symptoms (suicide, severe depression, anxiety, trauma, eating/weight and panic) and MDD comorbidities (infection, postpartum, brain tumor, diarrhea and pulmonary disease).

Table 2 Example topics for MDD patients readmitted with a psychiatric diagnosis within 30 days

For each of the configurations described above, we saw improving performance in the testing data set as measured by the receiver-operating characteristic curves, from area under the curve 0.618 for baseline to 0.682 for baseline+1000 words to 0.784 for baseline+75 topics, respectively (Table 3). Note that merely discarding less-informative words does not improve performance, as shown by very similar performance of baseline+top-100 and baseline+top-1000 words configurations. Figures 1 and 2 show Kaplan–Meier survival curves for the baseline model alone, and the full model incorporating baseline+75 topics. Table 3 also presents sensitivity and specificity in the testing data set at a default cut-point; of note, the baseline model is highly sensitive but nonspecific, whereas the full model is somewhat less sensitive with substantially increased specificity.

Table 3 Comparison of models with and without inclusion of LDA topics

Discussion

In this investigation of 4687 individuals admitted to a psychiatric inpatient unit with a principal diagnosis of MDD, we developed statistical models to predict psychiatric readmission within 30 days. These predictions substantially exceed those expected by chance and may provide an initial means of quantifying risk. More generally, we demonstrate the utility of applying topic modeling to psychiatric narrative notes to improve risk prediction accuracy.

Notably, the specific topics identified represent both psychiatric and clinical illness features not otherwise captured in coded data, lending this approach some face validity. Greater psychiatric comorbidity such as substance use or eating disorders may increase readmission risk, along with general medical illness such as infection or dementia. Not surprisingly, markers of greater depression severity, previously shown to be poorly captured in ICD9 codes,23 may also improve prediction of readmission. Beyond prediction, it is also possible that identifying particularly high-risk topics will facilitate the development of interventions to reduce readmission risk in particular subgroups.

Outside of psychiatry, topic modeling has been demonstrated to improve prediction of mortality in intensive care unit populations.13, 14, 24 Another natural language-processing approach has recently been demonstrated to improve prediction of suicide risk in a small case–control study of narrative notes from military veterans.15 However, to our knowledge, topic modeling has not previously been applied for predicting psychiatric hospital readmissions.

We emphasize several important limitations in interpreting our results. First, while significantly better than chance, these predictions may not be sufficient by themselves to target interventions, particularly costly ones. Clinical assessment is particularly important in this regard, and the increasing emphasis on symptom quantification may provide another means of improving prediction based on EHRs.25 Another valuable comparison would consider clinician estimates of risk at time of hospital discharge, and the extent to which the variance explained by the present model overlaps with or complements clinician estimate. The goal of this initial study was rather to establish a baseline, using only artifacts of routine clinical care, which may be improved with more detailed assessments. Although discrimination may not be sufficient for clinical application (that is, sensitivity of 75% but specificity of 63%), the optimal threshold for this trade-off will depend on the nature of the intervention contemplated in those individuals characterized as high risk.

Further, as we were unable to identify a similar clinical cohort available for collaboration, we could not examine the generalizability and portability of our models. This challenge highlights the importance of establishing deidentified clinical cohorts to allow new research and clinical tools relying on natural language processing to be validated, as has been done for obesity, for example.26 Such validation would be a necessary next step before attempting to disseminate risk stratification models beyond this New England health system. We would expect, however, that topic models would generalize better than (for example) rules-based classifiers that rely on individual terms.

Finally, as the health system examined here is not a closed system—that is, it is possible that individuals could be readmitted to another hospital and go undetected—some misclassification is to be expected. In the state of Massachusetts, the vast majority of individuals requiring readmission are (per payer and regulatory requirements) readmitted to the index hospital. Such misclassification would tend to decrease the discrimination of predictive models, so the estimates provided here are likely conservative.

Despite these caveats, these results suggest the potential power of narrative notes to identify meaningful predictors of outcome not available in coded data alone. They indicate that it may be possible to develop models to identify psychiatric patients at high risk for hospital readmission, a necessary first step in developing interventions to address this risk.