Super-utilizers are patients with frequent acute (i.e., emergency department) healthcare encounters, often because of complex physical, behavioral, and social needs. Parameters of defining healthcare super-utilizers vary across the literature, but the Center for Medicare and Medicaid Services (CMS) defines super-utilizers as “patients who accumulate large numbers of emergency department (ED) visits and hospital admissions, which might have been prevented by relatively inexpensive early interventions and primary care”.1 Super-utilization can result from or contribute to uncoordinated care and avoidable utilization of inpatient and emergency room services, and poorer health outcomes overall2. The cost implications of super-utilization are significant, with a 2012–2013 analysis3 in the United States (US) demonstrating that 50% of healthcare expenditures were attributed to just 5% of the population. Further costs are incurred with the common comorbidities seen among super-utilizers, including mental health and substance use diagnoses4.

Conventional approaches to managing this population rely on comprehensive care coordination, community-based care, and greater attention to complex social needs. However, combined approaches are difficult to implement and expensive4 because they are applied broadly, in contrast to narrower interventions targeting specific patient populations and their socio-contextual and behavioral needs. One example of this comprehensive approach that received national attention was carried out by the Camden Coalition of Healthcare Providers. Results of their randomized controlled trial revealed that the intervention—which involved care coordination among nurses, social workers, and community health workers—did not have any significant effect on reducing readmission rates 180 days following hospital discharge5. Patient navigator interventions targeting super-utilizers have shown mixed or moderate success in reducing hospital utilization6,7.

The Camden Coalition and others rely on the incorporation of social determinants of health into their proposed interventions8. However, as Iovan et al. (2019) found in a comprehensive review of super-utilizer interventions, most interventions target the downstream determinants surrounding patients’ material conditions (e.g. access to housing, food, and transportation), with only a select few offering interventions targeting the more fundamental social determinants of health with referrals to education, job opportunities, and vocational training9,10,11. Tackling these fundamental determinants requires a more targeted approach that can be tailored to the individual and engages the patient, their families, and caregivers beyond conventional healthcare visits.

Digital engagement is increasingly sought to effectively ‘hover’ over patients outside of their traditional healthcare encounters12. Smartphones, wearable devices, and social media may offer opportunities to engage patients, including super-utilizers, in care that meets them where they are—both physically and metaphorically—while simultaneously offering a more cost-effective and proactive approach to solving for patient needs. Facebook—used by 69% of adults in the US, of whom 74% report using it at least once a day13—provides both metadata and user-generated content, and thus offers a unique window into the lives of patients, and may reveal potential opportunities for interventions.

In this study, we sought to understand the online activity of consenting healthcare super-utilizers, specifically their day-to-day lifestyle, behaviors, and emotions by comparing their entire timeline of Facebook posts (quantified by open-vocabulary topics, dictionary-based psycholinguistic categories, linguistic markers of anger, stress, and loneliness expressions) with those of a matched group using propensity scoring based on Charlson comorbidity index, gender, and age.


Characteristics of study subjects

We defined super-utilizers a priori as any patient who had six or more ED encounters within a 12-month time period within our urban health system14. Out of 1830 who shared Facebook data and electronic health records as part of the Social Mediome study cohort15, 109 participants met our criteria and were thus categorized as super-utilizers (median age of 28 and 83% women) (Table 1). The control group included 109 participants (median age of 31 and 86% women). In our cohort, super-utilizers had more documented diagnoses of injury and poisoning, respiratory symptoms, skin disorders, anxiety, depression, and documented drug use when compared to the control group (Chi-squared statistic significant at p < .001), consistent with previous findings on the prevalence of comorbidities among super-utilizers1. Super-utilizers also had on average two times more posts (N = 1537 posts/user) in their social media profile compared to the control group (N = 963 posts/user), significant at p < .05 two-tailed t-test.

Table 1 Demographic characteristics and diagnoses distribution of healthcare super-utilizers and the control group.

Differentially expressed language features in the super-utilizer group


Super-utilizers used more self-references, first person pronouns (Cohen’s D = .75, [0.47, 1.05]), words indicating present focus (D = .57, [.29, .86]), and function words such as adverbs (D = .53, [.26, .81]) and negations (D = .49, [.22, .78]) (Table 2). They also used words indicative of cognitive processes including differentiation (D = .47, [.2, .76]), tentativeness (D = .38, [.11, .66]), and discrepancies (D = .37, [.1, .64]).

Table 2 Linguistic Enquiry Word Count (LIWC) categories significantly associated with language used in Facebook posts of healthcare super-utilizers.


Compared with the control group, super-utilizers were more likely to post about confusion and negativity (‘erked’, ‘pissed’, ‘upset’, ‘confused’, ‘rite’, D = .65, 95% CI-[.38, .95]), self-reflection (‘mind’, ‘thinking’, ‘alot’, ‘much’, ‘head’, D = .63 [.35, .91]), avoidance (‘wanna’, ‘away’, ‘far’, ‘stay’, ‘cry’, D = .62 [.34, .90]) and swearing, D = .52 [.24, .79], sleep (‘fall’, ‘sleep’, ‘asleep’, ‘bed’, ‘down’, D = .60 [.32, .88]), seeking help and attention (‘need’, ‘help’, ‘someone’, ‘please’, ‘come’, ‘save’, D = .61 [.33, .89]), psychosomatic symptoms (‘pain’, ‘hurt’, ‘killing’, ‘ugh’, ‘feeling’, D = .49 [.22, .77]), and self-agency (‘make’, ‘sure’, ‘things’, ‘move’, ‘decisions’, D = .56 [.29, .85]) (Table 3). Some of the highly correlated words are colloquial variations used on social media (e.g., ‘erkerd’ and ‘rite’).

Table 3 Topics (clusters of co-occurring words) significantly associated with language used in Facebook posts of healthcare super-utilizers.

Mental well-being

Super-utilizers were more likely to have posts containing language associated with anger (D = .51, [.24, .79]), stress (D = .46, [.19, .73]), and lonely expressions (D = .44, [.17, .71]). Language related to depression (D = .23, [.03, 0.5]) and anxiety (D = .20, [.06, .47]) was only slightly elevated compared to the control group.


In this study, we identified themes and contexts associated with ED super-utilizer posts on Facebook that reflected stress, anger, avoidance, attention-seeking, self-reflection, and health symptoms. Many of the topics reflect social-contextual challenges that may be contributing to healthcare seeking behaviors. Prior work has shown that super-utilizers are more likely to have complex physical, behavioral, and social needs16. Our work demonstrates that these complex circumstances are in fact reflected in the social media behaviors of this patient population as measured through linguistic characteristics that demonstrate stress, conflict, and loneliness. Future studies could investigate the extent to which social media posting and behavior—including language, images, and ‘lurking’ time—accurately reflect the lived experience of patients. Any approach involving personalized interventions would require significant technical infrastructure and thorough ethical review to guard against further stigmatization of an already vulnerable population.

Super-utilizers tend to have more severe and uncontrolled chronic illness2; the volume of language about psychosomatic symptoms posted by super-utilizers compared with the control group in this sample supports this finding. Attention-seeking language may reflect unmet needs in the daily experiences of super-utilizers and could also be a marker of loneliness, social isolation, or underlying mental health diagnoses. The burden of mental health in populations of super-utilizers has been well documented2, so the relationship between psychiatric conditions, social vulnerability, and language on social media is plausible.

Implications on intervention design

Much of the existing literature uses payer data to identify commonalities among super-utilizer patient profiles4,14,17. Among the published interventions, these data are augmented with patient interviews and assessments to gauge access to resources (e.g., social supports, living and working circumstances, and food security), which can then inform a case management method for providing targeted support to the patient14,18,19,20. While our findings support the characterizations of super-utilizers published in previous literature, they also suggest a potential application in future targeted interventions. Utilizing nontraditional digital sources to characterize the expressions of super-utilizers may allow care teams—particularly social workers and care coordinators—to understand essential elements of a patient’s daily life that may allow for a more tailored course of action to address healthcare and other needs. Such a model would require patients to share social media data with their care teams, which has several technical and ethical ramifications.

Social media analysis can potentially be used to supplement offline community care services with online social support interventions considering the high engagement of super-utilizers on social media. Engaging patients online also holds potential for increased interactive support21. While exploring digital social support groups for cancer patients, online environments were found to provide a platform for asking questions, communicating personal experiences, and sharing emotions22. In harnessing the dynamic nature of these platforms, interventions targeted to super-utilizers could respond and adapt to these highly engaged patients in an easily accessible and familiar environment. Opportunities are also growing in the development of new digital health technologies. Prior work explored super-utilizer receptivity to digital technologies for care management and outlined key takeaways from focus groups including widespread interest in digital health tools, healthcare delivery navigation challenges, and age-based digital literacy23. Our data provide further insight into super-utilizers’ digital presence that could benefit future development of digital health technologies targeted to this population.

Ethics and privacy

Maintaining privacy and confidentiality are critical when looking toward healthcare applications of social media data24. Potential stigmatization of already vulnerable service users once they have been flagged as potential super-utilizers could be problematic and should be guarded against. Specific guidelines for social media health research should include strict protocols around protection measures for sensitive data and deidentification whenever possible, as well as data storage on HIPAA-compliant servers25. Such safeguards are one approach for protecting against any downstream insurance or employment consequences in the event of data breaches. Furthermore, any personalized interventions utilizing such data should place high value on maintaining patient agency and avoid any prescriptive measures based unilaterally on social media insights. Lastly, it is important to preserve trust in the relationship between provider and patient, especially among vulnerable populations. A note of caution is that introducing social media data into the patient-physician relationship can result in a patient’s privacy feeling violated or influence a provider in their treatment26.


This study has several limitations. First, although the demographics of our sample are similar to the overall population served by the ED in urban hospitals15, our sample is not representative of the general population and is skewed towards younger African American females. Payer data revealed that super-utilizers with Medicaid coverage were older than other Medicaid patients, with an average age of 32.3 years for super-utilizers compared to 24.2 for patients with less than 6 hospital visits per year14. We prioritized matching on gender, the Charlson comorbidity score, and age (in that order). We found age and race to be significantly different across groups. In prior work, it was found that gender has the highest effect on language, but does not change a lot after 45 years, which was the reason for our characterization27. Previous literature found that super-utilizers, compared to ‘low-utilizers’, are more likely to be male and African American28 and Hispanic/Latino29.

Second, the EHR data for visits is obtained from one health system whereas patients might have received care from other systems not captured in our analysis. Third, though the exclusion of non-English speaking participants avoids cultural confounders considering the specific recruitment location of participants, it introduces sampling bias. Further, patients who are willing to share social media data may tend to be “over-sharers” so that the conclusions drawn may not be generalizable to all ED super-utilizers, and especially because eligibility was limited to English-speakers and English language posts.

In summary, social media language offers a window into patients’ characteristics that cannot be gleaned from their health records alone and may eventually lead to new ways to identify needs at the individual or population level. Healthcare super-utilizers’ social media posts reveal themes that suggest lifestyles, behaviors, and emotions that reflect negativity, conflict, sleep deprivation, and psychosomatic symptoms. While these findings need to be replicated in other studies before implementing interventions, this study is a step towards considering the inclusion of patient-generated data, with explicit consent, in understanding healthcare needs and sequelae—providing insight and a comprehensive view of the challenges these patients face beyond their medical presentation.


Study design and setting

The study was approved by the University of Pennsylvania Institutional Review Board. Using a convenience sample framework, from March 2014 through December 2017 patients receiving care in the emergency department (ED) of an urban academic hospital system were approached about participating in a study to merge social media and Electronic Health Records (EHR) data15. All participants gave their written informed consent to use their data for this study.

Selection of participants

We retrieved Facebook status updates up to 5 years prior to the ED index visit for all participants who consented to share their Facebook posts (N = 4587). We did not access data from the Facebook pages of study participants’ friends or from posts on the study participants’ pages made by anyone other than the participant. We excluded non-English posts and selected users with a minimum of 400 words, determined from prior work to be the minimum threshold for reliably predicting user traits from language30, retaining 1830 participants with Facebook data.

Extracting data from the EHR, we identified the ED visits for these participants which coincided with years when they also had Facebook data. We first identified all years (from 2009 to 2016) in which participants had six or more ED visits. For each patient, we obtained primary ICD-9 codes of every ED and inpatient visit available in the EHR. Then, we used these ICD-9 codes to obtain the diagnoses by mapping them onto the categories in Elixhauser comorbidity codes31. We used these categories to identify differences in diagnoses across super-utilizer and control groups. Further, using the same ICD-9 codes, we calculated the Charlson comorbidity index to obtain a measure of severity of disease for every patient. We characterized patients with six or more ED visits in any year from 2009 to 2016 as super-utilizers—as most of them had contiguous hospital visits in these years16. Since healthcare utilization varies based on demographics and severity of illness, we identified a propensity score matched group of control users based on the Charlson comorbidity index, gender, and age of our super-utilizer set in a retrospective case-control manner.

Linguistic attributes

We characterized posts using three sets of language features: (a) dictionary-based psycholinguistic features, (b) open-vocabulary topics32, and (c) mental well-being attributes, such as anger, anxiety, depression, stress, and lonely expressions by applying previously developed predictive models33,34,35.


From each post, we extracted the relative frequency of words/tokens. We removed words used by less than 1% of users. We then compared the posts of the super-utilizer and control groups against the 73 psycholinguistic categories from the Linguistic Inquiry Word Count (LIWC)36. For each, we measured the proportion of tokens (including words, emoticons etc.) represented in each LIWC category.


We also used an open-vocabulary approach. Two hundred latent Dirichlet allocation (LDA) topics (groups of co-occurring words) were generated using Facebook posts contributed by patients from a prior study32. The LDA generative model assumes that posts contain a combination of topics, and that topics are a distribution of words. Since the words in a post are known, topics, which are latent variables, can be estimated through Gibbs sampling. We use the Mallet implementation of the LDA algorithm, adjusting one parameter (alpha = 5) to favor fewer topics per post. All other parameters were kept at their default. An example of such a model is the following set of words (‘tuesday’, ‘monday’, ‘wednesday’,…) which clusters together days of the week by exploiting their similar distributional properties across tweets. We calculated the topic distribution of each user aggregated across all posts.

Mental well-being attributes

We used automatic text-regression methods developed in previous works to assign to each user scores on the depression35, anxiety37, anger, stress34, and lonely expressions33.

Identifying differentially expressed language features in the super-utilizer group

Posts from the same years were used for both case and control groups—2009–2016. We designed this as a person-level analyses and each individual was counted only once: 109 cases and 109 controls. All language features were extracted and compared at the individual level. Each linguistic attribute and mental well-being attribute were used as input in a logistic regression model. The models were setup to predict super-utilizers (i.e., group was the dependent variable). In accordance with conventional linguistic analysis, we used a p-value of <.05 for LIWC and mental health attributes and p < .01 for topics, after adjusting for multiple comparisons using Benjamini–Hochberg correction, to identify potentially meaningful associations. We calculated Cohen’s D associated with the super-utilizer’s group with the control group as reference, for each retained attribute38.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.