Multiscale classification of heart failure phenotypes by unsupervised clustering of unstructured electronic medical record data

As a leading cause of death and morbidity, heart failure (HF) is responsible for a large portion of healthcare and disability costs worldwide. Current approaches to define specific HF subpopulations may fail to account for the diversity of etiologies, comorbidities, and factors driving disease progression, and therefore have limited value for clinical decision making and development of novel therapies. Here we present a novel and data-driven approach to understand and characterize the real-world manifestation of HF by clustering disease and symptom-related clinical concepts (complaints) captured from unstructured electronic health record clinical notes. We used natural language processing to construct vectorized representations of patient complaints followed by clustering to group HF patients by similarity of complaint vectors. We then identified complaints that were significantly enriched within each cluster using statistical testing. Breaking the HF population into groups of similar patients revealed a clinically interpretable hierarchy of subgroups characterized by similar HF manifestation. Importantly, our methodology revealed well-known etiologies, risk factors, and comorbid conditions of HF (including ischemic heart disease, aortic valve disease, atrial fibrillation, congenital heart disease, various cardiomyopathies, obesity, hypertension, diabetes, and chronic kidney disease) and yielded additional insights into the details of each HF subgroup’s clinical manifestation of HF. Our approach is entirely hypothesis free and can therefore be readily applied for discovery of novel insights in alternative diseases or patient populations.

Heart failure (HF) is a leading cause of death and morbidity worldwide and is responsible for a large portion of healthcare and disability costs every year 1 . HF is challenging to treat because it can have various causes, is impacted by a wide array of patient genetic and lifestyle factors and comorbidities, and can manifest, progress, and respond to treatment differently among individuals [1][2][3] . Classification schemes for HF help clinicians determine the disease phenotype, select appropriate treatments, and define study populations for randomized controlled trials (RCT) of HF interventions. Such schemes are typically defined in a top-down manner based on HF etiology [4][5][6] , functional assessments (e.g., the New York Heart Association Functional Classification), imagingbased lab values (such as ejection fraction) and biomarkers (e.g. NT-proBNP, cardiac troponin) 1,3,7 . However, there is a consensus that these existing classification schemes for HF are coarse and often do not account for the heterogeneity stemming from a wide range of patient factors and comorbidities which may have a large impact on outcomes 3,8 .
Although HF classifications have provided the framing and structure of most research in the field, the development of novel, data-driven classifications of disease that capture heterogeneous clinical presentations have the potential to improve the understanding and care of HF patients, in particular to implement precision medicine and drive better outcomes 9,10 . With the rise of real-world data (RWD) and machine learning, a new line of research has developed that attempts to use these tools to inform these existing schemas 11,12 . The increasing availability of electronic health records (EHR) supplies valuable RWD for analysis of HF subpopulation characteristics and treatment performance, which can provide insights in care settings that may not be accurately www.nature.com/scientificreports/ represented by the highly controlled care and selective inclusion/exclusion criteria of RCTs. Importantly, EHRs contain is a wealth of rich, real-world information about patient disease captured in the expressive nature of unstructured clinical narratives, in particular regarding the diversity of the conditions in multimorbid patients and patient-reported or non-billable symptoms 13,14 , which are typically not used in existing classification schemes and phenotyping algorithms. This can be of particular interest in heart failure and especially HF with preserved ejection fraction (HFpEF), where there is recognition of a large amount of phenotypic diversity and a lack of interventions shown to improve outcomes 15 . At the same time, there is a growing body of work that aims to use a data-driven approach to phenotype disease in a variety of conditions, including chronic obstructive pulmonary disease (COPD) 16,17 , asthma 18 , sepsis 19 , gout 20 , Parkinson's disease 21 , and heart failure [22][23][24] . In particular, clustering has emerged as a paradigm for inferring phenotypes from data, rather than relying on top-down classification schemes 17,[20][21][22]24,25 . So far, most clustering studies were performed on a limited, fixed set of top-down clinical or domain specific definitions (e.g., specific biomarkers or diagnostic codes) to make the problem of finding subpopulations tractable. This hypothesis driven approach may limit their ability to fully capture the diversity of the disease states of the population and ultimately prevent discovery of novel factors contributing to phenotypic variation. We thus theorized that a data-driven, unsupervised phenotype discovery methodology that groups HF patients according to similarity of disease manifestation (phenotype) in RWD unstructured clinical text can provide insights into HF subpopulations' disease etiology and defining characteristics that may not be apparent in classically defined HF subgroups.
In this study, we present a hypothesis-free, data-driven clustering approach to understand real-world manifestations of heart failure in a large population of HF patients. Specifically, we construct groups of similar patients via unsupervised clustering of HF patients' symptoms and complaints mentioned in unstructured EHR data. These clusters and their distinctive pattern of disease manifestation (i.e., clinical complaints) can be understood as HF patient disease phenotypes. Importantly, we find that the resultant HF phenotypes correspond to clinically meaningful etiologies and endpoints of heart failure, which can be interpreted within a hierarchical framework and explored at various levels of granularity. In particular, the ability to reconstruct the pattern of disease subtypes can be beneficial in understanding the diversity of real-world patient populations in complex syndromes like HF. Such an approach can provide a complementary perspective to HF and may ultimately inform and contribute to a more precise HF classification scheme, especially if applied to very large heart failure populations. Finally, because the method is entirely unsupervised and does not require HF-specific domain expertise or definitions, this general methodology can be readily applied to gain insights into real-world manifestations of other complex diseases.

Materials and methods
In this study, we employed a clustering methodology to partition a large HF population into groups of similar patients. Using statistical testing to find significantly overrepresented patient complaints in the resultant clusters allows these HF patient subgroups to be interpreted as data-driven HF phenotypes. This section presents the methods employed to construct and interpret cluster-based HF phenotypes using a large repository of EHRs.
Description of dataset. Electronic health record dataset. In this study, we used the EHRs from a national medical research center located in a major metropolitan center in western Russia 14 . The center provides the full cycle of medical services, including inpatient and outpatient departments, imaging, rehabilitation services, perinatal care (including pediatric intensive care and surgery), and dentistry. Inpatient services are spread across various institutes and departments, and include, among others, internal medicine, functional diagnostics, intensive care units (ICU), including neonatal ICU (NICU), surgery (including cardiovascular, oncology, neurology, robotic surgery, etc.), clinical pharmacology, and chemotherapy. The longitudinal records used in this study were collected over a 10-year time span (2008-2018). Use of de-identified data for research purposes was approved by the institution.
Heart failure cohort definition. The heart failure analysis cohort was defined using the International Classification of Diseases, 10th Revision, Clinical Modification (ICD-10-CM). We included any patient who was diagnosed with an ICD-10 code for heart failure (I50), cardiomyopathy (I42), or hypertensive disease with heart failure (I11.0, I13.0, and I13.2). Patients of any age were included in the cohort. Except for the HF diagnosis no a priori inclusion/exclusion criteria were used.
After applying the diagnostic inclusion criteria, the resultant heart failure cohort consisted of 25,952 patients (Fig. 1). The number of patients matching each ICD-10 code in the inclusion criteria is shown in Table 1. A majority of the cohort (79.12%) had an ICD-10 code for heart failure (I50), while 26.24% had a cardiomyopathy code (I42) and 13.59% had a hypertensive heart disease with heart failure code. A majority of the cohort was male (57.4%), and the median age of adults in the dataset was 58 and 63 for males and females, respectively (48, 67 interquartile range for male and 48, 72 female patients), which suggests a relatively young heart failure population and is consistent with expected values of life expectancy and cardiovascular mortality and morbidity in the Russian Federation [26][27][28][29] . Table 1 also characterizes the incidence of selected comorbid conditions within the cohort. Patients were labeled with comorbid phenotypes using an ICD-10 code and text-based approach 14 .
Discovering heart failure phenotypes via clustering. EHR processing and feature extraction. To cluster patients into HF phenotypes, we first needed to convert the patients in the HF cohort into a vectorized representation suitable as input to a clustering algorithm. We chose to use the clinical notes found in each patient's EHR as the data source for clustering, since unstructured clinical narratives contain detailed textual www.nature.com/scientificreports/ descriptions of a patient's diagnoses, comorbid conditions, unbilled complaints, and other rich descriptors of disease that are often missing from structured data elements. We extracted complaints from all of the unstructured text in each patient's EHR for analysis. Medical concepts from the clinical notes of the heart failure patient cohort were identified using the Russian-language clinical named entity recognition (NER) system as described in Ref. 14 (Fig. 1A). This system extracts mentions of clinical concepts from several clinically relevant ontologies included in the Unified Medical Language System (UMLS) 30 and maps them to a concept unique identifier (CUI), which allows different strings to be matched to the same concept (e.g., "Type 2 diabetes" and "DM2" will both be assigned the same CUI). In this study, we extracted entities in SNOMED CT 31 and represented each entity by its normalized CUI.
Within UMLS, each CUI has one or more semantic types. We limited our analysis to entities corresponding to patient complaints (e.g., diseases, signs, symptoms, conditions; for a full list of UMLS semantic types, see Supplementary Table S1) and discarded entities corresponding to interventions (e.g., medications, procedures) and anatomy. We also removed from the analysis all entities with negative polarity (e.g., "Patient denies headache"). We aggregated all positively mentioned complaints for each patient over the entirety of the EHR timeline to generate a vector of counts of each complaint for each patient.
The vector of counts of complaints of each patient was then transformed using the term frequency-inverse document frequency (TF-IDF) approach, a standard method for text representation in information retrieval and other NLP algorithms. TF-IDF provides a measure of importance of a word or term to a document within a corpus 32 . The TF for a given complaint counts the number of times that the complaint occurs within the entire EHR of the patient (document), while the IDF penalizes terms that occur in many patients in the cohort (corpus).
Inferring HF phenotypes using clustering. We defined heart failure phenotypes by grouping aggregated patient complaint TF-IDF vectors into clusters using K-means clustering 33 (Fig. 1B). The resultant clusters contain patients grouped by similar patterns of complaints and comorbidities, which can then be interpreted as phenotypes. We applied K-means clustering for Kǫ[2, 3, . . . , 30] and utilized a cluster bootstrapping method to determine the values of K resulting in stable clusters, which can then be interpreted as reproducible, data-driven HF phenotypes. www.nature.com/scientificreports/ After finding a set of viable clusters via cluster bootstrapping, we aimed to visualize the hierarchical structure of the clustering result. The resulting phenotype dendrogram allows us to understand the hierarchical relationship between clusters at different values of K and provides a visualization of the phylogenetic tree of complaints and symptoms.
To create a clinical interpretation of each phenotype cluster, we used statistical testing to find complaints that were significantly overrepresented within the cluster as compared to the rest of the heart failure cohort. Doing so allows us to determine the distinguishing medical concepts, or features, associated with each cluster. We employed Bonferroni correction for multiple comparisons. Samples were tested and confirmed to be normally distributed. Finally, we performed an analysis to quantify the co-occurrence rate of important (significantly overrepresented) concepts associated with each cluster. To quantify the co-occurrence of concepts associated with each phenotype, we considered the top 10 most significantly associated (smallest p value) concepts in each cluster. Thus, for a given value of K , we consider 10K concepts; we then calculate a concept association score a ij using the Jaccard index.

Results
K-means clustering of the medical complaint vectors of the HF cohort for Kǫ[2, 3, . . . , 30] revealed stable clusters for K = [2,3,4,6,8,10,13,15,17,27] via cluster bootstrapping (starred values of K in Supplementary Fig. S1), which we consider the hierarchy of data-driven HF phenotypes. In the following sections, we visualize and describe in depth the resultant phenotypes for one value of K (K = 15) as well as interpreting these phenotypes in context of the overall derived data-driven hierarchy of HF.
Discovering complaints-driven heart failure phenotypes. Table 2 shows the top ten most over-represented complaints and symptoms (smallest p values) in each respective HF phenotype cluster for K = 15. We further characterized each cluster by examining descriptive statistics of several clinical characteristics ( Table 2). The clinical characteristics we used were number of patients, patient age, sex, body mass index (BMI), in hospital mortality rate, and structured diagnosis codes (ICD-10). For age we used the patient's age at his or her last encounter. For in hospital mortality rate, we utilized a structured data field in the EHR; actual mortality rates are almost certainly higher. The names chosen for the different clusters reflect the significantly overrepresented complaints and in the respective cluster, as well as the descriptive statistics for each cluster. Figure 2 shows an exemplary 2D visualization using a t-Distributed Stochastic Neighbor Embedding (t-SNE) based mapping of the HF cohort. Each point represents a single HF patient; the color indicates the cluster assignment for K = 15. From this, we can visualize the relative distance between individual patients in the HF cohort, as well as their respective HF phenotypes.
Reconstructing a hierarchy of heart failure classification. Examining Table 2 and Fig. 2, it is apparent that for K = 15 the population of heart failure patients is grouped into clusters with shared clinical characteristics. Intuitively, we can see that some clusters are more similar to each other than others. To quantify and visualize the natural hierarchy of heart failure within the cohort, we constructed a phenotype dendrogram for K = 15 (Fig. 3A). The stable clusters used in constructing the dendrogram are Kǫ[2, 3, 4, 6, 8, 10, 13] (marked with green corridors in Supplementary Fig. S1). All patients are aggregated at the left side of the dendrogram; each successive branch point shows the value of K at which a cluster splits into two smaller clusters. As K increases, branch points are emphasized with colored highlights; branch points further to the right on the dendrogram represent clusters that are more similar to each other as quantified by their Jaccard index. Thus, we can interpret that at K = 13 versus K = 15 , Congenital heart defects and NICU are merged into one cluster, and Myocardial infarction and Unstable angina are also one cluster. Branches are labeled using a clinical interpretation of the hierarchical structure of the clusters. Figure 3B shows the same t-SNE visualization found in Fig. 2, with cluster assignment colored for values of Kǫ [2,4,8,15] . This allows us to visualize the same information contained in the dendrogram for selected values of K.
From the first branch point in Fig. 3A at K = 2 , we can see the highest level of hierarchy within HF occurs with splitting the HF cohort into groups corresponding to ischemic and non-ischemic heart disease. Next, the non-ischemic heart disease group splits into subgroups that represent congenital vs. acquired and genetic etiologies of heart failure. Finally, at higher levels of K , patients within the acquired and genetic non-ischemic heart disease group further fragment into HF subgroups containing atrial fibrillation, dilated cardiomyopathy, aortic valve disease, and decompensated heart failure (which are predominantly comprised of male patients), hypertensive and cerebrovascular disease (predominantly female), and various cardiomyopathies.
Characterizing properties of discovered phenotypes. Examining the characteristics of each cluster in Table 2 and the dendrogram in Fig. 3, we find well-known interpretable causes and manifestations of heart failure. In the following section, we provide a clinical interpretation of selected clusters for K = 15 . For analysis, the full set of significantly associated concepts (Supplementary Table S2) and descriptive statistics ( Table 2) for each cluster was used.
Ischemic heart disease. There are four clusters of heart failure patients associated with ischemic heart disease ( Fig. 3A, top branch; Unstable angina, History of myocardial infarction, Acute myocardial infarction, and Cardiac surgery), the dominant etiology of congestive heart failure 34 . All of these clusters are dominated by males (63.6%, 69.9%, 73.4%, and 79.4%, respectively) and contain patients in their early 60s. Additionally, these clusters all have a similar chronic disease profile based on their ICD-10 codes, and include high prevalence of ischemic heart disease and associated concepts (including coronary heart disease, angina pectoris, myocardial infarc- www.nature.com/scientificreports/ Table 2. Cluster characteristics for K = 15. Top ten most significant concepts for each phenotype, ranked by p value (smallest to largest). Significance was determined using a one-sided (greater) t-test with Bonferroni correction testing the null hypothesis that the distribution of values of TF-IDF features for a medical entity in cluster i are drawn from the same distribution as the same entity in all other clusters. At right are shown characteristics of heart failure phenotypes, including number of patients, age, sex breakdown, and body mass index (BMI). The "Mortality" statistic denotes the percentage of patients in the cluster that expired within the hospital, as recorded in their EHRs. The "ICD-10" column shows the six most frequent ICD-10 codes and/or groups of codes with more than 5% incidence within the cluster. www.nature.com/scientificreports/ tion, myocardial ischemia, and stenosis), hypertension, peripheral artery disease, diabetes, stomach disease, and COPD. Within this group, the two most similar clusters are Acute myocardial infarction and Unstable angina, both of which contain patients undergoing acute ischemic events. Patients in the Myocardial infarction cluster contain patients that experienced an MI (99.1% of patients have the term myocardial infarction mentioned in their notes, and 91% of patients have an ICD-10 code for acute myocardial infarction) and have a higher in-hospital mortality (3.01%); we also observe that these patients also have a high prevalence of coronary heart disease (99%), where previous studies have shown the link between coronary heart disease and an increased risk of heart failure after myocardial infarction 35 . Similarly, patients in the Unstable angina cluster also have a relatively high rate of ICD-10 codes signifying myocardial infarction (14.5%) but have a higher prevalence of acute coronary syndrome (71.0% of patients).
The next branch point includes patients in the cluster named History of myocardial infarction. We observe that myocardial infarction is highly associated with this cluster (78.2% of these patients have mentions of myocardial infarction) but the EHRs from patients in this cluster have a very low rate of diagnostic codes signifying myocardial infarction. We can interpret this cluster as patients with a history of myocardial infarction and coronary heart disease (mentioned in 99.8% of patients) complicated by heart failure. Finally, the Cardiac surgery cluster contains patients who suffer from ischemic heart disease (98.9% of the patients in this cluster contained complaints of myocardial ischemia, and 99.5% had complaints of coronary heart disease) and underwent cardiac surgery (as evidenced by concepts such as postpericardiotomy syndrome in 58.6%, surgical fistula in 40.2%, and wound healing in 23.7% of patients). Additionally, high prevalence of concepts such as central venous pressure finding and pulmonary artery pressure indicate the placement of an arterial line, indicating a surgical or other high acuity setting. This interpretation was confirmed via service codes, which reveal that 53.5% of these patients received coronary artery bypass grafting (CABG).
Non-ischemic heart disease. Etiologies with high morbidity and mortality. We also observe clusters with other well-known non-ischemic etiologies of heart failure; for example Atrial fibrillation 36 and Heart valve disease 37,38 . These clusters occur within the same dendrogram branch as Decompensated CHF and Dilated cardiomyopathy. These clusters share the common characteristic that they are associated with high morbidity and mortality. Decompensated CHF, Dilated cardiomyopathy, and Atrial fibrillation are the only clusters in which mentions of decompensation occur significantly more frequently than in other clusters (prevalence of 63.3%, 19.5%, and 24.3% of patients, respectively), which is also reflected in the fact that these clusters group together within the dendrogram; Decompensated CHF and Aortic valve disease patients have high in-hospital mortality rates of 14.5% and 3.5%, respectively.
Decompensated CHF is the cluster with the highest in-hospital mortality rate.  Figure 2. Exemplary 2D visualization of the relative distances between all patients EHRs in the heart failure cohort using t-SNE. Colors show cluster assignment using K-means clustering (K = 15). Each cluster is shown with an interpretable name defining the heart failure phenotype.
Patients with dilated cardiomyopathy (DCM) were well clustered, with 97.2% of patients in the Dilated cardiomyopathy cluster also containing the complaint (as compared to 6.16% of patients outside the cluster), and 93.9% of patients receiving the I42 ICD-10 code for cardiomyopathy. These patients also exhibited typical complaints for DCM, including dyspnea (90.6% of patients); edema (67.0%) and swelling (41.2%); valve insufficiencies such as mitral valve insufficiency (82.8%) and tricuspid valve insufficiency (76.0%); arrhythmias, including ventricular tachycardia (37.7%), premature ventricular contractions (67.3%), interventricular desynchrony (18.9%), and atrial fibrillation (39.8%), among others; and pulmonary embolism (17.2%) 41 . Although incidence of DCM is typically biased towards men 42 , there is a disproportionally low proportion of females in the cohort (21.6%). This was hypothesized to be due to the Russian origin of the dataset, where rates of alcohol consumption in males and corresponding alcoholic cardiomyopathy are high 43,44 . Subsequent analysis identified EHR templates documenting lifestyle risk factors; within the DCM group, 33.12% of the patients had documentation of alcohol consumption as a bad habit, compared to 1.7% of patients outside of the DCM cluster. Thus, this cluster can also be interpreted as Alcoholic cardiomyopathy.
The Heart valve disease cluster contains patients with high prevalence of heart valve disease, including aortic valve disease, aortic valve stenosis, heart valve disease, mitral valve insufficiency, tricuspid valve insufficiency, and chronic rheumatic heart disease, among others. Findings of central venous pressure finding, cardiac index, and cardiac activity, and wound healing, as well as complaints such as postpericardiotomy syndrome, show that these  ) is the Jaccard index between cluster assignment for cluster i in for K = K1 (e.g., 15) and cluster j for K = K2 (e.g., 8). Branches are labeled using a clinical interpretation of the hierarchical structure of the clusters (see discussion). (B) t-SNE plots showing cluster assignment for K in [2,4,8,15], which are marked with black arrows in (A). Female heart failure. Within Fig. 3A, there is a branch containing clusters labeled Hypertensive heart disease and Cerebrovascular disease. In these clusters, we observe other well-known comorbidities and etiologies of heart failure including hypertension 45,46 and vascular disease (including cerebrovascular disease and stroke) concomitant with conditions such as diabetes and chronic kidney disease (CKD) 47 . Both clusters have a high percentage of female patients (72.3% in Hypertensive heart disease and 58.2% in Cerebrovascular disease).
Patients within the Hypertensive heart disease cluster have the highest BMI out of any cluster, with a median value of 30.1, as well as text mentions of complaints such as hyperlipidemia (43.1% of patients) and obesity (46.4% of patients). The high fraction of females is supported by literature, as progression of hypertension as the primary etiology of heart failure is up to 50% more common in women than men 45 . Patients within the Cerebrovascular disease contain the oldest patients (median age 67.3); within this cluster complaints include mentions of vascular conditions such as vascular diseases (29.8% of patients), cerebral atherosclerosis (40.1%), peripheral arterial disease (13.7%), ischemic stroke (14.3%), cerebral infarction (10.0%), and cerebrovascular accident (65.5%); neurological symptoms such as encephalopathies (67.2%), nystagmus (34.7%), and dysarthria (52.5%); common complications of stroke such as hemiparesis (12.9%); and chronic conditions including diabetes mellitus (46.1%), diabetic polyneuropathies (21.0%), chronic kidney diseases (17.8%), and diabetic nephropathy (14.4%). Together, the predominance of overweight or obese female patients with a heavy burden of comorbid conditions are consistent with characteristics of HFpEF 48-50 . Validating and discovering relations between HF patient complaints. From the previous sections, the proposed clustering framework validates well-known findings. To further quantify these interpretations, we used the concept association score a ij to quantify how frequently concepts co-occur for concept pairs where both concepts are significantly associated with the same cluster, as well as for concept pairs that are significantly associated with different clusters. Our analysis shows that the concept association scores within clusters are higher than those between clusters (mean 0.3016 vs. 0.1143, p = 2.00 × 10 −168 , t-test). Furthermore, these results are replicated in PubMed, where we observe the same pattern (mean 0.0341 vs. 0.0054, p = 1.58 × 10 −23 , t-test). Table 3 shows concept association scores for selected concept pairs. In several cases, we observe that the association score is lower in PubMed than in our data. This suggests that our method can be an approach to gather evidence for or discover lesser known associations between medical concepts. One such example is the co-occurrence of decompensated heart failure and lymphadenopathy. The association of hilar and mediastinal lymphadenopathy as a finding of decompensated heart failure has been established but not well studied [51][52][53] ; as of 2016, the largest study on the association between acute heart failure and lymphadenopathy contained a cohort of 215 patients (original cohort of 500 HF patients, with 285 excluded for lack of CT scans or possible confounding diagnoses that can cause lymphadenopathy), of which 68% exhibited CT signs of lymphadenopathy 54 . Within the cluster of "Decompensated CHF", 1368 patients contained complaints mentions of decompensation, with 486 of these patients (35.5%) containing mentions of lymphadenopathy, mediastinal lymphadenopathy, or hilar lymphadenopathy (see Supplementary Table S2), all of which were significantly associated with the cluster. Table 3. Exemplary association scores between pairs of medical concepts that co-occur within cluster phenotypes from Table 2. Comparable association scores within the HF cohort and the scientific literature (PubMed) indicate that co-occurrences are already known. Significantly higher association scores in the HF cohort indicate potentially novel associations. www.nature.com/scientificreports/ This result shows the power of RWD analysis to provide additional support for furthering our understanding of HF and its pathophysiology, which may aid in differential diagnosis and improved quality of care (e.g., reduction in unnecessary lymph node biopsies).

Discussion
In this study, we presented a novel and data-driven approach to constructing HF phenotypes based on the realworld disease manifestation found in the unstructured clinical notes of a large EHR database. These phenotypes were discovered by utilizing an NLP-based information extraction and unsupervised clustering approach to group HF patients into subcohorts. For K = 15 , we interpreted each subcohort by employing a statistical testing methodology in which we found medical concepts and complaints that are overrepresented in each group. We also characterized descriptive statistics of each subcohort, including demographic information such as age and sex, BMI, in hospital mortality rates, and most frequent diagnosis codes. Additionally, we used clustering at different value of K, which revealed the hierarchy of HF phenotypes. Finally, after finding the significant concepts and descriptive statistics for each cluster, we provided a medical analysis and find that subcohorts correspond to clinically meaningful etiologies and endpoints of heart failure.
Clinical notes are a rich source of patient information but are underutilized due to the major challenges involved in extracting and normalizing medical concepts found in unstructured free text. Here we found that hierarchical, data-driven phenotypes for heart failure could be constructed solely by clustering of complaints (disease and symptom mentions) extracted from unstructured notes. The resulting HF phenotypes are clinically informative with respective to comorbid conditions, symptoms, and other complaints that may be missing from traditional HF classifications, providing a more complete picture of a patient's disease state. These results are illuminating with regard to both the etiology and severity of HF across the cohort and provide a snapshot of disease characteristics across a large population.
In contrast to top-down approaches that use predefined criteria to classify HF disease states, this unsupervised approach using unstructured clinical notes offers the ability to uncover HF patient subgroups across multiple scales of medical concept granularity based on real-world data. Such an approach is data-driven and flexible; by discovering subcohorts of patients based on the similarity of their complaints, there is no need to specify complex inclusion/exclusion criteria a priori, but rather allow for dominant patterns to be discovered from the data itself.
Potential applications. Development of more effective methods for understanding heart failure in its various clinical manifestations, its symptoms, and their management is vital to improving treatment strategies and ultimately the quality of life of HF patients. The ability of our approach to automatically reveal patterns of realworld disease manifestations can aid in understanding complex syndromes like HF and the phenotypic heterogeneity in its patient populations. Importantly, we demonstrate that our methodology is able to produce an automated and scalable understanding of a large population of HF patients using a health system's routinely collected clinical data, which can serve as a foundation for practice-based medicine in which real-world insights relevant to a patient can be generated and provided to a clinician at the point of care 55 . For example, it is challenging for providers to understand which care regimens should be used in each patient subpopulation, particularly when dealing with older, chronic disease patients with multiple comorbidities. Phenotype clusters built from retrospective data can be a powerful tool to drive better treatment decisions; through analysis of which treatments have been successful for a given cluster in the past, providers may gain insight into which care regimens would have the best chance of success for new patients that map to existing clusters, thus enabling cluster-specific personalization of care.
Analogously, this technique can be applied on large cross-provider patient populations for epidemiology or health economics and outcomes research. Having phenotypes that more accurately reflect disease manifestations in real patient populations can improve the precision of disease burden assessments, which in turn can help healthcare practitioners or policy makers understand likely outcomes for large segments of the population and better perform resource allocation.
Finally, the patient representations built using this method present a unique opportunity to extract insights that can be shared between hospitals, because they extract high-level complaints without using patient identifiers. Additionally, because the method utilizes a clinical NLP system that extracts language-independent medical concepts from clinical text, such an approach can allow for scalable comparison of patient populations across different regions and languages without building word or term mappings to standard controlled terminologies, which is often prohibitively time-consuming in practice.
Limitations and future directions. Although a relatively large number of patient records were used in this study (n = 25,952), it remains to be determined whether the HF phenotypes reported here will remain comprehensive across larger patient populations and geographies and is a future direction of study. Additionally, in this study results were generated using complaints extracted from clinical text in the EHR without using (1) any structured data or (2) unstructured data corresponding to clinical interventions (e.g., medications, procedures) or numerical value extraction for lab and imaging measurements. Future avenues of research can explore utilizing structured sources of information in the EHR (e.g., diagnosis codes, labs) to enrich or further inform cluster phenotypes. Additionally, our general approach can be supplemented with other healthcare data sources, including other regularly collected information (e.g., administrative data or claims) or data sources used in precision medicine, if they are available (e.g., omics data). While the current approach identifies HF phenotypes by clustering on aggregated complaints extracted from entire patient timelines, an important direction for future research is to (1) analyze the progression of HF patients and the evolution of their disease state over time, and (2) study the interplay between phenotype, clinical interventions, and ultimately patient outcomes. Finally, in www.nature.com/scientificreports/ the current study we demonstrate the viability of data-driven phenotypes in heart failure, but the approach is condition-agnostic and can be easily applied to other diseases areas in the future.