Background & Summary

Heart failure (HF) affects over 6 million people in the United States, with an estimated incidence of 21 per 1000 people in the elderly people1. By using mathematical prediction models, heart failure is estimated to affect 8 million people over the age of 182. Heart failure is one of the most important reasons for hospitalization among elderly individuals and is also associated with significant mortality and morbidity. It has been reported that the mortality ranges from 20% to 60% one year after hospitalization for acute HF3,4,5,6, depending on comorbidities and coexisting medical conditions. Many cohort studies have been carried out for epidemiological investigations of hospitalized patients with HF. For instance, the Cleveland Heart Disease dataset, which contains 75 variables for 303 patients, is mainly used for practising machine learning algorithms7. The Nationwide Inpatient Sample (NIS) is a publicly available database from the Healthcare Utilization Project (HCUP) that is supported by the Agency for Healthcare Research and Quality. This dataset contains many patients with heart diseases, but the variables/attributes included in this dataset are not specifically designed for HF8,9,10. The Medical Information Mart for Intensive Care (MIMIC) database contains data associated with >60,000 distinct hospital admissions to critical care units between 2001 and 2012. Many of the patients have a HF diagnosis, and thus MIMIC is a good resource for testing research hypotheses related to critically ill HF patients11. However, these studies are either designed with attributes that are limited in number or not specific for HF. In other words, data collection was dictated by expert knowledge, and only variables deemed important were entered into the data collection form. The determination of feature variable inclusion/exclusion is largely driven by expertise and previous studies. Such a dataset can be used only to address a limited number of clinical questions. For example, the ESC-Heart Failure Association (HFA) EURObservational Research Programme (EORP) generated a large dataset that contained specifically HF patients, and a large amount of data that are routinely collected during clinical practice were abandoned. To the best of our knowledge, this is the largest HF dataset in the world, including 337 cardiology centres from 33 ESC Member countries12. In essence, many trivial attributes may work together to influence the clinical outcome. Thus, a dataset including all aspects of individual patient-level data can help disentangle complex relationships among attributes. In the era of big data, the electronic healthcare records are able to produce a large amount of data related to a given HF patient. These multiparameter relational databases may or may not be related to a given research question. Different studies and analyses require different variables. Making such a publicly available dataset can help to encourage data reuse, thereby promoting more medical knowledge discovery.

Our study aimed to establish a HF database based on electronic healthcare records. Data on subsequent hospital admissions and mortality were obtained at mandatory follow‐up visits at 28 days, 3 months and 6 months (if the patient was unable to reach the clinical centre, the follow‐up visit was replaced by a telephone call). The study was a retrospective study enrolling hospitalized patients with heart failure from December 2016 to June 2019. Patients were enrolled from Zigong Fourth People’s Hospital. Data were extracted from electronic healthcare records. However, this is a single-centre dataset, covering only Chinese patients. Findings with these data alone may not have convincing generalizability. Researchers may combine this dataset with other heart failure cohort data for a larger-scale study.


Study setting and population

The study was conducted at Zigong Fourth People’s Hospital, Sichuan, China from December 2016 to June 2019, and was approved by the ethics committee of Zigong Fourth People’s Hospital (Approval Number: 2020-010). Informed consent was waived due to the retrospective design of the study. The study complies with the Declaration of Helsinki.

Electronic healthcare records of consecutive patients with a diagnosis of HF were reviewed. We included all types of heart failure including acute HF, chronic HF, left HF, right HF, or a mixture of all. Heart failure was defined according to the European Society of Cardiology (ESC) criteria13:

  1. 1)

    The presence of symptoms and/or signs of HF. Typical symptoms include breathlessness, orthopnoea, paroxymal nocturnal dyspnea, reduced exercise tolerance, fatigue, tiredness, increased time to recover after exercise and ankle swelling. Typical signs include elevated jugular venous pressure, hepatojugular reflux, third heart sound (gallop rhythm) and laterally displaced apical impulse.

  2. 2)

    Elevated levels of BNPs (BNP >35 pg/mL and/or NT‐proBNP >125 pg/mL)

  3. 3)

    Objective evidence of other cardiac functional and structural alterations underlying HF.

  4. 4)

    In case of uncertainty, a stress test or invasively measured elevated LV filling pressure may be needed to confirm the diagnosis.

Patients who had a diagnosis of heart failure on hospital admission were enrolled in our study. The diagnosis was recorded with ICD- 9 in the EHR (Table 1).

Table 1 ICD-9 code for the diagnosis of heart failure.

Variables and attributes

Data collected for the dataset included three broad categories: demographic data, baseline clinical characteristics, comorbidities, laboratory findings, drugs and outcomes. Demographic data were entered manually into the EMR system by the nurses on admission if a patient first visited our hospital. Otherwise, demographic data could be automatically extracted from previous visits. Some missing or error data were checked if they were identified by the nurses. To ensure the accuracy and consistency of data entry, a drop-down list was used for some variables in our EMR system, such as sex, department of admission and occupation. Laboratory tests and drugs were electronically entered by physicians and/or lab workers. Data in the EMR were extracted by SQL query to establish the current database. The accuracy of the SQL query was then checked manually by randomly selecting 50 patients. Many data items were recorded in Chinese in the electronic healthcare record database, thus the largest challenge is the language barrier. All the lab test items, examinations, drug names and diagnoses were recorded in Chinese in the electronic healthcare record database. To address this problem, all Chinese terms were translated to English by the principal investigators (Z.Z., P.X. and L.C.).

The demographic data were obtained from the first sheet of the medical records and included age, sex, height, body weight, admission ward, type of admission (emergency vs. nonemergency), occupation, discharge department, admission date, visit times, and marital status.

Baseline clinical characteristics were measured on the day of hospital admission and included body temperature, pulse, respiration rate, systolic blood pressure, diastolic blood pressure, mean arterial blood pressure, weight, height, body mass index (BMI), type of heart failure, New York Heart Association (NYHA) cardiac function, Killip Grade (Class 1 No rales, no 3rd heart sound; Class 2 Rales in <1⁄2 lung field or presence of a 3rd heart sound; Class 3 Rales in >1⁄2 lung field–pulmonary oedema; Class 4 Cardiogenic shock–determined clinically), and Glasgow Coma Scale (GCS) score. Echocardiographic findings included left ventricular ejection fraction (LVEF), left ventricular end diastolic diameter, mitral valve peak E wave velocity (m/s), mitral valve peak A wave velocity (m/s), E/A, tricuspid valve regurgitation velocity, and tricuspid valve regurgitation pressure.

Comorbidities included a medical history of myocardial infarction, congestive heart failure, peripheral vascular disease, cerebrovascular disease, dementia, chronic obstructive pulmonary disease (COPD), connective tissue disease, peptic ulcer disease, diabetes, moderate-to-severe chronic kidney disease, hemiplegia, leukaemia, malignant lymphoma, solid tumour, liver disease and AIDS. The Charlson Comorbidity Index (CCI) was calculated by summing all comorbidity points described above14. A minority of patients were not coded as having a diagnosis of “congestive heart failure” in the comorbidity list because they did not have a past history of congestive heart failure on admission. They were diagnosed with HF for the first time in the index hospitalization. The comorbidities were taken from the admission notes.

Laboratory findings were obtained from day one of hospital admission, including serum creatinine, urea, uric acid, glomerular filtration rate, cystatin, white blood cell count, monocyte ratio, monocyte count, red blood cell count, coefficient of variation of red blood cell distribution width, standard deviation of red blood cell distribution width, mean corpuscular volume, haematocrit, lymphocyte count, mean haemoglobin volume, mean haemoglobin concentration, mean platelet volume, basophil ratio, basophil count, eosinophil ratio, eosinophil count, haemoglobin, platelet, platelet distribution width, platelet haematocrit, neutrophil ratio, neutrophil count, D-dimer, international normalized ratio, activated partial thromboplastin time, thrombin time, prothrombin activity, prothrombin time ratio, fibrinogen, high sensitivity troponin, myoglobin, carbon dioxide binding capacity, calcium, potassium, chloride, sodium, inorganic phosphorus, serum magnesium, creatine kinase isoenzyme to creatine kinase, hydroxybutyrate dehydrogenase to lactate dehydrogenase, hydroxybutyrate dehydrogenase, glutamic oxaloacetic transaminase, creatine kinase, creatine kinase isoenzyme, lactate dehydrogenase, brain natriuretic peptide, high sensitivity protein, nucleotidase, fucosidase, albumin, albumin/globulin ratio, cholinesterase, glutamyltranspeptidase, glutamic pyruvic transaminase, glutamic oxaliplatin, indirect bilirubin, alkaline phosphatase, globulin, direct bilirubin, total bilirubin, total bile acid, total protein, erythrocyte sedimentation rate, cholesterol, low-density lipoprotein cholesterol, triglyceride, high-density lipoprotein cholesterol, homocysteine, apolipoprotein A, apolipoprotein B, lipoprotein, pH, standard residual base, standard bicarbonate, partial pressure of carbon dioxide, total carbon dioxide, methemoglobin, haematocrit blood gas, reduced haemoglobin, potassium ion, chloride ion, sodium ion, glucose blood gas, lactate, measured residual base, measured bicarbonate, carboxyhemoglobin, body temperature blood gas, oxygen saturation, partial oxygen pressure, oxyhemoglobin, anion gap, free calcium, and total haemoglobin.

Primary drug categories included in our dataset were diuretics, inotropes, and vasodilators. The diuretics included furosemide, torasemide and spironolactone. Inotropes included deslanoside, dobutamine, digoxin, isoprenaline and milrinone. Vasodilators included isosorbide mononitrate and nitroglycerin.

Outcome variables included discharge date of the index hospital, vital status at hospital discharge, death within 28 days, readmission within 28 days, death within 3 months, readmission within 3 months, death within 6 months, readmission within 6 months, time to death (days from index hospital admission), time to readmission (days from index hospital admission), return to emergency department within 6 months, and time to visit emergency department within 6 months. The variable “DestinationDischarge” was recorded after hospital discharge, and the variable “outcome.during.hospitalization” was recorded after the decision to discharge was made.

Data Records

The study generated a single dataset, that contained information on 166 attributes of 2008 hospitalized patients from December 2016 to June 2019. The dataset is available at PhysioNet ( Missing values are indicated with blanks. Detailed information on variable specifications is included in a variable description file.

Technical Validation

The present study was a retrospective design. Information on eligible patients was collected at Zigong Fourth People’s Hospital. First, the required data were exported from the electronic healthcare database with the assistance of the information technology technician. The exported data were then checked by expert emergency and critical care physicians; if outliers in each variable and contradictions within data were detected, data were validated by another investigator. The outliers and contradictions were judged by expert emergency and critical care physicians. Data on subsequent hospital admissions and mortality were obtained at mandatory follow‐up visit at 28 days, 3 months and 6 months (if the patient was unable to reach the clinical centre, the follow‐up visit was replaced by a telephone call).

Data were finalized and fully anonymized on June 8, 2020.

Baseline characteristics of included patients

The overall mortality rate at hospital discharge was 1% (14/2008). A total of 212 patients were discharged to unknown places (212/2008, 11%), 1344 patients were discharged home (67%) and 438 patients were discharged to healthcare facilities (22%). Most patients were admitted to the department of cardiology (1547/2008, 77%), followed by the general ward (265/2008, 13%), others (181/2008, 9%) and the ICU (15/2008, 1%). There was also a significant difference between emergency and nonemergency patients (Online-only Table 1). The distributions of the baseline characteristics are shown in Fig. 1, Fig. 2 and Online-only Table 1.

Fig. 1
figure 1

Histogram showing the distribution of numeric attributes at baseline.

Fig. 2
figure 2

Bar chart showing the distribution of discrete attributes at baseline.