Background & Summary

Critically ill patients managed in the intensive care unit (ICU) are usually monitored closely for organ dysfunctions, and are treated intensively by a variety of supportive modalities1,2. Vital signs, laboratory tests, and medical treatments were obtained at a higher frequency than those treated in the general ward. Such daily intensive management will produce a huge amount of information including medical orders, imaging studies, laboratory findings, and waveform signals. The data generation mechanisms may reflect key factors related to the healthcare system, the pathophysiology of underlying disease, and patient’s preferences and cultures3. Thus, in-depth data mining of such large databases, such as risk factor analysis, predictive analytics, and causal inference4,5,6, can provide more insights into clinical research questions. More knowledge or pearls of wisdom can be obtained from data mining, and the translation of the knowledge into clinical practice may potentially improve clinical outcomes7,8.

Most published scientific reports do not make their original raw data freely accessible in the current critical care research community, partly attributable to confidentiality issues. The unwillingness to share data makes it difficult to reproduce the reported results. Furthermore, the exploration of a such large database from a single research group could be biased and limited. Thus, strenuous efforts have been made to encourage the scientific community to share their raw data, which is also supported by the open data campaign9,10. Several openly accessible critical care databases have been established, mainly reflecting the healthcare systems of western countries11,12,13. China is a large country with a huge patient population. For example, the estimated incident sepsis cases are about 3 million in 2017, accounting for nearly 10% of the global incident cases14. Chinese hospitals also have special hospital information systems that are distinct from those of western countries. However, hospital information systems in Chinese hospitals are mainly used for clinical practice and are far less developed for research purposes. Data sharing is still in its infancy in the Chinese critical care community, which significantly impairs the transparency of scientific work and international collaborations. To the best of our knowledge, there are two critical care databases being established in China which focus on pediatric critically ill patients and those with infections15,16. Here, we reported the establishment of a large critical care database comprising high-granularity data generated from the information system of a tertiary care university hospital. Details of the database are reported in the paper to encourage new research through secondary analysis of the database.

Methods

Study setting and population

The study was conducted in Zhejiang Provincial People’s Hospital, Zhejiang, China from January 2012 to May 2022. All patients admitted to the ICU of the hospital were eligible. There were two ICUs in the hospital: one was the comprehensive central ICU and the other was the emergency ICU (EICU). There was no exclusion criterion in enrolling subjects because we believed that patients who were excluded by a particular study might be eligible for another study. Thus, we included all records in the information system related to ICU stays. The study was approved by the ethics committee of Zhejiang Provincial People’s Hospital (approval number: QT2022185). Informed consent was waived as determined by the institutional review board, due to the retrospective design of the study. The study was conducted in accordance with the Declaration of Helsinki.

Database structure and development

The database is distributed as comma-separated value (CSV) files that can be imported to any relational database system. Each file contains a single table which will be further explained in the subsequent sections. Each individual subject can be identified by a series number (patient_SN) with the combination of digits and letters such as “3c74cf74c36241b7082ec35e458279dc”. Each unit hospital stay is denoted by a Hospital_ID with examples such as “9432117” and “336688072433”. The unique ICU stay can be identified by the HospitalTransfer table, which contains intrahospital transfer events for the subjects. All tables use Hospital_ID to identify an individual hospital stay, and the HospitalTransfer table can be used to determine ICU stays linked to the same patient and/or hospitalization.

We recommend the R package tidyverse for the management of the relational database because of its capability to streamline the workflow from data management to statistical analysis and to the training of machine learning models17. For large files, we recommend the data.table package to process the tabular data.

Deidentification

All tables are deidentified according to the Health Insurance Portability and Accountability Act (HIPAA). All protected information is removed including addresses, date of birth, date of hospital admission, date of discharge, date of medical order, personal numbers (e.g. name, phone, social security, and hospital number), exact age on admission (age is discretized into bins). When creating the dataset, patients were randomly assigned a unique identifier (patient_SN and hospital_ID) and the original hospital identifiers were not retained. As a result, the identifiers in the database cannot be linked back to the original, identifiable data. All doctor/nurse/pharmacist identifiers have also been removed to protect the privacy of contributing providers.

Data Records

The database comprises 8180 unique hospital admissions for 7638 individual patients from January 2012 to May 2022 and is available at the PhysioNet repository18. Table 1 shows the baseline demographics of hospital admissions. There are 2965 female and 5215 male patients in the dataset. The length of hospital days was 17 days (Q1 to Q3: 10 to 28). Male patients showed slightly longer hospital stay.

Table 1 Demographics and discharge status of the 8180 hospital admissions in the database.

The number of hospital admissions for ICU patients increased remarkably after the year 2018 because of the expansion of bed numbers this year for both comprehensive ICU and emergency ICU (Fig. 1). The distributions of hospital length of stay are shown in Fig. 2, restricting to patients with a length of stay (LOS) <60 days.

Fig. 1
figure 1

The number of admissions from the year 2012 to 2022.

Fig. 2
figure 2

The distributions of hospital length of stay in male and female patients.

We then categorized specific diagnoses into 31 categories to explore the characteristics of the population in the dataset19. The co-occurrences of the diseases are shown in Fig. 3. The results showed that pulmonary diseases are among the most common reasons for admission, followed by chronic heart failure (CHF). CHF usually coexists with valvular disorders. It is also noted that pulmonary diseases usually coexist with cardiac arrhythmia (Fig. 3). Figure 4 shows the frequency of these diseases. Hypertension is among the highest diseases in the study population, followed by chronic heart failure and arrhythmia.

Fig. 3
figure 3

The co-occurrence network shows the frequency of diagnosis categories in the datasets. The size of the circle represents the number of diagnoses, and the transparency of the lines represents the frequency of coexisting. Abbreviations: PUD = Peptic ulcer disease; DM = Diabetes without chronic complication; DMcx = Diabetes with chronic complication; PVD = Peripheral vascular disorders; CHF = Congestive heart failure; HTN = Hypertension; HTNcx = Hypertension, complicated; PHTN = pulmonary hypertension; Mets = Metastatic solid tumor.

Fig. 4
figure 4

Dot chart showing the frequency of commonly encountered diseases in the dataset. Abbreviations: PUD = Peptic ulcer disease; DM = Diabetes without chronic complication; DMcx = Diabetes with chronic complication; PVD = Peripheral vascular disorders; CHF = Congestive heart failure; HTN = Hypertension; HTNcx = Hypertension, complicated; PHTN = pulmonary hypertension; Mets = Metastatic solid tumor.

Classes of data

The data are organized into tables. There are a total of 17 tables comprising patient demographic data, medical order, laboratory findings, image studies, microbiology and hospital transfer events (Table 2). We will provide more details for each individual table to promote the reuse of our database.

Table 2 A general description of the tables in the database.

Patient admission record table

The patient admission record table describes the baseline patient demographics, past history, chief complain, and length of stay in the hospital. The patient_SN is a unique ID for individual patient and Hospital_ID is unique ID for hospital admission. If a patient discharged/died within 24 hours, the data were recorded in a separate table, so there are separate columns describing the chief complain and admission status for those short hospital stays. We provide both English and Chinese descriptions for chief complain. The present history recorded in the Med_history column contains more words, and the original Chinese descriptions are kept so that some natural language processing algorithms can be applied. The StatusOnDischarge variable includes several categories such as Cured, Not cured, Unknown and Dead. These categories are recorded as that in the original electronic system. The “Not cured” status refers to the situation when a patient was discharged against medical order and might be transferred to the primary care service center or go home for palliative care. The “Unknown” label is also entered by the clinicians and should be considered as a separate type of status (Table 3).

Table 3 variables in the patient admission record table.

Electronic medical record (First note table)

The FirstNote.csv table contains data related to the progress note recorded on the admission day (Table 4), which includes free text data such as the reasons for diagnosis, differential diagnosis and care plan. The diagnosis in this table is the initial diagnosis made on the day of admission and is subject modifications.

Table 4 variables in the FirstNote table.

Progress note table

The progress note table (ProgressNote.csv) contains information on a variety of daily progress notes such as Daily course record, Blood transfusion record, and record for bedside procedures (Table 5).

Diagnosis table

The diagnosis table contains information related to diagnosis for a hospital stay (Table 6). The Diagnosis_Desc column provides free text description for the diagnosis. ICD10_code is the code number for the standard ICD code. The information can be well processed with the icd package in R (https://github.com/cran/icd). The functionality of the package includes but not limited to finding comorbidities of patients based on ICD-10 codes, Charlson and Van Walraven score calculations, and comprehensive test suite to increase confidence in accurate processing of ICD codes.

Hospital transfer table

The HospitalTransfer table contains information related to intrahospital transfer events (Table 7). The time and department of each transfer event are given in respective columns. In the table, one row represents one transfer event, including the department a patient leaves (TransferFrom_Dept_Eng) and another department a patient transfer into (TransferTo_Dept_Eng). One episode of hospitalization may contain multiple transfer events. To protect patients’ privacy, all date and time information is recorded as days relative to hospital admission. Since the EICU is in the emergency department, the department names denoted by “Emergency medical department” or “Emergency Department” refer to the EICU.

Table 5 Variables in the ProgressNote table.
Table 6 variables in the Diagnosis table.

Surgery information table

The surgical operation information is recorded in a separate table (SurgeryTab.csv). The table records the scheduled time for operation and descriptions for the operation. The name of the operation can be extracted from the text descriptions (Oper_Scheduled). The medical order for a planned operation is usually prescribed 1 day prior to the operation. If the planned date takes a minus value, it can be regarded that the operation is performed on the day of hospital admission (Table 8).

The Lab table

The lab table contains data related to the laboratory findings (Table 9). There are 11,082,482 records of laboratory items in the dataset involving 214 types of laboratory items. there are 17 types of samples being tested for laboratory findings, including whole blood, plasma, urine, serum, arterial blood, stool, venous blood, catheter orifice, ascites, bile, dialysate, CK blood sample (kaolin-activated TEG channel), cerebrospinal fluid, bone marrow, deep venous catheter, sputum, gastric juice. The sample collection time is also recorded in days in reference to the hospital admission time. The Lab_category column may contain missing values for the following reasons: (1) the laboratory category is missing for some laboratory items that are derived from other values, such as INR, Urea: creatinine, and Arterial alveolar oxygen partial pressure ratio; (2) Some laboratory items are exported from the bedside point-of-care machines, such as troponin and blood gas items in an acute care setting; their laboratory category is not integrated into the laboratory system; and (3) some values not directly assayed by the machine such as inspired oxygen saturation (FiO2), and prothrombin time control. Since the missing information in the laboratory category will not influence the research outcome; we did not populate these missing cells.

The Lab dictionary

To facilitate the use of the Lab table, we generated a lab dictionary table (Table 10) which included the unique names of lab items and the lab category.

Microbiology culture table

The MicrobiologyCulture table contains information related to microbiology culture results (Table 11). Conventional information regarding sample, culture finding, culture time and description of microbiology culture are provided in the table.

Table 7 Explanation for variables in the HospitalTransfer table.
Table 8 Explanation for variables in the SurgeryTab table.
Table 9 Explanations for variables in the Lab table.
Table 10 Dictionary for laboratory items.
Table 11 Explanation for variables in the Microbiology culture table.

Drug sensitivity table

The DrugSens table contains information related to the drug sensitivity of cultured bacteria (Table 12). Conventional information including sample, microbiology, culture time, and drug name is available in the table. The negative and positive values in the DrugSens_result column refer to the results for Ultra broad spectrum β- Lactamase or D-test.

Table 12 The explanation for variables in the Drug sensitivity table.

Examination report table

The ExamReport table contains information related to a variety of medical examinations, including computed topography (CT), X-ray and ultrasound (Table 13). The images are not available in current dataset, but instead we include the free text descriptions and conclusions for these examinations.

Table 13 Explanation for variables in the ExamReport table.

Medical order table

The MedOrder table contains information related to the medical order prescribed by clinicians (Table 14). The table provides both regular and stat medical orders (MedOrder_Type). The contents of the medical order can be found in the MedOrder_DESC column.

Table 14 Explanation for variables in the MedOrder table.

Medication table

The medication table provides data on the medication orders prescribed by clinicians (Table 15). This table is designed specifically for medication orders, containing columns for drug dose, frequency, unit of drug dose and route of administration.

Table 15 The explanation for variables in the Medication table.

Medication dictionary

The Medication_Dictionary table provides information for the unique medication names. Some medications can be easily obtained from the dictionary table. We provided a DrugName column where users can easily look up unified pharmaceutical names irrespective of the specifications, formula, and route of administration. For example, if we want to extract sodium chloride injection, we can look for sodium chloride in the DrugName column. Alternatively, users may search the Med_DESC_Eng column with the key words “Sodium chloride”. This can be easily achieved by the stringr pipeline in R (Table 16).

Table 16 Medication dictionary table.

Vital sign table

The VitalSign table provides vital sign data for each hospital admission (Table 17). The VitalSign_DESC column provides categories of vital signs including diastolic blood pressure, temperature, heart rate and respiratory rate.

Table 17 Explanation for variables in the VitalSign table.

Technical Validation

Data were verified for integrity during the data transfer process from the hospital information system to the database platform using MD5 checksums (Table 2). The MD5_hashes presented in Table 2 can also be used by users to ensure the integrity of the downloaded datasets. All text information extracted from our medical information system are in Chinese. In establishing our data warehouse, we translated some meta-data and short text to facilitate the use of data by researchers outside China. The translation was first performed by using the paid BaiDu academic translation service (service number: MPE2022102608424528825) and then checked by two authors (Senjun Jin and Zhongheng Zhang) of the project. However, in order to maintain data fidelity, very little post-processing has been performed for other long text fields such as present history, progress notes, and text reports of image studies. Some natural language contents were not translated into English because any translations may change the results of natural language processing or text mining20,21. Users can employ some academic language translation services (including API) for a large volume of language translation if needed.

The medical data archived within the database were originally not intended for secondary analysis. Thus, some missing values and inconsistencies may occur due to technical errors, system integration, and data preprocessing. In particular, the electronic critical care nursing chart system was launched in the year 2018, and thus the current database contained no information before that time. These older nursing chart data before 2018 are recorded manually and archived in paper documents. We are planning to convert these data into electronic information in a future project.

Usage Notes

Data access

Data are deposited in the PhysioNet repository (https://physionet.org/) and can be accessed after completion of an online course (e.g. from the Collaborative Institutional Training Initiative)22. Data access also requires a data use agreement to be signed, which stipulates that the user will not try to re-identify any subjects, will not share the data, and will release code associated with any publication using the data. Once approved, the plain CSV files can be directly downloaded from the project on PhysioNet22.