The eICU Collaborative Research Database, a freely available multi-center database for critical care research

Critical care patients are monitored closely through the course of their illness. As a result of this monitoring, large amounts of data are routinely collected for these patients. Philips Healthcare has developed a telehealth system, the eICU Program, which leverages these data to support management of critically ill patients. Here we describe the eICU Collaborative Research Database, a multi-center intensive care unit (ICU)database with high granularity data for over 200,000 admissions to ICUs monitored by eICU Programs across the United States. The database is deidentified, and includes vital sign measurements, care plan documentation, severity of illness measures, diagnosis information, treatment information, and more. Data are publicly available after registration, including completion of a training course in research with human subjects and signing of a data use agreement mandating responsible handling of the data and adhering to the principle of collaborative research. The freely available nature of the data will support a number of applications including the development of machine learning algorithms, decision support tools, and clinical research.


Background & Summary
Intensive care units (ICUs) provide care for severely-ill patients who require invasive life-saving treatment. Critical care as a subspecialty of medicine began during a polio epidemic in which large number of patients required mechanical ventilation for many weeks 1 . Since then, the field of critical care has grown, and continues to evolve as demographics shift toward older and chronically sicker populations 2 . Patients in ICUs are monitored closely to detect physiologic changes associated with deteriorating illness that might require reassessment of the treatment regimen as appropriate. Close observation of ICU patients is facilitated by bedside monitors which continuously stream huge quantities of data, but relatively small portions of these data are archived for clinical documentation 3 . Challenges of archiving these data include integration of disparate information systems and building a comprehensive system to handle all types of data 4 . A telehealth ICU, or teleICU, is a centralized model of care where remote providers monitor ICU patients continuously, providing both structured consultations and reactive alerts 5 . TeleICUs allow caregivers from remote locations to monitor treatments for patients, alert local providers to sudden deterioration, and supplement care plans. Philips Healthcare, a major vendor of ICU equipment and services, provides a teleICU service known as the eICU program. Care providers primarily access and document data in an information management system called eCareManager and additionally have access to the other information systems present in the hospital. After implementation of the eICU program, large amounts of data are collected and streamed for real-time monitoring by a remote ICU team. These data are archived by Philips and transformed into a research database by the eICU Research Institute (eRI) 6 .
The Laboratory for Computational Physiology (LCP) at MIT partnered with the eRI to produce the eICU Collaborative Research Database (eICU-CRD), a publicly available database sourced from the eICU Telehealth Program (Data Citation 1). The LCP has previously shared the Medical Information Mart for Intensive Care (MIMIC) database 7,8 . The latest version, MIMIC-III, contains rich deidentified data for over 60,000 ICU admissions to the Beth Israel Deaconess Medical Center in Boston, MA. MIMIC-III has been used for educational purposes, to investigate novel clinical relationships, and develop new algorithms for patient monitoring. The source hospital of MIMIC-III does not participate in the eICU program, so eICU-CRD is a completely independent set of data collected from a large number of hospitals located within the United States. The release of eICU-CRD is intended to build upon the success of MIMIC-III and expand the scope of studies possible by making data available from multiple centers.

Database structure and development
The eICU Collaborative Research Database is distributed as a set of comma separated value (CSV)files which can be loaded into any relational database system. Each file contains data for a single table, and we denote references to tables by using italicized font. Similarly, we denote references to columns using monospace font.
All tables are deidentified to meet the safe harbor provision of the US Health Insurance Portability and Accountability Act (HIPAA) 9 . These provisions include the removal of all protected health information (PHI), such as personal numbers (e.g. phone, social security), addresses, dates, and ages over 89. When creating the dataset, patients were randomly assigned a unique identifier and a lookup key was not retained. As a result the identifiers in eICU-CRD cannot be linked back to the original, identifiable data. All hospital and ICU identifiers have also been removed to protect the privacy of contributing institutions and providers.
The schema was established in collaboration with Privacert (Cambridge, MA), who certified the reidentification risk as meeting safe harbor standards (HIPAA Certification no. 1031219-2). Subsequent to this certification, free-text fields were scanned for personal information using a previously published rule-based approach 10 . Briefly, this approach scans text for known patterns indicating presence of PHI (e.g. words following ''Mr.'' are frequently names, such as ''Mr. Smith''). The approach also detects words which are commonly used as places or names. The output of this algorithm was reviewed, and rows containing potential PHI were deleted. Finally, large portions of all tables were manually reviewed by at least three personnel to verify all data had been deidentified. Frequently, due to a low number of unique entries (e.g. when a table stored the results of a drop-down menu), the entire table was reviewed.
The schema of eICU-CRD is highly denormalized. All tables can be accessed independently and linked to a single patient tracking table, patient, using patientUnitStayId. The only exception to this is the hospital table, which links to the patient table using hospitalId. All tables, other than patient and hospital, have a randomly generated primary key with the suffix`id' (for example, the diagnosis table has diagnosisId as a primary key). This column has no physical meaning, being used only to constrain uniqueness on rows and ensure integrity of the data when loading into a database system.

Patient identifiers
Unit stays, where the primary unit of care is the ICU, are identified by a single integer: the patientUnitStayId. Each unique hospitalization is also assigned a unique integer, known as the patientHealthSystemStayId. Finally, patients are identified by a uniquePid. Unlike the other identifiers, uniquePid is generated using an algorithm based upon prior work on linking disparate patient medical records 11 . Each patientHealthSystemStayId has at least one or more www.nature.com/sdata/ SCIENTIFIC DATA | 5:180178 | DOI: 10.1038/sdata.2018.178 patientUnitStayId, and each uniquePid can have multiple hospital and/or unit stays. Figure 1 visualizes this hierarchy. All tables use patientUnitStayId to identify an individual unit stay, and the patient table can be used to determine unit stays linked to the same patient and/or hospitalization.

Sample selection
The eICU Collaborative Research Database is a subset of a research data repository maintained by eRI. A stratified random sample of patients was used to select patients for inclusion in the public dataset. The selection was done as follows: first, all hospital discharges between 2014 and 2015 were identified, and a single index stay for each unique patient was extracted. The proportion of index stays in each hospital from the eRI data repository was used to perform a stratified sample of patient index stays based upon hospital; the aim was to maintain the distribution of first ICU stays across the hospitals in the dataset. After a patient index stay was selected, all subsequent stays for that patient were also included in the dataset, regardless of the admitting hospital. A small proportion of patients only had stays in step down units or low acuity units, and these stays were removed.

Code availability
A Jupyter Notebook containing the code used to generate the tables and descriptive statistics included in this paper is available online 12 .
The code that underpins the eICU-CRD website and documentation is openly available and contributions from the research community are encouraged 13 .

Data Records
The database comprises 200,859 patient unit encounters for 139,367 unique patients admitted between 2014 and 2015. Patients were admitted to one of 335 units at 208 hospitals located throughout the US. Table 1 provides demographics of the dataset, including hospital level characteristics 14 . Table 2 highlights the top 10 most frequent admission diagnoses in the dataset as coded by trained eICU clinicians using the APACHE IV diagnosis system 15 . Table 3 collapses APACHE diagnoses into 21 groups which are more clinically intuitive. Patients who are missing APACHE IV hospital mortality predictions are excluded from both tables (N = 64,623). Patients will not have an APACHE IV hospital mortality prediction if they satisfy exclusion criteria for APACHE IV (burns patients, in-hospital readmissions, some transplant patients), or if their diagnosis is not documented within the first day of their ICU stay.

Classes of data
Data include vital signs, laboratory measurements, medications, APACHE components, care plan information, admission diagnosis, patient history, time-stamped diagnoses from a structured problem  list, and similarly chosen treatments. The data are organized into tables which broadly correspond to the type of data contained within the table. Table 4 gives an overview of tables available in the dataset.

Administrative data
Hospital level information is available in the hospital table, and includes regional location in the USA (midwest, northeast, west, south), teaching status, and the number of hospital beds. Hospital information is the result of a survey and is sometimes incomplete: 12.5% have unknown region and 20.1% have unknown bed capacity. Table 5 shows the percentage of hospital data in each category. Patient information is recorded in the patient table. The three identifiers described earlier (patientUnitStayId, patientHealthSystemStayId, uniquePid) are present in this table. Administrative information recorded in the patient table includes: admission and discharge time, unit type, admission source, discharge location, and patient vital status on discharge. Patient demographics are also present in the patient table, including age (with ages >89 grouped into`>89'), ethnicity, height, and weight.

APACHE data
The Acute Physiology, Age, and Chronic Health Evaluation (APACHE) IV system 15 is a tool used to riskadjust ICU patients for ICU performance benchmarking and quality improvement analysis. The APACHE IV system, among other predictions, provides estimates of the probability that a patient dies given data from the first 24 hours. These predictions, on aggregate across many patients, can be used to benchmark hospitals and subsequently identify policies from hospitals which may be beneficial for patient outcomes. In order to make these predictions, care providers must collect a set of parameters regarding the patient: physiologic measurements, comorbid burden, treatments given, and admission diagnosis. These parameters are used in a logistic regression to predict mortality. eICU-CRD contains all parameters used in the APACHE IV equations: physiologic parameters are primarily stored in apacheApsVar, and other parameters are stored in apachePredVar .  data provide an informative estimate of patient severity of illness on admission to the ICU, though it should be noted that these predictions are not available for every patient, in particular: those who stay less than four hours, burns patients, certain transplant patients, and in-hospital readmissions. See the original publication for more detail 15 .

Care plan
The care plan is a section of eCareManager which is primarily used for intraprofessional communication.
The data are documented using structured multiple choice lists and the care plan is used to communicate care provider type, provider specialty, code status, prognosis, treatment status, goals of care, healthcare proxies, and end-of-life discussion.

Care documentation
Drop down lists available in eCareManager allow for structured documentation of active problems and active treatments for a patient. It is also possible for care staff to enter short free-text entries. Eighteen tables are available in eICU-CRD which document various aspects of each patient's care including measurements made, active problems, treatments planned, and more.  admissionDrug. This table contains details of medications that a patient was taking prior to admission to the ICU. Information available includes the drug name, dosage, time frame during which the drug was administered, the user type and specialty of the clinician entering the data, and the note type where the information was entered.
allergy. Allergies were documented in the allergy table and sourced from patient note forms. Allergy information is available with a free text allergy name, type of documenting caregiver, whether the allergy is a drug, a standardized code for the drug (if applicable), and the time at which the allergy was documented.
customLab. Laboratory measurements that are not configured within the standard interface are included in the customLab table. These laboratory measurements are infrequently measured but may provide useful information for a small subset of patients. The most frequently measured test in the customlab table is glomerular filtration rate (GFR), and the table contains data for less than 1% of all patients in eICU-CRD v2.0.
diagnosis. Active problems were documented in the diagnosis table, with 86% of patients having a documented active problem during the first 24 h of their unit stay. There were a total of 3,933 unique active problems; the most common was acute respiratory failure (11.15% of patients), followed by acute renal failure (8.15% of patients)and diabetes (7.28% of patients). Problems are hierarchically categorized, and Table 6 shows the proportion of patients with an active problem for each organ system. Note that a patient can have problems documented for multiple organ systems. Most problems are mapped to International Classification of Disease (ICD) codes to facilitate identification of specific diseases using a well established ontology. However, it was not possible to map some diagnoses to ICD codes. For example, ''endocrine|glucose metabolism|diabetes mellitus|Type II|controlled'' is mapped to ICD-9 code 250.00 (Diabetes mellitus without mention of complication, type II or unspecified type, not stated as uncontrolled) and ICD-10 code E11.9 (Type 2 diabetes mellitus without complications). However, ''endocrine|glucose metabolism|diabetes mellitus'' is not mapped to an ICD code, as it is not clear whether this is type I or type II.
infusionDrug. Details of drug infusions are recorded within the infusionDrug table. These infusions are entered by care staff manually or interfaced from an electronic health record system from the hospital. Continuous infusions documented include vasopressors, antibiotics, anticoagulation, insulin, sedatives, analgesics, and so on. Of the 208 hospitals in eICU-CRD, 152 (73%) have data recorded in the infusionDrug table. Recorded information includes the name of the drug, a standardized code for the drug (using Hierarchical Ingredient Code List or HICL codes), the amount of drug in the carrying solution, the total volume of the carrier, the rate of the drug infusion, and the patient weight (if applicable for dosing). All records are stored with a single offset representing the time of the infusion.
intakeOutput. The intake and output of any volume for patients is stored in the intakeOutput  exist with non-specific names such as ''Crystalloids (ml)|Continuous infusion meds''. Overall fluid balance is an important aspect of patient health, and running totals for intake, output, dialysis, and net (intake minus output)are recorded. The most frequent records in the intakeOutput table include urine output, infusion of normal saline, oral fluid intake, non-saline fluid administration (e.g. dextrose based), enteral feeding, parenteral feeding, and more. medication. Active medication orders for patients are stored in the medication table. When a medication order is made by a physician, a pharmacist will review and verify the order in their corresponding pharmacy system. This order verification is interfaced into eCareManager and stored in the medication table. Free text instructions and comments are removed during the deidentification process. In eICU-CRD, two tables focus on recording patient medication: medication and infusionDrug. There are two key differences between these tables: (1) only continuous infusions are present in infusionDrug (e.g. intravenously infused normal saline but not orally prescribed acetaminophen), and (2) compounds described in medication are orders; and while usually these orders are fulfilled and administered this cannot be guaranteed. Information available for each order includes: the start time, end time, name of the compound, HICL code, dosage, route of administration, frequency of administration, loading dose, whether the drug is given pro re nata (PRN), and whether the drug is an IV admixture.
microLab. Microbiology information from patient derived specimens is made available in the microLab table. Presence of bacteria in specimens such as blood or sputum provides useful information for treatment planning and selection of antibiotic regimen. For each record the time of specimen collection (e.g. blood draw), site of the culture, organism found (if any), and sensitivity to various antibiotics (if any are tested). As microbiology is documented manually by care providers, and not directly interfaced from local hospital information systems, the table is not populated for a significant number of hospitals.
note. Notes are generally entered by the physician or physician extender primarily responsible for the documentation of the patient's unit care. There are several types of notes which can be entered in the system including admission, progress, patient medical history, procedure, catheterization, and consultation. Free-text notes were removed during the deidentification process. Highly structured text notes which are selected from drop down menus are retained within the database and present in the note table.  nurseCare. Patient care information is documented in the nurseCare table for the following categories: nutrition, activity, hygiene, wound care, line care, drain status, patient safety, alarms, isolation precautions, equipment, restraints, and other nursing care data. Each record is stored with an entry time (nurseCareEntryOffset) and a relevant time (nurseCareOffset). A custom hierarchy is used to group and store data.
nurseCharting. The majority of bedside documentation is entered into a ''flowsheet'', a tabular style interface with time in columns (usually hourly)and observations in rows. The nurseCharting table contains this information using a entity-attribute-value model, where the entity is a patient identifier, the attribute is the type of data recorded (e.g. heart rate), and the value is the measurement made (e.g. 80 beats per minute). Each charted item is stored with a ''chart time'' (nursingChartOffset), which specifies when the measurement was relevant, and a ''validation time'' (nursingChartEntryOffset), which indicates when the measurement was verified by staff. Vital signs available include: heart rate, heart rhythm, blood pressure, respiratory rate, peripheral oxygen saturation, temperature, location of temperature measurement, central venous pressure, oxygen flow in liters, oxygen device used for oxygen flow, and end tidal CO2. Less frequently documented vital signs available include: pulmonary artery pressure (PA), stroke volume (SV), cardiac output (CO), systemic vascular resistance (SVR), intracranial pressure (IP), cardiac index (CI), systemic vascular resistance index (SVRI), cerebral perfusion pressure (CPP), central venous oxygen saturation (SVO2), pulmonary artery occlusion pressure (PAOP), pulmonary vascular resistance (PVR), pulmonary vascular resistance index (PVRI), and intra-abdominal pressure (IAP). Other data elements available in nurseCharting include assessments made, commonly tabulated scores (neurological function scales, sedation scales, pain scales), and other physiologic measurements or device settings.
pastHistory. Information related a patient's relevant past medical history is stored in the pastHistory table. Providing a detailed past history is not common, but items such as AIDS, cirrhosis of the liver, hepatic failure, chronic renal failure, transplant, pre-existing cancers, and immunosuppression are more reliably documented due to their importance in severity of illness scoring. Elements of past medical history are documented using a custom hierarchical coding system and stored with the charted time (pastHistoryOffset) and with the entry time (pastHistoryEntryOffset).
physicalExam. Results of physical exams performed are stored in the physicalExam table. Data for physical exams are entered directly into eCareManager. The choices for the physical exam include "Not Performed", "Performed-Free Text", and "Performed-Structured". Free text sections are not included in the database. There is a large variety of drop-down menus for the physical exams recorded, with specific text entry boxes allowing for the creation of a structured physical exam.
respiratoryCare. This table contains information related to respiratory care. Patient data include respiratory care times, sequence of records for historical ordering, airway type/size/position, cuff pressure and various other ventilation details. Unlike other tables, the respiratoryCare table does not use an entityvalue-attribute model, but instead has many columns for each setting, most of which are empty for a given time of data recording.
respiratoryCharting. Charted data which relate to a patient's ventilation status, including the configuration of the bedside mechanical ventilator, are stored in the respiratoryCharting table. Each setting is stored with an entry time (respChartEntryOffset) and an observation time (respChartOffset). Examples of settings include the percentage of oxygen inspired, tidal volumes, pressure settings, and other ventilator parameters.

treatment.
A custom hierarchical coding system is used to record active treatments, and there are are 2,711 unique treatments documented in eICU-CRD. The most frequent treatments explicitly documented in the table across patients were mechanical ventilation (16.96% of patients), chest x-rays (8.79% of patients), oxygen therapy via a nasal cannula with a low fraction of oxygen (6.93% of patients), and normal saline administration (7.57%).

Bedside monitor data
Large quantities of data are continuously recorded on ICU patients and displayed via bedside monitors. The vitalPeriodic and vitalAperiodic tables contain data derived directly from these bedside monitors.
Unlike other data elements in the database, the data collected in these tables are not entered or validated by providers of care: the periodic and aperiodic vital sign data have been automatically derived and archived with no human verification. vitalAperiodic. Aperiodic vital signs are collected at various times and include non-invasive blood pressure, pulmonary artery occlusion pressure (PAOP), cardiac output, cardiac input, systemic vascular resistance (SVR), SVR index (SVRi), pulmonary vascular resistance (PVR), and PVR index (PVRi). The most frequent aperiodic vital sign is blood pressure (available for 94% of patients), and the least frequent is PVRi (available for 0.93% of patients).

Technical Validation
Data were verified for integrity during the data transfer process from Philips to MIT using MD5 checksums. In order to maintain data fidelity, very little post-processing has been performed. Each participant hospital in the database has customized workflows and clinical documentation processes, and as a result, the reliability and completion of data elements varies on a hospital and/or ICU level. Table 8 presents data completion across tables, showing the number of hospitals with low, medium, and high data completion.
The data archived within eICU-CRD were intended for use during routine clinical care, and not for secondary analysis. Thus, care must be taken when using the data, as inconsistencies which are inconsequential for clinical care may impact analyses performed.
A public issue tracker is used as a forum for reporting technical issues and describing solutions 13 . The correction of technical errors will be made with updated data releases.

Data access
Data can be accessed via a PhysioNet repository 16 . Details of the data access process are available online 17 . Use of the data requires proof of completion of a course on human subjects research (e.g. from the Collaborative Institutional Training Initiative 18 ). Data access also requires a data use agreement that stipulates, among other items, that the user will not share the data, will not attempt to re-identify any patients or institutions, and will release code associated with any publication using the data. Once  Future updates are planned for eICU-CRD. Updates which change the schema for currently available data, and as such break code syntactically, will result in a major version change. Release of new tables, correction of issues found in currently released data, and insertion of additional data into currently available tables will result in an increment in the minor version. Due to the complexity of the deidentification process and the high sensitivity required, not all data could be made available in the current version of eICU-CRD. Updates to the current dataset will be made as data are certified safe for release. Finally, eICU-CRD v2.0 contains data for patients admitted between 2014-2015. Future updates will be made to ensure data remain contemporary.

Collaborative code and documentation
A core aim in publicly releasing the eICU-CRD is to foster collaboration in secondary analysis of electronic health records, so we have created an openly available repository for sharing code 13 . We believe that publicly accessible code to extract reliable and consistent definitions for key clinical concepts is of utmost importance, both to accelerate research in the field and to ensure reproducibility of future studies 19,20 . Detailed documentation is available online 17 and includes information regarding data access,  table contents, and a schematic of the relationships between tables in the data. The documentation is source controlled within the code repository allowing for collaborative development 13 . Discussion around data usage, highlighting of issues, and best practices can be made via the issues panel of the GitHub repository.

Example usage
We have provided publicly accessible Jupyter Notebooks 21,22 to demonstrate usage of the data 12 . These notebooks supplement online documentation and include a detailed review of each table, with commentary on best practices when working with the data. More general notebooks are available in the code repository referenced earlier, and include notebooks for cohort extraction, summary of demographic characteristics, and visualization of time-series data. Figure 2 visualizes of a subset of variables available during a single patient stay and can be generated using a notebook provided online 12 .