Background & Summary

The International Severe Acute Respiratory and Emerging Infection Consortium (ISARIC) is a global federation of clinical research networks collaborating to prevent illness and death from infectious disease outbreaks through proficient and agile research response1. In January 2020, ISARIC launched a research response to the emergence of a novel severe acute respiratory syndrome coronavirus (SARS-COV-2), detected weeks earlier in Wuhan, China2,3. The initial focus was on the clinical characterisation of COVID-19, the disease caused by SARS-CoV-2, which mainly affects the respiratory system2. The fatality rate of COVID-19 varies substantially across different locations, which may reflect differences in population age, comorbidities, vaccination status, and other factors4. In June 2022, there were more than 500 million reported cases and more than 6 million deaths. Despite unprecedented success in the rapid generation of vaccines and effective treatments, COVID-19 continues to cause severe and widespread health consequences5,6. Therefore, the continuation of high-quality, globally-representative research is critical – as are the data required to deliver it.

At the beginning of the COVID-19 outbreak, ISARIC adapted the ISARIC-WHO Clinical Characterization Protocol and data tools7 to facilitate global research collaboration and accelerate the understanding of COVID-19 as part of the public health response to the pandemic1,8,9. Between January 2020 and September 2021, information about the clinical presentation, treatment, and outcomes of more than 705,000 patients with COVID-19, hospitalized across 62 countries, was aggregated to form the ISARIC-COVID-19 dataset. Clinical teams in 1,559 participating institutions collected the data. Figure 1 shows the number of patients per country included in the database as of September 20211,4,10. The number of patients included in the dataset continues to grow as data collection continues across the globe.

Fig. 1
figure 1

The number of patients per country is included in the ISARIC COVID-19 database.

The objective of the dataset is to accelerate understanding of COVID-19 through access to detailed clinical information on infected patients from a range of settings. Access to data facilitates science, improves scientific transparency and integrity, and has played a substantial role in the generation of knowledge that has led to better patient management and vaccine production for COVID-1911. The diversity of populations, regions, and resource levels from which the data originate increases the generalizability of the evidence generated and supports comparisons across them. By collating, standardizing, and sharing large volumes of disparate data, curation and governance efforts are invested centrally by a specialised team, enabling efficient data access, and analysis by many researchers focused on the questions most relevant to the patients in their settings. This approach accelerates pandemic response by promoting locally-driven, locally-relevant knowledge generation, which is most likely to have an impact on public health policy and drive societal benefits beyond health12,13.

Methods

Data collection

Standardized clinical data of patients with suspected or confirmed COVID-19 are collected on the ISARIC-WHO case report forms (CRFs) (https://isaric.org/research/covid-19-clinical-research-resources/covid-19-crf/) or site-specific iterations of these forms. These forms are available in multiple languages to support accessibility for a global response.

Sites implement data collection contemporaneously to clinical care. Data are collected through direct observation and/or reviewing and extracting electronic health records or patient registries. Data can be submitted to ISARIC by completing the CRF on the Research Electronic Data Capture platform (REDCap version 10.6 Vanderbilt University14) hosted by the University of Oxford. Alternatively, institutions using other data collection forms and/or a different data management system can share patient data in any format to the ISARIC COVID-19 data platform, hosted by the Infectious Diseases Data Observatory (IDDO, www.iddo.org). Data were prospectively collected on patients with clinical suspicion or laboratory confirmation of SARS-CoV-2 infection and admitted to a participating hospital or ward. Recruitment aimed to include all identified patients; however, resource constraints limited enrolment when patient numbers surged and health systems became overwhelmed. In such cases, or in sites where prospective data collection was impossible, data were extracted from electronic health records. Ethics approval and informed consent were obtained according to local regulations, which included a waiver of consent to collect de-identified data at several sites due to the burden on front-line workers and the data protection framework in place. The WHO-ISARIC Clinical Characterization Protocol was approved by the WHO Ethics Committee (RPC571 and RPC572).

Data standardization

The ISARIC COVID-19 dataset is a large, clinically comprehensive, international resource. The diversity of data aggregated to create this resource required a uniform data model to standardize the structures and ontologies to a harmonized format. Thus, all data are standardized to the Clinical Data Interchange Standards Consortium (CDISC) Study Data Tabulation Model (SDTM) to facilitate pooled analyses. While there is no perfect data model, the CDISC SDTM was chosen to allow maximum flexibility to accommodate the diverse data types collected by different groups. This was preferred over other options, such as the Observational Medical Outcomes Partnership (OMOP) model, which was more rigid with a fixed number of possible tables and variables. The use of SDTM also allows for greater interoperability to enable integration with COVID-19 clinical trial data that may be added to the dataset in the future. This data model is designed for data tabulation and storage. Using the dataset requires processing to create an analysis dataset from which results can be derived. Here we present a complete description of the available data, how it is formatted, and describe a generalizable strategy to use and maximize its utility in research.

Data standardization - de-identification

Data entered in the ISARIC REDCap database or uploaded to the IDDO data platform are reviewed to ensure no direct identifiers are included. Direct identifiers, including those listed in the UK General Data Protection Regulation (https://ico.org.uk/for-organisations/guide-to-data-protection/guide-to-the-general-data-protection-regulation-gdpr/) and the US Health Insurance Portability and Accountability Act (https://www.hhs.gov/hipaa/index.html), are permanently deleted before data are curated through various processes.

Data standardisation - pre-mapping

Data and all documentation shared with the data, such as dictionaries, protocols, publications, and data collection forms, are reviewed by the data curator to fully understand the contents of the dataset. Queries are raised with the data contributor when required. Each variable in the dataset is assigned to the appropriate SDTM domain(s), variable(s), and controlled vocabulary (if applicable) according to the rules found within the IDDO SDTM Implementation Manual (https://www.iddo.org/tools-and-resources/data-tools). The implementation manual chronicles each type of data curated to the platform and is consulted and updated with each new dataset to ensure consistency across the repository. An audit trail of the assignments is also recorded in a dataset-specific SDTM mapping guide.

Data standardization - data wrangling

For formatting and coding, the contributed datasets are loaded into Trifacta®, a data wrangling programme. This can include merging files, splitting variables into separate domains, applying controlled terminology to variables, and adding created variables as required. IDDO-defined standardization, conversion, and categorization formulas are also used as described in the IDDO SDTM Implementation Manual. Transformations on the contributed data (in the interests of standardization) are recorded and stored in a form that documents the transformation and enables it to be reproduced.

Data standardization - review and edit checks

Data is run through Pinnacle 21® (community version) software, a CDISC standards compliance-verification tool that checks the standard SDTM implementation guide rules and requirements for regulatory submission. The resulting checks and warnings are assessed for applicability to the individual dataset. The data are also run through standard edit checks to identify possible mapping errors separate from SDTM conformance. The curator adjusts the mapping as needed to make corrections.

Figure 2 describes the workflow from data acquisition to the final, pooled dataset that researchers can access to conduct their research.

Fig. 2
figure 2

Overview of the ISARIC COVID-19 Database.

Data Records

The dataset is available from the Infectious Diseases Data Observatory – IDDO at https://doi.org/10.48688/nx85-bv3015 The ISARIC-COVID-19 dataset is a relational database consisting of 16 tables, each representing a domain of information set out in the CDISC SDTM data model. Unique identifiers link these with the suffix ‘ID.’ For example, USUBJID refers to the subject’s unique identifier, which is the primary key for assessing individual-level data; STUDYID contains the unique identifier for an individual hospital or network of hospitals. Each table defines and tracks different aspects of illness and treatment.

Data tables

The tables (i.e., domains) currently included in the dataset are Demographics (DM), Disposition (DS), Environmental Risk (ER), Healthcare Encounters (HO), Inclusion/Exclusion Criteria (IE), Treatments and Interventions (IN), Laboratory Results (LB), Microbiology Specimen (MB), Reproductive System Findings (RP), Disease Response and Clinical Classification (RS), Clinical and Adverse Events (SA), Subject Visits (SV), Vital Signs (VS), COVID-19 Follow-Up questionnaire (CQ), Subject Characteristics (SC), and Pregnancy Outcomes (PO) (Supplementary Table 1); The majority of those tables are at a patient level, so it has a subject id (USUBJID) that that relates the information of a single patient distributed in the multiple tables. The Trial Summary (TS), Trial Inclusion Exclusion Criteria (TI), and Device Identifiers (DI) are study-level domains; thus, there is no individual patient-level data in those domains. Instead, there is information about the uniqueness of each institution, for instance, the inclusion/exclusion criteria or the devices used at each hospital. Data collection times for each data type are presented in Fig. 316,17,18. As an example, we show in Fig. 4 a synthetic, representative subset of the available data for a female patient.

Fig. 3
figure 3

Data collection points for each data type.

Fig. 4
figure 4

A synthetic, representative subset of the available data for a female patient.

The CDISC SDTM data model has several advantages. For example:

  1. (1)

    It can adapt to any number of events. Frequently recorded events such as vital signs, laboratory tests, and patient status scores are stored as a series of events. The order is recorded in the variables with the suffix ‘DY,’ which describes the day of the observation relative to the patient’s hospital admission date. For example, the variable ‘VSDY’ indicates the day when a particular vital sign was measured. Events occurring within the same day can be further ordered using the variables with the suffix ‘SEQ’, which captures the sequence of events independently of the day on which they occurred.

  2. (2)

    It captures whether or not a variable was collected for a given patient (this is critical to count denominators accurately in an aggregated collection of many different datasets). The model enables this by collecting the existence of a variable separately from the occurrence or completion of that variable. E.g., if the CRF for a dataset includes data on fever, the model shows that this question was prespecified as FEVER_PRESP = Yes; if the patient had a fever, it is captured as FEVER_OCCUR = Yes; if the patient was afebrile, it is registered as FEVER_OCCUR = No. Combining these two variables makes it possible to accurately quantify how many patients were evaluated for fever and how many had a fever. This distinction is found in the ER, HE, IN, and SA tables. A full description of how SDTM is implemented for these data, Frequently Asked Questions, and other data tools are available within the IDDO suite of curation and data resources (https://www.iddo.org/tools-and-resources/data-tools) to assist analysts in understanding these nuances. The remaining tables contain study-level data (e.g., Study Inclusion Exclusion Criteria and Device Identifiers); thus, there are no individual-level data in these domains.

The dataset also contains a rich repository of free-text entries that capture more fine-grained information not included in the CRF solicited entries. Such information can be identified by applying simple search functions or Natural Language Processing (NLP) techniques to the **TERM variable. Supplementary Table 1 describes how data is distributed across the domain data tables and how many unique patients are included in each table.

Patient characteristics

Among the 708,158 patients whose data were entered as of September 2021, 552,366 (78%) had laboratory confirmation of SARS-CoV-2 infection, and 50,426 (7,1%) were clinically diagnosed (where testing was not available or results were not reported). Of these patients, the median age (interquartile ranges: first quartile (Q1) and third quartile (Q3)) is 58 (IQR: 44–72) years, 48.9% are male, and 50.9% are female (the sex of 0.1% of the patients is unknown). A total of 126,069 (20.9%) patients were admitted to a critical care unit (ICU or HDU), and in-hospital mortality was 23.5%5. Table 1 provides a breakdown of the population by continent, and Supplementary Table 1 shows the number of unique patients with data reported per each domain.

Table 1 Details of the ISARIC-COVID-19 patient population by continent.

The most frequently reported comorbidities, symptoms at hospital admission, and complications during hospital admission are presented in Fig. 5. Among comorbid conditions, hypertension (30.7%), diabetes mellitus (29.6%), and chronic cardiac disease (10.5%) were the most frequently reported. The top five symptoms at admission were cough (23.7%), shortness of breath (19.8%), fever (17.5%), fatigue (11.5%), and altered consciousness (6.1%). Regarding complications, viral pneumonia (16.2%), acute respiratory distress syndrome (6.6%), acute kidney injury (5.5%), anaemia (4.3%), and bacterial pneumonia (3.8%) were the most frequently identified.

Fig. 5
figure 5

Distribution of primary symptoms, comorbidities, and treatments. (A) shows the prevalence of comorbidities; (B) shows the prevalence of symptoms at admission; (C) shows the proportion of patients receiving each treatment.

Technical Validation

Data submitted via the ISARIC REDCap system are subjected to a series of field-specific data quality checks designed by ISARIC. These trigger error alerts inform users of issues based on value limits, validate dates, flag missing variables, and perform logic checks to compare related variables. Data are further reviewed by a data manager who sends data quality reports and queries to sites when critical data are missing or outside expected values. Staff at data collection sites review the alerts and make the necessary corrections to their data in the REDCap system.

Data uploaded to the IDDO platform are verified during the ‘pre-mapping’ and ‘data review and edit checks’ processes described above. Interpretation of the data dictionary (for sites that used a unique data collection tool) and any missing values are queried directly with staff at the data collection sites. Results are charted per variable to identify and query outlier values. Where correction is suggested, the contributing site is contacted and asked to correct the data as needed before re-uploading them to the data platform.

Usage Notes

The utility of the data collected is optimised by issuing regular open-access ISARIC COVID-19 Clinical Data Reports (https://isaric.org/research/covid-19-clinical-research-resources/evidence-reports/) and periodic updates to the ISARIC COVID-19 Dashboard (https://livedataoxford.shinyapps.io/CovidClinicalDataDashboard/). Data are available for analysis through two mechanisms to maximize uptake: a collaborative mechanism for ISARIC partners who contribute data to the dataset and a data-sharing platform for external researchers. The sites that contribute to the data retain ownership and decision-making authority on their data at all times.

It is essential to highlight that more countries are globally transitioning to digital-based healthcare systems. During the transitioning process, quality control measures are necessary to enhance the effectiveness of healthcare-related communication and data quality19. Thus, the ISARIC-COVID-19 dataset can generate insights facilitating quality control measures, especially in developing countries where scarce scientific resources.

Data access

Staff from sites that contribute data to the dataset may access data for collaborative analysis via the ISARIC Partner Analysis scheme (https://isaric.org/research/isaric-partner-analysis-frequently-asked-questions/). Proposals for these analyses are governed and supported by ISARIC and executed with all data contributors’ contributions, oversite, and accreditation4,10,20. ISARIC provides statistical, clinical, and administrative support to promote analyses by partners who contribute the data, especially those based in low-resource settings.

External researchers who have not contributed to the dataset are also welcome to submit a data access and analysis proposal via the IDDO platform (https://www.iddo.org/covid19). An independent Data Access Committee reviews these requests according to the Data Access Guidelines of the platform. (https://www.iddo.org/covid19/data-sharing/accessing-data). Statistical analysis plans and outputs from both types of access can be viewed at: https://www.iddo.org/covid19/research/approved-uses-platform-data.

Data management, curation, governance, and the data-sharing platform are free to use and supported by the ISARIC and IDDO data management teams. When shared through the governed data access mechanisms, the ISARIC COVID-19 database is provided as a collection of comma-separated value (CSV) files (i.e., tables), along with scripts to help import the data into PostgreSQL and codes that enable the reuse of the data. Notably, where data transformations are made during the database construction process, care is taken not to modify raw study data. The teams performing analyses can develop analytic codes based on assumptions they deem appropriate.

Data use

The breadth of analyses published to date demonstrates the diversity of science that can be generated from these data. Examples include identification of unique COVID-19 symptomology at the extremities of age21; to develop the ISARIC 4 C mortality score that outperformed existing scores and showed utility to directly inform clinical decision making22; to identify temporal trends in inpatient journeys and inform resource needs in an evolving pandemic10, and to improve the diagnosis of acute kidney injury23. Further analyses to develop natural language processing, understand neurological outcomes in COVID-19 and develop models that predict a range of outcomes.

The use of such a large and diverse dataset is not without challenges. Robust interpretation of analytic outputs requires an understanding of the variation in recruitment practices between sites and during the course of the outbreak and the availability of treatments and facilities (e.g., ICUs and ventilators) across the range of resource settings. ISARIC’s collaborative approach to research outputs addresses these challenges by involving all staff who contributed to the collection of data in the review of the analysis plans and manuscripts. When designing an analysis plan, researchers must also consider which data are and are not available from each site and account for high levels of missingness, particularly during regional peaks in COVID-19 transmission. The CDISC SDTM data model was selected for harmonisation of these data, specifically because it captures these aspects of data providence. Those using the dataset benefit from the richness of the model; however, they will need to master the challenges of its complexity. Tools to support understanding of the data model can be found at https://www.iddo.org/tools-and-resources/data-tools.

Collaborative research

The ISARIC WHO characterization protocol has proven to be a successful strategy for generating standardized data from multiple sites that international researchers can access for analysis18,21,22,24,25,26,27. Having a pre-prepared protocol for clinical investigation of an emerging infectious disease established before the beginning of the COVID-19 pandemic allowed us to gather patient data very early in the pandemic. As a result, contributors benefited from clinical data captured in other regions before they experienced cases and improved confidence in a larger dataset. By implementing systems to harmonize global data, ISARIC and IDDO have made international collaboration more efficient1. The evolution of these systems, including integrating epidemiological and genomic data to address new types of research questions, is in progress. Finally, ISARIC’s data governance model allows members and non-members to propose research questions that could be answered using this dataset, which has helped advance science and empowers scientists worldwide4,10,20. This open and collaborative approach maximizes the scientific utility and public health impact of global data. With a focus on ensuring the representation of patient data and researchers from lower-resourced settings, the ISARIC network has accelerated understanding of COVID-19, advanced preparedness for future pandemics, and raised the bar on global collaboration for health.