ISARIC-COVID-19 dataset: A Prospective, Standardized, Global Dataset of Patients Hospitalized with COVID-19

The International Severe Acute Respiratory and Emerging Infection Consortium (ISARIC) COVID-19 dataset is one of the largest international databases of prospectively collected clinical data on people hospitalized with COVID-19. This dataset was compiled during the COVID-19 pandemic by a network of hospitals that collect data using the ISARIC-World Health Organization Clinical Characterization Protocol and data tools. The database includes data from more than 705,000 patients, collected in more than 60 countries and 1,500 centres worldwide. Patient data are available from acute hospital admissions with COVID-19 and outpatient follow-ups. The data include signs and symptoms, pre-existing comorbidities, vital signs, chronic and acute treatments, complications, dates of hospitalization and discharge, mortality, viral strains, vaccination status, and other data. Here, we present the dataset characteristics, explain its architecture and how to gain access, and provide tools to facilitate its use. Measurement(s) CDISC SDTM Vital Sign Test Name Terminology Technology Type(s) CDISC Define-XML General Observation Class Terminology Factor Type(s) Hospital Mortality • COVID-19 Sample Characteristic - Organism Homo sapiens Sample Characteristic - Environment hospital Measurement(s) CDISC SDTM Vital Sign Test Name Terminology Technology Type(s) CDISC Define-XML General Observation Class Terminology Factor Type(s) Hospital Mortality • COVID-19 Sample Characteristic - Organism Homo sapiens Sample Characteristic - Environment hospital


Background & Summary
The International Severe Acute Respiratory and Emerging Infection Consortium (ISARIC) is a global federation of clinical research networks collaborating to prevent illness and death from infectious disease outbreaks through proficient and agile research response 1 . In January 2020, ISARIC launched a research response to the emergence of a novel severe acute respiratory syndrome coronavirus (SARS-COV-2), detected weeks earlier in Wuhan, China 2,3 . The initial focus was on the clinical characterisation of COVID-19, the disease caused by SARS-CoV-2, which mainly affects the respiratory system 2 . The fatality rate of COVID-19 varies substantially across different locations, which may reflect differences in population age, comorbidities, vaccination status, and other factors 4 . In June 2022, there were more than 500 million reported cases and more than 6 million deaths. Despite unprecedented success in the rapid generation of vaccines and effective treatments, COVID-19 continues to cause severe and widespread health consequences 5,6 . Therefore, the continuation of high-quality, globally-representative research is critical -as are the data required to deliver it.
At the beginning of the COVID-19 outbreak, ISARIC adapted the ISARIC-WHO Clinical Characterization Protocol and data tools 7 to facilitate global research collaboration and accelerate the understanding of COVID-19 as part of the public health response to the pandemic 1,8,9 . Between January 2020 and September 2021, information about the clinical presentation, treatment, and outcomes of more than 705,000 patients with COVID-19, hospitalized across 62 countries, was aggregated to form the ISARIC-COVID-19 dataset. Clinical teams in 1,559 participating institutions collected the data. Figure 1 shows the number of patients per country included in the database as of September 2021 1,4,10 . The number of patients included in the dataset continues to grow as data collection continues across the globe.
The objective of the dataset is to accelerate understanding of COVID-19 through access to detailed clinical information on infected patients from a range of settings. Access to data facilitates science, improves scientific transparency and integrity, and has played a substantial role in the generation of knowledge that has led to better patient management and vaccine production for COVID-19 11 . The diversity of populations, regions, and resource levels from which the data originate increases the generalizability of the evidence generated and supports comparisons across them. By collating, standardizing, and sharing large volumes of disparate data, curation and governance efforts are invested centrally by a specialised team, enabling efficient data access, and analysis by many researchers focused on the questions most relevant to the patients in their settings. This approach accelerates pandemic response by promoting locally-driven, locally-relevant knowledge generation, which is most likely to have an impact on public health policy and drive societal benefits beyond health 12,13 .

Methods
Data collection. Standardized clinical data of patients with suspected or confirmed COVID-19 are collected on the ISARIC-WHO case report forms (CRFs) (https://isaric.org/research/covid-19-clinical-research-resources/ covid-19-crf/) or site-specific iterations of these forms. These forms are available in multiple languages to support accessibility for a global response.
Sites implement data collection contemporaneously to clinical care. Data are collected through direct observation and/or reviewing and extracting electronic health records or patient registries. Data can be submitted to ISARIC by completing the CRF on the Research Electronic Data Capture platform (REDCap version 10.6 Vanderbilt University 14 ) hosted by the University of Oxford. Alternatively, institutions using other data collection forms and/or a different data management system can share patient data in any format to the ISARIC COVID-19 data platform, hosted by the Infectious Diseases Data Observatory (IDDO, www.iddo.org). Data were prospectively collected on patients with clinical suspicion or laboratory confirmation of SARS-CoV-2 infection and admitted to a participating hospital or ward. Recruitment aimed to include all identified patients; however, resource constraints limited enrolment when patient numbers surged and health systems became overwhelmed. In such cases, or in sites where prospective data collection was impossible, data were extracted from electronic health records. Ethics approval and informed consent were obtained according to local regulations, which included a waiver of consent to collect de-identified data at several sites due to the burden on front-line Data standardization -review and edit checks. Data is run through Pinnacle 21 ® (community version) software, a CDISC standards compliance-verification tool that checks the standard SDTM implementation guide rules and requirements for regulatory submission. The resulting checks and warnings are assessed for applicability to the individual dataset. The data are also run through standard edit checks to identify possible mapping errors separate from SDTM conformance. The curator adjusts the mapping as needed to make corrections. Figure 2 describes the workflow from data acquisition to the final, pooled dataset that researchers can access to conduct their research.

Data Records
The dataset is available from the Infectious Diseases Data Observatory -IDDO at https://doi.org/10.48688/ nx85-bv30 15 The ISARIC-COVID-19 dataset is a relational database consisting of 16 tables, each representing a domain of information set out in the CDISC SDTM data model. Unique identifiers link these with the suffix 'ID. ' For example, USUBJID refers to the subject's unique identifier, which is the primary key for assessing individual-level data; STUDYID contains the unique identifier for an individual hospital or network of hospitals. Each table defines and tracks different aspects of illness and treatment.  Table 1); The majority of those tables are at a patient level, so it has a subject id (USUBJID) that that relates the information of a single patient distributed in the multiple tables. The Trial Summary (TS), Trial Inclusion Exclusion Criteria (TI), and Device Identifiers (DI) are study-level domains; thus, there is no individual patient-level data in those domains. Instead, there is information about the uniqueness of each institution, for instance, the inclusion/exclusion criteria or the devices used at each hospital. Data collection times for each data type are presented in Fig. 3 [16][17][18] . As an example, we show in Fig. 4 a synthetic, representative subset of the available data for a female patient.
The CDISC SDTM data model has several advantages. For example: (1) It can adapt to any number of events. Frequently recorded events such as vital signs, laboratory tests, and patient status scores are stored as a series of events. The order is recorded in the variables with the suffix 'DY, ' which describes the day of the observation relative to the patient's hospital admission date. For example, the variable 'VSDY' indicates the day when a particular vital sign was measured. Events occurring within the same day can be further ordered using the variables with the suffix 'SEQ' , which captures the sequence of events independently of the day on which they occurred. (2) It captures whether or not a variable was collected for a given patient (this is critical to count denominators accurately in an aggregated collection of many different datasets). The model enables this by collecting the existence of a variable separately from the occurrence or completion of that variable. E.g., if the CRF for a dataset includes data on fever, the model shows that this question was prespecified as FEVER_ PRESP = Yes; if the patient had a fever, it is captured as FEVER_OCCUR = Yes; if the patient was afebrile, it is registered as FEVER_OCCUR = No. Combining these two variables makes it possible to accurately quantify how many patients were evaluated for fever and how many had a fever. This distinction is found in the ER, HE, IN, and SA tables. A full description of how SDTM is implemented for these data, Frequently Asked Questions, and other data tools are available within the IDDO suite of curation and data resources (https://www.iddo.org/tools-and-resources/data-tools) to assist analysts in understanding these nuances. The remaining tables contain study-level data (e.g., Study Inclusion Exclusion Criteria and Device Identifiers); thus, there are no individual-level data in these domains.
The dataset also contains a rich repository of free-text entries that capture more fine-grained information not included in the CRF solicited entries. Such information can be identified by applying simple search functions or Natural Language Processing (NLP) techniques to the **TERM variable. Supplementary Table 1 describes how data is distributed across the domain data tables and how many unique patients are included in each table.  www.nature.com/scientificdata www.nature.com/scientificdata/ Patient characteristics. Among the 708,158 patients whose data were entered as of September 2021, 552,366 (78%) had laboratory confirmation of SARS-CoV-2 infection, and 50,426 (7,1%) were clinically diagnosed (where testing was not available or results were not reported). Of these patients, the median age (interquartile ranges: first quartile (Q1) and third quartile (Q3)) is 58 (IQR: 44-72) years, 48.9% are male, and 50.9% are female (the sex of 0.1% of the patients is unknown). A total of 126,069 (20.9%) patients were admitted to a critical care unit (ICU or HDU), and in-hospital mortality was 23.5% 5 . Table 1 provides a breakdown of the population by continent, and Supplementary Table 1 shows the number of unique patients with data reported per each domain.

Technical Validation
Data submitted via the ISARIC REDCap system are subjected to a series of field-specific data quality checks designed by ISARIC. These trigger error alerts inform users of issues based on value limits, validate dates, flag missing variables, and perform logic checks to compare related variables. Data are further reviewed by a data manager who sends data quality reports and queries to sites when critical data are missing or outside expected values. Staff at data collection sites review the alerts and make the necessary corrections to their data in the REDCap system.
Data uploaded to the IDDO platform are verified during the 'pre-mapping' and 'data review and edit checks' processes described above. Interpretation of the data dictionary (for sites that used a unique data collection tool) and any missing values are queried directly with staff at the data collection sites. Results are charted per variable to identify and query outlier values. Where correction is suggested, the contributing site is contacted and asked to correct the data as needed before re-uploading them to the data platform.

Usage Notes
The utility of the data collected is optimised by issuing regular open-access ISARIC COVID-19 Clinical Data Reports (https://isaric.org/research/covid-19-clinical-research-resources/evidence-reports/) and periodic updates to the ISARIC COVID-19 Dashboard (https://livedataoxford.shinyapps.io/ CovidClinicalDataDashboard/). Data are available for analysis through two mechanisms to maximize uptake: a collaborative mechanism for ISARIC partners who contribute data to the dataset and a data-sharing platform for external researchers. The sites that contribute to the data retain ownership and decision-making authority on their data at all times.
It is essential to highlight that more countries are globally transitioning to digital-based healthcare systems. During the transitioning process, quality control measures are necessary to enhance the effectiveness of healthcare-related communication and data quality 19 . Thus, the ISARIC-COVID-19 dataset can generate insights facilitating quality control measures, especially in developing countries where scarce scientific resources.
Data access. Staff from sites that contribute data to the dataset may access data for collaborative analysis via the ISARIC Partner Analysis scheme (https://isaric.org/research/isaric-partner-analysis-frequently-asked-questions/). Proposals for these analyses are governed and supported by ISARIC and executed with all data contributors' contributions, oversite, and accreditation 4,10,20 . ISARIC provides statistical, clinical, and administrative support to promote analyses by partners who contribute the data, especially those based in low-resource settings.
External researchers who have not contributed to the dataset are also welcome to submit a data access and analysis proposal via the IDDO platform (https://www.iddo.org/covid19). An independent Data Access Committee reviews these requests according to the Data Access Guidelines of the platform. (https://www.iddo. org/covid19/data-sharing/accessing-data). Statistical analysis plans and outputs from both types of access can be viewed at: https://www.iddo.org/covid19/research/approved-uses-platform-data.
Data management, curation, governance, and the data-sharing platform are free to use and supported by the ISARIC and IDDO data management teams. When shared through the governed data access mechanisms, the ISARIC COVID-19 database is provided as a collection of comma-separated value (CSV) files (i.e., tables), along with scripts to help import the data into PostgreSQL and codes that enable the reuse of the data. Notably, where data transformations are made during the database construction process, care is taken not to modify raw study data. The teams performing analyses can develop analytic codes based on assumptions they deem appropriate.
Data use. The breadth of analyses published to date demonstrates the diversity of science that can be generated from these data. Examples include identification of unique COVID-19 symptomology at the extremities of age 21 ; to develop the ISARIC 4 C mortality score that outperformed existing scores and showed utility to directly www.nature.com/scientificdata www.nature.com/scientificdata/ inform clinical decision making 22 ; to identify temporal trends in inpatient journeys and inform resource needs in an evolving pandemic 10 , and to improve the diagnosis of acute kidney injury 23 . Further analyses to develop natural language processing, understand neurological outcomes in COVID-19 and develop models that predict a range of outcomes.
The use of such a large and diverse dataset is not without challenges. Robust interpretation of analytic outputs requires an understanding of the variation in recruitment practices between sites and during the course of the outbreak and the availability of treatments and facilities (e.g., ICUs and ventilators) across the range of resource settings. ISARIC's collaborative approach to research outputs addresses these challenges by involving all staff who contributed to the collection of data in the review of the analysis plans and manuscripts. When designing an analysis plan, researchers must also consider which data are and are not available from each site and account for high levels of missingness, particularly during regional peaks in COVID-19 transmission. The CDISC SDTM data model was selected for harmonisation of these data, specifically because it captures these aspects of data providence. Those using the dataset benefit from the richness of the model; however, they will need to master the challenges of its complexity. Tools to support understanding of the data model can be found at https://www. iddo.org/tools-and-resources/data-tools.
Collaborative research. The ISARIC WHO characterization protocol has proven to be a successful strategy for generating standardized data from multiple sites that international researchers can access for analysis 18,21,22,[24][25][26][27] . Having a pre-prepared protocol for clinical investigation of an emerging infectious disease established before the beginning of the COVID-19 pandemic allowed us to gather patient data very early in the pandemic. As a result, contributors benefited from clinical data captured in other regions before they experienced cases and improved confidence in a larger dataset. By implementing systems to harmonize global data, ISARIC and IDDO have made international collaboration more efficient 1 . The evolution of these systems, including integrating epidemiological and genomic data to address new types of research questions, is in progress. Finally, ISARIC's data governance model allows members and non-members to propose research questions that could be answered using this dataset, which has helped advance science and empowers scientists worldwide 4,10,20 . This open and collaborative approach maximizes the scientific utility and public health impact of global data. With a focus on ensuring the representation of patient data and researchers from lower-resourced settings, the ISARIC network has accelerated understanding of COVID-19, advanced preparedness for future pandemics, and raised the bar on global collaboration for health.

Code availability
Processing codes for the ISARIC COVID-19 database are openly available online, and contributions from the research community to share these codes are encouraged. For this reason, a public code repository has been created along with this manuscript to develop and share code collectively: https://github.com/ ISARICDataPlatform/ISARICBasics.git. The content of this repository is under continuous development. Still, it has been seeded with code to generate patient-level datasets suitable for statistics and machine learning research, such as patient demographic, comorbid conditions at the time of admission, application of treatments, and severity scores, among others. It is possible for the research community to directly submit updates, improvements, and additions to the repository via GitHub. Moreover, a Jupyter Notebook containing the code used to generate the tables and descriptive statistics included in this paper is openly available on GitHub.

Competing interests
Allavena, C. declares personal fees from ViiVHealthcare, MSD, Janssen, and Gilead, all outside the submitted work. Andréjak, C. declares personal fees for lectures from Astra Zeneca, outside the submitted work. Antonelli, M. declares unrestricted research grants from GE and Estor/Toray, Board participation from Pfizer and Shionogi. All unrelated to the present work. Beltrame, A. has nothing to declare concerning the current work. A Borie, R. declares personal fees for Roche, Sanofi, and Boehringer Ingelheim's lectures outside the submitted work. Bosse, Hans Martin is co-investigator for placebo studies in infants and children in clinical trials by Actelion/Janssen (Johnson&Johnson), outside the submitted work. Cheng, M. declares grants from McGill Interdisciplinary Initiative in Infection and Immunity, grants from Canadian Institutes of Health Research, during the conduct of the study; personal fees from GEn1E Lifesciences (as a member of the scientific advisory board), personal fees from nplex biosciences (as a member of the scientific advisory board), outside the submitted work. He is the co-founder of Kanvas Biosciences and owns equity in the company. In addition, M. Cheng reports a patent Methods for detecting tissue damage, graft versus host disease, and infections using cellfree DNA profiling pending, and a patent Methods for assessing the severity and progression of SARS-CoV-2 infections using cell-free DNA pending. Cholley, B. declares personal fees (for lectures and participation to advisory boards) from Edwards, Amomed, Nordic Pharma, and Orion Pharma. Claure-Del Granado, R. declares individual fees (for lectures and participation to advisory boards) from Nova Biomedical, Medtronic, and Baxter all outside the submitted work. Cruz-Bermúdez J.L. declares personal fees from Elsevier for advice outside the submitted work. Cummings, M. and O'Donnell, M. participated as investigators for clinical trials evaluating the efficacy and safety of remdesivir (sponsored by Gilead Sciences) and convalescent plasma (sponsored by Amazon) in hospitalized patients with COVID-19. Support for this work is paid to Columbia University. Dalton, H. declares personal fees for medical director of Innovative ECMO Concepts and honorarium from Abiomed/BREETHE Oxi-1 and Instrumentation Labs. Consultant fee, Entegrion Inc. www.nature.com/scientificdata www.nature.com/scientificdata/  (CTN-2014-012), an unrestricted grant from BAXTER for the TAME trial kidney substudy, and consultancy fees paid to his institution from AM-PHARMA. Nseir S. declares lectures for Gilead, Pfizer, MSD, Biomérieux, Fischer and Paykel, and Bio-Rad, outside the submitted work. Openshaw, P. has served on scientific advisory boards for Janssen/J&J, Oxford Immunotech Ltd, GSK, Nestle, and Pfizer (fees to Imperial College). He is Imperial College lead investigator on EMINENT, a consortium funded by the MRC and GSK. He is a member of the RSV Consortium in Europe (RESCEU) and Inno4Vac, Innovative Medicines Initiatives (IMI) from the European Union. Peltan, I.D. declares grant support from the National Institutes of Health and, outside the submitted work, grant support from Centers for Disease Control and Prevention, National Institutes of Health, and Jannsen and payments to his institution from Regeneron and Asahi Kasei Pharma. Pesenti, A. declares personal fees from Maquet, Novalung/Xenios, Baxter, and Boehringer Ingelheim. Peytavin G. declares consulting fees (for lectures and/or participation in advisory boards) and travel grants from Gilead Sciences, Janssen, Merck, Takeda, Theratechnologies, and ViiV Healthcare. Poissy, J. declares personal fees from Gilead for lectures outside the submitting work. Povoa, P. declares personal fees (for lectures and advisory boards) from MSD, Technophage, Sanofi, and Gilead. Póvoas, D. declares consulting fees (for lectures and/or participation in advisory boards) from Roche and Viiv Healthcare; and travel/accommodation/ meeting expenses from Abbvie, Gilead Sciences, Janssen Cilag, Merck Sharp & Dohme, and ViiV Healthcare. Rewa, O. declares honoraria from Baxter Healthcare Inc and Leading Biosciences Inc. Rössler, B. declares grants from CytoSorbent Inc. Rossanese, A. declares consulting fees (for lectures and/or participation to advisory boards) from Emergent BioSolutions and Sanofi Pasteur, but all outside of the frame of the submitted work. Săndulescu, O. has been an investigator in COVID-19 clinical trials by Algernon Pharmaceuticals, Atea Pharmaceuticals, Regeneron Pharmaceuticals, Diffusion Pharmaceuticals, and Celltrion, Inc. and Atriva Therapeutics, outside the scope of the submitted work. Semple, M.G. reports grants from DHSC National Institute of Health Research UK, from the Medical Research Council UK, and from the Health Protection Research Unit in Emerging & Zoonotic Infections, University of Liverpool, supporting the conduct of the study; other interest in Integrum Scientific LLC, Greensboro, NC, USA, outside the submitted work. Serpa Neto, A. declares personal lecture fees from Drager outside the submitted work. Serrano-Balazote, P. declares funding via his Institution from Novartis and Janssen, and personal fees or participation in advisory boards or participation to the speaker's bureau of Roche, all outside of the submitted work. Shrapnel, S. participated as an investigator for an observational study analysing ICU patients with COVID-19 (for the Critical Care Consortium including ECMOCARD) funded by The Prince Charles Hospital Foundation during the conduct of this study. S. Shrapnel reports in-kind support from the Australian Research Council Centre of Excellence for Engineered Quantum Systems (CE170100009). Streinu-Cercel, Adrian has been an investigator in COVID-19 clinical trials by Algernon Pharmaceuticals, Atea Pharmaceuticals, Regeneron Pharmaceuticals, Diffusion Pharmaceuticals, and