Harmonization and standardization of data for a pan-European cohort on SARS- CoV-2 pandemic

The European project ORCHESTRA intends to create a new pan-European cohort to rapidly advance the knowledge of the effects and treatment of COVID-19. Establishing processes that facilitate the merging of heterogeneous clusters of retrospective data was an essential challenge. In addition, data from new ORCHESTRA prospective studies have to be compatible with earlier collected information to be efficiently combined. In this article, we describe how we utilized and contributed to existing standard terminologies to create consistent semantic representation of over 2500 COVID-19-related variables taken from three ORCHESTRA studies. The goal is to enable the semantic interoperability of data within the existing project studies and to create a common basis of standardized elements available for the design of new COVID-19 studies. We also identified 743 variables that were commonly used in two of the three prospective ORCHESTRA studies and can therefore be directly combined for analysis purposes. Additionally, we actively contributed to global interoperability by submitting new concept requests to the terminology Standards Development Organizations.


International standard terminologies
Semantic interoperability plays a central role when trying to merge data from different sources, each describing concepts in their own way. It is therefore important to identify the best-suited and worldwide recognized terminologies for the different categories of healthcare concepts.
The standard terminology maintained by SNOMED International, called SNOMED Clinical Terms (CT) is particularly well-suited as a general-purpose language for advancing semantic interoperability in medicine and healthcare due to its vast repertoire of available codes and hierarchical structuring of concepts and the definition of relationships between them. For example, the subtype relationship defines one concept as a subtype of another. This enables efficient classification of a clinical condition by including references to a SNOMED CT concept with all its hierarchical "children" and further descendant subtype concepts[ 1,2 ]. It can be complemented by more domain-specific terminologies. LOINC, for example, is typically used for laboratory observations and assessment tools. Each LOINC code represents the "question" that forms the basis of a test or measurement.
A LOINC term is made up of several components representing different information: the substance entity or specimen that is being measured, characteristics of the analyte, the time interval over which the observation was made, the type of value (nominal, ordinal or quantitative) and optionally the method used for analysis.
The Unified Code for Units of Measure (UCUM) [ 3 ] is used for measurement units.
ICD-10, the ICD in its 10 th version is published by the World Health Organization (WHO) and used for reporting diseases and health conditions" [ 4 ] for both clinical and research efforts. ICD codes define classifications not only for diseases but also for injuries 2 and disorders and are ordered hierarchically. Although a revision of ICD is already available from the WHO platform, ICD-11 will officially come into effect on 1 January 2022 and it will take quite some time before it becomes broadly used.
The WHO's ATC classification best represents medications [ 5 ]. ATC is also being used by the surveillance system of the European Centre for Disease Prevention and Control (ECDC) for example to monitor antimicrobial consumption [ 6 ].
For specific genetics investigations, following the approach of the Global Alliance for Genomics and Health (GA4GH) [ 7,8 ], the National Cancer Institute's (NCI) thesaurus (NCIt) [ 9 ] was found to offer the most complete terminology resources.
The use of plasma products is one of the many treatments [ 10,11 ] that have been investigated in an effort to combat severe COVID-19 infections. To represent the concept of convalescent plasma therapy, the ISBT 128 standard for medical products of human origin was selected [ 12 ].
All of these terminologies together can ensure that health data have a clear structure and unambiguous semantics.

GECCO Data Set
The GECCO (German Corona Consensus Data Set) research dataset on COVID-19 [13][14][15] ], standardized by the Charité, has been selected as reference for creating a core data set in ORCHESTRA. GECCO is part of the German COVID-19 Research Network of University Medicine (https://www.netzwerk-universitaetsmedizin.de), which aims to bundle the resources of German university hospitals to improve diagnostics and treatment of COVID-19 patients.
GECCO was developed using international health IT standards and terminologies for interoperable data exchange. In the development process of GECCO, the international project ISARIC-WHO CRF [ 16,17 ] was taken into account. Additionally, also data elements and the 3 corresponding value sets from relevant German projects were considered such as the German Pa-COVID-19 study [ 18 ], which investigates the pathophysiology of COVID-19 in a prospective patient cohort. Also the LEOSS [ 19,20 ] case registry was taken into account, a clinical patient registry for patients infected with SARS-CoV-2 initiated by the ESCMID Emerging Infections Task Force (EITaF) and the German Center for Infection Research (DZIF) and the German Society for Infectiology (DGI). The GECCO dataset was originally developed for use by the German university hospitals that partner to share their data for common analysis within a centralized platform as part of the CODEX (COVID-19 Data Exchange Platform) project (https://www.netzwerk-universitaetsmedizin.de/projekte/codex). However, since international standards and terminologies were used, the GECCO dataset can be extended to use cases beyond its original intention and also be applied in international contexts [ 21 ]. The GECCO FHIR profiles are also based on international work such as for example the International Patient summary (IPS) [ 22 ]. This ensures that the GECCO dataset can be re-used also internationally and thus supports interoperability. For this reason, it was possible to consider it as starting point for ORCHESTRA.

ORCHESTRA studies
In order to maximize potential insights that could be gained, the project ORCHESTRA includes SARS-CoV-2 infected and non-infected individuals of all ages and pre-existing conditions, focusing on at-risk populations of vulnerable individuals and healthcare workers.
Patients with history of COVID-19 will be followed in the ORCHESTRA studies for the assessment of clinical, radiological, and psychological consequences up to 18-month after diagnosis of COVID-19 [ 23 ]. The inclusion of the fragile population will offer an opportunity to explore the impact of COVID-19 on frail or at-risk populations, who are usually not on morbidity and mortality has been described in elderly and immunocompromised hosts [ 30,31 ]. Optimization of prevention strategies, screening practices and therapeutic management is therefore recommended when dealing with fragile patients [ 32,33 ]. Epidemiological data are strongly needed to design further intervention trials and health policies. 6 Genomics study Several different sample types will be analyzed within the ORCHESTRA study for the purposes of identifying human and viral genetic markers indicative of disease severity as well as to study immune responses over time in response to infection and immunization.
Specifically, samples will be collected from patients with COVID-19 (including breakthrough and reinfection) to study both short-and long-term effects of infection on host immunity, respiratory and intestinal microbiome dynamics, as well as host and viral genetic determinants underlying infection. Additionally, samples will be collected from vaccinated fragile populations as well as vaccinated healthcare workers to study effects of vaccination on host immunity and respiratory and intestinal microbiome dynamics.
Samples collected within the framework of the ORCHESTRA study will in many cases be subjected to more than one type of analysis.

REDCap®
Study data were collected and managed using REDCap electronic data capture tools hosted within ORCHESTRA [ 34,35 ].REDCap (Research Electronic Data Capture) is a secure, web-based software platform designed to support data capture for research studies, providing 1) an intuitive interface for validated data capture; 2) audit trails for tracking data manipulation and export procedures; 3) automated export procedures for seamless data downloads to common statistical packages; and 4) procedures for data integration and interoperability with external sources. The REDCap® project was developed to provide scientific research teams intuitive and reusable tools for collecting, storing and disseminating project-specific clinical and translational research data. The following key features were identified as critical components for supporting research projects: 1) collaborative access to data across academic departments and institutions; 2) user authentication and role-based security; 3) intuitive electronic case report forms (CRFs); 4) real-time data validation, integrity checks and other mechanisms for ensuring data quality (e.g. double-data entry options); 5) data attribution and audit capabilities; 6) protocol document storage and sharing; 7) central data storage and backups; 8) data export functions for common statistical packages; and 9) data import functions to facilitate bulk import of data from other systems. Given the quantity and diversity of research projects within academic medical centers, we determined two additional critical features for the REDCap® project: 10) a software generation cycle sufficiently fast to accommodate multiple concurrent projects without the need for custom project-specific programming; and 11) a model capable of meeting disparate data collection needs of projects across a wide array of scientific disciplines.
REDCap® accomplishes key functions through use of a single study metadata table referenced by presentation-level operational modules. Based on this abstracted programming model, studies are developed in an efficient manner with little resource investment beyond the creation of a single data dictionary. In the Supplementary Figure 1, a section of the REDCap 'read-only' version of the much larger Data Dictionary for the project is shown.
The concept of metadata-driven application development is well established, so early in the project it was agreed that the critical factor for success would lie in creating a simple workflow methodology allowing research teams to autonomously develop study-related metadata in an efficient manner [ 36,37 ].