Introduction

Numerous epidemiological studies have been examining risk factors of chronic diseases such as cancer, cardiovascular diseases, and diabetes, which represent a high burden of disease globally1,2. In Germany, where these three disease groups account for 44% of the total disability-adjusted life years (19.5%, 18.8%, 5.8% for cancer, cardiovascular diseases and diabetes, respectively in 2019)3, several population-based observational studies are dedicated to the study of risk factors of chronic diseases. The potential of research data derived by these studies to improve our understanding of health and disease can be substantially enhanced by following the FAIR principles (Findability, Accessibility, Interoperability, Reusability), optimizing interpretability and reproducibility of results, as well as reuse of data4. While it is increasingly accepted that all research data should follow the FAIR principles, implementation is not ubiquitous and interoperability across data sources is still limited5,6. In Germany, the consortium National Research Data Infrastructure for Personal Health Data (NFDI4Health, https://www.nfdi4health.de/en/) - with the participation of 26 observational studies – seeks to increase the value of research in epidemiology, public health, and clinical trial-based medicine, by making high quality personal health research data from Germany internationally accessible according to the FAIR principles7.

Research data in population-based observational studies usually refers to the sum of data that characterize each participant in that study, or parts thereof (i.e., personal data, unless anonymized). However, an important step in achieving FAIR data is the availability of rich metadata describing these research data8. The assessment of chronic diseases in observational studies is challenging, and each disease can be assessed in many different ways. Thus, the methods used to assess diseases differ between studies, depending on study aims, design, study population, and resources available9. In the case of chronic diseases, metadata include, among others, information on whether the outcome is prevalent or incident, on disease subtypes assessed and classification system(s) used, how data were collected (i.e., questionnaires, interviews, study examinations, administrative databases, or through a combination of sources), and whether and how self-reported diseases were verified (i.e., confirmed on a case basis) or validated (i.e., plausibility of prevalence or incidence observed in a study population evaluated based on a reference population)10.

Differences in assessment methods used have implications on how the data can be reused and how they should be interpreted. Knowledge of how data are collected is not only important for the scientific community to gain awareness of contextual constraints impacting interpretation, but also to enable reuse of data, for example in meta-analyses or pooled analyses. However, details on chronic disease assessment methods – hereafter referred to as chronic disease outcome metadata (CDOM) – are often difficult to find. Therefore, there is a need for specific reporting guidelines using a common metadata schema capturing the vast characteristics of chronic disease assessment and ascertainment methods used in epidemiological studies.

This study proposes a schema for CDOM in epidemiological studies and applies it to population-based observational studies in Germany, describing the current status of CDOM public availability and findability. Additionally, it assesses perceived consistency of CDOM with FAIR principles within identified studies.

Results

Summary of included studies

Sixteen observational studies participating in NFDI4Health collected chronic disease data (i.e., data on cardiovascular diseases, cancer, and/or type 2 diabetes mellitus). Of these, most studies had a cohort design (n = 13, with sample size ranging from 1,779 to ~205,000 participants), one was a cross-sectional study (7,124 participants), one had a mixed design with both cross-sectional and cohort characteristics (sample size of 8,152 participants), and one study comprised of multiple cross-sectional surveys (four samples ranging from 19,294 to 24,016 participants). An overview of the included studies is shown in Table 1. CDOM for these studies were searched according to the search strategy and criteria described in the methods section and in Supplementary Table 1.

Table 1 Overview of included German population-based observational studies (n = 16).

Publication of chronic disease outcome metadata: evaluation based on proposed schema

A metadata schema with all relevant CDOM was developed within NFDI4Health (see Table 2). CDOM were evaluated in each study per source and outcome and considered to be complete when information about all CDOM fields was available (metadata sources and metadata completeness evaluation scheme described in Tables 3, 4, respectively). For this, an in-depth search within the identified sources of metadata was performed and the identified metadata was recorded in detail by source and metadata field in Supplementary Table 2. This information was then used to summarize our findings in Tables 5, 6, described in the following results subsections. More details are provided in the methods section. Out of the sixteen included studies, publicly available CDOM were complete for all outcomes for 6 studies (CARLA, GEDA, NAKO, KORA, lidA, SHIP/SHIP Trend), complete for some outcomes for 4 studies (EPIC-Heidelberg, EPIC-Potsdam, GHS, IDEFICS/I.Family), and partial for the remaining 6 studies. Table 5 shows the overall status of publicly available CDOM in each study.

Table 2 Chronic disease outcome metadata schema.
Table 3 Sources of published outcome metadataa.
Table 4 Evaluation scheme of studies’ completeness of publicly accessible chronic disease outcome metadata.
Table 5 Published chronic disease outcome metadata in the included studies (n = 16)a.
Table 6 Completeness of public available chronic disease outcome metadata (by source and overall)a.

Public availability by source

Overall, scientific publications were the most frequent source of publicly available CDOM (n = 16), followed by study websites (n = 15; excluding links and references), study/trial registry databases (n = 11; excluding links and references), and data documentation (n = 10) (Fig. 1). Among the six studies with complete publicly available outcome metadata, the main sources of CDOM were scientific publications (GEDA, NAKO, SHIP/SHIP-Trend) and complementary information obtained both through scientific publications and data documentation (CARLA, KORA, lidA) (Table 5). Eleven studies had a (meta-)data access infrastructure. Of these, seven offered access without registration, three allowed registration by allowing users to sign up or to send a request per email, and one had no registration option (Fig. 2).

Fig. 1
figure 1

Proportion and number of included studies with publicly available chronic disease outcome metadata (CDOM), by source (total n = 16). Only direct sources of metadata (i.e., links and references not included).

Fig. 2
figure 2

Proportion and number of included studies with available and accessible (meta-)data infrastructure (total n = 16). Study-specific internet-accessible portals (through which data documents are often accessible) were considered as (meta-)data infrastructure. Available if the existence of a (meta-)data infrastructure was identified through the study website and/or data document search; accessible if contents could be viewed without registration or registration. (a) Credentials needed, no registration option. (b) Corresponding to (meta-)data access infrastructure not available.

Public availability by metadata field

All publicly available CDOM found was recorded in detail in Supplementary Table 2. Table 6 summarizes this information and rates completeness of CDOM to examine what kind of outcome metadata are more often publicly available or more often missing. A score was applied within each study to evaluate public availability of each metadata field (see evaluation scheme in Table 4: “3”, complete for all outcomes; “2”, complete for some outcomes; “1”, partial; “0“, missing/no metadata). Based on these scores, ICD-10 code was the field that was more often missing, with a median score of 2. All other metadata fields were more often publicly available, with a median score of 3. Similarly, Fig. 3 reflects the lower availability of information on whether codes of the International Classification of Diseases, Tenth Revision (ICD-10) were used, followed by the fields self-report: reference period and self-report: verification/validation. Conversely, data on prevalent/incident outcome and primary/secondary outcome show the highest proportion of completeness.

Fig. 3
figure 3

Proportion and number of included studies with complete, partial, and missing publicly available chronic disease outcome metadata, by metadata field (total n = 16). na, does not apply (not part of study design). Metadata considered complete if all aspects of the chronic disease outcome metadata schema (Table 2) are covered for all examined cardiovascular diseases, type 2 diabetes, and cancers.

Perceived consistency with FAIR principles and perceived barriers

Principal investigators from ten out of the sixteen included studies (one principal investigator by study; N = 10 principal investigators) filled out a survey including the CDOM-adapted checklist of the criteria to meet the FAIR guiding principles (Supplementary Table 3) and shared their perceived main barriers for consistency with the FAIR principles. Principal investigators were prompted to answer always yes/no to each item Perceived consistency of CDOM with FAIR principles ranged from 40% to 70% for findability criteria, from 40% to 60% for accessibility criteria (items A1. and A2.), from 50% to 70% for interoperability criteria, and 60% for reusability criteria (item R1.) (Fig. 4). The main perceived barrier was limited human resources (80% very important barrier, 10% moderately important barrier, 10% not an important barrier), followed by limited financial resources (60% very important barrier, 30% moderately important barrier, 10% not an important barrier) (Fig. 5). Other barriers mentioned by principal investigators were related to unavailability of adequate of harmonization tools, organizational barriers, legal barriers, and limited data quality (Supplementary Table 4).

Fig. 4
figure 4

Principal investigators’ (n = 10) perceived consistency of CDOM in their study with FAIR principles. (a) Applies only if yes to A1. (b) Applies only if yes to R1.

Fig. 5
figure 5

Principal investigators’ (n = 10) perceived barriers to achieve (meta-)data consistency with FAIR principles.

Discussion

Based on the proposed CDOM schema, our findings reveal that CDOM from German observational studies are often not fully described in publicly available metadata sources. Among the sixteen included observational studies, six studies had complete publicly available CDOM. The main source of publicly available CDOM were scientific publications and the most frequently missing metadata were whether ICD-10 codes were available, followed by the reference period for the questions from self-reported outcomes and whether and how self-reported outcomes were verified and/or validated.

While CDOM seem to be only partly publicly available, the majority of studies had a (meta-)data access infrastructure accessible without registration, or registration was possible by requesting access. However, about a third of the included studies did not have such infrastructure or it was not publicly accessible. In such cases, data reuse is mostly limited to scientists within specific networks or to those who are already familiar with the studies in question. Rich CDOM that can be found by external parties would substantially assist the scientific community by increasing data interpretability and reusability and thus the value of data and the range of scientific questions on chronic disease risk and progression that could be addressed within and across existing observational studies. Having access to CDOM before the submission of an analysis request would also facilitate study selection and clarify harmonization needs (e.g., for pooled analysis of multiple studies)11. For example, knowing whether two studies used different disease classification systems could help the planning of the data harmonization process.

It may not be surprising that our findings suggest the richest source of publicly available CDOM is scientific publications, but it highlights a problem for findability: publications are the traditional way how scientists make research results publicly available. However, these publications usually focus on addressing scientific research questions, rather than on publishing metadata. Although some epidemiological journals also allow the publication of papers on study or cohort profiles12, the focus is usually on study design aspects and instruments, rather than on metadata. As a result, metadata are spread across separate documents, often only addressing the necessary information to make sense of the research question(s) addressed in the publication. While finding scientific publications – a time-consuming task that is dependent on search engine and search strategy – may be difficult, finding the metadata within the scientific publications poses another hurdle, as they are not indexed and searchable within the documents5. Ideally, CDOM (together with all study metadata) should be centralized (e.g., metadata catalogue on the study’s website) and accessible; and should be linked to publications, data repositories, and other sources of study metadata. By reducing the number of sources repeating the same information and instead linking to a central metadata catalogue or repository, there is a lower risk of inconsistencies (e.g., updating metadata in the primary source but forgetting secondary sources).

There are various reasons why CDOM are often not all publicly available and consistent with FAIR principles. While the concept of FAIR (meta-)data is fairly new8, the observational studies included in our evaluation date as far as the 1980s and implementing post hoc classifications of data elements to some standard is difficult and would require considerable resources (i.e., financial, human, and technical) that may not be available. This is in line with our observation that most principal investigators in our survey indicated that limited human resources were the main perceived barrier. Despite these difficulties, there is interest from both more recent, and longer existing German observational studies to improve consistency with the FAIR (meta-)data principles, reflected by their participation in consortia such as NFDI4Health7. As the efforts of the included studies to improve adherence to the FAIR principles are ongoing, the findings in this paper reflect the status of CDOM public availability at the time of publication.

Another obstacle for FAIR CDOM is the lack of guidelines or standards for CDOM reporting from observational studies. Our proposed CDOM schema outlines the relevant contextual information that should be included in CDOM reporting to improve interpretability and interoperability. Additionally, it is not clear how FAIRness of CDOM in observational studies should be evaluated. While other FAIR guiding prinples-based evaluation tools have been applied in other fields such as physics and education13,14,15,16, we considered the checklist we implemented – the FAIR guiding principles8 applied to CDOM – to be the most appropriate approach to evaluate the principal investigators’ perception of CDOM FAIRness in their respective observational studies. For this purpose, the breadth of the FAIR guiding principles can allow the principal investigators to consider different implementations of the FAIR principles in their studies. However, comments submitted with the surveys showed that some respondents still found some items difficult to evaluate in the context of CDOM in their study. Other scientists have also found the interpretation challenging and state that the principles should serve as guidelines rather than as standards5. Existing standards and classifications such as ICD-1017, SNOMED CT18, and MIABIS19 could be used to establish a specific vocabulary to report CDOM guided by the FAIR princpiles. As these standards and classifications were developed for use in a clinical or health care setting (biomedical research in the case of MIABIS) – although ICD-10 is frequently implemented in epidemiological research – they cover only some CDOM fields (e.g., disease classification in ICD-10, SNOMED-CT, some disease domains and reference periods asked for self-reported outcomes in SNOMED-CT, study examinations in SNOMED-CT and MIABIS). However, different standards and classifications may be used to complement each other and improve CDOM interoperability, for example, by using Unified Medical Language System (UMLS)20, which supports the use of multiple vocabularies. To achieve a standard approach, agreements on what standards to use for which metadata fields and on a standard CDOM-reporting template are warranted. Maelstrom Research (https://www.maelstrom-research.org/), which was developed to facilitate epidemiological research collaborations, developed a catalogue21 displaying some of the relevant metadata fields for chronic diseases (including ICD-10 disease group classifications); however, it remains mostly on study level metadata, missing outcome-specific metadata. The here proposed CDOM schema offers a blueprint for a more comprehensive metadata model. Resulting comparable contextual information across studies could then be integrated into a common framework such as the ISA-framework in metadata repositories (improved interoperability)22.

Our findings should be interpreted in consideration of the study’s strengths and limitations. While there are no guidelines for CDOM reporting in observational studies, we developed a metadata schema for chronic diseases within a large consortium with many participating large German observational studies. We also identify the status regarding public availability of CDOM among German observational studies, contributing knowledge that can be used to target gaps in CDOM findability and accessibility and improve external collaborations in the scientific community. Some limitations of our study include that public availability was conditional on finding the CDOM based on our search criteria; however, the risk of missing important publicly available CDOM was mitigated by requesting feedback from principal investigators about additional internet-available CDOM. Finally, we cannot generalize about the current status of public available CDOM across all observational studies, as all the studies included were from Germany and had already expressed an interest in FAIR data by joining the NFDI4Health consortium; however, most large observational studies conducted in Germany were included.

In summary, CDOM from many population-based observational studies in Germany are not completely publicly available. Those CDOM that are available stem mostly from scientific publications. As studies do not rely on single papers to publish CDOM, findability of these data is limited. There is a need to shift publicly available CDOM from scientific publications to publicly accessible platforms such as easily findable (e.g., visible on the study’s website and linked elsewhere) metadata catalogues (indexed and searchable), where centralization would support data management efforts and completeness of information. This shift requires the availability of the necessary resources for running these platforms, gathering of necessary information, as well as continuous management to keep this information up-to-date on the study level. Furthermore, guidelines or a common approach for how to achieve FAIR CDOM and how to make them publicly available is warranted; for example, a standardised approach to providing data dictionaries and how CDOM are displayed within them. Our findings provide valuable information for the German scientific community and may help justify and impulse efforts to make CDOM fully available in consolidated metadata platforms.

Methods

Study selection

This study was conducted within the framework of NFDI4Health. In 2018, the German ministry for education and research (BMBF) and state governments commissioned the German Research Foundation (DFG) to establish a National Research Data Infrastructure (NFDI); in 2019, the DFG launched a first call to form consortia that aim to improve management, accessibility, storage, and sustainability of scientific and research data in all areas of science23. NFDI4Health was one of the consortia that successfully applied to the first DFG call, and was selected to be funded for 5 years, starting in 202024. A total of 15 observational studies participated in the funding application for NFDI4Health (i.e., co-applicant studies). NFDI4Health initiated several community workshops to invite potential partners and users from the scientific community to participate in the consortium. Based on this activity, 11 additional observational studies have submitted letters of commitment to participate in the consortium (i.e., participating studies)24.

For the current analysis, we selected studies meeting the following inclusion criteria: 1) observational, population-based co-applicant or participating study in NFDI4Health; and 2) collecting information on cardiovascular diseases, cancer, and/or type 2 diabetes mellitus.

Chronic disease outcome metadata (CDOM) schema

We developed a list of relevant contextual information about chronic disease outcomes for interpretation and reuse of data pertaining to the collection of chronic disease assessment-related information from observational, population-based studies participating in NFDI4Health. A final list of CDOM – a metadata schema specific to chronic disease ascertainment in epidemiological studies (Table 2) – includes general information about the outcome collected (i.e., prevalent or incident case, specific disease name/classification code, primary or secondary outcome) and the assessment method or data source (i.e., from self-report, from study examinations, from administrative databases), with additional levels of detail pertaining to the assessment method. Data pertaining to these metadata fields were searched for each of the eligible studies.

Sources of chronic disease outcome metadata (CDOM)

Based on an adaptation of previously defined sources contributing to (meta-)data discoverability25, the following sources were considered to provide CDOM from epidemiological studies: 1) scientific publications, 2) study websites, 3) study registry databases, and 4) data documents. Table 3 lists these sources in detail. Completeness of published CDOM for all eligible studies was evaluated based on screening of these four metadata sources. Databases used for searching scientific publications were PubMed and Google Scholar, without language restriction. All other sources of metadata were searched using Google including the following predefined keywords: study name, German city/region of the study, and other metadata-source describing keywords. Study/trial registries were searched additionally within websites of the following study registry databases: DRKS (German Clinical Trials Register, https://www.drks.de/), clinicaltrials.gov (https://clinicaltrials.gov/), ISRCTN (International Standard Randomised Controlled Trial Number, https://www.isrctn.com/), Maelstrom Research (https://www.maelstrom-research.org/), re3data.org (https://www.re3data.org/), ICTRP (International Clinical Trials Registry Platform, https://trialsearch.who.int/), euCanSHare (https://eucanshare.bsc.es/platform/), MDM Portal (https://medical-data-models.org/), and German Central Health Study Hub NFDI4Health (https://csh.nfdi4health.de/). Additionally, data documents were searched through the studies’ (meta-)data access infrastructure, if available. Different searches were carried out using terms in English and in German language between January and March 2022. The searches were repeated between August and September 2022 to include newly published CDOM. More details about the search criteria are described in Supplementary Table 1.

Evaluation of public availability of chronic disease outcome metadata (CDOM)

Public availability of CDOM was evaluated based only on publicly available information from the four aforementioned sources and was defined in terms of findability and accessibility. In a first step, metadata for all included studies were searched by screening in all the predefined metadata sources according to the search criteria detailed in Supplementary Table 1. To be publicly accessible, CDOM had to be both findable and freely accessible on the internet. Availability and accessibility of a (meta-)data access infrastructure was evaluated separately, for which we considered only internet-accessible portals. The existence of such portals was explored within the study website and the search for data documents. After recording all the identified publicly available CDOM by study, principal investigators from all included studies were invited to provide feedback on any missed publicly available CDOM. Any additional CDOM indicated by the principal investigators were added to the results as long as they were available online.

Evaluation of publicly available CDOM by study

Public availability of CDOM was evaluated overall for each study, and was considered to be complete if a detailed list of all the outcomes of interest that were collected in a study was publicly available and data on all the metadata fields listed in Table 2 was available for each corresponding chronic disease outcome. If data were complete for some outcomes only, published CDOM was considered to be complete for some outcomes. If only some of the outcome metadata fields could be filled for one or more chronic disease outcomes, published CDOM was considered to be partial. If no metadata fields could be filled based on publicly available information, published CDOM was considered to be missing. Table 4 details this evaluation scheme.

Evaluation of publicly available CDOM by metadata source and by metadata field

Publicly available CDOM was also recorded in more detail, distinguishing what kind of metadata were found in what source. Based on this information, we calculated a score summarizing public availability of CDOM across all included studies for each metadata field to examine what kind of outcome metadata are more often publicly available or more often missing. Separately for each study and source of metadata, the following rating scheme was used to evaluate each metadata field: “3”, complete for all outcomes; “2”, complete for some outcomes; “1”, partial; “0“, missing/no metadata (see Table 4). A score of 1 instead of 2 was given when some details about the metadata field were missing, e.g., if there was an indication that a study collected both prevalent as well as incident outcome data, but only a list of the prevalent outcomes was found (i.e., information about this metadata field was partial). This rating was applied to each outcome metadata field found in each metadata source. As the metadata sources study website and study/trial registries may serve both as direct sources (i.e., embedded metadata) and indirect sources (i.e., links and references), we evaluated them both as direct sources only and as direct plus indirect sources of metadata. For the overall rating, the highest metadata field score across metadata sources within each study made up the overall rating for a metadata field, which was then used to compute the median score per metadata field (range 0–3). For instance, if a study obtained a “3” for the metadata field “prevalent or incident outcome” based on data documents, but obtained a “2” based on the other metadata sources, the overall score for “prevalent or incident outcome” would be the highest score, i.e., “3” and it would be considered as complete for all outcomes.

Perceived consistency with FAIR principles by the Principal Investigators

Perceived consistency of CDOM with FAIR principles by the principal investigators was assessed based on the previously published criteria for each of the FAIR guiding principles8 with regard to CDOM (see Supplementary Table 3). These criteria were circulated as a checklist to the principal investigators of each of the included studies (one principal investigator representing one study), who returned the complete templates for their respective study (see Supplementary Fig. 1). For each criterion, principal investigators had the option of writing a comment, e.g., to express lack of clarity or to provide a more specific answer. Additionally, responders were also asked to provide feedback on their perceived barriers to achieve FAIR (meta-)data for their respective study. The following potential barriers were rated as “very important barrier”, “moderately important barrier” or “not an important barrier”: limited financial resources, limited human resources, limited technical resources, limited incentives. Additional barriers could be entered as free text and were rated in the same way.