Chronic disease outcome metadata from German observational studies – public availability and FAIR principles

Metadata from epidemiological studies, including chronic disease outcome metadata (CDOM), are important to be findable to allow interpretability and reusability. We propose a comprehensive metadata schema and used it to assess public availability and findability of CDOM from German population-based observational studies participating in the consortium National Research Data Infrastructure for Personal Health Data (NFDI4Health). Additionally, principal investigators from the included studies completed a checklist evaluating consistency with FAIR principles (Findability, Accessibility, Interoperability, Reusability) within their studies. Overall, six of sixteen studies had complete publicly available CDOM. The most frequent CDOM source was scientific publications and the most frequently missing metadata were availability of codes of the International Classification of Diseases, Tenth Revision (ICD-10). Principal investigators’ main perceived barriers for consistency with FAIR principles were limited human and financial resources. Our results reveal that CDOM from German population-based studies have incomplete availability and limited findability. There is a need to make CDOM publicly available in searchable platforms or metadata catalogues to improve their FAIRness, which requires human and financial resources.


Introduction
Numerous epidemiological studies have been examining risk factors of chronic diseases such as cancer, cardiovascular diseases, and diabetes, which represent a high burden of disease globally 1,2 .In Germany, where these three disease groups account for 44% of the total disability-adjusted life years (19.5%, 18.8%, 5.8% for cancer, cardiovascular diseases and diabetes, respectively in 2019) 3 , several population-based observational studies are dedicated to the study of risk factors of chronic diseases.The potential of research data derived by these studies to improve our understanding of health and disease can be substantially enhanced by following the FAIR prin-Publication of chronic disease outcome metadata: evaluation based on proposed schema.A metadata schema with all relevant CDOM was developed within NFDI4Health (see Table 2).CDOM were evaluated in each study per source and outcome and considered to be complete when information about all CDOM fields was available (metadata sources and metadata completeness evaluation scheme described in Tables 3, 4, respectively).For this, an in-depth search within the identified sources of metadata was performed and the identified metadata was recorded in detail by source and metadata field in Supplementary Table 2.This information was then used to summarize our findings in Tables 5, 6, described in the following results subsections.More details are provided in the methods section.Out of the sixteen included studies, publicly available CDOM were complete for all outcomes for 6 studies (CARLA, GEDA, NAKO, KORA, lidA, SHIP/SHIP Trend), complete for some outcomes for 4 studies (EPIC-Heidelberg, EPIC-Potsdam, GHS, IDEFICS/I.Family), and partial for the remaining 6 studies.Table 5 shows the overall status of publicly available CDOM in each study.
Public availability by source.Overall, scientific publications were the most frequent source of publicly available CDOM (n = 16), followed by study websites (n = 15; excluding links and references), study/trial registry databases (n = 11; excluding links and references), and data documentation (n = 10) (Fig. 1).Among the six studies with complete publicly available outcome metadata, the main sources of CDOM were scientific publications (GEDA, NAKO, SHIP/SHIP-Trend) and complementary information obtained both through scientific publications and data documentation (CARLA, KORA, lidA) (Table 5).Eleven studies had a (meta-)data access infrastructure.Of these, seven offered access without registration, three allowed registration by allowing users to sign up or to send a request per email, and one had no registration option (Fig. 2).
Public availability by metadata field.All publicly available CDOM found was recorded in detail in Supplementary Table 2. Table 6 summarizes this information and rates completeness of CDOM to examine what kind of outcome metadata are more often publicly available or more often missing.A score was applied within each study to evaluate public availability of each metadata field (see evaluation scheme in Table 4: "3", complete for all outcomes; "2", complete for some outcomes; "1", partial; "0", missing/no metadata).Based on these scores, ICD-10 code was the field that was more often missing, with a median score of 2. All other metadata fields were more often publicly available, with a median score of 3. Similarly, Fig. 3 reflects the lower availability of information on whether codes of the International Classification of Diseases, Tenth Revision (ICD-10) were used, followed by the fields self-report: reference period and self-report: verification/validation. Conversely, data on prevalent/incident outcome and primary/secondary outcome show the highest proportion of completeness.
Perceived consistency with FaIR principles and perceived barriers.Principal investigators from ten out of the sixteen included studies (one principal investigator by study; N = 10 principal investigators) filled out a survey including the CDOM-adapted checklist of the criteria to meet the FAIR guiding principles (Supplementary Table 3) and shared their perceived main barriers for consistency with the FAIR principles.Principal investigators were prompted to answer always yes/no to each item Perceived consistency of CDOM with FAIR principles ranged from 40% to 70% for findability criteria, from 40% to 60% for accessibility criteria (items A1. and A2.), from 50% to 70% for interoperability criteria, and 60% for reusability criteria (item R1.) (Fig. 4).The main perceived barrier was limited human resources (80% very important barrier, 10% moderately important barrier, 10% not an important barrier), followed by limited financial resources (60% very important barrier, 30% moderately important barrier, 10% not an important barrier) (Fig. 5).Other barriers mentioned by principal investigators were related to unavailability of adequate of harmonization tools, organizational barriers, legal barriers, and limited data quality (Supplementary Table 4).

Discussion
Based on the proposed CDOM schema, our findings reveal that CDOM from German observational studies are often not fully described in publicly available metadata sources.Among the sixteen included observational studies, six studies had complete publicly available CDOM.The main source of publicly available CDOM were scientific publications and the most frequently missing metadata were whether ICD-10 codes were available, followed by the reference period for the questions from self-reported outcomes and whether and how self-reported outcomes were verified and/or validated.
While CDOM seem to be only partly publicly available, the majority of studies had a (meta-)data access infrastructure accessible without registration, or registration was possible by requesting access.However, about a third of the included studies did not have such infrastructure or it was not publicly accessible.In such cases, data reuse is mostly limited to scientists within specific networks or to those who are already familiar with the studies in question.Rich CDOM that can be found by external parties would substantially assist the scientific community by increasing data interpretability and reusability and thus the value of data and the range of scientific questions on chronic disease risk and progression that could be addressed within and across existing observational studies.Having access to CDOM before the submission of an analysis request would also facilitate study selection and clarify harmonization needs (e.g., for pooled analysis of multiple studies) 11 .For example,

Study websites
Descriptions of the study and procedures.

Data documents
Study reports, data dictionaries, lists of variables, questionnaires, etc.Data documents are often available through (meta-)data access infrastructure (i.e., web portals).
Table 3. Sources of published outcome metadata a .a Adapted from previously defined sources contributing to (meta-)data discoverability (McMahon 2017, https://discovery.ucl.ac.uk/id/eprint/10025205) 25 knowing whether two studies used different disease classification systems could help the planning of the data harmonization process.It may not be surprising that our findings suggest the richest source of publicly available CDOM is scientific publications, but it highlights a problem for findability: publications are the traditional way how scientists make research results publicly available.However, these publications usually focus on addressing scientific research questions, rather than on publishing metadata.Although some epidemiological journals also allow the publication of papers on study or cohort profiles 12 , the focus is usually on study design aspects and instruments, rather than on metadata.As a result, metadata are spread across separate documents, often only addressing the necessary information to make sense of the research question(s) addressed in the publication.While finding scientific publications -a time-consuming task that is dependent on search engine and search strategy -may be difficult, finding the metadata within the scientific publications poses another hurdle, as they are not indexed and searchable within the documents 5 .Ideally, CDOM (together with all study metadata) should be centralized (e.g., metadata catalogue on the study's website) and accessible; and should be linked to publications, data repositories, and other sources of study metadata.By reducing the number of sources repeating the same information and instead linking to a central metadata catalogue or repository, there is a lower risk of inconsistencies (e.g., updating metadata in the primary source but forgetting secondary sources).
There are various reasons why CDOM are often not all publicly available and consistent with FAIR principles.While the concept of FAIR (meta-)data is fairly new 8 , the observational studies included in our evaluation date as far as the 1980s and implementing post hoc classifications of data elements to some standard is difficult and would require considerable resources (i.e., financial, human, and technical) that may not be available.This is in line with our observation that most principal investigators in our survey indicated that limited human resources were the main perceived barrier.Despite these difficulties, there is interest from both more recent, and longer existing German observational studies to improve consistency with the FAIR (meta-)data principles, reflected by their participation in consortia such as NFDI4Health 7 .As the efforts of the included studies to improve adherence to the FAIR principles are ongoing, the findings in this paper reflect the status of CDOM public availability at the time of publication.
Another obstacle for FAIR CDOM is the lack of guidelines or standards for CDOM reporting from observational studies.Our proposed CDOM schema outlines the relevant contextual information that should be included in CDOM reporting to improve interpretability and interoperability.Additionally, it is not clear how FAIRness of CDOM in observational studies should be evaluated.While other FAIR guiding prinples-based evaluation tools have been applied in other fields such as physics and education [13][14][15][16] , we considered the checklist we implemented -the FAIR guiding principles 8 applied to CDOM -to be the most appropriate approach to evaluate the principal investigators' perception of CDOM FAIRness in their respective observational studies.For this purpose, the breadth of the FAIR guiding principles can allow the principal investigators to consider different implementations of the FAIR principles in their studies.However, comments submitted with the surveys showed that some respondents still found some items difficult to evaluate in the context of CDOM in their study.Other scientists have also found the interpretation challenging and state that the principles should serve as guidelines rather than as standards 5 .Existing standards and classifications such as ICD-10 17 , SNOMED CT 18 , and MIABIS 19 could be used to establish a specific vocabulary to report CDOM guided by the FAIR princpiles.As these standards and classifications were developed for use in a clinical or health care setting (biomedical research in the case of MIABIS) -although ICD-10 is frequently implemented in epidemiological research -they cover only some CDOM fields (e.g., disease classification in ICD-10, SNOMED-CT, some disease domains and reference periods asked for self-reported outcomes in SNOMED-CT, study examinations in SNOMED-CT and MIABIS).However, different standards and classifications may be used to complement each other and improve CDOM interoperability, for example, by using Unified Medical Language System (UMLS) 20 , which supports the use of multiple vocabularies.To achieve a standard approach, agreements on what standards to use for which metadata fields and on a standard CDOM-reporting template are warranted.Maelstrom Research (https://www.maelstrom-research.org/),which was developed to facilitate epidemiological research collaborations, developed

Complete metadata for all outcomes
All metadata fields from Table 2 can be obtained for all examined chronic disease outcomes based on publicly accessible metadata.
A complete description of this metadata field was found for all examined chronic disease outcomes.

Complete metadata for some outcomes
All metadata fields from Table 2 can be obtained for some but not all examined chronic disease outcomes based on publicly accessible metadata.
A complete description of this metadata field was found for some but not all examined chronic disease outcomes.

Partial metadata
Some metadata fields from Table 2 can be obtained for all or some of the examined chronic disease outcomes based on publicly accessible metadata.
A partial description of this metadata field was found for all or some of the examined chronic disease outcomes (details are missing).

Metadata missing
None of the metadata fields from Table 2 can be obtained based on publicly accessible metadata.
Nothing describing this metadata field was found for any of the examined chronic disease outcomes.

Complete metadata for all outcomes
The here proposed CDOM schema offers a blueprint for a more comprehensive metadata model.Resulting comparable contextual information across studies could then be integrated into a common framework such as the ISA-framework in metadata repositories (improved interoperability) 22 .Our findings should be interpreted in consideration of the study's strengths and limitations.While there are no guidelines for CDOM reporting in observational studies, we developed a metadata schema for chronic diseases within a large consortium with many participating large German observational studies.We also identify the status regarding public availability of CDOM among German observational studies, contributing knowledge

General information
Assessment method

Complete metadata for all outcomes
Study website 0 Study/trial registries 2  Table 6.Completeness of public available chronic disease outcome metadata (by source and overall) a ."3", complete metadata for all outcomes; "2", complete metadata for some outcomes; "1", partial metadata for some or all outcomes; "0", missing metadata; "na.", not applicable (due to study design or absence of metadata source).Numbers in parentheses represent metadata availability from both direct sources of metadata (embedded in the corresponding source) and indirect sources of metadata (available through links and references).a Metadata considered complete if all aspects of the chronic disease outcome metadata schema (Table 2) are covered for all examined cardiovascular diseases, type 2 diabetes, and cancers; metadata complete for "all outcomes" refers to the evaluation of these diseases only.b Not considered if consulted for case verification only; considered if may be consulted for or complemented disease ascertainment (e.g., cause of death from death certificates to complement disease incidence data).c Median score (range 0-3) per study, to be interpreted as median public availability of chronic disease outcome metadata in the included studies; e.g., 3 = complete for all outcomes, 2 = complete for some outcomes.that can be used to target gaps in CDOM findability and accessibility and improve external collaborations in the scientific community.Some limitations of our study include that public availability was conditional on finding the CDOM based on our search criteria; however, the risk of missing important publicly available CDOM was mitigated by requesting feedback from principal investigators about additional internet-available CDOM.Finally, we cannot generalize about the current status of public available CDOM across all observational studies, as all the studies included were from Germany and had already expressed an interest in FAIR data by joining the NFDI4Health consortium; however, most large observational studies conducted in Germany were included.In summary, CDOM from many population-based observational studies in Germany are not completely publicly available.Those CDOM that are available stem mostly from scientific publications.As studies do not rely on single papers to publish CDOM, findability of these data is limited.There is a need to shift publicly available CDOM from scientific publications to publicly accessible platforms such as easily findable (e.g., visible on the study's website and linked elsewhere) metadata catalogues (indexed and searchable), where centralization would support data management efforts and completeness of information.This shift requires the availability of the necessary resources for running these platforms, gathering of necessary information, as well as continuous management to keep this information up-to-date on the study level.Furthermore, guidelines or a common approach for how to achieve FAIR CDOM and how to make them publicly available is warranted; for example, a standardised approach to providing data dictionaries and how CDOM are displayed within them.Our findings provide valuable information for the German scientific community and may help justify and impulse efforts to make CDOM fully available in consolidated metadata platforms.3 Proportion and number of included studies with complete, partial, and missing publicly available chronic disease outcome metadata, by metadata field (total n = 16).na, does not apply (not part of study design).Metadata considered complete if all aspects of the chronic disease outcome metadata schema (Table 2) are covered for all examined cardiovascular diseases, type 2 diabetes, and cancers.

Methods
Study selection.This study was conducted within the framework of NFDI4Health.In 2018, the German ministry for education and research (BMBF) and state governments commissioned the German Research Foundation (DFG) to establish a National Research Data Infrastructure (NFDI); in 2019, the DFG launched a first call to form consortia that aim to improve management, accessibility, storage, and sustainability of scientific and research data in all areas of science 23 .NFDI4Health was one of the consortia that successfully applied to the first DFG call, and was selected to be funded for 5 years, starting in 2020 24 .A total of 15 observational studies participated in the funding application for NFDI4Health (i.e., co-applicant studies).NFDI4Health initiated several community workshops to invite potential partners and users from the scientific community to participate in the consortium.Based on this activity, 11 additional observational studies have submitted letters of commitment to participate in the consortium (i.e., participating studies) 24 .
For the current analysis, we selected studies meeting the following inclusion criteria: 1) observational, population-based co-applicant or participating study in NFDI4Health; and 2) collecting information on cardiovascular diseases, cancer, and/or type 2 diabetes mellitus.
Chronic disease outcome metadata (CDOM) schema.We developed a list of relevant contextual information about chronic disease outcomes for interpretation and reuse of data pertaining to the collection of chronic disease assessment-related information from observational, population-based studies participating in NFDI4Health.A final list of CDOM -a metadata schema specific to chronic disease ascertainment in epidemiological studies (Table 2) -includes general information about the outcome collected (i.e., prevalent or incident  case, specific disease name/classification code, primary or secondary outcome) and the assessment method or data source (i.e., from self-report, from study examinations, from administrative databases), with additional levels of detail pertaining to the assessment method.Data pertaining to these metadata fields were searched for each of the eligible studies.

Sources of chronic disease outcome metadata (CDOM).
Based on an adaptation of previously defined sources contributing to (meta-)data discoverability 25 , the following sources were considered to provide CDOM from epidemiological studies: 1) scientific publications, 2) study websites, 3) study registry databases, and 4) data documents.Table 3 lists these sources in detail.Completeness of published CDOM for all eligible studies was evaluated based on screening of these four metadata sources.Databases used for searching scientific publications were PubMed and Google Scholar, without language restriction.All other sources of metadata were searched using Google including the following predefined keywords: study name, German city/region of the study, and other metadata-source describing keywords.Study/trial registries were searched additionally within websites of the following study registry databases: DRKS (

evaluation of public availability of chronic disease outcome metadata (CDOM).
Public availability of CDOM was evaluated based only on publicly available information from the four aforementioned sources and was defined in terms of findability and accessibility.In a first step, metadata for all included studies were searched by screening in all the predefined metadata sources according to the search criteria detailed in Supplementary Table 1.To be publicly accessible, CDOM had to be both findable and freely accessible on the internet.Availability and accessibility of a (meta-)data access infrastructure was evaluated separately, for which we considered only internet-accessible portals.The existence of such portals was explored within the study website and the search for data documents.After recording all the identified publicly available CDOM by study, principal investigators from all included studies were invited to provide feedback on any missed publicly available CDOM.Any additional CDOM indicated by the principal investigators were added to the results as long as they were available online.
Evaluation of publicly available CDOM by study.Public availability of CDOM was evaluated overall for each study, and was considered to be complete if a detailed list of all the outcomes of interest that were collected in a study was publicly available and data on all the metadata fields listed in Table 2 was available for each corresponding chronic disease outcome.If data were complete for some outcomes only, published CDOM was considered to be complete for some outcomes.If only some of the outcome metadata fields could be filled for one or more chronic disease outcomes, published CDOM was considered to be partial.If no metadata fields could be filled based on publicly available information, published CDOM was considered to be missing.Table 4 details this evaluation scheme.
Evaluation of publicly available CDOM by metadata source and by metadata field.Publicly available CDOM was also recorded in more detail, distinguishing what kind of metadata were found in what source.Based on this information, we calculated a score summarizing public availability of CDOM across all included studies for each metadata field to examine what kind of outcome metadata are more often publicly available or more often missing.Separately for each study and source of metadata, the following rating scheme was used to evaluate each metadata field: "3", complete for all outcomes; "2", complete for some outcomes; "1", partial; "0", missing/no metadata (see Table 4).A score of 1 instead of 2 was given when some details about the metadata field were missing, e.g., if there was an indication that a study collected both prevalent as well as incident outcome data, but only a list of the prevalent outcomes was found (i.e., information about this metadata field was partial).This rating was applied to each outcome metadata field found in each metadata source.As the metadata sources study website and study/trial registries may serve both as direct sources (i.e., embedded metadata) and indirect sources (i.e., links and references), we evaluated them both as direct sources only and as direct plus indirect sources of metadata.For the overall rating, the highest metadata field score across metadata sources within each study made up the overall rating for a metadata field, which was then used to compute the median score per metadata field (range 0-3).For instance, if a study obtained a "3" for the metadata field "prevalent or incident outcome" based on data documents, but obtained a "2" based on the other metadata sources, the overall score for "prevalent or incident outcome" would be the highest score, i.e., "3" and it would be considered as complete for all outcomes.
Perceived consistency with FaIR principles by the Principal Investigators.Perceived consistency of CDOM with FAIR principles by the principal investigators was assessed based on the previously published criteria for each of the FAIR guiding principles 8 with regard to CDOM (see Supplementary Table 3).These criteria were circulated as a checklist to the principal investigators of each of the included studies (one principal investigator representing one study), who returned the complete templates for their respective study (see Supplementary Fig. 1).For each criterion, principal investigators had the option of writing a comment, e.g., to express lack of clarity or to provide a more specific answer.Additionally, responders were also asked to provide feedback on their perceived barriers to achieve FAIR (meta-)data for their respective study.The following potential barriers were rated as "very important barrier", "moderately important barrier" or "not an important barrier": limited financial resources, limited human resources, limited technical resources, limited incentives.Additional barriers could be entered as free text and were rated in the same way.

Fig. 1
Fig. 1 Proportion and number of included studies with publicly available chronic disease outcome metadata (CDOM), by source (total n = 16).Only direct sources of metadata (i.e., links and references not included).

Fig. 2
Fig. 2 Proportion and number of included studies with available and accessible (meta-)data infrastructure (total n = 16).Study-specific internet-accessible portals (through which data documents are often accessible) were considered as (meta-)data infrastructure.Available if the existence of a (meta-)data infrastructure was identified through the study website and/or data document search; accessible if contents could be viewed without registration or registration.(a) Credentials needed, no registration option.(b) Corresponding to (meta-)data access infrastructure not available.

Fig.
Fig.3Proportion and number of included studies with complete, partial, and missing publicly available chronic disease outcome metadata, by metadata field (total n = 16).na, does not apply (not part of study design).Metadata considered complete if all aspects of the chronic disease outcome metadata schema (Table2) are covered for all examined cardiovascular diseases, type 2 diabetes, and cancers.

Fig. 4
Fig. 4 Principal investigators' (n = 10) perceived consistency of CDOM in their study with FAIR principles.(a) Applies only if yes to A1.(b) Applies only if yes to R1.

Table 1 .
Overview of included German population-based observational studies (n = 16).mo, months; y, years.a GNHIES98 participants were invited again for DEGS1.b Ongoing recruitment.c Exact number of participants not yet published.

Table 2 .
Chronic disease outcome metadata schema.

Table 4 .
Evaluation scheme of studies' completeness of publicly accessible chronic disease outcome metadata.