The variable quality of metadata about biological samples used in biomedical experiments

Gonçalves, Rafael S.; Musen, Mark A.

doi:10.1038/sdata.2019.21

Download PDF

Analysis
Open access
Published: 19 February 2019

The variable quality of metadata about biological samples used in biomedical experiments

Rafael S. Gonçalves¹ &
Mark A. Musen¹

Scientific Data volume 6, Article number: 190021 (2019) Cite this article

6427 Accesses
46 Citations
22 Altmetric
Metrics details

Subjects

Abstract

We present an analytical study of the quality of metadata about samples used in biomedical experiments. The metadata under analysis are stored in two well-known databases: BioSample—a repository managed by the National Center for Biotechnology Information (NCBI), and BioSamples—a repository managed by the European Bioinformatics Institute (EBI). We tested whether 11.4 M sample metadata records in the two repositories are populated with values that fulfill the stated requirements for such values. Our study revealed multiple anomalies in the metadata. Most metadata field names and their values are not standardized or controlled. Even simple binary or numeric fields are often populated with inadequate values of different data types. By clustering metadata field names, we discovered there are often many distinct ways to represent the same aspect of a sample. Overall, the metadata we analyzed reveal that there is a lack of principled mechanisms to enforce and validate metadata requirements. The significant aberrancies that we found in the metadata are likely to impede search and secondary use of the associated datasets.

Ontology-driven integrative analysis of omics data through Onassis

Article Open access 20 January 2020

Community-curated and standardised metadata of published ancient metagenomic samples with AncientMetagenomeDir

Article Open access 26 January 2021

A proteomics sample metadata representation for multiomics integration and big data analysis

Article Open access 06 October 2021

Introduction

The metadata about scientific experiments are essential for finding, retrieving, and reusing the scientific data stored in online repositories. Finding relevant scientific data requires not only that the data simply be accompanied by metadata, but also that the metadata be of sufficient quality for the corresponding datasets to be discovered and reused. When the quality of the metadata is poor, software systems that index and avail themselves to the experimental data may not find and return search results that otherwise would be appropriate for given search criteria. In addition, significant metadata post-processing efforts may be required to facilitate data analysis.

The literature on metadata quality generally point to the need for better practices and infrastructure for authoring metadata. Bruce et al.¹ define various metadata quality metrics, such as completeness (e.g., all necessary fields should be filled in), accuracy (e.g., the values filled in should be specified as appropriate for the field), and provenance (e.g., information about the metadata author). Park et al.^2,3 specified several high-level principles for the creation of good-quality metadata. The metrics mentioned in these works have been recently supplemented by the FAIR data principles⁴. The FAIR principles specify desirable criteria that metadata and their corresponding datasets should meet to be Findable, Accessible, Interoperable, and Reusable.

Several empirical studies suggest that metadata quality needs to be significantly improved. Infrequent use of ontologies to control field names and values and lack of validation have been identified as key problems. For example, Zaveri et al. reported that metadata records in the Gene Expression Omnibus (GEO) suffer from redundancy, inconsistency, and incompleteness⁵. This problem occurs because GEO allows users to create arbitrary fields that are not predefined by the GEO data dictionary and it also does not validate the values of those fields. Hu et al. developed an agglomerative clustering algorithm to clean metadata field names—cutCluster⁶. Hu et al. tested whether a sample of 359 out of 11,000 field names in GEO metadata were clustered similarly to a gold standard clustering crafted by the authors. Hu et al. concluded that multiple field names would require human verification to determine their correct cluster. Park⁷ examined the use of Dublin Core (DC) elements in field names, in a corpus of 659 metadata records sampled from digital image collections on the Web. Park identified various problems with the representation of DC elements that could have been prevented with better infrastructure to map metadata field names to DC elements. Bui et al.⁸ conducted a similar study to investigate the use of DC elements in metadata fields in a larger corpus of around 1 million records. The authors found that 6 DC fields are “rather well populated,” while the other 10 fields that they analyzed were poorly populated. However, none of these authors investigated whether the content of metadata fields is appropriately specified according to the fields’ expected values. For example, the studies checked whether DC fields are populated but not whether the values for the dc:date field are dates formatted according to some standard, or whether the values for the language field resolve to controlled terms in an ontology about languages or a language value set. For data to be FAIR, the value of each metadata field needs to be accurate and uniform (e.g., relying on controlled terms where possible), and to adhere to the field specification. Using controlled terms as a means to standardize metadata field names and field values allows users to be able to find data in a principled way, without having to cater to ad hoc representation mechanisms.

In this paper, we present an analysis of the quality of metadata in two online databases: the NCBI BioSample⁹, which is maintained by the U.S. National Center for Biotechnology Information (NCBI), and the EBI BioSamples^10,11, which is maintained by the European Bioinformatics Institute (EBI). These databases store metadata that describe the biological materials (samples) under investigation in a wide range of projects. We selected the NCBI BioSample as it represents the most recent NCBI metadata repository initiative, and because the NCBI BioSample repository was designed to standardize sample descriptions across all NCBI repositories (including GEO) with a focus on the use of controlled terms from ontologies. The EBI BioSamples was selected because it is EBI’s equivalent repository to the NCBI BioSample, and because the EBI curates the metadata in its repository, contrary to the NCBI. The curation process used with the EBI BioSamples repository involves mapping values to controlled terms, and it results in simpler metadata (e.g., values such as “N/A” or “missing” are pruned) that are presumably closely aligned with ontologies. The metadata in the EBI BioSamples are partially-curated descriptions of sample-related data hosted in databases such as ArrayExpress¹². ArrayExpress includes all the microarray data in the GEO database. Thus, in our study, we expect to find that there is overlap in the contents of the EBI BioSamples and the NCBI BioSample repositories, and that the curated EBI BioSamples metadata are of higher quality.

Methods

Our goal is to measure the quality of metadata records based on whether the fields that the records describe comply with their specification. We consider metadata to be of good quality if the metadata fields use controlled terms when indicated, if their values are parseable, and if the values match the expectations of the database designers. We analyzed metadata fields (so-called attributes) that have computationally verifiable expectations for their values in the two repositories for metadata about biomedical samples. For example, a field that is expected to be populated with numeric values can be unambiguously verified, while a field that is populated with free-text values would pose a non-trivial challenge for automated verification. A metadata attribute comprises a pair consisting of an attribute name and an attribute value. For example, values for the attribute named disease of human samples in the NCBI BioSample should correspond to terms in the Human Disease Ontology (DOID), according to BioSample documentation. We used the BioPortal repository of publicly available biomedical ontologies¹³ to identify correspondences between metadata values and ontology terms.

We acquired a copy of the NCBI BioSample database from the central NCBI FTP archive¹⁴ on June 25, 2017. The BioSample database was distributed as an XML file, with no explicit versioning information. Our copy of NCBI’s BioSample contained 6,615,347 metadata records. A typical BioSample record appears in Fig. 1.

**Figure 1: Example metadata record from the NCBI BioSample.**

The EBI software infrastructure did not have a downloadable archive containing the entire BioSamples database. We obtained a snapshot of the database on November 15, 2017 by contacting the EBI IT Helpdesk. Our copy of EBI’s BioSamples contained 4,793,915 metadata records. In this paper, when we refer to the metadata in either NCBI BioSample of EBI BioSamples, our comments are necessarily based on the snapshots that we obtained of the two repositories in 2017.

We built a software tool to extract key bits of information about each metadata record in the samples databases and to determine whether the attributes of each sample record were filled in and well-specified. Our tool collects the following data: sample identifier, accession number, publication date, last update date, submission date, identifier and name of the sample organism, owner name, and package name. Then, for each attribute within our tested attributes, the software records the attribute name, its value, and verifies whether it is filled in according to the attribute’s specification. An attribute specification describes the format and content of the expected attribute value. Each repository defines and documents its own attribute specifications. We developed our software to determine whether these specifications hold in the metadata that we processed.

We built a second tool to cluster a list of given strings according to their similarity using the affinity propagation clustering algorithm¹⁵. Affinity propagation is a machine learning algorithm that identifies exemplars among data points and creates clusters of data points around the exemplars. This clustering technique is desirable for our study because it does not require specifying the number of clusters upfront (which are unknown in our case), and because it computes a representative value for each cluster (the exemplar). We used the implementation of the affinity propagation algorithm in the scikit-learn Python package¹⁶. To compute the similarity between strings we use the Levenshtein edit distance. The Levenshtein distance between two strings s and t is the shortest sequence of single-character edit commands (insertions, deletions, or substitutions) that transforms s into t. We chose this distance metric because it is widely used in spell-checkers and search systems, it accounts for simple typing errors, and it is not restricted to strings of equal length.

Code availability

All code used for the quality assessment of metadata and the clustering of metadata keys is available at https://github.com/metadatacenter/metadata-analysis-tools.

Data availability

The data used and generated throughout the study described in this paper are available in Figshare¹⁷ at https://doi.org/10.6084/m9.figshare.6890603.

NCBI BioSample Overview

Officially launched in 2011, the NCBI BioSample repository accepts submissions of metadata through a Web-based portal that guides users through a series of metadata-entry forms. The first form prompts users to choose a package. A package represents a type of sample and it specifies a set of attributes that should be used to describe samples of a particular type. For instance, the Human.1.0 package requires its records to have the attributes age, sex, tissue, biomaterial provider, and isolate. This package also lists other attributes that can be optionally provided. Each of the 104 BioSample package types has a different set of rules regarding which attributes are required and which are optional¹⁸. A notable exception is the Generic package, which has no requirements at all. This package is not listed in the online package documentation and it is not an option in BioSample’s Web forms.

A metadata record defines multiple attributes, each of them composed of an attribute name and a value. BioSample provides a dictionary of 452 metadata attribute names¹⁹ that can be used to describe the samples that form the substrates of experiments. Metadata authors can, however, provide additional attributes with arbitrary names with no guidance or control from BioSample. Each metadata record describing a sample can contain multiple attributes. Given the biomedical domain, we expect BioSample metadata to use terms from ontologies in BioPortal—a repository that currently hosts over 700 publicly available biomedical ontologies.

Analysis of NCBI BioSample metadata

Our study assesses the quality of metadata in BioSample according to whether the attributes in the metadata records specify (1) a controlled attribute name (i.e., provided by an ontology or other controlled term source), (2) an attribute name that is in BioSample’s attribute dictionary, and (3) a valid value according to the attribute specification. We analyzed all the BioSample attributes and categorized those attributes that have the same type of expected values into the groups described in the following subsections.

Ontology-term attributes

There are 9 BioSample attributes that dictate the use of term values from specific ontologies. For example, the attribute phenotype, representing the phenotype of the sampled organism, should have input values that are terms from the Phenotypic Quality Ontology (PATO), according to the BioSample documentation. To verify whether ontology terms supply values for attributes in BioSample when appropriate, we performed searches in BioPortal for exact matches of the possible values for each relevant BioSample attribute field within the ontology that the BioSample attribute documentation indicates should provide values for that field. We indicate that an ontology-term attribute is well-specified if its value matches a term in the designated ontology. When matching terms, the algorithm implemented in BioPortal takes into consideration the term names, synonyms, and term identifiers.

Value set attributes

There are 32 attributes whose values are constrained to value sets specified in the BioSample documentation. For example, the attribute name dominant hand takes on values from a value set composed of the terms left, right, and ambidextrous. We developed methods for verifying that values stored for each of these types of attributes are appropriate, that is, whether values match against terms in the corresponding value sets. We tested whether the values found in BioSample records actually corresponded to the values defined in value sets in the BioSample documentation.

Boolean attributes

We tested 4 attributes in BioSample packages that require a Boolean value. We indicate that a Boolean attribute is well-specified if its value is true or false, regardless of capitalization. We consider values such as f or yes to be invalid.

Integer attributes

We tested 4 attributes that require an integer value. An integer attribute is well-specified if the given value can be parsed as an integer.

Timestamp attributes

We tested 11 attributes that require a timestamp value. A timestamp attribute is well-specified if the given value is in the format “DD-Mmm-YYYY”, “Mmm-YYYY” or “YYYY” (e.g., 20-Nov-2000, Nov-2000 or 2000), or adheres to the ISO 8601 standard for timestamps: “YYYY-mm-dd”, “YYYY-mm” or “YYYY-mm-ddThh:mm:ss” (e.g., 2000-11-20, 2000-11 or 2000-11-20T17:30:20).

We gathered similar information about other structured attributes, although we did not test the validity of those values in the BioSample data. For example, there are 161 attributes that require a unit of measure, 21 attributes that require a PubMed ID, and so on.

We chose to validate the 5 groups above because the characteristics of these groups are easily tested and because the expected values of the attributes are straightforward for users to specify (e.g., compared to attributes such as those that require a value to be composed of a floating-point number followed by a special symbol).

EBI BioSamples Overview

The EBI BioSamples repository stores metadata about biological samples used in experiments registered, for example, in ArrayExpress. Human curators use a software system known as Zooma²⁰ to standardize the metadata and to add them to BioSamples. Zooma maps free text annotations to terms in ontologies hosted in the EBI’s Ontology Lookup Service (OLS)^21–23. The tool applies these mappings based on rules that are learned from the manual curation carried out in the ArrayExpress repository.

Metadata authors can submit metadata to EBI BioSamples in the form of SampleTab files. The SampleTab file format is a tab-delimited, spreadsheet-like format composed of two sections: Meta-Submission Information (MSI) and Sample Characteristics Description (SCD). The MSI section contains information about the submission (e.g., title, identifier, description, version), about the submitting organization (e.g., name, address), and about database links (e.g., the name of the database and the identifier within that database). The SCD section describes the sample characteristics via attributes of the form of name–value pairs. For the purposes of our study, we focused on the SCD section. Metadata can be submitted to the EBI BioSamples via Web forms or via REST APIs.

BioSamples specifies only 3 attribute names (so-called “named attributes”) that should be used to describe samples²⁴, whereas NCBI BioSample specifies 452. The named attributes that have a definition in the EBI repository documentation are:

Organism – “Value should be scientific name and have NCBI Taxonomy as a Term Source REF with associated Term Source ID.”
Sex – “Prefer ‘male’ or ‘female’ over synonyms. May have other values in some cases e.g. yeast mating types.”

There is no definition for the expected values of attributes using the Material named attribute. In addition to named attributes, BioSamples allows metadata submitters to use “free-form attributes” to describe samples (i.e., attributes containing ad hoc attribute names other than the 3 discussed above).

We carried out the same analysis of the EBI BioSamples repository that we performed for NCBI BioSample. In our study of the EBI BioSamples, we assess the quality of the 3 metadata attributes defined in the BioSamples documentation as follows:

An Organism attribute is well-specified if the value corresponds to a term in the NCBI Taxonomy.
A Material attribute is well-specified if the value corresponds to a term in a biomedical ontology.
A Sex attribute is well-specified if the value is in the NCBI value set for the sex attribute, which includes the EBI-preferred terms “male” and “female”.

In the case of the Sex attribute, since there is no pre-defined range for the values in EBI BioSamples, we used the value set defined in the NCBI BioSample documentation. We used this value set to be able to compare results between the two repositories.

Results

We analyzed whether the values of metadata attributes comply with the specifications set out by the developers of each of the two hosting databases, NCBI BioSample and EBI BioSamples. We evaluated the quality of metadata records that exist in both databases, which we determined according to their accession identifiers. Finally, we clustered the metadata attribute names to identify redundant attribute names used to represent the same aspect of a sample, and thus ideally could be denoted by a single attribute name.

NCBI BioSample

The metadata records in BioSample represent 94 unique package types. Thus, not all of the 104 BioSample packages types are used. Generic packages make up the bulk of the BioSample database—85% of the records use this package definition (Fig. 2). The next most populated package is Pathogen, consisting of 3.2% of the records.

We examined the evolution of the number of Generic versus non-Generic submissions to BioSample over the years, to determine whether the metadata records adhering to the Generic package were legacy submissions potentially imported from other databases. In Fig. 3, we show the total number of metadata record submissions to NCBI BioSample from 2009 to 2017. Nearly all of the submissions until 2013 used the Generic package. After 2013, one observes some adoption of packages other than the Generic one, although most metadata (between 75 and 80% of all records) were still submitted using the Generic package between 2014 and 2017.

**Figure 3: Metadata submissions to NCBI BioSample from 2009–2017.**

BioSample records contain a total of 82,360,966 attributes (name–value pairs). Attribute names either are selected from the BioSample dictionary or are user-defined. A total of 12,284,229 pairs (15% of the total), encompassing over 2,303,021 metadata records (35% of all records), use attribute names that are not specified in the BioSample attribute dictionary. Of these attributes, we identified 18,198 syntactically unique custom attribute names specified by submitters. For example, some records use the name Altitude (m) instead of altitude—the attribute name defined by BioSample. The records that contain these attributes have been submitted by 313 different laboratories. Overall, there are 18,650 different attribute names used in BioSample metadata records—452 are BioSample-specified (2.4%), and the remainder are user-specified (97.6%). Only 9 of the 452 BioSample-specified attribute names are terms that are taken from standard ontologies. It is unclear whether any of the user-defined attribute names corresponds to ontology terms; in our analysis, we did not find any values for user-defined attribute names that correspond to ontology term IRIs (either in their full or prefixed form, e.g., ENVO:00000428). Of all BioSample records, only 197,123 Generic-package records (0.03%) do not specify any attributes. On average, each BioSample metadata record specifies 12 attributes. The vast majority of BioSample records (97%) specify at least one attribute.

The primary results of our study of NCBI BioSample metadata are presented in Fig. 4. We will now explain each of the columns in the Figure in the order of their appearance.

**Figure 4: Quality of dictionary attributes in NCBI BioSample according to their type.**