Introduction

Human biological samples are widely used in clinical trials, observational studies and personalized medicine. Their value can be expected to increase even more in the coming years, as next-generation sequencing technologies will bring forth omics data from sample derivatives at a faster pace.1, 2 Consequently, the importance of well-organized and well-maintained storage facilities for the biological samples with the possibility to compare specimens across different storage facilities, or biobanks, should be given a high priority. The European and global biobank community is currently in the process of establishing common infrastructures to promote harmonization3, 4 to make visible both samples and data and provide a standardized way for sharing these resources. Already, sharing high-level information about the organizational structure of biobanks and non-sensitive data about the stored samples is the focus of several national and international initiatives.5, 6

To struggle against barriers to data sharing and to encourage the process of harmonization,7 the biobank community should support and use a common terminology. Owing to its nature of dealing with both biological samples and potentially sensitive data, the field of biobanking relates to several knowledge domains; biology, to describe the properties of a sample; medicine to annotate associated clinical information; computer science for management of sample data; and law to provide the framework for donor-informed consent and control of personal data. Hence, terms relevant for the biobank community may be found both in general thesauri for medicine and biology8, 9 or legal instruments.10 In addition, specific glossaries for biobanking have also been developed by regional and international organizations, already recognized for supporting harmonization efforts in the biobank community.4, 11, 12, 13 To our knowledge, a survey comparing the use of these sources has so far not been conducted, and would be a useful instrument to further promote harmonization and data sharing. Thus, the purpose of the present study was to investigate the preference of definitions for 10 terms often used in biobanking. We have used questionnaires aiming to answer which definition is the preferred one for each of the terms, with a succeeding discussion guided by the quantitative result and comments from responders.

Materials and methods

Ten terms were selected on the basis of being important for information sharing about biological samples – for instance, in the implementation of a query system. For such systems, the first five terms, described in Table 1, are often used in an explicit way as data variables or attributes, describing what information and samples are being shared. On the other hand, the last five terms, described in Table 2, are highly relevant for the process of sharing, describing the conditions for sharing information about samples. The selected terms were [HUMAN] BIOBANK, SAMPLE and/or SPECIMEN, SAMPLE COLLECTION, STUDY, ALIQUOT, CODED/CODING, IDENTIFYING INFORMATION/IDENTIFIABILITY, ANONYMISED/ANONYMISATION, PERSONAL DATA and INFORMED CONSENT. Definitions were collected from P3G,11 ISBER,12 OECD,13, 14 Medical Subject Headings (MeSH),8 Statutes for the Biobanking and Biomolecular Resources Research Infrastructure — European Research Infrastructure Consortium (BBMRI-ERIC),15 The National Cancer Institute Thesaurus (NCI),9 the German Ethics Council,16 the Swedish Association of Local Authorities and Regions,17 the Oxford English Dictionary18 and the Directive 95/46/EC of the European Parliament and of the Council.10 The questionnaire emphasized that all terms should be considered in the context of biobanks.

Table 1 Definitions for [HUMAN] BIOBANK, SAMPLE and/or SPECIMEN, SAMPLE COLLECTION, STUDY and ALIQUOT
Table 2 Definitions for CODED/CODING, IDENTIFYING INFORMATION/IDENTIFIABILITY, ANONYMISED/ANONYMISATION, PERSONAL DATA and INFORMED CONSENT

The questionnaire was designed using Websurvey (Textalk, Mölndal, Sweden) with a predefined set of two to five definitions for each term, depending on the number of relevant definitions that could be found in literature. In addition, as an alternative, the respondent could choose to enter a comment.

For SAMPLE COLLECTION, only two definitions could be found in literature from sources relating to biobanks; one from the Swedish Association of Local Authorities and Regions and one from ISBER. To create more alternatives, the definition of COLLECTION from the Oxford English Dictionary was adapted and used in this context. Of the three definitions only the one from the Swedish Association of Local Authorities and Regions explicitly defines how the samples in a collection are related, with at least one common characteristic. Proper definitions for STUDY in the context of biobanks were, similar to SAMPLE COLLECTION, difficult to find in literature. Of the two given definitions, the one by MeSH defines STUDY in the context of epidemiology, whereas the definition given by NCI is generic. The selection thus offered two contrasting definitions, where the use of the first one can be motivated by the fact that epidemiology is a research field highly linked with biobanking. In some cases, sources did not use semantically or syntactically identical terms; for example, CODED vs CODING, but as the definitions were not strict in the same sense these terms were lumped for comparison.

To avoid, as far as possible, bias caused by the responders being more familiar with a particular organization, no sources were included in the questionnaire, and the definitions were also not put in a particular order.

The questionnaire was sent by e-mail to an European group (N=438), comprising one or two contact persons per biobank or biobank network in the Catalog of European Biobanks,5 and to a Swedish group (N=122), according to a Swedish e-mailing list for biobanks and registries. An English, respective Swedish, cover letter and header were sent with the questionnaires, which were otherwise identical. The survey period lasted from 28 June to 7 September 2012, with two reminders after 1 and 2 months.

Results

Of the 438 European biobank contacts, 92 responded, giving a response rate of 22%, if also considering that the e-mail was permanently undeliverable to 21 addresses. The ‘type of biobank’, as classified in the Catalog of European Biobanks (1) ‘Clinical biobank/study’, (2) ‘Population-based biobank/study’ or (3) ‘Non-human biobank/study’ was retrieved for the 92 responders. Fourteen responders were pairwise affiliated to the same biobank organization. Eight responders were affiliated to a network of biobanks rather than a specific organization, and one responder could not be traced back to a particular biobank organization or network. The distribution of European responders among clinical, population-based and biobank networks are presented in Figure 1. In a similar manner, the ‘country of biobank’ for each responder was retrieved from the Catalog of European Biobanks. The country of affiliation for the 346 non-responders was determined using the country domain of their respective e-mail address, or retrieved from the Catalog of European Biobanks for biobanks and networks categorized as EU or when a country domain was not part of the e-mail address. The number of responders versus invited participants for each country is presented in Figure 2. Invited participants with permanently undeliverable e-mail addresses have been excluded.

Figure 1
figure 1

Type of biobank affiliated with responders from the European group using the classification of clinical and population-based biobanks according to the Catalog of European Biobanks, with the addition of responders who are affiliated to a biobank network instead of a specific biobank.

Figure 2
figure 2

Country of biobank for European responders (dark grey bars with numerical labels) compared with the number of invited participants (light grey bars without labels).

In the Swedish group, 31 out of 122 responded, giving a response rate of 25%. A retrospective categorization by affiliation of type of biobank was not possible for the Swedish respondents. Taken together (All), 123 people participated in the survey, giving a total response rate of 23%. The results for each term are presented in Tables 3 and 4.

Table 3 Results for [HUMAN] BIOBANK, SAMPLE and/or SPECIMEN, SAMPLE COLLECTION, STUDY and ALIQUOT
Table 4 Results for CODED/CODING, IDENTIFYING INFORMATION/IDENTIFIABILITY, ANONYMISED/ANONYMISATION, PERSONAL DATA and INFORMED CONSENT

[HUMAN] BIOBANK

Of the five definitions for [HUMAN] BIOBANK the one by P3G got the highest rating, although closely followed by the definition used in the BBMRI-ERIC statutes. Four of the European respondents chose to enter a comment instead of selecting one of the specified definitions. Three respondents made a reference to the definitions of EuroBioBank,19 the Marble Arch International Working Group on Biobanking for Biomedical Research20 and the Norwegian body of law.21 One respondent emphasized that the clinical use of biobanks should also be part of a definition.

SAMPLE and/or SPECIMEN

The most popular definition for SAMPLE and/or SPECIMEN was the one issued by the P3G consortium.11 One of the Swedish respondents chose to enter a comment, suggesting that SAMPLE and SPECIMEN are two different concepts, and that SPECIMEN seems to imply a sample from a sample.

SAMPLE COLLECTION

For the term SAMPLE COLLECTION, the definition by the Swedish Association of Local Authorities and Regions17 received the highest score. One European respondent chose to enter a comment regarding the definition by ISBER12 and the interpretation of the term ‘isolated’ in this definition.

STUDY

Of the two definitions for STUDY, the more general definition by NCI9 was favored over the definition of STUDY in the epidemiological context provided by MeSH.8 Two European respondents and one Swedish commented that neither of the definitions are correct, or that they are too narrow, or that biobanks should be regarded as a service for studies and that the concept of STUDY should not be related per se. One European respondent commented that biobanks can serve various types of research but can also be used for diagnosis.

ALIQUOT

Of all terms, ALIQUOT received the best consensus among respondents, where the P3G definition11 was favored in all groups. One of the Swedish respondents made a comment that ALIQUOT corresponds to a sample from a sample according to the Glossary by the Swedish Association of Local Authorities and Regions.17

CODED/CODING

Of the two alternative definitions for CODED/CODING there was a tie between the definition from OECD13 and P3G11 in the group comprising all respondents. One EU respondent commented that neither of the specified definitions were satisfactory, but also did not know of a better one. A Swedish respondent stated that the term CODED/CODING should be replaced with the term pseudonymization.

IDENTIFYING INFORMATION/IDENTIFIABILITY

For the term(s) IDENTIFYING INFORMATION/IDENTIFIABILITY, the definition by OECD13 was the most popular among respondents. One European respondent stressed the difference between intentionally trying to identify a specific individual, and information linkage for a specific donor in order to create a valuable research asset but without any interest in revealing the identity of the donor. A Swedish respondent commented that the definition no. 3 (by ISBER12) corresponds to information that may directly or indirectly identify an individual, whereas definition no. 1 (by OECD13) is more related to a key that can be used to link the individual and data before and after pseudonymization. The same respondent also argued that the term IDENTIFIABILITY is something different than IDENTIFYING INFORMATION.

ANONYMISED/ANONYMISATION

Of the two given definitions for ANONYMISED/ANONYMISATION, the one given by P3G11 was favored by approximately two-thirds in both groups of respondents. Comments were provided by two European respondents, who stated that the definition by OECD13 is the correct definition for ANONYMISED data, whereas the definition by P3G is the correct definition for the process, and also that ANONYMISATION is the ability to identify the subject in terms of civil state from any type of measurements or combination of measurements has been lost. A Swedish respondent commented that in some Swedish basic legal documents the term is used for coded information.

PERSONAL DATA

The two given definitions for PERSONAL DATA were about equally favored, all respondents considered, with a small advantage for the definition given by P3G.11 There was, however, a considerable difference in the view of the definitions between the European group, who preferred the P3G definition, whereas the Swedish group of respondents favored the definition given in the current European data protection directive.10 One European respondent referred to earlier given comments and did not select a particular definition.

INFORMED CONSENT

For INFORMED CONSENT, all groups preferred the definition given by P3G,11 although the definition by ISBER12 was almost as popular. One Swedish respondent pointed out that there might be a difference in the meaning of INFORMED CONSENT and the decision of an INFORMED CONSENT.

Discussion

All in all, 123 persons participated in the survey. For European responders, the moderate response rate may be partially explained by 47 contacts who did not respond themselves but who had a responding co-contact with the same biobank affiliation. It is plausible that contacts connected to the same organization communicated and decided who should respond on behalf of their biobank, although the survey was indeed aimed to individuals rather than organizations. In addition, for European contacts, accounting not only for permanently undeliverable mails (that is, hard bounces), but also for so-called soft bounces (N=43) caused by – for example, an overfull mail-box – will increase the European response rate to 25%.

The survey demonstrated variability in preference of definitions for most terms. In this section, we have analyzed this variability from different perspectives, guided by the quantitative result and comments from responders. We have aimed to compare the definitions by reasoning, while accounting for the outcome of the survey, and try to suggest how definitions may be improved.

Scope of definitions

At least four types of biobanks have previously been identified: (1) biobanks established as part of the health-care process; (2) biobanks established in the context of clinical trials; (3) biobanks comprising the samples collected in a specific research project and could be re-used for other research; and (4) population-based biobanks, which may have a more general research purpose.4 Hence, it is desirable that a definition for the term [HUMAN] BIOBANK is general enough to contain all the four categories, in line with one comment that ‘the clinical use’ should be included in the definition. However, the most popular definition for [HUMAN] BIOBANK was the one given by P3G, despite that the P3G definition exclusively relates biobanks to population-based research. In a similar manner, the definition by ISBER for the term SAMPLE COLLECTION explicitly mentions research as a purpose. In contrast to SAMPLE COLLECTION, we argue that the term STUDY is firmly linked to a research question and may hence be thought of as a SAMPLE COLLECTION for which an ethical study permit exists.

With regards to the definition of what is intended as a ‘study’, the one from NCI was favored by respondents. If we can support this definition in the scope of clinical trials, where ‘detailed examination’ of a subject is the starting point to gather information and data, this definition is not really fitting with what is expected from biobanks in the sense that biobanks are mainly created as a resource aiming at contributing to various projects.22 As a consequence, the scope and expected functions of informed consent could vary a lot from one design to another. If we can agree that informed consent is a process (see OECD definition) as it has to be continuous for the whole duration of the research program, it cannot be reduced to a simple procedure. That is why, the definition from P3G, retained by respondents, is broader and is in accordance of what is expected from informed consent in the context of biobanks: expression of a will depending on the nature of the biobanks.

In the case of SAMPLE and/or SPECIMEN, the definition by P3G may be challenged in popularity by the preference for OECD (2009) and ISBER combined. The latter two definitions differ only in three aspects for the SPECIMEN part: an addition of ‘urine sample’, and replacements of ‘taken’ with ‘obtained’, and ‘subject or donor‘ with ‘participant’. Hence, a combination of these definitions may be the preferable one.

Potential regional differences

The preferred definitions for the terms [HUMAN] BIOBANK, SAMPLE and/or SPECIMEN and PERSONAL DATA seem to differ to a larger degree between the European and Swedish groups than the rest of the terms. The P3G definition for [HUMAN] BIOBANK was especially popular among Swedish respondents, which was also the reason that it scored highest among the total respondents. Contributing to the popularity of the definition by BBMRI-ERIC among European responders may be a higher awareness among European researchers about ongoing international infrastructure collaborations.

Differences in semantics

For the case PERSONAL DATA above, the two definitions are actually semantically different; the definition from the Directive 95/46/EC does not state that PERSONAL DATA per se lead to the identification of a natural person, only that it is the ‘information relating to an identified or identifiable natural person’. In article 2.1.a of the Directive, personal is defined as ‘any information relating to an identified or identifiable natural person; an identifiable person is one who can be identified, directly or indirectly, in particular by reference to an identification number or to one or more factors specific to his physical, physiological, mental, economic, cultural or social identity’. Data that cannot be connected to an individual person therefore falls outside the scope of the Directive (Article 3 of the Directive). This current definition includes health data that are considered in addition as sensitive data.

The definition by P3G, on the other hand, makes PERSONAL DATA synonymous with IDENTIFYING INFORMATION (see discussion above). We cannot be certain whether the responders noted this difference in semantics and reacted upon it or not. At least for European purposes we would at present, even if this might change, recommend to use the definition as stated in the Directive 95/46/EC, as to make biobank terminology consistent with legal terminology as far as possible. In the context of the revision of the Directive on Data protection (to be turned into an European regulation), the definition of Personal Data and sensitive Data (which will probably include Genetic Data) will be harmonized in all the European Union Members.23 This will facilitate the common understanding of this terminology and will improve the communication between the research teams.

Three definitions were given for the terms IDENTIFYING INFORMATION/IDENTIFIABILITY, of which two, by OECD (2009) and ISBER, had been given in the context of the first term, whereas one, P3G, was given in the context of the latter, see Table 2. The definitions were found to be semantically comparable, although use of the term IDENTIFIABILITY itself was questioned by one respondent. The OECD (2009)16 definition was considered most popular among responders regardless of group. The definition by P3G also brings up the concepts of CODING and pseudonymization. Although there is a relation between all these terms, it is possible that the inclusion makes the definition appear less straightforward in comparison. For the P3G Lexicon, we propose that IDENTIFIABILITY is replaced with IDENTIFYING INFORMATION and that the definition for PERSONAL DATA is used instead of the current one.

Definitions in the context of ontologies

The potential of an ontology for the biobank-administration domain has recently been described by Brochhausen et al.,24 where the major benefit of an ontology in this context is presented as minimizing the effort of querying multiple databases for the same kind of samples of interest. The ontology, Ontologized Minimum Information About Biobank data Sharing (OMIABIS), uses the definition for [HUMAN] BIOBANK adapted from the German Ethics Council, see Table 1, which did only receive 8% of the total votes. This highlights an important difference between ontologies and terminologies: ontologies are designed to fulfill different requirements than terminologies. Therefore, they follow different design principles than terminologies.25 Mainly, definitions in ontologies are written in a way that refers to the taxonomy underlying the ontology facilitating understanding by ontologists, and thus foster coordination of modular ontologies. Typically, definition should be authored following this patter: ‘An A is a B with property C’, where A are the entities defined, B is the immediate superclass and C is what makes the members of A different from all other members of B. This kind of definition is called Aristotelian definition.26 Although Aristotelian definitions might not be intuitively descriptive for the domain experts, ontologies and the entities represented in them should be presented in a manner that is understandable to the aforementioned experts. To achieve that we suggest to, firstly, ensure coextensive reference for the favored definition with the definition provided by OMIABIS and secondly to add an annotation (an rfds:comment) containing the favored definition.

Conclusions

With the domestic and international proliferation of biobanks and their associated data, a common language for biobanks are essential. At present there is a considerable confusion in some of the terms used in the biobank community.

Indicative from the survey is the risk of focusing only on the research aspect of biobanking in definitions. By not also including the clinical area of application the likelihood of separated communities increases. Hence, it is the recommendation that important terms should be formulated in such a way that all areas of biobanking are covered, at least if the aim is to improve the bridges between research and clinical application. The generalizability of a term will of course depend on the scope of its definition. There is, however, nothing stopping us from using a hierarchical level to define different subclasses of the terms, and how they relate to different types of biobanks. Here, the semantic structure of an ontology will help.

In general, the outcome of this survey, which was mainly targeted at associated members of the European BBMRI, favors the glossary of the P3G consortium whose definitions were voted most popular for seven of the eight terms where it was represented, all responders considered. The outcome of the survey should in the short run be accounted for by the related organizations whenever an update of their respective vocabularies is pending. With the risk of considering definitions out of their context, and only acknowledging the European perspective, the results could be used in the long run for the creation of a global biobank data dictionary, supporting information sharing about biological samples. In addition, the creation and maintenance of a machine-interpretable ontology representing the biobank domain would be beneficial.