Introduction

In information infrastructure projects and initiatives one aspiration is to develop data sharing as a common part of scientific culture and practice. Achieving this goal is largely dependent on having internationally compatible infrastructures that facilitate sustainable data references, as well as integrated search and retrieval capabilities within research data. There is an obvious need for a comprehensive service that unifies data sources and allows for retrieving relevant and reliable search results as quickly as possible. Archives and repositories, such as ICPSR (https://www.icpsr.umich.edu), GESIS (https://www.gesis.org), UK Data (http://www.data-archive.ac.uk/ and the Dataverse Project1, provide social science and economic data on their websites, (https://dataverse.org/). For example, the r3data.org database lists 201 social science repositories and 146 economics repositories (http://www.re3data.org/browse/by-subject)2. When searching for appropriate data, social scientists must use distributed services that are based on different systems and retrieval techniques.

Dedicated, discipline-specific social sciences and the economics search facilities are still missing. There are some initiatives underway to support the research community in this respect, but none with its sole focus on social sciences datasets and none with the purpose of providing advanced searches on high quality metadata by means of a curated set of harvested repositories (see Table 1). Advanced searches include the choice of search term operators (or vs. and) as well as searching at the field level, in contrast to searching for term in all fields.

Table 1 Comparison of portals offering search for datasets.

To address this demand, the gesisDataSearch project (http://datasearch.gesis.org/start) was initiated. Its purpose is to create a central search point, enabling social scientists to look up or filter potential datasets quickly, to access dataset metadata and decide on its relevance for their work, and for citation purposes or reusing a dataset. This aim is achieved through a faceted search interface.

The point of departure, and core of the project, was the DOI Registration agency for social and economic data database of the DOI Registration agency for social and economic data (da|ra, https://www.da-ra.de)3 that already includes searchable metadata from registered data centers, among them the considerable holdings of the German GESIS Data Archive (https://www.gesis.org/en/services/data-analysis/), and the US American Data Archive ICPSR. Together with data references of other relevant international data providers the content of this database was included in the search index after a systematic assessment. For this assessment, we harvested metadata in both standards, Dublin Core (DC, see http://dublincore.org/documents/DCes/) and Data Documentation Initiative (DDI, see http://www.DDIalliance.org/).

The assessment of the availability and quality of metadata records on datasets in different metadata standards showed that the more detailed DDI standard is not yet adopted by many social science institutions, resulting in lower numbers compared to records available in Dublin Core.

To provide a user-friendly search of a comprehensive social science research data collection, the search scope is more important than its depth. Further, DDI offers hundreds of elements, which differ and do not necessarily overlap across different DDI versions. Concerning the search interface, the choice of facets as the least common denominator of all available representations would have required an additional abstraction from various DDI versions and DC to a sensible set of aggregations for the search interface. We therefore use the metadata standard Dublin Core element set version 1.1, as well as three additional fields: OAI set (http://www.openarchives.org/OAI/openarchivesprotocol.html#Set), data provider and metadata provider.

Results

Metadata assessment

We started the metadata collection via the OAI-protocol and collected both the Dublin Core and Data Documentation Initiative metadata formats in the latest versions available (Box 1). This first harvesting (January 2016) resulted in about 470,000 DC v 1.1 and about 67,500 DDI metadata records in versions ranging from 1.1 to 3.1. The harvesting routine for DC metadata took about seven days to complete one full harvesting.

Results also revealed that DDI records often contained only little, if any, more detail than DC records. This is because adopting DDI standards and produced rich metadata using DDI requires more time, effort, and expertise; additionally tool support for DDI is still in its infancy. In order to include as many datasets as possible on the index while creating an index for a faceted search interface, DC was chosen as the minimal metadata standard.

Harvesting

At the time of writing, the gesisDataSearch production system periodically harvests DC metadata from 120 OAI-PMH sets from 58 different data providers, which distribute their metadata through eight different metadata providers (see Table 2, Table 3 (available online only), and Table 4 (available online only)). After a first review of the DC metadata records, some sets were excluded from harvesting as they describe few or no relevant datasets.

Table 2 Number of datasets in the index by metadata provider.
Table 3 Number of datasets in the index by data provider.
Table 4 List of Metadata Provider and OAI-PMH set names used for a preselection of metadata.

Harvesting is executed according to the following automated schedule:

  1. 1

    Initial full harvesting of all OAI-PMH sets for all metadata providers after system setup

  2. 2

    Daily incremental harvesting of metadata records recently added to the sets. A time range starting from the past 48 h to the moment of harvesting covers short-term corrections of erroneously published datasets.

  3. 3

    Yearly full harvesting of all OAI-PMH set for all metadata providers

This produced a total of about 295,000 metadata records that were filtered during the following steps in the processing chain (Data Citation 1).

Indexing

Dublin Core v1.1 is a universal metadata standard aiming at maximum interoperability. It can be applied in various ways to describe objects. Different providers comply differently with the implementation guidelines. Not all providers follow the recommendations and use controlled vocabularies. Others provide substructures, such as key-value pairs in simple text fields. These variations had to be addressed during the creation of the gesisDataSearch index, and are further explained below.

Filtering datasets

The selection of OAI-PMH sets is the first step in filtering metadata records that describe datasets in the social sciences and related fields. Many of the selected sets also contain metadata on objects other than datasets, such as documents, audio files, etc. Therefore, we applied the second level of filtering by excluding those metadata records from our index, that have at least dc:type element (http://dublincore.org/documents/2012/06/14/dcmi-terms/?v=elements#elements-type) with a value matching terms on a curated exclusion list. The list of values excluded by default (currently 483 terms; https://bitbucket.org/cessda/cessda.pasc.indexer/src/e5941c0d9bc4ab5cec86b4bf9c7285de6cf688b8/src/main/resources/application.yml?at=master&fileviewer=file-view-default#application.yml-562) can be extended during runtime using the web-based admin interface. Combined with a re-indexing of parts of the metadata or the whole corpus, the index can be iteratively curated.

Handling multiple languages

DC v.1.1 includes a ‘dc:language’ element http://dublincore.org/documents/2012/06/14/dcmi-terms/?v=elements#language) that should name the language of the resource; the described dataset in this case. Some providers, however, use the ‘language’ element to indicate the language of the metadata.

Further, each element might contain a ‘lang’ attribute, indicating the language of the value of that particular field (http://dublincore.org/schemas/xmls/simpledc20021212.xsd).

As gesisDataSearch should contain as much information as possible, we applied a simple procedure for handling language in our index:

  • If a ‘lang’ attribute of any DC element indicates a language, save the element content as sub-field (e.g., title.en=x, title.fr=y)

  • If no ‘lang’ attribute is given, store the element content as ‘nn’ e.g., title.nn

  • Store all elements’ contents in an additional ‘all’ field e.g., title.all

This makes it possible to let the faceted search interface users choose their preferred language and while still showing metadata content if it is not present in the desired language. The ‘all’ field is used for a per field search (Fig. 1).

Figure 1: The gesisDataSearch user interface.
figure 1

The user interface to the social science data index is implemented as a faceted search portal. Free term search is available either across all fields of the index or within one field, such as subject. All facet search inputs include autocomplete functionality.

Metadata enrichment

The DC elements ‘dc:coverage’ and ‘dc:subject’ have a high topical overlap (http://dublincore.org/usage/decisions/2012/dcterms-changes/); for instance, subject elements often contain location names such as countries. The usage of the ‘coverage’ element – intended to denote spatial and temporal applicability – is very diverse and ranges from standardized dates with milliseconds granularity to relative-time indications such as ‘Early Middle Ages’, and can contain both instants and time ranges.

We addressed this semantic problem by introducing a set of experimental, non-validated fields whose content is the result of a named entity recognition and geocoding. For named entity recognition, the Stanford CoreNLP library v3.6 was used4. Entities that are recognized as locations are forwarded to a geocoding service based on photon (https://github.com/komoot/photon), which uses OpenStreetMap data to provide coordinates to location names (Table 5). The current index contains 76,600 descriptions of datasets for the social sciences and related fields (Fig. 2).

Table 5 Experimental enrichment of metadata using named entity recognition and geocoding.
Figure 2: Visualisation of the contents of metadata.
figure 2

The visualization of the indexed social science metadata was helpful particularly in the early stages, when different metadata standards and versions were compared with respect to their application and use of fields across different metadata providers.

Managing the processing chain

Processing involves a number of services, some of which were developed by GESIS. The harvester is responsible for fetching DC metadata records from various metadata providers via their OAI-PMH endpoints. The DC metadata is stored as XML files in a folder. The indexer application runs on the same machine and processes all files in the metadata folder. The CoreNLP entity recognition is embedded into the indexer application. The indexer further uses the photon geocoding service to retrieve geo coordinates from place names detected by the CoreNLP entity recognition and associates these geo coordinates to indexed metadata record. The Worldbank application integrates both, fetching data from the Worldbank Data Catalog API (http://api.worldbank.org/v2/datacatalog) and indexing into the search index following the same document model that is used by the indexer (Fig. 3).

Figure 3: The gesisDataSearch architecture.
figure 3

The services required to create the index and provide the faceted search interface are either internal and developed or operated by GESIS (within the dotted line) or external (OAI endpoints or World Bank Data Catalog). While it is possible to run all required internal components on one physical machine, the architecture supports well scaling and operation on a cloud infrastructure. CESSDA operates the stack on Google Cloud Platform.

As the index is continuously growing and being curated, we created a possibility to intervene in the procession chain when need, e.g., to get basic statistics, re-index or re-harvest particular OAI-PMH sets, to add or remove selected metadata from the index, or to change the execution schedule (Fig. 4).

Figure 4: Web-based administration tool for spring boot based microservices.
figure 4

The web-based administration tools helps to manage and adapt the long running and repetitive processes. Users can see the resource consumption (network, storage, RAM, CPU), watch the log output to control operation, adapt log level, see and change configuration of each service at runtime, e.g., to adapt the harvesting schedule, to add new OAI endpoints to the harvesting process or to add or remove terms from filter lists.

We developed a remote control (https://github.com/codecentric/spring-boot-admin) for the spring boot based microservices (Fig. 4) which allows us to:

  • review log files and keep track of what is currently being harvested or indexed

  • change configuration during runtime, e.g., add new metadata provider or change data provider labels

  • get e-mails in case of problems

  • review uptime of components

  • execute dedicated functions, e.g., add new OAI-PMH repositories for harvesting or re-index all or selected metadata records during runtime

The index is created with elasticsearch. The open source technology stack elasticsearch v2.4.4 (https://www.elastic.co/downloads/past-releases/elasticsearch-2-4-4), based on Apache Lucene, was chosen as search engine framework, for being scalable and for its good tool support, e.g., with the spring-data-elasticsearch libraries (https://projects.spring.io/spring-data-elasticsearch/). The user interface datasearch.gesis.org (http://datasearch.gesis.org/start; Fig. 1) is based on searchkit (https://github.com/searchkit/searchkit), a collection of user interface components built using the react library (https://facebook.github.io/react/).

Usage Notes

The elasticsearch index ‘DC’ is available as elasticsearch snapshot, created with elasticsearch v 2.4.4 (Data Citation 1). It can be easily restored into an existing elasticsearch instance using the restore snapshot feature (https://www.elastic.co/guide/en/elasticsearch/reference/2.4/modules-snapshots.html).

Discussion

As of the time of writing (1/2018), the gesisDataSearch production system harvests DC metadata from 120 OAI-PMH sets from 58 different data providers, which distribute their metadata through eight different metadata providers. This results in about 295,000 metadata records. After filtering, the gesisDataSearch index provides 76,600 descriptions of datasets for the social sciences and related fields. This index is a comprehensive service that allows for obtaining relevant and reliable search results in one place.

Table 1 compares the gesisDataSearch index with alternative approaches to searching datasets in the social sciences. This shows that the presented index, accessible through datasearch.gesis.org, is currently the most comprehensive dataset focussing on the social sciences, with the most advanced search and filter possibilities combined with the possibility to review metadata on research data in different languages.

We tried to improve gesisDataSearch through several measures: First, by expanding the number of relevant data providers included in the harvesting process; second, enriching the DC metadata with elements from the DDI standard family; and finally, improving NLP accuracy for place names and date and time detection5.

Methods

To identify relevant metadata providers the available data resources were analysed. Our starting point was information on data archives in archival networks such as CESSDA European Research Infrastructure Consortium (ERIC) (https://www.cessda.eu). Furthermore, we consulted the inventory of data repositories ‘r3data.org’ as well as the metadata portals of DataCite, EUDAT, OpenAIRE, and Dataverse (https://dataverse.org/). As an outcome, 50 repositories were evaluated concerning technological premises, availability of metadata, used metadata formats and documentation level. Among them were archives and research centers, research projects, infrastructure projects, and service providers.

After a manual review of the wide variety of different technical systems for metadata publication a systematic assessment of the provided metadata was made on four levels:

  • Metadata formats

  • Quality of metadata

  • Cross-disciplinary metadata

  • Multilingualism of metadata

Further, the ‘terms of use’ of metadata also varied and needed to be taken into account when creating the index. We only included metadata that is publicly available. However, the referenced datasets themselves might be subject to other terms of use.

We decided to use the open archives initiative protocol for metadata harvesting (OAI-PM) for the retrieval of metadata from data providers. We dismissed alternatives to OAI-PMH, such as web scraping or API, which required both further clarifications of terms of use of scraped data and more human resources for implementing and operating many different APIs. One exception was the Worldbank Data Catalog that provides a basic API to its contents (http://api.worldbank.org/v2/datacatalog). It was included in the index because of the relevance of its content.

Source Code availability

The source code of the core components is publicly available at Bitbucket (https://bitbucket.org/cessda) as part of the CESSDA ERIC:

Additional information

How to cite this article: Krämer, T. et al. A data discovery index for the social sciences. Sci. Data 5:180064 doi: 10.1038/sdata.2018.64 (2018).

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.