This paper describes a novel search index for social and economic research data, one that enables users to search up-to-date references for data holdings in these disciplines. The index can be used for comparative analysis of publication of datasets in different areas of social science. The core of the index is the da|ra registration agency’s database for social and economic data, which contains high-quality searchable metadata from registered data publishers. Research data’s metadata records are harvested from data providers around the world and included in the index. In this paper, we describe the currently available indices on social science datasets and their shortcomings. Next, we describe the motivation behind and the purpose for the data discovery index as a dedicated and curated platform for finding social science research data and gesisDataSearch, its user interface. Further, we explain the harvesting, filtering and indexing procedure and give usage instructions for the dataset index. Lastly, we show that the index is currently the most comprehensive and most accessible collection of social science data descriptions available.
In information infrastructure projects and initiatives one aspiration is to develop data sharing as a common part of scientific culture and practice. Achieving this goal is largely dependent on having internationally compatible infrastructures that facilitate sustainable data references, as well as integrated search and retrieval capabilities within research data. There is an obvious need for a comprehensive service that unifies data sources and allows for retrieving relevant and reliable search results as quickly as possible. Archives and repositories, such as ICPSR (https://www.icpsr.umich.edu), GESIS (https://www.gesis.org), UK Data (http://www.data-archive.ac.uk/ and the Dataverse Project1, provide social science and economic data on their websites, (https://dataverse.org/). For example, the r3data.org database lists 201 social science repositories and 146 economics repositories (http://www.re3data.org/browse/by-subject)2. When searching for appropriate data, social scientists must use distributed services that are based on different systems and retrieval techniques.
Dedicated, discipline-specific social sciences and the economics search facilities are still missing. There are some initiatives underway to support the research community in this respect, but none with its sole focus on social sciences datasets and none with the purpose of providing advanced searches on high quality metadata by means of a curated set of harvested repositories (see Table 1). Advanced searches include the choice of search term operators (or vs. and) as well as searching at the field level, in contrast to searching for term in all fields.
To address this demand, the gesisDataSearch project (http://datasearch.gesis.org/start) was initiated. Its purpose is to create a central search point, enabling social scientists to look up or filter potential datasets quickly, to access dataset metadata and decide on its relevance for their work, and for citation purposes or reusing a dataset. This aim is achieved through a faceted search interface.
The point of departure, and core of the project, was the DOI Registration agency for social and economic data database of the DOI Registration agency for social and economic data (da|ra, https://www.da-ra.de)3 that already includes searchable metadata from registered data centers, among them the considerable holdings of the German GESIS Data Archive (https://www.gesis.org/en/services/data-analysis/), and the US American Data Archive ICPSR. Together with data references of other relevant international data providers the content of this database was included in the search index after a systematic assessment. For this assessment, we harvested metadata in both standards, Dublin Core (DC, see http://dublincore.org/documents/DCes/) and Data Documentation Initiative (DDI, see http://www.DDIalliance.org/).
The assessment of the availability and quality of metadata records on datasets in different metadata standards showed that the more detailed DDI standard is not yet adopted by many social science institutions, resulting in lower numbers compared to records available in Dublin Core.
To provide a user-friendly search of a comprehensive social science research data collection, the search scope is more important than its depth. Further, DDI offers hundreds of elements, which differ and do not necessarily overlap across different DDI versions. Concerning the search interface, the choice of facets as the least common denominator of all available representations would have required an additional abstraction from various DDI versions and DC to a sensible set of aggregations for the search interface. We therefore use the metadata standard Dublin Core element set version 1.1, as well as three additional fields: OAI set (http://www.openarchives.org/OAI/openarchivesprotocol.html#Set), data provider and metadata provider.
We started the metadata collection via the OAI-protocol and collected both the Dublin Core and Data Documentation Initiative metadata formats in the latest versions available (Box 1). This first harvesting (January 2016) resulted in about 470,000 DC v 1.1 and about 67,500 DDI metadata records in versions ranging from 1.1 to 3.1. The harvesting routine for DC metadata took about seven days to complete one full harvesting.
Results also revealed that DDI records often contained only little, if any, more detail than DC records. This is because adopting DDI standards and produced rich metadata using DDI requires more time, effort, and expertise; additionally tool support for DDI is still in its infancy. In order to include as many datasets as possible on the index while creating an index for a faceted search interface, DC was chosen as the minimal metadata standard.
At the time of writing, the gesisDataSearch production system periodically harvests DC metadata from 120 OAI-PMH sets from 58 different data providers, which distribute their metadata through eight different metadata providers (see Table 2, Table 3 (available online only), and Table 4 (available online only)). After a first review of the DC metadata records, some sets were excluded from harvesting as they describe few or no relevant datasets.
Harvesting is executed according to the following automated schedule:
Initial full harvesting of all OAI-PMH sets for all metadata providers after system setup
Daily incremental harvesting of metadata records recently added to the sets. A time range starting from the past 48 h to the moment of harvesting covers short-term corrections of erroneously published datasets.
Yearly full harvesting of all OAI-PMH set for all metadata providers
Dublin Core v1.1 is a universal metadata standard aiming at maximum interoperability. It can be applied in various ways to describe objects. Different providers comply differently with the implementation guidelines. Not all providers follow the recommendations and use controlled vocabularies. Others provide substructures, such as key-value pairs in simple text fields. These variations had to be addressed during the creation of the gesisDataSearch index, and are further explained below.
The selection of OAI-PMH sets is the first step in filtering metadata records that describe datasets in the social sciences and related fields. Many of the selected sets also contain metadata on objects other than datasets, such as documents, audio files, etc. Therefore, we applied the second level of filtering by excluding those metadata records from our index, that have at least dc:type element (http://dublincore.org/documents/2012/06/14/dcmi-terms/?v=elements#elements-type) with a value matching terms on a curated exclusion list. The list of values excluded by default (currently 483 terms; https://bitbucket.org/cessda/cessda.pasc.indexer/src/e5941c0d9bc4ab5cec86b4bf9c7285de6cf688b8/src/main/resources/application.yml?at=master&fileviewer=file-view-default#application.yml-562) can be extended during runtime using the web-based admin interface. Combined with a re-indexing of parts of the metadata or the whole corpus, the index can be iteratively curated.
Handling multiple languages
DC v.1.1 includes a ‘dc:language’ element http://dublincore.org/documents/2012/06/14/dcmi-terms/?v=elements#language) that should name the language of the resource; the described dataset in this case. Some providers, however, use the ‘language’ element to indicate the language of the metadata.
Further, each element might contain a ‘lang’ attribute, indicating the language of the value of that particular field (http://dublincore.org/schemas/xmls/simpledc20021212.xsd).
As gesisDataSearch should contain as much information as possible, we applied a simple procedure for handling language in our index:
If a ‘lang’ attribute of any DC element indicates a language, save the element content as sub-field (e.g., title.en=x, title.fr=y)
If no ‘lang’ attribute is given, store the element content as ‘nn’ e.g., title.nn
Store all elements’ contents in an additional ‘all’ field e.g., title.all
This makes it possible to let the faceted search interface users choose their preferred language and while still showing metadata content if it is not present in the desired language. The ‘all’ field is used for a per field search (Fig. 1).
The DC elements ‘dc:coverage’ and ‘dc:subject’ have a high topical overlap (http://dublincore.org/usage/decisions/2012/dcterms-changes/); for instance, subject elements often contain location names such as countries. The usage of the ‘coverage’ element – intended to denote spatial and temporal applicability – is very diverse and ranges from standardized dates with milliseconds granularity to relative-time indications such as ‘Early Middle Ages’, and can contain both instants and time ranges.
We addressed this semantic problem by introducing a set of experimental, non-validated fields whose content is the result of a named entity recognition and geocoding. For named entity recognition, the Stanford CoreNLP library v3.6 was used4. Entities that are recognized as locations are forwarded to a geocoding service based on photon (https://github.com/komoot/photon), which uses OpenStreetMap data to provide coordinates to location names (Table 5). The current index contains 76,600 descriptions of datasets for the social sciences and related fields (Fig. 2).
Managing the processing chain
Processing involves a number of services, some of which were developed by GESIS. The harvester is responsible for fetching DC metadata records from various metadata providers via their OAI-PMH endpoints. The DC metadata is stored as XML files in a folder. The indexer application runs on the same machine and processes all files in the metadata folder. The CoreNLP entity recognition is embedded into the indexer application. The indexer further uses the photon geocoding service to retrieve geo coordinates from place names detected by the CoreNLP entity recognition and associates these geo coordinates to indexed metadata record. The Worldbank application integrates both, fetching data from the Worldbank Data Catalog API (http://api.worldbank.org/v2/datacatalog) and indexing into the search index following the same document model that is used by the indexer (Fig. 3).
As the index is continuously growing and being curated, we created a possibility to intervene in the procession chain when need, e.g., to get basic statistics, re-index or re-harvest particular OAI-PMH sets, to add or remove selected metadata from the index, or to change the execution schedule (Fig. 4).
review log files and keep track of what is currently being harvested or indexed
change configuration during runtime, e.g., add new metadata provider or change data provider labels
get e-mails in case of problems
review uptime of components
execute dedicated functions, e.g., add new OAI-PMH repositories for harvesting or re-index all or selected metadata records during runtime
The index is created with elasticsearch. The open source technology stack elasticsearch v2.4.4 (https://www.elastic.co/downloads/past-releases/elasticsearch-2-4-4), based on Apache Lucene, was chosen as search engine framework, for being scalable and for its good tool support, e.g., with the spring-data-elasticsearch libraries (https://projects.spring.io/spring-data-elasticsearch/). The user interface datasearch.gesis.org (http://datasearch.gesis.org/start; Fig. 1) is based on searchkit (https://github.com/searchkit/searchkit), a collection of user interface components built using the react library (https://facebook.github.io/react/).
The elasticsearch index ‘DC’ is available as elasticsearch snapshot, created with elasticsearch v 2.4.4 (Data Citation 1: Zenodo http://doi.org/10.5281/zenodo.896431). It can be easily restored into an existing elasticsearch instance using the restore snapshot feature (https://www.elastic.co/guide/en/elasticsearch/reference/2.4/modules-snapshots.html).
As of the time of writing (1/2018), the gesisDataSearch production system harvests DC metadata from 120 OAI-PMH sets from 58 different data providers, which distribute their metadata through eight different metadata providers. This results in about 295,000 metadata records. After filtering, the gesisDataSearch index provides 76,600 descriptions of datasets for the social sciences and related fields. This index is a comprehensive service that allows for obtaining relevant and reliable search results in one place.
Table 1 compares the gesisDataSearch index with alternative approaches to searching datasets in the social sciences. This shows that the presented index, accessible through datasearch.gesis.org, is currently the most comprehensive dataset focussing on the social sciences, with the most advanced search and filter possibilities combined with the possibility to review metadata on research data in different languages.
We tried to improve gesisDataSearch through several measures: First, by expanding the number of relevant data providers included in the harvesting process; second, enriching the DC metadata with elements from the DDI standard family; and finally, improving NLP accuracy for place names and date and time detection5.
To identify relevant metadata providers the available data resources were analysed. Our starting point was information on data archives in archival networks such as CESSDA European Research Infrastructure Consortium (ERIC) (https://www.cessda.eu). Furthermore, we consulted the inventory of data repositories ‘r3data.org’ as well as the metadata portals of DataCite, EUDAT, OpenAIRE, and Dataverse (https://dataverse.org/). As an outcome, 50 repositories were evaluated concerning technological premises, availability of metadata, used metadata formats and documentation level. Among them were archives and research centers, research projects, infrastructure projects, and service providers.
After a manual review of the wide variety of different technical systems for metadata publication a systematic assessment of the provided metadata was made on four levels:
Quality of metadata
Multilingualism of metadata
Source Code availability
The source code of the core components is publicly available at Bitbucket (https://bitbucket.org/cessda) as part of the CESSDA ERIC:
How to cite this article: Krämer, T. et al. A data discovery index for the social sciences. Sci. Data 5:180064 doi: 10.1038/sdata.2018.64 (2018).
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Krämer, T. Zenodo http://doi.org/10.5281/zenodo.896431 (2017)
The project was funded by the German Research Foundation (DFG) under the grant number SU 647/132. It runs from 2014–2018.