Survey and Demographics
To develop useful metrics, we must understand the needs and values of both the researchers who create and use data, and of the data managers who preserve and publish it. Before starting work on Making Data Count, we had conducted a survey of researcher perspectives on data publication that touched briefly on metrics of impact4. Respondents to that survey found citation and download counts much more useful than search rank or altmetrics. For this project, we expanded on that work with new surveys for researchers and data managers about data sharing, discovery, and metrics.
In November and December of 2014, we solicited responses to a pair of online surveys via social media, listservs, and posts to CDL and PLOS blogs—ultimately hearing from 247 researchers and 73 data managers.
Data managers represented primarily academic (64%) and government-run (22%) repositories—in the United States (72%) or United Kingdom (11%).
Most (78%) of the researchers were academics—working in the United States (57%) or United Kingdom (14%).
Researchers covered the academic career spectrum: from principal investigators (42%), to postdocs (21%), and graduate students (19%).
More than half (53%) were biologists, but environmental (17%) and social (10%) science were also well represented.
The survey was approved by the University of California, Berkeley Committee for Protection of Human Subjects/Office for the Protection of Human Subjects (protocol ID 2014-10-6794). Anonymized individual responses are available in the University of California's Merritt repository5.
Researchers who share data want to know who is using their data for what purpose6,10, and this understandable wish may contribute to the popularity of sharing on request. Public repositories could at least partially satisfy depositor interest while still making data openly available by collecting some amount of information about data users. When we asked researchers what they want to know about users of their data, nearly half selected the most detailed option, ‘name and contact information,’ as their first choice (Fig. 1d). Researchers expressed little interest in communities defined by geography or affiliation, but average interest in knowing a user's scientific discipline was just as high as in knowing their name. Discipline was even more clearly a priority for data managers; 67% of them ranked it first.
For practical purposes, we wanted to know how these preferences compare to current repository practices, so we asked data managers what information their repositories collect. We found that collection of user information is highly polarized. Half of the repositories require users to supply detailed contact information—names (47%) and email addresses (44%)—and half (47%) do not collect any information at all. On one hand, in cases involving sensitive data, repositories may need know users identities to verify that their requests are legitimate and to hold them to data use agreements; on the other, open and friction-less access to data makes it easier to use. As a compromise repositories might ask users about the discipline in which they plan to use a dataset. This would preserve user anonymity while providing data managers and creators with a valuable sense of the data is being used.
Metrics of Impact
Although there are many potential audiences for data metrics, including administrators, funders, and data managers11, the most invested are undoubtedly the researchers who create them. Even if no information about data users is collected, metrics of use can demonstrate to researchers that their data is of interest and that the effort to share it was not wasted. Data managers also have a stake in knowing about use of their data to tailor services and justify funding. To learn about the immediate interests of these groups, we asked them to rank several potential metrics of impact (Fig. 1e).
Majorities of both researchers and data managers ranked landing page views as the least interesting metric. Researchers consistently ranked downloads as the second most interesting, while data managers (who were given a longer list to rank) put them in the middle. Resoundingly, both researchers and data managers measure scholarly prestige in citations; 85% of researchers and 61% of data managers ranked citations as the most interesting metric. This ordering—viewing, downloading, citing—is consistent with the results our previous researcher survey4. It depicts a ladder of increasing engagement with the data over time and suggests an entirely sensible weighting of metrics.
We asked data managers what metrics or statistics their repositories already track (Fig. 1f). A majority track landing page views and almost all track downloads. Despite high interest in citations, relatively few repositories track them, presumably because they are much more challenging to capture. Surprisingly few repositories report the metrics they track. Approximately one-third of the repositories that track each metric expose it through a programming interface or display it to site visitors. Limiting metrics to internal use shuts out many potential audiences and frustrates any cross-repository comparison of impact.
Citations are the coin of the academic realm, but their present usefulness for data is limited because datasets are rarely cited formally12. A 2011 survey of social science papers best illuminates current practice: only 17% of papers that used published data cited it in the reference list, roughly the same percentage as in 1995 (refs 13,14). However, we do believe that the situation will improve.
Researchers themselves strongly favor formal data citation. In a cross-disciplinary 2011 survey, 95% of respondents agreed that formal citation is a fair condition for data sharing, as did 87% of astrobiologists in a follow-up survey6,9. Citation ‘in the references like normal publications’ is the preferred method of receiving credit for data sharing by 71% of biodiversity researchers and by 75% of respondents to our earlier survey4,15.
In 2014, the scholarly communication community arrived at a Joint Declaration of Data Citation Principles with formal citation at its core16. The Joint Declaration has since been endorsed by 94 repositories, publishers, and scholarly organizations—including DataCite, CODATA, and Nature Publishing Group (http://go.nature.com/NHOqUp). Scientific Data data descriptors, for example, are formatted in accordance with the Joint Declaration, and each includes at least one formal data citation.
In contrast to citations, repositories can easily track data landing page views and downloads today. Neither researchers nor data managers put any weight on page views but, for researchers, downloads are a highly regarded second-choice metric. Responses to our previous survey suggest that the gap in perceived value between citations and downloads is surprisingly narrow4. Most repositories already track downloads, and we strongly recommend that more of them make download counts public.
This survey provides several clear points of guidance for Making Data Count and other data metrics initiatives. In the short-term, page views and social media activity can be de-emphasized because of low status and lack of data-related activity respectively. While challenging, citations should be emphasized and collected as best as possible. Downloads should be emphasized as, at present, a happy medium: both reasonably valuable and reasonably easy to measure.
In response to background research—including the results we have presented here—we are adapting the existing PLOS Article Level Metric tool (http://alm.plos.org) to capture and present metrics for datasets. This tool currently collects data from 13 sources, including bookmarking services (e.g., CiteULike, Mendeley), social media (e.g., Facebook, Twitter), and DataCite metadata (http://mds.datacite.org/). One previously successful approach to capturing informal citations17 that we are taking in Making Data Count is to search the full text of articles in several open access corpora (BioMed Central, Europe PubMed Central, and PLOS) for dataset identifiers. We are also using the Nature OpenSearch API as an additional source of dataset citations in the scholarly literature.
As of this writing, we have imported 94,752 DataONE datasets; new datasets are being added as they are published. Modifications to the DataONE network now enable it to track COUNTER (http://www.projectcounter.org/) compliant download statistics and expose them via API for the DLM tool to collect. In the future, we plan to also import datasets published by the Dryad general-purpose repository (http://datadryad.org/). Once these steps are completed and the tool is able to compile the full range of metrics discussed here, the final phase of development—to be completed in the Fall of 2015—will be to present the outcome via a web-friendly reporting and visualization tool that gives users easy access to the data for further analysis.
While we would be pleased to see more sophisticated schemes to apportion scholarly credit and facilitate knowledge discovery18,
How to cite this article: Kratz, J. & Strasser, C. Making data count. Sci. Data 2:150039 doi: 10.1038/sdata.2015.39 (2015).
Making Data Count is funded by National Science Foundation (NSF) grant number 1448821.
This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0 Metadata associated with this Data Descriptor is available at http://www.nature.com/sdata/ and is released under the CC0 waiver to maximize reuse.