The reality of the genomics age is that there are many very large data sets that are most usefully saved and manipulated in electronic form. Many journals add online ‘supplementary material’ to articles as a service to authors wishing to publish volumes of such data that cannot be accommodated within the body of an article.
Supplementary-material collections maintained by publishers serve as archival repositories directly connected with the peer-reviewed scientific literature, often competing with or substituting for the deposition of data in public repositories.
To assess the use of these, we investigated supplementary-data archives for gene-expression profiling data, a widely used experimental protocol for which international standards for data representation have been developed.
We anticipated that such archives might be a useful source of data. But to our dismay, it was impossible to systematically analyse our sample, taken from 10,128 papers in 139 journals. No standards for organizing supplementary-data collections have been adopted either across journals or even for supplementary-data collections associated with articles in the same journal.
Data are represented in an enormous range of different file formats, from raw data files (such as Affymetrix.cel files) to spreadsheets (xls file extensions), documents (doc and pdf) and text files (txt and cvs). Within documents there are no standards for data organization: different documents provide different numbers of columns, contain both differential and absolute expression values, and often have few details about the signal processing applied to obtain data. We also encountered a significant number of typographic errors in gene names, database accession numbers and data-set identifiers.
There are public repositories for gene-expression profile data (Stanford MicroArray Database, the US National Center for Biotechnology Information Gene Expression Omnibus and the ArrayExpress repository at the European Bioinformatics Institute). We compared the accessibility of gene-expression profile data in public repositories with accessibility of data in supplementary-data archives. The public repositories provide numerous search and retrieval tools, including unique accession numbers and the ability to search by specimen, platform and profile data. Publishers' supplementary-materials archives provided none of these features. As a result, relevant data are far harder to locate than in public repositories.
These findings are not limited to gene-expression data. Even within the same journal, there is no consistency in reporting or format among bioinformatics resources. File extensions for documents, figures and movies include xls, doc, eps, jpg, tif, gif, pdf, ppt, qt, asf, wma and wmv. They may or may not include long lists of links, be compressed into zip files or offer the option of including the supplementary material as part of the downloadable document containing the printed version of the article.
Supplementary data often represent the raw experimental values and are especially important for researchers in the same field. Among the advantages of storing these data in public repositories are the integration of information with the community knowledge resources and the ability to track and maintain computer-readable associations between data sets.
On the basis of our analysis, we recommend that scientific journals adopt a policy, similar to Nature's (see www.nature.com/nature/authors/policy/), of requiring that authors submit data to public repositories, if relevant repositories exist, and that the journal version should contain accession numbers, URLs and other appropriate specific indicators to the data source in the repositories.
Journals' supplementary-data archives should be restricted to idiosyncratic and nonstandard data types for which no public repository exists. Only then can community standards emerge.
About this article
Cite this article
Santos, C., Blake, J. & States, D. Supplementary data need to be kept in public repositories. Nature 438, 738 (2005). https://doi.org/10.1038/438738a
Science and Engineering Ethics (2010)
BMC Bioinformatics (2006)