Quantifying the impact of public omics data

The amount of omics data in the public domain is increasing every year. Modern science has become a data-intensive discipline. Innovative solutions for data management, data sharing, and for discovering novel datasets are therefore increasingly required. In 2016, we released the first version of the Omics Discovery Index (OmicsDI) as a light-weight system to aggregate datasets across multiple public omics data resources. OmicsDI aggregates genomics, transcriptomics, proteomics, metabolomics and multiomics datasets, as well as computational models of biological processes. Here, we propose a set of novel metrics to quantify the attention and impact of biomedical datasets. A complete framework (now integrated into OmicsDI) has been implemented in order to provide and evaluate those metrics. Finally, we propose a set of recommendations for authors, journals and data resources to promote an optimal quantification of the impact of datasets.

The revised manuscript regarding updates to the Omics Discovery Index has important improvements compared to the prior version of the manuscript. The development and inclusion of OmicsDI score is a strong addition to the manuscript and is a good starting point for future considerations regarding how to evaluate omics datasets. This manuscript is appropriate for Nature Communications after carrying out important revisions.
The main issue is with the figures and figure legends associated with the main body of the manuscript. Each of the five figure legends are far too brief and do a very poor job describing the content of each figure. Following standard practice each figure should have a title followed by a several sentence detailed description of the figure. Figures 2-4 are also problematic in general with small fonts and difficult to discern color schemes, for example.  Finally, as a minor revision, figure 1 in the supplement is a good figure and helps frame the manuscript. It would be valuable to move this to the main body of the manuscript and then reorder/renumber the current figures.
Editorial Note: This manuscript has been previously reviewed at another journal that is not operating a transparent peer review scheme. This document only contains reviewer comments and rebuttal letters for versions considered at Nature Communications . Mentions of prior referee reports have been redacted.

Reviewer #2 (Remarks to the Author):
The main issue is with the figures and figure legends associated with the main body of the manuscript. Each of the five figure legends are far too brief and do a very poor job describing the content of each figure. Following standard practice each figure should have a title followed by a several sentence detailed descriptions of the figure. Figures 2-4 are also problematic in general with small fonts and difficult to discern color schemes, for example. R/ Thanks to the reviewer for this comment. We didn't add more information into the legend of the figure because we thought they were self-explained. We agree with the reviewer and several sentences has been added to the legends of all figures. In addition, we have reviewed the font and resolution of all figures. Figure 2 contains three different types of reanalysis, which is interesting, but needs reconsideration as a presentation style and careful description of what each panel represents in the legend. Figure 3 has odd positioning of both panels and again suffers from small fonts.

R/ We have added in the legend a few sentences to explain each reanalysis pattern.
In addition, we have corrected the quality of the figure and the font size. We removed the one of the proteomics datasets because the reanalysis pattern is similar to the previous one. Figure 3 would be enhanced also with panels that show the number of reanalyses of selected individual datasets from year to year. It would be interesting to see if there are differences in times to reanalysis for particular fields and or datasets. Are there certain datasets that are far ahead of their time and only get reanalyzed several years later? R/ The main idea we want to discuss with this plot is that datasets in Transcriptomics get reanalyzed even when they are more than 10 years old. We want to make clear also that datasets get attention and impact more quickly than citations to the corresponding manuscript. We also want to motivate the fields of proteomics and metabolomics to reuse more data. The current plot has been separated into three different databases GEO, ArrayExpress and PRIDE to make clearer the trends for different omics fields. Finally, as a minor revision, figure 1 in the supplement is a good figure and helps frame the manuscript. It would be valuable to move this to the main body of the manuscript and then reorder/renumber the current figures. R/ We had the architecture/workflow figure in the main manuscript in the first review iteration but one of the reviewers suggested to move it into Supplementary Information. We appreciated this suggestion and think that the figure should be in the main text to support the claims and explain how we compute each metric.