Indexing and searching petabase-scale nucleotide resources


Searching vast and rapidly growing nucleotide content in resources, such as runs in the Sequence Read Archive and assemblies for whole-genome shotgun sequencing projects in GenBank, is currently impractical for most researchers. Here we present Pebblescout, a tool that navigates such content by providing indexing and search capabilities. Indexing uses dense sampling of the sequences in the resource. Search finds subjects (runs or assemblies) that have short sequence matches to a user query, with well-defined guarantees and ranks them using informativeness of the matches. We illustrate the functionality of Pebblescout by creating eight databases that index over 3.7 petabases. The web service of Pebblescout can be reached at We show that for a wide range of query lengths, Pebblescout provides a data-driven way for finding relevant subsets of large nucleotide resources, reducing the effort for downstream analysis substantially. We also show that Pebblescout results compare favorably to MetaGraph and Sourmash.

Fig. 1: Simplified index creation flowchart.
Fig. 2: Pebblescout and MetaGraph results using HIV-1 genome as the query and the SRA–Microbe database.

Data availability

All runs and assemblies mentioned in the manuscript were publicly available at NCBI on 1 July 2023. Access to the databases is through the search functionality available on the web page Data analysis utilized Nucleotide–Nucleotide BLAST 2.14.1+, SKESA 2.5.1, SAUTE 1.3.2, SPAdes genome assembler v3.15.5 (coronaSPAdes mode), Bowtie 2 version 2.2.6, and websites. Source data are provided with this paper.

Code availability

The metadata for all databases built (WGS, RefSeq and SRA subsets as mentioned in Table 1), queries and output for all applications presented, the software for Pebblescout and an example for building a small database and searching the built database are available at Zenodo (


This research work was supported by the NCBI of the National Library of Medicine, National Institutes of Health. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript. We thank E. Yaschenko for navigating the issues related to making the Pebblescout search publicly available and to S. Ponomarov for developing the web page for the same. NCBI’s systems team, specifically R. Patterson, was very helpful in providing access to tools for monitoring system loads. We thank D. Lipman for his interest, suggestions for improving the manuscript, and introducing us to A. Fire and C. Lareau who tested and used the web page in their work, and to S. Preheim who suggested the similar metagenomic runs application. We also thank P. Ghosh, B. Robbertse and I. Tolstoy for taking an active interest as early users of Pebblescout, A. Souvorov for doing an independent assessment of alignment and assemblies for several runs, V. Schneider for providing extensive editorial comments on drafts of the manuscript and J. Asherman for his comments on readability of the manuscript from a layman’s perspective.

S.A.S. proposed the presented solution and designed and implemented the software. R.A. managed the project, did testing and found applications. Both authors contributed to building databases, data interpretation and writing the manuscript.

Correspondence to Richa Agarwala.

The authors declare no competing interests.

Nature Methods thanks Artem Babaian, Andre Kahles and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Lei Tang, in collaboration with the Nature Methods team. Peer reviewer reports are available.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary dataset discussion, output formats, supplementary applications and assessments, References 41–57, Tables 1–9 and Figs. 1–5.

Source Data Fig. 2

Statistical source data.

Shiryev, S.A., Agarwala, R. Indexing and searching petabase-scale nucleotide resources. Nat Methods 21, 994–1002 (2024).

