Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Brief Communication
  • Published:

Indexing and real-time user-friendly queries in terabyte-sized complex genomic datasets with kmindex and ORA

A preprint version of the article is available at bioRxiv.

Abstract

Public sequencing databases contain vast amounts of biological information, yet they are largely underutilized as it is challenging to efficiently search them for any sequence(s) of interest. We present kmindex, an approach that can index thousands of metagenomes and perform sequence searches in a fraction of a second. The index construction is an order of magnitude faster than previous methods, while search times are two orders of magnitude faster. With negligible false positive rates below 0.01%, kmindex outperforms the precision of existing approaches by four orders of magnitude. Here we demonstrate the scalability of kmindex by successfully indexing 1,393 marine seawater metagenome samples from the Tara Oceans project. Additionally, we introduce the publicly accessible web server Ocean Read Atlas, which enables real-time queries on the Tara Oceans dataset.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: kmindex: an overview of the data structure and the query process.

Similar content being viewed by others

Data availability

A list of publicly available data used in this work is presented in the https://github.com/pierrepeterlongo/kmindex_benchmarks repository40.

Code availability

kmindex is an open-source software available at https://github.com/tlemane/kmindex (ref. 41). The documentation is available at https://tlemane.github.io/kmindex/. The exhaustive list of tool versions and commands used are presented in a companion website40, which also reports the FP computation protocols and a detailed description of the dataset considered for this benchmark. The ORA server code is available through a GitLab repository27.

References

  1. Edgar, R. C. et al. Petabase-scale sequence alignment catalyses viral discovery. Nature 602, 142–147 (2022).

    Article  ADS  CAS  PubMed  Google Scholar 

  2. Paoli, L. et al. Biosynthetic potential of the global ocean microbiome. Nature 607, 111–118 (2022).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  3. Katz, K. et al. The Sequence Read Archive: a decade more of explosive growth. Nucleic Acids Res. 50, D387–D390 (2022).

    Article  ADS  CAS  PubMed  Google Scholar 

  4. Chikhi, R., Holub, J. & Medvedev, P. Data structures to represent a set of k-long DNA sequences. ACM Comput. Surv. 54, 1–22 (2021).

    Article  Google Scholar 

  5. Marchet, C. et al. Data structures based on k-mers for querying large collections of sequencing data sets. Genome Res. 31, 1–12 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Pierce, N. T., Irber, L., Reiter, T., Brooks, P. & Brown, C. T. Large-scale sequence comparisons with sourmash. F1000Research 8, 1006 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Darvish, M., Seiler, E., Mehringer, S., Rahn, René & Reinert, K. Needle: a fast and space-efficient prefilter for estimating the quantification of very large collections of expression experiments. Bioinformatics 38, 4100–4108 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  8. Karasikov, M. et al. Metagraph: indexing and analysing nucleotide archives at petabase-scale. Preprint at bioRxiv https://doi.org/10.1101/2020.10.01.322164 (2020).

  9. Holley, G. & Melsted, P. áll Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs. Genome Biol. 21, 249 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  10. Cracco, A. & Tomescu, A. I. Extremely fast construction and querying of compacted and colored de Bruijn graphs with ggcat. Genome Res. 33, 1198–1207 (2023).

    PubMed  PubMed Central  Google Scholar 

  11. Bloom, B. H. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 422–426 (1970).

    Article  Google Scholar 

  12. Bingmann, T., Bradley, P., Gauger, F. & Iqbal, Z. COBS: a Compact Bit-Sliced Signature index. String Processing and Information Retrieval, SPIRE 2019. In Lecture Notes in Computer Science, Vol. 11811 (Springer, Cham, 2019).

  13. Solomon, B. & Kingsford, C. Improved search of large transcriptomic sequencing databases using split sequence Bloom trees. J. Comput. Biol. 25, 755–765 (2018).

    Article  MathSciNet  CAS  PubMed  PubMed Central  Google Scholar 

  14. Harris, R. S. & Medvedev, P. Improved representation of sequence Bloom trees. Bioinformatics 36, 721–727 (2020).

    Article  CAS  PubMed  Google Scholar 

  15. Srikakulam, S. K., Keller, S., Dabbaghie, F., Bals, R. & Kalinina, O. V. Metaprofi: an ultrafast chunked Bloom filter for storing and querying protein and nucleotide sequence data for accurate identification of functionally relevant genetic variants. Bioinformatics 39, btad101 (2023).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. The Ocean Read Atlas. OSU Institut Pytheas https://ocean-read-atlas.mio.osupytheas.fr/ (2023).

  17. Sunagawa, S. et al. Tara Oceans: towards global ocean ecosystems biology. Nat. Rev. Microbiol. 18, 428–445 (2020).

    Article  CAS  PubMed  Google Scholar 

  18. Alanko, J. N., Vuohtoniemi, J., Mäklin, T. & Puglisi, S. J. Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes. Bioinformatics 39, i260–i269 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  19. Mehringer, S. et al. Hierarchical interleaved Bloom filter: enabling ultrafast, approximate sequence queries. Genome Biol. 24, 131 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  20. Marchet, C. & Limasset, A. Scalable sequence database search using partitioned aggregated Bloom comb trees. Bioinformatics 39, i252–i259 (2023).

    Article  PubMed  PubMed Central  Google Scholar 

  21. Lemane, T., Medvedev, P., Chikhi, R. & Peterlongo, P. kmtricks: efficient and flexible construction of Bloom filters for large sequencing data collections. Bioinform. Adv. 2, vbac029 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  22. Villar, E. et al. The Ocean Gene Atlas: exploring the biogeography of plankton genes online. Nucleic Acids Res. 46, W289–W295 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Vernette, C. et al. The Ocean Gene Atlas v2. 0: online exploration of the biogeography and phylogeny of plankton genes. Nucleic Acids Res. 50, W516–W526 (2022).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Acinas, S. G. et al. Deep ocean metagenomes provide insight into the metabolic architecture of bathypelagic microbial communities. Commun. Biol. 4, 604 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Robidou, L. & Peterlongo, P. findere: fast and precise approximate membership query. In International Symposium on String Processing and Information Retrieval 151–163 (Springer, 2021).

  26. fio. GitHub https://github.com/axboe/fio (2023).

  27. DOI of the provided ORA server GitLab code. Zenodo https://doi.org/10.5281/zenodo.10462412 (2024).

  28. European Nucleotide Archive. European Bioinformatics Institute https://www.ebi.ac.uk/ena/ (2023).

  29. Tara Oceans Consortium, Coordinators; Tara Oceans Expedition, Participants. Registry of all samples from the Tara Oceans Expedition (2009–2013). PANGAEA https://doi.org/10.1594/PANGAEA.875582 (2017).

  30. Guidi, L., Gattuso, J.-P. & Pesant, S. Tara Oceans Consortium, Coordinators; Tara Oceans Expedition, Participants. Environmental context of all samples from the Tara Oceans Expedition (2009–2013), about carbonate chemistry in the targeted environmental feature. PANGAEA https://doi.org/10.1594/PANGAEA.875567 (2017).

  31. Tara Oceans Consortium, Coordinators; Tara Oceans Expedition, Participants. Biodiversity context of all samples from the Tara Oceans Expedition (2009–2013). PANGAEA https://doi.org/10.1594/PANGAEA.853809 (2015).

  32. Guidi, L. et al. Tara Oceans Consortium, Coordinators; Tara Oceans Expedition, Participants. Environmental context of all samples from the Tara Oceans Expedition (2009–2013), about pigment concentrations (HPLC) in the targeted environmental feature. PANGAEA https://doi.org/10.1594/PANGAEA.875569 (2017).

  33. Ardyna, M. et al. Tara Oceans Consortium, Coordinators; Tara Oceans Expedition, Participants. Environmental context of all samples from the Tara Oceans Expedition (2009–2013), about mesoscale features at the sampling location. PANGAEA https://doi.org/10.1594/PANGAEA.875577 (2017).

  34. Guidi, L. et al. Tara Oceans Consortium, Coordinators; Tara Oceans Expedition, Participants. Environmental context of all samples from the Tara Oceans Expedition (2009–2013), about nutrients in the targeted environmental feature. PANGAEA https://doi.org/10.1594/PANGAEA.875575 (2017).

  35. Guidi, L., Picheral, M. & Pesant, S. Tara Oceans Consortium, Coordinators; Tara Oceans Expedition, Participants. Environmental context of all samples from the Tara Oceans Expedition (2009–2013), about sensor data in the targeted environmental feature. PANGAEA https://doi.org/10.1594/PANGAEA.875576 (2017).

  36. Alberti, A. & Pesant, S. Tara Oceans Consortium, Coordinators; Tara Oceans Expedition, Participants. Methodology used in the lab for molecular analyses and links to the Sequence Read Archive of selected samples from the Tara Oceans Expedition (2009–2013). PANGAEA https://doi.org/10.1594/PANGAEA.875581 (2017).

  37. Speich, S. et al. Tara Oceans Consortium, Coordinators; Tara Oceans Expedition, Participants. Environmental context of all samples from the Tara Oceans Expedition (2009–2013), about the water column features at the sampling location. PANGAEA https://doi.org/10.1594/PANGAEA.875579 (2017).

  38. Overview. Ocean Read Atlas https://ora.mio.osupytheas.fr/manual/pages/ (2023).

  39. Interfaces. Ocean Read Atlas https://ora.mio.osupytheas.fr/manual/pages/interfaces.html (2023).

  40. pierrepeterlongo/kmindex_benchmarks: initial release. Zenodo https://doi.org/10.5281/zenodo.10462379 (2024).

  41. DOI of the kmindex GitHub repository. Zenodo https://doi.org/10.5281/zenodo.10462427 (2024).

Download references

Acknowledgements

We acknowledge the GenOuest core facility (https://www.genouest.org) and the TGCC (https://www-hpc.cea.fr/index-en.html) for providing the computing infrastructure, as well as France Génomique for funding of the TGCC computing resources used to process data used in this article. The authors thank J.-M. Aury for his help regarding the usage of the Tara Oceans datasets. Tara Oceans (which includes both the Tara Oceans and Tara Oceans Polar Circle expeditions) would not exist without the leadership of the Tara Ocean Foundation and the continuous support of Tara Oceans consortium members. The authors also thank K. Andre and M. Harun for their help regarding the usage of MetaGraph, A. Cracco and A. Tomescu for their help using ggcat, and C. Marchet and A. Limasset for their support using PAC. The web server is hosted by the OSU Pythéas cluster with the help of C. Blanpain and SIP members. A. Malgoyre from SIP is thanked for the development of the OSU Pythéas GitLab. The work was funded by ANR SeqDigger (ANR-19-CE45-0008) and the IPL Inria Neuromarkers, and received some support from the French government under the France 2030 investment plan, as part of the Initiative d’Excellence d’Aix-Marseille Université - A*MIDEX - Institute of Ocean Sciences (AMX-19-IET-016). This work is part of the ALPACA project that has received funding from the European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement nos. 956229 and 872539 (PANGAIA). R.C. was supported by ANR Full-RNA, Inception and PRAIRIE grants (ANR-22-CE45-0007, PIA/ANR16-CONV-0005 and ANR-19-P3IA-0001). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

Author information

Authors and Affiliations

Authors

Contributions

T.L., E.P., R.C. and P.P. have conceptualized the project. T.L., R.C. and P.P. developed the methodology. T.L. implemented the software. T.L. and P.P. conceived and conducted the experiments. M.L. and E.P. provided the data. T.L., R.C. and P.P. wrote the manuscript. N.L., J.L. and M.L. implemented and deployed the ORA server. R.C. and P.P. supervised the work. M.L., R.C. and P.P. obtained the funding. All authors reviewed the manuscript.

Corresponding authors

Correspondence to Téo Lemane or Pierre Peterlongo.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Computational Science thanks Natapol Pornputtapong, Guohua Wang and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling Editor: Ananya Rastogi, in collaboration with the Nature Computational Science team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Table 1 Description of the indexed dataset organized by size fraction. The “Fraction size" column indicates the size range of the target sequenced species

Supplementary information

Supplementary Information

Supplementary Figs. 1 and 2, Sections 1 and 2 and Tables 1–6.

Reporting Summary

Peer Review File

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lemane, T., Lezzoche, N., Lecubin, J. et al. Indexing and real-time user-friendly queries in terabyte-sized complex genomic datasets with kmindex and ORA. Nat Comput Sci 4, 104–109 (2024). https://doi.org/10.1038/s43588-024-00596-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s43588-024-00596-6

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics