Brief Communication
Published: 26 February 2024

Indexing and real-time user-friendly queries in terabyte-sized complex genomic datasets with kmindex and ORA

Nature Computational Science volume 4, pages 104–109 (2024)Cite this article

412 Accesses
27 Altmetric
Metrics details

Subjects

A preprint version of the article is available at bioRxiv.

Abstract

Public sequencing databases contain vast amounts of biological information, yet they are largely underutilized as it is challenging to efficiently search them for any sequence(s) of interest. We present kmindex, an approach that can index thousands of metagenomes and perform sequence searches in a fraction of a second. The index construction is an order of magnitude faster than previous methods, while search times are two orders of magnitude faster. With negligible false positive rates below 0.01%, kmindex outperforms the precision of existing approaches by four orders of magnitude. Here we demonstrate the scalability of kmindex by successfully indexing 1,393 marine seawater metagenome samples from the Tara Oceans project. Additionally, we introduce the publicly accessible web server Ocean Read Atlas, which enables real-time queries on the Tara Oceans dataset.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: kmindex: an overview of the data structure and the query process.**

Balancing efficient analysis and storage of quantitative genomics data with the D4 format and d4tools

Article 21 June 2021

Benchmarking second and third-generation sequencing platforms for microbial metagenomics

Article Open access 11 November 2022

Metagenome analysis using the Kraken software suite

Article 28 September 2022

Data availability

A list of publicly available data used in this work is presented in the https://github.com/pierrepeterlongo/kmindex_benchmarks repository⁴⁰.

Code availability

kmindex is an open-source software available at https://github.com/tlemane/kmindex (ref. ⁴¹). The documentation is available at https://tlemane.github.io/kmindex/. The exhaustive list of tool versions and commands used are presented in a companion website⁴⁰, which also reports the FP computation protocols and a detailed description of the dataset considered for this benchmark. The ORA server code is available through a GitLab repository²⁷.

References

Edgar, R. C. et al. Petabase-scale sequence alignment catalyses viral discovery. Nature 602, 142–147 (2022).
Article ADS CAS PubMed Google Scholar
Paoli, L. et al. Biosynthetic potential of the global ocean microbiome. Nature 607, 111–118 (2022).
Article ADS CAS PubMed PubMed Central Google Scholar
Katz, K. et al. The Sequence Read Archive: a decade more of explosive growth. Nucleic Acids Res. 50, D387–D390 (2022).
Article ADS CAS PubMed Google Scholar
Chikhi, R., Holub, J. & Medvedev, P. Data structures to represent a set of k-long DNA sequences. ACM Comput. Surv. 54, 1–22 (2021).
Article Google Scholar
Marchet, C. et al. Data structures based on k-mers for querying large collections of sequencing data sets. Genome Res. 31, 1–12 (2021).
Article CAS PubMed PubMed Central Google Scholar
Pierce, N. T., Irber, L., Reiter, T., Brooks, P. & Brown, C. T. Large-scale sequence comparisons with sourmash. F1000Research 8, 1006 (2019).
Article CAS PubMed PubMed Central Google Scholar
Darvish, M., Seiler, E., Mehringer, S., Rahn, René & Reinert, K. Needle: a fast and space-efficient prefilter for estimating the quantification of very large collections of expression experiments. Bioinformatics 38, 4100–4108 (2022).
Article CAS PubMed PubMed Central Google Scholar
Karasikov, M. et al. Metagraph: indexing and analysing nucleotide archives at petabase-scale. Preprint at bioRxiv https://doi.org/10.1101/2020.10.01.322164 (2020).
Holley, G. & Melsted, P. áll Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs. Genome Biol. 21, 249 (2020).
Article PubMed PubMed Central Google Scholar
Cracco, A. & Tomescu, A. I. Extremely fast construction and querying of compacted and colored de Bruijn graphs with ggcat. Genome Res. 33, 1198–1207 (2023).
PubMed PubMed Central Google Scholar
Bloom, B. H. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 422–426 (1970).
Article Google Scholar
Bingmann, T., Bradley, P., Gauger, F. & Iqbal, Z. COBS: a Compact Bit-Sliced Signature index. String Processing and Information Retrieval, SPIRE 2019. In Lecture Notes in Computer Science, Vol. 11811 (Springer, Cham, 2019).
Solomon, B. & Kingsford, C. Improved search of large transcriptomic sequencing databases using split sequence Bloom trees. J. Comput. Biol. 25, 755–765 (2018).
Article MathSciNet CAS PubMed PubMed Central Google Scholar
Harris, R. S. & Medvedev, P. Improved representation of sequence Bloom trees. Bioinformatics 36, 721–727 (2020).
Article CAS PubMed Google Scholar
Srikakulam, S. K., Keller, S., Dabbaghie, F., Bals, R. & Kalinina, O. V. Metaprofi: an ultrafast chunked Bloom filter for storing and querying protein and nucleotide sequence data for accurate identification of functionally relevant genetic variants. Bioinformatics 39, btad101 (2023).
Article CAS PubMed PubMed Central Google Scholar
The Ocean Read Atlas. OSU Institut Pytheas https://ocean-read-atlas.mio.osupytheas.fr/ (2023).
Sunagawa, S. et al. Tara Oceans: towards global ocean ecosystems biology. Nat. Rev. Microbiol. 18, 428–445 (2020).
Article CAS PubMed Google Scholar
Alanko, J. N., Vuohtoniemi, J., Mäklin, T. & Puglisi, S. J. Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes. Bioinformatics 39, i260–i269 (2023).
Article PubMed PubMed Central Google Scholar
Mehringer, S. et al. Hierarchical interleaved Bloom filter: enabling ultrafast, approximate sequence queries. Genome Biol. 24, 131 (2023).
Article PubMed PubMed Central Google Scholar
Marchet, C. & Limasset, A. Scalable sequence database search using partitioned aggregated Bloom comb trees. Bioinformatics 39, i252–i259 (2023).
Article PubMed PubMed Central Google Scholar
Lemane, T., Medvedev, P., Chikhi, R. & Peterlongo, P. kmtricks: efficient and flexible construction of Bloom filters for large sequencing data collections. Bioinform. Adv. 2, vbac029 (2022).
Article PubMed PubMed Central Google Scholar
Villar, E. et al. The Ocean Gene Atlas: exploring the biogeography of plankton genes online. Nucleic Acids Res. 46, W289–W295 (2018).
Article CAS PubMed PubMed Central Google Scholar
Vernette, C. et al. The Ocean Gene Atlas v2. 0: online exploration of the biogeography and phylogeny of plankton genes. Nucleic Acids Res. 50, W516–W526 (2022).
Article CAS PubMed PubMed Central Google Scholar
Acinas, S. G. et al. Deep ocean metagenomes provide insight into the metabolic architecture of bathypelagic microbial communities. Commun. Biol. 4, 604 (2021).
Article CAS PubMed PubMed Central Google Scholar
Robidou, L. & Peterlongo, P. findere: fast and precise approximate membership query. In International Symposium on String Processing and Information Retrieval 151–163 (Springer, 2021).
fio. GitHub https://github.com/axboe/fio (2023).
DOI of the provided ORA server GitLab code. Zenodo https://doi.org/10.5281/zenodo.10462412 (2024).
European Nucleotide Archive. European Bioinformatics Institute https://www.ebi.ac.uk/ena/ (2023).
Tara Oceans Consortium, Coordinators; Tara Oceans Expedition, Participants. Registry of all samples from the Tara Oceans Expedition (2009–2013). PANGAEA https://doi.org/10.1594/PANGAEA.875582 (2017).
Guidi, L., Gattuso, J.-P. & Pesant, S. Tara Oceans Consortium, Coordinators; Tara Oceans Expedition, Participants. Environmental context of all samples from the Tara Oceans Expedition (2009–2013), about carbonate chemistry in the targeted environmental feature. PANGAEA https://doi.org/10.1594/PANGAEA.875567 (2017).
Tara Oceans Consortium, Coordinators; Tara Oceans Expedition, Participants. Biodiversity context of all samples from the Tara Oceans Expedition (2009–2013). PANGAEA https://doi.org/10.1594/PANGAEA.853809 (2015).
Guidi, L. et al. Tara Oceans Consortium, Coordinators; Tara Oceans Expedition, Participants. Environmental context of all samples from the Tara Oceans Expedition (2009–2013), about pigment concentrations (HPLC) in the targeted environmental feature. PANGAEA https://doi.org/10.1594/PANGAEA.875569 (2017).
Ardyna, M. et al. Tara Oceans Consortium, Coordinators; Tara Oceans Expedition, Participants. Environmental context of all samples from the Tara Oceans Expedition (2009–2013), about mesoscale features at the sampling location. PANGAEA https://doi.org/10.1594/PANGAEA.875577 (2017).
Guidi, L. et al. Tara Oceans Consortium, Coordinators; Tara Oceans Expedition, Participants. Environmental context of all samples from the Tara Oceans Expedition (2009–2013), about nutrients in the targeted environmental feature. PANGAEA https://doi.org/10.1594/PANGAEA.875575 (2017).
Guidi, L., Picheral, M. & Pesant, S. Tara Oceans Consortium, Coordinators; Tara Oceans Expedition, Participants. Environmental context of all samples from the Tara Oceans Expedition (2009–2013), about sensor data in the targeted environmental feature. PANGAEA https://doi.org/10.1594/PANGAEA.875576 (2017).
Alberti, A. & Pesant, S. Tara Oceans Consortium, Coordinators; Tara Oceans Expedition, Participants. Methodology used in the lab for molecular analyses and links to the Sequence Read Archive of selected samples from the Tara Oceans Expedition (2009–2013). PANGAEA https://doi.org/10.1594/PANGAEA.875581 (2017).
Speich, S. et al. Tara Oceans Consortium, Coordinators; Tara Oceans Expedition, Participants. Environmental context of all samples from the Tara Oceans Expedition (2009–2013), about the water column features at the sampling location. PANGAEA https://doi.org/10.1594/PANGAEA.875579 (2017).
Overview. Ocean Read Atlas https://ora.mio.osupytheas.fr/manual/pages/ (2023).
Interfaces. Ocean Read Atlas https://ora.mio.osupytheas.fr/manual/pages/interfaces.html (2023).
pierrepeterlongo/kmindex_benchmarks: initial release. Zenodo https://doi.org/10.5281/zenodo.10462379 (2024).
DOI of the kmindex GitHub repository. Zenodo https://doi.org/10.5281/zenodo.10462427 (2024).

Download references

Acknowledgements

We acknowledge the GenOuest core facility (https://www.genouest.org) and the TGCC (https://www-hpc.cea.fr/index-en.html) for providing the computing infrastructure, as well as France Génomique for funding of the TGCC computing resources used to process data used in this article. The authors thank J.-M. Aury for his help regarding the usage of the Tara Oceans datasets. Tara Oceans (which includes both the Tara Oceans and Tara Oceans Polar Circle expeditions) would not exist without the leadership of the Tara Ocean Foundation and the continuous support of Tara Oceans consortium members. The authors also thank K. Andre and M. Harun for their help regarding the usage of MetaGraph, A. Cracco and A. Tomescu for their help using ggcat, and C. Marchet and A. Limasset for their support using PAC. The web server is hosted by the OSU Pythéas cluster with the help of C. Blanpain and SIP members. A. Malgoyre from SIP is thanked for the development of the OSU Pythéas GitLab. The work was funded by ANR SeqDigger (ANR-19-CE45-0008) and the IPL Inria Neuromarkers, and received some support from the French government under the France 2030 investment plan, as part of the Initiative d’Excellence d’Aix-Marseille Université - A*MIDEX - Institute of Ocean Sciences (AMX-19-IET-016). This work is part of the ALPACA project that has received funding from the European Union’s Horizon 2020 research and innovation program under the Marie Skłodowska-Curie grant agreement nos. 956229 and 872539 (PANGAIA). R.C. was supported by ANR Full-RNA, Inception and PRAIRIE grants (ANR-22-CE45-0007, PIA/ANR16-CONV-0005 and ANR-19-P3IA-0001). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

Author information

Authors and Affiliations

Univ. Rennes, Inria, CNRS, IRISA - UMR 6074, Rennes, France
Téo Lemane & Pierre Peterlongo
Génomique Métabolique, Genoscope, Institut de Biologie François Jacob, CEA, CNRS, Univ. Evry, Université Paris-Saclay, Evry, France
Téo Lemane & Eric Pelletier
Aix-Marseille Université, Université de Toulon, IRD, CNRS, Mediterranean Institute of Oceanography (MIO), UM 110, Marseille, France
Nolan Lezzoche & Magali Lescot
SIP, OSU PYTHEAS, Marseille, France
Julien Lecubin
Research Federation for the Study of Global Ocean Systems Ecology and Evolution, FR2022/Tara Oceans GO-SEE, CNRS, Paris, France
Eric Pelletier & Magali Lescot
Institut Pasteur, Université Paris Cité, G5 Sequence Bioinformatics, Paris, France
Rayan Chikhi

Authors

Téo Lemane
View author publications
You can also search for this author in PubMed Google Scholar
Nolan Lezzoche
View author publications
You can also search for this author in PubMed Google Scholar
Julien Lecubin
View author publications
You can also search for this author in PubMed Google Scholar
Eric Pelletier
View author publications
You can also search for this author in PubMed Google Scholar
Magali Lescot
View author publications
You can also search for this author in PubMed Google Scholar
Rayan Chikhi
View author publications
You can also search for this author in PubMed Google Scholar
Pierre Peterlongo
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

T.L., E.P., R.C. and P.P. have conceptualized the project. T.L., R.C. and P.P. developed the methodology. T.L. implemented the software. T.L. and P.P. conceived and conducted the experiments. M.L. and E.P. provided the data. T.L., R.C. and P.P. wrote the manuscript. N.L., J.L. and M.L. implemented and deployed the ORA server. R.C. and P.P. supervised the work. M.L., R.C. and P.P. obtained the funding. All authors reviewed the manuscript.

Corresponding authors

Correspondence to Téo Lemane or Pierre Peterlongo.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Computational Science thanks Natapol Pornputtapong, Guohua Wang and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling Editor: Ananya Rastogi, in collaboration with the Nature Computational Science team.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Table 1 Description of the indexed dataset organized by size fraction. The “Fraction size" column indicates the size range of the target sequenced species

Full size table

Supplementary information

Supplementary Information

Supplementary Figs. 1 and 2, Sections 1 and 2 and Tables 1–6.

Reporting Summary

Peer Review File

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Lemane, T., Lezzoche, N., Lecubin, J. et al. Indexing and real-time user-friendly queries in terabyte-sized complex genomic datasets with kmindex and ORA. Nat Comput Sci 4, 104–109 (2024). https://doi.org/10.1038/s43588-024-00596-6

Download citation

Received: 12 July 2023
Accepted: 16 January 2024
Published: 26 February 2024
Issue Date: February 2024
DOI: https://doi.org/10.1038/s43588-024-00596-6

Indexing and real-time user-friendly queries in terabyte-sized complex genomic datasets with kmindex and ORA

Subjects

Abstract

Access options

Similar content being viewed by others

Balancing efficient analysis and storage of quantitative genomics data with the D4 format and d4tools

Benchmarking second and third-generation sequencing platforms for microbial metagenomics

Metagenome analysis using the Kraken software suite

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Extended data

Supplementary information

Supplementary Information

Reporting Summary

Peer Review File

Rights and permissions

About this article

Cite this article

Search

Quick links

Subjects

Abstract

Access options

Similar content being viewed by others

Balancing efficient analysis and storage of quantitative genomics data with the D4 format and d4tools

Benchmarking second and third-generation sequencing platforms for microbial metagenomics

Metagenome analysis using the Kraken software suite

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Extended data

Supplementary information

Supplementary Information

Reporting Summary

Peer Review File

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links