Ultrafast search of all deposited bacterial and viral genomic data

Bradley, Phelim; den Bakker, Henk C.; Rocha, Eduardo P. C.; McVean, Gil; Iqbal, Zamin

doi:10.1038/s41587-018-0010-1

Article
Published: 04 February 2019

Ultrafast search of all deposited bacterial and viral genomic data

Nature Biotechnology volume 37, pages 152–159 (2019)Cite this article

14k Accesses
66 Citations
295 Altmetric
Metrics details

Subjects

Abstract

Exponentially increasing amounts of unprocessed bacterial and viral genomic sequence data are stored in the global archives. The ability to query these data for sequence search terms would facilitate both basic research and applications such as real-time genomic epidemiology and surveillance. However, this is not possible with current methods. To solve this problem, we combine knowledge of microbial population genomics with computational methods devised for web search to produce a searchable data structure named BItsliced Genomic Signature Index (BIGSI). We indexed the entire global corpus of 447,833 bacterial and viral whole-genome sequence datasets using four orders of magnitude less storage than previous methods. We applied our BIGSI search function to rapidly find resistance genes MCR-1, MCR-2, and MCR-3, determine the host-range of 2,827 plasmids, and quantify antibiotic resistance in archived datasets. Our index can grow incrementally as new (unprocessed or assembled) sequence datasets are deposited and can scale to millions of datasets.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 3: Speed and space trade-offs as index grows.**

**Fig. 4: Phylogenetic distribution of plasmid sequences.**

**Fig. 5: Plasmid spread and antibiotic resistance genes.**

**Fig. 6: Antibiotic resistance gene prevalence in ENA over time.**

DNA glycosylases provide antiviral defence in prokaryotes

Article Open access 17 April 2024

Genome assembly in the telomere-to-telomere era

Article 22 April 2024

Nanopore sequencing technology, bioinformatics and applications

Article 08 November 2021

Code availability

An open source implementation of BIGSI can be found at https://github.com/phelimb/BIGSI. BIGSI v0.3.0 supports disk-based indexing via Berkeley-DB or rocksDB, as well as distributed in-memory (via redis (https://redis.io)) key-value stores, and can be extended to any key-value store. The benchmarking uses the rocksDB key-value store and v0.2.0 of BIGSI, and the all-microbial index uses Berkeley-DB and BIGSI version v0.1.7.

Data availability

All of the underlying genomic data for this study are publicly available at the ENA, and Supplementary Data can be found in the directory http://ftp.ebi.ac.uk/pub/software/bigsi/nat_biotech_2018. Supplementary Data 1–9 can be found at http://ftp.ebi.ac.uk/pub/software/bigsi/nat_biotech_2018/supp or at https://figshare.com/s/b365381fcd9550e361da. Contents are as follows: Supplementary Data 1, MCR search results; Supplementary Data 2, plasmid search results; Supplementary Data 3, counts of five specific plasmids across genera; Supplementary Data 4, counts of MOB types across genera; Supplementary Data 5, CARD antibiotic resistance gene search results (T = 70%); Supplementary Data 6, benchmarking results; Supplementary Data 7, Bracken taxonomic results; Supplementary Data 8, MOB type definition fasta; Supplementary Data 9, MOB and T4SS search results (T = 100%). The all-microbial index itself is available at http://ftp.ebi.ac.uk/pub/software/bigsi/nat_biotech_2018/all-microbial-index/. In order to facilitate reproducibility for others without having to download and process 170 Tb of raw data, we made the 26 Tb of cleaned de Bruijn (binary) graph files for the entire all-microbial index snapshot of the ENA available at http://ftp.ebi.ac.uk/pub/software/bigsi/nat_biotech_2018/ctx. An archive of computational code can be found at http://ftp.ebi.ac.uk/pub/software/bigsi/nat_biotech_2018/bigsi.tar.gz. An archive of the data underlying the figures can be found at http://ftp.ebi.ac.uk/pub/software/bigsi/nat_biotech_2018/figure-data.zip and http://ftp.ebi.ac.uk/pub/software/bigsi/nat_biotech_2018/figure-data/ or as Supplementary Data 10 http://ftp.ebi.ac.uk/pub/software/bigsi/nat_biotech_2018/figure-data/.zip. We have also made a public instance of our index of the ENA available at http://bigsi.io, where the user can paste sequence and search. This instance uses BIGSI v0.1.7 (using berkeleyDB) and is hosted by CLIMB (http://www.climb.ac.uk/) on a 3 Tb RAM server.

References

Bradley, P. et al. Rapid antibiotic-resistance predictions from genome sequence data for Staphylococcus aureus and Mycobacterium tuberculosis. Nat. Commun. 6, 10063 (2015).
Article CAS PubMed Google Scholar
Brown, A. C. et al. Rapid whole-genome sequencing of Mycobacterium tuberculosis isolates directly from clinical samples. J. Clin. Microbiol. 53, 2230–2237 (2015).
Article CAS PubMed PubMed Central Google Scholar
Quick, J. et al. Real-time, portable genome sequencing for Ebola surveillance. Nature 530, 228–232 (2016).
Article CAS PubMed PubMed Central Google Scholar
Schmidt, K. et al. Identification of bacterial pathogens and antimicrobial resistance directly from clinical urines by nanopore-based metagenomic sequencing. J. Antimicrob. Chemother. 72, 104–114 (2017).
Article CAS PubMed Google Scholar
Votintseva, A. A. et al. Same-day diagnostic and surveillance data for tuberculosis via whole-genome sequencing of direct respiratory samples. J. Clin. Microbiol. 55, 1285–1298 (2017).
Article CAS PubMed PubMed Central Google Scholar
Shea, J. et al. Comprehensive whole-genome sequencing and reporting of drug resistance profiles on clinical cases of Mycobacterium tuberculosis in New York state. J. Clin. Microbiol. 55, 1871–1882 (2017).
Article CAS PubMed PubMed Central Google Scholar
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
Article CAS PubMed Google Scholar
Kent, W. J. BLAT-the BLAST-like alignment tool. Genome Res. 12, 656–664 (2002).
Article CAS PubMed PubMed Central Google Scholar
Zhang, Z. et al. A greedy algorithm for aligning DNA sequences. J. Comput. Biol. 7, 203–214 (2000).
Article CAS PubMed Google Scholar
Arredondo-Alonso, A. W. R., Schaik, W. V. & Schurch, C. On the (im)possibility of reconstructing plasmids from whole-genome short-read sequencing data. Microb. Genom. 3, e000128 (2017).
PubMed PubMed Central Google Scholar
Bradnam, K. R. et al. Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. Gigascience 2, 10 (2013).
Article PubMed PubMed Central Google Scholar
Earl, D. et al. Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res. 21, 2224–2241 (2011).
Article CAS PubMed PubMed Central Google Scholar
Solomon, B. & Kingsford, C. Fast search of thousands of short-read sequencing experiments. Nat. Biotechnol. 34, 300–302 (2016).
Article CAS PubMed PubMed Central Google Scholar
Pandey, P. et al. Mantis: a fast, small, and exact large-scale sequence-search index. Cell Syst. 7, 201–207 e204 (2018).
Solomon, B. & Kingsford, C. Improved search of large transcriptomic sequencing databases using split sequence Bloom trees. in International Conference on Research in Computational Molecular Biology 257–271 (Springer, 2017).
Sun, C., Harris, R., Chikhi, R. & Medvedev, P. AllSome Sequence Bloom Trees. in International Conference on Research in Computational Molecular Biology 272–286 (Springer, 2018).
Iqbal, Z., Caccamo, M., Turner, I., Flicek, P. & McVean, G. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genet. 44, 226–232 (2012).
Article CAS PubMed PubMed Central Google Scholar
Muggli, M. D. et al. Succinct colored de Bruijn graphs. Bioinformatics 33, 3181–3187 (2017).
Article CAS PubMed PubMed Central Google Scholar
Turner, I., Garimella, K., Iqbal, Z. & McVean, G. Integrating long-range connectivity information into de Bruijn graphs. Bioinformatics 34, 2556–2565 (2017).
Article Google Scholar
Almodaresi, F., Pandey, P. & Patro, R. Proc. 17th International Workshop on Algorithms in Bioinformatics. (Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, 2017).
Wong, H. K. T. L. H., Olken F., Rotem D., Wong L. In: Proc. 11th International Conference on Very Large Data Bases Vol. 11 (ed Pirotte, A. and Vassiliou, Y.) 448–457 (Stockholm, Sweden, 1985).
Shepherd, M. A. P. W. & Chu, C. K. A fixed-size Bloom filter for searching textual documents. Comput. J. 32, 212–219 (1989).
Article Google Scholar
Zobel, J., Moffat, A. & Ramamohanarao, K. Inverted files versus signature files for text indexing. ACM Trans. Database Syst. 23, 453–490 (1998).
Article Google Scholar
Goodwin, B. H. M. et al. Proc. 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (ACM, 2017).
Bloom, B. H. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 422–426 (1970).
Article Google Scholar
Pell, J. et al. Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. Proc. Natl Acad. Sci. USA 109, 13272–13277 (2012).
Article CAS PubMed PubMed Central Google Scholar
Walker, T. M. et al. Whole-genome sequencing for prediction of Mycobacterium tuberculosis drug susceptibility and resistance: a retrospective cohort study. Lancet Infect. Dis. 15, 1193–1202 (2015).
Article CAS PubMed PubMed Central Google Scholar
Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Article PubMed PubMed Central Google Scholar
Inouye, M. et al. SRST2: Rapid genomic surveillance for public health and hospital microbiology labs. Genome Med. 6, 90 (2014).
Article PubMed PubMed Central Google Scholar
Pandey, P. et al. Mantis: a fast, small, and exact large-scale sequence search index. Cell Syst. 7, 201–207 (2017).
Article Google Scholar
Lu, J., Breitwieser, F., Thielen, P. & Salzberg, S. L. Bracken: estimating species abundance in metagenomics data. PeerJ Comput. Sci. 3, e104 (2017).
Article Google Scholar
Wood, D. E. & Salzberg, S. L. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15, R46 (2014).
Article PubMed PubMed Central Google Scholar
Hu, Y., Liu, F., Lin, I. Y., Gao, G. F. & Zhu, B. Dissemination of the mcr-1 colistin resistance gene. Lancet Infect. Dis. 16, 146–147 (2016).
Article PubMed Google Scholar
Lu, X. et al. MCR-1.6, a new MCR variant carried by an incp plasmid in a colistin-resistant Salmonella enterica serovar typhimurium isolate from a healthy individual. Antimicrob. Agents Chemother. 61, e02632–16 (2017).
CAS PubMed PubMed Central Google Scholar
Matamoros, S. et al. Global phylogenetic analysis of Escherichia coli and plasmids carrying the mcr-1 gene indicates bacterial diversity but plasmid restriction. Sci. Rep. 7, 15364 (2017).
Article PubMed PubMed Central Google Scholar
Xavier, B. B. et al. Identification of a novel plasmid-mediated colistin-resistance gene, mcr-2, in Escherichia coli, Belgium, June 2016. Euro. Surveill. https://doi.org/10.2807/1560-7917.ES.2016.21.27.30280 (2016).
Yin, W. et al. Novel plasmid-mediated colistin resistance gene mcr-3 in Escherichia coli. mBio https://doi.org/10.1128/mBio.00543-17 (2017).
Ciric, L. J. A., Elvira de Vries L., Agerso Y., Mullany P., Roberts A. P. In: Madame Curie Bioscience Database [Internet] (Landes Bioscience, Austin, TX, 2013).
Guglielmini, J., Quintais, L., Garcillán-Barcia, M. P., de la Cruz, F. & Rocha, E. P. The repertoire of ICE in prokaryotes underscores the unity, diversity, and ubiquity of conjugation. PLoS Genet. 7, e1002222 (2011).
Article CAS PubMed PubMed Central Google Scholar
Jia, B. et al. CARD 2017: expansion and model-centric curation of the comprehensive antibiotic resistance database. Nucleic Acids Res. 45, D566–D573 (2017).
Article CAS PubMed Google Scholar
Eldholm, V. & Balloux, F. Antimicrobial resistance in Mycobacterium tuberculosis: the odd one out. Trends Microbiol. 24, 637–648 (2016).
Article CAS PubMed Google Scholar
World Health Organization. Global Tuberculosis Report 2017. https://www.who.int/tb/publications/global_report/gtbr2017_main_text.pdf (2017).
Gardy, J. L. & Loman, N. J. Towards a genomics-informed, real-time, global pathogen surveillance system. Nat. Rev. Genet. 19, 9–20 (2018).
Article CAS PubMed Google Scholar
Schatz, M. C. & Phillippy, A. M. The rise of a digital immune system. Gigascience 1, 4 (2012).
Article PubMed PubMed Central Google Scholar
Eyre, D. W. et al. WGS to predict antibiotic MICs for Neisseria gonorrhoeae. J. Antimicrob. Chemother. 72, 1937–1947 (2017).
Article CAS PubMed PubMed Central Google Scholar
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Article CAS PubMed PubMed Central Google Scholar
Langmead, B., Trapnell, C., Pop, M. & Salzberg, S. L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009).
Article PubMed PubMed Central Google Scholar
Broder, A. & Mitzenmacher, M. Network applications of Bloom filters: a survey. Internet Math. 1, 485–509 (2004).
Article Google Scholar

Download references

Acknowledgements

We would like to thank, for critical reading and helpful suggestions: G. Blackwell, R. Ffrancon, M. Hunt, J. Kelleher, J. Thornton, R. Patro, K. Malone, J. Marioni, A. Page, S. Gog, T. Bellman, F. Gauger. For enormous assistance with data download from EBI: R. Esnouf, G. Cochrane. For hosting our BIGSI demonstration: CLIMB. We acknowledge funding from Wellcome Trust Core Award Grant Number 203141/Z/16/Z. Z.I. was funded by a Wellcome Trust/Royal Society Sir Henry Dale Fellowship, grant number 102541/A/13/Z. G.M. was funded by Wellcome Trust grant 100956/Z/13/Z. E.P.C.R. was funded by the ANR MAGISBAC grant number ANR-14-CE10-0007-02. P.B. was funded by Wellcome Trust Studentship H5RZCO00.

Author information

Authors and Affiliations

Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK
Phelim Bradley & Zamin Iqbal
EMBL-EBI, Hinxton, UK
Phelim Bradley & Zamin Iqbal
Center for Food Safety, Department of Food Science and Technology, University of Georgia, Griffin, GA, USA
Henk C. den Bakker
UMR 3525, CNRS, Paris, France
Eduardo P. C. Rocha
Microbial Evolutionary Genomics, Institut Pasteur, Paris, France
Eduardo P. C. Rocha
Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, UK
Gil McVean

Authors

Phelim Bradley
View author publications
You can also search for this author in PubMed Google Scholar
Henk C. den Bakker
View author publications
You can also search for this author in PubMed Google Scholar
Eduardo P. C. Rocha
View author publications
You can also search for this author in PubMed Google Scholar
Gil McVean
View author publications
You can also search for this author in PubMed Google Scholar
Zamin Iqbal
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Z.I., G.M. designed and oversaw the study; P.B. invented the method, developed software and performed analyses; H.C.d.B. performed analyses for the plasmid study; E.P.C.R. codesigned and analyzed the conjugative system and plasmid analysis; Z.I. wrote the paper; all authors gave detailed feedback on the paper.

Corresponding author

Correspondence to Zamin Iqbal.

Ethics declarations

Competing interests

G.M. is a cofounder of, holder of shares in, and is a consultant to Genomics PLC.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Cartoon comparison between human and E. coli genomes.

Cartoon comparison of human genomes (above) and E. coli (below) as a representative bacterium. In humans, genetic variation is dominated by relatively sparse single nucleotide polymorphisms (SNPs), nucleotide diversity π = 0.001, and less than 1% of a typical genome lies in a structural variant (SV) [1]. Human genomes are therefore relatively compressible. In stark contrast, genes make up around 88% of an E. coli genome [2], yet two E. coli genomes may only share around 60% of their genes [3], and conserved genes have much higher nucleotide diversity (0.02) [4]. Thus, bacterial genomes present different compression and indexing challenges to human genomes. 1. The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68-74, doi:10.1038/nature15393 (2015). 2. Blattner, F. R. et al. The complete genome sequence of Escherichia coli K-12. Science 277, 1453-1462 (1997). 3. Touchon, Marie, et al. "Organised genome dynamics in the Escherichia coli species results in highly diverse adaptive paths." PLoS genetics 5.1 (2009): e1000344. 4. Kaas, Rolf S., et al. "Estimating variation within the genes and inferring the phylogeny of 186 sequenced diverse Escherichia coli genomes." BMC genomics 13.1 (2012).

Supplementary Figure 2 k-mer identity vs. percent nucleotide identity.

We ran 1,000 simulations where on each iteration we introduced 1 more random SNP into a sequence of length 1,000 bp and calculated the k-mer similarity between the original sequence and the sequence with introduced SNPs. Here, we plot the mean k-mer similarity observed for each percent identity. The grey area shows 3X the standard deviation around this mean.

Supplementary Figure 3 BIGSI scores vs. megaBLAST scores.

megaBLAST scores for a search of 100 antimicrobial resistance genes in a BLAST database of RefSeq-81 vs. the equivalent BIGSI scores in a search of a BIGSI of RefSeq-81. Pearson correlation of the scores was r = 0.998.

Supplementary Figure 4 Speed–space tradeoffs for exact-match queries.

We show query time for 2,157 antimicrobial resistance genes with T = 100% vs. peak disk size when searching databases of sizes from 10–10,000 microbial datasets. BIGSI’s query time does not increase significantly with N, as in this regime the query time is dominated by the constant time row lookups, rather than the bit-wise AND calculations.

Supplementary Figure 5 Simulated scaling to 1 million datasets.

We simulated scaling to 1 million datasets of peak data-structure storage requirements of BIGSI and SBT-fast, comparing performance with high/low proportion of sharing of k-mers between datasets (note y axis is on log scale). In the high k-mer sharing regime only 100 new k-mers are introduced per dataset, whereas the low k-mer sharing regime introduces 10,000 new k-mers per dataset. Since BIGSI scales linearly with number of datasets and independently of the number of k-mers, it uses the same storage per dataset in each regime. However, SBT-fast scales super-linearly with N since its Bloom filter size depends on the total number of kmers. For 1 million genomes with low k-mer sharing (right), which is the case we care about for global indexing, BIGSI would use 3.1 Terabytes whereas SBT-fast would use 25 Petabytes. When we index the ENA, we find each dataset adds 100,000 new k-mers on average, 10x more than the low kmer-sharing regime simulated here, which would further penalize SBT.

Supplementary Figure 6 Counts of the most frequent bacterial genera in the SRA/ENA data set.

Over 90% of the datasets were isolates of these 20 genera and 65% from the top 5 most prevalent genera.

Supplementary Figure 7 Permutation test for difference in phylogenetic spread of plasmids with ≥3 resistance genes versus those with 0.

We took the set of plasmids from Fig. 5 with at least 3 resistance genes (abbreviation 3G) and those with zero (abbrev. 0G). We defined “phylogenetic spread” of a plasmid as the median of the pairwise distances along a large subunit rRNA tree (incorporating branch lengths) between all pairs of genera in which the plasmid is detected, and calculated the 95% quantile of this distribution (red line). We then permuted the assignment of each plasmid to the classes 3G and 0G one million times (maintaining the class counts), each time calculating the difference in 95% quantiles for the two categories (3G and 0G). We show here the histogram of that statistic (i.e. the difference). This corresponds to a permutation test P value of 0.0024.

Supplementary Figure 8 Distribution of MOB types among phyla.

We show the proportion of each MOB type associated with different phyla based on a search of all known MOB types from Guglielmini et al. in the all-microbial-index.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–8 and Supplementary Tables 1–3

Reporting Summary

Supplementary Data

Supplementary Data 10

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bradley, P., den Bakker, H.C., Rocha, E.P.C. et al. Ultrafast search of all deposited bacterial and viral genomic data. Nat Biotechnol 37, 152–159 (2019). https://doi.org/10.1038/s41587-018-0010-1

Download citation

Received: 26 December 2017
Accepted: 20 December 2018
Published: 04 February 2019
Issue Date: February 2019
DOI: https://doi.org/10.1038/s41587-018-0010-1

This article is cited by

Detection of a historic reservoir of bedaquiline/clofazimine resistance-associated variants in Mycobacterium tuberculosis
- Camus Nimmo
- Arturo Torres Ortiz
- Lucy van Dorp
Genome Medicine (2024)
Hierarchical Interleaved Bloom Filter: enabling ultrafast, approximate sequence queries
- Svenja Mehringer
- Enrico Seiler
- Knut Reinert
Genome Biology (2023)
Navigating bottlenecks and trade-offs in genomic data analysis
- Bonnie Berger
- Yun William Yu
Nature Reviews Genetics (2023)
Petabase-scale sequence alignment catalyses viral discovery
- Robert C. Edgar
- Brie Taylor
- Artem Babaian
Nature (2022)
Role of mobile genetic elements in the global dissemination of the carbapenem resistance gene blaNDM
- Mislav Acman
- Ruobing Wang
- Francois Balloux
Nature Communications (2022)