Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Ultrafast search of all deposited bacterial and viral genomic data

Abstract

Exponentially increasing amounts of unprocessed bacterial and viral genomic sequence data are stored in the global archives. The ability to query these data for sequence search terms would facilitate both basic research and applications such as real-time genomic epidemiology and surveillance. However, this is not possible with current methods. To solve this problem, we combine knowledge of microbial population genomics with computational methods devised for web search to produce a searchable data structure named BItsliced Genomic Signature Index (BIGSI). We indexed the entire global corpus of 447,833 bacterial and viral whole-genome sequence datasets using four orders of magnitude less storage than previous methods. We applied our BIGSI search function to rapidly find resistance genes MCR-1, MCR-2, and MCR-3, determine the host-range of 2,827 plasmids, and quantify antibiotic resistance in archived datasets. Our index can grow incrementally as new (unprocessed or assembled) sequence datasets are deposited and can scale to millions of datasets.

This is a preview of subscription content, access via your institution

Access options

Rent or buy this article

Prices vary by article type

from$1.95

to$39.95

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Sequence matching methods.
Fig. 2: BIGSI encoding.
Fig. 3: Speed and space trade-offs as index grows.
Fig. 4: Phylogenetic distribution of plasmid sequences.
Fig. 5: Plasmid spread and antibiotic resistance genes.
Fig. 6: Antibiotic resistance gene prevalence in ENA over time.

Similar content being viewed by others

Code availability

An open source implementation of BIGSI can be found at https://github.com/phelimb/BIGSI. BIGSI v0.3.0 supports disk-based indexing via Berkeley-DB or rocksDB, as well as distributed in-memory (via redis (https://redis.io)) key-value stores, and can be extended to any key-value store. The benchmarking uses the rocksDB key-value store and v0.2.0 of BIGSI, and the all-microbial index uses Berkeley-DB and BIGSI version v0.1.7.

Data availability

All of the underlying genomic data for this study are publicly available at the ENA, and Supplementary Data can be found in the directory http://ftp.ebi.ac.uk/pub/software/bigsi/nat_biotech_2018. Supplementary Data 19 can be found at http://ftp.ebi.ac.uk/pub/software/bigsi/nat_biotech_2018/supp or at https://figshare.com/s/b365381fcd9550e361da. Contents are as follows: Supplementary Data 1, MCR search results; Supplementary Data 2, plasmid search results; Supplementary Data 3, counts of five specific plasmids across genera; Supplementary Data 4, counts of MOB types across genera; Supplementary Data 5, CARD antibiotic resistance gene search results (T = 70%); Supplementary Data 6, benchmarking results; Supplementary Data 7, Bracken taxonomic results; Supplementary Data 8, MOB type definition fasta; Supplementary Data 9, MOB and T4SS search results (T = 100%). The all-microbial index itself is available at http://ftp.ebi.ac.uk/pub/software/bigsi/nat_biotech_2018/all-microbial-index/. In order to facilitate reproducibility for others without having to download and process 170 Tb of raw data, we made the 26 Tb of cleaned de Bruijn (binary) graph files for the entire all-microbial index snapshot of the ENA available at http://ftp.ebi.ac.uk/pub/software/bigsi/nat_biotech_2018/ctx. An archive of computational code can be found at http://ftp.ebi.ac.uk/pub/software/bigsi/nat_biotech_2018/bigsi.tar.gz. An archive of the data underlying the figures can be found at http://ftp.ebi.ac.uk/pub/software/bigsi/nat_biotech_2018/figure-data.zip and http://ftp.ebi.ac.uk/pub/software/bigsi/nat_biotech_2018/figure-data/ or as Supplementary Data 10 http://ftp.ebi.ac.uk/pub/software/bigsi/nat_biotech_2018/figure-data/.zip. We have also made a public instance of our index of the ENA available at http://bigsi.io, where the user can paste sequence and search. This instance uses BIGSI v0.1.7 (using berkeleyDB) and is hosted by CLIMB (http://www.climb.ac.uk/) on a 3 Tb RAM server.

References

  1. Bradley, P. et al. Rapid antibiotic-resistance predictions from genome sequence data for Staphylococcus aureus and Mycobacterium tuberculosis. Nat. Commun. 6, 10063 (2015).

    Article  CAS  PubMed  Google Scholar 

  2. Brown, A. C. et al. Rapid whole-genome sequencing of Mycobacterium tuberculosis isolates directly from clinical samples. J. Clin. Microbiol. 53, 2230–2237 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Quick, J. et al. Real-time, portable genome sequencing for Ebola surveillance. Nature 530, 228–232 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Schmidt, K. et al. Identification of bacterial pathogens and antimicrobial resistance directly from clinical urines by nanopore-based metagenomic sequencing. J. Antimicrob. Chemother. 72, 104–114 (2017).

    Article  CAS  PubMed  Google Scholar 

  5. Votintseva, A. A. et al. Same-day diagnostic and surveillance data for tuberculosis via whole-genome sequencing of direct respiratory samples. J. Clin. Microbiol. 55, 1285–1298 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Shea, J. et al. Comprehensive whole-genome sequencing and reporting of drug resistance profiles on clinical cases of Mycobacterium tuberculosis in New York state. J. Clin. Microbiol. 55, 1871–1882 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).

    Article  CAS  PubMed  Google Scholar 

  8. Kent, W. J. BLAT-the BLAST-like alignment tool. Genome Res. 12, 656–664 (2002).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Zhang, Z. et al. A greedy algorithm for aligning DNA sequences. J. Comput. Biol. 7, 203–214 (2000).

    Article  CAS  PubMed  Google Scholar 

  10. Arredondo-Alonso, A. W. R., Schaik, W. V. & Schurch, C. On the (im)possibility of reconstructing plasmids from whole-genome short-read sequencing data. Microb. Genom. 3, e000128 (2017).

    PubMed  PubMed Central  Google Scholar 

  11. Bradnam, K. R. et al. Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. Gigascience 2, 10 (2013).

    Article  PubMed  PubMed Central  Google Scholar 

  12. Earl, D. et al. Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res. 21, 2224–2241 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Solomon, B. & Kingsford, C. Fast search of thousands of short-read sequencing experiments. Nat. Biotechnol. 34, 300–302 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Pandey, P. et al. Mantis: a fast, small, and exact large-scale sequence-search index. Cell Syst. 7, 201–207 e204 (2018).

  15. Solomon, B. & Kingsford, C. Improved search of large transcriptomic sequencing databases using split sequence Bloom trees. in International Conference on Research in Computational Molecular Biology 257–271 (Springer, 2017).

  16. Sun, C., Harris, R., Chikhi, R. & Medvedev, P. AllSome Sequence Bloom Trees. in International Conference on Research in Computational Molecular Biology 272–286 (Springer, 2018).

  17. Iqbal, Z., Caccamo, M., Turner, I., Flicek, P. & McVean, G. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genet. 44, 226–232 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  18. Muggli, M. D. et al. Succinct colored de Bruijn graphs. Bioinformatics 33, 3181–3187 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Turner, I., Garimella, K., Iqbal, Z. & McVean, G. Integrating long-range connectivity information into de Bruijn graphs. Bioinformatics 34, 2556–2565 (2017).

    Article  Google Scholar 

  20. Almodaresi, F., Pandey, P. & Patro, R. Proc. 17th International Workshop on Algorithms in Bioinformatics. (Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, 2017).

  21. Wong, H. K. T. L. H., Olken F., Rotem D., Wong L. In: Proc. 11th International Conference on Very Large Data Bases Vol. 11 (ed Pirotte, A. and Vassiliou, Y.) 448–457 (Stockholm, Sweden, 1985).

  22. Shepherd, M. A. P. W. & Chu, C. K. A fixed-size Bloom filter for searching textual documents. Comput. J. 32, 212–219 (1989).

    Article  Google Scholar 

  23. Zobel, J., Moffat, A. & Ramamohanarao, K. Inverted files versus signature files for text indexing. ACM Trans. Database Syst. 23, 453–490 (1998).

    Article  Google Scholar 

  24. Goodwin, B. H. M. et al. Proc. 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (ACM, 2017).

  25. Bloom, B. H. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 422–426 (1970).

    Article  Google Scholar 

  26. Pell, J. et al. Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. Proc. Natl Acad. Sci. USA 109, 13272–13277 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Walker, T. M. et al. Whole-genome sequencing for prediction of Mycobacterium tuberculosis drug susceptibility and resistance: a retrospective cohort study. Lancet Infect. Dis. 15, 1193–1202 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  29. Inouye, M. et al. SRST2: Rapid genomic surveillance for public health and hospital microbiology labs. Genome Med. 6, 90 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  30. Pandey, P. et al. Mantis: a fast, small, and exact large-scale sequence search index. Cell Syst. 7, 201–207 (2017).

    Article  Google Scholar 

  31. Lu, J., Breitwieser, F., Thielen, P. & Salzberg, S. L. Bracken: estimating species abundance in metagenomics data. PeerJ Comput. Sci. 3, e104 (2017).

    Article  Google Scholar 

  32. Wood, D. E. & Salzberg, S. L. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15, R46 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  33. Hu, Y., Liu, F., Lin, I. Y., Gao, G. F. & Zhu, B. Dissemination of the mcr-1 colistin resistance gene. Lancet Infect. Dis. 16, 146–147 (2016).

    Article  PubMed  Google Scholar 

  34. Lu, X. et al. MCR-1.6, a new MCR variant carried by an incp plasmid in a colistin-resistant Salmonella enterica serovar typhimurium isolate from a healthy individual. Antimicrob. Agents Chemother. 61, e02632–16 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  35. Matamoros, S. et al. Global phylogenetic analysis of Escherichia coli and plasmids carrying the mcr-1 gene indicates bacterial diversity but plasmid restriction. Sci. Rep. 7, 15364 (2017).

    Article  PubMed  PubMed Central  Google Scholar 

  36. Xavier, B. B. et al. Identification of a novel plasmid-mediated colistin-resistance gene, mcr-2, in Escherichia coli, Belgium, June 2016. Euro. Surveill. https://doi.org/10.2807/1560-7917.ES.2016.21.27.30280 (2016).

  37. Yin, W. et al. Novel plasmid-mediated colistin resistance gene mcr-3 in Escherichia coli. mBio https://doi.org/10.1128/mBio.00543-17 (2017).

  38. Ciric, L. J. A., Elvira de Vries L., Agerso Y., Mullany P., Roberts A. P. In: Madame Curie Bioscience Database [Internet] (Landes Bioscience, Austin, TX, 2013).

  39. Guglielmini, J., Quintais, L., Garcillán-Barcia, M. P., de la Cruz, F. & Rocha, E. P. The repertoire of ICE in prokaryotes underscores the unity, diversity, and ubiquity of conjugation. PLoS Genet. 7, e1002222 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Jia, B. et al. CARD 2017: expansion and model-centric curation of the comprehensive antibiotic resistance database. Nucleic Acids Res. 45, D566–D573 (2017).

    Article  CAS  PubMed  Google Scholar 

  41. Eldholm, V. & Balloux, F. Antimicrobial resistance in Mycobacterium tuberculosis: the odd one out. Trends Microbiol. 24, 637–648 (2016).

    Article  CAS  PubMed  Google Scholar 

  42. World Health Organization. Global Tuberculosis Report 2017. https://www.who.int/tb/publications/global_report/gtbr2017_main_text.pdf (2017).

  43. Gardy, J. L. & Loman, N. J. Towards a genomics-informed, real-time, global pathogen surveillance system. Nat. Rev. Genet. 19, 9–20 (2018).

    Article  CAS  PubMed  Google Scholar 

  44. Schatz, M. C. & Phillippy, A. M. The rise of a digital immune system. Gigascience 1, 4 (2012).

    Article  PubMed  PubMed Central  Google Scholar 

  45. Eyre, D. W. et al. WGS to predict antibiotic MICs for Neisseria gonorrhoeae. J. Antimicrob. Chemother. 72, 1937–1947 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  46. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  47. Langmead, B., Trapnell, C., Pop, M. & Salzberg, S. L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  48. Broder, A. & Mitzenmacher, M. Network applications of Bloom filters: a survey. Internet Math. 1, 485–509 (2004).

    Article  Google Scholar 

Download references

Acknowledgements

We would like to thank, for critical reading and helpful suggestions: G. Blackwell, R. Ffrancon, M. Hunt, J. Kelleher, J. Thornton, R. Patro, K. Malone, J. Marioni, A. Page, S. Gog, T. Bellman, F. Gauger. For enormous assistance with data download from EBI: R. Esnouf, G. Cochrane. For hosting our BIGSI demonstration: CLIMB. We acknowledge funding from Wellcome Trust Core Award Grant Number 203141/Z/16/Z. Z.I. was funded by a Wellcome Trust/Royal Society Sir Henry Dale Fellowship, grant number 102541/A/13/Z. G.M. was funded by Wellcome Trust grant 100956/Z/13/Z. E.P.C.R. was funded by the ANR MAGISBAC grant number ANR-14-CE10-0007-02. P.B. was funded by Wellcome Trust Studentship H5RZCO00.

Author information

Authors and Affiliations

Authors

Contributions

Z.I., G.M. designed and oversaw the study; P.B. invented the method, developed software and performed analyses; H.C.d.B. performed analyses for the plasmid study; E.P.C.R. codesigned and analyzed the conjugative system and plasmid analysis; Z.I. wrote the paper; all authors gave detailed feedback on the paper.

Corresponding author

Correspondence to Zamin Iqbal.

Ethics declarations

Competing interests

G.M. is a cofounder of, holder of shares in, and is a consultant to Genomics PLC.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Cartoon comparison between human and E. coli genomes.

Cartoon comparison of human genomes (above) and E. coli (below) as a representative bacterium. In humans, genetic variation is dominated by relatively sparse single nucleotide polymorphisms (SNPs), nucleotide diversity π = 0.001, and less than 1% of a typical genome lies in a structural variant (SV) [1]. Human genomes are therefore relatively compressible. In stark contrast, genes make up around 88% of an E. coli genome [2], yet two E. coli genomes may only share around 60% of their genes [3], and conserved genes have much higher nucleotide diversity (0.02) [4]. Thus, bacterial genomes present different compression and indexing challenges to human genomes. 1. The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68-74, doi:10.1038/nature15393 (2015). 2. Blattner, F. R. et al. The complete genome sequence of Escherichia coli K-12. Science 277, 1453-1462 (1997). 3. Touchon, Marie, et al. "Organised genome dynamics in the Escherichia coli species results in highly diverse adaptive paths." PLoS genetics 5.1 (2009): e1000344. 4. Kaas, Rolf S., et al. "Estimating variation within the genes and inferring the phylogeny of 186 sequenced diverse Escherichia coli genomes." BMC genomics 13.1 (2012).

Supplementary Figure 2 k-mer identity vs. percent nucleotide identity.

We ran 1,000 simulations where on each iteration we introduced 1 more random SNP into a sequence of length 1,000 bp and calculated the k-mer similarity between the original sequence and the sequence with introduced SNPs. Here, we plot the mean k-mer similarity observed for each percent identity. The grey area shows 3X the standard deviation around this mean.

Supplementary Figure 3 BIGSI scores vs. megaBLAST scores.

megaBLAST scores for a search of 100 antimicrobial resistance genes in a BLAST database of RefSeq-81 vs. the equivalent BIGSI scores in a search of a BIGSI of RefSeq-81. Pearson correlation of the scores was r = 0.998.

Supplementary Figure 4 Speed–space tradeoffs for exact-match queries.

We show query time for 2,157 antimicrobial resistance genes with T = 100% vs. peak disk size when searching databases of sizes from 10–10,000 microbial datasets. BIGSI’s query time does not increase significantly with N, as in this regime the query time is dominated by the constant time row lookups, rather than the bit-wise AND calculations.

Supplementary Figure 5 Simulated scaling to 1 million datasets.

We simulated scaling to 1 million datasets of peak data-structure storage requirements of BIGSI and SBT-fast, comparing performance with high/low proportion of sharing of k-mers between datasets (note y axis is on log scale). In the high k-mer sharing regime only 100 new k-mers are introduced per dataset, whereas the low k-mer sharing regime introduces 10,000 new k-mers per dataset. Since BIGSI scales linearly with number of datasets and independently of the number of k-mers, it uses the same storage per dataset in each regime. However, SBT-fast scales super-linearly with N since its Bloom filter size depends on the total number of kmers. For 1 million genomes with low k-mer sharing (right), which is the case we care about for global indexing, BIGSI would use 3.1 Terabytes whereas SBT-fast would use 25 Petabytes. When we index the ENA, we find each dataset adds 100,000 new k-mers on average, 10x more than the low kmer-sharing regime simulated here, which would further penalize SBT.

Supplementary Figure 6 Counts of the most frequent bacterial genera in the SRA/ENA data set.

Over 90% of the datasets were isolates of these 20 genera and 65% from the top 5 most prevalent genera.

Supplementary Figure 7 Permutation test for difference in phylogenetic spread of plasmids with ≥3 resistance genes versus those with 0.

We took the set of plasmids from Fig. 5 with at least 3 resistance genes (abbreviation 3G) and those with zero (abbrev. 0G). We defined “phylogenetic spread” of a plasmid as the median of the pairwise distances along a large subunit rRNA tree (incorporating branch lengths) between all pairs of genera in which the plasmid is detected, and calculated the 95% quantile of this distribution (red line). We then permuted the assignment of each plasmid to the classes 3G and 0G one million times (maintaining the class counts), each time calculating the difference in 95% quantiles for the two categories (3G and 0G). We show here the histogram of that statistic (i.e. the difference). This corresponds to a permutation test P value of 0.0024.

Supplementary Figure 8 Distribution of MOB types among phyla.

We show the proportion of each MOB type associated with different phyla based on a search of all known MOB types from Guglielmini et al. in the all-microbial-index.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–8 and Supplementary Tables 1–3

Reporting Summary

Supplementary Data

Supplementary Data 10

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bradley, P., den Bakker, H.C., Rocha, E.P.C. et al. Ultrafast search of all deposited bacterial and viral genomic data. Nat Biotechnol 37, 152–159 (2019). https://doi.org/10.1038/s41587-018-0010-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41587-018-0010-1

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing