Article | Published:

Ultrafast search of all deposited bacterial and viral genomic data

Nature Biotechnologyvolume 37pages152159 (2019) | Download Citation

Abstract

Exponentially increasing amounts of unprocessed bacterial and viral genomic sequence data are stored in the global archives. The ability to query these data for sequence search terms would facilitate both basic research and applications such as real-time genomic epidemiology and surveillance. However, this is not possible with current methods. To solve this problem, we combine knowledge of microbial population genomics with computational methods devised for web search to produce a searchable data structure named BItsliced Genomic Signature Index (BIGSI). We indexed the entire global corpus of 447,833 bacterial and viral whole-genome sequence datasets using four orders of magnitude less storage than previous methods. We applied our BIGSI search function to rapidly find resistance genes MCR-1, MCR-2, and MCR-3, determine the host-range of 2,827 plasmids, and quantify antibiotic resistance in archived datasets. Our index can grow incrementally as new (unprocessed or assembled) sequence datasets are deposited and can scale to millions of datasets.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Code availability

An open source implementation of BIGSI can be found at https://github.com/phelimb/BIGSI. BIGSI v0.3.0 supports disk-based indexing via Berkeley-DB or rocksDB, as well as distributed in-memory (via redis (https://redis.io)) key-value stores, and can be extended to any key-value store. The benchmarking uses the rocksDB key-value store and v0.2.0 of BIGSI, and the all-microbial index uses Berkeley-DB and BIGSI version v0.1.7.

Data availability

All of the underlying genomic data for this study are publicly available at the ENA, and Supplementary Data can be found in the directory http://ftp.ebi.ac.uk/pub/software/bigsi/nat_biotech_2018. Supplementary Data 19 can be found at http://ftp.ebi.ac.uk/pub/software/bigsi/nat_biotech_2018/supp or at https://figshare.com/s/b365381fcd9550e361da. Contents are as follows: Supplementary Data 1, MCR search results; Supplementary Data 2, plasmid search results; Supplementary Data 3, counts of five specific plasmids across genera; Supplementary Data 4, counts of MOB types across genera; Supplementary Data 5, CARD antibiotic resistance gene search results (T = 70%); Supplementary Data 6, benchmarking results; Supplementary Data 7, Bracken taxonomic results; Supplementary Data 8, MOB type definition fasta; Supplementary Data 9, MOB and T4SS search results (T = 100%). The all-microbial index itself is available at http://ftp.ebi.ac.uk/pub/software/bigsi/nat_biotech_2018/all-microbial-index/. In order to facilitate reproducibility for others without having to download and process 170 Tb of raw data, we made the 26 Tb of cleaned de Bruijn (binary) graph files for the entire all-microbial index snapshot of the ENA available at http://ftp.ebi.ac.uk/pub/software/bigsi/nat_biotech_2018/ctx. An archive of computational code can be found at http://ftp.ebi.ac.uk/pub/software/bigsi/nat_biotech_2018/bigsi.tar.gz. An archive of the data underlying the figures can be found at http://ftp.ebi.ac.uk/pub/software/bigsi/nat_biotech_2018/figure-data.zip and http://ftp.ebi.ac.uk/pub/software/bigsi/nat_biotech_2018/figure-data/ or as Supplementary Data 10 http://ftp.ebi.ac.uk/pub/software/bigsi/nat_biotech_2018/figure-data/.zip. We have also made a public instance of our index of the ENA available at http://bigsi.io, where the user can paste sequence and search. This instance uses BIGSI v0.1.7 (using berkeleyDB) and is hosted by CLIMB (http://www.climb.ac.uk/) on a 3 Tb RAM server.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  1. 1.

    Bradley, P. et al. Rapid antibiotic-resistance predictions from genome sequence data for Staphylococcus aureus and Mycobacterium tuberculosis. Nat. Commun. 6, 10063 (2015).

  2. 2.

    Brown, A. C. et al. Rapid whole-genome sequencing of Mycobacterium tuberculosis isolates directly from clinical samples. J. Clin. Microbiol. 53, 2230–2237 (2015).

  3. 3.

    Quick, J. et al. Real-time, portable genome sequencing for Ebola surveillance. Nature 530, 228–232 (2016).

  4. 4.

    Schmidt, K. et al. Identification of bacterial pathogens and antimicrobial resistance directly from clinical urines by nanopore-based metagenomic sequencing. J. Antimicrob. Chemother. 72, 104–114 (2017).

  5. 5.

    Votintseva, A. A. et al. Same-day diagnostic and surveillance data for tuberculosis via whole-genome sequencing of direct respiratory samples. J. Clin. Microbiol. 55, 1285–1298 (2017).

  6. 6.

    Shea, J. et al. Comprehensive whole-genome sequencing and reporting of drug resistance profiles on clinical cases of Mycobacterium tuberculosis in New York state. J. Clin. Microbiol. 55, 1871–1882 (2017).

  7. 7.

    Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).

  8. 8.

    Kent, W. J. BLAT-the BLAST-like alignment tool. Genome Res. 12, 656–664 (2002).

  9. 9.

    Zhang, Z. et al. A greedy algorithm for aligning DNA sequences. J. Comput. Biol. 7, 203–214 (2000).

  10. 10.

    Arredondo-Alonso, A. W. R., Schaik, W. V. & Schurch, C. On the (im)possibility of reconstructing plasmids from whole-genome short-read sequencing data. Microb. Genom. 3, e000128 (2017).

  11. 11.

    Bradnam, K. R. et al. Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. Gigascience 2, 10 (2013).

  12. 12.

    Earl, D. et al. Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res. 21, 2224–2241 (2011).

  13. 13.

    Solomon, B. & Kingsford, C. Fast search of thousands of short-read sequencing experiments. Nat. Biotechnol. 34, 300–302 (2016).

  14. 14.

    Pandey, P. et al. Mantis: a fast, small, and exact large-scale sequence-search index. Cell Syst. 7, 201–207 e204 (2018).

  15. 15.

    Solomon, B. & Kingsford, C. Improved search of large transcriptomic sequencing databases using split sequence Bloom trees. in International Conference on Research in Computational Molecular Biology 257–271 (Springer, 2017).

  16. 16.

    Sun, C., Harris, R., Chikhi, R. & Medvedev, P. AllSome Sequence Bloom Trees. in International Conference on Research in Computational Molecular Biology 272–286 (Springer, 2018).

  17. 17.

    Iqbal, Z., Caccamo, M., Turner, I., Flicek, P. & McVean, G. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genet. 44, 226–232 (2012).

  18. 18.

    Muggli, M. D. et al. Succinct colored de Bruijn graphs. Bioinformatics 33, 3181–3187 (2017).

  19. 19.

    Turner, I., Garimella, K., Iqbal, Z. & McVean, G. Integrating long-range connectivity information into de Bruijn graphs. Bioinformatics 34, 2556–2565 (2017).

  20. 20.

    Almodaresi, F., Pandey, P. & Patro, R. Proc. 17th International Workshop on Algorithms in Bioinformatics. (Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, 2017).

  21. 21.

    Wong, H. K. T. L. H., Olken F., Rotem D., Wong L. In: Proc. 11th International Conference on Very Large Data Bases Vol. 11 (ed Pirotte, A. and Vassiliou, Y.) 448–457 (Stockholm, Sweden, 1985).

  22. 22.

    Shepherd, M. A. P. W. & Chu, C. K. A fixed-size Bloom filter for searching textual documents. Comput. J. 32, 212–219 (1989).

  23. 23.

    Zobel, J., Moffat, A. & Ramamohanarao, K. Inverted files versus signature files for text indexing. ACM Trans. Database Syst. 23, 453–490 (1998).

  24. 24.

    Goodwin, B. H. M. et al. Proc. 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (ACM, 2017).

  25. 25.

    Bloom, B. H. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 422–426 (1970).

  26. 26.

    Pell, J. et al. Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. Proc. Natl Acad. Sci. USA 109, 13272–13277 (2012).

  27. 27.

    Walker, T. M. et al. Whole-genome sequencing for prediction of Mycobacterium tuberculosis drug susceptibility and resistance: a retrospective cohort study. Lancet Infect. Dis. 15, 1193–1202 (2015).

  28. 28.

    Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

  29. 29.

    Inouye, M. et al. SRST2: Rapid genomic surveillance for public health and hospital microbiology labs. Genome Med. 6, 90 (2014).

  30. 30.

    Pandey, P. et al. Mantis: a fast, small, and exact large-scale sequence search index. Cell Syst. 7, 201–207 (2017).

  31. 31.

    Lu, J., Breitwieser, F., Thielen, P. & Salzberg, S. L. Bracken: estimating species abundance in metagenomics data. PeerJ Comput. Sci. 3, e104 (2017).

  32. 32.

    Wood, D. E. & Salzberg, S. L. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 15, R46 (2014).

  33. 33.

    Hu, Y., Liu, F., Lin, I. Y., Gao, G. F. & Zhu, B. Dissemination of the mcr-1 colistin resistance gene. Lancet Infect. Dis. 16, 146–147 (2016).

  34. 34.

    Lu, X. et al. MCR-1.6, a new MCR variant carried by an incp plasmid in a colistin-resistant Salmonella enterica serovar typhimurium isolate from a healthy individual. Antimicrob. Agents Chemother. 61, e02632–16 (2017).

  35. 35.

    Matamoros, S. et al. Global phylogenetic analysis of Escherichia coli and plasmids carrying the mcr-1 gene indicates bacterial diversity but plasmid restriction. Sci. Rep. 7, 15364 (2017).

  36. 36.

    Xavier, B. B. et al. Identification of a novel plasmid-mediated colistin-resistance gene, mcr-2, in Escherichia coli, Belgium, June 2016. Euro. Surveill. https://doi.org/10.2807/1560-7917.ES.2016.21.27.30280 (2016).

  37. 37.

    Yin, W. et al. Novel plasmid-mediated colistin resistance gene mcr-3 in Escherichia coli. mBio https://doi.org/10.1128/mBio.00543-17 (2017).

  38. 38.

    Ciric, L. J. A., Elvira de Vries L., Agerso Y., Mullany P., Roberts A. P. In: Madame Curie Bioscience Database [Internet] (Landes Bioscience, Austin, TX, 2013).

  39. 39.

    Guglielmini, J., Quintais, L., Garcillán-Barcia, M. P., de la Cruz, F. & Rocha, E. P. The repertoire of ICE in prokaryotes underscores the unity, diversity, and ubiquity of conjugation. PLoS Genet. 7, e1002222 (2011).

  40. 40.

    Jia, B. et al. CARD 2017: expansion and model-centric curation of the comprehensive antibiotic resistance database. Nucleic Acids Res. 45, D566–D573 (2017).

  41. 41.

    Eldholm, V. & Balloux, F. Antimicrobial resistance in Mycobacterium tuberculosis: the odd one out. Trends Microbiol. 24, 637–648 (2016).

  42. 42.

    World Health Organization. Global Tuberculosis Report 2017. https://www.who.int/tb/publications/global_report/gtbr2017_main_text.pdf (2017).

  43. 43.

    Gardy, J. L. & Loman, N. J. Towards a genomics-informed, real-time, global pathogen surveillance system. Nat. Rev. Genet. 19, 9–20 (2018).

  44. 44.

    Schatz, M. C. & Phillippy, A. M. The rise of a digital immune system. Gigascience 1, 4 (2012).

  45. 45.

    Eyre, D. W. et al. WGS to predict antibiotic MICs for Neisseria gonorrhoeae. J. Antimicrob. Chemother. 72, 1937–1947 (2017).

  46. 46.

    Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).

  47. 47.

    Langmead, B., Trapnell, C., Pop, M. & Salzberg, S. L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009).

  48. 48.

    Broder, A. & Mitzenmacher, M. Network applications of Bloom filters: a survey. Internet Math. 1, 485–509 (2004).

Download references

Acknowledgements

We would like to thank, for critical reading and helpful suggestions: G. Blackwell, R. Ffrancon, M. Hunt, J. Kelleher, J. Thornton, R. Patro, K. Malone, J. Marioni, A. Page, S. Gog, T. Bellman, F. Gauger. For enormous assistance with data download from EBI: R. Esnouf, G. Cochrane. For hosting our BIGSI demonstration: CLIMB. We acknowledge funding from Wellcome Trust Core Award Grant Number 203141/Z/16/Z. Z.I. was funded by a Wellcome Trust/Royal Society Sir Henry Dale Fellowship, grant number 102541/A/13/Z. G.M. was funded by Wellcome Trust grant 100956/Z/13/Z. E.P.C.R. was funded by the ANR MAGISBAC grant number ANR-14-CE10-0007-02. P.B. was funded by Wellcome Trust Studentship H5RZCO00.

Author information

Affiliations

  1. Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK

    • Phelim Bradley
    •  & Zamin Iqbal
  2. EMBL-EBI, Hinxton, UK

    • Phelim Bradley
    •  & Zamin Iqbal
  3. Center for Food Safety, Department of Food Science and Technology, University of Georgia, Griffin, GA, USA

    • Henk C. den Bakker
  4. UMR 3525, CNRS, Paris, France

    • Eduardo P. C. Rocha
  5. Microbial Evolutionary Genomics, Institut Pasteur, Paris, France

    • Eduardo P. C. Rocha
  6. Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, UK

    • Gil McVean

Authors

  1. Search for Phelim Bradley in:

  2. Search for Henk C. den Bakker in:

  3. Search for Eduardo P. C. Rocha in:

  4. Search for Gil McVean in:

  5. Search for Zamin Iqbal in:

Contributions

Z.I., G.M. designed and oversaw the study; P.B. invented the method, developed software and performed analyses; H.C.d.B. performed analyses for the plasmid study; E.P.C.R. codesigned and analyzed the conjugative system and plasmid analysis; Z.I. wrote the paper; all authors gave detailed feedback on the paper.

Competing interests

G.M. is a cofounder of, holder of shares in, and is a consultant to Genomics PLC.

Corresponding author

Correspondence to Zamin Iqbal.

Integrated supplementary information

  1. Supplementary Figure 1 Cartoon comparison between human and E. coli genomes.

    Cartoon comparison of human genomes (above) and E. coli (below) as a representative bacterium. In humans, genetic variation is dominated by relatively sparse single nucleotide polymorphisms (SNPs), nucleotide diversity π = 0.001, and less than 1% of a typical genome lies in a structural variant (SV) [1]. Human genomes are therefore relatively compressible. In stark contrast, genes make up around 88% of an E. coli genome [2], yet two E. coli genomes may only share around 60% of their genes [3], and conserved genes have much higher nucleotide diversity (0.02) [4]. Thus, bacterial genomes present different compression and indexing challenges to human genomes. 1. The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68-74, doi:10.1038/nature15393 (2015). 2. Blattner, F. R. et al. The complete genome sequence of Escherichia coli K-12. Science 277, 1453-1462 (1997). 3. Touchon, Marie, et al. "Organised genome dynamics in the Escherichia coli species results in highly diverse adaptive paths." PLoS genetics 5.1 (2009): e1000344. 4. Kaas, Rolf S., et al. "Estimating variation within the genes and inferring the phylogeny of 186 sequenced diverse Escherichia coli genomes." BMC genomics 13.1 (2012).

  2. Supplementary Figure 2 k-mer identity vs. percent nucleotide identity.

    We ran 1,000 simulations where on each iteration we introduced 1 more random SNP into a sequence of length 1,000 bp and calculated the k-mer similarity between the original sequence and the sequence with introduced SNPs. Here, we plot the mean k-mer similarity observed for each percent identity. The grey area shows 3X the standard deviation around this mean.

  3. Supplementary Figure 3 BIGSI scores vs. megaBLAST scores.

    megaBLAST scores for a search of 100 antimicrobial resistance genes in a BLAST database of RefSeq-81 vs. the equivalent BIGSI scores in a search of a BIGSI of RefSeq-81. Pearson correlation of the scores was r = 0.998.

  4. Supplementary Figure 4 Speed–space tradeoffs for exact-match queries.

    We show query time for 2,157 antimicrobial resistance genes with T = 100% vs. peak disk size when searching databases of sizes from 10–10,000 microbial datasets. BIGSI’s query time does not increase significantly with N, as in this regime the query time is dominated by the constant time row lookups, rather than the bit-wise AND calculations.

  5. Supplementary Figure 5 Simulated scaling to 1 million datasets.

    We simulated scaling to 1 million datasets of peak data-structure storage requirements of BIGSI and SBT-fast, comparing performance with high/low proportion of sharing of k-mers between datasets (note y axis is on log scale). In the high k-mer sharing regime only 100 new k-mers are introduced per dataset, whereas the low k-mer sharing regime introduces 10,000 new k-mers per dataset. Since BIGSI scales linearly with number of datasets and independently of the number of k-mers, it uses the same storage per dataset in each regime. However, SBT-fast scales super-linearly with N since its Bloom filter size depends on the total number of kmers. For 1 million genomes with low k-mer sharing (right), which is the case we care about for global indexing, BIGSI would use 3.1 Terabytes whereas SBT-fast would use 25 Petabytes. When we index the ENA, we find each dataset adds 100,000 new k-mers on average, 10x more than the low kmer-sharing regime simulated here, which would further penalize SBT.

  6. Supplementary Figure 6 Counts of the most frequent bacterial genera in the SRA/ENA data set.

    Over 90% of the datasets were isolates of these 20 genera and 65% from the top 5 most prevalent genera.

  7. Supplementary Figure 7 Permutation test for difference in phylogenetic spread of plasmids with ≥3 resistance genes versus those with 0.

    We took the set of plasmids from Fig. 5 with at least 3 resistance genes (abbreviation 3G) and those with zero (abbrev. 0G). We defined “phylogenetic spread” of a plasmid as the median of the pairwise distances along a large subunit rRNA tree (incorporating branch lengths) between all pairs of genera in which the plasmid is detected, and calculated the 95% quantile of this distribution (red line). We then permuted the assignment of each plasmid to the classes 3G and 0G one million times (maintaining the class counts), each time calculating the difference in 95% quantiles for the two categories (3G and 0G). We show here the histogram of that statistic (i.e. the difference). This corresponds to a permutation test P value of 0.0024.

  8. Supplementary Figure 8 Distribution of MOB types among phyla.

    We show the proportion of each MOB type associated with different phyla based on a search of all known MOB types from Guglielmini et al. in the all-microbial-index.

Supplementary information

  1. Supplementary Text and Figures

    Supplementary Figures 1–8 and Supplementary Tables 1–3

  2. Reporting Summary

  3. Supplementary Data

    Supplementary Data 10

About this article

Publication history

Received

Accepted

Published

DOI

https://doi.org/10.1038/s41587-018-0010-1