Exponentially increasing amounts of unprocessed bacterial and viral genomic sequence data are stored in the global archives. The ability to query these data for sequence search terms would facilitate both basic research and applications such as real-time genomic epidemiology and surveillance. However, this is not possible with current methods. To solve this problem, we combine knowledge of microbial population genomics with computational methods devised for web search to produce a searchable data structure named BItsliced Genomic Signature Index (BIGSI). We indexed the entire global corpus of 447,833 bacterial and viral whole-genome sequence datasets using four orders of magnitude less storage than previous methods. We applied our BIGSI search function to rapidly find resistance genes MCR-1, MCR-2, and MCR-3, determine the host-range of 2,827 plasmids, and quantify antibiotic resistance in archived datasets. Our index can grow incrementally as new (unprocessed or assembled) sequence datasets are deposited and can scale to millions of datasets.
Access optionsAccess options
Subscribe to Journal
Get full journal access for 1 year
only $20.83 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
An open source implementation of BIGSI can be found at https://github.com/phelimb/BIGSI. BIGSI v0.3.0 supports disk-based indexing via Berkeley-DB or rocksDB, as well as distributed in-memory (via redis (https://redis.io)) key-value stores, and can be extended to any key-value store. The benchmarking uses the rocksDB key-value store and v0.2.0 of BIGSI, and the all-microbial index uses Berkeley-DB and BIGSI version v0.1.7.
All of the underlying genomic data for this study are publicly available at the ENA, and Supplementary Data can be found in the directory http://ftp.ebi.ac.uk/pub/software/bigsi/nat_biotech_2018. Supplementary Data 1–9 can be found at http://ftp.ebi.ac.uk/pub/software/bigsi/nat_biotech_2018/supp or at https://figshare.com/s/b365381fcd9550e361da. Contents are as follows: Supplementary Data 1, MCR search results; Supplementary Data 2, plasmid search results; Supplementary Data 3, counts of five specific plasmids across genera; Supplementary Data 4, counts of MOB types across genera; Supplementary Data 5, CARD antibiotic resistance gene search results (T = 70%); Supplementary Data 6, benchmarking results; Supplementary Data 7, Bracken taxonomic results; Supplementary Data 8, MOB type definition fasta; Supplementary Data 9, MOB and T4SS search results (T = 100%). The all-microbial index itself is available at http://ftp.ebi.ac.uk/pub/software/bigsi/nat_biotech_2018/all-microbial-index/. In order to facilitate reproducibility for others without having to download and process 170 Tb of raw data, we made the 26 Tb of cleaned de Bruijn (binary) graph files for the entire all-microbial index snapshot of the ENA available at http://ftp.ebi.ac.uk/pub/software/bigsi/nat_biotech_2018/ctx. An archive of computational code can be found at http://ftp.ebi.ac.uk/pub/software/bigsi/nat_biotech_2018/bigsi.tar.gz. An archive of the data underlying the figures can be found at http://ftp.ebi.ac.uk/pub/software/bigsi/nat_biotech_2018/figure-data.zip and http://ftp.ebi.ac.uk/pub/software/bigsi/nat_biotech_2018/figure-data/ or as Supplementary Data 10 http://ftp.ebi.ac.uk/pub/software/bigsi/nat_biotech_2018/figure-data/.zip. We have also made a public instance of our index of the ENA available at http://bigsi.io, where the user can paste sequence and search. This instance uses BIGSI v0.1.7 (using berkeleyDB) and is hosted by CLIMB (http://www.climb.ac.uk/) on a 3 Tb RAM server.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
We would like to thank, for critical reading and helpful suggestions: G. Blackwell, R. Ffrancon, M. Hunt, J. Kelleher, J. Thornton, R. Patro, K. Malone, J. Marioni, A. Page, S. Gog, T. Bellman, F. Gauger. For enormous assistance with data download from EBI: R. Esnouf, G. Cochrane. For hosting our BIGSI demonstration: CLIMB. We acknowledge funding from Wellcome Trust Core Award Grant Number 203141/Z/16/Z. Z.I. was funded by a Wellcome Trust/Royal Society Sir Henry Dale Fellowship, grant number 102541/A/13/Z. G.M. was funded by Wellcome Trust grant 100956/Z/13/Z. E.P.C.R. was funded by the ANR MAGISBAC grant number ANR-14-CE10-0007-02. P.B. was funded by Wellcome Trust Studentship H5RZCO00.