Supplementary Figure 5: Simulated scaling to 1 million datasets. | Nature Biotechnology

Supplementary Figure 5: Simulated scaling to 1 million datasets.

From: Ultrafast search of all deposited bacterial and viral genomic data

Supplementary Figure 5

We simulated scaling to 1 million datasets of peak data-structure storage requirements of BIGSI and SBT-fast, comparing performance with high/low proportion of sharing of k-mers between datasets (note y axis is on log scale). In the high k-mer sharing regime only 100 new k-mers are introduced per dataset, whereas the low k-mer sharing regime introduces 10,000 new k-mers per dataset. Since BIGSI scales linearly with number of datasets and independently of the number of k-mers, it uses the same storage per dataset in each regime. However, SBT-fast scales super-linearly with N since its Bloom filter size depends on the total number of kmers. For 1 million genomes with low k-mer sharing (right), which is the case we care about for global indexing, BIGSI would use 3.1 Terabytes whereas SBT-fast would use 25 Petabytes. When we index the ENA, we find each dataset adds 100,000 new k-mers on average, 10x more than the low kmer-sharing regime simulated here, which would further penalize SBT.

Back to article page