Abstract
BLAST is arguably the single most important piece of software ever written for the biological sciences. It is the core of most bioinformatics workflows, being a critical component of genome homology searches and annotation. It has influenced the landscape of biology by aiding in everything from functional characterization of genes to pathogen detection to the development of novel vaccines. While BLAST is very popular, it is also often one of the most computationally intensive parts of bioinformatics analysis. In our workflows, BLAST typically takes the majority of cpu time, and we need to parallelize to finish in a reasonable time frame. Waiting for BLAST to finish without having any clue of how long it’s going to take is kind of depressing, and you could waste a day of work trying to run a job that would never finish. If you feel the same way we do, then check out Cunningham, a tool we designed to estimate BLAST runtimes for shotgun sequence datasets using sequence composition statistics. We’ve trained its models on real metagenomic sequence data using the Amazon EC2 cloud, and it will provide a relatively quick estimate for datasets with up to tens of millions of sequences. It’s not perfect, but it’ll give you at least some idea of expected runtime, how large a cluster you’re going to need, how much you’ll need to partition your data, etc. We use it all the time now, so we hope it’ll be useful to someone else out there. Cunningham has been implemented in CloVR for efficient autoscaling in the cloud and is freely available at http://clovr.org.
Similar content being viewed by others
Article PDF
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
White, J., Matalka, M., Fricke, W. et al. Cunningham: a BLAST Runtime Estimator. Nat Prec (2011). https://doi.org/10.1038/npre.2011.5593.1
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/npre.2011.5593.1