Imagine putting together a jigsaw puzzle with no reference picture. Now imagine doing the same task with many duplicates of some pieces and others that are unique or missing entirely.

Scientists face a similar challenge in attempting genomic analysis of newly discovered bacterial species. However, a new computational strategy devised by scientists at the J. Craig Venter Institute (JCVI) and the University of California at San Diego has proven adept at turning such confusing mountains of fragments into a coherent whole.

Unlike laboratory-friendly bugs such as Escherichia coli, most bacterial species are intractable to cultivation, and scientists must make do with limited genetic material. Several years ago, Roger Lasken and colleagues helped usher in a new era of bacterial genomics with a technique known as multiple displacement amplification (MDA). MDA efficiently generates a plethora of amplified sequence fragments from the DNA contents of a single cell; where these fragments overlap, scientists can assemble 'contigs' that span larger genomic segments.

Conventional sequence-assembly algorithms can find MDA-derived data difficult to work with, however. “You get very biased amplification,” says Lasken, who is now at the JCVI. “You may have thousands of copies of one part of the sequence and few or no copies of other parts.” The latter poses a tougher problem for assembly, and most existing algorithms simply skip genomic segments for which the number of overlapping reads is too limiting for confident contig building. However, this wastes potentially valuable data. Lasken therefore teamed up with University of California at San Diego bioinformatician Pavel Pevzner to design a more efficient algorithm for MDA data analysis.

Instead of starting with a strict threshold for sequence coverage, Pevzner's algorithm begins assembly with a much lower cutoff—enabling potential inclusion of low-coverage regions—and then raises the bar for inclusion at subsequent stages of analysis, as groups of smaller contigs are being assembled into larger contigs. “The real breakthrough is on the informatics side,” says Lasken. “Instead of losing those rare reads, this software makes use of them and gets a much more complete assembly.”

As proof of concept, they demonstrated that this approach could generate assemblies from individual E. coli and Staphylococcus aureus bacteria that include a higher proportion of complete genes and operons with a lower error rate and fewer misassembled contigs relative to existing algorithms.

Their approach also performed well with individual cells of a previously uncharacterized marine bacterium, generating a genome assembly with larger, higher-quality contigs compared with those produced by older algorithms. Most of the expected metabolic genes appeared to be represented in their assembly, suggesting a high degree of completeness, and the researchers made preliminary deductions about the physiology of this bacterium based on some of the pathways that they identified.

Lasken's team is continuing to improve MDA while Pevzner and colleagues work toward a more streamlined analytical process. “You could conceivably go from finding an organism to having its assembled genome in a week,” says Lasken. Such power will undoubtedly prove extremely useful as he and his colleagues at the JCVI continue their efforts to catalog and characterize the numerous bacterial species that make their home in the gut, mouth and other reservoirs of the human body. “We have a huge number of bacteria we know almost nothing about,” says Lasken. “If we could even get 5 or 10% of their genome, it could be tremendously interesting.”