The complexity of microbial communities can vary drastically, from a couple of microorganisms to thousands or even millions, making the reconstruction of whole genomes from some samples tricky. “If the community is low in complexity, it should allow one to reconstruct genomes with high accuracy,” says Isidore Rigoutsos, manager of the bioinformatics and pattern-discovery group at IBM's Thomas J. Watson Research Center in Yorktown Heights, New York. But when it comes to highly complex communities, things are less straightforward.
Rigoutsos and his team have tested several genome assemblers and gene-prediction tools on simulated metagenomic data sets with varying degrees of complexity. Knowing the composition of the community allowed the team to benchmark and evaluate the tools.
“We found that as the complexity increased, many of the computational tools had an increasingly hard time,” says Rigoutsos. For most high-complexity samples, he says, the genome assemblers could not generate larger contigs, and several contigs that were assembled were actually chimaeric mixtures of sequences.
For metagenomic analysis, smaller contigs and single reads make assigning the sequence to a specific microorganism difficult. “We want to be able to assign a read of less than 1,000 nucleotides,” says Rigoutsos, which might allow researchers to determine species composition from high-complexity samples without the need to generate larger contigs.
Rigoutsos and his colleagues have made three simulated data sets available to researchers interested in testing assembly and prediction programs.
The problem of data analysis is not restricted to metagenomics — a growing number of researchers are using next-generation sequencing platforms and generating the quantity of data that in the past might only have been possible at large genome centres. Several companies are developing software to address this issue.
CLC bio in Cambridge, Massachusetts, offers the CLC Genomics Workbench, which provides reference assemblies of data from various next-generation sequencing systems as well as mutation detection. A future version of the program will incorporate algorithms for the de novo assembly of Sanger as well as next-generation sequence data. Meanwhile, Geospiza in Seattle, Washington, and GenomeQuest in Westborough, Massachusetts, are developing software to analyse data generated by Applied Biosystems SOLID next-generation sequencing platform.
The combination of assembly software and data sets to benchmark results should help solve some of the complexity problems associated with metagenomics. “If you sequence sufficiently, even 200 base-pair reads are enough,” says Rigoutsos. But he adds that the real question is how many 200 base-pair reads will be needed before we can truly understand complex communities.
Others are finding that with enough reads, fewer than 200 base pairs might be sufficient. Jens Stoye from Bielefeld University in Germany has compared a data set of 35 base pair reads generated on the Genome Analyzer from Illumina in San Diego, California, with a 454 data set for the same low-complexity sample. Although 99% of the Genome Analyzer's sequence data were discarded, because the system generates up to 50 million reads he could assign the species in the sample with the same efficiency from both data sets.