Mitch Sogin tested the accuracy of 454 sequencing. Credit: G. SOGIN

Mitch Sogin, director of the Josephine Bay Paul Center for Comparative Molecular Biology and Evolution at Woods Hole Marine Biological Laboratory in Massachusetts, performs environmental sampling of nucleic-acid sequences. “Every sequence has the potential to tell us an important story,” says Sogin, so highly accurate analysis techniques are needed.

But when Sogin's lab switched over from traditional Sanger-based sequencing to the next-generation sequencing system of 454 Life Sciences in Branford, Connecticut, to study these environmental samples, something strange happened. “The diversity was between ten- and a hundred-fold more divergent than we expected,” recalls Sogin.

So Sogin and his colleagues needed to determine whether the unexpected findings represented true biological diversity, or just errors caused by the new sequencing technology. “We had to explore just how good the sequencing technology actually was,” he says.

Sogin and his colleagues did a straightforward experiment in which they resequenced more than 50 templates and cloned sequences on a 454 Life Sciences Genome Sequencer 20 that they had sequenced previously with the Sanger methodology. The work showed that the 454 system was 98% accurate if no culling was used to remove bad bases or reads6.

However, by using a very simple set of rules, which caused somewhere between 10% and 20% of the data to be discarded, the accuracy could be pushed up to 99.75%. And discarding up to 20% of the data for this level of accuracy is a trade-off that is fine with Sogin because the latest 454 system — the Genome Sequencer FLX — can produce up to 400,000 reads per run.

Others agree that for some applications the large amount of data generated by the next-generation sequencing systems could trump the accuracy produced through Sanger sequencing. “For applications such as CHIP-sequencing you can use the 454 or Solexa 1G data even though they have lower base accuracy because you do not need it. What you need is the volume for the experiment,” says Chad Nusbaum of the Broad Institute in Cambridge, Massachusetts.

Sogin now thinks that traditional sequencing methods had been underestimating the biological diversity of the environmental nucleic-acid samples. “Turns out that the diversity is coming from low-abundance nucleic-acid populations that you are not likely to encounter if you sequence only a few hundred molecules. You see these low-abundance molecules only if you sequence many tens of thousands of molecules,” he says.

N.B.