Simpson, J.T. & Durbin, R. Genome Res. advance online publication (7 December 2011).

No longer limited by technology to generate DNA sequences, researchers now run into computational challenges when trying to make sense of up to hundreds of gigabases of fragmented information from a single genome. For certain applications, such as assessing the individual differences in human genomes or the analysis of genomes without a reference, de novo assembly is the only way forward. Simpson & Durbin now lower memory requirements by using compressed data structures in their string graph assembler (SGA). After filtering low-quality reads, SGA creates a compressed read index of intact reads and queries it to create a graph from the overlap between reads. These graphs are the basis for longer contiguous sequences (also known as contigs), which in turn are used to construct larger scaffolds. The researchers use SGA on a human genome and cover 95% of the reference genome.