Two recent papers suggest that a revolution in DNA sequencing is at hand1, 2 and explore compelling alternative DNA sequencing technologies that show promise to change the way human genome resequencing occurs.

Scientific revolutions, nearly by definition, often change the research landscape in unanticipated ways. Completing the human genome project-required multiple innovations,3 not the least of which was the large-scale application of gel electrophoresis and Sanger sequencing chemistry4 in a small number of highly automated industrial genome sequencing centers. One upshot of the human genome project is the future vision of ‘individualized medicine.’ Genomics technologies that could resequence individual genomes might provide genetic explanations for phenotypic variation in disease susceptibility and drug response, leading to improved patient care.

In the course of sequencing a human reference genome, the traditional industrial model has consistently produced ever-greater quantities of data at a reduced cost. However, it seems unlikely that current approaches are sufficiently scalable to fulfill the promise of individual genomic medicine.5 Realizing this future possibility will seemingly require another revolution in DNA sequencing technology: a revolution that these new papers indicate might be upon us.

Genome sequencing can be divided into four steps: (1) break a large DNA polymer into smaller fragments, (2) isolate and amplify single fragments, (3) determine the fragment sequence and (4) perform automated data quality assessment and sequence assembly to reconstruct the original DNA polymer sequence. Traditional DNA sequencing protocols use libraries of cloned fragments and Sanger sequencing chemistry to accomplish the first three steps. Alternative approaches accomplishing these same tasks are presented in these two papers.1, 2

Shendure et al1 employed a multiplex polymerase colony, or polony, protocol to generate approximately 1.6 million fragments that are each 135 basepairs (bp) in length (step 1). Each fragment has 100 bp in common, and contains two mate-pair tags of 17 and 18 bp from the genome being sequenced. The tags consist of random genome sequences selected to be approximately 1000 bp apart. Each fragment is attached to a separate 1 μm bead, amplified using a water-in-oil emulsion PCR protocol (step 2), and immobilized in a 1.5 cm2 acrylamide gel. The fragments are sequenced in parallel via a ligation protocol that uses four dyes to identify each possible base (step 3). In all, 13 basepairs from each tag, or a total of 26 bp, are determined for each fragment. It is striking that the protocols described were implemented with off-the-shelf instrumentation and reagents suggesting that in principle, it is possible for single laboratories to perform these assays.

Margulies et al2 also avoid traditional library construction. They shear an entire genome to generate 300 bp long DNA fragments (step 1), add specialized common adaptors, capture individual fragments on beads, and clonally amplify each fragment within an emulsion (step 2). The beads are then distributed across open wells of a fiber-optic slide and pyrosequencing chemistries are used to determine the sequence of each fragment (step 3). Average read lengths of 100 bp were reported. The authors suggest that mate pair reads are possible by sequencing the same fragment on a bead from different directions. A commercial system using this approach, that requires limited laboratory space and personnel to operate, is currently available.

High-throughput methods of data generation require quantitative measures of data quality (step 4). Traditional DNA sequencing uses Phred scores to determine the probability of a basecalling error.6, 7 A similar approach can be used to evaluate these two technologies. The Shendure/Porreca group estimated data quality by resequencing an Escherichia coli MG1655 genome expected to differ relative to the reference sequence at a number of known and unknown sites. In all, 70% of the 3.3 Mb genome had 4 × or greater coverage. No substitution errors were observed, implying an error rate <1 per 3.3 million bases sequenced, or a Phred score of 65 (Table 1). Margulies et al sequenced a Mycoplasma genitalium (580 kb) genome to estimate data quality. At high coverage sites (98.2% of the genome), the error rate was 3.0E-6, which corresponded to a Phred score of 55 (Table 1). Remarkably, Margulies et al were able to resequence the genome eight times for 40-fold coverage in only 243 min of instrument run time. So it seems both methods can produce very high-quality data.

Table 1 Representative error rates and their quantification

What do these approaches have in common? They perform sample preparations directly on entire genomes avoiding slower and more expensive clone-base methodologies. They both use similar emulsion PCR to clonally amplify single fragments. While both papers report raw accuracies and read lengths (26 bp,1 100 bp2) significantly lower than Sanger sequencing (∼700 bp), they compensate by generating many more sequences (∼1 600 000,1 ∼300 0002) as compared to Sanger sequencing (96), per single system run. Perhaps most impressive, the cost per high-quality base with either technology is roughly an order of magnitude lower than that of conventional sequencing. If one considers the lower costs associated with limited infrastructure and personnel, these approaches become even more attractive. Future improvements in the library density and read length for both technologies will further reduce cost while increasing throughput.

These studies focused on resequencing relatively small bacterial genomes. However, the methods of library construction and sequencing are general, so they are relevant to human genetics. Human genomic regions containing putative disease-causing alleles are typically identified through family-based linkage or case–control whole genome association studies. These regions are often roughly the size of bacterial genomes. In the near term, approaches enabling specific DNA isolation from localized regions in the human genome, such as that under a 5 Mb linkage peak, could be sequenced efficiently and accurately in single laboratories using these technologies (see expected number of errors in Table 1). Since variation detection in human genetics is often rate limiting, these advances have the very great potential to significantly speed the identification of human disease-causing variants.

Meeting the longer-term goal of a $1000 genome will require further improvements in scaling these or other technologies whose cost per high-quality base is far lower than traditional sequencing technologies. The notion that thousands of laboratories could generate genomic sequence at rates meeting or exceeding that of a conventional sequencing center will surely cause a revolution itself. For example, many traditional software/statistical packages used to map human disease traits already struggle with genomic data sets, and will face similar problems with large genome sequencing data sets. Developing efficient algorithms and computing infrastructure that can meet the challenges of handling, storing, exploring and analyzing such enormous data sets will prove formidable. Clearly, even after the completion of the Human Genome Project, the genomics revolution continues to advance rapidly, changing the perception and practice of human genetics and the potential role of genomics technologies in medical practiceâ–ª