The seminal importance of DNA sequencing to the life sciences, biotechnology and medicine has driven the search for more scalable and lower-cost solutions. Here we describe a DNA sequencing technology in which scalable, low-cost semiconductor manufacturing techniques are used to make an integrated circuit able to directly perform non-optical DNA sequencing of genomes. Sequence data are obtained by directly sensing the ions produced by template-directed DNA polymerase synthesis using all-natural nucleotides on this massively parallel semiconductor-sensing device or ion chip. The ion chip contains ion-sensitive, field-effect transistor-based sensors in perfect register with 1.2 million wells, which provide confinement and allow parallel, simultaneous detection of independent sequencing reactions. Use of the most widely used technology for constructing integrated circuits, the complementary metal-oxide semiconductor (CMOS) process, allows for low-cost, large-scale production and scaling of the device to higher densities and larger array sizes. We show the performance of the system by sequencing three bacterial genomes, its robustness and scalability by producing ion chips with up to 10 times as many sensors and sequencing a human genome.
DNA sequencing and, more recently, massively parallel DNA sequencing1,2,3,4 has had a profound impact on research and medicine. The reductions in cost and time for generating DNA sequence have resulted in a range of new sequencing applications in cancer5,6, human genetics7, infectious diseases8 and the study of personal genomes9,10,11, as well as in fields as diverse as ecology12,13 and the study of ancient DNA14,15. Although de novo sequencing costs have dropped substantially, there is a desire to continue to drop the cost of sequencing at an exponential rate consistent with the semiconductor industry’s Moore’s Law16 as well as to provide lower cost, faster and more portable devices. This has been operationalized by the desire to reach the $1,000 genome17.
To date, DNA sequencing has been limited by its requirement for imaging technology, electromagnetic intermediates (either X-rays18, or light19) and specialized nucleotides or other reagents20. To overcome these limitations and further democratize the practice of sequencing, a paradigm shift based on non-optical sequencing on newly developed integrated circuits was pursued. Owing to its scalability and its low power requirement, CMOS processes are dominant in modern integrated circuit manufacturing21. The ubiquitous nature of computers, digital cameras and mobile phones has been made possible by the low-cost production of integrated circuits in CMOS.
Leveraging advances in the imaging field—which has produced large, fast arrays for photonic imaging22—we sought a suitable electronic sensor for the construction of an integrated circuit to detect the hydrogen ions that would be released by DNA polymerase23 during sequencing by synthesis, as opposed to a sensor designed for the detection of photons. Although a variety of electrochemical detection methods have been studied24,25, the ion-sensitive field-effect transistor (ISFET)26,27 was most applicable to our chemistry and scaling requirements because of its sensitivity to hydrogen ions, and its compatibility with CMOS processes28,29,30,31. Previous attempts to detect both single-nucleotide polymorphisms (SNPs)32 and DNA synthesis33 as well as sequence DNA electronically34 have been made. However, none of them produced de novo DNA sequence, addressed the issue of delivering template DNA to the sensors, or scaled to large arrays. In addition, previous efforts in ISFETs were limited in the number of sensors per array, the yield of working independent sensors and readout speed35,36, and encountered difficulty in exposing the sensors to fluids while protecting the electronics37.
Here, we overcome previous limitations with electronic detection and enable the production of chips with a large number of fast, uniform, working sensors. Our focus has been on the development of these ion chips, as well as the biochemical methods, supporting instrumentation and software needed to enable de novo DNA sequencing for applications requiring millions to billions of bases (Supplementary Fig. 1). A typical 2-h run using an ion chip with 1.2 M sensors generates approximately 25 million bases. The performance of the ion chips and overall sequencing platform is demonstrated through whole-genome sequencing of three bacterial genomes. The scalability of our chip architecture is demonstrated by producing chips with up to 10 times the number of sensors and producing a low-coverage sequence of the genome of Gordon Moore, author of Moore’s law16.
A CMOS integrated circuit for sequencing
We have developed a simple, scalable ISFET sensor architecture using electronic addressing common in modern CMOS imagers (Supplementary Fig. 2). Our integrated circuit consists of a large array of sensor elements, each with a single floating gate connected to an underlying ISFET (Fig. 1a). For sequence confinement we rely on a 3.5-μm-diameter well formed by adding a 3-μm-thick dielectric layer over the electronics and etching to the sensor plate (Fig. 1b). A tantalum oxide layer provides for proton sensitivity (58 mv pH−1; ref. 38). High-speed addressing and readout are accomplished by the semiconductor electronics integrated with the sensor array (Fig. 1c). The sensor and underlying electronics provide a direct transduction from the incorporation event to an electronic signal. Unlike light-based sequencing technology, we do not use the elements of the array to collect photons and form a larger image to detect the incorporation of a base; instead we use each sensor to independently and directly monitor the hydrogen ions released during nucleotide incorporation.
Ion chips are manufactured on wafers (Fig. 2a), cut into individual die (Fig. 2b) and robotically packaged with a disposable polycarbonate flow cell that isolates the fluids to regions above the sensor array and away from the supporting electronics to provide convenient sample loading as well as electrical and fluidic interfaces to the sequencing instrument (Fig. 2c). Chips were designed and fabricated with 1.5 M, 7.2 M and 13 M ISFETs (Supplementary Fig. 3). On the basis of the placement of the flow cell on the sensor array, 1.2 M, 6.1 M and 11 M wells and sensors are exposed to fluids, with 99.9% of the sensors sensitive to pH and usable for DNA sequencing (Supplementary Fig. 4). Increasing the numbers of sensors per chip was first achieved by increasing the die area, from 10.6 mm × 10.9 mm to 17.5 mm × 17.5 mm, and then by increasing the density of the sensors by reducing the number of transistors per sensor from three to two. Chip density is limited by the selection of the CMOS node and the number of transistors per sensing element. Using a 0.35 μm CMOS node the minimum spacing for a three-transistor sensor is 5.1 μm and for a two-transistor sensor it is 3.8 μm (Supplementary Fig. 5). To understand further the limits on density, we show that 1.3 μm wells are readily manufactured, can be aligned to sensors, enable the generation of high-quality sequence (Supplementary Fig. 6) and can, using a 110 nm node, be fabricated with a spacing as small as 1.68 μm (Supplementary Fig. 7).
Sequencing on a semiconductor device
The all-electronic detection system used by the ion chip simplifies and greatly reduces the cost of the sequencing instrument (Supplementary Fig. 8). The instrument has no optical components, and is comprised primarily of an electronic reader board to interface with the chip, a microprocessor for signal processing, and a fluidics system to control the flow of reagents over the chip (Supplementary Fig. 9).
Genomic DNA is prepared for sequencing as described in Supplementary Methods. Briefly, DNA is fragmented, ligated to adapters, and adaptor-ligated libraries are clonally amplified onto beads. Template-bearing beads are enriched through a magnetic-bead-based process. Sequencing primers and DNA polymerase are then bound to the templates and pipetted into the chip’s loading port. Individual beads are loaded into individual sensor wells by spinning the chip in a desktop centrifuge. A 2 μm acrylamide bead was chosen to deliver sufficient copies of the template to the sensor well to achieve a high signal-to-noise ratio (SNR) (800 K copies, SNR, 10; Supplementary Methods and Supplementary Fig. 10), while well depth was selected to allow only a single bead to occupy a well.
In ion sequencing, all four nucleotides are provided in a stepwise fashion during an automated run (Supplementary Methods). When the nucleotide in the flow is complementary to the template base directly downstream of the sequencing primer, the nucleotide is incorporated into the nascent strand by the bound polymerase. This increases the length of the sequencing primer by one base (or more, if a homopolymer stretch is directly downstream of the primer) and results in the hydrolysis of the incoming nucleotide triphosphate, which causes the net liberation of a single proton for each nucleotide incorporated during that flow. The release of the proton produces a shift in the pH of the surrounding solution proportional to the number of nucleotides incorporated in the flow (0.02 pH units per single base incorporation). This is detected by the sensor on the bottom of each well, converted to a voltage and digitized by off-chip electronics (Fig. 3). The signal generation and detection occurs over 4 s (Fig. 3b). After the flow of each nucleotide, a wash is used to ensure nucleotides do not remain in the well. The small size of the wells allows diffusion into and out of the well on the order of a one-tenth of a second and eliminates the need for enzymatic removal of reagents1.
Signal processing and base calling
To change raw voltages into base calls, signal-processing software converts the raw data into measurements of incorporation in each well for each successive nucleotide flow using a physical model. Sampling the signal at high frequency relative to the time of the incorporation signal allows signal averaging to improve the SNR. The physical model takes into consideration diffusion rates, buffering effects and polymerase rates (Supplementary Fig. 11). The model is applied and fit to the raw trace from each well and the incorporation signals are extracted. A base caller corrects the signals for phase and signal loss, normalizes to the key, and generates corrected base calls for each flow in each well to produce the sequencing reads (Fig. 3c and Supplementary Fig. 12).
Next, each read is sequentially passed through two signal-based filters to exclude low-accuracy reads. The first filter measures the fraction of flows in which an incorporation event was measured. When this value is unusually large (greater than 60% of the first 60 flows) the read is not clonal. The second filter measures the extent to which the observed signal values match those predicted by the phasing model. When there is poor agreement (median absolute difference more than 0.06 over the first 60 flows) between the two, it corresponds to higher error rates. Lastly, per-base quality values are predicted using an adaptation of the Phred method39 that quantifies the concordance between the phasing model predictions and the observed signal. These ab initio scores track closely with post-alignment derived quality scores, and are used to trim back low-quality sequence from the 3′ end of a read (Supplementary Fig. 13).
Sequencing bacterial genomes
Bacterial genome sequencing and signal processing was performed as described earlier. We succeeded in sequencing all three genomes fivefold to tenfold in individual runs using the small ion chip, covering 96.80% to 99.99% of each genome, with genome-wide consensus accuracies as high as 99.99% (Table 1 and Supplementary Fig. 14). Escherichia coli sequencing with three successively larger ion chips produced 46 to over 270 megabases of sequence (Table 1).
To characterize run quality, we aligned each read to the corresponding reference genome (Supplementary Fig. 15). The per-base accuracy was observed to be 99.569% ± 0.001% within the first 50 bases and 98.897% ± 0.001% within the first 100 bases (Supplementary Fig. 16a). This accuracy is similar at 50 bases and higher at 100 bases than light-based methods using modified nucleotides (1.1% versus 5% error40). The per-base accuracy in calling a homopolymer of length 5 is 97.328% ± 0.023% (Supplementary Fig. 16b) and higher than pyrosequencing-based sequencing methods1,41. For each genome, the observed distribution of per-base coverage matches closely with the theoretical Poisson distribution reflecting the uniform nature of the coverage (Supplementary Fig. 17). The distribution of coverage was also relatively unbiased across GC content (Supplementary Fig. 18).
Ion sequencing technology has allowed the routine acquisition of 100-base read lengths, and perfect read lengths exceeding 200 bases (Supplementary Fig. 19). At present, 20–40% of the sensors in a given run yield mappable reads. The gap between the number of sensors on a chip and the number yielding sequence is primarily the result of incomplete loading of the chip, poor amplification of a fragment onto the bead, and lack of clonality of the template. With continued improvements in loading and template preparation, along with improvements in signal processing and base calling, it is expected that the percentage of sensors yielding reads, the average read length and read accuracy will all improve significantly, as it has for other sequencing technologies1,2,3,4,9,10,11.
‘Post-light’ sequencing of G. Moore
To illustrate the scalability of semiconductor sequencing we produced whole-genome sequence data from an individual, G. Moore42 (Fig. 4). Written consent was provided by G. Moore to sequence and publish his genome and resulting findings. Reads from his genome were deposited in the European Nucleotide Archive's Sequence Read Archive (SRA) under accession number ERP000682. The mean coverage of the G. Moore genome was 10.6-fold (Table 1). The degree to which the observed distribution of reads conforms to a Poisson distribution is indicative of a general lack of bias in coverage depth (Fig. 4b).
We found 2,598,983 SNPs in the G. Moore genome, of which 3.08% were found to be novel, consistent with previous reports4,9,11 (Supplementary Methods). To confirm the accuracy of our analysis, we also sequenced the G. Moore genome using ABI SOLiD Sequencing43 to 15-fold coverage and validated 99.95% of the heterozygous and 99.97% of the homozygous genotypes (Supplementary Tables 1 and 2).
We used the Online Mendelian Inheritance in Man database44 and the 23andMe functional SNP collection (https://www.23andme.com) to identify a subset of validated SNPs known to be involved in human disease and interesting phenotypes (Supplementary Table 3). We also examined the G. Moore sequence for the 7,693 deletions and inversions discovered by the 1000 Genomes Consortium and computationally found 3,413 of them in the G. Moore genome at a 99.94% positive predictive value (Supplementary Methods, Supplementary Table 4 and Supplementary Fig. 20). To determine G. Moore’s maternal ancestry, reads were also mapped to human mitochondrial DNA45 for a mean coverage of 732-fold. G. Moore’s mitochondria belong to haplogroup H, the most common in Europe46.
We have demonstrated the ability to produce and use a disposable integrated circuit fabricated in standard CMOS foundries to perform, for the first time, ‘post-light’ genome sequencing of bacterial and human genomes. With fifty billion dollars spent per year on CMOS semiconductor fabrication and packaging technologies, our goal was to leverage that investment to make a highly scalable sequencing technology. Using the G. Moore genome we demonstrated the feasibility of sequencing a human genome. The G. Moore genome sequence required on the order of a thousand individual ion chips comprising about one billion sensors. By demonstrating the ability to make larger and denser arrays, use fewer transistors per sensor, and sequence from wells as small as 1.3 μm, our work suggests that readily available CMOS nodes should enable the production of one-billion-sensor ion chips and low-cost routine human genome sequencing.
Sequences for Homo sapiens, Escherichia coli, Vibrio fisheri and Rhodopseuomanas palustris were deposited in the European Nucleotide Archive's Sequence Read Archive (SRA) under accession numbers ERP000682, ERP000541, ERP000542 and ERP000543 respectively.
We want to thank G. Moore for his willingness to participate in this study. We thank G. Fergus, M. Jain, J. Kole, L. Stevens and the ION team for supporting our efforts, and H. Peckman, V. Tadigotla, D. Holloway and S. Mclaughlin for help on the variant analysis, and M. Ross of the Broad Institute for help on quality scores. This research was supported, in part, by a grant from the National Human Genome Research Institute (NHGRI), RFA-HG-08-008, Revolutionary Genome Sequencing Technologies—The $1000 Genome. Grant number: R01 HG005094.
This table shows the human structural variations with associated annotations.