Single-molecule sequencing of an individual human genome

Journal name:
Nature Biotechnology
Volume:
27,
Pages:
847–850
Year published:
DOI:
doi:10.1038/nbt.1561
Received
Accepted
Published online

Recent advances in high-throughput DNA sequencing technologies have enabled order-of-magnitude improvements in both cost and throughput. Here we report the use of single-molecule methods to sequence an individual human genome. We aligned billions of 24- to 70-bp reads (32 bp average) to ~90% of the National Center for Biotechnology Information (NCBI) reference genome, with 28× average coverage. Our results were obtained on one sequencing instrument by a single operator with four data collection runs. Single-molecule sequencing enabled analysis of human genomic information without the need for cloning, amplification or ligation. We determined ~2.8 million single nucleotide polymorphisms (SNPs) with a false-positive rate of less than 1% as validated by Sanger sequencing and 99.8% concordance with SNP genotyping arrays. We identified 752 regions of copy number variation by analyzing coverage depth alone and validated 27 of these using digital PCR. This milestone should allow widespread application of genome sequencing to many aspects of genetics and human health, including personal genomics.

At a glance

Figures

  1. P0 genome sequencing metrics.
    Figure 1: P0 genome sequencing metrics.

    (a) Read length distributions for raw reads (blue) and uniquely aligned reads (red) from Helicos single-molecule sequencing of the genome of Patient Zero (P0). Filtered reads tend to be shorter because a larger proportion of the long reads are instrument artifacts related to the base addition order. (b) Coverage depth for sequence data of the P0 genome, computed over repeat masked regions (ENSEMBL, blue) compared to theoretical Poisson limit (red). (c) Error rate as a function of sequence coverage depth. Above 30× coverage, sampling noise from the limited number of BeadArray results begins to dominate the error rate, and error rate measurements are not accurate. Error rates are defined as concordance with independent measurement of SNPs using the Illumina Human610-Quad SNP BeadArray (see Online Methods for details). (d) Quality score (QS) tradeoffs between sensitivity and accuracy. High sensitivity is obtained by using a QS threshold of 0, which results in calls for all comparison BeadArray locations, with an accuracy of 98.3%. Raising the QS threshold to 1 results in 97% of comparison BeadArray locations being called, thereby lowering the sensitivity but increasing the accuracy of those calls to 99.2%. Numbers next to each data point indicate accuracy (percentages) and cutoff score (in brackets).

  2. SNP discovery in P0.
    Figure 2: SNP discovery in P0.

    (a) SNP distribution in the P0 genome as a function of quality score. Putative SNPs are 'validated' or 'nonvalidated' if they are annotated as such in dbSNP. Putative SNPs not found in dbSNP are 'novel'. SNPs with larger quality scores are called with higher confidence. A substantial decrease in the proportion of validated SNPs is seen as the quality score drops below 2.8, suggesting that 2.8 is a reasonable threshold for identifying high quality SNPs. (b) Distribution of high-quality SNP calls (quality score >2.8) for the P0 human genome. Validated, nonvalidated and novel SNPs are defined as in a. (c) Overlap in SNP locations between the genomes of P0, James Watson and Craig Venter (in thousands). In this figure the quality-score cutoff was moved to the second plateau in a (QS >1.9), increasing the sensitivity and resulting in a total of 3,263,470 SNPs in the P0 genome. This is due to a further 389,736 novel SNPs, 18,495 unvalidated SNPs and 49,768 validated SNPs. The ratio of validated to novel SNPs can be used to estimate that this improvement in sensitivity comes at a cost of an increased overall false-positive rate (from 1% to 10%). Even with this less restrictive cutoff, the SNP proportions shared with Venter and Watson remain consistent.

  3. Copy number variation in the P0 human genome.
    Figure 3: Copy number variation in the P0 human genome.

    Blue, signal from simulated dataset (simulated reads per 1 kb bin). Magenta, CNV estimate. Green, raw signal (actual reads mapped per 1 kb bin). (a) Heterozygous deletion. (b) Homozygous deletion. (c) Homozygous duplication. (d) Heterozygous deletion.

Accession codes

Referenced accessions

GenBank/EMBL/DDBJ

Author information

  1. These authors contributed equally to this work.

    • Dmitry Pushkarev &
    • Norma F Neff

Affiliations

  1. Department of Bioengineering, Stanford University and Howard Hughes Medical Institute, Stanford, California, USA.

    • Dmitry Pushkarev,
    • Norma F Neff &
    • Stephen R Quake

Contributions

N.F.N. prepared the libraries, performed the sequencing and wrote the manuscripts. D.P. developed the data analysis algorithms, performed the computations and wrote the manuscript. S.R.Q. designed the research and wrote the manuscript.

Competing financial interests

D.P. owns shares of Helicos. S.R.Q. is a founder, shareholder and consultant for Helicos and Fluidigm.

Corresponding author

Correspondence to:

Author details

Supplementary information

PDF files

  1. Supplementary Text and Figures (128 KB)

    Supplementary Figures 1–5 and Supplementary Tables 1–3

Additional data