Improved data analysis for the MinION nanopore sequencer

Jain, Miten; Fiddes, Ian T; Miga, Karen H; Olsen, Hugh E; Paten, Benedict; Akeson, Mark

doi:10.1038/nmeth.3290

Article
Published: 16 February 2015

Improved data analysis for the MinION nanopore sequencer

Miten Jain^1,2,
Ian T Fiddes^1,2,
Karen H Miga^1,2,
Hugh E Olsen^1,2,
Benedict Paten^1,2 &
…
Mark Akeson^1,2

Nature Methods volume 12, pages 351–356 (2015)Cite this article

27k Accesses
398 Citations
139 Altmetric
Metrics details

Subjects

Abstract

Speed, single-base sensitivity and long read lengths make nanopores a promising technology for high-throughput sequencing. We evaluated and optimized the performance of the MinION nanopore sequencer using M13 genomic DNA and used expectation maximization to obtain robust maximum-likelihood estimates for insertion, deletion and substitution error rates (4.9%, 7.8% and 5.1%, respectively). Over 99% of high-quality 2D MinION reads mapped to the reference at a mean identity of 85%. We present a single-nucleotide-variant detection tool that uses maximum-likelihood parameter estimates and marginalization over many possible read alignments to achieve precision and recall of up to 99%. By pairing our high-confidence alignment strategy with long MinION reads, we resolved the copy number for a cancer-testis gene family (CT47) within an unresolved region of human chromosome Xq24.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Molecular events and ionic-current trace for a 2D read of an M13 phage dsDNA molecule.**

**Figure 2: Read-length distributions and identity plots for M13.**

**Figure 3: Maximum-likelihood alignment parameters derived using expectation maximization (EM).**

**Figure 5: Exploring SNV calling with MinION reads.**

**Figure 6: Resolution of CT47 repeat copy-number estimate on human chromosome Xq24.**

Nanopore sequencing technology, bioinformatics and applications

Article 08 November 2021

Yunhao Wang, Yue Zhao, … Kin Fai Au

Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes

Article Open access 04 May 2020

Kishwar Shafin, Trevor Pesout, … Benedict Paten

Readfish enables targeted nanopore sequencing of gigabase-sized genomes

Article 30 November 2020

Alexander Payne, Nadine Holmes, … Matthew Loose

Accession codes

Primary accessions

European Nucleotide Archive

References

Chaisson, M.J. & Tesler, G. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics 13, 238 (2012).
Article CAS Google Scholar
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at http://arxiv.org/pdf/1303.3997.pdf (2013).
Frith, M.C., Wan, R. & Horton, P. Incorporating sequence quality data into alignment improves DNA read mapping. Nucleic Acids Res. 38, e100 (2010).
Article Google Scholar
Harris, R.S. Improved Pairwise Alignment of Genomic DNA. PhD thesis, Pennsylvania State Univ. (2007).
Benson, D.A. et al. GenBank. Nucleic Acids Res. 41, D36–D42 (2013).
Article CAS Google Scholar
Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
Article CAS Google Scholar
Do, C.B. & Batzoglou, S. What is the expectation maximization algorithm? Nat. Biotechnol. 26, 897–899 (2008).
Article CAS Google Scholar
Paten, B., Herrero, J., Beal, K., Fitzgerald, S. & Birney, E. Enredo and Pecan: genome-wide mammalian consistency-based multiple alignment with paralogs. Genome Res. 18, 1814–1828 (2008).
Article CAS Google Scholar
Schwartz, A.S. & Pachter, L. Multiple alignment by sequence annealing. Bioinformatics 23, e24–e29 (2007).
Article CAS Google Scholar
Quick, J., Quinlan, A. & Loman, N. A reference bacterial genome dataset generated on the MinION portable single-molecule nanopore sequencer. Gigascience 3, 22 (2014).
Article Google Scholar
Ashton, P.M. et al. MinION nanopore sequencing identifies the position and structure of a bacterial antibiotic resistance island. Nat. Biotechnol. doi:10.1038/nbt.3103 (8 December 2014).
Davey, J.W. et al. Genome-wide genetic marker discovery and genotyping using next-generation sequencing. Nat. Rev. Genet. 12, 499–510 (2011).
Article CAS Google Scholar
Bourlat, S.J. et al. Genomics in marine monitoring: new opportunities for assessing marine health status. Mar. Pollut. Bull. 74, 19–31 (2013).
Article CAS Google Scholar
Stucki, D. & Gagneux, S. Single nucleotide polymorphisms in Mycobacterium tuberculosis and the need for a curated database. Tuberculosis (Edinb.) 93, 30–39 (2013).
Article CAS Google Scholar
Holmes, I. & Bruno, W.J. Evolutionary HMMs: a Bayesian approach to multiple alignment. Bioinformatics 17, 803–820 (2001).
Article CAS Google Scholar
Chen, Y.T., Iseli, C. & Venditti, C. Identification of a new cancer/testis gene family, CT47, among expressed multicopy genes on the human X chromosome. Genes Chromosomes Cancer 45, 392–400 (2006).
Article CAS Google Scholar
Treangen, T.J. & Salzberg, S.L. Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat. Rev. Genet. 13, 36–46 (2012).
Article CAS Google Scholar
Tremblay, D.C., Alexander, G., Moseley, S. & Chadwick, B.P. Expression, tandem repeat copy number variation and stability of four macrosatellite arrays in the human genome. BMC Genomics 11, 632 (2010).
Article Google Scholar
Brahmachary, M. et al. Digital genotyping of macrosatellites and multicopy genes reveals novel biological functions associated with copy number variation of large tandem repeats. PLoS Genet. 10, e1004418 (2014).
Article Google Scholar
Mikheyev, A.S. & Tin, M.M. A first look at the Oxford Nanopore MinION sequencer. Mol. Ecol. Resour. 14, 1097–1102 (2014).
Article CAS Google Scholar
Schreiber, J. et al. Error rates for nanopore discrimination among cytosine, methylcytosine, and hydroxymethylcytosine along individual DNA strands. Proc. Natl. Acad. Sci. USA 110, 18910–18915 (2013).
Article CAS Google Scholar
Laszlo, A.H. et al. Detection and mapping of 5-methylcytosine and 5-hydroxymethylcytosine with nanopore MspA. Proc. Natl. Acad. Sci. USA 110, 18904–18909 (2013).
Article CAS Google Scholar
Wescoe, Z.L., Schreiber, J. & Akeson, M. Nanopores discriminate among five C5-cytosine variants in DNA. J. Am. Chem. Soc. 136, 16582–16587 (2014).
Article CAS Google Scholar
Cherf, G.M. et al. Automated forward and reverse ratcheting of DNA in a nanopore at 5-Å precision. Nat. Biotechnol. 30, 344–348 (2012).
Article CAS Google Scholar
Lieberman, K.R., Dahl, J.M., Mai, A.H., Akeson, M. & Wang, H. Dynamics of the translocation step measured in individual DNA polymerase complexes. J. Am. Chem. Soc. 134, 18816–18823 (2012).
Article CAS Google Scholar
Schibel, A.E. et al. Nanopore detection of 8-oxo-7,8-dihydro-2′-deoxyguanosine in immobilized single-stranded DNA via adduct formation to the DNA damage site. J. Am. Chem. Soc. 132, 17992–17995 (2010).
Article CAS Google Scholar
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM https://github.com/lh3/bwa/blob/master/NEWS.md/#release-079-19-may-2014 (2014).
Frith, M.C., Hamada, M. & Horton, P. Parameters for accurate genome alignment. BMC Bioinformatics 11, 80 (2010).
Article Google Scholar
Durbin, R., Eddy, S.R., Krogh, A. & Mitchison, G. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. (The Press Syndicate of The University of Cambridge, 1998).
Paten, B. et al. Cactus: algorithms for genome multiple sequence alignment. Genome Res. 21, 1512–1528 (2011).
Article CAS Google Scholar
Eddy, S.R. Profile hidden Markov models. Bioinformatics 14, 755–763 (1998).
Article CAS Google Scholar
Quinlan, A.R. & Hall, I.M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
Article CAS Google Scholar

Download references

Acknowledgements

Research reported in this publication was supported by the National Human Genome Research Institute of the US National Institutes of Health under award numbers HG006321 (M.A.), HG007827 (M.A.) and U54HG007990 (B.P.). The authors thank Oxford Nanopore Technologies for their gift to the UCSC Nanopore Group. The authors also thank D. Deamer for support, reading of the manuscript and helpful discussion. The authors gratefully acknowledge D. Haussler and J. Kent for their support.

Author information

Authors and Affiliations

UC Santa Cruz Genomics Institute, Santa Cruz, California, USA
Miten Jain, Ian T Fiddes, Karen H Miga, Hugh E Olsen, Benedict Paten & Mark Akeson
Department of Biomolecular Engineering, University of California, Santa Cruz, California, USA
Miten Jain, Ian T Fiddes, Karen H Miga, Hugh E Olsen, Benedict Paten & Mark Akeson

Authors

Miten Jain
View author publications
You can also search for this author in PubMed Google Scholar
Ian T Fiddes
View author publications
You can also search for this author in PubMed Google Scholar
Karen H Miga
View author publications
You can also search for this author in PubMed Google Scholar
Hugh E Olsen
View author publications
You can also search for this author in PubMed Google Scholar
Benedict Paten
View author publications
You can also search for this author in PubMed Google Scholar
Mark Akeson
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.A. conceived experiments and directed research. B.P. conceived and directed bioinformatics analysis. B.P., M.J., I.T.F. and K.H.M. were responsible for bioinformatics analysis and software development. M.J. and H.E.O. were responsible for the completion of sequencing experiments and data processing. M.J. and H.E.O. were responsible for preparing DNA sequencing standards. H.E.O. was responsible for Sanger sequencing of M13 dsDNA. B.P. and I.T.F. were responsible for k-mer and BLAST analysis. B.P. and M.J. were responsible for SNV analysis. B.P. developed and implemented expectation-maximization and realignment strategies. K.H.M. conceived and directed BAC experiments and data analysis. All authors contributed to the writing, editing and completion of the manuscript.

Corresponding authors

Correspondence to Benedict Paten or Mark Akeson.

Ethics declarations

Competing interests

M.A. is a consultant to Oxford Nanopore Technologies.

Integrated supplementary information

Supplementary Figure 1 MinION technology progression.

Progression of read identity distributions with MinION versions since June 2014.

Supplementary Figure 2 Venn diagram representing read mappability for MinION reads across three replicate M13 experiments using R7.3 chemistry.

Mappability represents the proportion of reads that can be aligned to either the M13 or the phage λ DNA control using the tuned parameters for each mapper. In our analysis, 2D reads had the highest mappability, with 99% of reads being mappable, followed by complement and template reads, with 98% and 95% of their respective read proportions being mappable. Among the four aligners used, LAST and LASTZ performed the best for M13, with LAST capturing the greatest proportion of mappable reads on its own.

Supplementary Figure 3 Venn diagram representing read mappability to a reversed reference for MinION reads from three replicate M13 experiments using R7.3 chemistry.

Because the reference was reversed, effectively no reads should map; this is thus a proxy measure of specificity. Results were obtained using the tuned alignment parameters.

Supplementary Figure 4 Structure for the hidden Markov model (HMM) used for expectation maximization (EM).

Structure of HMM used for EM, along with the estimated parameters for transition probabilities for template, complement and 2D reads. For each transition in order, the mean estimates and standard error across all experiments for that read type are shown.

Supplementary Figure 5 Convergence of log-likelihood ratios achieved using expectation maximization.

Convergences of log-likelihood for three independent runs of expectation maximization, each from a randomly parameterized model, each run for 100 iterations of training. The y-axis gives likelihood normalized by the highest log-likelihood found. The training used 2D reads from one MinION run of the M13 data using release R7.3 chemistry and a guide alignment generated by tuned LAST.

Supplementary Figure 6 Frequency plots for insertions and deletions in MinION read alignments.

Representative insertion and deletion plot for reads (fitted with an exponential distribution) from one M13 experiment using R7.3 chemistry, aligned using expectation maximization–trained LAST.

Supplementary Figure 7 Substitution matrices from alignments using expectation maximization–trained model.

Maximum-likelihood estimates and standard-error parameters for substitution matrices show trends across template, complement and 2D reads across three M13 experiments using R7.3 chemistry. The top row illustrates the average maximum-likelihood estimates for these substitutions, with the standard error represented in the lower row. For all aligners, thymine-to-adenosine and adenosine-to-thymine substitution rates were low, indicating that the device rarely miscalled one as the other. The color scheme is fitted on a log scale, and the substitution values are on an absolute scale.

Supplementary Figure 8 Substitution matrices from alignments using tuned parameters.

Substitution matrices for each of the four tuned aligners across three M13 experiments using R7.3 chemistry. For all aligners, thymine-to-adenosine and adenosine-to-thymine substitution rates were low, indicating that the device rarely miscalled one as the other. The color scheme is fitted on a log scale, and the substitution values are on an absolute scale.

Supplementary Figure 9 Realignment improves read identity.

Read identity for template, complement and 2D reads for three M13 replicate experiments using R7.3 chemistry, aligned using LAST. Three versions of the LAST alignment are shown: tuned LAST, trained LAST realignments and naive LAST realignments.

Supplementary Figure 10 An alignment quality measurement for 2D reads across three M13 replicate experiments.

Alignments were obtained using expectation maximization–trained LAST realignments. The two density clusters correspond to M13 and phage λ DNA control.

Supplementary Figure 11 Error profiles for 2D reads after realigning using expectation maximization–trained model.

Error profile analysis of 2D reads aligned using expectation maximization–trained LAST realignments indicates a moderate correlation between mismatches and indels per aligned base, and a weak correlation between insertions per aligned base and deletions per aligned base.

Supplementary Figure 12 The coverage and percentage of GC across the M13 genome.

(a–c) Coverage, smoothed by binning over a sliding 5-bp window, matching the k-mer length used in base calling. The GC content was calculated by binning over a 50-bp sliding window. Halving and doubling this window size did not drastically alter the result. (d–f) Coverage histograms across three M13 replicate experiments using R7.3 chemistry and aligned using expectation maximization–trained LAST realignments. About 2.1%, 2.0% and 2.6% of the M13 genome was underrepresented in template, complement and 2D reads, respectively.

Supplementary Figure 13 MinION data for full-length (48-kb) λ phage dsDNA.

Data for a 2D read of a full-length λ phage dsDNA from the MinION. (a) Molecular events for translocation of a single 48-kb λ dsDNA molecule through the MinION nanopore sequencer. DNA length and conformation are simplified for purposes of illustration. (i) Open channel. (ii) dsDNA with ligated loading (blue and brown) and hairpin adaptors (red) captured by the nanopore with the aid of a membrane anchor and an applied voltage across the membrane. (iii) Translocation of the 5′ end of the loading adaptor through the nanopore under control of a molecular motor and driven by the applied potential across the membrane. DNA translocation through the nanopore starts. (iv) Translocation of the template strand of DNA (gold). (v) Translocation of the hairpin adaptor (red). (vi) Translocation of the complement strand (blue). (vii) Translocation of the 3′ portion of the loading adaptor. (viii) Return to open-channel nanopore. (b) Raw current trace for the entire passage of the DNA construct through the nanopore (approximately 2,789 s). Regions of the ionic current trace corresponding to steps i–viii are labeled. (c) Expanded 1-s time scale of raw current traces for DNA capture and translocation of 5′ loading adaptors (i–iii), template strand (iv), hairpin adaptor (v), complement strand (vi), 3′ loading adaptor and return to open channel (vii–viii). Each adaptor generates a unique signal used for position reference in base determination. The FASTA sequence is available at http://figshare.com/articles/UCSC_Full_Length_Lambda_2D_Read/1209636.

Supplementary Figure 14 Pulse-field gel electrophoresis of RP11-482A22 BAC DNA to determine insert length.

The span of BAC end sequences relative to GRCh38 reference assembly provides estimates of 57 kb to the right of the repeats and 76 kb to the left of the repeats (depicted in black). To determine the length of the repeats, we performed NotI and AatII digests on RP11-482 DNA. The NotI digest isolates the insert DNA in its entirety from the cloning vector insert, pBACe3.6, providing evidence for a cloned insert in the range of 170–175 kb (blue) and an 11.6-kb cloning vector band (red). After subtraction of the known flanking region sizes, this estimate provides a repeat region in the range of 36.7–41.7 kb, or 7.5 to 8.5 copies of the CT47 repeat. The AatII digest was expected to cut the BAC three times, as illustrated in the schematic, providing three resulting fragments: (a) 108 kb including the upstream flanking region (50 kb), the downstream flanking region (46 kb) and the cloning vector insert (11.6 kb), shown in purple; (b) a 23-kb region directly downstream from the repeat array (blue), and a region observed by PFGE to be ≍50 kb that spans the CT47 repeat cluster (providing evidence for a 37-kb repeat region after subtraction of 12 kb of known flanking sequence, marked with gray shading). Regions providing evidence for repeat copy number are highlighted in yellow.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jain, M., Fiddes, I., Miga, K. et al. Improved data analysis for the MinION nanopore sequencer. Nat Methods 12, 351–356 (2015). https://doi.org/10.1038/nmeth.3290

Download citation

Received: 12 December 2014
Accepted: 20 January 2015
Published: 16 February 2015
Issue Date: April 2015
DOI: https://doi.org/10.1038/nmeth.3290

This article is cited by

Chromosome-scale genome assembly of Lepus oiostolus (Lepus, Leporidae)
- Shuo Feng
- Yaying Zhang
- Yongzhi Yang
Scientific Data (2024)
A preliminary study of the use of MinION sequencing to specifically detect Shiga toxin-producing Escherichia coli in culture swipes containing multiple serovars of this species
- Hege S. Tunsjø
- Ingvild Falkum Ullmann
- Colin Charnock
Scientific Reports (2023)
Nanopore-Based Metagenomic Sequencing in Respiratory Tract Infection: A Developing Diagnostic Platform
- Robert Chapman
- Luke Jones
- Stefan Bagby
Lung (2023)
Whole exome analysis of patients in Japan with hearing loss reveals high heterogeneity among responsible and novel candidate genes
- Hideki Mutai
- Yukihide Momozawa
- Tatsuo Matsunaga
Orphanet Journal of Rare Diseases (2022)
Methods for exploring the faecal microbiome of premature infants: a review
- Jacob A. F. Westaway
- Roger Huerlimann
- Donna Rudd
Maternal Health, Neonatology and Perinatology (2021)