Article | Published:

Synthetic long-read sequencing reveals intraspecies diversity in the human microbiome

Nature Biotechnology volume 34, pages 6469 (2016) | Download Citation



Identifying bacterial strains in metagenome and microbiome samples using computational analyses of short-read sequences remains a difficult problem. Here, we present an analysis of a human gut microbiome using TruSeq synthetic long reads combined with computational tools for metagenomic long-read assembly, variant calling and haplotyping (Nanoscope and Lens). Our analysis identifies 178 bacterial species, of which 51 were not found using shotgun reads alone. We recover bacterial contigs that comprise multiple operons, including 22 contigs of >1 Mbp. Furthermore, we observe extensive intraspecies variation within microbial strains in the form of haplotypes that span up to hundreds of Kbp. Incorporation of synthetic long-read sequencing technology with standard short-read approaches enables more precise and comprehensive analyses of metagenomic samples.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.


Primary accessions

Sequence Read Archive


  1. 1.

    et al. Insights into the phylogeny and coding potential of microbial dark matter. Nature 499, 431–437 (2013).

  2. 2.

    , & Metagenomics - a guide from sampling to data analysis. Microb. Inform. Exp. 2, 3 (2012).

  3. 3.

    The metagenomics of soil. Nat. Rev. Microbiol. 3, 470–478 (2005).

  4. 4.

    et al. Environmental genome shotgun sequencing of the Sargasso Sea. Science 304, 66–74 (2004).

  5. 5.

    Human Microbiome Project Consortium. Structure, function and diversity of the healthy human microbiome. Nature 486, 207–214 (2012).

  6. 6.

    et al. Untangling genomes from metagenomes: revealing an uncultured class of marine Euryarchaeota. Science 335, 587–590 (2012).

  7. 7.

    et al. Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes. Nat. Biotechnol. 31, 533–538 (2013).

  8. 8.

    et al. MetaHIT Consortium; MetaHIT Consortium. Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes. Nat. Biotechnol. 32, 822–828 (2014).

  9. 9.

    , , & Species-level deconvolution of metagenome assemblies with Hi-C-based contact probability maps. G3 (Bethesda) 4, 1339–1346 (2014).

  10. 10.

    et al. Real-time DNA sequencing from single polymerase molecules. Science 323, 133–138 (2009).

  11. 11.

    & Microbial phylogenetic profiling with the Pacific Biosciences sequencing platform. Microbiome 1, 10 (2013).

  12. 12.

    et al. Whole-genome haplotyping using long reads and statistical methods. Nat. Biotechnol. 32, 261–266 (2014).

  13. 13.

    et al. The human gut and groundwater harbor non-photosynthetic bacteria belonging to a new candidate phylum sibling to Cyanobacteria. eLife 2, e01102 (2013).

  14. 14.

    et al. Accurate, multi-kb reads resolve complex populations and detect rare microorganisms. Genome Res. 25, 534–543 (2015).

  15. 15.

    et al. Illumina TruSeq synthetic long-reads empower de novo assembly and resolve complex, highly-repetitive transposable elements. PLoS One 9, e106689 (2014).

  16. 16.

    et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 20, 265–272 (2010).

  17. 17.

    et al. A whole-genome assembly of Drosophila. Science 287, 2196–2204 (2000).

  18. 18.

    , , & Minimus: a fast, lightweight genome assembler. BMC Bioinformatics 8, 64 (2007).

  19. 19.

    , & Ab initio gene identification in metagenomic sequences. Nucleic Acids Res. 38, e132 (2010).

  20. 20.

    et al. GAGE-B: an evaluation of genome assemblers for bacterial organisms. Bioinformatics 29, 1718–1725 (2013).

  21. 21.

    , , , & DOOR: a database for prokaryotic operons. Nucleic Acids Res. 37, D459–D463 (2009).

  22. 22.

    , , & OGEE: an online gene essentiality database. Nucleic Acids Res. 40, D901–D906 (2012).

  23. 23.

    et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin) 6, 80–92 (2012).

  24. 24.

    Efficient algorithms for inferring evolutionary trees. Networks 21, 19–28 (1991).

  25. 25.

    , & Classifying short genomic fragments from novel lineages using composition and homology. BMC Bioinformatics 12, 328 (2011).

  26. 26.

    et al. Binning metagenomic contigs by coverage and composition. Nat. Methods 11, 1144–1146 (2014).

  27. 27.

    et al. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat. Biotechnol. 31, 1119–1125 (2013).

  28. 28.

    et al. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol. 33, 623–630 (2015).

  29. 29.

    et al. Parallel bacterial evolution within multiple patients identifies candidate pathogenicity genes. Nat. Genet. 43, 1275–1280 (2011).

  30. 30.

    et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One 9, e112963 (2014).

  31. 31.

    , , & Exploring variation-aware contig graphs for (comparative) metagenomics using MaryGold. Bioinformatics 29, 2826–2834 (2013).

  32. 32.

    , , & QUAST: quality assessment tool for genome assemblies. Bioinformatics 29, 1072–1075 (2013).

  33. 33.

    et al. MetAMOS: a modular and open source metagenomic assembly and analysis pipeline. Genome Biol. 14, R2 (2013).

  34. 34.

    et al. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl. Environ. Microbiol. 75, 7537–7541 (2009).

  35. 35.

    et al. Fosmid-based whole genome haplotyping of a HapMap trio child: evaluation of Single Individual Haplotyping techniques. Nucleic Acids Res. 40, 2041–2053 (2012).

  36. 36.

    , , & HapTree: a novel Bayesian framework for single individual polyplotyping using NGS data. PLoS Comput. Biol. 10, e1003502 (2014).

  37. 37.

    & Haplotype assembly in polyploid genomes and identical by descent shared tracts. Bioinformatics 29, i352–i360 (2013).

  38. 38.

    et al. cFinder: definition and quantification of multiple haplotypes in a mixed sample. BMC Res. Notes 8, 422 (2015).

  39. 39.

    et al. Frequency-based haplotype reconstruction from deep sequencing data of bacterial populations. Nucleic Acids Res. 43, e105 (2015).

  40. 40.

    Inference of haplotypes from samples of diploid populations: complexity and algorithms. J. Comput. Biol. 8, 305–323 (2001).

Download references


This work was supported by US National Institutes of Health/National Human Genome Research Institute (NIH/NHGRI) grant T32 HG000044. V.K. was supported by a Natural Sciences and Engineering Research Council of Canada (NSERC) post-graduate fellowship. We thank Illumina, Inc. for their assistance in sample preparation.

Author information

Author notes

    • Serafim Batzoglou
    •  & Michael Snyder

    These authors contributed equally to this work.


  1. Department of Computer Science, Stanford University, Stanford, California, USA.

    • Volodymyr Kuleshov
    •  & Serafim Batzoglou
  2. Department of Genetics, Stanford University School of Medicine, Stanford, California, USA.

    • Volodymyr Kuleshov
    • , Chao Jiang
    • , Wenyu Zhou
    • , Fereshteh Jahanbani
    •  & Michael Snyder


  1. Search for Volodymyr Kuleshov in:

  2. Search for Chao Jiang in:

  3. Search for Wenyu Zhou in:

  4. Search for Fereshteh Jahanbani in:

  5. Search for Serafim Batzoglou in:

  6. Search for Michael Snyder in:


S.B. and M.S. conceived the study. W.Z. and F.J. performed library preparation. V.K. developed the Nanoscope pipeline and the Lens algorithm. V.K. and C.J. performed computational analyses. V.K., C.J., S.B. and M.S. wrote the paper. S.B. and M.S. supervised the study.

Competing interests

V.K. serves as a consultant for Illumina Inc. S.B. is a co-founder of DNAnexus and a member of the scientific advisory boards of 23andMe and Eve Biomedical. M.S. is a co-founder of Personalis and a member of the scientific advisory boards of Personalis, AxioMx and Genapsys.

Corresponding authors

Correspondence to Volodymyr Kuleshov or Serafim Batzoglou or Michael Snyder.

Integrated supplementary information

Supplementary figures

  1. 1.

    Histogram of long read lengths for the mock metagenome

  2. 2.

    Histogram of long read lengths for the real metagenome

  3. 3.

    Fraction of genome covered with short and long reads, per organism, given an equal number of bases sequenced with each technology.

  4. 4.

    Estimated abundance using short and long reads.

  5. 5.

    Comparison of contig lengths obtained from short and long sequencing (real metagenome).

  6. 6.

    Recovery of operons from the assemblies obtained from short reads, long reads, and from the joint assembly (mock metagenome).

  7. 7.

    Recovery of genes from the assemblies obtained from short reads, long reads, and from the joint assembly (mock metagenome).

  8. 8.

    Fragment of 110 kbp genomic region in which there is variation between several bacterial subspecies.

  9. 9.

    Genomic region 50 kbp in length in which there is variation between several bacterial subspecies.

  10. 10.

    Percentage of genomic regions where all haplotypes are in perfect phylogeny, as a function of the percentage of positions that have to be corrected to ensure phylogeny.

  11. 11.

    Summary of the length and depth of genomic regions at which there is variation among bacteria.

  12. 12.

    Recovery of a 2.3 Mbp long contig from a species belonging to the genus Acinetobacter for which no finished genome was previously available.

  13. 13.

    Abundance estimates in the mock metagenome obtained from Nanoscope, compared to the abundances obtained from mapping short reads to the 20 known genome references.

  14. 14.

    Genomic variation statistics for 10 gut microbial species selected from our gut metagenome sample (at least 40% genomes were covered by reads).

Supplementary information

PDF files

  1. 1.

    Supplementary Text and Figures

    Supplementary Figures 1–14, Supplementary Tables 1–33 and Supplementary Methods

Tape archive files

  1. 1.

    Supplementary Code

About this article

Publication history





Further reading