Synthetic long-read sequencing reveals intraspecies diversity in the human microbiome



Identifying bacterial strains in metagenome and microbiome samples using computational analyses of short-read sequences remains a difficult problem. Here, we present an analysis of a human gut microbiome using TruSeq synthetic long reads combined with computational tools for metagenomic long-read assembly, variant calling and haplotyping (Nanoscope and Lens). Our analysis identifies 178 bacterial species, of which 51 were not found using shotgun reads alone. We recover bacterial contigs that comprise multiple operons, including 22 contigs of >1 Mbp. Furthermore, we observe extensive intraspecies variation within microbial strains in the form of haplotypes that span up to hundreds of Kbp. Incorporation of synthetic long-read sequencing technology with standard short-read approaches enables more precise and comprehensive analyses of metagenomic samples.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Figure 1: The Nanoscope pipeline and the Lens algorithm.
Figure 2: Long reads aligned to assembled metagenomic contigs reveal extensive variation within bacterial strains.
Figure 3: Bacterial species identified only by long reads (blue), only by short reads (magenta), ordered by abundance.

Accession codes

Primary accessions

Sequence Read Archive


  1. 1

    Rinke, C. et al. Insights into the phylogeny and coding potential of microbial dark matter. Nature 499, 431–437 (2013).

    CAS  Article  Google Scholar 

  2. 2

    Thomas, T., Gilbert, J. & Meyer, F. Metagenomics - a guide from sampling to data analysis. Microb. Inform. Exp. 2, 3 (2012).

    Article  Google Scholar 

  3. 3

    Daniel, R. The metagenomics of soil. Nat. Rev. Microbiol. 3, 470–478 (2005).

    CAS  Article  Google Scholar 

  4. 4

    Venter, J.C. et al. Environmental genome shotgun sequencing of the Sargasso Sea. Science 304, 66–74 (2004).

    CAS  Article  Google Scholar 

  5. 5

    Human Microbiome Project Consortium. Structure, function and diversity of the healthy human microbiome. Nature 486, 207–214 (2012).

  6. 6

    Iverson, V. et al. Untangling genomes from metagenomes: revealing an uncultured class of marine Euryarchaeota. Science 335, 587–590 (2012).

    CAS  Article  Google Scholar 

  7. 7

    Albertsen, M. et al. Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes. Nat. Biotechnol. 31, 533–538 (2013).

    CAS  Article  Google Scholar 

  8. 8

    Nielsen, H.B. et al. MetaHIT Consortium; MetaHIT Consortium. Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes. Nat. Biotechnol. 32, 822–828 (2014).

    CAS  Article  Google Scholar 

  9. 9

    Burton, J.N., Liachko, I., Dunham, M.J. & Shendure, J. Species-level deconvolution of metagenome assemblies with Hi-C-based contact probability maps. G3 (Bethesda) 4, 1339–1346 (2014).

    CAS  Article  Google Scholar 

  10. 10

    Eid, J. et al. Real-time DNA sequencing from single polymerase molecules. Science 323, 133–138 (2009).

    CAS  Article  Google Scholar 

  11. 11

    Fichot, E.B. & Norman, R.S. Microbial phylogenetic profiling with the Pacific Biosciences sequencing platform. Microbiome 1, 10 (2013).

    Article  Google Scholar 

  12. 12

    Kuleshov, V. et al. Whole-genome haplotyping using long reads and statistical methods. Nat. Biotechnol. 32, 261–266 (2014).

    CAS  Article  Google Scholar 

  13. 13

    Di Rienzi, S.C. et al. The human gut and groundwater harbor non-photosynthetic bacteria belonging to a new candidate phylum sibling to Cyanobacteria. eLife 2, e01102 (2013).

    Article  Google Scholar 

  14. 14

    Sharon, I. et al. Accurate, multi-kb reads resolve complex populations and detect rare microorganisms. Genome Res. 25, 534–543 (2015).

    CAS  Article  Google Scholar 

  15. 15

    McCoy, R.C. et al. Illumina TruSeq synthetic long-reads empower de novo assembly and resolve complex, highly-repetitive transposable elements. PLoS One 9, e106689 (2014).

    Article  Google Scholar 

  16. 16

    Li, R. et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 20, 265–272 (2010).

    CAS  Article  Google Scholar 

  17. 17

    Myers, E.W. et al. A whole-genome assembly of Drosophila. Science 287, 2196–2204 (2000).

    CAS  Article  Google Scholar 

  18. 18

    Sommer, D.D., Delcher, A.L., Salzberg, S.L. & Pop, M. Minimus: a fast, lightweight genome assembler. BMC Bioinformatics 8, 64 (2007).

    Article  Google Scholar 

  19. 19

    Zhu, W., Lomsadze, A. & Borodovsky, M. Ab initio gene identification in metagenomic sequences. Nucleic Acids Res. 38, e132 (2010).

    Article  Google Scholar 

  20. 20

    Magoc, T. et al. GAGE-B: an evaluation of genome assemblers for bacterial organisms. Bioinformatics 29, 1718–1725 (2013).

    CAS  Article  Google Scholar 

  21. 21

    Mao, F., Dam, P., Chou, J., Olman, V. & Xu, Y. DOOR: a database for prokaryotic operons. Nucleic Acids Res. 37, D459–D463 (2009).

    CAS  Article  Google Scholar 

  22. 22

    Chen, W.H., Minguez, P., Lercher, M.J. & Bork, P. OGEE: an online gene essentiality database. Nucleic Acids Res. 40, D901–D906 (2012).

    CAS  Article  Google Scholar 

  23. 23

    Cingolani, P. et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin) 6, 80–92 (2012).

    CAS  Article  Google Scholar 

  24. 24

    Gusfield, D. Efficient algorithms for inferring evolutionary trees. Networks 21, 19–28 (1991).

    Article  Google Scholar 

  25. 25

    Parks, D.H., MacDonald, N.J. & Beiko, R.G. Classifying short genomic fragments from novel lineages using composition and homology. BMC Bioinformatics 12, 328 (2011).

    CAS  Article  Google Scholar 

  26. 26

    Alneberg, J. et al. Binning metagenomic contigs by coverage and composition. Nat. Methods 11, 1144–1146 (2014).

    CAS  Article  Google Scholar 

  27. 27

    Burton, J.N. et al. Chromosome-scale scaffolding of de novo genome assemblies based on chromatin interactions. Nat. Biotechnol. 31, 1119–1125 (2013).

    CAS  Article  Google Scholar 

  28. 28

    Berlin, K. et al. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol. 33, 623–630 (2015).

    CAS  Article  Google Scholar 

  29. 29

    Lieberman, T.D. et al. Parallel bacterial evolution within multiple patients identifies candidate pathogenicity genes. Nat. Genet. 43, 1275–1280 (2011).

    CAS  Article  Google Scholar 

  30. 30

    Walker, B.J. et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One 9, e112963 (2014).

    Article  Google Scholar 

  31. 31

    Nijkamp, J.F., Pop, M., Reinders, M.J.T. & de Ridder, D. Exploring variation-aware contig graphs for (comparative) metagenomics using MaryGold. Bioinformatics 29, 2826–2834 (2013).

    CAS  Article  Google Scholar 

  32. 32

    Gurevich, A., Saveliev, V., Vyahhi, N. & Tesler, G. QUAST: quality assessment tool for genome assemblies. Bioinformatics 29, 1072–1075 (2013).

    CAS  Article  Google Scholar 

  33. 33

    Treangen, T.J. et al. MetAMOS: a modular and open source metagenomic assembly and analysis pipeline. Genome Biol. 14, R2 (2013).

    Article  Google Scholar 

  34. 34

    Schloss, P.D. et al. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl. Environ. Microbiol. 75, 7537–7541 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  35. 35

    Duitama, J. et al. Fosmid-based whole genome haplotyping of a HapMap trio child: evaluation of Single Individual Haplotyping techniques. Nucleic Acids Res. 40, 2041–2053 (2012).

    CAS  Article  Google Scholar 

  36. 36

    Berger, E., Yorukoglu, D., Peng, J. & Berger, B. HapTree: a novel Bayesian framework for single individual polyplotyping using NGS data. PLoS Comput. Biol. 10, e1003502 (2014).

    Article  Google Scholar 

  37. 37

    Aguiar, D. & Istrail, S. Haplotype assembly in polyploid genomes and identical by descent shared tracts. Bioinformatics 29, i352–i360 (2013).

    CAS  Article  Google Scholar 

  38. 38

    Niklas, N. et al. cFinder: definition and quantification of multiple haplotypes in a mixed sample. BMC Res. Notes 8, 422 (2015).

    Article  Google Scholar 

  39. 39

    Pulido-Tamayo, S. et al. Frequency-based haplotype reconstruction from deep sequencing data of bacterial populations. Nucleic Acids Res. 43, e105 (2015).

    Article  Google Scholar 

  40. 40

    Gusfield, D. Inference of haplotypes from samples of diploid populations: complexity and algorithms. J. Comput. Biol. 8, 305–323 (2001).

    CAS  Article  Google Scholar 

Download references


This work was supported by US National Institutes of Health/National Human Genome Research Institute (NIH/NHGRI) grant T32 HG000044. V.K. was supported by a Natural Sciences and Engineering Research Council of Canada (NSERC) post-graduate fellowship. We thank Illumina, Inc. for their assistance in sample preparation.

Author information




S.B. and M.S. conceived the study. W.Z. and F.J. performed library preparation. V.K. developed the Nanoscope pipeline and the Lens algorithm. V.K. and C.J. performed computational analyses. V.K., C.J., S.B. and M.S. wrote the paper. S.B. and M.S. supervised the study.

Corresponding authors

Correspondence to Volodymyr Kuleshov or Serafim Batzoglou or Michael Snyder.

Ethics declarations

Competing interests

V.K. serves as a consultant for Illumina Inc. S.B. is a co-founder of DNAnexus and a member of the scientific advisory boards of 23andMe and Eve Biomedical. M.S. is a co-founder of Personalis and a member of the scientific advisory boards of Personalis, AxioMx and Genapsys.

Integrated supplementary information

Supplementary Figure 1 Histogram of long read lengths for the mock metagenome

Supplementary Figure 2 Histogram of long read lengths for the real metagenome

Supplementary Figure 3 Fraction of genome covered with short and long reads, per organism, given an equal number of bases sequenced with each technology.

For several organisms, the % coverage greatly varies between the two technologies, indicating different types of bias.

Supplementary Figure 4 Estimated abundance using short and long reads.

For several organisms, the estimated abundances vary significantly.

Supplementary Figure 5 Comparison of contig lengths obtained from short and long sequencing (real metagenome).

About twenty contigs obtained from long read sequencing are longer than 1 Mbp.

Supplementary Figure 6 Recovery of operons from the assemblies obtained from short reads, long reads, and from the joint assembly (mock metagenome).

Short reads were assembled using Soapdenovo2, long reads were assembled with Celera; the two were merged with Minimus2. The joint assembly recovers more than half of all operons, and twice more than only short reads. Interestingly, long and short reads seem to recover different types of operons.

Supplementary Figure 7 Recovery of genes from the assemblies obtained from short reads, long reads, and from the joint assembly (mock metagenome).

Short reads were assembled using Soapdenovo2, long reads were assembled with Celera; the two were merged with Minimus2. The joint assembly recovers more than half of all genes, and twice more than only short reads. Interestingly, long and short reads seem to recover different types of genes.

Supplementary Figure 8 Fragment of 110 kbp genomic region in which there is variation between several bacterial subspecies.

The contig belongs to the bacterium Parabacteroides distasonis.

Supplementary Figure 9 Genomic region 50 kbp in length in which there is variation between several bacterial subspecies.

The contig belongs to the bacterium Odoribacter splanchnicus.

Supplementary Figure 10 Percentage of genomic regions where all haplotypes are in perfect phylogeny, as a function of the percentage of positions that have to be corrected to ensure phylogeny.

More than 85% of positions are in perfect phylogeny, and by correcting less than 5% of positions, we can increase this number to more than 92%.

Supplementary Figure 11 Summary of the length and depth of genomic regions at which there is variation among bacteria.

Blue regions are in perfect phylogeny, and red regions are not. Blue regions are in perfect phylogeny, and red regions are not.

Supplementary Figure 12 Recovery of a 2.3 Mbp long contig from a species belonging to the genus Acinetobacter for which no finished genome was previously available.

We mapped contigs from an earlier fragmented assembly (bottom) to a 2.3 Mbp contig that we assembled (top). Most of the long contig appears to be covered by shorter contigs from the fragmented assembly.

Supplementary Figure 13 Abundance estimates in the mock metagenome obtained from Nanoscope, compared to the abundances obtained from mapping short reads to the 20 known genome references.

Supplementary Figure 14 Genomic variation statistics for 10 gut microbial species selected from our gut metagenome sample (at least 40% genomes were covered by reads).

There is no obvious correlation between genome size/coverage and SNP density and π, which may be due to limited number of genomes analyzed.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–14, Supplementary Tables 1–33 and Supplementary Methods (PDF 3325 kb)

Supplementary Code (TAR 96160 kb)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Kuleshov, V., Jiang, C., Zhou, W. et al. Synthetic long-read sequencing reveals intraspecies diversity in the human microbiome. Nat Biotechnol 34, 64–69 (2016).

Download citation

Further reading


Sign up for the Nature Briefing newsletter for a daily update on COVID-19 science.
Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing