Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Phased diploid genome assembly with single-molecule real-time sequencing

Abstract

While genome assembly projects have been successful in many haploid and inbred species, the assembly of noninbred or rearranged heterozygous genomes remains a major challenge. To address this challenge, we introduce the open-source FALCON and FALCON-Unzip algorithms (https://github.com/PacificBiosciences/FALCON/) to assemble long-read sequencing data into highly accurate, contiguous, and correctly phased diploid genomes. We generate new reference sequences for heterozygous samples including an F1 hybrid of Arabidopsis thaliana, the widely cultivated Vitis vinifera cv. Cabernet Sauvignon, and the coral fungus Clavicorona pyxidata, samples that have challenged short-read assembly approaches. The FALCON-based assemblies are substantially more contiguous and complete than alternate short- or long-read approaches. The phased diploid assembly enabled the study of haplotype structure and heterozygosities between homologous chromosomes, including the identification of widespread heterozygous structural variation within coding sequences.

This is a preview of subscription content

Access options

Buy article

Get time limited or full article access on ReadCube.

$32.00

All prices are NET prices.

Figure 1: Overview of FALCON and FALCON-Unzip.
Figure 2: SNP density and structural variation in the FALCON-Unzip F1 Arabidopsis assembly.

Accession codes

Accessions

Sequence Read Archive

References

  1. Goffeau, A. et al. Life with 6000 genes. Science 274, 546, 563–567 (1996).

    CAS  Article  PubMed  Google Scholar 

  2. Myers, E.W. et al. A whole-genome assembly of Drosophila. Science 287, 2196–2204 (2000).

    CAS  Article  PubMed  Google Scholar 

  3. Bonfield, J.K., Smith, Kf. & Staden, R. A new DNA sequence assembly program. Nucleic Acids Res. 23, 4992–4999 (1995).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  4. Mouse ENCODE Consortium. et al. An encyclopedia of mouse DNA elements (mouse ENCODE). Genome Biol. 13, 418 (2012).

  5. Celniker, S.E. et al. Unlocking the secrets of the genome. Nature 459, 927–930 (2009).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  6. Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).

    Article  PubMed  Google Scholar 

  7. Earl, D. et al. Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res. 21, 2224–2241 (2011).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  8. Church, D.M. et al. Extending reference assembly models. Genome Biol. 16, 13 (2015).

    Article  PubMed  PubMed Central  Google Scholar 

  9. Tewhey, R., Bansal, V., Torkamani, A., Topol, E.J. & Schork, N.J. The importance of phase information for human genomics. Nat. Rev. Genet. 12, 215–223 (2011).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  10. Henson, J., Tischler, G. & Ning, Z. Next-generation sequencing and large genome assemblies. Pharmacogenomics 13, 901–915 (2012).

    CAS  Article  PubMed  Google Scholar 

  11. Alkan, C., Sajjadian, S. & Eichler, E.E. Limitations of next-generation genome sequence assembly. Nat. Methods 8, 61–65 (2011).

    CAS  Article  PubMed  Google Scholar 

  12. Vinson, J.P. et al. Assembly of polymorphic genomes: algorithms and application to Ciona savignyi. Genome Res. 15, 1127–1135 (2005).

    Article  PubMed  PubMed Central  Google Scholar 

  13. Levy, S. et al. The diploid genome sequence of an individual human. PLoS Biol. 5, e254 (2007).

    Article  PubMed  PubMed Central  Google Scholar 

  14. Iqbal, Z., Caccamo, M., Turner, I., Flicek, P. & McVean, G. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genet. 44, 226–232 (2012).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  15. Kajitani, R. et al. Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads. Genome Res. 24, 1384–1395 (2014).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  16. Roach, J.C. et al. Chromosomal haplotypes by genetic phasing of human families. Am. J. Hum. Genet. 89, 382–397 (2011).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  17. Kirkness, E.F. et al. Sequencing of isolated sperm cells for direct haplotyping of a human genome. Genome Res. 23, 826–832 (2013).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  18. Kitzman, J.O. et al. Haplotype-resolved genome sequencing of a Gujarati Indian individual. Nat. Biotechnol. 29, 59–63 (2011).

    CAS  Article  PubMed  Google Scholar 

  19. McCoy, R.C. et al. Illumina TruSeq synthetic long-reads empower de novo assembly and resolve complex, highly-repetitive transposable elements. PloS One 9, e106689 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  20. Mostovoy, Y. et al. A hybrid approach for de novo human genome sequence assembly and phasing. Nat. Methods 13, 587–590 (2016).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  21. Berlin, K. et al. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol. 33, 623–630 (2015).

    CAS  Article  PubMed  Google Scholar 

  22. Gordon, D. et al. Long-read sequence assembly of the gorilla genome. Science 352, aae0344 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  23. Chin, C.-S. et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods 10, 563–569 (2013).

    CAS  Article  PubMed  Google Scholar 

  24. Fasulo, D., Halpern, A., Dew, I. & Mobarry, C. Efficiently detecting polymorphisms during the fragment assembly process. Bioinformatics 18, S294–S302 (2002).

    Article  PubMed  Google Scholar 

  25. The Arabidopsis Genome Initiative Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408, 796–815 (2000).

  26. Gan, X. et al. Multiple reference genomes and transcriptomes for Arabidopsis thaliana. Nature 477, 419–423 (2011).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  27. Simão, F.A., Waterhouse, R.M., Ioannidis, P., Kriventseva, E.V. & Zdobnov, E.M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–3212 (2015).

    Article  PubMed  Google Scholar 

  28. Koren, S., Walenz, B.P., Berlin, K., Miller, J.R. & Phillippy, A.M. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Preprint at bioRxiv http://dx.doi.org/10.1101/071282 (2016).

  29. Luo, R. et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience 1, 18 (2012).

    Article  PubMed  PubMed Central  Google Scholar 

  30. Stanke, M. & Waack, S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics 19, ii215–ii225 (2003).

    Article  PubMed  Google Scholar 

  31. Jaillon, O. et al. The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature 449, 463–467 (2007).

    CAS  Article  PubMed  Google Scholar 

  32. Patel, S., Swaminathan, P., Fennell, A. & Zeng, E. in Proceedings of the 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) (eds. Huan, J. et al.) 1771–1773 (EEE, 2015).

  33. Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. Preprint at arXiv:1207.3907v2 [q-bio.GN] (2012).

  34. Bansal, V. & Bafna, V. HapCUT: an efficient and accurate algorithm for the haplotype assembly problem. Bioinformatics 24, i153–i159 (2008).

    Article  PubMed  Google Scholar 

  35. Degner, J.F. et al. Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics 25, 3207–3212 (2009).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  36. Liu, Y.-G. & Whittier, R.F. Rapid preparation of megabase plant DNA from nuclei in agarose plugs and microbeads. Nucleic Acids Res. 22, 2168–2169 (1994).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  37. Hayward, G.S. Unique double-stranded fragments of bacteriophage T5 DNA resulting from preferential shear-induced breakage at nicks. Proc. Natl. Acad. Sci. USA 71, 2108–2112 (1974).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  38. Myers, G. Algorithms in Bioinformatics (eds. Brown, D. & Morgenstern, B.) 52–67 (Springer, 2014).

  39. Myers, E.W. The fragment assembly string graph. Bioinformatics 21, ii79–ii85 (2005).

    CAS  PubMed  Google Scholar 

  40. Chaisson, M.J. & Tesler, G. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics 13, 238 (2012).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

The sequencing of the Cabernet Sauvignon genome was supported in part by a gift from the J. Lohr Vineyards and Wines to D.C. We would also like to thank F. Neto for providing an early-release BUSCO plant data set. Clavicorona pyxidata DNA was provided by L. Nagy (Institute of Biochemistry Biological Research Centre of the Hungarian Academy of Sciences). We thank J. Puglisi, F. Jupe, A. Copeland, and A. Wenger for reading and critiquing the manuscript. The project was supported in part by National Institutes of Health award (R01-HG006677 to M.C.S.) and by National Science Foundation awards (DBI-1350041 and IOS-1237880 to M.C.S.; MCB 0929402; and MCB 1122246 to J.R.E.). J.R.E. is an investigator at the Howard Hughes Medical Institute and Gordon and Betty Moore Foundation (GBMF 3034).

Author information

Authors and Affiliations

Authors

Contributions

C-S.C., P.P., A.C., D.R.R., and M.C.S. conceived the idea of the FALCON–FALCON-Unzip assembler. C.-S.C, P.P., F.J.S., M.N., G.T.C., D.R.R., D.C., and M.C.S. designed the experiments and performed the analysis. P.P., D.C., D.R.R., and M.C.S. collected the sequencing data. R.O'M. C.L., and J.R.E. constructed the Col-0-Cvi-1. A.C., R.O'M. R.F.-B., A.M.-C., G.R.C., M.D., C.L., J.R.E., and D.C. collected the samples and prepared DNA for sequencing. C.-S.C., P.P., F.J.S., M.N., G.T.C., D.C., D.R.R., and M.C.S. wrote the manuscript. C.-S.C. and C.D. implemented the computer code.

Corresponding authors

Correspondence to Chen-Shan Chin or Michael C Schatz.

Ethics declarations

Competing interests

C.-S.C., P.P., G.T.C., C.D., and D.R. are employees and shareholders of Pacific Biosciences, a company commercializing DNA sequencing technology.

Integrated supplementary information

Supplementary Figure 1 Schematics of the software and data process modules and the FACLON-Unzip assembly graph process for resolving haplotypes.

(a) Data dependence flow and software modules inside FALCON and FALCON-Unzip

(b) Left: Initial assembly graph of a contig in the Arabidopsis F1 hybrid assembly. The different colors represent different haplotype blocks and phases. Right: The assembly graph after “unzipping”. Conceptually, the unzipping step identifies the heterozygous SNPs and uses them to remove overlaps between reads from different haplotypes. After removing such overlaps, nodes from the different haplotypes in the assembly graph will no longer have edges between them. This allows FALCON-Unzip to identify long haplotype specific paths and construct haplotigs of them. The dashed circle region indicates haplotype blocks that can be extended through a bubble region.

Supplementary Figure 2 Reverse accumulative read length distribution of the three diploid genome datasets

Supplementary Figure 3 SOAPdenovo assembly sizes and N50 and NG50 sizes of the 3 genomes using different values of k using the raw reads and corrected by Lighter.

Supplementary Figure 4 Assemblytic analysis comparison of the Arabidopsis F1 assemblies from FALCON-Unzip, Platanus, and SOAPdenovo.

(a) Cumulative sequence length of three Arabidopsis F1 assemblies created by FALCON-Unzip, Platanus, and SOAPdenovo compared to the TAIR10 reference. (b) Variants called using Assemblytics from three Arabidopsis F1 assemblies created by FALCON-Unzip,Platanus, and SOAPdenovo.

Supplementary Figure 5 Variation comparison between the inbred line assemblies and the F1-hybrid for all Arabidopsis chromosome along with TAIR10 references.

Supplementary Figure 6 Homopolymer length and frequency in the TAIR10 Assembly.

Supplementary Figure 7 Assembly comparison: FALCON-Unzip V. vinifera cv. Cabernet Sauvignon assembly versus V. vinifera reference genome

(a) MUMmerplot of FALCON-Unzip V. vinifera cv. Cabernet Sauvignon assembly versus V. vinifera reference genome. For clarity only alignments >= 10,000 bp long to the primary chromosomes are displayed. (b) The synteny between PN40024 Chr1 from 5’- telomere to centromere (green line) to the longest contig 000000F (black line) and its associated haplotigs (blue lines). The vertical green and blue lines indicated homologous coding sequences between the sequences. The cyan lines in the bottom indicate the synteny between the primary contig and other primary contigs. (c) Synteny alignment between two primary contigs 000334F vs. 000000F. (d) Synteny alignment between two primary contigs 000057F vs 000075F.

Supplementary Figure 8 Comparison of the distribution the het-SNP site density of the three genomes

(a) The distribution of number of het-SNPs observed of the reads used for phasing of the longest contig of each genome in semi-log plot. (b) Fitting the distributions with a exponential function (density ~ c * exp(-a * het-SNP count)). We pick het-SNP count range of 10 to 200 for Arabidopsis, 50 to 200 for Vitis, and 10 to 100 for Clavicorona to catch the exponential decay part. The fitted parameter a = -0.0222, 0.0216, 0.0412 for Arabidopsis, Vitis and Clavicorona respectively. The fastest decay rate for Clavicorona indicates it has the least variation between the haplotypes among the three genomes. From this fitting, we expect to see about 45 (Arabidopsis), 46 (Vitis), and 24 (Clavicorona) per 10kb in the regions of interests.

Supplementary Figure 9 Example of a low heterozygosity region observed in Clavicorona genome.

The het-SNPs are called with FreeBayes on the alignments of the short read data to only the primary contigs. The contig 00003F has a low heterozygosity region from ~1.2Mb to ~2.7Mb.

Supplementary Figure 10 General schematic about how different levels of heterozygosity can affect the contig layout.

Supplementary Figure 11 Candidates for differentially expressed alleles from RNA-seq data.

(a)(b)We mapped both genomic reads (middle panel) and cDNA reads (lower panel) to the primary contigs from our Clavicorona pyxidata assembly. We also shows curated CDS sequences mapped to the contig (top panel). The genomic reads shows both alleles mapped while we only observe on major allele in the transcript reads.

Supplementary Figure 12 An Example of how the FALCON-sense algorithm generates consensus sequence.

Supplementary Figure 13 (a) Summary of the graph reduction from sequence overlaps to contigs. (b) Example on constructing haplotigs in the Clavicorona pyxidata assembly.

Supplementary Figure 14 Summary of the graph reduction from sequence overlaps to contigs.

Supplementary Figure 15 Summary of the greedy SNP phasing algorithm.

(a) All pairs of het-SNPs that are covered by multiple reads are evaluation. A “coupling score” is calculation from the number reads that support current haplotype assignment of the het-SNPs. (b)(c) We linearly scan through the het-SNP positions. If the total score is improved by flipping the haplotype assigned at one location, then we flip the assignment. (d) An example showing the “coupling score” before the flipping process (un-phased het-SNPs assignment) and afterward (phased het-SNP assignment).

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–15, Supplementary Tables 1–10 and Supplementary Note 1 (PDF 4833 kb)

Supplementary Data 1

SNP identified by nucmer between FALCON col-0 assembly and the TAIR10 reference (TXT 1920 kb)

Supplementary Data 2

List of syntenic regions identify between different primary contigs of the FALCON-Unzip Arabidopsis thaliana Col-0 x Cvi-1 assembly. (CSV 21668 kb)

Supplementary Data 3

List of syntenic regions identify between different primary contigs of the FALCON-Unzip Vitis vinifera assembly. (CSV 2183 kb)

Supplementary Data 4

Example of starting an AWS instance to run FALCON/FALCON-Unzip for Clavicorona pyxidata assembly (PDF 2523 kb)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Chin, CS., Peluso, P., Sedlazeck, F. et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat Methods 13, 1050–1054 (2016). https://doi.org/10.1038/nmeth.4035

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nmeth.4035

Further reading

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing