While genome assembly projects have been successful in many haploid and inbred species, the assembly of noninbred or rearranged heterozygous genomes remains a major challenge. To address this challenge, we introduce the open-source FALCON and FALCON-Unzip algorithms (https://github.com/PacificBiosciences/FALCON/) to assemble long-read sequencing data into highly accurate, contiguous, and correctly phased diploid genomes. We generate new reference sequences for heterozygous samples including an F1 hybrid of Arabidopsis thaliana, the widely cultivated Vitis vinifera cv. Cabernet Sauvignon, and the coral fungus Clavicorona pyxidata, samples that have challenged short-read assembly approaches. The FALCON-based assemblies are substantially more contiguous and complete than alternate short- or long-read approaches. The phased diploid assembly enabled the study of haplotype structure and heterozygosities between homologous chromosomes, including the identification of widespread heterozygous structural variation within coding sequences.
This is a preview of subscription content
Subscribe to Journal
Get full journal access for 1 year
only $9.92 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Get time limited or full article access on ReadCube.
All prices are NET prices.
Goffeau, A. et al. Life with 6000 genes. Science 274, 546, 563–567 (1996).
Myers, E.W. et al. A whole-genome assembly of Drosophila. Science 287, 2196–2204 (2000).
Bonfield, J.K., Smith, Kf. & Staden, R. A new DNA sequence assembly program. Nucleic Acids Res. 23, 4992–4999 (1995).
Mouse ENCODE Consortium. et al. An encyclopedia of mouse DNA elements (mouse ENCODE). Genome Biol. 13, 418 (2012).
Celniker, S.E. et al. Unlocking the secrets of the genome. Nature 459, 927–930 (2009).
Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
Earl, D. et al. Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res. 21, 2224–2241 (2011).
Church, D.M. et al. Extending reference assembly models. Genome Biol. 16, 13 (2015).
Tewhey, R., Bansal, V., Torkamani, A., Topol, E.J. & Schork, N.J. The importance of phase information for human genomics. Nat. Rev. Genet. 12, 215–223 (2011).
Henson, J., Tischler, G. & Ning, Z. Next-generation sequencing and large genome assemblies. Pharmacogenomics 13, 901–915 (2012).
Alkan, C., Sajjadian, S. & Eichler, E.E. Limitations of next-generation genome sequence assembly. Nat. Methods 8, 61–65 (2011).
Vinson, J.P. et al. Assembly of polymorphic genomes: algorithms and application to Ciona savignyi. Genome Res. 15, 1127–1135 (2005).
Levy, S. et al. The diploid genome sequence of an individual human. PLoS Biol. 5, e254 (2007).
Iqbal, Z., Caccamo, M., Turner, I., Flicek, P. & McVean, G. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Genet. 44, 226–232 (2012).
Kajitani, R. et al. Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads. Genome Res. 24, 1384–1395 (2014).
Roach, J.C. et al. Chromosomal haplotypes by genetic phasing of human families. Am. J. Hum. Genet. 89, 382–397 (2011).
Kirkness, E.F. et al. Sequencing of isolated sperm cells for direct haplotyping of a human genome. Genome Res. 23, 826–832 (2013).
Kitzman, J.O. et al. Haplotype-resolved genome sequencing of a Gujarati Indian individual. Nat. Biotechnol. 29, 59–63 (2011).
McCoy, R.C. et al. Illumina TruSeq synthetic long-reads empower de novo assembly and resolve complex, highly-repetitive transposable elements. PloS One 9, e106689 (2014).
Mostovoy, Y. et al. A hybrid approach for de novo human genome sequence assembly and phasing. Nat. Methods 13, 587–590 (2016).
Berlin, K. et al. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol. 33, 623–630 (2015).
Gordon, D. et al. Long-read sequence assembly of the gorilla genome. Science 352, aae0344 (2016).
Chin, C.-S. et al. Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data. Nat. Methods 10, 563–569 (2013).
Fasulo, D., Halpern, A., Dew, I. & Mobarry, C. Efficiently detecting polymorphisms during the fragment assembly process. Bioinformatics 18, S294–S302 (2002).
The Arabidopsis Genome Initiative Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408, 796–815 (2000).
Gan, X. et al. Multiple reference genomes and transcriptomes for Arabidopsis thaliana. Nature 477, 419–423 (2011).
Simão, F.A., Waterhouse, R.M., Ioannidis, P., Kriventseva, E.V. & Zdobnov, E.M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–3212 (2015).
Koren, S., Walenz, B.P., Berlin, K., Miller, J.R. & Phillippy, A.M. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Preprint at bioRxiv http://dx.doi.org/10.1101/071282 (2016).
Luo, R. et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience 1, 18 (2012).
Stanke, M. & Waack, S. Gene prediction with a hidden Markov model and a new intron submodel. Bioinformatics 19, ii215–ii225 (2003).
Jaillon, O. et al. The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature 449, 463–467 (2007).
Patel, S., Swaminathan, P., Fennell, A. & Zeng, E. in Proceedings of the 2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) (eds. Huan, J. et al.) 1771–1773 (EEE, 2015).
Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. Preprint at arXiv:1207.3907v2 [q-bio.GN] (2012).
Bansal, V. & Bafna, V. HapCUT: an efficient and accurate algorithm for the haplotype assembly problem. Bioinformatics 24, i153–i159 (2008).
Degner, J.F. et al. Effect of read-mapping biases on detecting allele-specific expression from RNA-sequencing data. Bioinformatics 25, 3207–3212 (2009).
Liu, Y.-G. & Whittier, R.F. Rapid preparation of megabase plant DNA from nuclei in agarose plugs and microbeads. Nucleic Acids Res. 22, 2168–2169 (1994).
Hayward, G.S. Unique double-stranded fragments of bacteriophage T5 DNA resulting from preferential shear-induced breakage at nicks. Proc. Natl. Acad. Sci. USA 71, 2108–2112 (1974).
Myers, G. Algorithms in Bioinformatics (eds. Brown, D. & Morgenstern, B.) 52–67 (Springer, 2014).
Myers, E.W. The fragment assembly string graph. Bioinformatics 21, ii79–ii85 (2005).
Chaisson, M.J. & Tesler, G. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinformatics 13, 238 (2012).
The sequencing of the Cabernet Sauvignon genome was supported in part by a gift from the J. Lohr Vineyards and Wines to D.C. We would also like to thank F. Neto for providing an early-release BUSCO plant data set. Clavicorona pyxidata DNA was provided by L. Nagy (Institute of Biochemistry Biological Research Centre of the Hungarian Academy of Sciences). We thank J. Puglisi, F. Jupe, A. Copeland, and A. Wenger for reading and critiquing the manuscript. The project was supported in part by National Institutes of Health award (R01-HG006677 to M.C.S.) and by National Science Foundation awards (DBI-1350041 and IOS-1237880 to M.C.S.; MCB 0929402; and MCB 1122246 to J.R.E.). J.R.E. is an investigator at the Howard Hughes Medical Institute and Gordon and Betty Moore Foundation (GBMF 3034).
C.-S.C., P.P., G.T.C., C.D., and D.R. are employees and shareholders of Pacific Biosciences, a company commercializing DNA sequencing technology.
Integrated supplementary information
Supplementary Figure 1 Schematics of the software and data process modules and the FACLON-Unzip assembly graph process for resolving haplotypes.
(a) Data dependence flow and software modules inside FALCON and FALCON-Unzip
(b) Left: Initial assembly graph of a contig in the Arabidopsis F1 hybrid assembly. The different colors represent different haplotype blocks and phases. Right: The assembly graph after “unzipping”. Conceptually, the unzipping step identifies the heterozygous SNPs and uses them to remove overlaps between reads from different haplotypes. After removing such overlaps, nodes from the different haplotypes in the assembly graph will no longer have edges between them. This allows FALCON-Unzip to identify long haplotype specific paths and construct haplotigs of them. The dashed circle region indicates haplotype blocks that can be extended through a bubble region.
Supplementary Figure 2 Reverse accumulative read length distribution of the three diploid genome datasets
Supplementary Figure 3 SOAPdenovo assembly sizes and N50 and NG50 sizes of the 3 genomes using different values of k using the raw reads and corrected by Lighter.
Supplementary Figure 4 Assemblytic analysis comparison of the Arabidopsis F1 assemblies from FALCON-Unzip, Platanus, and SOAPdenovo.
(a) Cumulative sequence length of three Arabidopsis F1 assemblies created by FALCON-Unzip, Platanus, and SOAPdenovo compared to the TAIR10 reference. (b) Variants called using Assemblytics from three Arabidopsis F1 assemblies created by FALCON-Unzip,Platanus, and SOAPdenovo.
Supplementary Figure 5 Variation comparison between the inbred line assemblies and the F1-hybrid for all Arabidopsis chromosome along with TAIR10 references.
Supplementary Figure 7 Assembly comparison: FALCON-Unzip V. vinifera cv. Cabernet Sauvignon assembly versus V. vinifera reference genome
(a) MUMmerplot of FALCON-Unzip V. vinifera cv. Cabernet Sauvignon assembly versus V. vinifera reference genome. For clarity only alignments >= 10,000 bp long to the primary chromosomes are displayed. (b) The synteny between PN40024 Chr1 from 5’- telomere to centromere (green line) to the longest contig 000000F (black line) and its associated haplotigs (blue lines). The vertical green and blue lines indicated homologous coding sequences between the sequences. The cyan lines in the bottom indicate the synteny between the primary contig and other primary contigs. (c) Synteny alignment between two primary contigs 000334F vs. 000000F. (d) Synteny alignment between two primary contigs 000057F vs 000075F.
(a) The distribution of number of het-SNPs observed of the reads used for phasing of the longest contig of each genome in semi-log plot. (b) Fitting the distributions with a exponential function (density ~ c * exp(-a * het-SNP count)). We pick het-SNP count range of 10 to 200 for Arabidopsis, 50 to 200 for Vitis, and 10 to 100 for Clavicorona to catch the exponential decay part. The fitted parameter a = -0.0222, 0.0216, 0.0412 for Arabidopsis, Vitis and Clavicorona respectively. The fastest decay rate for Clavicorona indicates it has the least variation between the haplotypes among the three genomes. From this fitting, we expect to see about 45 (Arabidopsis), 46 (Vitis), and 24 (Clavicorona) per 10kb in the regions of interests.
The het-SNPs are called with FreeBayes on the alignments of the short read data to only the primary contigs. The contig 00003F has a low heterozygosity region from ~1.2Mb to ~2.7Mb.
Supplementary Figure 10 General schematic about how different levels of heterozygosity can affect the contig layout.
(a)(b)We mapped both genomic reads (middle panel) and cDNA reads (lower panel) to the primary contigs from our Clavicorona pyxidata assembly. We also shows curated CDS sequences mapped to the contig (top panel). The genomic reads shows both alleles mapped while we only observe on major allele in the transcript reads.
Supplementary Figure 13 (a) Summary of the graph reduction from sequence overlaps to contigs. (b) Example on constructing haplotigs in the Clavicorona pyxidata assembly.
(a) All pairs of het-SNPs that are covered by multiple reads are evaluation. A “coupling score” is calculation from the number reads that support current haplotype assignment of the het-SNPs. (b)(c) We linearly scan through the het-SNP positions. If the total score is improved by flipping the haplotype assigned at one location, then we flip the assignment. (d) An example showing the “coupling score” before the flipping process (un-phased het-SNPs assignment) and afterward (phased het-SNP assignment).
Supplementary Figures 1–15, Supplementary Tables 1–10 and Supplementary Note 1 (PDF 4833 kb)
SNP identified by nucmer between FALCON col-0 assembly and the TAIR10 reference (TXT 1920 kb)
List of syntenic regions identify between different primary contigs of the FALCON-Unzip Arabidopsis thaliana Col-0 x Cvi-1 assembly. (CSV 21668 kb)
List of syntenic regions identify between different primary contigs of the FALCON-Unzip Vitis vinifera assembly. (CSV 2183 kb)
Example of starting an AWS instance to run FALCON/FALCON-Unzip for Clavicorona pyxidata assembly (PDF 2523 kb)
About this article
Cite this article
Chin, CS., Peluso, P., Sedlazeck, F. et al. Phased diploid genome assembly with single-molecule real-time sequencing. Nat Methods 13, 1050–1054 (2016). https://doi.org/10.1038/nmeth.4035
Environmental Microbiome (2022)
Genome Biology (2022)
Complete genome sequence of the kiwifruit bacterial canker pathogen Pseudomonas savastanoi strain MHT1
BMC Microbiology (2022)
‘Nebbiolo’ genome assembly allows surveying the occurrence and functional implications of genomic structural variations in grapevines (Vitis vinifera L.)
BMC Genomics (2022)
BMC Bioinformatics (2022)