Despite advances in sequencing technologies, assembly of complex plant genomes remains elusive due to polyploidy and high repeat content. Here we report PolyGembler for grouping and ordering contigs into pseudomolecules by genetic linkage analysis. Our approach also provides an accurate method with which to detect and fix assembly errors. Using simulated data, we demonstrate that our approach is of high accuracy and outperforms three existing state-of-the-art genetic mapping tools. Particularly, our approach is more robust to the presence of missing genotype data and genotyping errors. We used our method to construct pseudomolecules for allotetraploid lawn grass utilizing PacBio long reads in combination with restriction site-associated DNA sequencing, and for diploid Ipomoea trifida and autotetraploid potato utilizing contigs assembled from Illumina reads in combination with genotype data generated by single-nucleotide polymorphism arrays and genotyping by sequencing, respectively. We resolved 13 assembly errors for a published I. trifida genome assembly and anchored eight unplaced scaffolds in the published potato genome.
This is a preview of subscription content, access via your institution
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Rent or buy this article
Prices vary by article type
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data for the simulation studies, including comparisons with other methods and studies of M9 × M19 I. trifida and the B2721 potato, are available from http://data.genomicsresearch.org/Projects/polyGembler. Data for the 12601ab1 × Stirling potato mapping population were provided by C. Hackett. Data for the Z. japonica mapping population Carrizo × El Toro are available from the NCBI repository under the accession number SRP055007. The whole-genome PacBio sequence data for the Z. japonica cultivar Yaji are available from the NCBI repository under the accession number SRP110561. Data related to the PGSC version 4.03 pseudomolecules are available from http://solanaceae.plantbiology.msu.edu. The I. trifida de novo genome assembly ITR_r1.0 is available from http://sweetpotato-garden.kazusa.or.jp. The I. trifida de novo genome assembly NCNSP0306 is available from http://sweetpotato.plantbiology.msu.edu. Release 7 of the O. sativa reference genome is available from http://phytozome.jgi.doe.gov. The genome assembly of the Z. japonica accession Nagirizaki is available from http://zoysia.kazusa.or.jp. Source data are provided with this paper.
The software PolyGembler, presented in this article, and its documentation are publicly available at GitHub (https://github.com/c-zhou/polyGembler).
Kyriakidou, M., Tai, H. H., Anglin, N. L., Ellis, D. & Strömvik, M. V. Current strategies of polyploid plant genome sequence assembly. Front. Plant Sci. 9, 1660 (2018).
Bancroft, I. et al. Dissecting the genome of the polyploid crop oilseed rape by transcriptome sequencing. Nat. Biotechnol. 29, 762–766 (2011).
Wu, S. et al. Genome sequences of two diploid wild relatives of cultivated sweetpotato reveal targets for genetic improvement. Nat. Commun. 9, 4580 (2018).
Fierst, J. L. Using linkage maps to correct and scaffold de novo genome assemblies: methods, challenges, and computational tools. Front. Genet. 6, 220 (2015).
Altshuler, D. et al. An SNP map of the human genome generated by reduced representation shotgun sequencing. Nature 407, 513–516 (2000).
Baird, N. A. et al. Rapid SNP discovery and genetic mapping using sequenced rad markers. PLoS ONE 3, e3376 (2008).
Elshire, R. J. et al. A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species. PLoS ONE 6, e19379 (2011).
Lander, E. S. & Green, P. Construction of multilocus genetic linkage maps in humans. Proc. Natl Acad. Sci. USA 84, 2363–2367 (1987).
Broman, K. W., Wu, H., Sen, S. & Churchill, G. A. R/qtl: QTL mapping in experimental crosses. Bioinformatics 19, 889–890 (2003).
Margarido, G., Souza, A. & Garcia, A. OneMap: software for genetic mapping in outcrossing species. Hereditas 144, 78–79 (2007).
Van Ooijen, J. Multipoint maximum likelihood mapping in a full-sib family of an outbreeding species. Genet. Res. 93, 343–349 (2011).
Rastas, P., Calboli, F. C., Guo, B., Shikano, T. & Merila¨, J. Construction of ultradense linkage maps with Lep-MAP2: stickleback F2 recombinant crosses as an example. Genome Biol. Evol. 8, 78–93 (2016).
Hackett, C. & Luo, Z. TetraploidMap: construction of a linkage map in autotetraploid species. J. Hered. 94, 358–359 (2003).
Hackett, C. A., Boskamp, B., Vogogias, A., Preedy, K. F. & Milne, I. TetraploidSNPMap: software for linkage analysis and QTL mapping in autotetraploid populations using SNP dosage data. J. Hered. 108, 438–442 (2017).
Bourke, P. M. et al. polymapR—linkage analysis and genetic map construction from F1 populations of outcrossing polyploids. Bioinformatics 34, 3496–3502 (2018).
Hirakawa, H. et al. Survey of genome sequences in a wild sweet potato, Ipomoea trifida (H. B. K.) G. Don. DNA Res. 22, 171–179 (2015).
Consortium, P. G. S. et al. Genome sequence and analysis of the tuber crop potato. Nature 475, 189–195 (2011).
Hoshino, A. et al. Genome sequence and analysis of the Japanese morning glory Ipomoea nil. Nat. Commun. 7, 13295 (2016).
Wang, F. et al. Sequence-tagged high-density genetic maps of Zoysia japonica provide insights into genome evolution in Chloridoideae. Plant J. 82, 744–757 (2015).
Ouyang, S. et al. The TIGR Rice Genome Annotation Resource: improvements and new features. Nucleic Acids Res. 35, D883–D887 (2006).
Tanaka, H. et al. Sequencing and comparative analyses of the genomes of zoysiagrasses. DNA Res. 23, 171–180 (2016).
Strehl, A. & Ghosh, J. Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 3, 583–617 (2003).
Wu, Y., Bhat, P. R., Close, T. J. & Lonardi, S. Efficient and accurate construction of genetic linkage maps from the minimum spanning tree of a graph. PLoS Genet. 4, e1000212 (2008).
Mascher, M. et al. Anchoring and ordering NGS contig assemblies by population sequencing (POPSEQ). Plant J. 76, 718–727 (2013).
Hahn, M. W., Zhang, S. V. & Moyle, L. C. Sequencing, assembling, and correcting draft genomes using recombinant populations. G3 (Bethesda) 4, 669–679 (2014).
Su, S.-Y., White, J., Balding, D. J. & Coin, L. J. Inference of haplotypic phase and missing genotypes in polyploid organisms and variable copy number genomic regions. BMC Bioinformatics 9, 513 (2008).
Zheng, C. et al. Probabilistic multilocus haplotype reconstruction in outcrossing tetraploids. Genetics 203, 119–131 (2016).
Jiao, W.-B. & Schneeberger, K. The impact of third generation genomic technologies on plant genome assembly. Curr. Opin. Plant Biol. 36, 64–70 (2017).
Zhang, X., Zhang, S., Zhao, Q., Ming, R. & Tang, H. Assembly of allele-aware, chromosomal-scale autopolyploid genomes based on Hi-C data. Nat. Plants 5, 833–845 (2019).
Kyriakidou, M., Anglin, N. L., Ellis, D., Tai, H. H. & Strömvik, M. V. Genome assembly of six polyploid potato genomes. Sci. Data 7, 88 (2020).
Voorrips, R. E. & Maliepaard, C. A. The simulation of meiosis in diploid and tetraploid organisms using various genetic models. BMC Bioinformatics 13, 248 (2012).
Huang, W., Li, L., Myers, J. R. & Marth, G. T. Art: a next-generation sequencing read simulator. Bioinformatics 28, 593–594 (2012).
Love, R. R., Weisenfeld, N. I., Jaffe, D. B., Besansky, N. J. & Neafsey, D. E. Evaluation of DISCOVAR de novo using a mosquito sample for cost-effective short-read genome assembly. BMC Genomics 17, 187 (2016).
Li, Y. et al. DeepSimulator: a deep simulator for nanopore sequencing. Bioinformatics 34, 2899–2908 (2018).
Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A.Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37, 540–546 (2019).
Glaubitz, J. C. et al. TASSEL-GBS: a high capacity genotyping by sequencing analysis pipeline. PLoS ONE 9, e90346 (2014).
Rochette, N. C., Rivera-Colón, A. G. & Catchen, J. M. Stacks 2: analytical methods for paired-end sequencing improve RADseq-based population genomics. Mol. Ecol. 28, 4737–4754 (2019).
Gerard, D., Ferrão, L. F. V., Garcia, A. A. F. & Stephens, M. Genotyping polyploids from messy sequencing data. Genetics 210, 789–807 (2018).
Gurevich, A., Saveliev, V., Vyahhi, N. & Tesler, G. QUAST: quality assessment tool for genome assemblies. Bioinformatics 29, 1072–1075 (2013).
Csardi, G. & Nepusz, T. The igraph software package for complex network research. Int. J. Complex Syst. 1695, 1–9 (2006).
Rosvall, M. & Bergstrom, C.Maps of information flow reveal community structure in complex networks. Proc. Natl Acad. Sci. USA 105, 1118–1123 (2008).
Preedy, K. & Hackett, C. A rapid marker ordering approach for high-density genetic linkage maps in experimental autotetraploid populations using multidimensional scaling. Theor. Appl. Genet. 129, 2117–2132 (2016).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
Kurtz, S. et al. Versatile and open software for comparing large genomes. Genome Biol. 5, R12 (2004).
Xie, M., Wu, Q., Wang, J. & Jiang, T. H-PoP and H-PoPG: heuristic partitioning algorithms for single individual haplotyping of polyploids. Bioinformatics 32, 3735–3744 (2016).
We thank F. Diaz for developing the M9 × M19 I. trifida mapping population and M. David for extracting and quantifying DNA from the M9 × M19 cross. The 12601ab1 × Stirling Infinium 8303 potato array data were provided by C. A. Hackett. This research was supported by grants from the Bill & Melinda Gates Foundation (OPP1052983) and Australian Research Council (DP170102626 awarded to L.J.M.C.). The work at the International Potato Center (CIP) was carried out as part of the Consultative Group for International Agricultural Research (CGIAR) Research Program on Roots, Tubers and Bananas, which is supported by CGIAR Fund Donors (http://www.cgiar.org/about-us/our-funders/). This research was also supported by use of the NeCTAR Research Cloud, by QCIF and by the University of Queensland’s Research Computing Centre. The NeCTAR Research Cloud is a collaborative Australian research platform supported by the National Collaborative Research Infrastructure Strategy.
The authors declare no competing interests.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
A total of 42,715 SNPs located on 678 scaffolds were used for linkage analysis. These scaffolds of ~482Mb covered approximately 99.6% of the genome. a, Dot plot for the RF estimations for scaffold pairs mapped to the same reference chromosome. The x- and y-axis represents the physical distances and the estimated RFs, respectively. b, Histogram of the RF estimations for scaffold pairs mapped to different reference chromosomes. c, Collinear plots of pseudomolecules mapped to reference chromosomes. The x- and y-axis represents physical positions (Mb) on the reference chromosomes and pseudomolecules, respectively. Each line represents a collinear block between the reference chromosome and the pseudomolecule. The diagonal line in each plot indicates a high correlation between the reference chromosome and the pseudomolecule constructed from scaffolds.
Extended Data Fig. 2 Collinear plots between the Ipomoea nil reference chromosomes and pseudomolecules constructed from the Ipomoea trifida genotype data.
The x- and y-axis represents the physical positions (Mb) on the reference chromosomes and pseudomolecules, respectively. Each line represents a collinear block between the Ipomoea nil reference chromosome and the pseudomolecules.
Extended Data Fig. 3 Genetic linkage map construction from the Infinium 8303 SNP array data of the Stirling×12601ab1 mapping population.
a, Dot plot for RF estimations between scaffold pairs mapped to the same PGSC v4.03 chromosomes. The x- and y-axis represents the physical distances and the estimated RFs, respectively. b, Histogram of the RF estimations for scaffold pairs mapped to different PGSC v4.03 pseudomolecules. c, Comparison between the genetic linkage map constructed by the proposed method and the PGSC v4.03 pseudomolecules. Twelve genetic linkage groups corresponding to 12 pseudomolecules were constructed. In each plot, the x-axis represents the positions (Mb) on the PGSC v4.03 pseudomolecules, and the y-axis represents the positions (cM) on the genetic linkage map.
Extended Data Fig. 4 Genetic linkage map constructed from the Infinium 8303 SNP array data of the B2721 mapping population with TetraploidSNPMap.
Each dot represents a SNP. The x-axis represents the positions (Mb) on the PGSC v4.03 pseudomolecules, and the y-axis represents the positions (cM) on the genetic linkage map. The genetic linkage map comprises a total of 4,745 SNPs including 56 SNPs located on the unplaced PGSC v4.03 scaffolds (red) and 76 SNPs placed in incorrect PGSC v4.03 pseudomolecules (blue). Since the physical positions of the red and blue dots cannot be determined, they were set to zero in the plots.
Extended Data Fig. 5 Genetic linkage map constructed from the Infinium 8303 SNP array data of the Stirling×12601ab1 mapping population with TetraploidSNPMap.
Each dot represents a SNP. The x-axis represents the positions (Mb) on the PGSC v4.03 pseudomolecules, and the y-axis represents the positions (cM) on the genetic linkage map. The genetic linkage map comprises a total of 3,593 SNPs including 54 SNPs located on the unplaced PGSC v4.03 scaffolds (red) and 35 SNPs placed in incorrect PGSC v4.03 pseudomolecules (blue). Since the physical positions of the red and blue dots cannot be determined, they were set to zero in the plots.
Extended Data Fig. 6 Collinear plots between the pseudomolecules of Zoysia japonica accession Yaji and Nagirizaki.
The x- and y-axis represent the positions (Mb) on the pseudomolecules. Each line represents a collinear block between the pseudomolecules.
Extended Data Fig. 7 Relationship between the number of genetic markers and computational resources required for the haplotype phasing algorithm.
The x- and y-axis (in logarithm scale) represents the number of genetic markers and the consumption of resources, respectively. a, CPU time and b, Memory. Each point in the plot was averaged over 30 independent experiments (Intel® Xeon® Processor E5-2667 v3 CPU, 3.20GHz). The error bar for one standard deviation was included at each point.
About this article
Cite this article
Zhou, C., Olukolu, B., Gemenet, D.C. et al. Assembly of whole-chromosome pseudomolecules for polyploid plant genomes using outbred mapping populations. Nat Genet 52, 1256–1264 (2020). https://doi.org/10.1038/s41588-020-00717-7
This article is cited by
Centromeric repeats in Citrus sinensis provide new insights into centromeric evolution and the distribution of G-quadruplex structures
Horticulture Advances (2023)
Nature Genetics (2022)
Breeding and genetics of disease resistance in temperate fruit trees: challenges and new opportunities
Theoretical and Applied Genetics (2022)
Theoretical and Applied Genetics (2021)
Theoretical and Applied Genetics (2021)