Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Assembly of whole-chromosome pseudomolecules for polyploid plant genomes using outbred mapping populations

Abstract

Despite advances in sequencing technologies, assembly of complex plant genomes remains elusive due to polyploidy and high repeat content. Here we report PolyGembler for grouping and ordering contigs into pseudomolecules by genetic linkage analysis. Our approach also provides an accurate method with which to detect and fix assembly errors. Using simulated data, we demonstrate that our approach is of high accuracy and outperforms three existing state-of-the-art genetic mapping tools. Particularly, our approach is more robust to the presence of missing genotype data and genotyping errors. We used our method to construct pseudomolecules for allotetraploid lawn grass utilizing PacBio long reads in combination with restriction site-associated DNA sequencing, and for diploid Ipomoea trifida and autotetraploid potato utilizing contigs assembled from Illumina reads in combination with genotype data generated by single-nucleotide polymorphism arrays and genotyping by sequencing, respectively. We resolved 13 assembly errors for a published I. trifida genome assembly and anchored eight unplaced scaffolds in the published potato genome.

Main

High-quality genome assembly plays an essential role in plant genomic and genetic analyses. Even though the recent advances in third-generation long-read sequencing technologies significantly improved the contiguity of genome assemblies1, it is so far not possible to construct a complete polyploid plant genome with sequence data alone due to polyploidy and high repeat content. Long-distance linkage information, such as physical maps, genetic maps, optical maps, syntenic maps, chromatin interactions and Hi-C contact maps, is crucial in constructing complete genome assemblies. Genetic maps have been widely adopted as they provide chromosomal-scale linkage information. The construction of many high-quality chromosomal-level genome assemblies for plants involved scaffolding with genetic linkage maps as the last step for building pseudomolecules2,3. The idea is straightforward: a genetic map is constructed for genetic markers associated with scaffolds and the map is then used to anchor scaffolds to construct pseudomolecules4. The completeness of the pseudomolecules largely depends on the density of the genetic linkage map. To obtain a full map of chromosomes, genetic markers should cover as much of the genome as possible. However, genome-wide genetic marker discovery for a mapping population with a considerable number of individuals is nontrivial. Several methods have been proposed for high-throughput genetic marker discovery with cost-effective next-generation sequencing (NGS) technology, such as reduced-representation libraries5, restriction site-associated DNA sequencing (RAD-seq)6 and genotyping by sequencing (GBS)7.

The genotype data generated with NGS-based high-throughput genetic marker discovery methods, including reduced-representation libraries, RAD-seq and GBS, are usually of large scale with an abundance of missing values and genotyping errors. Genetic linkage analysis for such datasets is challenging. The conventional genetic mapping tools, such as MAPMAKER8 and R/qtl9, have been optimized for relatively small but high-quality marker sets. Moreover, these tools were designed for inbred lines. The development of inbred lines, however, could be difficult, expensive and time consuming, especially for polyploids10. Several methods have since been proposed for outbred lines, including OneMap10, JoinMap11 and Lep-MAP2 (ref. 12), but only for diploids. TetraploidMap was the first tool specifically designed for mapping outbred tetraploid species based on dominant and codominant marker information13. This tool was later extended to TetraploidSNPMap to make full use of the single-nucleotide polymorphism (SNP) dosage data14. polymapR is another genetic mapping tool for outbred polyploids utilizing SNP dosage data15. Due to the underlying design, however, these methods are intrinsically sensitive to missing values and genotyping errors, and thus require stringent quality control on the input genetic markers. For whole-chromosome pseudomolecule construction, this could dramatically reduce the proportion of the genome covered by the genetic linkage map.

Here we describe a novel method for the construction of whole-chromosome pseudomolecules for polyploids by genetic mapping. This method relies on the availability of a high-density marker set on an F1 outbred population and reference contigs or scaffolds. We first perform haplotype phasing for the mapping population at the scaffold level. We then estimate the genetic distance between each pair of scaffolds with the phasing results and further perform a linkage analysis to construct genetic linkage maps for scaffolds, which are used to build whole-chromosome pseudomolecules. This method is computationally efficient and robust to the presence of substantial amounts of missing genotype data and genotyping errors, and thus can well handle data generated using NGS-based high-throughput genetic marker discovery methods. Using simulated datasets, we demonstrate substantial improvements of our approach over existing genetic mapping algorithms and the ability to construct whole-chromosome pseudomolecules for both diploid and tetraploid genomes. We applied our method to the construction of pseudomolecules for several real datasets, including GBS data for the diploid I. trifida, SNP array data for the autotetraploid potato, and RAD-seq data for the allotetraploid Zoysia japonica. The resultant pseudomolecules showed high collinearity with the reference genomes.

Results

Overview of method

We have developed a new method called PolyGembler (polyploid genetic linkage assembler) for the assembly of polyploid genomes by genetic linkage analysis. Figure 1 provides an overview of PolyGembler. The method assumes the availability of genome-wide data for genotyping, such as GBS, RAD-seq and array data, collected on an F1 outbred family, as well as high coverage (that is, greater than 30×) whole-genome sequence data on a reference sample, or alternatively a set of reference contigs or scaffolds. First, the genotyping data of the mapping population are mapped to the reference scaffolds to call variants (Fig. 1a). The resultant genotype data or allele depth data are then used to infer haplotypes for each scaffold. The haplotypes are utilized to detect assembly errors and to calculate recombination fractions (RFs) between each pair of scaffolds (Fig. 1b–d). Next, the linkage information between the scaffold pairs is used for a graph-based clustering algorithm for linkage group construction (Fig. 1e). Ideally, the scaffolds in each linkage group are originated from the same chromosome. The order of the scaffolds in each linkage group is determined by running a multidimensional scaling (MDS) algorithm and the directions of the scaffolds are calculated by solving a carefully designed traveling salesman problem (TSP) to minimize the sum of adjacent RFs along the linkage map (Fig. 1f). Finally, the scaffold-based genetic linkage maps are used to construct pseudomolecules.

Fig. 1: PolyGembler framework.
figure1

a, Construction of reference contigs and scaffolds and genotyping of mapping populations with high-throughput genome-wide genetic marker discovery technologies. Although we show the genotype data here, PolyGembler can handle allele depth data. b, Construction of haplotypes for each scaffold and for each sample with a HMM. A well-trained HMM for a diploid genotype is depicted in the top panel. The scaffold consists of ten markers illustrated as ellipses. The numbers within the ellipses indicate the probabilities of the parental haplotypes associating with the A and B allele. The width of the line connecting two ellipses is proportional to the RF between parental haplotypes at that position. Recombinations are prohibited between paternal and maternal haplotypes. The model assumes that each F1 progeny inherits one haplotype from each parent’s gamete and maximizes the likelihood of the marker data. This model can identify assembly errors. The RF between the fifth and sixth markers is abnormally large, as indicated by the numeric vaules in the bottom panel, suggesting a misassembly. In the middle panel, we show the positions of markers along the scaffold. c, RF estimation between each pair of scaffolds by counting the mismatches of parental haplotypes. d, The estimated RFs reflect the physical distances. The dot plot and histogram are for scaffold pairs from the same chromosome and different chromosomes, respectively. e, Linkage group construction for the reference scaffolds with a graph partitioning algorithm. f, The order of the scaffolds within each linkage group is determined by the MDS algorithm. The directions of the scaffolds are further determined by solving a modified TSP.

Application to simulated data

Simulated data were used to evaluate PolyGembler. First, we simulated a diploid genome and a tetraploid genome. Genome-wide Illumina short reads and nanopore long reads were then simulated, respectively, from the diploid and tetraploid genome for genome assembly. To construct pseudomolecules from the genome assemblies, we also simulated outbred mapping populations consisting of two parents and 190 F1 progeny and then further simulated GBS data (Methods).

Diploid simulation

The genome assembly constructed from the simulated data consisted of 83,325 scaffolds of ~493 megabases (Mb). The longest scaffold was ~625 kilobases (kb) and the N50 statistic was ~90 kb. The GBS data were simulated at ~10× around restriction recognition sites (Supplementary Note 1). We identified 24,488 SNPs from the GBS data for pseudomolecule construction (Methods). A total of 3,348 scaffolds of ~338 Mb were covered by these SNPs, constituting ~69% of the genome. Among the 20 assembly errors confirmed with the Quality Assessment Tool for Genome Assemblies (QUAST; Methods), PolyGembler identified three without false positives. To investigate the accuracy of RF estimation, we mapped the scaffolds back to reference chromosomes to determine the physical positions (Methods). For scaffold pairs that located on the same chromosomes, the estimated RFs correlated with the physical distances very well, while for scaffold pairs that mapped to different chromosomes the RFs were consistently large (Fig. 2a,b). The linkage groups constructed by this method perfectly recognized 15 reference chromosomes. Within each linkage group, the order of the mapped scaffolds demonstrated high correlation with the true order on the corresponding reference chromosome (Fig. 2e). The Spearman’s rank correlation coefficients between them ranged from 0.967–1.000.

Fig. 2: Pseudomolecule construction for simulated datasets.
figure2

a,c, Dot plots of the physical distances against the estimated RFs for scaffold pairs mapped to the same reference chromosome for diploid (a) and tetraploid (c) mapping populations. b,d, Histograms of the estimated RFs for scaffold pairs mapped to different chromosomes for diploid (b) and tetraploid (d) mapping populations. e,f, Collinear plots between the reference chromosomes and the pseudomolecules constructed from the diploid (e) and tetraploid (f) mapping populations.

Tetraploid simulation

A total of ~10.9 million nanopore reads of ~48.5 gigabases (Gb) with an N50 of ~6.7 kb were simulated, representing ~100× depth of coverage of the genome (Methods). The resultant genome assembly consisted of 722 scaffolds of ~484 Mb with an N50 of ~1.1 Mb. The GBS data were simulated at ~40× around restriction recognition sites (Supplementary Note 1). We identified 58,733 SNPs from the GBS data located on 707 scaffolds for pseudomolecule construction. These scaffolds covered nearly the whole genome. Out of the 135 misassembled scaffolds confirmed with QUAST (Methods), PolyGembler identified 86 without false positives. The results for RF estimation were similar to those for diploid simulation (Fig. 2c,d). Fifteen pseudomolecules corresponding to reference chromosomes were constructed without misgrouping. The resultant pseudomolecules were highly collinear with reference chromosomes (Fig. 2f). The Spearman’s rank correlation coefficients between the scaffold order determined by PolyGembler and the true order fell within the range of 0.956–1.000. To investigate the proposed method thoroughly, we also simulated GBS data at ~20×. With the lower-depth data, the number of assembly errors detected by PolyGembler significantly decreased (19 out of 135). However, the results for linkage mapping and pseudomolecule construction were still quite accurate (Extended Data Fig. 1).

Application to I. trifida GBS data

The I. trifida mapping population consists of two parents and 210 F1 progeny. It was derived from the crossing of two diploid I. trifida heterozygous genotypes and is called M9 × M19. We ran PolyGembler with the GBS data of this mapping population to construct pseudomolecules from the de novo genome assembly ITR_r1.0 (ref. 16). The ITR_r1.0 genome assembly consists of 77,400 scaffolds. The total size is ~512 Mb and the N50 statistic is ~43 kb. We identified 35,646 SNPs covering 2,469 scaffolds of ~231 Mb, constituting ~45% of the genome. We used another I. trifida genome assembly, NCNSP0306 (ref. 3), as a reference genome to validate the results.

We detected 13 misassembled scaffolds in ITR_r1.0. Figure 3a depicts the RF estimations along the scaffold Itr_sc000015 with 30 independent runs. There was a potential misassembly around 177 kb, where the mean RF estimation was 0.432 (s.d. = 0.012). By mapping it to NCNSP0306 chromosomes, we found that while the first ~166 kb was mapped to chromosome 4, the last ~138 kb was mapped to chromosome 8 (Fig. 3b,c). The mapping results supported the misassembly of scaffold Itr_sc000015, as well as the misassembly position. The other misassemblies detected by PolyGembler were also confirmed by mapping to NCNSP0306 chromosomes (Supplementary Table 1). We split the potentially misassembled ITR_r1.0 scaffolds to generate new scaffolds and excluded the original ones from linkage analysis.

Fig. 3: The ITR_r1.0 scaffold Itr_sc000015 is a misassembly.
figure3

a, Estimated RFs along the scaffold based on 30 independent runs. Each gray line plot represents an independent run. The x and y axes represent the SNP positions on scaffold Itr_sc000015 (n = 91) and the estimated RFs between the adjacent SNPs, respectively. Solid dots indicate mean RFs. There is a misassembly around 177 kb, where the mean estimated RFs is 0.432 and the standard deviation is 0.012. b,c, The first ~166 kb of scaffold Itr_sc000015 are collinear with NCNSP0306 chromosome 4 (b), whereas the last ~138 kb are collinear with the NCNSP0306 chromosome 8 (c).

Source data

We mapped the scaffolds to the NCNSP0306 chromosomes to investigate the accuracy of RF estimation. When the physical distance was small, it showed a strong correlation with the estimated RFs. As the physical distance increased, the estimated RFs clearly deviated from the expected values calculated from conventional genetic mapping functions (Fig. 4a). One possible explanation is that the calculation of physical distances between scaffolds was not accurate due to structural variations between the two genome varieties or misassemblies in the reference chromosomes. Another possible reason, and probably the major reason, is the absence of genetic recombinations in large pericentromeric heterochromatin blocks in the I. trifida genome, as has been widely reported for the potato genome17. The RFs for scaffolds mapped to different chromosomes were consistently large (Fig. 4b).

Fig. 4: Pseudomolecule construction for M9 × M19 and B2721 mapping populations.
figure4

a,c, Dot plots of the physical distances against the estimated RFs for scaffold pairs mapped to the same reference chromosome for the M9 × M19 (a) and B2721 (c) mapping populations. b,d, Histograms of the estimated RFs for scaffold pairs mapped to different chromosomes for the M9 × M19 (b) and B2721 (d) mapping populations. e, Collinear plots between the NCNSP0306 reference chromosomes and the pseudomolecules constructed from the M9 × M19 mapping population. f, Genetic linkage map constructed from the B2721 mapping population.

Fifteen pseudomolecules in high concordance with the reference chromosomes were constructed (Fig. 4e). The pseudomolecules were not well assembled in several pericentromeric regions. This was mainly because these regions are highly repetitive and the original genome assembly ITR_r1.0 is extremely fragmented in these regions. To complete the genome assembly, we probably need to improve the quality of scaffolds in these regions; for example, with long reads. There was a clear discrepancy between the pseudomolecule and NCNSP0306 chromosome 3 (Fig. 4e). It is not known whether it was a misassembly by our approach, a misassembly in NCNSP0306 chromosome 3 or a structure variation between the two genome varieties. We mapped the pseudomolecule to the genome assembly of Ipomoea nil18—another species from the genus Ipomoea. The result showed that the pseudomolecule represented a better collinearity with I. nil chromosome 2, suggesting that NCNSP0306 chromosome 3 might be misassembled (Extended Data Fig. 2).

Application to Infinium 8303 potato array data

Two outbred mapping populations genotyped with the Infinium 8303 SNP array were used to construct pseudomolecules from Potato Genome Sequencing Consortium (PGSC) doubled-monoploid version 3 scaffolds17. The B2721 mapping population was derived from the cross Atlantic × B1829-5, consisting of 156 F1 progeny. A total of 5,736 SNPs were identified covering 721 scaffolds, of which 175 scaffolds with fewer than three SNPs were excluded from the further analysis. The other mapping population was derived from the cross 12601ab1 × Stirling, consisting of 190 F1 progeny. The variant calling ended up with 4,109 SNPs from 679 scaffolds, of which 455 scaffolds with at least three SNPs were retained. The phasing algorithm failed to generate any feasible haplotype calls for 79 and 100 scaffolds, respectively, for the two mapping populations due to a high level of homozygosity (Supplementary Note 2). The pairwise RF estimations showed very similar patterns to those for the I. trifida dataset (Fig. 4c,d and Extended Data Fig. 3a,b). For genetic linkage map construction, scaffolds were assigned to 12 linkage groups for both dataset, corresponding to the 12 PGSC version 4.03 chromosomes, without misassignment. The order of scaffolds within each linkage group was consistent with the order on chromosomes despite a few incorrect orderings for closely located scaffolds (Fig. 4f and Extended Data Fig. 3c). The shapes of the genetic linkage maps were nearly identical to those constructed from TetraploidSNPMap (Extended Data Figs. 4 and 5). The total sizes of the anchored scaffolds were ~429 and ~366 Mb, constituting ~59 and ~53% of the genome, respectively. We anchored eight scaffolds of ~2.8 Mb, which were not placed in any PGSC version 4.03 chromosomes, from the two mapping populations (Supplementary Table 2).

Application to Z. japonica RAD-seq data

Z. japonica has an allotetraploid genome with 20 chromosomes. A total of ~4.2 million PacBio reads of ~38.6 Gb with an N50 of ~12.2 kb were assembled for the Z. japonica cultivar Yaji (Methods). The resultant genome assembly consisted of 1,075 scaffolds of ~329 Mb with an N50 of ~2.6 Mb. The RAD-seq data of a mapping population comprising 60 F1 progeny derived from a cross between Carrizo and El Toro19 were used for pseudomolecule construction with PolyGembler. We identified 25,080 SNPs covering 297 scaffolds of ~303 Mb, constituting ~92% of the genome. PolyGembler identified six misassembled contigs of approximately 24 Mb, which were split before linkage analysis. The final genetic map consisted of 21 linkage groups, of which a small linkage group comprising two contigs of ~222 kb in total was excluded from the pseudomolecule construction. The resultant pseudomolecules can be grouped into ten homoeologous pairs, each with similar length, ranging from approximately 10–23 Mb. We mapped the pseudomolecules to the rice reference genome (Oryza sativa release 7)20 for validation (Fig. 5). The pseudomolecules were mapped in pairs and showed clear collinearity with the rice chromosomes, suggesting a high degree of authenticity.

Fig. 5: Collinear plots between the Z. japonica pseudomolecules constructed from PolyGembler and the O. sativa chromosomes.
figure5

The Z. japonica pseudomolecules labeled with asterisks were converted to reverse complement before performing alignment.

The dataset was also used for genetic linkage map construction in the originally study19. The genetic linkage maps reported for the parents Carrizo and El Toro consisted of 2,408 and 1,230 RAD markers, distributing across 21 and 20 linkage groups, respectively. Compared with the genetic linkage maps constructed using our approach, the genetic markers included were significantly fewer. The genetic linkage map for the parent El Toro was further used in another study for genome assembly of Z. japonica accession Nagirizaki for pseudomolecule construction21. In that study, total sequences of ~273 Mb were anchored in pseudomolecules (30 Mb fewer than the pseudomolecules constructed in the current study). We performed pairwise comparisons between the pseudomolecules from the two genome assemblies. The results showed that the two genome assemblies were highly consistent (Extended Data Fig. 6).

Comparison with well-known genetic mapping tools

We compared our genetic mapping method with three existing tools designed for outbred mapping populations, including OneMap (version 2.0-4)10 and Lep-MAP2 (ref. 12) for diploid genomes and TetraploidSNPMap14 for tetraploid genomes. The marker sets used for comparisons consisted of 1,000 SNPs sampled from a diploid and a tetraploid simulated outbred mapping population comprising 190 F1 progeny with known linkage groups and order (Methods). The missing data rates ranged from 0–0.8, while the genotyping error rates ranged from 0–0.4. For each configuration, we conducted 30 independent experiments for PolyGembler, OneMap and Lep-MAP2 and five experiments for TetraploidSNPMap, for which only a graphical user interface was provided by the author. We compared the accuracy for grouping and ordering, as well as the running time. The accuracy for grouping was measured using the normalized mutual information (NMI) between the constructed linkage groups and true groups22. NMI ranges from 0–1. Larger NMI values suggest more similar groupings, and specifically NMI = 1 indicates identical groupings. The accuracy for ordering was measured by the switch error rate (Re) of the marker order on the genetic linkage map compared with the true order on chromosomes23. We also investigated the phasing accuracy of PolyGembler using the correct phasing rate (Rc) as an evaluation, which was calculated as the proportion of the correctly phased loci with respect to the true haplotypes (Methods).

The results are reported in Table 1, with details in Supplementary Table 3. Evidently, PolyGembler was never worse and often better than the other methods in accuracy for grouping, and consistently outperformed the other methods in accuracy for ordering. Besides, PolyGembler can handle datasets with much higher missing and error rates. Here we call a linkage map with NMI ≥ 0.9 and Re ≤ 0.1 acceptable. For the diploid simulation, we gained acceptable results for datasets constituting up to 70% missing genotypes from PolyGembler and 40% missing genotypes from Lep-MAP2. With OneMap, we obtained accurate groupings for datasets with up to 50% missing data, but the orderings were consistently error prone. The performance on the genotyping errors showed a similar trend. The largest error rate that PolyGembler could handle reached 30%, while for Lep-MAP2 it dropped to 5%. For OneMap, the groupings were perfect for datasets with up to 10% genotyping errors; however, the orderings were inaccurate. PolyGembler also outperformed the other two methods for the data with mixtures of missing data and genotyping errors. PolyGembler produced acceptable results for up to 30% missing data and 10% genotyping errors, while for Lep-MAP2 acceptable results were generated for up to 20% missing data and a 5% error rate. Lep-MAP2 filtered out a significant proportion of SNPs for high missing data or genotyping error rates. Regarding the running time, PolyGembler was the fastest, followed by Lep-MAP2, and both were much faster than OneMap. Lep-MAP2 was occasionally faster than PolyGembler as it discarded many SNPs. For tetraploid datasets, PolyGembler consistently outperformed TetraploidSNPMap in both grouping and ordering. TetraploidSNPMap generated acceptable results for datasets of missing data up to 40% or genotyping errors up to 5%, or a combination of them up to 10% missing data and a 2% error rate. In contrast, PolyGembler can handle datasets of missing data up to 50% or genotyping errors up to 20%, or a combination of 30% missing data and a 10% error rate. For speed, TetraploidSNPMap had an average running time of less than 1 min, which was substantially faster (40–50×) than PolyGembler. A high level of phasing accuracy was observed for PolyGembler. The Rc was consistently >95% with diploid datasets of missing data up to 70% or genotyping errors up to 20%, or a combination of them up to 30% missing data and a 10% error rate. For tetraploid datasets, the missing and error rates that could be well handled by PolyGembler slightly decreased. To gain an Rc of the same level, the datasets needed to be with a missing rate no greater than 40% or an error rate no greater than 10%, or a combination of them no greater than 20% and 5%, respectively. Since we included the SNP loci for introducing missing data and genotyping errors in the calculation of Rc (Methods), the results indicated that PolyGembler imputed the missing data as well as corrected genotyping errors with high accuracy.

Table 1 Results for comparisons of PolyGembler with OneMap, Lep-MAP2 and TetraploidSNPMap on simulated genotype data

Discussion

We have described a genetic anchoring method that harnesses genotype data from a mapping population and reference genome assembly to construct chromosomal-scale pseudomolecules. Several methods have been proposed under this framework, such as POPSEQ24 and RPGC25. These methods have mainly focused on integration of established tools for read mapping, variant calling and genetic mapping, to build computational pipelines for organizing contigs or scaffolds. Since they largely depend on conventional genetic mapping algorithms, these methods have several limitations. First, they require high-quality genetic markers and lack robustness to genotyping errors and missing data. However, variant calling for such high-quality genetic markers from genome-wide genotyping sequence data, especially from low-coverage sequence data, is nontrivial. Second, these methods do not scale to large datasets. The conventional genetic mapping tools have been designed to handle datasets with up to several thousand genetic markers but do not scale to hundreds of thousands of genetic markers generated with the current whole-genome genotyping approaches. Lastly, these methods can rarely be used for polyploid genomes. To solve these problems, we used a divide-and-conquer strategy for genetic anchoring. We first performed haplotype phasing using a hidden Markov model (HMM) at the scaffold level. The genetic distances between each pair of scaffolds were then estimated from the haplotypes and further used to perform a scaffold-level linkage analysis. Finally, the genetic linkage maps of scaffolds were converted to whole-chromosome pseudomolecules.

Scaffold-level linkage analysis effectively reduces the computational complexity. Presumably, a well-designed heuristic algorithm is required to order the markers within each linkage group as it is a non-deterministic polynomial hard problem. However, in our method, the size of the problem is dramatically reduced, which enables us to combine the MDS algorithm with the exact TSP solver CONCORDE to order markers efficiently and accurately. The scaffold-level design also enables us to perform haplotype phasing in a highly parallel manner. In the present study, the method was used to handle datasets with up to 3,348 scaffolds. Moreover, the computational time and resources required for haplotype phasing were almost linearly related to the number of genetic markers, and the method can easily handle scaffolds with up to 10,000 genetic markers (Extended Data Fig. 7). The high scalability makes it possible to apply the method to other more extensive genotyping techniques, such as whole-genome or transcriptome re-sequencing approaches, which could have millions of genetic markers. The generalization of the proposed method to high-ploidy genomes is straightforward. In this study, we focused on diploids and tetraploids. However, this method can construct genetic linkage maps for higher-ploidy genomes. Our biggest challenge when dealing with higher-ploidy species is computational, especially in the haplotype phasing step. For hexaploids, the number of hidden states for HMM increases to 14,400. This is a large number of states but remains computationally tractable. However, for even higher levels of ploidy, computation becomes difficult.

The fundamental idea for haplotype phasing is to model the process of gamete formation of the parents with a Markov chain that starts with the first and ends with the last genetic marker along a chromosome. Similar models were used in PolyHap26 and TetraOrigin27. PolyHap assumed a given number of ancestral haplotypes shared by the target population. The recombinations were allowed between any ancestral haplotypes, leading to an intractably large state space as the number of ancestral haplotypes and ploidy increased. The state spaces for TetraOrigin and our method are nearly identical except that TetraOrigin also considered double reductions. Double reduction was not considered in our method mainly because the accurate identification of them requires high-quality genotyping data. TetraOrigin used SNP array data to detect double reductions27. However, with low-coverage GBS or RAD-seq data, it is difficult to obtain such high-quality genotypes without filtering out considerable markers, which is undesirable if we want to cover the whole genome. Even though double reduction could introduce some bias in RF estimation, it can be safely ignored if quadrivalent pairing is rare. For organizing the assembly scaffolds, the bias is negligible.

We have demonstrated the ability of our method in the construction of whole-chromosome pseudomolecules for genome assembly. For the real datasets, the genome coverage of the pseudomolecules ranged from 45% for the diploid I. trifida to 92% for the allotetraploid Z. japonica. The huge difference arose because of the quality of the input genome assemblies, especially the contiguity. The scaffold N50 statistics of the input genome assemblies were approximately 43 kb and 2.6 Mb, respectively, for I. trifida and Z. japonica. Even though generating high-quality plant genome assemblies remains challenging, the recent advances in the sequencing technologies28,29 and related computational tools facilitate the construction of highly contiguous genome assemblies even for very complex plant genomes30. In the context of the rapid development of genome assembly techniques, it is hoped that our method could serve as an option among the others to finish a genome assembly.

Methods

Data simulation

We simulated a reference genome with the sequence data of I. trifida16. The sequence data were first assembled to generate contigs (Supplementary Table 4). Contigs were then used to generate complex repeats considering the copy number of the contigs inferred from the sequence depth (Supplementary Table 5 and Supplementary Note 1). Finally, we randomly concatenated these repetitive sequences and generated a reference genome comprising 15 chromosomes with a total size of ~482 Mb (Supplementary Table 6). We simulated the genotypes of a diploid and a tetraploid outbred mapping population both consisting of two parents and 190 F1 progeny using PedgreeSim (version 2.0)31. The genotype data for comparison studies were randomly sampled from the mapping populations (Supplementary Note 1). To simulate GBS data, we mapped the genotypes back to the reference genome to generate individual genomes and simulated the GBS protocol including sequence errors to produce GBS sequences7 (Supplementary Note 1). To simulate the reference genome assemblies for diploid genomes, we simulated Illumina reads with ART (version 2.5.8)32, followed by DISCOVAR de novo (version 52488)33 assembly, whereas for tetraploid genomes, we simulated nanopore reads with DeepSimulator (version 1.5)34 and used Flye (version 2.5)35 for assembly (Supplementary Note 1).

Construction of Z. japonica genome assembly

The publicly available whole-genome PacBio sequence data for the Z. japonica cultivar Yaji were downloaded from the National Center for Biotechnology Information (NCBI) repository under the accession number SRP110561. The sequence data consist of ~4.2 million reads, with a total size of ~38.6 Gb and an N50 read length of ~12.2 kb. Flye (version 2.5)35 was used to assemble the sequence data with default parameters.

Variant calling for mapping populations

The variant calling for GBS and RAD-seq data was performed with TASSEL-GBS (version 5.2.43)36 and Stacks (version 2.41)37, respectively, using default parameters. Then, we extracted the allele depth of each SNP and performed a new round of genotyping with the R package updog (version 1.1.3)38, taking into account that the mapping population was a full-sib family resulting from one generation of a biparental cross. The R package fitPoly (version 3.0.0) was used to call genotypes from the signal intensities for Infinium 8303 SNP array data. We filtered out SNPs represented significant distortions to the expected segregation ratio. Precisely, for each SNP and for each possible combination of parental genotypes, we first removed F1 progeny with no genotypes or unexpected genotypes with respect to the parental genotypes (that is, genotyping errors). Then, if we had at least 50 F1 progeny retained, we performed an exact multinomial test using the xmulti function from the R package XNomial (version 1.0.4) to measure the segregation. We kept a SNP site if there existed a combination of parental genotypes such that the genotyping error rate was no greater than 5% and the P value for the exact multinomial test was no smaller than 0.001. We chose a small P value threshold to include more SNPs for linkage analysis. The PGSC provided a coordinate file describing the positions of Infinium 8303 potato array SNPs on the PGSC version 4.03 pseudomolecules. This was used to anchor the SNPs to PGSC doubled-monoploid version 3 scaffolds17.

Scaffold haplotype phasing

Here we describe our model for haplotype phasing. By solving this model, we can reconstruct the underlying inheritance pattern of the outbred mapping population, to infer the parental haplotypes and which of these haplotypes have been inherited by each F1 progeny.

Notation

We write a collection in square brackets to indicate an ordered list, while the use of wavy brackets indicates an unordered list. We write π = [π1,...,πu], a permutation of the sequential number 1 to u. Then, for an unordered list d′ = {d1,...,du}, we write π(d′) = [\(d_{\pi_{1}}\),...,\(d_{\pi_{u}}\)] for a permutation of d′ and Π(d′) for the collection of all such permutations. Thus, for example, if d′ = {1,2}, there are two permutations; namely, Π(d′) = {[1,2],[2,1]}. Here we use a letter with a prime symbol to represent an unordered list. We write d = [d1,...,du] for a row vector and dT = [d1,...,du]T for a column vector that is the transpose of d. Clearly, D = [d1,...,du]T is then a u × v matrix, where d1,...,du are 1 × v row vectors. Specifically, we write d×v for a 1 × v vector with the repeated element d, and Dv for the vth column vector of the matrix D.

Consider an outbred mapping population derived from a cross between two individuals P and Q with ploidy level h (so h = 2 for diploid genomes), and h is an even number greater than zero. Denote by F an F1 progeny derived from P and Q and we have n F1 progeny. For a chromosome with m markers, we assume the availability of allele depth data or genotype data collected on each individual at each marker. We only consider biallelic markers. Denote by xj and yj the total and reference allele count, respectively, for an individual at the jth marker, j = 1,...,m. The individual could be either a parent or F1 progeny. Here we drop the subscripts of individuals for convenience as we only need one individual to demonstrate the algorithm.

We use a h × m matrix to represent the h copies of the chromosome for an individual. Two parents are indexed (that is, P = [1×m,...,h×m]T and Q = [(h + 1)×m,...,(2h)×m]T). For the F1 progeny, write F = [f1,...,fh]T, where fi are 1 × m vectors, representing the ith copy of the chromosome, i = 1,...,h. The chromosome fi is a recombinant of parental chromosomes inherited from a gamete. Assume without loss of generality that f1,...,fh/2 and fh/2+1,...,fh were passed to F by parents P and Q, respectively. For example, fi could be [1×4 2×6], which indicates that the chromosome is a recombinant of parental haplotypes 1 and 2 inherited from a gamete of P, and the recombination occurred between the fourth and fifth marker. Now, consider all chromosomes together at the jth marker. Obviously, we have Pj = [1,...,h] and Qj = [h + 1,...,2h], irrespective of j. There are Perm(h,h/2)2 possible values for Fj (that is, the product of types of gametes the two parents can produce), where Perm calculates the partial permutations. Without considering the order of chromosomes within gametes, there are Comb(h,h/2)2 possible values for \({{\mathbf{F}}^{\prime}_j}\), where Comb calculates the binomial coefficients. Denote by Sj a collection of all such possible combinations of parental haplotypes in the mapping population at marker j. It is obvious that all Sj are identical irrespective of j, so we drop the j subscript for convenience. For example, with a diploid genome (that is, h = 2), we have S = {{1,2},{3,4},{1,3},{1,4},{2,3},{2,4}}, where the first two are for parents and the last four are for F1 progeny. Denote by θij the probability of parental chromosome i being associated with the reference allele at the jth marker, and therefore 1 − θij the probability of the alternative allele, where θij [0,1], i = 1,...,2h and j = 1,...,m.

Modeling the F1 progeny

An F1 progeny can be decomposed to \({\mathbf{F}} = [{\tilde{\mathbf P}}\;{\tilde{\mathbf Q}}]^T\), where \({\tilde{\mathbf P}} = [{\mathbf{f}}_1,...,{\mathbf{f}}_{h/2}]^T\) and \({\tilde{\mathbf Q}} = [{\mathbf{f}}_{h/2 + 1},...,{\mathbf{f}}_h]^T\), representing the gametes inherited from P and Q, respectively. The formation of gametes for two parents are the same. Here we only consider \({\tilde{\mathbf P}}\). We further write fi = [fi1,...,fim], where fij {1,...,h}, i = 1,...,h/2 and j = 1,...,m. Now, consider \({\tilde{\mathbf P}}_j\) and \({\tilde{\mathbf P}}_{j + 1}\) (that is, the jth and (j + 1)th markers on the chromosome). We introduce a parameter set α = [α1,...,αm−1] to model the transitions between the homologous chromosomes during the meiosis process. Precisely, αj is the probability of a jump occurring between different homologous chromosomes between the jth and (j + 1)th markers, j = 1,...,m − 1. This can be written as:

$$\Pr \left( {f_{i\left( {j + 1} \right)} = l{\mathrm{|}}f_{ij} = k} \right) = \left\{ {\begin{array}{*{20}{l}} {1 - \alpha _j} \hfill & {{\mathrm{if}}\,k = l} \hfill \\ {\frac{{\alpha _j}}{{h - 1}}} \hfill & {{\rm{otherwise}}} \hfill \end{array}} \right.$$
(1)

The parameter αj is positively related to the physical distance between the jth and (j + 1)th markers. When the two markers are tightly linked, αj should be extremely small to prevent haplotype states from jumping frequently. Actually, αj could be used to approximate the RF between the two markers. The transition probability between \({\tilde{\mathbf P}}_j\) and \({\tilde{\mathbf P}}_{j + 1}\) is then written as:

$${\mathrm{Pr}}\left( {{\tilde{\mathbf P}}_{j + 1} = \left[ {l_1, \ldots ,l_{\frac{h}{2}}} \right]{\mathrm{|}}{\tilde{\mathbf P}}_j = \left[ {k_1, \ldots ,k_{\frac{h}{2}}} \right]} \right) = \mathop {\prod }\limits_{\begin{array}{*{20}{c}} {i = 1, \ldots ,\frac{h}{2}} \end{array}} {\mathrm{Pr}}\left( {f_{i\left( {j + 1} \right)} = l_i{\mathrm{|}}f_{ij} = k_i} \right)$$
(2)

For unphased \({\tilde{\mathbf P}}^{\prime}_j\) and \({\tilde{\mathbf P}}_{j + 1}^\prime\), the transition probability is written as:

$$\begin{array}{rcl}{\mathrm{Pr}}\left( {{\tilde{\mathbf P}}_{j + 1}^\prime = \left\{ {l_1, \ldots ,l_{\frac{h}{2}}} \right\}{\mathrm{|}}{\tilde{\mathbf P}}_j^\prime = \left\{ {k_1, \ldots ,k_{\frac{h}{2}}} \right\}} \right) = \\ \mathop {\sum }\limits_{\begin{array}{*{20}{c}} {{\tilde{\mathbf P}}_j \in {\Pi}\left( {{\tilde{\mathbf P}}_j^\prime } \right)} \end{array}} \mathop {\prod }\limits_{\begin{array}{*{20}{c}} {i = 1, \ldots ,\frac{h}{2}} \end{array}} {\mathrm{Pr}}\left( {f_{i\left( {j + 1} \right)} = l_i{\mathrm{|}}f_{ij} = k_{\pi _i}} \right)\end{array}$$
(3)

Since the inheritance of chromosomes from two parents is independent, the transition probability between Fj and Fj+1 in the unphased mode is then written as:

$${\mathrm{Pr}}\left( {{\mathbf{F}}_{j + 1}^\prime {\mathrm{|}}{\mathbf{F}}_j^\prime } \right) = {\mathrm{Pr}}\left( {{\tilde{\mathbf P}}_{j + 1}^\prime {\mathrm{|}}{\tilde{\mathbf P}}_j^\prime } \right){\mathrm{Pr}}\left( {{\tilde{\mathbf Q}}_{j + 1}^\prime {\mathrm{|}}{\tilde{\mathbf Q}}_j^\prime } \right)$$
(4)

Likelihood of allele depth data

For an individual, the probability of observing the reference allele at a marker with parental chromosome subset sS is calculated as:

$$\bar \theta _j = \frac{1}{h}\mathop {\sum }\limits_{\begin{array}{*{20}{c}} {i \in s} \end{array}} \theta _{ij}$$
(5)

We drop the subscript for s, and \(\bar \theta _j\) is calculated for each s. The count of the reference allele at the jth marker, yj, follows a binomial model of probability \(\bar \theta _j\) with xj trials; that is:

$$y_j \sim{\rm{Binomial}}\left( {x_j,\bar \theta _j} \right)$$
(6)

Likelihood of genotype data

With genotype data, xj and yj represent the total allele count and reference allele dosage, respectively. Apparently, we have xj = h, irrespective of j, and yj = 0,...,h. For a chromosome set sS, yj follows a Poisson binomial distribution (that is, the sum of h independent Bernoulli trials with success probabilities associated with s). The probability of yj = g, where g = 0,...,h, is calculated as:

$${\mathrm{Pr}}\left( {y_j = g{\mathrm{|}}s} \right) = \mathop {\sum }\limits_{\begin{array}{*{20}{c}} {A \in s_g} \end{array}} \mathop {\prod }\limits_{\begin{array}{*{20}{c}} {a \in A} \end{array}} \theta _{aj}\mathop {\prod }\limits_{\begin{array}{*{20}{c}} {b \in A^{\rm{c}}} \end{array}} \theta _{bj}$$
(7)

where sg is the collection of all subsets of g elements that can be selected from s. Ac is the complement of A with respect to s (that is, Ac = s\A). For example, if s = {1,2,6,8} and g = 3, s3 = {{1,2,6},{1,2,8},{1,6,8},{2,6,8}}, and for A = {1,2,6}, Ac = {8}.

More computational details

The model described above can be summarized as a HMM. Each marker is a time step; the hidden states are the underlying parental chromosomes (S) and the observations are the allele depth or the genotype data collected on the mapping population. The transition probabilities are calculated with equation (4) and the emission probabilities are calculated with equation (6) or (7). The parameters for the model include α = [α1,...,αm−1] and {θij}, where i = 1,...,2h and j = 1,...,m. The Baum–Welch algorithm was used to estimate these parameters. We used Dirichlet priors on all parameters. Let \(\alpha _j \sim \frac{1}{2}{\rm{Beta}}\left( {\mu _\alpha \left( {1 - e^{ - 2\rho \delta _j}} \right),\mu _\alpha e^{ - 2\rho \delta _j}} \right)\), where δj is the physical distance between the jth and (j + 1)th markers along the chromosome, and ρ = 10−8 per base pair, reflecting the background recombination rate. Let θij Beta(μθ(1 − βj),μθβj), where βj is the B allele frequency estimated from the genotype data for the jth marker. We use μα = 105 and μθ = 10 for initialization of the expectation–maximization algorithm and μα = μθ = 0.1 for the maximization step.

RF estimation

The RF between two markers was estimated by the proportion of the number of recombinants to the total number of haplotypes in the F1 progeny. Assume a genome of ploidy h and a mapping population of n F1 progeny, and r recombinants between the two markers in the F1 haplotypes. The RF was then calculated as:

$$\gamma = \frac{r}{{nh}}$$
(8)

Two kinds of RFs were estimated in the proposed method: (1) to detect assembly errors, we calculated RFs between adjacent markers along the scaffold; and (2) to estimate genetic distance between two scaffolds, we calculated RFs between the outermost markers. The RF calculations within a scaffold were straightforward. We counted the number of jumps from one parental haplotype to another in F1 haplotypes. However, the calculation of RF between two scaffolds needed to consider all four possible directions as the orientations of the scaffolds on the chromosome were unknown. Besides, as we ran the haplotype phasing algorithm for each scaffold independently, we considered all possible correspondences between parental haplotypes. See Supplementary Note 3 for more computational details and an example.

Assembly error detection

To detect misassemblies, RFs between the adjacent markers along a scaffold were calculated. If the RF between two adjacent markers was larger than the predefined threshold 0.1, we considered it a misassembly at that position. Apparently, the physical distance between two markers should be taken into consideration. However, it is difficult to derive a universal relationship between the physical distances and RFs, especially when the physical distance becomes large. Therefore, we did not consider positions where the physical distance between the adjacent markers was larger than 1 Mb. The incorrect scaffolds detected by the algorithm were split at the misassembled positions to generate new scaffolds. We ran haplotype phasing analysis on the new scaffolds and recalculated the pairwise RFs. For the genome assemblies generated from the simulated data, QUAST (version 5.0.2)39 was employed to detect the genuine misassemblies by comparing the scaffolds with the reference chromosomes with default parameters. The genuine misassemblies identified by QUAST were used to validate the misassemblies detected by the proposed method.

Superscaffold construction and multipoint analysis

We introduced multipoint haplotype phasing analysis to improve the accuracy of RF estimations. The multipoint analysis ran the haplotype phasing algorithm on superscaffolds, which were constructed by nearest neighbor joining. Precisely, we chose the scaffold pair that represented the smallest RF. The scaffolds were combined to form a superscaffold if the RF was smaller than the predefined threshold 0.27 (approximately 30 cM with the Kosambi mapping function). The gap between the two scaffolds was measured by the RF. The RFs between the superscaffold and the other scaffolds were then updated. This process was repeated until there was no such closely linked scaffold pair. Multipoint analysis is crucial for short scaffolds as the number of markers that mapped to them could be small, which makes haplotype phasing solely based on them inaccurate.

Linkage group construction

We constructed a graph for linkage group construction with scaffolds as vertices and genetic linkages as edges. To simplify the graph, we excluded edges with RFs greater than the predefined threshold 0.38 (approximately 50 cM with the Kosambi mapping function). The retained edges were weighted based on the RFs. For an edge with RF γ, the weight was calculated as K(0.5) − K(γ), where K(.) is a function that converts the RFs to genetic distances with the Kosambi mapping function. For graph clustering, an information-theoretic approach, as implemented in the R package igraph40, was employed, which detects the modularity structure of a graph by minimizing the expected description length of the trajectory of a random walk41.

Scaffold ordering within linkage groups and pseudomolecule construction

The scaffold ordering within each linkage group was performed with the MDS algorithm, as implemented in the R package MDSMap (version 1.1)42. Once the order of the scaffolds was determined, we further performed a TSP to find the optimum directions of the scaffolds to minimize the sum of adjacent RFs. The start and end of a scaffold were treated as two independent points in the TSP. The TSP was carefully designed so that in the optimum solution: (1) the Hamiltonian circuit can be equivalently converted to a Hamiltonian path, as is required for scaffold ordering; (2) the order of scaffolds determined by the MDS algorithm remains unchanged; and (3) the start and end of a scaffold are adjacent (Supplementary Note 3). The exact TSP solver CONCORDE (release 03.12.19) was employed in this research to find exact solutions of TSPs to guarantee feasible scaffold orderings. The genetic linkage maps of scaffolds were converted to pseudomolecules. The adjacent scaffolds were joined with 10-kb gaps.

True linkage groups, scaffold orders and distances, and collinear blocks

We mapped scaffolds to the reference chromosomes using Minimap2 (version 2.17)43. A scaffold was considered as being mapped to a reference chromosome if at least 50% of the base pairs were collinear with that, and only that, chromosome. The true linkage groups of scaffolds were determined by the reference chromosomes they mapped to. The order of the scaffolds within a linkage group and the pairwise distances were determined by the positions on the chromosome. We used MUMmer (version 3.23)44 for mapping pseudomolecules to reference chromosomes to detect collinear blocks.

Calculation of the correct phasing rate

The correct phasing rate is calculated as the proportion of the correctly phased loci with respect to the true haplotypes45. Denote by H = {H1,...,Hh} and \({\mathbf{H}}^ \ast = \{ H_1^ \ast ,...,H_h^ \ast \}\) the haplotypes constructed by PolyGembler and the true haplotypes, respectively. Denote by \({\cal{H}}\) a one-to-one mapping between the haplotypes in H and H*. The number of correctly phased loci is then calculated as \(M({\cal{H}}) = \mathop {\sum}\nolimits_{i = 1}^h {\mathop {\sum}\nolimits_{j = 1}^m {{\Bbb{1}}({\cal{H}}(H_i)[j],H_i^ \ast [j])} }\), where m is the number of genetic markers and \(\Bbb {1}\) is the indicator function. The correct phasing rate Rc is defined as:

$$R_{\rm{c}}\left( {\mathbf{H}} \right) = \frac{{\mathop {{{\mathrm{max}}}}\limits_{\cal{H}} M\left( {\cal{H}} \right)}}{{mh}}$$
(9)

For a dataset of multiple individuals and multiple scaffolds, we calculated Rc as the total number of matched loci divided by the total number of loci. To evaluate the power of imputing missing data and the robustness in handling genotyping errors for PolyGembler, the haplotypes before introducing missing data and genotyping errors were used as true haplotypes.

Statistical testing

For SNP filtering, we performed exact multinomial tests with the xmulti function from the R package XNomial (version 1.0.4) to check whether the segregation of a SNP followed the law of segregation. The log-likelihood ratio was used as the test statistic to calculate P values. The original sample sizes for the M9 × M19, B2721, 12601ab1 × Stirling, Carrizo × El Toro and simulated mapping populations were 210, 156, 190, 60 and 190, respectively. We required a minimum sample size of 50 for testing after removing samples with no genotypes or unexpected genotypes with respect to the parental genotypes. We rejected the null hypothesis if the P value was <0.001. To compare PolyGembler with OneMap and Lep-MAP2, we performed two-tailed paired-sample t-tests to compare the accuracy for grouping and ordering and the central processing unit (CPU) time (Supplementary Table 3). The sample size was 30.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

Data for the simulation studies, including comparisons with other methods and studies of M9 × M19 I. trifida and the B2721 potato, are available from http://data.genomicsresearch.org/Projects/polyGembler. Data for the 12601ab1 × Stirling potato mapping population were provided by C. Hackett. Data for the Z. japonica mapping population Carrizo × El Toro are available from the NCBI repository under the accession number SRP055007. The whole-genome PacBio sequence data for the Z. japonica cultivar Yaji are available from the NCBI repository under the accession number SRP110561. Data related to the PGSC version 4.03 pseudomolecules are available from http://solanaceae.plantbiology.msu.edu. The I. trifida de novo genome assembly ITR_r1.0 is available from http://sweetpotato-garden.kazusa.or.jp. The I. trifida de novo genome assembly NCNSP0306 is available from http://sweetpotato.plantbiology.msu.edu. Release 7 of the O. sativa reference genome is available from http://phytozome.jgi.doe.gov. The genome assembly of the Z. japonica accession Nagirizaki is available from http://zoysia.kazusa.or.jp. Source data are provided with this paper.

Code availability

The software PolyGembler, presented in this article, and its documentation are publicly available at GitHub (https://github.com/c-zhou/polyGembler).

References

  1. 1.

    Kyriakidou, M., Tai, H. H., Anglin, N. L., Ellis, D. & Strömvik, M. V. Current strategies of polyploid plant genome sequence assembly. Front. Plant Sci. 9, 1660 (2018).

    PubMed  PubMed Central  Google Scholar 

  2. 2.

    Bancroft, I. et al. Dissecting the genome of the polyploid crop oilseed rape by transcriptome sequencing. Nat. Biotechnol. 29, 762–766 (2011).

    CAS  PubMed  Google Scholar 

  3. 3.

    Wu, S. et al. Genome sequences of two diploid wild relatives of cultivated sweetpotato reveal targets for genetic improvement. Nat. Commun. 9, 4580 (2018).

    PubMed  PubMed Central  Google Scholar 

  4. 4.

    Fierst, J. L. Using linkage maps to correct and scaffold de novo genome assemblies: methods, challenges, and computational tools. Front. Genet. 6, 220 (2015).

    PubMed  PubMed Central  Google Scholar 

  5. 5.

    Altshuler, D. et al. An SNP map of the human genome generated by reduced representation shotgun sequencing. Nature 407, 513–516 (2000).

    CAS  PubMed  Google Scholar 

  6. 6.

    Baird, N. A. et al. Rapid SNP discovery and genetic mapping using sequenced rad markers. PLoS ONE 3, e3376 (2008).

    PubMed  PubMed Central  Google Scholar 

  7. 7.

    Elshire, R. J. et al. A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species. PLoS ONE 6, e19379 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  8. 8.

    Lander, E. S. & Green, P. Construction of multilocus genetic linkage maps in humans. Proc. Natl Acad. Sci. USA 84, 2363–2367 (1987).

    CAS  PubMed  Google Scholar 

  9. 9.

    Broman, K. W., Wu, H., Sen, S. & Churchill, G. A. R/qtl: QTL mapping in experimental crosses. Bioinformatics 19, 889–890 (2003).

    CAS  PubMed  Google Scholar 

  10. 10.

    Margarido, G., Souza, A. & Garcia, A. OneMap: software for genetic mapping in outcrossing species. Hereditas 144, 78–79 (2007).

    CAS  PubMed  Google Scholar 

  11. 11.

    Van Ooijen, J. Multipoint maximum likelihood mapping in a full-sib family of an outbreeding species. Genet. Res. 93, 343–349 (2011).

    CAS  Google Scholar 

  12. 12.

    Rastas, P., Calboli, F. C., Guo, B., Shikano, T. & Merila¨, J. Construction of ultradense linkage maps with Lep-MAP2: stickleback F2 recombinant crosses as an example. Genome Biol. Evol. 8, 78–93 (2016).

    CAS  Google Scholar 

  13. 13.

    Hackett, C. & Luo, Z. TetraploidMap: construction of a linkage map in autotetraploid species. J. Hered. 94, 358–359 (2003).

    CAS  PubMed  Google Scholar 

  14. 14.

    Hackett, C. A., Boskamp, B., Vogogias, A., Preedy, K. F. & Milne, I. TetraploidSNPMap: software for linkage analysis and QTL mapping in autotetraploid populations using SNP dosage data. J. Hered. 108, 438–442 (2017).

    CAS  Google Scholar 

  15. 15.

    Bourke, P. M. et al. polymapR—linkage analysis and genetic map construction from F1 populations of outcrossing polyploids. Bioinformatics 34, 3496–3502 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  16. 16.

    Hirakawa, H. et al. Survey of genome sequences in a wild sweet potato, Ipomoea trifida (H. B. K.) G. Don. DNA Res. 22, 171–179 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  17. 17.

    Consortium, P. G. S. et al. Genome sequence and analysis of the tuber crop potato. Nature 475, 189–195 (2011).

    Google Scholar 

  18. 18.

    Hoshino, A. et al. Genome sequence and analysis of the Japanese morning glory Ipomoea nil. Nat. Commun. 7, 13295 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  19. 19.

    Wang, F. et al. Sequence-tagged high-density genetic maps of Zoysia japonica provide insights into genome evolution in Chloridoideae. Plant J. 82, 744–757 (2015).

    CAS  PubMed  Google Scholar 

  20. 20.

    Ouyang, S. et al. The TIGR Rice Genome Annotation Resource: improvements and new features. Nucleic Acids Res. 35, D883–D887 (2006).

    PubMed  PubMed Central  Google Scholar 

  21. 21.

    Tanaka, H. et al. Sequencing and comparative analyses of the genomes of zoysiagrasses. DNA Res. 23, 171–180 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  22. 22.

    Strehl, A. & Ghosh, J. Cluster ensembles—a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 3, 583–617 (2003).

    Google Scholar 

  23. 23.

    Wu, Y., Bhat, P. R., Close, T. J. & Lonardi, S. Efficient and accurate construction of genetic linkage maps from the minimum spanning tree of a graph. PLoS Genet. 4, e1000212 (2008).

    PubMed  PubMed Central  Google Scholar 

  24. 24.

    Mascher, M. et al. Anchoring and ordering NGS contig assemblies by population sequencing (POPSEQ). Plant J. 76, 718–727 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  25. 25.

    Hahn, M. W., Zhang, S. V. & Moyle, L. C. Sequencing, assembling, and correcting draft genomes using recombinant populations. G3 (Bethesda) 4, 669–679 (2014).

    Google Scholar 

  26. 26.

    Su, S.-Y., White, J., Balding, D. J. & Coin, L. J. Inference of haplotypic phase and missing genotypes in polyploid organisms and variable copy number genomic regions. BMC Bioinformatics 9, 513 (2008).

    PubMed  PubMed Central  Google Scholar 

  27. 27.

    Zheng, C. et al. Probabilistic multilocus haplotype reconstruction in outcrossing tetraploids. Genetics 203, 119–131 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  28. 28.

    Jiao, W.-B. & Schneeberger, K. The impact of third generation genomic technologies on plant genome assembly. Curr. Opin. Plant Biol. 36, 64–70 (2017).

    CAS  PubMed  Google Scholar 

  29. 29.

    Zhang, X., Zhang, S., Zhao, Q., Ming, R. & Tang, H. Assembly of allele-aware, chromosomal-scale autopolyploid genomes based on Hi-C data. Nat. Plants 5, 833–845 (2019).

    CAS  PubMed  Google Scholar 

  30. 30.

    Kyriakidou, M., Anglin, N. L., Ellis, D., Tai, H. H. & Strömvik, M. V. Genome assembly of six polyploid potato genomes. Sci. Data 7, 88 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  31. 31.

    Voorrips, R. E. & Maliepaard, C. A. The simulation of meiosis in diploid and tetraploid organisms using various genetic models. BMC Bioinformatics 13, 248 (2012).

    PubMed  PubMed Central  Google Scholar 

  32. 32.

    Huang, W., Li, L., Myers, J. R. & Marth, G. T. Art: a next-generation sequencing read simulator. Bioinformatics 28, 593–594 (2012).

    PubMed  Google Scholar 

  33. 33.

    Love, R. R., Weisenfeld, N. I., Jaffe, D. B., Besansky, N. J. & Neafsey, D. E. Evaluation of DISCOVAR de novo using a mosquito sample for cost-effective short-read genome assembly. BMC Genomics 17, 187 (2016).

    PubMed  PubMed Central  Google Scholar 

  34. 34.

    Li, Y. et al. DeepSimulator: a deep simulator for nanopore sequencing. Bioinformatics 34, 2899–2908 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  35. 35.

    Kolmogorov, M., Yuan, J., Lin, Y. & Pevzner, P. A.Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 37, 540–546 (2019).

    CAS  PubMed  Google Scholar 

  36. 36.

    Glaubitz, J. C. et al. TASSEL-GBS: a high capacity genotyping by sequencing analysis pipeline. PLoS ONE 9, e90346 (2014).

    PubMed  PubMed Central  Google Scholar 

  37. 37.

    Rochette, N. C., Rivera-Colón, A. G. & Catchen, J. M. Stacks 2: analytical methods for paired-end sequencing improve RADseq-based population genomics. Mol. Ecol. 28, 4737–4754 (2019).

    CAS  PubMed  Google Scholar 

  38. 38.

    Gerard, D., Ferrão, L. F. V., Garcia, A. A. F. & Stephens, M. Genotyping polyploids from messy sequencing data. Genetics 210, 789–807 (2018).

    PubMed  PubMed Central  Google Scholar 

  39. 39.

    Gurevich, A., Saveliev, V., Vyahhi, N. & Tesler, G. QUAST: quality assessment tool for genome assemblies. Bioinformatics 29, 1072–1075 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  40. 40.

    Csardi, G. & Nepusz, T. The igraph software package for complex network research. Int. J. Complex Syst. 1695, 1–9 (2006).

    Google Scholar 

  41. 41.

    Rosvall, M. & Bergstrom, C.Maps of information flow reveal community structure in complex networks. Proc. Natl Acad. Sci. USA 105, 1118–1123 (2008).

    CAS  PubMed  Google Scholar 

  42. 42.

    Preedy, K. & Hackett, C. A rapid marker ordering approach for high-density genetic linkage maps in experimental autotetraploid populations using multidimensional scaling. Theor. Appl. Genet. 129, 2117–2132 (2016).

    CAS  PubMed  Google Scholar 

  43. 43.

    Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  44. 44.

    Kurtz, S. et al. Versatile and open software for comparing large genomes. Genome Biol. 5, R12 (2004).

    PubMed  PubMed Central  Google Scholar 

  45. 45.

    Xie, M., Wu, Q., Wang, J. & Jiang, T. H-PoP and H-PoPG: heuristic partitioning algorithms for single individual haplotyping of polyploids. Bioinformatics 32, 3735–3744 (2016).

    CAS  PubMed  Google Scholar 

Download references

Acknowledgements

We thank F. Diaz for developing the M9 × M19 I. trifida mapping population and M. David for extracting and quantifying DNA from the M9 × M19 cross. The 12601ab1 × Stirling Infinium 8303 potato array data were provided by C. A. Hackett. This research was supported by grants from the Bill & Melinda Gates Foundation (OPP1052983) and Australian Research Council (DP170102626 awarded to L.J.M.C.). The work at the International Potato Center (CIP) was carried out as part of the Consultative Group for International Agricultural Research (CGIAR) Research Program on Roots, Tubers and Bananas, which is supported by CGIAR Fund Donors (http://www.cgiar.org/about-us/our-funders/). This research was also supported by use of the NeCTAR Research Cloud, by QCIF and by the University of Queensland’s Research Computing Centre. The NeCTAR Research Cloud is a collaborative Australian research platform supported by the National Collaborative Research Infrastructure Strategy.

Author information

Affiliations

Authors

Contributions

C.Z. and L.J.M.C. designed the study and wrote the software. A.K., D.C.G. and W.G. developed and provided the I. trifida mapping population materials. B.O., D.C.G., S.W. and W.G. generated data for the M9 × M19 I. trifida mapping population. C.Z. performed the analysis. C.Z. and L.J.M.C. wrote the manuscript. L.J.M.C., G.C.Y., A.K., M.D.C., A.W.G., Z.-B.Z. and Z.F. supervised the project. All authors contributed to editing the final manuscript.

Corresponding author

Correspondence to Lachlan J. M. Coin.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Pseudomolecule construction for 20× tetraploid simulated GBS data.

A total of 42,715 SNPs located on 678 scaffolds were used for linkage analysis. These scaffolds of ~482Mb covered approximately 99.6% of the genome. a, Dot plot for the RF estimations for scaffold pairs mapped to the same reference chromosome. The x- and y-axis represents the physical distances and the estimated RFs, respectively. b, Histogram of the RF estimations for scaffold pairs mapped to different reference chromosomes. c, Collinear plots of pseudomolecules mapped to reference chromosomes. The x- and y-axis represents physical positions (Mb) on the reference chromosomes and pseudomolecules, respectively. Each line represents a collinear block between the reference chromosome and the pseudomolecule. The diagonal line in each plot indicates a high correlation between the reference chromosome and the pseudomolecule constructed from scaffolds.

Extended Data Fig. 2 Collinear plots between the Ipomoea nil reference chromosomes and pseudomolecules constructed from the Ipomoea trifida genotype data.

The x- and y-axis represents the physical positions (Mb) on the reference chromosomes and pseudomolecules, respectively. Each line represents a collinear block between the Ipomoea nil reference chromosome and the pseudomolecules.

Extended Data Fig. 3 Genetic linkage map construction from the Infinium 8303 SNP array data of the Stirling×12601ab1 mapping population.

a, Dot plot for RF estimations between scaffold pairs mapped to the same PGSC v4.03 chromosomes. The x- and y-axis represents the physical distances and the estimated RFs, respectively. b, Histogram of the RF estimations for scaffold pairs mapped to different PGSC v4.03 pseudomolecules. c, Comparison between the genetic linkage map constructed by the proposed method and the PGSC v4.03 pseudomolecules. Twelve genetic linkage groups corresponding to 12 pseudomolecules were constructed. In each plot, the x-axis represents the positions (Mb) on the PGSC v4.03 pseudomolecules, and the y-axis represents the positions (cM) on the genetic linkage map.

Extended Data Fig. 4 Genetic linkage map constructed from the Infinium 8303 SNP array data of the B2721 mapping population with TetraploidSNPMap.

Each dot represents a SNP. The x-axis represents the positions (Mb) on the PGSC v4.03 pseudomolecules, and the y-axis represents the positions (cM) on the genetic linkage map. The genetic linkage map comprises a total of 4,745 SNPs including 56 SNPs located on the unplaced PGSC v4.03 scaffolds (red) and 76 SNPs placed in incorrect PGSC v4.03 pseudomolecules (blue). Since the physical positions of the red and blue dots cannot be determined, they were set to zero in the plots.

Extended Data Fig. 5 Genetic linkage map constructed from the Infinium 8303 SNP array data of the Stirling×12601ab1 mapping population with TetraploidSNPMap.

Each dot represents a SNP. The x-axis represents the positions (Mb) on the PGSC v4.03 pseudomolecules, and the y-axis represents the positions (cM) on the genetic linkage map. The genetic linkage map comprises a total of 3,593 SNPs including 54 SNPs located on the unplaced PGSC v4.03 scaffolds (red) and 35 SNPs placed in incorrect PGSC v4.03 pseudomolecules (blue). Since the physical positions of the red and blue dots cannot be determined, they were set to zero in the plots.

Extended Data Fig. 6 Collinear plots between the pseudomolecules of Zoysia japonica accession Yaji and Nagirizaki.

The x- and y-axis represent the positions (Mb) on the pseudomolecules. Each line represents a collinear block between the pseudomolecules.

Extended Data Fig. 7 Relationship between the number of genetic markers and computational resources required for the haplotype phasing algorithm.

The x- and y-axis (in logarithm scale) represents the number of genetic markers and the consumption of resources, respectively. a, CPU time and b, Memory. Each point in the plot was averaged over 30 independent experiments (Intel® Xeon® Processor E5-2667 v3 CPU, 3.20GHz). The error bar for one standard deviation was included at each point.

Source data

Supplementary information

Supplementary Information

Supplementary Notes 1–3, Figs. 1 and 2 and Tables 1–6.

Reporting Summary

Source data

Source Data Fig. 3

Statistical source data.

Source Data Extended Data Fig. 7

Statistical source data.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zhou, C., Olukolu, B., Gemenet, D.C. et al. Assembly of whole-chromosome pseudomolecules for polyploid plant genomes using outbred mapping populations. Nat Genet 52, 1256–1264 (2020). https://doi.org/10.1038/s41588-020-00717-7

Download citation

Further reading

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing