Allele-defined genome of the autopolyploid sugarcane Saccharum spontaneum L.

Modern sugarcanes are polyploid interspecific hybrids, combining high sugar content from Saccharum officinarum with hardiness, disease resistance and ratooning of Saccharum spontaneum. Sequencing of a haploid S. spontaneum, AP85-441, facilitated the assembly of 32 pseudo-chromosomes comprising 8 homologous groups of 4 members each, bearing 35,525 genes with alleles defined. The reduction of basic chromosome number from 10 to 8 in S. spontaneum was caused by fissions of 2 ancestral chromosomes followed by translocations to 4 chromosomes. Surprisingly, 80% of nucleotide binding site-encoding genes associated with disease resistance are located in 4 rearranged chromosomes and 51% of those in rearranged regions. Resequencing of 64 S. spontaneum genomes identified balancing selection in rearranged regions, maintaining their diversity. Introgressed S. spontaneum chromosomes in modern sugarcanes are randomly distributed in AP85-441 genome, indicating random recombination among homologs in different S. spontaneum accessions. The allele-defined Saccharum genome offers new knowledge and resources to accelerate sugarcane improvement. Sequencing of haploid sugarcane, Saccharum spontaneum, allows assembly of a prototypical version of the sugarcane chromosome set. This new reference genome will serve as a resource to accelerate sugarcane improvement.

C ultivated sugarcanes (Saccharum spp., Poaceae) are unusual among leading crops in that they are polyploid interspecific hybrids, with singularly complex genomes. Domesticated in New Guinea ~10,000 years ago, 'reeds that produce honey without bees' were considered a luxury and an expensive spice from the sixth to fourth centuries bc. After the introduction of sugarcane to the Old World around the eighth century 1 , its spread to Caribbean, South American, Indian Ocean and Pacific island nations drove large human migrations, including slave labor 2 . Now the world's number one crop by harvested tonnage and its fifth most valuable crop (FAO, 2012), sugarcane is cultivated on ~26 million hectares of land in > 90 countries, and 1.83 billion metric tonnes are harvested annually with a gross production value approaching $57 billion, providing 80% of the world's sugar and 40% of its ethanol as the primary sugar and biofuel feedstock crop.
While the high sugar content of modern sugarcane cultivars derives from cultivated 'noble' forms of Saccharum officinarum, their hardiness, disease resistance and ratooning capacity were obtained during 'nobilization' , specifically backcrossing into S. officinarum selected traits from a sugar-poor relative, Saccharum spontaneum 3 . 'Noble' S. officinarum cultivars, typically 2n = 8x = 80 (n is the haploid chromosome number in eukaryotic organisms and x is the monoploid chromosome number in polyploid organisms), accumulate sucrose in the stem reaching up to 50% of the dry weight but are vulnerable to biotic and abiotic stresses. Dutch breeders in Java made interspecific crosses between S. officinarum and a wild relative, S. spontaneum, to obtain disease resistance and stress tolerance traits of S. spontaneum while backcrossing to S. officinarum to recover high biomass and high sugar content 4 . Consequently, modern sugarcane cultivars are interspecific hybrids with approximately 80% chromosomes from S. officinarum, 10-15% chromosomes from S. spontaneum, and 5-10% recombinant chromosomes 5 .
The lowest chromosome number recorded for natural Saccharum accession is a 2n = 5x = 40 S. spontaneum that no longer exists; however, exactly one haploid (1n = 4x = 32) S. spontaneum, AP85-441, generated from a culture of octoploid SES208 6 , provides a foundation for assembly of a prototypical version of the sugarcane chromosome set. This study illuminates the hereditary blueprint and evolutionary history of one of our most important, and most complex, crop genomes.

Results
Genome sequencing and assembly. The genome size of AP85-441 was estimated at 3.36 Gbp by flow cytometry 7 . From a BAC library of AP85-441, 35,156 BAC clones were pooled into 712 libraries (mostly of 48 BACs; Supplementary Table 1), and individual BAC pools were sequenced independently by a Hiseq 2500 with PE250 (paired-end model and 250-bp read length), yielding 267. 5 Gbp of data that were assembled using three different assemblers: ALLPATHS-LG 8 , SPAdes 9 and SOAPdenovo2 10 , yielding a 2.56-Gbp assembly with contig N50 of 7.4 kb (Supplementary Tables 2 and 3). To reduce fragmentation, 295 Gbp of data from the Pacific Biosciences (PacBio) RS II system (Supplementary Table 4) were used for self-correction and assembly by Canu 11 , incorporating assembled BAC sequences and correcting and polishing with 90× Illumina paired-end sequences, yielding 3.13 Gbp with contig N50 of 45 kb (Supplementary Table 5). The hybrid assembled contigs and BAC contigs correspond with ~99.72% accuracy (Supplementary Table 6).
High-throughput chromatin conformation capture (Hi-C) is an extension of chromosome conformation capture (3C) technology, in which cross-linked chromatin is digested with an appropriate restriction enzyme and then ligated to obtain an interacting fragment 12 . This approach, which was pioneered by Lieberman-Aiden et al. 13 and Burton et al. 14 , was used previously in grasses in the assemblies of barley 15 and wild emmer wheat 16 . To provide a scaffold for contig assembly, four Hi-C libraries were constructed from young leaves of AP85-441. Chimeric fragments representing the original cross-linked long-distance physical interactions were processed into paired-end sequencing libraries, then 1 billion 150-bp pairedend Illumina reads were produced and uniquely mapped onto the draft assembly contigs. Due to polyploidy, existing Hi-C scaffolding programs such as LACHESIS 14 and SALSA 17 link S. spontaneum allelic haplotypes together and are no longer suitable for this autopolyploid genome. We developed a Hi-C-based scaffolding algorithm (ALLHIC) that integrates four functions-pruning, partition, optimization and building-to select contigs specific for polyploid genome assembly (see Online Methods and Supplementary Figs. 1-3). The quality of Hi-C sequencing was evaluated using HiC-Pro 18 (Supplementary Table 7 and Supplementary Fig. 4).
A Hi-C-based physical map was used to assemble 32 pseudochromosomes that anchor 2.9 Gbp of the genome, including 97% of the gene content. A high-density genetic map of 998,370 SNPs was used to verify the Hi-C assembly, supporting that the two methods are consistent in both chromosomal assignment and order for 89% of contigs (Supplementary Table 8). The 32 pseudo-chromosomes comprise 8 homologous groups with 4 sets of monoploid chromosomes: A, B, C and D (Fig. 1). A total of 219 (88.3 %) complete gene models among 248 ultraconserved core eukaryotic genes (CEGs) in CEGMA 19 and 1,397 (97.01 %) among 1,440 conserved genes in BUSCO 20 were recalled in our assembly (Supplementary Tables 9  and 10). Further, 1,624 million (97.01%) of 1,674 million Illumina short reads were alignable and covered 97.3% of the assembly (Supplementary Table 11). The assembly allowed us to predict 28 potential centromeric regions along the 32 chromosomes, with length ranging from 0.25 to 11.85 Mbp (Supplementary Table 12).
Allele-specific annotation. A high level of homologous gene retention was detected from sequencing multiple haplotypes in sugarcane, despite extreme autopolyploid redundancy 21 . In autopolyploid genomes, homologous genes at the same locus on homologous chromosomes are defined as alleles 22 . Using two rounds of MAKER followed by manual annotation to separate genes and alleles, we annotated 35,525 genes with alleles defined, including 4,289 (12.7%) genes with four alleles, 9,792 (27.6%) with three, 14,797 (41.7%) with two, and 6,647 (18.7%) with one. The total number of alleles was 82,773, with an average 2.3 alleles per gene. In unanchored sequences, 3,130 gene/alleles were annotated. We annotated 1,256 tandemly duplicated genes and 3,375 dispersedly duplicated paralogs (Table 1). The cytochrome P450 gene families illustrated the importance of annotating alleles in polyploid genomes, with a total of 1,465 manually annotated alleles in 387 genes ( Supplementary  Fig. 5). Without allele-specific annotation, the number of P450 genes in this genome would be 1,465, not 387.
Among the predicted gene models, 90.0% could be found in the sorghum genome 23 and 80% in collinear positions. Comparison with rice, sorghum, maize and Arabidopsis ( Supplementary Fig. 6) showed that among 21,661 gene families, 1,278 (6%) were unique to S. spontaneum. Gene Ontology (GO) enrichment analysis showed that these S. spontaneum-specific genes were enriched in a list of GO categories, including response to wounding/external stimulus, serine-type endopeptidase/peptidase inhibitor activity and ribosomal subunit (both false discovery rate (FDR) and P < 0.01, Fisher's exact test; Supplementary Table 13).

Polyploidization in S. spontaneum.
We assessed whether the two rounds of whole-genome duplication (WGD) affecting sugarcane were allopolyploidization followed by autopolyploidization, as proposed 26 , or just two rounds of autopolyploidization. Although the sequenced genome is haploid, each gamete contains four sets of homologous or hom(e)ologous chromosomes, representing two WGDs (that is, from one to two to four). Comparison among hom(e)ologous haplotypes A, B, C and D revealed 7.7 million SNPs, 1.03 million short indels and 3,637 structural variations (SVs), accounting for 11.2 Mbp of sequence and indicating heterozygosity of 0.98% in the S. spontaneum AP85-441 genome ( Fig. 3 and Supplementary Table 15). To exploit the fact that paralogs are often located on all four of a set of hom(e)ologous chromosomes, we developed a framework to carry out a more sensitive study of gene pair similarities (Supplementary Note). However, no clear partition reflecting two events could be inferred, with each of three complementary approaches suggesting random association among the four members of most homologous series. It is clear, however, from comparisons of chromosomal rearrangements that there were two discrete WGDs, rather than a single event. Two fissions in ancestral homologs of sorghum chromosomes SbChr05 and SbChr08 that resulted in translocations to a set of two chromosomes each occurred before the two rounds of WGDs in Saccharum and after the divergence of Saccharum and Miscanthus (Fig. 2c). However, inversions in ancestral SbChr08S and SbChr08L in two pairs, SsChr2AB and SsChr7AB, indicate that these occurred after the first WGD but before the second (Fig. 2c, event 2). Among three regions showing collapsing of homologous sequences (upper region of SsChr1C, middle region of SsChr3D and upper region of SsChr8C), SsChr3B and SsChr8A have about 2× greater depth of Illumina short reads, suggesting that they are the collapsed homologs. The SsChr1C region showed equal distribution among three homologs, indicating a deletion in SsChr1C (Supplementary  Table 16 and Supplementary Fig. 8).
Two inversions involving single chromosomes, ancestral SbChr05L (A12L) (homologous chromosome C of SsChr6) (Fig. 2c, event 3) and the bottom of SsChr5C (Fig. 2c, event 3), presumably occurred after the two rounds of WGD. Chromosome reduction in Miscanthus was caused by fusion of one set of chromosomes homologous to SbChr04 and SbChr07 27 . Inversions involving all four homologous chromosomes between SsChr4ABCD appear to have occurred before the two rounds of WGD, but it is actually an inversion that occurred in SbChr04 after Saccharum and Sorghum diverged from a common ancestor (Fig. 2c, event 5). These analyses indicate that the two rounds of WGD are autopolyploidization and that they occurred with a brief time in between.
Allelic expression dominance. The homologous genome expression levels of the four homologous genomes were similar in examined tissues (Supplementary Fig. 9), indicating no significant global homologous genome dominance in S. spontaneum. To mitigate differential gene content among the homologous genomes, we further extracted 4,289 sets of genes with four alleles in high collinearity from AP85-441 ( Supplementary Fig. 10), but overall gene expression level from each haplotype was similar for the four homologous genomes ( Supplementary Fig. 9). These results are not surprising-even recent allopolyploids such as Brassica napus 28 , Gossypium hirsutum 29 , Triticum aestivum 30 and Brassica juncea 31 displayed no homologous chromosome dominance. S. spontaneum is a recent autopolyploid, and homologous genomes are fluid and changing after each meiosis.
For breeding polyploid crops such as sugarcane, the segregation of alleles with different expression levels may contribute to the segregation of traits in a breeding population. To simplify the analysis of the allelic expression, we sorted the expression levels of four alleles for 4,289 genes in descending order from I to IV. Three allele pairs (I/II, II/III and III/IV) were compared for analyzing the differentially expressed alleles (Supplementary Figs. 11 and 12). We defined the pairs with less than twofold difference in expression within a pair as neutral and others as non-neutral. Of 4,289 genes, on average, 37.6% displayed neutral expression and 62.4% displayed   non-neutral expression, suggesting that the expression of alleles varied. We further analyzed the variations in gene number among tissues; the numbers of both neutral and non-neutral genes were similar among the examined tissues. However, the genes of these two expressional patterns varied among examined tissues. On average, 36.3% of the neutral and 56.4% of the non-neutral genes were conserved across all the examined tissues ( Supplementary Fig. 13).
Sugar transporters. Sucrose transporters (SUTs) are hypothesized to load sucrose into the phloem of leaf minor veins and also function to retrieve sucrose from the apoplasm during transport [36][37][38] . In the step prior to phloem loading, SWEETs (sugars will eventually be exported transporters) are potentially responsible for sucrose efflux into the cell wall space from phloem parenchyma cells and the bundle sheath 39,40 . SWEETs play various important roles in multiple physiological processes 41 . In sugarcane and sweet sorghum, the stems are the principal sink tissues that store very high concentrations of sugars within the parenchyma cells [42][43][44] . Tonoplast sugar transporters (TSTs) have been characterized as sucrose transporters highly associated with vacuolar sucrose accumulation from sugar beet taproot 45 , sugarcane 46 and sweet sorghum stems 47 and watermelon fruit 48 . Whereas there are 3 TST genes in the sorghum genome 47 , the family has expanded in the S. spontaneum genome, which has 4 genes consisting of 13 homologs. Hence, it is reasonable to hypothesize that TSTs are the most promising players to sequester sucrose into the vacuoles of the sugarcane stem 46,47,49 .
In the S. spontaneum genome, we identified 123 sugar transporters from 9 subfamilies, including 4 in the TST family, 4 in the vacuolar glucose transporter (VGT) family, 3 in the plastidic glucose transporter (pGlcT) family, 4 in the inositol transporters (INT) family, 31 in the polyol transporter (PLT) family, 14 in the early response to dehydration 6-like (SFP) family, 6 in the SUT family, 22 in the SWEET family, and 35 in the sugar transporters family or hexose transporter family (STP) (Supplementary Table 18). Phylogenetic analysis of those sugar transporters showed gene family expansion in the STP and PLT families compared with sorghum (22 in STP and 17 in PLT), rice (21 in STP and 11 in PLT) and Arabidopsis (14 in STP and 9 in PLT) ( Supplementary Fig. 15). Tandem duplication analysis indicated that 19 and 23 genes of STP and PLT, respectively, could be assigned to tandem duplication, compared with 11 and 9 genes in sorghum. The cause of STP and PLT family expansions in S. spontaneum is tandem duplication.
Disease resistance genes. S. spontaneum contributed disease resistance genes to modern sugarcane hybrid cultivars. We identified 361 sequences putatively encoding nucleotide-binding site (NBS) proteins, including 22 N-type, 169 NL-type, 68 CN-type and 102 CNL-type. The number of NBS-encoding genes is larger than that in sorghum 50 , owing to the species-specific tandem duplication in S. spontaneum. Surprisingly, 80% of the NBS-encoding genes located in the four rearrangement chromosomes (SsChr02, SsChr05, SsChr06 and SsChr07) and 51% of those were in the rearranged regions, including SsChr5 (Sb05S) 57. 6

S. spontaneum fraction in hybrid sugarcane cultivars.
Modern sugarcane cultivars are the product of complex and repeated hybridization between S. officinarum and S. spontaneum, resulting in complex hybrids with chromosome numbers and morphologies that differ from those of their progenitors. Previous studies estimated that the S. spontaneum genome contributed approximately 10-20% of the modern hybrid sugarcane genome. In the modern hybrid sugarcane SP80-3280, approximately 12.25% of sequences are contributed by S. spontaneum. We mapped the sequences back to the AP85-441 genome, and they were randomly and evenly distributed, not constituting a set or sets of chromosomes as expected (Fig. 4). Analysis of the integrated S. spontaneum fraction in 15 resequenced hybrid genomes also yielded random distribution throughout the genome (Supplementary Fig. 16).
Origin and genetic diversity of S. spontaneum. Most of the genetic diversity found within S. spontaneum has not been introgressed into commercial sugarcane, and in principle, this germplasm represents a rich source of desirable agronomic traits related to stress tolerance and biomass accumulation 51 . S. spontaneum has a broad natural range extending throughout Asia, the Indian subcontinent, the Mediterranean and Africa 52 , and natural populations display a wide range of phenotypic, genetic and ploidy-level diversity. In practice, however, nucleotide diversity (π) across S. spontaneum was estimated to be 0.00021 ± 0.000002 (Supplementary  Tables 21 and 22 and Supplementary Fig. 17), much lower than that in other clonally propagated crops such as potato 53  Both principal component analysis (PCA) and admixture-based analyses clustered the 64 S. spontaneum accessions into three distinct groups (Fig. 5a,c; for other K-values, see Supplementary Fig.  18a,b) that were also supported by phylogenetic relationships among the 64 accessions inferred by bootstrapping and geographic origins (Supplementary Table 23), with group 1 originating from China, the Philippines, Indonesia and Papua New Guinea and groups 2 and 3 originating from India, Pakistan and Iran. The regions of Pan-Malaysia might be the ancient hybrid zones among three groups. Ploidy varies widely within the three groups, from 6x to 16x. By mapping the ploidy levels on the bootstrapped tree (Fig. 5b), the topology shows that the accessions of different ploidy levels (from hexaploid to hexadecaploid) diverged independently from ancestors in three groups, suggesting that the fluid ploidy levels may have independently evolved from ancestral progenitors.
Regions of S. spontaneum with larger-scale chromosomal rearrangements compared with sorghum have higher genetic diversity (higher π value) than non-rearranged regions and may have undergone much stronger balancing selection (Supplementary Table 22 and Supplementary Fig. 19). Although several individual chromosomes do not show significant differences, comparisons averaging values on all chromosomes show nucleotide diversity (π) in rearranged regions (0.00025 ± 0.00003) to be much higher than in nonrearranged regions (0.00021 ± 0.00001, P = 0.000234). The Tajima's D in rearranged regions (− 0.659 ± 0.052) is much higher than in non-rearranged regions (− 0.720 ± 0.011, P = 0.005013). SNP density is also higher in rearranged regions (360.27 ± 48.41) than in nonrearranged regions (297.46 ± 12.65, P = 0.001798). In addition, the GO enrichment analyses showed that the non-rearranged regions are enriched in GOs related to basic life cycles, primarily in photosynthesis, respiration and ATP synthesis (both FDR and P < 0.05, Fisher's exact test; Supplementary Table 23), whereas the rearranged regions were enriched in many GOs related to secondary life processes, whole celluar process and intracellular anabolic and catabolic processes, transmembrane transport and ion binding (both FDR and P < 0.05, Fisher's exact test; Supplementary Table 24).
An intriguing question is whether genomic rearranged regions might have had a role in adaptation to different habitats. The rearranged regions in S. spontaneum with high levels of genomic diversity (π value) might result from a preponderance of adaptive genes related to habitat or stress adaptation, such as responses to various abiotic stresses (drought, salinity, alkaline, metal ions and so on), which are controlled by genes of whole celluar process and intracellular anabolic and catabolic processes, transmembrane transport and ion binding, as detected in these regions. Conversely, it is also meaningful that the non-rearranged regions mainly controlling basic life cycles maintain a lower level of genomic diversity. The rearranged regions have undergone stronger balancing selection after the polyploidization events. Adding further support to this notion is that 80% of the NBS-encoding genes are located on the four rearrangement chromosomes.

Discussion
The identification of 80% of disease resistance genes on rearranged chromosomes suggests that reduction of basic chromosome number might have contributed to the retention of disease-resistance genes. Following chromosome fissions and translocations in a diploid ancestor, translocated fragments may have undergone little recombination. Following WGD, additional chromosomal rearrangements in these translocated regions may have further suppressed recombination (Fig. 1). Population genomic analyses detected balancing selection in these rearranged regions, a mechanism to maintain genetic diversity. It is likely an unintended consequence that these rearranged chromosome arms are enriched with NBSencoding genes, resulting in more disease-resistance genes being retained in S. spontaneum, which leads to higher resistance to disease and abiotic stresses in S. spontaneum than in other Saccharum species and makes S. spontaneum the source of disease and stress tolerance in sugarcane breeding program.
Integration of S. spontaneum chromosome segments into modern sugarcane hybrid cultivars by three to four generations of backcrossing at random would result in about one set of monoploid S. spontaneum chromosomes. The S. spontaneum fraction of the sugarcane hybrid cultivar SP80-3280 and of 15 resequenced hybrid genomes each appear randomly distributed in the reference AP85-441 genome, indicating random recombination of homologous chromosomes in different accessions that have undergone many rounds of meiosis after their separation. This is indirect evidence that S. spontaneum is autopolyploid, and it reinforces the importance of allele-specific annotation for mining effective alleles of resistance genes in hybrid cultivars.
Defining alleles in an autopolyploid genome clarifies gene or gene family analysis, as demonstrated in P450 and other gene families. This reference genome offers substantial new knowledge and unprecedented genomic resources for sugarcane breeders and researchers to mine disease resistance and other alleles in rearranged chromosomes from historic hybrid cultivars, and to track them in breeding populations to shorten the 13-year breeding cycle.

Online content
Any methods, additional references, Nature Research reporting summaries, source data, statements of data availability and associated accession codes are available at https://doi.org/10.1038/ s41588-018-0237-2.

Methods
Genome sequencing. Illumina short reads sequencing. DNA was extracted from leaf tissue of a single soil-grown plant using the Qiagen DNeasy Plant Mini Kit and applied to 280-bp and 500-bp paired-end library construction using the NEBNext Ultra DNA Library Prep Kit for Illumina sequencing. Sequencing was performed using the Illumina HiSeq 2500 platform.

Construction of BAC libraries and sequencing.
Nuclei were isolated from the young leaf tissues of AP85-441 following the method described by Ming et al. 57 . The highmolecular-weight DNA embedded in agarose was partially digested using HindIII. Fractions at approximately 100 kb were recovered and cloned into the pSMART BAC vector (Lucigen). A total of 38,400 BAC clones were constructed and selected for sequencing; 48 BAC clones were pooled together, and DNA libraries were prepared with the PhasePrep BAC DNA Kit (Sigma) following the manufacturer's protocols. BAC DNA libraries were sequenced using the Illumina HiSeq 2500 with a 250-bp paired-end sequencing strategy.
PacBio library construction and sequencing. More than 5 µ g of sheared and concentrated DNA was applied to size selection by the BluePippin system. Approximately 20-kb SMRTbell libraries were prepared according to the released protocol from PacBio. A total of 176 Single-Molecule, Real-Time (SMRT) cells were run on the PacBio RS II system with P6-C4 chemistry.
Hi-C library construction and sequencing. Four Hi-C libraries were created from tender leaves of AP85-441 at BioMarker Technologies Company as described previously 58 . Briefly, the leaves were fixed with formaldehyde and lysed, and then the cross-linked DNA was digested with HindIII overnight. Sticky ends were biotinylated and proximity-ligated to form chimeric junctions that were enriched for and then physically sheared to a size of 500-700 bp. Chimeric fragments representing the original cross-linked long-distance physical interactions were then processed into paired-end sequencing libraries, and 1,001 million 150-bp pairedend reads were produced on the Illumina HiSeq X Ten platform.
Genome assembly overview. The sugarcane AP85-441 contig-level assembly incorporated sequencing data from a mixture of sequencing technologies ( Supplementary Fig. 1), including BAC pools sequenced with Illumina HiSeq 2500 and whole-genome shotgun sequencing with PacBio RS II as well as Hi-C reads, followed by Illumina short reads polishing. Each BAC pool was independently assembled using ALLPATHS-LG 8 , SPAdes 9 and SOAPdenovo2 10 , and best results were retained. For PacBio assembly, Canu v1.5 11 was used, as it is capable of avoiding collapsed repetitive regions and haplotypes. Self-correction was performed with parameter corOutCoverage = 100, which allowed us to correct all of the input PacBio reads. The corrected reads, along with BAC-assembled contigs, were imported to the assembly step. Chromosomal assembly was constructed based on proximity-guided assembly using our newly developed program, ALLHIC, which is designed for polyploid genome scaffolding (see Supplementary Note for details).
Genome annotation. Repeat prediction. We first customized a de novo repeat library of the genome using RepeatModeler (see URLs), which can automatically execute two de novo repeat finding programs, including RECON (version 1.08) 59 and RepeatScout (version 1.0.5) 60 . The consensus transposable element (TE) sequences generated above were imported to RepeatMasker (version 4.05) 61 to identify and cluster repetitive elements. Unknown TEs were further classified using TEclass (version 2.1.3) 62 . To identify tandem repeats within the genome, the Tandem Repeat Finder (TRF) package (version 4.07) 63 was used with the modified parameters of '1 1 2 80 5 200 2,000 -d -h' to find high-order repeats. Telomeres and centromeres were identified based on the .dat output files above. Repeat sequences with more than ten monomers ' AAACCT' were identified as telomeres. For centromere identification, we used a similar method described in the Oropetium thomaeum genome 64 . The largest repeat arrays were identified and clustered as centromeres. To further investigate LTRs, we applied the LTR_retriever pipeline 65 , which can integrate results from public programs such as LTR_FINDER 66 and LTRharvest 67 and efficiently remove false positives from the initial predictions. The predicted LTRs were further classified into intact and non-intact LTRs, and the insertion time was estimated as T = K/2μ (K is the divergence rate, and μ is the neutral mutation rate; the default is 1.38 × 10 −8 in LTR_retriever) using the scripts implemented in the LTR_retriever package 65 .
Gene annotation. To get high-quality annotation of protein-coding genes, we carried out two rounds of MAKER running, following extensive and careful manual inspections in JBrowse 68 .
In the first round of MAKER running, ten selected RNA sequencing (RNAseq) samples were imported into Trinity de novo assembly and genome-guided assembly pipelines with default parameters 69 . RSEM was used to calculate transcript abundance 70 . Transcripts with FPKM (fragments per kilobase of exon per million fragments mapped) < 1 and iso-percentage < 3% were removed from further analysis. The filtered transcripts were imported to the PASA program for construction of comprehensive transcripts, as PASA is able to take advantage of the high sensitivity of reference-based assembly while leveraging the ability of de novo assembly to detect novel transcripts 71 . The PASA-assembled transcripts described above were used for training. The nearly 'full-length' transcripts were evaluated by comparing with the UniProt plant protein database (last accessed on 8 December 2016), and proteins that were covered at least 95% were retained as candidates. Then ab initio gene predictors, including SNAP 72 , GENEMARK 73 and AUGUSTUS 74 , were each trained with those selected proteins. After that, the MAKER pipeline was used to integrate multiple tiers of coding evidence, including ab initio gene prediction, transcript evidence and protein evidence and generate a comprehensive set of protein-coding genes.
In the second round of MAKER running, the predicted gene models with AED score equal to 0 were extracted for retraining using SNAP 72 , GENEMARK 73 and AUGUSTUS 74 . In addition, the RNA-seq reads were mapped to the AP85-441 genome using HiSAT2 75 version 2.10 and reassembled using StringTie 76 version 1.3.4, which is a reference-based RNA assembler. Meanwhile, published full-length transcripts based on IsoSeq in sugarcane were also recruited for annotation 77 . The four haplotypes (A, B, C and D) were split into four sub-genomes, each containing eight pseudo-molecules.
Gene structures were visualized in JBrowse 68 along with RNA-seq-assembled transcripts and homologs from the sorghum, maize and rice genomes. We compared the two rounds of MAKER annotation and selected the better ones if their structures were better supported by homologous proteins or RNA-seqassembled transcripts. Genes in the first round of annotation were kept if their structures did not improve significantly in the second round.
Extensive manual inspection of the annotation identified that 28,306 gene models had a significant difference in protein length or sequence similarity compared with other reported protein sequences in the NCBI NR database. The corresponding genomic DNA sequences of these genes were extracted for further careful annotation using the online version of FGENESH (see URLs) with pretrained parameters for sorghum genes. Genes with significant improvement were replaced with the FGENESH annotation. BUSCO 20 version 3 was used for evaluation of annotation completeness. Out of 1,440 conserved genes, 1,397 (97.1%) were re-annotated in the AP85-441 genome, among which 1,101 (76.5%) were complete and duplicated BUSCO genes.

Allelic variation analysis. Construction of a monoploid genome.
To compare the allelic variations among the four haplotypes, we first generated a monoploid genome. The concept of the monoploid genome is aimed at retaining consensus sequences among four haplotypes and covering as many genetic materials as possible. The longest pseudo-molecule was used as reference for each set of haplotypes, and the other three haplotypes were mapped against the reference for SNP/indel and SV calling using the nucmer 78 program. Mapping results were filtered, and only the best hits were retained. The program show-snps, implemented in the MUMmer package 78 , was used to identify SNPs and indels with parameters -Clr, which means only SNPs/indels from ambiguous mapping were reported. Consensus sequences were extracted using a homemade PERL script. Insertions larger than 50 bp were identified on Assemblytics 79 , a Web-based SV analytics tool, and further inserted into the reference genome. Finally, a monoploid genome, containing eight representative pseudomolecules and 785 Mbp of sequence, was generated for further analysis (Fig. 3).
Identification of alleles. Identification of alleles in the AP85-441 genome was based on two strategies: (1) synteny-based and (2) coordinate-based approaches. Interhaplotype syntenic blocks were identified by MCScanX 80 and organized into a four-column table containing allele A, B, C or D. In addition, genes that were not shown in that table were mapped to the monoploid genome using GMAP 81 , and those with at least 50% overlaps on coordinates were considered as potential alleles. Sequence similarities were checked among alleles on the basis of reciprocal blast, and genes without significant similarities to any other allele were removed from the table.
Analysis of allelic variations. We use a reference-based strategy to identify SNPs/ indels and SVs. Similar to the approach described in the previous section, the nucmer 78 program was used to map haplotypes A, B, C and D to the monoploid genome and SNPs were extracted from ambiguous best mapping. Short indels (1-10 bp) and large structural variations were recalled by Assemblytics 79 on the basis of the alignments above.
Identification of the sequences in hybrid sugarcane that originate from S. spontaneum. The SP80-3280 genome was first masked using the customized TE library and then split into 1-kb fragments. Each of the fragments was blast against the AP85-441 and LA-purple (unpublished) masked genomes, respectively, and the mapping score was calculated for each blast hit using the following formula: where S indicates mapping score, N indicates the number of matched bases and I indicates identity in each blast hit. Fragments were further classified as sequences from AP85-441 and sequences from LA-purple if they had a best mapping score in the corresponding category.
Sequences were classified as fragments from both if they had similar mapping scores (< 5% difference) in the LA-purple and AP85-441 genomes.
Allelic expression dominance. Tissues including leaves, stems and roots were collected from mature plants, and RNA-seq analysis in this part was based on these three samples. RNA-seq reads were trimmed by Trimmomatic and then mapped to the AP85-441 genome by HiSAT2 75 . FPKM was calculated on the basis of unique mapping reads using the StringTie package 76 . To analyze allelic expression dominance, we extracted 4,289 genes with full of four alleles from the AP85-441 annotation files. To simplify the analysis of the allelic expression, the expression levels of the genes were sorted in descending order from I to IV. Three allele pairs (I/II, II/III and III/IV) were compared to analyze the differentially expressed alleles. Allele pairs with less than twofold difference in expression were defined as a neutral pair and all others as non-neutral.
Resequencing and population analysis. Reads mapping and variants calling. The raw pair-end reads of 64 S. spontaneum accessions were trimmed to remove the adaptors and low-quality bases using Trimmomatic 82 after quality control by FastQC 83 . The reads were filtered with a sliding window of size 7, with average Phred score scale of 20 within the window. The trimmed reads were mapped to the S. spontaneum genome using Bowtie2 84 with default parameters. The mapped reads were sorted, and duplicated reads were removed using SAMtools 85 . We estimated the rate of uniquely mapped reads outputted from both BWA 86 and Bowtie2 84 . Bowtie2 generated ten times as many uniquely mapped reads as BWA did ( Supplementary Fig. 21).
The Realigner Target Creator and Indel Realigner programs from the Genome Analysis Toolkit (GATK) package 87 were used for global realignment of reads around indels from the sorted BAM files. The HaplotypeCaller was used to estimate the SNPs and indels for putative diploids using the default parameters. The HaplotypeCaller outputted 42,585,337 unfiltered variants (SNPs and indels). The distribution of calling depths (DP) of each raw variant was estimated as a criterion for variants filtering. Low depths and repetitive variants were removed from the raw VCF file if they had DP < 2 or DP > 45, minQ < 30. Variants with more than 15% missing data were removed. These filtering strategies reduced the raw unfiltered set of 42.59 million variants (SNPs and Indels) to the working set of 4.48 million (4,476,608) variants. SnpEff (v3.6c) 88 was used to assign variants effects on the basis of gene models from S. spontaneum genome annotation. The variants sites were annotated as the SNPs and Indels, as well as intergenic and genic regions (including the synonymous, nonsynonymous, intronic, upstream and downstream variants).
Genome-wide genetic diversity estimation. Population genetic statistics of SNP density, π and Tajima's D were calculated directly from the filtered VCF file in 1,000-kb window and 500-kb step for π , and non-overlapping intervals for SNPs density and Tajima's D in VCFtools 89 . The high confidence 4,476,608 variant set was used for statistical estimations.
PCA. PCA was performed using the GCTA software on the filtered 4,476,608 variants. The input Plink binary files are transformed from the filtered VCFs file using VCFtools 89 and PLINK 90 . The top three principal components were used for assigning the 64 accessions and downstream population structure analysis.
Phylogeny. Bi-allelic and polymorphic SNPs (3,969,408) were used for reconstructing the phylogenetic relationships among 64 accessions. Before tree construction, we filtered and pruned the SNPs with MAF (minor allele frequency) < 0.2, missing rate > 0.15, and LD (linkage disequilibrium) threshold = 0.2. Finally, a total of 37,617 SNPs were selected for the constructed tree using SNPhylo software. The multiple consensus sequences were aligned using MUSCLE 91 . Maximum likelihood trees were constructed using the maximum likelihood method by running DNAML programs in the PHYLIP package 92 . In addition, a bootstrapped tree was constructed by bootstrapping (bootstrap = 10,000) analysis using the PHANGORN package 93 . Figtree v.1.4 (see URLs) was used to visualize the trees.
Population genetic structure. Ancestral population stratification among 64 accessions was inferred using Admixture software. The optimal ancestral population structure was estimated from the same variants set with STRUCTURE 94 using ancestral population sizes K = 1-20 and choosing the population with the smallest cross-validation error. The parameter standard errors were estimated using bootstrapping (bootstrap = 200) when doing the admixture analyses. DISTRUCT 95 was used to plot the population stratification results for K = 1 through K = 20 ( Supplementary Fig. 18).
Differentiation of genomic diversity among four homologous haploid sets. The reads mapped to each of four homologous haploid sets (A, B, C and D) of the AP85-441 genome were retrieved for each of 64 accessions using SAMtools 85 and Bedtools 96 . The four sets of retrieved reads for each of 64 accessions were mapped separately to each of eight chromosomes in a consensus monoploid genome using Bowtie2 with default parameters. The variants were called from a cohort of 256 BAM files generated from the previous step for each of the eight chromosomes. The HaplotypeCaller of GATK was used to estimate the SNPs and indels for putative diploids using the default parameters. The HaplotypeCaller outputted 17,531,765 unfiltered variants (SNPs and indels). The distribution of calling depths (DP) of each raw variant was estimated as a criterion for variant filtering. Low depths and repetitive variants were removed from the raw VCF file if they had DP < 1 or DP > 5, minQ < 20. We allowed the variant sites with max-missing rate as 50%. These filtering strategies reduced the raw unfiltered set of variants (SNPs and indels) to the working set of 68,911 variants. These filtering strategies reduced the raw unfiltered set of variants (SNPs and indels) to the working set of 68,911 variants. The working variant set was then used for estimating the population genetics statistics π among four homologous haploid sets.
Genomic diversity of genomic rearranged regions. To test whether genomic rearranged regions (RAR) have a genetic difference from non-rearranged regions (non-RAR), we compared the population genetic statistics π, SNP density, and Tajima's D between rearranged and non-rearranged regions in each of four sets (A, B, C and D) of chromosomes 2, 5, 6 and 7. The genomic rearranged regions inferred by collinear dot plot and alleles phasing are shown Supplementary Table  21. We used the T-test and Mann-Whitney U test with the one-tailed hypothesis to compare the differences of statistics (π, SNP density and Tajima's D) between RAR and non-RAR. To find the difference in gene functions between RAR and non-RAR, we conducted GO enrichment analysis for the gene models in RAR and non-RAR. We first blastX the S. spontaneum gene models in the NCBI NR database of Oryza sativa (see URLs). Then, the functional annotation and GO enrichment analyses of gene models were conducted in Blast2Go v4.1 software 97 . We used gene models of RAR or non-RAR as tested gene sets and the whole gene models as reference. The significance of enrichment was valued using the Fisher's exact test.
Genomic diversity among different polyploidy accessions. To test the effects of polyploidization on genetic diversity, we compared the population nucleotide diversity (π) among accessions with different ploidy levels. We used a 1,000-kb sliding window and a 500-kb step to calculate the values of each statistic. In addition, we divided the 64 accessions into four groups (ploidy 6, 8, 10 and 13-16) depending on their ploidy level. The four groups were used to calculate the pairwise Weir and Cockerham's F ST between the two of them using VCFtools version 0.1.12b 46 with a 1,000-kb sliding window and a 500-kb step.

Statistical parameters
When statistical analyses are reported, confirm that the following items are present in the relevant location (e.g. figure legend, table legend, main text, or Methods section).

n/a Confirmed
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement An indication of whether measurements were taken from distinct samples or whether the same sample was measured repeatedly The statistical test(s) used AND whether they are one-or two-sided Only common tests should be described solely by name; describe more complex techniques in the Methods section.
A description of all covariates tested A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons A full description of the statistics including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted

Software and code
Policy information about availability of computer code

Data collection
No commercial and custom code were used for data collection.

Data analysis
We provides a newly developed Hi-C scaffolding program, namely ALLHIC (version 0.1), which is designed for scaffolding polyploid genome using Hi-C reads. The program and test datasets are avaiable in the following link: https://github.com/tangerzhang/ALLHiC For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors/reviewers upon request. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information.

Data
Policy information about availability of data All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: -Accession codes, unique identifiers, or web links for publicly available datasets -A list of figures that have associated raw data -A description of any restrictions on data availability Provide your data availability statement here.