The reference genome of Miscanthus floridulus illuminates the evolution of Saccharinae

Miscanthus, a member of the Saccharinae subtribe that includes sorghum and sugarcane, has been widely studied as a feedstock for cellulosic biofuel production. Here, we report the sequencing and assembly of the Miscanthus floridulus genome by the integration of PacBio sequencing and Hi-C mapping, resulting in a chromosome-scale, high-quality reference genome of the genus Miscanthus. Comparisons among Saccharinae genomes suggest that Sorghum split first from the common ancestor of Saccharum and Miscanthus, which subsequently diverged from each other, with two successive whole-genome duplication events occurring independently in the Saccharum genus and one whole-genome duplication occurring in the Miscanthus genus. Fusion of two chromosomes occurred during rediploidization in M. floridulus and no significant subgenome dominance was observed. A survey of cellulose synthases (CesA) in M. floridulus revealed quite high expression of most CesA genes in growing stems, which is in agreement with the high cellulose content of this species. Resequencing and comparisons of 75 Miscanthus accessions suggest that M. lutarioriparius is genetically close to M. sacchariflorus and that M. floridulus is more distantly related to other species and is more genetically diverse. This study provides a valuable genomic resource for molecular breeding and improvement of Miscanthus and Saccharinae crops.

respectively [16][17][18][19][20] . In the CesA lineage, OsCesA4, OsCesA7 and OsCesA9 are mainly required for cellulose synthesis in the secondary cell wall of rice, whereas OsCesA1, OsCesA3 and OsCesA8 are strongly co-expressed in tissues typical of primary cell walls 21,22 , with the function of OsCesA5/OsCesA6 being partially redundant with that of OsCesA3. Because of their great importance in cell wall synthesis and bioenergy production, gene expression and phylogeny of all genes belonging to the CesA/Csl superfamily have been extensively studied since their identification [16][17][18][21][22][23][24][25] . In Miscanthus, however, only a handful of CesA genes were identified and analysed on the basis of transcriptome sequencing and quantitative polymerase chain reaction (qPCR), and the comprehensive genome-wide identification and expression profiling of CesA/Csl genes superfamily are still lacking 26,27 .
The reference genomes of sorghum and sugarcane (S. spontaneum) have been previously reported 15,28,29 , given their economic importance as major sources of food and livestock feed. However, despite it being one of the most promising bioenergy crops, Miscanthus still does not have a high-quality reference genome available. Here, we report the genome assembly of M. floridulusthe chromosome-scale reference genome for the genus Miscanthus. The assembly of the M. floridulus genome and comparative genomic analysis of this genome with those of sorghum and sugarcane, provide an opportunity to systematically understand the evolutionary history of the Saccharinae group, the diversification of the Miscanthus clade and the underlying genetic basis for the high cellulose content of Miscanthus.

Results
Genome sequencing and assembly. Due to the highly heterozygous nature of the M. floridulus genome, we first performed an initial genome survey to determine genome size and composition via Illumina sequencing data. The k-mer analysis (k = 17; Supplementary Fig. 1) revealed a heterozygosity rate of 1.89% and an estimated genome size of 2,462.35 megabases (Mb) (Supplementary Table 1), which was slightly smaller than the previous estimation of 2,596.59 Mb on the basis of cytometry data 30 . In addition, repeat sequences were estimated to constitute 73.42% of the genome.
We then combined four different technologies for the de novo assembly of the M. floridulus genome: Illumina short-read sequencing, PacBio long-read sequencing, 10x Genomics technology and Hi-C technology. After the initial assembly was achieved via FALCON with PacBio data, FALCON-Unzip was applied to the initial assembly to produce a phased diploid assembly 31 , which included primary contigs and haplotigs to address the issue of genomic heterozygosity. After phased polishing was performed, FALCON-Phase combined the primary contigs and haplotigs together with Hi-C data to extend the phasing between the unzipped haplotig blocks 32 : that is, the heterozygous regions of the genome as bubbles in the assembly graphs, resulting in the generation of two new and better phased assemblies (Supplementary Note 1). One of the phased assemblies was chosen for further polishing with PacBio reads and Illumina short reads, and yielded an assembly consisting of 4,522 contigs with an N50 length of 820,435 base pairs (bp). The phased assembly was then integrated together with the 10x Genomics data to create a hybrid assembly consisting of 2,803 scaffolds totalling 2.68 gigabases (Gb) with an N50 length of 1,644,988 bp. At the final step, a Hi-C-based physical map ( Supplementary Fig. 2) was used to assemble 19 pseudochromosomes that anchored 2.44 Gb of the genome, accounting for 91.03% of the whole-genome assembly (Supplementary Table 2). A high-density genetic map comprising 3,799 single-nucleotide polymorphisms (SNPs) was used to verify the Hi-C assembly, providing evidence supporting the consistency between the two methods in terms of both chromosomal assignment and order ( Supplementary Fig. 3). Additionally, the Hi-C assembly was verified by Bionano optical mapping data. The scaffolds assembled by the resulting Bionano optical maps covered 91.40% of the assembled genomes (Supplementary Table 3) and were highly consistent with the Hi-C assembly with only 186 conflicts identified between the Hi-C assembled genomes and Bionano optical maps (Supplementary Data 1). These conflicts were subsequently corrected manually ( Supplementary Fig. 4).
The completeness of the assembled M. floridulus genome was validated using the core eukaryotic genes mapping approach (CEGMA) 33 and benchmarking universal single-copy orthologues (BUSCO) 34 . CEGMA analysis revealed that the assembled genome covered 224 (90%) complete gene models of the 248 core eukaryotic genes (CEGs) and partially covered 14 additional CEG models (Supplementary Table 4). When BUSCO was used, among 1,440 conserved genes in our assembly, 1,378 complete single-copy plant orthologues (96%) were recalled (Supplementary Table 5). The draft assembly was further evaluated by mapping short reads to the genome assembly, which revealed a mapping rate and genome coverage of 98.4% and 99.8%, respectively (Supplementary Table 6). In addition, long terminal repeat retrotransposons assembly index (LAI), a reference-free genome metric for assessment of the assembly of repeat sequences 35 , was used to evaluate the continuity of the genome assembly. This analysis resulted in an LAI assembly score of 9.32, which is close to the quality of a reference genome according to the classification system 35 . Collectively, these results validated the high level of completeness and reliability of our M. floridulus genome assembly.
Genome annotation. In total, we identified 63.6% of the genome as transposable elements (TEs), which were categorized as long terminal repeat retrotransposons (LTR-RTs) (55.8%), non-LTR-RTs (1.2%) and DNA transposons (6.6%) (Supplementary Table 7). Among the LTR-RTs, Gypsy elements (31.3%) were much more abundant than Copia elements (14.2%). In the M. floridulus genome, similar to those of other species, TEs tend to accumulate in intergenic regions particularly recombination-suppressed pericentromeric regions. In addition, the latest LTR-RT proliferation occurred within the last 4 million years ( Supplementary Fig. 5).
A set of 76,913 genes was annotated with high confidence using a comprehensive strategy that combined de novo gene prediction, protein-based homology searches and transcriptome-based predictions, and the results showed that protein-coding genes were distributed mainly on the ends of chromosome arms (Fig. 1). Functional annotations of all predicted genes resulted in the assignment of putative functional annotations for 71,637 (93.14%) genes. Comparison among Zea mays, Oryza sativa, Sorghum bicolor, S. spontaneum and M. floridulus revealed that a core set of 13,539 gene families were shared among all five grass genomes and that 2,219 gene families were unique to M. floridulus ( Supplementary  Fig. 6). Gene ontology (GO) enrichment analysis showed that these M. floridulus-specific genes were enriched in GO categories such as postembryonic development, phosphoprotein phosphatase activity, regulation of root development, defence response to oomycetes and other basic molecular functions (Supplementary Table 8).
Comparison among the three Saccharinae species revealed that they shared 18,995 gene families ( Supplementary Fig. 7) and GO enrichment analysis notably revealed enrichment for the GO term 'xyloglucan metabolic process' (Supplementary Table 9), echoing the common view that these grasses are highly promising biofuel feedstocks 3 .

Genome synteny and the evolutionary history of Saccharinae.
Given that the annotated genome sequences of both sorghum and sugarcane are currently available 15,28,29 , we performed genomic synteny analyses among the three species of Saccharinae. We identified 2:1 and 2:4 syntenic depth ratios for alignments between M. floridulus-S. bicolor (Fig. 2a) and M. floridulus-S. spontaneum ( Fig. 2b), respectively, thereby suggesting that the ancestor of Miscanthus had experienced one round of WGD since its divergence from sorghum. In addition, we concurrently identified the two subgenomes of Miscanthus ( Supplementary Fig. 8). Furthermore, as revealed by a previously constructed genetic map of M. sinensis 36 , the ancestor of Sbchr7 was split around the centromere region into two segments. These two segments joined together at their telomere ends and formed a new, larger segment, which was then translocated into the centromeric region of the ancestral homoeologous chromosome of Sbchr4 to form MfChr8 (Fig. 2c and Supplementary Table 10). MfChr8 presented the lowest gene density in the second region connecting the split segments of ancestors of Sbchr4 and Sbchr7, suggesting that this region was most probably the newly evolved centromere of MfChr8 ( Fig. 2c and d). This fusion resulted in the reduction from two sets of ten duplicated chromosomes produced by WGD to 19 chromosomes. In addition, Sbchr4 carried an inversion (Fig. 2c) specific to the sorghum lineage that was absent in both Miscanthus and sugarcane, which is consistent with previous observations 15 .
Previous studies have proposed that Miscanthus and Saccharum share a common allopolyploid ancestor that diverged from the Sorghum lineage ~3.8-4.6 million years ago (Ma) 12 . However, a recent study demonstrated that S. spontaneum experienced two rounds of WGD autopolyploidization, which occurred during only a brief interval of time 15 . To elucidate the genomic changes that occurred preceding and following the divergence of these three lineages, we calculated the Ks (synonymous substitutions per site) values between all homoeologous gene pairs in the intergenomic or intragenomic synteny blocks and plotted the Ks distributions of all the syntelogues. This analysis revealed that the Saccharum-Sorghum divergence (Ks = 0.079) and Miscanthus-Sorghum divergence (Ks = 0.072) occurred at approximately the same time (Fig. 3a). Although the Miscanthus-Saccharum divergence (Ks = 0.058) and the divergence of the Miscanthus subgenomes (Ks = 0.051) occurred at a similar time, the divergence of Miscanthus subgenomes occurred approximately half-a-million years later. Since the two rounds of S. spontaneum autopolyploidization occurred within a relatively narrow timeframe, only one Ks peak was detected to represent the two events. Compared with the Ks peak value for syntelogues of Miscanthus-Saccharum, the peak Ks value (Ks = 0.012) for syntelogues between the S. spontaneum subgenomes was very small, indicating that the autopolyploidization event within the Saccharum lineage occurred much later than the Miscanthus-Saccharum divergence.
Given that the Miscanthus subgenomes diverged from each other at a similar time as each of them diverged from Saccharum, which type of polyploidization the Miscanthus genome has undergone remains unclear. Previous studies suggested that the subgenomes of allopolyploids tend to exhibit subgenome dominance after genome rediploidization 13,14,[37][38][39] . To determine if such dominance exists in Miscanthus and, if so, which subgenome predominated over the other, we first analysed gene losses in the synteny blocks between the two Miscanthus subgenomes through comparisons with their syntenic blocks in Sorghum. The results (Supplementary Table 11) showed that the average gene loss was ~25.8%, with no significant difference in overall gene loss between the two sets of homoeologous chromosomes (that is, two subgenomes; binomial test, P = 0.11) except for MfChr14-Sbchr8-MfChr15 (18.9% versus 58.3%). Notably, MfChr8, the fused chromosome derived from the ancestral Sbchr4 and Sbchr7 chromosomes, appeared to have lost fewer genes than did MfChr7 (20.9% versus 27.1%) and MfChr13 (19.2% versus 23.5%). These observations suggest that the Miscanthus subgenomes did not experience biased gene loss.
We then compared the gene expression levels of the syntelogues between the two subgenomes of Miscanthus (Supplementary Note 2) and found that there was no significant difference in the average expression level between genes in any two homoeologous chromosomes in any of the sampled tissues ( Supplementary Fig. 9), including the fused chromosome MfChr8 and its non-fused chromosome homoeologues MfChr7 or MfChr13. These observations, together with unbiased gene losses from the two subgenomes suggest a lack of substantial global subgenome dominance in Miscanthus.
On the basis of previous reports, our own findings and the assumption that the synonymous substitution rate for coding genes  is 6.5 × 10 −9 synonymous substitutions yr −1 for grasses 12,15,36,40 , we were able to deduce the evolutionary history of the Saccharinae clade (Fig. 3b). The sorghum lineage diverged from the common ancestor of Miscanthus and Saccharum between 5.5 and 6.1 Ma, which then diverged from each other between 4.3 and 4.6 Ma. The ancestor of Saccharum then maintained a diploid state (2n = 20) for ~3.0 Myr, during which the ancestor of S. spontaneum underwent genome rearrangement and chromosome number reduction that resulted in a diploid state of 2n = 16. Approximately 1.0 Ma, the ancestor of S. spontaneum experienced the first round of autopolyploidization. Further genome rearrangement and a second round of WGD occurred sequentially and shortly after the first WGD. All these genomic changes followed by another slight genome rearrangement led to the speciation of S. spontaneum. Alternatively, two rounds of WGD occurred in the last 4 Myr and presumably gave rise to the basic chromosomes specific to the modern S. officinarum. In contrast, the ancestor of Miscanthus was most likely to experience an autopolyploidization event shortly after its divergence from the ancestor of Saccharum. When the tetraploid Miscanthus ancestor underwent the rediploidization process, chromosome fusion and gene losses occurred, contributing to further evolution towards the present forms of the Miscanthus genome.

Cellulose synthases in Miscanthus.
Miscanthus, which has relatively high cellulose/hemicellulose content, is promising for use as a bioenergy feedstock 3,4,6,11 . To explore the molecular basis underlying this valuable trait, we identified the members of the CesA/Csl gene superfamily (except CslA/CslC; Methods) critical for cellulose/ hemicellulose synthesis 16,[18][19][20] Fig. 10). In light of these results, a phylogenetic tree of the CesA subfamily was then constructed and ten groups were identified with CesA3/CesA5 belonging to a same group and CesAX forming a new group not present in rice ( Fig. 4a and Supplementary Note 3). On the basis of the synteny analysis and genomic coordinates, the CesA/Csl genes generally followed a 1:2 ratio in each single synteny block in S. bicolor versus two corresponding synteny blocks in M. floridulus, with a few exceptions. For example, there were five CslH genes on MfChr11 and only two CslH genes on MfChr12 syntenic to three tandem CslH genes on SbChr6 (Fig. 4b), suggestive of gene duplication and gene family expansion of the CslH subfamily in M. floridulus.
We then obtained the gene expression data of CesA/Csl for different tissues of mature M. floridulus plants from the RNA sequencing (RNA-seq) data (Supplementary Note 2). In general, the members of all CesA subfamilies except CesA10 and CesA11, including CesA1/3/5/6/8 which are expressed predominantly in primary cell walls and CesA4/7/9 which are responsible mainly for cellulose synthesis in secondary cell walls, were expressed at the highest level in growing stems of all collected tissues (Fig. 4c). With respect to Csls, one or several members of the CslD, CslE, CslF and CslH subfamilies were also highly expressed in growing stems and leaves (Fig. 4c), suggesting that active hemicellulose synthesis is occurring in these tissues. According to the histogram of expression level of all expressed genes (average transcripts per million mapped reads (TPM) > 1) in the stems ( Supplementary Fig. 11), most expressed CesAs (11 of 17 genes) can be ranked among the top most expressed genes, suggestive of highly active cellulose synthesis; the same was true for several Csls in the stems (one CslD and two CslFs) and leaves (one CslH) (Fig. 4c). Taken together, these observations hint that Miscanthus had a strong ability for cellulose/hemicellulose synthesis, which is in agreement with its promising use as a bioenergy feedstock.  (Fig. 5a). The PCA results were supported by the phylogenetic tree constructed based on the SNP data (Fig. 5b), which showed ambiguity in the distinction between M. lutarioriparius and M. sacchariflorus, regardless of their differences in ploidy. Furthermore, M. × giganteus and other hybrids clustered very close to M. sinensis and M. sacchariflorus, which is consistent with the results of admixture-based analysis ( Fig. 5c and Supplementary Fig. 12; for other K-values, see Supplementary Fig. 13). Notably, several accessions from M. sinensis and one accession from M. lutarioriparius did not cluster within their labelled species, although they were observed to be morphologically similar with their respective labelled species. Taken together, these observations suggest that most hybrids were probably derived from M. sinensis and M. sacchariflorus, whereas M. lutarioriparius is genetically similar to M. sacchariflorus. In addition, the fixation index (F ST ) calculation of the different species showed that M. sacchariflorus and M. lutarioriparius were closely related, while M. floridulus was more distant from either of the two species (Supplementary Table 13). This increased phylogenetic distance from other accessions may be related to our observations of higher genetic diversity (higher π value) among the accessions of M. floridulus compared to other species ( Supplementary Fig. 14    these two species were acquired from geographically distant locations (Supplementary Table 12).

Discussion
After their divergence from the sorghum lineage, both Miscanthus and Saccharum underwent one or more rounds of independent WGD. Previous studies based on sequences from next-generation sequencing (NGS) of Miscanthus and expressed sequence tags (ESTs) sequences for S. officinarum suggested that an ancestor of these two genera also experienced a WGD before their evolutionary bifurcation. However, by identifying the paralogues in a recently released S. spontaneum reference genome and by plotting the distributions of their Ks values, we found that the Ks peak (0.012) of S. spontaneum was different from that of S. officinarum previously reported (0.04, on the basis of the So1-So2 paralogues). By using the high-quality reference genomes of S. spontaneum 15 and M. floridulus, we found it implausible that S. spontaneum and Miscanthus shared a common tetraploid ancestor. Moreover, the difference in basic chromosome number between S. officinarum and S. spontaneum suggests that a very long time was needed between the divergence of these species and their ancestral divergence from Miscanthus; otherwise, there would not be high conservation of morphological and physiological traits of S. officinarum and S. spontaneum.
Subgenome dominance has often been reported in the genomes of allopolyploids especially in allopalaeopolyploids 14,37-39 . However, some recent allopolyploids, such as Brassica napus 41 , Capsella bursa-pastoris 42 and Triticum aestivum 43 , display indistinguishable or slight subgenome dominance. In M. floridulus, only negligible subgenome dominance was detected, including insignificant differences in the ratio of gene loss between subgenomes and unbiased expression of syntelogues, both of which can make it difficult to draw a convincing inference of the nature of its polyploidy. On the basis of our present analyses, it is most likely that the ancestor Miscanthus underwent autopolyploidization. However, we can also suggest that the ancestor of Miscanthus may have experienced allopolyploidization at ~3.8-4.1 Ma or long after that time but the differentiation between the two subgenomes in M. floridulus, as observed in other allopolyploids [41][42][43] , appears to be undetectable due to the relatively short timeframe.
CesA/Csl genes play central roles in cellulose/hemicellulose synthesis, the speed and ability of which are positively correlated with their expression 19,44,45 . We measured their expression in different tissues of Miscanthus and found that, among all the expressed genes, most CesAs and a subset of Csls showed extremely high expression in growing stems, which are a major source of biomass in Miscanthus 46 . Since WGD led to the duplication of most CesA genes and since each of the duplicated CesA genes was still highly expressed, the combined expression level of CesAs from the same subfamily and for the same function would be much higher than that of single CesA, resulting in increased synthesis of cellulose, as evidenced by the   final content of cellulose in Miscanthus. Nevertheless, to reveal the relationship of the expression and number of CesA/Csl with the total cellulose/hemicellulose content, more well-designed experiments combining genetics, comparative genomics and transcriptomics are needed because of their importance in biofuel production. By resequencing the 75 accessions from the genus Miscanthus, we provided a valuable genetic resource for Miscanthus breeding and improvement. Through the PCA, phylogenetic and structure analyses, this study also established that M. sacchariflorus and M. lutarioriparius are genomically indistinguishable from each other, validating the recent taxonomic evaluation and molecular results favouring the subspecies status of M. lutarioriparius belonging to M. sacchariflorus 47 . Moreover, there were some disagreements between morphology-based classification and phylogenetic and structure analysis-based classification of several accessions from M. sinensis and M. lutarioriparius, indicating that there was some hybridization or introgression among them, similar with the introgression among different subpopulations of another important bioenergy crop switchgrass 48 . These findings highlight the broad ability of different Miscanthus accessions to hybridize and further suggest that the boundaries between species may be less clear than previously thought. Additionally, crosses between Miscanthus and any member of the Saccharinae are also feasible and the 'miscanes' hybrids of Saccharum and Miscanthus have been used for sugarcane improvement 49 . With the completion of the reference genomes of three of the most economically important members of the Saccharinae, together with new tools for molecular breeding and transgenic research, we can envision the generation of a new hybrid that carries all of the desirable traits for bioenergy production from each species, such as high cellulose content and cold resistance from Miscanthus, drought tolerance from sorghum and disease resistance from sugarcane 3,4,6 . Genome sequencing. For PacBio single-molecule real-time (SMRT) sequencing, at least 10 μg of sheared DNA was used to construct 20-kb-insert-size libraries in accordance with the PacBio protocol. The libraries were then sequenced on a PacBio Sequel instrument. For the polishing of genome assembly and whole-genome resequencing using Illumina NGS, genomic DNA was extracted from young leaf tissue of the different accessions. Afterwards, 500-bp paired-end libraries were constructed using a NEBNext Ultra DNA Library Prep Kit and subsequently sequenced with the Illumina HiSeq 2500 platform.

Methods
10x Genomics library construction and sequencing. DNA sample was prepared, indexed and barcoded using a GemCode instrument (10x Genomics). Briefly, ~0.7 ng of 50-kb-long input DNA was used in GEM PCR together with 16-bp-long barcodes introduced in droplets. The droplets were then fractured and the intermediate DNA was purified and sheared to 500 bp for library construction. Libraries were subsequently sequenced via the Illumina HiSeq X Ten platform.
Hi-C library construction and sequencing. The construction and sequencing of four Hi-C libraries was performed by the Annoroad Gene Technology Company. Briefly, young leaves of M. floridulus (accession 'MF70') were fixed with formaldehyde and then lysed, after which the cross-linked DNA was digested with MboI enzymes overnight. The enzymes were inactivated and cohesive ends were filled in by adding biotin-labelled dCTP. After proximity ligation in blunt-end ligation buffer, the cross-linking was reversed and DNA was purified for Hi-C library construction. The final library was sequenced on an Illumina HiSeq 2500 platform in 150-bp paired-end mode.
Bionano optical genome maps construction. Bionano optical maps were generated using a Saphyr Genome Imaging Instrument in conjunction with DLE1 non-nicking enzyme (CTTAAG). Extracted high-molecular-weight DNA was nicked and labelled using the SaphyrPrep Kit, and the labelled DNA was then loaded for imaging into the Saphyr system (Bionano Genomics). Imaged molecules were subsequently assembled using Bionano Solve (v.3.3) and the Bionano alignment images of the assembly were visualized with Bionano Access (v.1.5.2, https://bionanogenomics.com/support/software-downloads/). Genome survey. The short reads used for polishing the genome assembly were quality filtered, after which the genome size, repeat sequence ratio and genome heterozygous rate were calculated using the k-mer method via Jellyfish 50 . The k-mer value used for the genome survey was 17.
Genome assembly. Before assembly, PacBio reads that were longer than 7,000 bp were used as seed reads to correct read errors and generate consensus sequences; this was accomplished with the 'daligner' function in the main script of the FALCON assembler. After error correction produced highly accurate (up to 99.999%) consensus sequences, FALCON identified the overlaps between all the pairs of the pre-assembled error-corrected reads. The read overlaps were used to construct a directed string graph by following Myers' algorithm. After the initial assembly, FALCON-Unzip was used to produce primary contigs (p-contigs) and haplotigs 31 . After phased polishing, FALCON-Phase 32 then combined the primary contigs and haplotigs with the Hi-C data to extend phasing between unzipped haplotig blocks and to generate two phased assemblies (Supplementary Note 1). One of the phased assemblies was further polished with PacBio reads by the consensus-calling algorithm Quiver2 (ref. 31 ). The Illumina reads were also used to correct the contigs via Pilon3 (ref. 51 ). Heterozygosity was then removed from the error-corrected contigs using purge_haplotigs together with the Illumina clean reads. Afterwards, with the default settings, BWA software 52 was used to align the 10x Genomics data to the assembly and fill-in gaps. Scaffolding was performed by fragScaff with the barcoded sequencing reads and the three-stages option as follows: stage1 "-m 3000 -q 30"; stage2 "-C 5"; stage3 "-j 1 -u 3" (ref. 53 ). These processes ultimately yielded the final draft of the genome assembly. For Hi-C-assisted genome assembly, the Hi-C clean data were aligned to the preceding assembly using HiC-Pro 54 and Bowtie2 (ref. 55 ) software. Only read pairs with both reads in the pair-aligned contigs were considered valid gene pairs for scaffolding. By the use of linkage information and restriction enzyme sites, a string graph was used to construct the scaffold graph with LACHESIS4 (ref. 56 ).
Linkage map construction. The mapping population comprised 116 F 1 progeny generated from an interspecific cross between M. floridulus (MF70, 2n = 2x = 38) and M. sacchariflorus (MS62, 2n = 2x = 38). Genomic DNA was extracted from leaf tissue to construct restriction site-associated DNA sequencing (RAD-seq) libraries, which were then sequenced following standard protocols 57 . The generated reads were subsequently mapped to our genome assembly by BWA-MEM 52 and variants were called by freebayes 58 and then filtered by SnpSift 59 . A total of 6,472 SNPs were ultimately identified, including 5,827 parental testcross markers that were heterozygous only in M. floridulus (lm × ll) and 645 biparental markers that were heterozygous in both parents (hk × hk) according to cross-pollinator (CP) population type. A linkage map was constructed with JoinMap5 using the multipoint maximum likelihood mapping method 60 . A minimum independence logarithm of odds (LOD) score of 10 and a maximum recombination frequency of 0.4 were used to define linkage groups. In total, 3,799 SNPs clustered into 19 linkage groups covering 2,371.76 cM, with an average intermarker spacing of 0.65 cM. To validate the chromosome-scale assembly by Hi-C, the comparison of the genetic location and corresponding physical position of all SNP markers of the genetic map were assessed. In total, 92.74% of mapped markers were detected to be located at the same M. floridulus assembled chromosome. Of the markers, 5.35% were mapped equally well to homoeologous locations on the assembly (Supplementary Fig. 3).
Repeat annotation. Tandem repeats were extracted using TRF (http://tandem. bu.edu/trf/trf.html) 61 by ab initio prediction. For TE annotations, a combined strategy based on de novo structural identification and on a homology-based alignment search for whole-genome repeats was applied. A de novo repetitive elements database was built by combining the results of LTR_FINDER (http:// tlife.fudan.edu.cn/tlife/ltr_finder/) 62 and RepeatScout 63 (http://www.repeatmasker. org/), with those of RepeatModeler (http://www.repeatmasker.org/RepeatModeler), set to default parameters. All repeat sequences with lengths >100 bp and <5% 'N' gaps constituted the raw TE library. Repeat regions were extracted with a homology-based repeat prediction using the Repbase (http://www.girinst.org/ repbase) database using RepeatMasker (http://www.repeatmasker.org/) software and its in-house scripts (RepeatProteinMask) with default parameters. A custom non-redundant library that was generated by a combination of Repbase-based search and a de novo TE library, which was processed by uclust, was subjected to RepeatMasker for final repeat identification.
Specifically, to investigate LTR-RTs in detail, highly confident intact LTR-RTs were identified using the LTR_retriever 64 pipeline, which integrated the results from LTR_FINDER 62 and LTRharvest 65 and efficiently removed false positives. The insertion time was then estimated as T = K/2μ (where K is the divergence distance between the 5′ LTR and 3′ LTR of intact LTR-RTs and where μ is 1.38 × 10 −8 for grasses 66 ).

Gene model prediction and functional annotations.
A combination of homology-based, ab initio and transcriptome-based gene prediction methods was used for protein-coding gene annotation. The protein sequences of six species  71 and SNAP (https://github.com/KorfLab/SNAP) were used in our automated gene prediction pipeline. For genome annotation, RNA was extracted from five different tissues (root, leaf, stem, inflorescence and bud tissues) and mixed together equally for library construction. A de novo transcriptome assembly was then generated with Trinity (v.2.1.1) 72 , used for genome annotation and to optimize gene predictions. RNA-seq reads from different tissues were aligned to the genome using HISAT (v.2.0.4) 73 with default parameters to identify exon regions and splice positions. The alignment results were then used as input by StringTie (v.1.3.3) 73 with default parameters for genome-based transcript assembly. Finally, the non-redundant reference gene set was generated by merging genes predicted by the three methods above with EVidenceModeler (EVM; v.1.1.1) 74 using PASA terminal exon support and including masked TEs as input for gene prediction.
Gene functions were assigned according to the best matches with Swiss-Prot protein sequences identified using BLASTP (with a threshold of E value ≤1 × 10 -5 ). The motifs and domains were annotated using InterProScan (v.5.31) 75 by searching against publicly available databases, including ProDom, PRINTS, Pfam, SMRT, PANTHER and PROSITE. The GO IDs for each gene were assigned according to the corresponding InterPro entry.
Further, transfer RNAs were predicted using the program tRNAscan-SE (http:// lowelab.ucsc.edu/tRNAscan-SE/) 76 . Because ribosomal RNAs are highly conserved, we predicted rRNA sequences using BLAST with rRNAs from S. bicolor used as references. Other non-coding RNAs, including microRNAs and small nuclear RNAs, were identified by searching against the Rfam 77 database via infernal software (http://infernal.janelia.org/) 78 with the default parameters.
Genome assembly assessment. Both BUSCO 34 (http://busco.ezlab.org/) analysis and the CEGMA 33 (http://korflab.ucdavis.edu/Datasets/cegma/) pipeline were used to assess the completeness of the genome assembly and annotations. Further, the LAI, which was implemented in the LTR_retriever package, was also applied to assess the genome assembly quality 35 .
Gene family analysis. For gene family analysis, the online web server OrthoVenn2 (ref. 79 ) was used. Maize, sorghum, rice, S. spontaneum and M. floridulus protein sequences were used as inputs for online orthogroup clustering and annotations and GO enrichment.
Genome synteny. Protein sequences from M. floridulus, S. bicolor and S. spontaneum were blasted against themselves using BLASTP (E value <1 × 10 -5 ) to identify intergenome synteny blocks. The BLAST results were subjected to c-score filtering (c-score = 0.99) and the Python version JCVI (https://github.com/tanghaibao/ jcvi) of MCScan was used to generate dot plots. MCScanX 80 was used to determine synteny blocks with default parameters and at least five genes were required to define a synteny block. To identify intragenome synteny blocks, protein sequences from two homoeologous chromosomes were extracted to query against each other and after c-score filtering by JCVI, MCScanX was used to determine synteny blocks with default parameters. The Ks value was calculated via KaKs_Calculator 81 for each gene pair in the aligned blocks. The distributions of all Ks values were plotted via the R software and ggplot2 package 82 . The peak Ks values were converted to divergence time 12 according to the formula T = Ks/2λ (T, time; λ, average substitution rate) by using an average substitution rate of 6.5 × 10 −9 for grasses 40 to infer speciation or WGD events that occurred during the evolutionary history.

Subgenome analysis.
To analyse the gene loss of the two M. floridulus subgenomes, protein sequences of homoeologous chromosomes from M. floridulus and their corresponding homoeologous chromosome in S. bicolor were queried via BLAST against each other and synteny blocks were identified by MCScanX. Blocks shared by the three homoeologous chromosomes were then identified and syntelogues between any two of them were extracted for statistical analysis. For syntelogue expression analysis, RNA was extracted from leaf, growing stem, root and inflorescence tissues of mature M. floridulus plants and RNA-seq analysis was performed on these four tissues. There were three replicates for each tissue. The RNA-seq reads were quality filtered by fastp 83  Analysis of CesA/Csl genes. All genes from the three species S. bicolor, S. spontaneum and M. floridulus were first annotated using InterProScan. The CesA/ Csl genes were then retrieved from the files generated from the Interproscan-based functional annotation with Pfam accession ID 'PF03552' . Because there were only five and nine CslA/CslC genes (described by two Pfams 'PF00535' and 'PF13632') identified in S. bicolor and M. floridulus and these genes distantly diverged from other CesA/Csl genes 16 , those genes were excluded from the downstream analysis. The gene coordinates were retrieved from the GFF files of the three genomes and the expression level of CesA/Csl genes in M. floridulus were obtained from the RNA-seq results of the previous subgenome analysis. The expression of CesA/ Csl genes were retrieved and heatmap was drawn using the pheatmap package (https://CRAN.R-project.org/package=pheatmap) of R software with values of log 2 (TPM + 0.5).
Resequencing and diversity analysis. DNA from different accessions of multiple Miscanthus species and hybrids was extracted for library construction and resequencing. The paired-end reads were quality filtered by fastp 83 and subsequently mapped to the M. floridulus genome by BWA-MEM 52 . Via Picard, mapped reads were sorted and duplicated reads were marked. Uniquely mapped reads were then extracted from the sorted and duplicate-marked bam file with Sambamba 85 using the parameter "-F ′′mapping_quality >= 30 and not (unmapped or secondary_alignment) and not ([XA] != null or [SA] != null)′′". Then, SNPs and insertions/deletions (InDels) were called by freebayes 58 using parameters "-standard-filters -min-coverage 4 -use-best-n-alleles 4 --genotype-qualities" and the polyploidy level of the different accessions were specified with "-A Miscanthus_ freebayes_cnv_map.txt". The variants were then filtered by a custom script to change low quality data (GQ < 30 or DP < 4 or DP > 50) to null and further filtered by SnpSift 59 with the parameters "QUAL > 30 & NS . Afterwards, the multi-allelic genotype of polyploids were filtered and converted to bi-allelic genotype by our custom script and the remaining 6,200,617 variants were used in downstream analysis. PCA of the filtered variants was performed using the SMARTPCA program from EIGENSOFT (v.6.1.4) 86 . For phylogenetic analysis, SNPhylo 87 software was used. Before tree construction, we filtered and pruned the SNPs with minor allele frequency (MAF) < 0.1, missing rate >0.1 and linkage disequilibrium (LD) threshold = 0.2. A total of 15,857 SNPs were ultimately selected for the SNPhylo pipeline. iTOL (v.5) 88 was used to visualize the trees. For population structure analysis, the optimal ancestral population structure was estimated for the same variant set by using ADMIXTURE 89 with ancestral population sizes K = 1-7 and choosing the population size with the smallest cross-validation error. On the basis of the K value, admixture analyses were then performed to infer population structure. Fixation index (F ST ) and π value (indicating genetic diversity) were calculated by VCFtools 90 .
Reporting Summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability
The genome assembly and gene annotation have been deposited in the NCBI database under BioProject number PRJNA598249 and BioSample number SAMN13702268. Source data are provided with this paper.

October 2018
Corresponding author(s): Cuixia Chen Last updated by author(s): Mar 19, 2021 Reporting Summary Nature Research wishes to improve the reproducibility of the work that we publish. This form provides structure for consistency and transparency in reporting. For further information on Nature Research policies, see Authors & Referees and the Editorial Policy Checklist.

Statistics
For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section.

n/a Confirmed
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly The statistical test(s) used AND whether they are one-or two-sided Only common tests should be described solely by name; describe more complex techniques in the Methods section.
A description of all covariates tested A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted Data Policy information about availability of data All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: -Accession codes, unique identifiers, or web links for publicly available datasets -A list of figures that have associated raw data -A description of any restrictions on data availability The genome sequencing data, 10X Genomics and Hi-C sequencing data for genome assembly, gene annotation, RNA-seq data, genome resequencing data of 75 acessions, Bionnao optical map have been deposited in the NCBI database under BioProject number PRJNA598249 and BioSample number SUB4330926 with reviewer link https://dataview.ncbi.nlm.nih.gov/object/PRJNA598249?reviewer=gbulrdlhkk9usmmllottvfmmdu.

Field-specific reporting
Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection.

Life sciences Behavioural & social sciences Ecological, evolutionary & environmental sciences
For a reference copy of the document with all sections, see nature.com/documents/nr-reporting-summary-flat.pdf

Life sciences study design
All studies must disclose on these points even when the disclosure is negative.

Sample size
We sequenced a single Miscanthus floridulus plant, and no statistical methods were used to predetermine sample sizes. For population genetics analyses, 74 diploid and tetraploid accessions distributed across China and one triploid M. x giganteus were resequenced.
Data exclusions During the comparative genomics analysis, we chose the longest transcript to represent each gene and removed mitochondrial and chloroplast genes, since the used genome datasets include multiple transcripts and organellar genes that might complicate the comparative analysis.