Extensive intraspecific gene order and gene structural variations between Mo17 and other maize genomes

Maize is an important crop with a high level of genome diversity and heterosis. The genome sequence of a typical female line, B73, was previously released. Here, we report a de novo genome assembly of a corresponding male representative line, Mo17. More than 96.4% of the 2,183 Mb assembled genome can be accounted for by 362 scaffolds in ten pseudochromosomes with 38,620 annotated protein-coding genes. Comparative analysis revealed large gene-order and gene structural variations: approximately 10% of the annotated genes were mutually nonsyntenic, and more than 20% of the predicted genes had either large-effect mutations or large structural variations, which might cause considerable protein divergence between the two inbred lines. Our study provides a high-quality reference-genome sequence of an important maize germplasm, and the intraspecific gene order and gene structural variations identified should have implications for heterosis and genome evolution. The de novo genome assembly of maize line Mo17 and comparative analysis with other sequenced maize lines show extensive gene-order variations. This study provides insights into maize evolution and has implications for improving maize hybrid lines.

M aize (Zea mays) is a classical genetic model and an important crop worldwide. The maize genome exhibits high levels of genetic diversity among different inbred lines 1-6 . Owing to the high level of intraspecific genome diversity, hybrid maize resulting from crosses between lines from different heterotic groups shows extremely high levels of hybrid vigor. Consequently, the adoption of hybrid maize has grown rapidly since its introduction nearly a century ago 7 . Currently, most modern maize varieties are hybrids. Reid yellow dent (represented by inbred B73) and Lancaster (represented by inbred Mo17) are the two best-known maize variety groups. The hybrid generated by crossing B73 with Mo17 had long been the most commonly grown hybrid in America and other countries 8 , and the derived materials are still widely used in many maize breeding programs. B73/Mo17 is therefore the most common pair of maize inbred lines in many genetic and molecular studies, such as map-based cloning 9 or exploring the molecular basis of heterosis 10 and genetic imprinting [11][12][13] . The intermated population of B73 and Mo17 is the most prominent maize genetic-mapping population 14 . The release of the draft genome assembly of inbred line B73 in 2009 (ref. 6 ) was an important milestone in the maize community 15 . The draft genome sequences of the maize PH207 inbred line assembled from short reads have also been reported 16 . The precise genomic arrangement of maize has recently been demonstrated to be resolvable through assembly of PacBio long reads 17 , which are up to 40-60 kb long, albeit with a relatively high error rate 18 . Long-read assembly has been used to generate several other plant and animal genomes [19][20][21][22] . Recently, a high-quality whole-genome assembly of B73 (RefGen_v4) has been generated through single-molecule technologies and the BioNano Irys system 23 , thus resulting in major improvements relative to the earlier assembly.
Here, we report the assembly of a high-quality Mo17 reference genome through single-molecule sequencing and BioNano opticalmapping technologies. The generation of an additional reference genome provides an unprecedented opportunity for extensive comparison of intraspecific genome diversity in maize. By aligning the B73 and Mo17 genomes, we found 9,867,466 SNPs; 1,422,446 small insertions/deletions (indels, length shorter than 100 bp); and more than 25 MB of presence/absence-variation (PAV, length longer than 500 bp) sequences between the two representative maize genomes. Notably, our comparative genomics analysis uncovered extensive intraspecific gene-order variation: approximately 10% of genes were mutually nonsyntenic between B73 and Mo17. In addition, more than 20% of the annotated genes had large-effect mutations or large structural variations in B73 compared with Mo17. These large gene-order and gene structural variations were also observed in a comparison of the PH207 genome with the B73 and Mo17 genomes.

Extensive intraspecific gene order and gene structural variations between Mo17 and other maize genomes
Genome annotation. Repetitive elements, major components of complex genomes, are widely dispersed throughout the genome and have multiple roles in driving genome evolution 28 . In total, approximately 83.83% of the Mo17 assembly sequences were annotated as repetitive elements, including retrotransposons (75.24%), DNA transposons (6.12%) and unclassified elements (1.72%) (Supplementary Table 5). The families of Gypsy and Copia retrotransposons represented approximately 48.63% and 25.57% of the Mo17 assembly sequences, respectively. The composition of the different classes of repetitive DNA in Mo17 was highly similar to that in the B73 and PH207 genomes (Supplementary Table 5).
To examine transposon activity, we identified a total of 73,459, 74,160 and 50,402 high-confidence full-length long terminal repeat (LTR) retrotransposons in the B73, Mo17 and PH207 genomes, respectively. The expansion of LTR retrotransposons in maize occurred mainly within the past 1 million years in both the B73 and Mo17 genomes ( Supplementary Fig. 2), in agreement with previous estimates based on analysis of selected regions of the maize genome 29,30 . A relatively lower percentage of young LTR retrotransposons was seen in the PH207 genome assembly, because highly similar copies were collapsed when assembled from short reads ( Supplementary Fig. 2). Compared with the older LTR retrotransposons, which were more abundant in pericentromeric regions, the younger LTR retrotransposons were enriched in euchromatic regions ( Supplementary Fig. 3), in agreement with findings from previous studies 31,32 .
To annotate the protein-coding genes in the Mo17 genome, we combined results obtained from protein-homology-based prediction, RNA-seq-based prediction and ab initio prediction (Methods), an approach similar to that used for the annotation of the B73 (refs 23,33 ) and PH207 (ref. 16 ) genomes. In total, 38,620 high-confidence protein-coding genes were predicted in the Mo17 genome, and 55% of the exons of the predicted genes were supported by RNA-seq data from five different tissues with at least 90% coverage (Supplementary  Tables 6 and 7). In total, 37,830 (97.95%) of the Mo17 predicted genes were allocated in ten pseudochromosomes (Supplementary Table 4). Protein-coding genes were primarily within chromosome arms and were inversely correlated with transposable-element density ( Fig. 1).
Global genome comparison of B73, Mo17 and PH207. When the pseudochromosomes of Mo17 were aligned to the pseudochromosomes of B73 (ref. 34 ), approximately 61.18% of the Mo17 genome sequence matched in one-to-one syntenic blocks with 61.13% of the B73 genome sequence (Supplementary Fig. 4 and Supplementary Table 8). The genome-wide proportion of the regions that nearly matched between B73 and Mo17 was slightly higher than that estimated from a previous analysis of sequenced BAC clones of four selected regions 35 . The nonsyntenic sequences between the two genomes were mostly transposable elements, and the remainder comprised dispersed genes and inbred line-specific low-copy sequences. Similarly, we found 1,071.7 Mb (50.93%) of the Mo17 genome sequence and 1,101.7 Mb (52.3%) of the B73 genome sequence matching in syntenic blocks with 1,071.6 Mb (52.01%) and 1,101.4 Mb (53.46%) of the current PH207 genome sequences, respectively (Supplementary Table 8).
By comparing the two genomes, we identified 12,936 B73specific genomic segments (12.96 Mb in total) and 12,939 Mo17specific genomic segments (12.2 Mb in total) longer than 500 bp. Most (98.7%) of these PAV sequences were shorter than 5 kb ( Supplementary Fig. 5). We found 200 and 126 PAV sequences that were longer than 5 kb in B73 and Mo17, respectively. These PAV sequences were unevenly distributed across the genome (Fig. 1), and some were located in clusters (Supplementary Table 9). The longest PAV sequence segment was a 2.9-Mb B73-specific segment containing 66 predicted genes, found from 22.5 Mb to 25.4 Mb on chromosome 6. The length of this PAV segment was slightly longer than previously reported 3 . The longest Mo17-specific segment was a 2.5-Mb segment on chromosome 6 from 64.0 Mb to 66.5 Mb. This Mo17-specific segment was close to the centromere and contained only ten predicted genes. In addition, we found two Mo17-specific segments located close together on chromosome 2, with lengths of 752.6 kb (chromosome 2: 235809501-236562100) and 635.2 kb (chromosome 2: 237529501-238164700), containing 20 and 23 predicted genes, respectively (Supplementary Table 9). With the criterion requiring at least 75% of coding sequences to overlap with PAV sequences, and validation through alignment of resequencing reads, we identified 72 B73-specific PAV genes and 50 Mo17-specific PAV genes. The number of PAV genes identified with this stringent criterion was smaller than the earlier estimation using only resequencing data or comparative genomic hybridization arrays 2,3,5 .
We also identified 22 PH207-specific PAV genes, as compared with B73, and 75 PH207-specific PAV genes, as compared with Mo17, by using the same method (Supplementary Table 10); only four genes overlapped between the two sets. We found that only ~25% of these PAV genes in the B73, Mo17 and PH207 genomes have likely orthologs in sorghum (Supplementary Table 10). To further trace the origin of these PAV genes, we aligned resequencing reads of 19 wild relatives, 23 landraces and 60 modern inbred lines from maize Hapmap2 (ref. 36 ) projects to the B73, Mo17 and PH207 genomes. Closely related homologs of more than 95% of these PAV genes were detected in at least one of the wild relatives, thus indicating that most of these PAV genes might have already existed in their direct ancestors (Supplementary Table 10). In summary, most of the PAV genes are present in the wild relatives of maize. These PAV genes might have arisen during the process of rediploidization of maize ancestors, which occurred before the completion of maize domestication but after the divergence of sorghum and maize. Thus, these PAV genes are now dispersed in landraces and modern maize lines ( Supplementary Fig. 6).
A total of 9,867,466 SNPs and 1,422,446 indels were identified in the aligned syntenic blocks between the B73 and Mo17 genomes, with an average of 7.66 SNPs and 1.11 indels per kilobase (Supplementary Table 11). The distribution of SNPs and indels was positively correlated (Pearson's correlation R = 0.76, P < 8.7 × 10 −17 ; Fig. 1). Compared with indels genome wide, indels with multiples of 3 bp in length were more abundant in coding regions ( Supplementary Fig. 7). We also identified 8,598,810 SNPs and 1,190,826 indels in the aligned syntenic sequences between the Mo17 and PH207 genomes, and 8,290,900 SNPs and 1,078,722 indels in the aligned syntenic sequences between the B73 and PH207 genomes (Supplementary Table 8).
Maize is an ancient tetraploid, and its two subgenomes have undergone extensive gene fractionation. Using fractionation bias estimation with sorghum as a reference, we found that the subgenome organizations of B73 and Mo17 were nearly identical ( Supplementary Fig. 8), thus indicating that B73 and Mo17 share the same tetraploidization and a large part of the subsequent fractionation events. Fractionation of genes was similarly biased toward subgenome 1 in both B73 (  To detect genes that might be under selection, we first calculated the neutral mutation rate (Ks) between orthologous genes for each pair between any two genomes of B73, Mo17 and PH207. We found two peaks in the Ks distribution: one corresponded to a group of genes that might derive from recent genetic exchanges (Ks < 0.0028), and another represented most of the remaining genes that may have diverged from the common ancestors of maize approximately 2.1 Ma (Ks ~0.025) (Fig. 2a). We then calculated the Ka/Ks ratio for genes with Ks between 0.0028 and 0.25 (Fig. 2b). As expected, the Ka/Ks ratios for these genes were highly skewed toward zero, because most nonsynonymous mutations were deleterious and experienced strong purifying selection. Approximately 7,000 genes in each of the three genomes were identified to be likely to be evolutionary constrained (Ka/Ks < 0.1). In contrast, relatively few genes (> 1,000) were detected to be under positive selection (Ka/Ks > 1).

Extensive gene-order and structural variations.
Comparative analysis revealed 33,681 B73 and 33,597 Mo17 genes with corresponding orthologous genes or gene fragments in their syntenic regions. However, we found 5,105 B73 genes and 4,008 Mo17 genes that were nonsyntenic, because no homologous genes or gene fragments were found within 10 Mb of their corresponding positions in their respective counterpart genome ( Table 2); these nonsyntenic genes accounted for 13.16% and 10.66% of the total analyzed genes in the B73 and Mo17 genomes, respectively. Similarly, 3,284 (9.02%) and 4,472 (12.27%) PH207 genes were defined as nonsyntenic genes when compared with B73 and Mo17 genomes, respectively (Supplementary Tables 12 and 13). Notably, 2,112 PH207 genes were nonsyntenic with both the B73 and Mo17 genomes.
Among the syntenic genes in B73 and Mo17 were 12,167 B73 and 12,674 Mo17 genes without any amino acid changes, and approximately 80% and 57% of genes showed no variation in coding sequences (CDSs) and gene bodies (CDS and intron), respectively ( Table 2). We identified 2,498 B73 and 2,458 Mo17 highly conserved genes with no genetic variation in their entire genic regions, including within 2 Kb upstream and downstream (Table 2). There were 15,955 B73 and 15,512 Mo17 genes with only missense mutations and/or nonframeshift indels; those genes, together with the genes without amino acid changes, were classified as structurally conserved genes. These structurally conserved genes accounted for approximately 84% of the syntenic genes on the basis of either the B73 or the Mo17 gene annotation (Table 2). However, approximately 12% of the syntenic genes (3,947 in B73; 4,020 in Mo17) had large-effect mutations such as start-or stop-codon mutations, splice-donor or splice-acceptor mutations, frameshift mutations or premature-stop-codon mutations ( Table 2). In addition, 1,612 B73 and 1,391 Mo17 genes had large structural variations, as compared with their corresponding syntenic genes in the Mo17 and B73 genomes, respectively ( Table 2).
We then conducted additional analysis of the nonsyntenic genes between B73 and Mo17. On the basis of the analysis of a simple best hit in the counterpart genome, we classified 1,534 B73 and 1,216 Mo17 nonsyntenic genes as structurally conserved (Table 2). 1,387 B73 and 977 Mo17 nonsyntenic genes had large-effect mutations. In addition, 2,112 B73 and 1,765 Mo17 nonsyntenic genes had large structural variations, and 87 B73 genes and 253 Mo17 genes had no homologs identified in the Mo17 or B73 genome. Clustering of all annotated B73 and Mo17 genes revealed 320 B73-specific gene families (830 total gene members) as compared with Mo17, and 170 Mo17-specific gene families (578 gene members) as compared with B73. Among the 5,105 nonsyntenic genes in B73, 4,285 genes belonged to 2,114 gene families, including 294 B73-specific gene families (465 gene members). Among the 4,008 nonsyntenic genes in the Mo17 genome, 3,225 genes belonged to 1,631 gene families, including 119 Mo17-specific gene families (208 gene members).
In summary, a total of 9,058 B73 genes and 8,153 Mo17 genes had either large-effect mutations or large structural variations, as compared with their syntenic or best-hit Mo17 and B73 counterparts (Supplementary Table 13). Similarly, 8,278 and 9,738 PH207 genes had either large-effect mutations or large structural variations, as compared with the corresponding genes in B73 and Mo17, respectively (Supplementary Table 13). More than 20% of all predicted genes showed considerable protein sequence variations between any two inbred lines among B73, Mo17 and PH207, thus suggesting a potential functional complementation among these three representative maize inbred lines.
Notably, the proportion of genes with large-effect mutations and large structural variations within nonsyntenic genes was significantly higher than that in syntenic genes (chi-square test, P < 2 × 10 −16 ). For example, the proportion of genes with large structural variations in the nonsyntenic gene group was with approximately ten times greater than that of the syntenic genes (  and Supplementary Table 12). Interestingly, some of the nonsyntenic genes in the B73 genome had very high sequence homology with genes in Mo17 that were classified as syntenic genes. In fact, only 971 B73 nonsyntenic genes that had their best hits in the Mo17 genome were also identified as nonsyntenic genes. Among them, 479 genes were identified as mutual nonsyntenic best hits between B73 and Mo17. Our results showed that 1,529 B73 nonsyntenic genes with their best-hit genes in Mo17 had syntenic homologs in B73. Similarly, 1,301 Mo17 nonsyntenic genes with their best-hit genes in B73 had syntenic homologs in Mo17, thus indicating that a large proportion of these nonsyntenic genes are members of multiple gene families.
The percentage of transcription factors in whole-genome-duplicated and segmental-duplicated genes was approximately tenfold higher than that in singleton genes, thus indicating that transcription factors tend to be retained after whole-genome duplication or segmental duplication (Supplementary Table 14), similarly to previous findings in Arabidopsis 38 . On the basis of the gene-clustering analysis, approximately 75% of nonsyntenic genes among B73, Mo17 and PH207 were members of gene families amplified via dispersed duplication (Supplementary Table 15). Further analysis showed that only approximately 11-23% of nonsyntenic genes among B73, Mo17 and PH207 had high-confidence syntenic orthologous genes in sorghum. In contrast, more than half (approximately 58%) of all annotated genes had syntenic orthologs in sorghum (Supplementary Table 16), thus suggesting that most of the nonsyntenic genes among B73, Mo17 and PH207 arose largely from gene amplification instead of differential fractionation between two subgenomes after tetraploidization.
To investigate the relationship between genomic variation and transcriptomic differences, we generated RNA-seq data for bract, root, stem, seedling and endosperm tissues from both B73 and Mo17. In total, 24,209 B73 genes and 23,947 Mo17 genes were expressed in at least one of these tissues (Supplementary Table 17), and 859 B73 genes and 770 Mo17 genes showed specific expression in at least one of the tissues, including 25 B73-specific and 16 Mo17-specific PAV genes (Supplementary Table 18). We also found significant differences in expression profiles among different categories of genes that varied in their level of conservation. 66% of structurally conserved genes were expressed, a higher value than the proportion of genes with large-effect mutations and large structural variation (average 48%). Moreover, approximately 1.4% of structurally conserved genes were specifically expressed in B73 or Mo17, a value significantly lower than the proportion of genes with large-effect mutations and large structural variations (average 4.5%) (Supplementary Table 17).

Discussion
The availability of high-quality assembled genomes for both B73 and Mo17 provides an unprecedented opportunity for extensive intraspecific genome comparison in maize. Direct comparison between the two maize genomes uncovered extensive SNP and indel variation, as well as a large extent of gene-order and structural variations. We demonstrated that more than 10% of B73 and Mo17 genes were mutually nonsyntenic, a value approximately two-to threefold higher than the proportion of nonsyntenic genes between R498 (indica) and Nipponbare (japonica) Asian rice ( Supplementary  Tables 13 and 19). Although some of the nonsyntenic genes have closely related homologs between these two maize lines, the positional changes of these genes may introduce different chromatin contexts affecting the expression of the genes themselves or their neighboring genes. The exact functions of these nonsyntenic genes are largely unknown, because they are underrepresented in classical genetics studies in maize 39 . These nonsyntenic genes may play important roles in some lineage-specific functions, as demonstrated in a study on root development 40 . Evaluating the contribution of these nonsyntenic genes to quantitative phenotypic variations of agronomic traits would be an interesting future pursuit.
In addition to the gene-order variations, there were several other types of intraspecific structural variations. Only 60% of the B73 and Mo17 genomes were able to be aligned as one-to-one blocks. Although the remaining 40% of the variable genome largely comprised repetitive elements, the B73 and Mo17 genomes each contained approximately 12 Mb of unique low-copy sequences, including 122 (72 in B73 and 50 in Mo17) high-stringency PAV genes. Furthermore, more than 20% of the annotated genes had large-effect mutations or large structural variations, which could potentially lead to protein sequence changes and potential functional Only genes and their best hits in the counterpart genome anchored in ten pseudomolecules were included for the analysis. b Genic regions include 2 kb upstream and downstream of the gene body.
divergence between the two maize lines. Even after exclusion of potential redundancy resulting from multigene families, there were 320 B73-specific (and 170 Mo17-specific) gene families. In addition, there were 859 B73 genes and 770 Mo17 genes that showed specific expression in at least one of the tissues tested. Several factors can contribute to the extensive intraspecific genome diversity observed. Transposable elements, such as helitrons, have been reported to be able to introduce gene movement or exon shuffling in the maize genome 41,42 . The extensive genome diversity found suggests that the hybrids generated between two different maize lines may have a dramatically different complement of proteins or regulatory sequences, thus supporting the complementation hypothesis explaining the exceptional heterosis observed in maize nearly a century ago 7
Assembly evaluation. We used published Mo17 BAC sequences to assess the quality of the hybrid assembly. Nine Mo17 BAC sequences (1,528 kb) downloaded from GenBank (see URLs) were mapped to the scaffolds with BLAT 26 , and the alignment results were manually checked. All nine Mo17 BACs were covered by a single scaffold, with 100% coverage and 99.97% identity of the total BAC sequences, thus suggesting that the genome assembly was of high quality.
BWA mem 47 was used to map the ~4.4 million maize GBS-tag sequences 27 to the genome sequences to evaluate the assembly. The alignments were further filtered by mapping quality (MAQ = 60), and 36.68% were kept. Scaffolds with more than ten tags aligned were evaluated, and three scaffolds with tags aligned at disparate chromosome locations were split at the appropriate gap positions. After this correction, there were 2,560 scaffolds with an N50 length of 10.2 Mb, and the total genome length was 2.18 Gb. BUSCO 25 was further used to evaluate the genome-assembly completeness. 'Embryophyta_odb9' , which contained 1,440 single-copy orthologous genes was used as a searching dataset, and both Mo17 and B73 genomes were assessed. 27 were also used to anchor the assembled scaffolds onto Mo17 chromosomes. According to the mapping results above, scaffolds with more than ten tags aligned were further used to construct the pseudomolecules. The order and orientation of scaffolds were determined according to the physical positions of GBS tags. In total, 2.1-Gb scaffold sequences accounting for 96.4% of the Mo17 assembled genome were anchored onto the ten Mo17 chromosomes, which contained 97.95% of the annotated genes. We also aligned all GBS-tag sequences to the B73 genome and compared the alignments in these two genomes. The alignments showed high consistency between two genomes except for some pericentromere regions, owing to the lack of GBS tags.

Construction of pseudomolecules. The maize pangenome GBS tags
Analysis of repetitive elements. We identified repetitive elements in the Mo17 genome through a combination of homolog-based and de novo approaches. RepeatModeler 48 was first used to build TE consensus sequences as a de novo TE library on the basis of the Mo17 genome sequence. RepeatMasker 49 was then used to discover and identify repeats in the Mo17 genome with the combined library of the de novo TEs of Mo17 and Repbase 50 . Repetitive elements in the B73 and PH207 genomes were identified through the same method.
Transposon activity analysis. Full-length LTR retrotransposons were first identified in the assembled sequences of the B73, Mo17 and PH207 genomes with LTRharvest 51 with the following parameters: '-longoutput -motif tgca -minlenltr 100 -maxlenltr 7000 -mindistltr 1000 -maxdistltr 20000 -similar 85 -motifmis 1 -mintsd 5 -xdrop 5 -overlaps best' . All predicted LTR retrotransposons were further annotated for protein domains with LTRdigest 52 with GyDB (see URLs) as a search database. Candidates with no typical protein domains (for example, GAG, INT, RT and RT) or a tandem-repeat content greater than 20% were filtered. This procedure resulted in a final set of 73,459, 74,160 and 50,402 high-confidence full-length LTR retrotransposons in the B73, Mo17 and PH207 genomes, respectively. To calculate the insertion age of each LTR retrotransposons, 5′ and 3′ LTRs of the same element were aligned with MUSCLE 53 , and the distmat utility in the EMBOSS 54 software package was used to calculate the accumulated divergence 'K' between 5′ and 3′ LTRs. Insertion times (T) of the LTR retrotransposons were calculated with the formula T = K/2 × r, where r is the TE-specific mutation rate of 1.3 × 10 −8 per site per year 55 .

RNA-seq data collection and generation.
To aid in genome annotation and to perform transcriptome analysis, we generated RNA-seq data for five different tissues from B73 and Mo17: endosperm 12 d after pollination, 14-d seedlings, bracts, roots and stems harvested in the silking stage. For each sample, two independent biological replicates were generated. All fresh tissues were frozen in liquid nitrogen and stored at -80 °C before processing. Total RNA of each sample was extracted with TRIzol according to the manufacturer's instructions. RNA-seq libraries were prepared with the Illumina standard mRNA-seq library preparation kit and sequenced on the Illumina platform with paired-end sequencing strategy.
Gene annotation. MAKER-P version 3.1 (ref. 56 ) was used to annotate genes in the Mo17 genome, through a comprehensive strategy combining results obtained from protein-homology-based prediction, RNA-seq-based prediction and ab initio prediction. We used the same evidence that was used for the previous B73 gene annotation, with the addition of Mo17-specific RNA-seq datasets. All annotated proteins from Sorghum bicolor, Oryza sativa, Setaria italica, Brachypodium distachyon and Arabidopsis thaliana, downloaded from http://gramene.org/ release 48 (ref. 57 ), were used for protein-homology-based prediction. 74,471 assembled transcripts from multiple Mo17 tissues, full-length transcripts from B73 Iso-seq 58 , another set of 69,163 publicly available full-length cDNAs from B73 deposited in GenBank 59 , a total of 1,574,442 Trinity-assembled transcripts from 94 B73 RNAseq experiments 33 and 112,963 transcripts assembled from deep sequencing of a B73 seedling 60 were collected and included as transcript evidence. Augustus 61 and FGENESH (see URLs) were used for ab initio prediction of gene models in the TE-masked Mo17 genome. 44,747 genes (53,021 transcripts) were identified in the Mo17 genome and are referred to as the working gene set. This working set of gene annotations was expected to contain TEs that were not masked before annotation or annotations with poor supporting evidence. We further filtered this working set according to AED scores generated in MAKER-P software, then confirmed splice sites and performed transposon screening. Finally, 38,620 high confidence genes remained and are referred to as the filter gene set.

Identification of SNPs and indels.
We identified SNPs and insertion/deletion polymorphisms (indels, length < 100 bp) between the B73 and Mo17 genomes with Mummer 34 as follows: (i) The Mo17 pseudochromosome sequence was mapped to its corresponding B73 pseudochromosome with nucmer with the parameters '-mumreference -g 1000 -c 90 -l 40' . (ii) The delta-filter was used to filter mapping noise and determine the one-to-one alignment blocks with parameters '-r -q' . Alignments with aligned positions in one genome that were located more than 10 Mb away in another genome were further filtered. The aligned blocks between these two genomes were identified, and blank regions on the chromosomes that might be low-similarity regions or multiple aligned regions were filtered. (iii) Show-snps was used to obtain SNPs and small indels (< 100 bp). B73-genomebased SNPs and indels were detected with the parameter '-ClrTH' , and Mo17genome-based parameters were detected with the parameter '-ClqTH' . SNPs and indels shared between the Mo17 and PH207 genomes, or the B73 and PH207 genomes, were processed with the same method used for SNPs shared between the B73 and Mo17 genomes. The genome distributions of SNPs and indels between the B73 and Mo17 genomes were also determined.
Identification of PAV sequences, PAV clusters and PAV genes. PAV sequences in the B73 and Mo17 genomes were identified through a sliding-window method. To identify B73 specific sequences, we divided the B73 genome into 500-bp overlapping windows with a step size of 100 bp and then aligned each 500-bp window against the B73 and Mo17 genomes with BWA mem 47 with options '-w 500 --M' . The sequences of windows that could not be aligned or that aligned to the Mo17 genome with a primary alignment coverage less than 25% but could be properly aligned to the B73 genome were defined as B73-specific sequences.
Overlapping windows that could not be aligned were merged. Mo17-and PH207specific sequences were identified through the same method. Most of these PAV sequences had a relatively short length ( Supplementary Fig. 5), possibly as a result of our stringent calling criteria. We further merged PAV sequences that were within 100 kb of the physical coordinates to identify PAV clusters. If a merged region had more than 10% PAV sequences, we defined this region as a PAV cluster. We listed some large PAV clusters (> 500 kb) in both the B73 and Mo17 genomes.
For the identification of B73-, Mo17-and PH207-specific genes, the CDS of different transcripts were merged to represent a single gene, and genes with more than 75% of the CDS regions covered by PAV sequences were defined as PAV genes. We further aligned B73 resequencing reads from maize HapMap2 (ref. 36 ) projects to the Mo17 genome and Mo17 Illumina PCR-free reads to the B73 genome with BWA mem 47 to exclude potential false positives. For the B73/Mo17specific genes above, we filtered those with more than 50% CDS regions covered by Mo17/B73 reads to obtain the final PAV genes.
Resequencing reads of 19 wild relatives, 23 landraces and 60 modern inbred lines from maize Hapmap2 (ref. 36 ) projects were aligned to the B73, Mo17 and PH207 genomes with Bwa mem 47 . The mapping depth across the PAV gene regions was calculated with samtools 62 . If a gene had more than 90% of the coding sequences covered by resequencing reads, we defined it as trackable in that line.

Detection of gene structural variations among B73, Mo17 and PH207.
To survey gene-structure variation between B73 and Mo17 genomes, we extended the longest transcript of each B73/Mo17 gene 2 kb upstream and downstream, and then aligned it to both Mo17/B73 genomes and the sequence from the Mo17/B73 syntenic region (in which the locations in the two genomes were less than 10 Mb apart), respectively, with Bwa mem 47 . The alignments separated by less than 20 kb were merged. For the best-hit-based method, the genome-wide best hits were used to assess gene-structure variation. Genes without amino acid substitutions or with only missense mutations and/or nonframeshift indels (length = 3n bp) were classed as structurally conserved genes. Genes with complete CDSs but containing SNPs or indels (3 ± 1 nt) that might produce initiation codons, termination codons, premature termination, splicing-donor-site or splicing-acceptor-site mutations, and ORF frameshifts were classified as genes with large-effect mutations. The remaining genes not identified as PAV genes were classified as genes with large structural variation.
For the synteny-based method, the best hits located in the syntenic regions were used to assess gene-structure variation. Structurally conserved genes and genes with large-effect mutations were defined according to the same criteria above. The remaining genes with more than 75% CDS missed in the syntenic regions were classed as genes with syntenic information not established or were otherwise classified as genes with large structural variation. Notably, only genes and their best hits in the counterpart genome anchored in ten pseudomolecules were included in the analysis. The comparisons of PH207 genes to the B73 and Mo17 genomes were performed in the same way.

Identification of duplicated genes and gene families.
To identify gene duplications, BLASTP 63 was used to calculate pairwise similarities (e value < 1 × 10 −20 ), and MCscanX 64 with default parameters was then used for classification.
To identify gene families, we merged annotated genes from Mo17, B73 and three other grasses from the Phytozome database (see URLs), including S. bicolor (33,032 genes), O. sativa (39, 049 genes), B. distachyon (31,694 genes) and A. thaliana (27,416 genes). The longest proteins for each gene were aligned to one another. BLASTP 63 was used to calculate pairwise similarities (e value < 1 × 10 −20 ), and OrthoMCL 65 was used to identify gene families with an inflation value of 2 and percent-match cutoff of 50.
Comparative genomic analysis among B73, Mo17, PH207 and sorghum. To perform the comparative genomic analysis, we used the Synmap pipeline (see URLs). In brief, we used last 66 to blast the CDS sequence, then detected syntenic blocks with DAGchainer 67 with options -D 20 -A 5. Quota Align 68 was further used to merge adjacent syntenic blocks. The syntenic depth was set to 2:1 for maize and sorghum and 1:1 for B73 and Mo17; B73 and PH207; and Mo17 and PH207 comparisons, and the overlapped distance was set to 40 to permit overlapped syntenic regions. Fractionation bias was applied to determine subgenome organization in maize compared with sorghum. The CodeML utility in the PAML 69 software package was used to calculate the Ka and Ks rates between orthologous genes. The time of divergence from the common ancestors of maize (~2. 1 Ma) was inferred on the basis of the Ks of 0.025, because maize and sorghum shared a common ancestor with ~11.9 Ma (Ks ~0.139).
Transcriptome comparison between B73 and Mo17. The RNA-seq data for the bract, root, stem, seedling and endosperm tissues from the B73 and Mo17 lines were used to perform transcriptome comparison between B73 and Mo17. The RNA-seq data were aligned to both the B73 and Mo17 genomes with Hisat2 (ref. 70 ). All aligned reads were used to calculate the fragments per kilobase per million (FPKM) values with Cufflinks 71 . Genes with FPKM value greater than 1 for tissues of B73 (Mo17) and lower than 0.1 in corresponding tissues of Mo17 (B73) were deemed B73 (Mo17) specifically expressed genes.
Reporting Summary. Further information on experimental design is available in the Nature Research Reporting Summary linked to this article.

Statistical parameters
When statistical analyses are reported, confirm that the following items are present in the relevant location (e.g. figure legend, table legend, main text, or Methods section).

n/a Confirmed
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement An indication of whether measurements were taken from distinct samples or whether the same sample was measured repeatedly The statistical test(s) used AND whether they are one-or two-sided Only common tests should be described solely by name; describe more complex techniques in the Methods section.
A description of all covariates tested A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons A full description of the statistics including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted DAGchainer r020208; Quota Align; PAML v4.9; All custom codes used in this study will be made available according to the request.
For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors/reviewers upon request. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information.

April 2018
Data Policy information about availability of data All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: -Accession codes, unique identifiers, or web links for publicly available datasets -A list of figures that have associated raw data -A description of any restrictions on data availability The genome assembly and gene annotation can be achieved from NCBI with BioProject number PRJNA358298 and BioSample number SAMN06169745. The GenBank accession number of the data above is NCVQ00000000. Raw PacBio SMRT reads, Illumina data, and RNA-seq can be downloaded from NCBI SRA database with accession number SRP111315.
Field-specific reporting Please select the best fit for your research. If you are not sure, read the appropriate sections before making your selection.

Life sciences
Behavioural & social sciences Ecological, evolutionary & environmental sciences