Genomes of 13 domesticated and wild rice relatives highlight genetic conservation, turnover and innovation across the genus Oryza

The genus Oryza is a model system for the study of molecular evolution over time scales ranging from a few thousand to 15 million years. Using 13 reference genomes spanning the Oryza species tree, we show that despite few large-scale chromosomal rearrangements rapid species diversification is mirrored by lineage-specific emergence and turnover of many novel elements, including transposons, and potential new coding and noncoding genes. Our study resolves controversial areas of the Oryza phylogeny, showing a complex history of introgression among different chromosomes in the young ‘AA’ subclade containing the two domesticated species. This study highlights the prevalence of functionally coupled disease resistance genes and identifies many new haplotypes of potential use for future crop protection. Finally, this study marks a milestone in modern rice research with the release of a complete long-read assembly of IR 8 ‘Miracle Rice’, which relieved famine and drove the Green Revolution in Asia 50 years ago. Genome assemblies of 13 domesticated and wild rice relatives reveal salient features of genome evolution across the genus Oryza, especially rapid species diversification and turnover of transposons. This study also releases a complete long-read assembly of IR 8 ‘Miracle Rice’.

As the global population is projected to increase by almost 3 billion by 2050, rice breeders urgently need to develop new and sustainable cultivars with higher yields, healthier grains and reduced environmental footprints. Because the wild relatives of rice are adapted to different biogeographic ranges (Fig. 1a) and can tolerate many biotic and abiotic stresses 1-6 , they constitute an important reservoir for crop improvement. Strategies to harness such traits for crop improvement show clear promise, as exemplified by the introgression of bacterial blight resistance (Xa21) from the wild species Oryza longistaminata 7 .
The 27 Oryza species span ~15 million years of evolution, with 11 genome types, 6 of which are diploid (n = 12: AA, BB, CC, EE, FF and GG) and 5 of which are polyploid (n = 24: BBCC, CCDD, HHJJ, HHKK and KKLL). Cultivated rice belongs to the AA genome group, the primary germplasm pool for rice improvement 8 . To harness more distantly related species as genetic resources, a better understanding of genomic differences and similarities is essential.
As a first step toward a deeper evolutionary understanding of the genus, we developed a large array of publicly available genomic tools from BAC libraries 9,10 and BAC-end sequences to physical maps 11,12 and subgenome assemblies 13 . Comparisons of select orthologous regions have addressed the origin and evolutionary fate of genes [14][15][16] , as well as transposable element (TE) dynamics [17][18][19][20][21][22] and the effects of polyploidy 22,23 . In addition, assemblies of several AA genomes [24][25][26][27] and the O. brachyantha FF genome 28 have already pointed to the power of complete genome comparisons.
Here we detail the generation and analysis of nine new reference assemblies that span cultivated and wild species in the Oryza genus and extend to the African outgroup species Leersia perrieri. Our comparative evolutionary genomics analysis includes these and four previously published genomes (O. sativa vg. japonica (AA) 29 and indica (AA) 30 , O. glaberrima (AA) 25 and O. brachyantha (FF) 28 ), which were together reannotated using a common pipeline with the integration of baseline RNA-seq and bisulfite sequencing data.
Our analysis of this 13-genome dataset revealed several salient features of genome evolution in Oryza: 1. A single-species phylogeny, supported by over 6,000 singlecopy orthologs, places the 'crown' age of the AA clade at ~2.5 million years, implying a diversification rate (~0.50 net new species/million years) that is comparable to the high rates estimated for diverse plants in a variety of continental and island biodiversity hotspots 31 ; 2. Extensive interspecific gene flow, particularly between the South American Oryza glumaepatula and African AA species, underscores the semipermeable nature of Oryza species boundaries, at least within AA-genome species; 3. The emergence of many putative taxonomically restricted gene families corresponds to higher lability, reduced coding length and higher rates of evolution than in ancient families. Along with long intergenic noncoding RNA (lincRNA) genes, members of new families are more likely to localize to repetitive regions of the genome, suggesting a possible role of TEs and reduced meiotic recombination in their origin; 4. Rapid diversification and turnover of LTR-retrotransposons largely differentiate the AA, BB, FF and Leersia genome types; 5. A significant preference of deletion over insertion mutations was detected across eight AA Oryza genomes, similar to in metazoan lineages; 6. Analysis of thousands of candidate disease resistance genes shows an overabundance of heterogeneous gene pairs organized in head-to-head fashion and with atypical domains, extending support for the integrated decoy model of disease resistance function.

Results
The 13-genome data package. Chromosome-level reference assemblies for two domesticated (O. sativa vg. indica cv. IR 8 (also known as Miracle Rice) and the drought-tolerant aus variety N 22) and seven wild species (Oryza rufipogon, Oryza nivara, Oryza barthii, O. glumaepatula, Oryza meridionalis, Oryza punctata and L. perrieri) were generated using either long-read or short-read technologies, with extensive scaffold support from long-insert library reads, including BAC-ends, as summarized in Table 1 (Supplementary Note). Extensive assessments of the quality of each assembly (for example, sequence accuracy, genome completeness and gene-space coverage) are described in the Supplementary Note (Supplementary Fig. 1 and Supplementary Tables 1-22) and Supplementary Data 1 and 2. These assemblies, plus the four additional published assemblies described above, were annotated using a uniform annotation pipeline to minimize methodological artifacts ( Table 1, Supplementary Figs. 2 and 3, Supplementary Tables 14-25 and Supplementary Note).
Species tree. The outline of the Oryza phylogeny is clear, but the exact relationship between AA genomes, including the two domesticated rice species, has remained elusive 32-34 owing to extensive gene tree discordance 35 . Previous work has rarely used 'species tree' inference methods 35 that explicitly incorporate gene tree discordance caused by factors such as incomplete lineage sorting, and, if it did, did so with only a few genes 36 . Using 6,015 singlecopy orthologs from ten Oryza genomes including L. perrieri, we inferred strong support for a single-species phylogeny (Fig. 1b  Bootstrap proportions were obtained from 400 supermatrix bootstrap replicates, and MP-EST support values were obtained as described in the Methods. All clades were supported by both methods at 100%, except as indicated. Divergence times within the Oryza genus were estimated by PATHd8, assuming an age of 15 million years for the Oryza crown group. Bars on nodes represent the range of PATHd8 age estimates across all 12 chromosomes. Mya, million years ago. Note), which supports two independent origins of japonica and indica rice. We note, however, that some gene tree conflicts cannot be explained by incomplete lineage sorting alone. For example, data for chromosomes 6 and 12 suggested that the South American AA species O. glumaepatula is the sister group only of African O. barthii and O. glaberrima, rather than of these two species plus the O. sativa complex, as inferred from the whole-genome tree ( Fig. 1b and Supplementary Fig. 5).
With an inferred crown age of the AA species of less than 2.5 million years (Fig. 1b, Supplementary Fig. 4, Supplementary  Table 27 and Supplementary Note), footprints of past gene flow among AA species might still be detectible. Regions with evidence of introgression based on a positive D statistic 37 were found on all chromosomes, with ten having significant introgressions between O. glumaepatula and the African but not the Asian species (Fig. 2  and Supplementary Table 28).
Another source of topological incongruence is gene conversion between duplicated regions retained after the pan-grass polyploidy event. Such regions are particularly common on the distal regions of chromosomes 11 and 12 14 . Recent (< 4.9 million years in age) largescale conversion events were apparent on the distal 2.2-Mb regions of both of these chromosomes ( Supplementary Fig. 6). From inter-and intraspecies comparisons, we inferred four independent conversion events, one in the L. perrieri lineage and one each in the AA, BB and FF lineages (Supplementary Tables 29 and 30; also see the Supplementary Note), with no further events since the divergence of the AA species. Dates calculated for O. meridionalis, the sister group to the other AA species analyzed, suggest that this common event occurred almost concomitantly with the divergence of the AA clade.
Karyotype conservation. The 12-chromosome karyotype in rice resulted from the pan-grass polyploidy event and has deviated little in rice in comparison to other extant grasses 38-40 . Using synteny maps and whole-genome alignments, we found that the outgroup species L. perrieri has many small inversions of five or more genes when compared to O. sativa 29 (Supplementary Fig. 7). To estimate the rate of such events within Oryza, we focused on inversions shared by at least two consecutive species in the species tree, while discounting species-specific events as possible assembly errors. With Brachypodium distachyon as an additional outgroup, we identified nine paracentric inversions, ranging from 60 to 300 kb in length and involving up to 19 genes ( Fig. 3 and Supplementary  Table 31), from which we estimate that such inversions arise about once every 1.6 million years. However, this rate is not constant, as there was only one event between the divergence of FF genomes and the AA-BB ancestor. In contrast, there were at least six inversions in the branch leading to the common ancestor of the AA-type genomes, more than double the average rate.
TE diversification and rapid elimination. Selective amplification and loss of TEs has had a key role in Oryza genome and chromosome evolution 17,20,21,41 . Repeats, including TEs, constitute 27-50% of the genome assemblies (Table 1 and Supplementary Table 20), providing a unique basis for a comprehensive genus-wide comparison of TE dynamics. Each species was distinguished by its diversity of long terminal repeat-retrotransposons (LTR-RTs), the most abundant TE class 26 . Comparative annotation and phylogenetic analysis of LTR-RTs identified several lineage-specific transpositional bursts (Supplementary Figs. 8 and 9, and Supplementary    [44][45][46] . Tracking 75 orthologous LTR-RT loci across the eight most closely related AA species (Supplementary Data 3) yielded an average loss rate of 3.62 kb/million years per element. Applying this ratio to the whole genome predicts that approximately one-quarter of the O. sativa vg. japonica genome will be eliminated within the next 3-4 million years unless new waves of transposition occur during the same period. Similar to previous estimates [44][45][46] , our analysis extends these findings across the AA genomes of the Oryza genus.
These results reinforce the notion that TEs are important drivers of genome evolution because they fuel the rapid turnover of intergenic regions. However, not all TEs behave equally: the terminalrepeat retrotransposons in miniature (TRIMs), which are mostly located near genes, were highly conserved among all 13 genomes (Supplementary Fig. 11 and Supplementary Table 33).

Indels as drivers of genome evolution and domestication.
Pairwise comparisons of the eight most closely related AA genomes identified 216,059-699,587 indels, depending on the genome comparison, with the ancestral state determined for about half of these (Supplementary Table 34). In agreement with previous work in Oryza 26 and Arabidopsis 47 , short indels were strongly favored. In coding regions, in-frame indels at multiples of 3 bp were overrepresented, as expected ( Supplementary Fig. 12). There was a significant bias toward deletions (chi-squared test, P < 8.96 × 10 -11 ; Supplementary  Fig. 13), indicating a much higher fixation rate of polymorphic indels in the domesticated species than in its progenitor. O. glaberrima was previously shown to have been derived from subpopulations of O. barthii more than 3,000 years ago, with an attendant population bottleneck during early cultivation and domestication 49     Oryzeae-and species-specific gene families. We identified 21,448 putative gene families, which were assigned to four age bins ( Fig. 5a and Supplementary Fig. 14). The most ancient group, dated to at least the common ancestor of angiosperms, included 62% of genes but only 29% of families. The Poaceae age group, comprising a further 13% of genes and 9% of families, was enriched in defense response (GO:0006952) and wound response (GO:0009611) genes, including those encoding disease resistance receptors (see below) and enzyme inhibitors (Supplementary Table 36 and Supplementary Data 4). In addition, these families were enriched in genes encoding components of the ubiquitin-proteasome pathway 51 , including SKP1/BTB/ POZ and F-box proteins, the latter of which are known to undergo lineage-specific expansions and rapid evolution in plants 52 . Also enriched in Poaceae were genes encoding two classes of proteins abundant in late embryogenesis, which function in dehydration and other stress tolerance 53 . Genes that emerged within the Oryzeae accounted for 60% of families and 19% of annotated loci and were enriched in seed storage proteins, defense-related genes and nucleic-acid-binding functions (GO:0003676), including various DNA-binding domains, histones and methyl-CpG DNA-binding proteins (Supplementary Table 37 and Supplementary Data 4). Species-specific families and orphans accounted for an average of 6% of annotated loci per species.
While these analyses show that lineage-specific gene families can encode proteins with recognized biological functions, only 6% of species-specific and 10% of Oryzeae-specific loci have predicted InterPro domains, as compared to 63% in Poaceae and 91% in angiosperm-derived families. Despite some methodological concerns 54 , it is widely recognized that new genes arise continuously across all domains of life [55][56][57] . The fraction of Oryzeae-specific genes with syntenic orthologs in a second species was not substantially lower than for the two older groups, with 86% as compared to 93% and 97% for the Poaceae and angiosperm-derived families, respectively (Supplementary Table 38). This in turn provides unique opportunities to study patterns and mechanisms of putative new gene origination within a phylogenetic framework.
New functional genes likely evolve via transitory protogenes; while most are rapidly lost, a subset will be positively selected and gradually gain the characteristics of deeply rooted genes [58][59][60] . A key prediction is that properties such as coding length, expression, substitution rate and evolutionary stability change in an age-dependent manner 58,61 .
Coding length showed clear age dependence, with the median length increasing several fold from the youngest to the oldest group (Wilcoxon rank-sum test, P < 2.2 × 10 −16 ) ( Supplementary Fig. 15). This held true even when excluding genes shared throughout the angiosperms (Kendall's tau, P < 2.2 × 10 −16 ) (Fig. 5b). This observed age dependence, also found in Arabidopsis 56,62 , may indicate that longer initial coding lengths are more likely to survive over time or that coding sequences that survive gradually lengthen over time through chance removal of stop codons 58 .
Gene expression was also positively correlated with age, both when assessed qualitatively ( Supplementary Fig. 14) and quantitatively (Fig. 5b). Again, the exclusion of angiosperm-derived genes from the analysis did not change this finding (Kendall's tau, P < 2.2 × 10 −16 ) (Fig. 5b). These findings are consistent with lower and more restricted expression of lineage-specific loci in plants 63,64 .
Relaxed negative selection is a common characteristic of lineagespecific genes 57 . Comparisons of syntenic orthologs confirmed a trend of relaxed selection in younger genes (for example, median K a /K s for Oryzeae (0.59) > Poaceae (0.42) > angiosperms (0.27)), concomitant with higher rates for both nonsynonymous and synonymous substitutions (Wilcoxon rank-sum test, P < 2.2 × 10 −16 ) ( Supplementary Fig. 16). Nevertheless, the median K a /K s value for Oryzeae-specific genes was well below neutral rates, suggesting that many encode functional proteins 65 . These observed trends also held true after fractionating over all nodes in the species tree. Using measurements for O. nivara (Fig. 5c), we found a significant negative correlation for K a , K s and K a /K s with age, even when restricting to nodes within the Oryzeae (Kendall's tau, P < 1.7 × 10 −6 ). A final expectation is greater evolutionary stability of older genes. On average, any one species had 96% of families shared through the angiosperms but only 76% of Poaceae-specific and 15% of Oryzeaespecific families. Similarly, 75% of families shared through the angiosperms were present in all species but only 34% of Poaceae-specific and 15% of Oryzeae-specific families were ( Supplementary  Fig. 14). The oldest families were also the largest, consistent with expectations that fixed families diversify over time. Finally, Oryzeaespecific loci were more likely to be found near centromeres over low-recombination regions, and thus more likely to reside closer to LRT-RTs ( Fig. 6 and Supplementary Fig. 17).
Chimeric genes are common in O. sativa, suggesting that new genes originate at a much higher frequency than in animals 66,67 . One mechanism of chimeric gene formation is the transposition of genic sequences captured by flanking MULE DNA transposons in rice 68,69 . MULE-derived loci (MDLs) accounted for 7% of Oryzeaeand species-specific genes. While this is a small minority, it is clearly much higher than for all genes (chi-squared test, P < 1 × 10 −32 ) (Supplementary Table 39). MDLs in AA genomes showed high rates   of microsynteny (68-92%), indicative of conserved genomic position over periods of up to 2.4 million years. MDLs showed different characteristics from non-gene-containing MULEs, having for example higher GC content 70 (Supplementary Fig. 18). Methylation levels of MDLs were also lower and changed with evolutionary age 71 ( Supplementary Fig. 19).
LincRNA genes as a source of genetic novelty. We found thousands of lincRNAs that did not overlap protein-coding genes. L. perrieri had more lincRNAs than any Oryza species (q ≤ 3 × 10 -31 ) ( Supplementary Fig. 20 Fig. 21). Most lincRNA families have evolved very recently, with the majority being species specific ( Supplementary Fig. 22). Of the 23,633 families identified in nine species, over 91% were species specific and only 101 lincRNA families (< 0.5%) were detectable as far back as the common ancestor of Oryza and L. perrieri, indicating rapid emergence and turnover.
Similarly to vertebrates 72,73 , lincRNAs were enriched with TE-derived sequences, especially ones from DNA transposons and LTR-RTs (found in > 59%; Supplementary Fig. 23 and Supplementary Data 5). Whether this tolerance for TE sequence derives from nonfunctionality of the lincRNA transcripts or a modular structure that allows for large indels remains to be explored.

Evolution of disease resistance genes.
Most plant disease resistance genes encode intracellular nucleotide-binding, leucinerich repeat (NLR) receptors that directly or indirectly recognize pathogen effector proteins. NLR content is highly variable among rice populations, accounting for the majority of copy number and presence/absence polymorphisms affecting genes [74][75][76] , explained by rapid evolution and balancing selection 77 Fig. 24). In addition, NLR expansion was markedly higher in both the indica and japonica varietal groups as compared to their wild progenitors, consistent with recent artificial selection for increased NLR diversity.
As previously reported 26,79,80 , most NLR genes are positionally clustered (Supplementary Table 40), with the greatest density on chromosome 11, a finding now extended to the Leersia genus (Supplementary Table 43). One-third of the 1,046 clusters across 13 genomes had mixed subfamily composition (Supplementary Table 40) indicating that they did not arise simply by local duplications of single progenitor genes. Recent studies have shown that disease resistance sometimes requires the joint action of two side-byside NLR genes 78,81-83 , suggesting that formation and maintenance of complex clusters may be driven by such functionally coupled pairs. As typified by rice RGA4-RGA5 78 and Pik-1-Pik-2 84 , coupled NLRs are frequently distantly related, arranged in head-to-head formation (divergently transcribed) and include atypical domains that function as integrated decoy domains in pathogen recognition 84 .
To investigate the prevalence of these characteristics, we studied all 1,386 instances of adjacent NLR genes, of which over one-quarter were classified as heterogeneous (Supplementary Tables 44-46). Among these, the head-to-head configuration was significantly   85,86 , but many more rare types were also found. The presence of specific domains in different NLR subfamilies point to multiple, independent acquisitions of decoys by NLRs (Supplementary Table 42). Among adjacent NLR genes, putative integrated decoy domains were significantly more prevalent in heterogeneous pairs arranged in headto-head configuration as compared to other classes (Supplementary  Table 45; chi-squared test, P < 0.0001). These trends with the Pik-1-Pik-2 locus and adjacent cluster of RGA4-RGA5-like pairs are illustrated in Fig. 7b, with evidence for evolution of new specificities Osi 9311 Orufi  both within species and between species, by acquisition of WRKY, NAM, protein kinase, U-box and VQ domains. Finally, an analysis of the Pi-ta region ( Supplementary Fig. 26) identifies an excellent candidate for the tightly linked R gene Pi-ta2, whose broader resistance spectrum encompasses that of Pi-ta 87 but which has not been positively identified because of very low recombination in the Pi-ta region 87 .

Discussion
With two independently domesticated species, Oryza has had a significant role in advancing human civilization, leading some ricedependent cultures to revere 'rice is life' , as exemplified in dramatic fashion ~50 years ago with the release of IR8 in Asia. As the world population approaches 10 billion by 2050, rice breeders are faced with the challenge of producing crops that are high yielding and more nutritious while at the same time being more sustainable. How can we meet the challenge of producing more food and at the same time protecting the environment?
The premise of the International Oryza Map Alignment Project (IOMAP), initiated in 2003, was to develop a set of high-quality genomic resources for the wild relatives of rice that could be used as a foundational resource to discover and utilize novel genes, traits and/or genomic regions for crop improvement and basic research. IOMAP began with the generation of genomic tools (BAC libraries and physical maps) derived from wild accessions that have been actively used in breeding programs to introgress new traits into cultivated rice 88 N 22)) and the closely related outgroup species L. perrieri. In combination with four previously published Oryza genomes, our dataset represents a genome-wide vista of the results of ~15 million years of both natural and artificial selection on a single genus. Over this time period, the Oryzeae have maintained a base chromosome number of 12, despite having adapted to different ecological conditions associated with their global distribution and withstanding bursts of TE diversification that, in some cases, led to doubling of genome sizes, as in O. australiensis 20 and O. granulata 21 .
Our phylogenomic work illustrates both the challenges of inferring species phylogenies in closely related plant taxa-incomplete lineage sorting 89 , hybridization and introgression 90 -and the power of whole-genome sequences to untangle the resulting phylogenetic discordance. Combining recent tools for species tree inference with this massive dataset permitted us to construct a much more nuanced view of the species phylogeny in Oryza that reflects the mosaic history of different parts of the genome 91 . Not only does this allow a more complete framework for studies of the evolution of genes, chromosomes and genomes in Oryza, but it also leads to more accurate placement of Oryza species in an evolutionary and ecological context. For example, our genome-based estimate of the age of the AA genome clade implies a remarkably rapid diversification rate of ~0.50 net new species/million years, placing it on par with many rapidly diversifying taxa in island and continental hotspots 31,33 . This rapid species diversification among Oryza species in the AA clade likely provided special opportunities for cultivation and improvement by humans, resulting in two independent domestication events in this group.
Our analyses of TE-driven genome dynamics in Oryza point to a very high rate of TE turnover over long evolutionary periods. While this has been previously demonstrated in cultivated rice, it was unclear whether this pattern holds for the entire genus. We show that this process results from both recent lineage-specific TE amplifications and a high deletion rate of TE-related sequences. Even if Oryza genomes share many TE families, one could expect that the vast majority of TE insertions are polymorphic among wild rice species. Several recent studies have demonstrated the functional impact of TE-related sequences in rice 92,93 and other crops 94,95 on gene expression. TE polymorphisms in wild Oryza species could thus be exploited for breeding in addition to the diversity found in the genes themselves.
Comparative analysis of gene annotations across the Oryzeae revealed many apparent lineage-specific families that emerged in the common ancestor of grasses or within the Oryzeae clade. The latter exhibited characteristics typical of taxonomically restricted loci in plants and other domains of life, including shorter coding sequences, low expression, rapid evolution rates and family instability. Lack of recognizable functional domains also characterized these annotations, consistent with a de novo origin. Although such qualities can also be found in annotation artifacts 54,65 , the Oryzeaespecific loci (as distinguished from species-specific loci) showed a surprising degree of conserved synteny, even among more distantly related species, and we therefore focused on this subset to examine the effects of age on these characteristics. Our finding of a clear correlation toward more 'gene-like' characteristics with increasing age is consistent with Oryzeae-specific loci representing a pool of rapidly evolving sequences with adaptive potential. The observed continuum of characteristics over evolutionary time extended to the older, more established gene families, with characteristics of Poaceaederived families intermediate to those of the very new and very old. However, the relatively high proportion of genes in Poaceae-derived families with conserved functional domains (63% as compared to 91% in ancient families) suggests that de novo origin accounts for a minority of these families. Indeed, among Oryza-derived loci, the small proportion (~10%) with recognizable domains would have greater potential for future survival as compared to loci from other families. In both Poaceae-and Oryzeae-specific families, genes that do have conserved domains are enriched for defense and stress response functions, suggesting that their rapid evolution was driven by adaptation to varied environmental pressures. Although few putative de novo-originated genes have so far been functionally characterized in plants, the rice OsDR10 gene, conserved only in the Oryzeae tribe, is notable in having apparent function in the regulation of pathogen defense responses 96 .
TEs make an important contribution to the formation of lineage-specific loci in plants, by exaptation 63,64 or exon capture 68 . Here MDLs accounted for almost 7% of Oryzeae-derived and speciesspecific loci. We also found that lincRNA genes, which mostly arose in a species-specific manner, are associated with both DNA and LTR-RT-type transposons. Aside from these, we found a clear positional bias of putative new loci over retrotransposon-rich pericentromeric regions, suggesting that LTR-RT activity may contribute to the origination of new genes. Although yet to be confirmed, the observed rapid rates of insertion and excision of LTR-RTs may drive the generation of novel sequences at these sites with chance generation of ORFs 97 . An unanticipated finding is that the rate of silent site substitutions was elevated in new gene families as compared to older ones. Elevated synonymous substitutions may be caused by higher background mutation rates in pericentromeric regions as compared to gene-rich regions, as has been found in Arabidopsis 98 . A second hypothesis is that background mutations accumulate more rapidly owing to suppressed meiotic recombination in pericentromeres. High recombination rates in euchromatin provide opportunities for allelic gene conversion during meiosis, correcting mismatches caused by spontaneous mutation.
Disease pathogens remain a major threat to rice harvests worldwide, with yearly losses from the most serious disease, blast (Magnaporthe oryzae), having the potential to feed an estimated 60 million people 99 . Future forecasts may become dire as climate change alters or expands the geographic range of pathogens. Risk-reducing growing techniques and breeding of natural resistance have historically provided the most practical and effective means of control, with chemical fungicides as a last resort in some countries 100 . Yet, natural resistance is fragile owing to the rapid evolution of new pathovars that evade host recognition. Stacking of traits from the large array of resistance genes and haplotypes in existing rice germplasm is a realistic approach to build durable resistance 101 . Our sequencing of seven wild relatives of crop species opens a treasure trove of novel resistance haplotypes and loci to sustain this strategy. The practical utility of our resources is directly demonstrated by our identification of a strong candidate for the long-sought Pi-ta2 locus, which in combination with Pi-ta provides broad-specificity resistance to M. oryzae 87 .
Sequencing of the first two Asian rice genomes over a decade ago 29,30,102 greatly accelerated the identification and characterization of genes important to breeders, as well as insights into selective pressures acting on their evolution. The availability of 13 high-quality reference genome assemblies, presented here, now permits exploration of the majority of orthologous loci and genomic regions at will for the AA, BB and FF genome types and L. perrieri. As more wild and cultivated Oryza genomes become available, this resource will become even more powerful.

Methods
Methods, including statements of data availability and any associated accession codes and references, are available at https://doi. org/10.1038/s41588-018-0040-0.   O. punctata and L. perrieri) were generated using either long-or short-read technologies; most of these have extensive scaffold support from long-insert library reads, including BAC-ends, as summarized in Table 1. These assemblies, plus the four additional published assemblies 25,28-30 , were annotated using a uniform annotation pipeline to minimize biases associated with different methods for gene and repeat identification (Table 1).

Phylogenetic inference. Data preparation.
For each chromosome, we used the BLAST-Overlap-Synteny (BOS) filtering protocol 34 to identify clusters of singlecopy orthologous loci. Only clusters containing a single sequence from each genome were used in further analyses (note: the IR 8 and N 22 genomes were not used in this analysis as they did not provide any meaningful phylogenetic signal beyond the Nipponbare RefSeq sequence). The coding regions of each locus were aligned at the nucleotide level using PRANK v.140110 105 with the -F setting. For each chromosome, alignments containing sequences from all 11 genomes were concatenated to create 12 'supermatrix' alignments.
Maximum-likelihood tree inference. Maximum-likelihood phylogenies were inferred for individual locus alignments and supermatrices using GARLI version 2.01 106 (default settings except genthreshfortopoterm = 5000 and startoptprec = 0.01). For each individual gene alignment, the best fitting substitution model was chosen using ModelTest v3.7 107 with the Akaike information criterion 108 . Supermatrix analyses were run using a partitioned likelihood substitution model that allowed each locus an independent relative substitution rate. A single general time-reversible nucleotide substitution model with gamma-distributed rate heterogeneity and an estimated proportion of invariable sites was shared by all loci within each supermatrix. Confidence estimates were obtained from 400 bootstrap search replicates computed on each supermatrix. In addition, 200 bootstrap tree searches were performed on each individual locus and used as input to species tree analyses. During bootstrap analyses, substitution model parameter values were fixed at the values estimated during the maximum-likelihood tree search, for computational efficiency. Branches with negligible maximum-likelihood branch length estimates (1 × 10 −8 ) were collapsed, creating polytomies.
Species tree analyses. The MP-EST program v1.0 109 was used to infer species trees from collections of gene trees using a multispecies coalescent model. Analyses were run independently for each chromosome. Polytomies in locus trees were first randomly resolved with zero-length branches. To account for uncertainty in the inference of gene trees for each locus and in the arbitrary resolution of polytomies, MP-EST was run multiple times using gene trees selected from bootstrap samples. For each chromosome, one tree was randomly selected for each locus from the bootstrap samples, and this collection of trees was used as input to MP-EST. This procedure was repeated 400 times. MP-EST support values for species tree branches were calculated as the proportion of replicates returning a species tree containing that branch.
Oryza divergence time estimation. Divergence times within Oryza were estimated using the supermatrix phylogenies inferred for each chromosome. The phylogenies were rooted using the outgroup L. perrieri, which was subsequently pruned from the trees. These pruned trees, with maximum-likelihood branch length estimates obtained from GARLI, were used as input to PATHd8 v1.0 110 . PATHd8 smooths substitution rates across branches to transform a tree with branch lengths measured as substitutions per site to an ultrametric tree with relative node ages. To convert the relative ages to absolute ages, the age of the crown group of Oryza (the root of the pruned tree) was fixed using a calibration consistent with several recent studies 14,111,112 (15 million years).
Introgression among AA genomes. We sought evidence of introgression within the AA-genome species to understand further some of the gene tree discordance observed in the chromosome-level analyses. We used the D statistic 113 , which examines site patterns (so-called ABBA/ABAB patterns) in genome alignments for a specified four-taxon tree. A significant departure from zero of the difference in numbers of the two kinds of 'introgression' sites was tested with a blockjackknife procedure 37,114 , which is aimed at overcoming autocorrelations within the sequence. We focused on the case that seemed problematic in gene/species tree construction: the position of O. glumaepatula. We used the tree (O. punctata, (O. glumaepatula, (O. barthii, O. rufipogon))) to see whether there was evidence that O. glumaepatula has undergone introgression with either the African O. barthii or Asian O. rufipogon. As a 'control' , we also considered the tree (O. punctata, (O. meridionalis, (O. barthii, O. rufipogon))), which replaces the problematic O. glumaepatula with another taxon in the AA clade, O. meridionalis, which is more distantly related to the Asian/African pair. Small alignment blocks sampled across the sequences of each chromosome were assembled using the program hakmer 115 , which identifies exactly matching syntenic single-copy k-mers (k = 32) found in all four taxa, extended by an ungapped 25-nt region up-and downstream of each k-mer. Typically, the final datasets included 10-15% of the original sequence data after this culling. D statistics were computed with a Perl script (available upon request), and block-jackknife estimates of the standard error of D were obtained using m = 20 blocks. brachyantha and L. perrieri were carried out using a modification of the method described in Jacquemin et al. 14 (note: the IR 8 and N 22 genomes were not used in this analysis as they did not provide any meaningful phylogenetic signal beyond the Nipponbare RefSeq sequence). Initial alignment was made with Mugsy 116 , using the parameters -d 100 -c 200 -nucmeropts -l 20, and alignments were refined with Mugsy itself. All subsequent analyses were encapsulated by in-house scripts. Multiple-sequence alignments (300 bp) were generated directly from the MAF file and converted into a nexus format. Sequences with a majority of undefined bases or that were < 50 bp in length were eliminated. Indels were coded with Indelcoder 117 as described, and trees were generated with MRBAYES 118 using the following parameters: Nst = 2 and Rates = Invgamma. The data were partitioned according to type: DNA and gap binary information. For the gap binary data partition, we selected the option coding = variable. The four chains were run for 100,000 generations, sampling trees every 100 generations, with the first 25% of trees sampled discarded as burn-in. Posterior probabilities were calculated from the remaining samples. Intra-and interspecies distances were calculated from the output consensus file. Trees containing sequences from at least five species were retained for analysis. Phylogenetic (n = 5,782) trees were retained and used for calculation of intra-and interspecies Bayesian distances using BioPerl modules. Synteny maps and screen for chromosomal inversions. Syntenic relationships between orthologous genes were mapped for all 78 pairwise combinations of the 13 reference assemblies. We used DAGchainer 119 to identify collinear gene pairs within syntenic blocks, with parameters requiring neighboring genes to be no more than ten genes apart and a minimum chain length of five collinear genes. As this method is strict, we additionally identified 'in-range' syntenies of orthologs that mapped no more than five genes distant from the expected collinear position 120 . Clustered sets of syntenic genes encompassing all species were identified by single-linkage clustering over the pairwise relationships. As a second approach, we conducted pairwise whole-genome alignments using LASTZ 121 and applied a suite of programs to filter maximum-scoring chains and identify non-overlapping 'nets' corresponding to regions of synteny 122 .

Concerted evolution in chromosomes
To screen for chromosomal inversions, we first identified in each species the set of syntenic blocks (consisting of five or more collinear orthologs) having reverse orientation relative to the Nipponbare RefSeq sequence (note: the IR 8 and N 22 genomes were not used in this analysis as they did not provide any meaningful phylogenetic signal beyond the Nipponbare RefSeq sequence). Identified blocks were further screened to remove blocks associated with duplication or movement, by ensuring that the immediate upstream and downstream syntenic blocks were contiguous and in the forward orientation in both genomes. To increase confidence in our identification, we next screened for putative inversion events shared by two or more species. Putative inversions that overlapped with respect to IRGSP-1.0 coordinates were identified using BEDTools v2.17.0 123 . Shared events were also required to have LASTZ 121 alignment breakpoints within 20 kb of one another in the IRGSP reference. The set of candidate inversion events was further refined to include only those that could be logically placed at a discrete internal branch of the species tree. Thus, the final set of inversion events was identified as synteny blocks found in reverse orientation in all outgroup species and in the forward orientation in all in-group species. A synteny map of IRGSP-1.0 with B. distachyon (built in the same manner as described above) was used as an additional outgroup to confirm the ancestral state of the region.

LTR-retrotransposon deletion rate.
To estimate the deletion rate of LTR-RTs in rice, we first identified loci of the genomes that were absent from O. meridionalis but partly conserved in the remaining eight sequenced AA genomes and that contained deleted LTR-RTs. This was achieved in the following way. (1) All AA genomes were analyzed with RepeatMasker (see URLs) using a database of only full-length LTR-RTs to identify regions containing LTR-RT sequences. (2) To identify orthologous regions between the Nipponbare RefSeq sequence 29,126 and all the other AA genomes (query), sequences were aligned with NUCmer from the MUMmer package (version 3.1) 125 using the Nipponbare RefSeq sequence 29 (IRGSP1.0 assembly) as the reference: nucmer -g 500 < reference> < query> , default for all other options. The NUCmer outputs were then parsed with the utility show-coords from the same package: show-coords -r -g -I 70 -T -H < nucmer output> . The -g option was used to secure the orthologous relationship between the conserved blocks, thereby allowing the identification of indels by the break between two adjacent alignments. These analyses allowed indica were identical to the IR 8 RefSeq data and thus were not used in this analysis) and to extract the sequences corresponding to these orthologous regions in all AA genomes, including 500 bp before and after the break position in O. meridionalis, to create multifasta files for each selected locus.
The whole set of sequences from one multifasta file were aligned using MAFFT v6.940b 126 as follows: mafft -op 5 -ep 0 < multifasta file > . Increasing the opening penalty (-op) as compared to the default options allowed identification of the deleted parts from each sequence. Custom Perl scripts were then used to remove alignments that contained at least one sequence displaying stretches of 100 or more N's that could represent assembly gaps, to measure deletion lengths in each aligned sequence and to compare them to the length of the presumed initial complete element so that a deletion rate in kb/million years per element and a half-life could be calculated.
Assembly-based indel detection. We detected indels from eight comparisons among closely related species with easily alignable genomes (Supplementary  Supplementary Table 34). Genome assemblies were first soft masked with RepeatMasker. Chromosome allto-all alignments were performed with LASTZ 121 between the target and query genomes, with the following parameters: --strand= both --ambiguous= n --inner= 2000 --gappedthresh= 6000 --gapped --identity= 90. The resultant alignment blocks were further parsed with the CHAINNET package 122 . Syntenic alignments between intact contigs in the top chain were used to detect indels. We further inferred indels by allowing a gap to cover adjacent alignments with Perl scripts, which permitted us to detect additional indel events. to remove duplicated reads that might be due to PCR amplification and reads that were mapped to multiple loci. Pindel (version 0.2.5b8) 127 was then used to call indels with default parameters. Only indels detected in at least two individuals and without sequence gaps were subjected to further analysis. O. sativa vg. japonica and O. glumaepatula were used as outgroups to determine the ancestral state for each indel. If an indel was present in all O. glaberrima individuals, it was regarded as a 'fixed indel' . If an indel was present between individuals in a species, it was defined as a 'within-species indel' . Genome-wide genetic diversity (π) was calculated with indel markers using VCFtools v0.1.12 128 . Comparison of indels called using resequencing data versus the assembly-alignment-based approach identified a total of 23,372 ( > 1-bp) high-confidence indels that overlapped using both methods (1-bp indels were excluded to avoid false calls caused by sequencing errors) ( Supplementary  Table 35), a number much lower (~50%) than that estimated using the genomealignment-based approach alone (46,307). This difference may be due to the low efficiency of the short-read mapping-based method to detect indels in repetitive genomic regions and the fact that many indels detected from single-genome comparisons were only detected in a single accession (Supplementary Table 35).
Comparative methods using Ensembl gene trees. The Ensembl Compara pipeline 131 was implemented to define gene families, construct phylogenetic gene trees, and infer orthologs and paralogs. Updated protocols used in Ensembl version 75 software are detailed elsewhere (see URLs). Briefly, the longest encoded protein for each gene was subjected to an all-versus-all NCBI BLAST + alignment followed by clustering with hcluster_sg 130 . Each cluster, regarded as a gene family, was subjected to multiple-sequence alignment using several independent methods [131][132][133][134] , with a consensus alignment produced with M-Coffee 135 . TreeBest software was used to construct consensus phylogenetic gene trees incorporating five different tree-building methods: two were maximum-likelihood methods based on either protein alignment (WAG model) or codon alignment (phymal HKY model), and three were neighbor-joining methods that used p distance, dN distance or dS distance within codon alignments. The pipeline was run on the MAKER-P 136 annotations for ten Oryza species and L. perrieri (Table 1). Also included were annotations of three outgroup species, Arabidopsis thaliana (version TAIR10) 137 , Sorghum bicolor (version Sbi1.4) 138 and B. distachyon (version Brachy1.2) 139 , loaded from Gramene v40 Ensembl core databases 140 . Tree reconciliation to classify duplication and speciation nodes, and the assignment of taxon levels to nodes, made use of the following input species tree, which was devised on the basis of results described here (Fig. 1)  Because NCBI taxon identifiers were not available for most nodes of the species tree, we assigned each node an arbitrary letter name. Where available, the corresponding NCBI Taxonomies were as follows: N, Magnoliophyta (tax_id:3398); M, Poaceae (tax_id:4479); L, Pooideae (tax_id:147368); P, Oryzeae (tax_id: 147380); and O, Oryza (tax_id:4527).
Putative falsely split gene models were detected by an automated pipeline that identifies adjacent 'paralogs' that do not overlap in multiple-sequence alignments 140 .
Rates of synonymous and nonsynonymous substitutions relative to the Nipponbare RefSeq sequence 29 were determined for the subset of annotated loci having syntenic orthologs in O. sativa vg. japonica. Amino acid sequences were aligned in a pairwise fashion using CLUSTALW version 1.83 142 . Alignments were back-translated to coding sequences, and gaps were removed using BioPerl 143 . K a and K s substitution rates were estimated using Codeml (F3X4 codon frequency model) within PAML package version 4.3 144 .
Repeat coverage in 2-kb upstream and downstream regions flanking genes was calculated using the intersectBED utility from BEDTools v2.17.0 123 and evaluated for differences using the t.test function in R 145 .
GO functional enrichment of gene families grouped into the 'Poaceae bin' (families with root nodes of M and/or L) and the 'Oryzeae bin' (all other root nodes except N and leaf nodes) was analyzed using the online AgriGO Singular Enrichment Analysis tool 146 . For each taxon bin, corresponding genes in O. sativa vg. japonica with GO assignments were used as queries and tested for enrichment against a reference of all genes with GO assignments, using the hypergeometric statistical test option, applying the Hochberg FDR test for multi-test adjustment and requiring a minimum number of mapping entries of five. For InterPro domain enrichment, the number of genes possessing each InterPro domain was tallied in each taxon bin and among all genes overall across all 11 species. Enrichment was tested using R 145 with the hypergeometric distribution function, phyper and Bonferroni correction using p.adjust.
For analysis of correlation of gene coding length, expression, and K a , K s and K a /K s with gene family age, we used O. sativa vg. japonica as the focal species. Age ranks for gene families were assigned, from youngest to oldest, on the basis of the root node taxon of the corresponding gene tree for each family, as follows: 0, species-specific families and orphans; 1, families at node D; 2, node A; 3, node E; 4, node F; 5, node I; 6, node J; 7, node O; 8, node P; 9, node L; 10, node M; and 11, node N (see "Compara input species tree" for node positions and Fig. 5). Because older gene families tended to have more genes (data not shown), which could bias results, we selected a single value to represent each family for each parameter. For coding length, we averaged the coding lengths of the longest predicted transcript for all genes in the family. For expression, we selected the FPKM value for the most highly expressed gene across tissues from within each family. For K a , K s and K a /K s , we used syntenic O. nivara loci (see below) as the reference, taking the lowest value to represent each family. Correlation between assigned age rank and each parameter was tested using Kendall's tau rank correlation coefficient, as implemented using the cor.test function in R 145 .
The integrated genetic/physical map (constructed using data from Harushima et al. 147 and McCouch et al. 148 ) for O. sativa vg. japonica was downloaded from Gramene (see URLs) 140 . Recombination rates were calculated as cM/Mb over 2-Mb windows across each chromosome.
Ancestral state reconstruction of lincRNA families. LincRNA loci within and between species were grouped into families on the basis of sequence similarity.
A set of non-redundant exons for each locus was generated from transcript GTF files using an R script (available upon request). Repeats and TEs were masked in genomes using the maskFastaFromBed utility in BEDTools v2.17.0 123 . Fasta files for non-redundant exons at each locus were then generated from repeatmasked genomes using the gffread utility in Cufflinks 147 . LincRNA families were first identified within each species using BLASTN v2.2.28 148 (E value < 1 × 10 -5 ). Families were then grouped across species using a best hit reciprocal BLASTN approach, with the full transcriptomes (protein-coding and all unannotated, intergenic loci) for each species used as the queried databases. Multi-locus families with linkage to multiple families in the same species were all grouped into a single family. Ancestral state reconstruction of lincRNA families within the Oryza phylogeny was performed with Mesquite v3.02 using a parsimony approach and an unordered evolutionary model for gains and losses 149 . When ambiguous, the lincRNA family at a particular node was assumed to be absent.
Transposable element content of lincRNAs. TE content for lincRNA loci was generated by cross-referencing a GFF file with all annotated TEs and a GFF file with non-redundant exons for lincRNA loci in each species using the intersectBed utility from BEDTools v2.17.0 123 .
Comparative analysis of resistance genes. A seed set of putative disease resistance genes was identified by screening for the NB-ARC nucleotide-binding adaptor domain (IPR002182) in predicted proteins. The list of candidate genes was further expanded to include any gene from the Compara-designated family from which the seed gene originated. These totaled 5,104 genes in 36 gene families. Among them, 381 falsely split gene models were reannotated, consolidating them into 180 genes. More consistent annotation of domain structure was accomplished by running PfamScan 150 , initially accepting domain hits with a p-score threshold of less than 0.001. We noted that this threshold was too stringent for a number of relevant Pfam domains, for example, LRRs and RPW8, resulting in false-negative detection. To increase sensitivity, we relaxed the stringency to a p score of 1 for these and any domain that, in the initial high-stringency screen, was detected in at least six genes and at least two species. Domains that met these criteria included the following: NB-ARC, LRR_1, LRR_2, LRR_3, LRR_4, LRR_5, LRR_6, LRR_8, LRR_9, LRRNT_2, RPW8, AAA_16, TIR_2, Jacalin, Pkinase, PPR_2, WRKY, zf-BED, Thioredoxin, B3, TAXi_C, TAXi_N, RPW8, UPF0261, TIR, PP2C, DUF3681, Myb_DNA-bind_4, DUF3877, C1_3, C1_2, BTB, AvrRpt-cleavage and ALIX_LYPXL_bnd. Coiled-coil domains were annotated using the Coils output of InterProScan.
The above procedure resulted in identification of 5,749 genes in 36 Compara families, of which 5,408 (94%) had detectable NB-ARC domains.
A phylogenetic analysis was performed using the NB-ARC domains extracted from the genes of L. perrieri, O. punctata and O. sativa vg. indica . Domains less than 200 amino acids in length were excluded as being likely fragments, resulting in 1,000 sequences (306 L. perrieri, 273 O. punctata and 421 O. sativa vg. indica ). To construct a multiple-sequence alignment, we first built a hidden Markov model (HMM) from 12 complete NB-ARC domains selected from genes of different families. A ClustalW alignment of these was fed into the hmmbuild program from the HMMER 3.0 package 151 . The resulting HMM model was then used to align the 1,000 sequences using hmmalign with the -trim option 151 . A maximum-likelihood phylogenetic tree was built using FastTree 2.0 152 employing the WAG + CAT model and branch support was reported using the Shimodaira-Hasegawa method on 1,000 resampled alignments. Tree visualization used FigTree (see URLs).
To estimate rates of gene duplication, we focused on a subset of 28 Compara gene families having representatives from all species. Using Ensembl API scripts, Compara trees were traversed and nodes classified as duplication nodes were tallied according to the assigned taxonomy level (based on the most recent common ancestor). To reduce false positives, only nodes with a duplication consistency score 129 greater than 0.5 were collected.
Visualization of R-gene loci for preparation of figures was made using the genoPlotR package 165 .
Accession codes. All custom codes used in this study will be made available upon request.
Life Sciences Reporting Summary. Further information on experimental design is available in the Life Sciences Reporting Summary.
Data availability. Supplementary Data 1 (also available at the CyVerse Data Commons; https://doi.org/10.7946/P2FC9Z) contains FPC maps, BES alignment data, gene and repeat annotations and fasta, Trinity assemblies of RNA-seq data, Compara gene trees and gene family and synteny relationships, chromosomefeature distribution plots and quantitative gene expression data. All sequence data, except for the leaf and root RNA-seq data from O. brachyantha, have been deposited in GenBank. Additional O. brachyantha RNA-seq data from root and leaves, and all other data, are available from the corresponding authors upon reasonable request.

Replication
Describe whether the experimental findings were reliably reproduced. NA

Randomization
Describe how samples/organisms/participants were allocated into experimental groups.

Blinding
Describe whether the investigators were blinded to group allocation during data collection and/or analysis.
NA Note: all studies involving animals and/or human research participants must disclose whether blinding and randomization were used.

Statistical parameters
For all figures and tables that use statistical methods, confirm that the following items are present in relevant figure legends (or the Methods section if additional space is needed).

n/a Confirmed
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement (animals, litters, cultures, etc.) A description of how samples were collected, noting whether measurements were taken from distinct samples or whether the same sample was measured repeatedly.
A statement indicating how many times each experiment was replicated The statistical test(s) used and whether they are one-or two-sided (note: only common tests should be described solely by name; more complex techniques should be described in the Methods section) A description of any assumptions or corrections, such as an adjustment for multiple comparisons The test results (e.g. p values) given as exact values whenever possible and with confidence intervals noted A summary of the descriptive statistics, including central tendency (e.g. median, mean) and variation (e.g. standard deviation, interquartile range)

Clearly defined error bars
See the web collection on statistics for biologists for further resources and guidance.

Software
Policy information about availability of computer code

Software
Describe the software used to analyze the data in this study. We used publicly available and appropriately cited software as described.