Relationships of wild and domesticated rices (Oryza AA genome species) based upon whole chloroplast genome sequences

Rice is the most important crop in the world, acting as the staple food for over half of the world’s population. The evolutionary relationships of cultivated rice and its wild relatives have remained contentious and inconclusive. Here we report on the use of whole chloroplast sequences to elucidate the evolutionary and phylogenetic relationships in the AA genome Oryza species, representing the primary gene pool of rice. This is the first study that has produced a well resolved and strongly supported phylogeny of the AA genome species. The pan tropical distribution of these rice relatives was found to be explained by long distance dispersal within the last million years. The analysis resulted in a clustering pattern that showed strong geographical differentiation. The species were defined in two primary clades with a South American/African clade with two species, O glumaepatula and O longistaminata, distinguished from all other species. The largest clade was comprised of an Australian clade including newly identified taxa and the African and Asian clades. This refined knowledge of the relationships between cultivated rice and the related wild species provides a strong foundation for more targeted use of wild genetic resources in rice improvement and efforts to ensure their conservation.

be attributed to a relatively rapid diversification of the AA genome, differential choice of genes, use of insufficient data, incomplete lineage sorting, misidentification of accessions and introgression.
Previously, phylogenetic analysis 1,4,5 has followed the laborious process of amplifying selected loci, some of which unfortunately have not provided sufficient phylogenetic resolution. Recent advances in next generation sequencing have led to the sequencing of various chloroplast genomes which have continued to find utility in various areas of plant science. The use of whole chloroplast genomes, whose potential as a universal barcode has recently been demonstrated 6 , helps to overcome the previously laborious process of data generation. While nuclear data may lead to inconsistent phylogenies due to recombination, plastids have particular advantages in phylogenetic reconstruction as they are structurally stable, generally uniparental, haploid and non-recombinant 7 . Therefore, the aim of this study was to use massively parallel sequencing to reconstruct the evolutionary and phylogenetic relationships of the Oryza AA genome species from whole chloroplast genome sequences.
Assembly of paired end reads resulted in nine chloroplast genomes whose sizes ranged from 134563 to 134911 bp with O. longistaminata having the smallest genome ( Table 1). The assembled genomes had a quadripartite structure which is typical of angiosperms ( Fig. 1 and Supplementary results online). Aligning of full chloroplast sequences resulted in an alignment that was 143331bb in length with 134353 characters being constant. A total of 484 characters were variable out of which 221 were parsimony informative (see Supplementary Table S2 online). We obtained a well resolved and strongly supported phylogeny with a strong hierarchy of clades (Fig. 2). The resultant phylogenetic tree had two main distinct clades where the primary division was into a clade including O. glumaepatula and O. longistaminata which formed the basal clade and one containing all the other species. The large clade in turn contained two well resolved clades, the Australian clade and another one with Asian and African species. The various regions/continents which are used to refer to the different taxa in this study represent their provenance, and in all cases were also the source of collection.
The different phylogenetic criteria namely maximum likelihood, maximum parsimony and neighbour joining produced trees that were congruent, showing the robustness of this approach. There are varied and sometimes contradictory reports on the effect of indels on phylogenetic analysis, with different recommendations being advanced on how they should be treated 8,9 . In our study, deletion of indels as is sometimes recommended, resulted in generally poorly resolved and inconsistent phylogeny. Neighbour joining analysis yielded a poorly resolved tree while maximum parsimony and maximum likelihood analysis resulted in a tree that is almost similar to the one we have obtained without deleting the indels, except that O. glaberrima 2 moved next to the basal clade. Overall, our study shows that deleting the indels prior to phylogenetic analysis may not be an effective approach. Our results are consistent with those of Dessimoz and Gil 10 who concluded that indels carry substantial phylogenetic signal and deleting them can be detrimental in phylogenetic analysis. Our alignment had long indels and perhaps it is the size of the indels that plays a role in determining their phylogenetic effect.
Among the inconsistencies observed in previous studies, the issue of the radiation of AA genome, basal taxa and divergence time have perhaps been the most outstanding. This study shows that the basal clade consists of O. longistaminata and O. glumaepatula which are found in Africa and South America respectively. This divergence could be as a result of long distance dispersal through movement of animals especially birds, followed by ecological differentiation. These species are too close to have a divergence due to a pan-Gondwanaland distribution, and the continental drift between Africa and South America which is estimated to have occurred approximately 100-120 Mya 11,12 . These two species show a genetic relationship that is common in many other species of African and South American origin 13,14 . The presence of the most primitive extant grasses in South America 15 may also seem to suggest the Oryza genus may have had some ancestry here. Molecular divergence dating was conducted using both relaxed and strict clock approaches. The results obtained using the strict clock approach were generally closer to the estimates reported in previous studies, as compared to those obtained using the relaxed clock approach which were generally much older. Strict clock is known to be superior to relaxed clock in case of shallow phylogenies 16 . With the Oryza AA genome species having diverged only recently, we believe the strict clock approach provides better divergence date estimates and hence only these estimates will be highlighted in this paper. Based on whole chloroplast genome sequences, this study estimates that the AA genome may have diverged about 0.69-1.11million years ago (Mya) ( Table 2 and Fig. 3), a period that seems to correspond to the Pleistocene. This estimate is less than the estimates of 2 Mya given in an earlier study based on intron sequences of four nuclear genes 5 . The results also suggest that the AA genome group diverged from other Oryza species around 3.36-5.22 Mya. Relatively recent long distance dispersal remains the most reliable explanation for the current distribution of AA genome species. Though chloroplast genomes may have particular advantages in resolving phylogenetic relationships, Middleton et al. 17 , noted that their use in divergence dating may be problematic especially in species with short evolutionary times. This is due to the fact that presence of multiple chloroplast haplotypes may not correspond with species divergence since they may have diverged long before actual species divergence. This may therefore lead to inaccurate divergence estimates and may explain the difference between nuclear based and chloroplast based divergence estimates. Increasing the amount of data analysed in divergence dating has been found to result in more accurate estimates 18 and we therefore believe that due to the use of a large data set, our estimates are robust and reliable.
The origin of the AA genome has also remained quite contentious with different arguments and opinions being advanced. Previous studies suggest the AA genome group may have originated from Africa since the majority of the studies have reported O. longistaminata as the most ancestral species 19,20 . Furthermore, O. longistaminata has unique morphological features that are not present in other AA genome species such as self-incompatibility, rhizomatous and unique characteristics of the ligules, all consistent with it being significantly differentiated. The conventional wisdom in evolutionary biology is that annual species are derived from perennial ancestors. This is consistent with our findings which The basal split in this study was followed by the divergence of the Australian species from the Asian and African taxa and is estimated to have occurred about 0.54-0.86 Mya. Contrary to some studies which show that the spread of Oryza A genome species into Australia may have occurred earlier 4 , this study shows that spread to all the regions seems to have occurred just around the same time, an observation earlier made by Vaughan et al. 21 . The sharing of some taxa such as some types of Sorghum,  Nodes are labelled in the phylogenetic tree shown in Fig. 3. b Divergence estimates are in millions of years. c 95% HPD (Highest Posterior Density) is indicated in parenthesis.
Vigna radiata var. sublobata and Gossypium species between Australia and Africa suggests that there is a phytogeographic link between the two regions 22 . This study shows that the divergence between the African and Asian species occurred twice, with the first one being the divergence between ancestors of O. longistaminata and those of the Asian species while the second one is the radiation of the ancestors of the two cultivated taxa. The divergence between the two cultivated taxa is estimated to have occurred about 0.51-0.65 Mya (Table 2 and Fig. 3). The timing of this divergence corresponds to estimates given in previous studies of 0.64-0.7 Mya 5,23 . However, this estimated time of divergence is much earlier than the time of domestication of Asian and African rice, which is reported to have taken place about 10,000 years and 3500 years ago respectively. Many conflicting theories have been advanced on the origin and domestication of Asian cultivated rice and despite great research efforts, the debate rages on. Traditionally, these theories were broadly classified as supporting either monophyletic 24,25 or polyphyletic origin 26,27 . Recently however, some domestication pathways that deviate from these previously suggested models have been proposed 28 . In one of these models, Huang et al. 24 , suggested that japonica was domesticated from O. rufipogon in Southern China while indica was the product of a cross between japonica and local wild rice in South East Asia and South Asia. Analysis of genetic polymorphisms between the O. sativa subspecies shows some genetic differentiation with indica 1 and indica 2 having 296 and 247 nucleotide differences respectively between them and the japonica reference (GU592207). This genetic differentiation is confirmed by their phylogenetic placement where both japonica and indica formed distinct clades, associated with O. rufipogon and O. nivara respectively. This genetic differentiation between the two subspecies of O. sativa and the associated clustering pattern seems to support polyphyletic evolution, and suggests independent domestication of Asian rice as opposed to the single origin theory. Proponents of the independent domestication theory, posit that the diverged genomic backgrounds of japonica and indica were derived from pre-differentiated ancestral gene pools during separate and geographically isolated domestication events. This theory has gained support from several phylogenetic, genetic distance and genomic palaeontology analyses. Our findings suggest the possibility that the maternal genome of indica may have been derived from O. nivara while the maternal genome of japonica may have originated from O. rufipogon. The nuclear genomes may have a more complex origin. Similar to Asian rice, the origin of African rice has also been controversial and remains grossly understudied. Although the widely held proposition is that African rice was domesticated from O. barthii 29,30 , it has also been hypothesized that it evolved from Asian rice through sympatric speciation before it was later domesticated in West Africa 31 . According to this theory, upon introduction of Asian rice to Africa, drastic changes in climate and especially the high temperatures in the Sahel may have led to mutations which may have subsequently been selected for by farmers. This latter theory has however not gained much support and the arguments presented in its support seem less convincing. An Asian origin of African rice has also been proposed but dismissed 29 . The shattering nature of O. glaberrima and presence of red pericarp in some varieties seems to suggest that the process of domestication is incomplete.  30 and their results supported the domestication theory originally proposed by Portères 32 which was later supported by Li et al. 29 . Portères suggested that African rice was first domesticated in the Inland Delta of the Upper Niger River before spreading to two secondary centres, one located along the Senegambian coast and the other in the Guinea highlands 32 . The wide inconsistencies in evolutionary and phylogenetic relationships may be attributed to the use of germplasm that has been collected from areas that have been disturbed through human activity as well as those that are misidentified. Such human activities may have eroded important genetic signatures thus leading to contradictory theories and conclusions.
The strong geographic differentiation among the "O. rufipogon" accessions from either Asia or Australia was evidently clear. Information on such genetic patterns is valuable in guiding both germplasm use and conservation efforts particularly when devising sampling strategies. Though the observed genetic differentiation in "O. rufipogon" accessions from different regions is consistent with previous analysis conducted using molecular 33 and hybridization data 34 which showed them as distinctly different taxa, it should be interpreted with caution. The clustering of "O. rufipogon" in different clades may be due to incomplete lineage sorting or introgressive hybridization which may have led to chloroplast capture 35,36 . O. rufipogon is an outcrossing species 34 and so chloroplast capture is possible especially with species with which it has sympatric distribution 37 . Introgression is more frequent in chloroplast than in nuclear genomes and this phenomenon of reticulation and chloroplast capture usually leads to incongruence between nuclear and chloroplast based phylogenies 35 . It has generally been demonstrated that chloroplast based phylogenies usually reflect geographical distribution as compared to nuclear phylogenies which normally correspond with taxonomic relationships [37][38][39] . As recommended by Rieseberg and Soltis 35 , conducting phylogenetic analysis using both nuclear and chloroplast genomes is an important strategy to avoid erroneous phylogenies. The use of unlinked nuclear genes would be useful in testing for incomplete lineage sorting and this study therefore serves as a useful starting point, whose findings can be compared with the analysis of selected nuclear loci.
Recent studies using both molecular and morphological data have discovered two distinct perennial taxa in Northern Queensland, Australia which are believed to be new Oryza gene pools 40,41 . One of these taxa has previously been identified as O. rufipogon 40 . With the chloroplast genome of one of the two perennial taxa closely resembling that of the annual O. meridionalis 33 , there is need for further work to assess the possibility of introgressive hybridization resulting in chloroplast capture as already discussed above. The discovery of these gene pools has a great impact on global food security as it provides the breeders with critical genetic resources. The Oryza gene pools in South America/Africa and Australia are likely to possess important traits as they are genetically isolated from domesticated rice and hence remain genetically uncontaminated by gene flow from the large Asian domesticated rice populations. DNA extraction and sequence assembly. DNA was extracted from approximately 3 gm of plant tissue using a modified cetyltrimethylammonium bromide (CTAB) method that was originally published by Carroll, et al. 42 The DNA quality and quantity was assessed by agarose gel electrophoresis (0.7%, 103 V for 45 minutes) as well as by an Agilent BioAnalyzer 2100 (Agilent Technologies). Approximately 3 μ g of total DNA from each accession was indexed and pooled together in one lane of an Illumina Genome Analyzer and sequenced. In total, we sequenced 9 Oryza chloroplast genomes representing five species. The raw reads were assembled into whole chloroplast genomes in a multi-step approach employing a modified pipeline that involved a combination of both reference guided and de novo assembly approaches 43 . First, trimming was done with a quality score limit of 0.05 on CLC Genomics workbench 6.5.1 before undertaking sequence assembly. In order to discard non-CP reads, the trimmed reads were assembled by mapping to the published O. sativa chloroplast reference sequence (GU592207) using CLC Genomic Workbench 6.5.1 under the following parameters: length fraction 0.5, similarity fraction 0.8, mismatch cost 2, deletion and insertion costs 3. In order to allow assembly of the inverted repeats as well as other repetitive elements, reads showing non-specific matches were mapped randomly. The vote majority conflict resolution mode was used in order to ensure inclusion of only chloroplast specific reads thus avoiding contribution of nuclear and mitochondria reads to the consensus sequences. With the aim of correcting any assembly errors especially relating to possible misalignment due to insertions and deletions, the read mapping consensus sequences obtained were subjected to two iterations of forced realignment using the respective InDel variant tracks.

Methods
After reference guided assembly, de novo assembly was undertaken with a minimum contig length of 1000 bp. The resultant de novo contigs were aligned to the reference sequence using BLAST (http:// blast.ncbi.nlm.nih.gov/) as implemented in CLC Genomic Workbench 6.5.1 and the matching cp contigs ordered by alignment to the reference sequence using clone manager 9.0. The trimmed reads were later mapped to the de novo assembled sequence as reference and the resultant consensus sequence subjected to one cycle of forced realignment using InDel guidance variant tracks. The de novo and reference guided consensus were then compared and any conflicts resolved by reference to the reads. Phylogenetic analysis. The  The alignment was physically inspected and confirmed to be correct. The number of variable characters and parsimony informative sites were analysed in MEGA6 (ref. 45) with the out-group excluded. The aligned sequences were analysed by maximum parsimony (MP), neighbour joining (NJ), maximum likelihood (ML) and Bayesian Inference criteria using PAUP 4.0b10 (ref. 46) implemented through Geneious v 7.0.5. The Neighbour joining criterion used the general time-reversible (GTR) distance measure with rates assumed to follow gamma distribution. The maximum parsimony analysis was performed using heuristic searches with the 'MulTrees' option followed by tree bisection-reconnection branch swapping. Topological robustness was assessed using bootstrap method with 1000 random addition replicates. All characters were unordered and were accorded equal weight with gaps being treated as missing data. A separate phylogenetic assessment was also conducted with the gaps being deleted manually but resulted in a poorly resolved and inconsistent tree. Appropriate nucleotide substitution models were determined using Modeltest 3.7 (ref. 47). The models were chosen using Akaike Information Criterion (AIC) with the models being used for subsequent maximum likelihood and Bayesian Inference analysis. Maximum likelihood analysis was undertaken using the GTR+ G+ I model. Bayesian phylogenetic inference was conducted using MrBayes 3.1 (ref. 48) with Monte Carlo Markov Chains (MCMC) estimation of posterior probability distributions. The analysis used the GTR model with the rate variation assumed to follow gamma distribution. Four independent runs of 1,100,000 MCMC were performed with trees sampled after every 200 runs followed by a burn in length of 100,000 MCMC. O. officinalis acted as the out group. Molecular divergence dating. Molecular clock analysis was conducted using Bayesian method implemented in BEAST program 49 . The analysis was conducted using both strict and relaxed clock approaches with the data being partitioned into coding, non-coding and intergenic sequences. Calibration was done based on the divergence time between Zea mays and Triticum aestivum which is estimated to have occurred about 50 Mya 50 . The HKY and four gamma categories substitution model was used with the Calibrated Yule model priors being used to model speciation. TreeAnnotator (pre-release version 2.1.2) was used to calculate maximum probability clade tree and the final tree was visualized in FigTree v1.4.2.
Genome annotation. Annotation of the sequenced genomes was done using CpGAVAS 51 (http:// www.herbalgenomics.org/0506/cpgavas) followed by manual adjustments of start and stop codons as well as intron/exon boundaries which was done using Apollo 52 . GenBank files were created using sequin and were subsequently used to draw chloroplast gene maps using OGDraw v1.2 (ref. 53).