The draft genome sequence of an upland wild rice species, Oryza granulata

Exploiting novel gene sources from wild relatives has proven to be an efficient approach to advance crop genetic breeding efforts. Oryza granulata, with the GG genome type, occupies the basal position of the Oryza phylogeny and has the second largest genome (~882 Mb). As an upland wild rice species, it possesses renowned traits that distinguish it from other Oryza species, such as tolerance to shade and drought, immunity to bacterial blight and resistance to the brown planthopper. Here, we generated a 736.66-Mb genome assembly of O. granulata with 40,131 predicted protein-coding genes. With Hi-C data, for the first time, we anchored ~98.2% of the genome assembly to the twelve pseudo-chromosomes. This chromosome-length genome assembly of O. granulata will provide novel insights into rice genome evolution, enhance our efforts to search for new genes for future rice breeding programmes and facilitate the conservation of germplasm of this endangered wild rice species.

not anchored to the chromosomes 11 . This undoubtedly limits the use of O. granulata as a basic Oryza lineage to accurately infer the genome evolution of O. granulata compared to other rice species at the chromosome level.
O. granulata is naturally distributed in South Asia, including China, India, Cambodia, Indonesia, Laos, Myanmar, Nepal, the Philippines, Sri Lanka, and Thailand 18 . It is seriously threatened due to ongoing human disturbance and rapid deforestation 19 . Previous population genetic studies revealed that this species possesses fairly low levels of genetic diversity within populations but high genetic differentiation among populations 20,21 . The considerable genomic diversity detected through pan-genome analysis demonstrates that de novo assembly of more than one genome helps reveal the origin and evolutionary forces of population structure and levels of genomic diversity 22 . Thus, sequencing an additional genome of O. granulata from a genetically different population compared with the previously sequenced accession collected in India 11 is needed.
The availability of a chromosome-scale genome of O. granulata will lay the foundation for further evolutionary studies as well as the improvement of desired agronomic traits relevant to rice breeding programmes. Here, we present a new chromosome-scale genome of O. granulata assembled de novo using the Illumina and Hi-C sequencing platforms. In contrast to the previously sequenced O. granulata accession (IRGC Acc. No. 102117) from India, the sequenced plant was collected in Yunnan, China, and thus, the plants were geographically separated. The obtained genome assembly will provide novel insights into the genomic diversity and genome evolution of the genus Oryza and enhance the exploration of precious wild rice germplasm resources.

Methods
Plant material collection, total DNA isolation and genome sequencing. For genome sequencing, we collected dozens of O. granulata plants from Menghai County, Yunnan Province, China, which were planted in the greenhouse of the Kunming Institute of Botany, Chinese Academy of Sciences. Fresh and healthy leaves were harvested from the best-growing individual and immediately frozen in liquid nitrogen, followed by preservation at −80 °C in the laboratory prior to DNA extraction. High-quality genomic DNA was extracted from leaves using a modified CTAB method 23 . RNase A was used to remove RNA contaminants. The quality and quantity of the extracted DNA were examined using a NanoDrop 2000 spectrophotometer (NanoDrop Technologies, Wilmington, DE, USA) and electrophoresis on a 0.8% agarose gel, respectively.
A total of three 260-bp short-insert libraries and five long-insert libraries (3 kb, 10 kb and 20 kb) were prepared following Illumina's instructions. Then, the Illumina HiSeq. 2000 (PE100 and PE101) and HiSeq. 2500 (PE125 and PE150) platforms were employed for whole-genome sequencing according to the standard Illumina protocols (Illumina, San Diego, CA, USA). In total, we generated approximately 133.38 Gb (~168.41×) of raw data (Table 1).
Hi-C data preparation. We constructed Hi-C libraries using young leaves collected from the same individual plant of O. granulata for high-quality DNA isolation by following the standard protocol described previously with certain modifications 24 . Approximately 5-g leaf samples were cut into minute pieces and cross-linked by 2% formaldehyde solution at room temperature for 15 minutes. Then, the sample was mixed with excess 2.5 M glycine to stop the cross-linking reaction and neutralize the remaining formaldehyde. The Hi-C library was constructed and sequenced by Annoroad Genomics (Beijing, China) with the standard procedure described as follows. The cross-linked DNA was extracted and then digested with MboI restriction enzyme. The sticky ends of the digested fragments were biotinylated and proximity ligated to form ligation junctions that were enriched for and then ultrasonically sheared to a size of 200-500 bp. The biotin-labelled DNA fragments were pulled down and ligated with Illumina paired-end adapters and then amplified by PCR to produce the Hi-C sequencing library. The library was sequenced with the Illumina HiSeq. X Ten (PE150) platform, and a total of ~109.41 Gb (~138.15×) of raw sequencing data was produced (Table 1).

RNA isolation and transcriptome sequencing.
A total of seven tissues representing different developmental stages of O. granulata were sampled to generate the RNA-Seq data needed for subsequent genome annotation. These tissues included panicles at three different stages of flower development, flag leaves, and stems and the shoots and roots of three-leaf seedlings. Because of the low germination rate of O. granulata, seedlings were germinated from seeds harvested from multiple plants, while the remaining tissues were sampled from the www.nature.com/scientificdata www.nature.com/scientificdata/ individual used for genome sequencing. All collected samples were quickly frozen in liquid nitrogen and stored in a refrigerator at −80 °C before RNA extraction.
RNA was individually extracted from each tissue using TRI reagent (Molecular Research Centre, Inc., Cincinnati, OH, USA), according to the instructions provided with the reagents. Seven libraries were constructed and sequenced by Biomarker Technologies (Beijing, China) on the Illumina HiSeq. 2500 platform with a read length of 126 bp. In total, ~21.8 Gb of high-quality data was obtained and used for subsequent assembly after filtering the low-quality and duplicated reads caused by PCR amplification ( Table 2).
Estimation of genome size. The genome size of O. granulata was estimated using two methods, including k-mer frequency distribution and flow cytometric analysis. We first estimated and validated the genome size of O. granulata using flow cytometric analysis. A total of 40-50 mg of fresh leaves was collected for sample preparation using the OTTO method 25,26 . Nuclear samples were analysed using a BD FACSCalibur (BD Biosciences, USA) flow cytometer. CellQuest software (BD Biosciences, USA) was used to analyse the flow cytometry results and gate all cells of interest. Here, CV = D/M × 100%, where D is the standard deviation of the cell distribution and M is the average of the cell distribution. The average coefficient of variation (CV) was used to evaluate the results, with CV < 5% considered reliable. Nuclear DNA content was calculated as a linear relationship between the ratios of 2C-value peaks of the sample and standard.
When O. sativa ssp. japonica cv. Nipponbare (~389 Mb) 9,12 and Zea mays ssp. mays var. B73 (2,300 Mb) 27 were employed as inner standards, the estimated genome size of O. granulata was ~672 Mb and ~707 Mb, respectively, both of which were smaller than the previous estimate (882 Mb) (Fig. 1). Meanwhile, we generated the 17-mer occurrence distribution of sequencing reads from short libraries using the k-mer method (Fig. 2). Then, we estimated the genome size to be ~792 Mb, and the proportion of repeat sequences and heterozygosity rate of the genome were determined to be approximately 70.7% and 0.76%, respectively, using GCE 28 .
Genome assembly. We assembled the O. granulata genome using ALLPATHS-LG 29 and SSPACE 30 . First, the high-quality paired-end Illumina reads from short-insert-size libraries were assembled into contig sequences using ALLPATHS-LG. This process yielded assembly results with a contig N50 value of 22,359 bp and total length of ~732.33 Mb. Second, all mate-pair reads with large insert sizes (≥2 kb) were aligned onto the preassembled   www.nature.com/scientificdata www.nature.com/scientificdata/ contigs. According to the order and distance information, the assembled contigs were further elongated and eventually combined into scaffolds using SSPACE. We closed the gaps that might be repeat sequences masked during the construction of scaffolds using GapCloser 31 . Briefly, all paired-end sequencing reads were first mapped onto the assembled scaffolds, and then those read pairs with one read well aligned to the contigs and another located in a gap region were retrieved and locally assembled to close gaps. Consequently, the O. granulata genome assembly had a total length of ~736.66 Mb, which accounted for ~93% of the genome size estimated by k-mer analysis, containing 2,393 scaffolds (N50 = 916.3 kb; N90 = 239.8 kb) and 29,963 contigs (N50 = 43.9 kb; N90 = 12.0 kb). There were 1,146 scaffolds with lengths >100 kb, among which the largest scaffold had a sequence length of 4,040,447 bp (Table 3).
Three approaches were used to evaluate the completeness and accuracy of this genome assembly. First, we mapped all high-quality reads (~186.8 million, ~87×) from short-insert-size libraries back to the assembly using BWA (Burrows-Wheeler Aligner) 32 , showing good alignments with an average mapping rate of ~99.46%. Second, the completeness of genome assembly and gene prediction was assessed with BUSCO (Benchmarking Universal Single-Copy Orthologs) 33 according to collections from the Embryophyta lineage. Our gene predictions revealed 1,390 (96.53%) of the 1,440 highly conserved core proteins in the Embryophyta lineage. Third, the RNA sequencing reads generated in this study were assembled into a total of 137,380 transcripts using Trinity 34 , which had an N50 value of 1,035 bp and a total length of ~88.8 Mb. Then, they were aligned back to the genome assembly using GMAP 35 . Our results showed that a total of 89,977 transcripts could be successfully aligned to the genome assembly with a mapping rate of 65.5%. After filtering the low-quality reads using Trimmomatic 36 , clean paired-end reads of Hi-C data were mapped to the assembled scaffolds by BWA-MEM 32 . Finally, 1,265 (723.2 Mb, 98.2% of the assembled length) of 2,393 scaffolds were mapped, grouped and ordered into 12 chromosomes using LACHESIS 37 (Table 4).
Annotation of protein-coding genes. We predicted protein-coding genes of the O. granulata genome using three methods, including ab initio gene prediction, homology-based gene prediction and RNA-Seq-aided gene prediction. Prior to gene prediction, the assembled O. granulata genome was hard and soft masked using RepeatMasker 38 . We adopted Augustus [39][40][41] and SNAP 42 to perform ab initio gene prediction. Models used for each gene predictor were trained from a set of high-quality proteins generated from the RNA-Seq dataset. We    44,45 to conduct homology-based gene prediction. First, the protein sequences were aligned to the O. granulata genome assembly using Exonerate with the default parameters. Second, given that GeneWise is a time-consuming program to run, we mapped the protein sequences from O. sativa ssp. japonica cv. Nipponbare (MSU 7.0) to the O. granulata genome using GenBlastA 46 prior to GeneWise prediction. Homologous genomic fragments of the target genes together with their 5-kb upstream and downstream flanking sequences were then extracted using an in-house Perl script. Finally, GeneWise was used to align them against the corresponding proteins to determine gene structures. To carry out RNA-Seq-aided gene prediction, we first assembled clean RNA-Seq reads into transcripts using Trinity 34 , which were then aligned to our genome assembly using PASA 47 . The output included a set of consistent and non-overlapping sequence assemblies, which were used to describe the gene structures.
We combined all gene structures obtained from the three above-mentioned sets of predictions, including ab initio gene predictions and protein and transcript alignments, with the weighted consensus gene set using EVidenceModeler (EVM) 48 . To perform further filtering, the genes with peptide lengths shorter than 50 amino acids and/or containing inner stop codons were removed. In total, 40,131 protein-coding genes with an average length of 3,152 bp were predicted in the assembled O. granulata genome.
To assess the quality of gene prediction, we compared the length distributions of protein-coding genes, coding sequences (CDS), exons and introns with those from the other four species (Arabidopsis thaliana, Sorghum bicolor, Z. mays and O. sativa), among which we did not observe any obvious differences in the length distribution of gene features ( Fig. 3; Table 5). Then, we surveyed the proportion of our predicted O. granulata gene sets supported by RNA-Seq and homologous proteins. We aligned the assembled transcripts against our gene predictions using the BLAST program 49,50 . Only hits with a coverage ≥80% and an identity ≥90% were retained. Our analysis showed that approximately 47.58% (19,094) of the predicted gene models were supported by RNA-Seq data. Next, we downloaded protein sequences of O. sativa ssp. japonica cv. Nipponbare and aligned them to the predicted gene models using BLAST. We filtered those hits with an identity <30% or a gene coverage <80%. We found that 23,871 gene models, accounting for approximately 59.48% of the total genes, were supported by evidence of homologous proteins in rice. Combining genes validated by the two above-described methods, 28,823 genes, representing ~71.82% of the total O. granulata gene set, were supported by RNA-Seq and/or homologous proteins ( Table 6).
Gene functions were inferred according to the best match of the alignments to the National Center for Biotechnology Information (NCBI) Non-Redundant (NR) and Swiss-Prot protein databases using BLASTP 49,50 and the Kyoto Encyclopedia of Genes and Genomes (KEGG) database with an E-value threshold of 1E-5. The motifs and domains within gene models were identified by PFAM databases 51 . Gene Ontology (GO) IDs for each gene were obtained from Blast2GO 52 . In total, approximately 85.81% of the predicted protein-coding genes of O. granulata could be functionally annotated with known genes, conserved domains, and Gene Ontology terms ( Table 6).
Annotation of non-coding RNA genes. Five different types of non-coding RNA genes, namely, transfer RNA (tRNA) genes, ribosomal RNA (rRNA) genes, small nucleolar RNA (snoRNA) genes, small nuclear RNA (snRNA) genes and microRNA (miRNA) genes, were predicted using de novo and homology search methods. We used tRNAscan-SE algorithms 53 with default parameters to identify the genes associated with tRNA, which is an adaptor molecule composed of RNA used in biology to bridge the three-letter genetic code in messenger RNA (mRNA) with the twenty-letter code of amino acids in proteins. The rRNA genes (8S, 18S, and 28S), which are the RNA components of the ribosome and associated with the enzyme representing the site of protein synthesis in all living cells, were predicted using RNAmmer algorithms 54 with default parameters. snoRNAs are a class of small RNA molecules that guide chemical modifications of other RNAs, mainly ribosomal RNAs, transfer RNAs and small nuclear RNAs. The snoRNA genes were annotated using Snoscan 55 with the yeast rRNA methylation sites and yeast rRNA sequences provided by the Snowscan distribution. snRNA is a class of small RNA molecules  www.nature.com/scientificdata www.nature.com/scientificdata/ that are found within the nucleus of eukaryotic cells. They are involved in a variety of important processes, such as RNA splicing (removal of introns from hnRNA), regulation of transcription factors (7SK RNA) or RNA polymerase II (B2 RNA), and maintenance of telomeres. The snRNA genes were identified by Infernal software against the Rfam database with default parameters 56,57 .
The miRNA genes were annotated in two steps. First, we downloaded the existing rice miRNA entries from miRBase 58 . Then, the conserved miRNAs were identified by mapping all miRBase-recorded rice miRNA precursor sequences against the assembled O. granulata genome using BLASTN with cut-offs at an identity >60% and a query coverage >60%. Second, when a miRNA was mapped to the target O. granulata genome, the surrounding sequence was checked for hairpin structures. The loci with miRNA precursor secondary structures were annotated as miRNA genes.
We annotated a total of 1,003 tRNA genes, 221 rRNA genes, 295 snoRNA genes, 101 snRNA genes and 257 miRNA genes belonging to 50 miRNA families in the O. granulata genome (Table 7). To investigate miRNA-target genes involved in important biological pathways, the target genes of miRNAs were predicted using the psRNA-Target server with default parameters. Finally, 963 miRNA-target sites were identified. The protein sequences of   www.nature.com/scientificdata www.nature.com/scientificdata/ these target genes were blasted against O. sativa proteins in the Rice Genome Annotation Project Database 59 using the BLASTP program. The results were imported into the agriGO 60 server by comparing them with the whole set of protein-coding genes of O. sativa as a background. KO (KEGG Orthology) annotation of target genes was implemented using the BlastKOALA program 61 with the eukaryote gene database.

Annotation of repeat sequences. We identified the known TEs within the O. granulata genome using
RepeatMasker with the Repbase TE library 62,63 . RepeatProteinMask searches were also conducted using the TE protein database as a query library. The annotation of repeat sequences of the O. granulata genome is summarized in Table 8. The annotation showed that approximately 61.98% (456.6 Mb) of the assembled genome consisted of repeat sequences, and the proportion of repeat sequences varied largely from one type to another. DNA transposons and other repeats contributed only ~9.83% and ~1.1% to the assembled genome, respectively. In contrast, retrotransposons represented half (~51.05%) of the genome assembly.
We constructed a de novo repeat library of the O. granulata genome using RepeatModeler, which can automatically execute two core de novo repeat-finding programs, namely, RECON 64 and RepeatScout 65 , to comprehensively conduct, refine and classify consensus models of putative interspersed repeats for the O. granulata genome. Furthermore, we performed a de novo search for long terminal repeat (LTR) retrotransposons against the O. granulata genome sequences using LTR_STRUC 66 . All intact LTR retrotransposons were classified into Ty1/copia, Ty3/gypsy and unclassified groups according to both reverse transcriptase (RT) sequence similarity and the order of ORFs using Pfam 51 . The RT sequences were retrieved from each retrotransposon element and further checked by homology searches using ClustalW 67 against the published RTs that were downloaded from the Gypsy Database (GyDB) 68 . LTR retrotransposons (~50.83%) represented most of the RNA transposons in the O. granulata genome, accounting for approximately 43.41% of the assembly. They belonged to two types of LTR retrotransposon superfamilies: Ty1/copia and Ty3/gypsy (~5.58% and ~37.83%, respectively) ( Table 8).
We also identified tandem repeats using the Tandem Repeat Finder (TRF) package 69 and the non-interspersed repeat sequences, including low-complexity repeats, satellites and simple repeats, using RepeatMasker (Table 8). A total of six types of simple sequence repeats (SSRs), from mono-to hexa-nucleotides, were identified using the MISA (MIcroSAtellite) identification tool 70 . The minimum repeat unit size was set at twelve for mono-nucleotides, at six for di-nucleotides, at four for tri-nucleotides, and at three for tetra-to hexa-nucleotides. As a result, a total of 183,339 SSRs were detected in the O. granulata genome. Of these, tri-nucleotide SSRs accounted for the largest proportion, both in quantity and sequence length, followed by tetra-nucleotides, di-nucleotides and other types (Table 9). These SSRs will provide valuable molecular markers to assist rice breeding programmes.

Data records
All sequencing reads have been deposited into the NCBI Sequence Read Archive (SRA) 71 and BIG Genome Sequence Archive 72 . The assembled genome sequence is available from the NCBI 73,74 and BIG Genome Warehouse 75 . The protein-coding gene, non-coding gene, and repeat sequence annotation results and functional prediction results are available from the Figshare database 76 .  www.nature.com/scientificdata www.nature.com/scientificdata/ technical Validation RNA integrity. Before constructing RNA-Seq libraries, the concentration and amount of total RNA were separately evaluated using a NanoDrop 2000 UV-VIS spectrophotometer (NanoDrop Technologies, Wilmington, DE, USA), and the rRNA ratio and RNA integrity were estimated using an Agilent 2100 Bioanalyzer (Agilent Technologies, Palo Alto, CA, USA). For each tissue, only total RNAs with a total amount ≥15 μg, a concentration ≥400 ng/μl, an RNA integrity number (RIN) ≥7, and an rRNA ratio ≥1.4 were used to construct a cDNA library according to the manufacturer's instructions (Illumina, USA).
Quality filtering of Illumina sequencing raw reads. To eliminate adapter contaminants and potential sequencing errors, using Trimmomatic 36 , we removed the following five types of reads: (1) reads with ≥10 bp derived from the adapter sequences (allowing ≤10% mismatches); (2) reads with unidentified bases (Ns) constituting ≥10% of their length; (3) reads with ≥40% low-quality bases (Phred score ≤5); (4) reads caused by PCR duplications (i.e., read 1 and read 2 of two paired-end reads that were completely identical); and 5) reads with a k-mer frequency ≤3 (aiming to minimize the influences of sequencing errors). These five filtering processes resulted in a total of ~108.72 Gb (~137.27×) of high-quality data, which were retained and used for subsequent analysis ( Table 1).
Comparisons of the genome assemblies and annotation. We produced high-depth sequencing data for O. granulata using the Illumina and Hi-C sequencing platforms. Then, we de novo assembled an ~736. 66 Table 9. Occurrence of simple sequence repeats (SSRs) in the O. granulata genome. Note that the minimum repeat unit size was set at twelve for mono-nucleotides, at six for di-nucleotides, at four for tri-nucleotides, and at three for tetra-to hexa-nucleotides.
www.nature.com/scientificdata www.nature.com/scientificdata/ Table 1). The contig N50 was ~43.9 kb, which was higher than that obtained for the genome assemblies of other Oryza species with similar second-generation sequencing technology 17 . With Hi-C data, for the first time, we anchored approximately 98.2% of the genome assembly into the twelve pseudo-chromosomes. Due to the short reads sequenced by the Illumina platform and a large number of repeat sequences, the total lengths of genome assembly (~736.66 Mb) and repetitive sequences (~456.57 Mb) are shorter than those in the previous genome assembly (~776.96 Mb and ~528.04 Mb, respectively) 11 (Online-only Table 1). This may be attributed to the ~21 × PacBio data overcoming the above-mentioned difficulties to some extent, resulting in the assembly of an additional portion of repetitive sequences. However, we obtained fewer scaffolds but a longer scaffold N50 compared to those in the previous genome assembly 11 . We predicted 40,131 protein-coding genes and observed that the gene and ncRNA annotations were somewhat better than those for the previous genome assembly (Online-only Table 1). This was also evidenced by the evaluation using BUSCO, showing that 1,390 genes (~96.53%) were completely identified, which is somewhat better than the number for the previous genome assembly 11 . Thus, the newly released genome assembly, which has good continuity and integrity, is comparable to other sequenced Oryza genomes.

Code availability
The sequence data were generated using the software provided by the sequencing platform manufacturer and processed with commands provided for the public software cited in the manuscript. No custom computer code was generated in this work. The following bioinformatic tools and versions were used to generate all results as described in the main text. Default parameters were used if not stated. (2020) 7:131 | https://doi.org/10.1038/s41597-020-0470-2 www.nature.com/scientificdata www.nature.com/scientificdata/ site for target accessibility analysis: 17 bp upstream and 13 bp downstream, and range of central mismatch leading to translational inhibition = 9~11 bp, http://plantgrn.noble.org/psRNATarget/. 30