A high-quality chromosome-level wild rice genome of Oryza coarctata

Oryza coarctata (2n = 4X = 48, KKLL) is an allotetraploid, undomesticated relative of rice and the only species in the genus Oryza with tolerance to high salinity and submergence. Therefore, it contains important stress and tolerance genes/factors for rice. The initial draft genome published was limited by data and technical restrictions, leading to an incomplete and highly fragmented assembly. This study reports a new, highly contiguous chromosome-level genome assembly and annotation of O. coarctata. PacBio high-quality HiFi reads generated 460 contigs with a total length of 573.4 Mb and an N50 of 23.1 Mb, which were assembled into scaffolds with Hi-C data, anchoring 96.99% of the assembly onto 24 chromosomes. The genome assembly comprises 45,571 genes, and repetitive content contributes 25.5% of the genome. This study provides the novel identification of the KK and LL genome types of the genus Oryza, leading to valuable insights into rice genome evolution. The chromosome-level genome assembly of O. coarctata is a valuable resource for rice research and molecular breeding.


Background & Summary
Oryza coarctata is the only halophyte species in the genus Oryza 1 , exhibiting distinct natural traits, including high tolerance to submersion and salinity 2 (Fig. 1), as well as unique leaf anatomical features, such as the presence of Kranz anatomy (Fig. 2).These features are a result of high selection pressure, allowing its propagation in a wide range of ecological conditions, from submerged saline to non-saline terrestrial soils 3 .Oryza coarctata is primarily found in coastal areas across Bangladesh, India, Sri Lanka, and Malaysia 1, 2,4,5 .
The history of research on Oryza coarctata is complex.Until 1999, it was excluded from the rice genus due to some morphological differences and was classified as Porteresia coarctata.However, a study on the evolutionary relationship between various species and genome types of the rice genus demonstrated that it belongs to the genus Oryza, and it was named Oryza coarctata.It was also determined to be allotetraploid 6 .A subsequent study identified the genome types, KK and LL, from its allotetraploid genome 7 .
Numerous studies have demonstrated that Oryza coarctata offers a wealth of genetic resources for rice breeding research, including salt resistance, drought resistance, and improved photosynthetic efficiency 2,3,8,9 .Therefore, sequencing a high-quality chromosome-level genome of Oryza coarctata is essential for genomics research and can provide new insights into the evolutionary studies of rice.In our study, we sequenced a high-quality chromosome-level genome of Oryza coarctata using PacBio HiFi reads (~59.99X) and Hi-C data.We also identified the genome types, KK and LL, from its allotetraploid genome, which can provide new insights into the evolution of genus Oryza.

Methods
Staining method of O. coarctata leaf.Fresh leaf samples from three-leaf-stage plants were selected and snipped into 1cm-by-1cm squares.Immediately, these samples were placed in Carnoy's fixative (a mixture of ethanol and acetic acid in a 3:1 ratio).After being fixed at room temperature for 48 hours, the samples were shifted to 75% ethanol for permanent preservation.If proceeding with subsequent experiments, slices were manually prepared using a double blade, perpendicular to the leaf veins.The prepared filamentous sections were stained using 0.1% methylene blue for 3 minutes.Once stained, excess dye was rinsed off, and the samples were set on microscope slides for observation under a light microscope.repair of damaged DNA, end repair, and ligation with dumbbell-shaped adapters.After an exonuclease digestion, we screened the DNA fragments using BluePippin, forming the PacBio sequencing library.For the Hi-C library, we employed formaldehyde for crosslinking cells, thereby maintaining both intra-and intermolecular interactions, and preserving the cell's 3D structure.Following crosslinking, we employed the restriction enzyme HindIII for DNA digestion and incorporated biotin-labeled nucleotides during the end repair stage.After ligation of the repaired ends, we circularized the DNA, which enabled the identification of interactive DNA positions in further sequencing and analyses.We then decrosslinked and purified the DNA, fragmenting it into 300-700 bp lengths.Interaction-representing biotin-labeled DNA fragments were captured with streptavidin magnetic beads, thereby facilitating library construction.We sequenced the PacBio library on the PacBio Sequel II system (CCS mode), generating ~34.4 Gb clean data (~59.99× ), and all the CCS reads exhibited an N50 of ~15.2 kb.The Hi-C library, sequenced on the Illmina NovaSeq 6000 (PE150), produced ~73.76 Gb clean data (Table 1).Following extraction, these RNA samples were combined in equal measures, from which an RNA-seq library was prepared.The transcriptomes were sequenced on the Illmina NovaSeq.6000 platform, operated by the Biomarker Technology Company, Beijing, China.The sequencing process produced 13.23 Gb of short-read RNA-seq data (Table 1), which was used for predicting whole-genome protein-coding genes.

Genome sequencing. We began with fresh
Genome assembly.We used hifiasm software 10 to assemble the high-quality HiFi reads, which yielded a total of 460 contigs with a total length of 573.4 Mb and an N50 of 23.1 Mb (Table 2).Using Hi-C data, more than 96% of the contigs have been anchor to 24 chromosomes (Fig. 3).Subsequently, we joined contigs into scaffolds using Hi-C clean data.The 46.55% of unique mapped read pairs were valid interaction pairs and were used for correction of scaffolds and clustered, ordered and orientated scaffolds onto chromosomes by LACHESIS 11 .Before the assembly of chromosomes, we first executed a preassembly phase to correct errors in scaffolds, requiring the division of scaffolds into average segments of 50 kb.The Hi-C data were then mapped to these segments using the BWA (version 0.7.10-r789) 12 software.We preserved uniquely mapped data for assembly operations using LACHESIS software.We manually checked any pair of segments that exhibited inconsistent connection with the raw scaffold data.These corrected scaffolds were subsequently assembled using LACHESIS.After this process, we Fig. 4 Hi-C contact map of the chromosome-level assembly of O. coarctata.The intensity of interactions was calculated using a bin size of 300,000 bp.
manually adjusted any placement and orientation errors that exhibited distinct chromatin interaction patterns.
repeat annotation.Transposon element (TE) and tandem repeat were masked and annotated by the following workflows.TE were identified by a combination of homology-based and de novo approaches.We first customized a de novo repeat library of the genome using RepeatModeler 13 , which can automatically execute two de novo repeat finding programs, including RECON (version 1.08) 14 and RepeatScout 15 .Then full-length long terminal repeat retrotransposons (fl-LTR-RTs) were identified using both LTRharvest 16 and LTR_finder 17 .
The high-quality intact fl-LTR-RTs and non-redundant LTR library were then produced by LTR_retriever 18 .Non-redundant species-specific TE library was constructed by combining the de novo TE sequences library above with the known Repbase (version 19.06) 19 , REXdb (V3.0) 20 and Dfam (v3.2) 21 database.Final TE sequences in the Oryza coarctata genome were identified and classified by homology search against the library using RepeatMasker v4.10 22 .Tandem repeats were annotated by Tandem Repeats Finder 23 and MIcroSAtellite identification tool (MISA v2.1) 24 (Tables 4, 5).
In order to evaluate the accuracy of gene prediction, we compared the length distributions of protein-coding genes, coding sequences (CDS), exons, and introns of our study species with those from four additional reference species (A.thaliana 34 , O. brachyantha 35 , O. punctata 36 , and O. sativa 37 ).Notably, we did not observe any significant differences in the length distribution of gene features among these species (Fig. 6, Table 6).

Noncoding rNas annotation.
Non-coding RNA (ncRNA) refers to RNA that does not encode proteins, including various types of functional RNAs such as microRNA, rRNA, and tRNA.Different strategies were used to predict different ncRNAs based on their structural characteristics.The tRNA was identified using tRNAscan-SE v1.3.1 38 .The rRNA prediction was mainly based on the Rfam(v 12.0) 39   (v 0.9) 40 .miRNA was identified using the miRBase 41 database, while snoRNA and snRNA were predicted based on the Rfam(v 12.0) database and using Infenal 1.1 42 .A total of 2,804 tRNAs, 9,075 rRNAs, and 157 miRNAs were predicted (Table 7).

Pseudogene prediction.
Pseudogenes are sequences similar to functional genes, but they have lost their original function due to mutations such as insertions or deletions.We used GenBlastA v1.0.4 43  genome after masking the loci of real genes, in order to identify homologous gene sequences (putative genes).
We then used GeneWise v2.4.1 44 to detect premature stop codons and frameshift mutations in these sequences, and ultimately predicted 28 pseudogenes (Table 8).
Fig. 5 The genes that are integrated originated from the distribution maps of three prediction methods.

Discovery of genomic variations among K and L.
We utilized the syntenic blocks between Oryza coarctata (KKLL) and its related species Oryza puctata (BB) (Fig. 7) to uncover a pairing relationship among the 24 chromosomes.Then, using Subphaser 49 based on the principle of K-mer frequency difference between genomes of different species, we successfully separated the heterozygous chromosomes KK (~271 Mb) and LL (~261 Mb) from the Oryza coarctata genome (Fig. 8).A whole-genome synteny analysis was conducted between Subgenome K and Subgenome L using MUMmer, which, as shown in Fig. 9a, revealed a high-level of overall concordance between the K type and L type genomes.To further investigate genomic variations and local differences between the two assemblies, we employed SyRI v1.5 software 50 .This analysis led to the identification of several Mb-sized structural variations such as inversions, translocations, and duplications (Fig. 9b)

Data Records
The sequencing data, genome assembly and annotation data reported in this paper have been deposited in the Genome Warehouse in National Genomics Data Center (NGDC), Beijing Institute of Genomics, Chinese Academy of Sciences/China National Center for Bioinformation 51 under the BioProject accession number PRJCA016514 that is publicly accessible at https://ngdc.cncb.ac.cn/gwh.All the clean genome sequencing data including PacBio long-read data 52 , Illumina short-read DNA-seq 53,54 .and Hi-C data 55 , as well as Illumina short-read RNA sequencing data 56 were deposited in the Genome Sequence Archive (GSA) 57 of    NGDC under the accession number CRA011195.The genome assembly and annotation data have been deposited in the Genome Assembly Sequences and Annotations (GWH) of NGDC under accession number GWHCBHR00000000.The assembled genome has also been deposited in the NCBI assembly with the accession number JAULJY000000000 58 .The annotation results of repeated sequences, gene structure and functional prediction were deposited in the Figshare database 59 .
technical Validation assessment of the genome assembly.To evaluate the quality of the assembly, we assessed it from three different perspectives: second-generation data alignment rate, CEGMA evaluation, and BUSCO evaluation.The second-generation data alignment rate was over 99%, indicating the high accuracy of our assembly.Furthermore, the CEGMA evaluation showed that over 98% of the genes and more than 95% of the highly conserved genes were present in the assembly.The BUSCO evaluation also demonstrated the completeness of the assembly, with a score of 97.83% (Tables 10-12).Moreover, we evaluated the result of Hi-C based pseudo-chromosomes construction.LACHESIS software was utilized to divide and sequence the genome sequences into groups, while also orienting them.Manual mapping and inspection were then performed to obtain the chromosome level genome version.Our manual checks entailed re-examining the raw Hi-C data, confirming the inconsistency, and determining the correct alignment or orientation based on the highest number of supporting read pairs.Furthermore, the adjustment of placement and orientation errors exhibiting obvious discrete chromatin interaction patterns was performed when the chromatin interaction patterns indicated an arrangement inconsistent with the majority of the data.These adjustments were made based on the same principle of choosing the alignment or orientation that was supported by the highest number of read pairs.After the Hi-C assembly and manual heat map adjustments, it was determined that the 24 chromosomes contained a total of 556,116,023 bp genome sequence, accounting for 96.99% of the sequences located on the chromosomes.Among those sequences located on the chromosomes, the sequence and direction could be determined in 554,379,116 bp, accounting for 99.69% of the total sequence located on the chromosomes.assessment of the genome annotation.The number of genes supported by each prediction method was counted, and the majority of the genes were predicted using transcriptome-based and homology-based methods, indicating the high quality of the predictions.The embryophyta database of BUSCO contains 1,614 conserved core genes.We used BUSCO v5.0 software to evaluate the completeness of gene prediction, and 96.22% of BUSCO genes were found in our predicted genes, indicating high completeness (Table 13).The accuracy and completeness of gene prediction were evaluated from the overall level by mapping RNA-seq clean data to the assembled genome using Hisat2 software and calculating and summarizing the coverage of annotated exons, introns, and intergenic regions.In this genome, 87.64% of the transcriptome data mapped to the annotated exons, demonstrating the high accuracy of our prediction model (Fig. 10).
Oryza coarctata seedlings, sourced from the Koyra Riverbank in Khulna district, Bangladesh (22.77N latitude and 89.48 E longitude), and utilized them for superior DNA extraction.Our extraction protocol involved initial fragmentation of DNA samples via a g-TUBE, subsequent

Fig. 6
Fig. 6 Comparisons of gene features among O. coarctata and the four other species (A.thaliana, O. brachyantha, O. punctata and O. sativa).Gene features include gene length, CDS length, exon length and intron length.

Fig. 7
Fig. 7 Syntenic blocks between O. coarctata and O. punctata, represented through a linear collinear graph (a) and a dot plot (b).

Fig. 9
Fig. 9 Whole-genome comparison of the Subgenome1 with Subgenome2 assembly.(a) Dot plot for the syntenic blocks.(b) Chromosome-level local sequence differences.

Table 1 .
Sequencing data for Oryza coarctata genome assembly.Note that the sequencing coverage is calculated by the genome size of 573Mb.

Table 2 .
The Assembly Results for Oryza coarctata genome assembly.
database and predicted using barrnap

Table 3 .
Statistics for Chromosome-level assembly of the Oryza coarctata genome.

Table 4 .
to compare the Summary of the TE sequences in the Oryza coarctata genome.

Table 5 .
Summary of the tandem repeat sequences in the Oryza coarctata genome.

Table 6 .
Statistics for Gene prediction annotation in the Oryza coarctata genome.

Table 7 .
Statistics for Noncoding RNAs annotation in the Oryza coarctata genome.

Table 8 .
Statistics for Pseudogene prediction in the Oryza coarctata genome.

Table 9 .
Statistics for Functional annotation in the Oryza coarctata genome.

Table 10 .
Statistics of Second Generation Data Alignment in the Oryza coarctata genome.

Table 12 .
BUSCO assessment results in the Oryza coarctata contig-level genome.

Table 13 .
BUSCO assessment results in the Oryza coarctata chromosome-level genome.