A haplotype-resolved genome assembly of Rhododendron vialii based on PacBio HiFi reads and Hi-C data

Rhododendron vialii (subgen. Azaleastrum) is an evergreen shrub with high ornamental value. This species has been listed as a plant species with extremely small populations (PSESP) for urgent protection by China’s Yunnan provincial government in 2021, due to anthropogenic habitat fragmentation. However, limited genomic resources hinder scientifically understanding of genetic threats that the species is currently facing. In this study, we assembled a high-quality haplotype-resolved genome of R. vialii based on PacBio HiFi long reads and Hi-C reads. The assembly contains two haploid genomes with sizes 532.73 Mb and 521.98 Mb, with contig N50 length of 35.67 Mb and 34.70 Mb, respectively. About 99.92% of the assembled sequences could be anchored to 26 pseudochromosomes, and 14 gapless assembled chromosomes were included in this assembly. Additionally, 60,926 protein-coding genes were identified, of which 93.82% were functionally annotated. This is the first reported genome of R. vialii, and hopefully it will lay the foundations for further research into the conservation genomics and horticultural domestication of this ornamentally important species.

We ran a Benchmarking Universal Single-Copy Orthologs assessment (BUSCO 11 , v. 5.3.2) with the lineage dataset embryophyta_odb10. The complete BUSCOs (including single-copy and multi-copy) of the two haplotypes accounted for 98.5% and 98.1%, respectively, indicating good completeness of the genome. In addition, we annotated 551.06 Mb (52.19% of the whole genome) corresponding to repeat elements. A total of 60,926 protein-coding genes were identified, of which 93.82% could be functionally annotated. This is the first report of a R. vialii genome sequence, and we believe that it will provide an important resource allowing us to explore the mechanisms underlying threats to this species, as well as its evolutionary history and further utilization on ornamental horticulture.

Methods
Sampling. For genomic DNA extraction, fresh young leaves of R. vialii were collected from a single adult plant in Kunming Botanical Garden, Kunming Institute of Botany, Chinese Academy of Sciences. We also collected roots, branches, leaves, buds and fruits for transcriptome sequencing. These materials were frozen directly in liquid nitrogen and were then transferred to -80°C for preservation. The related sequencing services were performed by a commercial sequencing provider (Wuhan Benagen Technology Co. Ltd. Wuhan, China). Genome sequencing. A modified CTAB methods was performed to extract total DNA from young R. vialii leaves 12 . The concentration of DNA was assessed using NanoDrop (NanoDrop Technologies, Wilmington, DE, USA) and a Qubit 3.0 fluorometer (Life Technologies, Carlsbad, CA, USA). 1% agarose gel electrophoresis was then used to assess the purity and integrity of the resulting DNA. The short-read library with a DNA-fragment insert size of 200-400 bp was prepared using 1 μg genomic DNA following the manufacturer's instructions (BGI) and was subject to paired-end (PE) sequencing on a DNBSEQ-T7 platform (BGI lnc., Shenzhen, China) using a PE 150 model. This produced 97.39 Gb (~649 M reads) of raw data, meaning approximately 150× genome coverage (Supplementary Table 1).
Before long-read sequencing, the DNA was purified using a DNeasy Plant Mini Kit (Qiagen, Germantown, MD, USA), and the integrity of the DNA was evaluated with a Femto Pulse (Agilent Technologies, Santa Clara, CA, USA). Megaruptor 3 (Diagenode SA., Seraing, Belgium) was used to shear 8 μg genomic DNA, and these fragments were then concentrated using AMPure PB magnetic beads (Pacific Biosciences, Menlo Park, CA, USA). Each PacBio single molecule real-time (SMRT) library was prepared using a SMRT bell express template prep Kit 2.0 (Pacific Biosciences, Menlo Park, CA, USA), with insert sizes of 15 kb selected using BluePippin system (Sage Science, Beverly, MA, USA). The library was sequenced on a Pacific Bioscience Sequel II platform in CCS mode, and the raw data was converted into high-precision HiFi reads using the CCS workflow 13 (v. 6.3.0) with the standard parameters. From this process, we obtained 32.88 Gb (~60×) of HiFi data with an average read length and N50 read length of about 18 kb (Supplementary Table 2).
Hi-C library construction and sequencing. Hi-C libraries were prepared following a modified protocol from Belton et al. 14 . Fresh leaf tissue was fixed in 2% formaldehyde solution, and the cross-linked DNA was then digested with DpnII. Biotin-labeled adapters were attached at the sticky ends of the digested fragments to form chimeric junctions, which were enriched and trimmed to fragments of about 450 bp for pair-end sequencing on Transcriptome sequencing. Materials for transcriptome sequencing were homogenized and total RNA was extracted using R6827 Plant RNA Kit (Omega Bio-Tek, Norcross, GA, USA) following the manufacturer's instructions. Subsequently, SQK-PCS109 and SQK-PBK004 Kits (Oxford Nanopore Technologies, Oxford, UK) were combined to prepare the library, and the library was sequenced using a Nanopore PromethION sequencer. Finally, a total of 10.36 Gb (~10.89 M reads) full-length RNA-seq data were obtained for genome annotation (Supplementary Table 4).
Preliminary genome survey. The preprocessor, fastp 15 (v. 0.19.3) was used to filter out the adapter sequences, overly short reads and low-quality reads from the next generation sequencing data using the default parameters. Jellyfish 16 (v. 2.2.10) was then used to calculate the frequency distribution of the depth of clean data with 19-mers, and the basic features of the genome were estimated with the software GCE 17 Table 5 and Supplementary Fig. 1).
Chromosome-level genome assembly. HiFi reads and Hi-C short reads were used as a combined input for the genome assembler Hifiasm 18 (v. 0.16.1-r375) and were assembled into a pair of haplotype-resolved assembly contigs (haplotype A and haplotype B) in Hi-C mode with the default parameters. Juicer 19 (v. 1.5.6) was then used to map clean Hi-C reads to the contigs, and Hi-C-assisted initial chromosome assembly was conducted using the 3D-DNA 20 (v. 180922) algorithm with the standard procedure. Chromosome boundaries were adjusted and the scaffold corrected using the manually operated Juicebox 21 (v. 1.11.08) module, and the generated file was used as input for 3D-DNA for re-scaffolding by chromosome. Juicebox was used again for re-quality control and adjustment of mis-joins and orientation to generate chromosome scaffolds and un-anchored sequences. Additionally, TGS-GapCloser 22 (v. 1.1.1) was employed to fill gaps of the genome based on the HiFi reads.
Because some of the telomere assemblies were incomplete or missing, Minimap2 23 (v. 2.24-r1122) was used to map the HiFi reads to the chromosome, and the reads aligned to the positions of the telomeres were assembled into contigs using Hifiasm. These contigs were then mapped to the chromosomes to extend the chromosomal ends. GetOrganelle 24 (v. 1.7.5) was used to assemble the chloroplast and mitochondrial genomes. After the above steps were completed, the software Nextpolish 25 (v. 1.3.1) was employed to polish the assembly based on the short reads from two iterations, and Redundans 26 (v. 0.13c) was used to remove redundancies such as www.nature.com/scientificdata www.nature.com/scientificdata/ haplotigs and rDNA fragments. Overall, about 99.92% of the assembled data was anchored to pseudochromosomes in the two haplotypes (Supplementary Table 6), and the chromosome number was set based on the results of the karyotype analysis (2n = 26). Finally, we obtained a high-quality haplotype-resolved gapless genome of R. vialii.
The assembly (approx. 1.05 Gb) contained two complete haplotypes, haplotype A and haplotype B, with genome sizes of 532.73 Mb and 521.98 Mb, respectively ( Table 1). The genome size previously estimated based on K-mers was similar to these assemblies, with the main deviations in rDNA array. The contig N50 length of haplotype A and haplotype B were 35.67 Mb and 34.70 Mb, respectively. The number of gaps was fewer than 10 in both haplotypes (Table 1), and 14 gapless chromosomes were assembled (Supplementary Table 6), indicating good continuity of assembly.
Identification of repetitive elements. EDTA 27 (v. 1.9.9; parameters: --sensitive 1 --anno 1) was used for de novo identification of transposable elements to generate the TE library, and RepeatMasker 28 (v. 4.0.7) was employed to detect repetitive elements in the assembled genome of R. vialii with the default parameters. We identified a total of 1,534,208 repetitive sequences (~ 551.06 Mb), accounting for 52.19% of the assembled genome, of which long terminal repeats (LTRs) and terminal inverted repeats (TIRs) had the highest proportions, accounting for 27.18% and 17.25% of the genome, respectively (Supplementary Table 7).
Gene identification and functional annotation. Transcriptome assembly was based on multiple strategies: (i) next-generation RNA-seq data was downloaded from the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) database (SRR13338561) 29 for de novo assembly using Trinity 30 (v. 2.0.6); (ii) short-reads were aligned to the reference genome using Hisat2 31 (v. 2.1.0) and then assembled using StringTie 32 (v. 1.3.5); (iii) Minimap2 was used to map the long-reads to the genome and StringTie was employed for further assembly. Ultimately, PASA 33 (v. 2.4.1) was used to combine and optimize the transcriptomes obtained by the above methods and to generate a high-quality transcriptome with a total length of 122.34 Mb and 114,558 transcripts (Supplementary Table 8 45 and Vitis vinifera 46 ) were used as homologous protein evidence for this gene annotation. The PASA process was used to annotate the structure of the genomic genes based on transcript evidence, and the full-length genes were detected by comparison with reference proteins. The parametric model of AUGUSTUS 47 (v. 3.4.0) was trained with the full-length gene set for five rounds of optimization.
The MAKER2 48 (v. 2.31.9) pipeline was used for annotation based on ab initio prediction, transcript evidence and homologous protein evidence. Briefly, AUGUSTUS was used for ab initio protein-coding gene prediction after masking the repetitive regions of the genome with RepeatMasker. The transcripts were then aligned to the repeat-masked genome using BLASTN and TBLASTX, while BLASTX was employed to aligned protein sequences to the genome. The previous results were then optimized using Exonerate 49 (v. 2.2.0) and the hints files were generated based on the results. Integration of the predicted gene models was conducted using AUGUSTUS and UTR annotations were added according to the EST evidence.   www.nature.com/scientificdata www.nature.com/scientificdata/ Considering the relatively low accuracy of annotation results from the MAKER2 process, we also used EVidenceModeler 50 (EVM, v. 1.1.1) to integrate the predictive results obtained from the PASA and MAKER2 annotations. To avoid introducing TE coding regions, TEsorter 51 (v. 1.4.1) was used to identify the TE protein domains on the genome and EVM was used to mask them. Additionally, PASA was used to optimize the results obtained by EVM, and UTR sequences and alternative splicing were added, and overly short and abnormal gene annotations were removed. For non-coding RNAs, we used tRNAScan-SE 52 (v. 2.0.7) to identify the tRNAs and Barrnap (v. 0.9) 53 to detect the rRNAs. RfamScan 54 (v. 14.2) was used to annotate various non-coding RNAs. Finally, all annotation results were merged to remove the redundancy and a complete gene set was obtained.
Overall, a total of 60,926 protein-coding genes have been successfully predicted, with an average length of 5,378.1 bp. Among them, there were 79,618 coding sequences (CDS) and 460,225 exons, and the mean length was 1,275.7 bp and 328.4 bp, respectively ( Table 2). In addition, we also identified 5,538 non-coding genes, which contained 3,039 rRNAs, 770 tRNAs and 1,729 other ncRNAs.
For the functional prediction of protein-coding genes, three strategies were used. The predicted genes were aligned with the eggNOG v. 5.0 homologous gene database using eggNOG-mapper 55 (v. 2.0.1) for Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEEG) annotation. The protein sequences were matched to the protein databases, including Swiss_Prot, TrEMBL, NR (non-redundant protein) and the Arabidopsis database using DIAMOND 56 (v. 2.0.4; Identity > 30%, E-value < 1e-5) to determine the best alignment of the genes. To obtain the conserved amino acid sequences, motifs and domains of the predicted proteins, InterProScan 57 (v. 5.14-53.0) was used to search for similarity of domain according to the sub-databases PRINTS, Pfam, SMART, PANTHER and CDD of the InterPro database. Finally, 57,225 genes were functionally annotated in at least one of the above databases, accounting for 93.82% of the predicted protein-coding genes (Supplementary Table 9). www.nature.com/scientificdata www.nature.com/scientificdata/ Chromosomal synteny analysis. Minimap2 was used to compare the assembly of R. vialii with the previously published R. griersonianum genome, which showed that they had essentially the same order of chromosomes ( Supplementary Fig. 2a). Meanwhile, the dot-plot of syntenic blocks between the two haplotypes showed a similar result (Supplementary Fig. 2b).
To explore the differences between the two haplotypes, the NUCmer module embedded in MuMmer 4 58 was used to align the whole genome, and the command delta-filter was used to remove short and low-quality alignments. After the subprogram show-coords was used for format conversion, we ran SyRI 59 (v. 1.6) to detect variations. A total of 3,695 syntenic regions (~458 Mb) were detected, indicating high similarity between the two haplotypes. However, we also identified many variations, including 3,152,185 SNPs and 214,391 small insertions/deletions (indels, < 500 bp). We additionally identified 2,157 and 1,486 duplications in haplotypes A and B, respectively. Other differences between the two assemblies included 1,108 translocations and 48  www.nature.com/scientificdata www.nature.com/scientificdata/ inversions. Notably, two relatively large inversions of more than 1 Mb were found on Chr1 and Chr10, respectively ( Supplementary Fig. 3).

Data Records
The relevant data reported in this paper have been deposited in the National Genomics Data Center (NGDC) 60,61 , Beijing Institute of Genomics, Chinese Academy of Sciences/China National Center for Bioinformation, under the BioProject accession number PRJCA015878 that is publicly accessible at https://ngdc.cncb.ac.cn/gwh. BGI short-reads, PacBio HiFi long-reads, Hi-C reads and Iso-Seq data have been deposited in the Genome Sequence Archive (GSA) in NGDC under the accession number CRR719647 62 , CRR719646 63 , CRR719645 64 and CRR719648 65 . The final chromosome assembly and annotation data were deposited in the Genome Warehouse (GWH) in NGDC under the accession number GWHCAXW00000000 66 . The sequence data were also deposited in the SRA database with accession number SRR24501948 67 SRR24501949 68 SRR24501947 69 and SRR24501946 70 under the BioProject accession number PRJNA971245. And the final genome assembly has also stored in the GenBank with accession number GCA_030253575.1 71 and GCA_030253555.1 72 .

technical Validation
Evaluation of the assembled genome. The BUSCO analysis with the lineage dataset embryophyta_ odb10, which was conducted to assess genome completeness, showed that, for 1,614 expected genes from the embryophyta, the proportions of complete BUSCOs (including single-copy and multi-copy) of these two haplotypes were 98.5% and 98.1%, respectively, indicating good genomic completeness (Table 3). Short-reads and long-reads were mapped to the genome with BWA 73 (v. 0.7.17-r1188) and Minimap2, and the short RNA-seq reads were mapped to the assembly using Hisat2. After filtering out non-primary alignments, the map ratio and coverage of reads were calculated. We found that different types of sequencing data had a high genome coverage ( Table 4). The distribution of coverage depth of these sites on the genome matched the Poisson distribution and there was no obvious heterozygous peak. The BUSCO single-copy and multi-copy genes have approximately the same depth, indicating that the assembly had no redundancy (Fig. 2).
To assess single base error rate and heterozygosity, the next-generation reads were mapped to the genome using BWA, and the upstream data was input to bcftools 74 (v. 1.11) for variant detection. The calculated heterozygosity was approximately 0.0038% based on the heterozygous sites, and the single base error rate was about 0.00021% based on the homozygous loci. There was no obvious guanine-cytosine (GC) bias in the coverage depth analysis based on second and third generation data under different GC content (Fig. 3). Hi-C reads were mapped to the final version of the assembly using Juicer. In the Hi-C heatmap, there were strong interactive signals of the 13 chromosomes around the diagonal, suggesting that the two assemblies were without obvious chromosome assembly errors (Fig. 4). www.nature.com/scientificdata www.nature.com/scientificdata/ The repetitive sequences identified by RepeatMasker were mapped to the genome to determine the position of the telomeres and other characteristic sequences on the chromosomes. Most of the chromosomes assembled complete telomere sequences (TTTAGGG), and only a few telomeres were missing or incomplete. All chromosomes contained a high tandem repeat (TGGTACCGTATGGATGTACTCGTACGGTATTGTACCGTTTTGGTGTGGTT), which is probably the centromere. In addition, the 18-5.8-28S rDNA and 5S rDNA arrays were detected on Chr11 and Chr10 respectively (Fig. 5). In summary, this assembly can by described as a nearly telomere-to-telomere genome.
Evaluation of the gene annotation. The annotated and integrated proteins were also evaluated using BUSCO with the lineage dataset embryophyta_odb10. Briefly the proportion of complete core gene coverage was 98.1% (including 4.3% single-copy genes and 93.8% duplicated genes) and there were only a few fragmented (0.4%) and missing (1.5%) genes (Table 3), indicating that this annotation was of high quality.

Code availability
All software and pipelines were executed according to the manual and protocols of the published bioinformatic tools. The version and code/parameters of the software have been described in the Methods section.