A chromosome-level genome assembly of the Knoxia roxburghii (Rubiaceae)

Knoxia roxburghii is a well-known medicinal plant that is widely distributed in southern China and Southeast Asia. Its dried roots, known as hongdaji in traditional Chinese medicine, are used to treat a range of diseases, including cancers, carbuncles, and ascites. In this study, we report a de novo chromosome-level genome sequence for this diploid plant, which has a length of approximately 446.30 Mb with a contig N50 size of 42.26 Mb and scaffold N50 size of 44.38 Mb. Approximately 99.78% of the assembled sequences were anchored to 10 pseudochromosomes and 3 gapless assembled chromosomes were included in this assembly. A total of 24,507 genes were annotated, along with 68.92% of repetitive elements. Overall, our results will facilitate further active component biosynthesis for K. roxburghii and provide insights for future functional genomic studies and DNA-informed breeding.


Background & Summary
Knoxia roxburghii (Sprengel) M. A. Rau (2n = 20, homotypic synonym: Knoxia valerianoides Thorel ex Pitard), a perennial herb naturally distributed in southern China and Southeast Asia, is a member of the Rubiaceae family and the Knoxia genus 1 .The dried roots of K. roxburghii, known as hongdaji in Chinese medicine, exhibit a significant therapeutic effect in treating cancer, carbuncles, diarrhoea, ascites, chronic pharyngitis, and schizophrenia 2 .Additionally, the plant is a crucial ingredient in various Chinese herbal formulations, such as ZiJinDing, which has been shown to possess antitumour properties by modern pharmacology 3 .Phytochemical studies have revealed that K. roxburghii is rich in anthraquinones, triterpenoids, lignans, coumarins, sitosterols, and other important compounds 4,5 .Anthraquinones, such as 3-hydroxymoridone, knoxiadin, and damnacanthal, are considered key active components of K. roxburghii, exhibiting diverse biological activities including anticancer, antibacterial, anticoagulant, and antiviral effects 6,7 .Triterpenoids, which are a significant component of K. roxburghii, have anti-inflammatory, anticancer, and antioxidant effects.They are primarily responsible for reducing inflammation and swelling in K. roxburghii 8,9 .
In recent years, the wild populations of K. roxburghii in China have been facing an increased risk of extinction due to a surge in market demand 10 .Additionally, seed germination and emergence rates for this species are less than 1% under natural conditions, and it exhibits a protracted maturation period 11 .K. roxburghii has been categorized as a first-class protected wild Chinese herbal medicine, and its production area has been prohibited from being utilized 12 .As a result, artificially cultivated K. roxburghii has become the primary source of medicinal materials.Nevertheless, the cultivation process is plagued by southern blight and leaf spot, which have severely limited the plant's production 13 .Therefore, there is an urgent need for the breeding of promising new K. roxburghii varieties to tackle this issue.
Whole-genome-level studies can provide insights for enhancing medicinal material quality, molecular breeding, wild resource conservation, and functional gene discovery and utilization of plants [14][15][16] .However, to date, no whole-genome sequence of K. roxburghii has been reported.In the present study, by using DNBSEQ sequencing, single-molecule real-time sequencing, and high-throughput chromosome conformation capture sequencing (Hi-C) sequencing technologies, we provide a de novo high-quality chromosome-level genome sequence for K. roxburghii.The 99.78% genome sequence is anchored to 10 chromosomes, with a total length of 446.30Mb and scaffold N50 of 44.38 Mb.Transposable elements accounted for 68.92% (307.60 Mb) of the assembled genome sequence, with long terminal repeats (LTRs) being the dominant type.The LTR retrotransposon burst was estimated to have occurred approximately 0.2 million years ago.Phylogenetic analysis revealed that Copia and Gypsy elements could be grouped into eight and five lineages, respectively.The reference genome information obtained herein constitutes a valuable resource for promoting genetic improvement and elucidating the biosynthesis of active ingredients in this medicinal plant.

Methods
Sample collection and sequencing.For genomic DNA extraction, fresh leaves of K. roxburghii were collected from Chuxiong (N24°58′, E101°28′) in Yunnan Province, China.Additionally, stems, roots, buds, and leaves were gathered to perform transcriptome sequencing.The materials were immediately preserved in liquid nitrogen, transported to the laboratory, and stored at −80 °C.High-quality genomic DNA was extracted from leaves using the DNeasy Plant Mini Kit (QIAGEN, Valencia, California, USA).Total RNA was extracted from each sample using the Directzol RNA kit (Zymo Research, Irvine, CA, USA) following the manufacturer's instructions.
For short-reads sequencing, paired-end DNBSEQ libraries were constructed using the NextEra DNA Flex Library Prep Kit (Illumina, San Diego, CA, USA) with an insert size of 350 bp and sequenced on the DNBSEQ-T7 platform (MGI Tech, Shenzhen, China).A quality assessment of the short sequencing reads was conducted using fastp v. 0.21.0 17 with default parameters.This process involved the removal of adapter sequences, contaminants, PCR duplicates, and reads with a low-quality base percentage exceeding 30%.A total of 107.86 Gb clean short reads (251.78 × coverage) were generated and used for subsequent data processing.The genome size was estimated to be 428.39Mb, with a heterozygosity of 1.23% and repetitive content of 46.86% based on previous K-mer distribution analyses 18 .
For PacBio sequencing, the libraries were constructed with an insert size of 15 kb using the SMRTbell Template Prep Kits (Pacific Biosciences of California, Inc., CA, USA) and sequenced in CCS mode on the PacBio Sequel II platform (continuous long reads (CLR) sequencing mode).After trimming the low-quality reads and adaptor sequences from the raw data, approximately 52.85 Gb of long reads were generated, covering approximately 124 × of the estimated genome size.
For Hi-C sequencing, the library was prepared according to the protocol described by Lieberman-Aiden 19 et al.DNA was purified from proteins and randomly sheared into fragments of 300-700 bp in size.The resulting Hi-C library was sequenced on the Illumina NovaSeq 6000 sequencing platform using paired-end 150 bp reads.The raw data from Hi-C sequencing were processed using fastp.A total of 36.14Gb (84.36 × coverage) of clean reads were obtained.
For Oxford Nanopore Technologies (ONT) sequencing, all RNA samples of the same quantity were mixed for PCR-cDNA library construction using the Ligation Sequencing Kit (SQK-LSK109) and sequenced on the PromethION sequencer (Oxford Nanopore Technologies, Oxford, UK).NanoFilt v. 2.8.0 20 (parameters: -q 7 -l 100 -headcrop 30 -minGC 0.3) was used to process the RNA-seq data.Finally, a total of 6.2 Gb of full-length RNA-seq data were obtained for genome annotation.
Genome and chromosome assembly.The contig-level genome of K. roxburghii was assembled using Hifiasm v. 0.14.2 21with default parameters.Two rounds of error correction were performed based on PacBio sequencing and Illumina NovaSeq sequencing data using NextPolish v. 1.3.1 22 (parameters: sgs_options = -max_ depth 200 lgs_options = -min_read_len 1k -max_read_len 100k -max_depth 100 lgs_minimap2_options = -x map-ont) and Pilon v. 1.23 23 (parameters:-fix all-changes), respectively.The heterozygous sequences were removed by using the Purge_haplotigs pipeline v. 1.0.4 24.The high, mid, and low cut-off read depth parameters were set to 170, 55, and 5, respectively, to remove haplotigs.Consequently, the genome assembly contained 446.30Mb in 19 contigs with a contig N50 of 42.26 Mb, and the GC content of the genome was 35.98% (Table 1).
The Hi-C clean data were mapped to the draft genome using HiCUP v. 0.8.2 25 (parameters: -format sanger -longest 800 -shortest 150 -nofill N), followed by filtration to remove unmapped reads, invalid pairs, and PCR amplification-induced repetitive sequences.ALLHiC v. 0.9.8 26 (parameters: -e GATC -k 10) was utilized to cluster the contigs into chromosomal groups, with subsequent sorting and orientation.The interactions between contigs were converted into a specific binary file using 3D-DNA v. 180419 27 and Juicer v. 1.6 28 .Then, the visual correction of the assembly was finalized using JuiceBox v. 1.11.08 29 based on the intensity of chromosome interaction.Additionally, very short contigs without any interaction relationships were placed in the "unassigned" category.The final chromosomal-level genomic sequence was obtained by using 100 N to fill the gaps.Finally, 99.78% of the initial assembled sequences were anchored to 10 pseudo-chromosomes with lengths ranging from 42.02 Mb to 48.32 Mb (Fig. 1a, Table 2).The total length of the genome assembly was 446.30Mb, with a scaffold N50 of 44.38 Mb (Table 1).
Based on the high-quality reference genome in this study, 307.60 Mb of repetitive sequences of K. roxburghii were predicted (Table 6).Among the integrated results, 33.56% (149.76Mb) of the sequences were long terminal repeat (LTR) retrotransposons, with LTR/Copia elements being the dominant class (28.71% of the whole genome, 128.15 Mb), followed by LTR/Gypsy elements (2.79% of the whole genome, 12.47 Mb).To investigate the evolutionary history of transposable elements (TEs) in the K. roxburghii genome, a distribution plot of identity values between genomic copies and their consensus sequences was generated.The distributions of LTRs showed a peak at 89% identity, which was larger than the peaks of the other TE types, indicating that LTR-retrotransposons were recently transposed in the genome of K. roxburghii (Fig. 2a).Additionally, the genome contained 3,394 LTR-RTs, and the LTR retrotransposon burst was estimated to have occurred  approximately 0.2 million years ago (Fig. 2b).For LTR/Gypsy and LTR/Copia, phylogenetic trees revealed that repeat elements were organized into different clades and expanded in clusters (Fig. 2c,d).

Data Records
The BGI short reads, PacBio HiFi long-reads, Hi-C reads, and RNA-Seq data have been deposited in the NCBI Sequence Read Archive with accession numbers SRR25777372 58 , SRR25787934 59 , SRR24958413 60 , and SRR25775167 61 .The genome assembly has been deposited in DDBJ/ENA/GenBank under the accession number JAUECX000000000 62 .The chromosomal assembly and dataset of gene annotation have been deposited in the FigShare database at https://doi.org/10.6084/m9.figshare.23542566 63.

technical Validation
The integrity of the genome assembly was assessed using the sequence identity method.Reads from a small-fragment library were specifically selected and aligned to the assembled genome using BWA v. 0.7.17-r1188 64 .The alignment rate of all small fragment reads to the genome was approximately 99.60%, and the coverage rate was approximately 99.49%, indicating consistency between the reads and the assembled genome.We performed a Benchmarking Universal Single-Copy Orthology (BUSCO) v. 4.1.4 65analysis based on the embryophyta_odb10 database to assess the completeness of the assembly, which indicated that 97.50% of the complete BUSCOs were present in the assembly (Table 5).Furthermore, 99.78% of the scaffolds were successfully anchored to the 10 chromosomes.The accuracy of the chromosome assembly was indirectly confirmed by examining the Hi-C heatmap, which revealed a well-organized interaction contact pattern along the diagonals within and around the chromosome region (Fig. 1b).This observation provides additional support for the precision of the chromosome assembly.
To validate the predicted genes, we performed a BUSCO analysis.The analysis revealed a high reliability of the annotated results, as approximately 98.40% of the complete BUSCOs were identified (Table 5).The annotation results were considered acceptable since the number of predicted genes and structural characteristics of the K. roxburghii genome were consistent with those of the genomes of closely related species.

Fig. 1
Fig. 1 Overview of the genomic features of Knoxia roxburghii.(a) Genomic features of K. roxburghii.Tracks from outside to inside (a-e) are as follows: chromosomes, gene density, repeat sequence density, GC content, and collinearity between the chromosomes; (b) Hi-C interaction heatmap for the K. roxburghii genome showing interactions among the ten chromosomes.

Fig. 2
Fig. 2 Repeat sequence analysis of the Knoxia roxburghii genome.(a) Distribution of sequence identity values between genomic copies and consensus repeats in the K. roxburghii genome.(b) Distribution of sequence identity values between genomic copies and consensus repeats in the K. roxburghii genome.(c) Phylogenetic tree of Ty1/Copia-type retrotransposons.(d) Phylogenetic tree of Ty3/Gypsy-type LTR retrotransposons.

Table 1 .
Global statistics for the Knoxia roxburghii genome assembly.

Table 2 .
Statistics of the pseudochromosome length obtained by Hi-C assisted assembly of Knoxia roxburghii.

Table 3 .
Statistical results for the genetic structure of Knoxia roxburghii.

Table 4 .
Statistics of non-coding RNA prediction in the Knoxia roxburghii genome.

Table 5 .
Statistics for BUSCO estimation for Knoxia roxburghii genome assembly and annotation.

Table 6 .
Statistics of repeat elements of the Knoxia roxburghii assembly.