Background & Summary

Microcos paniculata Linnaeus (Fig. 1a), known in Chinese as Buzhaye, is a shrub commonly used in traditional Chinese medicine and herbal cooling teas1, including Wanglaoji, Huoqizheng2 and Jiaduobao, with an annual demand of about 250 tons (http://bk.cnpharm.com/zgyyb/2008/04/28/246974.html). The leaves of M. paniculata are also commonly used in ethnomedicinal treatments for food stagnation, damp-heat jaundice and fever3. Up to now, numerous studies have extensively investigated the phytochemical composition and pharmacological properties of this species, revealing the existence of bioactive secondary metabolites such as flavonoids, alkaloids, triterpenoids and organic acids1,4 from M. paniculata extracts. However, due to the lack of a high-quality reference genome, the molecular basis and evolution of the secondary metabolite biosynthesis in M. paniculata are rarely reported5.

Fig. 1
figure 1

Morphological characters (a) and the landscape of genome assembly and annotation of M. paniculata (b). The tracks from outside to inside are: pseudo-chromosomes, density of class I TEs, density of class II TEs, density of protein-coding genes, proportion of tandem repeats, GC content and collinear blocks.

In the present study, we assembled the genome of M. paniculata using 106 × short reads (42 Gb), 35 × HiFi reads (14 Gb), 75 × Hi-C reads (30 Gb) and 50 × iso-seq reads (20 Gb). The final assembly (~792 Mb) consisted of two complete haplotypes, haplotype A (399.43 Mb) and haplotype B (393.10 Mb), with contig N50 lengths of 43.44 Mb and 30.17 Mb, respectively (Table 1). About 99.93% of the assembled sequences were anchored onto 18 (2n) pseudo-chromosomes (Fig. 1b). The chloroplast and mitochondrial genomes were 159,456 bp and 380,905 bp, respectively. A total of 1,080,648 repeat sequences, with an approximate length of 482 Mb were identified, accounting for 60.76% of the assembled genome. Of the identified repeats, long terminal repeats (LTRs) constituted the largest proportion, with a number of 394,112 and a cumulative length of 321,160,287 bp, accounting for 40.52% of the M. paniculata genome assembly (Table 2). The genome contained 65,874 genes, including 49,439 protein-coding genes and 16,435 non-coding genes (Table 3). A total of 48,979 genes were functionally annotated, accounting for 99% of the identified protein-coding genes (Table 4). Of these, 44,971 genes were annotated by all three methods together (Fig. 2). In particular, 639 genes have been annotated as being related to the biosynthesis or metabolism of flavonoids, alkaloids and triterpenoids (Table S1). The resulting high-quality reference genome and annotation of M. paniculata will be a valuable resource for improving our understanding of the evolutionary relationships within the Malvales, for studying the molecular basis and biosynthetic mechanisms of phytochemical compounds, and for further study and exploitation of M. paniculata.

Table 1 Summary of M. paniculata genome assembly.
Table 2 Summary of repeat elements.
Table 3 Summary of M. paniculata genome annotations.
Table 4 Functional annotation of protein-coding genes in M. paniculata.
Fig. 2
figure 2

Venn diagram showing the unique and shared functionally annotated protein-coding genes in M. paniculata using the three strategies.

Methods

Sample collection and genome sequencing

Samples of M. paniculata were collected at Xishuangbanna Tropical Botanical Garden (XTBG), Chinese Academy of Sciences, Mengla, Yunnan Province, China. Genomic DNA was extracted using a modified CTAB method6. DNA quality was assessed using a NanoDrop One spectrophotometer (NanoDrop Technologies, Wilmington, DE, USA) and a Qubit 3.0 Fluorometer (Life Technologies, Carlsbad, CA, USA). Whole genome sequencing, Pacbio sequencing, Hi-C (high-through chromosome conformation capture) sequencing and full-length isoform sequencing (iso-seq) were performed at Wuhan Benagen Technology Co. Ltd. (Wuhan, China).

For whole genome sequencing, 1 μg of genomic DNA was sonicated to an approximate size range of 200–400 bp using a sonicator (Covaris, Brighton, UK). The short-read libraries were constructed following the manufacturer’s instructions and then sequenced on the DNBSEQ-T7 platform (BGI lnc., Shenzhen, China) using the PE (paired-end) 150 model.

For long-read sequencing, genomic DNA was sheared using the Megaruptor 3 shearing kit (Diagenode SA., Seraing, Belgium). The AMPure PB beads size selection kit (Pacbio, Menlo Park, CA, USA) was used to selectively deplete DNA fragments smaller than 5 kb. The libraries were prepared using the SMRTbell® prep kit 3.0 (Pacbio, Menlo Park, CA, USA) and then sequenced on a Revio system (Pacbio, Menlo Park, CA, USA). Raw sequencing data were converted to HiFi (high fidelity) reads using the CCS workflow 7.0.07 with parameters (--streamed --log-level INFO --stderr-json-log --kestrel-files-layout–min-rq 0.9 --non-hifi-prefix fail --knrt-ada --pbdc-model).

For Hi-C sequencing, leaf material from young shoots was fixed in 2% formaldehyde solution, and the Hi-C library was generated following a published protocol8. Briefly, the cross-linked materials were digested with 400 units of MboI, and marked with biotin-14-dCTP, and then subjected to blunt-end ligation of crosslinked fragments. After re-ligation, reverse crosslinking and purification, the chromatin DNA was sheared to a size of 200–600 bp using sonication. The biotin-labelled Hi-C fragments were then enriched using streptavidin magnetic beads. After the addition of A-tailing and an adapter, the Hi-C libraries were PCR-amplified (12–14 cycles) and then sequenced on the DNBSEQ-T7 platform (BGI lnc., Shenzhen, China) in PE150 mode.

Full-length isoform sequencing (iso-seq) was used to obtain high quality transcriptomic data. RNA was extracted from leaves, flowers and stems of M. paniculata using the R6827 Plant RNA Kit (Omega Bio-Tek, Norcross, GA, USA) following the manufacturer’s instructions. The cDNA-PCR Sequencing kit SQK-PCS109 by Oxford Nanopore (Oxford Nanopore Technologies, Oxford, UK) was used to prepare full-length cDNA libraries. The libraries were then sequenced on the PromethION sequencer (Oxford Nanopore Technologies, Oxford, UK).

Genome assembly

PacBio HiFi reads and Hi-C short reads were combined as input to Hifiasm v0.19.5-r5929 using the default parameters to generate haplotype-resolved contigs for subsequent analysis. Hi-C reads were mapped to the assembled haplotype contigs using Juicer v1.5.610, and a Hi-C-assisted initial chromosome assembly was then performed using the 3D-DNA v18092211 pipeline (with the parameters --early-exit -m haploid -r 0). Chromosome boundaries were then adjusted and the misjoins and switch errors were corrected manually using Juicebox v1.11.0812. This process generated chromosome-scale scaffolds and un-anchored contig sequences.

LR_Gapcloser v1.1.113 was used to fill gaps in the chromosome assembly based on HiFi reads (with the parameters -s p -r 2 -g 500 -v 500 -a 0.25). HiFi reads were then re-mapped to the chromosome scaffolds. The mapped reads located around the telomere repeat sequences (TTTAGGG)n14 were then extracted and assembled into contigs using Hifiasm v0.19.5-r592 with the default parameters. The resulting contigs were aligned back to the chromosome scaffold to extend the chromosome ends for telomere sequences, and totally 28 telomere sequences were obtained (Fig. 3a). In addition, GetOrganelle v1.7.515 was used to assemble the chloroplast and mitochondrial genomes.

Fig. 3
figure 3

Telomere distribution (a) and comparation of genome structure between haplotype A and haplotype B (b).

Nextpolish2 v0.1.016 was used to polish the above assembly based on HiFi reads and short reads with default parameters. Redundant haplotigs and rDNA fragments were removed using the Redundans v0.13c17 pipeline (with the parameters -identity 0.98 -overlap 0.8) and manually curated. A high quality haplotype resolved genome assembly of M. paniculata was then obtained.

Repeat annotation

The EDTA (Extensive de novo TE Annotator) program v1.9.918 (with the parameters --sensitive 1 --anno 1) was used for the de novo identification of transposable elements (TE), generating a TE library. RepeatMasker v4.0.719 was utilized to identify repeat elements (with the parameters -no_is -xsmall).

Annotation of protein-coding genes and noncoding RNAs

A total of 314,962 publicly available non-redundant protein sequences from Theobroma cacao20, Durio zibethinus21, Corchorus capsularis22, Gossypium raimondii23, Heritiera littoralis24, Dipterocarpus turbinatus25, Aquilaria sinensis26, Arabidopsis thaliana27, Carica papaya28, Vitis vinifera29, and Bombax ceiba30 were used as homologous protein evidence for gene annotation. Iso-seq data were mapped to the genome using Minimap2 v2.2431 (with the parameters -a -x splice --end-seed-pen = 60 --G 200k), then assembled in StringTie v1.3.532 (with the parameters -L -t -f 0.05), and the resulting sequences were used as transcript evidence.

PASA (Program to Assemble Spliced Alignments) v2.4.133 was used to annotate the genomic structure based on transcript evidence with the default parameters. Then, full-length gene sequences were identified by aligning with homologous protein evidence using BLAT34 (-prot) and removing the hits with query or target coverage <95%. The gene model was trained and optimized for five rounds in AUGUSTUS v3.4.035 using the full-length gene set with the default parameters.

The MAKER2 v2.31.936 pipeline was used to perform annotation based on ab initio prediction, the transcript evidence and the homologous protein evidence. Briefly: (1) RepeatMasker v4.0.719 was used to mask repeat sequences in the genome; (2) AUGUSTUS v3.4.035 was used for ab initio prediction based on the genomic sequence; (3) BLASTN was used to align the transcript evidence to the repeat-masked genome, and BLASTX was employed to align the homologous protein evidence to the genome. Exonerate v2.2.037 was used to realign the BLAST hits to the genome; (4) Finally, the predicted gene models were integrated using MAKER2 based on the hints generated from the above alignments.

EvidenceModeler (EVM) v1.1.138 was further employed to merge the annotation results obtained from PASA v2.4.1 and MAKER2 v2.31.9, generating consensus annotations. TEsorter v1.4.139 was utilized to identify TE protein domains on the genome (with the parameters -genome -db rexdb -cov 30 -eval 1e-5 -prob 0.9), and these domains were masked in the EVM process. The results obtained from EVM were refined by incorporating UTR sequences and alternative splicing using PASA v2.4.1 with the default parameters. Annotations that were too short (<50 amino acids), lacked start or stop codons, contained an internal stop codon, or had ambiguous bases were excluded. All annotations were then merged, and redundant annotations were removed.

In addition, for non-coding RNA (ncRNA) annotations, tRNAScan-SE v1.3.140 was used to identify transfer RNA (tRNA), and Barrnap v0.9 (https://github.com/tseemann/barrnap) was used to identify ribosomal RNA (rRNA). To ensure accuracy, partial rRNA annotations were excluded. Furthermore, RfamScan v14.241 was used to identify other ncRNA.

We employed three strategies to predict the function of the protein-coding genes: (1) eggNOG-mapper v2.0.042 (--target_taxa Viridiplantae -m diamond) was utilized to search for homologous genes in the eggNOG database, enabling Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) annotation; (2) DIAMOND v0.9.2443 (--evalue 1e-5 --max-target-seqs  5) was employed to align protein-coding genes with the Swiss-Prot, TrEMBL, NR (non-redundant protein in NCBI), and the TAIR10 protein databases; (3) InterProScan v5.27-66.044 was used to annotate protein domains and motifs by searching multiple publicly available databases, such as PRINTS, Pfam, SMART, PANTHER, and CDD of the InterPro database. TBtools v1.13245 was then used to draw a Venn diagram to show unique and shared protein-coding genes annotated using the three described strategies.

Comparison between haplotype assemblies

SyRI (Synteny and Rearrangement Identifier) v1.646 was used to detect synteny and genomic structural variations (≥50 bp in size) between the two haplotypes, with the default parameters. In total, our analysis identified 3,011 syntenic regions (350 Mb), 768 translocations (45 Mb), 20 inversions (2 Mb), 2,175 duplications in haplotype A (~15 Mb) and 1,686 duplications in haplotype B (~8 Mb). Most duplications were found on chromosomes 4 and 8, and most inversions were found on chromosome 7 (Fig. 3b). SyRI v1.6 was also used to identify SNPs, small InDels (insertions and deletions, <50 bp in size) and tandem repeats. Finally, 1,264,264 SNPs (1 Mb), 105,563 insertions (2 Mb in haplotype B), 100,073 deletions (2 Mb in haplotype A) and 282 tandem repeats (1 Mb) were identified.

Data Records

The BGI short reads, PacBio HiFi long reads, Hi-C reads and Iso-Seq data have been deposited at the Sequence Read Archive database of NCBI (National Center for Bioinformation Information) under accession numbers SRR25456891-SRR2545689447,48,49,50. The final genome assembly has been deposited at the GenBank database under the accession numbers GCA_030664735.151 and GCA_030664755.152. The genome annotations are available from the Figshare repository53. The AUGUSTUS model trained and optimized for this genome, together with the configuration files for MAKER are available from the Figshare repository54.

Technical Validation

We first calculated the mapping rate as a measure of assembly accuracy. The short reads and the long reads were re-mapped to the assembly using BWA-MEM v0.7.17-r118855 and Minimap2 v2.2431, respectively, with the default parameters. The mapping rates were calculated after filtering out non-primary alignments. In total, 99.89% of HiFi reads, 97.75% of iso-seq reads and 99.81% of short reads were mapped (Table 5). Moreover, the read coverage depth of both short and long read data was evenly distributed along each phased chromosome, indicating high quality of our haplotype-resolved assembly (Figure S1).

Table 5 Summary of mapping rates.

We evaluated the completeness of the genome assembly using BUSCO (Benchmarking Universal Single-Copy Orthologs) v5.3.256 based on the embryophyta_odb10 ortholog database. The BUSCO evaluation of the haplotype A identified 1,591 complete BUSCOs (including 1,561 single and 30 duplicated BUSCOs), accounting for 98.6% of the haplotype, while the missing BUSCOs represented merely 0.7% (Table 6). Similarly, the BUSCO assessment of the haplotype B identified 1,588 complete BUSCOs (including 1,560 single and 28 duplicated BUSCOs), accounting for 98.4% of the haplotype, while the missing BUSCOs were only 0.9% (Table 6). This indicates a relatively complete assembly. We used Merqury v1.357 to estimate the consensus and completeness of the genome assembly. Our results gave a consensus quality value (QV) of 73.38 for the genome assembly, and the completeness value was 99.19% (Table 6). We also used KAT (K-mer Analysis Toolkit) v2.4.058 to estimate the quality of the genome assembly by comparing k-mers in HiFi reads and in the assembly. Our results show high consistency between the reads and the genome assembly (Fig. 4a), with each haplotype representing approximately half of the heterozygous peak and nearly all of the homozygous peak (Fig. 4b,c).

Table 6 Evaluation of M. paniculata genome assembly.
Fig. 4
figure 4

Copy number spectra plots for genome (a), haplotype A (b) and haplotype B (c) using KAT (K-mer Analysis Toolkit). The k-mers from HiFi reads display two dominant heterozygous (multiplicity = 18) and homozygous (multiplicity = 34) peaks, and those from assemblies have 0–6×+ copy numbers.

In addition, we used BUSCO to evaluate the completeness of the genome annotation by retaining only the longest protein sequence for each gene, and found that the annotation of haplotype A was 97.6% complete, with only 17 (1.1%) genes missing, and the annotation of haplotype B was 97.1% complete, with only 19 (1.2%) genes missing (Table 7), indicating that the annotation was of high quality.

Table 7 BUSCO evaluation of M. paniculata genome annotation.

The Hi-C reads were aligned to the genome assembly using Juicer v1.5.610 with the default parameters. The Juicebox12 tools pre command (pre -n -q 0 or 1) was used to convert the raw file generated by Juicer into hic format, and dump command (dump observed BP 100000) was used to extract 100-kb contact matrix from the hic file. The hic file was visualized by Juicebox. Strong interactive signals were observed around the diagonal of the pseudo-chromosomes, and there was no obvious noise outside the diagonal (Fig. 5a), indicating the high quality of this chromosome assembly. In addition, no anomalies were observed across each homologous chromosome pair when duplicated reads were excluded (Fig. 5b), suggesting no switch errors between phased haplotypes.

Fig. 5
figure 5

Hi-C interaction heatmap of haplotype A and haplotype B with reads mapping quality ≥0 (including duplicated reads) (a) and mapping quality ≥1 (excluding duplicated reads) (b). The colour bar indicates the strength of the interaction, with yellow representing low and red representing high.