Chromosome-scale genomes of five Hongmu species in Leguminosae

The Legume family (Leguminosae or Fabaceae), is one of the largest and economically important flowering plants. Heartwood, the core of a tree trunk or branch, is a valuable and renewable resource employed for centuries in constructing sturdy and sustainable structures. Hongmu refers to a category of precious timber trees in China, encompassing 29 woody species, primarily from the legume genus. Due to the lack of genome data, detailed studies on their economic and ecological importance are limited. Therefore, this study generates chromosome-scale assemblies of five Hongmu species in Leguminosae: Pterocarpus santalinus, Pterocarpus macrocarpus, Dalbergia cochinchinensis, Dalbergia cultrata, and Senna siamea, using a combination of short-reads, long-read nanopore, and Hi-C data. We obtained 623.86 Mb, 634.58 Mb, 700.60 Mb, 645.98 Mb, and 437.29 Mb of pseudochromosome level assemblies with the scaffold N50 lengths of 63.1 Mb, 63.7 Mb, 70.4 Mb, 61.1 Mb and 32.2 Mb for P. santalinus, P. macrocarpus, D. cochinchinensis, D. cultrata and S. siamea, respectively. These genome data will serve as a valuable resource for studying crucial traits, like wood quality, disease resistance, and environmental adaptation in Hongmu.


Background & Summary
Leguminosae (Fabaceae) is the third-largest plant family with 770 genera and 19,500 species with substantial economic value 1 .The wood of the trees is divided into the outer layers of sapwood (SW) and the inner core of heartwood (HW).HW is the inner, dark-colored wood of a tree that is a valuable commodity, particularly in the timber industry due to its high durability, exquisite color, special scent, and rot and insect resistance properties with a long history 2 .Hongmu is a special term for precious timber trees in China, comprising 29 woody species, especially the legume genus such as Pterocarpus, Dalbergia, Senna, and Millettia of the Fabaceae family 3 .Despite their high economic value and medicinal properties, the lack of genome data hampers the in-depth understanding of genetic architecture and heartwood formation mechanisms 4,5 .Therefore, in this study, we selected five highly priced and high-quality heartwood-producing trees namely Pterocarpus santalinus, Pterocarpus macrocarpus, Dalbergia cochinchinensis, Dalbergia cultrata and Senna siamea for generating the genomic resource.
Pterocarpus santalinus (2n = 20) 6 , commonly known as zitan, red sandalwood or red sanders is mainly distributed in India, South and Southwest China.The plant is valued for its heartwood with excellent red wood color, texture, decay resistance and insect resistance 7,8 .It is classified as "endangered" in the IUCN red list of threatened species, because of illegal overharvesting.Heartwood exhibits medicinal properties, including its ability to alleviate fever, reduce inflammation, combat microbes, and act as an antioxidant, all of which are harnessed in traditional medicine.The major bioactive phyto-compounds extracted from the heartwood are santalins, flavonoids, terpenoids, phenolic compounds, alkaloids, saponins, tannins, and glycosides 9 .
Pterocarpus macrocarpus (2n = 22) 10 , commonly known as Burma padauk, is also an important timber of Southeast Asia, with distribution in Myanmar and Thailand 11 .Its reddish HW is expensive and used for making furniture and handicrafts, because of superior wood properties, including high density and resistance to termite attack 12 .
Dalbergia cochinchinensis (2n = 20) 13 (Thai Rosewood), distributed in Thailand, Cambodia, Vietnam, and Laos is listed as Critically Endangered in the IUCN red list of threatened species.Its reddish heartwood is valuable due high density, unique aroma and resistance to termites 14 .
Dalbergia cultrata (2n = 20) is a rosewood species also recognized as Burmese blackwood, and is distributed in a tropical and subtropical zone in Indo-China peninsula, and the south of Yunnan province in China.Heartwood is also valued for its quality, dark purplish-brown color, special scent and resistance to insects and disease.However, this species was threatened by overexploitation and listed in the IUCN red list of threatened species 15 .
Senna siamea (2n = 28) 16 , commonly known as kassod tree, cassod tree, and cassia tree in South or Southeast Asia, and is widely planted throughout the tropics.The HW is black-brown in texture, with high density and resistance to termites 17 .In Thailand, the young leaves and fruits are used as vegetables or traditional medicine 18 .
The heartwood properties of rot and insect resistance, durability and colors are largely defined by secondary metabolites 19 .Enhancing the quantity of secondary metabolites in the heartwood could potentially serve as a solution for reducing decay, increasing resistance to insects, and enhancing the durability of trees during breeding.Despite their economic and ecological importance, relatively little is known about the genetics and genomics of these heartwood Leguminosae species.Most previous studies have focused on molecular markers, such as microsatellites and amplified fragment length polymorphisms (AFLPs), which provide limited information about the genome structure and function [20][21][22] .The high-quality genome data provides a valuable resource for studying the genetic basis of important traits, such as wood quality, disease resistance, and environmental adaptation [23][24][25][26][27] .In this study, we provide chromosome-level genomes of five heartwood Hongmu species in Leguminosae: Pterocarpus santalinus, Pterocarpus macrocarpus, Dalbergia cochinchinensis, Dalbergia cultrata, and Senna siamea, using a combination of short-reads, long-read nanopore, and Hi-C data.This information can be used to improve breeding and conservation efforts for these species, as well as to develop new biotechnological applications.Additionally, the genome data can help shed light on the evolutionary history and relationships among the Leguminosae family, which is one of the largest and most diverse families of flowering plants.
For the RNAseq experiment, TIANGEN Kit was used for total RNA extraction from fresh leaves and stems.After quality control check, library construction and sequencing were performed on the Illumina platform which generated 11, 14, 22, 13, and 11 Gb raw data for P. santalinus, P. macrocarpus, D. cochinchinensis, D. cultrata and S. siamea, respectively, and on the other hand BGI-DIPSEQ platform generated a total of 74, 73, 69 and 77 Gb raw data for stem samples of P. santalinus, D. cochinchinensis, D. cultrata and S. siamea, respectively (Table S2).
De novo genome assembly and evaluation.The nanopore long reads of P. santalinus, P. macrocarpus, D. cochinchinensis and S. siamea were assembled by using NECAT 31 , while for D. cultrata nextdenovo 32 software was used.Then all five assemblies were polished by short reads with NextPolish software 33 .Finally, the genomes were moved to the contig overlaps by using the purge dups (v.1.2.3) 34 S4).
Further, we used the Hi-C data to anchor the contig assemblies to the chromosomes, the Juicer software 35 was used to extract the uniquely mapped and non-PCR duplicated Hi-C contact reads, then 3D-DNA 36 was used to integrate the assembled genome into a pseudochromosome level assembly.Finally, the Hi-C assembly result was visualized by Juicebox and manually improved according to the Hi-C contact map.As a result, we obtained 623.86 Mb, 634.58 Mb, 700.60 Mb, 645.98 Mb, 437.29 Mb of pseudochromosome level assemblies which were anchored to 10 chromosomes in P. santalinus, P. macrocarpus, D. cochinchinensis, D. cultrata, and 14 chromosomes in S. siamea, with the scaffold N50 lengths of 63.1 Mb, 63.7 Mb, 70.4 Mb, 61.1 Mb and 32.2 Mb, respectively.More than 96% of scaffolds were anchored into the pseudochromosomes of each species, which is consistent with the reported chromosome number of each species (2n = 20 for P. santalinus, P. macrocarpus, D. cochinchinensis and D. cultrata, 2n = 28 for S. siamea) (Fig. 1a-c, Table 1, Table S4, S5, Fig. 2) Repeat annotation.We combined the de novo and homolog-based methods to find the repeat element in the genomes of five species.For de novo prediction, we used LTR_FINDER 37 , RepeatModeler 38 to detect the repeat elements and then built a non-redundant library to identify the repeat element by RepeatMasker 39 .For the homolog-based methods, we used TRF to find the tandem repeats, and RepeatMasker was used to search the repeat element against the RepBase (v.21.12).In total, 49.07%, 49.49%, 62.58%, 48.88%, and 47.14% of the genome sequences were identified as repetitive sequences in P. santalinus, P. macrocarpus, D. cochinchinensis, D. cultrata and S. siamea, respectively (Table 2, Table S6).Long terminal repeats (LTRs) showed the highest proportions, comprising 39.49%, 39.90%, 55.88%, 43.27%, and 39.97% in P. santalinus, P. macrocarpus, D. cochinchinensis, D. cultrata and S. siamea respectively.The two Dalbergia legume wood trees showed higher LTRs than other three legume trees.Among the LTRs, the Gypsy LTRs (28.69%, 28.81%, 38.93%, 30.59%, 26.25%) were the most abundant in the P. santalinus, P. macrocarpus, D. cochinchinensis, D. cultrata and S. siamea respectively.Meanwhile, the two Dalbergia legume wood trees showed the highest number of Gypsy compared to the other three trees (Table S6).
Ribosomal RNA (rRNA) genes were searched against the plant rRNA database by using BLAST.MicroRNAs (miRNA) and small nuclear RNA (snRNA) were searched against the Rfam 12.0 database.tRNAscan-SE was also used to scan for tRNAs 41 S10).In particular, the number of rRNAs in S. siamea was higher than the other four legume wood trees.

Data Records
All the genomic sequencing raw data are deposited in the Genome Sequence Archive in National Genomics Data Center (NGDC) Genome Sequence Archive (GSA) database with the accession number CRA011389 45 under the BioProject accession number PRJCA017486 46 .The Chromosome-scale genome assemblies were submitted to the GenBank at NCBI under the accession number GCA_031439595.1 47 , GCA_031439585.1 48 , GCA_031216125.1 49 , GCA_031216105.1 50 , GCA_031216115.1 51

Fig. 1
Fig.1Circos plot and phylogenetic tree of five Hongmu species in Leguminosae.The distribution of genomic features along the chromosomes (scale is in Mb) (a-c).I, Pseudochromosomes.II, the density of gene number.III, the density of GC content.IV, the density of transposable elements.V, the density of transposable elements LTR.VI, the density of transposable elements LTR/Copia.VII, the density of Gypsy of LTR transposable elements.VIII, the collinearity of the genome.Senna tora is used for comparison only, the genome data is not generated in this study.(d) The phylogenetic tree of 11 representative legume species.All nodes exhibit 100% bootstrap support based on maximum likelihood analysis.All the species sequenced in the present study are highlighted in red color.

Fig. 2
Fig. 2 Hi-C map showing genome-wide all-by-all interactions.The map shows a high resolution of individual chromosomes that are scaffolded and assembled independently.The heat map colors ranging from white to dark red indicate the frequency of Hi-C interaction links from low to high (0-8).

Table 1 .
Genome assembly and assessment of five Hongmu species in Leguminosae.

Table 2 .
Genome annotation of five Hongmu species in Leguminosae.