Chromosome-scale genomes of commercially important mahoganies, Swietenia macrophylla and Khaya senegalensis

Sahu, Sunil Kumar; Liu, Min; Wang, Guanlong; Chen, Yewen; Li, Ruirui; Fang, Dongming; Sahu, Durgesh Nandini; Mu, Weixue; Wei, Jinpu; Liu, Jie; Zhao, Yuxian; Zhang, Shouzhou; Lisby, Michael; Liu, Xin; Xu, Xun; Li, Laigeng; Wang, Sibo; Liu, Huan; He, Chengzhong

doi:10.1038/s41597-023-02707-w

Download PDF

Data Descriptor
Open access
Published: 25 November 2023

Chromosome-scale genomes of commercially important mahoganies, Swietenia macrophylla and Khaya senegalensis

Scientific Data volume 10, Article number: 832 (2023) Cite this article

1200 Accesses
1 Citations
3 Altmetric
Metrics details

Subjects

Abstract

Mahogany species (family Meliaceae) are highly valued for their aesthetic and durable wood. Despite their economic and ecological importance, genomic resources for mahogany species are limited, hindering genetic improvement and conservation efforts. Here we perform chromosome-scale genome assemblies of two commercially important mahogany species: Swietenia macrophylla and Khaya senegalensis. By combining 10X sequencing and Hi-C data, we assemble high-quality genomes of 274.49 Mb (S. macrophylla) and 406.50 Mb (K. senegalensis), with scaffold N50 lengths of 8.51 Mb and 7.85 Mb, respectively. A total of 99.38% and 98.05% of the assembled sequences are anchored to 28 pseudo-chromosomes in S. macrophylla and K. senegalensis, respectively. We predict 34,129 and 31,908 protein-coding genes in S. macrophylla and K. senegalensis, respectively, of which 97.44% and 98.49% are functionally annotated. The chromosome-scale genome assemblies of these mahogany species could serve as a vital genetic resource, especially in understanding the properties of non-model woody plants. These high-quality genomes could support the development of molecular markers for breeding programs, conservation efforts, and the sustainable management of these valuable forest resources.

Chromosome-scale genomes of commercial timber trees (Ochroma pyramidale, Mesua ferrea, and Tectona grandis)

Article Open access 03 August 2023

Chromosome-level genome assemblies from two sandalwood species provide insights into the evolution of the Santalales

Article Open access 01 June 2023

The chromosome-scale assembly of the willow genome provides insight into Salicaceae genome evolution

Article Open access 01 April 2020

Background & Summary

The stability of forest ecosystems is increasingly being threatened by factors such as global climate change and unrestricted anthropogenic exploitation¹. Therefore, for the conservation and development of timber species, it is important to generate genomic information and decode the underlying genetic architecture and regulatory mechanisms to improve forest productivity, adaptation, resilience, and sustainability^2,3. In recent years, scientists have made significant progress in sequencing and analyzing the genomes of timber tree species like Populus trichocarpa⁴, Eucalyptus grandis⁵, Tectona grandis⁶, Dalbergia sissoo⁷, and Hopea hainanensis³, which has provided valuable insights into the genetic basis of traits such as wood formation, growth, and adaptation to environmental stress². Genomics-based approaches can be used to directly and significantly improve the productivity and adaptability of timber species. These approaches can be used to modify one or more genes in the genomes of timber species, or to identify effective genetic markers and genes for molecular breeding. Genomic research can also accelerate the generation of knowledge in systems biology, which is important for the development of computational genomics⁸. Computational genomics has opened up new ways of identifying genes that regulate complex traits, and through gene stacking and genome editing, customized timber species with special applications can be designed⁹. Forest trees are essential for maintaining biodiversity in terrestrial ecosystems and for producing fiber, fuel, and biomass¹⁰. Therefore, the importance and legitimacy of forestry studies, including genomics, will be a higher priority in the future.

Mahogany is a tropical hardwood known for its durability, stability, and beautiful reddish-brown color of its wood, and is commonly used in the manufacturing of fine furniture, cabinetry, flooring, and musical instruments¹¹. Swietenia macrophylla, commonly known as large-leaf mahogany, is a tropical timber species in the Meliaceae family that can tolerate a wide range of soils and environmental conditions. It can grow up to 40 meters tall, have a diameter of up to two meters, and live for several centuries¹². S. macrophylla is one of three species that produces genuine mahogany timber (Swietenia) and is famous for its high-quality wood, which plays an important role in the international mahogany market. The wood is used principally for making furniture, musical instruments, interior fittings and ship building¹³. Furthermore, S. macrophylla contains a variety of bioactive compounds such as phenols, flavonoids, terpenoids, and alkaloids, which are rich in medicinal value^14,15. Overall, the study of S. macrophylla highlights the urgent need to protect this valuable and threatened species. Through better management practices, forest conservation, and the sustainable use of this resource, we can ensure the long-term survival of S. macrophylla and other important tropical hardwood species.

Khaya senegalensis is another important species of deciduous tree in the Meliaceae family that is native to Africa. The wood K. senegalensis is prized for its beauty and durability, and it is used for a variety of purposes, including carpentry, interior trim, and construction. Traditionally, the wood was also used to make dugout canoes, household implements, djembe drums, and fuel wood^16,17. It is also used in traditional African folk medicine, and has been shown to be effective in treating a variety of ailments, including malaria, fever, and diarrhea. Overall, K. senegalensis is an important tree with a variety of uses. It is a valuable source of timber, and it has the potential to be used in a variety of medical applications. To date, genome sequences of several important tree species of the Meliacea family have been sequenced such as Toona sinensis¹⁸, Toona ciliata¹⁹, Azadirachta indica²⁰, Xylocarpus rumphii, X. moluccensis and X. granatum²¹.

Here, we construct high-quality genomes of S. macrophylla and K. senegalensis using a combination of 10x reads and Hi-C sequencing data. We predict 34,129 (S. macrophylla) and 31,908 (K. senegalensis) protein-coding genes. We also identify 187 and 123 miRNAs, 648 and 844 tRNAs, 249 and 186 rRNAs from the S. macrophylla and K. senegalensis genomes. Although the draft genome of S. macrophylla²¹ has been published previously, it lacks Hi-C data, and our study elevates the genome to the chromosome-scale with a longer N50 by combining Hi-C data, resulting in a higher-quality genome assembly.

Methods

Sample collection, library construction and sequencing, genome size evaluation

The fresh young leaves of Swietenia macrophylla (HCNGB_00002344) and Khaya senegalensis (HCNGB_00002341) were collected from Ruili, Yunnan, China (24°03′04.4″N 97°56′16.9″E), and stored in the Herbarium of China National GeneBank (HCNGB) (Supplemental Figs. 1–2). DNA was extracted using CTAB (cetyltrimethylammonium bromide)²², then GEM and barcode sequences were generated based on the standard protocol (Chromium Genome Chip Kit v1, 10X Genomics, Pleasanton, USA) for S. macrophylla and K. senegalensis. The barcode libraries were followed by sequencing on the BGISEQ-500 platform to generate 150 bp read pairs²³. Finally, we generated 1283.02 million reads and 192.45 Gb of raw data in S. macrophylla while K. senegalensis has 1141.22 million reads and 171. 18 Gb of raw data (Supplemental Table S1).

We also collected fresh young leaves, and branch samples from each species to collect xylem and phloem tissues, and RNA was extracted using the PureLink RNA Mini Kit (Thermo Fisher Scientific, Carlsbad, CA, USA) following the standard protocol to construct RNA libraries using the TruSeq RNA Sample Preparation Kit manual (Illumina, San Diego, CA, USA). RNA libraries were then sequenced on the BGISEQ-500 platform (paired-end, 100-bp reads or 150-bp reads) and the RNA reads were filtered to generate 241.63 million clean reads and 45.88 Gb of clean data for S. macrophylla as well as 517.49 million clean reads and 104.53 Gb of clean data for K. senegalensis (Supplemental Table S2) by the Trimmomatic²⁴ with the parameters:ILLUMINACLIP:adapter.fa:2:30:20:8:true HEADCROP:5 LEADING:3 TRAILING:3 SLIDINGWINDOW:5:8 MINLEN:50.

For Hi-C libraries, MboI restriction enzymes were used and constructed according to the in situ ligation protocol²⁵. The MboI-digested chromatin was end-labelled with biotin-14-dATP (Thermo Fisher Scientific, Waltham, MA, USA) and used for in situ DNA ligation. The DNA was extracted, purified, and then sheared using Covaris S2 (Covaris, Woburn, MA, USA). The DNA libraries were sequenced on a BGISEQ-500 after A-tailing, pull-down and adapter ligation to produce 100-bp read pairs which generated 1483.63 million reads and 148.36 Gb of Hi-C raw data for S. macrophylla and 1519.79 million reads and 151.98 Gb of Hi-C raw data for K. senegalensis (Supplemental Table S1).

A k-mer (k = 21) analysis was constructed using the obtained DNA sequencing reads from the 10X libraries which were filtered using SOAPnuke²⁶ with the parameters (-l 10 -q 0. 1 -n 0. 01 -Q 2 -d–misMatch 1–matchRatio 0.4) to estimate genome sizes, proportion of repeat sequence and heterozygosity. The k-mer frequency distribution analysis was performed using the following formula:

$$Gen=Num\ast \left(Len-17+1\right)/K\_Dep$$

Where Num represents the read number of reads used. Len represents the read length, K represents the k-mer length, and K_Dep refers to where the main peak is located in the distribution. The distribution of 21-kmers showed that the heterozygosity and duplication rate of the genome were respectively 1.00% and 20.14% in S. macrophylla, 0.73% and 42.60% in K. senegalensis, with genome sizes of 274.49 Mb (S. macrophylla) and 406.50 Mb (K. senegalensis) (Fig. 1 and Supplemental Table S3).

Genome assembly, evaluation, and repeat annotation

To perform the genome assembly, a de novo assembly program Supernova designed to assemble diploid germline genomes using Linked-Reads (10X library sequences) was used with the default parameters and exported into fasta format using the ‘pseudohap2’ style thereby performing GapCloser²⁷ with the parameters “-l 150” to fill the gap. The Hi-C reads were quality controlled and mapped to the genome assembly of each species using Juicer²⁸ with default parameters. Subsequently, a candidate superscaffold-level assembly was automatically generated using the 3D-DNA pipeline with default parameters²⁹ to correct misjoins, order, orient, and organize scaffolds from the draft assembly. The draft assembly was checked and refined manually in the Juicebox Assembly Tools³⁰ (Fig. 2a). The transcriptome sequences were assembled using Bridger tool³¹ and then mapped to the scaffold assembly using BLAT software³². The 10X clean reads were preliminarily assembled into scaffold sequences of 290.21 Mb for S. macrophylla with 5.76 Mb of Scaffold N50 and 406.50 Mb for K. senegalensis with 2.53 Mb of Scaffold N50. The scaffold sequences of two mahogany species were both further anchored onto 28 pseudochromosomes, accounting for 99.38% and 98.05% of the assembled genome. The final chromosome-scale genome assembly was 288.41 Mb with a scaffold N50 of 8.51 Mb in S. macrophylla and 370.38 Mb with a scaffold N50 of 7.85 Mb in K. senegalensis (Table 1, Supplemental Tables S4-5).

Table 1 Genome assembly and assessment statistics.

Full size table

Repeating elements were identified using a combination of homology-based and de novo approaches using default parameters. For homology-based approaches, we aligned the genome assembly with a known repeat database Repbase v. 21.01³³ using RepeatMasker v. 4.0.6³⁴ for homology-based repeat element characterization. RepeatModeler v.1.0.8³⁵ and LTR Finder v. 1.0.6³⁶ were used to construct a new repeat library using genome assembly, RepeatMasker v.4.0.6³⁷ was followed, used to identify and annotate repeat elements in the genome, and finally TRF v.4. 07³⁸ was used to tandem repeats in genomes for annotation (Table 2). We identified 85.08 Mb (29.50%) of repetitive sequences in the S. macrophylla genome and 80.85 Mb (21.83%) in the K. senegalensis genome. Most of these repeat sequences are Class I (53.57%) retro transposons, including Copia, Gypsy, LINE and SINE, accounted for 9.04%, 4.87%, 0.54%, 0.03% in S. macrophylla and 6.24%, 5.19%, 0.48%, 0.08% in K. senegalensis of the entire genome, respectively (Table 2, Supplemental Table S6).

Table 2 Genome annotation statistics.

Full size table

Gene annotation, functional annotation and noncoding RNAs annotation

The MAKER-P pipeline (version 2.31)³⁹ was used to predict protein-coding gene structures based on RNA, homologous protein and de novo prediction evidence. Clean transcriptome reads were assembled into inchworms using Trinity (version 2.0.6)⁴⁰ and therefore submitted to MAKER-P as expressed sequence tags for RNA evidence. Protein sequences from the model plant or related species (Supplemental Table S7) were downloaded for two mahogany species and utilized as protein evidence for homology comparisons. In order to perform de novo prediction, multiple training sets were created for various ab initio gene predictors. The generation of a set of transcripts was initially performed by applying the genome-guided approach of Trinity⁴⁰. Using PASA (version 2.0.2)⁴¹, these transcripts were then traced back to the genome, creating a collection of gene models with real gene features. For Augustus⁴² training, complete gene models were chosen. Genemark-ES (version 4.21)⁴³ was self-trained with default parameters. Based on the aforementioned data, the first round of MAKER-P was run with all default parameters set to “1,” except for “est2genome” and “protein2genome”, which only produced RNA and protein-supported gene models, respectively. The gene models were then used for the training of SNAP⁴⁴. The second and final rounds of MAKER-P were executed using the default parameters to generate the final gene model. The integration of protein-coding genes from S. macrophylla and K. senegalensis was successfully achieved, resulting in a total of 34129 and 32914 genes, respectively. The average gene length for S. macrophylla was determined to be 3052.92 bp, while for K. senegalensis it was 3068.00 bp. Additionally, the average lengths of exons and introns were calculated to be 215.60 bp and 402.79 bp, respectively, for S. macrophylla, and 230.06 bp and 431.15 bp, respectively, for K. senegalensis (Table 2, Supplemental Table S8).

Functional annotation of protein-coding genes was performed through the utilization of sequence similarity and domain conservation. This involved comparing the predicted amino acid sequences against publicly available databases. The initial step involved the identification of protein-coding genes by searching for optimal matches against protein sequence databases including the Kyoto Encyclopaedia of Genes and Genomes (KEGG)⁴⁵, the National Centre for Biotechnology Information (NCBI), non-redundant (NR) and COG databases⁴⁶, SwissProt⁴⁷, and TrEMBL. This search was performed using BLASTP with a specified E-value cut-off of 1e-5. Subsequently, InterProScan 55.0 was employed to detect and classify domains and motifs using the Pfam⁴⁸, SMART⁴⁹, PANTHER⁵⁰, PRINTS⁵¹, and ProDom⁵² databases. Consequently, the annotation rates for S. macrophylla and K. senegalensis were found to be 97% and 98% respectively (Table 2, Supplemental Table S9). Additionally, a combined total of 12,152 genes (equivalent to 35.61% of S. macrophylla) and 11,954 genes (equivalent to 37.46% of K. senegalensis) were jointly annotated in five functional databases (Fig. 3a).

To annotate non-coding RNAs, the ribosomal RNA (rRNA) genes were queried against the A. thaliana rRNA database using BLASTN V. 2.2.26⁵³ with parameter (-e 1e-5 -v 10000 -b 10000). The Rfam database⁵⁴ was queried for microRNAs (miRNA) and small nuclear RNA (snRNA) (tRNAscan-SE⁵⁵ was also employed to scan tRNA). In this study, we successfully isolated ribosomal RNA (rRNA), microRNA (miRNA), and transfer RNA (tRNA) from S. macrophylla and K. senegalensis. The quantities obtained for S. macrophylla were 249 for rRNA, 187 for miRNA, and 648 for tRNA, while for K. senegalensis, the quantities were 630 for rRNA, 189 for miRNA, and 844 for tRNA (Table 2, Supplemental Table S10).

Genome collinearity and Circos plot construction

MCScanX¹ was used to identify genomic collinearity between the two mahogany species and to obtain their pairs of colinear genes. The file of genomic collinearity generated by MCScanX was combined with the previous genome assembly and annotation results files to construct a circos plot (Fig. 2b). Here, we found that the genomes of two mahogany species share many similar structural features, including: (1) both consist of 28 chromosomes; (2) gene density and GC content show a positive correlation; (3) LTR density is negatively correlated with gene density and GC content; (4) the chromosomes of the two mahogany species show a high degree of collinearity between them, which also supports the close affinity between the two mahogany species. To show the taxonomic position of the sequenced species, the phylogenetic tree was subsequently constructed based on 317 single copy orthologues obtained from OrthoFinder v. 2.3.1⁵⁶ clustering (Fig. 3b). First, MAFFT v. 7.310⁵⁷ was used to conduct multiple sequence alignment for single-copy orthologs protein sequences, and the alignment results were input into IQtree v. 1.6.1⁵⁸ with the parameters “-b 100” to construct phylogenetic tree. The tree building results were rooted and visualized using FigTree v. 1.4 (http://tree.bio.ed.ac.uk/software/figtree). Second, species divergence time was estimated by combining the MCMCTREE module of PAML v. 4.5⁵⁹ and the TToL5 web portal⁶⁰. Finally, we used CAFÉ v. 4.2.1⁶¹ to analyze the expansion and contraction events of single-copy orthologs. The S. macrophylla and K. senegalensis diverged ~13.8 Mya and were closest to the genus Citrus, which was consistent with T. sinensis¹⁸ and T. ciliate¹⁹ of the same genus. The divergent time between T. sinensis and T. ciliate was ~15.3 Mya, which overlapped with the results of Wang et al.¹⁹ In addition, these two mahogany species diverged with A. thaliana ~93.6 Mya and P. trichocarpa ~99.7 Mya, which was similar to He et al.²¹ A total of 1735 and 1543 gene families had expanded and contracted in the S. macrophylla genome, while 1537 and 2052 gene families had expanded and contracted in the K. senegalensis genome, respectively.

Data Records

All the genomic sequencing raw data were deposited in the Genome Sequence Archive in National Genomics Data Center (NGDC) Genome Sequence Archive (GSA) database with the accession number CRA011793⁶² under the BioProject accession number PRJCA018269⁶³. The assembled scaffolds genomes were submitted to the Genome Warehouse under the accession number GWHDONZ00000000⁶⁴, GWHDOOA00000000⁶⁵ of S. macrophylla and K. senegalensis, respectively. The Chromosome-scale genome assemblies were also submitted to the NCBI under the accession number GCA_032401905.1⁶⁶, GCA_032402905.1⁶⁷ of S. macrophylla and K. senegalensis, respectively. The raw sequencing data and assembled genomes of S. macrophylla and K. senegalensis that support the findings of this study have also been deposited into CNGB Sequence Archive (CNSA)⁶⁸ of China National GeneBank DataBase (CNGBdb)⁶⁹ with accession number CNP0004053 and CNP0004052, respectively. The gene annotations, pseudogene predictions, and ncRNA files are available in the Figshare⁷⁰.

Technical Validation

Genome assembly and validation of gene prediction

In order to evaluate the quality of genome assembly, we used bwa (version: 0.7.12; mode: aln)⁷¹ to align the Illumina short reads with the chromosome-level genomes, 97.43% and 97.68% of the Illumina short reads were mapped to the S. macrophylla and K. senegalensis genomes, respectively (Supplemental Table S11). BUSCO (version 3.0.1)⁷² was used to assess the integrity of our genome assembly, with results showing 97% (S. macrophylla), 96.2% (K. senegalensis) for scaffold-scale genomes in addition to 95.8% (S. macrophylla), 91.6% (K. senegalensis) for Chromosome-scale genomes. To assess the results of Hi-C assembly, as shown in the chromosomal interaction heatmap, the intensity of diagonal interactions within each group is higher than the intensity of non-diagonal interactions (Fig. 2a), which was consistent with the principle of Hi-C assisted genome assembly and demonstrated that the genome assembly was accurate. Taken together, the results showed that the genomes of the two mahogany species assembled in this study had a high degree of integrity.

For gene prediction, we used BUSCO (version 3.0.1) to assess the number and proportion of annotated genes from two mahogany species occupying the database of the core set of angiosperm genes (embryophyta_odb10). The results showed that S. macrophylla had 1284 genes matched back to the core gene set (93.4%), while K. senegalensis had 1268 genes (92.2%), indicating that the annotated gene sets of both mahogany species are highly complete.

Code availability

All software used in this work is in the public domain and their parameters are described in the Methods section. If a software did not mention parameters, the default parameters suggested by the developer were used.

References

Wang, Y. et al. MCScanX: a toolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic acids research 40, e49–e49 (2012).
Article CAS PubMed PubMed Central Google Scholar
Neale, D. B. & Kremer, A. Forest tree genomics: growing resources and applications. Nature Reviews Genetics 12, 111–122 (2011).
Article CAS PubMed Google Scholar
Wang, S. et al. The chromosome‐scale genomes of Dipterocarpus turbinatus and Hopea hainanensis (Dipterocarpaceae) provide insights into fragrant oleoresin biosynthesis and hardwood formation. Plant Biotechnology Journal 20, 538–553 (2022).
Article CAS PubMed Google Scholar
Tuskan, G. A. et al. The genome of black cottonwood, Populus trichocarpa (Torr. & Gray). science 313, 1596–1604 (2006).
Article ADS CAS PubMed Google Scholar
Myburg, A. A. et al. The genome of Eucalyptus grandis. Nature 510, 356–362 (2014).
Article ADS CAS PubMed Google Scholar
Sahu, S. K. et al. Chromosome-scale genomes of commercial timber trees (Ochroma pyramidale, Mesua ferrea, and Tectona grandis). Scientific Data 10, 512 (2023).
Article CAS PubMed PubMed Central Google Scholar
Sahu, S. K. et al. Chromosome-scale genome of Indian Rosewood (Dalbergia sissoo). Frontiers in Plant Science 14, 1218515 (2023).
Article PubMed PubMed Central Google Scholar
Sahu, S. K. & Liu, H. Long-read sequencing (method of the year 2022): the way forward for plant omics research. Molecular Plant 16, 791–793 (2023).
Article CAS PubMed Google Scholar
Borthakur, D. et al. Current status and trends in forest genomics. Forestry Research 2, 2–11 (2022).
Article Google Scholar
Brockerhoff, E. G. et al. Forest biodiversity, ecosystem functioning and the provision of ecosystem services. Biodiversity and Conservation 26, 3005–3035 (2017).
Article Google Scholar
Verissimo, A., Barreto, P., Tarifa, R. & Uhl, C. Extraction of a high-value natural resource in Amazonia: the case of mahogany. Forest ecology and Management 72, 39–60 (1995).
Article Google Scholar
Gillies, A. C. M. et al. Genetic diversity in Mesoamerican populations of mahogany (Swietenia macrophylla), assessed using RAPDs. Heredity 83, 722–732 (1999).
Article PubMed Google Scholar
Krisnawati, H., Kallio, M. & Kanninen, M. Swietenia Macrophylla King: Ecology, Silviculture And Productivity. (CIFOR, 2011).
Telrandhe, U. B., Kosalge, S. B., Parihar, S., Sharma, D. & Lade, S. N. Phytochemistry and pharmacological activities of Swietenia macrophylla King (Meliaceae). Sch Acad J Pharm 1, 6–12 (2022).
Article Google Scholar
Moghadamtousi, S. Z., Goh, B. H., Chan, C. K., Shabab, T. & Kadir, H. A. Biological activities and phytochemicals of Swietenia macrophylla King. Molecules 18, 10465–10483 (2013).
Article CAS PubMed PubMed Central Google Scholar
Zhang, H., Wang, X., Chen, F., Androulakis, X. M. & Wargovich, M. J. Anticancer activity of limonoid from Khaya senegalensis. Phytotherapy Research 21, 731–734 (2007).
Article CAS PubMed Google Scholar
Arnold, R., Bevege, D. I., Bristow, M., Nikles, D. G. & Skelton, D. J. Khaya senegalensis - current use from its natural range and its potential in Sri Lanka and elsewhere in. Asia. Journal of Plant Protection 170, 1917–1930 (2004).
Google Scholar
Ji, Y. T. et al. Long read sequencing of Toona sinensis (A. Juss) Roem: A chromosome‐level reference genome for the family Meliaceae. Molecular Ecology Resources 21, 1243–1255 (2021).
Article CAS PubMed Google Scholar
Wang, X. et al. A chromosome-level genome assembly of Toona ciliata (Meliaceae). Genome Biology and Evolution 14, evac121 (2022).
Article PubMed PubMed Central Google Scholar
Du, Y. et al. Genomic analysis based on chromosome-level genome assembly reveals an expansion of terpene biosynthesis of Azadirachta indica. Frontiers in Plant Science 13 (2022).
He, Z. et al. Evolution of coastal forests based on a full set of mangrove genomes. Nature Ecology & Evolution 6, 738–749 (2022).
Article Google Scholar
Kumar, S. S., Muthusamy, T. & Kandasamy, K. DNA Extraction Protocol for Plants with High Levels of Secondary Metabolites and Polysaccharides without Using Liquid Nitrogen and Phenol. Isrn Mol Biol 2012, 205049 (2012).
Google Scholar
Huang, J. et al. BGISEQ-500 WGS library construction. protocols. io, 1–10 (2018).
Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30, 2114–2120 (2014).
Article CAS PubMed PubMed Central Google Scholar
Belaghzal, H., Dekker, J. & Gibcus, J. H. Hi-C 2.0: An optimized Hi-C procedure for high-resolution genome-wide mapping of chromosome conformation. Methods 123, 56–65 (2017).
Article CAS PubMed PubMed Central Google Scholar
Chen, Y. et al. SOAPnuke: a MapReduce acceleration-supported software for integrated quality control and preprocessing of high-throughput sequencing data. Gigascience 7, gix120 (2018).
Article PubMed Google Scholar
Luo, R. et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience 1, 2047-2217X–2041-2018 (2012).
Article Google Scholar
Durand, N. C. et al. Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments. Cell systems 3, 95–98 (2016).
Article CAS PubMed PubMed Central Google Scholar
Dudchenko, O. et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science 356, 92–95 (2017).
Article ADS CAS PubMed PubMed Central Google Scholar
Dudchenko, O. et al. The Juicebox Assembly Tools module facilitates de novo assembly of mammalian genomes with chromosome-length scaffolds for under $1000. Biorxiv, 254797 (2018).
Chang, Z. et al. Bridger: a new framework for de novo transcriptome assembly using RNA-seq data. Genome biology 16, 1–10 (2015).
Article Google Scholar
Kent, W. J. BLAT—the BLAST-like alignment tool. Genome research 12, 656–664 (2002).
CAS PubMed PubMed Central Google Scholar
Jurka, J. Repbase update: a database and an electronic journal of repetitive elements. Trends in genetics 16, 418–420 (2000).
Article CAS PubMed Google Scholar
Tarailo-Graovac, M. & Chen, N. Using RepeatMasker to identify repetitive elements in genomic sequences. Current protocols in bioinformatics 25, 4.10. 11–14.10. 14 (2009).
Article Google Scholar
Flynn, J. M. et al. RepeatModeler2 for automated genomic discovery of transposable element families. Proceedings of the National Academy of Sciences 117, 9451–9457 (2020).
Article ADS CAS Google Scholar
Xu, Z. & Wang, H. LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic acids research 35, W265–W268 (2007).
Article PubMed PubMed Central Google Scholar
Chen, N. Using Repeat Masker to identify repetitive elements in genomic sequences. Current protocols in bioinformatics 5, 4.10. 11–14.10. 14 (2004).
Article Google Scholar
Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic acids research 27, 573–580 (1999).
Article CAS PubMed PubMed Central Google Scholar
Campbell, M. S., Holt, C., Moore, B. & Yandell, M. Genome annotation and curation using MAKER and MAKER-P. Current protocols in bioinformatics 48, 4.11. 11–14.11. 39 (2014).
Article Google Scholar
Haas, B. J. et al. De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis. Nature protocols 8, 1494–1512 (2013).
Article CAS PubMed Google Scholar
Haas, B. J. et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome biology 9, 1–22 (2008).
Article Google Scholar
Stanke, M., Schöffmann, O., Morgenstern, B. & Waack, S. Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC bioinformatics 7, 1–11 (2006).
Article Google Scholar
Lomsadze, A., Ter-Hovhannisyan, V., Chernoff, Y. O. & Borodovsky, M. Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic acids research 33, 6494–6506 (2005).
Article CAS PubMed PubMed Central Google Scholar
Korf, I. Gene finding in novel genomes. BMC bioinformatics 5, 1–9 (2004).
Article Google Scholar
Aoki, K. F. & Kanehisa, M. Using the KEGG database resource. Current protocols in bioinformatics 11, 1.12.11–11.12.54 (2005).
Article Google Scholar
Tatusov, R. L., Koonin, E. V. & Lipman, D. J. A genomic perspective on protein families. Science 278, 631–637 (1997).
Article ADS CAS PubMed Google Scholar
Boeckmann, B. et al. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic acids research 31, 365–370 (2003).
Article CAS PubMed PubMed Central Google Scholar
Bateman, A. et al. The Pfam protein families database. Nucleic acids research 32, D138–D141 (2004).
Article CAS PubMed PubMed Central Google Scholar
Letunic, I., Doerks, T. & Bork, P. SMART 6: recent updates and new developments. Nucleic acids research 37, D229–D232 (2009).
Article CAS PubMed Google Scholar
Mi, H., Muruganujan, A., Casagrande, J. T. & Thomas, P. D. Large-scale gene function analysis with the PANTHER classification system. Nature protocols 8, 1551–1566 (2013).
Article PubMed PubMed Central Google Scholar
Attwood, T. K. et al. PRINTS and its automatic supplement, prePRINTS. Nucleic acids research 31, 400–402 (2003).
Article CAS PubMed PubMed Central Google Scholar
Corpet, F., Servant, F., Gouzy, J. & Kahn, D. ProDom and ProDom-CG: tools for protein domain analysis and whole genome comparisons. Nucleic acids research 28, 267–269 (2000).
Article CAS PubMed PubMed Central Google Scholar
Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinformatics 10, 421 (2009).
Article PubMed PubMed Central Google Scholar
Griffiths-Jones, S., Bateman, A., Marshall, M., Khanna, A. & Eddy, S. R. Rfam: an RNA family database. Nucleic acids research 31, 439–441 (2003).
Article CAS PubMed PubMed Central Google Scholar
Lowe, T. M. & Chan, P. P. tRNAscan-SE On-line: integrating search and context for analysis of transfer RNA genes. Nucleic acids research 44, W54–W57 (2016).
Article CAS PubMed PubMed Central Google Scholar
Emms, D. M. & Kelly, S. OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy. Genome Biol 16, 157 (2015).
Article PubMed PubMed Central Google Scholar
Katoh, K. & Standley, D. M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol 30, 772–780 (2013).
Article CAS PubMed PubMed Central Google Scholar
Minh, B. Q. et al. IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era. Molecular biology and evolution 37, 1530–1534 (2020).
Article CAS PubMed PubMed Central Google Scholar
Yang, Z. PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol 24, 1586–1591 (2007).
Article CAS PubMed Google Scholar
Kumar, S. et al. TimeTree 5: An Expanded Resource for Species Divergence Times. Mol Biol Evol 39 (2022).
De Bie, T., Cristianini, N., Demuth, J. P. & Hahn, M. W. CAFE: a computational tool for the study of gene family evolution. Bioinformatics 22, 1269–1271 (2006).
Article PubMed Google Scholar
NGDC Genome Sequence Archive https://bigd.big.ac.cn/gsa/browse/CRA011793 (2023).
NGDC BioProject https://ngdc.cncb.ac.cn/bioproject/browse/PRJCA018269 (2023).
NGDC Genome Warehouse https://ngdc.cncb.ac.cn/gwh/Assembly/64341/show (2023).
NGDC Genome Warehouse https://ngdc.cncb.ac.cn/gwh/Assembly/64342/show (2023).
NCBI Assembly https://identifiers.org/insdc.gca:GCA_032401905.1 (2023).
NCBI Assembly https://identifiers.org/insdc.gca:GCA_032402905.1 (2023).
Guo, X. et al. CNSA: a data repository for archiving omics data. Database (Oxford) 2020, baaa055 (2020).
Article PubMed Google Scholar
Chen, F. Z. et al. CNGBdb: China National GeneBank DataBase. Hereditas 42, 799–809 (2020).
PubMed Google Scholar
Wang, G. Two mahogany species, Figshare, https://doi.org/10.6084/m9.figshare.23685360.v2 (2023).
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Article CAS PubMed PubMed Central Google Scholar
Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–3212 (2015).
Article PubMed Google Scholar
Cheng, S. et al. 10KP: A phylodiverse genome sequencing plan. GigaScience 7, giy013 (2018).
Article PubMed PubMed Central Google Scholar

Download references

Acknowledgements

This work was supported by the Major Science and Technology Projects of Yunnan Province (Digitalization, Development and Application of Biotic Resource, 202002AA100007), the Strategic Priority Research Program of the Chinese Academy of Sciences (XDB27020104). This work is part of the 10KP project (https://db.cngb.org/10kp/)⁷³ and is also supported by China National GeneBank (CNGB; https://www.cngb.org/).

Author information

These authors contributed equally: Sunil Kumar Sahu, Min Liu, Guanlong Wang.

Authors and Affiliations

State Key Laboratory of Agricultural Genomics, Key Laboratory of Genomics, Ministry of Agriculture, BGI Research, Shenzhen, 518083, China
Sunil Kumar Sahu, Min Liu, Guanlong Wang, Yewen Chen, Ruirui Li, Dongming Fang, Durgesh Nandini Sahu, Weixue Mu, Jinpu Wei, Xin Liu, Xun Xu, Sibo Wang & Huan Liu
BGI Life Science Joint Research Center, Northeast Forestry University, Harbin, 150400, China
Min Liu & Huan Liu
College of Science, South China Agricultural University, Guangzhou, 510642, China
Guanlong Wang
College of Life Sciences, Chongqing Normal University, Chongqing, 400047, China
Ruirui Li
Forestry Bureau of Ruili, Yunnan Dehong, Ruili, 678600, China
Jie Liu
State Key Laboratory of Tree Genetics and Breeding, Research Institute of Forestry, Chinese Academy of Forestry, Beijing, 100091, China
Yuxian Zhao
Laboratory of Southern Subtropical Plant Diversity, Fairy Lake Botanical Garden, Shenzhen, Chinese Academy of Sciences, Shenzhen, 518004, China
Shouzhou Zhang
Department of Biology, University of Copenhagen, Copenhagen, DK-2100, Denmark
Michael Lisby
Guangdong Provincial Key Laboratory of Genome Read and Write, BGI-Shenzhen, Shenzhen, 518083, China
Xun Xu
National Key Laboratory of Plant Molecular Genetics and CAS Center for Excellence in Molecular Plant Sciences, Institute of Plant Physiology and Ecology, Chinese Academy of Sciences, Shanghai, 200032, China
Laigeng Li
Key Laboratory for Forest Genetic & Tree Improvement and Propagation in Universities of Yunnan Province, Southwest Forestry University, Kunming, 650224, China
Chengzhong He

Authors

Sunil Kumar Sahu
View author publications
You can also search for this author in PubMed Google Scholar
Min Liu
View author publications
You can also search for this author in PubMed Google Scholar
Guanlong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yewen Chen
View author publications
You can also search for this author in PubMed Google Scholar
Ruirui Li
View author publications
You can also search for this author in PubMed Google Scholar
Dongming Fang
View author publications
You can also search for this author in PubMed Google Scholar
Durgesh Nandini Sahu
View author publications
You can also search for this author in PubMed Google Scholar
Weixue Mu
View author publications
You can also search for this author in PubMed Google Scholar
Jinpu Wei
View author publications
You can also search for this author in PubMed Google Scholar
Jie Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yuxian Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Shouzhou Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Michael Lisby
View author publications
You can also search for this author in PubMed Google Scholar
Xin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xun Xu
View author publications
You can also search for this author in PubMed Google Scholar
Laigeng Li
View author publications
You can also search for this author in PubMed Google Scholar
Sibo Wang
View author publications
You can also search for this author in PubMed Google Scholar
Huan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Chengzhong He
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

H.L., S.K.S. and C.H. led and designed this project. H.L., S.K.S. and S.W., conceived the study. S.K.S., W.M., J.W., S.Z. and J.L. collected the leaf and tissue samples. S.K.S., M.L., G.W. and Y.C. contributed to the sample preparation and performed the genome and chromosome-scale assembly. S.K.S., M.L., S.W., Y.C., D.F., G.W., D.N.S., W.M., R.L. and S.W. performed annotation and comparative genomic analyses. S.K.S., G.W. and M.L. wrote the original draft manuscript. S.W., M.L., S.Z., X.X., J.L., C.H., D.N.S., Y.Z., X.L., L.L., and H.L., revised and edited the manuscript. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Sibo Wang, Huan Liu or Chengzhong He.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Sahu, S.K., Liu, M., Wang, G. et al. Chromosome-scale genomes of commercially important mahoganies, Swietenia macrophylla and Khaya senegalensis. Sci Data 10, 832 (2023). https://doi.org/10.1038/s41597-023-02707-w

Download citation

Received: 24 April 2023
Accepted: 31 October 2023
Published: 25 November 2023
DOI: https://doi.org/10.1038/s41597-023-02707-w

This article is cited by

Beyond NGS data sharing for plant ecological resilience and improvement of agronomic traits
- Ji-Su Kwon
- Jayabalan Shilpha
- Seon-In Yeom
Scientific Data (2024)