Abstract
Camellia crapnelliana Tutch., belonging to the Theaceae family, is an excellent landscape tree species with high ornamental values. It is particularly an important woody oil-bearing plant species with high ecological, economic, and medicinal values. Here, we first report the chromosome-scale reference genome of C. crapnelliana with integrated technologies of SMRT, Hi-C and Illumina sequencing platforms. The genome assembly had a total length of ~2.94 Gb with contig N50 of ~67.5 Mb, and ~96.34% of contigs were assigned to 15 chromosomes. In total, we predicted 37,390 protein-coding genes, ~99.00% of which could be functionally annotated. The chromosome-scale genome of C. crapnelliana will become valuable resources for understanding the genetic basis of the fatty acid biosynthesis, and greatly facilitate the exploration and conservation of C. crapnelliana.
Similar content being viewed by others
Background & Summary
As one of the four largest woody oil plants in the world, oil-tea camellia trees are a collective term for a group of Camellia species of highly economic values1. In China, oil-tea camellia trees have a long history of cultivation, which are mainly distributed in the south of the lower reaches of the Yangtze River2,3. There are approximately 50 species of such oil-tea camellia trees belonging to the family Theaceae4. C. oleifera, C. chekiangoleosa, C. crapnelliana and C. vietnamensis1,3 are commonly cultivated. They are woody, oil-bearing tree species with a high content of seed oil that is widely processed into skin and health care products and especially edible oil4. Camellia oil is remarkably rich in polyphenols, saponins, and other healthy components and free of cholesterol, erucic acid, and other harmful components5. Thus, the oil has extremely high nutritional and health-beneficial values and has strong market competitiveness and wide market prospects6. The content of unsaturated fatty acids in the edible oil is quite high, reaching approximately 90%, and the content of oleic acid can be approximately 87%5. Tea oil is therefore referred to as “Oriental olive oil”7, which has both health-beneficial and medicinal values8.
Among these oil-tea camellia species, C. crapnelliana, which belongs to Sect. Furfuracea and is naturally distributed in Hong Kong, southern Guangxi, northern Fujian, southern Zhejiang and Jiangxi provinces, China, was listed as China’s second-class protected plant species and recorded in the China Plant Red Data Book (CPRDB) as early as 19929. As an excellent garden greening species with the largest flowers and fruits (Fig. 1) in the genus Camellia, it has great potential for the industrial development as an oilseed plant10. Most recently, several chromosome-level tea tree genomes became publicly available11,12,13,14,15,16, but oil-tea camellia tree genome information is still quite limited11,12,13,14,15,16. Many efforts have been put on the regulation of the fatty acid biosynthesis in many plants17,18,19,20,21,22,23,24,25,26,27,28,29, however, in-depth understanding about the molecular basis and evolution of the fatty acid biosynthesis in C. crapnelliana largely rely on a high-quality reference genome.
In this study, we constructed and annotated a high-quality chromosome-level reference genome of C. crapnelliana using integrated sequencing data (~71 × PacBio HiFi reads and ~140 × Hi-C reads) (Fig. 2). K-mer analysis showed that the genome size of C. crapnelliana was estimated to be ~3.055 Gb, with a repeat sequence proportion of 76.76% (Supplementary Table S1). The final assembled genome was ~2.94 Gb, with contig N50 of ~67.50 Mb (Fig. 1d). Based on the karyotype of the species (2n = 30)30, approximately ~96.34% of the contig reads were anchored to 15 pseudochromosomes. A total of 37,390 protein-coding genes were predicted, of which 99.00% were functionally annotated. In addition, 176 miRNAs, 7,988 rRNAs, 857 tRNAs, and 485 snRNAs in the C. crapnelliana genome were annotated. The high-quality chromosome-level genome assembly of this oil-tea Camellia species will greatly help to enhance the functional analysis of novel genes towards oil quality and yield improvement, and augment its wild resources conservation and utilization in the future.
Methods
Plant materials, sample collection, and sequencing
For genomic DNA extraction, young healthy leaves of C. crapnelliana were collected from South China National Botanical Garden, Guangzhou, China. Sampled leaves were immediately flash-frozen in liquid nitrogen and stored at −80 °C until further use. High molecular weight genomic DNAs (gDNAs) were extracted from leaves using improved CTAB method31 and evaluated using NanoDrop One spectrophotometer (NanoDrop Technologies, Wilmington, DE) and Qubit 3.0 Fluorometer (Life Technologies, Carlsbad, CA, USA). For the genome survey, the paired-end (PE 150 bp) library was generated using the Illumina TruSeq DNA Nano Preparation Kit (Illumina, San Diego, CA, USA), and the library was sequenced on an Illumina HiSeq. 2500 platform following the manufacturer’s instructions. As a result of Illumina sequencing, we obtained ~173.51 Gb of Illumina paired-end reads (Supplementary Table S2). The Pacbio HiFi sequencing was then performed on the PacBio Sequel II platform (Pacific Biosciences, CA, USA), according to the manufacturer’s instructions. We obtained ~212.87 Gb HiFi reads with an average read length of ~19,232.96 bp, which covered about 71 × of the C. crapnelliana genome (Supplementary Table S2). For Hi-C sequencing, formaldehyde was used for crosslinking the fresh leaves, and the crosslinking reaction was terminated using glycine solution. Subsequently, the Hi-C library was constructed based on the instructions and sequenced on the Illumina platform (Annoroad Gene Technology Co., Ltd), and ~429.88 Gb raw reads were generated (Supplementary Table S2). The young leaves, flowers, young shoots, and seed kernels were collected for transcriptome sequencing. These tissue samples were rinsed using ddH2O and stored at −80 °C until use after snap-freeze using liquid nitrogen with three biological replicates. Total RNA extraction was performed using the RNeasy Plant Mini Kit (Qiagen, Hilden, Germany). A cDNA library was built following the instructions, followed by paired-end sequencing on the NovaSeq platform (Illumina). A total of ~30.00 Gb RNA-seq reads were obtained to assist the subsequent analysis of the C. crapnelliana genome.
Chromosome-level genome assembly
Genome size of C. crapnelliana was estimated from Hi-C data using k-mer frequency analysis. Jellyfish v2.3.032 was first applied to extracting and counting canonical k-mer at k = 21. Subsequently, findGSE v1.9433 was used to estimate the genome size from k-mer count data with parameters of “-k = 21”. As a result, we estimated the genome size of C. crapnelliana to be ~3.055 Gb (Supplementary Table S1). The PacBio HiFi reads were de novo assembled by using hifiasm v0.16.134 with default parameters. The genome assembly had a total size of ~2.94 Gb, containing 816 contigs with N50 sizes of ~67.5 Mb (Supplementary Table S3). The cleaned Hi-C reads were mapped to the corresponding contigs using Juicer v1.9.935. The unique mapped reads were taken as input for 3D-DNA pipeline v18011436 with parameters “-r 0” and then sorted and corrected manually using JuicerBox v1.11.0837. The fifteen pseudochromosomes were identified by distinct interaction signals in the Hi-C interaction heatmap (Supplementary Fig. S1), and the final assembled genome length was ~2.94 Gb (Figs. 1d, 2), with a scaffold N50 of ~67.50 Mb, containing ~96.34% of the assembled contigs for C. crapnelliana (Supplementary Table S4), accounting for ~96.34% of the estimated genome size based on the k-mer analysis (Supplementary Table S1). Compared to the ten other genome assemblies publicly available in the genus Camellia, the chromosome-level genome assembly of C. crapnelliana obtained in this study showed remarkable sequence continuity and genome completeness (Supplementary Table S5).
Genome annotation and functional prediction
The repetitive elements in the C. crapnelliana genome were identified by combining de novo and homology-based approaches. Tandem repeat sequences were annotated using Tandem Repeat Finder (TRF v4.09)38 with default parameters. A total of six types (mono- to hexa-nucleotides) of simple sequence repeats (SSRs) were identified using the MISA (MIcroSAtellite)39 identification tool with default parameters. For de novo-based searches, RepeatModeler v2.0.2a40, LTR_FINDER v1.0741, LTRharvest v1.5.942, and LTR_retriever v2.9.143 were applied for constructing de novo repeat libraries, by which RepeatMasker v4.1.3-p144 was employed to detect repeat sequences. For homology-based searches, we employed RepeatMasker v4.1.3-p144 against a known repeat library, Repbase v.19.0645. As a result, a total of ~2.44 Gb of repetitive elements occupying ~82.87% of the C. crapnelliana genome were annotated (Fig. 2; Supplementary Table S6). Most of these repeats were long terminal repeat (LTR) retrotransposons (~63.24%) of the genome; Supplementary Table S6). The DNA, LINE, and SINE classes accounted for ~10.84%, ~4.19%, and ~0.13% of the genome, respectively (Fig. 2; Supplementary Table S6). Additionally, tRNAscan-SE v2.046 software was used to predict tRNA genes. The rRNA, miRNA, and snRNA were predicted using INFERNAL (v1.1.2)47 software through searches against the Rfam database v9.148. Finally, we annotated 176 miRNAs, 7,988 rRNAs, 857 tRNAs, and 485 snRNAs in the C. crapnelliana genome (Supplementary Table S7).
To annotate protein-coding genes in the C. crapnelliana genome, gene models were obtained by combining the three approaches of ab initio gene predictions, homology-based predictions, and transcriptome-based predictions. The ab initio prediction was performed by AUGUSTUS v3.3.249, SNAP50 v2013-11-29, GeneMark-ES/ET51, GlimmerHMM52 v3.02. For homology-based prediction, the Exonerate53 v2.2.0 program was used to search against the protein sequences of Actinidia chinensis54, Arabidopsis thaliana55, Beta vulgaris56, C. oleifera26, DASZ14, C. sinensis var. assamica YK1011, Olea europaea57, C. chekiangoleosa58, C. lanceoleosa59, Vitis vinifera60,61, and Oryza sativa55 genomes. For transcriptome-based prediction, Trinity v2.15.162 was used for assembling transcripts based on RNA-seq data, and PASA63 v2.5.2 software was employed for gene structure prediction based on transcriptome assemblies. Additionally, HISAT2 v2.2.164 was employed for RNA-seq reads mapping onto the genome, and StringTie65 v2.2.1 was used for the generation of transcript structure. The assembled transcripts were subsequently used for ORF (open reading frame) prediction using TransDecoder v5.5.0. All predicted gene structures were integrated into a consensus set with EVidenceModeler (EVM v2.0.0)66. Finally, 37,390 gene models were predicted after integrating the results of the three aforementioned methods.
For the functional annotation of protein-coding genes, we aligned the predicted protein-coding gene sequences against public functional databases using BLAST v2.11.067 (e-value < 1e-5), including Swiss-Prot68, NR69, KEGG, and KOG70. Gene Ontology (GO) was performed using InterProScan v5.55-88.071,72 (Supplementary Fig. S2). As a result, a total of 37,015 protein-coding genes were annotated for C. crapnelliana, accounting for ~99.00% of all predicted genes (Supplementary Table S10). Predicted gene models were comparable to the fifteen other species in aspects such as gene number, average gene length, average CDS length, average exons per gene, average introns per gene, average exon length, and average intron length (Supplementary Table S11).
Genome synteny analysis and the detection of whole-genome duplication (WGD)
The WGD analyses were performed using all paralogous gene pairs. MAFFT v7.52073 was employed to conduct sequence alignment. The protein sequence alignment was converted into a codon alignment using PAL2NAL v14. Finally, the Ka and Ks values were obtained using yn00 v4.10.0 of PAML74 with the Nei-Gojobori (NG) method. Genes with Ks < 0.1 were excluded from further analyses (Supplementary Table S12)75. WGDI was adopted to mark the Ks on the syntenic block with different colors. The PeaksFit (−pf), Kspeaks (−kp), and KsFigures (−kf) tools of WGDI were used to illustrate the Ks density. The C. crapnelliana genome exhibited two peaks in the Ks density plot (Fig. 3a,b). Our results showed that the occurrence of two polyploidization events in the C. crapnelliana genome, including the ancient WGT (γ) event that occurred in grape and eudicots60,61, the other WGD (β) event shared with A. chinensis and other Theaceae species11,54,76 (Fig. 3a,b). We finally verified the occurrence of two WGD events in the C. crapnelliana genome by combining genomic synteny analysis and dot plots (Fig. 3c,d) of C. crapnelliana.
Data Records
The MGI short reads, PacBio HiFi long-reads, Hi-C reads, genome assembly and annotation data were deposited in the NCBI SRA database under accession number SRR28825902-SRR2882590877,78,79,80,81,82,83 and National Genomics Data Center (NGDC)84, Beijing Institute of Genomics, the Chinese Academy of Sciences/China National Center for Bioinformation with BioProject accession numbers PRJCA02251685. The genome sequencing data were deposited in the Genome Sequence Archive (GSA) of NGDC under Accession Numbers CRA01427286. The genome assembly has been deposited in DDBJ/ENA/GenBank under the accession number JBDORG00000000087. The genome assembly and annotation data were deposited in Genome Assembly Sequences and Annotations (GWH) of NGDC under accession number GWHERAW0000000088. The genome assembly and annotation were also deposited at the figshare database89.
Technical Validation
Assessment of the genome assembly
The completeness of the assembled genome was evaluated using BWA (v0.7.17)90 and Benchmarking Universal Single-Copy Orthologs (BUSCO, v5.4.4)91 with the embryophyta_odb10 lineage dataset. Approximately, ~99.67% of the Illumina short reads were aligned to the genome, of which ~93.97% of reads were properly mapped. The BUSCO analysis showed that the assembled genome sequences contained 1,600 (~99.2%) complete BUSCOs, including 1,405 (~87.1%) single-copy BUSCOs, 195 (~12.1%) duplicated BUSCOs, and 8 (~0.5%) fragmented BUSCOs (Supplementary Table S8).
Assessment of the gene annotation
The annotated and integrated proteins were also evaluated using BUSCO v5.4.491 with the lineage dataset embryophyte_odb10. Briefly, the proportion of complete core gene coverage was ~96.2% (including ~87.3% single-copy genes and ~8.9% duplicated genes), and there were only a few fragmented (~1.4%) and missing (~2.4%) genes (Supplementary Table S9), indicating high-quality annotation of the predicted gene models.
Genome synteny analysis and WGD detection
The Whole-Genome Duplication Integrated analysis tool (WGDI v0.6.5)92 was used for the detection of WGDs, intragenomic collinearity analysis, Ks estimation and peak fitting in C. crapnelliana (CCRA), C. sinensis var. assamica11(CSA-YK10), C. sinensis var. sinensis12(CSS-BY), Actinidia chinensis54(ACH), and Vitis vinifera60,61(VVI). JCVI93 v1.3.6 was further employed to draw the collinearity diagram across these species.
Code availability
All software and pipelines were executed according to the manual and protocols of the published bioinformatic tools. All software used in this work is publicly available, with versions and parameters clearly described in Methods. If no detailed parameters were mentioned for a software, the default parameters suggested by the developer were used. No custom code was used during this study for the curation and/or validation of the datasets.
References
Yang, C., Liu, X., Chen, Z., Lin, Y. & Wang, S. Comparison of oil content and fatty acid profile of ten new Camellia Oleifera cultivars. J. Lipids. 2016, 1–6 (2016).
Feng, J., Yang, Z., Chen, S., El-Kassaby, Y. A. & Chen, H. High throughput sequencing of small RNAs reveals dynamic microRNAs expression of lipid metabolism during Camellia Oleifera and C. Meiocarpa seed natural drying. BMC Genomics. 18 (2017).
Yu, J., Yan, H., Wu, Y., Wang, Y. & Xia, P. Quality evaluation of the oil of Camellia Spp. Foods. 11, 2221 (2022).
Chen, J., Guo, Y., Hu, X. & Zhou, K. Comparison of the chloroplast genome sequences of 13 oil-tea Camellia samples and identification of an undetermined oil-tea Camellia species from Hainan province. Front. Plant Sci. 12 (2022).
Ma, J., Ye, H., Rui, Y., Chen, G. & Zhang, N. Fatty acid composition of Camellia Oleifera oil. Journal Für Verbraucherschutz Und Lebensmittelsicherheit. 6, 9–12 (2011).
Bin, Z., Hai-yan, Z., Qing-ming, C. & Qi-zhi, L. Advance in research on bioactive compounds in Camellia Spp. Nonwood Forest Research. 28, 140–145 (2010).
Zhenghai, L. & Daoping, W. Chemical constituents of olive oil and from Camellia Oleifera seed oil. Journal of the Chinese Cereals and Oils Association. 23, 121–123 (2008).
Li, T. et al. Anticancer activity of a novel glycoprotein from Camellia Oleifera abel seeds against hepatic carcinoma in vitro and in vivo. Int. J. Biol. Macromol. 136, 284–295 (2019).
Likuo, F. & Jianming, J. China plant red data book: rare and endangered plants, science press: Beijing, 1992).
Xiong, J. et al. Camellianols a–g, barrigenol-like triterpenoids with Ptp1B inhibitory effects from the endangered ornamental plant Camellia Crapnelliana. J. Nat. Prod. 80, 2874–2882 (2017).
Xia, E. et al. The tea tree genome provides insights into tea flavor and independent evolution of caffeine biosynthesis. Mol. Plant. 10, 866–877 (2017).
Zhang, Q. et al. The chromosome-level reference genome of tea tree unveils recent bursts of non-autonomous LTR retrotransposons in driving genome size evolution. Mol. Plant. 13, 935–938 (2020).
Zhang, X. et al. Haplotype-resolved genome assembly provides insights into evolutionary history of the tea plant Camellia Sinensis. Nat. Genet. 53, 1250–1259 (2021).
Zhang, W. et al. Genome assembly of wild tea tree DASZ reveals pedigree and selection history of tea varieties. Nat. Commun. 11 (2020).
Xia, E. et al. The reference genome of tea plant and resequencing of 81 diverse accessions provide insights into its genome evolution and adaptation. Mol. Plant. 13, 1013–1026 (2020).
Chen, J. et al. The chromosome-scale genome reveals the evolution and diversification after the recent tetraploidization event in tea plant. Hortic. Res. 7 (2020).
He, Z. et al. A chromosome-level genome assembly provides insights into cornus wilsoniana evolution, oil biosynthesis and floral bud development. Hortic. Res. (2023).
Yuan, J. et al. Genomic basis of the giga-chromosomes and giga-genome of tree peony Paeonia Ostii. Nat. Commun. 13, 7328 (2022).
Song, J. et al. Eight high-quality genomes reveal pan-genome architecture and ecotype differentiation of Brassica Napus. Nat. Plants. 6, 34–45 (2020).
Zhang, L. et al. Tung tree (Vernicia Fordii) genome provides a resource for understanding genome evolution and improved oil production. Genomics, Proteomics & Bioinformatics. 17, 558–575 (2019).
Unver, T. et al. Genome of wild olive and the evolution of oil biosynthesis. Proceedings of the National Academy of Sciences. 114, E9413–E9422 (2017).
Badouin, H. et al. The sunflower genome provides insights into oil metabolism, flowering and asterid evolution. Nature. 546, 148–152 (2017).
Chen, X. et al. Draft genome of the peanut a-genome progenitor (Arachis Duranensis) provides insights into geocarpy, oil biosynthesis, and allergens. Proceedings of the National Academy of Sciences. 113, 6785–6790 (2016).
Wang, L. et al. Genome sequencing of the high oil crop sesame provides insight into oil biosynthesis. Genome Biol. 15, R39 (2014).
Xia, E. H. et al. Transcriptome analysis of the oil-rich tea plant, Camellia Oleifera, reveals candidate genes related to lipid metabolism. Plos One. 9, e104150 (2014).
Lin, P. et al. The genome of oil-Camellia and population genomics analysis provide insights into seed oil domestication. Genome Biol. 23, 14 (2022).
Zhang, K. et al. The genome of Orychophragmus Violaceus provides genomic insights into the evolution of Brassicaceae Polyploidization and its distinct traits. Plant Commun. 4, 100431 (2023).
Huang, F. et al. Genome assembly of the brassicaceae diploid Orychophragmus Violaceus reveals complex whole-genome duplication and evolution of dihydroxy fatty acid metabolism. Plant Commun. 4, 100432 (2023).
Tang, S. et al. Genome- and transcriptome-wide association studies provide insights into the genetic basis of natural variation of seed oil content in Brassica Napus. Mol. Plant. 14, 470–487 (2021).
Tianling, L. & Hanren, L. Morphology of the somatic chromosomes of Camellia Crapnelliana. Acta Botanica Yunnanica. 8, 319–321 (1986).
Porebski, S., Bailey, L. G. & Baum, B. R. Modification of a CTAB DNA extraction protocol for plants containing high polysaccharide and polyphenol components. Plant Mol. Biol. Rep. 15, 8–15 (1997).
Marçais, G. & Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 27, 764–770 (2011).
Sun, H., Ding, J., Piednoël, M. & Schneeberger, K. Findgse: estimating genome size variation within human and Arabidopsis using k-mer frequencies. Bioinformatics. 34, 550–557 (2018).
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with Hifiasm. Nat. Methods. 18, 170–175 (2021).
Durand, N. C. et al. Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments. Cell Syst. 3, 95–98 (2016).
Dudchenko, O. et al. De novo assembly of the Aedes Aegypti genome using Hi-C yields chromosome-length scaffolds. Science. 356, 92–95 (2017).
Dudchenko, O. et al. The Juicebox assembly tools module facilitates de novo assembly of mammalian genomes with chromosome-length scaffolds for under $1000. Cold Spring Harbor: Cold Spring Harbor Laboratory Press, 2018.
Benson, G. Tandem Repeats Finder: a program to analyze DNA sequences. Nucleic. Acids. Res. 27, 573–580 (1999).
Beier, S., Thiel, T., Münch, T., Scholz, U. & Mascher, M. Misa-Web: a web server for microsatellite prediction. Bioinformatics. 33, 2583–2585 (2017).
Flynn, J. M. et al. Repeatmodeler2 for automated genomic discovery of transposable element families. Proc. Natl. Acad. Sci. USA 117, 9451–9457 (2020).
Xu, Z. & Wang, H. LTR_Finder: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic. Acids. Res. 35, W265–W268 (2007).
Ellinghaus, D., Kurtz, S. & Willhoeft, U. Ltrharvest, an efficient and flexible software for de novo detection of LTR retrotransposons. BMC Bioinformatics. 9, 18 (2008).
Ou, S. & Jiang, N. LTR_Retriever: a highly accurate and sensitive program for identification of long terminal repeat retrotransposons. Plant Physiol. 176, 1410–1422 (2018).
Tempel, S. Using and understanding Repeatmasker. Totowa, NJ: Humana Press, 2012:29-51.
Jurka, J. et al. Repbase update, a database of eukaryotic repetitive elements. Cytogenet. Genome Res. 110, 462–467 (2005).
Lowe, T. M. & Eddy, S. R. TRNAscan-Se: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic. Acids. Res. 25, 955–964 (1997).
Nawrocki, E. P. & Eddy, S. R. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics. 29, 2933–2935 (2013).
Griffiths-Jones, S. Rfam: annotating non-coding RNAs in complete genomes. Nucleic. Acids. Res. 33, D121–D124 (2004).
Stanke, M. et al. Augustus: ab initio prediction of alternative transcripts. Nucleic. Acids. Res. 34, W435–W439 (2006).
Johnson, A. D. et al. Snap: a web-based tool for identification and annotation of proxy SNPs using hapmap. Bioinformatics. 24, 2938–2939 (2008).
Lomsadze, A. Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic. Acids. Res. 33, 6494–6506 (2005).
Majoros, W. H., Pertea, M. & Salzberg, S. L. Tigrscan and Glimmerhmm: two open sourceab initio eukaryotic gene-finders. Bioinformatics. 20, 2878–2879 (2004).
Slater, G. S. C. & Birney, E. Automated Generation of Heuristics for Biological sequence comparison. BMC Bioinformatics. 6, 31 (2005).
Han, X. et al. Two haplotype-resolved, gap-free genome assemblies for Actinidia Latifolia and Actinidia Chinensis shed light on the regulatory mechanisms of vitamin c and sucrose metabolism in kiwifruit. Mol. Plant. 16, 452–470 (2023).
Goodstein, D. M. et al. Phytozome: a comparative platform for green plant genomics. Nucleic. Acids. Res. 40, D1178–D1186 (2012).
McGrath, J. M. et al. A contiguous de novo genome assembly of sugar beet el10 (Beta Vulgaris L.). DNA Res. 30 (2023).
Rao, G. et al. De novo assembly of a new Olea Europaea genome accession using Nanopore sequencing. Hortic. Res. 8 (2021).
Shen, T. et al. The reference genome of Camellia Chekiangoleosa provides insights into Camellia evolution and tea oil biosynthesis. Hortic. Res. 9 (2022).
Gong, W. et al. Chromosome-level genome of Camellia Lanceoleosa provides a valuable resource for understanding genome evolution and self‐incompatibility. The Plant Journal. 110, 881–898 (2022).
Shi, X. et al. The complete reference genome for grapevine (Vitis Vinifera L.) genetics and breeding. Hortic. Res. 10 (2023).
Magris, G. et al. The genomes of 204 Vitis Vinifera accessions reveal the origin of european wine grapes. Nat. Commun. 12 (2021).
Grabherr, M. G. M. G. et al. Trinity: reconstructing a full-length transcriptome without a genome from RNA-seq data. Nat. Biotechnol. 29, 644–652 (2011).
Haas, B. J. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic. Acids. Res. 31, 5654–5666 (2003).
Kim, D., Langmead, B. & Salzberg, S. L. Hisat: a fast spliced aligner with low memory requirements. Nat. Methods. 12, 357–360 (2015).
Pertea, M., Kim, D., Pertea, G. M., Leek, J. T. & Salzberg, S. L. Transcript-level expression analysis of RNA-seq experiments with Hisat, Stringtie and Ballgown. Nat. Protoc. 11, 1650–1667 (2016).
Haas, B. J. et al. Automated eukaryotic gene structure annotation using evidencemodeler and the program to assemble spliced alignments. Genome Biol. 9, R7 (2008).
Boratyn, G. M. et al. Blast: a more efficient report with usability improvements. Nucleic. Acids. Res. 41, W29–W33 (2013).
Bateman, A. et al. Uniprot: the universal protein knowledgebase in 2021. Nucleic. Acids. Res. 49, D480–D489 (2021).
Coordinators, N. R. Database resources of the national center for biotechnology information. Nucleic. Acids. Res. 44, D7–D19 (2016).
Tatusov, R. L., Koonin, E. V. & Lipman, D. J. A genomic perspective on protein families. Science. 278, 631–637 (1997).
Jones, P. et al. Interproscan 5: Genome-scale protein function classification. Bioinformatics. 30, 1236–1240 (2014).
Blum, M. et al. The Interpro protein families and domains database: 20 years on. Nucleic. Acids. Res. 49, D344–D354 (2021).
Katoh, K. & Standley, D. M. Mafft multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30, 772–780 (2013).
Yang, Z. Paml 4: phylogenetic analysis by maximum likelihood. Mol. Biol. Evol. 24, 1586–1591 (2007).
Zhang, F. Chromosome-scale genome assembly of oil-tea tree Camellia crapnelliana. figshare. Dataset. https://doi.org/10.6084/m9.figshare.25680105.v1 (2024).
Wu, H. et al. A high-quality Actinidia Chinensis (kiwifruit) genome. Hortic. Res. 6, 117 (2019).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR28825902 (2024).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR28825903 (2024).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR28825904 (2024).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR28825905 (2024).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR28825906 (2024).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR28825907 (2024).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR28825908 (2024).
Xue, Y. et al. Database resources of the national genomics data center, china national center for bioinformation in 2023. Nucleic. Acids. Res. 51, D18–D28 (2023).
National Genomics Data Center (NGDC) BioProject https://ngdc.cncb.ac.cn/bioproject/browse/PRJCA022516 (2024).
National Genomics Data Center (NGDC) Genome Sequence Archive https://ngdc.cncb.ac.cn/search/all?&q=CRA014272 (2024).
NCBI GenBank https://identifiers.org/ncbi/insdc:JBDORG000000000 (2024).
NGDC Genome Warehouse, https://ngdc.cncb.ac.cn/search/all?q=GWHERAW00000000 (2024).
Zhang, F. Camellia crapnelliana genome assembly and annotation. figshare. Dataset. https://doi.org/10.6084/m9.figshare.25209830.v2 (2024).
Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows–Wheeler Transform. Bioinformatics. 26, 589–595 (2010).
Manni, M., Berkeley, M. R., Seppey, M., Simão, F. A. & Zdobnov, E. M. Busco update: novel and streamlined workflows along with broader and deeper phylogenetic coverage for scoring of eukaryotic, prokaryotic, and viral genomes. Mol. Biol. Evol. 38, 4647–4654 (2021).
Sun, P. et al. Wgdi: a user-friendly toolkit for evolutionary analyses of whole-genome duplications and ancestral karyotypes. Mol. Plant. 15, 1841–1851 (2022).
Tang, H. et al. Synteny and collinearity in plant genomes. Science. 320, 486–488 (2008).
Acknowledgements
We would highly appreciate Zi-ting Yu for her assistance to collect samples.
Author information
Authors and Affiliations
Contributions
Li-zhi Gao conceived and designed the study and revised the manuscript. Fen Zhang executed data analysis and drafted the manuscript. Li-ying Feng and Pei-fan Lin collected samples and performed experiments. Ju-jin Jia contributed to data analyses. All authors read, edited, and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Zhang, F., Feng, Ly., Lin, Pf. et al. Chromosome-scale genome assembly of oil-tea tree Camellia crapnelliana. Sci Data 11, 599 (2024). https://doi.org/10.1038/s41597-024-03459-x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-024-03459-x