Abstract
Ehretia macrophylla Wall, known as wild loquat, is an ecologically, economically, and medicinally significant tree species widely grown in China, Japan, Vietnam, and Nepal. In this study, we have successfully generated a haplotype-resolved chromosome-scale genome assembly of E. macrophylla by integrating PacBio HiFi long-reads, Illumina short-reads, and Hi-C data. The genome assembly consists of two haplotypes, with sizes of 1.82 Gb and 1.58 Gb respectively, and contig N50 lengths of 28.11 Mb and 21.57 Mb correspondingly. Additionally, 99.41% of the assembly was successfully anchored into 40 pseudo-chromosomes. We predicted 58,886 protein-coding genes, of which 99.60% were functionally annotated from databases. We furthermore detected 2.65 Gb repeat sequences, 659,290 rRNAs, 4,931 tRNAs and 4,688 other ncRNAs. The high-quality assembly of the genome offers a solid basis for furthering the fields of molecular breeding and functional genomics of E. macrophylla.
Similar content being viewed by others
Background & Summary
Ehretia macrophylla Wall is a perennial shrub tree belonging to the genus Ehretia in the Boraginaceae family. It can arrive at 15 m and is widely distributed in the southwest, south, and east of China, as well as in certain regions of Japan, Vietnam, and Nepal1,2,3. E. macrophylla, also known as wild loquat in China, is a rare tree with diverse applications, including ecological, gardening, ornamental, and medicinal value. To date, the complete sequencing of any species within the genus Ehretia remains unaccomplished. The genetic studies of E. macrophylla are impeded due to the absence of high-quality reference genome sequences, despite its multifarious applications.
E. macrophylla is an excellent tree species for urban greening and as a border tree, especially when dust retention is necessary. This is due to its high trunk, strong dust absorption ability, and resistance to pests and diseases2. Furthermore, the foliage of E. macrophylla serves a dual purpose as both a potential food source and medicinal resource, highlighting its multifaceted utility in various fields4. It has the effect of activating the meridians and treating rheumatism, dispelling wind and dampness, and relieving joint pain. Furthermore, the bark of E. macrophylla has the effect of dissipating blood stasis and swelling, making it suitable for treating fall injuries3. Of additional interest, the fruit of E. macrophylla serves as a functional food supplement, consumed as a traditional fruit and utilized in herbal tea. It can help soothe the throat and alleviate coughs. The fruit is usually used to treat diseases such as bronchitis, acute and chronic pharyngitis, cough, and asthma2,5. As a prominent species within the genus Ehretia, E. macrophylla is renowned for its diverse range of applications attributed to the copious presence of bioactive compounds in its fruit and other tissues. These bioactive substances remarkable antioxidant, antitumor, anti-inflammatory, antiviral, and antibacterial properties. Some of the key compounds found in E. macrophylla include quercetin, flavonoids, kaempferol, rosmarinate, caffeic acid, and pectin polysaccharide2,4,5.
High-quality genomes are of profound significance for in-depth research, rational development, and adequate protection of plants. Here, we present a high-quality genome assembly of E. macrophylla using an integrated approach, which includes PacBio HiFi long-read sequencing, short-read Illumina sequencing, and Hi-C sequencing. The assembled genome (~3.40 Gb) comprises haplotype a (1.82 Gb) and haplotype b (1.58 Gb), with contig N50 lengths of 28.11 Mb and 21.57 Mb, respectively. Furthermore, the assembled scaffolds were meticulously anchored to 40 pseudochromosomes with an exceptional anchoring rate of 99.41%. We predicted a total of 58,886 protein-coding genes, with 29,805 for haplotype a and 29,081 for haplotype b. Among these genes, 99.60% were functionally annotated. In addition, we identified 2.65 Gb repeat sequences (1.44 Gb for haplotype a and 1.21 Gb for haplotype b), and annotated a total of 668,909 non-coding RNA genes, including 659,290 rRNA (415,016 for haplotype a and 244,274 for haplotype b), 4,931 tRNA genes (2,522 for haplotype a and 2,409 for haplotype b) and 4,688 other ncRNA genes (2,428 for haplotype a and 2,260 for haplotype b). Our data will serve as a valuable genetic resource, enabling us to reveal the genetic mechanisms behind special properties, conduct evolutionary studies of the genus Ehretia and family Boraginaceae, and elucidate the molecular breeding of E. macrophylla.
Methods
Plant materials, library construction, and genome size estimation
Fresh leaf tissue for genome and RNA sequencing was sampled in 2022 from a mature E. macrophylla individual growing in Luoyang, Henan Province, China (34.663041 N, 112.434468 E) (Fig. 1a). Superior-quality genomic DNA was isolated using the Plant Genomic DNA Kit (Tiangen, China). The concentration and purity of the genomic DNA were assessed using a NanoDrop 8000 spectrophotometer (Thermo Fisher Scientific, USA). Total RNA was extracted from E. macrophylla samples utilizing TRIzol reagent. Subsequently, RNase-free DNase I was employed to treat the isolated RNA, followed by elution with RNase-free water. RNA integrity was measured using an Agilent 2100 Bioanalyzer (Agilent Technologies, Palo Alto, CA, USA).
The DNA that met the required qualifications was utilized to construct a genome library using the Pacific Biosciences SMRTbell Express Template Prep Kit. A 20-kb insert library was processed using a BluePippin system. The sequencing was carried out using the Pacific Bioscience Sequel II platform (Pacific Biosciences, Menlo Park, CA, USA). We obtained ~152.18 Gb of PacBio HiFi raw data (~84 × ) with an average length of 16.05 kb (Table 1). For Illumina sequencing, the sequencing was performed on the HiSeq X Ten platform (Illumina) with model of 150 PE. Finally, we obtained approximately 54.03 Gb of Illumina raw data (~30 × ). The Hi-C libraries were constructed, enriched, and sheared according to methods described previously6,7. The Hi-C sequencing was conducted using the Illumina HiSeq X Ten platform. A total of approximately 208.36 Gb (~114 × ) of raw Hi-C data were acquired. For RNA sequencing, a cDNA library was constructed using an RNA Library Prep Kit (NEB, UK). Approximately 9.66 Gb of raw data were obtained from the HiSeq X Ten platform (Illumina).
Genome survey and assembly
Before assembly, the adaptor sequences, low-quality regions, and sequences that were overly short were removed using the fastp v0.19.38 software. Jellyfish v2.3.09 was employed for determining the frequency distribution of the depth of clean data with 17 K-mers, and GenomeScope v2.010 was utilized to estimate the genome size. The estimated haplotype genome size for E. macrophylla is approximately 1.84 Gb (Fig. 2). A combination of HiFi reads and Hi-C short reads was employed as input for the genome assembler Hifiasm v0.16.111. The assembly process, conducted in Hi-C mode with default settings, resulted in the generation of two contigs representing haplotype a and haplotype b, respectively. For chromosome assembly, we first aligned the Hi-C reads to the assembly using Juicer v1.6 software12. Next, the draft genome assembly was scaffolded using 3D-DNA13 with Hi-C reads. Then, we manually adjusted the chromosome construction using the Juicebox tool14, which involved removing incorrect insertions and adjusting the orientation to correct visible errors to the best extent possible. For further optimization of the genome assembly, three rounds of corrections were performed on the assembly using Illumina reads with NextPolish v1.4.015, and the redundant sequences were removed using Redundans v0.14a2716. In total, approximately 99.41% the assembled data was anchored onto 40 pseudochromosomes in the two haplotypes (Supplementary Table 1). Finally, we obtained a high-quality haplotype-resolved chromosomal-level genome of E. macrophylla (Fig. 1b, Fig. 3). The assembly (~3.40 Gb) comprised two haplotypes, namely haplotype a and haplotype b, with respective genome sizes of 1.82 Gb and 1.58 Gb (Table 2). Since the genome assembly was haplotype-resolved and lacked parental information for subgenome phasing, we designated the long one chromosome from each homologous pair as haplotype a and the other as haplotype b. The contig N50 and the scaffold N50 lengths for haplotype a were 28.11 Mb and 92.55 Mb, respectively, whereas for haplotype b, they were 21.57 Mb and 83.31 Mb, respectively. A total of 307 gaps were identified in the current genome assembly (Table 2). Utilizing PacBio HiFi reads, the LR_Gapcloser17 software was employed for gap filling, with two iterations executed. Furthermore, we assembled a chloroplast genome with a length of 156,639 bp and a mitochondrial genome with a length of 702,890 bp using GetOrganelle v1.7.5.018.
Genomic repeat annotation
To annotate the repeat sequences in the E. macrophylla genome, a transposable element (TE) library was first constructed by running the extensive de novo TE Annotator (EDTA) pipeline to identify TEs from scratch. The parameters used were–Sensitive 1–ANNO 119. Then, we used RepeatMasker v4.1.3 (http://www.repeatmasker.org/RepeatMasker/) to mask the repeat library acquired from the Repbase database (https://www.girinst.org/repbase/). For E. macrophylla haplotype a, a total of 2,751,291 repetitive sequences, constituting approximately 79.18% of the genome, were identified with a cumulative length of 1.44 Gb. Among them, long terminal repeats (LTRs) were the main repeats, totaling 851,702, with a size of 790.85 Mb, accounting for 43.48% of the assembled genome. This was followed by DNA transposable elements (TIRs) at 29.36%. The sizes of the copia- and gypsy-like LTRs were 109.80 Mb and 351.66 Mb, respectively, which accounted for 6.04% and 19.33% of haplotype a (Table 3). In term of E. macrophylla haplotype b, a total of 2,258,809 repetitive sequences (76.29% of the genome) were identified with a length of 1.21 Gb. Of these, the primary repetitive elements were also LTRs, which amounted to 788,470 and occupied a total size of 713.16 Mb, representing 45.13% of the genome that was assembled. This was followed by TIRs, accounting for 24.32%. The copia- and gypsy-like LTRs had sizes of 109.44 Mb and 314.71 Mb, respectively, making up 6.93% and 19.92% of haplotype b (Table 3).
Gene identification and functional annotations
To annotate the high-quality protein-coding genes, a comprehensive approach encompassing homology-based, de novo, and transcriptome-based predictions was employed. A total of 31,9767 non-redundant protein sequences from closely related species (Echium plantagineum20, Solanum lycopersicum21, Coffea canephora22, Eucommia ulmoides23, Tectona grandis24, Daucus carota25, Nyssa sinensis26, Rhododendron simsii27, Lonicera japonica28, Lactuca saligna29, Vitis vinifera30, and Arabidopsis thaliana31) were gathered as evidence for protein homology using Exonerate V2.4.032. The RNA-seq data were aligned to the genome sequences using Hisat2 v2.2.019 with default parameters, followed by assembly of the aligned reads using StringTie 2 v2.1.233. Subsequently, all splicing variations were identified and classified through alignment of full-length transcripts utilizing the PASA v2.3.334 pipeline. All complete gene structures predicted using PASA v2.3.3 pipeline were utilized to generate a training model with AUGSTUS v3.3.335, employing default parameters.
In addition, the putative protein-coding gene structure was predicted utilizing MAKER236. The ab initio predictions of gene structure were conducted using AUGSTUS v3.3. We aligned the transcript evidence with the genome using BLAST+37 and finally optimized it with Exonerate v2.4.032. In order to increase the accuracy of the annotation, we integrated and updated the gene prediction results using EVidenceModeler51 (EVM)38 and PASA. In total, we annotated 29,805 protein-coding genes in E. macrophylla haplotype a with an average length of 4,956.40 bp. Among them, there are a total of 36,131 coding DNA sequence (CDS), 200,786 exons, and 164,655 introns. The average lengths were 1,243.30 bp for CDS, 281 bp for exons and 803 bp for introns (Table 4). Additionally, we identified 29,081 protein-coding genes in haplotype b a with an average length of 5,199.10 bp. A total of 34,686 CDS, 191,925 exons, and 157,239 introns were detected, with the average lengths of 1,248.6 bp, 279.1 bp and 854.6 bp respectively (Table 4).
Functional annotation of protein-coding genes was carried out using three strategies. First, we mapped gene sequences against the eggNOG 5.039 database using eggNOG-mapper v2.1640, and annotated 97.94% of the genes. Of these 48.80% and 47.94% were annotated with Gene Ontology (GO, http://geneontology.org/) and Kyoto Encyclopedia of Genes and Genomes (KEGG, https://www.genome.jp/kegg), respectively. Second, 98.40% of genes were annotated using DIAMOND v2.0.1241 against four protein databases: Swiss_Prot42 (78.96%), TrEMBL42 (98.39%), NR43 (98.23%), and Arabidopsis thaliana genes (91.53%). Finally, InterProScan v5.5.2-86.044 was used to annotate 98.74% of the gene against 14 databases (Table 5).
For the annotation of non-coding RNA genes, we detected a total of 415,016 rRNA genes, 2,522 tRNA genes, and 2,428 other ncRNA genes in haplotype a using tRNAScan-SE45, Barrnap (https://github.com/tseemann/barrnap), and Rfam46, respectively. In term of haplotype b, a total of 244,274 rRNA genes, 2,409 tRNA genes, and 2,260 other ncRNA genes were detected (Table 6).
Genome comparison between haplotype assemblies
The haplotype alignments were conducted utilizing minimap247, while the identification of syntenic regions and structural variations was performed using SyRI v1.648. The structural rearrangements identified between haplotype genomes were visualized using Plotsr v0.5.449 (Fig. 4). Chr 01, 02, and 04 to 10 exhibit more structural variation (Fig. 4a). A total of 13,045 syntenic regions (~953 Mbp) were detected, indicating extreme similarity between the two haplotypes (Fig. 4b). Numerous variations were also detected, including minor insertions/deletions and SNPs (Fig. 4c,d); two relatively large inversions were found on chr07 and chr10, respectively (Fig. 4a). We compared the dot plot of syntenic blocks using Minimap2 and found that the two haplotypes were very similar, with essentially the same chromosome order (Fig. 5).
Data Records
The sequencing data for this study have been uploaded to the NCBI database with the BioProject number PRJNA945189. The genomic PacBio sequencing data can be found in the NCBI Sequence Read Archive (SRA) database with accession numbers SRR2390702750, SRR2390702851, SRR2390702952, and SRR2390703053. For Hi-C sequencing data, specifically referring to accession numbers SRR2390703154 and SRR2390703655 in the SRA database. The genomic Illumina sequencing data are available under accession numbers SRR2390704756 and SRR2390705857. The final genome assembly was deposited in the GenBank with accession number: GCA_037974685.158 and GCA_037974665.159. In addition, the final chromosome assembly and annotation data were deposited in the Genome Warehouse (GWH) of the National Genomics Data Center (NGDC) with the accession number GWHEQHN0000000060 and under the BioProject number PRJCA021125.
Technical Validation
To evaluate the completeness and accuracy of the genome, we employed BWA61, minimap247, and HISAT219 to align Illumina reads, HiFi reads, and RNA-Seq reads to our reference genome respectively. In addition, BUSCO v5.2.262 was used to evaluate the genome completeness using the embryophyta_odb10 and eukaryota_odb10 databases. The genomic completeness of these two haplotypes was found to be satisfactory, with proportions of complete BUSCOs (including both single-copy and multi-copy) at 98.1% and 97.1% for the expected genes from embryophyta, respectively (Table 7). The E. macrophylla genome size was evaluated using k-mer analysis (Fig. 2). After filtering out non-primary alignments, we proceed to calculate the mapping ratio and coverage percentage. We found that the genome coverage from sequencing data is relatively high (Table 8). We conducted additional quality control analysis on the genome assembly using Merqury63 (at K = 16) based on PacBio HiFi reads (Fig. 6, Table 9). The consensus quality values (QVs) of the separate haplotypes a and b, as well as their shared genome, are recorded as 34.98, 34.74, and 34.87 correspondingly. The k-mer completeness scores of the distinct haplotypes a and b, along with their shared genome, amount to approximately 82.08%, 81.07%, and 94.46% accordingly. The further BUSCO analysis showed that the single-copy and multi-copy genes have approximately the same depth, indicating that the assembly had no redundancy (Fig. 7).
To evaluate the single-base error rate and heterozygosity, next-generation reads were mapped to the genome using BWA, and the variant loci were detected using bcftool v 1.1164. Heterozygous sites were utilized for the computation of heterozygosity rates, whereas homozygous sites were employed for the determination of error rates. We found that the heterozygosity rate was approximately 0.19%, and the error rate was approximately 0.012%. By evaluating the coverage depth and GC content distribution analysis of the second and third generation data, we found that the second-generation data had a significant guanine-cytosine (GC) bias (Fig. 8). Juicer12 was used to map the Hi-C data to the final genome assembly. It was found that the chromosome clustering was normal, with no obvious chromosome assembly errors, but there were abnormal signals in some regions (Fig. 3). The chromatin interaction data from the Hi-C map revealed low-level interactions occurred between pseudochromosomes, confirming the high quality and reliability of our chromosome-level anchoring (Supplementary Table 1).
The chromosomal locations of specific characteristic sequences, such as telomeres, rDNA, and tandem repeats, were determined through the mapping of repetitive sequences onto the genome. The majority of chromosome telomere sequences were completely assembled; however, a few exhibited partial or missing regions. We detected a high tandem repeat on chromosomes (Supplementary txt 1). This sequence contains 5 S rDNA, and its distribution is essentially consistent, suggesting that this sequence represents 5 S rDNA and its adjacent regions. In addition, the 18-5.8-28 S rDNA and 5 S rDNA arrays are very abundant and widely distributed (Supplementary Fig. 1).
BUSCO v5.2.262 was employed to assess the annotated and integrated proteins utilizing the embryophyta_odb10 and eukaryota_odb10 databases. The proportion of complete core gene coverage was 96.4% (Table 7), which included 7.1% single-copy genes and 89.3% duplicated genes. Only 0.9% fragmented and 2.7% missing genes were detected, indicating that the genome annotation is of superior quality.
Code availability
All software and pipelines were executed in accordance with the manual and protocols of the published bioinformatics tools, adhering to the specified versions and meticulously documenting the code/parameters used, as elaborated in the Methods section.
References
Gottschling, M., Mai, D. H. & Hilger, H. H. The systematic position of Ehretia fossils (Ehretiaceae, Boraginales) from the European Tertiary and implications for character evolution. Review of Palaeobotany and Palynology 121, 149–156, https://doi.org/10.1016/S0034-6667(01)00147-6 (2002).
Deng, N., Zheng, B., Li, T., Hu, X. & Liu, R. H. Phenolic profiles, antioxidant, antiproliferative, and hypoglycemic activities of Ehretia macrophyla Wall. (EMW) fruit. J Food Sci 85, 2177–2185, https://doi.org/10.1111/1750-3841.15185 (2020).
Xu, X., Cheng, Y., Tong, L., Tian, L. & Xia, C. The complete chloroplast genome sequence of Ehretia dicksonii Hance (Ehretiaceae). Mitochondrial DNA B Resour 7, 661–662, https://doi.org/10.1080/23802359.2022.2061873 (2022).
Dong, M., Oda, Y. & Hirota, M. 10E,12Z,15Z)-9-hydroxy-10,12,15-octadecatrienoic acid methyl ester as an anti-inflammatory compound from Ehretia dicksonii. Biosci Biotechnol Biochem 64, 882–886, https://doi.org/10.1271/bbb.64.882 (2000).
Xu, D. et al. Potential prebiotic functions of a characterised Ehretia macrophylla Wall. fruit polysaccharide. Int J Food Sci Tech 57, 35–47, https://doi.org/10.1111/ijfs.15005 (2022).
Wang, C. et al. Genome-wide analysis of local chromatin packing in Arabidopsis thaliana. Genome Res 25, 246–256, https://doi.org/10.1101/gr.170332.113 (2015).
Niu, S. et al. The Chinese pine genome and methylome unveil key features of conifer evolution. Cell 185, 204–217 e214, https://doi.org/10.1016/j.cell.2021.12.006 (2022).
Chen, S., Zhou, Y., Chen, Y. & Gu, J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890, https://doi.org/10.1093/bioinformatics/bty560 (2018).
Marçais, G. & Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27, 764–770, https://doi.org/10.1093/bioinformatics/btr011 (2011).
Ranallo-Benavidez, T. R., Jaron, K. S. & Schatz, M. C. GenomeScope 2.0 and smudgeplot for reference-free profling of polyploid genomes. Nat Commun 11, 1432, https://doi.org/10.1038/s41467-020-14998-3 (2020).
Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nature Methods 18, 170–175, https://doi.org/10.1038/s41592-020-01056-5 (2021).
Durand, N. C. et al. Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments. Cell Syst 3, 95–98, https://doi.org/10.1016/j.cels.2016.07.002 (2016).
Dudchenko, O. et al. De novo assembly of the Aedes aegypti genome using Hi-C yields chromosome-length scaffolds. Science 356, 92–95, https://doi.org/10.1126/science.aal3327 (2017).
Durand, N. C. et al. Juicebox provides a visualization system for Hi-C contact maps with unlimited zoom. Cell Syst 3, 99–101, https://doi.org/10.1016/j.cels.2015.07.012 (2016).
Hu, J., Fan, J., Sun, Z. & Liu, S. NextPolish: a fast and efficient genome polishing tool for long-read assembly. Bioinformatics 36, 2253–2255, https://doi.org/10.1093/bioinformatics/btz891 (2020).
Pryszcz, L. P. & Gabaldon, T. Redundans: an assembly pipeline for highly heterozygous genomes. Nucleic Acids Res 44, e113, https://doi.org/10.1093/nar/gkw294 (2016).
Xu, G. C. et al. LR_Gapcloser: a tiling path-based gap closer that uses long reads to complete genome assembly. Gigascience 8, https://doi.org/10.1093/gigascience/giy157 (2019).
Jin, J. J. et al. GetOrganelle: a fast and versatile toolkit for accurate de novo assembly of organelle genomes. Genome Biol 21, 241, https://doi.org/10.1186/s13059-020-02154-5 (2020).
Kim, D., Paggi, J. M., Park, C., Bennett, C. & Salzberg, S. L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol 37, 907–915, https://doi.org/10.1038/s41587-019-0201-4 (2019).
Tang, C. Y., Li, S., Wang, Y. T. & Wang, X. Comparative genome/transcriptome analysis probes Boraginales’ phylogenetic position, WGDs in Boraginales, and key enzyme genes in the alkannin/shikonin core pathway. Mol Ecol Resour 20, 228–241, https://doi.org/10.1111/1755-0998.13104 (2020).
Hosmani, P. S. et al. An improved de novo assembly and annotation of the tomato reference genome using single-molecule sequencing, Hi-C proximity ligation and optical maps. bioRxiv, 767764, https://doi.org/10.1101/767764 (2019).
Denoeud, F. et al. The coffee genome provides insight into the convergent evolution of caffeine biosynthesis. Science 345, 1181–1184, https://doi.org/10.1126/science.1255274 (2014).
Li, Y. et al. High-quality de novo assembly of the Eucommia ulmoides haploid genome provides new insights into evolution and rubber biosynthesis. Hortic Res-England 7, https://doi.org/10.1038/s41438-020-00406-w (2020).
Zhao, D. et al. A chromosomal-scale genome assembly of reveals the importance of tandem gene duplication and enables discovery of genes in natural product biosynthetic pathways. Gigascience 8, https://doi.org/10.1093/gigascience/giz005 (2019).
Iorizzo, M. et al. A high-quality carrot genome assembly provides new insights into carotenoid accumulation and asterid genome evolution. Nature Genetics 48, 657–+, https://doi.org/10.1038/ng.3565 (2016).
Yang, X. et al. A chromosome-level genome assembly of the Chinese tupelo Nyssa sinensis. Sci Data 6, 282, https://doi.org/10.1038/s41597-019-0296-y (2019).
Yang, F. S. et al. Chromosome-level genome assembly of a parent species of widely cultivated azaleas. Nat Commun 11, 5269, https://doi.org/10.1038/s41467-020-18771-4 (2020).
Pu, X. D. et al. The honeysuckle genome provides insight into the molecular mechanism of carotenoid metabolism underlying dynamic flower coloration. New Phytologist 227, 930–943, https://doi.org/10.1111/nph.16552 (2020).
Reyes-Chin-Wo, S. et al. Genome assembly with in vitro proximity ligation data and whole-genome triplication in lettuce. Nat Commun 8, 14953, https://doi.org/10.1038/ncomms14953 (2017).
Jaillon, O. et al. The grapevine genome sequence suggests ancestral hexaploidization in major angiosperm phyla. Nature 449, 463–467, https://doi.org/10.1038/nature06148 (2007).
Cheng, C. Y. et al. Araport11: a complete reannotation of the Arabidopsis thaliana reference genome. Plant J 89, 789–804, https://doi.org/10.1111/tpj.13415 (2017).
Slater, G. S. & Birney, E. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 6, 31, https://doi.org/10.1186/1471-2105-6-31 (2005).
Pertea, M. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat Biotechnol 33, 290–295, https://doi.org/10.1038/nbt.3122 (2015).
Haas, B. J. et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res 31, 5654–5666, https://doi.org/10.1093/nar/gkg770 (2003).
Stanke, M., Diekhans, M., Baertsch, R. & Haussler, D. Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics 24, 637–644, https://doi.org/10.1093/bioinformatics/btn013 (2008).
Cantarel, B. L. et al. MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res 18, 188–196, https://doi.org/10.1101/gr.6743907 (2008).
Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinformatics 10, 421, https://doi.org/10.1186/1471-2105-10-421 (2009).
Haas, B. J. et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the program to assemble spliced alignments. Genome Biology 9 (2008).
Huerta-Cepas, J. et al. eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res 47, D309–D314, https://doi.org/10.1093/nar/gky1085 (2019).
Huerta-Cepas, J. et al. Fast Genome-wide functional annotation through orthology assignment by eggNOG-Mapper. Molecular Biology and Evolution 34, 2115–2122, https://doi.org/10.1093/molbev/msx148 (2017).
Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat Methods 12, 59–60, https://doi.org/10.1038/nmeth.3176 (2015).
UniProt, C. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res 49, D480–D489, https://doi.org/10.1093/nar/gkaa1100 (2021).
Sayers, E. W. et al. Database resources of the national center for biotechnology information. Nucleic Acids Res 50, D20–D26, https://doi.org/10.1093/nar/gkab1112 (2022).
Jones, P. et al. InterProScan 5: genome-scale protein function classification. Bioinformatics 30, 1236–1240, https://doi.org/10.1093/bioinformatics/btu031 (2014).
Chan, P. P., Lin, B. Y., Mak, A. J. & Lowe, T. M. tRNAscan-SE 2.0: improved detection and functional classification of transfer RNA genes. Nucleic Acids Res 49, 9077–9096, https://doi.org/10.1093/nar/gkab688 (2021).
Kalvari, I. et al. Rfam 14: expanded coverage of metagenomic, viral and microRNA families. Nucleic Acids Res 49, D192–D200, https://doi.org/10.1093/nar/gkaa1047 (2021).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100, https://doi.org/10.1093/bioinformatics/bty191 (2018).
Goel, M., Sun, H. Q., Jiao, W. B. & Schneeberger, K. SyRI: finding genomic rearrangements and local sequence differences from whole-genome assemblies. Genome Biology 20, https://doi.org/10.1186/s13059-019-1911-0 (2019).
Goel, M. & Schneeberger, K. plotsr: visualizing structural similarities and rearrangements between multiple genomes. Bioinformatics 38, 2922–2926, https://doi.org/10.1093/bioinformatics/btac196 (2022).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR23907027 (2023).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR23907028 (2023).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR23907029 (2023).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR23907030 (2023).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR23907031 (2023).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR23907036 (2023).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR23907047 (2023).
NCBI Sequence Read Archive https://identifiers.org/ncbi/insdc.sra:SRR23907058 (2023).
NCBI GenBank https://identifiers.org/ncbi/insdc.gca:GCA_037974685.1 (2024).
NCBI GenBank https://identifiers.org/ncbi/insdc.gca:GCA_037974665.1 (2024).
NGDC Genome Warehouse https://ngdc.cncb.ac.cn/gwh/Assembly/83111/show (2023).
Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv: Genomics (2013).
Manni, M., Berkeley, M. R., Seppey, M. & Zdobnov, E. M. BUSCO: Assessing genomic data quality and beyond. Curr Protoc 1, e323, https://doi.org/10.1002/cpz1.323 (2021).
Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol 21, 245, https://doi.org/10.1186/s13059-020-02134-9 (2020).
Narasimhan, V. et al. BCFtools/RoH: a hidden Markov model approach for detecting autozygosity from next-generation sequencing data. Bioinformatics 32, 1749–1751, https://doi.org/10.1093/bioinformatics/btw044 (2016).
Acknowledgements
This study was supported by the Foundation for the invigorating forestry through science and technology (YLK202216), Central Plain Scholar’s workstation of Henan province (ZYGZZ2021048), the scientific and technological research project of Henan province (222102110480, 222102110444, 222102110448).
Author information
Authors and Affiliations
Contributions
Cheng S.P. and Feng S.G. conceived and designed the study; Cheng S.P. collected the samples; Zhang Q.K., Geng X.N., Xie L.H., Chen M.H., Jiao S.Q., Qi S.Z., Yao P.Q., Lu M.L., Zhang M.R., Zhai W.S., and Yun Q.Z. performed bioinformatics; Feng S.G., Cheng S.P. and Zhang Q.K. participated in the manuscript writing and revisions. Cheng S.P. and Zhang Q.K. contributed equally to this work. All the authors have read and approved the final manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Cheng, S., Zhang, Q., Geng, X. et al. Haplotype-resolved chromosome-level genome assembly of Ehretia macrophylla. Sci Data 11, 589 (2024). https://doi.org/10.1038/s41597-024-03431-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41597-024-03431-9