Haplotype-resolved chromosome-level genome assembly of Ehretia macrophylla

Ehretia macrophylla Wall, known as wild loquat, is an ecologically, economically, and medicinally significant tree species widely grown in China, Japan, Vietnam, and Nepal. In this study, we have successfully generated a haplotype-resolved chromosome-scale genome assembly of E. macrophylla by integrating PacBio HiFi long-reads, Illumina short-reads, and Hi-C data. The genome assembly consists of two haplotypes, with sizes of 1.82 Gb and 1.58 Gb respectively, and contig N50 lengths of 28.11 Mb and 21.57 Mb correspondingly. Additionally, 99.41% of the assembly was successfully anchored into 40 pseudo-chromosomes. We predicted 58,886 protein-coding genes, of which 99.60% were functionally annotated from databases. We furthermore detected 2.65 Gb repeat sequences, 659,290 rRNAs, 4,931 tRNAs and 4,688 other ncRNAs. The high-quality assembly of the genome offers a solid basis for furthering the fields of molecular breeding and functional genomics of E. macrophylla.


Background & Summary
Ehretia macrophylla Wall is a perennial shrub tree belonging to the genus Ehretia in the Boraginaceae family.It can arrive at 15 m and is widely distributed in the southwest, south, and east of China, as well as in certain regions of Japan, Vietnam, and Nepal [1][2][3] .E. macrophylla, also known as wild loquat in China, is a rare tree with diverse applications, including ecological, gardening, ornamental, and medicinal value.To date, the complete sequencing of any species within the genus Ehretia remains unaccomplished.The genetic studies of E. macrophylla are impeded due to the absence of high-quality reference genome sequences, despite its multifarious applications.
E. macrophylla is an excellent tree species for urban greening and as a border tree, especially when dust retention is necessary.This is due to its high trunk, strong dust absorption ability, and resistance to pests and diseases 2 .Furthermore, the foliage of E. macrophylla serves a dual purpose as both a potential food source and medicinal resource, highlighting its multifaceted utility in various fields 4 .It has the effect of activating the meridians and treating rheumatism, dispelling wind and dampness, and relieving joint pain.Furthermore, the bark of E. macrophylla has the effect of dissipating blood stasis and swelling, making it suitable for treating fall injuries 3 .Of additional interest, the fruit of E. macrophylla serves as a functional food supplement, consumed as a traditional fruit and utilized in herbal tea.It can help soothe the throat and alleviate coughs.The fruit is usually used to treat diseases such as bronchitis, acute and chronic pharyngitis, cough, and asthma 2,5 .As a prominent species within the genus Ehretia, E. macrophylla is renowned for its diverse range of applications attributed to the copious presence of bioactive compounds in its fruit and other tissues.These bioactive substances remarkable antioxidant, antitumor, anti-inflammatory, antiviral, and antibacterial properties.Some of the key compounds found in E. macrophylla include quercetin, flavonoids, kaempferol, rosmarinate, caffeic acid, and pectin polysaccharide 2,4,5 .
High-quality genomes are of profound significance for in-depth research, rational development, and adequate protection of plants.Here, we present a high-quality genome assembly of E. macrophylla using an integrated approach, which includes PacBio HiFi long-read sequencing, short-read Illumina sequencing, and Hi-C sequencing.The assembled genome (~3.40 Gb) comprises haplotype a (1.82 Gb) and haplotype b (1.58 Gb), with contig N50 lengths of 28.11 Mb and 21.57Mb, respectively.Furthermore, the assembled scaffolds were meticulously anchored to 40 pseudochromosomes with an exceptional anchoring rate of 99.41%.We predicted a total of 58,886 protein-coding genes, with 29,805 for haplotype a and 29,081 for haplotype b.Among these genes, 99.60% were functionally annotated.In addition, we identified 2.65 Gb repeat sequences (1.44 Gb for haplotype a and 1.21 Gb for haplotype b), and annotated a total of 668,909 non-coding RNA genes, including 659,290 rRNA (415,016 for haplotype a and 244,274 for haplotype b), 4,931 tRNA genes (2,522 for haplotype a and 2,409 for haplotype b) and 4,688 other ncRNA genes (2,428 for haplotype a and 2,260 for haplotype b).Our data will serve as a valuable genetic resource, enabling us to reveal the genetic mechanisms behind special properties, conduct evolutionary studies of the genus Ehretia and family Boraginaceae, and elucidate the molecular breeding of E. macrophylla.

Methods
Plant materials, library construction, and genome size estimation.Fresh leaf tissue for genome and RNA sequencing was sampled in 2022 from a mature E. macrophylla individual growing in Luoyang, Henan Province, China (34.663041N, 112.434468E) (Fig. 1a).Superior-quality genomic DNA was isolated using the Plant Genomic DNA Kit (Tiangen, China).The concentration and purity of the genomic DNA were assessed using a NanoDrop 8000 spectrophotometer (Thermo Fisher Scientific, USA).Total RNA was extracted from E. macrophylla samples utilizing TRIzol reagent.Subsequently, RNase-free DNase I was employed to treat the isolated RNA, followed by elution with RNase-free water.RNA integrity was measured using an Agilent 2100 Bioanalyzer (Agilent Technologies, Palo Alto, CA, USA).
The DNA that met the required qualifications was utilized to construct a genome library using the Pacific Biosciences SMRTbell Express Template Prep Kit.A 20-kb insert library was processed using a BluePippin system.The sequencing was carried out using the Pacific Bioscience Sequel II platform (Pacific Biosciences, Menlo Park, CA, USA).We obtained ~152.18Gb of PacBio HiFi raw data (~84 × ) with an average length of 16.05 kb (Table 1).For Illumina sequencing, the sequencing was performed on the HiSeq X Ten platform (Illumina) with model of 150 PE.Finally, we obtained approximately 54.03 Gb of Illumina raw data (~30 × ).The Hi-C libraries were constructed, enriched, and sheared according to methods described previously 6,7 .The Hi-C sequencing was conducted using the Illumina HiSeq X Ten platform.A total of approximately 208.36 Gb (~114 × ) of raw sequences that were overly short were removed using the fastp v0.19.3 8 software.Jellyfish v2.3.0 9 was employed for determining the frequency distribution of the depth of clean data with 17 K-mers, and GenomeScope v2.0 10 was utilized to estimate the genome size.The estimated haplotype genome size for E. macrophylla is approximately 1.84 Gb (Fig. 2).A combination of HiFi reads and Hi-C short reads was employed as input for the genome assembler Hifiasm v0.16.1 11 .The assembly process, conducted in Hi-C mode with default settings, resulted in the generation of two contigs representing haplotype a and haplotype b, respectively.For chromosome assembly, we first aligned the Hi-C reads to the assembly using Juicer v1.6 software 12 .Next, the draft genome assembly was scaffolded using 3D-DNA 13 with Hi-C reads.Then, we manually adjusted the chromosome construction using the Juicebox tool 14 , which involved removing incorrect insertions and adjusting the orientation to correct visible errors to the best extent possible.For further optimization of the genome assembly, three rounds of corrections were performed on the assembly using Illumina reads with NextPolish v1.4.0 15 , and the redundant sequences were removed using Redundans v0.14a27 16 .In total, approximately 99.41% the assembled data was anchored onto 40 pseudochromosomes in the two haplotypes (Supplementary Table 1).Finally, we obtained a high-quality haplotype-resolved chromosomal-level genome of E. macrophylla (Fig. 1b, Fig. 3).The assembly (~3.40 Gb) comprised two haplotypes, namely haplotype a and haplotype b, with respective genome sizes of 1.82 Gb and 1.58 Gb (Table 2).Since the genome assembly was haplotype-resolved and lacked parental information for subgenome phasing, we designated the long one chromosome from each homologous pair as haplotype a and the other as haplotype b.The contig N50 and the scaffold N50 lengths for haplotype a were 28.11 Mb and 92.55 Mb, respectively, whereas for haplotype b, they were 21.57Mb and 83.31 Mb, respectively.A total of 307 gaps were identified in the current genome assembly (Table 2).Utilizing PacBio HiFi reads, the LR_Gapcloser 17 software was employed for gap filling, with two iterations executed.Furthermore, we assembled a chloroplast genome with a length of 156,639 bp and a mitochondrial genome with a length of 702,890 bp using GetOrganelle v1.7.5.0 18 .

Genomic repeat annotation.
To annotate the repeat sequences in the E. macrophylla genome, a transposable element (TE) library was first constructed by running the extensive de novo TE Annotator (EDTA) pipeline to identify TEs from scratch.The parameters used were-Sensitive 1-ANNO 1 19 .Then, we used RepeatMasker  3).In term of E. macrophylla haplotype b, a total of 2,258,809 repetitive sequences (76.29% of the genome) were identified with a length of 1.21 Gb.Of these, the primary repetitive elements were also LTRs, which amounted to 788,470 and occupied a total size of 713.16 Mb, representing 45.13% of the genome that was assembled.This was followed by TIRs, accounting for 24.32%.The copia-and gypsy-like LTRs had sizes of 109.44 Mb and 314.71Mb, respectively, making up 6.93% and 19.92% of haplotype b (Table 3).
Fig. 3 The Hi-C interaction heatmap of chromosome interactions in E. macrophylla chromosomes.
In addition, the putative protein-coding gene structure was predicted utilizing MAKER2 36 .The ab initio predictions of gene structure were conducted using AUGSTUS v3.3.We aligned the transcript evidence with the genome using BLAST+ 37 and finally optimized it with Exonerate v2.4.0 32 .In order to increase the accuracy of the annotation, we integrated and updated the gene prediction results using EVidenceModeler51 (EVM) 38 and PASA.In total, we annotated 29,805 protein-coding genes in E. macrophylla haplotype a with an average length of 4,956.40bp.Among them, there are a total of 36,131 coding DNA sequence (CDS), 200,786 exons, and 164,655 introns.The average lengths were 1,243.30bp for CDS, 281 bp for exons and 803 bp for introns (Table 4).Additionally, we identified 29,081 protein-coding genes in haplotype b a with an average length of 5,199.10bp.A total of 34,686 CDS, 191,925 exons, and 157,239 introns were detected, with the average lengths of 1,248.6 bp, 279.1 bp and 854.6 bp respectively (Table 4).

Data Records
The sequencing data for this study have been uploaded to the NCBI database with the BioProject number PRJNA945189.The genomic PacBio sequencing data can be found in the NCBI Sequence Read Archive (SRA) database with accession numbers SRR23907027 50 , SRR23907028 51 , SRR23907029 52 , and SRR23907030 53 .For Hi-C sequencing data, specifically referring to accession numbers SRR23907031 54 and SRR23907036 55 in the SRA database.The genomic Illumina sequencing data are available under accession numbers SRR23907047 56 and SRR23907058 57 .The final genome assembly was deposited in the GenBank with accession number: GCA_037974685.1 58 and GCA_037974665.1 59 .In addition, the final chromosome assembly and annotation data were deposited in the Genome Warehouse (GWH) of the National Genomics Data Center (NGDC) with the accession number GWHEQHN00000000 60 and under the BioProject number PRJCA021125.

Technical Validation
To evaluate the completeness and accuracy of the genome, we employed BWA 61 , minimap2 47 , and HISAT2 19 to align Illumina reads, HiFi reads, and RNA-Seq reads to our reference genome respectively.In addition, BUSCO    www.nature.com/scientificdatawww.nature.com/scientificdata/v5.2.2 62 was used to evaluate the genome completeness using the embryophyta_odb10 and eukaryota_odb10 databases.The genomic completeness of these two haplotypes was found to be satisfactory, with proportions of complete BUSCOs (including both single-copy and multi-copy) at 98.1% and 97.1% for the expected genes from embryophyta, respectively (Table 7).The E. macrophylla genome size was evaluated using k-mer analysis (Fig. 2).After filtering out non-primary alignments, we proceed to calculate the mapping ratio and coverage percentage.We found that the genome coverage from sequencing data is relatively high (Table 8).We conducted additional quality control analysis on the genome assembly using Merqury 63 (at K = 16) based on PacBio HiFi reads (Fig. 6, Table 9).The consensus quality values (QVs) of the separate haplotypes a and b, as well as their shared genome, are recorded as 34.98, 34.74, and 34.87 correspondingly.The k-mer completeness scores of the distinct haplotypes a and b, along with their shared genome, amount to approximately 82.08%, 81.07%, and 94.46% accordingly.The further BUSCO analysis showed that the single-copy and multi-copy genes have approximately the same depth, indicating that the assembly had no redundancy (Fig. 7).
To evaluate the single-base error rate and heterozygosity, next-generation reads were mapped to the genome using BWA, and the variant loci were detected using bcftool v 1.11 64 .Heterozygous sites were utilized for the computation of heterozygosity rates, whereas homozygous sites were employed for the determination of error rates.We found that the heterozygosity rate was approximately 0.19%, and the error rate was approximately 0.012%.By evaluating the coverage depth and GC content distribution analysis of the second and third generation data, we found that the second-generation data had a significant guanine-cytosine (GC) bias (Fig. 8).Juicer 12 was used to map the Hi-C data to the final genome assembly.It was found that the chromosome clustering was normal, with no obvious chromosome assembly errors, but there were abnormal signals in some regions (Fig. 3).The chromatin interaction data from the Hi-C map revealed low-level interactions occurred between pseudochromosomes, confirming the high quality and reliability of our chromosome-level anchoring (Supplementary Table 1).
The chromosomal locations of specific characteristic sequences, such as telomeres, rDNA, and tandem repeats, were determined through the mapping of repetitive sequences onto the genome.The majority of chromosome telomere sequences were completely assembled; however, a few exhibited partial or missing regions.We detected a high tandem repeat on chromosomes (Supplementary txt 1).This sequence contains 5 S rDNA, and its distribution is essentially consistent, suggesting that this sequence represents 5 S rDNA and its adjacent regions.In addition, the 18-5.8-28S rDNA and 5 S rDNA arrays are very abundant and widely distributed (Supplementary Fig. 1).BUSCO v5.2.2 62 was employed to assess the annotated and integrated proteins utilizing the embryophyta_ odb10 and eukaryota_odb10 databases.The proportion of complete core gene coverage was 96.4% (Table 7), which included 7.1% single-copy genes and 89.3% duplicated genes.Only 0.9% fragmented and 2.7% missing genes were detected, indicating that the genome annotation is of superior quality.

Fig. 1
Fig. 1 The morphological characteristics and features of the haplotype-resolved genome assembly of E. macrophylla.(a) The overall morphological characteristics of E. macrophylla; (b) An overview of haplotyperesolved genome assembly of E. macrophylla.(i) The distribution of pseudochromosomes, (ii) GC content, (iii) gene density, (iv) repeat density, (v) ncRNA, and (vi) analysis of collinearity.

Fig. 4
Fig. 4 Structural variation and statistics between two haplotype genome assemblies of E. macrophylla.(a) Structural variation between haplotype genomes (haplotype a and haplotype b).(b) Size distributions of different types of structural variation between haplotype a and haplotype b.(c) Number of sequence differences in the syntenic region for each pair of chromosomes.(d) Size of sequence differences in the syntenic region for each pair of chromosomes.

Fig. 5
Fig. 5 Dot-plot of synteny blocks between the two haplotypes within E. macrophylla.

Fig. 6
Fig. 6 Evaluation of genome quality utilizing the Merqury spectrum plot.(a) Spectrum plot demonstrating copy number variations in haplotype assemblies of E. macrophylla.(b) Spectrum plot for assessing the completeness of K-mer assembly.

Fig. 7
Fig. 7 The coverage depth of the genome (left) and the distribution of BUSCO core regions (right) assessed using the next-generation data (upper) and HiFi data (lower).

Fig. 8
Fig.8The coverage depth for HiFi data (right) and next-generation data (left).

Table 1 .
The statistics of the genome sequencing data of E. macrophylla.
Fig.2Overview of the 17-mer frequency distribution in the genome of E. macrophylla.ing for 43.48% of the assembled genome.This was followed by DNA transposable elements (TIRs) at 29.36%.The sizes of the copia-and gypsy-like LTRs were 109.80Mb and 351.66 Mb, respectively, which accounted for 6.04% and 19.33% of haplotype a (Table

Table 2 .
Summary of the E. macrophylla genome assembly data.

Table 3 .
The repetitive sequences identified in the genome of E. macrophylla.

Table 4 .
Statistical analysis of gene annotations.

Table 5 .
Statistics of protein-coding gene functional annotation for E.macrophylla.

Table 6 .
Statistics for non-coding RNA genes in the genome of E. macrophylla.

Table 7 .
Statistical analysis of BUSCO for both haplotypes and proteins.

Table 8 .
Statistics of map rate and coverage of three types of sequencing reads.

Table 9 .
Statistical analysis of Merqury for evaluating the quality of haplotypes.