Chromosome-scale genome assembly of sweet tea (Lithocarpus polystachyus Rehder)

Lithocarpus, with >320 species, is the second largest genus of Fagaceae. However, the lack of a reference genome limits the molecular biology and functional study of Lithocarpus species. Here, we report the chromosome-scale genome assembly of sweet tea (Lithocarpus polystachyus Rehder), the first Lithocarpus species to be sequenced to date. Sweet tea has a 952-Mb genome, with a 21.4-Mb contig N50 value and 98.6% complete BUSCO score. In addition, the per-base consensus accuracy and completeness of the genome were estimated at 60.6 and 81.4, respectively. Genome annotation predicted 37,396 protein-coding genes, with repetitive sequences accounting for 64.2% of the genome. The genome did not undergo whole-genome duplication after the gamma (γ) hexaploidy event. Phylogenetic analysis showed that sweet tea diverged from the genus Quercus approximately at 59 million years ago. The high-quality genome assembly and gene annotation resources enrich the genomics of sweet tea, and will facilitate functional genomic studies in sweet tea and other Fagaceae species.


Background & Summary
Lithocarpus is the second largest genus of the Fagaceae family, comprising more than 320 species 1 .Species belonging to the genus Lithocarpus, commonly known as stone oaks, are dominant canopy trees in the tropical and subtropical forests of East Asia 2 .Lithocarpus polystachyus Rehder (2n = 2x = 24; syn.Lithocarpus litseifolius [Hance] Chun) is an evergreen plant widely distributed in south China and the adjacent southeast Asian countries 3 .The L. polystachyus is commonly known as "sweet tea", because its leaves have a sweet taste when brewed.The leaves of L. polystachyus also have medicinal properties and have long been used as herbal tea to prevent and manage diabetes.
The leaves of L. polystachyus contain high concentration of dihydrochalcones (DHCs), which is the main source of its sweet taste.DHCs is a class of minor flavonoids (e.g.trilobatin and neohesperidin), which have been reported to act as flavor sweeteners and bitterness blockers [4][5][6] .DHCs such as phloretin, phlorizin, and sieboldin have also been demonstrated to play important roles in human health by providing a wide range of beneficial effects against diabetes, cardiovascular, cancer, and free radical-involving diseases [7][8][9][10] .In addition, phloretin exhibits strong broad-range bactericidal and fungicidal activities, and sieboldin is a powerful multipotent antioxidant 9,11 .DHCs have been isolated from many medicinal plants belonging to different families, but their contents vary significantly both among and within species [12][13][14][15] .Four DHCs (phloretin, phlorizin, trilobatin, and sieboldin) have been reported in sweet tea, and the biosynthesis pathway of the first three DHCs have been proposed 14,16 .However, our knowledge of DHC biosynthesis and regulatory mechanisms is limited in sweet tea.Specifically, candidate genes and transcription factors involved in the DHCs biosynthesis pathway remain to be investigated.
Here, we assembled a high-quality chromosome-scale genome of sweet tea, the first ever in the genus Lithocarpus, using PacBio HiFi and Hi-C data.The assembled sweet tea genome had a total length of 952 Mb, with a contig N50 of 21.4 Mb and a complete BUSCO score of 98.6%.A total of 922.6 Mb (96.9%) of the sequences were anchored to the 12 chromosomes.Genome annotation predicted 37,396 protein-coding genes and 597.5 Mb repetitive sequences.The high-quality sweet tea genome provides a valuable resource for exploring key genes and molecular regulatory mechanisms involved in the biosynthesis of important compounds, including DHCs, and will further serve as a strong foundation for trait improvement in this species.

Methods
Plant material and sequencing.A healthy sweet tea tree growing in the South China National Botanical Garden (accession number: 19940074) was selected for de novo genome assembly (Fig. 1).Young leaves were collected from the selected individual for whole-genome sequencing.The leave, stem, and root were collected for RNA-sequencing (RNA-seq) for the transcriptome assembly.All samples were immediately flash-frozen in liquid nitrogen after harvest, and stored at −80 °C for subsequent nucleic acid extraction.
Genomic DNA and total RNA were isolated from the leaves using DNeasy Plant MiniKit (Qiagen, Germany) and RNAprep Pure Plus Kit (Tiangen, China), respectively.The mRNA was purified from total RNA using poly-T oligo-attached magnetic beads for subsequent sequencing.To perform short-read sequencing for genome survey, transcriptome assembly, and transcriptomic profiling, libraries with an insert size of ~350 bp was constructed and sequenced on the Illumina NovaSeq 6000 platform to generate 150-bp paired-end reads.To perform de novo genome assembly, a 15-20-kb PacBio HiFi library was prepared using the SMRTbell Express Template Preparation Kit 2.0 (Pacific Biosciences, USA) and sequenced on the PacBio Sequel IIe platform to produce PacBio HiFi long reads.To generate the high-throughput chromatin conformation capture (Hi-C) data, DNA was isolated from the leaves and fixed with paraformaldehyde.The genomic DNA was then enzymatically digested with DnpII, generating fragments with sticky ends.These sticky ends were repaired by ' A' or 'C' deoxynucleotides with biotin by using DNA polymerase.Subsequently, the DNA fragments were ligated together to form chimeric circles using DNA ligase.The ligated DNAs were then uncrosslinked, purified, and sheared into 300-500 bp in size.Finally, the Hi-C sequencing library was sequenced on the Illumina NovaSeq 6000 platform, generating 150-bp paired-end reads.Genome survey.The 21-bp K-mers with Illumina reads were counted using Jellyfish v1.1.11 17, with default parameters.The genome size, the level of heterozygosity, and repeat content were estimated using GenomeScope v1.0 18 .The estimated genome was 874 Mb in length and a heterozygosity of 1.5% (Fig. 2).
De novo genome assembly.The HiFi long reads were initially assembled into contigs using hifiasm v0.18.6-r513 19 with default parameter to generate one primary haplotype-collapsed assembly and a pair of partially phased assemblies.The primary assembly was selected for further scaffolding because it was much more contiguous than the partially phased assemblies.To anchor the contigs onto scaffolds, the Hi-C reads were mapped on to the contigs using Juicer v2.0 20 with DpnII (GATC) as the restriction enzyme site, and scaffolds were generated using the 3D-DNA (v201008) pipeline 21 with the following parameters: "-m haploid -i 150000 -r 0--editor-repeat-coverage 5".Based on the chromosome number of sweet tea (n = 12) determined previously 22,23 and the interaction information of Hi-C reads generated with the 3D-DNA pipeline, the chromosome segmentation boundaries and assembly errors were manually checked and adjusted using Juicebox v1.11.08 24 .
The total length of the final sweet tea genome assembly was 952.3 Mb, which is slightly larger than the genome size estimated by K-mer analysis (Fig. 2 and Table 1), and smaller than the size measured by flow cytometry (ca.1,149 Mb) 22 .The contig and scaffold N50 values of the sweet tea genome were 21.4 and 78.6 Mb, respectively, which are comparable with those of recently published Fagaceae genome assemblies (Table 2).A total of 922.6 Mb (96.9%) of the sequences were anchored to the 12 chromosomes (Table 1).The Hi-C interaction map showed a strong intrachromosomal interactive signal along the diagonal (Fig. 3).

Identification and characterization of repetitive elements.
Tandem duplications were identified using TRF 4.09 25 , and both structurally intact and fragmented transposable elements (TEs) were annotated using EDTA (Extensive de-novo TE Annotator) v1.9.6 26 .Overlapping regions of each class of repetitive elements were counted only once when calculating their total size.The divergence (K) of intact LTRs identified was estimated by Kimura two-parameter distance (K2P) 27 .The insertion time was calculated by the formula: T = K/(2 × r), where r refers to a substitution rate of 1.3 × 10 -8 per site per year 28 .A total of 597.5 Mb (62.74%) of the assembled sequences were annotated as TEs, with LTR (33.49%),TIR (14.27%), and Helitron (10.82%) being the three most abundant TE superfamilies (Fig. 4 and Table 3).We found most of the LTRs have been accumulated recently over a short time span with the peak of 0.3 million years ago (Ma), suggesting an expansion event (Fig. 5).In addition, TEs were unevenly distributed along the genome, with greater accumulation in regions with low gene density (Fig. 4).

Gene prediction and functional annotation.
Protein-coding genes in the sweet tea repeat-masked genome were predicted by applying MAKER v3.01.04 29 to the combined results of RNA-seq-based prediction, protein-homology-based prediction, and ab initio prediction.Trinity v2.14.0 30 was used to de novo assemble the transcriptome for RNA-seq-based gene prediction.HiSat2 v2.2.1 31 was used to align the RNA-seq reads against the sweet tea genome, and then Trinity and StringTie v2.2.1 32 were used to assemble the genome-guided transcriptomes.Then, the de novo and genome-guided transcriptome assemblies were merged and refined using CD-HIT-EST v4.8.1 33 , generating the transcript-based gene prediction.Plant protein sequences from the Swiss-Prot database (https://www.uniprot.org/downloads)and annotated proteins from Arabidopsis thaliana Araport11 34 and Populus trichocarpa v4.1 35 were used for protein-homology-based prediction.AUGUSTUS v3.4.0 36,37 with BUSCO single-copy genes as training data, GeneMark-ES v4.69_lic 38 with a self-training algorithm, and SNAP v2006-07-28 39 with the output of MAKER as training data were used for ab initio prediction.MAKER was run for a total of three rounds to obtain high-quality gene models.
To further improve the gene prediction, genes predicted by MAKER were integrated into consensus gene models using EVidenceModeler v1.1.1 40 and then further polished using PASA v2.5.2 41 .The longest transcript of each predicated gene model was considered as the representative gene models.The completeness of gene models was assessed by searching the gene content of the embryophyta_odb10 database using BUSCO v5.4.3 42 .
A total of 37,396 protein-coding genes were predicted, with 97.1% of the complete BUSCO genes covered (Table 1).The average length of genic regions, coding sequences (CDSs), and intron sequences was 4,903 bp, 1,163 bp, and 948 bp, respectively.The average length of genes and introns in sweat tea was comparable to that of Fagaceae species (Table 2).Among the predicted protein-coding genes, 36,096 (96.5%) were annotated in functional databases (Tables 1), and 3,200 were identified as transcription factor (TF) genes belonging to 99 different families.

Comparative genomic analysis.
To investigate the syntenic relationships between the protein-coding genes of sweet tea and those of the other Fagaceae species, collinear blocks between sweet tea and the representative species of each genus or section were identified based on protein sequences using MCScan implemented in jcvi v1.2.7 48 .The synonymous substitution rate (Ks) was estimated using KaKs_Calculator v3.0 49 based on paralogous gene pairs extracted from the collinear blocks.
The syntenic gene blocks showed 1:1 syntenic patterns between sweet tea and other Fagaceae species (Fig. 6a), suggesting a conserved genome structure in Fagaceae.The Ks distribution pattern of the sweet tea syntenic genes presented the same signature of Ks peaks only at ~1.2 as that observed in the other Fagaceae and Vitis vinifera genomes, indicating that the genome of sweet tea did not undergo WGD after the gamma (γ) hexaploidy event (Fig. 6b).

Phylogenetic analyses.
We conducted phylogenetic analyses to infer the relationship between sweet tea and other Fagaceae species.To do that, we utilized OrthoFinder v2.5.4 50 to identify orthologous genes between Fig. 3 Genome-wide chromatin interaction heatmap of sweet tea based on Hi-C data.
The divergence times among Fagaceae species were estimated using MCMCTree in the PAML v4.9j package 55 , based on four-fold degenerate sites.Two fossil calibrations were used to constrain the age of nodes: (1) the split between the Fagus genus and the rest of the Fagaceae species at 82-81 Ma 56 , and (2) the divergence between the Castanopsis and Castanea genera at 53-52 Ma 57 .The phylogenetic analyses showed that sweet tea was sister to the genus Quercus, with strong bootstrap support (100%).Calibration of the phylogenetic tree showed that sweet tea diverged from the genus Quercus ~59 Ma (95% HPD: 41.05-77.38Ma) (Fig. 6c).

Data Records
The genome assembly have been deposited in the GenBank database of NCBI with accession number JAWTZU000000000 58 .The raw sequence data have been deposited in the Genome Sequence Archive (GSA) in National Genomics Data Center (NGDC) database (https://ngdc.cncb.ac.cn/) under the accession number CRA012397 59 .The genome annotation files and synteny data were deposited in the Figshare database 60 .

technical Validation
Genome completeness was assessed by searching the gene content of the embryophyta_odb10 database with Benchmarking Universal Single-Copy Orthologs (BUSCO) v5.4.3 42 ; the quality of repetitive genomic regions was assessed using the LAI vbeta3.2program (Ou et al., 2018); per-base consensus accuracy (QV) and k-mer completeness was estimated with Merqury v1.3 61 using PacBio HiFi long reads with a K-mer value of 20-bp; and PacBio HiFi long reads were mapped on to the genome using minimap2 v2.24-r1122 62 to calculate the mapping rate.The telomeric sequences in the sweet tea genome assembly were identified using quartet v1.1.3 63with "-c plant".
The sweet tea genome assembly showed a high degree of completeness and accuracy at the chromosome scale as indicated by the following statistics (Table 1): (1) the complete BUSCO score was 98.6%, which suggests

Fig. 1
Fig. 1 The sweet tea tree sequenced in this study.(a) The sequenced tree planted in South China National Botanical Garden (accession number: 19940074) (b) The fruits of sequenced tree.

Fig. 4
Fig. 4 Genomic features of sweet tea.The tracks from the outer to the inner circle represent the 12 chromosomes (Chr1-Chr12), gene density, trnasposable element (TE; Copia, Gypsy, TIR, and Helitron) density, tandem repeat density, and GC content.The connecting lines in the center of the Circos plot indicate the syntenic gene blocks.

Fig. 5
Fig.5 The distribution of insertion time of intact LTRs in sweet tea.Ma, million years ago.

Fig. 6
Fig. 6 Syntenic anasis between sweet tea and other Fagaceae species.(a) Syntenic blocks between sweet tea and other Fagaceae species.(b) Distribution of the synonymous substitution (Ks) rates of paralogous gene pairs within syntenic blocks.The blue dashed line represents the 1.2 of Ks.(c) Phylogenetic tree of sweet tea and 11 other Fagaceae species.The numbers in square brackets indicate the 95% confidence intervals of the divergence time, and two fossil calibrations are indicated in blue.The red solid circle represents support lower than 100%, based on 1,000 bootstrap replicates.

Fig. 7
Fig. 7 Coverage of HiFi long reads mapped across the 12 chromosomes of sweet tea.

Table 1 .
Statistics of the sweet tea genome assembly and annotation.

Table 2 .
Summary of the genomic features of 12 Fagaceae species.

Table 3 .
Summary of the repetitive sequences in the sweet tea genome assembly.