Background & Summary

Lonicera japonica (Caprifoliaceae), is perennial and evergreen twining vine, commonly known as Jinyinhua or Rendong, plays a very important role in traditional Chinese medicine1. Meanwhile, L. japonica has been cultivated as an ornamental plant in many areas because of its varying colors and attractive smell2. L. japonica firstly recorded as medicine can be traced back to the Jin dynasty in China, then it also recorded in ‘Ben Cao Gang Mu’ which is famous classical book of Chinese material medica1. The first place in China where L. japonica was planted massively is Fengqiu county of Henan province, and it has been used to treat exogenous wind-heat, febrile disease, sore, carbuncle, furuncle and some infectious diseases3. It was found that the extracts of L. japonica and its chemical components have a variety of pharmacological effects, including anti-inflammatory, antibacterial, antiviral, antioxidant, liver protection and anti-tumor1. As L. japonica being used and cultivated in more and more areas, its chemical constituents have been studied widely. L. japonica contains a variety of medicinal ingredients, such as flavonoids, essential oil, triterpenoid soap and organic acids4,5, which allows honeysuckle to perform many pharmacological functions.

L. japonica has always been the object of researches and many scientists mainly focused on the biosynthesis of active medicinal ingredients and differential gene expression patterns between different tissues with the help of transcriptomics6,7,8. However, with the development of sequencing technology and the reduction of sequencing cost, progressively plant genomes are available, including L. japonica, whose Institute of Medicinal Plant Development (IMPLAD) germplasm registration number is 101074289. The study about Lj10107428 provided proof for a whole genome duplication (WGD) event and showed the expression of related biosynthetic genes was correlated with the accumulation of carotenoids and suggested the role of carotenoid degradation in the dynamic coloring of L. japonica by assembling the whole genome and transcriptomic analysis9. Nevertheless, a genome of one variety is not enough to represent the genetic resources of the species. The availability of the genomes of different varieties can help to improve the genetic resources of the species and understand the reasons for the trait alternations between different varieties. Therefore, we also sequenced and de novo assembled genome of a L. japonica variety that is called ‘Sijihua’, which is largely planted in Pingyi county, Shandong province. It’s reported that Sijihua are the most cold-resistant variety by studying different varieties of L. japonica in different regions10.

Here, we generated a chromosome-scale of the genome assembly of the variety of Sijihua using the combination of PacBio long reads, Illumina reads and the Hi-C sequencing data. Approximately 886.04 Mb genome was assembled with the contig N50 length of 1.58 Mb. A total of 826.50 Mb (93.28%) of the assembled sequences were anchored to 9 pseudo-chromosomes (Table 1). We predicted 39,320 protein-coding genes, and 92.87% of each gene were assigned by BLASTP against NR, GO, KOG, KEGG and other databases. We identified 644 tRNAs, 2,156 rRNAs, 109 miRNAs and 5,502 pseudogenes (Table 2). We also identified 255,264 simple sequence repeats (SSRs), and 40,252 are polymorphic SSRs (Table 3). The genome assembly of variety Sijihua is a valuable material to the germplasm resources of L. japonica, and helps researchers to explore the specificity of different varieties. The results also provide valuable clues to the molecular basis of cold-resistance traits of Sijihua and will facilitate further genetic improvements.

Table 1 Genome assembly and assessment of Sijihua and Lj10107428 genomes.
Table 2 Genome annotation of Sijihua and Lj10107428 genomes.
Table 3 SSRs annotation of Sijihua and Lj10107428 genomes.

Methods

Sample collection, library construction and genome size estimation

High-quality genomic DNA was extracted from young-fresh leaf tissue of Sijihua using CTAB (cetyl trimethylammonium bromide) method, and the samples were collected from the Zhongke Honeysuckle Planting Cooperative in Pingyi County, Shandong, China (Fig. 1). The qualified genomic DNA was broken to the target fragment (350 bp) by ultrasonic shock and the Illumina library was constructed through end repairing, adding 3’ A tail, ligating adapters and enriching with PCR. The fragment size and quality of the library were detected by 2100 and Q-PCR. Next, the sequencing of the library were performed by using Illumina Novaseq 6000, which finally generated 31.48 million reads, 50.26 Gb of raw data, which covered 61.48 × of the genome. PacBio library were constructed by using BulePippin to screen the target fragment that was interrupted by g-tube, subsequently was sequenced with PacBio Sequel II system. Consequently, the two SMRT-cells generated a total of 4,461,375 reads with N50 size of 29,096 bp. Totally we got 87.61 Gb sequencing data, accounting for 98.88 × of the entire genome. Fresh leaf tissue of honeysuckle was used to construct a library for Hi-C analysis. The fresh tissue was fixed with formaldehyde, attaining interacting loci to be bound to one another, and then cross-linked DNA was digested by restriction enzyme Hind III. Sticky ends were labeled with biotin during its repairing. Next, the interacting DNA fragments were ligated, purified and finally broke them into 300 bp ~ 700 bp fragments. Each ligated DNA fragment was marked with biotin and streptavidin beads were used to pull-down the interacting DNA fragments to complete the Hi-C library construction. The libraries were then sequenced on Illumina Novaseq 6000 platform, which generated 307,357,239 pairs of reads and 91.84 Gb clean data, which cover 103.65x of the genome.

Fig. 1
figure 1

The five growth stages of honeysuckle. (a) The juvenile bud stage. (b) The third green stage. (c) The complete white stage. (d) The silver flowering stage. (e) The gold flowering stage.

A k-mer (k = 19) analysis was constructed using 61.48 × Illumina data to estimate the genome size, proportion of repeat sequence and heterozygosity11. From the 19-kmers distribution, we could estimate the heterozygosity and repeat ratio of the Sijihua genome to be 0.74% and 51.8%, respectively and the estimated genome size was 817.45 Mb (Fig. 2 and Table 1).

Fig. 2
figure 2

19-kmer distribution in the honeysuckle genome.

RNA sequencing and analysis

Total RNA was extracted by using RNeasy Plus Mini Kit (Qiagen) from different development stages, including juvenile bud (JB), green bud (GB), white bud (WB), silver flower (SF), and golden flower (GF). Libraries were constructed by TruSeq RNA Library Prep Kit v.2 (Illumina, San Diego, CA, USA). Each library was constructed with three biological replicates, and 150 bp paired-end reads were sequenced by using the Illumina NovaSeq 6000 platform. Raw reads were trimmed by using TRIMMOMATIC (v.0.39)12, and the clean reads were aligned to the reference genome by HISAT2 (v. 2.2.1)13 with default parameters, and only uniquely mapped reads were kept. Expression value was estimated by StringTie (v. 2.1.5)14 as FPKM (Fragments per kilobase of exon model per million reads mapped). Genes with FPKM > 0.5 were considered as expressed, and used for further analysis. Differentially expressed genes (DEGs) were identified by DESeq2 (v 1.28.1)15 with default parameters. RNA-seq data of LJ10107428 were downloaded from National Genomics Data Center (https://ngdc.cncb.ac.cn) with BioProject ID (PRJCA001719).

De novo genome assembly

PacBio long reads were error corrected using Canu v1.516 and the top 40 x coverage of the longest corrected reads were subsequently assembled by SMARTdenovo17. To improve the accuracy of the assembly, the 61.48x Illumina short reads used in the genome survey were applied to three rounds of correction by Pilon v1.2218. The high-quality Hi-C reads were used to cluster, order and orient the contigs onto pseudo-chromosomes by using the LACHESIS19. We preliminary assembled the PacBio long reads into contig sequences of 886.04 Mb, including 1,159 contigs with N50 of 1.58 Mb, and the longest contig is 12.45 Mb. These contigs were further anchored onto 9 pseudo-chromosomes, accounting for 93.28% of the assembled genome. The final chromosome-scale genome assembly of Sijihua was 886.13 Mb with a scaffold N50 of 79.57 Mb (Table 1).

Repeat annotation

De novo and structure-based predictions were integrated to annotate repetitive sequences. LTR_FINDER v1.0520 and RepeatScout v1.0521 were primarily used to build a de novo repeat sequences library of Sijihua genome, which was classified by using PASTEClassifier v1.022 and merged with Repbase v19.0623 database as the final repeat sequences database. Structure-based predictions were performed by using RepeatMasker v4.0524 based on the constructed repeat sequences database. We identified 573.89 Mb (64.76%) of repetitive sequences in Sijihua genome. Most of these repeat sequences are Class I (53.57%) retrotransposons, including Copia, Gypsy, LINE and SINE, accounted for 15.93%, 19.14%, 2.61% and 0.34% of the entire genome, respectively. In addition, Class II DNA transposons make up 5.82% of the genome (Table 2).

Protein-coding genes prediction and other annotations of the genome

Prediction of protein-coding genes was based on ab initio gene predictions, homology-based predictions and transcriptome-based predictions. Ab initio predictions were performed by Genscan v3.125, Augustus v3.126, GlimmerHMM v1.227 v3.0.4, GeneID28 v1.4, and SNAP (version 2006–07–28)29. For homology-based prediction, GeMoMa v1.3.130 was used to annotate the gene models in Sijihua using amino acid sequences from Daucus carota, Helianthus annuus, Lactuca sativa, Lonicera japonica and Arabidopsis thaliana genome. For RNA-seq-based prediction, the clean RNA-seq reads were aligned to the assembled genome using Hisat31 v2.0.4 and Stringtie32 v1.2.3, and then TransDecoder33 v2.0, GenemarkS-T34 v5.1 and PASA35 v2.0.2 were jointly used for final coding-gene prediction. Finally, 39,320 gene models were predicted after integrating results of the three methods of predictions by EVidenceModeler36 v1.1.1 (Table 2). For non-coding RNAs annotation, microRNA and rRNA were detected by aligning the assembled genome against the to Rfam37 database using BLASTN. tRNA was identified by tRNAscan-SE38. Finally, we totally identified 2,909 non-coding RNAs, including 109 miRNAs, 2,156 rRNAs and 644 tRNAs.

The sequence of pseudogenes is similar to that of functional genes, but whose original function is lost due to insertions, deletions and other variants. The predicted protein sequences were used to search for homologous gene sequences on the genome through BLAT39 alignment, and then GeneWise39 was used to search for immature stop codons and frameshift mutations in gene sequences. In total, 5,502 pseudogenes were predicted (Table 2).

For gene functional annotation, we aligned the predicted protein-coding gene sequences against public functional databases using BLAST40 v2.2.31 (-evalue 1E-5), such as NR41, KOG42, GO43, KEGG44 and TrEMBL45. In addition, the motif and domain information were annotated using InterProScan46 through searching against public databases, including PROSITE47, HAMAP48, Pfam49, PRINTS50, ProDom51, SMART52, TIGRFAMs53, PIRSF54, SUPERFAMILY55, CATH-Gene3D56 and PANTHER57. As a result, more than 92% of protein-coding genes were annotated, and 1,376 conserved motifs and 36,282 domains were identified.

Identification of SSRs and polymorphism

Reference genomes of Sijihua and Lj10107428 were analyzed for the identification of simple sequence repeats (SSRs) and polymorphism by using CandiSSR58 with default parameters. We identified 255,264 and 182,207 SSRs in Sijihua and Lj10107428 genomes, respectively. Moreover, we found 40,252 SSRs are polymorphic between Sijihua and Lj10107428 genomes (Table 3). In addition, among all the SSRs identified in Sijihua genome, the most abundant SSRs motifis are di-nucleotide (192,362), followed by tri-nucleotide (54,009), tetra-nucleotide (6,395), penta-nucleotide (1,526), and hexa-nucleotide (972).

Global genome comparison of the Lj10107428 and Sijihua

Genome comparison between Sijihua and Lj10107428 was performed by using the NUCmer program embedded in MUMmer4 with parameters –mum –l 40 –c 100, then the delta alignment file was filtered by delta-filter with parameters -1, and finally show-snps was used to identify single nucleotide polymorphisms (SNPs) with parameters -ClrT. SyRI v1.559 was used to extract structure variations (SVs) based on the alignment file with tab-delimited text format performed by show-coords program. MCScanX60 was used to identify gene collinearity, including within and between genomes. In total, we identified 2,996,015 SNPs between Sijihua and Lj10107428. Among them, 14.44% of the SNPs were located in the genic region. Comparison between Lj10107428 and Sijihua genomes, we found 5,150 small insertions/deletions (indels, length shorter than 500 bp), and more than 4.5 Mb of presence/absence variation (PAV, length longer than 500 bp). Notably, we identified 895 Sijihua-specific genomic sequences (2.61 Mb in total), and 418 Lj10107428-specific genomic sequence (1.84 Mb in total) longer than 500 bp. These PAV segments were unevenly distributed across the chromosomes, and the longest PAV sequence was a 261 Kb segment on chromosome 2 (Fig. 3a). In addition, we also found 156 inversions ( >1,000 bp) and several translocations ( >10 kb) between these two genomes. The genome sequence comparison between Lj10107428 and Sijihua reveals high collinearity. Although some structure variations were detected between these two genomes, they primarily consist of large syntenic block with high degrees of collinearity (Fig. 3b). In addition, we identified 301 large syntenic blocks between Sijihua and Lj10107428, containing 25,128 syntenic genes. Specificity, there is an inverted region of approximate 19 Mb on chromosome 8 between Sijihua and Lj10107428 genomes (Fig. 3b). We also noticed that 561 and 302 NBS-LRR genes located in to Sijihua and Lj10107428 genomes, respectively (Fig. 3b). By comparing the annotated protein-coding genes, we found that 26,937 genes of Sijihua and Lj10107428 were shared by reciprocal best hit of BLAST algorithm with parameter (E-value < 1e-10). More species-specific genes were found in Sijihua (12,383) than that in Lj10107428 (6,361) (Fig. 3c).

Fig. 3
figure 3

Comparative genomic analysis between Sijihua and Lj10107428 varieties of honeysuckle. (a) Genomic features landscape of the Sijihua genome. Density of genes, TEs, SNPs, indels, PAVs, inversions and translocations were calculated in a 500 Kb sliding window. (b) Gene collinearity between Sijihua and Lj10107428 varieties. NBS-LRR genes were annotated as yellow dot across genome. (c) Venn diagram of the overlapped genes between Sijihua and Lj10107428 genomes. (d) Expression level comparison of shared genes in Sijihua and Lj10107428 varieties.

From the 18,958 overlapped expressed (FPKM > 0.5) genes between Sijihua and Lj10107428, most of them (>93.4%) did not show differential expression variation at juvenile bud stage. However, hundreds of genes were differential expressed, including 16 NBS-LRR genes. Notably, 11 of 16 NBS-LRR genes were highly expressed in Sijihua at juvenile bud stage. This observation is consistent across other flower development stages (Fig. 3d).

Data Records

The raw data of PacBio, Illumina and Hi-C sequencing were submitted to the National Center for Biotechnology Information (NCBI) SRA database with accession number SRP35369861 under BioProject accession number PRJNA794868. RNA-seq data were deposited into the NCBI (accession number PRJNA813701)62. The assembled genome had been deposited at GenBank with accession number SAMN2466218463. In addition, the genome annotation file had been submitted at the Figshare64.

Technical Validation

Evaluation of the genome assembly

To evaluate the quality of genome assembly, bwa (version: 0.7.10-R789; mode: aln) was used to align the Illumine short reads with the reference genome, and 99.75% of the Illumina short reads were mapped to the reference genome CEGMA65 v2.5 was used to assess the integrity of the final genome assembly. The CEGMA database contained 458 conserved core eukaryotic genes, while our assembled genome contained 439 (95.85%), which suggested that our assembled genome contains most of the core eukaryotic genes. BUSCO66 v4.0 was used to assess the integrity of our genome assembly by using the Embryophyta database of OrthoDB v10. Of the 1614 expected embryophyta genes, our genome contains 1,556 (97.03%). Together, these three evaluation systems demonstrate the high integrity of our assembled genome (Table 1).

Furthermore, to assess the result of Hi-C assembly, and the number of Hi-C read pairs coverage between any two bins acts as a strength signal of interaction between the two bins. As chromosomal interaction heatmap shown, within each group, it was found that the intensity of interaction at the diagonal position was higher than that at the non-diagonal position (Fig. 4), which was consistent with the principle of Hi-C-assisted genome assembly and proved that the genome assembly was accurate.

Fig. 4
figure 4

Hi-C contact map of the chromosome-scale assembly of Sijihua. Hi-C interaction matrix shows the pairwise correlations among 9 pseudomolecules. The intensity of the dark color is scaled to the strength of the correlation.