Chromosome-level genome assembly of Platycarya strobilacea

Platycarya strobilacea belongs to the walnut family (Juglandaceae), is commonly known as species endemic to East Asia, and is an ecologically important, wind pollinated, woody deciduous tree. To facilitate this ancient tree for the ecological value and conservation of this ancient tree, we report a new high-quality genome assembly of P. strobilacea. The genome size was 677.30 Mb, with a scaffold N50 size of 45,791,698 bp, and 98.43% of the assembly was anchored to 15 chromosomes. We annotated 32,246 protein-coding genes in the genome, of which 96.30% were functionally annotated in six databases. This new high-quality assembly of P. strobilacea provide valuable resource for the phylogenetic and evolutionary analysis of the walnut family and angiosperm.


Background & Summary
Platycarya strobilacea belongs to the walnut family (Juglandaceae), is commonly known as a species endemic to East Asia, and is an ecologically important, wind pollinated, woody deciduous tree [1][2][3] .It is known as a tertiary relict tree, and is widely native to East Asian (China, Japan, Korea, and Vietnam) in the sunny mountainous regions [1][2][3][4][5] .P. strobilacea is considered to have the widest geographic distribution in the genus Platycarya, mainly occurring in East Asia 3,6,7 .It is also known for its systematic and evolutionary ancient morphology, such as its unique systematic position in Juglandaceae 2,4 wingnuts and its bisexual inflorescence aggregated on the apices of branches [5][6][7][8] .Based on morphological and molecular evidence, P. strobilacea is considered to occupy a unique phylogenetic position in a sister group between Engelhardioideae and Juglandoideae 5,9,10 .Species within the Juglandaceae can be divided into three sub-families, namely Juglandoideae, Engelhardioideae, and Rhoipteleoideae, as supported by previous studies 6,11 .The fossil data, morphology, and molecular data have conflicting results regarding P. strobilacea's phylogeny in Juglandaceae 6,[9][10][11][12] .P. strobilacea is considered a sister group between Carya and Cyclocarya and the most of ancient wingnut groups are closely related to Cyclocarya within the subfamily Juglandoideae 6,[11][12][13] .
P. strobilacea is an ancient tree, and it has the widest distribution in the genus Platycarya in Eastern Asia, especially in subtropical China 14 .It previously occupied large range across the Northern Hemisphere according to the fossil record, but now only survives only in East Asia 7,14,15 .The bark, root bark, leaves, and fruit infructescence of P. strobilacea contain raw materials used for extracting tannin extraction.The bark can also be utilized for its fibers, the leaves can be used as pesticides, the roots and old trees contain aromatic oil, and the seeds contain oil which can be extracted.The morphology, biogeography, and population genetic of P. strobilacea have been described 3,5,12 .Previous studies on Platycarya detected a significant population structure and the multiple glacial refugia across most of the current geographic distribution range in China using chloroplast DNA and nuclear SNPs data 2,14 .The complex evolutionary history of P. strobilacea indicates that its morphology and genome might be influenced by climate change and environmental adaption.To meet demand for improved ecological conservation biology of this important tree, the high-quality whole genome sequence data is an essential genetic resource for this ecologically woody deciduous tree 2,4,9,14,15 .Useful genetic and genomic data of species in the Juglandaceae subgroup were recently published 4,[16][17][18][19][20][21] .
Here, we report a new high-quality chromosome-level genome assembly of P. strobilacea (NWU2021168).The whole genome of P. strobilacea was generated using short and long read sequencing data generated using the Illumina Hiseq, PacBio single-molecule real-time sequencing technology, and Hi-C platforms.We produced transcriptome expression profiles of different tissues related to flowering and stress genes in P. strobilacea.The genome sequence of P. strobilacea reported here is a new genomic resource for the genetic study of P. strobilacea, for genome evolution analysis in the walnut family and Angiosperms, and for exploring its potential ecological values.

Methods
Sample and whole genome sequencing.In 2021, we collected young and heathy leaves from a single individual of P. strobilacea (genotype NWU2021168), growing in Qinling Mountain, Shaanxi, China (altitude: 1268 m, 33°68′N, 107°35′E).Total high-quality genomic DNA of NWU2021168 was prepared from the fresh leave samples using a kit (TIANGEN, Beijing, China).A DNA library (350 bp) was constructed based on shortread data obtained from the Illumina Novaseq 6000 platform (Illumina, San Diego, CA, USA) for the genome survey.PacBio Sequel II HiFi long-read (20 kb) libraries were constructed and then sequenced for long reads (Novogene, Beijing).The Hi-C library was prepared and then sequenced based on the Illumina Novaseq 6000 platform (Illumina, San Diego, CA, USA) for the chromosome-level genome sequencing.The genome sequencing was completed using a combination of Illumina, Pacbio, and Hi-C sequencing technologies (Fig. 1a).After filtering out the low-quality reads, we obtained a total of 155.Genome de novo assembly and assessment.The assembly of the whole genome of P. strobilacea and the subsequent assessment followed the pipeline (Fig. 1a).The raw reads of Illumina were evaluated with SOAPnuke v1.5.6 22 .We generated the 17-K-mer statistics of the sequencing reads from short libraries (350 bp) using k-mer methods.The genome size was estimated using means of 17-K-mer statistics (Fig. 2a) 23 .The estimated genome size of was about 677.30Mb, and the proportion of GC content and the genome heterozygosity rate were determined to be approximately   34.12% and 1.13%, respectively (Table 1).De novo assembly of P. strobilacea was performed using the software Falcon v1.87 24 .Then, the sequencing reads from PacBio and Hi-C were mapped to our genome assembled scaffolds using the program BWA-aln 25 .Based on the Hi-C sequencing reads, the scaffolds were anchored to 15 pseudomolecules using LACHESIS 26 .The interaction heatmap of P. strobilacea chromosome pairs was produced using the software HiC-pro (Fig. 2b) 27 .Using the Hi-C mapping technology, the scaffolds were further anchored onto fifteen chromosomes that covered ~98.43% of the assembled sequences (Fig. 3).The final genome assembly was 677.30 Mb with an N50 of 43.67 Mb (Tables 1 and 2).Self-alignment analysis found that the duplications were present within a chromosome (Fig. 3b).The lengths of the fifteen assembled chromosomes of P. strobilacea ranged from 19,447,442 bp to 61,544,683 bp, with an average length of 42,331,493 bp (Fig. 3b).The final completeness of the P. strobilacea genome assembly was evaluated using BUSCO v3.0.2 software 28 .We identified a total of 1,614 BUSCO groups, 1,598 (99.0%) complete BUSCOs, 8 fragmented BUSCOs, 129 duplicated BUSCOs, and 1,469 single copy BUSCOs in the NWU2021168 P. strobilacea assembly.Based on the CEGMA (Core Eukaryotic Genes Mapping Approach), 248 core eukaryotic genes (93.95%) were verified in the NWU2021168 assembly.We aligned the Illumina short read data (24.0Gb) with our completed genome assembly, and 98.53% of the clean reads were mapped.The LAI (assembly index) of our P. strobilacea genome was 21.97 (Fig. 4a).These assessments validated the quality of the NWU2021168 assembly, showing that the P. strobilacea genome assembly is of good quality in both genic and intergenic regions.
Genome annotation of protein-coding genes and repeats.Genome annotation was predicted using multiple methods, including transcriptomic data, de novo prediction, and homology-based annotation methods 19 .The details of genome annotation follow the pipeline are shown in Fig. 1b.To ensure accurate gene annotation, RNA sequences from eight tissues (female flower, male flower, mix female and flower inflorescence, axillary bud, new branch, stem, stem bark, and leaf) were used to annotate genes using the software AUGUSTUS (Table 3) 29 .These eight tissues were collected from the individual of P. strobilacea (genotype NWU2021168), which was subjected to whole-genome sequencing (some tissues showed in Fig. 3a).For transcriptome sequencing, we extracted RNAs from three biological duplications from each tissue, and then each of the three RNAs were mixed into one for RNA sequencing using Illumina Hiseq 2500 platform (Illumina, San Diego, CA, USA).We obtained a total of 369,124,704 clean data from eight tissues.The average amount of clean sequencing data was 46,140,588 bp with clean data ranging from 44,367,760 bp (stem bark) to 47,703,130 bp (mix female and flower inflorescence).A mean mapped clean read rate was 90.99% with the mapped rate ranging from 69.78% (stem bark) to 95.27% (stem), respectively (Table 3).The gene structure was annotated for protein-coding genes with reference to four species (Juglans regia, Juglans sigillata, Carya illinoinensis, and Castanea mollissima) using  Exonerate v2.2.0 30 for homology-based annotation.The final genome annotation of the protein-coding genes was determined using the software MAKER2 31 .We estimated the final protein-coding genes for functional annotation using six databases, including SwissProt 32 , Nr 33 , KEGG 34 , InterPro 35 , GO 36 , and Pfam 37 databases, respectively (Fig. 4b and Table 4).Combining the multiple methods, we detected a total of 32,246 protein-coding gene models from the P. strobilacea NWU2021168 genome, with a mean coding sequence (CDS) length of 1,175 bp, an average exon length of 235 bp, and a mean of five exons per gene (Table 1).Among the 32,246 predicted genes, there were 30,480 (94.52%) genes annotated in the Nr database, 29,935 (92.83%) genes were annotated in InterPro, 24,250 (75.20%) genes were annotated in KEGG, 23,644 (73.32%) genes were annotated in Pfam, and 18,140 genes were annotated in GO database (Table 4), respectively.To identify transposable elements (TEs) and LTR-RTs (long terminal repeat retrotransposons) the P. strobilacea genome sequence was blasted against databases using Repbase v.20.05 38 , RepeatMasker v.4.0.7 39 , Tandem Repeats Finder (TRF) v4.09 40 , and PILER 41 , and LTRharvest v.1.5.10 42 with the default parameters.The syntenic relationships within the species P. strobilacea were obtained using the MCSCANX software 43 .The final physical    characteristics of the P. strobilacea genome assembly features were visualized using Circos 44 .We identified total of 271,999,812 bp (nearly half of the assembled genome length (41.72%)) of transposable element (TE) repetitive sequences in the genome assembly of P. strobilacea (NWU2021168) (Fig. 5a; Table 5).We detected the 31.24% of the genome length was occupied by e retroelement elements, constituting the predominant repeat type.The long terminal repeat (LTR) superfamily elements Copia, Gypsy, and DNA TEs constituted 223,145,245, 105,125,800, and 439,275,540 bp, corresponding to 32.95%, 15.52%, and 64.86% of the genome length, respectively.The density of Copia elements was twice as high as that of Gypsy elements in the P. strobilacea (NWU2021168) genome (Fig. 3b).
We also annotated the non-coding RNA including transfer RNA (tRNA), ribosomal RNA (rRNA), small nuclear RNA (snRNA), and microRNA (miRNA) (Table 6).A total of 6,766 rRNA, 636 tRNA, 2,042 snRNA and 463 miR-NAs were identified (Table 6).To validate genome annotation, we established the structure and number of genes in the P. strobilacea and four other species (C.illinoinensis, C. mollissima, J. regia, and J. sigillata) based on protein annotations from NCBI (Fig. 5b).A total of 32,246, 36,444, 31,074, 30624, and 30,387 protein-coding genes were identified in P. strobilacea, C. mollissima, C. illinoinensis, J. regia, and J. sigillata, respectively.The average length of the CDS, exon, gene, and intron in P. strobilacea was 1175.97 bp, 235.46 bp, 4,799.56bp, and 902.18 bp, respectively (Fig. 5b).In addition, the average number of exons per gene was found to be equivalent across the five species.

Data records
The raw data (Illumina reads, PacBio HiFi reads, and Hi-C sequencing reads) used for genome assembly were deposited in the SRA at National Center for Biotechnology Information (NCBI) [47][48][49] .The RNA-seq data of eight tissues and organs female flower, male flower, mix female and flower inflorescence, axillary bud, new branch, stem, stem bark, and leaf were deposited in the SRA at NCBI SRR26346274-SRR26346281 [50][51][52][53][54][55][56][57] .The final genome assembly files are deposited in NCBI Genbank 58 , and the final genome assembly and annotation files are available in Figshare 59 .

Fig. 1
Fig. 1 The Platycarya strobilacea genome sequencing assembly and annotation pipeline.(a) Genome assembly with a combination of Illumina, Pacbio, and Hi-C sequencing technologies.(b) The Platycarya strobilacea genome annotation workflow, including repeat annotation, gene annotation, and noncoding RNA (ncRNA) annotation.

Fig. 4
Fig. 4 Assembly Index LAI assessment and gene function annotations of assembled Platycarya strobilacea genome.(a) Assembly Index LAI assessment for each assembled P. strobilacea chromosome.The average LAI is about 21.97, indicating the high quality of our assembly.Dashed line (LAI = 21.97)indicates the gold standard quality level of the assembly.(b) Venn diagram showing the shared and unique genes between the four gene functions annotation databases.Swiss-Prot = Swiss Institute of Bioinformatics and Protein Information Resource, InterPro = Protein sequence analysis and classification, NR = non-redundant, and KEGG = Kyoto Encyclopedia of Genes and Genomes.

Fig. 5
Fig. 5 TE divergence distribution and genetic components of the Platycarya strobilacea genome and other four species.(a) TE sequence divergence distribution diagram.LINE = Long interspersed nuclear elements, LTR = Long terminal repeats, SINE = Short interspersed nuclear elements.(b) Comparison chart of CDS length, exon length, exon number, gene length, and intron length of Platycarya strobilacea, Carya illinoinensis, Castanea mollissima, Juglans regia, and Juglans sigillata genomes, respectively.

Table 1 .
Summary of sequencing data of Platycarya strobilacea.

Table 2 .
Statistical summary of the Platycarya strobilacea genome assembly and annotation.

Table 3 .
Statistical summary of transcriptome sequencing data from eight tissues for the Platycarya strobilacea genome annotation.

Table 4 .
Statistical summary of the annotation of the Platycarya strobilacea genome using six databases (Swissprot, Nr, KEGG, InterPro, GO, and Pfam).

Table 6 .
Abundance and size of noncoding RNA in Platycarya strobilacea.

Table 5 .
The statistical results of repeat sequences in Platycarya strobilacea genome.