Genome assembly of two diploid and one auto-tetraploid Cyclocarya paliurus genomes

Cyclocarya paliurus, an endemic species in the genus Juglandaceae with the character of heterodichogamy, is one of triterpene-rich medicinal plants in China. To uncover the genetic mechanisms behind the special characteristics, we sequenced the genomes of two diploid (protandry, PA-dip and protogyny, PG-dip) and one auto-tetraploid (PA-tetra) C. paliurus genomes. Based on 134.9 (~225x), 75.5 (~125x) and 271.8 Gb (~226x) subreads of PacBio platform sequencing data, we assembled 586.62 Mb (contig N50 = 1.9 Mb), 583.45 Mb (contig N50 = 1.4 Mb), and 2.38 Gb (contig N50 = 430.9 kb) for PA-dip, PG-dip and PA-tetra genome, respectively. Furthermore, 543.53, 553.87, and 2168.65 Mb in PA-dip, PG-dip, and PA-tetra, were respectively anchored to 16, 16, and 64 pseudo-chromosomes using over 65.4 Gb (~109x), 68 Gb (~113x), and 264 (~220x) Hi-C sequencing data. Annotation of PA-dip, PG-dip, and PA-tetra genome assembly identified 34,699, 35,221, and 34,633 protein-coding genes (90,752 gene models) or allele-defined genes, respectively. In addition, 45 accessions from nine locations were re-sequenced, and more than 10 × coverage reads were generated.


Background & Summary
Cyclocarya paliurus (Batal.) Iljinskaja (wheel wingnut) is an important medicinal woody plant, which widely distribute in mountanous areas of south and southeast China 1 . Its leaves are often used in traditional Chinese medicine to treat hypertension and diabetes due to its multiple bioactive compounds triterpenoids, especially species-specific triterpenoids, e.g. cyclocaric acid B (CA-B). However, evolutionary history and functions of triterpenoid-biosynthetic-related genes remain unknown. Moreover, the character of heterodichogamy with two temporally complementary genetic morphs in C. paliurus, significantly affects seed filling index and quality. However, the published C. paliurus genome 2 with lower quality can't support to expound above biological issues, further imped its widely application, such as metabolite yields and medicinal value. Our survey found that most of natural populations of C. paliurus are auto-tetraploid (2n = 4x = 64) with a base chromosome number of 16. However, populations distributed in areas along with 30° N in China possess diploid (2n = 2x = 32) individuals 3 . The C. paliurus genome, especially auto-tetraploid is characterized by great heterozygosity (1.97%), due likely to the fact that C. paliurusis is a outcrossing species.
Hence, we present chromosome-scale genome assemblies for auto-tetraploid (PA-tetra), and two diploid (PA-dip and PG-dip) C. paliurus that represent two flower morphs, protandry (PA) and protogyny (PG). We extracted total DNA from C. paliurus tissues, sequenced in Illumina Novaseq. 6000 with 150-bp paired-end (PE) reads length. After data filtering and preprocessing, 106.7, 86, and 291.7 Gb raw data were generated for PA-dip, PG-dip, and PA-tetra C. paliurus, respectively. We also sequenced constructed libraries using a PacBio Sequel II with P6-C4 chemistry, finally, 134.9, 75.5, and 271.8 Gb raw data were generated for PA-dip, PG-dip, and PA-tetra C. paliurus, respectively. In addition, High-throughput Chromatin Conformation Capture (Hi-C) sequencing was used for the chromosomal level assemblies. A total of 196, 190, and 687 million of 150-bp paired-end reads were produced on Illumina Novaseq. 6000 system. The final genome assembly size in PA-dip, PG-dip, and PA-tetra C. paliurus were 586.62 Mb, 583.45 Mb, and 2.38 Gb, containing 1,101 contigs (N50 = 1.9 Mb), 921 contigs (N50 = 1.4 Mb), and 9,744 contigs (N50 = 430.9 kb), respectively. Structural www.nature.com/scientificdata www.nature.com/scientificdata/ annotation of three genomes yielded 34,699 (PA-dip), 35,221 (PG-dip), and 34,633 (PA-tetra) protein-coding genes, and no less than 94.4% of highly conserved embryophyta genes completely presented in the genome assembly. To explore the population structure and evolutionary history of C. paliurus population, More than 10 × coverage of per sample for 35 tetraploid and 10 diploid C. paliurus growing in nine locations were generated from 150-bp Illumina short reads, respectively. A total of 3.89 and 26.67 million variants were respectively identified from diploid and tetraploid populations.
These three chromosome-level genomes of C. paliurus will greatly facilitate researchers to uncover the genetic mechanisms behind the special characteristics of C. paliurus species, including heterodichogamy, the origination and polyploidization, and the pathway of triterpenoids biosynthesis. In addition, the genome assembled in this study provides a valuable genomics resources for future efforts to protect the vulnerable C. paliurus and identify key functional genes.  Genome size estimation. To estimate the genome size for three C. paliurus, Illumina short reads were recruited to determine the distribution of K-mer values using published Estimate_genome_size.pl scripts (https:// bioinformatics.uconn.edu/genome-size-estimation-tutorial/). The genome size was estimated using the following formula: the number of K-mers divided by the depth of peaks (K-mer-number/depth). The size of the haploid genomes for PA-dip and PG-dip C. paliurus were estimated to be approximately 538. 16 Mb and 574.97 Mb (Fig. 1), which are similar with the genome size of Pterocarya stenoptera 6 , the species in the same family with  www.nature.com/scientificdata www.nature.com/scientificdata/ C. paliurus. In addition, the estimated genome size of PA-tetra C. paliurus is 2065.41 Mb and is likely a high heterozygosity plant (Fig. 1).
Preprocessing and genome assembly. All sequencing subreads generated from Pacbio SMRT Sequencer were self-corrected and trimming using CANU (v.1.7) 7 with optimized parameters (corOutCoverage = 100). Then, the overlaps distributed in all preassembled error-corrected reads were identified using FALCON  Table 3. C. paliurus and outgroup (Juglans regia) species used for the population genomics analysis. X-mean represents the fluorescence intensity of the sample peak which is proportional to the DNA content of the cells. Reference mean of internal reference (diploid or tetraploid C. paliurus) for flow cytometry analysis. The data from Outgroup was referred to Zhang's report 30 .
The quality of paired-end Hi-C reads were firstly evaluated using HiC-Pro 5 , and the Hi-C data was aligned to the contig-level assembly using BWA. After data analyzed by juicer 10 , the mis-joined and long contigs were interrupted using the 3D-DNA pipeline 4 , then iterative scaffolding was carried out following the several rounds of error correction from 3D-DNA. Properly paired reads chosen for the reassembly, should satisfy the condition that both paired ends are properly aligned according to the aligner. The contigs corrected by Hi-C interactions     www.nature.com/scientificdata www.nature.com/scientificdata/ were successfully linked into 16 pseudo-chromosomes in PG-dip and PA-dip, and 64 pseudo-chromosomes with four sets of monoploid chromosomes in PA-tetra C. paliurus using the ALLHiC pipeline 11 . repeat annotation. Repetitive sequences in three C. paliurus genomes were identified using same pipeline.

Gene annotation.
To obtain high quality protein-coding genes for three C. paliurus genomes, we employed a comprehensive strategy from ab initio gene prediction, RNA-seq-based prediction, and homology-based prediction. RNA-seq-based prediction was performed with the Trinity 17 software, which was used to predict protein-coding genes based on RNA-seq data from four different tissues (stems, female floral buds, male floral  www.nature.com/scientificdata www.nature.com/scientificdata/ buds, and leaf buds) of C. paliurus, genome-guided assembly and de novo assembly. RSEM 18 was used to quantify the fragments per kilobase per million (FPKM) expression level of assembled transcripts, and low quality transcripts were filtered with default criterion. The comprehensive transcript library from the filtered transcripts were constructed by PASA 19 program. Then, the filtered transcripts screened from PASA were aligned to protein sequences of six species (Oryza sativa, Vitis vinifera, Carica papaya, Morus notablis, Solanum tuberosum and Arabidopsis thaliana). For protein-coding genes annotation, MAKER2 20 pipeline run by two rounds. For the first round, MAKER2 was employed to generate consensus genes by integrating three annotation strategies (SNAP 21 , GENEMARK 22 and AUGUSTUS 23 ). After that, the MAKER2 was used to integrate multiple tiers of coding evidence, including ab initio gene prediction, transcript and protein evidence and generate a comprehensive set of protein-coding genes. After the first round, the predicated gene models with annotation edit distance (AED) values < 0.2 were selected for model re-training. Finally, the gene annotation improvement was generated by the second round of MAKER2. After filtering, a total of 34,699, 35,221, and 90,752 gene models were annotated in PA-dip, PG-dip and PA-tetra C. paliurus, respectively.
Samples re-sequencing and variant calling. In 2019 and 2020, we visited nine populations of C. paliurus distributed across south of China, and sampled 45 wild individuals including 10 diploid and 35 tetraploid accessions (Table 3). In addition, one walnut species (Juglans regia) was performed as an outgroup. All samples containing diploid and tetraploid C. paliurus were sequenced on Illumina Novaseq. 6000, and generated the expected more than 10 × coverage short reads (150 bp paired-end). Clean data was obtained by trimming adapters and low-quality reads (Q < 30) using Trimmomatic (v0.32) 24 , followed by aligning against the reference genome of PA-dip C. paliurus by BWA. The variant calling was carried out by GATK (v4.0.3.0) 25 . Firstly, GATK HaplotypeCaller was performed to identify general variants from each individual, and a single variant calling file www.nature.com/scientificdata www.nature.com/scientificdata/ was generated by GenotypeGVCFs software. This two-step approach was carried out to ensure variant accuracy, which included re-genotyping and quality re-calibration in the combined vcf file.

Data records
The whole genome sequencing raw data including Illumina short reads, PacBio long reads, Hi-C interaction reads, and re-sequencing data (SRP421615 26

Technical Validation
Quality assessment of the genome assembly. The assemblies presented here are the two diploid (two mating types with PA and PG), and one haplotype-resolved assembly of auto-tetraploid C. paliurus genomes. The contig N50 for PA-dip, PG-dip, and PA-tetra C. paliurus were 1.93 Mb, 1.39 Mb, and 430.92 kb, respectively. Besides, 543.53 (92.65%), 553.87 (94.93%), and 2168.65 Mb (91.08%) of PA-dip, PG-dip, and PA-tetra, were respectively anchored to 16, 16, and 64 pseudo-chromosomes (Table 4). For completeness assessment of PA-dip, PG-dip, and PA-tetra C. paliurus genomes, Illumina short reads were mapped to the assembly genome. The global mapping ratio and properly paired ratio reached at least 98.89% and 93.72%, respectively (Table 5). Further, we also assessed the completeness of the genome assembly with the BUSCO version 3 based on embryophyta_odb10 (https://busco.ezlab.org/list_of_lineages.html). More than 95.2% of complete BUSCOs and less than 3.9% of missing BUSCOs were identified in assembly (Table 6). Finally, we evaluated the assembly of the 16 chromosomes for two diploid genomes and 64 chromosomes for tetraploid genome (Figs. 2-4). The anchored genome was split into bins of 150 kb in length. The number of Hi-C read pairs covered by any two bins was used to define the signal for the interaction between those bins, and these signal intensities were plotted in the form of a heat map. The signal intensities clearly divided the bins into 16 distinct groups for both two diploid genomes and concentrated on diagonal, demonstrating the high consistency of genome structure and quality.
Gene annotation validation. Gene annotation in three genomes were performed using a combination of RNA-seq-based, homology-based, and ab initio gene prediction. To improve the quality of gene prediction, the transcripts were removed if FPKM less than 1, while the protein sequences with coverage greater than 95% were reserved as candidate sequences. After the first round, the predicated gene models with annotation edit distance (AED) values < 0.2 were selected for model re-training. BUSCO (version 3) analysis for PA-dip, PG-dip, and PA-tetra C. paliurus were used to evaluate the completeness of protein-coding annotation using embryophyta_ odb10 (https://busco.ezlab.org/list_of_lineages.html), showing more than 94.4% of BUSCOs value (Table 7).   www.nature.com/scientificdata www.nature.com/scientificdata/ Validation population structure variant. SNPs were further filtered according to the following parameters: (1) SNPs presented only in one of the two pipelines (SAMtools/BCFtools and GATK); (2) SNPs in repeat regions; (3) non-biallelic SNPs; (4) SNPs with missing rate more than 40%; (5) SNPs were removed if the distance with nearby variant sites less than 5 bp; (6) SNPs with read depth more than 1,000 or less than 5. A total of 3.89 and 26.67 million variants were respectively generated from diploid and tetraploid populations, containing 3.55 and 23.08 million SNPs, and 0.34 and 3.60 million indels. We also identified 3,845 and 38,899 variants in genic regions, including 3,753 and 23,764 synonymous, 5,503 and 35,241 nonsynonymous, respectively (Table 8).

Code availability
The sofware versions, settings and parameters used are described below.