A chromosome-level genome assembly of the Chinese tupelo Nyssa sinensis

The deciduous Chinese tupelo (Nyssa sinensis Oliv.) is a popular ornamental tree for the spectacular autumn leaf color. Here, using single-molecule sequencing and chromosome conformation capture data, we report a high-quality, chromosome-level genome assembly of N. sinensis. PacBio long reads were de novo assembled into 647 polished contigs with a total length of 1,001.42 megabases (Mb) and an N50 size of 3.62 Mb, which is in line with genome sizes estimated using flow cytometry and the k-mer analysis. These contigs were further clustered and ordered into 22 pseudo-chromosomes based on Hi-C data, matching the chromosome counts in Nyssa obtained from previous cytological studies. In addition, a total of 664.91 Mb of repetitive elements were identified and a total of 37,884 protein-coding genes were predicted in the genome of N. sinensis. All data were deposited in publicly available repositories, and should be a valuable resource for genomics, evolution, and conservation biology.

were obtained after removing adaptors in polymerase reads ( Table 1). The N50 read length reached 22.26 kb and 14.53 kb for polymerase reads and subreads, respectively. A total of 58.92 Gb of 150-bp paired-end reads were generated using the Illumina platform, and a total of 58.83 Gb (coverage of 55.97×) of reads were obtained after adapter trimming and quality filtering (Table 1). In addition, the Hi-C library was constructed using young leaf tissue from the same individual of N. sinensis, and sequenced using the Illumina platform. A total of 126.81 Gb (coverage of 120.63×) of 150-bp paired-end reads were obtained after adapter trimming and quality filtering (Table 1), which were later applied to extend the contiguity of the genome assembly to the chromosomal level. Furthermore, leaves and flowers were collected from the same individual of N. sinensis, and RNA-Seq reads were generated for genome annotation using the Illumina platform. A total of 17.36 Gb of 150-bp paired-end reads were obtained after adapter trimming and quality filtering (Table 1).
Genome size and heterozygosity estimation. The genome size of N. sinensis was first estimated using the k-mer analysis with Jellyfish 9 . The 17-mer frequency of Illumina short reads followed a Poisson distribution, with the highest peak occurring at a depth of 45 (Fig. 1). The estimated genome size was 1,051.16 Mb, and the heterozygosity rate of the genome was 0.87% (Table 2). In addition, we performed flow cytometry analysis using Vigna radiata as the internal standard, and the genome size of N. sinensis was estimated at 992 Mb.
De novo genome assembly and pseudo-chromosome construction. After the self-error correction step, the PacBio long reads were assembled into contigs using the hierarchical genome assembly process (HGAP) 10 as implemented in the FALCON assembler 11,12 . In addition, two rounds of polishing were applied to the assembled contigs using the Quiver algorithm 10 with the PacBio long reads, and another round of the genome-wide base-level correction was performed using Pilon 13 with the Illumina short reads. Finally, the Purge Haplotigs pipeline 14 was run to produce an improved, deduplicated assembly. The resulting genome assembly  Fig. 1 The k-mer analysis (k = 17) for estimating the genome size of Nyssa sinensis. The x-axis refers to the kmer depth; the y-axis refers to the frequency of the k-mer for a given depth.  Table 2. Summary of the k-mer analysis for estimating the genome size of Nyssa sinensis. 150-bp paired-end reads were generated using the Illumina platform, and a total of 58.83 Gb of reads were obtained after adapter trimming and quality filtering. The frequency of each k-mer was calculated and plotted in Fig. 1.
www.nature.com/scientificdata www.nature.com/scientificdata/ of N. sinensis contained 1,001.42 Mb of sequences in 647 polished contigs with an N50 size of 3.62 Mb (contigs shorter than 100 bp were discarded; Table 3), and the overall GC-content was 35.98%.
Construction of pseudo-chromosomes followed the previous study 15 using the Hi-C library. Briefly, the clean Hi-C reads were mapped to the assembled contigs using the Burrows-Wheeler Aligner 16 (BWA), and only uniquely mapped read pairs were considered for downstream analysis. Duplicate removal, sorting, and quality assessment were performed using HiC-Pro 17 . The assembled contigs were then clustered, ordered, and oriented into pseudo-chromosomes using LACHESIS 18 . A total of 585 contigs spanning 1,000.96 Mb (i.e., 99.95% of the assembly) were clustered into 22 chromosome groups (Fig. 2), matching the chromosome counts in Nyssa (n = 22) based on cytological studies [19][20][21] . In addition, of the clustered contigs, 382 contigs spanning 968.49 Mb (i.e., 96.71% of the assembly) were successfully ordered and orientated (Online-only Table 1). the annotation of repetitive elements. To annotate repetitive elements in the genome of N. sinensis, we utilized a combination of evidence-based and de novo approaches. The genome assembly was first searched using RepeatMasker (http://www.repeatmasker.org) against the Repbase database 22 . Next, a de novo repetitive element library was constructed using RepeatModeler (http://www.repeatmasker.org/RepeatModeler/), which employed results from RECON 23 and RepeatScout 24 . This de novo repetitive element library was then utilized by RepeatMasker to annotate repetitive elements. Results from these two runs of RepeatMasker were merged together. A total of 664.91 Mb of repetitive elements (i.e., 66.40% of the assembly) were identified in the genome of N. sinensis (Table 4), including retroelements (32.51%), DNA transposons (11.23%), tandem repeats (2.95%), and unclassified elements (19.71%). Thus, the percentage of predicted repetitive elements in the genome of N. sinensis is much higher in comparison with that in the closely related species C. acuminata (i.e., 35.6% 5 ).
Long terminal repeat (LTR) retrotransposons are prevalent in plant genomes 25 . In order to develop high-quality gene annotation, we additionally identified LTR retrotransposons in the genome of N. sinensis using a combination of four programs (i.e., LTR_FINDER 26 , LTRharvest 27 , LTR_retriever 25 , and RepeatMasker). Here, LTR_FINDER and LTRharvest were used for initial identification of LTR retrotransposons; LTR_retriever was then used to filter out false positives and estimate the insertion time for each intact LTR retrotransposon; finally, RepeatMasker was used for annotation of LTR retrotransposons. Our results suggested that when comparing with  Table 3. Summary of genome assemblies of Nyssa sinensis created at different stages of the assembly process. www.nature.com/scientificdata www.nature.com/scientificdata/ C. acuminata, LTR retrotransposons in the genome of N. sinensis had recently undergone a rapid proliferation, particularly the Ty3-gypsy family (Fig. 3).

Protein-coding gene prediction and functional annotation. The identification of protein-coding
genes in the assembled genome of N. sinensis was based on transcriptome data and ab initio prediction. First, two strategies (i.e., de novo and genome-guided assembly) were applied to assemble RNA-Seq reads into transcripts using Trinity 28 . In order to use Trinity in genome-guided mode, RNA-Seq reads were first aligned to the assembled genome of N. sinensis using HISAT2 29 . These two transcriptome assemblies were then merged. To generate the initial gene models for training AUGUSTUS 30 , our assembled transcripts were processed and utilized to identify open reading frames (ORFs) by the Program to Assemble Spliced Alignments 31 (PASA). AUGUSTUS was then utilized for ab initio gene prediction based on (i) a generalized hidden Markov model (HMM) and (ii) semi-Markov conditional random field (CRF). In addition, extrinsic evidence was incorporated into AUGUSTUS using a hints file, which was generated by aligning RNA-Seq reads to the hard-masked genome assembly with  Table 4. Summary of repetitive elements annotated in the genome of Nyssa sinensis. www.nature.com/scientificdata www.nature.com/scientificdata/ HISAT2. Lastly, untranslated regions (UTRs) and alternative splicing variations were annotated using PASA. A total of 37,884 protein-coding genes were predicted in the genome of N. sinensis (Table 5).
For functional annotation, our predicted protein-coding genes were searched against the Swiss-Prot and TrEMBL databases 32 using BLAST+ 33 with and E-value threshold of 1e-05, as well as the InterPro database using InterProScan 5 34 . In addition, for predicted protein-coding genes, gene ontology (GO) annotations were performed using Blast2GO 35 , and KEGG orthology (KO) identifiers were assigned using KEGG Automatic Annotation Server 36 (KAAS). A total of 36,185 genes (i.e., 95.52% of all predicted protein-coding genes) were successfully annotated by at least one database (Table 6).

technical Validation
Total RNA quality assessment. The quality of total RNA was evaluated using (i) agarose gel electrophoresis for RNA degradation and potential contamination, (ii) NanoDrop spectrophotometer for preliminary quantitation, and (iii) Agilent 2100 Bioanalyzer for RNA integrity and quantitation. Total RNA samples included in this study had an RNA integrity number (RIN) of 9.7-10 and an rRNA ratio of 1.5, which were then enriched for mRNA via an oligo(dT)-magnetic bead method.
Quality filtering of Illumina data. Illumina raw data were first filtered using Trimmomatic 45 to remove paired-end reads if either of the reads contained (i) adapter sequences, (ii) more than 10% of N bases, and (iii) more than 20% of bases with a Phred quality score less than 5.
assessing the completeness and accuracy of the genome assembly. We first evaluated the completeness of the assembly using CEGMA 46 and BUSCO 47,48 . Out of the 248 core eukaryotic genes in CEGMA, 235 (94.8%) complete matches and 244 (98.4%) complete plus partial matches were found in the assembled genome of N. sinensis. In addition, 93.4% complete and 2.2% partial of the 1,440 plant-specific BUSCO genes were identified in the assembly. Second, the accuracy of the assembly was assessed using our Illumina short reads. In total, 94.51% of the filtered short reads (58.83 Gb, Table 1) were mapped to the assembled genome of N. sinensis using BWA, which covered 99.89% of the assembly. Furthermore, Single-nucleotide polymorphisms (SNPs) were called and filtered using SAMtools 49 , and a total of 5,046,556 SNPs with a sequencing depth between 10× and 100× were identified, consisting of 5,040,788 heterozygous SNPs and 5,768 homozygous SNPs. The low rate of homozygous SNPs (0.0006% of the assembled genome) suggested the high accuracy of the assembly. Finally, the assembled genome of N. sinensis was divided into 10-kb non-overlapping windows, and the scatter plot of the sequencing depth versus the GC-content based on 10-kb windows indicated no contamination of foreign DNA in the assembly.   Table 6. Summary of functional annotation of protein-coding genes in the genome of Nyssa sinensis.