A chromosome-level reference genome of the hornbeam, Carpinus fangiana

Betulaceae, the birch family, comprises six living genera and over 160 species, many of which are economically valuable. To deepen our knowledge of Betulaceae species, we have sequenced the genome of a hornbeam, Carpinus fangiana, which belongs to the most species-rich genus of the Betulaceae subfamily Coryloideae. Based on over 75 Gb (~200x) of high-quality next-generation sequencing data, we assembled a 386.19 Mb C. fangiana genome with contig N50 and scaffold N50 sizes of 35.32 kb and 1.91 Mb, respectively. Furthermore, 357.84 Mb of the genome was anchored to eight chromosomes using over 50 Gb (~130x) Hi-C sequencing data. Transcriptomes representing six tissues were sequenced to facilitate gene annotation, and over 5.50 Gb high-quality data were generated for each tissue. The structural annotation identified a total of 27,381 protein-coding genes in the assembled genome, of which 94.36% were functionally annotated. Additionally, 4,440 non-coding genes were predicted.

www.nature.com/scientificdata www.nature.com/scientificdata/ To enrich the available genomic resources for Betulaceae, we sequenced the whole genome of Carpinus fangiana (Fig. 1), a member of the most species-rich genus in Coryloideae 10 . A total of 77.85 Gb (~200x) next-generation data and 52.19 Gb (~130x) Hi-C data were used to assemble the genome. The assembly produced a genome having a total length of 386. 19 Mb, with 357.84 Mb being anchored to eight chromosomes. To our knowledge, this is the first reported chromosome-level Coryloideae genome assembly. The contig N50 and scaffold N50 were 35.32 kb and 1.91 Mb, respectively. Structural annotation of the genome revealed a total of 27,381 protein-coding genes, of which 94.36% were functionally annotated. The genome was also predicted to contain 4,440 non-coding genes based on a comprehensive annotation. This chromosome-level genome of C. fangiana will greatly facilitate further biological studies on Betulaceae as well as the development and commercial exploitation of the genus.

Methods
Sampling, library construction and sequencing. Fresh leaves were collected from a wild C. fangiana tree in Ebian, Sichuan, China (N: 29° 1′44″; S: 102°59′30″; Fig. 1) and immediately dried over silica gel. Genomic DNA was then extracted from the dried leaves using the modified Cetyltrimethylammonium Ammonium Bromide (CTAB) 11 method. Sequencing libraries with different insert sizes were constructed using a library construction kit (Illumina). Short paired-end libraries were constructed with insert sizes of 230, 500, and 800 bp, while the insert sizes used to construct mate pair libraries were 2, 5, 10, and 20 kb. The Illumina HiSeq 2000 platform was used to sequence 150 bp paired-end reads for all these libraries in accordance with the manufacturer's instructions. These procedures generated a total of 115.12 Gb (~200x) raw data for C. fangiana genome assembly (Table 1).
A High-through chromosome conformation capture (Hi-C) library for the C. fangiana genome was also constructed. To this end, fresh leaves were fixed with formaldehyde to induce DNA cross-linking, after which  www.nature.com/scientificdata www.nature.com/scientificdata/ the DNA was digested with HindIII. The resulting sticky ends were biotinylated and proximity-ligated to form chimeric junctions that were enriched for, and physically sheared into 300-700 bp fragments. These chimeric fragments were sequenced on the Illumina HiSeq platform, generating 52.54 Gb (~130x) of Hi-C data (Table 1).
We also harvested six tissues (bark, branch, bract, flower, fruit, leaf) for total RNA sequencing. These samples were flash frozen in liquid nitrogen, and total RNA was extracted using the modified CTAB method 12 . cDNA libraries were then constructed using the NEBNext Ultra RNA Library Prep Kit for Illumina (NEB). The Illumina HiSeq 2500 platform was used to sequence these libraries with a read length of 2 × 150 bp, generating over 5.50 Gb raw data for each tissue ( Table 2). preprocessing and genome size estimation. Quality control checks on the raw genome data were preformed using FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Potential adapters in reads were removed using Scythe (http://github.com/vsbuffalo/scythe) and low-quality reads were discarded by Sickle (http://github.com/vsbuffalo/scythe). The program Lighter 13 was then used to correct sequence errors in the remaining reads. For mate pair reads, we also used FastUniq 14 to remove duplicates. In total, 77.85 Gb, ~200x high-quality next-generation sequencing data and 52.19 Gb, ~130x high-quality Hi-C data were generated for de novo assembly of the C. fangiana genome (Table 1).
Quality control of transcriptome data was performed using a custom Perl script. Reads were discarded if (1) the proportion of unidentified nucleotides in one read exceeded 5%, or (2) over 65% of the read's bases had a phred quality below 8. After eliminating low-quality reads, the quantity of retained data for each tissue was above 5.50 Gb ( Table 2). The RNA-seq reads were then assembled using Trinity 15 . CD-Hit 16 was used to eliminate redundant transcript sequences, and candidate coding regions in the transcript sequences were identified by TransDecoder (https://transdecoder.github.io).
Before genome assembly, we estimated the C. fangiana genome's size by performing a combined analysis using Jellyfish 17 and GenomeScope 18 . Reads from the short-insert libraries were first processed by Jellyfish to assess their k-mer distribution, using a k value of 17. Then, GenomeScope was used to estimate the genome size based on the k-mer distribution (Fig. 2). The genome was thereby estimated to be around 396.74 Mb long.
Genome assembly. Preliminary de novo assembly of the C. fangiana genome was performed with Platanus 19 , which can effectively manage high-throughput data from heterozygous samples. Assembly using Platanus proceeded via three steps: (1) contig-assembly, in which de Bruijn graphs were constructed using the clean reads from short paired-end libraries and the sequences of contigs were then displayed in the graphs; (2) scaffolding,  Table 2. Illumina RNA sequencing metrics, before and after quality control. www.nature.com/scientificdata www.nature.com/scientificdata/ in which reads from all next-generation libraries (short paired-end and mate pair) were mapped to contigs, after which contigs considered to be linked were combined into scaffolds; (3) gap closing, in which reads that mapped onto scaffolds were collected to cover the gaps between them. GapCloser 20 was used to further close the gaps based on reads from all the paired-end libraries, after which the automated HaploMerger2 pipeline 21 was used to rebuild the above assembly and implement flexible and sensitive error detection. After discarding scaffolds smaller than 1 kb, a high-quality de novo assembled C. fangiana genome was obtained. The size of this genome (386.19 Mb) was 97.34% of the estimated value (396.74 Mb) and its GC content was 37.59%. The scaffold N50 and N90 values were 1.91 Mb and 0.43 Mb, while the contig N50 and N90 were 35.32 kb and 8.54 kb ( Table 3).
The HiC-Pro 22 program was used for quality assessment of the Hi-C data. Valid interaction pairs were mapped to and used for error correction of the contigs and scaffolds assembled based on the next-generation sequencing data. Next, the contigs and scaffolds were anchored to chromosomes using LACHESIS 23 . In total, 357.84 Mb of scaffolds were assembled into eight chromosomes (Table 4). Finally, we obtained a high-quality chromosome-level genome with a total size of 386.25 Mb. The contig N50 and scaffold N50 values of this chromosome-level assembly were 34.85 kb and 37.11 Mb, respectively (Table 3).

Heterozygosity assessment and repeat annotation.
To assess the heterozygosity of the C. fangiana genome, we first mapped reads from the 500 bp library to the assembled genome using the BWA-MEM algorithm from the Burrows-Wheeler Aligner (BWA) package 24 . SAMtools 25 was used to convert the mapping results to BAM format, sort them, and remove duplicates. The Picard package (http://broadinstitute.github.io/picard/) was used to replace read groups in the bam file. Two programs (RealignerTargetCreator and IndelRealigner) from the Genome Analysis ToolKit (GATK) 26 package were used to avoid misalignments and account for the effects of indels. The SAMtools command 'mpileup' was used to generate a VCF format file, and the program bcftools from the SAMtools package was used to detect single nucleotide polymorphisms (SNPs). Finally, based on the SNPs, the heterozygosity was calculated to be 0.38% using a custom Perl script.
Repetitive sequences and transposable elements (TEs) in the C. fangiana genome were identified using a combined procedure incorporating de novo and homology-based approaches at the DNA and protein levels. Tandem repeats were annotated using Tandem Repeat Finder (TRF) 27 . A repeat library for the C. fangiana genome was generated using RepeatModeler (http://www.repeatmasker.org) to facilitate de novo annotation. RepeatMasker 28 (http://www.repeatmasker.org) was used to identify and classify the TEs at the DNA level. We also used RepeatProteinMasker to perform a WU-BLASTX search against the TE protein database in order to identify and  www.nature.com/scientificdata www.nature.com/scientificdata/ classify TEs at the protein level. Finally, long terminal repeats (LTR) were identified using LTR-FINDER 29 . In total, the C. fangiana genome was found to contain 158.69 Mb repetitive sequences, accounting for 41.08% of its length ( Table 5). As shown in Table 5, the most common classifications assigned to these repetitive elements were Unknown (15.97% of the assembled genome) and LTRs (14.57% of the assembled genome).
Gene annotation. Structural annotation of gene models was performed by applying a combination of de novo, homology-based, and transcriptome-based methods to the repeat-masked genome. The de novo approach was implemented using Augustus 30 , Geneid 31 , GeneMark 32 , glimmerHMM 33 , and SNAP 34 . For homology-based prediction, TBLASTN 35 was used to align predicted protein sequences from Arabidopsis thaliana, Vitis vinifera, Prunus persica, Ostrya chinensis, Ostrya rehderiana and Juglans regia to the C. fangiana genome with an E-value threshold of 1E-05. Then, GeneWise 36 was used to obtain accurate spliced alignments by aligning homologous sequences to matched proteins. Transcriptome-based prediction was performed with the Program to Assemble Spliced Alignments (PASA) 37 , which was used to predict protein-coding regions based on the assembled transcripts of the six different C. fangiana tissues. The gene models obtained from the de novo, homology-based, and transcriptome-based annotations were combined to form a consensus gene set using EVidenceModeler (EVM) 38 . After strict filtering, a total of 27,381 non-redundant protein-coding genes were annotated in the C. fangiana genome ( Table 6).
Functional annotation of the predicted protein genes was performed by using BLASTP with an E-value threshold of 1E-05 to search for homologous sequences in SwissProt (http://www.gpmaw.com/html/swiss-prot.html), TrEMBL (http://www.uniprot.org) 39 , and KEGG (http://www.genome.jp/kegg/) protein databases 40 . The program hmmscan of HMMER package (http://hmmer.org) was used to search the Pfam domains. InterProScan 41 was used to annotate the protein motifs and domains, and the Blast2GO pipeline 42 was used to obtain Gene Ontology (GO) 43 IDs for each gene based on the NCBI NR database. In total, 25,836 protein-coding genes, corresponding to 94.36% of the total predicted gene models in the C. fangiana genome were successfully functionally annotated (Table 7). www.nature.com/scientificdata www.nature.com/scientificdata/ We also annotated non-coding RNAs in the C. fangiana genome. tRNAscan-SE 44 was used to detect putative transfer RNAs (tRNAs) with eukaryotic parameters, resulting in the identification of 632 tRNAs. To identify other non-coding RNAs, INFERNAL 45 was used to perform searches against the Rfam 46 database, resulting in the identification of 936 ribosomal RNAs (rRNAs), 197 microRNAs (miRNAs), 117 small nuclear RNAs (snRNAs), and 232 small nucleolar RNAs (snoRNAs) ( Table 8).

Data records
The sequencing data including the Illumina genome data (SRA accession: SRX6070999-SRX6071006), Hi-C data (SRA accession: SRX6071007), and Illumina transcriptome data (SRA accession: SRX6070994-SRX6070998, SRX6071008) were submitted to the NCBI Sequence Read Archive (SRA) database under BioProject accession number PRJNA548027 47 . The assembled genome was deposited at DDJB/ENA/GenBank under accession number VIBQ00000000 48 . Repeat annotations, gene model annotations and non-coding RNA annotations, the CDS sequences for the coding and non-coding genes, the protein sequences for the coding genes, as well as two custom Perl scripts were deposited at figshare 49 . technical Validation assessment of the genome assembly. We evaluated the completeness of the C. fangiana genome assembly in two ways. First, all the paired-end reads were mapped to the assembly genome with BWA. The aligned outputs were then analyzed using SAMtools. The mapping rate for each library was above 90% (Table 9). Furthermore, the coverage of the genome after gap elimination was 99.74%, with 95.05% having at least 100x coverage. Benchmarking Universal Single-Copy Orthologs (BUSCO) 50 was also used to evaluate the completeness of the genome assembly. 95.30% of the "complete BUSCOs" were successfully identified in the assembly, and the proportion of "missing BUSCOs" was only 4.10% (Table 10). These results demonstrate the high reliability and completeness of the reported genome assembly.
Finally, we evaluated the assembly of the eight chromosomes. To this end, the anchored genome was split into 'bins' of 100 kb in length. The number of Hi-C read pairs covered by any two 'bins' was used to define the signal for the interaction between those 'bins' , and these signal intensities were plotted in the form of a heat map. The signal intensities clearly divided the 'bins' into eight distinct groups, demonstrating the high quality of the chromosome assembly (Fig. 3). improvement of gene annotation quality. To maximize the reliability of the gene annotation process, repeat regions in the assembled genome were masked before gene annotation. Mirroring the procedure used to filter gene annotation, EVM was initially used to merge the results obtained by de novo, homolog-based, and transcriptome-based predictions. Genes were then discarded if: (1) their CDS length was below 150 bp; (2) their putative coding regions could not be accurately translated into protein sequences; (3) they possessed early termination codons; or (4) they were only supported by de novo predictions. In addition, PASA was used to identify untranslated regions (UTRs).  www.nature.com/scientificdata www.nature.com/scientificdata/ code availability This work relied on many software tools. The versions, settings and parameters of these tools are given below.