Chromosome-scale genome assembly of the high royal jelly-producing honeybees

A high royal jelly-producing strain of honeybees (HRJHB) has been obtained by successive artificial selection of Italian honeybees (Apis mellifera ligustica) in China. The HRJHB can produce amounts of royal jelly that are dozens of times greater than their original counterparts, which has promoted China to be the largest producer of royal jelly in the world. In this study, we generated a chromosome-scale of the genome sequence for the HRJHB using PacBio long reads and Hi-C technique. The genome consists of 16 pseudo-chromosomes that contain 222 Mb of sequence, with a scaffold N50 of 13.6 Mb. BUSCO analysis yielded a completeness score of 99.3%. The genome has 12,288 predicted protein-coding genes and a rate of 8.11% of repetitive sequences. One chromosome inversion was identified between the HRJHB and the closely related Italian honeybees through whole-genome alignment analysis. The HRJHB’s genome sequence will be an important resource for understanding the genetic basis of high levels of royal jelly production, which may also shed light on the evolution of domesticated insects.

www.nature.com/scientificdata www.nature.com/scientificdata/ develop an entirely clear picture of what causes the complex royal jelly-producing trait. In recent years, honeybee selection programs for high RJ production have also been implemented in Brazil and France beekeeping 17,18 . Additionally, further breeding of HRJHB for improving general resistance to disease is being carried out in China.
In this study, we generated a chromosome-scale of the genome assembly of the HRJHB using PacBio long-reads, Illumina short reads, and the Hi-C chromosome conformation capture technique (Table 1; Fig. 2a). The resultant genome has a total length of 222 Mb with 16 chromosomes, and the scaffold N50 was 13.6 Mb ( Table 1). One chromosome inversion was identified between HRJHB and the closely related Italian honeybees via whole-genome alignment analysis (Fig. 2b). Moreover, through a combination of ab initio gene predictions, transcript evidence and homologous protein evidence, 12,288 protein coding genes were identified in this genome, therein 6,615 genes were assigned a GO term and 8,614 genes were assigned a protein domain ( Table 2). Repetitive elements are made of 8.11% of the HRJHB genome sequence, but transposable elements (TEs) only occupy 2.15% (Table 2). Among those TEs, DNA transposons represented the most abundant TE class, which make up the majority of the total TE content (1.68% out of 2.15%). Furthermore, Tc1-mariner (TcMar) is the most abundant TE superfamily in the genome. The genome sequence provides a valuable resource for exploring the molecular basis of the high royal jelly-producing trait in honeybee and will facilitate further genetic improvements. The HRJHB may even represent a novel animal model for studying the effects of artificial selection on insects.

Methods
Sample collection and genome sequencing. Samples of the HRJHB for genome and transcriptome sequencing were collected in 2019 from Zhejiang Province, China, where the HRJHB was originated and primarily distributed (Fig. 3).
Newly emerged drone bees (n = 20), that are descendants of the queen bee, were collected from a single colony (Fig. 1a). The thoraxes were pooled for PacBio single molecule real-time (SMRT) sequencing and Illumina HiSeq sequencing. Genomic DNA was extracted using the Gentra Puregene Tissue Kit (Qiagen) and was sequenced in accordance with the standard protocols. Newly emerged worker bees (n = 20) were collected from the same colony and their thoraxes were pooled for Hi-C sequencing. Hi-C library preparation was performed by Frasergen (http://www.frasergen.com/), which mainly followed a protocol described previously 19 . The obtained Hi-C sequencing libraries were sequenced on the Illumina HiSeq X Ten platform. Worker bees that  www.nature.com/scientificdata www.nature.com/scientificdata/ were excreting royal jelly (n = 20) were collected from the same colony and their heads, thoraxes and abdomens (excluding the mid-gut tissues) were pooled for RNA-seq on the Illumina HiSeq X Ten platform.
De novo genome assembly for HRJHB. A total of 33.37 Gb of long reads were generated by the PacBio Sequel platform (Table 1), which were self-corrected and assembled into contigs using Canu v2.1 20 , with default parameters. The obtained contigs were parsed by Purge Haplotigs v1.1.1 21 to get rid of the redundancies caused by the heterozygosity of the pooled honeybee samples. Then, the remaining non-redundant contigs were polished with Illumina HiSeq reads (Table 1) three times by utilizing software Pilon v1.23 22 . Finally, the Juicer tool 23 was applied to map Hi-C reads (Table 1) against the polished contig sequences of HRJHB using the BWA algorithm 24 . The 3D-DNA pipeline 25 was applied to scaffold the contig sequences in relation to the chromosome-scale of genome assembly. annotation of repeat sequences. TEs were de novo identified by RepeatModeler2 26 , in line with default parameters. Using the obtained repeat library, each honeybee genome assembly was analyzed with RepeatMasker (http://www.repeatmasker.org) to yield a comprehensive summary of the TE landscape in each assembly. The The HRJHB's genome sequence was aligned with a closely related honeybee genome (NCBI assembly: Amel_HAv3). The red arrow indicates the chromosome inversion between the two genomes on LG7. www.nature.com/scientificdata www.nature.com/scientificdata/ annotation files produced by RepeatMasker were processed by in-house scripts to eliminate redundancies. In addition, refined annotation files were used to determine the TE diversity and abundance within each assembly and tandem repeats were identified with the Tandem Repeat Finder 27 , which was implemented in RepeatMasker.

Protein-coding genes
Prediction and functional annotation of protein-coding genes. Annotation of protein-coding genes was based on ab initio gene predictions, transcript evidence, and homologous protein evidence, which were all applied in the MAKER computational pipeline 28 . Meanwhile, RNA-seq reads obtained in this study were assembled using Trinity 29 . The assembled RNA-seq transcripts, along with proteins from bees (superfamily Apoidea) that are available in the National Center for Biotechnology Information (NCBI) GenBank (last accessed on 01/28/2020), were imported into the MAKER pipeline to generate gene models. To obtain functional clues for the predicted gene models, protein sequences encoded by them were searched against the Uniprot-Swiss-Prot protein databases (last accessed on 01/28/2020) using the BLASTp algorithm implemented in BLAST suite v2.28 30 . In addition, protein domains and GO terms associated with gene models were identified by InterproScan-5 31 .

Data Records
The raw data was submitted to the National Center for Biotechnology Information (NCBI) SRA database (Experiments for SRP300170) under BioProject accession number PRJNA689474 32 . The assembled genome has been deposited at DDBJ/ENA/GenBank under the accession GCA_019321825.1 33 . Moreover, the genome annotation results have been deposited at the Figshare database 34 .  www.nature.com/scientificdata www.nature.com/scientificdata/ technical Validation Evaluation of the genome assembly. The completeness of the genome assembly was evaluated using a set of 4,415 hymenopteran benchmarking universal single-copy orthologs (BUSCOs) using software BUSCO v3 35 . The results indicated that 99.3% of these BUSCOs were present in the genome assembly (Table 1), suggesting a remarkably complete assembly of the HRJHB genome. Furthermore, the chromosome-level structural accuracy was assessed by performing whole-genome alignments between HRJHB genome and a closely related honeybee genome (GenBank assembly: Amel_HAv3) using software D-GENIES 36 . The alignment results revealed a highly conserved chromosome structure between the two genomes, indicating an accurate scaffolding of contigs in the HRJHB genome. Nevertheless, we did find one inversion on LG7 (Fig. 2b). The Hi-C heatmap revealed a well-organized interaction contact pattern along the diagonals within/around the chromosome inversion region in HRJHB (Fig. 4), which rules out the possibility that the structural variation was derived from unreliable Hi-C signals in the HRJHB assembly. In addition, as chromosome inversion has been found to be associated with honeybee adaptations 37 , the inversion identified in the HRJHB genome will guarantee that further analysis will be carried out to investigate its association with high royal jelly production.

Code availability
All software used in this work is in the public domain, with parameters being clearly described in Methods. If no detail parameters were mentioned for a software, default parameters were used as suggested by developer.