Hybrid genome assembly and annotation of Danionella translucida

Studying neuronal circuits at cellular resolution is very challenging in vertebrates due to the size and optical turbidity of their brains. Danionella translucida, a close relative of zebrafish, was recently introduced as a model organism for investigating neural network interactions in adult individuals. Danionella remains transparent throughout its life, has the smallest known vertebrate brain and possesses a rich repertoire of complex behaviours. Here we sequenced, assembled and annotated the Danionella translucida genome employing a hybrid Illumina/Nanopore read library as well as RNA-seq of embryonic, larval and adult mRNA. We achieved high assembly continuity using low-coverage long-read data and annotated a large fraction of the transcriptome. This dataset will pave the way for molecular research and targeted genetic manipulation of this novel model organism.

Sequencing on HiSeq 4000 generated 1.347 billion paired-end reads. A long ~10 kb mate-pair library was prepared using the Nextera Mate Pair Sample Prep Kit and sequenced on HiSeq 4000, resulting in 554 million pairedend reads. Raw read library quality was assessed using FastQC v0.11.8 10 .
A Nanopore sequencing high-molecular-weight gDNA library was prepared from 3 months post fertilisation (mpf) DT tails. We used 400 ng of DNA with the 1D Rapid Sequencing Kit (SQK-RAD004) according to manufacturer's instructions to produce the longest possible reads. This library was sequenced with the MinION sequencer on a single R9.4 flowcell using MinKNOW v1.11.5 software for sequencing and base-calling, producing a total of 4.3 Gb sequence over 825k reads. The read library N50 was 11.6 kb with the longest read being approximately 200 kb. Sequencing data statistics are summarised in Table 1.  www.nature.com/scientificdata www.nature.com/scientificdata/ Genome assembly. The genome assembly and annotation pipeline is shown in Fig. 2. We estimated the genome size using the k-mer histogram method with Kmergenie v1.7016 on the paired-end Illumina library preprocessed with fast-mcf v1.04.807 11,12 , which produced a putative assembly size of approximately 744 Mb. This translates into 186-fold Illumina and 5.8-fold Nanopore sequencing depths.
Multiple published assembly pipelines utilise a combination of short-and long-read sequencing. Our assembler of choice was MaSuRCA v3.2.6 13 , since it has already been used to generate high-quality assemblies of fish genomes 8,9 , providing a large continuity boost even with low amount of input long reads 14 . Briefly, Illumina paired-end shotgun reads were non-ambiguously extended into the superreads, which were mapped to Nanopore reads for error correction, resulting in megareads. These megareads were then fed to the modified CABOG assembler that assembles them into contigs and, ultimately, mate-pair reads were used to do scaffolding and gap repair.
Following MaSuRCA author's recommendation 8 , we have turned off the frgcorr module and provided raw paired-end and mate-pair read libraries for in-built preprocessing with the QuorUM error corrector 13,15 . The initial genome assembly size estimated with the Jellyfish assembler module was 938 Mb. After the MaSuRCA pipeline processing we have polished the assembly with one round of Pilon v1.22, which attempts to resolve assembly errors and fill scaffold gaps using preprocessed reads mapped to the assembly 16 . Leftover contaminants were filtered during the processing of the genome submission to the NCBI database. Statistics of the resulting assembly were generated using bbmap stats toolkit v37.32 17 and are presented in Table 2.
The resulting 735 Mb assembly had a scaffold N50 of 341 kb, the longest scaffold being more than 3 Mb. To assess the completeness of the assembly we used BUSCO v3 18 with the Actinopterygii ortholog dataset. In total, 91.5% of the orthologs were found in the assembly. transcriptome sequencing and annotation. We used three sources of transcriptome evidence for the DT genome annotation: (i) assembled poly-A-tailed short-read and raw Nanopore cDNA sequencing libraries, (ii) protein databases from sequenced and annotated fish species and (iii) trained gene prediction software. For Nanopore cDNA sequencing we extracted total nucleic acids from 1-2 dpf embryos using phenol-chloroform-isoamyl alcohol extraction followed by DNA digestion with DNAse I. The resulting total RNA was converted to double-stranded cDNA using poly-A selection at the reverse transcription step with the Maxima H Minus Double-Stranded cDNA Synthesis Kit (ThermoFisher). The double-stranded cDNA sequencing library was prepared and sequenced in the same way as the genomic DNA with MinKNOW v1.13.1, resulting in 190 Mb sequence data distributed over 209k reads. These reads were filtered to remove 10% of the shortest ones. For short-read RNA-sequencing, we have extracted total RNA with the TRIzol reagent (Invitrogen) from 3 dpf larvae and from adult fish. RNA was poly-A enriched and sequenced as 100 bp paired-end reads on the BGISEQ-500 platform. After preprocessing the library sizes were 65.4 million read pairs for 3 dpf larvae and 64.3 million read pairs for adult fish specimens (Table 1). We first assembled the 100 bp paired-end RNA-seq reads de novo using www.nature.com/scientificdata www.nature.com/scientificdata/ Trinity v2.8.4 assembler 19 . This produced 222448 contigs with an N50 length of 3586 bp, clustered into 146103 "genes". BUSCO transcriptome analysis revealed 96% of complete Actinopterygii orthologs in the Trinity assembly. These contigs, together with the Nanopore cDNA reads and proteomes of 11 fish species from Ensembl 20 were used as the transcript evidence in MAKER v2.31.10 annotation pipeline 21 . Repetitive regions were masked using a de novo generated DT repeat library (RepeatModeler v1.0.11) 22 . The highest quality annotations with average annotation distance (AED) < 0.25 were used to train SNAP 23 and Augustus 24 gene predictors. Gene models were then polished over two additional rounds of re-training and re-annotation. The final set of annotations consisted of 24,097 protein-coding gene models with an average length of 13.4 kb and an average AED of 0.18 (Table 3). We added putative protein functions using MAKER from the UniProt database 25 and protein domains from the interproscan v5.30-69.0 database 26 . tRNAs were searched for and annotated using tRNAscan-SE v1.4 27 . The BUSCO transcriptome completeness search found 86% of complete Actinopterygii orthologs in the annotation set. An example Interactive Genomics Viewer (IGV) v2.4.3 28 window with the dnmt1 gene is shown on Fig. 3, demonstrating the annotation and RNA-seq coverage.

Data Records
Raw sequencing libraries and genome and transcriptome assemblies are deposited to NCBI SRA as part of the BioProject SRP136594 29 .
The genome assembly with gene and transcript annotations has been deposited at GenBank under the accession number SRMA00000000 30 (the version described in this paper is SRMA01000000), as well as on figshare in FASTA/GFF3 format 31 . The Trinity transcriptome assembly has been deposited at NCBI TSA under accession number GHNV00000000 32 (the version described in this paper is GHNV01000000), as well as on figshare 31 .
Kmergenie-generated kmer abundance histograms and a summary report together with the genome size estimation are deposited at figshare 31 .   www.nature.com/scientificdata www.nature.com/scientificdata/ MAKER pipeline annotation output GFF3 file containing evidence mapping, identified repetitive elements and gene models, MAKER-predicted transcripts and proteins, IGV-compatible short-read and long-read RNA-seq coverage, raw sequencing read library FASTQC quality analysis report and intron orthology data together with their custom analysis code are available on figshare 31 .  www.nature.com/scientificdata www.nature.com/scientificdata/ technical Validation DT and zebrafish intron size distributions. The predicted genome size of DT is around one half of the zebrafish reference genome 33 . Danionella dracula, a close relative of DT, possesses a unique developmentally truncated morphology 34 and has a genome of a similar size (ENA Accession Number GCA_900490495.1). In order to validate our genome assembly, we set out to compare the compact genome of DT to the zebrafish reference genome.
Changes in the intron lengths have been shown to be a significant part of genomic truncations and expansions, such as a severe intron shortening in another miniature fish species, Paedocypris 35 , or an intron expansion in zebrafish 36 . We therefore compared the distribution of total intron sizes from the combined Ensembl/Havana zebrafish annotation 20 to the MAKER-produced DT annotation (Fig. 4a). We found that the DT intron size distribution is similar to other fish species investigated in ref. 35 which stands in stark contrast to the large tail of long introns in zebrafish. Median intron length values are in the range of the observed genome size difference (462 bp in DT as compared to 1,119 bp in zebrafish).
To investigate the difference in intron sizes on the transcript level, we compared average intron sizes for orthologous protein-coding transcripts in DT and zebrafish. We have identified orthologs in DT and zebrafish protein databases with the help of the conditional reciprocal best BLAST hit algorithm (CRB-BLAST) 37 . In total, we have identified 19,192 unique orthologous protein pairs. For 16,751 of those orthologs with complete protein-coding transcript exon annotation in both fish we calculated their respective average intron lengths (Fig. 4b). The distribution was again skewed towards long zebrafish introns in comparison to DT. As an example, Fig. 4c shows dnmt1 locus for the zebrafish and DT orthologs.

code availability
Software used for read preprocessing, genome and transcriptome assembly and annotation is described in the Methods section together with the versions used. Custom MATLAB code used for orthology analysis is deposited on figshare 31 .