The survey and reference assisted assembly of the Octopus vulgaris genome

The common octopus, Octopus vulgaris, is an active marine predator known for the richness and plasticity of its behavioral repertoire, and remarkable learning and memory capabilities. Octopus and other coleoid cephalopods, cuttlefish and squid, possess the largest nervous system among invertebrates, both for cell counts and body to brain size. O. vulgaris has been at the center of a long-tradition of research into diverse aspects of its biology. To leverage research in this iconic species, we generated 270 Gb of genomic sequencing data, complementing those available for the only other sequenced congeneric octopus, Octopus bimaculoides. We show that both genomes are similar in size, but display different levels of heterozygosity and repeats. Our data give a first quantitative glimpse into the rate of coding and non-coding regions and support the view that hundreds of novel genes may have arisen independently despite the close phylogenetic distance. We furthermore describe a reference-guided assembly and an open genomic resource (CephRes-gdatabase), opening new avenues in the study of genomic novelties in cephalopods and their biology.


Methods
Genomic DNa preparation. An adult male belonging to the species O. vulgaris Cuvier, 1797 (450 g body weight) was caught by fishermen from the Bay of Naples in 2011 1,2 and immediately humanely-killed 63,64 . Given the high rate of heterozygosity in marine organisms 65,66 , tissue from a single individual was used to extract the genomic DNA (to avoid contamination, spermatophores were used). Spermatophores in octopus are stored within the Needham's sac, structure that was dissected following Chapko and coworkers 67 . Tissue (124 mg) was used to extract the genomic DNA following the recommended phenol-chloroform extraction protocol by the Beijing Genomics Institute (BGI)-Shenzhen. Briefly, tissue lysis occurred overnight at 56 °C after adding 3.0 ml of lysis buffer containing proteinase K (300 μg; Sigma-Aldrich, Saint Louis, Missouri, United States) and RNase A (100 μg; Sigma-Aldrich, Saint Louis, Missouri, United States). DNA was then extracted with phenol (2X), phenol:chloroform, chloroform and was subsequently precipitated. Genomic DNA was dissolved in TE buffer to reach a final concentration of 1 μg/μl. Genome sequencing and quality control. A total of four genomic DNA libraries (with different insert sizes: 170, 250, 500 and 800 bp) were constructed following the Illumina library preparation protocols. Briefly, to construct the paired-end libraries DNA was fragmented by Adaptive Focused Acoustics technology (Covaris) and tested via gel-electrophotometry, the fragmented DNA combined with End Repair Mix (20 °C for 30 min). After purification, DNA ends were blunted and an A base was added to the 3′ ends. DNA adaptors with a single T-base 3′-end overhang were ligated to the above products. Ligation products were purified on 2% agarose gels to recover the target fragments and were purified from the gels (Qiagen Gel Extraction kit, 28704). Several rounds of PCR amplification with PCR Primer Cocktail and PCR Master Mix were performed to enrich the Adapter-ligated DNA fragments. Then the PCR products selected by running another 2% agarose gel to recover the target fragments and the gel purified (QIAquick Gel Extraction kit, QUIAGEN). The final library was quantified by assessing the average molecule length (Agilent 2100 Bioanalyzer), and by Real-Time qRT-PCR. A total of 277 Gb of raw data were generated by Illumina Hiseq 2000 at BGI.
All libraries were sequenced in a paired-end mode with read lengths of 100 bp or 150 bp. Reads were filtered and trimmed (100 bp to 95 bp, 150 bp to 145 bp) using SOAPnuke software (https://github.com/BGI-flexlab/ SOAPnuke) 68 which yielded 250 Gb of data. Low-quality reads, reads with adaptor sequences and duplicated reads were filtered, and if the quality of bases at the head or tail of the reads was low, we directly trimmed them from 100 bp to 95 bp (PE100) or form 150 bp to 145 bp (PE150). The remaining high-quality data were used in the further analysis. SGA PreQC v0.10.14 69 modules were run per library and on the combined libraries to estimate various genome parameters (Table 1 and Table 2). www.nature.com/scientificdata www.nature.com/scientificdata/ Draft genome assembly. We applied Assembly By Short Sequencing 2.0.2 (ABySS 70,71 ) for both k-mer sizes that were suggested by SGA PreQC. The quality of assemblies (ABySS kmer41 and ABySS kmer81) was evaluated by QUAST 4. 3 72 . A summary of various statistics is shown in Table 3. Based on the QUAST analysis the optimal kmer size for the ABySS assembly was estimated to be 81. Since a higher heterozygosity rate of the genome was predicted based on these initial results, the Redundans 0.13 c 73 tool was used to reduce the number of ABySS contigs from the initial assemblies. Redundans reduces contigs by removing highly similar contigs. These highly similar contigs are originally the different alleles of the same genomic position, but are too different for the De Brujin graph method to be assembled into the same contig (too much variation inside one kmer). Redundans collapses and scaffolds these reduced contigs into single genomic locations. Redundans reduced the number of scaffolds of the draft genome over seven (7) times, while improving assembly statistics (see Table 3).

Reference Assisted Scaffolding.
Given the availability of a relatively good reference genome of a related species (O. bimaculoides) 43 , a reference assisted scaffolding tool was used to optimize the genome. The reduced scaffolds were aligned to the O. bimaculoides genome using blastn 74 of the blast+ toolkit 2.8.0-alpha. These alignments were used by chromosomer 0.1.3 (https://github.com/gtamazian/Chromosomer) to scaffold the reduced scaffolds according to the given genome. assessment of draft genomes. An assessment of the draft genomes (ABySS, Redundans and chromosomer) was performed by looking for the highly conserved genes using BUSCO 3.0. 2 75 . The Metazoa odb9 database was used, supplying 978 orthologs. The number of complete orthologs increased with each improvement of the assembly (Table 3), confirming the gain in assembly quality of the final chromosomer version. The final genome build has over 50% complete BUSCOs, and 10% fragmented BUSCOs (orthologs found, but scattered over multiple scaffolds).

Data Records
The draft genome(s) of O. vulgaris as shown in Table 3 has been made publicly available on the genome browser and data repository of the Association for Cephalopod Research that initiated this work (http://www.cephalopodresearch.org/ceph_gdatab/) in collaboration with the Department of Molecular Evolution and Development, University of Vienna. This web resource is based on the browser originally designed by University of California, Santa Cruz (UCSC) 76 and will be maintained and curated to keep track of all present and upcoming octopus genomes. It includes comparative genomics tracks such as read mapping and whole genome alignment between the two octopus species. Raw reads have also been deposited to the NCBI SRA 77 . The reference-guided assembly has been deposited at GenBank 78 and its original version is also provided in the associated FigShare record (chromosomer.fa) together with its annotation (gene_models.chromosomer.gff), and other assemblies listed in Table 3 (Octopus vulgaris genome assemblies 79 . Table 2 and Table 3 summarize statistics about O. vulgaris genome as deduced from our current sequencing data and Fig. 1 shows the kmer (17mer) distribution determining the overall sequencing depth (Table 1 and 2).

Quality control DNa library.
To assess the quality of Illumina reads FastQC (www.bioinformatics.babraham.ac.uk/projects/fastqc) was performed on all raw data. Trimmomatic v0. 36 80 was was not able to identify any significant adaptor sequence contamination within the raw data. The data were mapped to the PhiX control library (Illumina, Inc) using Bowtie2 v2.3.4 81 and no matches were found. Sequencing depth assessment. We used jellyfish 2.2. 10 82 on the raw read data using kmer size of 17 bp.
This resulted in a depth of sequencing histogram (Fig. 1) showing sequencing depth peak of around 76x. Using the kmer depth curve and the cumulative read depth (Fig. 1), repetitiveness, and heterozygosity was conducted independent of the genome assemblies (see Tables 2 and 3). The genome was estimated to be around 2.4 Gb in length with a relatively high heterozygosity rate (>1.1%) and large repetitiveness (>50%).

Genome properties and future steps
To gain information on the genetic distance between the two closely related species O. vulgaris and O. bimaculoides, we mapped all the available raw sequence data from O. vulgaris against the genome of O. bimaculoides 83 and found that 74-84% of the data aligned, but that a high percentage (20-50%) was able to align multiple times. The significant proportion of multiple mapping reads suggests that, similar to the O. bimaculoides genome, O. vulgaris genome has a large number (at least 50%) of repetitive elements, confirmed by the cumulative read depth analysis  www.nature.com/scientificdata www.nature.com/scientificdata/ (Fig. 1). Ab initio repeat analysis using dnaPipeTE 84 revealed similar classes of octopus specific short interspersed nuclear elements (SINE) to be over-represented (Fig. 2), yet the proportions were strikingly different, despite the close phylogenetic distance. This indicates high activity of repetitive elements in the common octopus genome.
Profiling O. bimaculoides regions with read coverage from O. vulgaris, we found that 23,509 O. bimaculoides genes were covered at 90% or more of their coding sequence length by O. vulgaris reads (Fig. 3). Approximately 50% of those genes had a Pfam annotation, including gene families previously reported to have undergone major expansions in the O. bimaculoides genome, such as zinc fingers and protocadherins. This is in strong contrast to only 1,570 O. bimaculoides genes with no O. vulgaris read coverage, with just 14% of those having a Pfam annotation. Those candidates represent very recent novel or highly diverged genes and their number indicates  Only the longest scoring alignment between any given pair of two scaffolds or contigs was considered. Red: percentage nucleotide identity between Callistoctopus minor to Octopus bimaculoides. Blue: percentage nucleotide identity between Octopus vulgaris to O. bimaculoides.
www.nature.com/scientificdata www.nature.com/scientificdata/ a relatively high rate of novel gene formation in octopus genomes. To investigate non-coding evolution among cephalopods, we furthermore compared the mapping rates to non-repetitive non-coding regions of 100 bp and longer. Again, we found the majority of those loci are covered at 90% length or higher. However, the relative proportion of O. bimaculoides regions not covered by any reads was higher than for the genes, indicating a higher turnover rate for the non-coding, potentially regulatory, sequences (Fig. 3).
To evaluate the completeness of our assemblies, raw reads were mapped using Bowtie2 v2.3.4 against both ABySS kmer81 and kmer41 assemblies. For ABySS kmer 41, at least 99.94% of all the reads were mapped while the percentage of uniquely mapped reads was only around 33-50%. For the ABySS kmer81 assembly, percentages were at least 98% and between 31 and 57%, respectively.
We used our assemblies to estimate whole-genome divergences between the available octopod genomes. Mapping of the scaffolds of 10 kb and longer against the O. bimaculoides genome using MEGABLAST resulted in the overall sequence similarity of 92.4% in the aligned regions of 1 kb and above (Fig. 4). This divergence of around 8% between the two species is higher than the estimated heterozygosity rate of 1.1% in O. vulgaris and lower than the divergence between O. bimaculoides 83 and the recently released data of C. minor (82.4% similarity) (Fig. 4, and ref. 85 ) from a different genus, providing for the first whole-genome divergence estimates within this clade.
Our assemblies confirm that abundant repeat regions make it difficult to improve the genome based on the currently available sequence data. Future steps will include long read sequencing technology such as proximity-ligation based assemblies (e.g., Dovetail, PhaseGenomics) or longer read technologies (e.g., PacBio) to optimize the current assemblies.