Draft genome assembly and transcriptome data of the icefish Chionodraco myersi reveal the key role of mitochondria for a life without hemoglobin at subzero temperatures

Antarctic fish belonging to Notothenioidei represent an extraordinary example of radiation in the cold. In addition to the absence of hemoglobin, icefish show a number of other striking peculiarities including large-diameter blood vessels, high vascular densities, mitochondria-rich muscle cells, and unusual mitochondrial architecture. In order to investigate the bases of icefish adaptation to the extreme Southern Ocean conditions we sequenced the complete genome of the icefish Chionodraco myersi. Comparative analyses of the icefish genome with those of other teleost species, including two additional white-blooded and five red-blooded notothenioids, provided a new perspective on the evolutionary loss of globin genes. Muscle transcriptome comparative analyses against red-blooded notothenioids as well as temperate fish revealed the peculiar regulation of genes involved in mitochondrial function in icefish. Gene duplication and promoter sequence divergence were identified as genome-wide patterns that likely contributed to the broad transcriptional program underlying the unique features of icefish mitochondria.


RNA-seq libraries preparation and sequencing
Total RNA was extracted from spleen, kidney, liver, brain, and skeletal muscle using the RNAeasy  Table 8.

Reads quality analysis and filtering
For both species, quality of Illumina raw reads was analyzed with the FastQC v0.11.6 program (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). Subsequently, low-quality regions and adapters were trimmed using Trimmomatic v0.36 5 . In order to improve the quality of C. myersi PacBio sequences, a hybrid error correction method was performed with the software LoRDEC v0.3 (Long Read DBG Error Correction) 6 , using as reference set the Illumina short reads of the same individual.

Chionodraco myersi
Reads obtained from Illumina and PacBio Rs II for C. myersi were assembled with a hybrid strategy using the MaSuRCA v3.2.8 genome assembler 7 . Scaffolds shorter than 500 bp were removed as they are of limited use and probably artifacts.

Chionodraco hamatus
Global assembly of the Illumina reads obtained for C. hamatus was accomplished with the software

Genomes quality assessment
To obtain genome assembly statistics, the Assemblathon2 script was used 8  well-conserved genes, was employed to investigate the completeness of the assembly.
To estimate the size, repeat content and heterozygosity of the genomes, a K-mer analysis (K=21) was conducted using Genomescope 10 with DNA PE libraries. myersi) of genome masked. This difference in percentage of masking is probably due to a difference in the genome assembly quality. The use of PacBio long reads in C. myersi assembly likely helped in resolving repetitive elements.

Gene prediction
Gene prediction was performed considering several sources of evidence: i) RNA-seq data; ii) nucleotide and protein alignments; iii) de novo gene training and prediction. A total of nine RNA-seq libraries coming from five tissues were used for gene prediction. 1) Genes predicted only by ab initio programs: these genes were considered good only if confirmed by at least four different ab initio programs, if they were complete (with a start and a stop codon) and longer than 300 base pairs.
2) Gene supported only by external evidence (e.g. proteins/RNA-seq): they need to be confirmed by at least two different lines of evidence or by one external evidence and at least three different ab initio gene predictors.
3) Predicted genes with a low ab initio support (filter described at step 1) were further processed. Genes supported by less than four ab initio programs were searched against a database of teleost protein sequences. The fish database was downloaded from the NCBI GenBank repository selecting all the proteins belonging to the Teleostei class.
Proteins with a sequence coverage match higher than 70% and an e-value lower that 1E-20 were recovered.  Genome annotation was based on similarity and experimental evidence from RNA-seq data produced in the present study (Supplementary Table 8). The total number of estimated proteincoding genes was 38,127 and 27,111 for C. myersi and C. hamatus, respectively.