Genome sequences of Tropheus moorii and Petrochromis trewavasae, two eco-morphologically divergent cichlid fishes endemic to Lake Tanganyika

With more than 1000 species, East African cichlid fishes represent the fastest and most species-rich vertebrate radiation known, providing an ideal model to tackle molecular mechanisms underlying recurrent adaptive diversification. We add high-quality genome reconstructions for two phylogenetic key species of a lineage that diverged about ~ 3–9 million years ago (mya), representing the earliest split of the so-called modern haplochromines that seeded additional radiations such as those in Lake Malawi and Victoria. Along with the annotated genomes we analysed discriminating genomic features of the study species, each representing an extreme trophic morphology, one being an algae browser and the other an algae grazer. The genomes of Tropheus moorii (TM) and Petrochromis trewavasae (PT) comprise 911 and 918 Mbp with 40,300 and 39,600 predicted genes, respectively. Our DNA sequence data are based on 5 and 6 individuals of TM and PT, and the transcriptomic sequences of one individual per species and sex, respectively. Concerning variation, on average we observed 1 variant per 220 bp (interspecific), and 1 variant per 2540 bp (PT vs PT)/1561 bp (TM vs TM) (intraspecific). GO enrichment analysis of gene regions affected by variants revealed several candidates which may influence phenotype modifications related to facial and jaw morphology, such as genes belonging to the Hedgehog pathway (SHH, SMO, WNT9A) and the BMP and GLI families.

. Time calibrated phylogeny of cichlid species with publicly available complete genomes. The phylogeny is based on 519 anchored genome loci either sequenced or extracted from the genome sequence 6 . The study species are highlighted in bold, species with annotated high quality genomes are in black and species with high quality assemblies (but not yet annotated genomes) are indicated in grey (Figure created with Inkscape https ://inksc ape.org). In the following, we compare our results to published genomes and annotations of several cichlid fish with emphasis on O. niloticus and M. zebra due to their well-developed state. The latest versions (v4) of O. niloticus (44 × PacBio, newly anchored) and M. zebra (now 65 × PacBio and anchored) were published by Conte et al. 27 ; the tendency with respect to earlier versions is clear, qualities of sequences and annotations are improved and the numbers of annotated structures were further increased. With respect to the gene length distributions (Supplementary Table S1), the contiguity measures achieved for PT and TM are satisfying and fall in the typical range, given the applied sequencing technologies and coverage (Table 1; for a comparison with O. niloticus versions  see Supplementary Table S2, and for a general comparison with published fish genomes see Supplementary  Table S23 of Vij et al. 31 ).
Annotations. Structural annotation yielded ~ 40,300 (PT) and 39,600 (TM) genes and ~ 54,200 (PT) and 56,800 (TM) transcripts, respectively ( Table 2); this is in line with the results of different annotation versions of ON (~ 30,200 to 42,600 genes). As to annotated features, PT and TM show similar numbers which often lie between those of version 2 and 3 of the respective ON annotations. For comparison, statistics for ON v2-v4 (the latest) are added, as ON received the most community effort and data for genome assembly and annotation of all cichlids (Supplementary Table S2). Prediction of long non-coding RNAs yielded 2782 and 2112 lncRNAs for PT and TM, respectively. With 57.7% and 63.2% a slight preference for the sense strand could be observed (Supplementary Table S3). Homology based functional annotation could be made for 41,970 (PT) and 43,918 (TM) of the coding sequences (CDSs); putative secretory signals were predicted for 5899 (PT) and 6016 (TM) of them, respectively (Table 3). Pfam domain mapping yielded 78,900 (PT) and 84,158 (TM) hits, respectively. RepeatMasker 27 identified 31.1% (PT) and 30.0% (TM) of the genomes as repetitive, respectively; the largest proportions of classified repeat types were held by DNA transposons, LINEs and LTR transposons with ~ 13%, ~ 7% and ~ 2% (Table 4).
Data availability and visualization. The genome and transcriptome assemblies (FASTA), the structural and functional annotations (GFF3), read mappings (BAM) and additional Integrative Genomics Viewer (IGV) 33 track files (short and long non-coding RNAs, repeats, ORFs, CpG islands, microsatellites, IPR and eggNOG domains, variant calls, read mappings, alternative splicing, and REAPR error calls; Fig. 2) are available at https ://cichl idgen omes.tugra z.at. Quality evaluation. Assembly quality was assessed with BUSCO 34 and CEGMA 35 . BUSCO identified 98.3% and 98% of the 4584 proteins in the Actinopterygii database in complete form for PT and TM, respectively; 1.7% and 2% of the benchmarking universal single-copy orthologs (BUSCOs) were either fragmented or missing. These results compare well with those of published genomes and are generally on a par with those of the later versions of the O. niloticus genome drafts (Table 5). CEGMA identified all of the 248 core eukaryotic genes (CEGs) for both PT and TM (Table 6); CEGMA results for PT and TM transcriptome assemblies can be found in Supplementary Table S6. However, REAPR reports 17,166/11,992 (PT/TM) likely assembly errors (Supplementary Table S10); there are IGV tracks highlighting questionable regions to guide caution when analyzing in the vicinity (see Fig. 2). Completeness of conserved protein domains was assessed with DOGMA 36 . DOGMA found 91.8% and 90.5% of the 1051 expected conserved domains at a conserved domain arrangement size of 1 for PT and TM, respectively (Table 7).
Comparative analysis. We compared the genomes of PT and TM by mapping the raw reads of one species to the genome of the other species. This yielded 4,105,604 and 4,178,777 small variants (SMV; SNPs and InDels) for PT and TM, respectitvely. Furthermore, 356,428 and 577,124 SMVs were identified for PT and TM, when mapping the reads of the same species to the respective genomes. On average 1 variant per ~ 220 bp (interspecies), and 1 variant per 2540 bp (PT vs PT)/1561 bp (TM vs TM) (intraspecies) has been called (Table 8). For the two species, 93,842 and 89,489 large structural variants (SV; insertions, deletions, duplications, inversions and translocations) between species were detected, the majority being deletions with 60% and 65.6%, respectively ( Table 8).
The distribution of SMV and SV observed by the comparative analysis largely follows the genome coverage of particular structural/functional regions. There are small, but noticable devations: (1) SMV are (slightly) underrepresented in promoter, 5′ UTR, coding, splice site, 3′ UTR and intergenic regions; they are overrepresented in introns; (2) SV are (slightly) underrepresented in promoter, 5′ UTR, coding, 3′ UTR and intergenic regions; they are overrepresented in introns and splice sites (via overlap) (Fig. 3A).
SNPeff 37 categorises variant effects on the gene into four groups based on the location and nature of the variant: 'HIGH' , 'MODERATE ' ,'LOW' , or 'MODIFIER' , where the latter denotes non-coding variants or variants affecting non-coding genes, where predictions are difficult or there is no evidence of impact. In our analysis, more than 97% of the identified variants are classified as 'MODIFIER' ( Table 9).
Genes of interest, which are possibly affected by mutations, are highlighted in Table 10 (see Supplementary  Information for details on the gene selection and GO enrichment analysis, respectively). Here, we started to analyze genes which are related to the development of the viscerocranium (untargeted gene selection) and the pharyngeal system (targeted, i.e., biased gene selection based on literature-see Table S17); however, there are other GO terms of interest which are consistently enriched over different analysis approaches such as BMP signaling, for instance. A condensed GO analysis result for an untargeted approach (A2, see Table S14b) is shown in Table 11; here, gene categories are based on variant comparison groups (within and between species groups) combined with quantile ranking and thresholding (p = 0.5, i.e., median), and variant counts ('mutation loads') were used as criterion. The term 'embryonic viscerocranium morphogenesis' is enriched in the within and the  Table S16; genes belonging to this term were combined with genes from the targeted approach and used for further downstream analyses (see Supplementary  Table S14a, Table S14b, Table S15, Table S16, Table S18 and Table S21). In the comparative analysis, biological species are coded as A (PT) and B (TM) ( Table 11). The categories (AA, AB, BA and BB) refer to the within and between group comparisons. That is, there are mutations at the same genomic locations (nucleotides) which are either identical within and between species (referred to as, e.g., identical (AA) and redundantly identical (AB)) or nonidentical (referred to as, e.g., nonidentical (BB) and nonidentical (BA)); moreover, there are mutations which are unique to a group (referred to as, e.g., unique (AA) and unique (AB), i.e., at the genomic location there is only a variant in species A (unique (AA)) or there is only a variant between species A and B (unique (AB)), respectively). In the shown example, for SMV the calls for viscerocranium morphogenesis are symmetric except for the AB category (which fell below the threshold), i.e., the GO term is consistently enriched within and between species. Further analyses on the genes belonging to the term clearly verify the presence of shared and species-specific mutations in these genes (see example in Supplementary section Identification of genes putatively related to facial and jaw morphology). Hence, there is substantial variation in these genes which may drive changes in the manifestation of morphology. However, we cannot yet delineate possible effects from shared and non-shared variants. Besides variants in the DNA structure, alternative splicing (AS) was analyzed. There are ~ 6200 AS events in ~ 2600 genes between sexes of each species and ~ 39,000 AS events in ~ 9400 genes between the two species (see Supplementary Table S13). Table 3. Functional annotation statistics: The number of proteins found in UniProt and NR are given. Furthermore, the table contains the number of proteins with putative protease (Merops) and carbohydrate activity (CAZymes), the number of orthologs in fiNOG, the number of proteins matching the BUSCO vertrebrate models and the number of proteins with putative secretory signals (SignalP). Finally, the number of hits of the protein sequences for the various InterPro domain databases are presented.

Discussion
Assembly and annotation. Meta-assembly of a set of primary assemblies yielded high quality genome drafts with 918 and 911 Mbp for Petrochromis trewavasae and Tropheus moori, respectively. This is in line with the sizes between 900 and 1000 Mbp reported for other cichlid genomes 21,30 and with the ~ 940 Mbp estimated by our assembly validation with REAPR 38 . The latest Oreochromis niloticus assembly spans ~ 1 Gbp;-this variation in genome size may be due to biological differences but could also indicate that some portions of the repetitive DNA in the respective genome reconstruction (PT/TM) are not included in the assembly as a consequence of sequence collapse (e.g., collapsed repeats) 39 .
Typically, assemblies are based on genomic data from a single individual, which ideally stems from an inbred line. In this project, we assembled from 6 (PT) and 5 (TM) non-inbred individuals, respectively; this called for a more complex assembly approach. Furthermore, we performed de novo sequencing without any linkage or optical map data (as seen in the latest genome drafts of O. niloticus, and A. calliptera) and the PacBio coverage was (with ~ 9-10 ×) considerably lower than that used for the assemblies of O. niloticus (44 × for v3 30 and v4 27 ) and M. zebra (16.5 × for v3 40 , added on top of already high Illumina PE and MP coverages, and 65 × for v4 27 ). Still, both assemblies compare well with published genome drafts of comparable species with respect to typical metrics regarding gene content. BUSCO results (Table 5) show a low rate of ~ 4-8% (depending on database) of duplicated BUSCOs in PT and TM. This is slightly higher than the ~ 2-4% reported for other cichlid genomes, which may be a consequence of incorrect assembly of haplotypes from the non-inbred individuals. With respect to total BUSCOs identified, fragmented and missing BUSCOs, both the PT and TM genome reconstruction perform very well.
For both species, the annotations proved valid for first sensible downstream analyses, but certainly there are some gene models which may need further improvement, e.g., by repeated training of gene predictors; however, for AUGUSTUS, the central predictor, model evaluations already show good training states (see Supplementary  Table S12). The most relevant sources of insufficient gene models might include gene fusions, splits and especially truncations, which are obvious under closer inspection-this is typical for early annotations, especially when the annotation pipeline is still under development. We observe a relatively low mean and median length of protein sequences (see Table 2) in both assemblies/annotations. This may reflect a systematic error in the generation process, e.g., InDels leading to frame shifts and, hence, wrong translations and premature stop codons. Investigation of this phenomenon showed non-triplet InDels; however, these are also found between, e.g., O. niloticus and M. zebra transcript models. Moreover, the rate of identified nonsense mutations in PT and TM is low ( Table 9). The NCBI and Ensembl annotation pipelines are state-of-the-art; additionally, the amount and diversity of RNAseq data used for the annotation of, e.g., O. niloticus was much larger than was the case for either species in this project. Hence, the larger number of identified transcript isoforms (as well as the higher average numbers of exons per transcript) may be seen as straight-forward consequences. However, the total number of exons in both species is on a par with the ON annotations. Interestingly, the number of gene models in PT and TM are also on a par with ON v3. As there is no well-established method to score the correctness of gene models (perhaps by a general structure check and a database-based similarity majority scoring), this is merely a comparison of  www.nature.com/scientificreports/ numbers of elements, though. Moreover, there are, as mentioned, some gene fusions and splits in the PT and TM gene model sets, which will distort the gene count to some degree. As another quality measure for the annotated protein-coding genes, DOGMA 36 and PfamScan 41 were used; the results support the notion of bad gene models in the set, which do not contain certain protein domains or only fragments thereof (Table 7).

Comparative analysis.
We picked the two study species for the following reasons. Tropheus moorii is a highly successful algae browser found in large numbers in all types of rocky shore, while Petrochromis trewavasae is an algae grazer distributed at rocky shores on the western side of the lake, living in sympatry with Tropheus. The Tropheini comprise 3 predatory species, one omnivore, 10 algae browsers and 15 algae grazers. Algae grazers have chisel-like teeth to bite off filamentous algae from the rocky substrate, while algae grazers have comb-   www.nature.com/scientificreports/ like teeth in multiple rows to comb off unicellular algae and detritus from the rocks. Due to the old age of the tribe Tropheini, amounting to about 2-6.5 myr for the onset of their radiation 6 , the degree of eco-morphological divergence is greater than in the much younger eco-morphological equivalents in Lake Victoria, but comparable with the eco-morphospace covered by the entire Lake Malawi flock. Interestingly, the genus Tropheus comprises about 120 mostly allopatric and in terms of color distinct populations and sister species that are morphologically similar. They all remained in the same trophic niche at all rocky shorelines throughout the lake. Petrochtomis trewavasae does not show much color variation, has a restricted distribution at the southwestern shoreline of the lake and is a member of a complex and morphologically distinct grazer lineage including the much more diverse P. polyodon species complex. When considering the entire lineage, it underwent a similar evolutionary trajectory as Tropheus. It should be noted here that the generally much lower species number in Lake Tanganyika when compared to Lakes Malawi and Victoria also results from the different species concepts employed, in that several allopatric entities are treated as species in Lakes Victoria and Malawi, whereas as geographical varieties in the older Lake Tanganyika radiation. The comparative analysis presented here yielded, as expected, a large number of variant regions between the two species and even a considerable amount within each species. The large amount of variation at the intraspecific level may in fact be owed to our approach of using several non-inbred F1 individuals of a single population sampled in the natural environment, but better reflects intra-population diversity and ultimately the old evolutionary age of the lineage. We used GATK 42 and DELLY 43 , two well established tools, for variant calling; however, the calling of variants is still not a well solved problem with often little overlap between results of different algorithmic routes (e.g., see 44,45 ). As to the reported statistics on variant effects, it is known that the state of the structural annotation and the used variant effect annotator strongly influence the results 46 . The analysis results presented here reflect the state of the genome reconstructions (v1).
The relatively large number of reciprocally sorted SV and SMV among the two study species is remarkable and might reflect the relative old divergence time among the two study species amounting to about 2.5-6 Mya for the two clades 6 . In fact, it is expected that structural mutations affecting coding information need more time to evolve than regulatory mutations. Thus, when comparing species from the much younger Lake Victoria and Malawi, one would not expect such a marked degree in reciprocally distinct coding variation. The SV and SMV can also be interpreted in the light of the flexible stem hypothesis 4,18 . The flexible stem of cichlid radiations is formed by ecologically and phenotypically flexible species adapted to seasonally unstable river habitats. Once they seed lacustrine radiations, they can rapidly accommodate empty niche space in this more stable environment due to their large scope of phenotypic plasticity 18 . Subsequently, the phenotypically plastic population is subdivided into alternative adaptive phenotypes and subsequently adaptive genetic factors are sorted during speciation to proceed further via genetic accommodation and genetic assimilation 47 . Phenotypic or developmental plasticity refers to the ability of a single genotype to produce multiple phenotypes under different environmental conditions. The flexible stem hypothesis postulates that plasticity in a population can influence the direction of evolution by exposing cryptic genetic variation to selection in a novel environment. Under this model, subsets Table 8. Overview on inter-and intraspecies variant analysis result: The numbers represent heterogeneity between species (for PT vs TM and TM vs PT) and heterogeneity within species (for PT vs PT and TM vs TM). These numbers may include the net effect of technical issues (e.g., with assembly, annotation, mapping and calling algorithms). 1 as determined by SNPeff 37 .  www.nature.com/scientificreports/ of an ancestral population exploit distinct ecological niches in a new habitat, such as different food types. Within a single generation, plasticity in anatomy may lead to a fitness increase, e.g., more efficient food capture or processing, in each niche. Newly exposed phenotypic variation will be targeted by selection, and if the new environment is stable, the plastic phenotypes may be canalized through genetic assimilation. The assumption is that the molecular mechanisms for the plastic response also underlie the evolution of key phenotypes, i.e., genetic variation in the same molecules/signaling pathways, which enable plasticity, is targeted by selection and fixed in order to canalize the phenotype. In a recent study, the role of hedgehog (Hh) signaling in the craniofacial plasticity in teleosts has been highlighted, demonstrating that Hh levels tune the sensitivity to mechanical signals related to foraging conditions-where adaptive morphological changes in immediately affected structures, e.g., the pharyngeal bones, may propagate morphological changes to other craniofacial structures 48 .
Variants have been called in virtually all gene regions. About 99% have at least one-under the applied parameter settings-possible variant in the gene body or 5 kb up/downstream (Fig. 3B). Genes with at least one mutation were subjected to Gene Ontology (GO) analysis to get hints on possible interesting functional groups affected by more variants-i.e., the number of variants (or 'mutation load') was used as pointer for the probability of effective changes. The rationale behind this approach was the assumption of correctness of the infinitesimal model or the omnigenic model 49 , respectively. One may expect that the observed phenotype shifts are not due to few high impact (usually coding region) variants but rather due to several 'lower impact' variants (in the used categories probably the 'modifier variants' which typically represent > 90% of the mutation load). Even if at this stage the relevance of the variation in the selected genes is not clear, all listed genes have multiple calls regarding SMV and SV (Fig. 3B) which may increase chances of effective influences on phenotypes. Given their assigned functions reported in other organisms (Table 10), however, these genes are well worth being probed. For instance, five genes being related to nose and chin shape definition (DCHS2, RUNX2, GLI3, PAX1 and EDAR) have recently been identified in a human GWAS study 50 ; several variants in all these genes have also been found between the two species. Additionally, PAX3, KCTD15 and TBX family members (TBX1 and TBX10, but not TBX15 as previously reported) are in the result set; these genes have been related to facial morphology in humans in two other recent GWAS studies 51,52 (Table 10). Particular focus of future downstream analyses should be on genes with stable differences in gene expression among the study species. As stated earlier, our focus lies on the differences in facial and pharyngeal shapes (see Supplementary Fig. S1). It is interesting that this simple method of unbiased variant counting ('mutation load') output the GO terms related to the morphogenesis of the viscerocranium reproducibly (see Supplementary Information), without giving a rather unspecific long list of GO terms. From the GO result follows the highlighting of several important signaling pathways: BMP signaling (e.g., bmp2, bmp4), Hedgehog (Hh) signaling (e.g., Shh, Gli family, Sec family, smo, med12, plcb3), endothelin signaling (e.g., edn1, furin, dlx family), retinoic acid (RA) signaling (e.g., rere, rerea), and fibroblast growth factor (FGF) signaling (e.g., fgf8, fgf20b) (see Table 10 and Supplementary Table S21). All of these signaling networks are known to play roles in the regulation of vertebrate facial morphogenesis, and they interact. There are, for instance, strong co-operative and functional interactions between Shh and retinoic acid [53][54][55][56][57][58] . A more in-depth comparative analysis of the observed gene variant distribution across the two species and its respective phenotypes was not carried out at this stage; this will be an important task for follow-up studies.
To summarize, the two new draft genomes add two monophyletic and eco-morphologically divergent key species that fill an important phylogenetic gap. Moreover, they represent the earliest offshoot of the so-called modern haplochromine cichlids, the most species-rich lineage of East African cichlids. While the Tropheini radiated within the confines of Lake Tanganyika, their allies spread over several rivers to seed additional radiations such as those in Lake Malawi and Victoria, where those reached comparable eco-morphological diversity.

Methods
Study species. The sampled specimens of T. moorii are F2 offspring of wild caught individuals from the Zambian section of the southwestern shore of Lake Tanganyika (08°38′ S 30°52′ E) near the village Nakaku, which were brought to the University of Graz in 2005. The P. trewavasae specimens used in this study are F1 Table 9. Overview on putative effects of intra-and interspecies variants: Shown are variant effect annotations as determined by SNPeff 37 . The numbers represent heterogeneity between species (for PT vs TM and TM vs PT) and heterogeneity within species (for PT vs PT and TM vs TM). These numbers may include the net effect of technical issues (e.g., with assembly, annotation, mapping and calling algorithms). SPARC is a cysteine-rich acidic matrix-associated protein which is required for the collagen in bone to become calcified, and it is also involved in extracellular matrix synthesis and promotion of changes to cell shape. SPARC is required for normal growth of zebrafish otoliths 141     www.nature.com/scientificreports/ offspring of wild fish also from the southwestern shore, but further northeast near the village Katete (08°20′S 30°30′E) and were obtained from an ornamental fish importer. Sequencing and laboratory procedures. We sequenced the genomic DNA extracted from the specimens above using several sequencing technologies: Illumina HiSeq paired-end 2 × 101 bp (300 bp and 600 bp fragment size), Illumina Nextera mate-pair 2 × 100 bp (1-6 kbp fragment size), 454 Life Sciences (~ 350 bp average read length; 8 and 20 kbps fragment size) and single-molecule real-time (SMRT) sequencing technology from Pacific Biosciences (PacBio) (~ 8000-9000 bp average read length after correction). Laboratory-related methods (DNA extraction, library preparation and sequencing) have, in part, been previously described in the accompanying paper on the mitochondrial genomes 59 . In addition, we carried out two sequencing runs using second-generation Pacific Biosciences sequencing technology based upon one individual per species. DNA extraction was carried out in Graz, library preparation and sequencing at the Lausanne Genomic Technologies Facility: DNA was sheared in a Covaris g-TUBE (Covaris, Woburn, MA, USA) to obtain 20 kbp fragments. After shearing the DNA size distribution was checked on a Fragment Analyzer (Advanced Analytical Technologies, Ames, IA, USA). 5 μg of the sheared DNA was used to prepare one SMRTbell library with the PacBio SMRTbell Template Prep Kit 1 (Pacific Biosciences, Menlo Park, CA, USA) according to the manufacturer's recommendations. The resulting library was size selected on a BluePippin system (Sage Science, Inc.; Beverly, MA, USA) for molecules larger than 11 kbp. The recovered library was sequenced on thirteen/ sixteen (TM/PT) SMRT cells with P6/C4 chemistry and MagBeads on a PacBio RSII system (Pacific Biosciences, Menlo Park, CA, USA) at 240 min movie length.
For RNA-seq, total RNA from one male and female individual per species (pooled from the following tissues: liver, spleen, brain, heart and skeletal muscle) was extracted with Trizol as follows: tissue was homogenized with MagnaLyser and incubated with Trizol-tube 5 min at room temperature (RT); 200 µl Chloroform (/ml of Trizol) was added and shaken vigorously for 15 s, incubated for 2-3 min/RT and centrifuged at 12,200 rpm/4 °C/15 min; supernatant was transferred to a new 1.5 ml tube and 500 µl isopropanol (/ml of Trizol) were added; after vortexing, incubation for 10 min/RT, centrifugation at 12,200 rpm/4 °C/10 min supernatant was discarded and the pellet placed on ice immediately. The pellets were washed 2 times: add 1 ml EtOH 80% (-20 °C), centrifuge: full speed/4 °C/5 min discard supernatant and finally dried at 37 °C. Dried pellets were resuspended in 20 µl distilled water. RNA-seq libraries were derived from total RNA which was rRNA-depleted, normalized and sequenced on a single Illumina HiSeq 2500 lane per species.
General data (pre)processing. All pipelining and higher-level processing was done with R/Bioconductor, some minor pipelining in Bash and some workhorse functionality was written in C + + (called from R). For details on parameter settings for important steps/tools see Supplementary Table S22. FastQC v0.10.1 60 was used for basic read quality evaluation. A custom k-mer spectrum-based approach using JELLYFISH v2.0 61 (in conjunction with a database of known technical sequences) and a De Bruijn-based approach (implemented in Minion from the Kraken v13-274 package 62 ) were used for the automatic identification of technical contaminants and suspicious sequences (based on expected frequencies). In addition, FastQScreen v0.4.4 63 was utilized for the species-specific identification of biological contamination and DeconSeq v0.4.3 64 for its removal. Cutadapt v1.5 65 76 and Rcorrector v1.0.2 77 were used for RNA-seq and Musket v1.1 78 for DNA-seq base-call correction. DNA-seq and RNA-seq datasets were preprocessed using the same pipeline (with different parameter settings); in general, two filter regimes were applied to each data set ('stringent'/'standard' and 'relaxed') in preparation for different downstream use cases (see Supplementary Table S22). Genome sizes were estimated by a k-mer spectrum-based approach implemented in GCE v1.0.2 79 .
Genome assembly. From the perspective of the conducted meta-assembly, the algorithm implemented in MaSuRCA v2.3.2 75 (utilizes Celera Assembler v6.5 80 ) served as the core assembly procedure; all at this time available data sets (i.e., Illumina PE and MP, Illumina Nextera MP and 454 MP and SE) were used. Celera Assembler v8.3rc2 (CA) 80 was used for the 'PacBio only' assemblies. As several individuals per species (all non-inbred diploids) have been sequenced in this project, heterozygosity was a concern. Hence, assembly algorithms specifically designed to better handle divergence were incorporated into the reconstruction approach: Platanus v1.2.1 81 is a recent assembler tailored to more sensibly deal with heterozygosity issues in genomic data ( Genome annotation. Structural annotations were performed based on experimental data from mRNA-Seq datasets. Additionally, information was drawn from transcript and protein models from selected publicly available datasets (Danio rerio, H. burtoni, M. zebra, N. brichardi, O. niloticus, and P. nyererei) and from further models in UniProt|Swiss-Prot, nr/nt and UniRef90|teleost. Functional annotation was primarily conducted via BLAST-based comparisons against mentioned databases and via a host of databases coordinated by InterProScan 5 (see Table 2). Structural annotation of coding genes and tRNAs was generated using the pipelines MAKER v3.0 107 (utilising the gene finders GeneMark-ES v4.32 109  The mRNA training set for FEELnc was derived from the FA/MAKER annotation data, where presumed 'good' gene models with similar structure to previously published models were selected; the lncRNA training set was generated by shuffling of the mRNA sequences. Microsatellites were called with MISA v1.0 121 , CpG islands with EMBOSS v6.6.0 122 cpgplot and ORFs with EMBOSS v6.6.0 getorf (and R post-processing). Repeats were determined using RepeatMasker v4.0.6 32 (with RepBase v20160321 123 and species-specific libraries generated with RepeatModeler v1.0.8 124 ), RepeatScout v1.0.5 125 and TRF v406 126 .
Functional annotation was conducted using InterProScan v5.24-63.0 127 (utilizing the databases CDD-3.14, Coils-2. Comparative analysis-small (SMV) and structural variant (SV) calling-variant effect prediction. The Genome Analysis Toolkit (GATK) v3.7 was used for local realignment of reads and the detection and filtering of SNP/InDel variants (referred to as small variant/s, SMV) 42 as recommended by the GATK documentation; the HaplotypeCaller was applied-with a minimum score for variant emission of 10, for calling of 30, and a minimum pruning of 10. SMV with a quality score ≥ 30 were included in further analyses. DELLY v0.7.7 43 was applied to call structural variant/s (SV, insertions, deletions, duplications, inversions and translocations) with an insert size cut-off of 3 (for deletions) and a minimum paired-end mapping quality of 20