Genome sequences of Tropheus moorii and Petrochromis trewavasae, two eco-morphologically divergent cichlid fishes endemic to Lake Tanganyika

Fischer, C.; Koblmüller, S.; Börger, C.; Michelitsch, G.; Trajanoski, S.; Schlötterer, C.; Guelly, C.; Thallinger, G. G.; Sturmbauer, C.

doi:10.1038/s41598-021-81030-z

Download PDF

Article
Open access
Published: 22 February 2021

Genome sequences of Tropheus moorii and Petrochromis trewavasae, two eco-morphologically divergent cichlid fishes endemic to Lake Tanganyika

C. Fischer^1,2,
S. Koblmüller¹,
C. Börger¹,
G. Michelitsch³,
S. Trajanoski³,
C. Schlötterer⁴,
C. Guelly³,
G. G. Thallinger^2,5 &
…
C. Sturmbauer^1,5

Scientific Reports volume 11, Article number: 4309 (2021) Cite this article

1929 Accesses
3 Citations
4 Altmetric
Metrics details

Subjects

Abstract

With more than 1000 species, East African cichlid fishes represent the fastest and most species-rich vertebrate radiation known, providing an ideal model to tackle molecular mechanisms underlying recurrent adaptive diversification. We add high-quality genome reconstructions for two phylogenetic key species of a lineage that diverged about ~ 3–9 million years ago (mya), representing the earliest split of the so-called modern haplochromines that seeded additional radiations such as those in Lake Malawi and Victoria. Along with the annotated genomes we analysed discriminating genomic features of the study species, each representing an extreme trophic morphology, one being an algae browser and the other an algae grazer. The genomes of Tropheus moorii (TM) and Petrochromis trewavasae (PT) comprise 911 and 918 Mbp with 40,300 and 39,600 predicted genes, respectively. Our DNA sequence data are based on 5 and 6 individuals of TM and PT, and the transcriptomic sequences of one individual per species and sex, respectively. Concerning variation, on average we observed 1 variant per 220 bp (interspecific), and 1 variant per 2540 bp (PT vs PT)/1561 bp (TM vs TM) (intraspecific). GO enrichment analysis of gene regions affected by variants revealed several candidates which may influence phenotype modifications related to facial and jaw morphology, such as genes belonging to the Hedgehog pathway (SHH, SMO, WNT9A) and the BMP and GLI families.

Hagfish genome elucidates vertebrate whole-genome duplication events and their evolutionary consequences

Article Open access 12 January 2024

Genome-enabled discovery of evolutionary divergence in brains and behavior

Article Open access 21 June 2021

The hagfish genome and the evolution of vertebrates

Article Open access 23 January 2024

Introduction

With 1727 described species¹, cichlid fishes are among the most species-rich teleost fish families. Their hotspot of biodiversity lies in East Africa, and in particular the three Great Lakes, Victoria, Malawi and Tanganyika². Despite a large degree of similarity pointing to recurrent evolution of eco-morphologically equivalent species³, the three cichlid radiations show important differences with respect to species numbers, evolutionary age of lineages, diversity of parental care patterns and the degree of morphological divergence^2,3,4. This is likely due to different sets of colonizing species and most importantly due to their different evolutionary age.

With an age of 9–12 million years (myr)^5,6, Lake Tanganyika is by far the oldest of these lakes. Due to its old age, the Lake Tanganyika species assemblage is at a mature stage, so that it comprises the largest genetic and phenotypic diversity among the East African cichlid radiations, but further diversification proceeds predominately without much eco-morphological innovation². Upon colonization of the emerging lake, the cichlids took advantage of the window of ecological opportunity and rapidly diversified⁴. In fact, two colonizing lineages underwent hybridization at the very onset of the radiation, an event that might have triggered or boosted the start⁶. The Lake Tanganyika radiation holds a key position for the entire modern African cichlid fauna, in that three of the newly emerging lacustrine lineages managed to colonize surrounding rivers, so that the radiation repeatedly swept over the boundaries of the maturing lake^7,8,9,10. Three of the emerging lineages, the non-mouthbrooding Lamprologini, the mouthbrooding Orthochromini and some early Haplochromini such as the ancestors of the genera Pseudocrenilabrus and Serranochromis, left the lake at various stages of lake maturation to colonize particular surrounding water bodies^{7,8,9,11,12,13}. One group of early haplochromines continued to evolve in the lake-swamp-river interface towards more elaborate maternal mouthbrooders, demarcated by increased sexual dimorphism and eggspots on the anal fin^6,9, the so-called modern haplochromines. These modern haplochromines not only colonized most river systems all over southern and eastern Africa but re-entered the—at this time already much deeper and mature—Lake Tanganyika ecosystem, to evolve into the endemic Lake Tanganyika tribe Tropheini^9,14. Thus, the Tropheini managed to break into an ongoing and already complex lacustrine radiation, while its non-lacustrine sisters spread across several river systems to seed radiations in emerging lakes along their routes of riverine dispersal^6,8,9,15,16.

The Lake Tanganyika-endemic tribe Tropheini represents the sister group of all modern haplochromines outside the lake and diverged from these ~ 3–9 mya⁶. That five out of the 29 species of the Tropheini both occur in the lake itself and upstream in tributary rivers and/or parts of the Lukuga River, the lake’s only outflow, might be owed to their swamp-river origin¹⁷. This is why we decided to sequence and compare the genomes of two ecologically divergent species of the endemic Lake Tanganyika tribe Tropheini. In terms of genetics, the modern haplochromines, including the Tropheini, are iconic as their generalist riverine-adapted genomes repeatedly underwent recurrent adaptive modifications upon ecological opportunity—provided by newly emerging lakes⁴. It has been suggested that ecologically and phenotypically flexible species adapted to seasonally unstable river habitats can outcompete other colonizers in seeding lacustrine radiations, as they can rapidly accommodate empty niche space via phenotypic plasticity¹⁸. According to the flexible stem hypothesis, a phenotypically plastic population is subdivided into alternative adaptive phenotypes and subsequently adaptive genetic factors are sorted during speciation to proceed further via genetic accommodation and genetic assimilation. In the course of adaptive divergence during repeated adaptive radiations, genomic evolution was likely shaped by ecological opportunity, in combination with geographic fragmentation events, episodes of bottlenecks and population expansions, as well as repeated admixes or fusions in hybridization events caused by climate-induced lake level fluctuations^4,19. Along with divergence and incidental gene flow^6,20, gene duplication and selection^6,21 events apparently reshaped the genotypes. On the phenotype level, the evolutionary success of East African cichlids has been attributed to particular key innovations including (1) the functional decoupling of oral and pharyngeal jaws facilitating the exploitation of diverse trophic niches²², (2) the adaptation of the visual system to different water turbidity²³, and (3) parental care and male mating coloration driven by sexual selection facilitating reproductive isolation²⁴. At this stage, the suite of genetic mechanisms modifying the genomic substrate underlying the enormous phenotypic eco-morphospace covered by cichlids remains largely unknown (see²⁵ for a recent review).

The first major steps towards understanding the molecular mechanisms behind those divergent morphologies were taken by elucidating the genomes and transcriptomes of five cichlid species: Oreochromis niloticus representing an outgroup lineage, Neolamprologus pulcher representing a Tanganyikan substrate brooder lineage, and three modern haplochromines, namely Astatotilapia burtoni representing a riverine lineage, Maylandia zebra representing Lake Malawi and Pundamilia nyererei representing Lake Victoria. This study found evidence for an excess of gene duplications in the East African lineage compared to Oreochromis and other teleosts, an abundance of non-coding element divergence, accelerated coding sequence evolution, expression divergence along with transposable element insertions, and regulation by novel microRNAs²¹. The study also revealed genome-wide diversifying selection on coding and regulatory variants, some of which recruited from ancient polymorphisms.

High quality (HQ) genome drafts based on Pacific Biosciences (PacBio) data became available especially in the last two years. HQ drafts of Simochromis diagramma (East Africa, Lake Tanganyika) and Astatotilapia calliptera (East Africa, Lake Malawi) were generated by the Sanger Institute (2018) and a HQ draft of Archocentrus centrarchus (Central America) was generated by the G10K-VGP group (2019); assemblies of the South American cichlids Amphilophus citrinellus (2014, University of Konstanz) and Andinoacara coeruleopunctatus (2015, Sanger Institute)²⁶ are also available. The O. niloticus (ON) and M. zebra (MZ) genomes have recently (2019) been newly assembled and anchored with a high-coverage PacBio + genetic map approach²⁷; the genomes of A. calliptera, A. centrarchus and S. diagramma (not anchored) were reconstructed similarly. Oreochromis niloticus, M. zebra, A. calliptera and A. centrarchus are the only reconstructions on chromosome level (linkage groups). Seven HQ drafts received annotations from the NCBI Annotation Pipeline²⁸ (S. diagramma not yet); O. niloticus, A. calliptera and A. citrinellus received annotations from Ensembl²⁹ as well. These genomes cover species from the Great Lakes and rivers in Africa and from crater lakes in Central and South America (Fig. 1).

We present reconstructions of the two genomes of Tropheus moorii (TM) and Petrochromis trewavasae (PT), as well as sets of structural and functional annotations. The two species belong to two sublineages within the Tropheini that diverged ~ 2–6.5 mya⁵ and represent the deepest split within the tribe Tropheini. Aside from generating the genomic and transcriptomic data basis for two key species representing the modern haplochromines in Lake Tanganyika, the main interest in this study was in the genetic origin of the divergent facial and jaw morphologies of these morphologically diverse species. To this end we provide first insights on genetic variants potentially being involved in the morphological differentiation of the two study species.

Results

Assemblies

Based on the estimated genome sizes of ~ 900 Mbp (Supplementary Table S11), our sequencing efforts yielded sequence data with an average base coverage of ~ 1.5×, ~ 88× , ~ 34× and ~ 10.5× (PT) and ~ 1.2× , ~ 38× , ~ 29× and ~ 9.1× (TM) for Roche 454, Illumina PE, Illumina MP and PacBio, respectively (see Supplementary Table S23). The filtered sequence data was used to generate primary assemblies derived from different reconstruction algorithms (assemblers) and data combinations (see Methods). The final genome reconstructions of the two species are based on meta-assemblies of these sets of primary assemblies. The meta-assemblies with the best scores based on misassemblies, contiguity and gene predictions were used in subsequent analyses.

Petrochromis trewavasae

The primary assemblies exhibit assembly sizes from ~ 779 Mbp to ~ 966 Mbp (907 Mbp PacBio only; see Supplementary Table S11). The final assembly consists of 7261 scaffolds with a N50 of 1.84 Mbp, 1.44% of nucleotides are undetermined (N) and 90% of the assembled genome is contained in 885 fragments longer than 70 kbp. The total assembly size is 917.57 Mbp (Table 1).

Table 1 Assembly contiguity and size statistics: The assembled genomes consist of 917.57 and 911.13 Mbp for P. trewavasae and T. moorii, respectively. Count and number of bases for scaffolds and contigs are reported. Scaffolds were broken to contigs at stretches of Ns of length ≥ 10. Statistics on O. niloticus were obtained from NCBI and extended as necessary (in blue); technology-wise version 2 is comparable, version 4 is based on high-coverage PacBio and optical mapping data.

Full size table

Tropheus moorii

The primary assemblies exhibit assembly sizes from ~ 754 Mbp to ~ 952 Mbp (879 Mbp PacBio only; see Supplementary Table S11). The final assembly consists of 7662 scaffolds with a N50 of 1.64 Mbp, 1.29% of nucleotides are undetermined (N) and 90% of the assembled genome is contained in 657 fragments longer than 192 kbp. The total assembly size is 911.13 Mbp (Table 1). Both assembly sizes are in the expected range; k-mer spectra-based predictions hint to genome sizes of close to 900 Mbp (see Supplementary Table S11) and 900–1000 Mbp have also been reported for other cichlid genomes^21,30.

In the following, we compare our results to published genomes and annotations of several cichlid fish with emphasis on O. niloticus and M. zebra due to their well-developed state. The latest versions (v4) of O. niloticus (44 × PacBio, newly anchored) and M. zebra (now 65 × PacBio and anchored) were published by Conte et al.²⁷; the tendency with respect to earlier versions is clear, qualities of sequences and annotations are improved and the numbers of annotated structures were further increased. With respect to the gene length distributions (Supplementary Table S1), the contiguity measures achieved for PT and TM are satisfying and fall in the typical range, given the applied sequencing technologies and coverage (Table 1; for a comparison with O. niloticus versions see Supplementary Table S2, and for a general comparison with published fish genomes see Supplementary Table S23 of Vij et al.³¹).

Annotations

Structural annotation yielded ~ 40,300 (PT) and 39,600 (TM) genes and ~ 54,200 (PT) and 56,800 (TM) transcripts, respectively (Table 2); this is in line with the results of different annotation versions of ON (~ 30,200 to 42,600 genes). As to annotated features, PT and TM show similar numbers which often lie between those of version 2 and 3 of the respective ON annotations. For comparison, statistics for ON v2–v4 (the latest) are added, as ON received the most community effort and data for genome assembly and annotation of all cichlids (Supplementary Table S2). Prediction of long non-coding RNAs yielded 2782 and 2112 lncRNAs for PT and TM, respectively. With 57.7% and 63.2% a slight preference for the sense strand could be observed (Supplementary Table S3). Homology based functional annotation could be made for 41,970 (PT) and 43,918 (TM) of the coding sequences (CDSs); putative secretory signals were predicted for 5899 (PT) and 6016 (TM) of them, respectively (Table 3). Pfam domain mapping yielded 78,900 (PT) and 84,158 (TM) hits, respectively. RepeatMasker²⁷ identified 31.1% (PT) and 30.0% (TM) of the genomes as repetitive, respectively; the largest proportions of classified repeat types were held by DNA transposons, LINEs and LTR transposons with ~ 13%, ~ 7% and ~ 2% (Table 4).

Table 2 Structural annotation statistic of PT and TM in comparison with ON: Structural annotation yielded ~ 40,300 and 39,600 genes, respectively. This is in line with the results of different annotation versions of ON (~ 30,200 to 42,600).

Full size table

Table 3 Functional annotation statistics: The number of proteins found in UniProt and NR are given. Furthermore, the table contains the number of proteins with putative protease (Merops) and carbohydrate activity (CAZymes), the number of orthologs in fiNOG, the number of proteins matching the BUSCO vertrebrate models and the number of proteins with putative secretory signals (SignalP). Finally, the number of hits of the protein sequences for the various InterPro domain databases are presented.

Full size table

Table 4 Repeat annotation statistics as determined by RepeatMasker³².

Full size table

Data availability and visualization

The genome and transcriptome assemblies (FASTA), the structural and functional annotations (GFF3), read mappings (BAM) and additional Integrative Genomics Viewer (IGV)³³ track files (short and long non-coding RNAs, repeats, ORFs, CpG islands, microsatellites, IPR and eggNOG domains, variant calls, read mappings, alternative splicing, and REAPR error calls; Fig. 2) are available at https://cichlidgenomes.tugraz.at.

Quality evaluation

Assembly quality was assessed with BUSCO³⁴ and CEGMA³⁵. BUSCO identified 98.3% and 98% of the 4584 proteins in the Actinopterygii database in complete form for PT and TM, respectively; 1.7% and 2% of the benchmarking universal single-copy orthologs (BUSCOs) were either fragmented or missing. These results compare well with those of published genomes and are generally on a par with those of the later versions of the O. niloticus genome drafts (Table 5). CEGMA identified all of the 248 core eukaryotic genes (CEGs) for both PT and TM (Table 6); CEGMA results for PT and TM transcriptome assemblies can be found in Supplementary Table S6. However, REAPR reports 17,166/11,992 (PT/TM) likely assembly errors (Supplementary Table S10); there are IGV tracks highlighting questionable regions to guide caution when analyzing in the vicinity (see Fig. 2). Completeness of conserved protein domains was assessed with DOGMA³⁶. DOGMA found 91.8% and 90.5% of the 1051 expected conserved domains at a conserved domain arrangement size of 1 for PT and TM, respectively (Table 7).

Table 5 BUSCO results: Identified genes are classified as ‘complete’ when their lengths are within two standard deviations of the BUSCO group mean length (i.e., within ∼95% expectation). ‘Complete’ genes found with more than one copy are classified as ‘duplicated’; BUSCOs are expected to evolve under single-copy control, hence recovery of many duplicates may indicate erroneous assembly of haplotypes. Genes only partially recovered are classified as ‘fragmented’, and genes not recovered are classified as ‘missing’³⁴. The latest versions of assemblies were used in all cases (i.e., V4 of O. niloticus and M. zebra). See BUSCO results for PT and TM transcriptome assemblies in Supplementary Table S5. Values are color coded according to the rank: Dark green, best; dark red, worst. BUSCO stands for benchmarking universal single-copy ortholog.

Full size table

Table 6 CEGMA results: Shown are the latest versions in all cases; for ON and MZ additionally to v4 (PacBio-based) v2 (Illumina PE + MP-based) is listed for comparison (as PT and TM were primarily constructed using the same technologies). Values are color coded according to the rank: Dark green, best; dark red, worst. CEG stands for core eukaryotic gene.

Full size table

Table 7 DOGMA results: DOGMA³⁶ scores a sample transcriptome/proteome regarding its completeness of conserved protein domains provided as percentage of a defined core set (conserved domains are structural and functional building blocks of proteins). The analysis supports the notion (see mean and median protein lengths in Table 2) that gene models of protein-coding genes need improvement. Values are color coded according to the rank: Dark green, best; dark red, worst. CDA stands for conserved domain arrangement.

Full size table

Comparative analysis

We compared the genomes of PT and TM by mapping the raw reads of one species to the genome of the other species. This yielded 4,105,604 and 4,178,777 small variants (SMV; SNPs and InDels) for PT and TM, respectitvely. Furthermore, 356,428 and 577,124 SMVs were identified for PT and TM, when mapping the reads of the same species to the respective genomes. On average 1 variant per ~ 220 bp (interspecies), and 1 variant per 2540 bp (PT vs PT)/1561 bp (TM vs TM) (intraspecies) has been called (Table 8). For the two species, 93,842 and 89,489 large structural variants (SV; insertions, deletions, duplications, inversions and translocations) between species were detected, the majority being deletions with 60% and 65.6%, respectively (Table 8).

Table 8 Overview on inter- and intraspecies variant analysis result: The numbers represent heterogeneity between species (for PT vs TM and TM vs PT) and heterogeneity within species (for PT vs PT and TM vs TM). These numbers may include the net effect of technical issues (e.g., with assembly, annotation, mapping and calling algorithms).

Full size table

The distribution of SMV and SV observed by the comparative analysis largely follows the genome coverage of particular structural/functional regions. There are small, but noticable devations: (1) SMV are (slightly) underrepresented in promoter, 5′ UTR, coding, splice site, 3′ UTR and intergenic regions; they are overrepresented in introns; (2) SV are (slightly) underrepresented in promoter, 5′ UTR, coding, 3′ UTR and intergenic regions; they are overrepresented in introns and splice sites (via overlap) (Fig. 3A).

SNPeff³⁷ categorises variant effects on the gene into four groups based on the location and nature of the variant: ‘HIGH’, ‘MODERATE ‘ ,‘LOW’, or ‘MODIFIER’, where the latter denotes non-coding variants or variants affecting non-coding genes, where predictions are difficult or there is no evidence of impact. In our analysis, more than 97% of the identified variants are classified as ‘MODIFIER’ (Table 9).

Table 9 Overview on putative effects of intra- and interspecies variants: Shown are variant effect annotations as determined by SNPeff³⁷. The numbers represent heterogeneity between species (for PT vs TM and TM vs PT) and heterogeneity within species (for PT vs PT and TM vs TM). These numbers may include the net effect of technical issues (e.g., with assembly, annotation, mapping and calling algorithms).

Full size table

Genes of interest, which are possibly affected by mutations, are highlighted in Table 10 (see Supplementary Information for details on the gene selection and GO enrichment analysis, respectively). Here, we started to analyze genes which are related to the development of the viscerocranium (untargeted gene selection) and the pharyngeal system (targeted, i.e., biased gene selection based on literature—see Table S17); however, there are other GO terms of interest which are consistently enriched over different analysis approaches such as BMP signaling, for instance. A condensed GO analysis result for an untargeted approach (A2, see Table S14b) is shown in Table 11; here, gene categories are based on variant comparison groups (within and between species groups) combined with quantile ranking and thresholding (p = 0.5, i.e., median), and variant counts (‘mutation loads’) were used as criterion. The term ‘embryonic viscerocranium morphogenesis’ is enriched in the within and the between species gene sets over all approaches (see Supplementary Table S16; genes belonging to this term were combined with genes from the targeted approach and used for further downstream analyses (see Supplementary Table S14a, Table S14b, Table S15, Table S16, Table S18 and Table S21). In the comparative analysis, biological species are coded as A (PT) and B (TM) (Table 11). The categories (AA, AB, BA and BB) refer to the within and between group comparisons. That is, there are mutations at the same genomic locations (nucleotides) which are either identical within and between species (referred to as, e.g., identical (AA) and redundantly identical (AB)) or nonidentical (referred to as, e.g., nonidentical (BB) and nonidentical (BA)); moreover, there are mutations which are unique to a group (referred to as, e.g., unique (AA) and unique (AB), i.e., at the genomic location there is only a variant in species A (unique (AA)) or there is only a variant between species A and B (unique (AB)), respectively). In the shown example, for SMV the calls for viscerocranium morphogenesis are symmetric except for the AB category (which fell below the threshold), i.e., the GO term is consistently enriched within and between species. Further analyses on the genes belonging to the term clearly verify the presence of shared and species-specific mutations in these genes (see example in Supplementary section Identification of genes putatively related to facial and jaw morphology). Hence, there is substantial variation in these genes which may drive changes in the manifestation of morphology. However, we cannot yet delineate possible effects from shared and non-shared variants.

Table 10 Selected genes affected by variants. To narrow down the list of genes carrying variants, a targeted approach and GO enrichment analysis were performed; this table lists genes related to facial and jaw morphology. Shown results are filtered and simplified: (1) variant types and locations have been unified for transcript isoforms and (annotated) gene duplicates, and (2) they have been intersected between species comparisons. SMV, small variant(s); SV, structural variant(s).

Full size table

Table 11 GO enrichment analysis result—biological process terms (condensed). This table shows results from approach A2 (see Supplementary Table S14b). Enrichment was assessed via a Fisher’s exact test with a cutoff of p ≤ 0.001 and GO topology was accounted for (R package topGO, method weight). In the Type column biological species are coded as A (PT) and B (TM); identical and nonidentical variants at same nucleotide positions, and unique variants are indicated. The categories (AA, AB, BA and BB) refer to the within and between comparison: identical (AA) means that the intraspecific variant(s) (SMV and SV) in this group have also been called in the related interspecific (AB) comparison at the same location, with nonidentical (AA) a different variant has been called at the same location (e.g., A → T within and A → G between species), with unique (AA) only within species A and with unique (AB) only between species A and B a variant was called at that position; the same holds for species B and the BB and BA categories. Comparisons have been conducted two-way, i.e., A vs B and B vs A; the groups were tested against a gene universe containing all genes with GO information (the dataset contains 7905 (PT) and 7688 (TM) GO terms in total). SMV: small variant(s) (SNPs and InDels); SV: structural variant(s) (insertions, deletions, duplications, inversions and translocations). See Supplementary Table S15 for detailed lists.

Full size table

Besides variants in the DNA structure, alternative splicing (AS) was analyzed. There are ~ 6200 AS events in ~ 2600 genes between sexes of each species and ~ 39,000 AS events in ~ 9400 genes between the two species (see Supplementary Table S13).

Discussion

Assembly and annotation

Meta-assembly of a set of primary assemblies yielded high quality genome drafts with 918 and 911 Mbp for Petrochromis trewavasae and Tropheus moori, respectively. This is in line with the sizes between 900 and 1000 Mbp reported for other cichlid genomes^21,30 and with the ~ 940 Mbp estimated by our assembly validation with REAPR³⁸. The latest Oreochromis niloticus assembly spans ~ 1 Gbp;—this variation in genome size may be due to biological differences but could also indicate that some portions of the repetitive DNA in the respective genome reconstruction (PT/TM) are not included in the assembly as a consequence of sequence collapse (e.g., collapsed repeats)³⁹.

Typically, assemblies are based on genomic data from a single individual, which ideally stems from an inbred line. In this project, we assembled from 6 (PT) and 5 (TM) non-inbred individuals, respectively; this called for a more complex assembly approach. Furthermore, we performed de novo sequencing without any linkage or optical map data (as seen in the latest genome drafts of O. niloticus, and A. calliptera) and the PacBio coverage was (with ~ 9–10 ×) considerably lower than that used for the assemblies of O. niloticus (44 × for v3³⁰ and v4²⁷) and M. zebra (16.5 × for v3⁴⁰, added on top of already high Illumina PE and MP coverages, and 65 × for v4²⁷). Still, both assemblies compare well with published genome drafts of comparable species with respect to typical metrics regarding gene content. BUSCO results (Table 5) show a low rate of ~ 4–8% (depending on database) of duplicated BUSCOs in PT and TM. This is slightly higher than the ~ 2–4% reported for other cichlid genomes, which may be a consequence of incorrect assembly of haplotypes from the non-inbred individuals. With respect to total BUSCOs identified, fragmented and missing BUSCOs, both the PT and TM genome reconstruction perform very well.

For both species, the annotations proved valid for first sensible downstream analyses, but certainly there are some gene models which may need further improvement, e.g., by repeated training of gene predictors; however, for AUGUSTUS, the central predictor, model evaluations already show good training states (see Supplementary Table S12). The most relevant sources of insufficient gene models might include gene fusions, splits and especially truncations, which are obvious under closer inspection—this is typical for early annotations, especially when the annotation pipeline is still under development. We observe a relatively low mean and median length of protein sequences (see Table 2) in both assemblies/annotations. This may reflect a systematic error in the generation process, e.g., InDels leading to frame shifts and, hence, wrong translations and premature stop codons. Investigation of this phenomenon showed non-triplet InDels; however, these are also found between, e.g., O. niloticus and M. zebra transcript models. Moreover, the rate of identified nonsense mutations in PT and TM is low (Table 9). The NCBI and Ensembl annotation pipelines are state-of-the-art; additionally, the amount and diversity of RNA-seq data used for the annotation of, e.g., O. niloticus was much larger than was the case for either species in this project. Hence, the larger number of identified transcript isoforms (as well as the higher average numbers of exons per transcript) may be seen as straight-forward consequences. However, the total number of exons in both species is on a par with the ON annotations. Interestingly, the number of gene models in PT and TM are also on a par with ON v3. As there is no well-established method to score the correctness of gene models (perhaps by a general structure check and a database-based similarity majority scoring), this is merely a comparison of numbers of elements, though. Moreover, there are, as mentioned, some gene fusions and splits in the PT and TM gene model sets, which will distort the gene count to some degree. As another quality measure for the annotated protein-coding genes, DOGMA³⁶ and PfamScan⁴¹ were used; the results support the notion of bad gene models in the set, which do not contain certain protein domains or only fragments thereof (Table 7).

Comparative analysis

We picked the two study species for the following reasons. Tropheus moorii is a highly successful algae browser found in large numbers in all types of rocky shore, while Petrochromis trewavasae is an algae grazer distributed at rocky shores on the western side of the lake, living in sympatry with Tropheus. The Tropheini comprise 3 predatory species, one omnivore, 10 algae browsers and 15 algae grazers. Algae grazers have chisel-like teeth to bite off filamentous algae from the rocky substrate, while algae grazers have comb-like teeth in multiple rows to comb off unicellular algae and detritus from the rocks. Due to the old age of the tribe Tropheini, amounting to about 2–6.5 myr for the onset of their radiation⁶, the degree of eco-morphological divergence is greater than in the much younger eco-morphological equivalents in Lake Victoria, but comparable with the eco-morphospace covered by the entire Lake Malawi flock. Interestingly, the genus Tropheus comprises about 120 mostly allopatric and in terms of color distinct populations and sister species that are morphologically similar. They all remained in the same trophic niche at all rocky shorelines throughout the lake. Petrochtomis trewavasae does not show much color variation, has a restricted distribution at the southwestern shoreline of the lake and is a member of a complex and morphologically distinct grazer lineage including the much more diverse P. polyodon species complex. When considering the entire lineage, it underwent a similar evolutionary trajectory as Tropheus. It should be noted here that the generally much lower species number in Lake Tanganyika when compared to Lakes Malawi and Victoria also results from the different species concepts employed, in that several allopatric entities are treated as species in Lakes Victoria and Malawi, whereas as geographical varieties in the older Lake Tanganyika radiation.

The comparative analysis presented here yielded, as expected, a large number of variant regions between the two species and even a considerable amount within each species. The large amount of variation at the intraspecific level may in fact be owed to our approach of using several non-inbred F1 individuals of a single population sampled in the natural environment, but better reflects intra-population diversity and ultimately the old evolutionary age of the lineage. We used GATK⁴² and DELLY⁴³, two well established tools, for variant calling; however, the calling of variants is still not a well solved problem with often little overlap between results of different algorithmic routes (e.g., see^44,45). As to the reported statistics on variant effects, it is known that the state of the structural annotation and the used variant effect annotator strongly influence the results⁴⁶. The analysis results presented here reflect the state of the genome reconstructions (v1).

The relatively large number of reciprocally sorted SV and SMV among the two study species is remarkable and might reflect the relative old divergence time among the two study species amounting to about 2.5–6 Mya for the two clades⁶. In fact, it is expected that structural mutations affecting coding information need more time to evolve than regulatory mutations. Thus, when comparing species from the much younger Lake Victoria and Malawi, one would not expect such a marked degree in reciprocally distinct coding variation. The SV and SMV can also be interpreted in the light of the flexible stem hypothesis^4,18. The flexible stem of cichlid radiations is formed by ecologically and phenotypically flexible species adapted to seasonally unstable river habitats. Once they seed lacustrine radiations, they can rapidly accommodate empty niche space in this more stable environment due to their large scope of phenotypic plasticity¹⁸. Subsequently, the phenotypically plastic population is subdivided into alternative adaptive phenotypes and subsequently adaptive genetic factors are sorted during speciation to proceed further via genetic accommodation and genetic assimilation⁴⁷. Phenotypic or developmental plasticity refers to the ability of a single genotype to produce multiple phenotypes under different environmental conditions. The flexible stem hypothesis postulates that plasticity in a population can influence the direction of evolution by exposing cryptic genetic variation to selection in a novel environment. Under this model, subsets of an ancestral population exploit distinct ecological niches in a new habitat, such as different food types. Within a single generation, plasticity in anatomy may lead to a fitness increase, e.g., more efficient food capture or processing, in each niche. Newly exposed phenotypic variation will be targeted by selection, and if the new environment is stable, the plastic phenotypes may be canalized through genetic assimilation. The assumption is that the molecular mechanisms for the plastic response also underlie the evolution of key phenotypes, i.e., genetic variation in the same molecules/signaling pathways, which enable plasticity, is targeted by selection and fixed in order to canalize the phenotype. In a recent study, the role of hedgehog (Hh) signaling in the craniofacial plasticity in teleosts has been highlighted, demonstrating that Hh levels tune the sensitivity to mechanical signals related to foraging conditions—where adaptive morphological changes in immediately affected structures, e.g., the pharyngeal bones, may propagate morphological changes to other craniofacial structures⁴⁸.

Variants have been called in virtually all gene regions. About 99% have at least one—under the applied parameter settings—possible variant in the gene body or 5 kb up/downstream (Fig. 3B). Genes with at least one mutation were subjected to Gene Ontology (GO) analysis to get hints on possible interesting functional groups affected by more variants—i.e., the number of variants (or ‘mutation load’) was used as pointer for the probability of effective changes. The rationale behind this approach was the assumption of correctness of the infinitesimal model or the omnigenic model⁴⁹, respectively. One may expect that the observed phenotype shifts are not due to few high impact (usually coding region) variants but rather due to several ‘lower impact’ variants (in the used categories probably the ‘modifier variants’ which typically represent > 90% of the mutation load). Even if at this stage the relevance of the variation in the selected genes is not clear, all listed genes have multiple calls regarding SMV and SV (Fig. 3B) which may increase chances of effective influences on phenotypes. Given their assigned functions reported in other organisms (Table 10), however, these genes are well worth being probed. For instance, five genes being related to nose and chin shape definition (DCHS2, RUNX2, GLI3, PAX1 and EDAR) have recently been identified in a human GWAS study⁵⁰; several variants in all these genes have also been found between the two species. Additionally, PAX3, KCTD15 and TBX family members (TBX1 and TBX10, but not TBX15 as previously reported) are in the result set; these genes have been related to facial morphology in humans in two other recent GWAS studies^51,52 (Table 10). Particular focus of future downstream analyses should be on genes with stable differences in gene expression among the study species. As stated earlier, our focus lies on the differences in facial and pharyngeal shapes (see Supplementary Fig. S1). It is interesting that this simple method of unbiased variant counting (‘mutation load’) output the GO terms related to the morphogenesis of the viscerocranium reproducibly (see Supplementary Information), without giving a rather unspecific long list of GO terms. From the GO result follows the highlighting of several important signaling pathways: BMP signaling (e.g., bmp2, bmp4), Hedgehog (Hh) signaling (e.g., Shh, Gli family, Sec family, smo, med12, plcb3), endothelin signaling (e.g., edn1, furin, dlx family), retinoic acid (RA) signaling (e.g., rere, rerea), and fibroblast growth factor (FGF) signaling (e.g., fgf8, fgf20b) (see Table 10 and Supplementary Table S21). All of these signaling networks are known to play roles in the regulation of vertebrate facial morphogenesis, and they interact. There are, for instance, strong co-operative and functional interactions between Shh and retinoic acid^{53,54,55,56,57,58}. A more in-depth comparative analysis of the observed gene variant distribution across the two species and its respective phenotypes was not carried out at this stage; this will be an important task for follow-up studies.

To summarize, the two new draft genomes add two monophyletic and eco-morphologically divergent key species that fill an important phylogenetic gap. Moreover, they represent the earliest offshoot of the so-called modern haplochromine cichlids, the most species-rich lineage of East African cichlids. While the Tropheini radiated within the confines of Lake Tanganyika, their allies spread over several rivers to seed additional radiations such as those in Lake Malawi and Victoria, where those reached comparable eco-morphological diversity.

Methods

Study species

The sampled specimens of T. moorii are F2 offspring of wild caught individuals from the Zambian section of the southwestern shore of Lake Tanganyika (08°38′ S 30°52′ E) near the village Nakaku, which were brought to the University of Graz in 2005. The P. trewavasae specimens used in this study are F1 offspring of wild fish also from the southwestern shore, but further northeast near the village Katete (08°20′S 30°30′E) and were obtained from an ornamental fish importer. Collection of the parental generation of fish was carried out in the framework of a Memorandum of Understanding between the Department of Fisheries, Ministry of Agriculture and Cooperatives, Zambia, the Department of Biological Sciences at the University of Zambia in Lusaka, the Department of Zoology at the University of Graz, Austria, the Department of Behavioural Ecology at the University of Bern, Switzerland, and the Department of Zoology at the University of Basel, Switzerland, under the research permit issued to CSt by the Zambian Ministry of Home Affairs (permit number: SP006515). Sequence data presented here are based on DNA extractions of 6 P. trewawasae and 5 T. moorii individuals; the specimens included both sexes and were about one year old.

Sequencing and laboratory procedures

We sequenced the genomic DNA extracted from the specimens above using several sequencing technologies: Illumina HiSeq paired-end 2 × 101 bp (300 bp and 600 bp fragment size), Illumina Nextera mate-pair 2 × 100 bp (1–6 kbp fragment size), 454 Life Sciences (~ 350 bp average read length; 8 and 20 kbps fragment size) and single-molecule real-time (SMRT) sequencing technology from Pacific Biosciences (PacBio) (~ 8000–9000 bp average read length after correction).

Laboratory-related methods (DNA extraction, library preparation and sequencing) have, in part, been previously described in the accompanying paper on the mitochondrial genomes⁵⁹. In addition, we carried out two sequencing runs using second-generation Pacific Biosciences sequencing technology based upon one individual per species. DNA extraction was carried out in Graz, library preparation and sequencing at the Lausanne Genomic Technologies Facility: DNA was sheared in a Covaris g-TUBE (Covaris, Woburn, MA, USA) to obtain 20 kbp fragments. After shearing the DNA size distribution was checked on a Fragment Analyzer (Advanced Analytical Technologies, Ames, IA, USA). 5 μg of the sheared DNA was used to prepare one SMRTbell library with the PacBio SMRTbell Template Prep Kit 1 (Pacific Biosciences, Menlo Park, CA, USA) according to the manufacturer's recommendations. The resulting library was size selected on a BluePippin system (Sage Science, Inc.; Beverly, MA, USA) for molecules larger than 11 kbp. The recovered library was sequenced on thirteen/sixteen (TM/PT) SMRT cells with P6/C4 chemistry and MagBeads on a PacBio RSII system (Pacific Biosciences, Menlo Park, CA, USA) at 240 min movie length.

For RNA-seq, total RNA from one male and female individual per species (pooled from the following tissues: liver, spleen, brain, heart and skeletal muscle) was extracted with Trizol as follows: tissue was homogenized with MagnaLyser and incubated with Trizol-tube 5 min at room temperature (RT); 200 µl Chloroform (/ml of Trizol) was added and shaken vigorously for 15 s, incubated for 2–3 min/RT and centrifuged at 12,200 rpm/4 °C/15 min; supernatant was transferred to a new 1.5 ml tube and 500 µl isopropanol (/ml of Trizol) were added; after vortexing, incubation for 10 min/RT, centrifugation at 12,200 rpm/4 °C/10 min supernatant was discarded and the pellet placed on ice immediately. The pellets were washed 2 times: add 1 ml EtOH 80% (–20 °C), centrifuge: full speed/4 °C/5 min discard supernatant and finally dried at 37 °C. Dried pellets were resuspended in 20 µl distilled water. RNA-seq libraries were derived from total RNA which was rRNA-depleted, normalized and sequenced on a single Illumina HiSeq 2500 lane per species.

General data (pre)processing

All pipelining and higher-level processing was done with R/Bioconductor, some minor pipelining in Bash and some workhorse functionality was written in C + + (called from R). For details on parameter settings for important steps/tools see Supplementary Table S22.

FastQC v0.10.1⁶⁰ was used for basic read quality evaluation. A custom k-mer spectrum-based approach using JELLYFISH v2.0⁶¹ (in conjunction with a database of known technical sequences) and a De Bruijn-based approach (implemented in Minion from the Kraken v13-274 package⁶²) were used for the automatic identification of technical contaminants and suspicious sequences (based on expected frequencies). In addition, FastQScreen v0.4.4⁶³ was utilized for the species-specific identification of biological contamination and DeconSeq v0.4.3⁶⁴ for its removal. Cutadapt v1.5⁶⁵ was used for the removal of technical contaminants, Scythe v0.994⁶⁶ for additional 3′ adapter trimming, CLC quality trim v4.2⁶⁷ for quality-score-based read trimming and Reaper v13-274⁶² for further quality and complexity-based filtering. BBmerge v33.40⁶⁸ was used for overlapping paired-end read merging and FastUniq v1.1⁶⁹ for duplicate removal. Nextclip v1.2⁷⁰ was used for Nextera mate-pair read filtering and classification. 454 datasets were additionally filtered with sffToCA (Celera Assembler utility). BAMtools v2.4.0⁷¹, SAMtools/BCFtools/HTSlib v1.4⁷² and Picard tools v1.119⁷³ were used for mapping and sequence file manipulations such as indexing, merging, sorting, and generation of subsets, removal of duplicate reads, and removal of PE contamination from MP libraries in sequence files. Proovread v2.13.10⁷⁴ was used for PacBio read correction utilizing all available Illumina PE data and the unitigs created by MaSuRCA v2.3.2⁷⁵. SEECER v0.1.3⁷⁶ and Rcorrector v1.0.2⁷⁷ were used for RNA-seq and Musket v1.1⁷⁸ for DNA-seq base-call correction. DNA-seq and RNA-seq datasets were preprocessed using the same pipeline (with different parameter settings); in general, two filter regimes were applied to each data set (‘stringent’/’standard’ and ‘relaxed’) in preparation for different downstream use cases (see Supplementary Table S22). Genome sizes were estimated by a k-mer spectrum-based approach implemented in GCE v1.0.2⁷⁹.

Genome assembly

From the perspective of the conducted meta-assembly, the algorithm implemented in MaSuRCA v2.3.2⁷⁵ (utilizes Celera Assembler v6.5⁸⁰) served as the core assembly procedure; all at this time available data sets (i.e., Illumina PE and MP, Illumina Nextera MP and 454 MP and SE) were used. Celera Assembler v8.3rc2 (CA)⁸⁰ was used for the ‘PacBio only’ assemblies. As several individuals per species (all non-inbred diploids) have been sequenced in this project, heterozygosity was a concern. Hence, assembly algorithms specifically designed to better handle divergence were incorporated into the reconstruction approach: Platanus v1.2.1⁸¹ is a recent assembler tailored to more sensibly deal with heterozygosity issues in genomic data (5 iterations; all Illumina data sets were used); Redundans v0.12c⁸² (utilizes SSPACE3⁸³, GapCloser⁸⁴, bwa⁸⁵ and last⁸⁶) also aims at providing more accurate and contiguous assemblies of highly heterozygous genomes (5 iterations; all Illumina data sets were used). The PBJelly Suite v15.8.24⁸⁷ (utilizes BLASR⁸⁸) was used to incorporate the long sequence reads (PacBio) in a reference-guided assembly process into the established drafts (5 iterations). The diverse set of generated genome drafts was subjected to Metassembler⁸⁹ in an attempt to generate high quality consensus sequences. A custom algorithm, which takes into account several measures on probable misassemblies, contiguity and gene predictions (drawing information from QUAST⁹⁰ and REAPR³⁸), was applied to determine the best order of successive meta-assemblies.

Genome finishing

For another round of inter-scaffold gap closing, GMcloser⁹¹ (utilizes Nucmer⁹² / BLAST⁹³ and Bowtie2⁹⁴) was applied on the meta-assemblies with PacBio and Illumina PE data. Finally, Sealer⁹⁵ (utilizes Konnector, a part of the ABYSS assembler pipeline⁹⁶) was used with the Illumina PE (liberal) libraries for final gap filling and a custom GATK-based⁴² genome finishing (via Illumina PE back mapping and consensus recalling) was applied.

Genome validation

REAPR v1.0.18³⁸ (utilizing SMALT v0.7.0.1⁹⁷) was used with the Illumina Nextera mate-pair (6 kbp) and Illumina PE (600 bp) libraries to evaluate the correctness of assemblies and QUAST v4.1⁹⁰ was applied for contiguity and gene prediction statistics. Completeness of the assemblies was assessed using CEGMA v2.5³⁵ (utilizing GeneWise v2.4.1⁹⁸, HMMER v3.0⁹⁹ and NCBI BLAST + v2.2.29 + ⁹³) with parameter optimization for vertebrate genomes (–vrt) and BUSCO v3.0.2³⁴ (utilizing NCBI BLAST + v2.2.29 + , HMMER v3.1⁹⁹ and AUGUSTUS v3.2.1¹⁰⁰).

Transcriptome assembly and RNA-seq read mapping

The transcriptome assemblies were conducted with Trinity v2.3.2^101,102 and the PASA2 v2.0.2 pipeline¹⁰³ (utilizing GMAP v2014-12–06¹⁰⁴, BLAT v36.1¹⁰⁵ and MySQL v5.7.12¹⁰⁶); also Transdecoder v3.0.1¹⁰² was applied to identify candidate coding regions (used with MAKER3¹⁰⁷). RNA-seq read alignments for other analyses were generally conducted with STAR v2.4.2a¹⁰⁸ using default parameters.

Genome annotation

Structural annotations were performed based on experimental data from mRNA-Seq datasets. Additionally, information was drawn from transcript and protein models from selected publicly available datasets (Danio rerio, H. burtoni, M. zebra, N. brichardi, O. niloticus, and P. nyererei) and from further models in UniProt|Swiss-Prot, nr/nt and UniRef90|teleost. Functional annotation was primarily conducted via BLAST-based comparisons against mentioned databases and via a host of databases coordinated by InterProScan 5 (see Table 2).

Structural annotation of coding genes and tRNAs was generated using the pipelines MAKER v3.0¹⁰⁷ (utilising the gene finders GeneMark-ES v4.32¹⁰⁹, AUGUSTUS v3.2.1¹⁰⁰, SNAP v2013-11–29¹¹⁰ and tRNAscan v1.3.1¹¹¹), Funannotate v0.5.5-v0.7.0¹¹² (FA) and BRAKER1 v1.9¹¹³ (utilising GeneMark-ET v4.32¹¹⁴ and AUGUSTUS v3.2.1¹⁰⁰); BRAKER1 was also used for AUGUSTUS training. In addition, gene models were created with StringTie v1.3.2d¹¹⁵ and Cufflinks v2.2.1¹¹⁶. All models were combined by EVidenceModeler v1.1.1¹¹⁷ (EVM) under the control of MAKER3. For non-coding RNAs, Infernal v1.1.2¹¹⁸, Rfam v12.1¹¹⁹ and FEELnc v0.1.0¹²⁰ were utilized. The mRNA training set for FEELnc was derived from the FA/MAKER annotation data, where presumed ‘good’ gene models with similar structure to previously published models were selected; the lncRNA training set was generated by shuffling of the mRNA sequences. Microsatellites were called with MISA v1.0¹²¹, CpG islands with EMBOSS v6.6.0¹²² cpgplot and ORFs with EMBOSS v6.6.0 getorf (and R post-processing). Repeats were determined using RepeatMasker v4.0.6³² (with RepBase v20160321¹²³ and species-specific libraries generated with RepeatModeler v1.0.8¹²⁴), RepeatScout v1.0.5¹²⁵ and TRF v406¹²⁶.

Functional annotation was conducted using InterProScan v5.24–63.0¹²⁷ (utilizing the databases CDD-3.14, Coils-2.2.1, Gene3D-3.5.0, Hamap-201605.11, MobiDBLite-1.0, PANTHER-11.1, Pfam-30.0, PIRSF-3.01, PRINTS-42.0, ProDom-2006.1, ProSitePatterns-20.119, ProSiteProfiles-20.119, SFLD-2, SMART-7.1, SUPERFAMILY-1.75, TIGRFAM-15.0 and TMHMM-2.0c). Furthermore, under the control of FA the databases eggNOG v4.5.1¹²⁸ (fiNOG), MEROPS v12.0¹²⁹, dbCAN v5.0¹³⁰ and BUSCO vertebrata v3³⁴ were used for similarity searches and SIGNALP v4.1¹³¹ for identification of target location signal sequences.

Final integration of all annotations was done with R 3.4.3/Bioconductor 3.6 using the packages data.table 1.12.2, GenomicFeatures 1.30.3, VariantAnnotation 1.24.5 and their dependencies.

DNA-seq read mapping

Preprocessed reads were aligned in paired‐end mode with BWA mem⁸⁵ using the default parameters with ‐M and ‐R flags. Aligned reads were coordinate sorted with Picard SortSam v1.119⁷³ and indexed with SAMtools index v1.4⁷². Duplicates were removed with Picard MarkDuplicates v1.119. The quality of the mappings was assessed with QualiMap v2.0¹³².

Comparative analysis—small (SMV) and structural variant (SV) calling—variant effect prediction

The Genome Analysis Toolkit (GATK) v3.7 was used for local realignment of reads and the detection and filtering of SNP/InDel variants (referred to as small variant/s, SMV)⁴² as recommended by the GATK documentation; the HaplotypeCaller was applied—with a minimum score for variant emission of 10, for calling of 30, and a minimum pruning of 10. SMV with a quality score ≥ 30 were included in further analyses. DELLY v0.7.7⁴³ was applied to call structural variant/s (SV, insertions, deletions, duplications, inversions and translocations) with an insert size cut‐off of 3 (for deletions) and a minimum paired‐end mapping quality of 20. All variants with a minimum of 5 broken read pairs supporting the variant as well as with a minimum length of 300 bp (for deletions, inversions, and duplications) were included in further analyses, as recommended by the DELLY documentation. Presumed variant effects were called with SNPeff v4.3r³⁷. Whippet v0.11.1¹³³ was used for the calling of alternative splicing events. The comparative analyses were conducted in R 3.4.3/Bioconductor 3.6 using the packages data.table 1.12.2, GenomicFeatures 1.30.3, VariantAnnotation 1.24.5 and their dependencies.

GO analysis

To narrow down the candidate gene list, GO enrichment analysis was performed on the gene regions carrying variants using the R package topGO v2.30.1¹³⁴; the custom GO annotations were generated based on the InterProScan mappings. GO topology was accounted for (method weight) and enrichment was assessed via a Fisher’s exact test with a cutoff of p ≤ 0.001. See details on the GO analysis in the Supplementary Information.

Ethics approval and consent to participate

Animal treatment reported in this paper complies with the standards of the Animal Welfare Act in Austria and the European Community Directive 86/609. Fish were kept in our certified aquarium facility at the Institute of Biology, University of Graz. Individuals were sampled by CSt and SK, euthanized using an overdose of clove oil and decapitated conforming to the Austrian Animal Welfare legislation. According to the Austrian Animal Experiments Acts (TVG, BGBI. Nr. 501/1989, last changed by BGBI. I Nr. 162/2005), approval was not required because no experimental treatment was performed.

Consent for publication

Not applicable.

Data availability

The genome drafts were uploaded to EBI, TM: [GCA_902810505], PT: [GCA_902810495]; the genome and transcriptome assemblies (FASTA), the structural and functional annotations (GFF3), read mappings (BAM) and additional IGV³³ track files are available at https://cichlidgenomes.tugraz.at.

References

Van der Laan, R. & Fricke, R. Eschmeyer's Catalog of Fishes Family Group Names. http://www.calacademy.org/scientists/catalog-of-fishes-family-group-names (2020).
Greenwood, P. H. African cichlids and evolutionary theories. In Evolution of Fish Species Flock (eds Echelle, A. A. & Kornfield, I.) 141–154 (University of Maine at Orono Press, Orono, 1984).
Google Scholar
Muschick, M., Indermaur, A. & Salzburger, W. Convergent evolution within an adaptive radiation of cichlid fishes. Curr. Biol. 22, 2362–2368 (2012).
Article CAS PubMed Google Scholar
Wagner, C. E., Harmon, L. J. & Seehausen, O. Ecological opportunity and sexual selection together predict adaptive radiation. Nature 487, 366–369 (2012).
Article ADS CAS PubMed Google Scholar
Tiercelin, J.-J. & Mondeguer, A. The geology of the Tanganyika trough. In Lake Tanganyika and its Life (ed. Coulter, G. W.) 7–48 (Oxford University Press, Oxford, 1991).
Google Scholar
Irisarri, I. et al. Phylogenomics uncovers early hybridization and adaptive loci shaping the radiation of Lake Tanganyika cichlid fishes. Nat. Commun. 9, 3159 (2018).
Article ADS PubMed PubMed Central CAS Google Scholar
Salzburger, W., Meyer, A., Baric, S., Verheyen, E. & Sturmbauer, C. Phylogeny of the Lake Tanganyika Cichlid species flock and its relationship to the Central and East African Haplochromine Cichlid Fish Faunas. Syst. Biol. 51, 113–135 (2002).
Article PubMed Google Scholar
Salzburger, W., Mack, T., Verheyen, E. & Meyer, A. Out of Tanganyika: genesis, explosive speciation, key-innovations and phylogeography of the haplochromine cichlid fishes. BMC Evol. Biol. 5, 17 (2005).
Article PubMed PubMed Central Google Scholar
Koblmüller, S. et al. Age and spread of the haplochromine cichlid fishes in Africa. Mol. Phylogenet. Evol. 49, 153–169 (2008).
Article PubMed CAS Google Scholar
Sturmbauer, C., Salzburger, W., Duftner, N., Schelly, R. & Koblmüller, S. Evolutionary history of the Lake Tanganyika cichlid tribe Lamprologini (Teleostei: Perciformes) derived from mitochondrial and nuclear DNA data. Mol. Phylogenet. Evol. 57, 266–284 (2010).
Article CAS PubMed PubMed Central Google Scholar
Sturmbauer, C., Levinton, J. S. & Christy, J. Molecular phylogeny analysis of fiddler crabs: test of the hypothesis of increasing behavioral complexity in evolution. Proc. Natl. Acad. Sci. U. S. A. 93, 10855–10857 (1996).
Article ADS CAS PubMed PubMed Central Google Scholar
Joyce, D. A. et al. An extant cichlid fish radiation emerged in an extinct Pleistocene lake. Nature 435, 90–95 (2005).
Article ADS CAS PubMed Google Scholar
Katongo, C., Koblmüller, S., Duftner, N., Mumba, L. & Sturmbauer, C. Evolutionary history and biogeographic affinities of the serranochromine cichlids in Zambian rivers. Mol. Phylogenet. Evol. 45, 326–338 (2007).
Article CAS PubMed Google Scholar
Sturmbauer, C., Koblmüller, S., Sefc, K. M. & Duftner, N. Phylogeographic history of the genus Tropheus, a lineage of rock-dwelling cichlid fishes endemic to Lake Tanganyika. Hydrobiologia 542, 335–366 (2005).
Article Google Scholar
Meier, J. I. et al. Ancient hybridization fuels rapid cichlid fish adaptive radiations. Nat. Commun. 8, 14363 (2017).
Article ADS CAS PubMed PubMed Central Google Scholar
Svardal, H. et al. Ancestral hybridization facilitated species diversification in the Lake Malawi Cichlid fish adaptive radiation. Mol. Biol. Evol. 37, 1100–1113 (2020).
Article PubMed Google Scholar
Kullander, S. O. & Roberts, T. R. Out of Tanganyika: endemic lake fishes inhabit rapids of the Lukuga River. Ichthyol. Explor. Freshw. 22, 355–376 (2011).
Google Scholar
West-Eberhard, M.-J. Developmental Plasticity and Evolution (Oxford University Press, Oxford, 2003).
Book Google Scholar
Rossiter, A. The Cichlid fish assemblages of Lake Tanganyika: ecology, behaviour and evolution of its species flocks. In Advances in Ecological Research (eds Begon, M. & Fitter, A. H.) 187–252 (Academic Press Ltd., London, 1995).
Google Scholar
Malinsky, M. et al. Whole-genome sequences of Malawi cichlids reveal multiple radiations interconnected by gene flow. Nat. Ecol. Evol. 2, 1940–1955 (2018).
Article PubMed PubMed Central Google Scholar
Brawand, D. et al. The genomic substrate for adaptive radiation in African cichlid fish. Nature 513, 375–381 (2014).
Article ADS CAS PubMed PubMed Central Google Scholar
Liem, K. F. Evolutionary strategies and morphological innovations: Cichlid Pharyngeal Jaws. Syst Biol. 22, 425–441 (1973).
Google Scholar
Carleton, K. L., Dalton, B. E., Escobar-Camacho, D. & Nandamuri, S. P. Proximate and ultimate causes of variable visual sensitivities: Insights from cichlid fish radiations. Genesis 54, 299–325 (2016).
Article PubMed PubMed Central Google Scholar
Maan, M. E. & Sefc, K. M. Colour variation in cichlid fish: Developmental mechanisms, selective pressures and evolutionary consequences. Semin. Cell. Dev. Biol. 24, 516–528 (2013).
Article PubMed PubMed Central Google Scholar
Salzburger, W. Understanding explosive diversification through cichlid fish genomics. Nat. Rev. Genet. 19, 705–717 (2018).
Article CAS PubMed Google Scholar
Malinsky, M. Andinoacara coeruleopunctatus Genome Browser Gateway. http://em-x1.gurdon.cam.ac.uk/cgi-bin/hgGateway?hgsid=6400&clade=vertebrate&org=A.+coeruleopunctatus&db=0 (2015).
Conte, M. A. et al. Chromosome-scale assemblies reveal the structural evolution of African cichlid genomes. GigaScience 8, giz030 (2019).
Article PubMed PubMed Central CAS Google Scholar
Thibaud-Nissen, F. et al. P8008 the NCBI eukaryotic genome annotation pipeline. J. Anim. Sci. 94, 184 (2016).
Article Google Scholar
Zerbino, D. R. et al. Ensembl 2018. Nucleic Acids Res. 46, D754–D761 (2018).
Article CAS PubMed Google Scholar
Conte,M.A., Gammerdinger,W.J., Bartie,K.L., Penman,D.J. & Kocher,T.D. A high quality assembly of the Nile Tilapia (Oreochromis niloticus) genome reveals the structure of two sex determination regions. bioRxiv https://doi.org/10.1101/099564 (2017).
Vij, S. et al. Chromosomal-level assembly of the Asian Seabass genome using long sequence reads and multi-layered scaffolding. PLoS Genet. 12, e1005954 (2016).
Article PubMed PubMed Central CAS Google Scholar
Smit, A. F. A., Hubley, R. & Green, P. RepeatMasker Open-4.0. http://www.repeatmasker.org (2015).
Robinson, J. T. et al. Integrative genomics viewer. Nat. Biotechnol. 29, 24–26 (2011).
Article CAS PubMed PubMed Central Google Scholar
Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: Assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–3212 (2015).
Article PubMed CAS Google Scholar
Parra, G., Bradnam, K. & Korf, I. CEGMA: A pipeline to accurately annotate core genes in eukaryotic genomes. Bioinformatics 23, 1061–1067 (2007).
Article CAS PubMed Google Scholar
Dohmen, E., Kremer, L. P. M., Bornberg-Bauer, E. & Kemena, C. DOGMA: Domain-based transcriptome and proteome quality assessment. Bioinformatics 32, 2577–2581 (2016).
Article CAS PubMed Google Scholar
Cingolani, P. et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff. Fly 6, 80–92 (2012).
Article CAS PubMed PubMed Central Google Scholar
Hunt, M. et al. REAPR: a universal tool for genome assembly evaluation. Genome Biol. 14, R47 (2013).
Article PubMed PubMed Central Google Scholar
Asalone, K. C. et al. Regional sequence expansion or collapse in heterozygous genome assemblies. PLoS Comput. Biol. 16, e1008104 (2020).
Article CAS PubMed PubMed Central Google Scholar
Conte, M. A. & Kocher, T. D. An improved genome reference for the African cichlid Metriaclima zebra. BMC Genomics 16, 724 (2015).
Article PubMed PubMed Central CAS Google Scholar
Finn, R. D. et al. The Pfam protein families database. Nucleic Acids Res. 38, D211–D222 (2010).
Article CAS PubMed Google Scholar
McKenna, A. et al. The genome analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).
Article CAS PubMed PubMed Central Google Scholar
Rausch, T. et al. DELLY: Structural variant discovery by integrated paired-end and split-read analysis. Bioinformatics 28, i333–i339 (2012).
Article CAS PubMed PubMed Central Google Scholar
Liu, Y. et al. Comparison of multiple algorithms to reliably detect structural variants in pears. BMC Genomics 21, 61 (2020).
Article PubMed PubMed Central Google Scholar
Supernat, A., Vidarsson, O. V., Steen, V. M. & Stokowy, T. Comparison of three variant callers for human whole genome sequencing. Sci. Rep. 8, 17851 (2018).
Article ADS CAS PubMed PubMed Central Google Scholar
McCarthy, D. J. et al. Choice of transcripts and software has a large effect on variant annotation. Genome Med. 6, 26 (2014).
Article PubMed PubMed Central Google Scholar
Gunter, H. M., Schneider, R. F., Karner, I., Sturmbauer, C. & Meyer, A. Molecular investigation of genetic assimilation during the rapid adaptive radiations of East African cichlid fishes. Mol. Ecol. 26, 6634–6653 (2017).
Article CAS PubMed Google Scholar
Navon, D. et al. Hedgehog signaling is necessary and sufficient to mediate craniofacial plasticity in teleosts. Proc. Natl. Acad. Sci. U. S. A. 117, 19321–19327 (2020).
Article CAS PubMed PubMed Central Google Scholar
Boyle, E. A., Li, Y. I. & Pritchard, J. K. An expanded view of complex traits: From polygenic to omnigenic. Cell 169, 1177–1186 (2017).
Article CAS PubMed PubMed Central Google Scholar
Adhikari, K. et al. A genome-wide association scan implicates DCHS2, RUNX2, GLI3, PAX1 and EDAR in human facial variation. Nat. Commun. 7, 11616 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Liu, F. et al. A genome-wide association study identifies five loci influencing facial morphology in Europeans. PLoS Genet. 8, e1002932 (2012).
Article CAS PubMed PubMed Central Google Scholar
Claes, P. et al. Genome-wide mapping of global-to-local genetic effects on human facial shape. Nat. Genet. 50, 414–423 (2018).
Article CAS PubMed PubMed Central Google Scholar
Lupo, G., Harris, W. A. & Lewis, K. E. Mechanisms of ventral patterning in the vertebrate nervous system. Nat. Rev. Neurosci. 7, 103–114 (2006).
Article CAS PubMed Google Scholar
Dworkin, S., Boglev, Y., Owens, H. & Goldie, S. J. The role of sonic hedgehog in craniofacial patterning, morphogenesis and cranial neural crest survival. J. Dev. Biol. 4, 24 (2016).
Article PubMed Central Google Scholar
Szabo-Rogers, H. L., Smithers, L. E., Yakob, W. & Liu, K. J. New directions in craniofacial morphogenesis. Dev. Biol. 341, 84–94 (2010).
Article CAS PubMed Google Scholar
Zhou, H., Kim, S., Ishii, S. & Boyer, T. G. Mediator modulates Gli3-dependent Sonic hedgehog signaling. Mol. Cell Biol. 26, 8667–8682 (2006).
Article CAS PubMed PubMed Central Google Scholar
Vilhais-Neto, G. C. et al. Rere controls retinoic acid signalling and somite bilateral symmetry. Nature 463, 953–957 (2010).
Article ADS CAS PubMed Google Scholar
Clouthier, D. E., Garcia, E. & Schilling, T. F. Regulation of facial morphogenesis by endothelin signaling: Insights from mice and fish. Am. J. Med. Genet. A 152A, 2962–2973 (2010).
Article PubMed PubMed Central Google Scholar
Fischer, C. et al. Complete mitochondrial DNA sequences of the Threadfin Cichlid (Petrochromis trewavasae) and the Blunthead Cichlid (Tropheus moorii) and patterns of mitochondrial genome evolution in cichlid fishes. PLoS ONE 8, e67048 (2013).
Article ADS CAS PubMed PubMed Central Google Scholar
Andrews, S. FastQC A Quality Control tool for High Throughput Sequence Data. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ (2016).
Marçais, G. & Kingsford, C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27, 764–770 (2011).
Article PubMed PubMed Central CAS Google Scholar
Davis, M. P. A., van Dongen, S., Abreu-Goodger, C., Bartonicek, N. & Enright, A. J. Kraken: A set of tools for quality control and analysis of high-throughput sequence data. Methods 63, 41–49 (2013).
Article CAS PubMed PubMed Central Google Scholar
Wingett, S. W. & Andrews, S. FastQ Screen: A tool for multi-genome mapping and quality control. F1000Res. 7, 1338 (2018).
Article PubMed PubMed Central Google Scholar
Schmieder, R. & Edwards, R. Fast identification and removal of sequence contamination from genomic and metagenomic datasets. PLoS ONE 6, e17288 (2011).
Article ADS CAS PubMed PubMed Central Google Scholar
Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J. 17, 10–12 (2011).
Article Google Scholar
Buffalo, V. Scythe. https://github.com/vsbuffalo/scythe (2014).
CLCbio Assembly Cell. https://www.quiagenbioinformatics.com/products/clc-assembly-cell (2015).
Bushnell, B., Rood, J. & Singer, E. BBMerge—Accurate paired shotgun read merging via overlap. PLoS ONE 12, e0185056 (2017).
Article PubMed PubMed Central CAS Google Scholar
Xu, H. et al. FastUniq: A fast de novo duplicates removal tool for paired short reads. PLoS ONE 7, e52249 (2012).
Article ADS CAS PubMed PubMed Central Google Scholar
Leggett, R. M., Clavijo, B. J., Clissold, L., Clark, M. D. & Caccamo, M. NextClip: An analysis and read preparation tool for Nextera Long Mate Pair libraries. Bioinformatics 30, 566–568 (2014).
Article CAS PubMed Google Scholar
Barnett, D. W., Garrison, E. K., Quinlan, A. R., Strömberg, M. P. & Marth, G. T. BamTools: a C++ API and toolkit for analyzing and managing BAM files. Bioinformatics 27, 1691–1692 (2011).
Article CAS PubMed PubMed Central Google Scholar
Li, H. et al. The sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Article PubMed PubMed Central CAS Google Scholar
Broad Institute Picard Tools. https://github.com/broadinstitute/picard (2016).
Hackl, T., Hedrich, R., Schultz, J. & Förster, F. proovread: large-scale high-accuracy PacBio correction through iterative short read consensus. Bioinformatics 30, 3004–3011 (2014).
Article CAS PubMed PubMed Central Google Scholar
Zimin, A. V. et al. The MaSuRCA genome assembler. Bioinformatics 29, 2669–2677 (2013).
Article CAS PubMed PubMed Central Google Scholar
Le, H. S., Schulz, M. H., McCauley, B. M., Hinman, V. F. & Bar-Joseph, Z. Probabilistic error correction for RNA sequencing. Nucleic Acids Res. 41, e109 (2013).
Article CAS PubMed PubMed Central Google Scholar
Song, L. & Florea, L. Rcorrector: efficient and accurate error correction for Illumina RNA-seq reads. GigaScience 4, 48 (2015).
Article PubMed PubMed Central CAS Google Scholar
Liu, Y., Schröder, J. & Schmidt, B. Musket: A multistage k-mer spectrum-based error corrector for Illumina sequence data. Bioinformatics 29, 308–315 (2013).
Article CAS PubMed Google Scholar
Liu,B. et al. Estimation of genomic characteristics by analyzing k-mer frequency in de novo genome projects. arXiv:1308.2012 (2013).
Denisov, G. et al. Consensus generation and variant detection by Celera Assembler. Bioinformatics 24, 1035–1040 (2008).
Article CAS PubMed Google Scholar
Kajitani, R. et al. Efficient de novo assembly of highly heterozygous genomes from whole-genome shotgun short reads. Genome Res. 24, 1384–1395 (2014).
Article CAS PubMed PubMed Central Google Scholar
Pryszcz, L. P. & Gabaldón, T. Redundans: An assembly pipeline for highly heterozygous genomes. Nucleic Acids Res. 44, e113 (2016).
Article PubMed PubMed Central CAS Google Scholar
Boetzer, M., Henkel, C. V., Jansen, H. J., Butler, D. & Pirovano, W. Scaffolding pre-assembled contigs using SSPACE. Bioinformatics 27, 578–579 (2011).
Article CAS PubMed Google Scholar
Luo, R. et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience 1, 18 (2012).
Article PubMed PubMed Central Google Scholar
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Article CAS PubMed PubMed Central Google Scholar
Frith, M. C., Wan, R. & Horton, P. Incorporating sequence quality data into alignment improves DNA read mapping. Nucleic Acids Res. 38, e100 (2010).
Article PubMed PubMed Central CAS Google Scholar
English, A. C. et al. Mind the Gap: Upgrading genomes with pacific biosciences RS long-read sequencing technology. PLoS ONE 7, e47768 (2012).
Article ADS CAS PubMed PubMed Central Google Scholar
Chaisson, M. J. & Tesler, G. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): application and theory. BMC Bioinform. 13, 238 (2012).
Article CAS Google Scholar
Wences, A. H. & Schatz, M. C. Metassembler: Merging and optimizing de novo genome assemblies. Genome Biol. 16, 207 (2015).
Article PubMed PubMed Central CAS Google Scholar
Gurevich, A., Saveliev, V., Vyahhi, N. & Tesler, G. QUAST: Quality assessment tool for genome assemblies. Bioinformatics 29, 1072–1075 (2013).
Article CAS PubMed PubMed Central Google Scholar
Kosugi, S., Hirakawa, H. & Tabata, S. GMcloser: closing gaps in assemblies accurately with a likelihood-based selection of contig or long-read alignments. Bioinformatics 31, 3733–3741 (2015).
CAS PubMed Google Scholar
Kurtz, S. et al. Versatile and open software for comparing large genomes. Genome Biol. 5, R12 (2004).
Article PubMed PubMed Central Google Scholar
Camacho, C. et al. BLAST+: Architecture and applications. BMC Bioinformatics 10, 421 (2009).
Article PubMed PubMed Central CAS Google Scholar
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Meth. 9, 357–359 (2012).
Article CAS Google Scholar
Paulino, D. et al. Sealer: A scalable gap-closing application for finishing draft genomes. BMC Bioinform. 16, 230 (2015).
Article Google Scholar
Simpson, J. T. et al. ABySS: A parallel assembler for short read sequence data. Genome Res. 19, 1117–1123 (2009).
Article CAS PubMed PubMed Central Google Scholar
Ponstingl, H. & Ning, Z. SMALT. https://www.sanger.ac.uk/science/tools/smalt-0 (2018).
Birney, E., Clamp, M. & Durbin, R. GeneWise and genomewise. Genome Res. 14, 988–995 (2004).
Article CAS PubMed PubMed Central Google Scholar
Finn, R. D., Clements, J. & Eddy, S. R. HMMER web server: Interactive sequence similarity searching. Nucleic Acids Res. 39, W29–W37 (2011).
Article CAS PubMed PubMed Central Google Scholar
Stanke, M. & Morgenstern, B. Augustus: A web server for gene prediction in eukaryotes that allows user-defined constraints. Nucleic Acids Res. 33, W465–W467 (2005).
Article CAS PubMed PubMed Central Google Scholar
Grabherr, M. G. et al. Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nat. Biotechnol. 29, 644–652 (2011).
Article CAS PubMed PubMed Central Google Scholar
Haas, B. J. et al. De novo transcript sequence reconstruction from RNA-Seq: reference generation and analysis with Trinity. Nat. Protoc. 8, 1494–1512 (2013).
Article CAS PubMed Google Scholar
Haas, B. J. et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 31, 5654–5666 (2003).
Article CAS PubMed PubMed Central Google Scholar
Wu, T. D. & Watanabe, C. K. GMAP: A genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21, 1859–1875 (2005).
Article CAS PubMed Google Scholar
Kent, W. J. BLAT—The BLAST-like alignment tool. Genome Res. 12, 656–664 (2002).
Article CAS PubMed PubMed Central Google Scholar
Oracle Inc. MySQL. https://www.mysql.com (2016).
Cantarel, B. L. et al. MAKER: An easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res. 18, 188–196 (2008).
Article CAS PubMed PubMed Central Google Scholar
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
Article CAS PubMed Google Scholar
Lomsadze, A., Ter-Hovhannisyan, V., Chernoff, Y. O. & Borodovsky, M. Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic Acids Res. 33, 6494–6506 (2005).
Article CAS PubMed PubMed Central Google Scholar
Korf, I. Gene finding in novel genomes. BMC Bioinform. 5, 59 (2004).
Article Google Scholar
Schattner, P., Brooks, A. N. & Lowe, T. M. The tRNAscan-SE, snoscan and snoGPS web servers for the detection of tRNAs and snoRNAs. Nucleic Acids Res. 33, W686–W689 (2005).
Article CAS PubMed PubMed Central Google Scholar
Palmer, J. M. Funannotate: a fungal genome annotation and comparative genomics pipeline. https://github.com/nextgenusfs/funannotate (2016).
Hoff, K. J., Lange, S., Lomsadze, A., Borodovsky, M. & Stanke, M. BRAKER1: Unsupervised RNA-Seq-based genome annotation with GeneMark-ET and AUGUSTUS. Bioinformatics 32, 767–769 (2016).
Article CAS PubMed Google Scholar
Lomsadze, A., Burns, P. D. & Borodovsky, M. Integration of mapped RNA-Seq reads into automatic training of eukaryotic gene finding algorithm. Nucleic Acids Res. 42, e119 (2014).
Article PubMed PubMed Central CAS Google Scholar
Pertea, M. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat. Biotechnol. 33, 290–295 (2015).
Article CAS PubMed PubMed Central Google Scholar
Trapnell, C. et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010).
Article CAS PubMed PubMed Central Google Scholar
Haas, B. J. et al. Automated eukaryotic gene structure annotation using EVidenceModeler and the program to assemble spliced alignments. Genome Biol. 9, R7 (2008).
Article PubMed PubMed Central CAS Google Scholar
Nawrocki, E. P., Kolbe, D. L. & Eddy, S. R. Infernal 1.0: inference of RNA alignments. Bioinformatics 25, 1335–1337 (2009).
Article CAS PubMed PubMed Central Google Scholar
Griffiths-Jones, S., Bateman, A., Marshall, M., Khanna, A. & Eddy, S. R. Rfam: an RNA family database. Nucleic Acids Res. 31, 439–441 (2003).
Article CAS PubMed PubMed Central Google Scholar
Wucher,V. et al. FEELnc: A tool for Long non-coding RNAs annotation and its application to the dog transcriptome. bioRxiv https://doi.org/10.1101/064436 (2016).
Thiel, T., Michalek, W., Varshney, R. K. & Graner, A. Exploiting EST databases for the development and characterization of gene-derived SSR-markers in barley (Hordeum vulgare L.). Theor. Appl. Genet. 106, 411–422 (2003).
Article CAS PubMed Google Scholar
Rice, P., Longden, I. & Bleasby, A. EMBOSS: The European molecular biology open software suite. Trends. Genet. 16, 276–277 (2000).
Article CAS PubMed Google Scholar
Jurka, J. W. RepBase. https://www.girinst.org/server/RepBase (2016).
Smit, A. F. A. & Hubley, R. RepeatModeler Open-1.0. http://www.repeatmasker.org (2014).
Price, A. L., Jones, N. C. & Pevzner, P. A. D. novo identification of repeat families in large genomes. Bioinformatics 21, i351–i358 (2005).
Article CAS PubMed Google Scholar
Benson, G. Tandem repeats finder: A program to analyze DNA sequences. Nucleic Acids Res. 27, 573–580 (1999).
Article CAS PubMed PubMed Central Google Scholar
Jones, P. et al. InterProScan 5: genome-scale protein function classification. Bioinformatics 30, 1236–1240 (2014).
Article CAS PubMed PubMed Central Google Scholar
Huerta-Cepas, J. et al. eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences. Nucleic Acids Res. 44, D286–D293 (2016).
Article CAS PubMed Google Scholar
Rawlings, N. D., Barrett, A. J. & Finn, R. Twenty years of the MEROPS database of proteolytic enzymes, their substrates and inhibitors. Nucleic Acids Res. 44, D343–D350 (2016).
Article CAS PubMed Google Scholar
Yin, Y. et al. dbCAN: A web resource for automated carbohydrate-active enzyme annotation. Nucleic Acids Res. 40, W445–W451 (2012).
Article CAS PubMed PubMed Central Google Scholar
Petersen, T. N., Brunak, S., von Heijne, G. & Nielsen, H. SignalP 4.0: discriminating signal peptides from transmembrane regions. Nat. Methods 8, 785–786 (2011).
Article CAS PubMed Google Scholar
Okonechnikov, K., Conesa, A. & García-Alcalde, F. Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data. Bioinformatics 32, 292–294 (2016).
CAS PubMed Google Scholar
Sterne-Weiler, T., Weatheritt, R. J., Best, A. J., Ha, K. C. H. & Blencowe, B. J. Efficient and accurate quantitative profiling of alternative splicing patterns of any complexity on a laptop. Mol. Cell 72, 187–200 (2018).
Article CAS PubMed Google Scholar
Alexa, A., Rahnenführer, J. & Lengauer, T. Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics 22, 1600–1607 (2006).
Article CAS PubMed Google Scholar
Li, Y., Xiang, J. & Duan, C. Insulin-like growth factor-binding protein-3 plays an important role in regulating pharyngeal skeleton and inner ear formation and differentiation. J. Biol. Chem. 280, 3613–3620 (2005).
Article CAS PubMed Google Scholar
Lin, J. M. et al. Actions of fibroblast growth factor-8 in bone cells in vitro. Am. J. Physiol. Endocrinol. Metab. 297, E142–E150 (2009).
Article CAS PubMed Google Scholar
Nichols, J. T., Pan, L., Moens, C. B. & Kimmel, C. B. barx1 represses joints and promotes cartilage in the craniofacial skeleton. Development 140, 2765–2775 (2013).
Article CAS PubMed PubMed Central Google Scholar
Bush, J. O., Lan, Y. & Jiang, R. The cleft lip and palate defects in Dancer mutant mice result from gain of function of the Tbx10 gene. Proc. Natl. Acad. Sci. U. S. A. 101, 7022–7027 (2004).
Article ADS CAS PubMed PubMed Central Google Scholar
Vieira, A. R. et al. Medical sequencing of candidate genes for nonsyndromic cleft lip and palate. PLoS Genet. 1, e64 (2005).
Article PubMed PubMed Central CAS Google Scholar
Papaioannou, V. E. The T-box gene family: Emerging roles in development, stem cells and cancer. Development 141, 3819–3833 (2014).
Article CAS PubMed PubMed Central Google Scholar
Kang, Y. J., Stevenson, A. K., Yau, P. M. & Kollmar, R. Sparc protein is required for normal growth of zebrafish otoliths. J. Assoc. Res. Otolaryngol. 9, 436–451 (2008).
Article PubMed PubMed Central Google Scholar
Rosset, E. M. & Bradshaw, A. D. SPARC/osteonectin in mineralized tissue. Matrix Biol. 52–54, 78–87 (2016).
Article PubMed PubMed Central CAS Google Scholar
Zarelli, V. E. & Dawid, I. B. Inhibition of neural crest formation by Kctd15 involves regulation of transcription factor AP-2. Proc. Natl. Acad. Sci. U. S. A. 110, 2870–2875 (2013).
Article ADS CAS PubMed PubMed Central Google Scholar
Zhang, Z., Huynh, T. & Baldini, A. Mesodermal expression of Tbx1 is necessary and sufficient for pharyngeal arch and cardiac outflow tract development. Development 133, 3587–3595 (2006).
Article CAS PubMed Google Scholar
Yutzey, K. E. DiGeorge syndrome, Tbx1, and retinoic acid signaling come full circle. Circ. Res. 106, 630–632 (2010).
Article CAS PubMed PubMed Central Google Scholar
Ghassibe-Sabbagh, M. et al. FAF1, a gene that is disrupted in cleft palate and has conserved function in Zebrafish. Am. J. Hum. Genet. 88, 150–161 (2011).
Article CAS PubMed PubMed Central Google Scholar
Wilm, T. P. & Solnica-Krezel, L. Essential roles of a zebrafish prdm1/blimp1 homolog in embryo patterning and organogenesis. Development 132, 393–404 (2005).
Article CAS PubMed Google Scholar
Wang, L., Rajan, H., Pitman, J. L., McKeown, M. & Tsai, C. C. Histone deacetylase-associating Atrophin proteins are nuclear receptor corepressors. Genes Dev. 20, 525–530 (2006).
Article PubMed PubMed Central CAS Google Scholar
Plaster, N., Sonntag, C., Schilling, T. F. & Hammerschmidt, M. REREa/Atrophin-2 interacts with histone deacetylase and Fgf8 signaling to regulate multiple processes of zebrafish development. Dev. Dyn. 236, 1891–1904 (2007).
Article CAS PubMed Google Scholar
Jordan, V. K. et al. Genotype–phenotype correlations in individuals with pathogenic RERE variants. Hum. Mutat. 39, 666–675 (2018).
Article CAS PubMed PubMed Central Google Scholar
Diepeveen, E. T., Kim, F. D. & Salzburger, W. Sequence analyses of the distal-less homeobox gene family in East African cichlid fishes reveal signatures of positive selection. BMC Evol. Biol. 13, 153 (2013).
Article PubMed PubMed Central Google Scholar
Stock, D. W. et al. The evolution of the vertebrate Dlx gene family. Proc. Natl. Acad. Sci. USA 93, 10858–10863 (1996).
Article ADS CAS PubMed PubMed Central Google Scholar
Mark, M., Ghyselinck, N. B. & Chambon, P. Function of retinoic acid receptors during embryonic development. Nucl. Recept. Signal. 7, e002 (2009).
Article PubMed PubMed Central CAS Google Scholar
Linville, A., Radtke, K., Waxman, J. S., Yelon, D. & Schilling, T. F. Combinatorial roles for zebrafish retinoic acid receptors in the hindbrain, limbs and pharyngeal arches. Dev. Biol. 325, 60–70 (2009).
Article CAS PubMed Google Scholar
Swartz, M. E., Sheehan-Rooney, K., Dixon, M. J. & Eberhart, J. K. Examination of a palatogenic gene program in Zebrafish. Dev. Dyn. 240, 2204–2220 (2011).
Article CAS PubMed PubMed Central Google Scholar
Iwata, J. et al. Transforming growth factor-beta regulates basal transcriptional regulatory machinery to control cell proliferation and differentiation in cranial neural crest-derived osteoprogenitor cells. J. Biol. Chem. 285, 4975–4982 (2010).
Article CAS PubMed Google Scholar
Prochazkova, M., Prochazka, J., Marangoni, P. & Klein, O. D. Bones, Glands, Ears and More: The Multiple Roles of FGF10 in Craniofacial Development. Front Genet. 9, 542 (2018).
Article CAS PubMed PubMed Central Google Scholar
Du, J. et al. Different expression patterns of Gli1-3 in mouse embryonic maxillofacial development. Acta Histochem. 114, 620–625 (2012).
Article CAS PubMed Google Scholar

Download references

Acknowledgements

We thank Viola Nolte for expert technical support during sequencing, and Wolfgang Gessl for fish keeping and photographs.

Funding

This work was supported by the Austrian Science Fund projects [FWF Grants P22737 and P29838]. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. The purchase of the Pacific Biosciences RS II instrument at the University of Lausanne was financed in part by the Loterie Romande through the Fondation pour la Recherche en Médecine Génétique.

Author information

Authors and Affiliations

Institute of Biology, University of Graz, Graz, Austria
C. Fischer, S. Koblmüller, C. Börger & C. Sturmbauer
Institute of Biomedical Informatics, Graz University of Technology, Graz, Austria
C. Fischer & G. G. Thallinger
Center for Medical Research, Medical University of Graz, Graz, Austria
G. Michelitsch, S. Trajanoski & C. Guelly
Institut für Populationsgenetik, Vetmeduni Vienna, Vienna, Austria
C. Schlötterer
BioTechMed-Graz, Graz, Austria
G. G. Thallinger & C. Sturmbauer

Authors

C. Fischer
View author publications
You can also search for this author in PubMed Google Scholar
S. Koblmüller
View author publications
You can also search for this author in PubMed Google Scholar
C. Börger
View author publications
You can also search for this author in PubMed Google Scholar
G. Michelitsch
View author publications
You can also search for this author in PubMed Google Scholar
S. Trajanoski
View author publications
You can also search for this author in PubMed Google Scholar
C. Schlötterer
View author publications
You can also search for this author in PubMed Google Scholar
C. Guelly
View author publications
You can also search for this author in PubMed Google Scholar
G. G. Thallinger
View author publications
You can also search for this author in PubMed Google Scholar
C. Sturmbauer
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceived the study: C.St. Designed the experiments: C.St., G.G.T., C.G. and C.Sch. Performed the experiments: S.K., C.B., G.M. and S.T. Analyzed the data: C.F. Wrote the paper: C.F., G.G.T., C.St. Contributed to the manuscript: S.K., C.G., C.Sch. Approved the final manuscript: all authors.

Corresponding authors

Correspondence to G. G. Thallinger or C. Sturmbauer.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Fischer, C., Koblmüller, S., Börger, C. et al. Genome sequences of Tropheus moorii and Petrochromis trewavasae, two eco-morphologically divergent cichlid fishes endemic to Lake Tanganyika. Sci Rep 11, 4309 (2021). https://doi.org/10.1038/s41598-021-81030-z

Download citation

Received: 05 May 2020
Accepted: 28 December 2020
Published: 22 February 2021
DOI: https://doi.org/10.1038/s41598-021-81030-z

This article is cited by

Species status evaluation of Lirceus usdagalan, L. culveri, and L. hargeri populations (Isopoda; Asellidae) based on a large scale next-generation sequence data set
- Daniel W. Fong
- William Orndorff
- David B. Carlini
Conservation Genetics (2021)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Hagfish genome elucidates vertebrate whole-genome duplication events and their evolutionary consequences

Genome-enabled discovery of evolutionary divergence in brains and behavior

The hagfish genome and the evolution of vertebrates

Introduction

Results

Assemblies

Petrochromis trewavasae

Tropheus moorii

Annotations

Data availability and visualization

Quality evaluation

Comparative analysis

Discussion

Assembly and annotation

Comparative analysis

Methods

Study species

Sequencing and laboratory procedures

General data (pre)processing

Genome assembly

Genome finishing

Genome validation

Transcriptome assembly and RNA-seq read mapping

Genome annotation

DNA-seq read mapping

Comparative analysis—small (SMV) and structural variant (SV) calling—variant effect prediction

GO analysis

Ethics approval and consent to participate

Consent for publication

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher's note

Supplementary Information

Supplementary Information.

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Species status evaluation of Lirceus usdagalan, L. culveri, and L. hargeri populations (Isopoda; Asellidae) based on a large scale next-generation sequence data set

Comments

Search

Quick links