Cichlid fishes are famous for large, diverse and replicated adaptive radiations in the Great Lakes of East Africa. To understand the molecular mechanisms underlying cichlid phenotypic diversity, we sequenced the genomes and transcriptomes of five lineages of African cichlids: the Nile tilapia (Oreochromis niloticus), an ancestral lineage with low diversity; and four members of the East African lineage: Neolamprologus brichardi/pulcher (older radiation, Lake Tanganyika), Metriaclima zebra (recent radiation, Lake Malawi), Pundamilia nyererei (very recent radiation, Lake Victoria), and Astatotilapia burtoni (riverine species around Lake Tanganyika). We found an excess of gene duplications in the East African lineage compared to tilapia and other teleosts, an abundance of non-coding element divergence, accelerated coding sequence evolution, expression divergence associated with transposable element insertions, and regulation by novel microRNAs. In addition, we analysed sequence data from sixty individuals representing six closely related species from Lake Victoria, and show genome-wide diversifying selection on coding and regulatory variants, some of which were recruited from ancient polymorphisms. We conclude that a number of molecular mechanisms shaped East African cichlid genomes, and that amassing of standing variation during periods of relaxed purifying selection may have been important in facilitating subsequent evolutionary diversification.
Wide variation in the rates of diversification among lineages is a feature of evolution that has fascinated biologists since Darwin1,2. With approximately 2,000 known species, hundreds of which coexist in individual African lakes, cichlid fish are amongst the most striking examples of adaptive radiation, the phenomenon whereby a single lineage diversifies into many ecologically varied species in a short span of time3 (Fig. 1). The largest radiations, which in Lakes Victoria, Malawi and Tanganyika, have generated between 250 (Tanganyika) and 500 (Malawi and Victoria) species per lake, took no more than 15,000 to 100,000 years for Victoria and less than 5 million years for Malawi3,4,5, but 10–12 million years for Lake Tanganyika6. The radiations in Lake Victoria and Malawi thus display the highest sustained rates of speciation known to date in vertebrates7. The evolution of these lineages and their genomes has presumably been shaped by cycles of population expansion, fragmentation and contraction as lineages colonized lakes, diversified, collapsed when lakes dried up, and re-colonized lakes, and by episodic adaptation to a multitude of ecological niches coupled with strong sexual selection. Genetic diversity within lake radiations has been influenced by admixture following multiple colonization events and periodic infusions through hybridization8,9.
Cichlid phenotypic diversity encompasses variation in behaviour, body shape, coloration and ecological specialization. The frequent occurrence of convergent evolution of similar ecotypes (Fig. 1) suggests a primary role of natural selection in shaping cichlid phenotypic diversity10,11. In addition, the importance of sexual selection is demonstrated by a profusion of exaggerated sexually dimorphic traits like male nuptial colour and elaborate bower building by males3. Ecological and sexual selection converge in the cichlid visual system, where trichromatic colour vision, eight different opsin genes and novel spherical lenses promote sensitivity in the highly dimensional visual world of clear-water lakes12,13,14. Rapidly evolving sex determination systems, often linked to male and female colour patterns, may also speed cichlid diversification15,16. Ecological, social and behavioural variation correlates with striking diversity in brain structures17 that appears early in development18.
Exceptional phenotypic variation, even among closely related species, makes cichlids different from most other fish groups, including those that share the same habitats with them but have not diversified as much, as well as those that have radiated into much smaller species flocks in northern temperate lakes19. However, how cichlids evolve in this exceptionally highly dimensional phenotype space remains unexplained.
We sequenced the genomes of five representative cichlid species from throughout the East African haplo-tilapiine lineage (Extended Data Fig 1a), which gave rise to all East African cichlid radiations. These five lineages diverged primarily through geographical isolation, and three of them subsequently underwent adaptive radiations in the three largest lakes of Africa (Fig. 1). Here we describe the comparative analyses of the five genomes coupled with an analysis of the genetic basis of species divergence in the Lake Victoria species flock to examine the genomic substrate for rapid evolutionary diversification.
Accelerated gene evolution
To assess whether accelerated sequence evolution was a general feature of East African cichlids, we annotated the genomes of all five cichlids (Extended Data Fig. 1a) and estimated the nonsynonymous/synonymous nucleotide substitution (dN/dS) ratio by sampling the concatenated alignments of all genes annotated with particular gene ontology (GO) terms. An elevated rate of nonsynonymous nucleotide substitutions can indicate accelerated evolution (either due to relaxed constraint or positive selection); this approach has been applied previously in the context of cichlid vision13 and morphology20,21. We obtained significantly higher dN/dS ranks in O. niloticus (89 terms) compared to stickleback (11 terms), but considerably higher ranks still in the lineages of the East African radiation, haplochromines (299 terms) and N. brichardi (254 terms), (Extended Data Fig. 1b). In general, terms involved in morphological and developmental processes ranked significantly higher in haplochromines than in O. niloticus (P value = 0.036, Mann–Whitney U-test).
Amongst protein-coding genes with an increased number of nonsynonymous variants in haplochromines compared to N. brichardi and O. niloticus, two developmental genes, nog2 and bmpr1b, emerged showing haplochromine-specific substitutions. This result is notable given that three genes, a ligand (bmp4)21, a receptor (bmpr1b) and an antagonist (nog2) in the BMP pathway, all known to influence cichlid jaw morphology, show accelerated rates of protein evolution in haplochromine cichlids.
Of 22 candidate genes previously identified in teleost morphogenesis, vision and pigmentation, three are predicted to have undergone accelerated evolution in the common ancestors of the East African radiations suggesting a role in the diversification of cichlids: endothelin receptor type B1 (ednrb1) affects colour patterning22 and perhaps pharyngeal jaw development (Extended Data Fig. 2); green-sensitive opsin (kfh-g) and Rhodopsin (rho) are proteins important in vision.
Gene duplication allows for subsequent divergent evolution of the resultant gene copies, enabling functional innovation of the proteins and/or expression patterns23. East African cichlids, including Oreochromis niloticus, possess an unexpectedly large number of gene duplicates. We find 280 duplications in the lineage leading to the common ancestor of the lake radiations and 148 events in the common ancestor of the haplochromines. When normalizing for branch lengths this corresponds to an approximately 4.5- to 6-fold increase in gene duplications that occurred in the common ancestor of the East African lake radiations relative to older clades, and an even higher duplication rate in the common ancestor of just the haplochromines (Fig. 2, Extended Data Fig. 3a–c).
Inferred duplication rates in ancestral populations exceeded those in the extant taxa (Fig. 2). This could reflect the technical challenge of separating young, near-identical gene paralogues or true reduced rates in each lake radiation. Additionally, we could be underestimating lineage-specific rates of duplication owing to the sampling of a single species per radiation, if duplications accumulate during speciation but only some become fixed.
Cichlid-specific gene duplicates do not show statistically significant enrichment for particular gene categories (Supplementary Information). Expansion of the olfactory receptor gene family, which is a frequent feature of vertebrate evolution24, was also seen in O. niloticus, but not in any of the lake cichlids (Extended Data Fig. 4; Supplementary Information). Retained duplicated genes are known to often diverge in function through neo- or subfunctionalization25, and this has been suggested as part of the reason why bony fish generally are so species-rich (more than 50% of all known species of vertebrates are fish). Moreover, differential retention of alternative copies of duplicated genes through the process of divergent resolution has been suggested to promote speciation rates directly26.
Differences in the expression patterns of duplicate genes may contribute to evolutionary divergence of species. The expression patterns of 888 duplicate gene pairs from the common ancestor of the East Africa cichlids were categorized according to whether they are expressed widely among tissues (52.8%), are similarly restricted in their expression patterns for both gene copies (26.6%), or, in at least one gene copy, have newly gained expression in one or more tissues (20.6%). 7.5% of duplicates lost or gained complete tissue specificity, many (43%) of which have gained specific expression in the testis. In each of the stomatin and RNF141 gene pairs, one gene copy is broadly expressed whereas expression of the other is restricted to the testis (Extended Data Fig. 3d). RNF141 is the zebrafish orthologue of the human ZNF230, a transcription factor suggested to have a role during spermatogenesis. This observation is particularly interesting in the context of strong sexual selection14 observed in many East African cichlids15,16, including our sequenced species with the exception of N. brichardi.
Transposable element insertions alter gene expression
As in other teleosts, approximately 16–19% of the four East African cichlid genomes consist of transposable elements (TEs), and over 60% of cichlid TEs are DNA transposons (Extended Data Fig. 5; Supplementary Information). Three waves of TE insertions were detected in each of the cichlid genomes (Extended Data Fig. 6a–f), including a cichlid-specific burst of the Tigger family27. Notably, this TE family has continued expanding in the youngest radiation, Lake Victoria (Extended Data Fig. 6a).
We analysed the distribution of TE insertions near the 5′ untranslated region (5′ UTR; 0–20 kilobases upstream), or 3′ UTR (0–20 kb downstream) of orthologous gene pairs. We find that genes with TE insertions near the 5′ UTRs are significantly associated with increased gene expression in all tissues (false discovery rate (FDR) < 0.05, Mann–Whitney test, Extended Data Fig. 7a) compared to genes without TE insertions. In contrast, TE insertions near 3′ UTRs are significantly associated with increased gene expression in all tissues except brain and skeletal muscle (FDR < 0.05, Mann–Whitney U-test).
Generally, when inserted within or near genes in the transcriptional sense orientation, TE insertions show the expected pattern of purifying selection. Such TEs often contain polyadenylation signals that result in transcriptional arrest27. In all five cichlid species, intronic TE insertions occur preferentially in the antisense orientation of protein-coding genes, with the strongest bias being observed for long terminal repeats (LTRs) or long interspersed nucleotide repetitive elements (LINEs) (Extended Data Fig. 7b). As expected, intronic DNA transposons and LINEs or LTRs present in intergenic regions fail to show a significant orientation bias, and short interspersed nucleotide repetitive elements (SINE) show a moderate bias for sense insertions (Extended Data Fig. 7c).
Surprisingly, none of the five cichlid genomes showed any deficit of sense-oriented LINE insertions with approximately 15% divergence, which correspond to a time of transposable element insertions in the common ancestor of the haplo-tilapiine cichlids (Extended Data Fig. 7d). This suggests that ancestral East African cichlids went through an extended period of relaxed purifying selection during which overall TE activity increased (Extended Data Fig. 6a–f). However, in more recent history, haplochromine cichlids showed an increased efficiency in purging potentially deleterious TE insertions (Extended Data Fig. 7d).
Divergence of regulatory elements
To identify potential regulatory sequences that have diverged among the East African cichlids, we first predicted conserved noncoding elements (CNEs)28 in Nile tilapia and eight other teleosts using a 9-way alignment of teleost genomes (zebrafish, Tetraodon, stickleback, medaka and the five cichlids; Supplementary Information). We then identified 13,053 highly conserved noncoding elements (hCNEs) in tilapia and medaka. These are expected to be similarly conserved among the four East African lake cichlids as they shared a common ancestor with Nile tilapia more recently than with medaka. Among these hCNEs we searched for CNEs that exhibited significant changes (accelerated CNEs, aCNEs) (FDR-adjusted P < 0.05). A total of 625 such aCNEs (4.8%) were found to have diverged in one or more of the East African lake cichlids. Whereas the majority of aCNEs (93%) have experienced a higher rate of nucleotide substitutions, approximately a quarter have also experienced insertions (23%) and/or deletions (32%), again suggesting relaxed purifying selection. The aCNEs are distributed in intergenic regions (70%), introns (28%) and UTRs (2%) of protein-coding genes (Supplementary information).
The largest number of aCNEs is found in N. brichardi (n = 214), with lower numbers found in A. burtoni (n = 140), P. nyererei (n = 129) and M. zebra (n = 142). Approximately 60% of the aCNEs (n = 370) are accelerated in only one lineage. The remaining aCNEs have either accumulated mutations independently in several lineages, or their accelerated evolution was initiated in a common ancestor.
The majority of aCNEs in lake cichlids showed enrichment for nearby genes involved in ‘homophilic cell adhesion’ (P = 5.8 × 10−4) and ‘G-protein coupled receptor activity’ (P = 6.4 × 10−4). To verify the cis-regulatory function of these aCNEs, we assayed the ability of six selected aCNEs and their corresponding O. niloticus hCNEs to drive reporter gene expression in transgenic zebrafish. The assays not only indicated their potential to function as enhancers, but also demonstrated that aCNEs have altered the expression pattern compared to their homologous hCNEs, indicating their potential for altering expression of their target genes in a tissue-specific manner. We illustrate this with an example in Extended Data Fig. 8 (additional examples in Extended Data Fig. 9).
Novel microRNAs alter gene expression
MiRNAs offer yet another effective way of altering gene expression programs. We identified 1,344 miRNA loci (259–286 per cichlid species) from deep sequencing of small RNAs in late stage embryos (Extended Data Fig. 10a). By comparing these loci with known teleost microRNAs (Supplementary Information) we discovered: (1) 40 cases of de novo miRNA emergence and nine cases of apparent miRNA loss; (2) four distinct mature miRNAs with mutation(s) in the seed sequence; (3) at least 9 cases of arm switching29, (4) one case of seed shifting29, and (5) 92 distinct miRNAs with mutation(s) outside the seed sequence.
We explored miRNA spatial expression patterns in one case of arm switching (t_mze-miR-7132a-5p and t_mze-miR-7132a-3p) and for four de novo miRNAs (Fig. 3 and Extended Data Fig. 10). In the case of arm switching, spatial expression of the miRNA is clearly differentiated between the two pairs, consistent with results described previously30. The spatial expression of the four de novo miRNAs (miR-10029, miR10032, miR-10044, miR-10049) is confined to specific tissues (for example, fins, facial skeleton, brain) and is strikingly complementary to genes predicted to contain target sites for these miRNAs (miR-10032 targets neurod2, and miR-10029 targets bmpr1b). The neurod2 gene is known to be involved in brain development and neural differentiation whereas bmpr1b, previously described amongst the fast evolving genes, is implicated in the development and morphogenesis of nearly all organ systems.
Lake Victoria, a recent evolutionary radiation
Cichlid fish adaptive radiation is characterized by rapid speciation without geographical isolation. In Lake Victoria, several hundred endemic species emerged within the past 15,000–100,000 years34. We analysed patterns of genome-wide genetic variation in six sympatric and closely related species of the genera Pundamilia, Mbipia and Neochromis, all of which are endemic to Lake Victoria. We used the P. nyererei genome to investigate the pattern and magnitude of genomic differentiation in pairwise species comparisons. We then further characterized the regions of genomic differentiation to learn about: (1) the genomic distribution of divergent sites putatively under selection; (2) their nature (coding vs regulatory); (3) whether diversification occurred by selection on old standing variation, newer mutations or both.
Divergent selection on many genes
Analyses of restriction-site-associated DNA (RAD) data showed that the average genome-wide divergence was significant in all pairwise species comparisons (P < 0.001). In each pairwise comparison, we find many SNPs with high fixation index (FST) values distributed across all chromosomes (Fig. 4c). In each pair, 250 to 439 of these SNPs constitute significant outliers from the FST distribution (FDR < 5%; Fig. 4c), and BAYESCAN results indicate numerous loci under selection. Phylogenetic trees reconstructed from the concatenated RAD sequence data resolve species with high bootstrap support35, and loci putatively under selection play a strong role in differentiating species (Fig. 4b). Taken together, these results suggest that even the most recent rapid speciation in African lake cichlids is associated with genomically widespread divergence. Fixation of alternative alleles between species happens but is restricted to a minority of the many divergent loci, consistent with models of polygenic adaptation from standing genetic variation36.
We used the annotated P. nyererei reference genome to identify genes that diverged during and soon after speciation for three sister species pairs and two pairs of more distant relatives (Fig. 4c). We annotated all SNPs according to their positions in exons and potential cis-regulatory elements (in introns and 25 kb either side of genes), and analysed the proportion of SNPs in each category over increasing FST. In both pairs of sister species that differ primarily in male breeding coloration, the proportion of SNPs in exons increases from <10% in the full set of SNPs, to >18% at highly divergent SNPs. In the species that have diverged primarily in morphology, we find no exonic variants among highly divergent SNPs, and an increasing proportion of SNPs in introns with increasing FST (Fig. 4c).
These data suggest contrasting genomic mechanisms underlying phenotypic evolution depending on whether speciation is driven primarily by divergence of coloration and associated traits or by divergence of morphology associated with feeding ecology. This supports two predictions from evolutionary developmental biology37: (1) variation in coding sequence is most likely to be involved in the divergence of physiological and/or terminally differentiated traits like colour; (2) regulatory variation is more important in morphological changes involving genes that have pleiotropic effects in developmental networks.
For the Pundamilia species pair, putative regulatory SNPs with FST values significantly greater than zero show enrichment in conserved transcription factor binding sites and PhastCon elements (conserved elements across 46 vertebrate species), supporting a regulatory role for these variants. GO term enrichment analyses indicate that exonic SNPs are associated with metabolism and biosynthesis processes, while putative regulatory SNPs are associated with terms related to morphogenesis and development.
Comparing FST for each SNP in all six pairwise comparisons of the Mbipia and Pundamilia species revealed 3 candidate regulatory SNPs on LG6, 7 and 22 that are highly divergent in all comparisons of species with different colours, but not significantly differentiated between species with similar colours (Fig. 4c). The SNP on LG7 falls within a known quantitative trait locus (QTL) interval for yellow versus blue colour (and sex determination) in Malawi cichlids15. None of these SNPs are fixed differences between species, suggesting polygenic adaptation.
Sorting of ancient polymorphisms
To investigate whether ancient genetic variation, predating the origin of the Lake Victoria species flock, was an important source of alleles that are divergently sorted during speciation, for SNPs in each of the three Victoria sister species pair comparisons, we identified orthologous sites among the four other cichlid genomes. We find 14–15% of all Victoria SNPs are also variable among the other cichlid genomes. Among these ‘ancient variants’, the proportion of SNPs in exons increases from 9–15% among all sites to 30–100% at highly divergent SNPs in both pairs of sister species that differ primarily in male breeding coloration (Fig. 4c). Among the ancient exonic variants that became fixed in the red/blue Pundamilia speciation event is srd5a2b, a teleost-specific duplicate of srd5a2 which, in mammals, converts testosterone to dihydrotestosterone and has been implicated in sexual differentiation38. In the blue sister species that have diverged primarily in morphology, two ancient variants in potential cis-regulatory regions are highly divergent despite incomplete reproductive isolation among these incipient species39 (Fig. 4b). We compared the proportions of putative ancient variants to all SNPs between annotation categories, and find evidence for higher proportions of ancient variants in gene-associated regions than in non-genic regions (likelihood ratio tests on 2 × 2 contingency tables; exons: Pundamilia P = 0.016, Neochromis P = 0.015; flanking regions: Pundamilia P = 0.020; all other P > 0.1).
These analyses suggest that the genomic substrate for adaptive radiation includes ample coding and regulatory polymorphism, likely to be present well before the start of the radiations, some of which became subsequently sorted during species divergence.
In African lakes, nearly 1,500 new species of cichlid fish evolved in a few million years when environmentally determined opportunity for sexual selection and ecological niche expansion4 was met by an evolutionary lineage with unusual potential to adapt, speciate and diversify. Our analyses of five cichlid species representing five different lineages in the haplo-tilapiine clade, some of which gave rise to radiations, and of six closely related species from the most recent radiation, shed light into the complex genomic mechanisms that may give East African cichlids their unusual propensity for diversification.
We provide evidence for accumulation of genetic variation under relaxed constraint preceding radiation and involving multiple evolutionary mechanisms, including accelerated evolution of regulatory and coding sequence, increased gene duplication, TE insertions, novel micoRNAs and retention of ancient polymorphisms, possibly including interspecific hybridization. In addition, our data on genomic divergence within the Lake Victoria species flock suggest that adaptive radiation within the lakes is associated with divergent selection on many regions in the genome, both coding and regulatory, often recruiting old alleles from standing variation.
We conclude that neutral and adaptive processes both make important contributions to the genetic basis of cichlid radiations, but their roles are distinct and their relative importance has changed through time: neutral (and non-adaptive) processes seem to have been crucial to amassing genomic variation, whereas selection subsequently sorted some of this variation. The interaction of both is likely to have been necessary for generating many and diverse new species in very short periods of time.
Sequence Read Archive
We would like to thank the Broad Institute Genomics Platform for sequencing of the 5 cichlid genomes and transcriptomes. Sequencing, assembly, annotation and analysis by Broad Institute were supported by grants from the National Human Genome Research Institute (NHGRI). Genome evolution, duplication and TE analysis, ILS and ancient variant analyses were also supported by Swiss National Science Foundation grant PBLAP3-142774 awarded to D.B. and by University of Oxford Nuffield Department of Medicine Prize Studentship to Y.I.L. TE and copy number variation analyses were supported by the German Science Foundation (DFG), and advanced grant 29700 (“GenAdap”) by the European Research Council (ERC). CNE analysis and zebrafish functional assays were supported by the Biomedical Research Council of A*STAR, Singapore. MicroRNA sequencing and annotation was supported by ERC Starting Grant to E.A.M.; M.M. was supported by a fellowship from the Wellcome Trust. MicroRNA and target in situ hybridization was supported by grant 2R01DE019637-04 to J.T.S. Population genomics analyses were supported by Swiss National Science Foundation grants 31003A-118293 and 31003A-144046 to O.S.
Extended data figures
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported licence. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons licence, users will need to obtain permission from the licence holder to reproduce the material. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-sa/3.0/.