Whole-genome sequences of Malawi cichlids reveal multiple radiations interconnected by gene flow

The hundreds of cichlid fish species in Lake Malawi constitute the most extensive recent vertebrate adaptive radiation. Here we characterize its genomic diversity by sequencing 134 individuals covering 73 species across all major lineages. The average sequence divergence between species pairs is only 0.1–0.25%. These divergence values overlap diversity within species, with 82% of heterozygosity shared between species. Phylogenetic analyses suggest that diversification initially proceeded by serial branching from a generalist Astatotilapia-like ancestor. However, no single species tree adequately represents all species relationships, with evidence for substantial gene flow at multiple times. Common signatures of selection on visual and oxygen transport genes shared by distantly related deep-water species point to both adaptive introgression and independent selection. These findings enhance our understanding of genomic processes underlying rapid species diversification, and provide a platform for future genetic analysis of the Malawi radiation. Genome-sequencing data of all major cichlid fish lineages in Lake Malawi show genomic footprints of rapid species diversification.

Abstract: The hundreds of cichlid fish species in Lake Malawi constitute the most extensive recent vertebrate adaptive radiation. Here we characterize its genomic diversity by sequencing 134 individuals covering 73 species across all major lineages. Average sequence divergence between species pairs is only 0.1-0.25%. These divergence values overlap diversity within species, with 82% of heterozygosity shared between species. Phylogenetic analyses suggest that diversification initially proceeded by serial branching from a generalist Astatotilapia-like ancestor. However, no single species tree adequately represents all species relationships, with evidence for substantial gene flow at multiple times. Common signatures of selection on visual and oxygen transport genes shared by distantly related deep water species point to both adaptive introgression and independent selection. These findings enhance our understanding of genomic processes underlying rapid species diversification, and provide a platform for future genetic analysis of the Malawi radiation.
One Sentence Summary: The genomes of 73 cichlid fish species from Lake Malawi uncover evolutionary processes underlying a large adaptive evolutionary radiation.

Main Text:
The formation of every lake or island represents a fresh opportunity for colonization, proliferation and diversification of living forms. In some cases, the ecological opportunities presented by underutilized habitats facilitate adaptive radiation -rapid and extensive diversification of the descendants of the colonizing lineages [1][2][3] . Adaptive radiations are thus exquisite examples of the power of natural selection, as seen for example in Darwin's Galapagos finches 4,5 , Anolis lizards of the Caribbean 6 and in East African cichlid fishes 7,8 .
Cichlids are one of the most species-rich and diverse families of vertebrates, and nowhere are their radiations more spectacular than in the Great Lakes of East Africa: Malawi, Tanganyika, and Victoria 2 , each of which contains several hundred endemic species, with the largest number in Lake Malawi 9 . Molecular genetic studies have made major contributions to reconstructing the evolutionary histories of these adaptive radiations, especially in terms of the relationships between the lakes 10,11 , between some major lineages in Lake Tanganyika 12 , and in describing the role of hybridization in the origins of the Lake Victoria radiation 13 . However, the task of reconstructing within-lake relationships and of identifying sister species remains challenging due both to retention of large amounts of ancestral genetic polymorphism (i.e. incomplete lineage sorting) and to evidence suggesting gene flow between taxa 12,[14][15][16][17][18][19] .
Initial genome assemblies of cichlids from East Africa suggest that an increased rate of gene duplications together with accelerated evolution of some regulatory elements and protein coding genes may have contributed to the radiations 11 . However, understanding of the genomic mechanisms contributing to adaptive radiations is still in its infancy 3 . Here we provide an overview of and insights into the genomic signatures of the haplochromine cichlid radiation of Lake Malawi.
Previous work on the phylogeny of Lake Malawi haplochromine cichlid radiation, mainly based on mitochondrial DNA (mtDNA), has divided the species into six groups with differing ecology and morphology 20 : 1) the rock-dwelling 'mbuna'; 2) Rhamphochromis -typically midwater pelagic piscivores; 3) Diplotaxodon -typically deep-water pelagic zooplanktivores and piscivores; 4) deep-water and twilight feeding benthic species; 5) 'utaka' feeding on zooplankton in the water column but breeding on or near the benthic substrate; 6) a diverse group of benthic species, mainly found in shallow non-rocky habitats. In addition, Astatotilapia calliptera is a closely related generalist that inhabits shallow weedy margins of Lake Malawi, and other lakes and rivers in the catchment and more widely. To characterize the genetic diversity, species relationships, and signatures of selection across the whole radiation, we obtained paired-end Illumina whole-genome sequence data from 134 individuals of 73 species distributed broadly across the seven groups (Fig. 1a) 21 . This includes 102 individuals at ~15× coverage and 32 additional individuals at ~6× (Table S1).

Low genetic diversity and species divergence
Sequence data were aligned to the Metriaclima zebra reference assembly version 1.1 11 , followed by variant calling restricted to the 'accessible genome' (the portion of the genome where variants can be determined confidently with short read alignment), which comprised 653Mb or 91.5% of the assembly excluding gaps 21 . Average divergence from the reference per sample was 0.19% to 0.27% (Fig. S1). Across all samples, after filtering and variant refinement, we obtained 30.6 million variants of which 27.1 million were single nucleotide polymorphisms (SNPs) and the rest were short insertions and deletions (indels) 21 . Unless otherwise indicated, the following analyses are based on SNPs.
To estimate nucleotide diversity (π) within the sampled species we measured the frequency of heterozygous sites in each individual. The estimates are distributed within a relatively narrow range between 0.7 and 1.8×10 -3 per bp (Fig. 1b). The mean π estimate of 1.2×10 -3 per bp is at the low end of values found in other animals 22 , and there appears to be little relationship between π and the rate of speciation: individuals in the species-rich mbuna and shallow benthic groups show levels of π comparable to the relatively species-poor utaka, Diplotaxodon, and Rhamphochromis (Fig. S1).
Despite their extensive phenotypic differentiation, all species within the Lake Malawi haplochromine cichlid radiation are genetically closely related 23,24 . However, genome-wide genetic divergence has never been quantified. To examine the extent of genetic differentiation between species across the radiation we calculated the average pairwise differences between all pairs of sequences from different species (d XY ). A comparison of d XY against heterozygosity reveals that the two distributions partially overlap (Fig. 1b)

Low per-generation mutation rate
It has been suggested that the species richness and morphological diversity of teleosts in general and of cichlids in particular might be explained by elevated mutation rates compared to other vertebrates 27,28 . To obtain a direct estimate of the per-generation mutation rate, we reared offspring of three species from three different Lake Malawi groups (Astatotilapia calliptera, Aulonocara stuartgranti, and Lethrinops lethrinus). We sequenced both parents and one offspring of each to high coverage (40x), applied stringent quality filtering, and counted variants present in each offspring but absent in both its parents (Fig. S2) 21 . There was no evidence for significant difference in mutation rates between species. The overall mutation rate (μ) was estimated at 3.5×10 -9 (95%CI: 1.6×10 -9 to 4.6×10 -9 ) per bp per generation, approximately three to four times lower than the rate obtained in similar studies using human trios 29 , although given much shorter mean generation times the per-year rate is still expected to be higher in cichlids than in humans. We note that Recknagel 32 ), and the hundreds of species present, this result suggests that alleles at a locus will only rarely coalesce within the time between successive speciation events, consistent with high sharing of heterozygosity and ILS. This is because both the mean and standard deviation in the time to the common ancestor of a pair of alleles are expected to be in the order of 2Ne generations, or hundreds of thousands of years.

Between-species relationships
To obtain a first estimate of between-species relationships we divided the genome into 2543 nonoverlapping windows, each comprising 8000 SNPs (average size: 274kb), and constructed a Maximum Likelihood (ML) phylogeny separately for the full sequences within each window, obtaining trees with 2542 different topologies. We also calculated the maximum clade credibility (MCC) summary tree 33 and an ML phylogeny based on the full mtDNA genome (Figs. 1C, S4) 21 .
Despite extensive variation among the individual trees, it is apparent that there is some general consensus. Individuals from the previously identified eco-morphological groups tend to cluster together, with Rhamphochromis, Diplotaxodon, mbuna and A. calliptera forming well supported (in >80% of trees) apparently reciprocally monophyletic groups (Fig. S4B), while individuals from the utaka, and deep and shallow benthic were clustered within their respective groups, but with lower support. In the subset of 12 individuals shown in Fig. 1c, the pelagic clades Diplotaxodon and Rhamphochromis tend to cluster together and form a sister group to the rest of the radiation. Perhaps surprisingly, the majority of the trees place the widely-distributed lake/river-dwelling A. calliptera as the sister taxon to the specialized rocky-shore mbuna group in a position that is nested within the radiation (Figs. 1c, S4B). The overall sub-structuring is also apparent from patterns of linkage disequilibrium (LD). Mean LD decays within a few hundred base-pairs across the set of all species, in a few kilobases (kb) for subsets of species from within eco-morphological subgroups (mbuna, Diplotaxodon etc.), and extends beyond 10kb within species (Fig. S3) 21 . Because the extent of LD is substantially shorter than the window size of the ML phylogenies, we expect extensive ILS also within the windows, and it is established that ML may be inappropriate for building trees from large regions with extensive ILS [34][35][36] . Therefore we note that even in the absence of introgression the ML tree results above would be merely illustrative of overall genetic similarities rather than providing definitive reconstructions of the branching order of taxa.
It is also clear that the mtDNA phylogeny is an outlier, being substantially different both from the MCC summary and from the majority of the individual trees (Figs. 1c, S5). Discordances between mtDNA and nuclear phylogenies in Lake Malawi has been reported previously in refs. 15,19 , and interpreted as a signature of past hybridization events. However, large discrepancies between mitochondrial and nuclear phylogenies have been shown previously in a large number of other systems, reflecting both that mtDNA as a single locus is not expected to reflect the consensus under ILS, and that it often does not evolve neutrally (e.g. refs. [37][38][39]. In particular, the high incidence of mitochondrial selection underlines the importance of evaluating the Lake Malawi radiation from a genome-wide perspective rather than drawing conclusions regarding species relationships based on mtDNA signals alone. If the observed high levels of phylogenetic incongruence were due to ILS alone, it should still be possible to resolve a cleanly bifurcating species tree by applying either the multispecies coalescent model 40,41 , or the distance based Neighbor-Joining method which has been shown to be a statistically consistent and accurate species tree estimator under ILS 42,43 . We constructed a whole genome NJ tree using the Dasarathy et al. algorithm (Fig. 2A). A similar overall topology was produced from applying the Bayesian multispecies coalescent method SNAPP 44 to the subset of 48,922 unlinked SNPs in 12 individuals representing the eco-morphological groups ( Fig. S6; analysis of the full dataset was not computationally feasible using this method).
Overall, the resulting phylogenies share many similarities that reflect features of previous taxonomic assignment, but some currently-recognized genera are clearly polyphyletic, including Placidochromis, Lethrinops, and Mylochromis.

Violations of the species tree concept
Previous studies have suggested that hybridization and introgression subsequent to initial separation of species may have played a significant role in cichlid radiations, including in Lakes Tanganyika 12,14,16,17 and Malawi 15,19 . Consistent with this, we found some variation in species tree topologies depending on which inference method was used and also within the Bayesian MCMC samples (Figs. 2A, S4B, S6). Furthermore, we contrasted the pairwise genetic distances used to produce the raw NJ tree ( Fig. 2B; above the diagonal) against the distances between samples along the tree branches ( Fig. 2A), calculating the residuals ( Fig. 2B; below the diagonal). If the tree were able to perfectly capture all the genetic relationships in our sample, we would expect the residuals to be zero. However, we found a large number of differences, with some standout cases. The fact that we are using over 25 million variable sites, and the high bootstrap values on all branches of the NJ tree (mean support is 90%, median is 100%; Fig. 2C), suggest these differences are not due to sampling noise, but may reflect violations of the bifurcating species tree model.
The patterns of differences between the true genetic distance and the NJ tree distance affect both groups of species and individual species. Among the strongest signals on individual species, we see that: 1) Copadichromis trimaculatus, which in the NJ tree clusters within the shallow benthic clade, is genetically much closer to other utaka, and in particular to C. quadrimaculatus, than its placement in the tree would suggest; 2) Placidochromis cf. longimanus is genetically closer to the deep benthic clade and to a subset of the shallow benthic (mainly Lethrinops species) than the tree suggests; and 3) our sample of Otopharynx tetrastigma is much closer to Astatotilapia calliptera from Lake Kingiri (and to a lesser degree to other A. calliptera) than would be expected from the tree. The O. tetrastigma specimen comes from Lake Ilamba, a satellite crater lake of Lake Malawi that also harbours a population of A. calliptera and is geographically close (3.2 km) to Lake Kingiri. an observation we will return to below in the context of shared mechanisms of depth adaptation.
To gather further evidence regarding potential gene flow between the 'tree-violating' taxa identified above, we computed the f 4 admixture ratio [47][48][49] (f statistic in the following). The f statistic is closely related to Patterson's D (ABBA-BABA test) 47 , and when elevated due to introgression is expected to be linear in relation to the proportion of introgressed material. For each of the three cases discussed in the analysis of Figure 2, we found strong signals of non-treelike relatedness (Fig. 3B). Specifically, there are two very high f statistics involving C.
trimaculatus and shallow benthic and utaka species; it is also notable that the position of C.
trimaculatus within the shallow benthic group is an unusual feature of the NJ tree in Fig. 2a; it clusters with the other utaka in all other phylogenies. We interpret the evidence as suggesting that the gene-pool of C. trimaculatus is a product of hybridization between an utaka lineage and a shallow benthic lineage. More comparisons can be found in Table S8. 3C.
The f statistic tests are robust to the occurrence of incomplete lineage sorting, in the sense that ILS alone can not generate a significant test result 48 . We note, however, that pronounced population structure within ancestral species, coupled with rapid succession of speciation events, can also substantially violate the assumptions of a strictly bifurcating species tree and lead to significantly elevated f scores 48,51 . Nevertheless, to be able to explain many of the patterns reported here, ancestral population structure would have needed to segregate through multiple speciation events without affecting sister lineages, a scenario that is not credible in general.
Therefore, we suggest that there is strong evidence for multiple cross-species gene flow events.
Not only are there many significant f scores, they also are unusually large: 519 out of the 1725 significant scores (4.5% of the total 11452; Fig. S10) are larger than 3%, corresponding to inferred human-neanderthal introgression 47 . Across the subgroups in scores as a measure of violation of a tree-like branching model is more robust in this respect 21 .
Overall, the NJ tree residuals, haplotype sharing patterns, and the many elevated f b (C) scores in Fig. 3C reveal extensive violations of the bifurcating species tree model both across and within major groups. We therefore confirm that the evolutionary history of the Lake Malawi radiation is characterized by multiple gene flow events at different times during its evolutionary history, and that no single phylogenetic tree can adequately capture the evolutionary relationships within the species flock.

Origins of the radiation
Astatotilapia calliptera, although showing strong preference for shallow weedy habitats, is abundant and widespread in both flowing and still waters, and is thus considered a generalist. It has often been referred to as the 'prototype' for the closely-related endemic genera of Lake Malawi cichlids 26 . Therefore, discussions concerning the origin of the Malawi radiation often rely on ascertaining the relationship of this species to the rest of the Malawi radiation 15 To explore the origins of the Lake Malawi radiation in greater detail, we obtained whole genome sequences from 19 individuals from seven Astatotilapia species not found in Lake Malawi (Table S2) and generated new variant calls 21 . In addition, we sequenced five more A. calliptera specimens from Indian Ocean catchments, thus covering most of the geographical distribution of the species. We constructed NJ trees based on genetic distances and found that even with these additional data all the A. calliptera (including samples from outside the Lake Malawi catchment) continue to cluster as a single group at the same place nested within the radiation, whereas the other Astatotilapia species clearly branched off well before the lake radiation ( Fig. 4a,  Therefore, we propose here a model which reconciles the nested phylogenetic position of A. calliptera in Fig. 4b with its ancestral riverine phenotype. We suggest a model in which the Lake Malawi species flock consists of three separate radiations splitting off the ancestral lineage leading to A. calliptera; the pelagic radiation was seeded first, then the benthic + utaka, and finally the rock-dwelling mbuna, all in a relatively quick succession, followed by subsequent gene flow as described above (Fig. 4f). Using our per-generation mutation rate we obtained mean separation time estimates for these lineages between 460 thousand years ago (ka) [95%CI: (350ka to 990ka)] and 390ky [95%CI: (300ka to 860ka)] (Fig. 4f), assuming three years per generation as in ref. 56. The point estimates all fall within the second most recent prolonged deep lake phase as inferred from the Lake Malawi paleoecological record 30 while the upper ends of the confidence intervals cover the third deep lake phase. We also note that split times estimated from sequence divergence are likely to be reduced by subsequent gene-flow, leading to underestimates. Therefore we conclude that the data are consistent with the previous reports based on fossil time calibration which put the origin of the Lake Malawi radiation at 700-800ka 12 .
Focusing on the A. calliptera individuals, we found they cluster by geography (Fig. 4b,c), except for the specimen from crater Lake Kingiri, whose position in the clade is likely a result of the admixture signals shown in Fig. 3b. Indeed, the Kingiri individual clusters according to geography with the specimens from the nearby crater Lake Massoko and Mbaka River if a NJ tree is built with A. calliptera samples only (Fig. S14). Applying the same logic, we tested whether the position of the A. calliptera group in the NJ tree changes when the tree is built without mbuna (as would be expected if the A. calliptera position were affected by hybridization with mbuna). However, we found that the position of A. calliptera is not affected by the removal of mbuna (Fig. S15), suggesting that the nested position is not due to later hybridization. The f statistic analysis in Fig. 3c further supports this claim, because the signals involving the whole mbuna or A. calliptera groups are modest and do not suggest erroneous placement of these whole groups in all phylogenetic analyses.
Furthermore, the nested position of A. calliptera is also supported by the vast majority of the genome. We specifically searched for the basal branch in a set of 2638 local ML phylogenies for non-overlapping genomic windows and found results that are consistent with the whole genome NJ tree: the most common basal branches are the pelagic groups Rhamphochromis and Diplotaxodon (in 42.12% of the genomic windows). In comparison, A. calliptera (including all of the Indian Ocean catchment samples) were found to be basal only in 5.99% of the windows (Fig. S16).
Finally, we note that all the A. calliptera have a relatively recent common ancestor, with divergence at ~75% of the most distinct species in the Malawi radiation and corresponding to 340ka [95%CI: (260ka, 740ka)], suggesting that the Lake Malawi population has been a reservoir that has repopulated the river systems and more transient lakes following dry-wet transitions in East African hydroclimate 30,57 . Our results do not fully resolve whether the lineage leading from the common ancestor to A. calliptera retained its riverine generalist phenotype throughout or whether a lacustrine species evolved at some point (e.g. the common ancestor of A. calliptera and mbuna) and later de-specialized again to recolonize the rivers. However, while it is a possibility, we suggest that is it not likely that the many strong phenotypic affinities of A.
calliptera to the basal Astatotilapia [see refs. 58,59; Fig. 4e], would be reinvented from a lacustrine species. After all, A. calliptera is widespread and abundant in its preferred shallow weedy habitat throughout present-day Lake Malawi, despite the presence of hundreds of closelyrelated endemic lacustrine specialist species.

Signatures and consequences of selection
To gain insight into the functional basis of diversification and adaptation in Lake Malawi cichlids, we next turned our attention to protein coding genes. We compared the between-species  Fig. 5a).  without homologs tend to be short (median coding length is 432bp) and some of the signal may be explained by a component of gene prediction errors. However a comparison of short genes (≤450bp) without homologs to a set of random noncoding sequences (Fig. S18) showed significant differences (p < 2.2×10 -16 , Mann-Whitney test), with both a substantial component of genes with low ‫‬ ҧ ே , reflecting genes under purifying selection, and also an excess of genes with high ‫‬ ҧ ே (Fig. S19).
Cichlids have an unexpectedly large number of gene duplicates and it has been suggested that this phenomenon has contributed to their extensive adaptive radiations 3,11 . To investigate the extent of divergent selection on gene duplicates, we examined how the non-synonymous excess scores are related to gene copy numbers in the reference genomes. Focusing on homologous genes annotated both in the Malawi reference (M. zebra) and in the zebrafish genome, we found that the highest proportion of candidate genes was among genes with two or more copies in both genomes (N -N). The relative enrichment in this category is both substantial and highly significant (Fig. 5b). On the other hand, the increase in proportion of candidate genes in the N -1 category (multiple copies in the M. zebra genome but only one copy in zebrafish) relative to 1 -1 genes, is of a much lesser magnitude and is not significant (߯ ଶ test ‫‬ ൏ 0 . 1 8 ), suggesting that selection is occurring more often within ancient multi-copy gene families, rather than on genes with cichlid-specific duplications.
Next we used Gene Ontology (GO) annotation of zebrafish homologs to test whether candidate genes are enriched for particular functional categories. We found significant enrichment for 30 GO terms [range: 1.6×10 -8 < p < 0.01, weigh algorithm 21,60 ; Table S3] (iii) the immune system, especially inflammatory response and cytokine activity (Fig. 5c). It has been previously suggested that evolution of genes in these functional categories has contributed to cichlid radiations (as discussed below); it is nevertheless interesting to see that these categories stand out in an unbiased analysis of all 20,000+ genes in the genome.

Shared mechanisms of depth adaptation
To gain insight into the distribution of adaptive alleles across the radiation, we examined the haplotype genealogies for amino acid sequences of candidate genes, focusing on the genes in significantly enriched GO categories. It became apparent that many of the genealogies in the 'visual perception' category have common features that are unusual in the broader dataset: the haplotypes from the deep benthic group and the deep-water pelagic Diplotaxodon tend to be disproportionally diverse when compared with the rest of the radiation, and tend to group together despite these two groups being relatively distant in the whole-genome phylogenetic reconstructions.
Sharply decreasing levels of dissolved oxygen and low light intensities with narrow short wavelength spectra are the hallmarks of the habitats at below ~50 meters to which the deep benthic and pelagic Diplotaxodon groups have both adapted, either convergently or in parallel 61 .
Signatures of selection on similar haplotypes in the same genes involved in vision and in oxygen transport would therefore point to shared molecular mechanisms underlying this ecological parallelism.
To obtain a quantitative measure of shared molecular mechanisms, we calculated for each gene a similarity score for deep benthic and Diplotaxodon amino acid sequences and also compared the amounts of non-coding variation in these depth-adapted groups against the rest of the radiation 21 .
Both measures are elevated for candidate genes in the 'visual perception' category ( Fig. 6a; p=0.007 for similarity, p=0.08 for shared diversity, and p=0.003 when similarity and diversity

RH2B
the radiation differ between the two RH2 copies, with only one shared mutation out of a possible fourteen (Fig. S20B). RH2Aβ and RH2B are located within 40kb from each other on the same chromosome (Fig. 6c); a third paralog, RH2Aα, is located between them, but it has very little coding diversity specific to deep benthic and Diplotaxodon (Fig. S21). This finding is consistent with previous reports suggesting functional divergence between RH2Aα and RH2Aβ following the duplication of RH2A early in the cichlid lineage 62,63 . A similar, albeit weaker signature of shared depth-related selection is apparent in rhodopsin, which is known to play a role in deepwater adaptation in cichlids 64 . Previously, we discussed the role of coding variants in rhodopsin in the early stages of speciation of A. calliptera in the crater Lake Massoko 56 . The haplotype genealogy presented here for the broader radiation strongly suggests that the Massoko alleles did not originate by mutation in that lake but were selected out of ancestral variation (Fig. 6a).
The long wavelength, red-sensitive opsin (LWS) has been shown to play a role in speciation along a depth gradient in Lake Victoria 65 . While it is not particularly diverse in Diplotaxodon and deep benthics, it is interesting to note that Diplotaxodon have haplotypes that are clearly distinct from those in the rest of the radiation, while the majority of deep benthic haplotypes are their nearest neighbours (Fig. S21). The short-wavelength opsin SWS1 is among the genes with high Δ N-S scores but it does not exhibit shared selection between Diplotaxodon and deep benthics -it is most variable within the shallow benthic group. Finally, the short-wavelength opsins SWS2A and SWS2B have negative Δ N-S scores in our Lake Malawi dataset and thus are not among the candidate genes.
There have been many previous studies of selection on opsin genes in fish [e.g. reviewed in [66][67][68], including selection associated with depth preference, but having whole genome coverage allows us to investigate other components of primary visual perception in an unbiased fashion.
We found shared patterns of selection between deep benthics and Diplotaxodon in the genealogies of six other vision associated candidate genes: a homolog of the homeobox protein six7, the G-protein-coupled receptor kinase GRK7, two copies of the retinal cone arrestin-C, the α -subunit of cone transducin GNAT2, and peripherin-2 (Fig. 6a). The functions of these genes suggest a prominent role of cone cell vision in depth adaptation. The homeobox protein six7 governs the expression of RH2 opsins and is essential for the development of green cones in zebrafish 69 . One of the variants in this gene that distinguishes deep benthic and Diplotaxodon is just a residue away from the DNA binding site of the HOX domain, while another is located in the SIX1_SD domain responsible for binding with the transcriptional activation co-factor of six7 70 (Fig. S22C). The kinase GRK7 and the retinal cone arrestin-C genes have complementary roles in photoresponse recovery, where arrestin produces the final shutoff of the cone pigment following phosphorylation by GRK7, thus determining the temporal resolution of motion vision 71 . Note that bases near to the C-terminus in RH2Aβ mutated away from serine (S290Y and S292G), thus reducing the number of residues that can be modified by GRK7 (Fig. S22B).
The transducin subunit GNAT2 is located exclusively in the cone receptors and is a key component of the pathway which converts light stimulus into electrical response in these cells 72 .
The final gene, peripherin-2, is essential to the development and renewal of the membrane system in the outer cell segments that hold the opsin pigments in both rod and cone cells 73 .
Cichlid green-sensitive opsins are expressed exclusively in double-cone photoreceptors and the wavelength of maximum absorbance in cells expressing a mixture of RH2Aβ with RH2B (λ max = 498nm) corresponds to the part of light spectrum that transmits the best into deep water in Lake Malawi 68 . Figure 6C illustrates the possible interactions of all the above genes in a double-cone photoreceptor of the cichlid retina.
Haemoglobin genes in teleost fish are located in two separate chromosomal locations: the minor 'LA' cluster and the major 'MN' cluster 74 . The region around the LA cluster has been highlighted by selection scans among four Diplotaxodon species by Hahn et al. 75 , who also noted the similarity of the haemoglobin subunit beta (HBβ) haplotypes between Diplotaxodon and deep benthic species. We confirmed signatures of selection in the two annotated LA cluster haemoglobins. In addition, we found that four haemoglobin subunits (HBβ1, HBβ2, HBα2, HBα3) from the MN cluster are also among the genes with high selection scores (Fig. S22). It appears that shared patterns of depth selection may be particular to the β -globin genes (Fig.   S22B), although this hypothesis must remain tentative due to the highly repetitive nature of the MN cluster limiting our ability to confidently examine variation in all the haemoglobin genes in the region.
A key question concerns the mechanism leading to the similarity of haplotypes in Diplotaxodon and deep benthics. Possibilities include parallel selection on variation segregating in both groups due to common ancestry, selection on the gene flow that we described in a previous section, or independent selection on new mutations. From considering the haplotype genealogies and f dM statistics summarizing local patterns of excess allele sharing, there is evidence for each of these processes acting on different genes. The haplotype genealogies for rhodopsin and HBβ have outgroup taxa appearing at multiple locations on their haplotype networks, while A. calliptera specimens also appear at divergent positions (Fig. 6a). This suggests that the haplotype diversity of these genes may reflect ancient differences in the founders. In contrast, networks for the green cone genes show patterns more consistent with the Malawi radiation all being derived with respect to outgroups (or with us not having sampled a source of ancestral variation) and we found substantially elevated f dM scores extending for around 40kb around the RH2 cluster (Fig.   6d), consistent with adaptive introgression in a pattern reminiscent of mimicry loci in Heliconius butterflies 76 . In contrast, the peaks in f dM scores around peripherin-2 and one of the arrestin-C genes are relatively narrow, with boundaries that correspond almost exactly to the gene boundaries. Furthermore, these two genes have elevated f dM scores only for non-synonymous variants Fig. S23, while synonymous variants do not show any excess allele sharing between Diplotaxodon and deep benthics. Due to the close proximity of non-synonymous and synonymous sites within the same gene, this suggests that for these two genes there may have been independent selection on the same de novo mutations.

Discussion
Genome sequences form the substrate for evolution. Here we have described genome variation at the full sequence level across the Lake Malawi haplochromine cichlid radiation. We focused on ecomorphological diversity, representing more than half the genera from each major group rather than obtaining deep coverage of any particular group. Therefore, we have more samples from the morphologically highly diverse benthic lineages than, for example, the mbuna where many species are largely recognised by colour differences.
The observation that cichlids within an African Great Lake radiation are genetically very similar is not new 77 , but we now quantify the relationship of this to within-species variation, and the consequences for variation in local phylogeny across the genome. The observation of withinspecies diversity being relatively low for vertebrates, at around 0.1%, suggests that low genomewide nucleotide diversity levels do not necessarily limit rapid adaptation and speciation. This contrasts with the suggestion that high diversity levels may have been important for rapid adaptation in Atlantic killifish 78 . One possibility is that in cichlids repeated selection has maintained diversity in adaptive alleles for a range of traits that support ecological diversification, as appears to be the case for sticklebacks 79 .
We provide evidence that gene flow during the radiation, although not ubiquitous, has certainly been extensive. Overall, the numerous violations of the bifurcating species tree model suggest that full phylogenomic resolution of interspecies relationships in this system will require network approaches (see e.g. ref. 41; section 6.2). However, the majority of the signals affect groups of species, suggesting events involving their common ancestors, or are between closely-related species within the major ecological groups. We see only one strong and clear example of recent gene flow between individual more distantly-related species, not within Lake Malawi itself but between Otopharynx tetrastigma from crater Lake Ilamba and local A. calliptera. Lake Ilamba is very turbid and the scenario is reminiscent of cichlid admixture in low visibility conditions in Lake Victoria 80 . It is possible that some of the earlier signals of gene flow between lineages we observed in Lake Malawi may have happened during low lake level periods when the water is known to have been more turbid 30 .
Our suggested model of the early stages of radiation in Lake Malawi (Fig. 4f) is broadly consistent with the model of initial separation by major habitat divergence 24 , although we propose a refinement in which there were three relatively closely-spaced separations from a generalist Astatotilapia type lineage, initially of pelagic genera Rhamphochromis and Diplotaxodon, then of shallow-and deep-water benthics and utaka [a clade which includes Kocher's sand dwellers 24,26 ], and finally of mbuna (Fig. 4f). Thus, we suggest that Lake Malawi contains three separate haplochromine cichlid radiations, stemming from the Astatotilapia calliptera riverine generalist lineage, and interconnected by subsequent gene flow.
The finding that cichlid-specific gene duplicates do not tend to diverge particularly strongly in coding sequences (Fig. 5b) suggests that other mechanisms of diversification following gene duplications may be more important in cichlid radiations. Divergence via changes in expression patterns has been illustrated and discussed in ref. 11. Future studies that address larger scale structural variation between cichlid genomes will be able to assess the contribution of differential retention of duplicated genes.
The evidence concerning shared adaptation of the visual and oxygen-transport systems to deepwater environments between deep benthics and Diplotaxodon suggests different evolutionary mechanisms acting on different genes, even within the same cellular system. It will be interesting to see whether the same genes or even specific mutations underlie depth adaptation in Lake Tanganyika, which harbours specialist deep water species in least two different tribes 81 and has a similar light attenuation profile but a steeper oxygen gradient than Lake Malawi 61 .
Overall, our data and results provide unprecedented information about patterns of sequence sharing and adaptation across one of the most dramatic adaptive radiations, providing insights into mechanisms of rapid phenotypic diversification. The data sets we have generated are openly available (see Acknowledgements) and will underpin further studies on specific taxa and molecular systems. Given the extent of shared variation, we suggest that future studies that take into account variation within as well as between species will be important to help reveal finerscale details of adaptive selection.