The genomic timeline of cichlid fish diversification across continents

Cichlid fishes are celebrated for their vast taxonomic, phenotypic, and ecological diversity; however, a central aspect of their evolution — the timeline of their diversification — remains contentious. Here, we generate draft genome assemblies of 14 species representing the global cichlid diversity and integrate these into a new phylogenomic hypothesis of cichlid and teleost evolution that we time-calibrate with 58 re-evaluated fossil constraints and a new Bayesian model accounting for fossil-assignment uncertainty. Our results support cichlid diversification long after the breakup of the supercontinent Gondwana and lay the foundation for precise temporal reconstructions of the exceptional continental cichlid adaptive radiations.


Statistics
For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section.

n/a Confirmed
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly The statistical test(s) used AND whether they are one-or two-sided Only common tests should be described solely by name; describe more complex techniques in the Methods section.
A description of all covariates tested A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted Give P values as exact values whenever suitable.

For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings
For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated Our web collection on statistics for biologists contains articles on many of the points above.

Software and code
Policy information about availability of computer code Data collection

Data analysis
For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors/reviewers. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information.

Data
Policy information about availability of data All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: -Accession codes, unique identifiers, or web links for publicly available datasets -A list of figures that have associated raw data -A description of any restrictions on data availability Michael Matschiner, Walter Salzburger Jul 9, 2020 No software was used in the data collection process. Sequencing data were generated through Illumina HiSeq sequencing at the Norwegian Sequencing Centre, Oslo, Norway.  Table 7 for details). Sequence alignments used for phylogenomic inference are available from http://evoinformatics.eu/continental.htm. Figure 2, Supplementary Figures 1-4, and Supplementary Figures 6-7 have associated raw data available from http://evoinformatics.eu/continental.htm.

October 2018
Field-specific reporting Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection. In this study, we investigate the timeline of cichlid fish diversification across continents with newly generated genome sequences and a new model for phylogenetic divergence-time estimation.
Sampling for whole-genome sequencing included one specimen of each of 14 species of cichlid fishes. Samples used for this study were selected from a larger set of available samples so that all major geographic and taxonomic groups of cichlids were represented (see below for details on the sampling strategy). All samples were obtained from zoos, collaborators, field excursions described in other publications, and the aquarium trade, as listed in Supplementary  In addition to these 14 species, we included publicly available data for 77 further fish species. These datasets were taken either from the Ensembl (Ensembl.org), NCBI (ncbi.nlm.nih.gov), or EBI (ebi.ac.uk) databases or from deposits on datadryad.org, figshare.com, parrot.genomics.cn, surfdrive.surf.nl, cichlid.gurdon.cam.ac.uk, efishgenomics.integrativebiology.msu.edu, or creskolab.uoregon.edu (full links to all datasets are provided in Supplementary Table 7).
The 14 cichlid species for whole-genome sequencing were selected to cover a wide range of the native cichlid distribution worldwide, including South and Central America, India, Madagascar, Western and Eastern Africa, and to represent all cichlid subfamilies and multiple tribes of the subfamilies Cichlinae and Pseudocrenilabrinae. In cases where multiple samples were available for a clade so that their joint inclusion would have been redundant, the sample with the highest quality of the genomic data was selected. The sample size was determined to be sufficient because the selected lineages included all cichlid clades with fossil records for time calibration with our new Bayesian approach.
Whole-genome sequencing data were generated by Illumina HiSeq sequencing on an Illumina HiSeq2500 machine with v4 chemistry at the Norwegian Sequencing Centre (NSC), Oslo, Norway. Genome sequencing libraries were prepared by authors A.B. and F.R.; the Illumina machine was operated by NRC staff members.
Not applicable as no field work was conducted for this study; all samples were already available before the start of the study.
Whole-genome sequencing data were filtered to identify the most suitable sequences for phylogenomic inference. the criteria for the selection of genomic regions for phylogenomic analyses were pre-established in earlier studies (e.g. Roth O, Solbakken MH, Tørresen OK et al. (2020) Evolution of male pregnancy associated with remodeling of canonical vertebrate immunity in seahorses and pipefishes. Proceedings of the National Academy of Sciences USA, 117, 9431-9439.). All steps of the filtering pipeline are described in Supplementary Information 6, and the applied custom code is available from https://github.com/mmatschiner/continental.
To enable the reproduction of our result by other researchers, we provide all datasets, analysis code, and input files for certain programs on https://github.com/mmatschiner/continental and http://evoinformatics.eu/continental.htm. As part of our study, reproducibility was confirmed for specific analyses, such as all BEAST 2 analyses, for which we performed three replicates analyses with each dataset, and additionally sets of analyses with different datasets, that all supported the same result. Overall, all attempts at replication were successful, and no results can not be reproduced.