Abstract
The Zoonomia Project is investigating the genomics of shared and specialized traits in eutherian mammals. Here we provide genome assemblies for 131 species, of which all but 9 are previously uncharacterized, and describe a whole-genome alignment of 240 species of considerable phylogenetic diversity, comprising representatives from more than 80% of mammalian families. We find that regions of reduced genetic diversity are more abundant in species at a high risk of extinction, discern signals of evolutionary selection at high resolution and provide insights from individual reference genomes. By prioritizing phylogenetic diversity and making data available quickly and without restriction, the Zoonomia Project aims to support biological discovery, medical research and the conservation of biodiversity.
Similar content being viewed by others
Main
The genomics revolution is enabling advances not only in medical research1, but also in basic biology2 and in the conservation of biodiversity, where genomic tools have helped to apprehend poachers3 and to protect endangered populations4. However, we have only a limited ability to predict which genomic variants lead to changes in organism-level phenotypes, such as increased disease risk—a task that, in humans, is complicated by the sheer size of the genome (about three billion nucleotides)5.
Comparative genomics can address this challenge by identifying nucleotide positions that have remained unchanged across millions of years of evolution6 (suggesting that changes at these positions will negatively affect fitness), focusing the search for disease-causing variants. In 2011, the 29 Mammals Project7 identified 12-base-pair (bp) regions of evolutionary constraint that in total comprise 4.2% of the genome, by measuring sequence conservation in humans plus 28 other mammals. These regions proved to be more enriched for the heritability of complex diseases than any other functional mark, including coding status8. By expanding the number of species and making an alignment that is independent of any single reference genome, the Zoonomia Project was designed to detect evolutionary constraint in the eutherian lineage at increased resolution, and to provide genomic resources for over 130 previously uncharacterized species.
Designing a comparative-genomics multitool
When selecting species, we sought to maximize evolutionary branch length, to include at least one species from each eutherian family, and to prioritize species of medical, biological or biodiversity conservation interest. Our assemblies increase the percentage of eutherian families with a representative genome from 49% to 82%, and include 9 species that are the sole extant member of their family and 7 species that are critically endangered9 (Fig. 1): the Mexican howler monkey (Alouatta palliata mexicana), hirola (Beatragus hunteri), Russian saiga (Saiga tatarica tatarica), social tuco-tuco (Ctenomys sociabilis), indri (Indri indri), northern white rhinoceros (Ceratotherium simum cottoni) and black rhinoceros (Diceros bicornis).
We collaborated with 28 institutions to collect samples, nearly half (47%) of which were provided by The Frozen Zoo of San Diego Zoo Global (Supplementary Table 1). Since 1975, The Frozen Zoo has stored renewable cell cultures for about 10,000 vertebrate animals that represent over 1,100 taxa, including more than 200 species that are classified as vulnerable, endangered, critically endangered or extinct by the International Union for Conservation of Nature (IUCN)10. For 36 target species we were unable to acquire a DNA sample of sufficient quality, even though our requirements were modest (Methods), which highlights a major impediment to expanding the phylogenetic diversity of genomics.
We used two complementary approaches to generate genome assemblies (Extended Data Table 1). First, for 131 genomes we generated assemblies by performing a single lane of sequencing (2× 250-bp reads) on PCR-free libraries and assembling with DISCOVAR de novo11 (referred to here as ‘DISCOVAR assemblies’). This method does not require intact cells and uses less than two micrograms of medium-quality DNA (most fragments are over 5 kilobases (kb) in size), which allowed us to include species that are difficult to access (Extended Data Figs. 1, 2) while achieving ‘contiguous sequences constructed from overlapping short reads’ (contig) lengths comparable to those of existing assemblies (median contig N50 of 46.8 kb, compared to 47.9 kb for Refseq genome assemblies).
For nine DISCOVAR assemblies and one pre-existing assembly (the lesser hedgehog tenrec (Echinops telfairi)), we increased contiguity 200-fold (the median scaffold length increased from 90.5 kb to 18.5 megabases (Mb)) through proximity ligation, which uses chromatin interaction data to capture the physical relationships among genomic regions12. Unlike short-contiguity genomes, these assemblies capture structural changes such as chromosomal rearrangements13. The upgraded assemblies increase the number of eutherian orders that are represented by a long-range assembly (contig N50 > 20 kb and scaffold N50 > 10 Mb) from 12 to 18 (out of 19). We are working on upgrading the assembly of the large treeshrew (Tupaia tana) for the remaining order (Scandentia).
Comparative power of 240 species
The Zoonomia alignment includes 120 newly generated assemblies and 121 existing assemblies, representing a total of 240 species (the dataset includes assemblies for two different dogs) and spanning about 110 million years of mammalian evolution (Supplementary Table 2). With a total evolutionary branch length of 16.6 substitutions per site, we expect only 191 positions in the human genome (0.000006%) to be identical across the aligned species owing to chance (false positives) rather than evolutionary constraint (Extended Data Table 2). We applied this same calculation to data from The Exome Aggregation Consortium (ExAC)—who analysed exomes for 60,706 humans14—and estimated that 88% of positions would be expected to have no variation. This illustrates the potential for relatively small cross-species datasets to inform human genetic studies—even for diseases driven by high-penetrance coding mutations, for which ExAC data are optimally powered15.
Biological insights from additional assemblies
The scope and species diversity in the Zoonomia Project supports evolutionary studies in many lineages. Previously published papers (discussed in the subsections below), and the demonstrated utility of existing comparative genomics resources16,17, illustrate the benefits of making newly generated genome assemblies and alignments accessible to all researchers without restrictions on use.
Speciation
Comparing our assembly for the endangered Mexican howler monkey (Alouatta palliata mexicana, a subspecies of the mantled howler monkey) with the Guatemalan black howler monkey (Alouatta pigra)—which has a neighbouring range—suggests that different forms of selection shape the reproductive isolation of the two species18. Initial divergence in allopatry was followed by positive selection on postzygotic isolating mechanisms, which offers empirical support for a speciation process that was first outlined by Dobzhansky in 193519.
Protection from cancer
Using our assembly for the capybara (Hydrochoerus hydrochaeris) (a giant rodent), a previous publication20 has identified positive selection on anti-cancer pathways, echoing previous reports21 that other large mammal species—the African and Asian elephants (Loxodonta africana and Elephas maximus indicus, respectively) —carry extra copies (retrogenes) of the tumour-suppressor gene TP53. This offers a possible resolution to Peto’s paradox—the observation that cancer in large mammals is rarer than expected—and could reveal anti-cancer mechanisms.
Convergent evolution of venom
A previous publication22 has used our assembly for the Hispaniolan solenodon (Solenodon paradoxus) (Extended Data Fig. 2) to investigate venom production—a trait that is found in only a few eutherian lineages, including shrews and solenodons. They identified paralogous copies of a kallikrein 1 serine protease gene (KLK1) that together encode solenodon venom, and showed that the KLK1 gene was independently co-opted for venom production in both solenodons and shrews, in an example of molecular convergence.
Informing biodiversity conservation strategies
A previous analysis23 of our giant otter (Pteronura brasiliensis) assembly found low diversity and an elevated burden of putatively deleterious genetic variants, consistent with the recent population decline of this species through overhunting and habitat loss. The giant otter had fewer putatively deleterious variants than either the southern or northern sea otter (Enhydra lutris nereis and E. lutris kenyoni, respectively), which suggests that it has highest potential for recovery among these species if populations are protected.
Rapid assessment of species infection risk
Using the Zoonomia alignment and public genomic data from hundreds of other vertebrates, a previous publication24 compared the structure of ACE2—the receptor for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the causative agent of coronavirus disease 2019 (COVID-19)—and identified 47 mammals that have a high or very high likelihood of being virus reservoirs, intermediate hosts or good model organisms for the study of COVID-19, and detected positive selection in the ACE2 receptor-binding domain that is specific to bats.
Genetic diversity and extinction risk
We next asked whether a reference genome from a single individual can help to identify populations with low genetic diversity to prioritize in efforts to conserve biodiversity. Diversity metrics reflect demographic history25,26, and heterozygosity is lower in threatened species27. This analysis was feasible because we used a single sequencing and assembly protocol for all DISCOVAR assemblies, which minimized variation in accuracy, completeness and contiguity due to the sequencing technology and the assembly process that would otherwise confound species comparisons.
We estimated genetic diversity for 130 of our DISCOVAR assemblies, each of which represented a different species (Supplementary Table 3). Four of these estimates failed during analysis. For the remaining 126 DISCOVAR assemblies, we calculated 2 metrics: (1) the fraction of sites at which the sequenced individual is heterozygous (overall heterozygosity); and (2) the proportion of the genome that resides in an extended region without any variation (segments of homozygosity (SoH)). The SoH measurement is designed for short-contiguity assemblies, in which scaffolds are potentially shorter than runs of homozygosity. Overall, heterozygosity and SoH values are correlated (Pearson correlation r = −0.56, P = 1.8 × 10−9, n = 98). Although overall heterozygosity is correlated with contig N50 values (Pearson correlation rhet = −0.39, Phet = 4 × 10−5, nhet = 105) (probably owing to the difficulty of assembling more heterozygous genomes28), SoH values are not (Pearson correlation rSoH = 0.09, PSoH = 0.38, nSoH = 98). Overall heterozygosity and SoH values are highly correlated between the lower- and high-contiguity versions of the upgraded assemblies (Pearson correlation rhet = 0.999, Phet = 5 × 10−7, nhet = 7; rSoH = 0.996, PSoH = 1.4 × 10−6, nSoH = 7).
Genomic diversity varies significantly among species in different IUCN conservation categories, as measured by overall heterozygosity (Fig. 2a) and SoH values (Fig. 2b). SoH values increase (P = 0.024, R2 = 0.055, n = 94) with increasing levels of conservation concern, whereas heterozygosity decreases (P = 0.011, R2 = 0.064, n = 101). There is no significant difference between wild and captive populations in overall heterozygosity (Fig. 2c) or SoH values (Fig. 2d).
Unusual diversity values can suggest particular population demographics, although data from more than a single individual are needed to confirm these inferences. All seven critically endangered species have SoH values that are higher than the median for species categorized as of least concern (Fig. 2e). The genomes with the lowest heterozygosity and highest SoH values were the social tuco-tuco (heterozygosity = 0.00063 and SoH = 78.7%), which was sampled from a small laboratory colony with only 12 founders29, and the eastern mole (Scalopus aquaticus) (heterozygosity = 0.0008 and SoH = 81.3%), which was supplied by a professional mole catcher and was probably from a population that had experienced a bottleneck owing to pest control measures.
The correlation between diversity metrics and IUCN category is not explained by other species-level phenotypes. For species of least concern (n = 75), we assessed 21 phenotypes that are catalogued in the PanTHERIA30 database for correlation with heterozygosity or SoH values. The most significant was between SoH value and litter size, a trait that has previously been shown to predict extinction risk31 (PSoH = 0.02), but none is significant after Bonferroni correction (Extended Data Table 3).
Our inference that diversity trends lower in species at a higher risk of extinction comes from a small fraction (2.6%) of threatened mammals9. Whether this is a direct correlation with extinction risk or arises from an association between diversity and species-level phenotypes such as litter size, it suggests that valuable information can be gleaned from sequencing only a single individual. Should this pattern prove robust across more species, diversity metrics from a single reference genome could help to identify populations that are at risk—even when few species-level phenotypes are documented—and to prioritize species for follow-up at the population level.
Resources for biodiversity conservation
For each genome assembly, we catalogued all high-confidence variant sites (http://broad.io/variants) to support the design of cost-effective and accurate genetic assays that are usable even when the sample quality is low32; such assays are often preferable to designing expensive custom tools, relying on tools from related species or sequencing random regions33. The reference genomes themselves support the development of technologies such as using gene drives to control invasive species or pursuing ‘de-extinction’ through cloning and genetic engineering34.
Our genomes have two notable limitations. We sequenced only a single individual for each species, which is insufficient for studying population origins, population structure and recent demographic events35,36, and the shorter contiguity of our assemblies prevented us from analysing runs of homozygosity26. This highlights a dilemma that faces all large-scale genomics initiatives: determining when the value of sequencing additional individuals exceeds the value of improving the reference genome itself.
Whole-genome alignment
We aligned the genomes of 240 species (our assemblies and other mammalian genomes that were released when we started the alignment) as part of a 600-way pan-amniote alignment using the Cactus alignment software37 (Supplementary Table 2). Rather than aligning to a single anchor genome, Cactus infers an ancestral genome for each pair of assemblies (Fig. 3a). Consistent with our predictions, we have increased power to detect sequence constraint at individual bases relative to previous studies7,38. We detect 3.1% of bases in the human genome to be under purifying selection in the eutherian lineage (false-discovery rate (FDR) < 5%), without using windowing or other means to integrate contextual information across neighbouring bases. This is more than double the number from the largest previous 100-vertebrate alignment38 (Fig. 3b), with improvements being most notable in the non-coding sequence (Fig. 3c) and in the increased resolution of individual features (Fig. 3d). This represents a substantial proportion—but not all—of the 5 to 8% of the human genome that has previously been suggested to be under purifying selection7,39.
Next steps
Using our alignment of 240 mammalian genomes, we are pursuing four key strategies of analysis. First, we aim to provide the largest eutherian phylogeny based on nuclear genomes by building a comprehensive phylogeny and time tree, including trees partitioned by functional annotations, mode of inheritance and long-term recombination rates. Second, we will produce a detailed map of evolutionary constraint, identifying highly conserved genomic regions, regions under accelerated evolution in particular lineages and changes that probably affect phenotype, leveraging functional data from ENCODE40, GTEx41 and the Human Cell Atlas42. Third, we will use genotype–phenotype correlations to investigate patterns of constraint in regions associated with disease in humans, identify patterns of convergent adaptive evolution2 and apply a forward genomics strategy to link functional elements to traits. Finally, we will explore the evolution of genome structure by mapping syntenic regions between genomes, identifying evolutionary breakpoints and characterizing the repeat landscape.
Conclusion
The Zoonomia Project has captured mammalian diversity at a high resolution, and is among the first of many projects that are underway to sequence, catalogue and characterize whole branches of the eukaryotic biodiversity of the Earth. On the basis of our experience, we propose the following principles for realizing the full value of large-scale comparative genomics.
First, we should prioritize sample collection. We must support field researchers who collect samples and understand species ecology and behaviour, develop strategies for sample collection that do not rely on bulky laboratory equipment or cold chains, develop technology for using non-invasive types of sampling and establish more repositories of renewable cell cultures10.
Second, we need accessible and scalable tools for computational analysis. Few research groups have access to the computational resources necessary for work with massive genomic datasets. We must address the shortage of skilled computational scientists, and design software and data-storage systems that make powerful computational pipelines accessible to all researchers.
Finally, we should promote rapid data-sharing. Data embargoes must not be permitted to delay analyses that directly benefit the conservation of endangered species, human health or progress in basic science. Genomic data should be shared as quickly as possible and without restrictions on use.
Numerous large-scale genome-sequencing efforts are now underway, including the Earth BioGenome Project43, Genome 10K44, the Vertebrate Genomes Project, Bat 1K45, Bird 10K46 and DNA Zoo. As the number of genomes grows, so too will the usefulness of comparative genomics in disease research and the development of therapeutic strategies. Preserving, rather than merely recording, the biodiversity of the Earth must be a priority. Through global scientific collaborations, and by making genomic resources available and accessible to all research communities, we can ensure that the legacy of genomics is not a digital archive of lost species.
Methods
The number of samples (species) required to detect evolutionary conservation at a single base was estimated by applying a Poisson model of the distribution of substitution counts in the genome.
Species selection, sample shipping and regulatory approvals
Species were selected to maximize branch length across the eutherian mammal phylogeny, and to capture genomes of species from previously unrepresented eutherian families. Of 172 species initially selected for inclusion, we obtained sufficiently high-quality DNA samples for genome sequencing for 137. DNA samples from collaborating institutions were shipped to the Broad Institute (n = 69) or Uppsala University (n = 68). For samples received at the Broad Institute that were then sent to Uppsala, shipping approval was secured from the US Fish and Wildlife Service. Institutional Animal Care and Use Committee approval was not required.
Sample quality control, library construction and sequencing
DNA integrity for each sample was visualized via agarose gel (at the Broad Institute) or Agilent tape station (at Uppsala University). Samples passed quality control if the bulk of DNA fragments were greater than 5 kb. DNA concentration was then determined using Invitrogen Qubit dsDNA HS assay kit. For each of the samples that passed quality control, 1–3 μg of DNA was fragmented on the Covaris E220 Instrument using the 400-bp standard programme (10% duty cycle, 140 PIP, 200 cycles per burst, 55 s). Fragmented samples underwent SPRI double-size selection (0.55×, 0.7 × f) followed by PCR-free Illumina library construction following the manufacturer’s instructions (Kapa no. KK8232) using PCR-free adapters from Illumina (no. FC-121-3001). Final library fragment size distribution was determined on Agilent 2100 Bioanalyzer with High Sensitivity DNA Chips. Paired-end libraries were pooled, and then sequenced on a single lane of the Illumina HiSeq2500, set for Version 2 chemistry and 2×250-bp reads. This yielded a total of mean 375 million (s.d. = 125 million) reads per sample.
Assembly and validation
For each species, we applied DISCOVAR de novo11 (discovardenovo-52488) (ftp://ftp.broadinstitute.org/pub/crd/DiscovarDeNovo/) to assemble the 2×250-bp read group, using the following command: DiscovarDeNovo READS = [READFILE] OUT_DIR = [SPECIES_ID]//[SPECIES_ID].discovar_files NUM_THREADS = 24 MAX_MEM_GB = 200G.
Coverage for each genome was automatically calculated by DISCOVAR, with a mean coverage of 40.1× (s.d.± 14×). We assessed genome assembly, gene set and transcriptome completeness using Benchmarking Universal Single-Copy Orthologs (BUSCO), which provides quantitative measures on the basis of gene content from near-universal single-copy orthologues50. BUSCO was run with default parameters, using the mammalian gene model set (mammalia_odb9, n = 4,104), using the following command: python ./BUSCO.py -i [input fasta] -o [output_file] -l ./mammalia_odb9/ -m genome -c 1 -sp. human.
Median contig N50 for existing RefSeq assemblies was calculated using the assembly statistics for the most recent release of 118 eutherian mammals with RefSeq assembly accession numbers. Assemblies were all classified as either reference genome or representative genome. Assembly statistics were downloaded from the NCBI on 10 April 2019.
Genome upgrades
We selected genomes from each eutherian order without a pre-existing long-contiguity assembly on the basis of (1) whether the underlying assembly met the minimum quality threshold needed for HiRise upgrades; and (2) whether a second sample of sufficient quality could be obtained from that individual. All upgrades were done with Dovetail Chicago libraries and assembled with HiRise 2.1, as previously described51.
Estimating heterozygosity
Selection of assemblies for heterozygosity analysis
Heterozygosity statistics were calculated for all but four of our short read assemblies (n = 126) as well as eight Dovetail-upgraded genomes. Four failed because they were either too fragmented to analyse (n = 3) or because of undetermined errors (n = 1). One assembly was excluded because it was a second individual from a species that was already represented.
Heterozygosity calls
We applied the standard GATK pipeline with genotype quality banding to identify the callable fraction of the genome52,53. First, we used samtools to subsample paired reads from the unmapped .bam files54. After removing adaptor sequences from the selected reads, we used BWA-MEM to map reads to the reference genome scaffolds of >10 kb, removing duplicates using the PicardTools MarkDuplicates utility55. We then called heterozygous sites using standard GATK-Haplotypecaller specifications, and with additional gVCF banding at 0, 10, 20, 30, 40, 50 and 99 qualities. We used the fraction of the genome with genotype quality >15 for subsequent analyses. For the lists of high-confidence variant sites, we include only heterozygous positions after filtering at GQ >20, maximum DP <100, minimum DP >6, as described in the README file at http://broad.io/variants.
Inferring overall heterozygosity
To avoid confounding by sex chromosomes or complex regions, we excluded all scaffolds with less than 0.5 or greater than 2× of the average sample read depth, then calculated global heterozygosity as the fraction of heterozygous calls over the whole callable genome.
Calling SoH
We estimated the proportion of the genome within SoH using a metric designed for genomes with scaffold N50 shorter than the expected maximum length of runs of homozygosity (our median scaffold N50 is 62 kb). We first split all scaffolds into windows with a maximum length of 50 kb, with windows ranging from 20 to 50 kb for scaffolds <50 kb. For each window, we calculated the average number of heterozygous sites per bp. We discriminated windows with extremely low heterozygosity by using the Python 3.5.2 pomegranate package to fit a two-component Gaussian mixture model to the joint distribution of window heterozygosity, forcing the first component to be centred around the lower tail of the distribution and allowing the second to freely capture all the remaining heterozygosity variability56,57. As heterozygosity cannot be negative, and normal distributions near zero can cross into negative values, we used the normal cumulative distribution function to correct the posterior distribution by the negative excess—effectively fitting a truncated normal to the first component. The final SoH value was calculated using the posterior maximum likelihood classification between both components. We saw no significant correlation between contig N50 and SoH (Pearson correlation = 0.055, P = 0.57, n = 112).
Assessing the effect of the percentage of callable genome
We assessed whether the percentage of the genome that was callable (Supplementary Table 3) was likely to affect our analysis. The callable percentage was correlated with heterozygosity (r = −0.80, P < 2.2 × 10−16, n = 130), and weakly with SoH values (r = 0.18, P = 0.06, n = 112). There is no significant difference in callable percentage among IUCN categories (analysis of variance P = 0.98, n = 122) or between captive and wild populations (t-test P = 0.81, n = 120).
Analysing patterns of diversity
We excluded two genomes with exceptionally high heterozygosity (heterozygosity >0.02; >5 s.d. above the mean). Both were of non-endangered species, and thus removing them made our determination of lower heterozygosity in endangered species more conservative. Of the remaining 124 genomes, we excluded 19 with allelic balance values that were more than one s.d. above the mean (>0.36). Abnormally high allelic balance can indicate sequencing biases with potential for artefacts in estimates of heterozygosity and/or SoH. Our final dataset contains heterozygosity values for 105 genomes and SoH values for 98 genomes (Supplementary Table 3). For seven genomes, we were unable to estimate SoH because the two components of the Gaussian mixture model overlapped completely. To ask about a possible directional relationship between level of IUCN concern and overall heterozygosity or SoH, we applied regression using the IUCN category as an ordinal predictor. We also asked about the relationship of diversity metrics to a set of species-level phenotypes for which correlations were previously reported (Extended Data Table 3).
Alignment
The alignment was generated using the progressive mode of Cactus37,58. The topology used for the guide tree of the alignment was taken from TimeTree47; the branch lengths of the guide tree were generated by a least-squares fit from a distance matrix. The distance matrix was based on the UCSC 100-way phyloP fourfold-degenerate site tree38 for those species that had corresponding entries in the 100-way tree. For species not present in the 100-way tree, distance matrix entries were more coarsely estimated using the distance estimated from Mash59 to the closest relative included in the 100-way data.
Cactus does not attempt to fully resolve the gene tree when multiple duplications take place along a single branch, as there is an implicit restriction in Cactus that a duplication event be represented as multiple regions in the child species aligned to a single region in the parent species. This precludes representing discordance between the gene tree and species tree that could occur with incomplete lineage-sorting or horizontal transfer. However, the guide tree has a minimal effect on the alignment, with little difference between alignments with different trees—even when using a tree that is purposely wrong37. Phenomena such as incomplete lineage sorting that affect a subset of species are unlikely to substantially affect the detection of purifying selection across the whole eutherian lineage described in Fig. 3.
The alignment was generated in several steps, on account of its large scale. First, a backbone alignment of several long contiguity assemblies was generated, using the genomes of two non-placental mammals (Tasmanian devil (Sarcophilus harrisii) and platypus (Ornithorhynchus anatinus)), to inform the reconstruction of the placental root. Next, separate clade alignments were generated for each major clade in the alignment: Euarchonta, Glires, Laurasiatheria, Afrotheria and Xenarthra. The roots of these clade alignments were then aligned to the corresponding ancestral genomes from the backbone, stitching these alignments together to create the final alignment. The process of aligning a genome to an existing ancestor is complex and further described in an accompanying Article that introduces the progressive mode of Cactus37.
We created a neutral model for the conservation analysis using ancestral repeats detected by RepeatMasker60 on the eutherian ancestral genome produced in the Cactus alignment (tRNA and low-complexity repeats were removed). To fit the neutral model, we used phyloFit from the PHAST61 package, using the REV (generalized reversible) model and EM optimization method. The training input was a MAF exported on columns from the set of ancestral repeats mentioned above. Because phyloFit does not support alignment columns that contain duplicates, if a genome had more than one sequence in a single alignment block, these were replaced with a single entry representing the consensus base at each column.
We extracted initial conservation scores using phyloP from the PHAST61 package on a MAF exported using human as a reference. We converted the phyloP scores (which represent log-scaled P values of acceleration or conservation) into P values, then into q values using the FDR-correction of Benjamini and Hochberg62. Any column with a resulting q value less than 0.05 was deemed significantly conserved or accelerated.
The alignment—as well as conservation annotations—are available at https://cglgenomics.ucsc.edu/data/cactus/.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this paper.
Data availability
The project website is http://zoonomiaproject.org/. Details of each Zoonomia genome assembly—including NCBI GenBank63 accession numbers—are provided in Supplementary Table 1. Sequence data and genome assemblies are available at https://www.ncbi.nlm.nih.gov/. Variant lists for each species are provided at http://broad.io/variants. Further source data for Fig. 2 are provided in the Zoonomia GitHub repository (https://doi.org/10.5281/zenodo.3887432). The Cactus alignment is provided at https://cglgenomics.ucsc.edu/data/cactus/. A visualization of the alignments and phyloP data is available by loading our assembly hub into the UCSC browser64 by copying the hub link https://comparative-genomics-hubs.s3-us-west-2.amazonaws.com/200m_hub.txt into the Track Hubs page. There are no restrictions on use. Source data are provided with this paper.
Code availability
The DISCOVAR de novo assembly code is available at https://github.com/broadinstitute/discovar_de_novo/releases/tag/v52488 (https://doi.org/10.5281/zenodo.3870889), the Cactus pipeline is available at https://github.com/ComparativeGenomicsToolkit/cactus (https://doi.org/10.5281/zenodo.3873410) and code for other analyses is available at https://github.com/broadinstitute/Zoonomia/ (https://doi.org/10.5281/zenodo.3887432).
References
Claussnitzer, M. et al. A brief history of human disease genetics. Nature 577, 179–189 (2020).
Hiller, M. et al. A “forward genomics” approach links genotype to phenotype using independent phenotypic losses among related species. Cell Rep. 2, 817–823 (2012).
Wasser, S. K. et al. Genetic assignment of large seizures of elephant ivory reveals Africa’s major poaching hotspots. Science 349, 84–87 (2015).
Wright, B. et al. Development of a SNP-based assay for measuring genetic diversity in the Tasmanian devil insurance population. BMC Genomics 16, 791 (2015).
Lappalainen, T., Scott, A. J., Brandt, M. & Hall, I. M. Genomic analysis in the age of human genome sequencing. Cell 177, 70–84 (2019).
Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 (2014).
Lindblad-Toh, K. et al. A high-resolution map of human evolutionary constraint using 29 mammals. Nature 478, 476–482 (2011).
Finucane, H. K. et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet. 47, 1228–1235 (2015).
IUCN. The IUCN Red List of Threatened Species. Version 2019-2 http://www.iucnredlist.org (2019).
Ryder, O. A. & Onuma, M. Viable cell culture banking for biodiversity characterization and conservation. Annu. Rev. Anim. Biosci. 6, 83–98 (2018).
Weisenfeld, N. I. et al. Comprehensive variation discovery in single human genomes. Nat. Genet. 46, 1350–1355 (2014).
Putnam, N. H. et al. Chromosome-scale shotgun assembly using an in vitro method for long-range linkage. Genome Res. 26, 342–350 (2016).
Kim, J. et al. Reconstruction and evolutionary history of eutherian chromosomes. Proc. Natl Acad. Sci. USA 114, E5379–E5388 (2017).
Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
Balasubramanian, S. et al. Using ALoFT to determine the impact of putative loss-of-function variants in protein-coding genes. Nat. Commun. 8, 382 (2017).
Meadows, J. R. S. & Lindblad-Toh, K. Dissecting evolution and disease using comparative vertebrate genomics. Nat. Rev. Genet. 18, 624–636 (2017).
Cooper, G. M. & Shendure, J. Needles in stacks of needles: finding disease-causal variants in a wealth of genomic data. Nat. Rev. Genet. 12, 628–640 (2011).
Baiz, M. D., Tucker, P. K., Mueller, J. L. & Cortés-Ortiz, L. X-linked signature of reproductive isolation in humans is mirrored in a howler monkey hybrid zone. J. Hered. 111, 419–428 (2020).
Dobzhansky, T. & Dobzhansky, T. G. Genetics and the Origin of Species (Columbia Univ. Press, 1937).
Herrera-Álvarez, S., Karlsson, E., Ryder, O. A., Lindblad-Toh, K. & Crawford, A. J. How to make a rodent giant: genomic basis and tradeoffs of gigantism in the capybara, the world’s largest rodent. Preprint at https://doi.org/10.1101/424606 (2018).
Abegglen, L. M. et al. Potential mechanisms for cancer resistance in elephants and comparative cellular response to DNA damage in humans. J. Am. Med. Assoc. 314, 1850–1860 (2015).
Casewell, N. R. et al. Solenodon genome reveals convergent evolution of venom in eulipotyphlan mammals. Proc. Natl Acad. Sci. USA 116, 25745–25755 (2019).
Beichman, A. C. et al. Aquatic adaptation and depleted diversity: a deep dive into the genomes of the sea otter and giant otter. Mol. Biol. Evol. 36, 2631–2655 (2019).
Damas, J. et al. Broad host range of SARS-CoV-2 predicted by comparative and structural analysis of ACE2 in vertebrates. Proc. Natl Acad. Sci. USA 117, 22311–22322 (2020).
Xue, Y. et al. Mountain gorilla genomes reveal the impact of long-term population decline and inbreeding. Science 348, 242–245 (2015).
Ceballos, F. C., Joshi, P. K., Clark, D. W., Ramsay, M. & Wilson, J. F. Runs of homozygosity: windows into population history and trait architecture. Nat. Rev. Genet. 19, 220–234 (2018).
Spielman, D., Brook, B. W. & Frankham, R. Most species are not driven to extinction before genetic factors impact them. Proc. Natl Acad. Sci. USA 101, 15261–15264 (2004).
Vinson, J. P. et al. Assembly of polymorphic genomes: algorithms and application to Ciona savignyi. Genome Res. 15, 1127–1135 (2005).
MacManes, M. D. & Lacey, E. A. The social brain: transcriptome assembly and characterization of the hippocampus from a social subterranean rodent, the colonial tuco-tuco (Ctenomys sociabilis). PLoS ONE 7, e45524 (2012).
Jones, K. E. et al. PanTHERIA: a species-level database of life history, ecology, and geography of extant and recently extinct mammals. Ecology 90, 2648 (2009).
Cardillo, M. Biological determinants of extinction risk: why are smaller species less vulnerable? Anim. Conserv. 6, 63–69 (2003).
Natesh, M. et al. Empowering conservation practice with efficient and economical genotyping from poor quality samples. Methods Ecol. Evol. 10, 853–859 (2019).
Lowry, D. B. et al. Breaking RAD: an evaluation of the utility of restriction site-associated DNA sequencing for genome scans of adaptation. Mol. Ecol. Resour. 17, 142–152 (2017).
Shapiro, B. Pathways to de-extinction: how close can we get to resurrection of an extinct species? Funct. Ecol. 31, 996–1002 (2017).
Benazzo, A. et al. Survival and divergence in a small group: the extraordinary genomic history of the endangered Apennine brown bear stragglers. Proc. Natl Acad. Sci. USA 114, E9589–E9597 (2017).
Saremi, N. F. et al. Puma genomes from North and South America provide insights into the genomic consequences of inbreeding. Nat. Commun. 10, 4769 (2019).
Armstrong, J. et al. Progressive Cactus is a multiple-genome aligner for the thousand-genome era. Nature https://doi.org/10.1038/s41586-020-2871-y (2020).
Haeussler, M. et al. The UCSC genome browser database: 2019 update. Nucleic Acids Res. 47, D853–D858 (2019).
Rands, C. M., Meader, S., Ponting, C. P. & Lunter, G. 8.2% of the human genome is constrained: variation in rates of turnover across functional element classes in the human lineage. PLoS Genet. 10, e1004525 (2014).
ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
GTEx Consortium. Genetic effects on gene expression across human tissues. Nature 550, 204–213 (2017).
Regev, A. et al. The human cell atlas. eLife 6, e27041 (2017).
Lewin, H. A. et al. Earth BioGenome project: sequencing life for the future of life. Proc. Natl Acad. Sci. USA 115, 4325–4333 (2018).
Koepfli, K.-P., Paten, B., the Genome 10K Community of Scientists & O’Brien, S. J. The Genome 10K project: a way forward. Annu. Rev. Anim. Biosci. 3, 57–111 (2015).
Teeling, E. C. et al. Bat biology, genomes, and the Bat1K project: to generate chromosome-level genomes for all living bat species. Annu. Rev. Anim. Biosci. 6, 23–46 (2018).
Feng, S. et al. Dense sampling of bird diversity increases power of comparative genomics. Nature https://doi.org/10.1038/s41586-020-2873-9 (2020).
Kumar, S., Stecher, G., Suleski, M. & Hedges, S. B. TimeTree: a resource for timelines, timetrees, and divergence times. Mol. Biol. Evol. 34, 1812–1819 (2017).
Wilson, D. E. & Reeder, D. M. (eds) Mammal Species of the World. A Taxonomic and Geographic Reference 3rd edn (Johns Hopkins Univ. Press, 2005).
Vlieghe, D. et al. A new generation of JASPAR, the open-access repository for transcription factor binding site profiles. Nucleic Acids Res. 34, D95–D97 (2006).
Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–3212 (2015).
Farré, M. et al. A near-chromosome-scale genome assembly of the gemsbok (Oryx gazella): an iconic antelope of the Kalahari desert. Gigascience 8, giy162 (2019).
McKenna, A. et al. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303 (2010).
DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Benaglia, T., Chauveau, D., Hunter, D. & Young, D. mixtools: an R package for analyzing finite mixture models. J. Stat. Softw. 32, 1–29 (2009).
R Core Team. R: A Language and Environment for Statistical Computing. https://www.R-project.org/ (2019).
Paten, B. et al. Cactus: algorithms for genome multiple sequence alignment. Genome Res. 21, 1512–1528 (2011).
Ondov, B. D. et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 17, 132 (2016).
Smit, A. F. A., Hubley, R. & Green, P. RepeatMasker Open-4.0. http://www.repeatmasker.org/ (2013–2015).
Hubisz, M. J., Pollard, K. S. & Siepel, A. PHAST and RPHAST: phylogenetic analysis with space/time models. Brief. Bioinform. 12, 41–51 (2011).
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B 57, 289–300 (1995).
Benson, D. A. et al. GenBank. Nucleic Acids Res. 41, D36–D42 (2013).
Nguyen, N. et al. Comparative assembly hubs: web-accessible browsers for comparative genomics. Bioinformatics 30, 3293–3301 (2014).
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–443 (2020).
Pinheiro, E. C., Taddei, V. A., Migliorini, R. H. & Kettelhut, I. C. Effect of fasting on carbohydrate metabolism in frugivorous bats (Artibeus lituratus and Artibeus jamaicensis). Comp. Biochem. Physiol. B Biochem. Mol. Biol. 143, 279–284 (2006).
Gordon, L. M. et al. Amorphous intergranular phases control the properties of rodent tooth enamel. Science 347, 746–750 (2015).
Hindle, A. G. & Martin, S. L. Intrinsic circannual regulation of brown adipose tissue form and function in tune with hibernation. Am. J. Physiol. Endocrinol. Metab. 306, E284–E299 (2014).
Stanford, K. I. et al. Brown adipose tissue regulates glucose homeostasis and insulin sensitivity. J. Clin. Invest. 123, 215–223 (2013).
Chondronikola, M. et al. Brown adipose tissue improves whole-body glucose homeostasis and insulin sensitivity in humans. Diabetes 63, 4089–4099 (2014).
Saito, M. et al. High incidence of metabolically active brown adipose tissue in healthy adult humans: effects of cold exposure and adiposity. Diabetes 58, 1526–1531 (2009).
Acknowledgements
We thank the many individuals who provided samples and advice, including C. Adenyo, C. Avila, E. Baitchman, R. Behringer, A. Boyko, M. Breen, K. Campbell, P. Campbell, C. J. Conroy, K. Cooper, L. M. Dávalos, F. Delsuc, D. Distel, C. A. Emerling, J. Fronczek, N. Gemmel, J. Good, K. He, K. Helgen, A. Hindle, H. Hoekstra, R. Honeycutt, P. Hulva, W. Israelsen, B. Kayang, R. Kennerley, M. Korody, D. N. Lee, E. Louis, M. MacManes, A. Misuraca, A. Mitelberg, P. Morin, A. Mouton, M. Murayama, M. Nachman, A. Navarro, R. Ogden, B. Pasch, S. Puechmaille, T. J. Robinson, S. Rossiter, M. Ruedi, A. Seifert, S. Thomas, S. Turvey, G. Verbeylen and the late R. J. Baker. We also thank the Broad Institute Genomics Platform and SNP & SEQ Technology Platform (part of the National Genomics Infrastructure (NGI) Sweden and Science for Life Laboratory) and Swedish National Infrastructure for Computing (SNIC) at Uppmax. This project was funded by NIH NHGRI R01HG008742 (E.K.K., B.B., D.P.G., R.S., J.T.-M., J.J., H.J.N., B.P. and J. Armstrong), Swedish Research Council Distinguished Professor Award (K.L.-T., V.D.M., E.M. and J.R.S.M.), Swedish Research Council grant 2018-05973 (K.L.-T.), Knut and Alice Wallenberg Foundation (K.L.-T., V.D.M., E.M. and J.R.S.M.), Uppsala University (K.L.-T., V.D.M., E.M., J.R.S.M., J.J., J. Alfoldi and L.G.), Broad Institute Next10 (L.G.), Gladstone Institutes (K.S.P.), NIH NHGRI 5R01HG002939 (A.F.A.S. and R.H.), NIH NHGRI 5U24HG010136 (A.F.A.S. and R.H.), NIH NHGRI 5R01HG010485 (B.P. and M.D.), NIH NHGRI 2U41HG007234 (B.P., M.D. and J. Armstrong), NIH NIA 5PO1AG047200 (V.N.G.), NIH NIA 1UH2AG064706 (V.N.G.), BFU2017-86471-P MINECO/FEDER, UE (T.M.-B.), Secretaria d’Universitats i Recerca and CERCA Programme del Departament d’Economia i Coneixement de la Generalitat de Catalunya GRC 2017 SGR 880 (T.M.-B.), Howard Hughes International Early Career (T.M.-B.), European Research Council Horizon 2020 no. 864203 (T.M.-B.), Obra Social ‘La Caixa’ (T.M.-B.), BBSRC BBS/E/T/000PR9818, BBS/E/T/ 000PR9783 (W.H. and W.N.), BBSRC Core Strategic Programme Grant BB/P016774/1 (W.H., W.N. and F.D.), Sir Henry Dale Fellowship 200517/Z/16/Z jointly funded by the Wellcome Trust and the Royal Society (N.R.C.), FJCI-2016-29558 MICINN (D.J.), Prince Albert II Foundation of Monaco and Canada, Global Genome Initiative, Smithsonian Institution (M.N.), European Research Council Research Grant ERC-2012-StG311000 (E.C.T.), Irish Research Council Laureate Award (E.C.T.), UK Medical Research Council MR/P026028/1 (W.H. and W.N.), National Science Foundation DEB-1457735 (M.S.S.), National Science Foundation DEB-1753760 (W.J.M.), National Science Foundation IOS-2029774 (E.K.K. and D.P.G.), Robert and Rosabel Osborne Endowment (H.A.L. and J.D.), Swedish Research Council, FORMAS 221-2012-1531 (J.R.S.M.), NSF RoL: FELS: EAGER: DEB 1838283 (D.A.R.) and Academy of Finland grant to Center of Excellence in Tumor Genetics Research no. 312042 (T.K. and J.T.).
Author information
Authors and Affiliations
Consortia
Contributions
K.L.-T. conceived the project. J.J., V.D.M., E.M., N.R.C., L.G.C., J.D., V.N.G., M.L.H., K.-P.K., J.R.S.M., W.J.M., M.N., D.A.R., R.S., E.C.T., J. Alfoldi, O.A.R., H.A.L., K.L.-T. and E.K.K. contributed to the acquisition of the samples. J.J., V.D.M., E.M., J.D., L.G., K.-P.K., H.J.N., C.C.S., R.S., J.T.-M., J. Alfoldi, O.A.R., H.A.L., K.L.-T. and E.K.K. contributed to the production of the genome assemblies. D.P.G., A.S., J. Armstrong, J.J., D.J., I.T.F., L.F.K.K., H.A.L., T.M.-B., K.L.-T. and E.K.K. contributed to the data analysis. D.P.G., J.J., V.D.M., G.B., F.D.P., M.D., I.T.F., M.G., V.N.G., W.H., R.H., T.K., E.S.L., J.R.S.M., A.R.P., K.S.P., A.F.A.S., M.S.S., J.T., J. Alfoldi, B.B., O.A.R., H.A.L., B.P., T.M.-B., K.L.-T. and E.K.K. contributed to the design and conduct of the project. D.P.G., E.S.L., W.N., B.S., O.A.R., K.L.-T. and E.K.K. wrote the manuscript, with input from all other authors.
Corresponding author
Ethics declarations
Competing interests
L.G. is a co-founder of, equity owner in and chief technical officer at Fauna Bio Incorporated.
Additional information
Peer review information Nature thanks Chris Ponting, Steven Salzberg, Guojie Zhang and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data figures and tables
Extended Data Fig. 1 Notable traits in non-human mammals.
Sequences from species with notable phenotypes can inform human medicine, basic biology and biodiversity conservation, but sample collection can be challenging. a, The Jamaican fruit bat (Artibeus jamaicensis) maintains constant blood glucose across intervals of fruit-eating and fasting66, achieving homeostasis to a degree that is unknown in the treatment of human diabetes. b, The North American beaver (Castor canadensis) avoids tooth decay by incorporating iron rather than magnesium into tooth enamel, which yields an orange hue67. c, The thirteen-lined ground squirrel (Ictidomys tridecemlineatus) prepares for hibernation by rapidly increasing the thermogenic activity of brown fat68, a process that—in humans—is connected to improved glucose homeostasis and insulin sensitivity69,70,71. d, The tiny bumblebee bat (Craseonycteris thonglongyai) is among the smallest of mammals, making it a sparse source of DNA. e, The remote habitat of the very rare Amazon River dolphin (Inia geoffrensis) precludes collection of the high-molecular weight DNA. Image sources: Merlin D. Tuttle/Science Source (a); Stephen J. Krasemann/Science Source (b); Allyson Hindle (c); Sébastien J. Puechmaille (CC BY-SA) (d); M. Watson/Science Source (e).
Extended Data Fig. 2 Sample collection can be challenging, and sequencing methods must be selected to handle the sample quality.
To enable the inclusion of species from across the eutherian tree (including from the 50% of mammalian families not represented in existing genome databases), the Zoonomia Project needed sequencing and assembly methods that produce reliable data from DNA collected in remote locations, sometimes in only modest quantities and often without benefit of cold chains for transport. a, For the marine species such as the narwhal (Monodon monoceros), simply accessing an individual in the wild can prove challenging. For example, to sample DNA from the near-threatened narwhal, M.N. and Inuit guide D. Angnatsiak camped on the edge of an ice floe between Pond Inlet and Bylot Island, at the northeastern tip of Baffin Island. After a narwhal was collected by Inuit hunters as part of an annual hunt, hours of flensing were necessary for the collection of tissue samples. From left to right, F. McCann, H. C. Schmidt, F. Eichmiller, M.N., J. Orr (facing backward) and J. Orr (standing). b, For endangered species such as the Hispaniolan solenodon (S. paradoxus), sample collection must be designed to minimize stress to the individual, limiting the amount of DNA that can be collected22. To collect DNA from the endangered solenodon without imposing stress on individuals in the wild, N.R.C. turned to the world’s only captive solenodons, which are housed off-exhibit at ZOODOM in the Dominican Republic. With help from veterinarians at the zoo, N.R.C. collected a small amount of blood from the rugged tail of the solenodon. Narwhal photograph by G. Freund, and courtesy of M.N. Solenodon photograph courtesy of L. Emery.
Supplementary information
Supplementary Tables
This file contains Supplementary Tables 1-3.
Source data
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Zoonomia Consortium. A comparative genomics multitool for scientific discovery and conservation. Nature 587, 240–245 (2020). https://doi.org/10.1038/s41586-020-2876-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41586-020-2876-6
This article is cited by
-
FORGEdb: a tool for identifying candidate functional variants and uncovering target genes and mechanisms for complex diseases
Genome Biology (2024)
-
Evolution of STAT2 resistance to flavivirus NS5 occurred multiple times despite genetic constraints
Nature Communications (2024)
-
Genome-wide analyses reveals an association between invasive urothelial carcinoma in the Shetland sheepdog and NIPAL1
npj Precision Oncology (2024)
-
Genetic chronicle of the capybara: the complete mitochondrial genome of Hydrochoerus hydrochaeris
Mammalian Biology (2024)
-
Identification of constrained sequence elements across 239 primate genomes
Nature (2024)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.