Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Dense sampling of bird diversity increases power of comparative genomics

An Author Correction to this article was published on 08 April 2021

This article has been updated

Abstract

Whole-genome sequencing projects are increasingly populating the tree of life and characterizing biodiversity1,2,3,4. Sparse taxon sampling has previously been proposed to confound phylogenetic inference5, and captures only a fraction of the genomic diversity. Here we report a substantial step towards the dense representation of avian phylogenetic and molecular diversity, by analysing 363 genomes from 92.4% of bird families—including 267 newly sequenced genomes produced for phase II of the Bird 10,000 Genomes (B10K) Project. We use this comparative genome dataset in combination with a pipeline that leverages a reference-free whole-genome alignment to identify orthologous regions in greater numbers than has previously been possible and to recognize genomic novelties in particular bird lineages. The densely sampled alignment provides a single-base-pair map of selection, has more than doubled the fraction of bases that are confidently predicted to be under conservation and reveals extensive patterns of weak selection in predominantly non-coding DNA. Our results demonstrate that increasing the diversity of genomes used in comparative studies can reveal more shared and lineage-specific variation, and improve the investigation of genomic characteristics. We anticipate that this genomic resource will offer new perspectives on evolutionary processes in cross-species comparative analyses and assist in efforts to conserve species.

This is a preview of subscription content

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: Newly sequenced genomes densely cover the bird tree of life.
Fig. 2: Improved orthologue distinction and detection of lineage-specific sequences.
Fig. 3: Denser phylogenomic sequencing increases the power to detect selective constraints.

Data availability

All data released with this Article can be freely used. The B10K consortium is organizing phylogenomic analyses and other analyses with the whole-genome alignment, and we encourage persons to contact us for collaboration. Genome sequencing data, the genome assemblies and annotations of 267 species generated in this study have been deposited in the NCBI SRA and GenBank under accession PRJNA545868. The above data have also been deposited in the CNSA (https://db.cngb.org/cnsa/) of CNGBdb with accession number CNP0000505. The mitochondrial genomes and annotations of 336 species have been deposited in the NCBI GenBank under PRJNA545868. Sample information for each genome and the genome statistics can also be viewed online at https://b10k.scifeon.cloud/. The whole-genome alignment of the 363 birds in HAL format, along with a UCSC browser hub for all 363 species, is available at https://cglgenomics.ucsc.edu/data/cactus/. The Supplementary Data, which contains the tree file in Newick format for all 10,135 species of birds, is also available on Mendeley Data (https://doi.org/10.17632/fnpwzj37gw). The tree was pruned from the synthesis tree by excluding all subspecies, operational taxonomic units and unaccepted species as described in the Supplementary Information. Other data generated and analysed during this study, including Supplementary Tables 115, are also available on Mendeley Data (https://doi.org/10.17632/fnpwzj37gw). The study used publicly available data for species confirmation from the Barcode of Life Data (BOLD) (http://www.barcodinglife.org) and NCBI (https://www.ncbi.nlm.nih.gov/). The reference genomes, gene sets and published RNA-sequencing data used in the gene annotation and alignment construction of this study are available from Ensembl (http://www.ensembl.org) and NCBI. The databases used in functional annotation are available in InterPro (https://www.ebi.ac.uk/interpro), SwissProt (https://www.uniprot.org) and KEGG (https://www.genome.jp/kegg). The database used in the transposable elements annotation is available online (http://www.repeatmasker.org). The 77-way MULTIZ alignment, RefSeq genes and lncRNA gene set used in the selection analysis is available in UCSC Genome Browser (http://www.genome.ucsc.edu) and NONCODEv.5 database (http://www.noncode.org). The JASPAR2020 CORE vertebrate database used to identify transcription factor binding motifs is available online (http://jaspar2020.genereg.net).

Code availability

Scripts to run the annotation pipeline and the orthologue assignment pipeline can be found on the B10K GitHub repository at https://github.com/B10KGenomes/annotation. Scripts to estimate the neutral model can be found at https://github.com/ComparativeGenomicsToolkit/neutral-model-estimator.

Change history

References

  1. 1.

    Lewin, H. A. et al. Earth BioGenome project: sequencing life for the future of life. Proc. Natl Acad. Sci. USA 115, 4325–4333 (2018).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  2. 2.

    Genome 10K Community of Scientists. Genome 10K: a proposal to obtain whole-genome sequence for 10,000 vertebrate species. J. Hered. 100, 659–674 (2009).

    PubMed Central  Article  CAS  Google Scholar 

  3. 3.

    i5K Consortium. The i5K initiative: advancing arthropod genomics for knowledge, human health, agriculture, and the environment. J. Hered. 104, 595–600 (2013).

    PubMed Central  Article  Google Scholar 

  4. 4.

    Cheng, S. et al. 10KP: a phylodiverse genome sequencing plan. Gigascience 7, 1–9 (2018).

    ADS  CAS  PubMed  Article  PubMed Central  Google Scholar 

  5. 5.

    Prum, R. O. et al. A comprehensive phylogeny of birds (Aves) using targeted next-generation DNA sequencing. Nature 526, 569–573 (2015).

    ADS  CAS  PubMed  Article  PubMed Central  Google Scholar 

  6. 6.

    Zhang, G. et al. Bird sequencing project takes off. Nature 522, 34 (2015).

    ADS  CAS  PubMed  Article  PubMed Central  Google Scholar 

  7. 7.

    Boomsma, J. J. et al. The Global Ant Genomics Alliance (GAGA). Myrmecol. News 25, 61–66 (2017).

    Google Scholar 

  8. 8.

    Chen, L. et al. Large-scale ruminant genome sequencing provides insights into their evolution and distinct traits. Science 364, eaav6202 (2019).

    ADS  CAS  PubMed  Article  PubMed Central  Google Scholar 

  9. 9.

    Jarvis, E. D. et al. Whole-genome analyses resolve early branches in the tree of life of modern birds. Science 346, 1320–1331 (2014).

    ADS  CAS  PubMed  PubMed Central  Article  Google Scholar 

  10. 10.

    Zhang, G. et al. Comparative genomics reveals insights into avian genome evolution and adaptation. Science 346, 1311–1320 (2014).

    ADS  CAS  PubMed  PubMed Central  Article  Google Scholar 

  11. 11.

    Dickinson, E. C. & Remsen, J. V. (eds) The Howard and Moore Complete Checklist of the Birds of the World Volume 1: Non-passerines 4th edn (Aves, 2013).

  12. 12.

    Dickinson, E. C. & Christidis, L. (eds) The Howard and Moore Complete Checklist of the Birds of the World Volume 2: Passerines 4th edn (Aves, 2014).

  13. 13.

    BirdLife International. Leucopsar rothschildi. https://doi.org/10.2305/IUCN.UK.2018-2.RLTS.T22710912A129874226.en (The IUCN Red List of Threatened Species, 2018).

  14. 14.

    Meredith, R. W., Zhang, G., Gilbert, M. T. P., Jarvis, E. D. & Springer, M. S. Evidence for a single loss of mineralized teeth in the common avian ancestor. Science 346, 1254390 (2014).

    PubMed  Article  CAS  PubMed Central  Google Scholar 

  15. 15.

    Deutekom, E. S., Vosseberg, J., van Dam, T. J. P. & Snel, B. Measuring the impact of gene prediction on gene loss estimates in Eukaryotes by quantifying falsely inferred absences. PLOS Comput. Biol. 15, e1007301 (2019).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  16. 16.

    Plotkin, J. B. & Kudla, G. Synonymous but not the same: the causes and consequences of codon bias. Nat. Rev. Genet. 12, 32–42 (2011).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  17. 17.

    Armstrong, J. et al. Progressive Cactus is a multiple-genome aligner for the thousand-genome era. Nature https://doi.org/10.1038/s41586-020-2871-y (2020).

  18. 18.

    Armstrong, J. Enabling Comparative Genomics at the Scale of Hundreds of Species. PhD thesis, Univ. California Santa Cruz  (2019).

  19. 19.

    Blanchette, M. et al. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 14, 708–715 (2004).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  20. 20.

    Pegueroles, C., Laurie, S. & Albà, M. M. Accelerated evolution after gene duplication: a time-dependent process affecting just one copy. Mol. Biol. Evol. 30, 1830–1842 (2013).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  21. 21.

    Yuri, T., Kimball, R. T., Braun, E. L. & Braun, M. J. Duplication of accelerated evolution and growth hormone gene in passerine birds. Mol. Biol. Evol. 25, 352–361 (2008).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  22. 22.

    Armstrong, J., Fiddes, I. T., Diekhans, M. & Paten, B. Whole-genome alignment and comparative annotation. Annu. Rev. Anim. Biosci. 7, 41–64 (2019).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  23. 23.

    Schusdziarra, C., Blamowska, M., Azem, A. & Hell, K. Methylation-controlled J-protein MCJ acts in the import of proteins into human mitochondria. Hum. Mol. Genet. 22, 1348–1357 (2013).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  24. 24.

    Zhang, B., Peñagaricano, F., Driver, A., Chen, H. & Khatib, H. Differential expression of heat shock protein genes and their splice variants in bovine preimplantation embryos. J. Dairy Sci. 94, 4174–4182 (2011).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  25. 25.

    Mlitz, V. et al. Trichohyalin-like proteins have evolutionarily conserved roles in the morphogenesis of skin appendages. J. Invest. Dermatol. 134, 2685–2692 (2014).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  26. 26.

    Riede, T., Suthers, R. A., Fletcher, N. H. & Blevins, W. E. Songbirds tune their vocal tract to the fundamental frequency of their song. Proc. Natl Acad. Sci. USA 103, 5543–5548 (2006).

    ADS  CAS  PubMed  PubMed Central  Article  Google Scholar 

  27. 27.

    Drake, J. A. et al. Conserved noncoding sequences are selectively constrained and not mutation cold spots. Nat. Genet. 38, 223–227 (2006).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  28. 28.

    McLean, C. Y. et al. Human-specific loss of regulatory DNA and the evolution of human-specific traits. Nature 471, 216–219 (2011).

    ADS  CAS  PubMed  PubMed Central  Article  Google Scholar 

  29. 29.

    Mank, J. E., Axelsson, E. & Ellegren, H. Fast-X on the Z: rapid evolution of sex-linked genes in birds. Genome Res. 17, 618–624 (2007).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  30. 30.

    Axelsson, E., Webster, M. T., Smith, N. G. C., Burt, D. W. & Ellegren, H. Comparison of the chicken and turkey genomes reveals a higher rate of nucleotide divergence on microchromosomes than macrochromosomes. Genome Res. 15, 120–125 (2005).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  31. 31.

    Haeussler, M. et al. The UCSC Genome Browser database: 2019 update. Nucleic Acids Res. 47, D853–D858 (2019).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  32. 32.

    Cooper, G. M., Brudno, M., Green, E. D., Batzoglou, S. & Sidow, A. Quantitative estimates of sequence divergence for comparative analyses of mammalian genomes. Genome Res. 13, 813–820 (2003).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  33. 33.

    Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B 57, 289–300 (1995).

    MathSciNet  MATH  Google Scholar 

  34. 34.

    Gelabert, P. et al. Evolutionary history, genomic adaptation to toxic diet, and extinction of the Carolina parakeet. Curr. Biol. 30, 108–114.e5 (2020).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  35. 35.

    Feng, S. et al. The genomic footprints of the fall and recovery of the crested ibis. Curr. Biol. 29, 340–349.e7 (2019).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  36. 36.

    Brown, J. W., Wang, N. & Smith, S. A. The development of scientific consensus: analyzing conflict and concordance among avian phylogenies. Mol. Phylogenet. Evol. 116, 69–77 (2017).

    PubMed  Article  PubMed Central  Google Scholar 

  37. 37.

    Luo, R. et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. Gigascience 1, 18 (2012).

    PubMed  PubMed Central  Article  Google Scholar 

  38. 38.

    Gnerre, S. et al. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proc. Natl Acad. Sci. USA 108, 1513–1518 (2011).

    ADS  CAS  PubMed  Article  PubMed Central  Google Scholar 

  39. 39.

    Simão, F. A., Waterhouse, R. M., Ioannidis, P., Kriventseva, E. V. & Zdobnov, E. M. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31, 3210–3212 (2015).

    PubMed  Article  CAS  PubMed Central  Google Scholar 

  40. 40.

    Dierckxsens, N., Mardulyn, P. & Smits, G. NOVOPlasty: de novo assembly of organelle genomes from whole genome data. Nucleic Acids Res. 45, e18 (2017).

    PubMed  Article  CAS  PubMed Central  Google Scholar 

  41. 41.

    Meng, G., Li, Y., Yang, C. & Liu, S. MitoZ: a toolkit for animal mitochondrial genome assembly, annotation and visualization. Nucleic Acids Res. 47, e63 (2019).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  42. 42.

    Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27, 573–580 (1999).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  43. 43.

    Smit, A. F. A. and Hubley, R. and Green, P. RepeatMasker Open-4.0. http://www.repeatmasker.org/ (2013–2015)

  44. 44.

    Smit, A. F. A. & Hubley, R. RepeatModeler Open-1.0. http://www.repeatmasker.org/RepeatModeler/ (2008–2015).

  45. 45.

    Revell, L. J. phytools: an R package for phylogenetic comparative biology (and other things). Methods Ecol. Evol. 3, 217–223 (2012).

    Article  Google Scholar 

  46. 46.

    Faircloth, B. C. et al. Ultraconserved elements anchor thousands of genetic markers spanning multiple evolutionary timescales. Syst. Biol. 61, 717–726 (2012).

    PubMed  Article  PubMed Central  Google Scholar 

  47. 47.

    Faircloth, B. C. PHYLUCE is a software package for the analysis of conserved genomic loci. Bioinformatics 32, 786–788 (2016).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  48. 48.

    Kozlov, A. M., Aberer, A. J. & Stamatakis, A. ExaML version 3: a tool for phylogenomic analyses on supercomputers. Bioinformatics 31, 2577–2579 (2015).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  49. 49.

    Fitch, W. M. Distinguishing homologous from analogous proteins. Syst. Zool. 19, 99–113 (1970).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  50. 50.

    Fitch, W. M. Homology: a personal view on some of the problems. Trends Genet. 16, 227–231 (2000).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  51. 51.

    Dewey, C. N. Positional orthology: putting genomic evolutionary relationships into context. Brief. Bioinform. 12, 401–412 (2011).

    PubMed  PubMed Central  Article  Google Scholar 

  52. 52.

    Fernández, R., Gabaldon, T. & Dessimoz, C. in Phylogenetics in the Genomic Era (eds. Scornavacca, C. et al.) 2.4:1–2.4:14 (2020).

  53. 53.

    Jolliffe, I. T. & Greenacre, M. J. Theory and applications of correspondence analysis. Biometrics 42, 223 (1986).

    Article  Google Scholar 

  54. 54.

    Wright, F. The ‘effective number of codons’ used in a gene. Gene 87, 23–29 (1990).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  55. 55.

    Bao, W., Kojima, K. K. & Kohany, O. Repbase update, a database of repetitive elements in eukaryotic genomes. Mob. DNA 6, 11 (2015).

    PubMed  PubMed Central  Article  Google Scholar 

  56. 56.

    Hubisz, M. J., Pollard, K. S. & Siepel, A. PHAST and RPHAST: phylogenetic analysis with space/time models. Brief. Bioinform. 12, 41–51 (2011).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  57. 57.

    Charlesworth, B., Coyne, J. A. & Barton, N. H. The relative rates of evolution of sex chromosomes and autosomes. Am. Nat. 130, 113–146 (1987).

    Article  Google Scholar 

  58. 58.

    Pollard, K. S., Hubisz, M. J., Rosenbloom, K. R. & Siepel, A. Detection of nonneutral substitution rates on mammalian phylogenies. Genome Res. 20, 110–121 (2010).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  59. 59.

    Zerbino, D. R., Johnson, N., Juettemann, T., Wilder, S. P. & Flicek, P. WiggleTools: parallel processing of large collections of genome-wide datasets for visualization and statistical analysis. Bioinformatics 30, 1008–1009 (2014).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  60. 60.

    Fang, S. et al. NONCODEV5: a comprehensive annotation database for long non-coding RNAs. Nucleic Acids Res. 46, D308–D314 (2018).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  61. 61.

    Fornes, O. et al. JASPAR 2020: update of the open-access database of transcription factor binding profiles. Nucleic Acids Res. 48, D87–D92 (2020).

    CAS  PubMed  Article  PubMed Central  Google Scholar 

  62. 62.

    Katoh, K. & Standley, D. M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30, 772–780 (2013).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  63. 63.

    R Core Team. R: a language and environment for statistical computing. http://www.R-project.org/ (R Foundation for Statistical Computing, 2013).

  64. 64.

    Nguyen, L.-T., Schmidt, H. A., von Haeseler, A. & Minh, B. Q. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 32, 268–274 (2015).

  65. 65.

    Slater, G. S. C. & Birney, E. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 6, 31 (2005).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

Download references

Acknowledgements

The B10K Project would not be possible without the efforts of field collectors, curators and staff at the institutions listed in Supplementary Table 1. We thank J. Klicka (Burke Museum), J. B. Kristensen (Natural History Museum of Denmark), A. T. Peterson (Biodiversity Institute of the University of Kansas), M. B. Robbins (Biodiversity Institute of the University of Kansas), F. Robertson (University of Otago), T. King (University of Otago), K. C. Rowe (Museums Victoria), K. Winker (University of Alaska Museum) and the late A. Baker (Royal Ontario Museum) for providing tissue samples; B. J. Novak for sample coordination; Dovetail Genomics for the assembly of Caloenas nicobarica; T. Riede for helpful discussions of the mechanism and evolution of the vocal tract filter in songbirds; and China National Genebank at BGI for contributing to the sequencing for the B10K Project. The final version of the manuscript was approved by H. G. Spencer (University of Otago), in place of the late I.G.J. This work was supported by Strategic Priority Research Program of the Chinese Academy of Sciences (XDB31020000), International Partnership Program of Chinese Academy of Sciences (no. 152453KYSB20170002), Carlsberg Foundation (CF16-0663) and Villum Foundation (no. 25900) to G.Z. This work was also supported in part by National Natural Science Foundation of China no. 31901214 to S.F., ERC Consolidator Grant 681396 to M.T.P.G. and Howard Hughes Medical Institute funds to E.D.J., the National Institutes of Health (award numbers 5U54HG007990, 5T32HG008345-04, 1U01HL137183, R01HG010053, U01HL137183 and U54HG007990) to B. Paten. Supercomputing was partially performed using the DeiC National Life Science Supercomputer, Computerome, at the Technical University of Denmark. Portions of this research were also conducted with high-performance computing resources provided by Louisiana State University (http://www.hpc.lsu.edu). Parts of this work and its text were included in J.A.’s PhD thesis18.

Author information

Affiliations

Authors

Contributions

C.R., M.T.P.G., G.R.G., F.L., E.D.J. and G.Z. initiated the B10K Project. S.F., J.S., Y.D., J.A., B. Paten and G.Z. conceived the current study. S.F., J.S., Y.D., A.H.R., G. Chen, C.G., J.T.H., G.P., E.C., J. Fjeldså, P.A.H., R.T.B., L.C., M.F.B., D.T.T., B.C.R., G.S., G.B., S.C., I.J.L., S.J.C., P.N., J.P.D., O.A.R., J. Fuchs, M.B., J.C., G.M., S.J.H., P.G.R., K.A.J., I.G.J., F.L., C.R., M.T.P.G., G.R.G., E.D.J. and G.Z. coordinated samples, including collection, shipping and permits. S.F., J.S., Y.D., Q.F., B.C.F., J.T.H., C.P., G.P., E.C., M.-H.S.S., Â.M.R., L.P., G.S., S.J.C., D.W.B., J.C., Q.L., H.Y., J.W., F.L., M.T.P.G., E.D.J. and G.Z. were involved in DNA extraction, sequencing or barcode confirmation. S.F., Y.D., B. Petersen, T.S.-P., Z.W. and Q.Z. performed the genome assemblies. S.F., J.S., Y.D., W.C., S.A.-S. and A.M. performed the mitochondrial genome assemblies and annotation. B.C.F., J.T.H., E.C., Â.M.R., R.T.B., D.T.T., I.J.L., A.S., M.S., P.B.F., B.H., H.S., S.P., H.v.d.Z., R.v.d.S., C.V., C.N.B., A.G.C., J.W.F., R.B., N.C., A. Cloutier, T.B.S., S.V.E., D.J.F., S.B.S., F.H.S., A.V., A.E.R.S., B.S., J.G.-S., J.F.-O., J.R., M.R., A.T., V.F., L.D., A.O.U., T.S., Y.L., M.G.C., A. Corvelo, R.C.F., K.M.R., N.J.G., N.D., H.M., N.T., K.D., M.L., A.F., M.P.H., O.K., A.M.F., B.M., E.D.K., A.E.F., G.F., Á.M.P.-M., P.F.B., M.P.C., N.C.B.L., F.P., T.L.P., B.A.S., B.A.L., J.G.B., H.C.L., L.B.D., M.J.F., M.W.B., M.J.B., M.W., R.B.D., T.B.R., G. Camenisch, L.F.K., J.M.D.C., M.E.H., M.I.M.L., C.C.W., J.A.M.G., J.M., L.C.M., M.D.C., B.W., S.A.T., G.D.-R., A.A., A.T.R.V., C.V.M., J.T.W., M.T.P.G. and E.D.J. supplied genome assemblies for additional species. S.F., J.S., Y.D., J.A., Q.F., D.X., G. Chen, B.C.F., L.E., D.W.B., R.R.d.F., E.L.B., P.H., S.M., A.S., D.H., M.T.P.G., E.D.J., B. Paten and G.Z. developed and improved annotation and orthologue identification pipelines, and analysed orthologues. S.F., J.S., Y.D., J.A., Q.F., D.X., B.C.F., M.D., D.H., B. Paten and G.Z. produced and analysed whole-genome alignments. J. Fjeldså illustrated the birds in Fig. 1. S.F., J.S., Y.D., J.A., Q.F., B. Paten and G.Z. wrote the manuscript, with input from all authors.

Corresponding authors

Correspondence to Benedict Paten or Guojie Zhang.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature thanks Javier Herrero, Sushma Reddy and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Sampling and processing of the 363 genomes.

a, Sources of the 363 genomes. Each genome is a square; colour indicates the data source. Newly published genomes from the B10K Project phase II are red; unpublished genomes contributed by external labs are yellow; published genomes from phase I are orange; genomes contributed by the community that have since been published are dark blue; and other genomes available on NCBI are light blue. b, Map63 of geographical origin of the 281 bird samples for which geographical coordinates are available. c, Summary of the species confirmation of 236 B10K Project newly sequenced species. The downward arrows are excluded genomes. d, Summary of mitochondrial genome assembly and annotation for 336 species. The downward arrows are excluded mitochondrial genomes.

Extended Data Fig. 2 Distribution of transposable elements.

a, Percentage of the genome that is a transposable element (TE). Box plots are shown for groups with at least three sequenced species. b, Per cent base pairs of the genome that are long interspersed nuclear elements (LINEs), grouped by orders. Box plots are shown for groups with at least three sequenced species. c, S.d. of the transposable element content for orders with at least three sequenced species. d, S.d. of the per cent LINE content for orders with at least three sequenced species. e, Ancestral state reconstruction of total transposable elements. The branch colour from blue to red indicates an increase in transposable elements. Two orders with noticeable patterns—Piciformes and Bucerotiformes—are labelled on the tree. A zoomable figure with labels for all terminals is available at www.doi.org/10.17632/fnpwzj37gw.

Extended Data Fig. 3 Patterns of the presence and absence of 5 visual opsins in 363 bird species.

This figure shows patterns for the visual opsins encoded by RH1, RH2, OPN1sw1, OPN1sw2 and OPN1lw. Colours correspond to five annotated states of opsin sequences. A zoomable figure with labels for all terminals is available at www.doi.org/10.17632/fnpwzj37gw.

Extended Data Fig. 4 GC content and codon use.

a, Principal component analysis (PCA) of GC content in the coding regions of orthologues with conserved synteny with chicken for 340 bird species, including 164 Passeriformes species. b, Correspondence analysis of RSCU for all 363 birds. The primary and secondary axes account for 78.18% and 14.82% of the total variation, respectively. c, The distribution of codons on the same two axes as shown in b, with each codon coloured according to its ending nucleotide. This showed that the axis-1 score of a species is primarily determined by differences in frequencies of codons ending in G, C, A or T. d, RSCU analysis of 59 codons across avian genomes (n = 363 biologically independent species for each box plot). The horizontal lines indicate thresholds of under-represented codons (<0.6, blue box plots), average representation (1.0, white box plots) and over-represented codons (>1.6, orange box plots). e, Pearson correlation between GC content of the third codon position and the primary axis in b, colour-coded to distinguish Passeriformes and non-Passeriformes. The strong correlation (R2 = 0.9, P = 4.1 × 10−184) indicates that the frequencies of codons ending in G or C is the main driver of the codon bias in Passeriformes. f, Comparison of the mean Nc values between the Passeriformes and other species for orthologues with conserved synteny with chicken (Supplementary Table 12). Each dot represents the mean Nc value of an orthologue in the Passeriformes and other species, respectively. Orthologues with at least 20 individuals in both the Passeriformes and the non-Passeriformes were included in this analysis.

Extended Data Fig. 5 Overview of the pipelines for identifying genomic regions.

a, Assignment of orthologous protein-coding regions. All pairwise relationships between homologous regions obtained from the Cactus alignment (4 species shown here in different colours) were used to construct the homologous groups across all 363 birds. Using chicken as the reference, we further generated a table containing homologues with conserved synteny to chicken. b, Annotation of conserved orthologous intron regions on the basis of Cactus whole-genome alignments. The credible intron fragments in chicken were picked out after filtering out regions mapped by RNA sequences, and chicken-specific or repetitive regions. Orthologous relationships of intron fragments were detected on the basis of the aligned Cactus hits and the orthologues with conserved synteny with chicken. The non-intron regions of each bird in the alignments were masked as gaps.

Extended Data Fig. 6 Gene tree for copies of the growth hormone gene GH.

The tree was generated by maximum likelihood phylogenetic analysis64 of avian GH gene copies. Only nodes with >80 bootstrap are annotated as dots; the larger the dot, the higher the bootstrap. All Passeriformes sequences are clustered in a single clade and there are two sister gene clades within Passeriformes, corresponding to the GH_S gene copy (blue) and the GH_L gene copy (orange). Twelve species with only one copy are indicated by green stars. A zoomable figure with labels for all terminals and the tree file is available at www.doi.org/10.17632/fnpwzj37gw.

Extended Data Fig. 7 Identification of lineage-specific sequences.

a, An example of a 36-bp insertion (red) identified by Cactus in the southern cassowary (Casuarius casuarius) compared to the Okarito brown kiwi (Apteryx rowi) (both in Palaeognathae) with mapped sequence reads shown as lines. b, Proportion of lineage-specific sequence for each order correlated with the distance from parent node to MRCA node (branch length). c, Presence and absence of the DNAJC15-like gene (DNAJC15L), and its surrounding genes, in all 363 birds. Upstream: KLHL1 and DACH1; downstream: MZT1, BORA, RRP44, PIBF1 and KLF5. The state is shown for each bird in three ways: multiple copies (filled shapes), one copy (empty shapes) and no gene (blank). Passeriformes are highlighted in red. A zoomable figure with labels for all terminals is available at www.doi.org/10.17632/fnpwzj37gw. d, Exon fusion patterns of the DNAJC15-like gene (DNAJC15L) in three Passeriformes, compared to exon structure of the ancestral DNAJC15. For L. aspasia, gene models for the ancestral and novel copy are shown. The structure of the ancestral copy is highly conserved across all bird species with five introns. The Passeriformes-specific copy has no intron or newly derived minor intron and includes a poly-(A) at the 5′ end, which implies that this new gene was derived from retroduplication of DNAJC15.

Extended Data Fig. 8 The evolution of songbirds was associated with the loss of the cornulin gene.

a, Presence and absence of the cornulin gene (CRNN) and its surrounding genes (EDDM and S100A11) in all 363 birds. Branches are coloured as oscine Passeriformes (blue), non-oscine Passeriformes (green) and non-Passeriformes (black). The states of genes are shown in three ways: functional gene (filled box), pseudogene (empty box) and gene not found (blank). Genes were identified by Exonerate65 using phylogenetically diverse EDDM, CRNN and S100A11 sequences as queries. A zoomable figure with labels for all terminals is available at www.doi.org/10.17632/fnpwzj37gw. b, Hypothesis on the evolutionary loss of cornulin and the appearance of a fine-tuned extensibility of the oesophagus as a vocal tract filter in songbirds.

Extended Data Fig. 9 Acceleration and conservation scores.

Results are shown from 3 alignments for 53 birds, 77 vertebrates, and 363 birds. a, Acceleration (left) and conservation (right) within alignment columns on chicken. This panel is similar to Fig. 3a, but includes accelerated columns. b, Proportion of chicken functional regions covered by significantly accelerated or conserved sites. This panel is similar to Fig. 3c, but includes accelerated columns.

Extended Data Fig. 10 Distribution of acceleration and conservation scores.

a, Distribution of conservation and acceleration scores within different functional region types across alignments. Lines mark quartiles of the density estimates. b, Larger histogram of chicken column rates. This panel is similar to Fig. 3b, but includes accelerated columns ending at a rate of 10× the neutral rate. c, Difference in PhyloP scores (compared to original scores) after realignment with MAFFT for a random sample of significantly conserved sites. d, Comparison of the distribution of PhyloP scores across alignments. Scores indicate log-scaled probabilities of conservation (positive values) or acceleration (negative values) for each base in the genome. a and d show results from three alignments for 53 birds, 77 vertebrates and 363 birds.

Supplementary information

Supplementary Information

This file contains Supplementary Notes, Supplementary Methods and Supplementary Results regarding species selection, genome sequencing, assembly, annotation and ortholog identification, and whole-genome alignment. It also contains legends for Supplementary Tables 1-15. An interactive supplementary figure is available at https://genome-b10k.herokuapp.com/main. An interactive plot of assembly statistics and annotation statistics for all 363 bird genomes, data can be shown by species, taxonomy or by the source of the genome sequence. This figure visualises data from Supplementary Table 1.

Reporting Summary

Supplementary Data

The tree file in newick format for all 10,135 species of birds. The tree was pruned from the synthesis tree by excluding all subspecies, operational taxonomic units and unaccepted species as described in the Supplementary Information. Also available on Mendeley Data (doi:10.17632/fnpwzj37gw).

Supplementary Tables

This file contains Supplementary Tables 1-15 – see Supplementary Information document for legends. Also available on Mendeley Data (doi:10.17632/fnpwzj37gw). Sample information for each genome and genome statistics (Supplementary Table 1) can also be viewed online at https://b10k.scifeon.cloud/.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Feng, S., Stiller, J., Deng, Y. et al. Dense sampling of bird diversity increases power of comparative genomics. Nature 587, 252–257 (2020). https://doi.org/10.1038/s41586-020-2873-9

Download citation

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing