Endosymbiotic origin and differential loss of eukaryotic genes

Journal name:
Nature
Volume:
524,
Pages:
427–432
Date published:
DOI:
doi:10.1038/nature14963
Received
Accepted
Published online

Abstract

Chloroplasts arose from cyanobacteria, mitochondria arose from proteobacteria. Both organelles have conserved their prokaryotic biochemistry, but their genomes are reduced, and most organelle proteins are encoded in the nucleus. Endosymbiotic theory posits that bacterial genes in eukaryotic genomes entered the eukaryotic lineage via organelle ancestors. It predicts episodic influx of prokaryotic genes into the eukaryotic lineage, with acquisition corresponding to endosymbiotic events. Eukaryotic genome sequences, however, increasingly implicate lateral gene transfer, both from prokaryotes to eukaryotes and among eukaryotes, as a source of gene content variation in eukaryotic genomes, which predicts continuous, lineage-specific acquisition of prokaryotic genes in divergent eukaryotic groups. Here we discriminate between these two alternatives by clustering and phylogenetic analysis of eukaryotic gene families having prokaryotic homologues. Our results indicate (1) that gene transfer from bacteria to eukaryotes is episodic, as revealed by gene distributions, and coincides with major evolutionary transitions at the origin of chloroplasts and mitochondria; (2) that gene inheritance in eukaryotes is vertical, as revealed by extensive topological comparison, sparse gene distributions stemming from differential loss; and (3) that continuous, lineage-specific lateral gene transfer, although it sometimes occurs, does not contribute to long-term gene content evolution in eukaryotic genomes.

At a glance

Figures

  1. Distribution of taxa in EPCs.
    Figure 1: Distribution of taxa in EPCs.

    Each black tick indicates gene presence in a taxon. The 2,585 EPCs (x axis) are ordered first according to their distribution across six eukaryotic supergroups with clusters specific to lineages with photosynthetic eukaryotes (blocks A–C) on the left, then according to the number of supergroups within which the clusters occur. Clusters most densely distributed in archaea among prokaryotes (block D) and others (block E) are indicated. Lower-case letters label clusters whose distribution is suggestive of recent lineage-specific acquisitions. The numbers of protein sequences and EPCs per genome are shown on the right. Taxon abbreviations are given in Supplementary Tables 1 and 3.

  2. Occurrence in the sister group versus proteome size.
    Figure 2: Occurrence in the sister group versus proteome size.

    Prokaryotic taxa are plotted according to how frequently they are found in the sister group (defined as the nearest neighbour group) to a monophyletic group of eukaryotes in 1,933 trees against their proteome size. A two-sided Wilcoxon signed-rank test compares these frequencies with those generated by randomly selecting prokaryotic operational taxonomic units (OTUs) into the sister group (100 replicates). Upward and downward arrows indicate higher and lower frequencies in the real data set than in the randomized version, respectively. The test was adjusted for multiple comparisons. For complete statistics, see Supplementary Table 8.

  3. Comparison of sets of trees for single-copy genes in eukaryotic groups.
    Figure 3: Comparison of sets of trees for single-copy genes in eukaryotic groups.

    Cumulative distribution functions (y axis) for scores of minimal tree compatibility with the vertical reference data set (x axis). Values are number of species, sample sizes, and P values of the two-tailed Kolmogorov–Smirnov two-sample goodness-of-fit test in the comparison of the ESC (blue) data sets against the EPC (green) data set and a synthetic data set simulating one LGT (red). Dashed lines delineate the range of distributions in 100 replicates of random down-sampling. See also Extended Data Fig. 7.

  4. Eukaryote-prokaryote sequence identities for genes with a tip distribution in eukaryotes versus those whose distributions trace their presence to a more ancient ancestor.
    Figure 4: Eukaryote–prokaryote sequence identities for genes with a tip distribution in eukaryotes versus those whose distributions trace their presence to a more ancient ancestor.

    ae, Genes denoted by lower-case letters in Fig. 1 and those found in at least three of five major supergroups. The mean of the average pairwise identities is shown in parentheses. At P = 0.05, a two-sided Wilcoxon rank-sum test either did not reject the null hypotheses that the two sets of genes are not different (a, c) or suggested the tip-specific eukaryotic genes are less similar to their prokaryotic homologues (b, d, e). See also Extended Data Fig. 9.

  5. Additional gene distribution patterns.
    Extended Data Fig. 1: Additional gene distribution patterns.

    a, Distribution of ESCs. Each black tick indicates the presence of a cluster in a taxon. The 26,117 ESCs (x axis) from 55 eukaryotic genomes (Supplementary Table 1) are sorted according to their distribution across the six eukaryotic supergroups. b, Distribution of taxa in EPCs and monophyly of eukaryotes. Each black tick indicates the presence of a cluster in a taxon. The 2,585 EPCs (x axis) are separated into three sets according to the monophyly of eukaryotes and the results of the AUT and, within each set, are ordered according to their distribution across the six eukaryotic supergroups. Clusters where eukaryotes were resolved as non-monophyletic in the maximum likelihood tree tend to occur more frequently in bacterial taxa. Archaep., Archaeplastida; Opisth., Opisthokonta; Chl., Chloroplastida; Rho., Rhodophyta; Gla., Glaucophyta; Str., Stramenopila; De., Deinococcus-Thermus; oP., other Proteobacteria; Ch., Chlamydiae; Pl., Planctomycetes; Ve., Verrucomicrobia; Spi., Spirochaetae; The., Thermotogae; oB., other Bacteria. For abbreviations of eukaryotes, see Supplementary Table 1.

  6. Clustering, monophyly, and gene sharing.
    Extended Data Fig. 2: Clustering, monophyly, and gene sharing.

    a, b, Monophyly of eukaryotes in maximum likelihood trees, cluster size, and alignment quality. Cumulative frequency of clusters with different cluster size (a) or different HoT72 column scores (b) is plotted for three sets of EPCs that differ in terms of the monophyly of eukaryotes in the maximum likelihood trees (monophyletic: resolved as monophyletic in the original tree; passed AUT: resolved as non-monophyletic in the original tree, but at least one alternative tree with eukaryote monophyly (see Methods) was as likely at P = 0.05 in an AUT; failed AUT: alternative trees were not as likely as the original tree where eukaryotes were resolved as non-monophyletic). One-sided Kolmogorov–Smirnov two-sample goodness-of-fit test (cluster size/HoT column scores): monophyletic versus passed AUT, 1.04 × 10−13/7.9 × 10−3; monophyletic versus failed AUT, 1.45 × 10−61/2.04 × 10−10; passed AUT versus failed AUT, 3.40 × 10−13/4.00 × 10−3. c, d, Prokaryotic monophyly and gene sharing. c, Proportion of trees showing monophyly for taxonomic group. Prokaryotic phyla and classes (Supplementary Tables 3 and 4) that are monophyletic in the reference trees and that have at least five taxa (genomes in archaea or species in bacteria) are plotted according to the number of taxa and the proportion of EPC trees with at least two sequences from a prokaryotic group where it forms a monophyletic group. The proportion of eukaryote monophyly trees is higher than that of any prokaryotic group, including those with many fewer taxa. d, Gene sharing between a prokaryotic group and other prokaryotes. Using the same procedure for the generation of EPCs, 55 genomes were randomly sampled from a group of bacteria and the number of clusters (EPCs) they shared with prokaryotes not from this group was counted. The average number of shared clusters was mapped for each taxonomic group with 55–150 genomes (error bar, s.d.; number of genomes in parentheses). For E. coli and the eukaryotes (shown for comparison), there was only one sample. Colour coding for taxonomic levels: red, phylum; blue, class; green, order; magenta, family; cyan, genus; orange, species.

  7. Effect of taxon sampling on eukaryote monophyly in phylogenetic trees.
    Extended Data Fig. 3: Effect of taxon sampling on eukaryote monophyly in phylogenetic trees.

    After ten sequences (bold) were added to the original data set (EPC E1689_B206_A295), the relationships among Archaeplastida taxa (highlighted in green) changed from non-monophyly (a) to monophyly (b). Abbreviations are shown for eukaryotic sequences (Supplementary Table 2) and NCBI GI numbers for cyanobacterial sequences (Supplementary Table 3; RefSeq accessions are shown for the added sequences).

  8. Distribution of prokaryotic taxa in the sister group to eukaryotes, with EPCs sorted by eukaryotic supergroups.
    Extended Data Fig. 4: Distribution of prokaryotic taxa in the sister group to eukaryotes, with EPCs sorted by eukaryotic supergroups.

    Top: each black tick indicates the presence of a eukaryote taxon in one of the 2,585 EPCs. Bottom: each red tick indicates the presence of a prokaryote taxon in the sister group to eukaryotes in one of the 1,933 EPC maximum likelihood trees where eukaryotes were resolved to be monophyletic. The 2,585 EPCs, proteome size, and cluster size are as in Fig. 1. The number of EPCs present and the frequency of occurrence in the sister group to eukaryotes (‘clusters’) are shown for eukaryotes and prokaryotes, respectively. Archaep., Archaeplastida; Opisth., Opisthokonta; Chl., Chloroplastida; Rho., Rhodophyta; Gla., Glaucophyta; Str., Stramenopila; De., Deinococcus-Thermus; oP., other Proteobacteria; Ch., Chlamydiae; Pl., Planctomycetes; Ve., Verrucomicrobia; Spi., Spirochaetae; The., Thermotogae; oB., other Bacteria. For abbreviations of eukaryotes, see Supplementary Table 1.

  9. Distribution of prokaryotic taxa in the sister group to eukaryotes, with EPCs sorted by prokaryotic groups.
    Extended Data Fig. 5: Distribution of prokaryotic taxa in the sister group to eukaryotes, with EPCs sorted by prokaryotic groups.

    Top: each black tick indicates the presence of a eukaryote taxon in one of the 1,933 EPC maximum likelihood trees where eukaryotes were resolved to be monophyletic. Bottom: each red tick indicates the presence of a prokaryote taxon in the sister group to eukaryotes in one of those 1,933 EPC trees. The EPCs (x axis) are ordered according to the taxonomic groups to which the prokaryotes in the sister group to eukaryotes belong (separated into three blocks where only bacteria (1,586 EPCs), only archaea (314 EPCs), or both bacteria and archaea (33 EPCs) are found in the sister group). There are 16 bacterial groups (including ‘other Bacteria’; Firmicutes, Proteobacteria, and the PVC superphylum (Planctomycetes, Verrucomicrobia, and Chlamydiae) are regarded as single groups) and five archaeal groups (the five phyla). The number of EPCs present and the frequency of occurrence in the sister group to eukaryotes are shown for eukaryotes and prokaryotes, respectively. Archaep., Archaeplastida; Opisth., Opisthokonta; Chl., Chloroplastida; Rho., Rhodophyta; Gla., Glaucophyta; Str., Stramenopila; De., Deinococcus-Thermus; oP., other Proteobacteria; Ch., Chlamydiae; Pl., Planctomycetes; Ve., Verrucomicrobia; Spi., Spirochaetae; The., Thermotogae; oB., other Bacteria. For abbreviations of eukaryotes, see Supplementary Table 1.

  10. Distribution of taxa in the sister groups consisting purely of cyanobacteria, alphaproteobacteria, or archaea.
    Extended Data Fig. 6: Distribution of taxa in the sister groups consisting purely of cyanobacteria, alphaproteobacteria, or archaea.

    Each black tick indicates the presence of a prokaryotic taxon in the sister group to eukaryotes in an EPC tree. a–c, Distributions of taxa in all pure-cyanobacterial (a), pure-alphaproteobacterial (b), and pure-archaeal (c) sister groups. The clusters are ordered alphanumerically according to the eukaryotic cluster numbers (Supplementary Table 5), whereas for archaea (c) the taxa are further sorted by the five archaeal phyla.

  11. Comparison of sets of trees for single-copy genes in eukaryotic groups, with more inclusive criteria.
    Extended Data Fig. 7: Comparison of sets of trees for single-copy genes in eukaryotic groups, with more inclusive criteria.

    af,Cumulative distribution functions (y axis) for scores of minimal tree compatibility with the vertical reference data set (x axis). Values are number of species, sample sizes, and P values of the two-tailed Kolmogorov–Smirnov two-sample goodness-of-fit test in the comparison of the ESC (blue) data sets against the EPC (green) data set and a synthetic data set simulating one LGT (red). Dashed lines delineate the range of distributions in 100 replicates of random down-sampling. The criteria for tree inclusion were less stringent than those for Fig. 3 (see Methods).

  12. Overview of eukaryote gene content evolution.
    Extended Data Fig. 8: Overview of eukaryote gene content evolution.

    a, Eukaryotic evolution by gene loss. Genome sizes (number of EPCs present) were mapped onto the eukaryotic reference tree. Ancestral genome size in each eukaryotic ancestral node was calculated using a loss-only model, with all EPCs in blocks A–C and those in blocks D and E (Fig. 1) entering the eukaryotic lineage via the plastid ancestor (green) or the eukaryote ancestor (wheat colour). Plastid-derived genes are not shown for the ancestral nodes within SAR and Hacrobia, because of current debates about the number and nature of secondary symbioses, but are indicated by the greenish shading. b, Endosymbiotic gene transfer network. The network connecting apparent gene donors to the common ancestor of eukaryotes and Archaeplastida is mapped onto the reference phylogeny (vertical edges) of bacteria (left), eukaryotes (middle), and archaea (right). Grey shading (white to black) in the prokaryote reference trees (70 for archaea and 32 for bacteria) indicates how often a branch associated with a particular node was recovered within the trees of individual genes that were concatenated for inferring the reference topology. Lateral edges indicate gene influx at the origin of eukaryotes and at the origin of plastids. Edge colour corresponds to the frequencies with which a prokaryotic group appears in the sister group to eukaryotes. The archaeal reference tree was rooted between euryarchaeotes and other taxa, and the bacterial tree with Thermotogae. Secondary endosymbiotic transfers are indicated in light green and red. That members of both the Crenarchaeota and the Euryarcheaota are implicated as host relatives is probably because of the small archaeon sample34, 35, 36.

  13. Apparent gene transfers and eukaryote-prokaryote sequence identities.
    Extended Data Fig. 9: Apparent gene transfers and eukaryote–prokaryote sequence identities.

    a, Patterns suggestive of LGT from prokaryotes inferred from EPC trees. All EPC trees were searched for phylogenetic patterns suggestive of gene acquisitions by the common ancestor of each eukaryote lineage within the six supergroups (see Methods). The size of each circle is proportional to the number of such putative acquisitions, with the total number of putative acquisitions shown for each supergroup. The colour shows the age of nodes according to a eukaryotic time tree (blue, younger than 800 million years; red, older than 800 million years). For the four lineages with an asterisk, phylogenetic patterns where SAR/Hacrobia are nested within a clade formed by Archaeplastida were also counted as putative acquisitions to take into account secondary plastid endosymbioses. The numbers of acquisitions without such patterns are indicated in parentheses (and shown as inner circles). b, Eukaryote–prokaryote sequence identities for genes apparently acquired more recently and more anciently in eukaryotes (a). The mean of the average pairwise identities is shown in parentheses. At P = 0.05, a two-sided Wilcoxon rank-sum test either did not reject the null hypotheses that the two sets of genes are not different or suggested the tip-specific eukaryotic genes are less similar to their prokaryotic homologues.

  14. Distribution of ESCs and EPCs across eukaryotes under different criteria.
    Extended Data Fig. 10: Distribution of ESCs and EPCs across eukaryotes under different criteria.

    Different thresholds were applied to find eukaryote clusters with prokaryote homologues, including BLAST local identity for each eukaryote–prokaryote hit (30% or 20%) and levels of best-hit correspondence (10–50%) for identifying reciprocal pairs of eukaryote and prokaryote clusters. Distributions of ESCs and EPCs are drawn as in Extended Data Fig. 1a and Fig. 1, respectively.

References

  1. Koonin, E. V., Makarova, K. S. & Aravind, L. Horizontal gene transfer in prokaryotes: quantification and classification. Annu. Rev. Microbiol. 55, 709742 (2001)
  2. Doolittle, W. F. Phylogenetic classification and the universal tree. Science 284, 21242128 (1999)
  3. Ochman, H., Lawrence, J. G. & Groisman, E. A. Lateral gene transfer and the nature of bacterial innovation. Nature 405, 299304 (2000)
  4. Lang, A. S., Zhaxybayeva, O. & Beatty, J. T. Gene transfer agents: phage-like elements of genetic exchange. Nature Rev. Microbiol. 10, 472482 (2012)
  5. Rasko, D. A. et al. The pangenome structure of Escherichia coli: comparative genomic analysis of E. coli commensal and pathogenic isolates. J. Bacteriol. 190, 68816893 (2008)
  6. Lobkovsky, A. E., Wolf, Y. I. & Koonin, E. V. Gene frequency distributions reject a neutral model of genome evolution. Genome Biol. Evol. 5, 233242 (2013)
  7. Szathmáry, E. & Maynard Smith, J. The major evolutionary transitions. Nature 374, 227232 (1995)
  8. Nei, M. Mutation-Driven Evolution (Oxford Univ. Press, 2013)
  9. Timmis, J. N., Ayliffe, M. A., Huang, C. Y. & Martin, W. Endosymbiotic gene transfer: organelle genomes forge eukaryotic chromosomes. Nature Rev. Genet. 5, 123135 (2004)
  10. Lane, C. E. & Archibald, J. M. The eukaryotic tree of life: endosymbiosis takes its TOL. Trends Ecol. Evol. 23, 268275 (2008)
  11. Archibald, J. M. One plus One Equals One: Symbiosis and the Evolution of Complex Life (Oxford Univ. Press, 2014)
  12. Andersson, J. O. Lateral gene transfer in eukaryotes. Cell. Mol. Life Sci. 62, 11821197 (2005)
  13. Keeling, P. J. & Palmer, J. D. Horizontal gene transfer in eukaryotic evolution. Nature Rev. Genet. 9, 605618 (2008)
  14. Price, D. C. et al. Cyanophora paradoxa genome elucidates origin of photosynthesis in algae and plants. Science 335, 843847 (2012)
  15. Boto, L. Horizontal gene transfer in the acquisition of novel traits by metazoans. Proc. R. Soc. B 281, 20132450 (2014)
  16. Huang, J. L. Horizontal gene transfer in eukaryotes: the weak-link model. Bioessays 35, 868875 (2013)
  17. Crisp, A., Boschetti, C., Perry, M., Tunnacliffe, A. & Micklem, G. Expression of multiple horizontally acquired genes is a hallmark of both vertebrate and invertebrate genomes. Genome Biol. 16, 50 (2015)
  18. Gould, S. B., Waller, R. R. & McFadden, G. I. Plastid evolution. Annu. Rev. Plant Biol. 59, 491517 (2008)
  19. Curtis, B. A. et al. Algal genomes reveal evolutionary mosaicism and the fate of nucleomorphs. Nature 492, 5965 (2012)
  20. Alsmark, C. et al. Patterns of prokaryotic lateral gene transfers affecting parasitic microbial eukaryotes. Genome Biol. 14, R19 (2013)
  21. Keeling, P. J. & Inagaki, Y. A class of eukaryotic GTPase with a punctate distribution suggesting multiple functional replacements of translation elongation factor 1α. Proc. Natl Acad. Sci. USA 101, 1538015385 (2004)
  22. Steel, M., Penny, D. & Lockhart, P. J. Confidence in evolutionary trees from biological sequence data. Nature 364, 440442 (1993)
  23. Lockhart, P. J. et al. A covariotide model explains apparent phylogenetic structure of oxygenic photosynthetic lineages. Mol. Biol. Evol. 15, 11831188 (1998)
  24. Guo, Z. H. & Stiller, J. W. Comparative genomics and evolution of proteins associated with RNA polymerase II C-terminal domain. Mol. Biol. Evol. 22, 21662178 (2005)
  25. Semple, C. & Steel, M. Phylogenetics (Oxford Univ. Press, 2003)
  26. Hughes, A. L. & Friedman, R. Loss of ancestral genes in the genomic evolution of Ciona intestinalis. Evol. Dev. 7, 196200 (2005)
  27. Müller, M. et al. Biochemistry and evolution of anaerobic energy metabolism in eukaryotes. Microbiol. Mol. Biol. Rev. 76, 444495 (2012)
  28. Kondo, N., Nikoh, N., Ijichi, N., Shimada, M. & Fukatsu, T. Genome fragment of Wolbachia endosymbiont transferred to X chromosome of host insect. Proc. Natl Acad. Sci. USA 99, 1428014285 (2002)
  29. Husnik, F. et al. Horizontal gene transfer from diverse bacteria to an insect genome enables a tripartite nested mealybug symbiosis. Cell 153, 15671578 (2013)
  30. Mi, S. et al. Syncytin is a captive retroviral envelope protein involved in human placental morphogenesis. Nature 403, 785789 (2000)
  31. Derelle, R. et al. Bacterial proteins pinpoint a single eukaryotic root. Proc. Natl Acad. Sci. USA 112, E693E699 (2015)
  32. Rivera, M. C., Jain, R., Moore, J. E. & Lake, J. A. Genomic evidence for two functionally distinct gene classes. Proc. Natl Acad. Sci. USA 95, 62396244 (1998)
  33. Lane, N. & Martin, W. The energetics of genome complexity. Nature 467, 929934 (2010)
  34. Williams, T. A., Foster, P. G., Cox, C. J. & Embley, T. M. An archaeal origin of eukaryotes supports only two primary domains of life. Nature 504, 231236 (2013)
  35. Guy, L., Saw, J. H. & Ettema, T. J. G. The archaeal legacy of eukaryotes: a phylogenomic perspective. Cold Spring Harb. Perspect. Biol. 6, a016022 (2014)
  36. Koonin, E. V. & Yutin, N. The dispersed archaeal eukaryome and the complex archaeal ancestor of eukaryotes. Cold Spring Harb. Perspect. Biol. 6, a016188 (2014)
  37. Cotton, J. A. & McInerney, J. O. Eukaryotic genes of archaebacterial origin are more important than the more numerous eubacterial genes, irrespective of function. Proc. Natl Acad. Sci. USA 107, 1725217255 (2010)
  38. Moran, N. A., McCutcheon, J. P. & Nakabachi, A. Genomics and evolution of heritable bacterial symbionts. Annu. Rev. Genet. 42, 165190 (2008)
  39. John, P. & Whatley, F. R. Paracoccus denitrificans and the evolutionary origin of the mitochondrion. Nature 254, 495498 (1975)
  40. Koonin, E. V. & Wolf, Y. I. Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world. Nucleic Acids Res. 36, 66886719 (2008)
  41. Parfrey, L. W., Lahr, D. J. G., Knoll, A. H. & Katz, L. A. Estimating the timing of early eukaryotic diversification with multigene molecular clocks. Proc. Natl Acad. Sci. USA 108, 1362413629 (2011)
  42. Margulis, L., Dolan, M. F. & Guerrero, R. The chimeric eukaryote: origin of the nucleus from the karyomastigont in amitochondriate protists. Proc. Natl Acad. Sci. USA 97, 69546959 (2000)
  43. Fuerst, J. A. & Sagulenko, E. Keys to eukaryality: Planctomycetes and ancestral evolution of cellular complexity. Front. Microbiol. 3, 167 (2012)
  44. Domman, D., Horn, M., Embley, T. M. & Williams, T. A. Plastid establishment did not require a chlamydial partner. Nature Commun. 6, 6421 (2015)
  45. Hug, L. A., Stechmann, A. & Roger, A. J. Phylogenetic distributions and histories of proteins involved in anaerobic pyruvate metabolism in eukaryotes. Mol. Biol. Evol. 27, 311324 (2010)
  46. Kleine, T., Maier, U. G. & Leister, D. DNA transfer from organelles to the nucleus: the idiosyncratic genetics of endosymbiosis. Annu. Rev. Plant Biol. 60, 115138 (2009)
  47. Yue, J. P., Hu, X. Y., Sun, H., Yang, Y. P. & Huang, J. L. Widespread impact of horizontal gene transfer on plant colonization of land. Nature Commun. 3, 1152 (2012)
  48. Wolf, Y. I. & Koonin, E. V. Genome reduction as the dominant mode of evolution. Bioessays 35, 829837 (2013)
  49. Hao, W. L. & Golding, G. B. The fate of laterally transferred genes: life in the fast lane to adaptation or death. Genome Res. 16, 636643 (2006)
  50. Treangen, T. J. & Rocha, E. P. C. Horizontal transfer, not duplication, drives the expansion of protein families in prokaryotes. PLoS Genet. 7, e1001284 (2011)
  51. Nelson-Sathi, S. et al. Origins of major archaeal clades correspond to gene acquisitions from bacteria. Nature 517, 7780 (2015)
  52. Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 33893402 (1997)
  53. Tatusov, R. L., Koonin, E. V. & Lipman, D. J. A genomic perspective on protein families. Science 278, 631637 (1997)
  54. Rice, P., Longden, I. & Bleasby, A. EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. 16, 276277 (2000)
  55. Enright, A. J., Van Dongen, S. & Ouzounis, C. A. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 30, 15751584 (2002)
  56. Apic, G., Gough, J. & Teichmann, S. A. Domain combinations in archaeal, eubacterial and eukaryotic proteomes. J. Mol. Biol. 310, 311325 (2001)
  57. Powell, S. et al. eggNOG v4.0: nested orthology inference across 3686 organisms. Nucleic Acids Res. 42, D231D239 (2014)
  58. Tatusov, R. et al. The COG database: an updated version includes eukaryotes. BMC Bioinform. 4, 41 (2003)
  59. Yoon, H. S., Muller, K. M., Sheath, R. G., Ott, F. D. & Bhattacharya, D. Defining the major lineages of red algae (Rhodophyta). J. Phycol. 42, 482492 (2006)
  60. James, T. Y. et al. Reconstructing the early evolution of fungi using a six-gene phylogeny. Nature 443, 818822 (2006)
  61. Okamoto, N., Chantangsi, C., Horak, A., Leander, B. S. & Keeling, P. J. Molecular phylogeny and description of the novel katablepharid Roombia truncata gen. et sp. nov., and establishment of the Hacrobia taxon nov. PLoS ONE 4, e7080 (2009)
  62. Hampl, V. et al. Phylogenomic analyses support the monophyly of Excavata and resolve relationships among eukaryotic “supergroups”. Proc. Natl Acad. Sci. USA 106, 38593864 (2009)
  63. Janouškovec, J., Horák, A., Oborník, M., Lukeš, J. & Keeling, P. J. A common red algal origin of the apicomplexan, dinoflagellate, and heterokont plastids. Proc. Natl Acad. Sci. USA 107, 1094910954 (2010)
  64. Lahr, D. J. G., Grant, J., Nguyen, T., Lin, J. H. & Katz, L. A. Comprehensive phylogenetic reconstruction of Amoebozoa based on concatenated analyses of SSU-rDNA and actin genes. PLoS ONE 6, e22780 (2011)
  65. Adl, S. M. et al. The revised classification of eukaryotes. J. Eukaryot. Microbiol. 59, 429493 (2012)
  66. Leliaert, F. et al. Phylogeny and molecular evolution of the green algae. Crit. Rev. Plant Sci. 31, 146 (2012)
  67. Keeling, P. J. The number, speed, and impact of plastid endosymbioses in eukaryotic evolution. Annu. Rev. Plant Biol. 64, 583607 (2013)
  68. Jackson, C. J. & Reyes-Prieto, A. The mitochondrial genomes of the glaucophytes Gloeochaete wittrockiana and Cyanoptyche gloeocystis: multilocus phylogenetics suggests a monophyletic Archaeplastida. Genome Biol. Evol. 6, 27742785 (2014)
  69. Katoh, K. & Standley, D. M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30, 772780 (2013)
  70. Stamatakis, A. RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models. Bioinformatics 22, 26882690 (2006)
  71. Yutin, N. & Galperin, M. Y. A genomic update on clostridial phylogeny: Gram-negative spore formers and other misplaced clostridia. Environ. Microbiol. 15, 26312641 (2013)
  72. Landan, G. & Graur, D. Heads or tails: a simple reliability check for multiple sequence alignments. Mol. Biol. Evol. 24, 13801383 (2007)
  73. Landan, G. & Graur, D. Local reliability measures from sets of co-optimal multiple sequence alignments. Pacif. Symp. Biocomput. 13, 1524 (2008)
  74. Shimodaira, H. & Hasegawa, M. CONSEL: for assessing the confidence of phylogenetic tree selection. Bioinformatics 17, 12461247 (2001)
  75. Robinson, D. F. & Foulds, L. R. Comparison of phylogenetic trees. Math. Biosci. 53, 131147 (1981)
  76. Felsenstein, J. Inferring phylogenies from protein sequences by parsimony, distance, and likelihood methods. Methods Enzymol. 266, 418427 (1996)
  77. Guindon, S. et al. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst. Biol. 59, 307321 (2010)
  78. Whelan, S. & Goldman, N. A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol. Biol. Evol. 18, 691699 (2001)
  79. Ku, C. et al. Endosymbiotic gene transfer from prokaryotic pangenomes: inherited chimerism in eukaryotes. Proc. Natl. Acad. Sci. USA (2015)
  80. Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Stat. Soc. B 57, 289300 (1995)
  81. Benjamini, Y. & Yekutieli, D. The control of the false discovery rate in multiple testing under dependency. Ann. Stat. 29, 11651188 (2001)
  82. Zar, J. H. Biostatistical Analysis Ch. 22 (Pearson, 2014)
  83. Dagan, T. & Martin, W. Ancestral genome sizes specify the minimum rate of lateral gene transfer during prokaryote evolution. Proc. Natl Acad. Sci. USA 104, 870875 (2007)
  84. Petitjean, C., Deschamps, P., Lopez-Garcia, P. & Moreira, D. Rooting the domain Archaea by phylogenomic analysis supports the foundation of the new kingdom Proteoarchaeota. Genome Biol. Evol. 7, 191204 (2015)
  85. Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinform. 10, 421 (2009)
  86. Hazkani-Covo, E. & Graur, D. A comparative analysis of numt evolution in human and chimpanzee. Mol. Biol. Evol. 24, 1318 (2007)
  87. Martin, W. & Schnarrenberger, C. The evolution of the Calvin cycle from prokaryotic to eukaryotic chromosomes: a case study of functional redundancy in ancient pathways through endosymbiosis. Curr. Genet. 32, 118 (1997)
  88. Maier, U. G. et al. Massively convergent evolution for ribosomal protein gene content in plastid and mitochondrial genomes. Genome Biol. Evol. 5, 23182329 (2013)
  89. de Vries, J. & Wackernagel, W. Integration of foreign DNA during natural transformation of Acinetobacter sp. by homology-facilitated illegitimate recombination. Proc. Natl Acad. Sci. USA 99, 20942099 (2002)
  90. Putnam, N. H. et al. Sea anemone genome reveals ancestral eumetazoan gene repertoire and genomic organization. Science 317, 8694 (2007)
  91. Artamonova, I. I. & Mushegian, A. R. Genome sequence analysis indicates that the model eukaryote Nematostella vectensis harbors bacterial consorts. Appl. Environ. Microbiol. 79, 68686873 (2013)
  92. Srivastava, M. et al. The Amphimedon queenslandica genome and the evolution of animal complexity. Nature 466, 720726 (2010)
  93. Hentschel, U., Piel, J., Degnan, S. M. & Taylor, M. W. Genomic insights into the marine sponge microbiome. Nature Rev. Microbiol. 10, 641654 (2012)
  94. McCutcheon, J. P. & Moran, N. A. Extreme genome reduction in symbiotic bacteria. Nature Rev. Microbiol. 10, 1326 (2012)
  95. Wenger, Y. & Galliot, B. RNAseq versus genome-predicted transcriptomes: a large population of novel transcripts identified in an Illumina-454 Hydra transcriptome. BMC Genom. 14, 204 (2013)
  96. Langdon, W. B. Mycoplasma contamination in the 1000 Genomes Project. BioData Min. 7, 3 (2014)
  97. Lang, D., Zimmer, A. D., Rensing, S. A. & Reski, R. Exploring plant biodiversity: the Physcomitrella genome and beyond. Trends Plant Sci. 13, 542549 (2008)
  98. Maere, S. et al. Modeling gene and genome duplications in eukaryotes. Proc. Natl Acad. Sci. USA 102, 54545459 (2005)
  99. Lockhart, P. J., Larkum, A. W. D., Steel, M. A., Waddell, P. J. & Penny, D. Evolution of chlorophyll and bacteriochlorophyll: the problem of invariant sites in sequence analysis. Proc. Natl Acad. Sci. USA 93, 19301934 (1996)
  100. Lockhart, P. J. et al. How molecules evolve in eubacteria. Mol. Biol. Evol. 17, 835838 (2000)
  101. Pruitt, K. D., Tatusova, T. & Maglott, D. R. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 33, D501D504 (2005)
  102. Zwickl, D. J. & Hillis, D. M. Increased taxon sampling greatly reduces phylogenetic error. Syst. Biol. 51, 588598 (2002)
  103. Alvarez-Ponce, D., Lopez, P., Bapteste, E. & McInerney, J. O. Gene similarity networks provide tools for understanding eukaryote origins and evolution. Proc. Natl Acad. Sci. USA 110, E1594E1603 (2013)

Download references

Author information

Affiliations

  1. Institute of Molecular Evolution, Heinrich-Heine University, 40225 Düsseldorf, Germany

    • Chuan Ku,
    • Shijulal Nelson-Sathi,
    • Mayo Roettger,
    • Filipa L. Sousa &
    • William F. Martin
  2. Institute of Fundamental Sciences, Massey University, Palmerston North 4474, New Zealand

    • Peter J. Lockhart
  3. Department of Mathematics and Statistics, University of Otago, Dunedin 9054, New Zealand

    • David Bryant
  4. Department of Natural and Life Sciences, The Open University of Israel, Ra’anana 43107, Israel

    • Einat Hazkani-Covo
  5. Department of Biology, National University of Ireland, Maynooth, County Kildare, Ireland

    • James O. McInerney
  6. Michael Smith Building, The University of Manchester, Oxford Rd, Manchester M13 9PL, UK

    • James O. McInerney
  7. Genomic Microbiology Group, Institute of Microbiology, Christian-Albrechts-University of Kiel, 24118 Kiel, Germany

    • Giddy Landan
  8. Instituto de Tecnologia Química e Biológica, Universidade Nova de Lisboa, 2780-157 Oeiras, Portugal

    • William F. Martin

Contributions

C.K., G.L., S.N.-S., E.H.-C., D.B., M.R., P.J.L., J.O.M., and W.F.M. designed experiments. C.K., G.L., S.N.-S., M.R., F.L.S., and E.H.-C. performed analyses. C.K., S.N.S., F.L.S., P.J.L., D.B., E.H.-C., J.O.M., G.L., and W.F.M. wrote the paper.

Competing financial interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to:

Author details

Extended data figures and tables

Extended Data Figures

  1. Extended Data Figure 1: Additional gene distribution patterns. (574 KB)

    a, Distribution of ESCs. Each black tick indicates the presence of a cluster in a taxon. The 26,117 ESCs (x axis) from 55 eukaryotic genomes (Supplementary Table 1) are sorted according to their distribution across the six eukaryotic supergroups. b, Distribution of taxa in EPCs and monophyly of eukaryotes. Each black tick indicates the presence of a cluster in a taxon. The 2,585 EPCs (x axis) are separated into three sets according to the monophyly of eukaryotes and the results of the AUT and, within each set, are ordered according to their distribution across the six eukaryotic supergroups. Clusters where eukaryotes were resolved as non-monophyletic in the maximum likelihood tree tend to occur more frequently in bacterial taxa. Archaep., Archaeplastida; Opisth., Opisthokonta; Chl., Chloroplastida; Rho., Rhodophyta; Gla., Glaucophyta; Str., Stramenopila; De., Deinococcus-Thermus; oP., other Proteobacteria; Ch., Chlamydiae; Pl., Planctomycetes; Ve., Verrucomicrobia; Spi., Spirochaetae; The., Thermotogae; oB., other Bacteria. For abbreviations of eukaryotes, see Supplementary Table 1.

  2. Extended Data Figure 2: Clustering, monophyly, and gene sharing. (229 KB)

    a, b, Monophyly of eukaryotes in maximum likelihood trees, cluster size, and alignment quality. Cumulative frequency of clusters with different cluster size (a) or different HoT72 column scores (b) is plotted for three sets of EPCs that differ in terms of the monophyly of eukaryotes in the maximum likelihood trees (monophyletic: resolved as monophyletic in the original tree; passed AUT: resolved as non-monophyletic in the original tree, but at least one alternative tree with eukaryote monophyly (see Methods) was as likely at P = 0.05 in an AUT; failed AUT: alternative trees were not as likely as the original tree where eukaryotes were resolved as non-monophyletic). One-sided Kolmogorov–Smirnov two-sample goodness-of-fit test (cluster size/HoT column scores): monophyletic versus passed AUT, 1.04 × 10−13/7.9 × 10−3; monophyletic versus failed AUT, 1.45 × 10−61/2.04 × 10−10; passed AUT versus failed AUT, 3.40 × 10−13/4.00 × 10−3. c, d, Prokaryotic monophyly and gene sharing. c, Proportion of trees showing monophyly for taxonomic group. Prokaryotic phyla and classes (Supplementary Tables 3 and 4) that are monophyletic in the reference trees and that have at least five taxa (genomes in archaea or species in bacteria) are plotted according to the number of taxa and the proportion of EPC trees with at least two sequences from a prokaryotic group where it forms a monophyletic group. The proportion of eukaryote monophyly trees is higher than that of any prokaryotic group, including those with many fewer taxa. d, Gene sharing between a prokaryotic group and other prokaryotes. Using the same procedure for the generation of EPCs, 55 genomes were randomly sampled from a group of bacteria and the number of clusters (EPCs) they shared with prokaryotes not from this group was counted. The average number of shared clusters was mapped for each taxonomic group with 55–150 genomes (error bar, s.d.; number of genomes in parentheses). For E. coli and the eukaryotes (shown for comparison), there was only one sample. Colour coding for taxonomic levels: red, phylum; blue, class; green, order; magenta, family; cyan, genus; orange, species.

  3. Extended Data Figure 3: Effect of taxon sampling on eukaryote monophyly in phylogenetic trees. (528 KB)

    After ten sequences (bold) were added to the original data set (EPC E1689_B206_A295), the relationships among Archaeplastida taxa (highlighted in green) changed from non-monophyly (a) to monophyly (b). Abbreviations are shown for eukaryotic sequences (Supplementary Table 2) and NCBI GI numbers for cyanobacterial sequences (Supplementary Table 3; RefSeq accessions are shown for the added sequences).

  4. Extended Data Figure 4: Distribution of prokaryotic taxa in the sister group to eukaryotes, with EPCs sorted by eukaryotic supergroups. (931 KB)

    Top: each black tick indicates the presence of a eukaryote taxon in one of the 2,585 EPCs. Bottom: each red tick indicates the presence of a prokaryote taxon in the sister group to eukaryotes in one of the 1,933 EPC maximum likelihood trees where eukaryotes were resolved to be monophyletic. The 2,585 EPCs, proteome size, and cluster size are as in Fig. 1. The number of EPCs present and the frequency of occurrence in the sister group to eukaryotes (‘clusters’) are shown for eukaryotes and prokaryotes, respectively. Archaep., Archaeplastida; Opisth., Opisthokonta; Chl., Chloroplastida; Rho., Rhodophyta; Gla., Glaucophyta; Str., Stramenopila; De., Deinococcus-Thermus; oP., other Proteobacteria; Ch., Chlamydiae; Pl., Planctomycetes; Ve., Verrucomicrobia; Spi., Spirochaetae; The., Thermotogae; oB., other Bacteria. For abbreviations of eukaryotes, see Supplementary Table 1.

  5. Extended Data Figure 5: Distribution of prokaryotic taxa in the sister group to eukaryotes, with EPCs sorted by prokaryotic groups. (411 KB)

    Top: each black tick indicates the presence of a eukaryote taxon in one of the 1,933 EPC maximum likelihood trees where eukaryotes were resolved to be monophyletic. Bottom: each red tick indicates the presence of a prokaryote taxon in the sister group to eukaryotes in one of those 1,933 EPC trees. The EPCs (x axis) are ordered according to the taxonomic groups to which the prokaryotes in the sister group to eukaryotes belong (separated into three blocks where only bacteria (1,586 EPCs), only archaea (314 EPCs), or both bacteria and archaea (33 EPCs) are found in the sister group). There are 16 bacterial groups (including ‘other Bacteria’; Firmicutes, Proteobacteria, and the PVC superphylum (Planctomycetes, Verrucomicrobia, and Chlamydiae) are regarded as single groups) and five archaeal groups (the five phyla). The number of EPCs present and the frequency of occurrence in the sister group to eukaryotes are shown for eukaryotes and prokaryotes, respectively. Archaep., Archaeplastida; Opisth., Opisthokonta; Chl., Chloroplastida; Rho., Rhodophyta; Gla., Glaucophyta; Str., Stramenopila; De., Deinococcus-Thermus; oP., other Proteobacteria; Ch., Chlamydiae; Pl., Planctomycetes; Ve., Verrucomicrobia; Spi., Spirochaetae; The., Thermotogae; oB., other Bacteria. For abbreviations of eukaryotes, see Supplementary Table 1.

  6. Extended Data Figure 6: Distribution of taxa in the sister groups consisting purely of cyanobacteria, alphaproteobacteria, or archaea. (484 KB)

    Each black tick indicates the presence of a prokaryotic taxon in the sister group to eukaryotes in an EPC tree. a–c, Distributions of taxa in all pure-cyanobacterial (a), pure-alphaproteobacterial (b), and pure-archaeal (c) sister groups. The clusters are ordered alphanumerically according to the eukaryotic cluster numbers (Supplementary Table 5), whereas for archaea (c) the taxa are further sorted by the five archaeal phyla.

  7. Extended Data Figure 7: Comparison of sets of trees for single-copy genes in eukaryotic groups, with more inclusive criteria. (479 KB)

    af,Cumulative distribution functions (y axis) for scores of minimal tree compatibility with the vertical reference data set (x axis). Values are number of species, sample sizes, and P values of the two-tailed Kolmogorov–Smirnov two-sample goodness-of-fit test in the comparison of the ESC (blue) data sets against the EPC (green) data set and a synthetic data set simulating one LGT (red). Dashed lines delineate the range of distributions in 100 replicates of random down-sampling. The criteria for tree inclusion were less stringent than those for Fig. 3 (see Methods).

  8. Extended Data Figure 8: Overview of eukaryote gene content evolution. (447 KB)

    a, Eukaryotic evolution by gene loss. Genome sizes (number of EPCs present) were mapped onto the eukaryotic reference tree. Ancestral genome size in each eukaryotic ancestral node was calculated using a loss-only model, with all EPCs in blocks A–C and those in blocks D and E (Fig. 1) entering the eukaryotic lineage via the plastid ancestor (green) or the eukaryote ancestor (wheat colour). Plastid-derived genes are not shown for the ancestral nodes within SAR and Hacrobia, because of current debates about the number and nature of secondary symbioses, but are indicated by the greenish shading. b, Endosymbiotic gene transfer network. The network connecting apparent gene donors to the common ancestor of eukaryotes and Archaeplastida is mapped onto the reference phylogeny (vertical edges) of bacteria (left), eukaryotes (middle), and archaea (right). Grey shading (white to black) in the prokaryote reference trees (70 for archaea and 32 for bacteria) indicates how often a branch associated with a particular node was recovered within the trees of individual genes that were concatenated for inferring the reference topology. Lateral edges indicate gene influx at the origin of eukaryotes and at the origin of plastids. Edge colour corresponds to the frequencies with which a prokaryotic group appears in the sister group to eukaryotes. The archaeal reference tree was rooted between euryarchaeotes and other taxa, and the bacterial tree with Thermotogae. Secondary endosymbiotic transfers are indicated in light green and red. That members of both the Crenarchaeota and the Euryarcheaota are implicated as host relatives is probably because of the small archaeon sample34, 35, 36.

  9. Extended Data Figure 9: Apparent gene transfers and eukaryote–prokaryote sequence identities. (255 KB)

    a, Patterns suggestive of LGT from prokaryotes inferred from EPC trees. All EPC trees were searched for phylogenetic patterns suggestive of gene acquisitions by the common ancestor of each eukaryote lineage within the six supergroups (see Methods). The size of each circle is proportional to the number of such putative acquisitions, with the total number of putative acquisitions shown for each supergroup. The colour shows the age of nodes according to a eukaryotic time tree (blue, younger than 800 million years; red, older than 800 million years). For the four lineages with an asterisk, phylogenetic patterns where SAR/Hacrobia are nested within a clade formed by Archaeplastida were also counted as putative acquisitions to take into account secondary plastid endosymbioses. The numbers of acquisitions without such patterns are indicated in parentheses (and shown as inner circles). b, Eukaryote–prokaryote sequence identities for genes apparently acquired more recently and more anciently in eukaryotes (a). The mean of the average pairwise identities is shown in parentheses. At P = 0.05, a two-sided Wilcoxon rank-sum test either did not reject the null hypotheses that the two sets of genes are not different or suggested the tip-specific eukaryotic genes are less similar to their prokaryotic homologues.

  10. Extended Data Figure 10: Distribution of ESCs and EPCs across eukaryotes under different criteria. (312 KB)

    Different thresholds were applied to find eukaryote clusters with prokaryote homologues, including BLAST local identity for each eukaryote–prokaryote hit (30% or 20%) and levels of best-hit correspondence (10–50%) for identifying reciprocal pairs of eukaryote and prokaryote clusters. Distributions of ESCs and EPCs are drawn as in Extended Data Fig. 1a and Fig. 1, respectively.

Supplementary information

Excel files

  1. Supplementary Table 1 (4 MB)

    List of 55 eukaryote genomes organized by six eukaryotic supergroups, sources of genome sequences, 28,702 eukaryotic protein clusters with at least two sequences and maximum-likelihood trees of eukaryote-specific clusters (ESCs) with at least four sequences.

  2. Supplementary Table 2 (27.9 MB)

    List of 956,053 eukaryotic protein sequences in the sequence abbreviations used in this study and the original headers in the downloaded files.

  3. Supplementary Table 3 (21.9 MB)

    List of 1,847 bacterial genomes, their taxonomical groupings and 102,089 clusters with at least five sequences, as well as a maximum likelihood reference tree based on 32 nearly universal single-copy genes.

  4. Supplementary Table 4 (1.2 MB)

    List of 134 archaeal genomes, their taxonomical groupings and 11,992 clusters with at least five sequences.

  5. Supplementary Table 5 (5.2 MB)

    List of 2,585 eukaryote-prokaryote clusters (EPCs).

  6. Supplementary Table 6 (1.6 MB)

    Annotations of the functions of the 28,702 eukaryotic clusters and eukaryote monophyly in EPC trees.

  7. Supplementary Table 8 (68 KB)

    Frequency of occurrence of prokaryotic taxa in the sister group to eukaryotes and a two-sided Wilcoxon signed rank test comparing the original frequencies and those after randomizations.

Text files

  1. Supplementary Table 7 (27.5 MB)

    Maximum-likelihood trees with at least four sequences reconstructed from eukaryote-prokaryote clusters (EPCs).

PDF files

  1. Supplementary Table 9 (100 KB)

    BLAST analysis of bacterial, mitochondrial and plastid genomes against the nuclear genomes.

Additional data