Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Timing the origin of eukaryotic cellular complexity with ancient duplications


Eukaryogenesis is one of the most enigmatic evolutionary transitions, during which simple prokaryotic cells gave rise to complex eukaryotic cells. While evolutionary intermediates are lacking, gene duplications provide information on the order of events by which eukaryotes originated. Here we use a phylogenomics approach to reconstruct successive steps during eukaryogenesis. We find that gene duplications roughly doubled the proto-eukaryotic gene repertoire, with families inherited from the Asgard archaea-related host being duplicated most. By relatively timing events using phylogenetic distances, we inferred that duplications in cytoskeletal and membrane-trafficking families were among the earliest events, whereas most other families expanded predominantly after mitochondrial endosymbiosis. Altogether, we infer that the host that engulfed the proto-mitochondrion had some eukaryote-like complexity, which drastically increased upon mitochondrial acquisition. This scenario bridges the signs of complexity observed in Asgard archaeal genomes to the proposed role of mitochondria in triggering eukaryogenesis.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Fig. 1: Characterization of duplications during eukaryogenesis.
Fig. 2: Contribution of different phylogenetic origins to duplications during eukaryogenesis.
Fig. 3: Timing of acquisitions and duplications from different phylogenetic origins during eukaryogenesis.
Fig. 4: Timing of duplications during eukaryogenesis according to function and localization.

Data availability

Fasta files, phylogenetic trees and their annotations are available in figshare with the identifier53

Code availability

The code used to annotate the phylogenetic trees can be accessed in Github (


  1. 1.

    Dacks, J. B. et al. The changing view of eukaryogenesis—fossils, cells, lineages and how they all come together. J. Cell Sci. 129, 3695–3703 (2016).

    CAS  PubMed  Google Scholar 

  2. 2.

    Shiratori, T., Suzuki, S., Kakizawa, Y. & Ishida, K. Phagocytosis-like cell engulfment by a planctomycete bacterium. Nat. Commun. 10, 5529 (2019).

    PubMed  PubMed Central  Google Scholar 

  3. 3.

    Koumandou, V. L. et al. Molecular paleontology and complexity in the last eukaryotic common ancestor. Crit. Rev. Biochem. Mol. Biol. 48, 373–396 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  4. 4.

    Szathmáry, E. Toward major evolutionary transitions theory 2.0. Proc. Natl Acad. Sci. USA 112, 10104–10111 (2015).

    PubMed  Google Scholar 

  5. 5.

    Spang, A. et al. Complex archaea that bridge the gap between prokaryotes and eukaryotes. Nature 521, 173–179 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  6. 6.

    Zaremba-Niedzwiedzka, K. et al. Asgard archaea illuminate the origin of eukaryotic cellular complexity. Nature 541, 353–358 (2017).

    CAS  PubMed  Google Scholar 

  7. 7.

    Roger, A. J., Muñoz-Gómez, S. A. & Kamikawa, R. The origin and diversification of mitochondria. Curr. Biol. 27, R1177–R1192 (2017).

    CAS  PubMed  Google Scholar 

  8. 8.

    Martijn, J., Vosseberg, J., Guy, L., Offre, P. & Ettema, T. J. G. Deep mitochondrial origin outside the sampled alphaproteobacteria. Nature 557, 101–105 (2018).

    CAS  PubMed  Google Scholar 

  9. 9.

    Poole, A. M. & Gribaldo, S. Eukaryotic origins: how and when was the mitochondrion acquired? Cold Spring Harb. Perspect. Biol. 6, a015990 (2014).

    PubMed  PubMed Central  Google Scholar 

  10. 10.

    Pittis, A. A. & Gabaldón, T. Late acquisition of mitochondria by a host with chimaeric prokaryotic ancestry. Nature 531, 101–104 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  11. 11.

    Makarova, K. S., Wolf, Y. I., Mekhedov, S. L., Mirkin, B. G. & Koonin, E. V. Ancestral paralogs and pseudoparalogs and their role in the emergence of the eukaryotic cell. Nucleic Acids Res. 33, 4626–4638 (2005).

    CAS  PubMed  PubMed Central  Google Scholar 

  12. 12.

    Jékely, G. Small GTPases and the evolution of the eukaryotic cell. Bioessays 25, 1129–1138 (2003).

    PubMed  Google Scholar 

  13. 13.

    Wickstead, B., Gull, K. & Richards, T. A. Patterns of kinesin evolution reveal a complex ancestral eukaryote with a multifunctional cytoskeleton. BMC Evol. Biol. 10, 110 (2010).

    PubMed  PubMed Central  Google Scholar 

  14. 14.

    Elias, M., Brighouse, A., Gabernet-Castello, C., Field, M. C. & Dacks, J. B. Sculpting the endomembrane system in deep time: high resolution phylogenetics of Rab GTPases. J. Cell Sci. 125, 2500–2508 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  15. 15.

    Dacks, J. B. & Field, M. C. Evolutionary origins and specialisation of membrane transport. Curr. Opin. Cell Biol. 53, 70–76 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  16. 16.

    Finn, R. D. et al. The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res. 44, D279–D285 (2016).

    CAS  PubMed  Google Scholar 

  17. 17.

    Fritz-Laylin, L. K. et al. The genome of Naegleria gruberi illuminates early eukaryotic versatility. Cell 140, 631–642 (2010).

    CAS  PubMed  Google Scholar 

  18. 18.

    Derelle, R. et al. Bacterial proteins pinpoint a single eukaryotic root. Proc. Natl Acad. Sci. USA 112, E693–E699 (2015).

    CAS  PubMed  Google Scholar 

  19. 19.

    Burki, F., Roger, A. J., Brown, M. W. & Simpson, A. G. B. The new tree of eukaryotes. Trends Ecol. Evol. 35, 43–55 (2020).

    PubMed  Google Scholar 

  20. 20.

    Tria, F. D. K. et al. Gene duplications trace mitochondria to the onset of eukaryote complexity. Preprint at bioRxiv (2019).

  21. 21.

    Esser, C. et al. A genome phylogeny for mitochondria among α-proteobacteria and a predominantly eubacterial ancestry of yeast nuclear genes. Mol. Biol. Evol. 21, 1643–1660 (2004).

    CAS  PubMed  Google Scholar 

  22. 22.

    Pisani, D., Cotton, J. A. & McInerney, J. O. Supertrees disentangle the chimerical origin of eukaryotic genomes. Mol. Biol. Evol. 24, 1752–1760 (2007).

    CAS  PubMed  Google Scholar 

  23. 23.

    Narrowe, A. B. et al. Complex evolutionary history of translation elongation factor 2 and diphthamide biosynthesis in archaea and parabasalids. Genome Biol. Evol. 10, 2380–2393 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  24. 24.

    Williams, T. A., Cox, C. J., Foster, P. G., Szöllősi, G. J. & Embley, T. M. Phylogenomics provides robust support for a two-domains tree of life. Nat. Ecol. Evol. 4, 138–147 (2020).

    PubMed  Google Scholar 

  25. 25.

    Martin, W. F. et al. Late mitochondrial origin is an artifact. Genome Biol. Evol. 9, 373–379 (2017).

    PubMed  PubMed Central  Google Scholar 

  26. 26.

    Lane, N. Serial endosymbiosis or singular event at the origin of eukaryotes? J. Theor. Biol. 434, 58–67 (2017).

    PubMed  Google Scholar 

  27. 27.

    Pittis, A. A. & Gabaldón, T. On phylogenetic branch lengths distribution and the late acquistion of mitochondria. Preprint at bioRxiv (2016).

  28. 28.

    Lane, N. & Martin, W. The energetics of genome complexity. Nature 467, 929–934 (2010).

    CAS  PubMed  Google Scholar 

  29. 29.

    Lane, N. Bioenergetic constraints on the evolution of complex life. Cold Spring Harb. Perspect. Biol. 6, a015982 (2014).

    PubMed  PubMed Central  Google Scholar 

  30. 30.

    Klinger, C. M., Spang, A., Dacks, J. B. & Ettema, T. J. G. Tracing the archaeal origins of eukaryotic membrane-trafficking system building blocks. Mol. Biol. Evol. 33, 1528–1541 (2016).

    CAS  PubMed  Google Scholar 

  31. 31.

    Martijn, J. & Ettema, T. J. G. From archaeon to eukaryote: the evolutionary dark ages of the eukaryotic cell. Biochem. Soc. Trans. 41, 451–457 (2013).

    CAS  PubMed  Google Scholar 

  32. 32.

    Akıl, C. & Robinson, R. C. Genomes of Asgard archaea encode profilins that regulate actin. Nature 562, 439–443 (2018).

    PubMed  Google Scholar 

  33. 33.

    Imachi, H. et al. Isolation of an archaeon at the prokaryote–eukaryote interface. Nature 577, 519–525 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  34. 34.

    Deutekom, E. S., Vosseberg, J., Dam, T. J. Pvan & Snel, B. Measuring the impact of gene prediction on gene loss estimates in Eukaryotes by quantifying falsely inferred absences. PLoS Comput. Biol. 15, e1007301 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  35. 35.

    Huerta-Cepas, J. et al. eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences. Nucleic Acids Res. 44, D286–D293 (2016).

    CAS  PubMed  Google Scholar 

  36. 36.

    Eddy, S. R. Accelerated Profile HMM Searches. PLoS Comput. Biol. 7, e1002195 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  37. 37.

    Hauser, M., Mayer, C. E. & Söding, J. kClust: fast and sensitive clustering of large protein sequence databases. BMC Bioinform. 14, 248 (2013).

    Google Scholar 

  38. 38.

    van Wijk, L. M. & Snel, B. The first eukaryotic kinome tree illuminates the dynamic history of present-day kinases. Preprint at bioRxiv (2020).

  39. 39.

    Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).

    CAS  Google Scholar 

  40. 40.

    Shah, N., Nute, M. G., Warnow, T. & Pop, M. Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows. Bioinformatics 35, 1613–1614 (2019).

    CAS  PubMed  Google Scholar 

  41. 41.

    González-Pech, R. A., Stephens, T. G. & Chan, C. X. Commonly misunderstood parameters of NCBI BLAST and important considerations for users. Bioinformatics 35, 2697–2698 (2019).

    PubMed  Google Scholar 

  42. 42.

    Adl, S. M. et al. Revisions to the classification, nomenclature, and diversity of eukaryotes. J. Eukaryot. Microbiol. 66, 4–119 (2019).

    PubMed  PubMed Central  Google Scholar 

  43. 43.

    Katoh, K. & Standley, D. M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30, 772–780 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  44. 44.

    Capella-Gutiérrez, S., Silla-Martínez, J. M. & Gabaldón, T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25, 1972–1973 (2009).

    PubMed  PubMed Central  Google Scholar 

  45. 45.

    Nguyen, L.-T., Schmidt, H. A., von Haeseler, A. & Minh, B. Q. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 32, 268–274 (2015).

    CAS  PubMed  Google Scholar 

  46. 46.

    Le, S. Q., Dang, C. C. & Gascuel, O. Modeling protein evolution with several amino acid replacement matrices depending on site rates. Mol. Biol. Evol. 29, 2921–2936 (2012).

    CAS  PubMed  Google Scholar 

  47. 47.

    Minh, B. Q., Nguyen, M. A. T. & von Haeseler, A. Ultrafast approximation for phylogenetic bootstrap. Mol. Biol. Evol. 30, 1188–1195 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  48. 48.

    Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2—approximately maximum-likelihood trees for large alignments. PLoS ONE 5, e9490 (2010).

    PubMed  PubMed Central  Google Scholar 

  49. 49.

    Huerta-Cepas, J., Serra, F. & Bork, P. ETE 3: reconstruction, analysis, and visualization of phylogenomic data. Mol. Biol. Evol. 33, 1635–1638 (2016).

    CAS  PubMed  PubMed Central  Google Scholar 

  50. 50.

    Steinegger, M. et al. HH-suite3 for fast remote homology detection and deep protein annotation. BMC Bioinform. 20, 473 (2019).

    Google Scholar 

  51. 51.

    Huerta-Cepas, J. et al. Fast genome-wide functional annotation through orthology assignment by eggNOG-mapper. Mol. Biol. Evol. 34, 2115–2122 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  52. 52.

    Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59–60 (2015).

    CAS  PubMed  Google Scholar 

  53. 53.

    Vosseberg, J. et al. Data for: timing the origin of eukaryotic cellular complexity with ancient duplications. figshare (2020).

Download references


We thank K. S. Marakova and E. V. Koonin for sharing their KOG-to-COG protein clusters with us. We are grateful to T. J. P. van Dam, E. S. Deutekom and G. J. P. L. Kops for useful advice and discussions. This work is part of the research programme VICI with project number 016.160.638, which is (partly) financed by the Netherlands Organisation for Scientific Research (NWO). T.G. acknowledges support from the Spanish Ministry of Science and Innovation for grant PGC2018-099921-B-I00 and from the European Union’s Horizon 2020 research and innovation programme under grant agreement ERC-2016-724173.

Author information




J.J.E.v.H., T.G. and B.S. conceived the study. J.V. and J.J.E.v.H. performed the research. J.V., J.J.E.v.H., T.G. and B.S. analysed and interpreted the results. M.M.-H. performed the analysis on the human phylome. M.M.-H. and A.v.V. aided in the development of the tree analysis pipeline. L.M.v.W. implemented the ScrollSaw-based method. J.V., J.J.E.v.H. and B.S. wrote the manuscript, which was edited and approved by all authors.

Corresponding authors

Correspondence to Toni Gabaldón or Berend Snel.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Peer reviewer reports are available.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Estimating the number of LECA genes from the number of Pfam domains with linear regression.

Scatter plot showing the number of Pfam domains and protein-coding genes in present-day eukaryotes, with each dot representing one genome. The regression line (black) and its 95% confidence (filled grey) and prediction intervals (dashed grey) are depicted. The vertical line corresponds to the obtained number of LECA Pfam domains.

Extended Data Fig. 2 Effect of a different phylogenetic position of the eukaryotic root.

a, Number of inferred LECA families considering different root positions. These numbers are based on phylogenetic trees from Pfams that are only present in eukaryotes. Besides the Opimoda and Diphoda groups, two other group definitions were used to identify bidirectional best hits (BBHs) and select sequences for tree inference. Names of root positions indicate either the lineage at one side of the root or the position of the split (ADis-DiaM: Amorphea+Discoba – Diaphoretickes+Metamonada; AM-DiaDis: Amorphea+Metamonada – Diaphoretickes+Discoba). Excavate sequences, especially from Metamonada species, are rarely involved in BBHs, unless specifically searched for (Excavata in BBHs 5 groups; Discoba and Metamonada in BBHs 4 groups). b, Distribution of duplication lengths obtained using different root positions for eukaryote-only trees based on the four group BBHs. The difference between distributions is not statistically significant according to the Kruskal-Wallis test.

Extended Data Fig. 3 Fraction of LECA families resulting from inventions.

a, Contribution of inventions to LECA families performing different functions. 82% of pairwise comparisons were significantly different (Supplementary Fig. 3). b, Fraction of LECA families resulting from either an invention or duplication – a eukaryotic innovation – according to functional category. 84% of pairwise comparisons were significantly different (Supplementary Fig. 5). c, Contribution of inventions to LECA families performing their function in different cellular components. 51% of pairwise comparisons were significantly different (Supplementary Fig. 4). d, Fraction of LECA families resulting from an innovation according to cellular localisation. 74% of pairwise comparisons were significantly different (Supplementary Fig. 6). ad, Dashed lines indicate the overall invented or innovated fraction.

Extended Data Fig. 4 Phylogenetic origin of acquired Pfams.

a, b, Phylogeny of the prokaryotes (a) and Asgard archaea (b) present in our dataset based on the NCBI taxonomy. The branch widths and numbers indicate the number of acquisitions from a group. c, Number of acquisitions from different alphaproteobacterial orders or a combination of multiple orders (‘Alphaproteobacteria’).

Extended Data Fig. 5 Effect of duplications on branch lengths.

a, b, Distribution of alphaproteobacterial (a) and Asgard archaeal (b) stem lengths (sl’s) for acquisitions without and with duplications. Two alphaproteobacterial sl’s from acquisitions with Magnetococcales as sister group were removed based on the previously inferred phylogenetic position of mitochondria8. c, d, Distribution of Asgard archaeal sl’s for information storage and processing (c) and cellular processes and signalling families (d), comparing those without and with duplications. Upon removal of the outliers, the difference in cellular processes and signalling families no longer reached statistical significance. e, Distribution of Asgard archaeal sl’s for duplicated acquisitions, in which homomer-to-heteromer transitions had occurred compared to the other duplicated acquisitions. f, Distribution of vertebrate sl’s for families without and with duplications. g, Distribution of duplication lengths (dl’s) grouped according to the lineage in which the duplication occurred. All pairwise comparisons were significantly different (Mann-Whitney U tests). h, Distribution of differences in log-transformed dl values for all pairwise comparisons between chordate duplications according to age and functional annotation. All groups were significantly different (Mann-Whitney U tests). af, P values of Mann-Whitney U tests are shown. ce, The minimal sl via each duplication node is plotted.

Extended Data Fig. 6 Effect of branch length normalisation and functional divergence.

a, Ridgeline plot showing the distribution of uncorrected stem (rsl) or duplication lengths (rdl). Numbers indicate the number of acquisitions or duplications for which the branch lengths were included. The low peaks at very short branch lengths are an artefact from near-zero branch lengths. Groups are ordered based on the median value of rsl’s and rdl’s. b, Ridgeline plot showing the distribution of sls for non-duplicated acquisitions that share the same functional annotation of the prokaryotic sister group and are therefore expected to have undergone little functional divergence during eukaryogenesis. a, b, Branch lengths are depicted as the additive inverse of the log-transformed values. Pairwise comparisons that did not give a significant P value (Mann-Whitney U tests) are shown.

Supplementary information

Supplementary Information

Supplementary Methods, Discussion, Tables 1–3 and Figs. 1–8.

Reporting Summary

Peer Review Information

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Vosseberg, J., van Hooff, J.J.E., Marcet-Houben, M. et al. Timing the origin of eukaryotic cellular complexity with ancient duplications. Nat Ecol Evol 5, 92–100 (2021).

Download citation


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing