Divergent genomic trajectories predate the origin of animals and fungi

Ocaña-Pallarès, Eduard; Williams, Tom A.; López-Escardó, David; Arroyo, Alicia S.; Pathmanathan, Jananan S.; Bapteste, Eric; Tikhonenkov, Denis V.; Keeling, Patrick J.; Szöllősi, Gergely J.; Ruiz-Trillo, Iñaki

doi:10.1038/s41586-022-05110-4

Download PDF

Article
Open access
Published: 24 August 2022

Divergent genomic trajectories predate the origin of animals and fungi

Nature volume 609, pages 747–753 (2022)Cite this article

29k Accesses
25 Citations
245 Altmetric
Metrics details

Subjects

Abstract

Animals and fungi have radically distinct morphologies, yet both evolved within the same eukaryotic supergroup: Opisthokonta^1,2. Here we reconstructed the trajectory of genetic changes that accompanied the origin of Metazoa and Fungi since the divergence of Opisthokonta with a dataset that includes four novel genomes from crucial positions in the Opisthokonta phylogeny. We show that animals arose only after the accumulation of genes functionally important for their multicellularity, a tendency that began in the pre-metazoan ancestors and later accelerated in the metazoan root. By contrast, the pre-fungal ancestors experienced net losses of most functional categories, including those gained in the path to Metazoa. On a broad-scale functional level, fungal genomes contain a higher proportion of metabolic genes and diverged less from the last common ancestor of Opisthokonta than did the gene repertoires of Metazoa. Metazoa and Fungi also show differences regarding gene gain mechanisms. Gene fusions are more prevalent in Metazoa, whereas a larger fraction of gene gains were detected as horizontal gene transfers in Fungi and protists, in agreement with the long-standing idea that transfers would be less relevant in Metazoa due to germline isolation^3,4,5. Together, our results indicate that animals and fungi evolved under two contrasting trajectories of genetic change that predated the origin of both groups. The gradual establishment of two clearly differentiated genomic contexts thus set the stage for the emergence of Metazoa and Fungi.

Widespread patterns of gene loss in the evolution of the animal kingdom

Article 24 February 2020

Genomes of fungi and relatives reveal delayed loss of ancestral gene families and evolution of key fungal traits

Article Open access 22 June 2023

Macroevolutionary dynamics of gene family gain and loss along multicellular eukaryotic lineages

Article Open access 26 March 2024

Main

One of the most surprising early insights of molecular phylogenetics was the close evolutionary relationship between animals and fungi⁶, which was unexpected because of the enormous differences in their morphology, ecology, life history and behaviour. This relationship has stood the test of time, and now animals and fungi are members of Holozoa and Holomycota, respectively, which are the two major divisions of the eukaryotic supergroup Opisthokonta¹. Pinpointing how animals and fungi evolved to be so different requires a detailed reconstruction of the evolutionary changes leading up to the two lineages. This demands not only genomic data from diverse animals and fungi but also from the protist opisthokont groups that branch between them (Fig. 1d), which are underrepresented in genomic databases⁷.

**Fig. 1: Lineages leading to modern Metazoa and Fungi experienced sharply contrasting trajectories of genetic changes.**

Four new genomes of protist opisthokonts

The closest known groups to Metazoa within Holozoa are Choanoflagellatea, Filasterea and Teretosporea (Fig. 1d). Within Holomycota, the closest known groups to Fungi (here defined as the least inclusive clade including Chytridiomycota and Blastocladiomycota based on the absence of phagotrophy in all the members of this clade⁸) are Opisthosporidia (a paraphyletic group^9,10, which in our genomic dataset is represented by Rozella allomycis and Mitosporodium daphniae—RM clade) and Nucleariidae (Fig. 1d). To improve the limited genome sampling for the protist opisthokont groups⁷, we sequenced, assembled and annotated the genomes of three filastereans (Ministeria vibrans¹¹, Pigoraptor vietnamica¹² and Pigoraptor chileana¹²) and one nucleariid (Parvularia atlantis¹³) from metagenomic data produced from cultures of these species (Supplementary Information 1). Given that Filasterea and Nucleariidae were previously represented by only a single whole-genome-sequenced species, the four newly sequenced species represent a substantial increase in the diversity of genomic data available for the protist opisthokont groups (Fig. 1d). This can be expected to minimize the negative impact of poor taxon sampling in ancestral reconstructions (see an example of this issue in Extended Data Fig. 1a).

The four sequenced genomes present high completeness and contiguity metrics, which are in the range of those from the previously sequenced protist opisthokont species (Fig. 23 in Supplementary Information 1). With regard to genome size and gene content metrics, the sequenced species are not different from most unicellular eukaryotes and fungi (Extended Data Figs. 2 and 3) with the exception of P. atlantis. Despite having a compact genome (19.24 Mb), this nucleariid presents 8.58 introns per gene (Extended Data Fig. 3a). This ratio is almost identical to Homo sapiens, despite the introns of P. atlantis being approximately 86 times shorter (60.67 mean bp size) (Extended Data Fig. 3b), giving it an intron density (approximately four introns per kilobase) more than twice that of any other genome explored (Extended Data Fig. 1b).

Large differences in gene content

We explored whether the gene contents of Metazoa and Fungi present broad-scale functional differences as this would be indicative that, at some point after the divergence of their last common ancestor, a substantial genetic turnover occurred (that is, the remodelling of the gene content as a result of gene gains and losses, with gains including the origination of novel gene families and the expansion of ancestral families). In a multivariate analysis of the relative genomic representation of each Cluster of Orthologous Groups functional categories¹⁴ (hereafter referred to as functional categories), Metazoa and Fungi cluster separately in the dimension accounting for the largest variance explained (68.1%) (Fig. 2a). Functional categories of signal transduction (T), transcription (K) and extracellular structures (W), which are particularly relevant for animal multicellularity^15,16, are among the most differentially represented in animal genomes (particularly T and W; Extended Data Fig. 5a). Other categories that are more represented in Metazoa include cytoskeleton (Z) and cell motility (N) (Fig. 2a). By contrast, the vast majority of metabolic functional categories (C, E, F, G, H, I and Q; see Fig. 1c) are proportionally more represented in Fungi (Fig. 2a).

**Fig. 2: Gradual compositional change at the gene function level predated the origin of Metazoa and Fungi.**

Greater divergence of metazoan gene sets

From an evolutionary perspective, the large genetic differences shown between Metazoa and Fungi might be explained because either both or just one of the two groups experienced substantial genetic changes after diverging from their last shared common ancestor. Furthermore, this divergence could either be due to an abrupt genetic turnover in which changes would have occurred specifically in the root of both groups, or by a gradual process in which the preceding ancestors of each group were already accumulating changes in the direction of the differences observed in extant Metazoa and Fungi (Fig. 2a). To distinguish between these alternative scenarios, we took two complementary approaches to reconstruct the tempo and modes of the genetic divergence that occurred. In the first approach, we split the functional categories into two groups based on the results from the multivariate analysis on extant species from Metazoa and from Fungi (Fig. 2a): Metazoa-related or Fungi-related. Then, we computed the relative representation of each group of functional categories in every ancestral node of Opisthokonta (Fig. 1a) based on the gene contents inferred with our ancestral reconstruction pipeline (see Methods). In the second approach, we trained a series of machine learning classifiers to find their own functional category-based definition based on the gene contents from extant Metazoa and Fungi (see Methods). Then, we scored the ancestral nodes—which were not used to train the classifiers—according to how metazoan-like and fungal-like the relative compositions of functional categories of their inferred gene contents were (Extended Data Fig. 4d).

Not surprisingly, Fungi-related functional categories are more represented in Fungi (particularly in Basidiomycota and Ascomycota groups), but for most of the non-metazoan and non-fungal opisthokonts, the relative genomic representation of functional categories is more Fungi-like than Metazoa-like (Fig. 1d). As a result, Fungi does not separate from the protist opisthokont groups as distinctly as Metazoa (Extended Data Fig. 6b). These results are consistent with the fact that the machine learning classifiers differentiate the functional category compositions of Metazoa more strongly than those of Fungi (Extended Data Fig. 4d), as shown by the lower probabilities retrieved for the inner nodes of Fungi (43.7% for F3, root of Fungi) than those retrieved for Metazoa (81.7% for M4, root of Metazoa). Together, these results indicate that Metazoa experienced a broader differentiation at the gene function level than Fungi, with fungal gene contents being more similar to those of the protist opisthokonts, including the root of Opisthokonta (Fig. 1d and Extended Data Fig. 6c).

Gradual process, punctuated acceleration

Our ancestral reconstruction shows the genetic differences between Metazoa and Fungi (Fig. 2a) stemming from a divergence that started early after the split of Opisthokonta and continued up to the origin of the two groups (Fig. 2b,c). In the path to Metazoa, the changes that occurred in the three pre-metazoan ancestors (M1–M3) together account for a contribution of a similar magnitude to shifting the composition of the lineage towards Metazoa-related functional categories than those changes occurred in the metazoan root (3.7% versus 3.5%; Fig. 1d). Among the pre-metazoan ancestors, the changes in M2 and M3 contributed more than the changes in M1 despite both nodes showing fewer net gene gains (Fig. 1a). This is explained because gains in M1 were distributed across a wider set of functional categories, whereas gains in M2 occurred particularly in Metazoa-related functional categories, and the net losses in M3 were more prevalent in Fungi-related functional categories (Fig. 1a). Notwithstanding the contribution of the pre-metazoan ancestors, at the root of Metazoa (M4) there is also evidence for a substantial burst of net gains from a subset of functional categories (Fig. 1b), including transcription (K), signal transduction (T) and extracellular structures (W), which are particularly relevant for the animal multicellular genetic toolkit¹⁵. Although in the pre-genomic era the animal multicellular genetic toolkit was largely expected to be the outcome of metazoan-specific genetic innovations (that is, gene families that originated at the metazoan root), comparative genomics has revealed orthologues of many toolkit components in the unicellular relatives of animals^15,17,18,19. This finding highlighted the importance that the co-option of ancestral gene originations had for multicellularity, although those same studies, as well as more recent studies^19,20,21, also reported remarkable gene originations at the metazoan root. To quantify what contributed more to the pool of gene families involved in functions that are particularly important for multicellularity (K, T and W), whether pre-metazoan gene originations from Holozoa or those that occurred at the metazoan root, we traced the evolutionary trajectories of these three categories after the divergence of Opisthokonta.

Of gene gains observed at the metazoan root for K, T and W categories, 42.8% correspond to gene families that originated in this same ancestor (M4), whereas 21.2% of gains in M4 correspond to the expansion of gene families that originated in the pre-metazoan holozoan ancestors (Extended Data Fig. 6d). This difference (42.8% to 21.2%) is much greater than the observed for the other functional categories (19.2% to 15.9%), indicating that among the gene gains that occurred at M4, gene originations were particularly relevant for K, T and W at the metazoan root. An inspection of the ancestral contribution to the gene content of H. sapiens (Extended Data Fig. 6e) illustrates the same trend: genes from families originated in M4, a single ancestral node, contributed in a similar extent to the ancestral repertoire of the genes involved in K, T and W in H. sapiens (mean of 13.9%) than genes from families originated in the three pre-metazoan ancestral nodes (M1–M3) (mean of 12.5%). From this, we conclude that gene originations at M4 have been quantitatively more important (13.9% versus 12.5%) to functional categories related to animal multicellularity than the gene originations coming from any of the preceding holozoan ancestors. As a result, the metazoan root experienced a substantial increment in the relative genomic representation of K, T and W (+1.35%, +1.16% and +0.35%, respectively, from M3 to M4) (Extended Data Fig. 6f). Notwithstanding this, the tendency towards increasing the relative genomic representation of these functional categories was already ongoing in the pre-metazoan holozoan ancestors (+1.73%, +0.66% and +0.24%, respectively, from O to M3) and hence predated the origin of animals (Extended Data Fig. 6f).

Main genetic changes in Fungi

Similar to Metazoa, the genetic changes that occurred in the preceding ancestors of Fungi from Holomycota (F1 and F2) contributed more to shifting the gene content (1.8% together)—in this case, towards Fungi-related functional categories—than the root of the group (0.07%) (Figs. 1d and 2c). However, whereas the ancestral path to Metazoa from M1 to M3 accumulated net gains of Metazoa-related functional categories, F1 and F2 did not accumulate gains but rather losses of Metazoa-related functional categories, particularly signal transduction (Fig. 1a).

The two fungal nodes that present the largest compositional shift towards Fungi-related functional categories are, on the one hand, the stem node of Dikarya (Ascomycota + Basidiomycota) (+1.9%; Fig. 1d), which experienced genetic changes that could have predisposed the evolution of complex multicellularity in some members of this group (see Supplementary Information 4), and on the other hand, the last common ancestor of Zoopagomycota, Mucoromycotina and Dikarya (+1.5%), which experienced important morphological adaptations such as the ancestral loss of the flagellum that is characteristic of most fungal groups²². On average, and in contrast to animals, Fungi retained gene contents of a similar size to their ancestors and the protist opisthokonts (Extended Data Fig. 7). Still, some fungal nodes showed substantial net gains, particularly the fungal root (F3; Fig. 1b). Similar to the animal root in Holozoa, F3 was the node in Holomycota with the largest fraction of gene gains being explained by gene originations (Extended Data Fig. 8). Nevertheless, the changes seen at the fungal root made a low contribution to the compositional shift of Fungi (0.07%; Fig. 1d) because this node accumulated net gains of both Metazoa and Fungi-related functional categories (Fig. 1b).

The main characteristic of the genetic turnover that occurred in the path to extant Fungi was a specialization towards metabolism (Fig. 2d), whereas animal genomes specialized towards other functional categories (Fig. 2a). In agreement with this, the metazoan root experienced a net loss of metabolic genes (Extended Data Fig. 5d), despite this node presenting an overall net gain of gene content (Fig. 1b), whereas the fungal root experienced net metabolic gene gains (Extended Data Fig. 5c). (Note that an additional supplementary analysis with a dataset that includes transcriptomic data from the aphelid Paraphelidium tribonemae⁹, which is the closest known group to Fungi, suggests that half of the net gene gains originally detected at the fungal root, including the metabolic ones, could have also predated the origin of Fungi; see Supplementary Fig. 2).

The metabolic changes at the gene content level that we described for the root of Metazoa and Fungi did not become a tendency that continued during the diversification of both groups, as we detected a net accumulation of metabolic genes in Metazoa, but not in Fungi (Extended Data Fig. 5c,d). The larger representation of metabolism in fungal genomes is thus explained because the gene turnover that occurred during the diversification of Fungi benefitted the metabolic over the non-metabolic functions (Fig. 2d). By contrast, Metazoa accumulated more genes of every category, but gains were not particularly biased towards metabolic functions (Fig. 1c).

Differences in gene gain mechanisms

Metazoa and Fungi also differ in their preferences among the mechanisms that can be sources of gene gains. Although no significant differences between groups were found in the relative contribution of gene originations to gene gains, gene duplications were found to be significantly more prevalent specifically among metazoan gains (Fig. 3a,b), in accordance with previous studies that highlighted the importance of duplications in the origin and diversification of animals²¹. Besides originations and duplications, the gene tree–species tree reconciliation software²³ used in our ancestral reconstruction framework also estimates putative horizontal gene transfer events as sources of gene gains. Despite being originally described in Bacteria, horizontal gene transfer has been documented across a wide range of eukaryotes and is known to have led to significant functional changes^24,25,26,27. However, the relative contribution of transfers to gene gains in eukaryotes, and whether this contribution is homogeneous across the phylogeny, remain uncertain^28,29,30. In this regard, the fact that the reconciliation software recovered a significantly lower fraction of gene gains as being explained by transfers in Metazoa than in Fungi and in the other opisthokonts (Fig. 3c) is compatible with the historical consideration that transfers should contribute less to gene gains in animals due to germline isolation^3,4,5.

**Fig. 3: Taxonomic differences in the relative contribution of gene originations, gene duplications, horizontal gene transfers and gene fusions to gene gains.**

Our ancestral reconstruction pipeline also detects originations that occurred due to gene fusion events. Previous studies^17,18 have described multiple instances of genes in the animal multicellular toolkit that originated through gene fusions (here defined as the merging of partial or complete sequences from older genes). Our results indicate that fusions contributed significantly more to gene gains in Metazoa than in Fungi (Fig. 3d). This is not only explained because Metazoa experienced more gene gains than Fungi (Extended Data Fig. 7), but also because the fraction of originations detected as fusions are also greater in Metazoa (Extended Data Fig. 9). Fusions being less prevalent in Fungi agrees with a previous study that reported a particularly low rate of fusions compared with fissions³¹. Because fusions seem to be particularly relevant sources of transcription and signal transduction genes (Extended Data Fig. 5e,f), this gene gain mechanism could have been more prevalent in Metazoa due to the excess of gains of these two categories (Fig. 1a,b), which are particularly relevant for multicellularity¹⁵.

Two divergent genomic trajectories

Together, the emerging picture from our ancestral reconstruction indicates that animals and fungi have been evolving under sharply contrasting trajectories of genomic changes that predated the origin of both groups (Fig. 4). Fungal gene contents remained relatively constant in size (Extended Data Fig. 7) and specialized into metabolism (Fig. 2d). By contrast, animals accumulated net gains of most functional categories, although the unequal distribution of gene gains across categories led some categories to increase their relative genomic representation over the others, particularly those that are important for multicellularity (Extended Data Fig. 6f). Although both groups experienced substantial gains and losses during their divergence (Extended Data Fig. 10), the lineage leading to extant Metazoa experienced a larger compositional change in gene function (Fig. 2b,c). As a result, metazoan gene contents are more diverged than the fungal gene contents from those of the other opisthokonts at both the broad-scale functional level and the gene family content level (Extended Data Fig. 6c,g). Given that the latter result is independent of gene function annotation, Metazoa being more differentiated than Fungi from the rest of opisthokonts from a gene content perspective is robust to potential inequalities that may exist between groups at the level of biological knowledge or in the availability of functional information. This indeed agrees with the fact that there are more evident morphological discontinuities between protists and animals than between protists and some groups of Fungi⁸. Neither the hypha nor the cell wall characteristic of Fungi, which is also present in some of their protist relatives, are fungal synapomorphies⁸. Only the abandonment of phagotrophy for an osmotrophic lifestyle seems to be a common although not exclusive feature of Fungi³². Although animals distinguish from protists from the fact that all of them are multicellular, in Fungi, complex multicellularity is probably the outcome of convergent evolution as it is only found in some particular groups, which present important differences in the genetic contents involved on it³³ (see Supplementary Information 4 for further information on the evolution of multicellularity in Opisthokonta and particularly in Fungi).

**Fig. 4: The large genetic differences between modern animals and fungi are the outcome of two contrasting trajectories of genetic changes that preceded the origin of both groups.**

From a genomic perspective, the origin of Metazoa and Fungi is better described as a gradual rather than an episodic process given the contribution of their preceding ancestors (M1–M3 and F1–F2) to the cumulative changes at the level of gene function that occurred in the lineages leading to the extant representatives of both groups (Fig. 2b,c). Notwithstanding this, substantial quantitative changes in gene content also occurred concomitantly with the origin of the two groups (Fig. 1b). In particular, the genetic changes at the metazoan root represent an acceleration of a trend that was already ongoing in the pre-metazoan ancestors to accumulate genes of functional categories that are important for animal multicellularity (Extended Data Fig. 6f). These same categories underwent losses in the pre-fungal ancestors (Fig. 1a), situating the immediate ancestors of Fungi and Metazoa in substantially different latent potentials from a genomic perspective. This is especially relevant for the case of animals. Had not animal ancestors experienced a continuous and long-standing evolutionary trajectory that had a compounding effect on the genomic potential for multicellularity, metazoans could not have arisen. The origin of animals may be seen as a drastic evolutionary event, but our taxon-rich analysis shows how the potential for that to happen was generated gradually on a genomic level. Our results illustrate the importance of analysing evolutionary transitions in the light of their evolutionary prehistory.

Methods

Methodological pipeline for genomic data acquisition

We sequenced a series of culture lines, each including one of the four species of interest (M. vibrans, P. atlantis, P. vietnamica and P. chileana). The cultures of M. vibrans and P. atlantis (formerly Nuclearia sp.) were bought in ATCC (M. vibrans Tong. ATCC 50519 and Nuclearia sp. ATCC 50694, respectively). The cultures of P. vietnamica (formerly Opistho-1) and P. chileana (formerly Opistho-2) descend from the environmental isolates (P. vietnamica from a Freshwater Lake, Vietnam; and P. chileana from freshwater temporary water body, Chile) used in ref. ¹². As expected, the starting cultures included an uncertain fraction of contaminant species. In particular, the cultures of M. vibrans and P. atlantis included an uncertain diversity of bacterial contamination, whereas the cultures of each Pigoraptor species also included contamination from the eukaryote Parabodo caudatus. The sequenced metagenomic data were submitted to a bioinformatic decontamination pipeline that consisted of two to three rounds of detection and removal of contaminant fragments based on taxonomic and tetranucleotide composition information. All steps were thoroughly supervised to maximize the retention of bona fide genomic fragments from our species of interest and the removal of contaminant sequences. Decontaminated genomes were annotated combining both RNA sequencing-based BRAKER1 v1.9 (ref. ³⁴) and PASA v2.0.2 (ref. ³⁵) automatic annotation pipelines, the results of which were processed to correct erroneous gene predictions that might lead to the inference of false gene fusions. See Supplementary Information 1 for a detailed explanation about the nature of the sequenced data and the decontamination and genome annotation processes (see Fig. 1 in Supplementary Information 1 for a summary illustration).

Clustering sequences into orthogroups

A dataset of 1,463,920 protein sequences from 83 eukaryotic species, 59 from Opisthokonta (including the four genomes produced) and 24 from other eukaryotic groups, was constructed (draft_euk_db; see Supplementary Table 4). Protein sequences were aligned all-against-all using BLASTp³⁶ v2.5 [-seg yes, -soft_masking true, -evalue 1e-3]. On the basis of the alignments, proteins were clustered into orthogroups (OGs) with OrthoFinder³⁷ v2.7 [-I 2]. We treat OGs as proxies of gene families. The OGs produced by OrthoFinder were processed with the MAPBOS pipeline to fix protein domain heterogeneity problems that would compromise downstream analyses (see Supplementary Information 2 for a discussion of this issue, and for an explanation of the algorithm that we developed to correct it).

Species tree reconstruction

Ancestral gene contents were inferred by means of a gene tree–species tree reconciliation software. We thus needed to reconstruct a phylogenetic tree for every gene family and a species tree of the whole eukaryotic supergroup Opisthokonta. The results from the species tree reconstruction analyses are available in Supplementary Information 3. We first selected 342 OGs present in >77% of draft_euk_db taxa and with no more than an average of 1.16 copies per taxa. We measured alignment instability of the 342 OGs using COS.pl and msa_set_score v2.02, which are based on the Heads-or-Tails approach^38,39, keeping only those OGs with >0.70 mean column score (MCs). We manually curated the 69 OGs that survived to this filter by performing individual phylogenies for each one, using MAFFT⁴⁰ v7.123b [-einsi] for sequence alignment, trimAl⁴¹ v1.4.rev15 [-gappyout] for alignment trimming and IQ-TREE⁴² v1.6.7 for maximum-likelihood (ML) phylogenetic inference, using ModelFinder⁴³ for model selection. Three of these 69 OGs were discarded because the topology was strongly in disagreement with the expected species topology. For the remaining 66 OGs (hereafter referred to as the MCs70 dataset), we removed sequences whose branching pattern suggested that they were most likely misclassified as OG members. In addition, to keep only one sequence per taxon in every OG, for inparalogue cases, we kept the least divergent sequence according to branch length. We removed a total of 630 sequences from the MCs70 dataset, including likely misclassified OG members but also contaminant sequences. Most contamination cases found correspond to contamination from Stramenopiles in the proteome of Syssomonas multiformis, probably from Spumella sp.¹². However, we also detected Pirum gemmata contamination in the proteome of Abeoforma whisleri, and few from Ichthyophonus hoferi in Sphaerothecum destruens, indicating cross-contamination problems between these ichthyosporeans datasets. Still, these cases of contamination neither affected the phylogenetic inference, as they were removed during the screening, nor the downstream analyses, as these species were only used for species tree reconstruction purposes.

We created two distinct versions of the MCs70 dataset: the first dataset including all sequences from Holozoa (ingroup) and from three Holomycota taxa (outgroup) (Holozoa MCs70), and the second dataset including all sequences from Holomyoca (ingroup) and from three Holozoa taxa (outgroup) (Holomycota MCs70). An alignment supermatrix was created for each dataset, first aligning and trimming each OG per separate [MAFFT -einsi, trimAl -gappyout], and later concatenating the alignments into a supermatrix (Holozoa MCs70: 37 taxa, 17,475 sites and 9.27% of missing data; Holomycota MCs70: 28 taxa, 17,409 sites and 7.81% of missing data). We constructed a phylogenetic tree for both MCs70 datasets using ML and Bayesian inference. ML inferences were done with IQ-TREE, and the models chosen for Holozoa and Holomycota MCs70 datasets were LG+C50+F+R7 and LG+C30+F+R6, respectively. Despite ModelFinder suggesting the usage of C60 (ref. ⁴⁴) for Holomycota MCs70, we used mixture models with fewer profiles to avoid potential model overfitting, as some optimized mixture weights were estimated close to zero. Nodal supports for the ML trees consisted of 1,000 IQ-TREE ultrafast bootstrap replicates (UFBoot) and 100 standard non-parametric bootstrap replicates. Non-parametric bootstraps were computed under the PMSF model⁴⁵. We used the previously inferred ML trees as guide trees to infer mixture model parameters and site-specific frequency profiles, as implemented in IQ-TREE v1.6.7. Bayesian phylogenies were done under the CAT+GTR+Gamma(4) model in PhyloBayes-MPI⁴⁶ v1.8. Two chains were run for Holozoa MCs70 and for Holomycota MCs70 supermatrices, and convergence was assessed using the bpcomp and tracecomp programs in the PhyloBayes-MPI package. Consensus trees were built when the maximum between chain discrepancy in bipartition frequencies fell below 0.1 (burn-in 33%). We also performed three additional analyses (increasing number of positions in the supermatrix, compositional recoding and fastest-evolving sites removal) to test the robustness of the topological relationships found (see Supplementary Information 3).

Incorporation of prokaryotic homologues into the OGs

We incorporated prokaryotic homologues into the clusters before the MAPBOS processing step. For the incorporation of prokaryotic (and viral) homologues into the clusters, we first used DIAMOND⁴⁷ v0.8.22.84 [--more-sensitive, -e 1e-05] to align all eukaryotic sequences from euk_db (a subset of draft_euk_db, which includes the species labelled in bold in Supplementary Table 4) to a database including 8,231,104 bacterial, 331,476 archaeal and 20,955 viral from Uniprot reference proteomes (release 2016_02; prok_db) (forward alignment approach). The aligned sequences from prok_db were aligned back against euk_db sequences (reverse alignment approach). Hits with a query and target alignment coverages lower than 75% were discarded, as well as hits in which the best-scoring euk_db target of a given prok_db query was a member of a distinct cluster than the best-scoring euk_db query for that prok_db sequence in the forward alignment. After discarding the hits not satisfying these conditions, we incorporated into the clusters only the best-scoring prok_db query of each euk_db target sequence (that is, if a cluster has 300 sequences and the best-scoring query of all them was the same prok_db sequence, only that sequence will be incorporated into the cluster, which will then have 300 euk_db sequences and 1 prok_db sequence). Prok_db sequences were incorporated into OrthoFinder -I 2 clusters before these were processed by the MAPBOS pipeline (Supplementary Information 3). After MAPBOS, clusters included 1,117,614 eukaryotic sequences and 58,017 non-eukaryotic sequences (53,168, 4,301 and 548 from Bacteria, Archaea and viruses, respectively). All these 1,175,631 sequences were distributed among 413,445 clusters, 370,686 of which are singletons. Among eukaryotic sequences, on a taxonomic level, clusters included sequences mostly from Opisthokonta (50 species), but also from 18 representatives of other major eukaryotic groups (euk_db dataset).

Gene tree inference and gene tree–species tree reconciliation analyses

We submitted every post-MAPBOS OGs (or clusters) to a gene tree inference pipeline, consisting of using MAFFT-linsi for the alignment step, trimAl [–gappyout] for alignment trimming and IQ-TREE for the phylogenetic inference. In particular, IQ-TREE was run using the LG+G4 model and sampling 1,000 optimized [-bnni] UFBoot replicates for every gene tree.

For the gene tree–species tree reconciliation analyses, we used ALEml_undated from ALE v0.4 (https://github.com/ssolo/ALE). ALEml_undated requires a distribution of phylogenetic trees for every gene family (the UFBoot replicates in our case) and a species tree. The Opisthokonta fraction of the species tree consisted of the most favoured topology according to our analyses, which only included Opisthokonta taxa (Fig. 1 in Supplementary Information 3). The phylogenetic relationships between the non-Opisthokonta taxa were directly determined from a consensus of currently available bibliographical references^{48,49,50,51,52,53,54,55,56} (all euk_db species were included in the reconciliation analyses). Reconciliation analyses also incorporated non-eukaryotic sequences (see above), which, for practical reasons, were assigned to the same terminal node in the species tree (named ‘Prokaryotes’ in Fig. 7 in Supplementary Information 3). Eukaryotes with only transcriptomic or poor-quality genomic data were excluded from the reconciliation analyses (those labelled in grey in Fig. 1 in Supplementary Information 3). Note that the inclusion of transcriptomic data would have been particularly problematic to our study for the following reasons: (1) gene content predictions from transcriptomic tend to present inflated gene counts. For example, the proteomes that were previously produced based solely on transcriptomic data for P. atlantis² and for P. vietnamica and P. chileana¹² include much more sequences (29,620, 46,018 and 37,783) than the proteomes that we predicted from the genome sequences of these species (9,028, 14,822 and 14,510), with the genome-based proteomes showing even better completeness metrics (Fig. 23 in Supplementary Information 1). Inflated gene counts are expected to produce an excess of duplication inferences in the reconciliations, whereas (2) unexpressed genes may be confused by gene losses. (3) Transcriptomes are harder to decontaminate due to the lack of genomic context information regarding neighbouring genes, intron sequences or compositional features of the coding sequence, whereas (4) those sequences predicted from partial isoforms are expected to lead to inaccuracies to the software used to detect gene fusions (see below). (5) Accurate gene contents were also important given that the reconciliation software used (see above) infers the values for parameters such as gene duplication and loss rates from the data.

Inference of gene fusion events

We used CompositeSearch⁵⁷ to identify composite gene families, that is, families of genes whose protein sequence is composed by fractions—for example, protein domains—that are separately found in other, component, gene families. CompositeSearch requires as input all-against-all sequence alignments, for which we used the same BLASTp results used for OrthoFinder (see above), although alignment hits corresponding to draft_euk_db species not represented in euk_db were removed. Before being used as input for CompositeSearch, BLASTp results were preprocessed with cleanBlastp (included in CompositeSearch) to retain only the hit with the highest score among all hits involving the same query–target pair. CompositeSearch was run with the default parameters and forcing the software [-f] to work on the clusters resulting from the processing of the OG from OrthoFinder by the MAPBOS pipeline. Families with only one sequence were discarded as potential components [-y]. Prok_db sequences were not included in composite inferences as alignments between prok_db and euk_db sequences were done with DIAMOND instead of BLASTp due to computational time limitations. Because we work at the gene family level (clusters), we only considered as composites those clusters in which >50% of members were detected as composite sequences. This includes 48,066 clusters, 3,229 of which are not singletons.

CompositeSearch detects as a composite any sequence that matches with distinct subsets of sequences (components, from other OGs) in different regions of its sequence. Whereas fusion events may lead to composite sequences, not all sequences detected as composites necessarily originated from a gene fusion process. For example, a sequence found to be composite by the software could have originated de novo in a given ancestral lineage (gene X–domains A and B), and then, in a descendant lineage, that gene could have been split into two separate genes (gene Y–domain A and gene Z–domain B). In such a case of gene fission, the software would detect the gene X as a composite because some part of the sequence would be aligned by the gene Y (first component) and the other by the gene Z (second component). To retain only bona fide fusion composite sequences, we only considered those composite sequences in which all their components were inferred to have a more ancestral origin than the composite. This was done to minimize the false-positive inferences of fusions, at the expense of losing potential fusion events in which, for example, both the composite and the components may have originated in the same node of the phylogeny.

Functional annotation of sequences and OGs

Protein domain architectures of euk_db sequences and of prok_db captured sequences (see above) were determined with PfamScan⁵⁸ using Pfam A v29. Cluster of Orthologous Groups functional categories (functional categories) and KEGG Orthology Groups (KOs)⁵⁹ were annotated to euk_db sequences with eggNOG-mapper⁶⁰ v1.0.3-3-g3e22728, using DIAMOND for the alignments of euk_db sequences against the eggNOG database (the functional category ‘S: unknown function’ was ignored as it does not include functional information). Once sequences were annotated, the functional categories and KO annotations of every cluster were determined by averaging the annotations of the corresponding cluster members. For example, if a cluster includes two sequences (SeqA and SeqB), and SeqA was annotated with the functional category K and SeqB with the functional categories B and K, that cluster would be annotated as 0.75K and 0.25B (0.5K from SeqA + 0.25K from SeqB, and 0.25B from SeqB).

Inference of gains, losses and counts of functional categories and metabolic gene contents

From the reconciliation analyses (see ‘Gene tree inference and reconciliation analyses’), we retrieved the number of gains, losses and gene contents of every OG in every node in the phylogeny. For every given node, we determined the absolute representation of all functional categories by crossing the information between the number of copies of every OG in the node and the relative representation of every functional category among the functional information of the OGs. The same was done to determine the KO contents of every node. The percentage of metabolic genes of every node was determined by dividing the number of KOs with metabolic annotations by the total number of genes in the node (besides KOs belonging to the ‘metabolic category’, those belonging to the category ‘membrane transport’ were also considered as metabolic genes). The relative representation of every functional category in every node was determined by dividing the absolute value of every category in the node by the sum of the absolute values of all functional categories in the node. Gains and losses of functional categories and KOs were determined by comparing the contents of every node with those of its immediately preceding node.

Statistical analyses

Statistical analyses were carried out either in Python, mainly with the libraries Pandas⁶¹ and NumPy⁶², or in R. All descriptive statistics plots (with the exception of those including phylogenetic trees, which were constructed with ITOL⁶³) were done in R, particularly with the ggplot2 package⁶⁴. Mann–Whitney U-tests (one-tailed) were done in Python with SciPy⁶⁵ (scipy.stats.mannwhitneyu). More specific statistical analyses are detailed below.

Correspondence analyses of relative functional category compositions

The relative genomic representation of functional categories are examples of compositional data (CoDa)⁶⁶, in which every column (a functional category) is represented by a relative fraction and the sum of all values is the same for every row (genome). Owing to the fact that no orthogonality and collinearity are properties of CoDa, most commonly used multivariate analyses techniques such as principal component analyses are unappropriated for CoDa analyses and alternatives such as correspondence analyses are recommended instead⁶⁶. Correspondence analyses were done in R⁶⁷ with FactoMiner package⁶⁸ and the plots were constructed with the factoextra package⁶⁹.

Machine learning classifiers

For the classifiers of metazoan and fungal functional category compositions, we benchmarked five widely used learning models: logistic regression, k-nearest neighbours classifier, support vector classifier, Random Forest and artificial neural network, fine-tuning in every case the model hyperparameters using fivefold cross-validation. In total, we generated two classifiers for every learning model: one trained to distinguish between the functional category compositions of metazoan versus the other terminal nodes in Opisthokonta; and another doing the same but for Fungi instead of Metazoa. Relative functional category compositions were not used as features to train the model by the fact that they are correlated between them. Instead, the models were trained with the components retrieved from the correspondence analyses on the relative functional category compositions of opisthokont terminal nodes (relative compositions were computed excluding the S ‘unknown function’ category and doing first a column-wise and then a row-wise normalization before correspondence analyses was performed). Once models were trained, we computed the probability of belonging to the given class (Metazoa or Fungi, depending on the model) for every opisthokont node, including both terminal (used for model training) and internal (not used for model training) (see values in Supplementary Table 5). The probabilities represented in Extended Data Fig. 4d correspond to a weighted average over the probabilities retrieved from every classifier (excluding logistic regression for being in disagreement and showing worse predictions than the other classifiers). The weights were determined in the following manner: for every node, the average probability was computed, and then we computed the variance of the four models with respect to that averages. The weight of every model corresponds to the inverse of the relative variance of that model divided by the sum of the variances of the four models. The code is available at https://doi.org/10.6084/m9.figshare.13140191.v1 (‘fungiMetazoa_predModels’ in Code.300322.zip). We expect the predictors to capture the genomic compositional features well, as, for example, in the case of Metazoa, Trichoplax adherens, the animal with the lowest degree of phenotypic complexity among the sampled species, is the node with lowest probability (Extended Data Fig. 4d). All of these analyses were carried out in Python using packages from Sci-kit learn⁷⁰, TensorFlow⁷¹ and Keras⁷² libraries.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.

Data availability

The raw sequence data and assembled genomes generated in this study have been deposited in the European Nucleotide Archive (ENA) at EMBL-EBI under accession number PRJEB52884 (https://www.ebi.ac.uk/ena/browser/view/PRJEB52884). The genome assemblies are also available in figshare (https://doi.org/10.6084/m9.figshare.19895962.v1). Protein sequences of the species used in this study were downloaded from the GenBank public databases (https://www.ncbi.nlm.nih.gov/protein/), Uniprot (https://www.uniprot.org/), JGI genome database (https://genome.jgi.doe.gov/portal/) and Ensembl genomes (https://www.ensembl.org). The following specific databases were also used in this study: Pfam A v29 (https://pfam.xfam.org/), EggNOG emapperdb-4.5.1 (http://eggnog5.embl.de) and UniProt reference proteomes release 2016_02 (https://www.uniprot.org/). The supporting data files of this study are available in the following repository: https://doi.org/10.6084/m9.figshare.13140191.v1.

Code availability

The most relevant custom code developed for this study (the MAPBOS pipeline and the machine learning classifiers) is available at https://doi.org/10.5281/zenodo.6586559.

References

Adl, S. M. et al. Revisions to the classification, nomenclature, and diversity of eukaryotes. J. Eukaryot. Microbiol. 66, 4–119 (2019).
Article PubMed PubMed Central Google Scholar
Torruella, G. et al. Phylogenomics reveals convergent evolution of lifestyles in close relatives of animals and fungi. Curr. Biol. 25, 2404–2410 (2015).
Article CAS PubMed Google Scholar
Andersson, J. O. Lateral gene transfer in eukaryotes. Cell. Mol. Life Sci. 62, 1182–1197 (2005).
Article CAS PubMed ADS Google Scholar
Keeling, P. J. & Palmer, J. D. Horizontal gene transfer in eukaryotic evolution. Nat. Rev. Genet. 9, 605–618 (2008).
Article CAS PubMed Google Scholar
Doolittle, W. F. Phylogenetic classification and the universal tree. Science 284, 2124–2128 (1999).
Article CAS PubMed Google Scholar
Wainright, P., Hinkle, G., Sogin, M. L. & Stickel, S. K. Monophyletic origins of the metazoa: an evolutionary link with fungi. Science 260, 340–342 (1993).
Article CAS PubMed ADS Google Scholar
Del Campo, J. et al. The others: our biased perspective of eukaryotic genomes. Trends Ecol. Evol. 29, 252–259 (2014).
Article PubMed PubMed Central Google Scholar
Richards, T. A., Leonard, G. U. Y. & Wideman, J. G. What defines the “kingdom” Fungi? Microbiol. Spectr. 5, 3 (2017).
Article Google Scholar
Torruella, G. et al. Global transcriptome analysis of the aphelid Paraphelidium tribonemae supports the phagotrophic origin of fungi. Commun. Biol. 1, 231 (2018).
Article CAS PubMed PubMed Central Google Scholar
Galindo, L. J., López-García, P., Torruella, G., Karpov, S. & Moreira, D. Phylogenomics of a new fungal phylum reveals multiple waves of reductive evolution across Holomycota. Nat. Commun. 12, 4973 (2021).
Article CAS PubMed PubMed Central ADS Google Scholar
Tong, S. M. Heterotrophic flagellates and other protists from Southampton Water, U.K. Ophelia 47, 71–131 (1997).
Article Google Scholar
Hehenberger, E. et al. Novel predators reshape Holozoan phylogeny and reveal the presence of a two-component signaling system in the ancestor of animals. Curr. Biol. 27, 2043–2050 (2017).
Article CAS PubMed Google Scholar
López-Escardó, D., López-García, P., Moreira, D., Ruiz-Trillo, I. & Torruella, G. Parvularia atlantis gen. et sp. nov., a Nucleariid Filose Amoeba (Holomycota, Opisthokonta. J. Eukaryot. Microbiol. 65, 170–179 (2018).
Article PubMed PubMed Central Google Scholar
Tatusov, R. L. The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res. 28, 33–36 (2000).
Article CAS PubMed PubMed Central Google Scholar
Suga, H. et al. The Capsaspora genome reveals a complex unicellular prehistory of animals. Nat. Commun. 4, 2325 (2013).
Article PubMed ADS CAS Google Scholar
Ros-Rocher, N., Pérez-Posada, A. & Leger, M. M. The origin of animals: an ancestral reconstruction of the unicellular-to-multicellular transition. Open Biol. 11, 200359 (2021).
Article CAS PubMed PubMed Central Google Scholar
King, N. et al. The genome of the choanoflagellate Monosiga brevicollis and the origin of metazoans. Nature 451, 783–788 (2008).
Article CAS PubMed PubMed Central ADS Google Scholar
Grau-Bové, X. et al. Dynamics of genomic innovation in the unicellular ancestry of animals. eLife 6, e26036 (2017).
Article PubMed PubMed Central Google Scholar
Richter, D. J., Fozouni, P., Eisen, M. B. & King, N. Gene family innovation, conservation and loss on the animal stem lineage. eLife 7, e34226 (2018).
Article PubMed PubMed Central Google Scholar
Paps, J. & Holland, P. W. H. Reconstruction of the ancestral metazoan genome reveals an increase in genomic novelty. Nat. Commun. 9, 1730 (2018).
Article PubMed PubMed Central ADS CAS Google Scholar
Fernández, R. & Gabaldón, T. Gene gain and loss across the metazoan tree of life. Nat. Ecol. Evol. 4, 524–533 (2020).
Article PubMed PubMed Central Google Scholar
Stajich, J. E. et al. The Fungi. Curr. Biol. 19, R840–R845 (2009).
Article CAS PubMed PubMed Central Google Scholar
Szöllősi, G. J., Davín, A. A., Tannier, E., Daubin, V. & Boussau, B. Genome-scale phylogenetic analysis finds extensive gene transfer among fungi. Phil. Trans. R. Soc. B 370, 20140335 (2015).
Article PubMed PubMed Central CAS Google Scholar
Ocaña-Pallarès, E., Najle, S. R., Scazzocchio, C. & Ruiz-Trillo, I. Reticulate evolution in eukaryotes: origin and evolution of the nitrate assimilation pathway. PLoS Genet. 15, e1007986 (2019).
Article PubMed PubMed Central CAS Google Scholar
Boto, L. Horizontal gene transfer in the acquisition of novel traits by metazoans. Proc. R. Soc. B 281, 20132450 (2014).
Article PubMed PubMed Central Google Scholar
Irwin, N. A. T., Pittis, A. A., Richards, T. A. & Keeling, P. J. Systematic evaluation of horizontal gene transfer between eukaryotes and viruses. Nat. Microbiol. 7, 327–336 (2021).
Article PubMed CAS Google Scholar
Bock, R. The give-and-take of DNA: horizontal gene transfer in plants. Trends Plant Sci. 15, 11–22 (2010).
Article CAS PubMed Google Scholar
Martin, W. F. Too much eukaryote LGT. BioEssays 39, 1700115 (2017).
Article Google Scholar
Leger, M. M., Eme, L., Stairs, C. W. & Roger, A. J. Demystifying eukaryote lateral gene transfer. BioEssays 40, 1700242 (2018).
Article Google Scholar
Roger, A. J. Reply to ‘Eukaryote lateral gene transfer is Lamarckian’. Nat. Ecol. Evol. 2, 755 (2018).
Article PubMed Google Scholar
Leonard, G. & Richards, T. A. Genome-scale comparative analysis of gene fusions, gene fissions, and the fungal tree of life. Proc. Natl Acad. Sci. USA 109, 21402–21407 (2012).
Article CAS PubMed PubMed Central ADS Google Scholar
Richards, T. A. & Talbot, N. J. Osmotrophy. Curr. Biol. 28, R1179–R1180 (2018).
Article CAS PubMed Google Scholar
Nagy, L. G., Kovács, G. M. & Krizsán, K. Complex multicellularity in fungi: evolutionary convergence, single origin, or both? Biol. Rev. 93, 1778–1794 (2018).
Article PubMed Google Scholar
Hoff, K. J., Lange, S., Lomsadze, A., Borodovsky, M. & Stanke, M. BRAKER1: unsupervised RNA-seq-based genome annotation with GeneMark-ET and AUGUSTUS. Bioinformatics 32, 767–769 (2015).
Article PubMed PubMed Central CAS Google Scholar
Haas, B. J. et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 31, 5654–5666 (2003).
Article CAS PubMed PubMed Central Google Scholar
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
Article CAS PubMed Google Scholar
Emms, D. M. & Kelly, S. OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy. Genome Biol. 16, 157 (2015).
Article PubMed PubMed Central CAS Google Scholar
Landan, G. & Graur, D. Heads or tails: a simple reliability check for multiple sequence alignments. Mol. Biol. Evol. 24, 1380–1383 (2007).
Article CAS PubMed Google Scholar
Landan, G. & Graur, D. Local reliability measures from sets of co-optimal multiple sequence alignments. Pacific Symp. Biocomput. 13, 15–24 (2008).
Google Scholar
Chatzou, M. et al. Multiple sequence alignment phylogenetic tree reconstruction bootstrap analysis evolutionary analysis. Syst. Biol. 67, 997–1009 (2018).
Article CAS PubMed Google Scholar
Capella-Gutierrez, S., Silla-Martinez, J. M. & Gabaldon, T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25, 1972–1973 (2009).
Article CAS PubMed PubMed Central Google Scholar
Nguyen, L.-T., Schmidt, H. A., von Haeseler, A. & Minh, B. Q. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 32, 268–274 (2015).
Article CAS PubMed Google Scholar
Kalyaanamoorthy, S., Minh, B. Q., Wong, T. K. F., von Haeseler, A. & Jermiin, L. S. ModelFinder: fast model selection for accurate phylogenetic estimates. Nat. Methods 14, 587–589 (2017).
Article CAS PubMed PubMed Central Google Scholar
Quang, L. S., Gascuel, O. & Lartillot, N. Empirical profile mixture models for phylogenetic reconstruction. Bioinformatics 24, 2317–2323 (2008).
Article CAS Google Scholar
Wang, H. C., Minh, B. Q., Susko, E. & Roger, A. J. Modeling site heterogeneity with posterior mean site frequency profiles accelerates accurate phylogenomic estimation. Syst. Biol. 67, 216–235 (2018).
Article CAS PubMed Google Scholar
Lartillot, N., Rodrigue, N., Stubbs, D. & Richer, J. PhyloBayes MPI: phylogenetic reconstruction with infinite mixtures of profiles in a parallel environment. Syst. Biol. 62, 611–615 (2013).
Article CAS PubMed Google Scholar
Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59–60 (2015).
Article CAS PubMed Google Scholar
Brown, M. W. et al. Phylogenomics places orphan protistan lineages in a novel eukaryotic super-group. Genome Biol. Evol. 10, 427–433 (2018).
Article CAS PubMed PubMed Central Google Scholar
Janouškovec, J. et al. A new lineage of eukaryotes illuminates early mitochondrial genome reduction. Curr. Biol. 27, 3717–3724 (2017).
Article PubMed CAS Google Scholar
Parfrey, L. W., Lahr, D. J. G., Knoll, A. H. & Katz, L. A. Estimating the timing of early eukaryotic diversification with multigene molecular clocks. Proc. Natl Acad. Sci. USA 108, 13624–13629 (2011).
Article CAS PubMed PubMed Central ADS Google Scholar
Karnkowska, A. et al. A eukaryote without a mitochondrial organelle. Curr. Biol. 26, 1274–1284 (2016).
Article CAS PubMed Google Scholar
Derelle, R. et al. Bacterial proteins pinpoint a single eukaryotic root. Proc. Natl Acad. Sci. USA 112, E693–E699 (2015).
Article CAS PubMed PubMed Central Google Scholar
Derelle, R., López-García, P., Timpano, H. & Moreira, D. A phylogenomic framework to study the diversity and evolution of stramenopiles (=heterokonts). Mol. Biol. Evol. 33, 2890–2898 (2016).
Article CAS PubMed Google Scholar
Strassert, J. F. H., Jamy, M., Mylnikov, A. P., Tikhonenkov, D. V. & Burki, F. New phylogenomic analysis of the enigmatic phylum Telonemia further resolves the eukaryote tree of life. Mol. Biol. Evol. 36, 757–765 (2019).
Article CAS PubMed PubMed Central Google Scholar
Burki, F. et al. Untangling the early diversification of eukaryotes: a phylogenomic study of the evolutionary origins of Centrohelida, Haptophyta and Cryptista. Proc. R. Soc. B 283, 20152802 (2016).
Article PubMed PubMed Central CAS Google Scholar
Betts, H. C. et al. Integrated genomic and fossil evidence illuminates life’s early evolution and eukaryote origin. Nat. Ecol. Evol. 2, 1556–1562 (2018).
Article PubMed PubMed Central Google Scholar
Pathmanathan, J. S., Lopez, P., Lapointe, F.-J. & Bapteste, E. CompositeSearch: a generalized network approach for composite gene families detection. Mol. Biol. Evol. 35, 252–255 (2017).
Article PubMed Central CAS Google Scholar
Sonnhammer, E. L., Eddy, S. R. & Durbin, R. Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins 28, 405–420 (1997).
Article CAS PubMed Google Scholar
Kanehisa, M. & Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 28, 27–30 (2000).
Article CAS PubMed PubMed Central Google Scholar
Huerta-Cepas, J. et al. Fast genome-wide functional annotation through orthology assignment by eggNOG-Mapper. Mol. Biol. Evol. 34, 2115–2122 (2017).
Article CAS PubMed PubMed Central Google Scholar
Pandas Development Team. pandas-dev/pandas: Pandas. Zenodo https://doi.org/10.5281/zenodo.6702671 (2020).
Harris, C. R. et al. Array programming with NumPy. Nature 585, 357–362 (2020).
Article CAS PubMed PubMed Central ADS Google Scholar
Letunic, I. & Bork, P. Interactive Tree Of Life (iTOL): an online tool for phylogenetic tree display and annotation. Bioinformatics 23, 127–128 (2007).
Article CAS PubMed Google Scholar
Wickham, H. ggplot2: Elegant Graphics for Data Analysis (Springer-Verlag, 2016).
Virtanen, P. et al. Fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).
Article CAS PubMed PubMed Central Google Scholar
Kanehisa, M., Sato, Y., Kawashima, M., Furumichi, M. & Tanabe, M. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 44, D457–D462 (2016).
Article CAS PubMed Google Scholar
R Core Team. R: A Language and Environment for Statistical Computing. https://www.r-project.org/ (R Foundation for Statistical Computing, 2017).
Lê, S., Josse, J. & Husson, F. FactoMineR: an R package for multivariate analysis. J. Stat. Softw. 25, 1–18 (2008).
Article Google Scholar
Kassambara, A. & Mundt, F. factoextra: Extract and visualize the results of multivariate data analyses. Version 1.0.6 https://CRAN.R-project.org/paackage=factoextra (2019).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
MathSciNet MATH Google Scholar
Abadi, M. et al. TensorFlow: large-scale machine learning on heterogeneous distributed systems. Preprint at https://arxiv.org/abs/1603.04467 (2015).
Chollet, F. et al. Keras. GitHub https://github.com/fchollet/keras (2015).

Download references

Acknowledgements

E.O.-P. was supported by a predoctoral FPI grant from MINECO (BES-2015-072241) and by ESF Investing in your future. E.O.-P., D.L-E., A.S.A. and I.R.-T. received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP7-2007-2013) (Grant agreement No. 616960) and also from grants (BFU2014-57779-P and PID2020-120609GB-I00) by MCIN/AEI/10.13039/501100011033 and ‘ERDF A way of making Europe’. E.O.-P. and G.J.Sz. received funding from the European Research Council under the European Union’s Horizon 2020 research and innovation programme (Grant agreement No. 714774). T.A.W. was supported by a Royal Society University Research Fellowship (URF\R\201024) and NERC standard grant NE/P00251X/1. This work was supported by the Gordon and Betty Moore Foundation through grant GBMF9741 to T.A.W. and G.J.Sz. J.S.P. and E.B. received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP7-2007-2013) (Grant agreement No. 615274). D.V.T. and cell culturing were supported by the Russian Science Foundation grant no. 18-14-00239, https://rscf.ru/project/18-14-00239/. Culture of P. vietnamica was obtained as the result of field work in Vietnam as part of the project ‘Ecolan 3.2’ of the Russian–Vietnam Tropical Center. P.J.K. is supported by an Investigator Award from the Gordon and Betty Moore Foundation (https://doi.org/10.37807/GBMF9201). We thank the CRG/UPF FACS Unit, the CRG Genomics Unit and M. Antó-Subirats for technical assistance; D. J. Richter, M. M. Leger and I. Patten for the feedback provided on the manuscript; and M. J. Greenacre for the feedback provided on multivariate statistics.

Author information

Authors and Affiliations

Institut de Biologia Evolutiva (CSIC-Universitat Pompeu Fabra), Barcelona, Spain
Eduard Ocaña-Pallarès, David López-Escardó, Alicia S. Arroyo & Iñaki Ruiz-Trillo
Department of Biological Physics, Eötvös Lorand University, Budapest, Hungary
Eduard Ocaña-Pallarès & Gergely J. Szöllősi
School of Biological Sciences, University of Bristol, Bristol, UK
Tom A. Williams
Ecology of Marine Microbes, Institut de Ciències del Mar (ICM‐CSIC), Barcelona, Spain
David López-Escardó
Equipe AIRE, UMR 7138, Laboratoire Evolution Paris-Seine, Université Pierre et Marie Curie, Paris, France
Jananan S. Pathmanathan
Institut de Systématique, Evolution, Biodiversité (ISYEB), Sorbonne Université, CNRS, Museum National d’Histoire Naturelle, EPHE, Université des Antilles, Paris, France
Eric Bapteste
Laboratory of Microbiology, Papanin Institute for Biology of Inland Waters, Russian Academy of Sciences, Borok, Russia
Denis V. Tikhonenkov
AquaBioSafe Laboratory, University of Tyumen, Tyumen, Russia
Denis V. Tikhonenkov
Department of Botany, University of British Columbia, Vancouver, British Columbia, Canada
Patrick J. Keeling
MTA-ELTE “Lendület” Evolutionary Genomics Research Group, Budapest, Hungary
Gergely J. Szöllősi
Institute of Evolution, Center for Ecological Research, Budapest, Hungary
Gergely J. Szöllősi
Departament de Genètica, Microbiologia i Estadística, Facultat de Biologia, Institut de Recerca de la Biodiversitat (IRBio), Universitat de Barcelona (UB), Barcelona, Spain
Iñaki Ruiz-Trillo
ICREA, Barcelona, Spain
Iñaki Ruiz-Trillo

Authors

Eduard Ocaña-Pallarès
View author publications
You can also search for this author in PubMed Google Scholar
Tom A. Williams
View author publications
You can also search for this author in PubMed Google Scholar
David López-Escardó
View author publications
You can also search for this author in PubMed Google Scholar
Alicia S. Arroyo
View author publications
You can also search for this author in PubMed Google Scholar
Jananan S. Pathmanathan
View author publications
You can also search for this author in PubMed Google Scholar
Eric Bapteste
View author publications
You can also search for this author in PubMed Google Scholar
Denis V. Tikhonenkov
View author publications
You can also search for this author in PubMed Google Scholar
Patrick J. Keeling
View author publications
You can also search for this author in PubMed Google Scholar
Gergely J. Szöllősi
View author publications
You can also search for this author in PubMed Google Scholar
Iñaki Ruiz-Trillo
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

E.O.-P. conceptualized the study and wrote the draft of the manuscript under the supervision of I.R.-T. E.O.-P., D.L.-E. and A.S.A. generated the material for sequencing. E.O.-P. and A.S.A. made the figures for the manuscript. E.O.-P. performed all bioinformatic analyses (unless those specified below). T.A.W. performed the gene tree–species tree reconciliation analyses and the Bayesian species tree reconstruction, and provided feedback about the project. G.J.Sz. contributed substantially to reviewing the manuscript and providing feedback about the project. J.S.P. and E.B. adapted the software of CompositeSearch and provided feedback about the project. D.V.T. and P.J.K. provided polyxenic cultures from P. vietnamica, P. chileana and P. caudatus. All authors contributed to the review of the manuscript before submission for publication and approved the final version.

Corresponding authors

Correspondence to Eduard Ocaña-Pallarès or Iñaki Ruiz-Trillo.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature thanks Maja Adamska, James McInerney and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 The importance of taxon sampling in ancestral gene content reconstructions and intron density across eukaryotes.

(A) Influence of taxon sampling in the ancestral reconstruction of protein domains innovations (Pfam domains). Note that with the addition of taxon sampling from unicellular relatives of animals (Choanoflagellatea -C-, Filasterea -F-, Teretosporea -T-), the number of pre-metazoan protein domain originations increase at the expense of originations that were originally detected at M4 in the 'No unicell. Holozoa' condition. The origin of every protein domain was inferred at the last common ancestor of all the species in which the domain is represented. This analysis was carried out with the taxon sampling euk_db, first excluding all representatives from C, F and T groups ('No unicell. Holozoa'), and then progressively adding data from these groups in a chronological order corresponding to when the genomic data from the representatives of these groups became publicly available. Ancestral node abbreviations: M4 = last common ancestor (LCA) of Metazoa. M3 = LCA of Choanoflagellatea and M4. M2 = LCA of Filasterea and M3. M1 = LCA of Teretosporea and M2. O = LCA of Opisthokonta. (See Fig. 1d for an illustration of the phylogenetic context of these ancestral nodes). (B) Distribution of introns per kb in an eukaryotic dataset including the four genomes sequenced for this manuscript as well as the metrics included in the Fig. 1—source data 1 of ref. ¹⁸.

Extended Data Fig. 2 Genome size and gene count metrics in eukaryotes.

Distrubtion of (A) 'Genome size (Mb)' and (B) 'Number of genes' in an eukaryotic dataset including the four genomes produced as well as the metrics included in the Fig. 1—source data 1 of ref. ¹⁸.

Extended Data Fig. 3 Intron per gene and mean intro size metrics in eukaryotes.

Distrubtion of (A) 'Introns per gene' and (B) 'Mean intron size (bp)' in an eukaryotic dataset including the four genomes produced as well as the metrics included in the Fig. 1—source data 1 of ref. ¹⁸. Whereas a potential loss of non-coding regions in the P. atlantis genome during the metagenome decontamination could have led to an underestimation of the genome size metric, the high ratio of introns per gene and the small size of introns found strongly suggests that the intron-richness of this nucleariid is not an artefactual result.

Extended Data Fig. 4 Evolution of functional category composition in Opisthokonta.

(A–C) Net gains and losses of functional categories in those ancestral nodes that are not represented in Fig. 1. (D) Consensus phylogeny of Opisthokonta as reconstructed from the phylogenetic analyses (Supplementary Information 3). Genomic data was produced for the four species in bold. Branch colors correspond to the weighted average probability retrieved for every ancestor (internal branches) by the machine-learning classifiers that were trained to detect differential COG-compositional features of extant Metazoa and of Fungi (see Methods). Branch colors in the Holozoa clade represent the weighted averages from the metazoan predictors, and in the Holomycota clade the weighted average from the fungal predictors (Supplementary Table 5). (E) Cluster of Orthologous Groups (COG) categories with functional information (referred to as functional categories along the manuscript).

Extended Data Fig. 5 Differences in functional category composition, metabolic gene content changes and differential contribution of gene fusion originations vs non-fusion originations to each functional category in Opisthokonta.

(A) Relative and (B) absolute counts of functional categories in the opisthokont species from euk_db (Supplementary Tables 7 and 8, respectively). (C) Gains and losses of metabolic genes (KEGG orthology groups) in the Opisthokonta nodes preceding H. sapiens and (D) in the Opisthokonta nodes preceding N. crassa (Supplementary Table 9). (E) Differential representation of functional categories among fusion originations vs non-fusion originations in the Opisthokonta nodes preceding H. sapiens and (F) in the Opisthokonta nodes preceding N. crassa (Supplementary Table 10).

Extended Data Fig. 6 Correspondence analyses contribution biplots on functional category compositions in Opisthokonta, phylostratigraphic analyses of functional category changes in the evolutionary path towards extant Metazoa and clustering of Opisthokonta species based on gene family content composition.

Correspondence Analyses contribution biplot for the relative representation of functional categories (Supplementary Table 7) in the species from euk_db dataset representing (A) Metazoa and Fungi (B) Opisthokonta (i.e., Metazoa, Fungi, and also the other Holozoa and Holomycota sampled, Supplementary Table 4), and (C) every ancestor represented by an internal node in the Opisthokonta phylogeny (see Fig. 7 in Supplementary Information 3 for a mapping of every ancestral lineage to the phylogeny). (D) Phylostratigraphic origin of each functional category for those gene families that experienced increments in copy number (either gene gains or gene originations) in the last common ancestor of Metazoa for each functional category (Supplementary Table 12). (E) Phylostratigraphy of the ancestral gene content of Homo sapiens for each functional category (Supplementary Table 11). (F) Increment in the relative representation of functional categories which are particularly important for animal multicellularity since the divergence of Opisthokonta (Supplementary Table 13). (G) Similarities in gene family (orthogroups) composition between all the Opisthokonta species included in our study. We first computed the raw similarity value for each pair of species by inspecting those gene families found in both species and adding up for each of these families the lowest copy number value found among the two species. Each raw similarity value was then normalized by multiplying it by two and dividing it by the maximum possible similarity value that could have been found for that pair of species, which corresponds to the sum of members that every gene family has in the two species (species-specific families were not considered) (Supplementary Table 14). The dendrogram was reconstructed using the 'ward.D' method from the R package hclust.

Extended Data Fig. 7 Gene content size changes in Opisthokonta evolution.

Gene content size inferred for every ancestral node of the Opisthokonta phylogeny as shown by the size of corresponding pie chart (values are shown for some nodes in order to illustrate the proportionality between the diameter size and the numeric values).

Extended Data Fig. 8 Relative contribution of gene originations to gene gains in Opisthokonta evolution.

Percentages of gene gains corresponding to gene originations (including gene fusions) inferred for every ancestral node of the Opisthokonta phylogeny as shown by the size of corresponding pie chart (values are shown for some nodes in order to illustrate the proportionality between the diameter size and the numeric values).

Extended Data Fig. 9 Relative contribution of gene fusions to gene gains in Opisthokonta evolution.

Percentages of gene gains corresponding to gene fusions inferred for every ancestral node of the Opisthokonta phylogeny as shown by the size of corresponding pie chart (values are shown for some nodes in order to illustrate the proportionality between the diameter size and the numeric values).

Extended Data Fig. 10 Gene gains and losses in Opisthokonta evolution.

Sum of gene gains and gene losses (and the fraction of the sum corresponding to each one) inferred for the internal nodes of the Opisthokonta phylogeny as shown by the size of corresponding pie chart (values are shown for some nodes in order to illustrate the proportionality between the diameter size and the numeric values).

Supplementary information

Supplementary Figures

Supplementary Figs. 1–3 show a full representation of the boxplots shown in Fig. 1c, a comparison of the ancestral gene content reconstruction analyses using the same dataset as the original analysis as well as the proteomes of Paraphelidium tribonemae and Olpidium bornovanus and original source images for the PCR results shown in Supplementary Information 1 – Fig.3a,b.

Reporting Summary

Supplementary Information 1

Detailed explanation of the methodological pipeline followed to produce genomic data for Ministeria vibrans, Parvularia atlantis, Pigoraptor vietnamica and Pigoraptor chileana.

Supplementary Information 2

Explanation of the MAPBOS pipeline.

Supplementary Information 3

Phylogenetic analyses for the species tree reconstruction.

Supplementary Information 4

Brief introduction to multicellularity and complex multicellularity in the context of the eukaryotic supergroup Opisthokonta with a series of analyses in relation to the origin of complex multicellularity in Fungi.

Supplementary Tables

Supplementary Tables 1–15. Details for each table are shown at the top of each worksheet.

Peer Review File

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Ocaña-Pallarès, E., Williams, T.A., López-Escardó, D. et al. Divergent genomic trajectories predate the origin of animals and fungi. Nature 609, 747–753 (2022). https://doi.org/10.1038/s41586-022-05110-4

Download citation

Received: 08 February 2022
Accepted: 14 July 2022
Published: 24 August 2022
Issue Date: 22 September 2022
DOI: https://doi.org/10.1038/s41586-022-05110-4

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.