Main

One of the most surprising early insights of molecular phylogenetics was the close evolutionary relationship between animals and fungi6, which was unexpected because of the enormous differences in their morphology, ecology, life history and behaviour. This relationship has stood the test of time, and now animals and fungi are members of Holozoa and Holomycota, respectively, which are the two major divisions of the eukaryotic supergroup Opisthokonta1. Pinpointing how animals and fungi evolved to be so different requires a detailed reconstruction of the evolutionary changes leading up to the two lineages. This demands not only genomic data from diverse animals and fungi but also from the protist opisthokont groups that branch between them (Fig. 1d), which are underrepresented in genomic databases7.

Fig. 1: Lineages leading to modern Metazoa and Fungi experienced sharply contrasting trajectories of genetic changes.
figure 1

a,b, Net gains and losses of ‘Cluster of Orthologous Groups’ categories with functional information (hereafter referred to as functional categories) since the divergence of Opisthokonta to the emergence of both groups. See Extended Data Fig. 4 for full category names and for information on the other ancestral nodes. c, Boxplot distribution of the cumulative net gains and losses of functional categories that occurred in each of the ancestral paths leading to the extant representatives of Metazoa (n = 15) and of Fungi (n = 21) since the origin of both groups (Supplementary Tables 1 and 2). Outliers are not represented, but a fully displayed version of c is available in Supplementary Fig. 1. Note that, on average, Metazoa tended to accumulate genes for every functional category, whereas only a few categories experienced net gains in the path to modern Fungi. d, Changes in functional category composition during the evolution of Opisthokonta, with percentages indicating the magnitude of change in each ancestor (Supplementary Table 3). Metazoa-related and Fungi-related categories are indicated in Fig. 2a. The cladogram shown was reconstructed based on the most supported topologies found for Holozoa and Holomycota in the phylogenetic analyses (Supplementary Information 3). Genomic data were produced for the four species in bold.

Four new genomes of protist opisthokonts

The closest known groups to Metazoa within Holozoa are Choanoflagellatea, Filasterea and Teretosporea (Fig. 1d). Within Holomycota, the closest known groups to Fungi (here defined as the least inclusive clade including Chytridiomycota and Blastocladiomycota based on the absence of phagotrophy in all the members of this clade8) are Opisthosporidia (a paraphyletic group9,10, which in our genomic dataset is represented by Rozella allomycis and Mitosporodium daphniae—RM clade) and Nucleariidae (Fig. 1d). To improve the limited genome sampling for the protist opisthokont groups7, we sequenced, assembled and annotated the genomes of three filastereans (Ministeria vibrans11, Pigoraptor vietnamica12 and Pigoraptor chileana12) and one nucleariid (Parvularia atlantis13) from metagenomic data produced from cultures of these species (Supplementary Information 1). Given that Filasterea and Nucleariidae were previously represented by only a single whole-genome-sequenced species, the four newly sequenced species represent a substantial increase in the diversity of genomic data available for the protist opisthokont groups (Fig. 1d). This can be expected to minimize the negative impact of poor taxon sampling in ancestral reconstructions (see an example of this issue in Extended Data Fig. 1a).

The four sequenced genomes present high completeness and contiguity metrics, which are in the range of those from the previously sequenced protist opisthokont species (Fig. 23 in Supplementary Information 1). With regard to genome size and gene content metrics, the sequenced species are not different from most unicellular eukaryotes and fungi (Extended Data Figs. 2 and 3) with the exception of P. atlantis. Despite having a compact genome (19.24 Mb), this nucleariid presents 8.58 introns per gene (Extended Data Fig. 3a). This ratio is almost identical to Homo sapiens, despite the introns of P. atlantis being approximately 86 times shorter (60.67 mean bp size) (Extended Data Fig. 3b), giving it an intron density (approximately four introns per kilobase) more than twice that of any other genome explored (Extended Data Fig. 1b).

Large differences in gene content

We explored whether the gene contents of Metazoa and Fungi present broad-scale functional differences as this would be indicative that, at some point after the divergence of their last common ancestor, a substantial genetic turnover occurred (that is, the remodelling of the gene content as a result of gene gains and losses, with gains including the origination of novel gene families and the expansion of ancestral families). In a multivariate analysis of the relative genomic representation of each Cluster of Orthologous Groups functional categories14 (hereafter referred to as functional categories), Metazoa and Fungi cluster separately in the dimension accounting for the largest variance explained (68.1%) (Fig. 2a). Functional categories of signal transduction (T), transcription (K) and extracellular structures (W), which are particularly relevant for animal multicellularity15,16, are among the most differentially represented in animal genomes (particularly T and W; Extended Data Fig. 5a). Other categories that are more represented in Metazoa include cytoskeleton (Z) and cell motility (N) (Fig. 2a). By contrast, the vast majority of metabolic functional categories (C, E, F, G, H, I and Q; see Fig. 1c) are proportionally more represented in Fungi (Fig. 2a).

Fig. 2: Gradual compositional change at the gene function level predated the origin of Metazoa and Fungi.
figure 2

a, Correspondence analysis on the functional category compositions of modern metazoan and fungal gene contents (see species names in Supplementary Table 4). Amphimedon queenslandica was excluded because its outlier behaviour impairs proper data visualization (Extended Data Fig. 6a). Metazoa and Fungi cluster separately in dimension 1, the axis concentrating the largest fraction of variability (68.1%). Functional categories were grouped as Fungi-related or Metazoa-related from their contribution to dimension 1. b,c, Evolution of the functional category compositions in the ancestral paths leading to the species that got the highest scores by the machine learning classifiers that were trained to detect functional category compositions characteristic of Metazoa (b) and Fungi (c) (Supplementary Table 5). See the functional category composition of each ancestral node in Fig. 1d. d, Evolution of metabolic genomic representation in Opisthokonta, measured as the percentage of gene content represented by Kyoto Encyclopedia of Genes and Genomes (KEGG) Orthology Groups related to metabolism (Supplementary Table 3). Fungi have a larger fraction of their gene content involved in metabolism.

Greater divergence of metazoan gene sets

From an evolutionary perspective, the large genetic differences shown between Metazoa and Fungi might be explained because either both or just one of the two groups experienced substantial genetic changes after diverging from their last shared common ancestor. Furthermore, this divergence could either be due to an abrupt genetic turnover in which changes would have occurred specifically in the root of both groups, or by a gradual process in which the preceding ancestors of each group were already accumulating changes in the direction of the differences observed in extant Metazoa and Fungi (Fig. 2a). To distinguish between these alternative scenarios, we took two complementary approaches to reconstruct the tempo and modes of the genetic divergence that occurred. In the first approach, we split the functional categories into two groups based on the results from the multivariate analysis on extant species from Metazoa and from Fungi (Fig. 2a): Metazoa-related or Fungi-related. Then, we computed the relative representation of each group of functional categories in every ancestral node of Opisthokonta (Fig. 1a) based on the gene contents inferred with our ancestral reconstruction pipeline (see Methods). In the second approach, we trained a series of machine learning classifiers to find their own functional category-based definition based on the gene contents from extant Metazoa and Fungi (see Methods). Then, we scored the ancestral nodes—which were not used to train the classifiers—according to how metazoan-like and fungal-like the relative compositions of functional categories of their inferred gene contents were (Extended Data Fig. 4d).

Not surprisingly, Fungi-related functional categories are more represented in Fungi (particularly in Basidiomycota and Ascomycota groups), but for most of the non-metazoan and non-fungal opisthokonts, the relative genomic representation of functional categories is more Fungi-like than Metazoa-like (Fig. 1d). As a result, Fungi does not separate from the protist opisthokont groups as distinctly as Metazoa (Extended Data Fig. 6b). These results are consistent with the fact that the machine learning classifiers differentiate the functional category compositions of Metazoa more strongly than those of Fungi (Extended Data Fig. 4d), as shown by the lower probabilities retrieved for the inner nodes of Fungi (43.7% for F3, root of Fungi) than those retrieved for Metazoa (81.7% for M4, root of Metazoa). Together, these results indicate that Metazoa experienced a broader differentiation at the gene function level than Fungi, with fungal gene contents being more similar to those of the protist opisthokonts, including the root of Opisthokonta (Fig. 1d and Extended Data Fig. 6c).

Gradual process, punctuated acceleration

Our ancestral reconstruction shows the genetic differences between Metazoa and Fungi (Fig. 2a) stemming from a divergence that started early after the split of Opisthokonta and continued up to the origin of the two groups (Fig. 2b,c). In the path to Metazoa, the changes that occurred in the three pre-metazoan ancestors (M1–M3) together account for a contribution of a similar magnitude to shifting the composition of the lineage towards Metazoa-related functional categories than those changes occurred in the metazoan root (3.7% versus 3.5%; Fig. 1d). Among the pre-metazoan ancestors, the changes in M2 and M3 contributed more than the changes in M1 despite both nodes showing fewer net gene gains (Fig. 1a). This is explained because gains in M1 were distributed across a wider set of functional categories, whereas gains in M2 occurred particularly in Metazoa-related functional categories, and the net losses in M3 were more prevalent in Fungi-related functional categories (Fig. 1a). Notwithstanding the contribution of the pre-metazoan ancestors, at the root of Metazoa (M4) there is also evidence for a substantial burst of net gains from a subset of functional categories (Fig. 1b), including transcription (K), signal transduction (T) and extracellular structures (W), which are particularly relevant for the animal multicellular genetic toolkit15. Although in the pre-genomic era the animal multicellular genetic toolkit was largely expected to be the outcome of metazoan-specific genetic innovations (that is, gene families that originated at the metazoan root), comparative genomics has revealed orthologues of many toolkit components in the unicellular relatives of animals15,17,18,19. This finding highlighted the importance that the co-option of ancestral gene originations had for multicellularity, although those same studies, as well as more recent studies19,20,21, also reported remarkable gene originations at the metazoan root. To quantify what contributed more to the pool of gene families involved in functions that are particularly important for multicellularity (K, T and W), whether pre-metazoan gene originations from Holozoa or those that occurred at the metazoan root, we traced the evolutionary trajectories of these three categories after the divergence of Opisthokonta.

Of gene gains observed at the metazoan root for K, T and W categories, 42.8% correspond to gene families that originated in this same ancestor (M4), whereas 21.2% of gains in M4 correspond to the expansion of gene families that originated in the pre-metazoan holozoan ancestors (Extended Data Fig. 6d). This difference (42.8% to 21.2%) is much greater than the observed for the other functional categories (19.2% to 15.9%), indicating that among the gene gains that occurred at M4, gene originations were particularly relevant for K, T and W at the metazoan root. An inspection of the ancestral contribution to the gene content of H. sapiens (Extended Data Fig. 6e) illustrates the same trend: genes from families originated in M4, a single ancestral node, contributed in a similar extent to the ancestral repertoire of the genes involved in K, T and W in H. sapiens (mean of 13.9%) than genes from families originated in the three pre-metazoan ancestral nodes (M1–M3) (mean of 12.5%). From this, we conclude that gene originations at M4 have been quantitatively more important (13.9% versus 12.5%) to functional categories related to animal multicellularity than the gene originations coming from any of the preceding holozoan ancestors. As a result, the metazoan root experienced a substantial increment in the relative genomic representation of K, T and W (+1.35%, +1.16% and +0.35%, respectively, from M3 to M4) (Extended Data Fig. 6f). Notwithstanding this, the tendency towards increasing the relative genomic representation of these functional categories was already ongoing in the pre-metazoan holozoan ancestors (+1.73%, +0.66% and +0.24%, respectively, from O to M3) and hence predated the origin of animals (Extended Data Fig. 6f).

Main genetic changes in Fungi

Similar to Metazoa, the genetic changes that occurred in the preceding ancestors of Fungi from Holomycota (F1 and F2) contributed more to shifting the gene content (1.8% together)—in this case, towards Fungi-related functional categories—than the root of the group (0.07%) (Figs. 1d and 2c). However, whereas the ancestral path to Metazoa from M1 to M3 accumulated net gains of Metazoa-related functional categories, F1 and F2 did not accumulate gains but rather losses of Metazoa-related functional categories, particularly signal transduction (Fig. 1a).

The two fungal nodes that present the largest compositional shift towards Fungi-related functional categories are, on the one hand, the stem node of Dikarya (Ascomycota + Basidiomycota) (+1.9%; Fig. 1d), which experienced genetic changes that could have predisposed the evolution of complex multicellularity in some members of this group (see Supplementary Information 4), and on the other hand, the last common ancestor of Zoopagomycota, Mucoromycotina and Dikarya (+1.5%), which experienced important morphological adaptations such as the ancestral loss of the flagellum that is characteristic of most fungal groups22. On average, and in contrast to animals, Fungi retained gene contents of a similar size to their ancestors and the protist opisthokonts (Extended Data Fig. 7). Still, some fungal nodes showed substantial net gains, particularly the fungal root (F3; Fig. 1b). Similar to the animal root in Holozoa, F3 was the node in Holomycota with the largest fraction of gene gains being explained by gene originations (Extended Data Fig. 8). Nevertheless, the changes seen at the fungal root made a low contribution to the compositional shift of Fungi (0.07%; Fig. 1d) because this node accumulated net gains of both Metazoa and Fungi-related functional categories (Fig. 1b).

The main characteristic of the genetic turnover that occurred in the path to extant Fungi was a specialization towards metabolism (Fig. 2d), whereas animal genomes specialized towards other functional categories (Fig. 2a). In agreement with this, the metazoan root experienced a net loss of metabolic genes (Extended Data Fig. 5d), despite this node presenting an overall net gain of gene content (Fig. 1b), whereas the fungal root experienced net metabolic gene gains (Extended Data Fig. 5c). (Note that an additional supplementary analysis with a dataset that includes transcriptomic data from the aphelid Paraphelidium tribonemae9, which is the closest known group to Fungi, suggests that half of the net gene gains originally detected at the fungal root, including the metabolic ones, could have also predated the origin of Fungi; see Supplementary Fig. 2).

The metabolic changes at the gene content level that we described for the root of Metazoa and Fungi did not become a tendency that continued during the diversification of both groups, as we detected a net accumulation of metabolic genes in Metazoa, but not in Fungi (Extended Data Fig. 5c,d). The larger representation of metabolism in fungal genomes is thus explained because the gene turnover that occurred during the diversification of Fungi benefitted the metabolic over the non-metabolic functions (Fig. 2d). By contrast, Metazoa accumulated more genes of every category, but gains were not particularly biased towards metabolic functions (Fig. 1c).

Differences in gene gain mechanisms

Metazoa and Fungi also differ in their preferences among the mechanisms that can be sources of gene gains. Although no significant differences between groups were found in the relative contribution of gene originations to gene gains, gene duplications were found to be significantly more prevalent specifically among metazoan gains (Fig. 3a,b), in accordance with previous studies that highlighted the importance of duplications in the origin and diversification of animals21. Besides originations and duplications, the gene tree–species tree reconciliation software23 used in our ancestral reconstruction framework also estimates putative horizontal gene transfer events as sources of gene gains. Despite being originally described in Bacteria, horizontal gene transfer has been documented across a wide range of eukaryotes and is known to have led to significant functional changes24,25,26,27. However, the relative contribution of transfers to gene gains in eukaryotes, and whether this contribution is homogeneous across the phylogeny, remain uncertain28,29,30. In this regard, the fact that the reconciliation software recovered a significantly lower fraction of gene gains as being explained by transfers in Metazoa than in Fungi and in the other opisthokonts (Fig. 3c) is compatible with the historical consideration that transfers should contribute less to gene gains in animals due to germline isolation3,4,5.

Fig. 3: Taxonomic differences in the relative contribution of gene originations, gene duplications, horizontal gene transfers and gene fusions to gene gains.
figure 3

ad, Dots correspond to the percentage of gene gains explained by each mechanism in every ancestral lineage of Opisthokonta (Supplementary Table 6; values were normalized to the maximum value found in each plot for a better representation of differences between groups). For every plot, the asterisks indicate the groups that present significantly lower (b and d) or higher (c) distribution of values than Metazoa (Holozoa), according to one-tailed Mann–Whitney U-test results. *P ≤ 0.05, **P ≤ 0.01 and ***P ≤ 0.001 (see exact P values in Supplementary Table 6).

Our ancestral reconstruction pipeline also detects originations that occurred due to gene fusion events. Previous studies17,18 have described multiple instances of genes in the animal multicellular toolkit that originated through gene fusions (here defined as the merging of partial or complete sequences from older genes). Our results indicate that fusions contributed significantly more to gene gains in Metazoa than in Fungi (Fig. 3d). This is not only explained because Metazoa experienced more gene gains than Fungi (Extended Data Fig. 7), but also because the fraction of originations detected as fusions are also greater in Metazoa (Extended Data Fig. 9). Fusions being less prevalent in Fungi agrees with a previous study that reported a particularly low rate of fusions compared with fissions31. Because fusions seem to be particularly relevant sources of transcription and signal transduction genes (Extended Data Fig. 5e,f), this gene gain mechanism could have been more prevalent in Metazoa due to the excess of gains of these two categories (Fig. 1a,b), which are particularly relevant for multicellularity15.

Two divergent genomic trajectories

Together, the emerging picture from our ancestral reconstruction indicates that animals and fungi have been evolving under sharply contrasting trajectories of genomic changes that predated the origin of both groups (Fig. 4). Fungal gene contents remained relatively constant in size (Extended Data Fig. 7) and specialized into metabolism (Fig. 2d). By contrast, animals accumulated net gains of most functional categories, although the unequal distribution of gene gains across categories led some categories to increase their relative genomic representation over the others, particularly those that are important for multicellularity (Extended Data Fig. 6f). Although both groups experienced substantial gains and losses during their divergence (Extended Data Fig. 10), the lineage leading to extant Metazoa experienced a larger compositional change in gene function (Fig. 2b,c). As a result, metazoan gene contents are more diverged than the fungal gene contents from those of the other opisthokonts at both the broad-scale functional level and the gene family content level (Extended Data Fig. 6c,g). Given that the latter result is independent of gene function annotation, Metazoa being more differentiated than Fungi from the rest of opisthokonts from a gene content perspective is robust to potential inequalities that may exist between groups at the level of biological knowledge or in the availability of functional information. This indeed agrees with the fact that there are more evident morphological discontinuities between protists and animals than between protists and some groups of Fungi8. Neither the hypha nor the cell wall characteristic of Fungi, which is also present in some of their protist relatives, are fungal synapomorphies8. Only the abandonment of phagotrophy for an osmotrophic lifestyle seems to be a common although not exclusive feature of Fungi32. Although animals distinguish from protists from the fact that all of them are multicellular, in Fungi, complex multicellularity is probably the outcome of convergent evolution as it is only found in some particular groups, which present important differences in the genetic contents involved on it33 (see Supplementary Information 4 for further information on the evolution of multicellularity in Opisthokonta and particularly in Fungi).

Fig. 4: The large genetic differences between modern animals and fungi are the outcome of two contrasting trajectories of genetic changes that preceded the origin of both groups.
figure 4

These divergent trajectories started immediately after the split of their last common ancestor (Opisthokonta) into Holozoa and Holomycota and continued during the emergence and diversification of Metazoa and Fungi.

From a genomic perspective, the origin of Metazoa and Fungi is better described as a gradual rather than an episodic process given the contribution of their preceding ancestors (M1–M3 and F1–F2) to the cumulative changes at the level of gene function that occurred in the lineages leading to the extant representatives of both groups (Fig. 2b,c). Notwithstanding this, substantial quantitative changes in gene content also occurred concomitantly with the origin of the two groups (Fig. 1b). In particular, the genetic changes at the metazoan root represent an acceleration of a trend that was already ongoing in the pre-metazoan ancestors to accumulate genes of functional categories that are important for animal multicellularity (Extended Data Fig. 6f). These same categories underwent losses in the pre-fungal ancestors (Fig. 1a), situating the immediate ancestors of Fungi and Metazoa in substantially different latent potentials from a genomic perspective. This is especially relevant for the case of animals. Had not animal ancestors experienced a continuous and long-standing evolutionary trajectory that had a compounding effect on the genomic potential for multicellularity, metazoans could not have arisen. The origin of animals may be seen as a drastic evolutionary event, but our taxon-rich analysis shows how the potential for that to happen was generated gradually on a genomic level. Our results illustrate the importance of analysing evolutionary transitions in the light of their evolutionary prehistory.

Methods

Methodological pipeline for genomic data acquisition

We sequenced a series of culture lines, each including one of the four species of interest (M. vibrans, P. atlantis, P. vietnamica and P. chileana). The cultures of M. vibrans and P. atlantis (formerly Nuclearia sp.) were bought in ATCC (M. vibrans Tong. ATCC 50519 and Nuclearia sp. ATCC 50694, respectively). The cultures of P. vietnamica (formerly Opistho-1) and P. chileana (formerly Opistho-2) descend from the environmental isolates (P. vietnamica from a Freshwater Lake, Vietnam; and P. chileana from freshwater temporary water body, Chile) used in ref. 12. As expected, the starting cultures included an uncertain fraction of contaminant species. In particular, the cultures of M. vibrans and P. atlantis included an uncertain diversity of bacterial contamination, whereas the cultures of each Pigoraptor species also included contamination from the eukaryote Parabodo caudatus. The sequenced metagenomic data were submitted to a bioinformatic decontamination pipeline that consisted of two to three rounds of detection and removal of contaminant fragments based on taxonomic and tetranucleotide composition information. All steps were thoroughly supervised to maximize the retention of bona fide genomic fragments from our species of interest and the removal of contaminant sequences. Decontaminated genomes were annotated combining both RNA sequencing-based BRAKER1 v1.9 (ref. 34) and PASA v2.0.2 (ref. 35) automatic annotation pipelines, the results of which were processed to correct erroneous gene predictions that might lead to the inference of false gene fusions. See Supplementary Information 1 for a detailed explanation about the nature of the sequenced data and the decontamination and genome annotation processes (see Fig. 1 in Supplementary Information 1 for a summary illustration).

Clustering sequences into orthogroups

A dataset of 1,463,920 protein sequences from 83 eukaryotic species, 59 from Opisthokonta (including the four genomes produced) and 24 from other eukaryotic groups, was constructed (draft_euk_db; see Supplementary Table 4). Protein sequences were aligned all-against-all using BLASTp36 v2.5 [-seg yes, -soft_masking true, -evalue 1e-3]. On the basis of the alignments, proteins were clustered into orthogroups (OGs) with OrthoFinder37 v2.7 [-I 2]. We treat OGs as proxies of gene families. The OGs produced by OrthoFinder were processed with the MAPBOS pipeline to fix protein domain heterogeneity problems that would compromise downstream analyses (see Supplementary Information 2 for a discussion of this issue, and for an explanation of the algorithm that we developed to correct it).

Species tree reconstruction

Ancestral gene contents were inferred by means of a gene tree–species tree reconciliation software. We thus needed to reconstruct a phylogenetic tree for every gene family and a species tree of the whole eukaryotic supergroup Opisthokonta. The results from the species tree reconstruction analyses are available in Supplementary Information 3. We first selected 342 OGs present in >77% of draft_euk_db taxa and with no more than an average of 1.16 copies per taxa. We measured alignment instability of the 342 OGs using COS.pl and msa_set_score v2.02, which are based on the Heads-or-Tails approach38,39, keeping only those OGs with >0.70 mean column score (MCs). We manually curated the 69 OGs that survived to this filter by performing individual phylogenies for each one, using MAFFT40 v7.123b [-einsi] for sequence alignment, trimAl41 v1.4.rev15 [-gappyout] for alignment trimming and IQ-TREE42 v1.6.7 for maximum-likelihood (ML) phylogenetic inference, using ModelFinder43 for model selection. Three of these 69 OGs were discarded because the topology was strongly in disagreement with the expected species topology. For the remaining 66 OGs (hereafter referred to as the MCs70 dataset), we removed sequences whose branching pattern suggested that they were most likely misclassified as OG members. In addition, to keep only one sequence per taxon in every OG, for inparalogue cases, we kept the least divergent sequence according to branch length. We removed a total of 630 sequences from the MCs70 dataset, including likely misclassified OG members but also contaminant sequences. Most contamination cases found correspond to contamination from Stramenopiles in the proteome of Syssomonas multiformis, probably from Spumella sp.12. However, we also detected Pirum gemmata contamination in the proteome of Abeoforma whisleri, and few from Ichthyophonus hoferi in Sphaerothecum destruens, indicating cross-contamination problems between these ichthyosporeans datasets. Still, these cases of contamination neither affected the phylogenetic inference, as they were removed during the screening, nor the downstream analyses, as these species were only used for species tree reconstruction purposes.

We created two distinct versions of the MCs70 dataset: the first dataset including all sequences from Holozoa (ingroup) and from three Holomycota taxa (outgroup) (Holozoa MCs70), and the second dataset including all sequences from Holomyoca (ingroup) and from three Holozoa taxa (outgroup) (Holomycota MCs70). An alignment supermatrix was created for each dataset, first aligning and trimming each OG per separate [MAFFT -einsi, trimAl -gappyout], and later concatenating the alignments into a supermatrix (Holozoa MCs70: 37 taxa, 17,475 sites and 9.27% of missing data; Holomycota MCs70: 28 taxa, 17,409 sites and 7.81% of missing data). We constructed a phylogenetic tree for both MCs70 datasets using ML and Bayesian inference. ML inferences were done with IQ-TREE, and the models chosen for Holozoa and Holomycota MCs70 datasets were LG+C50+F+R7 and LG+C30+F+R6, respectively. Despite ModelFinder suggesting the usage of C60 (ref. 44) for Holomycota MCs70, we used mixture models with fewer profiles to avoid potential model overfitting, as some optimized mixture weights were estimated close to zero. Nodal supports for the ML trees consisted of 1,000 IQ-TREE ultrafast bootstrap replicates (UFBoot) and 100 standard non-parametric bootstrap replicates. Non-parametric bootstraps were computed under the PMSF model45. We used the previously inferred ML trees as guide trees to infer mixture model parameters and site-specific frequency profiles, as implemented in IQ-TREE v1.6.7. Bayesian phylogenies were done under the CAT+GTR+Gamma(4) model in PhyloBayes-MPI46 v1.8. Two chains were run for Holozoa MCs70 and for Holomycota MCs70 supermatrices, and convergence was assessed using the bpcomp and tracecomp programs in the PhyloBayes-MPI package. Consensus trees were built when the maximum between chain discrepancy in bipartition frequencies fell below 0.1 (burn-in 33%). We also performed three additional analyses (increasing number of positions in the supermatrix, compositional recoding and fastest-evolving sites removal) to test the robustness of the topological relationships found (see Supplementary Information 3).

Incorporation of prokaryotic homologues into the OGs

We incorporated prokaryotic homologues into the clusters before the MAPBOS processing step. For the incorporation of prokaryotic (and viral) homologues into the clusters, we first used DIAMOND47 v0.8.22.84 [--more-sensitive, -e 1e-05] to align all eukaryotic sequences from euk_db (a subset of draft_euk_db, which includes the species labelled in bold in Supplementary Table 4) to a database including 8,231,104 bacterial, 331,476 archaeal and 20,955 viral from Uniprot reference proteomes (release 2016_02; prok_db) (forward alignment approach). The aligned sequences from prok_db were aligned back against euk_db sequences (reverse alignment approach). Hits with a query and target alignment coverages lower than 75% were discarded, as well as hits in which the best-scoring euk_db target of a given prok_db query was a member of a distinct cluster than the best-scoring euk_db query for that prok_db sequence in the forward alignment. After discarding the hits not satisfying these conditions, we incorporated into the clusters only the best-scoring prok_db query of each euk_db target sequence (that is, if a cluster has 300 sequences and the best-scoring query of all them was the same prok_db sequence, only that sequence will be incorporated into the cluster, which will then have 300 euk_db sequences and 1 prok_db sequence). Prok_db sequences were incorporated into OrthoFinder -I 2 clusters before these were processed by the MAPBOS pipeline (Supplementary Information 3). After MAPBOS, clusters included 1,117,614 eukaryotic sequences and 58,017 non-eukaryotic sequences (53,168, 4,301 and 548 from Bacteria, Archaea and viruses, respectively). All these 1,175,631 sequences were distributed among 413,445 clusters, 370,686 of which are singletons. Among eukaryotic sequences, on a taxonomic level, clusters included sequences mostly from Opisthokonta (50 species), but also from 18 representatives of other major eukaryotic groups (euk_db dataset).

Gene tree inference and gene tree–species tree reconciliation analyses

We submitted every post-MAPBOS OGs (or clusters) to a gene tree inference pipeline, consisting of using MAFFT-linsi for the alignment step, trimAl [–gappyout] for alignment trimming and IQ-TREE for the phylogenetic inference. In particular, IQ-TREE was run using the LG+G4 model and sampling 1,000 optimized [-bnni] UFBoot replicates for every gene tree.

For the gene tree–species tree reconciliation analyses, we used ALEml_undated from ALE v0.4 (https://github.com/ssolo/ALE). ALEml_undated requires a distribution of phylogenetic trees for every gene family (the UFBoot replicates in our case) and a species tree. The Opisthokonta fraction of the species tree consisted of the most favoured topology according to our analyses, which only included Opisthokonta taxa (Fig. 1 in Supplementary Information 3). The phylogenetic relationships between the non-Opisthokonta taxa were directly determined from a consensus of currently available bibliographical references48,49,50,51,52,53,54,55,56 (all euk_db species were included in the reconciliation analyses). Reconciliation analyses also incorporated non-eukaryotic sequences (see above), which, for practical reasons, were assigned to the same terminal node in the species tree (named ‘Prokaryotes’ in Fig. 7 in Supplementary Information 3). Eukaryotes with only transcriptomic or poor-quality genomic data were excluded from the reconciliation analyses (those labelled in grey in Fig. 1 in Supplementary Information 3). Note that the inclusion of transcriptomic data would have been particularly problematic to our study for the following reasons: (1) gene content predictions from transcriptomic tend to present inflated gene counts. For example, the proteomes that were previously produced based solely on transcriptomic data for P. atlantis2 and for P. vietnamica and P. chileana12 include much more sequences (29,620, 46,018 and 37,783) than the proteomes that we predicted from the genome sequences of these species (9,028, 14,822 and 14,510), with the genome-based proteomes showing even better completeness metrics (Fig. 23 in Supplementary Information 1). Inflated gene counts are expected to produce an excess of duplication inferences in the reconciliations, whereas (2) unexpressed genes may be confused by gene losses. (3) Transcriptomes are harder to decontaminate due to the lack of genomic context information regarding neighbouring genes, intron sequences or compositional features of the coding sequence, whereas (4) those sequences predicted from partial isoforms are expected to lead to inaccuracies to the software used to detect gene fusions (see below). (5) Accurate gene contents were also important given that the reconciliation software used (see above) infers the values for parameters such as gene duplication and loss rates from the data.

Inference of gene fusion events

We used CompositeSearch57 to identify composite gene families, that is, families of genes whose protein sequence is composed by fractions—for example, protein domains—that are separately found in other, component, gene families. CompositeSearch requires as input all-against-all sequence alignments, for which we used the same BLASTp results used for OrthoFinder (see above), although alignment hits corresponding to draft_euk_db species not represented in euk_db were removed. Before being used as input for CompositeSearch, BLASTp results were preprocessed with cleanBlastp (included in CompositeSearch) to retain only the hit with the highest score among all hits involving the same query–target pair. CompositeSearch was run with the default parameters and forcing the software [-f] to work on the clusters resulting from the processing of the OG from OrthoFinder by the MAPBOS pipeline. Families with only one sequence were discarded as potential components [-y]. Prok_db sequences were not included in composite inferences as alignments between prok_db and euk_db sequences were done with DIAMOND instead of BLASTp due to computational time limitations. Because we work at the gene family level (clusters), we only considered as composites those clusters in which >50% of members were detected as composite sequences. This includes 48,066 clusters, 3,229 of which are not singletons.

CompositeSearch detects as a composite any sequence that matches with distinct subsets of sequences (components, from other OGs) in different regions of its sequence. Whereas fusion events may lead to composite sequences, not all sequences detected as composites necessarily originated from a gene fusion process. For example, a sequence found to be composite by the software could have originated de novo in a given ancestral lineage (gene X–domains A and B), and then, in a descendant lineage, that gene could have been split into two separate genes (gene Y–domain A and gene Z–domain B). In such a case of gene fission, the software would detect the gene X as a composite because some part of the sequence would be aligned by the gene Y (first component) and the other by the gene Z (second component). To retain only bona fide fusion composite sequences, we only considered those composite sequences in which all their components were inferred to have a more ancestral origin than the composite. This was done to minimize the false-positive inferences of fusions, at the expense of losing potential fusion events in which, for example, both the composite and the components may have originated in the same node of the phylogeny.

Functional annotation of sequences and OGs

Protein domain architectures of euk_db sequences and of prok_db captured sequences (see above) were determined with PfamScan58 using Pfam A v29. Cluster of Orthologous Groups functional categories (functional categories) and KEGG Orthology Groups (KOs)59 were annotated to euk_db sequences with eggNOG-mapper60 v1.0.3-3-g3e22728, using DIAMOND for the alignments of euk_db sequences against the eggNOG database (the functional category ‘S: unknown function’ was ignored as it does not include functional information). Once sequences were annotated, the functional categories and KO annotations of every cluster were determined by averaging the annotations of the corresponding cluster members. For example, if a cluster includes two sequences (SeqA and SeqB), and SeqA was annotated with the functional category K and SeqB with the functional categories B and K, that cluster would be annotated as 0.75K and 0.25B (0.5K from SeqA + 0.25K from SeqB, and 0.25B from SeqB).

Inference of gains, losses and counts of functional categories and metabolic gene contents

From the reconciliation analyses (see ‘Gene tree inference and reconciliation analyses’), we retrieved the number of gains, losses and gene contents of every OG in every node in the phylogeny. For every given node, we determined the absolute representation of all functional categories by crossing the information between the number of copies of every OG in the node and the relative representation of every functional category among the functional information of the OGs. The same was done to determine the KO contents of every node. The percentage of metabolic genes of every node was determined by dividing the number of KOs with metabolic annotations by the total number of genes in the node (besides KOs belonging to the ‘metabolic category’, those belonging to the category ‘membrane transport’ were also considered as metabolic genes). The relative representation of every functional category in every node was determined by dividing the absolute value of every category in the node by the sum of the absolute values of all functional categories in the node. Gains and losses of functional categories and KOs were determined by comparing the contents of every node with those of its immediately preceding node.

Statistical analyses

Statistical analyses were carried out either in Python, mainly with the libraries Pandas61 and NumPy62, or in R. All descriptive statistics plots (with the exception of those including phylogenetic trees, which were constructed with ITOL63) were done in R, particularly with the ggplot2 package64. Mann–Whitney U-tests (one-tailed) were done in Python with SciPy65 (scipy.stats.mannwhitneyu). More specific statistical analyses are detailed below.

Correspondence analyses of relative functional category compositions

The relative genomic representation of functional categories are examples of compositional data (CoDa)66, in which every column (a functional category) is represented by a relative fraction and the sum of all values is the same for every row (genome). Owing to the fact that no orthogonality and collinearity are properties of CoDa, most commonly used multivariate analyses techniques such as principal component analyses are unappropriated for CoDa analyses and alternatives such as correspondence analyses are recommended instead66. Correspondence analyses were done in R67 with FactoMiner package68 and the plots were constructed with the factoextra package69.

Machine learning classifiers

For the classifiers of metazoan and fungal functional category compositions, we benchmarked five widely used learning models: logistic regression, k-nearest neighbours classifier, support vector classifier, Random Forest and artificial neural network, fine-tuning in every case the model hyperparameters using fivefold cross-validation. In total, we generated two classifiers for every learning model: one trained to distinguish between the functional category compositions of metazoan versus the other terminal nodes in Opisthokonta; and another doing the same but for Fungi instead of Metazoa. Relative functional category compositions were not used as features to train the model by the fact that they are correlated between them. Instead, the models were trained with the components retrieved from the correspondence analyses on the relative functional category compositions of opisthokont terminal nodes (relative compositions were computed excluding the S ‘unknown function’ category and doing first a column-wise and then a row-wise normalization before correspondence analyses was performed). Once models were trained, we computed the probability of belonging to the given class (Metazoa or Fungi, depending on the model) for every opisthokont node, including both terminal (used for model training) and internal (not used for model training) (see values in Supplementary Table 5). The probabilities represented in Extended Data Fig. 4d correspond to a weighted average over the probabilities retrieved from every classifier (excluding logistic regression for being in disagreement and showing worse predictions than the other classifiers). The weights were determined in the following manner: for every node, the average probability was computed, and then we computed the variance of the four models with respect to that averages. The weight of every model corresponds to the inverse of the relative variance of that model divided by the sum of the variances of the four models. The code is available at https://doi.org/10.6084/m9.figshare.13140191.v1 (‘fungiMetazoa_predModels’ in Code.300322.zip). We expect the predictors to capture the genomic compositional features well, as, for example, in the case of Metazoa, Trichoplax adherens, the animal with the lowest degree of phenotypic complexity among the sampled species, is the node with lowest probability (Extended Data Fig. 4d). All of these analyses were carried out in Python using packages from Sci-kit learn70, TensorFlow71 and Keras72 libraries.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.