Deep Evolutionary History of the Phox and Bem1 (PB1) Domain Across Eukaryotes

Protein oligomerization is a fundamental process to build complex functional modules. Domains that facilitate the oligomerization process are diverse and widespread in nature across all kingdoms of life. One such domain is the Phox and Bem1 (PB1) domain, which is functionally well-studied in the animal kingdom. However, beyond animals, neither the origin nor the evolutionary patterns of PB1-containing proteins are understood. While PB1 domain proteins have been found in other kingdoms including plants, it is unclear how these relate to animal PB1 proteins. To address this question, we utilized large transcriptome datasets along with the proteomes of a broad range of species. We discovered eight PB1 domain-containing protein families in plants, along with four each in Protozoa and Fungi and three families in Chromista. Studying the deep evolutionary history of PB1 domains throughout eukaryotes revealed the presence of at least two, but likely three, ancestral PB1 copies in the Last Eukaryotic Common Ancestor (LECA). These three ancestral copies gave rise to multiple orthologues later in evolution. Analyzing the sequence and secondary structure properties of plant PB1 domains from all the eight families showed their common ubiquitin β-grasp fold, despite poor sequence identity. Tertiary structural models of these plant PB1 families, combined with Random Forest based classification, indicated family-specific differences attributed to the length of PB1 domain and the proportion of β-sheets. Thus, this study not only identifies novel PB1 families, but also provides an evolutionary basis to understand their diverse functional interactions.


Results
Identification and evolution of PB1 domain-containing proteins in various kingdoms. Animalia. Based on literature, we extracted protein sequences of all PB1 domain-containing proteins in the human genome from the Uniprot database. Nine gene families were found to encode the PB1 domains as a part of their protein architecture (Fig. 1A). aPKC and M3K2/3 both contain PB1 and Kinase domains in their Nand C-terminus, respectively, while aPKC contains an extra diacylglycerol-binding (kDAG) domain in the middle. NCF2/p67 Phox and NCF4/p40 Phox both contain SRC Homology 3 (SH3) and PB1 domains in the C-terminus, where NCF2 contains Tetratricopeptides and NCF4 contains Phox homologous domain (PX) in their N-terminus (Fig. 1A). The other three protein families, Par6, TFG and p62/SQSTM1, are in general shorter than other PB1 domain-containing proteins, with a PB1 domain in the amino-end. p62 contains a Ubiquitin-associated domain (UBA), whereas Par6 contains a PSD95-Dlg1-Zo1 (PDZ) domain in the carboxy-end (Fig. 1A). The full name or description of all the domains along with a link to the InterPro domain database are provided in Supplementary Table S1. PB1 sequences from all the above-mentioned proteins were used as queries to retrieve orthologues from ten species across various phyla in Animalia (see Additional File 3 for the list of species used). All retrieved orthologous sequences were used in a phylogenetic analysis along with the respective human counterparts. The PB1 domain-based phylogeny reflected the monophyly of each gene family ( Fig. 2 and Supplementary Fig. S2). PB1 domains of M3K2-M3K3 and aPKC-M2K5 form paralogous pairs, indicating the common ancestry of PB1 for each pair at the emergence of the kingdom Animalia. Interestingly, the paralogous pairs M3K2-M3K3 and aPKC-M2K5 PB1 domains are closer to the respective orthologues from other kingdoms than the other PB1 domains in the same kingdom, Animalia. A similar trend is observed with NBR1, however, surprisingly NCF2/ p67 Phox is placed as the sister clade to the NBR1. p62 does not show any close relationship with other PB1 domains, neither paralogous nor orthologous, from the same kingdom or the other kingdoms ( Fig. 2 and Supplementary  Fig. S2). In a similar way, Par6 and TFG also appear to be Animalia-specific clades (Figs. 2 and 3A,B).
Protozoa. From UniProt, at least six reference proteomes and other individual Protozoan sequences from various species across Apusozoa, Amoebozoa and Excavata were used to identify the PB1 domains (Additional File 3). Few PB1 domain-containing proteins were identified along with a large number of partial (or truncated) proteins with either only a PB1 domain or a large unknown flanking sequence ( Fig. 1 and Supplementary Fig. S1). Among the (full length) PB1 domain-containing proteins, orthologs of Animalia M3K2/3 as well as Plantae Phox were identified, and named as Kinase and Phox respectively ( Fig. 1B and Supplementary Fig. S2). Unlike the animals, the Protozoan kinases contain WD40 repeats at their C-terminus. Moreover, the PB1 domain is also adjacent to a kinase domain (Fig. 1B). Orthologues of the NBR1 protein with all the four (known) domains were also found, along with the sequences of various domain combinations i.e. either PB1 with ZZ-type Zn-finger, with the NBR1 central domain, or with the UBiquitin Associated (UBA) domains (Fig. 1). We also identified many PB1 domain-containing proteins either associated with a Sterile Alpha Motif (SAM), Per-Arnt-Sim (PAS), EF-hand or Cystathionine Beta Synthase (CBS) domains ( Supplementary Fig. S1). However, the majority of these are identified in only one sequence or one species, and also represented in polyphyletic groups spread across the phylogenetic tree, making it difficult to classify them into a certain clade ( Supplementary Fig. S2).
Chromista. To identify the PB1 domains in Chromista i.e. SAR (Stramenopila, Alveolata and Rhizaria) group and Cryptophyta, all the transcriptomes from the MMETSP database were used (Additional File 3). Well-annotated fungal (Bem1 and Cdc24), plant (Arabidopsis and Marchantia) and animal (Human and Mouse) PB1 sequences were used to query the database, and processed the data with a pipeline developed earlier 31 . Two gene families, Kinase and NBR1 were identified in Chromista, with a similar domain architecture, being orthologous to the respective gene families in Animalia ( Fig. 1C and Supplementary Fig. S2). Interestingly, like Protozoan kinase proteins, these carry WD40 repeats. However, the PB1 domain is located far towards the N-terminus as in Plantae and Animalia (Fig. 1C). Orthologues of NBR1 are identified as multiple (partial) proteins represented in polyphyletic groups, similar to Protozoa (Supplementary Fig. S2). However, as a third gene family, we identified the CBS domain-containing proteins, where the PB1 domain is associated in their carboxy terminus www.nature.com/scientificreports www.nature.com/scientificreports/ ( Fig. 1C). Few other PB1 domain proteins were identified as single copies in only one species that host either a Tetratricopeptide (TPR) repeat or an EF-hand domain, which were also represented in polyphyletic groups ( Fig. S1 and Supplementary Fig. S2).
Fungi. For Fungi, we have selected 12 reference proteomes from the MycoCosm database (Additional File 3). Well-annotated plant and animal PB1 sequences were used as query sequences. Four PB1 domain-containing protein families were identified (Fig. 1D). The widely known Bem1 and Cdc24, were identified as a monophyletic paralogous pair in our study (Fig. 2). Along with the PB1 domain, Bem1 contains SH3 and PX domains, whereas Cdc24 contains Dbl homology (DH) and Plextrin homology (PH) domains. Interestingly, NCF4/p40 Phox , the animal ortholog of Bem1, contains the PX domain in its N-terminus, not in the middle as in Bem1 (Fig. 1D). CBS domain containing proteins (referred as Mug70) were also identified, with a similar domain architecture like in other kingdoms (Fig. 1D). Further, NoxR, an ortholog of Animalia and Plantae Phox, was also identified as a sister clade to this pair ( Fig. 2 and Supplementary Fig. S2).
In summary, all the four gene families form a respective individual monophyletic group with all the paralogs, indicating their presence across major phyla in Fungi (Figs. 2 and 3B). It is worth noting that an ortholog of p62 and a PB1 domain associated with a SAM domain were identified. However, each was found in only one species and a single copy ( Supplementary Fig. S1). Hence, we discarded them for further analysis as they are considered low confidence and may not represent any phylum or the kingdom itself.
Plantae. To identify all the PB1 domains in the kingdom Plantae (Fig. 1E), we have adapted a similar pipeline as mentioned above 31 , using 485 transcriptomes, that belong to multiple phyla in the kingdom Plantae, from the OneKP database 33 (Additional File 3). We identified eight gene families that encode PB1 domain-containing proteins in plants (Fig. 1E). Among these, NBR1 and Kinase orthologs are placed in the same clade as their counterparts from other kingdoms, and also contain the same domain architecture as their animal orthologs www.nature.com/scientificreports www.nature.com/scientificreports/ (Figs. 1E and 2). ARF and Aux/IAA families form a distinct monophyletic clade indicating a common ancestry at the base of the streptophytes. ARFs contain B3 and dimerization domains (together referred as DNA-binding domain (DBD)) at the N-terminus and a PB1 domain at the C-terminus, like the Aux/IAAs. In addition to a PB1 domain, Aux/IAAs also contain an EAR-motif and a degron motif (domain I and II respectively; Fig. 1E). Phox proteins, having the same domains as animal counterparts, form a sister clade to the respective orthologous proteins from other kingdoms (Fig. 2). CBS domain-containing proteins were also identified in plants, placed in the same clade as fungal Mug70.
Kinase-derived, ARF, Aux/IAA and NLP are Plantae-specific families that are not identified in any other kingdom (Fig. 3A). All these plant-specific gene families were discovered before, except Kinase-derived, which has only a PB1 domain in its N-terminus with a large flanking sequence without any known domains. It is worth mentioning that the Kinase-derived PB1 domains resemble the Kinase PB1 domains and appear to have been duplicated in the ancestors of angiosperms (Figs. 2 and 3B). NLPs contain an RWP-RK domain, in association with the PB1 domain and they are placed as sister clade to the CBS domain-containing proteins (Figs. 1E and 2). Interestingly, among all the families identified so far across all the kingdoms, there do not appear to be any constraints on either the position of the PB1 domain in the protein, or the category of domains it is associated with (DNA binding, oligomerization, phosphorylation etc.; Fig. 1). An overview of all the identified gene families and their existence across the major species in the kingdoms Animalia, Fungi and Plantae is summarized in Fig. 3.

Ancestral copy number in LECA.
To better understand the origin and evolutionary patterns of all the PB1 domains across the five kingdoms in eukaryotes, two phylogenetic trees were constructed using only the PB1 domain protein sequences. One is based on the PB1 domains from only three kingdoms (Animalia, Fungi and Plantae; Fig. 2), whereas another one is constructed based on all the sequences from five kingdoms ( Supplementary Fig. S2). The detailed versions of both the phylogenetic trees are available online in the iTOL webserver: https://itol.embl.de/shared/dolfweijers. All the previously mentioned pairs that form the monophyletic groups of individual families in each kingdom are well supported with good bootstrap values (>75), especially in Fungi, Animalia and Plantae (Fig. 2). The branches representing PB1 domains in Apusozoa, Amoebozoa, Excavata, SAR group and Cryptophyta are highly unreliable due to the polyphyletic nature and their random distribution across the phylogenetic tree (Fig. S2). Overall, the recently evolved clades in the phylogeny that are either gene family-specific or kingdom-specific, are generally monophyletic in nature. We have observed a decrease in the bootstrap support of early branches in the phylogeny based on all the five kingdoms (Supplementary Fig. S2; https://itol.embl.de/shared/dolfweijers). Monophyletic grouping, as well as the presence in multiple kingdoms, support the notion that there would have been at least two common ancestral copies of PB1 domains, each corresponding to Kinase and Phox orthologues across eukaryotes. Even though the Plantae NBR1 PB1 domains, along with the animal orthologues (and the similar proteins p62) are not monophyletic in origin, they are still placed in the phylogeny as sister clades (Fig. 2). This distribution of orthologues from the various kingdoms hint at a third common ancestor of PB1 in LECA. This analysis has failed to predict the order of evolution because of the lack of sufficient phylogenetic signal due to poorly conserved sequences, a relatively small domain (in general) and poor bootstraps in the early branches in the tree with all five kingdoms. The use of bacterial outgroup sequences could not improve resolution, leading to mixing in the phylogeny along with ingroup sequences. Hence, no outgroup was used and the tree is unrooted. Because of these drawbacks, this study could not identify the order of events, but could predict the copy number in LECA, based on both the monophyletic nature of Kinase and Phox groups as well as presence of NBR1 orthologous sister clades across multiple kingdoms.
(Dis)similarities in the plant PB1 domains. After identifying the majority, if not all, of the PB1 domain-containing proteins and understanding their evolution patterns across major phyla in all five kingdoms in eukaryotes, we further investigated the plant PB1 domains in detail at the amino acid level. To achieve this, we gathered the PB1 domain sequences from four whole genome-sequenced land plants, one species each from liverworts (Marchantia polymorpha), mosses (Physcomitrella patens), basal angiosperms (Amborella trichopoda) and a core eudicot (Arabidopsis thaliana). All PB1 domain protein sequences that belong to the eight families were aligned, and an individual sequence logo was derived for each family (Fig. 4). The well-conserved (group of) residues across the majority of the families are the positive residues lysine (K) in β1 and arginine (R) in β2 that together represent the positive surface. However, the lysine of β1 that makes contact with the OPCA motif on the negative face of another PB1, is not conserved in Kinase and Kinase-derived PB1 domains, indicating that these could be type-1 PB1 domains with only a conserved negative face (Fig. 4).
In general, the negative face represented by the OPCA motif is relatively well-conserved in all the gene families, despite a strong conservation of three amino acids (QLP) just before the OPCA motif in Kinase and Kinase-derived PB1 domains (Fig. 4). Interestingly, the tyrosine (Y) in β3 is relatively well conserved in all the gene families. Apart from these generally conserved residues across multiple families, there are various single amino acids that are specifically conserved in each gene family. For example, tyrosine (Y) and phenylalanine (F) of the α1, glycine (G) in β4 and phenylalanine (F) in α2 are specific to ARF and Aux/IAA PB1 domains. In a similar way, two phenylalanine (F) in and before the β1 and a G-x-L-x-L-x-L motif in β5 are specific to PB1 domains associated with CBS domains (Fig. 4). The tryptophan (W) in β4 is specific to NLPs. Despite NBR1 being a single-copy gene in the kingdom Plantae, there do not seem to be any constrains on the domain itself, as there is less than 20% identity among them. This provides a basic understanding of relaxed evolutionary pressure in the PB1 domain, providing opportunities for many gene family-specific changes. This makes it difficult not only to predict general sequence patterns that are important for function, but also to estimate the domain properties specific to each family purely based on the primary sequence and its poorly conserved amino acids.
Classification using random forests. Since there are no clear patterns in the PB1 domains that set apart the protein it belongs to and because it is also not possible to identify important features of a specific PB1 domain based on the sequence alignment, one might detect patterns based on the secondary structure composition along with the amino acid properties. Random forest (RF) based classification was performed with 28 amino www.nature.com/scientificreports www.nature.com/scientificreports/ acid descriptors as variables. After bootstrap aggregating (bagging) all the decision trees from the RF, the mean out-of-bag (OOB) error rate is only 6% which indicates the high reliability of the RF model (Fig. 5). The classification error rate is the highest (~14%) for Kinase-derived and the least (~2%) for ARF PB1 domains (Fig. 5A and Supplementary Table S3). On average, the majority of PB1 families were resolved well, indicating the high reliability of classification using these descriptors.
The importance of each variable is evaluated through the mean decrease in accuracy (MDA) and the mean decrease in Gini (MDG). Higher values of both MDA and MDG indicate the most important variables. In this case, the top 10 important variables are shown in Fig. 5B, with mHbeta being the most important variable to differentiate the different classes of PB1 domains. Hydrophobic moment of β-sheets, mHbeta, indicates the strength of periodicity in the hydrophobicity of the β-sheets, also indicating the formation of more β-sheets (Eisenberg 1984). The next most important variables are composition of proline (P) and length of the PB1 domain (Fig. 5B). We further analysed how these three important variables differ between the gene families (Fig. 5C). mHbeta is low for ARFs and Aux/IAAs, slightly higher for Phox, but even higher for the rest of the gene families. On the other hand, the composition of proline is lowest in CBS, but shows a very broad distribution in the Kinase-derived family. However, the length of the PB1 domain is strongly constrained for the majority of the families (>90 for www.nature.com/scientificreports www.nature.com/scientificreports/ NBR1 and <90 for ARFs), except Aux/IAA and Kinase-derived, and to a certain extent for Phox (Fig. 5C). To correlate the contribution of mHbeta to the β-sheets in the secondary structure, we performed homology modelling of at least one randomly selected Arabidopsis orthologue for each protein family and we indeed found that higher mHbeta represents secondary structures with more β-sheets (Fig. 6). For example, IAA17 shows ~18% of the residues in β-sheets, whereas CBS36500 has ~34%, correlating with lower and higher mHbeta values observed for Aux/IAA and CBS PB1s, respectively ( Fig. 6 and Supplementary Fig. S3). Taken together, these results clearly indicate that there is a difference between the gene families that can be explained from the mHbeta, the composition of proline and by keeping the length unique/constrained for that respective family.

Discussion
The PB1 domain is widespread in nature, throughout all the kingdoms in the eukaryotic tree of life. It is diversified to a great extent in organisms with complex body plans like animals and even more so in land plants (Fig. 1). As a result, the human genome encodes 13 PB1 domain-containing proteins, whereas a model angiosperm, Arabidopsis thaliana, encodes more than 80 PB1 domain copies grouped into eight families (Fig. 3B). The proteins with a PB1 domain also feature various domains, representing a manifold association ranging from DNA/protein binding, catalytic function, scaffolding to membrane association (Fig. 1). However, the PB1 domain is (mostly) found at either terminus of the protein, preferably facilitating these to perform their native function, scaffolding or oligomerization, without the hindrance of other domains.
The evolutionary patterns of the PB1 domain showed that there are multiple families shared across multiple kingdoms (Fig. 2). Kinase and NBR1 are present in all the five kingdoms, while Phox is found in four kingdoms (except Chromista) with a similar domain architecture (Fig. 1). The phylogenetic placement of Kinase and Phox PB1 domains and their orthologs indicate the presence of two ancestral copies in LECA and presumably a third copy might be represented by the NBR1 and/or p62 group. It is known that orthologous proteins may perform similar functions by interacting with similar proteins across kingdoms. An example of common functionality of orthologous domains across multiple kingdoms is the recently studied DIX domain in plant cell polarity protein SOSEKI 34 . The DIX head-to-tail oligomerization domain is conserved across multiple kingdoms (e.g. Dishevelled in animals; DIX-like in SAR group), and functions in cell polarity by forming oligomers in plants and animals. In a similar way, the PB1 orthologs across multiple kingdoms may share common functionality in similar pathways. One such is the PB1 domain-containing protein NBR1, which serves as an autophagy cargo receptor in both plants and animals, where homodimerization of the PB1 domain is also conserved as a part of its function 8,35 . Phox orthologues in animals (p67 Phox ) and fungi (NoxR) play a key role in NADPH oxidation pathway by interacting with the membrane associated proteins (gp91 Phox and NoxA/B respectively) as well as through PB1 domain with p40 Phox and Cdc24 respectively 23,36 . Similarly, in Arabidopsis, Phox4 was shown to interact with membrane-associated proteins KNOLLE, SYP22 and PEN1, which belong to the SNARE family 37 . However, it is unknown which PB1 domain protein interacts with Phox4-PB1 in plants. In another study, Phox proteins (referred as MadB) were shown to be involved in the myosin-driven interactions, preferably through their PB1 domain 38 . Hence, discovering these unknown and novel interactions may provide a link to the existence of common pathways across kingdoms controlled by PB1-dependent interactions. www.nature.com/scientificreports www.nature.com/scientificreports/ Apart from the proteins that are shared across multiple kingdoms, some are specific to each group (Fig. 3A). ARF, Aux/IAA and NLP families are specific to plants, whereas TFG, Par6, M2K5 and aPKC are specific to animals. ARFs and Aux/IAAs are involved the Nuclear Auxin Pathway, controlling transcriptional regulation of downstream targets with multiple functions in response to the phytohormone auxin 26 . NLPs are master regulators of nitrate-inducible gene regulation in higher plants 28 . On the other hand, animal Par6 and aPKC PB1 domains are known to interact with each other playing a key role in epithelial cell polarity 39 . Thus, as far as can be inferred, these kingdom-specific PB1 domain-containing proteins appear to regulate processes that are specific to that kingdom.
It is interesting to see that the key interacting PB1 domains have also evolved in pairs. Some such pairs are: aPKC-M2K5 (Animalia), Bem1-Cdc24 (Fungi) and ARF-Aux/IAA (Plantae) (Fig. 2). The interacting pairs (for example ARF-Aux/IAA) seem to maintain pairs of amino acids specific to those classes (Fig. 4). Hence, based on this 'paired' conservation pattern, it is enticing to speculate that the Kinase and Kinase-derived PB1 domains might form interacting pairs (Fig. 4). Despite the overall poor sequence conservation, it is clear that PB1 domains are maintaining a flexible (global β-grasp fold) yet specific (local conserved residues) sequence context in each family may provide specificity in function. Adding to the complexity in specificity of each interaction, the PB1 domains can also undergo non-canonical interactions. In plants, PAL OF QUIRKY (POQ), a Kinase-derived PB1 domain, interacts with QUIRKY 29 . The PB1 domain of NLP interacts with HQ domain of TCP20 27 . In animals, the M2K5 PB1 interacts with ERK5, among many others 6 . However, the structural and/or mechanistic basis of any of these non-canonical interactions is currently unknown.
In various kingdoms, the PB1 domain-containing proteins have expanded to various complexities/copies. For example, NBR1 in plants is (mostly) a single copy gene, where ARFs and Aux/IAAs are represented by large gene families with more than 20 copies (Supplementary Table S2). This clearly shows varying duplication rates in different gene families. However, whether it is a single-or a multi-copy gene family, there is hardly any conservation in the PB1 domain among the members of the same gene family outside of key residues: lysine in β1, tyrosine in β3 and the OPCA motif (Fig. 4). Despite their low conservation, all the PB1 domain families identified in plants can potentially form a β-grasp ubiquitin fold (Fig. 6). Thus, for the PB1 domain it is evident that sequence conservation seems to be a less important factor than maintaining the overall β-grasp structure itself.
This poor sequence conservation is never a bottleneck to identify the most important features, as there are efficient machine learning-based classification programs like Random Forests (RF). RF has been very successful in classification with highly correlated variables at low error rate 40 . The classification error rate is as low as 2% (in ARFs), but up to 14% in Kinase-derived, which could be due to the broader distribution of all three most important variables (Fig. 5C). This clearly defines that the more specific the variables are, the lower the error rate is. RF also provides the relative importance of each variable with the precision. Hydrophobic moment of β-sheets, mHbeta, the most important variable, is low for Aux/IAA but high for CBS, correlating with the increased β-sheets in CBS (Fig. 6). How this increase of β-sheets could bring a change in function needs to be elucidated. Another interesting observation is that some variables are highly constrained for each family. For example, the length of the PB1 domain is always above 90 amino acids in the NBR1 family, where as it is always below 90 for ARFs. Hence, it is evident that PB1 domains are constrained in different ways to maintain the uniqueness of that family. Moreover, using more (specific) parameters in the future, one should be able to distinguish PB1 domains to a much broader extent, even across multiple kingdoms, and including homotypic and heterotypic interactions.
The β-grasp fold, to which the PB1 domain belongs, is widespread across eukaryotes, with at least 20 different families that are represented by this fold in the common ancestor of eukaryotes 13 . The DIX domain is one the recently studied families of β-grasp fold domains that is conserved across multiple kingdoms and structurally highly similar to PB1 domain 34,41 . However, despite their highly homologous structural fold, they appear to be evolved independently from a common ancestor in eukaryotes 34 (Supplementary Fig. S4). Apart from DIX and PB1 domains, the SAM domain also undergoes head-to-tail oligomerization, but this domain is structurally different from both others 41 . It is unclear why the PB1 domains are much more widespread compared to DIX or SAM domains. The latter are only limited to few families and few members in each family. One reason could be that, as discussed above, the PB1 domain might contribute additionally by a wide range of non-canonical interactions and its abundance across multiple kingdoms.

Search for PB1 domains in Animalia, Protozoa and Fungi. To study the PB1 domains in the kingdom
Animalia, based on the literature, we first extracted all the PB1 domain sequences of Human proteome from the UniProt database (https://www.uniprot.org/proteomes/). To find other PB1 domains, we then used ten proteomes from the kingdom across various phyla (Additional File 2). A protein database has been created with all these proteomes and queried this database with the PB1 domain sequences from already known plant (Arabidopsis and Marchantia) and animal (Human) species. BLASTP module in NCBI BLAST 2.7.1+ 42 was employed for this search and InterPro domain database v5.30-69.0 (https://www.ebi.ac.uk/interpro/) was used for domain identification in the BLAST hits. All the sequences that have a PB1 domain have been used for further phylogenetic analysis.
A similar procedure was used to obtain the PB1 sequences from Protozoa (Apusozoa, Amoebozoa and Excavata) and Fungi. However, the proteomes of twelve fungi across multiple phyla have been obtained from MycoCosm database at JGI (https://mycocosm.jgi.doe.gov). For the Protozoa, we have used the six reference proteomes from UniProt (Additional File 3).

Identification of the PB1 domains in Chromista and Plantae.
To identify the PB1 domains in the kingdom Plantae, we employed a large transcriptome resource, 1000 plant transcriptomes (OneKP) database 43 (www.onekp.com). Out of nearly 1300 transcriptomes in the database, we have used 485 transcriptomes in this study, covering all the phyla in the kingdom Plantae. We have adapted a protocol that was developed earlier 31 . In brief, the query PB1 sequences from Arabidopsis were searched against each transcriptome, where the resulting scaffold hits were translated using TransDecoder (v2.0.1; https://transdecoder.github.io). All these translated sequences were checked for the presence of PB1 domains using InterProScan 44 and only those protein sequences with a PB1 domain identified were used for further analysis. In a similar way, for SAR group in Chromista, we have employed another transcriptome dataset, Marine Micro Eukaryote Transcriptome Sequencing Project (MMETSP) database 45 . We have used all the available transcriptomes and adapted a similar protocol as mentioned above (Additional File 2). phylogeny construction and visualization. Using all the PB1 sequences that were identified in all the five kingdoms of eukaryotes, we performed the phylogenetic analysis (Additional Files 1,3,4). The protein sequences were aligned with MAFFT G-INS-i algorithm using default parameters (v7 46 ). Alignment was cleaned up further, where the positions with more than 20% gaps were removed with trimAl, prior to phylogeny construction 47 . ModelFinder (accessed through IQ-TREE) indicated 'LG' as the best model of evolution, of all the 462 models tested 48 . Further, the Maximum Likelihood (ML) method, employed in the IQ-TREE program was used for the phylogenetic tree construction, with 1000 rapid bootstrap replicates and tree branches tested by SH-aLRT method 49 . The resulting tree was manually curated further for some misplaced taxa. An unrooted tree was visualized in iTOL v4 (https://itol.embl.de/shared/dolfweijers). In a similar way, another phylogenetic tree was generated using the PB1 sequences only from three kingdoms (Animalia, Fungi and Plantae).

Alignment of the plant PB1 domains.
To understand the PB1 domains in the plant kingdom further, we have taken the PB1 sequences of all the eight families from four species (Marchantia, Physcomitrella, Amborella and Arabidopsis), aligned them using ClustalOmega 50 . After the alignment, the domains from each family were separated and a sequence logo was generated. All the gene identifiers from these four species are available in Supplementary Table S2. LogOddsLogo server was used for logo generation, with the colour codes for specific amino acids (https://www.ncbi.nlm.nih.gov/CBBresearch/Yu/logoddslogo/proteins.cgi). Amino acid groups 'PAGFLIMV' , 'KRH' and 'DE' were shown in purple, blue and red colours respectively. All other amino acids were shown in black colour.

Random forest (RF) based plant PB1 classification. The Random Forest (RF) method was used to
identify key amino acid descriptors to differentiate and classify each of the PB1 domains into eight plant PB1 families 40 . To make this classification an extensive one, we extracted all the PB1 domains from Plaza Monocots database v4.5, that includes genomes from all the major phyla in Embryophytes 51 . Since the size of each family is different, and to make the analysis uniform and comparable, we have extracted 100 PB1 sequences randomly for each gene family (except 78 for NBR1 as it is a single copy gene). We have used 28 amino acid descriptors (variables) calculated either with 'protr' or 'peptides' R packages 52,53 . Among these, 20 variables correspond to the composition of 20 amino acids, and the remaining eight correspond to the general parameters such as length, molecular weight, hydrophobicity, net charge, isoelectric point (pI), aliphatic index, hydrophobic moment of alpha and beta sheets. We used 'RandomForest' R package to build a maximum of 500 decision trees with 5 variables being tried at each step (www.r-project.org 54 ). Confusion matrix and variable importance plots showing mean decrease in accuracy and gini were obtained. Descriptive plots and other graphs shown were obtained using 'ggplot2' R package. Additional File 5 provides the complete R script that has been used for the RF analysis.

Data availability
All data generated or analyzed during this study are included in this published article and its supplementary information files.