Evolution of networks of protein domain organization

Aziz, M. Fayez; Caetano-Anollés, Gustavo

doi:10.1038/s41598-021-90498-8

Download PDF

Article
Open access
Published: 08 June 2021

Evolution of networks of protein domain organization

M. Fayez Aziz¹ &
Gustavo Caetano-Anollés¹

Scientific Reports volume 11, Article number: 12075 (2021) Cite this article

6306 Accesses
23 Citations
96 Altmetric
Metrics details

Subjects

Abstract

Domains are the structural, functional and evolutionary units of proteins. They combine to form multidomain proteins. The evolutionary history of this molecular combinatorics has been studied with phylogenomic methods. Here, we construct networks of domain organization and explore their evolution. A time series of networks revealed two ancient waves of structural novelty arising from ancient ‘p-loop’ and ‘winged helix’ domains and a massive ‘big bang’ of domain organization. The evolutionary recruitment of domains was highly modular, hierarchical and ongoing. Domain rearrangements elicited non-random and scale-free network structure. Comparative analyses of preferential attachment, randomness and modularity showed yin-and-yang complementary transition and biphasic patterns along the structural chronology. Remarkably, the evolving networks highlighted a central evolutionary role of cofactor-supporting structures of non-ribosomal peptide synthesis pathways, likely crucial to the early development of the genetic code. Some highly modular domains featured dual response regulation in two-component signal transduction systems with DNA-binding activity linked to transcriptional regulation of responses to environmental change. Interestingly, hub domains across the evolving networks shared the historical role of DNA binding and editing, an ancient protein function in molecular evolution. Our investigation unfolds historical source-sink patterns of evolutionary recruitment that further our understanding of protein architectures and functions.

Emergence of fractal geometries in the evolution of a metabolic enzyme

Article Open access 10 April 2024

Improving microbial phylogeny with citizen science within a mass-market video game

Article Open access 15 April 2024

Highly accurate protein structure prediction with AlphaFold

Article Open access 15 July 2021

Introduction

The biological functions of genes manifest through the proteins or functional RNA molecules they encode. In evolution, novel functions appear when genes produce new genes by duplication, mutation, recombination, fusion and fission, or when genes are generated de novo. Research has attempted to quantitatively describe the origins of these processes of molecular diversification and how they increase molecular complexity over the course of evolution, for instance through pathways of protein domain organization^1,2. Protein domains are structural and functional units of evolution that make up proteins^3,4,5, sometimes in unusually complex arrangements^6,7. They fold into compact 3-dimensional (3D) atomic structures that arrange alpha-helical and beta-sheet structure elements into tightly packed conformations of the polypeptide chain⁸. The Structural Classification of Proteins (SCOP)⁹ and its extended version SCOPe¹⁰ are popular taxonomy gold standards of domain structure. SCOP definitions can be used to scan genome sequences for motifs of domains and study how they combine in proteins⁶. In SCOP, the structure of domains exhibiting similar 3D arrangements of secondary structures and thus identical topologies have been classified as folds (F)⁹. Within folds, protein domains whose structure and functional features indicate a common evolutionary origin are further grouped into fold superfamilies (FSF). These FSFs sometimes hold multiple evolutionarily related families, which unify domains with pairwise amino acid identities of more than 30% (Supplementary Fig. S1A). As of March 9, 2021, 276,231 annotated SCOPe domains populate the 175,282 protein structures of the Research Collaboratory for Structural Bioinformatics Protein Data Bank (RCSB-PDB). We note that the cornerstone of the SCOP domain hierarchy is common ancestry, i.e. the existence of shared-and-derived features in domain sequence, structure and function. Homology is also central to many other domain taxonomies, including CATH¹¹, CDD¹², ProDom¹³, Pfam¹⁴, and the meta-database InterPro¹⁵. Most databases benefit from machine learning. For example, SCOP and Pfam depend on the identification of conserved regions in protein sequences through sequence alignment and background knowledge, which are then used to build probabilistic hidden Markov models (HMMs) of linear sequence analysis. For example, SCOP uses HMMs of structural recognition to recurrently enrich the database¹⁶ in a framework that increases alignment-quality and stability of family and superfamily relationships. A similar framework drives the Pfam database but focuses exclusively on sequence information. One difficulty is that not all domains fold into discrete structural entities within the space of possible folds¹⁷. Some popular domains overlap within a continuum. This ‘gregariousness’ complicates domain classification, demanding the exploration of super-secondary structural motifs as candidate lower-level units of structure, function and evolution¹⁸.

Domain structures appear repeatedly in the protein molecules, singly or in combination with other domains⁷. More than two-thirds of protein sequences are longer than an average domain length, a vast majority of which are multidomain proteins¹⁹. A study of protein structures in 749 genomes showed that the lengths of orthologous protein families in Eukarya were almost double the lengths found in Bacteria and Archaea²⁰. This variance among lengths results from shorter prokaryotic nondomain sequences that link domains to each other in proteins and have evolved reductively in prokaryotes but not in eukaryotes. The arrangement of domains along the sequence of multimeric proteins is referred to as ‘domain organization.’ Both the structure and organization of domains, which have been collectively termed protein domain ‘architecture’, are considered far more evolutionarily conserved than protein sequence^7,21,22,23. In addition, some domain combinations make up functional units that recur in different protein contexts²⁴. They have been termed supradomains (Supplementary Fig. S1B). Thus, domains and supradomains behave as modules, parts that interact more often with each other than with other parts or modules of the system.

Comparative genomic approaches allow to study the modular landscape of domain organization. For example, the evolutionary placement of domains in multiple architectural contexts can be quantified by counting distinct neighbors²⁵, domain adjacencies²⁶, or consecutive domain triplets²⁷ in proteins. These measures of ‘versatility’, ‘promiscuity’ or ‘mobility’ (reuse) of domain building blocks depend on both domain size and abundance. Smaller domains are more likely to be used in multidomain proteins and are therefore more mobile^27,28, an observation supported by a Menzerath-Altmann’s law of domain organization driven by an economy of scale²⁹. Similarly, highly abundant domains appear more versatile, prompting abundance-based normalization of domain versatility measurements when studying intrinsic combinatorial properties and variation across lineages and biological functions^30,31,32.

In order to retrace past events in architectural evolution, statements of history (phylogenies) proposed directly from genomic data must be used to build chronologies of first evolutionary appearance of domains and domain architectures. Unfortunately, protein sequence has limited power in deep retrodictive exploration^7,21,22,23. Furthermore, while structure is conserved over longer evolutionary timescales, a general metric for global pairwise comparison of structures does not yet exist³³. Thus, the systematic classification of protein structure has been unable to unify the widely divergent folded structures at any level of abstraction (e.g. FSFs in a ‘galaxy’ of folds³⁴), likely because different neighborhoods in protein sequence space contain different structures and functions³⁵. Construction of a ‘periodic table’ of idealized structural representations of folds³⁶ has not alleviated this difficulty due to an absence of rules of structural transformation that would explain the comparative framework. Numerous efforts to dissect the evolution of domain architectures have recently been reviewed³⁷. To overcome limitations and produce global evolutionary views of the protein world, the focus shifted from molecules to proteomes. Trees of proteomes were first reconstructed from a proteomic census of structural domains (beginning with ref.³⁸), and were later used to trace character-state changes along their branches to establish possible domain origins^39,40. This approach, however, was restricted to domain structures and architectures appearing after the common ancestor of the proteomes surveyed in the trees. A much more effective way to create truly global chronologies of the protein world was the reconstruction of phylogenomic trees of domain organization (beginning with ref.⁴¹). These phylogenies take advantage of powerful serial homologies defined by the proteomic abundance of domains⁴² or architectures^7,43 defined at F and FSF levels. Phylogenomic trees of domain structures helped uncover the natural history of biocatalysis by tracing chemical mechanisms in enzymatic reactions⁴⁴, analyze the optimization and increase of protein folding speed derived from a flexibility-correlated factor known as contact order (the average relative distance of amino acid contacts in the tertiary structure of proteins)⁴⁵, and study the history of an ‘elementary functionome’ with a bipartite network of elementary functional loop sequences and structural domains of proteins⁴⁶. This last study revealed two initial waves of functional innovation involving founder ‘p-loop’ and ‘winged helix’ domain structures, and the emergence of hierarchical modularity and power law behavior in network evolution. Phylogenomic trees of domain architectures and their associated chronologies of molecular accretion showed that architectural diversification evolved through gradual accumulation of domains (singly occurring domains), domain pairs (two different domains), multidomains (numerous domains, with occasional repetition) and domain repeats (domains of one type that are repeated)⁷. The diversification began with a few single-domain architectures earlier in the timeline, followed by an increasing rate of accretion that culminated in a massive “big bang” of domain organization. The accumulation of architectures continued to date but with a decreasing rate^7,46.

Here, we explore the evolving interactome of protein domain organization. We generate a chronology that captures the historical development of domain and multidomain interactions with a graph theoretical approach⁶ of time-varying (evolving) network structure. The chronology was calibrated with a molecular clock of protein structures, which transforms times of origin of domain architectures into geological timescales of billions of years (Gy)⁴⁷. Five distinct composition- and topology-based ‘operative’ criteria of connectivity defined nodes and links of the evolving networks. This strategy identified connectivity distributions in a series of 169 growing networks, hubs of evolutionary recruitment acting as donors and acceptors, and structural adaptations of evolving networks to modular, random and scale-free properties. In particular, we discover a pattern of connectivity driven by fusions and fissions, respectively, with densely linked older and younger architectures from the evolutionary timeline sandwiching a period of sparse connectivity. This supports a biphasic or hourglass pattern previously observed in protein evolution⁴⁸ and follows a model of module emergence⁴⁹. We thus reveal remarkable patterns of emergence of hierarchy, modularity and structural cooption in evolving networks.

Results and discussion

Construction of evolving networks

We build a time series of networks of domain organization embedding evolutionary information derived from the sequence and structure of millions of protein sequences encoded in hundreds of genomes. The goal is to unfold the history of how single-domain and multidomain proteins share domain make-up and how recruitment processes shape protein evolution. An ‘entity set’ of domains, supradomains, and multidomains were first extracted from the genomic census of fold structure and domain organization. This set of component parts of proteins, mostly recurrent, defined the nodes of the networks, which were labeled with concise classification strings (ccs) describing SCOP domain constituents (Fig. 1A). We define supradomains as sub-combinations of domains that appear in the census and are often used as evolutionary building blocks of multidomains. The definition is more inclusive than that of ref²⁴.

The growing interactions among contemporary architectures are constrained by domain make up and domain arrangement in the protein chain. These evolving interactions were captured with five different operative criteria for timed network generation defined by composition, pairwise occurrence, adjacency, and splicing of domain parts in a protein molecule, where: (1) composition describes makeup (component parts) of the molecular whole; (2) pairwise occurrence describes appearance of parts in sets of two; (3) adjacency refers to their geometrical or spatial arrangement (topology); and (4) splicing refers to the rearrangement of parts by operations of joining and excision that decompose structures (Fig. 1B). The Composition Network (CX) linked domain and supradomain to multidomain nodes (in a partially bimodal fashion) when proteins shared compositional makeup. The Pairwise Network (PX) connected domain to supradomain nodes when components occurred in pairs in a protein. The Pairwise Adjacency Network (PAX) connected domain to supradomain nodes when components occurred in pairs that were adjacent. The Spliced Pairwise Network (SPX) linked domain nodes to each other when their pairs were present in domain-spliced proteins. Lastly, the Spliced Pairwise Adjacency Network (SPAX) linked domain nodes to each other when their adjacent pairs were present in the domain-spliced proteins (Fig. 2).

Finally, we mapped the time or origin (age) of individual architectures onto the nodes of networks built using these five operative criteria (Supplementary Fig. S2). We did so for each of the 169 time-events of the timeline. Network construction has been illustrated with connectivity details of the most ancient domains (Supplementary Fig. S3) and further described in Sect. 1 of Supplementary Text. Networks showcased time directionality, connectivity distributions, and network layouts:

1.
Time Directionality Mapping ages onto networks helped follow their evolutionary growth, as nodes and links accumulated over time since the origin of proteins to the present. The timeline of networks imposed a time directionality on network links, making them arcs (directed edges with arrows pointing from older to younger nodes) of directed graphs (Fig. 1C). The ages of arcs were borrowed from the youngest of the component nodes involved in a link (Supplementary Fig. S3B).
2.
Degree Distributions The number of links connected to a node define that node’s ‘degree’. The degree distribution is a ‘composability’ attribute of a network and the entity set represented by its nodes, a design principle describing the inter-relationship of components of a system. In network evolution, the appearance of a new node may trigger establishment of one or more arcs from existing (older) nodes. Furthermore, outdegree describes the number of outward links and indegree the number of inward links from a node. As the timeline progresses, older nodes gain higher outdegrees as compared to the higher indegrees of recent nodes (Fig. 1C), polarizing the network with arcs depicting ‘arrows of time’ (Supplementary Figs. S2 and S3). The chronological appearance of architectures (domains, supradomains and multidomains) as network connectivity expands along the timeline causes degree to accumulate in the evolving networks (Fig. 2). Multiple interactions of nodes along the timeline diversified connectivity, a feature captured and quantified by weighted degree. Interestingly, box-and-whisker’s plots of weighted outdegree and indegree demonstrate bimodal degree distributions typical of biological systems^49,50 (Supplementary Fig. S4). The yin-yang patterns of contractions and expansions of architectural innovation are evident from the distributions of modern outdegrees and indegrees (Supplementary Fig. S5). In particular, the cumulative outdegree and indegree scattergrams demonstrate an hourglass (or bimodal) pattern of linkage development unfolding in evolution (Supplementary Fig. S6).
3.
Time Event-based ‘Radial’ and ‘Waterfall’ Layouts The growth of a network evolving at discrete temporal intervals can be modeled with Discrete Event Simulation (DES) tools^51,52,53. Borrowing the DES rationale, we modeled the evolution of directed networks of domain organization with time flowing from one event to another as discrete evolutionary ‘time steps’, typical of a step function. The progression of events was visualized with two types of layouts, a vertical representation we coined ‘waterfall’ layout that had nodes arranged top-down by age and a concentric ‘radial’ representation of growing networks that unfolded time-events of protein evolution from center to periphery (Fig. 1C). Network clusters comprising of hubs and their cohesive neighbors were segregated to improve differentiation along the horizontal axis. The waterfall and radial layouts made evolutionary recruitment evident as time events progressed downward or outward, respectively (Figs. 2 and 3).

Early history of modern domain organization

The accumulation of links connecting domain, supradomain and multidomain proteins in evolving CX, PX, PAX, SPX and SPAX networks played back the complicated history of domain recruitments that drive the evolution of domain organization. Figure 2 shows networks in radial layout at representative time-events defining boundaries of the three epochs of the evolving protein world (‘architectural diversification’, ‘superkingdom specification’ and ‘organismal diversification’, sensu^7,42). Networks grew in time and became increasingly complicated tangles, massively expanding after a “big bang” of domain combinations during the organismal diversification epoch. Movies described the evolutionary growth of these networks (Supplementary Video 1).

To illustrate the versatility of the waterfall visualization strategy, we dissected the early origin of proteins with the SPX network. Two major waves of structural innovation arising from ancient ‘p-loop’ and ‘winged helix’ domains were observed in the waterfall diagrams of a highly connected (reduced) subnetwork visualization of the SPX network (Fig. 3), matching similar recruitment waves observed in the study of evolutionary networks of elementary functionomes⁴⁶ and metabolites⁵⁴. Waves originated in primordial α/β/α-layered sandwich, β-barrel and helical bundle structures identified in an earlier structural phylogenomic study as part of the most ancient 54 protein domain families⁵⁵. However, most of the connectivity of these major pathways was established during the organismal diversification epoch less than 1.5 Gy ago (nd ≥ 0.6) and hence was fully developed relatively recently in evolution. The ‘p-loop’ and ‘winged helix’ waves embedded the major gateways of enzymatic recruitment we previously reported for metabolism⁵⁴. The first gateway was mediated by the c.37 P-loop hydrolase fold and originated in the energy interconversion pathways of the purine metabolism subnetwork. The second pathway was mediated by the a.4 winged helix fold and originated in the biosynthesis of cofactors and the metabolic subnetwork of porphyrin and chlorophyll^54,56,57. The congruent realization of these evolutionary patterns with data sources of different types is remarkable (Supplementary Video 2). It strongly supports the historical statements we propose. Further information can be found in Sect. 2 of Supplementary Text.

Network analysis of cooption mechanisms of recruitment

The networks of domains (SPX and SPAX) elicited 161 unique time-events along the evolutionary timeline, out of a total 169 events expected for networks of domains, supradomains and domain combinations (CX, PX and PAX) (Supplementary Tables 1–5). The node and connectivity distributions among the time-event bins of the evolving networks highlight the widespread, growing and recurrent combinatorial recruitment process that incorporates domains and their combinations into protein scaffolds and drives structural evolution (Fig. 2). Indeed, the largest hubs representing the most popular domains in the highly connected SPX subnetwork appeared not only early in evolution but also in the modern protein world (Fig. 3). Similar to the evolution of elementary functions⁴⁶, domain innovation also developed early during the first ~ 1.8 Gy of protein history (Fig. 3). The combinatorial recruitment process however spanned the entire timeline (Supplementary Fig. S2). In terms of origins, the first donor and acceptor composition event occurred in protein evolution with the appearance of a link in the CX network connecting domain c.2.1 to domain combination c.2.1|a.100.1, ~ 3.54 Gya (nd = 0.069). The first donor and acceptor pair occurred in the pairwise PX and SPX networks ~ 3.12 Gya (nd = 0.179), ~ 0.42 Gy later (Δnd = 0.11). The pairing event involved domains c.37.1 and d.14.1. The first adjacent donor and acceptor pair of the adjacency-based PAX and SPAX networks appeared ~ 2.90 Gya (nd = 0.237), ~ 0.22 Gy later (Δnd = 0.06). The adjacently paired nodes were domains c.37.1 and c.23.16. These observations highlight a remarkable tendency of domain organization to gradually but recurrently constrain pairwise occurrences in multidomain proteins. The evolutionary history of donors and acceptors of domain organization is hence associated with a highly optimized process of cooption. To explore this combinatorics, we first dissected the network connectivity with bar plots that describe the chronological accumulation of links along the evolutionary timeline (Supplementary Fig. S7). This made general patterns quantitative and source-sink relationships explicit. Second, we analyzed the per unit donor/acceptor ratio in the evolving networks to highlight pairwise cooption and composability, respectively (Supplementary Fig. S8). Specifically, domain acceptors (represented by network indegree) of SPX increased in number to a global average of 8.63 (± 0.15) sinks per domain in evolution. Domain donors (represented by network outdegree) of SPX reached a higher global average of 9.7 (± 0.56) sources per domain, indicating significant reutilization of relatively ancient domains. In contrast, the average number of donors and acceptors in the evolving CX network plateaued at 3.41 ± 0.34 sources and 3.43 ± 0.05 sinks per domain/multidomain, respectively. This showed uniform source/sink evolutionary rates as proteins acquired higher composability with time. Third, an inferential analysis of cooption-based source-sink relationships maturing at modern times revealed an independence of patterns from the selected network generation criteria (Supplementary Fig. S9). Primarily, the composition events yielding source domains and supradomains were dominant, with the number of events almost doubling in the CX network from the origin to the organismal diversification epoch ~ 1.5 Gya (nd = 0.6). However, the pairwise cooption events of the SPX domain network, e.g., doubled in number and reached relatively comparable levels in evolution only after delays of ~ 0.6 Gy (Δnd = 0.15) and ~ 2.1 Gya (nd = 0.75), respectively. Moreover, the number of cooption events yielding sink domains in SPX almost tripled by the beginning of the organismal diversification epoch. In contrast, the number of CX sinks reached that level only halfway along that evolutionary epoch. These divergent patterns indicate a frustrated dynamics of network growth. The early adoption of composability of domains and supradomains in multidomains seems to have preceded the pairwise cooption of domains in protein history, leading to the numerous recruitment pathways of the modern protein world. A discussion on the source-sink relationships impacted by domain fusion and fission processes can be found in Sect. 3 of Supplementary Text.

Hubs in network evolution

Network hubs are at the heart of network connectivity and could chaperone network evolution²⁶. We ranked modern domains and domain combinations of age nd = 1 as hubs based on the 99.9^th percentile of indegree and outdegree. Hubs were annotated with domain organization attributes, including SCOP domain descriptions, age, fusional/fissional information, and GO terms. We also associated hubs with age ranks reflecting their order of evolutionary appearance in the timeline.

The most notable donor hubs for all networks types were the carrier protein domains e.23.1, a.28.1 and c.69.1, which are involved in Non-Ribosomal Peptide Synthesis (NRPS), whether directly or indirectly through other pathways (Table 1). These domains diversified later in evolution yielding cofactor-binding molecular switches and barrel structures⁵⁵. Ancient NRPS pathways of domain accretion have been associated with a model that not only described stabilization and decoration of membranes by primordial alpha-helical bundles and beta-sheets, but also explained primordial protein synthesis and genetic code specificity chaperoned by ancient forms of aminoacyl-tRNA synthetase (aaRS) catalytic domains and NRPS modules. NRPS even preceded the emergence of the ribosome, acting as scaffold for nucleic acids and the modern translation function. In particular, the PX and PAX networks highlight the central evolutionary role of these novel emerging cofactor structures in the NRPS pathways. Thus, our findings made explicit that our connectivity criteria of generating networks of domain organization were at the cornerstone of the early development of genetic code and supported the evolutionary model of early biochemistry based on phylogenomic information and network structure.

Table 1 Domains and domain combinations scoring > = 99.9th percentiles of 249.916, [63] and {23}, based on combined outdegrees of the five networks at time points 1.0, [0.676] and {0.671}, respectively. The square and curly brackets denote values from the events after and before the big bang, respectively. N/A, not applicable.

Full size table

Domains c.30.1, b.1.1, d.142.1 and g.3.11 (0.723 < nd < 0.977) were the most prominent acceptor hubs (Table 2). These structures are integral parts of two-component signal transduction systems that are common in microbes. The highly modular domains feature dual response regulator proteins involved in the two-component signal transduction system comprising of an N-terminal response regulator receiver domain and a variable C-terminal effector domain with DNA-binding activity. These proteins are transcriptional regulators in bacteria and some protozoa, detecting and responding to environmental changes, e.g. nitrogen fixation. These evolving interactions of microbes with the environment mediated by two-component systems have apparently influenced the evolutionary process of cooption. Three acceptor hubs that were significant in PX with indegree > 250 (following behind the 99.9^th percentile in other networks) were Nucleotide cyclase (d.58.29), Spermadhesin, CUB domain (b.23.1), and Fibronectin type III (b.1.2) (nd = 0.723–0.809). See Sect. 4 of Supplementary Text for additional donor/acceptor hub information, and Sect. 5 for cooption events occurring during the ‘big bang’ of domain organization.

Table 2 Domains and domain combinations scoring > = 99.9th percentile of 247.977, [20] and {5}, based on combined in degrees of the five networks at time points 1.0, [0.676] and {0.671}, respectively. The square and curly brackets denote values from the events after and before the big bang, respectively. N/A, not applicable.

Full size table

Emergence of preferential attachment in network evolution

Genomic-centric processes such as duplication, recombination, fusion and fission shape patterns of molecular complexity². Many of these patterns can be explained with large ‘scale-free’ networks that grow by following the preferential attachment principle⁵⁸. These self-organizing and highly inhomogeneous networks attach links to highly connected hub-like nodes in a ‘rich-get-richer’ fashion, lacking a characteristic scale, irrespective of the properties of individual nodes or systems⁵⁹. This pattern of network expansion, which is remarkably popular in biology⁶⁰, is sharply distinct from that of the Erdős–Rényi random network model^61,62. In a scale-free network, the probability P(k) of nodes connecting with neighboring k nodes (i.e. the ratio of nodes with k links) decays as a power law, P(k) ~ k^–γ, with γ defined as the exponent of power law decay. The frequency distributions of node-connectivity in biomolecular networks have γ typically ranging 2.1–2.4⁶³. Thus, scale-free properties drive degree distributions with heavy tails, where very few nodes have high degree values.

Our statistical analyses of the featured indegree distributions along the timeline of growing networks uncovered interesting patterns of power law dynamics (Fig. 4). The scale-free patterns were established early on in protein evolution, primarily evident in the CX composition network. These patterns were remarkably divergent from evolving networks connected at random (RVN p value > 0.05). While power law behavior generally declined as the networks evolved (KS p-value < 0.05, α < 2.5), it somewhat sustained after the ‘big bang’ but only in CX and not in the pairwise networks (KS fit and γ closer to 0 and 2 in CX, respectively). A log linear regression model of CX produced the highest absolute value for γ of 3.81 among the five networks, which was achieved early along the evolutionary timeline (nd ~ 0.25). This value of γ was much higher than values reported for metabolic networks (γ ~ 2.2)⁶⁰. Remarkably, the γ was maintained at ~ 3 before and after the ‘big bang’, while remaining at ~ 2 until modern times with a minimum value of 1.7. The other four networks generated primarily with the pairwise criterion apparently deviated from the power-law behavior, especially after the ‘big bang’. For instance, the γ of PX and PAX peaked at 2.4 (nd ~ 0.35) and 3.2 (nd ~ 0.38), respectively, slightly later than CX. We also noted a transition in γ from 2.1 in PX and 2.7 in PAX prior to the ‘big bang’ to 1.6 in both after the big bang, plateauing at ~ 1 until the present. In the SPX and SPAX networks, γ reached a peak even later in time than PX and PAX with values of 2.8 (nd ~ 0.54) and 3.4 (nd ~ 0.66), respectively. These values transitioned from 2.4 in SPX and 2.8 in SPAX from before the big bang to 1.6 and 1.7 after the big bang, respectively, plateauing at ~ 1 in both the networks. As expected, the average γ based on less representative outdegree of each of the five networks remained low (1 ± 0.05).

We noticed biphasic patterns when -γ was plotted over network connectivity, with two minima at nd ~ 0.37 and ~ 0.67. Moreover, the scale-free tendency of adjacency networks seemed comparatively higher than that of networks lacking the adjacency restriction. For instance, the average values of γ for the PAX and SPAX networks (1.87 ± 0.06 and 2.13 ± 0.07, respectively) were relatively higher than those for the corresponding parent PX and SPX networks (1.61 ± 0.05 and 1.89 ± 0.06, respectively). This suggests that proximity in amino acid sequence plays a major role in rendering the power-law behavior of evolving networks of domain organization. Overall, the average γ of CX (2.56 ± 0.06) remained the highest along the evolutionary timeline, indicating that composition strongly elicits the preferential attachment property. A complementary transition from random to non-random behavior (RVN p value: 1 → 0) in ancient networks (nd ~ 0.3) implies deviation from randomness as biological networks evolve. Remarkably, this transition event coincides with the origin of a processive ribosome. Such biphasic patterns are common in biology and have explained the emergence of biological modules⁴⁹ in metabolic networks of Escherichia coli⁵⁰, networks of elementary functionomes⁴⁶, and molecular ancestry networks of enzymes⁶⁴. Section 6 of Supplementary Text further discusses scale-freeness and randomness of networks.

Emergence of hierarchical modularity

Modular networks embed sets of communities (closely-knit modules) that establish links preferentially within themselves and do so sparsely with the rest⁶⁵. Network modularity usually offsets the power-law behavior of biological networks by distributing node degrees within communities^66,67,68. However, both scale-free properties and modular structure may co-exist in a network when modules coalesce hierarchically⁶⁰. A primary index of modularity is the average clustering coefficient (C), defined as a node-averaged ratio of triangles (graph cycles of length 3) to triads (the connected graph triples) of the network, not taking into account the weights or direction of the node-links^60,69,70 (Fig. 5). The adjacency PAX and SPAX networks both showed the lowest C (averaged over nd) with a value of 0.09 ± 0.009. The composition CX network had a relatively higher C of 0.2 ± 0.009. However, the non-adjacency pairwise PX and SPX networks had the highest C values of 0.5 ± 0.02 and 0.32 ± 0.014, respectively. These values were still lower than those reported for metabolic networks (C = ~ 0.6)^60,68,71. Hence, the networks supposedly evolved more random smaller modules connected by various inter-modular links, rather than stronger larger modules with few interconnections. Also, the evolution of modular structure appeared better consolidated by pairwise (PX and SPX) and to a lesser degree composability (CX) constraints rather than by adjacency (PAX and SPAX). Comparing patterns of modularity of evolving networks to those of randomness (given by RVN_{p value}) indicated complementary transitions between the two behaviors over the evolutionary timeline (Figs. 4 and 5).

In order to dissect the modular behavior of evolving networks, we studied the regression patterns of C against network size N and evolutionary age nd. For typical scale-free models, C declines sharply with increasing N (C ~ N^-coefficient), while the coefficients are as high as 0.75⁷². Instead, highly modular networks are typically independent of N⁶⁰. In our networks, C regressed by N with very low coefficients (CX, 0.000036; PX, 0.00007; PAX, 0.000035; SPX, 0.00016; SPAX, 0.00016). In contrast, the regression of C with age (C ~ nd^-coefficient) produced significantly higher coefficients (CX, 0.39; PX, 0.85; PAX, 0.39; SPX, 0.35; SPAX, 0.41) (Fig. 5). As expected⁷³, the reference power-law (Barabási) networks that were used as control showed a C of zero. Our data strongly suggests the existence of a highly modular structure that is independent of network growth but is strongly constrained by history, especially when considering the pairwise interactions of the PX network. The rise of the modularity index with emerging power-law degree distribution during certain periods of network evolution indicated a parallel formation of complex hierarchical module clusters with scale-free properties, not distinct from those present in metabolic networks⁶⁰. Our networks of domain organization showed a slight lag between an onset of scale-free organization (measured with KS fit and γ indegree statistics) and a delayed emergence of modular behavior (measured with C), occurring during early protein evolution. This was followed by intermittent periods of hierarchical modularity spanning across the middle of the evolutionary timeline. Remarkably, the evolving networks showed again a prominent biphasic pattern of hierarchical modularity involving two peaks of modularity (higher statistic C) coinciding with increased power-law behavior (valleys of KS fit and -γ curves), at nd ~ 0.37 and nd ~ 0.67, respectively (Figs. 4 and 5). The modularity heatmaps and dendrograms of select phases of network evolution confirm these biphasic patterns (Fig. 6), which were markedly distinct from the long-tailed clustering patterns of preferential attachment (Supplementary Fig. S10). As identified earlier⁴⁶, the timing of this switch coincides with the early development of genetic code specificity in the emerging ribosomal aaRS catalytic domains, which was facilitated by the OB-fold structure⁷⁴. These counteracting and delicately balanced trends of modularity and preferential attachment suggest that the emergence of scale-free behavior of the partial bipartite CX network must have impacted the hierarchical modular structure of the modern pairwise networks of domain organization (PX, PAX, SPX, SPAX) (Supplementary Video 3). A detailed account of our testing and verification of this conjecture is explained in Sect. 7 of Supplementary Text.

Conclusions

We here present for the first time an evolutionary chronology of networks of domain organization. Tracing the time events of origination of protein domain architectures in these growing networks revealed major evolutionary pathways of molecular recruitment of domains and functions. Two prominent ancestral waves of structural novelty involved ancient domain innovations and founder ‘p-loop’ and ‘winged helix’ domain structures. We found that evolutionary recruitment in proteins is ongoing and highly modular. Remarkably, the networks highlighted the role of cofactor-supporting structures of NRPS pathways, which were backbone to the early evolution of the genetic code. The evolving domain rearrangements featured multitier evolutionary episodes of scale-free network structure, hierarchy and modular behavior. Remarkably, our analyses support biphasic patterns of diversification and module emergence that we have observed earlier^46,49. In an initial phase, at the cusp of architectural diversification, the modular components of emerging domain organization associated through weak linkages of recruitment. The second phase was massive and prolonged, with a multitude of modules appearing after the ‘big bang’ of the protein world, supporting the onset of organismal diversification. Such biphasic patterns are prevalent in biology and impact size, dipeptide makeup, and loop-mediated flexibility of proteins, possibly due to their intrinsic disorder^45,74. We propose that biphasic patterns in evolving networks are integral to module emergence in biological history. We prompt further study of their structure and origin.

Methods

Experimental design

Phylogenomic analysis of the entity set of protein domain architectures

We explore the evolution of networks describing how structural domains combine and split to form single domain and multidomain proteins, i.e. the domain organization of proteins. The definition of protein domain structures followed the FSF level of SCOP version 1.75⁹ (Fig. 1). Domain interactions were studied along an evolutionary timeline of structural and architectural innovation directly derived from a phylogenomic tree of architectures reconstructed from an HMM-based census of structural domain organization of 1,730 FSF structures from ~ 3 million protein sequences encoded in 749 genomes of 52 archaeal, 478 bacterial and 219 eukaryal organisms (dataset A749)²⁰ (Supplementary Fig. S1). The percentage of proteins with structural assignments was 62.2 ± 0.09(SD)% (see Table S1 and discussion in ref.²⁰). The tree was generated using maximum parsimony as the optimality criterion in PAUP* following the parsimony ratchet search strategy described in ref.⁷. Data matrix and tree files are provided as Supplementary Files 1–3 at https://github.com/gcalab/SciRep. The phylogeny represents reconstruction of the “natural history” of proteins that is supported by a model of protein structural growth⁷⁵ and is carefully indexed with various evolutionary epochs of the protein world⁷.

Calculation of the ages of domain organization

The ages of domains and domain combinations were calculated as node distance (nd) values, which were derived directly from the rooted phylogenomic tree of protein domain organization⁷. nd values describe relative ages (in a relative 0–1 scale) of first appearances of 6,162 domains and domain combinations (multidomains) defined at SCOP FSF level (the extant ‘entity set’ sampled by our study; Fig. 2) Collectively, ages defined an evolutionary timeline embodying architectural transformations and molecular transitions mediated by fusion and fission processes in the form of 169 unique ‘time events’ (age groups or time slivers) (Supplementary Fig. S2). A Python script was used to count the number of nodes from the root (base) of the tree to each leaf node and present the distance matrix of nodes in a relative zero-to-one scale⁴⁶. The script utilized the high imbalance of phylogenomic trees as a fundamental feature to derive the relative ages of domain organization⁷. The tree imbalance resulted from the accumulation of structures and their combinations in proteins and proteomes and not from node density, thus representing a true evolutionary process⁴⁷.

The timeline was calibrated with a molecular clock of FSF structures (t = –3.831nd + 3.628) used to calculate geological age in Gy through calibration points of FSF domains associated with microfossil, fossil and biogeochemical evidence, biomarkers, and first-appearance of clade-specific domains⁴⁷. The RSCB–PDB count was determined by following the hyperlink associated to the number of entries or structures (which is updated weekly) and selecting “Customizable Table” from the ’Reports’ menu above the results section. Subsequently, SCOP, CATH, and PFAM ID options were selected as domain information under the ’Domain Details’ section and domain counts data were exported as a comma separated value (.csv) file report. Supplementary Tables 1–5 provide an exhaustive summary of various connectivity categories of evolving networks based on this ‘entity set’ of domain organization. The extraction pipeline of SPX/SPAX domain units from the original data set can be found in Supplementary Table 6.

Indexing domain attributes

Domain ages and assignment of fusional/fissional properties followed ref.⁷. SCOP concise classification strings (ccs) of domain descriptions⁹ were downloaded from http://scop.mrc-lmb.cam.ac.uk/scop/parse/index.html for SCOP version 1.75 as the file dir_des_scop_txt_1_75.txt. Available descriptions for 2,223 single domains were obtained from SCOP unique identifiers (sunID). The Gene Ontology (GO) specifications were recorded from the Superfamily Database (SUPFAM) available at http://supfam.cs.bris.ac.uk/SUPERFAMILY/GO.html. High-coverage domain-centric GO annotations that were supported only by all UniProts (including multidomain UniProts) were downloaded as the file Domain2GO_supported_only_by_all.txt. High-quality truly domain-centric GO annotations that were supported by both single domain UniProts and all UniProts (including multidomain UniProts) were downloaded as the file Domain2GO_supported_by_both.txt. We reported only the GO annotations ‘by all’ to capture higher coverage. Also, the GO terms were reported only for the 2223 single domains with descriptions available. Specialized GO annotations from two levels of hierarchy downstream were taken from files Domain2GO-Hie-Dist1.csv and Domain2GO-Hie-Dist2.csv. Structural domains functional ontology (SDFO) that mapped information from a theoretic analysis of Domain2GO annotation profiles were reported from the file SDFO.txt.

Network construction, visualization and analysis

Mathematical definitions for construction of networks can be found in Supplementary Materials and Methods. The social network analysis tool Pajek⁷⁶ and the statistical test bench R’s igraph package⁷⁷ were used to visualize and analyze the networks, respectively. The collective impact of events was made explicit by Pajek’s Visualization of Similarity (VOS) clustering method^78,79. VOS helped reveal communities and design layouts of networks with nodes separated into network modules, where high modularity indices ranged from 94–95%. Number of clusters varied over networks (CX, 691; PX, 3,886; PAX, 4,126; SPX, 607; SPAX, 620). Network clusters were visually compacted to hubs and their cohesive neighbors with the energy-optimizing Kamada-Kawai ‘separate components’ algorithm⁸⁰. Pajek allowed to proportionally reduce the size of highly connected nodes by some scaling factor for optimally uncluttered visualization. Waterfall and radial network layouts were designed with node-size scaled down by factors of 0.1 and 0.25, respectively. R packages equipped with specialized code constructs to draw graphs and derive statistics were used to analyze network properties^81,82. We also used Pre-Hypertext Processing language (PHP) to write custom scripts that generated radial visualizations of the networks and helped conduct housekeeping data management⁸³. The PHP scripts were executed in the command line. Results of these scripts were input into Pajek’s and R’s analytical procedures. We used the open-source software ImageMagick (www.imagemagick.org) for batch conversion, captioning, and appending of network images (to represent legends and scales). A detailed description of partition and data files, list of network data analysis functions, charting and graphing procedures, methods to generate power law statistics, modularity indices and randomness checks, and the method pipeline used to achieve waterfall diagrams can be found in Supplementary Materials and Methods.

Statistical analysis

Scale-free network behavior

Linear regression models of P(k) given k (i.e. the probability of having k-neighbors) were used to derive the γ coefficient of the power law distribution and the determination coefficient, R². The value of γ represents an absolute slope of the log linear model of P(k) versus k. The slope is usually ≤ 0. γ >> 1 indicates strong tendency towards preferential attachment. R² indicates the percentage of data that fits the linear model. High values of both γ and R² suggest strong scale-free behavior. Additional power law statistics were calculated as: (1) the exponent of the fitted power law distribution, α, with an assumption that P(X = x) is proportional to x^–α; (2) KS fit statistic to compare the input degree distribution with that of fitted power-law; and (3) the KS p-value of a statistical test, with the null hypothesis that data is being drawn from a power law distribution^84,85. α >> 1, 0 < KS fit scores << 1, and KS p-values ≥ 0.05 suggest that degree data was derived from a fitted power law distribution. Maximum log likelihood of the fitted scale-free parameters was also determined. Control networks were included for reference that were generated with ‘Barabási’ methods⁵⁸ of the igraph package from R⁷⁷. These controls simulated basic and extended age-dependent power law graph models given varying sizes of the evolving networks.

Network modularity

We investigated modularity using six indices: (1) The VOS Quality index (VQ) was determined using the Pajek VOS algorithm by considering the number or weights of the links (arcs) between the nodes as similarities. Clusters or communities that were deemed ‘similar’ were iteratively drawn closer to each other until a final layout was achieved with least crossings and closest clusters. The quality index VQ was thus calculated for this final layout as ∑_{i=1 c, j=i+1 c} (e_ij− a_i²), where c is the number of communities; e_ij is the fraction of edges with one node v in the community i (c_i) and the other node w in the community j (c_j), defined as ∑_vw (A_vw/2 m) where v ϵ c_i, w ϵ c_j, m is the sum of weights in the graph and A_vw is the weighted value or 0, indicating presence or absence of edge between nodes v and w in the adjacency matrix A of the network; and a_i is the fraction of weighted k neighbors attached to the nodes in community i, i.e. k_i/2m^78,79. (2) The Clustering Ratio (C-ratio) is the ratio of the number of network clusters to the count of the connected nodes in the network. (3) The average Clustering Coefficient (C) is defined as the ratio of the triangles impingent on a node to the connected triples, determined as a global average over all nodes in a simplified (undirected/unweighted) network^60,69,70. C is not meaningful for strictly bipartite or scale-free graphs⁷³. We also report coefficients of linear regression of C over the age and size of the networks of domain organization. (4) The Fast-Greedy Community (FGC) agglomerative hierarchical algorithm detects community structure for networks with m edges, n nodes, and a depth d of the dendrogram describing the community structure, given an optimized linear running time of O(m × d × logn) ~ O(n × log²n)⁸⁶. The Newman-Girvan algorithm index (NG) was computed with two different input partitions, the first (5) defined by age (NG_age) and the second (6) defined by VOS clustering (NG_vos). NG calculates the modularity of a network given a predefined division or partition to measure the influence of the partition in separating the different node types. This indicates either assortative (positive) or disassortative (negative) mixing across modules⁶⁵. The NG algorithm computes an index as 1/(2 m)∑_ij(A_ij− 1/(2m)k_ik_j × ∆(c_i,c_j)), where m is the sum total of weights in the graph and A_ij are weighted entries in the adjacency matrix of the network; k_i | k_j and c_i | c_j are the weighted degrees and the components (numeric partitions) of the nodes i and j, respectively; finally, ∆(x,y) equals 1 if x = y and 0 otherwise⁶⁵. The VQ, C-ratio, C and FGC indices each range from 0 to 1, while the NG indices range from − 1 to 1. In all cases, higher values represent strong modularity of the network at an event of evolutionary history. Heatmaps of modularity were constructed using log10-scaled modularity matrices, with each map element given as (A_ij− k_ik_j/(2m))M_nd, where A_ij, k_i, k_j and m were the same as defined for NG⁶⁵, while M_nd was the network’s modularity index at event nd. Cladistic representations of modularity were visualized with dendrograms whose metrics were calculated from squared Euclidean distance matrices, which indicate dissimilarities between cluster means⁸⁷. The dissimilarity or distance matrices were clustered hierarchically using the Ward's minimum variance method that seeks compact and spherical clusters⁸⁸.

Quantifying randomness in networks

The Bartels rank test of randomness, which primarily offers a rank version of von Neumann's Ratio Test for Randomness⁸⁹, was used to measure random network behavior. The resultant test statistic RVN is defined as ∑_i=1→n−1 (R_i−R_i+1)²/∑_i=1→n (R_i– (n + 1)/2)², where R_i = rank (X_i) with i = 1…n, (RVN − 2)/σ is the asymptotically standard normal, and σ² = [4(n − 2)(5n² − 2n − 9)]/[5n(n + 1)(n − 1)²]. The null hypothesis of this method was randomness, which was tested against the alternate hypothesis of non-randomness, given a trend of RVN values. A p value is computed from a two-sided beta distribution approximation test. Random graph controls were created by following the Erdős–Rényi graph model⁶¹⁹⁰.

References

Chothia, C. & Gough, J. Genomic and structural aspects of protein evolution. Biochem. J. 419, 15–28 (2009).
Article CAS PubMed Google Scholar
Chothia, C., Gough, J., Vogel, C. & Teichmann, S. A. Evolution of the protein repertoire. Science 300, 1701 (2003).
Article ADS CAS PubMed Google Scholar
Wetlaufer, D. B. Nucleation, rapid folding, and globular intrachain regions in proteins. Proc. Natl. Acad. Sci. USA 70, 697–701 (1973).
Article ADS CAS PubMed PubMed Central Google Scholar
Janin, J. & Wodak, S. J. Structural domains in proteins and their role in the dynamics of protein function. Prog. Biophys. Mol. Biol. 42, 21–78 (1983).
Article CAS PubMed Google Scholar
Han, J. H., Batey, S., Nickson, A. A., Teichmann, S. A. & Clarke, J. The folding and evolution of multidomain proteins. Nat. Rev. Mol. Cell Biol. 8, 319–330 (2007).
Article CAS PubMed Google Scholar
Apic, G., Gough, J. & Teichmann, S. A. Domain combinations in archaeal, eubacterial and eukaryotic proteomes. J. Mol. Biol. 310, 311–325 (2001).
Article CAS PubMed Google Scholar
Wang, M. & Caetano-Anollés, G. The evolutionary mechanics of domain organization in proteomes and the rise of modularity in the protein world. Structure 17, 66–78 (2009).
Article CAS PubMed Google Scholar
Caetano-Anollés, G., Wang, M., Caetano-Anollés, D. & Mittenthal, J. E. The origin, evolution and structure of the protein world. Biochem. J. 417, 621–637 (2009).
Article PubMed CAS Google Scholar
Murzin, A. G. et al. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540 (1995).
Article CAS PubMed Google Scholar
Chandonia, J. M., Fox, N. K. & Brenner, S. E. SCOPe: manual curation and artifact removal in the structural classification of proteins—extended database. J. Mol. Biol. 429, 348–355 (2017).
Article CAS PubMed Google Scholar
Dawson, N. L. et al. CATH: an expanded resource to predict protein function through structure and sequence. Nucleic Acids Res. 45, D289–D295 (2017).
Article CAS PubMed Google Scholar
Marchler-Bauer, A. et al. CDD: NCBI’s conserved domain database. Nucleic Acids Res. 43, D222–D226 (2015).
Article CAS PubMed Google Scholar
Bru, C. et al. The ProDom database of protein domain families: more emphasis on 3D. Nucleic Acids Res. 33, D212–D215 (2005).
Article CAS PubMed Google Scholar
Finn, R. D. et al. The Pfam protein families database. Nucleic Acids Res. 36, D281–D288 (2008).
Article CAS PubMed Google Scholar
Finn, R. D. et al. InterPro in 2017—beyond protein family and domain annotations. Nucleic Acids Res. 45, D190–D199 (2017)
Article CAS Google Scholar
Wilson, D. et al. SUPERFAMILY—sophisticated comparative genomics, data mining, visualization and phylogeny. Nucleic Acids Res. 37, D380–D386 (2009).
Article CAS PubMed Google Scholar
Harrison, A., Pearl, F., Mott, R., Thornton, J. & Orengo, C. Quantifying the similarities within fold space. J. Mol. Biol. 323, 909–926 (2002).
Article CAS PubMed Google Scholar
Berezovsky, I. N., Guarnera, E. & Zheng, Z. Basic units of protein structure, folding, and function. Prog. Biophys. Mol. Biol. 128, 85–99 (2017).
Article CAS PubMed Google Scholar
Gerstein, M. How representative are the known structures of the proteins in a complete genome? A comprehensive structural census. Fold. Des. 3, 497–512 (1998).
Article CAS PubMed Google Scholar
Wang, M., Kurland, C. G. & Caetano-Anollés, G. Reductive evolution of proteomes and protein structures. Proc. Natl. Acad. Sci. USA 108, 11954–11958 (2011).
Article ADS CAS PubMed PubMed Central Google Scholar
Illergård, K., Ardell, D. H. & Elofsson, A. Structure is three to ten times more conserved than sequence—a study of structural response in protein cores. Proteins Struct. Funct. Bioinforma. 77, 499–508 (2009).
Article CAS Google Scholar
Bashton, M. & Chothia, C. The geometry of domain combination in proteins1. J. Mol. Biol. 315, 927–939 (2002).
Article CAS PubMed Google Scholar
Chothia, C. & Lesk, A. M. The relation between the divergence of sequence and structure in proteins. EMBO J. 5, 823–826 (1986).
Article CAS PubMed PubMed Central Google Scholar
Vogel, C., Berzuini, C., Bashton, M., Gough, J. & Teichmann, S. A. Supra-domains: evolutionary units larger than single protein domains. J. Mol. Biol. 336, 809–823 (2004).
Article CAS PubMed Google Scholar
Wuchty, S. Scale-free behavior in protein domain networks. Mol. Biol. Evol. 18, 1694–1702 (2001).
Article CAS PubMed Google Scholar
Apic, G., Huber, W. & Teichmann, S. A. Multi-domain protein families and domain pairs: comparison with known structures and a random model of domain recombination. J. Struct. Funct. Genom. 4, 67–78 (2003).
Article CAS Google Scholar
Tordai, H., Nagy, A., Farkas, K., Bányai, L. & Patthy, L. Modules, multidomain proteins and organismic complexity. FEBS J. 272, 5064–5078 (2005).
Article CAS PubMed Google Scholar
Weiner, J., Moore, A. D. & Bornberg-Bauer, E. Just how versatile are domains?. BMC Evol. Biol. 8, 285 (2008).
Article PubMed PubMed Central CAS Google Scholar
Shahzad, K., Mittenthal, J. E. & Caetano-Anollés, G. The organization of domains in proteins obeys Menzerath-Altmann’s law of language. BMC Syst. Biol. 9, 44 (2015).
Article PubMed PubMed Central CAS Google Scholar
Vogel, C., Teichmann, S. A. & Pereira-Leal, J. The relationship between domain duplication and recombination. J. Mol. Biol. 346, 355–365 (2005).
Article CAS PubMed Google Scholar
Basu, M. K., Carmel, L., Rogozin, I. B. & Koonin, E. V. Evolution of protein domain promiscuity in eukaryotes. Genome Res. 18, 449–461 (2008).
Article CAS PubMed PubMed Central Google Scholar
Xie, X., Jin, J. & Mao, Y. Evolutionary versatility of eukaryotic protein domains revealed by their bigram networks. BMC Evol. Biol. 11, 244 (2011).
Article CAS Google Scholar
Taylor, W. R. Evolutionary transitions in protein fold space. Curr. Opin. Struct. Biol. 17, 354–361 (2007).
Article CAS PubMed Google Scholar
Alva, V., Remmert, M., Biegert, A., Lupas, A. N. & Söding, J. A galaxy of folds. Protein Sci. 19, 124–130 (2010).
CAS PubMed Google Scholar
Ferrada, E. & Wagner, A. Evolutionary innovations and the organization of protein functions in genotype space. PLoS ONE 5, e14172 (2010).
Article ADS CAS PubMed PubMed Central Google Scholar
Taylor, W. R. A ‘periodic table’ for protein structures. Nature 416, 657–660 (2002).
Article ADS CAS PubMed Google Scholar
Forslund, S. K., Kaduk, M. & Sonnhammer, E. L. L. Evolution of protein domain architectures. In Evolutionary Genomics, Methods in Molecular Biology Vol. 1910 (ed. Anisimova, M.) 469–504 (Humana, 2019).
Chapter Google Scholar
Gerstein, M. Patterns of protein-fold usage in eight microbial genomes: a comprehensive structural census. Proteins Struct. Funct. Genet. 33, 518–534 (1998).
Article CAS PubMed Google Scholar
Abeln, S. & Deane, C. M. Fold usage on genomes and protein fold evolution. Proteins Struct. Funct. Genet. 60, 690–700 (2005).
Article CAS PubMed Google Scholar
Edwards, H., Abeln, S. & Deane, C. M. Exploring fold space preferences of new-born and ancient protein superfamilies. PLoS Comput. Biol. 9, e1003325 (2013).
Article ADS PubMed PubMed Central CAS Google Scholar
Caetano-Anollés, G. & Caetano-Anollés, D. An evolutionarily structured universe of protein architecture. Genome Res. 13, 1563–1571 (2003).
Article PubMed PubMed Central CAS Google Scholar
Wang, M., Yafremava, L. S., Caetano-Anollés, D., Mittenthal, J. E. & Caetano-Anollés, G. Reductive evolution of architectural repertoires in proteomes and the birth of the tripartite world. Genome Res. 17, 1572–1585 (2007).
Article PubMed PubMed Central Google Scholar
Wang, M. & Caetano-Anollés, G. Global phylogeny determined by the combination of protein domains in proteomes. Mol. Biol. Evol. 23, 2444–2454 (2006).
Article CAS PubMed Google Scholar
Nath, N., Mitchell, J. B. O. & Caetano-Anollés, G. The natural history of biocatalytic mechanisms. PLoS Comput. Biol. 10, e1003642 (2014).
Article ADS PubMed PubMed Central CAS Google Scholar
Debès, C., Wang, M., Caetano-Anollés, G. & Gräter, F. Evolutionary optimization of protein folding. PLoS Comput. Biol. 9, e1002861 (2013).
Article ADS PubMed PubMed Central CAS Google Scholar
Aziz, M. F., Caetano-Anollés, K. & Caetano-Anollés, G. The early history and emergence of molecular functions and modular scale-free network behavior. Sci. Rep. 6, 25058 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Wang, M. et al. A universal molecular clock of protein folds and its power in tracing the early history of aerobic metabolism and planet oxygenation. Mol. Biol. Evol. 28, 567–582 (2011).
Article CAS PubMed Google Scholar
Caetano-Anollés, D., Kim, K. M., Mittenthal, J. E. & Caetano-Anollés, G. Proteome evolution and the metabolic origins of translation and cellular life. J. Mol. Evol. 72, 14–33 (2011).
Article ADS PubMed CAS Google Scholar
Mittenthal, J. E., Caetano-Anollés, D. & Caetano-Anollés, G. Biphasic patterns of diversification and the emergence of modules. Front. Genet. 3, 147 (2012).
Article PubMed PubMed Central Google Scholar
Aziz, M. F. et al. Stress induces biphasic-rewiring and modularization patterns in the metabolomic networks of Escherichia coli. IEEE Intl. Conf. Bioinf. Biomed. https://doi.org/10.1109/BIBM.2012.6392626 (2012).
Article Google Scholar
MacDougall, M. H. Simulating Computer Systems: Techniques and Tools (MIT Press, 1987).
Google Scholar
Delaney, W. & Vaccari, E. Dynamic Models and Discrete Event Simulation (CRC Press, 1989).
MATH Google Scholar
Pidd, M. Computer simulation in management science. J. Oper. Res. Soc. 57, 327 (2006).
Google Scholar
Caetano-Anollés, G., Kim, H. S. & Mittenthal, J. E. The origin of modern metabolic networks inferred from phylogenomic analysis of protein architecture. Proc. Natl. Acad. Sci. USA 104, 9358 (2007).
Article ADS PubMed CAS PubMed Central Google Scholar
Caetano-Anollés, G., Kim, K. M. & Caetano-Anollés, D. The phylogenomic roots of modern biochemistry: origins of proteins, cofactors and protein biosynthesis. J. Mol. Evol. 74, 1–34 (2012).
Article ADS PubMed CAS Google Scholar
Caetano-Anollés, K. & Caetano-Anollés, G. Structural phylogenomics reveals gradual evolutionary replacement of abiotic chemistries by protein enzymes in purine metabolism. PLoS ONE 8, e59300 (2013).
Article ADS PubMed PubMed Central CAS Google Scholar
Caetano-Anollés, G. et al. The origin and evolution of modern metabolism. Int. J. Biochem. Cell Biol. 41, 285–297 (2009).
Article PubMed CAS Google Scholar
Barabási, A. L. & Albert, R. Emergence of scaling in random networks. Science 286, 509 (1999).
Article ADS MathSciNet PubMed MATH Google Scholar
Pang, T. Y. & Maslov, S. Universal distribution of component frequencies in biological and technological systems. Proc. Natl. Acad. Sci. USA 110, 6235–6239 (2013).
Article ADS MathSciNet CAS PubMed PubMed Central Google Scholar
Ravasz, E., Somera, A. L., Mongru, D. A., Oltvai, Z. N. & Barabási, A. L. Hierarchical organization of modularity in metabolic networks. Science 297, 1551–1555 (2002).
Article ADS CAS PubMed Google Scholar
Erdős, P. & Rényi, A. Connectivity of random nets. Publ. Math. Inst. Hungarian Acad. Sci. 5, 17–61 (1960).
MATH Google Scholar
Bollobas, B. Random Graphs (Academic Press, 1985).
MATH Google Scholar
Strogatz, S. H. Exploring complex networks. Nature 410, 268–276 (2001).
Article ADS CAS PubMed MATH Google Scholar
Mughal, F. & Caetano-Anollés, G. MANET 3.0: hierarchy and modularity in evolving metabolic networks. PLoS ONE 14, e0224201 (2019).
Article CAS PubMed PubMed Central Google Scholar
Newman, M. E. J. & Girvan, M. Finding and evaluating community structure in networks. Phys. Rev. E 69, 26113 (2004).
Article ADS CAS Google Scholar
Jeong, H., Tombor, B., Albert, R., Oltvai, Z. N. & Barabási, A. L. The large-scale organization of metabolic networks. Nature 407, 651–654 (2000).
Article ADS CAS PubMed Google Scholar
Overbeek, R. et al. WIT: integrated system for high-throughput genome sequence analysis and metabolic reconstruction. Nucleic Acids Res. 28, 123–125 (2000).
Article CAS PubMed PubMed Central Google Scholar
Wagner, A. & Fell, D. A. The small world inside large metabolic networks. Proc. R. Soc. Lond. Ser. B Biol. Sci. 268, 1803–1810 (2001).
Article CAS Google Scholar
Wasserman, S. & Faust, K. Social Network Analysis: Methods and Applications (MIT Press, 1994).
Book MATH Google Scholar
Barrat, A., Barthelemy, M., Pastor-Satorras, R. & Vespignani, A. The architecture of complex weighted networks. Proc. Natl. Acad. Sci. USA 101, 3747–3752 (2004).
Article ADS CAS PubMed MATH PubMed Central Google Scholar
Watts, D. J. & Strogatz, S. H. Collective dynamics of ‘small-world’networks. Nature 393, 440–442 (1998).
Article ADS CAS PubMed MATH Google Scholar
Albert, R. & Barabási, A. L. Statistical mechanics of complex networks. Rev. Mod. Phys. 74, 47 (2002).
Article ADS MathSciNet MATH Google Scholar
Newman, M. E. J., Strogatz, S. H. & Watts, D. J. Random graphs with arbitrary degree distributions and their applications. Phys. Rev. E 64, 26118 (2001).
Article ADS CAS Google Scholar
Caetano-Anollés, G., Wang, M. & Caetano-Anollés, D. Structural phylogenomics retrodicts the origin of the genetic code and uncovers the evolutionary impact of protein flexibility. PLoS ONE 8, e72225 (2013).
Article ADS PubMed PubMed Central CAS Google Scholar
Tal, G., Boca, S. M., Mittenthal, J. & Caetano-Anollés, G. A dynamic model for the evolution of protein structure. J. Mol. Evol. 82, 230–243 (2016).
Article ADS CAS PubMed Google Scholar
Mrvar, A. & Batagelj, V. Analysis and visualization of large networks with program package Pajek. Complex Adapt. Syst. Model. 4, 1–8 (2016).
Article MATH Google Scholar
Csardi, G. & Nepusz, T. The igraph software package for complex network research. Int. J. Complex Syst. 1695, 1–9 (2006).
Google Scholar
Van Eck, N. J. & Waltman, L. VOS: a new method for visualizing similarities between objects. In Advances in Data Analysis: Proceedings of the 30th Annual Conference of the German Classification Society 299–306 (Heidelberg: Springer Verlag, 2007).
Waltman, L., van Eck, N. J. & Noyons, E. C. M. A unified approach to mapping and clustering of bibliometric networks. J. Infometr. 4, 629–635 (2010).
Article Google Scholar
Kamada, T. & Kawai, S. An algorithm for drawing general undirected graphs. Inf. Process. Lett. 31, 7–15 (1989).
Article MathSciNet MATH Google Scholar
Ihaka, R. & Gentleman, R. R: A language for data analysis and graphics. J. Comput. Graph. Stat. 5, 299–314 (1996).
Google Scholar
R Core Team. R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria (2014).
PHP-Group & others. PHP: Hypertext PreProcessor. Internet http://www.php.net (2012).
Newman, M. E. J. Power laws, Pareto distributions and Zipf’s law. Contemp. Phys. 46, 323–351 (2005).
Article ADS Google Scholar
Clauset, A., Shalizi, C. R. & Newman, M. E. J. Power-law distributions in empirical data. SIAM Rev. 51, 661–703 (2009).
Article ADS MathSciNet MATH Google Scholar
Clauset, A., Newman, M. E. J. & Moore, C. Finding community structure in very large networks. Phys. Rev. E 70, 66111 (2004).
Article ADS CAS Google Scholar
Borg, I. & Groenen, P. Modern multidimensional scaling: theory and applications. J. Educ. Meas. 40, 277–280 (2003).
Article MATH Google Scholar
Murtagh, F. & Legendre, P. Ward’s hierarchical agglomerative clustering method: Which algorithms implement ward’s criterion?. J. Classif. 31, 274–295 (2014).
Article MathSciNet MATH Google Scholar
Bartels, R. The rank version of von Neumann’s ratio test for randomness. J. Am. Stat. Assoc. 77, 40–46 (1982).
Article MATH Google Scholar
Erdős, P. & Rényi, A. On random graphs I. Publ. Math. 6, 290–297 (1959).
MathSciNet MATH Google Scholar

Download references

Acknowledgements

Research was supported by grants from the National Science Foundation (MCB-0749836 and OISE-1132791) and the United States Department of Agriculture (ILLU-802-909 and ILLU-483-625) to GCA. Materials and data necessary to interpret the findings of this paper have been included in the manuscript.

Author information

Authors and Affiliations

Evolutionary Bioinformatics Laboratory, Department of Crop Sciences, University of Illinois, Urbana, IL, 61801, USA
M. Fayez Aziz & Gustavo Caetano-Anollés

Authors

M. Fayez Aziz
View author publications
You can also search for this author in PubMed Google Scholar
Gustavo Caetano-Anollés
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

G.C.-A. conceptualized the study. M.F.A. generated primary data, conducted network analysis, and generated figures and written documentation. Both authors interpreted results and wrote and revised the manuscript.

Corresponding author

Correspondence to Gustavo Caetano-Anollés.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Figure S1.

Supplementary Figure S2.

Supplementary Figure S3.

Supplementary Figure S4.

Supplementary Figure S5.

Supplementary Figure S6.

Supplementary Figure S7.

Supplementary Figure S8.

Supplementary Figure S9.

Supplementary Figure S10.

Supplementary File 1.

Supplementary File 2.

Supplementary File 3.

Supplementary Information.

Supplementary Video 1.

Supplementary Video 2.

Supplementary Video 3.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Aziz, M.F., Caetano-Anollés, G. Evolution of networks of protein domain organization. Sci Rep 11, 12075 (2021). https://doi.org/10.1038/s41598-021-90498-8

Download citation

Received: 02 December 2020
Accepted: 11 May 2021
Published: 08 June 2021
DOI: https://doi.org/10.1038/s41598-021-90498-8

This article is cited by

Improving the Catalytic Properties of Xylanase from Alteromones Macleadii H35 Through Sequence Analysis
- Caixia Cui
- Jia Xu
- Chenyan Zhou
Applied Biochemistry and Biotechnology (2024)
Real-time expression and in silico characterization of pea genes involved in salt and water-deficit stress
- Muhammad Farooq
- Rafiq Ahmad
- Sabaz Ali Khan
Molecular Biology Reports (2024)
In silico exploration of hypothetical proteins in Neisseria gonorrhoeae for identification of therapeutic targets
- Gunjan Lakhanpal
- Harshita Tiwari
- Deepak Kumar
In Silico Pharmacology (2024)
Tracing the birth of structural domains from loops during protein evolution
- M. Fayez Aziz
- Fizza Mughal
- Gustavo Caetano-Anollés
Scientific Reports (2023)
Maize heat shock proteins—prospection, validation, categorization and in silico analysis of the different ZmHSP families
- Rubens Diogo-
- Edila Vilela de Resende Von Pinho
- Danielle Rezende Vilela
Stress Biology (2023)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.