Ongoing shuffling of protein fragments diversifies core viral functions linked to interactions with bacterial hosts

Biological modularity enhances evolutionary adaptability. This principle is vividly exemplified by bacterial viruses (phages), which display extensive genomic modularity. Phage genomes are composed of independent functional modules that evolve separately and recombine in various configurations. While genomic modularity in phages has been extensively studied, less attention has been paid to protein modularity—proteins consisting of distinct building blocks that can evolve and recombine, enhancing functional and genetic diversity. Here, we use a set of 133,574 representative phage proteins and highly sensitive homology detection to capture instances of domain mosaicism, defined as fragment sharing between two otherwise unrelated proteins, and to understand its relationship with functional diversity in phage genomes. We discover that unrelated proteins from diverse functional classes frequently share homologous domains. This phenomenon is particularly pronounced within receptor-binding proteins, endolysins, and DNA polymerases. We also identify multiple instances of recent diversification via domain shuffling in receptor-binding proteins, neck passage structures, endolysins and some members of the core replication machinery, often transcending distant taxonomic and ecological boundaries. Our findings suggest that ongoing diversification via domain shuffling is reflective of a co-evolutionary arms race, driven by the need to overcome various bacterial resistance mechanisms against phages.

(3) A representative sequence from each cluster (with no more than 10 unknown characters) was taken and we aligned the UniClust30 database against it using hhblits to create 133,574 rHMMs (i.e., HMM profiles of representative protein sequences).All rHMMs were then compared against (4) each other using hhblits to generate the all-by-all comparison table, (5) PHROGs HMM database to assign a functional class to each rHMM, enhanced by hits to a database of anti-defence functions (Samuel & Burstein 2023), with conflicting assignments discarded, and (6) ECOD HMM database to predict protein domain architectures in each rHMM.Colours correspond to different PHROG functional categories (same as in Figure 2 and Figure 4).The association is significant (Spearman two-sided correlation test; p = 8.9 × 10 −12 , S = 49046, n = 99).Source data are provided as a Source Data file.rHMM: representative HMM profile.Coverage was calculated as a total coverage based on all hits between a given rHMM pair using hit probability threshold of 95%, and a maximum of query and sequence coverage.Only pairs that overlap on a fragment of ⩾ 50aa are shown.Colour corresponds to the number of rHMM pairs with a given set of parameters, (B) Same as A but with excluded pairs of rHMMs that are detected as mosaic using the domain composition approach, (C) Same as (A) and (B) but only for pairs of rHMMs that are detected as mosaic using the domain composition approach.Note that the vast majority of such pairs shares only very remote homology (no pairwise hit) and is not included in this plot.Source data are provided as a Source Data file.rHMM: representative HMM profile.The center line within boxplots corresponds to median, lower and upper hinges correspond to the first and third quartiles (the 25th and 75th percentiles).The upper whisker extends from the hinge to the largest value no further than 1.5 * IQR from the hinge (where IQR is the inter-quartile range, or distance between the first and third quartiles).The lower whisker extends from the hinge to the smallest value at most 1.5 * IQR of the hinge.The difference between the groups is statistically significant (Wilcoxon one-side rank sum test, p = 2.8 × 10 −45 , W = 963570, n virulent = 1866, ntemperate = 1444).Source data are provided as a Source Data file.rHMM: representative HMM profile.-axis).Phage genera are grouped into families, and for each genus we show the number of genomes and the average size of these genomes (i.e., the median number of proteins predicted per genome, rounded up to the nearest integer).Transparency denotes the proportion of genomes that include a protein within any rHMM of a given function (note that the function assignment is done very conservatively and many functions will be under-detected).Out of these genomes, we calculated the proportion of those whose protein of a given function was mosaic.This information is represented by the colour (red: very frequently mosaic, beige: rarely mosaic).A protein was defined mosaic if it belonged to an rHMM that was defined mosaic by either domain or sequence-based definition.Only functional classes with genetic diversity of at least 20 families are shown.Source data are provided as a Source Data file.rHMM: representative HMM profile.

Figure S1 .
Figure S1.Summary of methodology.(1) We predicted 462,721 protein sequences in all phage genomes downloaded from NCBI RefSeq.(2) These protein sequences were clustered into 133,624 high-coverage clusters with mmseqs2.(3)A representative sequence from each cluster (with no more than 10 unknown characters) was taken and we aligned the UniClust30 database against it using hhblits to create 133,574 rHMMs (i.e., HMM profiles of representative protein sequences).All rHMMs were then compared against (4) each other using hhblits to generate the all-by-all comparison table, (5) PHROGs HMM database to assign a functional class to each rHMM, enhanced by hits to a database of anti-defence functions (Samuel & Burstein 2023), with conflicting assignments discarded, and (6) ECOD HMM database to predict protein domain architectures in each rHMM.

Figure S2 .
Figure S2.Ambiguity of functional annotation increases with decreasing coverage.Impact of hhblits parameters on functional annotation.(A) Functional uniqueness (percentage of annotated rHMMs with hits to a single function; blue) and functional coverage (percentage of rHMMs with hits to any known function; orange) as a function of sequence (both query and subject) coverage threshold, for different values of the probability hit threshold (solid vs. dashed line).(B) Functional uniqueness and functional coverage as a function of probability threshold, for two extreme values of sequence coverage threshold (both query and subject; solid vs. dashed line).Source data are provided as a Source Data file.rHMM: representative HMM profile.

Figure S3 .
Figure S3.Functional co-occurrence at a high and low sequence coverage threshold.Each node represents a PHROG functional class.Nodes are linked an edge if they co-occurred at 80% (A) or 30% (B) coverage in at least 20 rHMMs at minimum 95% probability thresholds in hhblits.Colours refer to PHROG categories and node sizes refer to the number of rHMMs annotated by each PHROG functional class.Source data are provided as a Source Data file.rHMM: representative HMM profile.

Figure S4 .
Figure S4.Occurrence of different topologies in common functional classes.Heatmap showing which folds (Y-axis), defined by ECOD domains (T-groups), are found in which common PHROG classes (X-axis).Colours refer to PHROG categories.Source data are provided as a Source Data file.rHMM: representative HMM profile.

Figure S5 .
Figure S5.Diversity and frequency of domains.(A) Number of distinct domain families included in each H-group and detected within our rHMMs with valid functional annotations.Only the groups shown in Figure 1 are considered; (B) Number of distinct domain families included in each H-group and detected within rHMMs with known annotations vs. the number of rHMMs where a domain from given H-group is detected.Each point is an H-group present in our data, including the groups not shown in Figure 1.The association is significant (Spearman two-sided correlation test, p = 0.0001194, S = 188905, n = 120).Source data are provided as a Source Data file.rHMM: representative HMM profile.

Figure S6 .
Figure S6.Domain architectures of representative members of five DNA polymerase families: A, B, C, X and Y. Schematic drawing of domain architectures which were obtained with HHpred using ECOD as the database.The UniProt accessions of each protein are provided on the left-hand side.Number IDs correspond to ECOD T-groups with unique colours provided in the legend.

Figure S7 .
Figure S7.Domain network of proteins in the 'DNA, RNA and nucleotide metabolism' category.Nodes in the network represent domains (ECOD T-groups) and are linked by an edge if they co-occur in one rHMM.Size of the node represents its frequency (i.e., number of rHMMs in which it occurs).Colour of the node represents whether the domain occurs in rHMMs assigned to a single functional class (blue) or in rHMMs assigned to multiple functional classes (grey).Source data are provided as a Source Data file.rHMM: representative HMM profile.

Figure S8 .
Figure S8.Domain network of proteins in the 'tail' category.The network is analogous to the one in Figure S7 with the difference that domains which occur in rHMMs assigned to a single functional class are green instead of blue.Source data are provided as a Source Data file.rHMM: representative HMM profile.

Figure S9 .
Figure S9.Domain network of proteins in the 'lysis' category.The network is analogous to the one in Figure S7 with the difference that domains which occur in rHMMs assigned to a single functional class are red instead of blue.Source data are provided as a Source Data file.rHMM: representative HMM profile.

Figure
Figure S10.ECOD coverage per functional class.Left to right: (i) mean protein length, (ii) maximum number of different domains (H-groups) detected with a single rHMM, (iii) mean number of folds (T-groups) per fold combination, (iv) proportion of proteins with at least one ECOD domain detected at ⩾95% probability and ⩾70% subject coverage, and (iv) proportion of proteins with ECOD domains from at least two distinct X groups, detected at ⩾95% probability and ⩾70% subject coverage.Only the most diverse functional classes (those with ⩾6 rHMM families) are shown.Source data are provided as a Source Data file.rHMM: representative HMM profile.

Figure S11 .
Figure S11.Fold diversity vs genetic diversity.Diversity of all functional classes as measured by the number of rHMMs families (X-axis) and the number of fold architectures (i.e., unique combination of ECOD T-groups; Y-axis).Colours correspond to different PHROG functional categories (same as in Figure2and Figure4).The association is significant (Spearman two-sided correlation test; p = 8.9 × 10 −12 , S = 49046, n = 99).Source data are provided as a Source Data file.rHMM: representative HMM profile.

Figure
Figure S12.All-by-all comparison of HMM profiles of rHMMs.(A) Pairwise sequence coverage (Y-axis) of 133k rHMMs (over 10 10 HMM-HMM comparisons) as a function of pairwise percentage identity.Coverage was calculated as a total coverage based on all hits between a given rHMM pair using hit probability threshold of 95%, and a maximum of query and sequence coverage.Only pairs that overlap on a fragment of ⩾ 50aa are shown.Colour corresponds to the number of rHMM pairs with a given set of parameters, (B) Same as A but with excluded pairs of rHMMs that are detected as mosaic using the domain composition approach, (C) Same as (A) and (B) but only for pairs of rHMMs that are detected as mosaic using the domain composition approach.Note that the vast majority of such pairs shares only very remote homology (no pairwise hit) and is not included in this plot.Source data are provided as a Source Data file.rHMM: representative HMM profile.

Figure S13 .
Figure S13.Networks of functional classes with recent mosaicism.Nodes represent various PHROG functional classes and the edges connect classes between which there is at least one rHMM pair with evidence of recently emerged mosaicism.Such recently emerged mosaicism between an rHMM pair is defined when two rHMMs show no signal of homology over the majority of their length (at least 50% pairwise coverage) but a strong similarity over a fragment shorter than 50% of their length.By 'strong similarity' we mean percentage identity >= 70% (A) or >= 90% (B).Colour denotes a PHROG category.Source data are provided as a Source Data file.rHMM: representative HMM profile.

Figure S14 .
Figure S14.Proportion of mosaic proteins per genome within temperate and virulent phages.A protein was defined as mosaic if it belonged to an rHMM that was defined mosaic by ECOD-based or sequence-based definition.The center line within boxplots corresponds to median, lower and upper hinges correspond to the first and third quartiles (the 25th and 75th percentiles).The upper whisker extends from the hinge to the largest value no further than 1.5 * IQR from the hinge (where IQR is the inter-quartile range, or distance between the first and third quartiles).The lower whisker extends from the hinge to the smallest value at most 1.5 * IQR of the hinge.The difference between the groups is statistically significant (Wilcoxon one-side rank sum test, p = 2.8 × 10 −45 , W = 963570, n virulent = 1866, ntemperate = 1444).Source data are provided as a Source Data file.rHMM: representative HMM profile.

Figure S15 .
Figure S15.Proportion of mosaic proteins per genome within phages belonging to different taxonomic families.The difference between the groups is statistically significant according to Kruskal-Wallis test (Kruskal-Wallis chi-squared = 926.51,df = 40, p = 2.5 × 10 −168 , total sample size of n = 1565 across 41 taxonomic groups; only groups with ⩾ 50 genomes are shown).Average genomes sizes represent the median number of genes predicted within the family (rounded up a nearest integer).'Mosaic protein' is defined the same as in Figure S14.Boxplots are plotted in the same way as in Figure S14.Source data are provided as a Source Data file.rHMM: representative HMM profile.

Figure S16 .
Figure S16.Proportion of mosaic proteins per genome within phages belonging to different bacterial hosts.The difference between the groups is statistically significant according to Kruskal-Wallis test (Kruskal-Wallis chi-squared = 878.95,p = 6.4 × 10 −99 , total sample size of n = 4377 across 163 host groups; only groups with ⩾ 50 genomes are shown).Average genomes sizes represent the median number of genes predicted within the family (rounded up a nearest integer).'Mosaic protein' is defined the same as in Figure S14.Boxplots are plotted in the same way as in Figure S14.Source data are provided as a Source Data file.rHMM: representative HMM profile.

Figure S17 .
Figure S17.Mosaicism across taxa.Mosaicism within multiple protein classes (X-axis) divided into phages of different genera and taxonomic families (Y-axis).Phage genera are grouped into families, and for each genus we show the number of genomes and the average size of these genomes (i.e., the median number of proteins predicted per genome, rounded up to the nearest integer).Transparency denotes the proportion of genomes that include a protein within any rHMM of a given function (note that the function assignment is done very conservatively and many functions will be under-detected).Out of these genomes, we calculated the proportion of those whose protein of a given function was mosaic.This information is represented by the colour (red: very frequently mosaic, beige: rarely mosaic).A protein was defined mosaic if it belonged to an rHMM that was defined mosaic by either domain or sequence-based definition.Only functional classes with genetic diversity of at least 20 families are shown.Source data are provided as a Source Data file.rHMM: representative HMM profile.

Figure S18 .
Figure S18.Mosaicism across lifesyles.Mosaicism within multiple protein classes (Y-axis) divided into temperate and virulent phages (X-axis, with 1866 virulent and 1444 temperate phages).The length of each bar denotes the proportion of genomes that include a protein within any rHMM of a given function (note that the function assignment is done very conservatively, and many functions will be under-detected).Out of these genomes, we calculated the proportion of those whose protein of a given function was mosaic.This information is represented by the colour (red: very frequently mosaic, beige: rarely mosaic).Only functional classes with genetic diversity of at least 20 families are shown.Source data are provided as a Source Data file.rHMM: representative HMM profile.

Figure S19 .
Figure S19.Recent mosaicism between groups.Number of mosaic pairs of rHMMs subdivided into old and recently emerged mosaicism (X-axis) that are associated with the same or different trait: taxonomic family, genus, host and lifestyle (Y-axis).A pair of rHMMs was defined mosaic if it was mosaic by either ECOD-based or sequence-based definition.Note that rHMMs may be composed of proteins coming from phages of different groups, e.g.there might be very similar proteins in temperate and virulent phages that are grouped in one rHMM.Such cases are coloured in brown.Additionally, rHMMs composed only of proteins of unknown group (e.g., unclassified taxonomic family) are coloured in grey.Source data are provided as a Source Data file.rHMM: representative HMM profile.

Figure S20 .
Figure S20.Recent mosaicism vs. functions.Number of rHMM mosaic pairs within various functional classes (Y-axis) that has emerged recently (sequence-based mosaicism and pident ⩾70%) and that are associated with the same or different trait: taxonomic family, genus, host and lifestyle (X-axis).Only functional classes with genetic diversity of at least 20 families are shown.Colour legend is the same as in Figure S19.Source data are provided as a Source Data file.rHMM: representative HMM profile.