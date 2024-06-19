SynTracker compares strains within species using metagenome or genome data

SynTracker identifies synteny blocks in pairs of homologous genomic regions derived from isolate genomes, metagenomic assemblies or metagenome-assembled genomes (MAGs). As input, the pipeline accepts one genome per species of interest (bacterium, phage or plasmid), either fully or partially assembled, to be used as a reference and a collection of metagenomic assemblies (or genomes, if genomes are to be compared).

Step 1: identification of homologous regions

The reference genome is fragmented to create a collection of 1-kbp genomic regions, located 4 kbp apart (‘central regions’; Fig. 1a). Next, we convert the collection of per-sample metagenomic assemblies (or genomes) to a basic local alignment search tool (BLAST)20 database and use the central regions as queries for a high-stringency nucleotide BLAST (BLASTn) search (identity = 97%, minimal query coverage = 70%; Fig. 1b) to minimize the possibility of receiving multispecies hits or hits located within regions with high copy-number variation. For each BLAST hit, we then retrieve the target sequence and the flanking 2-kbp regions upstream and downstream of the target sequence. This strategy results in high specificity when identifying homologs to the central regions, while allowing for high variance in the sequence composition of the flanking regions. These parameters can be modified by the user, according to preferences.

Fig. 1: Illustration of the SynTracker algorithm. a, The reference genome is fragmented to yield central regions, that is, 1-kbp-long regions located 4 kbp apart. b, Each central region is used as a query for a BLAST search against a collection of sample-specific assemblies (or genomes, as appropriate). c, BLAST hits are retrieved with 2 kbp on each side of the hit; however, this can be modified by the user. All bins resulting from the same BLAST search are placed in the same region-specific bin. d, Within each bin, an all-versus-all pairwise alignment is performed to identify synteny blocks in pairs of sequences. Synteny scores are calculated on the basis of the number of blocks and the sum of the length of the blocks. e, For each pair of samples (or genomes) n regions are sampled and their synteny scores are averaged to yield the APSS. Full size image

Step 2: calculation of region-specific synteny scores

Each collection of homologous ~5-kbp regions (that is, derived from a BLAST search using the same central region query) is assigned to a region-specific bin (Fig. 1c). Within each bin, we perform an all-versus-all pairwise sequence alignment to identify synteny blocks (Fig. 1d) using the DECIPHER R package21. Then, for each pairwise alignment, we calculate the region-specific pairwise synteny score. This score is based on two parameters: the number of synteny blocks identified in each pairwise sequence alignment and the overlap between the two sequences. The synteny score is inversely proportional to the first and directly proportional to the second (Extended Data Fig. 1).

A single synteny block in a pairwise alignment can stem from two genomic regions with a high sequence similarity. A high number of synteny blocks can result from insertions, deletions, recombination events or several SNPs located within a very close proximity in just one of the two sequences. The sequence overlap is defined as the ratio of the accumulative length of all blocks to the length of the shorter DNA region in each pairwise comparison. The region-specific pairwise synteny score has a maximal value of 1, reflecting identification of a single synteny block and overlap of 100% (Fig. 1d).

Step 3: calculation of the APSS

After calculating the per-region synteny scores in all bins, we randomly subsample n regions per single comparison of metagenomic samples (or pair of genomes) and calculate the APSS by averaging the per-region pairwise synteny scores. Pairs of samples or genomes with fewer than n regions per comparison are excluded from downstream analysis (Fig. 1e). By default, n is equal to 40, 60, 80, 100 and 200 regions per pairwise comparison.

SynTracker is sensitive to structural variants, not to SNPs

We examined SynTracker’s performance and estimated the effect of different genomic variations on the synteny scores. In a first test, we used Bacmeta22 to generate in silico simulations of the evolution of bacterial populations. We performed two types of simulations: (1) the population was evolved by introducing SNPs exclusively and (2) only insertions and deletions were introduced. At each time point, for each genomic region, we sampled 20 ‘bacteria’ and calculated all pairwise synteny scores in addition to all pairwise sequence identities (that is, three sets of 190 pairwise comparisons at each time point; Fig. 2a and Extended Data Fig. 2). In simulations using SNPs, the minimal average BLAST identities were 99.48%, 99.46% and 99.5%, for regions 1, 2 and 3. The lowest average BLAST identities in simulations based on insertions and deletions were higher, at 99.79%, 99.79% and 99.84% (P < 2.2 × 10−16). The minimal average synteny scores in SNP-based simulations were higher (0.995, 0.995 and 0.981) than in indel-based simulations (0.103, 0.103 and 0.206; P < 2.2 × 10−16) even though the mutation frequency in SNP-based simulations was tenfold higher than the indel-based simulations. The lower synteny scores of genomic regions in the indel-based simulations highlight the higher sensitivity of the synteny-based approach to indels. These results show that populations evolving exclusively through the introduction of SNPs have a marginal reduction in the synteny scores compared to populations that evolve through the introduction of insertions and deletions at a lower mutation frequency.

Fig. 2: SynTracker shows robust strain-resolving performance using a small fraction of the genome length. a, Analysis of the genomic diversity of in silico evolved bacterial populations. Simulations were carried out for 3,000 generations through the exclusive introduction of SNPs at a frequency of 1 × 10−6 substitutions per nucleotide per generation. At each time point, 20 genomes were sampled and a pairwise comparison of the same 20-kbp region was performed using BLASTn (top) and SynTracker (bottom) (that is, 190 pairwise comparisons per time point). Horizontal black lines mark the group median and the red lines connect the group means (red dots). Boxes correspond to the interquartile range (IQR) and whiskers are extended to the largest and smallest observations within the first and third quartiles ± 1.5 × IQR. b, Same as in a but with simulations based on the introduction of indels at a frequency of 1 × 10−7 events per nucleotide per generation. c, Phylogenetic trees for 140 E. coli genomes belonging to 14 different phylogroups based on APSS. Left, tree based on 200 randomly selected 5-kbp regions per pairwise comparison. Right, Mash distance tree derived from the literature24. Colored lines connect the same genomes on both trees and lines are colored by phylogroup. P < 1 × 10−5 based on 100,000 randomizations of the synteny-based tree. d, Same as for c but using 40 regions per pairwise comparison. P < 1 × 10−5. e, Heat map showing the APSS of comparisons of five N. oceani strains, reflecting previously published synteny-based strain similarities27. Full size image

SynTracker classifies strains using a fraction of the genome

We examined the performance of SynTracker when comparing closely related genomes and assessed APSS values as the basis for the clustering of genomes into phylogenetic groups. We used a published classification of >10,000 Escherichia coli genomes based on whole-genome nucleotide content (Mash23), which identified 14 distinct phylogroups24. We randomly selected ten genomes per phylogroup and analyzed these 140 genomes eight times, randomly selecting 20–200 regions per pairwise comparison (representing ~1.8–18.5% of the E. coli O157:H7 genome length). We used the APSS values (Fig. 1e) to construct phylogenies.

The published Mash tree demonstrates the classification of E. coli genomes into 14 phylogroups; we checked whether we could recapitulate these phylogroups using APSS. We compared the Mash-based tree and the synteny-based trees using two methods: (1) calculating the Robinson–Foulds distance (RFD) between trees25 and (2) using phylogenetic information content26. To determine the statistical significance of these distances, we further calculated the distances (both RFDs and phylogenetic information content) between the published Mash tree and 1 × 105 randomly generated trees (Fig. 2c,d and Extended Data Fig. 4). We observed that the number of sampled regions per pairwise comparison was inversely correlated to the RFDs and positively correlated to the phylogenetic information (Extended Data Figs. 3 and 4). These results indicate that, when a greater proportion of the genome was sampled by SynTracker, the Mash-based and APSS-based trees became more similar. Regardless of the proportion of the genome sampled, for both tree comparison methods, the resulting P value was smaller than 1 × 10−5 in all subsampling values. This indicates that, even with less than 2% of the genome sampled, the APSS-based tree recapitulated the published tree. Importantly, this result indicates that synteny can be used to classify E. coli strains into the ‘correct’ phylogroup using a very small fraction of the whole genome.

In a second analysis, we used data for five Nitrosococcus oceani strains isolated in different global locations27. The authors used a gene synteny-based analysis using genome content and alignments. Their analysis revealed that the genomes of four of the strains (C-107, NS58, C-27 and AFC27) were highly conserved in content and gene synteny, while strain AFC132 contained an additional gene repertoire and differed in gene synteny from the other four genomes. Using ~28% of the genome, our SynTracker analysis corroborated these findings; pairwise comparisons of strains C-107, NS58, C-27 and AFC27 resulted in high APSSs (0.968–0.998), while AFC132 showed a much lower APSS to the others (0.82–0.87) (Fig. 2e).

Setting APSS thresholds for same-strain designations

In most strain-tracking software, score thresholds are used to determine whether the same strain is present in multiple samples28,29. To apply this concept to SynTracker results, we sought to determine an APSS threshold to use for designating two strains as the same. We note that the definition of a strain is ambiguous and highly subjective. Here, we assumed that two members of a species identified in the same individual over a time period of months likely belong to the same strain, while two members of a species colonizing different individuals likely belong to different strains. To establish an APSS threshold for the same strain for species of the human gut microbiome, we based our analysis on the longitudinal human gut metagenome30. We divided the dataset into training and testing sets, consisting of 117 and 106 metagenomic samples, respectively, and calculated the APSS scores for all strain pairs from 33 species in the training set. We then used the Youden statistic31 to determine the APSS value that optimally classified strain pairs as the same or different, which served as the basis for the thresholds used in subsequent analyses (Methods).

To check how well SynTracker performed when tracking strains, we first used it to track strains in the Poyet et al.30 testing set and determined the average specificity and sensitivity with each number of subsampled regions per pairwise comparison (Methods, Supplementary Table 8 and Extended Data Fig. 7). We then applied the APSS thresholds to a separate dataset to identify strain sharing between mothers and infants32. We observed a high proportion of strains shared between mothers and infants when infants were very young. Interestingly, as infants aged, the total number of shared strains increased, while their proportion of all pairwise comparisons decreased (Extended Data Fig. 8). These results highlight SynTracker’s utility for uncovering biologically meaningful phenomena when used as a standalone strain-comparison tool.

Synteny-based and SNP-based analyses reveal modes of genome evolution

Given SynTracker’s high sensitivity to structural genomic variations and low sensitivity to SNPs, we aimed to use it in combination with an SNP-based strain-tracking tool to examine how each of these tools captures different types of within-species genomic diversity. We applied SynTracker in combination with inStrain, a widely used strain-tracking tool with high sensitivity to SNPs28. The datasets we used here were (1) a collection of 12 Neisseria gonorrhoeae clinical isolates harboring antibiotic resistance (Supplementary Table 2 and Methods); (2) a population of hypermutator E. coli isolates that emerged after the colonization of four mice with two ancestral E. coli substrains33 (Supplementary Table 3); (3) 77 H. pylori clinical isolates obtained from six individuals2 (Supplementary Table 4); and (4) a collection of Streptomyces rimosus M527 from different fermentations (Supplementary Table 5 and Methods).

First, we performed 66 pairwise comparisons for the N. gonorrhoeae isolates. This species yielded a very high correlation between the SNP-based and synteny-based strain similarity scores (Spearman’s \(\rho\) = 0.985), suggesting that the genomic diversity of this species is achieved through both point mutations and structural genomic variation (Fig. 3a). SynTracker identified a larger proportion of the comparisons as the same strain compared to inStrain. Second, for the E. coli population, we performed 185 pairwise comparisons using inStrain; none of them were classified as the same strain using inStrain’s default same-strain threshold. In contrast, the SynTracker analysis classified all as belonging to the same strain using an APSS threshold of 0.955 (slightly more stringent than the threshold used above). Third, when applied to the H. pylori genome data, we performed 21–91 pairwise comparisons per participant and observed that, in three of the participants (194, 249 and 295; Extended Data Fig. 5), both tools assigned a majority of the genome pairs to the same strains. In the three other participants (322, 326 and 439), however, we detected additional subpopulations: subsets of genome pairs classified as belonging to the same strain by only one of the tools or neither (Fig. 3 and Extended Data Fig. 5). These results indicate that, within a single species and within a single host, subsets of H. pylori are generating genomic diversity using two very different modes, sometimes together and sometimes independently. Fourth, for the S. rimosus samples, we performed 185 pairwise comparisons and observed that, while all strain pairs appeared as clonal in the SNP-based analysis, the SynTracker analysis resulted in a wide range of APSS values, with some classified as different strains. This result suggests that the within-species genomic variation of S. rimosus is achieved through structural differences but not SNPs (Fig. 3d). These combined results are consistent with inStrain’s sensitivity to SNPs and SynTracker’s sensitivity to genomic structural differences. Each method provides a different view of the within-species genomic diversity. When used in combination, they powerfully synergize to generate a more comprehensive view of the modes of evolution underway.

Fig. 3: Combined synteny-based and SNP-based analysis of simple microbial populations. a–d, Points represent specific pairwise comparisons; popANI, SNP-based strain similarity, as calculated by inStrain. Red and blue lines show the same-strain cutoffs for SynTracker (>0.955) and inStrain (>0.99999); the same cutoffs are used in each panel. Purple points mark comparisons classified as the same strain by both tools, red points mark comparisons classified as the same strain by SynTrakcer, blue points mark comparisons classified as the same strain by inStrain and gray points are pairs marked as belonging to different strains by both tools. N. gonorrhoeae, n = 66 comparisons (a); hypermutator E. coli, n = 185 comparisons (b); H. pylori clinical isolates (participant 326), n = 91 comparisons (c); S. rimosus M527, n = 185 comparisons (d). Full size image

To validate our findings, we performed genome alignments for randomly selected isolates from the E. coli, H. pylori and S. rimosus datasets. In agreement with our results, we observed that the overall genome order was more uniform in E. coli compared to H. pylori and S. rimosus (Methods and Extended Data Fig. 6).

SynTracker applied to microbiomes reveals evolutionary patterns

We applied SynTracker to fecal metagenomes obtained from 1,133 individuals (747 adults and 386 related infants) residing in three countries34. In a first analysis, we used MIPs (both related and unrelated) and combined the SynTracker analysis with inStrain analysis to identify species with distinct modes of accumulating genomic diversity, that is, point mutations or insertions, deletions and recombination. In a second analysis, we excluded true MIPs and tracked strains across locations to characterize spatial patterns exhibited by different species. In both analyses, we used the same set of MAGs as a reference for both SynTracker and inStrain.

1. Detection of hypermutators and hyper-recombinators from metagenome data. Overall, SynTracker made twice as many strain comparisons as inStrain (40,000 versus 19,000), of which ~12,000 pairwise comparisons were performed by both tools. Despite the fact that this subset of comparisons was detected by both tools, each tool classified a different set of pairs as having the highest strain similarities (Fig. 4a). To identify which taxa comprised the sets of strains flagged as most similar by each of the tools, we compared the enrichment P values of each species in the two subsets (that is, most similar 5% of the strain comparisons according to each tool). Most species showed similar enrichment in both sets, suggesting that their within-species genomic diversity originated from both SNPs and structural differences. However, a subset of species showed differential enrichment in one or the other set, indicating that their genomic diversity originated preferentially from point mutations (hypermutators) or from structural differences (hyper-recombinant or increased indel rate). Hypermutators included Phocaeicola vulgatus, B. fragilis and Alistipes putredinis, while hyper-recombinators included Phocaeicola massiliensis, Streptococcus thermophilus, Prevotella spp., Streptococcus gallolyticus and others (Fig. 4b). Fig. 4: Combined synteny-based and SNP-based analysis of human gut microbiome reveals patterns of within-species genomic diversity. a, Synteny-based (APSS) and SNP-based (popANI) strain similarities in human gut metagenomes collected in Gabon, Germany and Vietnam. Dots represent pairwise comparisons within a species. Dots are colored when they are in the top 5% most similar: blue by inStrain, red by SynTracker and purple by both (all others are gray). Density plots show the distribution of scores for each tool. b, Species showing enrichment of SNPs or structural differences. Points represent species. The x axis shows the ratio between the enrichment of each species (that is, the hypergeometric P value) in the most similar 5% of the comparisons by tool. Right, species enriched in SNPs; left, species enriched in structural differences. The y axis indicates the degree of enrichment. Full size image 2. Networks built from APSS values allow the visualization of strain patterns. To visualize strain relatedness, we used APSS scores to build networks, where nodes represent hosts and edges are weighted by their APSS. In this analysis, we used SynTracker to assess whether conspecific strains sampled from persons living in geographic proximity are more similar to each other compared to strains obtained from people living further apart. In total, we were able to assess 145 species (that is, those identified with a sufficient genome coverage in >5 hosts), 61 of which showed significantly higher APSS values in individuals living in the same province (in Gabon or Vietnam) or in the same federal state (Germany), compared to individuals living in different provinces or states (Fig. 5a). For 39 species, the difference in mean APSS between strains obtained from the same or different province or state was medium to large (|d| > 0.5; Fig. 5a).

Fig. 5: Strain analysis of human gut metagenomes collected in Gabon, Germany and Vietnam. a, Distribution of strain similarities for species within gut microbiomes of individuals living within the same (yellow box) or different (maroon box) provinces (Gabon and Vietnam) or federal states (Germany). Stars correspond to Benjamini–Hochberg-corrected P values (one-sided Wilcoxon–Mann–Whitney test). *q < 5 × 10−2, **q < 5 × 10−3 and ***q < 5 × 10−5. Species are sorted according to the effect size and rightmost bars denote the effect size magnitude (large, medium, small and negligible). b, Species-specific strain-similarity networks. Nodes represent hosts and are colored by country (red, Gabon; blue, Vietnam; green, Germany) and edges represent strain comparisons, with APSS values as edge weights. Nodes are clustered by edge weights. Colored squares point to the magnitude category of the effect size of the species, as given in a. Full size image

The networks highlighted strain patterns by species by country. For instance, Lachnospira rogosae and Agathobacter rectalis showed distinct strain clusters within countries (Cohen’s d = 1.09 and 0.99, respectively; Fig. 5b). Similarly, the strain comparisons for Lachnospira sp003537285 resulted in two unconnected network clusters, one from Vietnam and one from Germany (with one sole carrier from Gabon). These results suggest the within-population evolution of strains with limited dispersal. In contrast, the genomes of two E. coli subspecies (Cohen’s d = 0.09) formed a network composed of distinct clusters, each containing participants from all three countries (Fig. 5b). This pattern suggests a high degree of geographic exchange across large distances. The species A. putredinis also yielded a network suggestive of cosmopolitan strains.

The network diagrams also allow subtle patterns of strain distribution to be readily identified. For instance, Bifidobacterium longum is primarily a cosmopolitan species (according to Cohen’s d value) and its strain comparisons resulted in three clusters, each composed of participants living in a single country. Surprisingly, a fourth cluster was made up from strains obtained from participants from all three countries. This pattern suggests the coexistence of geographically constrained and cosmopolitan strains within B. longum. Taken together, the network visualization of APSS scores provides a holistic view of strain patterns to emerge from highly complex data and the simultaneous assessment of strain dynamics. Although we applied this concept to spatial patterns here, any other type of metadata could be applied to the networks to aid interpretation of the patterns.

Benchmarking SynTracker against other strain-comparison tools

To benchmark SynTracker against three state-of-the-art strain-comparison tools (MIDAS35, StrainPhlAn36 and inStrain28), we used a test performed in the literature28, in which these tools were used to classify strains as shared or nonshared at different average nucleotide identity (ANI) cutoffs. In this test, gut metagenomic samples obtained from three pairs of premature infant twins were used. As the ground truth, conspecific strains residing in unrelated infants were assumed to be different, while those residing in twin infant pairs were considered the same, based on previous findings37. In a similar manner, we used SynTracker to identify shared strains in these metagenomic samples over a range of APSS values. The reference genomes used in this analysis were the species-representative genomes (SRGs; MAGs assembled from the same metagenomic samples and clustered on the basis of ANI) obtained from the literature28.

To determine the performance of each of the methods, we generated a receiver operating characteristic curve (ROC) for each tool and calculated the area under the curve (AUC), which is an indicator of its performance. SynTracker achieved the highest AUC value (0.93) among the previously published tools when using 40 regions per pairwise comparison (Fig. 6a). SynTracker proved to have the highest sensitivity, detecting 112 strain pairs compared to 85, 92 and 101 in StrainPhlAn, inStrain and MIDAS, respectively. Additionally, SynTracker proved to be the most effective tool when tracking plasmid and phage strains, outperforming inStrain (Fig. 6b).