Stochasticity, determinism, and contingency shape genome evolution of endosymbiotic bacteria

Evolution results from the interaction of stochastic and deterministic processes that create a web of historical contingency, shaping gene content and organismal function. To understand the scope of this interaction, we examine the relative contributions of stochasticity, determinism, and contingency in shaping gene inactivation in 34 lineages of endosymbiotic bacteria, Sodalis, found in parasitic lice, Columbicola, that are independently undergoing genome degeneration. Here we show that the process of genome degeneration in this system is largely deterministic: genes involved in amino acid biosynthesis are lost while those involved in providing B-vitamins to the host are retained. In contrast, many genes encoding redundant functions, including components of the respiratory chain and DNA repair pathways, are subject to stochastic loss, yielding historical contingencies that constrain subsequent losses. Thus, while selection results in functional convergence between symbiont lineages, stochastic mutations initiate distinct evolutionary trajectories, generating diverse gene inventories that lack the functional redundancy typically found in free-living relatives.


Fig. S2. Maximum-likelihood phylogeny of louse endosymbionts and representative
Enterobacterales based on 241 single copy orthologs using the GTR-G model of base substitutions.Tree also presented in fig.1a, but shown here with host and collection location described at tree tips.Tree with all bootstrap values and underlying DNA sequence alignments can be found in the data repository.Source data are provided as a Source Data file.

Fig. S3. Maximum-likelihood phylogeny of louse endosymbionts and representative
Enterobacterales based on 241 single copy orthologs using selected time-reversible base substitution models.Tree also presented in fig.1a, but shown here with host and collection location described at tree tips.Tree with all bootstrap values and underlying DNA sequence alignments can be found in the data repository.Source data are provided as a Source Data file.

Fig. S4. Maximum-likelihood phylogeny of louse endosymbionts and representative
Enterobacterales based on 241 single copy orthologs using selected amino acid base substitution models.Tree also presented in fig.1a, but shown here with host and collection location described at tree tips.Tree with all bootstrap values and underlying amino acid sequence alignments can be found in the data repository.Source data are provided as a Source Data file.3a, depicting the near-identical gene inventories derived from the novel alignmentbased approach, developed for this study (pipeline), versus the metagenomic assembly approach (manual).Genes that are determined by either approach to be intact are shown in green.Genes having ORFs spanning more than one contig or locally colinear block that cannot be evaluated (using the assembly-based approach) are shown in pink.Discrepancies in the annotation of intact genes between the two approaches are shown in black on a separate track labeled "Discrepancy" e, Representation of sequences derived from metagenomic assembly that were not aligned to S. praecaptivus by the Mauve algorithm, categorized as "fragments" (remnants of genes that are small in size), "mobile DNA" (corresponding to IS-element and phage sequences), and "intact genes".Abbreviations: C. = Columbicola.Source data are provided as a Source Data file.

Table S1: Endosymbiont identification, sample collection location, and host associations.
*C. adamsi and C. tasmaniensis were each sampled from two different dove species.Endosymbiont classification is based on preliminary pairwise comparisons with representative bacterial genomes.

Phylogenomics:
Data sources: Predicted protein coding sets were downloaded from NCBI Genome Assembly database (https://www.ncbi.nlm.nih.gov/assembly).We favored CDS described in the RefSeq annotation, but accepted the GenBank annotation when the RefSeq annotation was not available.To sample critical taxa for which predicted CDS were not available on NCBI, we downloaded either un-annotated contigs from NCBI or the raw sequence reads from the NCBI short-read archive (https://www.ncbi.nlm.nih.gov/sra).In the case where we obtained an assembly lacking an annotation, we generated an annotation using RAST [1][2] .When raw reads were utilized for whole genome assembly, we prepared the reads using fastp v0.23.2 and preformed a de novo assembly using metaSPAdes (v.3.14.0) [3][4] .We then identified candidate contigs belonging to the endosymbiont genome using NCBI BLASTx v2.10.0+,comparing to a custom database composed of representative Enterobacteriaceae 5 .These putative endosymbiont contigs were then annotated using RAST or MiGa 6 .
Ortholog identification: OrthoFinder v2.3.14 was used to identify orthologous gene groups among the newly assembled and downloaded gene sets 7 .From the resulting orthogroups, we identified each group that contained one of the 297 single-copy-orthologs used in our original Sodalis specific phylogenomic analysis.We then examined summary statistics, removing two newly downloaded species, due to poor representation in the focal 297 gene set.This left us with 48 newly added taxa, in addition to the 32 existing endosymbionts and Sodalis praecaptivus used in the targeted phylogenomic analysis of Sodalis.Next, we filtered the orthogroups, rejecting any individual species contribution to orthogroup that contained a paralog (i.e. one species contributed more than one locus).Finally, if four or more taxa were not represented in the filtered groups (i.e. they never contributed a gene or had their contribution removed due to paralogy), whole ortholog groups were rejected.This left us with 241 single-copy-orthologs, representing 95% or more of the taxa; effectively minimizing missing data, while providing many loci for phylogenetic estimation.

Comparative Genomics:
Calculation of Relationship Strength: The strength of direct or reciprocal/alternating relationships between strings was computed using the function 0 where 0.5 is the lesser of two Shannon Entropies (having a maximum possible value of 1) multiplied by 0.5 so that it contributes to one half of the relationship strength value.The other half of the relationship strength value is derived from the Hamming distance between strings, which corresponds to the strength of the direct or reciprocal/alternating match.However, because Hamming distances can be used to determine the strength of both direct and reciprocal/alternating matches *( − 17) 1 is used to transform values to a scale from 0 to 17, regardless of whether the relationship is direct or reciprocal/alternating.Validation of Genome Assembly and Annotation Pipeline: The alignment and annotation pipeline was validated by generating de novo assemblies using metaSPAdes from sequence reads from the C. claytoni, C. fradei and C. rodmani endosymbionts using default parameters 10 .The C. claytoni sample was chosen because it has the largest endosymbiont genome and is therefore anticipated to have a large quotient of pseudogenes that maintain the lowest level of sequence divergence from S. praecaptivus orthologs.Additionally, C. fradei was selected because its sequence reads have the lowest ratio of endosymbiont:host representation in the read library, resulting in a low depth of coverage in an assembly.Finally, C. rodmani was selected because its genes are anticipated to have the highest overall level of sequence divergence relative to S. praecaptivus orthologs, consistent with the fact that phylogenetic reveal that it maintains the highest patristic distance from S. praecaptivus.The C. claytoni, C. fradei and C. rodmani assemblies yielded 55,314, 17,650 and 35,211 contigs, with median sizes of 2,084 bp, 6,351 bp and 2,647 bp respectively.BLASTN and BLASTX 11 searches were then performed, using the nr/nt and nr databases respectively, to identify all contigs sharing sequence identity with S. praecaptivus.In total, 17, 18 and 13 contigs were identified with size ranging from 1-728 kbp, 1-1,169 kbp and 1-247 kbp, comprising the genomes of the C. claytoni, C. fradei and C. rodmani Sodalis-allied endosymbionts (Fig. S8a-b).These contigs were concatenated and then aligned using the Mauve algorithm with default parameters to the genome of S. praecaptivus (Fig. S8c).Intact protein-coding genes were identified by manual inspection, (guided by the Mauve alignments) on the basis of similarity of the best predicted translation of the region aligning to each gene in S. praecaptivus including up to 25 bp of flanking sequence at the 3' and 5' ends of each candidate ORF (to account for alternative start and stop codons).Genes were designated as intact if their best fit translation shared at least 89.79%, 83.13%, and 64.75% similarity for C. claytoni, C. fradei and C. rodmani endosymbionts, respectively, consistent with cutoff values estimated previously using the expectation maximization approach described in Methods (see section "Genome Sequences Annotation" in the associated publication).Genes having ORFs that spanned more than one contig or locally colinear block (as defined in Mauve) were treated as missing data.In cases where the status of a gene could be identified in the metagenomic assembly-based approach, there was greater than 99% agreement between the inventories of intact genes derived from the alignment-based pipeline and the metagenomic assembly-based approach (Fig. S8d).A small number of orphaned regions within the metagenomic contigs did not align to the S. praecaptivus genome using Mauve.We compared these regions to proteins available in the NCBI non-redundant (nr) protein database (database downloaded on March 9, 2022) using BLASTX, finding 47%, 31%, and 83%, in C. claytoni, C. fradei and C. rodmani endosymbiont respectively, of orphaned regions returned a significant hit (e<0.1 -5 ) to an S. praecaptivus protein.We filtered out sequences that returned a significant hit to S. praecaptivus genome, assuming that the regions share orthologs with S. praecaptivus, but failed to align.Of the remaining fragments, we identified open reading frames (ORFs) and isolated the longest ORFs (up to a total of 10 ORFs).We then compared these ORFs to proteins available in the nr database (as translated amino acids) using BLASTP.Again, some of the ORFs returned a significant hit to a S. praecaptivus protein and were excluded.We then evaluated the remaining ORFs based on the BLAST hit.The results indicate that the gene inventories of the C. claytoni, C. fradei and C. rodmani endosymbionts are largely subsets of S. praecaptivus, with the exception of small gene fragments (likely pseudogenes) and sequences derived from mobile genetic elements (IS-elements and phages), which are known to accumulate in the genomes of endosymbionts 12 (Fig. S8e).Aside from those, only two novel proteins were discovered (two copies of one in the endosymbiont of C. claytoni and one in the C. fradei endosymbiont), predicted to encode a metallochaperone and a PhoX-family phosphatase, respectively.Taken together, these validation efforts provide support for the notion that our custom (alignment-based) pipeline yields consistent and accurate annotation of the feather louse endosymbiont genomes in our study, regardless of their patristic distance from S. praecaptivus and the depth of sequencing coverage of the endosymbiont genome.Furthermore, based on these validations the feather louse endosymbionts do not maintain significant genetic functionality beyond that encoded by S. praecaptivus.
Finally, to provide maximum accessibility our data, we generated metagenomic assemblies using the sequence reads derived from all remaining lice used in our study, employing the same metaSPAdes approach.All contigs from each assembly sharing sequence identity with the S. praecaptivus whole genome sequence were identified and binned, yielding at least 1-Mb of sequence for each sample with the exception of C. adamsi, whose assembly yielded only a few contigs of small size due to low sequence coverage.Therefore, symbionts contigs from all samples except C. adamsi, were uploaded (https://www.ncbi.nlm.nih.gov/genome/).
Color Palettes and Maps: Figures in both the main paper and supplementary information use color palettes and maps derived from ColorBrewer 2.0 13 , an online CIELAB gradient making tool (https://davidjohnstone.net/lch-lab-colour-gradient-picker), colorCET (colorcet.com) 14and Scientific colour maps (version 8.01) 15 to minimize visual distortion and increase accessibility.

Fig. S5 :
Fig. S5: MutL linker sequence and annotation.a, Alignment of the MutL disordered linker sequences found in S. praecaptivus (top) and those louse endosymbionts predicted to maintain intact copies of MutH.Green bars depict regions of sequence coverage.b, AlphaFold prediction of the structure of MutL in S. praecaptivus (left) and the C. claytoni symbiont (right).In each, the two domains of the MutL monomer are connected by a largely unstructured linker that is substantially larger in the S. praecaptivus homolog.Abbreviations: C. = Columbicola.

Fig. S7 .
Fig. S7.Fraction of patristic distance consisting of branch following basal cospeciation event in louse endosymbionts.Distances obtained from phylogenetic tree present in Fig. 1b.

Fig S8 :
Fig S8: Relationships between different methods for genome assembly.a, Relationships between G+C base content and contig size, comparing metagenomic assemblies of C. claytoni, C. fradei and C. rodmani.Predicted endosymbiont contigs have relatively similar G+C base content (for each endosymbiont) and are highlighted in red, while other contigs (comprising host sequences) are shown in grey.b, Relationship between sequencing depth and length of contigs from endosymbiont contigs shown in red in a. c, alignments of endosymbiont metagenomic contigs (top) and the genome of S. praecaptivus (bottom, plasmid shown in green).d, Matrices, analogous to those in Fig.3a, depicting the near-identical gene inventories derived from the novel alignmentbased approach, developed for this study (pipeline), versus the metagenomic assembly approach (manual).Genes that are determined by either approach to be intact are shown in green.Genes having ORFs spanning more than one contig or locally colinear block that cannot be evaluated (using the assembly-based approach) are shown in pink.Discrepancies in the annotation of intact genes between the two approaches are shown in black on a separate track labeled "Discrepancy" e, Representation of sequences derived from metagenomic assembly that were not aligned to S. praecaptivus by the Mauve algorithm, categorized as "fragments" (remnants of genes that are small in size), "mobile DNA" (corresponding to IS-element and phage sequences), and "intact genes".Abbreviations: C. = Columbicola.Source data are provided as a Source Data file.