Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Pseudogenes act as a neutral reference for detecting selection in prokaryotic pangenomes

Abstract

A long-standing question is to what degree genetic drift and selection drive the divergence in rare accessory gene content between closely related bacteria. Rare genes, including singletons, make up a large proportion of pangenomes (all genes in a set of genomes), but it remains unclear how many such genes are adaptive, deleterious or neutral to their host genome. Estimates of species’ effective population sizes (Ne) are positively associated with pangenome size and fluidity, which has independently been interpreted as evidence for both neutral and adaptive pangenome models. We hypothesized that pseudogenes, used as a neutral reference, could be used to distinguish these models. We find that most functional categories are depleted for rare pseudogenes when a genome encodes only a single intact copy of a gene family. In contrast, transposons are enriched in pseudogenes, suggesting they are mostly neutral or deleterious to the host genome. Thus, even if individual rare accessory genes vary in their effects on host fitness, we can confidently reject a model of entirely neutral or deleterious rare genes. We also define the ratio of singleton intact genes to singleton pseudogenes (si/sp) within a pangenome, compare this measure across 668 prokaryotic species and detect a signal consistent with the adaptive value of many rare accessory genes. Taken together, our work demonstrates that comparing with pseudogenes can improve inferences of the evolutionary forces driving pangenome variation.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Distributions of gene or pseudogene sequence clusters by species and frequency in the pangenome, restricted to clusters that could be COG annotated.
Fig. 2: Most COG functional categories are depleted in pseudogenes when there is no redundant gene copy in the genome.
Fig. 3: Distributions of singleton-based pangenome diversity and molecular evolution metrics.
Fig. 4: Associations between pangenome diversity metrics and dN/dS, a proxy for the efficacy of selection.
Fig. 5: Spearman’s correlations between molecular evolution and pangenome diversity metrics.
Fig. 6: Weak relationships between the mean percent of each species’ genome covered by pseudogenes within-species dN/dS and within-species dS.

Similar content being viewed by others

Data availability

Key data files are openly available on Zenodo48 (https://doi.org/10.5281/zenodo.7942836). All analysed genomes are publicly available as part of NCBI RefSeq/GenBank (with accession IDs listed in the Zenodo repository). Additional databases used in this study include the eggNOG 5 database for eggNOG-mapper (http://eggnog5.embl.de) and UniProt KB release 2022_01 (https://www.uniprot.org/release-notes/2022-02-23-release).

Code availability

The code used for the analyses in this paper is openly available at GitHub (https://github.com/gavinmdouglas/pangenome_pseudogene_null).

References

  1. Innamorati, K. A., Earl, J. P., Aggarwal, S. D., Ehrlich, G. D. & Hiller, N. L. in The Pangenome: Diversity, Dynamics and Evolution of Genomes (eds Tettelin, H. & Medini, D.) 51–87 (Springer, 2020); https://doi.org/10.1007/978-3-030-38281-0_3

  2. Sela, I., Wolf, Y. I. & Koonin, E. V. Theory of prokaryotic genome evolution. Proc. Natl Acad. Sci. USA 113, 11399–11407 (2016).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  3. Bobay, L. M. & Ochman, H. Factors driving effective population size and pan-genome evolution in bacteria. BMC Evol. Biol. 18, 153 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. McInerney, J. O., McNally, A. & O'Connell, M. J. Why prokaryotes have pangenomes. Nat. Microbiol. 2, 17040 (2017).

    Article  CAS  PubMed  Google Scholar 

  5. Kimura, M. & Crow, J. F. The number of alleles that can be maintained in a finite population. Genetics 49, 725–738 (1964).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Andreani, N. A., Hesse, E. & Vos, M. Prokaryote genome fluidity is dependent on effective population size. ISME J. 11, 1719–1721 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Vos, M. & Eyre-Walker, A. Are pangenomes adaptive or not? Nat. Microbiol. 2, 1576 (2017).

    Article  CAS  PubMed  Google Scholar 

  8. Danneels, B., Pinto-Carbó, M. & Carlier, A. Patterns of nucleotide deletion and insertion inferred from bacterial pseudogenes. Genome Biol. Evol. 10, 1792–1802 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Kuo, C.-H. & Ochman, H. The extinction dynamics of bacterial pseudogenes. PLoS Genet. 6, e1001050 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  10. Wolf, Y. I., Makarova, K. S., Lobkovsky, A. E. & Koonin, E. V. Two fundamentally different classes of microbial genes. Nat. Microbiol. 2, 16208 (2016).

    Article  CAS  PubMed  Google Scholar 

  11. Galperin, M. Y. et al. COG database update: focus on microbial diversity, model organisms, and widespread pathogens. Nucleic Acids Res. 49, D274–D281 (2021).

    Article  CAS  PubMed  Google Scholar 

  12. Parks, D. H. et al. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat. Biotechnol. 36, 996–1004 (2018).

    Article  CAS  PubMed  Google Scholar 

  13. Kislyuk, A. O., Haegeman, B., Bergman, N. H. & Weitz, J. S. Genomic fluidity: an integrative view of gene diversity within microbial populations. BMC Genomics 12, 32 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  14. Rocha, E. P. C. et al. Comparisons of dN/dS are time dependent for closely related bacterial genomes. J. Theor. Biol. 239, 226–235 (2006).

    Article  ADS  CAS  PubMed  Google Scholar 

  15. Kryazhimskiy, S. & Plotkin, J. B. The population genetics of dN/dS. PLoS Genet. 4, e1000304 (2008).

    Article  PubMed  PubMed Central  Google Scholar 

  16. Boucher, Y. et al. Local mobile gene pools rapidly cross species boundaries to create endemicity within global Vibrio cholerae populations. mBio 2, e00335-10 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  17. Niehus, R., Mitri, S., Fletcher, A. G. & Foster, K. R. Migration and horizontal gene transfer divide microbial genomes into multiple niches. Nat. Commun. 6, 8924 (2015).

    Article  ADS  CAS  PubMed  Google Scholar 

  18. Smillie, C. S. et al. Ecology drives a global network of gene exchange connecting the human microbiome. Nature 480, 241–244 (2011).

    Article  ADS  CAS  PubMed  Google Scholar 

  19. Hottes, A. K. et al. Bacterial adaptation through loss of function. PLoS Genet. 9, e1003617 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Oren, Y. et al. Transfer of noncoding DNA drives regulatory rewiring in bacteria. Proc. Natl Acad. Sci. USA 111, 16112–16117 (2014).

    Article  ADS  CAS  PubMed  PubMed Central  Google Scholar 

  21. Peng, T., Lin, J., Xu, Y.-Z. & Zhang, Y. Comparative genomics reveals new evolutionary and ecological patterns of selenium utilization in bacteria. ISME J. 10, 2048–2059 (2016).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Schlüter, A. et al. Erythromycin resistance-conferring plasmid pRSB105, isolated from a sewage treatment plant, harbors a new macrolide resistance determinant, an integron-containing Tn402-like element, and a large region of unknown function. Appl. Environ. Microbiol. 73, 1952–1960 (2007).

    Article  ADS  PubMed  PubMed Central  Google Scholar 

  23. Bobay, L.-M., Rocha, E. P. C. & Touchon, M. The adaptation of temperate bacteriophages to their host genomes. Mol. Biol. Evol. 30, 737–751 (2013).

    Article  CAS  PubMed  Google Scholar 

  24. McKerral, J. C. et al. The promise and pitfalls of prophages. Preprint at bioRxiv https://doi.org/10.1101/2023.04.20.537752 (2023).

  25. Giovannoni, S. J., Cameron Thrash, J. & Temperton, B. Implications of streamlining theory for microbial ecology. ISME J. 8, 1553–1565 (2014).

    Article  PubMed  PubMed Central  Google Scholar 

  26. Daubin, V. & Moran, N. A. Comment on ‘The Origins of Genome Complexity’. Science 306, 978 (2004).

    Article  CAS  PubMed  Google Scholar 

  27. Lynch, M. & Conery, J. S. Response to comment on ‘The Origins of Genome Complexity’. Science 306, 978 (2004).

    Article  CAS  Google Scholar 

  28. Sharp, P. M., Bailes, E., Grocock, R. J., Peden, J. F. & Sockett, R. E. Variation in the strength of selected codon usage bias among bacteria. Nucleic Acids Res. 33, 1141–1153 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Koonin, E. V. Splendor and misery of adaptation, or the importance of neutral null for understanding evolution. BMC Biol. 14, 114 (2016).

    Article  PubMed  PubMed Central  Google Scholar 

  30. Rocha, E. P. C. Neutral theory, microbial practice: challenges in bacterial population genetics. Mol. Biol. Evol. 35, 1338–1347 (2018).

    Article  CAS  PubMed  Google Scholar 

  31. Li, W. & Godzik, A. CD-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659 (2006).

    Article  CAS  PubMed  Google Scholar 

  32. Cantalapiedra, C. P., Hernández-Plaza, A., Letunic, I., Bork, P. & Huerta-Cepas, J. eggNOG-mapper v2: functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Mol. Biol. Evol. 38, 5825–5829 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Huerta-Cepas, J. et al. eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5,090 organisms and 2,502 viruses. Nucleic Acids Res. 47, D309–D314 (2019).

    Article  CAS  PubMed  Google Scholar 

  34. Buchfink, B., Reuter, K. & Drost, H.-G. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat. Methods 18, 366–368 (2021).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Brooks, M. E. et al. glmmTMB balances speed and flexibility among packages for zero-inflated generalized linear mixed modeling. R J. 9, 378–400 (2017).

    Article  Google Scholar 

  36. Bowers, R. M. et al. Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nat. Biotechnol. 35, 725–731 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  37. Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. & Tyson, G. W. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 25, 1043–1055 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  38. Seemann, T. Prokka: rapid prokaryotic genome annotation. Bioinformatics 30, 2068–2069 (2014).

    Article  CAS  PubMed  Google Scholar 

  39. Tonkin-Hill, G. et al. Producing polished prokaryotic pangenomes with the Panaroo pipeline. Genome Biol. 21, 180 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  40. Syberg-Olsen, M. J., Garber, A. I., Keeling, P. J., McCutcheon, J. P. & Husnik, F. Pseudofinder: detection of pseudogenes in prokaryotic genomes. Mol. Biol. Evol. 39, msac153 (2022).

    Article  PubMed  PubMed Central  Google Scholar 

  41. The UniProt Consortium. The universal protein resource. Nucleic Acids Res. 36, D190–D195 (2008).

    Article  Google Scholar 

  42. Tange, O. GNU parallel: the command-line power tool. Login USENIX Mag. 36, 42–47 (2011).

    Google Scholar 

  43. Edgar, R. C. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792–1797 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Kosakovsky Pond, S. L. et al. HyPhy 2.5—a customizable platform for evolutionary hypothesis testing using phylogenies. Mol. Biol. Evol. 37, 295–299 (2020).

    Article  PubMed  Google Scholar 

  45. Nei, M. & Gojobori, T. Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol. Biol. Evol. 3, 418–426 (1986).

    CAS  PubMed  Google Scholar 

  46. Wickham, H. ggplot2: Elegant Graphics for Data Analysis (Springer-Verlag, 2016).

  47. Gu, Z. Complex heatmap visualization. iMeta 1, e43 (2022).

    Article  Google Scholar 

  48. Douglas, G. M. & Shapiro, B. J. Data and code for ‘Pseudogenes act as a neutral reference for detecting selection in prokaryotic pangenomes’. Zenodo https://doi.org/10.5281/zenodo.7942836 (2023).

Download references

Acknowledgements

We would like to thank W. F. Doolittle for providing motivating ideas and for advice and feedback throughout this project. We would also like to thank L.-M. Bobay for reading a draft of this paper and providing feedback and A. Eyre-Walker for providing constructive comments. G.M.D. was supported by a Natural Sciences and Engineering Research Council of Canada (NSERC) Postdoctoral Fellowship, and B.J.S. is supported by an NSERC Discovery Grant.

Author information

Authors and Affiliations

Authors

Contributions

Both G.M.D. and B.J.S. designed the study and wrote the paper. G.M.D. conducted all analyses.

Corresponding authors

Correspondence to Gavin M. Douglas or B. Jesse Shapiro.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Ecology & Evolution thanks Franz Baumdicker, James McInerney and Maria Rosa Domingo Sananes for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Frequency distributions of clusters by species, pangenome partitions, and element type.

Mixed elements are those that include both pseudogene and intact gene sequences in the same cluster. (a) Distribution for all clusters, including those that could not be annotated with a Clusters of Orthologous Genes (COG) identifier. Percentages correspond to the breakdown per species within a given element type (that is intact, mixed, or pseudogene). (b) Breakdown of the numbers and percentages of clusters that could not be COG annotated (that is the percentages correspond to the how many of the clusters per cell in panel a could not be COG annotated).

Extended Data Fig. 2 Akaike Information Criterion values for each generalized linear mixed model across the three tested pangenome partitions.

These partitions contained 213,912, 3,650,010, and 12,234,597 separate elements for the ultra-rare, other-rare, and shell partitions, respectively. The model formulas are indicated on the y-axis (see Online Methods for explanation). AIC values were normalized to range from 0-1 for better visualization of relative differences, and the raw AIC is indicated beside each bar.

Extended Data Fig. 3 Intercepts for each species random effect level in the generalized linear mixed models.

Each model is labelled by the corresponding pangenome partition. These partitions contained 213,912, 3,650,010, and 12,234,597 separate elements for the ultra-rare, other-rare, and shell partitions, respectively. Estimates correspond to logit (log-odds) values: estimates > 0 indicate an increased probability of an element being classified as a pseudogene. Note that each model has a different overall intercept value, meaning that the relative differences in species per model is most informative to compare across models, rather than the absolute intercept values (for instance, Enterococcus faecalis has the highest magnitude negative intercept in the shell model, but is of relatively low magnitude in the other models).

Extended Data Fig. 4 Summary of significant coefficients in generalized linear mixed models fit to other-rare and shell pangenome partition elements.

Summary of significant coefficients (P < 0.05) in generalized linear mixed models fit to other-rare and shell pangenome partition elements, which corresponded to 3,650,010 and 12,234,597 separate elements, respectively. Model response was element state (intact or pseudogene). The predictors were each element’s annotated Clusters of Orthologous Genes (COG) category, whether the element is redundant with an intact gene of the same COG ID (that is gene family, not COG category) in the same genome, and the interaction between these variables. The non-redundant coefficients represent the sum of the overall non-redundant coefficient and the interaction of non-redundancy and each COG category. Bars represent the estimated logit (log-odds) coefficient values: estimates > 0 indicate an increased probability of an element being classified as a pseudogene. Error bars represent one standard error, which is a point estimate per coefficient (rather than reflecting a distribution of coefficients).

Extended Data Fig. 5 Log-odds ratios for the 156 significant COG identifiers in the mobilome COG category.

Log-odds ratios for the 156 significant (Fisher’s exact test, false discovery rate < 0.05) Clusters of Orthologous Genes (COG) identifiers in the mobilome COG category. Only COG IDs that are significantly enriched or depleted in pseudogenes vs intact genes across one of the ten tested species are shown (focused on the ultra-rare pangenome partition). These tests were run per-species and restricted to redundant elements. Log-odds ratios > 0 indicate that a COG ID is enriched in pseudogenes vs. intact genes. The boxplot features are defined as follows: the centre line represents the median; the lower and upper hinges of the boxplots correspond to the 25th and 75th percentiles; the lower and upper boxplot whiskers extend to the lowest and highest points, respectively, to a limit of 1.5 multiplied by the interquartile range from the closest hinge. The sample sizes per sub-type is 11, 57, 6, and 82, for the mixed/other, phage, plasmid, and transposon categories, respectively.

Extended Data Fig. 6 Association of pangenome diversity metrics across 668 prokaryotic species.

Panels a-c: Associations among different pangenome diversity metrics, where each point corresponds to one of the 668 species. Singleton-based metrics were estimated based on repeated subsampling to nine genomes per species. Two-tailed Spearman correlation coefficients and P-values are indicated. (d) Gene frequency distribution for 2,187 genes encoded by Mycoplasmopsis bovis genomes. This species is highlighted as it exhibited the highest genomic fluidity (right-most point on panel c, which is driven by population substructure (that is most genes are present at intermediate frequency, in 10/20 genomes). The P-values reported in panels a and c are the closest approximation provided in the statistical test output.

Extended Data Fig. 7 Associations between dN/dS and pangenome diversity, as represented by repeated subsampling to three and 20 genomes separately.

Associations between dN/dS and pangenome diversity, as represented by repeated subsampling to three genomes to compute (a) si and (c) and si/sp, and repeated subsampling to 20 genomes to compute (b) si and (d) and si/sp. In the main text, si/sp is based on subsampling to nine genomes. Each point is one of 668 prokaryotic species. The two-tailed Spearman correlation coefficients and P-values are indicated. The P-values reported in panels a and b are the closest approximation provided in the statistical test output.

Extended Data Fig. 8 Distributions of pangenome diversity and molecular evolution metrics stratified by taxonomic class.

Distributions of pangenome diversity and molecular evolution metrics stratified by taxonomic class (with classes with <= five species collapsed into ‘Other’). Each point is a separate species. Sample sizes per class: 51 (Actinomycetia); 62 (Alphaproteobacteria); 161 (Bacilli); 33 (Bacteroidia); 7 (Chlamydiia); 37 (Clostridia); 286 (Gammaproteobacteria); 31 (Other). The boxplot features are defined as follows: the centre line represents the median; the lower and upper hinges of the boxplots correspond to the 25th and 75th percentiles; the lower and upper boxplot whiskers extend to the lowest and highest points, respectively, to a limit of 1.5 multiplied by the interquartile range from the closest hinge.

Extended Data Fig. 9 Summaries of four pangenome diversity linear models.

One model was fit for each pangenome diversity metric: the mean number of genes, genomic fluidity, the percentage of singleton intact genes (si), and the ratio of the percentages of singleton intact genes vs. pseudogenes (si/sp). All continuous response and predictor variables were standardized (that is converted to z-scores) prior to building models. Continuous variables were also transformed to normal distributions prior to this standardization (see Online Methods). Coefficients are displayed for each model, split by those that affect the intercept vs. the slope. The adjusted R2 is also indicated for each model, and the cell colouring indicates whether each value is statistically significant (P < 0.05). The number of species per taxonomic class is indicated by the blue bar. The category used to infer the overall intercept was based on a combination of all classes with <= 5 species present. Note that Chlamydiia is the class, not the common genus Chlamydia. These models were built based on 667 species, after excluding one species with no singleton intact genes, and contained 657 degrees of freedom.

Extended Data Table 1 Summary of 10 species used for in-depth pangenome analysis

Supplementary information

Reporting Summary

Peer Review File

Source Data Figs. 1–6, Extended Data Figs. 1–9 and Extended Data Table 1

Data for each display item are provided in separate sheets, as described in the ‘Descriptions’ sheet.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Douglas, G.M., Shapiro, B.J. Pseudogenes act as a neutral reference for detecting selection in prokaryotic pangenomes. Nat Ecol Evol 8, 304–314 (2024). https://doi.org/10.1038/s41559-023-02268-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41559-023-02268-6

This article is cited by

Search

Quick links

Nature Briefing Microbiology

Sign up for the Nature Briefing: Microbiology newsletter — what matters in microbiology research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: Microbiology