Major radiations of enigmatic Bacteria and Archaea with large inventories of uncharacterized proteins are a striking feature of the Tree of Life1–5. The processes that led to functional diversity in these lineages, which may contribute to a host-dependent lifestyle, are poorly understood. Here, we show that diversity-generating retroelements (DGRs), which guide site-specific protein hypervariability6–8, are prominent features of genomically reduced organisms from the bacterial candidate phyla radiation (CPR) and as yet uncultivated phyla belonging to the DPANN (Diapherotrites, Parvarchaeota, Aenigmarchaeota, Nanoarchaeota and Nanohaloarchaea) archaeal superphylum. From reconstructed genomes we have defined monophyletic bacterial and archaeal DGR lineages that expand the known DGR range by 120% and reveal a history of horizontal retroelement transfer. Retroelement-guided diversification is further shown to be active in current CPR and DPANN populations, with an assortment of protein targets potentially involved in attachment, defence and regulation. Based on observations of DGR abundance, function and evolutionary history, we find that targeted protein diversification is a pronounced trait of CPR and DPANN phyla compared to other bacterial and archaeal phyla. This diversification mechanism may provide CPR and DPANN organisms with a versatile tool that could be used for adaptation to a dynamic, host-dependent existence.
Subscribe to Journal
Get full journal access for 1 year
only $5.17 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
Rinke, C. et al. Insights into the phylogeny and coding potential of microbial dark matter. Nature 499, 431–437 (2013).
Castelle, C. J. et al. Genomic expansion of domain Archaea highlights roles for organisms from new phyla in anaerobic carbon cycling. Curr. Biol. 25, 690–701 (2015).
Brown, C. T. et al. Unusual biology across a group comprising more than 15% of domain bacteria. Nature 523, 208–211 (2015).
Hug, L. A. et al. A new view of the tree of life. Nat. Microbiol. 1, 16048 (2016).
Anantharaman, K. et al. Thousands of microbial genomes shed light on interconnected biogeochemical processes in an aquifer system. Nat. Commun. 7, 13219 (2016).
Liu, M. et al. Reverse transcriptase-mediated tropism switching in Bordetella bacteriophage. Science 295, 2091–2094 (2002).
Doulatov, S. et al. Tropism switching in Bordetella bacteriophage defines a family of diversity-generating retroelements. Nature 431, 476–481 (2004).
Guo, H., Arambula, D., Ghosh, P. & Miller, J. F. Diversity-generating retroelements in phage and bacterial genomes. Microbiol. Spectr. http://dx.doi.org/10.1128/microbiolspec.MDNA3-0029-2014 (2014).
Comolli, L. R., Baker, B. J., Downing, K. H., Siegerist, C. E. & Banfield, J. F. Three-dimensional analysis of the structure and ecology of a novel, ultra-small archaeon. ISME J. 3, 159–167 (2009).
Baker, B. J. et al. Enigmatic, ultrasmall, uncultivated Archaea. Proc. Natl Acad. Sci. USA 107, 8806–8811 (2010).
Luef, B. et al. Diverse uncultivated ultra-small bacterial cells in groundwater. Nat. Commun. 6, 6372 (2015).
Gong, J., Qing, Y., Guo, X. & Warren, A. Candidatus sonnebornia yantaiensis, a member of candidate division OD1, as intracellular bacteria of the ciliated protist Paramecium bursaria (Ciliophora, oligohymenophorea). Syst. Appl. Microbiol. 37, 35–41 (2014).
Nelson, W. C. & Stegen, J. C. The reduced genomes of parcubacteria (OD1) contain signatures of a symbiotic lifestyle. Front. Microbiol. 6, 713 (2015).
Valentine, D. L. Adaptations to energy stress dictate the ecology and evolution of the Archaea. Nat. Rev. Microbiol. 5, 316–323 (2007).
Paul, B. G. et al. Targeted diversity generation by intraterrestrial Archaea and archaeal viruses. Nat. Commun. 6, 6585 (2015).
Le Coq, J. & Ghosh, P. Conservation of the C-type lectin fold for massive sequence variation in a treponema diversity-generating retroelement. Proc. Natl Acad. Sci. USA 108, 14649–14653 (2011).
Arambula, D. et al. Surface display of a massively variable lipoprotein by a Legionella diversity-generating retroelement. Proc. Natl Acad. Sci. USA 110, 8212–8217 (2013).
Pfreundt, U., Kopf, M., Belkin, N., Berman-Frank, I. & Hess, W. R. The primary transcriptome of the marine diazotroph Trichodesmium erythraeum IMS101. Sci. Rep. 4, 6187 (2014).
Miller, J. L. et al. Selective ligand recognition by a diversity-generating retroelement variable protein. PLoS Biol. 6, e131 (2008).
Handa, S., Paul, B. G., Valentine, D. L., Miller, J. F. & Ghosh, P. Conservation of the C-type lectin fold for accommodating massive sequence variation in archaeal diversity-generating retroelements. BMC Struct. Biol. 16, 13 (2016).
Nimkulrat, S. et al. Genomic and metagenomic analysis of diversity-generating retroelements associated with Treponema denticola. Front. Microbiol. 7, 852 (2016).
Guo, H. et al. Diversity-generating retroelement homing regenerates target sequences for repeated rounds of codon rewriting and protein diversification. Mol. Cell 31, 813–823 (2008).
Sharon, I. et al. Time series community genomics analysis reveals rapid shifts in bacterial species, strains, and phage during infant gut colonization. Genome Res. 23, 111–120 (2013).
Guo, H. et al. Target site recognition by a diversity-generating retroelement. PLoS Genet. 7, e1002414 (2011).
Minot, S., Grunberg, S., Wu, G. D., Lewis, J. D. & Bushman, F. D. Hypervariable loci in the human gut virome. Proc. Natl Acad. Sci. USA 109, 3962–3966 (2012).
Schillinger, T., Lisfi, M., Chi, J., Cullum, J. & Zingler, N. Analysis of a comprehensive dataset of diversity generating retroelements generated by the program DiGReF. BMC Genomics 13, 430 (2012).
Ye, Y. Identification of diversity-generating retroelements in human microbiomes. Int. J. Mol. Sci. 15, 14234–14246 (2014).
Zimmerly, S. & Wu, L. An unexplored diversity of reverse transcriptases in bacteria. Microbiol. Spectr. https://dx.doi.org/10.1128/microbiolspec.MDNA3-0058-2014 (2015).
Xu, Q. et al. A distinct type of pilus from the human microbiome. Cell 165, 690–703 (2016).
Anantharaman, V., Koonin, E. V. & Aravind, L. Regulatory potential, phyletic distribution and evolution of ancient, intracellular small-molecule-binding domains. J. Mol. Biol. 307, 1271–1292 (2001).
Peng, Y. Leung, H. C., Yiu, S. M. & Chin, F. Y. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 28, 1420–1428 (2012).
Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11, 119 (2010).
Finn, R. D., Clements, J. & Eddy, S. R. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 39, W29–W37 (2011).
Kelley, L. A. & Sternberg, M. J. Protein structure prediction on the Web: a case study using the Phyre server. Nat. Protoc. 4, 363–371 (2009).
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).
Ultsch, A. & Moerchen, F. ESOM-maps: Tools for Clustering. Visualization, and Classification with Emergent SOM, Technology Report, Department of Mathematics and Computer Science No. 46 (University of Marburg, 2005).
Dick, G. J. et al. Community-wide analysis of microbial genome sequence signatures. Genome Biol. 10, 1–16 (2009).
Raes, J., Korbel, J. O., Lercher, M. J., von Mering, C. & Bork, P. Prediction of effective genome size in metagenomic samples. Genome Biol. 8, R10 (2007).
Zuker, M. Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res. 31, 3406–3415 (2003).
Li, W. & Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659 (2006).
Burge, S. W. et al. Rfam 11.0: 10 years of RNA families. Nucleic Acids Res. 41, D226–D232 (2013).
Simon, D. M. & Zimmerly, S. A diversity of uncharacterized reverse transcriptases in bacteria. Nucleic Acids Res. 36, 7219–7229 (2008).
Eddy, S. Profile hidden Markov models. Bioinformatics 14, 755–763 (1998).
Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree: computing large minimum evolution trees with profiles instead of a distance matrix. Mol. Biol. Evol. 26, 1641–1650 (2009).
Thompson, J. D., Gibson, T. & Higgins, D. G. Multiple sequence alignment using ClustalW and ClustalX. Curr. Protoc. Bioinformatics http://dx.doi.org/10.1002/0471250953.bi0203s00 (2002).
Guindon, S. et al. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst. Biol. 59, 307–321 (2010).
This research was funded by National Science Foundation grant no. OCE-1046144 to D.L.V., National Institutes of Health grant no. R01 AI096838 to J.F.M. and P.G., and by the US Department of Energy (DOE), Office of Science, Office of Biological and Environmental Research under award no. DE-AC02-05CH11231 (Sustainable Systems Scientific Focus Area; Lawrence Berkley National Laboratory operated by the University of California) and award no. DE-SC0004918 (Systems Biology Knowledge Base Focus Area). Sequencing was performed at the US DOE Joint Genome Institute, a DOE Office of Science User Facility, supported under contract no. DE-AC02-05CH11231. Metatranscriptomes were sequenced at the DOE-supported Environmental Molecular Sciences Laboratory at Pacific Northwest National Laboratory. B.G.P. was supported by a postdoctoral fellowship from the Center for Dark Energy Biosphere Investigations (C-DEBI). D.B. was supported by a long-term EMBO fellowship. The authors thank K. Anantharaman for assistance with genome binning, A. Singh and C.T. Brown, who aided in examining CPR and DPANN genomes and C. Magnabosco for offering insights on phylogenetic reconstruction. This is C-DEBI contribution no. 361.
J.F.M. is a cofounder, equity holder and chair of the scientific advisory board of AvidBiotics Inc., a biotherapeutics company in San Francisco. No other authors declare competing financial interests.
Supplementary Figures 1–9 (PDF 1731 kb)
Supplementary Table 1: DGRs that appear to be active based on readmapping and a stemloop-like sequence. Ns substitutions linked to TR adenines were inferred from VR-read-mapping and putative DGR stemloops were predicted using the Mfold DNA folding server (see Methods). The number of stemloops is shown incrementally for the same DGR. (3'-) distance from VR to the beginning of the stemloop is given in nucleotides. (XLSX 211 kb)
Supplementary Table 2: Metatranscriptomic readmapping analysis for DGRs that recruited at least ten perfect-matching transcripts. Relative proportions are given for transcripts mapping to DGRs versus the whole contig, and separately for transcripts mapping to TR versus the sum for all other DGR features.
Supplementary Table 3: Annotation details for DUF1566 (PF07603) containing DGR variable proteins. Variable protein length is given in amino acids. Transmembrane (TM) predictions are shown as “yes”, “no”, or “signal peptide”. The best hit from HMMER is listed with its corresponding e-value. Phyre2 values are given as per cent confidence (conf) and per cent coverage of the variable protein (covg).
Supplementary Table 4: Taxonomic affiliations of AAA_5 ATPase (PF07728) domain-containing DGR variable proteins. Rows are coloured by domain. Best hits were retrieved using pHMMER searches against the Uniprot database.
Supplementary Table 5: DGR-containing scaffolds and feature coordinates, including RT, VP (up to three), VR (up to three), and TR. Genome bin affiliations are given for each scaffold.
Supplementary Table 6: DGR-containing scaffolds and feature coordinates for scaffolds with more than one DGR cassette (up to three distinct DGRs for a single scaffold).
Supplementary Table 7: DGR-containing scaffolds and feature annotations for DGRs with split/interrupted RT open reading frames.
Supplementary Table 8: Variable proteins with homology to known pfams or database UniProtKB representatives. NA, or not applicable, indicates that no significant hit was returned from the database.
Supplementary Table 9: Index of reverse transcriptase (RT) tree labels. Representatives listed under Database as “Genbank”, have tree labels that are NCBI accession numbers.
Supplementary Table 10: DGR-containing scaffolds and corresponding Genbank accession codes.
All DGR-containing sequences that are described in this study, which were derived from draft genomes. (TXT 44046 kb)
Reverse transcriptase protein sequences for all DGRs from draft genomes. (TXT 259 kb)
All DGR targeted variable protein sequences. (TXT 255 kb)
Reverse transcriptase tree that corresponds to Fig. 2. (TXT 20 kb)
The reverse transcriptase multiple sequence alignment used to construct the phylogenetic tree in Fig. 2. (TXT 157 kb)
DGR-containing sequences as assembled metagenomic fragments, which are not contained in a draft genome. (TXT 36182 kb)
About this article
Cite this article
Paul, B., Burstein, D., Castelle, C. et al. Retroelement-guided protein diversification abounds in vast lineages of Bacteria and Archaea. Nat Microbiol 2, 17045 (2017). https://doi.org/10.1038/nmicrobiol.2017.45
A diverse uncultivated microbial community is responsible for organic matter degradation in the Black Sea sulphidic zone
Environmental Microbiology (2020)
Phylum‐level diversity of the microbiome of the extremophilic basidiomycete fungus Pisolithus arhizus (Scop.) Rauschert: An island of biodiversity in a thermal soil desert
FEMS Microbiology Letters (2019)
Journal of Molecular Biology (2019)
Nature Biotechnology (2019)