Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Retroelement-guided protein diversification abounds in vast lineages of Bacteria and Archaea


Major radiations of enigmatic Bacteria and Archaea with large inventories of uncharacterized proteins are a striking feature of the Tree of Life15. The processes that led to functional diversity in these lineages, which may contribute to a host-dependent lifestyle, are poorly understood. Here, we show that diversity-generating retroelements (DGRs), which guide site-specific protein hypervariability68, are prominent features of genomically reduced organisms from the bacterial candidate phyla radiation (CPR) and as yet uncultivated phyla belonging to the DPANN (Diapherotrites, Parvarchaeota, Aenigmarchaeota, Nanoarchaeota and Nanohaloarchaea) archaeal superphylum. From reconstructed genomes we have defined monophyletic bacterial and archaeal DGR lineages that expand the known DGR range by 120% and reveal a history of horizontal retroelement transfer. Retroelement-guided diversification is further shown to be active in current CPR and DPANN populations, with an assortment of protein targets potentially involved in attachment, defence and regulation. Based on observations of DGR abundance, function and evolutionary history, we find that targeted protein diversification is a pronounced trait of CPR and DPANN phyla compared to other bacterial and archaeal phyla. This diversification mechanism may provide CPR and DPANN organisms with a versatile tool that could be used for adaptation to a dynamic, host-dependent existence.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Figure 1: Prevalence of DGRs identified in groundwater metagenomes.
Figure 2: Phylogeny of DGRs and radiation of novel lineages.
Figure 3: Putative functional classes of DGR variable proteins.


  1. 1

    Rinke, C. et al. Insights into the phylogeny and coding potential of microbial dark matter. Nature 499, 431–437 (2013).

    CAS  Article  Google Scholar 

  2. 2

    Castelle, C. J. et al. Genomic expansion of domain Archaea highlights roles for organisms from new phyla in anaerobic carbon cycling. Curr. Biol. 25, 690–701 (2015).

    CAS  Article  Google Scholar 

  3. 3

    Brown, C. T. et al. Unusual biology across a group comprising more than 15% of domain bacteria. Nature 523, 208–211 (2015).

    CAS  Article  Google Scholar 

  4. 4

    Hug, L. A. et al. A new view of the tree of life. Nat. Microbiol. 1, 16048 (2016).

    CAS  Article  Google Scholar 

  5. 5

    Anantharaman, K. et al. Thousands of microbial genomes shed light on interconnected biogeochemical processes in an aquifer system. Nat. Commun. 7, 13219 (2016).

    CAS  Article  Google Scholar 

  6. 6

    Liu, M. et al. Reverse transcriptase-mediated tropism switching in Bordetella bacteriophage. Science 295, 2091–2094 (2002).

    CAS  Article  Google Scholar 

  7. 7

    Doulatov, S. et al. Tropism switching in Bordetella bacteriophage defines a family of diversity-generating retroelements. Nature 431, 476–481 (2004).

    CAS  Article  Google Scholar 

  8. 8

    Guo, H., Arambula, D., Ghosh, P. & Miller, J. F. Diversity-generating retroelements in phage and bacterial genomes. Microbiol. Spectr. (2014).

  9. 9

    Comolli, L. R., Baker, B. J., Downing, K. H., Siegerist, C. E. & Banfield, J. F. Three-dimensional analysis of the structure and ecology of a novel, ultra-small archaeon. ISME J. 3, 159–167 (2009).

    CAS  Article  Google Scholar 

  10. 10

    Baker, B. J. et al. Enigmatic, ultrasmall, uncultivated Archaea. Proc. Natl Acad. Sci. USA 107, 8806–8811 (2010).

    CAS  Article  Google Scholar 

  11. 11

    Luef, B. et al. Diverse uncultivated ultra-small bacterial cells in groundwater. Nat. Commun. 6, 6372 (2015).

    CAS  Article  Google Scholar 

  12. 12

    Gong, J., Qing, Y., Guo, X. & Warren, A. Candidatus sonnebornia yantaiensis, a member of candidate division OD1, as intracellular bacteria of the ciliated protist Paramecium bursaria (Ciliophora, oligohymenophorea). Syst. Appl. Microbiol. 37, 35–41 (2014).

    CAS  Article  Google Scholar 

  13. 13

    Nelson, W. C. & Stegen, J. C. The reduced genomes of parcubacteria (OD1) contain signatures of a symbiotic lifestyle. Front. Microbiol. 6, 713 (2015).

    Article  Google Scholar 

  14. 14

    Valentine, D. L. Adaptations to energy stress dictate the ecology and evolution of the Archaea. Nat. Rev. Microbiol. 5, 316–323 (2007).

    CAS  Article  Google Scholar 

  15. 15

    Paul, B. G. et al. Targeted diversity generation by intraterrestrial Archaea and archaeal viruses. Nat. Commun. 6, 6585 (2015).

    CAS  Article  Google Scholar 

  16. 16

    Le Coq, J. & Ghosh, P. Conservation of the C-type lectin fold for massive sequence variation in a treponema diversity-generating retroelement. Proc. Natl Acad. Sci. USA 108, 14649–14653 (2011).

    CAS  Article  Google Scholar 

  17. 17

    Arambula, D. et al. Surface display of a massively variable lipoprotein by a Legionella diversity-generating retroelement. Proc. Natl Acad. Sci. USA 110, 8212–8217 (2013).

    CAS  Article  Google Scholar 

  18. 18

    Pfreundt, U., Kopf, M., Belkin, N., Berman-Frank, I. & Hess, W. R. The primary transcriptome of the marine diazotroph Trichodesmium erythraeum IMS101. Sci. Rep. 4, 6187 (2014).

    CAS  Article  Google Scholar 

  19. 19

    Miller, J. L. et al. Selective ligand recognition by a diversity-generating retroelement variable protein. PLoS Biol. 6, e131 (2008).

    Article  Google Scholar 

  20. 20

    Handa, S., Paul, B. G., Valentine, D. L., Miller, J. F. & Ghosh, P. Conservation of the C-type lectin fold for accommodating massive sequence variation in archaeal diversity-generating retroelements. BMC Struct. Biol. 16, 13 (2016).

    Article  Google Scholar 

  21. 21

    Nimkulrat, S. et al. Genomic and metagenomic analysis of diversity-generating retroelements associated with Treponema denticola. Front. Microbiol. 7, 852 (2016).

    Article  Google Scholar 

  22. 22

    Guo, H. et al. Diversity-generating retroelement homing regenerates target sequences for repeated rounds of codon rewriting and protein diversification. Mol. Cell 31, 813–823 (2008).

    CAS  Article  Google Scholar 

  23. 23

    Sharon, I. et al. Time series community genomics analysis reveals rapid shifts in bacterial species, strains, and phage during infant gut colonization. Genome Res. 23, 111–120 (2013).

    CAS  Article  Google Scholar 

  24. 24

    Guo, H. et al. Target site recognition by a diversity-generating retroelement. PLoS Genet. 7, e1002414 (2011).

    CAS  Article  Google Scholar 

  25. 25

    Minot, S., Grunberg, S., Wu, G. D., Lewis, J. D. & Bushman, F. D. Hypervariable loci in the human gut virome. Proc. Natl Acad. Sci. USA 109, 3962–3966 (2012).

    CAS  Article  Google Scholar 

  26. 26

    Schillinger, T., Lisfi, M., Chi, J., Cullum, J. & Zingler, N. Analysis of a comprehensive dataset of diversity generating retroelements generated by the program DiGReF. BMC Genomics 13, 430 (2012).

    CAS  Article  Google Scholar 

  27. 27

    Ye, Y. Identification of diversity-generating retroelements in human microbiomes. Int. J. Mol. Sci. 15, 14234–14246 (2014).

    CAS  Article  Google Scholar 

  28. 28

    Zimmerly, S. & Wu, L. An unexplored diversity of reverse transcriptases in bacteria. Microbiol. Spectr. (2015).

  29. 29

    Xu, Q. et al. A distinct type of pilus from the human microbiome. Cell 165, 690–703 (2016).

    CAS  Article  Google Scholar 

  30. 30

    Anantharaman, V., Koonin, E. V. & Aravind, L. Regulatory potential, phyletic distribution and evolution of ancient, intracellular small-molecule-binding domains. J. Mol. Biol. 307, 1271–1292 (2001).

    CAS  Article  Google Scholar 

  31. 31

    Peng, Y. Leung, H. C., Yiu, S. M. & Chin, F. Y. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 28, 1420–1428 (2012).

    CAS  Article  Google Scholar 

  32. 32

    Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11, 119 (2010).

    Article  Google Scholar 

  33. 33

    Finn, R. D., Clements, J. & Eddy, S. R. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 39, W29–W37 (2011).

    CAS  Article  Google Scholar 

  34. 34

    Kelley, L. A. & Sternberg, M. J. Protein structure prediction on the Web: a case study using the Phyre server. Nat. Protoc. 4, 363–371 (2009).

    CAS  Article  Google Scholar 

  35. 35

    Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).

    CAS  Article  Google Scholar 

  36. 36

    Ultsch, A. & Moerchen, F. ESOM-maps: Tools for Clustering. Visualization, and Classification with Emergent SOM, Technology Report, Department of Mathematics and Computer Science No. 46 (University of Marburg, 2005).

  37. 37

    Dick, G. J. et al. Community-wide analysis of microbial genome sequence signatures. Genome Biol. 10, 1–16 (2009).

    Article  Google Scholar 

  38. 38

    Raes, J., Korbel, J. O., Lercher, M. J., von Mering, C. & Bork, P. Prediction of effective genome size in metagenomic samples. Genome Biol. 8, R10 (2007).

    Article  Google Scholar 

  39. 39

    Zuker, M. Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res. 31, 3406–3415 (2003).

    CAS  Article  Google Scholar 

  40. 40

    Li, W. & Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659 (2006).

    CAS  Article  Google Scholar 

  41. 41

    Burge, S. W. et al. Rfam 11.0: 10 years of RNA families. Nucleic Acids Res. 41, D226–D232 (2013).

    CAS  Article  Google Scholar 

  42. 42

    Simon, D. M. & Zimmerly, S. A diversity of uncharacterized reverse transcriptases in bacteria. Nucleic Acids Res. 36, 7219–7229 (2008).

    CAS  Article  Google Scholar 

  43. 43

    Eddy, S. Profile hidden Markov models. Bioinformatics 14, 755–763 (1998).

    CAS  Article  Google Scholar 

  44. 44

    Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree: computing large minimum evolution trees with profiles instead of a distance matrix. Mol. Biol. Evol. 26, 1641–1650 (2009).

    CAS  Article  Google Scholar 

  45. 45

    Thompson, J. D., Gibson, T. & Higgins, D. G. Multiple sequence alignment using ClustalW and ClustalX. Curr. Protoc. Bioinformatics (2002).

  46. 46

    Guindon, S. et al. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst. Biol. 59, 307–321 (2010).

    CAS  Article  Google Scholar 

Download references


This research was funded by National Science Foundation grant no. OCE-1046144 to D.L.V., National Institutes of Health grant no. R01 AI096838 to J.F.M. and P.G., and by the US Department of Energy (DOE), Office of Science, Office of Biological and Environmental Research under award no. DE-AC02-05CH11231 (Sustainable Systems Scientific Focus Area; Lawrence Berkley National Laboratory operated by the University of California) and award no. DE-SC0004918 (Systems Biology Knowledge Base Focus Area). Sequencing was performed at the US DOE Joint Genome Institute, a DOE Office of Science User Facility, supported under contract no. DE-AC02-05CH11231. Metatranscriptomes were sequenced at the DOE-supported Environmental Molecular Sciences Laboratory at Pacific Northwest National Laboratory. B.G.P. was supported by a postdoctoral fellowship from the Center for Dark Energy Biosphere Investigations (C-DEBI). D.B. was supported by a long-term EMBO fellowship. The authors thank K. Anantharaman for assistance with genome binning, A. Singh and C.T. Brown, who aided in examining CPR and DPANN genomes and C. Magnabosco for offering insights on phylogenetic reconstruction. This is C-DEBI contribution no. 361.

Author information




B.G.P. and D.L.V. developed the project. B.G.P., D.B., C.J.C., B.C.T. and J.F.B. performed reassembly, read mapping and annotation of the metagenomic and metatranscriptomics data sets. B.G.P., D.B., C.J.C., E.C., D.A., S.H., P.G., J.F.M., J.F.B. and D.L.V. conducted bioinformatic analyses on DGR sequences. B.G.P., D.B., C.J.C., J.F.B. and D.L.V. wrote the manuscript.

Corresponding author

Correspondence to David L. Valentine.

Ethics declarations

Competing interests

J.F.M. is a cofounder, equity holder and chair of the scientific advisory board of AvidBiotics Inc., a biotherapeutics company in San Francisco. No other authors declare competing financial interests.

Supplementary information

Supplementary Information

Supplementary Figures 1–9 (PDF 1731 kb)

Supplementary Tables 1–10

Supplementary Table 1: DGRs that appear to be active based on readmapping and a stemloop-like sequence. Ns substitutions linked to TR adenines were inferred from VR-read-mapping and putative DGR stemloops were predicted using the Mfold DNA folding server (see Methods). The number of stemloops is shown incrementally for the same DGR. (3'-) distance from VR to the beginning of the stemloop is given in nucleotides. (XLSX 211 kb)

Supplementary Table 2: Metatranscriptomic readmapping analysis for DGRs that recruited at least ten perfect-matching transcripts. Relative proportions are given for transcripts mapping to DGRs versus the whole contig, and separately for transcripts mapping to TR versus the sum for all other DGR features. 

Supplementary Table 3: Annotation details for DUF1566 (PF07603) containing DGR variable proteins. Variable protein length is given in amino acids. Transmembrane (TM) predictions are shown as “yes”, “no”, or “signal peptide”. The best hit from HMMER is listed with its corresponding e-value. Phyre2 values are given as per cent confidence (conf) and per cent coverage of the variable protein (covg). 

Supplementary Table 4: Taxonomic affiliations of AAA_5 ATPase (PF07728) domain-containing DGR variable proteins. Rows are coloured by domain. Best hits were retrieved using pHMMER searches against the Uniprot database. 

Supplementary Table 5: DGR-containing scaffolds and feature coordinates, including RT, VP (up to three), VR (up to three), and TR. Genome bin affiliations are given for each scaffold. 

Supplementary Table 6: DGR-containing scaffolds and feature coordinates for scaffolds with more than one DGR cassette (up to three distinct DGRs for a single scaffold). 

Supplementary Table 7: DGR-containing scaffolds and feature annotations for DGRs with split/interrupted RT open reading frames. 

Supplementary Table 8: Variable proteins with homology to known pfams or database UniProtKB representatives. NA, or not applicable, indicates that no significant hit was returned from the database. 

Supplementary Table 9: Index of reverse transcriptase (RT) tree labels. Representatives listed under Database as “Genbank”, have tree labels that are NCBI accession numbers. 

Supplementary Table 10: DGR-containing scaffolds and corresponding Genbank accession codes. 

Supplementary Data 1

All DGR-containing sequences that are described in this study, which were derived from draft genomes. (TXT 44046 kb)

Supplementary Data 2

Reverse transcriptase protein sequences for all DGRs from draft genomes. (TXT 259 kb)

Supplementary Data 3

All DGR targeted variable protein sequences. (TXT 255 kb)

Supplementary Data 4

Reverse transcriptase tree that corresponds to Fig. 2. (TXT 20 kb)

Supplementary Data 5

The reverse transcriptase multiple sequence alignment used to construct the phylogenetic tree in Fig. 2. (TXT 157 kb)

Supplementary Data 6

DGR-containing sequences as assembled metagenomic fragments, which are not contained in a draft genome. (TXT 36182 kb)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Paul, B., Burstein, D., Castelle, C. et al. Retroelement-guided protein diversification abounds in vast lineages of Bacteria and Archaea. Nat Microbiol 2, 17045 (2017).

Download citation

Further reading


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing