Retroelement-guided protein diversification abounds in vast lineages of Bacteria and Archaea

  • Nature Microbiology 2, Article number: 17045 (2017)
  • doi:10.1038/nmicrobiol.2017.45
  • Download Citation
Published online:


Major radiations of enigmatic Bacteria and Archaea with large inventories of uncharacterized proteins are a striking feature of the Tree of Life1,​2,​3,​4,​5. The processes that led to functional diversity in these lineages, which may contribute to a host-dependent lifestyle, are poorly understood. Here, we show that diversity-generating retroelements (DGRs), which guide site-specific protein hypervariability6,​7,​8, are prominent features of genomically reduced organisms from the bacterial candidate phyla radiation (CPR) and as yet uncultivated phyla belonging to the DPANN (Diapherotrites, Parvarchaeota, Aenigmarchaeota, Nanoarchaeota and Nanohaloarchaea) archaeal superphylum. From reconstructed genomes we have defined monophyletic bacterial and archaeal DGR lineages that expand the known DGR range by 120% and reveal a history of horizontal retroelement transfer. Retroelement-guided diversification is further shown to be active in current CPR and DPANN populations, with an assortment of protein targets potentially involved in attachment, defence and regulation. Based on observations of DGR abundance, function and evolutionary history, we find that targeted protein diversification is a pronounced trait of CPR and DPANN phyla compared to other bacterial and archaeal phyla. This diversification mechanism may provide CPR and DPANN organisms with a versatile tool that could be used for adaptation to a dynamic, host-dependent existence.

  • Subscribe to Nature Microbiology for full access:



Additional access options:

Already a subscriber?  Log in  now or  Register  for online access.


  1. 1.

    et al. Insights into the phylogeny and coding potential of microbial dark matter. Nature 499, 431–437 (2013).

  2. 2.

    et al. Genomic expansion of domain Archaea highlights roles for organisms from new phyla in anaerobic carbon cycling. Curr. Biol. 25, 690–701 (2015).

  3. 3.

    et al. Unusual biology across a group comprising more than 15% of domain bacteria. Nature 523, 208–211 (2015).

  4. 4.

    et al. A new view of the tree of life. Nat. Microbiol. 1, 16048 (2016).

  5. 5.

    et al. Thousands of microbial genomes shed light on interconnected biogeochemical processes in an aquifer system. Nat. Commun. 7, 13219 (2016).

  6. 6.

    et al. Reverse transcriptase-mediated tropism switching in Bordetella bacteriophage. Science 295, 2091–2094 (2002).

  7. 7.

    et al. Tropism switching in Bordetella bacteriophage defines a family of diversity-generating retroelements. Nature 431, 476–481 (2004).

  8. 8.

    , , & Diversity-generating retroelements in phage and bacterial genomes. Microbiol. Spectr. (2014).

  9. 9.

    , , , & Three-dimensional analysis of the structure and ecology of a novel, ultra-small archaeon. ISME J. 3, 159–167 (2009).

  10. 10.

    et al. Enigmatic, ultrasmall, uncultivated Archaea. Proc. Natl Acad. Sci. USA 107, 8806–8811 (2010).

  11. 11.

    et al. Diverse uncultivated ultra-small bacterial cells in groundwater. Nat. Commun. 6, 6372 (2015).

  12. 12.

    , , & Candidatus sonnebornia yantaiensis, a member of candidate division OD1, as intracellular bacteria of the ciliated protist Paramecium bursaria (Ciliophora, oligohymenophorea). Syst. Appl. Microbiol. 37, 35–41 (2014).

  13. 13.

    & The reduced genomes of parcubacteria (OD1) contain signatures of a symbiotic lifestyle. Front. Microbiol. 6, 713 (2015).

  14. 14.

    Adaptations to energy stress dictate the ecology and evolution of the Archaea. Nat. Rev. Microbiol. 5, 316–323 (2007).

  15. 15.

    et al. Targeted diversity generation by intraterrestrial Archaea and archaeal viruses. Nat. Commun. 6, 6585 (2015).

  16. 16.

    & Conservation of the C-type lectin fold for massive sequence variation in a treponema diversity-generating retroelement. Proc. Natl Acad. Sci. USA 108, 14649–14653 (2011).

  17. 17.

    et al. Surface display of a massively variable lipoprotein by a Legionella diversity-generating retroelement. Proc. Natl Acad. Sci. USA 110, 8212–8217 (2013).

  18. 18.

    , , , & The primary transcriptome of the marine diazotroph Trichodesmium erythraeum IMS101. Sci. Rep. 4, 6187 (2014).

  19. 19.

    et al. Selective ligand recognition by a diversity-generating retroelement variable protein. PLoS Biol. 6, e131 (2008).

  20. 20.

    , , , & Conservation of the C-type lectin fold for accommodating massive sequence variation in archaeal diversity-generating retroelements. BMC Struct. Biol. 16, 13 (2016).

  21. 21.

    et al. Genomic and metagenomic analysis of diversity-generating retroelements associated with Treponema denticola. Front. Microbiol. 7, 852 (2016).

  22. 22.

    et al. Diversity-generating retroelement homing regenerates target sequences for repeated rounds of codon rewriting and protein diversification. Mol. Cell 31, 813–823 (2008).

  23. 23.

    et al. Time series community genomics analysis reveals rapid shifts in bacterial species, strains, and phage during infant gut colonization. Genome Res. 23, 111–120 (2013).

  24. 24.

    et al. Target site recognition by a diversity-generating retroelement. PLoS Genet. 7, e1002414 (2011).

  25. 25.

    , , , & Hypervariable loci in the human gut virome. Proc. Natl Acad. Sci. USA 109, 3962–3966 (2012).

  26. 26.

    , , , & Analysis of a comprehensive dataset of diversity generating retroelements generated by the program DiGReF. BMC Genomics 13, 430 (2012).

  27. 27.

    Identification of diversity-generating retroelements in human microbiomes. Int. J. Mol. Sci. 15, 14234–14246 (2014).

  28. 28.

    & An unexplored diversity of reverse transcriptases in bacteria. Microbiol. Spectr. (2015).

  29. 29.

    et al. A distinct type of pilus from the human microbiome. Cell 165, 690–703 (2016).

  30. 30.

    , & Regulatory potential, phyletic distribution and evolution of ancient, intracellular small-molecule-binding domains. J. Mol. Biol. 307, 1271–1292 (2001).

  31. 31.

    , & IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 28, 1420–1428 (2012).

  32. 32.

    et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11, 119 (2010).

  33. 33.

    , & HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 39, W29–W37 (2011).

  34. 34.

    & Protein structure prediction on the Web: a case study using the Phyre server. Nat. Protoc. 4, 363–371 (2009).

  35. 35.

    & Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).

  36. 36.

    & ESOM-maps: Tools for Clustering. Visualization, and Classification with Emergent SOM, Technology Report, Department of Mathematics and Computer Science No. 46 (University of Marburg, 2005).

  37. 37.

    et al. Community-wide analysis of microbial genome sequence signatures. Genome Biol. 10, 1–16 (2009).

  38. 38.

    , , , & Prediction of effective genome size in metagenomic samples. Genome Biol. 8, R10 (2007).

  39. 39.

    Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res. 31, 3406–3415 (2003).

  40. 40.

    & Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659 (2006).

  41. 41.

    et al. Rfam 11.0: 10 years of RNA families. Nucleic Acids Res. 41, D226–D232 (2013).

  42. 42.

    & A diversity of uncharacterized reverse transcriptases in bacteria. Nucleic Acids Res. 36, 7219–7229 (2008).

  43. 43.

    Profile hidden Markov models. Bioinformatics 14, 755–763 (1998).

  44. 44.

    , & FastTree: computing large minimum evolution trees with profiles instead of a distance matrix. Mol. Biol. Evol. 26, 1641–1650 (2009).

  45. 45.

    , & Multiple sequence alignment using ClustalW and ClustalX. Curr. Protoc. Bioinformatics (2002).

  46. 46.

    et al. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Syst. Biol. 59, 307–321 (2010).

Download references


This research was funded by National Science Foundation grant no. OCE-1046144 to D.L.V., National Institutes of Health grant no. R01 AI096838 to J.F.M. and P.G., and by the US Department of Energy (DOE), Office of Science, Office of Biological and Environmental Research under award no. DE-AC02-05CH11231 (Sustainable Systems Scientific Focus Area; Lawrence Berkley National Laboratory operated by the University of California) and award no. DE-SC0004918 (Systems Biology Knowledge Base Focus Area). Sequencing was performed at the US DOE Joint Genome Institute, a DOE Office of Science User Facility, supported under contract no. DE-AC02-05CH11231. Metatranscriptomes were sequenced at the DOE-supported Environmental Molecular Sciences Laboratory at Pacific Northwest National Laboratory. B.G.P. was supported by a postdoctoral fellowship from the Center for Dark Energy Biosphere Investigations (C-DEBI). D.B. was supported by a long-term EMBO fellowship. The authors thank K. Anantharaman for assistance with genome binning, A. Singh and C.T. Brown, who aided in examining CPR and DPANN genomes and C. Magnabosco for offering insights on phylogenetic reconstruction. This is C-DEBI contribution no. 361.

Author information


  1. Marine Science Institute, University of California, Santa Barbara, California 93106, USA

    • Blair G. Paul
    •  & David L. Valentine
  2. Department of Earth and Planetary Science, University of California, Berkeley, California 94720, USA

    • David Burstein
    • , Cindy J. Castelle
    • , Brian C. Thomas
    •  & Jillian F. Banfield
  3. Department of Chemistry and Biochemistry, UC San Diego, La Jolla, California 92093, USA

    • Sumit Handa
    •  & Partho Ghosh
  4. Department of Microbiology, Immunology and Molecular Genetics, University of California, Los Angeles, California 90095, USA

    • Diego Arambula
    • , Elizabeth Czornyj
    •  & Jeff F. Miller
  5. Molecular Biology Institute, University of California, Los Angeles, California 90095, USA

    • Jeff F. Miller
  6. California NanoSystems Institute, University of California, Los Angeles, California 90095, USA

    • Jeff F. Miller
  7. Earth Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, California 94720, USA

    • Jillian F. Banfield
  8. Department of Environmental Science, Policy, and Management, University of California, Berkeley, California 94720, USA

    • Jillian F. Banfield
  9. Department of Earth Science, UC Santa Barbara, Santa Barbara, California 93106 USA

    • David L. Valentine


  1. Search for Blair G. Paul in:

  2. Search for David Burstein in:

  3. Search for Cindy J. Castelle in:

  4. Search for Sumit Handa in:

  5. Search for Diego Arambula in:

  6. Search for Elizabeth Czornyj in:

  7. Search for Brian C. Thomas in:

  8. Search for Partho Ghosh in:

  9. Search for Jeff F. Miller in:

  10. Search for Jillian F. Banfield in:

  11. Search for David L. Valentine in:


B.G.P. and D.L.V. developed the project. B.G.P., D.B., C.J.C., B.C.T. and J.F.B. performed reassembly, read mapping and annotation of the metagenomic and metatranscriptomics data sets. B.G.P., D.B., C.J.C., E.C., D.A., S.H., P.G., J.F.M., J.F.B. and D.L.V. conducted bioinformatic analyses on DGR sequences. B.G.P., D.B., C.J.C., J.F.B. and D.L.V. wrote the manuscript.

Competing interests

J.F.M. is a cofounder, equity holder and chair of the scientific advisory board of AvidBiotics Inc., a biotherapeutics company in San Francisco. No other authors declare competing financial interests.

Corresponding author

Correspondence to David L. Valentine.

Supplementary information

PDF files

  1. 1.

    Supplementary Information

    Supplementary Figures 1–9

Excel files

  1. 1.

    Supplementary Tables 1–10

    Supplementary Table 1: DGRs that appear to be active based on readmapping and a stemloop-like sequence. Ns substitutions linked to TR adenines were inferred from VR-read-mapping and putative DGR stemloops were predicted using the Mfold DNA folding server (see Methods). The number of stemloops is shown incrementally for the same DGR. (3'-) distance from VR to the beginning of the stemloop is given in nucleotides. Supplementary Table 2: Metatranscriptomic readmapping analysis for DGRs that recruited at least ten perfect-matching transcripts. Relative proportions are given for transcripts mapping to DGRs versus the whole contig, and separately for transcripts mapping to TR versus the sum for all other DGR features. Supplementary Table 3: Annotation details for DUF1566 (PF07603) containing DGR variable proteins. Variable protein length is given in amino acids. Transmembrane (TM) predictions are shown as “yes”, “no”, or “signal peptide”. The best hit from HMMER is listed with its corresponding e-value. Phyre2 values are given as per cent confidence (conf) and per cent coverage of the variable protein (covg). Supplementary Table 4: Taxonomic affiliations of AAA_5 ATPase (PF07728) domain-containing DGR variable proteins. Rows are coloured by domain. Best hits were retrieved using pHMMER searches against the Uniprot database. Supplementary Table 5: DGR-containing scaffolds and feature coordinates, including RT, VP (up to three), VR (up to three), and TR. Genome bin affiliations are given for each scaffold. Supplementary Table 6: DGR-containing scaffolds and feature coordinates for scaffolds with more than one DGR cassette (up to three distinct DGRs for a single scaffold). Supplementary Table 7: DGR-containing scaffolds and feature annotations for DGRs with split/interrupted RT open reading frames. Supplementary Table 8: Variable proteins with homology to known pfams or database UniProtKB representatives. NA, or not applicable, indicates that no significant hit was returned from the database. Supplementary Table 9: Index of reverse transcriptase (RT) tree labels. Representatives listed under Database as “Genbank”, have tree labels that are NCBI accession numbers. Supplementary Table 10: DGR-containing scaffolds and corresponding Genbank accession codes. 

Text files

  1. 1.

    Supplementary Data 1

    All DGR-containing sequences that are described in this study, which were derived from draft genomes.

  2. 2.

    Supplementary Data 2

    Reverse transcriptase protein sequences for all DGRs from draft genomes.

  3. 3.

    Supplementary Data 3

    All DGR targeted variable protein sequences.

  4. 4.

    Supplementary Data 4

    Reverse transcriptase tree that corresponds to Fig. 2.

  5. 5.

    Supplementary Data 5

    The reverse transcriptase multiple sequence alignment used to construct the phylogenetic tree in Fig. 2.

  6. 6.

    Supplementary Data 6

    DGR-containing sequences as assembled metagenomic fragments, which are not contained in a draft genome.