Unusual biology across a group comprising more than 15% of domain Bacteria

Journal name:
Nature
Volume:
523,
Pages:
208–211
Date published:
DOI:
doi:10.1038/nature14486
Received
Accepted
Published online
Corrected online

A prominent feature of the bacterial domain is a radiation of major lineages that are defined as candidate phyla because they lack isolated representatives. Bacteria from these phyla occur in diverse environments1 and are thought to mediate carbon and hydrogen cycles2. Genomic analyses of a few representatives suggested that metabolic limitations have prevented their cultivation2, 3, 4, 5, 6. Here we reconstructed 8 complete and 789 draft genomes from bacteria representing >35 phyla and documented features that consistently distinguish these organisms from other bacteria. We infer that this group, which may comprise >15% of the bacterial domain, has shared evolutionary history, and describe it as the candidate phyla radiation (CPR). All CPR genomes are small and most lack numerous biosynthetic pathways. Owing to divergent 16S ribosomal RNA (rRNA) gene sequences, 50–100% of organisms sampled from specific phyla would evade detection in typical cultivation-independent surveys. CPR organisms often have self-splicing introns and proteins encoded within their rRNA genes, a feature rarely reported in bacteria. Furthermore, they have unusual ribosome compositions. All are missing a ribosomal protein often absent in symbionts, and specific lineages are missing ribosomal proteins and biogenesis factors considered universal in bacteria. This implies different ribosome structures and biogenesis mechanisms, and underlines unusual biology across a large part of the bacterial domain.

At a glance

Figures

  1. Phylogeny and genomic sampling of the CPR.
    Figure 1: Phylogeny and genomic sampling of the CPR.

    a, b, Subsets of a maximum-likelihood 16S rRNA gene phylogeny (Supplementary Fig. 1) showing the CPR, a monophyletic radiation of candidate phyla (a), and genomic sampling of candidate phyla (b). Proposed names for phyla within the superphyla Parcubacteria and Microgenomates are explained in Extended Data Table 1. Many CPR 16S rRNA genes encode insertions (length shown by blue bars, combined length for multiple insertions).

  2. Features of insertions encoded within CPR 16S rRNA genes.
    Figure 2: Features of insertions encoded within CPR 16S rRNA genes.

    Insertions identified in assembled, unique bacterial 16S rRNA genes occur in conserved and variable (V; red bars) regions (Supplementary Table 5). Histograms show the frequency of insertions. Insertions are of several types distinguishable by catalytic RNA introns and/or ORFs. IVP, intervening sequence protein.

  3. Intron-encoding 16S rRNA gene from complete Microgenomates genome.
    Figure 3: Intron-encoding 16S rRNA gene from complete Microgenomates genome.

    a, Stringent mapping of paired-read metagenome sequences confirms the assembly. b, 16S rRNA encoding regions, but not insertions, are covered by perfectly matched metatranscriptome sequences. The absence of RNA sequences for insertions indicates that they are introns. Shown are regions corresponding to Escherichia coli K12 gene positions, RNA catalytic introns, ORFs and insertions. c, Structural models of encoded proteins (1, 2 and 4: coloured by the colours of the rainbow from the amino to the carboxy terminus) and predicted structure for a catalytic RNA intron (3: coloured by base-pairing probability; red is high, green is moderate, and blue is low). Protein Data Bank structures were used as templates for structural modelling (1: accession 1R7M; 2: 1B24; 4: 1B24).

  4. Sampling and geochemical measurements from acetate amendment field experiment conducted in aquifer well CD-01 at the Rifle IFRC site.
    Extended Data Fig. 1: Sampling and geochemical measurements from acetate amendment field experiment conducted in aquifer well CD-01 at the Rifle IFRC site.

    a, b, Samples were collected for metagenomics and metatranscriptomics at six time points (A–F) spanning several redox transitions during acetate stimulation of groundwater microbial communities. a, Groundwater was pumped from the alluvial aquifer and filtered through serial 1.2, 0.2 and 0.1 μm filters. DNA was extracted and sequenced from both the 0.2 and 0.1 μm filters, and RNA extracted and sequenced from the 0.2 μm filters (aerial image provided by S. M. Stoller for the US DOE under contract DE-AM01-07LM00060). b, Geochemical measurements were taken throughout the time series, showing a transition from dominant iron reduction to sulfate reduction through to methane production in the sampling environment.

  5. Validation of 20 draft-quality genomes by ESOM clustering of genome fragments based on tetranucleotide sequence composition.
    Extended Data Fig. 2: Validation of 20 draft-quality genomes by ESOM clustering of genome fragments based on tetranucleotide sequence composition.

    For validation, 20 draft genomes from a sample with a high proportion of CPR genomes (GWA2) were chosen at random. Each data point represents a 5–10 kb genome fragment. The ESOM was trained for 100 epochs with normalized tetranucleotide frequencies. Dark lines between data points indicate strong separation between regions. Data points are coloured based on the genome the fragment originated from. The ESOM shows well-delineated clusters for most of the 20 draft genomes, with few sequence fragments falling outside of these clusters. Two genomes from the same Microgenomates (OP11) phylum were not well delineated in the tetranucleotide-based ESOM (genomes 18 and 19). This shows how the method we used for binning, which takes into account abundance patterns in addition to sequence signatures, provides more accurate genome reconstructions. The white box distinguishes a single period on the repeating map. Genomes split into multiple clusters are labelled in red.

  6. Relative abundance of bacterial community members during acetate amendment.
    Extended Data Fig. 3: Relative abundance of bacterial community members during acetate amendment.

    a, b, Relative abundance was calculated based on stringent mapping of paired-read sequences from each sample to 16S rRNA gene sequences assembled from all samples. Relative abundance of cells from 0.2 μm filters (a) and from 0.1 μm (b) filters. Enrichment of CPR organisms in the 0.2 μm filtrate indicates that these organisms have ultra-small cell sizes.

  7. Features of insertion sequences encoded within 16S rRNA genes from the Silva database.
    Extended Data Fig. 4: Features of insertion sequences encoded within 16S rRNA genes from the Silva database.

    The non-redundant Silva 16S rRNA gene database (v. 115) was analysed to assess the prevalence of insertions. Only 761 of the 418,498 16S rRNA gene sequences from bacteria encode insertions. While many small insertions were identified, unlike the 16S rRNA gene sequences assembled from groundwater, these sequences (1) rarely encode large insertions, (2) do not contain both ORFs and introns, (3) do not encode ORFs that could be assigned to Pfam families, and (4) may be found in one of multiple copies of the 16S rRNA gene.

  8. 16S rRNA gene copy number estimations for genomes reconstructed from groundwater metagenomics.
    Extended Data Fig. 5: 16S rRNA gene copy number estimations for genomes reconstructed from groundwater metagenomics.

    a, b, 16S rRNA gene copy number was estimated for all draft CPR genomes and genome bins for organisms outside the CPR. This was achieved by comparing the coverage of 16S rRNA gene regions to the coverage of the rest of the genome. Importantly, coverage was calculated only with stringently mapped reads (no mismatches were allowed) to improve the accuracy of coverage calculations. a, Histogram of the number of 16S rRNA gene sequence copies estimated for each genome by calculating (16S rRNA gene coverage)/(genome coverage). Several WWE3 genomes were estimated to have high 16S rRNA gene copy number (Supplementary Table 7), but it was later determined that these estimates were skewed by the presence of a highly abundant closely related strain. The complete WWE3 genome assembled previously3 has an identical 16S rRNA gene and confirms that it is found in only one copy for this genotype. Thus, we removed these estimates from subsequent copy number analysis. b, Density plot comparing estimated copy number of genomes for organisms found within and outside the CPR, where the longer tail for non-CPR genomes depicts the propensity for multiple 16S rRNA copies, a trait absent from the CPR.

  9. Features of insertion sequences encoded within 23S rRNA genes recovered from groundwater-associated bacteria.
    Extended Data Fig. 6: Features of insertion sequences encoded within 23S rRNA genes recovered from groundwater-associated bacteria.

    Bacteria associated with the CPR encode insertions within their 23S rRNA genes (Supplementary Table 5). These insertions share many features with those identified in 16S rRNA gene sequences from CPR bacteria. Taxonomy was determined by inclusion in a genome with an established phylogeny.

  10. Analysis of the ability of PCR primers 515F and 806R to bind to recovered groundwater-associated 16S rRNA gene sequences.
    Extended Data Fig. 7: Analysis of the ability of PCR primers 515F and 806R to bind to recovered groundwater-associated 16S rRNA gene sequences.

    a, b, PrimerProspector was used to assess the ability of primers 515F and 806R to bind a non-redundant set of assembled near-complete 16S rRNA gene sequences (clustered at 97% sequence identity). The percentage of sequences that would be amplified by these primers is shown on the left axis, the total number of sequences analysed is on the top of each bar, and the number of sequences these primers would not bind to is indicated by the shading. Many assembled groundwater-associated 16S rRNA gene sequences would evade amplification by PCR primers 515F and 806R. Results of the analysis are shown at the domain (a) and superphylum or phylum (b) levels.

  11. Metabolic potential and ribosomal protein analysis of genomes from CPR and TM6 organisms.
    Extended Data Fig. 8: Metabolic potential and ribosomal protein analysis of genomes from CPR and TM6 organisms.

    Assembled genomes were analysed using ggKbase (Supplementary Data 4). Shown here is a non-redundant set of complete and near-complete genomes (≥75% of single copy genes, ≤1.125 copies) organized based on a subset of a maximum-likelihood 16S rRNA gene phylogeny (Supplementary Fig. 1). CPR organisms have partial tricarboxylic acid (TCA) cycles and lack electron transport chain (ETC) complexes. In addition, they have incomplete biosynthetic pathways for nucleotides and amino acids. The Peregrinibacteria are a notable exception to some of these limitations. Several Parcubacteria exhibit a complete ubiquinol (cytochrome bo) oxidase operon, as previously seen in Saccharibacteria3. However, lack of NADH dehydrogenase and other ETC components suggests that this enzyme is involved in oxygen scavenging/detoxification rather than energy production. AA Syn., amino acid synthesis; PP, pentose phosphate pathway.

Tables

  1. Proposed names for CPR phyla based on microbiology lifetime achievement award recipients
    Extended Data Table 1: Proposed names for CPR phyla based on microbiology lifetime achievement award recipients

Accession codes

Primary accessions

BioProject

Sequence Read Archive

Change history

Corrected online 29 January 2016
Extended Data Table 1 was corrected on 25 January 2016

References

  1. Harris, J. K., Kelley, S. T. & Pace, N. R. New perspective on uncultured bacterial phylogenetic division OP11. Appl. Environ. Microbiol. 70, 845849 (2004).
  2. Wrighton, K. C. et al. Fermentation, hydrogen, and sulfur metabolism in multiple uncultivated bacterial phyla. Science 337, 16611665 (2012).
  3. Kantor, R. S. et al. Small genomes and sparse metabolisms of sediment-associated bacteria from four candidate phyla. MBio 4, e00708e00713 (2013).
  4. Wrighton, K. C. et al. Metabolic interdependencies between phylogenetically novel fermenters and respiratory organisms in an unconfined aquifer. ISME J. 8, 14521463 (2014).
  5. Rinke, C. et al. Insights into the phylogeny and coding potential of microbial dark matter. Nature 499, 431437 (2013).
  6. Albertsen, M. et al. Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes. Nature Biotechnol. 31, 533538 (2013).
  7. Castelle, C. J. et al. Genomic expansion of domain archaea highlights roles for organisms from new phyla in anaerobic carbon cycling. Curr. Biol. 25, 690701 (2015).
  8. Luef, B. et al. Diverse, uncultivated ultra-small bacterial cells in groundwater. Nature Commun. 6, 6372 (2015).
  9. Burt, A. & Koufopanou, V. Homing endonuclease genes: the rise and fall and rise again of a selfish element. Curr. Opin. Genet. Dev. 14, 609615 (2004).
  10. Salman, V., Amann, R., Shub, D. A. & Schulz-Vogt, H. N. Multiple self-splicing introns in the 16S rRNA genes of giant sulfur bacteria. Proc. Natl Acad. Sci. USA 109, 42034208 (2012).
  11. Quast, C. et al. The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Res. 41, D590D596 (2013).
  12. Evguenieva-Hackenberg, E. Bacterial ribosomal RNA in pieces. Mol. Microbiol. 57, 318325 (2005).
  13. Raghavan, R., Hicks, L. D. & Minnick, M. F. Toxic introns and parasitic intein in Coxiella burnetii: legacies of a promiscuous past. J. Bacteriol. 190, 59345943 (2008).
  14. Baker, B. J., Hugenholtz, P., Dawson, S. C. & Banfield, J. F. Extremely acidophilic protists from acid mine drainage host Rickettsiales-lineage endosymbionts that have intervening sequences in their 16S rRNA genes. Appl. Environ. Microbiol. 69, 55125518 (2003).
  15. Gong, J., Qing, Y., Guo, X. & Warren, A. Candidatus Sonnebornia yantaiensis’, a member of candidate division OD1, as intracellular bacteria of the ciliated protist Paramecium bursaria (Ciliophora, Oligohymenophorea). Syst. Appl. Microbiol. 37, 3541 (2014).
  16. Caporaso, J. G. et al. Ultra-high-throughput microbial community analysis on the Illumina HiSeq and MiSeq platforms. ISME J. 6, 16211624 (2012).
  17. Nawrocki, E. P. in Structural RNA Homology Search and Alignment using Covariance Models (ed. Eddy, S. R. et al.) (Washington Univ. in Saint Louis, 2009).
  18. Baker, B. J. & Dick, G. J. Omic approaches in microbial ecology: charting the unknown. Microbe 8, 353360 (2013).
  19. Yarza, P. et al. Uniting the classification of cultured and uncultured bacteria and archaea using 16S rRNA gene sequences. Nature Rev. Microbiol. 12, 635645 (2014).
  20. Akanuma, G. et al. Inactivation of ribosomal protein genes in Bacillus subtilis reveals importance of each ribosomal protein for cell proliferation and cell differentiation. J. Bacteriol. 194, 62826291 (2012).
  21. Lecompte, O. Comparative analysis of ribosomal proteins in complete genomes: an example of reductive evolution at the domain scale. Nucleic Acids Res. 30, 53825390 (2002).
  22. Lagkouvardos, I., Jehl, M.-A., Rattei, T. & Horn, M. Signature protein of the PVC superphylum. Appl. Environ. Microbiol. 80, 440445 (2014).
  23. Yutin, N., Puigbò, P., Koonin, E. V. & Wolf, Y. I. Phylogenomics of prokaryotic ribosomal proteins. PLoS ONE 7, e36972 (2012).
  24. Nowotny, V. & Nierhaus, K. H. Initiator proteins for the assembly of the 50S subunit from Escherichia coli ribosomes. Proc. Natl Acad. Sci. USA 79, 72387242 (1982).
  25. Atkins, J. F. & Björk, G. R. A gripping tale of ribosomal frameshifting: extragenic suppressors of frameshift mutations spotlight P-site realignment. Microbiol. Mol. Biol. Rev. 73, 178210 (2009).
  26. Schuwirth, B. S. Structures of the bacterial ribosome at 3.5 Å resolution. Science 310, 827834 (2005).
  27. Nevskaya, N. Ribosomal protein L1 recognizes the same specific structural motif in its target sites on the autoregulatory mRNA and 23S rRNA. Nucleic Acids Res. 33, 478485 (2005).
  28. Shajani, Z., Sykes, M. T. & Williamson, J. R. Assembly of bacterial ribosomes. Annu. Rev. Biochem. 80, 501526 (2011).
  29. Luef, B. et al. Iron-reducing bacteria accumulate ferric oxyhydroxide nanoparticle aggregates that may support planktonic growth. ISME J. 7, 338350 (2013).
  30. Williams, K. H. et al. Acetate availability and its influence on sustainable bioremediation of uranium-contaminated groundwater. Geomicrobiol. J. 28, 519539 (2011).
  31. Peng, Y., Leung, H. C. M., Yiu, S. M. & Chin, F. Y. L. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics 28, 14201428 (2012).
  32. Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nature Methods 9, 357359 (2012).
  33. Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11, 119 (2010).
  34. Edgar, R. C. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26, 24602461 (2010).
  35. Suzek, B. E., Huang, H., McGarvey, P., Mazumder, R. & Wu, C. H. UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics 23, 12821288 (2007).
  36. Kanehisa, M., Goto, S., Sato, Y., Furumichi, M. & Tanabe, M. KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res. 40, D109D114 (2012).
  37. Kanehisa, M. & Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 28, 27 (2000).
  38. Hug, L. A. et al. Community genomic analyses constrain the distribution of metabolic traits across the Chloroflexi phylum and indicate roles in sediment carbon cycling. Microbiome 1, 22 (2013).
  39. Castelle, C. J. et al. Extraordinary phylogenetic diversity and metabolic versatility in aquifer sediment. Nature Commun. 4, 2120 (2013).
  40. Dick, G. J. et al. Community-wide analysis of microbial genome sequence signatures. Genome Biol. 10, R85 (2009).
  41. Raes, J., Korbel, J. O., Lercher, M. J., von Mering, C. & Bork, P. Prediction of effective genome size in metagenomic samples. Genome Biol. 8, R10 (2007).
  42. Altschul, S. F., Gish, W., Miller, W., Meyers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403410 (1990).
  43. McLean, J. S. et al. Candidate phylum TM6 genome recovered from a hospital sink biofilm provides genomic insights into this uncultivated phylum. Proc. Natl Acad. Sci. USA 110, E2390E2399 (2013).
  44. Podar, M. et al. Targeted access to the genomes of low-abundance organisms in complex microbial communities. Appl. Environ. Microbiol. 73, 32053214 (2007).
  45. Marcy, Y. et al. Dissecting biological ‘dark matter’ with single-cell genetic analysis of rare and uncultivated TM7 microbes from the human mouth. Proc. Natl Acad. Sci. USA 104, 1188911894 (2007).
  46. Nawrocki, E. P., Kolbe, D. L. & Eddy, S. R. Infernal 1.0: inference of RNA alignments. Bioinformatics 25, 13351337 (2009).
  47. Cannone, J. J. et al. The Comparative RNA Web (CRW) Site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs. BMC Bioinformatics 3, 2 (2002).
  48. Burge, S. W. et al. Rfam 11.0: 10 years of RNA families. Nucleic Acids Res. 41, D226D232 (2013).
  49. Andronescu, M., Condon, A., Hoos, H. H., Mathews, D. H. & Murphy, K. P. Efficient parameter estimation for RNA secondary structure prediction. Bioinformatics 23, i19i28 (2007).
  50. Kearse, M. et al. Geneious Basic: an integrated and extendable desktop software platform for the organization and analysis of sequence data. Bioinformatics 28, 16471649 (2012).
  51. Finn, R. D. et al. Pfam: the protein families database. Nucleic Acids Res. 42, D222D230 (2014).
  52. Kelley, L. A. & Sternberg, M. J. E. Protein structure prediction on the Web: a case study using the Phyre server. Nature Protocols 4, 363371 (2009).
  53. Gilbert, J. A. et al. Meeting report: the terabase metagenomics workshop and the vision of an Earth microbiome project. Stand. Genomic Sci. 3, 243248 (2010).
  54. Walters, W. A. et al. PrimerProspector: de novo design and taxonomic analysis of barcoded polymerase chain reaction primers. Bioinformatics 27, 11591161 (2011).
  55. Stamatakis, A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30, 13121313 (2014).
  56. Eddy, S. R. Accelerated profile HMM searches. PLOS Comput. Biol. 7, e1002195 (2011).
  57. Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2—approximately maximum-likelihood trees for large alignments. PLoS ONE 5, e9490 (2010).
  58. Edgar, R. C. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 17921797 (2004).
  59. Abascal, F., Zardoya, R. & Posada, D. ProtTest: selection of best-fit models of protein evolution. Bioinformatics 21, 21042105 (2005).
  60. Huson, D. H. & Scornavacca, C. Dendroscope 3: an interactive tool for rooted phylogenetic trees and networks. Syst. Biol. 61, 10611067 (2012).
  61. Zerbino, D. R. & Birney, E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821829 (2008).
  62. Ultsch, A. & Moerchen, F. ESOM-Maps: tools for clustering, visualization, and classification with Emergent SOM. Technical Report no. 46 (Dept. of Mathematics and Computer Science, University of Marburg, Germany, 2005).

Download references

Author information

Affiliations

  1. Department of Plant and Microbial Biology, University of California, Berkeley, California 94720, USA

    • Christopher T. Brown
  2. Department of Earth and Planetary Science, University of California, Berkeley, California 94720, USA

    • Laura A. Hug,
    • Brian C. Thomas,
    • Itai Sharon,
    • Cindy J. Castelle,
    • Andrea Singh &
    • Jillian F. Banfield
  3. School of Earth Sciences, The Ohio State University, Columbus, Ohio 43210, USA

    • Michael J. Wilkins
  4. Department of Microbiology, The Ohio State University, Columbus, Ohio 43210, USA

    • Michael J. Wilkins &
    • Kelly C. Wrighton
  5. Earth Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, California 94720, USA

    • Kenneth H. Williams &
    • Jillian F. Banfield
  6. Department of Environmental Science, Policy, and Management, University of California, Berkeley, California 94720, USA

    • Jillian F. Banfield

Contributions

Samples and geochemical measurements were taken by M.J.W., K.C.W. and K.H.W. B.C.T. assembled the metagenome data. I.S. implemented the ABAWACA algorithm. C.T.B. and J.F.B. binned the data and carried out the ESOM binning validation. J.F.B. closed and curated the complete genomes. C.T.B., L.A.H. and B.C.T. conducted the rRNA gene insertion analysis. C.T.B. and L.A.H. performed phylogenetic analyses. M.J.W. and K.C.W. conducted the RNA sequencing. C.T.B. carried out the 16S rRNA gene copy number, primer binding and transcript analyses. C.T.B. and J.F.B. carried out the ribosomal protein analyses. C.T.B., L.A.H., C.J.C. and J.F.B. conducted the metabolic analysis. A.S. and B.C.T. provided bioinformatics support. C.T.B. and J.F.B. drafted the manuscript. All authors reviewed the results and approved the manuscript.

Competing financial interests

The authors declare no competing financial interests.

Corresponding author

Correspondence to:

DNA and RNA sequences have been deposited in the NCBI Sequence Read Archive under accession number SRP050083, and genome sequences have been deposited in NCBI BioProject under accession number PRJNA273161 (first versions described here). Genomes are also available through ggKbase: http://ggkbase.berkeley.edu/CPR-complete-draft/organisms. ggKbase is a ‘live data’ site, thus annotations and genomes may be improved after publication.

Author details

Extended data figures and tables

Extended Data Figures

  1. Extended Data Figure 1: Sampling and geochemical measurements from acetate amendment field experiment conducted in aquifer well CD-01 at the Rifle IFRC site. (472 KB)

    a, b, Samples were collected for metagenomics and metatranscriptomics at six time points (A–F) spanning several redox transitions during acetate stimulation of groundwater microbial communities. a, Groundwater was pumped from the alluvial aquifer and filtered through serial 1.2, 0.2 and 0.1 μm filters. DNA was extracted and sequenced from both the 0.2 and 0.1 μm filters, and RNA extracted and sequenced from the 0.2 μm filters (aerial image provided by S. M. Stoller for the US DOE under contract DE-AM01-07LM00060). b, Geochemical measurements were taken throughout the time series, showing a transition from dominant iron reduction to sulfate reduction through to methane production in the sampling environment.

  2. Extended Data Figure 2: Validation of 20 draft-quality genomes by ESOM clustering of genome fragments based on tetranucleotide sequence composition. (1,007 KB)

    For validation, 20 draft genomes from a sample with a high proportion of CPR genomes (GWA2) were chosen at random. Each data point represents a 5–10 kb genome fragment. The ESOM was trained for 100 epochs with normalized tetranucleotide frequencies. Dark lines between data points indicate strong separation between regions. Data points are coloured based on the genome the fragment originated from. The ESOM shows well-delineated clusters for most of the 20 draft genomes, with few sequence fragments falling outside of these clusters. Two genomes from the same Microgenomates (OP11) phylum were not well delineated in the tetranucleotide-based ESOM (genomes 18 and 19). This shows how the method we used for binning, which takes into account abundance patterns in addition to sequence signatures, provides more accurate genome reconstructions. The white box distinguishes a single period on the repeating map. Genomes split into multiple clusters are labelled in red.

  3. Extended Data Figure 3: Relative abundance of bacterial community members during acetate amendment. (124 KB)

    a, b, Relative abundance was calculated based on stringent mapping of paired-read sequences from each sample to 16S rRNA gene sequences assembled from all samples. Relative abundance of cells from 0.2 μm filters (a) and from 0.1 μm (b) filters. Enrichment of CPR organisms in the 0.2 μm filtrate indicates that these organisms have ultra-small cell sizes.

  4. Extended Data Figure 4: Features of insertion sequences encoded within 16S rRNA genes from the Silva database. (175 KB)

    The non-redundant Silva 16S rRNA gene database (v. 115) was analysed to assess the prevalence of insertions. Only 761 of the 418,498 16S rRNA gene sequences from bacteria encode insertions. While many small insertions were identified, unlike the 16S rRNA gene sequences assembled from groundwater, these sequences (1) rarely encode large insertions, (2) do not contain both ORFs and introns, (3) do not encode ORFs that could be assigned to Pfam families, and (4) may be found in one of multiple copies of the 16S rRNA gene.

  5. Extended Data Figure 5: 16S rRNA gene copy number estimations for genomes reconstructed from groundwater metagenomics. (138 KB)

    a, b, 16S rRNA gene copy number was estimated for all draft CPR genomes and genome bins for organisms outside the CPR. This was achieved by comparing the coverage of 16S rRNA gene regions to the coverage of the rest of the genome. Importantly, coverage was calculated only with stringently mapped reads (no mismatches were allowed) to improve the accuracy of coverage calculations. a, Histogram of the number of 16S rRNA gene sequence copies estimated for each genome by calculating (16S rRNA gene coverage)/(genome coverage). Several WWE3 genomes were estimated to have high 16S rRNA gene copy number (Supplementary Table 7), but it was later determined that these estimates were skewed by the presence of a highly abundant closely related strain. The complete WWE3 genome assembled previously3 has an identical 16S rRNA gene and confirms that it is found in only one copy for this genotype. Thus, we removed these estimates from subsequent copy number analysis. b, Density plot comparing estimated copy number of genomes for organisms found within and outside the CPR, where the longer tail for non-CPR genomes depicts the propensity for multiple 16S rRNA copies, a trait absent from the CPR.

  6. Extended Data Figure 6: Features of insertion sequences encoded within 23S rRNA genes recovered from groundwater-associated bacteria. (223 KB)

    Bacteria associated with the CPR encode insertions within their 23S rRNA genes (Supplementary Table 5). These insertions share many features with those identified in 16S rRNA gene sequences from CPR bacteria. Taxonomy was determined by inclusion in a genome with an established phylogeny.

  7. Extended Data Figure 7: Analysis of the ability of PCR primers 515F and 806R to bind to recovered groundwater-associated 16S rRNA gene sequences. (183 KB)

    a, b, PrimerProspector was used to assess the ability of primers 515F and 806R to bind a non-redundant set of assembled near-complete 16S rRNA gene sequences (clustered at 97% sequence identity). The percentage of sequences that would be amplified by these primers is shown on the left axis, the total number of sequences analysed is on the top of each bar, and the number of sequences these primers would not bind to is indicated by the shading. Many assembled groundwater-associated 16S rRNA gene sequences would evade amplification by PCR primers 515F and 806R. Results of the analysis are shown at the domain (a) and superphylum or phylum (b) levels.

  8. Extended Data Figure 8: Metabolic potential and ribosomal protein analysis of genomes from CPR and TM6 organisms. (1,010 KB)

    Assembled genomes were analysed using ggKbase (Supplementary Data 4). Shown here is a non-redundant set of complete and near-complete genomes (≥75% of single copy genes, ≤1.125 copies) organized based on a subset of a maximum-likelihood 16S rRNA gene phylogeny (Supplementary Fig. 1). CPR organisms have partial tricarboxylic acid (TCA) cycles and lack electron transport chain (ETC) complexes. In addition, they have incomplete biosynthetic pathways for nucleotides and amino acids. The Peregrinibacteria are a notable exception to some of these limitations. Several Parcubacteria exhibit a complete ubiquinol (cytochrome bo) oxidase operon, as previously seen in Saccharibacteria3. However, lack of NADH dehydrogenase and other ETC components suggests that this enzyme is involved in oxygen scavenging/detoxification rather than energy production. AA Syn., amino acid synthesis; PP, pentose phosphate pathway.

Extended Data Tables

  1. Extended Data Table 1: Proposed names for CPR phyla based on microbiology lifetime achievement award recipients (373 KB)

Supplementary information

PDF files

  1. Supplementary Information (129 KB)

    This file contains a guide to Supplementary Figure 1, Supplementary Tables 1-10 and the Supplementary Data (see separate files).

  2. Supplementary Figure (1.3 MB)

    This file contains Supplementary Figure 1 (see the Supplementary Information file for details).

Excel files

  1. Supplementary Tables (3.1 MB)

    This file contains Supplementary Tables 1-10 (see the Supplementary Information file for details).

Zip files

  1. Supplementary Data (13.7 MB)

    This zipped file contains the Supplementary Data (see the Supplementary Information file for details).

Additional data