Average genome size: a potential source of bias in comparative metagenomics

Beszteri, Bánk; Temperton, Ben; Frickenhaus, Stephan; Giovannoni, Stephen J

doi:10.1038/ismej.2010.29

Download PDF

Short Communication
Published: 25 March 2010

Integrated Genomics and Post-Genomics Approaches in Microbial Ecology

Average genome size: a potential source of bias in comparative metagenomics

Bánk Beszteri¹,
Ben Temperton²,
Stephan Frickenhaus³ &
…
Stephen J Giovannoni¹

The ISME Journal volume 4, pages 1075–1077 (2010)Cite this article

3681 Accesses
52 Citations
3 Altmetric
Metrics details

Subjects

Abstract

In gene-centric comparative metagenomics, differences in observed relative gene abundances among samples are often assumed to reflect the biological importance of individual genes in different habitats. Statistical tests and data mining for genes that represent habitat-specific adaptations are frequently based on this measure. We demonstrate that this measure is biased by the average genome size of the communities sampled. Average genome sizes can be estimated from the metagenomic data themselves, and taken into account in comparative analyses. We suggest that this would enable ecologically more meaningful comparisons, especially when the average genome sizes of compared communities differ substantially. We illustrate the influence of average genome-size differences on comparative analyses, with an example to highlight the need for further exploration of this bias.

Trait biases in microbial reference genomes

Article Open access 09 February 2023

Sage Albright & Stilianos Louca

Towards the biogeography of prokaryotic genes

Article 15 December 2021

Luis Pedro Coelho, Renato Alves, … Peer Bork

KAUST Metagenomic Analysis Platform (KMAP), enabling access to massive analytics of re-annotated metagenomic data

Article Open access 01 June 2021

Intikhab Alam, Allan Anthony Kamau, … Vladimir B. Bajic

Main

Shotgun sequencing of total community DNA is becoming an established part of the methodological arsenal of microbial ecology. Owing to its relative novelty and the high complexity of the data it produces, metagenomics is still in an explorative phase in many ways, and several aspects of data acquisition and analysis are just gaining proper attention and treatment (Gomez-Alvarez et al., 2009; Kunin et al., 2009). Here we draw attention to a currently underappreciated aspect of gene-centric comparative metagenomic data analyses: the effect of average genome sizes on sampling probabilities of gene fragments.

In gene-centric comparative metagenomic analyses, sequences are classified into functional categories, and the relative abundances of reads belonging to different categories are compared across samples. The statistical approaches proposed for analyzing these data usually test the null hypothesis that sampling probabilities of genes are equal across samples (Rodriguez-Brito et al., 2006; Kristiansson et al., 2009; Li, 2009). Such analyses often account for differences in sampling effort between samples, and sometimes for average gene length, although the latter property is usually assumed to be relatively constant across samples. Using a simple mathematical model of metagenomic sampling, we demonstrate that sampling probabilities of individual genes (p_m), pooled across all taxa present in a community, are expected to depend on the average genome sizes of the samples compared (Ḡ), and on the gene length (l_m) and average copy number of the gene concerned (C̄_m):

Detailed arguments supporting this equation can be found in the Supplementary Information. Previously, Raes et al. (2007a) noted that this issue was likely to be important, without attempting to analyze its impact on comparative sequence analysis.

In practical terms, this means that even universally distributed single-copy genes will show apparent differences in relative abundances among samples if average genome sizes differ substantially across samples. Thus, in principle, statistically significant, but biologically uninteresting, variation can emerge from analyses that fail to account for differences in the average genome sizes of communities. A number of recent publications have described methods for estimating the average genome size of communities and have demonstrated substantial genome-size variation among communities, for example at different geographical locations or depths in marine environments (Raes et al., 2007b; Angly et al., 2009; Konstantinidis et al., 2009).

To illustrate, we considered a metagenomic data set from the Northern Pacific (DeLong et al., 2006). The number of reads sequenced from each sample varied between 6812 and 11 479 (Supplementary Figure 1a). The estimated genome sizes of these samples ranged from 3.36 to 5.56 million base pairs (Supplementary Figure 1b), somewhat higher than the values previously estimated for other oceanic metagenomic samples (Raes et al., 2007a). Notably, the sample from 10 m depth had a markedly lower estimated average genome size than the rest of the samples, increasing the probability of sampling any gene category in this sample compared with the others. We combined the effects of total numbers of sequences and estimated average genome sizes into a composite measure, which we termed as the effective sequence count (ESC, see Supplementary Information for mathematical details). ESCs ranged from 6305 to 11 393 ‘pseudo-reads’ (see Supplementary Figure 1c), suggesting that an approximately 1.8-fold difference is expected in the sampling probabilities of universal single-copy genes between the 70- and 200-m samples, representing the two extremes. Although the data set concerned is too small for confidently detecting sampling probability differences of this order of magnitude, when testing the above 35 universal single-copy cluster of orthologous groups (COGs) individually, six of them even showed significant trends with depth at an (uncorrected) α-level of 5% when average genome sizes were ignored.

To illustrate the potential effect of correcting for average genome-size differences on apparent trends in gene abundance, we used COG0415 (deoxyribopyrimidine photolyase), which was featured previously to illustrate a significant trend with depth in the same data set (DeLong et al., 2006; Kristiansson et al., 2009). Notably, the equal total counts of fragments annotated as COG0415 at the surface and at 70 m depth (Figure 1a) turn into a decrease at 70 m when a normalization by sequencing effort is applied (Figure 1b), and into an increase when both sequencing effort and community-averaged genome size are taken into account for normalization (Figure 1c). Based on the mathematical arguments detailed in the Supplementary Information, we suggest that the trend observed in Figure 1b confounds differences among communities in average genome size (that is, differences that are not specific to the gene of interest) with differences in the average copy number of COG0415 per cell. Figure 1c gives a biologically more meaningful picture about the differential presence of this gene in genomes of organisms occurring at different depths by accounting for the former, not gene-specific, factor. We further explore the possible effects of community average genome-size differences on comparative analyses in the Supplementary Information.

In conclusion, we propose to add community genome size to the list of potential biasing factors for gene-centric comparative metagenomics (ranging from sampling and sample preparation issues, and sequencing technologies, to annotation protocols, methods for measuring sampling effort, and gene length effects; see, for example, Raes et al., 2007a). Based on the simple mathematical considerations represented in Equation (1) and detailed in the Supplementary Information, we propose that decomposing relative gene abundance differences among metagenomic samples into ‘metagenome-wide’ and gene specific components—that is, differences in average genome sizes vs differences in relative gene-copy numbers—is expected to improve the biological relevance of inferences.

References

Angly FE, Willner D, Prieto-Davo A, Edwards RA, Schmieder R, Vega-Thurber R et al. (2009). The GAAS metagenomic tool and its estimations of viral and microbial average genome size in four major biomes. PLOS Comp Biol 5: e1000593.
Article Google Scholar
DeLong EF, Preston CM, Mincer T, Rich V, Hallam SJ, Frigaard N-U et al. (2006). Community genomics among stratified microbial assemblages in the ocean's interior. Science 311: 496–503.
Article CAS Google Scholar
Gomez-Alvarez V, Teal TK, Schmidt TM . (2009). Systematic artifacts in metagenomes from complex microbial communities. ISME J 3: 1314–1317.
Article Google Scholar
Konstantinidis KT, Braff J, Karl DM, DeLong EF . (2009). Comparative metagenomic analysis of a microbial community residing at a depth of 4000 meters at station ALOHA in the North Pacific subtropical gyre. Appl Environ Microbiol 75: 5345–5355.
Article CAS Google Scholar
Kristiansson E, Hugenholtz P, Dalevi D . (2009). ShotgunFunctionalizeR: an R-package for functional comparison of metagenomes. Bioinformatics 25: 2737–2738.
Article CAS Google Scholar
Kunin V, Engelbrektson A, Ochman H, Hugenholtz P . (2009). Wrinkles in the rare biosphere: pyrosequencing errors lead to artificial inflation of diversity estimates. Environ Microbiol. doi: 10 1111/j.1462-2920.2009.02051.x.
Li W . (2009). Analysis and comparison of very large metagenomes with fast clustering and functional annotation. BMC Bioinformatics 10: 359.
Article Google Scholar
Raes J, Foerstner KU, Bork P . (2007a). Get the most out of your metagenome: computational analysis of environmental sequence data. Curr Opin Microbiol 10: 490–498.
Article CAS Google Scholar
Raes J, Korbel JO, Lercher MJ, von Mering C, Bork P . (2007b). Prediction of effective genome size in metagenomic samples. Genome Biol 8: R10.
Article Google Scholar
Rodriguez-Brito B, Rohwer F, Edwards RA . (2006). An application of statistics to comparative metagenomics. BMC Bioinformatics 7: 162.
Article Google Scholar

Download references

Acknowledgements

This work was supported by a Marine Microbiology Initiative Investigator award from the Gordon and Betty Moore Foundation and the Society of General Microbiology President's Fund.

Author information

Authors and Affiliations

Department of Microbiology, Oregon State University, Corvallis, OR, USA
Bánk Beszteri & Stephen J Giovannoni
Plymouth Marine Laboratory, Plymouth, UK
Ben Temperton
Alfred Wegener Institute for Polar and Marine Research, Bremerhaven, Germany
Stephan Frickenhaus

Authors

Bánk Beszteri
View author publications
You can also search for this author in PubMed Google Scholar
Ben Temperton
View author publications
You can also search for this author in PubMed Google Scholar
Stephan Frickenhaus
View author publications
You can also search for this author in PubMed Google Scholar
Stephen J Giovannoni
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bánk Beszteri.

Additional information

Supplementary Information accompanies the paper on The ISME Journal website

Supplementary information

Supplementary Information (PDF 281 kb)

Supplementary Data 1 (DOC 70 kb)

Supplementary Data 2 (DOC 81 kb)

Supplementary Data 3 (DOC 104 kb)

Supplementary Data 4 (DOC 131 kb)

Supplementary Data 5 (DOC 68 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Beszteri, B., Temperton, B., Frickenhaus, S. et al. Average genome size: a potential source of bias in comparative metagenomics. ISME J 4, 1075–1077 (2010). https://doi.org/10.1038/ismej.2010.29

Download citation

Received: 26 November 2009
Revised: 15 February 2010
Accepted: 15 February 2010
Published: 25 March 2010
Issue Date: August 2010
DOI: https://doi.org/10.1038/ismej.2010.29