Inferring the functional capabilities of bacteria from metagenome-assembled genomes (MAGs) is becoming a central process in microbiology. Here we show that the completeness of genomes has a significant impact on the recovered functional signal, spanning all domains of metabolic functions. We identify factors that affect this relationship between genome completeness and function fullness, and provide baseline knowledge to guide efforts to correct for this overlooked bias in metagenomic functional inference.
Genome-resolved metagenomics enables draft bacterial genomes to be reconstructed from DNA sequence data derived from complex microbial mixtures . The metagenome-assembled genomes (MAG) derived from such a process can be annotated to predict their functional toolbox upon which microbiome-level functional analyses can be conducted [2, 3]. While long-read sequencing technologies can recover circularised genomes from metagenomic mixtures , one of the main issues of short-read sequencing-based approaches is that MAGs usually display different levels of genome completeness, i.e., the entirety of a microbe’s DNA is not always captured in the reconstructed genome . Genome completeness is primarily estimated through the presence of single-copy core genes (SCCGs), which are expected to be found in most bacteria . It is common to use MAGs with completeness values as low as 70% for the functional analyses of microbial communities . However, if a genome is estimated to be 70% complete, it is probable that many of the functions encoded in the actual genome will not be captured in the MAG, and thus the functional capacity of the genome will be underestimated [3, 8]. Not accounting for the level of completeness of MAGs could therefore lead researchers to incorrect interpretations of results, such as the artifactual deficit of functions being misinterpreted as real biological signal.
A major challenge of metagenomic research is correcting or accounting for biases in statistical analyses and modelling. However, we currently ignore how the loss of functional capacities is correlated with genome completeness, and whether these relationships are constant or variable across microbial phylogeny and metabolic domains. To address these issues, we conducted two complementary analyses. First, we investigated the relationship between estimated genome completeness and metabolic function fullness (defined as the proportion of biochemical reactions enabled by the genes present in a genome to accomplish a metabolic function) using 11,842 genomes from diverse origins publicly accessible at the GTDB database . Genome completeness was estimated using CheckM , while functional fullness of KEGG modules was estimated using DRAM [2, 10]. To ensure robust statistical modelling based on unbiased data, only MAGs belonging to the four most diverse bacterial phyla were considered; namely, Actinobacteriota, Bacteroidota, Firmicutes and Proteobacteria (around 3000 genomes each). The representation of genome completeness was evenly distributed across 70–100%, each window of 1% containing ca. 100 genomes from each phylum (Fig. 1a), and only genomes with contamination/redundancy values under 10% were considered (Fig. 1b). We also filtered the KEGG modules used for the modelling, by only considering the functions represented in at least 5% of the genomes (i.e., minimum representation of 592 data points). Finally, using a mock community of eight bacterial species, we compared 240 incomplete genomes (subsampled from MAGs) to their circularised genome counterparts, and applied a naive correction to their functional profiles using the previously trained models to showcase the possibility to improve functional inferences from incomplete bacterial genomes.
Results and discussion
We employed generalised linear models with binomial distribution to understand the association of genome completeness and function fullness in a filtered set of 195 KEGG metabolic modules across 11,842 genomes (Fig. 1c). The models estimated a positive relationship between genome completeness and metabolic function fullness for 94% of the studied modules, spanning all functional domains and levels of complexity (i.e., number of enzymatic steps). Overall, the increase of completeness from 70 to 100% was associated with a 15 ± 10% (mean ± sd) increase in module fullness. This relationship remained constant across the completeness gradient, with a slight tendency for the slope of the relationship to increase with completeness (Fig. 2a). This indicates that, while increasing the threshold to exclude MAGs with low completeness from functional analyses minimises the issue, the problem persists even when only considering ‘high quality’ (>90%) MAGs. We also found evidence for significant differences between the fullness-completeness relationship across bacterial phyla. Considering all KEGG modules analysed, Proteobacteria showed the overall strongest fullness-completeness relationship followed by Firmicutes, Actinobacteriota and Bacteroidota (Fig. 2b).
Similarly to taxonomic differences, the fullness-completeness relationship did not change evenly across metabolic domains. The fullness of the modules belonging to the ‘nucleotide metabolism’ and ‘biosynthesis of other secondary metabolites’ domains were the most affected by completeness, while ‘energy metabolism’ showed the weakest fullness-completeness association (Fig. 2c). In addition, the complexity of the modules was negatively associated with the fullness-completeness relationship (Fig. 2d). This suggests that the fullness of the modules with the fewest steps are the ones that are more severely affected by genome incompleteness.
We then implemented a complementary approach to assess the magnitude of the issue, by comparing the functional profiles of eight circularised genomes to their respective MAGs with ca. 70%, 80 and 90% completeness, as reconstructed and subsampled using genome-resolved metagenomics. The principal coordinate analysis (PCoA) of the genomes from the mock community revealed that reducing genome completeness from 100 to 70% introduced a systematic bias to their functional profiles of proportional magnitude to the level of the subsampling (Fig. 3a). A striking example is for the Pseudomonas aeruginosa genome, as genome incompleteness shifts its functional profile towards that of Bacillus subtilis.
Finally, as an initial attempt to showcase that completeness-biases can be minimised, we used the module-specific binomial generalised linear models trained using the MAGs from GTDB database to correct the functional profiles of incomplete MAGs. PCoA ordination showed that our models consistently reduced the functional bias introduced by genome incompleteness (Fig. 3b). The correction was rather successful for most genomes (the 95% confidence interval ellipses of the MAGs wrapped the complete genome), although it tended to ‘overcorrect’ others (e.g., Enterococcus faecalis and Lactobacillus fermentum).
Our results highlight the need to consider genome completeness when comparing the functional capacities between microbial genomes or metagenomes. Although chromosomal information does not encode the entire functional toolbox of microorganisms , linking plasmids and other extrachromosomal elements to bacterial strains in metagenomic data faces other challenges that are beyond the scope of this article . Currently, the focus of most genome-resolved metagenomic studies is limited to chromosomal DNA, and our results show that incorrect conclusions can be drawn if completeness biases are not considered. We argue that completeness biases should be accounted for in functional analyses based on metagenomics, analogously to how DNA sequencing depth biases are considered in diversity modelling approaches . Although the aim of this study was to showcase the bias introduced by genome incompleteness rather than correcting it, our simple correction approach considerably reduced the bias and contributed to recover less distorted functional profiles of incomplete MAGs. However, we believe there is ample room for improvement by including more variables that contribute to explaining the link between genome completeness and function fullness. Only through the correction and mitigation of the functional biases introduced by uneven genome completeness will researchers be able to robustly characterise, model, and assess the functional capabilities of microbial communities.
Materials and methods
Genome retrieval, annotation and distillation
We browsed the GTDB database , which contains complete and draft bacterial genomes with associated CheckM completeness scores, for Bacterial phyla with at least 100 genomes in each of the 1% windows ranging 70–100% of genome completeness with <10% contamination. These criteria were met by four phyla, namely Actinobacteriota, Bacteroidota, Firmicutes (sensu lato, including Bacilli and Clostridia) and Proteobacteria, which were considered for analysis. Genomes within each completeness window and phylum were randomly selected, and their sequences retrieved from the NCBI database  using their assembly accession codes. For the subsampled complete genome analysis, a sequenced ZymoBIOMICS® Microbial Community Standard from Olm et al.  was downloaded using kingfisher download (https://github.com/wwood/kingfisher-download). This mock community contains 8 bacterial species that have complete reference genomes available. The downloaded reads were randomly subsampled to different depths (1 million, 2 million, 5 million, and ALL) using seqtk sample with a seed of 1337 (https://github.com/lh3/seqtk), and each processed through the ‘Individual_Assembly_Binning.snakefile’ pipeline (https://github.com/earthhologenome/EHI_bioinformatics). dRep  was then used on all resulting MAGs with the inclusion of the complete bacterial genome references to determine which MAG was from which bacterial genome. A representative MAG from each species (CheckM completeness >90, contamination <5, minimum at least 42 contigs) had contigs randomly subsampled ten times at three different rates (70, 80, 90% of contigs remaining) using BBMap’s reformat.sh (total of 240 MAG subsamplings). CheckM was then run on these subsampled MAGs to estimate completeness.
Genomes were subsequently annotated and distilled to KEGG pathway fullness values (defined as the proportion of biochemical reactions enabled by the genes present in a genome to accomplish a metabolic function) using DRAM (1.2.4) . The summarise_genomes.py script was modified to output all modules and module numbers (https://github.com/EisenRa/DRAM_more_modules/tree/1.2.4_more_modules).
Statistics and visualisation
Statistical analyses were conducted on KEGG module fullness data. Only widespread KEGG modules present in at least 5% of MAGs were used for statistical modelling. This filtering resulted in 195 KEGG modules observed across 11,842 MAGs. Statistical analyses were conducted in three consecutive steps.
In the first step, generalised linear models were used to estimate the relationship between fullness of KEGG modules and completeness of genomes. A binomial distribution was used with the logit link function, since function fullness represents the proportion of enzymatic reactions (or steps) of a module present in a genome. The total number of steps of each module were used as weights in the models. Genome completeness (numeric variable), the bacterial Phylum (categorical variable with four levels) and their interaction were used as fixed explanatory variables, thus a Phylum-specific slope was estimated for each module.
In the second step, linear mixed effect modelling, as implemented in the R package lme4 , was used to explore predictors of the strength of the fullness-completeness relationship across modules. The Phylum-specific slopes estimated in aforementioned binomial models were used as response variables and bacterial Phylum (categorical variable with four levels), the KEGG domain of each module (categorical variable with ten levels), and the number of steps involved in a module (numeric variable) were used as fixed explanatory variables. Since four slopes were estimated for each module (one for each bacterial Phylum) and they were included in the response variable for this model, a module-level random effect was included in the model as random intercept (random = 1|module). We built bootstrap confidence intervals around the levels of the categorical variables through the bootMER() function with 999 simulations, and around the slope of the numeric variable using the function confint.merMod(), both included in the lme4 package. To make the marginal predictions for each categorical variable, the non-focal categorical variables (Phylum or domain) were kept in their reference level and the numeric variable (steps) in its mean value. Non-overlapping 95% confidence intervals between the levels of the categorical factors Phylum and domain were considered as evidence against the null hypothesis of no differences between groups. Similarly, for the numeric variable number of steps, a confidence interval of the slope not overlapping zero was considered as evidence against the null hypothesis of no association.
In the third step, we first used the functional profiles of 8 genomes from a mock community and their subsampled replicates to perform a Principal Coordinate Analysis (PCoA) for visual assessment of the bias introduced by the reduction of genome completeness to around 90%, 80 and 70%. Then, we used the module-specific binomial generalised linear models constructed in step one (trained with the MAGs retrieved from GTDB database) to correct the functional profiles of the subsampled genomes. For this, we generated two predictions for each focal genome based on its Phylum: the predicted module fullness given its observed completeness, and the predicted module fullness if the genome was 100% complete. The difference between both predictions was added to the observed module fullness in each focal genome. If the corrected module fullness was larger than 1 it was rounded to 1. Lastly, a joint PCoA was conducted using the functional profiles of complete genomes, the raw subsampled genomes and the corrected subsampled genomes, and displayed in six plots for the sake of visualisation.
Genome sequences and metadata employed in this study were retrieved from the NCBI and GTDB databases. A table containing the accession codes and metadata of the genomes, and the scripts employed for generating the results, are archived in Zenodo: https://zenodo.org/record/7584430, (https://doi.org/10.5281/zenodo.7584429).
Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, Richardson PM, et al. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature. 2004;428:37–43.
Shaffer M, Borton MA, McGivern BB, Zayed AA, La Rosa SL, Solden LM, et al. DRAM for distilling microbial metabolism to automate the curation of microbiome function. Nucleic Acids Res. 2020;48:8883–900.
Belcour A, Frioux C, Aite M, Bretaudeau A, Hildebrand F, Siegel A. Metage2Metabo, microbiota-scale metabolic complementarity for the identification of key species. Elife. 2020;9:e61968.
Meyer F, Fritz A, Deng Z-L, Koslicki D, Lesker TR, Gurevich A, et al. Critical assessment of metagenome interpretation: the second round of challenges. Nat Methods. 2022;19:429–40.
Meziti A, Rodriguez-R LM, Hatt JK, Peña-Gonzalez A, Levy K, Konstantinidis KT. The reliability of metagenome-assembled genomes (MAGs) in representing natural populations: insights from comparing MAGs against isolate genomes derived from the same fecal sample. Appl Environ Microbiol. 2021;87:e02593–20.
Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 2015;25:1043–55.
Levin D, Raab N, Pinto Y, Rothschild D, Zanir G, Godneva A, et al. Diversity and functional landscapes in the microbiota of animals in the wild. Science. 2021;372:eabb5352.
Zhou Z, Tran PQ, Breister AM, Liu Y, Kieft K, Cowley ES, et al. METABOLIC: high-throughput profiling of microbial genomes for functional traits, metabolism, biogeochemistry, and community-scale functional networks. Microbiome. 2022;10:33.
Parks DH, Chuvochina M, Rinke C, Mussig AJ, Chaumeil P-A, Hugenholtz P. GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy. Nucleic Acids Res. 2022;50:D785–94.
Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28:27–30.
Acman M, van Dorp L, Santini JM, Balloux F. Large-scale network analysis captures biological features of bacterial plasmids. Nat Commun. 2020;11:2452.
Kalmar L, Gupta S, Kean IRL, Ba X, Hadjirin N, Lay EM, et al. HAM-ART: an optimised culture-free Hi-C metagenomics pipeline for tracking antimicrobial resistance genes in complex microbial communities. PLoS Genet. 2022;18:e1009776.
McMurdie PJ, Holmes S. Waste not, want not: why rarefying microbiome data is inadmissible. PLoS Comput Biol. 2014;10:e1003531.
Kitts PA, Church DM, Thibaud-Nissen F, Choi J, Hem V, Sapojnikov V, et al. Assembly: a resource for assembled genomes at NCBI. Nucleic Acids Res. 2016;44:D73–80.
Olm MR, Crits-Christoph A, Bouma-Gregson K, Firek BA, Morowitz MJ, Banfield JF. inStrain profiles population microdiversity from metagenomic data and sensitively detects shared microbial strains. Nat Biotechnol. 2021;39:727–36.
Olm MR, Brown CT, Brooks B, Banfield JF. dRep: a tool for fast and accurate genomic comparisons that enables improved genome recovery from metagenomes through de-replication. ISME J. 2017;11:2864–8.
Bates D, Sarkar D, Bates MD, Matrix L. The lme4 package. R package version. 2007;2:74.
The authors acknowledge the H2020 project Holofood (Grant No. 817729), the Danish National Research Foundation award DNRF143 ‘A Center for Evolutionary Hologenomics’, and the Carlsberg Foundation grant CF20-0460.
The authors declare no competing interests.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Eisenhofer, R., Odriozola, I. & Alberdi, A. Impact of microbial genome completeness on metagenomic functional inference. ISME COMMUN. 3, 12 (2023). https://doi.org/10.1038/s43705-023-00221-z