Trimethoprim (TMP) is a bacteriostatic antibiotic of the diaminopyrimidine class. Of synthetic origin, as opposed to naturally derived antibiotics such as penicillins, it inhibits the growth of a range of aerobic bacteria including several Gram-positive and -negative pathogens [1]. First introduced in the 1960s [2], TMP is mainly employed in the treatment of urinary tract infections [3] as a single medication or in combination with sulfamethoxazole (co-trimoxazole) to treat a wide range of bacterial infections. In susceptible bacteria, TMP inhibits the enzymatic reduction of dihydrofolate to tetrahydrofolate by the bacterial FolA dihydrofolate reductase (DHFR) [1]. This in turn disrupts DNA synthesis through the suppression of purine and pyrimidine production [4], thus abrogating microbial proliferation. TMP and co-trimoxazole are included in the recent list of essential medicines maintained by the World Health Organization [5].

According to recent studies, TMP resistance is common in urinary tract infections caused by, for example, Escherichia coli or Klebsiella spp. [6]. The observed prevalence of resistance may reflect the selective pressure due to frequent TMP administration [7]. However, other factors can promote the prevalence of TMP resistance, such as co-selection [8] in cases where resistance genes targeted at other antibiotics are found on the same genetic element such as transposons [9] or multi-resistance plasmids [2]. An increase in the abundance of TMP resistance in clinically relevant pathogens has been identified by several studies [10, 11] and a reversal of resistance through the sole reduction of TMP administration is unlikely to be achievable [12].

At the mechanistic level, resistance may be due to mutations in the primary folA gene resulting in reduced TMP susceptibility [13], overproduction of such enzymes, selection toward a limited uptake or binding of TMP [14], or the action of efflux pumps [15]. The most important mechanism, however, appears to be the acquisition of type A dihydrofolate reductases (DfrA) which represent TMP-resistant variants of FolA [16, 17]. Unlike the primary FolA enzyme found in wild-type populations, DfrA variants allow bacterial cells to sustain DNA synthesis in the presence of moderate TMP levels [18]. Genes coding for DfrA were found on mobile genetic elements as early as the 1970s [9, 19, 20] and the clinical relevance of horizontal transmission of dfrA has been confirmed by case reports [21]. The dfrA genes emerged from both recent mutations in TMP-sensitive folA genes and pre-existing TMP-resistant folA genes, followed by mobilization in the resistome [17]. The CARD database [22] currently lists 57 members of the dfr gene family coding for TMP-resistant dihydrofolate reductases, many of which have been found on plasmids and within integron gene cassettes [23,24,25]. Most of these genes fall into the dfrA sub-family.

The family of type B dihydrofolate reductases (DfrB) provides an alternative mechanism for TMP resistance. The dfrB genes share no significant sequence nor structural similarities with FolA and DfrA, indicative of a distinct evolutionary origin for this family [26, 27]. Although dfrB were originally identified in clinical samples [28, 29], their prevalence is currently unknown, as they have not been routinely searched for whether by PCR methods in earlier decades or more recently by genomic identification [6]. Hence, the limited amount of available data on the distribution of dfrB genes has impeded our understanding of their emergence.

Only recently, two new variants of the dfrB family, dfrB10 and dfrB11 have been discovered and confirmed in their provision of high-level phenotypic TMP resistance [28]. Since then, ten additional variants labeled dfrB12 to dfrB21 have been identified in genome sequences of isolates and metagenomes based on homology, followed by experimental validation [30]. Some of these dfrB sequences originate from environmental samples where TMP exposure is supposed to be marginal. This gave rise to the hypothesis that the original selective advantage conferred by dfrB genes may be unrelated to TMP resistance [30]. However, this hypothesis was based on identification of dfrB sequences in few samples, which is why the possibility of contamination effects [31,32,33] cannot be excluded. Verification of the hypothesis was further hampered by data heterogeneity with regard to the origin of samples, the source of DNA (isolates vs. metagenomes), and sparse metainformation.

In this study, we provide the first large-scale, statistically verified analysis of the prevalence of this emerging source of high TMP resistance by evaluating shotgun metagenomic sequences of both freshwater and wastewater environments (324 samples, 7 ×109 high-quality sequences). We provide quantitative estimates on the abundance of dfrB genes in aquatic environments with and without strong human impact along with indicators of gene mobility. By comparing the distribution of particular gene variants across contrasting environments, our work contributes to the identification of the origin of dfrB which is a key to understanding the epidemiology of novel TMP resistance. In particular, our study highlights the possible role of environmental bacteria in the emergence of resistance to synthetic antimicrobials where evolutionary adaptation to naturally produced analogues did not take place.

Materials and methods

Origin and characteristics of the analyzed metagenomes

For this study, we analyzed publicly available metagenomic DNA sequences of freshwater and wastewater bacterial communities downloaded from SRA (sequence read archive; The picked sequences exclusively represent DNA fragments generated by the shotgun technique which does not involve amplification of particular targets, that is, the “selection” attribute of the sequencing library had to be “random”. For consistent processing and comparability, only samples that were sequenced on an Illumina instrument in paired-end layout were included. The latter integrate wastewater and freshwater samples from the temperate zone such that, in each category, at least five different countries are represented. The analyzed data consist of 324 distinct samples from the US, UK, New Zealand, China, Canada, Sweden, and Germany (Table 1). Full accession and metainformation is provided in the supplement Table S1. In total, the data comprise about 7 × 109 high quality read pairs. Freshwater samples were subdivided into two categories (unpolluted, polluted) based on the exposure of the respective sampling sites to effluents from municipal wastewater treatment plant effluents where sufficient on-site information was available (dataset “q” in Table 1). A further distinction was made between water and sediment samples.

Table 1 Characteristics of the analyzed data sets.


Processing of raw sequence data and short read analyses

All samples were processed through the same bioinformatics pipeline involving the simultaneous removal of purely technical sequences and the suppression of low quality reads using Trim Galore [34] (requested phred score: 28, minimum read length: 100 bp). Paired-end reads were subsequently merged with pandaseq [35] employing the default configuration to obtain sequences with a typical average length of 262 bp (see Table 1 for variation). Using BLASTN (, the merged sequences were aligned against a collection of the currently known dfrB sequences (Table S2) and against the resfinder [36] database of acquired antibiotic resistance genes available from the Center for Genomic Epidemiology ( Only high quality alignments were retained by requesting a minimum match length of 50 bp and a sequence identity ≥95%. The merged reads were further processed through metaxa2 [37] to analyze the composition of the bacterial community based on 16 S rRNA gene fragments. Reads being tested positive for dfrB were also scanned for the signatures of integrons using the IntegronFinder [38] application provided by the Pasteur Institute, France, as a galaxy-based web service ( The same reads were also aligned against the ISfinder [39] database of insertion sequences maintained by the Laboratoire de Microbiologie et Génétique Moléculaires, Toulouse, France, using their web service at

Assembly and analysis of the flanking regions of dfrB genes

Because of the limited read length, it is difficult to identify the genetic context of dfrB genes exclusively from original short read data. Consequently, gene context analyses call for metagenome assembly which is challenging, not least due to the exceptionally high demand for computer memory (RAM). Even if the assembly succeeds technically, a substantial extension of the sequences of primary interest is not necessarily guaranteed. This is especially so for low-coverage metagenomes of highly diverse microbial communities [40]. Consequently, we implemented an algorithm that specifically targets the assembly of the flanking regions of dfrB-like sequences. Briefly, a collection of the reads giving specific hits for dfrB variants was built by an initial run of BLASTN. Those reads served as “seeds” in a subsequent process of iterative sequence extension. In each iteration, the algorithm searches the remaining pool of yet “unused” reads for candidates allowing for a plausible extension of the seed sequence (in iteration 1) or the outcome of earlier iterations according to sequence similarity in overlapping parts (min. 50 bp overlap). The algorithm relies on BLASTN as the workhorse for sequence alignments and, due to the targeted focus on specific seeds, RAM usage remains very low. The source code of the seed-based assembler was made publicly available at together with basic documentation and a minimum working example. Overall, our seed-based approach turned out to be very close to the one implemented in GenSeed [41], a software which we only detected later.

From the assembled contigs, the seeds (i.e., the dfrB) were finally removed and the remaining flanking sequences were fed into BLASTX ( to identify the genetic context of dfrB at protein level. The flanking sequences were also scanned for insertion sequences as outlined for the original short reads. Very short assemblies of <250 bp after removal of the dfrB seed sequence were excluded from downstream analysis.

Statistical analysis

General methods

Statistical data analysis and visualization was performed in R [42] version 4.2.1. Principal component analysis was performed in base R (prcomp) using power-transformed relative abundances as input (power 0.2). Shifts in locations were assessed by means of the non-parametric Wilcoxon rank sum test (wilcox.test). The significance of differences in odds ratios was analyzed with Fisher’s exact test (fisher.test). Finally, confidence of relative abundance estimates were obtained from an exact binomial test (binom.test). Whenever multiple, logically connected hypotheses were tested simultaneously, the corresponding p values were adjusted to control the false discovery rate [43].

Correlations between ARGs and 16S rRNA-based species markers

Demonstrating statistical associations between ARG abundances and bacterial community composition is difficult when sample sizes are unequal. The intuitive solution is to employ normalization so as to achieve comparable counts of ARGs and taxonomic markers across samples, that is, the counts are divided by the total number of 16 S rRNA gene copies in the respective sample. While such normalization is very common, it is often overlooked that the obtained relative abundances are unsuitable for correlation analysis [44]. Unfortunately, valid alternatives to correlation like, for example, log ratio analyses, are not applicable either because the number of detected ARG copies is zero in a notable fraction of samples. In view of this challenge, we performed correlation analyses on the original count data (i.e., without normalization) using a downsampling approach to account for unequal sample sizes. Specifically, we identified the 10% quantile of 16 S rRNA gene copies across the considered samples (Q10). Samples with fewer than Q10 copies were subsequently discarded. For the remaining samples, we studied the correlation between ARG counts and 16 S rRNA gene counts attributed to a certain bacterial group after random-driven truncation to the common length Q10. The process was repeated 250 times and the median of the correlation coefficient (Spearman’s rho) was finally evaluated.


Unequal distribution of dfrB gene variants in river and wastewater

A principal component analysis suggests the separation of the data into two major subsets based on the abundance of distinct dfrB variants (Fig. 1). In particular, the established gene variants with smaller numeric indices (dfrB1 to dfrB7), that were the first to be identified, appear to be associated with wastewater while the recently discovered variants rather cluster with river samples. However, the underlying data structure appears to be more complex since only about 25% of the total variance is resolved by the first two principal components.

Fig. 1: Principal component analysis of the relative abundance of dfrB gene variants.
figure 1

Dots indicate individual samples colored by their origin. Integers represent gene variants (e.g., ‘10’ denotes dfrB10). See Fig. S1 for the corresponding scree plot.

Another graphical representation of the distribution of dfrB gene variants across different groups of samples exhibit a number of characteristic patterns (Fig. 2). Whereas some variants were primarily detected in wastewater-borne samples (e.g., dfrB1, dfrB2, dfrB4, dfrB7) others were found primarily or even exclusively in surface water and sediment samples (dfrB9, dfrB10, dfrB13, and dfrB19). Four out of the 20 known variants (dfrB6, dfrB11, dfrB12, dfrB18) could not be recovered from any of the screened metagenomic DNA reads based on the criterion of 95% sequence identity. For the recovered gene variants, estimated relative abundances frequently exceeded 10−5 copies × (16 S rRNA gene copies)−1 and four variants (dfrB3, dfrB9, dfrB10, dfrB14) occurred in relative abundances >10−4.

Fig. 2: Relative abundance of dfrB gene variants (copies/16 S rRNA gene copies) in distinct subsets of samples.
figure 2

Dot size encodes the width of the 95% confidence interval around empirical estimates according to a binomial model (high: <0.5 log units, moderate: <1 log unit, low & very low: >1 and >2 log units, respectively). Capital letters encode countries, superscripts denote studies according to Table 1.

The estimated relative abundances of some dfrB variants are subject to uncertainty as it is common for rare resistance genes (Fig. 2). In particular, the failure to detect any dfrB gene copies in some individual small-sized data sets is most likely a result of undersampling given the limited sensitivity of shotgun metagenomics as compared to quantitative PCR-based approaches. Nevertheless, for all of the groups of samples distinguished in Table 1 and Fig. 2, the number of 16 S rRNA gene copies used as a reference exceeds 1 × 106 such that reported relative abundances are unlikely to suffer from systematic bias (see Fig. S2 for details).

When the data are broadly divided by compartments into just two subsets (wastewater vs. river) many of the contrasts visible in Figs. 1 and 2 pass the threshold of statistical significance (Table 2). This applies to all gene variants which were more frequently found in wastewater communities, that is, dfrB1 to dfrB7. Likewise, the elevated prevalence in environmental samples was confirmed for dfrB9, dfrB10, and dfrB13. To exclude the possibility that the statistics are generally flawed due to global undersampling, we performed significance tests also for split samples. For all but two gene variants (dfrB5, dfrB13), the reported contrasts in relative abundance between wastewater and river-borne samples were consistently detected in data subsets, each representing 50% of the total analyzed material (Table 2).

Table 2 Comparison of the relative abundance of dfrB gene variants (copies per 16 S rRNA gene copies) in WWTP effluents and river samples.

Statistical association of dfrB with bacterial community composition

Using the subsampling technique outlined in the methods section, we scanned for empirical associations between the frequency of dfrB gene variants and the abundance of taxonomically defined bacterial groups (orders). The analysis was carried out separately for river-borne samples and for samples of treated and untreated wastewater. For the river-borne samples, no statistically significant association between the abundance of the predominant gene variants dfrB9 or dfrB10 and the abundance of bacterial orders could be identified (Spearman’s rho rarely exceeding 0.15; all p values > 0.1). In wastewater samples, however, the abundance of dfrB3 was found to be most closely associated with five orders of Gram-negative bacteria (Aeromonadales, Chromatiales, Pseudomonadales, Enterobacterales, Campylobacterales) all but the last belonging to the class of γ-Proteobacteria. The corresponding rank correlation coefficients (Spearman’s rho) reached values in range 0.5–0.8 (p < 0.05 each).

Physical association of dfr genes with other genetic markers

Analysis of the original short reads

The prospects of full metagenome assembly are limited in the case of short-read environmental samples subject to high microbial diversity [40] while the computational effort is very high. In particular, the probability of detecting ARGs and species markers (such as variable regions of the 16 S rRNA gene) on the same contig are very low and the possibility of assembly errors may limit the confidence in the validity of metagenome-assembled genomes [45]. Consequently, in a first step, we analyzed the immediate genetic neighborhood of the dfr fragments on the original (merged) reads. In particular, we scanned all reads containing signatures of dfr for the presence of other ARGs which would indicate embedding of the former in resistance gene cassette arrays.

The vast majority of reads where dfrB occurred in direct proximity of other ARGs originated from wastewater samples (61 out of 65; Table 3). By contrast, only a minor fraction (4/65) of the reads containing dfrB and another ARG was attributed to river-borne samples. Considering the total number of dfrB gene copies in both groups (wastewater: 357, river: 550), the likelihood of dfrB being found in neighborship with other ARGs was significantly higher in wastewater samples as compared to river samples (OR 0.036 [0.0093–0.097], p < 10−15, Fisher’s exact test, see Table S3). The majority of such co-occurrences (16 of the 21) was due to the variants dfrB1, dfrB3, and dfrB4 being found in direct proximity of ant genes coding for aminoglycoside resistance.

Table 3 Co-occurrence of dfrB and dfrA gene fragments with signatures of other antibiotic resistance genes on one and the same read (see Table 1 for average read lengths).

For the purpose of comparison, we performed the same analysis for the dfrA gene family which is responsible for moderate TMP resistance. The overall results were very similar to those obtained for dfrB (Table 3, Table S4). Specifically, dfrA was way more likely to be found in neighborship with other ARGs in samples of wastewater as compared to river samples (OR 0.11 [0.070–0.18, p < 10−15). Likewise, dfrA was most commonly associated with genes conferring resistance to aminoglycosides, namely aadA and ant.

The alignment of dfrB-positive short reads against databases of mobile genetic elements yielded a similar outcome as the analysis of resistance gene co-occurrences. In total, six short reads were identified which harbored both a sequence of dfrB and a cluster of attC recombination sites lacking an integron-integrase. The latter are commonly referred to as CALINs [46] representing degraded or incompletely sequenced integrons [30]. All of these reads originate from samples of treated or untreated wastewater and five out of six cases are attributable to the variants dfrB2 and dfrB3 (Table S5).

According to insertion sequence analysis, dfrB signatures were almost exclusively found in proximity to a single type of transposable element, TnAs3 (Table S6), originally discovered in the fish pathogen Aeromonas salmonicida. The apparent primary association of dfrB with this element is of particular relevance with regard to mobility. According to a recent study [47], an estimated proportion of 70% of TnAs3 have integrons within their genetic context of 10 ORF and they belong to the most abundant insertion sequences found on plasmids. In our dataset, most of the reads giving a simultaneous hit for both dfrB and TnAs3 originated from wastewater samples (85 out of 90 cases; Table S6) with dfrB3 again being predominant. We did not observe a single case of TnAs3 in direct proximity of a dfrB with a numeric index >7.

Analysis of the assembled flanking regions of dfrB genes

The analysis of the assembled flanking regions of dfrB genes overall confirms the outcome of the above evaluations based on short reads (Table 4 and S7). For the gene variants dfrB1 to dfrB7, a total of 33 flanking regions with a length >250 bp were recovered (length range 259–1228 bp, excluding the dfrB itself). In 27 of 33 cases (80%), significant BLASTX alignments were obtained for either integron integrases (8 cases) or resistance determinants (19 cases), including beta-lactamases, enzymes mediating aminoglycoside resistance, and multi-drug efflux pumps. The clear majority of those flanking regions (31/33) were recovered from wastewater-borne datasets.

Table 4 Analysis of the flanking regions of dfrB1 to dfrB7 (left panel) and dfrB variants with indices >8 (right panel).

By contrast, integron integrases or ARGs were not identified in any of the 30 assembled flanking regions of dfrB9 (5 cases), dfrB10 (22 cases), or variants with a greater numeric index (3 cases). All but one of these assemblies were recovered from river-borne samples. If conserved domains were detectable at all, they were generally indicative of enzymes not directly related to antimicrobial resistance or ARG mobility. For dfrB10, our assemblies suggest a number of typical genetic framings as flanks with similar contents were recovered from multiple independent samples (Table S7). Namely xanthine dehydrogenases, phosphotransferases, carboxymethyltransferases, as well as two different components of toxin-antitoxin (TA) systems were observed in multiple contigs originating from distinct samples.

The observed differences in potential mobility of dfrB genes as inferred from the detection of integron integrases or ARG in the assembled flanking regions are highly significant. The fact that dfrB genes being strongly associated with wastewater metagenomes (dfrB1 to dfrB7) frequently showed indications of mobility (27 of 33 cases) while other dfrB variants associated with river metagenomes did not (0 of 30 cases) is very unlikely to be by chance (OR > 23, p < 10−11; Fisher’s exact test). Strong significance (OR > 5.3, p < 10−6) remains even if the presence of TA systems in five of the flanking regions of dfrB10 is generally counted as an indication of mobility, based on the historical notion that TA systems are often plasmid-encoded, which is not necessarily the case [48, 49].


Prevalence and hosts of the “classical” clinical dfrB gene variants

Our data suggest that treated wastewater is a primary source of the “classical” variants of dfrB with numeric indices in the range of 1–5 but also 7, which comprise the earliest dfrB identified [29, 50, 51]. The relative abundance in wastewater-borne metagenomes was significantly increased for all of those variants as compared to freshwater-borne metagenomes. Some of the variants (dfrB5, dfrB7) were exclusively recovered from wastewater samples, a fact that should not be overinterpreted, however, considering the limited sensitivity of metagenomics as compared to PCR in the detection of low abundance genes [52]. Overall, the analyzed dataset supports the notion that the classical variants of dfrB are primarily harbored by human gut-associated bacteria.

In correlation analysis, members of the phylum γ-Proteobacteria, including, for example, Aeromonadales and Pseudomonadales, emerged as the most likely candidate host of one of the gene variants, dfrB3. As usual, however, correlation provides indications rather than rigorous evidence. In particular, the identification of definite candidate hosts is naturally limited by collinearities, arising from the natural structure of microbial communities. The sharing of common niches or the formation of metabolic networks, for example, inevitably lead to substantial correlations between distinct bacterial groups which may not be disentangled in host identification analysis. Nevertheless, the proposed primary association of the classical dfrB variants with γ-Proteobacteria is very plausible, as it perfectly agrees with the outcome of whole genome-based studies [28].

In the analysis of the original short read data, most of the classical dfrB variants, especially dfrB1, dfrB3, and dfrB4, were also found in direct proximity to other ARGs (Table 3). Considering the type of the linked ARGs, our findings corroborate the outcome of earlier WGS-based studies according to which the earliest reported dfrB genes are mostly (but not exclusively) associated with genes mediating resistance to aminoglycosides [28, 53]. In this respect, these dfrB variants are similar to genes of the dfrA family. The rare linkage of dfrB with genes mediating resistance to phenicol, betalactam (blaOXA), and rifampicin (arr) found in the analyzed metagenomes is also consistent with former WGS-based findings [28]. The direct neighborship of dfrB1 to dfrB7 to other ARGs was further confirmed by the evaluation of the assembled flanking regions. In addition to enzymes providing resistance against aminoglycosides and betalactams, multi-drug efflux pumps were frequently detected in proximity to the dihydrofolate reductases (Table 4).

The frequent co-occurrence of the classical dfrB variants with other ARGs suggests that a notable percentage of the gene copies is embedded in resistance cassette arrays as has been demonstrated by WGS analysis [28, 29, 53]. It can further be hypothesized that the respective dfrB genes are potentially part of mobile genetic elements like integrons and transposons facilitating their proliferation, including plasmid-based horizontal transfer. The alignment of dfrB-positive short reads against integron and insertion sequences databases strongly supports this hypothesis of mobility. Specifically, all of the 90 cases where a dfrB gene was found in direct proximity to an insertion sequence, most typically TnAs3, were attributable to the gene variants dfrB1 to dfrB7 (Table S6). Similarly, integron signatures were almost exclusively found in neighborship with these wastewater associated dfrB variants (Tables 4 and S5).

Occurrence of recently discovered dfrB genes

The recently described gene variants, especially dfrB9 to dfrB21, were discovered in a heterogeneous set of sequences comprising both bacterial isolate genomes [28] and metagenomes of environmental origin [30]. However, owing to the limited number of dfrB-positive sequences studied so far, it was previously impossible to infer quantitative information on the occurrence and distribution of these gene variants. Based on consistently processed metagenomic datasets, our study provides quantitative information on the abundance of dfrB9 to dfrB21 in complex environmental communities, which is a key to uncovering possible anthropogenic impacts.

According to our analysis, the variants dfrB9, dfrB10, dfrB13, and dfrB14 were shown to be significantly more prevalent in freshwater environments than in wastewater. Hence, very likely, these gene variants are primarily hosted by environmental bacteria which do not play a predominant role in municipal wastewater systems. This finding is consistent with the observation that many of the genes mediating clinically relevant antibiotic resistance have their origin in environmental bacteria [54] which has now been confirmed for a broad variety of genes and hosts [55,56,57]. However, the original role of many ancestral antibiotic resistance genes is still unknown and different factors may have shaped their evolution in environmental bacteria [58, 59]. From the anthropocentric perspective, the resistance trait is often at the center of interest and the presence of ARGs in antibiotic producers or their interspecific competitors is indeed expected from an ecological point of view. However, in the particular case of TMP, no natural analogues are known to date. Hence, the original selective advantage associated with environmental dfrB genes like, for instance, dfrB9 and dfrB10 is unrelated to self-resistance or microbial chemical warfare. The ecological advantage leading to the maintenance of these genes is yet to be identified. For example, the encoded DfrB could exhibit a favorable metabolic costs profile in the host compared to alternatives dihydrofolate reductases like the ubiquitous FolA. However, the dfrB genes could also be subject to selection by yet unknown factors unrelated to dihydrofolate reduction.

So far, information on actual bacterial hosts of dfrB genes is largely confined to the variants dfrB1 to dfrB5 and mostly pertains to genomes of human-related isolates [28]. The variant dfrB10 was originally discovered in a typical soil bacterium, Pseudomonas putida, involved in the infection of a patient in China [60]. Beyond that, we are not aware of successful attempts to disclose the bacterial hosts of, for example, dfrB9 to dfrB21 in environmental communities which inspired us to screen for statistical correlations between dfrB abundance and community composition. However, even the most prominent environmental gene variant in our data set, dfrB10, was not significantly associated with any bacterial order such that no host candidates could be identified. This could reflect the presence of dfrB10 in a variety of species due to horizontal transfer of, for example, plasmid-borne instances or the conservation over a long evolutionary history. However, a lack of close statistical associations would also be observed if dfrB10 was actually linked to a primary host which is just too rare to be adequately represented in communities inferred from amplification-free shotgun metagenomics.

According to our analyses, the genetic context of the recently described dfrB genes appears to deviate substantially from the one identified for dfrB1 to dfrB7. For example, none of the sequences hosting dfrB9 to dfrB21 was associated to insertion sequences, neither in the analysis of short reads, nor in the evaluation of assembled flanks. Similarly, according to our data set, genotypic multiple-resistance involving dfrB9 to dfrB21 appears to be a rare exception as it was observed in a single metagenome only where, as opposed to the general trend, dfrB10 was recovered from a wastewater sample. These findings agree with genetic context analyses of environmental sequences in which dfrB11 to dfrB21 were originally identified [30]. We conclude from the available direct and indirect evidence that, as opposed to dfrB1 - dfrB7, the recently discovered dfrB variants, are less likely associated with mobile elements which are recognized as major facilitators of the spread of clinically relevant antimicrobial resistance [61]. Nevertheless, this must not be misinterpreted in the sense that the newly identified dfrB variants are generally immobile and thus pose a lower risk compared to other TMP resistance genes. In particular, dfrB10 was originally discovered on a plasmid [60] and our data set demonstrates rare cases of combined TMP-aminoglycoside resistance involving dfrB10. Also dfrB9 was earlier detected within a plasmid-based integron in a clinical isolate classified as Enterobacteriaceae and other novel dfrB variants were found in the context of CALINs previously [30] (dfrB12, dfrB19) or in the present study (dfrB15). Hence, clear indications exist for actual and potential mobility of the recently discovered dfrB variants. Consequently, we have to expect that, under ambient conditions selecting for TMP resistance, these variants can undergo a fast spread by horizontal gene transfer in addition to vertical proliferation.

Besides emissions from sewage disposal, many surface waters receive additional anthropogenic inputs of bacteria linked to livestock farming and runoff from organically fertilized soils [62]. We cannot exclude that the occurrence of the environment-associated variants like dfrB9 or dfrB10 reflects such agricultural inputs. However, our data do not comprise any evidence for such a hypothesis. First, if the detection of the mentioned dfrB variants was reflecting emissions from livestock farming under TMP-selective conditions, we would expect them to commonly appear in neighborship with other resistance genes or mobile elements just like their wastewater-borne counterparts dfrB1 to dfrB7. This was not the case. Moreover, copies of dfrB10, dfrB13, dfrB16, and dfrB19 were also recovered from sequences which are very unlikely to be impacted by intensive farming. This applies in particular to the river sediment metagenomes from the US-based Great Smoky Mountains national park and unpolluted sites considered in one the datasets from Germany. Finally, if the occurrences of dfrB genes in river-borne samples was determined by contamination from external sources, we would expect to find the genes preferably in the water phase mediating the transport. However, in the most comprehensive dataset (label “q” in Table 1) which comprises water and sediment-borne samples, we rather see the opposite trend (Fig. 3A). The relative abundance of dfrB genes is significantly higher in sediment-borne DNA compared to the water-borne DNA at both polluted and unpolluted sampling sites. This is in contrast to dfrA where the higher relative abundance in the water phase as compared to sediment reflects a contamination of sampling sites exposed to WWTP effluents (green boxes Fig. 3B).

Fig. 3: Relative abundance of dfrB and dfrA genes aggregated at family level in samples from the Lockwitzbach River Basin, Germany (study label “q” in Table 1).
figure 3

Panel A refers to the dfrB gene and panel B to dfrA. The p values refer to a shift in location between water and sediment-borne samples (Wilcoxon rank sum test). Note the custom y-axis resulting from power transformation with an exponent of 0.2.

Implications for AMR research and management

Our findings suggest that the classical dfrB gene variants like dfrB1 or dfrB4 are established in human-associated bacterial communities as signaled by their primary association with wastewater metagenomes. Hence, together with members of the dfrA family, these dfrB variants appear to actively contribute to contemporary, clinically relevant bacterial resistance against TMP. By contrast, gene variants like dfrB9, dfrB10, or dfrB13 were rarely detected in an anthropogenic context, yet they appear to be present in many freshwater systems around the world. The respective aquatic bacterial communities thus represent a reservoir of additional TMP resistance that is potentially mobilizable and could further compromise the treatment of bacterial infections in the future.

The full spectrum of dfrB genes present in environmental bacterial communities is yet to be disclosed. A list of 20 new candidate dfrB genes was released only recently [30] and the metagenomes screened in this study suggest a considerable potential for new gene discoveries as well. This can be demonstrated by comparing the abundance of the presently known dfrB to the abundance of dfrB-like sequences satisfying a relaxed criterion of sequence identity (Fig. 4). Especially river-borne community DNA contains a notable amount of additional dfrB-like sequences, some of which may turn out to be novel TMP resistance genes in future analyses.

Fig. 4: Effect of the chosen similarity threshold on the relative abundance of dfrB-like sequences identified in datasets of different origin.
figure 4

Bars represent averages, whiskers illustrate maximum values.

From a research-centered perspective, we propose to work toward a more comprehensive understanding of the distribution of dfrB across ecosystems or geographic regions. Ideally, the respective samples would be acquired, processed, and sequenced according to common protocols so as to exclude methodology-borne biases in comparisons between data subsets. Standardization would need to cover, for example, DNA extraction protocols, library preparation, read length, or sequencing depth and should also consider sample barcoding schemes that reduce the chance of misclassification (e.g., through unique dual indexing). Metagenomic sequencing should further be complemented by a more sensitive quantification of dfrB via quantitative or digital PCR to better discriminate actually low gene abundances from undersampling effects. Likewise, we suggest fostering the identification of the respective bacterial hosts and the genetic framing to improve our understanding of the evolutionary trajectories of the different dfrB variants as well as possible pathways of past and future proliferation. The assembly of long-read metagenomic sequences [63, 64] might be a viable approach toward improved host characterization but it certainly remains challenging in view of the low relative abundance of dfrB and the substantial diversity of natural bacterial communities. Plasmid-capturing [65] and subsequent sequencing would be a possible means to specifically target dfrB genes of high lateral mobility. Our study demonstrates that the prevalence of antimicrobial resistance genes must be examined at the level of particular gene variants. A more aggregated analysis at the dfrB gene family level, for instance, would not have resolved the existing contrasts in gene distributions.

Overall, our study highlights the possibility of omnipresent environmental genes being promoted into a threat for human society. Thus, the concentrations of antibiotics to which environmental communities are exposed must be held below selective thresholds [66] not only to control the spread of established ARGs but also to prevent the rise of yet unknown genes hidden in environmental communities. Our study demonstrates that a promotion of such unknown resistance genes can be triggered even by synthetic drugs, like TMP, where natural analogs had not been an original selection factor.