A comprehensive evaluation of every patient with a bloodstream infection includes an attempt to identify the infectious source. Pathogens can originate from various places, such as the gut microbiota, skin and the external environment. Identifying the definitive origin of an infection would enable precise interventions focused on management of the source1,2. Unfortunately, hospital infection control practices are often informed by assumptions about the source of various specific pathogens; if these assumptions are incorrect, they lead to interventions that do not decrease pathogen exposure3. Here, we develop and apply a streamlined bioinformatic tool, named StrainSifter, to match bloodstream pathogens precisely to a candidate source. We then leverage this approach to interrogate the gut microbiota as a potential reservoir of bloodstream pathogens in a cohort of hematopoietic cell transplantation recipients. We find that patients with Escherichia coli and Klebsiella pneumoniae bloodstream infections have concomitant gut colonization with these organisms, suggesting that the gut may be a source of these infections. We also find cases where typically nonenteric pathogens, such as Pseudomonas aeruginosa and Staphylococcus epidermidis, are found in the gut microbiota, thereby challenging the existing informal dogma of these infections originating from environmental or skin sources. Thus, we present an approach to distinguish the source of various bloodstream infections, which may facilitate more accurate tracking and prevention of hospital-acquired infections.
Clinical management of infection involves the evaluation and elimination of infectious sources. Epidemiologically, bloodstream infections (BSIs) are common in hospitalized patients and contribute substantially to patient morbidity and mortality4. Thus, identifying the source of BSIs is critical in both clinical care and hospital epidemiology. BSIs are particularly common in immunocompromised patients who are hospitalized for extended periods of time, such as hematopoietic cell transplantation (HCT) recipients5,6,7. Primary BSIs with enteric organisms often arise as a result of translocation from the intestinal microbial reservoir across a damaged gastrointestinal barrier into the bloodstream8. By contrast, nonenteric commensal and environmental bacteria can access the bloodstream through intravenous lines and sites where skin epithelial integrity has been compromised. Existing methods for identifying the origins of BSIs in HCT patients include pulsed-field gel electrophoresis and multilocus sequence typing (MLST)9,10. Although rapid, affordable and standardized across many organisms, these methods are not ideal for distinguishing bacterial strains. Yet, microbial pathogenicity and transmission depend in part on strain-level variability, as different strains of the same species can vary widely in their ability to cause disease11,12. Whole-genome sequencing (WGS) has facilitated the exploration of strain-level determinants of virulence and has enabled precise tracking of pathogens11,13,14.
Although comparisons of strain genomes have primarily been performed on bacterial isolates, newer computational tools (such as metaSNV, MIDAS and StrainPhlAn)15,16,17 profile strain variation between metagenomes. These careful strain-level analyses allow us to understand when and how bacteria are transmitted and how they may change over time. However, bioinformatic tools have not been developed for identifying specific sources of infection by comparing disease-causing bacterial isolates to complex microbiome samples, such as human stool. In this work, we present StrainSifter, a bioinformatics pipeline for matching pathogens to potential sources. We then apply this tool to compare bacterial strains between the gut and the bloodstream in HCT patients, with the goal of better understanding the origin of BSIs in this population.
We performed a retrospective cohort study of autologous and allogeneic HCT recipients at Stanford University Hospital (CA, USA). We carried out weekly stool sampling for all subjects who consented to a tissue biobanking protocol between 5 October 2015 and 9 June 2017. We included patients if a stool sample had been collected in the 30 days preceding an episode of BSI and if a bloodstream isolate meeting standard BSI criteria had also been saved18. Thirty patients (32 bloodstream isolates) met these criteria. We sequenced all bloodstream isolates as well as stool samples (n = 82) collected between 60 days before and 31 days after the date of the BSI. Clinical characteristics of the cohort are listed in Table 1 (individual patient data are shown in Supplementary Table 1); select antibiotics and total parenteral nutrition use were identified within 30 days prior to BSI.
We sequenced a median of two stool samples per patient (range: 1–8), collected a median of 9 days prior to BSI (range: −58 to +31) (Supplementary Fig. 1; read counts in Supplementary Tables 2 and 3). Stool sequence data were taxonomically classified using the One Codex platform19. We observe the BSI species in the gut at a threshold of ≥0.1% relative abundance for 15 of 32 (47%) unique organisms, 10 of which are of expected enteric origin (8 typically intestinal and 2 typically oral). One patient developed a BSI with two species, both of which are present in the stool above the threshold level (Supplementary Table 4; full taxonomic classifications in Supplementary Table 5).
We next investigated whether BSI organisms are present at a higher relative abundance in the gut prior to infection, as has been reported20,21. Of the 15 BSIs in which the organism was detected in the stool, we observe intestinal dominance by the BSI pathogen in two instances (Fig. 1). In both cases, the BSI species are expected to be enteric in origin (E. coli from patient 3; Enterococcus faecium from patient 25) (Fig. 1 and Supplementary Table 4). By contrast, other enteric bacteria are poorly abundant (K. pneumoniae and Enterobacter cloacae from patient 2 at 2.8% and 0.6% relative abundance, respectively) (Fig. 1 and Supplementary Table 4). All typically nonenteric organisms are poorly abundant (0.01–2%; P. aeruginosa from patient 19, S. epidermidis from patient 13) or not detected in the gut prior to BSI (Staphylococcus aureus from multiple patients) (Fig. 1 and Supplementary Table 4). In the stool samples of several patients, we observe a high relative abundance of candidate pathogens that did not cause BSI in those individuals. Specifically, patient 14 experienced a K. pneumoniae BSI, yet stool samples at two time points are dominated by other potential pathogens: E. coli at 64% relative abundance 9 days prior to BSI and E. faecium at 82% relative abundance 19 days after BSI (Supplementary Table 5).
Although taxonomic concordance suggests BSI organism presence in the gut microbiota, we sought to test this hypothesis with greater precision. To do so, we developed StrainSifter (Supplementary Fig. 2), a bioinformatic pipeline that detects whether an organism is present with sufficient abundance in short-read data sets, and outputs phylogenetic trees and single-nucleotide variant (SNV) counts between samples. We used StrainSifter to investigate the relatedness of strains of each BSI species in our metagenomes and isolates. Isolate reads were assembled into draft genomes using a short-read genome assembly tool (assembly statistics are shown in Supplementary Table 6 and CheckM assessment in Supplementary Table 7). We compared the phylogenetic relatedness of all BSI and stool strains in our sample collection to one another (Fig. 2) and to publicly available data (Supplementary Fig. 3) and counted SNVs using StrainSifter (Supplementary Tables 8 and 9). Of note, none of the 30 patients included in our study had a sufficient abundance of S. aureus in their stool samples to profile with StrainSifter, indicating that this organism probably infrequently colonizes the gut of HCT patients.
In general, we find that BSI and gut metagenomic strains from the same patient are more closely related than strains from unrelated patients. As expected, BSI and intestinal strains of typically enteric species, such as E. coli (patients 3 and 7), E. faecium (patient 25), K. pneumoniae (patient 2) and Streptococcus mitis (patient 22), are closely phylogenetically related (Fig. 2), supporting the longstanding dogma that these organisms are gut derived9,10. On one extreme, we observe zero SNVs between BSI and stool strains of patient 3 at time points 33, 32 and 27 days prior to BSI, indicating that the identical E. coli strain is present in the gut over 1 month before the onset of infection (Supplementary Table 9). On the other extreme, we measured 259 SNVs between the E. coli BSI and the stool sample for patient 7. This surprising observation suggests the possibility of a population of closely related strains, in which the dominant strain is varying over time. Alternatively, the E. coli strain that resulted in BSI may have been acquired elsewhere.
Unexpectedly, we observe that gut and BSI strains are closely related in samples from the same patient for typically nonenteric taxa, including S. epidermidis (patient 13) and P. aeruginosa (patient 19, not pictured) (Fig. 2). We find one SNV (0.4 SNVs per megabase) between BSI and gut S. epidermidis strains of patient 13, indicating that the bloodstream strain is highly concordant with the strain found in the gut 1 day before (Supplementary Table 9). Furthermore, we observe zero discriminating SNVs between identical strains of P. aeruginosa in both the blood and the stool specimens. Although P. aeruginosa can exist in the gut microbiota22, S. epidermidis is typically thought to originate from the skin23,24,25,26. As further evidence that S. epidermidis bacteremia was not clearly line associated, the blood cultures of patient 13 cleared within 2 days, despite retention of the line (Supplementary Table 1). Interestingly, patient 7 did not develop a S. epidermidis BSI despite high relative abundance over two sequential time points (>60%) (Supplementary Table 5). Finally, to compare WGS-based approaches to traditional strain typing, we performed in silico MLST (Supplementary Table 10). In the four instances in which an MLST type was resolved for both gut and BSI strains, results were concordant with StrainSifter.
For the gut microbiota to be a contributing source of pathogens, the organisms must be alive. However, it is not possible to ascertain whether these organisms are alive using StrainSifter. A surrogate for measurement of a living organism is the rate of DNA replication. We used an available bioinformatic tool27 to assess replication rates for 11 stool samples from 9 patients in which gut and BSI strains were concordant and found that all had rates suggestive of active replication (Supplementary Table 11).
We observe relatively few events of potential pathogen transmission between individuals despite overlapping hospital admissions during the 20-month study period, based on sequence relatedness of bloodstream isolates or of candidate pathogens measured in the gut microbiome reservoir. For example, stool samples from patients 12 and 14 reveal E. faecium strains that differ by 49–76 SNVs (18–26 SNVs per megabase) relative to the bloodstream isolate of patient 25 (Supplementary Table 8). Similarly, several S. aureus BSI strains seem related: 710 SNVs (250 per megabase) between BSIs from patients 10 and 21, 166 SNVs (58 per megabase) between patients 3 and 5 and 729 SNVs (263 per megabase) between patients 1 and 12. However, it is important to note that StrainSifter profiles the dominant strain in each sample. Thus, true transmission events may be missed if different strains dominate in different individuals.
Finally, we asked whether closely related strains from different patients are also functionally related. We compared computationally predicted (Fig. 3 and Supplementary Table 12) and clinical antibiotic resistance (Supplementary Table 13) for individual patient BSIs. We find that predicted and clinical antibiotic resistance results are highly concordant. As noted previously, E. coli bloodstream isolates from patients 3 and 11 are phylogenetically related, differing by relatively few SNVs (60 SNVs per megabase). Functional analysis reveals that the BSI in patient 3 contains a gene encoding CTX-M, whereas the BSI in patient 11 does not. CTX-M is an extended-spectrum β-lactamase that confers resistance to most penicillins and cephalosporins. As predicted, clinical testing confirmed that the BSI in patient 3 was resistant to most penicillins and cephalosporins, whereas the BSI in patient 11 was not. By contrast, the E. coli BSI strain from patient 7 differs from that of patient 3 by 24,088 SNVs (7,390 SNVs per megabase), but demonstrates similar predicted and clinical extended-spectrum β-lactamase activity, also probably conferred by CTX-M. Phylogenetically related S. aureus BSIs exhibit similar predicted and clinical phenotypes. For example, MecR1-mediated methicillin resistance is predicted and present in S. aureus BSIs 3 and 5, which are closely related (Fig. 3 and Supplementary Tables 12 and 13). S. aureus BSIs 1 and 12, which are closely related to each other but distant from BSIs 3 and 5, lack a gene encoding MecR1 and are methicillin sensitive.
In conclusion, a detailed analysis using StrainSifter allowed us to precisely and comprehensively identify the candidate source of various BSIs. Although there is great enthusiasm for the incorporation of WGS into real-time patient management, at present, challenges in sample preparation and sequencing turnaround time limit the incorporation of such approaches into clinical care. Nevertheless, WGS is playing a growing role in hospital epidemiological studies. Characterization of gut microbiota dynamics that occur prior to infection may help us to precisely identify potential reservoirs of pathogens, thus enabling improved hospital infection prevention and management strategies.
The results presented are suggestive of a gut microbiota source for both enteric and nonenteric organisms. However, given that the present study sampled only stool microbiota, we cannot exclude the possibility of the same pathogenic strain colonizing multiple body sites from which the infection may have originated instead. In addition, although StrainSifter can precisely identify shared variants between genomes and metagenomes, it is limited to profiling only the dominant strain of a given organism in a community. However, it has been shown that gut metagenomes frequently contain only one predominant strain of each species, so StrainSifter is likely to function well under many circumstances17.
In the future, we anticipate that high-resolution WGS-based strain comparisons will facilitate the discovery of additional instances where typically nonenteric organisms are found in the gut microbiota, a model supported here. This knowledge may complement the growing body of research on therapies to improve gut microbiota diversity and may inform attempts to bolster colonization resistance against pathogens. Furthermore, more precisely identifying the origins of BSIs may influence how hospitals and health care providers can most effectively work to prevent infections. With these powerful genomic tools, we anticipate that precision source identification and strain tracking will lead us to a new, sharpened model of infectious disease.
A retrospective cohort study, approved by the institutional review board under the IRB protocol no. 42053 (principal investigator: A.S.B.), was performed at Stanford Hospital. Informed consent was obtained from all individuals whose samples were collected. At the time of cohort identification (July 2017), a stool biospecimen collection containing 964 stool samples from 402 patients was available for investigation. This collection consisted of convenience samples collected from autologous and allogeneic HCT patients at Stanford University Hospital between 5 October 2015 and 9 June 2017. Patients were included in this study if a stool sample had been collected within 30 days prior to an episode of BSI for which a blood isolate was also available. From this final cohort, we sequenced all stool samples in our collection within 60 days prior to and 31 days after BSI.
Bloodstream isolate identification
Bloodstream isolates from HCT patients who received medical care at Stanford University Hospital were obtained from the Stanford Hospital Clinical Microbiology Laboratory. All isolates considered typical bloodstream pathogens by National Healthcare Safety Network guidelines were stored in a glycerol suspension at −80 °C for up to 12 months18. Blood culture isolates considered to be skin-associated bacteria (including viridans group Streptococcus spp. and coagulase-negative Staphylococcus spp.) were saved if they were recovered in two or more blood culture sets as per National Healthcare Safety Network criteria18. Isolates were identified by standard biochemical testing and matrix-assisted laser desorption/ionization–time-of-flight mass spectrometry (Bruker Daltonics).
Bacterial bloodstream isolates were plated on brain heart infusion agar with 10% horse blood. DNA was extracted from isolates using the Gentra Puregene Yeast/Bact. Kit per manufacturer’s instructions. Stool samples were collected and stored at 4 °C for up to 24 h prior to homogenization, aliquoting and storage at −80 °C. DNA was extracted from stool using the QIAamp DNA Stool Mini Kit (Qiagen) per manufacturer’s instructions, with an initial bead-beating step prior to extraction using the Mini-Beadbeater-16 (BioSpec Products) and 1-mm diameter zirconia/silica beads (BioSpec Products). Bead-beating consisted of 7 rounds of alternating 30-s bead-beating bursts followed by 30 s of cooling on ice. The DNA concentration for all samples was measured using Qubit Fluorometric Quantitation (Life Technologies). DNA sequencing libraries from both isolates and stool were prepared using the Nextera XT DNA Library Prep Kit (Illumina), with isolates and stool microbiome libraries prepared at separate times following DNA decontamination of all laboratory surfaces and pipets (DNAZap, Ambion). Library concentration was measured using Qubit Fluorometric Quantitation (Life Technologies), and library quality and size distributions were analyzed with the Bioanalyzer 2100 (Agilent). Prepared libraries were multiplexed and subjected to 100-bp paired-end sequencing on the HiSeq 4000 platform (Illumina).
Sequence data were demultiplexed by unique barcodes (bcl2fastq v22.214.171.1242, Illumina). Reads were deduplicated to remove PCR and optical duplicates using SuperDeduper v1.4 with the start location in the read at 5 bp (–s 5) and minimum length of 50 bp (–l 50)28. Deduplicated reads were trimmed using TrimGalore v0.4.4, a wrapper for CutAdapt v1.16, with a minimum quality score of 30 for trimming (–q 30), minimum read length of 50 (–length 50) and the ‘–nextera’ flag to remove Illumina Nextera adapter sequences29,30. Draft genomes of bacterial isolates were assembled using SPAdes v3.11.0 (ref. 31) with default parameters. Summary statistics for each BSI assembly were generated using ‘basic_assembly_stats.py’ from GAEMR v1.0.1 (ref. 32). Draft genome completeness was assessed with CheckM v1.0.11 ‘lineage_wf’33. Draft genomes were filtered to remove contigs smaller than 1 kb for downstream analyses.
Gut metagenomic reads were taxonomically classified via the One Codex platform, a web-based tool for assigning read-level classifications based on unique k-mer signatures relative to a curated reference database (database v2017)19.
Phylogenetic tree building and variant identification with the StrainSifter pipeline
StrainSifter is a pipeline deployed as a Snakemake34 workflow packaged with conda, available at GitHub (https://github.com/bhattlab/strainsifter). Snakemake v5.1.4 and conda v4.5.9 were used. StrainSifter source code can be found in the Supplementary Information. StrainSifter contains modules for variant calling and phylogenetic tree building. StrainSifter accepts as input an assembled bacterial draft genome, designated as the reference, and two or more short-read data sets (isolate or metagenomic), and can report a phylogenetic tree of input samples as well as pairwise SNV counts.
To build the phylogenetic trees reported in this paper, the most contiguous and complete genome from our isolate collection was chosen as the reference genome for each infectious species (based on clinical laboratory taxonomic classifications). For the variant counting reported herein, BSI isolate draft genomes were supplied to StrainSifter. For both analyses, all stool and BSI short-read data sets were provided as input. We also used StrainSifter to evaluate the phylogenetic relatedness of our BSI strains to those available in a published database of pathogenic isolates from an intensive-care setting (BioProject PRJNA267549)35. For both phylogeny and SNV-counting modules, preprocessed short reads are first aligned to the reference genome using the Burrows–Wheeler Aligner v0.7.10 (ref. 36). Alignments are filtered to include only high-confidence alignments with mapping quality of at least 60 using the ‘view’ tool from the SAMtools suite (v1.7)37 (samtools view –b –q 60), and further filtered using BamTools ‘filter’ (v2.4.0) to include only reads with the desired number or fewer mismatches (that is for five or fewer mismatches: bamtools filter –tag ‘NM: ≤ 5’)38. For phylogenetic tree construction, reads with five or fewer mismatches were included; for determining strain SNVs, reads were limited to one or fewer mismatches. Per-base coverage is calculated from each resulting BAM file using bedtools genomecov (v2.26.0)39 and processed with custom python scripts to identify samples meeting a minimum average coverage of 5× across at least 40% of the genome15,40. Only samples meeting the coverage requirement are continued through the pipeline. Pileup files are created from BAM files using SAMtools ‘mpileup’ and are analyzed using custom python scripts to identify bases occurring with at least 0.8 frequency at positions covered 5× or greater (‘Computational methods supplement’ in the Supplementary Information). Only bases with a minimum phred score of ≥20 are considered. Consensus sequences for each sample are created, in which bases that cannot be confidently determined given the described parameters are called as ‘N’.
To create a phylogenetic tree, core positions are identified on a per-species basis, in which core positions are defined as positions in the reference genome where a base could be confidently called for all samples meeting the coverage requirements. To generate phylogenetic trees, core positions with variants in at least one sample are identified and concatenated into one FASTA file per sample. FASTA files are aligned using MUSCLE v3.8.31 (ref. 41) and a maximum-likelihood phylogenetic tree is computed using FastTree v2.1.7 (ref. 42). Phylogenetic trees are visualized in R using the ape v5.1 (ref. 43), phangorn v2.4.0 (ref. 44) and ggtree v1.10.5 (ref. 45) packages. Pairwise SNVs are determined from the consensus sequences using a custom python script.
Metagenomic short reads were assembled using metaSPAdes v3.11.0 (ref. 46). MLST schemes and sequences were downloaded from the PubMLST database47. MLST gene sequences were aligned to metagenome assemblies using nucleotide BLAST v2.2.31(ref. 48) and the top hit for each alignment was chosen based on the E-value, percent identity and alignment length. Only MLST sequences that were present in the metagenomic assembly with 100% identity across the entire length of the sequence were reported. MLST types generated by our in-house analysis were confirmed with the SRST2 synthetic MLST tool (v0.2.0)49.
Antibiotic resistance gene annotation
Putative protein sequences were identified in BSI draft genomes using Prodigal v2.6.3 (ref. 50). Antibiotic resistance genes were annotated from protein sequences by searching the Resfams antibiotic resistance protein family database (v1.2)51 using hmmscan from the hmmer package with the ‘–cut_ga’ and ‘–tblout’ flags52.
Determination of bacterial replication rates within metagenomic samples
Bacterial replication rates were assessed using the iRep v1.10 software27. Gut metagenomic samples were aligned to the BSI draft genome from the same patient using StrainSifter as described above. The resulting BAM files were converted to SAM format using SAMtools ‘view’, and the resulting SAM file and corresponding BSI draft genome were supplied to iRep as input for each sample.
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
StrainSifter and the associated source code can be found at https://github.com/bhattlab/strainsifter.
All sequencing data sets from the current study have been deposited in the Sequence Read Archive under BioProject PRJNA477326. Accession numbers are listed in Supplementary Table 14.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
We thank J. Kang for her assistance with stool sample processing, as well as the other members of the Bhatt laboratory for providing feedback on the study design, bioinformatics pipeline and manuscript revisions. We also thank N. Greenfield and the One Codex team for help with using their platform. We appreciate M. Kelly, C. Severyn and D. Ward for their feedback on the manuscript. We especially thank the patients and nurses on the Blood and Marrow Transplantation service for their enthusiastic participation in this project. This work was supported in part by the National Science Foundation Graduate Research Fellowship (F.B.T.), the National Institutes of Health (NIH), National Center for Advancing Translational Science, Clinical and Translational Science Awards KL2 TR001083 and UL1 TR001085 and the American Society of Blood and Marrow Transplantation New Investigator Award (T.M.A.). A.S.B. was funded in part by the National Cancer Institute NIH K08 award, no. CA184420, the Damon Runyon Clinical Investigator Award and the Amy Strelzer Manasevit Award. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.