Abstract
An important fraction of microbial diversity is harbored in strain individuality, so identification of conspecific bacterial strains is imperative for improved understanding of microbial community functions. Limitations in bioinformatics and sequencing technologies have to date precluded strain identification owing to difficulties in phasing short reads to faithfully recover the original strain-level genotypes, which have highly similar sequences. We present ConStrains, an open-source algorithm that identifies conspecific strains from metagenomic sequence data and reconstructs the phylogeny of these strains in microbial communities. The algorithm uses single-nucleotide polymorphism (SNP) patterns in a set of universal genes to infer within-species structures that represent strains. Applying ConStrains to simulated and host-derived datasets provides insights into microbial community dynamics.
Similar content being viewed by others
Main
Understanding how individual organisms coexist within a microbial community is crucial to understanding community functions. For example, the study of microbial community dynamics is important in human health, including how to maintain or restore a healthy human microbiome. Metagenomics has revolutionized microbiology by addressing some of these issues in a culture-independent manner. However, state-of-the-art metagenomics approaches are often limited to the species level1,2,3 or to partially assembled population consensus genomes4,5,6. Evidence that the unit of microbial action can fall below the species level comes from multiple sources, including culturing7, single-cell genomics8, redundant sequencing of the bacterial gene encoding 16S rRNA9, sequencing of internal transcribed spacers10, multilocus sequence typing11 and high-resolution analyses of genomic variation12. Therefore methods that enable strain resolution from metagenomics datasets are desirable.
Most existing culture-free approaches to identify bacterial strains in communities have drawbacks, which has limited wide adoption of these approaches. For example, single-cell sequencing requires expensive and laborious efforts in cell sorting and suspension, and thus this approach is not used to analyze large communities. Similarly, Hi-C, a sequencing-based approach13, requires extra steps and budget for cross-linking, library construction and sequencing. Strain typing methods that leverage strain-level gene copy number variations14 or strain-level phylogenetic marker SNPs such as canSNPs15, PathoScope16 and Sigma17 rely on the availability of complete reference strain genomes and, with current limitations on these resources, face challenges in studies of the broader diversity found using metagenomic sequencing approaches. An assembly-based approach is dependent on several factors, including genome structure and intraspecies divergence. With rare exceptions, assemblers usually fail to produce individual strain assemblies, instead creating either highly fragmented contigs or contigs that only represent population consensus sequences18,19; a recent effort in using variation-aware contig graphs for strain identification20 relies on manual inspection, and hence its accuracy is subject to users' experience. In all of these approaches, only a relatively small fraction of strain genomes have been successfully analyzed, and their distribution is usually biased21. On the other hand, methods based on single marker genes such as the gene encoding 16S rRNA often lack the resolution to reliably capture intraspecific genomic differences22.
To overcome this difficulty and increase the utility of metagenome dataset, we developed Conspecific Strains (ConStrains), an algorithm that exploits the polymorphism patterns in a set of universal bacterial and archaeal genes to infer strain-level structures in species populations. Using both in silico and previously published host-derived datasets, we found that ConStrains recovers intraspecific strain profiles and phylogeny with high accuracy, and captures important features of community dynamics including dominant strain switches and rare strains. The simulated datasets address performance in the context of different within-population diversities, different numbers of strains, the interference from other species within the same community, as well as the scalability of the method using a large in silico cohort with 322 samples. Predicted within-species structures as well as the strain genotypes were highly accurate across these simulated datasets. Applying this method to an infant gut development metagenomic dataset revealed new insights of strain dynamics with functional relevance. ConStrains is implemented in Python, and the source code is available as Supplementary Code and is freely available together with full documentation at https://bitbucket.org/luo-chengwei/constrains.
Results
The ConStrains algorithm
Guided by reference species, the ConStrains algorithm compares raw metagenomic reads to reference genomes and identifies patterns in SNPs as the basis for differentiation and quantification of conspecific strains. This approach is fundamentally different from other reference-dependent methods such as Sigma and PathoScope16,17 that rely on availability of a comprehensive reference strain collection because, unlike those methods, ConStrains can provide reliable predictions for species with only one genome (complete or draft). For confident SNP calling, a species requires a minimum of tenfold coverage (Supplementary Fig. 1) within or across all samples considered, which is obtained for all species with a relative abundance of >1% at typical sequencing depths of 5 Gbp. When applied to multiple samples, for example, a longitudinal time series or otherwise related samples, strain identities can be traced across the different samples. The algorithm achieves this in two operations: (i) identifying species for which SNPs are detected and quantified, and (ii) transforming individual SNPs into SNP profiles that represent individual strains.
The first operation is a two-step process. Because the algorithm identifies strains only for those species with sufficient sequencing depth (≥10-fold coverage in at least one sample; Supplementary Fig. 1), the first step uses MetaPhlAn1 for rapid species composition profiling. For those species with sufficient sequencing depth, a custom database of marker genes is created from the comprehensive PhyloPhlAn marker set23, against which the raw reads are mapped using Bowtie2 (ref. 24). This targeted approach allows for optimized time and computational efficiency. Resulting marker gene alignments are processed with SAMtools25 to generate a table of coverage by base position from which SNPs are identified. It is important to note that in this process the reference sequences are removed, and SNPs are identified de novo to minimize reference dependency (Fig. 1a–d and Online Methods). We verified that such a SNP selection procedure is sufficiently accurate and uniquely sensitive to disentangle intraspecific diversity (Supplementary Note 1 and Supplementary Fig. 2).
In the second operation, individual SNPs are combined into unique SNP profiles from which strains are identified. Previous methods have approached the challenge of identifying individual organisms from microbial communities using SNPs (for example, oligotyping26 and minimum entropy decomposition27), but were limited to SNPs within the span of a sequence read length. ConStrains overcomes this read-length limitation and represents each strain by a barcode-like string of concatenated SNPs spanning hundreds of genes, referred to as the 'uniGcode'. To derive the strain's uniGcodes in a dataset, ConStrains constructs candidate models of strain combinations using a combination of SNP-flow and SNP-type clustering algorithms. The relative abundance of strains in each candidate model across the cohort is estimated sequentially using a Metropolis-Hastings Markov chain Monte-Carlo approach (Fig. 1e–g and Online Methods). Finally, to choose the optimal model with the principle of balancing model fitness and complexity, corrected Akaike information criterion (AICc) is used (Fig. 1h and Online Methods). ConStrains repeats these steps for each species with sufficient coverage, then outputs the number of strains and their respective uniGcodes and relative abundances (Fig. 1i). The uniGcode allows downstream analysis such as cross-sample comparisons and evolutionary studies.
ConStrains identifies strains in large datasets
To validate the performance of ConStrains for strain profiling, we used in silico and host-derived datasets. We generated 36 different sets of k-strain mixtures using in silico genome-based Illumina paired-end read simulation based on ten different Escherichia coli strains whose complete genomes are publicly available, representing real-life scenarios of strain admixtures (k = 2–7; Fig. 2a,b, Supplementary Fig. 3a and Supplementary Table 1). We profiled these 36 sets of reads by ConStrains using default settings, and compared the predicted results with the 'true' strain compositions using Jensen-Shannon divergence (JSD; Fig. 2b and Supplementary Fig. 3b). ConStrains successfully predicted the underlying intraspecies compositions in all 36 datasets (P < 1 × 10−5; two group t-test against random guesses; Fig. 2b), demonstrating a substantial advantage (Supplementary Fig. 4) over reference-based approaches, with an improvement of 0.191 JSD on average (Supplementary Note 1 and Supplementary Fig. 5). In 34 of the 36 sets of reads (94.44%), the numbers of strains inferred exactly matched the ground truth (Fig. 2a), with the remaining two sets of reads having an additional chimeric strain predicted at an extremely low level (<0.1%). We therefore set the recommended detection limit at 0.1% to reduce such errors computationally. As this is a relative abundance threshold, one can still target low-abundance organisms by increasing sequence depth. In similar simulations with up to 30 E. coli strains, ConStrains predicted the strain composition with high confidence when the strain number was less than ten (Fig. 2c), which represents the intraspecific upper bound for more than 95% of metagenomic species (Fig. 2d and Supplementary Note 1). To assess the impact of intraspecies recombination on performance, we generated both real sequencing reads from highly recombined Burkholderia pseudomallei strains28 and in silico–simulated recombinant strain–derived reads, and identified no significant adverse impact (Supplementary Note 1). We also tested the performance in a more realistic metagenomic scenario by embedding E. coli strains in communities with various levels of complexity and found that our approach remained robust (Online Methods, Supplementary Note 2 and Supplementary Table 2). We also found no significant correlation between admixture compositions' alpha diversity and prediction accuracy. These results collectively suggested good algorithm performance (Supplementary Note 1).
We then tested ConStrains using a host-derived metagenomic dataset that had previously been analyzed using a manually curated strain-identification approach. Using manual strain curation, the authors had for the first time described the changes in an infant gut microbiome during the first 24 d of life4. All three manually curated Staphylococcus epidermidis strains reported in this study were successfully predicted by ConStrains in a fully automated manner, with the predicted relative abundances of each strain over time having highly similar values to the original compositions quantified from the scaffold coverage (JSD average = 0.024, s.d. = 0.021; Supplementary Fig. 6). Because the performance of ConStrains' fully automated approach matched well with the manually curated, semiautomated approach described previously4, but required far less machine and manual resources (ConStrains completed the infant gut dataset in 8.5 CPU hours with RAM peak footprint of <2 GB on a Linux server with Xeon 2.6 GHz processors, in contrast to days to weeks of manual curation after assembly), we next applied ConStrains to a very large dataset for which a manual effort would not be feasible (for detailed resource usage, see Supplementary Note 5 and Supplementary Table 3).
In the absence of the existence of such a large dataset (especially one where true results were known), we used a simulated shotgun dataset with intraspecific structure mimicking the natural relative abundance of taxa informed by a recent gut microbiome collection effort for which samples were collected daily over the course of one year29 (Fig. 3a, Online Methods and Supplementary Note 3). ConStrains analyzed 91 species with sufficient depth in the 322 in silico samples. In total, ConStrains analyzed 3.2 terabases of paired-end reads containing 1,361 strains from 320 species, with minimal runtime and infrastructure requirements (Supplementary Note 3). ConStrains achieved high accuracy for individual samples, and also captured crucial information such as dominant strain type changes, for example in Bacteroides fragilis (Fig. 3a–c, Supplementary Table 4 and Supplementary Note 3). This large cohort also enabled us to test factors that might affect the performance of ConStrains, including population complexity, coverage and relatedness. We found that 10× coverage was necessary for accurate profiling and that strain relatedness could also affect performance (Supplementary Fig. 7 and Supplementary Note 3). With this thorough benchmarking, we next applied ConStrains to two previously published clinical datasets to illustrate the biological insights strain-level analyses can provide.
ConStrains reconstructs strain phylogeny
In a published report on the genetic variation of Burkholderia dolosa in cystic fibrosis patients, a selective culturing step had been combined with a deep population sequencing approach30. We reanalyzed that dataset using our ConStrains algorithm and predicted a total of six B. dolosa strains in the population (strains abbreviated as pop-I to pop-VI; Fig. 4a) with an abundance well above 0.1%. We compared the uniGcodes from the six strains inferred by ConStrains with the isolate genome sequence by building a phylogenetic tree, and found that all of the colony strains and two population strains (pop-I and pop-II) were closely related (Fig. 4a). Moreover, the combined relative abundance of pop-I and pop-II represented the majority of the population (51.3% and 27.9% for pop-I and pop-II, respectively). This finding corroborated observations based on the colony sequencing approach. In addition, the ConStrains algorithm identified four additional, less abundant strains (pop-III to pop-VI). None of these strains could be inferred by the colony sequencing approach alone, likely because of their low abundance (11.2%, 8.1%, 1.0% and 0.5%, respectively). To validate these additional predictions, we further examined the polymorphism patterns in these four strains, and compared them against pop-I and pop-II. We found patterns that are unlikely to have resulted from chimeric mixtures of SNPs from pop-I and pop-II (P < 0.01, permutation test; Fig. 4b). This analysis demonstrated that application of ConStrains to cross-sectional datasets, used in addition to a culture-based approach, allows for a comprehensive and efficient discovery of rare strains.
Uncovering strain dynamics in infant gut development
We next analyzed an infant gut development dataset containing 54 samples from 9 subjects (indexed subjects 1–9; Fig. 5) collected over the first three years of life (Online Methods and Supplementary Fig. 8) to further explore the ability of ConStrains to reveal strain dynamics. We ran a ConStrains analysis on 75 species that had sufficient sequencing depth for analysis (10×; Fig. 5). Because previously reported strain-detection algorithms had been limited to studying the population consensus sequences12, and ConStrains has the capability to untangle intraspecies diversity, we first examined the number of strains observed in each species. Nearly all species (94.66%) had more than two strains, with an average of 4.88 strains per subject (±1.54 s.d.; Supplementary Fig. 9). By tracking the uniGcode of each strain in separate individuals, we identified several unique strain-level longitudinal patterns. For instance, the population of Fecalibacterium prausnitzii usually comprised several strains that maintained a co-dominant profile, in which the strains maintained the same order of abundance; Ruminococcus gnavus had highly variable behaviors over time, with different strains dominating the intraspecies composition at different time points; in contrast, Bacteroides ovatus contained one dominant strain over time, keeping other strains relatively rare. Bifidobacterium bifidum strains demonstrated comparable dynamic patterns similar to F. prausnitzii; moreover, the strains reestablished the same intraspecific diversity even after the abundance of the species dropped below the detection limit (Fig. 5). We anticipate that the capability of generating better insights in intraspecies dynamics of such health-related species31,32,33 will shed light on the role of these organisms in human physiology.
With this goal in mind, we pursued our findings in Bifidobacterium longum, an organism linked to human gut health and applied to prevention and treatment of several diseases33. We first observed that the phylogeny of B. longum strains strongly correlated with their host origins (Fig. 5), which indicated strong individuality of B. longum strains. Moreover, in subjects 4 and 6 (Fig. 6a), we observed switches in dominant strain types that were highly correlated with the overall relative abundance of the B. longum species. As previous work has shown that a single operon can affect the competitiveness of different Bacteroides fragilis strains34, we evaluated functional differences between different dominant strains. In both subjects, the different strains dominating during consecutive phases (period 2 in subject 4 and period 1 for subject 6; Fig. 6a) carried additional functions that might be crucial to B. longum's successful colonization of the host gut. In particular, the presence of the human milk oligosaccharide (HMO) utilization cluster has been shown to result from an adaptation to the human infant gut35 (Fig. 6b). Some additional functions might underlie formation of a B. longum bloom, including the presence of the fructose and L-fucose utilization gene clusters (Fig. 6b). Together, these findings might explain why strains with these functions were associated with higher relative abundance of B. longum in the infant gut microbiome. We also observed functions specific to strains that were dominant in periods when B. longum was less abundant (periods 1 and 3 in subject 4 and period 2 in subject 6; Fig. 6a), most notably that the capsular polysaccharide biosynthesis genes were absent from dominant strains when B. longum was more abundant (Fig. 6b). Taken together, strain-level insights provided by ConStrains, combined with functional analyses, could offer candidate targets and hypotheses for future studies.
Discussion
We show that the ConStrains algorithm accurately predicts strain-level profiles in large cohorts of metagenomic samples, and that the inferred uniGcodes reconstruct strain phylogeny, within or across cohorts, allowing combined cohort studies. ConStrains is scalable and has minimal resource requirements. In contrast, other approaches14,16,17 are largely dependent on prior knowledge of reference strain genomes, with subspecies resolution being directly dependent on the number of available reference strains per species. This greatly limits the application of such methods on real metagenomic data, because for most of the human microbiome species only one reference genome is available14. Current databases are quickly gaining in intraspecies genome representation, but are still far from saturating natural diversity. With just one genome per species, ConStrains can resolve natural diversity occurring within that species, and is therefore agnostic to unknown strains. Future improvements for strain-level analysis include identification of strains in the absence of any reference genomes. It is conceivable that combining ConStrains with de novo genome assembly from metagenomic data could be an appropriate candidate to overcome this hurdle.
ConStrains is particularly effective for obtaining insights that were previously overlooked using species-level findings (Supplementary Note 4 and Supplementary Figs. 10–12), and will thus enable new types of studies. As we showed with the B. longum example, combining strain-level profiles with reference genome–based gene coverage analysis can reveal candidate genes for understanding strain-specific beneficial effects and the functions that might contribute to successful colonization in the human gut. ConStrains could also identify strains or genes associated with disease and link variable genomic regions to individual strains, a major challenge in shotgun metagenomics. Strain-level profiles, together with appropriate metadata, could link reference-based or de novo–assembled genes with individual strains and help interpret unknown strain-specific functions. Our study of the infant gut development cohort captured HMO utilization cluster enrichment shifts in different development periods, which is particularly important because products of the HMO utilization cluster are essential for B. longum to exert its probiotic effects36. Finally, strain phylogeny could be used across cohorts and add metagenomic means to test fundamental ecological hypotheses, including neutral theory and other adaptive and nonadaptive mechanisms for maintaining sympatric diversity among strains. Although we applied ConStrains to human microbiome datasets, it can also be applied to environmental samples to test fundamental hypotheses about the role of microbes in the environment that can only be addressed at the strain level.
Methods
ConStrains algorithm.
Extracting target species and informative SNPs. With raw reads from samples S1, S2, ..., Sn, ConStrains starts with profiling input metagenomes using MetaPhlAn1 (v1.7) with default settings, with the exception that alignment options are set to 'very-sensitive'; species with average coverage higher than a coverage cutoff (default: 10×) in at least one sample are selected for further strain analysis. For each of the selected species, the corresponding set of the universally conserved genes reported in ref. 1 are used as a database, and Bowtie2 (ref. 24) mapping with default setting is carried out to map reads against those reference genes. Only reads with proper pairing and orientation, no indels, >30 mapping quality, >90 length mapped (overhanging part at gene 5′ and 3′ ends is clipped off before calculation) and at least 95% nucleotide identity with the reference gene are used. These reads are then piled up onto their respective reference sequences using SAMtools25, and the reference gene coverage is subsequently calculated on a per-base basis. To filter out genes with spurious mappings due to hypervariable regions or conserved universal motifs, sites with less than default minimum coverage, as well as those outside of the 1.5 interquartile coverage range across the gene length, are masked. Any gene with more than 30% of its length masked is discarded from further analysis. SNPs are then counted across samples as those unmasked positions where the minor allele had at least two counts or more than 3% in relative abundance.
Strain typing by SNP-flow algorithm. With SNPs extracted, ConStrains first infers the strain composition and their SNP types using the 'SNP-flow' algorithm in per-species, per-sample fashion. In this algorithm, all SNP sites are first hierarchically clustered by the Euclidean distance between the frequencies of different alleles defined as
where a and b are the frequency vector of the four bases sorted in descending order of the respective SNPs. Clusters that contain less than 5% of the overall SNPs or fewer than ten SNPs are discarded. The centroid of each cluster is selected as representative. These SNP cluster centroids (SCCs) are then ranked in descending order based on their weight quantified as the number of SNPs they represent. Finally, a directed graph is constructed using these SCCs, in which nodes are alleles in these SCCs and each is assigned a 'capacity' defined by the allele frequency, and these alleles from neighboring SCCs are connected by edges (Fig. 1e).
In the directed graph constructed in the previous step, nodes are denoted from the same SCC as a layer. With m layers in the graph, SNP-flow will explore all possible combinations of paths from the first layer to the last. This way, every such path represents a strain genotype, and its relative abundance, c, is defined as the lowest node capacity among all nodes on the path. Once a path is visited, all nodes on this path would subtract their capacity by the path's relative abundance c (Fig. 1e). Such a pathfinding and visiting step is repeated until all nodes' capacities are zero, and the visited paths constitute one combination. ConStrains exhausts all possible SNP-type (strains) combinations β = {β1, β2, ..., βk} in each sample with the ith sample's SNP-type βi = bi1bi2...bih where bij is one of the four bases, A, C, G and T, and the associated strain profile αi = (αi1,αi2,...αih) with
For each sample, ConStrains picks the optimal combination that minimizes the fitting error, defined as the discrepancy between expected SNP frequencies and observed frequencies, ɛ, defined as:
where Eij is expected frequency of the ith base at the jth SNP locale; and similarly, Oij is the observed frequency of the ith base at the jth SNP locale in the pileup of aligned reads from the corresponding sample. For instance, C is the second base (i = 2), and if we observed two Cs and eight As at the fifth SNP locale (j = 5) in the pileup of aligned reads against reference, the frequency of C is 0.2 at that position and thus is referred to as O25 = 0.2. Eij is inferred using αi and βi such that
Inferring strain compositions. To unify these optimal SNP types into cohort-wide strains, ConStrains next constructs a neighbor-joining tree of the SNP types from different samples based on sequence percentage identity, and utilizes an internal parameter, Δd, defined as the distance between the tree-cutting point and the leaves, to cut the tree. Rather than using a preset value, the algorithm cuts this tree using all possible Δd. Each internal node created by such a cut could be viewed as the representative of all the children nodes (SNP types) on the tree. In doing so, it identifies all possible k clusters defined by the structure of the tree of SNP types (Fig. 1f), which we refer to as candidate strains.
With the proposed k strains from the previous step, in each sample, we need to find a composition, α*= (α*1, α*2, ..., α*k) with
to minimize the discrepancy between expected SNP frequencies and observed frequencies, ɛ, as defined previously. This is carried out by a Metropolis-Hasting Monte-Carlo method. ConStrains first generates a number of seeds (default: 1,000) at uniform random on k − 1 simplex. The top 50 seeds are then selected and each such seed's vicinity on the k − 1 simplex is iteratively searched. In iteration t, a new point, αtik, is picked within the 0.01 radius of the previous point, αt − 1ik; and it is accepted as the new point with probability min(1, q(αtik, αt − 1ik)), where q(αtik, αt − 1ik) = ɛ(αtik)/ɛ(αt − 1ik). It repeats the iteration until |1 − q(αtik, αt − 1ik)|is smaller than 0.001 or the maximum number of iterations (10,000) is reached. The composition yielding the lowest ɛ is selected as optimal α*ik. ConStrains repeats this step for all samples and all k, yielding solutions for each k, α*k = (α*1, α*2, ..., α*n), with corresponding error (Fig. 1g):
Selecting the optimal strain model. Corrected AICc is employed to select optimal k. The AICc of each k is calculated as:
where L = 1 − ɛk denotes the model likelihood. ConStrains selects the k with the lowest AICc and outputs the associated SNP types and compositions as final results (Fig. 1h). As noted previously, we suggest filtering strains with less than 0.1% in relative abundance as they present a high probability of being chimeric.
In silico datasets.
To simulate in silico single species datasets, 62 complete E. coli genome sequences were downloaded from an NCBI database. Ten genomes were selected and their relatedness was shown by a maximum likelihood tree (Supplementary Fig. 3a) constructed from concatenated nucleotide sequences of core genes among the 10 strains using a method similar to ref. 19. 1,000 random compositions were sampled on a gamma distribution with k = 1 and θ = 0.5 for each number of strains (N = 2–7). In each set of these 1,000 compositions, Shannon entropy was calculated and based on which these compositions were ranked. The compositions on the 15th, 30th,..., 90th percentiles were picked to form a gradient of intraspecific diversity for each N. ART simulator37 was employed to simulate 100× coverage of 100-bp paired-end Illumina reads using these compositions with default settings for Illumina and library settings as “-m 350 -s 50” (Supplementary Fig. 3a). These samples were further grouped together to simulate single strain series samples (Supplementary Table 1).
These simulated E. coli reads were then spiked into in silico–constructed metagenomes to measure the impact from other species. Three human microbiome-like metagenomes with low, medium, and high complexity level (referred as LC, MC and HC, respectively) were simulated based on an aggregated MetaPhlAn1 profile over all 690 Human Microbiome Project (HMP) samples38. E. coli and Shigella were excluded from the profile, and the rest of the species were ranked based on their average abundance in the HMP cohort. The top 20, 50, and 100 most abundant species were selected for LC, MC and HC, respectively. The species composition in each in silico metagenome was calculated as their relative abundance in the HMP cohort, normalized by their total sum. Genomes of these species were downloaded from NCBI, and a representative strain was selected at random if multiple strains of the same species were present. A total of 100 million 100-bp paired-end Illumina reads were simulated for each set by ART simulator37 with the same settings as mentioned previously. Additional datasets for testing the sensitivity and the performance on different numbers of strains and recombined strains were generated in a similar fashion using ART (Supplementary Note 1).
The year-long shotgun metagenome cohort with 322 samples was simulated based on donor A's 16S rRNA amplicon profiles reported in ref. 29. The operational taxonomy unit (OTU) table was used as a guide for community composition in human microbiomes. To allow simulation at the strain level, however, taxonomy in the OTU table was shifted down by one level. For instance, species composition in the original OTU table was shifted to be the strain composition. NCBI draft and complete genomes were used to match as closely as possible the phylogeny of the original OTUs. Reads were then simulated by ART simulator as previously described. The coverage was set to be 1× per 25 read counts in the 16S OTU table.
Biological datasets.
The two infant gut development longitudinal metagenomic datasets used in this study were from a previous study4 and from our recent effort in tracking nine subjects in a three-year period since birth. For the former set, all metagenomic samples were downloaded from NCBI SRA under accession number SRA052203, and the corresponding assembled S. epidermidis strains and phage genomes were downloaded from ggKBase as described4. For the latter set, 54 stool samples were collected from nine infant subjects between September 2008 and August 2010 in Finland. Samples were first collected by the subjects' parents and stored in the household freezer before being transferred on dry ice to a laboratory −80 °C freezer. Samples were then shipped to the Broad Institute for DNA extraction, in which QIAamp DNA Stool Mini Kit (Qiagen, Inc.) was used as described previously39. Library construction was carried out following Human Microbiome Project's standard protocol (http://hmpdacc.org/resources/tools_protocols.php), and 101-bp paired-end reads were produced on an Illumina HiSeq 2000 platform. The raw sequences of these samples are available at SRA under BioProject accession number PRJNA269305, and the corresponding sample information is available in Supplementary Table 5.
Prediction accuracy measurement.
To measure how close the predicted composition, P, is from the true composition, Q, we applied Jenson-Shannon divergence with minor modifications. As it is possible that P and Q are of different dimensions, we first padded the one with lower dimension with zeros to match the one with the higher dimension, and then defined a composition M based on sorted P and Q, P′ and Q′, as:
Therefore the Jenson-Shannon divergence is:
where D(X||Y) is the Kullback-Leibler divergence defined as:
We calculated the SNP typing accuracy as the distance between the inferred SNP tree of strains, kTp, and the true strain tree constructed from concatenated core genes, Tq. First, a distance similar to the symmetric difference introduced by Robinson and Foulds40 was applied to calculate the distance, d, between these two trees. We then normalized d to the expected basal distance from a random tree with the same leaves. The expected basal distance, d, is the mean distance between Tq and 1,000 randomly generated trees with the same leaves.
Accession codes
References
Segata, N. et al. Metagenomic microbial community profiling using unique clade-specific marker genes. Nat. Methods 9, 811–814 (2012).
Sunagawa, S. et al. Metagenomic species profiling using universal phylogenetic marker genes. Nat. Methods 10, 1196–1199 (2013).
Darling, A.E. et al. PhyloSift: phylogenetic analysis of genomes and metagenomes. PeerJ 2, e243 (2014).
Sharon, I. et al. Time series community genomics analysis reveals rapid shifts in bacterial species, strains, and phage during infant gut colonization. Genome Res. 23, 111–120 (2013).
Nielsen, H.B. et al. Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes. Nat. Biotechnol. 32, 822–828 (2014).
Imelfort, M. et al. GroopM: an automated tool for the recovery of population genomes from related metagenomes. PeerJ 2, e603 (2014).
Luo, C. et al. Genome sequencing of environmental Escherichia coli expands understanding of the ecology and speciation of the model bacterial species. Proc. Natl. Acad. Sci. USA 108, 7200–7205 (2011).
Kashtan, N. et al. Single-cell genomics reveals hundreds of coexisting subpopulations in wild Prochlorococcus. Science 344, 416–420 (2014).
Faith, J.J. et al. The long-term stability of the human gut microbiota. Science 341, 1237439 (2013).
Maslunka, C., Gifford, B., Tucci, J., Gurtler, V. & Seviour, R.J. Insertions or deletions (Indels) in the rrn 16S–23S rRNA gene internal transcribed spacer region (ITS) compromise the typing and identification of strains within the Acinetobacter calcoaceticus-baumannii (Acb) complex and closely related members. PLoS ONE 9, e105390 (2014).
Han, D. et al. Population structure of clinical Vibrio parahaemolyticus from 17 coastal countries, determined through multilocus sequence analysis. PLoS ONE 9, e107371 (2014).
Schloissnig, S. et al. Genomic variation landscape of the human gut microbiome. Nature 493, 45–50 (2013).
Beitel, C.W. et al. Strain- and plasmid-level deconvolution of a synthetic metagenome by sequencing proximity ligation products. PeerJ 2, e415 (2014).
Greenblum, S., Carr, R. & Borenstein, E. Extensive strain-level copy-number variation across human gut microbiome species. Cell 160, 583–594 (2015).
Karlsson, E. et al. Eight new genomes and synthetic controls increase the accessibility of rapid melt-MAMA SNP typing of Coxiella burnetii. PLoS ONE 9, e85417 (2014).
Hong, C. et al. PathoScope 2.0: a complete computational framework for strain identification in environmental or clinical sequencing samples. Microbiome 2, 33 (2014).
Ahn, T.H., Chai, J. & Pan, C. Sigma: strain-level inference of genomes from metagenomic analysis for biosurveillance. Bioinformatics 31, 170–177 (2015).
Miller, J.R., Koren, S. & Sutton, G. Assembly algorithms for next-generation sequencing data. Genomics 95, 315–327 (2010).
Luo, C., Tsementzi, D., Kyrpides, N.C. & Konstantinidis, K.T. Individual genome assembly from complex community short-read metagenomic datasets. ISME J. 6, 898–901 (2012).
Nijkamp, J.F., Pop, M., Reinders, M.J. & de Ridder, D. Exploring variation-aware contig graphs for (comparative) metagenomics using MaryGold. Bioinformatics 29, 2826–2834 (2013).
Lasken, R.S. & McLean, J.S. Recent advances in genomic DNA sequencing of microbial species from single cells. Nat. Rev. Genet. 15, 577–584 (2014).
Ivanova, N. et al. Genome sequence of Bacillus cereus and comparative analysis with Bacillus anthracis. Nature 423, 87–91 (2003).
Segata, N., Bornigen, D., Morgan, X.C. & Huttenhower, C. PhyloPhlAn is a new method for improved phylogenetic and taxonomic placement of microbes. Nat. Commun. 4, 2304 (2013).
Langmead, B. & Salzberg, S.L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).
Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
Eren, A.M. et al. Oligotyping: Differentiating between closely related microbial taxa using 16S rRNA gene data. Methods Ecol. Evol. 4, 1111–1119 (2013).
Eren, A.M. et al. Minimum entropy decomposition: unsupervised oligotyping for sensitive partitioning of high-throughput marker gene sequences. ISME J. 9, 968–979 (2014).
Nandi, T. et al. Burkholderia pseudomallei sequencing identifies genomic clades with distinct recombination, accessory, and epigenetic profiles. Genome Res. 25, 129–141 (2015).
David, L.A. et al. Host lifestyle affects human microbiota on daily timescales. Genome Biol. 15, R89 (2014).
Lieberman, T.D. et al. Genetic variation of a bacterial pathogen within individuals with cystic fibrosis provides a record of selective pressures. Nat. Genet. 46, 82–87 (2014).
Sokol, H. et al. Faecalibacterium prausnitzii is an anti-inflammatory commensal bacterium identified by gut microbiota analysis of Crohn disease patients. Proc. Natl. Acad. Sci. USA 105, 16731–16736 (2008).
Crost, E.H. et al. Utilisation of mucin glycans by the human gut symbiont Ruminococcus gnavus is strain-dependent. PLoS ONE 8, e76341 (2013).
Di Gioia, D., Aloisio, I., Mazzola, G. & Biavati, B. Bifidobacteria: their impact on gut microbiota composition and their applications as probiotics in infants. Appl. Microbiol. Biotechnol. 98, 563–577 (2014).
Lee, S.M. et al. Bacterial colonization factors control specificity and stability of the gut microbiota. Nature 501, 426–429 (2013).
Schell, M.A. et al. The genome sequence of Bifidobacterium longum reflects its adaptation to the human gastrointestinal tract. Proc. Natl. Acad. Sci. USA 99, 14422–14427 (2002).
Sela, D.A. et al. The genome sequence of Bifidobacterium longum subsp. infantis reveals adaptations for milk utilization within the infant microbiome. Proc. Natl. Acad. Sci. USA 105, 18964–18969 (2008).
Huang, W., Li, L., Myers, J.R. & Marth, G.T. ART: a next-generation sequencing read simulator. Bioinformatics 28, 593–594 (2012).
Human Microbiome Project Consortium. Structure, function and diversity of the healthy human microbiome. Nature 486, 207–214 (2012).
Morgan, X.C. et al. Dysfunction of the intestinal microbiome in inflammatory bowel disease and treatment. Genome Biol. 13, R79 (2012).
Robinson, D.F. & Foulds, L.R. Comparison of phylogenetic trees. Math. Biosci. 53, 131–147 (1981).
Acknowledgements
We thank N. Nedelsky for editorial support. This work was supported in part by the Crohn's and Colitis Foundation of America, the Leona M. and Harry B. Helmsley Charitable Trust, US National Institutes of Health grants U54 DK102557 (R.J.X.) and R01 DK092405 (R.J.X.)., the Howard Hughes Medical Institute (R.K.) and the Finnish Centre of Excellence in Molecular Systems Immunology and Physiology Research 2012-17 (Academy of Finland, Decision No. 250114; M.K.).
Author information
Authors and Affiliations
Contributions
C.L. and D.G. conceived the project, C.L. designed and implemented the algorithm, C.L., D.G. and R.J.X. designed the experiments, and C.L. performed the analysis. M.K., H.S., D.G. and R.J.X. collected and sequenced the samples. C.L., R.K., R.J.X. and D.G. wrote the paper.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Integrated supplementary information
Supplementary Figure 1 Minimum requirements for sequencing depth and species’ relative abundance for ConStrains analysis based on 10x coverage cutoff
Gray lines mark the boundaries for associated genome size; the qualified space is the upper right hand side of the boundary. The boundaries are calculated as the sequencing throughput (in Gbp), S, needed for a 10-fold coverage over a species with relative abundance, r%, and average genome size of g Mbp as S = glr. Note that a sequence depth across the whole genome is correlated with depth of the marker genes such that the larger the genome size, the more sequencing will be required to result in a 10x coverage in these marker genes (and overall genome), when keeping other factors (including relative abundance) constant.
Supplementary Figure 2 ConStrains’ intra-species phylogenetic sensitivity in detecting strains.
(a) A neighbor-joining (NJ) tree is shown based on average nucleotide identity (ANI) of the 62 complete E. coli genomes from NCBI used in assessing ConStrains’ sensitivity towards genome relatedness. (b) The ANI distribution of all genome pairs among the 62 genomes (n = c2 = 1,891) covers the typical intra-species ANI range, 97%-100%. Representative 62 genome pairs were randomly selected to synthesize Illumina reads of strain mixtures as described in online methods. For simplicity, all in silico-synthesized Illumina read sets followed a 3:1 strains mixing ratio. (c) Though the number of informative SNP sites and the genome relatedness are strongly negatively correlated, ConStrains captured sufficient (>10) informative SNPs in 99.5% of all sets. (d) Among the 99.5% of all analyzable sets, ConStrains robustly recovered strain profiles with high precision (low JSD error rate). (e) At each divergence level (highlighted by red solid line along the tree shown in a), we show the three E. coli strains drawn to represent the branch (branches identified by blue circles, selected strains identified by red circles next to strain names in a) along with the corresponding predicted composition and JSD errors in comparison to the true composition.
Supplementary Figure 3 In silico test sets simulated by mixing ART-generated Illumina paired-end reads of E. coli strains at different depths with corresponding ConStrains performance.
(a) Unrooted tree on the left was constructed from concatenated nucleotide sequences of core genes shared between these ten E. coli strains; the heatmap on the right shows the coverage of each strain (rows) in each data set (columns), separated by the number of strains it contains, and ranked by its diversity measured by Shannon entropy (top panel). (b) Corresponding ConStrains performance is shown; graphs show the same data as Figure 2b with different y-axis values to increase readability. No significant differences in accuracy were observed between +LC, +MC, or +HC and pure predictions (two-tailed P = 0.536, 0.319, and 0.48 for +LC, +MC, and +HC, respectively, paired t-test; Supplementary Note 2).
Supplementary Figure 4 Comparisons between the predictions by ConStrains and Sigma over in silico E. coli data sets.
The grid represents the same strain admixtures as described in Figure 2. ConStrains’ predictions were based on a single E. coli reference genome (1-Ref.). Sigma’s predictions are shown based on one reference genome (1-Ref.), three reference genomes (3-Ref.), and all ten E. coli genomes used to simulate the admixtures (10-Ref.).
Supplementary Figure 5 Prediction of strain composition according to the number of core gene SNPs recombined between strains
Numbers next to circle mark the exact number of SNPs recombined in silico.
Supplementary Figure 7 Factors that affect the accuracy of ConStrains analysis in the year-long simulation.
(a) The green reference strain is significantly more divergent than the other four reference strains (one-tailed P = 3.5 × 10-12, t-test). (b-d) Relationships between prediction accuracy measured by JSD (y-axis of panels b-c) and the complexity of samples (Shannon index, x-axis; panel b) and the coverage of samples (x-axis; panel c). In c, the 10x cutoff for coverage is marked by the vertical dashed green line. A significant (P = 0.452, Pearson’s correlation) negative correlation between JSD and sample Shannon index was observed (b and d).
Supplementary Figure 8 Overview of times-series metagenomes from the infant gut development study.
Nine subjects (horizontal lines) were chosen and the associated sampling points (solid circles) are sorted along the longitudinal axis (with time on the x-axis).
Supplementary Figure 9 Overview of strains inferred by ConStrains for the infant gut microbiome development cohort.
(a) The uncollapsed tree shows strain-level resolution of the same tree shown in Figure 5a. (b) The average number of strains per species per subject is shown.
Supplementary Figure 11 ConStrains analysis reveals associations only observable at the strain level in the infant gut development cohort.
For each subject, significant association networks between strains were identified, whereas no species-level significant associations (FDR < 0.05) were observed. In network diagrams, each node represents a species, and the edges connecting nodes represent strains of the corresponding species that were significantly associated. The width of the edges is proportional to the number of pairs of associated strains, and the color of the edges is color-coded for the direction of associations. Below the network is the corresponding heatmap, whose upper right triangle represents the number of positively associated strains and the lower left triangle represents the number of negatively associated strains. The blue dashed oval captures the relationship between R. torques and S. wadsworthensis detailed in Supplementary Figure 12.
Supplementary Figure 12 Examples between R. toques and S. wadsworthensis of subject 8 as highlighted in Supplementary Figure 11.
At strain level (left panels), a significant positive association was found between the two highlighted strains of the two species. In contrast, no significant associations were found at the species level (right panels).
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1–12 (PDF 706 kb)
Rights and permissions
About this article
Cite this article
Luo, C., Knight, R., Siljander, H. et al. ConStrains identifies microbial strains in metagenomic datasets. Nat Biotechnol 33, 1045–1052 (2015). https://doi.org/10.1038/nbt.3319
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/nbt.3319
This article is cited by
-
Comparative analysis of metagenomic classifiers for long-read sequencing datasets
BMC Bioinformatics (2024)
-
High-resolution strain-level microbiome composition analysis from short reads
Microbiome (2023)
-
Fast and accurate metagenotyping of the human gut microbiome with GT-Pro
Nature Biotechnology (2022)
-
Dissecting the dominant hot spring microbial populations based on community-wide sampling at single-cell genomic resolution
The ISME Journal (2022)
-
STRONG: metagenomics strain resolution on assembly graphs
Genome Biology (2021)